Starting from:

$25

SoP - Community Structure in Networks - Final Project

1             Introduction
In complex networks, a network is said to have community structure if the nodes of the network can be grouped into groups of nodes with dense connections internally, and sparser connections between groups.

In this project you will implement an algorithm for detecting community structures (or clusters) in a network. The ability to detect such groups is of significant importance. For example, partitioning a protein-protein interaction network into clusters can provide a modular view of the network, with different groups of nodes performing different functions within the cell.

This document will first describe the mathematical basis for your algorithm, and then describe the code requirements and implementation. There will not be many implementation details, and no tester is provided – implementation and correctness are up to you. You will be graded for code modularity, design, readability, and performance.

2             Division Algorithm
This section describes the mathematical basis for your project. It does not describe your code, but rather the mathematical foundations your code will be based upon, as well as the algorithms you should implement and use.

We represent a network by a graph G = (V,E), and let A be the adjacency matrix of G:

(

                                                                       1      i and j are neighbors, (i,j) ∈ E

A =

                                                                       0   otherwise

We assume ∀i : Aii = 0.

A given group of vertices in a network is considered a community if the number of edges within the group is significantly more than expected (by chance). We define the modularity of a group as the number of edges within the group minus the expected number of edges in a random graph with the same degrees.

Formally, let ki be the degree of vertex i in G, and let M = Pi ki. The expected number of edges between i and . Let S be a sub-group of vertices, S ⊆ V , we define the modularity of S as:

 

Our goal in this project is thus to find a division that maximizes the modularity, that is, a division of the graph into groups such that the modularity Q, the sum of all group modularities, is close to maximal.

2.1           Dividing Into Two
Let us initially focus on a good division of the network into just two communities.

For a particular division of G into two groups, for each vertex i let si = +1 if it belongs to the first group and si = −1 if it belongs to the second group. Observe that:

 and j are in the same group, si = sj otherwise

Thus we can express the modularity as:

 

Since Pi,j Aij = Pi ki = M:

 

Let s = {s1,...,sn} be a column vector, and let B be the modularity matrix, . We can rewrite Q as:

 

Our goal is to find a vector s, a division of G, that maximizes the modularity Q.

Note that B is a symmetric matrix, thus it is diagonalizable with n eigenvalues. In each row and column its elements sum to zero, thus B has the eigenvector (1,...,1) with a corresponding eigenvalue 0.

Let β1,...,βn be the eigenvalues of B such that β1 ≥ ··· ≥ βn, and let u1,...,un be the corresponding normalized eigenvectors. For all vectors with a fixed norma c, the vector cu1 achieves the maximum value for Q (which is c2β1). However, recall we look for a vector s whose elements are ±1. Thus, we need to make s as close to being parallel to u1 as possible, which is equivalent to maximizing the dot product . The maximum is achieve by setting si = +1 if the element i of u1 is positive and si = −1 otherwise.

In other words, we need to obtain the leading eigenvector of B, i.e., the eigenvector of B with the largest eigenvalue. Then, vertices whose corresponding elements are positive go in one group, and the rest of the vertices go in the other group.

 

It is possible for there to be no positive eigenvalues of B. In this case, the leading eigenvector is the vector (1,...,1), corresponding to all vertices in a single group together. This is the correct result: in this case, the algorithm is telling us that there is no division of the network that results in a positive modularity. The modularity of an undivided network is zero, which is the best that can be achieved.

Additionally, it could happen that although the leading eigenvalue is positive, sT Bs is non-positive, i.e., the division represented by s has a worse modularity than a trivial division into one group.

The flow of the algorithm is as follows:

 

Algorithm 1 Divide a network into two modules

 

1.    Compute leading eigenpair u1 and β1 of the modularity matrix B

2.    if (β1 ≤ 0):

2.1.    The network is indivisible

3.    Compute s = {s1,...,sn} where si ∈ {+1,−1}, according to u1

4.    if (sT Bs ≤ 0):

4.1.    The network is indivisible

5.    return a division into two groups according to s

 

2.2           Dividing Into Modularity Groups
In the previous section we described a method that divides a network into two groups. Many networks, however, contain more than two communities, thus we would like to find good divisions of networks into a larger number of groups.

The standard approach to this problem, which we will adopt, is repeated division into two. We divide the network into two parts, then divide each of those parts, and so forth.

To divide a group, we cannot use the same method since it must take into consideration the entire graph, instead we need to look at the additional change ∆Q to the modularity. Suppose we divide a group g of size ng in two, the change depends on pairs of elements in g. Prior to the division, the contribution of pairs in g to Q was Pi,j∈g Bij. Let s be a vector of size ng with ±1 elements representing a division of g into two groups, and define δij as follows:

(

                                                                                                 1     if i = j

δij =

                                                                                                 0   otherwise

It follows that:

 

Let B[g] be the symmetric  sub-matrix of B corresponding to the rows and columns of g. Let fig = Pj∈g B[g]ij. Let B[g] be a symmetric ng × ng matrix with elements:

Bb[g]ij = B[g]ij − δijfig

We can rewrite ∆Q as:

 

This equation has the same form as before, thus we can apply the same algorithm to maximize ∆Q, according to the leading eigenpair of Bb[g].

Thus, Algorithm 1 can be updated to any group g as follows:

 

Algorithm 2 Divide a group into two

 

1.    Compute leading eigenpair u1 and β1 of the modularity matrix Bb[g]

2.    if (β1 ≤ 0):

2.1.    The group g is indivisible

3.    Compute s = {s1,...,sn} where si ∈ {+1,−1}, according to u1

4.    if (sT Bb[g]s ≤ 0):

4.1.    The group g is indivisible

5.    return a division of g into two groups according to s

 

Note that if g is the entire graph, then ∀i : fig = 0 and thus , allowing us to use Algorithm 2 instead of Algorithm 1 in all cases.

Finally, we can use Algorithm 2 to repeatedly divide the module into groups, as described in the following algorithm:

 

Algorithm 3 Divide a network into modularity groups

 

1. Start with a trivial division into one group: P ← {{1,...,n}} 2. Start with an empty output set of groups: O ← {}

3.    Repeat until P is empty:

3.1. Remove a group g from P 3.2. Divide g into g1,g2 with Algorithm 2

3.3.    if either g1 or g2 is of size 0:

3.3.1.     Add g to O

3.4.    else:

3.4.1.     Add to O: any group (g1 and/or g2) of size 1

3.4.2.     Add to P: any group (g1 and/or g2) larger than 1

4.    Output the division given by O

 

2.3           Modularity Maximization
In this section we describe a method to further optimize a division of into two groups. You should supplement this method into the previous algorithms.

Suppose {g1,g2} is an initial division of g into two groups. Our goal is to optimize this division further. To achieve this, find a vertex v that, when moved the other group, will give the biggest increase in modularity of the complete network, or the smallest decrease if no increase is possible, and move it to the other group.

Repeat this process with the constraint that each vertex may only be moved once, until all vertices have been moved. Once done, from all the states of division into two groups during the operation, find the state that has the maximal modularity, and start again from this state.

We repeat the entire process iteratively until no further improvement is found, i.e., until the maximal state is the current one.

3             Leading Eigenpair
In order to find the leading eigenpair of a diagonalizable matrix we will use power iteration.

Power iteration was described in the previous assignments, and uses matrix-vector multiplications to find the dominant eigenvector, the eigenvector with the largest absolute eigenvalue.

Let k be the number of iterations until we stop, such that bk is our approximation of the dominant eigenvector, an approximation of the corresponding dominant eigenvalue can be found as follows:

 

3.1           Matrix Shifting
Power iteration approximates the dominant eigenpair; however, in the division algorithm we search for the leading eigenpair. That is, we search for the eigenpair with the largest eigenvalue, while power iteration provides us with the eigenpair with the largest absolute eigenvalue.

In order to correct this, we will use matrix shifting to ensure the eigenpair we find is the leading eigenpair.

Let C be a matrix with the eigenvalues {λ1,...,λn}, and let the 1-norm of C be the sum of its largest column:  

Let C0 = ||C||1 ·I +C, then C0 has the same eigenvectors as C and the eigenvalues of C0 are {λ1 +||C||1,...,λn +||C||1}. Thus, the dominant eigenpair of C0 is (λ1 +||C||1,v1), and the leading eigenpair of C is (λ1,v1).

4             Implementation
Your program should be named cluster.

It receives two command-line arguments. The 1st is an input filename, and the 2nd is an output filename. The input of your program is a network (a graph), and the output is a list of groups (the division).

Use the algorithms described in order to divide the network into modularity groups. Optimize your code with the tools you have learned. For example, can the modularity matrix B (and  ]) be represented as a sparse matrix?

4.1           File Format
The input and output files are both binary files consisting only of integers.

Recall that binary files are not ”human-readable” and cannot be edited in a text editor. There are no lines, whitespace, or other separators, rather these files consist only of a stream of bytes. The size of each value in the file is 4, the size of int on most machines, including Nova. This can be determined in code by sizeof(int).

Below, we describe the input and output file formats. Sample files are provided to you. Use these samples to check your file formats, not the correctness of your code!

4.1.1          Input File
The first value represents the number of nodes in the network, n = |V |.

The second value represents the number of edges of the first node, i.e., k1. It is followed by the k1 indices of its neighbors, in increasing order.

The next value is k2, followed by the k2 indices of the neighbors of the second node, then k3 and its k3 neighbors, and so on until node n.

4.1.2          Output File
The first value represents the number of groups in the division.

The second value represents the number of nodes in the first group, followed by the indices of the nodes in the group, in increasing order.

The next value is the number of nodes in the second group, followed by the indices of the nodes in group, then the number of nodes and indices of nodes in the third group, and so on until the last group.


More products