$30
Assignment Description
Graph structural clustering is a common data analysis task to cluster vertices by their edge connections in the graph. SCAN (Structural Clustering Algorithm for Networks) [1] is such an algorithm that clusters vertices based on a structural similarity measure. The algorithm is efficient in both computation and memory.
Figure 1: Illustration of data structure.
In our implementation, a graph G = (V,E) is stored in a CSR structure as shown in Fig. 1: where one array Neighbors stores the neighbors of all vertices, and the other array Offsets stores the offset of each vertex into the Neighbors array.
Please note that: (1) The length of the array Offsets is |V | + 1), with the last element of this array points to the end of Neighbors. (2) Accordingly, the length of Neighbors is the number of edges plus 1 (|E| + 1).
The SCAN algorithm, shown in Algorithm 1, can be divided into two stages: (1) find pivot vertices using the structural similarity measure, (2) expand clusters starting from the pivot vertices in the depth-first order.
Assignment 2
Input : a graph G = (V,E), the structural similarity threshold , and the minimal cluster size µ
Output: the total number of clusters in G, and the cluster ID of each vertex
1 main()
3 4
5
6
7
8
9
10
11
12
13
14 15
16
17
18
19
20
21
22
24 25
26
27
28
29
Input and output
In this assignment, you will implement a pthread version of the SCAN algorithm.
In the assignment folder:
• clustering_sequential.cpp is the sequential version for your reference. You can check the clustering result and compare its running time.
• clustering.h provides some utility funictions.
• main.cpp contains the main function of the pthread version.
• clustering_pthread_skeleton.cpp is the code skeleton for your work. Your task is to complete the following function which returns the array cluster_result:
int *scan(float epsilon, int mu, int num_threads, int num_vs, int num_es, int
*nbr_offs, int *nbrs);
• The datasets and results folders contain the datasets and clustering results respectively.
Here is the output of executing the algorithm on ./sequential ../../data/test1/1.txt 0.7
3:
Screen output:
Elapsed Time: 0.000016516 s Number of clusters: 2
Result file output (sequential.txt):
2
-1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 -1 13 -1 13 -1 -1
2 in the first line is the number of clusters. The second line 6 13 is the cluster IDs (-1 if not in any cluster) of all vertices in order. Please note that in the parallel setting, we set the cluster ID be the lowest vertex ID of the pivots in the cluster. Make sure your file output follows the same format.