Starting from:

$30

CS314-Project 3 Efficient Parallel Graph Matching Solved

1.1          Graph Matching Problem
Given a graph G = (V,E), while V is the set of vertices (also called nodes) and E ⊂ |V |2. A matching M in G is a set of pairwise non-adjacent edges such that no two edges share a common vertex. A vertex is matched if it is an endpoint of one of the edges in the matching. A vertex is unmatched if it does not belong to any edge in the matching. In Fig. 1, we show examples of possible matchings for a given graph.

A maximum matching can be defined as a matching where the total weight of the edges in the matching is maximized. In Fig. 1, (c) is a maximum matching, where the total weight of the edges in the matching is 7. Fig. (a) and (b) respectively have the total weight of 3 and 2.



                                            (a)                                               (b)                                               (c)

Figure 1: Graph Matching Examples

1.2          Parallel Graph Matching
Most well-known matching algorithms such as blossom algorithm are embarrassingly sequential and hard to parallelize. In this project, we will be adopting handshaking-based algorithm that is amenable to parallelization and can be a good fit for GPUs.

In the handshaking-based algorithm, a vertex v extends a hand to one of its neighbours and the neighbor must be sitting on the maximum-weight edge incident to v. If two vertices shake hands, the edge between these two vertices will be added to the matching. An example is shown in Fig. 2 (b) where node A extends a hand to D since edge(A,D) has the largest weight among all edges incident to node A; Nodes C and F shake hands because they extend a hand to each other.



Figure 2: One-Way Handshaking Matching Example

It is possible that multiple incident edges of a node have maximum weight. In this project, we let the algorithm pick the neighbor vertex that has the smallest vertex index. For example, in Fig. 2(b), among the maximum-weight neighbors of vertex E, we pick vertex B since it has the smallest index (in alphabetical order) among all E’s edges that have maximum-weight 4.

The handshaking algorithm need to run one or multiple passes. A one-pass handshaking checks all nodes once and only once. At the end of every pass, we remove all matched nodes and check if there is any remaining un-matched nodes. If the remaining un-matched nodes are connected, another pass of handshaking must be performed. We repeat this until no more edges can be added to the matching. In Fig. 2, we show two passes of handshaking.

The handshaking algorithm is highly data parallel, since each vertex is processed independently and the find maximum-weight-neighbor step involves only reads to shared data but no writes. It is a greedy algorithm that attempts to maximize the total weight in the matching.

2            Finding Maximum-weight Neighbor
In the handshaking algorithm, the component of finding maximum-weight neighbor is nontrivial to parallelize. In this project, you will implement two GPU kernel functions for identifying maximum-weight neighbor of each node in the graph. You can implement other components of the handshaking algorithm as well, however, it is not required. See Section 6 for the extra-credit functionalities if you are interested.

2.1          Data Structure


0
0
1
1
1
2
2
3
3
3
4
4
1
4
0
2
3
1
3
1
2
4
0
3
3
2
3
1
4
1
5
4
5
3
2
3
src dst weight

Figure 3: A graph encoded as an edge list

The graph is encoded as an edge list consisting of three arrays: src, dst, and weight, such that src[n] is the source node for the n-th edge, dst[n] is the destination node for the n-th edge, and weight[n] is the weight of the n-th edge. The graph is undirected, so if src[n]=x and dst[n]=y then there exists an edge m such that src[m]=y and dst[m]=x.

An example of this representation is shown in Fig. 3. The example graph, with five vertices, has a total of six un-directed edges. The edge list is arranged such that edges that have the same source node are placed together. For edges that have the same source node, they are sorted by the destination indices. The relative order of the edges is important, please do not modify it.

We have provided graph I/O functions for you. The code given to you will read and parse the graphs stored in matrix format. After the graph is parsed, three arrays src[ ], dst[ ], weight[ ] contain graph information. Pointers to the three arrays: src[ ], dst[ ], weight[ ] are stored in the GraphData struct. This is what GraphData looks like:

struct GraphData { int numNodes; int numEdges; int * src; int * dst; int * weight;

}

We use the global arrays strongNeighbor[ ] and strongNeighbor_gpu[ ] to store the results of maximum-weight neighbors. strongNeighbor[i] stores the maximum-weight neighbor of node i on CPU. strongNeighbor_gpu[i] stores the maximum-weight neighbor of node i on GPU. In the end they should be identical, except one is allocated in CPU memory and the other is allocated in GPU memory.

In the main function located in src/mainStrong.cu for running the required component of the project, we call the write_match_result function, which will write the results into the specified output file. The size of the output will be the number of nodes in the graph. For example, if strongNeighbor=[4,2,1,4,0] then it will output those five numbers, in that order, separated by line breaks.

There are two GPU kernel functions that you are required to implement. Their arguments (input and output) are all described in the file gpuHeaders.cuh, along with the extra credit functions. Your implementations should go inside the gpu_required directory, which currently contains a separate file for each function. The two functions are described respectively in Section 2.2 and Section 2.3.

2.2          Parallel Segment-scan
To find the maximum-weight neighbor, we use parallel segment-scan. We first define segment-scan: Let P be an input array of segment IDs, A be an input array of data elements, and O the output array. Given an arbitrary reduce function f, the result of segment-scan is

O[j] = f(A[i],A[i + 1],...,A[j]) such that i = min{x|0 ≤ x ≤ j,P[i] = P[j]}

For example, if we want to use the max function, which returns the larger of two or more numbers, as our reduce function, the output for each element will be the max of all prior numbers in that segment. In other words,

O[j] = max(A[i],A[i + 1],...,A[j]) such that i = min{x|0 ≤ x ≤ j,P[i] = P[j]}

For example, suppose P=[0,0,0,1,2,2,3,3,3,3] and A=[6,1,9,2,7,4,2,8,3,9]. Then the output of a segment scan for maximum values is O=[6,6,9,2,7,7,2,8,8,9]. We can see in this segmentscan example, the last element in each segment should be the largest of all that segment.

In this project, we divide an edge list into multiple segments such that each segment contains edges from the same source vertex. In Fig. 3 and Fig. 5 we have given each segment a different color, to make them more visually distinctive. Here index 0 − 1 (inclusive) forms the first segment, index 2 − 4 forms the second segment, and so on.

Each edge is associated with a weight, and we want to find the edge that has the maximum weight within each segment. In Fig. 5, the strongestDst array shows the result of segment scan on the graph from Fig. 3.

                            seg1          seg2                     seg3                             seg4                   seg5



P[ ]

0
0
0
1
2
2
2
2
2
2
3
3
4
4
4
4


Figure 4: Example of a segment scan

A parallel segment scan can be performed with logarithmic time complexity. Fig. 4 illustrates the the parallel segment scan algorithm. In Fig. 4, array P is used to mark the segment. In Fig. 4, array A contains the input data we operate on. Array B contains the output data for the first iteration. Arrays C and D contain the output data for the second and third iterations respectively.

During each iteration of parallel segment-scan, each (independent) task picks two elements with a stride s, checks if these two elements are in the same segment; if so, it compares the two elements, store the maximum one in the appropriate location in the output array. A parallel segment-scan may involve multiple iterations, the first iteration uses stride s = 1 and the stride s doubles at every iteration.

In the example of Fig. 4, one thread is in charge of updating one data element in the output array at once. In the first iteration, if a thread is in charge of output the i-th data element B[i], it will do the following: if P[i-1] = P[i], the thread compares A[i] with A[i-1], write to B[i] such that B[i] = max(A[i], A[i-1]); if P[i] ! = P[i-1], it directly copies A[i] to B[i]. In the second iteration, a thread operates on elements with a stride of 2, if P[i] = P[i-2], a thread compares A[i] with A[i-2], and write to B[i] such that B[i] = max(A[i], A[i-2]); if P[i] ! = P[i − 2], B[i] = A[i]. It keeps doubling the stride until the stride s is large enough and any two elements that are compared are in two distinctive segments. The example in Fig. 4 demonstrates the idea, which takes three iterations in total. Be aware that we only describe the idea of the segment-scan algorithm, but we did not talk about how to handle boundary cases such that an element is located as the first element in an array. You have to handle the boundary cases in your code.

The kernel function you need to implement only performs one iteration of segment-scan. It is given a stride (which we refer to as ‘distance’ in the code) as an argument so it knows which two elements to compare. To avoid race conditions, the input and output will be separate arrays for each given iteration. We use two arrays in total to accomplish it, as we can swap the pointers of these two at the end of each iteration. You do not have to worry about allocating the input or output arrays.

You can let a thread be in charge of updating one element of the output array at one time. A thread might need to handle more than one data elements, thus you might need a loop within a kernel. See Section 2.4 for more details.

Recall that the maximum-weight neighbor is the one with the greatest weight. In the event of a tie, the neighbor with smaller vertex ID should be treated as stronger. E.g. if source node 1 has two possible destinations, nodes 2 and 3, that both have weight 4, then we say node 2 is the stronger neighbor since 2 < 3. This is accomplished by sorting edge lists with respect to source node first; Within a segment that has the same source node, the edges will be sorted by the destination node, in ascending order.

Note that our implementation requires operating output two data arrays in conjunction: an array where the ID of the maximum-weight neighbor will be stored, and an array the maximum weight will be stored, both as the last element of each segment. Your implementation needs to account for that.

The first function you need to implement, to perform this segment scan, is described below:

__global__ void strongestNeighborScan_gpu(int * src, int * oldDst, int * newDst, int * oldWeight, int * newWeight, int * madeChanges, int distance, int numEdges)

In this kernel function the arrays src, oldDst, and oldWeight are the input arrays. And newDst and newWeight are the output arrays. Each of these arrays contain numEdges elements. Additionally, we include an output called madeChanges, which is a pointer to an integer variable. It is used for detecting the termination condition, i.e. when to stop the segment-scan iterations. A thread should set (*madeChanges) to 1 if and only if oldDst[i] ! = newDst[i] for any of the data items i that it handles. The program given to you will use this variable to determine if the segment-scan process should terminate. If this variable is not set properly, it might result in an infinite loop.

For example, suppose we have src=[0,0,1,2,2,2,3,3], oldDst=[2,3,3,0,1,3,0,2], and oldWeight=[3,1,1,3,1,2,1,2], and are using a stride of distance=1. Then the result of the kernel will be that newDst=[2,2,3,0,0,3,0,2], newWeight=[3,3,1,3,3,2,1,2], and *madeChanges=1.

2.3          Pack Maximum-weight Neighbors
After the previous step, the maximum-weight neighbors are placed as the last element within each segment in the output array(s). Most elements in these two arrays do not contain useful information. We want to collate the useful parts of this array, in order to produce an output array that has the same number of elements as the number of segments, and only store the maximum-weight neighbor information.

This kernel function you need to implement is defined as follows:

__global__ void collateSegments_gpu(int * src, int * scanResult, int * output, int numEdges)

This function should identify the last element in every segment and pack them into the output array. The input src and scanResult, respectively represent the source nodes of the edges list, and the segment-scan result (the index array part). Each thread is as if in charge of one edge in the edge list at one time. A thread needs to tell whether the data element it is handling is the last one in each segment. To do that, let’s assume a thread is in charge of the i-th data element, it compares src[i] with src[i+1], if they are not equal, then the i-th data element is the last one in its own segment. Once a thread identifies that it is handling the last element of a segment, assuming the data item is x, it then stores scanResult[x] into output[src[x]].

For example, suppose we have source array src=[0,0,1,2,2,2,3,3] and the strongest destinations scanResult=[2,3,0,0,1,1,0,2]. Then the result of collateSegments would be output=[3,0,1,2].

2.4          Thread Assignment
It is important to map tasks to threads in a reasonable manner. Furthermore, although GPU kernels can be launched with a large number of threads, we may not want to do so because it incurs significant kernel launch overhead. In this project, we do not guarantee the number of threads allocated for each kernel is the same as the number of tasks to perform. Your code within a kernel needs to account for that.

For example, suppose we are working on an array with 1024 elements, but only have 256 threads. Each thread should work on four elements. To ensure memory coalescing, the first thread should handle array[0], array[256], array[512], and array[768]. Meanwhile the second thread should handle array[1], array[257], array[513], and array[769], and etc. This mapping distributes the work evenly across threads, while offering opportunity for the GPU device to combine adjacent memory requests performed by adjacent threads.

We use one-dimensional grids and one-dimensional thread-blocks, making it straightforward for you to calculate both the thread’s ID, and the total number of threads launched by the kernel function. Each thread can query the total number of threads launched using the expression blockDim.x * gridDim.x.

Also note that the number of tasks to process is not necessarily a multiple of the number of threads. In your code, you will need to add condition checks to account for that. For both kernels you need to implement, the number of tasks should be the same as the number of edges, as indicated by the numEdges argument in both functions

More products