NTHU- Homework 3: All-Pairs Shortest Path Solved

Starting from:

$30

GOAL

This assignment helps you manage to solve the all-pairs shortest path problem with CPU threads and then further accelerate the program with CUDA accompanied by Blocked Floyd-Warshall algorithm. In this assignment, you will realize how powerful GPUs can be. Finally, we encourage you to optimize your program by exploring different optimizing strategies for performance points.

2 REQUIREMENTS

● In this assignment, you are asked to implement 3 versions of programs that solve the all-pairs shortest path problem.

■ CPU version (hw3-1)
◆
program.
◆
You can choose any threading library or framework you like (pthread, std::thread, OpenMP, Intel TBB, etc).
◆
You can choose any algorithm to solve the problem.
◆
You must implement the shortest path algorithm yourself. (Do not use
You are required to use threading to parallelize the computation in your libraries to solve the problem. Ask TA if unsure).

■ Single-GPU version (hw3-2)
◆ Should be optimized to get the performance points (20%). ■ Multi-GPU version (hw3-3)

◆ Must use 2 GPUs. Single GPU version is not accepted and will get 0 for correctness and performance score in hw3-3 (even if you get AC on scoreboard).

3 BLOCKED FLOYD-WARSHALL ALGORITHM

Given an 𝑉 × 𝑉 matrix 𝑊 = [𝑤(𝑖, 𝑗)] where 𝑤(𝑖, 𝑗)≥0 represents the distance (weight of the edge) from a vertex 𝑖 to a vertex 𝑗 in a directed graph with 𝑉 vertices. We define an 𝑉 × 𝑉 matrix 𝐷 = [𝑑(𝑖, 𝑗)] where 𝑑(𝑖, 𝑗) denotes the shortest-path distance from a vertex 𝑖 tothe seta vertex 𝑗. Let 𝐷(𝑘) =. [𝑑(𝑘)(𝑖, 𝑗)] be the result which all the intermediate vertices are in

We define{0, 1𝑑,(𝑘2),(𝑖.,, 𝑗𝑘) −as the following:1}

𝑑(𝑘)(𝑖, 𝑗) = { 𝑤𝑚𝑖𝑛(𝑖,𝑗()𝑑(𝑘−1)(𝑖,𝑗),𝑑(𝑘−1)(𝑖,𝑘−1)+𝑑(𝑘−1)(𝑘−1,𝑗)) 𝑖𝑓 𝑘 𝑖𝑓≥1 𝑘=0;

The matrixIn the blocked𝐷 to the all-pairs shortest path problem.blocks of

submatrices. The number is called the blocking factor.𝐷For instance,⌈𝑉/𝐵⌉×⌈in𝑉/figure𝐵⌉ 1, we 𝐵divide a×𝐵 matrix into submatrices (or blocks)𝐵 by .

6×6 3×3 𝐵 = 2

Figure 1: Divide a matrix by B = 2

Theround is divided into 3 phases. It performsblocked version of the Floyd-Warshall algorithmiterationswillin each phase.perform ⌈𝑉/𝐵⌉ rounds, and each

Assuming a block is identified by its index𝐵 , where . The block with indexIn the(following𝐼, 𝐽) is denoted byexplanation,𝐷 ((𝑘𝐼,we)𝐽). assume (𝐼, 𝐽and) 0. The≤ 𝐼, execution𝐽 < ⌈𝑉/𝐵⌉flow is described step by step as follows: 𝑁 = 6 𝐵 = 2

● Phase 1: self-dependent blocks.

In theFor instance, in the 1st round,𝑘-th round, the first phase is to compute𝐷 (02,)0) is computed as follows:the 𝐵×𝐵 pivot block 𝐷 ((𝑘𝑘·−𝐵1),𝑘−1).

(

𝑑𝑑𝑑𝑑𝑑(((112)))(((000,,, 010))) === 𝑚𝑖𝑛𝑚𝑖𝑛𝑚𝑖𝑛(((𝑑𝑑𝑑(((000)))(((010,,, 001))),,, 𝑑𝑑𝑑((((0000))))((((1100,,,, 0000)))) ++++ 𝑑𝑑𝑑𝑑(((((10000)))))(((((10000,,,,, 01100))))))))))

(1)

(1)((11,, 01)) == 𝑚𝑖𝑛𝑚𝑖𝑛((𝑑𝑑((10))(1, 1), 𝑑

(2)(1, 0) = 𝑚𝑖𝑛(𝑑(1)

Note that the result of𝑑𝑑𝑑((22))(0, 1) = 𝑚𝑖𝑛(𝑑((11))(((001,,, 100))),,, 𝑑𝑑𝑑((((1111))))(((010,,, 111))) +++ 𝑑𝑑𝑑(((111)))((11,, 01and therefore cannot be))))

● Phase 2: pivot-row and pivot-column blocks. 𝑑(1).

In theThe result of pivot-row / pivot-column blocks depend on the result in phase 1 and𝑘-th round, it computes all 𝐷((𝑘ℎ·,𝑘𝐵−) 1) and 𝐷((𝑘𝑘·−𝐵1),ℎ) where ℎ≠𝑘 − 1.

itself.

For instance, in the 1st round, the result of 𝐷((02,)2) depends on 𝐷((02,)0) and 𝐷((00,)2):

𝑑((11))(0, 4) = 𝑚𝑖𝑛(𝑑((00))((00,, 54)),,𝑑𝑑(((222)))(((010,,, 000))) +++ 𝑑𝑑𝑑(((000)))((00,, 54))))

= 𝑚𝑖𝑛(𝑑(0)(1, 4),𝑑(2)

(𝑑(1)(1, 5),𝑑

𝑑𝑑(((211)))(((010,,, 545))) == 𝑚𝑖𝑛𝑚𝑖𝑛((𝑑𝑑(((110)))((00,, 54)),,𝑑𝑑((22))((10,, 10)) ++ 𝑑𝑑((01))((00,, 54))))

Phase 3: other blocks.𝑑𝑑𝑑𝑑𝑑(((222)))((((1101,,,, 5544)))) ==== 𝑚𝑖𝑛𝑚𝑖𝑛𝑚𝑖𝑛𝑚𝑖𝑛(((𝑑𝑑𝑑(1)((11,, 54)),,𝑑𝑑((22))(((011,,, 111))) +++ 𝑑𝑑𝑑(((111)))((((1111,,,, 5544))))))))

In the 𝑘-th round, it computes all 𝐷((𝑘ℎ·1𝐵,ℎ)2) where ℎ1, ℎ2≠𝑘 − 1.

The result of these blocks depends on the result from phase 2 and itself.

For instance, in the 1st round, the result of 𝐷((21),2) depends on 𝐷((21,)0) and 𝐷((20,)2):

)

𝑑𝑑𝑑𝑑𝑑𝑑(((((12111))))(((((22323,,,,, 54455))))) ===== 𝑚𝑖𝑛𝑚𝑖𝑛𝑚𝑖𝑛𝑚𝑖𝑛𝑚𝑖𝑛((((𝑑𝑑𝑑𝑑(((000)))(((223,,, 544))),,,𝑑𝑑𝑑((((2222))))((((2233,,,, 0000)))) ++++ 𝑑𝑑𝑑𝑑(((((22222)))))(((((10000,,,,, 54454))))))))))

(𝑑(0)

(2)(2, 4) = 𝑚𝑖𝑛(𝑑((11))((32,, 45)),,𝑑𝑑(2)

(2)

𝑑𝑑(2)((33,, 45)) == 𝑚𝑖𝑛𝑚𝑖𝑛((𝑑𝑑((11))(((332,,, 554))),,,𝑑𝑑𝑑(((222)))((((2323,,,, 1111)))) ++++ 𝑑𝑑𝑑𝑑(((222)))(((111,,, 545))))))

Figure 2: The 3 phases of the blocked FW algorithm in the first round.

Figure 3: The computations of 𝐷((20),2), 𝐷((21),2) and their dependencies in the first round.

Figure 4: In this particular example where 𝑉 = 6 and 𝐵 = 2, we will require ⌈𝑉/𝐵⌉ = 3 rounds.

4 RUN YOUR PROGRAMS

● Command line specification

# CPU srun -N1 -n1 -cCPUS ./hw3-1 INPUTFILE OUTPUTFILE

# Single-GPU srun -N1 -n1 --gres=gpu:1 ./hw3-2 INPUTFILE OUTPUTFILE

# Multi-GPU

srun -N1 -n1 -c2 --gres=gpu:2 ./hw3-3 INPUTFILE OUTPUTFILE
○ CPUS: Number of CPUs, specified by TA.

○ INPUTFILE: The pathname of the input file. Your program should read the input graph from this file.

○ OUTPUTFILE: The pathname of the output file. Your program should output the shortest path distances to this file. CPUS: Number of CPUs, specified by TA.

● Input specification

○ The input is a directed graph with non-negative edge distances.

○ The input file is a binary file containing 32-bit integers. You can use the int type in C/C++.

○ The first two integers are the number of vertices (V) and the number of edges
(E).

○ Then, there are E edges. Each edge consists of 3 integers:

1.2. source vertex iddestination vertex id(𝑠𝑟𝑐𝑖() )

3. edge weight ( ) 𝑑𝑠𝑡𝑖
○ The values of vertex indexes & edge indexes start at 0.𝑤𝑖

○ The ranges for the input are:

● 2≤𝑉≤6000 (CPU)

● 2≤ ≤40000 (Single-GPU)

● (Multi-GPU)

●

●● 0≤𝑠𝑟𝑐𝑖, 𝑑𝑠𝑡𝑖 < 𝑉

● 𝑠𝑟𝑐if 𝑖 ≠ 𝑑𝑠𝑡𝑖 then (there will not be repeated edges) ● 𝑠𝑟𝑐𝑖 = 𝑠𝑟𝑐𝑗 𝑑𝑠𝑡𝑖 ≠ 𝑑𝑠𝑡𝑗

Here’s an example: 0≤𝑤𝑖≤1000

offset
type
decimal value
description
0000
32-bit integer
3
# 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 (𝑉)
0004
32-bit integer
6
# 𝑒𝑑𝑔𝑒𝑠 (𝐸)
0008
32-bit integer
0
src id for edge 0
0012
32-bit integer
1
dst id for edge 0
0016
32-bit integer
3
edge 0’s distance
0020
32-bit integer

src id for edge 1
…
…
…
…
0076
32-bit integer

edge 5’s distance
● Output specification

○ The output file is also in binary format. 𝑉2 integers. 𝑉

○ For an input file with vertices, you should output an output file containing

○ Theedgefirst0: 𝑉 integers should be the shortest path distances; thenfor startingthe followingfrom integers𝑑𝑖𝑠𝑡would(0, 0), 𝑑𝑖𝑠𝑡be the(0, 1shortest), 𝑑𝑖𝑠𝑡(0, 2path), …, 𝑑𝑖𝑠𝑡distances(0, 𝑉 −starting1) from edge 1: 𝑉integers.𝑑𝑖𝑠𝑡(1, 0), 𝑑𝑖𝑠𝑡(1, 1), 𝑑𝑖𝑠𝑡(1, 2), …, 𝑑𝑖𝑠𝑡(1, 𝑉 − 1); and so on, totaling 𝑉2

○○ 𝑑𝑖𝑠𝑡If (there𝑖, 𝑗) = 0iswhereno 𝑖 =valid𝑗. path between , please output with:

. 𝑖→𝑗

Example output file:𝑑𝑖𝑠𝑡(𝑖, 𝑗) = 230 − 1 = 1073741823

offset
type
decimal value
description
0000
32-bit integer
0
𝑑𝑖𝑠𝑡(0, 0)
0004
32-bit integer
?
𝑑𝑖𝑠𝑡(0, 1)
0008
32-bit integer
?
𝑑𝑖𝑠𝑡(0, 2)
…
…
…
…
4𝑉2 − 8
32-bit integer
?
𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 2)
2
32-bit integer

4𝑉 − 4 0 𝑑𝑖𝑠𝑡(𝑉 − 1, 𝑉 − 1)

5 REPORT

Answer the questions below. You are recommended to use the same section numbering as they are listed.

1. Implementation

a. Which algorithm do you choose in hw3-1?

b. How do you divide your data in hw3-2, hw3-3?

c. What’s your configuration in hw3-2, hw3-3? And why? (e.g. blocking factor,

#blocks, #threads)

d. How do you implement the communication in hw3-3?

e. Briefly describe your implementations in diagrams, figures or sentences.

2. Profiling Results (hw3-2)

Provide the profiling results of following metrics on the biggest kernel of your program using NVIDIA profiling tools. NVIDIA Profiler Guide.

○ occupancy

○ sm efficiency

○ shared memory load/store throughput

○ global load/store throughput

3. Experiment & Analysis

a. System Spec

If you didn’t use our hades server for the experiments, please show the CPU, RAM, disk of the system.

b. Blocking Factor (hw3-2)

Observe what happened with different blocking factors, and plot the trend in terms of Integer GOPS and global/shared memory bandwidth. (You can get the information from profiling tools or manual) (You might want to check nvprof and Metrics Reference)

Figure 5: Example chart of performance and global memory bandwidth trend w.r.t. blocking factor

Note:

To run nvprof on hades with flags like --metrics, please run on the slurm partition prof. e.g. srun -p prof -N1 -n1 --gres=gpu:1 nvprof --metrics gld_throughput ./hw3-2 /home/pp21/share/hw3-2/cases/c01.1 c01.1.out

c. Optimization (hw3-2)

Any optimizations after you port the algorithm on GPU, describe them with sentences and charts. Here are some techniques you can implement:

■ Coalesced memory access

■ Shared memory

■ Handle bank conflict

■ CUDA 2D alignment

■ Occupancy optimization

■ Large blocking factor

■ Reduce communication

■ Streaming

Figure 6: Example chart of performance optimization¶

d. Weak scalability (hw3-3)

Observe weak scalability of the multi-GPU implementations

e. Time Distribution (hw3-2) Analyze the time spent in:

● computing

● communication

● memory copy (H2D, D2H)

● I/O of your program w.r.t. input size.

f. Others

Additional charts with explanation and studies. The more, the better.

4. Experience & conclusion

a. What have you learned from this homework?

b. Feedback (optional)

More products

ANP- Homework #2 Solved

$25

Add to cart

BGU-Assignment 1 Solved

$25

Add to cart

ANP- Homework #1 Solved

$25

Add to cart