EE451 - Parallel and Distributed Computation - PA7b

Starting from:

$25

1. Examples

The hello.cu contains the CUDA implementation of HelloWorld.

1. Login to HPC

2. Setup MPI toolchain:

module purge

module load gcc/8.3.0 cuda/10.1.243

3. Compile

nvcc -O3 -arch=sm_20 hello.cu

4. Run

srun -n1 --gres=gpu:1 -t1 ./a.out

The option -t specifies the limit of run time. Setting it as a small number will get your program scheduled earlier. For more information on srun options, you can use man srun to find out.

5. Profile (optional)

srun -n1 --gres=gpu:p100:1 --partition=debug nvprof ./a.out

6. Allocate a machine

salloc -n1 --gres=gpu:1 --mem=16G -t10

// After the allocation, you will log on the machine and have

10 minutes to perform multiple operations

./a.out

// edit, compile, and run again without waiting for a new allocation

./a.out

./a.out

2. (100 points) 1024 × 1024 matrix multiplication using these two approaches.

• Approach 1 (unoptimized implementation using global memory only) :

– Name this program as ‘mm1.cu’

– The value of each element of A is 1

– The value of each element of B is 2

– Thread block configuration: 16 × 16

– Grid configuration: 64 × 64

– After computation, print the value of C[451][451]

• Approach 2 (block matrix multiplication using shared memory) :

– Name this program as ‘mm2.cu’

– The value of each element of A is 1

– The value of each element of B is 2

– Thread block configuration: 32 × 32

– Grid configuration: 32 × 32

– More details of this algorithm can be found in the paper ‘Matrix Multiplication with CUDA’.

– After computation, print the value of C[451][451]

Measure the execution time of the kernel of Approach 1 and Approach 2, respectively.

More products

Assignment 5 - Beachport Rental Calculator Solution

Demonstrating inheritance Solution

$20

Add to cart

EE451 - Parallel and Distributed Computation - PA7b - Solved

More products