$30
Overview
As promised in Assignment 1, you are to construct a parallel C program that specifically implements HighLife (modification to the Game of Life) but with the twist of using ***only*** 1-D arrays in CUDA. Additionally, you will run this C-program on the AiMOS supercomputer at the CCI in parallel using a single CUDA enabled GPU.
Note, this is an individual assignment and is not a group assigment.
1.1 Review of HighLife Specification
The HighLife is an example of a Cellular Automata where universe is a two-dimensional orthogonal grid of square cells (with WRAP AROUND FOR THIS ASSIGNMENT), each of which is in one of two possible states, ALIVE or DEAD. Every cell interacts with its eight neighbors, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur at each and every cell:
Any LIVE cell with FEWER THAN two (2) LIVE neighbors DIES, as if caused by underpopulation.
Any LIVE cell with two (2) OR three (3) LIVE neighbors LIVES on to the next generation.
Any LIVE cell with MORE THAN three (3) LIVE neighbors DIES, as if by over-population.
Any DEAD cell with EXACTLY three (3) or six (6) LIVE neighbors becomes a LIVE cell, as if by reproduction/birth.
The world size and initial pattern are determined by an arguments to your program. A template for your program will be provide and more details are below. The first generation is created by applying the above rules to every cell in the seed—births and deaths occur simultaneously, and the discrete moment at which this happens is sometimes called a “tick” or iteration. The rules continue to be applied repeatedly to create further generations. The number of generations will also be an argument to your program. Note, an iteration starts with Cell(0,0) and ends with Cell(N − 1,N − 1) in the serial case.
1.2 Re-using Your Solution and C-template from Assignment 1:
Recall from Assignment 1, the world size and initial pattern are determined by an arguments to your program. The first generation is created by applying the above rules to every cell in the seed—births and deaths occur simultaneously, and the discrete moment at which this happens is sometimes called a “tick” or iteration. The rules continue to be applied repeatedly to create further generations. The number of generations will also be an argument to your program.
1.3 Template and Implementation Details
Re-use the provided template in Assignment 1, highlife.c, or your own solution to Assignment 1. Here, you must re-write/extend those implementations in highlife.c for highlife.cu and support the parallel CUDA processing of the world gid. Here is what you should look out for:
new headers:
#include <cuda.h>
#include <cuda_runtime.h>
init routines: Replace all the calloc calls with cudaMallocManaged. For example:
cudaMallocManaged( &g data, (g dataLength * sizeof(unsigned char)));
will allocate the right amount of space for the g data pointer in the template. Also, make sure you correctly ZERO out all the data elements as cudaMallocManaged will not does this for you like calloc does.
HL kernelLaunch: this function is called from main and looks like:
bool HL_kernelLaunch(unsigned char** d_data, unsigned char** d_resultData, size_t worldWidth, size_t worldHeight, size_t iterationsCount, ushort threadsCount)
This function computes the world via a CUDA kernel (described below) and swaps the new world with the previous world to be ready for next iteration. It should invoke the HL kernel CUDA kernel function which will advance the whole world one step or iteration. Last, it will need to invoke cudaDeviceSynchronize() between iterations before HL swap and prior to returning to the main routine.
HL kernel: is the main CUDA kernel function. This function will look like:
__global__ void HL_kernel(const unsigned char* d_data, unsigned int worldWidth, unsigned int worldHeight, unsigned char* d_resultData)
This function will iterate starting with: index = blockIdx.x *blockDim.x + threadIdx.x
and increment by
index + = blockDim.x * gridDim.x
provided index < worldWidth*worldHeight
.
Using current thread index you’ll need to compute the x0, x1, x2, y0, y1 and y2 offsets as you did in Assignment 1 using the worldWidth and worldSize variables. Note, you don’t need to use any shared memory. Once you have those offset values, count the number of alive cells among the 8 neighbors in the worlds.
To compile and execute the highlife program, do:
modules: module load xl r spectrum-mpi cuda
compile: nvcc -g -G -gencode arch=compute 70,code=sm 70 highlife.cu -o highlife – this will invoke the GCC compiler with debug and all warnings turned on. For performance runs, remove the -g and -G options and replace with -O3.
Allocate one compute-node: See salloc command below in the “Running on AiMOS” section.
For this assignment, there are the following 5 patterns:
Pattern 0: World is ALL zeros.
Pattern 1: World is ALL ones.
Pattern 2: Streak of 10 ones in about the middle of the World.
Pattern 3: Ones at the corners of the World.
Pattern 4: “Spinner” pattern at corners of the World.
Pattern 5: Replicator pattern starting in the middle.
2 Running on AiMOS
Please follow the steps below:
Login to CCI landing pad (ccni.rpi.edu) using SSH and your CCI account and PIC/Token/password information. For example, ssh SPNRcaro@blp03.ccni.rpi.edu.
Login to AiMOS front end by doing ssh PCPAyourlogin@dcsfen01.
(Do one time only if you did not do for Assignment 1). Setup ssh-keys for passwordless login between compute nodes via ssh-keygen -t rsa and then cp ~/.ssh/id rsa.pub ~/.ssh/authorized keys.
Load modules: run the module load xl r spectrum-mpi cuda This puts the correct GNU C compiler along with MPI (not used) and CUDA in your path correctly as well as all needed libraries, etc.
Compile code on front end by issuing the nvcc -g -G -gencode arch=compute 70,code=sm 70 highlife.cu -o highlife for debug or nvcc -O3 -gencode arch=compute 70,code=sm 70 highlife.cu -o highlife for optimized code/performance.
Get a single node allocation by issuing: salloc -N 1 --gres=gpu:1 -t 30 which will allocate a single compute node using only 1 GPU for 30 mins. The max time for the class is 30 mins per job. Your salloc command will return once you’ve been granted a node. Normally, it’s been immediate but if the system is full of jobs you may have to wait for some period of time.
Use the squeue to find the dcsXYZ node name (e.g., dcs24).
SSH to that compute node, for example, ssh dcs24. You should be at a Linux prompt on that compute node.
Issue run command for HighLife. For example, ./highlife 4 64 2 32 which will run GOL using pattern 4 with a world size of 64x64 for 2 iterations using a 32 thread blocksize.
If you are done with a node early, please exit the node and cancel the job with scancel JOBID where the JOBID can be found using the squeue
3 Parallel Performance Analysis and Report
First, make sure you disable any “world” output from your program. The worlds produced in these performance test will be too large for effective human analysis without additional post-processing of the data. Using the time command, execute your program across the following configurations and collect the total execution time for each run. Use the “Replicator” pattern #5 for all runs below. • Blocksize is 8 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each.
Blocksize is 16 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each. • Blocksize is 32 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each. • Blocksize is 64 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each. • Blocksize is 128 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each. • Blocksize is 256 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each.
For each world height/width, plot the execution time in seconds (y-axis) of each different thread blocksize (x-axis). Note you will have 7 plots. Determine for each world size, what thread configuration yields the fastest execution time. Why do you think that is the case? What did you notice when the blocksize was less than or equal-to the CUDA hardware “warp size” of 32 threads?
Use the time command to support your reasoning. Include the output from an time ./highlife <args> run in your report.
Last, determine which thread and worldsize configuration yields the fastest “cells updates per second” rate. For example, a world of 10242 that runs for 1024 iterations will have 10243 cell updates. So the “cell updates per second” is 10243 divided by the total execution time in seconds for that configuration. For this, please create a table of your data - there will be 35 entries - one for each thread blocksize / worldsize configuration with it’s corresponding “cell updates per second” rate.
Last, please provide instructions on how to run your highlife program if it differs from using 3 arguments. For example, you might find it easier to run all your jobs if you provide the thread blocksize as a 4th argument and modify the argc, argv processing code in main accordingly.
You can use any graphing software so long as your report is in a PDF file. Please make sure to label each figure and put a proper caption that describes what the figure/plot is.