ECE408 Objective Solution

Starting from:

$34.99

Implement a kernel the performs reduction of a 1D list. The reduction should give the sum of the list. You should implement the improved kernel discussed in week 4. Your kernel should be able to handle input lists of arbitrary length. However, for simplicity, you can assume that the input list will be at most 2048 x 65535 elements so that it can be handled by only one kernel launch. The boundary condition can be handled by filling "identity value (0 for sum)" into the shared memory of the last block when the length is not a multiple of the thread block size. Further assume that the reduction sums of each section generated by individual blocks will be summed up by the CPU. Prerequisites
Prerequisites
Before starting this lab, make sure that:
Instruction
Edit the code in the code tab to perform the following:
allocate device memory copy host memory to device
initialize thread block and kernel grid dimensions
invoke CUDA kernel copy results from device to host deallocate device memory
implement the improved reduction routine
use shared memory to reduce the number of global accesses, handle the boundary conditions in when loading input list elements into the shared memory implement a CPU loop to perform final reduction based on the sums of sections generated by the thread blocks
Instructions about where to place each part of the code is demarcated by the //@@ comment lines.

More products

SINGLE TABLE QUERIES Solution

$15

Add to cart

DELIVERABLE #3 – SQL DDL Solution

$15

Add to cart

ISM3255 Assignment 8 Solution

$15

Add to cart