Implement a kernel the performs reduction of a 1D list. The reduction should give the sum of the list. You should implement the improved kernel discussed in week 4. Your kernel should be able to handle input lists of arbitrary length. However, for simplicity, you can assume that the input list will be at most 2048 x 65535 elements so that it can be handled by only one kernel launch. The boundary condition can be handled by filling "identity value (0 for sum)" into the shared memory of the last block when the length is not a multiple of the thread block size. Further assume that the reduction sums of each section generated by individual blocks will be summed up by the CPU. Prerequisites Prerequisites Before starting this lab, make sure that: Instruction Edit the code in the code tab to perform the following: allocate device memory copy host memory to device initialize thread block and kernel grid dimensions invoke CUDA kernel copy results from device to host deallocate device memory implement the improved reduction routine use shared memory to reduce the number of global accesses, handle the boundary conditions in when loading input list elements into the shared memory implement a CPU loop to perform final reduction based on the sums of sections generated by the thread blocks Instructions about where to place each part of the code is demarcated by the //@@ comment lines.