$25
Matrix multiplication is a common operation in many real world applications such as wireless communication, image processing, video analysis and processing, computer games and graphics, computational finance, cryptography, and simulation of many physical and chemical processes, among others. If we have two 𝑛 ×𝑛 matrices A and B, then their product C = A*B is also a square 𝑛 ×𝑛 matrix whose entry 𝑐!" in column 𝑖 and row 𝑗 is given by the equation n
cij =Xaikbkj
k=1
For example, if 𝑛 = 8, then
Matrix multiplication on a conventional computer can be slow because typically processors have only one Arithmetic Logic Unit (ALU). Each element of the 8x8 product matrix requires 15 arithmetic operations on top of any address calculations needed to access the right memory contents and thus a full matrix multiply can take many clock cycles.
Implementing matrix multiplication in hardware allows us to take advantage of parallelism and high memory bandwidth to improve the performance significantly. The core computation in matrix multiplication is multiplying two numbers and adding it to a running sum (called multiplyaccumulate or MAC). So, in this laboratory exercise we will start with the implementation of a MAC in hardware and then use that MAC to realize two implementations of matrix multiplication and evaluate their performance.
You don’t have to implement the hardware on the DE10 board in this lab but you should still make sure that your designs synthesize properly and estimate their hardware resources.
Task 1 : Implement a MAC
Figure 1: Block Diagram for a Multiply Accumulator
One of the most fundamental building blocks for digital signal processing is the Multiply Accumulator (MAC) shown in Figure 1. The operation of a MAC is as follows: Given inputs 𝑥 and 𝑦 and accumulator register 𝑠, the operation
𝑠 ← 𝑠 +𝑥𝑦
is performed every clock cycle. With this operation, one can see how each element of C can be calculated over a number of clock cycles.
Write a Verilog program to implement the MAC module. Your module should take two 8-bit signed 2’s complement inputs and use a 19-bit signed accumulator to produce the result. It is important to be able to clear the MAC when each entry in C is calculated. Therefore, you should also include an input signal macc_clear that when asserted assigns the product of the inputs directly to the output as shown in the block diagram.
You may use the operator “ * “ in Verilog to instantiate a multiplier and “ + “ to instantiate an adder.
Write a testbench using the skeleton code provided to verify your design. Your testbench does not need to automatically verify your results but should thoroughly test your design. Show the output waveforms and resource utilization on the FPGA to your TA.
Task 2 Matrix Multiply with One MAC
Figure 2: High-level block diagram for Task 2.
You will now design a module to multiply two 8x8 matrices, implementing the function C = A * B by instantiating only one of the MAC modules you have designed. Access one element from each of the input RAMs at a time and use your MAC in such a way that one entry of the product matrix C is calculated about every 8 or 9 clock cycles. This algorithm is very similar to how you would multiply two matrices by hand. Remember that you must access matrix B by columns while matrix A needs to be accessed by rows.
The input and output signals for your module are given in the table below.
Signal Name
Signal Type
Description
clk
Input
Clock
start
Input
Start signal. Your module should not begin
calculation until this signal is asserted high
reset
Input
Reset signal. Should reset your module to its default waiting state. Note, this should NOT clear the contents of any RAM.
done
Output
Completion signal. This signal should be ‘0’ while your module is running, and ‘1’ when it has completed the multiplication.
clock_count[10:0]
Output
Number of elapsed clock cycles. This signal should be set to zero when the module begins calculation and increment by 1 every clock cycle until the multiplication is complete, at which point it stops incrementing.
In order to test your module successfully with the provided testbench, you should make sure to do following:
• Initialize the contents of the RAMs associated with the A and B matrices to the contents of “ram_a_init.txt” and “ram_b_init.txt” respectively. Use the $readmemb() instruction inside of an initial block to do this.
• You should write the output RAM as its own module and instantiate it in your top-level design. It is very important that when you instantiate the RAM, you give it the instance name “RAMOUTPUT” and that when you write module, the memory vector is named “mem”. Make “mem” a signed variable so ModelSim interprets its contents as 2’s complement numbers. These steps are important so the testbench accesses the correct signals.
Make the following assumptions:
• Matrix A and Matrix B are 8x8 matrices stored in column major order in two independent RAM modules.
• The entries of matrices A and B are 8-bit signed numbers.
• Matrix C is an 8x8 matrix stored in a RAM in column major order.
• The entries of matrix C are 19-bit signed numbers.
Verify your module with the provided testbench, making only minimal changes to the testbench if necessary so signal and module names are consistent. Demonstrate the simulation to your TA and record the number of clock cycles the computation took.
Task 3 Matrix Multiply with Two MACs
Figure 3: Example architecture for performing matrix multiplication using two MACs.
Performing matrix multiplication the way we did in Task 2 does not exploit any parallelism and thus does not offer much speedup over what we might be able to achieve on a simple processor. Since the M9K memory blocks are dual ported, we can read memory contents from two separate addresses each clock cycle. We can take advantage of this and parallelize our matrix multiplication hardware. One way you might consider doing this is to read one column from B and two rows from A at a time. This technique means that two entries of C can be calculated in parallel.
Modify your module from Task 2 in the following ways:
A) Dual port just one of the input matrix memory RAMs. You can reference for help in implementing a dual port RAM. Remember that values will only be read from this RAM, making it really a ROM.
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/max10/ug_m10_memory.pdf
B) Instantiate a total of two MAC modules designed in Task 1.
C) Make any necessary changes to your control logic to facilitate calculating multiple entries in the product matrix at a time.
Assume that your output RAM cannot be dual ported. Thus, it might be necessary to buffer values computed by the MACs so they can be written to memory one at a time.
Verify your design using the same testbench used in Task 2. Show your simulation results to your TA and record the number of clock cycles that the computation took.
Task 4
Optimize your design in Task III to have better throughput (MAC/s) by adopting parallelism in terms of MAC usage and pipelining the datapath.
This task will be graded relatively, which means each group will get the points out of 150 based on their designs and corresponding throughput.
To receive any points for this task you should have a fully synthesized and functional design.
Extra Credit
Sparse matrix vector Multiplication (SpMV)
In many emerging applications the matrices are very sparse, which means most of the entries in the matrices are zero. The straightforward algorithm that we are using thus far is extremely inefficient for sparse matrices since we are wasting time and power and memory bandwidth/capacity storing and multiplying by zeros. See
https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_for mat) for examples of how sparse matrices can be stored. Implement a Sparse parallel Matrix Vector multiplier based on the CSR format – assume the matrix is 16x16 and vector is 16x1 and assuming the sparsity is 25% which means only 4 elements per row are non-zero (100 points).
Lab Requirements
A. Task 1
1. Write a synthesizable model for a MAC.
2. Write a testbench to verify the operation of the circuit.
3. Verify your design through functional simulation. Demonstrate your simulation to your TA and have them sign a verification sheet when simulation is successful.
4. Estimate the resources usage – the number of flip-flops, logic elements, memory blocks, embedded multipliers etc.
B. Task 2
1. Design a circuit and write the Verilog to perform matrix multiplication of two 8x8 matrices by instantiating a single MAC.