Starting from:

$30

CS51200-LeNet Accelerator Engine Solved

Objective
In this homework, you will implement a CNN accelerator for the quantized LeNet model.

Model architecture
In Homework 1 and 2, we implemented a quantized LeNet model with 8-bit parameters (e.g., filter weights, input activations, and output activations). In this homework, you will need to implement an accelerator engine to compute all five layers of LeNet in Verilog. The CNN model is listed for your reference, as in Homework 1 and 2:

  
  

Testbench & SRAM
1.     A testbench ./sim/lenet_tb.v for your validation . It can load pattern and weight files into weight SRAM and activation SRAM, respectively, and validate the activation SRAM, where you should store your computation results. This testbench provides a simple result comparison for each layer's activations. You can modify and enhance its debugging capability accordingly. But note that we will use the original testbench to justify your design.

2.     We have two dual-port SRAMs in ./sim/sram_model: one is the weight SRAM, the other is the activation SRAM. For the given dual-port SRAM, you can perform two 32-bit memory accesses of separate addresses in the same clock cycle.

3.     Figure 1 shows the layout of the weight SRAM. In Conv1 and Conv2 weight layers, we only store five 8-bit weights in every two addresses, which is possible to access five weights in one clock cycle. Since Conv3 layer can be treated as a fullyconnection layer, for Conv3, FC1, and FC2 weight layers, we store four 8-bit weights in each address. For FC2 biases, one 32-bit bias is stored in every address.

  

Figure 1: Layout of weight SRAM.

 

4.     Figure 2 shows the layout of activation SRAM. The testbench will load the input feature map from offset 0 to offset 255 in SRAM. Each address holds four 8-bit inputs. In Conv1 layer, you should place fourteen 8-bit quantized activations in every four addresses, as figure 2 shows. In this case, the computation in the following Conv2 layer may be easier. For Conv2, Conv3, and FC1 layers, you should place four 8-bit quantized data in every address for the fully-connected computation. In the last layer, FC2, we omit the rescaling and clipping since the classification result will not

be affected. You should keep the 32-bits activation without rescaling and clipping.  

  

Figure 2: Layout of the activation SRAM.

5.     You should use the parameter.zip (with image, weight, bias, quantization scale, and golden data) generated and submitted in Homework 2 to validate your design, which should be pre-processed to match the format in Figures 1 and 2 (i.e., you have to write your own tool to convert the format). It is recommended to understand the testbench before writing your own pre-processing code. (You may use either C++ or

Python, or any programming language.)

Note: Refer to the Appendix for further discussion on the data arrangement in SRAMs.

Note: You should put proper quantization scale factors in the testbench. 

6.     TA will use unreleased patterns to evaluate your homework.  

Design
1.     A template ./hdl/lenet.v is provided. DO NOT change the I/O declaration.

2.     Input and output delay constraints are added in the synthesis script. Make sure to add flip-flops after input data port (except for the clk port) and before output data ports.

  

Figure 3: I/O-registered design style.

 

3.     Figure 4 illustrates possible data reuse in convolution operations and max-pooling in CONV1 and CONV2. Typically, four 5x5 convolution operations generate 2x2 activations. The following max pool operation produces one result (see Figure 4). You may consider loading 6x6 ifmaps to improve the data reusability. Note that the dual-port SRAM allows you to load two 32-bit words (i.e., eight 8-bit ifmaps) at the same time. So it is possible to process Columns 2 to 7 while dealing with Columns 0 to 5 for further improvement.  

   Figure 4: Data reuse for convolution operations.

4.     I/O port descriptions  

Signals
Width
I/O
Description
clk
1
Input
Clock.
rst_n
1
Input
Active-low reset.  
compute_start
1
Input
Single-cycle active-high pulse. The testbench uses this signal to inform the engine to start computing.
compute_finish
1
Output
You should pull up a single-cycle activehigh pulse to inform the testbench the

 

 
computation is finished. 
scale_CONV1
32
Input
Quantization scale factor for CONV1
scale_CONV2
32
Input
Quantization scale factor for CONV2
scale_CONV3
32
Input
Quantization scale factor for CONV3
scale_FC1
32
Input
Quantization scale factor for FC1
scale_FC2
32
Input
Quantization scale factor for FC2
sram_weight_wea0
4
Output
Each bit represents the byte-write enable for SRAM port 0. E.g., 4’b1001 means RAM[addr][31:24] = wdata[31:24] and RAM[addr][7:0] = wdata[7:0], leaving RAM[addr][23:8] untouched. Please refer to the SRAM behavior code to know how it works.
sram_weight_addr0
16
Output
Read/write address of port 0 in the weight SRAM.
sram_weight_wdata0
32
Output
Write data of port 0 in the weight SRAM.
sram_weight_rdata0
32
Input
Read data of port 0 in the weight SRAM.
sram_weight_wea1
4
Output
Each bit represents the byte-write enable for SRAM port 1.  
sram_weight_addr1
16
Output
Read/write address of port 1 in the weight SRAM.
sram_weight_wdata1
32
Output
Write data of port 1 in the weight SRAM.
sram_weight_rdata1
32
Input
Read data of port 1 in the weight SRAM.
sram_act_wea0
4
Output
Each bit represents the byte-write enable for SRAM port 0.  
sram_act_addr0
16
Output
Read/write address of port 0 in the activation SRAM.
sram_act_wdata0
32
Output
Write data of port 0 in the activation SRAM.
sram_act_rdata0
32
Input
Read data of port 0 in the activation SRAM.
sram_act_wea1
4
Output
Each bit represents the byte-write enable for SRAM port 1.  
sram_act_addr1
16
Output
Read/write address of port 1 in the activation SRAM.
sram_act_wdata1
32
Output
Write data of port 1 in the activation SRAM.
sram_act_rdata1
32
Input
Read data of port 1 in the activation SRAM.
 

Simulation & Synthesis
1.     RTL behavior simulation

Before synthesis, make sure the result of the RTL behavior simulation is correct. Figure 5 shows the simulation message when the validation is passed. You should modify the file sim_rtl.f with proper file paths. ➢ RTL simulation command:

cd   sim/ make sim 

  

Figure 5: Simulation pass message

2.     Synthesis  

Please refer to the Spyglass tutorial to ensure that your design is synthesizable. You may modify the Verilog file names and clock period (cycle) in syn/synthesis.tcl accordingly.

➢ Synthesis command:

cd   syn/ 

dc_shell -f synthesis.tcl 

The synthesis script also produces timing and area reports. Make sure the timing slack is MET.

  

Figure 6: Design Compiler timing report screenshot

 

3.     Gate-level simulation

Before doing gate-level simulation, please modify the clock period (CYCLE) to meet the synthesis clock period in sim/lenet_tb.v. You may add 1 to 3 ns to the synthesis clock period when encountering setup time violation. For example, if you set the synthesis clock constraint to 10 ns, you may set the clock period in the testbench to 12 ns to prevent setup time violation in gate-level simulation.  

 

Gate-level simulation command:

cd sim/ make syn 


           

Appendix
Pipeline architecture: 

Line # of weights.csv
Original filename
 0 – 59
c1.conv.weight.csv
  60 – 1019
c3.conv.weight.csv
You may want to optimize your design for better performance. Pipeline is a common technique to improve the throughput with a higher clock rate. For example, for a combinational circuit with flip-flops in Figure 8, you can partition it into two or more pipeline stages. The clock rate can be higher with a larger hardware area (i.e., additional flip-flops between stages). 

  

Figure 8: Simple pipeline concept.

 

Data arrangement in SRAMs:  

 

You need to prepare three pattern files for Verilog simulation (please trace the testbench code to know where to place them). First one is weights.csv, which consists of all quantized weights. You need to process and integrate the weight files in parameters/weights of HW2 (also refer to Figure 1, Table 1 and Table 2). Note: weight.csv contains 15760 lines; each line has a 32-bits HEX data (in ASCII format for $readmemh()). 

 

Table 1: Mapping of weights.csv and original quantized weight files

 
1020 – 13019
c5conv.weight.csv
 
13020 – 15539
f6.fc.weight.csv
15540 – 15749
output.fc.weight.csv
15750 – 15759
output.fc.bias.csv
 

Table 2: Weight files data comparison

Processed weight file
 
Original weight files
Line #
Data (HEX)
 
Line #
Data (DEC)
Filename: weight.csv
 
Filename: c1.conv.weight.csv
0
26F7EAC8
 
0
-56
1
-22
2
-9
3
38
1
0000000F
 
4
15
BLANK
0
BLANK
0
BLANK
0
 
  
Filename: weight.csv
 
Filename: c5.conv.weight.csv
1020
FB020203

 
 
0
3
1
2
2
2
3
-5
1021
FFFE00FB
 
4
-5
5
0
6
-2
7
-1
 
  
Filename: weight.csv
 
Filename: output.fc.bias.csv
15750
0000003A
 
0
58
15751
FFFFFFAD
 
1
-83
15752
FFFFFF50
 
2
-176
15753
0000000D
 
3
13
 

The second one is golden00.csv, which consists of quantized activations. You need to process and integrate the files in parameters/activations/img0 (also refer to Figure 2, Table 3, and Table 4).

Note: golden00.csv contains 753 lines, each line has 32-bits HEX data. 

 

Table 3: Mapping of golden00.csv and original activation files

Line # of golden00.csv
Original filename
  0 – 255
c1/input.csv
256 – 591
c3/input.csv
592 – 691
c5/input.csv
692 – 721
f6/input.csv
722 – 742
output/input.csv
743 – 752
output/output.csv
 

Table 4: Activation files data comparison

 
Processed activation file
Original activation files
 
 
Line #
Data (HEX)
Line #
Data (DEC)
 
 
Filename: golden00.csv
Filename: c1/input.csv
 
 
65
D48F8080
260
-128
 
 
261
-128
 
 
262
-113
 
 
263
-44
 
 
66
DE111B29
264
41
 
 
265
27
 
 
266
17
 
 
267
-34
 
 
  
 
 
Filename: golden00.csv
Filename: c3/input.csv
 
 
324
52250800
238
0
 
 
239
8
 
 
240
37
 
 
241
82
 
 
325
6E6F6F6E
242
110
 
 
243
111
 
 
244
111
 
 
245
110
 
 
326
14576B6D
246
109
 
 
247
107
 
 
248
87
 
 
249
20
 
 
327
0000576B
250
107
 
 
251
87
 
 
BLANK
0
 
 
BLANK
0
 
 
  
 
Filename: golden00.csv
Filename: output/output.csv
743
FFFFA07C
0
-24452
744
FFFFD10A
1
-12022
745
FFFFD82A
2
-10198
746
FFFFD645
3
-10683
 

The last one is image00.csv, which is similar to golden00.csv, but it only contains the image part. You need to process input.csv in parameters/activations/img0/c1 (also refer to Figure 2, Table 5, and Table 6).

Note: image00.csv contains 256 lines, each line has 32-bits HEX data. 

 

Table 5: Mapping of image00.csv and original activation files

Line # of image00.csv
Original filename
  0 – 255
c1/input.csv
 

Table 6: Image files data comparison

Processed activation file
Original activation files
Line #
Data (HEX)
Line #
Data (DEC)
Filename: image00.csv
Filename: c1/input.csv
65
D48F8080
260
-128
261
-128
262
-113
263
-44
66
DE111B29
264
41
265
27
266
17
267
-34
 

More products