Starting from:

$30

CAA- Homework 3 Solved

Programming 
In this homework, we will use verilog to implement simple ALU, FPU and CPU in this homework.

We use Icarus Verilog to run the simulation, and we use gtkwave to check waveform. We will score your implementations under these settings.

Folder structure for this homework:

HW3

                |-- ALU                                                                                 // Part 1

               |             |-- alu.f                                                                         <-- specify the files you use

               |            ‘-- codes                                                                       <-- put all the *.v here

               |          |                ‘-- alu.v

               |             |-- test_alu                                                              <-- run the test

               |              |-- testbench.v                                                         <-- test for corretness

               |              ‘-- testcases/                                                          <-- testcases

               |                            ‘-- generate.cpp                                          <-- used to generate testcases

                |-- FPU                                                                                 // Part 2

               |             |-- fpu.f

               |            ‘-- codes

               |          |                ‘-- fpu.v

               |             |-- test_fpu

               |              |-- testbench.v

               |              ‘-- testcases/

               |                            ‘-- generate.cpp

                 ‘-- CPU                                                                               // Part 3

|-- cpu.f

‘-- codes

                          |                 |-- instruction_memory.v                        <-- instruction memory with access latency

                          |              |-- data_memory.v                                    <-- data memory with access latency

                          |               ‘-- cpu.v

|-- test_cpu

|-- testbench.v

‘-- testcases/

|-- generate.s

‘-- generate.cpp

ALU(30%)
The ALU spec is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i data a
Input
32
Input data A may be signed or unsigned depending on the i inst signal
i data b
Input
32
Input data B may be signed or unsigned depending on the i inst signal
i inst
Input
4
Instruction signal representing functions to be performed
i valid
input
1
One clock signal when input data a and b are valid
o data
Output
32
Calculation result
o overflow
Output
1
Overflow signal
o valid
Output
1
Should be one cycle signal when your results are valid
The test environment is as follows:

You are asked to implement the following functions in ALU:

i inst
Function
Description
 
 
 
4’d3
Signed Max
max(i data a, i data b) (signed)
4’d4
Signed Min
min(i data a, i data b) (signed)
4’d8
 
 
4’d9
Unsigned Min
min(i data a, i data b) (unsigned)
4’d10
And
i data a & i data b
4’d11
Or
i data a | i data b
4’d12
Xor
i data a ^ i data b
4’d13
BitFlip
~ i data a
4’d14
BitReverse
Bit reverse i data a
More details:

We will compare the output data and overflow signal with the provided answers
For signed 32-bit integer Add, Sub, Mul, Max, MinTwo-input signal functions
Overflow signal only needs to be considered when Add, Sub or Mul is performed. For Max and Min, set the output overflow signal to 0.
We will not compare the return data with the answer provided when overflow happens
For unsigned 32-bit integer Add, Sub, Mul, Max, MinSame criteria as signed operations’
Xor, And, Or, BitFlip, and BitReverseSet output overflow signal to 0 when the above functions are performed.
Xor, And and Or are two-input signal functions.
BitFilp and BitReverse are one-input signal functions, therefore, i data b can be ignored
FPU
The FPU spec is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i data a
Input
32
Single precision floating point a
i data b
Input
32
Single precision floating point b
i inst
Input
1
Instruction signal representing functions to be performed
i valid
input
1
One clock signal when input data a and b are valid
o data
Output
32
Calculation result
o valid
Output
1
Should be one cycle signal when your results are valid
The test environment is as follows:

You are asked to implement the following functions in ALU:

i inst
Function
Description
1’d0
Add
i data a + i data b (single precision floating point)
1’d1
Mul
i data a * i data b (single precision floating point)
Floating point:

More details:

We will compare the output data with provided answers.
Follow IEEE-754 single precision floating point format
The inputs will not be denormal numbers, infinites, and NaNs, nor will the calculated result.
Simple testcases
During the computation, the one with smaller exponent will be shifted, you should keep the precision until rounding. As for rounding mode, we use default rounding to nearest even.I find this pdf useful to explain the rounding and the GRS bits.
The testcases may be too easy to worry about the rounding.
You may want to reference the diagram described in class to have better idea implementing FPU.

CPU
In this section, you are asked to implement a CPU that supports basic RV64I (not all of them). The CPU spec is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i   valid inst
Input
1
One cycle signal when the instruction form instruction memory is ready
i    inst
Input
32
32-bits instruction from instruction memory
i     d valid data
Input
1
One cycle signal when the data form data memory is ready (used when ld happens)
i d data
Input
64
64-bits data from data memory (used when ld happens)
o i valid addr
Output
1
One cycle signal when the pc-address is ready to be sent to instruction memory (fetch the instruction)
o i addr
Output
64
64-bits address to instruction memory (fetch the instruction)
o d data
Output
64
64-bits data to data memory (used when sd happens)
o d addr
Output
64
64-bits address to data memory (used when ld or sd happens)
o d MemRead
Output
1
One cycle siganl telling data memory that the current mode is reading (used when ld happens)
o d MemWrite
Output
1
One cycle siganl telling data memory that the current mode is writing (used when sd happens)
o finish
Output
1
Stop signal when EOF happens
The provided instruction memory is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i valid
Input
1
Signal that tells pc-address from cpu is ready
i addr
Input
64
64-bits address from cpu
o valid
Output
1
Valid when instruction is ready
o inst
Output
32
32-bits instruction to cpu
And the provided data memory is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i data
Input
64
64-bits data that will be stored (used when sd happens)
i addr
Input
64
Write to or read from target 64-bits address (used when ld or sd happens)
i MemRead
Input
1
One cycle signal and set current mode to reading
i MemWrite
Input
1
One cycle signal and set current mode to writing
o valid
Output
1
One cycle signal telling data is ready (used when ld happens)
o data
Output
64
64-bits data from data memory (used when ld happens)
The test environment is as follows:

We will only test the instructions highlighted in the red box, as the figures below

And one more instruction to be implemented is

i inst
Function
Description
32’b11111111111111111111111111111111
Stop
Stop and set o finish to 1
More details:

The instruction_memory.v, data_memory.v and testbench.v files should not be modified
We will compare the mem[1024] in data_memory.v result with provided answers to check for correctness
There are 1024 bytes memory in data memory module and 16x32 bits memory for instruction memory. No invalid access to instruction or data memory will be involved in the testcases. Hence, there is no need to handle these issues.
All the arithmetic operations here are unsigned, including A + B and A + imm. And there is no need to deal with overflow here.
You may notice that there’s latency when we want to access the memoryFor instruction memory, when i_valid is set, the instruction memory will stall for 5 cycles, and then return the instruction to cpu
For data memory, when i_MemRead or i_MemWrite is set, the data memory will stall for 7 cycles in both cases, and then return the data to cpu or write the data to memory
The latency comes from freezing the module for certain amount of cycles, as shows below
You may want to reference the block diagram of cpu from slides or textbook to have better idea implementing cpu. Notice that the diagram provided here is single cycle cpu, while in this homework, there’s additional latency accessing memory that needs to be considered.

Grading:

There are 8 testcases. 5% each. (eof, store, load, add, sub, and, or, xor, andi, ori, xori, slli, srli, bne, beq)
Report 
Write a report about how you implement ALU, FPU, and CPU.

You can draw some block diagrams to show your execution method more clearly.

”draw.io” is suggested to help you to draw the block diagrams easily.

Programming 
In this homework, we will use verilog to implement simple ALU, FPU and CPU in this homework.

We use Icarus Verilog to run the simulation, and we use gtkwave to check waveform. We will score your implementations under these settings.

Folder structure for this homework:

HW3

                |-- ALU                                                                                 // Part 1

               |             |-- alu.f                                                                         <-- specify the files you use

               |            ‘-- codes                                                                       <-- put all the *.v here

               |          |                ‘-- alu.v

               |             |-- test_alu                                                              <-- run the test

               |              |-- testbench.v                                                         <-- test for corretness

               |              ‘-- testcases/                                                          <-- testcases

               |                            ‘-- generate.cpp                                          <-- used to generate testcases

                |-- FPU                                                                                 // Part 2

               |             |-- fpu.f

               |            ‘-- codes

               |          |                ‘-- fpu.v

               |             |-- test_fpu

               |              |-- testbench.v

               |              ‘-- testcases/

               |                            ‘-- generate.cpp

                 ‘-- CPU                                                                               // Part 3

|-- cpu.f

‘-- codes

                          |                 |-- instruction_memory.v                        <-- instruction memory with access latency

                          |              |-- data_memory.v                                    <-- data memory with access latency

                          |               ‘-- cpu.v

|-- test_cpu

|-- testbench.v

‘-- testcases/

|-- generate.s

‘-- generate.cpp

ALU(30%)
The ALU spec is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i data a
Input
32
Input data A may be signed or unsigned depending on the i inst signal
i data b
Input
32
Input data B may be signed or unsigned depending on the i inst signal
i inst
Input
4
Instruction signal representing functions to be performed
i valid
input
1
One clock signal when input data a and b are valid
o data
Output
32
Calculation result
o overflow
Output
1
Overflow signal
o valid
Output
1
Should be one cycle signal when your results are valid
The test environment is as follows:

You are asked to implement the following functions in ALU:

i inst
Function
Description
 
 
 
4’d3
Signed Max
max(i data a, i data b) (signed)
4’d4
Signed Min
min(i data a, i data b) (signed)
4’d8
 
 
4’d9
Unsigned Min
min(i data a, i data b) (unsigned)
4’d10
And
i data a & i data b
4’d11
Or
i data a | i data b
4’d12
Xor
i data a ^ i data b
4’d13
BitFlip
~ i data a
4’d14
BitReverse
Bit reverse i data a
More details:

We will compare the output data and overflow signal with the provided answers
For signed 32-bit integer Add, Sub, Mul, Max, MinTwo-input signal functions
Overflow signal only needs to be considered when Add, Sub or Mul is performed. For Max and Min, set the output overflow signal to 0.
We will not compare the return data with the answer provided when overflow happens
For unsigned 32-bit integer Add, Sub, Mul, Max, MinSame criteria as signed operations’
Xor, And, Or, BitFlip, and BitReverseSet output overflow signal to 0 when the above functions are performed.
Xor, And and Or are two-input signal functions.
BitFilp and BitReverse are one-input signal functions, therefore, i data b can be ignored.
FPU
The FPU spec is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i data a
Input
32
Single precision floating point a
i data b
Input
32
Single precision floating point b
i inst
Input
1
Instruction signal representing functions to be performed
i valid
input
1
One clock signal when input data a and b are valid
o data
Output
32
Calculation result
o valid
Output
1
Should be one cycle signal when your results are valid
The test environment is as follows:

You are asked to implement the following functions in ALU:

i inst
Function
Description
1’d0
Add
i data a + i data b (single precision floating point)
1’d1
Mul
i data a * i data b (single precision floating point)
Floating point:

More details:

We will compare the output data with provided answers.
Follow IEEE-754 single precision floating point format
The inputs will not be denormal numbers, infinites, and NaNs, nor will the calculated result.
Simple testcases
During the computation, the one with smaller exponent will be shifted, you should keep the precision until rounding. As for rounding mode, we use default rounding to nearest even.I find this pdf useful to explain the rounding and the GRS bits.
The testcases may be too easy to worry about the rounding.
You may want to reference the diagram described in class to have better idea implementing FPU.

Grading:

There are 10 test cases for add and mul. Overall, there are 20 test cases.
0% for each test case
CPU(40%)
In this section, you are asked to implement a CPU that supports basic RV64I (not all of them). The CPU spec is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i   valid inst
Input
1
One cycle signal when the instruction form instruction memory is ready
i    inst
Input
32
32-bits instruction from instruction memory
i     d valid data
Input
1
One cycle signal when the data form data memory is ready (used when ld happens)
i d data
Input
64
64-bits data from data memory (used when ld happens)
o i valid addr
Output
1
One cycle signal when the pc-address is ready to be sent to instruction memory (fetch the instruction)
o i addr
Output
64
64-bits address to instruction memory (fetch the instruction)
o d data
Output
64
64-bits data to data memory (used when sd happens)
o d addr
Output
64
64-bits address to data memory (used when ld or sd happens)
o d MemRead
Output
1
One cycle siganl telling data memory that the current mode is reading (used when ld happens)
o d MemWrite
Output
1
One cycle siganl telling data memory that the current mode is writing (used when sd happens)
o finish
Output
1
Stop signal when EOF happens
The provided instruction memory is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i valid
Input
1
Signal that tells pc-address from cpu is ready
i addr
Input
64
64-bits address from cpu
o valid
Output
1
Valid when instruction is ready
o inst
Output
32
32-bits instruction to cpu
And the provided data memory is as follows:

Signal
I/O
Width
Functionality
i clk
Input
1
Clock signal
i rst n
Input
1
Active low asynchronous reset
i data
Input
64
64-bits data that will be stored (used when sd happens)
i addr
Input
64
Write to or read from target 64-bits address (used when ld or sd happens)
i MemRead
Input
1
One cycle signal and set current mode to reading
i MemWrite
Input
1
One cycle signal and set current mode to writing
o valid
Output
1
One cycle signal telling data is ready (used when ld happens)
o data
Output
64
64-bits data from data memory (used when ld happens)
The test environment is as follows:

We will only test the instructions highlighted in the red box, as the figures below

And one more instruction to be implemented is

i inst
Function
Description
32’b11111111111111111111111111111111
Stop
Stop and set o finish to 1
More details:

The instruction_memory.v, data_memory.v and testbench.v files should not be modified
We will compare the mem[1024] in data_memory.v result with provided answers to check for correctness
There are 1024 bytes memory in data memory module and 16x32 bits memory for instruction memory. No invalid access to instruction or data memory will be involved in the testcases. Hence, there is no need to handle these issues.
All the arithmetic operations here are unsigned, including A + B and A + imm. And there is no need to deal with overflow here.
You may notice that there’s latency when we want to access the memoryFor instruction memory, when i_valid is set, the instruction memory will stall for 5 cycles, and then return the instruction to cpu
For data memory, when i_MemRead or i_MemWrite is set, the data memory will stall for 7 cycles in both cases, and then return the data to cpu or write the data to memory
The latency comes from freezing the module for certain amount of cycles, as shows below
You may want to reference the block diagram of cpu from slides or textbook to have better idea implementing cpu. Notice that the diagram provided here is single cycle cpu, while in this homework, there’s additional latency accessing memory that needs to be considered.

Grading:

There are 8 testcases. 5% each. (eof, store, load, add, sub, and, or, xor, andi, ori, xori, slli, srli, bne, beq)
Report 
Write a report about how you implement ALU, FPU, and CPU.

You can draw some block diagrams to show your execution method more clearly.

”draw.io” is suggested to help you to draw the block diagrams easily.

More products