$30
2
In, we are going to examine the cache effect. The tool we’ll use is rocket-chip. You can either build rocket-chip yourself or use the image provided
docker pull ntuca2020/hw5 # size ~ 8.28G docker run --name=test -it ntuca2020/hw5 cd /root ls
Folder structure for this
emulator/
// link to rocket-chip emulator
|-- benchmarks/
// link to riscv-tests benchmark
|
|-- Makefile
// complie all benchmarks
|
|-- qsort/
// qsort benchmark folder
|
|-- qsort.riscv
// riscv executable
|
|-- qsort.riscv.dump
// objdump riscv executable
|
|-- mt-matmul/
// mt-matmul benchmark
|
|-- mt-matmul.riscv
// riscv executable
|
|-- mt-matmul.riscv.dump
// objdump riscv executable
|
|-- mt-matmul_4/
// for part2
|
| ‘-- matmul.c
<-- need to be handed in
|
|-- mt-matmul_4.riscv
// riscv executable
| |-- mt-matmul_4.riscv.dump // objdump riscv executable
| |-- ...
| ‘-- common
| |-- ...
// other benchmarks
| ‘-- crt.S
// specify number of cores available
|-- system/
// link to rocket-chip system
| |-- test.scala
// first part SoC settings
| |-- HW5.scala
<-- used for matrix multiplication and need to be handed in
| ‘-- *.scala
// other default scala settings
|-- build.sh
// build all settings
|-- test.sh
// test all settings
|-- spike_test.sh
// can test on spike first
|-- Config1
// Configuration1
|-- generated-src_Config1 |-- ...
// Layout, RTL, mappings, dts, etc, for Config1
‘-- Makefile
// Build the configuration
Part 1: Observing cache behavior (17%)
Run test.sh and fill in cycle counts for each benchmark and each setting in the following form (
dhrystone
median
multiply
qsort
rsort
towers
vvadd
Configuration 1
(1)
Configuration
(1)
Configurati
Configuration 4
Configuratio
Configuration 6
(4)
Configuration 7
(4)
Configuration 8
Configuration 9
Configuration 10
Configuration 11
Configuration 12
(5)Configuration 13
(5)
Tabelle 1: Benchmark on different configurations
Answer the following questions (answers should be based your observation on the cache configurations and the program behavior)
• Why are (1) the same or different?
• Why are (2) the same or different?
• Why are (3) the same or different?
• Why are (4) the same or different?
• Why are (5) the same or different?
• See the pmp.c in /root/emulator/benchmarks/pmp, what does this program want to do? And how does it make it?
• Change the number of cores available in crt.S file (line 125) in /root/emulator/benchmarks/common and recompile the mt-matmul program (for this question, matrix size is 32x32).
– Report the cycle count of configuration17 on 1-core, configuration19 on 2-core, and configuration20 on 4-core
– Describe whether the cycle count decreases linearly, why or why not.
Part 2: Cache and matrix multiplication
In this part, we revisit the matrix multiplication. You are asked to implement 64x64 matrix multiplication on 4-core, 128-B L1-D$, 128-B L1-I$ (no L2). The size of cache is fixed so that you can only change way-set setting in L1.
Change the dataset in /root/emulator/benchmarks/mt-matmul/mt matmul.c to the one with 64x64 (dataset2.h). The cache setting is specified in /root/emulator/system/HW5.scala and you can build the simulator using
make -j8 CONFIG=freechips.rocketchip.system.HW5Config
in /root/emulator.
The matrix multiplication program is located at /root/emulator/benchmarks/mt-matmul/matmul.c. Each thread will enter this function with its thread id and local storage (128KB) and exit once the task is finished. You may want to see the files under mt-matmul/ and common/.
The distribution of the workload and the cache behavior should be considered when you implement matrix multiplication. We will score based on the cycle count coming out from your HW5.scala and matmul.c.