OkkyunWoo Data cache in Pipelined CPU Uses a blocking data cache instead of a “magic memory”
Data flow When there is a cache miss, the cache requests data from memory When a dirty cache line is evicted, it should be written back to memory
*acknowledgement from the memory is implicit in our lab Signals to / from the cache addr[31:0], din[31:0], is_input_valid, mem_rw cache is_ready, dout[31:0] is_hit, is_output_valid Input signals to the data cache addr: memory address that CPU wants to read or write mem_rw: access type (0: read, 1: write) din: data that CPU writes to cache for stores • addr and din should be used only when is_input_valid is true • din should be used only when mem_rw is 1 addr[31:0], din[31:0], is_input_valid, mem_rw cache is_ready, dout[31:0], is_hit, is_output_valid Output signals from the data cache is_ready indicates the status of the cache • True if cache is ready to accept a request • False if cache is busy serving a prior request Cache cannot accept a request currently. LD/ST would be stalled In the 5-stage pipeline, is_ready will always be true when LD/ST goes into MEM stage cache addr[31:0], din[31:0], is_input_valid, mem_rw is_ready, dout[31:0], is_hit, is_output_valid Output signals from the data cache n is_output_valid indicates whether dout and is_hit are valid dout: data accessed from cache (for read) is_hit indicates whether a cache hit occurred cache n When the output from cache are valid, LD/ST instructions can use them and continue execution addr[31:0], din[31:0], is_input_valid, mem_rw is_ready, dout[31:0], is_hit, is_output_valid
n The size of a cache line (block) is 16 bytes 32 bits
addr … … … … .. 4 3 2 1 0 tags sets block offset 4B offset # sets # ways Block offset 0 Block offset 1 Block offset 2 Block offset 4 Set 0 Way 0 4B 4B 4B 4B Way 1 … n Asynchronous read: • valid, data, is_hit Synchronous write • Writes to the cache line (from both CPU and memory) should be synchronous Write-back, write-allocate • Read data from the memory if a write miss occurs n Replacement policy • Choose any way except for MRU way Structure • Choose between direct-mapped or set-associative (extra point) but not fully-associative • Size: 256 Bytes (data bank) • You are free to define # of ways and sets Each cache line should have: • Valid bit • Dirty bit • Bits for replacement • …
Matrix data layout n Memory layout of the matrix (row-major order) • Assume each element of the matrix is 4 B • Assume cache line size is 16 B address data 0x00 0x04 0x08 0x0c 0x10 0x14 0x18 0x1c cache matrix memory n Naïve implementation • Is this cache-friendly? No. Why?
n Tiled implementation • Is this cache-friendly? If yes, why?
n Tiled implementation • Is this cache-friendly? If yes, why? • Reuse data (in the cache) as much as possible within each tile The tile size is set to the cache line size
Submission • Blocking data cache • Direct-mapped (no extra credit) • N-way associative cache (Full extra credit + 3) • Youneedtofollowtherulesdescribedinlab_guide.pdf • The design of the cache • Direct-mapped or associative cache • Analyze cache hit ratio • If you implement associative cache, compare it with direct-mapped cache • Explain your replacement policy • Naïve matmul vs optimized matmul • Why is the cache hit ratio different between two matmul algorithms? • What happens to the cache hit ratio if you change the # of sets and # of ways? Submission • Implementation fileformat • .zipfilename:Lab5_{team_num}_{student1_id}_{student2_id}.zip • Contentsofthezipfile(only*.v): • cpu.v • … • Do notincludetop.v,InstMemory.v,DataMemory,RegisterFile.v, and CLOG2.v • Report fileformat • Lab5_{team_num}_{student1_id}_{student2_id}.pdf