Starting from:

$25

CPEN412 -  Microcomputer Systems Design - Assignment 4 - Solved

You can only do this assignment if you have got Assignment 2 to completely work, i.e. replace the Sram with Dram and got the 68k and debug monitor to run from Dram

 

MAKE SURE you backup your current design before starting this assignment, as we will do some performance comparisons with it at the end.

 

Introduction and Background to Lab

We saw in the Lecture 9-11 that since the 80’s, the speed of CPUs has become dramatically faster than main memory based on Dram technology. In fact it would be pointless running a multi Giga-Hertz processor directly from Dram today, as there would be so many wait states incurred the performance would be appalling. 

 

 

 

For this reason most CPU’s will have some kind of high speed SRam based cache memory that will provide a “local high speed copy” of the most recently accessed dram data. That is, when the CPU accesses an address in Dram (which will be slow), a copy of that data will also be stored in the cache memory. 

 

You can see how various “levels” of cache are utilised in a modern multi-core processor such as the Intel i7 shown below, with its multiple levels of cache memory. In fact a large proportion of silicon within a chip is given over to Cache memory, just so that the core processor can be supplied with instructions and data fast enough to keep it going.

 

 

 

Our basic 68k soft core processor only runs at 25 MHz because that is the maximum speed we could get from the Dram without incurring wait states. 

 

OK – we could have squeezed a little bit more speed out of it by pushing the envelope of the Dram timing parameters; most are pessimistic anyway and allow quite considerable latitude. The danger here of course is that boards and systems will fail when you move to volume production.

 

Remember also that increasing the clock speed of a Dram chip does not necessarily lead to faster access to the 1st item of data (only subsequent items when bursting and pipelining data). Raising clock speed would simply mean that our dram controller would have to introduce more clock delays (to build up the specified access time) before the 1st item of data became available. 

 

Most modern CPUs such as the i7 will also have separate caches for both data and instructions so that they do not compete for storage. Other CPUs may adopt a unified cache where a single cache holds both instructions and data.

 

How does a cache work? Let’s suppose the CPU asks for an instruction from main (Dram memory. Initially that instruction will not be in the cache so it will have to be read in from memory and given to the CPU. At the same time, a copy of that instruction could also be stored in the cache memory. If sometime in the future the CPU were to execute/fetch that same instruction again, the cache controller could give the CPU that instruction without incurring the slow dram access.

 

The likelihood of this occurring is significant due to temporal and spatial locality of data (https://en.wikipedia.org/wiki/Locality_of_reference ). This can be explained as follows

Temporal locality: Given that many programs spend their time executing within small loops, there is a strong probability that an instruction or variable accessed once would be accessed again fairly soon.

 

Spatial locality: Likewise given the mostly sequential nature of program execution, if an instruction or variable is fetched from main (dram) memory there is a strong probability that the next ‘n’ instructions (or variables in the case of say an array) will also be accessed shortly afterwards.

 

A good cache controller will thus try to predict what the CPU will access next and bring that data (instruction code or variables) into the cache from main memory in anticipation of it being needed. Of course if it guessed correctly and the CPU accesses information now stored in the cache (a cache hit) then a significant improvement in performance will be seen. If it did not guess correctly and the CPU attempts to access information not in the cache (a miss) then there is no harm done, but that information will have to be fetched from dram which will cause the CPU to wait.

 

A block diagram of the cache controller is given below. You can see it sits between the CPU and the main memory (dram) controller. 

 Main 
(Dram) Memory
 
 
 It seems obvious then, that the larger the cache memory, the more information it can hold and the higher the probability of a cache “hit” somewhere down the line while the program is executing. However, because the cache memory is much smaller than the main memory a cache controller will have to make choices about which data to cache. Perhaps it will have to eject some existing data (not used recently), in order to make space in the cache for new data needed now.

 

Cache Performance

Note that a cache controller cannot predict 100% what information is going to be needed next by the CPU, it can only make an educated guess. If for instance a subroutine is called, then the next instruction could come from an unrelated address that violates spatial locality. More to the point, the cache “hit” rate does to a large extent depend upon the program code, its size, the time it spends in tight loops, what subroutines are called and what data is fetched and stored. It’s possible for the performance of a cache to be very good for some sections of a program’s execution and poor for others. 

 

Cache “thrashing” can occur when the CPU spends it’s time constantly ejecting old information in order to make way for new information, leading to poor performance that could be as bad as (or worse) than having no cache at all.

 

Rational

You probably met the concepts of a Cache in CPEN 211 and maybe again in CPEN 311. If you took CPEN 411 you will have met it in more detail, but neither of those courses asked you to design one and experiment with different types and sizes of cache and witness firsthand the impact on performance.

 

In this lab we will equip our 68k processor with a Cache controller based on a unified instruction and data design.  The storage for the cache will be made up of SRam present within the FPGA. Our design will based around a single-level cache as there is no benefit to having level 2 or 3 caches when the memory speed for all of them is the same (you might as well just make a larger level 1 cache). 

 

Because our 68k can only run directly from dram at 25Mhz (dram is relatively slow when accessed randomly), the use of a cache (which relies on faster burst read/write operation on dram) will allow us to increase the clock speed of our 68k to perhaps as high as45 Mhz for greater performance – that is the limit of the fitter’s ability to lay it out on the FPGA and still have it work). 

We can experiment with changing the cache’s size and type to see the effect on performance for a typical program. It also opens up the possibility of having dual 68k cores in our system (each with their own cache), sharing common dram memory due to the increased bandwidth afforded by burst dram accesses (see lecture 19/20).

 

Ideally we would like separate data and instruction caches, since the two types of cache can be optimised slightly differently, but the soft core 68k processor does not provide the necessary signals required to determine if the current memory access being made is for an instruction op-code fetch or for data/variables. 

 

The original 68k did have those signals in the form of the function code pins FC2-FC0. This small limitation means we will need to create one combined (unified) instruction and data cache which is not optimal and will mean that instruction and data (i.e. variables) will have to compete for space in the cache, but it should nevertheless provide a worthwhile improvement in speed. 

 

  

 

 

Cache Design: There are several ways to design a cache, but perhaps the simplest to implement is the Direct Mapped Cache. Better performance can be obtained through the use of a set associative cache. We will build both types of cache and compare their performance. Make sure you read through the lecture notes before going further. 

 


Note that this is lab is in two parts (A & B) with different submission dates for each, so please check the canvas calendar for each date

PART A - Building a 32 Line x 8 word - 512 Byte Direct Mapped Cache Controller

The architecture for this type of cache is shown below (notice how the CPU address lines are split into word, index and tag address.). Note also that the validity and tag data as well as the cache lines are created out of memory We will build this up this data path in stages and then add the state machine to control the cache, e.g.

 

 


Let’s Start: 

Open your 68k system design from the last assignment (remember to make a backup). To make the memory for the Valid, Tag and Data, bring up the IP Catalog in Quartus from the Tools menu

 

Making the Validity Bit memory block: Using the 1-Port Ram create memory called “Valid_Data” that is 1 bit wide with 32 locations and no registered ‘q’ output (no flip-flop on the data output). Make sure you tick the box to create the Quartus symbol (.bsf file) for the new memory (in the last step of the wizard). Now paste the symbol into a new schematic file “Valid.bdf” and wire up with pins and names matching those below. Then create a new symbol for the Valid.bdf schematic file (from the file menu).

 

Making the Tag memory: Using the 1-Port Ram, create memory called “Tag_Data” that is 23 bits wide with 32 locations and no registered ‘q’ output. Make sure you create the Quartus symbol (.bsf file) for the new memory. Now paste the symbol into a new schematic file “TagMemory.bdf” and wire up with pines and pin names as show below, then create a new symbol for the TagMemory.bdf file.

 

 

Making the Cache’s Data (line) memory: Using the 1-Port Ram create memory called “CacheData” that is 8 bits wide with 256 locations (this will allow for 32 lines 16 bytes/per line) and no registered ‘q’ output. Make sure you create the Quartus symbol (.bsf file) for the new memory. Now paste the symbol into a new schematic file “CacheDataMemory.bdf” and wire up as show below. Notice this is organised as 2 individually accessible bytes, organised into 256 x 8 bit words. Now create a new symbol for the CacheDataMemory.bdf file.

 

 

 

 

  Note: Although you cannot see it in the diagram below, the signal/bus shown here, is named Address[7..0]. You can set that by right mouse clicking on the bus/wire and entering the name into the properties box.

 

 

 

Creating the “Hit Comparator”. 

Create a new VHDL file “AddressComparator.vhd” to compare the tag data (a 23 bit address) with the 68k’s 23 bit address. Here is the simple VHDL. You can rewrite in Verilog if you wish as it is trivial.

 

LIBRARY ieee;

USE ieee.std_logic_1164.all;

use ieee.std_logic_arith.all; 

use ieee.std_logic_unsigned.all; 

 

entity AddressComparator is

          Port (

                   AddressBus        : in Std_logic_vector(22 downto 0) ;

                   TagData           : in Std_logic_vector(22 downto 0) ;

                   

                   Hit_H              : out Std_logic

          );

end ;

 

 

architecture bhvr of AddressComparator is

Begin

          process(AddressBus, TagData)

          begin

                   if(AddressBus = TagData) then

                             Hit_H <= '1';

                   else

                             Hit_H <= '0' ;

                   end if ;

          end process ;

END ;

 

 

Now create a symbol for the comparator.


Creating the Cache Controller: On Canvas you will find the outline for the design of the incomplete cache controller written in Verilog. This is a state machine design (like the dram controller we wrote in a previous assignment). It controls the Cache and decides when to read from cache or dram memory etc. Copy the code for the cache controller into a new Verilog File called “M68kCacheController.v” and make sure it is added to your project. The pseudo code for the state machine is contained at the end of this assignment. You will have to turn this into real Verilog to complete the assignment. 

 

Save the file and create a symbol for it. It should look like this.

 

This cache controller sits between the 68k and the Dram Controller, so most of the signals on the LHS connect to the 68k and most of those on the right connect to the Dram Controller and also the various bit of cache memory and comparator we made earlier.

 

Create a new schematic file (.bdf) called DramCache and paste symbols for all the things we have designed so far onto it. The finished circuit should look that below. If you zoom in, you should be able to see the names and connections. The tri-state buffer at the top of the circuit can be found elsewhere in the 68k system design (e.g. look at the circuit for the ROM made in an assignment 1), so copy and paste it. Save the design and create a symbol for the schematic.

 

Creating a New Dram Controller: Later, when we have finished our Cache design, we are eventually going to raise the clock speed of the 68k to 45MHz (just about the fastest we could get it to run at on the FPGA of the DE1). For this reason, we will need a new Dram Controller that works with the higher clock speeds. 

 

Remember Drams have fixed timing parameters expressed in nS, so our dram controller has to build up those time delays as multiples of a known clock period. If we raise the dram clock from 50 to 90 MHz then the design of the dram controller will also need to change. 

 

More importantly, we need to change the dram controller so that it programs the Dram’s “Mode” register to perform 8 location sequential bursts during a read (but no burst during a write) – we discussed the mode register in the previous Dram controller assignment. This has been done in the dram controller already.



 
Now we are going to bring the Cache Controller and the new Dram Controller together, so create a new schematic file and paste symbols for the two circuits onto it and wire up as shown below. This is now our complete Cache and Dram Controller circuit. Save the file as “CachedDramController.bdf” and create a symbol for it.

 

 Notice in the schematic below, how the dram cache controller “grabs” the data directly from the dram. It does not go via the dram controller. In fact the dram controller is there just to enforce read/write timing, Dram initialisation via the mode register and refreshing of the dram.

 
Now go back to the top level schematic of our 68k system and replace the symbol for the current existing dram controller with the new “CachedDramController” created above. It should look like this. Don’t connect the bottom 6 signals on the lower r.h.s. to pins; they are for debugging (should you need them). Adding extra pins might mean the fitter runs out and that causes problems.


Important: remember to translate the pseudo HDL for the cache controller discussed earlier into real Verilog. 

 

Increasing the 68k Clock speed to 45 MHz

The whole purpose of adding the cache was to increase the clock speed of the 68k so it would run faster – i.e. @45Mhz, so let’s do that now. The new Dram controller has been designed to meet the critical timing’s associated with a 90Mhz Dram and Dram Controller clock so will not work unless those clocks are changed.

 

On the top level schematic, bring out a new Clock50Mhz. This drives the baud rate generator (and graphics controller) in our design so we have to stick to that frequency

  

 

Now adjust the clock and phase relationships of the phase locked loop (PLL called ClockGen) as shown above and below. You will notice the CPU clock has been increased to 45 MHz and the Dram Memory and Controller clocks have been increased to 90 MHz (i.e. double the CPU clock speed to maintain a constant phase relationship between them).  The 90 MHz and 90 MHz Inverted clocks connect to our Cache Controller “clock” and “clock_inverted” signals.

 

To make the adjustments, double click the PLL symbol called “ClockGen” and make these changes in the MegaWizard. Note we need to change some CLOCK and PHASE timing for some of them to account for critical time delays along PCB traces on the DE1 board and to meet set-up and hold requirements at the Dram chips.

 

In particular, you will notice that with outclk_3, the inverted 90 Mhz clock that drives the dram controller, it’s necessary to modify its phase shift from the previous 180 degrees to something like 230-250 degrees as shown below. This is due to the fact that we have doubled their clock frequencies and time delays on the PCB of the DE1 board need to be considered.

 

Hopefully you won’t have to experiment with this, but if you cannot get the assignment to work and you’ve checked everything else, experiment with increasing/decreasing this phase shift in 10nS intervals. As a last resort, drop the CPU clock frequency to 40MHz and the two Dram controller clocks to 80Mhz (out of phase by 180 degrees). The Dram controller HDL code should not need changing. Other than that, follow the remaining instructions.

 

Note relevant to above: Part of the problem with the “free” version of Quartus (as opposed to the licensed one you can buy) is that some features are disabled. One particularly important “disabled” feature that is critical to creating high speed designs and re-usable libraries of components, is the ability to “lock down” the resultant synthesis, layout and timing associated with a component/sub-module so that it is fixed thereafter – that is, the “incremental compilation” feature the recompiles only components/modules that have changed. 

 

With the free version of Quartus, each time you compile your project, the whole design is re-synthesised and laid out on the FPGA. This can be a problem when new hardware is suddenly added/changed as it can result in previously working designs suddenly failing due to changes in their layout and critical/tight timing. 


  Remember the Dram controller runs on the inverted clock, the Dram memory itself runs on non-inverted clock. The phase difference between them accounts for the time delays between rising edges in these clocks e.g. at 90Mhz, the clocks are 11.1ns wide, a phase shift of 180 degrees on outclk_3 means this clock is delayed by 5.55ns relative to outclk_2 (which has no phase shift).





Aim for somewhere between 220 and 250 degrees phase shift for this clock, i.e. start with say 230. If system springs to life but proves “flaky” when running memory test, entering data at the keyboard or when downloading a program, try a different value. This is as much due to the Quartus “fitter” as timing skew.
 
   

 

Click finish and when asked, update the PLL symbol on the top level schematic. 

 

Now rename the signals coming from the PLL using the names given above e.g. Clock45Mhz, Clock 90Mhz etc.

 


 Lastly the IIC, SPI and Canbus Controllers in our design run off a signal called Clock25Mhz adjust the signal name below to Clock45Mhz to match the signal name coming from the PLL, then re-compile the design.

 

Simulating the results: 

 

You don’t have to do this, (unless you are having trouble and need to debug it), but it’s interesting to see a simulation of what happens inside the cache controller. In the simulation below, I’ve brought out the extra “debugging and simulation” signals on the CachedDramController, to top level pins in Quartus to allow simulation. I also changed my cstart_V4.0.asm code to perform some reads and writes to Dram addresses after a reset to force hits and misses to exercise the cache. I then recompiled and added the extra signals to the “.vwf” vector waveform file before re-simulation. 

 

Note bringing out the extra debug signals MAY cause the download of the “.sof” file to the DE1 to fail (the DE1 green download light stays on – but it should still work when you press DE1/reset). This is due to the fact that when Quartus runs out of physical pins on the FPGA to assign to signals, it starts using reserved ones associated with downloading – this is what causes the programmer to report a fail and the green light to stay on. Just don’t convert it to a .JIC file and program it into the DE1

 

 Here’s the result. You can see that the first time the 68k attempts to read a word from location 0x08060002, it generates a miss (i.e. Hit_H = ‘0’) and the access time is stretched from the normal 4 clocks to 10 clocks.

 

              

 

This triggers a burst read from Dram. You can see the “word address” of the cache counting from 0 – 7 as it stores the dram data away. Eventually data from word address ‘1’ (corresponding to the cache line entry for address 0x08060002) is fetched and given back the 68k along with a Dtack_L at the end. Later a byte read from address 0x08060003 is performed resulting in a cache HIT from the same cache line and the 68k gets the data (from the cache) and a Dtack_L without delay in 4 clock cycles.

 

Taking Measurements to Determine CPU Speed-up with Cache

One of the purposes of this assignment is to measure the performance increase of adding the cache. In a perfect world, increasing the CPU clock from 25 to 45 MHz should yield a speedup of x1.8. In practice this will not be achievable due to misses etc, but we should get nearer to that, especially as we increase the size of the cache (see later). Remember however that cache performance depends heavily upon the nature of the program we run to measure it. 

 

Using your original 25 MHz design, i.e. the one you got working before starting this assignment, create, compile, load and run the program called SpeedTest.c  An IDE-68k Project for it can be found on Canvas. Essentially the program looks like this. You can see it contains both instructions (the compiled C code) and variables (the large 2D arrays).

 

int a[100][100], b[100][100], c[100][100];

int i, j, k, sum;

 

int main(void)

{

    Init_RS232();

 

    printf("\n\nStart.....");

    for(i=0; i <50; i ++)  {

        printf("%d ", i);

        for(j=0; j < 50; j++)  {

            sum = 0 ;

            for(k=0; k <50; k++)   {

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

                sum = sum + b[i][k] * b[k][j] + a[i][k] * c[i][j];

            }

            c[i][j] = sum ;

        }

    }

    printf("\n\nDone.....");

    return 0 ;

}

 

 

Step 1: 

Measure the time it takes to run the program on the 25 MHz – NON-cached version of the 68k System you had before starting this assignment.

 

Step 2: Now measure the time it takes to run on the 45 MHz (or 40Mhz if you could not get the 45Mhz version to compile reliably – there should be no grade penalty for that) version of the 68k system with the 32 line/512 Byte cache. Remember, this caches instructions and data during reads, but not during writes (i.e. it implements the write around cache write policy.) 

 

What is the “speed up” achieved with the 32 line/512k byte cache design? 
Explain your results and demonstrate your working system to the TA.
 


 

This last step requires some changes to our cache design which you are asked to work out. Re-design/wire the cache controller to increase the number of cache lines from 32 lines to 512 lines still with 8 words per line. That is, make the cache 16x bigger, i.e. an 8K byte (total) cache. 

 

This will require that the “Index” bus increases from 5 to 9 bits and Tag reduces from 23 to 19 bits as more of the 68k’s address lines become part of the Index and less of the Tag – see below.

 

 

 

 

In addition, the widths and number of locations of the various memory blocks we created earlier will also need to be changed using the IP catalog Wizard and the symbols for those memory components will need updating and re-wiring etc. 

 

The Address Comparator circuit that generates the “hit” will also need minor changes to bus widths and finally some minor changes to the Verilog for the Cache Controller will also be needed (e.g. the widths of various signal buses and the constants we assign to them).

 

Don’t forget to modify the code in the states responsible for “invalidating the cache” so that more lines are invalidated after a reset.

 

What is the speed up achieved with this bigger cache design. 
 

Pseudo code for the State Machine of the Direct Mapped Cache Controller.

 

Some of the states (e.g. invalidating the cache) are already included in the Verilog file on Canvas, so add these extra states and translate the pseudo-code into real Verilog. Try to understand what this is doing by thinking about how the cache should work.

 

-----------------------------------------------

-- Main IDLE state: 

-----------------------------------------------

                        Otherwise if we are in the Idle state {                                      

                                    if AS_L is active and DramSelect68_H  is active {

                                                if the 68k's access is a read, i.e. WE_L is high 

                                                            activate UDS and LDS to the Dram Controller to grab both bytes from Cache or Dram regardless of what 68k asks

                                                            Next state = CheckForCacheHit                                                  

                                                }

                                                else {              -- must be a write, so write the 68k data to Dram and invalidate the line as we don’t cache written data

                                                            if(ValidBitIn_H  is active) {

                                                                        Set ValidBitOut_H to invalid

                                                                        Activate ValidBit_WE_L to perform the write to the Valid memory in the cache. This occurs  on next clock edge 

                                                            }

                                                            Activate DramSelectFromCache_L to zero to start the Dram controller to perform the write a.s.a.p.

                                                            Next state = WriteDataToDram to perform the write

                                                }

                                    } 

 

------------------------------------------------------------------------------------------------------------------------------------------------------------------

-- State to Check if we have a Cache HIT during a read. If so give data to 68k from cache or if not, generate a burst fill 

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

 

                        Otherwise if we are in the CheckForCacheHit state {                      

                                    Keep activating UDS and LDS to Dram/Cache Memory controller to grab both bytes

                                    

                                    -- at this point, the Tag and Valid block will have clocked in the CPU address and output their Valid and Tag address

                                    -- to the comparator so we can see if the cache has a hit or not

                                    

                                    If CacheHit_H is active and the ValidBitIn_H is active {                   -- give the 68k the data from the cache

                                                -- remember by default  DataBusOutTo68k is set to DataBusInFromCache,                                                                                                     

                                                -- so get the data from the Cache corresponding to the CPU address we are reading from 

 

                                                Set WordAddress to AddressBusInFrom68k [3:1]                -- give the cache line the correct 3 bit word address specified by 68k

                                                Activate the DtackTo68k_L signal

                                                Next state = WaitForEndOfCacheRead;

                                    }

                                    Otherwise {                                                              -- we don't have the data Cached so get it from the Dram and Cache data and address

                                                activate DramSelectFromCache_L  signal                                -- start the Dram controller to perform the read a.s.a.p.

                                                Next state =  ReadDataFromDramIntoCache;

                                    }

 

----------------------------------------------------------------------------------------------------------------------------------

-- Got a Cache hit, so give the 68k the Cache data then wait for the 68k to end bus cycle 

---------------------------------------------------------------------------------------------------------------------------------

 

                        Otherwise if we are in the WaitForEndOfCacheRead state {                     

                                    Keep activating UDS and LDS to Dram/Cache Memory controller to grab both bytes

                                    

                                    --remember by default  DataBusOutTo68k is set to DataBusInFromCache,

 

                                    Set WordAddress to AddressBusInFrom68k bits [3:1]                    -- give the cache line the correct 3 bit address specified by 68k

                                    Active the DtackTo68k_L signal

                                    

                                    If AS_L is active       {

                                                Next state = WaitForEndOfCacheRead                        -- stay in this state until AS_L deactivated

                                    }

                        }           

 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-- Start of operation to Read from Dram State : Remember that CAS latency is 2 clocks before 1st item of burst data appears

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

                        Otherwise if we are in the ReadDataFromDramIntoCache state {

                                    Set Next state =  ReadDataFromDramIntoCache                                --   unless overridden below

                                    

                                    -- we need to wait for a valid CAS signal to be presented to the Dram by the Dram controller.

                                    -- we can’t just look at CAS, since a refresh also drives CAS low

                                    

                                    If CAS_Dram_L is active and RAS_Dram_L is INactive  {                  -- a read and not a refresh

                                                Go to new state CASDelay1 ;                                                       -- move to next state to wait 2 Clock period (CAS  latency) 

                                    }

 

                                    -- Keep Kicking the Dram controller to perform a burst read and fill a Line in the cache

                                    Activate the DramSelectFromCache_L signal                                     -- keep reading from Dram

                                    Deactivate DtackTo68k_L  signal                                                            -- no dtack to 68k until burst fill complete

 

                                    -- Because we are burst filling a line of cache from Dram, we have to store the TAG (i.e. the 68k's m.s.bits of address bus)

                                    -- into the Tag Cache to mark the fact that we will have the data at that address and move on to next state to get Dram data

 

                                    -- By Default:  TagDataOut set to AddressBusInFrom68k(31 downto 9);

                                    Activate TagCache_WE_L signal                         -- write the 68k's address with each clock as long as we are in this state

                                    

                                    -- we also have to set the Valid bit in the Valid Memory to indicate line in the cache is now valid

                                    Activate ValidBitOut_H signal                             --  Make Cache Line Valid

                                    Activate ValidBit_WE_L signal                            -- Write the above Valid Bit

                                    

                                    -- perform a Dram WORD READ(i.e. 16 bits), even if 68k is only reading a BYTE so we get both bytes as cache word is 16 bits wide

                                    -- By Default : Address bus to Dram is already set to the 68k's address bus by default

                                    -- By Default: AS_L, WE_L to Dram are already set to 68k's equivalent by default

 

                                    Keep activating UDS and LDS to Dram/Cache Memory controller to grab both bytes

                        }

                                                                        

---------------------------------------------------------------------------------------

-- Wait for 1st CAS clock (latency)

---------------------------------------------------------------------------------------

                                    

                        Otherwise if we are in the  CASDelay1 state  {                                              -- wait for Dram case signal to go low

                                    Keep activating UDS and LDS to Dram/Cache Memory controller to grab both bytes

                                    

                                    -- By Default : Address bus to Dram is already set to the 68k's address bus by default

                                    -- By Default: AS_L, WE_L to Dram are already set to 68k's equivalent by default

 

                                    Keep activating DramSelectFromCache_L                                          -- keep reading from Dram

                                    Deactivate DtackTo68k_L  signal                                                            -- no dtack to 68k until burst fill complete

 

                                    Next state = CASDelay2 ;                                                  -- go and wait for 2nd CAS clock latency

                        }                       

---------------------------------------------------------------------------------------

-- Wait for 2nd CAS Clock Latency

---------------------------------------------------------------------------------------

                                    

                        Otherwise if we are in the  CASDelay2 state {                                               -- wait for Dram case signal to go low

                                    Keep activating UDS and LDS to Dram/Cache Memory controller to grab both bytes

                                    

                                    -- By Default : Address bus to Dram is already set to the 68k's address bus by default

                                    -- By Default: AS_L, WE_L to Dram are already set to 68k's equivalent by default

 

                                    Keep activating DramSelectFromCache_L                                          -- keep reading from Dram

                                    Deactivate DtackTo68k_L  signal                                                            -- no dtack to 68k until burst fill complete

 

                                    Activate BurstCounterReset_L   signal                                                  -- reset the counter to supply 3 bit burst address to Cache memory

                                    Next state = BurstFill ;                                                                   

                        }

 

----------------------------------------------------------------------------------------------------------------------------

-- Start of burst fill from Dram into Cache (data should be available at Dram in this  state)

---------------------------------------------------------------------------------------------------------------------------

                        

                        Otherwise if we are in the BurstFill state {                                                     -- wait for Dram case signal to go low

                                    Keep activating UDS and LDS to Dram/Cache Memory controller to grab both bytes

                                    

                                    -- By Default : Address bus to Dram is already set to the 68k's address bus by default

                                    -- By Default: AS_L, WE_L to Dram are already set to 68k's equivalent by default

 

                                    Keep activating DramSelectFromCache_L  signal                              -- keep reading from Dram

                                    Deactivate DtackTo68k_L signal                                                             -- no dtack to 68k until burst fill complete

 

                                    -- burst counter should now be 0 when we first enter this state, as reset was synchronous and will count with each clock

                                    If BurstCounter = 8  {                                                                                 -- if we have read 8 words, it's time to stop

                                                Next state = EndBurstFill;

                                    }

                                    else {

                                                Set WordAddress to cache memory to lowest 3 bits of BurstCounter

                                                

                                                -- By Default: Index address to cache Memory is bits [8:4] of the 68ks address bus for a 32 line cache

                                                

                                                Activate DataCache_WE_L to store  next word from Dram into data Cache on next clock edge

                                                Next state = BurstFill                                                         -- stay in this state until counter reaches 8 above

                                    }

                        }            

---------------------------------------------------------------------------------------

-- End Burst fill

---------------------------------------------------------------------------------------

                        Otherwise if we are in the EndBurstFill state {                                                          -- wait for Dram case signal to go low

                                    Deactivate DramSelectFromCache_L   signal                                                  -- deactivate Dram controller

                                    Activate DtackTo68k_L signal                                                                             -- give dtack to 68k until end of 68k's bus cycle

                                    

                                    Keep activating UDS and LDS to Dram/Cache Memory controller to grab both bytes

 

                                    -- get the data from the Cache corresponding the REAL 68k address we are reading from                         

                                    Set WordAddress (to cache memory) to AddressBusInFrom68k bits [3:1]

                                    Set DataBusOutTo68k to DataBusInFromCache;                                            -- get data from the Cache and give to cpu

 

                                    -- now wait for the 68k to terminate the read by removing either AS_L or DRamSelect68k_H                              

                                    if AS_L is INactive or DramSelect68k_H is INactive { 

                                                Next state = IDLE;                                                                                       -- go to Idle state ending the Dram access

                                    }           

                                    else    {

                                                Next state = EndBurstFill                                                                          -- else stay in this state

                                    }

                        }

-------------------------------------------------------------------------------------------------------------------------------------------------------

-- Write Data to Dram State (no Burst)

--------------------------------------------------------------------------------------------------------------------------------------------------------

                        Otherwise if we are in the WriteDataToDram state {                                                          -- if we are writing data to Dram

                                    Set AddressBusOutToDramController  to AddressBusInFrom68k;                        -- override lower 3 bits

                                    

                                    -- Data Bus out to Dram is already set to 68k's data bus out by default

                                    -- By Default: AS_L, WE_L to Dram are already set to 68k's equivalent by default

                                    

                                    Keep Activating  DramSelectFromCache_L  signal                                                    -- keep kicking the Dram controller to perform the write

                                    Set  DtackTo68k_L  =  DtackFromDram_L;                                                                    -- give the 68k the dtack from the Dram controller

                                    

                                    -- now wait for the 68k to terminate the read by removing either AS_L or DRamSelect68k_H                              

                                    if AS_L is INactive or DramSelect68k_H is INactive { 

                                                Next state = IDLE;                                                                                       -- go to Idle state ending the Dram access

                                    }

                                    else    {

                                                Next state = WriteDataToDram                                                               -- else stay in this state until the 68k finishes the write                 

                                    }

                        }


 



3.       Translating the pseudo-HDL for the cache controller into real Verilog and demonstrating the benchmarks via a video a system where the CPU runs at 45 MHz with a 32 line cache                                                                                                                                                                                            25%

                                                                                                                                                                

Note the grades add up to 40%. The other 60% comes from Lab4 Part B later


PART B - Building a 8 Line x 8 word - 512 Byte 4-Way Set Associative cache

Note Part B has a later submission date to Part A so refer to the Canvas Calendar for dates/times

Step 1: Make a backup of your direct mapped cache from Part A so that we can compare the speed of that vs. the one we are about to build. The architecture for a 4-way associative cache is shown below (note this one shows 32 lines but ours will only have 8 lines) and the operation of the LRU bits is described in lecture 18. 

 

 

First let’s make the memory for the Valid, Tag and Data. Bring up the IP Catalog in Quartus from the Tools menu. 

Let’s create a 4-way set associative cache (Note the cache above is a 2Kbyte device with 32 lines and a 23 bit Tag address – Ours will initially implement 8 lines and a 25 bit tag address, i.e. 0.5K bytes of cache).

 

 

Creating the Valid Bit Memory.

Make a new single port RAM memory using the IP Catalog (as we did before for the direct mapped cache) ad call it “Valid_Bit_Associative”. This will hold the “Valid” bit for each of the 8 lines in the cache. That is a RAM with 8 x 1 bit. Note that 8 lines x 4 sets/blocks x 16 bytes per lines will equal 512 bytes.

Create a new schematic, place the memory and other wires/gates as shown below and save the schematic with the name “ValidBit_Associative.bdf”. Now create a symbol for this circuit.

 

Creating the Tag Memory.

 Make a new single port RAM memory device called “Tag_Data_Associative” to hold the 25 bit “Tag” address for a line in the cache. Make it 8 x 25 bits for 8 lines in the cache. That is a RAM with 8 x 25 bits. 

 

Create a new schematic, place the memory and other wires as shown above and save the schematic with the name “TagMemory_Associative.bdf”. Now create a symbol for the above circuit.

 

Creating the Cache Data Memory.

Make a new single port RAM memory device called “CachedData_Associative” to hold the data in a line of the cache. Make it 64 locations x 8 bits for an 8 line cache. Each of these will be 8 lines holding 8 bytes each (two in parallel give 8 lines of 8 words)

Create a new schematic, place two copies of the memory and other wires/gates as shown below and save the schematic with the name “CacheData_Associative.bdf”. Now create a symbol for this circuit.

 

 

Creating the Cache Data Memory.

Create a new schematic, and place the above items onto and wire up as shown below. You will need to write the code for the Comparator below and make a symbol for it. The VHDL is given below.

Save the schematic with the name “AssociativeCache_Set.bdf”. Now create a symbol for this circuit.

 

 

VHDL code for the comparator (save this as AddressComparator_Associative.vhd) and create a symbol for it. You can re-write in verilog if you wish as it is quite easy.

 

LIBRARY ieee;

USE ieee.std_logic_1164.all;

use ieee.std_logic_arith.all; 

use ieee.std_logic_unsigned.all; 

 

entity AddressComparator_Associative is

   port (

       AddressBus   : in std_logic_vector(24 downto 0) ;

       TagData      : in std_logic_vector(24 downto 0) ;

       

       Hit_H         : out std_logic

   );

end ;

 

 

architecture bhvr of AddressComparator_Associative is

begin

   process(AddressBus, TagData)

   begin

       if(AddressBus = TagData) then

          Hit_H <= '1';

       else

          Hit_H <= '0' ;

       end if ;

   end process ;

end;

 

Creating the Memory for the LRU bits

Our cache will use the Pseudo Least Recently Used eviction algorithm when replacing data in a line/block so we need some memory to hold 3 state bits (for a 4 way association). Make a new single port RAM memory device called “PLRU_Bits” to hold the least recently used bits for a line of the cache. Make the RAM with 8 x 3 bits. 

Create a new schematic, place the memory and other wires/gates as shown below and save the schematic with the name “PseudoLRU_Bits.bdf”. Now create a symbol for this circuit.

 

  

 


Creating a 4-Way Data Mux

We need a giant mux to select the data from 1 of the 4 block in our cache based (see previous archicture) on which block contains the data that the CPU wants to read. 

 

Create a new VHDL file and copy the following code to it. Save the file as CacheDataMux.vhd and Create a new symbol for it. Again you can use Verilog if you wish

 

LIBRARY ieee;

USE ieee.std_logic_1164.all;

USE ieee.std_logic_unsigned.all;

USE ieee.std_logic_arith.all;

 

entity CacheDataMux is

   Port (

          ValidHit0_H, ValidHit1_H,ValidHit2_H, ValidHit3_H : in std_logic;

          Block0_In    : in std_logic_vector(15 downto 0);         

          Block1_In    : in std_logic_vector(15 downto 0);         

          Block2_In    : in std_logic_vector(15 downto 0);         

          Block3_In    : in std_logic_vector(15 downto 0);         

 

          DataOut       : out std_logic_vector(15 downto 0)

   );

end ;

 

architecture bhvr of CacheDataMux is

begin

   process(ValidHit0_H, ValidHit1_H, ValidHit2_H, ValidHit3_H, Block0_In, Block1_In, Block2_In, Block3_In)

   begin

       if(ValidHit0_H = '1') then

          DataOut <= Block0_In;

       elsif(ValidHit1_H = '1') then

          DataOut <= Block1_In;

       elsif(ValidHit2_H = '1') then

          DataOut <= Block2_In;

       else

          DataOut <= Block3_In;

       end if;

   end process;

end ;

 


OK we’ve built the memory and other building blocks. Let’s go wire them together in a 4 Way – set associative cache.

 

The image below is the full cache controller. Yes – it’s too small to see, so you can find a copy of this schematic diagram on Canvas, it’s called “AssociativeDramCache.bdf”. Add it to your project. If you have used the names as directed above it should all be correct, otherwise, replace the symbols/signals with the named ones you created above.

 

  

 

Notice the 4 blocks/sets of the cache, the LRU bits and the 4-way data mux. If you click on any of these symbols, you should be able to expose the lower level designs. If for some reason the links don’t work to the lower levels, delete the symbols on the schematic above and paste your own symbol that you created earlier.

 

Important: Now make a symbol for the above BIG schematic.

 

Note the Verilog for the main cache controller state machine (the big symbol at the top) is also on Canvas. This will need to be completed just like we did for the direct mapped cache controller and added to the project. You will also need to create the symbol for the controller from the verilog code

 

 

 

Bringing the cache controller together with the Dram Controller

Just as we did for the direct mapped controller, we now need to bring our associative controller together with the dram controller.

 

 Create a new schematic and paste the symbol for the “AssociativeDramCache” onto it along with the cache enabled dram controller we used for the direct mapped cache (the same dram controller will work for both kinds of cache). It should look like this. 

 

 

 

Save this schematic with the name “AssociativeCachedDramController.bdf”. Now create a symbol for this schematic.

 


On the top level of our 68k system, replace the direct mapped cache controller with the new asociative cache controller as shown below. Remember we still have to write Verilog for the cache controller state machine. Check the wiring from the symbol to the pins (not visible) on the right that go to the dram memory – just to make sure (you don’t for example want to mix up cas with ras or any other signal name for that matter)

Verilog for the set associative cache controller state machine (add this to existing code)

 

//////////////////////////////////////////////-

// Main IDLE state: 

//////////////////////////////////////////////-

 

          Otherwise if we are in the Idle state {                   

              if AS_L is active and DramSelect68_H  is active {

              

                   // update LRU bits 

                   // first we have to read LRU bits into the controller based on the selected Line 

                   // (which is based on CPU address)

                   

                   Activate LRUBits_Load_H                                // Load LRU bits for the line

                   

                   // if the 68k's access is a read

                   if WE_L is high {                                                     

                        activate UDS and LDS to the Dram Controller to grab both bytes from Cache or Dram regardless of what 68k asks

                        NextState = CheckForCacheHit;

                   }

                   

                   else {   // must be a 68k write                                                                                           

                        

// if we are writing, and data is already in the cache (a hit), we should invalidate that block/line

// so set the ValidBitOut_H to 0 in preparation for a write to the Valid bit if cache hit occurs

                        

                        ValidBitOut_H     = 0; 

 

                        if (any of the 4 ValidHit_H[3..0] bits are 1) // (indicating a hit for the block)

                             Activate the single corresponding ValidBit_WE_L to invalidate that line

 

                        

// writes bypass the cache so start the dram controller to perform the write

                        

                        Activate DramSelectFromCache_L

                        NextState = WriteDataToDram;

                   }

              }

          }

          

////////////////////////////////////////////////////////////////////////////////////////////////////

// Check if we have a “read” Cache HIT. If so give data to 68k or if not, go generate a burst fill

// update the Least Recently Used Bits (LRUBits)

////////////////////////////////////////////////////////////////////////////////////////////////////

 

          otherwise if we are in the CheckForCacheHit state {     // we are looking for a Cache hit                              activate UDS and LDS to grab both bytes from Cache or Dram regardless of what 68k asks

     

              

// if any Block for the Set produces a valid cache hit, i.e. we found the data we are after.

// test each of the 4 blocks to see if one of them has both a cache hit and a valid bit set. 

// That will indentify the block containing the data we can use and give to the cpu

 

              if any of the ValidHit_H[3..0] bits reports a valid hit {

              

 

// if we have the data in the Cache give it to the 68k and return to idle state

// remember defaults:DataBusOutTo68k = DataBusInFromCache,AddressBusOutToDram = AddressBusInFrom68k, 

// also remember the cache block DATA MUX is automatically set to the block producing the valid Hit

                   

// use the lowest 3 bits of the 68k address bus to select the correct word in the line to give to 68k

// give the 68k a Dtack and then wait for the end of the 68k read 

                   

                   WordAddress = AddressBusInFrom68k[3..1];             

                   Activate DtackTo68k_L ; 

                   NextState = WaitForEndOfCacheRead;

                                                

//

// Having now read an item from the cache, we need to update the LRU bits (in case we need to evict

// in the future). Algorithm based on https://people.cs.clemson.edu/~mark/464/p_lru.txt

//

                   

                   if LRUBits[0] and LRUBits[1] are both 0

                        set 3 bit LRUBits_Out to {LRUBits[2] concated with binary 11};

                   

                   else if LRUBits[0] is 0 and LRUBits[1] is 1 

                        set 3 bit LRUBits_Out to {LRUBits[2] concated with binary 01} ;

                   

                   else if LRUBits[0] is 1 and LRUBits[2] is 0

                        set 3 bit LRUBits_Out to {1 concated with LRUBits[1] concated with 0};

                   

                   else 

                        set 3 bit LRUBits_Out to {0 concated with LRUBits[1] concated with 0};

 

// Update/Write new LRU bits back to cache

                   

                   Activate LRU_WE_L ;    

              }

              

// if no hit, then get data from dram and update LRU bits

 

              else {        

 

                   Activate DramSelectFromCache_L;

 

// use the LRU bits to figure out which block in the line to replace

// then update the LRU bits and save the replacement number for later

// algorithm based on https://people.cs.clemson.edu/~mark/464/p_lru.txt

 

                   if LRUBits[0] and LRUBits[1] are both 0 {

                        Set 2 bit ReplaceBlockNumberData to binary 00;      // use block 0

                        Set 3 bit LRUBits_Out to {LRUBits[2] concated with binary 11} ;

                   }

                        

                   else if LRUBits[0] is 0 and LRUBits[1] is 1 {

                        Set 2 bit ReplaceBlockNumberData to binary 01;      // use block 1

                        Set 3 bit LRUBits_Out to {LRUBits[2] concated with binary 01} ;

                   }

                        

                   else if LRUBits[0] is 1 and LRUBits[2] is 0 {

                        Set 2 bit ReplaceBlockNumberData to binary 10;      // use block 2

                        Set 3 bit LRUBits_Out to {1 concated with LRUBits[1] concated with 0} ;

                   }

                        

                   else {

                        Set 2 bit ReplaceBlockNumberData to binary 11;      // use block 3

                        Set 3 bit LRUBits_Out to {0 concated wit LRUBits[1] concated with 0} ;

                   }

 

// now write back the LRU bits and save the replacement block number for next state                                

                   Activate LRU_WE_L ;

                   Activate LoadReplacementBlockNumber_H ;

                                 

                   NextState = ReadDataFromDramIntoCache;

              }

          }

 

//////////////////////////////////////////////////////////////////////////////////////////////-

// Got a Cache hit, so give the 68k the Cache data now then wait for the 68k to end bus cycle 

//////////////////////////////////////////////////////////////////////////////////////////////-

 

          Otherwise if we are in the WaitForEndOfCacheRead state {         

              activate UDS and LDS  to grab both bytes from Cache or Dram regardless of what 68k asks

              

// remember defaults:DataBusOutTo68k = DataBusInFromCache,AddressBusOutToDram = AddressBusInFrom68k, 

// default NextState is Idle;

              

// keep using use the lowest 3 bits of the 68k address bus to select the correct word 

// in the line to give to 68k. Keep giving the 68k a Dtack and then wait for the end of the 68k read 

          

              WordAddress       = AddressBusInFrom68k[3..1];       

              Activate DtackTo68k_L;

              

              if AS_L still low 

                   NextState = WaitForEndOfCacheRead;     // stay here if 68k still completing access, 

                                                              // else return to default IDLE state

          }

              

////////////////////////////////////////////////////////////////////////////////////////////////

// Didn't get a cache hit during read so start operation to Read from Dram State : 

// Remember that CAS latency is 2 clocks before 1st item of burst data appears

////////////////////////////////////////////////////////////////////////////////////////////////

 

// perform a Dram WORD READ(i.e. 16 bit), even if 68k is only reading a BYTE 

// so we get both bytes as cache word is 16 bits wide

// Address bus to Dram is already set to the 68k's address bus by default

// AS_L, WE_L are already set to 68k's equivalent by default

 

          Otherwise if we are in the ReadDataFromDramIntoCache state {

              activate UDS and LDS  to grab both bytes from Cache or Dram regardless of what 68k asks

 

// Kick start the Dram controller to perform a burst read and fill a Line in the cache

// and stay in this state until a dram read command issued

              

              Activate DramSelectFromCache_L;                  // keep kicking Dram controller

              

              NextState = ReadDataFromDramIntoCache ;         

              if CAS_Dram_L is 0 and RAS_Dram_L is 1          // if "read" command (not "refresh")

                   NextState = CASDelay1 ;                      // move to next state

 

// Store the 68k's address bus in the Cache Tag to mark the fact we have the data at that address 

// and move on to next state to get Dram data

// By Default: TagDataOut set to AddressBusInFrom68k(31..7);          // tag is 25 bits

              

              set ValidBitOut_H to 1;                           // output “valid” signal

              

// identify which block we are going to store the new data in based on the LRU bits 

 

              if 2 bit ReplaceBlockNumber is binary 00 {

                   Activate TagCache_WE_L[0];         // issue write signal to Tag block 0

                   Activate ValidBit_WE_L[0];         // issue write signal to Valid block 0

              }

              

              else if 2 bit ReplaceBlockNumber is binary 01 {

                   Activate TagCache_WE_L[1];         // issue write signal to Tag block 1

                   Activate ValidBit_WE_L[1];         // issue write signal to Valid block 1

              }

              

              else if 2 bit ReplaceBlockNumber is binary 10 {

                   Activate TagCache_WE_L[2];         // issue write signal to Tag block 2

                   Activate ValidBit_WE_L[2];         // issue write signal to Valid block 2

              }

              

              else {

                   Activate TagCache_WE_L[3];         // issue write signal to Tag block 3

                   Activate ValidBit_WE_L[3];         // issue write signal to Valid block 3

              }

          }

                             

//////////////////////////////////////////////////////////////////////////////////////-

// Wait for 1st CAS clock (latency)

//////////////////////////////////////////////////////////////////////////////////////-

              

          Otherwise if we are on the CASDelay1 state {        

              activate UDS and LDS  to grab both bytes from Cache or Dram regardless of what 68k asks

              

              Activate DramSelectFromCache_L;             // keep reading from Dram

              NextState = CASDelay2 ;                      // go an wait for 2nd CAS clock latency

          }

          

//////////////////////////////////////////////////////////////////////////////////////-

// Wait for 2nd CAS Clock Latency

//////////////////////////////////////////////////////////////////////////////////////-

              

          Otherwise if we are on the CASDelay2 state {        

              activate UDS and LDS  to grab both bytes from Cache or Dram regardless of what 68k asks

              

              Activate DramSelectFromCache_L;             // keep reading from Dram

 

// reset the burst counter to supply 3 bit burst address 0-7 to Cache memory           

 

              Activate BurstCounterReset_L;               

              NextState = BurstFill ;                           

          }

 

////////////////////////////////////////////////////////////////////////////////////////////-

// Start of burst fill from Dram into Cache (data should be available at Dram in this  state)

////////////////////////////////////////////////////////////////////////////////////////////-

          

          Otherwise if we are on the BurstFill state {        

              activate UDS and LDS  to grab both bytes from Cache or Dram regardless of what 68k asks

          

              Activate DramSelectFromCache_L;        // keep reading from Dram

 

// burst counter should now be 0 when we first enter this state, as reset was synchronous

              

               NextState = BurstFill ;                 // assume we are staying in this state

              if BurstCounter equals 8)               // if we have read 8 words, it's time to stop

                   NextState = EndBurstFill;

              

              else {

 

// Use burst counter to supply the 3 bit address to the data Cache

 

                   WordAddress = BurstCounter[2..0];                    

                   if 2 bit ReplaceBlockNumber is binary 00

                        activate DataCache_WE_L[0];             // write data signal to block 0             

                   else if 2 bit ReplaceBlockNumber is binary 01 

                        activate DataCache_WE_L[1];             // write data signal to block 1                                          else if 2 bit ReplaceBlockNumber is binary 10

                        activate DataCache_WE_L[2];             // write data signal to block 2                                          else

                        activate DataCache_WE_L[3];             // write data signal to block 2    

              }

          }

              

//////////////////////////////////////////////////////////////////////////////////////-

// End Burst fill state and give the CPU the data from the cache

//////////////////////////////////////////////////////////////////////////////////////-

          

          Otherwise if we are on the EndBurstFill state {     

              activate UDS and LDS  to grab both bytes from Cache or Dram regardless of what 68k asks

              

              set DramSelectFromCache_L to 1;        // deactivate Dram controller

              Activate DtackTo68k_L;                   // give dtack to 68k until end of 68k's bus cycle

 

// get the data from the Cache corresponding the REAL CPU address we are reading from 

          

              WordAddress           = AddressBusInFrom68k[3..1];  

              DataBusOutTo68k       = DataBusInFromCache; // give data to cpu

 

// now wait for the 68k to terminate the read, either remove AS_L or DRamSelect_H               

 

              if AS_L is 1 OR DramSelect68k_H is 0 

                   NextState = Idle ;             // go to Idle state and ending the Dram access

              else

                   NextState = EndBurstFill ;    // else stay here

          }

          

//////////////////////////////////////////////-

// Write Data to Dram State (no Burst)

//////////////////////////////////////////////-

 

          Otherwise if we are in the WriteDataToDram state {       // if we are writing data to Dram

              

              AddressBusOutToDramController = AddressBusInFrom68k;

              

// Data Bus to Dram is already set to 68k's data bus out by default

// AS_L, WE_L, UDS_L and LDS_L are already set to 68k's equivalent by default

              

              Activate DramSelectFromCache_L;        // keep kicking the Dram controller to write

               DtackTo68k_L = DtackFromDram_L;        // give the 68k the Dram controllers dtack

              

// now wait for the 68k to terminate the write either remove AS_L or DRamSelect_H               

 

              if AS_L is 1 OR DramSelect68k_H is 0

                   NextState = Idle ;             // go to Idle state ending the Dram access

              else

                   NextState = WriteDataToDram;       // else stay here until the 68k finishes the write  

          }        

     }

End // end of HDL file

 

 

 

Finally: Making an 8 way set associative cache with 128 lines

This last step requires some extensive changes to our cache design which you are asked to work out for yourselves. As a stepping stone, re-design/wire the cache controller to increase the number of cache lines to 128 still with 8 words per line and still with 4-way associativity (that is a cache which is 128 * 4 * 8 * 2 = 8Kbytes in size) . Having checked that this works, finally make it 8-way set associative. That is, make the total cache 16K bytes in size. 

 

As with the direct map this will require that the “Index” bus increases in width and Tag reduces as more of the 68k’s address lines become part of the Index and less of the Tag. There will be several other minor changes, in the hit and mux circuit for example, but the biggest changes will be in the size and use of the LRUs bits used in the cache eviction policy. Study Lecture 18, pages 9-14 for insight into how this works and look for ways to efficiently map this to Verilog with if-else statements. It will initially require that you generate a table similar to one on slide 13 but using 7 LRU bits rather than 3.

 

 

 

Grades for 4-Way Set Associative Cache (PART B): 

 

1.       Creating the memory, wiring up the circuit                                                                           10%

2.       Benchmarking the speed of the following 3 systems 

a.       Your original 25MHz with no cache

b.       Your 45MHz system with 512 byte 4-way, 8-line SA cache 

c.       Your 45MHz system with 16Kbyte 8-way, 128-line SA cache                          5%

3.       Translating the pseudo-HDL for the cache controller into real Verilog and demonstrating via the video, the benchmarking of the system where the CPU runs at 45 MHz with a 512 byte, 4-way, 8-line set associative cache                                                                    20%

4.       Translating the pseudo-HDL for the cache controller into real Verilog and demonstrating via the video, the benchmarking of system where the CPU runs at 45 MHz with a 16k byte, 8-way, 128-line set associative cache                                                                        25%                                

More products