$30
2.1. RowHammer Properties
Determine whether each of following statements is true or false.
(a) Cells in a DRAM with a smaller technology node are more vulnerable to RowHammer.
1. True 2. False
(b) Cells which have shorter retention times are especially vulnerable to RowHammer.
1. True 2. False
(c) The vulnerability of cells in a victim row to RowHammer depends on the data stored in the victim row.
1. True 2. False
(d) The vulnerability of cells in a victim row to RowHammer depends on the data stored in the aggressor row.
1. True 2. False
(e) RowHammer-induced errors are mostly repeatable.
1. True 2. False
2.2. RowHammer Mitigations
One research group rigorously investigated the relationship between the DRAM refresh rate and RowHammerinduced errors using real DRAM modules from three major memory manufacturers: A, B, and C. The following figure shows the characterization results of the most RowHammer-vulnerable modules of each DRAM manufacturer A, B, and C (A23, B11, and C19, respectively). As shown in the figure below, we can achieve an effectively-zero probability of RowHammer-induced errors when we reduce the refresh interval to 8 ms. This is because the probability of RowHammer-induced errors becomes less than that of a random DRAM circuit fault.
As a leading computer architect of a company, you are asked to design an efficient RowHammer mitigation technique based on the characterization results.
(a) A junior engineer suggests to simply reduce the refresh interval from the default 64 ms to 8 ms. How will the bank utilization U and the DRAM energy consumption E of all refresh operations be changed over the current system?
(b) A DRAM manufacturer releases a higher capacity version of the same DRAM module (4x the total capacity). After rigorously reverse-engineering, you determine that the manufacturer doubles both the (1) number of rows in a bank, and (2) number of banks in a module without changing any other aspects (e.g., row size, bus rate, and timing parameters). Your company considers upgrading the current system with the higher-capacity DRAM module and asks your opinion. In the current system, the bank utilization of refresh operations is 0.05 when the refresh interval is 64 ms. Would reducing the refresh interval to 8 ms still be viable (in terms of RowHammer protection capability and performance overheads) in a module with 2x the number of rows per bank and 2x the number of banks per module? How about in a module with 4x the number of rows per bank and 4x the number of banks per module?
(c) Due to significant overheads of reducing the refresh interval, your team decides to find other RowHammer mitigation techniques with lower overhead. The junior engineer proposes a counter-based approach as follows:
In this approach, the memory controller maintains a counter for each row R, which increments when an adjacent row is activated, and resets when Row R is activated or refreshed. If the value of a row R’s counter exceeds a threshold value, T, the memory controller activates row R and resets its respective counter.
Here are some specifications on the current memory system:
• Interface: DDR3-1333
• CL = 7
• tRCD = 13.5 ns
• tRAS = 35 ns
• tWR = 15 ns
• tRP = 13.5 ns
• tRFC = 120 ns
• Configuration: Per-channel memory controller deals with 2 ranks, each of which has 8 banks. The number of rows per bank is 215. Each row in one bank is 8 KiB.
Are the given specifications enough to implement the proposed approach? If no, list the specifications additionally required.
(d) Suppose that all the necessary specifications for implementing the proposed approach are known. Calculate the maximum value of T that can guarantee the same level of security against RowHammer attacks over when adopting 8-ms refresh interval?
(e) Calculate the number of bits required for counters in each memory controller. How does it change when the number of rows per bank and the number of banks per chip are doubled?
(g) You recall PARA (Probabilistic Adjacent Row Activation) which is proposed in the first RowHammer paper. Here is how a PARA-enabled memory controller works (for more details, see Section 8.2 in the paper[1]):
Whenever a row is closed, the controller flips a biased coin with a probability p of turning up heads, where p << 1. If the coin turns up heads, the controller opens one of its adjacent rows where either of the two adjacent rows are chosen with equal probability (p/2). Due to its probabilistic nature, PARA does not guarantee that the adjacent will always be refreshed in time. Hence, PARA cannot prevent disturbance errors with absolute certainty. However, its parameter p can be set so that disturbance errors occur at an extremely low probability — many orders of magnitude lower than the failure rates of other system components (e.g., more than 1% of hard-disk drives fail every year.)
Suppose that the probability of experiencing an error for PARA in 64 ms is 1.9×10−22 when p = 0.001 at the worst operating conditions. How can we estimate the probability of experiencing an error in one year based on that value? Show your work. (If you just put an answer, you will get no points for this problem.)
3. Processing in Memory: Ambit [200 points]
3.1. In-DRAM Bitmap Indices I
Recall that in class we discussed Ambit, which is a DRAM design that can greatly accelerate bulk bitwise operations by providing the ability to perform bitwise AND/OR of two rows in a subarray.
One real-world application that can benefit from Ambit’s in-DRAM bulk bitwise operations is the database bitmap index, as we also discussed in the lecture. By using bitmap indices, we want to run the following query on a database that keeps track of user actions: “How many unique users were active every week for the past w weeks?" Every week, each user is represented by a single bit. If the user was active a given week, the corresponding bit is set to 1. The total number of users is u.
We assume the bits corresponding to one week are all in the same row. If u is greater than the total number of bits in one row (the row size is 8 kilobytes), more rows in different subarrays are used for the same week. We assume that all weeks corresponding to the users in one subarray fit in that subarray. We would like to compare two possible implementations of the database query:
• CPU-based implementation: This implementation reads the bits of all u users for the w weeks. For each user, it ands the bits corresponding to the past w weeks. Then, it performs a bit-count operation to compute the final result.
Since this operation is very memory-bound, we simplify the estimation of the execution time as the time needed to read all bits for the u users in the last w weeks. The memory bandwidth that the CPU can exploit is X bytes/s.
• Ambit-based implementation: This implementation takes advantage of bulk and operations of Ambit. In each subarray, we reserve one Accumulation row and one Operand row (besides the control rows that are needed for the regular operation of Ambit). Initially, all bits in the Accumulation row are set to 1. Any row can be moved to the Operand row by using RowClone (recall that RowClone is a mechanism that enables very fast copying of a row to another row in the same subarray). trc and tand are the latencies (in seconds) of RowClone’s copy and Ambit’s and respectively.
Since Ambit does not support bit-count operations inside DRAM, the final bit-count is still executed on the CPU. We consider that the execution time of the bit-count operation is negligible compared to the time needed to read all bits from the Accumulation rows by the CPU.
(a) What is the total number of DRAM rows that are occupied by u users and w weeks?
(b) What is the throughput in users/second of the Ambit-based implementation?
(d) What is the maximum w for the CPU implementation to be faster than the Ambit-based implementation? Assume u is a multiple of the row size.
3.2. In-DRAM Bitmap Indices II
You have been hired to accelerate ETH’s student database. After profiling the system for a while, you found out that one of the most executed queries is to "select the hometown of the students that are from Switzerland and speak German". The attributes hometown, country, and language are encoded using a fourbyte binary representation. The database has 32768 (215) entries, and each attribute is stored contiguously in memory. The database management system executes the following query:
bool position_hometown[entries]; for(int i = 0; i < entries; i++){ if(students.country[i] == "Switzerland" && students.language[i] == "German"){ position_hometown[i] = true;
} else{ position_hometown[i] = false;
}
}
1
2
3
4
5
6
7
8
9
(a) You are running the above code on a single-core processor. Assume that:
• Your processor has an 8 MiB direct-mapped cache, with a cache line of 64 bytes.
• A hit in this cache takes one cycle and a miss takes 100 cycles for both load and store operations.
• All load/store operations are serialized, i.e., the latency of multiple memory requests cannot be overlapped.
• The starting addresses of students.country, students.language, and position_hometown are 0x05000000, 0x06000000, 0x07000000, respectively.
• The execution time of a non-memory instruction is zero (i.e., we ignore its execution time).
How many cycles are required to run the query? Show your work.
(b) Recall that in class we discussed Ambit, which is a DRAM design that can greatly accelerate bulk bitwise operations by providing the ability to perform bitwise AND/OR/XOR of two rows in a subarray. Ambit works by issuing back-to-back ACTIVATE (A) and PRECHARGE (P) operations. For example, to compute AND, OR, and XOR operations, Ambit issues the sequence of commands described in the table below (e.g., AAP(X,Y ) represents double row activation of rows X and Y followed by a precharge operation, AAAP(X,Y,Z) represents triple row activation of rows X, Y , and Z followed by a precharge operation). In those instructions, Ambit copies the source rows Di and Dj to auxiliary rows (Bi). Control rows Ci dictate which operation (AND/OR) Ambit executes. The DRAM rows with dual-contact cells (i.e., rows DCCi) are used to perform the bitwise NOT operation on the data stored in the row. Basically, copying a source row to DCCi flips all bits in the source row and stores the result in both the source row and DCCi. Assume that:
• The DRAM row size is 8 Kbytes.
• An ACTIVATE command takes 50 cycles to execute.
• A PRECHARGE command takes 20 cycles to execute.
• DRAM has a single memory bank.
• The syntax of an Ambit operation is: bbop_[and/or/xor] destination, source_1, source_2.
• Addresses 0x08000000 and 0x09000000 are used to store partial results.
• The rows at addresses 0x0A000000 and 0x0B00000 store the codes for "Switzerland" and "German", respectively, in each four bytes throughout the entire row.
Dk = Di AND Dj
Dk = Di OR Dj
Dk = Di XOR Dj
AAP (Di, B0)
AAP (Dj, B1)
AAP (C0, B2)
AAAP (B0, B1, B2)
AAP B0, Dk
AAP (Di, B0)
AAP (Dj, B1) AAP (C1, B2)
AAAP (B0, B1, B2) AAP B0, Dk
AAP (Di, B0)
AAP (Dj, B1)
AAP (Di, DCC0)
AAP (Dj, DCC1)
AAP (C0, B2)
AAAP (B0, DCC1, B2)
AAP (C0, B2)
AAAP (B1, DCC0, B2)
AAP (C1, B2)
AAAP (B0, B1, B2)
AAP (B0, Dk)
i) The following code aims to execute the query "select the hometown of the students that are from Switzerland and speak German" in terms of Boolean operations to make use of Ambit. Fill in the blank boxes such that the algorithm produces the correct result. Show your work.
1 for(int i = 0; i < ; i++){
2
bbop_
bbop_
bbop_
}
0x08000000, 0x05000000 + i*8192, 0x0A000000;
0x09000000, 0x06000000 + i*8192, 0x0B000000;
0x07000000 + i*8192, 0x08000000, 0x09000000;
3
4
5
6
7
8
ii) How much speedup does Ambit provide over the baseline processor when executing the same query? Show your work.
4. In-DRAM Bit Serial Computation [200 points]
Recall that in class, we discussed Ambit, which is a DRAM design that can greatly accelerate bulk bitwise operations by providing the ability to perform bitwise AND/OR of two rows in a subarray and NOT of one row. Since Ambit is logically complete, it is possible to implement any other logic gate (e.g., XOR). To be able to implement arithmetic operations, bit shifting is also necessary. There is no way of shifting bits in DRAM with a conventional layout, but it can be done with a bit-serial layout, as Figure 1 shows. With such a layout, it is possible to perform bit-serial arithmetic computations in Ambit.
Figure 1. In-DRAM bit-serial layout for array A, which contains five 4-bit elements. DRAM cells in the same bitline contain the bits of an array element: A[i]_j represents bit j of element i.
We want to evaluate the potential performance benefits of using Ambit for arithmetic computations by implementing a simple workload, the element-wise addition of two arrays. Listing 1 shows a sequential code for the addition of two input arrays A and B into output array C.
Listing 1. Sequential CPU implementation of element-wise addition of arrays A and B.
for(int i = 0; i < num_elements; i++){
C[i] = A[i] + B[i];
}
1
2
3
We compare two possible implementations of the element-wise addition of two arrays: a CPU-based and an Ambit-based implementation. We make two assumptions. First, we use the most favorable layout for each implementation (i.e., conventional layout for CPU, and bit-serial layout for Ambit). Second, both implementations can operate on array elements of any size (i.e., bits/element):
• CPU-based implementation: This implementation reads elements of A and B from memory, adds them, and writes the resulting elements of C into memory.
Since the computation is simple and regular, we can use a simple analytical performance model for the
execution time of the CPU-based implementation: , where K represents the cost per arithmetic operation and M is the DRAM bandwidth. (Note: num_operations should include only the operations for the array addition.)
• Ambit-based implementation: This implementation assumes a bit serial layout for arrays A, B, and C. It performs additions in a bit serial manner, which only requires XOR, AND, and OR operations, as you can see in the 1-bit full adder in Figure 2.
Figure 2. 1-bit full adder.
Ambit implements these operations by issuing back-to-back ACTIVATE (A) and PRECHARGE (P) operations. For example, to compute AND, OR, and XOR operations, Ambit issues the sequence of commands described in Table 1, where AAP(X,Y ) represents double row activation of rows X and Y followed by a precharge operation, and AAAP(X,Y,Z) represents triple row activation of rows X, Y, and Z followed by a precharge operation.
In those instructions, Ambit copies the source rows Di and Dj to auxiliary rows (Bi). Control rows Ci dictate which operation (AND/OR) Ambit executes. The DRAM rows with dual-contact cells (i.e., rows DCCi) are used to perform the bitwise NOT operation on the data stored in the row. Basically, the NOT operation copies a source row to DCCi, flips all bits of the row, and stores the result in both the source row and DCCi. Assume that:
– The DRAM row size is 8 Kbytes.
– An ACTIVATE command takes 20ns to execute.
– A PRECHARGE command takes 10ns to execute. – DRAM has a single memory bank.
– The syntax of an Ambit operation is: bbop_[and/or/xor] destination, source_1, source_2.
– The rows at addresses 0x00700000, 0x00800000, and 0x00900000 are used to store partial results. Initially, they contain all zeroes.
– The rows at addresses 0x00A00000, 0x00B00000, and 0x00C00000 store arrays A, B, and C, respectively.
– These are all byte addresses. All these rows belong to the same DRAM subarray.
Table 1. Sequences of ACTIVATE and PRECHARGE operations for the execution of Ambit’s AND, OR, and XOR.
Dk = Di AND Dj
Dk = Di OR Dj
Dk = Di XOR Dj
AAP (Di, B0)
AAP (Dj, B1)
AAP (C0, B2)
AAAP (B0, B1, B2)
AAP B0, Dk
AAP (Di, B0)
AAP (Dj, B1) AAP (C1, B2)
AAAP (B0, B1, B2) AAP B0, Dk
AAP (Di, B0)
AAP (Dj, B1)
AAP (Di, DCC0)
AAP (Dj, DCC1)
AAP (C0, B2)
AAAP (B0, DCC1, B2)
AAP (C0, B2)
AAAP (B1, DCC0, B2) AAP (C1, B2)
AAAP (B0, B1, B2)
AAP (B0, Dk)
(a) For the CPU-based implementation, you want to obtain K and M. To this end, you run two experiments. In the first experiment, you run your CPU code for the element-wise array addition for 65,536 4-bit elements and measure tcpu = 100 us. In the second experiment, you run the STREAM-Copy benchmark for 102,400 4-bit elements and measure tcpu = 10 us. The STREAM-Copy benchmark simply copies the contents of one input array A to an output array B. What are the values of K and
M?
(b) Write the code for the Ambit-based implementation of the element-wise addition of arrays A and B. The resulting array is C.
(c) Compute the maximum throughput (in arithmetic operations per second, OPS) of the Ambit-based implementation as a function of the element size (i.e., bits/element).
(d) Determine the element size (in bits) for which the CPU-based implementation is faster than the Ambitbased implementation (Note: Use the same array size as in the previous part).
5. Caching vs. Processing-in-Memory [200 points]
We are given the following piece of code that makes accesses to integer arrays A and B. The size of each element in both A and B is 4 bytes. The base address of array A is 0x00001000, and the base address of B is 0x00008000.
movi R1, #0x1000 // Store the base address of A in R1 movi R2, #0x8000 // Store the base address of B in R2 movi R3, #0
Outer_Loop: movi R4, #0 movi R7, #0 Inner_Loop:
add R5, R3, R4 // R5 = R3 + R4
// load 4 bytes from memory address R1+R5 ld R5, [R1, R5] // R5 = Memory[R1 + R5], ld R6, [R2, R4] // R6 = Memory[R2 + R4] mul R5, R5, R6 // R5 = R5 * R6 add R7, R7, R5 // R7 += R5
inc R4 // R4++
bne R4, #2, Inner_Loop // If R4 != 2, jump to Inner_Loop
//store the data of R7 in memory address R1+R3 st [R1, R3], R7 // Memory[R1 + R3] = R7,
inc R3 // R3++
bne R3, #16, Outer_Loop // If R3 != 16, jump to Outer_Loop
You are running the above code on a single-core processor. For now, assume that the processor does not have caches. Therefore, all load/store instructions access the main memory, which has a fixed 50cycle latency, for both read and write operations. Assume that all load/store operations are serialized, i.e., the latency of multiple memory requests cannot be overlapped. Also assume that the execution time of a non-memory-access instruction is zero (i.e., we ignore its execution time).
(a) What is the execution time of the above piece of code in cycles? Show your work.
(b) Assume that a 128-byte private cache is added to the processor core in the next-generation processor. The cache block size is 8-byte. The cache is direct-mapped. On a hit, the cache services both read and write requests in 5 cycles. On a miss, the main memory is accessed and the access fills an 8-byte cache line in 50 cycles. Assuming that the cache is initially empty, what is the new execution time on this processor with the described cache? Show your work.
(c) You are not satisfied with the performance after implementing the described cache. To do better, you consider utilizing a processing unit that is available close to the main memory. This processing unit can directly interface to the main memory with a 10-cycle latency, for both read and write operations. How many cycles does it take to execute the same program using the in-memory processing units? Show your work. (Assume that the in-memory processing unit does not have a cache, and the memory accesses are serialized like in the processor core. The latency of the non-memory-access operations is ignored.)
(d) You friend now suggests that, by changing the cache capacity of the single-core processor (in part (b)), she could provide as good performance as the system that utilizes the memory processing unit (in part (c)). Is she correct? What is the minimum capacity required for the cache of the single-core processor to match the performance of the program running on the memory processing unit? Show your work.
(e) What other changes could be made to the cache design to improve the performance of the single-core processor on this program?
[1] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu, "Flipping Bits in Memory without Accessing Them: an Experimental Study of DRAM Disturbance Errors," in Proceedings of the 41st Annual International Sympoisum on Computer Architecture, 2014. https://people.inf.ethz.ch/omutl u/pub/dram-row-hammer_isca14.pdf