$25
Exercise 1: Use of the time Command
The time command which is built-in to many Unix shells is simple to use: simply pre x a normal shell command by the word time. For example:
$ time ls ~/cs220/labs/lab11/files coverage int-search matmul-cache parity
real 0m0.002s user 0m0.000s sys 0m0.000s
You should get output with a very similar format. If not, verify that you are running the bash shell. Make sure you are not running the time program (available as /usr/bin/time) which produces output in a di erent format.
As mentioned in the background section, the command reports the amount of real or elapsed time (AKA wall-clock time), amount of time spent in kernel space and amount of time spent in user space.
Try timing a command which produces more interesting output:
$ time sleep 5
Note the large di erence between the elapsed time and the other times.
Time a command which searches a large directory:
$ time wc -l ‘find /etc -type f 2/dev/null‘ 2/dev/null | tail
[The command within the backquotes find’s all les (-type f) within the /etc directory, discarding any errors (2/dev/null). Those results are counted using wc -l (the results of the backquotes find are treated as though they were typed in to the wc -l command), with errors again being discarded. The last portion of the results are displayed using the tail command.]
Notice here that the system time is non-trivial and measures the time used for starting and synchronizing the multiple sub-processes involved in the above command.
Run it again; you may get a much faster result if your system has cached the necessary les in its kernel bu ers.
Try timing a couple more non-trivial shell commands.
1.3.3 Exercise 2: Speeding Up Code using Assembly
In the previous lab we explored using the parity ag (available on most architectures) to evaluate the parity of a word. In this exercise, we will use time to evaluate how much of a performance boost this technique provides.
Change over to the parity directory and look at the les: main.c is simply a driver program which uses its command-line arguments to gure out how many parities it should evaluate over random integers; parity-c.c evaluates parity using the ones-counting technique covered earlier in the course and nally parity-s.c uses inline assembly.
Build all programs by typing make. Then run:
$ ./parity-c -d 10 followed by:
./parity-s -d 10
The -d ag is a debug ag and prints out the randomly generated integers (limited to one byte) and the computed parity. Both of them should produce the same results even though the rst program uses the counting-ones technique to evaluate parity whereas the second program evaluates parity by accessing the parity ag using inline assembly.
Now time the two programs (with the -d ags turned o ). As both programs are pretty fast, you may need to increase the number of tests to around 100 million or so, depending on the system you are running on. If your results are similar to mine, you should observe that the version which uses inline assembly is about twice as fast as the version which is written in pure C. Also observe that almost all the time is spent in user space as the program uses minimal kernel services.
1.3.4 Exercise 3: Observing Cache E ects
Change over to the matmul-cache directory. The main.c le is a simple driver program which accesses the command-line arguments to get the size of a (square) matrix and the number of tests to run. It generates two random double matrices and then calls a matrix multiply routine to multiply them together.
Look at the simple-matmul.c le: it contains the classic code for matrix multiplication. Recall that matrices in C are laid out by row; hence in the inner-most loop
for (int k = 0; k < n; k++) { c[i][j] += a[i][k]*b[k][j];
} the entries in a[][] are being accessed sequentially in memory within a row, but the entries in b[][] are not being accessed sequentially in memory (in fact, they are being accessed by column so that successive accesses will have wildly di erent memory addresses).
Recall that modern computer systems use a hierarchy of memory systems with a small amount of expensive high-speed cache memory and a larger amount of cheaper (but slower) main memory. Assuming that the program has good locality of reference the overall memory speed should be close to that of the high-speed cache at overall cost close to the cost of the main memory.
Unfortunately, the columnar access to b[][] in the inner-loop of the matrix multiply destroys locality of reference. If the matrix is su ciently large, then it is likely that successive columns of b[][] are not in the cache and hence the system will need to fetch the b[][] entries from the slow main memory resulting in poor overall performance.
However, there is a simple x which avoids the problem. Simply transpose the matrix b[][] before the multiply begins and then within the inner loop, instead of accessing b[][] column-wise access the transposed matrix row-wise. The row-wise access should take care of the cache problems.
Look at the transpose-matmul.c le and notice the changes within matrix_multiply(). However, the matrix_transpose() function has been left incomplete:
please add in the code to make its meet the given speci cation.
First make sure that your x is correct: build all the programs by typing make. Then simply run the simple matrix multiply on a small matrix using something like:
$ ./simple-matmul 4 1
The result matrix should be printed out.
Then run the transposed matrix multiply:
$ ./transpose-matmul 4 1
If your matrix_transpose() function is correct, the results should match. If not, iterate until you x that function.
Now run:
$ time ./simple-matmul 1500 1 followed by
$ time ./transpose-matmul 1500 1
Depending on your system, you may or may not see that the second solution is faster.
However, as you increase the size of the matrix, you should de nitely see the di erence in performance:
$ for s in ‘seq 1000 500 2500‘; \
do \
echo -n "*** Size $s: simple"; time ./simple-matmul $s 1 ; echo ; \
echo -n "*** Size $s: transpose"; time ./transpose-matmul $s 1 ; echo ; \ done
You should de nitely see a large di erence in performance as the matrix size increases.
1.3.5 Exercise 4: Algorithmic Changes Win
Change over to the int-search directory. The driver program in main.c creates a random integer array of size speci ed by the rst command-line argument and runs a number of searches for elements in that array speci ed by the second command-line argument.
Two implementations of the search are provided:
1. An incomplete linear search implementation is provided. It should search sequentially through the array looking for the speci ed element: if found, it should return its index, if not found (the entire array has been searched) it should return a negative value.
If there are n elements in the array, on average n/2 elements will be compared for a successful search but n elements must be compared if the search proves ultimately unsuccessful.
2. A complete binary search implementation is provided. This provides identical functionality to the linear search but uses the bsearch() library function discussed in class.
If there are n elements in the array, a maximum of dlog2ne comparisons are necessary for searching for an element.
As n increases, the di erence in performance can be dramatic.
To measure the increase, rst add in the code for the linear search in linearsearch.c. Build all the programs by typing make. Test your linear search on a small example by typing:
$ ./linear-search 100 1
If it fails, you should get an assertion failure.
Now time both linear search and binary search:
$ time ./linear-search 2000 2000
$ time ./binary-search 2000 2000
You should see that the binary search is dramatically faster.
1.3.6 Exercise 5: Ensuring Test Coverage
Change over to the coverage directory and build the coverage program by typing make. The coverage program is a silly program which reads 3 int’s a, b and c at a time from standard input and then calls a compute() function which computes some function compute(a, b, c). The code for compute() contains a lot of conditional code.
The program has been compiled with special options (see the Makefile). When compiled, it will produce a coverage.gcno binary le which records information about the control ow in the program. Each run of the program generates a binary le coverage.gcda which contains information about which lines were executed during that run.
Run the program by typing ./coverage, provide 3 integers like 200 2000 2999 and then type ^D to indicate EOF. You should see the generated le in your directory.
Now run the program gcov on the generated le: i.e. gcov coverage.gcda. It should create a le coverage.c.gcov which is the source le with each line annotated with a count of the number of times that line was executed. Look at coverage.c.gcov to see your results.
Now look at coverage.c and generate su cient test data to ensure that every line in compute() is covered by your test data. Because of the regular branching structure within compute() it should be clear that 8 sets of a, b, c values should be su cient.
Run gcov to validate that your tests do indeed exercise each line of compute().