18645: How to Write Fast Code. Assignment 2, spring 2008

18-645: How to Write Fast Code, spring 2008

Assignment 2

Due date: Thu Feb. 7th, 6:00pm by email.
Send your completed assignment to: schellap+18645-assign2 at andrew dot cmu dot edu

Questions: Make use of TA office hours, or email the TAs or Prof. Pueschel

Instructions (read carefully)

For implementation exercises you must submit your source code that you created and used for the experiments.
Submit all other requested results (plots, tables, other results, text) in one file. (.ps, or .pdf preferred, MS-Word format accepted).
For plots/benchmarks, be concise, but provide necessary information and always briefly discuss the plot and draw conclusions.
When compiling, ensure that you use optimization flags.

Exercises

(Microarchitectural parameters, 15 pts) Determine and create a table for the following microarchitectural parameters of:
1. The computer for which /proc/cpuinfo is listed here.
2. Your computer (or of a particular computer in the ECE cluster):
Parameters to be determined:
1. CPU-core manufacturer, model name
2. Number of CPU cores on chip
3. CPU-core frequency
4. L1-D, L1-I cache: Size, line size, associativity
5. L2 cache: Size, line size, associativity
6. Maximum theoretical throughput of floating point additions (in flop/s - floating point operations per second).
7. Maximum theoretical throughput of floating point multiplications (in flop/s)
8. Calculated maximum theoretical peak performance (sum of the above two).
Tips:
On Unix/Linux systems, typing 'cat /proc/cpuinfo' in a shell will give you adequate information about your CPU for you to be able to find an appropriate manual for it on the manufacturer's website (typically AMD or Intel). The manufacturer's website will contain information about the on-chip caches (typically, L1 and L2).
The following software tool measures microarchitectural parameters by running experiments. Using it is optional, but you might find it useful:
Calibrator http://homepages.cwi.nl/~manegold/Calibrator/calibrator.shtml

Exercise 1 in section 3.4 (page 31) of the Tutorial might help you in answering the questions about theoretical peak performance.
(MMM benchmarking, 20pts) The standard matrix multiplication kernel performs the following operation : C = A B + C, wher A, B, C are matrices of compatible size. We provide a C source file and a C header file that times this kernel using different methods under Windows and Linux (for x86 compatibles).
1. Inspect and understand the code.
2. Determine the exact number of additions and multiplication performed by the compute() function in mmm.c of the code.
3. Using your computer (or one in the cluster, if possible not too loaded), compile and run the code. Ensure you get consistent timings by at least 2 different timers. You need to setup the timer for your machine by setting the number of runs (NUM_RUNS) and the machine frequency (FREQUENCY).
4. Then, for all square matrices of sizes n between 50 and 400, in increments of 50, create a plot for the following quantities (One plot per quantity, n should be the x-axis)
  - runtime.
  - performance (in mflop/s).
  - Using the data from exercise 1, percentage of the peak performance reached.
(Microbenchmarks: Mathematical functions, 25pts) Determine the runtime (in cycles) of the following computations (x, y are doubles) as accurately as possible (you should reuse the timing setup from question 2):
1. y = x
2. y = 7.12x
3. y = x + 7.12
4. y = sin(x), x in {0.0, 0.2, 4.1, 170.32}
5. y = log(x), x in {0.001, 1.00001, 10.65, 2762.32}
6. y = exp(x), x in {-1.234e-17, 0.101, 3.72, 1.234e25}
There are a total of 15 runtimes. Create a table. Explain the results.

The benchmark setup should be as follows:
1. Allocate two vector doubles x[N] and y[N] and initialize all x[i] to be one of the values from above.
2. Use:
  for(i=0; i<N; i++) y[i] = f(x[i]);
  to compute y[i] = f(x[i]) , with f() being one of the functions above and time this for-loop.
3. Choose N such that all data easily fits into L1 cache but there are adequate iterations to get a reasonable timing measurement.
4. Use the x86 time stamp counter provided to you in class.
To accurately measure these very short computations, use the following guidelines:
1. Only time the actual work, leave everything else (initializations, timing related computations, etc.) outside the timing loop.
2. Use the C preprocessor to produce a parameterized implementation to easily check different parameters.
3. You may have to run your for(N) loop multiple times to obtain reasonable timing accuracy.
4. You may have to take the minimum across multiple such measurements to obtain stable results.
5. You must perform a warm up for the experiment: variables where you store the timing results, the timed routine and the data vectors should all be loaded into the L1 cache, since cache misses might result in inaccurate timing results.
6. Alignment of your data vectors on cache line sizes or page sizes can influence the run.
(Performance and runtime, 10 pts) Consider a Core 2 Duo running at 3Ghz.
1. You learned in class that one processor core can execute up to one addition and one multiplication per cycle (scalar doubles). What is the resulting theoretical floating point peak performance?
2. Assuming an MMM implementation of cost 2n³ for n x n matrices reaches 80% peak, i.e., an efficiency of e(n) = 0.8 across all sizes n. What is the runtime as a function of n? What is the largest size n that can be handled in one millisecond?
3. The efficiency e(n) of implementation usually varies with n. Assume the efficiency follows the formula
  e(n)=0.8-exp(-n/A)
  for some constant A. Assume this implementation reaches 0.5 (50% of peak) efficiency at size 32 x 32, determine A and the biggest size n that can be handled in a millisecond.
(Floyd-Warshall Algorithm 10 pts) A (positively) weighted, directed graph G with n vertices is represented by an n x n matrix D. The Floyd-Warshall algorithm computes the shortest path between all vertices of the graph as follows:

for k = 1:n for i = 1:n for j = 1:n D[i][j] = min( D[i][j], D[i][k] + D[k][j] )
1. What does this code remind you of?
2. Determine an appropriate cost measure for this algorithm.
3. Compute the cost based on this measure.
(Cache miss analysis, 15 pts + 10 extra pts) Consider a 2-way set associative cache with a cache size of 32KB, a cacheline size of 32B, and a FIFO (First In, First Out) replacement policy (this means if one of the two candidate cachelines has to be replaced, it will be the one that was first brought into the cache).

Consider two single-precision floating point arrays (single precision float = 4B), A and B with n elements, where n is much larger than the cache and is a multiple of the cache size. Further, you may assume that A and B are both fully cache-aligned, i.e., A[0] and B[0] map to the first position in the first cacheline.

Now consider the following pseudocode snippet:

for(i from 0 to n-1) A[i] = A[i] + B[f(i)]

where f(i) is an index mapping function that reads B at a stride of 8. (If for example, B was 16 elements long, then reading it at stride 8 would result in this access pattern: f(i) = 0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15).

Assume an empty cache for each part of this exercise.
2. (Disregard the code snippet for this part) What is the expected number of cache misses incurred by streaming once completely through array A alone in sequential order, given the cache parameters above?
3. (Disregard the code snippet for this part) What is the expected number of cache misses incurred by streaming once completely through array B alone at stride of 8 given the cache parameters above?
4. How many cache misses is the given pseudocode snippet expected to incur? (Assume, for simplicity, that index variables are not cached).
5. (10 extra points) Rewrite the code (without changing the semantics, i.e., overall computation) to reduce the number of cache misses as much as possible. (Assume, for simplicity, that index variables are not cached).
(5 pts) How many hours did it take you to complete this assignment?