Instructions (read carefully)
 For implementation exercises you must submit your source code that you created and used for the experiments.
 Submit all other requested results (plots, tables, other results, text) in one file. (.ps, or .pdf preferred, MSWord format accepted).
 For plots/benchmarks, be concise, but provide necessary information and always briefly discuss the plot and draw conclusions.
 When compiling, ensure that you use optimization flags.
Exercises
 (Microarchitectural parameters, 15 pts) Determine and create a table for the following microarchitectural parameters of:
 The computer for which /proc/cpuinfo is listed here.
 Your computer (or of a particular computer in the ECE cluster):
Parameters to be determined:
 CPUcore manufacturer, model name
 Number of CPU cores on chip
 CPUcore frequency
 L1D, L1I cache: Size, line size, associativity
 L2 cache: Size, line size, associativity
 Maximum theoretical throughput of floating point additions (in flop/s  floating point operations per second).
 Maximum theoretical throughput of floating point multiplications (in flop/s)
 Calculated maximum theoretical peak performance (sum of the above two).
Tips:
On Unix/Linux systems, typing 'cat /proc/cpuinfo' in a shell will give you adequate information about your CPU for you to be able to find an appropriate manual for it on the manufacturer's website (typically AMD or Intel). The manufacturer's website will contain information about the onchip caches (typically, L1 and L2).
The following software tool measures microarchitectural parameters by running experiments. Using it is optional, but you might find it useful:
Calibrator http://homepages.cwi.nl/~manegold/Calibrator/calibrator.shtml
Exercise 1 in section 3.4 (page 31) of the Tutorial might help you in answering the questions about theoretical peak performance.
 (MMM benchmarking, 20pts) The standard matrix multiplication kernel performs the following operation : C = A B + C, wher A, B, C are matrices of compatible size. We provide a C source file and a C header file that times this kernel using different methods under Windows and Linux (for x86 compatibles).
 Inspect and understand the code.
 Determine the exact number of additions and multiplication performed by the
compute() function in mmm.c of the code.
 Using your computer (or one in the cluster, if possible not too loaded), compile and run the code. Ensure you get consistent timings by at least 2 different timers. You need to setup the timer for your machine by setting the number of runs (NUM_RUNS) and the machine frequency (FREQUENCY).
 Then, for all square matrices of sizes n between 50 and 400, in increments of 50, create a plot for the following quantities (One plot per quantity, n should be the xaxis)
 runtime.
 performance (in mflop/s).
 Using the data from exercise 1, percentage of the peak performance reached.
 (Microbenchmarks: Mathematical functions, 25pts) Determine the runtime (in cycles) of the following computations (x, y are doubles) as accurately as possible (you should reuse the timing setup from question 2):
 y = x
 y = 7.12x
 y = x + 7.12
 y = sin(x), x in {0.0, 0.2, 4.1, 170.32}
 y = log(x), x in {0.001, 1.00001, 10.65, 2762.32}
 y = exp(x), x in {1.234e17, 0.101, 3.72, 1.234e25}
There are a total of 15 runtimes. Create a table. Explain the results.
The benchmark setup should be as follows:
 Allocate two vector doubles x[N] and y[N] and initialize all x[i] to be one of the values from above.
 Use:
for(i=0; i<N; i++)
y[i] = f(x[i]);
to compute y[i] = f(x[i]) , with f() being one of the functions above and time this forloop.
 Choose N such that all data easily fits into L1 cache but there are adequate iterations to get a reasonable timing measurement.
 Use the x86 time stamp counter provided to you in class.
To accurately measure these very short computations, use the following guidelines:
 Only time the actual work, leave everything else (initializations, timing related computations, etc.) outside the timing loop.
 Use the C preprocessor to produce a parameterized implementation to easily check different parameters.
 You may have to run your for(N) loop multiple times to obtain reasonable timing accuracy.
 You may have to take the minimum across multiple such measurements to obtain stable results.
 You must perform a warm up for the experiment: variables where you store the timing results, the timed routine and the data vectors should all be loaded into the L1 cache, since cache misses might result in inaccurate timing results.
 Alignment of your data vectors on cache line sizes or page sizes can influence the run.
 (Performance and runtime, 10 pts) Consider a Core 2 Duo running at 3Ghz.
 You learned in class that one processor core can execute up to one addition and one multiplication per cycle (scalar doubles). What is the resulting theoretical floating point peak performance?
 Assuming an MMM implementation of cost 2n^{3} for n x n matrices reaches 80% peak, i.e., an efficiency of e(n) = 0.8 across all sizes n. What is the runtime as a function of n? What is the largest size n that can be handled in one millisecond?
 The efficiency e(n) of implementation usually varies with n. Assume the efficiency follows the formula
e(n)=0.8exp(n/A)
for some constant A. Assume this implementation reaches 0.5 (50% of peak) efficiency at size 32 x 32, determine A and the biggest size n that can be handled in a millisecond.
 (FloydWarshall Algorithm 10 pts) A (positively) weighted, directed graph G with n vertices is represented by an n x n matrix D. The FloydWarshall algorithm computes the shortest path between all vertices of the graph as follows:
for k = 1:n
for i = 1:n
for j = 1:n
D[i][j] = min( D[i][j], D[i][k] + D[k][j] )
 What does this code remind you of?
 Determine an appropriate cost measure for this algorithm.
 Compute the cost based on this measure.

(Cache miss analysis, 15 pts + 10 extra pts) Consider a 2way set associative cache with a cache size of 32KB, a cacheline size of 32B, and a FIFO (First In, First Out) replacement policy (this means if one of the two candidate cachelines has to be replaced, it will be the one that was first brought into the cache).
Consider two singleprecision floating point arrays (single precision float = 4B), A and B with n elements, where n is much larger than the cache and is a multiple of the cache size. Further, you may assume that A and B are both fully cachealigned, i.e., A[0] and B[0] map to the first position in the first cacheline.
Now consider the following pseudocode snippet:
for(i from 0 to n1)
A[i] = A[i] + B[f(i)]
where f(i) is an index mapping function that reads B at a stride of 8. (If for example, B was 16 elements long, then reading it at stride 8 would result in this access pattern: f(i) = 0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15).
Assume an empty cache for each part of this exercise.

 (Disregard the code snippet for this part) What is the expected number of cache misses incurred by streaming once completely through array A alone in sequential order, given the cache parameters above?
 (Disregard the code snippet for this part) What is the expected number of cache misses incurred by streaming once completely through array B alone at stride of 8 given the cache parameters above?
 How many cache misses is the given pseudocode snippet expected to incur? (Assume, for simplicity, that index variables are not cached).
 (10 extra points) Rewrite the code (without changing the semantics, i.e., overall computation) to reduce the number of cache misses as much as possible. (Assume, for simplicity, that index variables are not cached).
 (5 pts) How many hours did it take you to complete this assignment?
