Instructions (read carefully)
 For implementation exercises you have to submit the working code that goes along with the experiments.
 All other requested results (plots, tables, other results, text) submit in one word, ps, or pdf file.
 For plots/benchmarks follow the guidelines presented in lecture 5 where applicable (no need for huge amounts of text, but provide necessary information and always briefly discuss the plot and draw a conclusion).
 When compiling use good optimization flags
Exercises
 (Microarchitectural parameters, 30pts) The following software tools measure microarchitectural parameters. Download and run them. Organize those that were mentioned in class (includes parameters related to caches and execution units) into a table organized as: vertically the tools, horizontally the type of parameter
 MOB http://steamboat.cs.ucsb.edu/mob/
 Calibrator http://homepages.cwi.nl/~manegold/Calibrator/calibrator.shtml
 lmbench http://sourceforge.net/projects/lmbench
 Optional: XRay http://iss.cs.cornell.edu/Software/XRay.aspx
 (miniMMM, 40pts) You may use any code from http://www.cs.drexel.edu/~jjohnson/wi04/cs680/assignments/assign1p.html (from a class taught by Jeremy Johnson, CS, Drexel University). You should use the code and method for timing from that website (in timeblockmm.c), but may modify it (e.g., increase number of repetitions for accurate measurements). This exercise builds on lecture 6 and on the paper "Is search really necessary to generate highperformance BLAS?" by Yotov et al. which you received as a hard copy. The goal of this exercise is to implement a fast miniMMM to multiply two square NB x NB matrices (NB is a parameter), which is the used within an MMM in 3.
 (By definition) Implement by definition (triple loop) using the ijk loop order. This is your code0.
 (Blocking) Block into micro MMMs with MU = NU = 2, KU=1. The inner triple loop has the kij order. Unroll (by hand) the innermost i and jloop such that you have alternately adds and mults and do scalar replacement. This is your code1.
 (Unrolling) Unroll the innermost kloop by a factor of 2 and 4 (KU=2 and 4, doubles and quadruples the loop body) again doing scalar replacement. Assume that 4 divides NB. This gives you code2 and code3.
 (Performance plot, search for best block size NB) Determine the L1 cache size C1 (in doubles, i.e., 8B units) of your computer. Measure the performance (in Mflops) of your four codes for all NB with 16 <= NB <= min(80, sqrt(C1)) with 4 divides NB. Create a plot: xaxis shows NB, yaxis shows performance (so there will be 4 lines in it). Discuss the plot including: Which NB and which code yields the maximum performance? What is the percentage of peak performance in this case?
 Does it improve if in the best code so far you switch the outermost loop order from ijk to jik?
 (MMM, 30pts) Implement an MMM for multiplying two square n x n matrices assuming NB divides n, blocked into NB x NB blocks using your best miniMMM code from exercise 2. Create a performance plot comparing this code and code0 (by definition) above for an interesting range of sizes n (up to sizes where the matrices do not fit into the L2 cache). xaxis shows n; yaxis performance in Mflops. Discuss the plot.
 (Extra credit, up to 20pts) If you went through all steps in 2. and 3., you'll receive extra points for any other optimization or interesting experiment that you perform. Examples include skewing (as explained in lecture 6), blocking in addition for the second cache, evaluating the effect of loop index ordering, or anything else you can think of. Create proper plots with brierf discussion.
