Instructions (read carefully)
 For implementation exercises you have to submit the working code that goes along with the experiments.
 All other requested results (plots, tables, other results, text) submit in one word, ps, or pdf file. Make sure you answer all the questions posed. Questions 3 and 4 may be also submitted handwritten (bring over to Srinivas PH B10, me PH B16 or Carol Patterson PH B15).
 For plots/benchmarks follow the guidelines presented in lecture 5 where applicable (no need for huge amounts of text, but provide necessary information and always briefly discuss the plot and draw a conclusion).
 When compiling use good optimization flags
Exercises
 (Maximal performance program)
 Determine the maximal peak performance on your machine (excluding vector instructions)
 Write a program that achieves a floating point performance as high as possible. The program does not need to compute anything of importance, but you have to follow the following rules:
 For the operations count, count only floating point instructions (additions and multiplications)
 Do not use vector instructions
 There should be no obvious way of simplifying the computation (otherwise the compiler may do it); you should actually look at the assembly code to check whether all ops are still there
 Make sure you use the result you compute afterwards in the code (but outside the timing bracket), otherwise the compiler may optimize the computation away
Explain the design decisions behind your program and report the percentage of peak performance you achieve. Be ambitious (can you get above 90%?, 95%?).

(MicroBenchmarks: Mathematical functions, requires Pentium compatible machine) Determine the runtime (in cycles) of the following computations (x, y are doubles) as accurately as possible:
y := x 

y := 7.12*x 

y := x + 7.12 

y := sin(x) 
x in {0.0, 0.2, 4.1, 170.32} 
y := log(x) 
x in {0.001, 1.00001, 10.65, 2762.32} 
y := exp(x) 
x in {1.234e17, 0.101, 3.72, 1.234e25} 
This means a total of 15 runtimes. Try to explain the results. The benchmark setup should be as follows:
 Allocate two vectors double x[N] and y[N] and initialize all x[i] to be one of the values from above.
 Use
for (i=0; i<N; i++) y[i] = f(x[i]); 
to compute y[i] := f(x[i]) with f() being one of the functions above and time this forloop.  Choose N such that all data easily fits into L1 cache but there are enough iterations to obtain a reasonable amount of work.
 Use the x86 time stamp counter via the interface provided by rdtsc.h.
To accurately measure these very short computations, use the following hints
 Only time the actual work, leave everything else (any initialization or timing related computation, etc.) outside the timing loop.
 Use the C preprocessor to produce a parameterized implementation to easily check different parameters.
 You may have to run your for(N) loop multiple times to obtain reasonable accuracy.
 You may have to take the minimum across multiple such measurements to obtain stable results. Thus, you might end up with three nested loops.
 You have to warm up the experiment: variables where you store the timing results, the timed routine and the data vectorseverything has so be in L1 cache.
 Alignment of your data vectors on cache line sizes or page sizes can influence the runtime significantly. The use of CPUID to serialize the CPU before reading the RDTSC as explained in the Intel manual produces an considerable amount of overhead and can be omitted in this setting.
 (Cost Radix2 FFT) Determine the arithmetic cost of a radix2 CooleyTukey FFT for a DFT of 2power size n = 2^k. The cost measure is (A(n), M(n)), where A(n) is the number of complex adds/subs, M(n) is the number of mults by a complex constant not equal to 1 or 1. Is this a good cost measure when considering implementation?
 (Extra credit: Cost arbitrary radix FFT) Show that the arithmetic cost (as defined and measured as in exercise 4.) of the CooleyTukey FFT for a DFT of 2power size n = 2^k is independent of the chosen recursion strategy. (Hint: use induction).
