5.a) This implementation of Floyd-Warshall inlvoves 3 nested loops in a manner similar to the naive implementation of a matrix-multiplication. However, the operations themselves are different and the loop order is important.

5.b) The code mainly uses 2 arithmetic operations: additions and minimums.  Thus we expected you to choose number of additions and number of minimums as a cost measure. Most of you noticed that a minimum is actually implemented using a substraction and a comparison to zero. Therefore number of additions and substractions, number of comparisons could have been another cost measure.  However, as comparisons usually imply branches and mispredicted branches are expensive, we only gave partial credit to the ones who chose additions and substractions as a cost measure.

5.c) There are n3 additions and n3 minimums in the code. Equivalently, there are 2n3 additions and n3 comparisons.

6.a) There would be n/8 cache misses: there are 8 elements per cache line, and bringing in an element as a result of a miss would bring along 7 other neighboring elements which will all result in hits.

c) There would be n cache misses for B: since b is read in strides of 8, each access to b would result in a miss. Since n is much larger than the cache size, there is no opportunity for reuse.

d) The pseudocode results in a total of (n+(n/8)) misses. This is the sum of the misses of A and the misses of B. Note that the key point here is that the misses of A and the misses of B can simply be added only because the cache is 2-way set associative, and there would be no conflict misses. If it were not for this fact, a simple addition would not suffice - a more complex analysis would be necessary.

e) The key observation here is that the elements of B that are brought in *as neighbors* upon a cache miss, never have an opportunity to get used, thus resulting in n misses to B. Theoretically, if we could somehow use all these neighboring elements before they were evicted, we could reduce the number of misses to B to that of A, which is n/8.

In this case, there happens to be an easy way to do so. The idea is similar to blocking, and takes advantage of neighbor-use: whenever we get a bunch of elements of B into the cache, we attempt to use these before they get evicted.

This idea is easily implemented in the following manner: we notice that after the first 8 access to A, one line of A (8 elements, from A[0] to A[7]) is in cache, while 8 lines of B (64 elements, from A[0] to A[63]) are in cache.

Instead of moving on to access A[9] at this point, we instead access A[h] such that A[h] requires summation with B[1]. A[h+1] is then added to B[9], and so forth until A[h+7] is added with B[57].

Next, we bring in A[g] such that A[g] requires summation with B[2], and repeat the pattern.

This pattern is repeated until all 64 elements of B (B[0] through B[63]) are used up. Note that A[h], A[g] etc. all map to the same spot in the cache (same line, same alignment) that A[0] did.

Here is the code that does this:

for(int i=0; i <(N/8); i+=8)
  for(int j=0; j<N; j+=(N/8)) {
    for (int k=0; k<8; k++) {
      int d = i+j+k;
      A[d] += B[f(d)];
      // Print out the indices of A, B for our comfort:
      printf("%4d, %4d\n", d, L(N, STRIDE, d));
    }
    printf("\n");
  }

Where L(N, STRIDE, d) implements the function f(i) mentioned in the question.