18-548/15-548 Fall 1998

Lab 4:
Multi-Level Cache &
Context Switching Effects

Due Friday, October 23, 1998 at 3:00 PM

Draft until Friday 10/9

This is a simulation homework that takes a look at the benefits of multi-level caching and an example of the effects of context switching on cache performance. You will also see an example of how to instrument a program using Atom, as well as running a multi-level Dinero simulation. Both problems require use of an Alpha workstation.

The homework-specific files you will need are in /afs/ece/class/ece548/www/hw/lab4 . There are subdirectories with example scripts for both problems, and you are welcome to make use of them. In order to run correctly, the scripts must be copied to your own directory, and must have a logical link to the lab4 directory
ln -s /afs/ece/class/ece548/www/hw/lab4
in the same directory from which you run the script. Note that the scripts are representative, and may not necessarily do everything required to answer the lab questions.

Problem 1: Multi-Level Caching

In this problem you'll analyze the performance of a program with respect to various single- and multi-level cache configurations. The program you'll be instrumenting is madd.c which is a simple matrix addition program. Please note that the point of this program is to have something that is relatively simple to understand but still has interesting behavior -- so while it is possible to optimize performance of this program by loop rearrangement and eliminating the second call to madd(), that is not the point of this exercise. (Your chance to optimize performance on "real" code comes in the next lab...) All terminology is Cragon-based (sectors+blocks) rather than dinero-based. Atom is used to instrument only the procedure madd(), so you don't need to worry about the caching effects of any other part of the program.

1. Compile and atomize madd.c and report dinero simulation results for the three cache configurations below. For each case show the dinero command line, data read miss rate, data write miss rate, and total bus usage. The first case is a straightforward write-through cache with sector size=block size of 4 words. The second case is a write-back cache having otherwise similar parameters. The third case is an L2-like unified cache.

dinero -i8K -d8K -a4 -ww -An -b32 -W8 -B16

dinero -i8K -d8K -a4 -wc -Aw -b32 -W8 -B16

dinero -u512K -a4 -wc -Aw -b32 -W8 -B16

2. For the results in part 1 above, answer the following questions.

Why is the write miss ratio of the 8K write-through cache essentially 100%?
Why is the read miss ratio of the 8K write-back cache essentially 100% and the bus usage so much higher?
Why does the 512K cache have better performance than the small write-back cache?
What you'd really like for the 512K cache is to be able to use a block size of 8 bytes on a 32-byte-sector, and still have the cache fetch an entire sector on read miss (prefetching the entire sector on miss), but only make valid+dirty a single block on a write miss while leaving all other blocks within a sector invalid on a write miss. This scheme would suppress reads needed to fill out the block size of greater than 8 bytes on a write miss. How would you expect the simulation results from the 512K cache simulation to change if dinero could simulate this arrangement (which it can't for implementation reasons) -- specifically what would the new number of bus cycles would be for this "smart sectoring" approach? (Hints: if you want to understand the write behavior better try a dinero simluation changed to an 8-byte block size and think about the results; but realize that this isn't exactly what this problem is asking. This problem requires you to look at your dinero output and see how it would differ if "smart sectoring" were implemented.)
Compute the theoretical minimum number of bus cycles if you consider only accesses to the data arrays. How much overhead (in bus cycles) is being spent on subroutine calling overhead (compare the theoretical result to the "smart sectoring" result for the 512K cache.).

3. Use the following simplified timing model and compute the effective memory access time for data accesses on the three examples in part 1 of this problem, showing them in a table. (We're going to ignore instruction memory for the purposes of illustrating a point). Which option is the fastest? This timing model has specifically been simplified to ignore any data dependencies or other fine-grain timing effects; instead it is more of a "plumbing" model dealing with gross data flow rates. This problem asks you to consider 2 ways in which you might be limited in each of the three configurations: memory access latency or bus bandwidth. Either effective memory access time (as in a total for the whole program) or average effective memory access time are fine.

Cache hit to an 8K cache takes 1 clock

Cache hit to a 512K cache takes 8 clocks

Cache miss takes cache hit time + 99 clocks for reads

All write operations take the same amount of time as a cache hit. This is because they run concurrently using write buffers.

The total bandwidth available on the bus is limited to 1 bus cycle per 33 clocks (if one assumes that a blocking cache is used, this timing model means that it takes 3 "bus cycles" worth of time to process a read miss, but the write buffers can access the bus with 100% efficiency). So the effective access time might be limited by traffic ratio rather than read miss ratios. (In other words, if you need too many bus cycles you'll be limited by bus bandwidth, but otherwise you'll be limited by read latencies.)

4. Model a two-level cache with the parameters below. Note that the L1 cache can now have a 32-byte bus instead of the 16-byte bus used in earlier 8K cache examples because it is connected to an L2 cache instead of to a memory bus. Report the dinero command line, data read miss rate, data write miss rate, and total bus usage for each cache as you did for part 1 of this problem. How much better (in clocks, not as a ratio) is this composite 2-level cache performance than the best case from part 3 of this problem? Use the simplified timing model below.

L1 cache: dinero -i8K -d8K -a4 -ww -An -b32 -W8 -B32
L2 cache: dinero -u512K -a4 -wc -Aw -b32 -W8 -B16

Cache hit to an L1 cache takes 1 clock

Cache read miss on L1 cache takes 9 clocks (1 clock to access L1 cache, and 8 clocks to an assumed hit on L2 cache)

Cache read miss on L2 cache takes 99 clocks in addition to the 9 clocks already assessed for the L1 read miss.

All write operations take one clock. This is because they run concurrently using write buffers.

The total bandwidth available on the bus coming out of the L2 cache is limited to 1 bus cycle per 33 clocks. So the effective access time might be limited by traffic ratio rather than read miss ratios. (In other words, if your traffic ratio is too high you'll be limited by bus bandwidth, but if there is enough time to do all the write bus transactions while processing read misses you'll be limited by read latencies.) We will ignore any bus constraints between the L1 and L2 cache, so only L2 bus usage is of interest.

Problem 2: Context Switching

In this problem you'll simulate what happens when multiple programs alternate in accessing cache memory. Although this problem does not include measurements of direct context switching overhead (e.g., register saves and restores), it does give a feel for the indirect context switching costs caused by inter-task cache conflict misses. This problem assumes that you have a logical link in place to the lab4 subdirectory as in problem 1, and uses three traces in the lab4/traces directory. There is nothing special about these traces -- they were arbitrarily selected from the H&P trace distributions.

1. Run the program mtask.c while varying the tasking frequency and cache size, then feed the outputs to dinero. This takes a number of accesses and a list of files as input parameters. It then reads the specified number of memory access from the first file, then from the second file, and so on until wrapping around back to the first file and prints them to standard output. This gives a simple approximation to a multitasking environment. (mtask also alters the high bits of each address to reduce the possibility that different programs will accidentally get cache hits on memory locations belonging to other programs; virtual memory effects are ignored.) Report your results in a table for the following experimental conditions:

Three trace files fed to mtask in this order:
lab4/traces/pasc.000.din lab4/traces/spic.000.din lab4/traces/mul2.000.din

Dinero executed with block size of 8, associativity of 4, and all other parameters defaulted. Take data points for unified cache of sizes: 1K, 4K, 16K, and 64K bytes.

Take sets of data points for the above cache sizes for tasking intervals of 10, 100, 1000, 10000, and 100000 (this is the mtask "number of accesses" parameter).

2. Plot a graph with the results from part 1 having four curves -- one per cache size. Put miss ratio on a logarithmic Y axis and number of accesses between switches on a logarithmic X axis. Be sure to label the curves and the axes.

3. For the data you have plotted there will be "break points" in terms of miss ratio improving noticably for any given cache size when the number of memory accesses between switches changes above a certain point. Look at your graph and state an approximate relationship for how big cache size should be for this situation in order to be to the right of the break point on the performance curve. Detailed numerical analysis is not required -- just an approximate relationship with one digit of precision. Of course it should be understood that this isn't a general rule, just a result for this particular situation.

18-548/15-548 home page.

18-548/15-548 Fall 1998

Lab 4: Multi-Level Cache & Context Switching Effects

Draft until Friday 10/9

Lab 4:
Multi-Level Cache &
Context Switching Effects