Due Friday, October 9, 1998 at 3:00 PM
This lab takes you through a simulation-based exploration of the effects of cache parameters on performance for a benchmark program. In this case the benchmark is going to be cjpeg, which performs image compression and outputs files in jpeg format. cjpeg is a multimedia application, although has fairly good overall cache behavior compared to some. (If you're curious about jpeg, there is a FAQ entry available on it.) Note that there are two reasons to perform such simulations -- for system design and for performance improvement of the application.
The homework-specific files you will need are in
/afs/ece/class/ece548/www/hw/lab3
. You must establish a logical link there for the command lines in this
homework to work properly:
ln -s /afs/ece/class/ece548/www/hw/lab3
You are welcome to use and adapt any shell or perl scripts you find in
that directory to help you with the homework.
You are strongly urged to get an early start on this lab. Running simulations is a crucial part of being a computer architect, and thinking about what the simulations are doing (rather than simply punching through a set of parameters) will help you understand the concepts we are covering in class. You may want to set up a file using the "at" facility ("man at" will give more info) and run your simulations overnight, testing them first to make sure they're likely to "go".
Problem 1: Compare with H&P traces
In this problem we perform a cursory comparison of cjpeg to three other programs in the H&P trace collection. You'll get to see whether this image compression program "looks" any different from so-called general purpose computing.
The H&P traces are fixed in size and content, but you will be able to generate traces on-the-fly for cjpeg because it has been "atom-ized". In other words, an instrumentation package called Atom has been used to augment the cjpeg program with printf statements that spit out trace information suitable for use by dinero on every instruction fetch, data load, and data store. You'll be using atom yourself in a coming homework. But for now we've done that for you.
1. In order to use cjpeg you'll need an image file, which for
this homework must be in .gif format. You have an image assigned to you in
the directory lab3/gif
(using a rather obvious naming and image content scheme). Please don't
modify the file in any way before using it. Save this image your working
directory and run cjpeg on it (where "fname" is the file name
you got from the archive):
lab3/cjpeg -outfile fname.jpg fname.gif
This will translate the .gif file into a .jpg file. Load the .jpg file
into Netscape or otherwise view it in some program and print it out (B&W
printing is fine). Clearly annotate it with the filename and the file
sizes of both the .gif and .jpg version (a .jpg file is typically less
than half the size, but this varies significantly depending on the
picture).
2. Let's get some idea of what cjpeg is doing to perform this
compression. cjpeg.atom is the instrumented version of the
program that produces a full dinero trace file. Run this instrumented
version through dinero and
report the instruction miss ratio, data read miss ratio, data write miss
ratio, and overall traffic ratio using the following command line (for
this and all other problems, use the graphics image selected in part 1 of
this problem; obviously the below command should be entered as a single
line).
lab3/cjpeg.atom -outfile fname.jpg fname.gif | dinero -i8K -d8K -b16
-W8 -B8 -z1000000000
Cut & paste the dinero result from this command line; circle and
label the requested 4 pieces of information. If you have even the
slightest doubt whether you are looking at the correct numbers ask for
help from the course staff. (Note: don't try to pipe the trace output from
cjpeg.atom to a file in your working directory -- it is likely
to be up to a GB in size).
3. Now let's compare this to other traces. Run the same dinero
parameters on the following three dinero input files in the homework
directory: cc1.din, spice.din, tex.din. Create two tables indicating for
each of the three programs and cjpeg:
Table 1: number of instructions, number of reads, number of writes, total
demands (sum of all demand references)
Table 2: overall miss ratio, instruction miss ratio, data read miss
ratio, data write miss ratio, traffic ratio.
You'll want to use command lines for the other traces that look slightly
different, since these are 32-bit programs with 4-byte accesses, and are
in general pretty short traces. In particular, use:
dinero -i8K -d8K -b16 < lab3/cc1.din
dinero -i8K -d8K -b16 < lab3/spice.din
dinero -i8K -d8K -b16 < lab3/tex.din
Compared to the three H&P traces, how well behaved is cjpeg (what is
the rank of its cache performance among the four cases in your tables)?
Problem 2. Block Size
The classical way to run a cache experiment is to pick a starting point as we did above and then do sensitivity analysis to see which parameters matter and which don't. In this problem we'll look at block size.
In order to keep run times tractible, there is a slightly differently instrumented version of the jpeg program available called cjpeg.atomd which differs in that it does not include instruction accesses in the output trace. So, use that for the following problems just as you used cjpeg.atom above. In actuality the results could still be "real", because you can think of it as an experiment to determine what memory access patterns you'd see if you were building an ASIC to do this operation -- only data accesses.
It is recommended that if there is room on the local scratch volume you run cjpeg.atomd once and save the results in a file on /scratch to feed to dinero. BUT, please be sure to erase this temporary file when you're done to make room for others. The file should be on the order of 100 MB in size; the scratch volumes hold 500 MB to 1 GB depending on the machine. NOTE: scratch volume files are not backed up, and are automatically deleted after 24 hours or so.
1. Run dinero while varying the block size from 8 bytes to 4096 bytes in
increments of a factor of two (i.e., 8, 16, 32, 64, 128, 256, 512,
1024, 2048, 4096). Tabulate the results as block size, data miss ratio,
and traffic ratio (note that there will be no instructions in the trace
you use for this). Other than block size, use the dinero command line
below, which makes sure you use the whole trace. Obviously you'll have to
pipe your data file or the cjpeg.atomd output into dinero for
this to work.
dinero -i8K -d8K -b16 -W8 -B8 -z1000000000
2. Plot the block size results with block size on a logarithmic X axis and both ratios on the same Y axis (use the Y axis that works best to show the data). Which two or three block sizes look attractive and why? Which type of miss (which of the three "C"s) dominates at the right hand side of your graph (block size 4096)?
Problem 3. Associativity
Now let's look at associativity, still using the data-only trace from cjpeg.atomd as in Problem 2.
1. Run dinero while varying the associativity, using values of {1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18). Note that you will have to round the cache size up or down a block to hold an integral number of blocks in the cache. The desired approach is not to make the cache size one convenient to physically realize, but rather do an approximate apples-to-apples comparison of associativity with essentially identical cache sizes (i.e., cache size should not vary from 8192 by more than a few bytes, and in many cases will not be a power of two; as an example for 3-way set associative cache you would use -d8208 because that holds an integral number of 16-byte blocks and is divisible by three). Use the same command line as in Problem 2, (dinero -i8K -d8K -b16 -W8 -B8 -z1000000000) but vary associativity. Tabulate the resultant cache size, associativity, overall data miss ratio, and traffic ratio.
2. Plot the associativity results with associativity on a linear X axis and both ratios on the same graph using a Y axis that makes sense. If it were physically realizable (i.e., ignore the fact that you don't have a number of sets that is an even power of 2, and therefore addressing the memory array would be painful), would you prefer the system simulated having 3-way set associativity or the 4-way set associativity system? Why?
3. Run another simulation with 4-way set associative cache just under 8KB instead of exactly 8KB in size (i.e., -d8128 -i8128). This cache is a little smaller, and yet it has a better miss rate. What insight does this give you about the shape of the curve you just plotted in part 2 of this problem? (i.e., make a brief statement about program behavior with respect to how it is accessing the cache.)
Problem 4. Write Policies
Now let's look at write policies, and see if they matter.
1. Again run data-only simulations that test all four combinations of write-through/write-back and write-allocation/write-no-allocation given the same base command line from Problem 2 (dinero -i8K -d8K -b16 -W8 -B8 -z1000000000). Show the overall data miss ratio and traffic ratio data in a table. Assuming equal implementation cost, which combination is best?
18-548/15-548 home page.