The First MEMOCODE HW/SW Co-design Contest

https://memocode07.ece.cmu.edu/contest.html

Entries Due Online: March 19, 2007

1. Overview

Matrix-matrix multiplication (MMM) is a linear algebra operation familiar to everyone. This design challenge emphasizes your HW/SW co-design methodology’s ability to explore algorithmic alternatives, to fine-tune performance, and to create quickly a HW/SW co-designed/optimized implementation.

The implementation of a basic triple-loop square MMM is given below.  The first three arguments A, B and C are pointers to the square N-by-N multiplicand, multiplier and product matrices, respectively.  The fourth argument N specifies the number of rows/columns in the square matrices.  A, B and C are row-major.  The implementation below assumes C is zero initialized. This implementation performs N3 multiplications and N3 additions.

void mmmKernel(Number* A, Number* B, Number* C, int N) {

int i, j, k;

for (j = 0; j < N; j++)

for (i = 0; i < N; i++)

for (k = 0; k < N; k++)

C[i*N+j] = C[i*N+j] + A[i*N+k]*B[k*N+j] ;

}

The above straightforward implementation has poor data reuse locality, which becomes a performance concern when the entire data set cannot be kept in a nearby fast memory.  This can occur in a software scenario when the data set cannot fit in the cache of the processor; this can also occur in a HW/SW co-execution scenario when the data set cannot fit within the hardware accelerator. In both scenarios, the data set must be transferred frequently and redundantly between the nearby fast memory and a slower backing-store (e.g., DRAM).

To improve data locality, a “blocked” MMM breaks down the computation into a series of small MMMs of NB-by-NB regions in the original N-by-N multiplicand and multiplier matrices. Blocked MMM also performs N3 multiplications and N3 additions but has better data locality (requiring fewer data transfers between the fast near-by memory and the slower backing store).  A blocked MMM implementation is given below with an additional argument, NB, the blocking size.  This implementation assumes N is an integer multiple of NB.  Again, this implementation assumes C is zero initialized. This implementation assumes the existence of a new MMM kernel (which could be a basic triple-loop implementation) to multiply the sub-blocks.  For complete details, please see the reference software-only implementation released with the contest.

void mmmBlocked(Number* A, Number* B, Number *C, int N, int NB) {

int j, i, k;

for (j = 0; j < N; j += NB)

for (i = 0; i < N; i += NB)

for (k = 0; k < N; k += NB)

mmmKernel(&(A[i*N+k]), &(B[k*N+j]), &(C[i*N+j]), N, NB);

}

Starting from the reference software-only blocked MMM implementation (see submission instructions), you are to develop a HW/SW co-designed optimized implementation that partitions the computation onto a processor running software and the hardware datapath to be implemented on an FPGA fabric.   For starters, mmmKernel( ) is a good candidate to accelerate in HW, and the choice of NB is a tunable parameter.  You might want to explore non-square blocking. You can also consider more effective data management to take advantage of on-FPGA memory capacity.  Your implementation should be optimized for 1024-by-1024 and also 256-by-256 sized matrices.  Your implementation must be able to support 2-power matrix sizes, N=64, 128, 256, 512, and 1024, without reprogramming the FPGA fabric.

Like the reference implementation, your design must support 16-bit fixed-point, complex data values.  That is, each matrix element is 32-bits: the more significant 16 bits are used for the real component and the less significant 16 bits are used for the imaginary component.  The multiplier matrix (B) and the product matrix (C) use the same 16-bit 2’s-complement fixed-point format in the real and the imaginary component.  The multiplicand matrix (A) uses a 16-bit 2’s-complement fixed-point formant with 14 bits for fraction.  You can assume the multiplicand matrix contains real and imaginary values that are between 1 and -1 inclusive.  You can also assume the test cases would not cause overflows in the reference software-only implementation.

3. Implementation Platform

You may use any HW and SW design methodology at your disposal.  Formal methods are encouraged but not required.

You may use any FPGA development platform at your disposal.  The FPGA platform must permit a processor running software (e.g., the embedded PowerPC405 in a Xilinx Virtex-II Pro FPGA or a Xilinx Microblaze soft-core) to communicate with the hardware accelerator datapath on an FPGA fabric.  The processor and/or the FPGA should have access to off-chip memory sufficient to hold the multiplicand, multiplier and product matrices, up to 1024-by-1024.  In your implementation, the multiplicand matrix and the multiplier matrix must start from the off-chip memory, and the product matrix must complete in the off-chip memory.

The reference platform supported by the contest is the Xilinx XUP development board (http://www.digilentinc.com/Products/Detail.cfm?Prod=XUPV2P&Nav1=Products&Nav2=Programmable) with 512MB of DRAM.  For those of you using XUP, we provide a reference EDK project for the Xilinx XUP board that implements a basic interface library to enable communication between the software running on the 300MHz embedded PowerPC405 and the 100MHz FPGA fabric through the DSOCM interface (150+MB/sec bandwidth).  The interface is based on two circular memory FIFOs, one for each direction of data communication between the hardware and the software.  You may develop or acquire any other interfacing scheme you prefer.

4. Contest Judging

The contest will be judged in two parts.

First, an objective element of judging is based on the combined execution time (wall-clock) of a 1024-by-1024 MMM and a 256-by-256 MMM.  Second, an subjective element of judging is based on the “elegance” of the solutions.

The execution time will normalized for platform differences.  The effective execution time will be

Timeeffective     = (Timemeasured N=1024 + 64*Timemeasured N=256) * fCPU-speed * fFPGA-capacity * fFPGA-speed

where

fCPU-speed         = min{ 1 , Time{SW-only, N=1024, no blocking, 300MHz PPC in XUP}/Time{SW-only, N=1024 no blocking, your CPU}}

fFPGA-capacity  = min{ 1 , capacity-in-gates{Your FPGA} / capacity-in-gates{XC2VP30}}     /* based on vendor data for the part you use */

fFPGA-speed       = min{ 1 , operating-frequency{Your FPGA} / 100MHz }

Notice, this normalization penalizes platforms that are more capable than the XUP but deliberately does not benefit less capable platforms.  Some factors that the normalization calculation does not take into account are 1. HW/SW communication bandwidth and latency,  and 2. off-chip memory bandwidth and latency. This normalization is not perfect; hence “gaming the system” is fair game. In the case an entry is based on an extraordinary FPGA platform, the judging panel reserves the right to determine a fair normalization on a case-by-case basis.

You need to prepare a brief documentation describing 1. your FPGA platform, 2. your design methodology, 3. the organization of your design and its theory of operation, 4. a brief analysis of its performance (e.g., where are time spent and where are the bottlenecks).  A panel of judges will assess subjectively the elegance of the solution and rank the entries accordingly.   The decisions of the judges are final. (Documentations in the form of PowerPoint slides are perfectly acceptable. Keep in mind, the judges’ subjective sense of elegance is likely to be influenced by the quality of the documentation.)

The entries are ranked overall by the value (RankTime-effective+Rankelegance).  In the case of a tie, Timeeffective is the tie-breaker.  Top three finishers will have the opportunity to present their design and design methodology in MEMOCODE’07’s technical program.  All entrants are invited to submit a poster for Memocdoe’s Poster Session. We will offer a cash prize for the winning design (from the top three) selected at the conference.

We rely on the honor systems for initially reporting the performance data required for Timeeffective calculation, i.e.,

Timemeasured, N=1024, Timemeasured, N=256

Time{SW-only, N=1024, no blocking, your CPU}   /* Reference implementation compiled with –O an –DNDEBUG */

The winning designs must work correctly. In the case you should come in top three, we will need to arrange verification of the correctness of your design and the performance data.

4. Eligibility and Submission Instructions

This contest is open to industry and academic institutions (faculty and/or students).  A team may include both industry and academic members.  There is no limit on the size of a team.  There is no limit on the number of teams per institution.  Each person can participate in only one team.

If you are interested, please email forrest@ece.ucsb.edu with your contact information (team member names and affiliation).  We will respond by sending you the SW-only MMM reference design (portable C) and the XUP HW/SW interface design (C and Verilog in an EDK project).  There are no obligations associated with requesting the reference designs. You have until March 19, 2007 to submit your design submission (design source + documentation) in a tar or zip file via http://www.ece.cmu.edu/tools/upload/ (use memocode07@ece.cmu.edu as the recipient address).   Although you have until March 19, we estimate the design task to be approximately a one-week effort.