Systems Scientists: Tze Meng Low
Post-Docs: Nikos Alachiotis, Qi Guo, Tze Meng Low, Kaushik Vaidyanathan PhD Students: Berkin Akin, Thom Popovici, Fazle Sadi, Ekin Sumbul, Richard Veras,Guanglin Xu Engineers: Brian Duff, Jason Larkin Alumni: Aliaksei Sandryhaila, Qiuling Zhu
The DARPA PERFECT Program develops technology to achieve power efficiency of 75 GFLOPS/W at 7nm for DoD-relevant applications. Our approach is to deploy algorithm/architecture co-synthesis that leverages 3D chip stacking technology. We are targeting the PERFECT goal of 75 GFLOPS/W from three complementary angles:
• Hardware/software co-synthesis based on the SPIRAL technology. • Memory-side Accelerators, a highly configurable architecture for near memory computing. • Low power accelerators based on the logic-in-memory and 3DIC technology.
The core idea of our strategy is that reaching the PERFECT goal requires to leverage domain knowledge, automatic design space exploration, and leveraging of low-power technology like 2.5D and 3D interconnection/stacking and logic/memory integration that become possible at sub-20nm technology nodes. Further, as evidenced by its hardware and software design tools CMU recognizes the need for industry-grade tools to enable integration with the PERFECT architecture and the need for eventual technology transition.
Our major accomplishments in Phase 1 of PERFECT are all demonstrated on a simulated 3DIC system (4 layers of 2Gbit DRAM, 1 layer of logic 32nm technology node, run at 1GHz – 1.3 GHz). The major tangible accomplishments are three near-DRAM accelerators that have been demonstrated in Phase 1 in detailed full-system (DRAM and logic) simulation (Figure 3 and 4):
• 2D FFT performing at 40 GFLOPS/W, up to 8k x 8k single-precision floating-point. Data resides in DRAM before and after computation. 1D FFT and 3D FFT also demonstrated. • Polar formatting SAR at 71 GFLOPS/W, up to 8k x 8k, 91 dB PSNR compared to a Matlab double-precision “gold standard” implementation of SAR with FFT-based interpolation. • Sparse matrix-matrix multiplication at 1 GFLOPS/W, demonstrated for matrices of the University of Florida sparse matrix collection. This is 100x speed-up and 1000x improved power efficiency when compared to the state-of-the-art (Intel and Nvidia).
These accelerators are components of the DPA architecture and the key to reaching the PERFECT goal of 75 GFLOPS/W at 7nm with DPA. CMU uses an elaborate setup of industry standard simulation tools to obtain the 32nm performance and power estimates. All steps in the simulation were provided to the TAV. CMU also performed thermal simulations and addressed power distribution to establish that the designs can be built with current or emerging 3D technology.
Design and Simulation Tools
Our results were obtained using our tools for design space exploration. We demonstrated for all these kernels the capability to generate and evaluate many design points with various trade-offs to enable users to reach a certain goal for one metric (e.g., at least 1 TFLOPS or no more than 40 Watt) while optimizing the other metric.