
DARPA PERFECT: 



Energy Efficient High Performance through ApplicationSpecific Processor/Program CoSynthesis (HR00111320007) 



Franz Franchetti (PI), José M. F. Moura (CoPI), James C. Hoe (CoPI), Larry Pileggi (CoPI), Mike Franusich (CoPI)
PostDocs: Yusuf Adibelli, Nikos Alachiotis, Qi Guo, Tze Meng Low, Kaushik Vaidyanathan
PhD Students: Thom Popovici, Fazle Sadi, Kuntal Shah, Richard Veras, Guanglin Xu
Engineers: Brian Duff, Jason Larkin
Alumni: Berkin Akin, Aliaksei Sandryhaila, Ekin Sumbul, Qiuling Zhu 



Approach
The DARPA PERFECT Program develops technology to achieve power efficiency of 75 GFLOPS/W at 7nm for DoDrelevant applications. Our approach is to deploy algorithm/architecture cosynthesis that leverages 3D chip stacking technology. We are targeting the PERFECT goal of 75 GFLOPS/W from three complementary angles: 

• Hardware/software cosynthesis based on the SPIRAL technology.
• Memoryside Accelerators, a highly configurable architecture for near memory computing.
• Low power accelerators based on the logicinmemory and 3DIC technology.


The core idea of our strategy is that reaching the PERFECT goal requires to leverage domain knowledge, automatic design space exploration, and leveraging of lowpower technology like 2.5D and 3D interconnection/stacking and logic/memory integration that become possible at sub20nm technology nodes. Further, as evidenced by its hardware and software design tools CMU recognizes the need for industrygrade tools to enable integration with the PERFECT architecture and the need for eventual technology transition. 







Results
Our major accomplishments in Phase 1 of PERFECT are all demonstrated on a simulated 3DIC system (4 layers of 2Gbit DRAM, 1 layer of logic 32nm technology node, run at 1GHz – 1.3 GHz). The major tangible accomplishments are three nearDRAM accelerators that have been demonstrated in Phase 1 in detailed fullsystem (DRAM and logic) simulation (Figure 3 and 4): 

• 2D FFT performing at 40 GFLOPS/W, up to 8k x 8k singleprecision floatingpoint. Data resides in DRAM before and after computation. 1D FFT and 3D FFT also demonstrated.
• Polar formatting SAR at 71 GFLOPS/W, up to 8k x 8k, 91 dB PSNR compared to a Matlab doubleprecision “gold standard” implementation of SAR with FFTbased interpolation.
• Sparse matrixmatrix multiplication at 1 GFLOPS/W, demonstrated for matrices of the University of Florida sparse matrix collection. This is 100x speedup and 1000x improved power efficiency when compared to the stateoftheart (Intel and Nvidia).


These accelerators are components of the DPA architecture and the key to reaching the PERFECT goal of 75 GFLOPS/W at 7nm with DPA. CMU uses an elaborate setup of industry standard simulation tools to obtain the 32nm performance and power estimates. All steps in the simulation were provided to the TAV. CMU also performed thermal simulations and addressed power distribution to establish that the designs can be built with current or emerging 3D technology. 











Design and Simulation Tools
Our results were obtained using our tools for design space exploration. We demonstrated for all these kernels the capability to generate and evaluate many design points with various tradeoffs to enable users to reach a certain goal for one metric (e.g., at least 1 TFLOPS or no more than 40 Watt) while optimizing the other metric. 







Publications
B. Akin, F. Franchetti, J. C. Hoe
Data Reorganization in Memory Using 3Dstacked DRAM
42nd International Symposium on Computer Architecture (ISCA), 2015.
H. E. Sumbul, K. Vaidyanathan, Q. Zhu, F. Franchetti, L. Pileggi
A Synthesis Methodology for ApplicationSpecific LogicinMemory Designs
Proceedings of the 52nd Design Automation Conference (DAC), 2015.
D. A. Popovici, F. Russell, K. Wilkinson, CK. Skylaris, P. H. J. Kelly, F. Franchetti
Generating Optimized Fourier Interpolation Routines for Density Functional Theory Using SPIRAL
29th International Parallel & Distributed Processing Symposium (IPDPS), 2015.
Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, TM. Low, L. Pileggi, J. Hoe, F. Franchetti
3DStacked MemorySide Acceleration: Accelerator and System Design
2nd Workshop on Near Data Processing (WONDP) in conjunction with the 47th International Symposium on Microarchitecture (MICRO47), 2014.
B. Akin, F. Franchetti, J. C. Hoe
Understanding the Design Space of DRAMoptimized Hardware FFT Accelerators
Proceedings of the 25th International Conference on ApplicationSpecific Systems, Architectures and Processors (ASAP) 2014, pages 248255.
B. Akın, F. Franchetti, J. C. Hoe
HAMLeT: Hardware Accelerated Memory Layout Transform within 3Dstacked DRAM
Proceedings of High Performance Extreme Computing Conference (HPEC) 2014.
Best Paper Award
F. Sadi, B. Akin, D. T. Popovici, J. C. Hoe, L. Pileggi, F. Franchetti
Algorithm/Hardware Cooptimized SAR Image Reconstruction with 3Dstacked Logic in Memory
Proceedings of High Performance Extreme Computing Conference (HPEC) 2014.
Rising Stars Session
Q. Zhu, C. R. Berger, E. L. Turner, L. Pileggi, and F. Franchetti:
Local Interpolationbased Polar Format SAR: Algorithm, Hardware Implementation and Design Automation
The Journal of Signal Processing Systems. Springer, 2013, VLSI1782R2.
B. Akin, F. Franchetti, J. Hoe
FFTs with NearOptimal Memory Access Through Block Data Layouts
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
Q. Zhu, B. Akin, H. E. Sumbul, F. Sadi, J. Hoe, L. Pileggi, F. Franchetti
A 3DStacked LogicinMemory Accelerator for ApplicationSpecific Data Intensive Computing
Proceedings of IEEE International 3D Systems Integration Conference (3DIC) 2013, pages 17.
Q. Zhu, H. E. Sumbul, F. Sadi, J. Hoe, L. Pileggi, F. Franchetti
Accelerating Sparse MatrixMatrix Multiplication with 3DStacked LogicinMemory Hardware
IEEE High Performance Extreme Computing Conference (HPEC), 2013, pages 16.
Best Paper Award





