Reliable Processors and Systems
This research investigates the impact of soft-error tolerance in future deep-submicron microprocessor designs. The study investigates different options to achieve the desired level of protection against soft errors. This research effort is in part supported by NSF through a CAREER Award. The TRUSS Project (Total Reliability Using Scalable Servers) develops a reliable, available, and serviceable (RAS) hardware platform based on a distributed cluster of commodity blade servers. The goal of the project is to leverage the cost-effectiveness of commodity processor and memory modules in a reliable server design that achieves both performance and cost scalability. This research effort is in part supported by NSF through an ITR Award and by Intel. (Go to the TRUSS Project Page.)
- Students
- Jared Smolens (PhD Thesis)
- Brian Gold (advised by Babak Falsafi)
- Jangwoo Kim (advised by Babak Falsafi)
- Publications
- Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors. B. T. Gold, B. Falsafi, and J. C. Hoe. Pacific Rim International Symposium on Dependable Computing (PRDC), November 2009.
- OpenSPARC: An Open Platform for Hardware Reliability Experimentation. I. Parulkar, A. Wood, J. C. Hoe, B. Falsafi, S. V. Adve and J. Torrellas. Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE), April 2008. (pdf)
- Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. J. Kim, N. Hardavellas, K. Mai, B. Falsafi and J. C. Hoe. International Symposium on Microarchitecture (MICRO), December 2007. (pdf)
- PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers. J. Kim, J. C. Smolens, B. Falsafi and J. C. Hoe. Pacific Rim International Symposium on Dependable Computing (PRDC), December 2007. (pdf)
- Detecting Emerging Wearout Faults. J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. The Third Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2007. (pdf)
- Reunion: Complexity-Effective Multicore Redundancy. J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. International Symposium on Microarchitecture (MICRO), December 2006.(pdf)
- TRUSS: Reliable, Scalable Server Architecture. B. T. Gold, J. C. Smolens, J. Kim, E. S. Chung, V. Liaskovitis, E. Nurvitadhi, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. IEEE Micro, Volume 25, Number 6, November/December 2005. (pdf)
- Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth. J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. IEEE Micro, Volume 24, Number 6, November/December 2004. (pdf) (note: Top Picks version of ASPLOS 2004.)
- Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures. J. C. Smolens, J. Kim, J. C. Hoe, and B. Falsafi. International Symposium on Microarchitecture (MICRO), November 2004. (pdf)
- Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth. J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2004. (pdf)
- Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery. J. Ray, J. C. Hoe and B. Falsafi. International Symposium on Microarchitecture (MICRO), December 2001. (pdf)
- Thesis
- Fingerprinting: Hash-Based Error Detection in Microprocessors. Jared Smolens, PhD, ECE/CMU, December 2007. (pdf)