To Publications, Talks, News, Home Page

News

Apple releases a Rowhammer patch
Our Research in the Media
Lecture Videos & Course Materials
Here is a summary paper covering our recent memory systems research, as of January 2015:
- Onur Mutlu, Justin Meza, and Lavanya Subramanian,
  "The Main Memory System: Challenges and Opportunities"
  Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.
Here is a summary talk (video, pptx, pdf) and paper covering some of our recent memory systems research, presented at MemCon 2013 and IMW 2013. The full reference is:
- Onur Mutlu,
  "Memory Scaling: A Systems Architecture Perspective"
  Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013. Slides (pptx) (pdf) Video
I taught "Scalable Memory Systems" at the HiPEAC Summer School between July 14-20, 2013, in Fiuggi, Italy. You can find course slides and videos here.
I chaired the Technical Program Committee of MICRO 2012. Here are my Program Chair's Message and Program Chair's Remarks I presented at the conference. Here are the videos of the lightning session and the first keynote talk.

Some Recent Research Results

Recent Summary of Our Memory Systems Research (as of January 2015) -- The following paper summarizes our research in memory systems, as of January 2015.
- Onur Mutlu, Justin Meza, and Lavanya Subramanian,
  "The Main Memory System: Challenges and Opportunities"
  Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.
Brief Summary of Our Main Memory Systems Research (as of August 2013) -- The following talk (both video and slides) and the associated paper summarize our research in memory scaling, as of August 2013.
- Onur Mutlu,
  "Memory Scaling: A Systems Architecture Perspective"
  Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013. Slides (pptx) (pdf) Video
January 2012 -- Scalable, Energy-Efficient Memory Systems Our ICAC 2011 paper, "Memory Power Management via Dynamic Voltage/Frequency Scaling", demonstrates that memory systems which are provisioned for high performance with memory-intensive applications are often overkill for many other applications which do not require as much memory bandwidth. Running memory at a lower frequency has a minimal impact on the performance of these applications, and also allows for an operating voltage reduction, which significantly reduces memory system power and thus increases energy efficiency. We demonstrate a dynamic voltage/frequency scaling approach to increasing memory system energy efficiency which observes memory bandwidth at runtime and scales the memory frequency and voltage with this bandwidth demand. Significantly, we evaluate this on a real server platform by using memory controller timing registers in the Intel Nehalem, which replicates the effect of dynamically adjustable memory frequency. Combined with an analytical model for power savings, we show that memory power can be reduced by 10.4% on average (20.5% max in one workload) with only 0.17% on performance. You can view our slides here: pptx, pdf.

December 2011 -- Scalable Memory Systems: Our latest work on memory interference handling, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning", was presented at MICRO 2011. You can view our slides here.

Inter-application interference at the main memory is a major impediment to individual application and system performance. Many past works, including ours, have addressed this problem by application-aware request reordering in the memory controller. This paper presents a fundamentally different alternative approach to address this problem - application-aware Memory Channel Partitioning (MCP). The key idea of MCP is to map the data of badly-interfering applications to different memory channels. MCP performs slightly better than the current best memory request scheduling policy while involving no changes to the memory controller. We also observe that inter-application interference can be mitigated even better with a combination of memory channel partitioning and request scheduling. We propose an Integrated Memory Partitioning and Scheduling (IMPS) mechanism that improves system performance over the current best memory request scheduler, while incurring minimal hardware complexity.

October 2011 -- Efficient Cache Management: Caches are critical to performance in modern microprocessors/systems. Unfortunately, not all blocks inserted into the cache are reused later, largely degrading the performance benefit of a cache. We propose a new mechanism, VTS-cache, to predict how likely it is that a missed block will be reused if it is inserted into the cache and use this prediction to decide at what location in the cache the block should be inserted. VTS-cache uses the recency of eviction of a block to predict its future reuse behavior. We provide a practical, low-cost implementation of VTS-cache, without modifying the existing cache structure. Our technical report, "Improving Cache Performance using Victim Tag Stores," describes the mechanism and shows that VTS-cache outperforms five state-of-the-art proposals.
October 2011 -- Energy-efficient Communication Substrates: We are designing and evaluating on-chip interconnects with new designs that provide high performance with very simple router hardware, and low energy and area overhead. In our recent technical report, "A High-Performance Hierarchical Ring On-Chip Interconnect with Low-Cost Routers," we show that a hierarchy of rings on-chip provides nearly the same performance as a baseline high-performance mesh network, while using very simple ring routers and minimal buffering. The key insight is to use a high-bandwidth global ring to join several smaller local rings, and connect rings with simple transfer or "bridge" routers. The global ring allows quick cross-chip journeys and alleviates interference seen in both meshes and single-ring designs. Our technical report provides solutions to ensure forward progress in such a network (i.e., avoid livelock and deadlock), and demonstrates with synthesis-based hardware modeling that such designs are practical.

October 2011 -- Energy-efficient Communication Substrates: We are investigating ways to improve the performance and energy efficiency of bufferless deflection-based on-chip interconnects further. Our recent technical report "MinBD: A Minimally-Buffered Deflection Router Approaching Conventional Buffered-Router Performance" presents a design that has nearly the performance (within 4.6%) of conventional interconnects, with large virtual-channel buffers, by using primarily deflection routing to handle contention, and only a small buffer per router to assist in high load. We show that by addressing several simple yet important bottlenecks in earlier bufferless deflection routers (using our CHIPPER work in HPCA 2011 as an example), bufferless or minimally-buffered deflection routers show very good performance and energy efficiency. Finally, this report shows that our router design principles are applicable to high-radix routers as well, and that such routers can still provide energy savings in networks where performance degradation is already addressed by data-locality mapping techniques.
September 2011 -- Heterogeneous Main Memory with New Technologies: We are designing heterogeneous main memory systems that consist of multiple memory technologies, e.g., Phase Change Memory (PCM) and DRAM, to achieve high energy efficiency and overcome DRAM technology scaling challenges. We have developed a new way of managing data placement in such a memory system with the goal of achieving the best characteristics of both PCM and DRAM technologies. The main idea of our work is to dynamically identify and place data that cause frequent row buffer miss accesses in DRAM, and data that do not in PCM. The key insight behind this approach is that data which generally hit in the row buffer can take advantage of the large memory capacity that PCM has to offer, and still be accessed as quickly as if the data were placed in DRAM. Our technical report, "Row Buffer Locality-Aware Data Placement in Hybrid Memories," describes in detail our mechanism and results.
February 2011 -- Energy-efficient Communication Substrates: Our latest work on efficient router design, "CHIPPER: A Low-Complexity Bufferless Deflection Router," appeared at HPCA 2011. Paper (pdf) Slides (pptx)
We are designing energy-efficient communication substrates to enable the scaling of a parallel multiprocessor to a large number of nodes under a given power budget. To this end, we are examining the design of very efficient routers. This paper designs a simple bufferless deflection router that is competitive in operating frequency with buffered routers. It solves two key issues in deflection router design, livelock freedom and packet reassembly, with simple mechanisms.
December 2010 -- Scalable Memory Controllers: Our latest memory scheduling algorithm, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior" appeared at MICRO 2010. Paper (pdf) Slides (pptx)
This work was selected as one of the Top Picks in Computer Architecture of 2010 by IEEE Micro. Top Picks paper
Memory schedulers in multi-core systems should carefully schedule memory requests from different threads to ensure high system performance and fast, fair progress of each thread. The paper provides an application-aware memory access scheduling algorithm that maximizes system throughput and fairness at the same time, outperforming all previous algorithms in both metrics. The main idea is to dynamically divide threads into two separate clusters (latency-sensitive and bandwidth-sensitive) and employ different memory request scheduling policies in each cluster such that the needs of different kinds of threads are served separately.
December 2010 -- General-purpose Graphics Processors: We have devised new microarchitectural techniques to overcome performance loss due to branch divergence and memory latency in GPU architectures, described in "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling."
The paper describes two main ideas: 1) having large warps, and dynamically creating SIMD-width size warps from the active threads of each large warp, thereby improving functional unit utilization in the presence of branch divergence, 2) two-level warp scheduling, which batches warps and prioritizes a batch at a time: when a batch is stalled for memory another batch is likely to be in its computation phase, thereby tolerating memory stalls.
Our research received several recognitions in 2011, 2010, and 2009:

News

Our Research in the Media

Lecture Videos & Course Materials

Some Recent Research Results