To Publications by Topic, Publications by Date, Talks, Recent Research News, Home Page
News on Some Recent Research Results
- January 2012 -- Scalable, Energy-Efficient Memory Systems Our
ICAC 2011 paper, "Memory Power
Management via Dynamic Voltage/Frequency Scaling",
demonstrates that memory systems which are provisioned for high
performance with memory-intensive applications are often overkill
for many other applications which do not require as much memory
bandwidth. Running memory at a lower frequency has a minimal impact
on the performance of these applications, and also allows for an
operating voltage reduction, which significantly reduces memory
system power and thus increases energy efficiency. We demonstrate a
dynamic voltage/frequency scaling approach to increasing memory
system energy efficiency which observes memory bandwidth at runtime
and scales the memory frequency and voltage with this bandwidth
demand. Significantly, we evaluate this on a real server platform by
using memory controller timing registers in the Intel Nehalem, which
replicates the effect of dynamically adjustable memory frequency.
Combined with an analytical model for power savings, we show that
memory power can be reduced by 10.4% on average (20.5% max in one
workload) with only 0.17% on performance. You can view our slides
here: pptx, pdf.
- December 2011 -- Scalable Memory
Systems: Our latest work on memory interference handling,
"Reducing Memory Interference in
Multicore Systems via Application-Aware Memory Channel
Partitioning", was presented at MICRO 2011. You can view our
slides here.
Inter-application interference at the main memory is a major impediment to
individual application and system performance. Many past works, including ours,
have addressed this problem by application-aware request reordering in the memory controller. This
paper presents a fundamentally different alternative approach to address this
problem - application-aware Memory Channel Partitioning (MCP). The key idea of
MCP is to map the data of badly-interfering applications to different memory
channels. MCP performs slightly better than the current best memory request
scheduling policy while involving no changes to the memory controller. We also
observe that inter-application interference can be mitigated even better with a
combination of memory channel partitioning and request scheduling. We propose an
Integrated Memory Partitioning and Scheduling (IMPS) mechanism that improves
system performance over the current best memory request scheduler, while
incurring minimal hardware complexity.
- October 2011 -- Efficient Cache
Management: Caches are critical to performance in modern
microprocessors/systems. Unfortunately, not all blocks inserted
into the cache are reused later, largely degrading the
performance benefit of a cache. We propose a new mechanism,
VTS-cache, to predict how likely it is that a missed block will
be reused if it is inserted into the cache and use this
prediction to decide at what location in the cache the block
should be inserted. VTS-cache uses the recency of eviction of a
block to predict its future reuse behavior. We provide a
practical, low-cost implementation of VTS-cache, without
modifying the existing cache structure. Our technical
report, "Improving
Cache Performance using Victim Tag Stores," describes the
mechanism and shows that VTS-cache outperforms five
state-of-the-art proposals.
- October 2011 -- Energy-efficient Communication
Substrates: We are designing and evaluating on-chip
interconnects with new designs that provide high performance with very
simple router hardware, and low energy and area overhead. In our
recent technical
report,
"A High-Performance Hierarchical Ring On-Chip Interconnect with
Low-Cost Routers," we show that a hierarchy of rings on-chip
provides nearly the same performance as a baseline high-performance
mesh network, while using very simple ring routers and minimal
buffering. The key insight is to use a high-bandwidth global ring to
join several smaller local rings, and connect rings with simple
transfer or "bridge" routers. The global ring allows quick cross-chip
journeys and alleviates interference seen in both meshes and
single-ring designs. Our technical report provides solutions to ensure
forward progress in such a network (i.e., avoid livelock and
deadlock), and demonstrates with synthesis-based hardware modeling
that such designs are practical.
- October 2011 -- Energy-efficient
Communication Substrates: We are investigating ways to
improve the performance and energy efficiency of bufferless
deflection-based on-chip interconnects further. Our recent technical
report
"MinBD: A Minimally-Buffered Deflection Router Approaching
Conventional Buffered-Router Performance" presents a design that
has nearly the performance (within 4.6%) of conventional
interconnects, with large virtual-channel buffers, by using
primarily deflection routing to handle contention, and only a small
buffer per router to assist in high load. We show that by addressing
several simple yet important bottlenecks in earlier bufferless
deflection routers (using our CHIPPER work in HPCA 2011 as an
example), bufferless or minimally-buffered deflection routers show
very good performance and energy efficiency. Finally, this report
shows that our router design principles are applicable to high-radix
routers as well, and that such routers can still provide energy
savings in networks where performance degradation is already
addressed by data-locality mapping techniques.
- September 2011 -- Heterogeneous Main
Memory with New Technologies: We are designing heterogeneous
main memory systems that consist of multiple memory technologies,
e.g., Phase Change Memory (PCM) and DRAM, to achieve high energy
efficiency and overcome DRAM technology scaling challenges. We have
developed a new way of managing data placement in such a memory system
with the goal of achieving the best characteristics of both PCM and
DRAM technologies. The main idea of our work is to dynamically
identify and place data that cause frequent row buffer miss accesses
in DRAM, and data that do not in PCM. The key insight behind this
approach is that data which generally hit in the row buffer can take
advantage of the large memory capacity that PCM has to offer, and
still be accessed as quickly as if the data were placed in DRAM. Our
technical
report, "Row
Buffer Locality-Aware Data Placement in Hybrid Memories,"
describes in detail our mechanism and results.
- February 2011 -- Energy-efficient
Communication Substrates: Our latest work on efficient router
design, "CHIPPER: A
Low-Complexity Bufferless Deflection Router," appeared at HPCA
2011. Paper
(pdf) Slides (pptx)
We are designing energy-efficient communication
substrates to enable the scaling of a parallel multiprocessor to a
large number of nodes under a given power budget. To this end, we are
examining the design of very efficient routers. This paper designs a
simple bufferless deflection router that is competitive in operating
frequency with buffered routers. It solves two key issues in
deflection router design, livelock freedom and packet reassembly, with
simple mechanisms.
- December 2010 -- Scalable Memory
Controllers: Our latest memory scheduling
algorithm, "Thread Cluster Memory
Scheduling: Exploiting Differences in Memory Access Behavior"
appeared at MICRO 2010. Paper
(pdf) Slides (pptx)
This work was selected as one of the Top Picks in Computer Architecture of 2010 by IEEE Micro. Top Picks paper
Memory schedulers in multi-core systems should carefully schedule memory requests from different threads to ensure high system performance and fast, fair progress of each thread. The paper provides an application-aware memory access scheduling algorithm that maximizes system throughput and fairness at the same time, outperforming all previous algorithms in both metrics. The main idea is to dynamically divide threads into two separate clusters (latency-sensitive and bandwidth-sensitive) and employ different memory request scheduling policies in each cluster such that the needs of different kinds of threads are served separately.
- December 2010 -- General-purpose Graphics
Processors: We have devised new microarchitectural
techniques to overcome performance loss due to branch divergence
and memory latency in GPU architectures, described
in "Improving
GPU Performance via Large Warps and Two-Level Warp
Scheduling."
The paper describes two main ideas: 1) having large warps,
and dynamically creating SIMD-width size warps from the active
threads of each large warp, thereby improving functional unit
utilization in the presence of branch divergence, 2) two-level
warp scheduling, which batches warps and prioritizes a batch at a
time: when a batch is stalled for memory another batch is likely
to be in its computation phase, thereby tolerating memory stalls.
- Our research received several recognitions in 2011, 2010, and 2009:
- IEEE Computer Society TCCA Young Computer Architect Award, 2011.
- ASPLOS 2010 Best Paper Award (ACM International Conference on Architectural Support for Programming Languages and Operating Systems), 2010. paper
- VTS 2010 Best Paper Award (IEEE VLSI Test Symposium), 2011. paper
- HPCA 2010 Best Paper Session Selection (IEEE International Symposium on High-Performance Computer Architecture), 2010. paper
- MICRO 2010 paper on memory scheduling selected to IEEE Micro's "Top Picks from Computer Architecture Conferences" issue, 2011. paper
- ISCA 2010 paper on on-chip network packet scheduling selected to IEEE Micro's "Top Picks from Computer Architecture Conferences" issue, 2011. paper
- ISCA 2010 paper on staged execution on multi-core systems selected to IEEE Micro's "Top Picks from Computer Architecture Conferences" issue, 2011. paper
- ISCA 2009 paper on using Phase Change Memory as main memory selected as CACM Research Highlight, Special Invited Feature Article in the Communications of the ACM, 2010 paper
- National Science Foundation CAREER Award, 2010.
- ISCA 2009 paper on using Phase Change Memory as main memory selected to IEEE Micro's "Top Picks from Computer Architecture Conferences" issue, 2010. paper
- ASPLOS 2009 paper on critical section acceleration selected to IEEE Micro's "Top Picks from Computer Architecture Conferences" issue, 2010. paper
- HPCA 2009 Best Paper Session Selection (IEEE International Symposium on High-Performance Computer Architecture), 2009. paper