# Reliability (and Security) Issues of DRAM and NAND Flash Scaling

Onur Mutlu omutlu@gmail.com http://users.ece.cmu.edu/~omutlu/

HPCA Memory Reliability Workshop March 13, 2016



SAFARI

#### Limits of Charge Memory

- Difficult charge placement and control
  - Flash: floating gate charge
  - DRAM: capacitor charge, transistor leakage
- Reliable sensing, data retention, and charge control become more difficult as charge storage unit size reduces





#### DRAM Scaling Issues

- DRAM RowHammer Problem
- Some Other DRAM Reliability Studies

#### NAND Flash Scaling Issues

- Some NAND Flash Reliability Studies
- Read Disturb Errors in NAND Flash Memory
- Summary and Discussion

### The DRAM Scaling Problem

- DRAM stores charge in a capacitor (charge-based memory)
  - Capacitor must be large enough for reliable sensing
  - Access transistor should be large enough for low leakage and high retention time
  - Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]



As DRAM cell becomes smaller, it becomes more vulnerable

#### SAFARI





Repeatedly opening and closing a row enough times within a refresh interval induces **disturbance errors** in adjacent rows in **most real DRAM chips you can buy today** 

<u>Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM</u> <u>Disturbance Errors</u>, (Kim et al., ISCA 2014)

## Most DRAM Modules Are Vulnerable



**B** company









| Up to                      | Up to               | Up to                      |
|----------------------------|---------------------|----------------------------|
| <b>1.0×10</b> <sup>7</sup> | 2.7×10 <sup>6</sup> | <b>3.3×10</b> <sup>5</sup> |
| errors                     | errors              | errors                     |

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

### Recent DRAM Is More Vulnerable



All modules from 2012–2013 are vulnerable

### A Simple Program Can Induce Many Errors



loop: mov (X), %eax mov (Y), %ebx clflush (X) clflush (X) mfence jmp loop



Download from: <a href="https://github.com/CMU-SAFARI/rowhammer">https://github.com/CMU-SAFARI/rowhammer</a>

### Observed Errors in Real Systems

| CPU Architecture          | Errors | Access-Rate |
|---------------------------|--------|-------------|
| Intel Haswell (2013)      | 22.9K  | 12.3M/sec   |
| Intel Ivy Bridge (2012)   | 20.7K  | 11.7M/sec   |
| Intel Sandy Bridge (2011) | 16.1K  | 11.6M/sec   |
| AMD Piledriver (2012)     | 59     | 6.1M/sec    |

- A real reliability & security issue
- In a more controlled environment, we can induce as many as ten million disturbance errors

9

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

#### One Can Take Over an Otherwise-Secure System

#### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Abstract. Memory isolation is a key property of a reliable and secure computing system — an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology

# Project Zero

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

News and updates from the Project Zero team at Google

Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)

Monday, March 9, 2015

Exploiting the DRAM rowhammer bug to gain kernel privileges

### RowHammer Security Attack Example

- "Rowhammer" is a problem with some recent DRAM devices in which repeatedly accessing a row of memory can cause bit flips in adjacent rows (Kim et al., ISCA 2014).
  - Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)
- We tested a selection of laptops and found that a subset of them exhibited the problem.
- We built two working privilege escalation exploits that use this effect.
  - Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)
- One exploit uses rowhammer-induced bit flips to gain kernel privileges on x86-64 Linux when run as an unprivileged userland process.
- When run on a machine vulnerable to the rowhammer problem, the process was able to induce bit flips in page table entries (PTEs).
- It was able to use this to gain write access to its own page table, and hence gain read-write access to all of physical memory.

#### Security Implications



It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after

### Selected Readings on RowHammer

- Our first detailed study: Rowhammer analysis and solutions
  - Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu,
     "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors"
     Proceedings of the <u>41st International Symposium on Computer Architecture</u> (ISCA), Minneapolis, MN, June 2014. [Slides (pptx) (pdf)] [ Lightning Session Slides (pptx) (pdf)] [Source Code and Data]
- Our Source Code to Induce Errors in Modern DRAM Chips
  - https://github.com/CMU-SAFARI/rowhammer
- Google Project Zero's Attack to Take Over a System (March 2015)
  - Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)
  - <u>https://github.com/google/rowhammer-test</u>
- Remote RowHammer Attacks via JavaScript (July 2015)
  - http://arxiv.org/abs/1507.06955
  - <u>https://github.com/IAIK/rowhammerjs</u>

# **Root Causes of Disturbance Errors**

- Cause 1: Electromagnetic coupling
  - Toggling the wordline voltage briefly increases the voltage of adjacent wordlines
  - − Slightly opens adjacent rows → Charge leakage
- Cause 2: Conductive bridges
- Cause 3: Hot-carrier injection

#### Confirmed by at least one manufacturer

### Experimental DRAM Testing Infrastructure



Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case (Lee et al., HPCA 2015)

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015) An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms (Liu et al., ISCA 2013)

<u>The Efficacy of Error Mitigation Techniques</u> <u>for DRAM Retention Failures: A</u> <u>Comparative Experimental Study</u> (Khan et al., SIGMETRICS 2014)



#### SAFARI

### Experimental Infrastructure (DRAM)



#### **SAFARI**

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

### RowHammer Characterization Results

- 1. Most Modules Are at Risk
- 2. Errors vs. Vintage
- 3. Error = Charge Loss
- 4. Adjacency: Aggressor & Victim
- 5. Sensitivity Studies
- 6. Other Results in Paper
- 7. Solution Space

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

17

# Access Interval (Aggressor)



*Note: For three modules with the most errors (only first bank)* 

*Less frequent accesses → Fewer errors* 

# Refresh Interval



Note: Using three modules with the most errors (only first bank)

*More frequent refreshes → Fewer errors* 





Errors affected by data stored in other cells

# **Naive Solutions**

#### Throttle accesses to same row

- − Limit access-interval: ≥500ns
- Limit number of accesses:  $\leq 128K$  (=64ms/500ns)

#### **2** Refresh more frequently

– Shorten refresh-interval by  $\sim 7x$ 

Both naive solutions introduce significant overhead in performance and power

### Apple's Patch for RowHammer

#### https://support.apple.com/en-gb/HT204934

Available for: OS X Mountain Lion v10.8.5, OS X Mavericks v10.9.5

Impact: A malicious application may induce memory corruption to escalate privileges

Description: A disturbance error, also known as Rowhammer, exists with some DDR3 RAM that could have led to memory corruption. This issue was mitigated by increasing memory refresh rates.

CVE-ID

CVE-2015-3693 : Mark Seaborn and Thomas Dullien of Google, working from original research by Yoongu Kim et al (2014)

HP and Lenovo released similar patches

# **Our Solution**

- PARA: <u>Probabilistic Adjacent Row Activation</u>
- Key Idea
  - After closing a row, we activate (i.e., refresh) one of its neighbors with a low probability: p = 0.005
- Reliability Guarantee
  - When p=0.005, errors in one year:  $9.4 \times 10^{-14}$
  - By adjusting the value of p, we can provide an arbitrarily strong protection against errors

# Advantages of PARA

- PARA refreshes rows infrequently
  - Low power
  - Low performance-overhead
    - Average slowdown: 0.20% (for 29 benchmarks)
    - Maximum slowdown: 0.75%
- PARA is stateless
  - Low cost
  - Low complexity
- PARA is an effective and low-overhead solution to prevent disturbance errors

#### More on RowHammer Analysis

#### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Yoongu Kim<sup>1</sup> Ross Daly<sup>\*</sup> Jeremie Kim<sup>1</sup> Chris Fallin<sup>\*</sup> Ji Hye Lee<sup>1</sup> Donghyuk Lee<sup>1</sup> Chris Wilkerson<sup>2</sup> Konrad Lai Onur Mutlu<sup>1</sup> <sup>1</sup>Carnegie Mellon University <sup>2</sup>Intel Labs

#### **RowHammer: Reliability Analysis and Security Implications**

Yoongu Kim<sup>1</sup>, Ross Daly, Jeremie Kim<sup>1</sup>, Chris Fallin, Ji Hye Lee<sup>1</sup>, Donghyuk Lee<sup>1</sup>, Chris Wilkerson<sup>2</sup>, Konrad Lai, and Onur Mutlu<sup>1</sup> <sup>1</sup>Carnegie Mellon University <sup>2</sup>Intel Labs

#### Future of Main Memory

- DRAM is becoming less reliable  $\rightarrow$  more vulnerable
- Due to difficulties in DRAM scaling, unexpected types of failures may appear
- And, they may already be slipping into the field
  - Read disturb errors (Rowhammer)
  - Retention errors
  - Read errors, write errors
  - ...

These failures can also pose security vulnerabilities

#### Analysis of Retention Failures [ISCA'13]

#### An Experimental Study of Data Retention Behavior in Modern DRAM Devices:

#### Implications for Retention Time Profiling Mechanisms

Ben Jaiyen Jamie Liu Yoongu Kim Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University 5000 Forbes Ave. 5000 Forbes Ave. 5000 Forbes Ave. Pittsburgh, PA 15213 Pittsburgh, PA 15213 Pittsburgh, PA 15213 jamiel@alumni.cmu.edu bjaiyen@alumni.cmu.edu yoonguk@ece.cmu.edu Onur Mutlu Chris Wilkerson

Intel Corporation 2200 Mission College Blvd. Santa Clara, CA 95054 chris.wilkerson@intel.com Onur Mutlu Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 onur@cmu.edu Can We Exploit the DRAM Retention Time Profile?



# 128-256ms

### Two Challenges to Retention Time Profiling

- Challenge 1: Data Pattern Dependence (DPD)
  - Retention time of a DRAM cell depends on its value and the values of cells nearby it
  - □ When a row is activated, all bitlines are perturbed simultaneously





### Two Challenges to Retention Time Profiling

- Challenge 2: Variable Retention Time (VRT)
  - Retention time of a DRAM cell changes randomly over time
    - a cell alternates between multiple retention time states
  - Leakage current of a cell changes sporadically due to a charge trap in the gate oxide of the DRAM cell access transistor
  - When the trap becomes occupied, charge leaks more readily from the transistor's drain, leading to a short retention time
    - Called *Trap-Assisted Gate-Induced Drain Leakage*
  - This process appears to be a random process [Kim | IEEE TED'11]
  - Worst-case retention time depends on a random process
    → need to find the worst case despite this

#### The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study

Samira Khan<sup>†</sup>\* samirakhan@cmu.edu Donghyuk Lee<sup>†</sup> donghyuk1@cmu.edu Yoongu Kim<sup>+</sup> yoongukim@cmu.edu

Alaa R. Alameldeen\* Chris Wilkerson\* alaa.r.alameldeen@intel.com chris.wilkerson@intel.com Onur Mutlu<sup>†</sup> onur@cmu.edu

<sup>†</sup>Carnegie Mellon University \*Intel Labs

### Online Profiling of DRAM In the Field



without disturbing the system and applications

#### Multi-Rate Refresh with Online Profiling & ECC

 Moinuddin Qureshi, Dae Hyun Kim, Samira Khan, Prashant Nair, and Onur Mutlu,
 "AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems"
 Proceedings of the
 45th Annual IEEE/IFIP International Conference on Dependable
 Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.
 [Slides (pptx) (pdf)]

#### AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems

Moinuddin K. Qureshi<sup>†</sup> Dae-Hyun Kim<sup>†</sup> <sup>†</sup>Georgia Institute of Technology {*moin, dhkim, pnair6*}@*ece.gatech.edu*  Samira Khan<sup>‡</sup>

Prashant J. Nair<sup>†</sup> Onur Mutlu<sup>‡</sup> <sup>‡</sup>Carnegie Mellon University {*samirakhan, onur*}@*cmu.edu* 

#### **ARCHITECTURE MODEL FOR CELL UNDER VRT**



Two key parameters:

Active-VRT Pool (AVP): How many VRT cells in this period?

Active-VRT Injection (AVI): How many new (previously undiscovered) cells became weak in this period?

Model has two parameters: AVP and AVI

#### AVATAR

Insight: Avoid forming Active VRT Pool  $\rightarrow$  Upgrade on ECC error Observation: Rate of VRT >> Rate of soft error (50x-2500x)



**AVATAR mitigates VRT by breaking AVP Pool** 

#### **AVATAR: TIME TO FAILURE**

System: Four channels, each with 8GB DIMM



#### **AVATAR increases time to failure to 10s of years**

\* We include the effect of soft error in the above lifetime analysis (details in the paper)

36
#### **ENERGY DELAY PRODUCT**



#### AVATAR reduces EDP, Significant reduction at higher capacity nodes

### Large-Scale Failure Analysis of DRAM Chips

 Analysis and modeling of memory errors found in all of Facebook's server fleet

 Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field" Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015. [Slides (pptx) (pdf)] [DRAM Error Model]

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field

Justin Meza Qiang Wu\* Sanjeev Kumar\* Onur Mutlu

Carnegie Mellon University \* Facebook, Inc.

# DRAM Reliability Reducing



Chip density (Gb)

# Recap: The DRAM Scaling Problem

#### **DRAM Process Scaling Challenges**

#### \* Refresh

Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance
THE MEMORY FORUM 2014

### Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling

Uksong Kang, Hak-soo Yu, Churoo Park, \*Hongzhong Zheng, \*\*John Halbert, \*\*Kuljit Bains, SeongJin Jang, and Joo Sun Choi



Samsung Electronics, Hwasung, Korea / \*Samsung Electronics, San Jose / \*\*Intel

### How Do We Solve The Problem?



software/hardware/device cooperation

### Exploiting Memory Error Tolerance with Hybrid Memory Systems



On Microsoft's Web Search workload Reduces server hardware cost by 4.7 % Achieves single server availability target of 99.90 % Heterogeneous-Reliability Memory [DSN 2014]



#### DRAM Scaling Issues

- DRAM RowHammer Problem
- Some Other DRAM Reliability Studies

NAND Flash Scaling Issues

Some NAND Flash Reliability Studies

- Read Disturb Errors in NAND Flash Memory
- Summary and Discussion

# Evolution of NAND Flash Memory



Seaung Suk Lee, "Emerging Challenges in NAND Flash Technology", Flash Summit 2011 (Hynix)

- Flash memory is widening its range of applications
  - Portable consumer devices, laptop PCs and enterprise servers

# Flash Challenges: Reliability and Endurance



E. Grochowski et al., "Future technology challenges for NAND flash and HDD products", Flash Memory Summit 2012

### NAND Flash Memory is Increasingly Noisy



### Future NAND Flash-based Storage Architecture



Our Goals:

- Build reliable error models for NAND flash memory
- Design efficient reliability mechanisms based on the model

### NAND Flash Error Model





#### **Experimentally characterize and model dominant errors**

Cai et al., "Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis"", DATE 2012





Goals:

- Understand error mechanisms and develop reliable predictive models for MLC NAND flash memory errors
- Develop efficient error management techniques to mitigate errors and improve flash reliability and endurance
- Approach:
  - □ Solid experimental analyses of errors in real MLC NAND flash memory → drive the understanding and models
  - □ Understanding, models and creativity → drive the new techniques

### Experimental Testing Platform





[Cai+, FCCM 2011, DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, HPCA 2015, DSN 2015]

NAND Daughter Board

**SAFARI** Cai et al., FPGA-based Solid-State Drive prototyping platform, FCCM 2011. <sup>50</sup>

# NAND Flash Usage and Error Model



## Methodology: Error and ECC Analysis

- Characterized errors and error rates of 3x and 2y-nm MLC NAND flash using an experimental FPGA-based platform
  - Cai+, DATE'12, ICCD'12, DATE'13, ITJ'13, ICCD'13, SIGMETRICS'14]

- Quantified Raw Bit Error Rate (RBER) at a given P/E cycle
  - Raw Bit Error Rate: Fraction of erroneous bits without any correction
- Quantified error correction capability (and area and power consumption) of various BCH-code implementations
  - Identified how much RBER each code can tolerate
    - $\rightarrow$  how many P/E cycles (flash lifetime) each code can sustain



- Four types of errors [Cai+, DATE 2012]
- Caused by common flash operations
  - Read errors
  - Erase errors
  - Program (interference) errors
- Caused by flash cell losing charge over time
  - Retention errors
    - Whether an error happens depends on required retention time
    - Especially problematic in MLC flash because threshold voltage window to determine stored value is smaller

### **Observations:** Flash Error Analysis





- Raw bit error rate increases exponentially with P/E cycles
- Retention errors are dominant (>99% for 1-year ret. time)
- Retention errors increase with retention time requirement

54 SAFARI Cai et al., Error Patterns in MLC NAND Flash Memory, DATE 2012.



 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, "Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis" Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Dresden, Germany, March 2012. Slides (ppt)

### **Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis**

Yu Cai<sup>1</sup>, Erich F. Haratsch<sup>2</sup>, Onur Mutlu<sup>1</sup> and Ken Mai<sup>1</sup> <sup>1</sup>Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA <sup>2</sup>LSI Corporation, 1110 American Parkway NE, Allentown, PA <sup>1</sup>{yucai, onur, kenmai}@andrew.cmu.edu, <sup>2</sup>erich.haratsch@lsi.com



 Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai, <u>"Flash Correct-and-Refresh: Retention-Aware Error</u> <u>Management for Increased Flash Memory Lifetime"</u> *Proceedings of the* <u>30th IEEE International Conference on Computer Design</u> (ICCD), Montreal, Quebec, Canada, September 2012. <u>Slides (ppt) (pdf)</u>

### Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime

Yu Cai<sup>1</sup>, Gulay Yalcin<sup>2</sup>, Onur Mutlu<sup>1</sup>, Erich F. Haratsch<sup>3</sup>, Adrian Cristal<sup>2</sup>, Osman S. Unsal<sup>2</sup> and Ken Mai<sup>1</sup> <sup>1</sup>DSSC, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA <sup>2</sup>Barcelona Supercomputing Center, C/Jordi Girona 29, Barcelona, Spain <sup>3</sup>LSI Corporation, 1110 American Parkway NE, Allentown, PA <sup>1</sup>{yucai, omutlu, kenmai}@ece.cmu.edu, <sup>2</sup>{gulay.yalcin, adrian.cristal, osman.unsal}@bsc.es, <sup>3</sup>erich.haratsch@lsi.com

# Threshold Voltage Modeling

 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, "Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis and Modeling" Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Grenoble, France, March 2013. Slides (ppt)

Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis, and Modeling

> Yu Cai<sup>1</sup>, Erich F. Haratsch<sup>2</sup>, Onur Mutlu<sup>1</sup> and Ken Mai<sup>1</sup> <sup>1</sup>DSSC, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA <sup>2</sup>LSI Corporation, 1110 American Parkway NE, Allentown, PA <sup>1</sup>{yucai, onur, kenmai}@andrew.cmu.edu, <sup>2</sup>erich.haratsch@lsi.com

### Program Interference Modeling



 Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation" Proceedings of the <u>31st IEEE International Conference on Computer Design</u> (ICCD), Asheville, NC, October 2013. <u>Slides (pptx) (pdf)</u> <u>Lightning Session Slides (pdf)</u>

#### Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation

Yu Cai<sup>1</sup>, Onur Mutlu<sup>1</sup>, Erich F. Haratsch<sup>2</sup> and Ken Mai<sup>1</sup> 1. Data Storage Systems Center, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 2. LSI Corporation, San Jose, CA yucaicai@gmail.com, {omutlu, kenmai}@andrew.cmu.edu



 Yu Cai, Gulay Yalcin, Onur Mutlu, Eric Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai,

"Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories"

Proceedings of the

<u>ACM International Conference on Measurement and Modeling of</u> <u>Computer Systems (SIGMETRICS</u>), Austin, TX, June 2014. <u>Slides (ppt) (pdf)</u>

#### Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories

Yu Cai<sup>1</sup>, Gulay Yalcin<sup>2</sup>, Onur Mutlu<sup>1</sup>, Erich F. Haratsch<sup>4</sup>, Osman Unsal<sup>2</sup>, Adrian Cristal<sup>2,3</sup>, and Ken Mai<sup>1</sup> <sup>1</sup>Electrical and Computer Engineering Department, Carnegie Mellon University <sup>2</sup>Barcelona Supercomputing Center, Spain <sup>3</sup>IIIA – CSIC – Spain National Research Council <sup>4</sup>LSI Corporation yucaicai@gmail.com, {omutlu, kenmai}@ece.cmu.edu, {gulay.yalcin, adrian.cristal, osman.unsal}@bsc.es

### Data Retention Analysis & Recovery Flas



### Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery

Yu Cai, Yixin Luo, Erich F. Haratsch<sup>\*</sup>, Ken Mai, Onur Mutlu Carnegie Mellon University, <sup>\*</sup>LSI Corporation yucaicai@gmail.com, yixinluo@cs.cmu.edu, erich.haratsch@lsi.com, {kenmai, omutlu}@ece.cmu.edu



#### DRAM Scaling Issues

- DRAM RowHammer Problem
- Some Other DRAM Reliability Studies
- NAND Flash Scaling Issues
  - Some NAND Flash Reliability Studies
  - Read Disturb Errors in NAND Flash Memory
- Summary and Discussion

### Read Disturb Errors in Flash Memory





- Presented at IEEE/IFIP DSN 2015 Conference in June 2015.
- Full paper for details:
  - Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
     <u>"Read Disturb Errors in MLC NAND Flash Memory:</u> <u>Characterization and Mitigation"</u> *Proceedings of the* <u>45th Annual IEEE/IFIP International Conference on</u>
    - *Dependable Systems and Networks (DSN)*, Rio de Janeiro, Brazil, June 2015.
  - http://users.ece.cmu.edu/~omutlu/pub/flash-read-disturberrors\_dsn15.pdf

# **Executive Summary**



- **Read disturb errors** limit flash memory lifetime today
  - Apply a high pass-through voltage ( $V_{pass}$ ) to multiple pages on a read
  - Repeated application of  $V_{pass}$  can alter stored values in unread pages
- We characterize read disturb on real NAND flash chips
  - Slightly lowering V<sub>pass</sub> greatly reduces read disturb errors
  - Some flash cells are more prone to read disturb
- Technique 1: Mitigate read disturb errors online
  - $-V_{pass}$  Tuning dynamically finds and applies a lowered  $V_{pass}$  per block
  - Flash memory lifetime improves by 21%
- Technique 2: Recover after failure to prevent data loss
  - *Read Disturb Oriented Error Recovery* (RDR) selectively corrects cells more susceptible to read disturb errors

Reduces raw bit error rate (RBER) by up to 36%
 SAFARI

# Outline



- Background (Problem and Goal)
- Key Experimental Observations
- Mitigation: V<sub>pass</sub> Tuning
- Recovery: Read Disturb Oriented Error Recovery
- Conclusion

# Outline



- Background (Problem and Goal)
- Key Experimental Observations
- Mitigation: V<sub>pass</sub> Tuning
- Recovery: Read Disturb Oriented Error Recovery

Conclusion

## NAND Flash Memory Background



### Flash Cell Array





# Flash Cell





Floating Gate Transistor (Flash Cell)






#### Read Disturb Problem: "Weak Programming" Effect



#### Read Disturb Problem: "Weak Programming" Effect



Read disturb errors: Reading from one page can alter the values stored in other unread pages

# Goal: Mitigate and Recover Read Disturb Errors

# Outline



- Background (Problem and Goal)
- Key Experimental Observations
- Mitigation: V<sub>pass</sub> Tuning
- Recovery: Read Disturb Oriented Error Recovery
- Conclusion

# Methodology



• FPGA-based flash memory testing platform [Cai+, FCCM '11]



- Real 20- to 24-nm MLC NAND flash chips
- 0 to 1M read disturbs
- 0 to 15K Program/Erase Cycles (PEC)

# **Experimental Infrastructure**





[Cai+, DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, HPCA 2015, DSN 2015, MSST 2015] NAND Daughter Board

## Read Disturb Effect on V<sub>th</sub> Distribution



## **Other Experimental Observations**

- •Lower threshold voltage states are affected more by read disturb
- Wear-out increases read disturb effect

## Key Observation 1: Slightly lowering V<sub>pass</sub> greatly reduces read disturb errors



Fig. 11. Raw bit error rate vs. read disturb count for different  $V_{pass}$  values, for flash memory under 8K P/E cycles of wear.

#### **Percentage of Vpass Reduction**

# Outline



- Background (Problem and Goal)
- Key Experimental Observations
- Mitigation: V<sub>pass</sub> Tuning
- Recovery: Read Disturb Oriented Error Recovery

Conclusion

# Read Disturb Mitigation: V<sub>pass</sub> Tuning

- Key Idea: Dynamically find and apply a lowered  $V_{\text{pass}}$
- Trade-off for lowering V<sub>pass</sub>
  +Allows more read disturbs
  -Induces more read errors





# Utilizing the Unused ECC Capability



- 1. ECC provisioned for high retention "age"
- 2. Unused ECC capability can be used to fix read errors

3. Unused ECC capability decreases over retention age Dynamically adjust  $V_{\text{pass}}$  so that read errors fully utilize the unused ECC capability

# V<sub>pass</sub> Reduction Trade-Off Summary

- Today: Conservatively set V<sub>pass</sub> to a high voltage
  - Accumulates more read disturb errors at the end of each refresh interval
  - +No read errors
- Idea: Dynamically adjust V<sub>pass</sub> to unused ECC capability
  - + Minimize read disturb errors
  - Control read errors to be tolerable by ECC
  - $\odot$  If read errors exceed ECC capability, read again with a higher  $V_{\text{pass}}$  to correct read errors

# V<sub>pass</sub> Tuning Steps



- Perform once for each block every day:
  - **1.** Estimate unused ECC capability (using retention age)
  - 2. Aggressively reduce V<sub>pass</sub> until read errors exceed ECC capability
  - 3. Gradually increase  $V_{pass}$  until read errors become just less than ECC capability



# Evaluation of V<sub>pass</sub> Tuning

- 19 real workload I/O traces
- Assume 7-day refresh period
- Similar methodology as before to determine acceptable  $V_{pass}$  reduction

- Overhead for a 512 GB flash drive:
  - -128~KB storage overhead for per-block  $V_{\text{pass}}$  setting and worst-case page
  - -24.34 sec/day average V<sub>pass</sub> Tuning overhead

# V<sub>pass</sub> Tuning Lifetime Improvements



Average lifetime improvement: 21.0%



# Outline



- Background (Problem and Goal)
- Key Experimental Observations
- Mitigation: V<sub>pass</sub> Tuning
- Recovery: Read Disturb Oriented Error Recovery

Conclusion

## **Read Disturb Resistance**









#### Read Disturb Oriented Error Recovery (RDR)

- Triggered by an uncorrectable flash error
  - -Back up all valid data in the faulty block
  - -Disturb the faulty page 100K times (more)
  - -Compare  $V_{th}$ 's before and after read disturb
  - -Select cells susceptible to flash errors ( $V_{ref}$ - $\sigma$ < $V_{th}$ < $V_{ref}$ - $\sigma$ )
  - –Predict among these susceptible cells
    - Cells with more  $V_{th}$  shifts are disturb-prone  $\rightarrow$  Lower  $V_{th}$  state
    - Cells with less  $V_{th}$  shifts are disturb-resistant  $\rightarrow$  Higher  $V_{th}$  state

## **RDR Evaluation**





Reduces total error counts by up to 36% @ 1M read disturbs ECC can be used to correct the remaining errors

# Outline



- Background (Problem and Goal)
- Key Experimental Observations
- Mitigation: V<sub>pass</sub> Tuning
- Recovery: Read Disturb Oriented Error Recovery

#### Conclusion

# **Executive Summary**



- **Read disturb errors** limit flash memory lifetime today
  - Apply a high pass-through voltage ( $V_{pass}$ ) to multiple pages on a read
  - Repeated application of  $V_{pass}$  can alter stored values in unread pages
- We characterize read disturb on real NAND flash chips
  - Slightly lowering V<sub>pass</sub> greatly reduces read disturb errors
  - Some flash cells are more prone to read disturb
- Technique 1: Mitigate read disturb errors online
  - $-V_{pass}$  Tuning dynamically finds and applies a lowered  $V_{pass}$  per block
  - Flash memory lifetime improves by 21%
- Technique 2: Recover after failure to prevent data loss
  - Read Disturb Oriented Error Recovery (RDR) selectively corrects cells more susceptible to read disturb errors

Reduces raw bit error rate (RBER) by up to 36%
 SAFARI

## More on Flash Read Disturb Errors



 Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
 "Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation"
 Proceedings of the
 <u>45th Annual IEEE/IFIP International Conference on Dependable</u> Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.

#### Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery

Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch\*, Ken Mai, Onur Mutlu Carnegie Mellon University, \*Seagate Technology yucaicai@gmail.com, {yixinluo, ghose, kenmai, onur}@cmu.edu

## Large-Scale Flash SSD Error Analysis

- First large-scale field study of flash memory errors
- Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the <u>ACM International Conference on Measurement and Modeling of</u> <u>Computer Systems (SIGMETRICS)</u>, Portland, OR, June 2015. [Slides (pptx) (pdf)] [Coverage at ZDNet] [Coverage on The Register] [Coverage on TechSpot] [Coverage on The Tech Report]

#### A Large-Scale Study of Flash Memory Failures in the Field

Justin Meza Carnegie Mellon University meza@cmu.edu Qiang Wu Facebook, Inc. qwu@fb.com Sanjeev Kumar Facebook, Inc. skumar@fb.com Onur Mutlu Carnegie Mellon University onur@cmu.edu

# A few SSDs cause most errors



Normalized SSD number



## Agenda

#### DRAM Scaling Issues

- DRAM RowHammer Problem
- Some Other DRAM Reliability Studies

#### NAND Flash Scaling Issues

- Some NAND Flash Reliability Studies
- Read Disturb Errors in NAND Flash Memory

Summary and Discussion

## Summary

- DRAM and Flash Scaling Challenges are real and critical
  They lead to many reliability (and security) challenges
- We need to understand various reliability issues with both
  - Small-scale experimental studies (FPGA-based testing platforms)
  - Large-scale experimental studies (data centers and clusters)
- We need to innovate at all levels
  - DRAM and Flash architecture and controllers
  - Hardware, software, devices
- There are many problems to solve
  - Industry-academia cooperation is much needed and welcome

# Reliability (and Security) Issues of DRAM and NAND Flash Scaling

Onur Mutlu omutlu@gmail.com http://users.ece.cmu.edu/~omutlu/

HPCA Memory Reliability Workshop March 13, 2016



# Ramulator: A Fast and Extensible DRAM Simulator [IEEE Comp Arch Letters'15]

## Ramulator Motivation

- DRAM and Memory Controller landscape is changing
- Many new and upcoming standards
- Many new controller designs
- A fact and easy-to-extend simulator is very much needed

| Segment     | DRAM Standards & Architectures                                                                                                                                                                                                        |
|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Commodity   | DDR3 (2007) [14]; DDR4 (2012) [18]                                                                                                                                                                                                    |
| Low-Power   | LPDDR3 (2012) [17]; LPDDR4 (2014) [20]                                                                                                                                                                                                |
| Graphics    | GDDR5 (2009) [15]                                                                                                                                                                                                                     |
| Performance | eDRAM [28], [32]; RLDRAM3 (2011) [29]                                                                                                                                                                                                 |
| 3D-Stacked  | WIO (2011) [16]; WIO2 (2014) [21]; MCDRAM (2015) [13];<br>HBM (2013) [19]; HMC1.0 (2013) [10]; HMC1.1 (2014) [11]                                                                                                                     |
| Academic    | SBA/SSA (2010) [38]; Staged Reads (2012) [8]; RAIDR (2012) [27];<br>SALP (2012) [24]; TL-DRAM (2013) [26]; RowClone (2013) [37];<br>Half-DRAM (2014) [39]; Row-Buffer Decoupling (2014) [33];<br>SARP (2014) [6]; AL-DRAM (2015) [25] |
|             | Table 1. Landscape of DRAM-based memory                                                                                                                                                                                               |

### Ramulator

- Provides out-of-the box support for many DRAM standards:
  - DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, plus new proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP)
- ~2.5X faster than fastest open-source simulator
- Modular and extensible to different standards

| Simulator   | Cycles (10 <sup>6</sup> ) |        | Runtime (sec.) |        | <i>Req/sec</i> (10 <sup>3</sup> ) |        | Memory  |  |
|-------------|---------------------------|--------|----------------|--------|-----------------------------------|--------|---------|--|
| (clang -03) | Random                    | Stream | Random         | Stream | Random                            | Stream | (MB)    |  |
| Ramulator   | 652                       | 411    | 752            | 249    | 133                               | 402    | 2.1     |  |
| DRAMSim2    | 645                       | 413    | 2,030          | 876    | 49                                | 114    | 1.2     |  |
| USIMM       | 661                       | 409    | 1,880          | 750    | 53                                | 133    | 4.5     |  |
| DrSim       | 647                       | 406    | 18,109         | 12,984 | 6                                 | 8      | 1.6     |  |
| NVMain      | 666                       | 413    | 6,881          | 5,023  | 15                                | 20     | 4,230.0 |  |

Table 3. Comparison of five simulators using two traces

#### Case Study: Comparison of DRAM Standards

| Standard          | Rate<br>(MT/s) | Timing<br>(CL-RCD-RP) | Data-Bus<br>(Width×Chan.) | Rank-per-Chan | BW<br>(GB/s) |
|-------------------|----------------|-----------------------|---------------------------|---------------|--------------|
| DDR3              | 1,600          | 11-11-11              | $64$ -bit $\times 1$      | 1             | 11.9         |
| DDR4              | 2,400          | 16-16-16              | $64$ -bit $\times 1$      | 1             | 17.9         |
| SALP <sup>†</sup> | 1,600          | 11-11-11              | $64$ -bit $\times 1$      | 1             | 11.9         |
| LPDDR3            | 1,600          | 12 - 15 - 15          | $64$ -bit $\times 1$      | 1             | 11.9         |
| LPDDR4            | 2,400          | 22-22-22              | $32$ -bit $	imes 2^*$     | 1             | 17.9         |
| GDDR5 [12]        | 6,000          | 18-18-18              | $64$ -bit $\times 1$      | 1             | 44.7         |
| HBM               | 1,000          | 7-7-7                 | $128$ -bit $\times 8^*$   | 1             | 119.2        |
| WIO               | 266            | 7-7-7                 | $128$ -bit $\times 4^*$   | 1             | 15.9         |
| WIO2              | 1,066          | 9-10-10               | 128-bit $\times$ 8*       | 1             | 127.2        |


#### Ramulator Paper and Source Code

- Yoongu Kim, Weikun Yang, and <u>Onur Mutlu</u>,
  "Ramulator: A Fast and Extensible DRAM Simulator" <u>IEEE Computer Architecture Letters</u> (CAL), March 2015.
   [Source Code]
- Source code is released under the liberal MIT License
  <u>https://github.com/CMU-SAFARI/ramulator</u>

#### More Detail on DRAM Errors

#### Memory Errors in Facebook Fleet

 Analysis and modeling of memory errors found in all of Facebook's server fleet

 Justin Meza, Qiang Wu, Sanjeev Kumar, and <u>Onur Mutlu</u>, <u>"Revisiting Memory Errors in Large-Scale Production Data</u> <u>Centers: Analysis and Modeling of New Trends from the Field"</u> *Proceedings of the* <u>45th Annual IEEE/IFIP International Conference on Dependable</u> <u>Systems and Networks</u> (DSN), Rio de Janeiro, Brazil, June 2015. [Slides (pptx) (pdf)] [DRAM Error Model]

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field

Justin Meza Qiang Wu\* Sanjeev Kumar\* Onur Mutlu

Carnegie Mellon University \* Facebook, Inc.

#### Error/failure occurrence

Page offlining at scale

New reliability trends

Technology scaling

Modeling errors

Architecture & workload

#### Error/failure occurrence

Page Errors follow a *power-law distribution* and a large number of errors occur due to *sockets/ channels* 

Architecture & work.load

#### Error/failure occurrence

We find that *newer* cell fabrication technologies have *higher failure rates* 



trends

Architecture & work.load

#### Error/failure occurrence



Chips per DIMM, transfer Page width, and workload type (not necessarily CPU/memory utilization) affect reliability

Architecture & workload

#### Error/failure occurrence

Page We have made publicly available a **statistical model** for assessing server memory reliability

#### rends

Modeling errors

Architecture & workload

#### Error/failure occurrence

Page offlining at scale *First large-scale study* of page offlining; real-world *limitations* of technique

trends

Architecture & workload

### Server error rate



Month

### Memory error distribution



0.0 0.2 0.4 0.6 0.8 1.0 Normalized device number

### Memory error distribution



#### Large Scale Field Analysis of Flash Memory Errors

#### SSD Error Analysis of Facebook Systems

• First large-scale field study of flash memory errors

 Justin Meza, Qiang Wu, Sanjeev Kumar, and <u>Onur Mutlu</u>, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the <u>ACM International Conference on Measurement and Modeling of</u> <u>Computer Systems (SIGMETRICS</u>), Portland, OR, June 2015. [Slides (pptx) (pdf)] [Coverage at ZDNet] [ Coverage on The Register] [Coverage on TechSpot] [ Coverage on The Tech Report]

#### A Large-Scale Study of Flash Memory Failures in the Field

Justin Meza Carnegie Mellon University meza@cmu.edu Qiang Wu Facebook, Inc. qwu@fb.com Sanjeev Kumar Facebook, Inc. skumar@fb.com Onur Mutlu Carnegie Mellon University onur@cmu.edu

#### SAFARI

### A few SSDs cause most errors



Normalized SSD number

### A few SSDs cause most errors



Normalized SSD number



### Summary SSD lifecycle Accel **Early detection** lifecycle period distinct from hard disk drive Ce lifecycle. verature Ten



#### Storage lifecycle background: the bathtub curve for disk drives



#### Storage lifecycle background: the bathtub curve for disk drives



Storage lifecycle background: the bathtub curve for disk drives



### Use data written to flash to examine SSD lifecycle

(time-independent utilization metric)

#### 720GB, 1 SSD 720GB, 2 SSDs







#### SSD lifecycle

CC



# Acce. *Early detection* lifecycle period distinct from hard disk drive lifecycle.







#### 720GB, 1 SSD 720GB, 2 SSDs



Average temperature (°C)





Average temperature (°C)

#### SSD lifecycle

ce



#### Summary SSD lifecycle We **do not** observe the Read effects of *read* disturbance disturbance errors in the field.

### Summary

#### SSD lifecycle

Ce

Acce. **Throttling SSD usage** helps mitigate temperature-induced errors.
# Summary

### SSD lifecycle

Access pattern dependence We quantify the effects of the *page cache* and *write amplification* in the field.

Temperature

### More on SSD Error Analysis in the Field

• First large-scale field study of flash memory errors

 Justin Meza, Qiang Wu, Sanjeev Kumar, and <u>Onur Mutlu</u>, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the <u>ACM International Conference on Measurement and Modeling of</u> <u>Computer Systems (SIGMETRICS</u>), Portland, OR, June 2015. [Slides (pptx) (pdf)] [Coverage at ZDNet] [ Coverage on The Register] [Coverage on TechSpot] [ Coverage on The Tech Report]

#### A Large-Scale Study of Flash Memory Failures in the Field

Justin Meza Carnegie Mellon University meza@cmu.edu Qiang Wu Facebook, Inc. qwu@fb.com Sanjeev Kumar Facebook, Inc. skumar@fb.com Onur Mutlu Carnegie Mellon University onur@cmu.edu

### NAND Flash Memory Readings



## Errors in Flash Memory (I)

#### 1. <u>Retention noise study and management</u>

- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,
  <u>Flash Correct-and-Refresh: Retention-Aware Error Management for</u> <u>Increased Flash Memory Lifetime</u>, ICCD 2012.
- Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, <u>Data Retention in MLC NAND Flash Memory: Characterization, Optimization</u> <u>and Recovery</u>, HPCA 2015.
- Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu, <u>WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware</u> <u>Retention Management</u>, MSST 2015.

#### 2. Flash-based SSD prototyping and testing platform

4) Yu Cai, Erich F. Haratsh, Mark McCartney, Ken Mai, <u>FPGA-based solid-state drive prototyping platform</u>, FCCM 2011.



## Errors in Flash Memory (II)

#### 3. Overall flash error analysis

- 5) Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, <u>Error Patterns in MLC NAND Flash Memory: Measurement, Characterization,</u> <u>and Analysis</u>, DATE 2012.
- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,
  <u>Error Analysis and Retention-Aware Error Management for NAND Flash</u> <u>Memory</u>, ITJ 2013.

#### 4. Program and erase noise study

 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, <u>Threshold Voltage Distribution in MLC NAND Flash Memory:</u> <u>Characterization, Analysis and Modeling</u>, DATE 2013.



## Errors in Flash Memory (III)

#### 5. Cell-to-cell interference characterization and tolerance

- Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, <u>Program Interference in MLC NAND Flash Memory: Characterization,</u> <u>Modeling, and Mitigation</u>, ICCD 2013.
- 9) Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, <u>Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories</u>, SIGMETRICS 2014.

#### 6. Read disturb noise study

10) Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu, <u>Read Disturb Errors in MLC NAND Flash Memory: Characterization and</u> <u>Mitigation</u>, DSN 2015.

#### 7. Flash errors in the field

11) Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, <u>A Large-Scale Study of Flash Memory Errors in the Field</u>, SIGMETRICS 2015.

### More Detail on Flash Error Analysis

 Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,
"Error Analysis and Retention-Aware Error Management for NAND Flash Memory" Intel Technology Journal (ITJ) Special Issue on Memory Resiliency, Vol. 17, No. 1, May 2013.

Intel® Technology Journal | Volume 17, Issue 1, 2013

ERROR ANALYSIS AND RETENTION-AWARE ERROR MANAGEMENT FOR NAND FLASH MEMORY



## Error Analysis and Management for MLC NAND Flash Memory

Onur Mutlu onur@cmu.edu

(joint work with Yu Cai, Gulay Yalcin, Erich Haratsch, Ken Mai, Adrian Cristal, Osman Unsal)

August 7, 2014 Flash Memory Summit 2014, Santa Clara, CA









- Problem: MLC NAND flash memory reliability/endurance is a key challenge for satisfying future storage systems' requirements
- Our Goals: (1) Build reliable error models for NAND flash memory via experimental characterization, (2) Develop efficient techniques to improve reliability and endurance
- This talk provides a "flash" summary of our recent results published in the past 3 years:
  - Experimental error and threshold voltage characterization [DATE'12&13]
  - Retention-aware error management [ICCD'12]
  - Program interference analysis and read reference V prediction [ICCD'13]
  - Neighbor-assisted error correction [SIGMETRICS'14]





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary

### Evolution of NAND Flash Memory



Seaung Suk Lee, "Emerging Challenges in NAND Flash Technology", Flash Summit 2011 (Hynix)

- Flash memory is widening its range of applications
  - Portable consumer devices, laptop PCs and enterprise servers

### Flash Challenges: Reliability and Endurance



E. Grochowski et al., "Future technology challenges for NAND flash and HDD products", Flash Memory Summit 2012

### NAND Flash Memory is Increasingly Noisy



### Future NAND Flash-based Storage Architecture



Our Goals:

- Build reliable error models for NAND flash memory
- Design efficient reliability mechanisms based on the model

### NAND Flash Error Model





#### **Experimentally characterize and model dominant errors**

Cai et al., "Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis"", DATE 2012



Cai et al., "Threshold voltage distribution in MLC NAND Flash Memory: Characterization, Analysis, and Modeling", **DATE 2013**  Cai et al., "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation", **ICCD 2013** 

Cai et al., "Neighbor-Cell Assisted Error Correction in MLC NAND Flash Memories", **SIGMETRICS 2014**  Cai et al., "Flash Correct-and-Refresh: Retention-aware error management for increased flash memory lifetime", **ICCD 2012** 

Cai et al., "Error Analysis and Retention-Aware Error Management for NAND Flash Memory, **ITJ 2013** 



Goals:

- Understand error mechanisms and develop reliable predictive models for MLC NAND flash memory errors
- Develop efficient error management techniques to mitigate errors and improve flash reliability and endurance
- Approach:
  - □ Solid experimental analyses of errors in real MLC NAND flash memory → drive the understanding and models
  - □ Understanding, models and creativity → drive the new techniques





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Main Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary

### Experimental Testing Platform





[Cai+, FCCM 2011, DATE 2012, ICCD 2012, DATE NA 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014]

NAND Daughter Board

**SAFARI** Cai et al., FPGA-based Solid-State Drive prototyping platform, FCCM 2011.<sup>162</sup>

## NAND Flash Usage and Error Model



### Methodology: Error and ECC Analysis

- Characterized errors and error rates of 3x and 2y-nm MLC NAND flash using an experimental FPGA-based platform
  - Cai+, DATE'12, ICCD'12, DATE'13, ITJ'13, ICCD'13, SIGMETRICS'14]

- Quantified Raw Bit Error Rate (RBER) at a given P/E cycle
  - Raw Bit Error Rate: Fraction of erroneous bits without any correction
- Quantified error correction capability (and area and power consumption) of various BCH-code implementations
  - Identified how much RBER each code can tolerate
    - $\rightarrow$  how many P/E cycles (flash lifetime) each code can sustain



- Four types of errors [Cai+, DATE 2012]
- Caused by common flash operations
  - Read errors
  - Erase errors
  - Program (interference) errors
- Caused by flash cell losing charge over time
  - Retention errors
    - Whether an error happens depends on required retention time
    - Especially problematic in MLC flash because threshold voltage window to determine stored value is smaller





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Main Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary

### **Observations:** Flash Error Analysis





- Raw bit error rate increases exponentially with P/E cycles
- Retention errors are dominant (>99% for 1-year ret. time)
- Retention errors increase with retention time requirement

167 SAFARI Cai et al., Error Patterns in MLC NAND Flash Memory, DATE 2012.

### Retention Error Mechanism





Electron loss from the floating gate causes retention errors

- Cells with more programmed electrons suffer more from retention errors
- Threshold voltage is more likely to shift by one window than by multiple

### Retention Error Value Dependency



 Cells with more programmed electrons tend to suffer more from retention noise (i.e. 00 and 01)



 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, "Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis" Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Dresden, Germany, March 2012. Slides (ppt)





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Main Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary



- Key Observations:
  - Retention errors are the dominant source of errors in flash memory [Cai+ DATE 2012][Tanakamaru+ ISSCC 2011]
    → limit flash lifetime as they increase over time
  - Retention errors can be corrected by "refreshing" each flash page periodically

#### Key Idea:

- Periodically read each flash page,
- Correct its errors using "weak" ECC, and
- □ Either remap it to a new physical page or reprogram it in-place,
- Before the page accumulates more errors than ECC-correctable
- Optimization: Adapt refresh rate to endured P/E cycles

### FCR: Two Key Questions



- How to refresh?
  - Remap a page to another one
  - Reprogram a page (in-place)
  - Hybrid of remap and reprogram
- When to refresh?
  - Fixed period

Adapt the period to retention error severity



• Pro: No remapping needed  $\rightarrow$  no additional erase operations

Con: Increases the occurrence of program errors

### Normalized Flash Memory Lifetime



Lifetime of FCR much higher than lifetime of stronger ECC









Adaptive-rate refresh: <1.8% energy increase until daily refresh is triggered



 Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,
"Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime" Proceedings of the
30th IEEE International Conference on Computer Design (ICCD), Montreal, Quebec, Canada, September 2012.
Slides (ppt) (pdf)





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Main Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary



- How does threshold voltage (Vth) distribution of different programmed states change over flash lifetime?
- Can we model it accurately and predict the Vth changes?
- Can we build mechanisms that can correct for Vth changes? (thereby reducing read error rates)

### Threshold Voltage Distribution Model



Gaussian distribution with additive white noise

- As P/E cycles increase ...
- Distribution shifts to the right
- Distribution becomes wider
### Threshold Voltage Distribution Model

- Vth distribution can be modeled with ~95% accuracy as a Gaussian distribution with additive white noise
- Distortion in Vth over P/E cycles can be modeled and predicted as an exponential function of P/E cycles
  - With more than 95% accuracy

### More Detail on Threshold Voltage Model

 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, "Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis and Modeling" Proceedings of the <u>Design, Automation, and Test in Europe Conference</u> (DATE), Grenoble, France, March 2013. Slides (ppt)

### Program Interference Errors



- When a cell is being programmed, voltage level of a neighboring cell changes (unintentionally) due to parasitic capacitance coupling
  - $\rightarrow$  can change the data value stored
- Also called program interference error
- Causes neighboring cell voltage to increase (shift right)
- Once retention errors are minimized, these errors can become dominant

### How Current Flash Cells are Programmed

Programming 2-bit MLC NAND flash memory in two steps 



### Basics of Program Interference





### Traditional Model for Vth Change





Traditional model for victim cell threshold voltage change

$$\Delta V_{victim} = \left(2C_x \Delta V_x + C_y \Delta V_y + 2C_{xy} \Delta V_{xy}\right) / C_{total}$$

Not accurate and requires knowledge of coupling caps!



 Develop a new, more accurate and easier to implement model for program interference

#### Idea:

- Empirically characterize and model the effect of neighbor cell
   Vth changes on the Vth of the victim cell
- Fit neighbor Vth change to a linear regression model and find the coefficients of the model via empirical measurement

$$\Delta V_{victim}(n,j) = \sum_{y=j-K}^{j+K} \sum_{x=n+1}^{n=M} \alpha(x,y) \Delta V_{neighbor}(x,y) + \alpha_j V_{victim}^{before}(n,j)$$

Can be measured

#### Developing a New Model via Empirical Measurement

Feature extraction for V<sub>th</sub> changes based on characterization

- Threshold voltage changes on aggressor cell
- Original state of victim cell
- Enhanced linear regression model

$$\Delta V_{victim}(n,j) = \sum_{y=j-K}^{j+K} \sum_{x=n+1}^{n=M} \alpha(x,y) \Delta V_{neighbor}(x,y) + \alpha_0 V_{victim}^{before}(n,j)$$

$$Y = X\alpha + \varepsilon \quad (\text{vector expression})$$

Maximum likelihood estimation of the model coefficients

$$\underset{\alpha}{\operatorname{arg\,min}}(\|X \times \alpha - Y\|_{2}^{2} + \lambda \|\alpha\|_{1})$$

### Effect of Neighbor Voltages on the Victim



- Immediately-above cell interference is dominant
- Immediately-diagonal neighbor is the second dominant
- Far neighbor cell interference exists
- Victim cell's Vth has negative effect on interference

#### **SAFARI** Cai et al., Program Interference in MLC NAND Flash Memory, ICCD<sup>1</sup>2013

# New Model for Program Interference Flash Memory



**SAFARI** Cai et al., Program Interference in MLC NAND Flash Memory, ICCD 2013

# Model Accuracy





### Many Other Results in the Paper



 Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation" Proceedings of the <u>31st IEEE International Conference on Computer Design</u> (ICCD), Asheville, NC, October 2013. Slides (pptx) (pdf)

Lightning Session Slides (pdf)





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Main Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary



- So, what can we do with the model?
- Goal: Mitigate the effects of program interference caused voltage shifts

### Optimum Read Reference for Flash Memory



There exists an optimal read reference voltage

 Predictable if the statistics (i.e. mean, variance) of threshold voltage distributions are characterized and modeled

#### Optimum Read Reference Voltage Prediction



- Vth shift learning (done every ~1k P/E cycles)
  - Program sample cells with known data pattern and test Vth
  - Program aggressor neighbor cells and test victim Vth after interference
  - Characterize the mean shift in Vth (i.e., program interference noise)
- Optimum read reference voltage prediction
  - Default read reference voltage + Predicted mean Vth shift by model

#### Effect of Read Reference Voltage Prediction



Read reference voltage prediction reduces raw BER (by 64%) and increases the P/E cycle lifetime (by 30%)

### More on Read Reference Voltage Prediction

 Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation" Proceedings of the <u>31st IEEE International Conference on Computer Design</u> (ICCD), Asheville, NC, October 2013. <u>Slides (pptx) (pdf)</u> Lightning Session Slides (pdf)





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Main Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary





 Develop a better error correction mechanism for cases where ECC fails to correct a page



- Immediate neighbor cell has the most effect on the victim cell when programmed
- A single set of read reference voltages is used to determine the value of the (victim) cell
- The set of read reference voltages is determined based on the *overall threshold voltage distribution of all cells* in flash memory

#### New Observations [Cai+ SIGMETRICS'14]

- Vth distributions of cells with different-valued immediate-neighbor cells are significantly different
  - Because neighbor value affects the amount of Vth shift
- Corollary: If we know the value of the immediate-neighbor, we can find a more accurate set of read reference voltages based on the "conditional" threshold voltage distribution

Cai et al., Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories, SIGMETRICS 2014.

### Secrets of Threshold Voltage Distributions



### If We Knew the Immediate Neighbor ...

Then, we could choose a different read reference voltage to more accurately read the "victim" cell

# Overall vs Conditional Reading





- Using the optimum read reference voltage based on the overall distribution leads to more errors
- Better to use the optimum read reference voltage based on the conditional distribution (i.e., value of the neighbor)
  - Conditional distributions of two states are farther apart from each other

#### Measurement Results





Raw BER of conditional reading is much smaller than overall reading

## Idea: Neighbor Assisted Correction (NAC)

- Read a page with the read reference voltages based on overall Vth distribution (same as today) and buffer it
- If ECC fails:
  - Read the immediate-neighbor page
  - Re-read the page using the read reference voltages corresponding to the voltage distribution assuming a particular immediate-neighbor value
  - Replace the buffered values of the cells with that particular immediate-neighbor cell value
  - Apply ECC again

# Neighbor Assisted Correction Flow





- Trigger neighbor-assisted reading only when ECC fails
- Read neighbor values and use corresponding read reference voltages in a prioritized order until ECC passes

### Lifetime Extension with NAC





### Performance Analysis of NAC



No performance loss within nominal lifetime and with reasonable (1%) ECC fail rates





 Yu Cai, Gulay Yalcin, Onur Mutlu, Eric Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai,
 "Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories"
 Proceedings of the
 ACM International Conference on Measurement and
 Modeling of Computer Systems (SIGMETRICS), Austin, TX, June 2014. Slides (ppt) (pdf)





- Background, Motivation and Approach
- Experimental Characterization Methodology
- Error Analysis and Management
  - Main Characterization Results
  - Retention-Aware Error Management
  - Threshold Voltage and Program Interference Analysis
  - Read Reference Voltage Prediction
  - Neighbor-Assisted Error Correction
- Summary

### Executive Summary



- Problem: MLC NAND flash memory reliability/endurance is a key challenge for satisfying future storage systems' requirements
- We are: (1) Building reliable error models for NAND flash memory via experimental characterization, (2) Developing efficient techniques to improve reliability and endurance
- This talk provided a "flash" summary of our recent results published in the past 3 years:
  - Experimental error and threshold voltage characterization [DATE'12&13]
  - Retention-aware error management [ICCD'12]
  - Program interference analysis and read reference V prediction [ICCD'13]
  - Neighbor-assisted error correction [SIGMETRICS'14]

# Readings (I)



 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, <u>"Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis"</u> Proceedings of the Design Automation, and Test in Europe Conference (DATE) Dresden

*Proceedings of the <u>Design, Automation, and Test in Europe Conference</u> (DATE), Dresden, Germany, March 2012. <u>Slides (ppt)</u>* 

 Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,

"Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime"

Proceedings of the <u>30th IEEE International Conference on Computer Design</u> (**ICCD**), Montreal, Quebec, Canada, September 2012. <u>Slides (ppt)</u> (pdf)

 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, <u>"Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization,</u> <u>Analysis and Modeling"</u> *Proceedings of the <u>Design, Automation, and Test in Europe Conference</u> (DATE), Grenoble, France, March 2013. <u>Slides (ppt)</u>* 

# Readings (II)



 Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,

"Error Analysis and Retention-Aware Error Management for NAND Flash Memory"

Intel Technology Journal (ITJ) Special Issue on Memory Resiliency, Vol. 17, No. 1, May 2013.

- Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation" Proceedings of the <u>31st IEEE International Conference on Computer Design (ICCD)</u>, Asheville, NC, October 2013. <u>Slides (pptx) (pdf) Lightning Session Slides (pdf)</u>
- Yu Cai, Gulay Yalcin, Onur Mutlu, Eric Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, "Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories" Proceedings of the <u>ACM International Conference on Measurement and Modeling of Computer Systems</u> (SIGMETRICS), Austin, TX, June 2014. Slides (ppt) (pdf)





All are available at

http://users.ece.cmu.edu/~omutlu/projects.htm



### Related Videos and Course Materials

- Computer Architecture Lecture Videos on Youtube
  - https://www.youtube.com/playlist?
    list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ
- Computer Architecture Course Materials
  - http://www.ece.cmu.edu/~ece447/s13/doku.php?id=schedule
- Advanced Computer Architecture Course Materials
  - http://www.ece.cmu.edu/~ece740/f13/doku.php?id=schedule
- Advanced Computer Architecture Lecture Videos on Youtube
  - <u>https://www.youtube.com/playlist?</u> <u>list=PL5PHm2jkkXmgDN1PLwOY\_tGtUlynnyV6D</u>



# Thank you.

#### Feel free to email me with any questions & feedback

<u>onur@cmu.edu</u> <u>http://users.ece.cmu.edu/~omutlu/</u>



# Error Analysis and Management for MLC NAND Flash Memory

Onur Mutlu onur@cmu.edu

(joint work with Yu Cai, Gulay Yalcin, Eric Haratsch, Ken Mai, Adrian Cristal, Osman Unsal)

August 7, 2014 Flash Memory Summit 2014, Santa Clara, CA





