# FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie ## **New Design Opportunity with NVMs** ## **New Design Opportunity with NVMs** Databases, file systems, key-value stores (In-memory data structures can immediately become permanent) ## **New Design Opportunity with NVMs** #### New use case of NVM: concurrently running two types of applications [Kannan + HPCA'14, Liu + ASPLOS'14, Meza + WEED'14] ## Focus of Our Work: Memory Controller Design **Fair and High-Performance Memory Control** ## Why Another Memory Control Scheme? ## **Memory Controller** Determine which requests can be sent on the memory bus to be serviced # Why conventional memory control schemes are inefficient in persistent memory systems How to design fair and high-performance memory control in this new scenario Conventional memory control schemes Assumptions Design choices 1. Reads are on the critical path of application execution 1. Prioritize reads over writes (Application execution is read-dependent) 2. Applications are usually read-intensive 2. Delay writes until they overflow the write queue (Infrequent write queue drains) ## These assumptions no longer hold in persistent memory, which needs to support data persistence (Data consistency when the system suddenly crashes or loses power) Mechanisms: multiversioning and write-order control ## Implication of Multiversioning ## Implication of Multiversioning - Logging - Copy-on-write #### The two versions are not updated at the same time Significantly increasing write traffic – Two writes with each one data update #### **Assumptions** 1. Reads are on the critical path of application execution Design decisions 1. Prioritize reads over writes 2. Applications are usually read-intensive Persistent applications are usually write-intensive Infrequent write queue drains ## **Implication of Write-order Control** The two versions are not supposed to be updated at the same time, issued in order: $$A' = \{A'_{1}, A'_{2}, A'_{3}\}$$ $A = \{A_{1}, A_{2}, A_{3}\}$ Reordered by caches and MCs: Reordered by caches of $$A_2$$ , $A'_2$ , $A'_1$ , $A_3$ , $A_3$ , $A_3$ , $A_3$ , $A_4$ , $A_5$ , $A_5$ , $A_7$ , $A_8$ ## **Implication of Write-order Control** #### Restrict the ordering of writes arriving at the memory Making application execution write dependent – Subsequent writes, reads, and computation can all depend on a previously issued write [Volos+ ASPLOS'11, Coburn+ ASPLOS'11, Condit+ SOSP'09, Venkataraman+ FAST'11] #### **Assumptions** 1. Reads are on the critical path of application execution (Application execution is read-dependent) Design decisions Writes are also on the critical execution path (Application execution is Possivrite-dependent) #### **Assumptions** 1. Writes are also on the critical execution path 2. Persistent applications are usually write-intensive **Assumptions** 1. Writes are also on the critical execution path Design choices 1. Prioritize rea Unfairness 2. Persistent applications are usually write-intensive 2. Delay writ performance they overflow the v Degradation Frequent write queue drains Frequent stall reads to drain the write queue, frequently switch between servicing reads and servicing writes #### **Bus Turnaround Overhead** *tRTW ~ 7.5ns tWTR ~ 15ns* [Kim + ISCA'12] Read Write Queue Overflow NVM Forcing write queue drain by stalling reads Read Write Queue Overflow Bus cycles wasted on bus turnarounds (tRTW and tWTR) Bus cycles to perform memory accesses #### **Assumptions** 1. Writes are also on the critical execution path 1. Prioritize rea 2. Persistent applications are usually write-intensive 2. Delay writh 90%300% they fill up the with 90% energy 3. Writes in persistent memory have low bank-level parallelism (BLP) ## Low Bank-level Parallelism (BLP) Stalling reads for a long time while the bus is servicing writes to persistent memory #### **Assumptions** 1. Writes are also on the critical execution path 2. Persistent applications are usually write-intensive 3. Writes of persistent applications have low BLP #### Design choices ## **Design Principles** Persistence-Aware Memory Scheduling Minimizing write queue drains and bus turnarounds, while ensuring fairness Persistent Write Striding Increasing BLP of writes to persistent memory to fully utilize memory bandwidth Minimizing write queue drains and bus turnarounds, while ensuring fairness Problem: when to switch between servicing read batches and write batches Low bus turnaround overhead, risk frequent write queue drains # Read Queue From the same source, To the same NVM row, In the same R/W direction Write Queue Minimizing write queue drains and bus turnarounds, while ensuring fairness Problem: when to switch between servicing read batches and write batches Less likely to starve reads and writes, higher bus turnaround overhead #### **Batch** From the same source, To the same NVM row, In the same R/W direction Minimizing write queue drains and bus turnarounds, while ensuring fairness ## Key idea 1: balance the amount of time spent in continuously servicing reads and writes Minimizing write queue drains and bus turnarounds, while ensuring fairness Key idea 2: Time to service read batches and write batches is JUST long enough Pick the choice with the shortest times ## **Persistent Write Striding** Increasing BLP of persistent writes to fully utilize memory bandwidth ## **Persistent Write Striding** Increasing BLP of persistent writes to fully utilize memory bandwidth Key idea: stride writes by an offset to remap them to different banks A Log (Spans Multiple Banks) **BA-NVM** Log0 Benefit: no bookkeeping required Stride + offset Serviced in Rea Bank 2 Req parallel Log1 Rea Bank 2 Rea Req Log2 Req Bank 3 Stride + offset Req \_Bank 3 ## **Experimental Setup** #### Simulator McSimA+ [Ahn+, ISPASS'13] (modified) #### Configuration - Four-core processor, eight threads - Private caches: L1/L2, SRAM, 64KB/256KB per core - Shared last-level cache: L3, SRAM, 2MB per core - Main memory: STT-MRAM DIMM (8GB) #### Benchmarks - 7 non-persistent applications - 3 persistent applications #### Metrics - Weighted speedup & maximum slowdown ### **Performance** ### **Fairness** #### **Maximum Slowdown** #### **Bus Turnaround Overhead** (The worst case of previous designs: 17%) #### **Other Results** ### Sensitivity study on various - NVM Latencies - Row-buffer sizes - Number of threads ## **Summary** ## Thank you! # FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie