# Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multithreading Improve Replacement Policies Caching Is the Warp Scheduler aware of these techniques? Main Memory Improve Memory Scheduling Policies **Prefetching** Improve Prefetcher (look deep in the future, if you can!) Cache-Conscious Two-level Scheduling Scheduling, MICRO'11 MICRO'12 Multi-Caching threading **Aware** Warp **Scheduler** Main **Prefetching Memory** Thread-Block-Aware Scheduling (OWL) ASPLOS'13 # **Our Proposal** Prefetch Aware Warp Scheduler - Goals: - Make a Simple prefetcher more Capable - Improve system performance by orchestrating scheduling and prefetching mechanisms - 25% average IPC improvement over - Prefetching + Conventional Warp Scheduling Policy - 7% average IPC improvement over - Prefetching + Best Previous Warp Scheduling Policy # Outline - Proposal - Background and Motivation - Prefetch-aware Scheduling - Evaluation - Conclusions # High-Level View of a GPU # Warp Scheduling Policy - Equal scheduling priority - Round-Robin (RR) execution - Problem: Warps stall roughly at the same time # Accessing DRAM ... Bank 1 Bank 2 High Row Buffer Locality # Warp Scheduler Perspective (Summary) | Warp<br>Scheduler | Forms Multiple Warp Groups? | DRAM Bandwidth<br>Utilization | | |-------------------------|-----------------------------|-------------------------------|---------------------------| | | | Bank<br>Level<br>Parallelism | Row<br>Buffer<br>Locality | | Round-<br>Robin<br>(RR) | * | | | | Two-Level<br>(TL) | | * | | # **Evaluating RR and TL schedulers** #### (2) Prefetching: Improve DRAM Bandwidth Utilization # Challenge: Designing a Prefetcher #### **Our Goal** Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy Why? A simple prefetching does not improve performance with existing scheduling policies. # Let's Try... # Simple Prefetching with TL scheduling # Warp Scheduler Perspective (Summary) | Warp<br>Scheduler | Forms<br>Multiple | Simple<br>Prefetcher | | M Bandwidth<br>Jtilization | | |-------------------------|-------------------|----------------------|------------------------------|----------------------------|--| | | Warp<br>Groups? | Friendly? | Bank<br>Level<br>Parallelism | Row<br>Buffer<br>Locality | | | Round-<br>Robin<br>(RR) | * | * | | | | | Two-Level<br>(TL) | | * | * | | | #### **Our Goal** Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy A simple prefetching does not improve performance with existing scheduling policies. # Sophisticated Prefetcher #### Prefetch Aware (PA) Warp Scheduler Simple Prefetcher # Prefetch-aware (PA) warp scheduling **Group 1** See paper for generalized algorithm of PA scheduler 5 Scheduling Non-consecutive warps are associated with one group #### Simple Prefetching with PA scheduling Reasoning of non-consecutive warp grouping is that groups can (simple) prefetch for each other (green warps can prefetch for red warps using simple prefetcher) # Simple Prefetching with PA scheduling #### DRAM Bandwidth Utilization 18% increase in bank-level parallelism 24% decrease in row buffer locality Bank 1 Bank 2 High Bank-Level Parallelism High Row Buffer Locality # Warp Scheduler Perspective (Summary) | Warp<br>Scheduler | Forms Multiple Warp Groups? | Simple<br>Prefetcher<br>Friendly? | DRAM Bandwidth<br>Utilization | | |----------------------------|-----------------------------|-----------------------------------|-------------------------------|------------------------| | | | | Bank<br>Level<br>Parallelism | Row Buffer<br>Locality | | Round-<br>Robin<br>(RR) | * | * | | | | Two-Level (TL) | | * | * | <b>✓</b> | | Prefetch-<br>Aware<br>(PA) | | | | (with prefetching) | # Outline - Proposal - Background and Motivation - Prefetch-aware Scheduling - Evaluation - Conclusions ## **Evaluation Methodology** - Evaluated on GPGPU-Sim, a cycle accurate GPU simulator - Baseline Architecture - 30 SMs, 8 memory controllers, crossbar connected - 1300MHz, SIMT Width = 8, Max. 1024 threads/core - 32 KB L1 data cache, 8 KB Texture and Constant Caches - L1 Data Cache Prefetcher, GDDR3@1100MHz - Applications Chosen from: - Mapreduce Applications - Rodinia Heterogeneous Applications - Parboil Throughput Computing Focused Applications - NVIDIA CUDA SDK GPGPU Applications # Spatial Locality Detector based Prefetching # Improving Prefetching Effectiveness #### Performance Evaluation #### Conclusions - Existing warp schedulers in GPGPUs cannot take advantage of simple prefetchers - Consecutive warps have good spatial locality, and can prefetch well for each other - □ But, existing schedulers schedule consecutive warps closeby in time → prefetches are too late - We proposed prefetch-aware (PA) warp scheduling - Key idea: group consecutive warps into different groups - Enables a simple prefetcher to be timely since warps in different groups are scheduled at separate times - Evaluations show that PA warp scheduling improves performance over combinations of conventional (RR) and the best previous (TL) warp scheduling and prefetching policies - Better orchestrates warp scheduling and prefetching decisions #### **THANKS!** # **QUESTIONS?** ## **BACKUP** ## Effect of Prefetch-aware Scheduling Percentage of DRAM requests (averaged over group) with: ■ 1 miss □ 2 misses ■ 3-4 misses to a macro-block # Working (With Two-Level Scheduling) # Working (With Prefetch-Aware Scheduling) # Working (With Prefetch-Aware Scheduling) # Effect on Row Buffer locality 24% decrease in row buffer locality over TL #### Effect on Bank-Level Parallelism 18% increase in bank-level parallelism over TL ## Simple Prefetching + RR scheduling ## Simple Prefetching with TL scheduling # CTA-Assignment Policy (Example) #### **Multi-threaded CUDA Kernel** #### SIMT Core-1 #### **SIMT Core-2**