OrchestratedSchedulingand* PrefetchingforGPGPUs*Conclusions! Existing warp schedulers in GPGPUs...

Orchestrated Scheduling and Prefetching for GPGPUs

Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Multi-‐threading

Caching

Prefetching

Memory

Improve Replacement

Policies

Parallelize your code! Launch more threads!

Improve Memory Scheduling Policies

Improve Prefetcher (look deep in the future,

if you can!)

Is the Warp Scheduler aware of these techniques?

Multi-‐threading

Caching

Prefetching

Memory

Cache-Conscious Scheduling, MICRO’12

Two-level Scheduling MICRO’11

Thread-Block-Aware Scheduling (OWL)

ASPLOS’13 ?

Aware Warp

Scheduler

Our Proposal n  Prefetch Aware Warp Scheduler n  Goals:

q  Make a Simple prefetcher more Capable q  Improve system performance by orchestrating

scheduling and prefetching mechanisms n  25% average IPC improvement over

q  Prefetching + Conventional Warp Scheduling Policy

n  7% average IPC improvement over q  Prefetching + Best Previous Warp Scheduling Policy

Outline n  Proposal

n  Background and Motivation n  Prefetch-aware Scheduling

n  Evaluation n  Conclusions

High-‐Level View of a GPU

Streaming Multiprocessors (SMs)

Scheduler

ALUs L1 Caches

Threads

W W W W W W

L2 cache

Interconnect

CTA CTA CTA CTA

Cooperative Thread Arrays (CTAs) Or Thread Blocks

Prefetcher

Warp Scheduling Policy n  Equal scheduling priority

q  Round-Robin (RR) execution

n  Problem: Warps stall roughly at the same time

SIMT Core Stalls

Compute Phase (2)

Compute Phase (1)

DRAM Requests

SIMT Core Stalls Compute

Phase (2)

Compute Phase (1)

DRAM Requests

Compute Phase (1)

Group 2 Group 1

DRAM Requests

Comp. Phase

Group 1

Comp. Phase

Group 2

Cycles

TWO LEVEL (TL) SCHEDULING

Accessing DRAM …

Idle for a period

Bank 1 Bank 2

Memory Addresses X

Group 1

Bank 1 Bank 2

Group 2

Legend

Low Bank-Level

Parallelism

High Row Buffer Locality

High Bank-Level

Parallelism

High Row Buffer

Locality

Warp Scheduler Perspec?ve (Summary)

Warp Scheduler

Forms Multiple Warp Groups?

DRAM Bandwidth Utilization

Bank Level

Parallelism

Row Buffer

Locality Round- Robin (RR)

✖ ✔ ✔

Two-Level (TL) ✔ ✖ ✔

Evalua?ng RR and TL schedulers

0 1 2 3 4 5 6 7

Round-robin (RR) Two-level (TL)

IPC Improvement factor with Perfect L1 Cache Can we further reduce this gap?

Via Prefetching ?

2.20X 1.88X

DRAM Requests

Compute Phase (1)

(1) Prefetching: Saves more cycles

Compute Phase (1)

Comp. Phase

Compute Phase (1)

DRAM Requests

Compute Phase (1)

Comp. Phase

Cycles

Prefetch Requests

Cycles

Compute Phase-2

(Group-2) Can Start

Comp. Phase

(A) (B)

Bank 1 Bank 2

Memory Addresses

Idle for a period

(2) Prefetching: Improve DRAM Bandwidth U?liza?on

Bank 1 Bank 2

Prefetch Requests

No Idle period! High Bank-

Level Parallelism

High Row

Buffer Locality

Memory Addresses

Challenge: Designing a Prefetcher

Bank 1 Bank 2

Prefetch Requests

X Sophisticated Prefetcher Y

Our Goal n  Keep the prefetcher simple, yet get the

performance benefits of a sophisticated prefetcher.

To this end, we will design a prefetch-aware warp scheduling policy

A simple prefetching does not improve performance with existing scheduling policies.

DRAM Requests

Simple Prefetching + RR scheduling

Compute Phase (1)

DRAM Requests

Compute Phase (1)

Compute Phase (2)

No Saved Cycles

Overlap with D2 (Late Prefetch)

Compute Phase (2)

DRAM Requests

Simple Prefetching + TL scheduling

Cycles

Group 2 Group 1 Group 2 Group 1

Compute Phase (1)

Group 2 Group 1

Compute Phase (1)

Comp. Phase

Group 1 Comp. Phase

Comp. Phase

Group 2 Comp. Phase

No Saved Cycles

(over TL)

Let’s Try…

X Simple Prefetcher X + 4

Memory Addresses

Simple Prefetching with TL scheduling

Bank 1 Bank 2

Idle for a period

X + 4 May not be equal to

UP1 UP2 UP3 UP4

Bank 1 Bank 2

Useless Prefetches

Useless Prefetch (X + 4)

DRAM Requests

Cycles

Compute Phase (1)

Comp. Phase

No Saved Cycles

(over TL) U5

Useless Prefetches

Warp Scheduler

Forms Multiple Warp

Groups?

Simple Prefetcher Friendly?

Bank Level

Parallelism

Row Buffer

Locality

Round-Robin (RR)

✖ ✖ ✔ ✔

Two-Level (TL) ✔ ✖ ✖ ✔

Our Goal n  Keep the prefetcher simple, yet get the

performance benefits of a sophisticated prefetcher.

To this end, we will design a prefetch-aware warp

scheduling policy

A simple prefetching does not improve performance with existing scheduling policies.

Sophisticated Prefetcher

Simple Prefetcher

Prefetch Aware (PA) Warp Scheduler

Prefetch-aware Scheduling

Non-consecutive warps are associated with one group

Prefetch-‐aware (PA) warp scheduling

Group 1

Round Robin Scheduling

Two-level Scheduling

Group 2 X

See paper for generalized algorithm of PA scheduler

Simple Prefetching with PA scheduling

Bank 1 Bank 2

3 X Simple

Prefetcher X + 1

Reasoning of non-consecutive warp grouping is that groups can (simple) prefetch for each other (green warps can prefetch for red warps using simple prefetcher)

Bank 1 Bank 2

Cache Hits!

2 X Simple

Prefetcher X + 1

DRAM Requests

Compute Phase (1)

Comp. Phase

Compute Phase (1)

DRAM Requests

Compute Phase (1)

Comp. Phase

Cycles

Prefetch Requests

Cycles

Compute Phase-2

(Group-2) Can Start

Comp. Phase

(2) Saved

Cycles!!! (over TL)

(A) (B)

DRAM Bandwidth U:liza:on

Bank 1 Bank 2

High Bank-Level Parallelism High Row Buffer Locality

X Simple Prefetcher X + 1

18% increase in bank-level parallelism

24% decrease in row buffer locality

Warp Scheduler

Forms Multiple Warp

Groups?

Simple Prefetcher Friendly?

Bank Level

Parallelism

Row Buffer Locality

Round-Robin (RR)

✖ ✖ ✔ ✔

Two-Level (TL) ✔ ✖ ✖ ✔

Prefetch-Aware (PA)

✔ ✔

✔ (with prefetching)

Outline n  Proposal

n  Background and Motivation n  Prefetch-aware Scheduling

n  Evaluation n  Conclusions

Evalua?on Methodology n  Evaluated on GPGPU-Sim, a cycle accurate GPU simulator

n  Baseline Architecture q  30 SMs, 8 memory controllers, crossbar connected q  1300MHz, SIMT Width = 8, Max. 1024 threads/core q  32 KB L1 data cache, 8 KB Texture and Constant Caches q  L1 Data Cache Prefetcher, GDDR3@1100MHz

n  Applications Chosen from: q  Mapreduce Applications q  Rodinia – Heterogeneous Applications q  Parboil – Throughput Computing Focused Applications q  NVIDIA CUDA SDK – GPGPU Applications

Spa?al Locality Detector based Prefetching

MACRO BLOCK

Prefetch:- Not accessed (demanded) Cache Lines

Prefetch-aware Scheduler Improves effectiveness of this simple prefetcher

D = Demand, P = Prefetch

P See paper for more details

Improving Prefetching Effec?veness

85% 89% 90%

100% 89% 86% 69%

Fraction of Late Prefetches

Reduction in L1D Miss Rates

Prefetch Accuracy

RR+Prefetching TL+Prefetching PA+Prefetching

Performance Evalua?on

RR+Prefetching TL TL+Prefetching Prefetch-aware (PA) PA+Prefetching

1.01 1.16 1.19 1.20 1.26

Results are Normalized to RR scheduling

25% IPC improvement over Prefetching + RR Warp Scheduling Policy (Commonly Used)

7% IPC improvement over Prefetching + TL Warp Scheduling Policy (Best Previous)

See paper for Additional Results

Conclusions n  Existing warp schedulers in GPGPUs cannot take advantage of

simple prefetchers q  Consecutive warps have good spatial locality, and can

prefetch well for each other q  But, existing schedulers schedule consecutive warps closeby

in time à prefetches are too late n  We proposed prefetch-aware (PA) warp scheduling

q  Key idea: group consecutive warps into different groups q  Enables a simple prefetcher to be timely since warps in

different groups are scheduled at separate times n  Evaluations show that PA warp scheduling improves

performance over combinations of conventional (RR) and the best previous (TL) warp scheduling and prefetching policies q  Better orchestrates warp scheduling and prefetching decisions

THANKS! QUESTIONS?

BACKUP

Effect of Prefetch-‐aware Scheduling

Two-level Prefetch-aware

1 miss 2 misses 3-4 misses Percentage of DRAM requests (averaged over group) with:

to a macro-block

High Spatial Locality Requests

Recovered by Prefetching

High Spatial Locality Requests

Working (With Two-‐Level Scheduling)

MACRO BLOCK

High Spatial Locality

Requests

Working (With Prefetch-‐Aware Scheduling) MACRO BLOCK

MACRO BLOCK

High Spatial Locality

Requests

MACRO BLOCK

Cache Hits

Working (With Prefetch-‐Aware Scheduling)

Effect on Row Buffer locality

TL TL+Prefetching PA PA+Prefetching

24% decrease in row buffer locality over TL

Effect on Bank-‐Level Parallelism

10 15 20 25

m RR TL PA

18% increase in bank-level parallelism over TL

Bank 1 Bank 2

Memory Addresses

Simple Prefetching + RR scheduling

Bank 1 Bank 2

Memory Addresses

Idle for a period

Group 1

Group 2

Legend

Warp Scheduler

ALUs L1 Caches

CTA-Assignment Policy (Example)

Warp Scheduler

ALUs L1 Caches

Multi-threaded CUDA Kernel

SIMT Core-1 SIMT Core-2

CTA-1 CTA-2 CTA-3 CTA-4

CTA-3 CTA-4 CTA-1 CTA-2

OrchestratedSchedulingand* PrefetchingforGPGPUs*Conclusions! Existing warp schedulers in GPGPUs...

Documents

CUDA Warps and Occupancy - NVIDIAon-demand.gputechconf.com/gtc...WarpsAndOccupancy.pdf · CUDA Warps and Occupancy GPU Computing Webinar 7/12/2011 Dr. Justin Luitjens, NVIDIA Corporation

Memory Access Patterns For Cellular Automata Using GPGPUs

A Case Against Small Data Types in GPGPUs

Winged Warps Series Monarch - · PDF fileScreamin' Squirrel Designs Winged Warps Series - Monarch Page 2 of 9

Introduction to GPGPUs and to CUDA programming model: … Autumn... · Introduction to GPGPUs and to CUDA programming model: CUDA Libraries ... Standard C Math library ... CUBLAS

Warps manifesto Xander

General Purpose Graphics Processing Units (GPGPUs)ece3056-sy.ece.gatech.edu/wp-content/uploads/sites/546/2017/08/GPU.… · 1 General Purpose Graphics Processing Units (GPGPUs) Lecture

A Framework for Geometric Warps and Deformationsgfx.cs.princeton.edu/pubs/Milliron_2002_AFF/warp.pdf · A Framework for Geometric Warps and Deformations ... and to provide a geometric,

GPGPUs and CUDA Guest Lecture Computing with GPGPUs Raj Singh National Center for Microscopy and Imaging Research

3D Warps Manual

PROGRAMMING GPGPUS USING CUDA - About us | …research.nesc.ac.uk/files/CUDA_PROGRAMMING.pdf · WHY GPGPUS • GPGPUs - General Purpose Computing on Graphics Processing Units (GPUs)

Orchestrated Scheduling and Prefetching for GPGPUs...of a GPGPU. An eﬀective warp scheduling policy can facil-itate the concurrent execution of many warps, potentially enabling all

Meta-programming and Multi-stage Programming for GPGPUs

Bandwidth-Efﬁcient On-Chip Interconnect Designs for GPGPUs · 2020. 8. 10. · Bandwidth-Efﬁcient On-Chip Interconnect Designs for GPGPUs Hyunjun Jang 1, Jinchun Kim2, Paul Gratz2,

Boost Your Productivity with GPGPUs and IBM Platform Computing

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Principal warps: thin-plate splines and the …user.engineering.uiowa.edu/~aip/papers/bookstein-89.pdfTitle Principal warps: thin-plate splines and the decomposition of deformation

GPGPUs and CUDA Guest Lecture, CSE167, Fall 2008 Computing with GPGPUs Raj Singh National Center for Microscopy and Imaging Research

Content-Preserving Warps for 3D Video …research.cs.wisc.edu/graphics/Papers/Gleicher/fliu/siggraph09...To appear in the ACM SIGGRAPH conference proceedings Content-Preserving Warps

Serious about WARPs The 3 rd Annual WARPs Forum 13 March Leeds Town Hall

Orchestrated*Scheduling*and* Prefetching*for*GPGPUs*Conclusions! Existing warp schedulers in GPGPUs...

OrchestratedSchedulingand* PrefetchingforGPGPUs*Conclusions! Existing warp schedulers in GPGPUs...