42
1 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt MICRO 2012 (2) Goal Understand the relationship between schedulers (warp/wavefront) and locality behaviors Distinguish between inter-wavefront and intra-wavefront locality Design a scheduler to match #scheduled wavefronts with the L1 cache size Working set of the wavefronts fits in the cache Emphasis on intra-wavefront locality

Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

1

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)

Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt

MICRO 2012

(2)

Goal

•  Understand the relationship between schedulers (warp/wavefront) and locality behaviors v  Distinguish between inter-wavefront and intra-wavefront

locality

•  Design a scheduler to match #scheduled wavefronts with the L1 cache size v  Working set of the wavefronts fits in the cache v  Emphasis on intra-wavefront locality

Page 2: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

2

(3)

Reference Locality

Intra-thread locality

Inter-thread locality

Intra-wavefront locality Inter-wavefront

locality

•  Scheduler decisions can affect locality

•  Need non-oblivious (to locality) schedulers

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

(4)

Key Idea

Executing Wavefronts Ready Wavefronts

Issue?

•  Impact of issuing new wavefronts on the intra-warp locality of executing wavefronts v  Footprint in the cache v  When will a new wavefront cause thrashing?

Intra-wavefront footprints in the cache

Page 3: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

3

(5)

Concurrency vs. Cache Behavior

Round Robin Scheduler

Less than max concurrency

Adding more wavefronts to hide latency traded off

against creating more long latency memory references

due to thrashing

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

(6)

Additions to the GPU Microarchitecture

Control issue of wavefronts

I-Fetch

Decode

RF PRF

D-Cache Data All Hit?

Writeback

scalar Pipeline

scalar pipeline

scalar pipeline

Issue

I-Buffer

pending warps Keep track of locality behavior

on a per wavefront basis

Feedback

Page 4: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

4

(7)

Intra-Wavefront Locality

•  Scalar thread traverses edges

•  Edges stored in successive memory locations

•  One thread vs. many

•  Intra-thread locality leads to intra-wavefront locality

•  #wavefronts determined by its reference footprint and cache size

•  Scheduler attempts to find the right #wavefronts

(8)

Static Wavefront Limiting I-Fetch

Decode

RF PRF

D-Cache Data All Hit?

Writeback

scalar Pipeline

scalar pipeline

scalar pipeline

Issue

I-Buffer

pending warps

•  Limit the #wavefronts based on working set and cache size

•  Based on profiling?

•  Current schemes allocate #wavefronts based on resource consumption not effectiveness of utilization

•  Seek to shorten the re-reference interval

Page 5: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

5

(9)

Cache Conscious Wavefront Scheduling

•  Basic Steps v  Keep track of lost locality of a

wavefront v  Control number of wavefronts

that can be issued

•  Approach v  List of wavefronts sorted by lost

locality score v  Stop issuing wavefronts with

least lost locality

I-Fetch

Decode

RF PRF

D-Cache Data All Hit?

Writeback

scalar Pipeline

scalar pipeline

scalar pipeline

Issue

I-Buffer

pending warps

(10)

Keeping Track of Locality I-Fetch

Decode

RF PRF

D-Cache Data All Hit?

Writeback

scalar Pipeline

scalar pipeline

scalar pipeline

Issue

I-Buffer

pending warps

Keep track of associated Wavefronts

Indexed by WID Victims indicate “lost locality

Update lost locality score

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

Page 6: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

6

(11)

Updating Estimates of Locality I-Fetch

Decode

RF PRF

D-Cache Data All Hit?

Writeback

scalar Pipeline

scalar pipeline

scalar pipeline

Issue

I-Buffer

pending warps

Wavefronts sorted by lost locality score

Update wavefront lost locality score on victim hit

Note which are allowed to issue

Changes based on

#wavefronts

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

(12)

Estimating the Feedback Gain

Drop the scores – no VTA hits

VYA hit – increase score

Prevent issue

LLDS = VTAHitsTotalInsIssuedTotal

.KThrottle.CumLLSCutoff Scale gain based on percentage of cutoff

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

Page 7: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

7

(13)

Schedulers

•  Loose Round Robin

•  Greedy-Then-Oldest (GTO) v  Execute one wavefront until stall and then oldest

•  2LVL-GTO v  Two level scheduler with GTO instead of RR

•  Best SWL

•  CCWS

(14)

Performance

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

Page 8: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

8

(15)

Some General Observations

•  Performance largely determined by v  Emphasis on oldest wavefronts v  Distribution of references across cache lines – extent of lack

of coalescing

•  GTO works well for this reason à prioritizes older wavefronts

•  LRR touches too much data (too little reuse) to fit in the cache

(16)

Locality Behaviors

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

Page 9: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

9

(17)

Summary

•  Dynamic tracking of relationship between wavefronts and working sets in the cache

•  Modify scheduling decisions to minimize interference in the cache

•  Tunables: need profile information to create stable operation of the feedback control

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)

Divergence Aware Warp Scheduling T. Rogers, M O’Conner, and T. Aamodt

MICRO 2013

Page 10: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

10

(19)

Goal

•  Understand the relationship between schedulers (warp/wavefront) and locality behaviors v  Distinguish between inter-wavefront and intra-wavefront

locality

•  Design a scheduler to match #scheduled wavefronts with the L1 cache size v  Working set of the wavefronts fits in the cache v  Emphasis on intra-wavefront locality

•  Differs from CCWS in being proactive v  Deeper look at what happens inside loops

(20)

Key Idea

•  Manage the relationship between control divergence, memory divergence and scheduling

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 11: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

11

(21)

Key Idea (2)

while (i <C[tid+1])

warp 0 warp 1

Fill the cache with 4 references – delay warp 1 Divergent

branch

Intra-thread locality

Available room in the cache,

schedule warp 1

Use warp 0 behavior to predict interference due to warp 1

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(22)

Goal Simpler portable version GPU-Optimized Version

Make the performance

equivalent

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 12: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

12

(23)

Additions to the GPU Microarchitecture

Control issue of wavefronts

I-Fetch

Decode

RF PRF

D-Cache Data All Hit?

Writeback

scalar Pipeline

scalar pipeline

scalar pipeline

Issue

I-Buffer/Sboard

pending warps

Memory Coalescer

(24)

Observation

•  Bulk of the accesses in a loop come from a few static load instructions

•  Bulk of the locality in (these) applications is intra-loop

Loops

Page 13: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

13

(25)

Distribution of Locality

Bulk of the locality comes form a few static loads in loops

Hint: Can we keep data from last iteration?

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(26)

A Solution

•  Prediction mechanisms for locality across iterations of a loop

•  Schedule such that data fetched in one iteration is still present at next iteration

•  Combine with control flow divergence (how much of the footprint needs to be in the cache?)

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 14: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

14

(27)

Classification of Dynamic Loads

•  Group static loads into equivalence classes à reference the same cache line

•  Identify these groups by repetition ID •  Prediction for each load by compiler or hardware

converged

Diverged

Somewhat diverged

Control flow divergence

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(28)

Predicting a Warp’s Cache Footprint •  Entering loop body •  Create footprint prediction

Warp

•  Exit loop •  Reinitialize prediction

•  Some threads exit the loop •  Predicted footprint drops

Warp

•  Predict locality usage of static loads v  Not all loads increase the footprint

•  Combine with control divergence to predict footprint •  Use footprint to throttle/not-throttle warp issue

Taken

Not Taken

Page 15: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

15

(29)

Principles of Operation

•  Prefix sum of each warp’s cache footprint used to select warps that can be issued EffCacheSize = kAssocFactor.TotalNumLines

•  Scaling back from a fully associative cache •  Empirically determined

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(30)

Principles of Operation (2)

•  Profile static load instructions v  Are they divergent? v  Loop repetition ID

o  Assume all loads with same base address and offset within cache line access are repeated each iteration

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 16: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

16

(31)

Prediction Mechanisms

•  Profiled Divergence Aware Scheduling (DAWS) v  Used offline profile results to dynamically determine de-

scheduling decisions

•  Detected Divergence Aware Scheduling (DAWS) v  Behaviors derived at run-time to drive de-scheduling

decisions o  Loops that exhibit intra-warp locality o  Static loads are characterized as divergent or convergent

(32)

Extensions for DAWS I-Fetch

Decode

RF PRF

D-Cache Data All Hit?

Writeback

scalar Pipeline

scalar pipeline

scalar pipeline

Issue

I-Buffer

+

+

Page 17: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

17

(33)

Operation: Tracking

Basis for throttling

Profile-based information

One entry per warp issue

slot

Created/removed at loop begin/

end

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

#Active Lanes

(34)

Operation: Prediction

Sum results from static loads in this loop

-  Add #active-lanes of cache lines for divergent loads

-  Add 2 for converged loads -  Count loads in the same

equivalence class only once (unless divergent)

•  Generally only considering de-scheduling warps in loops v  Since most of the activity is here

•  Can be extended to non-loop regions by associating non-loop code with next loop

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 18: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

18

(35)

Operation: Nested Loops

for (i=0; i<limitx; i++)}{ .. ..

for (j=0: j<limity; j++){ .. .. }

..

.. }

•  On-entry update prediction to that of inner loop

•  On re-entry predict based inner loop predictions

On-exit, do not clear prediction

De-scheduling of warps determined by inner loop behaviors!

•  Re-used predictions based on inner-most loops which is where most of the date re-use is found

(36)

Detected DAWS: Prediction Sampling warp for

the loop (>2 active threads)

•  Detect both memory divergence and intra-loop repetition at run time

•  Fill PCLoad entries based on run time information •  Use profile information to start

if locality

enter load

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 19: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

19

(37)

Detected DAWS: Classification

Increment or decrement the counter depending on #memory accesses for a load

Create equivalence classes of loads (checking the PCs)

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(38)

Performance

Little to no degradation

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 20: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

20

(39)

Performance

Significant intra-warp locality

SPMV-scalar normalized to best

SPMV-vector

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(40)

Summary

•  If we can characterize warp level memory reference locality, we can use this information to minimize interference in the cache through scheduling constraints

•  Proactive scheme outperforms reactive management

•  Understand interactions between memory divergence and control divergence

Page 21: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

21

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)

OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving

GPGPU Performance A. Jog et. al ASPLOS 2013

(42)

Goal

•  Understand memory effects of scheduling from deeper within the memory hierarchy

•  Minimize idle cycles induced by stalling warps waiting on memory references

Page 22: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

22

Off-­‐chip  Bandwidth  is  Cri2cal!  

43

Percentage of total execution cycles wasted waiting for the data to come back from DRAM

0%

20%

40%

60%

80%

100% SA

D

PVC

SS

C

BFS

M

UM

C

FD

KM

N

SCP

FWT IIX

SP

MV

JPEG

B

FSR

SC

FF

T SD

2 W

P PV

R

BP

CO

N

AES

SD

1 B

LK

HS

SLA

DN

LP

S N

N

PFN

LY

TE

LUD

M

M

STO

C

P N

QU

C

UTP

H

W

TPA

F

AVG

AV

G-T

1

Type-1 Applications

55% AVG: 32%

Type-2 Applications

GPGPU Applications

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(44)

Source of Idle Cycles

•  Warps stalled on waiting for memory reference v  Cache miss v  Service at the memory controller v  Row buffer miss in DRAM v  Latency in the network (not addressed in this paper)

•  The last warp effect

•  The last CTA effect

•  Lack of multiprogrammed execution v  One (small) kernel at a time

Page 23: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

23

(45)

Impact of Idle Cycles

Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

High-­‐Level  View  of  a  GPU  

DRAM

SIMT  Cores  

Scheduler

ALUs L1 Caches

Threads

…   W W W W W W

Warps

L2 cache

Interconnect

CTA CTA CTA CTA

Cooperative Thread Arrays (CTAs)

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 24: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

24

Warp Scheduler

ALUs L1 Caches

CTA-­‐Assignment  Policy  (Example)  

47

Warp Scheduler

ALUs L1 Caches

Multi-threaded CUDA Kernel

SIMT Core-1 SIMT Core-2

CTA-1 CTA-2 CTA-3 CTA-4

CTA-2 CTA-4 CTA-1 CTA-3

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(48)

Organizing CTAs Into Groups

•  Set minimum number of warps equal to #pipeline stages v  Same philosophy as the two-level warp scheduler

•  Use same CTA grouping/numbering across SMs?

Warp Scheduler

ALUs L1 Caches

Warp Scheduler

ALUs L1 Caches

SIMT Core-1 SIMT Core-2

CTA-2 CTA-4 CTA-1 CTA-3

Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 25: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

25

Warp  Scheduling  Policy  n  All launched warps on a SIMT core have equal priority

q  Round-Robin execution n  Problem: Many warps stall at long latency operations

roughly at the same time

     

   

 49

W

W

CTA W

W

CTA W

W

CTA W

W

CTA

All warps compute

All warps have equal priority

W

W

CTA W

W

CTA W

W

CTA W

W

CTA

All warps compute

All warps have equal priority

Send Memory Requests

SIMT Core Stalls

Time

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Solu2on  

50

W W

CTA

W W

CTA

W W

CTA

W W

CTA

W W

CTA

W W

CTA

W W

CTA

W W

CTA

Send Memory Requests

Saved Cycles

W

W

CTA W

W

CTA W

W

CTA W

W

CTA

All warps compute

All warps have equal priority

W

W

CTA W

W

CTA W

W

CTA W

W

CTA

All warps compute

All warps have equal priority

Send Memory Requests

SIMT Core Stalls

Time

•  Form Warp-Groups (Narasiman MICRO’11) •  CTA-Aware grouping •  Group Switch is Round-Robin

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 26: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

26

(51)

Two Level Round Robin Scheduler

CTA0 CTA1

CTA3 CTA2

CTA4 CTA5

CTA7 CTA8

CTA12 CTA13

CTA15 CTA14

Group 0 Group 1

Group 3 Group 2

RR

RR

Thread

Agnostic to when pending misses

are satisfied

(52)

Key Idea

Executing Warps (EW)

•  Relationship between EW, SW and CW?

•  When EW ≠ CW, we get interference

•  Principle – optimize reuse: Seek to make EW = CW

Stalled Warps (SW) Cache Resident Warps (CW)

Page 27: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

27

Objec2ve  1:  Improve  Cache  Hit  Rates  

53

CTA 1 CTA 3 CTA 5 CTA 7 CTA 1 CTA 3 CTA 5 CTA 7

Data for CTA1 arrives. No switching.

CTA 3 CTA 5 CTA 7 CTA 1 CTA 3 C5 CTA 7 CTA 1 C5

Data for CTA1 arrives.

T No Switching: 4 CTAs in Time T

Switching: 3 CTAs in Time T

Fewer CTAs accessing the cache concurrently à Less cache contention Time

Switch to CTA1.

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Reduc2on  in  L1  Miss  Rates  

54

n  Limited benefits for cache insensitive applications

n  What is happening deeper in the memory system?

0.00  

0.20  

0.40  

0.60  

0.80  

1.00  

1.20  

SAD SSC BFS KMN IIX SPMV BFSR AVG. Normalize

d  L1  M

iss  Rates  

8%

18%

Round-Robin CTA-Grouping CTA-Grouping-Prioritization

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 28: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

28

(55)

The Off-Chip Memory Path

Off-chip GDDR5

Off-chip GDDR5

CU  0    

To  

CU  15

CU  16    

To  

CU  31 MC

MC

MC

MC

MC MC

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Access patterns?

Ordering and buffering?

(56)

Inter-CTA Locality

Warp Scheduler

ALUs L1 Caches

Warp Scheduler

ALUs L1 Caches

CTA-2 CTA-4 CTA-1 CTA-3

DRAM DRAM DRAM DRAM

How do CTAs Interact at the MC and in DRAM?

Page 29: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

29

(57)

Impact of the Memory Controller

•  Memory scheduling policies v  Optimize BW vs. memory

latency

•  Impact of row buffer access locality

•  Cache lines?

(58)

Row Buffer Locality

Page 30: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

30

CTA  Data  Layout  (A  Simple  Example)  

59

A(0,0) A(0,1) A(0,2) A(0,3)

:

:

DRAM Data Layout (Row Major)

Bank 1 Bank 2 Bank 3

Bank 4 A(1,0) A(1,1) A(1,2) A(1,3)

:

:

A(2,0) A(2,1) A(2,2) A(2,3)

:

:

A(3,0) A(3,1) A(3,2) A(3,3)

:

:

Data Matrix

A(0,0) A(0,1) A(0,2) A(0,3)

A(1,0) A(1,1) A(1,2) A(1,3)

A(2,0) A(2,1) A(2,2) A(2,3)

A(3,0) A(3,1) A(3,2) A(3,3)

mapped to Bank 1 CTA 1 CTA 2

CTA 3 CTA 4

mapped to Bank 2

mapped to Bank 3

mapped to Bank 4

Average percentage of consecutive CTAs (out of total CTAs) accessing the same row = 64%

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

L2 Cache

Implica2ons  of  high  CTA-­‐row  sharing  

60

W CTA-1

W CTA-3

W CTA-2

W CTA-4

SIMT Core-1 SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

Idle Banks

W W W W

CTA Prioritization Order CTA Prioritization Order

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 31: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

31

High Row Locality Low Bank Level Parallelism

Bank-1

Row-1

Bank-2

Row-2

Bank-1

Row-1

Bank-2

Row-2

Req Req

Req

Req

Req

Req

Req

Req

Req

Req

Lower Row Locality Higher Bank Level

Parallelism

Req

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(62)

Some Additional Details

•  Spread reference from multiple CTAs (on multiple SMs) across row buffers in the distinct banks

•  Do not use same CTA group prioritization across SMs v  Play the odds

•  What happens with applications with unstructured, irregular memory access patterns?

Page 32: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

32

L2 Cache

Objec2ve  2:  Improving  Bank  Level  Parallelism  

W CTA-1

W CTA-3

W CTA-2

W CTA-4

SIMT Core-1 SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

W W W W

11% increase in bank-level parallelism

14% decrease in row buffer locality

CTA Prioritization Order CTA Prioritization Order

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

L2 Cache

 Objec2ve  3:  Recovering  Row  Locality  

W CTA-1

W CTA-3

W CTA-2

W CTA-4

SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

W W W W

Memory Side Prefetching

L2 Hits!

SIMT Core-1

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 33: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

33

Memory  Side  Prefetching  n  Prefetch the so-far-unfetched cache lines in an already open row into

the L2 cache, just before it is closed

n  What to prefetch? q  Sequentially prefetches the cache lines that were not accessed by

demand requests q  Sophisticated schemes are left as future work

n  When to prefetch?

q  Opportunistic in Nature q  Option 1: Prefetching stops as soon as demand request comes for

another row. (Demands are always critical) q  Option 2: Give more time for prefetching, make demands wait if

there are not many. (Demands are NOT always critical)

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

IPC  results  (Normalized  to  Round-­‐Robin)  

0.6 1.0 1.4 1.8 2.2 2.6 3.0

SA

D

PV

C

SS

C

BFS

MU

M

CFD

KM

N

SC

P

FWT

IIX

SP

MV

JPE

G

BFS

R

SC

FFT

SD

2

WP

PV

R

BP

AVG

- T1

Nor

mal

ized

IPC

Objective 1 Objective (1+2) Objective (1+2+3) Perfect-L2

25% 31% 33%

n  11% within Perfect L2

44%

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 34: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

34

(67)

Summary

•  Coordinated scheduling across SMs, CTAs, and warps

•  Consideration of effects deeper in the memory system

•  Coordinating warp residence in the core with the presence of corresponding lines in the cache

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)

CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU

Workloads S. –Y Lee, A. A. Kumar and C. J Wu

ISCA 2015

Page 35: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

35

(69)

Goal

•  Reduce warp divergence and hence increase throughput

•  The key is the identification of critical (lagging) warps

•  Manage resources and scheduling decisions to speed up the execution of critical warps thereby reducing divergence

(70)

Review: Resource Limits on Occupancy

SM Scheduler

Kernel Distributor

SM SM SM SM

DRAM

Limits the #threads

Limits the #thread blocks

Warp Schedulers

Warp Context Warp Context Warp Context Warp Context

Register File

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

TB 0

Thread Block Control

L1/Shared Memory Limits the #thread

blocks

Limits the #threads

SM – Stream Multiprocessor

SP – Stream Processor

Locality effects

Page 36: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

36

(71)

Evolution of Warps in TB

•  Coupled lifetimes of warps in a TB v  Start at the same time v  Synchronization barriers v  Kernel exit (implicit synchronization barrier)

Completed warps

Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

Region where latency hiding is less effective

(72)

Warp Criticality Problem

Host  CPU

Intercon

nection  Bu

s

GPU

SMX SMX SMX SMX

Kernel  Distributor

SMX  Scheduler Core Core Core Core

Registers

L1  Cache  /  Shard  Memory

Warp  Schedulers

Warp  Context

Kernel  M

anagem

ent  U

nit

HW  W

ork  Que

ues

Pend

ing  

Kernels

Memory  Controller

PC Dim Param ExeBLKernel  Distributor  Entry

Control  Registers

DRAML2  Cache

TB

Warp

Warp

Available registers: spatial underutilization

Registers allocated to completed warps in the

TB: temporal underutilization

The last warp

Warp Context

Completed (idle) warp contexts

Manage resources and schedules around Critical Warps

Temporal & spatial underutilization

Page 37: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

37

(73)

Warp Execution Time Disparity

•  Branch divergence, interference in the memory system, scheduling policy, and workload imbalance

Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

(74)

Branch Divergence

Warp-to-warp variation in dynamic instruction count (w/o branch divergence)

Intra-warp branch

divergence Example: traversal over constant node degree graphs

Page 38: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

38

(75)

Impact of the Memory System

Executing Warps

Intra-wavefront footprints in the cache

•  Could have been re-used •  Too slow!

Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

(76)

Impact of Warp Scheduler

•  Amplifies the critical warp effect

Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

Page 39: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

39

(77)

Criticality Predictor Logic

m instructions

n instructions

•  Non-divergent branches can generate large differences in dynamic instruction counts across warps (e.g., m>>n)

•  Update CPL counter on branch: estimate dynamic instruction count

•  Update CPL counter on instruction commit

(78)

CPL Calculation

m instructions

n instructions

nCriticality = nInstr *w.CPIavg + nStall

Inter-instruction memory stall cycles

Average warp CPI Average instruction Disparity between warps

Page 40: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

40

(79)

Scheduling Policy

•  Select warp based on criticality

•  Execute until no more instructions are available v  A form of GTO

•  Critical warps get higher priority and large time slice

(80)

Behavior of Critical Warp References

Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

Page 41: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

41

(81)

Criticality Aware Cache Prioritization

•  Prediction: critical warp •  Prediction: re-reference interval •  Both used to manage the cache foot print

Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

(82)

Integration into GPU Microarchitecture

Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

Criticality Prediction Cache Management

Page 42: Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner ... · due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO

42

(83)

Performance

Periodic computation of accuracy for critical warps

Due in part to low miss rates

(84)

Summary

•  Warp divergence leads to some lagging warps à critical warps

•  Expose the performance impact of critical warps à throughput reduction

•  Coordinate scheduler and cache management to reduce warp divergence