(1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted

Preview:

Citation preview

(1)

Introduction to Control Divergence

Lectures Slides and Figures contributed from sources as noted

(2)

Objective

• Understand the occurrence of control divergence and the concept of thread reconvergence Also described as branch divergence and thread divergence

• Cover a basic thread reconvergence mechanism – PDOM Set up discussion of further optimizations and advanced

techniques

• Explore one approach for mitigating the performance degradation due to control divergence – dynamic warp formation

(3)

Reading

• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009

• W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

• M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

• Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(4)

Handling Branches

• CUDA Code:

if(…) … (True for some threads)

else … (True for others)

• What if threads takes different branches?

• Branch Divergence!

T T T T

taken not taken

(5)

Branch Divergence

• Occurs within a warp• Branches lead serialization branch dependent code

Performance issue: low warp utilization

if(…)

{… }

else {…}

Idle threads

Reconvergence!

Branch Divergence

Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Thread Warp Common PC

Thread2

Thread3

Thread4

Thread1

B

C D

E

F

A

G

• Different threads follow different control flow paths through the kernel code

• Thread execution is (partially) serialized Subset of threads that follow the same path execute in

parallel

(7)

Basic Idea

• Split: partition a warp Two mutually exclusive thread subsets, each branching to a

different target Identify subsets with two activity masks effectively two warps

• Join: merge two subsets of a previously split warp Reconverge the mutually exclusive sets of threads

• Orchestrate the correct execution for nested branches

• Note the long history of technques in SIMD processors (see background in Fung et. al.)

Thread Warp Common PC

T2 T3 T4T1

0 0 11 activity mask

(8)

Thread Reconvergence

• Fundamental problem: Merge threads with the same PC How do we sequence execution of threads? Since this can

effect the ability to reconverge

• Question: When can threads productively reconverge?

• Question: When is the best time to reconverge?

B

C D

E

F

A

G

(9)

Dominator

• Node d dominates node n if every path from the entry node to n must go through d

B

C D

E

F

A

G

(10)

Immediate Dominator

• Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n

B

C D

E

F

A

G

(11)

Post Dominator

• Node d post dominates node n if every path from the node n to the exit node must go through d

B

C D

E

F

A

G

(12)

Immediate Post Dominator

• Node d immediate post dominates node n if every path from node n to the exist node must go through d and no other nodes post dominate n between d and n

B

C D

E

F

A

G

Baseline: PDOM

Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

- G 1111TOS

B

C D

E

F

A

G

Thread Warp Common PC

Thread2

Thread3

Thread4

Thread1

B/1111

C/1001 D/0110

E/1111

A/1111

G/1111

- A 1111TOSE D 0110E C 1001TOS

- E 1111E D 0110TOS- E 1111

A D G A

Time

CB E

- B 1111TOS - E 1111TOSReconv. PC Next PC Active Mask

Stack

E D 0110E E 1001TOS

- E 1111

(14)

Stack Entry

- G 1111TOS

B

C D

E

F

A

G

B/1111

C/1001 D/0110

E/1111

A/1111

G/1111

- A 1111TOSE D 0110E C 1001TOS

- E 1111E D 0110TOS- E 1111- B 1111TOS - E 1111TOS

Reconv. PC Next PC Active Mask

Stack

E D 0110E E 1001TOS

- E 1111

• A stack entry is a specification of a group of active threads that will execute that basic block

• The natural nested structure of control exposes the use stack-based serialization

(15)

More Complex Example

From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009

• Stack based implementation for nested control flow Stack entry RPC set to IPDOM

• Re-convergence at the immediate post-dominator of the branch

(16)

Implementation I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

Issue

I-Buffer

pending warps

FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

- G 1111TOS - A 1111TOSE D 0110E C 1001TOS

- E 1111E D 0110TOS- E 1111- B 1111TOS - E 1111TOS

Reconv. PC Next PC Active Mask

Stack

E D 0110E E 1001TOS

- E 1111

• GPGPUSim model: Implement per warp stack at issue stage Acquire the active mask and PC from the

TOS Scoreboard check prior to issue Register writeback updates scoreboard and

ready bit in instruction buffer When RPC = Next PC, pop the stack

• Implications for instruction fetch?

(17)

Implementation (2)

• warpPC (next instruction) compared to reconvergence PC

• On a branch Can store the reconvergence PC as part of the branch

instruction Branch unit has NextPC, TargetPC and reconvergence PC to

update the stack

• On reaching a reconvergence point Pop the stack Continue fetching from the NextPC of the next entry on the

stack

(18)

Can We Do Better?

• Warps are formed statically

• Key idea of dynamic warp formation Show a pool of warps and how they can be merged

• At a high level what are the requirements

B

C D

E

F

A

G

(19)

Compaction Techniques

• Can we reform warps so as to increase utilization?

• Basic idea: Compaction Reform warps with threads that follow the same control flow

path Increase utilization of warps

• Two basic types of compaction techniques

• Inter-warp compaction Group threads from different warps Group threads within a warp

o Changing the effective warp size

(20)

Inter-Warp Thread Compaction Techniques

Lectures Slides and Figures contributed from sources as noted

(21)

Goal

Warp 0

Warp 1

Warp 2

Warp 3

Warp 4

Warp 5

Taken Not Taken

if(…)

{… }

else {…}

Merge threads?

(22)

Reading

• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009

• W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

DWF: Example

Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

A A B B G G A AC C D D E E F F

Time

A A B B G G A AC D E E F

Time

A x/1111y/1111

B x/1110y/0011

C x/1000y/0010 D x/0110

y/0001 F x/0001y/1100

E x/1110y/0011

G x/1111y/1111

A new warp created from scalar threads of both Warp x and y executing at Basic Block D

D

Execution of Warp xat Basic Block A

Execution of Warp yat Basic Block A

LegendAA

Baseline

DynamicWarpFormation

(24)

How Does This Work?

• Criteria for merging Same PC Complements of active threads in each warp Recall: many warps/TB all executing the same code

• What information do we need to merge two warps Need thread IDs and PCs

• Ideally how would you find/merge warps?

DWF: Microarchitecture Implementation

Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

I-Cache

Decode

Com

mit/

Writeback

RF 2

RF 1

ALU 2

ALU 1 (TID, Reg#)

(TID, Reg#)

RF 3ALU 3 (TID, Reg#)

RF 4ALU 4 (TID, Reg#)

Thread SchedulerPC-Warp LUT Warp Pool Issu

e Log

ic

Warp Allocator

TID x N PC A

TID x N PC B

H

H

TID x NPC PrioTID x NPC Prio

OCCPC IDXOCCPC IDX

Warp Update Register T

Warp Update Register NT

REQ

REQTID x N

PC Prio

Assists in aggregating threads

Identify available lanes

Identify occupied lanes

Point to warp being formed

• Warps formed dynamically in the warp pool• After commit check PC-Warp LUT and merge or allocated newly

forming warp

DWF: Microarchitecture Implementation

Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

I-Cache

Decode

Com

mit/

Writeback

RF 2

RF 1

ALU 2

ALU 1 (TID, Reg#)

(TID, Reg#)

RF 3ALU 3 (TID, Reg#)

RF 4ALU 4 (TID, Reg#)

Thread SchedulerPC-Warp LUT Warp Pool Issu

e Log

ic

Warp Allocator

TID x N PC A

TID x N PC B

H

H

TID x NPC PrioTID x NPC Prio

OCCPC IDXOCCPC IDX

Warp Update Register T

Warp Update Register NT

REQ

REQTID x N

PC PrioA 5 6 7 8A 1 2 3 4

5 7 8

6

B

C

1011

0100

B 2 30110B 0 B 5 2 3 8

B

0010B 2

71

3

4

2 B

C

0110

1001

C 11001C 1 4C 61101C 1

No Lane Conflict

A: BEQ R2, BC: …

X

1234

Y

5678

X

1234

X

1234

X

1234

X

1234

Y

5678

Y

5678

Y

5678

Y

5678

Z

5238

Z

5238

Z

5238

(27)

Resource Usage

• Ideally would like a small number of unique PCs in progress at a time minimize overhead

• Warp divergence will increase the number of unique PCs Mitigate via warp scheduling

• Scheduling policies FIFO Program counter – address variation measure of divergence Majority/Minority- most common vs. helping stragglers Post dominator (catch up)

(28)

Hardware Consequences

• Expose the implications that warps have in the base design Implications for register file access lane aware DWF

• Register bank conflicts

From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009

(29)

Relaxing Implications of Warps

• Thread swizzling Essentially remap work to threads so as to create more

opportunities for DWF requires deep understanding of algorithm behavior and data sets

• Lane swizzling in hardware Provide limited connectivity between register banks and

lanes avoiding full crossbars

(30)

Summary

• Control flow divergence is a fundamental performance limiter for SIMT execution

• Dynamic warp formation is one way to mitigate these effects We will look at several others

• Must balance a complex set of effects Memory behaviors Synchronization behaviors Scheduler behaviors

Thread Block Compaction W. Fung and T. Aamodt

HPCA 2011

(32)

Goal

• Overcome some of the disadvantages of dynamic warp formation Impact of Scheduling Breaking implicit synchronization Reduction of memory coalescing opportunities

Wilson Fung, Tor Aamodt Thread Block Compaction 33

9 6 3 4D-- 10 -- --D1 2 3 4E5 6 7 8E

9 10 11 12E

DWF Pathologies: Starvation

• Majority Scheduling– Best Performing – Prioritize largest group of threads

with same PC• Starvation

– LOWER SIMD Efficiency!

• Other Warp Scheduler?– Tricky: Variable Memory Latency

Time

1 2 7 8C 5 -- 11 12C

9 6 3 4D-- 10 -- --D

1 2 7 8E 5 -- 11 12E

9 6 3 4E-- 10 -- --E

B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;

1000s cycles

Wilson Fung, Tor Aamodt Thread Block Compaction 34

DWF Pathologies: Extra Uncoalesced Accesses

• Coalesced Memory Access = Memory SIMD – 1st Order CUDA Programmer Optimization

• Not preserved by DWFE: B = C[tid.x] + K;

1 2 3 4E5 6 7 8E

9 10 11 12E

Memory

0x100

0x1400x180

1 2 7 12E9 6 3 8E5 10 11 4E

Memory

0x100

0x1400x180

#Acc = 3

#Acc = 9

No DWF

With DWF

L1 Cache AbsorbsRedundant

Memory Traffic

L1$ Port Conflict

(35)Wilson Fung, Tor Aamodt

DWF Pathologies: Implicit Warp Sync.

• Some CUDA applications depend on the lockstep execution of “static warps”

Thread 0 ... 31Thread 32 ... 63Thread 64 ... 95

Warp 0Warp 1Warp 2

From W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009

(36)

Performance Impact

From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

Wilson Fung, Tor Aamodt Thread Block Compaction 37

Thread Block Compaction Block-wide Reconvergence Stack

Regroup threads within a block Better Reconv. Stack: Likely Convergence

Converge before Immediate Post-Dominator Robust

Avg. 22% speedup on divergent CUDA apps No penalty on others

PC RPC AMaskWarp 0

E -- 1111D E 0011C E 1100

PC RPC AMaskWarp 1

E -- 1111D E 0100C E 1011

PC RPC AMaskWarp 2

E -- 1111D E 1100C E 0011

PC RPC Active MaskThread Block 0

E -- 1111 1111 1111D E 0011 0100 1100C E 1100 1011 0011 C Warp X

C Warp Y

D Warp UD Warp T

E Warp 0E Warp 1E Warp 2

GPU Microarchitecture

Wilson Fung, Tor Aamodt Thread Block Compaction

InterconnectionNetwork

Memory PartitionLast-Level

Cache Bank

Off-ChipDRAM Channel

Memory PartitionLast-Level

Cache Bank

Off-ChipDRAM Channel

Memory PartitionLast-Level

Cache Bank

Off-ChipDRAM Channel

SIMT CoreSIMT CoreSIMT CoreSIMT CoreSIMT Core

SIMTFront End SIMD Datapath

FetchDecode

ScheduleBranch

Done (Warp ID)

Memory Subsystem Icnt.NetworkSMem L1 D$ Tex $ Const$ More Details

in Paper

Wilson Fung, Tor Aamodt Thread Block Compaction 39

StaticWarp

DynamicWarp

StaticWarp

Observation

Compute kernels usually contain divergent and non-divergent (coherent) code segments

Coalesced memory access usually in coherent code segments DWF no benefit there

Coherent

Divergent

Coherent

Reset Warps

Divergence

RecvgPt.

Coales. LD/ST

Wilson Fung, Tor Aamodt Thread Block Compaction 40

Thread Block Compaction

Barrier @ Branch/reconverge pt. All avail. threads arrive at branch Insensitive to warp scheduling

Run a thread block like a warp Whole block move between coherent/divergent code Block-wide stack to track exec. paths reconvg.

Warp compaction Regrouping with all avail. threads If no divergence, gives static warp arrangement

Starvation

Implicit Warp Sync.

Extra Uncoalesced Memory Access

Wilson Fung, Tor Aamodt Thread Block Compaction 41

Thread Block Compaction

PC RPC Active ThreadsA - 1 2 3 4 5 6 7 8 9 10 11 12D E -- -- 3 4 -- 6 -- -- 9 10 -- --C E 1 2 -- -- 5 -- 7 8 -- -- 11 12

E - 1 2 3 4 5 6 7 8 9 10 11 12

Time

1 2 7 8C 5 -- 11 12C

9 6 3 4D-- 10 -- --D

5 6 7 8A 9 10 11 12A

1 2 3 4A

5 6 7 8E 9 10 11 12E

1 2 3 4E

A: K = A[tid.x];

B: if (K > 10)

C: K = 10;

else

D: K = 0;

E: B = C[tid.x] + K;

5 6 7 8A 9 10 11 12A

1 2 3 4A

5 -- 7 8C -- -- 11 12C

1 2 -- --C

-- 6 -- --D9 10 -- --D

-- -- 3 4D

5 6 7 8E 9 10 11 12E

1 2 7 8E

-- -- -- ---- -- -- --

-- -- -- ---- -- -- --

-- -- -- ---- -- -- --

Wilson Fung, Tor Aamodt Thread Block Compaction 42

Thread Block Compaction

Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks

Multiple thread blocks run on a core Already done in most CUDA applications

Block 0

Block 1

Block 2

Branch Warp Compaction

Execution

Execution

Execution

Execution

Time

(43)

High Level View

• DWF: warp broken down every cycle and threads in a warp shepherded into a new warp (LUT and warp pool)

• TBC: warps broken down at potentially divergent points and threads compacted across the thread block

Wilson Fung, Tor Aamodt Thread Block Compaction 44

Microarchitecture Modification

Per-Warp Stack Block-Wide Stack I-Buffer + TIDs Warp Buffer

Store the dynamic warps New Unit: Thread Compactor

Translate activemask to compact dynamic warps More Detail in Paper

ALUALUALU

I-Cache Decode

Warp Buffer

Score-Board

Issue RegFile

MEM

ALU

FetchBlock-Wide

Stack

Done (WID)

Valid[1:N]

Branch Target PC

ActiveMask

Pred.Thread

Compactor

(45)

Microarchitecture Modification (2)

#compacted warps

TIDs of #compacted warps

From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

(46)

Operation

All threads mapped to the same lane

When this is zero – compact (priority

encoder)

Pick a thread mapped to this lane

• Warp 2 arrives first creating 2 target entries

• Next warps update the active mask

From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

Wilson Fung, Tor Aamodt Thread Block Compaction 47

Thread Compactor Convert activemask from block-wide stack to

thread IDs in warp buffer Array of Priority-Encoder

P-Enc P-Enc P-Enc P-Enc

1 2 7 85 -- 11 12

1 2 -- -- 5 -- 7 8 -- -- 11 12C E

1 2 -- --5 -- 7 8-- -- 11 12

1 2 7 8C 5 -- 11 12C

Warp Buffer

Wilson Fung, Tor Aamodt Thread Block Compaction 48

RarelyTaken

Likely-Convergence Immediate Post-Dominator: Conservative

All paths from divergent branch must merge there Convergence can happen earlier

When any two of the paths merge

Extended Recvg. Stack to exploit this TBC: 30% speedup for Ray Tracing

while (i < K) { X = data[i];A: if ( X = 0 )B: result[i] = Y;C: else if ( X = 1 )D: break;E: i++; }F: return result[i];

A

B C

DE

FiPDom of A

Details in Paper

Wilson Fung, Tor Aamodt Thread Block Compaction 49

Likely-Convergence (2)

NVIDIA uses break instruction for loop exits That handles last example

Our solution: Likely-Convergence Points

This paper: only used to capture loop-breaks

PC RPC LPC LPosActiveThdsF -- -- -- 1 2 3 4E F -- -- --B F E 1 1C F E 1 2 3 4E F E 1 2D F E 1 3 4

E F -- -- 2E F E 1 1E F -- -- 1 2

Convergence!

(50)

Likely-Convergence (3)

Check !Merge inside the stack

From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

Wilson Fung, Tor Aamodt Thread Block Compaction 51

Likely-Convergence (4)

Applies to both per-warp stack (PDOM) and thread block compaction (TBC) Enable more threads grouping for TBC Side effect: Reduce stack usage in some case

Wilson Fung, Tor Aamodt Thread Block Compaction 52

Evaluation

Simulation: GPGPU-Sim (2.2.1b) ~Quadro FX5800 + L1 & L2 Caches

21 Benchmarks All of GPGPU-Sim original benchmarks Rodinia benchmarks Other important applications:

Face Detection from Visbench (UIUC) DNA Sequencing (MUMMER-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research

Wilson Fung, Tor Aamodt Thread Block Compaction 53

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

TBC

DWF

IPC Relative to Baseline

COHE

DIVG

Experimental Results

2 Benchmark Groups: COHE = Non-Divergent CUDA applications DIVG = Divergent CUDA applications

Serious Slowdown from pathologiesNo Penalty for COHE

22% Speedup on DIVG

Per-Warp Stack

Wilson Fung, Tor Aamodt Thread Block Compaction 54

Effect on Memory Traffic

TBC does still generate some extra uncoalesced memory access Memory

0x100

0x1400x180

#Acc = 4

1 2 7 8C 5 -- 11 12C

2nd Acc will hit the L1 cache

No Change to Overall

Memory TrafficIn/out of a core

Nor

mal

ized

Mem

ory

Sta

lls

0%

50%

100%

150%

200%

250%

300%

TBC DWF Baseline

0.6

0.8

1

1.2

BF

S2

FC

DT

HO

TS

P

LP

S

MU

M

MU

Mp

p

NA

MD

NV

RT

AE

S

BA

CK

P

CP

DG

HR

TW

L

LIB

LK

YT

MG

ST

NN

C

RA

Y

ST

MC

L

ST

O

WP

DIVG COHE

Me

mo

ry T

raff

icN

orm

ali

zed

to

Ba

se

lin

e

TBC-AGE TBC-RRB 2.67x

(55)Wilson Fung, Tor Aamodt

Thread Block Compaction

Conclusion

• Thread Block Compaction Addressed some key challenges of DWF One significant step closer to reality

• Benefit from advancements on reconvergence stack Likely-Convergence Point Extensible: Integrate other stack-based proposals

CAPRI: Prediction of Compaction-Adequacy for Handling Control-

Divergence in GPGPU ArchitecturesM. Rhu and M. Erez

ISCA 2012

(57)

Goals

• Improve the performance of inter-warp compaction techniques

• Predict when branches diverge Borrow philosophy from branch prediction

• Use prediction to apply compaction only when it is beneficial

(58)

Issues with Thread Block Compaction

B

C D

E

F

A

G

• TBC: warps broken down at potentially divergent points and threads compacted across the thread block

• Barrier synchronization overhead cannot always be hidden

• When it works, it works well

Implicit barrier across warps to collect compaction

candidates

(59)

Divergence Behavior: A Closer Look

Loop executes a fixed number of times

Compaction ineffective branch

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(60)

Compaction-Resistant Branches

Control flow graph

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(61)

Compaction-Resistant Branches(2)

Control flow graph

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(62)

Impact of Ineffective Compaction

Threads shuffled around with no performance improvement

However, can lead to increased memory divergence!

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(63)

Basic Idea

• Only stall and compact when there is a high probability of compaction success

• Otherwise allow warps to bypass the (implicit) barrier

• Compaction adequacy predictor! Think branch prediction!

(64)

Example: TBC vs. CAPRI

Bypassing enables increased overlap of memory references

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(65)

CAPRI: Example

No divergence, no stalling

Diverge, stall, initialize history, all others will now stall

• Diverge and history available, predict, update prediction

• All other warps follow (one prediction/branch

• CAPT updated

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(66)

The Predictor

• Prediction uses active masks of all warps Need to understand what

could have happened

• Actual compaction only uses actual stalled warps

• Minimum provides the maximum compaction ability, i.e., #compacted warps

• Update history predictor accordingly

Available threads in a lane

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(67)

Behavior

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(68)

Impact of Implicit Barriers

• Idle cycle count helps us understand the negative effects of implicit barriers in TBC

Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012

(69)

Summary

• The synchronization overhead of thread block compaction can introduce performance degradation

• Some branches more divergence than others

• Apply TBC judiciously predict when it is beneficial

• Effectively predict when the inter-warp compaction is effective.

M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with

SIMD Lane Permutation,” ISCA 2013

(71)

Goals

• Understand the limitations of compaction techniques and proximity to ideal compaction

• Provide mechanisms to overcome these limitations and approach ideal compaction rates

(72)

Limitations of Compaction

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(73)

Mapping Threads to Lanes: Today

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

Grid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

Warp 0 Warp 1

Modulo assignment to lanes

Linearization of thread IDs

(74)

Data Dependent Branches

• Data dependent control flows less likely to produce lane conflicts

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(75)

Programmatic Branches

• Programmatic branches can be correlated to lane assignments (with modulo assignment)

• Program variables that operate like constants across threads can produced correlated branching behaviors

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(76)

P-Branches vs. D-Branches

P-branches are the problem!

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(77)

Compaction Opportunities

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

• Lane reassignment can improve compaction opportunities

(78)

Aligned Divergence

• Threads mapped to a lane tend to evaluate (programmatic) predicates the same way Empirically, rarely exhibited for input, data dependent control

flow behavior

• Compaction cannot help in the presence of lane conflicts

• performance of compaction mechanisms depends on both divergence patterns and lane conflicts

• We need to understand impact of lane assignment

(79)

Impact of Lane Reassignment

Goal: Improve “compactability”

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(80)

Random Permutations

• Does not always work well works well on average

• Better understanding of programs can lead to better permutations choices

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(81)

Mapping Threads to Lanes; New

Warp 0

Warp 1

Warp N-1

SIMD Width = 8

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

RF n-1

RF n-2

RF n-3

RF n-4

RF1

RF0

What criteria do we use for lane assignment?

(82)

Balanced Permutation

Each lane has a single instance of a logical thread from each warp

Even warps: permutation within a half warp

Odd warps: swap upper and

lower

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(83)

Balanced Permutation (2)

Logical TID of 0 in each warp is now assigned a

different lane

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(84)

Characteristics

• Vertical Balance: Each lane only has logical TIDs of distinct threads in a warp

• Horizontal balance: Logical TID x in all of the warps is bound to different lanes

• This works when CTA have fewer than SIMD_Width warps: why?

• Note that random permutations achieve this only on average

(85)

Impact on Memory Coalescing

• Modern GPUs do not require ordered requests

• Coalescing can occur across a set of requests speicific lane assignments do not affect coalescing behavior

• Increase is L1 miss rate offset by benefits of compaction

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(86)

Speedup of Compaction

• Can improve the compaction rate of divergence dues to the majority of programmatic branches

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(87)

Compaction Rate vs. Utilization

Distinguish between compaction rate and utilization!

Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(88)

Application of Balanced PermutationI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

Issue

I-Buffer

pending warps

• Permutation is applied when the warp is launched

• Maintained for the life of the warp

• Does not affect the baseline compaction mechanism

• Enable/disable SLP to preserve target specific, programmer implemented optimizations

(89)

Summary

• Structural hazards limit the performance improvements from inter-warp compaction

• Program behaviors produce correlated lane assignments today

• Remapping threads to lanes enables extension of compaction opportunities

(90)

Summary: Inter-Warp Compaction

B

C D

E

F

A

G

Thread Block

Resource Management

Scope

μarchProgram

Properties

Thread Block

Co-Design of applications, resource management,

software, microarchitecture,

Recommended