Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Warp-Aware TraceScheduling for GPUS
James Jablin (Brown)
Thomas Jablin (UIUC)
Onur Mutlu (CMU)
Maurice Herlihy (Brown)
Historical Trends in GFLOPS:CPUs vs. GPUs
0
250
500
750
1000
1250
1500
1750
2000
2250
2500
2750
3000
3250
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Northwood WoodcrestPrescott HarpertownBloomfield
WestmereSandy Bridge
NVIDIA GPU Single-Precision FP
Intel CPU Single-Precision FP
2012
GeForce 5800
GeForce 6800 Ultra
GeForce 7800 GTX
GeForce 8800 GTX
GeForce 280 GTX
GeForce 480 GTX
GeForce 580 GTX
GeForce 680 GTX
Theore
tica
l G
FLO
P/s
reproduced from NVIDIA CUDA C Programming Guide (Version 5.0)
Performance Pitfalls
Control flow cannegatively affect performance.
Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency
Performance Pitfalls
Hardware: CPU versus GPU
ControlALU ALU
ALUALU
Cache
DRAM DRAM
CPU GPUreproduced from NVIDIA CUDA C Programming Guide (Version 5.0)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
pipeline stall (bubble)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 9 10
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency
Performance Pitfalls
Performance Pitfalls
Warp Divergence - threads within awarp take different paths and thedifferent execution paths are serialized
Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency
Warp Divergence Example
B
C
A
D
Warp Divergence Example
B
C
A
D
A
Warp Divergence Example
Warp Divergence!
B
C
A
D
A
Warp Divergence Example
Warp Divergence!
B
C
A
D
A
B
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
A
B
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
A
B
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
Warp Reconverges!
A
B
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
Warp Reconverges!
A
B
D
Warp-Aware Trace Scheduling
Schedule instructions across basic blockboundaries to expose additional ILP...
Warp-Aware Trace Scheduling
Schedule instructions across basic blockboundaries to expose additional ILP...
while managing andoptimizing warp divergence.
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
Step Description
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
1. Trace Selection
Step Description
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
2. Trace Formation
1. Trace Selection
Step Description
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
3. Local Scheduling
2. Trace Formation
1. Trace Selection
Step Description
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
3. Local Scheduling
schedule instructionswithin each region
2. Trace Formation facilitate local scheduling,potentially adding nodesand edges
1. Trace Selection partition basic blocksinto regions
Step Description
J
L
K
A
B
C
G
H
I
D
F
E
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
Trace # 1
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
Trace # 1
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
Trace # 1
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8Trace # 2
Trace # 1
Trace # 3
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
BB3
...mul.wide.s32 %rd13, %r4, 4;add.s64 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
10
100
BB1
100
90
BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
394041424344
BB2...
BB3:mul.wide.s32 %rd15, %r3, %rd8;add.s64 %rd12, %rd1, %rd15; ...
454647...
... ...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2];@%p2 bra BB2;
BB0
10111213
2148495051525322
Profiling
0.95×
1.00×
1.05×
1.10×
1.15×
1.20×
1.25×
1.30×
1.35×
1.40×
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
KernelSpeedup
Instructions
ExecutedperCycle(IPC)
ComparingSpeedup and IPC UsingDynamic
backprop
bfscfd heartwall
hotspot
kmeans
lavaMD
leukocyte
lud mummergpu
nn nw particlefilter(f)
particlefilter(n)
pathfinder
sradstreamcluster
GEOMEAN
HARMEAN
bpnnlayerforw
ardCUDA
Kernel
Kernel2
cudacom
puteflux
kernel
calculatetem
p
invertmapping
kmeansP
oint
kernelgpu
cuda
dilatekernel
IMGVFkernel
luddiagonal
mum
mergpuK
ernel
printKernel
euclidneedle
cudashared1
needlecuda
shared2
findindex
kernellikelihood
kernel
kerneldynproc
kernel
sradcuda
1
sradcuda
2
kernelcom
putecost
Speedup
IPC
Backup Slides
store instructions
shadow store buffer,
Restricted [6] General [6] Boosting [36] Deviant (GP U)
excludes texture, shared
Scheduling Restrictions Legal and Safe Legal noneand constant memory
operations and all
shadow register file,
HardwareSupport nonenon-trapping
noneinstructions and support for
re-executing instructionsException Handling for
prohibited ignored supported absentSpeculative Instructions
GPU Programming Model
CPU GPU
Tim
e
Host Code
Host Code Device CodeCPU
CPU
GPUCyclic
Communication
GPU Programming Model
CPU GPU
Tim
e
Host Code
Host Code Device Code
GridBlock (0,0) Block (1,0)
Block (0,1) Block (1,1)CPU
CPU
GPUCyclic
Communication
GPU Programming Model
CPU GPU
Tim
e
Host Code
Device CodeHost Code
GridBlock (0,0) Block (1,0)
Block (0,1) Block (1,1)
Block (0,1)Thread (0,0) Thread (1,0) Thread (2,0)
Thread (0,1) Thread (1,1) Thread (2,1)
CPU
CPU
GPUCyclic
Communication
Characterizing the Grid...
Grid
gridDim.x
gri
dD
im.y
Characterizing the Grid, Blocks...
Grid
gridDim.x
gri
dD
im.y Block (0,1)
blockDim.x
blo
ckD
im.y
Characterizing the Grid, Blocks...
Grid
gridDim.x
gri
dD
im.y Block (blockIdx.x,blockIdx.y)
blockDim.x
blo
ckD
im.y
Characterizing the Grid, Blocks, andThreads
Grid
gridDim.x
gri
dD
im.y Block (blockIdx.x,blockIdx.y)
Thread (0,1)
blockDim.x
blo
ckD
im.y
Characterizing the Grid, Blocks, andThreads
Grid
gridDim.x
gri
dD
im.y Block (blockIdx.x,blockIdx.y)
Thread (threadIdx.x,threadIdx.y)
blockDim.x
blo
ckD
im.y
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x < 32) { }
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
if (threadIdx.x > 65) { }
YES
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
if (threadIdx.x > 65) { }
if (BlockIdx.x > 1) { }
YES
YES
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
if (threadIdx.x > 65) { }
if (blockIdx.x > 1) { }
YES
YES
NO