Upload
proxymo1
View
213
Download
0
Embed Size (px)
Citation preview
7/29/2019 JoeStam__ConvolutionSoup
1/43
NVIDIA Confidential
Convolution SoupA case study in CUDA optimizationJoe Stam [email protected]
Convolution SoupA case study in CUDA optimizationJoe Stam [email protected]
7/29/2019 JoeStam__ConvolutionSoup
2/43
NVIDIA Confidential
Optimization
GPUs are very fast
BUTPoor programming can lead to hideousperformance
Squeaking out the most speed takes a bit ofexpertise
7/29/2019 JoeStam__ConvolutionSoup
3/43
NVIDIA Confidential
A convolution case study
Well use the simple example of a 5x5
convolution to illustrate
optimization strategies and their
effects
Basic 5x5 convolution
8-bit data, monochrome
Generalized non-separable case
No special border handling
Dynamic coefficients use constant
memoryBenchmarks on 2048 X 2048 image
GeForce 8800 GT (G92)
7/29/2019 JoeStam__ConvolutionSoup
4/43
NVIDIA Confidential
What to optimize?
GMEM Coalescing
GMEM Bandwidth
Occupancy# of threads running on an SM
Limited by Registers, SMEM, 8-blocks maximum, 768threads maximum (1024 on GT200)
More threads running allows more latency hiding!
SMEM Bank Conflicts
LMEM usage
Compute instructionsinl ining, __mul() intrinsics, fast math
Last thing to optimize, Compute power is evolving faster than memorybandwidth
7/29/2019 JoeStam__ConvolutionSoup
5/43
NVIDIA Confidential
Coalescing GMEM Often the most
important optimization
A coordinated read by a half-warp (16 threads)
A contiguous region of global memory:64 bytes - each thread reads a word: int, float,
128 bytes - each thread reads a double-word: int2, float2,
256 bytes each thread reads a quad-word: int4, float4,
Additional restrictions:
Starting address for a region must be a multiple of region
size
The k th thread in a half-warp must access the kth element in ablock being read
Exception: not all threads must be participating
Predicated access, divergence within a halfwarp
7/29/2019 JoeStam__ConvolutionSoup
6/43
NVIDIA Confidential
Coalesced Access:
Reading floats (32-bit)
t0 t1 t2 t14 t15t3
t0 t1 t2 t14 t15t3
132 136 184 192128 140 144 188
132 136 184 192128 140 144 188
Some Threads Do Not Participate
All threads participate
7/29/2019 JoeStam__ConvolutionSoup
7/43
NVIDIA Confidential
Uncoalesced Access:
Reading floats (32-bit)
t0 t1 t2 t14 t15t3
132 136128 140 144
Permuted Access by Threads
184 192188
Misaligned Starting Address (not a multiple of 64)
t0 t1 t2 t13 t15t3
132 136 184 192128 140 144 188
t14
7/29/2019 JoeStam__ConvolutionSoup
8/43
NVIDIA Confidential
No simple one-size-fits all strategyMust Experiment
7/29/2019 JoeStam__ConvolutionSoup
9/43
NVIDIA Confidential
Tools
Look at the .cubin to find register, smem, lmem
usage
Use thekeep compiler option
7/29/2019 JoeStam__ConvolutionSoup
10/43
NVIDIA Confidential
PTX GPU assembly
Rarely write it, but always look at it!
Use keep to write it
To show interleaved source code:
--opencc-options -LIST:source=on
7/29/2019 JoeStam__ConvolutionSoup
11/43
NVIDIA Confidential
Visual Profiler
7/29/2019 JoeStam__ConvolutionSoup
12/43
NVIDIA Confidential
Occupancy Calculator Spreadsheet
7/29/2019 JoeStam__ConvolutionSoup
13/43
NVIDIA Confidential
Memory vs. Compute
Kernel time is frequently dominated by memory
bandwidth.
With enough compute, memory latencies can be
hidden by thread swapping
Write kernels with compute commented out to
examine the effects
7/29/2019 JoeStam__ConvolutionSoup
14/43
NVIDIA Confidential
Example 1: Nave GMEM
Warning: Do not try this at home!
__gl obal __ voi d Nai veGl obal Convol ut i onKer nel ( unsi gned char * i mg_i n, unsi gned char * i mg_out ,unsi gned i nt wi dt h, unsi gned i nt hei ght ,unsi gned i nt pi t ch, f l oat scal e)
{unsi gned i nt X = __umul 24( bl ockI dx. x, bl ockDi m. x) + t hr eadI dx. x;unsi gned i nt Y = __umul 24( bl ockI dx. y, bl ockDi m. y) + t hr eadI dx. y;
i f ( X > 1 && X < wi dt h- 2 && Y > 1 && Y < hei ght - 2){
i nt sum = 0;i nt ki dx = 0;f or ( i nt i = - 2; i
7/29/2019 JoeStam__ConvolutionSoup
15/43
NVIDIA Confidential
Results
148 ms!
Nothing is coalesced
8-bit Memory accesses are very
inefficient
You are the weakest link Good Bye!
7/29/2019 JoeStam__ConvolutionSoup
16/43
NVIDIA Confidential
Example 2: Simple Textures
Texture hardware provides cached access to
GMEM, no worries about coalescing reads
Results: 7.8 ms
But:
0.600 ms additional time to copy the image into acudaArray for textures
sti ll using 8-bit writes back to GMEM
7/29/2019 JoeStam__ConvolutionSoup
17/43
NVIDIA Confidential
Example 2a: Textures
Using 32-bit writes improves bandwidth &
coalesces
Results: 6.3 ms, ~25% faster
7/29/2019 JoeStam__ConvolutionSoup
18/43
NVIDIA Confidential
Example 3: Texture into SMEM
Use textures to fetch memory to avoid coalescing
issues
Store tile in SMEM so all pixels are only read once
Results:
3.3 ms!
Memory read/write only: 1.1 ms
7/29/2019 JoeStam__ConvolutionSoup
19/43
NVIDIA Confidential
Example 3: Continued
Unfortunately, textures use a lot of registers
(25 in this kernel). This reduces occupancy
16x16 block size limits us to 1 block / SM (on G92),
thus the entire block is stalled during the__syncthreads()
16x8 block allows 2 blocks / SM,
surprisingly lit tle performance improvement (3.2 ms)
16x20 block maximizes threads
Also little performance improvement (3.2 ms)
Memory bandwidth goes up slightly because of fewer
reads from kernel apron overlap
7/29/2019 JoeStam__ConvolutionSoup
20/43
NVIDIA Confidential
Example 4: Texture with floats
Use texture hardware to promote image to float.
Uses 4x the SMEM, but requires no data typeconversion until the write
Results: 6.5 ms, oopsbad idea
May be useful for cases where f32 compute is
needed
7/29/2019 JoeStam__ConvolutionSoup
21/43
NVIDIA Confidential
Example 5: GMEM to SMEM
4 pixels / thread
Try nave implementation first. Shift reads left & upby apron amount, then read an additional 4-pixel
strip to the right.
All loads are uncoalesced!
Results:
3.6 ms
Memory only: 1.6 ms
Slightly worse than using textures
7/29/2019 JoeStam__ConvolutionSoup
22/43
NVIDIA Confidential
Example 6: GMEM to SMEM
Strict CoalescingProcess 4 pixels / thread for 32-bit reads
Read an image tile plus the apron into SMEMFor 16x16 block size, read 72x16 pixels into SMEM
0
18
0 72
Apron Pixels Image Tile
7/29/2019 JoeStam__ConvolutionSoup
23/43
NVIDIA Confidential
Convolution SMEM Reads
Step 1: All threads read center top portion into memory
7/29/2019 JoeStam__ConvolutionSoup
24/43
NVIDIA Confidential
Convolution SMEM Reads
Step 2: Threads with threadIdx.y
< 2 read bottom two
rows
7/29/2019 JoeStam__ConvolutionSoup
25/43
NVIDIA Confidential
Convolution SMEM Reads
Step 3: Threads with threadIdx.x
== 15
read left-top
apron pixels
7/29/2019 JoeStam__ConvolutionSoup
26/43
NVIDIA Confidential
Convolution SMEM Reads
Step 5: Threads with threadIdx.x
== 0 read top-right
apron pixels
7/29/2019 JoeStam__ConvolutionSoup
27/43
NVIDIA Confidential
Convolution SMEM Reads
Step 6: Threads with threadIdx.x
== 0 &&
threadIdx.y
< 2 read bottom-right apron pixels
7/29/2019 JoeStam__ConvolutionSoup
28/43
NVIDIA Confidential
Example 6 Cont.
Results (16x16 block, 2 blocks / SM):
3.4 ms.
Memory Only 1.2 ms.
Note: Texture is stil l sl ightly better, even with all this work!
7/29/2019 JoeStam__ConvolutionSoup
29/43
NVIDIA Confidential
Example 6 Effect of block Size
1200 bytes of SMEM per block
11 registers
16x16 = 2 blocks / SM
16x8 = 5 blocks / SM
16x8 Results:3.5ms. no benefit (probably because of increased overlap)
16x20 Results:
3.4 ms. again no real benefit
7/29/2019 JoeStam__ConvolutionSoup
30/43
NVIDIA Confidential
Example 6 A couple pathological
casesNon multiple-of-16 block width results in non-
coalesced access and wasted threads
19x13 block results:
5.1 ms. 50% decrease
2.4 ms. memory only
Change image pitch to break coalescing
Unaligned pitch results:
4.3 ms.
2.4 ms. memory only
7/29/2019 JoeStam__ConvolutionSoup
31/43
NVIDIA Confidential
Example 7: 128-bit Read/Write to GMEM
Reading data in 128 bit words is faster than 32-bit
words (64-bit is also good)
Same amount of data, but fewertransactions to the
memory controller
Read data as int4s, cast to char,
process 16-bytes / thread
Results:
4.8 ms.
0.4 ms memory only
What happened? Memory is way faster,
but compute is SLOW
7/29/2019 JoeStam__ConvolutionSoup
32/43
NVIDIA Confidential
The Answer:
SMEM Bank Conflicts causes warp serialization
7/29/2019 JoeStam__ConvolutionSoup
33/43
NVIDIA Confidential
Banked SMEM Architecture
Many threads accessing memory
Therefore, memory is divided into banks
Essential to achieve high bandwidth
Each bank can service one address per cycle
A memory can service as many simultaneousaccesses as it has banks
Multiple simultaneous accesses to a bankresult in a bank conflict
Conflicting accesses are serialized
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1
Bank 0
7/29/2019 JoeStam__ConvolutionSoup
34/43
NVIDIA Confidential
Bank Addressing Examples
No Bank Conflicts
Linear addressing
stride == 1
No Bank Conflicts
Random 1:1 Permutation
Bank 15
Bank 7
Bank 6Bank 5Bank 4Bank 3
Bank 2Bank 1Bank 0
Thread 15
Thread 7
Thread 6Thread 5Thread 4Thread 3
Thread 2Thread 1Thread 0
Bank 15
Bank 7
Bank 6Bank 5Bank 4Bank 3
Bank 2Bank 1Bank 0
Thread 15
Thread 7
Thread 6Thread 5Thread 4Thread 3
Thread 2Thread 1Thread 0
7/29/2019 JoeStam__ConvolutionSoup
35/43
NVIDIA Confidential
Bank Addressing Examples
2-way Bank Conflicts
Linear addressing
stride == 2
8-way Bank Conflicts
Linear addressing
stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3
Thread 2Thread 1Thread 0
Bank 15
Bank 7
Bank 6Bank 5Bank 4Bank 3
Bank 2Bank 1Bank 0
Thread 15
Thread 7
Thread 6Thread 5Thread 4Thread 3
Thread 2Thread 1Thread 0
Bank 9
Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0
x8
x8
7/29/2019 JoeStam__ConvolutionSoup
36/43
NVIDIA Confidential
Example 8: 128-bit, resolve bank
conflictsHave each thread process every 4th 32-bit word
Intermediate results are stored in SMEM
Need to shrink the block size since this uses more SMEM
Results:
2.9 ms. !
Memory Only: 0.4 ms.
7/29/2019 JoeStam__ConvolutionSoup
37/43
NVIDIA Confidential
Example 10: 128-bit, unroll inner
loopMostly memory focused until now
All code used fast math where possible(e.g.__umul24)
For the 2-D convolution, the compiler only unrolled
the inner loop
(#pragma unroll didnt work for the outer loop)
Results:
2.5 ms.
0.4 ms. Memory only
7/29/2019 JoeStam__ConvolutionSoup
38/43
NVIDIA Confidential
Summary
Convolution Approaches Comparison
0
1
2
3
4
5
6
7
8
9
10
Naive
Glob
alConvo
luti
Sim
pleTex
ture1
-byt
eConv
olut
Sim
pleTex
ture4
-byt
eConv
olut
Textu
re-to
-SM
EM16x1
6bloc
kCon
vol
Texture
-to-SME
M16x
8(2
block
s/sm
)bloc
kCon
v
Textu
re-to
-SM
EM16x2
0blo
ckCon
vol
Textu
re-to
-SM
EMflo
atCon
volu
Glob
al-to
-SME
Mflo
atCon
volu
Glob
al-to-SM
EMStri
ctCo
alesc
ed16x
16blockC
on
Global-t
o-SM
EMStri
ctCo
alesc
ed16x
8blo
ckCon
v
Glob
al-to-SM
EMStri
ctCo
alesc
ed16x
20blockC
on
Glob
al-to-SM
EMBre
akCoa
lescing
19x
13blockC
on
Glob
al-to-
SMEM
Bre
akCoa
lescin
gwith
wro
ngpitchC
onv
Glob
al-to
-SM
EM128
-bitr
/wcon
volu
Global
-to-SME
M128
-bitr
/wbank
opti
mizedc
onv
Glob
al-to
-SME
M12
8-bit
r/w
bank
optim
izedun
rolled
con
Algori thm
Time(ms.)
Compute
Memory Only
7/29/2019 JoeStam__ConvolutionSoup
39/43
NVIDIA Confidential
GT200 optimization
Coalescing buffers greatly improve the performance, especiallyin non-optimal situations
Convolut ion Approaches Comparison, GT200
0
1
2
3
4
5
6
7
8
9
10
Naive
Global
S
imple
Te
xture4-
Tex
ture-to-
S
MEM
Tex
ture-to-
SME
Mfloat
Global-to-
S
MEM
Global-to-
S
MEM
Global-to-
S
MEM
Glo
bal-to-
SME
M128-
Time(ms)
Compute +Memory
Memory Only
7/29/2019 JoeStam__ConvolutionSoup
40/43
NVIDIA Confidential
GT200 Observations
hideous case isnt so bad: 2.5x slower than best
case, vs. 60x slower on G92
128-bit is still best
Complex Coalesced pattern to fil l SMEM is actually
slightly slower than just a simple shift
7/29/2019 JoeStam__ConvolutionSoup
41/43
NVIDIA Confidential
A few more comparisons
Best & Worse Case Algorithms
1
10
100
1000
Hideous Case Best Case
Tim
e
(ms
)
G86G92
GT200
Note: Log Scale
7/29/2019 JoeStam__ConvolutionSoup
42/43
NVIDIA Confidential
32 vs. 128-bit Access
*Note: Using best 32-bit method
32 vs. 128 bit process ing
0
1
2
3
4
5
6
32-Bit Memory 32-Bit Memory 128-Bit 128-Bit Memory
Time (ms)
G92GT200
7/29/2019 JoeStam__ConvolutionSoup
43/43
Broken Coalescing
Effec t of Wrong Pitch
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Correct Coalescing Correct Coalescing - Memory Wrong Pitch Wrong Pitch - Memory
Time (ms.)
G92GT200