JoeStam__ConvolutionSoup

7/29/2019 JoeStam__ConvolutionSoup

1/43

NVIDIA Confidential

Convolution SoupA case study in CUDA optimizationJoe Stam [email protected]

Convolution SoupA case study in CUDA optimizationJoe Stam [email protected]


2/43

NVIDIA Confidential

Optimization

GPUs are very fast

BUTPoor programming can lead to hideousperformance

Squeaking out the most speed takes a bit ofexpertise


3/43

NVIDIA Confidential

A convolution case study

Well use the simple example of a 5x5

convolution to illustrate

optimization strategies and their

effects

Basic 5x5 convolution

8-bit data, monochrome

Generalized non-separable case

No special border handling

Dynamic coefficients use constant

memoryBenchmarks on 2048 X 2048 image

GeForce 8800 GT (G92)


4/43

NVIDIA Confidential

What to optimize?

GMEM Coalescing

GMEM Bandwidth

Occupancy# of threads running on an SM

Limited by Registers, SMEM, 8-blocks maximum, 768threads maximum (1024 on GT200)

More threads running allows more latency hiding!

SMEM Bank Conflicts

LMEM usage

Compute instructionsinl ining, __mul() intrinsics, fast math

Last thing to optimize, Compute power is evolving faster than memorybandwidth


5/43

NVIDIA Confidential

Coalescing GMEM Often the most

important optimization

A coordinated read by a half-warp (16 threads)

A contiguous region of global memory:64 bytes - each thread reads a word: int, float,

128 bytes - each thread reads a double-word: int2, float2,

256 bytes each thread reads a quad-word: int4, float4,

Additional restrictions:

Starting address for a region must be a multiple of region

size

The k th thread in a half-warp must access the kth element in ablock being read

Exception: not all threads must be participating

Predicated access, divergence within a halfwarp


6/43

NVIDIA Confidential

Coalesced Access:

Reading floats (32-bit)

t0 t1 t2 t14 t15t3

t0 t1 t2 t14 t15t3

132 136 184 192128 140 144 188

132 136 184 192128 140 144 188

Some Threads Do Not Participate

All threads participate


7/43

NVIDIA Confidential

Uncoalesced Access:

Reading floats (32-bit)

t0 t1 t2 t14 t15t3

132 136128 140 144

Permuted Access by Threads

184 192188

Misaligned Starting Address (not a multiple of 64)

t0 t1 t2 t13 t15t3

132 136 184 192128 140 144 188

t14


8/43

NVIDIA Confidential

No simple one-size-fits all strategyMust Experiment


9/43

NVIDIA Confidential

Tools

Look at the .cubin to find register, smem, lmem

usage

Use thekeep compiler option


10/43

NVIDIA Confidential

PTX GPU assembly

Rarely write it, but always look at it!

Use keep to write it

To show interleaved source code:

--opencc-options -LIST:source=on


11/43

NVIDIA Confidential

Visual Profiler


12/43

NVIDIA Confidential

Occupancy Calculator Spreadsheet


13/43

NVIDIA Confidential

Memory vs. Compute

Kernel time is frequently dominated by memory

bandwidth.

With enough compute, memory latencies can be

hidden by thread swapping

Write kernels with compute commented out to

examine the effects


14/43

NVIDIA Confidential

Example 1: Nave GMEM

Warning: Do not try this at home!

__gl obal __ voi d Nai veGl obal Convol ut i onKer nel ( unsi gned char * i mg_i n, unsi gned char * i mg_out ,unsi gned i nt wi dt h, unsi gned i nt hei ght ,unsi gned i nt pi t ch, f l oat scal e)

{unsi gned i nt X = __umul 24( bl ockI dx. x, bl ockDi m. x) + t hr eadI dx. x;unsi gned i nt Y = __umul 24( bl ockI dx. y, bl ockDi m. y) + t hr eadI dx. y;

i f ( X > 1 && X < wi dt h- 2 && Y > 1 && Y < hei ght - 2){

i nt sum = 0;i nt ki dx = 0;f or ( i nt i = - 2; i


15/43

NVIDIA Confidential

Results

148 ms!

Nothing is coalesced

8-bit Memory accesses are very

inefficient

You are the weakest link Good Bye!


16/43

NVIDIA Confidential

Example 2: Simple Textures

Texture hardware provides cached access to

GMEM, no worries about coalescing reads

Results: 7.8 ms

But:

0.600 ms additional time to copy the image into acudaArray for textures

sti ll using 8-bit writes back to GMEM


17/43

NVIDIA Confidential

Example 2a: Textures

Using 32-bit writes improves bandwidth &

coalesces

Results: 6.3 ms, ~25% faster


18/43

NVIDIA Confidential

Example 3: Texture into SMEM

Use textures to fetch memory to avoid coalescing

issues

Store tile in SMEM so all pixels are only read once

Results:

3.3 ms!

Memory read/write only: 1.1 ms


19/43

NVIDIA Confidential

Example 3: Continued

Unfortunately, textures use a lot of registers

(25 in this kernel). This reduces occupancy

16x16 block size limits us to 1 block / SM (on G92),

thus the entire block is stalled during the__syncthreads()

16x8 block allows 2 blocks / SM,

surprisingly lit tle performance improvement (3.2 ms)

16x20 block maximizes threads

Also little performance improvement (3.2 ms)

Memory bandwidth goes up slightly because of fewer

reads from kernel apron overlap


20/43

NVIDIA Confidential

Example 4: Texture with floats

Use texture hardware to promote image to float.

Uses 4x the SMEM, but requires no data typeconversion until the write

Results: 6.5 ms, oopsbad idea

May be useful for cases where f32 compute is

needed


21/43

NVIDIA Confidential

Example 5: GMEM to SMEM

4 pixels / thread

Try nave implementation first. Shift reads left & upby apron amount, then read an additional 4-pixel

strip to the right.

All loads are uncoalesced!

Results:

3.6 ms

Memory only: 1.6 ms

Slightly worse than using textures


22/43

NVIDIA Confidential

Example 6: GMEM to SMEM

Strict CoalescingProcess 4 pixels / thread for 32-bit reads

Read an image tile plus the apron into SMEMFor 16x16 block size, read 72x16 pixels into SMEM

0

18

0 72

Apron Pixels Image Tile


23/43

NVIDIA Confidential

Convolution SMEM Reads

Step 1: All threads read center top portion into memory


24/43

NVIDIA Confidential


Step 2: Threads with threadIdx.y

< 2 read bottom two

rows


25/43

NVIDIA Confidential


Step 3: Threads with threadIdx.x

== 15

read left-top

apron pixels


26/43

NVIDIA Confidential



== 0 read top-right

apron pixels


27/43

NVIDIA Confidential



== 0 &&

threadIdx.y

< 2 read bottom-right apron pixels


28/43

NVIDIA Confidential

Example 6 Cont.

Results (16x16 block, 2 blocks / SM):

3.4 ms.

Memory Only 1.2 ms.

Note: Texture is stil l sl ightly better, even with all this work!


29/43

NVIDIA Confidential

Example 6 Effect of block Size

1200 bytes of SMEM per block

11 registers

16x16 = 2 blocks / SM

16x8 = 5 blocks / SM

16x8 Results:3.5ms. no benefit (probably because of increased overlap)

16x20 Results:

3.4 ms. again no real benefit


30/43

NVIDIA Confidential

Example 6 A couple pathological

casesNon multiple-of-16 block width results in non-

coalesced access and wasted threads

19x13 block results:

5.1 ms. 50% decrease

2.4 ms. memory only

Change image pitch to break coalescing

Unaligned pitch results:

4.3 ms.

2.4 ms. memory only


31/43

NVIDIA Confidential

Example 7: 128-bit Read/Write to GMEM

Reading data in 128 bit words is faster than 32-bit

words (64-bit is also good)

Same amount of data, but fewertransactions to the

memory controller

Read data as int4s, cast to char,

process 16-bytes / thread

Results:

4.8 ms.

0.4 ms memory only

What happened? Memory is way faster,

but compute is SLOW


32/43

NVIDIA Confidential

The Answer:

SMEM Bank Conflicts causes warp serialization


33/43

NVIDIA Confidential

Banked SMEM Architecture

Many threads accessing memory

Therefore, memory is divided into banks

Essential to achieve high bandwidth

Each bank can service one address per cycle

A memory can service as many simultaneousaccesses as it has banks

Multiple simultaneous accesses to a bankresult in a bank conflict

Conflicting accesses are serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1

Bank 0


34/43

NVIDIA Confidential

Bank Addressing Examples

No Bank Conflicts

Linear addressing

stride == 1

No Bank Conflicts

Random 1:1 Permutation

Bank 15

Bank 7

Bank 6Bank 5Bank 4Bank 3

Bank 2Bank 1Bank 0

Thread 15

Thread 7

Thread 6Thread 5Thread 4Thread 3

Thread 2Thread 1Thread 0

Bank 15

Bank 7


Bank 2Bank 1Bank 0

Thread 15

Thread 7




35/43

NVIDIA Confidential

Bank Addressing Examples

2-way Bank Conflicts

Linear addressing

stride == 2

8-way Bank Conflicts

Linear addressing

stride == 8


Thread 4Thread 3


Bank 15

Bank 7


Bank 2Bank 1Bank 0

Thread 15

Thread 7



Bank 9

Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

x8

x8


36/43

NVIDIA Confidential

Example 8: 128-bit, resolve bank

conflictsHave each thread process every 4th 32-bit word

Intermediate results are stored in SMEM

Need to shrink the block size since this uses more SMEM

Results:

2.9 ms. !

Memory Only: 0.4 ms.


37/43

NVIDIA Confidential

Example 10: 128-bit, unroll inner

loopMostly memory focused until now

All code used fast math where possible(e.g.__umul24)

For the 2-D convolution, the compiler only unrolled

the inner loop

(#pragma unroll didnt work for the outer loop)

Results:

2.5 ms.

0.4 ms. Memory only


38/43

NVIDIA Confidential

Summary

Convolution Approaches Comparison

0

1

2

3

4

5

6

7

8

9

10

Naive

Glob

alConvo

luti

Sim

pleTex

ture1

-byt

eConv

olut

Sim

pleTex

ture4

-byt

eConv

olut

Textu

re-to

-SM

EM16x1

6bloc

kCon

vol

Texture

-to-SME

M16x

8(2

block

s/sm

)bloc

kCon

v

Textu

re-to

-SM

EM16x2

0blo

ckCon

vol

Textu

re-to

-SM

EMflo

atCon

volu

Glob

al-to

-SME

Mflo

atCon

volu

Glob

al-to-SM

EMStri

ctCo

alesc

ed16x

16blockC

on

Global-t

o-SM

EMStri

ctCo

alesc

ed16x

8blo

ckCon

v

Glob

al-to-SM

EMStri

ctCo

alesc

ed16x

20blockC

on

Glob

al-to-SM

EMBre

akCoa

lescing

19x

13blockC

on

Glob

al-to-

SMEM

Bre

akCoa

lescin

gwith

wro

ngpitchC

onv

Glob

al-to

-SM

EM128

-bitr

/wcon

volu

Global

-to-SME

M128

-bitr

/wbank

opti

mizedc

onv

Glob

al-to

-SME

M12

8-bit

r/w

bank

optim

izedun

rolled

con

Algori thm

Time(ms.)

Compute

Memory Only


39/43

NVIDIA Confidential

GT200 optimization

Coalescing buffers greatly improve the performance, especiallyin non-optimal situations

Convolut ion Approaches Comparison, GT200

0

1

2

3

4

5

6

7

8

9

10

Naive

Global

S

imple

Te

xture4-

Tex

ture-to-

S

MEM

Tex

ture-to-

SME

Mfloat

Global-to-

S

MEM

Global-to-

S

MEM

Global-to-

S

MEM

Glo

bal-to-

SME

M128-

Time(ms)

Compute +Memory

Memory Only


40/43

NVIDIA Confidential

GT200 Observations

hideous case isnt so bad: 2.5x slower than best

case, vs. 60x slower on G92

128-bit is still best

Complex Coalesced pattern to fil l SMEM is actually

slightly slower than just a simple shift


41/43

NVIDIA Confidential

A few more comparisons

Best & Worse Case Algorithms

1

10

100

1000

Hideous Case Best Case

Tim

e

(ms

)

G86G92

GT200

Note: Log Scale


42/43

NVIDIA Confidential

32 vs. 128-bit Access

*Note: Using best 32-bit method

32 vs. 128 bit process ing

0

1

2

3

4

5

6

32-Bit Memory 32-Bit Memory 128-Bit 128-Bit Memory

Time (ms)

G92GT200


43/43

Broken Coalescing

Effec t of Wrong Pitch

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Correct Coalescing Correct Coalescing - Memory Wrong Pitch Wrong Pitch - Memory

Time (ms.)

G92GT200

Documents

JoeStam__ConvolutionSoup