JoeStam__ConvolutionSoup

Embed Size (px)

Citation preview

  • 7/29/2019 JoeStam__ConvolutionSoup

    1/43

    NVIDIA Confidential

    Convolution SoupA case study in CUDA optimizationJoe Stam [email protected]

    Convolution SoupA case study in CUDA optimizationJoe Stam [email protected]

  • 7/29/2019 JoeStam__ConvolutionSoup

    2/43

    NVIDIA Confidential

    Optimization

    GPUs are very fast

    BUTPoor programming can lead to hideousperformance

    Squeaking out the most speed takes a bit ofexpertise

  • 7/29/2019 JoeStam__ConvolutionSoup

    3/43

    NVIDIA Confidential

    A convolution case study

    Well use the simple example of a 5x5

    convolution to illustrate

    optimization strategies and their

    effects

    Basic 5x5 convolution

    8-bit data, monochrome

    Generalized non-separable case

    No special border handling

    Dynamic coefficients use constant

    memoryBenchmarks on 2048 X 2048 image

    GeForce 8800 GT (G92)

  • 7/29/2019 JoeStam__ConvolutionSoup

    4/43

    NVIDIA Confidential

    What to optimize?

    GMEM Coalescing

    GMEM Bandwidth

    Occupancy# of threads running on an SM

    Limited by Registers, SMEM, 8-blocks maximum, 768threads maximum (1024 on GT200)

    More threads running allows more latency hiding!

    SMEM Bank Conflicts

    LMEM usage

    Compute instructionsinl ining, __mul() intrinsics, fast math

    Last thing to optimize, Compute power is evolving faster than memorybandwidth

  • 7/29/2019 JoeStam__ConvolutionSoup

    5/43

    NVIDIA Confidential

    Coalescing GMEM Often the most

    important optimization

    A coordinated read by a half-warp (16 threads)

    A contiguous region of global memory:64 bytes - each thread reads a word: int, float,

    128 bytes - each thread reads a double-word: int2, float2,

    256 bytes each thread reads a quad-word: int4, float4,

    Additional restrictions:

    Starting address for a region must be a multiple of region

    size

    The k th thread in a half-warp must access the kth element in ablock being read

    Exception: not all threads must be participating

    Predicated access, divergence within a halfwarp

  • 7/29/2019 JoeStam__ConvolutionSoup

    6/43

    NVIDIA Confidential

    Coalesced Access:

    Reading floats (32-bit)

    t0 t1 t2 t14 t15t3

    t0 t1 t2 t14 t15t3

    132 136 184 192128 140 144 188

    132 136 184 192128 140 144 188

    Some Threads Do Not Participate

    All threads participate

  • 7/29/2019 JoeStam__ConvolutionSoup

    7/43

    NVIDIA Confidential

    Uncoalesced Access:

    Reading floats (32-bit)

    t0 t1 t2 t14 t15t3

    132 136128 140 144

    Permuted Access by Threads

    184 192188

    Misaligned Starting Address (not a multiple of 64)

    t0 t1 t2 t13 t15t3

    132 136 184 192128 140 144 188

    t14

  • 7/29/2019 JoeStam__ConvolutionSoup

    8/43

    NVIDIA Confidential

    No simple one-size-fits all strategyMust Experiment

  • 7/29/2019 JoeStam__ConvolutionSoup

    9/43

    NVIDIA Confidential

    Tools

    Look at the .cubin to find register, smem, lmem

    usage

    Use thekeep compiler option

  • 7/29/2019 JoeStam__ConvolutionSoup

    10/43

    NVIDIA Confidential

    PTX GPU assembly

    Rarely write it, but always look at it!

    Use keep to write it

    To show interleaved source code:

    --opencc-options -LIST:source=on

  • 7/29/2019 JoeStam__ConvolutionSoup

    11/43

    NVIDIA Confidential

    Visual Profiler

  • 7/29/2019 JoeStam__ConvolutionSoup

    12/43

    NVIDIA Confidential

    Occupancy Calculator Spreadsheet

  • 7/29/2019 JoeStam__ConvolutionSoup

    13/43

    NVIDIA Confidential

    Memory vs. Compute

    Kernel time is frequently dominated by memory

    bandwidth.

    With enough compute, memory latencies can be

    hidden by thread swapping

    Write kernels with compute commented out to

    examine the effects

  • 7/29/2019 JoeStam__ConvolutionSoup

    14/43

    NVIDIA Confidential

    Example 1: Nave GMEM

    Warning: Do not try this at home!

    __gl obal __ voi d Nai veGl obal Convol ut i onKer nel ( unsi gned char * i mg_i n, unsi gned char * i mg_out ,unsi gned i nt wi dt h, unsi gned i nt hei ght ,unsi gned i nt pi t ch, f l oat scal e)

    {unsi gned i nt X = __umul 24( bl ockI dx. x, bl ockDi m. x) + t hr eadI dx. x;unsi gned i nt Y = __umul 24( bl ockI dx. y, bl ockDi m. y) + t hr eadI dx. y;

    i f ( X > 1 && X < wi dt h- 2 && Y > 1 && Y < hei ght - 2){

    i nt sum = 0;i nt ki dx = 0;f or ( i nt i = - 2; i

  • 7/29/2019 JoeStam__ConvolutionSoup

    15/43

    NVIDIA Confidential

    Results

    148 ms!

    Nothing is coalesced

    8-bit Memory accesses are very

    inefficient

    You are the weakest link Good Bye!

  • 7/29/2019 JoeStam__ConvolutionSoup

    16/43

    NVIDIA Confidential

    Example 2: Simple Textures

    Texture hardware provides cached access to

    GMEM, no worries about coalescing reads

    Results: 7.8 ms

    But:

    0.600 ms additional time to copy the image into acudaArray for textures

    sti ll using 8-bit writes back to GMEM

  • 7/29/2019 JoeStam__ConvolutionSoup

    17/43

    NVIDIA Confidential

    Example 2a: Textures

    Using 32-bit writes improves bandwidth &

    coalesces

    Results: 6.3 ms, ~25% faster

  • 7/29/2019 JoeStam__ConvolutionSoup

    18/43

    NVIDIA Confidential

    Example 3: Texture into SMEM

    Use textures to fetch memory to avoid coalescing

    issues

    Store tile in SMEM so all pixels are only read once

    Results:

    3.3 ms!

    Memory read/write only: 1.1 ms

  • 7/29/2019 JoeStam__ConvolutionSoup

    19/43

    NVIDIA Confidential

    Example 3: Continued

    Unfortunately, textures use a lot of registers

    (25 in this kernel). This reduces occupancy

    16x16 block size limits us to 1 block / SM (on G92),

    thus the entire block is stalled during the__syncthreads()

    16x8 block allows 2 blocks / SM,

    surprisingly lit tle performance improvement (3.2 ms)

    16x20 block maximizes threads

    Also little performance improvement (3.2 ms)

    Memory bandwidth goes up slightly because of fewer

    reads from kernel apron overlap

  • 7/29/2019 JoeStam__ConvolutionSoup

    20/43

    NVIDIA Confidential

    Example 4: Texture with floats

    Use texture hardware to promote image to float.

    Uses 4x the SMEM, but requires no data typeconversion until the write

    Results: 6.5 ms, oopsbad idea

    May be useful for cases where f32 compute is

    needed

  • 7/29/2019 JoeStam__ConvolutionSoup

    21/43

    NVIDIA Confidential

    Example 5: GMEM to SMEM

    4 pixels / thread

    Try nave implementation first. Shift reads left & upby apron amount, then read an additional 4-pixel

    strip to the right.

    All loads are uncoalesced!

    Results:

    3.6 ms

    Memory only: 1.6 ms

    Slightly worse than using textures

  • 7/29/2019 JoeStam__ConvolutionSoup

    22/43

    NVIDIA Confidential

    Example 6: GMEM to SMEM

    Strict CoalescingProcess 4 pixels / thread for 32-bit reads

    Read an image tile plus the apron into SMEMFor 16x16 block size, read 72x16 pixels into SMEM

    0

    18

    0 72

    Apron Pixels Image Tile

  • 7/29/2019 JoeStam__ConvolutionSoup

    23/43

    NVIDIA Confidential

    Convolution SMEM Reads

    Step 1: All threads read center top portion into memory

  • 7/29/2019 JoeStam__ConvolutionSoup

    24/43

    NVIDIA Confidential

    Convolution SMEM Reads

    Step 2: Threads with threadIdx.y

    < 2 read bottom two

    rows

  • 7/29/2019 JoeStam__ConvolutionSoup

    25/43

    NVIDIA Confidential

    Convolution SMEM Reads

    Step 3: Threads with threadIdx.x

    == 15

    read left-top

    apron pixels

  • 7/29/2019 JoeStam__ConvolutionSoup

    26/43

    NVIDIA Confidential

    Convolution SMEM Reads

    Step 5: Threads with threadIdx.x

    == 0 read top-right

    apron pixels

  • 7/29/2019 JoeStam__ConvolutionSoup

    27/43

    NVIDIA Confidential

    Convolution SMEM Reads

    Step 6: Threads with threadIdx.x

    == 0 &&

    threadIdx.y

    < 2 read bottom-right apron pixels

  • 7/29/2019 JoeStam__ConvolutionSoup

    28/43

    NVIDIA Confidential

    Example 6 Cont.

    Results (16x16 block, 2 blocks / SM):

    3.4 ms.

    Memory Only 1.2 ms.

    Note: Texture is stil l sl ightly better, even with all this work!

  • 7/29/2019 JoeStam__ConvolutionSoup

    29/43

    NVIDIA Confidential

    Example 6 Effect of block Size

    1200 bytes of SMEM per block

    11 registers

    16x16 = 2 blocks / SM

    16x8 = 5 blocks / SM

    16x8 Results:3.5ms. no benefit (probably because of increased overlap)

    16x20 Results:

    3.4 ms. again no real benefit

  • 7/29/2019 JoeStam__ConvolutionSoup

    30/43

    NVIDIA Confidential

    Example 6 A couple pathological

    casesNon multiple-of-16 block width results in non-

    coalesced access and wasted threads

    19x13 block results:

    5.1 ms. 50% decrease

    2.4 ms. memory only

    Change image pitch to break coalescing

    Unaligned pitch results:

    4.3 ms.

    2.4 ms. memory only

  • 7/29/2019 JoeStam__ConvolutionSoup

    31/43

    NVIDIA Confidential

    Example 7: 128-bit Read/Write to GMEM

    Reading data in 128 bit words is faster than 32-bit

    words (64-bit is also good)

    Same amount of data, but fewertransactions to the

    memory controller

    Read data as int4s, cast to char,

    process 16-bytes / thread

    Results:

    4.8 ms.

    0.4 ms memory only

    What happened? Memory is way faster,

    but compute is SLOW

  • 7/29/2019 JoeStam__ConvolutionSoup

    32/43

    NVIDIA Confidential

    The Answer:

    SMEM Bank Conflicts causes warp serialization

  • 7/29/2019 JoeStam__ConvolutionSoup

    33/43

    NVIDIA Confidential

    Banked SMEM Architecture

    Many threads accessing memory

    Therefore, memory is divided into banks

    Essential to achieve high bandwidth

    Each bank can service one address per cycle

    A memory can service as many simultaneousaccesses as it has banks

    Multiple simultaneous accesses to a bankresult in a bank conflict

    Conflicting accesses are serialized

    Bank 15

    Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1

    Bank 0

  • 7/29/2019 JoeStam__ConvolutionSoup

    34/43

    NVIDIA Confidential

    Bank Addressing Examples

    No Bank Conflicts

    Linear addressing

    stride == 1

    No Bank Conflicts

    Random 1:1 Permutation

    Bank 15

    Bank 7

    Bank 6Bank 5Bank 4Bank 3

    Bank 2Bank 1Bank 0

    Thread 15

    Thread 7

    Thread 6Thread 5Thread 4Thread 3

    Thread 2Thread 1Thread 0

    Bank 15

    Bank 7

    Bank 6Bank 5Bank 4Bank 3

    Bank 2Bank 1Bank 0

    Thread 15

    Thread 7

    Thread 6Thread 5Thread 4Thread 3

    Thread 2Thread 1Thread 0

  • 7/29/2019 JoeStam__ConvolutionSoup

    35/43

    NVIDIA Confidential

    Bank Addressing Examples

    2-way Bank Conflicts

    Linear addressing

    stride == 2

    8-way Bank Conflicts

    Linear addressing

    stride == 8

    Thread 11Thread 10Thread 9Thread 8

    Thread 4Thread 3

    Thread 2Thread 1Thread 0

    Bank 15

    Bank 7

    Bank 6Bank 5Bank 4Bank 3

    Bank 2Bank 1Bank 0

    Thread 15

    Thread 7

    Thread 6Thread 5Thread 4Thread 3

    Thread 2Thread 1Thread 0

    Bank 9

    Bank 8

    Bank 15

    Bank 7

    Bank 2Bank 1Bank 0

    x8

    x8

  • 7/29/2019 JoeStam__ConvolutionSoup

    36/43

    NVIDIA Confidential

    Example 8: 128-bit, resolve bank

    conflictsHave each thread process every 4th 32-bit word

    Intermediate results are stored in SMEM

    Need to shrink the block size since this uses more SMEM

    Results:

    2.9 ms. !

    Memory Only: 0.4 ms.

  • 7/29/2019 JoeStam__ConvolutionSoup

    37/43

    NVIDIA Confidential

    Example 10: 128-bit, unroll inner

    loopMostly memory focused until now

    All code used fast math where possible(e.g.__umul24)

    For the 2-D convolution, the compiler only unrolled

    the inner loop

    (#pragma unroll didnt work for the outer loop)

    Results:

    2.5 ms.

    0.4 ms. Memory only

  • 7/29/2019 JoeStam__ConvolutionSoup

    38/43

    NVIDIA Confidential

    Summary

    Convolution Approaches Comparison

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Naive

    Glob

    alConvo

    luti

    Sim

    pleTex

    ture1

    -byt

    eConv

    olut

    Sim

    pleTex

    ture4

    -byt

    eConv

    olut

    Textu

    re-to

    -SM

    EM16x1

    6bloc

    kCon

    vol

    Texture

    -to-SME

    M16x

    8(2

    block

    s/sm

    )bloc

    kCon

    v

    Textu

    re-to

    -SM

    EM16x2

    0blo

    ckCon

    vol

    Textu

    re-to

    -SM

    EMflo

    atCon

    volu

    Glob

    al-to

    -SME

    Mflo

    atCon

    volu

    Glob

    al-to-SM

    EMStri

    ctCo

    alesc

    ed16x

    16blockC

    on

    Global-t

    o-SM

    EMStri

    ctCo

    alesc

    ed16x

    8blo

    ckCon

    v

    Glob

    al-to-SM

    EMStri

    ctCo

    alesc

    ed16x

    20blockC

    on

    Glob

    al-to-SM

    EMBre

    akCoa

    lescing

    19x

    13blockC

    on

    Glob

    al-to-

    SMEM

    Bre

    akCoa

    lescin

    gwith

    wro

    ngpitchC

    onv

    Glob

    al-to

    -SM

    EM128

    -bitr

    /wcon

    volu

    Global

    -to-SME

    M128

    -bitr

    /wbank

    opti

    mizedc

    onv

    Glob

    al-to

    -SME

    M12

    8-bit

    r/w

    bank

    optim

    izedun

    rolled

    con

    Algori thm

    Time(ms.)

    Compute

    Memory Only

  • 7/29/2019 JoeStam__ConvolutionSoup

    39/43

    NVIDIA Confidential

    GT200 optimization

    Coalescing buffers greatly improve the performance, especiallyin non-optimal situations

    Convolut ion Approaches Comparison, GT200

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Naive

    Global

    S

    imple

    Te

    xture4-

    Tex

    ture-to-

    S

    MEM

    Tex

    ture-to-

    SME

    Mfloat

    Global-to-

    S

    MEM

    Global-to-

    S

    MEM

    Global-to-

    S

    MEM

    Glo

    bal-to-

    SME

    M128-

    Time(ms)

    Compute +Memory

    Memory Only

  • 7/29/2019 JoeStam__ConvolutionSoup

    40/43

    NVIDIA Confidential

    GT200 Observations

    hideous case isnt so bad: 2.5x slower than best

    case, vs. 60x slower on G92

    128-bit is still best

    Complex Coalesced pattern to fil l SMEM is actually

    slightly slower than just a simple shift

  • 7/29/2019 JoeStam__ConvolutionSoup

    41/43

    NVIDIA Confidential

    A few more comparisons

    Best & Worse Case Algorithms

    1

    10

    100

    1000

    Hideous Case Best Case

    Tim

    e

    (ms

    )

    G86G92

    GT200

    Note: Log Scale

  • 7/29/2019 JoeStam__ConvolutionSoup

    42/43

    NVIDIA Confidential

    32 vs. 128-bit Access

    *Note: Using best 32-bit method

    32 vs. 128 bit process ing

    0

    1

    2

    3

    4

    5

    6

    32-Bit Memory 32-Bit Memory 128-Bit 128-Bit Memory

    Time (ms)

    G92GT200

  • 7/29/2019 JoeStam__ConvolutionSoup

    43/43

    Broken Coalescing

    Effec t of Wrong Pitch

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    Correct Coalescing Correct Coalescing - Memory Wrong Pitch Wrong Pitch - Memory

    Time (ms.)

    G92GT200