17
High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™

High-Performance GPU Programming for Deep Learning

Embed Size (px)

Citation preview

Page 1: High-Performance GPU Programming for Deep Learning

High-Performance GPU Programming for Deep Learning

7 April 2016 Scott Gray

Nervana Systems

MAKING MACHINES SMARTER.™

Page 2: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

High-Performance GPU kernels for deep learning

2

• Fast matrix multiply for small minibatches

• Direct convolution leveraging GEMM advances

• Even faster convolution with Winograd

Page 3: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

GEMM: Basics

3

C = AB

Page 4: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

GEMM: Memory Load

4

Outer product contiguous Outer product strided

threads

memory load

single tile

batched GEMM

Page 5: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Batched GEMM tiles 32 x 32GEMM tile 32 x 64GEMM tile 32 x 32

GEMM: Tile sizes

5

threads

shared memory load

Page 6: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

hGEMM Results - NN

6

Nx3072x3072 NN op

0

1500

3000

4500

6000

32 64 96 128

Nervana 32x32 cuBLAS 128x64

Batch Size (N)

GFL

OPS

Page 7: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

hGEMM Results - TN

7

GFL

OPS

Nx3072x3072 TN op

0

1500

3000

4500

6000

32 64 96 128

Nervana 32x32 cuBLAS 128x64

Batch Size (N)

Page 8: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Direct convolution is still relevant

8

• Striding

• Odd-size filters

• Placeholder until faster algo can be implemented

• Often faster for single image or first small C layer

Page 9: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Direct convolution: implementation details

9

• Batched GEMM for efficient transpose and higher occupancy

• Compound outer product block remapping

• Square wave pattern for P,Q block mapping

• Slicing: shared memory lookup + integer division

• N vs C contiguous

• Single P,Q vs tiled P,Q

• Bprop as upside down fprop

• Update specific optimizations

Page 10: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Winograd: input transform

10

Input Feature Map

4x4 stride 2• Input transform

• 2D Winograd is a nested

product of 1D transforms

• Transforms can be

simplified to remove zeros

Page 11: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Winograd: filter transform

11

• Filter transform

• Same as input but with

different coefficients

• Transform each feature map

independently

Page 12: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Winograd: batched GEMM

12

Page 13: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Winograd: output transform

13

Output Feature Map

• Output transform

• Same as input and filter

• Transform back to pixel

space to obtain 2x2 output

tile

Page 14: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana 14

Performance: VGG

VGG fp32 - Totals by operation

0

0.5

1

1.5

2

64 32 16 8 4 2 1

Winograd fp32 fpropWinograd fp32 bpropWinograd fp32 updatecuDNN fp32 fpropcuDNN fp32 bpropcuDNN fp32 update

Algo

rithm

ic S

peed

up

Batch Size

Page 15: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Performance: Alexnet convolutional layers

15

Alexnet Totals

0

0.5

1

1.5

2

128 64 32 16 8 4

Nervana fp16Nervana fp32CuBLAS fp16CuBLAS fp32

Batch Size

Algo

rithm

ic S

peed

up

Page 16: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Compounding

16

• alpha / beta

• bias

• relu, prelu, tanh, …

• bprop relu, …

• bprop bias

• batchnorm mean

Compounding inside of GEMM and conv for free:

Page 17: High-Performance GPU Programming for Deep Learning

Proprietary and confidential. Do not distribute.ne r vana

Summary

17

• Nervana has the fastest tools for deep learning

• neon with state-of-the-art Maxwell kernels

• Nervana Cloud with multi-GPU training

• Watch for Nervana Engine, our deep learning processor