[251] implementing deep learning using cu dnn

Implementing Deep Learning using cuDNN

이예하

VUNO Inc.

1

Contents

1. Deep Learning Review2. Implementation on GPU using cuDNN3. Optimization Issues4. Introduction to VUNO-Net

2

1.Deep Learning Review

3

Brief History of Neural Network

1940 1950 1960 1970 1980 1990 2000

Golden Age Dark Age (“AI Winter”)

Electronic Brain

1943

S. McCulloch - W. Pitts

• Adjustable Weights • Weights are not Learned

1969

XORProblem

M. Minsky - S. Papert

• XOR Problem

Multi-layered Perceptron

(Backpropagation)

1986

D. Rumelhart - G. Hinton - R. Wiliams

• Solution to nonlinearly separable problems • Big computation, local optima and overfitting

Backward Error

Foward Activity

V. Vapnik - C. Cortes

1995

SVM

• Limitations of learning prior knowledge • Kernel function: Human Intervention

Deep Neural Network (Pretraining)

2006

G. Hinton - S. Ruslan

• Hierarchical feature Learning

2010

Perceptron

1957

F. Rosenblatt

• Learnable Weights and Threshold

ADALINE

1960

B. Widrow - M. Hoff

4

Machine/Deep Learning is eating the world!

5

• Restricted Boltzmann machine• Auto-encoder• Deep belief network

• Deep Boltzmann machines

• Generative stochastic networks

• Convolutional neural networks

• Recurrent neural networks

Building Blocks

6

• Receptive Field(Huber & Wiesel, 1959)

• Neocognitron (Fukushima, 1980)

Convolutional Neural Networks

7

• LeNet-5 (Yann LeCun, 1998)


8

• Alex Net (Alex Krizhevsky et. al., 2012)

• Clarifai (Matt Zeiler and Rob Fergus, 2013)


9

• Detection/Classification and Localization

• 1.2M for Training/50K for Validation

• 1000 Classes

ImageNet Challenge

10

Error rate Remark

Human 5.1% Karpathy

google (2014/8) 6.67% Inception

baidu (2015/1) 5.98% Data Augmentation

ms (2015/2) 4.94%SPP, PReLU,

Weight Initialization

google (2015/2) 4.82% Batch Normalization

baidu (2015/5) 4.58% Cheating


Network

• Softmax Layer (Output)

• Fully Connected Layer

• Pooling Layer

• Convolution Layer

Layer

• Input / Output

• Weights

• Neuron activationVGG (K. Simonyan and A. Zisserman, 2015)

11

Neural Network

12

Forward Pass

Softmax Layer

FC Layer

Conv Layer

Backward Pass

Softmax Layer

FC Layer

Conv Layer

Convolution Layer - Forward

x1 x4 x7

x2 x5 x8

x3 x6 x9

w1 w3

w2 w4

y1 y3

y2 y4

Filter Input

13

w1

w2

w4

w3

Output

Convolution Layer - Forward

How to evaluate the convolution layer efficiently?

1. Lower the convolutions into a matrix multiplication (cuDNN)

• There are several ways to implement convolutions efficiently

2. Fast Fourier Transform to compute the convolution (cuDNN_v3)

3. Computing the convolutions directly (cuda-convnet)

14

Fully Connected Layer - Forward

15

x2x1

y1 y2

x3

w11 w12 w13w w w

• Matrix calculation is very fast on GPU

• cuBLAS library

Softmax Layer - Forward

16

x2x1

y1 y2 Issues

1. How to efficiently evaluate denominator?

2. Numerical instability due to the too large or too

small xk

Fortunately, softmax Layer is supported by cuDNN

Learning on Neural Network

Update network (i.e. weights) to minimize loss function

• Popular loss function for classification

• Necessary criterion

Sum-of-squared error

Cross-entropy error

17

Learning on Neural Network

• Due to the complexity of the loss, a closed-form solution is usually not possible

• Using iterative approach

• How to evaluate the gradient

• Error backpropagation

Gradient descent Newton’s method

18

Softmax Layer - Backward

x2x1

y1 y2

Loss function:

• Errors back-propagated from softmax Layer can be calculated

• Subtract 1 from the output value with target index

• This can be done efficiently using GPU threads

19

Fully Connected Layer - Backward

20

x2x1

y1 y2

x3

w11 w ww21 w w

Forward pass:

Error:

Gradient:

• Matrix calculation is very fast on GPU

• Element-wise multiplication can be done efficiently using GPU thread

Error:

Gradient:

Convolution Layer - Backward

x1 x4 x7

x2 x5 x8

x3 x6 x9

w1 w3

w2 w4

y1 y3

y2 y4

Filter Input

21

w1

w2

w4

w3

Output

w4 w2

w3 w1

Forward

Error

Convolution Layer - Backward

Filter Input

22

Output

x1 x4 x7

x2 x5 x8

x3 x6 x9

w1 w3

w2 w4

Gradient

• The error and gradient can be computed with convolution scheme

• cuDNN supports this operation

2.Implementation on GPU using cuDNN

23

Introduction to cuDNN

cuDNN is a GPU-accelerated library of primitives for deep neural networks

• Convolution forward and backward• Pooling forward and backward• Softmax forward and backward• Neuron activations forward and backward:

• Rectified linear (ReLU)• Sigmoid• Hyperbolic tangent (TANH)

• Tensor transformation functions24

Introduction to cuDNN (version 2)

cuDNN's convolution routines aim for performance competitive with the fastest GEMM

•Lowering the convolutions into a matrix multiplication

25(Sharan Chetlur et. al., 2015)

Introduction to cuDNN

Benchmarks

26https://github.com/soumith/convnet-benchmarks https://developer.nvidia.com/cudnn

Goal

Learning VGG model using cuDNN

• Data Layer• Convolution Layer• Pooling Layer• Fully Connected Layer• Softmax Layer

27

Preliminary

Common data structure for Layer

• Device memory & tensor description for input/output data & error• Tensor Description defines dimensions of data

28

cuDNN initialize & release

cudaSetDevice( deviceID );cudnnCreate( &cudnn_handle );

cudnnDestroy( cudnn_handle );cudaDeviceReset();

float *d_input, *d_output, *d_inputDelta, *d_outputDeltacudnnTensorDescriptor_t inputDesc;cudnnTensorDescriptor_t outputDesc;

Data Layer

29

create & set Tensor DescriptorcudnnCreateTensorDescriptor();

cudnnSetTensor4dDescriptor();

cudnnSetTensor4dDescriptor(outputDesc,CUDNN_TENSOR_NCHW,CUDNN_FLOAT,sampleCnt,channels,height,width

);

1 2 3

4 5 6

7 8 9

10 11 12

13 14 15

16 17 18

19 20 21

22 23 24

25 26 27

28 29 30

31 32 33

34 35 36

sample #1 sample #2

channel #1

channel #2

Example: 2 images (3x3x2)

Convolution Layer

1. Initialization

30

1.1 create & set Filter Descriptor

1.2 create & set Conv Descriptor

1.3 create & set output Tensor Descriptor

1.4 Get Convolution Algorithm

cudnnCreateFilterDescriptor(&filterDesc);

cudnnSetFilter4dDescriptor(…);

cudnnCreateConvolutionDescriptor(&convDesc);

cudnnSetConvolution2dDescriptor(…);

cudnnGetConvolution2dForwardOutputDim(…);

cudnnCreateTensorDescriptor(&dstTensorDesc);

cudnnSetTensor4dDescriptor();

cudnnGetConvolutionForwardAlgorithm(…);

cudnnGetConvolutionForwardWorkspaceSize(…);

Convolution Layer

1.1 Set Filter Descriptor

31

cudnnSetFilter4dDescriptor(filterDesc,CUDNN_FLOAT,filterCnt,input_channelCnt,filter_height,filter_width

);

1 2

3 4

5 6

7 8

9 10

11 12

13 14

15 16

Filter #1 Filter #2

channel #1

channel #2

Example: 2 Filters (2x2x2)

Convolution Layer

1.2 Set Convolution Descriptor

32

CUDNN_CROSS_CORRELATION

Convolution Layer

1.3 Set output Tensor Descriptor

33

• n, c, h and w indicate output dimension• Tensor Description defines dimensions of data

Convolution Layer

1.4 Get Convolution Algorithm

34

inputDesc

outputDescCUDNN_CONVOLUTION_FWD_PREFER_FASTEST

Convolution Layer

2. Forward Pass

2.1 Convolution

2.2 Activation

cudnnConvolutionForward(…);

cudnnActivationForward(…);

35

Convolution Layer

2.1 Convolution

36

d_inputinputDesc

d_outputoutputDesc

Convolution Layer

2.2 Activation

37

sigmoidtanhReLU

outputDescd_output

Convolution Layer

3. Backward Pass

3.1 Activation Backward

3.2 Calculate Gradient

cudnnActivationBackward(…);

cudnnConvolutionBackwardFilter(…);

3.2 Error Backpropagation cudnnConvolutionBackwardData(…);

38

• Errors back-propagated from l+1 layer (d_outputDelta) is multiplied by • See 22 slide (Convolution Layer - Backward)

Convolution Layer

3.1 Activation Backward

39

outputDescd_output

outputDescd_outputDelta

outputDescd_output


Convolution Layer

3.2 Calculate Gradient

40

inputDescd_input


filterDescd_filterGradient

Convolution Layer

3.2 Error Backpropagation

41


inputDescd_inputDelta

Pooling Layer / Softmax Layer

42

1. Initilization

2. Forward Pass

3. Backward Pass

cudnnCreatePoolingDescriptor(&poolingDesc);

cudnnSetPooling2dDescriptor(…);

cudnnPoolingForward(…);

cudnnPoolingBackward(…);

Pooling Layer

Softmax Layer

Forward Pass cudnnSoftmaxForward(…);

3.Optimization Issues

43

Optimization

Learning Very Deep ConvNet

We know the Deep ConvNet can be trained without pre-training

• weight sharing

• sparsity

• Recifier unit

But, “With fixed standard deviations very deep models have difficulties to

converges” (Kaiming He et. al., 2015)

• e.g. random initialization from Gaussian dist. with 0.01 std

• >8 convolution layers44

Optimization

VGG (K. Simonyan and A. Zisserman, 2015) GoogleNet (C. Szegedy et. al., 2014)45

Optimization

Initialization of Weights for Rectifier (Kaiming He et. al., 2015)

•The variance of the response in each layer

•Sufficient condition that the gradient is not exponentially large/small

•Standard deviation for initialization

(spatial filter size)2 x (filter Cnt)

46

Optimization

Case study

47

The filter number

64 0.059

128 0.042

256 0.029

512 0.021

3x3 filter

• When using 0.01, the std of the gradient propagated from conv10 to convey

Error vanishing

Speed

Data loading & model learning

•Reducing data loading and augmentation time

•Data provider thread (dp_thread)

•Model learning thread (worker_thread)

readData(){

if(is_dp_thread_running) pthread_join(dp_thread)…if(is_data_remain) pthread_create(dp_thread)

}

readData();for(…) {

readData();pthread_create(worker_thread)…pthread_join(worker_thread)

}

48

Speed

Multi-GPU

• Data parallelization v.s. Model parallelization• Distribute the model, use the same data : Model Parallelism• Distribute the data, use the same model : Data Parallelism

• Data parallelization & Gradient Average• One of the easiest way to use Multi-GPU

• The result is same with using single GPU

49

Parallelism

50

Data Parallelization Model Parallelization

The good : Easy to implementThe bad : Cost of sync increases with the number of GPU

The good : Larger network can be trainedThe bad : Sync is necessary in all layers

Parallelism

51

Mixing Data Parallelization and Model Parallelization

(Krizhevsky, 2014)

Speed

Data Parallelization & Gradient Average

Forward Pass

Backward Pass

Gradient

Update

data (256)

Forward Pass

Backward Pass

Gradient

Update

data1 (128)

Forward Pass

Backward Pass

Gradient

data2(128)

Average

52

4.Introduction to VUNO-Net

53

The Team

54

VUNO-Net

Theano

Caffe

Cuda-ConvNet

Pylearn

RNNLIBDL4J

Torch VUNO-Net

55

VUNO-Net

56

Structure Output Learning

Convolution

LSTM

MD-LSTM(2D)

Pooling

Spatial Pyramid Pooling

Fully Connection

Concatenation

Softmax

Regression

Connectionist Temporal Classification

Multi-GPU Support

Batch Normalization

Parametric Rectifier

Initialization for Rectifier

Stochastic Gradient Descent

Dropout

Data Augmentation Open Source

VUNO-net Only

Features

Performance

The state-of-the-art performance at image & speech

84.3%82.2%

VUNOWR1

Speech Recognition(TIMIT*)

82.1%79.3%

VUNOWR2

Image Classification(CIFAR-100)

Kaggle Top 1(2014)

* TIMIT is a one of most popular benchmark dataset for speech recognition task (Texas Instrument - MIT)WR1 (World Record) - “Speech Recognitino with Deep Recurrent Neural Networks”, Alex Graves, ICCASP(2013)WR2 (World Record) - “kaggle competition: https://www.kaggle.com/c/cifar-10”

57

https://www.kaggle.com/c/cifar-10

Application

We’ve achieved record breaking performance on medical image analysis.

58

Application

Whole Lung Quantification

• Sliding window: Pixel level classification

• But, context information is more important

• Ongoing works

MD-LSTM Recurrent CNN

normal Emphysemav.s.

59

Application


60

Original Image (CT) VUNO Golden Standard

Example #1

Application


61

Original Image (CT) VUNO Golden Standard

Example #2

Visualization

SVM Features (22 features)

62

Histogram, Gradient, Run-length, Co-occurrence matrix, Cluster analysis,

Top-hat transformation

Visualization

Activation of top hidden layer

63

200 hidden nodes

Visualization

64

We Are Hiring!!

65

Algorithm Engineer CUDA Programmer Application Developer

Staff MemberBusiness Developer

Q&A

66

Thank You

67

Technology

[251] implementing deep learning using cu dnn