51
Neural Network on CUDA Restricted Boltzmann Machine Daniel Ly Volodymyr Paprotski Danny Yen

Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

  • Upload
    vomien

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Neural Network on CUDA

Restricted Boltzmann Machine

Daniel LyVolodymyr Paprotski

Danny Yen

Page 2: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Outline

● Restricted Boltzmann Machine background● CUDA Implementation

– Matrix Library

– Random Number Generator

– Sigmoid Function

● Results● Discussion

Page 3: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Background

● Neural networks is a computational paradigm that is inspired by biological nervous systems

● Based on the idea of having lots of simple computational units, called nodes, that maps an input to an output

● Interesting behaviour emerges from massive networks of interconnected nodes

Page 4: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Background

●Artificial neural networks are capable of learning

● Given a data set, automated learning rules can be applied to achieve a desired behaviour

●Artificial neural networks are applied to a growing number of applications

● Pattern classification● Computer vision● Signal processing

Page 5: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM Theory

Page 6: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM Theory

Visible Layer

vi

Binary state{0,1}

Page 7: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM Theory

Visible Layer

vi

hj

Hidden Layer

Binary state{0,1}

Page 8: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM Theory

Visible Layer

vi

hj

Hidden Layer

Connections/Weightswi,j

ℜ∈

Page 9: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Page 10: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Load data vector in visible layer

Page 11: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Generate energies

E j=∑j

w i , j v i

Page 12: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Determine node state through transfer function

)( jj Efh =

Page 13: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Determine node state through transfer function

Page 14: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Determine node state through transfer function

Page 15: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Reconstruct visible nodes

Page 16: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Reconstruct visible nodes

)(

,

ii

ijjii

Efv

hwE

=

= ∑

Page 17: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Reconstruct visible nodes

Page 18: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

Page 19: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

wi , j=w i , j±εv i h j

Page 20: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

Page 21: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

Page 22: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

Page 23: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

Page 24: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

Page 25: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Update weights (simplified)

Page 26: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

RBM TheoryAlternating Gibbs Sampling

Repeat cycle

Page 27: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

CUDA Implementation

● The project was divided into three kernels– Matrix Operations

– Random Number Generation

– Sigmoid Function

Page 28: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Operations

● Node states, energies and weights can be represented as matrices

● Computation is dominated by matrix operations● Matrix libraries were based on the examples in

the SDK– Additional optimization was achieved

– Extra care was used to ensure coalesced memory calls and bank conflicts were avoided

Page 29: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Addition

C = ε(A + B)

Page 30: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Addition

C = ε(A + B)

Page 31: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Addition

C = ε(A + B)

Page 32: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Transpose

B = AT

Page 33: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Transpose

B = AT

Page 34: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Transpose

B = AT

Page 35: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Multiply

C = BA

Page 36: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Multiply

C = BA

Page 37: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Multiply

C = BA

Page 38: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Multiply

C = BA

Page 39: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Matrix Multiply

C = BA

Page 40: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Random Number Generator● Mersenne Twister

– Pros● Uses bitwise operations● Long period (219937-1)● High dimensional equidistribution ● Efficient memory usage

– Cons● iterative● Insecure – after N outputs, it’s predictable

Page 41: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

CUDA Implementation

● Launch many Mersenne twisters simultaneously– MT_RNG_COUNT=threads*blocks

● Initialization computes of per-thread configuration– dcmt0.4 library

● Can be time consuming

Page 42: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Sigmoid Function

● Road blocks– Number Representation

● Fixed point● Floating point

– Approximations

– Input and Output are floats

● 100% parallel: one-to-one data-to-output map

Page 43: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Sigmoid Implementations

● Native floating point function● Broken line approximation (9 segments)● Second order approximation ● Precalculated texture (256 values)

-10-9.7

-9.4-9.1

-8.8-8.5

-8.2-7.9

-7.6-7.3

-7-6.7

-6.4-6.1

-5.8-5.5

-5.2-4.9

-4.6-4.3

-4-3.7

-3.4-3.1

-2.8-2.5

-2.2-1.9

-1.6-1.3

-1-0.7

-0.4-0.1

0.20.5

0.81.1

1.41.7

22.3

2.62.9

3.23.5

3.84.1

4.44.7

55.3

5.65.9

6.26.5

6.87.1

7.47.7

88.3

8.68.9

9.29.5

9.8

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

NativeLinearSecondTexture

Page 44: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Results

● Software baseline– Optimized Sequential C++ code

– Compiled with g++ version 4.1.2● Flags = -O3

– 2.83GHz Intel Core2 Quad core, 6MB L2 Cache

– 4GB DDR2 RAM

● CUDA implementation– GTX280

Page 45: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Results

● RBM Properties– 512 Nodes in Visible Layer

– 512 Nodes in Hidden Layer

– 256k Single-Precision Floating Point Weights

● Metrics– Performance Ratio

– Error rate was not measured● Different random number implementations = different

results

Page 46: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Results

Ad

diti

on

Tra

nsp

ose

Mu

ltip

ly

25

6k

Ra

nd

24

M R

an

d

Flo

at

Te

xtu

re

Lin

. Ap

pro

x

2n

d O

rde

r

RB

M T

ota

l0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

100000.00

0.93 1.19

249.10

6.02

540.89

2158.36 2158.36 2158.36 2158.36

25661.60

0.05 0.451.50

0.59

8.18

1.803.10

1.92 1.92

389.80

Computation Time

Type

Tim

e (m

s)

Page 47: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Results

Ad

diti

on

Tra

nsp

ose

Mu

ltip

ly

25

6k

Ra

nd

24

M R

an

d

Flo

at

Te

xtu

re

Lin

. Ap

pro

x

2n

d O

rde

r

RB

M T

ota

l0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

100000.00

0.93 1.19

249.10

6.02

540.89

2158.36 2158.36 2158.36 2158.36

25661.60

0.05 0.451.50

0.59

8.18

1.803.10

1.92 1.92

389.80

Computation Time

Type

Tim

e (m

s)

Ad

diti

on

Tra

nsp

ose

Mu

ltip

ly

25

6k

Ra

nd

24

M R

an

d

Flo

at

Te

xtu

re

Lin

. Ap

pro

x

2n

d O

rde

r

RB

M T

ota

l1

10

100

1000

10000

166.4

2.66

17.1910.26

66.09

1199.33696.99

1123.681122.81

65.83

GPU vs CPU Speed-up

Type

Spe

ed-u

p

Ad

diti

on

Tra

nsp

ose

Mu

ltip

ly

25

6k

Ra

nd

24

M R

an

d

Flo

at

Te

xtu

re

Lin

. Ap

pro

x

2n

d O

rde

r

RB

M T

ota

l1

10

100

1000

10000

17.19

2.66

166.4

10.26

66.09

1199.33696.99

1123.681122.81

65.83

GPU vs CPU Speed-up

Type

Spe

ed-u

p

Ad

diti

on

Tra

nsp

ose

Mu

ltip

ly

25

6k

Ra

nd

24

M R

an

d

Flo

at

Te

xtu

re

Lin

. Ap

pro

x

2n

d O

rde

r

RB

M T

ota

l1

10

100

1000

10000

17.19

2.66

166.4

10.26

66.09

1199.33696.99

1123.68 1122.81

65.83

GPU vs CPU Speed-up

Type

Spe

ed-u

p

Page 48: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Discussion

● End Result program is much more efficient● Programming evidence

– >25 individual hours of debugging code

– Memcpy in/out produces different results!

– Adding dummy Memcpy changes behaviour!

– One/Two consecutive prints change behaviour!

– Removed compiler optimizations

– Peppered code with cudeThreadSynchronize()

– Conclussion: parallelization bugs by compiler

Page 49: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Conclusion

● CUDA implementations is well suited for Neural Network applications– 65 fold speed up was achieved for 512x512

network

● The CUDA language could be more mature and bug-free

● Further optimization still could be drawn from profiling and integrating code

Page 50: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Performance● For 262144 numbers

– Compared to CPU implementation of MT● ~10.26x speed up

– Compared to optimized RNG on CPU● ~1.67x speed up

● For 24 million numbers– Compared to CPU implementation of MT

● ~66.09x speed up– Compared to optimized RNG on CPU

● ~11.19x speed up

Page 51: Neural Network on CUDA - University of Torontomoshovos/CUDA08/arx/NeuralNet.pdf · Outline Restricted Boltzmann Machine background CUDA Implementation – Matrix Library – Random

Sigmoid Performance

● Native __expf() is the fastest method● Texture slowest● Normalized to 1000 kernel calls● Sigmoid of 24M numbers

Time (CUDA) Speed-upCPU 2158.36 N/AFloat 1.7996 1199.33Texture 3.0967 696.99Linear Approximation 1.9208 1123.68Second Order 1.9223 1122.81