Basic Math for Neural Networks - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture2_xiaodong.pdf · Deep Learning for Computer Vision, Speech, and Language Outline Feedforward

Basic Math for Neural Networks

Xiaodong Cui

IBM T. J. Watson Research CenterYorktown Heights, NY 10598

Spring, 2017

Deep Learning for Computer Vision, Speech, and Language

Outline

• Feedforward neural networks

• Forward propagation

• Neural networks as universal approximators

• Back propagation

• Jacobian

• Vanishing gradient problem

EECS 6894, Columbia University 2/23


Feedforward Neural Networks

input layer

hidden layers

output layer

a

z

h

+1

+1

x

a(l)i =

∑j

w(l)ij z

(l−1)j + b

(l)i

z(l)i = h

(l)i

(a

(l)i

)EECS 6894, Columbia University 3/23


Activation Functions – Sigmoidalso known as logistic function

h(a) = 11 + e−a

h′(a) = e−a

(1 + e−a)2 = h(a)[1 − h(a)

]

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fdf/dx



Activation Functions – Hyperbolic Tangent (tanh)

h(a) = ea − e−a

ea + e−a

h′(a) = 4(ea + e−a)2 = 1 −

[h(a)

]2

-5 -4 -3 -2 -1 0 1 2 3 4 5-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

fdf/dx



Activation Functions – Rectified Linear Unit (ReLU)

h(a) = max(0, a) ={

a if a > 0,

0 if a ≤ 0.

h′(a) ={

1 if a > 0,

0 if a ≤ 0

-3 -2 -1 0 1 2 30

0.5

1

1.5

2

2.5

3

fdf/dx



Activation Functions – Softmax

also known as normalized exponential function, a generalization ofthe logistic function to multiple classes.

h(ak) = eak∑j eaj

∂h(ak)∂aj

= ∂h(ak)∂ log h(ak)

∂ log h(ak)∂aj

= h(ak)(

δkj − h(aj))

Why "softmax"?

max{x1, · · · , xn} ≤ log(ex1 + · · · + exn) ≤ max{x1, · · · , xn} + log n



Loss Functions

• Regression• Classification



Regression: Least Square ErrorLet yn,k and yn,k be the output and target respectively. Forregression, the typical loss function is least square error (LSE):

L(y, y) = 12

∑n

(yn − yn)2

= 12

∑n

∑k

(ynk − ynk)2

where n is the sample index and k is the dimension index.The derivative of LSE with respect to each dimension of eachsample:

∂L(y, y)∂ynk

= (ynk − ynk)



Classification: Cross-Entropy (CE)Suppose there are two (discrete) probability distributions p and q, the "distance"between the two distributions can be measured by the Kullback-Leibler divergence:

DKL(p||q) =∑

k

pk log pk

qk

For classification, the targets yn are given

yn = {yn1, · · · , ynk, · · · , ynK}

The predictions after the softmax layer are posterior probabilities after normalization

yn = {yn1, · · · , ynk, · · · , ynK}

Now measure the "distance" of these two distributions from all samples∑

n

DKL(yn||yn) =∑

n

∑k

ynk log ynk

ynk=

∑n

∑k

ynk log ynk −∑

n

∑k

ynk log ynk

= −∑

n

∑k

ynk log ynk + C



Classification: Cross-Entropy (CE)Cross-entropy as the loss function:

L(y, y) = −∑

n

∑k

ynk log ynk

Typically, targets are given in the form of 1-of-K coding

yn = {0, · · · , 1, · · · , 0}

where

yn ={

1 if k = kn,

0 otherwise

It follows that the cross-entropy has an even simpler form:

L(y, y) = −∑

n

log ynkn



Classification: Cross-Entropy (CE)derivative of the composite function of cross-entropy and softmax:

L(y, y) =∑

n

En =∑

n

(−

∑i

yni log yni

)yni = eani∑

j eanj

∂En

∂ank= ∂

∂ank

(−

∑i

yni log yni

)= −

∑i

yni∂ log yni

∂ank

= −∑

i

yni∂ log yni

∂yni

∂yni

∂ank= −

∑i

yni

yni

∂yni

∂ log yni

∂ log yni

∂ank

= −∑

i

yni

yni· yni · (δik − ynk) = −

∑i

yni(δik − ynk)

= −∑

i

yniδik +∑

i

yniynk = −ynk + ynk

( ∑i

yn,i

)= ynk − ynk

What’s the derivative for an arbitrary differentiable function? – leave it forexercise



A DNN Example In Torch

Algorithm 1 Definition of a DNN with 3 hidden layersfunction create_model()

– MODELlocal n_inputs = 460local n_outputs = 3000local n_hidden = 1024local model = nn.Sequential()model:add(nn.Linear(n_inputs, n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_outputs))model:add(nn.LogSoftMax())

– LOSS FUNCTIONlocal criterion = nn.ClassNLLCriterion()

return model:cuda(), criterion:cuda()end



Neural Networks As Universal ApproximatorsUniversal Approximation Theorem[1][2][3] Let ϕ(·) be a nonconstant, bounded, andmonotonically-increasing continuous function. Let In represent an n-dimensional unit cube[0, 1]n and C(In) be the space of continuous functions on In. Then given any functionf ∈ C(In) and ϵ > 0, there exist an integer N and real constants ai, bi ∈ R and wi ∈ Rn,where i = 1, 2, · · · , N , such that one may define

F (x) =N∑

i=1aiϕ(wT

i x + bi)

as an approximate realization of the function f where f is independent of ϕ such that

|F (x) − f(x)| < ϵ

for all x ∈ In.

[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks areUniversal Approximators," Neural Networks, 2(5), 359-366.[3] Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks," NeuralNetworks, 4(2), 251-257.



Neural Networks As Universal ClassifiersExtends f(x) to decision functions of the form [1]

f(x) = j, iff x ∈ Pj , j = 1, 2, · · · , K

where Pj partition An into K disjoint measurable subsets and An is a compact subset of Rn.

Arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neuralnetworks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity!

[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Huang, W. Y. and Lippmann, R. P. (1988). "Neural Nets and Traditional Classifiers," inNeural Information Processing Systems (Denver 1987), D. Z. Anderson, Editor. AmericanInstitute of Physics, New York, 387-396.



Back Propagation

wki

j

i

k

wij

L How to compute the gradient ofthe loss function with respect toa weight wij?

∂L∂wij

D. E. Rumelhart, G. E. Hintonand R. J. Williams (1986)."Learning representations byback-propagating errors,"Nature, vol. 323, 533-536.



Back Propagation

a(l)i =

∑j w

(l)ij z

(l−1)j

z(l)i = h

(l)i

(a

(l)i

)

∂L∂w

(l)ij

= ∂L∂a

(l)i

∂a(l)i

∂w(l)ij

It follows that

∂a(l)i

∂w(l)ij

= z(l)j and δ

(l)i , ∂L

∂a(l)i

δi is often referred to as “errors”. By the chain rule,

δ(l)i = ∂L

∂a(l)i

=∑

k

∂L∂a

(l+1)k

∂a(l+1)k

∂a(l)i

=∑

k

δ(l+1)k

∂a(l+1)k

∂a(l)i



Back Propagation

a(l+1)k =

∑i

w(l+1)ki h(a(l)

i )

therefore

∂a(l+1)k

∂a(l)i

= w(l+1)ki h′(a(l)

i )

It follows that

δ(l)i = h′(a(l)

i )∑

k

w(l+1)ki δ

(l+1)k

which tells us that the errors can be evaluated recursively.



Back Propagation: An Example

• A neural network with multiple layers• Softmax output layer• Cross-entropy loss function

• Forward propagation:I push the input xn through the network to get the activations to the hidden layers

a(l)i =

∑j

w(l)ij z

(l−1)j , z

(l)i = h

(l)i

(a

(l)i

)• Back propagation:

I start from the output layer

δ(L)k = ynk − ynk

I back-propagate errors from the output layer all the way down to the input layer

δ(l)i = h′(a(l)

i )∑

k

w(l+1)ki δ

(l+1)k

I evaluate gradients

∂L∂w

(l)ij

= δ(l)i z

(l)j



Gradient CheckingIn practice, when you implement the gradient of a network or amodule, one way to check the correctness of your implementationis via the following gradient checking by a (symmetric) finitedifference:

∂L∂wij

= L(wij + ϵ) − L(wij − ϵ)2ϵ

check the "closeness" of the two gradients (your ownimplementation and the one from the above difference)

||g1 − g2||||g1 + g2||



Jacobian Matrixyk

xi

• measure the sensitivity of outputs with respect to inputs of a(sub-)network

• can be used as a modular operation for error backprop in alarger network



Jacobian MatrixThe Jacobian matrix can be computed using a similar back-propagation procedure.

∂yk

∂xi=

∑l

∂yk

∂al

∂al

∂xi

where al are the activation functions having immediate connections with the inputs xi.

∂yk

∂xi=

∑l

wli∂yk

∂al

Analogous to the error propagation, define

δkl ,∂yk

∂al

and it can be computed recursively

δkl = ∂yk

∂al=

∑j

∂yk

∂aj

∂aj

∂al=

∑j

∂yk

∂aj

[wjlh

′(al)]

=∑

j

δkj

[wjlh

′(al)]

where j sweeps all the connections wjl with l.

What’s the Jacobian matrix of softmax outputs to the inputs? – leave it as an exercise



Vanishing Gradients

function nonlinearity gradient

sigmoid x = 11+e−a

∂L∂a = x(1 − x)∂L

∂x

tanh x = ea−e−a

ea+e−a∂L∂a = (1 − x2)∂L

∂x

softsign x = a1+|a|

∂L∂a = (1 − |x|)2 ∂L

∂x

What’s the problem???

|x| ≤ 1 → ∂L∂a

≤ ∂L∂x

Nonlinearity causes vanishing gradients.EECS 6894, Columbia University 23/23

Documents

Basic Math for Neural Networks - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture2_xiaodong.pdf · Deep Learning for Computer Vision, Speech, and Language Outline Feedforward