Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Basic Math for Neural Networks
Xiaodong Cui
IBM T. J. Watson Research CenterYorktown Heights, NY 10598
Spring, 2017
Deep Learning for Computer Vision, Speech, and Language
Outline
• Feedforward neural networks
• Forward propagation
• Neural networks as universal approximators
• Back propagation
• Jacobian
• Vanishing gradient problem
EECS 6894, Columbia University 2/23
Deep Learning for Computer Vision, Speech, and Language
Feedforward Neural Networks
input layer
hidden layers
output layer
a
z
h
+1
+1
x
a(l)i =
∑j
w(l)ij z
(l−1)j + b
(l)i
z(l)i = h
(l)i
(a
(l)i
)EECS 6894, Columbia University 3/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Sigmoidalso known as logistic function
h(a) = 11 + e−a
h′(a) = e−a
(1 + e−a)2 = h(a)[1 − h(a)
]
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fdf/dx
EECS 6894, Columbia University 4/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Hyperbolic Tangent (tanh)
h(a) = ea − e−a
ea + e−a
h′(a) = 4(ea + e−a)2 = 1 −
[h(a)
]2
-5 -4 -3 -2 -1 0 1 2 3 4 5-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
fdf/dx
EECS 6894, Columbia University 5/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Rectified Linear Unit (ReLU)
h(a) = max(0, a) ={
a if a > 0,
0 if a ≤ 0.
h′(a) ={
1 if a > 0,
0 if a ≤ 0
-3 -2 -1 0 1 2 30
0.5
1
1.5
2
2.5
3
fdf/dx
EECS 6894, Columbia University 6/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Softmax
also known as normalized exponential function, a generalization ofthe logistic function to multiple classes.
h(ak) = eak∑j eaj
∂h(ak)∂aj
= ∂h(ak)∂ log h(ak)
∂ log h(ak)∂aj
= h(ak)(
δkj − h(aj))
Why "softmax"?
max{x1, · · · , xn} ≤ log(ex1 + · · · + exn) ≤ max{x1, · · · , xn} + log n
EECS 6894, Columbia University 7/23
Deep Learning for Computer Vision, Speech, and Language
Loss Functions
• Regression• Classification
EECS 6894, Columbia University 8/23
Deep Learning for Computer Vision, Speech, and Language
Regression: Least Square ErrorLet yn,k and yn,k be the output and target respectively. Forregression, the typical loss function is least square error (LSE):
L(y, y) = 12
∑n
(yn − yn)2
= 12
∑n
∑k
(ynk − ynk)2
where n is the sample index and k is the dimension index.The derivative of LSE with respect to each dimension of eachsample:
∂L(y, y)∂ynk
= (ynk − ynk)
EECS 6894, Columbia University 9/23
Deep Learning for Computer Vision, Speech, and Language
Classification: Cross-Entropy (CE)Suppose there are two (discrete) probability distributions p and q, the "distance"between the two distributions can be measured by the Kullback-Leibler divergence:
DKL(p||q) =∑
k
pk log pk
qk
For classification, the targets yn are given
yn = {yn1, · · · , ynk, · · · , ynK}
The predictions after the softmax layer are posterior probabilities after normalization
yn = {yn1, · · · , ynk, · · · , ynK}
Now measure the "distance" of these two distributions from all samples∑
n
DKL(yn||yn) =∑
n
∑k
ynk log ynk
ynk=
∑n
∑k
ynk log ynk −∑
n
∑k
ynk log ynk
= −∑
n
∑k
ynk log ynk + C
EECS 6894, Columbia University 10/23
Deep Learning for Computer Vision, Speech, and Language
Classification: Cross-Entropy (CE)Cross-entropy as the loss function:
L(y, y) = −∑
n
∑k
ynk log ynk
Typically, targets are given in the form of 1-of-K coding
yn = {0, · · · , 1, · · · , 0}
where
yn ={
1 if k = kn,
0 otherwise
It follows that the cross-entropy has an even simpler form:
L(y, y) = −∑
n
log ynkn
EECS 6894, Columbia University 11/23
Deep Learning for Computer Vision, Speech, and Language
Classification: Cross-Entropy (CE)derivative of the composite function of cross-entropy and softmax:
L(y, y) =∑
n
En =∑
n
(−
∑i
yni log yni
)yni = eani∑
j eanj
∂En
∂ank= ∂
∂ank
(−
∑i
yni log yni
)= −
∑i
yni∂ log yni
∂ank
= −∑
i
yni∂ log yni
∂yni
∂yni
∂ank= −
∑i
yni
yni
∂yni
∂ log yni
∂ log yni
∂ank
= −∑
i
yni
yni· yni · (δik − ynk) = −
∑i
yni(δik − ynk)
= −∑
i
yniδik +∑
i
yniynk = −ynk + ynk
( ∑i
yn,i
)= ynk − ynk
What’s the derivative for an arbitrary differentiable function? – leave it forexercise
EECS 6894, Columbia University 12/23
Deep Learning for Computer Vision, Speech, and Language
A DNN Example In Torch
Algorithm 1 Definition of a DNN with 3 hidden layersfunction create_model()
– MODELlocal n_inputs = 460local n_outputs = 3000local n_hidden = 1024local model = nn.Sequential()model:add(nn.Linear(n_inputs, n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_outputs))model:add(nn.LogSoftMax())
– LOSS FUNCTIONlocal criterion = nn.ClassNLLCriterion()
return model:cuda(), criterion:cuda()end
EECS 6894, Columbia University 13/23
Deep Learning for Computer Vision, Speech, and Language
Neural Networks As Universal ApproximatorsUniversal Approximation Theorem[1][2][3] Let ϕ(·) be a nonconstant, bounded, andmonotonically-increasing continuous function. Let In represent an n-dimensional unit cube[0, 1]n and C(In) be the space of continuous functions on In. Then given any functionf ∈ C(In) and ϵ > 0, there exist an integer N and real constants ai, bi ∈ R and wi ∈ Rn,where i = 1, 2, · · · , N , such that one may define
F (x) =N∑
i=1aiϕ(wT
i x + bi)
as an approximate realization of the function f where f is independent of ϕ such that
|F (x) − f(x)| < ϵ
for all x ∈ In.
[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks areUniversal Approximators," Neural Networks, 2(5), 359-366.[3] Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks," NeuralNetworks, 4(2), 251-257.
EECS 6894, Columbia University 14/23
Deep Learning for Computer Vision, Speech, and Language
Neural Networks As Universal ClassifiersExtends f(x) to decision functions of the form [1]
f(x) = j, iff x ∈ Pj , j = 1, 2, · · · , K
where Pj partition An into K disjoint measurable subsets and An is a compact subset of Rn.
Arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neuralnetworks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity!
[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Huang, W. Y. and Lippmann, R. P. (1988). "Neural Nets and Traditional Classifiers," inNeural Information Processing Systems (Denver 1987), D. Z. Anderson, Editor. AmericanInstitute of Physics, New York, 387-396.
EECS 6894, Columbia University 15/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation
wki
j
i
k
wij
L How to compute the gradient ofthe loss function with respect toa weight wij?
∂L∂wij
D. E. Rumelhart, G. E. Hintonand R. J. Williams (1986)."Learning representations byback-propagating errors,"Nature, vol. 323, 533-536.
EECS 6894, Columbia University 16/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation
a(l)i =
∑j w
(l)ij z
(l−1)j
z(l)i = h
(l)i
(a
(l)i
)
∂L∂w
(l)ij
= ∂L∂a
(l)i
∂a(l)i
∂w(l)ij
It follows that
∂a(l)i
∂w(l)ij
= z(l)j and δ
(l)i , ∂L
∂a(l)i
δi is often referred to as “errors”. By the chain rule,
δ(l)i = ∂L
∂a(l)i
=∑
k
∂L∂a
(l+1)k
∂a(l+1)k
∂a(l)i
=∑
k
δ(l+1)k
∂a(l+1)k
∂a(l)i
EECS 6894, Columbia University 17/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation
a(l+1)k =
∑i
w(l+1)ki h(a(l)
i )
therefore
∂a(l+1)k
∂a(l)i
= w(l+1)ki h′(a(l)
i )
It follows that
δ(l)i = h′(a(l)
i )∑
k
w(l+1)ki δ
(l+1)k
which tells us that the errors can be evaluated recursively.
EECS 6894, Columbia University 18/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation: An Example
• A neural network with multiple layers• Softmax output layer• Cross-entropy loss function
• Forward propagation:I push the input xn through the network to get the activations to the hidden layers
a(l)i =
∑j
w(l)ij z
(l−1)j , z
(l)i = h
(l)i
(a
(l)i
)• Back propagation:
I start from the output layer
δ(L)k = ynk − ynk
I back-propagate errors from the output layer all the way down to the input layer
δ(l)i = h′(a(l)
i )∑
k
w(l+1)ki δ
(l+1)k
I evaluate gradients
∂L∂w
(l)ij
= δ(l)i z
(l)j
EECS 6894, Columbia University 19/23
Deep Learning for Computer Vision, Speech, and Language
Gradient CheckingIn practice, when you implement the gradient of a network or amodule, one way to check the correctness of your implementationis via the following gradient checking by a (symmetric) finitedifference:
∂L∂wij
= L(wij + ϵ) − L(wij − ϵ)2ϵ
check the "closeness" of the two gradients (your ownimplementation and the one from the above difference)
||g1 − g2||||g1 + g2||
EECS 6894, Columbia University 20/23
Deep Learning for Computer Vision, Speech, and Language
Jacobian Matrixyk
xi
• measure the sensitivity of outputs with respect to inputs of a(sub-)network
• can be used as a modular operation for error backprop in alarger network
EECS 6894, Columbia University 21/23
Deep Learning for Computer Vision, Speech, and Language
Jacobian MatrixThe Jacobian matrix can be computed using a similar back-propagation procedure.
∂yk
∂xi=
∑l
∂yk
∂al
∂al
∂xi
where al are the activation functions having immediate connections with the inputs xi.
∂yk
∂xi=
∑l
wli∂yk
∂al
Analogous to the error propagation, define
δkl ,∂yk
∂al
and it can be computed recursively
δkl = ∂yk
∂al=
∑j
∂yk
∂aj
∂aj
∂al=
∑j
∂yk
∂aj
[wjlh
′(al)]
=∑
j
δkj
[wjlh
′(al)]
where j sweeps all the connections wjl with l.
What’s the Jacobian matrix of softmax outputs to the inputs? – leave it as an exercise
EECS 6894, Columbia University 22/23
Deep Learning for Computer Vision, Speech, and Language
Vanishing Gradients
function nonlinearity gradient
sigmoid x = 11+e−a
∂L∂a = x(1 − x)∂L
∂x
tanh x = ea−e−a
ea+e−a∂L∂a = (1 − x2)∂L
∂x
softsign x = a1+|a|
∂L∂a = (1 − |x|)2 ∂L
∂x
What’s the problem???
|x| ≤ 1 → ∂L∂a
≤ ∂L∂x
Nonlinearity causes vanishing gradients.EECS 6894, Columbia University 23/23