Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Intro to Neural Networks and Deep Learning
Jack LanchantinDr. Yanjun Qi
1
UVA CS 6316
Neurons1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice2
3
x1
x2
x3
W1
w3
W2
x ŷ
ewx+b
1 + ewx+b
Logistic Regression
Sigmoid Function(aka logistic, logit, “S”, soft-step)
4
P(Y=1|x) =
x
ez
1 + ez
Expanded Logistic Regression
x1
x2
x3
Σ
+1
z
z = wT x + b
y = sigmoid(z) =5
px11xp
p = 3
1x1
w1
w2
w3
b1SummingFunction
SigmoidFunction
Multiply by weights
ŷ = P(Y=1|x,w)
Input x
“Neuron”
x1
x2
x3
Σ
+1
z
6
w1
w2
w3
b1
Neurons
7http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Neuron
x1
x2
x3
Σ z ŷ
8
ez
1 + ez
z = wT x
ŷ = sigmoid(z) =px11xp1x1
From here on, we leave out bias for simplicity
Input x
w1
w2
w3
“Block View” of a Neuron
x *
Dot Product Sigmoid
w z
9
y
Input output
Dot Product SigmoidInput output
x *w z ŷ
parameterized block
ez
1 + ez
z = wT x
ŷ = sigmoid(z) =px11xp1x1
Neuron Representation
Σ
10
The linear transformation and nonlinearity together is typically considered a single neuron
ŷ
x1
x2
x3
x
w1
w2
w3
Neuron Representation
*w
11
ŷx
The linear transformation and nonlinearity together is typically considered a single neuron
Neurons
1-Layer Neural NetworkMulti-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice12
13
x1
x2
x3
W1
w3
W2
x ŷ
1-Layer Neural Network (with 4 neurons)
x1
x2
x3
Σ
Σ
Σ
Σ
Input x Output ŷ
14
ŷ1
ŷ2
ŷ3
ŷ4
Wz
Linear Sigmoid
1 layer
matrixvector
1-Layer Neural Network (with 4 neurons)
x1
x2
x3
Σ
Σ
Σ
Σ
Input x
15
d = 4
p = 3
Wz
Output ŷ
ez
1 + ez
z =WT x
ŷ = sigmoid(z) =px1dxpdx1
dx1 dx1
Element-wise on vector z
ŷ1
ŷ2
ŷ3
ŷ4
1-Layer Neural Network (with 4 neurons)
x1
x2
x3
Input x
16
ez
1 + ez
z =WT x
ŷ = sigmoid(z) =px1dxpdx1
dx1 dx1
d = 4
p = 3
W Σ
Σ
Σ
Σ
Output ŷ
ŷ1
ŷ2
ŷ3
ŷ4
Element-wise on vector z
“Block View” of a Neural Network
x *
Dot Product Sigmoid
w z
17
y
Input output
Dot Product SigmoidInput output
x *W z ŷ
W is now a matrix
ez
1 + ez
z =WT x
ŷ = sigmoid(z) =px1dxpdx1
dx1 dx1
z is now a vector
Neurons
1-Layer Neural Network
Multi-layer Neural NetworkLoss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice18
19
x1
x2
x3
W1
w3
W2
x ŷ
Multi-Layer Neural Network(Multi-Layer Perceptron (MLP) Network)
20
Outputlayer
x1
x2
x3
x ŷ
W1
w2
Hiddenlayer
weight subscript represents layer number
2-layer NN
Multi-Layer Neural Network (MLP)
21
1st hidden layer
2nd hiddenlayer
Outputlayer
x1
x2
x3
x ŷ
3-layer NN
W1
w3
W2
Multi-Layer Neural Network (MLP)
22
z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)
x1
x2
x3
h1 h2
W1
w3
W2
x ŷ
1
2
3
hidden layer 1 output
Multi-Class Output MLP
x1
x2
x3
x ŷ
23
h1 h2
z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =WT h2 ŷ = sigmoid(z2)
1
2
3
W1
w3
W2
“Block View” Of MLP
x y
1st hidden layer
2nd hidden layer
Output layer
24
*W1
*W2
*W3
z1 z2 z3h1 h2
“Deep” Neural Networks (i.e. > 1 hidden layer)
25Researchers have successfully used 1000 layers to train an object classifier
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss FunctionsBackpropagation
Nonlinearity Functions
NNs in Practice26
27
x1
x2
x3
W1
w3
W2
x ŷ E (ŷ)
ŷ = P(y=1|X,W)
28
x1
x2
x3
x
Binary Classification Loss
E= loss = - logP(Y = ŷ | X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ)
W1
w3
W2
this example is for a single sample x
true output
29
x1
x2
x3
Σ
W1
w3
W2
x
Regression Loss
E = loss= ( y - ŷ )212
ŷ
true output
z1
z2
z3
30
x1
x2
x3
x
Σ
Σ
Σ
ŷ1
ŷ2
ŷ3
Multi-Class Classification Loss
“Softmax” function. Normalizing function which converts each class output to a probability.
E = loss = - yj ln ŷjΣj = 1...K
= P( ŷi = 1 | x )
W1 W3
W2
ŷi
“0” for all except true class
K = 3
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
BackpropagationNonlinearity Functions
NNs in Practice31
32
x1
x2
x3
W1
w3
W2
x ŷ E (ŷ)
Training Neural Networks
33
How do we learn the optimal weights WL for our task??● Gradient descent:
LeCun et. al. Efficient Backpropagation. 1998
WL(t+1) = WL(t) - E WL(t)
But how do we get gradients of lower layers?● Backpropagation!
○ Repeated application of chain rule of calculus○ Locally minimize the objective○ Requires all “blocks” of the network to be differentiable
x ŷ
W1w3
W2
34
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
35
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
36
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
37
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
38
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
39
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
40
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
41
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
42
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
43
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
44
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
45
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
46
Backpropagation Intro
Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
47
ŷ = P(y=1|X,W)
x1
x2
x3
W1
w2
x
Backpropagation(binary classification example)
Example on 1-hidden layer NN for binary classification
48
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
49
x y*W1
*w2
z1z2h1
E = loss =
Backpropagation(binary classification example)
Gradient Descent to Minimize loss: Need to find these!
50
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
51
Backpropagation(binary classification example)
= ??
= ??
x y*W1
*w2
z1z2h1
52
Backpropagation(binary classification example)
= ??
= ??
Exploit the chain rule!
x y*W1
*w2
z1z2h1
53
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
54
Backpropagation(binary classification example)
chain rule
x y*W1
*w2
z1z2h1
55
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
56
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
57
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
58
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
59
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
60
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
61
Backpropagation(binary classification example)
x y*W1
*w2
z1z2h1
62
Backpropagation(binary classification example)
already computed
x y*W1
*w2
z1z2h1
63
“Local-ness” of Backpropagation
fx y
“local gradients”activations
gradients
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
64
“Local-ness” of Backpropagation
fx y
“local gradients”activations
gradients
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
x
65
Example: Sigmoid Block
sigmoid(x) = (x)
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions)
x y
1st hidden layer
2nd hidden layer
Output layer
66
*W1
*W2
*W3
z1 z2 z3h1 h2
Want to find optimal weights W to minimize some loss function E!
Backprop Whiteboard Demo
x1
x2
1
Σ
Σ
ŷ
67
w1 z1w2
w3w4
b1
b2
z2
h1
h2
Σ
w5
w6
1b3
z1 = x1w1 + x2w3 + b1z2 = x1w2 + x2w4 + b2
h1 =exp(z1)
1 + exp(z1)exp(z2)
1 + exp(z2)h2 =
ŷ = h1w5 + h2w6 + b3
E = ( y - ŷ )2
f1
f2
f3
f4
w(t+1) = w(t) - E w(t)
E w = ??
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity FunctionsNNs in Practice
68
Nonlinearity Functions (i.e. transfer or activation functions)
x1
x2
x3
Σ
SummingFunction
SigmoidFunction
w1
w2
w3
+1
b1
z
x﹡w
Multiply by weights
69
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
70https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Name Plot Equation Derivative ( w.r.t x )
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
71https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Name Plot Equation Derivative ( w.r.t x )
Nonlinearity Functions (aka transfer or activation functions)
x﹡w
72https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Name Plot Equation Derivative ( w.r.t x )
usually works best in practice
Neurons
1-Layer Neural Network
Multi-layer Neural Network
Loss Functions
Backpropagation
Nonlinearity Functions
NNs in Practice73
74
Neural Net Pipeline
x ŷ
W1w3
1. Initialize weights2. For each batch of input x samples S:
a. Run the network “Forward” on S to compute outputs and lossb. Run the network “Backward” using outputs and loss to compute gradientsc. Update weights using SGD (or a similar method)
3. Repeat step 2 until loss convergence
W2
E
75
Non-Convexity of Neural Nets
In very high dimensions, there exists many local minimum which are about the same.
Pascanu, et. al. On the saddle point problem for non-convex optimization 2014
76
Building Deep Neural Nets
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
fx
y
77
Building Deep Neural Nets
“GoogLeNet” for Object Classification
78
Block Example Implementation
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
79
Advantage of Neural Nets
As long as it’s fully differentiable, we can train the model to automatically learn features for us.