Upload
gregory-renard
View
86
Download
4
Embed Size (px)
Citation preview
Louis Monier@louis_monier
https://www.linkedin.com/in/louismonier
Deep Learning BasicsGo Deep or Go Home
Gregory Renard@redohttps://www.linkedin.com/in/gregoryrenard
Class 1 - Q1 - 2016
Supervised Learning
cathatdogcatmug
…
… =?
prediction
dog (89%)
training set …
Training(days)
Testing(ms)
Single Neuron
+
x
w1
xpos
x
w2
ypos
w3
ChoiceScore
Linear Combination Nonlinearity
3.4906 0.9981, red
Inputs
3.45
1.21
-0.314
1.59
2.65
More Neurons, More Layers: Deep Learning
input1
input2
input3
input layer hidden layers output layer
circleiris
eye
face
89% likely to be a face
How to find the best values for the weights?
w = Parameter
Error
a
b
Define error = |expected - computed| 2
Find parameters that minimize average error
Gradient of error wrt w is
Update:
take a step downhill
Backpropagation: Assigning Blame
input1
input2
input3
label=1
loss=0.49
w1
w2
w3
b
w1’
w2’
w3’
b’
1). For loss to go down by 0.1, p must go up by 0.05.For p to go up by 0.05, w1 must go up by 0.09.For p to go up by 0.05, w2 must go down by 0.12....
prediction p=0.3
p’=0.89
2). For loss to go down by 0.1, p’ must go down by 0.01
3). For p’ to go down by 0.01, w2’ must go up by 0.04
Stochastic Gradient Descent (SGD) and Minibatch
It’s too expensive to compute the gradient on all inputs to take a step.
We prefer a quick approximation.
Use a small random sample of inputs (minibatch) to compute the gradient.
Apply the gradient to all the weights.
Welcome to SGD, much more efficient than regular GD!
Usually things would get intense at this point...
Chain rule
Partial derivatives
HessianSum over paths
Credit: Richard Socher Stanford class cs224d
Jacobian matrix
Local gradient
Train. Validate. Test.Training set
Validation set
Cross-validation
Testing set
2 3 41
2 3 41
2 3 41
2 3 41
shuffle!
one epoch
Never ever, ever, ever, ever, ever, ever use for training!
Tensors are not scary
Number: 0D
Vector: 1D
Matrix: 2D
Tensor: 3D, 4D, … array of numbers
This image is a 3264 x 2448 x 3 tensor
[r=0.45 g=0.84 b=0.76]
Fully-Connected NetworksPer layer, every input connects to every neuron.
input1
input2
input3
input layer hidden layers output layer
circleiris
eye
face
89% likely to be a face
Recurrent Neural Networks (RNN)Appropriate when inputs are sequences.
We’ll cover in detail when we work on text.
pipe a not is this
t=0t=1t=2t=3t=4
this is not a pipe
t=0 t=1 t=2 t=3 t=4
Convolutional Neural Networks (ConvNets)Appropriate for image tasks, but not limited to image tasks.
We’ll cover in detail in the next session.
HyperparametersActivations: a zoo of nonlinear functions
Initializations: Distribution of initial weights. Not all zeros.
Optimizers: driving the gradient descent
Objectives: comparing a prediction to the truth
Regularizers: forcing the function we learn to remain “simple”
...and many more
Softmax
x1 = 6.78
x2 = 4.25
x3 = -0.16
x4 = -3.5
dog = 0.92536
cat = 0.07371
car = 0.00089
hat = 0.00003
Raw scores Probabilities
SOFTMAX
Various algorithms for driving Gradient DescentTricks to speed up SGD.
Learn the learning rate.
Credit: Alex Radford
Cross-entropy Loss
x1 = 6.78
x2 = 4.25
x3 = -0.16
x4 = -3.5
dog = 0.92536
cat = 0.07371
car = 0.00089
hat = 0.00003
Raw scores Probabilities
SOFTMAX
dog: loss = 0.078
cat: loss = 2.607
car: loss = 7.024
hat: loss = 10.414
If label is:
=?
label (truth)
Overfitting
Very simple.Might not predict well.
Just right? Overfitting. Rote memorization.Will not generalize well.
Regularization: Avoiding Overfitting
Idea: keep the functions “simple” by constraining the weights.
Loss = Error(prediction, truth) + L(keep solution simple)
L2 makes all parameters medium-size
L1 kills many parameters.
L1+L2 sometimes used.
At every step, during training, ignore the output of a fraction p of neurons.
p=0.5 is a good default.
Regularization: Dropout
input1
input2
input3