Upload
others
View
38
Download
0
Embed Size (px)
Citation preview
Neural Networks:Optimization & Regularization
Shan-Hung [email protected]
Department of Computer Science,National Tsing Hua University, Taiwan
Machine Learning
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 1 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 2 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 3 / 60
Challenges
NN a complex function:
ˆy = f (x;Q)= f
(L)(· · · f (1)(x;W
(1));W
(L))
Given a training set X, our goal is to solve:
argminQ C(Q) = argminQ� logP(X |Q)= argminQ Â
i
� logP(y(i) |x(i),Q)= argminQ Â
i
C
(i)(Q)= argmin
W
(1),··· ,W(L) Âi
C
(i)(W(1), · · · ,W(L))
What are the challenges of solving this problem with SGD?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 4 / 60
Challenges
NN a complex function:
ˆy = f (x;Q)= f
(L)(· · · f (1)(x;W
(1));W
(L))
Given a training set X, our goal is to solve:
argminQ C(Q) = argminQ� logP(X |Q)= argminQ Â
i
� logP(y(i) |x(i),Q)= argminQ Â
i
C
(i)(Q)= argmin
W
(1),··· ,W(L) Âi
C
(i)(W(1), · · · ,W(L))
What are the challenges of solving this problem with SGD?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 4 / 60
Challenges
NN a complex function:
ˆy = f (x;Q)= f
(L)(· · · f (1)(x;W
(1));W
(L))
Given a training set X, our goal is to solve:
argminQ C(Q) = argminQ� logP(X |Q)= argminQ Â
i
� logP(y(i) |x(i),Q)= argminQ Â
i
C
(i)(Q)= argmin
W
(1),··· ,W(L) Âi
C
(i)(W(1), · · · ,W(L))
What are the challenges of solving this problem with SGD?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 4 / 60
Non-ConvexityThe loss function C
(i) is non-convex
SGD stops at local minima or saddle pointsPrior to the success of SGD (in roughly 2012), NN cost functionsurfaces were generally believed to have many non-convex structureHowever, studies [2, 4] show SGD seldom encounters critical pointswhen training a large NN
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60
Non-ConvexityThe loss function C
(i) is non-convex
SGD stops at local minima or saddle points
Prior to the success of SGD (in roughly 2012), NN cost functionsurfaces were generally believed to have many non-convex structureHowever, studies [2, 4] show SGD seldom encounters critical pointswhen training a large NN
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60
Non-ConvexityThe loss function C
(i) is non-convex
SGD stops at local minima or saddle pointsPrior to the success of SGD (in roughly 2012), NN cost functionsurfaces were generally believed to have many non-convex structure
However, studies [2, 4] show SGD seldom encounters critical pointswhen training a large NN
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60
Non-ConvexityThe loss function C
(i) is non-convex
SGD stops at local minima or saddle pointsPrior to the success of SGD (in roughly 2012), NN cost functionsurfaces were generally believed to have many non-convex structureHowever, studies [2, 4] show SGD seldom encounters critical pointswhen training a large NN
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60
Ill-Conditioning
The loss C
(i) may be ill-conditioned (in terms of Q)Due to, e.g., dependency between W
(k)’s at different layers
SGD has slow progress at valleys or plateaus
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 6 / 60
Ill-Conditioning
The loss C
(i) may be ill-conditioned (in terms of Q)Due to, e.g., dependency between W
(k)’s at different layers
SGD has slow progress at valleys or plateaus
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 6 / 60
Lacks Global Minima
The loss C
(i) may lack a global minimum point
E.g., for multiclass classificationP(y |x,Q) provided by a softmax functionC
(i)(Q) =� logP(y(i) |x(i),Q) can become arbitrarily close to zero (ifclassifying example i correctly)But not actually reaching zero
SGD may proceed along adirection foreverInitialization is important
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60
Lacks Global Minima
The loss C
(i) may lack a global minimum pointE.g., for multiclass classification
P(y |x,Q) provided by a softmax function
C
(i)(Q) =� logP(y(i) |x(i),Q) can become arbitrarily close to zero (ifclassifying example i correctly)But not actually reaching zero
SGD may proceed along adirection foreverInitialization is important
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60
Lacks Global Minima
The loss C
(i) may lack a global minimum pointE.g., for multiclass classification
P(y |x,Q) provided by a softmax functionC
(i)(Q) =� logP(y(i) |x(i),Q) can become arbitrarily close to zero (ifclassifying example i correctly)
But not actually reaching zero
SGD may proceed along adirection foreverInitialization is important
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60
Lacks Global Minima
The loss C
(i) may lack a global minimum pointE.g., for multiclass classification
P(y |x,Q) provided by a softmax functionC
(i)(Q) =� logP(y(i) |x(i),Q) can become arbitrarily close to zero (ifclassifying example i correctly)But not actually reaching zero
SGD may proceed along adirection foreverInitialization is important
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60
Lacks Global Minima
The loss C
(i) may lack a global minimum pointE.g., for multiclass classification
P(y |x,Q) provided by a softmax functionC
(i)(Q) =� logP(y(i) |x(i),Q) can become arbitrarily close to zero (ifclassifying example i correctly)But not actually reaching zero
SGD may proceed along adirection foreverInitialization is important
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60
Training 101
Before training a feedforward NN, remember to standardize
(z-normalize) the input
Prevents dominating featuresImproves conditioning
When training, remember to:
1 Initialize all weights to small random values
Breaks “symmetry” between different units so they are not updated inthe same wayBiases b
(k)’s may be initialized to zero (or to small positive values forReLUs to prevent too much saturation)
2Early stop if the validation error does not continue decreasing
Prevents overfitting
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60
Training 101
Before training a feedforward NN, remember to standardize
(z-normalize) the inputPrevents dominating featuresImproves conditioning
When training, remember to:
1 Initialize all weights to small random values
Breaks “symmetry” between different units so they are not updated inthe same wayBiases b
(k)’s may be initialized to zero (or to small positive values forReLUs to prevent too much saturation)
2Early stop if the validation error does not continue decreasing
Prevents overfitting
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60
Training 101
Before training a feedforward NN, remember to standardize
(z-normalize) the inputPrevents dominating featuresImproves conditioning
When training, remember to:
1 Initialize all weights to small random values
Breaks “symmetry” between different units so they are not updated inthe same way
Biases b
(k)’s may be initialized to zero (or to small positive values forReLUs to prevent too much saturation)
2Early stop if the validation error does not continue decreasing
Prevents overfitting
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60
Training 101
Before training a feedforward NN, remember to standardize
(z-normalize) the inputPrevents dominating featuresImproves conditioning
When training, remember to:
1 Initialize all weights to small random values
Breaks “symmetry” between different units so they are not updated inthe same wayBiases b
(k)’s may be initialized to zero
(or to small positive values forReLUs to prevent too much saturation)
2Early stop if the validation error does not continue decreasing
Prevents overfitting
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60
Training 101
Before training a feedforward NN, remember to standardize
(z-normalize) the inputPrevents dominating featuresImproves conditioning
When training, remember to:
1 Initialize all weights to small random values
Breaks “symmetry” between different units so they are not updated inthe same wayBiases b
(k)’s may be initialized to zero (or to small positive values forReLUs to prevent too much saturation)
2Early stop if the validation error does not continue decreasing
Prevents overfitting
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60
Training 101
Before training a feedforward NN, remember to standardize
(z-normalize) the inputPrevents dominating featuresImproves conditioning
When training, remember to:
1 Initialize all weights to small random values
Breaks “symmetry” between different units so they are not updated inthe same wayBiases b
(k)’s may be initialized to zero (or to small positive values forReLUs to prevent too much saturation)
2Early stop if the validation error does not continue decreasing
Prevents overfitting
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60
Training 101
Before training a feedforward NN, remember to standardize
(z-normalize) the inputPrevents dominating featuresImproves conditioning
When training, remember to:
1 Initialize all weights to small random values
Breaks “symmetry” between different units so they are not updated inthe same wayBiases b
(k)’s may be initialized to zero (or to small positive values forReLUs to prevent too much saturation)
2Early stop if the validation error does not continue decreasing
Prevents overfitting
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 9 / 60
Momentum
Update rule in SGD:
Q(t+1) Q(t)�hg
(t)
where g
(t) = —QC(Q(t))
Gets stuck in local minimaor saddle points
Momentum: make the same movement v
(t)
in the last iteration, corrected by negativegradient:
v
(t+1) lv
(t)�(1�l )g(t)
Q(t+1) Q(t) +hv
(t+1)
v
(t) is a moving average of �g
(t)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 10 / 60
Momentum
Update rule in SGD:
Q(t+1) Q(t)�hg
(t)
where g
(t) = —QC(Q(t))Gets stuck in local minimaor saddle points
Momentum: make the same movement v
(t)
in the last iteration, corrected by negativegradient:
v
(t+1) lv
(t)�(1�l )g(t)
Q(t+1) Q(t) +hv
(t+1)
v
(t) is a moving average of �g
(t)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 10 / 60
Momentum
Update rule in SGD:
Q(t+1) Q(t)�hg
(t)
where g
(t) = —QC(Q(t))Gets stuck in local minimaor saddle points
Momentum: make the same movement v
(t)
in the last iteration, corrected by negativegradient:
v
(t+1) lv
(t)�(1�l )g(t)
Q(t+1) Q(t) +hv
(t+1)
v
(t) is a moving average of �g
(t)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 10 / 60
Nesterov Momentum
Make the same movement v
(t) in the lastiteration, corrected by lookahead
negative gradient:
˜Q(t+1) Q(t) +hv
(t)
v
(t+1) lv
(t)�(1�l )—QC( ˜Q(t))
Q(t+1) Q(t) +hv
(t+1)
Faster convergence to a minimumNot helpful for NNs that lack of minima
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 11 / 60
Nesterov Momentum
Make the same movement v
(t) in the lastiteration, corrected by lookahead
negative gradient:
˜Q(t+1) Q(t) +hv
(t)
v
(t+1) lv
(t)�(1�l )—QC( ˜Q(t))
Q(t+1) Q(t) +hv
(t+1)
Faster convergence to a minimum
Not helpful for NNs that lack of minima
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 11 / 60
Nesterov Momentum
Make the same movement v
(t) in the lastiteration, corrected by lookahead
negative gradient:
˜Q(t+1) Q(t) +hv
(t)
v
(t+1) lv
(t)�(1�l )—QC( ˜Q(t))
Q(t+1) Q(t) +hv
(t+1)
Faster convergence to a minimumNot helpful for NNs that lack of minima
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 11 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 12 / 60
Where Does SGD Spend Its Training Time?
1 Detouring a saddle point of high costBetter initialization
2 Traversing the relatively flat valleyAdaptive learning rate
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 13 / 60
Where Does SGD Spend Its Training Time?
1 Detouring a saddle point of high costBetter initialization
2 Traversing the relatively flat valleyAdaptive learning rate
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 13 / 60
SGD with Adaptive Learning Rates
Smaller learning rate h along a steep directionPrevents overshooting
Larger learning rate h along a flat directionSpeed up convergence
How?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 14 / 60
SGD with Adaptive Learning Rates
Smaller learning rate h along a steep directionPrevents overshooting
Larger learning rate h along a flat directionSpeed up convergence
How?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 14 / 60
SGD with Adaptive Learning Rates
Smaller learning rate h along a steep directionPrevents overshooting
Larger learning rate h along a flat directionSpeed up convergence
How?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 14 / 60
AdaGrad
Update rule:r
(t+1) r
(t) +g
(t)�g
(t)
Q(t+1) Q(t)� hpr
(t+1)�g
(t)
r
(t+1) accumulates squared gradients along each axisDivision and square root applied to r
(t+1) elementwisely
We have
hpr
(t+1)=
hpt+1
� 1q1
t+1
r
(t+1)=
hpt+1
� 1q1
t+1
Ât
i=0
g
(i)�g
(i)
1Smaller learning rate along all directions as t grows
2Larger learning rate along more gently sloped directions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60
AdaGrad
Update rule:r
(t+1) r
(t) +g
(t)�g
(t)
Q(t+1) Q(t)� hpr
(t+1)�g
(t)
r
(t+1) accumulates squared gradients along each axis
Division and square root applied to r
(t+1) elementwisely
We have
hpr
(t+1)=
hpt+1
� 1q1
t+1
r
(t+1)=
hpt+1
� 1q1
t+1
Ât
i=0
g
(i)�g
(i)
1Smaller learning rate along all directions as t grows
2Larger learning rate along more gently sloped directions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60
AdaGrad
Update rule:r
(t+1) r
(t) +g
(t)�g
(t)
Q(t+1) Q(t)� hpr
(t+1)�g
(t)
r
(t+1) accumulates squared gradients along each axisDivision and square root applied to r
(t+1) elementwisely
We have
hpr
(t+1)=
hpt+1
� 1q1
t+1
r
(t+1)=
hpt+1
� 1q1
t+1
Ât
i=0
g
(i)�g
(i)
1Smaller learning rate along all directions as t grows
2Larger learning rate along more gently sloped directions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60
AdaGrad
Update rule:r
(t+1) r
(t) +g
(t)�g
(t)
Q(t+1) Q(t)� hpr
(t+1)�g
(t)
r
(t+1) accumulates squared gradients along each axisDivision and square root applied to r
(t+1) elementwisely
We have
hpr
(t+1)=
hpt+1
� 1q1
t+1
r
(t+1)=
hpt+1
� 1q1
t+1
Ât
i=0
g
(i)�g
(i)
1Smaller learning rate along all directions as t grows
2Larger learning rate along more gently sloped directions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60
AdaGrad
Update rule:r
(t+1) r
(t) +g
(t)�g
(t)
Q(t+1) Q(t)� hpr
(t+1)�g
(t)
r
(t+1) accumulates squared gradients along each axisDivision and square root applied to r
(t+1) elementwisely
We have
hpr
(t+1)=
hpt+1
� 1q1
t+1
r
(t+1)=
hpt+1
� 1q1
t+1
Ât
i=0
g
(i)�g
(i)
1Smaller learning rate along all directions as t grows
2Larger learning rate along more gently sloped directions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60
Limitations
The optimal learning rate along a direction may change over time
In AdaGrad, r
(t+1) accumulates squared gradients from the
beginning of training
Results in premature adaptivity
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 16 / 60
Limitations
The optimal learning rate along a direction may change over time
In AdaGrad, r
(t+1) accumulates squared gradients from the
beginning of training
Results in premature adaptivity
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 16 / 60
RMSProp
RMSProp changes the gradient accumulation in r
(t+1) into a movingaverage:
r
(t+1) l r
(t) + (1�l )g(t)�g
(t)
Q(t+1) Q(t)� hpr
(t+1)�g
(t)
A popular algorithm Adam (short for adaptive moments) [7] is acombination of RMSProp and Momentum:
v
(t+1) l1
v
(t)� (1�l1
)g(t)
r
(t+1) l2
r
(t) + (1�l2
)g(t)�g
(t)
Q(t+1) Q(t) +hp
r
(t+1)� v
(t+1)
With some bias corrections for v
(t+1) and r
(t+1)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 17 / 60
RMSProp
RMSProp changes the gradient accumulation in r
(t+1) into a movingaverage:
r
(t+1) l r
(t) + (1�l )g(t)�g
(t)
Q(t+1) Q(t)� hpr
(t+1)�g
(t)
A popular algorithm Adam (short for adaptive moments) [7] is acombination of RMSProp and Momentum:
v
(t+1) l1
v
(t)� (1�l1
)g(t)
r
(t+1) l2
r
(t) + (1�l2
)g(t)�g
(t)
Q(t+1) Q(t) +hp
r
(t+1)� v
(t+1)
With some bias corrections for v
(t+1) and r
(t+1)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 17 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 18 / 60
Training Deep NNs I
So far, we modify the optimization algorithm to better train the model
Can we modify the model to ease the optimization task?What are the difficulties in training a deep NN?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 19 / 60
Training Deep NNs I
So far, we modify the optimization algorithm to better train the modelCan we modify the model to ease the optimization task?
What are the difficulties in training a deep NN?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 19 / 60
Training Deep NNs I
So far, we modify the optimization algorithm to better train the modelCan we modify the model to ease the optimization task?What are the difficulties in training a deep NN?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 19 / 60
Training Deep NNs IIThe cost C(Q) of a deep NN is usually ill-conditioned due to thedependency between W
(k)’s at different layers
As a simple example, consider a deep NN for x,y 2 R:
y = f (x) = xw
(1)w
(2) · · ·w(L)
Single unit at each layerLinear activation function and no bias in each unit
The output y is a linear function of x, but not of weightsThe curvature of f with respect to any two w
(i) and w
(j) is
∂ f
∂w
(i)∂w
(j)= (w(i) +w
(j)) · x ’k 6=i,j
w
(k)
Very small if L is large and w
(k) < 1 for k 6= i, jVery large if L is large and w
(k) > 1 for k 6= i, j
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60
Training Deep NNs IIThe cost C(Q) of a deep NN is usually ill-conditioned due to thedependency between W
(k)’s at different layersAs a simple example, consider a deep NN for x,y 2 R:
y = f (x) = xw
(1)w
(2) · · ·w(L)
Single unit at each layerLinear activation function and no bias in each unit
The output y is a linear function of x, but not of weightsThe curvature of f with respect to any two w
(i) and w
(j) is
∂ f
∂w
(i)∂w
(j)= (w(i) +w
(j)) · x ’k 6=i,j
w
(k)
Very small if L is large and w
(k) < 1 for k 6= i, jVery large if L is large and w
(k) > 1 for k 6= i, j
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60
Training Deep NNs IIThe cost C(Q) of a deep NN is usually ill-conditioned due to thedependency between W
(k)’s at different layersAs a simple example, consider a deep NN for x,y 2 R:
y = f (x) = xw
(1)w
(2) · · ·w(L)
Single unit at each layerLinear activation function and no bias in each unit
The output y is a linear function of x, but not of weights
The curvature of f with respect to any two w
(i) and w
(j) is
∂ f
∂w
(i)∂w
(j)= (w(i) +w
(j)) · x ’k 6=i,j
w
(k)
Very small if L is large and w
(k) < 1 for k 6= i, jVery large if L is large and w
(k) > 1 for k 6= i, j
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60
Training Deep NNs IIThe cost C(Q) of a deep NN is usually ill-conditioned due to thedependency between W
(k)’s at different layersAs a simple example, consider a deep NN for x,y 2 R:
y = f (x) = xw
(1)w
(2) · · ·w(L)
Single unit at each layerLinear activation function and no bias in each unit
The output y is a linear function of x, but not of weightsThe curvature of f with respect to any two w
(i) and w
(j) is
∂ f
∂w
(i)∂w
(j)= (w(i) +w
(j)) · x ’k 6=i,j
w
(k)
Very small if L is large and w
(k) < 1 for k 6= i, jVery large if L is large and w
(k) > 1 for k 6= i, j
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficient
Let Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)
However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iteration
C(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneously
Second-order methods?Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Training Deep NNs IIIThe ill-conditioned C(Q) makes a gradient-based optimizationalgorithm (e.g., SGD) inefficientLet Q = [w(1),w(2), · · · ,w(L)]> and g
(t) = —QC(Q(t))
In gradient descent, we get Q(t+1) by Q(t+1) Q(t)�hg
(t) based onthe first-order Taylor approximation of C
The gradient g
(t)i
= ∂C
∂w
(i) (Q(t)) is calculated individually by fixingC(Q(t)) in other dimensions (w(j)’s, j 6= i)However, g
(t) updates Q(t) in all dimensions simultaneously in thesame iterationC(Q(t+1)) will be guaranteed to decrease only if C is linear at Q(t)
Wrong assumption: Q(t+1)i
will decrease C even if other Q(t+1)j
’s areupdated simultaneouslySecond-order methods?
Time consumingDoes not take into account high-order effects
Can we change the model to make this assumption not-so-wrong?Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60
Batch Normalization I
y = f (x) = xw
(1)w
(2) · · ·w(L)
Why not standardize each hidden activation a
(k), k = 1, · · · ,L�1 (aswe standardized x)?
We havey = a
(L�1)w
(L)
When a
(L�1) is standardized, g
(t)L
= ∂C
∂w
(L) (Q(t)) is more likely todecrease C
If x⇠N (0,1), then still a
(L�1) ⇠N (0,1), no matter howw
(1), · · · ,w(L�1) changeChanges in other dimensions proposed by g
(t)i
’s, i 6= L, can be zeroedout
Similarly, if a
(k�1) is standardized, g
(t)k
= ∂C
∂w
(k) (Q(t)) is more likely todecrease C
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60
Batch Normalization I
y = f (x) = xw
(1)w
(2) · · ·w(L)
Why not standardize each hidden activation a
(k), k = 1, · · · ,L�1 (aswe standardized x)?We have
y = a
(L�1)w
(L)
When a
(L�1) is standardized, g
(t)L
= ∂C
∂w
(L) (Q(t)) is more likely todecrease C
If x⇠N (0,1), then still a
(L�1) ⇠N (0,1), no matter howw
(1), · · · ,w(L�1) changeChanges in other dimensions proposed by g
(t)i
’s, i 6= L, can be zeroedout
Similarly, if a
(k�1) is standardized, g
(t)k
= ∂C
∂w
(k) (Q(t)) is more likely todecrease C
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60
Batch Normalization I
y = f (x) = xw
(1)w
(2) · · ·w(L)
Why not standardize each hidden activation a
(k), k = 1, · · · ,L�1 (aswe standardized x)?We have
y = a
(L�1)w
(L)
When a
(L�1) is standardized, g
(t)L
= ∂C
∂w
(L) (Q(t)) is more likely todecrease C
If x⇠N (0,1), then still a
(L�1) ⇠N (0,1), no matter howw
(1), · · · ,w(L�1) change
Changes in other dimensions proposed by g
(t)i
’s, i 6= L, can be zeroedout
Similarly, if a
(k�1) is standardized, g
(t)k
= ∂C
∂w
(k) (Q(t)) is more likely todecrease C
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60
Batch Normalization I
y = f (x) = xw
(1)w
(2) · · ·w(L)
Why not standardize each hidden activation a
(k), k = 1, · · · ,L�1 (aswe standardized x)?We have
y = a
(L�1)w
(L)
When a
(L�1) is standardized, g
(t)L
= ∂C
∂w
(L) (Q(t)) is more likely todecrease C
If x⇠N (0,1), then still a
(L�1) ⇠N (0,1), no matter howw
(1), · · · ,w(L�1) changeChanges in other dimensions proposed by g
(t)i
’s, i 6= L, can be zeroedout
Similarly, if a
(k�1) is standardized, g
(t)k
= ∂C
∂w
(k) (Q(t)) is more likely todecrease C
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60
Batch Normalization I
y = f (x) = xw
(1)w
(2) · · ·w(L)
Why not standardize each hidden activation a
(k), k = 1, · · · ,L�1 (aswe standardized x)?We have
y = a
(L�1)w
(L)
When a
(L�1) is standardized, g
(t)L
= ∂C
∂w
(L) (Q(t)) is more likely todecrease C
If x⇠N (0,1), then still a
(L�1) ⇠N (0,1), no matter howw
(1), · · · ,w(L�1) changeChanges in other dimensions proposed by g
(t)i
’s, i 6= L, can be zeroedout
Similarly, if a
(k�1) is standardized, g
(t)k
= ∂C
∂w
(k) (Q(t)) is more likely todecrease C
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60
Batch Normalization II
How to standardize a
(k) at training and test time?We can standardize the input x because we see multiple examples
During training time, we see a minibatch of activations a
(k) 2 RM (Mthe batch size)Batch normalization [6]:
a
(k)i
=a
(k)i
�µ(k)
s (k),8i
µ(k) and s (k) are mean and std of activations across examples in theminibatch
At test time, µ(k) and s (k) can be replaced by running averages thatwere collected during training timeCan be readily extended to NNs having multiple neurons at each layer
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60
Batch Normalization II
How to standardize a
(k) at training and test time?We can standardize the input x because we see multiple examples
During training time, we see a minibatch of activations a
(k) 2 RM (Mthe batch size)Batch normalization [6]:
a
(k)i
=a
(k)i
�µ(k)
s (k),8i
µ(k) and s (k) are mean and std of activations across examples in theminibatch
At test time, µ(k) and s (k) can be replaced by running averages thatwere collected during training timeCan be readily extended to NNs having multiple neurons at each layer
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60
Batch Normalization II
How to standardize a
(k) at training and test time?We can standardize the input x because we see multiple examples
During training time, we see a minibatch of activations a
(k) 2 RM (Mthe batch size)Batch normalization [6]:
a
(k)i
=a
(k)i
�µ(k)
s (k),8i
µ(k) and s (k) are mean and std of activations across examples in theminibatch
At test time, µ(k) and s (k) can be replaced by running averages thatwere collected during training time
Can be readily extended to NNs having multiple neurons at each layer
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60
Batch Normalization II
How to standardize a
(k) at training and test time?We can standardize the input x because we see multiple examples
During training time, we see a minibatch of activations a
(k) 2 RM (Mthe batch size)Batch normalization [6]:
a
(k)i
=a
(k)i
�µ(k)
s (k),8i
µ(k) and s (k) are mean and std of activations across examples in theminibatch
At test time, µ(k) and s (k) can be replaced by running averages thatwere collected during training timeCan be readily extended to NNs having multiple neurons at each layer
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60
Standardizing Nonlinear Units
How to standardize a nonlinear unit a
(k) = act(z(k))?
We can still zero out the effects from other layers by normalizing z
(k)
Given a minibatch of z
(k) 2 RM:
z
(k)i
=z
(k)i
�µ(k)
s (k),8i
A hidden unit now looks like:
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60
Standardizing Nonlinear Units
How to standardize a nonlinear unit a
(k) = act(z(k))?We can still zero out the effects from other layers by normalizing z
(k)
Given a minibatch of z
(k) 2 RM:
z
(k)i
=z
(k)i
�µ(k)
s (k),8i
A hidden unit now looks like:
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60
Standardizing Nonlinear Units
How to standardize a nonlinear unit a
(k) = act(z(k))?We can still zero out the effects from other layers by normalizing z
(k)
Given a minibatch of z
(k) 2 RM:
z
(k)i
=z
(k)i
�µ(k)
s (k),8i
A hidden unit now looks like:
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60
Standardizing Nonlinear Units
How to standardize a nonlinear unit a
(k) = act(z(k))?We can still zero out the effects from other layers by normalizing z
(k)
Given a minibatch of z
(k) 2 RM:
z
(k)i
=z
(k)i
�µ(k)
s (k),8i
A hidden unit now looks like:
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60
Expressiveness I
The weights W
(k) at each layer is easier to train nowThe “wrong assumption” of gradient-based optimization is made valid
But at the cost of expressivenessNormalizing a
(k) or z
(k) limits the output range of a unitObserve that there is no need to insist a z
(k) to have zero mean andunit variance
We only care about whether it is “fixed” when calculating the gradientsfor other layers
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 25 / 60
Expressiveness I
The weights W
(k) at each layer is easier to train nowThe “wrong assumption” of gradient-based optimization is made valid
But at the cost of expressivenessNormalizing a
(k) or z
(k) limits the output range of a unit
Observe that there is no need to insist a z
(k) to have zero mean andunit variance
We only care about whether it is “fixed” when calculating the gradientsfor other layers
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 25 / 60
Expressiveness I
The weights W
(k) at each layer is easier to train nowThe “wrong assumption” of gradient-based optimization is made valid
But at the cost of expressivenessNormalizing a
(k) or z
(k) limits the output range of a unitObserve that there is no need to insist a z
(k) to have zero mean andunit variance
We only care about whether it is “fixed” when calculating the gradientsfor other layers
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 25 / 60
Expressiveness II
During training time, we can introduce two parameters g and b andback-propagate through
g z
(k) +b
to learn their best values
Question: g and b can be learned to invert z
(k) to get z
(k), so what’sthe point?
z
(k) = z
(k)�µ(k)
s (k) , so g z
(k) +b = s z
(k) +µ = z
(k)
The weights W
(k), g, and b are now easier to learn with SGD
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 26 / 60
Expressiveness II
During training time, we can introduce two parameters g and b andback-propagate through
g z
(k) +b
to learn their best valuesQuestion: g and b can be learned to invert z
(k) to get z
(k), so what’sthe point?
z
(k) = z
(k)�µ(k)
s (k) , so g z
(k) +b = s z
(k) +µ = z
(k)
The weights W
(k), g, and b are now easier to learn with SGD
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 26 / 60
Expressiveness II
During training time, we can introduce two parameters g and b andback-propagate through
g z
(k) +b
to learn their best valuesQuestion: g and b can be learned to invert z
(k) to get z
(k), so what’sthe point?
z
(k) = z
(k)�µ(k)
s (k) , so g z
(k) +b = s z
(k) +µ = z
(k)
The weights W
(k), g, and b are now easier to learn with SGD
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 26 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 27 / 60
Parameter Initialization
Initialization is important
How to better initialize Q(0)?
1 Train an NN multiple times with random initial points, and then pickthe best
2 Design a series of cost functions such that a solution to one is a goodinitial point of the next
Solve the “easy” problem first, and then a “harder” one, and so on
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60
Parameter Initialization
Initialization is important
How to better initialize Q(0)?
1 Train an NN multiple times with random initial points, and then pickthe best
2 Design a series of cost functions such that a solution to one is a goodinitial point of the next
Solve the “easy” problem first, and then a “harder” one, and so on
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60
Parameter Initialization
Initialization is important
How to better initialize Q(0)?
1 Train an NN multiple times with random initial points, and then pickthe best
2 Design a series of cost functions such that a solution to one is a goodinitial point of the next
Solve the “easy” problem first, and then a “harder” one, and so on
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60
Parameter Initialization
Initialization is important
How to better initialize Q(0)?
1 Train an NN multiple times with random initial points, and then pickthe best
2 Design a series of cost functions such that a solution to one is a goodinitial point of the next
Solve the “easy” problem first, and then a “harder” one, and so on
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60
Continuation Methods IContinuation methods: construct easier cost functions bysmoothing the original cost function:
˜
C(Q) = E
˜Q⇠N (Q,s2)C( ˜Q)
In practice, we sample several ˜Q’s to approximate the expectation
Assumption: some non-convex functions become approximately convexwhen smoothen
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 29 / 60
Continuation Methods II
Problems?
Cost function might not become convex, no matter how much it issmoothenDesigned to deal with local minima; not very helpful for NNs withoutminima
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 30 / 60
Continuation Methods II
Problems?Cost function might not become convex, no matter how much it issmoothen
Designed to deal with local minima; not very helpful for NNs withoutminima
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 30 / 60
Continuation Methods II
Problems?Cost function might not become convex, no matter how much it issmoothenDesigned to deal with local minima; not very helpful for NNs withoutminima
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 30 / 60
Curriculum Learning
Curriculum learning (or shaping) [1]: make the cost function easierby increasing the influence of simpler examples
E.g., by assigning them larger weights in the new cost functionOr, by sampling them more frequently
How to define “simple” examples?Face image recognition: front view (easy) vs. side view (hard)Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs.1-/2-/3-/4-star reviews (hard)
Learn simple concepts first, then learn more complex concepts thatdepend on these simpler concepts
Just like how humans learnKnowing the principles, we are less likely to explain an observationusing special (but wrong) rules
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60
Curriculum Learning
Curriculum learning (or shaping) [1]: make the cost function easierby increasing the influence of simpler examples
E.g., by assigning them larger weights in the new cost functionOr, by sampling them more frequently
How to define “simple” examples?
Face image recognition: front view (easy) vs. side view (hard)Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs.1-/2-/3-/4-star reviews (hard)
Learn simple concepts first, then learn more complex concepts thatdepend on these simpler concepts
Just like how humans learnKnowing the principles, we are less likely to explain an observationusing special (but wrong) rules
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60
Curriculum Learning
Curriculum learning (or shaping) [1]: make the cost function easierby increasing the influence of simpler examples
E.g., by assigning them larger weights in the new cost functionOr, by sampling them more frequently
How to define “simple” examples?Face image recognition: front view (easy) vs. side view (hard)Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs.1-/2-/3-/4-star reviews (hard)
Learn simple concepts first, then learn more complex concepts thatdepend on these simpler concepts
Just like how humans learnKnowing the principles, we are less likely to explain an observationusing special (but wrong) rules
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60
Curriculum Learning
Curriculum learning (or shaping) [1]: make the cost function easierby increasing the influence of simpler examples
E.g., by assigning them larger weights in the new cost functionOr, by sampling them more frequently
How to define “simple” examples?Face image recognition: front view (easy) vs. side view (hard)Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs.1-/2-/3-/4-star reviews (hard)
Learn simple concepts first, then learn more complex concepts thatdepend on these simpler concepts
Just like how humans learnKnowing the principles, we are less likely to explain an observationusing special (but wrong) rules
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60
Curriculum Learning
Curriculum learning (or shaping) [1]: make the cost function easierby increasing the influence of simpler examples
E.g., by assigning them larger weights in the new cost functionOr, by sampling them more frequently
How to define “simple” examples?Face image recognition: front view (easy) vs. side view (hard)Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs.1-/2-/3-/4-star reviews (hard)
Learn simple concepts first, then learn more complex concepts thatdepend on these simpler concepts
Just like how humans learnKnowing the principles, we are less likely to explain an observationusing special (but wrong) rules
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 32 / 60
Regularization
The goal of an ML algorithm is to perform well not just on thetraining data, but also on new inputs
Regularization: techniques that reduce the generalization error of anML algorithm
But not the training error
By expressing preference to a simpler modelBy providing different perspectives on how to explain the training dataBy encoding prior knowledge
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60
Regularization
The goal of an ML algorithm is to perform well not just on thetraining data, but also on new inputsRegularization: techniques that reduce the generalization error of anML algorithm
But not the training error
By expressing preference to a simpler modelBy providing different perspectives on how to explain the training dataBy encoding prior knowledge
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60
Regularization
The goal of an ML algorithm is to perform well not just on thetraining data, but also on new inputsRegularization: techniques that reduce the generalization error of anML algorithm
But not the training error
By expressing preference to a simpler model
By providing different perspectives on how to explain the training dataBy encoding prior knowledge
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60
Regularization
The goal of an ML algorithm is to perform well not just on thetraining data, but also on new inputsRegularization: techniques that reduce the generalization error of anML algorithm
But not the training error
By expressing preference to a simpler modelBy providing different perspectives on how to explain the training data
By encoding prior knowledge
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60
Regularization
The goal of an ML algorithm is to perform well not just on thetraining data, but also on new inputsRegularization: techniques that reduce the generalization error of anML algorithm
But not the training error
By expressing preference to a simpler modelBy providing different perspectives on how to explain the training dataBy encoding prior knowledge
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60
Regularization in Deep Learning I
I have big data, do I still need to regularize my NN?The excess error is dominated by optimization error (time)
Generally, yes!For “hard” problems, the true data generating process is almostcertainly outside the model family
E.g., problems in images, audio sequences, and text domainsThe true generation process essentially involves simulating the entireuniverse
In these domains, the best fitting model (with lowest generalizationerror) is usually a larger model regularized appropriately
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60
Regularization in Deep Learning I
I have big data, do I still need to regularize my NN?The excess error is dominated by optimization error (time)
Generally, yes!
For “hard” problems, the true data generating process is almostcertainly outside the model family
E.g., problems in images, audio sequences, and text domainsThe true generation process essentially involves simulating the entireuniverse
In these domains, the best fitting model (with lowest generalizationerror) is usually a larger model regularized appropriately
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60
Regularization in Deep Learning I
I have big data, do I still need to regularize my NN?The excess error is dominated by optimization error (time)
Generally, yes!For “hard” problems, the true data generating process is almostcertainly outside the model family
E.g., problems in images, audio sequences, and text domains
The true generation process essentially involves simulating the entireuniverse
In these domains, the best fitting model (with lowest generalizationerror) is usually a larger model regularized appropriately
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60
Regularization in Deep Learning I
I have big data, do I still need to regularize my NN?The excess error is dominated by optimization error (time)
Generally, yes!For “hard” problems, the true data generating process is almostcertainly outside the model family
E.g., problems in images, audio sequences, and text domainsThe true generation process essentially involves simulating the entireuniverse
In these domains, the best fitting model (with lowest generalizationerror) is usually a larger model regularized appropriately
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60
Regularization in Deep Learning I
I have big data, do I still need to regularize my NN?The excess error is dominated by optimization error (time)
Generally, yes!For “hard” problems, the true data generating process is almostcertainly outside the model family
E.g., problems in images, audio sequences, and text domainsThe true generation process essentially involves simulating the entireuniverse
In these domains, the best fitting model (with lowest generalizationerror) is usually a larger model regularized appropriately
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60
Regularization in Deep Learning II
For “easy” problems, regularization may be necessary to make theproblems well defined
For example, when applying a logistic regression to a linearly separabledataset:
argmax
w
log’i
P(y(i) |x(i);w)
= argmax
w
log’i
s(w>(i))y
(i)[1�s(w>x
(i))](1�y
(i))
If a weight vector w is able to achieve perfect classification, so is 2w
Furthermore, 2w gives higher likelihoodWithout regularization, SGD will continually increase w’s magnitude
A deep NN is likely to separable a dataset and has the similar issue
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60
Regularization in Deep Learning II
For “easy” problems, regularization may be necessary to make theproblems well definedFor example, when applying a logistic regression to a linearly separabledataset:
argmax
w
log’i
P(y(i) |x(i);w)
= argmax
w
log’i
s(w>(i))y
(i)[1�s(w>x
(i))](1�y
(i))
If a weight vector w is able to achieve perfect classification, so is 2w
Furthermore, 2w gives higher likelihoodWithout regularization, SGD will continually increase w’s magnitude
A deep NN is likely to separable a dataset and has the similar issue
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60
Regularization in Deep Learning II
For “easy” problems, regularization may be necessary to make theproblems well definedFor example, when applying a logistic regression to a linearly separabledataset:
argmax
w
log’i
P(y(i) |x(i);w)
= argmax
w
log’i
s(w>(i))y
(i)[1�s(w>x
(i))](1�y
(i))
If a weight vector w is able to achieve perfect classification, so is 2w
Furthermore, 2w gives higher likelihoodWithout regularization, SGD will continually increase w’s magnitude
A deep NN is likely to separable a dataset and has the similar issue
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60
Regularization in Deep Learning II
For “easy” problems, regularization may be necessary to make theproblems well definedFor example, when applying a logistic regression to a linearly separabledataset:
argmax
w
log’i
P(y(i) |x(i);w)
= argmax
w
log’i
s(w>(i))y
(i)[1�s(w>x
(i))](1�y
(i))
If a weight vector w is able to achieve perfect classification, so is 2w
Furthermore, 2w gives higher likelihood
Without regularization, SGD will continually increase w’s magnitude
A deep NN is likely to separable a dataset and has the similar issue
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60
Regularization in Deep Learning II
For “easy” problems, regularization may be necessary to make theproblems well definedFor example, when applying a logistic regression to a linearly separabledataset:
argmax
w
log’i
P(y(i) |x(i);w)
= argmax
w
log’i
s(w>(i))y
(i)[1�s(w>x
(i))](1�y
(i))
If a weight vector w is able to achieve perfect classification, so is 2w
Furthermore, 2w gives higher likelihoodWithout regularization, SGD will continually increase w’s magnitude
A deep NN is likely to separable a dataset and has the similar issue
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60
Regularization in Deep Learning II
For “easy” problems, regularization may be necessary to make theproblems well definedFor example, when applying a logistic regression to a linearly separabledataset:
argmax
w
log’i
P(y(i) |x(i);w)
= argmax
w
log’i
s(w>(i))y
(i)[1�s(w>x
(i))](1�y
(i))
If a weight vector w is able to achieve perfect classification, so is 2w
Furthermore, 2w gives higher likelihoodWithout regularization, SGD will continually increase w’s magnitude
A deep NN is likely to separable a dataset and has the similar issue
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 36 / 60
Weight Decay
To add norm penalties:
argmin
QC(Q)+aW(Q)
W can be, e.g., L
1- or L
2-norm
W(W), W(W(k)), W(W(k)i,: ), or W(W(k)
:,j )?
Limiting column norms W(W(k):,j ), 8j,k, is preferred [5]
Prevents any one hidden unit from having very large weights and z
(k)j
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 37 / 60
Weight Decay
To add norm penalties:
argmin
QC(Q)+aW(Q)
W can be, e.g., L
1- or L
2-norm
W(W), W(W(k)), W(W(k)i,: ), or W(W(k)
:,j )?
Limiting column norms W(W(k):,j ), 8j,k, is preferred [5]
Prevents any one hidden unit from having very large weights and z
(k)j
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 37 / 60
Weight Decay
To add norm penalties:
argmin
QC(Q)+aW(Q)
W can be, e.g., L
1- or L
2-norm
W(W), W(W(k)), W(W(k)i,: ), or W(W(k)
:,j )?
Limiting column norms W(W(k):,j ), 8j,k, is preferred [5]
Prevents any one hidden unit from having very large weights and z
(k)j
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 37 / 60
Explicit Weight Decay I
Explicit norm penalties:
argmin
QC(Q) subject to W(Q) R
To solve the problem, we can use the projective SGD:At each step t, update Q(t+1) as in SGDIf Q(t+1) falls out of the feasible set, project Q(t+1) back to the tangentspace (edge) of feasible set
Advantage?Prevents dead units that do not contribute much to the behavior ofNN due to too small weights
Explicit constraints does not push weights to the origin
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 38 / 60
Explicit Weight Decay I
Explicit norm penalties:
argmin
QC(Q) subject to W(Q) R
To solve the problem, we can use the projective SGD:At each step t, update Q(t+1) as in SGDIf Q(t+1) falls out of the feasible set, project Q(t+1) back to the tangentspace (edge) of feasible set
Advantage?
Prevents dead units that do not contribute much to the behavior ofNN due to too small weights
Explicit constraints does not push weights to the origin
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 38 / 60
Explicit Weight Decay I
Explicit norm penalties:
argmin
QC(Q) subject to W(Q) R
To solve the problem, we can use the projective SGD:At each step t, update Q(t+1) as in SGDIf Q(t+1) falls out of the feasible set, project Q(t+1) back to the tangentspace (edge) of feasible set
Advantage?Prevents dead units that do not contribute much to the behavior ofNN due to too small weights
Explicit constraints does not push weights to the origin
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 38 / 60
Explicit Weight Decay II
Also prevents instability due to a large learning rateReprojection clips the weights and improves numeric stability
Hinton et al. [5] recommend using:
explicit constraints + reprojection + large learning rate
to allow rapid exploration of parameter space while maintainingnumeric stability
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 39 / 60
Explicit Weight Decay II
Also prevents instability due to a large learning rateReprojection clips the weights and improves numeric stability
Hinton et al. [5] recommend using:
explicit constraints + reprojection + large learning rate
to allow rapid exploration of parameter space while maintainingnumeric stability
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 39 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 40 / 60
Data Augmentation
Theoretically, the best way to improve the generalizability of a modelis to train it on more data
For some ML tasks, it is not hard to create new fake dataIn classification, we can generate new (x,y) pairs by transforming anexample input x
(i) given the same y
(i)
E.g, scaling, translating, rotating, or flipping images (x(i)’s)
Very effective in image object recognition and speech recognition tasks
CautionDo not to apply transformations that would change the correct class!
E.g., in OCR tasks, avoid:Horizontal flips for ‘b’ and ‘d’180
� rotations for ‘6’ and ‘9’
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60
Data Augmentation
Theoretically, the best way to improve the generalizability of a modelis to train it on more data
For some ML tasks, it is not hard to create new fake dataIn classification, we can generate new (x,y) pairs by transforming anexample input x
(i) given the same y
(i)
E.g, scaling, translating, rotating, or flipping images (x(i)’s)
Very effective in image object recognition and speech recognition tasks
CautionDo not to apply transformations that would change the correct class!
E.g., in OCR tasks, avoid:Horizontal flips for ‘b’ and ‘d’180
� rotations for ‘6’ and ‘9’
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60
Data Augmentation
Theoretically, the best way to improve the generalizability of a modelis to train it on more data
For some ML tasks, it is not hard to create new fake dataIn classification, we can generate new (x,y) pairs by transforming anexample input x
(i) given the same y
(i)
E.g, scaling, translating, rotating, or flipping images (x(i)’s)
Very effective in image object recognition and speech recognition tasks
CautionDo not to apply transformations that would change the correct class!
E.g., in OCR tasks, avoid:Horizontal flips for ‘b’ and ‘d’180
� rotations for ‘6’ and ‘9’
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60
Data Augmentation
Theoretically, the best way to improve the generalizability of a modelis to train it on more data
For some ML tasks, it is not hard to create new fake dataIn classification, we can generate new (x,y) pairs by transforming anexample input x
(i) given the same y
(i)
E.g, scaling, translating, rotating, or flipping images (x(i)’s)
Very effective in image object recognition and speech recognition tasks
CautionDo not to apply transformations that would change the correct class!
E.g., in OCR tasks, avoid:Horizontal flips for ‘b’ and ‘d’180
� rotations for ‘6’ and ‘9’
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60
Data Augmentation
Theoretically, the best way to improve the generalizability of a modelis to train it on more data
For some ML tasks, it is not hard to create new fake dataIn classification, we can generate new (x,y) pairs by transforming anexample input x
(i) given the same y
(i)
E.g, scaling, translating, rotating, or flipping images (x(i)’s)
Very effective in image object recognition and speech recognition tasks
CautionDo not to apply transformations that would change the correct class!
E.g., in OCR tasks, avoid:Horizontal flips for ‘b’ and ‘d’180
� rotations for ‘6’ and ‘9’
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60
Noise and Adversarial Data
NNs are not very robust to the perturbation of input (x(i)’s)Noises [11]Adversarial points [3]
How to improve the robustness?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 42 / 60
Noise and Adversarial Data
NNs are not very robust to the perturbation of input (x(i)’s)Noises [11]Adversarial points [3]
How to improve the robustness?
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 42 / 60
Noise Injection
We can train an NN with artificial random noise applied to x
(i)’s
Why noise injection works?Recall that the analytic solution of Ridge regression is
w =⇣
X
>X+a(t)
I
⌘�1
X
>y
In this case, weight decay = adding variance (noises)
More generally, makes the function f locally constant
Cost function C insensitive to small variations in weightsFinds solutions that are not merely minima, but minima surrounded byflat regions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60
Noise Injection
We can train an NN with artificial random noise applied to x
(i)’sWhy noise injection works?
Recall that the analytic solution of Ridge regression is
w =⇣
X
>X+a(t)
I
⌘�1
X
>y
In this case, weight decay = adding variance (noises)
More generally, makes the function f locally constant
Cost function C insensitive to small variations in weightsFinds solutions that are not merely minima, but minima surrounded byflat regions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60
Noise Injection
We can train an NN with artificial random noise applied to x
(i)’sWhy noise injection works?Recall that the analytic solution of Ridge regression is
w =⇣
X
>X+a(t)
I
⌘�1
X
>y
In this case, weight decay = adding variance (noises)
More generally, makes the function f locally constant
Cost function C insensitive to small variations in weightsFinds solutions that are not merely minima, but minima surrounded byflat regions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60
Noise Injection
We can train an NN with artificial random noise applied to x
(i)’sWhy noise injection works?Recall that the analytic solution of Ridge regression is
w =⇣
X
>X+a(t)
I
⌘�1
X
>y
In this case, weight decay = adding variance (noises)
More generally, makes the function f locally constant
Cost function C insensitive to small variations in weightsFinds solutions that are not merely minima, but minima surrounded byflat regions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60
Noise Injection
We can train an NN with artificial random noise applied to x
(i)’sWhy noise injection works?Recall that the analytic solution of Ridge regression is
w =⇣
X
>X+a(t)
I
⌘�1
X
>y
In this case, weight decay = adding variance (noises)
More generally, makes the function f locally constant
Cost function C insensitive to small variations in weightsFinds solutions that are not merely minima, but minima surrounded byflat regions
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60
Variants
We can also inject noise to hidden representations [8]Highly effective provided that the magnitude of the noise can becarefully tuned
The batch normalization, in addition to simplifying optimization, offerssimilar regularization effect to noise injection
Injects noises from examples in a minibatch to an activation a
(k)j
How about injecting noise to outputs (y(i)’s)?Already done in probabilistic models
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60
Variants
We can also inject noise to hidden representations [8]Highly effective provided that the magnitude of the noise can becarefully tuned
The batch normalization, in addition to simplifying optimization, offerssimilar regularization effect to noise injection
Injects noises from examples in a minibatch to an activation a
(k)j
How about injecting noise to outputs (y(i)’s)?Already done in probabilistic models
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60
Variants
We can also inject noise to hidden representations [8]Highly effective provided that the magnitude of the noise can becarefully tuned
The batch normalization, in addition to simplifying optimization, offerssimilar regularization effect to noise injection
Injects noises from examples in a minibatch to an activation a
(k)j
How about injecting noise to outputs (y(i)’s)?
Already done in probabilistic models
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60
Variants
We can also inject noise to hidden representations [8]Highly effective provided that the magnitude of the noise can becarefully tuned
The batch normalization, in addition to simplifying optimization, offerssimilar regularization effect to noise injection
Injects noises from examples in a minibatch to an activation a
(k)j
How about injecting noise to outputs (y(i)’s)?Already done in probabilistic models
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 45 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent voters
Bagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependent
Boosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?
Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNs
Bagging: train multiple NNs, each with resampled XGoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNs
Very time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Ensemble Methods
Ensemble methods can improve generalizability by offering differentexplanations to X
Voting: reduces variance of predictions if having independent votersBagging: resample X to makes voters less dependentBoosting: increase confidence (margin) of predictions, if not
overfitting
Ensemble methods in deep learning?Voting: train multiple NNsBagging: train multiple NNs, each with resampled X
GoogleLeNet [10], winner of ILSVRC’14, is an ensemble of 6 NNsVery time consuming to ensemble a large number of NNs
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60
Dropout I
Dropout: a feature-basedbagging
Resamples input aswell as latent featuresWith parameter
sharing among voters
SGD training: each time loading a minibatch, randomly sample abinary mask to apply to all input and hidden units
Each unit has probability a to be included (a hyperparameter)Typically, 0.8 for input units and 0.5 for hidden units
Different minibatches are used to train different parts of the NNSimilar to bagging, but much more efficientNo need to retrain unmasked unitsExponential number of voters
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60
Dropout I
Dropout: a feature-basedbagging
Resamples input aswell as latent features
With parameter
sharing among voters
SGD training: each time loading a minibatch, randomly sample abinary mask to apply to all input and hidden units
Each unit has probability a to be included (a hyperparameter)Typically, 0.8 for input units and 0.5 for hidden units
Different minibatches are used to train different parts of the NNSimilar to bagging, but much more efficientNo need to retrain unmasked unitsExponential number of voters
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60
Dropout I
Dropout: a feature-basedbagging
Resamples input aswell as latent featuresWith parameter
sharing among voters
SGD training: each time loading a minibatch, randomly sample abinary mask to apply to all input and hidden units
Each unit has probability a to be included (a hyperparameter)Typically, 0.8 for input units and 0.5 for hidden units
Different minibatches are used to train different parts of the NNSimilar to bagging, but much more efficientNo need to retrain unmasked unitsExponential number of voters
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60
Dropout I
Dropout: a feature-basedbagging
Resamples input aswell as latent featuresWith parameter
sharing among voters
SGD training: each time loading a minibatch, randomly sample abinary mask to apply to all input and hidden units
Each unit has probability a to be included (a hyperparameter)
Typically, 0.8 for input units and 0.5 for hidden unitsDifferent minibatches are used to train different parts of the NN
Similar to bagging, but much more efficientNo need to retrain unmasked unitsExponential number of voters
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60
Dropout I
Dropout: a feature-basedbagging
Resamples input aswell as latent featuresWith parameter
sharing among voters
SGD training: each time loading a minibatch, randomly sample abinary mask to apply to all input and hidden units
Each unit has probability a to be included (a hyperparameter)Typically, 0.8 for input units and 0.5 for hidden units
Different minibatches are used to train different parts of the NNSimilar to bagging, but much more efficientNo need to retrain unmasked unitsExponential number of voters
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60
Dropout I
Dropout: a feature-basedbagging
Resamples input aswell as latent featuresWith parameter
sharing among voters
SGD training: each time loading a minibatch, randomly sample abinary mask to apply to all input and hidden units
Each unit has probability a to be included (a hyperparameter)Typically, 0.8 for input units and 0.5 for hidden units
Different minibatches are used to train different parts of the NNSimilar to bagging, but much more efficient
No need to retrain unmasked unitsExponential number of voters
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60
Dropout I
Dropout: a feature-basedbagging
Resamples input aswell as latent featuresWith parameter
sharing among voters
SGD training: each time loading a minibatch, randomly sample abinary mask to apply to all input and hidden units
Each unit has probability a to be included (a hyperparameter)Typically, 0.8 for input units and 0.5 for hidden units
Different minibatches are used to train different parts of the NNSimilar to bagging, but much more efficientNo need to retrain unmasked unitsExponential number of voters
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60
Dropout II
How to vote to make a final prediction?
Mask sampling:1 Randomly sample some (typically, 10⇠ 20) masks2 For each mask, apply it to the trained NN and get a prediction3 Average the predictions
Weigh scaling:Make a single prediction using the NN with all unitsBut weights going out from a unit is multiplied by aHeuristic: each unit outputs the same expected amount of weight as intraining
The better one is problem dependent
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60
Dropout II
How to vote to make a final prediction?Mask sampling:
1 Randomly sample some (typically, 10⇠ 20) masks2 For each mask, apply it to the trained NN and get a prediction3 Average the predictions
Weigh scaling:Make a single prediction using the NN with all unitsBut weights going out from a unit is multiplied by aHeuristic: each unit outputs the same expected amount of weight as intraining
The better one is problem dependent
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60
Dropout II
How to vote to make a final prediction?Mask sampling:
1 Randomly sample some (typically, 10⇠ 20) masks2 For each mask, apply it to the trained NN and get a prediction3 Average the predictions
Weigh scaling:Make a single prediction using the NN with all unitsBut weights going out from a unit is multiplied by a
Heuristic: each unit outputs the same expected amount of weight as intraining
The better one is problem dependent
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60
Dropout II
How to vote to make a final prediction?Mask sampling:
1 Randomly sample some (typically, 10⇠ 20) masks2 For each mask, apply it to the trained NN and get a prediction3 Average the predictions
Weigh scaling:Make a single prediction using the NN with all unitsBut weights going out from a unit is multiplied by aHeuristic: each unit outputs the same expected amount of weight as intraining
The better one is problem dependent
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60
Dropout II
How to vote to make a final prediction?Mask sampling:
1 Randomly sample some (typically, 10⇠ 20) masks2 For each mask, apply it to the trained NN and get a prediction3 Average the predictions
Weigh scaling:Make a single prediction using the NN with all unitsBut weights going out from a unit is multiplied by aHeuristic: each unit outputs the same expected amount of weight as intraining
The better one is problem dependent
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60
Dropout III
Dropout improves generalization beyond ensembling
For example, in face image recognition:If there is a unit that detects noseDropping the unit encourages the model to learn mouth (or noseagain) in another unit
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60
Dropout III
Dropout improves generalization beyond ensemblingFor example, in face image recognition:
If there is a unit that detects noseDropping the unit encourages the model to learn mouth (or noseagain) in another unit
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60
Dropout III
Dropout improves generalization beyond ensemblingFor example, in face image recognition:If there is a unit that detects nose
Dropping the unit encourages the model to learn mouth (or noseagain) in another unit
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60
Dropout III
Dropout improves generalization beyond ensemblingFor example, in face image recognition:If there is a unit that detects noseDropping the unit encourages the model to learn mouth (or noseagain) in another unit
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 50 / 60
Manifolds IOne way to improve the generalizability of a model is to incorporatethe prior knowledge
In many applications, data of the same class concentrate around oneor more low-dimensional manifolds
A manifold is a topological space that are linear locally
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 51 / 60
Manifolds IOne way to improve the generalizability of a model is to incorporatethe prior knowledgeIn many applications, data of the same class concentrate around oneor more low-dimensional manifolds
A manifold is a topological space that are linear locally
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 51 / 60
Manifolds IOne way to improve the generalizability of a model is to incorporatethe prior knowledgeIn many applications, data of the same class concentrate around oneor more low-dimensional manifolds
A manifold is a topological space that are linear locally
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 51 / 60
Manifolds II
For each point x on a manifold, we have its tangent space spannedby tangent vectors
Local directions specify how one can change x infinitesimally whilestaying on the manifold
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 52 / 60
Tangent Prop
How to incorporate the manifold prior into a model?
Suppose we have the tangent vectors {v
(i,j)}j
for each example x
(i)
Tangent Prop [9] trains an NN classifier f with cost penalty:
W[f ] = Âi,j
—x
f (x(i))>v
(i,j)
To make f local constant along tangent directions
How to obtain {v
(i,j)}j
?Manually specified based on domain knowledge
Images: scaling, translating, rotating, flipping etc.
Or learned automatically (to be discussed later)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60
Tangent Prop
How to incorporate the manifold prior into a model?Suppose we have the tangent vectors {v
(i,j)}j
for each example x
(i)
Tangent Prop [9] trains an NN classifier f with cost penalty:
W[f ] = Âi,j
—x
f (x(i))>v
(i,j)
To make f local constant along tangent directions
How to obtain {v
(i,j)}j
?Manually specified based on domain knowledge
Images: scaling, translating, rotating, flipping etc.
Or learned automatically (to be discussed later)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60
Tangent Prop
How to incorporate the manifold prior into a model?Suppose we have the tangent vectors {v
(i,j)}j
for each example x
(i)
Tangent Prop [9] trains an NN classifier f with cost penalty:
W[f ] = Âi,j
—x
f (x(i))>v
(i,j)
To make f local constant along tangent directions
How to obtain {v
(i,j)}j
?
Manually specified based on domain knowledgeImages: scaling, translating, rotating, flipping etc.
Or learned automatically (to be discussed later)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60
Tangent Prop
How to incorporate the manifold prior into a model?Suppose we have the tangent vectors {v
(i,j)}j
for each example x
(i)
Tangent Prop [9] trains an NN classifier f with cost penalty:
W[f ] = Âi,j
—x
f (x(i))>v
(i,j)
To make f local constant along tangent directions
How to obtain {v
(i,j)}j
?Manually specified based on domain knowledge
Images: scaling, translating, rotating, flipping etc.
Or learned automatically (to be discussed later)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60
Tangent Prop
How to incorporate the manifold prior into a model?Suppose we have the tangent vectors {v
(i,j)}j
for each example x
(i)
Tangent Prop [9] trains an NN classifier f with cost penalty:
W[f ] = Âi,j
—x
f (x(i))>v
(i,j)
To make f local constant along tangent directions
How to obtain {v
(i,j)}j
?Manually specified based on domain knowledge
Images: scaling, translating, rotating, flipping etc.
Or learned automatically (to be discussed later)
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60
Outline
1 OptimizationMomentum & Nesterov MomentumAdaGrad & RMSPropBatch NormalizationContinuation Methods & Curriculum Learning
2 RegularizationWeight DecayData AugmentationDropoutManifold RegularizationDomain-Specific Model Design
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 54 / 60
Domain-Specific Prior Knowledge
If done right, incorporating the domain-specific prior knowledge into amodel is a highly effective way the improve generalizability
Better f that “makes sense”May also simplify optimization problem
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 55 / 60
Word2vecWeight-tying leads to simpler model
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 56 / 60
Convolution Neural Networks
Locally connected neurons for pattern detection at different locations
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 57 / 60
Reference I
[1] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and JasonWeston.Curriculum learning.In Proceedings of the 26th annual international conference on machine
learning, pages 41–48. ACM, 2009.
[2] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard BenArous, and Yann LeCun.The loss surfaces of multilayer networks.In AISTATS, 2015.
[3] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014.
[4] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe.Qualitatively characterizing neural network optimization problems.arXiv preprint arXiv:1412.6544, 2014.
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 58 / 60
Reference II
[5] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,and Ruslan R Salakhutdinov.Improving neural networks by preventing co-adaptation of featuredetectors.arXiv preprint arXiv:1207.0580, 2012.
[6] Sergey Ioffe and Christian Szegedy.Batch normalization: Accelerating deep network training by reducinginternal covariate shift.arXiv preprint arXiv:1502.03167, 2015.
[7] Diederik Kingma and Jimmy Ba.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
[8] Ben Poole, Jascha Sohl-Dickstein, and Surya Ganguli.Analyzing noise in autoencoders and deep networks.arXiv preprint arXiv:1406.1831, 2014.
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 59 / 60
Reference III
[9] Patrice Simard, Bernard Victorri, Yann LeCun, and John S Denker.Tangent prop-a formalism for specifying selected invariances in anadaptive network.In NIPS, volume 91, pages 895–903, 1991.
[10] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, ScottReed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, andAndrew Rabinovich.Going deeper with convolutions.In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–9, 2015.
[11] Yichuan Tang and Chris Eliasmith.Deep networks for robust visual recognition.In Proceedings of the 27th International Conference on Machine
Learning (ICML-10), pages 1055–1062, 2010.
Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 60 / 60