Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Training and GeneralizationoOverfitting, Underfitting, and Goldilocks FittingoTraining, Validation, and Testing Data SetsoModel Order, Model Capacity, Generalization Loss
Training and Generalization
§ Goal: – Learn the “true relationship” from training data pairs |"#, %# #&'
()*." = ,- % + /0010
– What we learn needs to generalize beyond the training data.
§ Key parameters:– 3 = Model Order = number of parameters = Dimension of 4 ∈ ℜ7
– 89×; = # training points = Dimension of " × # of training pairs
§ Key issues– If 3 ≫ 89×;: Model order is too high and there is a tendency to over fit.
– If 3 << 89×;: Model order is too low, and there is a tendency to under fit
Overfitting§ Training data
§ Overfitting
– Model order too high– Doesn’t generalize well
!
"
!
"
!#, "#
Underfitting§ Training data
§ Underfitting
– Model order too low– Doesn’t generalize well
!
"
!#, "#
!
"
Goldilocks Fitting§ Training data
§ Best fitting
– Model order “just right”– Best generalization
!
"
!#, "#
!
"
Partitioning of Labeled Data§ Let !", $" for k ∈ ' = 0,⋯ ,+ − 1 be the full set data.
– $" is the input data.– !" is the label or “ground truth” data.
§ Typically, we randomly partition✼ the data into three subsets:– '/ is the training data– '0 is the validation data– '1 is the testing (evaluation) data
�Note that “partition” means ' = '/ ∪ '3 ∪ '4 and ∅ = '6 ∩ '3 = '6 ∩ '4 = '3 ∩ '4§ For each partition, we define a loss function:
86 9 = 1'6
:"∈;<
$" − => !" ?
83 9 = 1'3
:"∈;@
$" − => !" ?
84 9 = 1'4
:"∈;A
$" − => !" ?
Roles of Data
§ Training data:– Only data used to train model
!∗ = argmin* +, != argmin*
1. /0∈23
40 − 6* 70 8
§ Validation data:– Used to compare models of different order.
§Testing data– Used for final evaluation of model performance.
Loss Function Convergence§ Loss vs. iterations of gradient-based optimization
• Notice:– As training continues, the model is overfit to the data– Best to stop training when !" is at a minimum– Model order is too high, but early termination of training can help fix problem
Iteration #
Loss
!" - validation loss
!# - training loss
best stopping point
generalization loss
Loss vs. Model Order vs. # Training Pairs§ Loss vs. model order
§ Loss vs. # of training pairs
! = Model Order
Loss
#$ - validation loss
#% - training loss
best model order/capacity
& = # Training Pairs
Loss
More training data is always better, but slower.
What are !" and !# telling you?§ Model order/model capacity may be too low…
§ Model order/model capacity may be too high…
Iteration #
Loss
$% - validation loss
$& - training loss
Large
Iteration #
Loss
$% - validation loss
$& - training loss
Small
Never Test on Training Data!§ Never report training loss, !", as your ML system accuracy!
– This is like doing a homework problem after you have seen the solution.– The network has “memorized” the answers.
§ Don’t even report validation loss, !#, as your ML system accuracy.– This is also biased by the fact that your tuned model order parameters.
§ Only report testing loss, !$, as your ML system accuracy.– This data is sequestered to ensure it is an unbiased estimate of loss.
% = # Training Pairs
Loss
This is great
But this is terrible
Solutions to Parameter Overfitting1. Early termination
2. Regularization– "# and "$ weight regularization– Loss function is modified to be
%" & = " & + ) * &– ) larger ⇒ less overfitting
3. Dropout Method: Next slide
Iteration #
Loss
!" - validation loss
!# - training loss
best stopping point
⇐More RegularizationLess Regularization ⇒
Loss
$% - validation loss
$& - training loss
best regularization
1/)
The Dropout Method*• Drop nodes with probably ! ≈ 0.2• Scale all node outputs by !:
– To compute loss for validation and test– During inference
*Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, vol. 15, no. 56, pp. 1929−1958, 2014.
During Training: Done independently for each batch
During Validation, Testing, and Inference
Stochastic Gradient DescentoBatches and EpochsoLearning Rate and MomentumoWeight initialization and regularizationoDropout method
Batches, Epochs, and Stochastic Gradient Descent (SGD)
§ Partition training set into randomized batches
– For each batch you compute a separate gradient∇" #; %& ⇐ Gradient for ()* batch of training data
Stochastic Gradient Descent (SGD)
% = 1,⋯ ,/ =0&12
3%&
Repeat until converged {Repeat ( = 1 to 5 {6 ← −∇" #; %&# ← # + :6)
}}
One batchOne epoch
/& = %&= (# of sampler per batch)
Theoretical Analysis of SGD§ Assume simple sampling (sampling with replacement)
!" = ∇%" & = (gradient from '() training sample)
– Each sample, !"*, is i.i.d. with distribution + ! =),-(./012 /
3
!4!5
⋮
!375
!"8!"9
⋮!":;<9
rand
om
sam
ple
with
re
plac
emen
tTrue gradient:
! =1
>?"@4
375
!" Batch gradient:
A! =1
>B?,@4
3;75
!"*
• Then
A! = ! +H
>B
where I H = 0
KLM H =1
>?"@4
375
!" − ! !" − !(
BatchGradient
TrueGradient
Noise
Effect of Batch Size on SGD
§ Then we have that:– As !" → ∞ (batch size goes up) ⇒ noise decreases– As !" → 0 (batch size goes down) ⇒ noise increases
'(')
⋮
'+,)
'-.'-/
⋮'-012/ra
ndom
sa
mpl
e w
ith
repl
acem
ent
True gradient:
' =1!5-6(
+,)
'- Batch gradient:
7' =1! 5
86(
+1,)
'-9
7' = ' +;!"Batch
GradientTrue
GradientNoise
Effect of Gradient Noise: Exploration
!
− #$ !
!
−$ !
!
#$ !
!
$ ! very good
good better
very bad
Bumpy FunctionSmooth Function
#$ = $ + '()Batch
GradientTrue
GradientNoise
Effect of Gradient Noise: Exploitation
!
"# !
!
# !perfect
not so good
Smooth Function
"# = # + &'(Batch
GradientTrue
GradientNoise
SGD Issues§ Why SGD works?
– The gradient for a small batch is much faster to compute and almost as good as the full gradient.
– If ! = 10,000 and !& = '& = 32, then one iteration of SGD is approximately *+,+++,- ≈ 312times faster than GD.
§ Batch size– Larger batches: less “noise” in gradient ⇒
• Worse: slower updates; less exploration. • Better: better local convergence.
– Smaller batches: more “noise” in gradient ⇒• Worse: hunts around local minimum.• Better: faster updates; better exploration.
§ Patch size:– Many algorithms train on image “patches”– Apocryphal: Smaller patches speed training. Not true!!!!
§ Step size 0– Too large ⇒ hunts around local minimum– Too small ⇒ slow convergence
Momentum§ SGD with momentum
– " is step size, and # is momentum typically with # = 0.9
– Interpretation• ( is like position• ) is like velocity• Friction = 1 − #
Stochastic Gradient Descent (SGD)with momentum
init ) ← 0Repeat until converged {
Repeat . = 0 to /0 − 1 {1 ← −∇3 (; 50) ← #) + "1( ← ( + )7
}}
momentumterm
Momentum§ SGD with momentum
– " is step size, and # is momentum typically with # = 0.9
– Interpretation• ( is like position• ) is like velocity• Friction = 1 − #
Stochastic Gradient Descent (SGD)with momentum
init ) ← 0Repeat until converged {
Repeat . = 0 to /0 − 1 {1 ← −∇3 (; 50) ← #) + "1( ← ( + )7
}}
momentum
for each epoch
for each batch
Interpretation of Momentum§Special case of impulsive input: If !" = $%
–Then
&% = &' +)
1 − , 1 − ,%
Momentum
init - ← 0Repeat 0 = 0 to 12 − 1 {- ← ,- + )$%& ← & + -4
}
!"n
Asymptotic value = 7
89:
n&"
Time constant = −1/ log ,
)1− , 1−1?
Intuition
position !
n!#- position
n$#- velocity
Time constant = −1/ log ,
Friction = 1 − ,
$-
Nesterov Momentum*§ SGD with momentum
– " is step size, and # is momentum typically with # ≈ 0.9
– Intuition: • Even if () = 0, we have that +),- = +) + #/0 because of momentum• So compute the gradient at +) + #/0
Stochastic Gradient Descent (SGD)with Nesterov momentum
init / ← 0Repeat until converged {
Repeat 3 = 0 to 45 − 1 {( ← −∇9 + + #/0; ;5/ ← #/ + "(+ ← + + /0
}}
Nesterovgradient
*Yu. E. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, no. 3, pp. 543– 547.
ADAM (Adaptive Moment Estimation)*§ SGD with ADAM optimization = Momentum + Preconditioning
– Typical parameters: ! = 0.001; '( = 0.9; '* = 0.999; + = 10,-ADAM Optimization
init . ← 0; 0 ← 0;init 1 ← 0Repeat until converged {
Repeat 3 = 0 to 45 {1 ← 1 + 17 ← −∇: ;; <5. ← '(. + 1 − '( 70 ← '*0 + 1 − '* 7*=. ← ./ 1 − '(?0̂ ← 0/ 1 − '*?; ← ; + ! 0̂ + + ,( =.
}}
*Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, The 3rd International Conference for Learning Representations (ICLR), San Diego, 2015.
ADAM (Adaptive Moment Estimation)*§ SGD with ADAM optimization = Momentum + Preconditioning
– Typical parameters: ! = 0.001; '( = 0.9; '* = 0.999; + = 10,-ADAM Optimization
init . ← 0; 0 ← 0;init 1 ← 0Repeat until converged {
Repeat 3 = 0 to 45 {1 ← 1 + 17 ← −∇: ;; <5. ← '(. + 1 − '( 70 ← '*0 + 1 − '* 7*=. ← ./ 1 − '(?0̂ ← 0/ 1 − '*?; ← ; + ! 0̂ + + ,( =.
}}
Element-wise operations on vectors
momentumaverage squared
gradient
hack to make it work better at the start
normalizes by RMS of gradient
*Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, The 3rd International Conference for Learning Representations (ICLR), San Diego, 2015.