Training and Generalization - Purdue University– The gradient for a small batch is much faster to compute and almost as good as the full gradient. – If !=10,000and ! & =' & =32,

Training and GeneralizationoOverfitting, Underfitting, and Goldilocks FittingoTraining, Validation, and Testing Data SetsoModel Order, Model Capacity, Generalization Loss

Training and Generalization

§ Goal: – Learn the “true relationship” from training data pairs |"#, %# #&'

()*." = ,- % + /0010

– What we learn needs to generalize beyond the training data.

§ Key parameters:– 3 = Model Order = number of parameters = Dimension of 4 ∈ ℜ7

– 89×; = # training points = Dimension of " × # of training pairs

§ Key issues– If 3 ≫ 89×;: Model order is too high and there is a tendency to over fit.

– If 3 << 89×;: Model order is too low, and there is a tendency to under fit

Overfitting§ Training data

§ Overfitting

– Model order too high– Doesn’t generalize well

!

"

!

"

!#, "#

Underfitting§ Training data

§ Underfitting

– Model order too low– Doesn’t generalize well

!

"

!#, "#

!

"

Goldilocks Fitting§ Training data

§ Best fitting

– Model order “just right”– Best generalization

!

"

!#, "#

!

"

Partitioning of Labeled Data§ Let !", $" for k ∈ ' = 0,⋯ ,+ − 1 be the full set data.

– $" is the input data.– !" is the label or “ground truth” data.

§ Typically, we randomly partition✼ the data into three subsets:– '/ is the training data– '0 is the validation data– '1 is the testing (evaluation) data

�Note that “partition” means ' = '/ ∪ '3 ∪ '4 and ∅ = '6 ∩ '3 = '6 ∩ '4 = '3 ∩ '4§ For each partition, we define a loss function:

86 9 = 1'6

:"∈;<

$" − => !" ?

83 9 = 1'3

:"∈;@

$" − => !" ?

84 9 = 1'4

:"∈;A

$" − => !" ?

Roles of Data

§ Training data:– Only data used to train model

!∗ = argmin* +, != argmin*

1. /0∈23

40 − 6* 70 8

§ Validation data:– Used to compare models of different order.

§Testing data– Used for final evaluation of model performance.

Loss Function Convergence§ Loss vs. iterations of gradient-based optimization

• Notice:– As training continues, the model is overfit to the data– Best to stop training when !" is at a minimum– Model order is too high, but early termination of training can help fix problem

Iteration #

Loss

!" - validation loss

!# - training loss

best stopping point

generalization loss

Loss vs. Model Order vs. # Training Pairs§ Loss vs. model order

§ Loss vs. # of training pairs

! = Model Order

Loss

#$ - validation loss

#% - training loss

best model order/capacity

& = # Training Pairs

Loss

More training data is always better, but slower.

What are !" and !# telling you?§ Model order/model capacity may be too low…

§ Model order/model capacity may be too high…

Iteration #

Loss

$% - validation loss

$& - training loss

Large

Iteration #

Loss


$& - training loss

Small

Never Test on Training Data!§ Never report training loss, !", as your ML system accuracy!

– This is like doing a homework problem after you have seen the solution.– The network has “memorized” the answers.

§ Don’t even report validation loss, !#, as your ML system accuracy.– This is also biased by the fact that your tuned model order parameters.

§ Only report testing loss, !$, as your ML system accuracy.– This data is sequestered to ensure it is an unbiased estimate of loss.

% = # Training Pairs

Loss

This is great

But this is terrible

Solutions to Parameter Overfitting1. Early termination

2. Regularization– "# and "$ weight regularization– Loss function is modified to be

%" & = " & + ) * &– ) larger ⇒ less overfitting

3. Dropout Method: Next slide

Iteration #

Loss

!" - validation loss

!# - training loss

best stopping point

⇐More RegularizationLess Regularization ⇒

Loss


$& - training loss

best regularization

1/)

The Dropout Method*• Drop nodes with probably ! ≈ 0.2• Scale all node outputs by !:

– To compute loss for validation and test– During inference

*Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, vol. 15, no. 56, pp. 1929−1958, 2014.

During Training: Done independently for each batch

During Validation, Testing, and Inference

Stochastic Gradient DescentoBatches and EpochsoLearning Rate and MomentumoWeight initialization and regularizationoDropout method

Batches, Epochs, and Stochastic Gradient Descent (SGD)

§ Partition training set into randomized batches

– For each batch you compute a separate gradient∇" #; %& ⇐ Gradient for ()* batch of training data

Stochastic Gradient Descent (SGD)

% = 1,⋯ ,/ =0&12

3%&

Repeat until converged {Repeat ( = 1 to 5 {6 ← −∇" #; %&# ← # + :6)

}}

One batchOne epoch

/& = %&= (# of sampler per batch)

Theoretical Analysis of SGD§ Assume simple sampling (sampling with replacement)

!" = ∇%" & = (gradient from '() training sample)

– Each sample, !"*, is i.i.d. with distribution + ! =),-(./012 /

3

!4!5

⋮

!375

!"8!"9

⋮!":;<9

rand

om

sam

ple

with

re

plac

emen

tTrue gradient:

! =1

>?"@4

375

!" Batch gradient:

A! =1

>B?,@4

3;75

!"*

• Then

A! = ! +H

>B

where I H = 0

KLM H =1

>?"@4

375

!" − ! !" − !(

BatchGradient

TrueGradient

Noise

Effect of Batch Size on SGD

§ Then we have that:– As !" → ∞ (batch size goes up) ⇒ noise decreases– As !" → 0 (batch size goes down) ⇒ noise increases

'(')

⋮

'+,)

'-.'-/

⋮'-012/ra

ndom

sa

mpl

e w

ith

repl

acem

ent

True gradient:

' =1!5-6(

+,)

'- Batch gradient:

7' =1! 5

86(

+1,)

'-9

7' = ' +;!"Batch

GradientTrue

GradientNoise

Effect of Gradient Noise: Exploration

!

− #$ !

!

−$ !

!

#$ !

!

$ ! very good

good better

very bad

Bumpy FunctionSmooth Function

#$ = $ + '()Batch

GradientTrue

GradientNoise

Effect of Gradient Noise: Exploitation

!

"# !

!

# !perfect

not so good

Smooth Function

"# = # + &'(Batch

GradientTrue

GradientNoise

SGD Issues§ Why SGD works?

– The gradient for a small batch is much faster to compute and almost as good as the full gradient.

– If ! = 10,000 and !& = '& = 32, then one iteration of SGD is approximately *+,+++,- ≈ 312times faster than GD.

§ Batch size– Larger batches: less “noise” in gradient ⇒

• Worse: slower updates; less exploration. • Better: better local convergence.

– Smaller batches: more “noise” in gradient ⇒• Worse: hunts around local minimum.• Better: faster updates; better exploration.

§ Patch size:– Many algorithms train on image “patches”– Apocryphal: Smaller patches speed training. Not true!!!!

§ Step size 0– Too large ⇒ hunts around local minimum– Too small ⇒ slow convergence

Momentum§ SGD with momentum

– " is step size, and # is momentum typically with # = 0.9

– Interpretation• ( is like position• ) is like velocity• Friction = 1 − #

Stochastic Gradient Descent (SGD)with momentum

init ) ← 0Repeat until converged {

Repeat . = 0 to /0 − 1 {1 ← −∇3 (; 50) ← #) + "1( ← ( + )7

}}

momentumterm

Momentum§ SGD with momentum

– " is step size, and # is momentum typically with # = 0.9

– Interpretation• ( is like position• ) is like velocity• Friction = 1 − #

Stochastic Gradient Descent (SGD)with momentum

init ) ← 0Repeat until converged {

Repeat . = 0 to /0 − 1 {1 ← −∇3 (; 50) ← #) + "1( ← ( + )7

}}

momentum

for each epoch

for each batch

Interpretation of Momentum§Special case of impulsive input: If !" = $%

–Then

&% = &' +)

1 − , 1 − ,%

Momentum

init - ← 0Repeat 0 = 0 to 12 − 1 {- ← ,- + )$%& ← & + -4

}

!"n

Asymptotic value = 7

89:

n&"

Time constant = −1/ log ,

)1− , 1−1?

Intuition

position !

n!#- position

n$#- velocity

Time constant = −1/ log ,

Friction = 1 − ,

$-

Nesterov Momentum*§ SGD with momentum

– " is step size, and # is momentum typically with # ≈ 0.9

– Intuition: • Even if () = 0, we have that +),- = +) + #/0 because of momentum• So compute the gradient at +) + #/0

Stochastic Gradient Descent (SGD)with Nesterov momentum

init / ← 0Repeat until converged {

Repeat 3 = 0 to 45 − 1 {( ← −∇9 + + #/0; ;5/ ← #/ + "(+ ← + + /0

}}

Nesterovgradient

*Yu. E. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, no. 3, pp. 543– 547.

ADAM (Adaptive Moment Estimation)*§ SGD with ADAM optimization = Momentum + Preconditioning

– Typical parameters: ! = 0.001; '( = 0.9; '* = 0.999; + = 10,-ADAM Optimization

init . ← 0; 0 ← 0;init 1 ← 0Repeat until converged {

Repeat 3 = 0 to 45 {1 ← 1 + 17 ← −∇: ;; <5. ← '(. + 1 − '( 70 ← '*0 + 1 − '* 7*=. ← ./ 1 − '(?0̂ ← 0/ 1 − '*?; ← ; + ! 0̂ + + ,( =.

}}

*Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, The 3rd International Conference for Learning Representations (ICLR), San Diego, 2015.

ADAM (Adaptive Moment Estimation)*§ SGD with ADAM optimization = Momentum + Preconditioning

– Typical parameters: ! = 0.001; '( = 0.9; '* = 0.999; + = 10,-ADAM Optimization

init . ← 0; 0 ← 0;init 1 ← 0Repeat until converged {

Repeat 3 = 0 to 45 {1 ← 1 + 17 ← −∇: ;; <5. ← '(. + 1 − '( 70 ← '*0 + 1 − '* 7*=. ← ./ 1 − '(?0̂ ← 0/ 1 − '*?; ← ; + ! 0̂ + + ,( =.

}}

Element-wise operations on vectors

momentumaverage squared

gradient

hack to make it work better at the start

normalizes by RMS of gradient

*Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, The 3rd International Conference for Learning Representations (ICLR), San Diego, 2015.

Documents

Training and Generalization - Purdue University– The gradient for a small batch is much faster to compute and almost as good as the full gradient. – If !=10,000and ! & =' & =32,