33
Machine Learning Srihari Regularization in Neural Networks Sargur Srihari [email protected] 1

Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Regularization in Neural Networks

Sargur [email protected]

1

Page 2: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Topics in Neural Net Regularization• Definition of regularization• Methods

1. Limiting capacity: no of hidden units

2. Norm Penalties: L2- and L1- regularization

3. Early stopping• Invariant methods

4. Data Set Augmentation5. Tangent propagation

2

Page 3: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Definition of Regularization• Generalization error

• Performance on inputs not previously seen• Also called as Test error

• Regularization is:• “any modification to a learning algorithm to reduce

its generalization error but not its training error”• Reduce generalization error even at the expense of

increasing training error• E.g., Limiting model capacity

is a regularization method

3

Page 4: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Goals of Regularization• Some goals of regularization

1.Encode prior knowledge2.Express preference for simpler model3.Need to make underdetermined problem

determined

4

Page 5: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Need for Regularization

• Generalization • Prevent over-fitting

• Occam's razor • Bayesian point of view

• Regularization corresponds to prior distributions on model parameters

5

Page 6: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Best Model: one properly regularized

• Best fitting model obtained not by finding the right number of parameters

• Instead, best fitting model is a large model that has been regularized appropriately

• We review several strategies for how to create such a large, deep regularized model

6

Page 7: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Limiting no. of hidden units

• No. of input/output units determined by dimensions

• Number of hidden units M is a free parameter• Adjusted to get best predictive performance

• Possible approach is to get maximum likelihood estimate of M for balance between under-fitting and over-fitting

7

Page 8: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Effect of No. of Hidden Units

Sinusoidal Regression Problem

Two layer network trained on 10 data points

M = 1, 3 and 10 hidden units

Minimizing sum-of-squared error functionUsing conjugate gradient descent

Generalization error is not a simple function of Mdue to presence of local minima in error function

Page 9: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Validation Set to fix no. of hidden units

9

No. of hidden units, M

Sum of squaresTest error

• 30 random starts for each M30 points in each column of graph

• Overall best validation set performance happened at M=8

• Plot a graph choosing random starts and different numbers of hidden units M and choose the specific solution having smallest generalization error

• There are other ways to control the complexity of a neural network in order to avoid over-fitting

• Alternative approach is to choose a relatively large value of M and then control complexity by adding a regularization term

Page 10: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning SrihariParameter Norm Penalty• Generalization error not a simple function of M

• Due to presence of local minima• Need to control capacity to avoid over-fitting

• Alternatively choose large M and control complexity by addition of regularization term

• Simplest regularizer is weight decay

• Effective model complexity determined by choice of λ• Regularizer is equivalent to a Gaussian prior over w

• In a 2-layer neural network

!E(w) = E(w)+

λ2wTw

Page 11: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Weight Decay

• The name weight decay is due to the following

• To prevent overfitting, every time we update a weight w with the gradient ∇J in respect to w, we also subtract from it λw.

• This gives the weights a tendency to decay towards zero, hence the name.

11

Page 12: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Effect of Neural Net Regularization

12

https://medium.com/@shay.palachy/understanding-the-scaling-of-l%C2%B2-regularization-in-the-context-of-neural-networks-e3d25f8b50db?fbclid=IwAR17rKfnM2JnkWBOZFvQg6ftl0k52diyEra_66UCwI18Gtf7nFd6l3l4VMY

Page 13: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Invariance to Transformation

13

• Simple weight decay has certain shortcomings invariance to scaling• Same building, different views, scales

Page 14: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Linear Transformation

https://mathinsight.org/determinant_linear_transformation

Page 15: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning SrihariEx: Two-layer MLP• Simple weight decay is inconsistent with certain

scaling properties of network mappings• To show this, consider a multi-layer perceptron

network with two layers of weights and linear output units• Set of inputs x={xi} and outputs y={yk}• Activations of hidden units in first layer have form

• Activations of output units are

15

z

j= h w

jix

ii∑ +w

j 0

⎝⎜⎜⎜

⎠⎟⎟⎟⎟

y

k= w

kjj∑ z

j+w

k0

Page 16: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Linear Transformations of i/o Variables

16

• If we perform linear transformation of input data

• Then mapping performed by network is unchanged by making a corresponding linear transformation of

• weights, biases from the inputs to the hidden units as

• Similar linear transform of outputs of network

• achieved by transforming second layer weights, biases

xi→ !x

i= ax

i+b

w

ji→ !w

ji=

1a

wji and w

j 0= w

j 0−

ba

wji

i∑

yk→ !y

k= cy

k+d

w

k j→ !w

kj= cw

kj and w

k0= cw

k0+d

Page 17: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Desirable invariance of regularizer• Suppose we train two different networks

• First network: trained using original data: x={xi}, y={yk}• Second network: input and/or target variables are

transformed by one of the linear transformations

• Consistency requires that we should obtain equivalent networks that differ only by linear transformation of the weights

17

xi→ !x

i= ax

i+b yk

→ !yk

= cyk

+d

w

ji→

1a

wji

wk j→ cw

kj

For first layer: And/or or second layer:

and/or

Page 18: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Simple weight decay fails invariance• Regularizer should have this property

• Otherwise arbitrarily favor one solution over another• Simple weight decay does not

have this property• Treats all weights and biases on an equal footing

• While resulting wji and wkj should be treated differently• Consequently networks will have different weights and

violate invariance• We therefore look for a regularizer invariant under

the linear transformations

18

!E(w) = E(w)+

λ2wTw

Page 19: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Regularizer invariance• The regularizer should be invariant to re-scaling

of weights and shifts of biases• Such a regularizer is

• where W1 are weights of first layer and • W2 are the set of weights in the second layer

• This regularizer remains unchanged under the weight transformations provided the parameters are rescaled using

• We have seen before that weight decay is equivalent to a Gaussian prior. So what is the equivalent prior to this?

λ1

2w2

w∈W1

∑ +λ

2

2w2

w∈W2

λ1→ a1/2λ

1 and λ

2→ c−1/2λ

2

Page 20: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Bayesian linear regression

20

p(w |α) = N(w | 0,α −1I )

p(w | t) = N t

n|wTφ(x

n),β−1( )

n=1

N

∏ N(w | 0,α −1I)

ln p(w | t) = − β

2tn−wTφ(x

n){ }

n=1

N

∑2

− α2wTw + const

E(w) = 1

2 t

n−wTφ(x

n) { }2

n=1

N

∑ + λ2wTw

Log of Posterior is

Thus Maximization of posterior is equivalent to weight decay

with addition of quadratic regularization term wTw with λ = α /β

p(t | X,w,β) = N t

n| wTφ(x

n),β−1( )

n=1

N

w0

w1

y(x,w)=w0+w1xp(w|α)~N(0,1/α)

Page 21: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning SrihariEquivalent prior for neural network• Regularizer invariant to rescaled weights/ biases

• where W1: weights of first layer, W2 : of second layer• It corresponds to a prior of the form

• An improper prior which cannot be normalized• Leads to difficulties in selecting regularization coefficients

and in model comparison within Bayesian framework• Instead include separate priors for biases with their own

hyper-parameters

λ1

2w2

w∈W1

∑ +λ

2

2w2

w∈W2

p(w |α

1,α

2) α exp −

α1

2w2

w∈W1

∑ −α

2

2w2

w∈W2

∑⎛

⎝⎜⎜⎜⎜

⎟⎟⎟⎟⎟α1 and α2 are hyper-parameters

Page 22: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Separate priors and biases

• We can illustrate effect of the resulting four hyperparameters

• α1b, precision of Gaussian distribution of first layer bias

• α1w, precision of Gaussian of first layer weights

• α2b, precision of Gaussian distribution of second layer bias

• α2w, precision of Gaussian distribution of second layer weights

Page 23: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Effect of hyperparameters on i/o

23

Priors are governed by four hyper-parametersα1

b, precision of Gaussian distribution of first layer biasα1

w, precision of Gaussian of first layer weightsα2

b,…. ..of second layer biasα2

w, …. of second layer weights

Network with single input (x value range: -1 to +1),single linear output (y value range: -60 to +40)

12 hidden units with tanh activation functions

Draw five samples from prior over hyperparameters, and plot input-output (network functions)Five samples correspond to five colorsFor each setting function is learnt and plotted

input

outp

ut

input

outp

ut

Observe:vertical scalecontrolledby α2

w

Observe:horizontal scalecontrolledby α1

woutp

ut

input

Page 24: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

3. Early Stopping• Alternative to regularization

• In controlling complexity• Error measured with an

independent validation set • shows initial decrease in error

and then an increase• Training stopped at point of

smallesterror with validation data• Effectively limits network

complexity 24

Training Set Error

Validation Set Error� �� ���

Iteration Step

Iteration Step

Page 25: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Interpreting the effect of Early Stopping• Consider quadratic error function

• Axes in weight space are parallel to eigen vectors of Hessian

• In absence of weight decay, weight vector starts at origin and proceeds to wML

• Stopping at similar to weight decay

• Effective no of parameters in network grows during training 25

!E(w) = E(w)+

λ2wTw

Contour of constant errorwML represents minimum

!w

Page 26: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning SrihariInvariances• In classification there is a need for:

• Predictions invariant to transformations of input • Example:

• handwritten digit should be assigned same classification irrespective of position in image (translation) and size (scale)

• Such transformations produce significant changes in raw data, yet need to produce same output from classifier• Examples: pixel intensities, in speech recognition, nonlinear time warping along time axis

26

Page 27: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Simple Approach for Invariance

• Large sample set where all transformations are present• E.g., for translation invariance, examples of objects

in may different positions• Impractical

• Number of different samples grows exponentially with number of transformations

• Seek alternative approaches for adaptive model to exhibit required invariances

27

Page 28: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning SrihariApproaches to Invariance1. Training set augmented

• Transform training patterns for desired invariancesE.g., shift each image into different positions

2. Regularization term to error function • Penalize changes inoutput when input transformed

• Leads to tangent propagation.

3. Invariance built into pre-processing • By extracting features invariant to transformations

4. Build invariance into structure of network (CNN)• Local receptive fields and shared weights

28

Page 29: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Approach 1: Transform each input

29

• Synthetically warp each handwritten digit image before presentation to model

• Easy to implement but computationally costly

Original Image Warped Images

Random displacementsΔx,Δy ∈ (0,1) at each pixelThen smooth by convolutionsof width 0.01, 30 and 60 resply

Page 30: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning SrihariApproach 2: Tangent Propagation• Regularization to

encourage invariance to transformations• by technique of tangent

propagation• Consider effect of

transformation on input xn• A 1-D continuous

transformation parameterized by ξ applied to xn sweeps a manifold M in D-dimensional input space

30

Two-dimensional input space showing effect of continuous transformation with singleparameter ξLet the vector resulting from acting on xn by thistransformation be denoted by s(xn,ξ)defined so that s(x,0)=x. Then the tangent to the curve M is given by the directional derivative τ = δs/δξ and the tangent vector at point xn is given by

τ n =∂s(xn,ξ)∂ξ ξ = 0

Page 31: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Tangent Propagation as Regularization• Under a transformation of input vector

• The network output vector will change• Derivative of output k wrt ξ is given by

• where Jki is the (k,i) element of Jacobian Matrix J

• Result used to modify standard error function

• To encourage local invariance in neighborhood of data pt

∂yk

∂ξξ=0

=∂y

k

∂xii=1

D

∑∂x

i

∂ξξ=0

= Jki

i=1

D

∑ τi

!E = E +λΩwhere λ is a regularization coefficient and

Ω=12

∂ynk

∂ξξ=0

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

2

k∑

n∑ =

12

Jnkiτ

nii=1

D

∑⎛

⎝⎜⎜⎜⎜

⎠⎟⎟⎟⎟

2

k∑

n∑

Page 32: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Tangent vector from finite differences

32

True image rotatedfor comparison

Original Image x

Tangent vector τcorresponding tosmall clockwise rotation

Adding small contributionfrom tangent vector to image x+ετ

• In practical implementationtangent vector τn is approximated

using finite differencesby subtracting original vector xnfrom the corresponding vectorafter transformations usinga small value of ξ and thendividing by ξ

• A related techniqueTangent Distance is usedto build invariance propertiesinto distance-based methods such asnearest-neighbor classifiers

Page 33: Regularization in Neural Networkscedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdfweight transformations provided the parameters are rescaled using •We have seen

Machine Learning Srihari

Equivalence of Approaches 1 and 2

• Expanding the training set is closely related to tangent propagation

• Small transformations of the original input vectors together with Sum-of-squared error function can be shown to be equivalent to the tangent propagation regularizer

33