Model Selection and Validation

Model Selection and Model Selection and ValidationValidation

“All models are wrong; some are useful.”

George E. P. Box

Some slides were taken from:• J. C. Sapll: MJ. C. Sapll: MODELINGODELING C CONSIDERATIONSONSIDERATIONS ANDAND S STATISTICALTATISTICAL IINFORMATIONNFORMATION

• J. Hinton: Preventing overfitting• Bei Yu: Model Assessment

Overfitting • The training data contains information about the

regularities in the mapping from input to output. But it also contains noise– The target values may be unreliable.– There is sampling error. There will be accidental

regularities just because of the particular training cases that were chosen.

• When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. – So it fits both kinds of regularity.– If the model is very flexible it can model the sampling error

really well. This is a disaster.

A simple example of overfitting

• Which model do you believe?– The complicated model

fits the data better.– But it is not economical

• A model is convincing when it fits a lot of data surprisingly well.– It is not surprising that a

complicated model can fit a small amount of data.

Generalization• The objective of learning is to achieve

good generalization to new cases, otherwise just use a look-up table.

• Generalization can be defined as a mathematical interpolation or regression over a set of training points:

f(x)

x

Generalization

• Over-Training is the equivalent of over-fitting a set of data points to a curve which is too complex

• Occam’s Razor (1300s, English Logician): – “plurality should not be assumed without necessity”

• The simplest model which explains the majority of the data is usually the best

GeneralizationPreventing Over-training:• Use a separate test or tuning set of examples• Monitor error on the test set as network trains• Stop network training just prior to over-fit error

occurring - early stopping or tuning• Number of effective weights is reduced• Most new systems have automated early stopping

methods

GeneralizationWeight Decay: an automated method of

effective weight control• Adjust the bp error function to penalize the growth of

unnecessary weights:

where: = weight -cost parameter

is decayed by an amount proportional to its magnitude; those not reinforced => 0

E t o wjj

j iji

1

2 22 2( )

w w wij ij ij

wij

13-8

Formal Model Definition

• Assume model z = h(x,) + v, where z is

output, h(·) is some function, x is input, v

is noise, and is vector of model

parameters

A fundamental goal is to take n data points and

estimate , forming n

13-9

Model Error DefinitionModel Error Definition

• Given a data set [xi,yi], i = 1,..,n• Given a model output h(x,n), where n

is taken from some family of parameters, the sum squared errors (SSE, MSE) is

Σi [yi - h(xi,n)]2, • The likelihood is

ΠiP(h(xi,n)|xi)

13-10

Error surface as a function of Model parameters can look like this

13-11

Error surface can also look like this

Which one is better?

13-12

Properties of the error surfaces

• The first surface is rough, thus a small change in parameter space can lead to large change in error

• Due to the steepness of the surface, a minimum can be found, although a gradient-descent optimization algorithm can get stuck in local minima

• The second is very smooth thus, large change in parameter set does not lead to much change in model error

• In other words, it is expected that generalization performance will be similar to performance on a test set

13-13

Parameter stability

• Finer detail: while the surface is very smooth, it is impossible to get to the true minima.

• Suggests that models that penalize on smoothness may be misleading.

• Breiman (1992) has shown that even in simple problems and simple nonlinear models, the degree of generalization is strongly dependent on the stability of the parameters.

13-14

Bias-Variance Decomposition

• Assume:• Bias-Variance Decomposition:

• K-NN:• Linear fit:

– Ridge Regression:

,)( XfY ),0(~ 2 N

))(ˆ())(ˆ(

)](ˆ)(ˆ[)]()(ˆ[

))(ˆ)(ˆ)(ˆ)((

))(ˆ()(

0022

200

200

2

20000

02

00

xfVarxfBias

xfExfExfxfE

xfxfExfExfE

xXxfYExErr

01

0 )()( xIXXXxh T

22

02

002 )()]()(ˆ[)0( xhxfxfExErr p

kxfxf

kxErr

k

ll

22

10)(

20 )()(

1)(

13-15

Bias-Variance DecompositionBias-Variance Decomposition

• The MSE of the model at a fixed x can be decomposed as:

E{[h(x, ) E(z|x)]2 |x}

= E{[h(x, ) E(h(x, ))]2|x} + [E(h(x, )) E(z|x)]2

= variance at x + (bias at x)2

where expectations are computed w.r.t.• Above implies:

Model too simple Model too simple High bias High bias//low variancelow variance

Model too complex Model too complex Low bias Low bias//high variancehigh variance

n

n

n n

n

13-16

Bias-Variance Tradeoff in Model Bias-Variance Tradeoff in Model Selection in Simple ProblemSelection in Simple Problem

13-17

Model SelectionModel Selection

• The bias-variance tradeoff provides conceptual framework for determining a good model– bias-variance tradeoff not directly useful

• Many methods for practical determination of a good model– AIC, Bayesian selection, cross-validation,

minimum description length, V-C dimension, etc.

• All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias)

• Cross-validation is one of the most popular model fitting methods

13-18

Cross-ValidationCross-Validation

• Cross-validation is a simple, general method for comparing candidate models– Other specialized methods may work better in specific

problems

• Cross-validation uses the training set of data

• Does not work on some pathological distributions

• Method is based on iteratively partitioning the full set of training data into training and test subsets

• For each partition, estimateestimate model from training subset and evaluateevaluate model on test subset

• Select model that performs best over all test subsets

13-19

Division of Data for Cross-Validation Division of Data for Cross-Validation with Disjoint Test Subsetswith Disjoint Test Subsets

13-20

Typical Steps for Cross-ValidationTypical Steps for Cross-Validation

Step 0 (initialization) Step 0 (initialization) Determine size of test subsets and candidate model. Let i be counter for test subset being used.

Step 1 (estimation) Step 1 (estimation) For the i th test subset, let the remaining data be the i th training subset. Estimate from this training subset.

Step 2 (error calculation) Step 2 (error calculation) Based on estimate for from Step 1 (i th training subset), calculate MSE (or other measure) with data in i th test subset.

Step 3 (new trainingStep 3 (new training // test subset) test subset) Update i to i + 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated.

Step 4 (new model) Step 4 (new model) Repeat steps 1 to 3 for next model. Choose Choose model with lowest mean MSE as best.model with lowest mean MSE as best.

13-21

Numerical Illustration of Cross-Validation Numerical Illustration of Cross-Validation (Example 13.4 in (Example 13.4 in ISSOISSO))

• Consider true system corresponding to a sine function of the input with additive normally distributed noise

• Consider three candidate models– Linear (affine) model– 3rd-order polynomial– 10th-order polynomial

• Suppose 30 data points are available, divided into 5 disjoint test subsets

• Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred

• See following plot

13-22

Numerical Illustration (cont’d): Relative Fits Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observationsfor 3 Models with Low-Noise Observations

13-23

Standard approach to Model Selection

• Optimize concurrently the likelihood or mean squared error together with a complexity penalty.

• Some penalties: norm of the weight vector, smoothness, number of terminating leaves (in CART), variance weights, cross validation... etc.

• Spend most computational time on optimizing the parameter solution via sophisticated Gradient descent methods or even global-minimum seeking methods.

13-24

Alternative approach

MDL based model selection

Later

13-25

Model Complexity

13-26

Preventing overfitting

• Use a model that has the right capacity:– enough to model the true regularities– not enough to also model the spurious

regularities (assuming they are weaker).

• Standard ways to limit the capacity of a neural net:– Limit the number of hidden units.– Limit the size of the weights.– Stop the learning before it has time to over-fit.

13-27

Limiting the size of the weights• Weight-decay involves

adding an extra term to the cost function that penalizes the squared weights.– Keeps weights small unless

they have big error derivatives.

ii

i

iii

ii

w

Ew

w

Cwhen

ww

E

w

C

wEC

1

22

,0

w

C

13-28

The effect of weight-decay• It prevents the network from using weights

that it does not need.– This can often improve generalization a lot. – It helps to stop it from fitting the sampling error. – It makes a smoother model in which the output

changes more slowly as the input changes. w

• If the network has two very similar inputs it prefers to put half the weight on each rather than all the weight on one.

w/2 w/2 w 0

13-29

Model selection

• How do we decide which limit to use and how strong to make the limit?– If we use the test data we get an unfair

prediction of the error rate we would get on new test data.

– Suppose we compared a set of models that gave random results, the best one on a particular dataset would do better than chance. But it wont do better than chance on another test set.

• So use a separate validation set to do model selection.

13-30

Using a validation set• Divide the total dataset into three subsets:

– Training data is used for learning the parameters of the model.

– Validation data is not used of learning but is used for deciding what type of model and what amount of regularization works best.

– Test data is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data.

• We could then re-divide the total dataset to get another unbiased estimate of the true error rate.

13-31

Early stopping

• If we have lots of data and a big model, its very expensive to keep re-training it with different amounts of weight decay.

• It is much cheaper to start with very small weights and let them grow until the performance on the validation set starts getting worse (but don’t get fooled by noise!)

• The capacity of the model is limited because the weights have not had time to grow big.

13-32

Why early stopping works• When the weights are very

small, every hidden unit is in its linear range.– So a net with a large layer of

hidden units is linear.

– It has no more capacity than a linear net in which the inputs are directly connected to the outputs!

• As the weights grow, the hidden units start using their non-linear ranges so the capacity grows.

outputs

inputs

13-33

Model Assessment and Selection

• Loss Function and Error Rate• Bias, Variance and Model Complexity• Optimization• AIC (Akaike Information Criterion)• BIC (Bayesian Information Criterion)• MDL (Minimum Description Length)

13-34

Key Methods to Estimate Prediction Error

• Estimate Optimism, then add it to the training error rate.

• AIC: choose the model with smallest AIC

• BIC: choose the model with smallest BIC

poerrrrE in ˆˆ

2ˆ)(

2)()( N

derrAIC

2

2 )(log

N

dNerr

NBIC

13-35

Model Assessment and Selection

• Model Selection: – estimating the performance of different

models in order to choose the best one.

• Model Assessment:– having chosen the model, estimating the

prediction error on new data.

13-36

Approaches

• data-rich:– data split: Train-Validation-Test– typical split: 50%-25%-25% (how?)

• data-insufficient:– Analytical approaches:

• AIC, BIC, MDL, SRM

– efficient sample re-use approaches:• cross validation, bootstrapping

13-37

Model Complexity

13-38

Bias-Variance Tradeoff

13-39

Summary• Cross validation: A practical way to

estimate model error.

• Model Estimation should be done with a penalty

• When best model estimation is chosen, estimate on whole data or average models on cross validated data

13-40

Loss Functions

• Continuous Response

• Categorical Response

squared errorabsolute error

0-1 losslog-likelihood

13-41

Error Functions

• Training Error: – the average loss over the training sample.– Continuous Response:– Categorical Response:

• Generalization Error:– the expected prediction error over an independent

test sample.– Continuous Response:– Categorical Response: ))](ˆ,([ XGGLEErr

))](ˆ,([ XfYLEErr

N

iigi

N

iii

xpN

err

xfyLN

err

1

1

)(ˆlog2

))(ˆ,(1

13-42

Detailed Decomposition for Linear Model Family

• average squared bias decomposition

222 ]_[]_[][ BiasEstimationAveBiasModelAveBiasAve

200*0

20*00

2000

2*

]ˆ[])([)]()([

))((minarg

xExExxfExEfxfE

XXfE

TTx

Txx

T

=0 for LLSF;

>0 for ridge regression

trade off with variance;

Documents

Model Selection and Validation