Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note:...

Preview:

Citation preview

201ab Quantitative methodsResampling: Cross Validation

Ed Vul

Resampling

Using our existing data to generate possible samples and thusobtain a sampling distribution:

I Bootstrap: of a statitistc for confidence intervals.I Randomization: under the null for NHST.I Cross Validation: for prediction.

The problem: overfitting

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

The problem: overfitting

9th order polynomial for10 data points

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

I Complex models can fitweird patterns.

I They will fit noise, not justsignal

I Fitting noise yields terribleprediction performance,even though “fit” toobserved data looks verygood.

Overfitting yields worse prediction error

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

Overfitting yields worse prediction error

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

The problem: overfitting

We want to. . .

I know how well our model will predict new data, not just howwell it fits observed data/noise.

I pick models that will predict new data well, and not overfit.

But we obviously have not yet seen future data.

Solution: Hold out / validation data

I Use part of existing data as though we have not seen it: Splitthe data into two sets:training used to fit the modeltest (“holdout”) to evaluate the model

I Doing this once is ok if we have a lot of data, so both trainingand test sets can be big even with split.

I With little data we will have lots of variability in evaluation.

Cross-validation

We will do the hold-out process a bunch of times on the same datato try to reduce noise in our test-set performance.

This gives us a better estimate of prediction accuracy for the modelclass (but not any one particular set of parameter values!).

Hold-out: example

dat <- dat %>%mutate(use_as = ifelse((1:n())%%2==1,'train', 'test'))

training_data = dat %>%filter(use_as == 'train')

test_data = dat %>%filter(use_as == 'test')

x y use_as0.00 -0.21 train0.11 -0.93 test0.22 -0.93 train0.33 0.65 test0.44 -1.06 train0.56 0.11 test0.67 2.40 train0.78 1.29 test0.89 0.99 train1.00 0.94 test

Hold-out: example

I fit model on training dataM = lm(data = training_data, y~poly(x, 3))

I generate predictions on test dataprediction = predict(M, test_data)

I Measure prediction error. Here: as sum of squared errors.sum((test_data$y - prediction)ˆ2)

## [1] 7.616142

Train vs test performance as function of complexity

poly.order train.SSE test.SSE1 2.52 5.862 1.88 23.673 0.94 587.044 0.00 15690.46

Note: 10 total datapoints, splitting into 5 train, 5 test. Over andover again. (More on this later)

Leave-one-out cross-validation

Run hold out n times for n data points. Each time use one datapoint as the test data, and the remaining n-1 datapoints as training.

Leave-one-out cross-validation

n = nrow(dat)loo_error = rep(NA, n)for(i in 1:n){training_data = dat[(1:n)[-i], ]test_data = dat[i,]M = lm(data = training_data, y~poly(x, 3))prediction = predict(M, test_data)loo_error[i] = (test_data$y - prediction)ˆ2

}

Leave-one-out cross-validation

0

1

2

3

0 1 2 3error

coun

t

mean(loo_error)

## [1] 0.9759489

Varieties of cross-validation

I Repeated random sub-sampling (suitable for larger sample sizesand replicates)

I Leave k out (LOO: k=1): exhaustive, for small sample sizesI K-fold (LOO: k=n)

For both fitting and evaluation:- Nested cross-validation.

There are lots of varieties of error/fit measures depending on whatyou are after.

Larger-scale example: data

## Rows: 251## Columns: 14## $ bf.percent <dbl> 12.3, 6.1, 25.3, 10.4, 28.7, 20.9, 19.2, 12.4, 4.1, 11.7...## $ age <dbl> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, ...## $ weight <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, ...## $ height <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, ...## $ neck <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42...## $ chest <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9...## $ abdomen <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8...## $ hip <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9...## $ thigh <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63...## $ knee <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41...## $ ankle <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25...## $ bicep <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35...## $ forearm <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30...## $ wrist <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19...

Large-scale example: Modelslm.model = lm(bf.percent ~ .)svr.model = e1071::svm(bf.percent ~ ., cross=0)lm2.model = lm(bf.percent ~ polym(age,

weight,height,neck,chest,abdomen,hip,thigh,knee,ankle,bicep,forearm,wrist,raw = T,degree = 2))

Leave-50-out random sub-samplingRMSE = function(true_y, predicted_y){sqrt(mean((predicted_y - true_y)ˆ2))

}

n = nrow(dat)k = 50repetitions = 100errors = rep(NA, repetitions)for(i in 1:repetitions){test_idx = sort(sample(n, k, replace=F))train_idx = (1:n)[-test_idx]test_dat = dat[test_idx,]train_dat = dat[train_idx,]M = lm(data = train_dat, bf.percent ~ .)pred_y = predict(M, test_dat)errors[i] = RMSE(test_dat$bf.percent, pred_y)

}

Leave-50-out random sub-sampling: Results

1.0

1.5

2.0

2.5

lm lm2 svrmodel

log1

0(M

SE

)

type

test.err

train.err

Resampling: Cross-validation

Goal: estimate prediction accuracy/error on future data withoutactually having data from the future.

Strategy: Repeat many times:

I Split existing data into training and test setI fit model to training set, evaluate error on test set.

Resampling: Bootstrap

Goal: quantify sampling error in some statistic to get confidenceintervals.

Strategy:

I Generate new hypothetical samples of the same size as existingsample by resampling from it (with replacement!).

I Calculate statistic on each sample to obtain many samples ofthe sampling distribution of statistic.

I Use that to get confidence intervals via quantile function.

Resampling: Randomization

Goal: test a null hypothesis that some structure/regularity does notexist in the data.

Strategy:

I Define statistic to measure structureI Define shuffling (sampling without replacement) process to

destroy only that structure.I Repeat many times: statistic(shuffle(data)) to obtain many

samples of the distribution of statistic under null.I Calculate p value from samples.

Resampling recap

Randomization Shuffle data to obtain sampling distribution ofstatistic under the null, and thus test null hypothesis.

Bootstrapping Resample current data to obtain samplingdistribution of statistic, and thus get a confidence interval.

Cross-validation Subsample existing data into training and test toestimate prediction performance on unseen data.

Resampling: beware

You have lots of responsibility here. Lots of room make a mistake,and only check / catch it if mistake is unfavorable.

Questions?

Recommended