26
201ab Quantitative methods Resampling: Cross Validation Ed Vul

Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

201ab Quantitative methodsResampling: Cross Validation

Ed Vul

Page 2: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Resampling

Using our existing data to generate possible samples and thusobtain a sampling distribution:

I Bootstrap: of a statitistc for confidence intervals.I Randomization: under the null for NHST.I Cross Validation: for prediction.

Page 3: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

The problem: overfitting

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

Page 4: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

The problem: overfitting

9th order polynomial for10 data points

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

I Complex models can fitweird patterns.

I They will fit noise, not justsignal

I Fitting noise yields terribleprediction performance,even though “fit” toobserved data looks verygood.

Page 5: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Overfitting yields worse prediction error

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

Page 6: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Overfitting yields worse prediction error

−2.5

0.0

2.5

0.00 0.25 0.50 0.75 1.00x

y

Page 7: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

The problem: overfitting

We want to. . .

I know how well our model will predict new data, not just howwell it fits observed data/noise.

I pick models that will predict new data well, and not overfit.

But we obviously have not yet seen future data.

Page 8: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Solution: Hold out / validation data

I Use part of existing data as though we have not seen it: Splitthe data into two sets:training used to fit the modeltest (“holdout”) to evaluate the model

I Doing this once is ok if we have a lot of data, so both trainingand test sets can be big even with split.

I With little data we will have lots of variability in evaluation.

Page 9: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Cross-validation

We will do the hold-out process a bunch of times on the same datato try to reduce noise in our test-set performance.

This gives us a better estimate of prediction accuracy for the modelclass (but not any one particular set of parameter values!).

Page 10: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Hold-out: example

dat <- dat %>%mutate(use_as = ifelse((1:n())%%2==1,'train', 'test'))

training_data = dat %>%filter(use_as == 'train')

test_data = dat %>%filter(use_as == 'test')

x y use_as0.00 -0.21 train0.11 -0.93 test0.22 -0.93 train0.33 0.65 test0.44 -1.06 train0.56 0.11 test0.67 2.40 train0.78 1.29 test0.89 0.99 train1.00 0.94 test

Page 11: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Hold-out: example

I fit model on training dataM = lm(data = training_data, y~poly(x, 3))

I generate predictions on test dataprediction = predict(M, test_data)

I Measure prediction error. Here: as sum of squared errors.sum((test_data$y - prediction)ˆ2)

## [1] 7.616142

Page 12: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Train vs test performance as function of complexity

poly.order train.SSE test.SSE1 2.52 5.862 1.88 23.673 0.94 587.044 0.00 15690.46

Note: 10 total datapoints, splitting into 5 train, 5 test. Over andover again. (More on this later)

Page 13: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Leave-one-out cross-validation

Run hold out n times for n data points. Each time use one datapoint as the test data, and the remaining n-1 datapoints as training.

Page 14: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Leave-one-out cross-validation

n = nrow(dat)loo_error = rep(NA, n)for(i in 1:n){training_data = dat[(1:n)[-i], ]test_data = dat[i,]M = lm(data = training_data, y~poly(x, 3))prediction = predict(M, test_data)loo_error[i] = (test_data$y - prediction)ˆ2

}

Page 15: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Leave-one-out cross-validation

0

1

2

3

0 1 2 3error

coun

t

mean(loo_error)

## [1] 0.9759489

Page 16: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Varieties of cross-validation

I Repeated random sub-sampling (suitable for larger sample sizesand replicates)

I Leave k out (LOO: k=1): exhaustive, for small sample sizesI K-fold (LOO: k=n)

For both fitting and evaluation:- Nested cross-validation.

There are lots of varieties of error/fit measures depending on whatyou are after.

Page 17: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Larger-scale example: data

## Rows: 251## Columns: 14## $ bf.percent <dbl> 12.3, 6.1, 25.3, 10.4, 28.7, 20.9, 19.2, 12.4, 4.1, 11.7...## $ age <dbl> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, ...## $ weight <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, ...## $ height <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, ...## $ neck <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42...## $ chest <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9...## $ abdomen <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8...## $ hip <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9...## $ thigh <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63...## $ knee <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41...## $ ankle <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25...## $ bicep <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35...## $ forearm <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30...## $ wrist <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19...

Page 18: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Large-scale example: Modelslm.model = lm(bf.percent ~ .)svr.model = e1071::svm(bf.percent ~ ., cross=0)lm2.model = lm(bf.percent ~ polym(age,

weight,height,neck,chest,abdomen,hip,thigh,knee,ankle,bicep,forearm,wrist,raw = T,degree = 2))

Page 19: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Leave-50-out random sub-samplingRMSE = function(true_y, predicted_y){sqrt(mean((predicted_y - true_y)ˆ2))

}

n = nrow(dat)k = 50repetitions = 100errors = rep(NA, repetitions)for(i in 1:repetitions){test_idx = sort(sample(n, k, replace=F))train_idx = (1:n)[-test_idx]test_dat = dat[test_idx,]train_dat = dat[train_idx,]M = lm(data = train_dat, bf.percent ~ .)pred_y = predict(M, test_dat)errors[i] = RMSE(test_dat$bf.percent, pred_y)

}

Page 20: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Leave-50-out random sub-sampling: Results

1.0

1.5

2.0

2.5

lm lm2 svrmodel

log1

0(M

SE

)

type

test.err

train.err

Page 21: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Resampling: Cross-validation

Goal: estimate prediction accuracy/error on future data withoutactually having data from the future.

Strategy: Repeat many times:

I Split existing data into training and test setI fit model to training set, evaluate error on test set.

Page 22: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Resampling: Bootstrap

Goal: quantify sampling error in some statistic to get confidenceintervals.

Strategy:

I Generate new hypothetical samples of the same size as existingsample by resampling from it (with replacement!).

I Calculate statistic on each sample to obtain many samples ofthe sampling distribution of statistic.

I Use that to get confidence intervals via quantile function.

Page 23: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Resampling: Randomization

Goal: test a null hypothesis that some structure/regularity does notexist in the data.

Strategy:

I Define statistic to measure structureI Define shuffling (sampling without replacement) process to

destroy only that structure.I Repeat many times: statistic(shuffle(data)) to obtain many

samples of the distribution of statistic under null.I Calculate p value from samples.

Page 24: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Resampling recap

Randomization Shuffle data to obtain sampling distribution ofstatistic under the null, and thus test null hypothesis.

Bootstrapping Resample current data to obtain samplingdistribution of statistic, and thus get a confidence interval.

Cross-validation Subsample existing data into training and test toestimate prediction performance on unseen data.

Page 25: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Resampling: beware

You have lots of responsibility here. Lots of room make a mistake,and only check / catch it if mistake is unfavorable.

Page 26: Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

Questions?