Accuracy & conﬁdenceggordon/tmp/17-bootstrap-cv.pdf · Geoff Gordon—Machine Learning—Fall 2013 Gold standard: new samples •Get M independent data sets •Run our computation

Geoff Gordon—Machine Learning—Fall 2013

Accuracy & confidence

• Most of course so far: estimating stuff from data

• Today: how much do we trust our estimates?

• Last week: one answer to this question‣ prove ahead of time that training set estimate of

prediction error will have accuracy ϵ w/ probability 1–δ‣ had to handle two issues:

‣ limited data ⇒ can’t get exact error of single model

‣ selection bias ⇒ we pick “lucky” model r.t. right one

1

error is just one thing we could estimate from data; tail bounds are just one way to get accuracy of estimate


Selection bias

2

CDF of max of n samples of N(μ=2, σ2=1) [representing error estimates for n models]

0 2 4 60

0.2

0.4

0.6

0.8

1

n=1n=4n=30

each sample represents estimate of accuracy for a single model we’re evaluating (true accuracy = 2)

with 4 samples, only 2^-4 = .0625 chance of max below μ

w/ 30 samples, typical values are near upper 2.5% quantile of a single sample (need 28 samples to have 50% chance of max≥μ+1.96)

just 4 models means almost 95% chance of thinking we found a positive effect when there is none

>> zs = -2:.05:4; ps=(1+erf(zs/sqrt(2)))/2; plot(zs+2, ps, zs+2, ps.^4, zs+2, ps.^30, 'linewidth', 2); vertline(2); legend({'n=1', 'n=4', 'n=30'}, 'location', 'nw'); set(gca, 'fontsize', 24)


Overfitting

• Overfitting = selection bias when fitting complex models to little/noisy data‣ to limit overfitting: limit noise in data, get more data,

simplify model class

• Today: not trying to limit overfitting‣ instead, try to evaluate accuracy of selected model (and

recursively, accuracy of our accuracy estimate)

‣ can lead to detection of overfitting

3

complex models == large model classes


What is accuracy?

• Simple problem: estimate μ and σ2 for a Gaussian from samples x1, x2, … xN ~ Normal(μ, σ2)

4

typical estimator: sample mean xbar = sum_i x_i / N

E(xbar) = E(sum_i x_i / N) = sum_i E(x_i) / N [linearity of expectation] = N μ / N = μ

I.e., sample mean is *unbiased* bias = E(statistic) - parameter

V(xbar) = E((xbar - E(xbar))^2) = E((sum_i x_i / N - μ)^2)pretend μ=0 for simplicity = E((sum_i x_i)^2) / N^2 = E(sum_s sum_i x_s x_i) / N^2 = E(sum_i x_i^2) / N^2 [independence] = N σ^2/N^2 = σ^2/N


Bias vs. variance vs. residual

• Mean squared prediction error: predict xN+1

‣

5

E((xbar - x_{N+1})^2) [= prediction error] = E(((xbar - mu) - (x_{N+1}-mu))^2) = E((xbar - mu)^2 - 2 (xbar-mu)(x_{N+1}-mu) + (x_{N+1}-mu)^2) [E(product of indep 0-mean vars) = 0] = E((xbar - mu)^2 + (x_{N+1}-mu)^2) = E((xbar - mu)^2) + σ2

= E(((xbar - E(xbar)) - (mu - E(xbar)))^2) + σ2

= E((xbar - E(xbar))^2) + E(xbar - E(xbar))(mu - E(xbar)) + (mu - E(xbar))^2 + σ2

[E(xbar - E(xbar)) = 0 by linearity] = E((xbar - E(xbar))^2) + (mu - E(xbar))^2 + σ2

= V(xbar) + bias^2 + σ2

= bias^2 + variance + residual^2 = (estimation error)^2 + residual^2

this decomposition holds for squared error of any prediction


Bias-variance tradeoff

• Can’t do much about residual, so we’re mostly concerned w/ estimation error = bias2 + variance

• Can trade bias v. variance to some extent: e.g., always estimate 0 ⇒ variance=0, but bias big

• Cramér-Rao bound on estimation error:

6

Cramér-Rao bound: if \hat\theta is an estimator of \theta, with bias E(\hat\theta - \theta) = b(\theta)then (under mild conditions) E((\hat\theta - \theta)^2) \geq b(\theta)^2 + (1+b’(\theta))^2 / I(\theta)where I(theta) = Fisher information (positive; describes how hard the estimation problem is; high information = easy problem)

note: b = 0 means bound is 1/I(\theta)note: if b’ < 0, a biased estimator can beat unbiased one

wikipedia page gives useful proofs:http://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound


Prediction error v. estimation error

• Several ways to get at accuracy‣ prediction error (bias2 + var + residual2)

‣ talks only about predictions

‣ estimation error (bias2 + var)

‣ same; tries to concentrate on error due to estimation

‣ parameter error

‣ talks about parameters r.t. predictions

‣ in simple case, numerically equal to estimation error

‣ but only makes sense if our model class is right

7

E((µ� µ̂)2)

pred error: E[(x-\hat x)^2] for observation x

estimation error: since we can’t reduce residual


Evaluating accuracy

• In N(μ, σ2) example, we were able to derive bias, variance, and residual from first principles

• In general, have to estimate prediction error, estimation error, or model error from data

• Holdout data, tail bounds, normal theory (use CLT & tables of normal dist’n), and today’s topics: crossvalidation & bootstrap

8


Goal: estimate sampling variability

• We’ve computed something from our sample‣ classification error rate, a parameter vector, mean squared

prediction error, …

‣ for simplicity, a single number (e.g., ith component of weight vector)

‣ t = f(x1, x2, …, xN)

• How much would t vary if we had taken a different sample?

• For concreteness: f = sample mean (an estimate of population mean)

9

t = the number we estimatedf = the estimation procedurex_1, …: the sample

sample mean: bias = 0, var = σ2/N


Gold standard: new samples

• Get M independent data sets

• Run our computation M times: t1, t2, … tM

‣ tj =

• Look at distribution of tj

‣ mean, variance, upper and lower 2.5% quantiles, …

• A tad wasteful of data…

10

M indep sets: x^j_1 thru x^j_N, for j = 1..M

tj = f(x^j_1..N)


Crossvalidation & bootstrap

• CV and bootstrap: approximate the gold standard, but cheaper—spend computation instead of data

•Work for nearly arbitrarily complicated models

• Typically tighter than tail bounds, but involve difficult-to-verify approximations/assumptions

• Basic idea: surrogate samples‣ Rearrange/modify x1, …, xN to build each “new” sample

• Getting something from nothing? (hence name)

11

cv&boot: appropriate for complicated learners where getting tight theory is hard; also get tighter by taking averages in situations that are closer to reality (e.g. taking advantage of correlations among learners)

might repeat 10 times, 1000 times or 10k times -- depends on how much computation we can afford

something from nothing: no, taking advantage of difficult-to-verify assumptions


For example

12−2 0 2 40

10

20

30

40

50

−2 0 2 40

0.2

0.4

0.6

0.8

1

μ=1.5

μ=1.6136ˆ

true variance of a single sample:E(x^2) - E(x)^2 = mu.^2*w' + sig^2 - 1.5^2 = 1.3600

true stdev of muhatsqrt(1.36/N) = .0825

>> zs = -2:.05:4; sig = .6; mu = [-.5 2]; w = [.2 .8]; p1 = w(1) * exp(-0.5*(zs-mu(1)).^2/sig^2); p2 = w(2) * exp(-0.5*(zs-mu(2)).^2/sig^2); plot(zs, p1+p2, 'linewidth', 2)


Basic bootstrap

• Treat x1…xN as our estimate of true distribution

• To get a new sample, draw N times from this estimate (with replacement)

• Do this M times‣ each original xi part of many samples (on average 1–1/e

of them, about 63%)

‣ each sample contains many repeated values (single xi selected multiple times)

13

Geoff Gordon—Machine Learning—Fall 2013−2 0 2 40

10

20

30

40

50

Basic bootstrap

14−2 0 2 40

10

20

30

40

50

−2 0 2 40

10

20

30

40

50

← original

resamples↓

μ=1.6909

μ=1.6136

−2 0 2 40

10

20

30

40

50

μ=1.6059 μ=1.6507

Repeat 100k times: est. stdev of \hat\mu = 0.0818 compare to true stdev, .0825

got 3 sig figs w/ 100k reps; to get 1 sig fig, need only ~10 reps

===

true variance of a single sample:E(x^2) - E(x)^2 = mu.^2*w' + sig^2 - 1.5^2 = 1.3600

true stdev of muhatsqrt(1.36/N) = .0825

>> k = 100000; muhats = zeros(k, 1); for j = 1:k; idx = randi(n, n, 1); xx = xs(idx); muhats(j) = mean(xx); end>> sqrt(var(muhats))


What can go wrong?

• Convergence is only asymptotic (large original sample)‣ here: what if original sample hits mostly the larger mode?

• Original sample might not be i.i.d.‣ unmeasured covariate

15

answer: will badly underestimate variance (original sample is more compact than actual distribution)

chance of this gets higher as original sample gets smaller (e.g., 10% chance that n=10 hits *only* larger mode)

not iid: e.g., suppose we measure yields of 100 plots of a new feed corn variety now suppose 10 plots on each of 10 farms or 50 plots on each of 2 farms unmeasured covariate: which farm

2-farm case: what if both happen to be farms w/ higher than average yield? -- bootstrap will underestimate variance again


Types of errors

• “Conservative” estimate of uncertainty: tends to be high (too uncertain)

• “Optimistic” estimate of uncertainty: tends to be low (too certain)

16

“conservative” (good) vs. “optimistic” (bad)both of above failure modes can lead to optimism


Should we worry?

• New drug: mean outcome 1.327 [higher is better]‣ old one: outcome 1.242

• Bootstrap underestimates σ = .04‣ true σ = .08

• Tell investors: new drug better than old one

• Enter Phase III trials—cost $millions

•Whoops, it isn’t better after all…

17


Blocked resampling

• Partial fix for one issue (original sample not i.i.d.)

• Divide sample into blocks that tend to share the unmeasured covariates, and resample blocks‣ e.g., time series: break up into blocks of adjacent times

‣ assumes unmeasured covariates change slowly

‣ e.g., matrix: break up by rows or columns

‣ assumes unmeasured covariates are associated with rows or columns (e.g., user preferences in Netflix)

18

time series: could also use GP bootstrap or parametric bootstrap

issue: need enough blocks to resample, else variance is very high


Further reading

• http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf

• Hesterberg et al. (2005). “Bootstrap methods and permutation tests.” In Moore & McCabe, Introduction to the Practice of Statistics.

19


Cross-validation

• Used to estimate classification error, RMSE, or similar error measure of an algorithm

• Surrogate sample: exactly the same as x1, …, xN except for train-test split

• k-fold CV:‣ randomly permute x1, … xN

‣ split into folds: first N/k samples, second N/k samples, …

‣ train on k–1 folds, measure error on remaining fold

‣ repeat k times, with each fold being holdout set once

20

f = function from whole sample to single number = train model on k-1 folds then evaluate error on remaining one

CV: uses sample splitting idea twice first: split into train & validation second: repeat to estimate variability only the second is approximated

k = N: leave-one-out CV (LOOCV)


Cross-validation: caveats

• Original sample might not be i.i.d.

• Size of surrogate sample is wrong: ‣ want to estimate error we’d get on a sample of size N

‣ actually use samples of size N(k–1)/k

• Failure of i.i.d, even if original sample was i.i.d.

21

two of these are potentially optimistic; middle one is conservative (but usually pretty small effect)

Documents

Accuracy & conﬁdenceggordon/tmp/17-bootstrap-cv.pdf · Geoff Gordon—Machine Learning—Fall 2013 Gold standard: new samples •Get M independent data sets •Run our computation