Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Geoff Gordon—Machine Learning—Fall 2013
Accuracy & confidence
• Most of course so far: estimating stuff from data
• Today: how much do we trust our estimates?
• Last week: one answer to this question‣ prove ahead of time that training set estimate of
prediction error will have accuracy ϵ w/ probability 1–δ‣ had to handle two issues:
‣ limited data ⇒ can’t get exact error of single model
‣ selection bias ⇒ we pick “lucky” model r.t. right one
1
error is just one thing we could estimate from data; tail bounds are just one way to get accuracy of estimate
Geoff Gordon—Machine Learning—Fall 2013
Selection bias
2
CDF of max of n samples of N(μ=2, σ2=1) [representing error estimates for n models]
0 2 4 60
0.2
0.4
0.6
0.8
1
n=1n=4n=30
each sample represents estimate of accuracy for a single model we’re evaluating (true accuracy = 2)
with 4 samples, only 2^-4 = .0625 chance of max below μ
w/ 30 samples, typical values are near upper 2.5% quantile of a single sample (need 28 samples to have 50% chance of max≥μ+1.96)
just 4 models means almost 95% chance of thinking we found a positive effect when there is none
>> zs = -2:.05:4; ps=(1+erf(zs/sqrt(2)))/2; plot(zs+2, ps, zs+2, ps.^4, zs+2, ps.^30, 'linewidth', 2); vertline(2); legend({'n=1', 'n=4', 'n=30'}, 'location', 'nw'); set(gca, 'fontsize', 24)
Geoff Gordon—Machine Learning—Fall 2013
Overfitting
• Overfitting = selection bias when fitting complex models to little/noisy data‣ to limit overfitting: limit noise in data, get more data,
simplify model class
• Today: not trying to limit overfitting‣ instead, try to evaluate accuracy of selected model (and
recursively, accuracy of our accuracy estimate)
‣ can lead to detection of overfitting
3
complex models == large model classes
Geoff Gordon—Machine Learning—Fall 2013
What is accuracy?
• Simple problem: estimate μ and σ2 for a Gaussian from samples x1, x2, … xN ~ Normal(μ, σ2)
4
typical estimator: sample mean xbar = sum_i x_i / N
E(xbar) = E(sum_i x_i / N) = sum_i E(x_i) / N [linearity of expectation] = N μ / N = μ
I.e., sample mean is *unbiased* bias = E(statistic) - parameter
V(xbar) = E((xbar - E(xbar))^2) = E((sum_i x_i / N - μ)^2)pretend μ=0 for simplicity = E((sum_i x_i)^2) / N^2 = E(sum_s sum_i x_s x_i) / N^2 = E(sum_i x_i^2) / N^2 [independence] = N σ^2/N^2 = σ^2/N
Geoff Gordon—Machine Learning—Fall 2013
Bias vs. variance vs. residual
• Mean squared prediction error: predict xN+1
‣
5
E((xbar - x_{N+1})^2) [= prediction error] = E(((xbar - mu) - (x_{N+1}-mu))^2) = E((xbar - mu)^2 - 2 (xbar-mu)(x_{N+1}-mu) + (x_{N+1}-mu)^2) [E(product of indep 0-mean vars) = 0] = E((xbar - mu)^2 + (x_{N+1}-mu)^2) = E((xbar - mu)^2) + σ2
= E(((xbar - E(xbar)) - (mu - E(xbar)))^2) + σ2
= E((xbar - E(xbar))^2) + E(xbar - E(xbar))(mu - E(xbar)) + (mu - E(xbar))^2 + σ2
[E(xbar - E(xbar)) = 0 by linearity] = E((xbar - E(xbar))^2) + (mu - E(xbar))^2 + σ2
= V(xbar) + bias^2 + σ2
= bias^2 + variance + residual^2 = (estimation error)^2 + residual^2
this decomposition holds for squared error of any prediction
Geoff Gordon—Machine Learning—Fall 2013
Bias-variance tradeoff
• Can’t do much about residual, so we’re mostly concerned w/ estimation error = bias2 + variance
• Can trade bias v. variance to some extent: e.g., always estimate 0 ⇒ variance=0, but bias big
• Cramér-Rao bound on estimation error:
6
Cramér-Rao bound: if \hat\theta is an estimator of \theta, with bias E(\hat\theta - \theta) = b(\theta)then (under mild conditions) E((\hat\theta - \theta)^2) \geq b(\theta)^2 + (1+b’(\theta))^2 / I(\theta)where I(theta) = Fisher information (positive; describes how hard the estimation problem is; high information = easy problem)
note: b = 0 means bound is 1/I(\theta)note: if b’ < 0, a biased estimator can beat unbiased one
wikipedia page gives useful proofs:http://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound
Geoff Gordon—Machine Learning—Fall 2013
Prediction error v. estimation error
• Several ways to get at accuracy‣ prediction error (bias2 + var + residual2)
‣ talks only about predictions
‣ estimation error (bias2 + var)
‣ same; tries to concentrate on error due to estimation
‣ parameter error
‣ talks about parameters r.t. predictions
‣ in simple case, numerically equal to estimation error
‣ but only makes sense if our model class is right
7
E((µ� µ̂)2)
pred error: E[(x-\hat x)^2] for observation x
estimation error: since we can’t reduce residual
Geoff Gordon—Machine Learning—Fall 2013
Evaluating accuracy
• In N(μ, σ2) example, we were able to derive bias, variance, and residual from first principles
• In general, have to estimate prediction error, estimation error, or model error from data
• Holdout data, tail bounds, normal theory (use CLT & tables of normal dist’n), and today’s topics: crossvalidation & bootstrap
8
Geoff Gordon—Machine Learning—Fall 2013
Goal: estimate sampling variability
• We’ve computed something from our sample‣ classification error rate, a parameter vector, mean squared
prediction error, …
‣ for simplicity, a single number (e.g., ith component of weight vector)
‣ t = f(x1, x2, …, xN)
• How much would t vary if we had taken a different sample?
• For concreteness: f = sample mean (an estimate of population mean)
9
t = the number we estimatedf = the estimation procedurex_1, …: the sample
sample mean: bias = 0, var = σ2/N
Geoff Gordon—Machine Learning—Fall 2013
Gold standard: new samples
• Get M independent data sets
• Run our computation M times: t1, t2, … tM
‣ tj =
• Look at distribution of tj
‣ mean, variance, upper and lower 2.5% quantiles, …
• A tad wasteful of data…
10
M indep sets: x^j_1 thru x^j_N, for j = 1..M
tj = f(x^j_1..N)
Geoff Gordon—Machine Learning—Fall 2013
Crossvalidation & bootstrap
• CV and bootstrap: approximate the gold standard, but cheaper—spend computation instead of data
•Work for nearly arbitrarily complicated models
• Typically tighter than tail bounds, but involve difficult-to-verify approximations/assumptions
• Basic idea: surrogate samples‣ Rearrange/modify x1, …, xN to build each “new” sample
• Getting something from nothing? (hence name)
11
cv&boot: appropriate for complicated learners where getting tight theory is hard; also get tighter by taking averages in situations that are closer to reality (e.g. taking advantage of correlations among learners)
might repeat 10 times, 1000 times or 10k times -- depends on how much computation we can afford
something from nothing: no, taking advantage of difficult-to-verify assumptions
Geoff Gordon—Machine Learning—Fall 2013
For example
12−2 0 2 40
10
20
30
40
50
−2 0 2 40
0.2
0.4
0.6
0.8
1
μ=1.5
μ=1.6136ˆ
true variance of a single sample:E(x^2) - E(x)^2 = mu.^2*w' + sig^2 - 1.5^2 = 1.3600
true stdev of muhatsqrt(1.36/N) = .0825
>> zs = -2:.05:4; sig = .6; mu = [-.5 2]; w = [.2 .8]; p1 = w(1) * exp(-0.5*(zs-mu(1)).^2/sig^2); p2 = w(2) * exp(-0.5*(zs-mu(2)).^2/sig^2); plot(zs, p1+p2, 'linewidth', 2)
Geoff Gordon—Machine Learning—Fall 2013
Basic bootstrap
• Treat x1…xN as our estimate of true distribution
• To get a new sample, draw N times from this estimate (with replacement)
• Do this M times‣ each original xi part of many samples (on average 1–1/e
of them, about 63%)
‣ each sample contains many repeated values (single xi selected multiple times)
13
Geoff Gordon—Machine Learning—Fall 2013−2 0 2 40
10
20
30
40
50
Basic bootstrap
14−2 0 2 40
10
20
30
40
50
−2 0 2 40
10
20
30
40
50
← original
resamples↓
μ=1.6909
μ=1.6136
−2 0 2 40
10
20
30
40
50
μ=1.6059 μ=1.6507
Repeat 100k times: est. stdev of \hat\mu = 0.0818 compare to true stdev, .0825
got 3 sig figs w/ 100k reps; to get 1 sig fig, need only ~10 reps
===
true variance of a single sample:E(x^2) - E(x)^2 = mu.^2*w' + sig^2 - 1.5^2 = 1.3600
true stdev of muhatsqrt(1.36/N) = .0825
>> k = 100000; muhats = zeros(k, 1); for j = 1:k; idx = randi(n, n, 1); xx = xs(idx); muhats(j) = mean(xx); end>> sqrt(var(muhats))
Geoff Gordon—Machine Learning—Fall 2013
What can go wrong?
• Convergence is only asymptotic (large original sample)‣ here: what if original sample hits mostly the larger mode?
• Original sample might not be i.i.d.‣ unmeasured covariate
15
answer: will badly underestimate variance (original sample is more compact than actual distribution)
chance of this gets higher as original sample gets smaller (e.g., 10% chance that n=10 hits *only* larger mode)
not iid: e.g., suppose we measure yields of 100 plots of a new feed corn variety now suppose 10 plots on each of 10 farms or 50 plots on each of 2 farms unmeasured covariate: which farm
2-farm case: what if both happen to be farms w/ higher than average yield? -- bootstrap will underestimate variance again
Geoff Gordon—Machine Learning—Fall 2013
Types of errors
• “Conservative” estimate of uncertainty: tends to be high (too uncertain)
• “Optimistic” estimate of uncertainty: tends to be low (too certain)
16
“conservative” (good) vs. “optimistic” (bad)both of above failure modes can lead to optimism
Geoff Gordon—Machine Learning—Fall 2013
Should we worry?
• New drug: mean outcome 1.327 [higher is better]‣ old one: outcome 1.242
• Bootstrap underestimates σ = .04‣ true σ = .08
• Tell investors: new drug better than old one
• Enter Phase III trials—cost $millions
•Whoops, it isn’t better after all…
17
Geoff Gordon—Machine Learning—Fall 2013
Blocked resampling
• Partial fix for one issue (original sample not i.i.d.)
• Divide sample into blocks that tend to share the unmeasured covariates, and resample blocks‣ e.g., time series: break up into blocks of adjacent times
‣ assumes unmeasured covariates change slowly
‣ e.g., matrix: break up by rows or columns
‣ assumes unmeasured covariates are associated with rows or columns (e.g., user preferences in Netflix)
18
time series: could also use GP bootstrap or parametric bootstrap
issue: need enough blocks to resample, else variance is very high
Geoff Gordon—Machine Learning—Fall 2013
Further reading
• http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf
• Hesterberg et al. (2005). “Bootstrap methods and permutation tests.” In Moore & McCabe, Introduction to the Practice of Statistics.
19
Geoff Gordon—Machine Learning—Fall 2013
Cross-validation
• Used to estimate classification error, RMSE, or similar error measure of an algorithm
• Surrogate sample: exactly the same as x1, …, xN except for train-test split
• k-fold CV:‣ randomly permute x1, … xN
‣ split into folds: first N/k samples, second N/k samples, …
‣ train on k–1 folds, measure error on remaining fold
‣ repeat k times, with each fold being holdout set once
20
f = function from whole sample to single number = train model on k-1 folds then evaluate error on remaining one
CV: uses sample splitting idea twice first: split into train & validation second: repeat to estimate variability only the second is approximated
k = N: leave-one-out CV (LOOCV)
Geoff Gordon—Machine Learning—Fall 2013
Cross-validation: caveats
• Original sample might not be i.i.d.
• Size of surrogate sample is wrong: ‣ want to estimate error we’d get on a sample of size N
‣ actually use samples of size N(k–1)/k
• Failure of i.i.d, even if original sample was i.i.d.
21
two of these are potentially optimistic; middle one is conservative (but usually pretty small effect)