Regression III Lecture 5: Resampling Techniques · zˆ0 = 1 # T⇤ b T B! • This just gives the inverse of the normal CDF for proportion of bootstrap replicates less than T. •

Regression III

Lecture 5: Resampling Techniques

Dave Armstrong

University of Western OntarioDepartment of Political Science

Department of Statistics and Actuarial Science (by courtesy)

e: [email protected]: www.quantoid.net/teachicpsr/regression3/

1 / 64

Goals of the Lecture

• Introduce the general idea of Resampling - techniques toresample from the original data

• Bootstrapping• Cross-validation

• An extended example of bootstrapping local polynomialregression models.

2 / 64

Resampling: An Overview

• Resampling techniques sample from the original dataset• Some of the applications of these methods are:

• to compute standard errors and confidence intervals (either whenwe have small sample sizes, dubious distributional assumptions orfor a statistic that does not have an easily derivable asymptoticstandard error)

• Subset selection in regression• Handling Missing Data• Selection of degrees of freedom in nonparametric regression

(especially GAMs)

• For the most part, this lecture will discuss resampling techniquesin the context of computing confidence intervals and hypothesistests for regression analysis.

3 / 64

Resampling and Regression: A Caution

• There is no need whatsoever for bootstrapping in regressionanalysis if the OLS assumptions are met

• In such cases, OLS estimates are unbiased and maximally e�cient.

• There are situations, however, where we cannot satisfy theassumptions and thus other methods are more helpful

• Robust regression (such as MM-estimation) often provides betterestimates than OLS in the presence of influential cases, but onlyhas reliable SEs asymptotically.

• Local polynomial regression is often “better” (in the RSS sense) inthe presence of non-linearity, but because of the unknown df, onlyhas approximate sampling distribution.

• Cross-validation is particularly helpful for validating models andchoosing model fitting parameters

4 / 64

Bootstrapping: General Overview

• If we assume that a random variable X or statistic has aparticular population value, we can study how a statisticalestimator computed from samples behaves

• We don’t always know, however, how a variable or statistic isdistributed in the population

• For example, there may be a statistic for which standard errorshave not been formulated (e.g., imagine we wanted to test whethertwo additive scales have significantly di↵erent levels of internalconsistency - Cronbach’s � doesn’t have an exact samplingdistribution

• Another example is the impact of missing data on a distribution -we don’t know how the missing data di↵er from the observed data

• Bootstrapping is a technique for estimating standard errors andconfidence intervals (sets) without making assumptions about thedistributions that give rise to the data

5 / 64

Bootstrapping: General Overview (2)

• Assume that we have a sample of size n for which we requiremore reliable standard errors for our estimates

• Perhaps n is small, or alternatively, we have a statistic for whichthere is no known sampling distribution

• The bootstrap provides one “solution”• Take several new samples from the original sample, calculating the

statistic each time• Calculate the average and standard error (and maybe quantiles)

from the empirical distribution of the bootstrap samples• In other words, we find a standard error based on sampling (with

replacement) from the original data

• We apply principles of inference similar to those employed whensampling from the population

• The population is to the sample as the sample is to the bootstrapsamples

6 / 64

Bootstrapping: General Overview (3)

there are several Variants of the bootstrap. The two we are mostinterested in are:1. Nonparametric Bootstrap

• No underlying population distribution is assumed• Most commonly used method

2. Parametric Bootstrap• Assumes that the statistic has a particular parametric form (e.g.,

normal)

7 / 64

Bootstrapping the Mean

• Imagine, unrealistically, that we are interested in finding theconfidence interval for the mean of a sample of only 4observations

• Specifically, assume that we are interested in the di↵erence inincome between husbands and wives

• We have four cases, with the following mean di↵erences (in$ 1000’s): 6, -3, 5, 3, for a mean of 2.75 and a standard deviationof 4.031

• From classical theory, we can calculate the standard error:

SE =Sxpn

=4.031129p

4= 2.015564

• Now we’ll compare the confidence interval to the one calculatedusing bootstrapping

8 / 64

Defining the Random Variable

• The first thing that bootstrapping does is estimate thepopulation distribution of Y from the four observations in thesample

• In other words, the random variable Y ⇤ is defined:

Y ⇤ p⇤(Y ⇤)6 0.25-3 0.255 0.253 0.25

• The mean of Y ⇤ is then simply the mean of the sample:

E⇤(Y ⇤) =’

Y ⇤p⇤(Y ⇤)= 2.75

= Y

9 / 64

The Sample as the Population (1)

• We now treat the sample as if it were the population, andresample from it

• In this case, we take all possible samples with replacement,meaning that we take nn = 256 di↵erent samples

• Since we found all possible samples, the mean of these samples issimply the original mean

• We then determine the standard error of Y from these samples

SE⇤(Y ) =

s

Õnnb=1(Y ⇤

b � Y )2nn

= 1.74553

We now adjust for the sample size

SE(Y ) =r

n

n � 1SE⇤(Y ⇤) =

r

4

3⇥ 1.74553 = 2.015564

10 / 64

The Sample as the Population (2)

• In this example, because we used all possible resamples of oursample, the bootstrap standard error (2.015564) is exactly thesame as the original standard error

• This approach can be used for statistics for which we do not havestandard error formulas, or we have small sample sizes

• In summary, the following analogies can be made to samplingfrom the population

• Bootstrap observations ! original observations• Bootstrap Mean ! original sample mean• Original sample mean ! unknown population mean µ• Distribution of the bootstrap means ! unknown sampling

distribution from the original sample

11 / 64

Characteristics of the Bootstrap Statistic

• The bootstrap sampling distribution around the original estimateof the statistic T is analogous to the sampling distribution of Taround the population parameter �

• The average of the bootstrapped statistics is simply:

T ⇤ = E(T ⇤) ⇡ÕRb=1T

⇤b

R

where R is the number of bootstraps• The bias of T can be seen as its deviation from the bootstrapaverage (i.e., it estimates T � �)

B⇤ = T ⇤ �T

• The estimated bootstrap variance of T ⇤ is:

V (T ⇤) =ÕRb=1(T ⇤

b � T ⇤)2

R � 1

12 / 64

Bootstrapping with Larger Samples

• The larger the sample, the more e↵ort it is to calculate thebootstrap estimates

• With large sample sizes, the possible number of bootstrap samplesnn gets very large and impractical (e.g., it would take a long timeto calculate 10001000 bootstrap samples)

• typically we want to take somewhere between 1000 and 2000bootstrap samples in order to find a confidence interval of astatistic

• After calculating the standard error, we can easily find theconfidence interval. Three methods are commonly used1. Normal Theory Intervals2. Percentile Intervals3. Bias Corrected Percentile Intervals

13 / 64

Evaluating Confidence Intervals

Accuracy:

• how quickly do coverage errors go to zero?

• Prob{� < Tlo} = � and Prob{� > Tup } = �

• Errors go to zero at a rate of:• 1

n (second-order accurate)• 1p

n (first-order accurate)

Transformation Respecting:

• For any monotone transformation of � , � =m(� ), can we obtainthe right confidence interval on � with the confidence intervals on� mapped by m()? E.g.,

[�lo , �up ] = [m(�lo),m(�up )]

14 / 64

Bootstrap Confidence Intervals: Normal Theory Intervals

• Many statistics are asymptotically normally distributed

• Therefore, in large samples, we may be able to use a normalityassumption to characterize the bootstrap distribution. E.g.,

T ⇤ ⇠ N (T , se2)

where se isq

V (T ⇤)• This approach works well for the bootstrap confidence interval,but only if the bootstrap sampling distribution is approximatelynormally distributed

• In other words, it is important to look at the distribution beforerelying on the normal theory interval

15 / 64

Bootstrap Confidence Intervals: Percentile Intervals

• Uses percentiles of the bootstrap sampling distribution to findthe end-points of the confidence interval

• If G is the CDF of T ⇤, then we can find the 100(1-�)% confidenceinterval with:

⇥

T%,lo , T%,up⇤

=⇥

G�1(�), G�1(1 � �)⇤

• The (1 � 2�) percentile interval can be approximated with:

⇥

T%,lo , T%,up⇤

⇡h

T ⇤(� )B ,T ⇤(1�� )

B

i

where T ⇤(� )B and T ⇤(1�� )

B are the ordered (B) bootstrap replicatessuch that 100�% of them fall below the former and 100�% ofthem fall above the latter.

• These intervals do not assume a normal distribution, but they donot perform well unless we have a large original sample and atleast 1000 bootstrap samples

16 / 64

Bootstrap Confidence Intervals: Bias-Corrected, Accelerated

Percentile Intervals (BCa)• The BCa CI adjusts the confidence intervals for bias due to smallsamples by employing a normalizing transformation through twocorrection factors.

• This is also a percentile interval, but the percentiles are notnecessarily the ones you would think.

• Using strict percentile intervals,⇥

Tlo , Tup⇤

⇡h

T ⇤(� )B ,T ⇤(1�� )

B

i

• Here,⇥

Tlo , Tup⇤

⇡h

T ⇤(�1)B ,T ⇤(�2)

B

i

• �1 , � and �2 , (1 � �1)

�1 = �

✓

z0 +z0 + z(� )

1 � a(Z0 + z(� ))

◆

�2 = �

✓

z0 +z0 + z(1�� )

1 � a(Z0 + z(1�� ))

◆

17 / 64

Bias correction: z0

z0 = ��1

#�

T ⇤b T

B

!

• This just gives the inverse of the normal CDF for proportion ofbootstrap replicates less than T .

• Note that if #�

T ⇤b T

= 0.5, then z0 = 0.

• If T is unbiased, the proportion will be close to 0.5, meaning thatthe correction is close to 0.

18 / 64

Acceleration: a

a =

Õni=1

�

T(·) �T(i)�3

6n

Õni=1

�

T(·) �T(i)�2o

32

• T(i) is the calculation of the original estimate T with eachobservation i jacknifed out in turn.

• T(·) isÕn

i=1T(i )n

• The acceleration constant corrects the bootstrap sample forskewness.

19 / 64

Evaluation Confidence Intervals

Normal Percentile BCaAccuracy 1st order 1st order 2nd orderTransofrmation-respecting No Yes Yes

20 / 64

Number of BS Samples

Now, we can revisit the idea about the number of bootstrap samples.

• The real question here is how precise to you want estimatedquantities to be.

• Often, we’re trying to estimate the 95% confidence interval (sothe 2.5th and 97.5th percentiles of the distribution).

• Estimating a quantity in the tail of a distribution with precisiontakes more observations.

21 / 64

Simple Simulation: Bootstrapping the Mean (Lower CI)

22 / 64

Bootstrapping the Median

The median doesn’t have an exact sampling distribution (thoughthere is a normal approximation). Let’s consider bootstrapping themedian to get a sense of its sampling variability.

set.seed(123)

n <- 25

x <- rchisq(n,1,1)

meds <- rep(NA, 1000)

for(i in 1:1000){meds[i] <- median(x[sample(1:n, n, replace=T)])

}median(x)

## [1] 1.206853

summary(meds)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 0.2838 0.6693 1.2069 1.2897 1.7855 3.7398

23 / 64

Make CIs

• Normal Theory CI:se <- sd(meds)bias <- mean(meds) - median(x)norm.ci <- median(x)- bias + qnorm((1+.95)/2)*se*c(-1,1)

• Percentile CI:med.ord <- meds[order(meds)]pct.ci <- med.ord[c(25,975)]pct.ci <- quantile(meds, c(.025,.975))

• BCa CI:zhat <- qnorm(mean(meds < median(x)))medi <- sapply(1:n, function(i)median(x[-i]))Tdot <- mean(medi)ahat <- sum((Tdot-medi)^3)/(6*sum((Tdot-medi)^2)^(3/2))zalpha <- 1.96z1alpha <- -1.96a1 <- pnorm(zhat + ((zhat + zalpha)/(1-ahat*(zhat + zalpha))))a2 <- pnorm(zhat + ((zhat + z1alpha)/(1-ahat*(zhat + z1alpha))))bca.ci <- quantile(meds, c(a1, a2))a1 <- floor(a1*1000)a2 <- ceiling(a2*1000)bca.ci <- med.ord[c(a2, a1)]

24 / 64

Looking at CIs

mat <- rbind(norm.ci, pct.ci, bca.ci)rownames(mat) <- c("norm", "pct", "bca")colnames(mat) <- c("lower", "upper")round(mat, 3)

## lower upper## norm -0.046 2.294## pct 0.473 2.829## bca 0.473 2.211

25 / 64

With boot package

library(boot)set.seed(123)med.fun <- function(x, inds){

med <- median(x[inds])med

}boot.med <- boot(x, med.fun, R=1000)boot.ci(boot.med)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 1000 bootstrap replicates#### CALL :## boot.ci(boot.out = boot.med)#### Intervals :## Level Normal Basic## 95% (-0.107, 2.331 ) (-0.415, 1.940 )#### Level Percentile BCa## 95% ( 0.473, 2.829 ) ( 0.473, 2.211 )## Calculations and Intervals on Original Scale

26 / 64

BS Di↵erence in Correlations

We can bootstrap the di↵erence in correlations as well.

set.seed(123)data(Mroz)cor.fun <- function(dat, inds){

tmp <- dat[inds, ]cor1 <- with(tmp[which(tmp$hc == "no"), ],

cor(age, lwg, use="complete"))cor2 <- with(tmp[which(tmp$hc == "yes"), ],

cor(age, lwg, use="complete"))cor2-cor1

}boot.cor <- boot(Mroz, cor.fun, R=2000, strata=Mroz$hc)

27 / 64

Bootstrap Confidence Intervals

boot.ci(boot.cor)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 2000 bootstrap replicates#### CALL :## boot.ci(boot.out = boot.cor)#### Intervals :## Level Normal Basic## 95% (-0.0401, 0.2369 ) (-0.0394, 0.2342 )#### Level Percentile BCa## 95% (-0.0351, 0.2385 ) (-0.0350, 0.2386 )## Calculations and Intervals on Original Scale

28 / 64

Bootstrapping Regression Models

There are two ways to do this• Random-X Bootstrapping

• The regressors are treated as random• Thus, we select bootstrap samples directly from the observations

and calculate the statistic for each bootstrap sample

• Fixed-X Bootstrapping• The regressors are treated as fixed - implies that the regression

model fit to the data is “correct”• The fitted values of Y are then the expectation of the bootstrap

• We attach a random error (usually resampled from the residuals)

to each Y which produces the fixed-x bootstrap sample Y ⇤b

• To obtain bootstrap replications of the coe�cients, we regress Y ⇤b

on the fixed model matrix for each bootstrap sample

29 / 64

WVS Data

• secpay: Imagine two secretaries, of the same age, doingpractically the same job. One finds out that the other earnsconsiderably more than she does. The better paid secretary,however, is quicker, more e�cient and more reliable at her job.In your opinion, is it fair or not fair that one secretary is paidmore than the other?

• gini Observed inequality captured by GINI coe�cient at thecountry level.

• democrat Coded 1 if the country maintained “partly free” or“free” rating on Freedom House’s Freedom in the World meausrecontinuously from 1980 to 1995.

30 / 64

gini

secpay

0.4

0.5

0.6

0.7

0.8

0.9

20 30 40 50 60

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Non−DemocacyDemocracy

gini

secpay

0.4

0.5

0.6

0.7

0.8

0.9

20 30 40 50 60

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


gini

secpay

0.6

0.7

0.8

0.9

20 30 40 50 60

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


31 / 64

Bootstrapping Regression: Example (1)

• We fit a linear model of secpay on the interaction of gini anddemocracy.

dat <- read.csv("http://quantoid.net/files/reg3/weakliem.txt", header=T)dat <- dat[-c(25,49), ]dat <- dat[complete.cases(

dat[,c("secpay", "gini", "democrat")]), ]mod <- lm(secpay ~ gini*democrat,data=dat)summary(mod)

#### Call:## lm(formula = secpay ~ gini * democrat, data = dat)#### Residuals:## Min 1Q Median 3Q Max## -0.155604 -0.047674 0.004133 0.046637 0.179825#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 1.059234 0.059679 17.749 < 2e-16 ***## gini -0.004995 0.001516 -3.294 0.00198 **## democrat -0.486071 0.088182 -5.512 1.86e-06 ***## gini:democrat 0.010840 0.002476 4.378 7.53e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.07118 on 43 degrees of freedom## Multiple R-squared: 0.5176,Adjusted R-squared: 0.484## F-statistic: 15.38 on 3 and 43 DF, p-value: 6.087e-07

32 / 64

Fixed-X Bootstrapping

• If we are relatively certain that the model is the “right” model(i.e., it is properly specified), then this makes sense.

boot.reg1 <- Boot(mod, R=500, method="residual")

33 / 64

Fixed-X CIs

perc.ci <- sapply(1:4,function(x){boot.ci(boot.reg1, type=c("perc"),

index = x)$percent})[4:5,]bca.ci <- sapply(1:4,

function(x){boot.ci(boot.reg1, type=c("bca"),index = x)$bca})[4:5,]

colnames(bca.ci) <- colnames(perc.ci) <- names(coef(mod))rownames(bca.ci) <- rownames(perc.ci) <- c("lower", "upper")perc.ci

## (Intercept) gini democrat gini:democrat## lower 0.9427854 -0.008099226 -0.6507048 0.005767702## upper 1.1808116 -0.002214939 -0.3211883 0.015383294

bca.ci


34 / 64

Random-X resampling (1)

• Random-X resampling is also referred to as observationresampling

• It selects B bootstrap samples (i.e., resamples) of theobservations, fits the regression for each one, and determines thestandard errors from the bootstrap distribution

• This is done fairly simply in R using the boot function from theboot library, too.

35 / 64

Random-X Resampling (2)

boot.reg2 <- Boot(mod, R=500, method="case")perc.ci <- sapply(1:4,

function(x){boot.ci(boot.reg2, type=c("perc"),index = x)$percent})[4:5,]

bca.ci <- sapply(1:4,function(x){boot.ci(boot.reg2, type=c("bca"),index = x)$bca})[4:5,]

# change row and column names of output

colnames(bca.ci) <- colnames(perc.ci) <- names(coef(mod))rownames(bca.ci) <- rownames(perc.ci) <- c("lower", "upper")# print output

perc.ci


bca.ci


36 / 64

What if we Retained the Outliers?

37 / 64

Bootstrap Disrtibution

n <- c("Intercept", "GINI", "Democrat", "Gini*Democrat")par(mfrow=c(2,2))for(j in 1:4){plot(density(boot.reg1$t[,j]), xlab="T*",ylab="Density", main = n[j])

lines(density(boot.reg2$t[,j]), lty=2)}

38 / 64

The Bootstrap Distribution

0.5 0.6 0.7 0.8 0.9 1.0 1.1

01

23

4

Intercept

T*

Density

−0.5 0.0 0.5

0.0

0.5

1.0

1.5

GINI

T*

Density

−0.6 −0.4 −0.2 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

Democrat

T*

Density

−1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Gini*Democrat

T*

Density

39 / 64

Jackknife-after-Bootstrap

• The jackknife-after-bootstrap provides a diagnostic of thebootstrap by allowing us to examine what would happen to thedensity if particular cases were deleted

• Given that we know that there are outliers, this diagnostic isimportant

• The jackknife-after-bootstrap plot is produced using thejack.after.boot function in the boot package. Once again, theindex=2 argument asks for the results for the slope coe�cient inthe model

jack.after.boot(boot.reg1, index=2)

40 / 64

Jackknife-after-Bootstrap (2)

• The jack.after.boot plot is constructed as follows:• the horizontal axis represents the standardized jackknife value (the

standardized value of the di↵erence with that observation takenout)

• The vertical axis represents various quantiles of the bootstrapstatistic

• Horizontal lines in the graph represent the bootstrap distributionat the various quantiles

• Case numbers are labeled at the bottom of the graph so that eachobservation can be identified

41 / 64

R code for Jack After Boot

par(mfrow=c(2,2))jack.after.boot(boot.reg1, index=1)jack.after.boot(boot.reg1, index=2)jack.after.boot(boot.reg1, index=3)jack.after.boot(boot.reg1, index=4)

par(mfrow=c(2,2))jack.after.boot(boot.reg2, index=1)jack.after.boot(boot.reg2, index=2)jack.after.boot(boot.reg2, index=3)jack.after.boot(boot.reg2, index=4)

42 / 64

−2 −1 0 1 2

−0.3

−0.2

−0.1

0.0

0.1

0.2

standardized jackknife value

5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

** * * **

*** **

** **

***

**** *

****** *****

** ***

****

* * ***

*

**

* * ***** ******

***

*********** ***

**** ***

** *** * * **

*

** * * ***** **** ** *

** *********** ***

**** ***** *** *

* ** *

** * * ***** **** ** *** **** ******* ******* ***** *** * * ** *

** * * ***** **** ** *** **** ****

*** ******* ***** *** * * ** **

* * * ***** **** **

*** **** **

***** ******* ***** *** * * ** **

* * * *****

**** ***** *

***

******* **

***** *

**** *

***

* ** *

49 20 6 4 32 48 47 36 16 10

2 12 5 35 34 41 39 8 21 26

25 31 19 17 37 33 44 42 11 30

9 14 22 7 45 15 46 13 23 29

40 24 43 18 1 27 3 38 28

−2 −1 0 1 2

−0.6

−0.4

−0.2

0.0

0.2

0.4


5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

** * *

** * **** *** **

*****

***

****************

**** **

* * *

* * * *** *****

***

***

**** *** *

********

********

*** ** * * *

* * * *** * **** *** ******* *** ******************** ** * * *

* * * *** * **** *** ******* *** ******************** ** * * *

** * **

** **** *** ******

* *** ********************** *

* *** * **

* * **** *** ******* *** ******************** ** * * ** * * *** *

**** *** ****

**

**** *******

************* *

* * **

16 30 8 23 18 1 31 20 35 19

29 10 11 41 38 15 7 40 32 9

26 17 46 49 45 36 47 24 25 14

28 21 39 3 22 42 4 5 34 2

13 27 33 44 37 48 43 6 12

−2 −1 0 1 2

−0.4

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3


5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

* * * * ** * *** * *

* ******* *** *

********** *********** **

* ** * * * ** * *

** * ** * ****** *** * ********** ******* **** ** * *

* * * * ** * *** * ** * ****** *** * ****

****** ******* **** *** *

* * * * ***

*** * ** * ****** *** * ********** ******* **** ** * *

* * * * ** * *** * **

******* *** *

*******

*** *

****** ****

** * ** * * * ** * *

** * ** * **

**** *** * ********** ******* *

*** ** * *

* * * * ** * *** * ** * **

****

*** ******

***** ******

* ****** *

*

26 8 16 37 39 4 32 34 14 6

29 10 13 5 18 19 48 3 2 12

41 21 15 27 7 20 36 11 40 24

28 49 30 43 47 23 44 42 31 25

46 22 17 38 35 9 331 45

−2 −1 0 1 2

−1.0

−0.5

0.0

0.5


5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

* * ** *

** * * **

********** ***

***

*** **

* *** * ****

*****

** *

* * ** *** * * *

*****

****** ****

**

* ** *** *** * ***

** ***

** * *

** ** *

** * * ************ ***

*** * ** *** *** * *** * * **** * * *

* * ** *** * * ************ ****** * ** *** *** * *** * * **** * * *

* * ** *** * * ************ ****** * ** *** *** * *** * * ****

* * *** *

**** * * *****

******* ******

* ** *** *** * ***

* * **** * * **

* ** *** * * *****

******* *****

* * ** *** *** * *** * * ***

** * *

25 6 44 2 48 35 18 21 10 49

12 45 31 36 9 20 5 13 17 41

14 3 34 23 19 32 27 37 46 26

24 42 11 7 47 38 16 22 8 29

1 33 40 4 39 43 15 30 28

43 / 64

−5 −4 −3 −2 −1 0 1 2

−0.4

−0.2

0.0

0.2


5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

*

**********

****************************

**** ****

* *

**

*********

***********************

******** * ***** *

** ********************

********************

****

** *

* * **************************************** * **

*** *

** *****

**********************

************* * ****

* **

* ************************

**************** * ***** *

**

**********************

****************** * ***** *

49 4642 33127 191728 1

25 37 16 6 945 1348 15 23

443 18 2440 38358 20 32

395 12 1029117 30 34 22

33 26443614 21241 47

−2 0 2 4

−1.0

−0.5

0.0

0.5


5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

* *** ************************************

*******

**

* ***

*******************

*********************

*** *

** *

**

*******************

************************ *

*

* *** **************

******************

*********** * *

* *** ******************

*

***************

********* *

*

* *******

***************

**********************

** *

*

* *** ***************

****

********

*******

********* *

*

22 34 8 7 114531 12 543

32 47 2819 38 106 16 464

1 15 3017 914 344 26 25

20 41 2 35 292736 42 39 49

23 13 2148 4024 18 3337

−2 −1 0 1 2 3 4

−0.4

−0.2

0.0

0.2

0.4


5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

* * *** ** **************

**************

******

******

* ** * *** ** **********************

****** ******** ****

**

* * *** ** ************************************ **** *

*

* * *** ** **************************** ******** **** * *

* * *** ** *************

**********

***** *****

*** **** *

*

* * *** ** ********************

******** *

******* **** *

*

* **** **

***************************

********* ****

*

*

22 1 18 458 249 44 6 4

16 34 48 235 1129 42 3 32

17 20 7 30 273614 5 33 25

47 28 2115 463812 37 40 49

23 41 1913103126 43 39

−4 −3 −2 −1 0 1 2

−1.5

−1.0

−0.5

0.0

0.5

1.0


5, 1

0, 1

6, 5

0, 8

4, 9

0, 9

5 %−i

les

of (T

*−t)

*

* * *** *********

***********

*************

***** **** *

*

* * *** *********

**********

********

*********** ***

* *

** * ***

********************

****************** **** *

**

* *** ************************************** *** * *

** * *** *

************************************* *** * ** *

* *** ******************

*******

************* *

** * ** *

* *** ******************

*******

************* *** * *

49 4 3 14 29 8 487 41 17

25 4337 3138 24 4613 34 1

40 3344 12 1130 21 19 47 16

32 6 42 9 2745 15 18 23 22

39 5 26 36 35 102 28 20

44 / 64

Fixed versus random

Why do the samples di↵er?

• Fixed-X resampling enforces the assumption that errors arerandomly distributed by resampling the residuals from a commondistribution

• As a result, if the model is not specified correctly (i.e., there isun-modeled nonlinearity, heteroskedasticity or outliers) theseattributes do not carry over to the bootstrap samples

• the e↵ects of outliers was clear in the random-X case, but not withthe fixed-X bootstrap

45 / 64

Bootstrapping the Loess Curve

We could also use bootstrapping to get the Loess Curve. First, let’stry the fixed-x resampling:

library(foreign)dat <- read.dta("http://quantoid.net/files/reg3/jacob.dta")rescale01 <- function(x)(x-min(x, na.rm=T)/diff(range(x, na.rm=T)))seq.range <- function(x)seq(min(x), max(x), length=250)dat$perot <- rescale01(dat$perot)dat$chal <- rescale01(dat$chal_vote)s <- seq.range(dat$perot)df <- data.frame(perot = s)lo.mod <- loess(chal ~ perot,

data=dat, span=.75)resids <- lo.mod$residualsyhat <- lo.mod$fittedlo.fun <- function(dat, inds){

boot.e <- resids[inds]boot.y <- yhat + boot.etmp.lo <- loess(boot.y ~ perot,

data=dat, span=.75)preds <- predict(tmp.lo, newdata=df)

preds}boot.perot1 <- boot(dat, lo.fun, R=1000)

46 / 64

Bootstrapping the Loess Curve (2)

Now, random-X resampling

lo.fun2 <- function(dat, inds){tmp.lo <- loess(boot.y ~ perot,

data=dat[inds,], span=.75)preds <- predict(tmp.lo, newdata=df)preds

}boot.perot2 <- boot(dat, lo.fun, R=1000)

47 / 64

Make Confidence Intervals

out1 <- sapply(1:250, function(i)boot.ci(boot.perot1,index=i, type="perc")$perc)

out2 <- sapply(1:250, function(i)boot.ci(boot.perot2,index=i, type="perc")$perc)

preds <- predict(lo.mod, newdata=df,se=T)lower <- preds$fit - 1.96*preds$seupper <- preds$fit + 1.96*preds$se

ci1 <- t(out1[4:5, ])ci2 <- t(out2[4:5, ])ci3 <- cbind(lower, upper)

plot(s, preds$fit, type = "l",ylim=range(c(ci1, ci2, ci3)))

poly.x <- c(s, rev(s))polygon(poly.x, y=c(ci1[,1], rev(ci1[,2])),

col=rgb(1,0,0,.25), border=NA)polygon(poly.x, y=c(ci2[,1], rev(ci2[,2])),

col=rgb(0,1,0,.25), border=NA)polygon(poly.x, y=c(ci3[,1], rev(ci3[,2])),

col=rgb(0,0,1,.25), border=NA)legend("topleft",

c("Random-X", "Fixed-X", "Analytical"),fill = rgb(c(1,0,0), c(0,1,0), c(0,0,1),

c(.25,.25,.25)), inset=.01)

48 / 64

Confidence Interval Graph

5 10 15 20 25 30

1020

3040

s

preds$fit

Random−XFixed−XAnalytical

49 / 64

When Might Bootstrapping Fail?

1. Incomplete Data: since we are trying to estimate a populationdistribution from the sample data, we must assume that themissing data are not problematic. It is acceptable, however, touse bootstrapping if multiple imputation is used beforehand

2. Dependent data: bootstrapping assumes independence whensampling with replacement. It should not be used if the data aredependent

3. Outliers and influential cases: if obvious outliers are found, theyshould be removed or corrected before performing the bootstrap(especially for random-X). We do not want the simulations todepend crucially on particular observations

50 / 64

Cross-Validation (1)

• If no two observations have the same Y , a p-variable model fit top + 1 observations will fit the data precisely

• this implies, then, that discarding much of a dataset can lead tobetter fitting models

• Of course, this will lead to biased estimators that are likely to givequite di↵erent predictions on another dataset (generated with thesame DGP)

• Model validation allows us to assess whether the model is likelyto predict accurately on future observations or observations notused to develop this model

• Three major factors that may lead to model failure are:1. Over-fitting2. Changes in measurement3. Changes in sampling

• External validation involves retesting the model on new datacollected at a di↵erent point in time or from a di↵erentpopulation

• Internal validation (or cross-validation) involves fitting andevaluating the model carefully using only one sample

51 / 64


• A basic, but powerful, tool for statistical model building

• Cross-validation is similar to bootstrapping in that it resamplesfrom the original data

• The basic form involves randomly dividing the sample into twosubsets:

• The first subset of the data (screening sample) is used to select orestimate a statistical model

• The second subset is then used to test the findings

• Can be helpful in avoiding capitalizing on chance and over-fittingthe data - i.e., findings from the first subset may not always beconfirmed by the second subsets

• Cross-validation is often extended to use several subsets (either apreset number chosen by the researcher or leave-one-outcross-validation)

52 / 64


• The data are split into k subsets (usually 3 k 10), but canalso use leave-one-out cross-validation

• Each of the subsets are left out in turn, with the regression runon the remaining data

• Prediction error is then calculated as the sum of the squarederrors:

RSS =’

(Yi � Yi )2

• We choose the model with the smallest average “error”

MSE =

Õ(Yi � Yi )2n

• We could also look to the model with the largest average R2

53 / 64


How many observations should I leave out from each fit?

• There is no rule on how many cases to leave out, but Efron(1983) suggests that grouped cross-validation (withapproximately 10% of the data left out each time) is better thanleave-one-out cross-validation

Number of repetitions

• Harrell (2001:93) suggests that one may need to leave 110 of the

sample out 200 times to get accurate estimates

Cross-validation does not validate the complete sample

• External validation, on the other hand, validates the model on anew sample

• Of course, limitations in resources usually prohibits externalvalidation in a single study

54 / 64

Cross-Validation in R

• Cross-validation is done easily using the cv.glm function in theboot package

dat <- read.csv("http://www.quantoid.net/files/reg3/weakliem.txt",

header=T)

dat <- dat[-c(25,49), ]

mod1 <- glm(secpay ~ poly(gini, 3)*democrat, data=dat)

mod2 <- glm(secpay ~ gini*democrat, data=dat)

deltas <- NULL

for(i in 1:25){deltas <- rbind(deltas, c(

cv.glm(dat, mod1, K=5)$delta,

cv.glm(dat, mod2, K=5)$delta)

)}colMeans(deltas)

## [1] 0.008985782 0.008295404 0.005647222 0.005528784

55 / 64

Cross-validating Span in Loess

We could use cross-validation to tell us something about the span inour Loess model.

1. First, split the sample into K groups (usually 10).2. For each of the k = 10 groups, estimate the model on the other 9

and get predictions for the omitted groups observations. Do thisfor each of the 10 subsets in turn.

3. Calculate the CV error: 1nÕ(�i � �i )2

4. Potentially, do this lots of times and average across the CV error.

set.seed(1)n <- 400x <- 0:(n-1)/(n-1)f <- 0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10y <- f + rnorm(n, 0, sd = 2)tmp <- data.frame(y=y, x=x)lo.mod <- loess(y ~ x, data=tmp, span=.75)

56 / 64

Minimizing CV Criterion Directly

best.span <- optimize(cv.lo2, c(.05,.95), form=y ~ x, data=tmp,numiter=5, K=10)

best.span

## $minimum## [1] 0.2271807#### $objective## [1] 3.912508

There is also a canned function in fANCOVA that optimizes the spanvia AICc or GCV.

library(fANCOVA)best.span2 <- loess.as(tmp$x, tmp$y, criterion="gcv")best.span2$pars$span

## [1] 0.1483344

57 / 64

The Curve

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●●●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−50

510

x

y

CVGCV

58 / 64

Manually Cross-Validating � in the Y-J Transform

Sometimes, optimizing the cross-validation criterion fails.

• Randomness in the CV procedure can produce a function thathas several local minima.

• You could force the same random split at every evaluation byhand-coding the CV, but this might not be the best idea.

• If optimization of the CV criterion fails, you could always do itmanually.

59 / 64

Example

cvoptim_yj <- function(pars, form, data, trans.vars, K=5, numiter=10){require(boot)

require(VGAM)form <- as.character(form)for(i in 1:length(trans.vars)){

form <- gsub(trans.vars[i], paste("yeo.johnson(", trans.vars[i],",", pars[i], ")", sep=""), form)

}form <- as.formula(paste0(form[2], form[1], form[3]))m <- glm(as.formula(form), data, family=gaussian)d <- lapply(1:numiter, function(x)cv.glm(data, m, K=K))mean(sapply(d, function(x)x$delta[1]))

}

lams <- seq(0,2, by=.1)s <- sapply(lams, function(x)cvoptim_yj(x, form=prestige ~ income + education + women,

data=Prestige, trans.vars="income", K=3))plot(s ~ lams, type="l", xlab=expression(lambda), ylab="CV Error")

60 / 64

Figure

0.0 0.5 1.0 1.5 2.0

5560

6570

75

λ

CV

Erro

r

61 / 64

Summary and Conclusions

• resampling techniques are powerful tools for estimating standarderrors from small samples or when the statistics we are using donot have easily determined standard errors

• Bootstrapping involves taking a “new” random sample (withreplacement) from the original data

• We then calculate bootstrap standard errors and statistical testsfrom the average of the statistic from the bootstrap samples

• Jackknife resampling takes new samples of the data by omittingeach case individually and recalculating the statistic each time

• Cross-validation randomly splits the sample into two groups,comparing the model results from one sample to the results fromthe other

62 / 64

Readings

Today: Resampling and Cross-validation

⇤ Fox (2008) Chapter 21⇤ Stone (1974)� Efron and Tibshirani (1993)� Davison and Hinkley (1997)� Ronchetti, Field and Blanchard (1997)

63 / 64

Davison, Anthony C. and D.V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge:Cambridge University Press.

Efron, Bradley and Robert Tibshirani. 1993. An Introduction to the Bootstrap. New York: Chapman &Hall.

Fox, John. 2008. Applied Regression Analysis and Generalized Linear Models, 2nd edition. ThousandOaks, CA: Sage, Inc.

Ronchetti, Elvezio, Christopher Field and Wade Blanchard. 1997. “Robust Linear Model Selection byCross-validation.” Journal of the American Statistical Association 92(439):1017–1023.

Stone, Mervyn. 1974. “Cross-validation and Assessment of Statistical Predictions.” Journal of the Royal

Statistical Society, Series B 36(2):111–147.

64 / 64

Documents

Regression III Lecture 5: Resampling Techniques · zˆ0 = 1 # T⇤ b T B! • This just gives the inverse of the normal CDF for proportion of bootstrap replicates less than T. •