Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Lecture 23: Transformations and Weighted LeastSquares
Pratheepa Jeganathan
11/13/2019
Recap
I What is a regression model?I Descriptive statistics – graphicalI Descriptive statistics – numericalI Inference about a population meanI Difference between two population meansI Some tips on RI Simple linear regression (covariance, correlation, estimation,
geometry of least squares)I Inference on simple linear regression modelI Goodness of fit of regression: analysis of variance.I F -statistics.I Residuals.I Diagnostic plots for simple linear regression (graphical
methods).
Recap
I Multiple linear regressionI Specifying the model.I Fitting the model: least squares.I Interpretation of the coefficients.I Matrix formulation of multiple linear regressionI Inference for multiple linear regression
I T -statistics revisited.I More F statistics.I Tests involving more than one β.
I Diagnostics – more on graphical methods and numericalmethodsI Different types of residualsI InfluenceI Outlier detectionI Multiple comparison (Bonferroni correction)I Residual plots:
I partial regression (added variable) plot,I partial residual (residual plus component) plot.
Recap
I Adding qualitative predictorsI Qualitative variables as predictors to the regression model.I Adding interactions to the linear regression model.I Testing for equality of regression relationship in various subsets
of a populationI ANOVA
I All qualitative predictors.I One-way layoutI Two-way layout
Transformation
Outline
I We have been working with linear regression models so far inthe course.
I Some models are nonlinear, but can be transformed to a linearmodel (CH Chapter 6).
I We will also see that transformations can sometimes stabilizethe variance making constant variance a more reasonableassumption (CH Chapter 6).
I Finally, we will see how to correct for unequal variance using atechnique weighted least squares (WLS) (CH Chapter 7).
Bacterial colony decay (CH Chapter 6.3, Page 167)
I Here is a simple dataset showing the number of bacteria alivein a colony, nt as a function of time t.
I A simple linear regression model is clearly not a very good fit.bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table',
header=T)head(bacteria.table)
## t N_t## 1 1 355## 2 2 211## 3 3 197## 4 4 166## 5 5 142## 6 6 106
Fitting (Bacterial colony decay)
0
100
200
300
4 8 12t
N_t
Diagnostics (Bacterial colony decay)
0 50 100 150 200 250
−50
100
Fitted values
Res
idua
ls
Residuals vs Fitted1
15
8
−1 0 1
−1
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q1
15
8
0 50 100 150 200 250
0.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location1
15 8
0.00 0.05 0.10 0.15 0.20 0.25−
12
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance0.5
Residuals vs Leverage1
1514
Exponential decay model
I Suppose the expected number of cells grows like
E (nt) = n0eβ1t , t = 1, 2, 3, . . .
I If we take logs of both sides
log E (nt) = log n0 + β1t.
I A reasonable (?) model:
log nt = log n0 + β1t + εt , εtIID∼ N(0, σ2).
bacteria.log.lm = lm(log(N_t) ~ t, bacteria.table)df = cbind(bacteria.table,
lm.fit = fitted(bacteria.lm),lm.log.fit = exp(fitted(bacteria.log.lm)))
p = p +geom_abline(intercept = bacteria.lm$coef[1],slope = bacteria.lm$coef[2],
col = "red") +geom_line(data = df,
aes(x = t, y = lm.log.fit),color = "green")
0
100
200
300
4 8 12t
N_t
Diagnostics
par(mfrow=c(2,2))plot(bacteria.log.lm, pch=23, bg='orange')
3.0 3.5 4.0 4.5 5.0 5.5
−0.
20.
2
Fitted values
Res
idua
ls
Residuals vs Fitted7
210
−1 0 1
−2
1
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q7
2 10
3.0 3.5 4.0 4.5 5.0 5.5
0.0
1.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location7 210
0.00 0.05 0.10 0.15 0.20 0.25
−2
1
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance0.5
0.5
Residuals vs Leverage
2
17
Logarithmic transformation
I This model is slightly different than original model:
E (log nt) ≤ log E (nt)
but may be approximately true.
I If εt ∼ N(0, σ2) then
nt = n0 · γt · eβ1t .
I γt = eεt is called a log-normal (0, σ2) random variable.
Linearizing regression function
I We see that an exponential growth or decay model can bemade (approximately) linear.
I Here are a few other models that can be linearized:I y = αxβ , use ỹ = log(y), x̃ = log(x);I y = αeβx , use ỹ = log(y);I y = x/(αx − β), use ỹ = 1/y , x̃ = 1/x .I More in textbook.
Caveats
I Just because expected value linearizes, doesn’t mean that theerrors behave correctly.
I In some cases, this can be corrected using weighted leastsquares (more later).
I Constant variance, normality assumptions should still bechecked.
Stabilizing variance
I Sometimes, a transformation can turn non-constant varianceerrors to “close to” constant variance. This is another situationin which we might consider a transformation.
I Example: by the “delta rule”, if
Var(Y ) = σ2E (Y )
thenVar(√
Y ) ' σ2
4 .
I In practice, we might not know which transformation is best.Box-Cox transformations offer a tool to find a “best”transformation.
http://en.wikipedia.org/wiki/Power_transform
Delta rule
The following approximation is ubiquitous in statistics.
I Taylor series expansion:
f (Y ) = f (E (Y )) + ḟ (E (Y ))(Y − E (Y )) + . . .
I Taking expectations of both sides yields:
Var(f (Y )) ' ḟ (E (Y ))2 · Var(Y )
Delta rule
I So, for our previous example:
Var(√
Y ) ' Var(Y )4 · E (Y )
I Another example
Var(log(Y )) ' Var(Y )E (Y )2 .
Caveats
I Just because a transformation makes variance constant doesn’tmean regression function is still linear (or even that it waslinear)!
I The models are approximations, and once a model is selectedour standard diagnostics should be used to assess adequacy offit.
I It is possible to have non-constant variance but the variancestabilizing transformation may destroy linearity of theregression function.I Solution: try weighted least squares (WLS).
Weighted Least Squares (CH Chapter 7)
Correcting for unequal variance: weighted least squares
I We will now see an example in which there seems to be strongevidence for variance that changes based on Region.
I After observing this, we will create a new model that attemptsto correct for this and come up with better estimates.
I Correcting for unequal variance, as we describe it here,generally requires a model for how the variance depends onobservable quantities.
Correcting for unequal variance: weighted least squares(CH Chapter 7.4, Page 197)
Variable DescriptionY Per capita education expenditure by stateX1 Per capita income in 1973 by stateX2 Proportion of population under 18X3 Proportion in urban areasRegion Which region of the country are the states located in
lm
education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T)education.table$Region = factor(education.table$Region)education.lm = lm(Y ~ X1 + X2 + X3, data=education.table)
lm
summary(education.lm)
Diagnostics
par(mfrow=c(2,2))plot(education.lm)
200 250 300 350 400 450
−10
010
0
Fitted values
Res
idua
ls
Residuals vs Fitted49
10
7
−2 −1 0 1 2
−2
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q49
10
7
200 250 300 350 400 450
0.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location49
107
0.0 0.1 0.2 0.3 0.4
−2
2
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance1
0.5
Residuals vs Leverage49
7
18
Diagnostics
I there is an outlier, let’s drop the outlier and refitboxplot(rstandard(education.lm) ~ education.table$Region,
col=c('red', 'green', 'blue', 'yellow'))
1 2 3 4
−2
−1
01
23
education.table$Region
rsta
ndar
d(ed
ucat
ion.
lm)
Fit a model without the outlier
keep.subset = (education.table$STATE != 'AK')education.noAK.lm = lm(Y ~ X1 + X2 + X3,
subset=keep.subset,data=education.table)
Fit a model without the outlier
summary(education.noAK.lm)
par(mfrow=c(2,2))plot(education.noAK.lm)
220 240 260 280 300 320 340
−10
050
Fitted values
Res
idua
ls
Residuals vs Fitted
10
15 7
−2 −1 0 1 2
−2
1
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
10
157
220 240 260 280 300 320 340
0.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location1015 7
0.00 0.10 0.20 0.30−
21
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance0.5
0.5
Residuals vs Leverage427 3
Diagnostics (refitted model)
par(mfrow=c(1,1))boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))
1 2 3 4
−2
−1
01
2
education.table$Region[keep.subset]
rsta
ndar
d(ed
ucat
ion.
noA
K.lm
)
Re-weighting observations
I If you have a reasonable guess of variance as a function of thepredictors, you can use this to re-weight the data.
I Hypothetical example
Yi = β0 + β1Xi + εi , εi ∼ N(0, σ2X 2i ).
I Setting Ỹi = Yi/Xi , X̃i = 1/Xi , model becomes
Ỹi = β0X̃i + β1 + γi , γi ∼ N(0, σ2).
Weighted Least SquaresI Fitting this model is equivalent to minimizing
n∑i=1
1X 2i
(Yi − β0 − β1Xi)2
I Weighted Least Squares
SSE (β,w) =n∑
i=1wi (Yi − β0 − β1Xi)2 , wi =
1X 2i
.
I In general, weights should be like:
wi =1
Var(εi).
I Our education expenditure example assumes
wi = WRegion[i]
Common weighting schemes
I If you have a qualitative variable, then it is easy to estimateweight within groups (our example today).
I “Often”Var(εi) = Var(Yi) = V (E (Yi))
I Many non-Gaussian (non-Normal) models behave like this:logistic, Poisson regression.
What if we didn’t re-weight?
I Our (ordinary) least squares estimator with design matrix X is
β̂ = β̂OLS = (XT X )−1XT Y = β + (XT X )−1XT �.
I Our model says that ε|X ∼ N(0, σ2X ) so
E [(XT X )−1XT �] = E [(XT X )−1XT �|X ]= 0
So the OLS estimator is unbiased.
I Variance of β̂OLS is
Var((XT X )−1XT �) = σ2(XT X )−1XT VX (XT X )−1,
where V = diag(X 21 , . . . ,X 2n ).
Two-stage procedure
I Suppose we have a hypothesis about the weights, i.e. they areconstant within Region, or they are something like
w−1i = Var(�i) = α0 + α1X2i1.
I We pre-whiten:1. Fit model using OLS (Ordinary Least Squares) to get initial
estimate β̂OLS2. Use predicted values from this model to estimate wi .3. Refit model using WLS (Weighted Least Squares).4. If needed, iterate previous two steps.
Example (two-stage procedure)
I Let’s use w−1i = Var(�i).# Weight vector for each observationeduc.weights = 0 * education.table$Yfor (region in levels(education.table$Region)) {
# remove the outlier Alaskasubset.region = (education.table$Region[
keep.subset] == region)
educ.weights[subset.region] = 1.0/(sum(resid(education.noAK.lm)[subset.region]^2) /sum(subset.region))
}
Example (two-stage procedure)
I Weights for the observations in each Regionunique(educ.weights)
## [1] 0.0006891263 0.0004103443 0.0040090885 0.0010521628
Weighted least squares regression
I Here is our new model.I Note that the scale of the estimates is unchanged.I Numerically the estimates are similar.I What changes most is the Std. Error column.
education.noAK.weight.lm =lm(Y ~ X1 + X2 + X3,weights=educ.weights,subset=keep.subset,data=education.table)
Weighted least squares regression
summary(education.noAK.weight.lm)
Diagnostics
par(mfrow=c(2,2))plot(education.noAK.weight.lm)
200 240 280 320
−10
050
Fitted values
Res
idua
ls
Residuals vs Fitted
10
1514
−2 −1 0 1 2
−1
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q42
45
47
200 240 280 320
0.0
1.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location42 4547
0.0 0.1 0.2 0.3
−2
1
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance0.5
0.5
Residuals vs Leverage42
45
7
Diagnostics
I Let’s look at the boxplot again. It looks better, but perhapsnot perfect.
par(mfrow=c(1,1))boxplot(resid(education.noAK.weight.lm,
type='pearson') ~education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))
Diagnostics
1 2 3 4
−1.
5−
0.5
0.5
1.5
education.table$Region[keep.subset]resi
d(ed
ucat
ion.
noA
K.w
eigh
t.lm
, typ
e =
"pe
arso
n")
Unequal variance: effects on inference
I So far, we have just mentioned that things may have unequalvariance, but not thought about how it affects inference.
I In general, if we ignore unequal variance, our estimates ofvariance are not very good. The covariance has the “sandwichform” we saw above
Cov(β̂OLS) = (X ′X )−1(X ′W−1X )(X ′X )−1.
with W = diag(1/σ2i ).
I ** If our Std. Error is incorrect, so are our conclusions basedon t-statistics!**
I In the education expenditure data example, correcting forweights seemed to make the t-statistics larger. ** This will notalways be the case!**
Pratheepa Jeganathann by n diagonal matrix
Pratheepa Jeganathan
Unequal variance: effects on inference
I Weighted least squares estimator
β̂WLS = (XT WX )−1XT WY
I If we have the correct weights, then
Cov(β̂WLS) = (XT WX )−1.
Efficiency
I The efficiency of an unbiased estimator of β is 1 / variance.
I Estimators can be compared by their efficiency: the moreefficient, the better.
I The other reason to correct for unequal variance (besides sothat we get valid inference) is for efficiency.
Pratheepa Jeganathan
Pratheepa Jeganathan
Illustrative example
I Suppose
Zi = µ+ εi , εi ∼ N(0, i2 · σ2), 1 ≤ i ≤ n.
I Three unbiased estimators of µ:
µ̂1 =1n
n∑i=1
Zi
µ̂2 =1∑n
i=1 i−2n∑
i=1i−2Zi
µ̂3 =1∑n
i=1 i−1n∑
i=1i−1Zi
Pratheepa Jeganathan
Illustrative example
I The estimator µ̂2 will always have lower variance, hence tighterconfidence intervals.
I The estimator µ̂3 has incorrect weights, but they are “closer”to correct than the naive mean’s weights which assume eachobservation has equal variance.
Illustrative example
ntrial = 1000 # how many trials will we be doing?nsample = 20 # how many points in each trialsd = c(1:20) # how does the variance changemu = 2.0
get.sample = function() {return(rnorm(nsample)*sd + mu)
}
unweighted.estimate = numeric(ntrial)weighted.estimate = numeric(ntrial)suboptimal.estimate = numeric(ntrial)
Illustrative example
I Let’s simulate a number of experiments and compare the threeestimates.
for (i in 1:ntrial) {cur.sample = get.sample()unweighted.estimate[i] = mean(cur.sample)weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)
}
Illustrative example
I Compute SE (µ̂i)data.frame(mean(unweighted.estimate),
sd(unweighted.estimate))
## mean.unweighted.estimate. sd.unweighted.estimate.## 1 1.991227 2.590985data.frame(mean(weighted.estimate),
sd(weighted.estimate))
## mean.weighted.estimate. sd.weighted.estimate.## 1 2.014456 0.7788919data.frame(mean(suboptimal.estimate),
sd(suboptimal.estimate))
## mean.suboptimal.estimate. sd.suboptimal.estimate.## 1 2.016207 1.227501
Pratheepa Jeganathan
Pratheepa Jeganathan
Pratheepa Jeganathan
Illustrative example
−4 −2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
0.5
Comparison of sampling distribution of the estimators
densX
dens
Y
optimalsuboptimalmean
Pratheepa Jeganathan
Pratheepa Jeganathan
Pratheepa Jeganathan
Pratheepa Jeganathan
Reference
I CH Chapter 6 and Chapter 7I Lecture notes of Jonathan Taylor .
http://statweb.stanford.edu/~jtaylo/
TransformationWeighted Least Squares (CH Chapter 7)