Upload
fdf
View
220
Download
16
Embed Size (px)
Citation preview
ISyE 7401 Advanced Statistical Modeling Spring 2013
HW #3 (due in class on February 6, Wednesday)
(There are 4 questions, and please look at both sides)
Problem 1. Suppose that each setting xi of the independent variable in a simple least squaresproblem is duplicated, yielding two independent observations Yi1, Yi2. Is it true that the least squaresestimates of the intercept and slope can be found by doing a regression of the mean response ofeach pair of duplicates, Yi = (Yi1 + Yi2)/2 on the xis? Why or why not? Explain.
Answer: The answer is a resounding “Yes” for the point estimates of the intercept andslope, but the answers for the confidence intervals of intercept and slope are “no” (because of thedifference in estimating σ2, the error variance).There are a couple of ways to prove that the point estimates of the intercept and slope are thesame. The first one is to look at the SSE’s. Note that for the full data set, we want to find b0 andb1 to minimize
RSS1 =n∑
i=1
[(yi1 − (b0 + b1xi)
)2+
(yi2 − (b0 + b1xi)
)2],
whereas the second linear regression is to find b0 and b1 that minimize RSS2 =∑n
i=1
(yi − (b0 +
b1xi))2
. Since yi = (yi1 + yi2)/2, it is straightforward to show that
RSS1 − 2RSS2 =n∑
i=1
[(yi1 − (b0 + b1xi)
)2+
(yi2 − (b0 + b1xi)
)2− 2
(yi1 + yi22
− (b0 + b1xi))2]
=n∑
i=1
[y2i1 + y2i2 −
(yi1 + yi2)2
2
]=
n∑i=1
(yi1 − yi2)2
2
does not depend on b0 and b1. Hence, the pair of (b0, b1) that minimizes one of RSS’s also minimizesthe other, and thus these two linear regression methods lead to the same solutions.The second method is to prove them directly. For the original full data set, the least squareestimates are
β1 =
∑ni=1(xi − x)(Yi1 − y) +
∑ni=1(xi − x)(Yi2 − y)
2∑n
i=1(xi − x)2
=
∑ni=1(xi − x)(Yi1+Yi2
2 − y)∑ni=1(xi − x)2
,
β0 = y − β1x, with y =
∑ni=1(Yi1 + Yi2)
2n
while the least estimates for the new regression model are
β∗1 =
∑ni=1(xi − x)(Yi1+Yi2
2 − y)∑ni=1(xi − x)2
, β∗0 = y − β1x
Thus β∗1 = β1 and so β∗
0 = β0. Therefore, the two methods are equivalent. You can also easilyextend this proof to multiple linear regression.
1
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
It is important to note that these two linear regression leads to different estimators of σ2, the errorvariance, and thus the confidence intervals of intercept and slope are different. To see this, thefull data set has 2n observations and thus s21 = σ2
full =RSS12n−2 whereas, the second linear regression
method has n observations and thus s22 = σ2 = RSS2n−2 . Plugging the relationship between RSS1 and
RSS2, we have
s21 − s22 = − RSS2
(n− 1)(n− 2)+
1
2(n− 1)
n∑i=1
(yi1 − yi2)2
2,
and thus s21 = s22 in general.
Problem 2. Suppose that n points are to be placed in the interval [−1, 1] for fitting the model
Yi = β0 + β1xi + ϵi,
where the ϵi’s are independent with common mean 0 and common variance σ2. How should thexi’s be chosen in order to minimize V ar(β1)?
(Hints: This is one of the earliest examples in the field of optimal design of experiments. Notethat this question essentially asks us to choose the xi’s (in the interval [−1, 1]) so as to maximizeh(x1, . . . , xn) =
∑ni=1(xi− x)2 =
∑ni=1 x
2i −n(x)2. For each i, fixing the other xj ’s, is h a quadratic
polynomial of xi? When will xi maximize h? Finally, suppose m of the xi’s are equal to −1, thenthe remaining n − m of xi’s should be 1. Which m maximizes h? Your answers may depend onwhether n is even or odd.)
Answer: Recall that V ar(β1) =σ2
Sxx= σ2∑n
i=1(xi−x)2. Hence, minimizing V ar(β1) is equivalent
to maximizing h(x1, . . . , xn) =∑n
i=1(xi − x)2 =∑n
i=1 x2i − n(x)2. Following the hints, for each
i, fixing the other xj ’s (note that x also depends on xi) we have h =∑n
i=1 x2i − n(
∑ni=1 xi
n )2 =(1 − 1
n)x2i − 2
nxi∑
j =i xj +∑
j =i x2j − 1
n(∑
j =i xj)2, which is a quadratic polynomial of xi with a
leading coefficient 1 − 1n > 0 when n ≥ 2. Thus the optimal choices of xi’s must be the endpoints
±1. Suppose m of the xi’s are equal to −1, then the remaining n−m of xi’s should be 1, and thush = n− (m−(n−m)
n )2 = n− (2m−nn )2, which is minimized at m = n/2 if n is even, and m = (n±1)/2
if n is odd.
Problem 3. The dataset teengamb (available in the R library faraway — you need to install thislibrary first) concerns a study of teenage gambling in Britain. Fit a regression model with theexpenditure on gambling as the response and the sex (coded as male=0 and female=1), status,income, and verbal score as predictors. Present the output and answer the following questions.
(a) What percentage of variation in the response is explained by these predictors?
Answer: 52.67%, which is just the value of R2.
(b) Which observation has the largest (positive) residual? Give the case number.
Answer: The 24th observation.
(c) Compute the mean and median of the residuals.
Answer: The mean of the residuals is 0 and the median is -1.4514.
2
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
(d) Compute the correlation of the residuals with the fitted values.
Answer: 0.
(e) Compute the correlation of the residuals with the income.
Answer: 0.
(f) For all other predictors held constant, what would be the difference in predicted expenditureon gambling for a male compared to a female.
Answer: When variable “sex” goes from 0 to 1, the change of the predicted expenditureon gambling would be (1 − 0) × (−22.1183) = −22.1183, where −22.1183 is the regressioncoefficient of “sex”.
(g) Which variables are statistically significant?
Answer: “sex” and “income” are significant at 95% confidence level.
(h) Predict the amount that a male with average (given these data) status, income and verbalscore would gamble along with an appropriate 95% CI. Repeat the prediction for a male withmaximal values (for this data) of status, income, and verbal score. Which CI is wider andwhy is this expected?
Answer: Note that in the multiple linear regression Yn×1 = Xn×pβp×1 + ϵ, for a given new
data xnew, the point prediction is y∗ = xnewβ. There are two kinds of confidence intervalsinvolving y∗. The first one is the so-called 100(1 − α)% prediction interval on the futureobservation, which is given by
xnewβ ± tα/2,n−pσ√
1 + xTnew(X
TX)−1xnew.
The second one is the so-called 100(1− α)% confidence interval on the mean response y∗
xnewβ ± tα/2,n−pσ√
xTnew(X
TX)−1xnew.
These two intervals differ only on the term “1” (variance of the future noise) inside thesquare root term. In this problem, the prediction interval seems to be more appropriate,though we also give the full credit if you provide the latter intervals (but take points off ifyou misunderstood them).For a male with average values, the predicted amount of gambling is 28.2425 with a 95%prediction interval [−18.5154, 75.0004] (For a male with average values, the 95% confidenceinterval on the mean gambling is [18.7828, 37.7023]).Likewise, for a male with maximal values, the predicted amount of gambling is 71.3079 witha 95% prediction interval [17.0659, 125.5500] (the corresponding 95% confidence interval onthe mean gambling is [42.2324, 100.3835]). The interval for a male with maximal values iswider because larger coefficients result in larger (prediction) variance.
3
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
Figure 1: Non-constant variance of the original linear model
20 40 60 80 100
−20
−10
010
2030
Fitted values
Res
idua
ls
lm(Lab ~ Field)
Residuals vs Fitted
1001517
20 40 60 80 100
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
lm(Lab ~ Field)
Scale−Location
1005515
(i) Fit a model with just income as a predictor and use an F-test to compare it to the full model.
Answer: The p value of the F-test is 0.01177 which means the reduced model is not adequate.
Problem 4. Researchers at National Institute of Standards and Technology (NIST) collectedpipeline data (available in the R library faraway) on ultrasonic measurements of the depths ofdefects in the Alaska pipeline in the field. The depth of the defects were then remeasured in thelaboratory. These measurements were performed in six different batches. It turns out that thisbatch effect is not significant and so can be ignored in the analysis that follows. The laboratorymeasurements are more accurate than the in-field measurements, but more time consuming andexpensive. We want to develop an regression equation for correcting the in-field measurements.
(a) Fit a regression model Lab ∼ Field. Check for nonconstant variance.
Answer: See the result above as well as Figure 1. Obviously the variance is not constant.We can also do a Non-constant variance score test, the p value is 5.35e-08 which indicatessignificantly non-constant variance.
(b) We wish to use weights to account for the nonconstant variance. Here we split the range ofField into 12 groups of size nine (except for the last group which has only eight values). Withineach group, we compte the variance of Labs as varlab and the mean of Field as meanfield.Suppose pipeline is the name of your data frame, the following R code will make the neededcomputations:
> i <- order(pipeline$Field)
> npipe <- pipeline[i,]
4
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
> ff <- gl(12, 9)[-108]
> meanfield <- unlist(lapply(split(npipe$Field, ff), mean))
> varlab <- unlist(lappy(split(npipe$Lab, ff), var))
Suppose we guess that the variance in the response is linked to the predictor in the followingway:
var(Lab) = a0Fielda1
Regress log(varlab) on log(meanfield) to estimate a0 and a1. (You might or might not choseto ignore the last point, as the last group has only eight values). Use this to determineappropriate weights in a WLS fit of Lab on Field. Show the regression summary.
Answer: By regressing log(varlab) on log(meanfield), we got estimation of log(a0) =−0.3538 and a1 = 1.1244. So the weights in aWLS fit of Lab on Field should be 1
e0.3538·Field1.1244i.
See the regression summary above (model3).
(c) An alternative to weighting is transformation. Find transformations on Lab and/or Field sothat in the transformed scale the relationship is approximately linear with constant variance.You may restrict your choice of transformation to square root, log and inverse.
Answer: We regress log(Lab) on log(Field). See results above. The p value is less than2.2e-16 and R2 = 0.9337 which is close to 1. Therefore the two transformed variables exhibitgood linear relationship. Moreover, non-constant variance score test has a p value of 0.394,which is not significant at 95% confidence level.
Appendix: R Code
## Problem 3
##
> require(faraway)
> model<-lm(gamble~.,data=teengamb)
> summary(model)
Call:
lm(formula = gamble ~ ., data = teengamb)
Residuals:
Min 1Q Median 3Q Max
-51.082 -11.320 -1.451 9.452 94.252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.55565 17.19680 1.312 0.1968
sex -22.11833 8.21111 -2.694 0.0101 *
status 0.05223 0.28111 0.186 0.8535
income 4.96198 1.02539 4.839 1.79e-05 ***
5
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
verbal -2.95949 2.17215 -1.362 0.1803
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 22.69 on 42 degrees of freedom
Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
>
> round(vcov(model), digits=2)
(Intercept) sex status income verbal
(Intercept) 295.73 -72.73 -2.40 -9.89 -15.18
sex -72.73 67.42 1.27 2.47 -3.54
status -2.40 1.27 0.08 0.10 -0.32
income -9.89 2.47 0.10 1.05 -0.05
verbal -15.18 -3.54 -0.32 -0.05 4.72
>
> which.max(residuals(model))
24
24
> mean(residuals(model))
[1] 1.240143e-16
> median(residuals(model))
[1] -1.451392
> cor(residuals(model),fitted(model))
[1] 6.247412e-17
> cor(residuals(model),teengamb$income)
[1] -3.961603e-17
> m1<-data.frame(sex=0,status=mean(teengamb$status),
income=mean(teengamb$income),verbal=mean(teengamb$verbal))
> predict(model,m1,interval="confidence")
fit lwr upr
1 28.24252 18.78277 37.70227
>
> predict(model,m1,interval="prediction")
fit lwr upr
1 28.24252 -18.51536 75.00039
6
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
> m2<-data.frame(sex=0,status=max(teengamb$status),
income=max(teengamb$income),verbal=max(teengamb$verbal))
> predict(model,m2,interval="confidence")
fit lwr upr
1 71.30794 42.23237 100.3835
>
> predict(model,m2,interval="prediction")
fit lwr upr
1 71.30794 17.06588 125.55
> model2<-lm(gamble~income,data=teengamb)
> anova(model,model2)
Analysis of Variance Table
Model 1: gamble ~ sex + status + income + verbal
Model 2: gamble ~ income
Res.Df RSS Df Sum of Sq F Pr(>F)
1 42 21624
2 45 28009 -3 -6384.8 4.1338 0.01177 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
###################
#### Problem 4:
> model<-lm(Lab~Field,data=pipeline)
> summary(model)
Call:
lm(formula = Lab ~ Field, data = pipeline)
Residuals:
Min 1Q Median 3Q Max
-21.985 -4.072 -1.431 2.504 24.334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.96750 1.57479 -1.249 0.214
Field 1.22297 0.04107 29.778 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7.865 on 105 degrees of freedom
7
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
Multiple R-squared: 0.8941, Adjusted R-squared: 0.8931
F-statistic: 886.7 on 1 and 105 DF, p-value: < 2.2e-16
> plot(model,which=c(1,3))
> require(car)
> ncvTest(model)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 29.58568 Df = 1 p = 5.349868e-08
> model2<-lm(log(varlab)~log(meanfield))
> summary(model2)
Call:
lm(formula = log(varlab) ~ log(meanfield))
Residuals:
Min 1Q Median 3Q Max
-2.2038 -0.6729 0.1656 0.7205 1.1891
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3538 1.5715 -0.225 0.8264
log(meanfield) 1.1244 0.4617 2.435 0.0351 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.018 on 10 degrees of freedom
Multiple R-squared: 0.3723, Adjusted R-squared: 0.3095
F-statistic: 5.931 on 1 and 10 DF, p-value: 0.03513
> weight<-1/(exp(model2$coefficients[1])*pipeline$Field^model2$coefficients[2])
> model3<-lm(Lab~Field,data=pipeline,weights=weight)
> summary(model3)
Call:
lm(formula = Lab ~ Field, data = pipeline, weights = weight)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.0826 -0.8102 -0.3189 0.6212 3.4429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
8
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
(Intercept) -1.49436 0.90707 -1.647 0.102
Field 1.20828 0.03488 34.637 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.169 on 105 degrees of freedom
Multiple R-squared: 0.9195, Adjusted R-squared: 0.9188
F-statistic: 1200 on 1 and 105 DF, p-value: < 2.2e-16
> model4<-lm(log(Lab)~log(Field),data=pipeline)
> summary(model4)
Call:
lm(formula = log(Lab) ~ log(Field), data = pipeline)
Residuals:
Min 1Q Median 3Q Max
-0.40212 -0.11853 -0.03092 0.13424 0.40209
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06849 0.09305 -0.736 0.463
log(Field) 1.05483 0.02743 38.457 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.1837 on 105 degrees of freedom
Multiple R-squared: 0.9337, Adjusted R-squared: 0.9331
F-statistic: 1479 on 1 and 105 DF, p-value: < 2.2e-16
> ncvTest(model4)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.7266744 Df = 1 p = 0.3939633
9
https://www.coursehero.com/file/7917926/hw03sol/
This st
udy r
esou
rce w
as
share
d via
Course
Hero.co
m
Powered by TCPDF (www.tcpdf.org)