9
ISyE 7401 Advanced Statistical Modeling Spring 2013 HW #3 (due in class on February 6, Wednesday) (There are 4 questions, and please look at both sides) Problem 1. Suppose that each setting x i of the independent variable in a simple least squares problem is duplicated, yielding two independent observations Y i1 ,Y i2 . Is it true that the least squares estimates of the intercept and slope can be found by doing a regression of the mean response of each pair of duplicates, ¯ Y i =(Y i1 + Y i2 )/2 on the x i s? Why or why not? Explain. Answer: The answer is a resounding “Yes” for the point estimates of the intercept and slope, but the answers for the confidence intervals of intercept and slope are “no” (because of the difference in estimating σ 2 , the error variance). There are a couple of ways to prove that the point estimates of the intercept and slope are the same. The first one is to look at the SSE’s. Note that for the full data set, we want to find b 0 and b 1 to minimize RSS 1 = n i=1 [( y i 1 (b 0 + b 1 x i ) ) 2 + ( y i 2 (b 0 + b 1 x i ) ) 2 ] , whereas the second linear regression is to find b 0 and b 1 that minimize RSS 2 = n i=1 ( ¯ y i (b 0 + b 1 x i ) ) 2 . Since y i =(y i 1 + y i 2 )/2, it is straightforward to show that RSS 1 2RSS 2 = n i=1 [( y i 1 (b 0 + b 1 x i ) ) 2 + ( y i 2 (b 0 + b 1 x i ) ) 2 2 ( y i 1 + y i 2 2 (b 0 + b 1 x i ) ) 2 ] = n i=1 [ y 2 i 1 + y 2 i 2 (y i 1 + y i 2 ) 2 2 ] = n i=1 (y i 1 y i 2 ) 2 2 does not depend on b 0 and b 1 . Hence, the pair of (b 0 ,b 1 ) that minimizes one of RSS ’s also minimizes the other, and thus these two linear regression methods lead to the same solutions. The second method is to prove them directly. For the original full data set, the least square estimates are b β 1 = n i=1 (x i ¯ x)(Y i1 ¯ y)+ n i=1 (x i ¯ x)(Y i2 ¯ y) 2 n i=1 (x i ¯ x) 2 = n i=1 (x i ¯ x)( Y i1 +Y i2 2 ¯ y) n i=1 (x i ¯ x) 2 , b β 0 = ¯ y b β 1 ¯ x, with ¯ y = n i=1 (Y i1 + Y i2 ) 2n while the least estimates for the new regression model are b β 1 = n i=1 (x i ¯ x)( Y i1 +Y i2 2 ¯ y) n i=1 (x i ¯ x) 2 , b β 0 y b β 1 ¯ x Thus b β 1 = b β 1 and so b β 0 = b β 0 . Therefore, the two methods are equivalent. You can also easily extend this proof to multiple linear regression. 1 https://www.coursehero.com/file/7917926/hw03sol/ This study resource was shared via CourseHero.com

hw03sol

  • Upload
    fdf

  • View
    220

  • Download
    16

Embed Size (px)

Citation preview

Page 1: hw03sol

ISyE 7401 Advanced Statistical Modeling Spring 2013

HW #3 (due in class on February 6, Wednesday)

(There are 4 questions, and please look at both sides)

Problem 1. Suppose that each setting xi of the independent variable in a simple least squaresproblem is duplicated, yielding two independent observations Yi1, Yi2. Is it true that the least squaresestimates of the intercept and slope can be found by doing a regression of the mean response ofeach pair of duplicates, Yi = (Yi1 + Yi2)/2 on the xis? Why or why not? Explain.

Answer: The answer is a resounding “Yes” for the point estimates of the intercept andslope, but the answers for the confidence intervals of intercept and slope are “no” (because of thedifference in estimating σ2, the error variance).There are a couple of ways to prove that the point estimates of the intercept and slope are thesame. The first one is to look at the SSE’s. Note that for the full data set, we want to find b0 andb1 to minimize

RSS1 =n∑

i=1

[(yi1 − (b0 + b1xi)

)2+

(yi2 − (b0 + b1xi)

)2],

whereas the second linear regression is to find b0 and b1 that minimize RSS2 =∑n

i=1

(yi − (b0 +

b1xi))2

. Since yi = (yi1 + yi2)/2, it is straightforward to show that

RSS1 − 2RSS2 =n∑

i=1

[(yi1 − (b0 + b1xi)

)2+

(yi2 − (b0 + b1xi)

)2− 2

(yi1 + yi22

− (b0 + b1xi))2]

=n∑

i=1

[y2i1 + y2i2 −

(yi1 + yi2)2

2

]=

n∑i=1

(yi1 − yi2)2

2

does not depend on b0 and b1. Hence, the pair of (b0, b1) that minimizes one of RSS’s also minimizesthe other, and thus these two linear regression methods lead to the same solutions.The second method is to prove them directly. For the original full data set, the least squareestimates are

β1 =

∑ni=1(xi − x)(Yi1 − y) +

∑ni=1(xi − x)(Yi2 − y)

2∑n

i=1(xi − x)2

=

∑ni=1(xi − x)(Yi1+Yi2

2 − y)∑ni=1(xi − x)2

,

β0 = y − β1x, with y =

∑ni=1(Yi1 + Yi2)

2n

while the least estimates for the new regression model are

β∗1 =

∑ni=1(xi − x)(Yi1+Yi2

2 − y)∑ni=1(xi − x)2

, β∗0 = y − β1x

Thus β∗1 = β1 and so β∗

0 = β0. Therefore, the two methods are equivalent. You can also easilyextend this proof to multiple linear regression.

1

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 2: hw03sol

It is important to note that these two linear regression leads to different estimators of σ2, the errorvariance, and thus the confidence intervals of intercept and slope are different. To see this, thefull data set has 2n observations and thus s21 = σ2

full =RSS12n−2 whereas, the second linear regression

method has n observations and thus s22 = σ2 = RSS2n−2 . Plugging the relationship between RSS1 and

RSS2, we have

s21 − s22 = − RSS2

(n− 1)(n− 2)+

1

2(n− 1)

n∑i=1

(yi1 − yi2)2

2,

and thus s21 = s22 in general.

Problem 2. Suppose that n points are to be placed in the interval [−1, 1] for fitting the model

Yi = β0 + β1xi + ϵi,

where the ϵi’s are independent with common mean 0 and common variance σ2. How should thexi’s be chosen in order to minimize V ar(β1)?

(Hints: This is one of the earliest examples in the field of optimal design of experiments. Notethat this question essentially asks us to choose the xi’s (in the interval [−1, 1]) so as to maximizeh(x1, . . . , xn) =

∑ni=1(xi− x)2 =

∑ni=1 x

2i −n(x)2. For each i, fixing the other xj ’s, is h a quadratic

polynomial of xi? When will xi maximize h? Finally, suppose m of the xi’s are equal to −1, thenthe remaining n − m of xi’s should be 1. Which m maximizes h? Your answers may depend onwhether n is even or odd.)

Answer: Recall that V ar(β1) =σ2

Sxx= σ2∑n

i=1(xi−x)2. Hence, minimizing V ar(β1) is equivalent

to maximizing h(x1, . . . , xn) =∑n

i=1(xi − x)2 =∑n

i=1 x2i − n(x)2. Following the hints, for each

i, fixing the other xj ’s (note that x also depends on xi) we have h =∑n

i=1 x2i − n(

∑ni=1 xi

n )2 =(1 − 1

n)x2i − 2

nxi∑

j =i xj +∑

j =i x2j − 1

n(∑

j =i xj)2, which is a quadratic polynomial of xi with a

leading coefficient 1 − 1n > 0 when n ≥ 2. Thus the optimal choices of xi’s must be the endpoints

±1. Suppose m of the xi’s are equal to −1, then the remaining n−m of xi’s should be 1, and thush = n− (m−(n−m)

n )2 = n− (2m−nn )2, which is minimized at m = n/2 if n is even, and m = (n±1)/2

if n is odd.

Problem 3. The dataset teengamb (available in the R library faraway — you need to install thislibrary first) concerns a study of teenage gambling in Britain. Fit a regression model with theexpenditure on gambling as the response and the sex (coded as male=0 and female=1), status,income, and verbal score as predictors. Present the output and answer the following questions.

(a) What percentage of variation in the response is explained by these predictors?

Answer: 52.67%, which is just the value of R2.

(b) Which observation has the largest (positive) residual? Give the case number.

Answer: The 24th observation.

(c) Compute the mean and median of the residuals.

Answer: The mean of the residuals is 0 and the median is -1.4514.

2

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 3: hw03sol

(d) Compute the correlation of the residuals with the fitted values.

Answer: 0.

(e) Compute the correlation of the residuals with the income.

Answer: 0.

(f) For all other predictors held constant, what would be the difference in predicted expenditureon gambling for a male compared to a female.

Answer: When variable “sex” goes from 0 to 1, the change of the predicted expenditureon gambling would be (1 − 0) × (−22.1183) = −22.1183, where −22.1183 is the regressioncoefficient of “sex”.

(g) Which variables are statistically significant?

Answer: “sex” and “income” are significant at 95% confidence level.

(h) Predict the amount that a male with average (given these data) status, income and verbalscore would gamble along with an appropriate 95% CI. Repeat the prediction for a male withmaximal values (for this data) of status, income, and verbal score. Which CI is wider andwhy is this expected?

Answer: Note that in the multiple linear regression Yn×1 = Xn×pβp×1 + ϵ, for a given new

data xnew, the point prediction is y∗ = xnewβ. There are two kinds of confidence intervalsinvolving y∗. The first one is the so-called 100(1 − α)% prediction interval on the futureobservation, which is given by

xnewβ ± tα/2,n−pσ√

1 + xTnew(X

TX)−1xnew.

The second one is the so-called 100(1− α)% confidence interval on the mean response y∗

xnewβ ± tα/2,n−pσ√

xTnew(X

TX)−1xnew.

These two intervals differ only on the term “1” (variance of the future noise) inside thesquare root term. In this problem, the prediction interval seems to be more appropriate,though we also give the full credit if you provide the latter intervals (but take points off ifyou misunderstood them).For a male with average values, the predicted amount of gambling is 28.2425 with a 95%prediction interval [−18.5154, 75.0004] (For a male with average values, the 95% confidenceinterval on the mean gambling is [18.7828, 37.7023]).Likewise, for a male with maximal values, the predicted amount of gambling is 71.3079 witha 95% prediction interval [17.0659, 125.5500] (the corresponding 95% confidence interval onthe mean gambling is [42.2324, 100.3835]). The interval for a male with maximal values iswider because larger coefficients result in larger (prediction) variance.

3

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 4: hw03sol

Figure 1: Non-constant variance of the original linear model

20 40 60 80 100

−20

−10

010

2030

Fitted values

Res

idua

ls

lm(Lab ~ Field)

Residuals vs Fitted

1001517

20 40 60 80 100

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

lm(Lab ~ Field)

Scale−Location

1005515

(i) Fit a model with just income as a predictor and use an F-test to compare it to the full model.

Answer: The p value of the F-test is 0.01177 which means the reduced model is not adequate.

Problem 4. Researchers at National Institute of Standards and Technology (NIST) collectedpipeline data (available in the R library faraway) on ultrasonic measurements of the depths ofdefects in the Alaska pipeline in the field. The depth of the defects were then remeasured in thelaboratory. These measurements were performed in six different batches. It turns out that thisbatch effect is not significant and so can be ignored in the analysis that follows. The laboratorymeasurements are more accurate than the in-field measurements, but more time consuming andexpensive. We want to develop an regression equation for correcting the in-field measurements.

(a) Fit a regression model Lab ∼ Field. Check for nonconstant variance.

Answer: See the result above as well as Figure 1. Obviously the variance is not constant.We can also do a Non-constant variance score test, the p value is 5.35e-08 which indicatessignificantly non-constant variance.

(b) We wish to use weights to account for the nonconstant variance. Here we split the range ofField into 12 groups of size nine (except for the last group which has only eight values). Withineach group, we compte the variance of Labs as varlab and the mean of Field as meanfield.Suppose pipeline is the name of your data frame, the following R code will make the neededcomputations:

> i <- order(pipeline$Field)

> npipe <- pipeline[i,]

4

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 5: hw03sol

> ff <- gl(12, 9)[-108]

> meanfield <- unlist(lapply(split(npipe$Field, ff), mean))

> varlab <- unlist(lappy(split(npipe$Lab, ff), var))

Suppose we guess that the variance in the response is linked to the predictor in the followingway:

var(Lab) = a0Fielda1

Regress log(varlab) on log(meanfield) to estimate a0 and a1. (You might or might not choseto ignore the last point, as the last group has only eight values). Use this to determineappropriate weights in a WLS fit of Lab on Field. Show the regression summary.

Answer: By regressing log(varlab) on log(meanfield), we got estimation of log(a0) =−0.3538 and a1 = 1.1244. So the weights in aWLS fit of Lab on Field should be 1

e0.3538·Field1.1244i.

See the regression summary above (model3).

(c) An alternative to weighting is transformation. Find transformations on Lab and/or Field sothat in the transformed scale the relationship is approximately linear with constant variance.You may restrict your choice of transformation to square root, log and inverse.

Answer: We regress log(Lab) on log(Field). See results above. The p value is less than2.2e-16 and R2 = 0.9337 which is close to 1. Therefore the two transformed variables exhibitgood linear relationship. Moreover, non-constant variance score test has a p value of 0.394,which is not significant at 95% confidence level.

Appendix: R Code

## Problem 3

##

> require(faraway)

> model<-lm(gamble~.,data=teengamb)

> summary(model)

Call:

lm(formula = gamble ~ ., data = teengamb)

Residuals:

Min 1Q Median 3Q Max

-51.082 -11.320 -1.451 9.452 94.252

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 22.55565 17.19680 1.312 0.1968

sex -22.11833 8.21111 -2.694 0.0101 *

status 0.05223 0.28111 0.186 0.8535

income 4.96198 1.02539 4.839 1.79e-05 ***

5

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 6: hw03sol

verbal -2.95949 2.17215 -1.362 0.1803

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 22.69 on 42 degrees of freedom

Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816

F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06

>

> round(vcov(model), digits=2)

(Intercept) sex status income verbal

(Intercept) 295.73 -72.73 -2.40 -9.89 -15.18

sex -72.73 67.42 1.27 2.47 -3.54

status -2.40 1.27 0.08 0.10 -0.32

income -9.89 2.47 0.10 1.05 -0.05

verbal -15.18 -3.54 -0.32 -0.05 4.72

>

> which.max(residuals(model))

24

24

> mean(residuals(model))

[1] 1.240143e-16

> median(residuals(model))

[1] -1.451392

> cor(residuals(model),fitted(model))

[1] 6.247412e-17

> cor(residuals(model),teengamb$income)

[1] -3.961603e-17

> m1<-data.frame(sex=0,status=mean(teengamb$status),

income=mean(teengamb$income),verbal=mean(teengamb$verbal))

> predict(model,m1,interval="confidence")

fit lwr upr

1 28.24252 18.78277 37.70227

>

> predict(model,m1,interval="prediction")

fit lwr upr

1 28.24252 -18.51536 75.00039

6

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 7: hw03sol

> m2<-data.frame(sex=0,status=max(teengamb$status),

income=max(teengamb$income),verbal=max(teengamb$verbal))

> predict(model,m2,interval="confidence")

fit lwr upr

1 71.30794 42.23237 100.3835

>

> predict(model,m2,interval="prediction")

fit lwr upr

1 71.30794 17.06588 125.55

> model2<-lm(gamble~income,data=teengamb)

> anova(model,model2)

Analysis of Variance Table

Model 1: gamble ~ sex + status + income + verbal

Model 2: gamble ~ income

Res.Df RSS Df Sum of Sq F Pr(>F)

1 42 21624

2 45 28009 -3 -6384.8 4.1338 0.01177 *

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

###################

#### Problem 4:

> model<-lm(Lab~Field,data=pipeline)

> summary(model)

Call:

lm(formula = Lab ~ Field, data = pipeline)

Residuals:

Min 1Q Median 3Q Max

-21.985 -4.072 -1.431 2.504 24.334

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.96750 1.57479 -1.249 0.214

Field 1.22297 0.04107 29.778 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 7.865 on 105 degrees of freedom

7

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 8: hw03sol

Multiple R-squared: 0.8941, Adjusted R-squared: 0.8931

F-statistic: 886.7 on 1 and 105 DF, p-value: < 2.2e-16

> plot(model,which=c(1,3))

> require(car)

> ncvTest(model)

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 29.58568 Df = 1 p = 5.349868e-08

> model2<-lm(log(varlab)~log(meanfield))

> summary(model2)

Call:

lm(formula = log(varlab) ~ log(meanfield))

Residuals:

Min 1Q Median 3Q Max

-2.2038 -0.6729 0.1656 0.7205 1.1891

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.3538 1.5715 -0.225 0.8264

log(meanfield) 1.1244 0.4617 2.435 0.0351 *

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.018 on 10 degrees of freedom

Multiple R-squared: 0.3723, Adjusted R-squared: 0.3095

F-statistic: 5.931 on 1 and 10 DF, p-value: 0.03513

> weight<-1/(exp(model2$coefficients[1])*pipeline$Field^model2$coefficients[2])

> model3<-lm(Lab~Field,data=pipeline,weights=weight)

> summary(model3)

Call:

lm(formula = Lab ~ Field, data = pipeline, weights = weight)

Weighted Residuals:

Min 1Q Median 3Q Max

-2.0826 -0.8102 -0.3189 0.6212 3.4429

Coefficients:

Estimate Std. Error t value Pr(>|t|)

8

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Page 9: hw03sol

(Intercept) -1.49436 0.90707 -1.647 0.102

Field 1.20828 0.03488 34.637 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.169 on 105 degrees of freedom

Multiple R-squared: 0.9195, Adjusted R-squared: 0.9188

F-statistic: 1200 on 1 and 105 DF, p-value: < 2.2e-16

> model4<-lm(log(Lab)~log(Field),data=pipeline)

> summary(model4)

Call:

lm(formula = log(Lab) ~ log(Field), data = pipeline)

Residuals:

Min 1Q Median 3Q Max

-0.40212 -0.11853 -0.03092 0.13424 0.40209

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.06849 0.09305 -0.736 0.463

log(Field) 1.05483 0.02743 38.457 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.1837 on 105 degrees of freedom

Multiple R-squared: 0.9337, Adjusted R-squared: 0.9331

F-statistic: 1479 on 1 and 105 DF, p-value: < 2.2e-16

> ncvTest(model4)

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.7266744 Df = 1 p = 0.3939633

9

https://www.coursehero.com/file/7917926/hw03sol/

This st

udy r

esou

rce w

as

share

d via

Course

Hero.co

m

Powered by TCPDF (www.tcpdf.org)