18
Lecture 6 Notes Note: I will e-mail homework 2 tonight. It will be due next Thursday. The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis (Chapter 4.2) In multiple regression analysis, we consider more than one independent variable x 1 ,…,x K . We are interested in the conditional mean of y given x 1 , …,x K .

Lecture 6 Notes

Embed Size (px)

DESCRIPTION

Lecture 6 Notes. Note: I will e-mail homework 2 tonight. It will be due next Thursday. The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis (Chapter 4.2) - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 6 Notes

Lecture 6 Notes

• Note: I will e-mail homework 2 tonight. It will be due next Thursday.

• The Multiple Linear Regression model (Chapter 4.1)

• Inferences from multiple regression analysis (Chapter 4.2)

• In multiple regression analysis, we consider more than one independent variable x1,…,xK . We are interested in the conditional mean of y given x1,…,xK .

Page 2: Lecture 6 Notes

Automobile Example• A team charged with designing a new automobile is

concerned about the gas mileage that can be achieved. The design team is interested in two things:

(1) Which characteristics of the design are likely to affect mileage?

(2) A new car is planned to have the following characteristics: weight – 4000 lbs, horsepower – 200, cargo – 18 cubic feet, seating – 5 adults. Predict the new car’s gas mileage.

• The team has available information about gallons per 1000 miles and four design characteristics (weight, horsepower, cargo, seating) for a sample of cars made in 1989. Data is in car89.JMP.

Page 3: Lecture 6 Notes

Multivariate Correlations GP1000MHwy Weight(lb) Horsepower Cargo Seating GP1000MHwy 1.0000 0.7097 0.6157 0.3405 0.2599 Weight(lb) 0.7097 1.0000 0.7509 0.1816 0.3499 Horsepower 0.6157 0.7509 1.0000 -0.0548 -0.0914 Cargo 0.3405 0.1816 -0.0548 1.0000 0.4894 Seating 0.2599 0.3499 -0.0914 0.4894 1.0000 7 rows not used due to missing values. Scatterplot Matrix

25

35

45

55

2000

3000

4000

100

150

200

250

20

60

100

140

180

2

4

6

8

GP1000MHwy

25 35 4550

Weight(lb)

2000 3000 4000

Horsepower

100 150 200 250

Cargo

20 60 100 160

Seating

2 3 4 5 6 7 8

Page 4: Lecture 6 Notes

Best Single Predictor

• To obtain the correlation matrix and pairwise scatterplots, click Analyze, Multivariate Methods, Multivariate.

• If we use simple linear regression with each of the four independent variables, which provides the best predictions?

Page 5: Lecture 6 Notes

Best Single Predictor

• Answer: The simple linear regression that has the highest R2 gives the best predictions because recall that

• Weight gives the best predictions of GPM1000Hwy based on simple linear regression.

• But we can obtain better predictions by using more than one of the independent variables.

SST

SSER 12

Page 6: Lecture 6 Notes

Multiple Linear Regression Model

• Assumptions about :– The expected value of the disturbances is zero for

each ,

– The variance of each is equal to ,i.e.,

– The are normally distributed.

– The are independent.

11 | ,..., 0 1 1( | , , )( )KK y x x K KE Y x x x x

iKiKiii exxxy 22110

ie

1( , , )Kx x1( | , , ) 0i i iKE e x x

ie

2e

ie

ie

21( | , , )i i iK eVar e x x

Page 7: Lecture 6 Notes

Point Estimates for Multiple Linear Regression Model

• We use the same least squares procedure as for simple linear regression.

• Our estimates of are the coefficients that minimize the sum of squared prediction errors:

• Least Squares in JMP: Click Analyze, Fit Model, put dependent variable into Y and add independent variables to the construct model effects box.

K ,...,0 Kbb ,...,0

n

i iKKiibbK xbxbbybbK 1

2*1

*1

*0,...,0 )(minarg,..., **

0

KK xbxbby 110ˆ

Page 8: Lecture 6 Notes

R esp onse G P1000M H w y W h ole M odel Actu al b y P red icted P lo t S um m ary o f F it R S quare 0.589015 R S quare A dj 0 .573208 R oot M ean S quare E rror 3.542778 M ean o f R esponse 37.33359 O bservations (or S um W gts) 109 An a lysis o f V ariance S ource D F S um of S quares M ean S quare F R atio M odel 4 1870.7788 467.695 37.2627 E rror 104 1305.3330 12.551 P rob > F C . Total 108 3176.1118 < .0001 P aram eter E stim ates Term E stim ate S td E rror t R atio P rob>|t| In tercept 19 .100521 2.098478 9.10 < .0001 W eight(lb ) 0 .0040877 0.001203 3.40 0.0010 H orsepower 0 .0426999 0.01567 2.73 0.0075 C argo 0.0533 0.013787 3.87 0.0002 S eating 0.0268912 0.428283 0.06 0.9501 R esidu al b y P red icted P lo t

-10

-5

0

5

10

GP

1000M

Hw

y R

esid

ual

25 30 35 40 45 50 55

GP1000MHwy Predicted

Page 9: Lecture 6 Notes

Root Mean Square Error

• Estimate of :

• = Root Mean Square Error in JMP • For simple linear regression of GP1000MHWY on

Weight, . For multiple linear regression of GP1000MHWY on weight, horsepower, cargo, seating,

e

1

)ˆ( 2

1

Kn

yys i

n

i ie

es

54.3es

86.3es

Page 10: Lecture 6 Notes

Residuals and Root Mean Square Errors

• Residual for observation i = prediction error for observation i =

• Root mean square error = Typical size of absolute value of prediction error• As with simple linear regression model, if multiple linear regression model

holds– About 95% of the observations will be within two RMSEs of their

predicted value • For car data, about 95% of the time, the actual GP1000M will be within

2*3.54=7.08 GP1000M of the predicted GP1000M of the car based on the car’s weight, horsepower, cargo and seating.

1 1 0 1 1ˆ ( | , , )K K K KE Y X x X x b b x b x

1 1

0 1 1

ˆ ( | , , )i i K iK

i i K iK

Y E Y X x X x

Y b b x b x

Page 11: Lecture 6 Notes

Inferences about Regression Coefficients

• Confidence intervals: confidence interval for :

Degrees of freedom for t equals n-(K+1). Standard error of , , found on JMP output.

• Hypothesis Test:

Decision rule for test: Reject H0 if or where p-value for testing is printed in JMP

output under Prob>|t|.

%100)1(

kkbk stb 2/

kbkbs

*

*0

:

:

kka

kk

H

H

2/tt 2/tt

kb

kk

s

bt

*

0:0 kH

Page 12: Lecture 6 Notes

Inference Examples

• Find a 95% confidence interval for ?

• Is seating of any help in predicting gas mileage once horsepower, weight and cargo have been taken into account? Carry out a test at the 0.05 significance level.

horsepower

Page 13: Lecture 6 Notes

Partial Slopes vs. Marginal Slopes

• Multiple Linear Regression Model:

• The coefficient is a partial slope. It indicates the change in the mean of y that is associated with a one unit increase in while holding all other variables fixed.

• A marginal slope is obtained when we perform a simple regression with only one X, ignoring all other variables. Consequently the other variables are not held fixed.

KKxxy xxK

110,...,| 1

k

kxKkk xxxx ...,,,..., ,111

Page 14: Lecture 6 Notes

Simple Linear Regression Bivariate Fit of GP1000MHwy By Seating

25

30

35

40

45

50

55

GP

10

00

MH

wy

2 3 4 5 6 7 8

Seating

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 30.829816 2.277905 13.53 <.0001 Seating 1.3022488 0.442389 2.94 0.0040

Multiple Linear Regression Response GP1000MHwy Whole Model Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 19.100521 2.098478 9.10 <.0001 Weight(lb) 0.0040877 0.001203 3.40 0.0010 Cargo 0.0533 0.013787 3.87 0.0002 Seating 0.0268912 0.428283 0.06 0.9501 Horsepower 0.0426999 0.01567 2.73 0.0075

Page 15: Lecture 6 Notes

Partial Slopes vs. Marginal Slopes Example

• In order to evaluate the benefits of a proposed irrigation scheme in a certain region, suppose that the relation of yield Y to rainfall R is investigated over several years.

• Data is in rainfall.JMP.

Page 16: Lecture 6 Notes

B i v a r i a t e F i t o f Y i e l d B y T o t a l S p r i n g R a i n f a l l

3 0

4 0

5 0

6 0

7 0

8 0

9 0Y

ield

7 8 9 1 0 11 1 2 1 3

T o ta l S p rin g R a in fa ll

L in e a r F it

L i n e a r F i t Y i e l d = 7 6 . 6 6 6 6 6 7 - 1 . 6 6 6 6 6 6 7 T o t a l S p r i n g R a i n f a l l S u m m a r y o f F i t R S q u a r e 0 . 0 2 7 7 7 8 R S q u a r e A d j - 0 . 1 3 4 2 6 R o o t M e a n S q u a r e E r r o r 1 3 . 9 4 4 3 3 M e a n o f R e s p o n s e 6 0 O b s e r v a t i o n s ( o r S u m W g t s ) 8 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t 7 6 . 6 6 6 6 6 7 4 0 . 5 5 4 6 1 . 8 9 0 . 1 0 7 6 T o t a l S p r i n g R a i n f a l l - 1 . 6 6 6 6 6 7 4 . 0 2 5 3 8 2 - 0 . 4 1 0 . 6 9 3 2

Page 17: Lecture 6 Notes

42.5

45

47.5

50

52.5

55

57.5A

vera

ge S

prin

g Te

mpe

ratu

re

7 8 9 10 11 12 13

Total Spring Rainfall

Bivariate Fit of Average Spring Temperature By Total Spring Rainfall

Higher rainfall is associated with lower temperature.

Page 18: Lecture 6 Notes

Multiple Linear Regression Response Yield Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -144.7619 55.8499 -2.59 0.0487 Total Spring Rainfall 5.7142857 2.680238 2.13 0.0862 Average Spring Temperature 2.952381 0.692034 4.27 0.0080

Rainfall is estimated to be beneficial once temperature is held fixed.

Multiple regression provides a better picture of the benefits of an irrigation scheme because temperature would be held fixed inan irrigation scheme.