17
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University 2 Slide © 2005 Thomson/South-Western Chapter 15 Multiple Regression Multiple Regression Model Least Squares Method Multiple Coefficient of Determination Model Assumptions Testing for Significance Using the Estimated Regression Equation for Estimation and Prediction Qualitative Independent Variables 3 Slide © 2005 Thomson/South-Western The equation that describes how the dependent variable y is related to the independent variables x 1 , x 2 , . . . x p and an error term is called the multiple regression model . Multiple Regression Model y = β 0 + β 1 x 1 + β 2 x 2 + . . . + β p x p + ε where: β 0 , β 1 , β 2 , . . . , β p are the parameters , and ε is a random variable called the error term

© 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

1

1Slide© 2005 Thomson/South-Western

Slides Prepared byJOHN S. LOUCKS

St. Edward’s University

2Slide© 2005 Thomson/South-Western

Chapter 15Multiple Regression

Multiple Regression ModelLeast Squares MethodMultiple Coefficient of DeterminationModel AssumptionsTesting for SignificanceUsing the Estimated Regression Equation

for Estimation and PredictionQualitative Independent Variables

3Slide© 2005 Thomson/South-Western

The equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . xp and an error term is called the multipleregression model.

Multiple Regression Model

y = β0 + β1x1 + β2x2 + . . . + βpxp + ε

where:β0, β1, β2, . . . , βp are the parameters, andε is a random variable called the error term

Page 2: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

2

4Slide© 2005 Thomson/South-Western

The equation that describes how the mean value of y is related to x1, x2, . . . xp is called the multiple regression equation.

Multiple Regression Equation

E(y) = β0 + β1x1 + β2x2 + . . . + βpxp

5Slide© 2005 Thomson/South-Western

A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters β0, β1, β2, . . . , βp.

Estimated Multiple Regression Equation

y = b0 + b1x1 + b2x2 + . . . + bpxp

The estimated multiple regression equation is:

6Slide© 2005 Thomson/South-Western

Estimation Process

Multiple Regression ModelE(y) = β0 + β1x1 + β2x2 +. . .+ βpxp + ε

Multiple Regression EquationE(y) = β0 + β1x1 + β2x2 +. . .+ βpxp

Unknown parameters areβ0, β1, β2, . . . , βp

Sample Data:x1 x2 . . . xp y. . . .. . . .

0 1 1 2 2ˆ ... p py b b x b x b x= + + + +0 1 1 2 2ˆ ... p py b b x b x b x= + + + +

Estimated MultipleRegression Equation

Sample statistics areb0, b1, b2, . . . , bp

b0, b1, b2, . . . , bpprovide estimates of

β0, β1, β2, . . . , βp

Page 3: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

3

7Slide© 2005 Thomson/South-Western

Least Squares Method

Least Squares Criterion

2ˆmin ( )i iy y−∑ 2ˆmin ( )i iy y−∑

Computation of Coefficient ValuesThe formulas for the regression coefficients

b0, b1, b2, . . . bp involve the use of matrix algebra. We will rely on computer software packages toperform the calculations.

8Slide© 2005 Thomson/South-Western

The years of experience, score on the aptitudetest, and corresponding annual salary ($1000s) for a sample of 20 programmers is shown on the nextslide.

Example: Programmer Salary Survey

Multiple Regression Model

A software firm collected data for a sampleof 20 computer programmers. A suggestionwas made that regression analysis couldbe used to determine if salary was relatedto the years of experience and the scoreon the firm’s programmer aptitude test.

9Slide© 2005 Thomson/South-Western

47158

100166

92105684633

781008682868475808391

88737581748779947089

2443

23.734.335.838

22.223.13033

3826.636.231.62934

30.133.928.230

Exper. Score ScoreExper.Salary Salary

Multiple Regression Model

Page 4: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

4

10Slide© 2005 Thomson/South-Western

Suppose we believe that salary (y) isrelated to the years of experience (x1) and the score onthe programmer aptitude test (x2) by the following regression model:

Multiple Regression Model

wherey = annual salary ($1000)

x1 = years of experiencex2 = score on programmer aptitude test

y = β0 + β1x1 + β2x2 + ε

11Slide© 2005 Thomson/South-Western

Solving for the Estimates of β0, β1, β2

Input DataLeast Squares

Outputx1 x2 y

4 78 247 100 43. . .. . .3 89 30

ComputerPackage

for SolvingMultiple

RegressionProblems

b0 = b1 =b2 =R2 =

etc.

12Slide© 2005 Thomson/South-Western

Excel Worksheet (showing partial data entered)

A B C D1 Programmer Experience (yrs) Test Score Salary ($K)2 1 4 78 24.03 2 7 100 43.04 3 1 86 23.75 4 5 82 34.36 5 8 86 35.87 6 10 84 38.08 7 0 75 22.29 8 1 80 23.1

Note: Rows 10-21 are not shown.

Solving for the Estimates of β0, β1, β2

Page 5: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

5

13Slide© 2005 Thomson/South-Western

Excel’s Regression Dialog Box

Solving for the Estimates of β0, β1, β2

14Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843

Note: Columns F-I are not shown.

Solving for the Estimates of β0, β1, β2

15Slide© 2005 Thomson/South-Western

Estimated Regression Equation

SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)

Note: Predicted salary will be in thousands of dollars.

Page 6: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

6

16Slide© 2005 Thomson/South-Western

Interpreting the Coefficients

In multiple regression analysis, we interpret eachregression coefficient as follows:

bi represents an estimate of the change in ycorresponding to a 1-unit increase in xi when allother independent variables are held constant.

17Slide© 2005 Thomson/South-Western

Salary is expected to increase by $1,404 for each additional year of experience (when the variablescore on programmer attitude test is held constant).

b1 = 1. 404

Interpreting the Coefficients

18Slide© 2005 Thomson/South-Western

Salary is expected to increase by $251 for eachadditional point scored on the programmer aptitude

test (when the variable years of experience is heldconstant).

b2 = 0.251

Interpreting the Coefficients

Page 7: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

7

19Slide© 2005 Thomson/South-Western

Multiple Coefficient of Determination

Relationship Among SST, SSR, SSE

where:SST = total sum of squaresSSR = sum of squares due to regressionSSE = sum of squares due to error

SST = SSR + SSE

2( )iy y−∑ 2( )iy y−∑ 2ˆ( )iy y= −∑ 2ˆ( )iy y= −∑ 2ˆ( )i iy y+ −∑ 2ˆ( )i iy y+ −∑

20Slide© 2005 Thomson/South-Western

Excel’s ANOVA Output

A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538

SSTSSR

Multiple Coefficient of Determination

21Slide© 2005 Thomson/South-Western

Multiple Coefficient of Determination

R2 = 500.3285/599.7855 = .83418

R2 = SSR/SST

Page 8: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

8

22Slide© 2005 Thomson/South-Western

Adjusted Multiple Coefficientof Determination

R R nn pa

2 21 1 11

= − −−

− −( )R R n

n pa2 21 1 1

1= − −

−− −

( )

2 20 11 (1 .834179) .81467120 2 1aR −

= − − =− −

2 20 11 (1 .834179) .81467120 2 1aR −

= − − =− −

23Slide© 2005 Thomson/South-Western

Excel’s Regression Statistics

A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.91333405928 R Square 0.83417910329 Adjusted R Square 0.81467076230 Standard Error 2.41876207631 Observations 2032

Adjusted Multiple Coefficientof Determination

24Slide© 2005 Thomson/South-Western

The variance of ε , denoted by σ 2, is the same for allvalues of the independent variables.

The error ε is a normally distributed random variablereflecting the deviation between the y value and theexpected value of y given by β0 + β1x1 + β2x2 + . . + βpxp.

Assumptions About the Error Term ε

The error ε is a random variable with mean of zero.

The values of ε are independent.

Page 9: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

9

25Slide© 2005 Thomson/South-Western

In simple linear regression, the F and t tests providethe same conclusion.

Testing for Significance

In multiple regression, the F and t tests have differentpurposes.

26Slide© 2005 Thomson/South-Western

Testing for Significance: F Test

The F test is referred to as the test for overallsignificance.

The F test is used to determine whether a significantrelationship exists between the dependent variableand the set of all the independent variables.

27Slide© 2005 Thomson/South-Western

A separate t test is conducted for each of theindependent variables in the model.

If the F test shows an overall significance, the t test isused to determine whether each of the individualindependent variables is significant.

Testing for Significance: t Test

We refer to each of these t tests as a test for individualsignificance.

Page 10: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

10

28Slide© 2005 Thomson/South-Western

Testing for Significance: F Test

Hypotheses

Rejection Rule

Test Statistics

H0: β1 = β2 = . . . = βp = 0Ha: One or more of the parameters

is not equal to zero.

F = MSR/MSE

Reject H0 if p-value < α or if F > Fα,where Fα is based on an F distributionwith p d.f. in the numerator andn - p - 1 d.f. in the denominator.

29Slide© 2005 Thomson/South-Western

F Test for Overall Significance

Hypotheses H0: β1 = β2 = 0Ha: One or both of the parameters

is not equal to zero.

Rejection Rule For α = .05 and d.f. = 2, 17; F.05 = 3.59Reject H0 if p-value < .05 or F > 3.59

30Slide© 2005 Thomson/South-Western

Excel’s ANOVA Output

A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538

F Test for Overall Significance

p-value used to test foroverall significance

Page 11: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

11

31Slide© 2005 Thomson/South-Western

F Test for Overall Significance

Test Statistics F = MSR/MSE= 250.16/5.85 = 42.76

Conclusion p-value < .05, so we can reject H0.(Also, F = 42.76 > 3.59)

32Slide© 2005 Thomson/South-Western

Testing for Significance: t Test

Hypotheses

Rejection Rule

Test Statistics

Reject H0 if p-value < α orif t < -tα/2 or t > tα/2 where tα/2

is based on a t distributionwith n - p - 1 degrees of freedom.

t bs

i

bi

=t bs

i

bi

=

0 : 0iH β =0 : 0iH β =

: 0a iH β ≠: 0a iH β ≠

33Slide© 2005 Thomson/South-Western

t Test for Significanceof Individual Parameters

Hypotheses

Rejection Rule For α = .05 and d.f. = 17, t.025 = 2.11Reject H0 if p-value < .05 or if t > 2.11

0 : 0iH β =0 : 0iH β =

: 0a iH β ≠: 0a iH β ≠

Page 12: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

12

34Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843

Note: Columns F-I are not shown.

t Test for Significanceof Individual Parameters

t statistic and p-value used to test for the individual significance of “Experience”

35Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843

Note: Columns F-I are not shown.

t Test for Significanceof Individual Parameters

t statistic and p-value used to test for the individual significance of “Test Score”

36Slide© 2005 Thomson/South-Western

t Test for Significanceof Individual Parameters

bsb

1

1

1 40391986

7 07= =..

.bsb

1

1

1 40391986

7 07= =..

.bsb

1

1

1 40391986

7 07= =..

.bsb

1

1

1 40391986

7 07= =..

.

bsb

2

2

2508907735

3 24= =..

.bsb

2

2

2508907735

3 24= =..

.bsb

2

2

2508907735

3 24= =..

.bsb

2

2

2508907735

3 24= =..

.

Test Statistics

Conclusions Reject both H0: β1 = 0 and H0: β2 = 0.Both independent variables aresignificant.

Page 13: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

13

37Slide© 2005 Thomson/South-Western

Testing for Significance: Multicollinearity

The term multicollinearity refers to the correlationamong the independent variables.

When the independent variables are highly correlated(say, |r | > .7), it is not possible to determine theseparate effect of any particular independent variableon the dependent variable.

38Slide© 2005 Thomson/South-Western

Testing for Significance: Multicollinearity

Every attempt should be made to avoid includingindependent variables that are highly correlated.

If the estimated regression equation is to be used onlyfor predictive purposes, multicollinearity is usuallynot a serious problem.

39Slide© 2005 Thomson/South-Western

Using the Estimated Regression Equationfor Estimation and Prediction

The procedures for estimating the mean value of yand predicting an individual value of y in multipleregression are similar to those in simple regression.

We substitute the given values of x1, x2, . . . , xp intothe estimated regression equation and use thecorresponding value of y as the point estimate.

Page 14: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

14

40Slide© 2005 Thomson/South-Western

Using the Estimated Regression Equationfor Estimation and Prediction

Software packages for multiple regression will oftenprovide these interval estimates.

The formulas required to develop interval estimatesfor the mean value of y and for an individual valueof y are beyond the scope of the textbook.

^

41Slide© 2005 Thomson/South-Western

In many situations we must work with qualitativeindependent variables such as gender (male, female),method of payment (cash, check, credit card), etc.

For example, x2 might represent gender where x2 = 0indicates male and x2 = 1 indicates female.

Qualitative Independent Variables

In this case, x2 is called a dummy or indicator variable.

42Slide© 2005 Thomson/South-Western

As an extension of the problem involving thecomputer programmer salary survey, supposethat management also believes that theannual salary is related to whether theindividual has a graduate degree incomputer science or information systems.

The years of experience, the score on the programmeraptitude test, whether the individual has a relevant graduate degree, and the annual salary ($1000) for eachof the sampled 20 programmers are shown on the next slide.

Qualitative Independent Variables

Example: Programmer Salary Survey

Page 15: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

15

43Slide© 2005 Thomson/South-Western

47158

100166

92105684633

781008682868475808391

88737581748779947089

2443

23.734.335.838

22.223.13033

3826.636.231.62934

30.133.928.230

Exper. Score ScoreExper.Salary SalaryDegr.

NoYesNoYesYesYesNoNoNoYes

Degr.

YesNoYesNoNoYesNoYesNoNo

Qualitative Independent Variables

44Slide© 2005 Thomson/South-Western

Estimated Regression Equation

y = b0 + b1x1 + b2x2 + b3x3

^where:

y = annual salary ($1000)x1 = years of experiencex2 = score on programmer aptitude testx3 = 0 if individual does not have a graduate degree

1 if individual does have a graduate degree

x3 is a dummy variable

45Slide© 2005 Thomson/South-Western

Excel’s Regression Statistics

A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.92021523928 R Square 0.84679608529 Adjusted R Square 0.81807035130 Standard Error 2.39647510131 Observations 2032

Qualitative Independent Variables

Page 16: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

16

46Slide© 2005 Thomson/South-Western

Excel’s ANOVA Output

A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 3 507.896 169.2987 29.47866 9.41675E-0736 Residual 16 91.88949 5.74309337 Total 19 599.785538

Qualitative Independent Variables

47Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 7.94485 7.3808 1.0764 0.297741 Experience 1.14758 0.2976 3.8561 0.001442 Test Score 0.19694 0.0899 2.1905 0.0436443 Grad. Degr. 2.28042 1.98661 1.1479 0.2678944

Note: Columns F-I are not shown.

Not significant

Qualitative Independent Variables

48Slide© 2005 Thomson/South-Western

Note: Columns C-E are hidden.

A B3839 Coeffic.40 Intercept 7.9448541 Experience 1.1475842 Test Score 0.1969443 Grad. Degr. 2.2804244

F G H I

Low. 95% Up. 95% Low. 95.0% Up. 95.0%-7.701739 23.5914 -7.7017385 23.5914360.516695 1.77847 0.51669483 1.77846860.00635 0.38752 0.00634964 0.3875243

-1.931002 6.49185 -1.9310017 6.4918494

Excel’s Regression Equation Output

Qualitative Independent Variables

Page 17: © 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The equation that describes how the mean value of y is related to x1, x2, . . . xp

17

49Slide© 2005 Thomson/South-Western

More Complex Qualitative Variables

If a qualitative variable has k levels, k - 1 dummyvariables are required, with each dummy variablebeing coded as 0 or 1.

For example, a variable with levels A, B, and C couldbe represented by x1 and x2 values of (0, 0) for A, (1, 0)for B, and (0,1) for C.

Care must be taken in defining and interpreting thedummy variables.

50Slide© 2005 Thomson/South-Western

For example, a variable indicating level of education could be represented by x1 and x2 values as follows:

More Complex Qualitative Variables

HighestDegree x1 x2

Bachelor’s 0 0Master’s 1 0Ph.D. 0 1

51Slide© 2005 Thomson/South-Western

End of Chapter 15