© 2005 Thomson/South-Western Slide 1 · 2009-01-05 · 2 © 2005 Thomson/South-Western Slide 4 The...

Preview:

Citation preview

1

1Slide© 2005 Thomson/South-Western

Slides Prepared byJOHN S. LOUCKS

St. Edward’s University

2Slide© 2005 Thomson/South-Western

Chapter 15Multiple Regression

Multiple Regression ModelLeast Squares MethodMultiple Coefficient of DeterminationModel AssumptionsTesting for SignificanceUsing the Estimated Regression Equation

for Estimation and PredictionQualitative Independent Variables

3Slide© 2005 Thomson/South-Western

The equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . xp and an error term is called the multipleregression model.

Multiple Regression Model

y = β0 + β1x1 + β2x2 + . . . + βpxp + ε

where:β0, β1, β2, . . . , βp are the parameters, andε is a random variable called the error term

2

4Slide© 2005 Thomson/South-Western

The equation that describes how the mean value of y is related to x1, x2, . . . xp is called the multiple regression equation.

Multiple Regression Equation

E(y) = β0 + β1x1 + β2x2 + . . . + βpxp

5Slide© 2005 Thomson/South-Western

A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters β0, β1, β2, . . . , βp.

Estimated Multiple Regression Equation

y = b0 + b1x1 + b2x2 + . . . + bpxp

The estimated multiple regression equation is:

6Slide© 2005 Thomson/South-Western

Estimation Process

Multiple Regression ModelE(y) = β0 + β1x1 + β2x2 +. . .+ βpxp + ε

Multiple Regression EquationE(y) = β0 + β1x1 + β2x2 +. . .+ βpxp

Unknown parameters areβ0, β1, β2, . . . , βp

Sample Data:x1 x2 . . . xp y. . . .. . . .

0 1 1 2 2ˆ ... p py b b x b x b x= + + + +0 1 1 2 2ˆ ... p py b b x b x b x= + + + +

Estimated MultipleRegression Equation

Sample statistics areb0, b1, b2, . . . , bp

b0, b1, b2, . . . , bpprovide estimates of

β0, β1, β2, . . . , βp

3

7Slide© 2005 Thomson/South-Western

Least Squares Method

Least Squares Criterion

2ˆmin ( )i iy y−∑ 2ˆmin ( )i iy y−∑

Computation of Coefficient ValuesThe formulas for the regression coefficients

b0, b1, b2, . . . bp involve the use of matrix algebra. We will rely on computer software packages toperform the calculations.

8Slide© 2005 Thomson/South-Western

The years of experience, score on the aptitudetest, and corresponding annual salary ($1000s) for a sample of 20 programmers is shown on the nextslide.

Example: Programmer Salary Survey

Multiple Regression Model

A software firm collected data for a sampleof 20 computer programmers. A suggestionwas made that regression analysis couldbe used to determine if salary was relatedto the years of experience and the scoreon the firm’s programmer aptitude test.

9Slide© 2005 Thomson/South-Western

47158

100166

92105684633

781008682868475808391

88737581748779947089

2443

23.734.335.838

22.223.13033

3826.636.231.62934

30.133.928.230

Exper. Score ScoreExper.Salary Salary

Multiple Regression Model

4

10Slide© 2005 Thomson/South-Western

Suppose we believe that salary (y) isrelated to the years of experience (x1) and the score onthe programmer aptitude test (x2) by the following regression model:

Multiple Regression Model

wherey = annual salary ($1000)

x1 = years of experiencex2 = score on programmer aptitude test

y = β0 + β1x1 + β2x2 + ε

11Slide© 2005 Thomson/South-Western

Solving for the Estimates of β0, β1, β2

Input DataLeast Squares

Outputx1 x2 y

4 78 247 100 43. . .. . .3 89 30

ComputerPackage

for SolvingMultiple

RegressionProblems

b0 = b1 =b2 =R2 =

etc.

12Slide© 2005 Thomson/South-Western

Excel Worksheet (showing partial data entered)

A B C D1 Programmer Experience (yrs) Test Score Salary ($K)2 1 4 78 24.03 2 7 100 43.04 3 1 86 23.75 4 5 82 34.36 5 8 86 35.87 6 10 84 38.08 7 0 75 22.29 8 1 80 23.1

Note: Rows 10-21 are not shown.

Solving for the Estimates of β0, β1, β2

5

13Slide© 2005 Thomson/South-Western

Excel’s Regression Dialog Box

Solving for the Estimates of β0, β1, β2

14Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843

Note: Columns F-I are not shown.

Solving for the Estimates of β0, β1, β2

15Slide© 2005 Thomson/South-Western

Estimated Regression Equation

SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)

Note: Predicted salary will be in thousands of dollars.

6

16Slide© 2005 Thomson/South-Western

Interpreting the Coefficients

In multiple regression analysis, we interpret eachregression coefficient as follows:

bi represents an estimate of the change in ycorresponding to a 1-unit increase in xi when allother independent variables are held constant.

17Slide© 2005 Thomson/South-Western

Salary is expected to increase by $1,404 for each additional year of experience (when the variablescore on programmer attitude test is held constant).

b1 = 1. 404

Interpreting the Coefficients

18Slide© 2005 Thomson/South-Western

Salary is expected to increase by $251 for eachadditional point scored on the programmer aptitude

test (when the variable years of experience is heldconstant).

b2 = 0.251

Interpreting the Coefficients

7

19Slide© 2005 Thomson/South-Western

Multiple Coefficient of Determination

Relationship Among SST, SSR, SSE

where:SST = total sum of squaresSSR = sum of squares due to regressionSSE = sum of squares due to error

SST = SSR + SSE

2( )iy y−∑ 2( )iy y−∑ 2ˆ( )iy y= −∑ 2ˆ( )iy y= −∑ 2ˆ( )i iy y+ −∑ 2ˆ( )i iy y+ −∑

20Slide© 2005 Thomson/South-Western

Excel’s ANOVA Output

A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538

SSTSSR

Multiple Coefficient of Determination

21Slide© 2005 Thomson/South-Western

Multiple Coefficient of Determination

R2 = 500.3285/599.7855 = .83418

R2 = SSR/SST

8

22Slide© 2005 Thomson/South-Western

Adjusted Multiple Coefficientof Determination

R R nn pa

2 21 1 11

= − −−

− −( )R R n

n pa2 21 1 1

1= − −

−− −

( )

2 20 11 (1 .834179) .81467120 2 1aR −

= − − =− −

2 20 11 (1 .834179) .81467120 2 1aR −

= − − =− −

23Slide© 2005 Thomson/South-Western

Excel’s Regression Statistics

A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.91333405928 R Square 0.83417910329 Adjusted R Square 0.81467076230 Standard Error 2.41876207631 Observations 2032

Adjusted Multiple Coefficientof Determination

24Slide© 2005 Thomson/South-Western

The variance of ε , denoted by σ 2, is the same for allvalues of the independent variables.

The error ε is a normally distributed random variablereflecting the deviation between the y value and theexpected value of y given by β0 + β1x1 + β2x2 + . . + βpxp.

Assumptions About the Error Term ε

The error ε is a random variable with mean of zero.

The values of ε are independent.

9

25Slide© 2005 Thomson/South-Western

In simple linear regression, the F and t tests providethe same conclusion.

Testing for Significance

In multiple regression, the F and t tests have differentpurposes.

26Slide© 2005 Thomson/South-Western

Testing for Significance: F Test

The F test is referred to as the test for overallsignificance.

The F test is used to determine whether a significantrelationship exists between the dependent variableand the set of all the independent variables.

27Slide© 2005 Thomson/South-Western

A separate t test is conducted for each of theindependent variables in the model.

If the F test shows an overall significance, the t test isused to determine whether each of the individualindependent variables is significant.

Testing for Significance: t Test

We refer to each of these t tests as a test for individualsignificance.

10

28Slide© 2005 Thomson/South-Western

Testing for Significance: F Test

Hypotheses

Rejection Rule

Test Statistics

H0: β1 = β2 = . . . = βp = 0Ha: One or more of the parameters

is not equal to zero.

F = MSR/MSE

Reject H0 if p-value < α or if F > Fα,where Fα is based on an F distributionwith p d.f. in the numerator andn - p - 1 d.f. in the denominator.

29Slide© 2005 Thomson/South-Western

F Test for Overall Significance

Hypotheses H0: β1 = β2 = 0Ha: One or both of the parameters

is not equal to zero.

Rejection Rule For α = .05 and d.f. = 2, 17; F.05 = 3.59Reject H0 if p-value < .05 or F > 3.59

30Slide© 2005 Thomson/South-Western

Excel’s ANOVA Output

A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538

F Test for Overall Significance

p-value used to test foroverall significance

11

31Slide© 2005 Thomson/South-Western

F Test for Overall Significance

Test Statistics F = MSR/MSE= 250.16/5.85 = 42.76

Conclusion p-value < .05, so we can reject H0.(Also, F = 42.76 > 3.59)

32Slide© 2005 Thomson/South-Western

Testing for Significance: t Test

Hypotheses

Rejection Rule

Test Statistics

Reject H0 if p-value < α orif t < -tα/2 or t > tα/2 where tα/2

is based on a t distributionwith n - p - 1 degrees of freedom.

t bs

i

bi

=t bs

i

bi

=

0 : 0iH β =0 : 0iH β =

: 0a iH β ≠: 0a iH β ≠

33Slide© 2005 Thomson/South-Western

t Test for Significanceof Individual Parameters

Hypotheses

Rejection Rule For α = .05 and d.f. = 17, t.025 = 2.11Reject H0 if p-value < .05 or if t > 2.11

0 : 0iH β =0 : 0iH β =

: 0a iH β ≠: 0a iH β ≠

12

34Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843

Note: Columns F-I are not shown.

t Test for Significanceof Individual Parameters

t statistic and p-value used to test for the individual significance of “Experience”

35Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843

Note: Columns F-I are not shown.

t Test for Significanceof Individual Parameters

t statistic and p-value used to test for the individual significance of “Test Score”

36Slide© 2005 Thomson/South-Western

t Test for Significanceof Individual Parameters

bsb

1

1

1 40391986

7 07= =..

.bsb

1

1

1 40391986

7 07= =..

.bsb

1

1

1 40391986

7 07= =..

.bsb

1

1

1 40391986

7 07= =..

.

bsb

2

2

2508907735

3 24= =..

.bsb

2

2

2508907735

3 24= =..

.bsb

2

2

2508907735

3 24= =..

.bsb

2

2

2508907735

3 24= =..

.

Test Statistics

Conclusions Reject both H0: β1 = 0 and H0: β2 = 0.Both independent variables aresignificant.

13

37Slide© 2005 Thomson/South-Western

Testing for Significance: Multicollinearity

The term multicollinearity refers to the correlationamong the independent variables.

When the independent variables are highly correlated(say, |r | > .7), it is not possible to determine theseparate effect of any particular independent variableon the dependent variable.

38Slide© 2005 Thomson/South-Western

Testing for Significance: Multicollinearity

Every attempt should be made to avoid includingindependent variables that are highly correlated.

If the estimated regression equation is to be used onlyfor predictive purposes, multicollinearity is usuallynot a serious problem.

39Slide© 2005 Thomson/South-Western

Using the Estimated Regression Equationfor Estimation and Prediction

The procedures for estimating the mean value of yand predicting an individual value of y in multipleregression are similar to those in simple regression.

We substitute the given values of x1, x2, . . . , xp intothe estimated regression equation and use thecorresponding value of y as the point estimate.

14

40Slide© 2005 Thomson/South-Western

Using the Estimated Regression Equationfor Estimation and Prediction

Software packages for multiple regression will oftenprovide these interval estimates.

The formulas required to develop interval estimatesfor the mean value of y and for an individual valueof y are beyond the scope of the textbook.

^

41Slide© 2005 Thomson/South-Western

In many situations we must work with qualitativeindependent variables such as gender (male, female),method of payment (cash, check, credit card), etc.

For example, x2 might represent gender where x2 = 0indicates male and x2 = 1 indicates female.

Qualitative Independent Variables

In this case, x2 is called a dummy or indicator variable.

42Slide© 2005 Thomson/South-Western

As an extension of the problem involving thecomputer programmer salary survey, supposethat management also believes that theannual salary is related to whether theindividual has a graduate degree incomputer science or information systems.

The years of experience, the score on the programmeraptitude test, whether the individual has a relevant graduate degree, and the annual salary ($1000) for eachof the sampled 20 programmers are shown on the next slide.

Qualitative Independent Variables

Example: Programmer Salary Survey

15

43Slide© 2005 Thomson/South-Western

47158

100166

92105684633

781008682868475808391

88737581748779947089

2443

23.734.335.838

22.223.13033

3826.636.231.62934

30.133.928.230

Exper. Score ScoreExper.Salary SalaryDegr.

NoYesNoYesYesYesNoNoNoYes

Degr.

YesNoYesNoNoYesNoYesNoNo

Qualitative Independent Variables

44Slide© 2005 Thomson/South-Western

Estimated Regression Equation

y = b0 + b1x1 + b2x2 + b3x3

^where:

y = annual salary ($1000)x1 = years of experiencex2 = score on programmer aptitude testx3 = 0 if individual does not have a graduate degree

1 if individual does have a graduate degree

x3 is a dummy variable

45Slide© 2005 Thomson/South-Western

Excel’s Regression Statistics

A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.92021523928 R Square 0.84679608529 Adjusted R Square 0.81807035130 Standard Error 2.39647510131 Observations 2032

Qualitative Independent Variables

16

46Slide© 2005 Thomson/South-Western

Excel’s ANOVA Output

A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 3 507.896 169.2987 29.47866 9.41675E-0736 Residual 16 91.88949 5.74309337 Total 19 599.785538

Qualitative Independent Variables

47Slide© 2005 Thomson/South-Western

Excel’s Regression Equation Output

A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 7.94485 7.3808 1.0764 0.297741 Experience 1.14758 0.2976 3.8561 0.001442 Test Score 0.19694 0.0899 2.1905 0.0436443 Grad. Degr. 2.28042 1.98661 1.1479 0.2678944

Note: Columns F-I are not shown.

Not significant

Qualitative Independent Variables

48Slide© 2005 Thomson/South-Western

Note: Columns C-E are hidden.

A B3839 Coeffic.40 Intercept 7.9448541 Experience 1.1475842 Test Score 0.1969443 Grad. Degr. 2.2804244

F G H I

Low. 95% Up. 95% Low. 95.0% Up. 95.0%-7.701739 23.5914 -7.7017385 23.5914360.516695 1.77847 0.51669483 1.77846860.00635 0.38752 0.00634964 0.3875243

-1.931002 6.49185 -1.9310017 6.4918494

Excel’s Regression Equation Output

Qualitative Independent Variables

17

49Slide© 2005 Thomson/South-Western

More Complex Qualitative Variables

If a qualitative variable has k levels, k - 1 dummyvariables are required, with each dummy variablebeing coded as 0 or 1.

For example, a variable with levels A, B, and C couldbe represented by x1 and x2 values of (0, 0) for A, (1, 0)for B, and (0,1) for C.

Care must be taken in defining and interpreting thedummy variables.

50Slide© 2005 Thomson/South-Western

For example, a variable indicating level of education could be represented by x1 and x2 values as follows:

More Complex Qualitative Variables

HighestDegree x1 x2

Bachelor’s 0 0Master’s 1 0Ph.D. 0 1

51Slide© 2005 Thomson/South-Western

End of Chapter 15

Recommended