Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
1Slide© 2005 Thomson/South-Western
Slides Prepared byJOHN S. LOUCKS
St. Edward’s University
2Slide© 2005 Thomson/South-Western
Chapter 15Multiple Regression
Multiple Regression ModelLeast Squares MethodMultiple Coefficient of DeterminationModel AssumptionsTesting for SignificanceUsing the Estimated Regression Equation
for Estimation and PredictionQualitative Independent Variables
3Slide© 2005 Thomson/South-Western
The equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . xp and an error term is called the multipleregression model.
Multiple Regression Model
y = β0 + β1x1 + β2x2 + . . . + βpxp + ε
where:β0, β1, β2, . . . , βp are the parameters, andε is a random variable called the error term
2
4Slide© 2005 Thomson/South-Western
The equation that describes how the mean value of y is related to x1, x2, . . . xp is called the multiple regression equation.
Multiple Regression Equation
E(y) = β0 + β1x1 + β2x2 + . . . + βpxp
5Slide© 2005 Thomson/South-Western
A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters β0, β1, β2, . . . , βp.
Estimated Multiple Regression Equation
y = b0 + b1x1 + b2x2 + . . . + bpxp
The estimated multiple regression equation is:
6Slide© 2005 Thomson/South-Western
Estimation Process
Multiple Regression ModelE(y) = β0 + β1x1 + β2x2 +. . .+ βpxp + ε
Multiple Regression EquationE(y) = β0 + β1x1 + β2x2 +. . .+ βpxp
Unknown parameters areβ0, β1, β2, . . . , βp
Sample Data:x1 x2 . . . xp y. . . .. . . .
0 1 1 2 2ˆ ... p py b b x b x b x= + + + +0 1 1 2 2ˆ ... p py b b x b x b x= + + + +
Estimated MultipleRegression Equation
Sample statistics areb0, b1, b2, . . . , bp
b0, b1, b2, . . . , bpprovide estimates of
β0, β1, β2, . . . , βp
3
7Slide© 2005 Thomson/South-Western
Least Squares Method
Least Squares Criterion
2ˆmin ( )i iy y−∑ 2ˆmin ( )i iy y−∑
Computation of Coefficient ValuesThe formulas for the regression coefficients
b0, b1, b2, . . . bp involve the use of matrix algebra. We will rely on computer software packages toperform the calculations.
8Slide© 2005 Thomson/South-Western
The years of experience, score on the aptitudetest, and corresponding annual salary ($1000s) for a sample of 20 programmers is shown on the nextslide.
Example: Programmer Salary Survey
Multiple Regression Model
A software firm collected data for a sampleof 20 computer programmers. A suggestionwas made that regression analysis couldbe used to determine if salary was relatedto the years of experience and the scoreon the firm’s programmer aptitude test.
9Slide© 2005 Thomson/South-Western
47158
100166
92105684633
781008682868475808391
88737581748779947089
2443
23.734.335.838
22.223.13033
3826.636.231.62934
30.133.928.230
Exper. Score ScoreExper.Salary Salary
Multiple Regression Model
4
10Slide© 2005 Thomson/South-Western
Suppose we believe that salary (y) isrelated to the years of experience (x1) and the score onthe programmer aptitude test (x2) by the following regression model:
Multiple Regression Model
wherey = annual salary ($1000)
x1 = years of experiencex2 = score on programmer aptitude test
y = β0 + β1x1 + β2x2 + ε
11Slide© 2005 Thomson/South-Western
Solving for the Estimates of β0, β1, β2
Input DataLeast Squares
Outputx1 x2 y
4 78 247 100 43. . .. . .3 89 30
ComputerPackage
for SolvingMultiple
RegressionProblems
b0 = b1 =b2 =R2 =
etc.
12Slide© 2005 Thomson/South-Western
Excel Worksheet (showing partial data entered)
A B C D1 Programmer Experience (yrs) Test Score Salary ($K)2 1 4 78 24.03 2 7 100 43.04 3 1 86 23.75 4 5 82 34.36 5 8 86 35.87 6 10 84 38.08 7 0 75 22.29 8 1 80 23.1
Note: Rows 10-21 are not shown.
Solving for the Estimates of β0, β1, β2
5
13Slide© 2005 Thomson/South-Western
Excel’s Regression Dialog Box
Solving for the Estimates of β0, β1, β2
14Slide© 2005 Thomson/South-Western
Excel’s Regression Equation Output
A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843
Note: Columns F-I are not shown.
Solving for the Estimates of β0, β1, β2
15Slide© 2005 Thomson/South-Western
Estimated Regression Equation
SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)
Note: Predicted salary will be in thousands of dollars.
6
16Slide© 2005 Thomson/South-Western
Interpreting the Coefficients
In multiple regression analysis, we interpret eachregression coefficient as follows:
bi represents an estimate of the change in ycorresponding to a 1-unit increase in xi when allother independent variables are held constant.
17Slide© 2005 Thomson/South-Western
Salary is expected to increase by $1,404 for each additional year of experience (when the variablescore on programmer attitude test is held constant).
b1 = 1. 404
Interpreting the Coefficients
18Slide© 2005 Thomson/South-Western
Salary is expected to increase by $251 for eachadditional point scored on the programmer aptitude
test (when the variable years of experience is heldconstant).
b2 = 0.251
Interpreting the Coefficients
7
19Slide© 2005 Thomson/South-Western
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
where:SST = total sum of squaresSSR = sum of squares due to regressionSSE = sum of squares due to error
SST = SSR + SSE
2( )iy y−∑ 2( )iy y−∑ 2ˆ( )iy y= −∑ 2ˆ( )iy y= −∑ 2ˆ( )i iy y+ −∑ 2ˆ( )i iy y+ −∑
20Slide© 2005 Thomson/South-Western
Excel’s ANOVA Output
A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538
SSTSSR
Multiple Coefficient of Determination
21Slide© 2005 Thomson/South-Western
Multiple Coefficient of Determination
R2 = 500.3285/599.7855 = .83418
R2 = SSR/SST
8
22Slide© 2005 Thomson/South-Western
Adjusted Multiple Coefficientof Determination
R R nn pa
2 21 1 11
= − −−
− −( )R R n
n pa2 21 1 1
1= − −
−− −
( )
2 20 11 (1 .834179) .81467120 2 1aR −
= − − =− −
2 20 11 (1 .834179) .81467120 2 1aR −
= − − =− −
23Slide© 2005 Thomson/South-Western
Excel’s Regression Statistics
A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.91333405928 R Square 0.83417910329 Adjusted R Square 0.81467076230 Standard Error 2.41876207631 Observations 2032
Adjusted Multiple Coefficientof Determination
24Slide© 2005 Thomson/South-Western
The variance of ε , denoted by σ 2, is the same for allvalues of the independent variables.
The error ε is a normally distributed random variablereflecting the deviation between the y value and theexpected value of y given by β0 + β1x1 + β2x2 + . . + βpxp.
Assumptions About the Error Term ε
The error ε is a random variable with mean of zero.
The values of ε are independent.
9
25Slide© 2005 Thomson/South-Western
In simple linear regression, the F and t tests providethe same conclusion.
Testing for Significance
In multiple regression, the F and t tests have differentpurposes.
26Slide© 2005 Thomson/South-Western
Testing for Significance: F Test
The F test is referred to as the test for overallsignificance.
The F test is used to determine whether a significantrelationship exists between the dependent variableand the set of all the independent variables.
27Slide© 2005 Thomson/South-Western
A separate t test is conducted for each of theindependent variables in the model.
If the F test shows an overall significance, the t test isused to determine whether each of the individualindependent variables is significant.
Testing for Significance: t Test
We refer to each of these t tests as a test for individualsignificance.
10
28Slide© 2005 Thomson/South-Western
Testing for Significance: F Test
Hypotheses
Rejection Rule
Test Statistics
H0: β1 = β2 = . . . = βp = 0Ha: One or more of the parameters
is not equal to zero.
F = MSR/MSE
Reject H0 if p-value < α or if F > Fα,where Fα is based on an F distributionwith p d.f. in the numerator andn - p - 1 d.f. in the denominator.
29Slide© 2005 Thomson/South-Western
F Test for Overall Significance
Hypotheses H0: β1 = β2 = 0Ha: One or both of the parameters
is not equal to zero.
Rejection Rule For α = .05 and d.f. = 2, 17; F.05 = 3.59Reject H0 if p-value < .05 or F > 3.59
30Slide© 2005 Thomson/South-Western
Excel’s ANOVA Output
A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538
F Test for Overall Significance
p-value used to test foroverall significance
11
31Slide© 2005 Thomson/South-Western
F Test for Overall Significance
Test Statistics F = MSR/MSE= 250.16/5.85 = 42.76
Conclusion p-value < .05, so we can reject H0.(Also, F = 42.76 > 3.59)
32Slide© 2005 Thomson/South-Western
Testing for Significance: t Test
Hypotheses
Rejection Rule
Test Statistics
Reject H0 if p-value < α orif t < -tα/2 or t > tα/2 where tα/2
is based on a t distributionwith n - p - 1 degrees of freedom.
t bs
i
bi
=t bs
i
bi
=
0 : 0iH β =0 : 0iH β =
: 0a iH β ≠: 0a iH β ≠
33Slide© 2005 Thomson/South-Western
t Test for Significanceof Individual Parameters
Hypotheses
Rejection Rule For α = .05 and d.f. = 17, t.025 = 2.11Reject H0 if p-value < .05 or if t > 2.11
0 : 0iH β =0 : 0iH β =
: 0a iH β ≠: 0a iH β ≠
12
34Slide© 2005 Thomson/South-Western
Excel’s Regression Equation Output
A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843
Note: Columns F-I are not shown.
t Test for Significanceof Individual Parameters
t statistic and p-value used to test for the individual significance of “Experience”
35Slide© 2005 Thomson/South-Western
Excel’s Regression Equation Output
A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843
Note: Columns F-I are not shown.
t Test for Significanceof Individual Parameters
t statistic and p-value used to test for the individual significance of “Test Score”
36Slide© 2005 Thomson/South-Western
t Test for Significanceof Individual Parameters
bsb
1
1
1 40391986
7 07= =..
.bsb
1
1
1 40391986
7 07= =..
.bsb
1
1
1 40391986
7 07= =..
.bsb
1
1
1 40391986
7 07= =..
.
bsb
2
2
2508907735
3 24= =..
.bsb
2
2
2508907735
3 24= =..
.bsb
2
2
2508907735
3 24= =..
.bsb
2
2
2508907735
3 24= =..
.
Test Statistics
Conclusions Reject both H0: β1 = 0 and H0: β2 = 0.Both independent variables aresignificant.
13
37Slide© 2005 Thomson/South-Western
Testing for Significance: Multicollinearity
The term multicollinearity refers to the correlationamong the independent variables.
When the independent variables are highly correlated(say, |r | > .7), it is not possible to determine theseparate effect of any particular independent variableon the dependent variable.
38Slide© 2005 Thomson/South-Western
Testing for Significance: Multicollinearity
Every attempt should be made to avoid includingindependent variables that are highly correlated.
If the estimated regression equation is to be used onlyfor predictive purposes, multicollinearity is usuallynot a serious problem.
39Slide© 2005 Thomson/South-Western
Using the Estimated Regression Equationfor Estimation and Prediction
The procedures for estimating the mean value of yand predicting an individual value of y in multipleregression are similar to those in simple regression.
We substitute the given values of x1, x2, . . . , xp intothe estimated regression equation and use thecorresponding value of y as the point estimate.
14
40Slide© 2005 Thomson/South-Western
Using the Estimated Regression Equationfor Estimation and Prediction
Software packages for multiple regression will oftenprovide these interval estimates.
The formulas required to develop interval estimatesfor the mean value of y and for an individual valueof y are beyond the scope of the textbook.
^
41Slide© 2005 Thomson/South-Western
In many situations we must work with qualitativeindependent variables such as gender (male, female),method of payment (cash, check, credit card), etc.
For example, x2 might represent gender where x2 = 0indicates male and x2 = 1 indicates female.
Qualitative Independent Variables
In this case, x2 is called a dummy or indicator variable.
42Slide© 2005 Thomson/South-Western
As an extension of the problem involving thecomputer programmer salary survey, supposethat management also believes that theannual salary is related to whether theindividual has a graduate degree incomputer science or information systems.
The years of experience, the score on the programmeraptitude test, whether the individual has a relevant graduate degree, and the annual salary ($1000) for eachof the sampled 20 programmers are shown on the next slide.
Qualitative Independent Variables
Example: Programmer Salary Survey
15
43Slide© 2005 Thomson/South-Western
47158
100166
92105684633
781008682868475808391
88737581748779947089
2443
23.734.335.838
22.223.13033
3826.636.231.62934
30.133.928.230
Exper. Score ScoreExper.Salary SalaryDegr.
NoYesNoYesYesYesNoNoNoYes
Degr.
YesNoYesNoNoYesNoYesNoNo
Qualitative Independent Variables
44Slide© 2005 Thomson/South-Western
Estimated Regression Equation
y = b0 + b1x1 + b2x2 + b3x3
^where:
y = annual salary ($1000)x1 = years of experiencex2 = score on programmer aptitude testx3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
x3 is a dummy variable
45Slide© 2005 Thomson/South-Western
Excel’s Regression Statistics
A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.92021523928 R Square 0.84679608529 Adjusted R Square 0.81807035130 Standard Error 2.39647510131 Observations 2032
Qualitative Independent Variables
16
46Slide© 2005 Thomson/South-Western
Excel’s ANOVA Output
A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 3 507.896 169.2987 29.47866 9.41675E-0736 Residual 16 91.88949 5.74309337 Total 19 599.785538
Qualitative Independent Variables
47Slide© 2005 Thomson/South-Western
Excel’s Regression Equation Output
A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 7.94485 7.3808 1.0764 0.297741 Experience 1.14758 0.2976 3.8561 0.001442 Test Score 0.19694 0.0899 2.1905 0.0436443 Grad. Degr. 2.28042 1.98661 1.1479 0.2678944
Note: Columns F-I are not shown.
Not significant
Qualitative Independent Variables
48Slide© 2005 Thomson/South-Western
Note: Columns C-E are hidden.
A B3839 Coeffic.40 Intercept 7.9448541 Experience 1.1475842 Test Score 0.1969443 Grad. Degr. 2.2804244
F G H I
Low. 95% Up. 95% Low. 95.0% Up. 95.0%-7.701739 23.5914 -7.7017385 23.5914360.516695 1.77847 0.51669483 1.77846860.00635 0.38752 0.00634964 0.3875243
-1.931002 6.49185 -1.9310017 6.4918494
Excel’s Regression Equation Output
Qualitative Independent Variables
17
49Slide© 2005 Thomson/South-Western
More Complex Qualitative Variables
If a qualitative variable has k levels, k - 1 dummyvariables are required, with each dummy variablebeing coded as 0 or 1.
For example, a variable with levels A, B, and C couldbe represented by x1 and x2 values of (0, 0) for A, (1, 0)for B, and (0,1) for C.
Care must be taken in defining and interpreting thedummy variables.
50Slide© 2005 Thomson/South-Western
For example, a variable indicating level of education could be represented by x1 and x2 values as follows:
More Complex Qualitative Variables
HighestDegree x1 x2
Bachelor’s 0 0Master’s 1 0Ph.D. 0 1
51Slide© 2005 Thomson/South-Western
End of Chapter 15