Upload
jacob-little
View
222
Download
1
Embed Size (px)
Citation preview
1 1 Slide
Slide
Multiple Regression
Multiple Regression Model Least Squares Method Multiple Coefficient of
Determination Model Assumptions Testing for Significance Using the Estimated Regression
Equation for Estimation and Prediction Qualitative Independent Variables Residual Analysis
2 2 Slide
Slide
In this chapter we continue our study of regression analysis by considering situations involving two or more independent variables.
Multiple Regression
This subject area, called multiple regression analysis, enables us to consider more factors and thus obtain better estimates than are possible with simple linear regression.
3 3 Slide
Slide
The equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . xp and an error term is:
Multiple Regression Model
y = b0 + b1x1 + b2x2 + . . . + bpxp + e
where:b0, b1, b2, . . . , bp are the parameters,
e is a random variable called the error term,p is the number of independent variables
Multiple Regression Model
4 4 Slide
Slide
The equation that describes how the mean value of y is related to x1, x2, . . . xp is called the multiple regression equation.
Multiple Regression Equation
E(y) = 0 + 1x1 + 2x2 + . . . + pxp
5 5 Slide
Slide
A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters b0, b1, b2, . . . , bp.
Estimated Multiple Regression Equation
^y = b0 + b1x1 + b2x2 + . . . + bpxp
The estimated multiple regression equation is:
6 6 Slide
Slide
Estimation Process
Multiple Regression ModelE(y) = 0 + 1x1 + 2x2 +. . .+ pxp + e
Multiple Regression EquationE(y) = 0 + 1x1 + 2x2 +. . .+ pxp
Unknown parameters areb0, b1, b2, . . . , bp
Sample Data:x1 x2 . . . xp y. . . .. . . .
0 1 1 2 2ˆ ... p py b b x b x b x
Estimated MultipleRegression Equation
Sample statistics are
b0, b1, b2, . . . , bp
b0, b1, b2, . . . , bp
provide estimates ofb0, b1, b2, . . . , bp
7 7 Slide
Slide
Least Squares Method
Least Squares Criterion
2ˆmin ( )i iy ywhere:
yi = observed value of the dependent
variable for the ith observation = estimated value of the dependent
variable for the ith observationiy
8 8 Slide
Slide
The years of experience, score on the aptitudetest, and corresponding annual salary ($1000s) for a sample of 20 programmers is shown on the nextslide.
Example: Programmer Salary Survey
Multiple Regression Model
A software firm collected data for a sampleof 20 computer programmers. A suggestionwas made that regression analysis couldbe used to determine if salary was relatedto the years of experience and the scoreon the firm’s programmer aptitude test.
9 9 Slide
Slide
47158100166
92105684633
781008682868475808391
88737581748779947089
2443
23.734.335.838
22.223.13033
3826.636.231.62934
30.133.928.230
Exper. Score ScoreExper.Salary Salary
Multiple Regression Model
10 10 Slide
Slide
Suppose we believe that salary (y) isrelated to the years of experience (x1) and the
score onthe programmer aptitude test (x2) by the
following regression model:
Multiple Regression Model
where y = annual salary ($1000) x1 = years of experience x2 = score on programmer aptitude test
y = 0 + 1x1 + 2x2 +
11 11 Slide
Slide
Solving for the Estimates of 0, 1, 2
Input DataLeast Squares
Output
x1 x2 y
4 78 24
7 100 43
. . .
. . .
3 89 30
ComputerPackage
for SolvingMultiple
RegressionProblems
b0 =
b1 =
b2 =
R2 =
etc.
Download ProgramerSalary.xlsx
12 12 Slide
Slide
Excel Worksheet (showing partial data)
Note: Rows 10-21 are not shown.
Solving for the Estimates of 0, 1, 2
A B C D1 Programmer Experience (yrs) Test Score Salary ($K)2 1 4 78 24.03 2 7 100 43.04 3 1 86 23.75 4 5 82 34.36 5 8 86 35.87 6 10 84 38.08 7 0 75 22.29 8 1 80 23.1
13 13 Slide
Slide
Using Excel’s Regression Tool
Performing the Regression Analysis
Step 3 Choose Regression from the list of Analysis Tools
Step 2 In the Analysis group, click Data AnalysisStep 1 Click the Data tab on the Ribbon
14 14 Slide
Slide
Excel’s Regression Dialog Box
Solving for the Estimates of 0, 1, 2
15 15 Slide
Slide
Excel’s Regression Equation Output
Note: Columns F-I are not shown.
Solving for the Estimates of 0, 1, 2
A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843
16 16 Slide
Slide
Estimated Regression Equation
SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)
Note: Predicted salary will be in thousands of dollars.
17 17 Slide
Slide
Interpreting the Coefficients
In multiple regression analysis, we interpret each
regression coefficient as follows: bi represents an estimate of the change in y corresponding to a 1-unit increase in xi when all other independent variables are held constant.
18 18 Slide
Slide
Salary is expected to increase by $1,404 for each additional year of experience (when the variablescore on programmer attitude test is held constant).
b1 = 1. 404b1 = 1. 404
Interpreting the Coefficients
19 19 Slide
Slide
Salary is expected to increase by $251 for each additional point scored on the programmer aptitudetest (when the variable years of experience is heldconstant).
b2 = 0.251b2 = 0.251
Interpreting the Coefficients
20 20 Slide
Slide
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error
SST = SSR + SSE
2( )iy y 2ˆ( )iy y 2ˆ( )i iy y
21 21 Slide
Slide
Excel’s ANOVA Output
Multiple Coefficient of Determination
A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538
SSRSST
22 22 Slide
Slide
Multiple Coefficient of Determination
R2 = 500.3285/599.7855 = .83418
R2 = SSR/SST
23 23 Slide
Slide
Adjusted Multiple Coefficientof Determination
Adding independent variables, even ones that are not statistically significant, causes the prediction errors to become smaller, thus reducing the sum of squares due to error, SSE.
Because SSR = SST – SSE, when SSE becomes smaller, SSR becomes larger, causing R2 = SSR/SST to increase.
The adjusted multiple coefficient of determination compensates for the number of independent variables in the model.
24 24 Slide
Slide
Adjusted Multiple Coefficientof Determination
R Rn
n pa2 21 1
11
( )R R
nn pa
2 21 11
1
( )
2 20 11 (1 .834179) .814671
20 2 1aR
where: n is the number of observations p is the number of independent variables
After adjusting the number of independent variables,we see 81.46% of variability in salary can be explainedby the regression.
25 25 Slide
Slide
Excel’s Regression Statistics
Adjusted Multiple Coefficientof Determination
A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.91333405928 R Square 0.83417910329 Adjusted R Square 0.81467076230 Standard Error 2.41876207631 Observations 2032
𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒𝑅=√𝑅2
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟=√𝑀𝑆𝐸
26 26 Slide
Slide
The variance of , denoted by 2, is the same for all values of the independent variables. The variance of , denoted by 2, is the same for all values of the independent variables.
The error is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given by 0 + 1x1 + 2x2 + . . + pxp.
The error is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given by 0 + 1x1 + 2x2 + . . + pxp.
Assumptions About the Error Term
The error is a random variable with mean of zero. The error is a random variable with mean of zero.
The values of are independent. The values of are independent.
27 27 Slide
Slide
In simple linear regression, the F and t tests provide the same conclusion. In simple linear regression, the F and t tests provide the same conclusion.
Testing for Significance
In multiple regression, the F and t tests have different purposes. In multiple regression, the F and t tests have different purposes.
28 28 Slide
Slide
Testing for Significance: F Test
The F test is referred to as the test for overall significance. The F test is referred to as the test for overall significance.
The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables.
The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables.
29 29 Slide
Slide
A separate t test is conducted for each of the independent variables in the model. A separate t test is conducted for each of the independent variables in the model.
If the F test shows an overall significance, the t test is used to determine whether each of the individual independent variables is significant.
If the F test shows an overall significance, the t test is used to determine whether each of the individual independent variables is significant.
Testing for Significance: t Test
We refer to each of these t tests as a test for individual significance. We refer to each of these t tests as a test for individual significance.
30 30 Slide
Slide
Testing for Significance: F Test
Hypotheses
Rejection Rule
Test Statistics
H0: 1 = 2 = . . . = p = 0
Ha: One or more of the parameters
is not equal to zero.
F = MSR/MSE
Reject H0 if p-value < a or if F > F,
where F is based on an F distribution
with p d.f. in the numerator andn - p - 1 d.f. in the denominator.
p is the number of independent variables
31 31 Slide
Slide
F Test for Overall Significance
Hypotheses H0: 1 = 2 = 0
Ha: One or both of the parameters
is not equal to zero.
Rejection Rule For = .05 and d.f. = 2, 17; F.05 = 3.59
p-value approach: Reject H0 if p-value < .05
Critical value approach: Reject H0 F > 3.59
Critical value = FINV(0.05,2,17)
32 32 Slide
Slide
Excel’s ANOVA Output
F Test for Overall Significance
A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 2 500.3285 250.1643 42.76013 2.32774E-0736 Residual 17 99.45697 5.8504137 Total 19 599.785538
p-value used to test for
overall significance
p-value = FDIST(42.76013,2,17)=2.32774E-07
numerator’s DF
denominator’s DF
33 33 Slide
Slide
F Test for Overall Significance
Test Statistics F = MSR/MSE = 250.16/5.85 = 42.76
Conclusion p-value < .05, so we can reject H0.(Also, F = 42.76 > 3.59)
The regression includes at least onesignificant independent variable.
34 34 Slide
Slide
Testing for Significance: t Test
Hypotheses
Rejection Rule
Test Statistics
where t is based on a t distribution
with n - p - 1 degrees of freedom.p is the number of independent variables.
tbs
i
bi
tbs
i
bi
0 : 0iH
: 0a iH
p-value approach: Reject H0 if p-value < aCritical value approach: Reject H0 if t < -tor t > t
35 35 Slide
Slide
t Test for Significanceof Individual Parameters
Hypotheses
Rejection RuleFor = .05 and d.f. = 17, t.025 = 2.11
d.f. = n-p-1 = 20-2-1 = 17
p-value approach: Reject H0 if p-value < aCritical value approach: Reject H0 if t < -tor t > t
Critical values are t.025 = TINV(0.025*2,17)=2.11 and - t.025 =-2.11
36 36 Slide
Slide
Excel’s Regression Equation Output
Note: Columns F-I are not shown.
t Test for Significanceof Individual Parameters
A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843
t statistic and p-value used to test for the individual significance of
“Experience”p-value for Experience = TDIST(7.07,17,2)=1.9E-06
37 37 Slide
Slide
Excel’s Regression Equation Output
Note: Columns F-I are not shown.
t Test for Significanceof Individual Parameters
A B C D E3839 Coeffic. Std. Err. t Stat P-value40 Intercept 3.17394 6.15607 0.5156 0.6127941 Experience 1.4039 0.19857 7.0702 1.9E-0642 Test Score 0.25089 0.07735 3.2433 0.0047843
t statistic and p-value used to test for the individual significance of “Test
Score”p-value for Test Score = TDIST(3.2433,17,2)=0.00478
38 38 Slide
Slide
t Test for Significanceof Individual Parameters
bsb
1
1
1 40391986
7 07 ..
.bsb
1
1
1 40391986
7 07 ..
.
bsb
2
2
2508907735
3 24 ..
.bsb
2
2
2508907735
3 24 ..
.
Test Statistics
Conclusions Reject both H0: 1 = 0 and H0: 2 = 0.
Both independent variables aresignificant.
p-values p-value 1= 1.9E-06p-value 2= 0.00478
critical values t.025 = 2.11
-t.025 = -2.11
39 39 Slide
Slide
In many situations we must work with qualitative independent variables such as gender (male, female), method of payment (cash, check, credit card), etc.
In many situations we must work with qualitative independent variables such as gender (male, female), method of payment (cash, check, credit card), etc.
For example, x2 might represent gender where x2 = 0 indicates male and x2 = 1 indicates female.
For example, x2 might represent gender where x2 = 0 indicates male and x2 = 1 indicates female.
Qualitative Independent Variables
In this case, x2 is called a dummy or indicator variable. In this case, x2 is called a dummy or indicator variable.
For a qualitative independent variable with kcategories, we create k-1 dummy variables. For a qualitative independent variable with kcategories, we create k-1 dummy variables.
40 40 Slide
Slide
As an extension of the problem involving thecomputer programmer salary survey, supposethat management also believes that theannual salary is related to whether theindividual has a graduate degree incomputer science or information systems.
The years of experience, the score on the programmer
aptitude test, whether the individual has a relevant
graduate degree, and the annual salary ($1000) for each
of the sampled 20 programmers are shown on the next
slide.
Qualitative Independent Variables
Example: Programmer Salary Survey
41 41 Slide
Slide
47158100166
92105684633
781008682868475808391
88737581748779947089
2443
23.734.335.838
22.223.13033
3826.636.231.62934
30.133.928.230
Exper. Score ScoreExper.Salary SalaryDegr.
NoYes NoYesYesYes No No NoYes
Degr.
Yes NoYes No NoYes NoYes No No
Qualitative Independent Variables
42 42 Slide
Slide
Estimated Regression Equation
y = b0 + b1x1 + b2x2 + b3x3
^
where:
y = annual salary ($1000) x1 = years of experience x2 = score on programmer aptitude test x3 = 0 if individual does not have a graduate degree 1 if individual does have a graduate degree
x3 is a dummy variable
Use ProgramerSalary.xlsx (Qualitative IV-2Cat tab)
43 43 Slide
Slide
Excel Formula Worksheet (showing data)
Note: Rows 10-21 are not shown.
Estimated Regression Equation
1Pro-
grammerExperience
(years)Test
ScoreGrad.
DegreeSalary ($000)
2 1 4 78 0 24.03 2 7 100 1 43.04 3 1 86 0 23.75 4 5 82 1 34.36 5 8 86 1 35.87 6 10 84 1 38.08 7 0 75 0 22.29 8 1 80 0 23.1
44 44 Slide
Slide
Estimated Regression Equation
significant
not significant
��𝑎𝑙𝑎𝑟𝑦=7.94+1.14 𝐸𝑥𝑝+0.19𝑆𝑐𝑜𝑟𝑒+2.28𝐷𝑢𝑚𝑚𝑦𝐷𝑒𝑔
45 45 Slide
Slide
Excel’s ANOVA Output
Categorical Independent Variables
A B C D E F3233 ANOVA34 df SS MS F Significance F35 Regression 3 507.8960 169.2987 29.4787 9.41675E-0736 Residual 16 91.8895 5.743137 Total 19 599.7855
R2 = 507.896/599.7855 = .8468
2 20 1
1 (1 .8468) .818120 3 1aR
46 46 Slide
Slide
Excel’s Regression Statistics
Categorical Independent Variables
A B C23 24 SUMMARY OUTPUT2526 Regression Statistics27 Multiple R 0.92021523928 R Square 0.84679608529 Adjusted R Square 0.81807035130 Standard Error 2.39647510131 Observations 2032
Previously,R Square = .8342
Previously,Adjusted R Square = .815
(essentially no improvement)
47 47 Slide
Slide
For example, a variable indicating level of education could be represented by x1 and x2 values as follows:
More Complex Qualitative Variables
HighestDegree x1 x2
Bachelor’s 0 0Master’s 1 0Ph.D. 0 1
Use ProgramerSalary.xlsx (Qualitative IV-3Cat tab)
48 48 Slide
Slide
More Complex Qualitative Variables
Use ProgramerSalary.xlsx (Qualitative IV-3Cat tab)
significant
not significant
49 49 Slide
Slide
Residual Analysis
y
For simple linear regression the residual plot against
and the residual plot against x provide the same information.
y In multiple regression analysis it is preferable
to use the residual plot against to determine if the model assumptions are satisfied.
50 50 Slide
Slide
Standardized Residual Plot Against y
Standardized residuals are frequently used in residual plots for purposes of:• Identifying outliers (typically, standardized
residuals < -2 or > +2)• Providing insight about the assumption that
the error term e has a normal distribution The computation of the standardized residuals
in multiple regression analysis is too complex to be done by hand
Excel’s Regression tool can be used
51 51 Slide
Slide
Excel Value Worksheet
Note: Rows 37-51 are not shown.
Standardized Residual Plot Against y
A B C D2829 RESIDUAL OUTPUT3031 Observation Predicted Y Residuals Standard Residuals32 1 27.89626052 -3.89626052 -1.77170689633 2 37.95204323 5.047956775 2.29540601634 3 26.02901122 -2.32901122 -1.05904757235 4 32.11201403 2.187985973 0.99492059636 5 36.34250715 -0.54250715 -0.246688757
standardized residual > 2 outlier
Remember to check the standardized residual box in Regression tool dialog box
52 52 Slide
Slide
Standardized Residual Plot Against y
Excel’s Standardized Residual Plot
Standardized Residual Plot
-2
-1
0
1
2
3
0 10 20 30 40 50
Predicted Salary
Sta
nd
ard
R
esid
ual
sOutlier