Upload
avice-davis
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Multiple Regression IMultiple Regression I 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 4Chapter 4Multiple Regression AnalysisMultiple Regression Analysis
(Part 1)(Part 1)
Terry DielmanTerry DielmanApplied Regression Analysis:Applied Regression Analysis:
A Second Course in Business and A Second Course in Business and Economic Statistics, fourth editionEconomic Statistics, fourth edition
Multiple Regression IMultiple Regression I 22Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.1 Using Multiple Regression4.1 Using Multiple Regression
In Chapter 3, the method of least squares In Chapter 3, the method of least squares was used to describe the relationship was used to describe the relationship between a dependent variable between a dependent variable yy and an and an explanatory variable explanatory variable xx..
Here we extend that to two or more Here we extend that to two or more predictor variables, using an equation of predictor variables, using an equation of the form:the form:
kk xbxbxbby 22110ˆ
Multiple Regression IMultiple Regression I 33Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Basic ExplorationBasic Exploration
In Chapter 3 our main graphic tool In Chapter 3 our main graphic tool was the X-Y scatter plot. was the X-Y scatter plot.
Exploratory graphics are a bit harder Exploratory graphics are a bit harder to produce here because they need to produce here because they need to be multidimensional.to be multidimensional.
Even if there were just two Even if there were just two xx variables a 3-D display is needed.variables a 3-D display is needed.
Multiple Regression IMultiple Regression I 44Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Estimation of CoefficientsEstimation of Coefficients
We want an equation of the form:We want an equation of the form:
As before we use least squares. The As before we use least squares. The coefficients coefficients bb00 b b11 b b22 ... b ... bkk are are determined by minimizing the sum of determined by minimizing the sum of squared residuals.squared residuals.
kk xbxbxbby 22110ˆ
Multiple Regression IMultiple Regression I 55Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Formulae are Very ComplexFormulae are Very Complex Can show exact formula when k=1 (simple Can show exact formula when k=1 (simple
regression). Refer to Section 3.1.regression). Refer to Section 3.1.
Few texts show the formulae for k=2 (the Few texts show the formulae for k=2 (the simplest of multiple regressions)simplest of multiple regressions)
Appendix D shows formula in matrix Appendix D shows formula in matrix notationnotation
This is This is totallytotally a computer problem a computer problem
Multiple Regression IMultiple Regression I 66Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 4.1 Meddicorp SalesExample 4.1 Meddicorp Sales
n = 25 sales territoriesn = 25 sales territories
Y = Sales (1000$) in each territoryY = Sales (1000$) in each territory
XX11 = Advertising (100$) in territory = Advertising (100$) in territory
XX22 = Amount of bonuses (100$) paid = Amount of bonuses (100$) paid
to salespersons in the territoryto salespersons in the territory
Data set MEDDICORP4Data set MEDDICORP4
Multiple Regression IMultiple Regression I 77Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
600500400
1700
1600
1500
1400
1300
1200
1100
1000
900
ADV
SA
LE
S
PlotsPlotsand and
CorrelationCorrelation
340290240
1700
1600
1500
1400
1300
1200
1100
1000
900
BONUS
SA
LE
S
Correlations
SALES ADV
ADV 0.900
BONUS 0.568 0.419
Multiple Regression IMultiple Regression I 88Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3D Graphics3D Graphics
900
1000
1100
1200
600
1300
1400
1500
1600
SALES
1700
500 240ADV400
290Bon340
Meddicorp Sales
Multiple Regression IMultiple Regression I 99Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab Regression OutputMinitab Regression OutputThe regression equation isSALES = - 516 + 2.47 ADV + 1.86 BONUS
Predictor Coef SE Coef T PConstant -516.4 189.9 -2.72 0.013ADV 2.4732 0.2753 8.98 0.000BONUS 1.8562 0.7157 2.59 0.017
S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F PRegression 2 1067797 533899 64.83 0.000Residual Error 22 181176 8235Total 24 1248974
Multiple Regression IMultiple Regression I 1010Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3D Surface Graph3D Surface Graph
900
600
1000
1100
1200
1300
1400
1500
EstSales
1600
500 240Advert
290
400Bonus340
Estimated Meddicorp Sales
The regression equation isSALES = - 516 + 2.47 ADV + 1.86 BONUS
Multiple Regression IMultiple Regression I 1111Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interpretation of CoefficientsInterpretation of Coefficients
Recall that sales is in $1000s and Recall that sales is in $1000s and advertising and bonus in $100s.advertising and bonus in $100s.
If advertising is held fixed, sales increase If advertising is held fixed, sales increase $1860 for each $100 of bonus paid.$1860 for each $100 of bonus paid.
If bonus were fixed, sales increase $2470 If bonus were fixed, sales increase $2470 for each $100 spent on ads.for each $100 spent on ads.
The regression equation isSALES = - 516 + 2.47 ADV + 1.86 BONUS
Multiple Regression IMultiple Regression I 1212Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.2 Inferences From a Multiple4.2 Inferences From a Multiple Regression Analysis Regression Analysis
In general, the population regression equation involving K predictors is:In general, the population regression equation involving K predictors is:
This says the mean value of This says the mean value of yy at a given set of at a given set of xx values is a point on the surface values is a point on the surface
described by the terms on the right-hand side of the equation.described by the terms on the right-hand side of the equation.
KKxxxy xxxK
22110,...,,| 21
Multiple Regression IMultiple Regression I 1313Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.2.1 Assumptions Concerning the Population 4.2.1 Assumptions Concerning the Population Regression LineRegression Line
An alternative way of writing the relationship An alternative way of writing the relationship is:is:
where where ii denotes the i denotes the ithth observation and observation and eeii denotes a random error or disturbance denotes a random error or disturbance (deviation from the mean).(deviation from the mean).
We make certain assumptions about the We make certain assumptions about the eeii..
ikikiii exxxy 22110
Multiple Regression IMultiple Regression I 1414Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
AssumptionsAssumptions
1.1. We expect the average disturbance We expect the average disturbance eeii to be zero so the regression line to be zero so the regression line
passes through the average value passes through the average value of of yy..
2.2. The The eeii have constant variance have constant variance ee22..
3.3. The The eeii are normally distributed. are normally distributed.
4.4. The The eeii are independent. are independent.
Multiple Regression IMultiple Regression I 1515Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
InferencesInferences
The assumptions allow inferences about The assumptions allow inferences about the population relationship to be made the population relationship to be made from a sample equation.from a sample equation.
The first inferences considered are those The first inferences considered are those about the individual population about the individual population coefficients coefficients 1 1 2 2 ...... KK..
Chapter 6 examines what happens when Chapter 6 examines what happens when the assumptions are violated.the assumptions are violated.
Multiple Regression IMultiple Regression I 1616Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.2.2 Inferences about the Population 4.2.2 Inferences about the Population Regression CoefficientsRegression Coefficients
If we wish to make an estimate of the effect If we wish to make an estimate of the effect of a change in one of the of a change in one of the xx variables on variables on yy, , use the interval: use the interval:
this refers to the this refers to the jjthth of the of the KK regression regression coefficients. The multiplier coefficients. The multiplier tt is selected is selected from the from the tt-distribution with -distribution with n-K-1n-K-1 degrees degrees of freedom. of freedom.
jbKnj stb 1
Multiple Regression IMultiple Regression I 1717Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests About the CoefficientsTests About the Coefficients
A test about the marginal effect of A test about the marginal effect of xxjj on on yy may be obtained from: may be obtained from:
HH00:: jj = = jj**
HHaa:: jj ≠ ≠ jj**
where where jj* * is some specific value that is is some specific value that is
relevant for the relevant for the jjthth coefficient. coefficient.
Multiple Regression IMultiple Regression I 1818Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Test StatisticTest Statistic
The test would be performed by using the The test would be performed by using the standardized test statistic:standardized test statistic:
The most common form of this test is for the The most common form of this test is for the parameter to be 0. In this case the test parameter to be 0. In this case the test statistic is just the estimate divided by its statistic is just the estimate divided by its standard error.standard error.
bS
t
j
jjb *
Multiple Regression IMultiple Regression I 1919Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 4.2 Meddicorp (Continued)Example 4.2 Meddicorp (Continued)
Refer again to the portion of the regression output Refer again to the portion of the regression output about the individual regression coefficients:about the individual regression coefficients:
This lists the estimates, their standard errors and This lists the estimates, their standard errors and the ratio of the estimates to their standard errors.the ratio of the estimates to their standard errors.
Predictor Coef SE Coef T PConstant -516.4 189.9 -2.72 0.013ADV 2.4732 0.2753 8.98 0.000BONUS 1.8562 0.7157 2.59 0.017
Multiple Regression IMultiple Regression I 2020Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests For Effect of AdvertisingTests For Effect of AdvertisingTo see if an increase in advertising To see if an increase in advertising
expenditure affects sales, we can test:expenditure affects sales, we can test:
HH00:: ADVADV = 0 (An increase in advertising= 0 (An increase in advertising has no effect on sales)has no effect on sales)
HHaa:: ADVADV ≠ ≠ 0 (Sales do change when0 (Sales do change when advertising increases)advertising increases)
The df are The df are n-K-1 = 25–2-1 = 22. n-K-1 = 25–2-1 = 22. At a 5% At a 5% significance level, the critical point from significance level, the critical point from the t-table is 2.074the t-table is 2.074
Multiple Regression IMultiple Regression I 2121Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Test ResultTest Result
From the output we get:From the output we get:
t = ( 2.4732 – 0)/.2753 = 8.98 t = ( 2.4732 – 0)/.2753 = 8.98
This is above the critical value of 2.074, so This is above the critical value of 2.074, so we reject we reject HH00..
Note that we could also make use of the p-Note that we could also make use of the p-value (.000) for the test.value (.000) for the test.
Multiple Regression IMultiple Regression I 2222Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
One-Sided Test on BonusOne-Sided Test on BonusWe can modify the test to make it one sided We can modify the test to make it one sided
HH00:: BONUSBONUS = 0 (Increased bonuses= 0 (Increased bonuses
do not affect sales)do not affect sales)
HHaa:: BONUSBONUS > > 0 (Sales increase when0 (Sales increase when
bonuses are higher)bonuses are higher)
At a 5% significance level, the (one-sided) At a 5% significance level, the (one-sided) critical point is 1.717.critical point is 1.717.
Multiple Regression IMultiple Regression I 2323Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
One-Sided Test ResultOne-Sided Test Result
From the output we get:From the output we get:
t = 1.8562/.7157 = 2.59 which is > 1.717 t = 1.8562/.7157 = 2.59 which is > 1.717
We reject We reject HH00 but this time make a more but this time make a more specific conclusion.specific conclusion.
The listed p-value (.017) is for a two-sided The listed p-value (.017) is for a two-sided test. For our one-sided test, cut it in half.test. For our one-sided test, cut it in half.
Multiple Regression IMultiple Regression I 2424Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interval Effect of advertisingInterval Effect of advertisingRecall that sales are measured in 1000$ and ADV in 100$
badv = 2.4732 and has standard error = .2753
2.4732 ± 2.074(.2753) = 2.4732 ± .5709
= 1.902 to 3.044
Each $100 spent on advertising returns $1902 to $3044 in sales.
Multiple Regression IMultiple Regression I 2525Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.3 Assessing the Fit of the Model4.3 Assessing the Fit of the Model
Recall how we partitioned the variation in Recall how we partitioned the variation in the previous chapter:the previous chapter:
SST = Total variation in the sample of Y valuesSST = Total variation in the sample of Y values
Split up into two components SSE, SSRSplit up into two components SSE, SSR
SSE = Error or SSE = Error or unexplainedunexplained variation variation
SSR = Explained by the Yhat functionSSR = Explained by the Yhat function
Multiple Regression IMultiple Regression I 2626Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.3.1 The ANOVA Table and R4.3.1 The ANOVA Table and R22
These are the same statistics we briefly These are the same statistics we briefly examined in simple regression.examined in simple regression.
They are perhaps more important here They are perhaps more important here because they measure how well all the because they measure how well all the variables in the equation work together.variables in the equation work together.
S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F PRegression 2 1067797 533899 64.83 0.000Residual Error 22 181176 8235Total 24 1248974
Multiple Regression IMultiple Regression I 2727Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
RR22 – a Universal Measure of Fit – a Universal Measure of Fit
RR22 = SSR / SST = proportion of variation = SSR / SST = proportion of variation explained by the regression equation.explained by the regression equation.
If multiplied by 100, interpret as %If multiplied by 100, interpret as %
If only one If only one xx, R, R22 is square of correlation is square of correlation
For multiple, RFor multiple, R22 is square of correlation is square of correlation between the Y values and Y-hat valuesbetween the Y values and Y-hat values
Multiple Regression IMultiple Regression I 2828Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
For our exampleFor our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F PRegression 2 1067797 533899 64.83 0.000Residual Error 22 181176 8235Total 24 1248974
R2 = 1067797 / 1248974 = .85494
85.5% of the variation in sales in the 25 territories is explained by the different levels of advertising and bonus
Multiple Regression IMultiple Regression I 2929Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Adjusted RAdjusted R22
If there are many predictor variables to choose If there are many predictor variables to choose from, the best Rfrom, the best R22 is always obtained by throwing is always obtained by throwing them all in the model.them all in the model.
Some of these predictors could be insignificant, Some of these predictors could be insignificant, suggesting they contribute little to the model's suggesting they contribute little to the model's RR22..
Adjusted RAdjusted R22 is a way to balance the desire for is a way to balance the desire for high Rhigh R22 against the desire to include only against the desire to include only important variables. important variables.
Multiple Regression IMultiple Regression I 3030Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
ComputationComputation
The "adjustment" is for the number of The "adjustment" is for the number of variables in the model.variables in the model.
Although regular RAlthough regular R22 may decrease when you may decrease when you remove a variable, the adjusted version remove a variable, the adjusted version may actually may actually increaseincrease if that variable did if that variable did not have much significance.not have much significance.
)1/(
)1/(12
nSST
KnSSERadj
Multiple Regression IMultiple Regression I 3131Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.3.2 The 4.3.2 The FF Statistic Statistic
Since RSince R22 is so high, you would certainly is so high, you would certainly think that the model contains significant think that the model contains significant predictive power.predictive power.
In other problems it is perhaps not so In other problems it is perhaps not so obvious. For example, would an Robvious. For example, would an R22 of 20% of 20% show any prediction ability at all?show any prediction ability at all?
We can test for the predictive power of the We can test for the predictive power of the entire model using the entire model using the FF statistic. statistic.
Multiple Regression IMultiple Regression I 3232Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
F TestsF Tests
Generally these compare two sources of variationGenerally these compare two sources of variation
F = V1/V2 and has two df parametersF = V1/V2 and has two df parameters
Here V1 = SSR/Here V1 = SSR/KK has has K K df df
And V2 = SSE/(And V2 = SSE/(n-K-1)n-K-1) has n-k-1 df has n-k-1 df
Multiple Regression IMultiple Regression I 3333Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Usually will see several pages of these; one or two pages at each specific level of significance (.10, .05, .01).
Value of F at a specific significance level
Numerator d.f.
denom.d.f.
F TablesF Tables
Multiple Regression IMultiple Regression I 3434Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
F Test HypothesesF Test Hypotheses
H0: 1 = 2 = …= K = 0 (None of the Xs help explain Y)
Ha: Not all s are 0 (At least one X is useful)
H0: R2 = 0 is an equivalent hypothesis
Multiple Regression IMultiple Regression I 3535Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
F test for our exampleF test for our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F PRegression 2 1067797 533899 64.83 0.000Residual Error 22 181176 8235Total 24 1248974
F = 533899 / 8235 = 64.83 has p-value = 0.000
From tables, F2,22,.05 = 3.44 and F2,22,.01 = 5.72
Confirms that R2 = 85.5% is not near zero