Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman

Multiple Regression IMultiple Regression I 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 4Chapter 4Multiple Regression AnalysisMultiple Regression Analysis

(Part 1)(Part 1)

Terry DielmanTerry DielmanApplied Regression Analysis:Applied Regression Analysis:

A Second Course in Business and A Second Course in Business and Economic Statistics, fourth editionEconomic Statistics, fourth edition


4.1 Using Multiple Regression4.1 Using Multiple Regression

In Chapter 3, the method of least squares In Chapter 3, the method of least squares was used to describe the relationship was used to describe the relationship between a dependent variable between a dependent variable yy and an and an explanatory variable explanatory variable xx..

Here we extend that to two or more Here we extend that to two or more predictor variables, using an equation of predictor variables, using an equation of the form:the form:

kk xbxbxbby 22110ˆ


Basic ExplorationBasic Exploration

In Chapter 3 our main graphic tool In Chapter 3 our main graphic tool was the X-Y scatter plot. was the X-Y scatter plot.

Exploratory graphics are a bit harder Exploratory graphics are a bit harder to produce here because they need to produce here because they need to be multidimensional.to be multidimensional.

Even if there were just two Even if there were just two xx variables a 3-D display is needed.variables a 3-D display is needed.


Estimation of CoefficientsEstimation of Coefficients

We want an equation of the form:We want an equation of the form:

As before we use least squares. The As before we use least squares. The coefficients coefficients bb00 b b11 b b22 ... b ... bkk are are determined by minimizing the sum of determined by minimizing the sum of squared residuals.squared residuals.

kk xbxbxbby 22110ˆ


Formulae are Very ComplexFormulae are Very Complex Can show exact formula when k=1 (simple Can show exact formula when k=1 (simple

regression). Refer to Section 3.1.regression). Refer to Section 3.1.

Few texts show the formulae for k=2 (the Few texts show the formulae for k=2 (the simplest of multiple regressions)simplest of multiple regressions)

Appendix D shows formula in matrix Appendix D shows formula in matrix notationnotation

This is This is totallytotally a computer problem a computer problem


Example 4.1 Meddicorp SalesExample 4.1 Meddicorp Sales

n = 25 sales territoriesn = 25 sales territories

Y = Sales (1000$) in each territoryY = Sales (1000$) in each territory

XX11 = Advertising (100$) in territory = Advertising (100$) in territory

XX22 = Amount of bonuses (100$) paid = Amount of bonuses (100$) paid

to salespersons in the territoryto salespersons in the territory

Data set MEDDICORP4Data set MEDDICORP4


600500400

1700

1600

1500

1400

1300

1200

1100

1000

900

ADV

SA

LE

S

PlotsPlotsand and

CorrelationCorrelation

340290240

1700

1600

1500

1400

1300

1200

1100

1000

900

BONUS

SA

LE

S

Correlations

SALES ADV

ADV 0.900

BONUS 0.568 0.419


3D Graphics3D Graphics

900

1000

1100

1200

600

1300

1400

1500

1600

SALES

1700

500 240ADV400

290Bon340

Meddicorp Sales


Minitab Regression OutputMinitab Regression OutputThe regression equation isSALES = - 516 + 2.47 ADV + 1.86 BONUS

Predictor Coef SE Coef T PConstant -516.4 189.9 -2.72 0.013ADV 2.4732 0.2753 8.98 0.000BONUS 1.8562 0.7157 2.59 0.017

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%

Analysis of Variance

Source DF SS MS F PRegression 2 1067797 533899 64.83 0.000Residual Error 22 181176 8235Total 24 1248974


3D Surface Graph3D Surface Graph

900

600

1000

1100

1200

1300

1400

1500

EstSales

1600

500 240Advert

290

400Bonus340

Estimated Meddicorp Sales

The regression equation isSALES = - 516 + 2.47 ADV + 1.86 BONUS


Interpretation of CoefficientsInterpretation of Coefficients

Recall that sales is in $1000s and Recall that sales is in $1000s and advertising and bonus in $100s.advertising and bonus in $100s.

If advertising is held fixed, sales increase If advertising is held fixed, sales increase $1860 for each $100 of bonus paid.$1860 for each $100 of bonus paid.

If bonus were fixed, sales increase $2470 If bonus were fixed, sales increase $2470 for each $100 spent on ads.for each $100 spent on ads.

The regression equation isSALES = - 516 + 2.47 ADV + 1.86 BONUS


4.2 Inferences From a Multiple4.2 Inferences From a Multiple Regression Analysis Regression Analysis

In general, the population regression equation involving K predictors is:In general, the population regression equation involving K predictors is:

This says the mean value of This says the mean value of yy at a given set of at a given set of xx values is a point on the surface values is a point on the surface

described by the terms on the right-hand side of the equation.described by the terms on the right-hand side of the equation.

KKxxxy xxxK

22110,...,,| 21


4.2.1 Assumptions Concerning the Population 4.2.1 Assumptions Concerning the Population Regression LineRegression Line

An alternative way of writing the relationship An alternative way of writing the relationship is:is:

where where ii denotes the i denotes the ithth observation and observation and eeii denotes a random error or disturbance denotes a random error or disturbance (deviation from the mean).(deviation from the mean).

We make certain assumptions about the We make certain assumptions about the eeii..

ikikiii exxxy 22110


AssumptionsAssumptions

1.1. We expect the average disturbance We expect the average disturbance eeii to be zero so the regression line to be zero so the regression line

passes through the average value passes through the average value of of yy..

2.2. The The eeii have constant variance have constant variance ee22..

3.3. The The eeii are normally distributed. are normally distributed.

4.4. The The eeii are independent. are independent.


InferencesInferences

The assumptions allow inferences about The assumptions allow inferences about the population relationship to be made the population relationship to be made from a sample equation.from a sample equation.

The first inferences considered are those The first inferences considered are those about the individual population about the individual population coefficients coefficients 1 1 2 2 ...... KK..

Chapter 6 examines what happens when Chapter 6 examines what happens when the assumptions are violated.the assumptions are violated.


4.2.2 Inferences about the Population 4.2.2 Inferences about the Population Regression CoefficientsRegression Coefficients

If we wish to make an estimate of the effect If we wish to make an estimate of the effect of a change in one of the of a change in one of the xx variables on variables on yy, , use the interval: use the interval:

this refers to the this refers to the jjthth of the of the KK regression regression coefficients. The multiplier coefficients. The multiplier tt is selected is selected from the from the tt-distribution with -distribution with n-K-1n-K-1 degrees degrees of freedom. of freedom.

jbKnj stb 1


Tests About the CoefficientsTests About the Coefficients

A test about the marginal effect of A test about the marginal effect of xxjj on on yy may be obtained from: may be obtained from:

HH00:: jj = = jj**

HHaa:: jj ≠ ≠ jj**

where where jj* * is some specific value that is is some specific value that is

relevant for the relevant for the jjthth coefficient. coefficient.


Test StatisticTest Statistic

The test would be performed by using the The test would be performed by using the standardized test statistic:standardized test statistic:

The most common form of this test is for the The most common form of this test is for the parameter to be 0. In this case the test parameter to be 0. In this case the test statistic is just the estimate divided by its statistic is just the estimate divided by its standard error.standard error.

bS

t

j

jjb *


Example 4.2 Meddicorp (Continued)Example 4.2 Meddicorp (Continued)

Refer again to the portion of the regression output Refer again to the portion of the regression output about the individual regression coefficients:about the individual regression coefficients:

This lists the estimates, their standard errors and This lists the estimates, their standard errors and the ratio of the estimates to their standard errors.the ratio of the estimates to their standard errors.

Predictor Coef SE Coef T PConstant -516.4 189.9 -2.72 0.013ADV 2.4732 0.2753 8.98 0.000BONUS 1.8562 0.7157 2.59 0.017


Tests For Effect of AdvertisingTests For Effect of AdvertisingTo see if an increase in advertising To see if an increase in advertising

expenditure affects sales, we can test:expenditure affects sales, we can test:

HH00:: ADVADV = 0 (An increase in advertising= 0 (An increase in advertising has no effect on sales)has no effect on sales)

HHaa:: ADVADV ≠ ≠ 0 (Sales do change when0 (Sales do change when advertising increases)advertising increases)

The df are The df are n-K-1 = 25–2-1 = 22. n-K-1 = 25–2-1 = 22. At a 5% At a 5% significance level, the critical point from significance level, the critical point from the t-table is 2.074the t-table is 2.074


Test ResultTest Result

From the output we get:From the output we get:

t = ( 2.4732 – 0)/.2753 = 8.98 t = ( 2.4732 – 0)/.2753 = 8.98

This is above the critical value of 2.074, so This is above the critical value of 2.074, so we reject we reject HH00..

Note that we could also make use of the p-Note that we could also make use of the p-value (.000) for the test.value (.000) for the test.


One-Sided Test on BonusOne-Sided Test on BonusWe can modify the test to make it one sided We can modify the test to make it one sided

HH00:: BONUSBONUS = 0 (Increased bonuses= 0 (Increased bonuses

do not affect sales)do not affect sales)

HHaa:: BONUSBONUS > > 0 (Sales increase when0 (Sales increase when

bonuses are higher)bonuses are higher)

At a 5% significance level, the (one-sided) At a 5% significance level, the (one-sided) critical point is 1.717.critical point is 1.717.


One-Sided Test ResultOne-Sided Test Result

From the output we get:From the output we get:

t = 1.8562/.7157 = 2.59 which is > 1.717 t = 1.8562/.7157 = 2.59 which is > 1.717

We reject We reject HH00 but this time make a more but this time make a more specific conclusion.specific conclusion.

The listed p-value (.017) is for a two-sided The listed p-value (.017) is for a two-sided test. For our one-sided test, cut it in half.test. For our one-sided test, cut it in half.


Interval Effect of advertisingInterval Effect of advertisingRecall that sales are measured in 1000$ and ADV in 100$

badv = 2.4732 and has standard error = .2753

2.4732 ± 2.074(.2753) = 2.4732 ± .5709

= 1.902 to 3.044

Each $100 spent on advertising returns $1902 to $3044 in sales.


4.3 Assessing the Fit of the Model4.3 Assessing the Fit of the Model

Recall how we partitioned the variation in Recall how we partitioned the variation in the previous chapter:the previous chapter:

SST = Total variation in the sample of Y valuesSST = Total variation in the sample of Y values

Split up into two components SSE, SSRSplit up into two components SSE, SSR

SSE = Error or SSE = Error or unexplainedunexplained variation variation

SSR = Explained by the Yhat functionSSR = Explained by the Yhat function


4.3.1 The ANOVA Table and R4.3.1 The ANOVA Table and R22

These are the same statistics we briefly These are the same statistics we briefly examined in simple regression.examined in simple regression.

They are perhaps more important here They are perhaps more important here because they measure how well all the because they measure how well all the variables in the equation work together.variables in the equation work together.

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%




RR22 – a Universal Measure of Fit – a Universal Measure of Fit

RR22 = SSR / SST = proportion of variation = SSR / SST = proportion of variation explained by the regression equation.explained by the regression equation.

If multiplied by 100, interpret as %If multiplied by 100, interpret as %

If only one If only one xx, R, R22 is square of correlation is square of correlation

For multiple, RFor multiple, R22 is square of correlation is square of correlation between the Y values and Y-hat valuesbetween the Y values and Y-hat values


For our exampleFor our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%



R2 = 1067797 / 1248974 = .85494

85.5% of the variation in sales in the 25 territories is explained by the different levels of advertising and bonus


Adjusted RAdjusted R22

If there are many predictor variables to choose If there are many predictor variables to choose from, the best Rfrom, the best R22 is always obtained by throwing is always obtained by throwing them all in the model.them all in the model.

Some of these predictors could be insignificant, Some of these predictors could be insignificant, suggesting they contribute little to the model's suggesting they contribute little to the model's RR22..

Adjusted RAdjusted R22 is a way to balance the desire for is a way to balance the desire for high Rhigh R22 against the desire to include only against the desire to include only important variables. important variables.


ComputationComputation

The "adjustment" is for the number of The "adjustment" is for the number of variables in the model.variables in the model.

Although regular RAlthough regular R22 may decrease when you may decrease when you remove a variable, the adjusted version remove a variable, the adjusted version may actually may actually increaseincrease if that variable did if that variable did not have much significance.not have much significance.

)1/(

)1/(12

nSST

KnSSERadj


4.3.2 The 4.3.2 The FF Statistic Statistic

Since RSince R22 is so high, you would certainly is so high, you would certainly think that the model contains significant think that the model contains significant predictive power.predictive power.

In other problems it is perhaps not so In other problems it is perhaps not so obvious. For example, would an Robvious. For example, would an R22 of 20% of 20% show any prediction ability at all?show any prediction ability at all?

We can test for the predictive power of the We can test for the predictive power of the entire model using the entire model using the FF statistic. statistic.


F TestsF Tests

Generally these compare two sources of variationGenerally these compare two sources of variation

F = V1/V2 and has two df parametersF = V1/V2 and has two df parameters

Here V1 = SSR/Here V1 = SSR/KK has has K K df df

And V2 = SSE/(And V2 = SSE/(n-K-1)n-K-1) has n-k-1 df has n-k-1 df


Usually will see several pages of these; one or two pages at each specific level of significance (.10, .05, .01).

Value of F at a specific significance level

Numerator d.f.

denom.d.f.

F TablesF Tables


F Test HypothesesF Test Hypotheses

H0: 1 = 2 = …= K = 0 (None of the Xs help explain Y)

Ha: Not all s are 0 (At least one X is useful)

H0: R2 = 0 is an equivalent hypothesis


F test for our exampleF test for our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%



F = 533899 / 8235 = 64.83 has p-value = 0.000

From tables, F2,22,.05 = 3.44 and F2,22,.01 = 5.72

Confirms that R2 = 85.5% is not near zero

Documents

Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman