26
1 4. Multiple Regression I ECON 251 Research Methods

1 4. Multiple Regression I ECON 251 Research Methods

Embed Size (px)

Citation preview

Page 1: 1 4. Multiple Regression I ECON 251 Research Methods

1

4. Multiple Regression I

ECON 251

Research Methods

Page 2: 1 4. Multiple Regression I ECON 251 Research Methods

2

In this section, we extend the simple linear regression model, and allow for any number (k) of independent variables. This should yield a better model in most cases.

y = 0 + 1x1+ 2x2 + …+ kxk +

We add Adjusted R2 to our model assessment tools. Because of the complexity of the calculations, we will rely

exclusively on the computer to do our model estimation.

Coefficients

Dependent variable Independent variables

Random error variable

Basic Multiple Regression Model

Page 3: 1 4. Multiple Regression I ECON 251 Research Methods

3

y = 0 + 1x

X1

Y

X2

The simple linear regression modelallows for one independent variable, “x”

y =0 + 1x +

The multiple linear regression modelallows for more than one independent variable.Y = 0 + 1x1 + 2x2 +

Note how the straight line becomes a plain.

y = 0 + 1x1 + 2x2

Basic Multiple Regression Model

Page 4: 1 4. Multiple Regression I ECON 251 Research Methods

4

One of the most important aspects of regression analysis is verifying that our results are not being impacted by assumption violations or “other dangers.” That is why we return to this important topic. In this section, we will be looking for solutions to instances where we encounter problems.

Recall Our List of “Assumption Violations & Other Dangers”:• The error ( term is properly distributed. Which means:

1. The probability distribution of is normal, with a mean of 0.2. The standard deviation of is for all values of x.

3. The set of errors associated with different values of y are all independent.

• Other assumptions, that when violated can threaten the usefulness of your results include:4. No unnecessary outliers5. No serious multicollinearity

Regression Diagnostics

Page 5: 1 4. Multiple Regression I ECON 251 Research Methods

5

Assumptions #1 and #2 – Remedying Violations

We discussed both assumptions in the last section, as well as how to detect them using visual inspection of graphs.

Nonnormality or heteroscedasticity can be remedied using transformations on the y variable.

The transformations can improve the linear relationship between the dependent variable and the independent variables.

Many computer software systems allow us to make the transformations easily.

Page 6: 1 4. Multiple Regression I ECON 251 Research Methods

6

» y’ = ln y (for y > 0)―Use when the s increases with y, or―Use when the error distribution is positively skewed

» y’ = y2

―Use when the s2 is proportional to E(y), or―Use when the error distribution is negatively skewed

» y’ = y1/2 (for y > 0)―Use when the s2 is proportional to E(y)

» y’ = 1/y―Use when s2 increases significantly when y increases

beyond some value.

A brief list of transformations

Page 7: 1 4. Multiple Regression I ECON 251 Research Methods

7

Example – Quiz Score A statistics professor wanted to know whether time limit affect

the scores on a quiz? A random sample of 100 students was split into 5 groups. Each student wrote a quiz, but each group was given a

different time limit. See data below.

Time 40 45 50 55 6020 24 26 30 32

23 26 25 32 31

. . . . .

. . . . .

Score

Analyze these results, and include diagnostics

Page 8: 1 4. Multiple Regression I ECON 251 Research Methods

80

10

20

30

40

50

-2.5 -1.5 -0.5 0.5 1.5 2.5 More

Regression StatisticsMultiple R 0.8625395R Square 0.7439744Adj. R Square 0.7413619Standard Error 2.3046094Observations 100

ANOVAdf SS MS F Sig. F

Regression 1 1512.5 1512.5 284.7742555 9.41548E-31Residual 98 520.5 5.3112245Total 99 2033

Coeffs S.E. t Stat P-valueIntercept -2.2 1.6458203 -1.336719 0.18440922Time 0.55 0.0325921 16.875256 9.41548E-31

The errors seem to be_______ distributed

The model tested: SCORE = 0 + 1TIME +

There is ________ linear relationship between time and score.

This model is ______ andprovides a ______ fit.

Example – Quiz Score

Page 9: 1 4. Multiple Regression I ECON 251 Research Methods

9

The standard error of estimate seems to __________ with the predicted value of y.

Two transformations are used to remedy this problem:1. y ’ = ln y2. y ’ = 1/y

Example – Quiz ScoreStandard Residuals vs Predicted Score

-3

-2

-1

0

1

2

3

4

18 20 22 24 26 28 30 32

Page 10: 1 4. Multiple Regression I ECON 251 Research Methods

10

Let us see what happens when a transformation is appliedScore

15

20

25

30

35

40

0 20 40 60 80

Ln Score

2

3

4

0 20 40 60 80

40,18

40,2340, 3.135

40, 2.89

Ln 23 = 3.135

Ln 18 = 2.89

The original data, where “Score” is a function of “Time”

The modified data, where LnScore is a function of “Time"

Example – Quiz Score

Page 11: 1 4. Multiple Regression I ECON 251 Research Methods

11

The new regression analysis and the diagnostics are:

The model tested: LnScore = ’0 + ’1TIME + ’SUMMARY OUTPUT

Regression StatisticsMultiple R 0.8783R Square 0.771412Adj R Sq 0.769079Stad Error 0.084437Observations 100

ANOVAdf SS MS F Sig F

Regression 1 2.357901 2.357901 330.7181 3.58E-33Residual 98 0.698705 0.00713Total 99 3.056606

Coeffs S.E. t Stat P-valueIntercept 2.129582 0.0603 35.31632 1.51E-57Time 0.021716 0.001194 18.18566 3.58E-33

Predicted LnScore = 2.1295 + .0217 Time

This model is _______ andprovides a ________ fit.

Example – Quiz Score

Page 12: 1 4. Multiple Regression I ECON 251 Research Methods

12

The errors seem to be_________ distributed

0

10

20

30

40

-2.5 -1.5 -0.5 0.5 1.5 2.5 More

The standard errors still changes with the predicted y, but the change is _______ than before.

Example – Quiz Score

Standard Residuals vs Predicted LnScore

-3

-2

-1

0

1

2

3

4

2.9 3 3.1 3.2 3.3 3.4 3.5

Page 13: 1 4. Multiple Regression I ECON 251 Research Methods

13

Example – Quiz Score

Let TIME = 55 minutes

LnScore = 2.1295 + 0.0217 * Time = 2.1295 + 0.0217 * (55) = 3.323

How do we use the modified model to predict?

To find the predicted score, take the antilog:

antiloge3.323 = e3.323 = 27.770

If 55 minutes is given for the quiz, we expect the score to be 27.770.

Find the predicted score if 50 minutes are given for the quiz.

Page 14: 1 4. Multiple Regression I ECON 251 Research Methods

14

:50time

Example – Quiz Score

Page 15: 1 4. Multiple Regression I ECON 251 Research Methods

15

Exists when independent variables included in the same regression, are linearly related to one another.

Multicollinearity nearly always exists. We will (somewhat arbitrarily) consider it serious if the absolute value of the correlation coefficient exceeds 0.8.

Example – House Price• A real estate agent believes that a house selling price can be

predicted using the house size, number of bedrooms, and lot size.

• A random sample of 100 houses was drawn and data recorded.

• Analyze the relationship among the four variables

Price Bedrooms H Size Lot Size124100 3 1290 3900218300 4 2080 6600117800 3 1250 3750

. . . .

. . . .

Assumption #5 Violation – Serious Multicollinearity

Page 16: 1 4. Multiple Regression I ECON 251 Research Methods

16

Regression StatisticsMultiple R 0.74833R Square 0.5599977Adj R Square 0.5462477Std Error 25022.708Observations 100

ANOVAdf SS MS F Sig F

Regression 3 76501718347 25500572782 40.726898 4.56894E-17Residual 96 60109046053 626135896.4Total 99 1.36611E+11

Coeffs S.E. t Stat P-valueIntercept 37717.595 14176.74195 2.660526279 0.0091448Bedrooms 2306.0808 6994.19244 0.329713665 0.7423347H Size 74.296806 52.97857934 1.402393325 0.1640233Lot Size -4.363783 17.0240013 -0.256331212 0.7982436

The proposed model isPRICE = 0 + 1 BEDROOMS + 2 H-SIZE +3 LOTSIZE + • Excel solution

The model is ____, but no variable is significantly related to the selling price !!

Example – House Price

Page 17: 1 4. Multiple Regression I ECON 251 Research Methods

17

However, • when regressing the price on each independent variable

alone, it is found that each variable is strongly related to the selling price.

• Multicollinearity is the source of this problem.Price Bedrooms H Size Lot Size

Price 1Bedrooms 0.645411 1H Size 0.747762 0.846454 1Lot Size 0.740874 0.83743 0.993615 1

Multicollinearity inflates Sbi’s:

• Bringing t-stats closer to zero and insignificance.• The coefficients cannot be interpreted as “slopes”.

Example – House Price

Page 18: 1 4. Multiple Regression I ECON 251 Research Methods

18

Correcting for Multicollinearity: Get rid of one of the variables that is a duplicate, and re-estimate the model.

With this done, and the high R2 relative to your first model, and the high p-value for “Bedrooms”, estimate the model with only “House Size” as a variable.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.74812872R Square 0.55969658Adjusted R Square 0.55061816Standard Error 24901.9079Observations 100

ANOVAdf SS MS F Significance F

Regression 2 76460577656 3.82E+10 61.65131344 5.27199E-18Residual 97 60150186744 6.2E+08Total 99 1.36611E+11

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 36976.7497 13812.00479 2.677146 0.008719726 9563.758882 64389.741Bedrooms 2414.08233 6947.786241 0.347461 0.728998069 -11375.3424 16203.507H Size 61.008184 10.86414729 5.615552 1.86613E-07 39.445871 82.570497

Example – House Price

Page 19: 1 4. Multiple Regression I ECON 251 Research Methods

19

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.74776237R Square 0.55914857Adjusted R Square 0.55465008Standard Error 24789.9443Observations 100

ANOVAdf SS MS F Significance F

Regression 1 76385713057 7.64E+10 124.2971108 3.96657E-19Residual 98 60225051343 6.15E+08Total 99 1.36611E+11

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 40066.3887 10521.4367 3.808072 0.000244249 19186.94039 60945.837H Size 64.2034306 5.758743264 11.14886 3.96657E-19 52.77539219 75.631469

Note R2 is nearly as high as original model, but adjusted R2 is actually higher than before. F-test for overall validity of model is fine, t-test for your independent variable also fine. This is your final model.

Now: Estimate sale price for a house with 3 bedrooms, 2000 sq ft of house on a lot of 5,000 sq ft. Compare results of final model with original model.

Example – House Price

Page 20: 1 4. Multiple Regression I ECON 251 Research Methods

20

This condition is common with time series data. When it exists in time series data, it is referred to as

Autocorrelation.

Detection:• run a regression• save residuals• plot residuals against time• if you see a pattern, your regression may have auto-

correlation problem

Assumptions #3 Violation – Non-Independence of Errors

Page 21: 1 4. Multiple Regression I ECON 251 Research Methods

21

+

++

+

+

+

+

++

+

+

+

++

+

y

Time

Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Positive first order autocorrelation

Negative first order autocorrelation

Residuals

Time0

0

Residuals

Time

Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ.

Time

y

Autocorrelation

Page 22: 1 4. Multiple Regression I ECON 251 Research Methods

22

How does the weather affect the sales of lift tickets in a ski resort?

Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected.

The model hypothesized was

TICKETS=0 + 1SNOWFALL + 2TEMPERATURE +

Regression analysis yielded the following results:

Example – Lift Ticket

Page 23: 1 4. Multiple Regression I ECON 251 Research Methods

23

Regression StatisticsMultiple R 0.346453R Square 0.12003Adj R Square 0.016504Std Error 1711.676Observations 20

ANOVAdf SS MS F Sig. F

Regression 2 6793798.2 3396899.1 1.159416 0.337271Residual 17 49807214 2929836.1Total 19 56601012

Coeffs S.E. t Stat P-valueIntercept 8308.011 903.7285 9.1930391 5.24E-08Snowfall 74.59325 51.574829 1.4463111 0.166276Tempture -8.75374 19.704359 -0.444254 0.662462

The model seems to be very poor:

• The fit is _______ (R-square=0.12),• It is _________ (Signif. F =0.33)• No variable is ___________ to Sales

Example – Lift Ticket

Page 24: 1 4. Multiple Regression I ECON 251 Research Methods

24

-4000

-3000

-2000

-1000

0

1000

2000

3000

7500 8500 9500 10500 11500 12500

-4000

-3000

-2000

-1000

0

1000

2000

3000

0 5 10 15 20 25

Residual over time

Residual vs. predicted y

The errors are ___ independent

The error variance is constant

01234567

-2.5 -1.5 -0.5 0.5 1.5 2.5 More

The error distribution

Example – Lift Ticket

Page 25: 1 4. Multiple Regression I ECON 251 Research Methods

25

The modified regression model

TICKETS=0 + 1SNOWFALL + 2TEMPERATURE + 3YEARS +

Are all the required conditions for this model met?

How good is the fit of this model?

Is the model useful? Which variables are linearly related to ticket sales and which

ones are not?

The autocorrelation has occurred over time.

Therefore, a time dependent variable added to the model may correct the problem

Example – Lift Ticket

Page 26: 1 4. Multiple Regression I ECON 251 Research Methods

26

SUMMARY OUTPUT (Ticket Sales with "Years" as 3rd independent variable)

Regression StatisticsMultiple R 0.860801945R Square 0.740979989Adjusted R Square0.692413737Standard Error957.2354341Observations 20

ANOVAdf SS MS F Significance F

Regression 3 41940217.38 13980072.46 15.25709636 5.93379E-05Residual 16 14660794.82 916299.6764Total 19 56601012.2

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 5965.587635 631.2517853 9.450409128 5.9958E-08 4627.393932 7303.781Snowfall 70.1830592 28.85142183 2.432568475 0.027101034 9.020790993 131.3453Temperature -9.23280225 11.01970866 -0.8378445 0.41445918 -32.59353576 14.12793Years 229.969972 37.13208681 6.19329512 1.28864E-05 151.2534822 308.6865

The fit of this model is _____ R2 = 0.74

The model is _____. Significance F = 5.93 E-5.

All the required conditions ______ for this model.

TEMPERATURE is ________ related to ticket sales.

SNOWFALL and YEARS ________ related to ticket sales