Upload
shon-wood
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
1
4. Multiple Regression I
ECON 251
Research Methods
2
In this section, we extend the simple linear regression model, and allow for any number (k) of independent variables. This should yield a better model in most cases.
y = 0 + 1x1+ 2x2 + …+ kxk +
We add Adjusted R2 to our model assessment tools. Because of the complexity of the calculations, we will rely
exclusively on the computer to do our model estimation.
Coefficients
Dependent variable Independent variables
Random error variable
Basic Multiple Regression Model
3
y = 0 + 1x
X1
Y
X2
The simple linear regression modelallows for one independent variable, “x”
y =0 + 1x +
The multiple linear regression modelallows for more than one independent variable.Y = 0 + 1x1 + 2x2 +
Note how the straight line becomes a plain.
y = 0 + 1x1 + 2x2
Basic Multiple Regression Model
4
One of the most important aspects of regression analysis is verifying that our results are not being impacted by assumption violations or “other dangers.” That is why we return to this important topic. In this section, we will be looking for solutions to instances where we encounter problems.
Recall Our List of “Assumption Violations & Other Dangers”:• The error ( term is properly distributed. Which means:
1. The probability distribution of is normal, with a mean of 0.2. The standard deviation of is for all values of x.
3. The set of errors associated with different values of y are all independent.
• Other assumptions, that when violated can threaten the usefulness of your results include:4. No unnecessary outliers5. No serious multicollinearity
Regression Diagnostics
5
Assumptions #1 and #2 – Remedying Violations
We discussed both assumptions in the last section, as well as how to detect them using visual inspection of graphs.
Nonnormality or heteroscedasticity can be remedied using transformations on the y variable.
The transformations can improve the linear relationship between the dependent variable and the independent variables.
Many computer software systems allow us to make the transformations easily.
6
» y’ = ln y (for y > 0)―Use when the s increases with y, or―Use when the error distribution is positively skewed
» y’ = y2
―Use when the s2 is proportional to E(y), or―Use when the error distribution is negatively skewed
» y’ = y1/2 (for y > 0)―Use when the s2 is proportional to E(y)
» y’ = 1/y―Use when s2 increases significantly when y increases
beyond some value.
A brief list of transformations
7
Example – Quiz Score A statistics professor wanted to know whether time limit affect
the scores on a quiz? A random sample of 100 students was split into 5 groups. Each student wrote a quiz, but each group was given a
different time limit. See data below.
Time 40 45 50 55 6020 24 26 30 32
23 26 25 32 31
. . . . .
. . . . .
Score
Analyze these results, and include diagnostics
80
10
20
30
40
50
-2.5 -1.5 -0.5 0.5 1.5 2.5 More
Regression StatisticsMultiple R 0.8625395R Square 0.7439744Adj. R Square 0.7413619Standard Error 2.3046094Observations 100
ANOVAdf SS MS F Sig. F
Regression 1 1512.5 1512.5 284.7742555 9.41548E-31Residual 98 520.5 5.3112245Total 99 2033
Coeffs S.E. t Stat P-valueIntercept -2.2 1.6458203 -1.336719 0.18440922Time 0.55 0.0325921 16.875256 9.41548E-31
The errors seem to be_______ distributed
The model tested: SCORE = 0 + 1TIME +
There is ________ linear relationship between time and score.
This model is ______ andprovides a ______ fit.
Example – Quiz Score
9
The standard error of estimate seems to __________ with the predicted value of y.
Two transformations are used to remedy this problem:1. y ’ = ln y2. y ’ = 1/y
Example – Quiz ScoreStandard Residuals vs Predicted Score
-3
-2
-1
0
1
2
3
4
18 20 22 24 26 28 30 32
10
Let us see what happens when a transformation is appliedScore
15
20
25
30
35
40
0 20 40 60 80
Ln Score
2
3
4
0 20 40 60 80
40,18
40,2340, 3.135
40, 2.89
Ln 23 = 3.135
Ln 18 = 2.89
The original data, where “Score” is a function of “Time”
The modified data, where LnScore is a function of “Time"
Example – Quiz Score
11
The new regression analysis and the diagnostics are:
The model tested: LnScore = ’0 + ’1TIME + ’SUMMARY OUTPUT
Regression StatisticsMultiple R 0.8783R Square 0.771412Adj R Sq 0.769079Stad Error 0.084437Observations 100
ANOVAdf SS MS F Sig F
Regression 1 2.357901 2.357901 330.7181 3.58E-33Residual 98 0.698705 0.00713Total 99 3.056606
Coeffs S.E. t Stat P-valueIntercept 2.129582 0.0603 35.31632 1.51E-57Time 0.021716 0.001194 18.18566 3.58E-33
Predicted LnScore = 2.1295 + .0217 Time
This model is _______ andprovides a ________ fit.
Example – Quiz Score
12
The errors seem to be_________ distributed
0
10
20
30
40
-2.5 -1.5 -0.5 0.5 1.5 2.5 More
The standard errors still changes with the predicted y, but the change is _______ than before.
Example – Quiz Score
Standard Residuals vs Predicted LnScore
-3
-2
-1
0
1
2
3
4
2.9 3 3.1 3.2 3.3 3.4 3.5
13
Example – Quiz Score
Let TIME = 55 minutes
LnScore = 2.1295 + 0.0217 * Time = 2.1295 + 0.0217 * (55) = 3.323
How do we use the modified model to predict?
To find the predicted score, take the antilog:
antiloge3.323 = e3.323 = 27.770
If 55 minutes is given for the quiz, we expect the score to be 27.770.
Find the predicted score if 50 minutes are given for the quiz.
14
:50time
Example – Quiz Score
15
Exists when independent variables included in the same regression, are linearly related to one another.
Multicollinearity nearly always exists. We will (somewhat arbitrarily) consider it serious if the absolute value of the correlation coefficient exceeds 0.8.
Example – House Price• A real estate agent believes that a house selling price can be
predicted using the house size, number of bedrooms, and lot size.
• A random sample of 100 houses was drawn and data recorded.
• Analyze the relationship among the four variables
Price Bedrooms H Size Lot Size124100 3 1290 3900218300 4 2080 6600117800 3 1250 3750
. . . .
. . . .
Assumption #5 Violation – Serious Multicollinearity
16
Regression StatisticsMultiple R 0.74833R Square 0.5599977Adj R Square 0.5462477Std Error 25022.708Observations 100
ANOVAdf SS MS F Sig F
Regression 3 76501718347 25500572782 40.726898 4.56894E-17Residual 96 60109046053 626135896.4Total 99 1.36611E+11
Coeffs S.E. t Stat P-valueIntercept 37717.595 14176.74195 2.660526279 0.0091448Bedrooms 2306.0808 6994.19244 0.329713665 0.7423347H Size 74.296806 52.97857934 1.402393325 0.1640233Lot Size -4.363783 17.0240013 -0.256331212 0.7982436
The proposed model isPRICE = 0 + 1 BEDROOMS + 2 H-SIZE +3 LOTSIZE + • Excel solution
The model is ____, but no variable is significantly related to the selling price !!
Example – House Price
17
However, • when regressing the price on each independent variable
alone, it is found that each variable is strongly related to the selling price.
• Multicollinearity is the source of this problem.Price Bedrooms H Size Lot Size
Price 1Bedrooms 0.645411 1H Size 0.747762 0.846454 1Lot Size 0.740874 0.83743 0.993615 1
Multicollinearity inflates Sbi’s:
• Bringing t-stats closer to zero and insignificance.• The coefficients cannot be interpreted as “slopes”.
Example – House Price
18
Correcting for Multicollinearity: Get rid of one of the variables that is a duplicate, and re-estimate the model.
With this done, and the high R2 relative to your first model, and the high p-value for “Bedrooms”, estimate the model with only “House Size” as a variable.
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.74812872R Square 0.55969658Adjusted R Square 0.55061816Standard Error 24901.9079Observations 100
ANOVAdf SS MS F Significance F
Regression 2 76460577656 3.82E+10 61.65131344 5.27199E-18Residual 97 60150186744 6.2E+08Total 99 1.36611E+11
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 36976.7497 13812.00479 2.677146 0.008719726 9563.758882 64389.741Bedrooms 2414.08233 6947.786241 0.347461 0.728998069 -11375.3424 16203.507H Size 61.008184 10.86414729 5.615552 1.86613E-07 39.445871 82.570497
Example – House Price
19
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.74776237R Square 0.55914857Adjusted R Square 0.55465008Standard Error 24789.9443Observations 100
ANOVAdf SS MS F Significance F
Regression 1 76385713057 7.64E+10 124.2971108 3.96657E-19Residual 98 60225051343 6.15E+08Total 99 1.36611E+11
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 40066.3887 10521.4367 3.808072 0.000244249 19186.94039 60945.837H Size 64.2034306 5.758743264 11.14886 3.96657E-19 52.77539219 75.631469
Note R2 is nearly as high as original model, but adjusted R2 is actually higher than before. F-test for overall validity of model is fine, t-test for your independent variable also fine. This is your final model.
Now: Estimate sale price for a house with 3 bedrooms, 2000 sq ft of house on a lot of 5,000 sq ft. Compare results of final model with original model.
Example – House Price
20
This condition is common with time series data. When it exists in time series data, it is referred to as
Autocorrelation.
Detection:• run a regression• save residuals• plot residuals against time• if you see a pattern, your regression may have auto-
correlation problem
Assumptions #3 Violation – Non-Independence of Errors
21
+
++
+
+
+
+
++
+
+
+
++
+
y
Time
Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Positive first order autocorrelation
Negative first order autocorrelation
Residuals
Time0
0
Residuals
Time
Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ.
Time
y
Autocorrelation
22
How does the weather affect the sales of lift tickets in a ski resort?
Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected.
The model hypothesized was
TICKETS=0 + 1SNOWFALL + 2TEMPERATURE +
Regression analysis yielded the following results:
Example – Lift Ticket
23
Regression StatisticsMultiple R 0.346453R Square 0.12003Adj R Square 0.016504Std Error 1711.676Observations 20
ANOVAdf SS MS F Sig. F
Regression 2 6793798.2 3396899.1 1.159416 0.337271Residual 17 49807214 2929836.1Total 19 56601012
Coeffs S.E. t Stat P-valueIntercept 8308.011 903.7285 9.1930391 5.24E-08Snowfall 74.59325 51.574829 1.4463111 0.166276Tempture -8.75374 19.704359 -0.444254 0.662462
The model seems to be very poor:
• The fit is _______ (R-square=0.12),• It is _________ (Signif. F =0.33)• No variable is ___________ to Sales
Example – Lift Ticket
24
-4000
-3000
-2000
-1000
0
1000
2000
3000
7500 8500 9500 10500 11500 12500
-4000
-3000
-2000
-1000
0
1000
2000
3000
0 5 10 15 20 25
Residual over time
Residual vs. predicted y
The errors are ___ independent
The error variance is constant
01234567
-2.5 -1.5 -0.5 0.5 1.5 2.5 More
The error distribution
Example – Lift Ticket
25
The modified regression model
TICKETS=0 + 1SNOWFALL + 2TEMPERATURE + 3YEARS +
Are all the required conditions for this model met?
How good is the fit of this model?
Is the model useful? Which variables are linearly related to ticket sales and which
ones are not?
The autocorrelation has occurred over time.
Therefore, a time dependent variable added to the model may correct the problem
Example – Lift Ticket
26
SUMMARY OUTPUT (Ticket Sales with "Years" as 3rd independent variable)
Regression StatisticsMultiple R 0.860801945R Square 0.740979989Adjusted R Square0.692413737Standard Error957.2354341Observations 20
ANOVAdf SS MS F Significance F
Regression 3 41940217.38 13980072.46 15.25709636 5.93379E-05Residual 16 14660794.82 916299.6764Total 19 56601012.2
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 5965.587635 631.2517853 9.450409128 5.9958E-08 4627.393932 7303.781Snowfall 70.1830592 28.85142183 2.432568475 0.027101034 9.020790993 131.3453Temperature -9.23280225 11.01970866 -0.8378445 0.41445918 -32.59353576 14.12793Years 229.969972 37.13208681 6.19329512 1.28864E-05 151.2534822 308.6865
The fit of this model is _____ R2 = 0.74
The model is _____. Significance F = 5.93 E-5.
All the required conditions ______ for this model.
TEMPERATURE is ________ related to ticket sales.
SNOWFALL and YEARS ________ related to ticket sales