Upload
cameron-crawford
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Class 23
The most over-rated statisticThe four assumptions
The most Important hypothesis test yetUsing yes/no variables in regressions
Adjusted R-square Pg 9-12 Pfeifer note
Hours2
4.174.424.754.836.67
77.087.177.171012
12.513.6715.08
Hours Mean 7.900667Standard Error 1.003487Median 7.08Mode 7.17Standard Deviation 3.886488Sample Variance 15.10479Kurtosis -0.75506Skewness 0.524811Range 13.08Minimum 2Maximum 15.08Sum 118.51Count 15
Our better method of forecasting hours would use a mean of 7.9 and standard deviation of 3.89 (and the t-distribution with 14 dof)
The sample variance is
The variation in Hours that regression will try
to explain
Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof)
The squared standard error is
The variation in Hours regression leaves
unexplained.
MSF Hours26 2
34.2 4.1729 4.42
34.3 4.7585.9 4.83
143.2 6.6785.5 7
140.6 7.08140.6 7.1740.4 7.17101 10
239.7 12179.3 12.5126.5 13.67140.8 15.08
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.7260033R Square 0.5270808Adjusted R Square 0.4907024Standard Error 2.7735959Observations 15 ANOVA
dfRegression 1Residual 13Total 14
CoefficientsIntercept 3.312316MSF 0.0444895
𝑌=3.31+0.044× 157.3=10.51
Adjusted R-squarePg 9-12 Pfeifer note
• Adjusted R-square is the percentage of variation explained• The initial variation is s2 = 15.1• The variation left unexplained (after using MSF in a
regression) is (standard error)2 = 7.69.• Adjusted R-square = • Adjusted R-square = (15.1-7.69)/15.1 = 0.49• The regression using MSF explained 49% of the variation
in hours.• The “adjusted” happened in the calculation of s and
standard error.
Adjusted R-squarePg 9-12 Pfeifer note
From the Pfeifer note
Adj R-square = 0.0
Adj R-square = 0.5
Adj R-square = 1.0Standard error = 0
Standard error = s
Why Pfeifer says R2 is over-rated• There is no standard for how large it should be.
– In some situations an adjusted R2 of 0.05 would be FANTASTIC. In others, an adjusted R2 of 0.96 would be DISAPOINTING.
• It has no real use.– Unlike “standard error” which is needed to make probability
forecasts.• It is usually redundant
– When comparing models, lower standard errors mean higher adj R2
– The correlation coefficient (which shares the same sign as b) ≈ the square root of adj R2.
The Coal Pile Example
• The firm needed a way to estimate the weight of a coal pile (based on it’s dimensions)
W D h d56 20 10 15 93 25 10 20
161 30 12 24 31 15 12 10 70 20 14 13 76 20 14 13
375 40 16 32 34 15 14 8 45 20 8 16
58 20 10 15
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.986792416R Square 0.973759272Adjusted R Square 0.960638908Standard Error 20.56622179Observations 10 ANOVA
dfRegression 3Residual 6Total 9
CoefficientsIntercept -294.6954733D -15.12016461h 23.02366255d 27.62139918
96% of the variation in W is
explained by this regression.
We just used MULTIPLE regression.
The Coal Pile Example
• Engineer Bob calculated the Volume of each pile and used simple regression…
100% of the variation in W is
explained by this regression.
Standard error went from to 20.6 to 2.8!!!
W Vol
56 2421.6493 3992.44
161 6898.9431 1492.2670 3038.4476 3038.44
375 16353.0434 1499.0645 2044.13
58 2421.64
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.999673782R Square 0.99934767Adjusted R Square 0.999266129Standard Error 2.808218162Observations 10 ANOVA
dfRegression 1Residual 8Total 9
CoefficientsIntercept 0.668821408Vol 0.022970159
The Four Assumptions
• Linearity
• Independence– The n observations were sampled
independently from the same population.• Homoskedasticity
– All Y’s given X share a common σ.• Normality
– The probability distribution of Y│X is normal.
• Errors are normal. Y’s don’t have to be.
10
15
20
25
30
35
0 10 20 30 40 50 60 70
X
Y
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 2 4 6 8 10 12
Fitted
Res
idu
al
0
50
100
150
200
250
-30.
64
-20.
08
-9.5
1
1.06
11.6
3
22.2
0
32.7
7
43.3
4
53.9
0
64.4
7
75.0
4
Fre
quen
cy
Sec 5 of Pfeifer noteSec 12.4 of EMBS
Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof)
MSF Hours26 2
34.2 4.1729 4.42
34.3 4.7585.9 4.83
143.2 6.6785.5 7
140.6 7.08140.6 7.1740.4 7.17101 10
239.7 12179.3 12.5126.5 13.67140.8 15.08
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.7260033R Square 0.5270808Adjusted R Square 0.4907024Standard Error 2.7735959Observations 15 ANOVA
dfRegression 1Residual 13Total 14
CoefficientsIntercept 3.312316MSF 0.0444895
𝑌=3.31+0.044× 157.3=10.51
The four assumptions
LinearityIndependence (all 15 points
count equally)
homoskedasticity
Normality
Sec 5 of Pfeifer noteSec 12.4 of EMBS
Hypotheses
• H0: P=0.5 (LTT, wunderdog)• H0: Independence (supermarket job and
response, treatment and heart attack, light and myopia, tosser and outcome)
• H0: μ=100 (IQ)• H0: μM= μF (heights, weights, batting average)
• H0: μcompact= μmid = μlarge (displacement)
P 13 of Pfeifer noteSec 12.5 of EMBS
H0: b=0
• b=0 means X and Y are independent– In this way it’s like the chi-squared independence
test….for numerical variables.• b=0 means don’t use X to forecast Y
– Don’t put X in the regression equation• b=0 means just use to forecast Y• b=0 means the “true” adj R-square is zero.
P 13 of Pfeifer noteSec 12.5 of EMBS
Testing b=0 is EASY!!!
• H0: μ=100
• P-value from the t.dist with n-1 dof
• H0: b=0• (-0)/(se of coef)• P-value from t.dist
using n-2 dof.
MSF Hours26 2
34.2 4.1729 4.42
34.3 4.7585.9 4.83
143.2 6.6785.5 7
140.6 7.08140.6 7.1740.4 7.17101 10
239.7 12179.3 12.5126.5 13.67140.8 15.08
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 3.3123 1.4021 2.3624 0.0344 0.2832 6.3414MSF 0.0445 0.0117 3.8064 0.0022 0.0192 0.0697
The standard error of the coefficient
The t-stat to test b=0.
The 2-tailed p-value.
�̂�
P 13 of Pfeifer noteSec 12.5 of EMBS
Using Yes/No variable in Regression
Car ClassDisplaceme
nt Fuel Type Hwy MPG
1 Midsize 3.5 R 28
2 Midsize 3 R 26
3 Large 3 P 26
4 Large 3.5 P 25
. . . . .
. . . . .
58 Compact 6 P 20
59 Midsize 2.5 R 30
60 Midsize 2 R 32
Categorical Categorical NumericalNumerical
n=60Sec 8 of Pfeifer noteSec 13.7 of EMBS
Does MPG “depend” on
fuel type?
Fuel type (yes/no) and mpg (numerical)• Un-stack the data so there are two columns of MPG data.
• Data Analysis, T-test two sample
t-Test: Two-Sample Assuming Equal Variances
P RMean 24.33333 27.70833Variance 12.4 9.519928Observations 36 24Pooled Variance 11.2579Hypothesized Mean Difference 0df 58t Stat -3.81704P(T<=t) one-tail 0.000165 0.999835t Critical one-tail 1.671553P(T<=t) two-tail 0.000331t Critical two-tail 2.001717
Sec 8 of Pfeifer noteSec 13.7 of EMBS
H0: μP = μR
Or
H0: μP – μR = 0
Using Yes/No variables in Regression
1. Convert the categorical variable into a 1/0 DUMMY Variable.– Use an if statement to do this.– It won’t matter which is assigned 1, which is assigned 0.– It doesn’t even matter what 2 numbers you assign
to the two categories (regression will adjust)
2. Regress MPG (numerical) on DUMMY (1/0 numerical)
3. Test H0: b=0 using the regression output.Sec 8 of Pfeifer noteSec 13.7 of EMBS
Using Yes/No variables in RegressionFuel Type Dprem
Hwy MPG
R 0 28R 0 26P 1 26P 1 25. . .P 1 21P 1 25P 1 20R 0 30R 0 32
SUMMARY OUTPUT
Regression Statistics
Adj R Square 0.1870
Standard Error 3.3553
Observations 60
ANOVA
df SS MS F Sig F
Regression 1 164.025 164.025 14.570 3.306E-04
Residual 58 652.958 11.258
Total 59 816.983
Coeff Std Error t Stat P-value
Intercept 27.708 0.6849 40.4564 3.321E-44
Dprem -3.375 0.8842 -3.8170 3.306E-04
Sec 8 of Pfeifer noteSec 13.7 of EMBS
Regression with one Dummy variable
When D=0,
When D=1,
For Regular,
27.7
For premium,
24.3
H0: μP = μR
Or
H0: μP – μR = 0
Or
H0: b = 0
What we learned today
• We learned about “adjusted R square”– The most over-rated statistic of all time.
• We learned the four assumptions required to use regression to make a probability forecast of Y│X.– And how to check each of them.
• We learned how to test H0: b=0.– And why this is such an important test.
• We learned how to use a yes/no variable in a regression.– Create a dummy variable.