19
Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Embed Size (px)

Citation preview

Page 1: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Class 23

The most over-rated statisticThe four assumptions

The most Important hypothesis test yetUsing yes/no variables in regressions

Page 2: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Adjusted R-square Pg 9-12 Pfeifer note

Hours2

4.174.424.754.836.67

77.087.177.171012

12.513.6715.08

Hours Mean 7.900667Standard Error 1.003487Median 7.08Mode 7.17Standard Deviation 3.886488Sample Variance 15.10479Kurtosis -0.75506Skewness 0.524811Range 13.08Minimum 2Maximum 15.08Sum 118.51Count 15

Our better method of forecasting hours would use a mean of 7.9 and standard deviation of 3.89 (and the t-distribution with 14 dof)

The sample variance is

The variation in Hours that regression will try

to explain

Page 3: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof)

The squared standard error is

The variation in Hours regression leaves

unexplained.

MSF Hours26 2

34.2 4.1729 4.42

34.3 4.7585.9 4.83

143.2 6.6785.5 7

140.6 7.08140.6 7.1740.4 7.17101 10

239.7 12179.3 12.5126.5 13.67140.8 15.08

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.7260033R Square 0.5270808Adjusted R Square 0.4907024Standard Error 2.7735959Observations 15 ANOVA

dfRegression 1Residual 13Total 14

CoefficientsIntercept 3.312316MSF 0.0444895

𝑌=3.31+0.044× 157.3=10.51

Adjusted R-squarePg 9-12 Pfeifer note

Page 4: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

• Adjusted R-square is the percentage of variation explained• The initial variation is s2 = 15.1• The variation left unexplained (after using MSF in a

regression) is (standard error)2 = 7.69.• Adjusted R-square = • Adjusted R-square = (15.1-7.69)/15.1 = 0.49• The regression using MSF explained 49% of the variation

in hours.• The “adjusted” happened in the calculation of s and

standard error.

Adjusted R-squarePg 9-12 Pfeifer note

Page 5: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

From the Pfeifer note

Adj R-square = 0.0

Adj R-square = 0.5

Adj R-square = 1.0Standard error = 0

Standard error = s

Page 6: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Why Pfeifer says R2 is over-rated• There is no standard for how large it should be.

– In some situations an adjusted R2 of 0.05 would be FANTASTIC. In others, an adjusted R2 of 0.96 would be DISAPOINTING.

• It has no real use.– Unlike “standard error” which is needed to make probability

forecasts.• It is usually redundant

– When comparing models, lower standard errors mean higher adj R2

– The correlation coefficient (which shares the same sign as b) ≈ the square root of adj R2.

Page 7: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

The Coal Pile Example

• The firm needed a way to estimate the weight of a coal pile (based on it’s dimensions)

W D h d56 20 10 15 93 25 10 20

161 30 12 24 31 15 12 10 70 20 14 13 76 20 14 13

375 40 16 32 34 15 14 8 45 20 8 16

58 20 10 15

SUMMARY OUTPUT     

Regression StatisticsMultiple R 0.986792416R Square 0.973759272Adjusted R Square 0.960638908Standard Error 20.56622179Observations 10   ANOVA  

  dfRegression 3Residual 6Total 9   

  CoefficientsIntercept -294.6954733D -15.12016461h 23.02366255d 27.62139918

96% of the variation in W is

explained by this regression.

We just used MULTIPLE regression.

Page 8: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

The Coal Pile Example

• Engineer Bob calculated the Volume of each pile and used simple regression…

100% of the variation in W is

explained by this regression.

Standard error went from to 20.6 to 2.8!!!

W Vol

56 2421.6493 3992.44

161 6898.9431 1492.2670 3038.4476 3038.44

375 16353.0434 1499.0645 2044.13

58 2421.64

SUMMARY OUTPUT     

Regression StatisticsMultiple R 0.999673782R Square 0.99934767Adjusted R Square 0.999266129Standard Error 2.808218162Observations 10   ANOVA  

  dfRegression 1Residual 8Total 9   

  CoefficientsIntercept 0.668821408Vol 0.022970159

Page 9: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

The Four Assumptions

• Linearity

• Independence– The n observations were sampled

independently from the same population.• Homoskedasticity

– All Y’s given X share a common σ.• Normality

– The probability distribution of Y│X is normal.

• Errors are normal. Y’s don’t have to be.

10

15

20

25

30

35

0 10 20 30 40 50 60 70

X

Y

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 2 4 6 8 10 12

Fitted

Res

idu

al

0

50

100

150

200

250

-30.

64

-20.

08

-9.5

1

1.06

11.6

3

22.2

0

32.7

7

43.3

4

53.9

0

64.4

7

75.0

4

Fre

quen

cy

Sec 5 of Pfeifer noteSec 12.4 of EMBS

Page 10: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof)

MSF Hours26 2

34.2 4.1729 4.42

34.3 4.7585.9 4.83

143.2 6.6785.5 7

140.6 7.08140.6 7.1740.4 7.17101 10

239.7 12179.3 12.5126.5 13.67140.8 15.08

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.7260033R Square 0.5270808Adjusted R Square 0.4907024Standard Error 2.7735959Observations 15 ANOVA

dfRegression 1Residual 13Total 14

CoefficientsIntercept 3.312316MSF 0.0444895

𝑌=3.31+0.044× 157.3=10.51

The four assumptions

LinearityIndependence (all 15 points

count equally)

homoskedasticity

Normality

Sec 5 of Pfeifer noteSec 12.4 of EMBS

Page 11: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Hypotheses

• H0: P=0.5 (LTT, wunderdog)• H0: Independence (supermarket job and

response, treatment and heart attack, light and myopia, tosser and outcome)

• H0: μ=100 (IQ)• H0: μM= μF (heights, weights, batting average)

• H0: μcompact= μmid = μlarge (displacement)

P 13 of Pfeifer noteSec 12.5 of EMBS

Page 12: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

H0: b=0

• b=0 means X and Y are independent– In this way it’s like the chi-squared independence

test….for numerical variables.• b=0 means don’t use X to forecast Y

– Don’t put X in the regression equation• b=0 means just use to forecast Y• b=0 means the “true” adj R-square is zero.

P 13 of Pfeifer noteSec 12.5 of EMBS

Page 13: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Testing b=0 is EASY!!!

• H0: μ=100

• P-value from the t.dist with n-1 dof

• H0: b=0• (-0)/(se of coef)• P-value from t.dist

using n-2 dof.

MSF Hours26 2

34.2 4.1729 4.42

34.3 4.7585.9 4.83

143.2 6.6785.5 7

140.6 7.08140.6 7.1740.4 7.17101 10

239.7 12179.3 12.5126.5 13.67140.8 15.08

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 3.3123 1.4021 2.3624 0.0344 0.2832 6.3414MSF 0.0445 0.0117 3.8064 0.0022 0.0192 0.0697

The standard error of the coefficient

The t-stat to test b=0.

The 2-tailed p-value.

�̂�

P 13 of Pfeifer noteSec 12.5 of EMBS

Page 14: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Using Yes/No variable in Regression

Car ClassDisplaceme

nt Fuel Type Hwy MPG

1 Midsize 3.5 R 28

2 Midsize 3 R 26

3 Large 3 P 26

4 Large 3.5 P 25

. . . . .

. . . . .

58 Compact 6 P 20

59 Midsize 2.5 R 30

60 Midsize 2 R 32

Categorical Categorical NumericalNumerical

n=60Sec 8 of Pfeifer noteSec 13.7 of EMBS

Does MPG “depend” on

fuel type?

Page 15: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Fuel type (yes/no) and mpg (numerical)• Un-stack the data so there are two columns of MPG data.

• Data Analysis, T-test two sample

t-Test: Two-Sample Assuming Equal Variances

  P RMean 24.33333 27.70833Variance 12.4 9.519928Observations 36 24Pooled Variance 11.2579Hypothesized Mean Difference 0df 58t Stat -3.81704P(T<=t) one-tail 0.000165 0.999835t Critical one-tail 1.671553P(T<=t) two-tail 0.000331t Critical two-tail 2.001717 

Sec 8 of Pfeifer noteSec 13.7 of EMBS

H0: μP = μR

Or

H0: μP – μR = 0

Page 16: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Using Yes/No variables in Regression

1. Convert the categorical variable into a 1/0 DUMMY Variable.– Use an if statement to do this.– It won’t matter which is assigned 1, which is assigned 0.– It doesn’t even matter what 2 numbers you assign

to the two categories (regression will adjust)

2. Regress MPG (numerical) on DUMMY (1/0 numerical)

3. Test H0: b=0 using the regression output.Sec 8 of Pfeifer noteSec 13.7 of EMBS

Page 17: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Using Yes/No variables in RegressionFuel Type Dprem

Hwy MPG

R 0 28R 0 26P 1 26P 1 25. . .P 1 21P 1 25P 1 20R 0 30R 0 32

SUMMARY OUTPUT        

   

Regression Statistics        

Adj R Square 0.1870  

Standard Error 3.3553  

Observations 60  

   

ANOVA  

  df SS MS F Sig F

Regression 1 164.025 164.025 14.570 3.306E-04

Residual 58 652.958 11.258  

Total 59 816.983     

   

  Coeff Std Error t Stat P-value  

Intercept 27.708 0.6849 40.4564 3.321E-44 

Dprem -3.375 0.8842 -3.8170 3.306E-04 

Sec 8 of Pfeifer noteSec 13.7 of EMBS

Page 18: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Regression with one Dummy variable

When D=0,

When D=1,

For Regular,

27.7

For premium,

24.3

H0: μP = μR

Or

H0: μP – μR = 0

Or

H0: b = 0

Page 19: Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

What we learned today

• We learned about “adjusted R square”– The most over-rated statistic of all time.

• We learned the four assumptions required to use regression to make a probability forecast of Y│X.– And how to check each of them.

• We learned how to test H0: b=0.– And why this is such an important test.

• We learned how to use a yes/no variable in a regression.– Create a dummy variable.