Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions

Class 23

The most over-rated statisticThe four assumptions

The most Important hypothesis test yetUsing yes/no variables in regressions

Adjusted R-square Pg 9-12 Pfeifer note

Hours2

4.174.424.754.836.67

77.087.177.171012

12.513.6715.08

Hours Mean 7.900667Standard Error 1.003487Median 7.08Mode 7.17Standard Deviation 3.886488Sample Variance 15.10479Kurtosis -0.75506Skewness 0.524811Range 13.08Minimum 2Maximum 15.08Sum 118.51Count 15

Our better method of forecasting hours would use a mean of 7.9 and standard deviation of 3.89 (and the t-distribution with 14 dof)

The sample variance is

The variation in Hours that regression will try

to explain

Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof)

The squared standard error is

The variation in Hours regression leaves

unexplained.

MSF Hours26 2

34.2 4.1729 4.42

34.3 4.7585.9 4.83

143.2 6.6785.5 7

140.6 7.08140.6 7.1740.4 7.17101 10

239.7 12179.3 12.5126.5 13.67140.8 15.08

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.7260033R Square 0.5270808Adjusted R Square 0.4907024Standard Error 2.7735959Observations 15 ANOVA

dfRegression 1Residual 13Total 14

CoefficientsIntercept 3.312316MSF 0.0444895

𝑌=3.31+0.044× 157.3=10.51

Adjusted R-squarePg 9-12 Pfeifer note

• Adjusted R-square is the percentage of variation explained• The initial variation is s2 = 15.1• The variation left unexplained (after using MSF in a

regression) is (standard error)2 = 7.69.• Adjusted R-square = • Adjusted R-square = (15.1-7.69)/15.1 = 0.49• The regression using MSF explained 49% of the variation

in hours.• The “adjusted” happened in the calculation of s and

standard error.

Adjusted R-squarePg 9-12 Pfeifer note

From the Pfeifer note

Adj R-square = 0.0

Adj R-square = 0.5

Adj R-square = 1.0Standard error = 0

Standard error = s

Why Pfeifer says R2 is over-rated• There is no standard for how large it should be.

– In some situations an adjusted R2 of 0.05 would be FANTASTIC. In others, an adjusted R2 of 0.96 would be DISAPOINTING.

• It has no real use.– Unlike “standard error” which is needed to make probability

forecasts.• It is usually redundant

– When comparing models, lower standard errors mean higher adj R2

– The correlation coefficient (which shares the same sign as b) ≈ the square root of adj R2.

The Coal Pile Example

• The firm needed a way to estimate the weight of a coal pile (based on it’s dimensions)

W D h d56 20 10 15 93 25 10 20

161 30 12 24 31 15 12 10 70 20 14 13 76 20 14 13

375 40 16 32 34 15 14 8 45 20 8 16

58 20 10 15

SUMMARY OUTPUT



CoefficientsIntercept -294.6954733D -15.12016461h 23.02366255d 27.62139918

96% of the variation in W is

explained by this regression.

We just used MULTIPLE regression.

The Coal Pile Example

• Engineer Bob calculated the Volume of each pile and used simple regression…

100% of the variation in W is

explained by this regression.

Standard error went from to 20.6 to 2.8!!!

W Vol

56 2421.6493 3992.44

161 6898.9431 1492.2670 3038.4476 3038.44

375 16353.0434 1499.0645 2044.13

58 2421.64

SUMMARY OUTPUT



CoefficientsIntercept 0.668821408Vol 0.022970159

The Four Assumptions

• Linearity

• Independence– The n observations were sampled

independently from the same population.• Homoskedasticity

– All Y’s given X share a common σ.• Normality

– The probability distribution of Y│X is normal.

• Errors are normal. Y’s don’t have to be.

10

15

20

25

30

35

0 10 20 30 40 50 60 70

X

Y

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 2 4 6 8 10 12

Fitted

Res

idu

al

0

50

100

150

200

250

-30.

64

-20.

08

-9.5

1

1.06

11.6

3

22.2

0

32.7

7

43.3

4

53.9

0

64.4

7

75.0

4

Fre

quen

cy

Sec 5 of Pfeifer noteSec 12.4 of EMBS

Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof)

MSF Hours26 2

34.2 4.1729 4.42

34.3 4.7585.9 4.83

143.2 6.6785.5 7

140.6 7.08140.6 7.1740.4 7.17101 10

239.7 12179.3 12.5126.5 13.67140.8 15.08

SUMMARY OUTPUT



CoefficientsIntercept 3.312316MSF 0.0444895

𝑌=3.31+0.044× 157.3=10.51

The four assumptions

LinearityIndependence (all 15 points

count equally)

homoskedasticity

Normality


Hypotheses

• H0: P=0.5 (LTT, wunderdog)• H0: Independence (supermarket job and

response, treatment and heart attack, light and myopia, tosser and outcome)

• H0: μ=100 (IQ)• H0: μM= μF (heights, weights, batting average)

• H0: μcompact= μmid = μlarge (displacement)

P 13 of Pfeifer noteSec 12.5 of EMBS

H0: b=0

• b=0 means X and Y are independent– In this way it’s like the chi-squared independence

test….for numerical variables.• b=0 means don’t use X to forecast Y

– Don’t put X in the regression equation• b=0 means just use to forecast Y• b=0 means the “true” adj R-square is zero.


Testing b=0 is EASY!!!

• H0: μ=100

• P-value from the t.dist with n-1 dof

• H0: b=0• (-0)/(se of coef)• P-value from t.dist

using n-2 dof.

MSF Hours26 2

34.2 4.1729 4.42

34.3 4.7585.9 4.83

143.2 6.6785.5 7

140.6 7.08140.6 7.1740.4 7.17101 10

239.7 12179.3 12.5126.5 13.67140.8 15.08

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 3.3123 1.4021 2.3624 0.0344 0.2832 6.3414MSF 0.0445 0.0117 3.8064 0.0022 0.0192 0.0697

The standard error of the coefficient

The t-stat to test b=0.

The 2-tailed p-value.

�̂�


Using Yes/No variable in Regression

Car ClassDisplaceme

nt Fuel Type Hwy MPG

1 Midsize 3.5 R 28

2 Midsize 3 R 26

3 Large 3 P 26

4 Large 3.5 P 25

. . . . .

. . . . .

58 Compact 6 P 20

59 Midsize 2.5 R 30

60 Midsize 2 R 32

Categorical Categorical NumericalNumerical

n=60Sec 8 of Pfeifer noteSec 13.7 of EMBS

Does MPG “depend” on

fuel type?

Fuel type (yes/no) and mpg (numerical)• Un-stack the data so there are two columns of MPG data.

• Data Analysis, T-test two sample

t-Test: Two-Sample Assuming Equal Variances

P RMean 24.33333 27.70833Variance 12.4 9.519928Observations 36 24Pooled Variance 11.2579Hypothesized Mean Difference 0df 58t Stat -3.81704P(T<=t) one-tail 0.000165 0.999835t Critical one-tail 1.671553P(T<=t) two-tail 0.000331t Critical two-tail 2.001717


H0: μP = μR

Or

H0: μP – μR = 0

Using Yes/No variables in Regression

1. Convert the categorical variable into a 1/0 DUMMY Variable.– Use an if statement to do this.– It won’t matter which is assigned 1, which is assigned 0.– It doesn’t even matter what 2 numbers you assign

to the two categories (regression will adjust)

2. Regress MPG (numerical) on DUMMY (1/0 numerical)

3. Test H0: b=0 using the regression output.Sec 8 of Pfeifer noteSec 13.7 of EMBS

Using Yes/No variables in RegressionFuel Type Dprem

Hwy MPG

R 0 28R 0 26P 1 26P 1 25. . .P 1 21P 1 25P 1 20R 0 30R 0 32

SUMMARY OUTPUT

Regression Statistics

Adj R Square 0.1870

Standard Error 3.3553

Observations 60

ANOVA

df SS MS F Sig F

Regression 1 164.025 164.025 14.570 3.306E-04

Residual 58 652.958 11.258

Total 59 816.983

Coeff Std Error t Stat P-value

Intercept 27.708 0.6849 40.4564 3.321E-44

Dprem -3.375 0.8842 -3.8170 3.306E-04


Regression with one Dummy variable

When D=0,

When D=1,

For Regular,

27.7

For premium,

24.3

H0: μP = μR

Or

H0: μP – μR = 0

Or

H0: b = 0

What we learned today

• We learned about “adjusted R square”– The most over-rated statistic of all time.

• We learned the four assumptions required to use regression to make a probability forecast of Y│X.– And how to check each of them.

• We learned how to test H0: b=0.– And why this is such an important test.

• We learned how to use a yes/no variable in a regression.– Create a dummy variable.

Documents

Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions