Bivariate

Bivariate analysis

The Multiple Regression Model

Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)

ikik2i21i10i εXβXβXββY

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Assumptions of Regression

Use the acronym LINE:• Linearity

– The underlying relationship between X and Y is linear

• Independence of Errors– Error values are statistically independent

• Normality of Error– Error values (ε) are normally distributed for any given value of X

• Equal Variance (Homoscedasticity)– The probability distribution of the errors has constant variance

Regression StatisticsMultiple R 0.998368R Square 0.996739Adjusted R Square 0.995808Standard Error 1.350151Observations 28

ANOVA

df SS MS FSignifican

ce FRegression 6 11701.72 1950.286 1069.876 5.54E-25Residual 21 38.28108 1.822908Total 27 11740

.99673911740

11704.1

SST

SSRr2

99.674% variation is explained by the dependent Variables

Adjusted r2

• r2 never decreases when a new X variable is added to the model– This can be a disadvantage when comparing models

• What is the net effect of adding a new variable?– We lose a degree of freedom when a new X variable

is added– Did the new X variable add enough explanatory

power to offset the loss of one degree of freedom?

• Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used

(where n = sample size, k = number of independent variables)

– Penalize excessive use of unimportant independent variables

– Smaller than r2

– Useful in comparing among models

Adjusted r2

1

1)1(1 22

kn

nrradj

Error and coefficients relationship• B1 = Covar(yx)/Varp(x)

Stddevp 419.28571 1103.4439 115902.4 1630165.82 36245060.6 706538.59 195.9184Covar 662.14286 6862.5 25621.4286 120976.786 16061.643 257.1429b1 0.6000694 0.059209 0.01571707 0.00333775 0.0227329 1.3125

Is the Model Significant?

• F Test for Overall Significance of the Model

• Shows if there is a linear relationship between all of the X variables considered together and Y

• Use F-test statistic

• Hypotheses: H0: β1 = β2 = … = βk = 0 (no linear relationship)

H1: at least one βi ≠ 0 (at least one independent variable affects Y)

F Test for Overall Significance

• Test statistic:

where F has (numerator) = k and(denominator) = (n – k - 1)

degrees of freedom

1knSSEk

SSR

MSE

MSRF

Case discussion

Multiple Regression Assumptions

Assumptions:• The errors are normally distributed• Errors have a constant variance• The model errors are independent

ei = (Yi – Yi)

<

Errors (residuals) from the regression model:

Error terms and coefficient estimates

• Once we think of the Error term as a random variable, it becomes clear that the estimates of b1, b2, … (as distinguished from their true values) will also be random variables, because the estimates generated by the SSE criterion will depend upon the particular value of e drawn by nature for each individual in the data set.

Statistical Inference and Goodness of fit

• The parameter estimates are themselves random variables, dependent upon the random variables e.

• Thus, each estimate can be thought of as a draw from some underlying probability distribution, the nature of that distribution as yet unspecified.

• If we assume that the error terms e are all drawn from the same normal distribution, it is possible to show that the parameter estimates have a normal distribution as well.

T Statistic and P value

• T = B1-B1average/B1 std dev

Can you have a hypothesis that b1 average = b1 estimate and do the T test

Are Individual Variables Significant?

• Use t tests of individual variable slopes

• Shows if there is a linear relationship between the variable Xj and Y

• Hypotheses:

– H0: βj = 0 (no linear relationship)

– H1: βj ≠ 0 (linear relationship does exist between Xj and Y)

Are Individual Variables Significant?

H0: βj = 0 (no linear relationship)

H1: βj ≠ 0 (linear relationship does exist between xj and y)

Test Statistic:

(df = n – k – 1)

jb

j

S

bt

0

Coefficien

tsStandard

Error t Stat P-valueLower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996 -82.5325 -35.5996OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097 -0.10302 0.089097BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949 0.031028 0.052949YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794 0.000637 0.004794VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021 0.000918 0.002021INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05 -0.00552 3.78E-05SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592 -0.41049 -0.12592

with n – (k+1) degrees of freedom

Confidence Interval Estimate for the Slope

• Confidence interval for the population slope βj

• where t has (n – k – 1) d.f.jbknj Stb 1

Example: Form a 95% confidence interval for the effect of changes in Bars on fatal accidents:

0.041988 ±(2.079614 )(0.005271)So the interval is (0.031028, 0.052949 )

(This interval does not contain zero, so bars has a significant effect on Accidents)

Coefficien

tsStandard

Error t Stat P-valueLower 95%

Upper 95%

Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592

Using Dummy Variables

• A dummy variable is a categorical explanatory variable with two levels:– yes or no, on or off, male or female– coded as 0 or 1

• Regression intercepts are different if the variable is significant

• Assumes equal slopes for other variables

Interaction Between Independent Variables

• Hypothesizes interaction between pairs of X variables– Response to one X variable may vary at different

levels of another X variable

• Contains cross-product term

–

)X(XbXbXbb

XbXbXbbY

21322110

3322110

Effect of Interaction

• Given: • Without interaction term, effect of X1 on Y is

measured by β1

• With interaction term, effect of X1 on Y is measured by β1 + β3 X2

• Effect changes as X2 changes

εXXβXβXββY 21322110

X2 = 1:Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1

X2 = 0: Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1

Interaction Example

Slopes are different if the effect of X1 on Y depends on X2 value

X1

4

8

12

0

0 10.5 1.5

Y = 1 + 2X1 + 3X2 + 4X1X2

Suppose X2 is a dummy variable and the estimated regression equation is

Y

Residual Analysis

• The residual for observation i, ei, is the difference between its observed and predicted value

• Check the assumptions of regression by examining the residuals– Examine for linearity assumption– Evaluate independence assumption – Evaluate normal distribution assumption – Examine for constant variance for all levels of X (homoscedasticity)

• Graphical Analysis of Residuals– Can plot residuals vs. X

iii YYe

Residual Analysis for Independence

Not Independent

Independent

X

Xresi

dual

s

resi

dual

s

X

resi

dual

s

Residual Analysis for Equal Variance

Non-constant variance Constant variance

x x

Y

x x

Y

resi

dual

s

resi

dual

s

Linear fit does not give random residuals

Linear vs. Nonlinear Fit

Nonlinear fit gives random residuals

X

resi

dual

s

X

Y

X

resi

dual

s

Y

X

Quadratic Regression Model

Quadratic models may be considered when the scatter diagram takes on one of the following shapes:

X1

Y

X1X1

YYY

β1 < 0 β1 > 0 β1 < 0 β1 > 0

β1 = the coefficient of the linear term β2 = the coefficient of the squared term

X1

i21i21i10i εXβXββY

β2 > 0 β2 > 0 β2 < 0 β2 < 0

Technology

Bivariate