Diagnostics of Linear Regression

Junhui Qian

October 27, 2014

Basics

Nonlinearity

Correlation

Homoscedasticity

The Objectives

I After estimating a model, we should always performdiagnostics on the model. In particular, we should checkwhether the assumptions we made are valid.

I For OLS estimation, we should usually check:I Is the relationship between x and y linear?I Are the residuals serially uncorrelated?I Are the residuals uncorrelated with explanatory variables?

(endogeneity)I Does homoscedasticity hold?

Residuals

I Residuals are unobservable. But they can be estimated:

ui = yi − x ′i β.

I Using matrix language,

u = (I − PX )Y .

I If β is close to β, then ui is close to ui .

I Let yi = x ′i β, we call yi the “the fitted value”.

I Then the explained variable can be decomposed into

yi = yi + ui .

Variations

I SST (total sum of squares)

SST ≡n∑

(yi − y)2 = Y ′(I − Pι)Y .

I SSE (explained sum of squares)

SSE ≡n∑

(yi − y)2 = Y ′(PX − Pι)Y .

I SSR (sum of squared residuals)

SSR ≡n∑

u2i = Y ′(I − PX )Y .

I We have SST = SSE+ SSR.

Goodness of Fit

I R2 of the regression:

R2 ≡ SSE/SST = 1− SSR/SST.

I R2 is the fraction of the sample variation in y that isexplained by x . And we have 0 ≤ R2 ≤ 1.

I R2 does NOT validate a model. A high R2 only says that y ispredictable with information in x . In social science, this is notthe case in general.

I If additional regressors are added to a model, R2 will increase.

I The adjusted R2, denoted as R2, is designed to penalize thenumber of regressors,R2 = 1− [SSR/(n − 1− k)]/[SST/(n − 1).

−4 −2 0 2 4−4

6Low R2

−4 −2 0 2 4−3

4High R2

Residual Plots

We can plot

I Residuals

I Residuals versus Fitted Value

I Residuals versus Explanatory Variables

Any pattern in residual plots suggests nonlinearity or endogeneity.

−4 −2 0 2 4−4

10Regression

0 20 40 60 80 100−4

6Residuals

−4 −2 0 2 4−4

6Residuals v.s. Regressor

X−2 0 2 4 6

6Residuals v.s. Fitted Value

Residuals

Figure : Residual Plots. DGP: y = 0.2 + x + 0.5x2 + u

Partial Residual Plots

I To see whether there exists nonlinearity in a regressor, say thej-th explanatory variable xj , We can plot

u + βjxj versus xj ,

where u is residual from the full model.

I Partial residual plots may help us find the true (nonlinear)functional form of xj .

Partial Residual Plots: Example

Suppose the true model is

y = β0 + β1x + β2z + g(z) + u,

where g(z) is a nonlinear function. We mistakenly estimate:

y = β0 + β1x + β2z + u.

If we plot β2z + u versus z , we may probably be able to detectnonlinearity in g(z).

−3 −2 −1 0 1 2 3 4−4

12Partial Residual Plot

tial r

Figure : Residual Plots. DGP: y = 0.2 + x + 0.5z + z2 + u

The iid Assumption

I The CLR assumption dictates that residuals should be iid.

I It is generally difficult to determine whether a given numberof observations are from the same distribution.

I If there is a natural order of the observations (e.g., time),then we may check whether the residuals are correlated.

I If there is correlation, then the iid assumption is violated.

Residuals with Time

I When we deal with time series regression, for example,

πt = β0 + β0mt + ut ,

where πt is the inflation rate and mt is the growth rate ofmoney supply, both indexed by time t.

I Now the “natural order” is time, and a time series plot of theestimated residual contains information.

Residual Plots

We can plot:

I Residuals over time

I Residuals v.s. previous residual

I Correlogram

4No Correlation

4Weak Correlation

0 20 40 60 80 100 120 140 160 180 200−10

10Strong Correlation

Figure : Residuals over time: ut = αut−1 + εt , α = 0, 0.5, 0.95, from topto bottom.

−2 0 2 4−2

ut−1

No Correlation

−5 0 5−3

ut−1

Weak Correlation

−5 0 5 10−6

ut−1

Strong Correlation

Figure : Residuals v.s. previous residual: ut = αut−1 + εt ,α = 0, 0.5, 0.95, from left to right.

0 20 40−0.2

No Correlation

0 20 40−0.4

−0.2

Weak Correlation

0 20 400

Strong Correlation

Figure : Correlograms: ut = αut−1 + εt , α = 0, 0.5, 0.95, from left toright.

Durbin-Watson Test

I Durbin-Watson is the formal test for independence, or moreprecisely, non-correlation.

I It assumes a AR(1) model for ut , ut = αut−1 + εt .

I The null hypothesis is: H0 : ρ = α = 0.

I The test statistic is

∑Tt=2(ut − ut−1)

2∑Tt=1 u

2t−1

Durbin-Watson Test

I DW ∈ [0, 4].

I DW = 2 indicates no autocorrelation.

I If DW is substantially less than 2, there is evidence of positiveserial correlation. As a rough rule of thumb, if DW is lessthan 1.0, there may be cause for alarm.

I Small values of DW indicate successive error terms are, onaverage, close in value to one another, or positively correlated.

I Large values of DW indicate successive error terms are, onaverage, much different in value to one another, or negativelycorrelated.

Fixing Correlation

I It’s most likely that the model is misspecified.I The usual practices are:

I Add more explanatory variablesI Add more lags of the existing explanatory variables

I If var(ui |x) = σ2, we call the model “homoscedastic”. If not,we call it “heteroscedastic”.

I If homoscedasticity does not hold, but CLR Assumptions 1-4still hold, the OLS estimator is still unbiased and consistent.However, OLS is no longer BLUE.

I We can detect heteroscedasticity by looking at the residualsv.s. regressors.

I For simple regressions, we can look at regression lines.I And we can formally test for homoscedasticity.

I White testI Breusch-Pagan test

0 0.5 1 1.5 2 2.5−2

lsHeteroscedasticity

0 0.5 1 1.5 2 2.5−0.5

Regression Line

Figure : Heteroscedasticity. DGP: yi = β0 + 0.5xi + xiεi .

Fixing Heteroscedasticity

I Use a different specification for the model (different variables,or perhaps non-linear transformations of the variables).

I Use GLS (Generalized Least Square).

Diagnostics of Linear Regression -

Documents

Inﬂuence Diagnostics for Linear Mixed ModelsIn ordinary linear models such model diagnostics are generally available in statistical packages and standard textbooks on applied regression,

Linear Regression Analysis 5E Montgomery, Peck and Vining 1 Chapter 6 Diagnostics for Leverage and Influence

Chapter 6 Regression Diagnostics

1 Curve-Fitting Polynomial Interpolation. 2 Curve Fitting Regression Linear Regression Polynomial Regression Multiple Linear Regression Non-linear Regression

Inference and Diagnostics for Simple Linear Regression

Tree-structured model diagnostics for linear regression · 2017. 8. 24. · Tree-structured model diagnostics for linear regression. Mach Learn (2009) 74: 111–131 DOI 10.1007/s10994-008-5080-8

Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics

Lecture 7 Linear Regression Diagnostics

Generalized linear regression - Little Dumb doctor to Multiple Linear Regression 3 tics tutorial at Multiple Linear RegressionMultiple Linear Regression • Regression analysis is

LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model

REGRESSION 12.1 Simple Linear Regression Model 12.2 ...sman/courses/6739/SimpleLinearRegression.pdf · Goldsman — ISyE 6739 Linear Regression REGRESSION 12.1 Simple Linear Regression

Regression diagnostics - York Universityeuclid.psych.yorku.ca/.../lectures/RegDiagnostics2x2.pdfgeneralized linear model (E.g.,: logistic regression, poisson regression) • For now,

Regression Diagnostics and Advanced Regression Topics

Regression Analysis - STAT · Regression Analysis • 1. Simple Linear Regression • 2. Inference in Regression Analysis • 3. Diagnostics • 4. Simultaneous Inference • 5. Matrix

SIMPLE LINEAR REGRESSION. 2 Simple Regression Linear Regression

Linear Regression Inﬂuence Diagnostics for Unclustered

Lecture 18: Diagnostics in multiple linear regression · I Fitting the model: least squares. I Interpretation of the coecients. I Matrix formulation of multiple linear regression

Chapter 13: SIMPLE LINEAR REGRESSION. 2 Simple Regression Linear Regression

Non-Linear Regression - Penn State College of Engineeringrtc12/CSE586/lectures/regrNonlinear... · 2013-04-01 · Non-linear regression Linear regression: Non-Linear regression: where

Multiple Linear Regression - Analysis Made Easy · Multiple Linear Regression The MULTIPLE LINEAR REGRESSION command performs simple multiple regression using least squares. Linear