Diagnostics

- p. 1/16

Statistics 203: Introduction to Regressionand Analysis of Variance

Multiple Linear Regression: Diagnostics

Jonathan Taylor

l Todayl Spline modelsl What are the assumptions?l Problems in the regression

functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

- p. 2/16

Today

n Splines + other bases.n Diagnostics




- p. 3/16

Spline models

n Splines are piecewise polynomials functions, i.e. on aninterval between knots (ti, ti+1) the spline f(x) ispolynomial but the coefficients change within each interval.

n Example: cubic spline with knows at t1 < t2 < < th

f(x) =3

j=0

0jxj +

hi=1

i(x ti)3+

where

(x ti)+ ={

x ti if x ti 00 otherwise.

n Here is an example.n Conditioning problem again: B-splines are used to keep the

model subspace the same but have the design lessill-conditioned.

n Other bases one might use: Fourier: sin and cos waves;Wavelet: space/time localized basis for functions.




- p. 4/16

What are the assumptions?

n What is the full model for a given design matrix X ?

Yi = 0 + 1Xi1 + + pXi,p1 + in Errors N(0, 2I).n What can go wrong?

u Regression function can be wrong missing predictors,nonlinear.

u Assumptions about the errors can be wrong.u Outliers & influential observations: both in predictors and

observations.




- p. 5/16

Problems in the regression function

n True regression function may have higher-order non-linearterms i.e. X21 or even interactions X1 X2.

n How to fix? Difficult in general we will look at two plotsadded variable plots and partial residual plots.




- p. 6/16

Partial residual plot

n For 1 j p 1 leteij = ei + jXij .

n Can help to determine if variance depends on XXXj andoutliers.

n If there is a non-linear trend, it is evidence that linear is notsufficient.




- p. 7/16

Added-variable plot

n For 1 j p 1 let H(j) be the Hat matrix with this predictordeleted. Plot

(I H(j))Y vs.(I H(j))Xj .n Plot should be linear and slope should be j . Why?

Y = X(j)(j) + jXj +

(I H(j))Y = (I H(j))X(j)(j) + j(I H(j))Xj + (I H(j))(I H(j))Y = j(I H(j))Xj + (I H(j))

n Also can be helpful for detecting outliers.n If there is a non-linear trend, it is evidence that linear is not

sufficient.




- p. 8/16

Problems with the errors

n Errors may not be normally distributed. We will look atQQplot for a graphical check. May not effect inference inlarge samples.

n Variance may not be constant. Transformations cansometimes help correct this. Non-constant variance affectsour estimates of SE() which can change t and F statisticssubstantially!

n Graphical checks of non-constant variance: added variableplots, partial residual plots, fitted vs. residual plots.

n Errors may not be independent. This can seriously affect ourestimates of SE().




- p. 9/16

Outliers & Influence

n Some residuals may be much larger than others which canaffect the overall fit of the model. This may be evidence of anoutlier: a point where the model has very poor fit. This canbe caused by many factors and such points should not beautomatically deleted from the dataset.

n Even if an observation does not have a large residual, it canexert a strong influence on the regression function.

n General stragegy to measure influence: for eachobservation, drop it from the model and measure how muchdoes the model change?




- p. 10/16

Dropping an observation

n A (i) indicates i-th observation was not used in fitting themodel.

n For example: Yj(i) is the regression function evaluated at thej-th observations predictors BUT the coefficients(0,(i), . . . , p1,(i)) were fit after deleting i-th row of data.

n Basic idea: if Yj(i) is very different than Yj (using all the data)then i is an influential point for determining Yj .




- p. 11/16

Different residuals

n Ordinary residuals: ei = Yi Yin Standardized residuals: ri = ei/s(ei) = ei/

1Hii, H is

the hat matrix. (rstandard)n Studentized residuals: ti = ei/(i)

1Hii tnp1.

(rstudent)




- p. 12/16

Crude outlier detection test

n If the studentized residuals are large: observation may be anoutlier.

n Problem: if n is large, if we threshold at t1/2,np1 wewill get many outliers by chance even if model is correct.

n Solution: Bonferroni correction, threshold at t1/(2n),np1.




- p. 13/16

Bonferroni correction for multiple comparisons

n If we are doing many t (or other) tests, say m > 1 we cancontrol overall false positive rate at by testing each one atlevel /m.

n Proof:P (at least one false positive)

= P(mi=1|Ti| t1/(2m),np2)

m

i=1

P(|Ti| t1/(2m),np2)

=m

i=1

m= .




- p. 14/16

DFFITS

n

DFFITSi =Yi Yi(i)(i)

Hii

n This quantity measures how much the regression functionchanges at the i-th observation when the i-th variable isdeleted.

n For small/medium datasets: value of 1 or greater isconsidered suspicious. For large dataset: value of 2

p/n.




- p. 15/16

Cooks distance

n

Di =

nj=1(Yj Yj(i))2

p 2

n This quantity measures how much the entire regressionfunction changes when the i-th variable is deleted.

n Should be comparable to Fp,np: if the p-value of Di is 50percent or more, then the i-th point is likely influential:investigate this point further.




- p. 16/16

DFBETAS

n

DFBETASj(i) =j j(i)

2(i)(XT X)1jj

.

n This quantity measures how much the coefficients changewhen the i-th variable is deleted.

n For small/medium datasets: value of 1 or greater issuspicious. For large dataset: value of 2/n.

n Here is an example.

TodaySpline modelsWhat are the assumptions?Problems in the regression functionPartial residual plotAdded-variable plotProblems with the errorsOutliers & InfluenceDropping an observationDifferent residualsCrude outlier detection testBonferroni correction for multiple comparisonsDFFITSCook's distanceDFBETAS

Documents

Diagnostics