Upload
cesardako
View
213
Download
0
Embed Size (px)
DESCRIPTION
Outliers & influential observations: both in predictors andobservations.
Citation preview
- p. 1/16
Statistics 203: Introduction to Regressionand Analysis of Variance
Multiple Linear Regression: Diagnostics
Jonathan Taylor
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 2/16
Today
n Splines + other bases.n Diagnostics
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 3/16
Spline models
n Splines are piecewise polynomials functions, i.e. on aninterval between knots (ti, ti+1) the spline f(x) ispolynomial but the coefficients change within each interval.
n Example: cubic spline with knows at t1 < t2 < < th
f(x) =3
j=0
0jxj +
hi=1
i(x ti)3+
where
(x ti)+ ={
x ti if x ti 00 otherwise.
n Here is an example.n Conditioning problem again: B-splines are used to keep the
model subspace the same but have the design lessill-conditioned.
n Other bases one might use: Fourier: sin and cos waves;Wavelet: space/time localized basis for functions.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 4/16
What are the assumptions?
n What is the full model for a given design matrix X ?
Yi = 0 + 1Xi1 + + pXi,p1 + in Errors N(0, 2I).n What can go wrong?
u Regression function can be wrong missing predictors,nonlinear.
u Assumptions about the errors can be wrong.u Outliers & influential observations: both in predictors and
observations.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 5/16
Problems in the regression function
n True regression function may have higher-order non-linearterms i.e. X21 or even interactions X1 X2.
n How to fix? Difficult in general we will look at two plotsadded variable plots and partial residual plots.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 6/16
Partial residual plot
n For 1 j p 1 leteij = ei + jXij .
n Can help to determine if variance depends on XXXj andoutliers.
n If there is a non-linear trend, it is evidence that linear is notsufficient.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 7/16
Added-variable plot
n For 1 j p 1 let H(j) be the Hat matrix with this predictordeleted. Plot
(I H(j))Y vs.(I H(j))Xj .n Plot should be linear and slope should be j . Why?
Y = X(j)(j) + jXj +
(I H(j))Y = (I H(j))X(j)(j) + j(I H(j))Xj + (I H(j))(I H(j))Y = j(I H(j))Xj + (I H(j))
n Also can be helpful for detecting outliers.n If there is a non-linear trend, it is evidence that linear is not
sufficient.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 8/16
Problems with the errors
n Errors may not be normally distributed. We will look atQQplot for a graphical check. May not effect inference inlarge samples.
n Variance may not be constant. Transformations cansometimes help correct this. Non-constant variance affectsour estimates of SE() which can change t and F statisticssubstantially!
n Graphical checks of non-constant variance: added variableplots, partial residual plots, fitted vs. residual plots.
n Errors may not be independent. This can seriously affect ourestimates of SE().
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 9/16
Outliers & Influence
n Some residuals may be much larger than others which canaffect the overall fit of the model. This may be evidence of anoutlier: a point where the model has very poor fit. This canbe caused by many factors and such points should not beautomatically deleted from the dataset.
n Even if an observation does not have a large residual, it canexert a strong influence on the regression function.
n General stragegy to measure influence: for eachobservation, drop it from the model and measure how muchdoes the model change?
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 10/16
Dropping an observation
n A (i) indicates i-th observation was not used in fitting themodel.
n For example: Yj(i) is the regression function evaluated at thej-th observations predictors BUT the coefficients(0,(i), . . . , p1,(i)) were fit after deleting i-th row of data.
n Basic idea: if Yj(i) is very different than Yj (using all the data)then i is an influential point for determining Yj .
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 11/16
Different residuals
n Ordinary residuals: ei = Yi Yin Standardized residuals: ri = ei/s(ei) = ei/
1Hii, H is
the hat matrix. (rstandard)n Studentized residuals: ti = ei/(i)
1Hii tnp1.
(rstudent)
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 12/16
Crude outlier detection test
n If the studentized residuals are large: observation may be anoutlier.
n Problem: if n is large, if we threshold at t1/2,np1 wewill get many outliers by chance even if model is correct.
n Solution: Bonferroni correction, threshold at t1/(2n),np1.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 13/16
Bonferroni correction for multiple comparisons
n If we are doing many t (or other) tests, say m > 1 we cancontrol overall false positive rate at by testing each one atlevel /m.
n Proof:P (at least one false positive)
= P(mi=1|Ti| t1/(2m),np2)
m
i=1
P(|Ti| t1/(2m),np2)
=m
i=1
m= .
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 14/16
DFFITS
n
DFFITSi =Yi Yi(i)(i)
Hii
n This quantity measures how much the regression functionchanges at the i-th observation when the i-th variable isdeleted.
n For small/medium datasets: value of 1 or greater isconsidered suspicious. For large dataset: value of 2
p/n.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 15/16
Cooks distance
n
Di =
nj=1(Yj Yj(i))2
p 2
n This quantity measures how much the entire regressionfunction changes when the i-th variable is deleted.
n Should be comparable to Fp,np: if the p-value of Di is 50percent or more, then the i-th point is likely influential:investigate this point further.
l Todayl Spline modelsl What are the assumptions?l Problems in the regression
functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for
multiple comparisonsl DFF ITSl Cooks distancel DFBET AS
- p. 16/16
DFBETAS
n
DFBETASj(i) =j j(i)
2(i)(XT X)1jj
.
n This quantity measures how much the coefficients changewhen the i-th variable is deleted.
n For small/medium datasets: value of 1 or greater issuspicious. For large dataset: value of 2/n.
n Here is an example.
TodaySpline modelsWhat are the assumptions?Problems in the regression functionPartial residual plotAdded-variable plotProblems with the errorsOutliers & InfluenceDropping an observationDifferent residualsCrude outlier detection testBonferroni correction for multiple comparisonsDFFITSCook's distanceDFBETAS