16
- p. 1/16 Statistics 203: Introduction to Regression and Analysis of Variance Multiple Linear Regression: Diagnostics Jonathan Taylor

Diagnostics

Embed Size (px)

DESCRIPTION

Outliers & influential observations: both in predictors andobservations.

Citation preview

  • - p. 1/16

    Statistics 203: Introduction to Regressionand Analysis of Variance

    Multiple Linear Regression: Diagnostics

    Jonathan Taylor

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 2/16

    Today

    n Splines + other bases.n Diagnostics

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 3/16

    Spline models

    n Splines are piecewise polynomials functions, i.e. on aninterval between knots (ti, ti+1) the spline f(x) ispolynomial but the coefficients change within each interval.

    n Example: cubic spline with knows at t1 < t2 < < th

    f(x) =3

    j=0

    0jxj +

    hi=1

    i(x ti)3+

    where

    (x ti)+ ={

    x ti if x ti 00 otherwise.

    n Here is an example.n Conditioning problem again: B-splines are used to keep the

    model subspace the same but have the design lessill-conditioned.

    n Other bases one might use: Fourier: sin and cos waves;Wavelet: space/time localized basis for functions.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 4/16

    What are the assumptions?

    n What is the full model for a given design matrix X ?

    Yi = 0 + 1Xi1 + + pXi,p1 + in Errors N(0, 2I).n What can go wrong?

    u Regression function can be wrong missing predictors,nonlinear.

    u Assumptions about the errors can be wrong.u Outliers & influential observations: both in predictors and

    observations.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 5/16

    Problems in the regression function

    n True regression function may have higher-order non-linearterms i.e. X21 or even interactions X1 X2.

    n How to fix? Difficult in general we will look at two plotsadded variable plots and partial residual plots.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 6/16

    Partial residual plot

    n For 1 j p 1 leteij = ei + jXij .

    n Can help to determine if variance depends on XXXj andoutliers.

    n If there is a non-linear trend, it is evidence that linear is notsufficient.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 7/16

    Added-variable plot

    n For 1 j p 1 let H(j) be the Hat matrix with this predictordeleted. Plot

    (I H(j))Y vs.(I H(j))Xj .n Plot should be linear and slope should be j . Why?

    Y = X(j)(j) + jXj +

    (I H(j))Y = (I H(j))X(j)(j) + j(I H(j))Xj + (I H(j))(I H(j))Y = j(I H(j))Xj + (I H(j))

    n Also can be helpful for detecting outliers.n If there is a non-linear trend, it is evidence that linear is not

    sufficient.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 8/16

    Problems with the errors

    n Errors may not be normally distributed. We will look atQQplot for a graphical check. May not effect inference inlarge samples.

    n Variance may not be constant. Transformations cansometimes help correct this. Non-constant variance affectsour estimates of SE() which can change t and F statisticssubstantially!

    n Graphical checks of non-constant variance: added variableplots, partial residual plots, fitted vs. residual plots.

    n Errors may not be independent. This can seriously affect ourestimates of SE().

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 9/16

    Outliers & Influence

    n Some residuals may be much larger than others which canaffect the overall fit of the model. This may be evidence of anoutlier: a point where the model has very poor fit. This canbe caused by many factors and such points should not beautomatically deleted from the dataset.

    n Even if an observation does not have a large residual, it canexert a strong influence on the regression function.

    n General stragegy to measure influence: for eachobservation, drop it from the model and measure how muchdoes the model change?

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 10/16

    Dropping an observation

    n A (i) indicates i-th observation was not used in fitting themodel.

    n For example: Yj(i) is the regression function evaluated at thej-th observations predictors BUT the coefficients(0,(i), . . . , p1,(i)) were fit after deleting i-th row of data.

    n Basic idea: if Yj(i) is very different than Yj (using all the data)then i is an influential point for determining Yj .

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 11/16

    Different residuals

    n Ordinary residuals: ei = Yi Yin Standardized residuals: ri = ei/s(ei) = ei/

    1Hii, H is

    the hat matrix. (rstandard)n Studentized residuals: ti = ei/(i)

    1Hii tnp1.

    (rstudent)

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 12/16

    Crude outlier detection test

    n If the studentized residuals are large: observation may be anoutlier.

    n Problem: if n is large, if we threshold at t1/2,np1 wewill get many outliers by chance even if model is correct.

    n Solution: Bonferroni correction, threshold at t1/(2n),np1.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 13/16

    Bonferroni correction for multiple comparisons

    n If we are doing many t (or other) tests, say m > 1 we cancontrol overall false positive rate at by testing each one atlevel /m.

    n Proof:P (at least one false positive)

    = P(mi=1|Ti| t1/(2m),np2)

    m

    i=1

    P(|Ti| t1/(2m),np2)

    =m

    i=1

    m= .

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 14/16

    DFFITS

    n

    DFFITSi =Yi Yi(i)(i)

    Hii

    n This quantity measures how much the regression functionchanges at the i-th observation when the i-th variable isdeleted.

    n For small/medium datasets: value of 1 or greater isconsidered suspicious. For large dataset: value of 2

    p/n.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 15/16

    Cooks distance

    n

    Di =

    nj=1(Yj Yj(i))2

    p 2

    n This quantity measures how much the entire regressionfunction changes when the i-th variable is deleted.

    n Should be comparable to Fp,np: if the p-value of Di is 50percent or more, then the i-th point is likely influential:investigate this point further.

  • l Todayl Spline modelsl What are the assumptions?l Problems in the regression

    functionl Partial residual plotl Added-variable plotl Problems with the errorsl Outliers & Influencel Dropping an observationl Different residualsl Crude outlier detection testl Bonferroni correction for

    multiple comparisonsl DFF ITSl Cooks distancel DFBET AS

    - p. 16/16

    DFBETAS

    n

    DFBETASj(i) =j j(i)

    2(i)(XT X)1jj

    .

    n This quantity measures how much the coefficients changewhen the i-th variable is deleted.

    n For small/medium datasets: value of 1 or greater issuspicious. For large dataset: value of 2/n.

    n Here is an example.

    TodaySpline modelsWhat are the assumptions?Problems in the regression functionPartial residual plotAdded-variable plotProblems with the errorsOutliers & InfluenceDropping an observationDifferent residualsCrude outlier detection testBonferroni correction for multiple comparisonsDFFITSCook's distanceDFBETAS