80
Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics

  • Upload
    penha

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Topic 18: Model Selection and Diagnostics. Variable Selection. We want to choose a “best” model that is a subset of the available explanatory variables Two separate problems How many explanatory variables should we use (i.e., subset size) - PowerPoint PPT Presentation

Citation preview

Page 1: Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics

Page 2: Topic 18: Model Selection and Diagnostics

Variable Selection

• We want to choose a “best” model that is a subset of the available explanatory variables

• Two separate problems–How many explanatory variables

should we use (i.e., subset size)–Given the subset size, which

variables should we choose

Page 3: Topic 18: Model Selection and Diagnostics

KNNL Example

• Page 350, Section 9.2• Y : survival time of patient (liver op)• X’s (explanatory variables) are –Blood clotting score–Prognostic index–Enzyme function test–Liver function test

Page 4: Topic 18: Model Selection and Diagnostics

KNNL Example cont.

• n = 54 patients• Start with the usual plots and

descriptive statistics• Time-to-event data is often

heavily skewed and typically transformed with a log

Page 5: Topic 18: Model Selection and Diagnostics

Data

Data a1; infile 'U:\.www\datasets512\CH09TA01.txt‘ delimiter='09'x; input blood prog enz liver age gender alcmod alcheavy surv lsurv;run;

Tab delimited

Dummy variables for alcohol use

Ln(surv)

Page 6: Topic 18: Model Selection and Diagnostics

Obs blood prog enz liver age gender alcmod alcheavy surv lsurv1 6.7 62 81 2.59 50 0 1 0 695 6.544

2 5.1 59 66 1.70 39 0 0 0 403 5.999

3 7.4 57 83 2.16 55 0 0 0 710 6.565

4 6.5 73 41 2.01 48 0 0 0 349 5.854

5 7.8 65 115 4.30 45 0 0 1 2343 7.759

6 5.8 38 72 1.42 65 1 1 0 348 5.852

Data

Page 7: Topic 18: Model Selection and Diagnostics
Page 8: Topic 18: Model Selection and Diagnostics
Page 9: Topic 18: Model Selection and Diagnostics
Page 10: Topic 18: Model Selection and Diagnostics
Page 11: Topic 18: Model Selection and Diagnostics

Log Transform of Y

• Recall that regression model does not require Y to be Normally distributed

• In this case, transform reduces influence of long right tail and often stabilizes the variance of the residuals

Page 12: Topic 18: Model Selection and Diagnostics

Scatterplotsproc corr plot=matrix; var blood prog enz liver;run;

proc corr plot=scatter; var blood prog enz liver; with lsurv;run;

Page 13: Topic 18: Model Selection and Diagnostics
Page 14: Topic 18: Model Selection and Diagnostics
Page 15: Topic 18: Model Selection and Diagnostics
Page 16: Topic 18: Model Selection and Diagnostics
Page 17: Topic 18: Model Selection and Diagnostics
Page 18: Topic 18: Model Selection and Diagnostics

Correlation Summary

Pearson Correlation Coefficients, N = 54Prob > |r| under H0: Rho=0

  blood prog enz liverlsurv 0.24619

0.07270.46994

0.00030.65389<.0001

0.64926<.0001

Page 19: Topic 18: Model Selection and Diagnostics

The Two Problems in Variable Selection

1. To determine an appropriate subset size

– Might use adjusted R2, Cp, MSE,

PRESS, AIC, SBC (BIC)

2. To determine best model of this fixed size– Might use R2

Page 20: Topic 18: Model Selection and Diagnostics

Adjusted R2

• R2 by its construction is guaranteed to increase with p

–SSE cannot decrease with additional X and SSTO constant

• Adjusted R2 uses df to account for p

MSTO

MSE

SSTO

SSE

pn

nR ppa

11

12

Page 21: Topic 18: Model Selection and Diagnostics

Adjusted R2

• Want to find model that maximizes • Since MSTO will remain constant for a given

data set – Depends only on Y

• Equivalent information to MSE• Thus could also find choice of model that

minimizes MSE• Details on pages 354-356

2aR

Page 22: Topic 18: Model Selection and Diagnostics

Cp Criterion

• The basic idea is to compare subset models with the full model

• A subset model is good if there is not substantial “bias” in the predicted values (relative to the full model)

• Looks at the ratio of total mean squared error and the true error variance

• See page 357-359 for details

Page 23: Topic 18: Model Selection and Diagnostics

Cp Criterion

)2()Full(MSE

SSEpnC p

p

SSE based on a specific choice of p-1 variables

MSE based on the full set of variables

Select the full set and Cp=(n-p)-(n-2p)=p

Page 24: Topic 18: Model Selection and Diagnostics

Use of Cp

• p is the number of regression coefficients including the intercept

• A model is good according to this

criterion if Cp ≤ p

• Rule: Pick the smallest model for

which Cp is smaller than p or pick the

model that minimizes Cp, provided

the minimum Cp is smaller than p

Page 25: Topic 18: Model Selection and Diagnostics

SBC (BIC) and AIC

Criterion based on log(likelihood) plus a penalty for more complexity

• AIC – minimize

• SBC – minimize

p2n

SSElogn p

)nlog(pn

SSElogn p

Page 26: Topic 18: Model Selection and Diagnostics

Other approaches

• PRESS (prediction SS)–For each case i–Delete the case and predict Y using

a model based on the other n-1 cases –Look at the SS for observed minus

predicted–Want to minimize the PRESS

Page 27: Topic 18: Model Selection and Diagnostics

Variable Selection • Additional proc reg model statement

options useful in variable selection

– INCLUDE=n forces the first n explanatory variables into all models

–BEST=n limits the output to the best n models of each subset size or total

–START=n limits output to models that include at least n explanatory variables

Page 28: Topic 18: Model Selection and Diagnostics

Variable Selection

• Step type procedures

–Forward selection (Step up)

–Backward elimination (Step down)

–Stepwise (forward selection with a backward glance)

• Very popular but now have much better search techniques like BEST

Page 29: Topic 18: Model Selection and Diagnostics

Ordering models of the same subset size

• Use R2 or SSE• This approach can lead us to consider

several models that give us approximately the same predicted values

• May need to apply knowledge of the subject matter to make a final selection

• Not that important if prediction is the key goal

Page 30: Topic 18: Model Selection and Diagnostics

Proc Reg

proc reg data=a1; model lsurv= blood prog enz liver/ selection=rsquare cp aic sbc b best=3;run;

Page 31: Topic 18: Model Selection and Diagnostics

Number inModel R-Square C(p) AIC SBC

1 0.4276 66.4889 -103.8269 -99.848891 0.4215 67.7148 -103.2615 -99.283571 0.2208 108.5558 -87.1781 -83.200112 0.6633 20.5197 -130.4833 -124.516342 0.5995 33.5041 -121.1126 -115.145612 0.5486 43.8517 -114.6583 -108.691383 0.7573 3.3905 -146.1609 -138.204943 0.7178 11.4237 -138.0232 -130.067233 0.6121 32.9320 -120.8442 -112.888234 0.7592 5.0000 -144.5895 -134.64461

Selection Results

Page 32: Topic 18: Model Selection and Diagnostics

Number inModel

Parameter Estimates

Intercept blood prog enz liver1 5.26426 . . 0.01512 .1 5.61218 . . . 0.298191 5.56613 . 0.01367 . .2 4.35058 . 0.01412 0.01539 .2 5.02818 . . 0.01073 0.209452 4.54623 0.10792 . 0.01634 .3 3.76618 0.09546 0.01334 0.01645 .3 4.40582 . 0.01101 0.01261 0.129773 4.78168 0.04482 . 0.01220 0.163604 3.85195 0.08368 0.01266 0.01563 0.03216

Selection Results

Page 33: Topic 18: Model Selection and Diagnostics

Proc Reg

proc reg data=a1; model lsurv= blood prog enz liver/ selection=cp aic sbc b best=3;run;

Page 34: Topic 18: Model Selection and Diagnostics

Selection ResultsNumber in

Model C(p) R-Square AIC SBC3 3.3905 0.7573 -146.1609 -138.20494

4 5.0000 0.7592 -144.5895 -134.64461

3 11.4237 0.7178 -138.0232 -130.06723

WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not

Page 35: Topic 18: Model Selection and Diagnostics

How to Choose with C(p)

1. Want small C(p)

2. Want C(p) near p

In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria

Can be somewhat subjective when determining “near”

Page 36: Topic 18: Model Selection and Diagnostics

Proc Regproc reg data=a1 outest=b1; model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b;run;quit; symbol1 v=circle i=none;symbol2 v=none i=join;proc gplot data=b1; plot _Cp_*_P_ _P_*_P_ / overlay;run;

Creates data set with estimates & criteria

Page 37: Topic 18: Model Selection and Diagnostics

Mallows C(p)

0102030405060708090

100110120130140150

Number of parameters in model

2 3 4 5

Start to approach C(p)=p line here

Page 38: Topic 18: Model Selection and Diagnostics

Model Validation

• Since data used to generate parameter estimates, you’d expect model to predict fitted Y’s well

• Want to check model predictive ability for a separate data set

• Various techniques of cross validation (data split, leave one out) are possible

Page 39: Topic 18: Model Selection and Diagnostics

Regression Diagnostics

• Partial regression plots• Studentized deleted residuals• Hat matrix diagonals• Dffits, Cook’s D, DFBETAS• Variance inflation factor• Tolerance

Page 40: Topic 18: Model Selection and Diagnostics

KNNL Example

• Page 386, Section 10.1

• Y is amount of life insurance

• X1 is average annual income

• X2 is a risk aversion score

• n = 18 managers

Page 41: Topic 18: Model Selection and Diagnostics

Read in the data set

data a1; infile ‘../data/ch10ta01.txt'; input income risk insur;

Page 42: Topic 18: Model Selection and Diagnostics
Page 43: Topic 18: Model Selection and Diagnostics

Partial regression plots

• Also called added variable plots or adjusted variable plots

• One plot for each Xi

Page 44: Topic 18: Model Selection and Diagnostics

Partial regression plots• These plots show the strength of the

marginal relationship between Y and Xi in

the full model .

• They can also detect

–Nonlinear relationships

–Heterogeneous variances

–Outliers

Page 45: Topic 18: Model Selection and Diagnostics

Partial regression plots

• Consider plot for X1

–Use the other X’s to predict Y –Use the other X’s to predict X1

–Plot the residuals from the first regression vs the residuals from the second regression

Page 46: Topic 18: Model Selection and Diagnostics

The partial option with proc reg and plots=

proc reg data=a1 plots=partialplot; model insur=income risk /partial;run;

Page 47: Topic 18: Model Selection and Diagnostics

OutputAnalysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 2 173919 86960 542.33 <.0001Error 15 2405.1476 160.3431    Corrected Total 17 176324      

Root MSE 12.66267 R-Square 0.9864Dependent Mean 134.44444 Adj R-Sq 0.9845Coeff Var 9.41851    

Page 48: Topic 18: Model Selection and Diagnostics

Output

Parameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t| ToleranceIntercept 1 -205.71866 11.39268 -18.06 <.0001 .

income 1 6.28803 0.20415 30.80 <.0001 0.93524

risk 1 4.73760 1.37808 3.44 0.0037 0.93524

Page 49: Topic 18: Model Selection and Diagnostics

Output

• The partial option on the model statement in proc reg generates graphs in the output window

• These are ok for some purposes but we prefer better looking plots

• To generate these plots we follow the regression steps outlined earlier and use gplot or plots=partialplot

Page 50: Topic 18: Model Selection and Diagnostics

Partial regression plots

*partial regression plot for risk;proc reg data=a1; model insur risk = income; output out=a2 r=resins resris;

symbol1 v=circle i=sm70;proc gplot data=a2; plot resins*resris;run;

Page 51: Topic 18: Model Selection and Diagnostics

The plot for risk

Page 52: Topic 18: Model Selection and Diagnostics

Partial plot for incomecode not shown

Page 53: Topic 18: Model Selection and Diagnostics
Page 54: Topic 18: Model Selection and Diagnostics

Residual plot (vs risk)

proc reg data=a1; model insur= risk income; output out=a2 r=resins;symbol1 v=circle i=sm70;

proc sort data=a2; by risk;proc gplot data=a2; plot resins*risk;run;

Page 55: Topic 18: Model Selection and Diagnostics

Residuals vs Risk

Page 56: Topic 18: Model Selection and Diagnostics

Residual plot (vs income)

proc sort data=a2; by income;proc gplot data=a2; plot resins*income;run;

Page 57: Topic 18: Model Selection and Diagnostics

Residuals vs Income

Page 58: Topic 18: Model Selection and Diagnostics
Page 59: Topic 18: Model Selection and Diagnostics

insur = -205.72 +6.288 income +4.7376 risk

N 18 Rsq 0.9864AdjRsq0.9845RMSE 12.663

Res

idua

l

-20

-15

-10

-5

0

5

10

15

20

25

income

25 30 35 40 45 50 55 60 65 70 75 80

Page 60: Topic 18: Model Selection and Diagnostics

insur = -205.72 +6.288 income +4.7376 risk

N 18 Rsq 0.9864AdjRsq0.9845RMSE 12.663

Res

idua

l

-20

-15

-10

-5

0

5

10

15

20

25

risk

1 2 3 4 5 6 7 8 9 10

Page 61: Topic 18: Model Selection and Diagnostics

Other “Residuals”

• There are several versions of residuals1. Our usual residuals

ei= Yi – 2. Studentized residuals

•Studentized means dividing by its standard error

•Are distributed t(n-p) ( ≈ Normal)

iY

)hMSE(1

ee

ii

i*i

Page 62: Topic 18: Model Selection and Diagnostics

Other “Residuals”

–Studentized deleted residuals•Delete case i and refit the model•Compute the predicted value for

case i using this refitted model•Compute the “studentized residual”•Don’t do this literally but this is the

concept

Page 63: Topic 18: Model Selection and Diagnostics

Studentized Deleted Residuals

• We use the notation (i) to indicate that case i has been deleted from the model fit computations

• di = Yi - is the deleted residual

• Turns out di = ei/(1-hii)

• Also Var di=(Var ei)/(1-hii)2=MSE(i)/(1- hii)

i(i)Y

)1(MSE/e ii(i)ii ht

Page 64: Topic 18: Model Selection and Diagnostics

Residuals

• When we examine the residuals, regardless of version, we are looking for

–Outliers

–Non-normal error distributions

– Influential observations

Page 65: Topic 18: Model Selection and Diagnostics

The r option and studentized residuals

proc reg data=a1; model insur=income risk/r;run;

Page 66: Topic 18: Model Selection and Diagnostics

Output

StudentObs Residual 1 -1.206 2 -0.910 3 2.121 4 -0.363 5 -0.210

Page 67: Topic 18: Model Selection and Diagnostics

The influence option and studentized deleted

residuals

proc reg data=a1; model insur=income risk /influence;run;

Page 68: Topic 18: Model Selection and Diagnostics

Output

Obs Residual RStudent 1 -14.7311 -1.2259 2 -10.9321 -0.9048 3 24.1845 2.4487 4 -4.2780 -0.3518 5 -2.5522 -0.2028 6 10.3417 1.0138

Page 69: Topic 18: Model Selection and Diagnostics

Hat matrix diagonals

• hii is a measure of how much Yi is contributing to the prediction of

• = hi1Y1 + hi2 Y2 + hi3Y3 + …• hii is sometimes called the leverage

of the ith observation• It is a measure of the distance

between the X values for the ith case and the means of the X values

iYiY

Page 70: Topic 18: Model Selection and Diagnostics

Hat matrix diagonals

• 0 ≤ hii ≤ 1• Σ(hii) = p• Large value of hii suggess that ith case

is distant from the center of all X’s • The average value is p/n• Values far from this average point to

cases that should be examined carefully

Page 71: Topic 18: Model Selection and Diagnostics

Influence option gives hat diagonals

Hat DiagObs H 1 0.0693 2 0.1006 3 0.1890 4 0.1316 5 0.0756

Page 72: Topic 18: Model Selection and Diagnostics

DFFITS

• A measure of the influence of case i on (a single case)

• Thus, it is closely related to hii

• It is a standardized version of the difference between computed with and without case i

• Concern if greater than 1 for small data sets or greater than for large data sets

iY

iY

np2

Page 73: Topic 18: Model Selection and Diagnostics

Cook’s Distance

• A measure of the influence of case i on all of the ’s (all the cases)

• It is a standardized version of the sum of squares of the differences between the predicted values computed with and without case I

• Compare with F(p,n-p)

• Concern if distance above 50%-tile

iY

Page 74: Topic 18: Model Selection and Diagnostics

DFBETAS

• A measure of the influence of case i on each of the regression coefficients

• It is a standardized version of the difference between the regression coefficient computed with and without case i

• Concern if DFBETA greater than 1 in small data sets or greater than for large data sets

n/2

Page 75: Topic 18: Model Selection and Diagnostics

Variance Inflation Factor• The VIF is related to the variance of

the estimated regression coefficients

• We calculate it for each explanatory variable

• One suggested rule is that a value of 10 or more for VIF indicates excessive multicollinearity

Page 76: Topic 18: Model Selection and Diagnostics

Tolerance

• TOL = (1-R2k) where R2

k is the squared multiple correlation obtained in a regression where all other explanatory variables are used to predict Xk

• TOL = 1/VIF

• Described in comment on p 410

Page 77: Topic 18: Model Selection and Diagnostics

Output

Variable Tolerance

Intercept .income 0.93524risk 0.93524

Page 78: Topic 18: Model Selection and Diagnostics

Full diagnostics

proc reg data=a1; model insur=income risk /r partial influence tol; id income risk; plot r.*(income risk);run;

Page 79: Topic 18: Model Selection and Diagnostics

Plot statement inside Reg

• Can generate several plots within Proc Reg

• Need to know symbol names

• Available in Table 1 once you click on plot command inside REG syntax

– r. represents usual residuals

– rstudent. represents deleted resids

–p. represents predicted values

Page 80: Topic 18: Model Selection and Diagnostics

Last slide

• We went over KNNL Chapters 9 and 10

• We used program topic18.sas to generate the output