Topic 18: Model Selection and Diagnosticsbacraig/notes512/Topic_18.pdf · Topic 18: Model Selection and Diagnostics. Variable Selection • We want to choose a “best” model that

Topic 18: Model Selection and Diagnostics

Variable Selection• We want to choose a “best” model

that is a subset of the available explanatory variables

• Two separate problems1. How many explanatory variables

should we use (i.e., subset size)2. Given the subset size, which

variables should we choose

KNNL Example• Page 350, Section 9.2• n = 54 patients / cases• Y : survival time (liver operation)• X’s (explanatory variables) are

– Blood clotting score– Prognostic index– Enzyme function test– Liver function test

KNNL Example cont.• We start with the usual plots and

descriptive statistics• Note that time-to-event / survival

data are often heavily skewed and typically transformed with a log prior to model fitting

Ln Transform of Y• Recall that regression model

requires Y|X to be Normally distributed, not Y

• Better to look at residuals• With data like these, transform

reduces influence of long right tail and stabilizes the variance of the residuals

Data

Data a1; infile 'U:\.www\datasets512\CH09TA01.txt‘

delimiter='09'x;input blood prog enz liver age gender

alcmod alcheavy surv lsurv;run;

Tab delimited

Dummy variables for alcohol use

Ln(surv)

Obs blood prog enz liver age Gender alcmod alcheavy surv lsurv1 6.7 62 81 2.59 50 0 1 0 695 6.544

2 5.1 59 66 1.70 39 0 0 0 403 5.999

3 7.4 57 83 2.16 55 0 0 0 710 6.565

4 6.5 73 41 2.01 48 0 0 0 349 5.854

5 7.8 65 115 4.30 45 0 0 1 2343 7.759

6 5.8 38 72 1.42 65 1 1 0 348 5.852

Data

Long right tail

Generate scatterplotsproc corr plot=matrix;var blood prog enz liver;run;

proc corr plot=scatter;var blood prog enz liver;with lsurv;run;

Correlation Summary

Pearson Correlation Coefficients, N = 54Prob > |r| under H0: Rho=0

blood prog enz liverlsurv 0.24619

0.07270.46994

0.00030.65389<.0001

0.64926<.0001

The Two Problems in Variable Selection

1. To determine an appropriate subset size

– Might use adjusted R2, Cp, MSE,

PRESS, AIC, SBC (BIC)2. To determine best model of a fixed size

– Might use R2

Adjusted R2

• R2 by its construction is guaranteed to increase with p– SSE cannot decrease with

additional X and SSTO constant• Adjusted R2 uses df to account for

changes in p

2 SSE MSE11 1SSTO MSTO

p pa

nRn p

−= − = − −

Adjusted R2

• Want to find model that maximizes • Since MSTO remains constant for a

given data set, equivalent to finding model that minimizes MSE

• Details on pages 354-356

2aR

Cp Criterion• The basic idea is to compare subset

models with a full model • A subset model is “good” if there is not

substantial bias in the predicted values relative to the full model

• Looks at the ratio of total mean squared error and the true error variance

• See page 357-359 for details

Cp Criterion

)2()Full(MSE

SSEpnC p

p −−=

SSE based on a specific choice of p-1variables

MSE(full) based on all the variables

Consider full set Cp=(n-p)-(n-2p)=p

Use of Cp

• p is the number of regression coefficients including the intercept

• A model is good according to this criterion if Cp ≤ p

• Rule: Pick the smallest model for which Cp is smaller than p or pick the model that minimizes Cp, provided the Cp is not much larger than p

SBC (BIC) and AICCriterion based on log(likelihood) plus a penalty for more complexity

• AIC – minimize

• SBC – minimize

SSEn log 2p

np

+ SSE

n log p log(n)n

p +

Other approaches• PRESS (prediction SS)

– For each case i, delete the case and predict Y using the fitted model based on the other n-1 cases

– Look at the SS for observed minus predicted

– Want to minimize the PRESS– Appears this requires n regressions

but not the case

Variable Selection in SAS• Additional proc reg model statement

options useful in variable selection– INCLUDE=n forces the first n

explanatory variables into all models– BEST=n limits the output to the best n

models of each subset size or total– START=n limits output to models that

include at least n explanatory variables

Variable Selection• Step-type procedures

– Forward selection (Step up)– Backward elimination (Step down)– Stepwise (forward selection with a

backward glance)• Very popular but now have much

better search techniques like BEST

2. Ordering models of the same subset size

• Use R2 or SSE / MSE or F*• This approach can lead us to consider

several models that give us approximately the same predicted values

• May need to apply knowledge of the subject matter to make a final selection

• Not that important if prediction is the key goal

Proc Reg Codeproc reg data=a1;

model lsurv=blood prog enz liver/selection=rsquare cp aicsbc b best=3;

run;

Number inModel R-Square C(p) AIC SBC

1 0.4276 66.4889 -103.8269 -99.848891 0.4215 67.7148 -103.2615 -99.283571 0.2208 108.5558 -87.1781 -83.200112 0.6633 20.5197 -130.4833 -124.516342 0.5995 33.5041 -121.1126 -115.145612 0.5486 43.8517 -114.6583 -108.691383 0.7573 3.3905 -146.1609 -138.204943 0.7178 11.4237 -138.0232 -130.067233 0.6121 32.9320 -120.8442 -112.888234 0.7592 5.0000 -144.5895 -134.64461

Selection Results

Number inModel

Parameter EstimatesIntercept blood prog enz liver

1 5.26426 . . 0.01512 .1 5.61218 . . . 0.298191 5.56613 . 0.01367 . .2 4.35058 . 0.01412 0.01539 .2 5.02818 . . 0.01073 0.209452 4.54623 0.10792 . 0.01634 .3 3.76618 0.09546 0.01334 0.01645 .3 4.40582 . 0.01101 0.01261 0.129773 4.78168 0.04482 . 0.01220 0.163604 3.85195 0.08368 0.01266 0.01563 0.03216

Selection Results

Proc Reg Codeproc reg data=a1;

model lsurv=blood prog enz liver/selection=cp aicsbc b best=3;

run;

Selection ResultsNumber in

Model C(p) R-Square AIC SBC3 3.3905 0.7573 -146.1609 -138.204944 5.0000 0.7592 -144.5895 -134.644613 11.4237 0.7178 -138.0232 -130.06723

WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not

How to Choose with C(p)1. Want small C(p)2. Want C(p) near p

In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria

Can be somewhat subjective when determining “near”

Proc Regproc reg data=a1 outest=b1;

model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b;

run;quit;

symbol1 v=circle i=none;symbol2 v=none i=join;proc gplot data=b1;plot _Cp_*_P_ _P_*_P_ / overlay;

run;

Creates data set with estimates & criteria

Start to approach C(p)=p line here

Model Validation• Since data used to generate parameter

estimates, you’d expect model to predict fitted Y’s well

• Should check model predictive ability for a separate data set if available

• Various techniques of cross validation (data split, leave one out) are possible if only one data set available

Additional Multiple Regression Diagnostics

• Partial regression plots• Studentized deleted residuals• Hat matrix diagonals• Dffits, Cook’s D, DFBETAS• Variance inflation factor• Tolerance

KNNL Example• Page 386, Section 10.1• Y is amount of life insurance• X1 is average annual income • X2 is a risk aversion score• n = 18 managers

Read in the data set

data a1; infile ‘../data/ch10ta01.txt';input income risk insur;

Partial regression plots

• Also called added variable plots or adjusted variable plots

• One plot for each Xi

Partial regression plots• These plots show the strength of the

marginal relationship between Y and Xi inthe full model (recall partial correlation)

• They can also detect– Nonlinear relationships – Heterogeneous variances– Outliers

Partial regression plots• Consider plot for X1

–Use the other X’s to predict Y –Use the other X’s to predict X1–Plot the residuals from the first

regression vs the residuals from the second regression

The partial option with proc reg

proc reg data=a1; model insur=income risk

/partial;run;

OutputAnalysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 2 173919 86960 542.33 <.0001Error 15 2405.1476 160.3431Corrected Total 17 176324

Root MSE 12.66267 R-Square 0.9864Dependent Mean 134.44444 Adj R-Sq 0.9845Coeff Var 9.41851

OutputParameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t|Intercept 1 -205.71866 11.39268 -18.06 <.0001

income 1 6.28803 0.20415 30.80 <.0001

risk 1 4.73760 1.37808 3.44 0.0037

Curvilinear relationship

Can also see that here

Other Residuals• There are several versions of residuals

1. Our usual residuals•

2. Studentized residuals

•

• Studentized means dividing by its standard error

• Are almost distributed t(n-p)

i i iˆe Y Y= −

( )* ii

ii

eeMSE 1 h

=−

Studentized deleted Residual

– Delete case i and refit the model– Compute the predicted value for

case i using this refitted model– Compute the “studentized

residual”– Don’t do this literally but this is the

concept– Results in t-distributed residuals

Studentized Deleted Residuals

• We use the notation “(i)” to indicate that case i has been deleted from the model fit computations

• is the deleted residual

• Turns out di = ei/(1-hii)

• Also Var(di)=Var(ei)/(1-hii)2=MSE(i)/(1- hii)

•

i i i(i)ˆd Y Y= −

( )i i (i)e MSE 1 h iit = −

Using Residuals • When we examine the residuals,

regardless of version, we are looking for – Outliers– Non-normal error distributions– Influential observations

The r option and studentized residuals

proc reg data=a1; model insur=income risk/r;

run;

OutputOutput Statistics

Obs income riskDependent

VariablePredicted

Value

StdErrorMean

Predict ResidualStd ErrorResidual

StudentResidual Cook's D

1 45.0 6 91 105.7311 3.3332 -14.7311 12.216 -1.206 0.036

2 57.2 4 162 172.9321 4.0172 -10.9321 12.009 -0.910 0.031

3 26.9 5 11 -13.1845 5.5052 24.1845 11.403 2.121 0.349

4 66.3 7 240 244.2780 4.5932 -4.2780 11.800 -0.363 0.007

5 41.0 5 73 75.5522 3.4815 -2.5522 12.175 -0.210 0.001

6 73.0 10 311 300.6583 7.4898 10.3417 10.210 1.013 0.184

7 79.4 1 316 298.1627 9.9907 17.8373 7.780 2.293 2.889

8 52.8 8 154 163.9763 4.5985 -9.9763 11.798 -0.846 0.036

9 55.9 6 164 174.3084 3.2470 -10.3084 12.239 -0.842 0.017

10 38.1 4 54 52.9440 4.0148 1.0560 12.009 0.088 0.000

11 35.8 6 53 48.0699 4.3886 4.9301 11.878 0.415 0.008

12 75.8 9 326 313.5272 6.9287 12.4728 10.599 1.177 0.197

13 37.4 5 55 53.1919 3.8909 1.8081 12.050 0.150 0.001

14 54.4 2 130 145.6744 5.7973 -15.6744 11.258 -1.392 0.171

15 46.2 7 112 117.8634 3.9171 -5.8634 12.042 -0.487 0.008

16 46.1 4 91 103.2985 3.5257 -12.2985 12.162 -1.011 0.029

17 30.4 3 14 -0.5636 5.3985 14.5636 11.454 1.271 0.120

18 39.1 5 63 63.5798 3.6886 -0.5798 12.114 -0.048 0.000

Cook’s Distance• A measure of the influence of case i

on all of the ’s (all the cases)• It is a standardized version of the

sum of squares of the differences between the predicted values computed with and without case i

• Compare with F(p,n-p) • Concern if distance above 50%-tile

iY

The influence option and studentized deleted

residualsproc reg data=a1;

model insur=income risk/influence;

run;


Obs income risk Residual RStudentHat Diag

HCov

Ratio DFFITS

DFBETAS

Intercept income risk1 45.0 6 -14.7311 -1.2259 0.0693 0.9732 -0.3345 -0.1179 0.1245 -0.1107

2 57.2 4 -10.9321 -0.9048 0.1006 1.1532 -0.3027 -0.0395 -0.1470 0.1723

3 26.9 5 24.1845 2.4487 0.1890 0.5205 1.1821 0.9594 -0.9871 0.1436

4 66.3 7 -4.2780 -0.3518 0.1316 1.3794 -0.1369 0.0770 -0.0821 -0.0410

5 41.0 5 -2.5522 -0.2028 0.0756 1.3189 -0.0580 -0.0394 0.0286 0.0011

6 73.0 10 10.3417 1.0138 0.3499 1.5296 0.7437 -0.5298 0.3048 0.5125

7 79.4 1 17.8373 2.7483 0.6225 0.8930 3.5292 -0.3649 2.6598 -2.6751

8 52.8 8 -9.9763 -0.8371 0.1319 1.2237 -0.3263 0.0816 0.0254 -0.2452

9 55.9 6 -10.3084 -0.8336 0.0658 1.1384 -0.2212 0.0308 -0.0672 -0.0366

10 38.1 4 1.0560 0.0850 0.1005 1.3653 0.0284 0.0238 -0.0138 -0.0092

11 35.8 6 4.9301 0.4033 0.1201 1.3502 0.1490 0.0863 -0.1057 0.0536

12 75.8 9 12.4728 1.1933 0.2994 1.3128 0.7801 -0.5820 0.4495 0.4096

13 37.4 5 1.8081 0.1451 0.0944 1.3521 0.0468 0.0348 -0.0294 0.0014

14 54.4 2 -15.6744 -1.4415 0.2096 1.0274 -0.7423 -0.2706 -0.2656 0.6269

15 46.2 7 -5.8634 -0.4742 0.0957 1.2966 -0.1543 -0.0164 0.0532 -0.0953

16 46.1 4 -12.2985 -1.0120 0.0775 1.0788 -0.2934 -0.1810 0.0258 0.1424

17 30.4 3 14.5636 1.3004 0.1818 1.0677 0.6129 0.5803 -0.3608 -0.2577

18 39.1 5 -0.5798 -0.0462 0.0849 1.3434 -0.0141 -0.0101 0.0080 -0.0001

Hat matrix diagonals• hii is a measure of how much Yi is

contributing to the prediction of • = hi1Y1 + hi2 Y2 + hi3Y3 + …• hii is sometimes called the leverage

of the ith observation• It is a measure of the distance

between the X values for the ith case and the means of the X values

iYiY

Hat matrix diagonals• 0 ≤ hii ≤ 1• Σ(hii) = p• Large value of hii suggess that ith

case is distant from the center of all X’s

• The average value is p/n• Values far from this average point to

cases that should be examined carefully


Obs income risk Residual RStudentHat Diag

HCov

Ratio DFFITS

DFBETAS

Intercept income risk1 45.0 6 -14.7311 -1.2259 0.0693 0.9732 -0.3345 -0.1179 0.1245 -0.1107

2 57.2 4 -10.9321 -0.9048 0.1006 1.1532 -0.3027 -0.0395 -0.1470 0.1723

3 26.9 5 24.1845 2.4487 0.1890 0.5205 1.1821 0.9594 -0.9871 0.1436

4 66.3 7 -4.2780 -0.3518 0.1316 1.3794 -0.1369 0.0770 -0.0821 -0.0410

5 41.0 5 -2.5522 -0.2028 0.0756 1.3189 -0.0580 -0.0394 0.0286 0.0011

6 73.0 10 10.3417 1.0138 0.3499 1.5296 0.7437 -0.5298 0.3048 0.5125

7 79.4 1 17.8373 2.7483 0.6225 0.8930 3.5292 -0.3649 2.6598 -2.6751

8 52.8 8 -9.9763 -0.8371 0.1319 1.2237 -0.3263 0.0816 0.0254 -0.2452

9 55.9 6 -10.3084 -0.8336 0.0658 1.1384 -0.2212 0.0308 -0.0672 -0.0366

10 38.1 4 1.0560 0.0850 0.1005 1.3653 0.0284 0.0238 -0.0138 -0.0092

11 35.8 6 4.9301 0.4033 0.1201 1.3502 0.1490 0.0863 -0.1057 0.0536

12 75.8 9 12.4728 1.1933 0.2994 1.3128 0.7801 -0.5820 0.4495 0.4096

13 37.4 5 1.8081 0.1451 0.0944 1.3521 0.0468 0.0348 -0.0294 0.0014

14 54.4 2 -15.6744 -1.4415 0.2096 1.0274 -0.7423 -0.2706 -0.2656 0.6269

15 46.2 7 -5.8634 -0.4742 0.0957 1.2966 -0.1543 -0.0164 0.0532 -0.0953

16 46.1 4 -12.2985 -1.0120 0.0775 1.0788 -0.2934 -0.1810 0.0258 0.1424

17 30.4 3 14.5636 1.3004 0.1818 1.0677 0.6129 0.5803 -0.3608 -0.2577

18 39.1 5 -0.5798 -0.0462 0.0849 1.3434 -0.0141 -0.0101 0.0080 -0.0001

DFFITS• A measure of the influence of case i

on (a single case)• Thus, it is closely related to hii• It is a standardized version of the

difference between computed with and without case i

• Concern if greater than 1 for small data sets or greater than for large data sets

iY

iY

2 p n

DFBETAS• A measure of the influence of case i on

each of the regression coefficients• It is a standardized version of the

difference between the regression coefficient computed with and without case i

• Concern if DFBETA greater than 1 in small data sets or greater than for large data sets

2 / n

Variance Inflation Factor• The VIF is related to the variance of

the estimated regression coefficients• We calculate it for each explanatory

variable• One suggested rule is that a value of

10 or more for VIF indicates excessive multicollinearity

Tolerance• TOL = (1-R2

k) where R2k is the

squared multiple correlation obtained in a regression where all other explanatory variables are used to predict Xk

• TOL = 1/VIF• Described in comment on p 410

Output

Parameter Estimates

Variable DFParameter

EstimateStandard

Error t Value Pr > |t| ToleranceVarianceInflation

Intercept 1 -205.71866 11.39268 -18.06 <.0001 . 0

income 1 6.28803 0.20415 30.80 <.0001 0.93524 1.06925

risk 1 4.73760 1.37808 3.44 0.0037 0.93524 1.06925

Full diagnosticsproc reg data=a1; model insur=income risk

/r partial influence tol;id income risk; plot rstudent.*(income risk);

run;

Plot statement inside Reg• Can generate several plots within

Proc Reg• Need to know symbol names• Available in Table 1 once you click

on plot command inside REG syntax– r. represents usual residuals– rstudent. represents deleted resids– p. represents predicted values

Last slide• We went over KNNL Chapters 9 and 10• We used program topic18.sas to

generate the output

Documents

Topic 18: Model Selection and Diagnosticsbacraig/notes512/Topic_18.pdf · Topic 18: Model Selection and Diagnostics. Variable Selection • We want to choose a “best” model that