14
Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g., adjusted R 2 , AIC, BIC, likelihood ratio test, etc.) for model selection and many algorithms for including or excluding X’s in the model: forward selection, backward elimination, stepwise regression, etc. With the availability of statistical packages, stepwise regression is now most X 2 X 3 X 4 Y X 1 X 5 X 6

Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Embed Size (px)

Citation preview

Page 1: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Xuhua Xia

Stepwise Regression• Y may depend on many independent

variables• How to find a subset of X’s that best predict

Y?• There are several criteria (e.g., adjusted R2,

AIC, BIC, likelihood ratio test, etc.) for model selection and many algorithms for including or excluding X’s in the model: forward selection, backward elimination, stepwise regression, etc.

• With the availability of statistical packages, stepwise regression is now most commonly used.

X2X3

X4

YX1

X5X6

Page 2: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Xuhua Xia

A Data Set for Multiple RegressionMeasurements on men involved in a physical fitness course at N. C. State University. Fitness is typically measured by oxygen intake rate (oxy) which is difficult (at least cumbersome when one is exercising oneself) to measure. The study goal is to develop an equation to predict oxy based on exercise tests rather than on oxygen consumption measurements.

The dataset has 31 observations. The variables in the data set are:age (in years)weight (in kg)oxy (oxygen intake rate, ml per kg body weight per minute)runtime (time to run 1.5 miles, in minutes)rstpulse (heart rate while resting)runpulse (heart rate while running, at the same time when oxygen rate was measured)maxpulse (maximum heart rate recorded while running).

Page 3: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Xuhua Xia

oxy age weight runtime rstpulse runpulse maxpulse44.609 44 89.47 11.37 62 178 18259.571 42 68.15 8.17 40 166 17245.681 40 75.98 11.95 70 176 18060.055 38 81.87 8.63 48 170 18644.754 45 66.45 11.12 51 176 17649.156 49 81.42 8.95 44 180 18546.774 48 91.63 10.25 48 162 16446.08 54 79.38 11.17 62 156 165

45.118 51 67.25 11.08 48 172 17250.545 57 59.08 9.93 49 148 15547.467 52 82.78 10.5 53 170 17245.313 40 75.07 10.07 62 185 18549.874 38 89.02 9.22 55 178 18049.091 43 81.19 10.85 64 162 17050.541 44 73.03 10.13 45 168 16847.273 47 79.15 10.6 47 162 16440.836 51 69.63 10.95 57 168 17250.388 49 73.37 10.08 67 168 16845.441 52 76.32 9.63 48 164 16639.203 54 91.63 12.88 44 168 17248.673 49 76.32 9.4 56 186 18854.297 44 85.84 8.65 45 156 16844.811 47 77.45 11.63 58 176 17639.442 44 81.42 13.08 63 174 17637.388 45 87.66 14.03 56 186 19251.855 54 83.12 10.33 50 166 17046.672 51 77.91 10 48 162 16839.407 57 73.37 12.63 58 174 17654.625 50 70.87 8.92 48 146 15545.79 51 73.71 10.47 59 186 18847.92 48 61.24 11.5 52 170 176

Page 4: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Xuhua Xia

Correlation matrix

age weight oxy runtime rstpulse runpulse maxpulse

age 1.00000 -0.23354 -0.30459 0.18875 -0.16410 -0.33787 -0.43292

0.2061 0.0957 0.3092 0.3777 0.0630 0.0150

weight -0.23354 1.00000 -0.16275 0.14351 0.04397 0.18152 0.24938

0.2061 0.3817 0.4412 0.8143 0.3284 0.1761

oxy -0.30459 -0.16275 1.00000 -0.86219 -0.39936 -0.39797 -0.23674

0.0957 0.3817 <.0001 0.0260 0.0266 0.1997

runtime 0.18875 0.14351 -0.86219 1.00000 0.45038 0.31365 0.22610

0.3092 0.4412 <.0001 0.0110 0.0858 0.2213

rstpulse -0.16410 0.04397 -0.39936 0.45038 1.00000 0.35246 0.30512

0.3777 0.8143 0.0260 0.0110 0.0518 0.0951

runpulse -0.33787 0.18152 -0.39797 0.31365 0.35246 1.00000 0.92975

0.0630 0.3284 0.0266 0.0858 0.0518 <.0001

maxpulse -0.43292 0.24938 -0.23674 0.22610 0.30512 0.92975 1.00000

0.0150 0.1761 0.1997 0.2213 0.0951 <.0001

Page 5: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Xuhua Xia

Scatterplot matrix

Page 6: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

rcorr in Hmisc oxy age weight runtime rstpulse runpulse maxpulseoxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00

P oxy age weight runtime rstpulse runpulse maxpulseoxy 0.0957 0.3817 0.0000 0.0260 0.0266 0.1997 age 0.0957 0.2061 0.3092 0.3777 0.0630 0.0150 weight 0.3817 0.2061 0.4412 0.8143 0.3284 0.1761 runtime 0.0000 0.3092 0.4412 0.0110 0.0858 0.2213 rstpulse 0.0260 0.3777 0.8143 0.0110 0.0518 0.0951 runpulse 0.0266 0.0630 0.3284 0.0858 0.0518 0.0000 maxpulse 0.1997 0.0150 0.1761 0.2213 0.0951 0.0000 > print(rmat) oxy age weight runtime rstpulse runpulse maxpulseoxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00

Page 7: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Backward elimination

Xuhua Xia

Start: AIC=58.16oxy ~ age + weight + runtime + rstpulse + runpulse + maxpulse

Df Sum of Sq RSS AIC- rstpulse 1 0.571 129.41 56.299<none> 128.84 58.162- weight 1 9.911 138.75 58.459- maxpulse 1 26.491 155.33 61.958- age 1 27.746 156.58 62.208- runpulse 1 51.058 179.90 66.510- runtime 1 250.822 379.66 89.664

Step: AIC=56.3oxy ~ age + weight + runtime + runpulse + maxpulse

Df Sum of Sq RSS AIC<none> 129.41 56.299- weight 1 9.52 138.93 56.499- maxpulse 1 26.83 156.23 60.139- age 1 27.37 156.78 60.247- runpulse 1 52.60 182.00 64.871- runtime 1 320.36 449.77 92.917

the current model, i.e., without eliminating rstpulse

2( 1)( ) ln( ( ))

( 1) ln( )( ) ln( ( ))

pAIC p SSE p

np n

BIC p SSE pn

Page 8: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Forward addition

Xuhua Xia

Start: AIC=104.7oxy ~ 1

Df Sum of Sq RSS AIC+ runtime 1 632.90 218.48 64.534+ rstpulse 1 135.78 715.60 101.313+ runpulse 1 134.84 716.54 101.354+ age 1 78.99 772.39 103.681<none> 851.38 104.699+ maxpulse 1 47.72 803.67 104.911+ weight 1 22.55 828.83 105.867

Step: AIC=64.53oxy ~ runtime

Df Sum of Sq RSS AIC+ age 1 17.7656 200.72 63.905+ runpulse 1 15.3621 203.12 64.274<none> 218.48 64.534+ maxpulse 1 1.5674 216.91 66.311+ weight 1 1.3236 217.16 66.346+ rstpulse 1 0.1301 218.35 66.516

Step: AIC=63.9oxy ~ runtime + age

Df Sum of Sq RSS AIC+ runpulse 1 39.885 160.83 59.037+ maxpulse 1 14.885 185.83 63.516<none> 200.72 63.905+ weight 1 5.605 195.11 65.027+ rstpulse 1 2.641 198.07 65.494

Step: AIC=59.04oxy ~ runtime + age + runpulse

Df Sum of Sq RSS AIC+ maxpulse 1 21.9007 138.93 56.499<none> 160.83 59.037+ weight 1 4.5958 156.24 60.139+ rstpulse 1 0.4901 160.34 60.943

Step: AIC=56.5oxy ~ runtime + age + runpulse + maxpulse

IVs whose addition will improve fit

IVs whose addition will make it worse

Page 9: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

R Functions

Xuhua Xia

library(Hmisc)cor(myD,method="pearson|spearman")pairs(~age+weight+runtime+rstpulse+runpulse+maxpulse+oxy)rmat<-rcorr(as.matrix(myD), type="pearson|spearman")rmatprint(rmat[1],digits=5)fit<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse)anova(fit)summary(fit)full.model<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse)best.model<-step(full.model,direction="backward")min.model<-lm(oxy~1)best.model<-step(min.model,direction="forward", scope="~age+weight+runtime+rstpulse+runpulse+maxpulse")new<-data.frame(specify values here)predict(fit,new,interval="confidence")predict(fit,new,interval="prediction")

Two R functions for computing Pearson correlation: cor in basic package does not provide associated p, and rcorr in the Hmisc package includes p.

Page 10: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Package leaps

Xuhua Xia

x<-as.matrix(myD)DV<-x[,1]IV<-x[,2:7]library(leaps) solR2a<-leaps(IV, DV, names=names(myD)[2:7], method="adjr2")solCp<-leaps(IV, DV, names=names(myD)[2:7], method="Cp")

Package leaps includes a function leaps that offers two more criteria for model selection:1) adjusted r2 2) Mallow's Cp (which is used less frequently now)The input to leaps is not a data frame but a vector for DV and a matrix for IVs:

Page 11: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

leaps evaluates all linear models

Xuhua Xia

$which age weight runtime rstpulse runpulse maxpulse1 FALSE FALSE TRUE FALSE FALSE FALSE1 FALSE FALSE FALSE TRUE FALSE FALSE1 FALSE FALSE FALSE FALSE TRUE FALSE1 TRUE FALSE FALSE FALSE FALSE FALSE1 FALSE FALSE FALSE FALSE FALSE TRUE1 FALSE TRUE FALSE FALSE FALSE FALSE2 TRUE FALSE TRUE FALSE FALSE FALSE2 FALSE FALSE TRUE FALSE TRUE FALSE2 FALSE FALSE TRUE FALSE FALSE TRUE2 FALSE TRUE TRUE FALSE FALSE FALSE2 FALSE FALSE TRUE TRUE FALSE FALSE2 TRUE FALSE FALSE FALSE TRUE FALSE2 TRUE FALSE FALSE TRUE FALSE FALSE2 FALSE FALSE FALSE FALSE TRUE TRUE2 TRUE FALSE FALSE FALSE FALSE TRUE2 FALSE FALSE FALSE TRUE TRUE FALSE3 TRUE FALSE TRUE FALSE TRUE FALSE3 FALSE FALSE TRUE FALSE TRUE TRUE3 TRUE FALSE TRUE FALSE FALSE TRUE3 TRUE TRUE TRUE FALSE FALSE FALSE3 TRUE FALSE TRUE TRUE FALSE FALSE3 FALSE FALSE TRUE TRUE TRUE FALSE3 FALSE TRUE TRUE FALSE TRUE FALSE3 FALSE TRUE TRUE FALSE FALSE TRUE3 FALSE FALSE TRUE TRUE FALSE TRUE3 FALSE TRUE TRUE TRUE FALSE FALSE……

The best model is one with the greatest adjusted r2 or a Cp closest to the total number of IVs (6 in our case)

The next two slides show results of evaluation

Page 12: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Model evaluation: adjusted r2

Xuhua Xia

$size [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6[39] 6 6 6 6 7

$adjr2 [1] 0.734531140 0.130502041 0.129362176 0.061492963 0.023495780 [6] -0.007080873 0.747407426 0.744382656 0.727022568 0.726715841[11] 0.725213882 0.331423675 0.250289567 0.238663728 0.207123289[16] 0.180390053 0.790104958 0.788876048 0.757477967 0.745367336[21] 0.741499364 0.735442756 0.735365598 0.717949838 0.716918697[26] 0.716790424 0.811713247 0.788260638 0.787518105 0.782696332[31] 0.781231246 0.753335728 0.750114011 0.740422515 0.725675821[36] 0.707129090 0.817602176 0.804437584 0.781067331 0.779299361[41] 0.746441303 0.464879114 0.810839895

The number of coefficients in each model

Given the set of adjusted r2 values for the 43 alternative models, which one is the maximum?

maxAdjR2<-max(solR2a$adjr2);bestModelInd<-match(maxAdjR2,solR2a$adjr2)solR2a$which[bestModelInd,]

bestModelInd<-which.max(solR2a$adjr2)

Find the max adjusted r2

Find the index of the max adjusted r2 Show the best model

Find the index of the max adjusted r2

Page 13: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

leaps output for Cp: 2

$size [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6[39] 6 6 6 6 7

$Cp [1] 13.698840 106.302108 106.476860 116.881841 122.707162 127.394846 [7] 12.389449 12.837184 15.406872 15.452274 15.674598 73.964510[13] 85.974204 87.695093 92.363796 96.320923 6.959627 7.135037[19] 11.616680 13.345306 13.897406 14.761903 14.772916 17.258776[25] 17.405958 17.424267 4.879958 8.103512 8.205573 8.868324[31] 9.069700 12.903931 13.346755 14.678848 16.705777 19.255019[37] 5.106275 6.846150 9.934837 10.168497 14.511122 51.723275[43] 7.000000

The number of coefficients in each model

Given the set of Cp values for the 43 alternative models, which one is closest to 6?

solCpbestModelInd<-which(abs(solCp$Cp-6)==min(abs(solCp$Cp-6)))solCp$which[bestModelInd,]

This leads to a model that is suboptimal based on AIC or adjusted r2

Page 14: Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Xuhua Xia

)1(1

11 22 R

mn

nRa

Criteria used in model selection

• Ra2

• Cp• SBC (BIC)• AIC• Significance level

Burnham, K. P. and D. R. Anderson. 2002 Model selection and multimodel inference: a practical information-theoretic approach. 2nd ed. Springer. (Best book on model selection)