Chapter 9: Variable Selection and Model Buildingcsproat/Homework/STAT 378/Notes...Building a regression model that includes only a subset of the available regressors involves two conflicting

Chapter 9: Variable Selection and Model Building

In this chapter, we will talk about:

• Variable selection and model building problem, • Several criteria for the evaluation of subset regression models,

• All possible regressions procedure,

• Backward Elimination Procedure

• Forward selection procedure,

• Stepwise regression procedure.

In most practical problems, the analyst has a rather large pool of possible candidate regressors, of which only a few are likely to be important. Finding an appropriate subset of regressors for the model is often called the variable selection problem.

Building a regression model that includes only a subset of the available regressors involves two conflicting objectives:

(a) We would like the model to include as many regressors as possible so that information content in these factors can influence the predicted value of . y

(b) We want the model to include as few regressors as possible because the variance

of the prediction y) increases as the number of regressors increases. By deleting variables from the model, we may improve the precision of the parameter estimates of the retained variables even though some of the deleted variables are not negligible. This is also true for the variance of a predicted response. Deleting variables potentially introduces bias into the estimates of the coefficient of retained variables and the response. Over-fitting a model (including variables in the model with truly zero regression coefficients in the population) will not introduce bias when population regression coefficient estimated, if the usual regression assumptions are met. We must, however, to ensure that over-fitting does not introduce harmful collinearity. The basic steps for variable selection are as follows:

(a) Specify the maximum model to be considered.

(b) Specify a criterion for selection a model.

(c) Specify a strategy for selecting variables.

(d) Conduct the specified analysis.

(e) Evaluate the Validity of the model chosen. (Validity of a model is discussed in Chapter 10.)

Step 1: Specifying the maximum Model: The maximum model is defined to be the largest model (the one having the most predictor variables) considered at any point in the process of model selection.

The particular sample of data to be analyzed imposes certain constrain on the choice of the maximum model. The most basic constraint is that the error degrees of freedom must be positive. Therefore, 0)1( >+−=− knpn or equivalently 1+=> kpn , where is the number of observation and is the the number of predictors, giving regression coefficient (including the intercept). In general, we like to have large error degrees of freedom. (This means that the smaller the sample size, the smaller the maximum model should be.) The question then arises as to how many degrees of freedom are needed. The weakest requirement is . Another suggested rule of thumb for regression is to have at least 5 (or 10) observations per predictor. Then (or ).

nk 1+k

10)1( >+− knkn 5> kn 10>

Step 2: Specifying a Criterion for Selecting a Model: There are several criteria that can be used to evaluate subset regression models. The criterion that we used for model selection certainly be related to intended use of model. Let and denote the regression sum of squares and the residuals sum of squares, respectively, for a regression model with terms, that is regressors and an intercept term

)( pSSR )(Re pSS s

p 1−p

β 0.

(a) -Test Statistic: Another reasonable criterion for selecting the best model is the -test statistic for comparing the full and reduced models. The -statistic is:

FF F

)1(

)()1(

Re

−−

−+=

kn

pk

SSSSSSFs

RRp

This statistic may be compared to an -distribution with F 1+− pk and degrees of freedom. If is not significant, we can use the smaller (

1−− kn

F p 1−p variables) model.

(b) Coefficient of Determination: A measure of the adequacy of a regression model that has been widely used is the coefficient of determination . R2

Let denote the coefficient of determination for a Rp

2 p -term subset model. Then

SSSS

SSSSR

T

s

T

Rp

pp )(1

)( Re2 −==

Rp2 increases as increases and is maximum when p 1+= kp . Therefore, the analyst

uses this criterion by adding regressors to the model up to the point where an additional variable only provides a small increase in . Rp

2

Let ( )( )dRR knk ,,

21

20 111

α+−−=

+ where

11,,

,, −−= −−

kn

k Fd knkkn

αα

and is the value of

for the full model. Any subset of regressor variables producing an greater than

is called -adequate (

Rk2

1+

R2 R2

R20 R2 α ) subset (That is, its is not significantly different

from ). R2

Rk2

1+

Example 1: Suppose that we want to investigate how weight (WGT) varies with height (HGT) and age (AGE) for children with a particular kind of nutritional deficiency. The dependent variable here is , and two basic independent variables are WGTY =

HGTX =1 and AGEX =2

The WGT, HGT, and AGE for a random sample consists of 12 children who attend a certain clinic are given in the example 3 of chapter 3. .

5626.18

)07.4(38

3 8,3,05.04,13,05.0 === Fd

( )( ) 4370.05626.117803.01120 =+−−=R

(c) Residual Mean Square: A third criterion to consider in selecting the best model

is the estimated error variance for the )1( −p variable model-namely,

pnp

p SSMS ss −

=)(

)( ReRe

Because always decreases as increases, )(Re pSS s p )(Re pMS s initially decreases,

then stabilizes, and eventually may increases. Advocates of the )(Re pMS s criterion will

plot )(Re pMS s versus and base the choice of on the following: p p

1. The minimum )(Re pMS s

2. The value of such that p )(Re pMS s is approximately

equal to MS sRe for the full model, or

3. A value of p near the point where the smallest )(Re pMS s turns upward.

Note that the subset regression model that minimizes )(Re pMS s will also

maximize . R pAdj2

,

(d) Mallow's CP Statistic: Another candidate for a selection criterion involving is

Mallow's )(Re pSS s CP :

pnkp

MSSSC

s

sp 2

)()(

Re

Re +−=

CP criterion helps us to decide how many variables to put in the best model, since it

achieves a value of approximately if p )(Re pMS s is roughly equal to )(Re kMS s .

(e) PRESS: One can select the subset regression model based on a small value of PRESS. While PRESS has intuitive appeal, particularly for the prediction problem, it is not a simple function of the residual sum of squares, and developing an algorithm for variable selection based on this criterion is not straightforward. This statistics is, however, potentially useful, for discriminating between alternative models.

SAS Output: Number in Adjusted Model R-Square R-Square C(p) MSE Variables in Model 1 0.6630 0.6293 4.2682 29.93275 HGT 1 0.5926 0.5519 6.8310 36.18571 AGE 1 0.5876 0.5464 7.0138 36.63180 AGE2 ---------------------------------------------------------------------------------- 2 0.7800 0.7311 2.0097 21.71415 HGT AGE 2 0.7764 0.7267 2.1398 22.06667 HGT AGE2 2 0.5927 0.5022 8.8275 40.19683 AGE AGE2 ---------------------------------------------------------------------------------- 3 0.7803 0.6978 4.0000 24.39869 HGT AGE AGE2

Step 3: Specifying a Strategy for Selecting Variables:

(a) All possible regression procedure: The all possible regression procedure requires that we fit each possible regression equation associated with each possible combination of the k independent variables.

Example 2 (The Hald Cement Data): Hald (1952) presents data concerning the heat evolved in calories per gram cement (Y ) as a function of the amount of each of four ingredients in the mix: tricalcium aluminate ( X 1 ), tricalcium silicate ( X 2 ), tetracalcium alumino ferrite ( ), and dicalcium silicate ( ). The data are shown in example 1 of chapter 10.

X 3 X 4

92.18

)84.3(48

4 8,4,05.04,13,05.0

=== Fd

( )( ) 9486.092.119824.01120 =+−−=R

SAS Output: Hald Cement Data Y on X1,X2,X3 and X4 The REG Procedure Number of Observations Read 13 Number of Observations Used 13 Number in Adjusted Model R-Square R-Square C(p) MSE Variables in Model 1 0.6745 0.6450 138.7308 80.35154 x4 1 0.6663 0.6359 142.4864 82.39421 x2 1 0.5339 0.4916 202.5488 115.06243 x1 1 0.2859 0.2210 315.1543 176.30913 x3 ---------------------------------------------------------------------------------- 2 0.9787 0.9744 2.6782 5.79045 x1 x2 2 0.9725 0.9670 5.4959 7.47621 x1 x4 2 0.9353 0.9223 22.3731 17.57380 x3 x4 2 0.8470 0.8164 62.4377 41.54427 x2 x3 2 0.6801 0.6161 138.2259 86.88801 x2 x4 2 0.5482 0.4578 198.0947 122.70721 x1 x3 ---------------------------------------------------------------------------------- 3 0.9823 0.9764 3.0182 5.33030 x1 x2 x4 3 0.9823 0.9764 3.0413 5.34562 x1 x2 x3 3 0.9813 0.9750 3.4968 5.64846 x1 x3 x4 3 0.9728 0.9638 7.3375 8.20162 x2 x3 x4 ---------------------------------------------------------------------------------- 4 0.9824 0.9736 5.0000 5.98295 x1 x2 x3 x4

If we assume that the intercept term β 0

is included in all equations, then if there are

candidate regressors, there are 2 total equations to be estimated and examined. Therefore, the number of equations to be examined increases rapidly as the number of candidate regressors increases

kk

(b) Backward Elimination Procedure: We begin with a model that includes all candidate regressors. Then the partial -statistic is computed for each

regressors as if it were the last variable to enter the model. The smallest of these partial -statistics is compared with a pre-selected value,

k F

F F OUT , for example, and if the smallest partial value is less than F F OUT , that regressor is removed from the model. Now a regression model with 1−k regressors is fit, the partial

-statistics for this new model calculated, and the procedure repeated. The backward elimination algorithm terminates when the smallest partial value is not less than the pre-selected cutoff value

FF

F OUT . Example 1 (Cont.):

Backward Elimination: Step 0 All Variables Entered: R-Square = 0.7803 and C(p) = 4.0000

Analysis of Variance

Source DF Sum of Squares

Mean Square

F Value Pr > F

Model 3 693.06046 231.02015 9.47 0.0052

Error 8 195.18954 24.39869

Corrected Total 11 888.25000

Variable ParameterEstimate

StandardError

Type II SS F Value Pr > F

Intercept 3.43843 33.61082 0.25535 0.01 0.9210

HGT 0.72369 0.27696 166.58195 6.83 0.0310

AGE 2.77687 7.42728 3.41051 0.14 0.7182

AGE2 -0.04171 0.42241 0.23786 0.01 0.9238

Backward Elimination: Step 1 Variable AGE2 Removed: R-Square = 0.7800 and C(p) = 2.0097



Mean Square

F Value Pr > F

Model 2 692.82261 346.41130 15.95 0.0011

Error 9 195.42739 21.71415



StandardError


Intercept 6.55305 10.94483 7.78416 0.36 0.5641

HGT 0.72204 0.26081 166.42975 7.66 0.0218

AGE 2.05013 0.93723 103.90008 4.78 0.0565

All variables left in the model are significant at the 0.1000 level.

Summary of Backward Elimination

Step Variable Removed

Number Vars In

Partial R-

Square

Model R-

Square

C(p) F Value

Pr > F

1 AGE2 2 0.0003 0.7800 2.0097 0.01 0.9238

(c) Forward Selection Procedure: The procedure begins with the assumption that there are no regressors in the model other than the intercept. An effort is made to find an optimal subset by inserting into model one at a time. At each step the regressor having the highest partial correlation with (or equivalently the largest

-statistic given the other regressors already in the model) is added to the model if its partial -statistic exceeds the pre-selected entry level

yF

F F IN . Example (Cont.):

Variable R2 F-value P-value

HGT 0.6630 19.67 0.0013 WGT 0.5926 14.55 0.0034 AGE2 0.5876 14.25 0.0036

SAS Outpu Forward Selection: Step 1 Variable HGT Entered: R-Square = 0.6630 and C(p) = 4.2682



Mean Square

F Value Pr > F

Model 1 588.92252 588.92252 19.67 0.0013

Error 10 299.32748 29.93275



StandardError


Intercept 6.18985 12.84875 6.94681 0.23 0.6404

HGT 1.07223 0.24173 588.92252 19.67 0.0013

Forward Selection: Step 2 Variable AGE Entered: R-Square = 0.7800 and C(p) = 2.0097



Mean Square

F Value Pr > F

Model 2 692.82261 346.41130 15.95 0.0011

Error 9 195.42739 21.71415



StandardError


Intercept 6.55305 10.94483 7.78416 0.36 0.5641

HGT 0.72204 0.26081 166.42975 7.66 0.0218

AGE 2.05013 0.93723 103.90008 4.78 0.0565

No other variable met the 0.1000 significance level for entry into the model.

Summary of Forward Selection

Step Variable Entered

Number Vars In

Partial R-Square

Model R-Square

C(p) F Value Pr > F

1 HGT 1 0.6630 0.6630 4.2682 19.67 0.0013

2 AGE 2 0.1170 0.7800 2.0097 4.78 0.0565

(d) Stepwise Regression Procedure: Stepwise regression is a modified version of forward regression that permits reexamination, at every step, of the variables incorporated in the model in pervious steps. A variable that entered at an early stage may become superfluous at a larger stage because of its relationship with other variables subsequently added to the model.

Example 1 (Cont.): In the forward selection procedure, HGT was added to the model. Next we added AGE. Now before testing to see whether ( )AGE 2

should be added to

the model, we find ( ) 66.7| =AGEHGTF which exceeds 36.39,1,1.0 =F . Thus we do not

remove HGT from the model. Next we check whether we should add( . The answer is no.

)AGE 2

Example 2 (Cont.): Backward Elimination: Step 0 All Variables Entered: R-Square = 0.9824 and C(p) = 5.0000



Mean Square

F Value Pr > F

Model 4 2667.89944 666.97486 111.48 <.0001

Error 8 47.86364 5.98295



StandardError


Intercept 62.40537 70.07096 4.74552 0.79 0.3991

x1 1.55110 0.74477 25.95091 4.34 0.0708

x2 0.51017 0.72379 2.97248 0.50 0.5009

x3 0.10191 0.75471 0.10909 0.02 0.8959

x4 -0.14406 0.70905 0.24697 0.04 0.8441

Backward Elimination: Step 1 Variable x3 Removed: R-Square = 0.9823 and C(p) = 3.0182



Mean Square

F Value Pr > F

Model 3 2667.79035 889.26345 166.83 <.0001

Error 9 47.97273 5.33030



StandardError


Intercept 71.64831 14.14239 136.81003 25.67 0.0007

x1 1.45194 0.11700 820.90740 154.01 <.0001

x2 0.41611 0.18561 26.78938 5.03 0.0517

x4 -0.23654 0.17329 9.93175 1.86 0.2054

Backward Elimination: Step 2 Variable x4 Removed: R-Square = 0.9787 and C(p) = 2.6782



Mean Square

F Value Pr > F

Model 2 2657.85859 1328.92930 229.50 <.0001

Error 10 57.90448 5.79045


Variable Parameter Estimate

StandardError


Intercept 52.57735 2.28617 3062.60416 528.91 <.0001

X1 1.46831 0.12130 848.43186 146.52 <.0001

X2 0.66225 0.04585 1207.78227 208.58 <.0001

All variables left in the model are significant at the 0.1000 level.

Summary of Backward Elimination

Step Variable Removed

Number Vars In

Partial R-

Square

Model R-

Square

C(p) F Value

Pr > F

1 x3 3 0.0000 0.9823 3.0182 0.02 0.8959

2 x4 2 0.0037 0.9787 2.6782 1.86 0.2054

Forward Selection: Step 1

Variable x4 Entered: R-Square = 0.6745 and C(p) = 138.7308



Mean Square

F Value Pr > F

Model 1 1831.89616 1831.89616 22.80 0.0006

Error 11 883.86692 80.35154



StandardError


Intercept 117.56793 5.26221 40108 499.16 <.0001

x4 -0.73816 0.15460 1831.89616 22.80 0.0006

Forward Selection: Step 2 Variable x1 Entered: R-Square = 0.9725 and C(p) = 5.4959



Mean Square

F Value Pr > F

Model 2 2641.00096 1320.50048 176.63 <.0001

Error 10 74.76211 7.47621



StandardError


Intercept 103.09738 2.12398 17615 2356.10 <.0001

x1 1.43996 0.13842 809.10480 108.22 <.0001

x4 -0.61395 0.04864 1190.92464 159.30 <.0001

Forward Selection: Step 3 Variable x2 Entered: R-Square = 0.9823 and C(p) = 3.0182



Mean Square

F Value Pr > F

Model 3 2667.79035 889.26345 166.83 <.0001

Error 9 47.97273 5.33030



StandardError


Intercept 71.64831 14.14239 136.81003 25.67 0.0007

x1 1.45194 0.11700 820.90740 154.01 <.0001

x2 0.41611 0.18561 26.78938 5.03 0.0517

x4 -0.23654 0.17329 9.93175 1.86 0.2054

No other variable met the 0.1000 significance level for entry into the model.

Summary of Forward Selection


Number Vars In

Partial R-

Square

Model R-

Square

C(p) F Value

Pr > F

1 x4 1 0.6745 0.6745 138.731 22.80 0.0006

2 x1 2 0.2979 0.9725 5.4959 108.22 <.0001

3 x2 3 0.0099 0.9823 3.0182 5.03 0.0517

Stepwise Selection: Step 1 Variable x4 Entered: R-Square = 0.6745 and C(p) = 138.7308



Mean Square

F Value Pr > F

Model 1 1831.89616 1831.89616 22.80 0.0006

Error 11 883.86692 80.35154



StandardError


Intercept 117.56793 5.26221 40108 499.16 <.0001

x4 -0.73816 0.15460 1831.89616 22.80 0.0006




Mean Square

F Value Pr > F

Model 2 2641.00096 1320.50048 176.63 <.0001

Error 10 74.76211 7.47621



StandardError


Intercept 103.09738 2.12398 17615 2356.10 <.0001

x1 1.43996 0.13842 809.10480 108.22 <.0001

x4 -0.61395 0.04864 1190.92464 159.30 <.0001




Mean Square

F Value Pr > F

Model 3 2667.79035 889.26345 166.83 <.0001

Error 9 47.97273 5.33030



StandardError


Intercept 71.64831 14.14239 136.81003 25.67 0.0007

x1 1.45194 0.11700 820.90740 154.01 <.0001

x2 0.41611 0.18561 26.78938 5.03 0.0517

x4 -0.23654 0.17329 9.93175 1.86 0.2054

Stepwise Selection: Step 4 Variable x4 Removed: R-Square = 0.9787 and C(p) = 2.6782



Mean Square

F Value Pr > F

Model 2 2657.85859 1328.92930 229.50 <.0001

Error 10 57.90448 5.79045



StandardError


Intercept 52.57735 2.28617 3062.60416 528.91 <.0001

x1 1.46831 0.12130 848.43186 146.52 <.0001

x2 0.66225 0.04585 1207.78227 208.58 <.0001

All variables left in the model are significant at the 0.1000 level. No other variable met the 0.1000 significance level for entry into the model.

Summary of Stepwise Selection


Variable Removed

NumberVars In

PartialR-

Square

Model R-

Square

C(p) F Value

Pr > F

1 x4 1 0.6745 0.6745 138.731 22.80 0.0006

2 x1 2 0.2979 0.9725 5.4959 108.22 <.0001

3 x2 3 0.0099 0.9823 3.0182 5.03 0.0517

4 x4 2 0.0037 0.9787 2.6782 1.86 0.2054

Strategy to Select the Best Regression Equation:

Fit the full model

Perform residual analysis

Transform data

Do we need a transformation?

Yes

No

Select models for further analysis

Perform all possible regressions

Make recommendations

To select the best regression equation, carry out the following steps:

1) Fit the largest model possible to the data. 2) Perform a through analysis of this model.

3) Determine if a transformation of the response or some of the regressors is

necessary.

4) Determine if all possible regression is feasible.

• If all possible regression is feasible, perform all possible regression using such criteria as MallowsC , adjusted , and the PRESS statistic to rank

the best subset models. p R2

• If all possible regression is not feasible, use backward, forward and

stepwise selection techniques to generate the largest model such that all possible is feasible. Perform all possible regression as outlined above.

5) Compare and contrast the best models recommended by each criterion. 6) Perform through analyses of the “best models” (usually three to five models).

7) Explore the need for further transformations.

8) Discuss with the subject-matter experts the relative advantages and disadvantages

of the final set of models.

Documents

Chapter 9: Variable Selection and Model Buildingcsproat/Homework/STAT 378/Notes...Building a regression model that includes only a subset of the available regressors involves two conflicting