Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Chapter 9: Variable Selection and Model Building
In this chapter, we will talk about:
• Variable selection and model building problem, • Several criteria for the evaluation of subset regression models,
• All possible regressions procedure,
• Backward Elimination Procedure
• Forward selection procedure,
• Stepwise regression procedure.
In most practical problems, the analyst has a rather large pool of possible candidate regressors, of which only a few are likely to be important. Finding an appropriate subset of regressors for the model is often called the variable selection problem.
Building a regression model that includes only a subset of the available regressors involves two conflicting objectives:
(a) We would like the model to include as many regressors as possible so that information content in these factors can influence the predicted value of . y
(b) We want the model to include as few regressors as possible because the variance
of the prediction y) increases as the number of regressors increases. By deleting variables from the model, we may improve the precision of the parameter estimates of the retained variables even though some of the deleted variables are not negligible. This is also true for the variance of a predicted response. Deleting variables potentially introduces bias into the estimates of the coefficient of retained variables and the response. Over-fitting a model (including variables in the model with truly zero regression coefficients in the population) will not introduce bias when population regression coefficient estimated, if the usual regression assumptions are met. We must, however, to ensure that over-fitting does not introduce harmful collinearity. The basic steps for variable selection are as follows:
(a) Specify the maximum model to be considered.
(b) Specify a criterion for selection a model.
(c) Specify a strategy for selecting variables.
(d) Conduct the specified analysis.
(e) Evaluate the Validity of the model chosen. (Validity of a model is discussed in Chapter 10.)
Step 1: Specifying the maximum Model: The maximum model is defined to be the largest model (the one having the most predictor variables) considered at any point in the process of model selection.
The particular sample of data to be analyzed imposes certain constrain on the choice of the maximum model. The most basic constraint is that the error degrees of freedom must be positive. Therefore, 0)1( >+−=− knpn or equivalently 1+=> kpn , where is the number of observation and is the the number of predictors, giving regression coefficient (including the intercept). In general, we like to have large error degrees of freedom. (This means that the smaller the sample size, the smaller the maximum model should be.) The question then arises as to how many degrees of freedom are needed. The weakest requirement is . Another suggested rule of thumb for regression is to have at least 5 (or 10) observations per predictor. Then (or ).
nk 1+k
10)1( >+− knkn 5> kn 10>
Step 2: Specifying a Criterion for Selecting a Model: There are several criteria that can be used to evaluate subset regression models. The criterion that we used for model selection certainly be related to intended use of model. Let and denote the regression sum of squares and the residuals sum of squares, respectively, for a regression model with terms, that is regressors and an intercept term
)( pSSR )(Re pSS s
p 1−p
β 0.
(a) -Test Statistic: Another reasonable criterion for selecting the best model is the -test statistic for comparing the full and reduced models. The -statistic is:
FF F
)1(
)()1(
Re
−−
−+=
kn
pk
SSSSSSFs
RRp
This statistic may be compared to an -distribution with F 1+− pk and degrees of freedom. If is not significant, we can use the smaller (
1−− kn
F p 1−p variables) model.
(b) Coefficient of Determination: A measure of the adequacy of a regression model that has been widely used is the coefficient of determination . R2
Let denote the coefficient of determination for a Rp
2 p -term subset model. Then
SSSS
SSSSR
T
s
T
Rp
pp )(1
)( Re2 −==
Rp2 increases as increases and is maximum when p 1+= kp . Therefore, the analyst
uses this criterion by adding regressors to the model up to the point where an additional variable only provides a small increase in . Rp
2
Let ( )( )dRR knk ,,
21
20 111
α+−−=
+ where
11,,
,, −−= −−
kn
k Fd knkkn
αα
and is the value of
for the full model. Any subset of regressor variables producing an greater than
is called -adequate (
Rk2
1+
R2 R2
R20 R2 α ) subset (That is, its is not significantly different
from ). R2
Rk2
1+
Example 1: Suppose that we want to investigate how weight (WGT) varies with height (HGT) and age (AGE) for children with a particular kind of nutritional deficiency. The dependent variable here is , and two basic independent variables are WGTY =
HGTX =1 and AGEX =2
The WGT, HGT, and AGE for a random sample consists of 12 children who attend a certain clinic are given in the example 3 of chapter 3. .
5626.18
)07.4(38
3 8,3,05.04,13,05.0 === Fd
( )( ) 4370.05626.117803.01120 =+−−=R
(c) Residual Mean Square: A third criterion to consider in selecting the best model
is the estimated error variance for the )1( −p variable model-namely,
pnp
p SSMS ss −
=)(
)( ReRe
Because always decreases as increases, )(Re pSS s p )(Re pMS s initially decreases,
then stabilizes, and eventually may increases. Advocates of the )(Re pMS s criterion will
plot )(Re pMS s versus and base the choice of on the following: p p
1. The minimum )(Re pMS s
2. The value of such that p )(Re pMS s is approximately
equal to MS sRe for the full model, or
3. A value of p near the point where the smallest )(Re pMS s turns upward.
Note that the subset regression model that minimizes )(Re pMS s will also
maximize . R pAdj2
,
(d) Mallow's CP Statistic: Another candidate for a selection criterion involving is
Mallow's )(Re pSS s CP :
pnkp
MSSSC
s
sp 2
)()(
Re
Re +−=
CP criterion helps us to decide how many variables to put in the best model, since it
achieves a value of approximately if p )(Re pMS s is roughly equal to )(Re kMS s .
(e) PRESS: One can select the subset regression model based on a small value of PRESS. While PRESS has intuitive appeal, particularly for the prediction problem, it is not a simple function of the residual sum of squares, and developing an algorithm for variable selection based on this criterion is not straightforward. This statistics is, however, potentially useful, for discriminating between alternative models.
SAS Output: Number in Adjusted Model R-Square R-Square C(p) MSE Variables in Model 1 0.6630 0.6293 4.2682 29.93275 HGT 1 0.5926 0.5519 6.8310 36.18571 AGE 1 0.5876 0.5464 7.0138 36.63180 AGE2 ---------------------------------------------------------------------------------- 2 0.7800 0.7311 2.0097 21.71415 HGT AGE 2 0.7764 0.7267 2.1398 22.06667 HGT AGE2 2 0.5927 0.5022 8.8275 40.19683 AGE AGE2 ---------------------------------------------------------------------------------- 3 0.7803 0.6978 4.0000 24.39869 HGT AGE AGE2
Step 3: Specifying a Strategy for Selecting Variables:
(a) All possible regression procedure: The all possible regression procedure requires that we fit each possible regression equation associated with each possible combination of the k independent variables.
Example 2 (The Hald Cement Data): Hald (1952) presents data concerning the heat evolved in calories per gram cement (Y ) as a function of the amount of each of four ingredients in the mix: tricalcium aluminate ( X 1 ), tricalcium silicate ( X 2 ), tetracalcium alumino ferrite ( ), and dicalcium silicate ( ). The data are shown in example 1 of chapter 10.
X 3 X 4
92.18
)84.3(48
4 8,4,05.04,13,05.0
=== Fd
( )( ) 9486.092.119824.01120 =+−−=R
SAS Output: Hald Cement Data Y on X1,X2,X3 and X4 The REG Procedure Number of Observations Read 13 Number of Observations Used 13 Number in Adjusted Model R-Square R-Square C(p) MSE Variables in Model 1 0.6745 0.6450 138.7308 80.35154 x4 1 0.6663 0.6359 142.4864 82.39421 x2 1 0.5339 0.4916 202.5488 115.06243 x1 1 0.2859 0.2210 315.1543 176.30913 x3 ---------------------------------------------------------------------------------- 2 0.9787 0.9744 2.6782 5.79045 x1 x2 2 0.9725 0.9670 5.4959 7.47621 x1 x4 2 0.9353 0.9223 22.3731 17.57380 x3 x4 2 0.8470 0.8164 62.4377 41.54427 x2 x3 2 0.6801 0.6161 138.2259 86.88801 x2 x4 2 0.5482 0.4578 198.0947 122.70721 x1 x3 ---------------------------------------------------------------------------------- 3 0.9823 0.9764 3.0182 5.33030 x1 x2 x4 3 0.9823 0.9764 3.0413 5.34562 x1 x2 x3 3 0.9813 0.9750 3.4968 5.64846 x1 x3 x4 3 0.9728 0.9638 7.3375 8.20162 x2 x3 x4 ---------------------------------------------------------------------------------- 4 0.9824 0.9736 5.0000 5.98295 x1 x2 x3 x4
If we assume that the intercept term β 0
is included in all equations, then if there are
candidate regressors, there are 2 total equations to be estimated and examined. Therefore, the number of equations to be examined increases rapidly as the number of candidate regressors increases
kk
(b) Backward Elimination Procedure: We begin with a model that includes all candidate regressors. Then the partial -statistic is computed for each
regressors as if it were the last variable to enter the model. The smallest of these partial -statistics is compared with a pre-selected value,
k F
F F OUT , for example, and if the smallest partial value is less than F F OUT , that regressor is removed from the model. Now a regression model with 1−k regressors is fit, the partial
-statistics for this new model calculated, and the procedure repeated. The backward elimination algorithm terminates when the smallest partial value is not less than the pre-selected cutoff value
FF
F OUT . Example 1 (Cont.):
Backward Elimination: Step 0 All Variables Entered: R-Square = 0.7803 and C(p) = 4.0000
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 3 693.06046 231.02015 9.47 0.0052
Error 8 195.18954 24.39869
Corrected Total 11 888.25000
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 3.43843 33.61082 0.25535 0.01 0.9210
HGT 0.72369 0.27696 166.58195 6.83 0.0310
AGE 2.77687 7.42728 3.41051 0.14 0.7182
AGE2 -0.04171 0.42241 0.23786 0.01 0.9238
Backward Elimination: Step 1 Variable AGE2 Removed: R-Square = 0.7800 and C(p) = 2.0097
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 2 692.82261 346.41130 15.95 0.0011
Error 9 195.42739 21.71415
Corrected Total 11 888.25000
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 6.55305 10.94483 7.78416 0.36 0.5641
HGT 0.72204 0.26081 166.42975 7.66 0.0218
AGE 2.05013 0.93723 103.90008 4.78 0.0565
All variables left in the model are significant at the 0.1000 level.
Summary of Backward Elimination
Step Variable Removed
Number Vars In
Partial R-
Square
Model R-
Square
C(p) F Value
Pr > F
1 AGE2 2 0.0003 0.7800 2.0097 0.01 0.9238
(c) Forward Selection Procedure: The procedure begins with the assumption that there are no regressors in the model other than the intercept. An effort is made to find an optimal subset by inserting into model one at a time. At each step the regressor having the highest partial correlation with (or equivalently the largest
-statistic given the other regressors already in the model) is added to the model if its partial -statistic exceeds the pre-selected entry level
yF
F F IN . Example (Cont.):
Variable R2 F-value P-value
HGT 0.6630 19.67 0.0013 WGT 0.5926 14.55 0.0034 AGE2 0.5876 14.25 0.0036
SAS Outpu Forward Selection: Step 1 Variable HGT Entered: R-Square = 0.6630 and C(p) = 4.2682
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 1 588.92252 588.92252 19.67 0.0013
Error 10 299.32748 29.93275
Corrected Total 11 888.25000
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 6.18985 12.84875 6.94681 0.23 0.6404
HGT 1.07223 0.24173 588.92252 19.67 0.0013
Forward Selection: Step 2 Variable AGE Entered: R-Square = 0.7800 and C(p) = 2.0097
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 2 692.82261 346.41130 15.95 0.0011
Error 9 195.42739 21.71415
Corrected Total 11 888.25000
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 6.55305 10.94483 7.78416 0.36 0.5641
HGT 0.72204 0.26081 166.42975 7.66 0.0218
AGE 2.05013 0.93723 103.90008 4.78 0.0565
No other variable met the 0.1000 significance level for entry into the model.
Summary of Forward Selection
Step Variable Entered
Number Vars In
Partial R-Square
Model R-Square
C(p) F Value Pr > F
1 HGT 1 0.6630 0.6630 4.2682 19.67 0.0013
2 AGE 2 0.1170 0.7800 2.0097 4.78 0.0565
(d) Stepwise Regression Procedure: Stepwise regression is a modified version of forward regression that permits reexamination, at every step, of the variables incorporated in the model in pervious steps. A variable that entered at an early stage may become superfluous at a larger stage because of its relationship with other variables subsequently added to the model.
Example 1 (Cont.): In the forward selection procedure, HGT was added to the model. Next we added AGE. Now before testing to see whether ( )AGE 2
should be added to
the model, we find ( ) 66.7| =AGEHGTF which exceeds 36.39,1,1.0 =F . Thus we do not
remove HGT from the model. Next we check whether we should add( . The answer is no.
)AGE 2
Example 2 (Cont.): Backward Elimination: Step 0 All Variables Entered: R-Square = 0.9824 and C(p) = 5.0000
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 4 2667.89944 666.97486 111.48 <.0001
Error 8 47.86364 5.98295
Corrected Total 12 2715.76308
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 62.40537 70.07096 4.74552 0.79 0.3991
x1 1.55110 0.74477 25.95091 4.34 0.0708
x2 0.51017 0.72379 2.97248 0.50 0.5009
x3 0.10191 0.75471 0.10909 0.02 0.8959
x4 -0.14406 0.70905 0.24697 0.04 0.8441
Backward Elimination: Step 1 Variable x3 Removed: R-Square = 0.9823 and C(p) = 3.0182
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 3 2667.79035 889.26345 166.83 <.0001
Error 9 47.97273 5.33030
Corrected Total 12 2715.76308
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 71.64831 14.14239 136.81003 25.67 0.0007
x1 1.45194 0.11700 820.90740 154.01 <.0001
x2 0.41611 0.18561 26.78938 5.03 0.0517
x4 -0.23654 0.17329 9.93175 1.86 0.2054
Backward Elimination: Step 2 Variable x4 Removed: R-Square = 0.9787 and C(p) = 2.6782
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 2 2657.85859 1328.92930 229.50 <.0001
Error 10 57.90448 5.79045
Corrected Total 12 2715.76308
Variable Parameter Estimate
StandardError
Type II SS F Value Pr > F
Intercept 52.57735 2.28617 3062.60416 528.91 <.0001
X1 1.46831 0.12130 848.43186 146.52 <.0001
X2 0.66225 0.04585 1207.78227 208.58 <.0001
All variables left in the model are significant at the 0.1000 level.
Summary of Backward Elimination
Step Variable Removed
Number Vars In
Partial R-
Square
Model R-
Square
C(p) F Value
Pr > F
1 x3 3 0.0000 0.9823 3.0182 0.02 0.8959
2 x4 2 0.0037 0.9787 2.6782 1.86 0.2054
Forward Selection: Step 1
Variable x4 Entered: R-Square = 0.6745 and C(p) = 138.7308
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 1 1831.89616 1831.89616 22.80 0.0006
Error 11 883.86692 80.35154
Corrected Total 12 2715.76308
Variable Parameter Estimate
StandardError
Type II SS F Value Pr > F
Intercept 117.56793 5.26221 40108 499.16 <.0001
x4 -0.73816 0.15460 1831.89616 22.80 0.0006
Forward Selection: Step 2 Variable x1 Entered: R-Square = 0.9725 and C(p) = 5.4959
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 2 2641.00096 1320.50048 176.63 <.0001
Error 10 74.76211 7.47621
Corrected Total 12 2715.76308
Variable Parameter Estimate
StandardError
Type II SS F Value Pr > F
Intercept 103.09738 2.12398 17615 2356.10 <.0001
x1 1.43996 0.13842 809.10480 108.22 <.0001
x4 -0.61395 0.04864 1190.92464 159.30 <.0001
Forward Selection: Step 3 Variable x2 Entered: R-Square = 0.9823 and C(p) = 3.0182
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 3 2667.79035 889.26345 166.83 <.0001
Error 9 47.97273 5.33030
Corrected Total 12 2715.76308
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 71.64831 14.14239 136.81003 25.67 0.0007
x1 1.45194 0.11700 820.90740 154.01 <.0001
x2 0.41611 0.18561 26.78938 5.03 0.0517
x4 -0.23654 0.17329 9.93175 1.86 0.2054
No other variable met the 0.1000 significance level for entry into the model.
Summary of Forward Selection
Step Variable Entered
Number Vars In
Partial R-
Square
Model R-
Square
C(p) F Value
Pr > F
1 x4 1 0.6745 0.6745 138.731 22.80 0.0006
2 x1 2 0.2979 0.9725 5.4959 108.22 <.0001
3 x2 3 0.0099 0.9823 3.0182 5.03 0.0517
Stepwise Selection: Step 1 Variable x4 Entered: R-Square = 0.6745 and C(p) = 138.7308
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 1 1831.89616 1831.89616 22.80 0.0006
Error 11 883.86692 80.35154
Corrected Total 12 2715.76308
Variable Parameter Estimate
StandardError
Type II SS F Value Pr > F
Intercept 117.56793 5.26221 40108 499.16 <.0001
x4 -0.73816 0.15460 1831.89616 22.80 0.0006
Stepwise Selection: Step 2 Variable x1 Entered: R-Square = 0.9725 and C(p) = 5.4959
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 2 2641.00096 1320.50048 176.63 <.0001
Error 10 74.76211 7.47621
Corrected Total 12 2715.76308
Variable Parameter Estimate
StandardError
Type II SS F Value Pr > F
Intercept 103.09738 2.12398 17615 2356.10 <.0001
x1 1.43996 0.13842 809.10480 108.22 <.0001
x4 -0.61395 0.04864 1190.92464 159.30 <.0001
Stepwise Selection: Step 3 Variable x2 Entered: R-Square = 0.9823 and C(p) = 3.0182
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 3 2667.79035 889.26345 166.83 <.0001
Error 9 47.97273 5.33030
Corrected Total 12 2715.76308
Variable ParameterEstimate
StandardError
Type II SS F Value Pr > F
Intercept 71.64831 14.14239 136.81003 25.67 0.0007
x1 1.45194 0.11700 820.90740 154.01 <.0001
x2 0.41611 0.18561 26.78938 5.03 0.0517
x4 -0.23654 0.17329 9.93175 1.86 0.2054
Stepwise Selection: Step 4 Variable x4 Removed: R-Square = 0.9787 and C(p) = 2.6782
Analysis of Variance
Source DF Sum of Squares
Mean Square
F Value Pr > F
Model 2 2657.85859 1328.92930 229.50 <.0001
Error 10 57.90448 5.79045
Corrected Total 12 2715.76308
Variable Parameter Estimate
StandardError
Type II SS F Value Pr > F
Intercept 52.57735 2.28617 3062.60416 528.91 <.0001
x1 1.46831 0.12130 848.43186 146.52 <.0001
x2 0.66225 0.04585 1207.78227 208.58 <.0001
All variables left in the model are significant at the 0.1000 level. No other variable met the 0.1000 significance level for entry into the model.
Summary of Stepwise Selection
Step Variable Entered
Variable Removed
NumberVars In
PartialR-
Square
Model R-
Square
C(p) F Value
Pr > F
1 x4 1 0.6745 0.6745 138.731 22.80 0.0006
2 x1 2 0.2979 0.9725 5.4959 108.22 <.0001
3 x2 3 0.0099 0.9823 3.0182 5.03 0.0517
4 x4 2 0.0037 0.9787 2.6782 1.86 0.2054
Strategy to Select the Best Regression Equation:
Fit the full model
Perform residual analysis
Transform data
Do we need a transformation?
Yes
No
Select models for further analysis
Perform all possible regressions
Make recommendations
To select the best regression equation, carry out the following steps:
1) Fit the largest model possible to the data. 2) Perform a through analysis of this model.
3) Determine if a transformation of the response or some of the regressors is
necessary.
4) Determine if all possible regression is feasible.
• If all possible regression is feasible, perform all possible regression using such criteria as MallowsC , adjusted , and the PRESS statistic to rank
the best subset models. p R2
• If all possible regression is not feasible, use backward, forward and
stepwise selection techniques to generate the largest model such that all possible is feasible. Perform all possible regression as outlined above.
5) Compare and contrast the best models recommended by each criterion. 6) Perform through analyses of the “best models” (usually three to five models).
7) Explore the need for further transformations.
8) Discuss with the subject-matter experts the relative advantages and disadvantages
of the final set of models.