23
Chapter 1 Logistic Regression Tree Analysis 1.1 Introduction Logistic regression is a technique for modeling the probability of an event in terms of one or more predictor variables. Consider, for example, a data set on tree damage during a severe thunderstorm over 477,000 acres of the Boundary Waters Canoe Area Wilderness in northeastern Minnesota in July 4, 1999 (R package alr4 [1]). Observations from 3666 trees were collected, including for each tree, whether it was blown down (Y = 1) or not (Y = 0), its trunk diameter D in centimeters, its species S, and the local intensity L of the storm, as measured by the fraction of damaged trees in its vicinity. Let p = P (Y = 1) denote the probability that a tree is blown down. Linear logistic regression approximates the logit function logit(p) = log(p/(1 p)) as a function of the predictor variables linear in any unknown parameters. A simple linear logistic regression model is logit(p) = log(p/(1 p)) = β 0 + β 1 X , where X is the only predictor variable. That is, p = {1 + exp(β 0 β 1 X )} -1 . In gen- eral, given k predictor variables, X 1 ,...,X k ,a multiple linear logistic regression model has the form logit(p)= β 0 + k j=1 β j X j . The parameters β 0 1 ,...,β k are typically estimated by maximizing the likelihood. Let n denote the sample size and let (x i1 ,...,x ik ,y i ) denote the values of (X 1 ,...,X k ,Y ) for the ith observation (i =1,...,n). Treating each y i as the outcome of an independent Bernoulli random variable with success probability p i , the likelihood function is n i=1 p yi i (1 p i ) 1-yi = exp{ i y i (β 0 + j β j x ij )} i {1 + exp(β 0 + j β j x ij )} . The maximum likelihood estimates are the values of (β 0 1 ,...,β k ) that maxi- mize this function. 1

LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

Chapter 1

Logistic Regression Tree

Analysis

1.1 Introduction

Logistic regression is a technique for modeling the probability of an event interms of one or more predictor variables. Consider, for example, a data set ontree damage during a severe thunderstorm over 477,000 acres of the BoundaryWaters Canoe Area Wilderness in northeastern Minnesota in July 4, 1999 (Rpackage alr4 [1]). Observations from 3666 trees were collected, including foreach tree, whether it was blown down (Y = 1) or not (Y = 0), its trunkdiameter D in centimeters, its species S, and the local intensity L of the storm,as measured by the fraction of damaged trees in its vicinity.

Let p = P (Y = 1) denote the probability that a tree is blown down. Linearlogistic regression approximates the logit function logit(p) = log(p/(1− p)) as afunction of the predictor variables linear in any unknown parameters. A simple

linear logistic regression model is logit(p) = log(p/(1 − p)) = β0 + β1X , whereX is the only predictor variable. That is, p = {1+ exp(−β0−β1X)}−1. In gen-eral, given k predictor variables, X1, . . . , Xk, a multiple linear logistic regressionmodel has the form logit(p) = β0 +

∑kj=1 βjXj . The parameters β0, β1, . . . , βk

are typically estimated by maximizing the likelihood. Let n denote the samplesize and let (xi1, . . . , xik, yi) denote the values of (X1, . . . , Xk, Y ) for the ithobservation (i = 1, . . . , n). Treating each yi as the outcome of an independentBernoulli random variable with success probability pi, the likelihood function is

n∏

i=1

pyi

i (1− pi)1−yi =

exp{∑i yi(β0 +∑

j βjxij)}∏

i{1 + exp(β0 +∑

j βjxij)}.

The maximum likelihood estimates are the values of (β0, β1, . . . , βk) that maxi-mize this function.

1

Page 2: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

2 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Local storm intensity (L)

Pro

b(B

low

dow

n)

Figure 1.1: Estimated probability of blowdown computed from a simple linearlogistic regression model using L as predictor

1.2 Fitting logistic regression models

Fitting a simple linear logistic regression model to the data using L yields

logit(p) = −1.999 + 4.407L (1.1)

with estimated p-function shown in Fig. 1.1. The stronger the local stormintensity, the higher the chance that a tree is blown down. The boxplots inFigure 1.2 show that the distributions of D are skewed. Cook and Weisberg [2]transformed D to log(D), and obtained the model

logit(p) = −4.792 + 1.749 log(D) (1.2)

which suggests that larger trees are less likely to survive a storm than smallerones. If both log(D) and L are used, the model becomes

logit(p) = −6.677 + 1.763 log(D) + 4.42L. (1.3)

The relative constancy of the coefficients of L and log(D) in equations (1.1)–(1.3) is due to the weak correlation of 0.168 between the two variables. If theinteraction L log(D) is included, the model changes to

logit(p) = −4.341 + 0.891 log(D)− 1.482L+ 2.235L log(D). (1.4)

So far, species S has been excluded from the models. As in least squaresregression, a categorical variable havingm distinct values may be represented by

Page 3: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.2. FITTING LOGISTIC REGRESSION MODELS 3

A BA BF BS C JP PB RM RP

20

40

60

80

Species (S)

Tru

nk d

iam

ete

r (D

)

Figure 1.2: Boxplots of trunk diameter D. The dotted line marks the medianvalue of 14 for D.

(m−1) indicator variables, U1, . . . , Um−1, each taking value 0 or 1. The variablesfor species are shown in Table 1.1, which uses the “set-to-zero constraint” thatsets all the indicator variables to 0 for the first species (aspen). A model thatassumes the same slope coefficients for all species but that gives each a differentintercept term is

logit(p) = −5.997 + 1.581 log(D) + 4.629L

− 2.243U1 + 0.0002U2 + 0.167U3 − 2.077U4

+ 1.040U5 − 1.724U6 − 1.796U7 − 0.003U8. (1.5)

How well do models (1.1)–(1.5) fit the data? One popular way to assess fitis by means of significance tests based on the residual deviance and its degreesof freedom (df)—see, e.g., [3, p. 96] for the definitions. The residual devianceis analogous to the residual sum of squares in least squares regression. Formodel (1.5), the residual deviance is 3259 with 3655 df. We can evaluate the fitof this model by comparing its residual deviance against that of a larger one,such as the 27-parameter model

logit(p) = β0 + β1 log(D) + β2L+

8∑

j=1

γjUj

+8

j=1

β1jUj log(D) +8

j=1

β2jUjL (1.6)

Page 4: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

4 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

Table 1.1: Indicator variable coding for the species variable S

Species U1 U2 U3 U4 U5 U6 U7 U8

A (aspen) 0 0 0 0 0 0 0 0BA (black ash) 1 0 0 0 0 0 0 0BF (balsam fir) 0 1 0 0 0 0 0 0BS (black spruce) 0 0 1 0 0 0 0 0C (cedar) 0 0 0 1 0 0 0 0JP (jack pine) 0 0 0 0 1 0 0 0PB (paper birch) 0 0 0 0 0 1 0 0RM (red maple) 0 0 0 0 0 0 1 0RP (red pine) 0 0 0 0 0 0 0 1

that allows the slope coefficients for log(D) and L to vary with S. It has aresidual deviance of 3163 with 3639 df. If model (1.5) fits the data well, thedifference between its residual deviance and that of model (1.6) is approximatelydistributed as a chi-squared random variable with df equal to the difference indf of the two models. The difference in deviance here is 3259−3163 = 96, whichis too large to be explained by a chi-squared distribution with 3655− 3639 = 16df.

Rejection of model (1.5) does not necessarily imply that model (1.6) is sat-isfactory. To find out, it may be compared with a larger one, such as the28-parameter model

logit(p) = β0 + β1 log(D) + β2L+ β3L log(D) +

8∑

j=1

γjUj

+8

j=1

β1jUj log(D) +8

j=1

β2jUjL (1.7)

that includes an interaction between L and log(D). This has a residual devianceof 3121 with 3638 df. Therefore model (1.6) is rejected because its residualdeviance differs from that of (1.7) by 42 but their dfs differ only by 1. Withthis procedure, each of models (1.1) through (1.6) is rejected when comparedagainst the next larger model in the sequence.

Another way to select a model employs a function such as AIC, which isresidual deviance plus two times the number of estimated parameters. AIC triesto balance deviance against model complexity (see, e.g., [4, p. 234]), although ittends to over-fit the data. That is, it often chooses a large model. In this dataset, if we apply AIC to the set of all models up to third-order, it chooses the

Page 5: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.3. LOGISTIC REGRESSION TREES 5

largest, namely, the 36-parameter model

logit(p) = β0 + β1 log(D) + β2L+

8∑

j=1

γjUj

+ β3L log(D) +8

j=1

β1jUj log(D)

+

8∑

j=1

β2jUjL+

8∑

j=1

δjUjL log(D). (1.8)

Models (1.7) and (1.8) are hard to display graphically. Plotting the esti-mated p-function as in Fig. 1.1 is impossible if a model has more than onepredictor variable. This problem is exacerbated by the fact that model com-plexity typically increases with increase in sample size and number of predictors.Interpretation of the estimated coefficients is frequently futile then, because thecoefficients typically do not remain the same from one model to another, due tomulticollinearity among the terms. For example, the coefficient for L is 4.424,-1.482, and 4.629 in models (1.3), (1.4), and (1.5), respectively.

To deal with this problem, [2] used a partial one-dimensional model (POD)that employs a linear function of log(D) and L as predictor variable. Theyfound that if the observations for balsam fir (BF) and black spruce (BS) areexcluded, the model logit(p) = β0 +Z +

j γjUj , with Z = 0.78 log(D) + 4.1L,fits the other species quite well. Now the estimated p-function can be plottedas shown in Fig. 1.3. The graph is not as simple as that in Fig. 1.1, however,because Z is a linear combination of two variables. To include species BF andBS, [2] settled on the larger model

logit(p) = β0 + Z +

9∑

j=1

γjUj + (θ1IBF + θ2IBS) log(D)

+ (φ1IBF + φ2IBS)L (1.9)

which contains separate coefficients (θj , φj) for BF and BS. Here I(·) denotesthe indicator function, i.e., IA = 1 if species is A, and IA = 0 otherwise. Themodel cannot be displayed graphically for species BF and BS because it uses Z,log(D) and L as predictors.

1.3 Logistic regression trees

A logistic regression tree model offers an alternative solution that simultaneouslyretains the graphical advantage of simple models and the prediction accuracyof more complex ones. It works by recursively partitioning the data set andfitting a simple or multiple logistic regression model to each partition. As aresult, the partitions can be displayed as a decision tree [5]. For example,Figure 1.4 shows a simple linear logistic regression tree constructed by the

Page 6: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

6 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

Z = 0.78 log(D) + 4.1 L

Pro

b(B

low

dow

n)

1

1

1

11

1 1 1

2

2

2

2

22

2 2

3

3

3

3

3

33

3

4 44

4

4

4

4

4

5 5 55

5

5

5

5

6 6 66

6

6

6

6

7 7 7 77

7

7

7

1

2

3

4

5

6

7

JP

RP

A

PB

RM

C

BA

Figure 1.3: Estimated probability of blowdown for seven species, excludingbalsam fir (BF) and black spruce (BS), according to model (1.9)

GUIDE algorithm [6, 7], such that a logistic regression model using the bestsingle linear predictor is fitted in each terminal node. Beside each intermediatenode is a condition stipulating that an observation goes to the left subnode ifand only if the condition is satisfied. Below each terminal node are the samplesize (in italics), the proportion of blown down trees, and the name of the bestlinear predictor variable. An asterisk beside the name of a linear predictorindicates that its t-statistic is greater than 2 in absolute value. The split atthe root node (labeled 1) sends observations to node 2 if and only if S is A, BS,JP, or RP. Terminal node 5 consists of all the observations from the JP and RP

species. The proportion (0.82) of trees blown down there is the highest of allterminal nodes. Species A and BS trees with diameters greater than 9.75 cm(node 9) have the second highest blow down proportion of 0.67. Variable L isthe best linear predictor in all terminal nodes except nodes 13 and 15, where Dis the best linear predictor. The main advantage in using one linear predictor ineach node is that the fitted p-functions can be displayed graphically, as shownin Figure 1.5.

The logistic regression tree in Fig. 1.4 may be considered a different kind ofPOD model from that envisioned in [2]. Whereas the word “partial” in PODrefers to model (1.9) being one-dimensional if restricted to certain parts of thedata (species in this example), in a tree model it refers to partitions of thepredictor space. In addition, whereas “one-dimensional” refers to Z being alinear combination of log(D) and L in (1.9), in a tree model the predictor ineach node, being one of the variables, is trivially one-dimensional.

Page 7: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.3. LOGISTIC REGRESSION TREES 7

S = A,BS,JP,RP 1

S = A,BS 2

≤ 9.75D

4

8

L∗0.24233

9

L∗0.671173

5

L∗0.82551

S = BF 3

≤ 0.33L

6

12

L∗0.19297

13

D∗0.49362

≤ 0.50L

7

14

L∗0.09776

15

D∗0.35274

Figure 1.4: GUIDE simple linear logistic regression tree for P(blowdown). Ateach split, an observation goes to the left branch if and only if the conditionis satisfied. Sample size (in italics), proportion of blowdowns, and name ofregressor variable are printed beneath each terminal node. Green and yellowterminal nodes have L and D, respectively, as best linear predictor. An asteriskbeside the name of a predictor indicates that its absolute t-statistic is greaterthan 2.

10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Nodes 13, 15

Diameter (D)

Pro

b(b

low

dow

n)

Node 13Node 15

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Nodes not 13, 15

Intensity (L)

Node 5Node 8Node 9Node 12Node 14

Figure 1.5: Estimated logistic regression functions in terminal nodes of the treein Fig. 1.4.

Page 8: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

8 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

GUIDE is a classification and regression tree algorithm with origins in theFACT [8], SUPPORT [9], QUEST [10], CRUISE [11, 12], and LOTUS [13]methods; see [14]. All of them split the data set recursively, choosing a single Xvariable to split at each step. If X is an ordinal variable, the split typically hasthe form s = {X ≤ c}, where c is a constant. If X is a categorical variable, thesplit has the form s = {X ∈ ω}, where ω is a subset of the values taken by X .For least squares regression trees, many algorithms, such as AID [15], CART[16] and M5 [17], choose s to minimize the total sum of squared residuals of theregression models fitted to the two data subsets formed by s. Though seeminglyinnocuous, this approach is fundamentally flawed as it is biased toward choosingX variables that allow more splits. To see this, suppose that X is an ordinalvariable having m distinct values. Then there are (m − 1) ways to split thedata along the X axis, with each split s = {X ≤ c} being such that c is themidpoint between two consecutively ordered distinct values of X . This creates aselection bias towardX variables with largem. In the current example, variableL has 709 unique values but D has only 87. Hence L has eight times as manyopportunities as D to split the data. The bias is worse if there are high-levelcategorical variables, because a categorical variable having m categorical valuespermits (2m−1 − 1) splits of the form s = {X ∈ ω}. For example, variable Spermits (29−1 − 1) = 255 splits, which is almost three times as many splits asD allows. Publication [18] seems to be the earliest warning that the bias canyield misleading conclusions about the effects of the variables.

GUIDE gets rid of the bias by using a two-step approach to split selection.First, it uses significance tests to select X . Then it searches for c or ω for theselected X . For linear regression trees, this is achieved by fitting a linear modelto the data in the node and using a contingency table chi-squared test of the as-sociation between grouped values of each predictor variable and the signs of theresiduals. If X is ordinal, the groups are intervals between certain order statis-tics; if it is categorical, the groups are the categorical levels themselves. TheX variable having the smallest chi-squared p-value is selected. Repeating thisprocedure recursively produces a large binary tree that is pruned to minimize across-validation estimate of prediction mean squared error [16].

Let p̂(x) denote the estimated value of P (Y = 1 |X = x). The split variableselection method needs to be modified for logistic regression, because the resid-ual y − p̂(x) is positive if y = 1 and negative if y = 0, irrespective of the valueof p̂(x). Therefore the residual signs provide no information on the adequacyof p̂(x). A first attempt at a solution was proposed in [19], where the residualsy − p̂(x) are replaced with “pseudo-residuals” p̄(x) − p̂(x), with p̄(x) being aweighted average of the y values in a neighborhood of x. Its weaknesses aresensitivity to choice of weights and neighborhoods, and difficulty in specifyingthe neighborhoods if the dimension of the predictor space is large or if thereare missing values. LOTUS uses a trend-adjusted chi-squared test [20, 21] thateffectively replaces p̄(x) with a linear estimate.

For logistic regression, GUIDE uses the average from an ensemble of least-squares GUIDE regression trees (called a “GUIDE forest”) to form the pseudo-residuals for variable selection. The main steps are as follows.

Page 9: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.3. LOGISTIC REGRESSION TREES 9

1. Fit a least-squares GUIDE forest [22] to the data to obtain a preliminaryestimate p̃(x) of p(x) for each observed x. Random forest [23] is not anadequate substitute for GUIDE forest if the data contain missing values.

2. Beginning with the root node, carry out the following steps on the datain each node, stopping only if the number of observations is less than apre-specified threshold.

(a) Temporarily impute the missing values in the ordinal predictor vari-ables by replacing them with their node sample means.

(b) Fit a simple or multiple linear logistic regression model to the im-puted data in the node. If a simple linear logistic model is desired,fit a model to each linear predictor variable in turn to find the onewith smallest residual deviance. Let p̂(x) denote the estimated valueof p(x) from the fitted model.

(c) Revert imputed values in step (2a) to their original missing state.

(d) For each ordinal X variable, let q1 ≤ q2 ≤ q3 denote its sample quar-

tiles at the node and define the categorical variable V =∑3

j=1 I(X >qj). Define V = X if X is a categorical variable. Add an extra “miss-ing” category to V if X has missing values.

(e) Form a contingency table for eachX variable using the signs of p̃(x)−p̂(x) as rows and the values of V as columns. Find the chi-squaredstatistic χ2

ν for the test of independence between rows and columns.

(f) Let Gν(x) denote the distribution function of a chi-squared variablewith ν df and let ǫ = 2 × 10−6. Convert each χ2

ν to its equivalentone-df χ2

1 value as follows.

i. If ǫ < Gν(χ2ν) < 1− ǫ, define χ2

1 = G−11 (Gν(χ

2ν)).

ii. Otherwise, use two applications of the Wilson-Hilferty approxi-mation [24] to avoid dealing with very small or large p-values asfollows. Define

W1 ={

2χ2ν −

√2ν − 1 + 1

}2

/2

W2 = max

0,

[

7

9+√ν

{

(

χ2ν

ν

)1/3

− 1 +2

}]3

.

Then the approximate one-df chi-squared value is

χ21 =

W2 if χ2ν < ν + 10

√2ν

(W1 +W2)/2 if χ2ν ≥ ν + 10

√2ν and W2 < χ2

ν

W1 otherwise.

This procedure is an improvement over an earlier one-step approxi-mation in [7]. Tables 1.2 and 1.3 show the contingency tables and cor-responding chi-squared statistics for Species, Intensity and Diameterat the root node of the tree in Figure 1.4.

Page 10: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

10 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

Table 1.2: Chi-squared test for Species with Wilson-Hilferty χ21 value

A BA BF BS C JP PB RM RPp̃ > p̂ 413 0 239 673 0 501 0 2 47p̃ ≤ p̂ 23 75 420 297 355 1 497 121 2

χ28 = 2125, χ2

1 = 1942

Table 1.3: Chi-squared tests for Intensity and Diameterwith quartile intervalsQ1, Q2, Q3, Q4 and Wilson-Hilferty χ2

1 valuesIntensity Diameter

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4

p̃ > p̂ 327 418 527 603 14 543 637 681p̃ ≤ p̂ 595 493 390 313 933 424 281 153

χ23 = 195, χ2

1 = 171 χ23 = 1378, χ2

1 = 1314

(g) Let X∗ be the variable with the largest value of χ21.

i. If X∗ is ordinal, let s be a split of the form {X∗ = NA}, {X∗ ≤c} ∪ {X∗ = NA}, or {X∗ ≤ c} ∩ {X∗ 6= NA}, where NA is themissing value code.

ii. If X∗ is categorical, let s be a split of the form {X∗ ∈ ω}, whereω is a proper subset of the values (including NA) of X∗.

(h) For each split s, apply steps (2a) and (2b) to the data in the leftand right subnodes induced by s and let dL(s) and dR(s) be theirrespective residual deviances.

(i) Select the split s that minimizes dL(s) + dR(s).

3. After splitting stops, prune the tree with the CART cost-complexity method[16] to obtain a nested sequence of subtrees.

4. Use the CART cross-validationmethod to estimate the prediction devianceof each subtree.

5. Select the smallest subtree whose estimated prediction deviance is withina half standard error of the minimum.

Figure 1.6 shows the LOTUS tree for the current data. MOB [25] is anotheralgorithm that can construct logistic regression trees, but for simple linear lo-gistic tree models, it requires the linear predictor to be pre-specified and to bethe same in all terminal nodes. Figure 1.7 shows the MOB tree with L as thecommon linear predictor. Figure 1.8 compares the values of p̂(x) from a GUIDEforest of 500 trees, model (1.9) and the simple linear logistic GUIDE, LOTUSand MOB models. Although there are clear differences in the values of p̂(x) be-tween GUIDE, LOTUS and MOB, they seem to compare similarly against (1.9)and GUIDE forest. Figure 1.9 shows the corresponding results where LOTUSfits the multiple linear logistic regression model logit(p) = β0 + β1D+ β2L, and

Page 11: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.4. MISSING VALUES AND CYCLIC VARIABLES 11

S = BA,BF,C,PB,RM

C,PB,RMS = BA,

≤ 11D

L0.10459

L0.20591

L ≤ 0.3

L0.14263

D0.49396

D ≤ 14

D ≤ 9

L0.25237

BS,RPS = A,

L0.58391

0.8260

L ≤ 0.4

A,BSS =

L0.47309

L0.69200

L0.88760

Figure 1.6: LOTUS simple linear logistic regression model for P(blowdown). Ateach split, an observation goes to the left branch if and only if the conditionis satisfied. Sample size (in italics), proportion of blowdowns, and name ofregressor variable (if any) are printed below nodes. Green and yellow terminalnodes have L and D, respectively, as best linear predictor.

GUIDE and MOB fit logit(p) = β0 + β1D + β2L+∑8

j=1 γjUj in each terminalnode. (LOTUS does not convert categorical variables to indicator variables toserve as regressors.) The correlations among the p̂(x) values are much higher.

1.4 Missing values and cyclic variables

The U.S. National Highway Traffic Safety Administration has been evaluatingvehicle safety by performing crash tests with dummy occupants since 1972 (ftp://www.nhtsa.dot.gov/ges). We use 3310 records of test crashes where thetest dummy is in the driver’s seat to show how GUIDE deals with missingvalues and cyclic variables. Each record gives the severity of head injury (HIC)sustained by the dummy and the values of 102 variables describing the vehicle,test environment and the dummy itself. The response variable is Y = 1 if HIC> 1000 (threshold for severe head injury), and Y = 0 otherwise. Forty-sixpredictor variables are ordinal, six are cyclic, and fifty are categorical.

Three features in the data make model building particularly challenging.The first is missing data values. Methods not designed to deal with missingvalues require them to be imputed first, but imputation can be extraordinarilydifficult if there are many missing values (see [22]). All the ordinal and cyclicvariables and some categorical variables here have missing values. Missing valuesin categorical variables do not require special treatment, however, because theyare given their own “NA” category in GUIDE. Table 1.4 gives the names andnumbers of missing values of some ordinal and cyclic variables (see [26] fordefinitions of other variables).

Page 12: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

12 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

S = BA,BF,C,PB,RM

S = BF

D ≤ 9

D≤6

181 210

D≤10.5

56 212

D≤11

459 591

D≤9.5

S =A,JP

22

D≤6

49 167

S = A,BS

L≤0.445

D≤23

434 111

628

D≤22

L≤0.128

32

L≤0.345

72 132

L≤0.196

45 265

Figure 1.7: MOB simple linear logistic regression model with L pre-specified asthe common linear predictor in all nodes. At each split, an observation goes tothe left branch if and only if the condition is satisfied. Sample sizes (in italics)are printed below nodes.

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

GUIDE forest

0.0

0.4

0.8

0.0

0.4

0.8

0.0

0.4

0.8

0.0 0.2 0.4 0.6 0.8

0.0

0.4

0.8

Cook−Weisberg

0.0 0.2 0.4 0.6 0.8 1.0

LOTUS

0.0 0.2 0.4 0.6 0.8 1.0

MOB

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

GUIDE tree

Figure 1.8: Comparison of fitted values p̂ of Cook-Weisberg model (1.9) andGUIDE forest versus simple linear logistic regression tree models

Page 13: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.4. MISSING VALUES AND CYCLIC VARIABLES 13

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

GUIDE forest

0.0

0.4

0.8

0.0

0.4

0.8

0.0

0.4

0.8

0.0 0.2 0.4 0.6 0.8

0.0

0.4

0.8

Cook−Weisberg

0.0 0.2 0.4 0.6 0.8 1.0

LOTUS

0.0 0.2 0.4 0.6 0.8 1.0

MOB

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

GUIDE tree

Figure 1.9: Comparison of fitted values p̂ of Cook-Weisberg model (1.9) andGUIDE forest versus multiple linear logistic regression tree models

Table 1.4: Names and numbers of missing values of some predictor variables inthe crash-test data

Variable Description MissingBARDIA diameter of pole barrier 2807BARRIG rigid or deformable barrier 1BARSHP barrier shape (21 types) 0CURBWT weight of vehicle without passengers and cargo 2854CARANG angle between surface of rollover test cart and ground 991IMPANG angle between axis of vehicle 2 and axis of vehicle 1 or

barrier (0 degrees means perpendicular to barrier)4

IMPPNT point of impact on side of vehicle 2 by vehicle 1 1693CLSSPD closing speed: relative velocity of approach of two cen-

ters of gravity before contact2

VEHSPD resultant speed of vehicle before impact 1RSTFRT front occupant restraint system; ”0” indicates noneYEAR vehicle model year 4

Page 14: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

14 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

Defo

rmable

Rig

idU

nknow

n

20 40 60 80 100

CLSSPD

BA

RR

IG

Figure 1.10: Boxplots of closing speed by barrier rigidity for the whole sample,with box width proportional to square root of sample size

GUIDE deals with missing values in two ways. For split selection, it sendsall missing values in the selected ordinal or cyclic variable either to the left orto the right subnode, depending on which minimizes the sum of the residualdeviances in the two subnodes. Hence no imputation is carried out here, but tofit a logistic regression model to a node, GUIDE imputes missing values in theselected predictor variable with its node mean.

A second challenging feature is the existence of cyclic variables that areangles with periods of 360 degrees. These variables are traditionally transformedto sines and cosines but splits on each of the latter are not as complete as splitson the angles themselves. The problem is more difficult if the variable hasmissing values. If we use sine and cosine transformations, do we impute theangles and then compute the sines and cosines of the imputed angle valuesor do we impute the sines and cosines separately? GUIDE avoids imputationentirely by using cyclic variables for node splitting only. If a cyclic variable isselected to split a node, the split takes the form of a sector “X ∈ (θ1, θ2)”, whereθ1 and θ2 are angles, and missing X values are sent to the left or right subnodein the same fashion as non-cyclic variables.

The third challenging feature is that, apparently by design, high-speed crashtests are more often carried out on deformable barriers and low-speed tests moreoften on rigid barriers. This is evident from the boxplots of CLSSPD by BARRIG

in Fig. 1.10, where half of the tests with deformable barriers are above closingspeeds of 60 mph, but less than one quarter of those with rigid barriers are above60 mph. Presumably, crashes into rigid barriers are not performed at high speedsbecause the outcomes are predictable, but this confounds the effects of CLSSPDand BARRIG.

We say that X is an “s” variable if it can be used to split the nodes and

Page 15: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.4. MISSING VALUES AND CYCLIC VARIABLES 15

in (269, 287)IMPANG

1

433 2

-YEAR*0.08

= 0RSTFRT

3

≤∗55.75CLSSPD

6

≤∗1981YEAR

12

107 24

+CLSSPD*0.28

25 259

-YEAR*0.03

BARSHP=FLB 13

166 26

+CLSSPD0.44

27 311

-CLSSPD0.34

7 2000

-YEAR*0.01

Figure 1.11: GUIDE simple linear logistic regression tree for crash-test data.At each split, an observation goes to the left branch if and only if the conditionis satisfied. The symbol ‘≤∗’ stands for ‘≤ or missing’. Sample size (in italics)is printed beside each node; proportion of cases with Y = 1, and sign and nameof the best linear predictor printed beneath nodes. An asterisk beside the nameof a regressor indicates that its absolute t-statistic is greater than 2.

an “f” variable if it can be used to fit logistic regression node models. To limitthe amount of imputation in this example, we restrict ordinal variables withmore than 20 percent missing values to serve as s variables only. There are 28such ordinal variables, with numbers of missing values ranging from 532 to 3271(total sample size is 3276). The remaining 18 ordinal variables are allowed toserve as both s and f variables. Cyclic and categorical variables are used fornode splitting only.

Figure 1.11 shows the GUIDE logistic regression tree where a simple linearlogistic regression is fitted in each node. Only 5 variables appear in the model.The root node is split on impact angle IMPANG, which is the angle between thelongitudinal axis of the vehicle carrying the test dummy and the longitudinalaxis of another vehicle or a barrier, measured clockwise. Observations withIMPANG between 269 and 287 degrees (driver-side impact) go to node 2, wherevehicle model YEAR is the best linear predictor. The negative sign attached toYEAR in the figure indicates that it has a negative slope. Eight percent of the433 tests in node 2 produced severe head injury. Observations with missingIMPANG or values outside the sector (269, 287) go to node 3, where they are spliton RSTFRT (type of restraint system on front occupants). Observations with

Page 16: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

16 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

RSTFRT = 0 (no front restraints) go to node 6; otherwise, they go to node 7,where one percent have severe head injury and YEAR is again the best linearpredictor. Node 6 is split on CLSSPD, closing speed. Observations with CLSSPD

≤ 55.75 or missing go to node 12, where they are split on YEAR. (The asteriskin the inequality symbol ≤∗ in the figure indicates that missing values go left;if there is no asterisk, missing values go right.) Observations with YEAR ≤ 1981or missing go to node 24, where 28 percent have severe head injury and CLSSPD

is the best linear predictor; otherwise, they go to node 25, where 3 percent havesevere head injury and YEAR is the best linear predictor. Finally, observationswith CLSSPD > 55.75 at node 6 go to node 13, where they are split on BARSHP

(barrier shape). Observations with BARSHP = FLB (flat barrier) go to node 26where 44 percent of them have severe head injury and CLSSPD is the best linearpredictor; those for which BARSHP is not FLB go to node 27, where 34 percenthave severe head injury and CLSSPD is again the best linear predictor.

The tree shows that nodes 26 and 27 have the highest proportions of severehead injury, at 44 and 34 percent, respectively. These two nodes have in commonthat IMPANG is not in (269, 287), RSTFRT = 0, and CLSSPD > 55.75. Thisshows that high speed and absence of restraints make a dangerous combination.Nodes 7 and 25 have the lowest proportions of severe head injury, at 1 and 3percent, respectively. Node 7 has IMPANG not in (269, 287) and RSTFRT 6= 0.Node 25 has IMPANG not in (269, 287), RSTFRT = 0, CLSSPD ≤55.75 or missing,and YEAR > 1981. This suggests that non-driver-side impact, presence of arestraint system, low closing speed, and vehicles built after 1981 are factorsthat contribute to low rates of severe head injury.

In Fig. 1.11, an asterisk is attached to the name of a linear predictor variableif its absolute t-statistic exceeds 2.0. Thus, the linear predictors in nodes 2, 7,24, and 25 are possibly statistically significant but those in nodes 26 and 27are not. (The t-statistics obtained from the node logistic regression models arebiased low because they do not account for uncertainty due to split selection;for a more accurate bootstrap-based solution, see [27, 28].) Figure 1.12 plotsthe fitted logistic regression curves in the terminal nodes of the tree wherethe predictor variable t-statistics are larger than 2.0 in absolute value. Theproportion of crashes with severe head injury in each plot is indicated by adotted line.

Given the confounding between CLSSPD and BARRIG observed in Fig. 1.10 forthe whole sample, it is natural to wonder if the confounding persists in node 24,where CLSSPD is the best linear predictor. (CLSSPD is not significant in nodes 26and 27 even without adjusting for variable selection.) The plot of CLSSPD byBARRIG for node 24 in Figure 1.13 shows that the confounding has practicallydisappeared, as all but 4 of the 107 crash tests in the node are against rigidbarriers.

Page 17: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.4. MISSING VALUES AND CYCLIC VARIABLES 17

1980 1990 2000 2010

0.0

0.4

0.8

YEAR

P(Y

=1)

IMPANG ∈ (269, 287)

25 30 35 40 45 50 550.0

0.4

0.8

CLSSPD

IMPANG ∉ (269, 287)

RSTFRT = 0CLSSPD ≤ 55.75

YEAR ≤ 1981 or NA

1985 1995 2005 2015

0.0

0.4

0.8

YEAR

IMPANG ∉ (269, 287)

RSTFRT = 0CLSSPD ≤ 55.75

YEAR > 1981

55 60 65 70 75

0.0

0.4

0.8

CLSSPD

P(Y

=1)

IMPANG ∉ (269, 287)

RSTFRT = 0CLSSPD > 55.75BARSHP = FLB

60 70 80 90 100

0.0

0.4

0.8

CLSSPD

IMPANG ∉ (269, 287)

RSTFRT = 0CLSSPD > 55.75BARSHP ≠ FLB

1980 1990 2000 2010

0.0

0.4

0.8

YEAR

IMPANG ∉ (269, 287)

RSTFRT > 0

Figure 1.12: Fitted logistic regression curves in the terminal nodes of Fig. 1.11where the predictor variable has an absolute t-statistic greater than 2.0; hori-zontal dotted lines show the sample proportion of severe injury in the node.

25 30 35 40 45 50 55De

form

able

Rig

idU

nkn

ow

n

CLSSPD

BA

RR

IG

Figure 1.13: Plot of closing speed by barrier rigidity in node 24 of Fig. 1.11

Page 18: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

18 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

1.5 Conclusion

Logistic regression is a technique for estimating the probability of an eventin terms of the values of one or more predictor variables. Often, if there aremissing values among the predictor variables, they are imputed first. Otherwise,the observations or variables containing the missing values are deleted. Neithersolution is appealing. In practice, finding a logistic regression model with goodprediction accuracy is seldom automatic; it usually requires careful selection ofvariables, consideration of transformations, and estimation of the accuracy ofmultiple models. Even when a model with good accuracy is found, interpretationof the regression coefficients is not straightforward if there are two or morepredictor variables.

A logistic regression tree is a piecewise logistic regression model, with thepieces obtained by recursively partitioning the space of predictor variables. Con-sequently, if there is no over-fitting, it may be expected to possess higher predic-tion accuracy than a one-piece logistic regression model. Recursive partitioninghas two advantages over a search of all partitions: it is computationally efficientand it allows the partitions to be displayed as a decision tree. At a minimum, alogistic regression tree can serve as an informal goodness of fit test of whether aone-piece logistic model is adequate for the whole sample. A non-trivial prunedtree would indicate that a one-piece logistic model has lower prediction accuracy,possibly due to unaccounted interactions or nonlinearities among the variables.Ideally, a good tree-growing and pruning algorithm would automatically ac-count for the overlooked effects, making it unnecessary to specify interactionand higher-order terms. It would also allow the models in the terminal nodes tobe as simple as desired (such as fitting a single linear predictor in each node).

The tree pruning method is very important for prediction accuracy. Manymethods adopt the AIC-type approach of selecting the tree that minimizes thesum of the residual deviance and a multiple, K, of the number of terminal nodes.There being no value of K that works for all data sets [16], the advantage ofthis approach is mainly computational speed. Our experience indicates that itis inferior to the CART pruning approach that uses cross-validation to estimateprediction accuracy.

Despite a binary decision tree being intuitive to interpret, a poor split selec-tion method can yield misleading conclusions. One common cause is selectionbias. The greedy approach used by CART and many other algorithms is knownto prefer variables that permit more splits of the data than variables that allowfewer splits. Consequently, it is hard to know if a variable is chosen because itgenuinely has the greatest effect at the node or because it simply allows moreways to partition the data there. LOTUS and GUIDE remove the bias byselecting variables with chi-squared tests. At the time this article is written,GUIDE is the only tree algorithm that can deal with cyclic variables and withtwo or more missing value codes [22]. The GUIDE software and manual maybe obtained from www.stat.wisc.edu/~loh/guide.html.

Page 19: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

1.6. ACKNOWLEDGMENT 19

1.6 Acknowledgment

This chapter was completed while the author was visiting the National ChungCheng and National Tsing Hua Universities, Taiwan, under the auspices of theNational Center for Theoretical Sciences.

Page 20: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

20 CHAPTER 1. LOGISTIC REGRESSION TREE ANALYSIS

Page 21: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

Bibliography

[1] S. Weisberg: Applied Linear Regression, Fourth edn. (Wiley, Hoboken NJ2014)

[2] R. D. Cook, S. Weisberg: Partial one-dimensional regression models, Amer-ican Statistician 58, 110–116 (2004)

[3] A. Agresti: An Introduction to Categorical Data Analysis (Wiley, New York1996)

[4] J. M. Chambers, T. J. Hastie: Statistical Models in S (Wadsworth, PacificGrove 1992)

[5] W.-Y. Loh: Classification and regression trees, Wiley InterdisciplinaryReviews: Data Mining and Knowledge Discovery 1, 14–23 (2011)doi:10.1002/widm.8

[6] W.-Y. Loh: Regression trees with unbiased variable selection and interac-tion detection, Statistica Sinica 12, 361–386 (2002)

[7] W.-Y. Loh: Improving the precision of classification trees, Annals of Ap-plied Statistics 3, 1710–1737 (2009)

[8] W.-Y. Loh, N. Vanichsetakul: Tree-structured classification via generalizeddiscriminant analysis (with discussion), Journal of the American StatisticalAssociation 83, 715–728 (1988)

[9] P. Chaudhuri, M.-C. Huang, W.-Y. Loh, R. Yao: Piecewise-polynomialregression trees, Statistica Sinica 4, 143–167 (1994)

[10] W.-Y. Loh, Y.-S. Shih: Split selection methods for classification trees, Sta-tistica Sinica 7, 815–840 (1997)

[11] H. Kim, W.-Y. Loh: Classification trees with unbiased multiway splits,Journal of the American Statistical Association 96, 589–604 (2001)

[12] H. Kim, W.-Y. Loh: Classification trees with bivariate linear discriminantnode models, Journal of Computational and Graphical Statistics 12, 512–530 (2003)

21

Page 22: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

22 BIBLIOGRAPHY

[13] K.-Y. Chan, W.-Y. Loh: LOTUS: An algorithm for building accurateand comprehensible logistic regression trees, Journal of Computational andGraphical Statistics 13, 826–852 (2004)

[14] W.-Y. Loh: Fifty years of classification and regression trees (with discus-sion), International Statistical Review 34, 329–370 (2014)

[15] J. N. Morgan, J. A. Sonquist: Problems in the analysis of survey data, anda proposal, Journal of the American Statistical Association 58, 415–434(1963)

[16] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: Classification and

Regression Trees (Wadsworth, Belmont, California, U.S.A. 1984)

[17] J. R. Quinlan: Learning with continuous classes. In: Proceedings of AI’92

Australian National Conference on Artificial Intelligence (World Scientific,Singapore 1992) pp. 343–348

[18] P. Doyle: The use of Automatic Interaction Detector and similar searchprocedures, Operational Research Quarterly 24, 465–467 (1973)

[19] P. Chaudhuri, W.-D. Lo, W.-Y. Loh, C.-C. Yang: Generalized RegressionTrees, Statistica Sinica 5, 641–666 (1995)

[20] W. G. Cochran: Some methods of strengthening the common χ2 tests,Biometrics 10, 417–451 (1954)

[21] P. Armitage: Tests for linear trends in proportions and frequencies, Bio-metrics 11, 375–386 (1955)

[22] W.-Y. Loh, J. Eltinge, M. J. Cho, Y. Li: Classification and regression treesand forests for incomplete data from sample surveys, Statistica Sinica 29,431–453 (2019)

[23] L. Breiman: Random Forests, Machine Learning 45(1), 5–32 (2001)

[24] E. B. Wilson, M. M. Hilferty: The distribution of chi-square, Proceedingsof the National Academy of Sciences of the United States of America 17,684–688 (1931)

[25] A. Zeileis, T. Hothorn, K. Hornik: Model-based recursive partitioning,Journal of Computational and Graphical Statistics 17, 492–514 (2008)

[26] NHTSA: Test Reference Guide Version 5, https://one.nhtsa.gov/

Research/Databases-and-Software/NHTSA-Test-Reference-Guides, 1(2014)

[27] W.-Y. Loh, H. Fu, M. Man, V. Champion, M. Yu: Identification of sub-groups with differential treatment effects for longitudinal and multiresponsevariables, Statistics in Medicine 35, 4837–4855 (2016)

Page 23: LogisticRegression Tree Analysispages.stat.wisc.edu/~loh/treeprogs/guide/logistic2.pdf · model cannot be displayed graphically for species BF and BS because it uses Z, log(D) and

BIBLIOGRAPHY 23

[28] W.-Y. Loh, M. Man, S. Wang: Subgroups from regression trees with ad-justment for prognostic effects and post-selection inference, Statistics inMedicine 38, 545–557 (2019)