nonparametric regression when estimating the probability of success

NONPARAMETRIC REGRESSION WHEN

ESTIMATING THE PROBABILITY OF SUCCESS

Rand R. Wilcox

Dept of Psychology

University of Southern California

August 21, 2010

1

ABSTRACT

For the random variables Y,X1, . . . , Xp, where Y is binary, let M(x1, . . . , xp) = P (Y =

1|(X1, . . . Xp) = (x1, . . . xp)). The paper compares four smoothers aimed at estimating

M(x1, . . . , xp), three of which can be used when p > 1. Evidently there are no published

comparisons of smoothers when p > 1 and Y is binary. And there are no published results

on how the four estimators, considered here, compare. One of the estimators is based on an

approach described in Hosmer and Lemeshow (1989, p. 85), which is limited to p = 1. A

simple modification of this estimator (called method E3 in the paper) is proposed that can

be used when p > 1. No estimator dominated in terms of mean squared error and bias. And

for p = 1, the differences among three of the estimators, in terms of mean squared error and

bias, is not particularly striking. But for p > 1, differences among the estimators are magni-

fied, with method E3 performing relatively well. An estimator based on the running interval

smoother performs about as well as E3, but for general use, E3 is found to be preferable.

An estimator studied by Signorini and Jones (1984) is not recommended, particularly when

p > 1.

keywords: logistic regression, kernel estimators, smoothers.

2

1 Introduction

Consider a random sample (Yi, Xi1, . . . Xip) from some unknown multivariate distribution

(i = 1, . . . , n), where Yi is binary. As is well known, a fundamental goal that is commonly

encountered is estimating

M(x1, . . . , xp) = P (Y = 1|(X1, . . . Xp) = (x1, . . . , xp)). (1)

More broadly, there is the issue of understanding the nature of the association between Y

and (X1, . . . Xp). From basic principles, using least squares regression is unsatisfactory, one

reason being that the estimate of M(x1, . . . xp) can be substantially smaller than 0 or greater

than 1. A commonly used strategy for dealing with this problem is to assume

P (Y = 1|X = x) = F (x′β), (2)

where F is some strictly increasing cumulative distribution function and β is a vector of

unknown parameters. The best-known choice for F is

F (t) =exp(t)

1 + exp(t),

which yields the usual logistic regression model. That is, assume that

P (Y = 1|X = x) =exp(β0 + β1x1 + · · ·+ βpxp)

1 + exp(β0 + β1x1 + · · ·+ βpxp). (3)

In the broader context where Y is not necessarily binary, a general concern with para-

metric regression models is that the particular model chosen might not provide a reasonably

3

accurate approximation of the true regression surface. Of course, one can attempt to deal

with this problem by introducing functions of the predictors into the model. For example,

replace the model Y = β0 + β1X with Y = β0 + β1X + β2X2. But this is not always satis-

factory, which has prompted a wide range of nonparametric estimators that are often called

smoothers (e.g. Efromovich, 1999; Eubank, 1999; Fan and Gijbels, 1996; Fox, 2001; Green

and Silverman, 1993; Gyofri et al., 2002; Hardle, 1990; Hastie and Tibshirani, 1990).

For the situation where Y is binary, a simple strategy is to use a smoother aimed at

estimating the mean of Y , given (X1, . . . Xp), but for some smoothers this approach is unsat-

isfactory. Examples are the method LOESS developed by Cleveland (1979) and the kernel

smoother derived by Fan (1993). In effect, these methods use weighted least squares to

estimate the conditional mean of Y , which can yield estimates of M(x1, . . . xp) substantially

smaller than 0 or larger than 1. There are, however, smoothers that avoid this problem,

some of which have been compared when p = 1 (Signorini & Jones, 2004). An approach not

discussed by Signorini and Jones is described by Hosmer and Lemeshow (1989, p. 85), which

they attribute to general theoretical results derived by Kay and Little (1987). Yet another

viable method is the running interval smoother (Wilcox, 2005, sections 11.4.4 and 11.4.8),

but there are no results on how this method compares to other estimators in terms of mean

squared error and bias. And evidently there are no results comparing smoothers for binary

outcomes when p > 1. The goal in this paper is to fill this gap.

4

Section 2 describes the estimators to be compared. One of these estimators is based

on a slight modification of the estimator in Hosmer and Lemeshow (1989, p. 85), which

extends their estimator to p > 1 predictors. Section 3 reports simulation results and section

4 illustrates the practical advantages of using a smoother rather than relying solely on the

basic logistic regression model.

2 Description of the Estimators

Method E1

The first estimator stems from Signorini and Jones (2004). Let nj be the number of Y

values equal to j (j = 0, 1). Let fj be the probability density function of (X1, . . . Xp) given

that Y = j. Then an estimate of M(x1, . . . xp) is

M(x1, . . . xp) =n1f1(x1, . . . xp)

n1f1(x1, . . . xp) + n0f0(x1, . . . xp), (4)

where f is some estimate of f . Signorini and Jones consider the general situation where f is

some kernel density estimator. Motivated by results summarized in Silverman (1986), here

an adaptive kernel density estimator is used.

First consider p = 1. Briefly, let f(Xi) be an initial estimate of f(Xi). Here, f(Xi) is

5

based on the expected frequency curve in Wilcox (2005, section 3.2.3). Let

log g =1

n

∑logf(Xi)

and

λi = (f(Xi)/g)−a,

where a is a sensitivity parameter satisfying 0 ≤ a ≤ 1. Based on comments by Silverman

(1986), a = .5 is used. Then the adaptive kernel estimate of f is taken to be

f(x) =1

n

∑ 1

hλi

K{h−1λ−1i (x−Xi)},

where

K(t) = 34(1− 1

5t2)/

√5, |t| < √

5

= 0, otherwise,

is the Epanechnikov kernel, and following Silverman (1986, pp. 47–48), the span is

h = 1.06A

n1/5,

where

A = min(s, IQR/1.34),

s is the standard deviation, and IQR is the interquartile range. (Here, IQR is computed via

the ideal fourths; see for example Wilcox, 2005, section 3.12.5.)

For p > 1, first rescale the p marginal distributions. More precisely, let

xi` = Xi`/min(s`, IQR`/1.34),

6

where s` and IQR` are, respectively, the standard deviation and interquartile range based

on X1`, . . . , Xn`, ` = 1, . . . , p. If x′x < 1, the multivariate Epanechnikov kernel is

Ke(x) =(p + 2)(1− x′x)

2cp

;

otherwise Ke(x) = 0. The quantity cp is the volume of the unit p-sphere: c1 = 2, c2 = π,

and for p > 2 cp = 2πcp−2/p. The estimate of the probability density function is

f(x) =1

n

∑ 1

hλi

K{h−1λ−1i (x− xi)},

where, following Silverman (1986, p. 86), the span is taken to be

h = A(p)n−1/(p+4),

A(1) = 1.77, A(2) = 2.78 and for p > 2,

A(p) =

(8p(p + 2)(p + 4)(2

√π)p

(2p + 1)cp

)1/(p+4)

.

Method E2

The second estimator is based on a slight modification of an approach described in Hosmer

and Lemeshow (1989), which unlike the other estimators considered here, is limited to p = 1.

Let xi = (Xi−M)/MADN, where M is the usual median based on X1, . . . , Xn, MAD is the

median of |X1 − M |, . . . , |Xn − M | and MADN is MAD/.6745. Under normality, MADN

estimates the population variance, but unlike the sample variance, MADN is resistant to

7

outliers. In particular, its breakdown point is .5, the highest possible value, where the

breakdown point of an estimator refers to the minimum proportion of points that must be

altered so that its value can be made arbitrarily large or small. Of course, the median has

a breakdown point of .5 as well.

Let

wi = Ihe−(xi−x)2 ,

where Ih = 1 if |xi − x| < h, otherwise Ih = 0. The estimate of m(x) is taken to be

m(x) =

∑wiyi∑wi

, (5)

Hosmer and Lemeshow use the same weights except that they do not standardize the Xi

values; they simply use (Xi− x)2. This is unsatisfactory, however, because a change in scale

can substantially alter the weights resulting in a highly inaccurate estimate of m(x). The

indicator function Ih plays the role of a span. If Xi is sufficiently far from x, it is given no

weight when estimating m(x). Experience with other smoothers, where Y is not necessarily

binary, suggests using h = 1. But the simulation results in section 3 suggest that h = 2 is a

bit better choice for general use.

Method E3

Method E2 is limited to p = 1. A simple and seemingly natural generalization that allows

p ≥ 1 is to use an approach that has obvious similarities to the running interval smoother

8

(e.g., Wilcox, 2005, section 11.4.4), which corresponds to method E4 outlined below. In

particular, again use (5), but with the weights based on a robust analog of Mahalanobis

distance where the usual covariance matrix is replaced by a measure of scatter that is rela-

tively insensitive to outliers. Here, the minimum volume ellipsoid (MVE) estimator is used

(e.g. Rousseeuw & Leroy, 1987), which also has the highest possible breakdown point, .5.

Roughly, the MVE estimator searches for the ellipsoid containing half of the data that has

the smallest volume. Based on this subset of the data, the mean and covariance matrix are

computed which yields a robust Mahalanobis distance for each of the n observed points.

These distances are used to determine which points are ”good”, meaning that the squared

distances do not exceed the .975 quantile of a chi-squared distribution with p degrees of

freedom. Here, as is typically done, the measures of location and scatter are recomputed

using the good points only, which are henceforth called the MVE estimates. There are many

other multivariate robust measures of scatter (e.g., Wilcox, 2005, chapter 6), perhaps some

other choice has practical value for the situation at hand, but this is not pursued.

Let S be the MVE measure of scatter and let

di = (Xi − x)′S−1(Xi − x).

Now the weights are given by

wi = Ihe−di ,

and again m(x) is computed with (5). Here, Ih = 1 if√

di < h, otherwise Ih = 0. Note

9

that for p = 1, the MVE estimate of scatter will differ somewhat from the estimate based

on MAD/.6745. So for p = 1, E2 and E3 can yield slightly different results.

Method E4

The final method is based on what is known as the running interval smoother (e.g.

Wilcox, 2005, section 11.4.4). First consider p = 1. The point x is said to be close to Xi if

|Xi − x| ≤ k ×MADN,

where MADN=MAD/.6745. So for normal distributions, x is considered to be close to Xi if

x is within k standard deviations of Xi. Let

N(x) = {j : |x−Xi| ≤ k ×MADN}.

So N(x) indexes the set of all Xi values that are close to x. The estimate of m(x) is just the

proportion of Yi values equal to 1 for which i is an element of N(x). That is, use all of the Yi

values for which Xi is close to x. Following Wilcox (2005), the running interval smoother can

be generalized to more than one predictor by replacing MADN with the minimum volume

ellipsoid estimate of scatter, M, and by again measuring the distance between x and Xi

with di, as was done by Method E3. Now m(x) is the mean of the Yi values for which

√di < k. From Wilcox (2005, section 11.4.4), a seemingly good choice for k is 0.8 or 1. But

the simulation results in section 3 suggest that k = 1.2 performs relatively well.

10

3 Simulation Results

Simulations were used to assess the bias and mean squared error of the four estimators in

section 2. Bias was measured with

U =1

Bn

B∑

b=1

n∑

i=1

mb(Xi)−m(Xi),

where B is the number of replications used in the simulations and mb(Xi) is the estimate

of m(Xi) based on the bth replication. Here, B = 1000 was used. Mean squared error was

measured with

V =1

Bn

B∑

b=1

n∑

i=1

(mb(Xi)−m(Xi))2.

For p = 1, the true regression line was taken to have the form

m(x) =exp β1x + β2x

2

1 + exp β1x + β2x2, (6)

where the choices for the regression coefficients were (β1, β2) =(0, 0), (0, 1), (1, 0) and (1,

1). For p = 2,

m(x) =exp β1x1 + β2x

21

1 + exp β1x1 + β2x21

was used. Some simulations were run when

m(x) =exp β1(x1 + x1) + β2(x

21 + x2

2)

1 + exp β1(x1 + x2) + β2(x21 + x2

2),

but no additional insights were obtained, so for brevity the results are not reported.

11

For p = 1, four distributions were used to generate X values: normal, symmetric and

heavy-tailed, asymmetric and relatively light-tailed, and asymmetric and heavy-tailed. More

precisely, X values were generated from a g-and-h distribution. To elaborate, let Z be a

standard normal random variable. Then

W =

exp(gZ)−1g

exp(hZ2/2), if g > 0

Zexp(hZ2/2), if g = 0

has a g-and-h distribution where g and h are parameters that determine the first four mo-

ments (Hoaglin, 1985). The standard normal distribution corresponds to g = h = 0. The

case g = 0 corresponds to a symmetric distribution, and as g increases, skewness increases

as well. The parameter h determines tail thickness. As h increases, the g-and-h distribution

becomes more heavy tailed, roughly meaning that the probability of generating an outlier

increases. The four distributions considered here are (g, h) = (0, 0), (0, .5), (.5, 0) and (5.,

.5). Table 1 summarizes the skewness (κ1) and kurtosis (κ2) for the g-and-h distributions

used in the simulations. When h > 1/k, E(X − µ)k is not defined and the corresponding

entry in Table 1 is left blank.

For p > 1, the marginal distributions of X were taken to be one of the four g-and-h

distributions just described, with the correlation among all variables set equal to ρ = 0 or

.5. (The R function rmul, which belongs to the library of R functions in Wilcox, 2005, was

used to generate values for X as just described.) Altering the correlation did impact the

estimated mean squared error, but the relative merits of the estimators were unaltered, so

12

Table 1: Some properties of the g-and-h distribution.

g h κ1 κ2

0.0 0.0 0.00 3.00

0.0 0.5 0.00 —

0.5 0.0 1.75 8.9

0.5 0.5 — —

only results for ρ = 0 are reported.

Table 2 shows the estimated mean squared error when p = 1 and n = 40. No single esti-

mator dominates and in general there seems to be little separating the methods. When using

method E4, it seems that using k = 1.2 is slightly preferable to using k = .8. Methods E2,

E3 and E4 had bias estimates that ranged between −.009 and .009. The same was generally

true for method E1, but in some situations the estimate was less than −.02, suggesting that

it is the least satisfactory of the estimators considered.

Table 3 shows the estimated mean squared error when p = 2. Now there are more marked

differences among the estimators compared to p = 1. In general, method E3 with h = 2

gave the best results. Method E4 with k = 1.2 is a close second. There are situations where

E1 performs about as well as the other methods. But there are instances where there is not

the case, no situations were found where it provides a substantial advantage, which suggests

13

Table 2: Estimated mean squared error, n = 40, p = 1

g h β1 β2 E1 E2 E3 E4

h = 1 h = 2 h = 1 h = 2 k = .8 k = 1.2

0.0 0.0 0 0 .020 .017 .012 .020 .014 .020 .014

1 0 .017 .014 .014 .015 .014 .016 .015

0 1 .016 .015 .016 .014 .015 .016 .019

1 1 .016 .015 .017 .015 .015 .016 .020

0.0 0.5 0 0 .032 .039 .030 .035 .026 .038 .029

1 0 .017 .015 .014 .015 .014 .016 .016

0 1 .016 .015 .015 .014 .014 .015 .019

1 1 .016 .015 .017 .014 .015 .016 .021

0.5 0.0 0 0 .021 .020 .016 .025 .020 .023 .016

1 0 .017 .014 .013 .016 .014 .016 .016

0 1 .017 .014 .013 .015 .014 .015 .015

1 1 .016 .012 .013 .013 .012 .013 .015

0.5 0.5 0 0 .032 .035 .027 .039 .031 .039 .029

1 0 .019 .015 .015 .016 .014 .017 .017

0 1 .016 .014 .015 .014 .014 .015 .018

1 1 .016 .014 .016 .014 .014 .015 .020

14

that it be excluded from consideration. Again bias was found to be negligible when using

methods E2, E3 and E4. Bias was low generally low when using method E1, but in some

situations the estimated bias was less than −.04.

Some additional simulations were run with p = 4. Methods E3 and E4 again had low

bias, but method E1 was relatively unsatisfactory in some situations. Again method E3 with

h = 2 was found to be generally best in terms of mean squared error.

4 An Illustration

Kyphosis is a postoperative spinal deformity. Hastie and Tibshirani (1990) report data on

the presence or absence of kyphosis versus the age of the patient, in months, the number of

vertebrae involved in the spinal operation, and a variable called start, which is the beginning

of the range of vertebrae involved. Figure 1 shows an approximation of the regression surface

using method E3 based on age and the variable start. (Method E4 gives similar results.)

Figure 2 shows an estimate of the regression surface using the standard logistic regression

model given by equation (3). As is evident, Figures 1 and 2 give a strikingly different

perception of the association. Using the standard logistic regression model, age has a p-

value of .107 and the variable start has a p-value of .0002. However, if a quadratic term

for age is included in the model, the linear and quadratic terms for age have p-values .018

15

Table 3: Estimated mean squared error, n = 40, p = 2

g h β1 β2 E1 E3 E4

h = 1 h = 2 k = .8 k = 1.2

0.0 0.0 0 0 .061 .061 .036 .057 .045

1 0 .053 .046 .026 .044 .036

0 1 .060 .042 .026 .041 .035

1 1 .063 .043 .029 .042 .035

0.0 0.5 0 0 .076 .090 .077 .094 .069

1 0 .047 .063 .038 .053 .044

0 1 .067 .047 .036 .046 .042

1 1 .070 .048 .035 .049 .043

0.5 0.0 0 0 .061 .073 .048 .070 .057

1 0 .058 .053 .031 .051 .038

0 1 .067 .047 .034 .046 .040

1 1 .064 .045 .031 .043 .038

0.5 0.5 0 0 .074 .095 .071 .092 .080

1 0 .051 .055 .038 .054 .045

0 1 .069 .048 .038 .048 .044

1 1 .073 .049 .038 .049 .044

16

Age Start

Figure 1: An approximation of the regression surface using the kyphosis data and method

E3

17

Age Start

Figure 2: An approximation of the regression surface, using the kyphosis data, assuming the

usual logistic regression model given by (3)

18

and .036, respectively. If the variable start is replaced by the number of vertebrae involved,

again a smooth gives a noticeably different perception of the association compared to the

usual logistic regression model.

5 Concluding Remarks

In summary, no method dominated and situations were found where each method compares

well to the others. But in terms of choosing a method for routine use, the results reported

here suggest method E3 or perhaps E4. Method E2 performs reasonably well but is limited

to p = 1. When p = 1, there appears to be little reason for preferring E2 over E3 and

E4. Method E1 compares very well to the other methods, in terms of both mean squared

error and bias, when p = 4 and m(x) = .5. But its bias can be relatively high, in general

it never demonstrated a striking advantage when p < 4, the other three estimators have

relatively low bias in all situations considered, so method E1 would seems to be the least

satisfactory estimator. Perhaps its performance can be improved by replacing the adaptive

kernel estimator used here with some other kernel density estimator, but this remains to be

determined.

Some additional simulations were run where larger values for the spans, h and k, were

used. This did not improve any of the estimators considered. When using method E3, for

19

example, increasing the span to h = 2 or even h = 5 had a negligible effect on the estimated

mean squared error. However, when using method E4, some caution is needed when choosing

the span. If the span is too large, curvature, and indeed a true association, can be missed.

And if the span is too small, a ragged regression line (or surface) can result. When using

method E3, altering the span h can again affect the smoothness of the resulting graph. But

as the span increases, it appears that true curvature is still detected, which is not necessarily

the case when using method E4.

R functions for applying the methods in this paper are available from the author. The

functions bkreg, logrsm, logSM and rplot.bin perform methods E1, E2, E3 and E4, respec-

tively.

References

Cleveland, W. S. (1979). Robust locally weighted regression and smoothing

scatterplots. Journal of the American Statistical Association, 74,

829–836.

Efromovich, S. (1999). Nonparametric Curve Estimation: Methods, Theory and

Applications. New York: Springer-Verlag.

Eubank, R. L. (1999). Nonparametric Regression and Spline Smoothing.

New York: Marcel Dekker.

Fan, J. (1993). Local linear smoothers and their minimax efficiencies.

20

The Annals of Statistics, 21, 196–216.

Fan, J. & Gijbels, I. (1996). Local Polynomial Modeling and Its Applications.

Boca Raton, FL: CRC Press.

Fox, J. (2001). Multiple and Generalized Nonparametric Regression.

Thousands Oaks, CA: Sage.

Green, P. J. & Silverman, B. W. (1993). Nonparametric Regression and Generalized

Linear Models: A Roughness Penalty Approach. Boca Raton, FL: CRC Press.

Gyorfi, L., Kohler, M., Krzyzk, & A. Walk, H. (2002).

A Distribution-Free Theory of Nonparametric Regression. New York:

Springer Verlag.

Hardle, W. (1990). Applied Nonparametric Regression. Econometric

Society Monographs No. 19, Cambridge, UK:

Cambridge University Press.

Hastie, T. J. & Tibshirani, R. J. (1990). Generalized Additive Models.

New York: Chapman and Hall.

Hoaglin, D. C. (1985) Summarizing shape numerically: The g-and-h

distributions. In D. Hoaglin, F. Mosteller and J. Tukey (Eds.)

Exploring data tables, trends, and shapes. (pp. 461–515).

New York: Wiley.

Hosmer, D. W. & Lemeshow, S. (1989). Applied Logistic Regression.

21

New York: Wiley.

Kay, R. & Little, S. (1987). Transformation of the explanatory variables in

the logistic regression model for binary data. Biometrika, 74, 495–501.

Rousseeuw, P. J. & Leroy, A. M. (1987). Robust Regression & Outlier

Detection. New York: Wiley.

Signorini, D. F. & Jones, M. C. (2004). Kernel estimators for univariate

binary regression. Journal of the American Statistical Association, 99,

119–126.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.

New York: Chapman and Hall.

Wilcox, R. R. (2005). Introduction to Robust Estimation and Hypothesis

Testing, 2nd Ed. San Diego, CA: Academic Press.

22

Documents

nonparametric regression when estimating the probability of success