Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
NONPARAMETRIC REGRESSION WHEN
ESTIMATING THE PROBABILITY OF SUCCESS
Rand R. Wilcox
Dept of Psychology
University of Southern California
August 21, 2010
1
ABSTRACT
For the random variables Y,X1, . . . , Xp, where Y is binary, let M(x1, . . . , xp) = P (Y =
1|(X1, . . . Xp) = (x1, . . . xp)). The paper compares four smoothers aimed at estimating
M(x1, . . . , xp), three of which can be used when p > 1. Evidently there are no published
comparisons of smoothers when p > 1 and Y is binary. And there are no published results
on how the four estimators, considered here, compare. One of the estimators is based on an
approach described in Hosmer and Lemeshow (1989, p. 85), which is limited to p = 1. A
simple modification of this estimator (called method E3 in the paper) is proposed that can
be used when p > 1. No estimator dominated in terms of mean squared error and bias. And
for p = 1, the differences among three of the estimators, in terms of mean squared error and
bias, is not particularly striking. But for p > 1, differences among the estimators are magni-
fied, with method E3 performing relatively well. An estimator based on the running interval
smoother performs about as well as E3, but for general use, E3 is found to be preferable.
An estimator studied by Signorini and Jones (1984) is not recommended, particularly when
p > 1.
keywords: logistic regression, kernel estimators, smoothers.
2
1 Introduction
Consider a random sample (Yi, Xi1, . . . Xip) from some unknown multivariate distribution
(i = 1, . . . , n), where Yi is binary. As is well known, a fundamental goal that is commonly
encountered is estimating
M(x1, . . . , xp) = P (Y = 1|(X1, . . . Xp) = (x1, . . . , xp)). (1)
More broadly, there is the issue of understanding the nature of the association between Y
and (X1, . . . Xp). From basic principles, using least squares regression is unsatisfactory, one
reason being that the estimate of M(x1, . . . xp) can be substantially smaller than 0 or greater
than 1. A commonly used strategy for dealing with this problem is to assume
P (Y = 1|X = x) = F (x′β), (2)
where F is some strictly increasing cumulative distribution function and β is a vector of
unknown parameters. The best-known choice for F is
F (t) =exp(t)
1 + exp(t),
which yields the usual logistic regression model. That is, assume that
P (Y = 1|X = x) =exp(β0 + β1x1 + · · ·+ βpxp)
1 + exp(β0 + β1x1 + · · ·+ βpxp). (3)
In the broader context where Y is not necessarily binary, a general concern with para-
metric regression models is that the particular model chosen might not provide a reasonably
3
accurate approximation of the true regression surface. Of course, one can attempt to deal
with this problem by introducing functions of the predictors into the model. For example,
replace the model Y = β0 + β1X with Y = β0 + β1X + β2X2. But this is not always satis-
factory, which has prompted a wide range of nonparametric estimators that are often called
smoothers (e.g. Efromovich, 1999; Eubank, 1999; Fan and Gijbels, 1996; Fox, 2001; Green
and Silverman, 1993; Gyofri et al., 2002; Hardle, 1990; Hastie and Tibshirani, 1990).
For the situation where Y is binary, a simple strategy is to use a smoother aimed at
estimating the mean of Y , given (X1, . . . Xp), but for some smoothers this approach is unsat-
isfactory. Examples are the method LOESS developed by Cleveland (1979) and the kernel
smoother derived by Fan (1993). In effect, these methods use weighted least squares to
estimate the conditional mean of Y , which can yield estimates of M(x1, . . . xp) substantially
smaller than 0 or larger than 1. There are, however, smoothers that avoid this problem,
some of which have been compared when p = 1 (Signorini & Jones, 2004). An approach not
discussed by Signorini and Jones is described by Hosmer and Lemeshow (1989, p. 85), which
they attribute to general theoretical results derived by Kay and Little (1987). Yet another
viable method is the running interval smoother (Wilcox, 2005, sections 11.4.4 and 11.4.8),
but there are no results on how this method compares to other estimators in terms of mean
squared error and bias. And evidently there are no results comparing smoothers for binary
outcomes when p > 1. The goal in this paper is to fill this gap.
4
Section 2 describes the estimators to be compared. One of these estimators is based
on a slight modification of the estimator in Hosmer and Lemeshow (1989, p. 85), which
extends their estimator to p > 1 predictors. Section 3 reports simulation results and section
4 illustrates the practical advantages of using a smoother rather than relying solely on the
basic logistic regression model.
2 Description of the Estimators
Method E1
The first estimator stems from Signorini and Jones (2004). Let nj be the number of Y
values equal to j (j = 0, 1). Let fj be the probability density function of (X1, . . . Xp) given
that Y = j. Then an estimate of M(x1, . . . xp) is
M(x1, . . . xp) =n1f1(x1, . . . xp)
n1f1(x1, . . . xp) + n0f0(x1, . . . xp), (4)
where f is some estimate of f . Signorini and Jones consider the general situation where f is
some kernel density estimator. Motivated by results summarized in Silverman (1986), here
an adaptive kernel density estimator is used.
First consider p = 1. Briefly, let f(Xi) be an initial estimate of f(Xi). Here, f(Xi) is
5
based on the expected frequency curve in Wilcox (2005, section 3.2.3). Let
log g =1
n
∑logf(Xi)
and
λi = (f(Xi)/g)−a,
where a is a sensitivity parameter satisfying 0 ≤ a ≤ 1. Based on comments by Silverman
(1986), a = .5 is used. Then the adaptive kernel estimate of f is taken to be
f(x) =1
n
∑ 1
hλi
K{h−1λ−1i (x−Xi)},
where
K(t) = 34(1− 1
5t2)/
√5, |t| < √
5
= 0, otherwise,
is the Epanechnikov kernel, and following Silverman (1986, pp. 47–48), the span is
h = 1.06A
n1/5,
where
A = min(s, IQR/1.34),
s is the standard deviation, and IQR is the interquartile range. (Here, IQR is computed via
the ideal fourths; see for example Wilcox, 2005, section 3.12.5.)
For p > 1, first rescale the p marginal distributions. More precisely, let
xi` = Xi`/min(s`, IQR`/1.34),
6
where s` and IQR` are, respectively, the standard deviation and interquartile range based
on X1`, . . . , Xn`, ` = 1, . . . , p. If x′x < 1, the multivariate Epanechnikov kernel is
Ke(x) =(p + 2)(1− x′x)
2cp
;
otherwise Ke(x) = 0. The quantity cp is the volume of the unit p-sphere: c1 = 2, c2 = π,
and for p > 2 cp = 2πcp−2/p. The estimate of the probability density function is
f(x) =1
n
∑ 1
hλi
K{h−1λ−1i (x− xi)},
where, following Silverman (1986, p. 86), the span is taken to be
h = A(p)n−1/(p+4),
A(1) = 1.77, A(2) = 2.78 and for p > 2,
A(p) =
(8p(p + 2)(p + 4)(2
√π)p
(2p + 1)cp
)1/(p+4)
.
Method E2
The second estimator is based on a slight modification of an approach described in Hosmer
and Lemeshow (1989), which unlike the other estimators considered here, is limited to p = 1.
Let xi = (Xi−M)/MADN, where M is the usual median based on X1, . . . , Xn, MAD is the
median of |X1 − M |, . . . , |Xn − M | and MADN is MAD/.6745. Under normality, MADN
estimates the population variance, but unlike the sample variance, MADN is resistant to
7
outliers. In particular, its breakdown point is .5, the highest possible value, where the
breakdown point of an estimator refers to the minimum proportion of points that must be
altered so that its value can be made arbitrarily large or small. Of course, the median has
a breakdown point of .5 as well.
Let
wi = Ihe−(xi−x)2 ,
where Ih = 1 if |xi − x| < h, otherwise Ih = 0. The estimate of m(x) is taken to be
m(x) =
∑wiyi∑wi
, (5)
Hosmer and Lemeshow use the same weights except that they do not standardize the Xi
values; they simply use (Xi− x)2. This is unsatisfactory, however, because a change in scale
can substantially alter the weights resulting in a highly inaccurate estimate of m(x). The
indicator function Ih plays the role of a span. If Xi is sufficiently far from x, it is given no
weight when estimating m(x). Experience with other smoothers, where Y is not necessarily
binary, suggests using h = 1. But the simulation results in section 3 suggest that h = 2 is a
bit better choice for general use.
Method E3
Method E2 is limited to p = 1. A simple and seemingly natural generalization that allows
p ≥ 1 is to use an approach that has obvious similarities to the running interval smoother
8
(e.g., Wilcox, 2005, section 11.4.4), which corresponds to method E4 outlined below. In
particular, again use (5), but with the weights based on a robust analog of Mahalanobis
distance where the usual covariance matrix is replaced by a measure of scatter that is rela-
tively insensitive to outliers. Here, the minimum volume ellipsoid (MVE) estimator is used
(e.g. Rousseeuw & Leroy, 1987), which also has the highest possible breakdown point, .5.
Roughly, the MVE estimator searches for the ellipsoid containing half of the data that has
the smallest volume. Based on this subset of the data, the mean and covariance matrix are
computed which yields a robust Mahalanobis distance for each of the n observed points.
These distances are used to determine which points are ”good”, meaning that the squared
distances do not exceed the .975 quantile of a chi-squared distribution with p degrees of
freedom. Here, as is typically done, the measures of location and scatter are recomputed
using the good points only, which are henceforth called the MVE estimates. There are many
other multivariate robust measures of scatter (e.g., Wilcox, 2005, chapter 6), perhaps some
other choice has practical value for the situation at hand, but this is not pursued.
Let S be the MVE measure of scatter and let
di = (Xi − x)′S−1(Xi − x).
Now the weights are given by
wi = Ihe−di ,
and again m(x) is computed with (5). Here, Ih = 1 if√
di < h, otherwise Ih = 0. Note
9
that for p = 1, the MVE estimate of scatter will differ somewhat from the estimate based
on MAD/.6745. So for p = 1, E2 and E3 can yield slightly different results.
Method E4
The final method is based on what is known as the running interval smoother (e.g.
Wilcox, 2005, section 11.4.4). First consider p = 1. The point x is said to be close to Xi if
|Xi − x| ≤ k ×MADN,
where MADN=MAD/.6745. So for normal distributions, x is considered to be close to Xi if
x is within k standard deviations of Xi. Let
N(x) = {j : |x−Xi| ≤ k ×MADN}.
So N(x) indexes the set of all Xi values that are close to x. The estimate of m(x) is just the
proportion of Yi values equal to 1 for which i is an element of N(x). That is, use all of the Yi
values for which Xi is close to x. Following Wilcox (2005), the running interval smoother can
be generalized to more than one predictor by replacing MADN with the minimum volume
ellipsoid estimate of scatter, M, and by again measuring the distance between x and Xi
with di, as was done by Method E3. Now m(x) is the mean of the Yi values for which
√di < k. From Wilcox (2005, section 11.4.4), a seemingly good choice for k is 0.8 or 1. But
the simulation results in section 3 suggest that k = 1.2 performs relatively well.
10
3 Simulation Results
Simulations were used to assess the bias and mean squared error of the four estimators in
section 2. Bias was measured with
U =1
Bn
B∑
b=1
n∑
i=1
mb(Xi)−m(Xi),
where B is the number of replications used in the simulations and mb(Xi) is the estimate
of m(Xi) based on the bth replication. Here, B = 1000 was used. Mean squared error was
measured with
V =1
Bn
B∑
b=1
n∑
i=1
(mb(Xi)−m(Xi))2.
For p = 1, the true regression line was taken to have the form
m(x) =exp β1x + β2x
2
1 + exp β1x + β2x2, (6)
where the choices for the regression coefficients were (β1, β2) =(0, 0), (0, 1), (1, 0) and (1,
1). For p = 2,
m(x) =exp β1x1 + β2x
21
1 + exp β1x1 + β2x21
was used. Some simulations were run when
m(x) =exp β1(x1 + x1) + β2(x
21 + x2
2)
1 + exp β1(x1 + x2) + β2(x21 + x2
2),
but no additional insights were obtained, so for brevity the results are not reported.
11
For p = 1, four distributions were used to generate X values: normal, symmetric and
heavy-tailed, asymmetric and relatively light-tailed, and asymmetric and heavy-tailed. More
precisely, X values were generated from a g-and-h distribution. To elaborate, let Z be a
standard normal random variable. Then
W =
exp(gZ)−1g
exp(hZ2/2), if g > 0
Zexp(hZ2/2), if g = 0
has a g-and-h distribution where g and h are parameters that determine the first four mo-
ments (Hoaglin, 1985). The standard normal distribution corresponds to g = h = 0. The
case g = 0 corresponds to a symmetric distribution, and as g increases, skewness increases
as well. The parameter h determines tail thickness. As h increases, the g-and-h distribution
becomes more heavy tailed, roughly meaning that the probability of generating an outlier
increases. The four distributions considered here are (g, h) = (0, 0), (0, .5), (.5, 0) and (5.,
.5). Table 1 summarizes the skewness (κ1) and kurtosis (κ2) for the g-and-h distributions
used in the simulations. When h > 1/k, E(X − µ)k is not defined and the corresponding
entry in Table 1 is left blank.
For p > 1, the marginal distributions of X were taken to be one of the four g-and-h
distributions just described, with the correlation among all variables set equal to ρ = 0 or
.5. (The R function rmul, which belongs to the library of R functions in Wilcox, 2005, was
used to generate values for X as just described.) Altering the correlation did impact the
estimated mean squared error, but the relative merits of the estimators were unaltered, so
12
Table 1: Some properties of the g-and-h distribution.
g h κ1 κ2
0.0 0.0 0.00 3.00
0.0 0.5 0.00 —
0.5 0.0 1.75 8.9
0.5 0.5 — —
only results for ρ = 0 are reported.
Table 2 shows the estimated mean squared error when p = 1 and n = 40. No single esti-
mator dominates and in general there seems to be little separating the methods. When using
method E4, it seems that using k = 1.2 is slightly preferable to using k = .8. Methods E2,
E3 and E4 had bias estimates that ranged between −.009 and .009. The same was generally
true for method E1, but in some situations the estimate was less than −.02, suggesting that
it is the least satisfactory of the estimators considered.
Table 3 shows the estimated mean squared error when p = 2. Now there are more marked
differences among the estimators compared to p = 1. In general, method E3 with h = 2
gave the best results. Method E4 with k = 1.2 is a close second. There are situations where
E1 performs about as well as the other methods. But there are instances where there is not
the case, no situations were found where it provides a substantial advantage, which suggests
13
Table 2: Estimated mean squared error, n = 40, p = 1
g h β1 β2 E1 E2 E3 E4
h = 1 h = 2 h = 1 h = 2 k = .8 k = 1.2
0.0 0.0 0 0 .020 .017 .012 .020 .014 .020 .014
1 0 .017 .014 .014 .015 .014 .016 .015
0 1 .016 .015 .016 .014 .015 .016 .019
1 1 .016 .015 .017 .015 .015 .016 .020
0.0 0.5 0 0 .032 .039 .030 .035 .026 .038 .029
1 0 .017 .015 .014 .015 .014 .016 .016
0 1 .016 .015 .015 .014 .014 .015 .019
1 1 .016 .015 .017 .014 .015 .016 .021
0.5 0.0 0 0 .021 .020 .016 .025 .020 .023 .016
1 0 .017 .014 .013 .016 .014 .016 .016
0 1 .017 .014 .013 .015 .014 .015 .015
1 1 .016 .012 .013 .013 .012 .013 .015
0.5 0.5 0 0 .032 .035 .027 .039 .031 .039 .029
1 0 .019 .015 .015 .016 .014 .017 .017
0 1 .016 .014 .015 .014 .014 .015 .018
1 1 .016 .014 .016 .014 .014 .015 .020
14
that it be excluded from consideration. Again bias was found to be negligible when using
methods E2, E3 and E4. Bias was low generally low when using method E1, but in some
situations the estimated bias was less than −.04.
Some additional simulations were run with p = 4. Methods E3 and E4 again had low
bias, but method E1 was relatively unsatisfactory in some situations. Again method E3 with
h = 2 was found to be generally best in terms of mean squared error.
4 An Illustration
Kyphosis is a postoperative spinal deformity. Hastie and Tibshirani (1990) report data on
the presence or absence of kyphosis versus the age of the patient, in months, the number of
vertebrae involved in the spinal operation, and a variable called start, which is the beginning
of the range of vertebrae involved. Figure 1 shows an approximation of the regression surface
using method E3 based on age and the variable start. (Method E4 gives similar results.)
Figure 2 shows an estimate of the regression surface using the standard logistic regression
model given by equation (3). As is evident, Figures 1 and 2 give a strikingly different
perception of the association. Using the standard logistic regression model, age has a p-
value of .107 and the variable start has a p-value of .0002. However, if a quadratic term
for age is included in the model, the linear and quadratic terms for age have p-values .018
15
Table 3: Estimated mean squared error, n = 40, p = 2
g h β1 β2 E1 E3 E4
h = 1 h = 2 k = .8 k = 1.2
0.0 0.0 0 0 .061 .061 .036 .057 .045
1 0 .053 .046 .026 .044 .036
0 1 .060 .042 .026 .041 .035
1 1 .063 .043 .029 .042 .035
0.0 0.5 0 0 .076 .090 .077 .094 .069
1 0 .047 .063 .038 .053 .044
0 1 .067 .047 .036 .046 .042
1 1 .070 .048 .035 .049 .043
0.5 0.0 0 0 .061 .073 .048 .070 .057
1 0 .058 .053 .031 .051 .038
0 1 .067 .047 .034 .046 .040
1 1 .064 .045 .031 .043 .038
0.5 0.5 0 0 .074 .095 .071 .092 .080
1 0 .051 .055 .038 .054 .045
0 1 .069 .048 .038 .048 .044
1 1 .073 .049 .038 .049 .044
16
Age Start
Figure 1: An approximation of the regression surface using the kyphosis data and method
E3
17
Age Start
Figure 2: An approximation of the regression surface, using the kyphosis data, assuming the
usual logistic regression model given by (3)
18
and .036, respectively. If the variable start is replaced by the number of vertebrae involved,
again a smooth gives a noticeably different perception of the association compared to the
usual logistic regression model.
5 Concluding Remarks
In summary, no method dominated and situations were found where each method compares
well to the others. But in terms of choosing a method for routine use, the results reported
here suggest method E3 or perhaps E4. Method E2 performs reasonably well but is limited
to p = 1. When p = 1, there appears to be little reason for preferring E2 over E3 and
E4. Method E1 compares very well to the other methods, in terms of both mean squared
error and bias, when p = 4 and m(x) = .5. But its bias can be relatively high, in general
it never demonstrated a striking advantage when p < 4, the other three estimators have
relatively low bias in all situations considered, so method E1 would seems to be the least
satisfactory estimator. Perhaps its performance can be improved by replacing the adaptive
kernel estimator used here with some other kernel density estimator, but this remains to be
determined.
Some additional simulations were run where larger values for the spans, h and k, were
used. This did not improve any of the estimators considered. When using method E3, for
19
example, increasing the span to h = 2 or even h = 5 had a negligible effect on the estimated
mean squared error. However, when using method E4, some caution is needed when choosing
the span. If the span is too large, curvature, and indeed a true association, can be missed.
And if the span is too small, a ragged regression line (or surface) can result. When using
method E3, altering the span h can again affect the smoothness of the resulting graph. But
as the span increases, it appears that true curvature is still detected, which is not necessarily
the case when using method E4.
R functions for applying the methods in this paper are available from the author. The
functions bkreg, logrsm, logSM and rplot.bin perform methods E1, E2, E3 and E4, respec-
tively.
References
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing
scatterplots. Journal of the American Statistical Association, 74,
829–836.
Efromovich, S. (1999). Nonparametric Curve Estimation: Methods, Theory and
Applications. New York: Springer-Verlag.
Eubank, R. L. (1999). Nonparametric Regression and Spline Smoothing.
New York: Marcel Dekker.
Fan, J. (1993). Local linear smoothers and their minimax efficiencies.
20
The Annals of Statistics, 21, 196–216.
Fan, J. & Gijbels, I. (1996). Local Polynomial Modeling and Its Applications.
Boca Raton, FL: CRC Press.
Fox, J. (2001). Multiple and Generalized Nonparametric Regression.
Thousands Oaks, CA: Sage.
Green, P. J. & Silverman, B. W. (1993). Nonparametric Regression and Generalized
Linear Models: A Roughness Penalty Approach. Boca Raton, FL: CRC Press.
Gyorfi, L., Kohler, M., Krzyzk, & A. Walk, H. (2002).
A Distribution-Free Theory of Nonparametric Regression. New York:
Springer Verlag.
Hardle, W. (1990). Applied Nonparametric Regression. Econometric
Society Monographs No. 19, Cambridge, UK:
Cambridge University Press.
Hastie, T. J. & Tibshirani, R. J. (1990). Generalized Additive Models.
New York: Chapman and Hall.
Hoaglin, D. C. (1985) Summarizing shape numerically: The g-and-h
distributions. In D. Hoaglin, F. Mosteller and J. Tukey (Eds.)
Exploring data tables, trends, and shapes. (pp. 461–515).
New York: Wiley.
Hosmer, D. W. & Lemeshow, S. (1989). Applied Logistic Regression.
21
New York: Wiley.
Kay, R. & Little, S. (1987). Transformation of the explanatory variables in
the logistic regression model for binary data. Biometrika, 74, 495–501.
Rousseeuw, P. J. & Leroy, A. M. (1987). Robust Regression & Outlier
Detection. New York: Wiley.
Signorini, D. F. & Jones, M. C. (2004). Kernel estimators for univariate
binary regression. Journal of the American Statistical Association, 99,
119–126.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.
New York: Chapman and Hall.
Wilcox, R. R. (2005). Introduction to Robust Estimation and Hypothesis
Testing, 2nd Ed. San Diego, CA: Academic Press.
22