155
ESTIMATION OF INTEGRATED SQUARED DENSITY DERIVATIVES by Brian Kent Aldershof A dissertation submitted to the faculty of The University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Statistics. Chapel Hill 1991 Advisor Reader U Reader ----'---+t----'---"'----

ESTIMATION OF INTEGRATED SQUARED DENSITY DERIVATIVES · PDF fileESTIMATION OF INTEGRATED SQUARED DENSITY DERIVATIVES by ... (Bandwidth) (Distn #2; Sample Size = 250) 104 MSE vs log

Embed Size (px)

Citation preview

ESTIMATION OF INTEGRATED SQUARED DENSITY DERIVATIVES

by

Brian Kent Aldershof

A dissertation submitted to the faculty of The University of North Carolina at Chapel Hill in

partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department

of Statistics.

Chapel Hill

1991

Advisor

Reader

UL~./~Reader

----'---+t----'---"'----

BRIAN KENT ALDERSHOF. Estimation of Integrated Squared Density Derivatives

(under the direction of J. Steven Marron)

ABSTRACT

The dissertation research examines smoothing estimates of integrated squared density

derivatives. The estimators discussed are derived by substituting a kernel density estimate into

the functional being estimated. A basic estimator is derived and then many modifications of it

are explored. A set of similar bias-reduced estimators based on jackknife techniques, higher

order kernels, and other modifications are compared. Many of these bias reduction techniques

are shown to be equivalent. A computationally more efficient estimator based on binning is

presented. The proper way to bin is established so that the binned estimator has the same

asymptotic MSE convergence rate as the basic estimator.

Asymptotic results are evaluated by using exact calculations based on Gaussian mixture

densities. It is shown that in some cases the asymptotic results can be quite misleading, while

in others they approximate truth acceptably well.

A set of estimators is presented that relies on estimating similar functionals of higher

derivatives. It is shown that there is an optimal number of functionals that should be

estimated, but that this number depends on the density and the sample size. In general, the

number of functionals estimated is itself a smoothing parameter. These results are explored

through asymptotic calculations and some simulation studies.

ii

ACKNOWLEDGEMENTS

I am very grateful for the encouragement, support, and guidance of my advisor Dr. J.

Steven Marron. His insights and intuition led to many of the results here. His patient

encouragement helped me through rough times. Thanks, Steve.

I am grateful to the people who supported me and my family throughout my years in

Graduate School. In particular, thanks to my mother who always helped out. Thanks also to

my in-laws for their support.

Most of all, I want to thank my wife and daughter. My family has always been loving

and supportive despite Graduate School poverty and uncertainty. Welcome to the world, Nick.

iii

TABLE OF CONTENTS

Page

LIST OF TABLES vi

LIST OF FIGURES vii

Chapter

I. Introduction and Literature Review

1. Introduction 1

2. Literature Review 3

II. Diagonal Terms

1. Introduction 13

2. Bias Reduction 14

3. Mean Squared Error Reduction 15

4. Computation 23

5. Stepped Estimators 24

III. Bias Reduction

1. Introduction 25

2. Notation 26

3. Higher Order Kernel Estimators 26

4. D - Estimators 27

5. Generalized Jackknife Estimators 28

6. Higher Order Generalized Jackknife Estimators 29

7. Relationships Among Bias Reduction Estimators 30

8. Theorems 32

9. Example 34

10. Proofs 38

IV. Computation

1. Introduction 48

2. Notation 49

3. The Histogram Binned Estimator : 50

4. Computation of 8m (h, n, K) 53

5. Generalized Bin Estimator 54

6. Proofs 57

iv

V. Asymptotics and Exact Calculations

1. Introduction 69

2. Comparison of Asymptotic and Exact Risks 69

3. Exact MSE Calculations 73

4. Examples 77

5. Proofs 80

VI. Estimability of 8m and m

1. Introduction 84

2. Asymptotic Calculations 84

3. Exact MSE Calculations 86

VII. The One - Step Estimator

1. Introduction 92

2. Assumptions and Notation 93

3. Results 94

4. Figures 100

5. Conclusions 101

6. Proofs 106

7. em(t) and calculating the skewness of 8m (h, n, K) 115

VIII. The K - Step Estimator

1. Introduction 118

2. Assumptions and Notation 120

3. Results 121

4. Simulations 125

5. Conclusions 127

6. Proofs 132

Appendix A viii

v

Table 2.1:

Table 2.2:

Table 6.1a:

Table 6.1b:

LIST OF TABLES

Exact Asymptotic Values of MSEj02 21

"Plug-in" Values of MSEj02 22

Values of N1/ 2(m) for m=O, , 5; Distns 1-8 89

Values of N1/ 2(m) for m=O, , 5; Distns 9-15 90

vi

Figure 3.1:

Figure 3.2a:

Figure 3.2b:

Figure 3.2c

Figure 5.1:

Figure 5.2:

Figure 6.1a:

Figure 6.1b:

Figure 7.1a:

Figure 7.1b:

Figure 7.2a:

Figure 7.2b:

Figure 7.3a:

Figure 7.3b:

Figure 7Aa:

Figure 7Ab:

Figure 8.1a:

Figure 8.1b:

Figure 8.2a:

Figure 8.2b:

Figure 8.3a:

Figure 8.3b:

Figure 804:

LIST OF FIGURES

Equivalences of Bias Reduction Techniques 31

D-estimator kernels 36

Jackknife kernels 36

Second-order kernel 37

MSE vs log(Bandwidth) (Distn #4; Sample Size = 1000) 79

MSE vs log(Bandwidth) (Distn #11; Sample Size = 1000) 79

Ntol(O) vs tolerance 91

Ntol(l) vs tolerance 91

MSE vs Bandwidth (Distn #2; Sample Size = 250) 103

MSE vs Bandwidth (Distn #2; Sample Size = 1000) 103

MSE vs log(Bandwidth) (Distn #2; Sample Size = 250) 104

MSE vs log(Bandwidth) (Distn #2; Sample Size = 1000) 104

MSE vs Bandwidth (Distn #2; Sample Size = 250) 105

MSE vs Bandwidth (Distn #2; Sample Size = 1000) 105

C2(T) vs T (Distn #1; Sample Size = 250) 117

C2(T) vs T (Distn #1; Sample Size = 1000) 117

Theta-hat densities (Distn #6; Squared 1st DerivativeSS=100; 100 Samples) 128

Theta-hat densities (Distn #6; Squared 1st DerivativeSS=100; 100 Samples) 128

Theta-hat densities (Distn #3; Squared 1st DerivativeSS=100; 100 Samples) : 129

Bandwidth densities (Distn #3; Squared 1st DerivativeSS=100; 100 Samples) 129

MSE vs Step (Distn #6; Sample Size = 100) 130

MSE vs Step (Distn #6; Sample Size = 500) 130

MSE vs Step (Distn #3; Sample Size = 100) 131

vii

Chapter I: Introduction and Literature Review

1. Introduction

This research discusses a class of estimators of the functional Om = J(im)Yfor fa

probability density function. The estimators discussed in this dissertation are of the form:

(1.1)

for some D which mayor may not be a function of the data. The goal of the research is to

discuss the behavior of these estimators in a variety of settings and to provide guidelines for

computing them.

Chapter II discusses possible choices of D given in (1.1). D can be chosen to reduce bias

simply by including the diagonal terms of the double sum thereby making Bnecessarily positive.

In many settings, this estimator performs better than a "leave-out-the-diagonals" version with

D = O. A possibly better estimator chosen to reduce MSE is also given although with reasonable

sample sizes this did not perform as well as hoped.

Chapter III discusses three strategies for reducing bias in B. The "leave-in-the-..

diagonals" estimator is some improvement over the "no-diagonals" estimator because it reduces

bias with some choice of bandwidth. More sophisticated strategies for reducing bias can improve

the estimator even more (at least with a sufficiently large sample size). The three strategies

explored are jackknifing, higher order kernels, and special choices of D (as in (1.1)). Some

equivalences between these techniques are given.

..

Chapter IV discusses computation of the estimators using a strategy called "binning".

Binning is used to reduce the number of computationally intensive kernel evaluations. An

optimal binning strategy is given in this chapter.

Chapter V discusses asymptotic calculations. Many of the guidelines presented in this

dissertation and in many areas of density estimation rely on asymptotic results. This chapter

provides some comparison of these results to exact results. A tool used to do this is exact

calculations based on Gaussian mixtures. Theorems required to do these calculations are

presented here.

Chapter VI provides some guidelines about the relative difficulty of estimating (Jm

compared to (Jm+k. Both asymptotic and exact calculations are used to show that it gets much

more difficult to estimate (Jm for greater m. It is also shown that the asymptotic calculations

become increasingly misleading in this context as m increases.

Chapter VII discusses a "one-step" estimator for choosing the bandwidth of Bm • The

estimator is based on estimating (Jm+l first and using this estimate to find an "optimal"

bandwidth for Bm . The results given here suggest that there may be some advantages to using

the one-step estimator in that it might be easier to choose a near-optimal bandwidth. The

penalty for using this estimator is that it has a greater minimum MSE than the standard

estimator.

Chapter VIII discusses a generalization of the one-step estimator called the "k-step"

estimator. In the "k-step" estimator all the functionals (Jm+!' .•., (Jm+k are estimated and used

to provide a bandwidth for Bm . It is shown there that with a finite sample size and a "plug-in"

bandwidth, MSE(Bm ) decreases for the first couple of steps and ultimately increases without

bound.

2

The remainder of this chapter is a literature review of previous research on the topic.

2. Literature Review.

The dissertation research concerns the study of the estimation of functionals of the form

0m(f) = JV(m)(x)Y dx. Although there are several ways of estimating functionals of this

type, only kernel-type estimators will be investigated here. A kernel estimator in(x) of J(x) is

defined as

where Kh(x) = 11h K(xlh) , K is bounded and symmetric with JK(u) = 1, hn is the

bandwidth, and n is the sample size. Estimators of 0m(f) that will be studied here will be based

on this type of kernel estimator although other possibilities are discussed below.

2.1) Estimating 0o(f) = Jf2(x) dx.

Estimating 0o(f) is an important problem in calculating the asymptotic relative

efficiencies (ARE) of non-parametric, rank-based statistics. For example, consider testing Ho:

G(x) = F(x) vs. HI: G(x) = F(x - 0) where ° > 0, F and G are unknown, and F has density f

and variance (1'2. The ARE of the Wilcoxon test to the t-test is 12(1"Oo(f)f (Hodges and

Lehmann, 1956). The same result holds in the more general analysis of variance problem for the

ARE of the Kruskal-Wallis test relative to the standard F-test. The problem of estimating 0o(f)

naturally arises from forming a data-based estimate of these ARE's.

0o(J) can also be used in computing the L2 distance between two densities. In this

setting, estimation of 0o(J) has arisen in pattern recognition (Patrick and Fischer, 1969), data

3

reduction (Fukunga and Mantock, 1984), and other areas (Pawlak, 1986).

A straight-forward estimator of ()o(J) is derived by squaring and integrating an estimate

of f. Bhattacharya and Roussas (1969) proposed ();(in) = Jin2(x) dx as an estimator of ();(f),

where in(x) is a kernel density estimator. Substitution of in(x) gives

where * denotes convolution. Ahmad (1976) proved that with suitably chosen bandwidths,

();(in) - ()o(J) with probability 1 and established a rate of convergence.

Another estimator of ()o(J) is motivated by noting that:

Jf2(x) dx = E(J(x)) = JJ(x) dF(x).

An estimator of ()o(J) proposed by Schuster (1974) is ()o(in)

the empirical distribution function. Substitution gives

Notice that ();(in) can always be expressed as ()o(in)' For kernel functions, ()o(in) uses Kh andn

();(in) uses Kh * Kh · A simple change of variables shows that K" * Kh = (K. K)h so then n .~ n n

class of estimators ();(in) is the class of ()o(in) restricted to densities which are convolutions. If

K is a Gaussian density, ()o(in) and ();(in) are identical except for different bandwidths. Since

they are the same estimators it seems that ()o(in) and ();(in) should be equally good estimators

of ()o(J). Indeed, Schuster (1974) showed that ()o(in) and ();(in) converge to ()o(J) at the same

rate. Ahmad (1976) proved that under some conditions (fewer than used by Schuster) ()o(in) is

4

strongly consistent and asymptotically normal.

Dimitriev and Tarasenko (1974) arrive at a similar estimator by considering U-

statistics. The resulting "quasi-U-statistic" for (}o(f) is

where {Xi} and {Yj } are possibly different samples from density f. Dimitriev and Tarasenko

establish mean square error (MSE) convergence of i, asymptotic unbiasedness of i, and

calculate its asymptotic variance, 4x[/ /4)(x) dx - ( / /2)(x) dxt].

Schweder (1975) studied small-sample properties of (}o(in) including bandwidth

selection, and choice of kernel. Schweder suggested using a bandwidth that eliminates low order

terms in the Taylor expansion of the bias. This bandwidth is:

1

hn = [ W (: _ 1)]:3where

w = ( / (r (x) )2 dx )(/ u2 K(u) dU).

The optimal choice of kernel is not as clear, although Schweder suggests using the rectangular

density because it is computationally simple. Schweder also points out that the diagonal terms

in (}o(in), Le., those with i=j, are not data-dependent and thus could be treated separately. He

suggests the estimator iJ where:

and

D is chosen to eliminate the first-order term in the asymptotic expansion of the bias. Schweder

5

does not explore this any further, suggesting it is only useful for n large. He suggests two data-

based estimators of W. The first is to rewrite j{x) = 6-1/ 1(6-1x) and estimate 6 using the

interquartile range. A Pearson curve is then fitted to 11(x) and the integral of its squared

derivative calculated analytically or numerically. The second is a two-stage estimator in

which J(/'(x)y dx is estimated using another kernel-estimator. Of course, this second estimator

requires another bandwidth selection but Schweder's simulation results suggest that a fixed

bandwidth gives reasonable results.

Cheng and Sertling (1981) investigate estimation of a wider class of functionals that

includes °0(1). The estimator they propose allows for a wide class of density estimators

including the kernel-type. They establish strong convergence of (}o(in) to (}O(l) with rate

O(n-1/2 (log n)I/2) for sufficiently smooth Iwith suitable choice of kernel and bandwidth.

Aubuchon and Hettmansperger (1984) use small-sample simulations to compare (}o(in)

with an estimator similar to Schweder's iJ with the D-type adjustment. Although Schweder felt

that the D adjustment was only useful for large n, Aubuchon and Hettmansparger's simulations

suggest otherwise. Even with sample sizes as small as n=10 they find the D-estimator is

superior to 0o(fn). The D-type adjustment requires an estimate of J(/'(x)y dx. For this

estimation, they use a data-based estimate of a scale parameter but then they assume a known

distribution (e.g., Gaussian). They calculate the integral of its squared derivative directly.

Koul, Sievers, and McKean (1987) motivate and study a particular kernel estimator for

°0(1). If e, 1] are LLd. with distribution function (d.f.) F then the d.f. of I e-1] I is:

+00

H(y) = J{F(y + x) - F( -y - x)} dF(x).-00

Hence, if F has density f, then the density of H at 0 IS 2(}o(l). Substituting the empirical

6

distribution function for F leads to:

where leA) is the indicator function of event A. An estimator i' of 00(/) is derived by choosing

hn near 0 and:

Rewriting gives

where K(u) is the uniform density function over the interval [-1, 1]. This is, of course, simply

0o(}n) with a uniform kernel. They study this estimator in the context of estimating functionals

of a residual density from a regression problem. Since they must deal with non-i.i.d. residuals

their results for i' are not very strong.

Hall and Marron (1987) investigate estimators similar to O;(}n) and 0o(}n)' The

estimators they propose are identical except the "diagonal" terms Kh * Kh (0), and Kh (0) aren n n

omitted since they do not depend on the data. They establish MSE convergence of their

estimators to 00(/) with the parametric rate O( n- l ) with sufficient smoothness of f. A more

complete discussion of Hall and Marron's results is in section III.

Ritov and Bickel (1987) establish the semiparametric information bound for the

estimation of 00(/)' They show that there is no rate which can be achieved uniformly over

certain classes of distributions. Further, for any sequence of estimates {Ok} there exists a

distribution F (with density fJ such that n'(Ok - 00(/) ) does not converge to 0 for any I > O.

7

Bickel and Ritov (1988) investigate a complicated "one-step" estimator based on kernel

estimators. They examine the smoothness requirements on I for estimation of ()o(J) that allows

convergence at rate O( n-1/2) of their estimator. They establish smoothness conditions for I that

allow their estimator to achieve the semiparametric information bound and prove that there is

no efficient estimator for I less smooth. A more complete discussion of Bickel and Ritov's

results is in section 2.3.

There are several other approaches to estimating ()o(J) that will not be investigated

here. One approach is based on orthogonal polynomials. If {¢> i(x)} is an orthonormal basis for

L2, then it can be shown that an estimator for ()o(J) is:

lor q( n) --+00 as n --+00 and

This approach is discussed in Ahmad (1979), Pawlak (1986), and Prakasa-Rao (1983). Another

approach to estimating ()o(J) is based on spacings of order statistics. Define the estimator Tm n,

by

where n is the sample size, 1:5 m :5 n, and X(j) is the jth order statistic. Then Tm,n is a

consistent estimator of ()o(J). This approach is discussed in van Es (1988), Hall (1982), and

Miura (1985). Another approach is based on Wilcoxon confidence intervals. Let T(()) be the

8

Wilcoxon signed rank statistic, i.e.,

1'(0) = [n(n+l)jl 4:<L; I(X; + Xj > 20)'_J

where I(A) is the indicator of event A. If OLand 0U are the endpoints of the Wilcoxon

confidence interval then

is a consistent estimator of 0oU) for symmetric /. Aubuchon and Hettmansperger (1984) show

that this approach is asymptotically equivalent to the kernel-based method. Further results can

be found in Lehmann (1963).

2.2) Estimating OIff) = J(f"(x)y dx

Estimating 0Iff) is an important problem in bandwidth selection for kernel density

estimates. A sensible approach to choosing the bandwidth hn is to choose it to minimize the

mean integrated square error (MISE) where MISE = EJU- h)2. Under some conditions

on K and f, MISE is minimized for

[ ]

1ft

~ ,.., n-1/ 5 JK

2

2 (Parzen, 1962).

{J ? K(x) dX} JU II) 2

Hence, estimation of 0Iff) is important for selecting a data-based bandwidth that is

asymptotically optimal. Of course, 0Iff) may be of interest simply as a measure of "curviness"

of a distribution.

9

Unlike the case of ()oU), there has not been much work devoted strictly to estimation of

()iJ). An exception is Wertz (1981), but his estimator is based on orthogonal series rather than

kernels. Much of the above work is generalized in Hall and Marron (1987) and Bickel and Ritov

(1988) to ()mU) for arbitrary m, and their results are discussed in section III.

2.3) Estimating ()mU) = J(/m)(x)Y dx, m arbitrary.

Most of the work on estimation of ()mU) is motivated by the previous two cases

(m = 0 and m = 2). There is usually little effort in extending mathematical results for "m = f!'

to "m finite". ()mU) for m > 2 does arise in the asymptotic investigations of some of the

estimators that will be studied here. The major difficulty in estimating ()mU) is that as m

becomes larger, BmU) requires more data for the same level of performance.

Hall and Marron (1987) generalize ();(}n) and ()o(}n) (discussed above) to arbitrary m.

The estimators they propose and investigate are:

and

Notice that these estimators omit the "diagonal terms", Le., the terms with i=j. These terms

do not depend on the data so they add a component to the estimator depending only on the

kernel shape and bandwidth. The effect of omitting these terms will be discussed below.

Hall and Marron investigate the MSE convergence of ()':n(}n) and () m(}n) to ()mU). The

rate of convergence depends on m and the smoothness of f The density f has smoothness

10

p > 0 if there is a constant M so that:

for all x and y, 05 a 51, and p = I + a. Further, for K a density function, define

II =min( p - m, 2). Hall and Marron establish for II 5 2m + 1/2 the MSE convergence rate is

O(n-411/ (211 + 4m + 1» for a suitable choice of bandwidth. For II > 2m + 1/2, the MSE

convergence rate is O(n- l ). For some kernels (not necessarily densities) and sufficiently smooth

J, Hall and Marron provide the best exponents of convergence. Under similar smoothness

requirements, they also provide the best constants for kernel-type estimates. These rates are not

quite as fast as the convergence rates given in Bickel and Ritov (1988) for their estimators. The

constants may be useful in bandwidth selection and will be discussed in detail below.

Bickel and Ritov (1988) suggest a new complicated "one-step" estimator, 8m, for 8m(f)

that is apparently an improvement over earlier estimators. They calculate the semiparametric

information bounds for the estimation of 8m (f). Under smoothness (Holder) conditions similar

to those in Hall and Marron (1987), they show that their estimator is '\f1i - consistent and

efficient, Le., achieves the information bound. Suppose f has smoothness p as defined above and

8m(f) is to be estimated. Bickel and Ritov require a kernel of order max(m, l-m) + 1 where I is

as defined above. Under these conditions:

a) 8m is '\f1i - consistent.

• 2b) n3m(F) E(8m - 8m(f» -+ 1.

c) L ( ~n3m(F) (Om - 8m(f) ) ) -+ N ( 0, 1) for 3m(F) < 00 •

11

ii) If m < p < 2m + 1/4,

a) E(Om - Om(f)2 = O(n-2"t) where 'Y =4(p • m)/{l + 4p)·

Further, they show that if p < 2m + 1/4 then their estimator achieves the best possible MSE

convergence rates. The proof that their estimator achieves the best possible convergence rate is

the main result of the paper. It is a surprising and important result that there should be a

changepoint in the convergence rate at 2m + 1/4.

12

Chapter II: Diagonal terms

1. Introduction.

The estimators discussed in this dissertation are of the form:

(2.1)

for some D which mayor may not be a function of the data. Notice that the "diagonal" terms

are omitted (although they may be included in D) so there are n( n - 1) terms in the summation.

The Hall and Marron estimators use D = O. The Sheather-Jones estimators use

D= n-1h-2m-1(_1)mI!2m)(O). Notice that this estimator is simply:

With an appropriate choice of the bandwidth, the Sheather and Jones estimators converge in

MSE faster than the Hall and Marron estimators because their D eliminates bias. Schweder

(1975) uses a data-based D aimed directly at eliminating bias for estimating 00(1).

Unfortunately, the Sheather-Jones work does not resolve the issue of whether or not to

include the diagonals in the estimator. It seems that there are two philosophical issues central

to this discussion. First, the diagonal terms are non-stochastic. Since the diagonal terms

depend only on the choice of bandwidth and kernel (i.e., not the data) it seems that they cannot

be a useful component of the estimator. It turns out that this position is just barely tenable,

but it seems fairly convincing at first glance. Second, including the diagonal terms forces the

estimator to be positive. This may not seem compelling; squaring the estimator or taking its

absolute value also make it positive. There is even some sense to allowing the estimator to be

negative. A negative value implies that the estimator does not have enough data to get a

reliable estimate of Om' A strictly positive estimator may provide a false sense of security in

what may be a nonsense estimate based on an impossibly small sample size (see chapter 6 for a

discussion of reasonable sample sizes). The benefits of a positive estimator stem more from

practical than philosophical considerations.

The practical issues involved in deciding whether or not to include the diagonals are

investigated throughout this dissertation, although usually in some other context. The purpose

of this chapter is to introduce and summarize many of those results. The diagonals decision is

the first choice a statistician must make in calculating a kernel estimate of Om' Even though

many of the results discussed here are not proven until later chapters, this choice justifies

including the results so early. The practical issues that will be addressed are bias reduction,

mean squared error, computation, and "stepped" estimation.

2. Bias Reduction.

If the positivity issue is ignored, then the reason for including the diagonal terms is that

with some clever choice of bandwidth including them eliminates some bias. Sheather and Jones

show that this causes a reduction in the asymptotic MSE of the estimator, compared to the

leave-out-the diagonals estimator. Basically, with a proper choice of bandwidth the diagonals

equal the first term in the Taylor expansion of the bias. This bandwidth is not so small that

the estimator is overwhelmed by variance, so the MSE is reduced.

Minus the positivity issue, this argument is not an adequate reason for using the leave­

out-the-diagonals estimator. Bias reduction is fairly well-studied. If the only goal is to reduce

bias, then the limited approach of Sheather and Jones cannot be the best. It will be shown in a

14

later chapter that bias can be reduced (without affecting variance too much) by using a higher-

order kernel (Le., a symmetric kernel that is negative over some of its range), by using a more

sophisticated choice of D as in (2.1), or by "jackknifing". Any of these techniques can be used

to eliminate arbitrarily many terms in the Taylor expansion of bias. Unfortunately, with any of

these techniques the estimator is no longer necessarily positive.

3. Mean Squared Error Reduction.

Of course, a better goal than reducing bias is to reduce MSE. The Sheather-Jones

bandwidth is chosen only to reduce bias, although it clearly reduces MSE. An improved

estimator is suggested by choosing a data-based D to minimize the asymptotic MSE. The

following analysis suggests that choosing D this way results in an improved MSE convergence

rate.

Define (}m(in) as in (2.1). Extensions of calculations in Hall and Marron (1987) show

that:

where

15

In the three following cases, hAMSE is MSE asymptotically optimal bandwidth.

AMSE is simply the dominant terms in the expansion of MSE with the remainders disregarded. Case 1

is the Hall and Marron estimator, Case 2 is the Sheather and Jones estimator, Case 3 is the

general D.

Case 1: D = 0 (Hall and Marron)

1

hAMSE = ((-1 m + 1) C2)2r + 4m + 1

2r c; n2

..

A MSE(h ) - C n-1 + C n-2 h-4m- 1 +( C h2 )2AMSE - 1 2 3

1

where Al is a constant depending on K, m, and /.8

Hence, for Om(}n) with m =0, AMSE =C1 n-1 and for m =2, AMSE =Al n -13.

Case 2: (Sheather and Jones)

To eliminate the first term in the expansion of the bias choose ho so that

16

So

Notice that ho is not exactly the minimizer of AMSE. The real hAMSE for m ~ 1 would

minimize:

However, let hAMSE = CnC!L. Then the squared bias is at least O( n - 4/(2m+3») and this

minimum occurs at C!L = - 1/(2m + 3). Substitution shows that the variance is O( n - 5/(2m+3»)

at this minimum. Hence, ho '"" hAMSE as n-+oo. Finding the actual value of hAMSE requires

finding the real root of a (4m + 6)th order polynomial whose coefficients must be estimated.

Since this is difficult, we will say that hAMSE == ho and disregard the difference for a fixed value

of n.

Substituting ho into the expression for AMSE gives:

AMSE( hAMSE)

where A2 is a constant depending on K, m, and f. Hence, for 0m<Jn) with m = 0,

5AMSE= C1 n-1. For Om<Jn) with m = 2, AMSE= A2 n- 7

17

Notice that for estimating ()o(f) the convergence rates of the Hall and Marron estimator

is the same as that of the Sheather and Jones estimator. For estimating ()m(f), m ~ 1, the

Sheather and Jones estimator has a faster convergence rate. In particular, for m = 2, n-S/

7 is

£ -8~3a aster convergence rate than n .

Case 3: For general D,

Choose D = C3h2 to eliminate the first term in the bias and then choose h to mInimiZe

This gives:

1

_((4 m + 1) C2)4m + 9hAMSE - .J

8<.;, n24

A MSE(h ) - C n-I + C n-2 h-4m-1 + d h8AMSE - 1 2 4

2_ C -I + (C4 -8 C-4m- I )4m + 9 (__8_) 4m-~ 9 (4m + 9\- 1 n 2 n 4 4m + 1 4m + 1)

where A3 is a constant depending on K, r, and f. Hence, for () man) with m =0,16

AMSE = C1 n-I . For ()man) with m =2, AMSE = A3 n -17.

Notice that for estimating ()o(f) the convergence rates of this estimator is the same as

that of the Hall/Marron estimator and the Sheather/Jones estimator. For estimating ()m(f),

18

m;::: 1 this estimator has a faster convergence rate than the other two. In particular, for m = 2,

n-16/ 17 is a faster convergence rate than n-S/ 7 or n-S/

13• Some easy algebra shows for m;::: 1:

8 5 16o < 4m + 5 < 2m + 3 < 4m + 9 < 1.

For symmetric kernels and m > 0, the Hall and Marron estimator has the slowest MSE

convergence, followed by the Sheather and Jones estimator, and then the estimator using D

chosen to minimize the asymptotic MSE.

Despite the asymptotic results above, in practical estimation problems the D-estimator

does not always give better results than other kernel estimators. D must be estimated or

approximated and this added level of difficulty may overwhelm any asymptotic gains in

practical settings with even seemingly large sample sizes. As with all asymptotics, it is difficult

to predict when the sample size is large enough so that the asymptotic result applies.

A useful tool for gauging how well asymptotics work in fixed sample size cases is exact

MSE calculations with Normal mixture densities. Over any finite range, Normal mixture

densities can be shown to be dense withe respect to various norms in the space of all continuous

densities. Hence, most practical estimation problems can be approximated by using a Normal

mixture density as a test case. Details are provided in chapter V.

Table 2.1 provides some insight into how well the three estimators discussed above

perform. Fifteen Normal mixture densities are given in Appendix A. These densities seem to

be representative of a large class of densities that are likely to be encountered in a practical

estimation setting. For each density, the goal was to estimate 82, In each case, D and hAMSE

were calculated by assuming that the density was known. Of course, this is not realistic but still

provides some insights. The MSE's given are exact, based on calculations as described above.

19

The results suggest that the Sheather-Jones D estimator is approximately as good as the AMSE

minimizing D estimator, even with sample sizes as large as 1000. In most cases, both perform

significantly better than the leave-out-the-diagonals estimator. In cases 11 - 14, the leave-out­

the-diagonals version outperforms the other two estimators. The reason for this is that the true

errors are not effectively modelled by the asymptotics. The true values for °3 are extremely

large for these distributions, resulting in bandwidths that are highly undersmoothing.

Table 2.2 investigates the estimators in more realistic settings. The functionals in both

D and hAMSE must be estimated. A reasonable approach to this is to assume the density is

N(O, 0-2 ) , estimate 0-

2 , and calculate the functionals based on this density. I call these "plug-in"

estimators. Table 2.2 gives exact MSE's for plug-in estimators. The results suggest that the

added complication of estimating more functionals makes the D-estimators less attractive. The

two leave-in-the-diagonal estimators still behaved similarly. There was much less consistency in

comparing the leave-in's with the leave-out. In some cases, the leave-ins were much better (e.g.,

#1, 2, 9). In some cases, they were about the same (e.g., #3, 10, 11). In some cases, the leave­

ins were somewhat worse (e.g., #5, 7). Note that many of the ratios for "spiky" distributions

are slightly less than 1. This happens because for these distributions, the plug-in bandwidths are

grossly oversmoothing resulting in 02 / °2 ~ 0, so MSE / (82)2 ~ [82 - 0]2/ (82)2 = 1.

The conclusion from these calculations is that the asymptotics suggest that the leave-in­

the-diagonals estimators are more efficient. This efficiency seems to be difficult to realize unless

the distribution is very simple or known, or the sample sizes are very large. Asymptotic MSE

convergence rates alone do not seem to be convincing reasons to use either type of estimator.

20

Table 2.1: Exact Asymptotic Values

Note: Sample size = 1000.

The Hall and Marron is the "leave-out-the-diagonals" estimator.

The Sheather and Jones the "leave-in-the-diagonals" estimator.

The AMSE minimizing D-estimator has a stochastic constant added.

MSE/0 2

Distn Hall/Marron Sheather/Jones AMSE minimizing D

1 0.092 0.025 0.025

2 0.124 0.037 0.039

3 0.422 0.304 0.241

4 0.210 0.076 0.078

5 0.098 0.026 0.026

6 0.186 0.052 0.056

7 0.101 0.025 0.026

8 0.314 0.130 0.136

9 0.469 0.321 0.289

10 0.205 0.087 0.072

11 3.472 22.346 8.726

12 1.705 6.039 2.825

13 4.620 31.850 14.197

14 1. 527 6.060 2.376

15 0.471 0.582 0.284

Table 2.2: "Plug-in" Values

Note: Sample size = 1000.

The Hall and Marron is the "leave-out-the-diagonals" estimator.

The Sheather and Jones the "leave-in-the-diagonals" estimator.

The AMSE minimizing D-estimator has a stochastic constant added.

MSE/(}2

Distn Hall/Marron Sheather/Jones AMSE minimizing D

1 0.092 0.025 0.025

2 0.118 0.022 0.021

3 0.965 0.977 0.978

4 0.988 0.994 0.994

5 0.717 0.809 0.813

6 0.201 0.238 0.244

7 0.115 0.201 0.207

8 0.569 0.613 0.617

9 0.392 0.077 0.075

10 0.996 0.998 0.998

11 1. 000 1. 000 1. 000

12 0.999 0.999 0.999

13 1. 000 1. 000 1. 000

14 0.999 0.999 0.999

15 0.949 0.958 0.958

3. Computation:

Calculating Om exactly requires O(n2 ) kernel evaluations. Since kernel evaluations are

usually computationally difficult, for large n it is desirable to use an approximation to Om that

is easier to compute. One such approximation discussed in chapter IV is the histogram binned

estimator. Suppose that the range of X is partitioned into", equal width bins called II' ..., I",.

The midpoint of bin Ix is called cx. The estimator is given by:

where c(X) == ca for X E Ia. It will be shown in chapter 4 that the binned estimator requires

only 0(",) kernel evaluations and that for reasonable values of", the approximation is very good.

Nearly all the simulations done in this dissertation were done using a histogram binned

estimator because of these benefits.

Of course, in the binned estimator there is again the choice of whether or not to include

the diagonal terms. Suppose the diagonal terms are omitted. Then:

Let each bin be of width b, and let nx be the number of observations in bin x. Some algebra and

counting shows that:

{

",-I [ E E .nqnrJ }0= (_l)rn E T)2rn) (ib) I q- r 1= 1 __1_ T)2rn) (0) .i=oL\h n(n-1) (n_1)L\h

23

An eliminator of non-stochastic terms would now drop the last term giving:

However, the same counting and algebra done in reverse shows that this is equal to:

the binned estimator with the diagonals left in.

This may not contribute greatly to the discussion of whether or not to include the

diagonals in the unbinned estimator, but it seems to resolve it in the case of the binned

estimator. The leave-out-the-diagonals estimator may be negative and it includes a non-

stochastic term. The leave-in-the diagonals estimator is positive and has no non-stochastic

terms.

4. "Stepped" estimators.

As shown in section 2, the asymptotically optimal bandwidths for estimating Om are all

functions of 0m+l' In chapters VII and VIII, multi-step estimators will be discussed in which

0m+k is estimated to arrive at an estimated optimal bandwidth for Om' In this setting, the

positivity of the leave-in-the-diagonals estimator is absolutely necessary. Results about these

kinds of estimators will be discussed in depth in these later chapters.

24

Chapter ill: Bias Reduction

1. Introduction.

From Hall and Marron (1987), the estimator defined in (2.1) with D = 0 is biased. The

Sheather and Jones estimator discussed in chapter 2, adds a component which cancels some of

this bias. The result is an estimator with a faster MSE convergence rate. The natural question

is whether some further bias reduction might eliminate more bias and result in even faster MSE

convergence rates (or, more modestly, with the same MSE convergence rate but a better

constant).

The Sheather-Jones approach is basically to add something to the estimator to cancel a

term in the Taylor expansion of the bias. Of course, the obvious extension is to add something

different to cancel the first several terms in the bias. These estimators wll be called

D - estimators.

Three approaches to reducing bias in kernel estimates are higher order kernels, D­

estimates, and the generalized jackknife. This section will examine each of these approaches and

establish the connections between them. The generalized jackknife and the higher order kernel

estimates are equivalent. A D-estimate is a higher order kernel estimate. Under some

conditions (stated below), a higher order kernel or generalized jackknife estimate can be written

as aD-estimate.

2. Notation.

In the discussion below, define Om(h, n, K) as follows:

Om(h, n, K) = (_l)m n- l (n-1rl ~):; K~2m)(Xi - X)I r1

where h is the bandwidth, n is the sample size, K is symmetric and bounded, and JK( u)

du = 1. When h, n, and K are clear from the context or irrelevent then some or all may be

omitted from Om(h, n, K). The kernel function K has order 2r when:

j= 0

j = 1,.··, 2r-1

j= 2r

where C> O. A probability density function is an order 2 kernel.

The functional being estimated, J(im)Y,will be denoted Bm(J)·

3. Higher order kernel estimators.

A "higher order" kernel is a kernel with order> 2. It is well known that the MSE for a

kernel density estimate decreases with increasing kernel order (Bartlett, 1963). The higher order

kernel estimate reduces the MSE by reducing the bias of the estimate. Calculations similar to

those in Hall and Marron (1987) show that if m + 2N derivatives of f exist

Hence the bias is O(h2r) for K a kernel of order 2r.

26

In kernel density estimation, this smaller MSE has a price. Since a higher order kernel

must have negative values, the interpretation of the density estimate as a moving weighted

•average is obscured. For estimating ()mU), an added problem is that ()m(h, n, K) may be

negative for higher order K even though ()m(/) is positive. Nevertheless, the lower MSE may be

attractive.

4. lHstimators.

A straightforward technique for reducing bias is to subtract an estimate of the bias from

the estimate of ()mU). This allows a probability density kernel to be used, preserving the

moving average interpretation of the density estimate. If K is a density, then the first order

bias term of 0m(K) is

To reduce the bias In Bm(K), this bias term may be estimated and subtracted from Bm(K).

Define

(3.2)

former will be called a D-estimator.

The disadvantage of this technique is that another estimation problem has been added

to an already difficult problem. Instead of selecting one kernel and bandwidth, two are required.

A more general version of a D-estimator may include terms to eliminate higher order bias terms

and these add still more estimation problems.

27

5. Generalized jackknife estimators.

• •Let (}m(hv nl , KI ) and (}m(h2, n2' K2) be distinct estimates of (}mU). For simplicity

suppose that KI and K2 are densities, although this is not required in the general case. Define

Using the expression for bias given in (3.1),

The first-order bias term is eliminated if

R=hI

2Ju2 K1( u) du

h/ Ju2 K2(u) du·

The estimator qOm(hv nl , KI ), Om(h2 , ~, K2), R) with R chosen this way is called a

generalized jackknife estimator (for a similar discussion of density estimation see Schucany and

Sommers, 1977).

The difficulty with the generalized jackknife is similar to the difficulty with the D-

estimator; the improved bias costs another estimation problem. It is also not clear how the two

estimators should be chosen in relation to each other.

28

For a discussion of how the generalized jackknife relates to the more common

"pseudovalue" jackknife see Gray and Schucany (1972).

6. Higher order generalized jackknife estimators.

Since combining two distinct estimators provides a new estimator with smaller bias,

• •combining more than two should further reduce the bias. Define (J i == (J m(hi' ni' Ki) where the

•(J i are distinct then

where ai = h/Ju2 K j ( u) duo This motivates the higher order generalized jackknife estimator

where

and

J' .] lei (3.3)G (JI'···' (Jr = iAi

• • •(JI (J2 (Jr

e=an aI2 aIr

ar - I , I ar _l , 2 ar _l , r

1 1 1

A=an aI2 aIr

ar _l , I ar - I , 2 ar _l , r

An application of theorem 4.1 in Gray and Schucany (1972) shows that if

29

then Bias(Gm[0l"'" Or]> = O[(max(h1,···, hr))2r+2}· The order of the higher order generalized

jackknife is r, the number of estimators used to produce it.

7. Relationships among the bias reduction estimators.

These bias reduction estimators are all aimed at eliminating at least one term in the

expansion of the bias. In many cases, one bias reduction estimator may be rewritten in the form

of a different type of bias reduction estimator. For example, a generalized jackknife estimator is

always a higher order kernel estimator for the right choice of kernel. The remainder of this

chapter will describe these relationships.

The relationships for estimation of () m(J) are as follows:

• A generalized jackknife estimator with order r is a kernel estimator with order q for q/22: r.

• A kernel estimator with order 2r is a generalized jackknife estimator with order r.

• A D-estimator is a generalized jackknife estimator and a kernel estimator with order r = 4.

• A kernel estimator with order r = 4 is a D-estimator under some regularity conditions for the

kernel.

• A generalized jackknife estimator of order r = 2 is a D-estimator under some regularity

conditions on the estimators.

The relationships are shown graphically in figure 3.1.

..

30

..

Figure 3.1: Equivalences of Bias Reduction Techniques

I Generalized Jackknife, Order rl

r = 2

,0- Estimatorl .....----------

Higher Order Kernel,

Order q

______________ n ~ With some conditions

8. Theorems.

The first two theorems show the equivalence of the generalized jackknife and the higher

order kernel estimators.

Theorem 3.1: A generalized jackknife estimator of order r is a higher order kernel estimator of

order q, where q/2 ~ r if each lJ i in the generalized jackknife estimator is based on all n

observations.

Proof: The proofs of all theorems in this chapter are given in Section 10.

For an example of a higher order kernel estimate of degree q where q/2 > r where r is

the degree of the equivalent jackknife estimator see Schucany and Sommers (1977). The

example they give is for density estimation but it is easily extended.

Theorem 3.2: A higher order kernel estimator of order 2r is a generalized jackknife estimator of

order r.

The next theorem shows that a D-estimator is a second order kernel estimator, and by

the previous two theorems, a second order generalized jackknife estimator. The theorem

provides the construction of the fourth order kernel. A construction of the equivalent

generalized jackknife kernels is given in the proof of theorem 3.2.

32

"

..

Theorem 3.3: AD-estimator, 8m(KI , .((2' D) which is defined in (3.2), is a kernel estimator of

order r=4 ifxK2(x), x'2K;(x) -+ Oas Ixl-+ 00, Ju4 KI (u)du < 00, and Ju4 KP)(u)du<00.

Then

is an order 4 kernel and

Theorem 3.4 and Corollary 3.4.1 show that a fourth order kernel (or generalized

jackknife) estimator is a D-estimator for some conditions on the kernel. It is possible that a

theorem with weaker tail conditions could be proved. The "ultimately negative" condition in

Corollary 3.4.1 is somewhat irksome. The conditions for the generalized jackknife estimators are

omitted. To determine if a generalized jackknif~ estimator can be written as a D-estimator is an

easy matter of examining the corresponding fourth order kernel.

Definition: Kernel K*(x) has a sign change at to when K*(x) < 0 in an f-neighborhood of to

where x < to and K*(x) > 0 in an f-neighborhood of tl where x> tl . If to::ft tll then K*(x) = 0

Notation: F~x) is the c.d.f. of density K. a+ and a- are the positive and negative parts,

respectively, of G.

33

Theorem 3.4: If K"'(x) has order 4, has 2 sign changes, and K"'(x) is o(x-3) as 1x 1- 00 then

where

In the above construction, a single bandwidth is used for both kernels. Since a

D-estimator may have different ,bandwidths for each kernel, a more general result is available

than the theorem.

Corollary 3.4.1: If K"'(x) has order 4, is ultimately negative and o(x-3 ) as I x 1- 00, and

K"'(x) > 0 in a neighborhood of 0, then O(h, n, K"') can be written as aD-estimator.

9. Example.

The following example shows one estimate written as a D-estimate, a generalized

jackknife estimate, and a second order kernel estimate. The functional being estimated is J/2

where / is the Normal mixture distribution #2 given in Appendix A.

The D-estimator kernels are shown in figure 3.2a. The D-estimator is Bm(K1 , K2 , D)

34

where D =D(hl , h2 , n, K1, K2 ), hI =0.189, h2 =0.347, n =250, and Kl' K2 are standard

Gaussian densities. The bandwidths are MSE optimal.

The corresponding generalized jackknife kernels are shown in figure 3.2b. The jackknife

estimator is Gm~l' 82, R] where 0i is the kernel estimator using density Ki. In this instance,

R =0.13. Notice that by itself O2 would be a terrible estimator since it weights distant points

more than local points in estimating f.

The corresponding fourth order kernel is shown in figure 3.2c. The small value of R in

the generalized jackknife estimate is reflected in the small dip below the x-axis. In fact, the

overall impact of the bias reduction methods can be assessed by comparing the K-kernel in the

D-estimator with the fourth order kernel. The two kernels are nearly the same, so it here the

bias reduction can not have much effect.

35

Figure 3.20: D-estimator kernels0.4 -.------------~~----------___,

0.3

, ....1--- Kernel0.2

0.1

O+-~--==----____tL--__+--__\_--------==~____l

-0.1 -+---,-----,-------,----+------,-----,-------r-------j

-4 -2 o 2 4

Figure 3.2b: Jackknife kernels0.4

0.35

0.3 K2

0.25

0.2

0.15

0.1

0.05

0-4 -2 0 2 4

Figure 3.2c: Second-order kernel0.6

0.5

0.4

0.3

0.2

0.1

0

-0.1-4 -2 0 2 4

10. Proofs.

Proof of theorem 3.1:

Define OJOl' "', ,q, A, aij as in section V. Then OJOl' "', Or] = 0m(1, n, %)

IKIwhere 9G = iAl and

K=

and

To show that % is of order q for q/2 ~ r, note Ju2(r-I)%( u) du = 111 Ju2(r-I) IK( u) Idu =

Ju2(r-I) KI

Ju2(r-I) K2

Ju2(r-I) Kr

1 an aI2 aIr= iAl

ar-I , I ar_I , 2 ar_I, r

ar_I , I ar_I , 2 ar_I, r

1 an al2 aIr= iAl

ar_I , I ar-I , 2 ar_I , r

= 0, because row 1 and row r are identical. 0

38

Proof of theorem 3.2:

Define aJOl' "', Orl A, aij as in section V.

Let K*(x) be a kernel of order 2r. Select {ki(x)} so K*(x) = .t (-1)i-1ki(x) where ki(x) ~ 0 and1=1

ki(x) are bounded.

Define Ki(x)k,{x) . .

- I ki(u)du so Ki(x) IS a density.

Since JK*( u) du =1, . t (_1)i-l Jki(u)du =1.1=1

Since K*(x) is of order 2r, for 1 < k < r,

o = Ju2k K*(u) du

39

So we have the system of equations:

1 1 1 I k1 1

an au aIr -I~ 0=

ar_1, 1 ar_1, 2 ar_1, r (_l)r-l I kr 0

or I k1 1

- I k2 0A =

(_l)r-l I kr 0

Using Cramer's rule,

an aI, i-I aI, i+l aIr

Jki =ll lar_1, 1 a . ar_1, i+l ar_1, rr-l, ,-I

40

K*(x) = t (-l)i- I ki(x) = t (-l)i- I Jk{u)duK{x) = IA1 11.L:=rI(-1)i-I I.A(i)IKi(x)

i=I i=I I I

KI K2 Kr

1 an aI2 aIr= iAi

ar _I , 1 ar_I, 2 ar_I, r

So using this kernel gives the estimator GJOI' "', Or] which is a generalized jackknife

estimator of order r defined in (3.3). 0

Proof of theorem 3.3: Define a D-estimator

41

So it needs to be shown that K*(hI, h2 , x) - [KI, hI - ~~ {J u2 KI(u) du } KJ:)h2}X)

is an order 4 kernel.

= 1

since JKJ2)h (x) dx =K; h (x) 1+00 =0 by the statement of the theorem., 2 ' 2 -00

= 0

by symmetry.

42

= o.

because integration by parts and the given conditions show {J 1? KJ:>h.z(x) dX} = 2.

This is is an immediate consequence of the statement of the theorem.

Hence, K* is an order 4 kernel. 0

Proof of theorem 3.4:

43

To complete the proof it must be shown that D2(x, h) is the second derivative of a probability

density function. Define

i. K(x, h) ~ O.

. (K*(u))+ J- h du dyJ(Ki/t)) + dt

where C> 0 and the integral does not depend on h.

..

44

Since Kh(x) has two sign changes there must exist Xo < 0 so FK (xo) = 0 and1

FK (xo) =~. By symmetry, FK (0) = FK (0) = FK (xo) =~. Since FK (x) and FK (x) are2 1 2 2 1 2

monotone, FK/x) - FK/x);::: 0 for all x:5 0 so

Jx FK (y) - FK (y) dy;::: 0 V x:5 o.-00 2 1

Since K1(x, h) and K2(x, h) are symmetric, FK(x) = 1 - FKI-x). Hence,I I

J+xFK (y) - FK (y) dy

-x 2 1

= J 0 (1- FK (-y)) - (1- FK (-y)) dy + J+x FK (y) - FK (y) dy-x 2 1 0 2 1

= J 0 FK (-y) - FK (-y) dy + J+x FK (y) - FK (y) dy-x 1 2 0 2 1

= J+x FK (y) - FK (y) dy + J+x FK (y) - FK (y) dyo 1 2 0 2 1

= 0

So for x;::: 0,

45

ii. K( x, h) is integrable.

By symmetry, it suffices to show /:00 / ~oo /:00 K2( w) dw du dy < 00. Since

K*(x) is 0(x·3 ), K2(x, h) is 0(x·3 ). Since K*(x) is an order 4 kernel, /:00~ K2( w) dw < 00.

Two applications of integration by parts using K2(x, h) being 0(x·3 ) proves the result.

/+00

iii. K(x, h) dx = 1.-00

By Hall and Marron (1987) and integration by parts,

E [8(h, n, K*)] - 8 = ~; ( / (!"(x) r dx )(/ u4 K*(u) dU) + 0(h4).

. -E [8(h, n, K1 ) - D(h, n, K)] - 8 =

- ~2 ( / (f'(x)) 2 dx )(/ u2 K1(u, h) dU) - E [D(h, n, K)] + 0(h2).

Since E [8(h, n, K*)] = E [8(h, n, K1) - D(h, n, K)] and matching terms in the expansion,

E [D(h, n, K)] = - ~2 ( J(!'(x) r dx )(J u2 K1(u, h) dU) + 0(h2).

So by the definition of D,

E [KA2 )(xi - X)] = - ( J(f'(x) r dx ) + 0(1).

46

However from Hall and Marron (1987),

E [Kh2 )(xi - Xj)] = - / / K(u, h) f'(x) f'(x - hu) dx du

= -[(I K(u, h) dU)( / (f'(x) f dX) + 0(1)]

This implies / K( u, h) du = 1.

This completes the proof of the theorem. 0

Proof of Corollary 3.4.1:

Suppose K*(x) ~ 0 for I x I > a and a is a sign change point. Further, suppose K*(x) 2:: 0 for

I x I < band b is a sign change point.

Let K**(x) = +[K*] + (+x) - [K*]-(x).

Then K**(x) meets the conditions of the previous theorem and may be written as a D-

estimator according to the construction given there. So, using the same definitions as in the

theorem, if

• **' . / /O(h, n, K ) = O(h, n, Kt ) - D(h, n, D2 )

then

• *' bh • / /O(h, n, K ) =O(-a-' n, Kt ) - D(h, n, D2 ). 0

47

Chapter IV: Computation

1. Introduction.

The basic estimator, defined in (2.1), is a double sum requiring O( n2 ) kernel

evaluations. For reasonable values of n, the computation time for the estimator is "feasible".

For many of the applications discussed in this dissertation, however, the computation time

required for straightforward calculation is a burden. For example, the "k - step" estimators

discussed in Chapter VIII require O(k x n2 ) kernel evaluations to calculate directly. Even on a

relatively fast computer, simulations which require many repetitions of this operation can take

weeks to finish.

This chi;l.pter discusses a computation strategy called "binning" to reduce the

computation time by approximating the estimator. The object of binning is to distribute the

data initially to II "bins". The estimator is computed on the "binned" data. The pay-off is that

the estimator now requires only 0(11) kernel evaluations.

The simplest binning is called "histogram" binning. The range of the data is divided

into II equal width bins. Each observation is recoded to equal the midpoint of the bin in which

it lies. The estimator is calculated using the recoded data instead of the original data.

A more complicated type of binning is called "generalized" binning. The range of the

data is similarly partitioned into II bins as in histogram binning. The new data consists of a

kernel density estimate at each bin midpoint. A special case discussed below is "linear" binning

in which the kernel for the generalized binning is a triangular kernel.

There are two questions that are addressed in this chapter: "How should the data be

binned?" and "How much error is introduced by the binning?". These questions are answered

with asymptotic calculations. In general, bin widths slightly smaller than the bandwidth seem

to give good results and require much less computation time.

2. Notation:

In the discussion below, assume that there are v bins of width b. The bins are called

Iv "', Iv· The midpoint of bin Ia will be denoted ca and c(x) == ca if x is in bin a.

Define 0m as follows:

Define 0m(h, n, K) as follows:

Om(h, n, K)

where h is the bandwidth, n is the sample size, K is a symmetric, bounded probability density

function. Define 8m(h, n, K) as follows:

•0m(h, n, K) (./.1)

where 9 is some function of the observations. When h, n, and K are clear from the context or

irrelevant then some or all may be omitted from Om(h, n, K) or 8m(h, n, K).

49

3. The histogram binned estimator.

The histogram binned estimator is the most straightforward type of binning. That is:

•0m(h, n, K)

where c(X) == ca if X E fa and c(X) == 0 otherwise. In the notation of (4.1):

where na is the number of observations in bin a.

Lemma 4.1 provides the asymptotic variance and squared bias for Bm(h, n, K) as

h, 6 -+ 0; n -+ 00; nh, n6-+00. It is similar to lemma 3.1 in Hall and Marron (1987).

Lemma 4.1: For f a density with a continuous (m + 1)5t derivative vanishing at ± 00, and K a

symmetric infinitely differentiable kernel:

. ' 2 2 [h4 {I 2 }2 64

] 4 41. {E[ (}m(h, n, K) - Om]} = Om+! T u K(u) du + 144 + 0(6 ) + o(h ).

50

•ii. Var[ 0m(h, n, K)] =

where:

Theorem 4.1 suggests how the bin width should be chosen relative to the bandwidth.

Theorem 4.1:

If b= h(k then for a 2= 1, Om(h, n, K) converges to Om at the same rate as Bm(h, n, K),

the unbinned estimator.

Proof: The proofs of all theorems in this chapter are given in Section 6.

Theorem 4.2 provides the asymptotically MSE optimal bandwidth (and bin width) for

b = h. By theorem 4.1, b must be at least this small. Decreasing b further does not affect the

rate of convergence but does add computation time, so theorem 4.2 provides an "optimal" bin

width.

51

Theorem 4.2:

For b= h, the minimum MSE is achieved by:

(a) For m ~ 1,

where

Then: 4

E[ 0 (h K) _ 8 ]2 _ (4m+ 5) C2 [(4m+ 1) C1n-2]4m+5 (-S/4m+5)

m ,n, m - 4(4m + 1) C2

+ 0 n .

(b) For m = 0, any h such that hnl /(4m+l) --. 00.

Then:

52

•4. Computation of 0m(h, n, K).

In the previous two sections it was established that (j m( h, n, K) is a more accurate

estimator of 8m than 8m(h, n, K). However, 8m(h, n, K) requires O(n2) computationally

difficult kernel evaluations. Suppose, as above, there are v bins of width b called II"'" Iv.

Further, let na observations fall in bin I a• Then:

=(_J)m{ ~ K(2m)(ib)[lq~r~inqnr] __1_K(2m)(o)} (4.2)i = 0 h n( n - 1) (n - 1) h

Because the bins are equally spaced, computing 8m(h, n, K) requires only O(k) kernel

evaluations.

If k is large, some care must be taken in computing the double sum. For k> n, most of

the nq should be 0 or J and this should certainly be used to speed computation.

53

5. Generalized bin estimator.

The computational form of Om(h, n, K) given in (4.2) suggests a more general

approximation. As b-+O:

Thus a more' general binning scheme is to replace nq by nbj( cq) In (4.2), where j is a kernel

density estimator. Hence,

Some simplifying assumptions are needed. The only binning strategies worth considering

are those that reduce the number of kernel evaluations. First, define L on a compact interval

and let b/w ={' Further, define:

(Note that for a = 1, La*L reduces to the usual convolution for symmetric L).

54

Lemma 4.2: Let 0m(h, n, K) be a binned estimator with binning kernel L and binning

bandwidth w. Further, let L be non-zero on a compact interval and h/w = I < 00. If f is

sufficiently smooth; h, h, w-O then:

. ' 21. {E[ (Jm(h, n, K) - (Jm]) =

•ii. Var[ (Jm(h, n, K)] =

Notice that lemma 4.2 reduces to the same result as lemma 4.1 when L is the uniform kernel on

[-1,1] and w=~h.

55

Special Case: Linearly weighted binning.

An improvement on the histogram binning for kernel density estimation is linearly

weighted binning (Jones and Lotwick, 1983). For observation X E (mi' mi+l]' the probability

mass n-1 associated with X is divided between two bin midpoints mi' and mi+! by assigning

weights n-1b-1(mi+! - X) at mi and n-1b-1(mi - X) at mi+!' Jones (1989) points out that this is

the same as binning with a triangular kernel on [-1,1] with w = b. In this case,

which implies that the triangular kernel gives slightly more bias than the histogram binning.

The first terms in the asymptotic expansion of the variance are:

implying that the variance of the triangular kernel is nearly the same as the histogram binner.

56

6. Proofs.

Proof of lemma 4.1:

Lemma 4.1.1: As 6......0,

fJ( x)dxfJ(y)dy = 62J( Cq)J( cr)+fif"( cq)J( cr) +/"( cr)J(Cq)) + o( 64).

I q I r

JJ(x)dxJJ(y)dyJJ(z)dz =Iq Ir Is

The proof of lemma 4.1.1 follows from a Taylor expansion of J(x) around the bin midpoints. 0

r..fO (h n K)]= f(_1)m{ E K(2m)(i6)[lq~r~inqnr]__1_K(2m)(o)}]Lt. m " L i = 0 h n( n - 1) (n - 1) h

57

Using lemma 4.1.1: E[Om(h, n, 10] =

Using Hall and Marron (1987):

[h2 {f 2 } b

2] 2 2=Om+ Om+12 uK(u)du +72 +o(b)+o(h).

ii. Variances:

Lemma 4.1.2: Let neal = n(n - 1) x ... x (n - a + 1). For q:f: r:f: s:f: t:

Var(n~) = E(n~) - E(n~)E(n~)

= 4n(3)p~ + 6n(2)p~ - 4n3p~ + (JOn2 - 6n)p~ - 4n(2)p~ - np~ + npq'

Cov(n~ n~) =E(n~ n~) - E( n~)E( n~)

= - 4n3p~p~+ (JOn2 - 6n)p~p~- 2n(2)P~Pr- 2n(2)pqP~_ npqPr'

Cov( n~, nqnr) = E( n~ nr) - E( n~)E( nqnr)

= 2n(3)P~Pr- 4n3p~Pr+ (JOn2 - 6n)p~Pr- 2n(2)P~Pr+ ni(2)PqPr

Var(nqnr) = E(n~ n~) -[E(nqnr)j

= n(3)P~Pr+ n(3)pqP~+ n(2)PqPr + (JOn2 - 6n)p~p~- 4n3p~p~

58

Cov( n~, nsnr) = E(n~nsnr) - E(n~)E(nsnr)

- / 3 2 (10 2 6) 2 2 (2)- - "n PqPrPs + n - n PqPrPs - n PqPrPs.

Cov( nqn,., nsnt) =E(nqnrnsnt) - E( nqnr)E(nsnt)

= - -/n3pqPrPsPt + (10n2

- 6n)pqPrPsPt-

Proof of lemma 4.1.2: (k )Ea.

All the identities are proved using E[n~al) x··· x nLak)] = n i=l 1 p~l x ... x ~k

and tedious algebra.

59

Lemma 4.1.3: As b, h...... Oj nb, nh......ooj b/h......"Y:

Proof of lemma 4.1.3:

Lemma 4.1.4: Let:

Then as b, h...... O, nh, nb......oo, b/h......"Y:

60

Proof of lemma 4.1.4:

= q t lr tIs t1Kk2m

)(Cq - Cr) Kk2m

)(c r - Cs) IJtx)dxIJty)dyIJt z)dz

Iq Ir Is

= IIIKk2m)(x- y) K~2m)(y_ z) Jtx)Jty)Jtz) dxdydz

+ ;;IIIKk2m)(x - y)Kk2m)(y - Z)(J"(X)Jty)Jtz) +J"(y)Jtx)Jt z) +J"(z)Jtx)Jty)) dxdydz

Using integration by parts:

Now return to proof of Lemma 4.1:

61

Combining terms and substuting using lemma 4.1.1 gives:

1 {(3) ~ ~ ~ (2m)( ) (2m)( )= 2 2 -In LJ LJ LJ Kh Cq - Cr Kh Cr - Cs PqPrPsn (n - 1) q =lr =Is =1

62

Using lemmas 4.1.2 and 4.1.3:

Proof of theorem 4.1 and theorem 4.2:

Immediate from lemma 4.1.

Proof of lemma 4.2:

•i. E[lJm(h, n, K)] =

63

For the first term:

(n-l)=-n- x

64

For the second term,

As 6-0:

Part i of lemma 4.2 follows directly. 0

For part ii of lemma 4.2:

65

Further,

=[J II12m)(x-y)J{X)J{y) dXdY+O(W)r

66

= t t t t b4 I12m)(Cq- Cr)I12m)(CS - Ct)(W-2I(Cq)I(Ct)L*L(-y [r- t]) Xq=lr=ls=lt=l

X L*L(-y [s - q]) + o(W·2))

00 00 min(i+II,II)min(HII,II)='}'2,E ,E L*L('}'i)L*L('}'i) E., E., b2I12m)(Cq-ct-ib)I12m)(cq+ib-ct)

l=-OOJ=-OO q=O 1 t=O J

X (I( Cq)1( Ct) + 0(1))

= '}'2112m)( 0), f .f L2*L('}'i, '}'i)( (} m + 0(1)).

l=-ooJ=-oo

67

Substituting and combining terms gives:

o

68

Chapter V: Asymptotics and Exact Calculations

1. Introduction:

A primary tool for studying smoothing estimators is asymptotics. The behavior of the

estimator is studied by examining its asymptotic behavior as the sample size, n, goes to infinity

and the bandwidth, h, goes to O. Since the bandwidth should not go to 0 too quickly, the

requirement that nh-oo is usually added.

An important question always raised by asymptotics is how large does n have to be

before the asymptotics approximate the true behavior well. The goal of the estimators discussed

in this research is generally minimizing their MSE. Usually, this is approached by minimizing

an asymptotic MSE. This chapter investigates how well the asymptotic MSE approximates the

true MSE.

2. Comparison of Asymptotic and Exact Risks.

The asymptotic variance and squared bias of 8m (the leave-out-the-diagonals estimator)

are calculated in Hall and Marron (1987). They show that:

Var(0m) = {2n-2+ o(n-2)} Var{I12m)(Xl - X2)]

+ {4n-l + o(n-l )} COv(I12m)(Xl - X2), I12m)(X2 - X3 )]

So,

Var(8m) = {2n-2+ o(n-2)}[EjKh(2m)2*.!tX) -(E/(R\2m)*.!tX)] YJ+ {.in-1 + o(n-1 )} Var/(Kh*pm)(X)]

By letting n-oo, h-O, and nh-oo they show that the asymptotic values are given by:

The first question that arises in studying the relationship between the true value and the

asymptotic values is whether one is always larger than the other. In most settings, the AMSE is

larger than the MSE (see figures 5.1 and 5.2 for some examples). ABIAS2 is always greater than

BIAS2• Unfortunately, AVAR is not always greater than VAR. As will be shown, the difficulty

occurs when the 2mth derivative of the density is "smoother" than the kernel. In this case,

AVAR will underestimate VAR.

For the Bias component, the next theorem resolves the issue.

Theorem 5.1 : ABIAS2 ~ BIAS2

Proof: The proofs of all theorems in this chapter are given in Section 5.

70

For the Variance component however:

Var(0m) = {2n-2+ o(n-2)} Var{R);m)(Xl - X2)]

+ {4n-l + o(n-l )} Cov(R);m) (Xl - X2), I12m)(X2- X3 )]

For small bandwidths and m > 0 the first component is much more significant than the

second. Theorem 5.2 shows that this component of the variance is less than its asymptotic

value.

For the second component, however:

COv(I12m)(Xl - X2), I12m)(X2- X3 )] =

= E1:I12m)(Xl - X2) I12m)(X2- X3 )] - ~[I12m)(Xl - X2)]

=JJJK(u)K(v) j(x)pm)(x_ hu)pm)(x+ hv) dx du dv

-(JJK(u) /m)(x)/m)(x_ hu) dx duy

The asymptotic value of the covariance is:

71

..

These expessions suggest that if the convolution of the kernel and the 2mth derivative of

f is "smoother", i.e., wiggles up and down less than the 2mth derivative of f alone, then the

asymptotic covariance will be greater than the covariance. Usually, this will be the case. The

2mth derivative of f is likely to filled with "peaks and valleys". Convolving it with a kernel fills

in these "peaks and valleys" so the convolution will be less variable. However, suppose a

standard normal kernel is used and J(x) = - 3/4 :t? + 3/4 for x E (-1, 1) and 0 elsewhere. Then,

while:

Of course, this only happens because f 2 }(x) is "smoother" than the kernel over the range

where J(x) > O.

In the exact calculations described below, the true variance was always greater than the

asymptotic variance. The exact calculations were done with Normal mixture densities. The

derivatives of the Normal mixture densities are continuous but are not as smooth as a Normal

kernel. It seems reasonable that for some class of densities the true variance is always greater

than the asymptotic variance, but I did not find an illuminating way of characterizing this class.

Hence,

Conjecture: For f in some smoothness class refative to the kernel,

..

In all the examples studied, hMSE > hAMSE' Intuitively, the asymptotic varIance

approximates the true variance better than the asymptotic squared bias approximates the

72

squared bias. The asymptotic squared bias is much greater than the squared bias. The

asymptotic variance is only slightly greater than the true variance. In trading-off asymptotic

variance and squared bias, hAMSE favors variance more than hMSE to avoid the inflated

asymptotic squared bias. Again, I do not have a general proof of this, so:

Conjecture: Under reasonable conditions: hMSE > hAMSE·

3. Exact MSE calculationS.

The asymptotics have several shortcomings. As with all asymptotics, it is difficult to

know how close the limiting values are to actual values using finite sample sizes and realistic

bandwidths. The asymptotics may require tremendously large sample sizes before they are a

reasonable approximation to truth. Secondly, the asymptotics contain constants that are

difficult to comprehend.

Another approach is to compute exact MSE's for a class of distributions that is

representative of a wide range of distributions. As will be shown below, if f is a mixture of

Gaussian densities (or even derivatives of Gaussian densities) and K is a Gaussian kernel, then it

•is possible to compute exact values for MSE(O h).m,

a. Lemmas and Notation:

Several lemmas are needed...

Lemma 5.1: For 0'1' 0'2 > 0, and r1, r2 = 0, 1, 2, ...

73

Lemma 5.2: For 0'1' 0'2 > 0, and r I , r2 = 0, 1, 2, ...

Lemma 5.3: For O'j > 0,

- 1 d -0' = "7, an J1. =0'

..

Lemma 5.4: For O'j > 0, and rj = 0, 1, 2, ...

where J1.j = J1.j - it

H,l..x) is the rth Hermite polynomial

OF( n) is the odd factorial of n

OF(x) = °if x is odd

74

The variance of 0m,h involves integrals like those in lemma 5.4. It is not very

enlightening to write the complete sum each time the integral appears in the variance. However,

the sum is straightforward to program, so the integral can be computed easily and exactly. Let

J TIn ",(rj)( ) d - I (- - -)'1'0" X - Ilj X = 1 r, Il, 0' •

j=1 J

where

ii = (0'1' 0'2' ••• , O'n)

Then:

Lemma 5.5:

where

Lemma 5.6:

r = (2m, 2m, 0)

i1n = (0, 0, Ilj -Il)

iit = (h, h, ~O'? + O'J).

where i1GI = (Iljt Iljt Ill)

iiGI = (~O'r + h2

, ~O'J + h2

, 0'/)

75

b. Theorems:

The lemmas above allow MSE(Om,h) to be computed for f a Gaussian mixture density.

For the theorems in this section suppose that f is a mixture of k Gaussian densities, Le.,

Theorem 5.3:

Theorem 5.4:

(n- 2)+4 n(n -1)

h -1 -1 I - -2 -2 * d fi d bwere JJijl O'ij' 11 r, JJijl' O'ijl' O'ij are as e me a ove.

Theorem 5.5:

•Of course, the MSE(Om,h) follows easily from the theorems.

76

4. Examples.

Figure 5.1 shows a typical example. The graph shows the asymptotic MSE and the true

variance, squared bias, and MSE of 82 for a wide range of bandwidths. The risks were

standardized by dividing them by (02)2. The Normal mixture density used is distribution #4

given in Appendix A. This distribution is not particularly spiky, and presents a relatively easy

estimation problem in this context. Several features are worth noting:

• For every bandwidth, AMSE > MSE.

• The asymptotic variance approximates the true variance much better than the asymptotic

squared bias approximates the true squared bias.

• hAMSEis fairly close to hMSE' Further, MSE(hAMSE) is not much greater than

MSE(hAMSE).

Figure 5.2 is identical to Figure 5.1 except Distribution #11 in Appendix A is used

instead. Distribution #11 is very spiky and presents a terrible estimation problem. Several

features distinguish this problem from the much easier problem posed by Distribution #4:

• hAMSE is not very close to hMSE" The intuition for this is that hAMSE is too heavily

impacted by trying to estimate the curvature in the spikes.

• MSE(hAMSE) is much greater than MSE(hAMSE)·

• The asymptotic squared bias is a terrible approximation to the true squared bias.

77

Figure 5.1: MSE vs log(Bandwidth)Dstn #4; Sample Size = 10001 . ~ ..r--------.-.,-----------------------------,

1 .2

0.9

0.8

0.3

...... - ..

- 1 . 5- 1 • e-2.0- 2 . 2-2.4- 2 . IS- 2 . S- 3 0

._-­------o . 0 -t.,-------r---.L-+--~-___r---~------=..:--;=.-=-:.:-~-=-:.:-=..:.;--=-=-::..:-==-=--=;.-=-:.:-c.;:-=.;-;..:-=--=;:.-=-:.;-:.=-=--=---,J

- J . 2

Log ( Ban d wid t h )'- E: C END AMSE --- MS' -------- Sq. B i as Variance

Figure 5.2: MSE vs log(Bandwidth)Dstn #11; Sample Size = 1000

1 •

12

10

o

h AMSELJ h M S E

I

\II

'\\~\

\

'~,",,~

--------- ---- ------ -_':":.~:.. -_ ... _.. ------------_ .. ----------------- - - - - - - - - - - -- 4 . 4 - .... 2 - .... 0 - J . S - 3 . 6 - J .... - J . 2

Log ( Ban d wid t h )LECEND --- AMSE --- MS' -------- Sq. 8 I as Variance

5. Proofs.

Proof of Theorem 5.1:

From Hall and Marron (1987),

Using the integral form of Taylor's remainder theorem:

So by a change of variables,

Integrating by parts,

So by Cauchy-Schwarz:

IBias I ::::; h2 8mJu2K(u) du = IABias I·

Proof of Theorem 5.2:

80

So by a direct application of Cauchy-Schwarz:

Proof of Lemma 5.1: This is Theorem 4 in Aldershof, et. al. (1990).

Proof of Lemma 5.2: This is Corollary 4.4 in Aldershof, et. al. (1990).

Proof of Lemma 5.3: This is Theorem 3 in Aldershof, et. al. (1990).

Proof of Lemma 5.4: This is Theorem 5.1 in Aldershof, et. al. (1990).

Proof of Lemma 5.5:

From Hall and Marron (1987),

Using Lemma 5.2:

81

Proof of Lemma 5.6:

From Hall and Marron (1987),

=JJJh-4m~2m)(U)~2m)(V)j(x+hu)j(x)j(x_hv) dx du dv.

Using Lemma 5.1:

Proof of Theorem 5.3:

From Hall and Marron (1987),

82

Using Lemma 5.1:

Using Lemma 5.2:

Proof of Theorem 5.4:

The proof follows easily from the lemmas. First,

The result then fQllows from substitution and counting terms.

Proof of Theorem 5.5: Follows directly from Lemma 5.2.

83

Chapter VI: Estimability of 8m and m

1. Introduction.

The object of this section is to answer the question "How much more difficult is it to

estimate Bm +! than Bm ?". Intuitively, it seems that Bm should become increasingly difficult to

estimate as m grows larger. In other words, the larger m is, the larger the sample size that

should be required to estimate Bm with some fixed accuracy. In general, integration smooths a

function and differentiation makes it more jagged. The more jagged a function is, the more

data that is required to resolve its features. The purpose of this section is to clarify this

intuition by providing it with some mathematical basis and quantification.

2. Asymptotics.

a. Assumptions.

In this section assume that the kernel K is a probability density function. We also need

to assume some smoothness on the density f. Specifically, f has smoothness of order p when a

constant M> 0 exists so that for all x, y and 0 < cr ~ 1 :

where p =/+ cr. Assume that p ~ m + 2.

b. Results.

By Hall and Marron (1987), the minimum asymptotic MSE is given by:

For m = 0:

For m> 0:

Hence,

For m = 0:

For m> 0:

inf{MSE(B )} = O(n-8/(4m+5)).h m,h

These asymptotic results give an indication of how much larger the estimation error is

when the sample size remains fixed as m increases. Obviously, as m increases the infimum of

the MSE increases.

Another angle is to allow the sample size to vary but fix the estimation error. An

intuitive measure of estimability is:

85

In other words, Ntol is the smallest sample size necessary so that, for some bandwidth, the

standardized MSE is less than tol (tol stands for "tolerance"). Using the above equations it is

easy to see that (suppose here that m> 0):

Ntol(m) "" C(m, f)(tol)-(4m+5)/8 .

where C( m, f) is a constant depending on m and the density f. The asymptotic expression for

Ntol provides one answer to the question "How much more difficult is it to estimate 8m+! than

The difficulty with this answer is that the constant terms are complicated. It does make clear,

however, that if tol is very small (Le., very accurate estimation is required) it is much harder to

estimate 8m+1 than 8m ,

3. Exact MSE calculations.

a. Introduction.

The asymptotic results are useful only in a general way. The practical question seems

to be: "If 8m+j is estimated instead of 8m how much larger must the sample be?". The ideal

answer is provided by Ntol(m), but this will never be known in realistic settings.

Exact calculations based on methods discussed in chapter 5 provide some insights here.

Ntol(m) can be calculated exactly for mixtures of Gaussian distributions. The results are

discussed below.

86

c. Results.

Appendix A shows fifteen Gaussian mixture densities. For each density, N1/im) was

computed for m = 0, 1, ..., 5 using both exact and asymptotic calculations. Recall that:

The values are shown in tables 6.1a and 6.1b. Several features are worth noting:

i) N1/ 2( m) is a strictly monotone increasing function of m.

ii) The asymptotic calculations always result in a higher value of N1/ 2( m) than the exact

calculations.

• •This is simply a consequence of AMSE(0m h) > MSE(0m h) for these distributions and a, ,

Gaussian kernel. Perhaps more interesting is how much worse the asymptotic calculation gets

as m grows larger. The asymptotic value seems relatively close to the exact value for m ~ 2, but

is much too large for m> 2. Interestingly, the ratios of the asymptotic values to exact values

are fairly constant across distributions for fixed m.

iii) The "spiky" distributions (distributions 11-14) have very large values of N1/ 2(m).

Lots of data is required to resolve these complicated features. Nevertheless, the high

values for N1/ 2( 1) are disconcerting. Estimating the squared first derivative seems as if it

should be relatively simple, particularly with such a large tolerance. The high values of N1/ 2(1)

demonstrate that this is not always the case.

87

iv) The standardization process leads to some counterintuitive results.

For example, N1/ 2( m) is greater for distribution 13 than distribution 11 for m> 1 (the

"spikes" for x < 0 are the same). Intuitively, it seems that distribution 11 is "spikier" than

distribution 13, so it should require more data to estimate its derivatives. This is half-correct.

Distribution 11 is "spikier" than distribution 13 in that Om for m> 0 is greater for distribution

11 than for distribution 13. However, we need approximately the same amount of data for x < 0

to resolve the spikes in both distributions. Our standardization process implies that half of our

data in estimating Om for distribution 13 is wasted in the positive sector. That is, data in the

positive sector is much more useful in reducing the ratio of the MSE to (Om)2 for distribution 11

than for distribution 13.

It is useful to ask how Nto1 changes with tol. Figures 6.1a and 6.1b show plots of

In(Nto1(O»and In(Nto1(1», respectively, against tol for distributions 6 and 12. The differences

between In(Nto1(x» for each distribution remained remarkably constant. Note also that as

tol-O, In(Nto1(x»-oo at an ever increasing rate.

88

Table 6.1a: Values of N1/ 2 (m) for m = 0, ... , 5; Distns 1-8.

Distn m Exact Asymptotic Asymptotic/Exact

1 0 2 3 1.501 1 4 7 1. 751 2 22 37 1.681 3 133 327 2.461 4 1073 3786 3.531 5 11553 54171 4.692 0 2 3 1.502 1 5 9 1.802 2 35 63 1.802 3 297 674 2.272 4 2460 8947 3.642 5 29221 141185 4.833 0 3 5 1.673 1 34 59 1.743 2 320 689 2.153 3 3110 8289 2.673 4 30426 107160 3.523 5 348161 1536700 4.414 0 3 4 1.334 1 21 25 1.194 2 98 156 1.594 3 580 1374 2.374 4 4619 15910 3.444 5 50513 227682 4.515 0 2 3 1.505 1 5 7 1.405 2 25 42 1. 685 3 154 369 2.405 4 1220 4274 3.505 5 13071 61160 4.686 0 2 3 1. 506 1 10 22 2.206 2 95 126 1. 336 3 344 611 1.786 4 1446 3853 2.666 5 8912 32760 3.687 0 2 3 1.507 1 7 9 1.297 2 23 50 2.177 3 185 497 2.697 4 1667 5569 3.347 5 15514 71216 4.598 0 2 3 1. 508 1 18 40 2.228 2 219 356 1. 638 3 1338 3004 2.258 4 9797 33952 3.478 5 105636 483443 4.58

Table 6.1b: Values of N1/ 2 (m) for m = 0, ... , 5; Distns 9-15.

Distn m Exact Asymptotic Asymptotic/Exact

9 0 2 3 1.509 1 20 57 2.859 2 419 931 2.229 3 4256 11348 2.679 4 40267 149398 3.719 5 546667 2294200 4.20

10 0 3 5 1. 6710 1 45 51 1.1310 2 105 190 1.8110 3 536 2101 3.9210 4 9756 40012 4.1010 5 148653 572513 3.8511 0 3 7 2.3311 1 5746 9347 1. 6311 2 27299 59104 2.1711 3 187695 522987 2.7911 4 1642625 6059200 3.6911 5 18611679 86712000 4.6612 0 3 6 2.0012 1 303 998 3.2912 2 5840 15408 2.6412 3 57442 175577 3.0612 4 583877 2196100 3.7612 5 6853345 32165000 4.6913 0 3 6 2.0013 1 5334 11770 2.2113 2 46939 101970 2.1713 3 326783 910542 2.7913 4 2860989 10552000 3.6913 5 32415676 151000000 4.6614 0 3 8 2.6714 1 261 799 3.0614 2 4547 12473 2.7414 3 49026 148424 3.0314 4 491216 1838700 3.7414 5 5703710 26772000 4.6915 0 3 7 2.3315 1 82 143 1.7415 2 443 1074 2.4215 3 3960 11519 2.9115 4 38465 127983 3.3315 5 353857 1577100 4.46

o . 1 8o . 1 20.06

, .0

2 . ~

1 . ~

O. 0 '-r------~----_,..._-----~----_,..._-----~----___,_Jo. 00

3.0

2.0

3.~

o . ~

'.0

Figure 6.1a: Ntol(O) vs toleranceIn$N~ ...- -----,

Tolerance

Figure 6.1b: Ntol(1) vs tolerance

'n(~l...----------------------------------,

, 0

0'r-----,----r------r------,---,--------r----,.----...,-------,--------r'0.0 o . , O. 2 o . 3 0.' o . ~ o . 6 o . 7 o . 8 o . 9 1 .0

Tolerance

Chapter Vll: The "One-Step" Estimator

1. Introduction.

Hall and Marron (1987) provide the asymptotic' MSE optimal bandwidth for estimating

8m using the "leave-out-the-diagonals" estimator for m;::: 1. This bandwidth is given below in

theorem 7.1. The difficulty is that the bandwidth is a function of 80 and 8m+!. both of which

are presumably unknown. It seems that for practical estimation problems, this "optimal"

bandwidth is of little help.

This chapter discusses a "one-step" estimator in which a kernel estimator of 8m+! is

used to estimate the optimal bandwidth for 8m, Of course, the estimator of 8m+! also requires a

bandwidth so it is apparently no improvement over the "no-step" estimator. It will be shown

that the "one-step" estimator can be less sensitive to this choice of bandwidth. That is,

deviations from the optimal bandwidth do not cause large deviations in its MSE compared to

the "no-step" estimator. The cost of this is that with optimal choices of bandwidth the "one-

step" estimator has higher MSE than the "no-step" estimator.

The "one-step" estimator provides a compelling reason to calculate a "leave-in-the

..diagonals" estimator. If the kernel estimate of 8m+! is negative, the "one-step" estimator is a

kernel estimator with a bandwidth that is a fractional root of a negative number. To avoid this,

the kernel estimator for 8m+! used below is the "leave-in-the diagonals" estimator since it is

always positive. The kernel estimator for 8m below is the "leave-out-the-diagonals" estimator.

The "leave-in-the diagonals" estimator is discussed in chapter VIII.

Since (Jo is much easier to estimate than (Jm for m> 0, in this chapter (Jo will be

considered known. A complete demonstration and explanation of this is found in chapter VI.

Hall and Marron (1987) show that the bandwidth selection problem for (Jo is also much easier

than for (Jm' For moderately smooth f, any sequence of bandwidths, {hn } such that

n-1/(4m+I) < hn < n-1/ 4 ,for m the order of the kernel, provides the asymptotically optimal MSE

convergence rate for 00' Since there is no unknown functional needed in this sequence, (Jo has no

"one-step" estimator. All the results in this chapter are valid only for estimating (Jm for m ~ 1.

2. Assumptions and Notation.

Let:

where K is a convolution density function. Some technical problems are avoided by assuming

that Om is bounded away from O. This follows from any number of conditions including f

having compact support, 0- being bounded a.s., Jhaving compact support, and many others. In

practical problems, Om is usually very large so the assumption is not too important for practical

results.

A "leave-out-the-diagonals" estimator will be denoted:

..

Several functionals of the kernel are abbreviated. They are:

93

Many of the following results are asymptotic results. An asymptotic quantity is denoted

with an "A", e.g., AMSE denotes "asymptotic MSE". An asymptotic quantity will always be

equal to actual quantities with remainders set equal to zero.

3. Results.

Theorem 7.1 is given by Hall and Marron (1987) and serves as the motivation for the

chapter. The· "optimal" bandwidth is difficult to use because it requires two unknown

functionals 80 and 8m+1. A "one-step" estimator uses a kernel estimate of each of these

unknown functionals to estimate the optimal bandwidth.

Theorem 7.1:

For m ~ 1, the "no-step" estimator achieves its minimum MSE by taking

{

2}1/(4m+5)h _ (8m+ 2)80Km n" ( -2/(4m+S))0- 2 +0 n .

8m +!

Proof: The proofs of all theorems in this chapter are given in Section 6.

Since 80 is easier to estimate than 8m for m> 0, it will be considered known. That is,

some estimate (presumably a kernel estimate) should be used but the estimation error will be

ignored. So let

94

The focus of the chapter will be the performance of ho and 0m(ho). Define O:n == 0m(ho)

to be the "one-step" estimator. Lemma 7.1 is needed to describe the MSE of O:n.

Lemma 7.1: If°0 is known and m ~ 1, then for n-+oo, h-+O, h> n-1/ 2m :

For a discussion of the properties of em(t) see section 7.

Theorem 7.2 is a direct consequence of Lemma 7.1. The theorem shows that ho is a

biased estimator for ho and quantifies the asymptotic rate of convergence of ho to ho.

95

Theorem 7.2: If 00 is known, then for n-oo, h-O, h> n-1/2m :

Lemmas 7.2 and 7.3 describe the squared bias and variance of O:n. Of course, these are

necessary for describing the MSE(O:n). Notice that the bias looks similar to the bias for the

known optimal bandwidth except for the term involving the e-function. The variance is also

similar except a term that looks like bias is added on and it contains an additional e-function.

Lemma 7.2: If 00 is known, then for n-oo, h-O, h> n-1/2m :

Lemma 7.3: If 00 is known, then for n-oo, h-O, h> n-1/2m :

Theorem 7.3 describes the MSE for O:n. Theorem 7.3 is the main result of the chapter.

Theorem 7.3: If 00 is known, then for n-oo, h-O, h> n-1/2m :

96

Note:

Theorem 7.3 provides the motivation for using the one-step estimator. As long as h is

chosen so that MSE(Om+t(h)) is relatively small, MSE(O~) is equal to the optimal value for

AMSE(Om) plus a small, positive remainder. Note that the "one-step" MSE is larger than the

optimal MSE for a kernel estimator of Om on two counts. First, since the "one-step" bandwidth

estimates an asymptotic MSE optimal bandwidth (rather than the actual MSE optimal

bandwidth), the best that can be hoped for is that the "one-step" MSE is close to the infimum

of the asymptotic MSE. By Chapter V, this is greater than the infimum of the MSE. Second,

the remainder term is positive and, in fact, unbounded for terrible choices of h.

For optimal bandwidths, the one-step estimator has higher MSE than the no-step

estimator. However, notice that the remainder term resembles the squared bias of the no-step

estimator at its asymptotic MSE optimal bandwidth, Le.,

Remainder:

Squared Bias: ], 4 Ju2K( u) du [0 f

'00 4 m+t

For a reasonable choice of h (and large n) we should have MSE(Om+t(h)) < [Om+tf. An

indication of how large n must be is given by Chapter VI. Hence, in MSE, the "one-step"

estimator resembles the "no-step" estimator at its asymptotically optimal bandwidth plus some

added bias.

97

Corollary 7.3.1: For any fixed bandwidth h:

-8

MSE(O:n(h)) = O(n4m+5).

Corollary 7.3.1 is a consistency result. For a fixed bandwidth, increasing the sample

size causes the MSE to go to zero. In contrast, for a fixed bandwidth the "no-step" estimator is

not consistent since its bias is unaffected by sample size.

Corollary 7.3.1 seems startling. A consistent estimator is nearly always preferred over a

non-consistent estimator. This corollary seems to imply that the "one-step" estimator is a much

better than the "no-step" estimator. Some intuition may help clarify the corollary..Notice that

the "one-step" bandwidth is given by:

• (. )-2/(4m+5)ho = constant x nOm +1

So as n--oo, lio = lio(n) =6\ n-2/(4m+5»). The "one-step" estimator is consistent because of this

explicit tie between the bandwidth and the sample size. The "no-step" estimator is similarly

consistent for any sequence of bandwidths {hn} such that hn = 6\ n-2/(4m+5»).

Corollary 7.3.2 provides some guidance for bandwidth selection.

Corollary 7.3.2: If 00 is known, then for n--oo and h--O, the AMSE(O:U) is minimized by

which is also the bandwidth that minimizes AMSE(Om+I)'

98

The most interesting aspect of Corollary 7.3.2 is that AMSE(O:n) is minimized by the

same bandwidth that minimizes AMSE(Om+l)- This is a nice optimality property. The

optimal AMSE bandwidth for Om+! is also optimal for estimating the bandwidth for Om-

Corollary 7.3.3 is best motivated by looking at Figures 7.1a and 7.1b. These figures

show plots of standardized AMSE(OI(h)) and AMSE(0t(h)) for Normal mixture distribution #2

in Appendix A. The difference between the optimal bandwidths along the X-axis is not

meaningful. In 0I(h), h is used in the kernel estimate of 0I. In 0t(h), h is used in the kernel

estimate of 0m+!- The difference along the Y-axis between the minimum AMSE's is important.

This difference can be viewed as the cost of using the optimal "one-step" estimator instead of

the optimal "no-step" estimator. Corollary 7.3.3 shows that the optimal "one-step" estimator

has AMSE which converges to the AMSE of the optimal "no-step" estimator and quantifies the

rate.

Corollary 7.3.3: For known 0o, if ho' hI are chosen as in theorems 7.1 and 7.3 respectively, then

as n-oo and h-O:

Notice that the rate given in corollary 7.3.3 is:

See Figures 7.3a and 7.3b for a depiction of Corollary 7.3.3.

99

4. Figures.

Figures 7.1a and 7.1b show ~AMSE(01)/01 plotted against h for n = 250 and 1000,

respectively (01 here means just an estimator of 01), Both "no-step" and "one-step" estimators

are shown. At large values of h, AMSE(01) is mostly due to bias. AMSE(01) for small values

of h is mostly due to variance. The figures clearly demonstrate the limited effect bandwidth

selection error has on asymptotic bias of the "one-step" estimator for a wide range of

bandwidths. The asymptotic bias of the "one-step" estimator decreases for increasing sample

sizes, unlike the bias of the "no-step" estimator. It is also clear that bandwidth selection error

has a large effect on asymptotic variance for both estimators. It is not easy to compare the

variances of the "no-step" and "one-step" estimators on these figures, however.

Figures 7.2a and 7.2b show ~AMSE(01)/01 plotted against /og(h) for n = 250 and 1000,

respectively. The effect of bandwidth selection error on asymptotic variance is much more

evident in these figures than in figures 7.1a and 7.1b. Choosing h too small is much worse for

the "one-step" estimator than for the "no-step" estimator. The following table summarizes the

asymptotic magnitudes of the variance and bias for each estimator.

"no-step"

"one-step"

Variance Squared Bias

Figures 7.1 - 7.2 present a compelling argument for oversmoothing to get ho' the "one-

step" bandwidth.

100

1000,

~AMSE(Ol)Figures 7.3a and 7.3b show ( ~ .) plotted

(Jl x min AMSE«(J~)against h for n = 250 and

respectively where O~ is the "no-step" estimator. The purpose of these pictures is to show that

the relative difference between the minimum AMSE's for the "no-step" and "one-step"

estimators decreases with increasing sample size. The figures illustrate Corollary 3.3.

Figures 7.4a and 7.4b show em(t) for m = 2, f a standard Normal density, K a Gaussian

kernel, and several values of n, t, and h. The values of t chosen were important in some of the

theorems above. Notice that as the sample size increases and the bandwidths decrease em(t)

approaches o.

5. Conclusions.

The "one-step" estimator has several appealing properties. First, its MSE decreases

with sample size for any bandwidth. In other words, no matter how terrible a bandwidth is

selected, gathering more data improves the estimator. In contrast, a terrible bandwidth

selection for a "no-step" estimator may overwhelm the estimator with bias that will never go

away, regardless of the sample size. Second, even with optimal choices of bandwidths the "one-

step" is not much worse than the no-step. Finally, the "one-step" estimator is much less

affected by an oversmoothing bandwidth than the "no-step" estimator.

The disadvantage of the "one-step" estimator is that it is highly affected by

undersmoothing bandwidths. The "one-step" also involves more computing time.

A reasonable approach to choosing the "one-step" bandwidth is to use an oversmoothing

101

bandwidth based on minimum values of JU(m))2 as given in Terrell (1990) or Terrell & Scott

(1985). These bandwidths are the largest bandwidths that could possibly be optimal based on

some measure of sample variability. This approach seems a bit extreme so bandwidths based

on J(c/J(m))2 for c/J a Gaussian p.d.f. may be better and will still likely be oversmoothing.

The over-smoothing approach illustrates the real advantage of the "one-step" estimator.

It is easy to choose an oversmoothing bandwidth. If the oversmoothing bandwidth performs

almost as well as the optimal bandwidth then the bandwidth selection problem disappears.

102

Figure 7.1a: MSE vs BandwidthDstn #2; Sample size = 250

v'MSE/e~ --,1 . 0

o.~

0.8

0.7

1 -. t • p

0.8

o.~

o .•

o . 3 l ======:=S::;:::;:;;>'::::'-'I-~o . 2

O.

0.0 0.1 0.2 0.3 0.4 0.5 0.8 0.7 0.8 0.9 1.0 1.1 1.2 1.3

Ban d wid t h.. . 5

Figure 7.1b: MSE vs BandwidthDstn #2; Sample size = 1000

v'MS~/~.,r------------------------------,

o . ~

o 8

O-step

o . 7

o . 6

o . 5

o .•

o . 3

o . 2

o . 1

o . 0 -",...--.----r--r-:.---.--...----,.----r---.,...----,---r--.----r---r----r---.,.Jo 0 o. 0.2 0.3 0.4 0.5 0.8 0.7 0.8 0.9 1.0 1.1

Ban d wid t h. 3 t -4 . 5

0.9

O-step

o . ,

o . 7

0.&

0.8

Figure 7.2a: MSE vs log(Bandwidth)Dstn #2; Sample size = 250

v'MSE/e~ --,1 .0

o .•

o . 3 l ==========~====r=~;;;'-""::::::::::'I----~o . 2

O. 1

- 0 . 2-0.5-0.8- 1 • 1- 1 .4- t • 7

o . 0 "1,----~--,r-~-~--r----.-:.._r-~-~~_r_-~-~_r-~---__,J

-2.0

log ( Ban d w d t h )

Figure 7.2b: MSE vs log(Bandwidth)Dstn #2; Sample size = 1000

v'MS~/~~ ---,

o . 9

o . 8

o . 7

o .•

O-step

o . ,

o .•

o . 3

o . 2

o . 1

- 0 •- 0 . 7- , .0- , . 3- 1 .6- 1 • g- 2 . 2

o . 0 "1,--~~--r-~--_r-------.--""":""_--r-~--:'-'---~--.----__,J

- 2 . 5

log ( Ban d wid t h )

Figure 7.3a: MSE vs BandwidthDstn #2; Sample size = 250

"I\,oSE/(em"2'~~J""""-------------------------------------------,

1 • Q

1 • e

1 . 7

1 • e

1 . 5

1 .•

1 . 3

. 2

t - • t e p

1 . 0 -l.,====;r====r====;:::::::.~__,.....!~-.,..._~-__r-~__,---.,..._~-__r-~__,---.,..._~-_,J0.0 o . 1 o . 2 o . 3 o .4 O. 5 o . 8 o . 7 o.e O. 1 .0 . 2

8 and wid t h

Figure 7.3b: MSE vs BandwidthDstn #2; Sample size = 1000

.,I"SE/(em;n;"'5~J""""--------------------------------------------,

.Q

o 0 O. o . 2 0.3 o .• o . 5 o . o . 7 0.8 O. 1 .0 . 2

8 and wid t h

6. Proofs

Proof of Theorem 7.1: This is Theorem 2 in Hall and Marron (1987).

A lemma is necessary before proving lemma 7.1.

Let:

Proof of lemma 7.1.1:

.By assumption, (}m' and hence p, is bounded below by some positive f.

So,

Since the "diagonals" are simply a constant shift, the skewness of Om is the same as Om' To

ease some calculations we will calculate the third central m~ment of Om instead of Om'

where J.l =E(Om) now.

106

..

So let:

,Also, let ':Pn(Y) = (n~'y)!

By Hall and Marron (1987),

107

jJ =Om + 0(1).

Combining these with similar calculations show:

E3 = jJ h-4m-1( J/2)(J(~2m»)2)+ o(h-4m-1).

E4 = J(t<2m) )3/+ 0(1).

Es = h-6m-1( J/2)(J(~2m)y)+ o( h-6m-1).

108

A laborious counting project (see section7) shows that:

E(Om)3 = n-6 {~n(6) x E1 + 1tf.Pn(5) x E2 + ~n(,n x E3+ 8"Jln(4) x E4 + 24~n(4) x Es

+ 24~n(3) x E6 + 8"Jln(3) x E7 + 4~n(2) x Eg}.

Substitution shows that:

Combining all these gives:

So for m ~ 1, h> n-1/ 2m:

109

Proof of Lemma 7.1:

By a Taylor expansion about JJ:

where p is between 8m and JJ.

Using lemma 7.1.1 to bound the remainder term:

110

Proof of Theorem 7.2: A direct application of Lemma 7.1.

Several lemmas are required for proving lemmas 7.2 and 7.3.

Lemma 7.2.1: If (Jo is known, then for n-+oo and h2-+O :

E(h )-4m-l - h -4m-l(1 + e (8m+2))o - 0 m 4m+5 .

Proof:

111

( )(-4m-I)/(4m+5){ / }- (J + 1) C -2 (0 )(8m+2) (4m+5)(1 +e (8m+2))-."m In m+1 . m+14m+5

- h -4m-I(1 +e (8m+2)) 0- 0 m+14m+5'

Lemma 7.2.2: If (Jo is known, then for n-oo and h-O :

Proof: Similar to Lemma 7.2.1.

Lemma 7.2.3: If (Jo is known, then for n-oo and h-O :

Proof: Similar to Lemma 7.2.1.

Proof of Lemma 7.2:

112

E [ h- 2] -4- 0 --Bias(Ol ) - 0 + o(n4m+5)m - 2 m+l .

Using lemma 7.2.2:

o -4- m+l 1. 2(1 + c: (-=.1....-» + o(n4m+5 ) 0- 2'~ m+l 4m+5 .

Proof of Lemma 7.3:

[1&2] . -=.L

- Var~ + E [2n-20 '" h -4m-l]+ o(n4m+5)- 2 m+l 0 m 0

Using the lemmas 7.2.1 - 7.2.3:

- 2 -20 {h -4m-l(1 + c: (sm+2»}- n O"'m 0 m+l 4m+5

Proof of Theorem 7.3:

The first part of theorem 7.3 follows directly from lemmas 7.2 and 7.3.

113

For the second part of theorem 7.3:

(J26) -2(J h -4m-l", (8m+2) + m+lh 4", (-8) _",n OlCm 0 '"'m+l 4m+5 -4- 0 '"'m+l 4m+5 -

By substituting ho:

By expanding em +!:

42'= ho (4m +5) AMSE((Jm+!(h»

Proof of Corollary 7.3.1: Follows directly from the proof of Theorem 7.3.

Proof of Corollary 7.3.2: Follows directly from the proof of Theorem 7.3.

Proof of Corollary 7.3.3: Follows directly from the proof of Theorem 7.3.

114

7. em ( t) and calculating the skewness of 8m .

em(t) is a function of m, t, n, h, the density f, and the kernel K so describing it is

difficult. Perhaps the best way to gain some insight into em(t) is to look at its derivation to see

that:

Hence, for large t, em(t) resembles t2 x the standardized MSE. For small t, it is more like t x the

difference between Bias and MSE. Figures 7.4a and 7.4b show em(t) for m = 2, f a standard

Normal density, K a Gaussian kernel, and several values of n, t, and h.

b) Calculating the skewness of 8m .

are given by the list below. The number of similar terms is given in the right-hand column.

~n(a) is notation for (n~! a)r As a check, the sum of the number of terms in the list is:

which is equal to [n(n-l)t

115

Term: Given by: Number:

E{ R),2m)(Xl - X2)R),2m)(X3 - X4)R),2m)(Xs - X6)} Ej ~n(6)

E{ R),2m)(Xl - X2)Rfm)(X3 - X4)R),2m)(Xl - Xs)} E2 j~n(5)

E{ R),2m)(Xl - X2)R),2m)(X3 - X4 )Rfm)(X3 - X4)} E3 ~n(4)

E{ R),2m) (Xl - X2)R),2m)(Xl - X3)R),2m)(Xl - X4 ) } E4 sc:P n(4)

E{ Rfm)(Xl - X2)R),2m)(Xl - X3 )R),2m)(X2- X4)} E5 24~n(4)

E{ R),2m)(Xl - X2)R),2m)(Xl - X2)R),2m)(Xl - X3 )} E6 24~n(3)

E{ R),2m) (Xl - X2)R),2m)(Xl - X2)R),2m)(Xl - X2)} E7 4~n(3)

E{ R),2m)(Xl - X2)R),2m)(Xl - X3 )R),2m)(X2- X3)} ES sc:P n(2)

116

Figure 7.4a: C2(1) vs TDistn #1; Sample size = 250

:klW

C ( T )1 .0

0.8

0.8

0.4

0.2

0.0

I

;\ ;\ ;\ ;\ ;\ ;\ ;

\ /\ /

\ I\ I~ I, /, /, /

' ............. , - - - - /,./-'- - - - - - - u _

..... ,---,=:.:.:~:':':-::"=--------=7~"-· --------

J -~-_--.- •••• - •••• - ••••" ..:c.-~~=----.-.-.---// --.. _------

-0.2

- 0 . 4 ""-r--~-___r----,__---_,_---.,..-----r---__,----,_---..,_---__,-~-___,J

- 1 • 0 - 0 . 9 -o.a -0.7 -0.6 -a.S -0.4 -O.J -0.2 - 0 . 1 0.0

T--- -2/0 -------- - .. / 9 ----- 10/9

Figure 7.4b: C2(1) vs TDistn #1; Sample size = 1000

---

I;;

;;

;;

;/

//

I;'

//

/-- --- ------------~~:7'-------- --- _

.-­_.--------

.-.... .. --- ...- ..--------------

o . 2

o . 4

0.0

0.8

0.8

C ( T ).0

- 0 . 2

0.0- 0 . 1- a . 2-0.3- 0 . -4-0.3-0.6- 0 . 7-0.8- 0 . 9

- 0 . • "'r---__,----,------,-----,----...,.---__,r-----,_---..,_---__,---___,J- 1 .0

T--- -2/51 -------- - -4 / 9 ----- 10/9

Chapter VIll: The "K-Step" Estimator

1. Introduction.

In the previous chapter, a "one-step" estimator was introduced. The "one-step"

estimator is a kernel estimator of ()m with a bandwidth containing an estimate of ()m+!" The

estimate of ()m+! is another kernel estimator depending on some bandwidth.

The obvious extension to the "one-step" estimator is the "k-step" estimator. Suppose

that ()m is to be estimated. The AMSE optimal bandwidth for this estimation problem is a

function of ()m+l. The "one-step" estimator substitutes a smoothing estimate of ()m+! into this

optimal bandwidth. This bandwidth is used to provide a smoothing estimate of ()m. Of course,

this process can be extended so that the initial functional estimated is ()m+k" This is used to

estimate ()m+k-l' which is used to estimate ()m+k-2' and so on. This is the "k-step" estimator.

The "one-step" estimator discussed in chapter VII was a leave-out-the-diagonals

estimator. The estimator used to estimate the bandwidth was a leave-in-the-diagonals

estimator. An obvious question is how does the "one-step" estimator discussed in the previous

chapter compare to a "one-step" in which both estimators are leave-in-the diagonals estimators.

This "one-step" estimator is discussed in this chapter as a special case, although the bandwidth

used here is stochastic rather than selected as it was in chapter VII.

By the previous chapter, the "Sheather-Jones" optimal bandwidth for estimating 8m is

given by:

Recall that the "Sheather-Jones" bandwidth is not the asymptotic MSE optimal

bandwidth, but merely the bandwidth that eliminates the first term in the Taylor expansion of

the bias.

Of course, the process of "stepping" must stop somewhere, so there is some initial

bandwidth decision to be made. It seems reasonable to expect that "stepping" should mitigate

the impact of a poor initial bandwidth. One possible choice for an inital bandwidth is to "plug-

in" the Gaussian density into the expression for the AMSE optimal bandwidth. Notice that if a

Normal kernel is used:

{

-1 }1/2m+32 n •= 2m+l (J'

which is slightly less than n-1/ 2m+3 fT.

119

The "k-step" estimator discussed in this chapter uses the "plug-in" bandwidth as its

initial bandwidth. This makes the discussion a bit easier than in the previous chapter since

here, the estimator relies only on integer k, rather than a bandwidth.

2. Assumptions and Notation.

Define:

(5.1)

where 0m+l is an estimator of Om+!' If

then hm is a "plug-in" or "O-step" bandwidth and we will write hm == h~, However, if

Om+! == O~+! is a "k-step" kernel estimator then h~+! is a "(k+l)-step" bandwidth.

In this chapter, Om is a "leave-in-the-diagonals" estimator of Om so it is always positive.

To avoid some technical problems, we assume that it is bounded away from 0 a.s.. This is not

a large assumption because we usually worry about Om being too large rather than too small. In

any case, Om is bounded away from 0 a.s. if any measure of sample variability is bounded a.s..

..

120

3. Results.

Oka. Bias of Om.

Theorem 8.1 shows that convergence rate of the asymptotic bias decreases for two steps,

but thereafter remains constant.

Theorem 8.1:

Let 0;:' be the "k-step" estimator. Define:

In other words, I3(Om) is the rate of convergence of EOm to Om. Then the following table

provides the asymptotic bias reduction rate for "stepping".

k I3(O~)

0 -2/(2m+3)

1 -2X(2rrf+3+ 2rrf+5)

~2 -4/(2m+3)

..

Proof: The proofs of all theorems in this chapter are given in Section 6.

The convergence rates given in the table suggest that for bias, most of the gains from

"stepping" occur in the first step. There are no gains in convergence rates at all after the second

121

..

step. The intuition of this is fairly clear. The bandwidth is selected so that the diagonal terms

cancel the next term in the Taylor expansion of the bias. If this term in the bias is cancelled by

the diagonals, then the bias is 0(h4). The key to achieving this convergence rate is to have the

diagonals and the next term of the bias converge at a faster rate. The theorem shows that two

steps are need for this rate. Since h= O(n-1/(2m+3)). the best rate that can be achieved is

O( n-4/(2m+3)).

Convergence rates do not tell the entire story. Several simulations were conducted to

examine the distribution of 8~ for several values of k, m and several different distributions. The

results suggest that bias grows ever more positive with further stepping. This is somewhat

disturbing. It was hoped that with some "reasonable" initial bandwidth, the MSE would

decrease with increased stepping. This did not happen in the simulations. Some estimated

densities of 8~ are shown in figures 8.1a and 8.2a. These are described in more detail in section

4.

The following theorem and corollary shows that if K is a Normal kernel and f is a

Normal mixture density the bias continues to grow as k-+oo.

Theorem 8.2:

If K is a Normal kernel and Om < C1 0 F( 2m) c2-m. then for any fixed n:

Bia{B~] -+ 00•

It is shown in Section 6 that if f is a Normal mixture density then Om < c1 0F(2m) c2-m , so

Corollary 8.2.1 follows.

122

Corollary 8.2.1:

If K is a Normal kernel and fis a Normal mixture density, then for any fixed n:

Bia{O~] --> 00.

The significance of this result is that the MSE of the "k-step" estimator must also go to

infinity as Ie increases. This obviously implies that continued stepping is not a good plan. The

theorem can probably be made more general, although this was not explored. The corollary is a

counter-example to the hypothesis that continued stepping may be a good plan. This is

formalized in the following corollary.

Corollary 8.2.2:

If K is a Normal kernel and fis a Normal mixture density, then for any fixed n:

MSE[O~] --> 00.

Stepping primarily affects the difference between the diagonal term and the next term in

the expansion of the bias. This difference can be decomposed into a positive and a negative

term. Stepping causes the positive term to grow, basically by making the bandwidth smaller.

The n~gative term represents the bias of 8~+k multiplied by (Ie - 1) bandwidths. The negative

term gets smaller by this multiplication.

The divergent series in the proof of Theorem 8.2 provides some intuition into how the

bias behaves. Notice the series is something like E i'(1/n)'. When n is large, the series seems

to converge for the first few steps. Notice also that the larger is n, the higher number of steps

must be taken to cancel the negative term. Hence, with very large n, "stepping" more than 2

steps may continue to reduce bias. The intuition here is that with a tremendous sample size it

123

is possible to estimate many of the 0i's productively.

a. Variance of B~.

"Stepping" does not affect the order of convergence of the variance.

Theorem 8.3:

Let B~ be the "k-step" estimator. Then, for any k:

Once again, convergence rates are not the complete story. It seems reasonable to expect

that with each step taken more variance is added to the final estimator. This intuition turns

out to be correct and it is formalized in the following theorem.

Theorem 8.4:

Let B~ be the "k-step" estimator. Then, for any k, Var(O~) is an increasing function of k.

Oka. MSE of Om.

Mimicing Theorem 8.1, define:

So ,(Bm) is the rate of convergence of MSE[Om] to O. Combining Theorems 8.1 and 8.2 shows

124

that the MSE error convergence rate is given by the following table:

k ;(8~)

0 -4/(2m+3)

~1 -5j(2m+3)

4. Simulations.

Simulations provide some useful insights into the behavior of the k - step estimators.

The asymptotic calculations imply that the MSE convergence rate of the estimators improves for

the first step. The bias convergence rate improves for the first two steps. However, both bias

and variance eventually grow without bound with further stepping. An aim of the simulations

is to investigate in several practical cases the optimal number of steps to minimize the MSE.

The simulations were done on distributions 1- 6, 10, and 11. For each simulation, (Jl

was estimated. In all cases, all the estimators from the "0 - step" to the "20 - step" were

computed. All simulations were done using the same seed and 100 samples. Sample sizes of 100

and 500 were investigated. Only simulations from distributions 3 and 6 are presented here as

they seem to be representative of all the conclusions discussed below.

Results that confirm the theoretic calculations are:

• The biggest gains in estimation occur during the first two steps. Thereafter, the estimator

either got worse or improved only slightly with each step.

• The larger the sample size, the greater the number of steps should be taken.

• The bias and variance of the estimator both eventually grow with increased stepping.

125

Results not apparent from theoretic calculations are:

• The final bandwidth computed became more variable and had a mean that steadily

approached 0 with increased stepping.

• The more non-Normal the distribution (i.e., the more different the sequence of 9i from that of

the Normal distribution), the more steps were required to the optimal.

Figure 8.1a shows an estimated density of lit for distribution 6. The density at the far

left is the "0 - step" and steps increase to the right. 1t is clear that the estimates get

increasingly variable with increasing k. Further, the means of the distribu tions initially get

closer to the true value and then keep increasing past it. The largest gains are found in the

first two steps.

Figure 8.1b shows the estimated densities of the bandwidth used in the final estimate of

91, The density at the far right is the "0- step" density and steps increase to the left. The

mean of the density steadily approached O. The variance of the density also increased.

Figures 8.2a and 8.2b are similar to 8.1a and 8.1b except that density 3 is used rather

than density 6. Density 3 is much more non-Normal and this difference is reflected in the litand bandwidth densities. It takes many steps before EIi~ is less than 9\. Further, the expected

value of the bandwidth does not become less than the optimal bandwidth even after twenty

steps. The insight is that the optimal bandwidth is much less than the plug-in bandwidth from

the Normal so it takes many steps to reach the optimal.

Figures 8.3a and 8.3b are plots of the standardized MSE for distribu tion 6 and sample

sizes of 100 and 500. The graphs show that with a sample size of 100, the MSE is minimized

with 1 step. With a sample size of 500, the minimum MSE is attained after 2 steps. This

increased number of steps with increased sample sizes coincides with the theoretic result

126

discussed earlier. Suppose we reason that the estimation is likely to improve for the first two

steps but then get worse sometime thereafter so just take two steps. These figures suggests that

this reasoning works well for distribution 6, which is nearly Normal in some sense.

Figure 8.4 gives the counterpoint to the above reasoning. Figure 8.4 is a similar plot to

figure 8.3 except that distribution 3 is shown only with a sample size of 100. The figure shows

that the minimum MSE is attained after 11 steps. Since the distribution is SO non-Normal

many steps must be taken to achieve the optimal. What is even more striking from figure 4 is

that it makes clear that the number of steps taken is simply a smoothing parameter. Increased

stepping causes the bandwidth to become smaller (in a stochasti~ way) and results in less

smoothing. A "spiky" distribution such as distribution 3 requires many steps to recover the

"spike". A smoother distribution such as distribution 6 does not. require so many.

5. Conclusions.

Summarizing the above results, the "O-step" estimator has a bandwidth just bad enough

so that its squared bias converges to zero slower than its variance. Recall from an earlier

chapter that if hAMSE is known then the MSE convergence rate is due entirely to variance.

This again becomes the case after only a single step.

The bad news is that stepping eventually causes the MSE to increase without bound

(with a finite sample size). Stepping once gives a faster MSE convergence rate. Stepping twice

causes a a faster convergence rate of the squared bias. The simulations indicate that the more

non-Normal the density function is, the more steps that should be taken.

127

Figure 8.1a: Theta- hat densitiesDisln #6; Squared 1sl DerivativeSS = 100: 100 samples

D [loIS; ~ j .r----------...,--------------------------,, .., . ,, .., . J

, . ,,. ,, .0

o . ,

0.'

o. ,

o .•

o . ,

0 .•

O. J

o . ,

-. _ J -, o

n ( the to)

Figure 8.1b: Bandwidth densitiesDistn #6; Squared 1st DerivativeSS = 100: 100 samples

ot N S , r; .r------------------------...,-----------.,

1 n ( h hat )

Figure 8.2a: Theta - hat densitiesDistn #3; Squared 1st Derivative55 = 100; 100 samples

o t NS i ~ i -r------------------,--------------,

, . 0

o •

o .•

0.'

o .•

o . ,

o .•

o. ,

o . ,

o . ,

-, o ,n ( the t a )

• •

Figure 8.2b: Bandwidth densitiesDistn #3; Squared 1st Derivative .5S = 100; 100 samples

DEN 5 I T ~ -r----------,---------------------,

-. -, -,I n ( h hat )

- ,

,

o

Figure 8.4: MSE vs StepDstn #3; Sample size = 100MSE/Pt~ .,-- --,

·...· ,

· 0

o .•

o .•

0.'

,i

ii

I,i

.i

0.'.......•. .....,/

__ _I"'"-------_..::: .....~..__....•_---_. .,.o . 0 -<;."-=·"-T='·o~=~-=o·,-=.:·;-:::;:;-=.:·c-;=.·--~----·~-~-~--~-~-~~-~-~_·c·c-;:..:·"·o·"-c·o·o·"·~·;·;·;·"·T··-·-·-·--~·-·_,J

o • • • I Q 1 1

S t e p

12 13 ,. ,~ 16 17 18 19 2.0

_loiS!!: _- SQ. B I ... _e_ v" ~ 1 "", c:.

'. Figure 8.3a: MSE vs StepDstn #6; Sample size = 100

MSE/~t~ .,- -------..,

. ,

.,

..

. ,

.0

o.

o.

0.'

o .

-~----

,;/

/I

;;

;.-'

./.-/

.---0.0

o , , , , ,S t e p

LEGENO _ ~SE +--.- SQ 6 I " • _ .. _ Va,. Ion".

Figure 8.3b: MSE vs StepOstn #6; Sample size = 500

MSE~0:~ ..- ~ ~ ~ __,

o . 1.5

,/

///

/J

//

/..-/

..--

o . 10

o.oa

..."\,

\\\"

, -------;.~-~

.-~- '~, --o. 0 a"-c ~-----·~------------c-c-,-o-O-C-C-"-"-"-"-"-c-O-O-C-C-"-"-"-C-r·c--_-_-_-_-_-_-_-"·_-_--_-_-_-_-_-,-_-~----------------,.------~-~-,J

o , , , •S t e p

LEGENO _""lOl!: ----- SQ B I " •

6. Proofs.

Proof of Theorem 8.1: First, a lemma is necessary.

Lemma 8.1: Define f3(Om) as in theorem 8.1, and let h be as in (5.1), then:

Proof of lemma 8.1:

Notice first that 0;:' is simply a kernel estimator with a stochastic bandwidth, Le.,

0:' = tim(~) where ~ has a distribution depending on k, m, and the observations. To simplify

the notation let h;: hm'

So if f has R + m derivatives:

Substituting for hgives:

where C! is a positive constant depending only on K and m. We assume that 0m+! > b, for

132

some positive bound b. Hence,

The lemma follows immediately.

Proof of Theorem 8.1:

.' ·'0 . .2/(2m+5) • 2If k= 1, then Blas(8m+1) = Blas(8m+1) = O(n ) by the above so 13(8m+I) = - 2m+ 5'

If k = 2, then Bias(Om+I) = Bias(O:"+I) = 0(n·2/(2m+5) x n'2/(2m+7) by th~ above. Since

i~ 2m':" + I < - 2m"+ 5 for k ~ 2, the ditTerence between the diagonals term and the first

term in the expansion of the bias is not as large as the next term in the expansion. Hence,

Proof of Theorem 8.2: Several lemmas are necessary.

Lemma 8.2:

Let SOt =f: (_l)k+i kOt( i + k)!( 2n - i - k)!k=O [( n - k)! kIf

Then Ot =0 SOt =1.

Ot =1 SOt =- n2 + (2n + 1) i

Ot =2 SOt =+n2 (n - 1)2 - n (2n + 1)(n -1) i + n(2n + 1) ;2

Proof of lemma 8.2: Prudnikov, et. aI. (1986).

133

Lemma 8.3:

n (n)2Let To- = LIPt=O

Theno-=O

0-=1

0-=2

To- = (~)

To-=(~) 2T =(2n- 2) n2

0- n-1

Proof of lemma 8.3: Prudnikov, et. al. (1986).

Lemma 8.4:

( m+2)(m+2)_ m m ;+j (m) m i + 1 j + 1 _ (m + 2)2

Qm-i~j~(-l) i (j) (.2m+ 4 ) -2(2m+1)(2m+S)'. 1+1+ 2

Proof of lemma 8.4:

( m+ 2)(m+ 2)Qm=I:I:(-li +j (m)(m) i+1 ;+1

;:OJ:o 1 1 (.2m+ 4 )1+1+ 2

Substituting 1= i + 1, 5 = j + 1, n = m + 2 and expanding the combinations gives:

Q_ (n!)2 ~ (n - I) I ~ ()'+5( 2) (2n - 1- 5)! (I + 5)!

2- L. L. -1 n5-Sn- (2:) n2( n _1)2 ':1 [( n - I)! I!f 5:1 [( n - 5)! 5!f

134

So using the lemma 8.2:

(n f? n-I (0 I) I { }Qn-2 = ( ). E - f +0(1- n

2) + n(1 + 2n) I - (1 + 2n) r-

2nn n( n - 1)2 '=1((n - I)! II

A laborious calculation using lemma 8.3 shows that:

n2

Qn-2= = !!(20 1)(20 3)·

The result follows by substitution.

Lemma 8.5: For fa 13( m + 2, m + 2) density defined on [0, 1],

((2m+3)!)2 .1 m2

°m= (m+2)! (2m+5) !!(2m-1)(2m-3)"

Proof of lemma 8.5:

o(13) =J(fm))2 =( r(2m+4) )2J[~z"'+I(1_x)m+l)rdXm r(m+ 2)r(m+ 2) oxm

(r(2m+4) )2J[~ (m) m-i ((m+1)!)2 +I.i i+IJ= rem + 2)r(m + 2) ;=0 i (-1) (m+ 1- i)!(i+ 1)! z'" (1- x) dx

135

(f(em+ 4) )2

= f(m+ e)f(m+ e) X

m m(m)(m) Hi (m+l)!t j;m+20ioi i+;+2xEE· . (-1) ( +1 "l'C+1)'( +1 ")'("+1)' (l-x) dxi=Oj=O '1m - J • I • m - 1 . 1 .

(f(em+ 4) )2

= f( m + e)f(m + e) x

m m(m)(m) i+i (m+l)!tx i~i~O i j (-1) (m + 1- i)!( i + 1)!(m+ 1 -j)!(j + 1)!

(em + e- i - J")!(i + j + e)!)(em+ 5)!

( m+eXm+e)_( f(em+4) )2 (m+l)!)2 m m _ i+i(m) m i+l j+l- f( m+ e)f(m+ e) (m + e)2(em + 5) i~i~ ( 1) i (j) ( .em + 4 )

'+J+ e

Using lemma 8.4:

·(f(em + 4»)2 -2 _\ m2= f(m + e) (m + e) (em + 5) e(em -1)(em _ 3) (801)

:

Note: The previous lemma is of some independent interest because it is needed to provide the

"maximal smoothing" bandwidth for Om-

k .Lemma 8.6: For f a continuous, differentiable density function L: 0m+i a1 diverges for any a.

i=O

Proof of lemma 8.6: From Terrell (1990), for a fixed standard deviation Om is minimized by the

13(m+ e, m+ e) family_ Define q~ to be the variance of the 13( m+ 2, m+ 2) density which is

non-zero only on [0, 1], i.e.,

q2 _ 113 - 4(em+ 5)

136

Hence among densities with variance u 2" the minimum value of 8m is

whereUm(.8) is defined in (8.1). So, using the lemma,

. (Uf3)2rn+1(( tim + 3)!)2 .1 m2mln(Um) = "7 (m+ til! (tim + 5) ti(tim-1)(tim- 3)

. 2 . . .

(1 )2rn+1(( tim + 3)!) -rn-3/2 m2

= tiu (m + til! (tim + 5) tie tim "-1)( tim - 3)

Now Stirling's Formula shows that

( )

22m+l 2m+3.5 ·m-lmineU ) ~ ...L(.1..) (tim + 3) e (tim + 5rm-3/ 2

m 8 tiu (m+ tir+2

1 '1( 1 )2m+1( (tim + 3)4m+7 )= 8 e tieu (tim + 5)m+3/2(m + ti)2rn+4

( )2rn+1>...L e-1 _1_ (8m)m

8 tieu

diverges.

Proof of Theorem 8.2:

137

By a Taylor expansion:

Similarly,

and in general this equals:

Next, we show the second term diverges:

2

=k4 E(e<_Jlm+ij(l2m+2i)(Ol)2m+2i+3

~! i=O n 8m+i+l

2

138

2

= k4 k.l( 20F(2m+2i)(!brr1/2 )2m+2i+3-I! i~ n C1 OFf 2m+2i+2)(C2r2m.2i-2

{

2 }Xfri 0 i 20F(2m+2j)(21rr l / 2 2m+2j+3m+i+2 jU(n C

1OF(2m+2j+2)(c2r2m-2j.2)

where "J and c4 depend only on m and f. This series diverges as a consequence of lemma 8.6.

Proof of Corollary 8.2.1:

t

If /is a Normal mixture density then j(x) = L:Pi <Pu,(x-I';) so:. i=l I

139

By Cauchy-Schwarz:

The corollary follows immediately.

Proof of Corollary 8.2.2:

This is a direct consequence of Corollary 8.2.1.

Proof of Theorem 8.3:

Regarding the "k-step" estimator as a standard kernel estimator with a stochastic

bandwidth gives:

For the first term,

140

So,

4Var(0m(h)) Ih] = Cn·5/(2m+3)e[Om+!fm

+!)/(2m+3)

[_ ,<4m+!)/(2m+3)

Since E 0m+lJ =0(1) for any kernel estimator,

For the second term,

{ h2] {-4 ]:s Va ~m+! + Va O(h)

t( ."2) ,)2/(2m+3) ]=Va 2(.1)m

l\.' rn (O)n' 8'2+1 + 0(n.8/{2m+3»)

~8m+!

{d ).4/(2rn+3) [d )'2/{2m+3)]2}" ""= Cn .4/{2rn+3) "\Om+! - "\Om+! + O( n .8/(2m+3»)

If the bandwidth used in 8m+! is 0(n' /(2m+5»). then Lemma 7.1 in the previous chapter implies:

{d' ).4/(2m+3) _[d - )'2/(2m+3)]2} _ -2/{2m+5)",8m+1 ",8m+! - O(n ).

Hence,

The theorem follows by combining the two terms.

141

-.'

Proof of Theorem 8.4:

As in the proof of theorem 8.3:

-5!(2m+3) [- 14m+ll!(2m+3l;:: Cn E 0m+l

[_ "j!4m+ll!(2m+3)

for some positive C. But by Theorem 8.2, E 0m+lJ - 00 as k - 00.

142

APPENDIX A: GAUSSIAN MIXTURE DENSITIES

Note: The figures in Appendix A were taken from Marron and Wand (1991).

viii

S. T. Sheather and M. C. Jones (1989), "Reliable data-based bandwidth selection for kernel

density estimation, with emphasis on a successful 'non·cross validatory' approach," Unpublished.

manuscript.

T. Schweder (1975), "Window estimation of the asymptotic variance of rank estimators of

location," Scand. J. Statist., vol. 2, pp. 113-126.

G..R. Terrell (1990), "The maximal smoothing principle in density estimation," Jour. of Am.

Statist. Ass., vol. 85, pp. 47Q..477.

B. van Es (1988), "Estimating functionals related to a density by a class of statistics based on

spacings," To appear.

W. Wertz (1981), "A remark on the estimation of functionals of a probability density," Prob.

Contr. Info"". Theory, vol. 10, pp. 279-285.

xiv

,

,

A. Miura (1985), "Spacing estimation of the asymptotic variance of rank estimators of location,"

Proceedings of the Indian Statistical Institute Golden Jubilee International Conference on

Statistics: Applications and other Directions, pp. 391-404.

H. G. Muller (1984), "Smooth optimum kernel estimators of regression curves, densities, and

modes," Ann. Statist., vol. 12, pp.766-774.

E. Parzen (1962), "On estimation of a probability density and mode," Ann. Math. Statist., vol.

33, pp. 1065-1076.

M. Pawlak (1986), "On non parametric estimation of a functional of a probability density,"

IEEE Trans. Info. Th., vol. IT-32, no. 1, pp. 79-84.

B. L. S. Prakasa-Rao (1983), Nonparametric Functional Estimation, Academic Press, N.Y..

A. P. Prudnikov, Yu. A. Brychkov, O. 1. Marichev (1986), Integrals and Series Vi, Gordon and

B~each, N.Y..

Y. Ritov and P. J. Bickel (1987), "Achieving information bounds in non and semi-parametric

models," To appear in A nn. Stat.

E. F. Schuster (1974), "On the rate of convergence of an estimate of a functional of a

probability density," Scand. Actuarial J., vol. 2, pp. 103-107.

D. W. Scott and S. J. Sheather (1985), "Kernel density estimation with binned data," Comm. in

Statist. • Theor. and Meth., vol. 14, pp. 1353-1359.

D. W. Scott and G. R. Terrell (1987), "Biased and unbiased cross-validation in density

estimation," Jour. of Am. Statist. Ass., vol. 82, pp. 1131-1146.

W. R. Schucany and J. P. Sommers (1984), "Improvement of kernel type density estimators,"

Journal of the American Statistical Association, vol. 72, 420-423.

xiii

H. L. Gray and W. R. Schucany (1972), The Generalized Jackknife Statistic, New York: Marcel

Dekker.

P. Hall (1982), "Limit theorems for estimators based on inverse of spacings of order statistics,"

Ann. Prob., vol. 10, pp. 992-1003.

P. Hall and S. Marron (1987), "Estimation of integrated squared density derivatives," Stat. (3

Prob. Let., vol. 6, pp. 109-115.

1. Hodges and E. Lehmann (1956), "The efficiency of some nonparametric competitors of the t­

test," Ann. Math. Stat., vol. 27, pp. 324-335.

M. C. Jones (1989), "Discretized and interpolated kernel density estimates," Jour. of Am.

Statist. Ass., vol. 84, pp. 733-741.

M. C. Jones and H. W. Lotwick (1983), "On the errors involved in computing the empirical

characteristic function, n Journal of Statistical Computation and Simulation, 17, 133-149.

H. L. Koul, G. L. Sievers, and J. McKean (1987), "An estimator of the scale parameter for the

rank analysis of linear models under general score functions," Scand. J. Statist., vol. 14, pp. 131­

141.

E. L. Lehmann (1963), "Nonparametric confidence intervals for a shift parameter," Ann. Math.

Stat., vol. 27, pp. 1507-1512.

T. Lissack and K. S. Fu (1976), "Error estimation in pattern recognition via La-distance

between posterior density functions," IEEE Trans. Info. Th., vol. IT-22, pp. 34-45.

J. S. Marron (1988), "Automatic smoothing parameter selection," Emp. Econ., vol. 13, pp. 187­

208.

J. S. Marron and M. P. Wand (1991), "Exact Mean Integrated·Squared Error," Annals of Stat.,

To appear.

xii

;

REFERENCES

I. A. Ahmad (1976), "On the asymptotic properties of a functional of a probability density,"

Scand. Act.arial J., vol. 4, pp. 176-181.

I. A. Ahmad (1979), "Strong consistency of density estimation by orthogonal series methods for

dependent variables with application," Ann. Inst. Stat. Math., pt. A, vol. 31, pp. 279-288.

B. K. Aldershof, J. S. Marron, B. U. Park, and M. P. Wand (1990), "Facts about the Gaussian

Probability Density Function," To appear.

J. C. Aubuchon and T. P. Hettmansperger (1984), "A note on the estimation of the integral of

f2(x)," J. of Stat. Plan. and Inf., vol. 9, 321-331.

M. S. Bartlett (1963), "Statistical estimation of density functions," Sankhya, 25A, 245-254.

G. K. Bhattacharyya and G. G. Roussos (1969), "Estimation of a certain functional of a

probability density function," Scand. Akt.arietidskr., vol 3-4, pp. 201-206.

P. J. Bickel and Y. Ritov (1988), "Estimating integrated squared density derivatives: sharp best

order convergence estimates," Sankhya, vol. 50, ser. A, pt. 3, pp. 381-393.

K. F. Cheng and R. J. Serfling (1981), "On estimation of a class of efficacy-related parameters,"

Scand. Act.arial J., vol. 9, pp 83-92.

Y. G. Dimitriev and F. P. Tarasenko (1974), "On. a class of nonparametric estimates of

nonlinear functionals of a density," Theory of Prob. Appl., vol. 19, pp. 390-394.

V. A. Epanechnikov (1969), "Nonparametric estimation of a multivariate probability density,"

Theory of Prob. Appl., vol. 14, pp. 153-158.

K. Fukunga and J. M. Mantock (1984), "Nonparametric data reduction," IEEE Trans. Patt .

. Anal. Mach. Intell., vol. PAMI-6, pp. 115-118.

T. Gasser, H. G. Miiller, and V. Mammitzsch (1985), "Kernels for nonparametric curve

estimation," J. Roy. Statist. Soc., vol. B47, pp. 238-252.

xi

TABLE 1

Parameter3 for 15 example normal mixture den3itie3.

:

!

Density

#1 Gaussian

#2 Skewed Unimodal

#3 Strongly Skewed

#4 Kurtotic Unimodal

#5 Outlier

#6 Bimodal

#7 Separated Bimodal

#8 Skewed Bimodal

#9 Trimodal

#10 Claw

#11 Double Claw

#12 Asymmetric Claw

#13 Asymmetric Double

Claw

#14 Smooth Comb

#15 Discrete Comb

N(0,1)

iN(O, 1) + iN(~,(~?) + ~Nn;, (~?)

2:~=0 IN(3{ml -1},m21)

~N(0,1)+ iN(O'({0)2)

110N(O, 1) + 190N (O, Uo?)!N(-1,(~)2) + ~N(1,(~)2)

!N(-~,a?) + ~N(~,m2)

~N(O,1)+ ~N(~,(i?)

290N(-~,m2) + 290N(~,m2) + 110N(O,W2)

~N(O, 1) + 2::=0 loN(£/2 -1, (110)2)

l~oN(-1,(~)2) + l~oN(1,(~)2) + 2:~=0 3~oN((£ - 3)/2,C~0)2)

!N(O, 1) + 2:~=_2(21-l /31)N(£ + ~, (2- l /1W)

2:~=0 1~60N(2£-1,(~?)+ 3~oN(-~'(1~0?) + 3~oN(-1'(1~0)2)

+3~oN(~'(1~0)2)+ 2:~=13~oN(£/2'(1~0)2)

2:~=0(25-l /63)N( {65 - 96(!)l} /21, (~i)2 /221 )

2:~=0 t N((12£ -15)/7,(t)2) + 2:~~8 211N (2£/7'(2\)2)

:"...;::::::..•..,.,.....-.,,-.....,.~-"!--~.,...;~.•

62 Skewed Unimodal Density:

:Io...""'::...:-.__••--•.-----;.--"=!••

#3 Strongly Skewed Density:',..::.....:.;...-...;:...:.-_-----...:..,

~••c.:•o

:l...,.--....."....::.,;::::::.;:::::=7'"--:.:--4.•

,

15 Bimodal Density•0::

..:..c•...•••· .... ... -, • • ••

19 Trimodol Density••::

..:i•...

0

:i... -. • • ••

112 Asym. Claw Density••0•

..:..c:.•••.... ... -, • • ••

liS Outlier Densitv

18 Asym. Bimodal Density':

:l~....,.:::::.....,--..,.,-~.---;--.~~.•

111 Double Claw Density::......:....,.:...:;..:...------...:--,

:'~...~~..,...--.,..--=-.-~-.....,.:-~.•

14 Kurtotic Unimodal Density::

••~ '---:,t...,,-_...=;::::::._."",--.".....-.,....::::"'.~-i..

•17 Separoted Bimodal Density

:

:,~...~-.....,.--..,......:::::.~-.,...--:.~~.•

110 Claw Density:'~--=-_....:~---=--......

./ '-::L..._c::::..::...---.--.----=.::::::~.•

o•...1

113 Asym. Db. Claw Density::

~..co...

o

.. ., •• • •

114 Smooth Comb Density:

::~..."--..----.-.::::..~--::!....~...!.!1.•

115 Discrete Comb Density••o.

f\, ./1•

..:'"• ,~...

0

:if) \) \) .\•.... -. ., • • ••