88

Click here to load reader

Statistical Analysis of Mixtures and the Empirical Probability Measure

Embed Size (px)

Citation preview

Page 1: Statistical Analysis of Mixtures and the Empirical Probability Measure

Acta Applicandae Mathematicae 50: 253–340, 1998. 253c© 1998 Kluwer Academic Publishers. Printed in the Netherlands.

Statistical Analysis of Mixtures and theEmpirical Probability Measure

PHILIPPE BARBECNRS–Universite Paul Sabatier, Laboratoire de Statistique et Probabilites, 118 Route deNarbonne, 31062 Toulouse, France. e-mail: [email protected]

(Received: 14 March 1997)

Abstract. We consider the problem of estimating a mixture of probability measures in an abstractsetting. Twelve examples are worked out, in order to show the applicability of the theory.

Mathematics Subject Classifications (1991): 62G05, 62F12, 62G20, 41A35, 44A99, 45P05.

Key words: mixture models, empirical process, approximation by linear operators, integral trans-forms.

1. Introduction

Mixture models have been used in a wide range of applications, including econ-omy, biology, medicine, astronomy, and pattern recognition, and can be tracedback to more than one hundred years ago with the work of Pearson (1894).Although a huge literature has been produced since then, and in particular dur-ing the last 30 years, very little is known on their various estimation procedures.The purpose of this paper is to develop a general theory of statistical estimationof mixture models. It also provides an example of a statistical problem which canbe better understood – if not solved – by combining notions from several areasof mathematics. To explain precisely our goal, let us recall what is a mixturemodel. We will then review some of the literature on mixture models, and sketchthe content of the paper as well as say a few words on the relevant mathematicaltools.

Consider a family of probability measures (p.m.’s) P = Pθ : θ ∈ Θ definedon some separable metric space X . Throughout this paper we assume that theparameter space Θ is also a separable metric space. Given a p.m. µ on Θ endowedwith its Borel σ-field,we define the mixture

Pµ :=∫Pθ dµ(θ).

More precisely, for any Borel set A ⊂ X , we define Pµ(A) =∫Pθ(A) dµ(θ)

(see, e.g., Robbins (1948) for a justification that this indeed defines a measure).This paper deals with one aspect of the following statistical problem. Given

VTEX(EL) PIPS No: 136963 MATHKAPACAP1290.tex; 29/01/1998; 13:55; v.7; p.1

Page 2: Statistical Analysis of Mixtures and the Empirical Probability Measure

254 PHILIPPE BARBE

a sample X1, . . . ,Xn of independent and identically distributed (i.i.d.) randomvariables (r.v.’s) with a common p.m. Pµ∗ , we want to estimate Pµ∗ and µ∗ onthe basis of the sample, knowing the family P and that µ∗ belongs to a knownset M of p.m.’s (we shall mainly focus on the case where M is the set of allp.m.’s on Θ).

It is well known that the empirical p.m. Pn := n−1∑16i6n δXi is in various

sense a consistent estimator of Pµ∗ (this is in essence the strong law of largenumbers for the measure-valued r.v.’s δXi , which asserts that Pn converges insome sense to Pµ∗ as n → ∞). However, Pn is not of the form Pµ in general.Our goal is then to investigate how close (in a sense to be made precise later) isPn to the set Pµ :µ ∈M. More precisely, we shall define a distance betweenPn and the set of mixtures, say dn := d(Pn, Pµ :µ ∈ M), and consider thedistribution Πn of n1/2dn. The starting point of our investigation will be to showthat, under reasonable conditions, Πn converges weakly*. Since the distributionof Pn depends on Pµ∗ , Πn depends on µ∗, and its weak* limit, Π(µ∗), dependsalso on µ∗. Hence we can define a mapping µ∗ 7→ Π(µ∗). Now, we want todescribe very precisely when the distance dn between Pn and the set of mixturesis of order o(n−1/2) (as n → ∞). In such a case, this distance is of magnitudesmaller than the stochastic fluctuations of Pn (which are of order n−1/2 thanksto the central limit theorem), and Pn is then very close to being a mixture. Thatdn is of order o(n−1/2) means that Π(µ∗) is the point mass at 0. This leads usto characterize the set Π−1(δ0). Then, we shall study the mapping Π(.), describeits level sets (this tells us for which measures µ∗ the distances dn have the samelimiting distribution) and study its continuity.

The results we shall obtain provide a general asymptotic theory of minimumdistance estimators in mixture models. Although the paper is of theoretical nature,its applicability will be extensively emphasized in a number of examples ofapplication.

We would like to review a little on the literature on mixture models, andexplain how the present investigations are related to previous works, as well ashow those previous works can be useful in conjunction with our new results.We will consider the following five topics: identifiability, classes of mixturesthat have been considered, maximum likelihood estimation, other methods ofestimations and efficiency of the estimators.

Identifiability. Identifiability means here that the mapping µ ∈M 7→ Pµ is one-to-one. If the parameter of interest is µ∗, then identifiability is clearly neededin order to obtain consistent estimator, that is to find a measure µn calculat-ed from the data X1, . . . ,Xn, which converges to µ∗ in a reasonable senseas n → ∞). The seminal papers of Teicher (1960, 1961, 1963, 1967) haveled many authors to study this question. Among those, we mention Barndorff-Nielsen (1965), Yakowitz and Spragins (1968), Tallis (1969), Rennie (1972),Chandra (1977), Tallis and Chesson (1982), Bach, Plachky and Thomson (1987).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.2

Page 3: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 255

Some special classes such as the Von Mises distributions are studied by Fraser,Hsu and Walker (1981), mixtures of Gaussian distributions by Bruni and Koch(1985); Kent (1983) studies identifiability of mixtures arising in directional data;Luxmann-Ellinghaus (1987) shows the identifiability of mixtures of some powerseries (see also Milhaud and Mounime (1993) for an improved result). Patil andBildikar (1966) emphasized on mixtures of discrete distributions. A survey aboutthe identifiability issue may be found in Prakasa Rao (1992, Chapter 8). Sinceour model is related directly to Pµ∗ and not µ∗ itself, we shall not need anyidentifiability assumption in our main results. However, to turn them into statis-tical inference procedures on µ∗, an assumption much weaker than identifiabilitywill be required (see Section 5). It will be seen (in Section 7) that the above-mentioned works are useful to obtain sufficient conditions for our assumptionsto hold in concrete examples.

Class of mixtures. Several classes P have been studied from a theoretical pointof view and, therefore, their estimation is of interest. The reader may refer toSection 7 of this paper for further references on ‘classical’ examples. The mono-graph by Lindsey (1995) also provides several motivations for studying mixturemodels. However, we mention three less classical examples which do not seemto have been paid much attention as far as statistical procedures are concerned.Mixture representations have been very useful to investigate infinite divisibilityof some distributions. Our result does not allow us to obtain an infinite divisibleestimator of infinite divisible distributions. However, few subclasses of infinitedivisible distributions can be estimated as mixtures, and provide examples ofapplication of this paper. Mixture representation of some infinite divisible distri-butions may be found in the works of Goldie (1967), Steutel (1967, 1968, 1970),Kelker (1971), Keilson and Steutel (1972), Shanbhag and Sreehari (1977) andKent (1981). In this area the seminal work of Thorin (1977) must be mentioned.A beautiful treatment of the link between mixture representation and infinitedivisibility and further references are in the monograph by Bondesson (1992).Mixtures also appear naturally in limit theorems in probability theory (see Martinand Schwartz (1972), Eagleson (1975), Rootzen (1977)). Here again our resultscan be used to derive statistical procedure in order to estimate the limiting distri-butions by distributions of the same type. Finally, p.m.’s invariant under a groupof transformations appear also as mixtures of extremal measures (see Varadarajan(1963), Farrell (1962), Maitra (1977) and references therein). Estimation proce-dures for such mixtures within an abstract framework have not been consideredin the literature, possibly because of the stronge emphasis on the so-called max-imum likelihood estimator. Indeed, one should notice that maximum likelihoodtheory is not that widely applied due to the need of a density for the p.m.’s andtherefore the existence of a dominating measure for all the p.m.’s considered.A very classical example of invariant p.m.’s are those invariant under the groupof rotations of Rd; they are a mixture of uniform distributions over the spheres.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.3

Page 4: Statistical Analysis of Mixtures and the Empirical Probability Measure

256 PHILIPPE BARBE

More complicated examples of invariant measures under a group of transformsappear very naturally in integral geometry and we refer to the book by Santalo(1984) for a general exposition and references.

Maximum likelihood estimation. Several methods to estimate µ∗ have been pro-posed, although little is actually known on each of them. The most popular esti-mator in the theoretical literature is the maximum likelihood estimator (m.l.e.). Itturns out to be consistent under very general conditions (Pfanzagl, 1988; Masse,1993) but requires a density to exist. Our results do not need the Pθ’s to admita density w.r.t. a fixed p.m. The algorithmical aspects of the m.l.e. have beenstudied by Simar (1976) in the case of a mixture of Poisson distributions, Jewell(1982) for exponential distributions, and Lindsay (1983a,b) emphasized expo-nential families. Van de Geer (1993) studied rates of convergence of the m.l.e.of the density of Pµ∗ under entropy condition on the family P. Eggermont andLaRiccia (1995) studied consistency and rates of convergence of the maximumsmoothed likelihood density estimator when Ω = X = Rd. As far as limitingdistribution is concerned, the situation is rather easy to describe. To the author’sknowledge, there is no general central limit theorem type result for the m.l.e.However, central limit theorems are available for smooth functional of Pµ∗ inthe Poisson case (Lambert and Thierney, 1984), and for more general mixtures ofdiscrete distributions (see Milhaud and Mounime (1993) who deal with Noack’s(1950) distributions). Van de Geer (1994) succeeded in proving a central limittheorem and the efficiency of the m.l.e. in estimating a linear functional of Pµ∗ ;but she required several conditions which seem difficult to check in practical cas-es. Further properties of the m.l.e. such as self-consistency are in Laird (1978).In contrast to these works on the maximum likelihood for special models, wecan provide a widely applicable theory. Our assumptions will be seen to be verymild. Comparing with the existing works we could even argue that we do nothave any assumption at all, in the sense that once the family P is picked, we canprovide a universal method to estimate the mixture (consider a universal Donskerclass of functions in the next sections).

Other methods of estimation. Other methods of estimation than the m.l.e. havebeen proposed in various special cases of mixtures, and we shall point out therelevent literature in the corresponding examples in Section 7. As far as generalmixtures are concerned, Choi and Bulgren (1968) proved the consistency of aminimum distance estimator based on Wolfovitz’s distance between the empiricalcumulative distribution function (c.d.f.) and the c.d.f. of mixtures. A variant oftheir estimator and similar results are in MacDonald (1971). MacDonald showssome simulations which suggest that his estimator is less biased than Choi andBulgren’s (1968). Deely and Kruze (1968) study a minimum distance estimatorbased on Kolmogorov distance. This estimator is very appealing from a com-putational point of view since it leads to a linear programming problem, and it

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.4

Page 5: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 257

seems to the author that it has been greatly overlooked. The approach of Deelyand Kruze (1968) has been further developed by Blum and Sursala (1977). Theestimators studied in this paper may be viewed as an abstract version of the Deelyand Kruze (1968) estimator. We shall prove central limit theorem-type results.In concrete application, our estimators can be often computed in using linearprogramming techniques such as the simplex algorithm or the polynomial timealgorithm of Khachiyan (1979) or Karmarkar (1984). One result of this paperis that building a general theory for mixtures on abstract spaces turns out to bemuch easier than dealing with special cases. Once the abstract case has beenstudied, applications to concrete examples follow naturally and we shall obtainsome unexpected results in concrete examples.

Efficiency. We mention the semi-parametric approach of Van der Vaart (1991)who studied conditions for n1/2-consistency of functional of the mixing measure.A somewhat related work has been done by Thierney and Lambert (1984) whoinvestigate the estimation of differentiable functionals of Pµ∗ . They showed inan example (mixture of Poisson distributions) that any estimator based on anestimator of Pµ∗ has a greater asymptotic variance than the one obtained by usingthe empirical p.m. or, equivalently, that the empirical p.m. is the most preciseestimator of Pµ∗ in the Poisson mixture case. The general results on efficiency insemi-parametric models by Bickel, Klassen, Ritov and Wellner (1993) are alsorelevant. Therefore it is obvious that it is not worth estimating Pµ∗ as a mixtureif one is interested only in functional of Pµ∗ . Maybe one should better use thenaive empirical probability measure. But it turns out that the differentiability ofa functional can be often obtained using some special type of distance betweenp.m.’s which are deeply rooted in the modern theory of empirical processes (werefer the reader to the discussion on the choice of a metric in Chapter 1 of Barbeand Bertail (1995)), and which will be the basic tool in our approach. We shallnot study the efficiency of our estimators in detail, but we shall only give asufficient condition for them to be efficient. To our knowledge, no result of thisgenerality is known for other estimators. Although this is a very restrictive pointof view, one can see this paper as an attempt to build efficient estimators, and topartially describe when efficiency can be obtained with mimimum distance typeestimators.

Further work has been done in special cases, and in particular deconvolutionproblems are a special case of mixtures. We refer to the work of Ritov (1987),Carroll and Hall (1988), Devroye (1989), Stefanski (1990), Zhang (1990) or Fan(1991) and reference therein for this aspect, and it is clear that our results canbe applied for deconvolution problems.

There has been a tremendous work on statistical analysis of finite mixtures(i.e. assuming that µ∗ is supported by a finite number of points). We shall not dealspecifically with finite mixture models. But our results apply when one does notput any upper bound on the number of components (see Remarks 3.1 and 3.2).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.5

Page 6: Statistical Analysis of Mixtures and the Empirical Probability Measure

258 PHILIPPE BARBE

We refer to the paper by Redner and Walker (1984) and the books by Everitt andHand (1981), Titterington, Smith and Makov (1985) and McLachlan and Basford(1988) for further references and examples of application of mixture models.

Content of the paper and relevant mathematical tools. The paper is organized asfollows:

In Section 2 we define some notations and our estimators. Under very generalconditions, we prove that our minimum distance estimators are consistent. There-fore, our main goal described at the beginning of this introduction is meaningful.The relevant mathematical theory we will use (in Section 2 as well as in essen-tially all the other sections) is the modern abstract empirical process theory. It isclosely related to the field of probability in Banach spaces, and deals mainly withweak* convergence of some measures defined on spaces of measures. In Sec-tion 3, we start our way towards answering our main question, and we give somegeneral results of the limiting behaviour on a distance between Pµn , our estimatorof Pµ∗ , and the empirical probability measure Pn of X1, . . . ,Xn. In this section,we will need some probabilistic coupling techniques, and will briefly show howa statistical technique, the bootstrap, can be used in practical applications. Theresults of Section 3 lead us (in Section 4) to obtain a necessary and sufficientcondition for Pµn to behave asymptotically as Pn. In other words, we character-ize situations where the distance between Pµn and Pn is very small. The result isof an algebraic nature and shows that the main problem is to described the spacespanned by some functions built from the family P. The relevant mathematicaltool is the abstract Wiener space theory as far as the proof of the main theoremgoes. This allows us to transform a probabilistic problem into a functional analyt-ic one. For applications, we will need mainly some classical functional analysis,a result which is usually related to diffusion processes or stochastic differentialequations, and the so-called quantile representation used in mathematical statis-tics. Since statisticians are sometimes interested in estimating µ∗ and not onlyPµ∗ , Section 5 shows how to turn results on Pµn into results on µn. This leadsus to study the role of identifiability in mixture models in our abstract settingas well as to take a new look on results of Section 4, using a condition weakerthan identifiability. In particular, Section 5 shows that we do not need identifi-ability with minimum distance estimators. The mathematical notion coming inis the invertibility of a linear operator, and some elementary functional analysischaracterizing completeness. In Section 6 we study some basic properties of thefunction which associates to a mixing distribution the limiting distribution of thedistance between Pµn and Pn. In particular, we try to see what are the aspects ofµ∗ that are kept in this limiting distribution. This section does not use any newmathematical ingredient, but offers a combination of those introduced previously.The theory developed in Sections 2–6 is then used in Section 7 where we studyexamples. We apply our main results to obtain sharp conditions under which theempirical p.m. is very close to a mixed distribution. In particular, we closely

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.6

Page 7: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 259

examine mixtures of Normal, Poisson, Gamma, etc. This section uses mainlyelementary facts on analytic functions, some classical asymptotic analysis, andsome properties of Hermite polynomials.

In Section 8 we shall obtain a new necessary and sufficient condition for Pn tobehave asymptotically as Pµn , which motivates us to study some representationsof the inverse of an operator canonically associated to P. Mathematically, it linksour statistical problem to that of approximating a discontinuous operator by afamily of nicely behaved ones, and in concrete applications to some problems onapproximation of functions and inversion of integral transforms. The new charac-terization obtained in Section 8 is applied to the same examples as in Section 9.Although it is not stated explicitly, the main mathematical fact used in Section 9is that the continuous functions are dense in the space of distributions; our exam-ple are in essence examples of concrete approximation of some distribution bycontinuous functions, and they require mainly some classical analysis. The finalsection contains some general comments and open problems. An appendix givessome approximation results that are needed in our proofs.

To make our results rather concrete, despite the fact that Sections 7 and 9 aredevoted to applications, we shall illustrate the results of Sections 2–6 with thefollowing three basic examples, labelled Examples 1–3, throughout paper. Moreinformation on these examples can be found in Section 7.

EXAMPLE 1 (Estimating a p.m.). Let Θ = X and Pθ = δθ be the Dirac measureat θ. We takeM as the set of all p.m.’s on Θ. Then, Pµ∗ = µ∗ so that estimatingthe mixing measure µ∗ is the same as estimating the underlying p.m. Pµ∗ . Thisexample has not been considered in previous literature on mixtures, but turns tobe extremely useful in guessing the type of results we can achieve.

EXAMPLE 2 (Mixture of uniform densities). Let X = Θ = R+ and let Pθ bethe uniform distribution over [0, θ] denoted also U[0,θ], with U[0,0] = δ0. We takeM as the set of all p.m.’s on Θ. Thanks to a theorem of Kintchine (1938),unimodal distributions on R+ with mode 0 coincide with mixtures of uniformdistributions Pθ, θ > 0. In other words, a p.m. P on R+ is unimodal with mode0 iff there exists a µ∗ such that P = Pµ∗ . Thus, a nice way to obtain a unimodalestimator of a unimodal distribution is to estimate µ∗ by some µn and then totake Pµn . Our estimators are actually well suited for that type of problem whereone is much more interested in Pµ∗ than in µ∗.

EXAMPLE 3 (Mixture of exponential distributions). Let X = Θ = R+ and Pθ= Exp(θ) the exponential p.m. with density θ e−θx w.r.t. the Lebesgue mea-sure. We take M as the set of all p.m.’s on Θ. Writing Fµ∗(x) = Pµ∗ [0, x] =∫

e−θx dµ∗(θ), we see that estimating µ∗ leads to a Monte Carlo method to invertLaplace transforms (see, e.g., Jewell (1982)).

At this point, the reader should be aware of the following. Although weshall work only in assuming that the observations X1, . . . ,Xn are i.i.d., some of

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.7

Page 8: Statistical Analysis of Mixtures and the Empirical Probability Measure

260 PHILIPPE BARBE

our results can be extended (using recent results on empirical processes underdependence conditions as surveyed in Andrews and Pollard (1994)) to caseswhere both the independence and identical distribution assumptions are dropped.Also the case of censored data can be handled by the same techniques as ours.Though little has been done in this direction in the literature on mixture, thisextension may be worthwhile, given the work of Delbaen and Hazendonck (1984)and some recent developements in time series and some censored models.

2. Definition and Consistency of a Minimum Distance Estimator

We are now going to define the minimum distance estimators we are interestedin, and prove their consistency. When Θ = Rk, it is quite natural to look at µ∗

through its distribution function. This point of view is somehow misleading sinceit is specific to Euclidean spaces. Having in mind Example 1 and the empiricalprocess theory of Dudley (1984) or Pollard (1984), it is much more natural,having an estimator µn of µ∗ to look at the corresponding process indexed by aclass of functions (see hereafter).

Before going further, we need to introduce some general notations. If Y isa metric space endowed with a Borel σ-field, we denote M(Y) (resp. M+(Y),M1(Y)) the set of all signed (resp. nonnegative, probability) finite measures onY . If G is a class of real-valued functions defined on Y , it induces a pseudo-semi-norm ‖.‖G on M(Y) (and therefore on M+(Y) and M1(Y)) defined by

‖ν‖G := sup|νg| : g ∈ G,

where we denote νg :=∫g dν. It should be noticed that we may have ‖ν‖G = 0

without having ν = 0 if G is not rich enough, and we may also have ‖ν‖G =∞.Going back to the problem of estimating Pµ∗ ∈ M1(X ), it is quite natural

in our abstract setting to consider a class F of real-valued functions on X . Itprovides a pseudo-semi-norm ‖ · ‖F on M(X ).

Let us define the empirical p.m.

Pn := n−1∑

16i6nδXi .

When the Xi’s are i.i.d. it is well known that Pn is an estimator of the unknownPµ∗ . In this section we shall not assume that the Xi’s are i.i.d. in order to showroughly how the results of the next sections can be extended to a non-i.i.d. setting.We strongly point out to the statistician readers that all we actually need is thatPn is a consistent estimator of Pµ∗ in the sense that

limn→∞

‖Pn − Pµ∗‖F = 0 a.s., (2.1)

no matter what Pn is. In particular, for censored data, one could denote Pn thep.m. pertaining to the Kaplan–Meier estimator says. When the Xi’s are i.i.d.,

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.8

Page 9: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 261

classes F for which (2.1) holds are called Pµ∗-Glivenko–Cantelli classes (GCclasses). In the i.i.d. setting, the reader is referred to Dudley (1984) or Pollard(1984) for conditions under which (2.1) holds, and Talagrand (1987, 1995) for acharacterization of GC classes.

What (2.1) tells us is that Pn is an approximation of Pµ∗ in the sense of ‖·‖F(no matter how Pn is obtained!). Therefore, since Pn is observed, one way toestimate Pµ∗ is to approximate Pn itself by a mixture. This yields the estimatorµn such that

‖Pµn − Pn‖F 6 inf‖Pµ − Pn‖F : µ ∈M+ ηn, (2.2)

where ηn is a positive sequence which converges to 0 a.s.It must be noticed that µn may not be unique, especially if F is rather small.

We can take ηn = 0 if the infimum is achieved in the r.h.s. of (2.2). Also µndepends on the sample X1, . . . ,Xn and on the class F . We do not keep track ofF in the notation µn, but the reader should keep it in mind, since we shall usevarious classes in concrete examples.

Notice also that the estimator Pµn does not depend on the parametrizationchosen for the family P.

In view of (2.2), it is also convenient to define

∆n := ‖Pµn − Pµ‖F , and

∆ := inf‖Pµ − Pµ∗‖F :µ ∈M <∞.

Clearly, if µ∗ ∈ M, then ∆ = 0. But we can have ∆ = 0 without havingµ∗ ∈M (take, for instance, M = µ, µ 6= µ∗ and F = x ∈ X 7→ 1).

When ∆ = 0, our first proposition shows that Pµn is a reasonable estimatorfor Pµ∗ in the sense of the pseudo-seminorm ‖.‖F .

PROPOSITION 2.1. If (2.1) holds, then limn→∞∆n = ∆ a.s.Proof. Using the triangle inequality and the definition of µn, we have, for

any ν ∈M,

∆n 6 infµ∈M

‖Pµ − Pn‖F + ηn

6 ‖Pν − Pµ∗‖F + ηn + ‖Pµ∗ − Pn‖F .

Therefore, lim supn→∞∆n 6 ∆ a.s. But we also have

∆n > infµ‖Pµ − Pn‖F

> infµ|‖Pµ − Pµ∗‖F − ‖Pµ∗ − Pn‖ |

> ∆− ‖Pµ∗ − Pn‖F ,

so that (2.1) yields lim infn→∞∆n > ∆ a.s. 2

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.9

Page 10: Statistical Analysis of Mixtures and the Empirical Probability Measure

262 PHILIPPE BARBE

Proposition 2.1 tells us that if ∆ = 0, if (2.1) holds and if ηn converges a.s. to0, then Pµn is a consistant estimator of Pµ∗ in the sense of ‖ · ‖F , but this doesnot say too much obout µn w.r.t. µ∗.

To deduce some results on µn, which live on Θ, from results on Pµn , whichlive on X , we need something to transfer the results related to X into resultsrelated to Θ. The trick is to define an operator canonically associated with thefamily P. Precisely, we view the family P as an operator which maps measurablefunctions defined on X into measurable functions defined on Θ. If f is Pθ-integrable for any θ ∈ Θ, we define

Pf(θ) := Pθf =

∫f(x) dPθ(x).

We shall assume that

all the functions in F are Pθ-integrable for any θ ∈ Θ. (2.3)

Then, the operator P maps F into

H := PF = Pf : f ∈ F

which is a class of functions from Θ to R.Extending the notation introduced in Section 1, if ν ∈M(Θ), we denote Pν

the signed measure Pν :=∫Pθν(dθ) (i.e. Pν(A) :=

∫Pθ(A) dν(θ)).

The key to this paper is the following very easy result.

PROPOSITION 2.2. For any signed measure ν ∈M(Θ), ‖Pν‖F = ‖ν‖H.Proof. By definition of Pν , we have for any f ∈ F∫

f dPν(x) =

∫ ∫f(x)Pθ(dx)ν(dθ)

=

∫(Pθf)ν(dθ) =

∫(Pf)(θ)ν(dθ),

and the result follows from the definition of H. 2

Combining Propositions 2.1 and 2.2, we deduce

PROPOSITION 2.3. If (2.1) and (2.3) hold, then

limn→∞

‖µn − µ∗‖H = inf‖µ− µ∗‖H :µ ∈M a.s.

In particlar, if µ∗ ∈M, the estimator µn converges a.s. to µ∗ in ‖.‖H-pseudo-seminorm.

Although it will not be used in the sequel, we mention the following propertyof the mapping P.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.10

Page 11: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 263

LEMMA 2.1. For any q > 1, P is a contraction from Lq(Pµ∗) into Lq(µ∗).Moreover, P preseves positivity, and

‖P‖Lq(Pµ∗ )→Lq(µ∗) := sup‖Pf‖Lq(µ∗) : ‖f‖Lq(Pµ∗ ) 6 1 = 1.

Proof. That P preserves positivity is obvious. Next, let f ∈ Lq(Pµ∗). Theoperator P is a contraction from Lq(Pµ∗) to Lq(µ∗) since

‖Pf‖Lq(Pµ∗ ) =

(∫|Pf(θ)|q dµ∗(θ)

)1/q

6(∫

(Pθ|f |)q dµ∗(θ))1/q

6(∫

Pθ(|f |q) dµ∗(θ))1/q

= (Pµ∗ |f |q)1/q = ‖f‖Lq(Pµ∗ ).

If f is the constant function 1, then ‖Pf‖Lq(µ∗) = ‖f‖Lq(Pµ∗ ) = 1. 2

Let us illustrate the use of Propositions 2.1 and 2.3 with our three basic examples.

EXAMPLE 1 (continued). The mapping P is just the identity, and H = F .Clearly, µn = Pn ensures that (2.2) holds, even taking ηn = 0. Propositions 2.1and 2.3 says nothing else than (2.1), and we cannot expect a better result.

EXAMPLE 2 (continued). Let F = 1· 6 z : z > 0, so that ‖ · ‖F is theKolmogorov distance (sup-norm) between the c.d.f.’s. We shall then call F theKolmogorov class in this example. The mapping P is Pf(θ) = θ−1 ∫ θ

0 f(x) dxfor θ > 0, and Pf(0) = f(0). We denote fz(x) := 1x 6 z. Thus, H =hz : z > 0 ∪ θ 7→ 1θ = 0 where hz(θ) = Pfz(θ) = 1 ∧ (z/θ), z > 0.

EXAMPLE 3 (continued). We take F = 1· > z : z > 0, and denote nowfz(x) := x > z. The norm ‖ · ‖F generates again the sup-norm betweenc.d.f.’s and we shall call F the complement Kolmogorov class. The mapping Pis Pf(θ) = θ

∫f(x) e−θx dx = θL(f)(θ) where L(·) is the Laplace transform

operator. Therefore H = θ 7→ e−θz; z > 0. Since

‖µn − µ∗‖H = supz>0|L(µn)(z) − L(µ∗)(z)|,

Proposition 2.3 shows that the Laplace transform of µn converges pointwise a.s.to that of µ∗. Therefore, µn converges weakly* a.s. to µ∗.

Comparing Examples 2 and 3, we see that in Example 3 we obtain muchmore (i.e. weak convergence of µn to µ∗) by an extra-argument relative to theconvergence of Laplace transform. We can do something similar in Example 2.

EXAMPLE 2 (continued). Let F1 = f1,z : z > 0, where f1,z(x) = (1 −zx) e−zx. Then h1,z(θ) := Pf1,z(θ) = e−θz. Therefore, if H1 = PF1, the norm

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.11

Page 12: Statistical Analysis of Mixtures and the Empirical Probability Measure

264 PHILIPPE BARBE

‖ · ‖H1 between two p.m.’s is the sup-norm between their Laplace transforms.Consequently, if µ∗ ∈M, µn converges weakly* a.s. to µ∗.

What we did to obtain convergence of the Laplace transform in Example 2,is, first, to choose the class H1, and then, for any function h ∈ H1, to solvethe equation Pf = h in order to construct F . This is not always possible. Forinstance, in Example 3, all the functions h that can be obtained through themapping P are of the form h(θ) = θL(f+)(θ)− θL(f−)(θ), where f+ := f ∨ 0and f− := −f ∨ 0. Hence, they are differences of two completely monotonefunctions. In particular, one cannot generate functions such as θ 7→ 1θ 6 z.This may be viewed as a drawback of our framework. This is not, in the sensethat in some cases, an extra argument gives the convergence of c.d.f. in sup-norm (i.e. allows us to change of class F or H). This is, for instance, the casein Example 3 when the limiting p.m. µ∗ is continuous.

The reader should also notice that no identifiability assumption is required inthis section. One nice thing about the operator P and the way the class H arebuilt is that they allow us to work easily in the the quotient space of M(Θ) bythe equivalence relation µ ∼ ν iff Pµ = Pν .

3. General Results on the Limiting Distribution of infµ‖Pµ − Pn‖FFrom now P on, we shall assume, without stating it explicitly, that the sampleX1, . . . ,Xn is i.i.d. with some common p.m. Pµ∗ .

In this section we show that ‖Pµ−Pn‖F admits a limiting distribution (undermild conditions), and we shall say few things about this limiting distribution.For this purpose, we need to strengthen (2.1) by using the weak* convergencetheory. We view Pn as a random element in the space l∞(F) of all boundedfunctions over F . The space l∞(F) is endowed with the sup-norm ‖ · ‖F , andequipped with a σ-algebra generated by balls centered on continuous functionsof l∞(F) and which makes all the finite-dimensional projections measurable (i.e.for any k > 1 and any f1, . . . , fk ∈ F , the random vector (Pnf1, . . . , Pnfk) ismeasurable). We shall assume that

F is a Pµ∗-Donsker class. (3.1)

The meaning of (3.1) is that the distribution of the process n1/2(Pn − Pµ∗)findexed by f ∈ F converges weakly* to that of a centered Gaussian process Bfindexed by f ∈ F , with covariance

EBfBg = Pµ∗(fg)− Pµ∗fPµ∗g, f, g ∈ F .

Moreover, B ∈ l∞(F) has a.s. continuous sample paths when F is equippedwith the L2(Pµ∗)-seminorm. The reader may refer to the book of Pollard (1984)for this type of weak* convergence.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.12

Page 13: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 265

We mention that an equivalent of (2.1) for this section and the following isthat the distribution of αn(Pn − P ) converges weakly* to that of a process B(not necessarily centered and Gaussian!) for some nondecreasing sequence αn.

We denote Πn as the p.m. of n1/2 infµ∈M ‖Pµ − Pn‖F . Notice that Πn is ap.m. on R+, which depends on µ∗ (since the Xi’s are i.i.d. Pµ∗), on F and onM. Our next result shows that Πn converges weakly* under (3.1) provided that

M is a convex set which contains µ∗. (3.2)

PROPOSITION 3.1. Under (3.1), and (3.2) the sequence of p.m.’s (Πn)n>1 con-verges weakly*.

Remark 3.1. It should be noticed that for a fixed positive integer k, the setMk

of p.m.’s of the form∑

16i6k tiδθi , θ1, . . . , θk ∈ Θ, 0 6 t 6 1,∑

16i6k ti = 1is not convex. So (3.2) does not hold if we are interested in a model where weknow a-priori that µ∗ is supported by a finite number of points with an upperbound on a number of these points. Of course, if M is the set of all p.m.’s orthe set of all p.m.’s which are supported by a finite number of points (withoutany upper bound on the number of points), (3.2) holds. It is possible that thelimit of Πn is degenerate and is δ0 (see the next sections where this question isextensively studied).

Proof. Using the Skorokhod–Dudley–Wichura theorem (see, e.g., Pollard(1984)), we can construct a version P ∗n of Pn and B∗ of B such that

limn→∞

‖n1/2(P ∗n − Pµ∗)−B∗‖F = 0 a.s. (3.3)

Therefore, we have

n1/2 infµ∈M

‖Pµ − P ∗n‖F = ln + o(1) a.s. as n→∞,

where

ln := infµ∈M

‖n1/2(Pµ − Pµ∗)−B∗‖F .

It suffices to show that ln converges a.s. Let ε > 0 and let µn,ε ∈M such that

‖n1/2(Pµn,ε − Pµ∗)−B∗‖F 6 ln + ε.

Since µn,ε and µ∗ are in the convex set M,

µn,ε :=√

n

n+ 1µn,ε +

(1−

√n

n+ 1

)µ∗ ∈M

and, consequently,

ln+1 6 ‖√n+ 1(Pµn,ε − Pµ∗)−B∗‖F

= ‖n1/2(Pµn,ε − Pµ∗)−B∗‖F6 ln + ε.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.13

Page 14: Statistical Analysis of Mixtures and the Empirical Probability Measure

266 PHILIPPE BARBE

Since ε is arbitrary, ln is a nonincreasing sequence of nonnegative r.v.’s andconverges a.s. (its limit is measurable as a nonincreasing limit of nonnegativer.v.’s). 2

As a consequence of Proposition 3.1, to each p.m. µ∗ for which (3.1) and (3.2)hold, we can associate a p.m. on R+, Π(µ∗), which is the weak* limit of Πn

when the r.v.’s X1, . . . ,Xn are i.i.d. Pµ∗ . We have not found any trace of asimilar construction in the literature. The study of this mapping is the main topicof this paper.

Recall that a signed measure ν admits a unique Hahn–Jordan decompositionν = ν+ − ν−, where ν+ and ν− are nonnegative measures.

When M is the set of all p.m.’s over Θ, the p.m. Π(µ∗) is given as follows.

PROPOSITION 3.2. If (3.1)–(3.2) hold and M = M1(Θ), then Π(µ∗) is thedistribution of

inf

‖Pν −B‖F :

∥∥∥∥∥dν−

dµ∗

∥∥∥∥∥∞<∞, ν(Θ) = 0, ν ∈M(Θ)

. (3.4)

Proof. We use the notation of the proof of Proposition 3.1. Let

α∗ := inf

‖Pν −B∗‖F :

∥∥∥∥∥dν−

dµ∗

∥∥∥∥∥∞<∞; ν(Θ) = 0

and let νε ∈M such that∥∥∥∥dν−εdµ∗

∥∥∥∥∞<∞, νε(Θ) = 0 and ‖Pνε −B∗‖F 6 α∗ + ε.

Then, for n large enough, µn = µ∗ + n−1/2νε is a p.m. and

β∗n := n1/2‖Pµn − P ∗n‖F 6 ‖n1/2(Pµn − Pµ∗)−B∗‖F + o(1)

6 ‖Pνε −B∗‖F + o(1) 6 α∗ + ε+ o(1) a.s.

Conversly, let µn such that

n1/2‖Pµn − P ∗n‖F 6 infµ∈M

‖Pµ − Pn‖F + n−1.

Then

‖n1/2(Pµn − Pµ∗)−B∗‖F 6 infµ∈M

‖Pµ − Pn‖F + n−1,

and the result follows. 2

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.14

Page 15: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 267

Remark 3.2. IfM is the set of all p.m.’s on Θ with finite support and Pµ :µ ∈M is dense in Pµ :µ ∈M1(Θ) viewed as a subset of l∞(F), then the resultof Proposition 3.2 is clearly true. This often holds in practice and, therefore,almost all the results of this paper can be extended to cover finite mixtures withno upper bound on the number of components.

Notice that if µ∗ = δθ, then (3.4) is

inf‖λ(Pµ − Pθ)−B‖F :µ ∈M1(Θ), λ > 0.

The mapping µ∗ 7→ Π(µ∗) is not continuous in general (see Section 6 wherethis mapping is studied). It has, however, a useful property, and to state it, let usdenote Π(µ) a r.v. with distribution Π(µ). If X,Y are two r.v.’s on R, we saythat X is stochastically less than Y , and denote X 6st Y , iff for any x ∈ R,PX > x 6 PY > x.

Recall also that a function F is called an envelope of F if |f | 6 F for anyf ∈ F .

PROPOSITION 3.3. If F is Pµj -Donsker (j = 1, 2) with Pµj -integrable enve-lope andM is a convex set containing µ1 and µ2, then, for any λ ∈ [0, 1], F isPλµ1+(1−λµ2)-Donsker and

Π(λµ1 + (1− λ)µ2) 6st√λΠ(µ1) +

√1− λΠ(µ2), (3.5)

where Π(µ1) and Π(µ2) are independent.

Remark 3.3. Inequality (3.5) is much weaker than convexity and does notmean that much, except in the useful case where Π(µ1) = Π(µ2) = 0.

Proof. The proof uses a coupling technique. We assume that λ 6∈ 0, 1 sincethe result is trivial otherwise. Let Xj := (Xj,i)i>1, j = 1, 2 be two independentsequences of independent r.v.’s, those in Xj having the common p.m. Pµj , j =

1, 2. Let Pj,n := n−1∑16i6n δXj,i , j = 1, 2. Using the Skorokhod–Dudley–

Wichura theorem, we can find versions P ∗j,n of Pj,n and Gaussian processes B∗jsuch that

limn→∞

‖n1/2(P ∗j,n − Pµj )−B∗j ‖F = 0 a.s. j = 1, 2. (3.6)

Since P1,n and P2,n are independent, we can further take B∗1 and B∗2 independent.Let Nn be a Bernoulli B(n, λ) r.v. independent of the (P ∗j,n)n>1 and B∗j , j = 1, 2.Realizing Nn as the sum of n independent B(1, λ) r.v.’s, we can assume thatlimn→∞Nn/n = λ a.s. The p.m.

P ∗n := (Nn/n)P ∗1,Nn + ((n−Nn)/n)P ∗2,n−Nn

has the same distribution (as a r.v. in l∞(F)) as the empirical p.m. Pn of asample X1, . . . ,Xn i.i.d. Pλµ1+(1−λ)µ2

= λPµ1 + (1− λ)Pµ2 .

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.15

Page 16: Statistical Analysis of Mixtures and the Empirical Probability Measure

268 PHILIPPE BARBE

Let µj,n, j = 1, 2, such that

‖Pµj,n − P ∗j,n‖F 6 infµ∈M

‖Pµ − P ∗j,n‖F + n−1.

If n = 0 set µj,0 = 0 (measure constant equal to 0). Set (with the convention0/0 = 0),

µn := (Nn/n)µ1,Nn + ((n−Nn)/n)µ2,n−Nn ∈M.

Then,

n1/2 infµ∈M

‖Pµ − P ∗n‖F 6 n1/2‖Pµn − P ∗n‖F

= n1/2∥∥∥∥Nn

n(Pµ1,Nn

− P ∗1,Nn) +n−Nn

n(Pµ2,n−Nn − P

∗2,n−Nn)

∥∥∥∥F

6 n1/2Nn

ninfµ∈M

‖Pµ − P ∗1,Nn‖F + n1/2n−Nn

ninfµ∈M

×

×‖Pµ − P2,n−Nn‖F + o(1) a.s.

=

√Nn

nN1/2n inf

µ∈M‖Pµ − P ∗1,Nn‖F +

+

√n−Nn

n(n−Nn)1/2 inf

µ∈M‖Pµ − P ∗2,n−Nn‖F + o(1) a.s. (3.7)

Since (3.6) holds, replacing (3.3) by (3.6) in the proof of Proposition 3.1, we seethat

Π(µj) := limn→∞

infµn1/2‖Pµ − P ∗j,n‖F (3.8)

exists a.s. Since limn→∞Nn/n = λ a.s., (3.7) and (3.8) give the result. 2

A consequence of Proposition 3.3 is the following. For a class of function F anda family P of p.m.’s, let

M(F ,P) := µ ∈M : Π(µ) = δ0.

The set M(F ,P) contains all the p.m.’s µ such that ‖Pµn−Pn‖F = oPµ(n−1/2),where Pn is the empirical p.m. of an i.i.d. sample of size n with p.m. Pµ. Nothingprevents M(F ,P) from being empty, but we shall exhibit many examples whereit is not, and study it carefully in the following sections.

PROPOSITION 3.4. If F is Pµ-Donsker for any µ in the convex set M, thenthe set M(F ,P) is convex.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.16

Page 17: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 269

Proof. If M(F ,P) is empty, the result is clearly true. Otherwise it is astraightforward consequence of Proposition 3.3. 2

We give a trivial bound on ∆n which may be used in applications.

PROPOSITION 3.5. If µ∗ ∈M, then

infµ‖Pµ − Pn‖F 6 ‖Pn − Pµ∗‖F .

Proof. It is obvious. 2

There are several ways to use the bound of Proposition 3.5 in practice. Firstin some cases, explicit bounds for P‖Pn − Pµ∗‖F > x are known and yieldexplicit bounds for P∆n > x. One such example is when X = R and F isthe Kolmogorov class. In this case, P‖Pn − Pµ∗‖F > x 6 2 exp(−nx2/2)(Dvoretski et al., 1956; Massart, 1986). Talagrand (1994, 1995) and Ledoux(1996) also provide very sharp bounds in an abstract setting for some polynomialuniversal Donsker classes.

When there is no explicit bound for P‖Pn−Pµ∗‖F > x, the reader shouldnotice that if F is Pµ∗-Donsker, then n1/2‖Pn−Pµ∗‖F can be bootstrapped usingresults of Gine and Zinn (1990) for the classical bootstrap and of Praestgaardand Wellner (1993) for the Lo (1991) and Mason and Newton (1993) generalizedbootstraps. Therefore, Proposition 3.5 may be used to obtain (conservative) con-fidence intervals or tests on Pµ∗ . We shall see in Section 5 how any informationon Pµ∗ can be carried over µ∗.

We conclude this section in mentioning an invariance property of the norm‖·‖F . Let g be a one-to-one measurable mapping form X into an other separablemetric space Y endowed with its Borel σ-field. Let P gn := n−1∑

16i6n δg(Xi)and denote (P g)θ = Pθ g−1 the image of Pθ under g. If FY is a class offunctions on Y , let us denote FY g := f g : f ∈ FY.

LEMMA 3.1. For any measurable one-to-one mapping g from X to Y , and anyclass of functions FY , we have for any p.m. µ

‖(P g)µ − P gn‖FY = ‖Pµ − Pn‖FYg.

In particular, if X = Y and FY g = FY = F , then for any µ

‖(P g)µ − P gn‖F = ‖Pµ − Pn‖F . (3.9)

Proof. Just notice that whenever it makes sense, for f ∈ FY ,

(P g)µf − P gnf = Pµ(f g)− Pn(f g). 2

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.17

Page 18: Statistical Analysis of Mixtures and the Empirical Probability Measure

270 PHILIPPE BARBE

4. Necessary and Sufficient Condition for Havinginfµ‖Pµ − Pn‖F = oPµ∗ (n

−1/2)

Our aim now is to obtain conditions under which the estimator Pµn is closeenough to the empirical p.m. so that they share the same limiting distribution,and Pµn will be an efficient estimator. We shall mainly concentrate on the casewhere

M is the set of all p.m.’s on Θ. (4.1)

As mentioned in Section 3, our results also hold when M is the set of allmeasures with finite support, assuming that Pµ :µ ∈ M is dense in Pµ :µ ∈M1(Θ) ⊂ l∞(F).

For sake of brevity, we shall denote infµ ‖Pµ−Pn‖F = infµ∈M ‖Pµ−Pn‖F .We shall investigate conditions under which

infµ‖Pµ − Pn‖F = oPµ∗ (n

−1/2) as n→∞, (4.2)

i.e., with notation of Section 3, conditions under which Π(µ∗) = 0 a.s. orΠ(µ∗) = δ0. We shall comment in Remark 4.2 on condition (4.2) when Mis a strict subset of M1(Θ). Surprisingly, we have not found any investigationof (4.2) in the literature, except in the case of mixtures of uniform distributionswhere partial results exist (Hartigan and Hartigan, 1985; Lemdani, 1995).

To understand the interest of (4.2), assume that F is Pµ∗-Donsker. Then thep.m. of n1/2(Pn−Pµ∗) converges weakly* to that of a centered Gaussian processB in l∞(F). Therefore, if (4.1) holds, the p.m. of n1/2(Pµn − Pµ∗) convergesweakly* to that of the same Gaussian process B. This tells us several things.First, the limiting behaviour of Pµn , which is a natural estimator for Pµ∗ , isthe same as that of the empirical p.m. In this case, the empirical p.m. is veryclose to a mixture. If the statistician is not specially interested in estimatingthe mixed distribution by a mixed distribution, then there is no real need toestimate µ∗ and one would be better off using the empirical p.m. for statisticalinference. Second, for some mixtures of Poisson distributions, it has been provedby Thierney and Lambert (1984) that in order to estimate a smooth functionalT (Pµ∗), the asymptotic minimum variance estimator is T (Pn). Their result havebeen generalized, and it follows from Bickel et al. (1993) that if the tangentspace (at Pµ∗) of the model (but notice that to define the tangent space in thesense of Bickel et al. (1993) requires more assumptions on the Pθ’s than we usehere) is L2(Pµ∗), then the empirical p.m. is efficient for estimating Pµ∗ . Similarresults can be derived for the estimation of smooth functionals T (Pµ∗), givingsufficient condition under which T (Pn) is efficient (see Bickel et al. (1993)).Now, if T is differentiable w.r.t. the norm ‖.‖F (see Chapter 1 in Barbe andBertail (1995) on how one can try to force this assumption to hold in choosingF in a proper way), and T (Pn) is an efficient estimator of T (Pµ∗), the estimator

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.18

Page 19: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 271

T (Pµn) achieves the asymptotic efficiency bound if (4.2) holds. In this case,(4.2) is a property which one should look for. Third, a result like (4.2) may beused in a similar way as Proposition 3.5. For instance, since we know how toboostrap the empirical process n1/2(Pn − Pµ∗) indexed by F (see Praestgaardand Wellner, 1993), we can simulate quite easily the asymptotic distribution ofsmooth functionals of Pµn if (4.2) holds.

We shall assume that the class

F admits a Pµ∗-integrable envelope F . (4.3)

If F is Pµ∗-Donsker, so is F ∪1 where 1 is the constant function equal to 1on X . Therefore, up to adding 1 to F , there is no loss of generality in assumingthat

if Qε is a family of signed measures on X such that limε→0‖Qε −Q‖F = 0

for some p.m. Q, then limε→0

Qε(X ) = Q(X ) = 1. (4.4)

In order to state our main result, let us recall a few facts about Gross’ theoryof abstract Wiener spaces (Gross, 1967). Consider the separable Banach spaceE = (Cb(F), ‖ · ‖F) where Cb(F) is the set of all continuous bounded functionsover F , when F is equipped with the L2(Pµ∗)-seminorm. Denote ρ the Gaussianmeasure on E induced by B and further let E∗ be the topological dual of E.Following the exposition in Chapter 4 of Ledoux (1994), we consider the abstract

Wiener space factorization E∗j→ L2(ρ)

j∗→ E. More precisely, if y ∈ E∗, thenits image under j is the function (in L2(ρ)), x ∈ E 7→ 〈x, y〉. Next, if g ∈ L2(ρ),then j∗(g) =

∫E xg(x) dρ(x) (here the integral is in the strong sense, and agrees

with the weak integral). The reproducing kernel Hilbert space H of ρ is j∗(L2(ρ))(recall that the process B depends on µ∗ and P through Pµ∗ , so that the wholeconstuction of the Hilbert space H depends on µ∗, P and F). Our next theoremtransforms the probabilistic problem of checking if (4.2) holds into a functionalanalytic one.

THEOREM 4.1. Assume that F is Pµ∗-Donsker and that (4.1), (4.3) and (4.4)hold. Let H be the reproducing kernel Hilbert space associated with ρ. Then(4.2) holds if and only if H is in the closure in (l∞(F), ‖ · ‖F) of the set

P(µ∗,F) := Pν : ν ∈M(Θ), ν− µ∗, ν−PF <∞,that is,

H ⊂ cl(l∞(F),‖.‖F )P(µ∗,F). (4.5)

Remark 4.1. The set P(µ∗,F) will play a very important role in furtherresults. Sometimes it is convenient to consider the set

P ′(µ∗,F) :=

Pν :

∥∥∥∥∥dν−

dµ∗

∥∥∥∥∥∞<∞

.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.19

Page 20: Statistical Analysis of Mixtures and the Empirical Probability Measure

272 PHILIPPE BARBE

Let ν ∈ P(µ∗,F). For any M > 0, define the measure ν−M by its density

dν−Mdµ∗

:=dν−

dµ∗∧M,

and let νM = ν+ − ν−M ∈ P ′(µ∗,F). If f ∈ F ,

|Pνf − PνM f | 6∫F dPν−−ν−M

=

∫Θ

(PθF )1

dν−

dµ∗>M

dν−(θ).

Thus, P(µ∗,F) ⊂ cl(l∞(F),‖·‖F )P ′(µ∗,F), since ν−PF <∞.Next, if Pµ∗F <∞, then P ′(µ∗,F) ⊂ P(µ∗,F). In this case,

cl(l∞(F),‖·‖F )P(µ∗,F) = cl(l∞(F),‖·‖F )P ′(µ∗,F).

So, in few proofs, when dealing with closure in (l∞(F), ‖·‖F ), we shall use quitefreely elements in P ′(µ∗,F) and P(µ∗,F), without too much explicit reference.

Proof. Assume that Π(µ∗) = δ0 (i.e. (4.2) holds). Proposition 3.2 shows thata.s.

B ∈ cl(l∞(F),‖.‖F )P(µ∗,F).

Since clH = suppρ, (4.5) holds.Using the notation of the proof of Proposition 3.1, we have

infµn1/2‖Pµ − P ∗n‖F

= infµ‖n1/2(Pµ − Pµ∗)−B∗‖F + o(1) a.s. as n→∞.

Since suppρ = cl(H) is in the closure of P(µ∗,F) (since so is H), there existsa sequence Pνm ∈ P(µ∗,F) such that B∗ = limm→∞ Pνm (in (l∞(F), ‖.‖F )).Define ν−n,M by

dν−m,Mdµ∗

(θ) :=dν−mdµ∗

(θ) ∧M

and, further, let

cn,m,M := (ν+m(Θ)− ν−m,M (Θ))/n1/2,

µn,m,M := (1− cn,m,M)µ∗ + n−1/2(ν+m − ν−m,M ).

For n large enough, µn,m,M is a p.m., and as n→∞,

n1/2 infµ‖Pµ − P ∗n‖F + o(1) 6 ‖n1/2(Pµn,m,M − Pµ∗)−B∗‖F

6 ‖ − n1/2cn,m,MPµ∗ + Pν+m−ν−m,M

−B∗‖F

6 |ν+m(Θ)− ν−m,M (Θ)|‖Pµ∗‖F +

+‖Pν+m−ν−m,M

−B∗‖F =: Am,M .

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.20

Page 21: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 273

Consequently,

lim supn→∞

infµn1/2‖Pµ − P ∗n‖F 6 lim

m→∞lim supM→∞

Am,M = 0.

2

Remark 4.2. When M is not convex, the proof of Theorem 4.1 works pro-vided we can build the measure µn,m,M in M.

A somewhat related result to Theorem 4.1 is in Van der Vaart (1991) andBickel et al. (1993) and we shall discuss this point at the end of this section. Thestatistician readers should notice the simplicity of the proof that (4.2) implies(4.5) in Theorem 4.1. The surprising power of the result can be seen in the nextexamples.

EXAMPLE 1 (continued). Condition (4.5) holds in a trivial manner, since B∗ =limn→∞ n1/2(P ∗n −Pµ∗) (with the notation of the proof of Proposition 3.1), andtherefore B∗ ∈ clP(µ∗,F) a.s.

EXAMPLE 2 (continued). We still consider the Kolmogorov class F . Instead ofusing the full generality of the abstract Wiener space theory, we evaluate H inusing the support theorem of Stroock and Varadhan (1972) (see also Ikeda andWatanabe (1989)). Let us denote Fµ∗ the c.d.f. of Pµ∗ . The process B admits astochastic integral representation Bf =

∫f(x) dB(x), where B(x) = BFµ∗(x)

and B is a standard Brownian bridge. This representation of B comes essentiallyfrom the so-called quantile representation (see, e.g., Shorack and Wellner (1986)).In other words, Bfz = B(Fµ∗(z)). Let W be a standard Wiener process. Then

Bd= W −W (1)Id. Let ρW be the p.m. induced on (C[0, 1], ‖ · ‖∞) by W|[0,1].

The support theorem of Stroock and Varadhan (1972) asserts that supp ρW|[0,1] =

clf : f a.c.,∫ 1

0 f2(t) dt <∞ where ‘a.c.’ means absolutely continuous w.r.t. the

Lebesgue measure, and f denotes the Radon–Nikodym derivative of f , and theclosure is in (C[0, 1], ‖·‖∞). Thus, writing ρB for the p.m. ofB in (C[0, 1], ‖·‖∞),

suppρB = cl

f(x)− xf(1) : f a.c.,

∫ 1

0f 2(t) dt <∞

= cl H,

where

H :=

f : f(0) = f(1) = 0, f a.c.

∫ 1

0f 2(t) dt <∞

.

Let f ∈ suppρB . Theorem 4.1 shows that (4.2) implies

f Fµ∗ ∈ clFν : Pν ∈ P(µ∗,F), (4.6)

where Fν is the c.d.f. of Pν , i.e. Fν(x) = Pν [0, x].

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.21

Page 22: Statistical Analysis of Mixtures and the Empirical Probability Measure

274 PHILIPPE BARBE

Let α = sup suppµ∗. Assume that (a, b)∩ suppµ∗ = ∅ for some 0 6 a < b 6α. If Pν ∈ P(µ∗,F), then suppν− ∩ (a, b) = ∅. For z ∈ (a, b),

Pνfz = Pν+fz −∫ a

0Pθ[0, z] dν−(θ)−

∫ ∞b

Pθ[0, z] dν−θ

= Fν+(z)− ν−[0, a]− z∫ ∞b

θ−1 dν−(θ).

Consequently, on (a, b), the function z 7→ Pνfz is concave. Therefore, all thefunctions in clFν : Pν ∈ P(µ∗,F) are concave on (a, b). Clearly, there existssome functions f Fµ∗ , f ∈ suppρB that are not concave on (a, b). Thus, (4.6)implies that any interval (a, b) with 0 6 a < b 6 α must be in suppµ∗. In otherwords, suppµ∗ = [0, α] (or [0,∞) if α =∞) is necessary to have (4.2).

It is now clear that the condition suppµ∗ is an interval which contains 0 isalso sufficient for (4.2) to hold. Indeed, if suppµ∗ = [0, α], then suppPµ∗ =[0, α]. Then notice that any function f in H has a support in [0, α] and can beapproximated uniformly by a sequence (fn)n>1 of twice differentiable functions.Write fn = f1,n − f2,n where f1,n = fn ∨ 0 and f2,n = fn ∧ 0. This shows thatfn is a difference of two concave functions supported by [0, α] and is limit offunctions of the form (Fνn,m)m>1 for some νn,m ∈ P(F , µ∗) (here, if needed,we use Lemma A.1 to construct νn,m).

Similar investigation for Grenander’s maximum likelihood estimator of uni-modal p.m., say Pµn , is in Kiefer and Wolfowitz (1976). Among other things,they obtained sufficient conditions to have ‖Pµn − Pn‖F = oPµ∗ (n

−1/2). Theirconditions are stronger than ours and have not been proved to be necessary.However, if they hold, we deduce that ‖Pµn−Pµn‖F = oPµ∗ (n

−1/2) as n→∞.

EXAMPLE 3 (continued). Since F generates the Kolmogorov distance, the space(l∞(F), ‖.‖F ) coincide with the space B(R+) of all bounded functions on R+

endowed with the sup-norm. Arguing, as in Example 2, the support of the p.m.of the Gaussian process B coincides with the closure of

Sµ∗ := f Fµ∗ : f ∈ H,

with Fµ∗(x) := Pµ∗ [0, x] = 1 −∫

e−θx dµ∗(θ). Let us first show the follow-ing useful (but much stronger than really needed as we will see in the otherapplications in Section 7!).

Claim. In B(R+), the closure clSµ does not depend on µ. Therefore, for anyµ ∈M1(Θ),

clSµ := clSδ1 = clx ∈ R 7→ f( e−x) : f ∈ H.

Proof. Let µ1, µ2 ∈ M1(Θ), and f Fµ1 ∈ Sµ1 , f ∈ H. The function g :=f Fµ1 F−1

µ2satisfies f Fµ1 = gFµ2 . Since Fµ1 F−1

µ2is infinitely differentiable

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.22

Page 23: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 275

and increasing, g is absolutly continuous w.r.t. the Lebesgue measure, with theRadon–Nykodym derivative

g(x) =F ′µ1

F ′µ2

F−1µ2

(x)f Fµ1 F−1µ2

(x), 0 6 x 6 1.

For any 0 6 ε < 1, set

hε(x) := g(x)1ε 6 x 6 1− ε.

Then clearly∫ 1

0 h2ε(x) dx <∞. Set

hε(x) :=∫ x

0hε(y) dy and gε(x) := hε(x)− (1− ε)−1(x− ε)+hε(1).

Now, gε ∈ H. Furthermore,

|gε(x)− g(x)| = |hε(x)− g(x)− (1− ε)−1(x− ε)+hε(1)|

6∣∣∣∣ ∫ ε∧x

0g(y) dy

∣∣∣∣+ |hε(1)|+∣∣∣∣∣∫ 1

(1−ε)∨(1−x)g(y) dy

∣∣∣∣∣=

∣∣∣∣∣∫ Fµ2F

−1µ1

(ε∧x)

0f(y) dy

∣∣∣∣∣++

∣∣∣∣∣∫ 1

εg(y) dy

∣∣∣∣∣+∣∣∣∣∣∫ 1

Fµ2F−1µ1

((1−ε)∨(1−x))f(y) dy

∣∣∣∣∣= |f Fµ2 F−1

µ1(ε ∧ x)|+ |f Fµ2 F−1

µ1(ε)|+

+ |f Fµ2 F−1µ1

((1− ε) ∨ (1− x))|.

Since f ∈ H, Holder’s indequality gives the well-known bound

|f(x)| =

∣∣∣∣∫ x

0f(y) dy

∣∣∣∣ 6 (∫ x

0dx)1/2 (∫ x

0f 2(y) dy

)1/2

6 x1/2

(∫ 1

0f 2(y) dy

)1/2

,

and, similarly, |f(1− x)| 6 x1/2(∫ 1

0 f2(y) dy)1/2. Consequently,

|gε(x)− g(x)| 6 2(1− Fµ2 F−1

µ1(1− ε) +

+Fµ2 F−1µ1

(ε))1/2

(∫ 1

0f 2(y) dy

)1/2

.

Therefore, limε→∞ ‖gε − g‖∞ = 0, so that limε→∞ ‖f Fµ1 − gε Fµ2‖∞ = 0.In conclusion, Sµ1 ⊂ clSµ2 . Inverting µ1 and µ2, we obtain the first part of the

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.23

Page 24: Statistical Analysis of Mixtures and the Empirical Probability Measure

276 PHILIPPE BARBE

claim. Then, f ∈ H iff x ∈ [0, 1] 7→ f(1−x) is also in H, so that clSδ1 is indeedthe set in the claim. 2

It is easy to see that (4.2) holds iff

inf‖Pn − Pµ‖ :µ ∈M(Θ) ∪ δ∞ = oP (n−1/2),

where here Pδ∞ = P+∞ = δ0. Then, using our claim, we see that (4.5) isequivalent to

x ∈ R+ 7→ f( e−x) : f ∈ H ⊂

⊂ cl(B(R),‖.‖∞)Fν : ν ∈M(Θ) ∪ δ∞, ν− µ∗,

where we denote Fν(x) := Pν(−∞, x] = Pν1· 6 x for a signed measure ν.Making a change of variable t = e−x, and using that the constant function 1 isFδ∞ , we obtain that (4.5) is equivalent to

H ⊂ cl(B[0,1),‖.‖∞)

t ∈ [0, 1) 7→

∫tθ dν(θ) : ν ∈M(Θ), ν− µ∗

. (4.7)

Using the Sasz–Muntz theorem (see Lemma A.4 for the version we are actuallyusing), a necessary and sufficient condition for (4.7) to hold is that suppµ∗

contains a sequence of distinct points (θi)i>1 such that∑i>1 θi/(1 + θ2

i ) =∞.To conclude this section, we suggest some connection between Theorem 4.1

and the asymptotic efficiency theory as exposed in Bickel et al. (1993). In orderto keep the discussion reasonably short, we do not recall much of the definitionsof Bickel et al. (1993), and the reader who is not very familiar with efficiencytheory may skip the remainder of this section. Until the end of this section,we assume that all the Pθ’s are absolutely continuous w.r.t. a fixed dominatingmeasure P , and let

pθ :=dPθdP

and pµ∗ :=∫pθ dµ∗(θ) =

dPµ∗

dP.

Denote TPµ∗ (P) the closed tangent space at Pµ∗ of the model space P := Pµ :µ ∈M1(Θ). It can be easily seen that

TPµ∗ (P) = clL2(Pµ∗ )h ∈ L2(Pµ∗) : supph ⊂ suppPµ∗ ,

h = pλ(µ−µ∗), λ > 0, µ ∈M1(Θ).

Thus, notice that

TPµ∗ (P) = clL2(Pµ∗ )

dPνdPµ∗

:Pν ∈ P(µ∗,F),dPνdPµ∗

∈ L2(Pµ∗)

.

The set in the r.h.s. of (4.5) may be viewed as the tangent space of P at Pµ∗when P is viewed as a subset of l∞(F) (and not L2(Pµ∗) as in Bickel et al.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.24

Page 25: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 277

(1993)), provided we agree to include a one-sided derivative in the tangent space(see also point 1 p. 263 of Bickel et al. (1993)).

Let πf ∈ l∞(F)∗ be the evaluation map πf (Q) = Qf . Since F is Pµ∗-Donsker, the function I(πf ) := f − Pµ∗f is in L2(Pµ∗). If, moreover,

I(πf ) : f ∈ F ⊂ TPµ∗ (P), (4.8)

then the functions I(πf ) are the efficient influence functions, and the influencecovariance functional operator for Pµ∗ ∈ l∞(F) is I−1(πf1 , πf2) := Pµ∗f1f2 −Pµ∗f1Pµ∗f2. Then Pn is clearly asymptotically efficient for Pµ∗ . Consequently,under (4.8) and the assumptions of Theorem 4.1, Pµn is efficient if (4.5) holds.It is outside of the scope of this paper to investigate deeply the link between(4.8) and (4.5) and our purpose here was mainly to point out that further workis needed in that direction, and that the formalisms developed in Bickel et al.(1993) and in this paper give hope of settling the question of how to obtaingeneral criteria under which one can achieve efficiency in mixture models.

5. From the Estimation of Pµ∗ to that of µ∗

In Section 4 we have studied some conditions under which n1/2 infµ ‖Pµ −Pn‖F = oPµ∗ (1). If this holds and µn is defined as in (2.2) with

limn→∞

n1/2ηn = 0 in probability, (5.1)

then

n1/2(Pµn − Pµ∗)d= B as n→∞,

as processes indexed by F . Our aim now is to turn this convergence into one onn1/2(µ∗ − µn). Assume that the operator

P is invertible from F to H. (5.2)

We first state a few results related to (5.2) and its statistical meaning.

LEMMA 5.1. If the family P of p.m.’s Pθ , θ ∈ Θ, is complete, then (5.2) holds.Proof. Completeness means (see, e.g., Lehmann (1959) for the statistical

aspects and Davis (1963) for the mathematical ones) that if Pθf = Pf(θ) = 0for any θ ∈ Θ and f measurable, then f = 0. This implies that P is injectivefrom F to H. 2

Extending the usual definition, one can say that (5.2) is F-completeness of thefamily P.

It turns out that (5.2) can be related to the identifiability of the mixture,something which we have not used in all the previous sections. Recall that

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.25

Page 26: Statistical Analysis of Mixtures and the Empirical Probability Measure

278 PHILIPPE BARBE

the mixture model defined by P is said to be identifiable if the applicationµ ∈ M1(Θ) 7→ Pµ ∈ M1(X ) is injective. Actually, the proper notion herewould be that the mapping µ 7→ Pµ from l∞(H) to l∞(F) is injective; but thisdoes not say much for practical purposes.

Our next two lemmas show that the operator P in Examples 2 and 3 isinvertible as a special case of location/scale models. The works on identifiabilityquoted in Section 1 can then be used to show that (5.2) holds in many practicalcases.

LEMMA 5.2. Assume that X = R+, Θ = R+ and that Pθ has a density θp(θx)w.r.t. the Lebegue measure dx (i.e. P defines a scale model w.r.t. to a fixed densityp(·)). If F ⊂ L1(dx) and P is identifiable, then P is F-complete and (5.2) holds.

Proof. Let f = f1 − f2, where f1, f2 ∈ F . Assume that f 6= 0, but thatPf = 0, i.e.∫

f(x)θp(θx) dx = 0 for any θ ∈ Θ.

Switching θ and x, for any x ∈ R+,

x

∫f(θ)p(θx) dθ = 0. (5.3)

Integrating w.r.t. dx, we show that∫f(θ) dθ = 0. For s = +,− let µs be the

p.m. with density fs(θ)/∫fs(t) dt w.r.t. dθ. Then (5.3) implies that Pµ+ = Pµ−

while µ+ 6= µ− since f 6= 0. Thus, the mixture model is not identifiable. 2

LEMMA 5.3. Assume that X = Θ = R and that Pθ admits a density p(x − θ)w.r.t. the Lebesgue measure (i.e. P defines a location model w.r.t. a fixed densityp(.)). If F ⊂ L1(dx) and P is identifable, then P is F-complete and (5.2) holds.

Proof. The proof is very similar to that of Lemma 5.2. With obvious notation,if f 6= 0 and Pf = 0, we obtain∫

f+(θ)p(x− θ) dθ =

∫f−(θ)p(θ − x) dθ for any x ∈ R,

and conclude as in the proof of Lemma 5.2. 2

Remark 5.1. From Lemmas 5.2 and 5.3, we see that for practical purpose,(5.2) is much weaker than identifiability of the mixture model. The invertibilityof P is equivalent to the identifiability of a submodel. This is the case, forinstance, if one considers mixtures of uniform p.m.’s over all intervals [a, b],a 6 b. Denoting θ = (a, b) and Pθ = U[a,b], we have Pf(θ) = (b−a)

∫ ba f(x) dx.

The submodel defined by all the θ’s of the form (0, b) is identifiable (mixtureof U[0,b] distributions as in Example 2), and P is invertible with P−1h(x) =x∂h/∂b(a, b)|(0,x)+h(0, x), since the restriction of P(·)(θ) to all the θ = (0, b) is

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.26

Page 27: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 279

invertible. We could also consider the submodel made of mixtures of U[x,x] = δx.In this case, f(x) = Pf(x, x).

Let B be a centered Gaussian process as at the beginning of Section 3. Under(5.2) we can define a Gaussian process on H = PF by

Gh = B(P−1(h)), h ∈ H.

The process G is centered, with covariance

EGh1Gh2 = Pµ∗(P−1h1P−1h2)− Pµ∗(P−1h1)Pµ∗(P−1h2)

= Pµ∗(P−1h1P−1h2)− µ∗h1µ∗h2.

PROPOSITION 5.1. If (2.2), (3.1), (4.2), (5.1) and (5.2) hold, then, the dis-tribution of n1/2(µn − µ∗) ∈ (l∞(H), ‖ · ‖H) converges weakly∗ to that ofG ∈ (l∞(H), ‖ · ‖H).

Proof. The result is obvious, since

oPµ∗ (1) = ‖n1/2(Pµn − Pµ∗)−B‖F = ‖n1/2(µn − µ∗)−G‖H

(use Proposition 2.2 and definition of G). 2

Using Proposition 5.1, one can carry over to the estimator µn the limiting dis-tribution results obtained in Examples 2, 3 in Section 4.

It should also be clear that results on the convergence of µn to µ∗ can beobtained by other arguments in some special cases. For instance, the rate ofconvergence can be obtained using the approach of Chen (1995), showing thatthe mapping µ 7→ Pµ is Holderian fromM1(Θ) to (l∞(F), ‖ · ‖F ) for a suitabledistance on M1(Θ) and a suitable class F . In the same vein, one could alsouse Edelman’s (1988) Fourier transform argument (see his Theorem 1 and hiscomment on p. 1613). However, we shall not focus on these aspects in the presentpaper, since we do not have a fully general method on how to obtain limitingresults on µn for any topology on M1(Θ).

We can rephrase Proposition 3.2 and Theorem 4.1 as follows, where HG

denotes the reproducing kernel Hilbert space associated to the Gaussian pro-cess G.

PROPOSITION 5.2. Under the assumptions of Theorem 4.1 and (5.2), (4.2)holds if and only if

inf‖ν −G‖H : ν ∈M(Θ), ν− µ∗, ν−PF <∞ = 0 a.s.

or equivalently, if and only if

HG ⊂ cl(l∞(H),‖·‖H)ν : ν ∈M(Θ), ν− µ∗, ν−PF <∞.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.27

Page 28: Statistical Analysis of Mixtures and the Empirical Probability Measure

280 PHILIPPE BARBE

Proof. It is straightforward from Proposition 3.2, Theorem 4.1 and Proposi-tion 2.2. 2

6. The Mapping Π

In Section 4 we gave a necessary and sufficient condition for a p.m. µ∗ to be inthe set M(F ,P) = Π−1(δ0). In this section we investigate further properties ofthe mapping Π. We are seeking for conditions under which two measures are inthe same set of pre-images and study the continuity of Π.

We already know that δ0 can have many pre-images through Π(·) (see, e.g.,Examples 2 and 3 in Section 4). We have not been able to characterize the setΠ−1(λ) for any p.m. λ on R+, but we can give a sufficient condition for twop.m.’s µ∗1, µ∗2 to have the same image through Π(·). This shows partially whichaspect of µ∗ is kept in Π(µ∗1).

Let us consider two p.m.’s µ∗1 and µ∗2 on Θ. The pair (µ∗1, µ∗2) induces a

reproducing kernel Hilbert space H[µ∗1 ,µ∗2 ] as follows. Let S := supp(Pµ∗1 −Pµ∗2 ).

Let B[µ∗1 ,µ∗2 ] be the centered Gaussian process in (l∞(F), ‖ · ‖F ) with covariance

EB[µ∗1 ,µ∗2 ]fB[µ∗1 ,µ

∗2 ]g

= Pµ∗1 (fg1S)− Pµ∗1 (f1S)Pµ∗1 (g1S) +

+Pµ∗2 (fg1S)− Pµ∗2 (f1S)Pµ∗2 (g1S).

The Hilbert space H[µ∗1 ,µ∗2 ] is the reproducing kernel Hilbert space associated to

B[µ∗1 ,µ∗2 ].

Let, further, D := supp(µ∗1 − µ∗2). The real number

c :=∫DPθ(S) dµ∗j(θ)

does not depend on j = 1, 2 since

1− c :=∫DcPθ(S) dµ∗j (θ) +

∫ΘPθ(S

c) dµ∗j(θ).

Moreover, c = 0 if and only if µ∗1 = µ∗2. Hence, if µ∗1 6= µ∗2, we can define thep.m.’s

Qj(·) = c−1∫DPθ(· ∩ S) dµ∗j(θ), j = 1, 2, (6.1)

Q(·) = (1− c)−1(∫

DcPθ(· ∩ S) dµ∗j(θ) +

∫ΘPθ(· ∩ Sc) dµ∗j(θ)

). (6.2)

(Q does not depend on j = 1, 2 from the very definition of S and D). We alsoset Q to be the null measure if c = 1.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.28

Page 29: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 281

THEOREM 6.1. Let µ∗1, µ∗2 be two p.m.’s on Θ, µ∗1 6= µ∗2. Let Q1, Q2, Q as in(6.1)–(6.2). Assume that

(i) F is Q1, Q2, Q, Pµ∗1 , Pµ∗2 -Donsker and 1 ∈ F ,

(ii) H[µ∗1 ,µ∗2 ] ⊂ cl(l∞(F),‖·‖F )

Pν :

∥∥∥∥dν−

dµ∗1

∥∥∥∥∞<∞

and Pµ∗2 ∈ cl(l∞(F),‖·‖F )

Pν :

∥∥∥∥dν+

dµ∗1

∥∥∥∥∞<∞

.

Then Π(µ∗1) 6st Π(µ∗2). Moreover, if

(iii) H[µ∗1 ,µ∗2 ] ⊂ cl(l∞(F),‖·‖F )

Pν :

∥∥∥∥dν−

dµ∗2

∥∥∥∥∞<∞

and Pµ∗1 ∈ cl(l∞(F),‖·‖F )

Pν :

∥∥∥∥dν+

dµ∗2

∥∥∥∥∞<∞

,

then Π(µ∗2) 6st Π(µ∗1), so that actually Π(µ∗1) = Π(µ∗2).Proof. The proof uses a coupling argument which leads to two r.v.’s Π(µ∗1) 6

Π(µ∗2) under assumptions (i) and (ii). The result under assumption (iii) is thenobvious.

Let Qn be a p.m. distributed in (l∞(F), ‖ · ‖F ) as the empirical p.m. of asample of size n from Q, and similarly, Qjn corresponds to a sample from Qj .Using the Skorokhod–Dudley–Wichura theorem, we can assume that for somecentered Gaussian processes B1, B2, B ∈ l∞(F),

‖n1/2(Qjn −Qj)−Bj‖F = o(1) a.s. as n→∞, (6.3)

‖n1/2(Qn −Q)−B‖F = o(1) a.s. as n→∞. (6.4)

For n > 1, let Nn ∼ B(n, c) be a Bernoulli r.v., independent of all the previousr.v.’s. Again, we may assume that there exists a Gaussian r.v. Υ ∼ N (0, c(1−c))such that∣∣∣∣n1/2

(Nn

n− c)−Υ

∣∣∣∣ = o(1) a.s. as n→∞. (6.5)

Since

Pµ∗j = cQj + (1− c)Q, (6.6)

the p.m.

P jn :=Nn

nQjNn +

n−Nn

nQn−Nn (6.7)

is distributed (in l∞(F)) as the empirical p.m. of a sample of size n fromPµ∗j . This defines our coupling. Combining (6.3)–(6.7) we obtain a Skorokhod–Dudley–Wichura representation

εjn := ‖n1/2(P jn − Pµ∗j )−ΥQj + ΥQ− c1/2Bj − (1− c)1/2B‖F

= o(1) a.s. as n→∞.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.29

Page 30: Statistical Analysis of Mixtures and the Empirical Probability Measure

282 PHILIPPE BARBE

The proof of Proposition 3.1 shows that there exists a r.v. Π(µ∗j ) such that

limn→∞

infµ‖Pµ − P jn‖F = Π(µ∗j ) a.s. j = 1, 2.

Let µn such that

‖Pµn − P 2n‖F 6 inf

µ‖Pµ − P 2

n‖+ n−1.

Then, for any p.m. µ,

|m1/2‖Pµ − P 1m‖F − n1/2‖Pµn − P 2

n‖F |

6 ‖m1/2(Pµ − P 1m)− n1/2(Pµn − P 2

n)‖F

= ‖m1/2(Pµ − Pµ∗1 ) +m1/2(Pµ∗1 − P1m)−

−n1/2(Pµn − Pµ∗2 )− n1/2(Pµ∗2 − P2n)‖F

6 ‖m1/2(Pµ − Pµ∗1 )− n1/2(Pµn − Pµ∗2 ) +

+Υ(Q2 −Q1) + c1/2(B2 −B1)‖F + ε1m + ε2

n. (6.8)

Our aim is to show that we can pick a p.m. µ to make (6.8) small. Usingassumption (ii), let µ2,ε such that∥∥∥∥dµ+

2,ε

dµ∗1

∥∥∥∥∞<∞ and ‖Pµ2,ε − Pµ∗2 ‖F 6 ε.

Set

ν1,n,ε := n1/2(µn − µ2,ε) + c−1Υ(µ∗1 − µ2,ε).

Using assumption (ii) (with the fact that H[µ∗1 ,µ∗2 ] is the reproducing kernel Hilbert

space associated to B1 −B2) and arguing as in Section 4, let ν2,ε such that∥∥∥∥dν−2,εdµ∗1

∥∥∥∥∞<∞ and ‖c1/2(B1 −B2)− Pν2,ε‖F 6 ε.

Let, further,

cn,m,ε := 1 +m−1/2(ν1,n,ε(Θ) + ν2,ε(Θ)).

Then, using that Bj1 = 0 and Pµ2 1 = 1,

m1/2|cn,m,ε − 1| = |(n1/2 + c−1Υ)(1− µ2,ε(Θ)) + ν2,ε(Θ)|

6 (n1/2 + c−1|Υ|)|1− Pµ2,ε(X )|+ |Pν2,ε(X )|

6 2ε(1 + c−1|Υ|+ n1/2).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.30

Page 31: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 283

Consequently, for m large enough,

|c−1n,m,ε − 1| 6 4εm−1/2(1 + c−1|Υ|+ n1/2).

For m large enough, set

µn,m,ε := µ∗1 +m−1/2(ν1,n,ε + ν2,ε)

and

µn,m,ε = µn,m,ε/cn,m,ε.

Thus, µn,m,ε is a p.m. if m is large enough. For µ = µn,m,ε, the upper bound of(6.8) is less than

‖ΥPν1,n,ε − n1/2(Pµn − Pµ∗2 ) + c−1Υ(Pµ∗2 − Pµ∗1 )‖F +

+ ‖Pν2,ε − c1/2(B1 −B2)‖F + ε1m + ε2

n +m1/2|c−1n,m,ε − 1|‖Pµn,m,ε‖F

6 ‖Pµ2,ε − Pµ∗2 ‖F (n1/2 + c−1|Υ|) + ε+ ε1m + ε2

n +

+ 4ε(1 + c−1|Υ|+ n1/2)×

×(‖Pµ∗1 ‖F +m−1/2‖Pν1,n,ε‖F +m−1/2‖Pν2,ε‖F ). (6.9)

Combining (6.8)–(6.9), taking first a limit as m→∞, and then a limit as ε→ 0,we obtain

limm→∞

infµm1/2‖Pµ − P 1

m‖F 6 limε→0

lim infm→∞

m1/2‖Pµn,m,ε − P 1m‖F

6 n1/2‖Pµn − P 2n‖F + ε2

n.

Let n tend to infinity in the upper bound; we obtain Π(µ∗1) 6 Π(µ∗2) a.s. 2

In order to apply Theorem 6.1, notice that when F is the Kolmogorov class onR (i.e. F = 1.6z : z ∈ R), the space H[µ∗1 ,µ

∗2 ] can be identified as follows.

With the notation of the proof of Theorem 6.1, the reproducing kernel Hilbertspace of Bj is f F j : f ∈ H where F j is the c.d.f. pertaining to Qj . SinceB1 and B2 are independent,

H[µ∗1 ,µ∗2 ] = f F 1 − g F 2 : f, g ∈ H.

EXAMPLE 2 (continued). Assume that S = [a, b], i.e. Pµ∗1 = Pµ∗2 except on aninterval [a, b]. Then S = D. We first identify the closure of H[µ∗1 ,µ

∗2 ]. If a 6 x 6 b,

F j(x) = c−1∫

[a,b]U[0,θ]([0, x] ∩ [a, b]) dµ∗j (θ)

= c−1∫

[a,b]θ−1(aU[0,a] + (θ − a)U[a,θ])([0, x] ∩ [a, b]) dµ∗j (θ)

= c−1∫

[a,b]θ−1(θ − a)U[a,θ][a, x] dµ∗j (θ).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.31

Page 32: Statistical Analysis of Mixtures and the Empirical Probability Measure

284 PHILIPPE BARBE

Thus, Qj is a mixture of U[a,θ], a 6 θ 6 b. Let F j be the c.d.f.

F j(x) := F j(x+ a) = c−1∫

[0,b−a](θ + a)−1θU[0,θ][0, x] dµ∗j (θ + a).

Denoting µ∗j(.) = µ∗j((. + a) ∩ [a, b]), the c.d.f. of Pµj is F j . Hence, f ∈H[µ∗1 ,µ

∗2 ] if and only if f(. − a) ∈ H[µ∗1,µ

∗2]. Furthermore, since S = [a, b],

supp(Pµ∗1 − Pµ∗2 ) = [0, b− a]. Let

Hb−a := x ∈ R 7→ h(x/(b− a)) : h ∈ H.

We claim that H[µ∗1 ,µ∗2 ] ⊂ cl(C[0,b−a],‖.‖∞)Hb−a. Indeed, for f ∈ H, denote

fε(x) =

0 if x 6 ε or x > 1,

f(x)− 1−x1−ε f(ε) if x ∈ [ε, 1].

Then

fε(x) =

0 if x 6 ε or x > 1,

f(x) + f(ε)1−ε if x ∈ [ε, 1],

which shows that fε ∈ H. Moreover,

‖fε − f‖∞ 6f(ε)

1− ε 6ε1/2

1− ε

(∫ 1

0f 2(t) dt

)1/2

.

Let h ∈ H[µ∗1 ,µ∗2 ]. Then h = f F 1 − g F 2 with f, g ∈ H. The function h(ε) :=

fε F 1 − gε F 2 belongs to Hb−a, and limε→∞ ‖hε − h‖∞ = 0. Consequently,H[µ∗1 ,µ

∗2 ] ⊂ clHb−a as we claimed.

Assume now that [a, b] ⊂ suppµ∗1, which means that the density fµ∗1 = F ′µ∗1does not have any flat part on [a, b]. Then [0, b− a] ⊂ supp µ∗1. Thus,

C := cl(l∞(F),‖·‖∞)

Pν :

∥∥∥∥∥ dνdµ∗1

∥∥∥∥∥F<∞

coincides with the closure in (C[0, b−a], ‖.‖∞) of the space spanned by concavefunctions on [0, b−a], which is C[0, a−b] itself. Hence, clH[µ∗1 ,µ

∗2 ] ⊂ clHb−a ⊂ C

and we have proved the following rather unexpected result:If Fµ∗1 = Fµ∗2 except on an interval [a, b] ⊂ suppµ∗1 (i.e. the derivative F ′µ∗1 is

decreasing on [a, b]), then Π(µ∗1) 6st Π(µ∗2). In particular, if Fµ∗1 = Fµ∗2 excepton an interval [a, b] ⊂ suppµ∗1 ∩ suppµ∗2, then Π(µ∗1) = Π(µ∗2) (since F 6st Gand G 6st F implies F = G).

This result means that if starting with a unimodal p.m. Pµ∗1 we perturb it onan interval [a, b] where its density is strictly decreasing, obtaining this way aunimodal p.m. Pµ∗2 with density still decreasing on [a, b], then Π(µ∗1) = Π(µ∗2).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.32

Page 33: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 285

Consequently Π(µ) depends only on the Pµ(Ci)’s, i ∈ I , where the Ci’s(i ∈ I) are the connected components of the set of all decreasing points of thedensity of Pµ, and on the value of the density of Pµ on its flat parts. In otherwords, changing the density of Pµ where it is decreasing into a density which isdecreasing at the same points and with the same flat parts does not change thevalue of Π(µ). So very little of Pµ remains in Π(µ) and Π(µ) is a very robustfunction of µ provided we do not change suppµ.

Let us now study the continuity of Π(·). In fact, we know already that ingeneral Π(·) is not continuous for any reasonable topology. Indeed, Examples 2and 3 in Section 4 show that M(F ,P) may contain all the p.m.’s on R+ whichsupport an interval containing 0. Thus, for any reasonable topology on M(R+),the set M(F ,P) is dense in M(Θ) = M(R+) in these two cases. In otherwords, efficiency of µn is highly nonrobust; any mixture is a limit of mixtureswhere efficiency can be achieved. Even stronger, since Pµn is the closest mixtureto Pn, small perturbations of µ∗ can make any estimator less efficient than Pn ifPn is efficient.

Although Π(·) is rather poorly behaved, we are going to prove that it isupper-semicontinuous along ‘nice’ sequences if F is a ‘nice’ class. Hence, smallperturbations of µ∗ in ‘good’ directions do not decrease the efficiency of theestimator µn!

For any class F , we denote (F−F)2 the class (f−g)2 : f, g ∈ F. Considera sequence (µ∗k)k>1 in M1(Θ), such that

Pµ∗k converges to some Pµ∗ in l∞(F), l∞((F − F)2) and weakly*. (6.10)

In order to compare limk→∞Π(Pµ∗k ) (if it exists) with Π(P ∗µ), we assumefurther that

for any k > 1, Pµ∗ ∈ cl(l∞(F),‖·‖F )Pν : ν− µ∗k. (6.11)

In order that Π(µ∗k) is well defined for each k and use our general approach, itis clear that F must be Pµ∗

k-Donsker for any k > 1, and Pµ∗-Donsker. We shall

require slightly more, namely that

F is a (Pµ∗k )k>1 ∪ Pµ∗-uniform Donsker class (6.12)

as defined in Sheehy and Wellner (1992). It is convenient to consider a class ofuniformly bounded functions, i.e. such that

supf∈F‖f‖∞ <∞. (6.13)

Such a class admits an envelope F which may be taken as the constant functionequal to the l.h.s. of (6.13).

For any function f ∈ F , we denote its modulus of continuity at x ∈ X ,

ω(f, x, δ) := sup|f(x)− f(y)| : d(x, y) 6 δ.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.33

Page 34: Statistical Analysis of Mixtures and the Empirical Probability Measure

286 PHILIPPE BARBE

We require the class of functions F to be equicontinuous in Pµ∗-mean, in thesense that

limδ→0

supf∈F

Pµ∗ω(f, ., δ) = 0. (6.14)

THEOREM 6.2. Assume that (µ∗k)k>1 is a sequence in M1(Θ) and that F is aclass of functions such that (6.10)–(6.14) hold. Then,

lim supk→∞

Π(µ∗k) 6st Π(µ∗), (6.15)

in the sense that for any x ∈ R,

lim supk→∞

Π(µ∗k)(x,∞) 6 Π(µ∗)(x,∞).

Remark 6.1. Another way to read (6.15) is that there exists a sequence ofr.v.’s Π(µ∗k) and Π(µ∗) such that lim supk→∞ Π(µ∗k) 6 Π(µ∗) a.s.

Proof. The proof is quite long and requires several couplings and almost surerepresentations. We start with an a.s. representation for the empirical p.m. Pn ofa sample of size n from Pµ∗ . From Pn we build an empirical p.m. Pk,n of asample of size n from Pµ∗k . This will lead to a weak* limit theorem on the pair(Pn, Pk,n) for fixed k, as n→∞. We shall turn it into an a.s. representation of(Pn, Pk,n) as n → ∞ involving two Gaussian processes (B,Bk) for which wecan prove that (6.10) implies limk→∞ ‖B−Bk‖F = 0 in probability. With such arepresentation in hands, the proof of Proposition 3.1 shows that we can constructsome r.v.’s Π(µ∗k) and Π(µ∗). Using ideas quite similar as those used to proveTheorem 4.1, we shall be able to compare the behaviour of Π(µ∗k) and Π(µ∗) as ktends to infinity along a subsequence which diverges fast enough. This restrictionon the growth of the subsequence leads us to a proof by contradiction.

Hence, to start, let us assume that for some x > 0,

lim supk→∞

Π(µ∗k)(x,∞) > Π(µ∗)(x,∞).

Up to extracting a subsequence, we can actually assume that

limk→∞

Π(µ∗k)(x,∞) > Π(µ∗)(x,∞) (6.16)

and in all the remainder of the proof we shall only consider this subsequence.Using the Skorokhod–Dudley–Wichura theorem, we can assume that we draw

a sample (X(n)i )16i6n of size n of i.i.d. r.v.’s with a common p.m. Pµ∗ such that

their empirical p.m.

Pn := n−1∑

16i6nδX

(n)i

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.34

Page 35: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 287

verifies

‖n1/2(Pn − Pµ∗)−B‖F = o(1) a.s. as n→∞

for some centered Gaussian process B ∈ l∞(F).Let ε > 0. Following Chapter IV.3 of Pollard (1984), let C0, C1, . . . , Cr be a

partition (depending on ε) of X such that

Pµ∗(∂Cl) = 0 for l = 0, 1, . . . , r,

Pµ∗(C0) 6 ε,diam(Cl) 6 ε for l = 0, 1, . . . , r.

Since Pµ∗k converges weakly* to Pµ∗ as k → ∞ (recall that we actually onlyconsider the subsequence such that (6.16) holds), there exists k(ε) such that forany k > k(ε),

Pµ∗k(Cl) > (1− ε)Pµ∗(Cl), l = 0, 1, . . . , r.

In order to construct from Pn an empirical p.m. pertaining to an i.i.d. samplefrom Pµ∗k , let (Ui)i>1 be a sequence of i.i.d. r.v.’s uniformly distributed over

[0, 1], independent of all the previous r.v.’s. We define a family of r.v.’s X(n,ε)i ,

1 6 i 6 n, 0 < ε < 14 , as follows.

If Ui 6 1− ε, then

X(n,ε)i ∼

∑06l6r

Pµ∗k(ε)

(·|Cl)1X(n)i ∈ Cl.

If Ui > 1− ε, then

X(n,ε)i ∼ ε−1

∑06l6r

Pµ∗k(ε)

(.|Cl)[Pµ∗k(ε)

(.|Cl)− (1− ε)Pµ∗(Cl)].

Then, (X(n,ε)i )16i6ε is a sample from Pµ∗

k(ε). Define their empirical distribution

function

P (ε)n := n−1

∑16i6n

δX

(n,ε)i

.

Since (6.12) holds, the distribution of n1/2(P(ε)n − Pµ∗

k(ε)) converges weakly* to

that of a centered Gaussian process B(ε) as n→∞, when F is equipped with theL2(Pµ∗

k(ε))-seminorm, and viewing P (ε)

n as a r.v. in (l∞(F), ‖·‖F ). Consequently,

the sequence of distributions of (n1/2(Pn−Pµ∗), n1/2(P(ε)n −Pµ∗

k(ε))) ∈ l∞(F)×

l∞(F) is tight, since it its marginals are tight (notice that in the space l∞(F)×l∞(F), F is equipped with the L2(Pµ∗)-seminorm for the first set in the product,

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.35

Page 36: Statistical Analysis of Mixtures and the Empirical Probability Measure

288 PHILIPPE BARBE

and with the L2(Pµ∗k(ε)

)-seminorm for the second one). Calculating the covariancefunction, one can check that these distributions converge weakly* to that of(B,B(ε)) say.

Let us now show that the family of p.m.’s (B,B(ε)) in l∞(F) × l∞(F) istight along the subsequence ε = 1/q, q ∈ N, when F is equipped with theL2(Pµ∗)-seminorm (for both spaces l∞(F) now!). For this, if f, g ∈ F , then

|‖f − g‖2L2(Pµ∗

k) − ‖f − g‖

2L2(Pµ∗)|

=

∣∣∣∣∫ (f − g)2d(Pµ∗k − Pµ∗)∣∣∣∣

6 ‖Pµ∗k − Pµ∗‖(F−F)2 =: δk.

Consequently, if f, g ∈ F , then ‖f − g‖L2(Pµ∗) 6 δ implies ‖f − g‖L2(Pµ∗k

) 6

(δ2 + δk)1/2.

When N is a seminorm, denote Fδ,N the set f − g : f, g ∈ F ,N(f − g) 6δ. Assumption (6.12) implies that F is (Pµ∗k )k>1-uniformly pre-Gaussian (seeDefinition 2.0 in Sheehy and Wellner (1992)). Moreover, (6.10) implies thatlimk→∞ δk = 0. Consequently, for any η > 0,

limδ→0

lim supq→∞

P‖B(1/q)‖F(δ,‖·‖L2(Pµ∗ )) > η

6 limδ→0

lim supq→∞

P‖B(1/q)‖F((δ2+δk)1/2,‖·‖L2(Pµ∗k

))> η

= 0.

This proves the tightness of the p.m.’s of B(1/q), and thus, that of the p.m.’sof (B(1/q), B)q>1 ∈ l∞(F)× l∞(F). To identify the limit, if f ∈ F , notice that(with the convention 0/0 = 0)

var((n1/2(Pn − Pµ∗)− n1/2(P (ε)n − Pµ∗k(ε)

))f)

= var(f(X(n)i )− f(X

(n,ε)i ))

6 2‖f‖∞E|f(X(n)i )− f(X

(n,ε)i )|

6 4‖f‖2∞PUi > 1− ε+ 2‖f‖∞PUi 6 1− ε ×

×∑

06l6r

∫Cl×Cl

|f(x)− f(y)|Pµ∗(dx)

Pµ∗(Cl)×Pµ∗

k(dy)

Pµ∗k(Cl)Pµ∗(Cl)

6 4‖f‖2∞ε+ 2‖f‖∞

∑06l6r

∫Cl

ω(f, x, 2ε)Pµ∗(dx)

6 4‖F‖2∞ε+ 2‖F‖∞ sup

f∈FPµ∗ω(f, ., 2ε).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.36

Page 37: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 289

The r.h.s. in the preceding chain of inequalities tends to 0 as ε→ 0. Consequently,

limq→∞

‖B −B(1/q)‖F = 0 in probability. (6.17)

Now, let (Pn, P(1/q)n ) be a version of (Pn, P

(1/q)n ) such that a Skorokhod–Dudley–

Wichura representation holds, i.e.

‖n1/2(Pn − Pµ∗)− B‖F = o(1) a.s. as n→∞ and

εqn := ‖n1/2(P(1/q)n − Pµ∗

k(1/q)))− B(1/q)‖F = o(1) a.s. as n→∞.

Then, (B(1/q), B)d= (B(1/q), B). This defines our coupling, and we control

‖B − B(1/q)‖F in probability thanks to (6.17).Next, let µn such that

‖Pµn − Pn‖F 6 infµ‖Pµ − Pn‖F + n−1.

Then, for any p.m. µ,

m1/2‖Pµ − P (1/q)m ‖F

= ‖m1/2(Pµ − Pµ∗k(1/q)

)− B(1/q)‖F + εqm

= ‖m1/2(Pµ − Pµ∗k(1/q)

)− n1/2(Pµn − Pµ∗)‖F +

+ ‖n1/2(Pµn − Pµ∗)− B‖F + ‖B − B(1/q)‖F + εqm. (6.18)

Using assumption (6.11)(and Remark 4.1), let µ∗k,η µ∗k such that ‖dµ∗k,η/dµ∗k‖∞<∞, and

limη→0‖Pµ∗

k,η− Pµ∗‖F = 0.

Set

µn,q,m,η := µ∗k(1/q) +

√n

m(µn − µ∗k(1/q),η).

For m large enough, µn,q,m,η is a p.m. Thus, (6.18) yields

infµm1/2‖Pµ − P (1/q)

m ‖F

6 n1/2‖Pµ∗k(1/q),η

− Pµ∗‖F + ‖n1/2(Pµn − Pµ∗)− B‖F +

+ ‖B − B(1/q)‖F + εqm.

Consequently,

Π(µ∗k(1/q))(x,∞)

6 Pn1/2‖Pµ∗k(1/q),η

− Pµ∗‖F + ‖n1/2(Pµn − Pµ∗)− B‖F +

+ ‖B − B(1/q)‖F > x. (6.19)

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.37

Page 38: Statistical Analysis of Mixtures and the Empirical Probability Measure

290 PHILIPPE BARBE

Take first the limite as η → 0 and then the limit as n→∞ in the r.h.s. of (6.19)to obtain

Π(µ∗k(1/q))(x,∞) 6 PΠ(µ∗) + ‖B − B(1/q)‖F > x. (6.20)

Let q tends to infinity on both sides of (6.20), use (6.17) and the right continuityof c.d.f.’s to conclude that

lim supq→∞

Π(µ∗k(1/q))(x,∞) 6 PΠ(µ∗) > x

which contradicts (6.16) and proves Theorem 6.2. 2

Assumptions of Theorem 6.2 may seem heavy to check. Let us show that theyoften hold when X = R and F is the Kolmogorov class. In this case, we assumethat the mapping

µ ∈M1(Θ) 7→ Pµ ∈M1(X ) is continuous

when M1(Θ) and M1(X ) are equipped with the weak* topology (6.21)

and that

Pµ∗ is a continuous p.m. (6.22)

It is more convenient to work with the c.d.f.’s Fµ∗k

and Fµ∗ recalling that, in thecase under investigation,

‖Pµ − Pν‖F = ‖Fµ − Fν‖∞.

Assume that (µ∗k)k>1 converges weakly* to µ∗. Then (6.10) holds due to thefollowing. First, (6.21) implies that Pµ∗k converges weakly* to Pµ∗ . But then,(6.22) implies that Fµ∗k converges to Fµ∗ uniformly, which means that Pµ∗k con-verges to Pµ∗ in (l∞(F), ‖.‖F ). Finally, any function in (F −F)2 is of the formf(x) = 1z1 < x 6 z2, so that

(Pµ∗k − Pµ∗)f = (Fµ∗k − Fµ∗)(z2)− (Fµ∗k − Fµ∗)(z1)

and convergence of Pµ∗k to Pµ∗ in l∞((F − F)2) is implied by that in l∞(F).Donsker’s theorem implies that (6.12) holds, while (6.14) is implied by (6.22).Thus, all what remains to check is (6.11). Under (6.21), Lemma A.1 and

(6.21) show that (6.11) holds, for instance, if suppµ∗ ⊂ suppµ∗k.In summary, we have the following result.

LEMMA 6.1. If X = R, F = 1· 6 z : z ∈ R, and (6.21)–(6.22) holds.Then Theorem 6.1 holds for any sequence µ∗k converging weakly* to µ∗ and suchthat suppµ∗ ⊂ suppµ∗k.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.38

Page 39: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 291

EXAMPLE 2 (continued). Let us show that Lemma 6.1 holds. Indeed, (6.22) istrivial, and (6.21) holds since

Fµ(x) =

∫ ∞0

Fθ(x) dµ(x) =

∫ ∞0

(1 ∧ x

θ

)dµ(θ)

and the function 1 ∧ x/θ is continuous and bounded. Thus, if µ∗k convergesweakly to µ∗, the functions Fµ∗k converge pointwise to Fµ∗ , which implies weak*convergence of Pµ∗k to Pµ∗ .

EXAMPLE 3 (continued). Again, Lemma 6.1 holds. (6.22) is obvious. Arguingas in Example 2, (6.21) follows from the fact that

Fµ(x) =

∫ ∞0

θ e−θx dµ(θ),

and the function θ 7→ θ e−θx is continuous and bounded for any x > 0. Hence,µ→ Fµ is continuous on the set µ ∈M1(Θ) :µ0 = 0, which leads Theorem6.2 on any sequence (µ∗k) with µ∗k0 = 0.

7. Examples

This section contains only examples and show how to apply our main results tovarious cases. We investigate the conditions under which

n1/2 infµ‖Pµ − Pn‖F = oP ∗µ (1). (7.1)

As pointed out in Section 2, it may be interesting to start from a class offunctions H and map it back to F through P−1 when P is invertible. Hence, weshall provide inversion formulas for the operator P when we have been able tofind one. Often, one can easily inverse the operator P using Fourier transform.However, we give inversion using only real functions since this will be neededin the next two sections where we develop another approach to check (7.1).

Before starting with the examples, let us give a result which is useful tocheck the assumptions of Theorem 4.1 in some special cases. The followingproposition gives (7.1) in a rather systematic way under weaker conditions thanthose discussed in Section 4 of Thierney and Lambert (1984), since we do notrestrict the Pθ’s to be an exponential family (see, for instance, Example 11).

PROPOSITION 7.1. Assume that the hypotheses of Theorem 4.1 hold, and alsothat

(i) P is a set of p.m.’s with support an interval X ⊆ R indexed by Θ ⊂ R,and with intX 6= ∅,

(ii) the c.d.f.’s Fθ(x) := Pθ(−∞, x] are continuous on Θ×X ,

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.39

Page 40: Statistical Analysis of Mixtures and the Empirical Probability Measure

292 PHILIPPE BARBE

(iii) F = 1· 6 z : z ∈ R,(iv) the mapping m 7→

∫Fθ(x) dm(x) maps injectively finite Borel measures m

on intX into the set of all real analytic functions on (θ−, θ+) ⊂ Θ.

If suppµ∗ contains a clustering point in (θ−, θ+), then (7.1) holds.Proof. Let (θi)i>1 be a sequence in suppµ∗∩ (θ−, θ+) with a clustering point

in (θ−, θ+). Let m be a finite Borel measure on intX . If θ 7→∫Fθ(x) dm(x) is

zero on the sequence θi, it is zero everywhere on (θ−, θ+) thanks to analycity.Hence, using assumption (iv), m = 0 if suppm ⊂ X . By a theorem of Banach(see, e.g., Theorem 11.1.7 in Davis (1963)), the set of functions x 7→ Fθi(x) : i >1∪x 7→ 1 spans (CK(intX ), ‖ ·‖∞) (here we use the fact that the dual spaceCK(intX )∗ is identified with the set of all finite Borel measures on intX ). Hence,x 7→ Fθi(x) : i > 1 spans the set C0(X ) of all functions f ∈ C(intX ) suchthat limx→∂X f(x) = 0.

Using the representation of B with a Brownian bridge B as in Section 4, thesupport of B may be identified with the closure of the set

Sµ∗ := f F ∗µ : f ∈ H.

Since Sµ∗ ⊂ C0(X ), Sµ∗ is in the space spanned by the Fθi’s. Using assumption(ii) and, if needed, a smoothing procedure as in Equation (A.3) in the proof ofLemma A.4 in the Appendix, one sees that the assumption of Theorem 4.1 holds,so that (4.2) (i.e. (7.1)) holds. 2

Remark 7.1. If P defines a location or a scale model, notice that the injec-tivity of the mapping in assumption (iv) of Proposition 7.1 is equivalent to theidentifiability of the mixture model. Also condition (iv) could be called a weakanalycity condition since it asserts that the function θ 7→ Fθ is analytic whenwe equip the continuous bounded functions on R with the weak topology (seeHerve (1989) for analytic functions in infinite-dimensional spaces).

We now recall the results we obtained in our basic Examples 2 and 3. In thissection, the equations (resp. propositions) are labeled (7.x.y) to refer to the ythequation (resp. proposition) in Example x.

EXAMPLE 2 (continued). Before collating the results we have obtained, let usrecall that mixtures of uniforms coincide with unimodal distribution by Kintchine’s(1938) theorem. Moreover, µ∗ is uniquely determined by Pµ∗ (see, e.g., Dhar-madhikari and Joag-Dev (1988)). The estimation and tests of unimodal p.m.’shave been on-going over the years. An approach based on a nonparametric ker-nel estimate has been proposed by Silverman (1981) and studied by Mammenet al. (1992) under drastically restrictive technical conditions. One of the mostpopular estimators of a unimodal p.m. (since in many ways optimal, and veryeasy to compute) is Grenander’s (1956) least concave majorant. Its asymptoticproperties have been studied by Prakasa Rao (1969), Groeneboom (1985), and

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.40

Page 41: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 293

Kim and Pollard (1990). It has been proved by Kiefer and Wolfowitz (1976) thatGrenander’s estimator is asymptotically efficient. A nonasymptotic approach isin Birge (1989). A minimum distance procedure, which is at the origin of ourExample 2, is in Hartigan and Hartigan (1985). In particular, they give a suffi-cient condition for the empirical c.d.f. to be at a Kolmogorov distance o(n−1/2)of the set of all unimodal distributions. This condition has been improved byLemdani (1995). We shall obtain a much better sufficient condition which willbe proved to be necessary when the mode is known. A result in the same spiritunder stronger conditions, is in Kiefer and Wolfowitz (1976), who studied theKolmogorov distance between the semiparametric maximum likelihood estimator(i.e. the Grenander estimator) and the empirical c.d.f.

Combining the results of Section 4, we have

PROPOSITION 7.2.1. For Pθ = U[0,θ], Θ = R+ and F = 1· 6 z : z > 0,(7.1) holds if and only if suppµ∗ is an interval with left point 0.

In order to handle the case F = F1 (see Example 2 in Section 2), let us firstintroduce a definition related to the Sasz–Muntz theorem.

DEFINITION 7.1. A measure µ of R+ is called a Muntz measure if suppµcontains a sequence of distinct points, (θi)i>1, with

∑θi/(1 + θ2

i ) = +∞.

PROPOSITION 7.2.2. For

Pθ = U[0,θ], Θ = R+ and F = x 7→ (1− zx) e−zx : z > 0,

(7.1) holds if µ∗ is a Muntz measure. Moreover, for (7.1) to hold, it is necessarythat ]suppµ∗ = ∞. Thus, if suppµ∗ is a compact subset of (0,∞) and (7.1)holds, then µ∗ is a Muntz measure.

Proof. One easily checks that the limiting process B indexed by F admitsthe representation (with B a standard Brownian bridge and Fµ∗ the c.d.f. of Pµ∗)

B((1− z.) e−z.) =

∫ ∞0

(1− zx) e−zx dB Fµ∗(x)

= z

∫ ∞0

(2− zx) e−zxB Fµ∗(x) dx.

Consequently, the support of B can be identified with the closure Sµ∗ in (B(R+),‖ · ‖∞) of the set

Sµ∗ :=z > 0 7→ z

∫ ∞0

(2− zx) e−zxf Fµ∗(x) dx : f ∈ H.

Since

Sµ∗ ⊂ C0,0(R+) :=f ∈ C(R+) : lim

z→0f(z) = lim

z→∞f(z) = 0

,

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.41

Page 42: Statistical Analysis of Mixtures and the Empirical Probability Measure

294 PHILIPPE BARBE

we also have clSµ∗ ⊂ C0,0(R+).Let fz(x) := (1 − zx) e−zx. To show that Sµ∗ is in the space spanned by

the functions z 7→ Pνfz, ν− µ∗, notice that Pνfz = νPfz = νhz withhz(θ) = z e−θz . If µ∗ is a Muntz measure, there exists a sequence of points(θi)i>1 in suppµ∗ such that the functions z 7→ hz(θi)/z = e−θiz and z 7→ 1span the set of continuous bounded functions on R+ with limit at +∞ for thesupremum norm (this can be shown using the same proof as in Schwartz (1943),i.e. making a change of variable t = e−z). Thus, the functions z 7→ hz(θi) spansC0,0(R+). Using, if needed, a smoothing procedure similar to the one in theproof of Lemma A.4, we infer that z 7→ Pνhz : ν µ∗ spans C0,0(R+) andso spans Sµ∗ . Thus, we can apply Theorem 4.1, and (7.1) holds.

Assume now that ]suppµ∗ <∞, i.e. µ∗ =∑

16i6k µ∗i δθi for some θi ∈ (0,∞)

and∑

16i6k µ∗i = 1, µ∗i > 0. For any p.m. µ, write µ = µ∗ + µ⊥ its Lebesgue

decomposition w.r.t. µ∗, i.e. µ∗ µ∗ and µ⊥⊥µ∗. Then µ∗ =∑

16i6k µ∗iδθi . If(7.1) holds, then Theorem 4.1 implies that for any function f in clSµ∗

infµ

supz>0|n1/2

∫e−θz dµ⊥(θ) +

+n1/2∑

16i6k(µ∗i − µ∗i ) e−θiz − f(z)| = o(1) (7.2.2)

as n→∞. Equation (7.2.2) implies that any function in the support of B Fµ∗can be written as the limit of some nonnegative function plus some functions inthe finite-dimensional vector space spanned by z 7→ e−θiz : 1 6 i 6 k, whichis clearly wrong. Thus, (7.1) cannot holds. 2

EXAMPLE 3 (continued). Recall that Pθ is the exponential p.m. with densi-ty θ e−θx1x > 0 w.r.t. the Lebesgue measure on R. Since θ−1Pf(θ) =∫

e−θxf(x) dx is the Laplace transform of f , the real inversion formula forLaplace transform in Feller (1971, Ch. VII.6) yields

P−1h(x) = limm→∞

(−1)m

(m− 1)!

(m

x

)m( hId

)(m−1) (mx

).

PROPOSITION 7.3.1. For Pθ = Exp(θ) and F = 1· > z : z > 0, (7.1)holds if and only if µ∗ is a Muntz measure.

Proof. It follows from Example 3 in Section 4, that (7.1) implies ]suppµ∗ =∞ is proved exactly as in Proposition 7.2.2. 2

EXAMPLE 4. Let us consider the nonidentifiable (Teicher, 1960) mixture ofnormal distributions. Thus, let Θ = R × R+, and for θ = (m,σ2), let Pθ =N (m,σ2).

PROPOSITION 7.4.1. For Pm,σ2 = N (m,σ2), (m,σ2) ∈ R × R+, (7.1) holdsno matter what F is.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.42

Page 43: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 295

Proof. Just notice that Pn = n−1∑16i6nN (Xi, 0). 2

In the next examples, we deal with subclasses of mixtures of Gaussian p.m.’s inorder to avoid triviality as in Proposition 7.4.1.

EXAMPLE 5. Let Θ = R, and Pθ = N (θ, σ2), where σ2 is now supposed to befixed and known. Here the operator P is

Pf(θ) = (2π)−1/2σ−1∫

exp(−(x− θ)2/2σ2)f(x) dx

= e−θ2/2σ2

(2π)−1/2σ

∫eθy e−y

2σ2/2f(yσ2) dy.

Thus, if we set gf (y) := e−y2σ2/2f(yσ2), we see that σ−1(2π)1/2 eθ

2/2σ2Pf(θ) isthe Laplace transform of gf . Using the inversion formula for Laplace transformin Feller (1971, Ch. VII.6), we obtain

gf (y) = limm→∞

(−1)m−1

(m− 1)!

(m

y

)m×

× dm−1

dθm−1

(σ−1(2π)1/2 eθ

2/2σ2Pf(−θ)) ∣∣∣∣θ=m/y

,

so that

f(x) = ex2/2σ2

limm→∞

(−1)m−1

(m− 1)!

(mσ2

x

)m(2π)1/2

σ×

× dm−1

dθm−1

(eθ

2/2σ2Pf(−θ)) ∣∣∣∣θ=mσ2/x

.

Clearly, we shall obtain results similar to that for mixtures of exponential distri-butions. We shall make use of Proposition 7.1 which leads to conditions slightlymore restrictive than the Sasz–Muntz theorem.

PROPOSITION 7.5.1. For Pθ = N (θ, σ2), θ ∈ R, and F = 1· 6 z : z ∈ R,(7.1) holds if suppµ∗ contains a finite clustering point in R. Morever, (7.1)implies that ]suppµ∗ = ∞. Hence, if (7.1) holds and suppµ∗ is bounded, thensuppµ∗ admits a finite clustering point.

Proof. We can assume that σ = 1. The family P defines a location model.All we really have to check to apply Proposition 7.1 is that for any finite Borelmeasure m on R, the function

g(θ) := gm(θ) := mP−θf. =

∫Φ(z + θ) dm(z)

is real analytic in any interval [θ−, θ+] (where Φ denotes the standard normalc.d.f.). This can be done as follows.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.43

Page 44: Statistical Analysis of Mixtures and the Empirical Probability Measure

296 PHILIPPE BARBE

We have g′(θ) =∫φ(z+θ) dm(z) with φ(x) = (2π)−1/2 e−x

2/2. Let us showthat g′ is analytic. Let ψ(x) := e−x

2. The nth Hermite polynomial Hn is defined

by

Hn(x) := (−1)n ex2 dn

dxne−x

2= (−1)nψ(n)(x)/ψ(x).

Let

ξn(x) := (√π2nn!)−1/2Hn(x) e−x

2/2.

Theorem 8.91.3 in Szego (1939) gives ‖ξn‖∞ 6 cn−1/12 for some constant c.One easily sees that

g(n+1)(θ) =

∫dn

dθnφ(z + θ) dm(z)

= 2−n/2∫Hn((z + θ)/

√2)φ(z + θ) dm(z)

= π1/4(n!)1/2∫ξn((z + θ)/2) e−(z+θ)2/4 dm(z).

Consequently, the following bound holds,

|g(n+1)(θ)| 6 π1/4(n!)1/4cn−1/12∫

e−(z+θ)2/4 d|m|(z).

Therefore, on any interval (θ−, θ+), there exists a constant r such that

supθ∈(θ−,θ+)

|g(n)(θ)| 6 rnn!, n > 0.

Then g(·) is real analytic (see, e.g., Theorem 1.9.3 in Davis (1963)). ApplyProposition 7.1 to obtain the sufficient part of Proposition 7.5.1.

The necessity part of Proposition 7.5.1. follows as in Proposition 7.2.2. 2

In the proof of Proposition 7.5.1, we can also show that g′(θ) is analytic asfollows. Extend g(θ) to the complex plane, integrate it over a triangular path, anduse Fubini’s theorem. Then, since φ is analytic, we obtain that the integral of g(θ)over any triangular path is 0, and we use Morera’s theorem (see, e.g., Conway(1973)) to conclude that g is analytic. This looks more efficient. However, tocheck that the assumptions of Fubini’s theorem are fulfilled requires about thesame amount of work. Moreover, bounding the derivative of g is useful if onewants to use another technique developed hereafter in Section 8, a techniquewhich allows us to investigate the rate of convergence in (7.1). Thus, we shalluse mainly proofs in the spririt of Proposition 7.5.1. The reader will see in ourtreatment of Example 3 in Section 8 that this is what we need there.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.44

Page 45: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 297

EXAMPLE 6. Let Θ = R+ and Pθ = N (m, θ), F = 1· 6 z : z ∈ R.Variance mixture of normal distributions have been studied by Andrews andMallows (1974) who characterized all the distributions that are scaled normalmixtures (see also Bagirov (1988)). Efron and Olshen (1978) studied the rangethat can be attained by the c.d.f. of variance normal mixtures. Kelker (1971)investigate infinite divisibility of this class of mixtures.

Since the Kolmogorov distance is invariant by translation of the r.v.’s, thereis no loss of generality in assuming that m = 0 and setting Pθ = N (0, θ). Themapping P is

Pf(θ) = (2πθ)−1/2∫

exp(−x2/2θ)f(x) dx

= (2πθ)−1/2∫

R+exp(−x2/2θ)(f(x) + f(−x)) dx.

Consequently, for s > 0,

Pf(1/s)(2π/s)1/2 =

∫R+

exp(−sy)(f(√

2y) + f(−√

2y))(2y)−1/2 dy.

Using (4.10), we have

f(√

2y) + f(−√

2y)

= (2y)1/2 limm→∞

(−1)m−1

(m− 1)!

(m

y

)m dm−1

dsm−1

((2π/s)1/2Pf(1/s)

)∣∣∣∣∣s=y

,

or equivalently, for x > 0,

f(x) + f(−x)

= x limm→∞

(−1)m−1

(m− 1)!

(2mx2

)m dm−1

dsm−1

((2π/s)1/2Pf(1/s)

)∣∣∣∣∣s=x2/2

.

In this case the mapping P is not one-to-one, due to the symmetry of the normaldistribution. Thus, it may be better to use a class F of symmetric functions suchas F2 := 1−z 6 · 6 z : z ∈ R.

PROPOSITION 7.6.1. Let

Pθ = N (0, θ), θ > 0 and F = F2 := 1−z 6 · 6 z : z > 0.

Then (7.1) holds if suppµ∗ contains a finite clustering point in (0,∞]. Moreover,(7.1) implies that ]suppµ∗ =∞. Hence, if inf suppµ∗ > 0 and (7.1) holds, thensuppµ∗ admits a clustering point in (0,∞].

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.45

Page 46: Statistical Analysis of Mixtures and the Empirical Probability Measure

298 PHILIPPE BARBE

Proof. Let m be a finite Borel measure on R+, and let (with a change ofvariable s = 1/θ)

g(s) := gm(s) :=∫

Φ(sx) dm(x).

Let us show that g(·) is real analytic, so that Proposition 7.1 applies. We havefor n > 0,

g(n+1)(s) :=∫xn+1φ(n)(sx) dm(x)

= 2−n/2∫xn+1Hn(sx/

√2)φ(sx) dm(x)

= (n!)1/2π1/4∫xn+1ξn(sx/

√2) e−x

2s2/4 dm(x).

Therefore, denoting |m| the total variation measure of m, and c some constant

|g(n+1)(s)| 6 π1/4(n!)1/2cn−1/12∫|x|n+1 e−x

2s2/4 d|m|(x).

The function xn+1 e−x2s2/8 is maximum for x = 2(n+1)1/2/s, and its maximum

is

gn := 2n+1(n+ 1)n+1s−n+1 exp(−(n+ 1)/2

).

Hence,

g(n+1)(s) 6 π1/4(n!)1/2gn

∫e−x

2s2/8 d|m|(x).

For any s > 0,∫

e−x2s2/8 d|m|(x) < ∞. Hence, to show that g(n+1) is real

analytic on (0,∞), it suffices to show that (n!)1/2gn 6 rnn! for n large enoughand some r > 0. Notice that

log((n!)1/2gn) = (1/2) logn! + log gn

= (12)((n + (1

2 ) logn− n+ log√

2π + O(n−1))

+(n+ 1) log(2/s) + ((n + 1)/2) log(n+ 1)− (n+ 1)/2

= n logn+ (34 ) logn− n+ n log(2/s)−

−(1

2

)log√

2π + log(2/s) + O(n−1)

= logn! + (14) logn+ n log(2/s)− (1

2 ) log√

2π + log(2/s) + O(n−1).

Consequently, there exists an r > 0 so that for n large enough, (n!)1/2gn 6 n!rn,and g(·) is analytic on [0,∞).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.46

Page 47: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 299

The mapping m 7→ gm is injective on the set of all measures m nonsymmetricw.r.t. 0, since gm = 0 implies g′m = 0. If gm = 0, changing s in

√s we obtain∫

e−sx2x dm(x) = 0 for any s.

It follows from elementary properties of the Laplace transform that m = 0. ApplyProposition 7.1 to obtain the sufficient condition in Proposition 8.6.1.

The necessity part follows also from Theorem 4.1 in the usual way. Clearlythe support of B is an infinite dimensional space and cannot be approximatedby functions of the form x > 0 7→∑

16i6k aiΦ(x/θi) + v(x) where the θi’s andk are fixed and v is nondecreasing. 2

EXAMPLE 7. Θ = R+, Pθ = N (m+θβ, θ2σ2), where m,β ∈ R and σ2 > 0 areknown parameters. Mixtures of such p.m.’s are called variance-mean mixtures,and have been studied by Barndorff–Nielsen, Kent and Sørensen (1982). Herethe mapping P is

Pf(θ) = (2π)−1/2θ−1σ−1∫

exp(−(x−m− θβ)2

2θ2σ2

)f(x) dx

= (2π)−1/2θ−1σ−1∫

exp(− 1

2σ2 ((x/θ)− β)2)f(x+m) dx.

When, β = 0, we are in the situation of Example 6; therefore, we shall deal herewith β 6= 0.

Our first task is to obtain an inversion formula for the operator P. Define

Ψε(f)(y) :=1

(2π)1/2ε

∫exp

(−(y − x)2

2ε2

)f(x) dx.

For any continuous bounded function f on R,

limε→∞

Ψε(f)(y) = f(y). (7.7.1)

Expand the exponential to obtain formally

f(y) = limε→∞

1√2πε

∑i>0

(−1)i

ε2ii!2i

∫(y − x)2if(x) dx

= limε→∞

1√2πε

∑i>0

(−1)i

ε2ii!2i×

×∑

06j62i

(2ij

)y2i−j(−1)j

∫xj f(x) dx. (7.7.2)

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.47

Page 48: Statistical Analysis of Mixtures and the Empirical Probability Measure

300 PHILIPPE BARBE

Set c = |β|/σ√

2, ψ(x) = e−x2. The change of variables x = yβ and θ = 1/s

leads

gf (s) := s−1β−1σ√

2πPf(1/s) =

∫ψ(c(sy − 1))f(y) dy,

where f(y) = f(yβ+m). Thus, formally, with Hn the nth Hermite polynomial,

g(n)f (s) =

∫(−1)ncnynHn(c(sy − 1))φ(c(sy − 1))f(y) dy

and in particular

g(n)f (0) =

∫(−1)ncnynHn(−c)φ(−c)f (y) dy. (7.7.3)

Combining (8.7.1)–(8.7.4), we obtain

f(y) = limε→∞

1

ε√

∑i>0

(−1)i

i!2iε2i ×

×∑

06j62i

(2ij

)y2i−j(−1)j

1(−1)jcjHj(−c)φ(c)

g(j)f (0)

= limε→∞

1εβ

∑i>0

(−1)i

i!2iε2i

∑06j62i

(2ij

)y2i−j 1

cjHj(−c)φ(c)×

× dj

dsjs−1Pf(1/s)

∣∣∣∣s=0

(7.7.4)

which is the desired inversion formula (of course, all the above calculations arerather formal but we shall make them rigorous in Section 9).

PROPOSITION 7.7.1. Let Pθ = N (m+θβ, θ2σ2), Θ = R, where β 6= 0. Assumethat F = 1· 6 z : z ∈ R. (7.1) holds if suppµ∗ contains a finite clusteringpoint in R. Moreover, (7.1) implies that ]suppµ∗ = ∞. Hence, if (7.1) holdsand suppµ∗ is bounded, then suppµ∗ admits a finite clustering point.

Proof. Since F is invariant by translations, we can assume that m = 0. Also,we can reparametrize the family, replacing β by β/σ. Then, in order to applyProposition 7.1, we want to show that if m is a finite Borel measure, the function(set s = 1/θ)

g(s) :=∫

Φ(sx− β) dm(x)

is analytic in (0,∞). Following the proof of Proposition 7.5.1,

g(n+1)(s) =

∫xn+1ζn(2−1/2(sx− β)) exp

(−(sx− β)2/4

)dm(x),

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.48

Page 49: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 301

and

|g(n+1)(s)| 6 π1/4(n!)1/2cn−1/2∫|x|n+1 exp

(−(sx− β)2/4

)d|m|(x).

If n ∈ 2N + 1,∫|x|n+1 exp

(−(sx− β)2

4

)d|m|(x)

6(∫|x|n+2 exp

(−(sx− β)2

4n+ 2n+ 1

)d|m|(x)

)n+1/n+2

.

Consequently, for some constant c,

|g(n+1)(s)| 6 π1/4(n!)1/2cn−1/12gn(‖m‖TV + 1),

where

gn =

supx∈R x

n+1 exp(− (sx−β)2

4

), if n+ 1 ∈ 2N,(

supx∈R xn+2 exp

(− (sx−β)2

4n+2n+1

))n+1/n+2, if n+ 1 ∈ 2N + 1.

One can check that

log gn =n

2logn− n(1

2 + log s) + o(n) as n→∞.

Hence,

12 logn! + log gn = 1

2(n logn− n+ o(n))− n

2logn− n(1

2 + log s)

= n logn− n− n log s+ o(n)

= logn! + O(n).

It follows that for some r > 0,

|g(n+1)(s)| 6 n!rn,

and g is analytic. Proposition 7.1 applies to give the sufficient part of Proposi-tion 7.7.1. The necessary part is clear. 2

EXAMPLE 8. Θ = R, and Pθ is the gamma p.m. with density w.r.t. the Lebesguemeasure βθΓ(θ)−1xθ−1 e−βx. Mixtures of Pθ’s make a subclass of the generalizedgamma convolution p.m.’s introduced by Thorin (1977) and further studied byBondesson (1992). The operator P is given by

Pf(θ) = βθΓ(θ)−1∫

e−βxxθ−1f(x) dx

= βθΓ(θ)−1∫

e−β ey eθyf(ey) dy.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.49

Page 50: Statistical Analysis of Mixtures and the Empirical Probability Measure

302 PHILIPPE BARBE

Therefore, Γ(θ)β−θPf(θ) is the Laplace transform of g(y) := e−β eyf(ey). Thisyields the inversion formula

P−1h(x) = eβx limm→∞

(−1)m−1

(m− 1)!

(m

logx

)m dm−1

dθm−1 Γ(θ)β−θh(θ)

∣∣∣∣θ=m/ log x

.

PROPOSITION 7.8.1. With Pθ the Gamma distribution with density

βθΓ(θ)−1xθ−1 e−βx and F := 1· 6 z : z > 0,

(7.1) holds if suppµ∗ admits a clustering point in (0,∞). Moreover, (7.1) impliesthat ]suppµ∗ =∞. Hence, if suppµ∗ is a compact subset of (0,∞), (7.1) impliesthat it admits a clustering point in (0,∞).

Proof. We shall use a proof in the same spirit as in Proposition 7.1. Aspreviously, let C0,0(R+) be the subset of C(R+) of all functions f(x) such thatlimx→0 f(x) = limx→∞ f(x) = 0.

Let (θi)i>1 be a sequence in R+ which converges to a point θ∗ ∈ R+. Forany c > 0, α > 0, let

Ac,α := y ∈ R+ 7→ yθ−1 e−cyα

: θ ∈ (θi)i>1.

Let us first show that if θ∗ > 1, then Ac,α spans C0,0(R+). Assuming that θ∗ > 1,we can assume that all the θi’s are in (1,∞). Recall the inclusion of the dualspaces C0,0(R+)∗ ⊂ CK((0,∞))∗. Let m be a finite Borel measure on (0,∞).Then, set

g(θ) := gm(θ) :=∫yθ−1 e−cy

αdm(y).

In order to show that g(·) is analytic, extend g(·) to a half complex plane indefining for Re z > 0

G(z) :=∫yz e−cy

αdm(y),

so that g(θ) = G(θ − 1) in the range Re θ > 1. To show that G(.) is analytic inthe domain Re z > 0, let z, z′ with Re z,Re z′ > 0 and write z′ − z = a + ib,where a, b are real numbers. Notice that

∆(z, z′) = |G(z) −G(z′)− (z − z′)∫yz log y e−cy

αdm(y)|

6∫|1− yz′−z − (z − z′) log y||yz| e−cyα d|m|(y).

Next, write

1− ya+ib + (a+ ib) log y

= (1− ya − a log y) + ya(1− yib − ib log y) + ib log y(ya − 1),

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.50

Page 51: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 303

so that

∆(z, z′) 6∫|1− ya − a log y||yz| e−cyα d|m|(y) +

+

∫|1− yib − ib log y|ya e−cy

αd|m|(y) +

+

∫b|ya − 1| log y e−cy

αd|m|(y). (7.8.1)

The inequality |ex − 1− x| 6 x2 for x 6 0.2 yields∣∣∣∣1− ya − a log ya+ ib

∣∣∣∣ =

∣∣∣∣1− ya − a log ya2 + b2 (a− ib)

∣∣∣∣6∣∣∣∣1− ya − a log y

a

∣∣∣∣ |a− ib|√a2 + b2

6 a2(log y)2

for |a log y| 6 0.2 and, consequently,

lima→0

∫ exp(0.2/a)

0

∣∣∣∣1− ya − a log ya+ ib

∣∣∣∣ |yz| e−cyα d|m|(y) = 0,

while

lima→0

∫ ∞exp(0.2/a)

∣∣∣∣1− ya − a log ya+ ib

∣∣∣∣ |yz| e−cyα d|m|(y)

6 lima→0

a−1 e−c exp(0.1α/a)∫ ∞

0|1− ya − a log y| e−cyα/2 d|m|(y) = 0.

Hence,

lima+ib→0

∫ ∣∣∣∣1− ya − a log ya+ ib

∣∣∣∣ |y|z e−cyα

d|m|(y) = 0.

The two other terms in the r.h.s. of (7.8.1) can be handled in the same way, andthus G(.) is analytic on Re z > 0. Hence, g(·) is real analytic at any θ > 1.Arguing as in the proof of Proposition 7.1, the set Ac,α spans C0,0(R+).

Let Fθ(x) := Pθ[0, x] be the c.d.f. of Pθ. The support of the Gaussian processB is the closure of Sµ∗ := f Fµ∗ : f ∈ H in (B(R+), ‖ · ‖∞). Let

A := y 7→ Fθi(y) : i > 1.

Let us show that A spans Sµ∗ in (B(R+), ‖ · ‖∞). It suffices to prove that Aspans a set which contains C0

b (R+) (since Sµ∗ ⊂ C0,0(R+)). Let η > 0 such thatβ − η > 0. Notice that Fθ(y) = βθΓ(θ)−1 ∫ x

0 fθ(y) e−ηy dy with fθ ∈ Aβ−η,1.Since Aβ−η,1 spans C0,0(R+), it suffices to show that

C :=x 7→

∫ x

0f(y) e−ηy dy : f ∈ C0,0(R+)

∪ x 7→ 1

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.51

Page 52: Statistical Analysis of Mixtures and the Empirical Probability Measure

304 PHILIPPE BARBE

spans C0b (R+). Since∫ x

0e−ay e−ηy dy = (1− e−(η+a)y)/(η + a),

all the functions 1− e−ay, a > η belong to the space spanned by C.Let us study the space spanned by

C1 := y > 0 7→ 1− e−ay : a > η.

Make a change of variable e−y = t ∈ (0, 1], and set

C2 := t ∈ [0, 1] 7→ 1− ta; a > η + 1.

Since

limε→∞

sup06t61

|ε−1((1− ta+ε)− (1− ta))− ta−1| = 0

(recall a > η + 1 > 1), all the monomials ta−1, a > 1 are in the space spannedby C2 in (B[0, 1], ‖.‖∞). The Muntz theorem shows then that C2 spans the set ofall continuous functions f on [0, 1] with f(0) = 0. Hence, C1 spans C0,0(R+) aswell as A. This shows Proposition 7.8.1 when suppµ∗ admits a clustering pointθ∗ ∈ (1,∞).

Now, if the only clustering point of suppµ∗ is θ∗ ∈ (0, 1], notice that for anys > 0,

βθΓ(θ)Fθ(xs) =

∫ xs

0yθ−1 e−βy dy = s

∫ x

0tsθ−1 e−βt

sdt.

Pick s such that θ∗s > 1. Let (θi)i>1 converging to θ∗. Up to removing someterms, we can assume that sθi > 1. Arguing as in the case θ∗ > 1, and using theset Aβ−η,s instead of Aβ−η,1, we see that Fθi : i > 1 spans C0,0(R+) sincex 7→ f(x1/s) : f ∈ C0,0(R+) = C0,0(R+). This fully proves the sufficient partof Proposition 8.8.1.

If (7.1) holds, then ]suppµ∗ = ∞ by the same argument as in Proposition7.5.1. 2

EXAMPLE 9. Let Θ = R+ and Pθ be the p.m. with density

pθ(x) = rθ−rxr−110 6 x 6 θ.

Mixtures of densities pθ(.) are called r-unimodal (see, e.g., Dharmadhikari andJoag-Dev (1988, p. 73)) as defined by Olshen and Savage (1970). It is easy tocheck that X is r-unimodal if and only if Xr is unimodal. Consider the mappingg(x) = x1/r from R+ into R+. This mapping is one-to-one and maps a unimodalr.v. into a r-unimodal one. If F = 1· 6 z : z > 0, we have F = F g−1.Consequently, Lemma 3.1 shows that all the results for the unimodal distributionsthat we have proved with the class F hold for d-unimodals.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.52

Page 53: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 305

EXAMPLE 10. Mixture of Cauchy distributions. Let Pθ be the Cauchy distribu-tion with density

pθ(x) :=1

π(1 + (x− θ)2), θ ∈ Θ = R.

We use the Kolmogorov distance, so that F := 1· 6 z : z ∈ R. The operatorP is given by

Pf(θ) :=1π

∫f(x)

1 + (x− θ)2 dx.

We have not been able to explicitly invert the operator P in the Cauchy mixtureexample (with only real-valued function operators), but Theorem 4.1 still givesa fine result in conjunction with Proposition 7.1.

PROPOSITION 7.10.1. Assume that Pθ is the Cauchy distribution with medianθ, that Θ = R and F := 1· 6 z : z ∈ R. Then (7.1) holds if suppµ∗ admitsa clustering point in R. Moreover, (7.1) implies that ]suppµ∗ =∞.

Proof. We use Proposition 7.1. Thus, all what we have to prove is that themapping

θ ∈ R 7→ g(θ) := gm(θ) :=∫

[1 + (x− θ)2]−1 dm(x)

is analytic whenever m is a finite Borel measure.Let k(x) = 1/(1 + x). Notice that

g(n)(θ) =

∫(−1)n

dn

dynk(y2)

∣∣∣∣y=x−θ

dm(x).

In order to bound g(n), we need to bound the derivatives of k(y2). Recall (seeGradshteyn and Ryzhik (1980, formula 0.432)) that

dn

dynk(y2) =

∑06p6[n/2]

(2y)n−2pk(n−p)(y2)(n)2p

p!,

where (n)0 = 1 and (n)k = n(n− 1) · · · (n− k + 1) for k > 1. Therefore,

dn

dxnk(x2) =

∑06p6[n/2]

(2x)n−2p (−1)n−p(n− p)!(1 + x2)n−p+1

(n)2p

p!.

The function x 7→ xn−2p/(1 +x2)n−p+1 is maximum at x2 = (n−2p)/(n+ 2p),with maximum

A :=(n− 2pn+ 2

)(n−2p)/2 ( n+ 22n+ 2− 2p

)n−p+1

=

(n− 2pn+ 1− p

)(n/2)−p ( n+ 2n+ 1− p

)(n/2)−1

2−n+p−1.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.53

Page 54: Statistical Analysis of Mixtures and the Empirical Probability Measure

306 PHILIPPE BARBE

Notice that (n− 2p)/(n+ 1− p) < 1 so that

A 6(

n+ 2n+ 1− p

)(n/2)+1

2−n+p−1 6(n+ 2n/2

)(n/2)+1

2−n+p−1

=

(n+ 2n

)(n/2)+1

2−(n/2)+p 6 c2−(n/2)+p

for some universal constant c. Consequently,∣∣∣∣ dn

dxnk(x2)

∣∣∣∣ 6 c∑

06p6[n/2]

(n)2p

p!2n−2p2−(n/2)+p

= c2n/2∑

06p6[n/2]

(n)2p

p!2−p

6 c2n/2n!n 6 rnn!

for some r < ∞ and n large enough. Thus, for n large enough, g(n)(θ) 62nn!‖m‖TV and so g(.) is real analytic at any point θ ∈ R. Apply Proposition7.1 to obtain the sufficient part of Proposition 7.10.1.

The necessity part follows as usual. If ]suppµ∗ <∞ and (7.1) holds, all thefunctions in the support of B can be written as an increasing function f withlimx→∞ f(x) = 0 plus a linear combination of the functions Fθ for θ ∈ suppµ∗,which is clearly wrong. 2

Our last example in this section deals with discrete distributions.

EXAMPLE 11. Mixture of Noack’s (1950) distributions. Let (ak)k>0 be asequence of positive real numbers such that the series Z(θ) :=

∑k>0 akθ

k has aradius of convergence R ∈ (0,∞]. Following Milhaud and Mounime (1993), letΘ = [0, R) (or [0, R] if Z(R) is defined), and define a p.m. on the integers,

Pθ = Z(θ)−1∑k>0

akθkδk, 0 6 θ < R. (7.11.1)

Milhaud and Mounime (1993) study the maximum likelihood estimation of µ∗

in such a model. A very different method, using Fourier transform, has beenproposed by Zhang (1995) who obtained the rate of convergence for the pointwisemean square error of his estimator of the density of µ∗ (assuming of course thatµ∗ admits such a density) and Hengartner (1997) for Poisson mixture and rateof convergence of the mean integrated square error of the density of µ∗. Thisfamily generalizes the Poisson distribution which is obtained for ak = 1/k! andincludes negative binomial p.m.’s (ak =

(n+k−1k

)).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.54

Page 55: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 307

The inverse of the operator

Pf(θ) = Z(θ)−1∑k>0

akθkf(k)

is given by the formula

P−1h(k) =1

akk!dk

dθk(Z(θ)h(θ))

∣∣∣∣∣θ=0

. (7.11.2)

For mixtures of distributions on the integers, it seems rather natural to considerthe class of functions F := 1· = k : k ∈ N. Therefore, if we denote

Nk,n := ]1 6 i 6 n : Xi = k,

we have

‖Pµ − Pn‖F = supk>0|Pµ(k)− n−1Nk,n|.

The class F is Pµ∗-Donsker since 1· 6 k : k ∈ N and 1· 6 k+ 1−1· 6k : k ∈ N are universally Donsker (this follows from Donsker’s theorem; seealso Dudley (1987) for universal Donsker classes).

PROPOSITION 7.11.1. Let Pθ as in (7.11.1), and let F := 1· = k : k ∈ N.For (7.1) to hold, it suffices that µ∗ admits a clustering point θ0 ∈ [0, R) suchthat

lim supn→∞

n−1( log supakknθk−n0 : k > n − n logn)∈ [−∞,+∞) (7.11.3)

(the lim sup may be −∞).Proof. Let r such that

lim supn→∞

n−1(log supakknθk−n0 : k > n − n logn)6 r − 1. (7.11.4)

If Un denotes the empirical c.d.f. of a uniform sample over [0, 1], we have(Nk,n

n− Fµ∗(k)

)k>0

d=((Un − Id)(Fµ∗(k + 1))− (Un − Id)(Fµ∗(k))

)k>0 .

Consequently, the support of the Gaussian process B indexed by F is the closurein (l∞(F), ‖ · ‖F) of

Sµ∗ := f Fµ∗(k + 1)− f Fµ∗(k) : f ∈ H.

Clearly, Sµ∗ ⊂ C0(N), where C0(N) is the set of all sequences in N with limit0 at infinity.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.55

Page 56: Statistical Analysis of Mixtures and the Empirical Probability Measure

308 PHILIPPE BARBE

Notice that

Z(θ)Pθfk = akθk.

Let m be a finite measure on the integers and

g(θ) := gm(θ) :=∫akθ

k dm(k).

Then, the derivative verifies

|g(n)(θ0)| =

∣∣∣∣ ∫ ak(k)nθk−n0 1k > n dm(k)

∣∣∣∣6 supak(k)nθ

k−n0 : k > n‖m‖TV

6 supakknθk−n0 : k > n‖m‖TV6 rnn!

for some r > 0, where the last inequality comes from (7.11.4) and holds forn large enough. Now, we conclude the proof in the spirit of Proposition 7.1 toobtain that Z(θ)Pθfk : θ ∈ (θi)i>1 (with θi a sequence in suppµ∗ convergingto θ0) spans C0(N).

The reader should notice that (7.11.3) is always verified when θ0 = 0, i.e.if 0 is a clustering point of the support of the mixing measure µ∗. Otherwise,(7.11.3) expresses a balance between the location of the clustering point θ0 andthe decay of the sequence (az)z>0. 2

8. An Other Necessary and Sufficient Condition for Havinginfµ‖Pµ − Pn‖F = oPµ∗ (n

1/2)

As far as the theory goes, examples in Section 7 show that Theorem 4.1 is a nicetool to obtain conditions under which Pµn is close to Pn in the sense that (4.2)holds. But our abstract approach is not very explicit since we do not describemuch the p.m. µn. In this section, we provide a new condition equivalent to (4.2)which gives another way to check whether (4.2) holds. Although equivalent,the result we shall obtain is not as nice as Theorem 4.1. In applications, itrequires more complicated calculations and seldom provides conditions as sharpas Theorem 4.1. However, we believe it is useful for four reasons:

− it provides a quite explicit description of µn (but not a close formula!).This may be interesting if one needs a starting point to run an optimizationalgorithm leading to an estimation of µ∗;

− although we have not met one, there could be some cases where Theorem 4.1is more difficult to apply than the result of this section;

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.56

Page 57: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 309

− from a theoretical point of view, it enlightens (4.2) by showing that it isrelated to the properties of the operator P;

− although we did not pursue in that direction, the technique used in thissection is worth using if one wants to investigate the exact rates and limitingdistribution in (4.2).

The next theorem is a variation of Proposition 3.2 but is much more importantas far as applications are concerned, since it has the same status as Theorem 4.1.

THEOREM 8.1. Assume that F is Pµ∗-Donsker and that (4.1), (4.3) and (4.4)

hold. In order to have n1/2 infµ ‖Pµ−Pn‖F = oPµ∗ (1) as n→∞, it is necessaryand sufficient that

n1/2 inf‖Pν − Pn‖F : ν ∈M(Θ), ν− µ∗, ν−PF <∞

= oP ∗µ (1) as n→∞. (8.1)

Notice that the major difference betwen (8.1) and (4.2) is that the infimum in(8.1) is taken over a much larger class of signed measures ν.

Proof. Since the set on which the infimum is taken in (8.1) is larger than theone in (4.2), the condition is trivially necessary.

Let us prove that it is sufficient. Since F is Pµ∗-Donsker, there exists a versionP ∗n of Pn and B∗ of B such that

ε∗n := ‖n1/2(P ∗n − Pµ∗)−B∗‖F = o(1) a.s. as n→∞.

Using (8.1), there exists a family νn of signed measures such that ν−n µ∗,ν−n PF <∞ and

limn→∞

n1/2‖Pνn − P ∗n‖F = 0 in probability.

Let ν−n,M be the (nonnegative) measure defined by its density

dν−n,Mdµ∗

(θ) :=dν−ndµ∗

(θ) ∧M.

Let further

µn,m,M = µ∗ +

√m

n(νm − µ∗ + ν−m − ν−m,M )

and

cn,m,M := 1/µn,m,M (Θ).

Notice that cn,m,M is well defined for any n large enough, and that

limn→∞

cn,m,M = 1 in probability

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.57

Page 58: Statistical Analysis of Mixtures and the Empirical Probability Measure

310 PHILIPPE BARBE

(the randomness in cn,m,M comes from that of νε,n). Set

µn,m,M := cn,m,M µn,m,M ,

which is a well defined p.m. for n large enough. For any p.m. µ,

n1/2 infµ‖Pµ − P ∗n‖F

6 n1/2‖Pµ − P ∗n‖F6 ‖n1/2(Pµ − Pµ∗)−B∗‖F + ε∗n

6 ‖n1/2(Pµ − Pµ∗)−m1/2(P ∗m − Pµ∗)‖F + ε∗n + ε∗m

6 ‖n1/2(Pµ − Pµ∗)−m1/2(Pνm − Pµ∗)‖F +

+ ε∗n + ε∗m +m1/2‖Pνm − P ∗m‖F= ‖n1/2(µ− µ∗)−m1/2(νm − µ∗)‖H +

+ ε∗n + ε∗m +m1/2‖Pνm − P ∗m‖F . (8.2)

Taking µ = µn,m,M in the upper bound (8.2) yields

n1/2 infµ‖Pµ − P ∗n‖F

6 n1/2|cn,m,M − 1|‖µ∗‖H + cn,m,Mm1/2‖ν−m − νm,M‖H +

+ |cn,m,M − 1|m1/2‖νm − µ∗‖H +

+ ε∗n + ε∗m +m1/2‖Pνε,m − P ∗m‖F . (8.3)

Taking successively in both side of (8.3) the limit as n → ∞, the lim inf asM →∞ and the lim sup as m→∞ in the r.h.s. of (8.3), we obtain

lim supn→∞

n1/2 infµ‖Pµ − P ∗n‖F

6 lim supm→∞

lim infM→∞

lim supn→∞

n1/2|cn,m,M − 1|‖µ∗‖H. (8.4)

But notice that

n1/2(µn,m,M (Θ)− 1) = m1/2(νm − µ∗ + ν−m − ν−m,M )(Θ).

Consequently,

limM→∞

limn→∞

n1/2(c−1n,m,M − 1) = m1/2(νm(Θ)− µ∗(Θ))

= m1/2(Pνm − Pµ∗)1 = m1/2(Pνm − P ∗m)1.

Hence, using (8.1) and the definition of νn, we see that the r.h.s. of (8.4) is 0, sothat (4.2) holds. 2

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.58

Page 59: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 311

Remark 8.1. Before looking at the consequence of Theorem 8.1 in our threebasic examples, let us comment on its general use. Recall the notation (seeTheorem 4.1)

P(µ∗,F) := Pν : ν ∈M(Θ), ν− µ∗, ν−PF <∞.If for any x ∈ suppPµ∗ , the Dirac measure δx is in the closure (in (l∞(F), ‖·‖F ))of P(µ∗,F), we can find measures νε,x such that

limε→∞

‖Pνε,x − δx‖F = 0. (8.5)

Set νε,n := n−1∑16i6n νε,Xi . Then (8.5) implies

limε→∞

‖Pνε,n − Pn‖F = 0

so that (8.1) holds. In practice, (8.5) is a too strong requirement, since there isoften a subclass Fx ⊂ F on which Pνε,x does not converge to δx. But generally⋂

16i6nFXi = ∅, and this will be good enough to check (8.1). In any case,(8.5) should be understood as being a general guide towards proving that (8.1)holds, and (8.1) should be interpreted as saying that δx belongs to the closureof some vector space spanned by the family P (but notice the important factthat this last requirement is much stronger than the real meaning of (8.1) if weconsider only subspaces of (l∞(F), ‖ · ‖F )). So, in practice, the key step tocheck (8.1) will be to find a family Pνε,x such that δx = limε→∞ Pνε,x in somesense. Observe also that Pνf = νPf . Therefore, if δx is a limit of Pνε,x , wehave limε→∞ νε,x(Pf) = f(x), so that νε,x is an appoximation of the inverseδx P−1 (if it exists – see Section 5 for the invertibility of P). Therefore, away to check (8.1) is to first find a signed measure νε,x in P(µ∗,F) such thatlimε→∞ νε,x(Pf) = f(x), and then try to extend this pointwise convergence (inl∞(F)) into a uniform one required in (8.1).

It is also clear from the triviality of the necessity part of Theorem 4.1 that thesufficient part tells us much more than the necessity one. So we shall investigate(8.1) as a sufficient condition in this section.

Remark 8.2. IfM is a strict convex subset ofM1(Θ), Theorem 4.1 still holdsprovided the measure µn,m,M belongs toM. Also, whenM is the set of all finitelinear combinations of Dirac measures, we can obtain the result of Theorem 4.1in using similar density property of Pµ :µ ∈M as in Remarks 3.1, 3.2.

EXAMPLE 1 (continued). Clearly (8.1) holds. Here, Theorem 4.1 says nothingelse but F is Pµ∗-Donsker.

EXAMPLE 2 (continued). We apply here our general guideline given in Remark8.1. If h(θ) := Pf(θ) = θ−1 ∫ θ

0 f(x) dx, we have P−1(h)(x) = xh(x) + h(x)

where h is the Lebesgue derivative of h. This suggests looking for measures νε,xlike x(δx+ε − δx)/ε + δx. But we may have to smooth the Dirac measures tohave ν−ε,x µ∗.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.59

Page 60: Statistical Analysis of Mixtures and the Empirical Probability Measure

312 PHILIPPE BARBE

In order to make precise statements, assume that suppµ∗ = [0, α] for someα > 0 (α = ∞ is allowed, and in this case one should read suppµ∗ = [0,∞);also the case α = 0 may be considered, but is trivial). Let gε(θ, x) be as in theproof of Lemma A.2 in the Appendix, and set

dνn,εdµ∗

(θ) := θn−1∑

16i6ngε(θ,Xi).

Clearly νn,ε ∈ P(µ∗,F). To check (8.1), let

kz(θ) := θhz(θ) = θPfz(θ) = θ ∧ z

so that P−1h(x) = kz(x). Then

Pνn,εfz = n−1∑

16i6n

∫θhz(θ)gε(θ,Xi) dµ∗(θ).

If x+ ε 6 z, then∣∣∣∣∫ kz(θ)gε(θ, x) dµ∗(θ)− 1∣∣∣∣ =

∣∣∣∣∫ θgε(θ, x) dµ∗(θ)− 1∣∣∣∣ 6 ε.

If x− ε > z, then∣∣∣∣∫ kz(θ)gε(θ, x) dµ∗(θ)∣∣∣∣ =

∣∣∣∣z ∫ gε(θ, x) dµ∗(θ)∣∣∣∣ = 0.

If x− ε 6 z 6 x+ ε, then (with the notation of the proof of Lemma A.2)

0 6∫kz(θ)gz(θ, x) dµ∗(θ)

= (x2 − x1)−1(∫

kz(θ)g2(θ, x) dµ∗(θ)−∫kz(θ)g1(θ, x) dµ∗(θ)

)

6 kz(x2 + α2)− kz(x1 − α1)

x2 − x16 x2 + α2 − (x1 − α1)

x2 − x1

6 1 + (x2 − x1) 6 3.

Consequently,

‖Pnfz − Pνn,εfz‖ 6 εn−1∑

16i6n1Xi + ε 6 z+

+ 3n−1∑

16i6n1Xi − ε 6 z 6 Xi + ε

6 ε+ 3 supz>0

Pn[z − ε, z + ε].

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.60

Page 61: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 313

Therefore,

lim supε→0

‖Pn − Pνn,ε‖F 6 3/n a.s.

and condition (8.1) holds. Therefore, in order to have (4.2) here, it is sufficientthat suppµ∗ is an interval which contains 0. This gives the same condition asTheorem 4.1, which we know to be also necessary.

The reader may notice that the class Fx introduced in Remark 4.1 can betaken here as fz :x− δ 6 z 6 x+ δ for, say, 4δ = min16i<j6n |Xi −Xj |.

EXAMPLE 3 (continued). As been seen in Example 3 in Section 3, mixturesof exponential distributions are strongly related to the Laplace transform, andwe need to recall few facts about the Laplace transform inversion. These factswill be crucial in further examples given in Section 9. Let u(·) be a boundedcontinuous function on R+ and let L(·) be the Laplace transform operator, sothat

w(λ) := L(u)(λ) :=∫ ∞

0e−λxu(x) dx.

Then (Feller (1971, Ch. VII.6)),

limm→∞

(−1)m−1

(m− 1)!

(m

x

)mw(m−1)(m/x) = u(x). (8.6)

Since in the case of a mixture of exponential distributions, we have

Pf(θ) = θ

∫e−θxf(x) dx = θL(f)(θ),

(8.6) yields the inversion formula for P. For h = Pf and Id the identity function,

P−1h(x) = limm→∞

(−1)m−1

(m− 1)!

(m

x

)m( hId

)(m−1)

(m/x). (8.7)

Looking at (8.7), one sees that what matters in P−1(h)(x) for x > 0 arethe high order derivatives of h(·), taken at points (m/x) that tend to +∞as m → ∞. Therefore, referring to the general comments in Remark 8.1, toapproximate P−1(h)(x) by some νε,x(h), one may think that we need to have+∞ = limε→∞ sup supp νε,x.

Let us now follow the general guideline given in Remark 8.1. Hence, we wantto approximate δx by some Pνε,x = νε,x P.

We first need to recall how (8.6) can be obtained. Following Feller(1971, Ch. VII.6), let Γα,ν be a r.v. with a gamma distribution with densityανzν−1 e−αz/Γ(ν) for z > 0. The inversion formula (8.6) comes from the factthat

limm→∞

Γm/x,m = x in probability.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.61

Page 62: Statistical Analysis of Mixtures and the Empirical Probability Measure

314 PHILIPPE BARBE

Then, if u(·) is bounded and continuous, weak* convergence shows that

limm→∞

Eu(Γm/x,m) = u(x),

and this is (8.6) since Eu(Γm/x,m) is the l.h.s. of (8.6). Set

Πm(f)(x) := Ef(Γm/x,m).

We have

Πm(fz)(x) − fz(x) = PΓm/x,m > z − 1x > z.

Moreover, using Markov’s inequality, for any 0 6 t < m/x, and τ = tx/m ∈[0, 1),

PΓm/x,m > z 6 e−tzE etΓm/x,m

= e−tz(1− xt/m)−m

= exp(−m((τz/x) + log(1− τ))). (8.8)

If x < z, then

0 6 Πmfz(x)− fz(x) = Πm(fz)(x) 6 exp(−m((τz/x) + log(1− τ))).

If x > z,

0 6 fz(x)−Πmfz(x)

= 1− PΓm/x,m > z

= P−Γm/x,m > −z

6 exp(−m(−(τz/x) + log(1 + τ))). (8.9)

Let x > 0 and 0 < ε < x/4. Combining (8.8) and (8.9), we obtain

limm→∞

sup|z−x|>ε

|Πmfz(x)− fz(x)| = 0.

Next, if |z − x| 6 ε, notice that

|Πmfz(x)− fz(x)| = |PΓm/x,m > z − 1x > z| 6 1.

Therefore, taking ε < min16i<j6n |Xi −Xj |/8,

lim supm→∞

supz>0|n−1

∑16i6n

Πmfz(Xi)− Pnfz| 6 n−1. (8.10)

Equation (8.10) shows that we can replace Pnfz in (8.1) by n−1∑16i6n Πm

fz(Xi). Therefore, following Remark 8.1, we are looking for an approximationof Πmfz(Xi) by some νε,XiPf .

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.62

Page 63: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 315

Using Leibniz’s rule, we have

Πm(fz)(x) = Efz(Γm/x,m) =(−1)m−1

(m− 1)!

(m

x

)m (PfzId

)(m−1)

(m/x)

=(−1)m−1

(m− 1)!

(m

x

)m ∑06k6m−1

(m− 1)!(m− k − 1)!

(−1)k ×

×(x/m)k+1(Pfz)(m−1−k)(m/x)

=∑

06k6m−1

(−1)k

k!(m/x)kh(k)

z (m/x). (8.11)

Thus, to approximate Πm(fz)(x) by some νε,xPfz , it is sufficient to approximate

h(k)z (m/x) by some νε,x,khz . In order to be able to choose supp νε,x,k as we want,

we need to use the real analytic property of the functions hz(θ) = e−θz.Let 0 6 t 6 x and

∆k,r(hz, x) := h(k)z (x)−

∑06q6r

(q!)−1(x− t)qh(k+q)z (t).

Using Taylor’s formula, we have

|∆k,r(hz , x)|

=

∣∣∣∣∫ x

t(r!)−1(x− y)rh(k+r+1)

z (y) dy∣∣∣∣

=

∣∣∣∣∫ x

t(r!)−1(x− y)r(−z)k+r+1 e−zy dy

∣∣∣∣= zk+r+1(r!)−1 e−zt

∫ x−t

0(x− t− v)r e−zv dv

6 zk+r+1(r!)−1 e−zt(x− t)r∫ ∞

0e−zv dv

= zk+r e−zt(x− t)r/r!

Since the function z 7→ zk+r e−zt is maximum for z = (k + r)/t, we have

supz>0|∆k,r(hz , x)| 6 eAk,r(x)

with

Ak,r(x) := (k + r) log(k + r)− (k + r) log t+ r log(x− t)− k − r −

− (r + (1/2)) log r + r − log√

2π + O(1/r) as r →∞

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.63

Page 64: Statistical Analysis of Mixtures and the Empirical Probability Measure

316 PHILIPPE BARBE

(the O(1/r) term comes from the asymptotic approximation of log r!, and there-fore, is uniform in k ∈ N and x > 0). Notice that

Ak,r(x) = −r(log t− log(x− t)) + o(r)

provided that k stays in the range limr→∞ kr−1 log r = 0 (here the o(r) isuniform in this range of k and for x > 0).

If for some δ > 0, (1 + δ)x/2 6 t 6 x, then

logt

x− t > log1 + δ

1− δ > 0,

so that in the range (1 + δ)x/2 6 t 6 x,

limr→∞

supz>0|∆k,r(hz , x)| = 0. (8.12)

Combining (8.11) and (8.12), we obtain, for anym > 1, x > 0 and (1+δ)m/2x 6t 6 m/x,

limr→∞

supz>0|Πmfz(x)−

∑06k6m−1

(−1)k

k!

(m

x

)k×

×∑

06l6r

((m/x)− t)ll!

h(k+l)z (t)| = 0. (8.13)

Assume that suppµ∗ contains an unbounded sequence (tq)q>1 of clusteringpoints. Up to extracting a subsequence, we can assume that limq→∞ tq =∞. Forany sample point Xi, the intervals [(1+δ)m/2Xi,m/Xi] (with m > (1+δ)/(1−δ)) cover [(1 + δ)Xi/(1− δ),∞). Let m(Xi, q) such that tq ∈ [(1 + δ)m(Xi, q)/2Xi,m(Xi, q)/Xi]. For any 1 6 i 6 n, limq→∞m(Xi, q) =∞.

Using Lemma A.3, if tq is a clustering point of suppµ∗, there exists somefunction gε,tq such that for any integer l

limε→∞

supz>0

∣∣∣∣∫ hz(θ)gε,l(θ, tq) dµ∗(θ)− h(l)z (tq)

∣∣∣∣ = 0, (8.14)

and, moreover, replacing hz(.) by the constant function 1, Lemma A.3 also gives

limε→∞

∫gε,0(θ, tq) dµ∗(θ) = 1. (8.15)

Define the random signed measure

dνε,n,r,qdµ∗

:= n−1∑

16i6n

∑06k6m(Xi,q)

(−1)k

k!

(m(Xi, q)

Xi

)k×

×∑

06l6r

(m(Xi,q)Xi

− tq)l

l!gε,k+l(θ, tq).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.64

Page 65: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 317

Combining (8.13)–(8.15), we obtain

limr→∞

limε→∞

supz>0|νε,n,r,q − n−1

∑06i6n

Πm(Xi,q)fz(Xi)| = 0.

Since limq→∞m(Xi, q) =∞, we can modify slightly (8.10) into

lim supq→∞

supz>0|n−1

∑16i6n

Πm(Xi,q)fz(Xi)− Pnfz| 6 n−1,

so that finally

lim supq→∞

limr→∞

limε→∞

‖Pνε,n,r,q − Pn‖F 6 n−1,

and (8.1) holds. 2

In conclusion, we have proved that provided suppµ∗ contains an unboundedsequence of clustering points tn ∈ (0,∞), then (8.1) holds, and therefore (4.2)holds. This condition is much stronger than that obtained with Theorem 4.1,and requires much more work. But the construction of µn is quite explicit. Thereader may notice that if t0 ∈ (0,∞) is a clustering point of suppµ∗, Schwartz’s(1943) version of the Sasz–Muntz theorem shows that the set of functions A :=x 7→ e−tnx :n > 1, where tn is a sequence in suppµ∗ which converges to t0, isdense in Lp(dx)-spaces. Therefore, A can be used to approximate any functionthat approximate δx as a distribution. But our proof is needed since we need akind of uniformity w.r.t. the class F in the approximation.

We also notice that in the approximation step (after (8.11)) the bound for∆k,r(hz , x) is obtained in bounding the derivative of hz , which is mainly what wedid in many examples in Section 7 (see the comment after the end of Example 5).

An interesting aspect of the approach of this section is also that the mea-sures P+

νε,n,r,q are natural candidate to start a numerical optimization procedureto calculate µn more accurately, and they are rather explicit.

One of the main differences between Examples 2 and 3 is that in Example3 we can use analyticity of the functions in H in order to obtain an inversionformula for P acting on F . We have not been able to obtain a result similar toProposition 7.1 in using Theorem 8.1, but the following definition may enlightenboth Proposition 7.1 and the difference between Examples 2 and 3.

DEFINITION 8.1. The family P is locally F-complete at C ⊂ Θ if and only iffor any open set V ⊃ C, the map P(·)|V is injective over F .

To some extent, local F-completeness is related to the concept of local oper-ator or differential operator (Gallot et al., 1993). Indeed, if P−1 is a differentialoperator, it is a local operator. In this case, to calculate P−1(h)(x), it is enough toknow the function h in the neighbourhood V of some point θ = θ(x). Therefore,

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.65

Page 66: Statistical Analysis of Mixtures and the Empirical Probability Measure

318 PHILIPPE BARBE

if two functions hi = Pfi, i = 1, 2 coincide on V , it does not imply that f1 = f2on the whole space, and so, if F is rich enough, P is not locally F-complete atθ. Roughly, one should think that P locally F-complete at some θ ∈ Θ meansthat P−1 is not a local operator.

On the other hand, if the functions in F are, say, real analytic, then often sowill be H, and P will be locally F-complete at any θ ∈ Θ. Therefore, one shouldkeep in mind that in Definition 8.1, the structure of both P and F matters.

The reader may also notice that F-completeness at (θ−, θ+) is also closelyrelated to assumption (iv) in Theorem 7.1. Indeed, assumption (iv) in Theorem7.1 implies F-completeness at (θ−, θ+) for location or scale models.

The above discussion is illustrated in the following examples.

EXAMPLE 2 (continued). Here, P−1(h)(x) = xh(x) + h(x) is clearly a localoperator. The family P is not F-complete on any proper subset of R+ (recallthat in this example we take for F the Kolmogorov class). However, thanks toreal analytic properties of the functions in F1, P is locally F1-complete at anypoint θ > 0.

EXAMPLE 3 (continued). Analytical properties of Laplace transform ensure thatP is locally F-complete at any θ ∈ Θ for any class F of functions for whichthe Laplace transform is defined on R+.

One can expect that if P is locally F-complete at some θ ∈ Θ, then (4.2)holds if for instance µ∗ is a measure which is continuous in a neighbourhood ofθ. We have not been able to prove such a general result, which is probably wrongwithout further assumptions. However, Proposition 7.1 goes in this direction whenworking on the real line. Let us just illustrate the use of F-completeness at anypoint.

EXAMPLE 2 (continued). Here we deal with the class of functions F1. Let t0 >0. Since h1,z(θ) = e−θz is real analytic (i.e. P is locally F-complete at anypoint), we have for q = 0, 1,

h(q)1,z(θ) =

∑k>0

(k!)−1(θ − t0)kh(k+q)1,z (t0).

Hence, using (8.12), we infer that

limr→∞

supz>0|xh′1,z(x) + h(x)− x

∑06k6r

(k!)−1(x− t0)kh(k+1)1,z (t0)−

−∑

06k6r(k!)−1(x− t0)kh

(k)1,z (t0)| = 0.

Using (8.14) and the end of the proof of Example 3 in this section, one deducesthat n1/2 infµ ‖Pµ−Pn‖F1 = oPµ∗ (1) as n→∞ if, for instance, suppµ∗ containsan unbounded sequence of clustering points (tn)n>1 ∈ (0,∞).

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.66

Page 67: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 319

9. Further Examples

The purpose of this section is to show how to apply Theorem 8.1. We shallmainly go through a few examples labelled as in Section 7. Hence, Example12 is new. We also label Equations (9.x.y) to refer to equation y relative toExample x. We refer to Section 8 for Examples 2 and 3. Example 4 is quitetrivial and is omitted. Few changes of variables show that Examples 5, 6 and8 are quite similar to Example 3 since they involve Laplace transform. So weomit them in this section. For Example 10, we have not been able to obtainan inversion formula using only real-valued functions, and so we are unable toapply Theorem 7.1.

EXAMPLE 7 (continued). An inversion formula for the operator

Pf(θ) =1√

2πθσ

∫exp

(−(x−m− θβ)2

2θ2σ2

)f(x) dx

is given in Section 7. Conditions under which (7.7.1) holds are fairly standard(for instance, if f is continuous bounded), and we can easily obtain uniformityover a class of functions in (7.7.1). Thus, we shall not concentrate on this aspect,and just assume that

limn→∞

lim supε→0

supf∈F

n1/2∣∣∣∣n−1

∑16i6n

Ψε(f)(Xi)− Pnf∣∣∣∣ = 0 a.s. (9.7.1)

PROPOSITION 9.7.1. Let Pθ = N (m+θβ, θ2σ2), Θ = R, where β 6= 0. Assumethat F is a Pµ∗-Donsker class of functions with envelope F , such that (9.7.1)holds and∫

etx2F (x) dx <∞, for any t ∈ R. (9.7.2)

If ∞ is a clustering point of suppµ∗, then (7.1) holds.Proof. Under (9.7.1), it suffices to show the existence of measures νη,x µ∗

such that

limn→∞

lim supη→0

n1/2 supf∈F

∣∣∣∣n−1∑

16i6nνη,Xi f−n−1

∑16i6n

Ψε(f)(Xi)

∣∣∣∣ = 0. (9.7.3)

Let Ψε,k(·) be the functional defined by

Ψε,k(f)(y) :=1

(2π)1/2ε

∑06i6k

(−1)i

ε2ii!2i

∫(y − x)2if(x) dx.

Then

|Ψε(f)(y)−Ψε,k(f)(y)|

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.67

Page 68: Statistical Analysis of Mixtures and the Empirical Probability Measure

320 PHILIPPE BARBE

61

(2π)1/2ε

∑i>k+1

1ε2ii!2i

∫(y − x)2if(x) dx

6 1(2π)1/2ε

∫exp((y − x)2/2ε2)f(x) dx.

Thus, (9.7.1)–(9.7.2) and Lebesgue’s dominated convergence theorem yield

limk→∞

supf∈F|Ψε(f)(y) −Ψε,k(f)(y)| = 0.

Therefore, it suffices to approximate Ψε,k(f)(y) to obtain (9.7.3). Expanding(y−x)2i with the binomial expansion, it suffices then to approximate the mapping

f 7→ dj

dsjs−1Pf(1/s)

∣∣∣∣s=0

(see (7.7.4)). This can be done by using Lemma A.3 of the appendix, since ∞is a clustering point of suppµ∗ (recall the change of variable θ = 1/s). 2

EXAMPLE 11. The operator P is

Pf(θ) = Z(θ)−1∑k>0

akθkf(k), 0 6 θ < R. (9.11.1)

and its inverse is given in (7.11.2)

PROPOSITION 9.11.1. Let Pθ as in (9.11.1) and let F = 1· = k : k ∈ N.For (8.1) to hold, it suffices that µ∗ admits a clustering point θ0 ∈ [0, R) andthat there exists t > θ0 such that for any k ∈ N

limm→∞

m supaztz−kzk+m : z ∈ N, z > k +m = 0. (9.11.2)

Proof. We use Theorem 7.1 with Remark 7.1, and approximate the Diracmeasure δk by a mixture of Pθ’s w.r.t. a signed measure as follows.

Recall the inversion formula

f(k) = (akk!)−1 dk

dθk(Z(θ)Pf(θ))

∣∣∣∣θ=0

.

Since any function in F is less than 1, Z(θ)Pf(θ) is real analytic on [0, R).Therefore,

dk

dθkZ(θ)Pf(θ) =

∑i>0

(i!)−1(θ − θ0)idi+k

dθi+k(Z(θ)Pf(θ))

∣∣∣∣θ=θ0

.

It follows that

δkf = f(k) = (akk!)−1∑i>0

(i!)−1(−1)iθi0di+k

dθi+k(Z(θ)Pf(θ))

∣∣∣∣θ=θ0

.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.68

Page 69: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 321

Define

δk,mf := (akk!)−1∑

06i6m(i!)−1(−1)iθi0

di+k

dθi+k(Z(θ)Pf(θ))

∣∣∣∣θ=θ0

.

We have

δkf − δk,mf = (akk!)−1∑

i>m+1

(i!)−1(−1)iθi0di+k

dθi+k(Z(θ)Pf(θ))

∣∣∣∣θ=θ0

.

For z ∈ N, set fz(·) := 1. = z, so that F = fz : z ∈ N. We have

Z(θ)Pfz(θ) = azθz,

and

di

dθi(Z(θ)Pfz(θ)) = (z)iθ

z−iaz1i 6 z.

Consequently, setting n! = 1 if n 6 0,

|δkfz − δk,mfz| = θz−k0azak

1k!

∑m+16i6z−k

z!i!(z − k − i)!

6 θz−k0azak

1k!

z!

Γ(z−k+1

2

)2 z,

where the last inequality comes from i!(z − k − i)! 6 Γ((z − k + 1)/2)2 for0 6 i 6 z − k.

Since log Γ(z + 1) − 2 log Γ((z − k + 1)/2) = (12) log z + k log z + O(1) as

z →∞, we infer that for some constant c (independent of m),

|δkfz − δk,mfz| 6 cθz−k0azz

k+(1/2)

akk!.

Therefore, using (9.11.2), we have for any k ∈ N,

limn→∞

supz>0|δkfz − δk,mfz| = 0. (9.11.3)

Let Xn,n = maxi6i6nXi, so that Pn =∑

16i6Xn,n n−1Ni,nδi. Let us denote

Pn,m =∑

06i6Xn,n n−1Ni,nδi,m. Equation (9.11.3) yields

limm→∞

‖Pn − Pn,m‖F = 0.

Therefore, to prove that (8.1) holds, it suffices to prove it with Pn,m instead of Pn,and it actually suffices to prove that δk,n itself can be approximated by signed

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.69

Page 70: Statistical Analysis of Mixtures and the Empirical Probability Measure

322 PHILIPPE BARBE

mixtures of Pθ’s where the negative part of the signed measure is absolutelycontinuous w.r.t. µ∗. Just rewrite δk,mf as

δk,mf = (akk!)−1∑

06i6m(−1)i(i!)−1θ0 ×

×∑

06j6k+i

(k + i

j

)Z(k+i−j)(θ0)(Pf)(j)(θ0).

It suffices to approximate all the (Pf)(j)(θ0) by measures νj,ε with ν−j,ε µ∗,for j = 0, 1, . . . ,Xn,n +m. This can be done using Lemma A.3 provided thereexists a neighbourhood U of θ0 such that for any k

maxi6i6k

supf∈F

supy∈U|(Pf)(i)(y)| <∞. (9.11.4)

Since θ0 < R, all the derivatives Z(i)(θ), 1 6 i 6 k, θ in a small neighbourhoodU of θ0 are bounded and (9.11.4) holds if (considering U such that supU 6 t),

max16i6k

supf∈F

supy∈U

∣∣∣∣ di

dθi(Z(θ)Pf(θ))

∣∣∣∣θ=y

∣∣∣∣= max

16i6ksupz∈N

supy∈U|(z)iazyz−k−i1k + i 6 z|

6 max16i6k

supz∈N|ziaztz−k−i1k + i 6 z|

6 max16i6k

supaztz−kzk+i : z > k + iθ−m0

<∞.

But this last inequality follows from (9.11.2) and this concludes the proof ofProposition 9.11. 2

EXAMPLE 12. Let Θ = R+, and Pθ be the p.m. with density

pθ(x) = rθ−r(θ − x)r−110 6 x 6 θ

w.r.t. the Lebesgue measure, since the mapping P is then

Pf(θ) = rθ−r∫ θ

0(θ − x)r−1f(x) dx = Γ(r + 1)θ−rI0

rf(θ), (9.12.1)

where I0r is the Riemann–Liouville fractional integral operator of order r (the

reader may refer to Miller and Ross (1993) for fractional calculus and refer-ences therein). Hence, we shall call Pθ the Riemann–Liouville p.m. (of orderr) with parameter θ. For r = 1, Pθ = U[0,θ] and, therefore, this example is ageneralization of Example 2.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.70

Page 71: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 323

When r is an integer, it is easy to see that we can obtain the same necessaryand sufficient conditions as in Example 2 in order to have (8.1). Indeed, noticethat in this case (r ∈ N),

Pf(θ)θr/r = f (−r)(θ),

where, for any function f , we set f (0) = f and for any integer s > 1,

f (−s)(θ) =

∫ θ

0f (−s+1)(x) dx.

Consequently, if h = Pf ,

P−1h(x) =dr

dθr(r−1θrh(θ))

∣∣∣θ=x

= r−1∑

06i6r

(r

i

)r!

(r − i)!xr−ih(i)(x) (9.12.2)

and the same proofs as in the one of Example 2 give a necessary and sufficientcondition to have (7.1) with the Kolmogorov distance. Where r is not an integer,formula (9.12.2) is no longer valid. To invert P when r 6∈ N, we use the Laplacetransform operator L and recall that LI0

r (f)(t) = trL(f)(t). Therefore, (9.12.1)yields

L(θrPf(θ)

Γ(r + 1)

)(t) = trL(f)(t), (9.12.3)

and the inverse of P is given formally by

f(x) = L−1(t−rL

(θrPf(θ)

Γ(r + 1)

)(t)

)(x).

It will require some work to check the sufficient condition (8.1).

PROPOSITION 9.12.1. Let Pθ be the Riemann–Liouville p.m. of order r, withparameter θ, and let F = 1· 6 z : z > 0. If suppµ∗ = [0, α] (or [0,∞)),then (8.1) holds.

Proof. We only prove the result when α <∞, the one with α =∞ followsfrom some easy modifications. As in Example 3 in Section 8, let

Πmf(x) =(−1)m−1

(m− 1)!

(m

x

)m(Lf)(m−1)(m/x) = Ef(Γm/x,m).

Then, (8.10) shows that, with fz(x) = 1x 6 z,

lim supm→∞

supz>0

∣∣∣∣n−1∑

16i6nΠmfz(Xi)− Pnfz

∣∣∣∣ 6 n−1. (9.12.4)

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.71

Page 72: Statistical Analysis of Mixtures and the Empirical Probability Measure

324 PHILIPPE BARBE

Using (9.12.3), we also have

Πmf(x) =(−1)m−1

(m− 1)!

(m

x

)m dm−1

dtm−1

(t−rL

(θrPf(θ)

Γ(r + 1)

)(t)

) ∣∣∣∣t=m/x

=(−1)m−1

(m− 1)!

(m

x

)m×

×∑

06i6m−1

(m− 1i

)(r)m−2−i

(m

x

)r−(m−1−i)×

× di

dtiL(θrPf(θ)

Γ(r + 1)

)(t)

∣∣∣∣t=m/x

. (9.12.5)

Let us see what we need to check (8.1). Assume that there is a family of signedmeasures ν(i)

x,ε µ∗ with ‖dν(i)x,ε/dµ∗‖∞ <∞, such that

limε→∞

supz>0

∣∣∣∣ν(i)m/x,εhz −

di

dtiL(θrPf(θ)

Γ(r + 1)

)(t)

∣∣∣∣t=m/x

∣∣∣∣ = 0 (9.12.6)

for any x = X1, . . . ,Xn. Then, we set

νn,m,ε =(−1)m−1

(m− 1)!1n

∑16i6n

m

Xi

∑06j6m−1

(m− 1j

)(r)m−2−j ×

×(m

Xi

)r−(m−1−j)ν

(j)m/Xi,ε

and (9.12.4)–(9.12.6) give

lim supm→∞

lim supε→0

‖Pn − Pνn,m,ε‖F 6 1/n

and (8.1) holds. So, all what we have to do is to construct ν(i)t,ε in (9.12.6) for

t = m/X1, . . . ,m/Xn. Since Xi > 0 a.s., taking m large enough, we cansuppose that t is as large as we need.

Let us write

L(θrPf(θ))(t) =

∫ ∞0

e−θtθrPf(θ) dθ

= A1(h)(t) +A2(h)(t)

with the operators

A1(h)(t) =

∫ α

0e−θtθrh(θ) dθ,

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.72

Page 73: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 325

A2(h)(t) =

∫ ∞α

e−θtθrh(θ) dθ.

To obtain (9.12.6), we have to approximate the operators A1 and A2 and theirderivatives. We approximate A1 first.

Set for k > 1, i > 0,

A(i)1 (h)(t) =

di

dtiA1(h)(t),

A(i)1,k(h)(t) =

∑16j6k

(αj/k)rh(αj/k)

∫ αj/k

α(j−1)/kθi e−θt dθ.

Notice that with hz = Pfz ,

θrhz(θ) = θr − (θ − θ ∧ z)r =

θr if θ 6 zθr − (θ − z)r if θ > z.

Therefore, if θ < z < α,∣∣∣∣ ddθθrhz(θ)

∣∣∣∣ 6 rαr−1,

while if z < θ < α,∣∣∣∣ ddθθrhz(θ)

∣∣∣∣ = |rθr−1 − r(θ − r)r−1| 6 rαr−1.

Consequently, denoting X1,n = min16i6nXi and Xn,n = max16i6nXi, for any0 6 z 6 α and m/Xn,n 6 t 6 m/X1,n,

|A(i)1 (hz)(t)−A(i)

i,k(hz)(t)| 6 rαr−1k−1∑

16j6k

∫ αj/k

α(j−1)/kθi e−θt dθ

= rαr−1k−1∫ α

0θi e−θt dθ

= rαr−1k−1t−i−1∫ αt

0ui e−u du

6 rαr−1k−1t−i−1Γ(i+ 1)

6 rαr−1i!k−1Xi+1n,nm

−i−1

6 rαr+ii!k−1m−i−1. (9.12.7)

Since suppµ∗ = [0, α], all the points αj/k, 1 6 j 6 k are in suppµ∗. We canapproximate the Dirac measure δαj/k, and the mapping h 7→ h(αj/k), by some

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.73

Page 74: Statistical Analysis of Mixtures and the Empirical Probability Measure

326 PHILIPPE BARBE

measures δ∗αj/k,ε µ∗ with ‖dδ∗αj/k,ε/dµ∗‖∞ < ∞ (see Lemma A.3 in theAppendix) such that

supz>0|δ∗αj/k,εhz − δαj/khz| 6 ε. (9.12.8)

Set

ν(i)n,m,k,ε = n−1

∑16l6n

∑16j6k

(αj/k)r∫ αj/k

α(j−1)/kθi e−θm/Xl dθδ∗ε,αj/k

and take k = k(m) such that rαr+ii!/kmi+1 6 1/mn1/2. Then (9.12.7)–(9.12.8)yield

lim supε→0

supz>0|PnA(i)

1 (hz)− ν1,n,m,k,ε(hz)| 6 rαr+ii!/kmi+1 6 1/mn1/2

and we are done with the approximation of the operators A(i)1 , 1 6 i 6 m.

We now turn to the approximation of the operator A2 and its derivatives.Let β < α. Then, for z 6 β < α < θ, the function θrhz(θ) = θr − (θ − z)r

is real analytic. Set gz(θ) := θrhz(θ) and

A(i)2 (h)(t) =

di

dtiA2(h)(t) = (−1)i

∫ ∞α

θi e−θtθrh(θ) dθ,

A(i)2,k(hz)(t) = (−1)i

∫ ∞α

∑06j6k

θi e−θt(j!)−1(θ − β)jg(j)z (β) dθ.

Taylor’s formula yields∣∣∣∣∣∣gz(θ)−∑

16i6k(j!)−1(θ − β)jg(j)

z (β)

∣∣∣∣∣∣=

∣∣∣∣∣∫ θ

β(k!)−1(y − β)kg(k+1)

z (y) dy

∣∣∣∣∣=

∣∣∣∣∣∫ θ

β(k!)−1(y − β)k(r)k(y

r−k − (y − z)r−k) dy

∣∣∣∣∣6 (r)k(k!)(−1)

∫ θ

0ykyr−k dy

= θr+1(r)k(k!)−1.

Consequently, for m/Xn,n 6 t 6 m/X1,n, using that Xn,n 6 α, we have

|A(i)2 (hz)(t)−A(i)

2,k(hz)(t)|

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.74

Page 75: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 327

6∫ ∞α

θi e−θtθr+1 dθ(r)k/k!

=

∫ ∞αt

ur+i+1 e−u du(r)k/k!ti+r+2

6 Γ(r + i+ 2)(r)k(k!)−1t−i−r−2

6 Γ(r + i+ 2)(r)k(k!)−1(α/m)i+r+2. (9.12.9)

The bound (9.12.9) shows that instead of approximating the operators A(i)2 (h)(t)

by signed measures, we can approximate A(i)2,k(hz)(t) in the range z 6 α. This

will for do the job also z > α since θrhz(θ) does not depend on z when 0 6θ 6 α 6 z.

Notice that

A(i)2,k(hz)(t) = (−1)i

∑06j6k

g(j)z (β)cj,i,α(t),

where

cj,i,α(t) := (j!)−1∫ ∞α

θi(θ − β)j e−θt dθ.

Thus, all what we have to do is to approximate the operator hz 7→ g(j)z (β), i.e.

hz 7→dj

dθjθrhz(θ)

∣∣∣∣∣θ=β

=∑

06l6j

(j

l

)(r)j−l−1β

r−l+jh(l)z (β),

which is done if we approximate the operators h 7→ h(l)(β). These last operatorsare approximated using Lemma A.3, noticing that β is a clustering point ofsuppµ∗ ∩ [0, β] ⊂ [0, α] and this ends the proof of Proposition 9.12.1.

Let us mention that the coefficient cj,i,α(t) is explicit,

cj,i,α(t) =∑

06k6j

(k

j

)(−β)j−kt−i−k−1

∑06l6i+k

(i+ k

l

)(αt)i+k+l(−1)ll!.

2

10. Further Remarks and Open Questions

The various examples studied in Sections 7 and 9 show that our results are applic-able. It is, however, clear that the main problem is checking whether the familyPν spans the reproducing kernel Hilbert space associated to the Gaussian processB, or, equivalently, if the Dirac measure δx is in some sense in the closure of allthe Pν’s, where ν varies in all M(Θ). This general question goes much beyondany statistical applications, and the author is not aware of any algorithm that

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.75

Page 76: Statistical Analysis of Mixtures and the Empirical Probability Measure

328 PHILIPPE BARBE

allows us to easily check this fact. The papers by Beurling (1955) and Bercoviciand Foias (1984) even suggest that this may be a major problem.

It should be mentioned that our results yield results for other estimation meth-ods. For instance on the real line, Choi and Bulgren (1968) proposed an estimatorof µ∗,

µn := arg infµ

∫(Fµ(x)− Fn(x))2 dFn(x).

This estimator is easily computable using quadratic programming. Notice that ifµn is our estimator defined when F generates the Kolmogorov distance, then

n infµ

∫(Fµ − Fn)2 dFn

6∫n‖Fµn − Fn‖2

∞ dFn(x) = n‖Fµn − Fn‖2∞. (10.1)

Therefore, if infµ ‖Fµ−Fn‖∞ = oPµ∗ (n−1/2), the estimator Fµn and Fn behave

in a quite similar way. One can find sufficient conditions for (10.1) to tends to0 as n→ ∞ using similar arguments as ours, and our general abstract point ofview could be carried over to study (10.1) more precisely.

In some cases we obtained necessary and sufficient conditions for having

infµ‖Pµ − Pn‖F = oPµ∗ (n

−1/2).

However, in order to avoid technicalities we sometimes enforce some restrictionsto obtain the necessity part. For instance, Proposition 7.1 is rather general, butdoes not give the best conditions when applied in Examples 5–11. This is mainlya problem in analysis, namely to describe the closure of some set of functions.An open question is actually to obtain necessary and sufficient conditions in thoseexamples.

Concerning the measure µn, our results are given in the space (l∞(H), ‖·‖H)in Section 5. It would be interesting to turn them into results in other spaces.For instance, in Example 3 we cannot obtain a central limit-type theorem on thec.d.f. pertaining to µn, but only on its Laplace transform. The author will showin a later work how the point of view of Section 8 yields results on µn in ratherarbitrary spaces. It is unclear at this stage what generality can be achieved and iffully applicable results (i.e. without uncheckable technical assumptions) can beobtained.

Also, the investigation of the mapping Π in various cases would be of interestin order to understand better what are the features of the original distribution Pµ∗that are captured by the limiting distribution Π(µ∗). For instance, our couplingtechnique allows us to give sufficient conditions to have Π(µ1) = Π(µ2) but doesnot provide any necessary conditions. We believe that our sufficient conditionsare actually quite sharp, but we are unable to prove any worth while statementin that direction.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.76

Page 77: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 329

We would be also very interested in knowing the range of Π(·). Here is, forinstance, one question which may be easier, that would be worth studying andthat we have not been able to solve. Rewriting (3.4), we see that

Π(µ∗) = inf

‖ν −Gµ∗‖H :

∥∥∥∥∥dν−

dµ∗

∥∥∥∥∥∞<∞, ν(Θ) = 0, ν ∈M(Θ)

(here, we denoteGµ∗ to point out that we are uder Pµ∗). On the real line (X = R),if F = 1· 6 z : z ∈ R is the Kolmogorov class, and if all the Pµ’s arecontinuous, we can construct the processes Gµ such that ‖Gµ‖F = ‖Bµ‖F doesnot depends on µ. Indeed, notice that if B is a standard Brownian bridge,

‖Bµ‖F = ‖B Fµ(·)‖∞ = ‖B‖∞,

where Fµ is the c.d.f. of Pµ. If µ = δx, the process Gµ is quite difficultto approximate by measures ν with ν− = αδx, α > 0 (this is the condi-tion ‖dν−/dµ‖∞ < ∞). Therefore, we can expect that for any x ∈ suppµ,Π(µ) 6st Π(δx). This would be very interesting for applications. It tells us thatif we tabulate Π(δx), then we can obtain conservative tests for the hypothesis thatthe data comes from a mixture of Pθ’s. More generally, it would be interesting toknow the p.m.’s in the range of Π(·) which are extremal for the stochastic partialorder. We have not been able to obtain a nice result in that direction, except inthe special case of mixture of uniforms. We give our result now, for the sake ofcompleteness. Its proof can be recast in an abstract setting, but the assumptionsneeded by the author do not make an abstract exposition worthwhile.

EXAMPLE 2 (continued). Our next result is even stronger than the conjecturementioned above. It is useful to test that a p.m. on R+ is unimodal (with mode0), since it shows that the critical value of the test statistics infµ ‖Pµ − Pn‖Fcalculated under the uniform distribution leads to conservative tests, even fora finite sample size. This partially solves a conjecture raised by Hartigan andHartigan (1985).

We denote Pµ,n the empirical p.m. of n r.v.’s i.i.d. with common p.m. Pµ.

PROPOSITION 10.1. For any p.m. µ∗ and any n > 1,

infµ‖Pµ − Pµ∗,n‖F 6st inf

µ‖Pµ − Pδ1,n‖F .

Consequently, Π(µ∗) 6st Π(δ1).Proof. It is more convenient to use c.d.f.’s to prove Proposition 10.1, and we

shall denote Fµ as the c.d.f. of Pµ, and Fµ,n as that of Pµ,n. Thus, Fδ1,n is theempirical c.d.f. of a sample of size n uniformly distributed over [0, 1]. Notice

that Fδ1,n Fµd=Fµ,n. Let µ ∈M1(Θ) and let Kµ = Fµ Fµ∗ . Then

‖Kµ − Fδ1,n Fµ∗‖∞ 6 ‖Fµ − Fδ1,n‖∞.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.77

Page 78: Statistical Analysis of Mixtures and the Empirical Probability Measure

330 PHILIPPE BARBE

If sup suppµ 6 1, then Kµ is a unimodal d.f. (it is nondecreasing, concave, withKµ(0) = 1−Kµ(∞) = 0), and so infν ‖Fν − Fδ1,n Fµ∗‖∞ 6 ‖Fµ − Fδ1,n‖∞.

Next, consider a p.m. µ with sup suppµ > 1. Then Kµ is still nondecreasing,concave, with Kµ(0) = 0, but now Kµ(∞) = Fµ(1) < 1. Notice that Fδ1,n Fµ∗(x) = 1 for any x > xn = F←µ∗ F←δ1,n

(1) <∞. Set Kµ = Kµ if x 6 xn and

Kµ linear on [xn,∞) with slope

K ′µ,d = limh↓0

h−1(Kµ(xn + h)−Kµ(x)).

Set Kµ := Kµ ∧ 1. Then Kµ is a unimodal c.d.f. on R+ with mode 0, and

‖Kµ − Fδ1,n Fµ∗‖∞ = ‖Kµ − Fδ1,n Fµ∗‖∞ 6 ‖Fµ − Fδ1,n‖∞,

so that

infν‖Fν − Fδ1,n Fµ∗‖∞ 6 ‖Fµ − Fδ1,n‖∞.

Since µ is arbitrary, the result follows. 2

Concerning the mapping P, it would be very interesting to have some kindof general results concerning inversion formulas. But any general result seemsdifficult to reach. In case of convolution-type mixtures, Laplace transform is anefficient tool to obtain sufficient conditions for (8.1) to hold. Indeed if Pf(θ) =∫g(θ − x)f(x) dx, then formally

f = L−1(L(Pf)

L(g)

)= L−1(Lf).

We have seen that it is quite easy to construct the error in the approximationf −Πm(f). Thus, within our framework, the main point is then to approximatethe Laplace transform LPf . This is quite easy if suppµ∗ = R+ but we have notbeen able to obtain general sharp results with this approach.

Appendix

In this appendix, we prove some analytical results that we use in our proofs.Although we have not found these results in the literature, we do not claim thatthey are original.

Our first lemma deals with approximation of p.m.’s.

LEMMA A.1. Let µ and ν be two p.m.’s on R+ with suppµ ⊂ suppν. Then, thereexists a family of p.m.’s (νε)ε>0 such that νε ν and ‖dνε/dν‖∞ <∞ and νεconverges weakly* to µ as ε→ 0. Furthermore, for H = θ 7→ 1∨ z/θ : z > 0,we have limε→∞ ‖νε − µ‖H = 0, provided µ 6= δ0.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.78

Page 79: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 331

Proof. Let us first assume that supp ν is compact in R, and up to a translation,let 0 = inf supp ν, and a := 1 + sup suppµ. Let ε > 0 and Ik,ε = [a(k −1)/m, ak/m], k = 1, 2, . . . ,m−1, wherem = [1/ε]. LetKε := k : ν(Ik,ε) > 0and define the p.m.’s νε by

dνεdν

(x) :=∑k∈Kε

µ(Ik,ε)

ν(Ik,ε)1x ∈ Ik,ε.

If f is a continuous function on R with bounded first derivative, using that

infx∈Ik,ε

f(x) 6 ν(Ik,ε)−1∫Ik,ε

f(x) dν(x) 6 supx∈Ik,ε

f(x),

we obtain∣∣∣∣∫ fd(νε − µ)

∣∣∣∣ =

∣∣∣∣∣∣∑k∈Ik,ε

µ(Ik,ε)ν(Ik,ε)−1∫Ik,ε

f dν −∫Ik,ε

f dµ

∣∣∣∣∣∣6 2

∑k∈Ik,ε

sup|f(x)− f(y)| : x, y ∈ Ik,εµ(Ik,ε)

6 2‖f ′‖∞a/[1/ε].

Consequently, νε converges weakly* to µ.If ν does not have a compact a support, approximate it first by p.m.’s with

compact support, say νi (with support in [0, i], for instance) for which we canconsider families νiε as previously, and use a diagonal argument.

Using the notation of Section 2, let Pθ = U[0,θ] and let F = 1[0,z] : z ∈ R+.Then H = PF . Since νε converges weakly* to µ as ε → 0, Pνε convergesweakly* to Pµ. But Pµ is continuous if µ 6= δ0, and therefore, ‖Pνε −Pµ‖F → 0as ε→ 0, i.e. ‖νε − µ‖H → 0 as ε→ 0. 2

LEMMA A.2. Let µ be a p.m. on R and let 0 < ε < 1. If x is not isolated insuppµ, there exists a bounded function θ 7→ gε(θ, x) such that

(i) supp gε(., x) ⊂ [x− ε, x+ ε],(ii)

∫gε(θ, x) dµ(θ) = 0,

(iii) |∫θgε(θ, x) dµ(θ)− 1| 6 ε.

Proof. Since x is not isolated, there exists two points x1 < x2 in (suppµ ∩[x− ε/2, x+ ε/2])\x.

Set α := (x2 − x1)2/2. Since xi ∈ suppµ, i = 1, 2, we have µ[xi − αi, xi +αi] > 0 for any αi > 0, i = 1, 2. Let for αi 6 α,

gi(θ, x) = 1[xi−αi,xi+αi](θ)/µ[xi − αi, xi + αi]

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.79

Page 80: Statistical Analysis of Mixtures and the Empirical Probability Measure

332 PHILIPPE BARBE

and, further, let

gε(θ, x) :=g2(θ, x)− g1(θ, x)

x2 − x1.

Clearly, (i) and (ii) hold for ε2/2 6 ε, i.e. ε 6 2. Moreover,

(x2 − x1)

∫θgε(θ, x) dµ(θ) =

∫θg2(θ, x) dµ(θ)−

∫θg1(θ, x) dµ(θ).

Therefore, we have the bounds

(x2 − x1)

∫θgε(θ, x) dµ(θ) 6 x2 + α2 − (x1 − α1) 6 x2 − x1 + 2α

and

(x2 − x1)

∫θgε(θ, x) dµ(θ) > x2 − α2 − (x1 + α1) > x2 − x1 − 2α.

Consequently, (iii) holds since∣∣∣∣∫ θgε(θ, x) dµ(θ)− 1∣∣∣∣ 6 2α/(x2 − x1) = x2 − x1 6 ε.

2

LEMMA A.3. Let µ be a p.m. on R and x a clustering point in suppµ. Let Hbe a class of k+ 1-time continuously differentiable functions such that for someopen set U with x ∈ U ,

max06i6k+1

suph∈H

supy∈U|h(i)(y)| <∞.

Then, for any ε > 0, there exists functions θ 7→ gε,i(x, θ) with support includedin [x− ε, x+ ε], bounded, and such that

max06i6k

suph∈H

∣∣∣∣∫ h(θ)gε,i(x, θ) dµ(θ)− h(i)(x)

∣∣∣∣ 6 ε.Proof. If x0 < x1 < · · · < xk, denote ν[x0, . . . , xk] the signed measure

defined by

ν[x0, . . . , xk] := k!∑

06i6k

δxi∏j 6=i(xi − xj)

,

with, for k = 0, ν[x0] = δx0 .Since x ∈ suppµ is a clustering point, let xn ∈ suppµ\x such that

limn→∞ xn = x. Up to extracting a subsequence, we can assume that all the xn’sare distinct. Thus, denote x0

n < x1n < · · · < xkn the tuple (xn, xn+1, . . . , xn+k)

ordered in increasing order. Using formulas 25.1.5 and 25.1.10 in Abramovitz and

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.80

Page 81: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 333

Stegun (1965), we have ν[x0n, x

1n, . . . , x

in]h = h(i)(ζin) for some x0

n 6 ζin 6 xin.For n large enough, all the xn’s are in U and therefore

suph∈H|ν[x0

n, . . . , xin]h− h(i)(x)| 6 (xin − x0

n) suph∈H

supy∈U|h(i+1)(y)|.

Therefore, denoting

ηn := max06i6k

suph∈H|ν[x0

n, . . . , xin]h− h(i)(x)|,

we have limn→∞ ηn = 0.Next, let δn := min16i6k(xin − xi−1

n ). For δ < δn/2, let

g(n)i (θ, x, δ) := i!

∑06j6i

1[xjn−δ,xjn+δ]

µ[xjn − δ, xjn + δ]

1∏16l6il 6=j

(xj − xl).

Then, for h ∈ H,∣∣∣∣∫ h(θ)g(n)i (θ, x, δ)− ν[x0

n, . . . , xin]h

∣∣∣∣= i!

∣∣∣∣∣∣∣∑

06j6i

∫ xjn+δ

xjn−δ(h(θ)− h(xjn))gi(θ, x, δ) dµ(θ)

µ[xjn − δ, xjn + δ]

1∏16l6il 6=j

(xj − xl)

∣∣∣∣∣∣∣6 i!

∑06j6i

δ‖h′‖U∏16l6il 6=j

(xj − xl).

6 i!δδ1−in ‖h′‖U ,

where ‖g‖U := sup|g(x)| : x ∈ U is the sup-norm over U . Consequently, ifM := max06i6k+1 suph∈H ‖h(i)‖U , we have

suph∈H

∣∣∣∣∫ h(θ)g(n)i (θ, x, δkn) dµ(θ)− h(i)(x)

∣∣∣∣6 (xkn − x0

n)M + k!δnM.

Take δ = δkn/2 say. For ε > 0, let nε such that (xknε −x0nε)M + k!δnεM 6 ε. Set

gε,i(θ, x) := g(nε)i (θ, x, δknε). 2

Our next lemma is a version of the Muntz theorem which is used in Section 8.We define the set

S :=

f ∈ C[0, 1] : f(0) = 0, f a.c.,

∫ 1

0f 2(t) dt <∞

.

Recall that a Muntz measure is a measure on R which support admits a a sequenceof distinct points (θi)i>1 such that the series

∑i>1 θi/(1 + θ2

i ) diverges.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.81

Page 82: Statistical Analysis of Mixtures and the Empirical Probability Measure

334 PHILIPPE BARBE

LEMMA A.4. The measure µ∗ is a Muntz measure on R+ if and only if

S ⊂ cl(C[0,1],‖.‖∞)

t ∈ [0, 1] 7→

∫tθ dν(θ) : ν− µ∗

. (A.1)

Proof. Define

S1 := cl(C[0,1],‖·‖∞)

t ∈ [0, 1] 7→

∫tθ dν(θ) : ν− µ∗

,

S2 := cl(C[0,1],‖·‖∞)

t ∈ [0, 1] 7→

∫tθ dν(θ) + cµ∗0 :

]supp ν <∞, supp ν− ⊂ suppµ∗\0, c ∈ R.

Claim. S1 = S2.

Proof of the Claim. We first prove the inclusion S1 ⊂ S2. Since both S1 andS2 are closed, it suffices to show that S2 is dense in S1.

Let ν be a nonnegative measure such that ν µ∗ and f(t) = −∫tθ dν(θ) ∈

S1. Let

νε :=∑i>0

ν(εi, ε(i + 1)]δε(i+1) + ν0δ0

and fε(t) := −∫tθ dνε(θ) ∈ S2. For any k > 0 we have

|f(t)− fε(t)|

6∑i>0

∫(εi,ε(i+1)]

|tθ − tε(i+1)| dν(θ)

6∑i>0

ν(εi, ε(i+ 1)]tεi(1− tε) 6 ν(0, εk] + ν(R+)tεk(1− tε)

6 ν(0, εk] + ν(R+) sup06t61

|tεk(1− tε)| 6 ν(0, εk] + ν(R+)d/k

for some universal constant d (d = maxk log(1 − (k + 1)−1) : k > 1 forinstance). Since k is arbitrary, fε converges uniformly to f as ε → 0 and S2 isdense in S1.

We now prove that S2 ⊂ S1. If µ∗0 > 0, the functions t 7→ −cµ∗0 is inS1. So we can assume from now on that µ∗0 = 0. Let f(t) =

∫tθ dν(θ) ∈ S2

with ν =∑

16i6m νθiδθi . Let νε := ν+ − ν−ε with

ν−ε (A) :=∑

16i6m

µ∗(A ∩ (θi − ε, θi + ε))

µ∗(θi − ε, θi + ε)ν−θi

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.82

Page 83: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 335

and fε(t) :=∫tθ dνε(θ). Then

|f(t)− fε(t)| =∑

16i6m|ν|θi

∣∣∣∣∣tθi − 1µ∗(θi − ε, θi + ε)

∫ θi+ε

θi−εtθ dµ∗(θ)

∣∣∣∣∣6 2

∑16i6m

|ν|θitθi−ε|1− 2t2εθi |

and therefore limε→0 ‖f − fε‖∞ = 0 and the claim is proved.We can now proceed to prove Lemma A.4. Muntz’s theorem asserts (see, e.g.,

Borwein and Ederlyi (1995)) that for distinct θi’s,

C[0, 1] = S := cl(C[0,1],‖.‖∞)span(t 7→ tθi : i > 1 ∪ t 7→ 1)

iff∑i θi/(1 + θ2

i ) = ∞. If µ∗ is a Muntz measure, take θi as in the definitionof a Muntz measure. Then Muntz’s theorem yields S ⊂ cl(C[0,1],‖·‖∞)spant 7→tθi : i > 1. Since t 7→ tθi is a function in S2, we obtain S ⊂ S2 = S1 (the lastequality coming from the claim).

Conversely, assume that S ⊂ S1 = S2. Since S is dense in f ∈ C[0, 1] :f(0) = 0, we have C[0, 1] ⊂ cl(C[0,1],‖.‖∞)span(S2 ∪ t 7→ 1). Hence, fromMuntz’s theorem and the definition of S2, there exists points θi ∈ suppµ∗ suchthat

∑θi/(1 + θ2

i ) =∞ and µ∗ is a Muntz measure. 2

Acknowledgements

The origin of this paper is in Christian Robert’s invitation to the InternationalWorkshop on Mixtures in Aussois (Sept. 95), and the close neighbourhood ofXavier Milhaud. I would like to thank both of them for the discussions we hadwhile I was writing this paper. The DEA lectures of Michel Ledoux in Toulousein 1994 had certainly a decisive influence on the type of results I aimed at inthis paper. I proved the main results when I was visiting Lanh Tran at IndianaUniversity, and it is a real pleasure to thank him for having provided me withan exceptional working environment in September 1994 and during the summerof 1995. Part of this work was completed while I was visiting the School ofMathematics in the Georgia Institute of Technology. Also the Departments ofMathematics at the University of Tennessee in Knoxville, Texas A&M in CollegeStation and University of Iowa in Iowa City are some of the nice places wherethis paper has been completed. I would like also to express my gratitude toNadine and Xavier who hosted me very well in Toulouse in winter 95 where agood part of this paper was written. Thanks are also due to the little girl whogave a nice toy to the little boy and allowed me to complete this work peacefully.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.83

Page 84: Statistical Analysis of Mixtures and the Empirical Probability Measure

336 PHILIPPE BARBE

References

Abramowitz, M. and Stegun, J. A. (1965), Handbook of Mathematical Functions, Dover, NewYork.

Andrews, D. F. and Mallows, C. L. (1974), Scale mixture of normal distributions, J. Roy. Statist.Soc., Ser. B 36, 99–102.

Andrews, D. W. K. and Pollard, D. (1994), An introduction to functional central limit theoremsfor dependent stochastic processes, Internat. Statist. Rev. 62, 119–132.

Bach, A., Plachky, D. and Thomson, W. (1987), A characterization of identifiability of mixturesof distributions, in: M. L. Puri and P. Revesz (eds), Mathematical Statistics and ProbabilityTheory, Vol. A, D. Reidel, Dordrecht, pp. 15–21.

Bagirov, E. B. (1988), Some remark on mixtures of normal distributions, Theory Probab. Appl.33, 709–710.

Barbe, Ph. and Bertail, P. (1995), The Weighted Bootstrap, Lecture Notes in Statistics, Springer,New York.

Barndorff-Nielsen, O. (1965), Identifiability of finite mixtures of exponential families, J. Math.Anal. Appl. 21, 115–121.

Barndorff-Nielsen, O., Kent, J. and Sørensen, M. (1982), Normal variance-mean mixture and z-distributions, Internat. Statist. Rev. 50, 145–159.

Bercovici, H. and Foias, C. (1984), A real variable restatement of Riemann’s hypothesis, Israel J.Math. 48, 57–68.

Beurling, A. (1955), A closure problem related to the Riemann Zeta-function, Proc. Acad. Sci.U.S.A. 41, 312–314.

Bickel, P. J., Klaassen, C. A., Ritov, Y. and Wellner, J. A. (1993), Efficiency and AdaptativeEstimation for Semiparametric Models, Johns Hopkins University Press, Baltimore.

Birge, L. (1989), The Grenander estimator: a nonasymptotic approach, Ann. Statist. 17, 1532–1549.Blum, J. R. and Sursala, V. (1977), Estimating the parameters of a mixing disribution function,

Ann. Probab. 5, 200–209.Bondesson, L. (1992), Generalized Gamma Convolutions and Related Classes of Distributions,

Lecture Notes in Statistics 76, Springer, New York.Borwein, P. and Ederlyi, T. (1995), Polynomials and Polynomial Inequalities, Springer, New York.Bruni, C. and Koch, G. (1985), Identifiability of continuous mixtures of unknown Gaussian distri-

butions, Ann. Probab. 13, 1341–1357.Carroll, R. J. and Hall, P. (1988), Optimal rates for deconvolving a density, J. Amer. Statist. Assoc.

83, 1184–1186.Chandra, S. (1977), On the mixture of probability distributions, Scand. J. Statist. 4, 105–112.Chen, J. (1995), Optimal rate of convergence for finite mixture models, Ann. Statist. 23, 221–223.Choi, K. and Bulgren, W. B. (1968), An estimation procedure for mixture of distributions, J. Roy.

Statist. Soc., Ser. B, 444–460.Conway, J. B. (1973), Functions of One Complex Variable, Springer, New York.Davis, Ph. J. (1963), Interpolation and Approximation, Blaisdell, New York.Deely, J. J. and Kuse, R. L. (1968), Construction of sequences estimating the mixture distribution,

Ann. Math. Statist. 39, 286–288.Delbaen, F. and Hazendonck, J. (1984), Weighted Markov processes with an application to risk

theory, in: F. de Vylder, M. Govaerts and J. Hazendonck (eds), Proc. NATO Advanced StudyInstitute on Insurance Premium, Louvain, July 18–31, 1983, D. Reidel, Dordrecht, pp. 121–132.

Devroye, L. (1990), Consistent deconvolution in density estimation, Canad. J. Statist. 17, 235–239.Dharmadhikari, S. and Joag-Dev, K. (1988), Unimodality, Convexity, and Applications, Academic

Press, New York.Dvoretski, A., Kiefer, J. and Wolfowitz, J. (1956), Asymptotic minimax charater of the sample

distribution function and of the classical multinomial estimation, Ann. Math. Statist. 27, 642–669.

Dudley, R. M. (1984), A Course on Empirical Processes, in: Ecole d’Ete de Probabilite de SaintFlour, XII-1982, Lecture Notes in Math. 1097, Springer, New York, pp. 1–142.

Dudley, R. M. (1987), Universal Donsker classes and metric entropy, Ann. Probab. 15, 1306–1326.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.84

Page 85: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 337

Eagleson, G. K. (1975), Martingale convergence to mixture of infinitely divisible laws, Ann.Probab. 3, 557–562.

Edelman, D. (1988), Estimation of the mixing distribution for a normal mean with applications tothe compound decision problem, Ann. Statist. 16, 1602–1622.

Efron, B. and Olshen, R. (1978), How broad is the class of normal scale mixtures?, Ann. Statist.6, 1159–1164.

Eggermont, P. and LaRiccia, V. N. (1995), Maximum smoothed likelihood density estimation forinverse problems, Ann. Statist. 23, 199–220.

Everitt, B. S. and Hand, D. J. (1981), Finite Mixture Distributions, Chapman and Hall, London.Fan, J. (1988), On the optimal rates of convergence for nonparametric deconvolution problem,

Ann. Statist. 19, 1257–1272.Farrell, R. H. (1962), Representation of invariant measures, Illinois J. Math. 6, 447–467.Feller, W. (1971), Theory of Probability and its Applications, Vol. 2, Wiley, New York.Fraser, M. D., Hsu, Y. S. and Walker, J. J. (1981), Identifiability of finite mixture of von Mises

distributions, Ann. Statist. 9, 1130–1131.Gallot, S., Hulin, D. and Lafontaine, J. (1993), Riemannian Geometry, 2nd edn, Springer, Berlin.Gine, E. and Zinn, J. (1990), Bootstrapping general empirical functions, Ann. Probab. 18, 851–869.Goldie, C. M. (1967), A class of infinitely divisible random variables, Proc. Camb. Philos. Soc.

63, 1141–1143.Gradshteyn, J. and Ryzhik, J. M. (1980), Table of Integrals, Series and Products, 5th edn, Academic

Press, San Diego, Calif.Grenander, U. (1956), On the theory of mortality measurement, part II, Scand. Actuar. J. 39,

125–153.Groeneboom, P. (1985), Estimating a monotone density, in: L. M. LeCam and R. A. Olshen (eds),

Proc. Berkeley Conf. in Honor of Jerzy Neyman and Jack Kiefer, 2, Wadsworth, Belmont,Calif., pp. 539–555.

Gross, L. (1967), Abstract Wiener spaces, in: L. M. Le Cam and J. Neyman (eds), Proc. FifthBerkeley Symp. Math. Statist. Probab., Vol II(1), University of California Press, Berkeley,Calif., pp. 31–42.

Hartigan, J. A. and Hartigan, P. M. (1985), The dip test of unimodality, Ann. Statist. 13, 70–80.Hengartner, N. (1997), Adaptative demixing in Poisson mixture models, Ann. Statist. 25, 917–928.Herve, M. (1989), Analycity in Infinite Dimensional Spaces, De Gruyter, Berlin.Ikeda, N. and Watanabe, S. (1989), Stochastic Differential Equations and Diffusions Processes,

2nd edn, North-Holland/Kodansha.Jewel, N. (1982), Mixtures of exponential distributions, Ann. Statist. 10, 479–484.Kamarkar, N. (1984), A new polynomial-time algorithm for linear programming, Combinatorica

4, 373–395.Keilson, J. and Steutel, F. W. (1972), Families of infinite divisible distributions closed under mixing

and convolutions, Ann. Math. Statist. 43, 242–250.Kelker, D. (1971), Infinite divisibility and variance mixture of the normal distribution, Ann. Math.

Statist. 42, 802–808.Kent, J. T. (1981), Convolution mixture of infinitely divisible distributions, Math. Proc. Camb.

Phil. Soc. 90, 141–153.Kent, J. T. (1983), Identifiability of finite mixture for directional data, Ann. Statist. 11, 984–988.Khachiyan, L. G. (1979), A polynomial algorithm in linear progamming, Soviet. Math. Dokl. 20,

191–194.Kiefer, J. and Wolfowitz, J. (1976), Asymptotically minimax estimation of concave and convex

distribution functions, Zeit. Wahrsch. Theor. verw. Geb. 34, 73–85.Kim, J. and Pollard, D. (1990), Cube root asymptotics, Ann. Statist. 18, 191–219.Kintchine, A. Y. (1938), On unimodal distributions, Izv. Nauchno-Issled. Inst. Mat. Mech. Tomsk.

Gos. Univ. 2, 1–7.Laird, N. (1978), Nonparametric maximum-likelihood estimation of a mixing distribution, J. Amer.

Statist. Assoc. 73, 805–811.Lambert, D. and Tierney, L. (1984), Asymptotic properties of maximum likelihood estimates in

the mixed Poisson model, Ann. Statist. 12, 1388–1399.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.85

Page 86: Statistical Analysis of Mixtures and the Empirical Probability Measure

338 PHILIPPE BARBE

Ledoux, M. (1994), Isoperimetry and Gaussian Analysis, Ecole d’Ete de Probabilites de St. Flour,Lecture Notes in Math., Springer, New York, to appear.

Ledoux, M. (1996), On Talagrand’s deviation inequalities for product measures, Preprint.Lehman, E. L. (1959), Testing Statistical Hypotheses, Wiley, New York.Lemdani, M. (1995), Tests dans le cas d’un melange de Lois dans des modeles parametriques et

non parametriques, PhD thesis, Ecole Polytechnique.Lindsay, B. G. (1983a), The geometry of mixture likelihood: a general theory, Ann. Statist. 11,

86–94.Lindsay, B. G. (1983b), Efficiency of the conditional score in a mixture setting, Ann. Statist. 11,

486–497.Lindsay, B. G. (1995), Mixture Models: Theory, Geometry and Applications, NSF-CBMS Regional

Conference Series in Probability and Statistics, Vol. 5, IMS, Hayward, CA.Lo, A. (1991), Bayesian bootstrap clones and a biometry function, Sankhya Ser. A 53, 320–333.Luxmann-Ellingaus, U. (1987), On the identifiability of mixtures of infinitely divisible power series

distributions, Statist. Probab. Lett. 5, 375–378.MacDonald, P. D. M. (1971), Comment on ‘An estimation procedure for mixture of distributions’

by Choi and Bulgren, J. Roy. Statist. Soc. Ser. B 33, 326–329.McLachlan, G. J. and Basford, K. E. (1988), Mixture Models: Inference and Applications to

Clustering, Marcel Dekker, New York.Maitra, A. (1977), Integral representations of invariant measures, Trans. Amer. Math. Soc. 229,

209–225.Mammen, E., Maron, S. and Fisher, N. I. (1992), Some asymptotic for multimodality tests based

on kernel density estimator, Probab. Theory Related Fields 91, 115–132.Martin, R. D. and Schwartz, S. C. (1972), On mixture, quasi-mixture and nearly normal random

processes, Ann. Math. Statist. 40, 948–967.Mason, D. M. and Newton, M. A. (1992), A rank statistic approach to the consistency of a general

weighted bootstrap, Ann. Statist. 20, 1611–1624.Massart, P. (1986), Rates of convergence in the central limit theorem for empirical process, Ann.

Inst. Henri Poincare, Probab. Statist. 22, 381–424.Masse, J. C. (1993), Nonparametric maximum likelihood estimation in a nonlocally compact param-

eter setting, Technical report 93-24, Departement de Mathematique et de Statistique, UniversiteLaval, Quebec.

Milhaud, X. and Mounime, S. (1993), A modified maximum likelihood estimator for finite mixture,preprint.

Miller, K. S. and Ross, B. (1993), An Introduction to the Fractional Calculus and FractionalDifferential Equation, Wiley, New York.

Noack, A. (1950), A class of random variables with discrete distribution, Ann. Math. Statist. 21,127–132.

Olshen, R. A. and Savage, L. J. (1970), Generalized unimodality, J. Appl. Probab. 7, 21–34.Patil, G. P. and Bildikar, S. (1966), Identifiability of countable mixtures of discrete probability

distributions using methods of infinite matrices, Proc. Camb. Philos. Soc. 62, 485–494.Pearson, K. (1894), Contribution to the mathematical theory of evolution, Philos. Trans. Roy. Soc.

London Ser. A 185, 71–110.Pfanzagl, J. (1988), Consistency of maximum likelihood estimators for certain nonparametric fam-

ilies, in particular mixtures, J. Statist. Plann. Inf. 19, 137–158.Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.Praestgaard, J. and Wellner, J. A. (1984), Exchangeably weighted bootstraps of the general empirical

process, Ann. Probab. 21, 2053–2086.Prakasa Rao, B. L. S. (1969), Estimating a unimodal density, Sankhya Ser. A 31, 23–36.Prakasa Rao, B. L. (1992), Identifiability in Stochastic Models: Characterization of Probability

Distributions, Academic Press, New York.Redner, R. and Walker, H. F. (1984), Mixture densities, maximum likelihood and the E.M. algo-

rithm, SIAM Rev. 26, 195–239.Rennie, R. R. (1972), On the independence of the identifiability of finite multivariate mixture and

the identifiability of the marginal mixtures, Sankhya Ser. A 34, 449–452.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.86

Page 87: Statistical Analysis of Mixtures and the Empirical Probability Measure

STATISTICAL ANALYSIS OF MIXTURES AND THE EMPIRICAL PROBABILITY MEASURE 339

Robbins, H. (1948), Mixture of distributions, Ann. Math. Statist. 19, 360–369.Rootzen, H. (1977), A note on convergence to mixture of normal distribution, Zeit. Wahrsch. Theor.

verw. Geb. 38, 211–216.Santalo, L. (1984), Integral Geometry and Geometric Probability, Cambridge University Press.Schwartz, L. (1943), Etude des sommes d’exponentielles reelles, Act. Sci. Ind. 959, Hermann,

Paris.Shanbhag, D. N. and Sreehari, M. (1977), On certain self-decomposable distributions, Zeit.

Wahrsch. Theor. verw. Geb. 38, 217–222.Shorack, G. R. and Wellner, J. A. (1986), Empirical Processes with Applications to Statistics,

Wiley, New York.Sheehy, A. and Wellner, J. A. (1992), Uniform Donsker classes of functions, Ann. Probab. 20,

1983–2030.Silverman, B. W. (1981), Using kernel density estimates to investigate multimodality, J. Roy. Statist.

Soc. Ser. B 43, 97–99.Simar, L. (1976), Maximum likelihood estimation of a compound Poisson process, Ann. Statist. 4,

1200–1209.Stefanski, L. A. (1990), Rates of convergence of some estimators in a class of deconvolution

problems, Statist. Probab. Lett. 9, 229–235.Steutel, F. W. (1967), Note on the infinite divisibility of exponential mixtures, Ann. Math. Statist.

38, 1303–1305.Steutel, F. W. (1968), A class of infinitely divisible mixtures, Ann. Math. Statist. 39, 1153–1157.Steutel, F. W. (1970), Preservation of infinite divisibility under mixing and related topics, Math.

Center Tracts 33, Amsterdam.Stroock, D. and Varadhan, S. R. S. (1972), On the support of diffusion processes with applications

to the strong maximum principle, in: L. M. Le Cam, J. Neyman and E. L. Scott (eds), Proc.Sixth Berkeley Symp. Math. Statist. Probab., University of California Press, Berkeley, CA.,pp. 333–359.

Szego, G. (1939), Orthogonal Polynomials, AMS Coll. Publ., Amer. Math. Ser., Providence, R.I.Talagrand, M. (1987), Donsker classes and geometry, Ann. Probab. 15, 1327–1338.Talagrand, M. (1994), Sharper bounds for Gaussian and empirical processes, Ann. Probab. 22,

28–76.Talagrand, M. (1995), The Glivenko–Cantelli problem, ten years later, Preprint.Talagrand, M. (1995), Concentration of measure and isoperimetric inequalities in product spaces,

Publ. Math. IHES 81, 73–205.Tallis, G. M. (1969), The identifiability of mixtures of distributions, J. Appl. Probab. 6, 389–398.Tallis, G. M. and Chesson, P. (1982), Identifiability of mixtures, J. Austral. Math. Soc. 32, 339–348.Teicher, H. (1960), On the mixture of distributions, Ann. Math. Statist. 31, 55–77.Teicher, H. (1961), Identifiability of mixtures, Ann. Math. Statist. 32, 244–248.Teicher, H. (1963), Identifiability of finite mixtures, Ann. Math. Statist. 34, 1265–1269.Teicher, H. (1967), Identifiability of mixtures of product measures, Ann. Math. Statist. 38, 1300–

1302.Thierney, L. and Lambert, D. (1984), Asymptotic efficiency of estimators of functional of mixed

distributions, Ann. Statist. 12, 1380–1387.Thorin, O. (1977), On the infinite divisibility of the lognormal distribution, Scand. Actuar. J.,

121–148.Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985), Statistical Analysis of Finite Mixture

Distributions, Wiley, New York.Van de Geer, S. (1993), Rates of convergence for the maximum likelihood estimator in mixture

models, Technical Report, TW 93-09, University of Leiden.Van de Geer, S. (1994), Asymptotic normality in mixture models, Technical report, TW 94-03,

University of Leiden.Van der Vaart, A. (1991), On differentiable functionals, Ann. Statist. 19, 178–204.Varadarajan, V. S. (1963), Groups of automorphisms of Borel spaces, Trans. Amer. Math. Soc. 109,

191–220.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.87

Page 88: Statistical Analysis of Mixtures and the Empirical Probability Measure

340 PHILIPPE BARBE

Yakowitz, S. I. and Spragins, J. D. (1968), On the identifiability of finite mixture, Ann. Math.Statist. 40, 1728–1735.

Zhang, C.-H. (1990), Fourier methods for estimating mixing densities and distributions, Ann. Statist.18, 806–831.

Zhang, C.-H. (1995), On estimating mixing densities in discrete exponential family models, Ann.Statist. 23, 929–945.

ACAP1290.tex; 29/01/1998; 13:55; v.7; p.88