Lecture Notes for Stat522B Jiahua Chen

Asymptotic in Statistics

Lecture Notes for Stat522B

Jiahua ChenDepartment of Statistics

University of British Columbia

2

Course Outline

A number of asymptotic results in statistics will be presented: concepts of statis-tic order, the classical law of large numbers and central limit theorem; the largesample behaviour of the empirical distribution and sample quantiles.

Prerequisite: Stat 460/560 or permission of the instructor.

Topics:

• Review of probability theory, probability inequalities.

• Modes of convergence, stochastic order, laws of large numbers.

• Results on asymptotic normality.

• Empirical distribution, moments and quartiles

• Smoothing method

• Asymptotic Results in Finite Mixture Models

Assessment: Students will be expected to work on 20 assignment problems plusa research report on a topic of their own choice.

Contents

1 Brief preparation in probability theory 11.1 Measure and measurable space . . . . . . . . . . . . . . . . . . . 11.2 Probability measure and random variables . . . . . . . . . . . . . 31.3 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . 61.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Fundamentals in Asymptotic Theory 112.1 Mode of convergence . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Uniform Strong law of large numbers . . . . . . . . . . . . . . . 172.3 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . 192.4 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 212.5 Big and small o, Slutsky’s theorem . . . . . . . . . . . . . . . . . 222.6 Asymptotic normality for functions of random variables . . . . . . 242.7 Sum of random number of random variables . . . . . . . . . . . . 252.8 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Empirical distributions, moments and quantiles 293.1 Properties of sample moments . . . . . . . . . . . . . . . . . . . 303.2 Empirical distribution function . . . . . . . . . . . . . . . . . . . 343.3 Sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Inequalities on bounded random variables . . . . . . . . . . . . . 383.5 Bahadur’s representation . . . . . . . . . . . . . . . . . . . . . . 40

1

2 CONTENTS

4 Smoothing method 474.1 Kernel density estimate . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.1 Bias of the kernel density estimator . . . . . . . . . . . . 494.1.2 Variance of the kernel density estimator . . . . . . . . . . 504.1.3 Asymptotic normality of the kernel density estimator . . . 52

4.2 Non-parametric regression analysis . . . . . . . . . . . . . . . . . 534.2.1 Kernel regression estimator . . . . . . . . . . . . . . . . 544.2.2 Local polynomial regression estimator . . . . . . . . . . . 554.2.3 Asymptotic bias and variance for fixed design . . . . . . . 564.2.4 Bias and variance under random design . . . . . . . . . . 57

4.3 Assignment problems . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Asymptotic Results in Finite Mixture Models 635.1 Finite mixture model . . . . . . . . . . . . . . . . . . . . . . . . 635.2 Test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . . 655.3 Binomial mixture example . . . . . . . . . . . . . . . . . . . . . 665.4 C(α) test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 The generic C(α) test . . . . . . . . . . . . . . . . . . . . 715.4.2 C(α) test for homogeneity . . . . . . . . . . . . . . . . . 735.4.3 C(α) statistic under NEF-QVF . . . . . . . . . . . . . . . 765.4.4 Expressions of the C(α) statistics for NEF-VEF mixtures . 77

5.5 Brute-force likelihood ratio test for homogeneity . . . . . . . . . 785.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 835.5.2 The proof of Theorem 5.2 . . . . . . . . . . . . . . . . . 86

Chapter 1

Brief preparation in probabilitytheory

1.1 Measure and measurable space

Measure theory is motivated by the desire of measuring the length, area or volumnof subsets in a space Ω under consideration. However, unless Ω is finite, thenumber of possible subsets of Ω is very large. In most cases, it is not possible todefine a measure so that it has some desirable properties and it is consistent withcommon notions of area and volume.

Consider the one-dimensional Euclid space R consists of all real numbers andsuppose that we want to give a length measurement to each subset of R. For anordinary interval (a,b] with b > a, it is natural to define its length as

µ((a,b]) = b−a,

where µ is the notation for measuring the length of a set. Let Ii = (ai,bi] andA = ∪Ii and suppose ai ≤ bi < ai+1 for all i = 1,2, . . .. It is natural to require µ tohave the property such that

µ(A) =∞

∑i=1

(bi−ai).

That is, we are imposing a rule on measuring the length of the subsets of R.

1

2 CHAPTER 1. BRIEF PREPARATION IN PROBABILITY THEORY

Naturally, if the lengths of Ai, i = 1,2, . . . have been defined, we want

µ(∪∞i=1Ai) =

∞

∑i=1

µ(Ai), (1.1)

when Ai are mutually exclusive.The above discussion shows that a measure might be introduced by first as-

signing measurements to simple subsets, and then be extended by applying theadditive rule (1.1) to assign measurements to more complex subsets. Unfortu-nately, this procedure often does not extend the domain of the measure to allpossible subsets of Ω. Instead, we can identify the maximum collection of subsetsthat a measure can be extended to. This collection of sets is closed under countableunion. The notion of σ -algebra seems to be the result of such a consideration.

Definition 1.1 Let Ω be a space under consideration. A class of subsets F iscalled a σ -algebra if it satisfies the following three conditions:

(1) The empty set /0 ∈F ;(2) If A ∈F , then Ac ∈F ;(3) If Ai ∈F , i = 1,2, . . ., then their union ∪∞

i=1Ai ∈F .

Note that the property (3) is only applicable to countable number of sets.When Ω = R and F contains all intervals, then the smallest possible σ -algebrafor F is called Borel σ -algebra and all the sets in F are called Borel sets. Wedenote the Borel σ -algebra as B. Even though not every subset of real numbersis a Borel set, statisticians rarely have to consider non-Borel sets in their research.As a side remark, the domain of a measure on R such that µ((a,b]) = b− a, canbe extended beyond Borel σ -algebra, for instance, Lesbegues algebra.

When a space Ω is equipped with a σ -algebra F , we call (Ω,F ) a measurablespace: it has the potential to be equipped with a measure. A measure is formallydefined as a set function on F with some properties.

Definition 1.2 Let (Ω,F ) be a measureable space. A set function µ defined onF is a measure if it satisfies the following three properties.

(1) For any A ∈F , µ(A)≥ 0;(2) The empty set /0 has 0 measure;

1.2. PROBABILITY MEASURE AND RANDOM VARIABLES 3

(3) It is countably additive:

µ(∪∞i=1Ai) =

∞

∑i=1

µ(Ai)

when Ai are mutually exclusive.

We have to restrict the additivity to countable number of sets. This restrictionresults in a strange fact in probability theory. If a random variable is continuous,then the probability that this random variable takes any specific real value is zero.At the same time, that chance for it to fall into some interval (which is made of in-dividual values) can be larger than 0. The definition of a measure disallows addingup probabilities over all the real values in the interval to form the probability ofthe interval.

In measure theory, the measure of a subset is allowed to be infinity. We assumethat ∞+∞ = ∞ and so on. If we let µ(A) = ∞ for all non-empty set A, this setfunction satisfies the conditions for a measure. Such measures is probably notuseful. Even if some sets possessing infinite measure, we would like to havea sequence of mutually exclusive sets such that every one of them have finitemeasure, and their union covers the whole space. We call this kind of measureσ -finite. Naturally, σ -finite measures have many other mathematical propertiesthat are convenient in applications.

When a space is equipped with a σ -algebra F , the sets in F have the potentialto be measured. Hence, we have a measurable space (Ω,F ). After a measure ν

is actually assigned, we obtain a measure space (Ω,F ,ν).

1.2 Probability measure and random variables

To a mathematician, a probability measure P is merely a specific measure: it as-signs measure 1 to the whole space. The whole space is now called the samplespace which denotes the set of all possible outcomes of an experiment. Individualpossible outcomes are called sample points. For theoretical discussion, a specificexperimental setup is redundant in the probability theory. In fact, we do not men-tion the sample space at all.

In statistics, the focus is on functions defined on the sample space Ω, and thesefunctions are called random variables. Let X be a randon variable. The desire of


computing the probability of ω : X(ω) ∈ B for a Borel set B makes it necessaryfor ω : X(ω) ∈ B ∈F . These considerations motive the definition of a randomvariable.

Definition 1.3 A random variable is a real valued function on the probability(Ω,F ,P) such that ω : X(ω) ∈ B ∈F for all Borel sets B.

In plain words, random variables are F -measurable functions.Interestingly, this definition rules out the possibility for X to take infinity as its

value and implies the cumulative distribution function defined as

F(x) = P(X ≤ x)

has limit 1 when x→ ∞. For one-dimensional function F(x), it is a cumulativedistribution function of some random variable if and only if

1. limx→−∞ F(x) = 0; limx→∞ F(x) = 1.

2. F(x) is a non-decreasing, right continuous function.

Note also that with each random variable defined, we could define a corre-sponding probability measure PX on the real space such that

PX(B) = P(X ∈ B).

We have hence obtained an induced measure on R. At the same time, the collectionof sets X ∈ B is also a σ -algebra. We call it σ(X) which is a sub-σ -algebra of F .

Definition 1.4 Let X be a random variable on a probability space (Ω,F ,P). Wedefine σ(X) to be the smallest σ -algebra such that

X ∈ B ∈ σ(X)

for all B ∈B.

It is seen that sum of two random variables is also a random variable. Allcommonly used functions of random variables are also random variables. That is,they remain F -measurable.

1.2. PROBABILITY MEASURE AND RANDOM VARIABLES 5

The rigorous definitions of integration and expectation are involved. Let usassume that for a measurable function f (·) ≥ 0 on a measure space (Ω,F ,ν), asimple definition of the integration ∫

f (·)dν

is available. A general function f can be written as f+− f−, the difference be-tween its positive and negative parts. The integration of this function is the differ-ence between two separate integrations∫

f dν =∫

f+dν−∫

f−dν

unless we are in the situation of ∞−∞. In this case, the integration is said notexist. The expectation of a function of a random variable X is simply∫

f (X(·))dP =∫

f (·)dPX

banning ∞−∞. Note that the integrations on two sides are with respect to twodifferent measures. The above equality was joked as un-conscience statistician’slaw. In mathematics, it is called the change-of-variable formula.

The integration taught in undergraduate calculus courses are called Riemannintegration. Most properties of Riemann integration remain valid for this measure-theory-based integration. The new integration makes more functions integrable.Under the new definition (even though we did not really give one), it becomesunnecessary to separately define the expectation of continuous random variablesand the expectation of discrete random variables. Without a unified definition, thecommonly accepted formulas such as

E(X +Y ) = E(X)+E(Y )

are unprovable.The concept of Radon-Nikodym derivative is hard to many, but it is handy for

example when we have to work with discrete and continuous random variables onthe same platform. Suppose ν and λ are two σ -finite measures on the measurablespace (Ω,F ). We say λ is dominated by ν if for any F measurable set A, ν(A) =0 implies λ (A) = 0. We use notation λ << ν . Note that this definition is σ -algebra F dependent.

The famous Radon-Nikodym Theorem is as follows.


Theorem 1.1 Let ν and λ be two σ -finite meausres on (Ω,F ). If λ << ν , thenthere exists a non-negative F -measurable function f such that

λ (A) =∫

Af dν

for all A ∈F .We call f the Radon-Nikodym derivative of λ with respect to ν .

If λ is a probability measure, then f can be chosen to be non-negative, and itis called the density function of λ with respect to ν . The commonly referred prob-ability density function is the derivative of the cumulative distribution function ofabsolutely continuous random variables with respect to Lesbesgues measure. Theprobability mass function is the derivative of the cumulative distribution functionof integer valued random variables with respect to counting measure. One suchexample is the probability mass function of the Poisson distribution.

Essentially, a measure assigns a non-negative value to every member in the σ -fields it is designed on and possesses some properties. Lesbegues measure assignsa value of every interval equaling to its length. The value of every other set in theσ -field is derived from the rules (properties) for a measure. In comparison, underthe counting measure we often do not explicitly define, each set with single integerhas measure 1. Any set with finite number of integers has measure equaling thenumber of integers it contains.

1.3 Conditional expectation

The concept of expectation is developed as theoretical average size of a randomvariable. The word “conditional” has a meaning tied more tightly with probability.When we focus on a subset of the sample space and rescale the probability on thisevent to 1, we get a conditional probability measure. In elementary probabilitytheory, the conditional expectation is also the average size of a random variablewhere we only examine its behaviour when its outcome is in a pre-specified subsetof the sample space.

To understand the advanced notion of conditional expectation, we start withan indicator random variable. By taking values 1 and 0 only, an indicator random

1.3. CONDITIONAL EXPECTATION 7

variable IA divides the sample space into two pieces: A and Ac. The conditionalexpectation of a random variable X given IA = 1 is the average size of X whenA occurs. Similarly, the conditional expectation of X given IA = 0 is the averagesize of X over Ac. Thus, random variable IA partitions Ω into two pieces, and wecompute the conditional expectation of X over each piece. We may use a ran-dom variable Y to cut the sample space into more pieces and compute conditionalexpectations over each. Consequently, the conditional expectation of X given Ybecomes a function: it takes different values on different pieces of the samplespace and the partition is created by the random variable Y .

If the random variable Y is not discrete, it does not partition the sample spaceneatly. A general random variable Y does not neatly partition the sample spaceinto countable many mutually exclusive pieces. Computing average size of Xgiven the size of Y is hard to image. At this moment, we may realize that theconcept of σ -algebra can be helpful. In fact, with the concept of σ -algebra, wedefine the conditional expectation of X without the help of Y .

The conditional expectation is not given by a constructive definition, but aconceptual requirement on what properties it should have.

Definition 1.5 The conditional expectation of a random variable X given a σ -algebra A , E(X |A ), is a A -measurable function such that∫

AE(X |A )dP =

∫A

XdP

for every A ∈A .

If Y is a random variable, then we define E(X |Y ) as E(X |σ(Y )). It turnsout such a function exists and is practically unique whenever the expectation ofX exists. The conditional expectation defined in elementary probability theorydoes not contradict this definition. In view of this new definition, we must haveEE(X |Y )= E(X). This formula is true by definition! We regret that the abovedefinition is not too useful for computing E(X |Y ) when given two random vari-ables X and Y .

When working with conditional expectation under measure theory, we shouldremember that the conditional expectation is a random variable. It is regardedas non-random with respect to the σ -algebra in its conditional argument. Most


formulas in elementary probability theory have their measure theory versions. Forexample, we have

E[g(X)h(Y )|Y ] = h(Y )E[g(X)|Y ]

whenever the relevant quantities exist.The definition of conditional probability can be derived from the conditional

expectation. For any event A, we note that P(A) = EIA. Hence, we regard theconditional probability P(A|B) as the the value of EIA|IB when the sample pointω ∈ B. To take it to extreme, many probabilists advocate to forego the probabilityoperation all together.

1.4 Independence

The probability theory becomes a discipline rather than a special case of the mea-sure theory largely due to some special notions so dear to probabilistic concepts.

Definition 1.6 Let (Ω,F ,P) be a probability space. Two events A,B ∈ F areindependent if any only if P(AB) = P(A)P(B).

Let F 1 and F 2 be two sub-σ -algebras of F . They are independent if anyonly if A is independent of B for all A ∈F 1 and B ∈F 2.

Let X and Y be two random variables. We say that X and Y are independentif and only if σ(X) and σ(Y ) are independent of each other.

Conceptually, when A and B are two independent events, then P(A|B) = P(A)by the definition in elementary probability theory textbooks. Yet one cannot re-place P(AB) = P(A)P(B) in the above independence definition by P(A|B) = P(A).It becomes problematic when, for example, P(B) = 0.

Theorem 1.2 Two random variables X and Y are independent if and only if

P(X ≤ x,Y ≤ y) = P(X ≤ x)P(Y ≤ y) (1.2)

for any real numbers x and y.

The generalization to a countable number of random variables can be doneeasily. A key notion is, pairwise independence is not sufficient for full indepen-dence.

1.5. ASSIGNMENT PROBLEMS 9

1.5 Assignment problems

1. Let X be a random variable having Poisson distribution with mean µ = 1,Y be a random variable having standard normal distribution and W be arandom variable such that P(W = 0) = P(W = 1) = 0.5.

Assume X ,Y and W are independent. Construct a measure ν(·) such that itdominates the probability measure induced by WX +(1−W )Y .

2. Let the space Ω be the set of all real numbers. Suppose a σ -algebra F

contains all half intervals in the form of (−∞,x] for all real number x. Showthat F contains all singleton set x.

3. Let B be the Borel σ -algebra on R and that Y is a random variable. Verifythat

σ(Y ) = Y−1(B) : B ∈B

is a σ -algebra, where Y−1(B) = ω : Y (ω) ∈ B.

4. From measurability point of view, show that if X and Y are two randomvariables, then X +Y , XY are also random variables.

Give an example where X/Y is not a random variable if the definition insection 1.2 is rigorously interpreted.

5. Prove that if F(x) is a cumulative distribution function of some randomvariable, then

limx→−∞

F(x) = 0, limx→∞

F(x) = 1.

6. Assume that g(·) is a measurable function and Y is a random variable. As-sume both E(Y ) and Eg(Y ) exist. Prove or disprove that

Eg(Y )|Y= g(Y ); EY |g(Y )= Y.

7. Assume all relevant expectations exist. Show that

E[g(X)h(Y )|Y ] = h(Y )E[g(X)|Y ]

provide that both g and h are measurable functions. The equality may beinterpreted as valid except on a measurable zero-probability event.


8. Define VAR(X |Y ) = E[X−E(X |Y )2|Y ]. Show that

VAR(X) = EVAR(X |Y )+VARE(X |Y ).

9. Prove Theorem 1.2.

10. Prove that if X and Y are independent random variables, and h and g are twomeasurable functions,

E[h(X)g(Y )] = E[h(X)]E[g(Y )]

under the assumption that all expectations exist and finite.

11. Suppose X and Y are jointly normally distributed with means 0, variances1, and correlation coefficient ρ .

Verify that E(X |Y ) = ρY .

Remark: rigorous proofs of some assignment problems may need some knowl-edges beyond what have been presented in this chapter. It is hard to clearly statewhat results should be assumed. Hence, we have to leave a big dose of ambiguityhere. Nevertheless, these problems show that some commonly accepted resultsare not self-righteous. They are in fact rigorously established somewhere.

Chapter 2

Fundamentals in Asymptotic Theory

Other than a few classical results in mathematical statistics, the exact distribu-tional property of a statistics or other random objects is often hard determine to thelast details. A good approximation to the exact distribution is very useful in inves-tigating the properties of various statistical procedures. In statistical applications,many observations, say n of them, from the same probability model/populationare often assumed available. Good approximations are possible when the numberof repeated observations is large. A theory developed for the situation where thenumber of observations is large forms the Asymptotic Theory.

In asymptotic theory, we work hard to find the limiting distribution of a ran-dom quantity sequence Tn as n→ ∞. Such results are sometime interesting fortheir own rights. In statistical applications, we do not really have the sample sizen increases as time goes, much less where n increases unboundedly. If so, whyshould we care about the limit which is usually attained only when n = ∞? Myanswer is similar to the answer to the use of tangent line to replace a segment ofsmooth curve in mathematical analysis. If f (x) is a smooth function at a neigh-borhood of x = 0, we have approximately

f (x)≈ f (0)+ f ′(0)x.

While the approximation may never be exact unless x = 0, we are comfortableto claim that if the approximation is precise enough at x = 0.1, it will be preciseenough for |x| ≤ 0.1. In asymptotic theory, if the limiting distribution approxi-mates the finite sample distribution when n = 100 well enough, we are confident

11

12 CHAPTER 2. FUNDAMENTALS IN ASYMPTOTIC THEORY

that when n > 100, the approximation will likely be more accurate. In this situa-tion, we are comfortable to use the limiting distribution in the place of the exactdistribution for statistical inference.

In this chapter, we introduce some classical notion and results in limiting pro-cess.

2.1 Mode of convergence

Let X1,X2, . . . ,Xn, . . . be a sequence of random variables defined on a probabilityspace with sample space Ω, σ -algebra F , and probability measure P.

Recall that every random variable is a real valued function. Thus, a sequenceof random variables is also a sequence of functions. At each sample point ω ∈Ω,we have a sequence of real numbers:

X1(ω),X2(ω), . . . .

For some ω , the limit of the above sequence may exist. For some other ω , the limitmay not exist. Let A⊂Ω be the set of ω at which the above sequence converges.It can be shown that A is then measurable. Let X be a random variable such thatfor each ω ∈ A, X(ω) = limn→∞ Xn(ω).

Definition 2.1 Convergence almost surely: If P(A) = 1, we say that Xn∞n=1

converges almost surely to X. In notation, Xna.s.−→X.

A minor point is that the limit X is unique up to a zero probability event underthe conditions in the above definition. If another random variable Y differs from Xby a zero probability event, then we also have Xn→ Y almost surely. Proving thealmost sure convergence of a random variable sequence is often hard. A weakerversion of the convergence is much easier to establish.

Let X , Xn∞n=1 be one and a sequence of random variables defined on a prob-

ability space. In weak version of convergence, we examine the probability of thedifference X−Xn being large.

Definition 2.2 Convergence in probability. Suppose that for any δ > 0,

limn→∞

P|Xn−X | ≥ δ= 0.

Then we say that Xn converges to X in probability. In notation, Xnp−→X.

2.1. MODE OF CONVERGENCE 13

Conceptually, the mode of almost sure convergence keeps track of the valuesof random variables at the same sample point on and on. It requires the conver-gence of X(ω) at almost all sample points. If you find “almost all sample points”is too tricky, simply interpret it as “all sample points” and you are not too far fromthe truth. The mode of convergence in probability requires that the event on whichXn and X differ more than a fixed amount shrinks in probability. This event is ndependent. It is one event when n = 10 and and it is another when n = 11 andso on. In other words, we have a moving target as n evolves when defining con-vergence in probability. Because of this, the convergence in probability does notimply convergence of Xn(ω) for any ω ∈Ω. The following classical example is avivid illustration of this point.

Example 2.1 Let Ω = [0,1], the unit interval of real numbers. Let F be theclassical Borel σ -algebra on [0,1] and P be the uniform probability measure.

For m = 0,1,2, . . . , and k = 0,1, . . . ,2m−1, let

X2m+k(ω) =

1 when k < 2mω ≤ (k+1);0 otherwise.

In plain words, we have defined a sequence of random variables made of indicatorfunctions on intervals of shrinking length 2−m. Yet the union of every 2m intervalscompletely cover the sample space [0,1] as k goes from 1 to 2m.

It is seen thatP(|Xn|> 0)≤ 2−m

where m= logn/ log2−1. Hence as n→∞, P(|Xn|> 0)→ 0. This implies Xn→ 0in probability.

At the same time, the sequence

X1(ω),X2(ω),X3(ω), . . .

contains infinity numbers of both 0 and 1 for any ω . Thus none of such sequenceconverge. In other words,

P(ω : Xn(ω) converges) = 0.

Hence, Xn does not converge to 0 in the mode of “almost surely”.


Due to the finiteness of the probability measure, if a sequence of random vari-ables Xn converges to X almost surely, then Xn also converges to X in probability.If a sequence of random variables Xn converges to X in probability, Xn does notnecessarily converge to X as show by the above example. However, Xn always hasa subsequence Xnk such that Xnk → X almost surely.

The convergence in moment is another commonly employed concept. It isoften not directly applied in statistical theory, but it is sometimes convenient toverify the convergence in moment. The convergence in moment implies the con-vergence in probability.

Definition 2.3 Convergence in moment. Let r > 0 be a real number. If the rthabsolute moment exists for all Xn∞

n=1 and X, and

limn→∞

E|Xn−X |r= 0,

then Xn converges to X in the rth moment.

By a well known inequality in probability theory, we can show the rth momentconvergence implies the sth moment convergence when 0 < s < r. In addition, italso implies the convergence in probability. Such a result can be established easilyby using the following inequality.

Markov Inequality: Let X be a random variable and for some r > 0, E|X |r <∞. Then for any ε > 0, we have

P(|X | ≥ ε)≤ E|X |r

εr .

PROOF: It is easy to verify that

I(|X | ≥ ε)≤ |X |r

εr .

Taking expectation results in the inequality to be shown. ♦When r = 2, the Markov inequality becomes Chebyshev’s inequality:

P(|X−µ| ≥ ε)≤ σ2

ε2

where µ = E(X) and σ2 =Var(X).

2.1. MODE OF CONVERGENCE 15

Example 2.2 Suppose Xn→ X in the rth moment for some r > 0. For any δ > 0,we have

P(|Xn−X | ≥ δ )≤ E|Xn−X |r

δ r .

The right hand side converges to zero as n→ ∞ because of the moment conver-gence. Thus, we have shown that Xn→ X in probability.

The reverse of this result is not true in general. For example, let X be a randomvariable with uniform distribution on [0, 1]. Define Xn = X + nI(X < n−1) forn = 1,2, . . .. It is easy to show that Xn→ X almost surely. However, E|Xn−X |= 1which does not converge to zero. Hence, Xn does not converge to X in the firstmoment. If, however, there exists a nonrandom constant M such that P(|Xn| <M) = 1 for all n and Xn → X in probability, then Xn → X in rth moment for allr > 0.

A typical tool of proving almost sure convergence is Borel-Cantelli Lemma.

Lemma 2.1 Borel-Cantelli Lemma: If An,n ≥ 1 is a sequence of events forwhich ∑

∞i=1 P(An)< ∞, then

P(An occur infinitely often) = 0.

The event An occur infinitely often contains all sample points which is amember of infinite number of An’s. We will use i.o. for infinitely often. The factthat Xn→ X almost surely is equivalent to

P(|Xn−X | ≥ ε, i.o. ) = 0

for all ε > 0. In view of Borel-Cantelli Lemma, if∞

∑n=1

P(|Xn−X | ≥ ε)< ∞,

for all ε > 0, then Xn→ X almost surely.Let X1,X2, . . . ,Xn, . . . be a sequence of independent and identically distributed

(iid) random variables such that their second moment exists. Let µ = E(X1) andσ2 = Var(X1). Let Xn = n−1

∑ni=1 Xi so that Xn is a sequence of random vari-

ables too. By Chebyshev’s inequality,

P(|Xn−µ| ≥ ε)≤ σ2

nε2


for any give ε > 0. As n→ ∞, the probability converges to 0. Hence, we haveshown Xn→ µ in probability. Note that we may view µ as a random variable witha degenerate distribution.

The proof can be used to establish the almost sure convergence if the 4th mo-ment of X1 exists. In fact, the existence of the first moment is sufficient to estab-lish the almost sure convergence of the sample mean of the i.i.d. random variables.The elementary proof under the first moment assumption only is long and com-plex. We present the followings without proofs.

Theorem 2.1 Law of Large Numbers: Let X1,X2, . . . ,Xn, . . . be a sequence ofindependent and identically distributed (i.i.d. ) random variables.

(a) If nP(|X1|> n)→ 0, then

Xn− cn→ 0

in probability, where cn = EX1I(|X1| ≤ n).(b) If E|X1|< ∞, then

Xn−E(X1)→ 0

almost surely.

The existence of the first moment of a random variable is closely related tohow fast P(|X |> n) goes to zero as n→∞. Here we give an interesting inequalityand a related result.

Let X be a positive random variable with finite expectation. That is, assumeP(X ≥ 0) = 1 and EX< ∞. Then we have

EX=∞

∑n=0

EXI(n < X ≤ n+1).

Since

nI(n < X ≤ n+1)< XI(n < X ≤ n+1)< (n+1)I(n < X ≤ n+1)

for all n = 0,1, . . .. We get

∞

∑n=0nP(n < X ≤ n+1) ≤ EX ≤

∞

∑n=0(n+1)P(n < X ≤ n+1).

2.2. UNIFORM STRONG LAW OF LARGE NUMBERS 17

Let qn = P(X > n) so that P(n < X ≤ n+1) = qn−qn+1. We then have

∞

∑n=0nP(n < X ≤ n+1)=

∞

∑n=0

qn+1.

Consequently, if EX< ∞, then

∞

∑n=0

qn+1 =∞

∑n=0

P(X > n+1)< ∞.

If X1,X2, . . . ,Xn, . . . is a sequence of random variables with the same distributionas X , then we have

∞

∑n=0

P(Xn > n+1)< ∞.

By Borel-Cantelli Lemma, Xn < n+1 almost surely.

2.2 Uniform Strong law of large numbers

In many statistical problems, we must work on i.i.d. random variables indexedby some parameters. For each given parameter value, the (strong) law of largenumber is applicable. However, we are often interested in large sample propertiesof a parameter derived from the sum of such random variables. These propertiescan often be obtained based on the uniform convergence with probability oneof these functions. Rubin (1956) gives a sufficient condition for such uniformconvergence which is particularly simple to use.

Let X1,X2, . . . ,Xn . . . be a sequence of i.i.d. random variables taking values inan arbitrary space X . Let g(x;θ) be a measurable function in x for each θ ∈ Θ.Suppose further that Θ is a compact parameter space.

Theorem 2.2 Suppose there exists a function H(·) such that EH(X) < ∞ andthat |g(x;θ)| ≤ H(x) for all θ ∈ Θ. The parameter space Θ is compact. In addi-tion, there exists A j, j = 1,2, . . . such that

P(Xi ∈ ∪∞j=1A j) = 1


and g(x;θ) is continuous in θ uniformly on x∈ A j for each j. Then, almost surely,and uniformly in θ ∈Θ,

n−1n

∑i=1

g(Xi;θ)→ Eg(X1;θ)

and that Eg(X1;θ) is a continuous function in θ .

Proof: We may define Bk = ∪kj=1A j for k = 1,2, . . .. Note that Bk is monotone

increasing. The theorem condition implies that P(X ∈ Bk)→ 1 as k → ∞ andtherefore

H(X)1(X ∈ Bck)→ 0

almost surely, where X is a random variable with the same distribution with X1.By the dominant convergence theorem, the condition EH(X)< ∞ leads to

EH(X)1(X ∈ Bck)→ 0

as k→ ∞. We now take note of

supθ

∣∣n−1

n

∑i=1

g(Xi;θ)−Eg(X ;θ)∣∣

≤ supθ

∣∣n−1

n

∑i=1

g(Xi;θ)1(Xi ∈ Bk)−Eg(X ;θ)1(Xi ∈ Bk)∣∣

+supθ

∣∣n−1

n

∑i=1

g(Xi;θ)1(Xi 6∈ Bk)−Eg(X ;θ)1(Xi 6∈ Bk)∣∣.

The second term is bounded by

n−1n

∑i=1

H(Xi)1(Xi 6∈ Bk)+EH(X)1(X ∈ Bck)→ 2EH(X)1(X ∈ Bc

k)

which is arbitrarily small almost surely. Because H(X) dominants g(X ;θ), theseresults show that the proof of the theorem can be carried out as as if X = Bk forsome large enough k.

In other words, we need only prove this theorem when g(x;θ) is simply equicon-tinuous over x. Under this condition, for any ε > 0, there exist a finite number ofθ values, θ1,θ2, . . . ,θm such that

supθ∈Θ

minj|g(x;θ)−g(x;θ j)|< ε/2.

2.3. CONVERGENCE IN DISTRIBUTION 19

This also implies

supθ∈Θ

minj|Eg(X ;θ)−Eg(X ;θ j)|< ε/2.

Next, we easily observe that

supθ

∣∣n−1

n

∑i=1

g(Xi;θ)−Eg(X ;θ)∣∣≤ max

1≤ j≤m∣∣n−1

n

∑i=1

g(Xi;θ j)−Eg(X ;θ j)∣∣+ε.

The first term goes to 0 almost surely by the conventional strong law of largenumbers and ε is an arbitrarily small positive number. This conclusion is thereforetrue.

2.3 Convergence in distribution

The concept of convergence in distribution is different from the modes of conver-gence given in the last section.

Definition 2.4 Convergence in distribution: Let X1,X2, . . . ,Xn, . . . be a sequenceof random variables, and X be another random variable. If

P(Xn ≤ x0)→ P(X ≤ x0)

for all x0 such that F(x) = P(X ≤ x) is continuous at x = x0, then we say that

Xn→ X in distribution. We may also denote it as Xnd−→ X.

The convergence in distribution is not dependent on the probability space.Thus, we may instead discuss a sequence of distribution functions Fn(x) and F(x).If Fn(x)→ F(x) at all continuous point of F(x), then Fn converges to F(x) indistribution. We sometimes mix up the random variables and their distributionfunctions. When we state that Xn converges to F(x), the claim is the same as thedistribution of Xn converges to F(x).

It could happen that Fn(x) converges at each x, but the limit, say F(x), does nothave the properties such as limx→∞ F(x) = 1. In this case, Fn(x) does not convergein distribution although the function sequence converges.


Example 2.3 Let X be a positive random variable and Xn = nX for n = 1,2, . . ..It is seen that P(Xn < x)→ 0 for any finite x. Let Fn(x) denote the distributionfunction of Xn, we have Fn(x)→ 0. However, this convergence is not in the modeof “in distribution”.

When Fn → F in distribution, there may not be any corresponding randomvariables under discussion. It is always possible to construct a probability spaceand a sequence of random variables Xn∞

n=1 and X such that their distributionfunctions are the same as Fn and F . Furthermore, the construction can be donesuch that Xn→ X almost surely.

Theorem 2.3 Skorohod representation Theorem. Suppose Fn→ F in distribu-tion. There exists a probability space and a sequence of random variables Xn∞

n=1and X such that Xn→ X almost surely.

Using this result, one may show that if Xn → X in distribution and g(x) is acontinuous function, then g(Xn)→ g(X) in distribution.

We end this section by presenting two useful results. (1) Xn→ X in distribu-tion if and only if E[g(Xn)]→ E[g(X)] for all bounded and continuous function g.(2) When X1,X2, . . . are random vectors of finite dimension, then Xn→ X in dis-tribution if and only if for any non-random vector a, aτXn→ aτX in distribution.

Example 2.4 Let X1,X2, . . . ,Xn be a sequence of iid exponentially distributed ran-dom variables. Their common cumulative distribution family is given by

F(x) = 1− exp(−x)

for x≥ 0.Let X(n) = maxX1, . . . ,Xn and X(1) = minX1, . . . ,Xn.It is seen that

P(nX(1) > x) = exp(−x/n)n = exp(−x).

Hence, nX(1)→ X1 in distribution.On the other hand, we find

PX(n)− logn < x= 1−n−1 exp(−x)n→ exp(−e−x).

The right hand side is a cumulative distribution function. Hence, X(n)− lognconverges in distribution to a distribution with cumulative distribution functionexp(−e−x). We call it type I extreme value distribution.

2.4. CENTRAL LIMIT THEOREM 21

2.4 Central limit theorem

The most important example of the convergence in distribution is the classicalcentral limit theorem. It presents an important case when a commonly used statis-tic is asymptotically normal. The simplest version is as follows. By N(µ,σ2), wemean the normal distribution with mean µ and variance σ2.

Theorem 2.4 Classical Central Limit Theorem: Let X1,X2, . . . be a sequenceof iid random variables. Assume that both µ = E(X1) and σ2 = Var(X1) exist.Then, as n→ ∞, √

n[Xn−µ]→ N(0,σ2)

in distribution, where Xn = n−1∑

ni=1 Xi.

It may appear illogical to some that we start with a sequence of random vari-ables, but end up with a normal distribution. As already commented in the lastsection, we interpret both sides as their corresponding distribution functions.

If Xn’s do not have the same distribution, then having common mean and vari-ance is not sufficient for the asymptotic normality of the sample mean. A set ofnearly necessary and sufficient conditions is the Lindberg condition. For mostapplications, we recommend the verification of the Liapounov condition.

Theorem 2.5 Central Limit Theorem under Liapounov Condition: Let X1,X2, . . .

be a sequence of independent random variables. Assume that both µi = E(Xi) andσ2

i =Var(Xi) exist. Further, assume that for some δ > 0,

∑ni=1 E|Xi−µi|2+δ

[∑ni=1 σ2

i ]1+δ/2 → 0

as n→ ∞. Then, as n→ ∞,

∑ni=1(Xi−µi)√

∑ni=1 σ2

i

→ N(0,1)

in distribution.

The central limit theorem for random vectors is established through examiningthe convergence of aτXn for all possible non-random vector a.


2.5 Big and small o, Slutsky’s theorem

There are many important statistics that are not straight sum of independent ran-dom variables. At the same time, many are also asymptotically normal. Many ofsuch results are proved with the help of Slutsky’s theorem and with the conceptsof big and small o’s.

Let an be a sequence of positive numbers and Xn be a sequence of randomvariables. If

Xn/an→ 0

in probability, we say Xn = op(an). In general, the definition is meaningful onlyif an is a monotone sequence. If instead, for any given ε > 0, there exist positiveconstants M and N such that whenever n > N,

P(|Xn/an|< M)≥ 1− ε,

then we say that Xn = Op(an). In most textbooks, the positiveness of an is notrequired. Not requiring positiveness does not change the essence of the currentdefinition. Sticking to positiveness is helpful at avoiding some un-intended abuseof these concepts.

We love to compare statistics under investigation to n1/2,n,n−1/2 and so on.If Xn = op(n−1), it implies that Xn converges to 0 faster than the rate of n−1. IfXn = Op(n−1), it implies that Xn converges to 0 no slower than the rate of n−1.Most importantly, when Xn = Op(n), it does not imply Xn has a size of n when nis large. Even if Xn = 0 for all n, it is still true that Xn = Op(n).

Example 2.5 If E|Xn|= o(1), then Xn = op(1).Proof: By Markov inequality, for any M > 0, we have

P(|Xn|> M)≤ E|Xn|/M = o(1).

Hence, Xn = op(1).

The reverse of the above example is clearly wrong.

Example 2.6 Suppose P(Xn = 0) = 1− n−1 and P(Xn = n) = n−1. Then Xn =

op(n−m) for any fixed m > 0. Yet we do not have EXn= o(1).

2.5. BIG AND SMALL O, SLUTSKY’S THEOREM 23

While the above example appears in almost all textbooks, it is not unusual tofind such misconception appears in research papers in some disguised forms.

Example 2.7 If Xn = Op(an) and Yn = Op(bn) for two positive sequences of realnumbers an and bn, then

(i) Xn +Yn = Op(an +bn);

(ii) XnYn = Op(anbn).

However, Xn−Yn = Op(an−bn) or Xn/Yn = Op(an/bn) are not necessarily true.

Example 2.8 Suppose X1, . . . ,Xn is a set of iid random variables from Poissondistribution with mean θ . Let Xn be the sample mean.

Then, we have

(1) exp(Xn) = exp(θ)+op(1).

(2) exp(Xn) = exp(θ)+Op(n−1/2).

Let us first present a simplified version of Slutsky’s Theorem.

Theorem 2.6 Suppose Xn→ X in distribution, and Yn = op(1), then Xn+Yn→ Xin distribution.

PROOF: Let Fn(x) and F(x) be the cumulative distribution functions of Xn and X .Let x be a continuous point of F(x). For any given ε > 0, according to some realanalysis result, we can always find 0 < ε ′ < ε such that F(x) is also continuous atx+ ε ′.

Since Yn = op(1), we have for any δ > 0 and ε > 0, there exists an N such thatwhen n > N,

P(|Yn| ≤ ε)> 1−δ .

Let ε be chosen such that x+ε is a continuous point of F . Hence, when n>N,we have

P(Xn +Yn ≤ x)≤ P(Xn ≤ x+ ε)+δ → F(x+ ε)+δ .

Since δ can be arbitrarily small, we have shown

limsupP(Xn +Yn ≤ x)≤ F(x+ ε)


for all ε such that x+ ε is a continuous point of F . As indicated earlier, suchε can also be chosen arbitrarily small, we may let ε → 0. Consequently, by thecontinuity of F at x, we have

limsupP(Xn +Yn ≤ x)≤ F(x).

Similarly, we can show that

liminfP(Xn +Yn ≤ x)≥ F(x).

Two inequalities together imply Xn +Yn→ X in distribution. ♦If F(x) is a continuous function, then we can save a lot of trouble in the above

proof.The simplified Slutsky’s Theorem I presented above is also refereed as delta-

method when it is used as a tool for proving asymptotic results. In a nut shell, itsimply states that adding a op(1) quantity to a sequence of random variables doesnot change the limiting distribution.

2.6 Asymptotic normality for functions of randomvariables

Suppose we already know that for some an→ ∞, an(Yn− bn)d−→Y . What do we

know about the distribution of g(Yn)− g(bn)? The first observation is, if bn doesnot have a limit, then even if g is a smooth function, the difference is still far fromdetermined. In general, g(Yn)− g(bn) depends on the slope of g near bn. Hencewe only consider the case where bn is a constant that does not depend on n.

Theorem 2.7 Assume that an(Yn− µ)→ Y in distribution, an → ∞, and g(·) iscontinuously differentiable in a neighborhood of µ . Then

an[g(Yn)−g(µ)]→ g′(µ)Y

in distribution.

Proof: Using the mean value theorem

an[g(Yn)−g(µ)] = g′(ξn)[an(Yn−µ)],

2.7. SUM OF RANDOM NUMBER OF RANDOM VARIABLES 25

for some value of ξn between Yn and µ . Since an→ ∞, we must have Yn→ µ inprobability. Hence we also have ξn

p−→ µ . Consequently, the differentiability ofg at µ implies

g′(ξn)−g(µ) = op(1)

andan[g(Yn)−g(µ)] = g′(µ)[an(Yn−µ)]+op(1).

The result follows the Slutsky’s theorem. ♦The result and the proof are presented for the case when Xn and Y are one-

dimensional. It can be easily generalized to vector cases. When an does notconverge to any constant, out idea should still apply. It is not smart to declarethat the asymptotic does not work because the conditions of Theorem 2.7 are notsatisfied.

2.7 Sum of random number of random variables

Sometimes we need to work with the sum of random number of random variables.One such example is the total amount of insurance claims in a month.

Theorem 2.8 Let Xi, i = 1,2, . . . be i.i.d. random variables with mean µ andvariance σ2. Let Ni, i = 1,2, . . . be a sequence of integer valued random vari-ables which is independent of Xi, i = 1,2, . . ., and P(Nn > M)→ 1 for any M asn→ ∞. Then

N−1/2n

Nn

∑j=1

(X j−µ)→ N(0,σ2)

in distribution.

Proof: For simplicity, assume µ = 0, σ2 = 1 and let Yn = n−1/2∑

ni=1 Xi. The

classical central limit theorem implies that for any real value x and a positiveconstant ε , there exists a constant M, such that whenever n > M,

|P(Yn ≤ x)−Φ(x)| ≤ ε.

From the independence assumption,

P(YNn ≤ x) =∞

∑m=1

P(Ym ≤ x)P(Nn = m).


Hence,

|P(YNn ≤ x)−Φ(x)| = |∞

∑m=1P(Ym ≤ x)−Φ(x)P(Nn = m)|

≤ | ∑m≥M

P(Ym ≤ x)P(Nn = m)−Φ(x)|+P(Nn < M)

≤ ε +P(Nn < M)→ ε.

From the arbitrariness of the choice of ε , we conclude that P(YNn ≤ x)→Φ(x) forall x. Hence, the theorem is proved. ♦


1. Prove that the set ω : limXn(ω) exists is measurable.

2. Identify an almost surely convergence subsequence in the context of Exam-ple 2.1.

3. Prove Borel-Cantelli Lemma.

4. Suppose that there exists a nonrandom constant M such that P(|Xn|< M) =

1 for all n and Xn→ X in probability. Show that Xn→ X in rth moment forall r > 0.

5. Show that if Xn→ X almost surely, then Xn→ X in probability.

6. Using Borel-Cantelli Lemma to show that the sample mean Xn of an i.i.d. sampleconverges to its mean almost surely if E|X1|4 < ∞.

7. Show that if Xn→ X in distribution and g(x) is a continuous function, theng(Xn)→ g(X) in distribution.

Furthermore, give an example of non-continuous g(x) such that g(Xn) doesnot converge to g(X) in distribution.

8. Prove that Xn→ X in distribution if and only if E[g(Xn)]→ E[g(X)] for allbounded and continuous function g.


9. Let X1, . . . ,Xn be an i.i.d. sample from uniform distribution on [0, 1]. Findthe limiting distribution of nX(1), where X(1) = minX1, . . . ,Xn when n→∞.

10. Let X1, . . . ,Xn be an i.i.d. sample from standard normal distribution. Finda non-degenerating limiting distribution of an(X(1)− bn) with appropriatechoices of an and bn.

11. Suppose Fn and F are a sequence of one-dimensional cumulative distribu-tion functions and that Fn

d−→ F . Show that

supx|Fn(x)−F(x)| → 0

as n→ ∞ if F(x) is a continuous function.

Give a counter–example when F is not continuous.

12. Suppose Fn and F are a sequence of absolutely continuous one-dimensionalcumulative distribution functions and that Fn

d−→ F . Let fn(x) and f (x) betheir density functions. Give a counter example to∫

| fn(x)− f (x)|dx→ 0

as n→ ∞.

Prove the the above limiting conclusion is true when fn(x)→ f (x) at all x.

Are there any similar results for discrete distributions?

13. Suppose that Xni, i= 1, . . . ,n∞n=1 is a sequence of sets of random variables.

It is known thatmaxP(|Xni|> n−2)→ 0

as n→ ∞. Does it imply that ∑ni=1 Xni = op(n−1)? What is the order of

max1≤i≤nXni?

14. Suppose that Xn = Op(n2) and Yn = op(n2). Is it true that Yn/Xn = op(1)?

15. Suppose that Xn = Op(an) and Yn = Op(bn). Prove that XnYn = Op(anbn).


16. Suppose that Xn = Op(an) and Yn = Op(bn). Give a counter example toXn−Yn = Op(an−bn).

17. Suppose we have a sequence of random variable Xn such that Xnd−→ X .

Show that Xn = Op(1).

18. Suppose Xnd−→ X and Yn = 1+op(1). Is it true that Xn/Yn

d−→ X?

19. Let Xi∞i=1 be a sequence of i.i.d. random variables. Show that Xn =Op(1).

Is it true that ∑ni=1 Xi = Op(n)?

20. Assume that an(Yn− µ)→ Y in distribution, an → ∞, and g(·) is continu-ously differentiable in a neighborhood of µ .

Suppose g′(µ) = 0 and g′′(x) is continuous and non-zero at x = µ . Obtaina limiting distribution of g(Yn)−g(µ) under an appropriate scale.

Chapter 3

Empirical distributions, momentsand quantiles

Let X ,X1,X2, . . . be i.i.d. random variables. Let mk = EXk and µk = E(X −m1)

k for k = 1,2, . . .. We may also use notation µ for m1, and σ2 for µ2. We callmk the kth moment and µk the kth central moment.

With n i.i.d. observations of X , a corresponding empirical distribution functionFn is constructed by placing at each observation Xi a mass n−1. That is,

Fn(x) =1n

n

∑i=1

1(Xi ≤ x), −∞ < x < ∞.

We also call Fn(x) the empirical distribution. It is a natural estimator of the cumu-lative distribution function of X .

Assume that Fn(x) is not random for the moment. Then if a random variableX∗ has cumulative distribution Fn(x), we would find its moments are given by

m∗k =1n

n

∑i=1

Xki and µ

∗k =

1n

n

∑i=1

(Xi−m∗1)k.

We denote them as mk and µk because they are natural estimates of mk and µk. Weuse Xn for the sample mean and S2

n for the sample variance.Since we have reasons to believe that Fn is a good estimator of F , it may also

be true that ψ(Fn) will estimate ψ(F) for any reasonable functional ψ . In thischapter, we discuss large sample properties of ψ(Fn).

29

30CHAPTER 3. EMPIRICAL DISTRIBUTIONS, MOMENTS AND QUANTILES

3.1 Properties of sample moments

Moments of a distribution family are very important parameters. Sample mo-ments provide natural estimates. Many other parameters are functions of mo-ments, therefore, estimates can be obtained by using the functions of sample mo-ments. This is the so-called method of moments.

If the relevant moments of Xi exist, we can easily show

1. mka.s.−→ mk.

2. n1/2[mk−mk]d−→ N(0,m2k−m2

k).

3. E(mk) = mk; nVAR(mk) = m2k−m2k .

Before we work on the central sample moments µk, let us first define

bk =1n

n

∑i=1

(Xi−µ)k, k = 1,2, . . . .

If we replace Xi by Xi−µ in mk, it becomes bk. Obviously, bk→ µk almost surelyfor all k when the kth moment of X is finite.

Theorem 3.1 Let µk, µk and so on are defined the same way as above. Assumethat the kth moment of X is finite. Then, we have

(a) µka.s.−→ µk almost surely.

(b) Eµk−µk = 12k(k−1)µk−2µ2− kµkn−1 +O(n−2).

(b)√

nµk−µkd−→ N(0,σ2

k ) when we also have EX2k< ∞, where

σ2k = µ2k−µ

2k −2kµk−1µk+1 + k2

µ2µ2k−1.

PROOF The proof of conclusion (a) is straightforward.

3.1. PROPERTIES OF SAMPLE MOMENTS 31

(b). Without loss of generality, let us assume µ = 0 to make the presentationsimpler. It is seen that

µk = n−1n

∑i=1

(Xi− X)k

= n−1n

∑i=1

k

∑j=0

(kj

)(−1) jXk− j

i X j

= n−1n

∑i=1

Xki +n−1

n

∑i=1

k

∑j=1

(kj

)(−1) jXk− j

i X j

= bk +b1

k

∑j=1

(kj

)(−1) jbk− jb

j−11 .

This is the the first equality in (b).For the second equality, note that Ebk= µk. Thus, we get

Eµk−µk =k

∑j=1

(kj

)(−1) jEb j

1bk− j.

We next study the order of these expectations term, Eb j−11 bk− j for j = 1,2, . . . ,k,

term by term.When j = 1, we have

Eb1bk−1= n−2∑i,lEXiXk−1

l .

Due to independence and that Xi’s have mean 0, the summand is zero unless i = land there are only n of them. From EXk

i = µk, we get

Eb1bk−1= n−1µk.

When j = 2, we have

Eb21bk−2= n−3

∑i,l,m

EXiXlXk−2m .

A term in the above summation has nonzero expectation only if i = l. When i = l,we have two cases where i = l = m and i = j 6= m. They have n and n(n−1) terms


in the summation respectively. The corresponding expectations are given by µk

and µ2µk−2. Hence, we get

Eb21bk−2= n−1(µk +µ2µk−2)+O(n−2).

When j ≥ 3, we have

Eb j1bk− j= n−( j+1)

∑i1,i2,...,i j,l

EXi1Xi2 · · ·Xi jXk− jl .

The individual expectations are non-zero only if i1, i2, . . . , i j are paired up with an-other index including possibly also l. Hence, for terms with nonzero expectation,there are at most j− 2 different indices in i1, i2, . . . , i j, l. The total number ofsuch indices is no more than n j−1. Since EXi1Xi2 · · ·Xi jX

k− jl < ∞, we must have

Eb j1bk− j= O(n−2).

Combining the calculations for j = 1,2 and for j ≥ 3, we get the conclusion.(c). We seek to use the Slutzky’s theorem in this proof. This amounts to

expand the random quantity into a leading term whose limiting distribution can beshown by a classical result, and an op(1) which does not alter the outcome of thelimiting distribution.

Since b1 = Op(n−1/2), b21 = Op(n−1), and bk− j = Op(1), we get b j

1bk− j =

O(n−1) for j ≥ 2. Consequently, we find

√nµk−µk =

√nbk−µk− kb1µk−1+

√nkb1(bk−1−µk−1)+Op(n−1/2)

=√

nbk−µk− kb1µk−1+Op(n−1/2).

The last equality is a resultant of b1(bk−1−µk−1) = Op(n−1). It is seen that

bk−µk− kb1µk−1 = n−1n

∑i=1Xk

i −µk−1Xi−µk

which the sum of i.i.d. random variables. It is trivial to verify that EXki −µk−1Xi−

µk= 0 and VARXki −µk−1Xi−µk= µ2k−µ2

k −2kµk−1µk+1 + k2µ2µ2k−1. Ap-

plying the classical central limit theorem to n−1/2∑

ni=1Xk

i −µk−1Xi−µk, we getthe conclusion. ♦

3.1. PROPERTIES OF SAMPLE MOMENTS 33

The same technique can be used to show that E(Xn−µ)k = O(n−k/2) when kis a positive even integer; and that E(Xn−µ)k = O(n−(k+1)/2) when k is a positiveodd integer. The second result is, however, not as obvious.

Here is a proof. The claim is the same as E(∑ni=1 Xi)

k =O(nk/2) or O(n(k+1)/2)

when k is odd. In the expansion of this summation, all terms have form

X j1i1 · · ·X

jmim

with j1, . . . , jm > 0 and j1 + · · ·+ jm = k. Its expectation equals 0 whenever oneof them equals 1. Thus, the size of m is at most k/2 or (k− 1)/2 when k is odd.Since, each i1, . . . , im has at most n choices, the total number of such terms is nomore than nk/2 of O(n(k−1)/2) when k is odd. As their moments have an upperbound, the claim is proved.

Theorem 3.2 Assume that the kth moment of X1 exists and Xn = n−1∑

ni=1 Xi. Then

(a) E(Xn− µ)k = O(n−k/2) when k is a positive even integer and that E(Xn−µ)k = O(n−(k+1)/2).

(b) E|Xn−µ|k = O(n−k/2) when k ≥ 2.

PROOF:(a) The claims are the same as E(∑n

i=1 Xi)k = O(nk/2) or O(n(k+1)/2) when k

is odd. We have a generic form of expansion

( n

∑i=1

Xi)k

= ∑X j1i1 · · ·X

jmim

such that the summation is over all combinations of j1, . . . , jm > 0 and j1 + · · ·+jm = k.

The expectation of X j1i1 · · ·X

jmim equals 0 whenever one of j1, . . . , jm equals 1.

Thus, the terms with nonzero expectation must have m ≤ k/2, or m ≤ (k− 1)/2when k is odd. Since, each i1, . . . , im takes at most n values, the total number ofnonzero expectation terms is no more than nk/2 of O(n(k−1)/2) when k is odd.Since their moments are bounded by a common constant, the claims must be true.

(b) The proof of this result becomes trivial based on the inequality in the nexttheorem. We omit the actual proof here.


Theorem 3.3 Assume that Yi, i = 1,2, . . . ,n are independent random variableswith E(Yi) = 0 for all i. Then, for some k > 1,

AkEn

∑i=1

Y 2i k/2 ≤ E|

n

∑i=1

Yi|k ≤ BkEn

∑i=1

Y 2i k/2

where Ak and Bk are some positive constants not depending on n.

This inequality is attributed to Marcinkiewics-Zygmund and the proof can befound in Chow and Teicher (1978, 1st Edition, page 356). Its proof is somewhatinvolved.

3.2 Empirical distribution function

For each fixed x, Fn(x) is the sample mean of Yi = I(Xi ≤ x), i = 1,2, . . . ,n. SinceYi’s are i.i.d. random variables and they have finite moments to any order, thestandard large sample results apply. We can easily claim:

1. Fn(x)a.s.−→ F(x) for each fixed x, and in any order of moments.

2.√

nFn(x)−F(x) d→ N(0,σ2(x)) with σ2(x) = F(x)1−F(x).

3. Fn(x)−F(x) = Op(n−1/2).

The conclusion 3 is a corollary of conclusion 2. A direct proof can be done byusing Chebyshev’s inequality:

P(√

n|Fn(x)−F(x)|> M)≤ σ2(x)M2

whose right hand side can be made arbitrarily small with a proper choice of M.Recall that if F(x) is continuous, then the convergence of Fn(x) at every x

implies the uniform convergence in x. That is, Dn = supx |Fn(x)−F(x)| convergesto 0 almost surely. The statistic Dn is called the Kolmogorov-Smirnov distanceand it is used for the goodness of fit test. In fact, when F is continuous andunivariate, it is known that

P(Dn > d)≤C exp−2nd2

3.3. SAMPLE QUANTILES 35

for all n and d, and C is an absolute constant. If X is a random vector, this resultremains true with 2 replaced by 2− ε , and C then depends on the dimension andε .

In addition, under same conditions,

limn→∞

P(n1/2Dn ≤ d) = 1−2∞

∑j=1

(−1) j+1 exp(2 j2d2).

We refer to Serfling (1980) for more results.

3.3 Sample quantiles

Let F(x) be a cumulative distribution function. We define, for any 0 < p < 1, itspth quantile as

ξp = F−1(p) = infx : F(x)≥ p.

Intuitively, if ξp is the pth quantile of F(x), we should have F(ξp) = p. Theabove definition clearly does not guarantee its validity. The problem arises whenF(x) jumps at ξp. We can, however, prove the following:

Theorem 3.4 Let F be a distribution function. The function F−1(t), 0 < t < 1, isnondecreasing and left-continuous, and satisfies

1. F−1(F(x))≤ x, −∞ < x < ∞,

2. F(F−1(t))≥ t, 0 < t < 1,

3. F(x)≥ t if and only if x≥ F−1(t).

PROOF: We first show that the inverse is monotone. When t1 < t2, we have

x : F(x)≥ t1 ⊃ x : F(x)≥ t2.

The lowest value in a smaller set is larger than the lowest value in a larger set.Hence,

infx : F(x)≥ t1 ≤ infx : F(x)≥ t2.

which is F−1(t1)≤ F−1(t2) or monotonicity.


To prove the left continuity, let tk∞k=1 be an increasing sequence taking val-

ues between 0 and 1 with limit t0. We hence have F−1(tk) is an increasingsequence with upper bound F−1(t0). Hence, it has a limit. We wish to showF−1(tk)→ F−1(t0). If not, let x ∈ (limF−1(tk),F−1(t0)). This implies

t0 > F(x)≥ tk

for all k. This is not possible when lim tk = t0.

1. By definition, for any y such that F(y) ≥ F(x), we have y ≥ F−1(F(x)).This remains to be true when y = x, hence x≥ F−1(F(x)).

2. For any y > F−1(t), we have F(y) ≥ t by definition. Let y→ F−1(t) fromright, and from the right-continuity, we must have F(F−1(t))≥ t.

3. This is the consequence of (1) and (2). ♦.

With an empirical distribution function Fn(x), we define the empirical pthquantile F−1

n (p) = ξp. What properties does this estimator have?In order for ξp to behave, some conditions on F(x) seem necessary. For ex-

ample, if F(x) is a distribution which place 50% probability each at +1 and −1.The median of F(x) equals −1 by our definition. The median of Fn(x) equals −1whenever less than 50% of observations are equal to−1 and it equals 1 otherwise.The median is not meaningful for this type of distributions. To be able to differ-entiate between ξp and ξp±δ , it is most desirable that F(x) strictly increase overthis range.

Here is the consistency result for ξp. Note that ξp depends on n although thisfact is not explicit in its notation.

Theorem 3.5 Let 0 < p < 1. If ξp is the unique solution x of F(x−)≤ p≤ F(x),then ξp→ ξp almost surely.

Proof: For every ε > 0, by the uniqueness condition and the definition of ξp, wehave

F(ξp− ε)< p < F(ξp + ε).

3.3. SAMPLE QUANTILES 37

It has been shown earlier that Fn(ξp± ε)→ F(ξp± ε) almost surely. Thisimplies that

ξp− ε ≤ ξp ≤ ξp + ε

almost surely. Since the size of ε is arbitrary, we must have ξp→ ξp almost surely.♦.

If you like mathematics, the last sentence in the proof can be made more rig-orous.

Theorem 3.6 Let 0 < p,< 1. If F is differentiable at ξp and F ′(ξp)> 0, then

√nF ′(ξp)[ξp−ξp]

d→ N(0, p(1− p)).

Proof: For any real number x, we have

(√

n(ξp−ξp)≤ x ) = (ξp ≤ ξp +x√n).

By definition of the sample quantile, the above event is the same as the followingevent:

Fn(ξp +x√n)≥ p.

Because F has positive derivative at ξp, we have F(ξp) = p. Thus,

P(√

n[ξp−ξp]≤ x)

= P(

Fn(ξp +x√n)−F(ξp +

x√n)≥ F(ξp)−F(ξp +

x√n))

= P(

Fn(ξp +x√n)−F(ξp +

x√n)≥− x√

nF ′(ξp)+o(

1√n))

= P(√

n[Fn(ξp +x√n)−F(ξp +

x√n)]≥−xF ′(ξp)+o(1)

).

By Slutsky’s theorem, for the sake of deriving the limiting distribution, theterm o(1) can be ignored if the resulting probability has a limit. The resultingprobability has limit as the c.d.f. of N(0, [F ′(ξp)]

2 p(1− p)) by applying the cen-tral limit theorem for double arrays. ♦.

If F(x) is absolutely continuous, then F ′(ξp) = f (ξp) the density function.To be more specific, let p = 0.5 and hence ξ0.5 is the median. Thus, the effi-ciency of the sample median depends on the size of the density at the median.


If F(x) is the standard normal, then f (ξ0.5) =1√2π

. The asymptotic variance is

hence 0.52∗(2π) = π

2 . In comparison, the sample mean has asymptotic variance 1which is smaller. Both mean and median are the same location parameter for nor-mal distribution family. Therefore, the sample mean is a more efficient estimatorfor the location parameter than the sample median. If, however, the distributionunder consideration is double exponential, then the value of the density functionat median is 0.5. Hence the asymptotic variance of the sample median is 1. Atthe same time, the sample mean has asymptotic variance 2. Thus, in this case, thesample median is more efficient.

If we take the more extreme example when F(x) has Cauchy distribution, thenthe sample mean has infinite variance. The sample median is far superior. Forthose who advocate robust estimation, they point out that not only the samplemedian is robust, but also it can be more efficient when the model deviates fromnormality.

3.4 Inequalities on bounded random variables

We often work with bounded random variables. There are many particularly sharpinequalities for the sum of bounded random variables.

Theorem 3.7 (Bernstein Inequality) . Let Xn be a random variable having bi-nomial distribution with parameters n and p. For any ε > 0, we have

P(|1n

Xn− p|> ε)≤ 2exp(−14

nε2).

Proof: We work on the P(1nXn > p+ ε) only.

P(1n

Xn > p+ ε) =n

∑k=m

(nk)pkqn−k

≤n

∑k=m

expλ [k−n(p+ ε)](nk)pkqn−k

≤ exp(−λnε)n

∑k=0

(nk)(peλq)k(qe−λ p)n−k

= e−λnε(peλq +qe−λ p)n

3.4. INEQUALITIES ON BOUNDED RANDOM VARIABLES 39

with q = 1− p, m the smallest integer which is larger than n(p+ ε) and for everypositive constant λ .

It is easy to show ex ≤ x+ ex2for all real number x. With the help of this, we

gete−λnε(peλq +qe−λ p)n ≤ exp(nλ

2−λnε).

By choosing λ = 12ε , we get the conclusion.

The other part can be done similarly and so we get the conclusion. ♦.What we have done is, in fact, making use of the moment generating function.

More skillful application of the same technique can give us even sharper boundwhich is applicable in more general cases. We will state, without a proof, of thesharper bound as follows:

Theorem 3.8 (Hoeffding inequality) Let Y1,Y2, . . . ,Yn be independent random vari-ables satisfying P(a≤ Yi ≤ b) = 1, for each i, where a < b. Then, for every ε > 0and all n,

P(Yn−E(Yn)≥ ε)≤ exp(− 2nε2

(b−a)2 ).

With this inequality, we give a very sharp bound for the size of the samplequantile.

Example 3.1 Let 0 < p < 1. Suppose that ξp is the unique solution of F(x−) ≤p≤ F(x). Then, for every ε > 0 and all n,

P(|ξp−ξp|> ε)≤ 2exp−2nδ2ε

where δε = minF(ξp + ε)− p, p−F(ξp− ε).

Proof: Assignment.The result can be stated in an even stronger way. Recall ξp actually depends

on n. Let us now write it as ξpn.

Corollary 3.1 Under the assumptions of the above theorem, for every ε > 0 andall n,

P(supm≥n|ξpm−ξp|> ε)≤ 2

1−ρε

ρnε .

where ρε = exp(−2δ 2ε ).


Remark:

1. We can choose whatever value for ε , including making it a function of n.For example, we can choose ε = 1√

n .

2. Since the bound works for all n, we can apply it for fixed n as well as forasymptotic analysis.

We now introduce another inequality which is also attributed to Bernstein.

Theorem 3.9 (Bernstain) Let Y1, . . . ,Yn be independent random variables satis-fying P(|Yi−EYi| ≤ m) = 1, for each i, where m is finite. Then, for t > 0,

P(|

n

∑i=1

Yi−n

∑i=1

EYi| ≥ nt)≤ 2exp(− n2t2

2∑ni=1Var(Yi)+

23mnt

), (3.1)

for all positive integer n.

The strength of this inequality is at situations where m is not small but theindividual variances are small.

3.5 Bahadur’s representation

We have seen that the properties of the sample quantiles can be investigatedthrough the empirical distribution function based on iid observations. This is verynatural. It is very ingenious to have guessed that there is a linear relationship be-tween the sample quantile and the sample distribution. In a not so accurate way,Bahadur showed that

F−1n (p)−F−1(p) =Cp[Fn(ξp)−F(ξp)]

for some constant Cp depends on p and F when n is large. Such a result make itvery easy to study the properties of sample quantiles. A key step in proving thisresult is to assess the size of

Fn(ξp + x)−Fn(ξp)−F(ξp + x)−F(ξp)de f= ∆n(x)−∆(x).

3.5. BAHADUR’S REPRESENTATION 41

When x is a fixed constant, not random nor depends on n, we have

P|∆n(x)−∆(x)| ≥ t ≤ 2exp− nt2

2σ2(x)+ 23t

whereσ

2(x) = ∆(x)1−∆(x).

As this is true for all n, we may conclusion tentatively that

∆n(x)−∆(x) = Op(n−1/2).

This result can be improved when x is known to be very small. Assume thatthe c.d.f of X satisfies the conditions

|∆(x)|= |F(ξp + x)−F(ξp)| ≤ c|x|

for all small enough |x|. Let us now choose

x = n−1/2(logn)1/2.

It is therefore true that |∆(x)| ≤ cn−1/2(logn)1/2, and σ2(x) ≤ cn−1/2(logn)1/2.Applying these facts to the same inequality when n is large, we have

P|∆n(x)−∆(x)| ≥ 3c1/2n−3/4(logn)3/4

≤ 2exp− 9cn−1/2(logn)3/2

2cn−1/2(logn)1/2 + 23c1/2n−3/4(logn)3/4

≤ 2exp−4log(n)= 2n−4.

By using Borel-Cantelli Lemma, we have shown

∆n(x)−∆(x) = O(n−3/4(logn)3/4)

for this choice of x, almost surely.Now, we try to upgrade this result so that it is true uniformly for x in a small

region of ξp.


Lemma 3.1 Assume that the density function f (x) ≤ c in a neighborhood of ξp.Let an be a sequence of positive numbers such that an = C0n−1/2(logn)1/2. Wehave

sup|x|≤an

|∆n(x)−∆(x)|= O(n−3/4(logn)3/4)

almost surely.

PROOF: Let us divide the interval [−an,an] into αn = 2n1/4(logn)1/2 equal lengthintervals. We round up if αn is not an integer. Let b0,b±1, . . .b±αn be end pointsof these intervals with b0 = 0. Obviously, the length of each interval is not longerthan C0n−3/4.

Let βn = maxF(bi+1)− F(bi) where the max is taken over the obviousrange. Clearly βn = O(n−3/4).

One key observation is:

sup|x|≤an

|∆n(x)−∆(x)| ≤max |∆n(bi)−∆(bi)|+βn.

However, for each i, we have shown by (3.2) that

P|∆n(bi)−∆(bi)| ≥ 3c1/2n−3/4(logn)3/4 ≤ 2n−4.

Hence, the chance for the maximum to be larger than this quantity is at most αn

times of 2n−4. In addition

∑n

2αnn−4

which is finite. Hence, we have shown

sup|x|≤an

|∆n(x)−∆(x)|= O(n−3/4(logn)3/4)

almost surely. This completes the proof.To apply this result to the sample quantile, we need only show that the sample

quantile will stay close to the target quantile. More precisely, we can show thefollowing.

Lemma 3.2 Let 0 < p < 1. Suppose F is differentiable at ξp with F ′(ξp) =

f (ξp)> 0. Then, almost surely,

|ξp−ξp| ≤2(logn)1/2

f (ξp)n1/2 .


With this lemma, we are ready to claim the following.

Corollary 3.2 Under the conditions of Lemma 3.2, we have

Fn(ξp)−Fn(ξp)−F(ξp)−F(ξp)= O(n−3/4(logn)3/4)

almost surely.

Finally, we get the Bahadur’s representation.

Theorem 3.10 Under the conditions of Lemma 3.2, and assume that F is twicedifferentiable at ξp. Then, almost surely,

ξp−ξp = f (ξp)−1p−Fn(ξp)+O(n−3/4(logn)3/4).

PROOF: As F is twice differentiable, we have

F(ξp)−F(ξp) = f (ξp)(ξp−ξp)+O(ξp−ξp2).

From Lemma 3.2, we may replace F by Fn on the left hand side which will re-sulting an error of size n−3/4(logn)3/4. In addition, we know that ξp− ξp2 =

O(n−1(logn)). Therefore, we find

Fn(ξp)−Fn(ξp) = f (ξp)(ξp−ξp)+O(n−3/4(logn)3/4). (3.2)

From the definition of ξp, we know it is either (np)th order statistic with a round-ing up or down by 1 if F is continuous at all x. If so,

|Fn(ξp)− p| ≤ n−1.

Otherwise, ξp is converges to ξp almost surely and the density f (ξp) exists andnone zero. Hence, the same bound applies almost surely. Substituting it into (4.7),we have

p−Fn(ξp) = f (ξp)(ξp−ξp)+O(n−3/4(logn)3/4).

This implies the result of this theorem.There is a usual routine in proving asymptotic normality of a statistic. We

first expand the statistics so that it consists of two terms. The first term is thesum of independent random variables. The second term is a higher order term


compared to the first one. Consequently, the classical central limit theorem isapplied together with the Sluztsky’s theorem.

This idea works in most cases. Badahur’s representation further enhancesthis point of view. It finds such an expansion for a highly non-smooth function.Further, it quantifies the higher order term accurately. In statistics, we do notusually need such a precise expansion. We rarely make use of an almost sureresult. However, this technique is useful.

More general results, history of the development can be found in Serfling(1980).

Problems

1. Let F(x) be a one-dimensional cumulative distribution function such F(x)=0.5 if and only if x1 < x< x2 for some x1 < x2. Let ξ0.5 be the sample medianbased on n iid samples from F(x). Derive the limiting distribution of ξ0.5

when n→ ∞.

2. Show that under the i.i.d. and finite moment assumption, E(Xn − µ)k =

O(n−k/2) when k is a positive even integer; and E(Xn−µ)k = O(n−(k+1)/2)

when k is a positive odd integer.

3. Show that

s0 = log

q(p+ ε)

p(q− ε)

is the minimum point of

g(s) = (q+ pes)n exp−(p+ ε)s

where p+q = 1, 0 < p < 1, 0 < ε < q.

Show that (q

q− ε

)(q−ε)( pp+ ε

)(p+ε)

≤ exp(−2ε2).

4. Let X1,X2, . . . ,Xn be i.i.d. random variables. Calculate

E(Xn−µ)k

for k = 3,4,5 as functions of the moments of X1.


5. Let Fn(t) be the empirical distribution function based on uniform[0, 1]i.i.d. random variables. Show that

sup0≤t≤1

|Fn(t)− t|= sup0≤p≤1

|F−1n (p)− p|.

6. Let ξ0.25 and ξ0.75 be empirical 25% and 75% quantiles. Derive the limitingdistribution of √

n[(ξ0.75− ξ0.25)− (ξ0.75−ξ0.25)]

under conditions comparable to the conditions in Theorem 3.10 on the dis-tribution function F . Discuss how can we have the derivative conditions onF weakened?

7. Prove Example 3.1.

8. Prove Lemma 3.2

9. The residual in the Bahadur’s representation has higher order if it is in thesense of “in probability”, not “almost surely”. Show that the order can beimproved to Op(n−3/4(logn)1/2).


Chapter 4

Smoothing method

In some statistical applications, the parameter to be estimated is a function onsome Euclidean space, not a real number or real vector. The number of obser-vations available to estimate the value of this function at any given point is oftenconceptually 0. A simple example is to estimate the density function of an ab-solutely continuous random variable. It is widely believed that, with a referenceyet to be found, there does not exist an unbiased estimator in general for densityfunction. A related problem is on non-parametric regression analysis. In this case,the objective is to make inference on the regression function. In theory, a responsevariable y can often be predicted in terms of a few covariates or design variables x.The regression function is the conditional expectation of Y given X = x. When Xis random and has continuous distribution, the number of observations of Y at ex-actly X = x is 0. One again has to make use of observed responses correspondingto X values in a neighborhood of x. We will work on density estimation first.

4.1 Kernel density estimate

Let F(x) be the common cumulative distribution function of an iid sample on realnumbers. When the density function is smooth, we have

f (x)≈ F(x+h)−F(x−h)2h

47

48 CHAPTER 4. SMOOTHING METHOD

when h is a small number. Since Fn(x) is a good estimator of F(x), we may henceestimate f (x) by

f (x) =Fn(x+h)−Fn(x−h)

2h=

12nh

n

∑i=1

1(|Xi− x| ≤ h)

with a proper choice of small h. It is seen that this estimator is the ratio of theaverage number of observations falling into the interval [x−h,x+h] to the lengthof the interval. When h is large, the average may not reflect the density at x butthe average density over a short interval containing x. Thus, the estimator mayhave large bias. When h is very small, the number of observations in the intervalwill fluctuate from sample to sample more severely, thus the estimator has largevariance. In general, we choose the bandwidth h according to the sample size. Asit will be seen, the basic requirements include h→ 0 and nh→ ∞ as the samplesize n→ ∞. Because of this, most quantities to be discussed are functions of neven if there is no explicit notational indication.

LetK(x) =

12

I(−1 < x < 1)

which is itself a density function and Kh(x) = h−1K(x/h). We can then write

f (x) = n−1n

∑i=1

Kh(Xi− x) =∫

Kh(t− x)dFn(t). (4.1)

Hence, the density estimator is the average value of Kh(Xi− x). The estimatordefined by (4.1) is called the kernel density estimator. We call K the kernel of thisdensity estimator and h the bandwidth.

In this type of estimators, the contribution of Xi toward the density estimate atx is determined by Kh(Xi− x). It is easy to see that we may replace K(x) by anyother density function, and the resulting f (x) is still a sensible density estimator.With the previous choice of K, while observations within ±h neighborhood ofx have equal contribution, the observations out side of this interval contributenothing to the density. It is more reasonable to make K(x) a smoothly decreasingfunction in |x|. Thus, a popular choice of K(x) is the density function of thestandard normal distribution. Although the rest of our discussion can be easilygeneralized to multi-dimensional densities, the presentation is the simplest when

4.1. KERNEL DENSITY ESTIMATE 49

X is a scaler. Unless otherwise specified, we assume X is a scale in the rest of thechapter.

Some basic conditions on the kernel function and the density functions are asfollows.

1. K(x) is a density function.

2.∫

sK(s)ds = 0.

3. µ2 =∫

s2K(s)ds < ∞.

4. R(K) =∫

K2(s)ds < ∞.

5. lims→±∞ K(s) = 0.

6. f (x) is continuously differentiable to the second order and has boundedsecond derivative.

The above conditions can be relaxed to obtain the results we are set to discuss.It is not hard to verify that the density function of the standard normal distributionsatisfies all the conditions listed. Hence, the normal density function can be usedas a kernel function, which results in a kernel density estimator with the propertiesto be presented.

4.1.1 Bias of the kernel density estimator

A simple expression of the bias of the kernel density estimator can be easily ob-tained. Note that due to iid structure of the data,

E[ f (x)] =∫

Kh(t− x) f (t)dt =∫

K(s) f (x+hs)ds.

When f (x) is continuous at x, we have

E[ f (x)]→ f (x)

when h→ 0 and K(x) is a density function.


Assume the conditions on the density function f (x) and K(x) are all satisfied.Then

f (x+hs) = f (x)+ f ′(x)hs+h2

2f ′′(ξ )s2

for some ξ between x and x+hs. Under these conditions, we have

E[ f (x)] = f (x)+h2

2

∫s2K(s) f ′′(ξ )ds.

Since f ′′(x) is continuous and bounded, we have∫s2K(s) f ′′(ξ )ds→ µ2 f ′′(x)

as h→ 0. Consequently, we have

E[ f (x)] = f (x)+µ2h2

2f ′′(x)+o(h2).

In conclusion, the bias of a kernel density estimator is typically in the orderof h2 when, for example, the kernel function is chosen to be symmetric, and thedensity function has continuous, bounded second derivative.

We now look into the variance properties of the kernel density estimation.

4.1.2 Variance of the kernel density estimator

Due to the iid structure in the kernel density estimator, we have

VAR( f (x)) = n−1VAR(Kh(X− x))

where X is a generic random variable whose density function is given by f (x). Itis easily verifiable that

VAR(Kh(X− x))≤ E[Kh(X− x)]2.

The interesting part is that the difference is E[Kh(X−x)]2 which equal f (x)+O(h2)2 according to the result in the last section. At the same time, it will beseen that E[K2

h (X − x)] = O(h−1) which tends to infinity as h→ 0. Hence, theleading term of the variance is determined purely by E[K2

h (X− x)].

4.1. KERNEL DENSITY ESTIMATE 51

Similar to the bias computation, we have

E[K2h (X− x)] =

∫K2

h (t− x) f (t)dt

= h−1∫

K2(s) f (x+hs)ds

= h−1 f (x)R(K)ds(1+o(1)).

Hence, we have

VAR( f (x)) = (nh)−1 f (x)R(K)(1+o(1)). (4.2)

It is seen then that the mean squared error of the kernel density estimator is

MSE( f (x)) = (nh)−1 f (x)‖K‖2 +h4µ

22 [ f′′(x)]2/4+o((nh)−1 +h4).

Thus, in order for f (x) to be consistent, a set of necessary conditions on h are

h→ 0, nh→ ∞.

To minimize the order of MSE as n→ ∞, we should choose h = n−1/5.The best choices of h at different x are not the same. Thus, one may instead

search for h such that the integrated MSE is minimized. For this purpose, wefurther assume

∫[ f ′′(x)]2dx<∞ and the integration of the remainder term remains

to the order before the integration. If so, we have the mean integrated squared erroras

MISE = (nh)−1‖K‖2 +h4µ

22

∫[ f ′′(x)]2dx/4+o((nh)−1 +h4). (4.3)

The optimal choice of h is then

hopt =

[‖K‖2

nµ22∫[ f ′′(x)]2dx

]1/5

.

With this h, we have

MISEopt =54µ2

2‖K‖4∫[ f ′′(x)]2dx1/5n−4/5.


In this expression, we cannot make∫[ f ′′(x)]2dx change as this comes with the

data. We have some freedom to find a K to minimize the MISE. This is equivalentto minimize

µ22‖K‖4. (4.4)

This quantity does not depend on the choice of the bandwidth h. That is, if thekernel function K minimizes (4.4), Kh also minimizes it as long as h > 0.

The solution to this minimization problem is found to be

K(x) =34(1− x2)+

which is called Epanechnikov kernel. It is found, however, other choices of K donot loss much of the efficiency. For example, choosing normal density functionas the kernel function is 95% efficient. It means that one need about 5% moresample to achieve the same level of precision if the normal density function isused as kernel rather than the optimal Epanechnikov kernel is applied.

In conclusion, the choice of K is not considered very important. The choiceof the bandwidth parameter h is. There are thousands of papers published on thistopic. We do not intend to carry you away here.

4.1.3 Asymptotic normality of the kernel density estimator

In most statistical research papers, we are interested in finding constant sequencesan,bn such that

an[ f (x)− f (x)−bn]d−→ Y

for some non-degenerate random variable Y . Such a result can then be used toconstruct confidence bounds for f (x) or perform hypothesis test. Since f (x) hasan iid structure, this Y must have normal distribution.

Denote Zni = Kh(Xi− x)−E[Kh(Xi− x)], then f (x)−E[ f (x)] = n−1∑

ni=1 Zni

which is asymptotically normal when the Liapunov condition is satisfied. It iseasy to verify that the Liapunov condition is satisfied when for some δ > 0,∫

K2+δ (s)ds < ∞; and f (x)> 0. (4.5)

4.2. NON-PARAMETRIC REGRESSION ANALYSIS 53

Let σ2n =Var(Zn1). Under (4.5) and other conditions specified earlier, we have

1√nσn

n

∑i=1

Znid→

N(0,1).

In general, we prefer to know the limiting distribution of f (x)− f (x) rather thanthat of f (x)−E[ f (x)] after properly scaled. The above result helps if

1√nσn

n[EKh(X− x)− f (x)]

converges to some constant. Recall that σ2n = f (x)R2(K)h−1+o(h−1) and E[Kh(X−

x)− f (x)] = µ2 f ′′(x)h2/2+o(h2). Hence,

1√nσn

n[EKh(X− x)− f (x)] = O(n1/2h1/2h2) = O(n1/2h5/2).

By choosing h = n−1/5, the limiting distribution result becomes

1√nh

[ f (x)− f (x)]→ N(a,σ2)

with a = µ2[ f ′′(x)]/2 and σ2 = f (x)R(K).

4.2 Non-parametric regression analysis

A related problem in statistics is the non-parametric regression analysis. The datain such applications consist of pairs (Xi,Yi), i = 1,2, . . . ,n from some probabilitymodel. It is desirable to use X as predictor to predict the response value Y . Fromprobability theory, E(Y |X) minimizes

E[Y −g(X)]2

among all measurable function of X . In statistical literature, m(X) = E(Y |X) isalso called the regression function of Y with respect to X .

In other applications, x values in the model are selected by a design. Thus,they are not random. A commonly assumed model for the data in this situation is

Yi = m(xi)+ v1/2(xi)εi (4.6)


for i = 1,2, . . . ,n, where xi’s are design points and m(x) is the unknown non-parametric regression function, v(x) is the variance function, and εi are randomerror. If v(x) is a constant function, the model is homoscedastic, otherwise, it isheteroscedastic. In both cases, random design or fixed design, our observationsconsist of n pairs of (Yi,Xi), i = 1,2, . . . ,n.

4.2.1 Kernel regression estimator

Intuitively, m(x) = E(Y |X = x) is the average value of Y given X = x. When Xis random and has continuous distribution, the event X = x has probability zero.Thus, in theory, the number of observations of (Xi,Yi) in the sample such thatXi = x for any x is practically zero. It is impossible to estimate m(x) for any givenx unbiasedly. When it is reasonable to believe that m(x) is a continuous functionof x, however, one may collect information in a small neighborhood of x for thepurpose of estimating m(x). Consider such a small neighborhood [x− h,x+ h],the average value of Yi correspond to xi’s in this interval is

n

∑i=1

yi1(|Xi− x| ≤ h)/n

∑i=1

1(|Xi− x| ≤ h) =n

∑i=1

yiKh(Xi− x)/n

∑i=1

Kh(Xi− x)

where K(x) = 121(|x| ≤ 1). Clearly, K(x) can be replaced by any other density

function of x to get a general kernel regression estimator:

m(x) =n

∑i=1

yiKh(Xi− x)/n

∑i=1

Kh(Xi− x). (4.7)

Both K and h play the same roles as in the kernel density estimate.When X is random and has absolutely continuous distribution, we may esti-

mate the joint density function of (X ,Y ) by a kernel density estimator

f (x,y) = n−1n

∑i=1

Kh(yi− y)Kh(xi− x).

In practice, we may choose two different kernel functions K and two unequalbandwidths h. We could also replace K(y)K(x) by a density function of a randomvector with two dimension. The theory we intend to illustrate will not change.Thus, we will only discuss the problem under the above simplistic setting.


The conditional density function of Y given X = x is naturally estimated by

f (y|x) = f (x,y)/ f (x) (4.8)

where f (x) is the kernel density estimator of f (x) with kernel function K and thesame bandwidth h. It is seen that∫

y f (y|x)dy = m(x) (4.9)

assuming the range of y is (−∞,∞). Otherwise, one can choose the kernel functionK with compact support to ensure the validity of the equality. The above identityis then true for y not on the boundary, and when h is very small.

We have seen that the kernel regression estimator m(x) is well motivated bothfor random and non-random X .

4.2.2 Local polynomial regression estimator

The kernel density estimator can be generalized or be regarded as a special case ofanother method. In any small neighborhood of a point x0, one may fit a polynomialmodel to the data:

m(x) = β0 +β1(x− x0)+ · · ·+βp(x− x0)p

for some integer p ≥ 0. If m(x) is a smooth function, this model is justified byTaylor’s expansion at least for x-values in a small neighborhood of x0. Let N(x0;h)be small neighborhood of x0 indexed by h. One can then estimate m(x) by the leastsum of squares based on data in N(x0,h). For a given p, we search for β0, . . . ,βp

to minimize

∑xi∈N(x0;h)

[Yi−β0 +β1(x− x0)+ · · ·+βp(x− x0)p]2.

Instead of defining a neighborhood directly, one may use a kernel function toreduce the weights of observations at xi which are far away from x0. Employingthe same idea as in the kernel regression estimator, we select a suitable kernelfunction K and replace the above sum of squares by the sum of weighted squares:

n

∑i=1

Kh(xi− x0)[Yi−β0 +β1(x− x0)+ · · ·+βp(x− x0)p]2.


We then estimate m(x0) by β0. When we choose p = 0, this estimator reducesto the kernel regression estimator. Recently statistical literature indicates that thelocal polynomial regression method has some superior properties. It is, however,too much material for us to cover much of them in this course. Thus, we will onlystudy the case when p = 0.

4.2.3 Asymptotic bias and variance for fixed design

Let us first consider the case when xi’s are not random and are equally spacedin the unit interval [0, 1]. That is, let us assume that xi = n−1(i− 1/2) for i =1,2, . . . ,n. Under model (4.6) and by the definition of (4.7), we have

E[m(x)] =n

∑i=1

m(xi)Kh(xi− x)/n

∑i=1

Kh(xi− x).

At the same time, by the mean value theorem for integrals, we have∫ 1

0m(t)Kh(t− x)dt =

n

∑i=1

∫ i/n

(i−1)/nm(t)Kh(t− x)dt

=n

∑i=1

m(ti)Kh(ti− x)

for some ti’s in [(i−1)/n, i/n]. Thus, we have

|∫ 1

0m(t)Kh(t− x)dt−n−1

n

∑i=1

m(xi)Kh(xi− x)|

≤ (nh)−1n

∑i=1|m(xi)Kh(xi− x)−m(ti)Kh(ti− x)|

= O((nh2)−1)

when m(x) and K(x) both have bounded first derivatives. If K has compact sup-port, and x is an interior point, the range of the summation or integration can berestricted to an interval of length proportional to h. Consequently, the order as-sessment can be refined to O((nh)−1). At the same time, when x is an inner pointof the unit interval,∫ 1

0m(t)Kh(t− x)dt =

∫K(s)m(x+hs)ds = m(x)+

µ2m′′(x)2

h2 +o(h2).


Hence,

n−1n

∑i=1

m(xi)Kh(xi− x) = m(x)+µ2m′′(x)

2h2 +o(h2)+O((nh2)−1).

Using the same technique, we have

n

∑i=1

Kh(xi− x) = 1+o(h2)+O((nh2)−1).

Combined, we have

E[m(x)] = m(x)+µ2m′′(x)

2h2 +o(h2)+O((nh2)−1).

The order assessment here is slightly different from the literature, for example,page 122 of Wand and Jones (1995). One may investigate more closely on theorder of the error when we approximate the summation with integration to find ifours is not precise enough.

The computation of the asymptotic variance is done in the similar fashion. Wehave

VAR(m(x)) =n

∑i=1

v(xi)K2h (xi− x)/[

n

∑i=1

Kh(xi− x)]2

= (nh)−1v(x)R(K)+o(h2 +(nh)−1).

Again, the order assessment here is different from some standard literature.

4.2.4 Bias and variance under random design

When X is random, the bias of the kernel regression estimator is harder to deter-mine if we interpret the bias very rigidly. The main problem is from the factthat the kernel regression estimator is a ratio of two random variables. It iswell known that for any two random variables X and Y , it is usually true thatE[X/Y ] 6= E(X)/E(Y ). When Y takes a value near 0, the ratio becomes very un-stable. The unstable problem does not get much better even if the chance for Yclose to 0 is very small.

To avoid this problem, we adopt a notion of the asymptotic bias and variance.Suppose a−1

n (Zn−bn)→ Z in distribution such that E(Z) = 0 and Var(Z) = 1. We


define the asymptotic mean and variance of Zn as bn and a2n. A similar definition

has been given in Shao (1998), however, this definition has not appeared anywhereelse to my best knowledge.

The numerator Un = ∑ni=1 yiKh(Xi−x) and the denominator Vn = ∑

ni=1 Kh(Xi−

x) in m(x) are both sum of iid random variables. We look for proper constants an

and (un,vn) such that

an[(Un−un,Vn− vn)]

has limiting distribution. For this purpose, we first search for approximate meansand variances of Un and Vn.

It is seen that

E[Un] = E[n

∑i=1

m(Xi)K(Xi− x)]

= n∫

m(t)Kh(t− x) f (t)dt

= n∫

m(x+ sh) f (x+ sh)K(s)ds

= n[m(x) f (x)+12m′′(x) f (x)+2m′(x) f ′(x)+m(x) f ′′(x)µ2h2]

+o(nh2). (4.10)

Thus, we put

un = n[m(x) f (x)+12m′′(x) f (x)+2m′(x) f ′(x)+m(x) f ′′(x)µ2h2].

Similarly, we have

VAR[Un] = E[ n

∑i=1

v(Xi)K2h (Xi− x)

]+VAR

[ n

∑i=1

m(Xi)Kh(Xi− x)]. (4.11)

It is seen that

E[v(X)K2h (X− x)] =

∫v(t)K2

h (t− x) f (t)dt

= h−1v(x) f (x)R(K)+o(h−1).


For the second term in (4.10), we have

VAR[n

∑i=1

m(Xi)Kh(Xi− x)]

= nVAR[m(X)Kh(Xi− x)]

= n[Em2(X)K2h (X− x)−Em(X)Kh(X− x)2]

= nEm2(X)K2h (X− x)+O(n).

Further,

Em2(X)K2h (X− x) =

∫m2(t)K2

h (t− x) f (t)dt

= h−1∫

m2(x+ sh) f (x+ sh)K(s)ds

= h−1m2(x) f (x)R(K)+o(h−1).

In conclusion, we have shown

VAR(Un) = nh−1v(x)+m2(x) f (x)R(K)+o(nh−1).

Using similar calculation, we have

E[Vn] = n[ f (x)+12

f ′′(x)µ2h2]+o(nh2),

VAR(Vn) = nh−1 f (x)R(K)+o(nh−1).

In view of the above bias and variance results, it is apparent that we shouldchoose h−1/5 and therefore an = n−(3/5) to produce some meaningful limitingdistribution. Assume that the conditions for the joint asymptotic normality ofn−3/5[Un−E(Un),Vn−E(Vn)] are satisfied. For convenience, write Un = n−1Un

and so on for the sake that Un→ m(x) f (x) in probability. It makes the followingpresentation simpler. We have

n2/5[Un− un,Vn− vn]→ N(0,∆) (4.12)

for covariance matrix ∆ consists of δ11 = v(x)+m2(x) f (x)R(K), δ22 = f (x)R(K)

and δ12 = m(x) f (x)R(K). The computation of δ12 is left out as an assignmentproblem.


Finally, we have

Un/Vn−m(x) = [Un/Vn− un/vn]+ [un/vn−m(x)].

and[un/vn−m(x)] =

12m′′(x)+2µ2m′(x) f ′(x)/ f (x)h2 +o(h2).

With that, we have

n2/5m(x)−m(x) = n2/5[

Un

Vn−m(x)

]= m′′(x)+2m′(x) f ′′(x)/ f (x)µ2

+n2/5[Un− unvn + unvn−Vn)]

Vnvn+op(1)

= m′′(x)+2m′(x) f ′′(x)/ f (x)µ2

+n2/5[Un− unvn + unvn−Vn)]

v2n

+op(1).

It is then obvious that

n2/5m(x)−m(x)→ N(a,σ2),

with a = µ22 m

′′(x)+2m′(x) f ′(x)/ f (x) and

σ2 =

f 2(x)δ11 +2m(x) f 2(x)δ12 +m2(x) f 2(x)δ22

f 4(x)

= v(x)R(K) f (x)−1. (4.13)

Because of this, it is widely cited that the asymptotic bias of m(x) is given by

12m′′(x)+2m′(x) f ′′(x)/ f (x)µ2h2

and the asymptotic variance is given by

(nh)−1v(x)R(K) f (x)−1.

The citation is not entirely true as two important conditions cannot be ignored: (1)K(x) has compact support; (2) the bandwidth parameter h = O(n−1/5).



1. Verify that the Liapunov condition is satisfied when (4.5) is met in additionto other conditions on K and f specified before (4.5).

2. Show that when the second moments of X and Y exist,

E[Y −g(X)]2

is minimized among all possible choice of measurable function of X wheng(X) = EY |X.

3. Verify the order assessment given in (4.10). Present your own result if yourassessment is different.

4. Prove that (4.9) is true as defined in the content. Why is it necessary toassume that the range of x is the entire space of real numbers? List twomeaningful generalizations of this result (which is too restrictive in manyways).

5. Verify the result on the variance covariance matrix ∆ defined in (4.12).

6. Verify the result given in (4.13)


Chapter 5

Asymptotic Results in FiniteMixture Models

5.1 Finite mixture model

In statistics, a model means a family of probability distributions. Given a randomvariable X , its cumulative distribution function (c.d.f. ) is defined to be F(x) =P(X ≤ x) for x ∈ R. Let p ∈ (0,1) and

F(x) = ∑0≤k≤x

(nk

)pk(1− p)n−k

for x ∈ R and the summation on k is over integers. A random variable with itsc.d.f. having the above algebraic expression is known as binomially distributed,or it is a binomial random variable. It contains two parameters n and p. A par-ticular pair of values in n and p gives one particular binomial distribution. Thebinomial distribution family is the collection of binomial distributions with allpossible parameter values in n and p. We may form a narrower binomial distribu-tion family by holding n fixed. Whether or not n is fixed, this distribution familyis regarded as Binomial Model.

A discrete integer valued random variable has its p.m.f. given by

f (x) = P(X = x) = F(x)−F(x−1)

63

64 CHAPTER 5. ASYMPTOTIC RESULTS IN FINITE MIXTURE MODELS

for x = 0,±1, . . .. For a binomial X , its p.m.f. is given by

BIN(k;n, p) = P(X = k) =(

nk

)pk(1− p)n−k

for k = 0,1, . . . ,n. We write X ∼ BIN(n, p). Let π be a value between 0 and 1 andlet

f (k) = πBIN(k;n, p1)+(1−π)BIN(k;n, p2) (5.1)

for k = 0,1, . . . ,n with some parameters n and p1, p2 ∈ [0,1]. Clearly, the abovef (·) is also a p.m.f. The distributions whose p.m.f. have algebraic structure (5.1)form a new distribution family. Because f (·) is a convex combination of twop.m.f. ’s from a well known distribution family, its distribution is called a binomialmixture distribution. We subsequently have a binomial mixture model.

Be aware that we use f (·) very loosely as a general symbol for a p.m.f. ora p.d.f. Its specifics may change from one paragraph to another. We must notinterpret it as a specific p.m.f. or p.d.f. with great care.

Let f (x;θ) : θ ∈ Θ be a parametric model and G(θ) be a c.d.f. on Θ. Weobtain a mixture distribution represented by

f (x;G) =∫

Θ

f (x;θ)dG(θ), (5.2)

where the integration is understood in the Lebesgue-Stieltjes sense. When G(·) isabsolutely continuous with density function g(θ), the integration equals

∫Θ

f (x;θ)g(θ)dθ .We are often interested in the situation where G is a discrete distribution assign-ing positive probabilities π1, . . . ,πm to finite number of θ -values, θ1, . . . ,θm suchthat ∑

mj=1 π j = 1. In this case, the mixture density (5.2) is reduced to the familiar

convex combination:

f (x;G) =m

∑j=1

π j f (x;θ j). (5.3)

Because a distribution is also referred as a population in some context, we there-fore also call f (x;θ j) a sub-population of the mixture population. We call θ j

sub-population parameter and π j the corresponding mixing proportion. When allπ j > 0, these component parameters θ1, . . . ,θm are the support points of G. Theorder of the mixture model is m if G has at most m support points.

The density functions f (x;G) in (5.2) form a mixture distribution family, andtherefore a mixture model. The density functions f (x;G) in (5.3) form a finite

5.2. TEST OF HOMOGENEITY 65

mixture distribution family, and therefore a finite mixture model. The collectionof the mixing distribution will be denoted as G. A mixture model is a distributionfamily characterized by

f (x;G) =∫

Θ

f (x;θ)dG(θ) : G ∈G

which requires both f (x;θ) : θ ∈Θ and G fully specified.We will use F(x;θ) as the c.d.f. of the component density function f (x;θ) and

similarly for F(x;G) and f (x;G). The same symbols F and f are used for bothcomponent and mixture distributions, we have to pay attention to the symbol intheir entry. We also use G(θ) for the probability the distribution G assigns to aspecific θ value and similarly for F(x). We may also refer f (x;G) as a mixturedensity, a mixture distribution or a mixture model.

We have now completed the introduction of the mixture model.

5.2 Test of homogeneity

Finite mixture models are often used to help determine whether or not data weregenerated from a homogeneous or heterogeneous population. Let X1, . . . ,Xn be asample from the mixture p.d.f.

(1− γ) f (x,θ1)+ γ f (x,θ2), (5.4)

where θ1 ≤ θ2 ∈Θ and 0≤ γ ≤ 1. We wish to test the hypothesis

H0 : θ1 = θ2, (or equivalently γ = 0, or γ = 1).

Namely, we test whether or not the observations come from a homogeneous pop-ulation f (x,θ).

In applications, the null model is the default position. That is, unless there isa strong evidence in contradiction, the null model is regarded as “true”. At thesame time, searching for evince against the null model in favour of a specific typeof alternative is a way to establishing the new theory. We do not blindly trust anew theory unless it survives serious challenges.

There are many approaches to this hypothesis test problem. Given the nicei.i.d. structure and the parametric form of the finite mixture model, the classicallikelihood ratio test will be the one to be discussed.


5.3 Binomial mixture example

Consider the situation where the kernel distribution is Binomial with known sizeparameter m and probability of success parameter θ . Let X1, . . . ,Xn be a set ofi.i.d. random variables with common finite mixture of binomial distributions suchthat

P(X = k) = (1−π)BIN(k;m,θ1)+πBIN(k;m,θ2)

where π and 1−π are mixing proportions, and θ1 and θ2 are component distribu-tion parameters. Clearly, the parameter space of the mixing parameters θ1 and θ2

are bounded. The likelihood ratio statistic is stochastically bounded. We now firstdemonstrate this fact.

Let nk be the number of observed Xi = k for k = 0,1, . . . ,m. The log-likelihoodfunction is given by

`n(π,θ1,θ2) =m

∑k=0

nk log(1−π) f (k;m,θ1)+π f (k;m,θ2)

Let θk = nk/n. By Jensen’s inequality, we have

`n(π,θ1,θ2)≤ nm

∑k=0

θk log θk

for any choices of π , θ1 and θ2. Let p∗k = P(X = k) under the true distribution ofX . It is well known that in distribution, we have

nm

∑k=0

pk log pk−nm

∑k=0

pk log p∗k → χ2m

as n→ ∞. Let us also use Mn for the likelihood ratio statistic. We find

Mn = 2sup`n(π, p1, p2)− sup`n(1, p, p) ≤ nm

∑k=0

pk log pk−nm

∑k=0

pk log p∗k

which has a χ2m limiting distribution. That is, Mn = Op(1), or it is stochastically

bounded.One should have realized that the conclusion Mn =Op(1) does not particularly

rely on Xi’s having a binomial distribution. The conclusion remains true when Xi’shave common and finite number of support points.

5.3. BINOMIAL MIXTURE EXAMPLE 67

In spite of boundedness of Mn for binomial mixtures, finding the limiting dis-tribution of Mn troubled statisticians as well as geneticists for a long time. Thefirst most satisfactory answer is given by Chernoff and Lander(1985JSPI). Unlikethe results under regular models, the limiting distribution of the likelihood ratiostatistic under binomial mixtures is not an asymptotic pivotal. The outcomes varyaccording to the size of m, and true null distribution p and so on. We now use thesimplest case with m = 2 special component parameter values for illustration. Inthis case,

H0 : f (k;2,0.5), Ha : (1−π) f (k;2,0.5)+π f (k;2,0.5+θ).

We investigate the limiting distribution of the likelihood ratio statistic in the pres-ence of n i.i.d. observations from H0.

Under the alternative hypothesis Ha, the parameter space of (π,θ) is a unitsquare [0,1]× [0,1]. The null hypothesis is made of two lines in this unit square:one is formed by π = 0 and the other is θ = 0.5. Unlike the hypothesis testproblems under regular models, all points on these two lines parameterize thesame distribution. The derivation of new limiting distribution naturally starts fromhow to avoid this non-identiability. Chernoff and Lander were the first to useparameter transformation.

Using the parameter setting under the alternative model, we note

P(X = 0) = 0.25(1−π)+π(1−θ)2;

P(X = 1) = 0.5(1−π)+2πθ(1−θ);

P(X = 2) = 0.25(1−π)+πθ2.

If the data are from the null model, parameters π,θ are not uniquely defined andtherefore cannot be consistently estimated. At the same time, let

ξ1 = 0.5π(θ −0.5); ξ2 = π(θ −0.5)2.

The null model is uniquely defined by ξ1 = ξ2 = 0. After the parameter transfor-mation, we find

P(X = 0) = 0.25−2ξ1 +ξ2;

P(X = 1) = 0.5−2ξ2;

P(X = 2) = 0.25+2ξ1 +ξ2.


Let p0, p1, p2 be sample proportions of X = 0,1,2. The likelihood would be max-imized by setting

p0 = 0.25−2ξ1 +ξ2;

p1 = 0.5−2ξ2;

p2 = 0.25+2ξ1 +ξ2.

The unconstrained solution is given by

ξ1 = (p2− p0)/4

ξ2 = (p2 + p0−0.5)/2

At the same time, because

π = 4ξ21 /ξ2, θ = 0.5(1+ξ2/ξ1)

and they have range [−0.25,0.25]× [0,1], we must have

|ξ2| ≤ ξ1,4ξ21 ≤ ξ2.

The range is shown in the following figure. In addition, after the shaded region isexpanded, it is well approximated by a cone as show by the plot on the right handside:

|ξ2| ≤ ξ1,ξ2 ≥ 0.

Let ϕ be the angle from the positive x-axis. Then the cone contains two dis-joint regions: 0 < ϕ < π/4 and 3π/4 < ϕ < π/2. This π is the mathematicalconstant for the ratio bewteen a circle’s circumference to its diameter.

At the same time, under the null model, and using the classical central limittheorem, √

n(ξ1, ξ2)d−→ (Z1,Z2)

where (Z1,Z2) are bivariate normal with covariance matrix I−1 and I is the Fisherinformation with respect to (ξ1,ξ2):

I =[

32 00 16

].

The asymptotic variance can also be directly computed.

5.3. BINOMIAL MIXTURE EXAMPLE 69

Applying Chernoff (1954), we may now regard the hypothesis test problem astesting

H0 : ξ1 = ξ2 = 0

against the alternativeHa : |ξ2| ≤ ξ1,ξ2 ≥ 0

given a single pair of observation (Z1,Z2) with mean (ξ1,ξ2) and covariance ma-trix I−1 to obtain the limiting distribution of the original likelihood ratio test. Notethat the log-likelihood function is given by

`(ξ1,ξ2) =−16(Z1−ξ1)2−8(Z2−ξ2)

2 + c

where c is parameter free constant.There are three cases depending on the observed value of (Z1,Z2) to obtain

analytical solutions to the likelihood ratio test statistic.Case I: |Z2| ≤ Z1,Z2 ≥ 0. In this case the MLE ξ1 = Z1 and ξ2 = Z2. Hence,

the likelihood ratio statistic is

T =−2`(ξ1, ξ2)− l(0,0)= 32Z21 +16Z2

2 ∼ χ22

Case II: Z2 < 0. In this case the MLE ξ1 = Z1 and ξ2 = 0. Hence, the likelihoodratio statistic is

T =−2`(ξ1, ξ2)− `(0,0)= 32Z21 ∼ χ

21

Case II—: |Z1| > Z2 ≥ 0. In this case, the likelihood is maximized whenξ1 = ξ2. the MLE ξ1 = ξ2 = (2Z1 +Z2)/3. Hence, the likelihood ratio statistic is

T =−2`(ξ1, ξ2)− `(0,0)= (16/3)(Z2−Z1)2

It can be verified that the event |Z2| ≤ Z1,Z2 ≥ 0 in Case I is independent of32Z2

1 +16Z22 ≤ z for any z. The same is true for the event in Case II. However,

the event in the third case is not independent of (Z2−Z1)2 ≤ z. Thus, we conclude

the limiting distribution of the likelihood ratio test statistic T is given by

P(T ≤ t) = 0.5P(χ21 ≤ t)+2λP(χ2

2 ≤ t)

+(0.5−2λ )P(Z2−Z1)2 ≤ 3t/16||Z1|> Z2 ≥ 0


where λ = arctan(1/√

2)/(2π) and this π = 3.14 · · · .Chernoff and Lander (1995) presented this result by introducing two standard

normal Y1,Y2. It may not be obvious that this above result is the same.Using the same approach, it is possible to go some distance. For instance, if

one considers

H0 : π(1−π)(θ2−θ1) = 0

against the general alternative

H0 : π(1−π)(θ2−θ1) 6= 0.

Then if the null model is not θ1 = θ2 = 0, the limiting distribution of the likelihoodratio statistic when m = 2 is

0.5χ21 +0.5χ

22 .

The limiting distribution when m = 3 can be similarly derived.

5.4 C(α) test

The general C(α) test is designed to test a specific null value of a parameter ofinterest in the presence of nuisance parameters. More specifically, suppose thestatistical model is made of a family of density functions f (x;ξ ,η) with someappropriate parameter space for (ξ ,η). The problem of interest is to test for H0 :ξ = ξ0 versus the alternative Ha : ξ 6= ξ0. That is, the parameter value of η isleft unspecified in both hypotheses and it is of no interest. This observation earnsits name as a nuisance parameter. Due to its presence, the null and alternativehypotheses are composite as oppose to simple, as both contain more than a singleparameter value in terms of (ξ ,η). As in common practice of methodologicaldevelopment in statistical significance test, the size of the test is set at some α

value between 0 and 1. Working on composite hypothesis and having size α

appear to be the reasons behind the name C(α). While our interest lies in the useof C(α) to test for homogeneity in the context of mixture models, it is helpful tohave a generic introduction.

5.4. C(α) TEST 71

5.4.1 The generic C(α) test

To motivate the C(α) test, let us first examine the situation where the model isfree from nuisance parameters. Denote the model without nuisance parameter asa distribution family f (x;ξ ) with some parameter space in ξ . In addition, weassume that this family is regular. Namely, the density function is differentiable inξ for all x, the derivative and the integration can be exchanged and so on. Basedon an i.i.d. sample x1, . . . ,xn, the score function of ξ is given by

Sn(ξ ) =n

∑i=1

∂ log f (xi;ξ )

∂ξ.

When the true distribution is given by f (x;ξ0), we have ESn(ξ0) = 0. Definethe Fisher information (matrix) to be

I(ξ ) = E[∂ log f (xi;ξ )

∂ξ∂ log f (xi;ξ )

∂ξτ

].

It is well known that

Sτn(ξ0)nI(ξ0)−1Sn(ξ0)

d−→ χ2d

where d is the dimension of ξ . Clearly, a test for H0 : ξ = ξ0 versus the alternativeHa : ξ 6= ξ0 based on Sn can be sensibly constructed with rejection region givenby

Sτn(ξ0)I−1(ξ0)Sn(ξ0)≥ nξ

2d (1−α).

We call it score test and credit its invention to Rao(??). When the dimension of ξ isd = 1, then the test can be equivalently defined based on the asymptotic normalityof Sn(ξ0). In application, one may replace nI(ξ0) by observed information andevaluate it at a root-n consistent estimator ξ .

Back to the general situation where the model is given by f (x;ξ ,η). If η

value in f (x;ξ ,η) is in fact known, the test problem reduces to the one we havejust described. We may then proceed as follows. Define

Sn(ξ ;η) =n

∑i=1

∂ log f (xi;ξ ,η)

∂ξ.


The semicolon indicates that the “score” operation is with respect to only ξ . Sim-ilarly let us define the ξ -specific Fisher information to be

I11(ξ ,η) = E[∂ log f (X ;ξ ,η)

∂ξ∂ log f (X ;ξ ,η)

∂ξτ

].

With the value of η specified, we have a score test statistic and its limiting distri-bution

Sτn(ξ0;η)nI11(ξ0,η)−1Sn(ξ0;η)

d−→ χ2d .

A score test can therefore be carried out using this statistic.Without a known η value, the temptation is to have η replaced by an efficient

or root-n consistent estimator. In general, the chisquare limiting distribution of

Sτn(ξ0; η)nI(ξ0; η)−1Sn(ξ0; η)

is no longer the simple chisquare. For a specific choice of η , we may work outthe limiting distribution of Sτ

n(ξ0; η) and similarly define a new test statistic. Theapproach of Neyman (1959) achieved this goal in a graceful way.

To explain the scheme of Nayman, let us introduce the other score function

Sn(η ;ξ ) =n

∑i=1

∂ log f (xi;ξ ,η)

∂η.

The above notation highlights that the “score operation” is with respect to only η .Next, let us define the other part of the Fisher information matrix to be

I12(ξ ;η) = Iτ21(ξ ;η) = E

[∂ log f (X ;ξ ,η)

∂ξ∂ log f (X ;ξ ,η)

∂ητ

]and

I22(ξ ;η) = E[∂ log f (X ;ξ ,η)

∂η∂ log f (X ;ξ ,η)

∂ητ

].

Let us now project Sn(ξ ;η) into the orthogonal space of Sn(η ;ξ ) to get

Wn(ξ ,η) = Sn(ξ ;η)− I12(ξ ;η)I−122 (ξ ;η)Sn(η ;ξ ). (5.5)

Conceptually, it removes the influence of the nuisance parameter η on the scorefunction of ξ . At the true parameter value of ξ ,η ,

n−1/2Wn(ξ ,η)d−→ N(0,I11− I12I−1

22 I21).

5.4. C(α) TEST 73

Under the null hypothesis, the value of ξ is specified as ξ0, the value of η is un-specified. Thus, we naturally try to construct a test statistic based on Wn(ξ0, η)

where η is some root-n estimator of η . For this purpose, we must know the dis-tribution of Wn(ξ0, η). The following result of Neyman (1959) makes the answerto this question simple.

Theorem 5.1 Suppose x1, . . . ,xn is an i.i.d. sample from f (x;ξ ,η). Let Wn(ξ ,η)

be defined as (5.5) together with other accompanying notations. Let η be a root-nconsistent estimator of η when ξ = ξ0. We have

Wn(ξ0,η)−Wn(ξ0, η) = op(n1/2)

as n→ ∞, under any distribution where ξ = ξ0.

Due to the above theorem, the limiting distribution of Wn(ξ0, η) is the sameas that of Wn(ξ0,η) with (ξ0,η) being the true parameter values of the model thatgenerated the data. Thus,

W τn (ξ0, η)[nI11− I12I−1

22 I21]−1Wn(ξ0, η)

may be used as the final C(α) test statistic. The information matrix in the abovedefinition is evaluated at ξ0, η , and the rejection region can be decided based onits chisquare limiting distribution.

If we choose η as the constrained maximum likelihood estimator given ξ = ξ0,we would have Wn(ξ0, η) = Sn(ξ0; η) in (5.5). The projected score functionSn(ξ0,η) is one of many possible zero-mean functions satisfying some regularityproperties. Neyman(1959) called such class of functions Cramer functions. EachCramer function can be projected to obtain a corresponding Wn and therefore atest statistic for H0 : ξ = ξ0. Within this class, the test based on score functionSn(ξ0,η) is optimal: having the highest asymptotic power against some local al-ternatives. In general, if the local alternative is of two-sided nature, the optimalitybased on the notion of “uniformly most powerful” cannot be achieved. The localoptimality has to be justified in a more restricted sense.

5.4.2 C(α) test for homogeneity

As shown in the last subsection, the C(α) test is designed to test for a special nullhypothesis in the presence of some nuisance parameters. The most convincing


example of its application is for homogeneity test. Recall that a mixture model isrepresented by its density function in the form of

f (x;G) =∫

Θ

f (x;θ)dG(θ).

Neyman and Scott (1966) regarded the variance of the mixing distribution Ψ asthe parameter of interest, and the mean and other aspects of Ψ as nuisance param-eters. In the simplest situation where Θ = R, we may rewrite the mixture densityfunction as

ϕ(x;θ ,σ ,Ψ) =∫

Θ

f (x;θ +√

σξ )dΨ(ξ ) (5.6)

such that the standardized mixing distribution Ψ(·) has mean 0 and variance 1.The null hypothesis is H0 : σ = 0. The rational of the choice of

√σ instead of σ

in the above definition will be seen later. Both θ and the mixing distribution Ψ

are nuisance parameters.The partial derivative of ϕ(x;θ ,σ ,Ψ) with respect to σ is given by

∂ϕ(x;θ ,σ ,Ψ)

∂σ=

∫Θ

ξ f ′(x;θ +√

σξ )dΨ(ξ )

2√

σ∫

Θf (x;θ +

√σξ )dΨ(ξ )

.

At σ = 0 or let σ ↓ 0, we find

∂ϕ(x;θ ,σ ,Ψ)

∂σ

∣∣∣∣σ↓0

=f ′′(x;θ)

2 f (x;θ).

This is the score function for σ based on a single observation. We may noticethat the choice of

√σ gives us the non-degenerate score function. The partial

derivative of ϕ(x;θ ,σ ,Ψ) with respect to θ is given by

∂ϕ(x;θ ,σ ,Ψ)

∂θ=

∫Θ

f ′(x;θ +√

σξ )dΨ(ξ )∫Θ

f (x;θ +√

σξ )dΨ(ξ )

which leads to score function for θ based on a single observation as:

∂ϕ(x;θ ,0,Ψ)

∂θ=

f ′(x;θ)

f (x;θ).

Both of them are free from the mixing distribution Ψ.

5.4. C(α) TEST 75

Let us now define

yi(θ) =f ′(xi;θ)

f (xi;θ), zi(θ) =

f ′′(xi;θ)

2 f (xi;θ)(5.7)

with xi’s being i.i.d. observations from the mixture model. The score functionsbased on the entire sample are ∑

ni=1 Zi(θ) and ∑

ni=1Yi(θ) for the mean and vari-

ance of G. Based on the principle of deriving test statistic Wn in the last sub-section, we first project zi(θ) into space of yi(θ), and make use of the residualwi(θ) = zi(θ)−β (θ)yi(θ). The regression coefficient

β (θ) =EY1(θ)Z1(θ)

EY 21 (θ)

.

We capitalized Y and Z to indicate their status as random variables. The expecta-tion is with respect to the homogeneous model f (x;θ).

When θ is the maximum likelihood estimator of θ under the homogeneousmodel assumption f (x,θ), the C(α) statistic has a simple form:

Wn =∑

ni=1Wi(θ)√

nν(θ)=

∑ni=1 Zi(θ)√

nν(θ)(5.8)

with ν(θ) = EW 21 (θ). Because the parameter of interest is univariate, we can

skip the step of creating a quadratic form. Clearly, Wn has standard normal lim-iting distribution and the homogeneity null hypothesis is one-sided. At a givensignificance level α , we reject the homogeneity hypothesis H0 when Wn > zα .This is the C(α) test for homogeneity.

In deriving the C(α) statistic, we assumed the parameter space Θ = R. Withthis parameter space, if G(·) is a mixing distribution on Θ, so is G((θ −θ ∗)/σ)

for any θ ∗ and σ ≥ 0. We have made use of this fact in (5.5). If Θ = R+ as inthe Poisson mixture model where θ ≥ 0, G((θ − θ ∗)/σ) cannot be regarded asa legitimate mixing distribution for some θ ∗ and σ . In the specific example ofPoisson mixture, one may re-parameterize model with ξ = logθ . However, thereseems to be no unified approach in general, and the optimality consideration is atstake.

Whether or not the mathematical derivation of Wn can be carried out as we didearlier for other forms of Θ, the statistic Wn remains a useful metric on the plausi-bility of the homogeneity hypothesis. The limiting distribution of Wn remains thesame and it is usefulness in detecting the population heterogeneity.


5.4.3 C(α) statistic under NEF-QVF

Many commonly used distributions in statistics belong to a group of natural expo-nential families with quadratic variance function (NEF-QVF; Morris, 1982). Theexamples include normal, Poisson, binomial, and exponential. The density func-tion in one-parameter natural exponential family has a unified analytical form

f (x;θ) = h(x)expxφ −A(φ),

with respect to some σ -finite measure, where θ = A′(φ) represents the mean pa-rameter. Let σ2 = A′′(φ) be the variance under f (x;θ). To be a member of NEF-QVF, there must exist constants a,b, and c such that

σ2 = A′′(φ) = aA′(φ)+bA′(φ)+ c = aθ

2 +bθ + c. (5.9)

Namely, the variance is a quadratic function of the mean.When the kernel density function f (x;θ) is a member of NEF-QVF, the C(α)

statistic has a particularly simple analytical form and simple interpretation.

Theorem 5.2 When the kernel f (x;θ) is a member of NEF-QVF, then the

Wn =∑

ni=1(xi− x)2−nσ2√

2n(a+1)σ2,

where C(α) statistic is given by x = n−1∑

ni=1 xi and σ2 = ax2+bx+c with coeffi-

cients given by (5.9) are the maximum likelihood estimators of θ and σ2, respec-tively.

The analytical form of the C(α) test statistics for the normal, Poisson, bino-mial, and exponential kernels are included in Table 5.4.3 for easy reference. Theirderivation is given in the next subsection.

Table 5.1: Analytical form of C(α) some NEF-QVF mixtures.

Kernel N(θ ,1) Poisson(θ ) BIN(m, p) Exp(θ )

C(α) ∑ni=1(xi−x)2−n√

2n∑

ni=1(xi−x)2−nx√

2nx∑

ni=1(xi−x)2−nx(m−x)/m√

2n(1−1/m)x(m−x)/m∑

ni=1(xi−x)2−nx2√

4nx2

5.4. C(α) TEST 77

Note that the C(α) statistics contains the factor ∑ni=1(xi− x)2 which is a scaled

up sample variance. The second term in the numerator of these C(α) statisticsis the corresponding ‘estimated variance’ if the data are from the correspondinghomogeneous NEF-QVF distribution. Hence, in each case, the test statistic isthe difference between the ‘observed variance’ and the ‘perceived variance’ whenthe null hypothesis is true under the corresponding NEF-QVF kernel distributionassumption. The difference is then divided by their estimated null asymptoticvariance. Thus, the C(α) test is the same as the ‘over-dispersion’ test.

5.4.4 Expressions of the C(α) statistics for NEF-VEF mixtures

The quadratic variance function under the natural exponential family is char-acterized by its density function f (x;θ) = h(x)expxφ − A(φ) and A′′(φ) =aA′(φ)+ bA′(φ)+ c for some constants a, b, and c. The mean and variance aregiven by θ = A′(φ) and σ2 = A′′(φ). Taking derivatives with respect to φ on thequadratic relationship, we find that

A′′′(φ) = 2aA′(φ)+bA′′(φ) = (2aθ +b)σ2,

A(4)(φ) = 2aA′′(φ)2 +2aA′(φ)+bA′′′(φ) = 2aσ4 +(2aθ +b)2

σ2.

Because of the regularity of the exponential family, we have

C

dk f (X ;θ ∗)/dφ k

f (X ;θ ∗)

= 0

for k = 1, 2, 3, 4 where θ ∗ is the true parameter value under the null model. Thisimplies

C (X−θ∗)3 = A′′′(φ∗) = (2aθ

∗+b)σ∗2,

C (X−θ∗)4 = 3A′′(φ∗)2 +A(4)(φ∗) = (2a+3)σ∗4 +(2aθ

∗+b)2σ∗2,

where φ∗ is the value of the natural parameter corresponding to θ ∗, and similarlyfor σ∗2.

The ingredients of the C(α) statistics are

Yi(θ∗) =

f ′(X ;θ ∗)

f (Xi;θ ∗)=

(Xi−θ ∗)

σ∗2,

Zi(θ∗) =

f ′′(X ;θ ∗)

2 f (Xi;θ ∗)=

(Xi−θ ∗)2− (2aθ ∗+b)(Xi−θ ∗)−σ∗2

2σ = 4.


We then have

C Yi(θ∗)Zi(θ

∗)= C (Xi−θ ∗)3− (2aθ ∗+b)C (Xi−θ ∗)2−σ∗2C (Xi−θ ∗)

2σ∗6= 0.

Therefore, the regression coefficient of Zi(θ∗) against Yi(θ

∗) is β (θ ∗) = 0. Thisleads to the projection

Wi(θ∗) = Zi(θ

∗)−β (θ ∗)Yi(β∗) = Zi(θ

∗)

and

4σ∗8VARWi(θ∗) = VAR(Xi−θ

∗)2−2(2aθ∗+b)C (Xi−θ

∗)3+(2aθ

∗+b)2C (Xi−θ∗)2

= (2a+3)σ∗4 +(2aθ∗+b)2σ∗2−σ∗4

−2(2aθ∗+b)2σ∗2 +(2aθ

∗+b)2σ∗2

= (2a+2)σ∗4.

Hence, ν(θ ∗) = VARWi(θ∗)= 0.5(a+1)σ∗−4.

Because the maximum likelihood estimator θ = X , we find that

n

∑i=1

Wi(θ) =n

∑i=1

Zi(θ) =∑

ni=1(Xi− X)2− σ2

2σ4

with σ2 = aX2+bX+c due to invariance. The C(α) test statistic, Tn =∑ni=1Wi(θ)/

√nν(θ),

is therefore given by the simplified expression in Theorem 5.2. ♦

5.5 Brute-force likelihood ratio test for homogeneity

While C(α) is largely successful at homogeneity test, statisticians remain faithfulto the likelihood ratio test for homogeneity. We have already given the exampleof Hartigan (1985) which shows the straightforward likelihood ratio statistic hasstochastically unbounded. If we insists on the use of likelihood ratio test for ho-mogeneity in general, we must either re-scale the statistics, or confine the modelspace so that the test statistic has a non-degenerative limiting distribution. The re-sult of Chernoff and Lander (1995) on binomial mixture is the example where the

5.5. BRUTE-FORCE LIKELIHOOD RATIO TEST FOR HOMOGENEITY 79

parameter space is naturally bounded. The result of Bickel and Chernoff (1993)is on re-scaled likelihood ratio statistic. The more general results are not easyto obtain. The first such attempt is by Basu and Ghosh (1995) who obtained thelimiting distribution of the likelihood ratio statistic under a separate condition.Their result is a breakthrough in one way, but is short of providing the satisfactoryanswer to the limiting distribution.

In this section, we work on the original likelihood ratio statistics for homo-geneity test with sufficient generality yet still limited in terms of developing auseful inference tool.

Let f (x;θ) for θ ∈ Θ ⊂ R be a p.d.f. with respect to a σ -finite measure. Weobserve a random sample X1, . . . ,Xn of size n from a population with the mixturep.d.f.

f (x;G) = (1−π) f (x;θ1)+π f (x;θ2), (5.10)

where θ j ∈ Θ, j = 1,2 are component parameter values and π and 1−π are themixing proportions. Without loss of generality, we assume 0 ≤ π ≤ 1/2. In thisset up, the mixing distribution G has at most two distinct support points θ1 andθ2. We wish to test

H0 : α = 0 or θ1 = θ2,

versus the full model (5.10).As usual, the log likelihood function of the mixing distribution is given by

`n(G) =n

∑i=1

log(1−π) f (xi;θ1)+π f (xi;θ2).

The maximum likelihood estimator of G is a mixing distribution on Θ with at mosttwo distinct support points at which `n(G) attains it maximum value. Without lossof generality, π < 1/2.

We assume the finite mixture model (5.10) satisfies conditions for the consis-tency of G. The consistency of G is of particular interest when the true mixingdistribution G∗ degenerates. A detailed analysis leads to the following results.

Lemma 5.1 Suppose G is consistent as discussed in this section. As n→ ∞, boththe MLE’s of θ1−θ ∗ and π(θ2−θ ∗) converge to 0 in probability when f (x;θ ∗)

is the true distribution.


Proof: Let us denote the MLE of G as

G(θ) = (1− π)I(θ1 ≤ θ)+ πI(θ2 ≤ θ)

with π ≤ 1/2. The true mixing distribution under homogeneity model is G∗(θ) =I(θ ∗ ≤ θ). Since Θ is compact, we find δ = infθ∈Θ exp(−|θ |)> 0. Thus, for thedistance defined by (??), we have

D(G,G∗) =∫

Θ

|G(θ)−G∗(θ)|exp(−|θ |)dθ (5.11)

≥ δ

∫Θ

|G(θ)−G∗(θ)|dθ (5.12)

= δ(1− π)|θ1−θ∗|+ π|θ2−θ

∗|. (5.13)

Since π < 1/2, the consistency of G implies both |θ1−θ ∗|→ 0 and π|θ2−θ ∗|→ 0almost surely. These are the conclusions of this lemma. ♦.

The likelihood ratio test statistic is twice of the differences between two max-imum values. We have discussed the one under the alternative model. Whenconfined in the space of null hypothesis, the log likelihood function is simplifiedto

`n(θ) =n

∑i=1

log f (xi;θ).

Note that as usual, `n(·) has been used as a generic notation. Denote its globalmaximum point in Θ as θ . The likelihood ratio test statistic is then

Rn = 2`n(G)− `n(θ).

We now focus on deriving its limiting distribution under a few additional condi-tions.

Strong identifiability. The kernel function f (x;θ) together with the first twoderivatives f ′(x;θ) and f ′′(x;θ) are identifiable by θ . That is, for any θ1 6= θ2

in Θ,2

∑j=1a j f (x,θ j)+b j f ′(x,θ j)+ c j f ′′(x,θ j)= 0,

for all x implies that a j = b j = c j = 0, j = 1,2.


The identifiability required here is stronger than ordinary in the sense that,besides f (x,θ) itself, the first two derivatives are also identifiable. This was firstproposed by Chen (1995) to establish that the best possible rate of estimating Gis n−1/4 under some condition. This topic will be discussed further in anotherchapter.

The following quantities play important roles in our study:

Yi(θ) =f (Xi,θ)− f (Xi,θ

∗)

(θ −θ ∗) f (Xi,θ ∗); (5.14)

Zi(θ) =Yi(θ)−Yi(θ

∗)

θ −θ ∗. (5.15)

At θ = θ ∗, the above functions taking continuity limits as their values. The pro-jection residual of Z to the space of Y is denoted as

Wi(θ) = Zi(θ)−h(θ)Yi(θ∗), (5.16)

where h(θ) = EYi(θ∗)Zi(θ)/EY 2

i (θ∗). These notations match the ones de-

fined in the section for C(α) test with a minor change: Yi(θ∗) differs by a factor

of 2.Both Yi(θ) and Zi(θ) are continuous with EYi(θ) = 0 and EZi(θ) = 0

and that Zi(θ) can be approximated by the derivative of Yi(θ) through the meanvalue theorem. They were regarded as score functions in the contents of C(α) test.

Uniform integrable upper bound. There exists integrable g with some δ > 0such that |Yi(θ)|4+δ ≤ g(Xi) and |Y ′i (θ)|3 ≤ g(Xi) for all θ ∈Θ.

Note that Yi(θ) is uniformly continuous in Xi and θ over S×Θ, where S is anycompact interval of real numbers. This implies equicontinuity of Yi(θ) in θ forXi ∈ S. According to Rubin (1956), this condition ensures that n−1

∑ni=1 |Yi(θ)|k

converges almost surely to E|Y1(θ)|k uniformly in θ ∈ Θ, for k ≤ 4. The sameresults hold for n−1

∑ni=1 |Zi(θ)|k.

Lemma 5.2 Suppose the model (5.10) satisfies the strong identifiability conditionand the uniform integrable upper bound conditions. Then the covariance matrix ofthe vector (Y1(θ

∗),Z1(θ))′ is positive definite at all θ ∈Θ under the homogeneous

model with θ = θ ∗.


Proof. The result is implied by Cauchy inequality

C 2Y1(θ∗)Z1(θ) ≤ C Y 2

1 (θ∗)C Z2

1(θ),

where the equality holds only if Y1(θ∗) and Z1(θ) are linearly associated. The

strong identifiability condition thus ensures that the inequality holds. The inte-grable upper bound conditions makes these expectations exist. ♦

The limiting distribution of the likelihood ratio statistic Rn will be describedby some stochastic process. The convergence of a sequence of stochastic processis an indispensable notion.

Lemma 5.3 The processes n−1/2∑Yi(θ), n−1/2

∑Y ′i (θ), n−1/2∑Zi(θ) and n−1/2

∑Z′i(θ)over Θ are tight.

Proof. To see this, consider

E

n−1/2

n

∑i=1

Yi(θ2)−n−1/2n

∑i=1

Yi(θ1)

2

= EY1(θ2)−Y1(θ1)2

≤ E

g2/3(X1)(θ2−θ1)2.

By Theorem 12.3 of Billingsley (1968, p. 95), n−1/2∑Yi(θ) is tight. From this

comment, we know that a sufficient condition for the tightness of n−1/2∑Y ′i (θ)

and n−1/2∑Z′i(θ) is that Y ′′i (θ)2 ≤ g(Xi) and Z′′i (θ)2 ≤ g(Xi) with an inte-

grable function g. ♦

Theorem 5.3 Suppose the mixture model (5.10) satisfies all conditions specifiedin this section. When f (x,θ ∗) is the true null distribution, the asymptotic distri-bution of the likelihood ratio test statistic for homogeneity is that of

supθ∈Θ

W+(θ)2,


where W (θ), θ ∈Θ, is a Gaussian process with mean 0, variance 1 and autocor-relation function ρ(θ ,θ ′) given by

ρ(θ ,θ ′) =covW1(θ),W1(θ

′)√EW 2

1 (θ)EW 21 (θ

′), (5.17)

and that W1(θ) is defined by (5.16).

This result is first presented in Chen and Chen (199?). We notice that theresult of ?? is more general. If one works hard enough, this result can be directlyobtained from that one. However, the result here is presented in a much morecomprehensive fashion.

5.5.1 Examples

The autocorrelation function ρ(θ ,θ ′) of the Gaussian process W (θ) is well be-haved with commonly-used kernel functions. In particular, when Z1(θ) and Y1(θ

∗)

are uncorrelated, i.e., h(θ) = 0, ρ(θ ,θ ′) becomes simple. We provide its expres-sions for normal, binomial and Poisson. As shown in the last section, Z1(θ) andY1(θ

∗) are uncorrelated. In terms of Yi and Zi,

ρ(θ ,θ ′) =COVZ1(θ)−h(θ)Y1(θ

∗),Z1(θ′)−h(θ ′)Y1(θ

∗)√VARZ1(θ)−h(θ)Y1(θ ∗)VARZ1(θ ′)−h(θ ′)Y1(θ ∗)

,

where h(θ) = EZ1(θ)Y1(θ∗)/EY 2

1 (θ∗). When Z1(θ) and Y1(θ

∗) are uncor-related,

ρ(θ ,θ ′) =COVZ1(θ),Z1(θ

′)√VARZ1(θ)VARZ1(θ ′)

.

Example 5.1 Normal kernel function. Let f (x;θ) be normal N(θ ,σ∗2) withknown σ∗. For simplicity, let θ ∗ = 0 and σ∗ = 1. Then Y1(0) = X1 and for θ 6= 0,

Z1(θ) = θ−1θ−1(expX1θ −θ

2/2−1)−X1,

and Z1(0)= (X21 −1)/2. We have seen that EZ1(θ)Y1(0)= 0, and hence h(θ)=

0 for all θ . Note that for θ , θ ′ 6= 0,

EZ1(θ)Z1(θ′)= (θθ

′)−2exp(θθ′)−1−θθ

′.


We have for θ , θ ′ 6= 0,

ρ(θ ,θ ′) =exp(θθ ′)−1−θθ ′√

exp(θ 2)−1−θ 2exp(θ ′2)−1−θ ′2.

For θ 6= 0, it reduces to

ρ(θ ,0) =θ 2√

2exp(θ 2)−1−θ 2.

♦

Example 5.2 Binomial kernel function. Consider the binomial kernel function

f (x,θ) ∝ θx(1−θ)k−x, for x = 0, . . . ,k.

Note that

Y1(θ∗) =

X1

θ ∗− k−X1

1−θ ∗; (5.18)

Y1(θ) =1

θ −θ ∗

[(1−θ

1−θ ∗

)k(1−θ ∗)/θ ∗

(1−θ)/θ

X1

−1

]. (5.19)

As a member of NEF-QVF, binomial mixture models also have EZ1(θ)Y1(θ∗)=

0, so h(θ) = 0 for all θ . The covariance of Z1(θ) and Z1(θ′) for θ ,θ ′ 6= θ ∗ is

1(θ −θ ∗)2(θ ′−θ ∗)2

[1+

(θ −θ ∗)(θ ′−θ ∗)

θ ∗(1−θ ∗)

k

−1− k(θ −θ ∗)(θ ′−θ ∗)

θ ∗(1−θ ∗)

].

Let

u =

√k(θ −θ ∗)

θ ∗(1−θ ∗), u′ =

√k(θ ′−θ ∗)

θ ∗(1−θ ∗).

We obtain

ρ(θ ,θ ′) =(1+uu′/k)k−1−uu′√

(1+u2/k)k−1−u2

(1+u′2/k)k−1−u′2 .


Interestingly, when k is large,

ρ(θ ,θ ′)≈ exp(uu′)−1−uu′√exp(u2)−1−u2exp(u′2)−1−u′2

.

That is, when k is large the autocorrelation function behaves similarly to that ofthe normal kernel. ♦

Example 5.3 Poisson kernel function. Let

f (x,θ) ∝ e−θθ

x, for x = 0,1,2, . . . .

Then

Y1(θ∗) =

X1−θ ∗

θ ∗; (5.20)

Y1(θ) =exp−(θ −θ ∗)(θ/θ ∗)X1−1

θ −θ ∗for θ 6= θ

∗. (5.21)

Again Poisson is a member of NEF-QVF so that Z1(θ) and Y1(θ∗) are uncorre-

lated and h(θ) = 0 for all θ .For θ ,θ ′ 6= θ ∗,

COVZ1(θ),Z1(θ′)= exp(θ −θ ∗)(θ ′−θ ∗)/θ ∗−1− (θ ′−θ ∗)/θ ∗

(θ −θ ∗)2(θ ′−θ ∗)2 .

Put

v =θ −θ ∗√

θ ∗, v′ =

θ ′−θ ∗√θ ∗

.

Then

ρ(θ ,θ ′) =(1+ vv′/k)k−1− vv′√

(1+ v2/k)k−1− v2

(1+ v′2/k)k−1− v′2 .

Interestingly, this form is identical to the one for normal kernel. ♦

We can easily verify all conditions of the theorem are satisfied in these exam-ples, when the parameter space Θ is confined to a compact subset of R.

Exponential distribution is another popular member of NEF-QVF. This dis-tribution does not satisfy the integrable upper bound condition in general. It issomewhat a surprise that many of results developed in the literature are not appli-cable to the exponential mixture model.


5.5.2 The proof of Theorem 5.2

We carefully analyze the asymptotic behaviour of the likelihood ratio test statisticsunder the true distribution f (x;θ ∗). For convenience, we rewrite the log likelihoodfunction under two-component mixture model as

`n(π,θ1,θ2) =n

∑i=1

log(1−π) f (Xi;θ1)+π f (Xi;θ2).

The new notation helps to highlight the detailed structure of the mixing distribu-tion G. As usual, we have used `n(·) in a very generic and non-rigorous fashion.Define

rn(π,θ1,θ2) = 2`n(π,θ1,θ2)− supθ∈Θ

`n(0,θ ,θ)

andRn = suprn(π,θ1,θ2) : π,θ1,θ2

over the region of 0 ≤ π ≤ 1/2, θ j ∈ Θ, j = 1,2. We may call rn(π,θ1,θ2) loglikelihood ratio functions.

To find the asymptotic distribution of Rn, it is convenient to partition rn intotwo parts:

rn(π,θ1,θ2) = 2`n(π,θ1,θ2)− `n(0,θ ∗,θ ∗)+2`n(0,θ ∗,θ ∗)− `n(0, θ , θ)= r1n(π,θ1,θ2)+ r2n,

where θ is the MLE of θ under the null model. Note that

−r2n =−2`n(0,θ ∗,0)− `n(0, θ , θ)

is an ordinary likelihood ratio (no mixtures involved) and hence an approximationis immediate as follows (Wilks 1938):

r2n =−n−1/2

∑ni=1Yi(θ

∗)2

EY 21 (θ

∗)+op(1). (5.22)

This expansion suffices in deriving the limiting distribution of Rn. The main taskis to analyze r1n(π,θ1,θ2).

Write

r1n(π,θ1,θ2) = 2n

∑i=1

log(1+δi),


where

δi = (1−π)

f (Xi,θ1)

f (Xi,θ ∗)−1+π

f (Xi,θ2)

f (Xi,θ ∗)−1

= (1−π)(θ1−θ∗)Yi(θ1)+π(θ2−θ

∗)Yi(θ2) (5.23)

and Yi(θ) is defined by (5.14).The main idea is as follows. By the Taylor expansion,

r1n(π,θ1,θ2) = 2n

∑i=1

δi−n

∑i=1

δ2i + εn.

We need to argue that when the sample size n is large, the remainder εn is negli-gible uniformly in the mixing parameters. Negligibility relies on consistency ofthe MLE’s of the parameters. By Lemma 5.1, the MLE of θ1 is consistent. Theproblem is that the MLE’s of θ2 and π

Our solution is to consider the case of |θ2− θ ∗| ≤ ε for an arbitrarily smallε > 0, and that of |θ2−θ ∗|> ε , separately. In the process, we used sandwich idea.Let Rn(ε; I) denote the supremum of rn(π,θ1,θ2) with the restriction |θ2−θ ∗|>ε , and Rn(ε; II) the supremum with |θ2−θ ∗| ≤ ε .

Case I: |θ2−θ ∗|> ε .

In this case, let π(θ2) and θ1(θ2) be the MLE’s of π and θ1 with fixed θ2 ∈Θ suchthat |θ2− θ ∗| > ε . The consistency results of Lemma 5.1 remain true for π(θ2)

and θ1(θ2). For simplicity of notation, we write π = π(θ2) and θ1 = θ1(θ2).First we establish an upper bound on Rn(ε; I). By the inequality 2 log(1+x)≤

2x− x2 +(2/3)x3, we have

r1n(π,θ1,θ2) = 2n

∑i=1

log(1+δi)≤ 2n

∑i=1

δi−n

∑i=1

δ2i +

23

n

∑i=1

δ3i , (5.24)

whereδi = (1−π)(θ1−θ

∗)Yi(θ1)+π(θ2−θ∗)Yi(θ2)

as defined in (5.23). Replacing θ1 with θ ∗ in the quantity Yi(θ1) gives

δi = (1−π)(θ1−θ∗)Yi(θ

∗)+π(θ2−θ∗)Yi(θ2)+ ein,


where the remainder is given by

ein = (1−π)(θ1−θ∗)Yi(θ1)−Yi(θ

∗).

In terms of Zi defined in (??), ein =(1−π)(θ1−θ ∗)2Zi(θ1). Since n−1/2∑

ni=1 Zi(θ1)

is tight, it is Op(1), so

en =n

∑i=1

ein = n1/2(1−π)(θ1−θ∗)2n−1/2

n

∑i=1

Zi(θ1) = n1/2(θ1−θ∗)2Op(1).

Similarly, replace Yi(θ1) with Yi(θ∗) in the square and cubic terms of δi, which

results in a remainder having a higher order than en. Consequently,

r1n(π,θ1,θ2) ≤ 2n

∑i=1(1−π)(θ1−θ

∗)Yi(θ∗)+π(θ2−θ

∗)Yi(θ2)

−n

∑i=1(1−π)(θ1−θ


∗)Yi(θ2)2

+23

n

∑i=1(1−π)(θ1−θ


∗)Yi(θ2)3

+ n1/2(θ1−θ∗)2Op(1).

Now write

(1−π)(θ1−θ∗)Yi(θ

∗)+π(θ2−θ∗)Yi(θ2) = m1Yi(θ

∗)+m2Zi(θ2),

where

m1 = (1−π)(θ1−θ∗)+π(θ2−θ

∗); (5.25)

m2 = π(θ2−θ∗)2. (5.26)

Hence

r1n(π,θ1,θ2) ≤ 2n

∑i=1m1Yi(θ

∗)+m2Zi(θ2)−n

∑i=1m1Yi(θ

∗)+m2Zi(θ2)2

+23

n

∑i=1m1Yi(θ

∗)+m2Zi(θ2)3 +n1/2(θ1−θ∗)2Op(1).(5.27)


Sincen−1

∑|Yi(θ∗)|3 + |Zi(θ2)|3= Op(1)

uniformly in θ2, and since

n−1∑m1Yi(θ

∗)+m2Zi(θ2)2

converges uniformly to a positive definite quadratic form in m1 and m2 uniformlyin θ2 (Lemma 5.2), it follows that

∑ |m1Yi(θ∗)+m2Zi(θ2)|3

∑m1Yi(θ ∗)+m2Zi(θ2)2 ≤max(|m1|, |m2|)Op(1).

Letm1 = (1− π)(θ1−θ

∗)+ π(θ2−θ∗), m2 = π(θ2−θ

∗)2.

By Lemma 1, max(|m1|, |m2|) = op(1) uniformly in θ2. Consequently, it followsthat

n

∑i=1m1Yi(θ

∗)+ m2Zi(θ2)3 = op

[n

∑i=1m1Yi(θ

∗)+ m2Zi(θ2)2

]. (5.28)

The remainder n1/2(θ1−θ ∗)2Op(1) in (5.27) is also negligible when compared tothe square terms. To see this, recall that 0≤ π ≤ 1/2, |θ2−θ ∗| ≥ ε and θ1−θ ∗ =

op(1). We have

n1/2(θ1−θ∗)2 ≤ 4n1/2|m1|+ |π(θ2−θ

∗)|2

≤ 4n1/2(|m1|+ m2/ε)2

≤ 8n1/2(m21 + m2

2/ε2)

= op

[n

∑i=1m1Yi(θ

∗)+ m2Zi(θ2)2

]. (5.29)

Combining (5.27), (5.28) and (5.29) yields

r1n(π, θ1,θ2) ≤ 2n

∑i=1m1Yi(θ

∗)+ m2Zi(θ2)

−n

∑i=1m1Yi(θ

∗)+ m2Zi(θ2)21+op(1), (5.30)


uniformly in θ2.To clarify the role each component plays, we can use the quantity Wi(θ) intro-

duced in (5.16). Since EYi(θ∗)Wi(θ) = 0, Yi(θ

∗) and Wi(θ) are orthogonal sothat

n

∑i=1

Yi(θ∗)Wi(θ2) = Op(n1/2),

uniformly in θ2. Hence (5.30) becomes

r1n(π, θ1,θ2) ≤ 2n

∑i=1tYi(θ

∗)+ m2Wi(θ2)

− t2n

∑i=1

Y 2i (θ

∗)+ m22

n

∑i=1

W 2i (θ2)1+op(1),

where t = m1 + m2h(θ2). For fixed θ2, consider the quadratic function

Q(t,m2) = 2n

∑i=1tYi(θ

∗)+m2Wi(θ2)−

t2

n

∑i=1

Y 2i (θ

∗)+m22

n

∑i=1

W 2i (θ2)

.

(5.31)Over the region of m2 ≥ 0, for any θ2 fixed, Q(t,m2) is maximized at t = t andm2 = m2, with

t =∑Yi(θ

∗)

∑Y 2i (θ

∗), m2 =

∑Wi(θ2)+

∑W 2i (θ2)

. (5.32)

Thus

r1n(π, θ1,θ2)≤Q(t, m2)+op(1) =∑n

i=1Yi(θ∗)2

∑ni=1Y 2

i (θ∗)

+[∑n

i=1Wi(θ2)+]2

∑ni=1W 2

i (θ2)+op(1).

Finally, by integrable upper bound condition,

n−1∑Y 2

i (θ∗) = EY 2

1 (θ∗)+op(1)

andn−1

∑W 2i (θ2) = EW 2

1 (θ2)+op(1).

Therefore,

r1n(π, θ1,θ2)≤n−1/2

∑ni=1Yi(θ

∗)2

EY 21 (θ

∗)+

[n−1/2∑

ni=1Wi(θ2)+]2

EW 21 (θ2)

+op(1).

(5.33)


From (5.33) and (5.22),

rn(π, θ1,θ2) = r1n(π, θ1,θ2)+ r2n ≤[n−1/2

∑Wi(θ2)+]2

EW 21 (θ2)

+op(1). (5.34)

Hence we have established an upper bound on Rn(ε; I) as follows:

Rn(ε; I)≤ sup|θ−θ∗|>ε

[n−1/2∑

ni=1Wi(θ)+]2

EW 21 (θ)

+op(1).

To obtain a lower bound of Rn(ε; I), for any θ2 fixed such that |θ2−θ ∗| ≥ ε ,let π and θ1 to be the values determined by t and m2 as given in (5.32). Since t =

Op(n−1/2), m2 =Op(n−1/2) uniformly in θ2, and |h(θ2)| ≤√

EZ21(θ2)/EY 2

1 (θ∗)

is a bounded quantity, it follows that π = Op(n−1/2) and θ1 = Op(n−1/2) uni-formly in θ2 satisfying |θ2−θ ∗| ≥ ε . Consider the following Taylor expansion

r1n(π, θ1,θ2) = 2n

∑i=1

δi−n

∑i=1

δ2i (1+ ηi)

−2,

where |ηi|< |δi| and δi is the δi in (5.23) with π = π and θ1 = θ1. We have

|δi| ≤ (|θ1|+ |π|M) max1≤i≤n

[supθ∈Θ

|Yi(θ)|].

By integrable upper bound condition, |Yi(θ)|4+δ ≤ g(Xi) and Eg(Xi) is finite.Hence

max1≤i≤n

[supθ∈Θ

|Yi(θ)|] = op(n1/4),

implying max(|ηi|) = op(1) uniformly. It follows that

r1n(π, θ1,θ2) = 2n

∑i=1

δi−n

∑i=1

δ2i 1+op(1).

Applying the argument leading to (5.34), we know that with π and θ1,

rn(π, θ1,θ2)≥[n−1/2

∑ni=1Wi(θ2)+]2

EW 21 (θ2)

+op(1).

Therefore,

supπ,θ1

rn(π,θ1,θ2)≥ rn(π, θ1,θ2)≥[n−1/2

∑ni=1Wi(θ2)+]2

EW 21 (θ2)

+op(1).

Combining with (5.34), we thus arrive at the following result.


Lemma 5.4 Suppose all conditions specified in this section. Let Rn(ε; I) be thelikelihood ratio statistic with restriction |θ2−θ ∗|> ε . Then when f (x,θ ∗) is thetrue null distribution, as n→ ∞,

Rn(ε; I) = sup|θ−θ∗|>ε

[n−1/2∑

ni=1Wi(θ)+]2

EW 21 (θ)

+op(1).

The asymptotic distribution of n−1/2∑Wi(θ)/

√EW 2

1 (θ) when f (x,θ ∗) is thetrue null distribution, is a Gaussian process with mean 0, variance 1 and autocor-relation function ρ(θ ,θ ′) as (5.17).

Case II: |θ2−θ ∗| ≤ ε .

When θ2 is in an arbitrarily small neighbourhood of θ ∗, some complications ap-pear since θ1−θ ∗ and θ2−θ ∗ are confounded, so that n1/2(θ1−θ ∗)2 is no longernegligible when compared to n1/2(θ2− θ ∗)2. However, in this case, θ1 and θ2

can be treated equally, so that the usual quadratic approximation to the likelihoodbecomes possible. Since the MLE of θ1 is consistent, in addition to |θ2−θ ∗| ≤ ε ,we can restrict θ1 in the following analysis to the region of |θ1−θ ∗| ≤ ε .

In the sequel, π , θ1 and θ2 denote the MLE’s of π , θ1, and θ2 within the regiondefined by 0≤ π ≤ 1/2, |θ1−θ ∗| ≤ ε and |θ2−θ ∗| ≤ ε .

Again, we start with (5.24). Replacing both θ1 and θ2 in δi with θ ∗, we have

δi = m1Yi(θ∗)+m2Zi(θ

∗)+ ein,

where m1 is the same as before, but

m2 = (1−π)(θ1−θ∗)2 +π(θ2−θ

∗)2.

Note that |m1| ≤ ε and m2 ≤ ε2. The remainder term now becomes

ein = (1−π)(θ1−θ∗)2Zi(θ1)−Zi(θ

∗)+π(θ2−θ∗)2Zi(θ2)−Zi(θ

∗).

By integrable upper bound condition,

en =n

∑i=1

ein = n1/2(1−π)|(θ1−θ∗)|3 +π|(θ2−θ

∗)|3Op(1).


Thus

r1n(π,θ1,θ2) ≤ 2n

∑i=1m1Yi(θ

∗)+m2Zi(θ∗)−

n

∑i=1m1Yi(θ

∗)+m2Zi(θ∗)2

+23

n

∑i=1m1Yi(θ

∗)+m2Zi(θ∗)3

+ n1/2(1−π)|(θ1−θ∗)|3 +π|(θ2−θ

∗)|3Op(1). (5.35)

Using the same argument as in Case I,

∑ |m1Yi(θ∗)+m2Zi(θ2)|3

∑m1Yi(θ ∗)+m2Zi(θ2)2 ≤max(|m1|, |m2|)Op(1)≤ εOp(1).

Equation (5.35) reduces to

r1n(π,θ1,θ2) ≤ 2n

∑i=1m1Yi(θ

∗)+m2Zi(θ∗)

−n

∑i=1m1Yi(θ

∗)+m2Zi(θ∗)21+ εOp(1)

+ n1/2(1−π)|(θ1−θ∗)|3 +π|(θ2−θ

∗)|3Op(1).(5.36)

Let us analyze the remainder in terms of the MLE’s:

n1/2(1− π)|(θ1−θ∗)|3 + π|(θ2−θ

∗)|3 = op(1)+ εOp(1)n1/2m2

≤ op(1)+ εOp(1)(1+nm22)

≤ εOp(1)+ εOp(nm22).

Consequently, in terms of the MLE’s, (5.36) reduces to

r1n(π, θ1, θ2) ≤ 2n

∑i=1m1Yi(θ

∗)+ m2Zi(θ∗) (5.37)

−n

∑i=1m1Yi(θ

∗)+ m2Zi(θ∗)21+ εOp(1)+ εOp(1).(5.38)

Note that in the above, the term εOp(nm22) has been absorbed into the quadratic


quantity. By orthogonalization, we have

r1n(π, θ1, θ2) ≤ 2n

∑i=1tYi(θ

∗)+ m2Wi(θ∗)

−

t2

n

∑i=1

Y 2i (θ

∗)+ m22

n

∑i=1

W 2i (θ

∗)

1+ εOp(1)+ εOp(1),

where Wi(θ) is defined in (5.16) and t = m1 + m2h(θ ∗)Applying the same technique leading to (5.31), we get

r1n(π, θ1, θ2)≤ 1+ εOp(1)−1[∑Yi(θ

∗)2

∑Y 2i (θ

∗)+

[∑Wi(θ∗)+]2

∑W 2i (θ

∗)

]+ εOp(1).

By (5.22),

rn(π, θ1, θ2) ≤εOp(1)

1+ εOp(1)n−1/2

∑Yi(θ∗)2

EY 21 (θ

∗)

+[n−1/2

∑Wi(θ∗)+]2

1+ εOp(1)EW 21 (θ

∗)+ εOp(1)

=[n−1/2

∑Wi(θ∗)+]2

EW 21 (θ

∗)+ εOp(1).

Therefore,

Rn(ε; II)≤ [n−1/2∑Wi(θ

∗)+]2

EW 21 (θ

∗)+ εOp(1).

Next, let θ2 = θ ∗, and π , θ1 be determined by

m1 +m2h(θ ∗) =∑Yi(θ

∗)

∑Y 2i (θ

∗), m2 =

∑Wi(θ∗)+

∑W 2i (θ

∗).

The rest of the proof is the same as that in Case I, and we get

Rn(ε; II)≥ [n−1/2∑Wi(θ

∗)+]2

EW 21 (θ

∗)+op(1).


Lemma 5.5 Suppose that Conditions 1–5 hold. Let Rn(ε; II) be the likelihoodratio statistic with restriction |θ2−θ ∗| ≤ ε for any arbitrarily small ε > 0. Whenf (x,θ ∗) is the true null distribution, as n→ ∞,

[n−1/2∑Wi(θ

∗)+]2

EW 21 (θ

∗)+op(1)≤ Rn(ε; II)≤ [n−1/2

∑Wi(θ∗)+]2

EW 21 (θ

∗)+ εOp(1),

where Wi(θ∗) is defined by (5.16).

Proof of the Theorem. For any small ε > 0, Rn = maxRn(ε; I),Rn(ε; II). ByLemmas 5.3 and 5.5,

Rn ≤max

[sup

|θ−θ∗|>ε

[n−1/2

∑Wi(θ)+]2

EW 21 (θ)

,[n−1/2

∑Wi(θ∗)+]2

EW 21 (θ

∗)+ εOp(1)

]

plus a term in op(1), and

Rn ≥max

[sup

|θ−θ∗|>ε

[n−1/2

∑Wi(θ)+]2

EW 21 (θ)

,[n−1/2

∑Wi(θ∗)+]2

EW 21 (θ

∗)

]+op(1).

Since n−1/2∑Wi(θ)/

√EW 2

1 (θ) converges to the Gaussian process W (θ), θ ∈Θ, with mean 0, variance 1 and autocorrelation function ρ(θ ,θ ′) which is givenby (5.17), the theorem follows by first letting n→ ∞ and then letting ε → 0. ♦

Documents

Lecture Notes for Stat522B Jiahua Chen