28
Lecture Notes Weak convergence of stochastic processes Thomas Mikosch 1 (2005) 1 Laboratory of Actuarial Mathematics, University of Copenhagen 1

Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Lecture Notes

Weak convergence of stochasticprocesses

Thomas Mikosch1

(2005)

1Laboratory of Actuarial Mathematics, University of Copenhagen

1

Page 2: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

2

Page 3: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Contents

1 Some motivating examples 2

2 Convergence in Euclidean space 72.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The continuous mapping theorem . . . . . . . . . . . . . . . . . . . . . . . . 82.3 The Portmanteau theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 The Slutsky argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 The univariate and multivariate central limit theorems . . . . . . . . . . . . 17

3

Page 4: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

1

Page 5: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

1 Some motivating examples

Consider an iid sequence of real-valued random variables X, Xi, i = 1, 2, . . ., and constructthe corresponding random walk process (Sn) from it:

S0 = 0 , Sn = X1 + · · · + Xn , n = 1, 2, . . . .(1.1)

Two invariance principles have stimulated and characterized the development of probabilitytheory and statistics: the strong law of large numbers:

n−1Sna.s.→ EX ,

given that EX is well defined (possibly infinite), and the central limit theorem:

n−1/2(Sn − ESn)d→ N(0, var(X)) ,

provided that var(X) is finite. These two results were modified in various ways, amongothers to stochastic processes.

Example 1.1 (Glivenko-Cantelli and empirical central limit theorem). Considerthe empirical distribution function of the iid sample X1, . . . , Xn with common distributionfunction F :

Fn(x) = n−1n∑

i=1

I(−∞,x](Xi) , x ∈ R .

Notice the similarity between the sample mean Xn = n−1Sn and Fn: both are averagesof random elements. The main difference is that Fn is an average of genuine stochasticprocesses in contrast to Xn which is an average of real-valued random variables. Nevertheless,the sequence of stochastic processes (Fn) satisfies a strong law of large numbers, commonlyknown as the Glivenko-Cantelli theorem:

‖Fn − F‖∞ = supx

|Fn(x) − F (x)|a.s.→ 0 .

The Glivenko-Cantelli theorem is a consequence of the one-dimensional strong law of largenumbers

Fn(x) = n−1

n∑

i=1

I(−∞,x](Xi)a.s.→ EI(−∞,x](X) = F (x) , x ∈ R ,

and the monotonicity of the functions Fn and F . For a proof, see for example Billingsley[2]. By definition, the stochastic processes Fn and F assume values in the space of right-continuous functions with left limits (cadlag functions), equipped with the uniform topologyor supremum norm (sup-norm). We will see later that the space of such functions is rathercomplicated, but it is one of those spaces which is most often used in the theory of stochasticprocesses. The convergence in this space is usually not determined by the sup-norm.

2

Page 6: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Nevertheless, one may ask about an analog of the central limit theorem for Fn. It doesindeed exist. Again, the one-dimensional central limit theorem helps us to understand whatis going on. Notice that

EFn(x) = EF (x) and var(Fn(x)) = n−1F (x)(1 − F (x)) .

Moreover, for fixed x, n Fn(x) is Bin(n, F (x)) distributed, hence it satisfies the central limittheorem

Un(x) = n1/2(Fn(x) − EFn(x)) = n1/2(Fn(x) − F (x))d→ Y ∼ N(0, F (x)(1 − F (x))) .

The multivariate central limit theorem is also applicable and gives joint convergence of thefinite-dimensional distributions, i.e., of the vectors (Fn(x1), . . . , Fn(xk))

′ towards the distri-bution of a k-dimensional normal vector Yk, say. To convince ourselves how the multivariatecentral limit theorem applies, consider the case k = 2, the cases k > 2 are analogous. Assumex1 < x2. Then

n

(Fn(x1)

Fn(x2)

)=

n∑

i=1

(I(−∞,x1](Xi)I(−∞,x2](Xi)

)= Sn

is a simple partial sum process of iid two-dimensional random vectors with finite secondmoment. Then the central limit theorem applies and tells us that

n−1/2(Sn − ESn)d→ YN(0, Σ) ,

where σii = var(I(−∞,xi](X1)) = F (xi)(1−F (xi)) and σij = cov(I(−∞,x1](Xi), I(−∞,x2](Xj)) =F (x1)(1 − F (x2)).

If we specify the Xi’s to be iid U(0, 1) random variables, i.e., F (x) = x, x ∈ (0, 1), then thelimiting covariance structure of the normal vector Yk = (Y1, . . . , Yk)

′ is given by cov(Yi, Yj) =xi ∧ xj − xi xj. A Gaussian process B = (Bt)t∈[0,1] on the interval [0, 1] is well defined inthe space of all measurable functions on [0, 1] through its covariance function cov(Bt, Bs).Indeed, the Gaussian finite-dimensional distributions of B are completely determined by thisfunction, and then Kolmogorov’s consistency theorem applies. The particular covariancefunction t∧ s− st is well known: it belongs to the so-called Brownian bridge Bt = Wt − tW1,where W = (Wt)t∈[0,1] is a standard Brownian motion on [0, 1] which has covariance functioncov(Wt, Ws) = s ∧ t. It can be shown that the uniform empirical process n1/2(Fn − F )converges in distribution to the Brownian bridge B. This is the analog of the central limittheorem for the empirical distribution function and is referred to as the uniform empiricalcentral limit theorem

The empirical distribution functions gives us an example where we can understand whatthe notion of convergence in distribution of stochastic processes might mean. We mentionhere that the convergence in distribution of the empirical process is not just the weak conver-gence of the finite-dimensional distributions to the limiting finite-dimensional distributions,i.e., of convergence in distribution of the vectors Yk towards (Bx1

, . . . , Bxk)′:

Ykd→ (Bx1

, . . . , Bxk)′

3

Page 7: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

for any choice of x1, . . . , xk ∈ [0, 1]. For a counterexample. see Example 1.2 below. Never-theless, the weak convergence in distribution of the finite-dimensional distributions will turnout to be necessary for the weak convergence of stochastic processes.

At this point we want to say in a vague way what we mean by weak convergence and byconvergence in distribution. For a sequence of random vectors Xn = (X

(n)1 , . . . , X

(n)k )′ and

a limiting vector X = (X1, . . . , Xk)′ we can characterize convergence in distribution by the

convergence of the distribution functions

Gn(x) = P (X(n)1 ≤ x1 , . . . , X

(n)k ≤ xk)

→ P (X1 ≤ x1, . . . , Xk ≤ xk) = G(x)

for all continuity points x ∈ Rk of G. In other words, we characterize convergence indistribution by the convergence of the underlying distributions or distribution functions. Wewrite Gn

w→ G and refer to it as weak convergence of the underlying probability measures.

In a way, weak convergence and convergence in distribution are exchangeable notions. Inthe former case one stresses the convergence of the underlying probability measures withoutcaring about the existence of stochastic processes which would have to be constructed withthe given measures as their distributions. In the latter case we assume we have stochasticprocesses with a given structure and then we go to their distributions and study their weakconvergence.

Another goal of this course is to show that weak convergence of stochastic processes is ina sense equivalent to the weak convergence of the distributions of any continuous mappingacting on the weakly converging stochastic processes. Indeed, from any reasonable notion ofweak convergence we would expect that it is kept under continuous mappings. For example,

consider the uniform empirical processes (Un) on [0, 1]. Then we already know that Und→ B,

where B is the Brownian bridge. Consider the following mappings from the space of cadlagfunctions on [0, 1] to R (these mappings are easily seen to be continuous when one takes thesup-norm ‖ · ‖∞ to determine convergence, but as said above this is in general not the rightthing to do)

f1(x) = ‖x‖∞ , f2(x) = supt∈[0,1]

x(t) , f3(x) =

∫ 1

0

(x(t))2 dt .

After we will have made clear what the metric in this function space is, it will not be difficultto see that these real-valued mappings are continuous. Then one can indeed show that

fj(Un)d→ fj(B) , j = 1, 2, 3 .

These limit relations are well known in statistics where they are used for constructing good-ness of fit tests. Indeed, the relations

f1(Un) = supt∈[0,1]

|Fn(t) − t|d→ sup

t∈[0,1]

|B(t)| ,

f2(Un) = supt∈[0,1]

(Fn(t) − t)d→ sup

t∈[0,1]

B(t)

4

Page 8: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

are used to construct the two- and one-sided Kolmogorov-Smirnov tests for goodness of fitof the uniform distribution on (0, 1). The relation

f3(Un) =

∫ 1

0

(Fn(t) − F (t))2 dtd→

∫ 1

0

B2t dt

is used to construct the Cramer-von Mises test for goodness of fit of the uniform distribution.The distributions of the limiting functionals are well known and tabulated, see e.g. Shorackand Wellner [13].

Next we give an example of some processes where we see that the convergence of the finite-dimensional distributions is not sufficient for the weak convergence of the processes to alimit.

Example 1.2 (The convergence of the finite-dimensional distributions is not suf-ficient for weak convergence of stochastic processes.) We consider an iid sequence(Xn) with standard Frechet distribution function Φα(x) = e −x−α

, x > 0, for some α > 0. Itis an extreme value distribution (see Embrechts et al. [6]) since it is easily seen that

Mn = max(X1, . . . , Xn)d= n1/αX , n = 1, 2, . . . .

We construct the processes

Yn(t) =

n−1/αXi t = i/n , i = 1, . . . , n ,

0 t = 0 ,

linear interpolation of the points (i/n, n−1/αXi) elsewhere for t ∈ [0, 1].

The processes Yn have continuous sample paths, i.e., they live in the space C[0, 1] of con-tinuous functions on [0, 1], equipped with the sup-norm. It is easy to see that the mappingf(x) = ‖x‖∞ is continuous as a function from C[0, 1] to R. If we had convergence in dis-

tribution Ynd→ Y for some limit Y (we would expect the limit to have continuous sample

paths too), we would expect that the continuous mapping theorem gives us f(Yn)d→ f(Y ).

Notice that

f(Yn) = n−1/αMnd= X .

On the other hand, we can characterize the limiting process Y . For every fixed t it is

not difficult to see that Yn(t)P→ 0, hence also (Yn(t1), . . . , Yn(tk))

′ P→ 0 for any choice of

ti ∈ [0, 1], and so the limiting process Y would satisfy Yd= o in the sense of the finite-

dimensional distributions, where o is the zero process on [0, 1]. It satisfies f(o) = 0 and

therefore f(Yn)d→ f(Y ) is impossible in this case.

We will see later that in this situation the so-called tightness condition of the processes Yn

is not satisfied which in addition to the convergence of the finite-dimensional distributionswill turn out to be another necessary condition for weak convergence of stochastic processes.Roughly speaking, tightness for (Yn) means that the processes Yn “must not oscillate toowildly” so that “probability mass cannot disappear from good (=compact) sets”.

5

Page 9: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

In 1949, Doob [5] gave some heuristic argument how the asymptotic distribution of theKolmogorov-Smirnov statistic could be derived by using a continuous mapping argument asdescribed above. The original papers on the Kolmogorov-Smirnov and Cramer-von Misesgoodness of fit tests did not use such an approach; they rather focused on the particularproperties of those statistics. Doob’s argument initiated one of the major developmentsin the theory of stochastic processes in the 1950s. Some of the most famous probabilistscontributed to the area, including Donsker [3, 4], Prokhorov [11], Skorokhod [14, 15].

Donsker, in particular, contributed to the understanding of Brownian motion as a limitingprocess of scaled and centered random walks.

Example 1.3 (The Donsker invariance principle) Brownian motion is one of the mostfundamental processes in modern probability theory. It is used in the theory of stochasticintegration as one of the basic ingredients to Ito integration; the Brownian bridge and manyother important processes are derived from it. Brownian motion is the most popular processin mathematical finance and has been in use in physics for more than 100 years, includingthe important contributions of Einstein, Wiener, Levy. However, Brownian motion is nota process which has a “physical” meaning. Its sample paths are nowhere differentiable, aproperty which is not observable in real life. But it is an approximation to a random walk(Sn), see (1.1) for its definition, on large and small scales.

Define the following continuous time process

Sn(t) =

Si t = i/n , i = 0, . . . , n ,

linear interpolation elsewhere for t ∈ [0, 1] .

The so-defined process Sn(·) has sample paths in C[0, 1]. Donsker [3] showed that

Sn = (n−1/2(Sn(t) − ESn(t)))t∈[0,1]d→ σ W ,

where σ2 = var(X) is assumed to be finite and, as before, W is standard Brownian motion on[0, 1]. The convergence refers to convergence in distribution in the space C[0, 1] of continuousfunctions on [0, 1] equipped with the sup-norm.

This means that we can consider Brownian motion as some kind of continuous timerandom walk. Donsker’s invariance principle gives us a justification for this notion. Also inthis case, the continuous mapping theorem applies: for any continuous mapping f on C we

have f(Sn)d→ f(W ). For example, take the mapping

f(x) = ( mint∈[0,1]

x(t), maxt∈[0,1]

x(t))

which is easily seen to be a continuous mapping from C[0, 1] to R2. Hence

f(Sn) = n−1/2( mint∈[0,1]

(Sn(t) − ESn(t)), maxt∈[0,1]

(Sn(t) − ESn(t)))

= n−1/2( mini=1,...,n

(Si − iEX), maxi=1,...,n

(Si − iEX))

d→ f(W ) = ( min

t∈[0,1]Wt, max

t∈[0,1]Wt) .

6

Page 10: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

The distributions of the latter functionals of Brownian motion are tractable and have beentabulated. The same applies for the functionals of the Brownian bridge considered on p. 4.Here we see one obvious advantage of the use of the continuous time random walk, Brownianmotion, instead of the random walk process Sn: the distribution of Brownian motion doesnot depend on the distribution of the step sizes Xi; it is invariant under the distributions ofXi and its finite-dimensional distributions are Gaussian distributions which are more easilyhandled than those of Sn.

2 Convergence in Euclidean space

2.1 Definition

In this section we consider sequences of random variables and random vectors and re-considertheir convergence from a more abstract point of view. As a motivation, we start with asequence (Xn) of random variables with distributional limit X. We assume X and all Xn tobe defined on an abstract probability space (Ω,F , P). Then we say that (Xn) converges indistribution to X if

Gn(x) = P(Xn ≤ x) → P(X ≤ x) = G(x)(2.2)

holds for all continuity points x of G, i.e., P(X = x) = 0 or G(x−) = G(x), and then we alsosay that the sequence of distribution functions (Gn) (the corresponding distributions) weaklyconverge to the distribution function G (the corresponding distribution). This is perhaps themost intuitive understanding of weak convergence or convergence in distribution on the realline. However, dealing with distribution functions is somewhat restrictive and not alwayseasy. For example, if you wanted to verify the central limit theorem for the partial sums ofan iid sequence with finite variance, it is rather hard if not impossible to check (2.2) directlyby calculating the distribution functions Gn. One way out of this situation is often usedin courses on probability theory: Fourier methods (e.g. the convergence of characteristicfunctions) are elegant tools to prove convergence in distribution.

The convergence in distribution of random vectors (Xn) can still be described in termsof distribution functions, and characteristic functions are also adequate tools in this situ-ation. However, we intend to study the convergence of infinite-dimensional objects suchas stochastic processes on intervals. There the notion of distribution function is in generalmeaningless and therefore we will aim at a definition of convergence in distribution whichdoes not depend on the structure of a finite-dimensional space.

Definition 2.1 (Convergence in distribution and weak convergence in Euclideanspace) Assume the Rd-valued vectors X, Xn, n = 1, 2, . . ., are defined on (Ω,F , P). Thenthe sequence (Xn) converges in distribution to X if the relation

Ef(Xn) =

∫f(x) P(Xn ∈ dx) →

∫f(x) P(X ∈ dx) = Ef(X)(2.3)

holds for all bounded, continuous real-valued functions f on Rd (we write f ∈ C(Rd)).

7

Page 11: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Assume that P , Pn, n = 1, 2, . . ., are distributions on the Borel σ-field B(Rd) of Rd. Thenthe sequence (Pn) converges weakly to P if

EPnf(X) =

Rd

f(x) Pn(dx) →

Rd

f(x) P (dx) = EP f(X)

for all f ∈ C(Rd).

We conclude from this definition that Pnw→ P just means Xn

d→ X for random vectors

Xn and X with distributions Pn and P , respectively. Such random vectors can always beconstructed on a sufficiently rich probability space. Therefore convergence in distributionand weak convergence are just different languages which are equivalent to each other and sowe will often restrict ourselves to the formulation of the results in one language.

Notice that the notation Xnd→ X is slightly misleading because it suggests that there is

some “physical” or ω-wise convergence of the Xn’s towards X (such as for a.s. convergence

or for convergence in probability). However,d→ only refers to convergence of the underlying

distributions.Since we do not want to have different notions of convergence we will have to check that

Definition 2.1 is consistent with the commonly used definition (2.2) for d = 1; see Remark 2.4and Exercise 2.5 below.

In our presentation we closely follow some of the classical books in the area. The bookwhich has been most influential as regards the propagation of weak convergence is Billings-ley’s [1] classic whose first edition appeared in 1968. More specific books which also treatthe general theory of weak convergence of stochastic processes are Pollard [10] and Jacodand Shiryaev [8]. By now there exist many other books on more specific topics which alsocontain large parts on weak convergence.

2.2 The continuous mapping theorem

The continuous mapping theorem tells us that convergence in distribution is preserved undercontinuous and a.s. continuous transformations.

Theorem 2.2 (The continuous mapping theorem in Euclidean space, e.g. Pollard[10], III.2) Let h : Rd → Rs be measurable and let C be its set of continuity points. Supposethat

Xnd→ X in Rd and P(X ∈ C) = 1.

Then h(Xn)d→ h(X) in Rs.

Proof. We have to show that

Ef(h(Xn)) → Ef(h(X))

for all f ∈ C(Rs). Hence it suffices to show that Eg(Xn) → Eg(X) for all bounded, measur-able real-valued g on Rd which are continuous at each point of the set C.

8

Page 12: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

1) We construct bounded, continuous gi ≤ g such that gi ↑ g at each point of C. Withoutloss of generality assume g > 0 (add a constant, if necessary). Define

d(x, A) = inf|x− y| : y ∈ A , A ⊂ Rd .

Since | · | is continuous, so is d(·, A). Define

gm,r(x) = r ∧ [m d(x, g ≤ r)] , m ≥ 1 , r ∈ Q+ .

Each gm,r is bounded, continuous and gm,r ≤ g. We have

supm,r

gm,r(x) = g(x) , x ∈ C .(2.4)

Indeed, for x ∈ C and ε > 0 choose r ∈ Q+ such that g(x)− ε < r < g(x). By continuity ofg at x, r < g(y) in some neighborhood of x, hence

d(x, g ≤ r) > 0 and gm,r(x) = r > g(x) − ε for large m.

Now write the gm,r as a sequence h1, h2, . . . and define

gi = max(h1, . . . , hi) .

By (2.4), gi ≤ g, and gi ↑ g at each point of C since ε can be made arbitrarily small.

2) Since Xnd→ X and gi is bounded, continuous, we have

Egi(Xn) → Egi(X) .(2.5)

By the properties of (gi),

lim inf Eg(Xn) ≥ lim inf Egi(Xn) = Egi(X) .

Monotone convergence and the fact that P(X ∈ C) = 1 yield that

lim inf Eg(Xn) ≥ Eg(X).

The same relation holds for g replaced by −g which yields the statement.

Exercise 2.3 Suppose X = x a.s. for some constant x. How does then Theorem 2.2 read?

Remark 2.4 The classical notion of convergence in distribution on the real line is preserved.(The same statement is correct in Euclidean space, but we restrict ourselves to this simplecase.) This follows from the continuous mapping theorem. Indeed, if we have convergence

in distribution Xnd→ X in the sense then

I(−∞,x](Xn)d→ I(−∞,x](X)

for those x for which f = I(−∞,x] is a.s. continuous, i.e., the point of discontinuity x ofthis function has probability zero with respect to the distribution of X. In other words, thedistribution function of X must be continuous at x, and then by dominated convergence

EI(−∞,x](Xn) = P(Xn ≤ x) → P(X ≤ x) = EI(−∞,x](X) .(2.6)

9

Page 13: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

This corresponds to the classical definition of convergence in distribution.Conversely, assume the classical definition of convergence in distribution, i.e., (2.6) holds

for all continuity points x of the distribution function G(·) = P (X ≤ ·). Any bounded,continuous function f on R can be approximated by a non-decreasing sequence of simplefunctions fk (linear combinations of indicator functions of intervals of the form (−∞, x]),uniformly on R. Since there is only a countable number of discontinuities of the distributionfunction G, one can always choose the jump points of the simple functions fk such that theyare continuity points of G, hence their set of discontinuities has G-probability zero. Then

the continuous mapping theorem applies: fk(Xn)d→ fk(X). Finally, use the dominated

convergence theorem, the fact that ‖f−fk‖∞ can be made arbitrarily small and the followingdecomposition

Ef(X) − Ef(Xn) = E(f(X) − fk(X)) + E(fk(X) − fk(Xn)) + E(fk(Xn) − f(Xn)) .

to conclude that Ef(Xn) → Ef(X) for any bounded, continuous f .

Exercise 2.5 Make the converse part to the proof precise.

Remark 2.6 The functions gi used in the above proof are bounded and uniformly con-tinuous. Moreover, the proof shows that every g ∈ C(Rd) can be approximated by theuniformly continuous functions gi which also satisfy the limit relation (2.5). Thus, it suffices

for Xnd→ X to prove that Eg(Xn) → Eg(X) for all bounded, uniformly continuous functions

g on Rd.

Remark 2.7 We could have formulated the continuous mapping theorem as a weak con-vergence result as follows. Suppose P and Pn, n = 1, 2, . . ., are distributions on Rd suchthat Pn

w→ P and h : Rd → Rs is a measurable function with set D of discontinuities. If

P (D) = 0 then

Pn h−1 = Pn(h ∈ ·)w→ P (h ∈ ·) = P h−1 ,

i.e., there is weak convergence for the distributions induced by the P -a.e. continuous map-ping h.

2.3 The Portmanteau theorem

The following result is useful for understanding what convergence in distribution or weakconvergence means.

Theorem 2.8 (Portmanteau theorem, see e.g. Billingsley [1], Theorem 2.1) LetX, Xn, n = 1, 2, . . . be Rd-valued random vectors defined on the same probability space. Thenthe following are equivalent:

1. Xnd→ X.

2. Ef(Xn) → Ef(X) for all bounded, uniformly continuous functions f on Rd.

10

Page 14: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

3. lim sup P(Xn ∈ F ) ≤ P(X ∈ F ) for all closed F .

4. lim inf P(Xn ∈ G) ≥ P(X ∈ G) for all open G.

5. lim P(Xn ∈ A) = P(X ∈ A) for continuity sets A of the distribution of X, i.e.,P(X ∈ ∂A) = 0.

Proof. We have learned in Remark 2.6 that 1) and 2) are equivalent. The equivalence of 3)and 4) is immediate by taking complements. If 3) and 4) hold and P(X ∈ ∂A) = 0 then

P(X ∈ A) = P(X ∈ intA) ≤ lim inf P(Xn ∈ intA) ≤ lim sup P(Xn ∈ A)

≤ P(X ∈ A) = P(X ∈ A) .

(A stands for closure, ∂A for boundary, intA for the interior.) Hence 3) or 4) imply 5).5) implies 3) Assume F closed. Define

d(x, F ) = inf|x − y| : y ∈ F and Fk = x : d(x, F ) ≤ δk

for a sequence δk ↓ 0. The sets Fk are closed, Fk ↓ F and hence

lim P(X ∈ Fk) = P(X ∈ F ) .(2.7)

Suppose for the moment that we can choose δk such that P(X ∈ ∂Fk) = 0. Then 5) implies

lim sup P(Xn ∈ F ) ≤ lim sup P(Xn ∈ Fk) = P(X ∈ Fk)

and an application of (2.7) completes the proof.Since

∂x : d(x, F ) ≤ δ ⊂ d(x, F ) = δ

these boundaries are disjoint for distinct δ and hence only a countable number of them canhave positive probability. Now choose δk ↓ 0 such that X ∈ ∂x : d(x, F ) ≤ δk hasprobability zero.

2) implies 3) Suppose F is closed and define G = x : d(x, F ) < ε for ε > 0. Fixδ > 0. For small ε > 0,

P(X ∈ G) < P(X ∈ F ) + δ ,

since the sets G decrease to F as ε ↓ 0. Define

φ(t) =

1 if t ≤ 0,

1 − t if 0 ≤ t ≤ 1,

0 if t ≥ 1.

The function

f(x) = φ(ε−1d(x, F ))

11

Page 15: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

is uniformly continuous and

f(x) = 1 x ∈ F ,

f(x) = 0 x ∈ Gc ,

0 ≤ f(x) ≤ 1 for all x.

(Gc denotes the complement of G.) By 2), Ef(Xn) → Ef(X). Moreover,

P(Xn ∈ F ) = EIF (Xn)f(Xn) ≤ Ef(Xn) ,

Ef(Xn) = EIG(Xn)f(Xn) ≤ P(X ∈ G) < P(X ∈ F ) + δ .

Hence

lim sup P(Xn ∈ F ) ≤ lim Ef(Xn) = Ef(X) < P(X ∈ F ) + δ ,

and since δ > 0 was arbitrary, 3) follows.3) implies 1) Suppose f ∈ C(Rd).

i) Without loss of generality we may assume that 0 < f(x) < 1 for all x (otherwise, first adda constant to the bounded function f such that it becomes positive, then scale it between 0and 1). Write

Fi = x : i/k ≤ f(x) i = 0, . . . , k .

Hence

k∑

i=1

i − 1

kP(X ∈ Fi−1\Fi) ≤ Ef(X) ≤

k∑

i=1

i

kP(X ∈ Fi−1\Fi).

By writing P(X ∈ Fi−1\Fi) = P(X ∈ Fi−1) − P(X ∈ Fi), we obtain

1

k

k∑

i=1

P(X ∈ Fi) ≤ Ef(X) ≤1

k+

1

k

k∑

i=1

P(X ∈ Fi) .(2.8)

Under 3), lim sup P(Xn ∈ Fi) ≤ P(X ∈ Fi). From this and applying (2.8), we conclude that

lim sup Ef(Xn) ≤1

k+ Ef(X) .(2.9)

Now let k → ∞:

lim sup Ef(Xn) ≤ Ef(X) .

ii) Apply (2.9) to −f . The two relations complete the proof.

12

Page 16: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

2.4 The Slutsky argument

The following result turns out be extremely useful when considering applications of conver-gence in distribution.

Theorem 2.9 (Slutsky’s theorem, see Billingsley [1], Theorem 3.2) Let Xun be ran-dom vectors on the same probability space such that

Xund→ Xu as n → ∞ and Xu

d→ X as u → ∞.

Suppose further that the random vectors Yn are defined on the same probability space as Xun

and

limu→∞

lim supn→∞

P(|Xun − Yn| > ε) = 0(2.10)

for each ε > 0. Then Ynd→ X as n → ∞.

Proof. Let F be closed. Define

Fε = x : d(x, F ) ≤ ε

with d as above. Then

P(Yn ∈ F ) ≤ P(Xun ∈ Fε) + P(|Xun − Yn| > ε) .

Since Xund→ X as n → ∞, the Portmanteau theorem yields

lim supn→∞

P(Yn ∈ F ) ≤ P(Xu ∈ Fε) + lim supn→∞

P(|Xun − Yn| > ε) .

By (2.10) and since Xud→ X as u → ∞, another application of the Portmanteau theorem

yields

lim sup P(Yn ∈ F ) ≤ P(X ∈ Fε) .

Since F is closed, Fε ↓ F as ε ↓ 0 and hence the result follows by continuity of P.

Example 2.10 (The limit distribution of U-statistics) We focus on U -statistics oforder 2, U -statistics of higher order can be considered as well. A U -statistic of order 2 isa natural generalization of the sample mean to functions of two variables. Let h : R2 → R

be a symmetric measurable function, X, X1, X2, . . . , be iid random variables with commondistribution function F . Then

Hn = n−2∑

1≤i6=j≤n

h(Xi, Xj)

is a U-statistic of order 2 with kernel h. Examples of U -statistics (up to slight changes ofthe normalization) are

Sample variance h(x, y) = 0.5(x − y)2 ,

Gini mean difference h(x, y) = |x − y| ,

Wilcoxon one-sample statistics h(x, y) = Ix+y≤0(x, y) .

13

Page 17: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Assume that Eh(X1, X2) = 0 and Eh2(X1, X2) < ∞. Under these conditions we can write

Hn = n−2∑

1≤i6=j≤n

[h(Xi, Xj) − E(h(Xi, Xj) | Xi) − E(h(Xi, Xj) | Xj)]

+2n−2∑

1≤i6=j≤n

E(h(Xi, Xj) | Xi)]

= n−2∑

1≤i6=j≤n

h2(Xi, Xj) + 2n−2(n − 1)

n∑

i=1

h1(Xi)

= Hn2 + Hn1 ,(2.11)

where

h1(Xi) = E(h(Xi, X) | Xi) , h2(Xi, Xj) = h(Xi, Xj) − h1(Xi) − h1(Xj) .

Relation (2.11) is referred to as the Hoeffding decomposition of the U -statistic Hn. For anextensive theory of U -statistics, including the Hoeffding decomposition and limit theory forU -statistics, see Serfling [12].

If h1(Xi) 6= 0 a.s., using the fact that the h2(Xi, Xj)’s are uncorrelated one can calculatethe variance of Hn2 and derive the relation

n1/2Hn = n1/2Hn1 + oP(1) .

Since Hn1 is a partial sum of iid random variables, the central limit theorem immediatelyimplies. This case is called non-degenerate.

The more interesting case occurs when h1(Xi) = 0 a.s. Then Hn = Hn2 a.s. and theU -statistic is called degenerate. Various interesting U -statistics are degenerate or can beapproximated by degenerate U -statistics. Among them is the Cramer-von Mises goodnessof fit statistic

ω2n = n

R

(Fn(x) − F (x))2w(F (x)) dF (x) .

Here Fn is the empirical distribution function of the sample X1, . . . , Xn. The ordinaryCramer-von Mises statistic appears for w ≡ 1 and the weight function w(u) = (u(1 − u))−1

yields the Anderson-Darling statistic.Substituting for Fn, squaring and changing the order of summation and integration, one

obtains the U -statistic ω2n = Hn with degenerate kernel

h(x, y) =

R

(Ix≤z − F (z))(Iy≤z − F (z)) w(F (z)) dF (z) .

Limit distributions of non-degenerate U -statistics are non-Gaussian. We give a heuristicapproach to the weak limit, using the Slutsky argument. For simplicity, we choose w ≡ 1and X uniform on (0, 1). Since Eh2(X1, X2) < ∞ it follows from Hilbert-Schmidt theory forlinear operators that

h(x, y) =

∞∑

k=1

γk gk(x) gk(y) ,

14

Page 18: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

where (gk) is a complete orthonormal system of eigenfunctions to the linear operator

E(h(X, x)φ(X) | X1 = x) =

∫ 1

0

h(x, y) φ(y) dy

and γk are the corresponding eigenvalues, i.e.,∫ 1

0

h(x, y) gk(y) dy = γk gk(x) .

In particular, since∫ 1

0

h(x, y) 1 dy = E(h(X, x) | X1 = x) = 0 ,

the constant 1 is an eigenfunction to the eigenvalue 0. By orthogonality of the gk’s:

Eh2(X1, X2) =

∞∑

k=1

γ2k .

Introduce the truncated kernel

hK(x, y) =K∑

k=1

γk gk(x) gk(y) ,

and write

HnK = n−2∑

1≤i6=j≤n

hK(Xi, Xj) .

Then

Hn = HnK + (Hn − HnK) .

Note that

nHnK = n−1∑

1≤i6=j≤n

K∑

k=1

γk gk(Xi) gk(Xj)

=K∑

k=1

γk n−1∑

1≤i6=j≤n

gk(Xi) gk(Xj)

=K∑

k=1

γk

(

n−1/2n∑

i=1

gk(Xi)

)2

− n−1n∑

i=1

g2k(Xi)

.

The multivariate central limit theorem and the strong law of large numbers imply that(

n−1/2

n∑

i=1

(g1(Xi), . . . , gK(Xi))′ , n−1

n∑

i=1

(g21(Xi) − 1, . . . , g2

K(Xi) − 1)′

)

d→ (N1, . . . NK , 0, . . . , 0)′,

15

Page 19: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

where N1, . . . , NK are iid N(0, 1) random variables. The multivariate central limit theoremyields this result because Egk(X) = 0, Eg2

k(X) = 1 and E[gk(Xi) gl(Xj)] = 0 for k 6= l. Thelimiting covariance structure is then inherited in the Gaussian limit. The latter relation andthe continuous mapping theorem give

HnKd→

K∑

i=1

γk (N2k − 1) .

The remainder kernel h(x, y) = h(x, y) − hK(x, y) yields a degenerate U -statistic

Hn − HnK = n−2∑

1≤i6=j≤n

h(Xi, Xj) .

Since the h(Xi, Xj)’s under the sum are uncorrelated, one can show that

E(n (Hn − HnK)2) ≤ const Eh2(X1, X2) = const∞∑

k=K+1

γ2k .

The right-hand side converges to zero as K → ∞. Hence, by Chebyshev’s inequality,

limK→∞

lim supn→∞

P(|n (Hn − HnK)| > ε) = 0 , ε > 0 .

Moreover, an application of the 3-series theorem yields that

K∑

k=1

γk (N2k − 1)

a.s.→

∞∑

k=1

γk (N2k − 1) ,

for an iid sequence of N(0, 1) random variables Ni. Finally, Slutsky’s theorem yields

n Hnd→

∞∑

k=1

γk (N2k − 1) .

This means that the limit distribution of a degenerate U -statistic is in general a complicatedinfinite weighted sum of independent χ2-distributed random variables. We mention thatthe latter limit can be written as a double Ito stochastic integral with respect to Brownianmotion.

Exercise 2.11 In the above proof we made use of the following fact. Suppose that Xnd→ X

in Rd and YnP→ y for some constant y in Rs, i.e., P(|Yn−y| > ε) → 0 for all ε > 0. Then the

concatenated vector (Xn,Yn) converges in distribution to (X,y) in Rd+s. Use this result, thecentral limit theorem and the continuous mapping theorem to show that a non-degenerateU -statistic with finite second moment has a normal limit.

16

Page 20: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

2.5 The univariate and multivariate central limit theorems

In this section we closely follow Pollard [10], Sections III.3 and III.4. It is our aim to usethe convergence in distribution in the sense of Definition 2.1 to prove some of the classicalcentral limit theorems. We also want to show that the condition of continuity on f in (2.3)can be strengthened significantly.

Consider the triangular scheme (Xni)i=1,...,kn, n = 1, 2, . . . , of row-wise independent ran-

dom variables and define the row-wise partial sums

Sn = Xn1 + · · ·+ Xnkn.

We want to find conditions which imply the central limit theorem for (Sn), i.e., Snd→ Y ∼

N(µ , σ2) for a suitable constants µ and σ2 > 0 or, equivalently, that

Ef(Sn) → Ef(Y ) , f ∈ C(R) .(2.12)

This suggests to find good bounds for the difference |Ef(X) − Ef(Z)| for two randomvariables or even random vectors X and Z and smooth functions f , and then to apply suchan inequality to (2.12). Such an approach would be independent of characteristic functionswhich are usually taken to prove central limit theorems in Euclidean space. Thus thisapproach would possibly also allow one to generalize the central limit theorem to infinite-dimensional settings.

We start with a useful lemma. It tells us that slight perturbations of a sequence ofrandom vectors do not influence the convergence in distribution of this sequence.

Lemma 2.12 (Perturbation lemma) Let X, Y, Xn, n = 1, 2, . . ., be Rd-valued random

vectors such that Xn + σYd→ X + σY for each σ > 0. Then Xn

d→ X.

Proof. We use a Slutsky argument. Write Xnσ = Xn + σY, Yn = Xn, Xσ = X + σY.

Then we have Xnσd→ Xσ as n → ∞, and as σ → 0, Xσ

d→ X. Moreover,

limσ→0

lim supn→∞

P(|Xnσ − Yn| > ε) = limσ→0

P(|σY| > ε) = 0 , ε > 0 .

Hence Yn = Xnd→ X.

In the Portmanteau theorem we mentioned that it suffices for Xnd→ X to show that

Ef(Xn) → Ef(X) for all bounded, uniformly continuous f . In the following we will show thatit even suffices to assume that f is infinitely often differentiable. The key is the perturbationlemma above. By Ck(Rd) we denote the class of all bounded, real-valued functions on Rd

which have bounded, continuous derivatives of kth order. In particular, f ∈ C∞(Rd) meansthat f has bounded, continuous derivatives of all orders.

Theorem 2.13 Let X, Xn, n = 1, 2, . . . , be Rd-valued random vectors. The relation Xnd→

X holds if and only if

Ef(Xn) → Ef(X) for all f ∈ C∞(Rd).(2.13)

17

Page 21: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Proof. It suffices to show the sufficiency part. Moreover, we may restrict ourselves to aspecial subclass of C∞(Rd) for which (2.13) holds. Let X, Y be independent such that Y isGaussian N(0, Id), where Id denotes the d-dimensional identity matrix. Notice that we canwrite

Ef(X + σY) = E[E(f(X + σY) | X)] = Efσ(X) ,

where

fσ(x) =1

(2π)d/2

∫f(x + σy) e − 1

2|y|2dy

=1

(2πσ2)d/2

∫f(z) e − 1

2σ2|z−x|2 dz .

Lebesgue dominated convergence justifies the change of integral and limits, implying thatfσ ∈ C∞(Rd). Now suppose that Efσ(Xn) → Efσ(X) for all σ > 0, all bounded, continuous

f , i.e., Xn + σYd→ X + σY for all σ > 0. By the perturbation lemma, Xn

d→ X.

Now we return to our original problem concerning the central limit theorem for the partialsums Sn of the triangular scheme of row-wise independent random variables. In addition,we will assume that the Xni’s have mean zero and variance σ2

ni such that

σ2n1 + · · ·+ σ2

nkn= 1 .

From Theorem 2.13 we learned that it suffices for the central limit theorem Snd→ Y ∼ N(0, 1)

to show that

|Ef(Sn) − Ef(Y )| → 0 , f ∈ C∞(R) .(2.14)

We start with the following simple argument: Taylor expansion yields

f(x + y) − f(x) = yf ′(x) +1

2y2f ′′(x) + R(x, y) ,

where

(2.15) R(x, y) =1

6y3f ′′′(x + θy)

for some θ ∈ (0, 1). Since f ∈ C3(R), |f ′′′| is bounded by a constant C, say. Now replacex, y by independent random variables X, Y and take expectations:

Ef(X + Y ) − Ef(X) = EY Ef ′(X) +1

2EY 2Ef ′′(X)

+ER(X, Y ) .(2.16)

Let X, Y and X, W be independent and such that EY = EW and EY 2 = EW 2. From (2.16)and (2.15) we immediately have

|Ef(X + Y ) − Ef(X + W )| ≤ E|R(X, Y )| + E|R(X, W )|

(2.17)

≤ C(E|Y |3 + E|W |3)

).

This inequality is the basis for proving a central limit theorem for (Sn):

18

Page 22: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Theorem 2.14 (Lyapunov’s central limit theorem) If the Lyapunov condition

(2.18)kn∑

j=1

E|Xnj|3 → 0 , n → ∞ ,

holds then Snd→ Y ∼ (0, 1).

Proof. Choose an f ∈ C3(R). We want to show that Ef(Sn) → Ef(Y ). Let (ξni)i=1,...,kn, n =

1, 2, . . . , be row-wise independent Gaussian random variables and such that ξni has the samemean and variance as Xni. In particular, Zn = ξn1 + · · ·+ ξnkn

∼ N(0, 1). Moreover, supposethat (ξni) and (Xni) are independent. Thus it suffices to show that Ef(Sn) − Ef(Zn) → 0.Write

Snk = Xn1 + · · · + Xn,k−1 + ξk+1 + · · ·+ ξkn.

Then, by (2.17),

|Ef(Sn) − Ef(Zn)| ≤

kn∑

k=1

|Ef(Snk + Xnk) − Ef(Snk + ξnk)|

≤kn∑

k=1

(E|R(Snk, Xnk)| + E|R(Snk, ξnk)|)

≤ C

kn∑

k=1

(E|Xnk|

3 + E|ξnk|3)

.

The first summand on the right-hand side converges to zero as n → ∞ in view of Lyapunov’scondition (2.18). The second one does as well since, by Lyapunov’s inequality,

kn∑

j=1

E|ξnk|3 =

kn∑

j=1

Eσ3nj|Y |3 = E|Y |3

kn∑

j=1

(EX2nk)

3/2 ≤ E|Y |3kn∑

j=1

E|Xnk|3 .

This proves the theorem.

Exercise 2.15 Let (Xi) be iid random variables with mean zero and variance 1. Check theconditions of Theorem 2.14 for the double array

Xni = Xi/an , i = 1, . . . , n ,

for suitable normalizing constants an ↑ ∞.

The 3rd moment condition on the Xni’s is actually not needed. By truncation arguments(i.e., by partitioning the range of the random variables Xni in a clever way) it is possible toformulate sufficient conditions for the central limit theorem in terms of 2nd moments or even2nd truncated moments. This is the contents of Lindeberg’s central limit theorem. It is wellknown that Lindeberg’s condition (2.19) is close to a necessary assumption for the centrallimit theorem (see Petrov [9]). At this point it might also be worthwhile mentioning that

19

Page 23: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

the central limit theorem problem can be generalized and modified in very different ways toget a great variety of limit laws which are different from the Gaussian distribution. Thesedistributions include the infinitely divisible laws, in particular, the Poisson and the stabledistributions.

Theorem 2.16 (Lindeberg’s central limit theorem) If the Lindeberg condition

(2.19) Ln(ε) =kn∑

j=1

EX2njI(ε,∞)(Xnj) → 0 , n → ∞ , ε > 0 ,

holds then Snd→ Y ∼ N(0, 1).

Proof. As in the proof of Theorem 2.14 we obtain

|Ef(Sn) − Ef(Zn)| ≤

kn∑

j=1

(E|R(Snk, Xnk)| + E|R(Snk, ξnk)|) .(2.20)

It is our intention to improve the upper bounds of the right-hand side in (2.20). For thisreason reconsider the Taylor expansion

|R(x, y)| = |[f(x + y) − f(x) − yf ′(x)] −1

2f ′′(x)|(2.21)

= |1

2y2f ′′(x + θ′) −

1

2y2f ′′(x)| ≤ |f ′′|y2 .

Let

C = max(1

6|f ′′′|, |f ′′|).

Then we conclude from (2.21) and (2.15) for independent X, Y that

E|R(X, Y )| ≤ EC|Y |3I[0,ε](Y ) + EC|Y |2I(ε,∞)(Y )

≤ εCEY 2 + CEY 2I(ε,∞)(Y ) .

An application of this inequality to the first sum on the right-hand side in (2.20) yields anupper estimate of

C

kn∑

j=1

(εEX2

nj + EX2njI(ε,∞)(Xnj)

)= Cε + CLn(ε) .

The second sum can be estimated as before by

C

kn∑

j=1

σ3njE|Y |3 ≤ CE|Y |3 max

jσnj

kn∑

j=1

σ2nj ≤ CE|Y |3 max

j

(ε2 + EX2

njI(ε,∞)(Xnj))1/2

≤ CE|Y |3(ε2 + Ln(ε)

)1/2→ 0 ,

by virtue of the Lindeberg condition (2.19).

20

Page 24: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Exercise 2.17 Check what the Lindeberg condition (2.19) means for iid random variables(Xi) with mean zero and variance 1, i.e., for a double array Xni = Xi/an, i = 1, . . . , n, forappropriate constants an ↑ ∞.

Exercise 2.18 Prove that the conditionkn∑

j=1

E|Xnj|2+δ → 0 , n → ∞ ,

for some δ ∈ (0, 1] implies the central limit theorem Snd→ Y ∼ N(0, 1).

Lindeberg’s and Lyapunov’s central limit theorem are extremely useful tools for provingasymptotic results in statistics. Among others, the central limit theorem for double arraysof random variables is of great use for proving results on the bootstrap.

Example 2.19 (Bootstrap central limit theorem) Let X1, X2, . . . be iid random vari-ables with common distribution function F , mean µ and finite variance σ2. We know fromthe central limit theorem for iid sums that

supx

∣∣P((Sn − nµ) /(nσ2)1/2 ≤ x

)− Φ(x)

∣∣→ 0 ,

where Φ denotes the standard normal distribution function and Sn = X1+· · ·+Xn. (Uniformconvergence is a consequence of the continuity of the limit distribution function Φ.)Let Fn denote the empirical distribution function of the first n observations. Given Fn, drawiid copies

(2.22) X∗n1, . . . , X

∗nn

with common distribution function Fn. In other words, given X1, . . . , Xn, X∗ni has distribu-

tion function Fn. Obviously, (2.22) is a double array of (conditionally) iid random variableswhich we call a bootstrap sample.We intend to prove the central limit theorem for the bootstrap sums

S∗n = X∗

n1 + · · ·+ X∗nn ,

which means that

(2.23) supx

∣∣P∗((S∗

n − nE∗X∗n1) /(var∗(S∗

n))1/2 ≤ x

)− Φ(x)

∣∣→ 0

for almost every realization of (Xn). Here

P∗(·) = P(·|X1, X2, . . .)

and E∗ and var∗ denote expectation and variance with respect to P∗.First we write down what E∗X∗

n1, var∗(S∗n) mean in terms of X1, . . . , Xn. Since Fn defines a

uniform measure on the sample X1, . . . , Xn we have

E∗X∗n1 =

1

n

n∑

i=1

Xi = Xn ,

var∗(S∗n) = n var∗(X∗

n1) = n1

n

n∑

i=1

(Xi − Xn)2 =: n s2n.

21

Page 25: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Now we apply Lyapunov’s central limit theorem to the double array

((X∗

ni − Xn)/(ns2n)

1/2)

i=1,...,n.

Lyapunov’s condition (2.18) is then of the form

n∑

i=1

E∗∣∣(X∗

ni − Xn)/(ns2n)1/2

∣∣3 = n−1/2s−3n E∗

∣∣(X∗n1 − Xn)

∣∣3 → 0 .(2.24)

We estimate the right-hand side of (2.24):

n−1/2s−3n E∗

∣∣(X∗ni − Xn)

∣∣3 = n−1/2s−3n

1

n

n∑

i=1

|Xi − Xn|3 ≤ n−1/2s−1

n maxi

|Xi − Xn| .

A Borel-Cantelli argument and the SLLN provide that maxi |Xi − Xn|2/n

a.s.→ 0 and s2

na.s.→

σ2. Hence (2.24) is satisfied and the central limit theorem (2.23) applies for almost everyrealization of (Xn). Moreover, a combination of the conditional and the unconditional centrallimit theorems yields that

supx

∣∣P∗((S∗

n − nE∗X∗n1) /(var∗(S∗

n))1/2 ≤ x)− P

((Sn − nµ) /(nσ2)1/2

)∣∣→ 0

as n → ∞ for almost every realization of (Xn). This property indicates that the bootstrapworks for the sample mean.

For further work on the bootstrap and related asymptotic theory see the monograph byHall [7].

Exercise 2.20 Prove in detail that maxi |Xi − Xn|2/n

a.s.→ 0 holds provided σ2 < ∞.

Now we want to apply our knowledge about the central limit theorem for real-valued randomvariables to the multivariate central limit theorem. For this reason we will use a classicaltheorem about the convergence of characteristic functions. Recall that the characteristicfunction of a random vector X in Rd is the Fourier transform fX of X with respect to theprobability measure P, i.e.

fX(γ) = EeiγX , γ ∈ Rd .

Here γX denotes the usual scalar product between γ and X in Rd.

Theorem 2.21 (Continuity theorem for characteristic functions)

Let X and (Xn) be random vectors in Rd. Then Xnd→ X holds if and only if for each γ ∈ Rd

fXn(γ) = EeiγXn → fX(γ) = EeiγX .

This elegant statement translates the problem about convergence of (Xn) in distribution toa problem of Fourier transforms. They are particularly easy to handle since one can worknow with the complex exponential on the unit sphere which function is uniformly bounded,infinitely often differentiable and has many other attractive properties. Therefore classicalprobability theory in finite-dimensional spaces made heavily use of the convenient tools of

22

Page 26: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

Fourier analysis.

Proof. (Sketch) The proof of necessity is immediate since we can write

fXn(γ) = E cos(γXn) + i E sin(γXn)

and both functions cos(γ·) and sin(γ·) are continuous and bounded on Rd. Hence Xnd→ X

implies thatE cos(γXn) → E cos(γX) , E sin(γXn) → E sin(γX) .

The proof of sufficiency is more involved and makes use of the perturbation Lemma 2.12.We give here only the main idea and refer to Pollard [10], Theorem 29 in Chapter III, fordetails. The basic idea is to show that convergence of the characteristic functions impliesthat

(2.25) Xn + σYd→ X + σY

for each σ > 0 and for a Gaussian N(0, Id) random vector Y (Id stands for the identity)independent of (Xn) and X. Then one uses the perturbation lemma with σ → 0.The relation (2.25) is equivalent to

Ef(Xn + σY) → Ef(X + σY)

for f ∈ C(Rd). Adding the σY-term ensures that X+σY has a smooth density. Conditioningon X and using the properties of the d-dimensional Gaussian density gives that

Ef(X + σY) = E

(1

(2πσ2)d/2

∫f(z)e−

1

2|z−X|2/σ2

dz

)

=

∫f(z)

1

(2πσ2)d/2Ee−

1

2|z−X|2/σ2

dz

=

∫f(z)J(z)dz .

The expression J(z) is basically the characteristic function of a Gaussian random variableand can also be written as such with respect to the density of Y:

e−1

2|z−X|2/σ2

=

∫1

(2π)d/2eiy(z−X)/σ− 1

2|y|2dy .

This is the way the characteristic function of X comes into to calculation. The same argu-ments for Xn +σY give a similar representation for Ef(Xn +σY). Some further calculationsand the fact that the characteristic function of Xn converges to the one of X complete theproof.

Read Chapter III.5 in Pollard [10] for more details.

A corollary of the continuity theorem for characteristic functions is the following veryuseful tool which reduces multivariate to univariate convergence in distribution.

Exercise 2.22 (Cramer-Wold device)

The relation Xnd→ X holds in Rd if and only if

γXnd→ γX , γ ∈ Rd .

23

Page 27: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

A consequence of the Cramer-Wold device is the multivariate central limit theorem:

Theorem 2.23 (Multivariate central limit theorem) Let (Xn) be iid random vectorsin Rd with EXn = 0 and covariance matrix Σ. Then

n−1/2Sn = n−1/2(X1 + · · · + Xn)d→ Y ∼ N(0, Σ) .

Proof. By Cramer-Wold, it suffices to show that for all γ ∈ Rd

(2.26) n−1/2γSnd→ γY ∼ N(0, γ′Σγ)

in R. The random variables γXi are iid with mean zero and variance γ ′Σγ. We apply theclassical central limit theorem for iid random variables and obtain (2.26).

24

Page 28: Lecture Notes Weak convergence of stochastic processesweb.math.ku.dk/~erhansen/web/stat1/mikosch1.pdf · processes with a given structure and then we go to their distributions and

References

[1] Billingsley, P. (1999) Convergence of Probability Measures. Wiley, New York.

[2] Billingsley, P. (1995) Probability and Measure, 3rd edition. Wiley, New York.

[3] Donsker, M.D. (1951) An invariance principle for certain probability limit theorems.Mem. Amer. Math. Soc. 6, 1-12.

[4] Donsker, M.D. (1952) Justification and extension of Doob’s heuristic approach to theKolmogorov-Smirnov theorems. Ann. Math. Statist. 23, 277–281.

[5] Doob, J.L. (1949) Heuristic approach to the Kolmogorov-Smirnov theorems. Ann.Math. Statist. 20, 393–403.

[6] Embrechts, P., Kluppelberg, C. and Mikosch, T. (1997) Modelling ExtremalEvents for Insurance and Finance. Springer, Berlin.

[7] Hall, P.G. (1992) The Bootstrap and Edgeworth Expansion. Springer, New York.

[8] Jacod, J. and Shiryaev, A.N. (1987) Limit Theorems for Stochastic Processes.Springer, Berlin.

[9] Petrov, V.V. (1975) Sums of Independent Random Variables. Springer, Berlin.

[10] Pollard, D. (1984) Convergence of Stochastic Processes. Springer, Berlin.

[11] Prokhorov, Yu.V. (1956) Convergence of random processes and limit theorems inprobability. Theor. Probab. Appl. 1, 157–214.

[12] Serfling, R.J. (1980) Approximation Theorems of Mathematical Statistics. Wiley,New York.

[13] Shorack, G.R. and Wellner, J.A. (1986) Empirical Processes with Applicationsto Statistics. Wiley, New York.

[14] Skorokhod, A.V. (1956) Limit theorems for stochastic processes. Theor. Probab.Appl. 1, 261–290.

[15] Skorokhod, A.V. (1957) Limit theorems for stochastic processes with independentincrements. Theor. Probab. Appl. 2, 138–171.

25