438

Probability

  • Upload
    lequynh

  • View
    221

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Probability
Page 2: Probability

Probability

Page 3: Probability

SIAM's Classics in Applied Mathematics series consists of books that were previouslyallowed to go out of print. These books are republished by SIAM as a professionalservice because they continue to be important resources for mathematical scientists.

Editor-in-ChiefRobert E. O'Malley, Jr., University of Washington

Editorial BoardRichard A. Brualdi, University of Wisconsin-Madison

Herbert B. Keller, California Institute of Technology

Andrzej Z. Manitius, George Mason University

Ingram Olkin, Stanford University

Stanley Richardson, University of Edinburgh

Ferdinand Verhulst, Mathematisch Instituut, University of Utrecht

Classics in Applied MathematicsC. C. Lin and L. A. Segel, Mathematics Applied to Deterministic Problems in theNatural Sciences

Johan G. F. Belinfante and Bernard Kolman, A Survey of Lie Groups and LieAlgebras with Applications and Computational Method's

James M. Ortega, Numerical Analysis: A Second Course

Anthony V. Fiacco and Garth P. McCormick, Nonlinear Programming; SequentialUnconstrained Minimisation Techniques

F. H. Clarke, Optimization and Nonsmooth Analysis

George F. Carrier and Carl E. Pearson, Ordinary Differential Equations

Leo Breiman, Probability

R. Bellman and G. M. Wing, An Introduction to Invariant Imbedding

Abraham Berman and Robert J. Plemmons, Nonnegative Matrices in the MathematicalSciences

Olvi L. Mangasarian, Nonlinear Programming

*Carl Friedrich Gauss, Theory of the Combination of Observations Least Subjectto Errors: Part One, Part Tivo, Supplement. Translated by G. W. Stewart

Richard Bellman, Introduction to Matrix Analysis

U. M. Ascher, R. M. M. Mattheij, and R. D. Russell, Numerical Solution ofBoundary Value Problems for Ordinary Differential Equations

K. E. Brenanj S. L. Campbell, and L. R. Petzold, Numerical Solution of Initial-Value Problems in Differential-Algebraic Equations

Charles L. Lawson and Richard J. Hanson, Solving Least Squares Problems

J. E. Dennis, Jr. and Robert B. Schnabel, Numerical Methods for UnconstrainedOptimization and Nonlinear Equations

Richard E. Barlow and Frank Proschan, Mathematical Theory of Reliability

*First time in print.

Page 4: Probability

Classics in Applied Mathematics (continued)

Cornelius Lanczos, Linear Differential Operators

Richard Bellman, Introduction to Matrix Analysis, Second Edition

Beresford N. Parlett, The Symmetric Eigenvalue Problem

Richard Haberman, Mathematical Models: Mechanical Vibrations, PopulationDynamics, and Traffic Flow

Peter W. M. John, Statistical Design and Analysis of Experiments

Tamer Basar and Geert Jan Olsder, Dynamic Noncooperative Game Theory, SecondEdition

Emanuel Parzen, Stochastic Processes

Petar Kokotovic', Hassan K. Khalil, and John O'Reilly, Singular PerturbationMethods in Control: Analysis and Design

Jean Dickinson Gibbons, Ingram Olkin, and Milton Sobel, Selecting and OrderingPopulations: A New Statistical Methodology

James A. Murdock, Perturbations: Theory and Methods

Ivar Ekeland and Roger Temam, Convex Analysis and Variational Problems

Ivar Stakgold, Boundary Value Problems of Mathematical Physics, Volumes I and II

J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations inSeveral Variables

David Kinderlehrer and Guido Stampacchia, An Introduction to VariationafInequalities and Their Applications

F. Natterer, The Mathematics of Computerized Tomography

Avinash C. Kak and Malcolm Slaney, Principles of Computerized Tomographic Imaging

R. Wong, Asymptotic Approximations of Integrals

O. Axelsson and V. A. Barker, Finite Element Solution of Boundary ValueProblems: Theory and Computation

David R. Brillinger, Time Series: Data Analysis and Theory

Joel N. Franklin, Methods of Mathematical Economics: Linear and NonlinearProgramming, Fixed-Point Theorems

Philip Hartman, Ordinary Differential Equations, Second Edition

Michael D. Intriligator, Mathematical Optimization and Economic Theory

Philippe G. Ciarlet, The Finite Element Method for Elliptic Problems

Jane K. Cullum and Ralph A. Willoughby, Lanczos Algorithms for Large SymmetricEigenvalue Computations, Vol. I: Theory

M. Vidyasagar, Nonlinear Systems Analysis, Second Edition

Robert Mattheij and Jaap Molenaar, Ordinary Differential Equations in Theoryand Practice

Shanti S. Gupta and S. Panchapakesan, Multiple Decision Procedures: Theory andMethodology of Selecting and Ranking Populations

Page 5: Probability

This page intentionally left blank

Page 6: Probability

Probability

Leo BreimanUniversity of California, Berkeley

siam.Society for Industrial and Applied Mathematics

Philadelphia

Page 7: Probability

Copyright ©1992 by the Society for Industrial and Applied Mathematics.

This SIAM edition is an unabridged, corrected republication of the work firstpublished by Addison-Wesley Publishing Company, Inc., Reading, Massachusetts,1968.

10 9 8 7 6 5

All rights reserved. Printed in the United States of America. No part of this bookmay be reproduced, stored, or transmitted in any manner without the writtenpermission of the publisher. For information, write to the Society for Industrial andApplied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688.

Library of Congress Cataloging-in-Publication Data

Breiman,LeoProbability / Leo Breiman.

p. cm. — (Classics in applied mathematics ; 7)Originally published: Reading, Mass.: Addison-Wesley Pub. Co.,

1968. (Addison-Wesley series in statistics)Includes bibliographical references and index.ISBNOS9871-296-31. Probabilities. L Title. IL Series.

QA273.B864 1992519.2»dc20 92-1381

ELL3JTL is a registered trademark.

Page 8: Probability

Preface to the Classic Edition

This is the first of four books I have written; the one I worked the hardest on;and the one I am fondest of. It marked my goodbye to mathematics andprobability theory. About the time the book was written, I left UCLA to go intothe world of applied statistics and computing as a full-time freelanceconsultant.

The book went out of print well over ten years ago, but before it did ageneration of statisticians, engineers, and mathematicians learned graduateprobability theory from its pages. Since the book became unavailable, I havereceived many calls asking where it could be bought and then for permission tocopy part or all of it for use in graduate probability courses.

These reminders that the book was not forgotten saddened me and I wasdelighted when SIAM offered to republish it in their Classics Series. Thepresent edition is the same as the original except for the correction of a fewmisprints and errors, mainly minor.

After the book was out for a few years it became commonplace for ayounger participant at some professional meeting to lean over toward me andconfide that he or she had studied probability out of my book. Lately, this hasbecome rarer and the confiders older. With republication, I hope that the ageand frequency trends will reverse direction.

Leo BreimanUniversity of California, Berkeley

January, 1992

vii

Page 9: Probability

This page intentionally left blank

Page 10: Probability

Preface

A few years ago I started a book by first writing a very extensive preface. Inever finished that book and resolved that in the future I would write first thebook and then the preface. Having followed this resolution I note that the resultis a desire to be as brief as possible.

This text developed from an introductory graduate course and seminar inprobability theory at UCLA. A prerequisite is some knowledge of real variabletheory, such as the ideas of measure, measurable functions, and so on.Roughly, the first seven chapters of Measure Theory by Paul Halmos [64] issufficient background. There is an appendix which lists the essentialdefinitions and theorems. This should be taken as a rapid review or outline forstudy rather than as an exposition. No prior knowledge of probability isassumed, but browsing through an elementary book such as the one by WilliamFeller [59, Vol. I], with its diverse and vivid examples, gives an excellentfeeling for the subject.

Probability theory has a right and a left hand. On the right is the rigorousfoundational work using the tools of measure theory. The left hand "thinksprobabilistically," reduces problems to gambling situations, coin-tossing,motions of a physical particle. I am grateful to Michel Loeve for teaching methe first side, and to David Blackwell, who gave me the flavor of the other.

David Freedman read through the entire manuscript. His suggestionsresulted in many substantial revisions, and the book has been considerablyimproved by his efforts. Charles Stone worked hard to convince me of theimportance of analytic methods in probability. The presence of Chapter 10 islargely due to his influence, and I am further in his debt for reading parts of themanuscript and for some illuminating conversations on diffusion theory.

Of course, in preparing my lectures, I borrowed heavily from the existingbooks in the field and the finished product reflects this. In particular, the booksby M. Loeve [108], J. L. Doob [39], E. B. Dynkin [43], and K. Ito and H. P.McKean [76] were significant contributors.

Two students, Carl Maltz and Frank Kontrovich, read parts of themanuscript and provided lists of mistakes and unreadable portions. Also, I wasblessed by having two fine typists, Louise Gaines and Ruth Goldstein, whorose above mere patience when faced with my numerous revisions of the "finaldraft." Finally, I am grateful to my many nonmathematician friends whocontinually asked when I was going to finish "that thing," in voices that couldnot be interminably denied.

Leo BreimanTopanga, California

January, 1968ix

Page 11: Probability

This page intentionally left blank

Page 12: Probability

Contents

Chapter 1 Introduction

1 n independent tosses of a fair coin 12 The "law of averages" 13 The bell-shaped curve enters (fluctuation theory) 74 Strong form of the "law of averages" 115 An analytic model for coin-tossing 156 Conclusions 17

Chapter 2 Mathematical Framework

1 Introduction 192 Random vectors 203 The distribution of processes 214 Extension in sequence space 235 Distribution functions 256 Random variables 297 Expectations of random variables 318 Convergence of random variables 33

Chapter 3 Independence

1 Basic definitions and results 362 Tail events and the Kolmogorov zero-one law 403 The Borel-Cantelli lemma 414 The random signs problem 455 The law of pure types 496 The law of large numbers for independent random variables 517 Recurrence of sums 538 Stopping times and equidistribution of sums 589 Hewitt-Savage zero-one law 63

Chapter 4 Conditional Probability and Conditional Expectation

1 Introduction 672 A more general conditional expectation 733 Regular conditional probabilities and distributions 77

xi

Page 13: Probability

xii Contents

Chapter 5 Martingales

1 Gambling and gambling systems 822 Definitions of martingales and submartingales 833 The optional sampling theorem 844 The martingale convergence theorem 895 Further martingale theorems 916 Stopping times 957 Stopping rules 988 Back to gambling 101

Chapter 6 Stationary Processes and the Ergodic Theorem

1 Introduction and definitions 1042 Measure-preserving transformations 1063 Invariant sets and ergodicity 1084 Invariant random variables 1125 The ergodic theorem 1136 Converses and corollaries 1167 Back to stationary processes 1188 An application 1209 Recurrence times 12210 Stationary point processes 125

Chapter 7 Markov Chains

1 Definitions 1292 Asymptotic stationarity 1333 Closed sets, indecomposability, ergodicity 1354 The countable case 1375 The renewal process of a state 1386 Group properties of states 1417 Stationary initial distributions 1438 Some examples 1459 The convergence theorem 15010 The backward method 153

Chapter 8 Convergence in Distribution and the Tools Thereof

1 Introduction 1592 The compactness of distribution functions 1603 Integrals and X>-convergence 1634 Classes of functions that separate 1655 Translation into random-variable terms 1666 An application of the foregoing 1677 Characteristic functions and the continuity theorem 170

Page 14: Probability

Contents

8 The convergence of types theorem 1749 Characteristic functions and independence 17510 Fourier inversion formulas 17711 More on characteristic functions 17912 Method of moments 18113 Other separating function classes 182

Chapter 9 The One-Dimensional Central Limit Problem

1 Introduction 1852 Why normal? 1853 The nonidentically distributed case 1864 The Poisson convergence 1885 The infinitely divisible laws 1906 The generalized limit problem 1957 Uniqueness of representation and convergence 1968 The stable laws 1999 The form of the stable laws 20010 The computation of the stable characteristic functions 20411 The domain of attraction of a stable law 20712 A coin-tossing example 21313 The domain of attraction of the normal law 214

Chapter 10 The Renewal Theorem and Local Limit Theorem

1 Introduction 2162 The tools 2163 The renewal theorem 2184 A local central limit theorem 2245 Applying a Tauberian theorem 2276 Occupation times 229

Chapter 11 Multidimensional Central Limit Theorem andGaussian Processes

1 Introduction 2332 Properties of Mk 2343 The multidimensional central limit theorem 2374 The joint normal distribution 2385 Stationary Gaussian process 2416 Spectral representation of stationary Gaussian processes 2427 Other problems 246

xii

Page 15: Probability

xiv Contents

Chapter 12 Stochastic Processes and Brownian Motion

1 Introduction 2482 Brownian motion as the limit of random walks 2513 Definitions and existence 2514 Beyond the Kolmogorov extension 2545 Extension by continuity 2556 Continuity of Brownian motion 2577 An alternative definition 2598 Variation and differentiability 2619 Law of the iterated logarithm 26310 Behavior at t = x 26511 The zeros of X(0 26712 The strong Markov property 268

Chapter 13 Invariance Theorems

1 Introduction 2722 The first-exit distribution 2733 Representation of sums 2764 Convergence of sample paths of sums to Brownian motion paths 2785 An invariance principle 2816 The Kolmogorov-Smirnov statistics 2837 More on first-exit distributions 2878 The law of the iterated logarithm 2919 A more general invariance theorem 293

Chapter 14 Martingales and Processes with Stationary,Independent Increments

1 Introduction 2982 The extension to smooth versions 2983 Continuous parameter martingales 3004 Processes with stationary, independent increments 3035 Path properties 3066 The Poisson process 3087 Jump processes 3108 Limits of jump processes 3129 Examples 31610 A remark on a general decomposition 318

Page 16: Probability

Contents

Chapter 15 Markov Processes, Introduction andPure Jump Case

1 Introduction and definitions 3192 Regular transition probabilities 3203 Stationary transition probabilities 3224 Infinitesimal conditions 3245 Pure jump processes 3286 Construction of jump processes 3327 Explosions 3368 Nonuniqueness and boundary conditions 3399 Resolvent and uniqueness 34010 Asymptotic stationarity 344

Chapter 16 Diffusions

1 The Ornstein-Uhlenbeck process 3472 Processes that are locally Brownian 3513 Brownian motion with boundaries 3524 Feller processes 3565 The natural scale 3586 Speed measure 3627 Boundaries 3658 Construction of Feller processes 3709 The characteristic operator 37510 Uniqueness 37911 (p + (jt) and (p - (x) 38312 Diffusions 385

Appendix: On Measure and Function Theory 391

Bibliography 405

Index 412

xv

Page 17: Probability

To my mother and fatherand Tuesday's children

Page 18: Probability

CHAPTER 1

INTRODUCTION

A good deal of probability theory consists of the study of limit theorems.These limit theorems come in two categories which we call strong and weak.To illustrate and also to dip into history we begin with a study of coin-tossing and a discussion of the two most famous prototypes of weak andstrong limit theorems.

1. n INDEPENDENT TOSSES OF A FAIR COIN

These words put us immediately into difficulty. What meaning can beassigned to the words, coin, fair, independent? Take a pragmatic attitude—allcomputations involving n tosses of a fair coin are based on two givens:

a) There are 2n possible outcomes, namely, all sequences «-long of the twoletters H and T (Heads and Tails).

b) Each sequence has probability 2~".

Nothing else is given. All computations regarding odds, and so forth, infair coin-tossing are based on (a) and (b) above. Hence we take (a) and (b)as being the complete definition of n independent tosses of a fair coin.

2. THE "LAW OF AVERAGES"

Vaguely, almost everyone believes that for large «, the number of heads isabout the same as the number of tails. That is, if you toss a fair coin a largenumber of times, then about half the tosses result in heads.

How to make this mathematics? All we have at our disposal to mathe-matize the "law of averages" are (a) and (b) above. So if there is anythingat all corresponding to the law of averages, it must come out of (a) and (b)with no extra added ingredients.

Analyze the 2n sequences of H and T. In how many of these sequencesdo exactly k heads appear? This is a combinatorial problem which clearlycan be rephrased as: Given n squares, in how many different ways can wedistribute k crosses on them? (See Fig. 1.1.) For example, if n = 3, k = 2,then we have the result shown in Fig. 1.2, and the answer is 3.

To get the answer in general, take the k crosses and subscript them sothey become different from each other, that is, + x, +z, . . . , +fc. Now we

1

Page 19: Probability

INTRODUCTION 1.2

may place these latter crosses in n squares in n(n — 1) • • • (n — k + 1) ways[+! may be put down in n ways, then +2 in (n — 1) ways, and so forth].But any permutation of the k subscripted crosses among the boxes theyoccupy gives rise to exactly the same distribution of unsubscripted crosses.There are k \ permutations. Hence

Proposition 1.1. There are exactly

sequences of H, J", n-long in which k heads appear.

Simple computations show that if n is even, nCk is a maximum fork = n/2 and if n is odd, nCk has its maximum value at k = (n — l)/2 andk = (11 + l)/2.

Stirling's Approximation [59, Vol. I, pp. 50 ff.]

where e

We use this to get

where dn —+ 0 as n -> oo.In 2/z trials there are 22n possible sequences of outcomes H, T. Thus

(1.3) implies that k = n for only a fraction of about \l\Jirn of the sequences.Equivalently, the probability that the number of heads equals the number oftails is about \l\Jirn for « large (see Fig. 1.3).

Conclusion. As n becomes large, the proportion of sequences such thatheads comes up exactly n/2 times goes to zero (see Fig. 1.3).

Whatever the "law of averages" may say, it is certainly not reasonablein a thousand tosses of a fair coin to expect exactly 500 heads. It is not

2

Figure 1.1 Figure 1.2

Page 20: Probability

1.2 THE LAW OF AVERAGES

Figure 1.3 Probability of exactly k heads in In tosses.

possible to fix a number M such that for n large most of the sequences havethe property that the number of heads in the sequence is within M of n/2.For 2n tosses this fraction of the sequences is easily seen to be less than2Ml\iTrn (forgetting <5n) and so becomes smaller and smaller.

To be more reasonable, perhaps the best we can get is that usually theproportion of heads in n tosses is close to £. More precisely—

Question. Given any e > 0, for how many sequences does the proportionof heads differ from £ by less than e ?

The answer to this question is one of the earliest and most famous of thelimit theorems of probability. Let N(n, e) be the number of sequencesw-long satisfying the condition of the above question.

Theorem 1.4. limn 2~n N(n, e) = 1.

In other words, the fraction of sequences such that the proportion of headsdiffers from \ by less than e goes to one as n increases for any e > 0.

This theorem is called the weak law of large numbers for fair coin tossing.To prove this theorem we need to show that

Theorem 1.4 states that most of the time, if you toss a coin n times, theproportion of heads will be close to \. Is this what is intuitively meant bythe law of averages! Not quite—the abiding faith seems to be that no matterhow badly you have done on the first n tosses, eventually things will settle downand smooth out if you keep tossing the coin.

Ignore this faith for the moment. Let us go back and establish somenotation and machinery so we can give Theorem 1.4 an interesting proof.One proof is simply to establish (1.5) by direct computation. It was done thisway originally, but the following proof is simpler.

3

Page 21: Probability

4 INTRODUCTION 1.2

Definition 1.6

a) Let n be the space consisting of all sequences n-long of H, T. Denotethese sequences by

b) Let A, B, C, and so forth, denote subsets of n. The probability P(A) ofany subset A is defined as the sum of the probabilities of all sequences in A,that is,

equivalently, P(A) is the fraction of the total number of sequences that arein A.

For example, one interesting subset of n is the set Al of all sequences suchthat the first member is H. This set can be described as "the first toss resultsin heads." We should certainly have, if (b) above makes sense, P(Aj) = \.This is so, because there are exactly 2n~l members of n whose first member isH.

c) Let X(w) be any real-valued function on n. Define the expected value ofX as

Note that the expected value of X is just its average weighted by the prob-ability. Suppose X(o>) takes the value xl on the set of sequences Alf jc2 on A2,and so forth; then, of course,

And also note that EX is an integral, that is,

where are real numbers, and .EX for X > 0. Also, in the futurewe will denote by the subset of n satisfying the conditionsfollowing the semicolon.

The proof of 1 .4 will be based on the important Chebyshev inequality.

Proposition 1.7. For X(w) any function on n and any e

P((o; |X| > e) = —(number of o>; |X(eo)| > e) =

Proof

Page 22: Probability

1.2 THE LAW OF AVERAGES

Define functions X^co),.. . , Xn(eo), Sn(co) on Qn by

I if ;th member of co is H,

0 if ;th member of co is T,

so that Sn(co) is exactly the number of heads in the sequence co. For practice,note that

EXi = 0 • P(co; first toss = T} + 1 • P(co; first toss = H) = $,

EX1X2 = 0 • P(co ; either first toss or second toss = T)

+ 1 • P(co; both first toss and second toss = H ) = £

(since there are 2n~2 sequences beginning with HH). Similarly, check thatif / 5^ j, then

1 on 2n-1 sequences,

so that

Also,

Finally, write

,— | on 2n Sequences,

so that

Proof of Theorem 1.4. By Chebyshev's inequality,

Use (1.9) now to get

implying

Since P(On) = 1, this completes the proof.

5

Page 23: Probability

6 INTRODUCTION 1.2

Definition 1.11. Consider n independent tosses of a biased coin with probabilityp of heads. This is defined by

a) there are 2n possible outcomes consisting of all sequences in Qn.b) the probability P(ot) of any sequence a) is given by

As before, define P(A\ A <= iin, by P(A) = % P(co). For X(o>) any realwe A

valued function on Qn, define EX = ]£ X(co)

The following problems concern biased coin-tossing.

Problems

1. Show that p with equality if A and5 are disjoint.2. Show that

3. Show that Chebyshev's inequality 1.7 remains true for biased coin-tossing.4. Prove the weak law of large numbers in the form: for any e > 0,

5. Using Stirling's approximation, find an approximation to the value of

Definition 1.12. For (o e n, a> = (cox,. . ., o>n), where Wi e {H, 71}, call o)ithe ith coordinate ofa> or the outcome of the ith toss. Any subset A <= n willbe referred to as an event. An event A n will be said to depend only on the

kth tosses if it is of the form

Problems

6. If A is of the form above, show that P(A) = P'(E), where P'((o) is definedo n Q b 1.7. If A, B n are such that A depends on the ilt . . . , ik tosses and B onthey'i, . . . ,jm tosses and the sets have no commonmember, then

[Hint: On Problems 6 and 7 above induction works.]

Page 24: Probability

1.3 BELL-SHAPED CURVE ENTERS (FLUCTUATION THEORY) 7

Figure 1.4 Probability of kheads in n tosses.

3. THE BELL-SHAPED CURVE ENTERS (Fluctuation Theory)

For large n, the weak law of large numbers says that most outcomes haveabout n/2 heads, more precisely, that the number of heads falls in the range(n/2)(l ± e) with probability almost one for n large. Pose the question,how large a fluctuation about n/2 is nonsurprising? For instance, if youget 60 H in 100 tosses, will you strongly suspect the coin of being biased?If you get 54? 43 ? and so on. Look at the graph in Fig. 1.4. What we want isa function <p(n) increasing with n such that

There are useful things we know about (jp(n). As the maximum height of thegraph is order l/\/« we suspect that we will have to go about v« steps oneither side. Certainly if we put cp(ri) = xn V«/2 (the £ factor to make thingswork out later on), then lim xn > 0, otherwise the limit in (1.13) would bezero. By Chebyshev's inequality,

So Imijcn < oo, otherwise a in (1.13) would have to be one. These twobounds lead to the immediate suspicion that we could take xn — > x, 0 <jc < oo. But there is no reason, then, not to try xn = x, all n. First, examinethe case for n even. We want to evaluate

This is given by

Put k = n + j, to get

Page 25: Probability

8 INTRODUCTION 1.3

Put

and write

Let Djn be the second factor above,

and

Use the expansion log (1 + x) = jc(l + e(x)), where lim,^,, e(x) = 0.

Note thaty is restricted to the range Rn = {j; \j\ < x \Jnj2}t so that if wewrite

then sup ci>n -> 0. Writing

since 0 we find that

where again sup 0. Also for J so that

where sup

Make the changes of variable, condition y E Rn becomes

Page 26: Probability

1.3 BELL-SHAPED CURVE ENTERS (FLUCTUATION THEORY) 9

Now the end is near:

The factor on the right is graciously just the approximating sum for anintegral, that is, we have now shown that

To get the odd values of n take h > 0 and note that for n sufficiently large

yielding

Thus we have proved, as done originally (more or less), a special case of thefamous central limit theorem, which along with the law of large numbersshares the throne in probability theory.

Theorem 1.15

There is a more standard form for this theorem: Let

and Zn = 2Sre — «, that is, Zn is the excess of heads over tails in n tosses,or if

then Z From 1.15

Page 27: Probability

10 INTRODUCTION 1 3

By symmetry,

giving

But <D(+oo) = 1, so 1 - O(-jc) = O(jc), and therefore,

Theorem 1.17

Thus, the asymptotic distribution of the deviation of the number ofheads from n/2 is governed by O(x). That O(jc), the normal curve, should besingled out from among all other limiting distributions is one of the mostmagical and puzzling results in probability. Why <D(x)? The above proofgives very little insight as to what properties of O(x) cause its sudden appear-ance against the simple backdrop of fair coin-tossing. We return to thislater.

Problems

8. Using Theorem 1.15 as an approximation device, find the smallestinteger N such that in 1600 tosses of a fair coin, there is probability at least0.99 that the number of heads will fall in the range 800 ± N.

9. Show that fory even,

where e

10. Show that

where the sup is over all even j.11. Consider a sequence of experiments such that on the wth experiment acoin is tossed independently n times with probability of heads pn. IfUm^ np then letting Sn be the number of heads occurringin the «th experiment, show that

Page 28: Probability

1.4 STRONG FORM OF THE "LAW OF AVERAGES" 11

4. STRONG FORM OF THE "LAW OF AVERAGES"

Definition 1.19. Let Q be the space consisting of all infinite sequences of H'sand T's. Denote a point in Q by a>. Let co,. be thejih letter in co, that is, co =

Define functions on Q as follows:

We are concerned with the behavior of Sn(co)/H for large n. The intuitivenotion of the law of averages would be that

for all co e Q. This is obviously false; the limit need not exist and even if itdoes, certainly does not need to equal £. [Consider co = (H, H, H, . . .)•]The most we can ask is whether for almost all sequences in Q, (1.20) holds.

What about the set E of co such that (1.20) does not hold? We wouldlike to say that in some sense the probability of this exceptional set is small.Here we encounter an essential difficulty. We know how to assign prob-abilities to sets of the form

Indeed, if A c Qn, we know how to assign probabilities to all sets of theform

that is, simply as before

(number of sequences in A).

But the exceptional set E is the set such that Sn(co)/« +-> | and so does notdepend on for any n, but rather on the asymptotic distributionof heads and tails in the sequence co.

But anyhow, let's try to push through a proof and then see what iswrong with it and what needs to be fixed up.

Theorem 1.21 (Strong law of large numbers). The probability of the set ofsequences E such that S

Proof. First note that

Page 29: Probability

12 INTRODUCTION 1.4

since for any n, take m such that

For this m,

Fix e > 0 and let Ee be the set (o>; lim \Smt/m2 — i| > e}. Look at theset £Wo>mi <= Q of sequences such that the inequality \Smt/m2 — £| > eoccurs at least once for m0 < m < ml. That is,

The set Em<j mi is a set of sequences that depends only on the coordinates« > ! , . . . , comz. We know how to assign probability to such sets, and applyingthe result of Problem 1,

Using Chebyshev's inequality in the form (1.10) we get

Let Wj go to infinity and note that

is the set of all sequences such that the inequality |Sm*//rj2 — £| > e occursat least once for m > w0. Also note that the {Em mi} are an increasingsequence of sets in wx for m0 fixed. If we could make a vital transition andsay that

then it would follow that

Now lim |Sm2/w2 — || > e if and only if for any m0, 3m > w0 such|Sm2/m2 — \\ > e. From this, £"e = limmo Emg, where the sets Emo aredecreasing in m0. (The limits of increasing or decreasing sequences of sets

Page 30: Probability

1.4 STRONG FORM OF THE "LAW OF AVERAGES" 13

are well defined, for examplecould again assert as above that

then

By definition, E is the set {co; lim |Sm2/m2 — £| > 0}, so E = limfc£1/fc,

k running through the positive integers, and the sets El/k increasing in k.Once more, if we assert that

then since P(El/k) = 0, all k > 0, consequently P(E) = 0 and the theoremis proven. Q.E.D111

The real question is one of how may probability be assigned to subsetsof Q. What we need for the above proof is an assignment of probabilityP(-) on a class of subsets & of Q. such that 5" contains all the sets that appearin the above proof and such that P(-) in some way corresponds to a faircoin-tossing probability. More concretely, what we want are the statements

(1.22)

i) $ contains all subsets depending only on a finite number of tosses, that is,all sets of the form {«; (co 1, and P(-)is defined on these sets by

where Pn is the probability defined previously on ii) if An is any monotone sequence of sets in 5% then lim^ An is also in 5";iii) if the An are as in (ii) above, then

iv) if A, B e & are disjoint, then

Of these four, one is simply the requirement that the assignment be con-sistent with our previous definition of independent coin-tossing. Two andthree are exactly the statement of what is needed to make the transitions inthe proof of the law of large numbers valid. Four is that the assignmentP(-) continue to have on H the property that the probability assignment hason Qn and whose absence would seem intuitively most offensive, namely,

and so forth.) If we

Page 31: Probability

14 INTRODUCTION 1.4

that if two sets of outcomes are disjoint, then the probability of getting intoeither one or the other is the sum of the probabilities of each one.

Also, is the assignment of P(-) unique in any sense? If it is not, then weare in real difficulty. We can put the above questions into more amenableform. Let &Q be the class of all subsets of Q depending on only a finitenumber of tosses, then

Proposition 1.23. tFQ is afield, where

Definition 1.24. A class of subsets C of a space Q is afield if it is closed underfinite unions, intersections, and complementation. The complement of£l is theempty set 0.

The proof of (1.23) is a direct verification. For economy, take & to be thesmallest class of sets containing &Q such that & has property (1.22ii). Thatsuch a smallest class exists can be established by considering the sets commonto every class of sets containing 3^ satisfying (1.22ii). But (see Appendix A),these properties imply

Proposition 1.25. 5" is the smallest a-field containing -3-",,, where

Definition 1.26. A class of subsets & of Q. is a a-field if it is closed undercomplementation, and countable intersections and unions. For any class C ofsubsets of Q, denote by ^(C) the smallest a-field containing C.

Also

Proposition 1.27. /*(•) on & satisfies (1.22) iff P(~) is a probability measure,where

Definition 1.28. A nonnegative set function />(•) defined on a a-field & ofsubsets of£l is a probability measure if

i) (normalization) P(Q) = 1 ;ii) (a-additivity) for every finite or countable collection {Bk} of sets in & such

that Bk is disjoint from Bj, k j£j,

Proof of 1.27. If P(-) satisfies (1.22), then by finite induction on (iv)

Let An = U Bk, then the An are a monotone sequence of sets, lim An =

U Bk. By 0.22m),

Page 32: Probability

1.5 AN ANALYTIC MODEL FOR COIN-TOSSING 15

Conversely, if P(-) is tf-additive, it implies (1.22). For if the {An} are amonotone sequence, say An c An+l, we can let Bk = Ak — Ak_lt k > 1,BT, = A-L. The {Bk} are disjoint, and limn An = \J Bk. Thus cr-additivity gives

The starting point is a set function with the following properties :

Definition 1.29. A nonnegative set function P on a field &Q is a finite probabil-ity measure if

i) P(^)= 1;ii) for A, B E 3 , and disjoint,

Now the original question can be restated in more standard form:Given the finite probability measure P(-) defined on by (1.22i), does thereexist a probability measure defined on & and agreeing with P(-) on 5^.And in what sense is the measure unique ? The problem is seen to be one ofextension — given P(-) on F^ is it possible to extend the domain of definitionof P(-) to 5 such that it is cr-additive? But this is a standard measure theo-retical question. The surprise is that the attempt to patch up the strong lawof large numbers has led directly to this well-known problem (see AppendixA. 9).

5. AN ANALYTIC MODEL FOR COIN-TOSSING

The fact that the sequence X^co), X2(co), . . . comprised functions dependingon consecutive independent tosses of a fair coin was to some extent immaterialin the proof of the strong law. For example, produce functions on a differentspace Q' this way: Toss a well-balanced, six-sided die independently, letQ' be the space of all infinite sequences &/ = ((o'r w'2, . . .), where co'k takesvalues in (1, . . . , 6). Define X'n(o/) to be one if the nth throw results in aneven face, zero if in an odd face. The sequence X^(co'), X'2(co'), . . . has thesame probabilistic structure as X^co), . . . in the sense that the probabilityof any sequence «-long of zeros and ones is l/2n in both models (with theappropriate definition of independent throws of a well-balanced die). Butthis assignment of probabilities is the important information, rather than theexact nature of the underlying space. For example, the same argument

Page 33: Probability

16 INTRODUCTION 1.5

leading to the strong law of large numbers holds for the variables Xj, X'2,....Therefore, in general, we will consider as a model for fair coin-tossing anyset of functions X1? X2, . . . , with values zero or one defined on a space Oof points co such that probability l/2n is assigned to all sets of the form

for $!,. . . , sn any sequence of zeros and ones.An interesting analytic model can be constructed on the half-open

unit interval Q. = [0, 1). It can be shown that every number x in [0, 1)has a unique binary expansion containing an infinite number of zeros. Thelatter restriction takes care of binary rational points which have two ex-pansions, that is

Now for any x e [0, 1) write down this expansion ;c = ,x^xz • • -

That is, Xn(x) is the nth digit in the expansion of* (see Fig. 1.5).

and define

Figure 1.5

To every interval / <= [0, 1), assign probability P(I) = ||/||, the lengthof /. Now check that the probability of the set

is l/2n, because this set is exactly the interval

Thus, Xi(x), X2(x), . . . on [0, 1) with the given probability is a model forfair coin-tossing.

The interest in this particular model is that the extension of/* is a classicalresult. The smallest cr-field containing all the intervals

Page 34: Probability

1.6 CONCLUSIONS 17

is the Borel field $i([0, 1)) of subsets of [0, 1), and there is a unique extensionof P to a probability on this field, namely, Lebesgue measure. The theoremestablishing the existence of Lebesgue measure makes the proof of thestrong law of large numbers rigorous for the analytic model. The statementin this context is:

Theorem 1.31. For almost all x e [0, 1) with respect to Lebesgue measure, theasymptotic proportions of zero's and one's in the binary expansion of x is \,

The existence of Lebesgue measure in the analytic model for coin-tossing makes it plausible that there is an extended probability in the originalmodel. This is true, but we defer the proof to the next chapter.

Another way of looking at the analytic model is to say that the binaryexpansion of a number in [0, 1) produces independent zeros and ones withprobability | each with respect to Lebesgue measure. Thus, as Theorem1.31 illustrates, any results established for fair coin-tossing can be written astheorems concerning functions and numbers on and in [0, 1) with respectto Lebesgue measure. Denote Lebesgue measure from now on by dx orl(dx).

Problem 12. (The Rademacher Functions). Let

where x = .x^x^, • • • is the unique binary expansion of x containing aninfinite number of zeros. Show from the properties of coin-tossing that ifik T± ij fory ^ k,

Graph y3(x), y4(x). Show that the sequence of functions {y^x}} is ortho-normal with respect to Lebesgue measure.

6. CONCLUSIONS

The strong law of large numbers and the central limit theorem illustrate thetwo main types of limit theorems in probability.

Strong limit theorems. Given a sequence of functions Y1(o>), Y2(co), . . . there isa limit function Y(o>) such that

Weak limit theorems. Given a sequence of functions Y1(co), Y2(co), . . . showthat

exists for every x.

Page 35: Probability

18 INTRODUCTION

There is a great difference between strong and weak theorems which willbecome more apparent. We will show later, for instance, that Zn/v/i hasno limit in any reasonable way. A more dramatic example of this is : on([0, 1), $>i([Q, 1))) with P being Lebesgue measure, define

for n even. For n odd,

For all /i, P(y; Yn(jy) < x) = P(y\ Y^) < jc). But for every y e [0, 1)

To begin with we concentrate on strong limit theorems. But to do this weneed a more firmly constructed measure theoretic foundation.

NOTES

To get some of the fascinating interplay between probability and numbertheory, refer to Mark Kac's monograph [83].

Although there will be very little subsequent work with combinatoricsin this text, they occupy an honored and powerful place in probability theory.First, for many of the more important theorems, the original version wasfor independent fair coin-tossing. Even outside of this, there are some strongtheorems in probability for which the most interesting proofs are combina-torial. A good source for these uses are Feller's books [59].

An elegant approach to the measure theoretic aspects of probabilitycan be found in Neveu's book [113].

Page 36: Probability

CHAPTER 2

1. INTRODUCTION

The context that is necessary for the strong limit theorems we want to proveis:

Definition 2.1. A probability space consists of a triple (D, &, P) where

i) Q is a space of points co, called the sample space and sample points.ii) 5" is a a-field of subsets ofQ. These subsets are called events.iii) P(-) is a probability measure on fr; henceforth refer to P as simply a

probability.

On Q there is defined a sequence of real-valued functions X1(o>), X2(co),...which are random variables in the sense of

Definition 2.2. A function X(co) defined on Q, is called a random variable if forevery Bore I set B in the real line R(l), the set {01; X(o>) e B] is in &. (X(co) is ameasurable function on (Q, -3*").)

Whether a given function is a random variable, of course, depends on thepair (Q, 3r). The reason underlying 2.2 is that we want probability assigned toall sets of the form {o>; X(co) E /}, where / is some interval. It will followfrom 2.29 that if (co; X(o>) 6 7} is in & for all intervals /, then X must be arandom variable.

Definition 2.3. A countable stochastic process, or process, is a sequence ofrandom variables Xl5 X 2 , . . . defined on a common probability space (Q, 5", P).

But in a probabilistic model arising in gambling or science the givendata are usually an assignment of probability to a much smaller class ofsets. For example, if all the variables X1} X2, . . . take values in some count-able set F, the probability of all sets of the form

is usually given. If the X1? X2 , . . . are not discrete, then often the specifica-tion is for all sets of the form

where are intervals.19

M A T H E M A T I C A L F R A M E W O R K

Page 37: Probability

20 MATHEMATICAL FRAMEWORK 2.2

To justify the use of a probability space as a framework for probabilitytheory it is really necessary to show that a reasonable assignment of prob-abilities to a small class of sets has a unique extension to a probability P ona probability space (Q,, &, P). There are fairly general results to this effect.We defer this until we have explored some of the measure-theoretic propertiesof processes.

2. RANDOM VECTORS

Given two spaces Q. and R, let X be a function on Q to R, X: Q. ->- R.The inverse image under X of a set B <= R is {co; X 6 B}. We abbreviate thisby {X e B}.

Proposition 2.6. Set operations are preserved under inverse mappings, that is,

(Bc denotes the complement of the set B.)

Proof. By definition.

This quickly gives

Proposition 2.7. IfX: £}—>•/?, and $ is a a-field in R, the class of sets (X 6 B },B E $, is a a-field. If & is a a-field in Q,, then the class of subsets B in R suchthat {X E B} E $ is a a-field.

Proof. Both assertions are obvious from 2.6.

Definition 2.8. If there are a-fields & and $, in Q, R respectively, X: Q —»• Ris called a random vector if{X E B} E 5, for all B E $. (X is a measurable map

from (Q, 30 to (R, $).)

We will sometimes refer to (R, $) as the range space of X. But the range of Xis the direct image under X of Q, that is, the union of all points X(co), co e Q..

Denote by J"(X) the cr-field of all sets of the form {X G B}, B E $.

Definition 2.9. If A is a a-field contained in 3r, call X .-measurable //^(X) c: Jk,

If there is a probability space (D, 3r, P) and X is a random vector withrange space (/?, $), 'then P can be naturally defined on 3$ by

It is easy to check that P defined this way is a probability on $>.

Page 38: Probability

2.3 THE DISTRIBUTION OF PROCESSES 21

Definition 2.10. P is called the probability distribution of the random vector X.

Conversely, suppose X is a random vector on (Q, 5-") to (R, $) andthere is a probability distribution P defined on $. Since every set in ^(X)is of the form (X e B}, B e $, can P be defined on J"(X) by

The answer, in general, is no ! The difficulty is that the same set A e ^(X)may be represented in two different ways as (X e B-,} and (X e B2}, and thereis no guarantee that P(B1) = P(B2). What is true is

Proposition 2.12. Let F be the range of \. IfBe$>,B <= Fc implies P(B) = 0,then P is uniquely defined on ^F(X) by 2.11, and is a probability.

Proof. If A = (X e 5J = (X e B2}, then B^ — B2 and B2 — B± are bothin Fc. Hence P(Bl) = P(B2). The tf-additivity is quickly verified.

Problem 1. Use 2.12 and the existence of the analytic model for coin-tossingto prove the existence of the desired extension of P in the original model.

3. THE DISTRIBUTION OF PROCESSES

Denote by R(cc) the space consisting of all infinite sequences (xl5 x2, . . .)of real numbers. In R(m) an ^-dimensional rectangle is a set of the form

where Ilt . . . , In are finite or infinite intervals. Take the Borel field $<»to be the smallest cr-field of subsets of R{co} containing all finite-dimensionalrectangles.

If each component of X = (Xl5 . . .) is a random variable, then it followsthat the vector X is a measurable mapping to (R(Xl), $<»)• In other words,

Proposition 2.13. If Xx, X2, . . . are random variables on (Q, J"), then forX = (X1} X2, . . . ) and every B e 3^, (X e B} e &.

Proof. Let S be a finite-dimensional rectangle

Then

This is certainly in 3r. Now let C be the class of sets C in JS^ such that(X e C} e 5". By 2.7 C is a cr-field. Since C contains all rectangles, C = 3*>x.

If all that we observe are the values of a process X^w), X2(co), . . . theunderlying probability space is certainly not uniquely determined. As aq.example, suppose that in one room a fair coin is being tossed independently,

Page 39: Probability

22 MATHEMATICAL FRAMEWORK 2.3

and calls zero or one are being made for tails or heads respectively. Inanother room a well-balanced die is being cast independently and zero or onecalled as the resulting face is odd or even. There is, however, no way ofdiscriminating between these two experiments on the basis of the calls.

Denote X = (X^ . . .). From an observational point of view, the thingthat really interests us is not the space (£1, 5", P), but the distribution of thevalues of X. If two processes, X on (Q, 3r, P), X' on (D', J"', P') have the sameprobability distribution,

then there is no way of distinguishing between the processes by observingthem.

Definition 2.15. Two processes {XB} on (Q, &, P) and {X^} on (Q', 3F' ', P') willbe said to have the same distribution if (2.14) holds.

The distribution of a process contains all the information which isrelevant to probability theory. All theorems we will prove depend only onthe distribution of the process, and hence hold for all processes having thatdistribution. Among all processes having a given distribution P on 3^,there is one which has some claim to being the simplest.

Definition 2.16. For any given distribution P define random variables Xon(R^\&

This process is called the coordinate representation process and has the samedistribution as the original process.

This last assertion is immediate since for any B G ^B^,

This construction also leads to the observation that given any probabilityP on $«,, there exists a process X such that P(X E B) = P(B).

Define the Borel field $„ in R(n) as the smallest (T-field containing allrectangles intervals.

Definition 2.17. An n-dimensional cylinder set in Pv (00) is any set of the form

Problems

2. Show that the class of all finite-dimensional cylinder sets is a field, but nota (r-field.

3. Let F be a countable set, F = {/-,-}. Denote by f'00' the set of all infinitesequences with coordinates in F. Show that X: Q — > F is a random variablewith respect to (£1, J") iff {o>; X(o>) = r,} e 3r, ally.

Page 40: Probability

2.4 EXTENSION IN SEQUENCE SPACE 23

4. Given two processes X, X' such that both take values in F(00), show thatthey have the same distribution iff

for every ^-sequence

4. EXTENSION IN SEQUENCE SPACE

Given the concept of the distribution of a process, the extension problem canbe looked at in a different way. The given data is a specification of valuesP(X E E) for a class of sets in &x. That is, a set function / is defined for aclass of sets C c: 3^, and /(£) is the probability that the observed values ofthe process fall in B E G. Now ask: Does there exist any process whosedistribution agrees with P on G? Alternatively—construct a process Xsuchthat P(\ E B) = P(B) for all B e C. This is equivalent to the question ofwhether / on C can be extended to a probability on $<„. Because if so, thecoordinate representation process has the desired distribution. As far as theoriginal sample space is concerned, once P on $«, is gotten, 2.12 can be usedto get an extension of P to ^(X), if P assigns probability zero to sets B falling in the complement of the range of X.

Besides this, another reason for looking at the extension problem on is that this is the smoothest space on which we can always put aprocess having any given distribution. It has some topological propertieswhich allow nice extension results to be proved.

The basic extension theorem we use is the analog in $„, of the extensionof measures on the real line from their values on intervals. Let C be the classof all finite-dimensional rectangles, and assume that P is defined on C.A finite-dimensional rectangle may be written as a disjoint union of finite-dimensional rectangles, for instance,

Of course, we will insist that if a rectangle S is a finite union U> Sj of disjointrectangles, then P(S) = £,/(£,). But an additional regularity condition isrequired, simply because the class of finite probabilities is much larger thanthe class of probabilities.

Extension Theorem 2.18. Let P be defined on the class G of all finite-dimensionalrectangles and have the properties:

m, are disjoint n-dimensional rectangles and

is a rectangle, then

Page 41: Probability

24 MATHEMATICAL FRAMEWORK 2.4

c) if{Sj} are a nondecreasing sequence of n-dimensional rectangles, and Sj | S,

Then there is a unique extension of P to a probability on

Proof. As this result belongs largely to the realm of measure theory ratherthan probability, we relegate its proof to Appendix A. 48.

Theorem 2.18 translates into probability language as: If probabilityis assigned in a reasonable way to rectangles, then there exists a process

. such that P(Xl G / , . . . , Xn e /„) has the specified values.If the probabilities are assigned to rectangles, then in order to be well

defined, the assignment must be consistent. This means here that since an^-dimensional rectangle is also an (n + l)-dimensional rectangle (take/ , its assignment as an (n + l)-dimensional rectangle must agreewith its assignment as an ^-dimensional rectangle.

Now consider the situation in which the probability distributions of allfinite collections of random variables in a process are specified. Specifically,probabilities n on $>n, n = 1 , 2 , . . . , are given and P is defined on the classof all finite-dimensional cylinder sets (2.17) by

In order for P to be well-defined, the Pn must be consistent — every n-dimensional cylinder set is also an (n + l)-dimensional cylinder set and mustbe given the same probability by Pn and

Corollary 2.19. (Kolmogorov extension theorem). There is a unique extensionof P to a probability on 3^^.

Proof. P is defined on the class C of all finite-dimensional rectangles and iscertainly a finite probability on C. Let S*, S* be rectanglethen

Since Pn is a probability on $„, it is well behaved under monotone limits(1.27). Hence Pn(Sf) | Pn(S*)> and Theorem 2.18 is in force.

The extension requirements become particularly simple when therequired process takes only values in a countable set.

Corollary 2.20. Let F c= R^ be a countable set. Ifp(s1, . . . , sn) is specifiedfor all finite sequences of elements of F and satisfies

Page 42: Probability

2.5 DISTRIBUTION FUNCTIONS 25

then there exists a process Xl5 X2, . . . such that

Proof. Let s denote an «-tuple with coordinates in F. For any B e $„, define

It is easy to check that /„ is a finite probability on $>n. Furthermore, Bk j B

Thus we conclude (see 1.27) that Pn is a probability on $>n. The Pn are clearlyconsistent. Now apply 2.19 to get the result.

The extension results in this section are a bit disquieting, because eventhough the results are purely measure-theoretic, the proofs in the space(/?(CO), $,„) depend essentially on the topological properties of Euclideanspaces. This is in the nature of the problem. For example, if one has aninfinite product (Q(co), &„) of (Q, 3=") spaces, that is:

5"^ the smallest cr-field containing all sets of the form

Pn a probability on all /7-dimensional cylinder sets; and the set {Pn} con-sistent; then a probability P on & agreeing with Pn on ^-dimensionalcylinder sets may or may not exist. For a counter-example, see Jessen andAndersen [77] or Neveu [113, p. 84].

Problem 5. Take ft = (0, 1],

Let = Un>1 ^(X^ . . . , XJ. Characterize the sets in 3 , 3r(3r0). For

Prove that P is additive on ^Q, but that there is no extension of/* to a prob-ability on

5. DISTRIBUTION FUNCTIONS

What is needed to ensure that two processes have the same distribution?

Definition 2.21. Given a process {Xn} on (Q, 3r, P), define the n-dimensionaldistribution functions by

implies

Page 43: Probability

26 MATHEMATICAL FRAMEWORK 2.5

The functions Fn(-) are real-valued functions defined on R(n}. Denote theseat times by -Fn(xn) or, to make their dependence on the random variablesexplicit, by FXi... Xn(*i,...,*„) or FXii(xB).

Theorem 2.22. Two processes have the same distribution iff all their distributionfunctions are equal.

The proof of 2.22 follows from a more general result that we want on therecord.

Proposition 2.23. Let Q, Q' be two probabilities on (Q, 5"). Let C be a classof sets such that A, B E G => A n B e G, and 5" = 5"(C). Then Q = Q' onG implies that Q = Q' on &.

There seems to be a common belief that 2.23 is true without the hypothesisthat C be closed under n. To disprove this, let Q = [a, b, c, d}, Qi(a) =QM = Qz(b) = Q2(c) = i and fit(6) = Q,(c) = Qz(a) = Q2(d) = f ^ isthe class of all subsets of Q, and

Proof. Let (C) be the smallest field containing C. By the unique extensiontheorem it suffices to show that Q = Q' on ^(C)- Let 2) be the smallestclass of sets such that

Then D = &<)(£). To see this, let be the class of sets A in 3) such that^ n C E D for all C 6 C. Then notice that 'M satisfies (i), (ii), (iii) above, so1L = 0). This implies that A n C e 0) for all A e CD, C 6 C. Now let8 be the class of sets E in D such that A n £ G 0), all ^ e D. Similarly 8satisfies (i), (ii), (iii), so 8 = D. This yields 2) closed under n, but by (ii),(iii), D is also closed under complementation, proving the assertion. LetS be the class of sets G in & such that Q(G) = Q'(G). Then S satisfies (i),(ii), (iii) => 0) c Q or 5"0(C) c S.

Returning to the proof of 2.22. Let A P' be defined on $„ by P(X e B),P'(X' E B), respectively. Let G c &m be the class of all sets of the formC = {x; jCj < ylt. . ., xn < jn). Then clearly C is closed under n, and^(C) = %„. Now P(C) = /xJ7l,. . . ,yn) and /'(C) = Fx'n(y,, . . . ,yj,so that P = P' on G by hypothesis. By 2.23 P = P' on $„.

Another proof of 2.22 which makes it more transparent is as follows:For any function G(xlt. . . , xn) on R(n) and / an interval [a, b), a < b,x = (*n • • • > *J> write

Page 44: Probability

2.5 DISTRIBUTION FUNCTIONS 27

By definition, since

the probability of any rectangle (Xx e 71} . . . , XB e /„} with 7l5 . . . , In leftclosed, right open, can be expressed in terms of Fn by

because, for 7

By taking limits, we can now get the probabilities of all rectangles. Fromthe extension theorem 2.18 we know that specifying P on rectangles uniquelydetermines it.

Frequently, the distribution of a process is specified by giving a set ofdistribution functions (Fn(x)}, n = 1 ,2 , . . . But in order that {Fn(\)} bederived from a process {Xn} on a probability space (Q, 3r, P), they must havecertain essential properties.

Proposition 2.25. The distribution functions Fn(x) satisfy the conditions:

i) Non-negativity. For finite intervals I

ii) Continuity from below. 7/x(fc) = (x(*\ . . . , x(£}) and x(f} \ Xj,j = 1, . . . , « ,then

in) Normalization. All limits of Fn exist as

lfxj j - oo, then Fn(x) -* 0. Ifallx^j = 1, . . . , n | + oo, ///eAZ Fn(x) -»• 1.

ITie 5e/ o/ distribution functions are connected by

iv) Consistency

Proof. The proof of (i) follows from (2.24). To prove (ii), note that

Use the essential fact that probabilities behave nicely under monotone limitsto get (ii). Use this same fact to prove (iii) and (iv); e.g. if xthen 00, then

Page 45: Probability

28 MATHEMATICAL FRAMEWORK 2.5

Another important construction theorem verifies that the conditions of2.25 characterize the distribution functions of a process.

Theorem 2.26. Given a set of functions {Fn(x)} satisfying 2.25 (/), (»')> (Hi), (iv),there is a process {Xn} on (Q, 5% P) such that

Proof. The idea of how the proof should go is simple. Use O = R(co),$ = 3}^, and use the coordinate representation process X1} X2, . . . Wewant to construct P on &«> such that if 5 6 $> w is a semi-infinite rectangle ofthe form

then

To construct P starting from Fn, define / on rectangles whose sides are leftclosed, right open, intervals /!,... ,/„ by

Extend this to all rectangles by taking limits. The consistency 2.25 (iv)guarantees that P is well defined on all rectangles. All that is necessary todo now is to verify the conditions of 2.18. If Sjt S are left closed, right openrectangles, and Sj | 5", then the continuity from below of Fn, 2.25 (ii), yields

To verify the above for general rectangles, use the fact that their probabilitiescan be defined as limits of probabilities of left closed, right open rectangles.

The complication is in showing additivity of P on rectangles. It issufficient to show that

for left closed, right open, disjoint rectangles Slt . . . , Sk whose union is arectangle S. In one dimension the statement P(S) = 3 A-S>) follows fromthe obvious fact that for a

The general result is a standard theorem in the theory of the Stieltjes integral(McShane [ I l i a , pp. 245-246]).

If a function F(xlf . . . , xn) satisfies only the first three conditions of 2.25then Theorem 2.26 implies the following.

Page 46: Probability

2.6 RANDOM VARIABLES 29

Corollary 2.27. There are random variables X1} . . . , Xn on a space (Q, 3r, P)such that

Hence, any such function will be called an n-dimensional distribution function.

If a set {Fn}, n = 1, 2, . . . , of ^-dimensional distribution functionssatisfies 2.25 (iv), call them consistent. The specification of a consistent set of{Fn} is pretty much the minimum amount of data needed to completelyspecify the distribution of a process in the general case.

Problems

6. For any random variable X, let Fx(x) = P(X < x). The function Fx(x)is called the distribution function of the variable X.Prove that F^(x) satisfies

7. If a function F(x is nondecreasing in each variable separately,does this imply Give an example of a functionF(x, y) such that

iii) There are finite intervals such that

8. Let F^x), F2(x), . . . be functions satisfying the conditions of Problem 6.Prove that the functions

form a consistent set of distribution functions.

6. RANDOM VARIABLES

From now on, for reasons sufficient and necessary, we study random variablesdefined on a probability space. The sufficient reason is that the extensiontheorems state that given a fairly reasonable assignment of probabilities, aprocess can be constructed fitting the specified data. The necessity is thatmost strong limit theorems require this kind of an environment. Now werecord a few facts regarding random variables and probability spaces.

Page 47: Probability

30 MATHEMATICAL FRAMEWORK 2.6

Proposition 2.28. Let C be a class of Borel sets such that ^(C) = $15 X areal-valued function on Q. If {X E C] E 5", all C E C, //;e« X is a randomvariable on (Q, &).

Proof. Let 3) c be the class of all Borel sets D such that (X e D} e F.0) is a <r-field. C c CD => D = &x.

Corollary 2.29. If {X e 1} E & for all intervals I, then X ij a random variable.

At times functions come up which may be infinite on some parts of £1but which are random variables on subsets where they are finite.

Definition 2.30. An extended random variable X on (Q, 5*") may assume thevalues ± oo, but (X s B] E 5", for all B E 3^.

Proposition 2.31. Let X be a random vector to (R, $). If <p(x) is a randomvariable on (R, 3$), then <p(X) is a random variable on (D, 5r), measurable (X).

Proof. Write, for A E $1?

g9-1(y4) here denoting the inverse image of A under 9?.

Definition 2.32. For random variables Xlt X2, . . . on Q, the a-Jield of all eventsdepending on the first n outcomes is the class of sets {(Xl5 . . . , Xn) e B},B e $>n. Denote this by ^(X^ . . . , Xn). The class of sets depending on only afinite number of outcomes is

In general, ,!F0 is a field, but not a cr-field. But the fact that ^(X)follows immediately from the definitions.

Proposition 2.33. Given a process Xlt X2, . . . . For every set A1 E (X) ande > 0, there is a set Az in some ^(X^ . . . , Xn) such that

Proof. The proof of this is one of the standard results which cluster aroundthe construction used in the Caratheodory extension theorem. The statementis that if P on ^(J'o) is an extension of P on ^p, then for every set A^ E (-F,,)and e > 0, there is a set Az in the field -F,, such that P(AZ A A^ < e (seeAppendix A. 12). Then 2.33 follows because J"(X) is the smallest a-fieldcontaining

is the symmetric set difference

Page 48: Probability

2.7 EXPECTATIONS OF RANDOM VARIABLES 31

If all the random variables in a process X^ X2, . . . take values in a Borelset E e $!, it may be more convenient to use the range space (£(GO), $„(£)),where $>«>(£) consists of all sets in 3^ which are subsets of £(co). Forexample, if X1} X2, . . . are coin-tossing variables, then each one takes valuesin (0, 1}, and the relevant R, 3$ for the process is

If a random variable X has distribution function F(x), then P(X E B)is a probability measure on which is an extension of the measure onintervals [a, b) given by F(b) — F(a). Thus, use the notation:

Definition 2.34. For X a random variable, denote by P(X E dx) or F(dx) theprobability measure P(X E B) on 3^v Refer to F(dx) as the distribution ofX.

Definition 2.35. A sequence X l5 X2, . . . of random variables all having thesame distribution F(dx) are called identically distributed. Similarly, callrandom vectors Xl5 X2, . . . with the same range space (R, $) identicallydistributed if they have the common distribution

Problems

9. Show that $oo({0, 1}) is the smallest cr-field containing all sets of the form

where s^ . . . , sn is any sequence «-long of zeros and ones, n = 1,2, ...

10. Given a process Xl5 X2, . . . on (Q, 3r, P). Let ml9 mz, . . . be positiveinteger-valued random variables on (Q, 5-", P). Prove that the sequence

TOi, XTOa, . . . is a process on (Q, 5", P).

7. EXPECTATIONS OF RANDOM VARIABLES

Definition 2.36. Let X be a random variable on (Q, 3r, P). Define the expecta-tion ofX, denoted EX, by J X(co) dP((o). This is well defined if E \X\ < oo.Alternative notations for the integrals are

Definition 2.37. For any probability space (Q., 3r, P) define

i) if A E 3r, the set indicator %A(u>) is the random variable

Page 49: Probability

32 MATHEMATICAL FRAMEWORK 2.7

ii) If X is a random variable, then X+, X~ are the random variables

A number of results we prove in this and later sections depend on aprinciple we state as

Proposition 2.38. Consider a class C of random variables having the properties

iii) For every set77?e« C includes all nonnegative random variables on 5-", P).

Proof. See Appendix A. 22.

This is used to prove

Proposition 2.39. Let the processes X on (ft, 5", P), X' on (ft', 5="', P') have thesame distribution. Then if(p(x) is measurable (/?<cc), 3^),

/« r/?e sense that if either side is well defined, so is the other, and the two areequal.

Proof. Consider all <p for which (2.40) is true. This class satisfies (i) and(ii) of 2.38. Further, let B E ft^ and let <p(x) = ^(x)- Then the two sidesof (2.40) become P(X e B) and P'(X' e B), respectively. But these are equalsince the processes have the same distribution. Hence (2.40) holds for allnonnegative 99. Thus, for any 99, it holds true for \<p\, y+, q>~.

Corollary 2.41. Define P(-) on 3^^ by P(B) = P(X e B). Then ifcp iable(R(X\ &„) and E \<p(X)\ < oo,

Proof. {Xn} on (Q, &, P) and {Xn} on (/?(co), &„, P), have the same distri-bution, where Xn is the coordinate representation process. Thus

buT

Page 50: Probability

2.8 CONVERGENCE OF RANDOM VARIABLES 33

8. CONVERGENCE OF RANDOM VARIABLES

Given a sequence of random variables (XJ, there are various modes ofstrong convergence of Xn to a limiting random variable X.

Definition 2.42

i) Xn converges to X almost surely (a.s.) //

Denote this by

ii) Xn converges to X m rth mean, for r > 0, //£ |Xn — X|r —>• 0. Denote

this by

iii) Xn converges in probability to X if for every e > 0,

Denote this by

The important things to notice are:

First, all these convergences are "probabilistic." That is, if X, Xl5 . . . hasthe same distribution as X', X^, . . . , then Xn -*• X in any of the above senses

/ / r Pimplies that w -> X in the same sense. This is obvious foR

See Problem 1 for

Secondly, Cauchy convergence in any one of these senses gives converg-ence.

Proposition 2.43. m — Xn —v 0 in any of the above ways as / « ,«—»• ooin any way, then there is a random variable X such that Xn —»• X in the sameway.

Proof. Do a.s. convergence first. For all co such that Xn(co) is Cauchyconvergent, lininX^eo) exists. Hence P(co; limnXn(co) exists) = 1. LetX(co) = limreXn(w) for all co such that the limit exists, otherwise put it equal

<l S

to zero, then Xn —^> X. For the other modes of Cauchy convergence theproof is deferred until Section 3 of the next chapter (Problems 6 and 7).

Thirdly, of these various kinds of convergences ——>• is usually the hardestto establish and more or less the strongest. To get from a.s. convergence to

—>• , some sort of boundedness condition is necessary. Recalla.s.

Theorem 2.44. (Lebesgue bounded convergence theorem). If Yn —V Y andif there is a random variable Z > 0 such that E7. < oo, and |YJ < ~Lfor alln,then EYn -* £Y. (See Appendix A.28).

Page 51: Probability

34 MATHEMATICAL FRAMEWORK

Hence, using Yn = |Xn X|r in 2.44, we get

Convergence in probability is the weakest. The implications go

Problems

11. Prove (2.46i and ii). [Use a generalization of Chebyshev's inequalityon (ii).]

12. Let {Xn}, {X'n} have the same distribution. Prove that if Xn -*• X a.s.,there is a random variable X' such that X'n —»• X' a.s.

13. For a process {Xn} prove that the set

at; limXn(w) does not exist In j

is an event (i.e., is in &).

14. Prove that for X a random variable with £|X| < oo, then/^eJ", An [ 0,implies

NOTES

The use of a probability space (£1, 3r, P) as a context for probability theorywas formalized by Kolmogorov [98], in a monograph published in 1933.But, as Kolmogorov pointed out, the concept had already been current forsome time.

Subsequent work in probability theory has proceeded, almost withoutexception, from this framework. There has been controversy about thecorrespondence between the axioms for a probability space and moreprimitive intuitive notions of probability. A different approach in which theprobability of an event is defined as its asymptotic frequency is given byvon Mises [112]. The argument can go on at several levels. At the top isthe contention that although it seems reasonable to assume that P is afinite probability, 'there are no strong intuitive grounds for assuming itcr-additive. Thus, in their recent book [40] Dubins and Savage assume onlyfinite additivity, and even within this weaker framework prove interestinglimit theorems. One level down is the question of whether a probabilitymeasure P need be additive at all. The more basic property is argued to be:A c= B => P(A) < P(B). But, as always, with weaker assumptions fewernontrivial theorems can be proven.

Page 52: Probability

NOTES 35

At the other end, it happens that some a-fields have so many sets in themthat examples occur which disturb one's intuitive concept of probability.Thus, there has been some work in the direction of restricting the type ofcr-field to be considered. An interesting article on this is by Blackwell [8].

For a more detailed treatment of measure theory than is possible inAppendix A, we recommend the books of Neveu [113], Loeve [108], andHalmos [64].

Page 53: Probability

CHAPTER 3

INDEPENDENCE

Independence, or some form of it, is one of the central concepts of prob-ability, and it is largely responsible for the distinctive character of probabilitytheory.

1. BASIC DEFINITIONS AND RESULTS

Definition 3.1

(a) Given random variables Xx, . . . , Xn, on (Q, 5% P), they are said to beindependent if for any sets Blt . . . , Bn e 5Jls

(b) Given a probability space (Q, &, P) and a-fields 3 , . . . , 5~n contained in&, they are said to be independent if for any sets A ^ E 5 , . . . , A

Obviously, Xl5 . . . , Xn are independent random variables iff&^i), . . . (XJare independent a-fields.

These definitions have immediate generalizations to random vectors.

Definition 3.2. Random vectors Xl, X2, . . . , Xn are said to be independent ifthe a-fields ^(X^, ^(X2), . . . , ^(XJ, are independent.

Virtually all the results of this section stated for independent randomvariables hold for independent random vectors. But as the generalization isso apparent, we usually omit it.

Definition 3.3. The random variables Xl5 X2, . . . are called independent if forevery n > 2, the random variables Xx, X2, . . . , Xn, are independent.

Proposition 3.4. Let Xl5 X2, . . . be independent random variables and Blt B2, . . .any sets in $j. Then

36

Page 54: Probability

3.1 BASIC DEFINITIONS AND RESULTS 37

Proof. Let An = {Xl £ Blf . . . , Xn £ Bn}. Then An are decreasing, hence

But,

and

Note that the same result holds for independent cr-fields.

Proposition 3.5. Let Xl5 X2, . . . be independent random variables, (7l5 z'2, • • •)>(ji,jz, • • •) disjoint sets of integers. Then the fields

are independent.

Proof. Consider any set D e 3^ of the form D = (X;. e 5l5 . . . , X^ £ Bm},Bk £ 3l5 A: = 1 , . . . , m. Define two measures Q^ and Q[ on 3r

l by, for A E J^,

Consider the class of sets lof the form

Note that

Thus Q! = Q( on C, C is closed under n, ^(C) = 3r1 => Ql = Q[ on F1

(see 2.23). Now repeat the argument. Fix A E 3rl and define Qz(-), Q'z(')

on by P(A n •), P(A)P(-). By the preceding, for any D of the form givenabove, Q2(D) = Q'2(D), implying Q2 = Q2 on &% and thus for any Al E J^,

Corollary 3.6. Let X1? X 2 , . . . be independent random variables, J±, J2, . . .disjoint sets of integers. Then the a-fields 3r

k = &({X}},j £ Jk) are independent.

Proof. Assume that 3^,. . . , &n are independent. Let

Page 55: Probability

38 INDEPENDENCE 3.1

Since/ and Jn+l are disjoint, 3r/ and n+l satisfy the conditions of Proposition3.5 above. Let A± e 3^, . . . , An e &n. Then since 3r

k <= &', k = ! , . . . , » ,

Let 4' = ^! n • • • n An and An+l E 5rn+1. By 3.5,

by the induction hypothesis.

From 3.5 and 3.6 we extract more concrete and interesting consequences.For instance, 3.5 implies that the fields (X^ . . . , Xn) and ^(X^, . . .)are independent. As another example, if(plt <p2, . . . are measurable (R(m), $m),then the random variables

are independent. Another way of stating 3.6 is to say that the randomvectors Xfc = ({X,},y e Jk), are independent.

How and when do we get independent random variables ?

Theorem 3.7. A necessary and sufficient condition for X1? X2, . . . , to beindependent random variables is that for every n, and n-tuple (x

Proof. It is obviously necessary — consider the sets {XTo go the other way, we want to show that for arbitrary

Fix ;c2, . . . , xn and define two cr-additive measures Q and Q' on $! by

Now on all sets of the form (— oo, x), Q and £X agree, implying that Q = Q'on Sj. Repeat this by fixing e Sl5 x3, . . . , xn and defining

so Qz = Q2 on the sets (—00, x), hence Qz = Q'z on 5Jls and continue ondown.

The implication of this theorem is that if we have any one-dimensionaldistribution functions F^x), Fz(x), . . . and we form the consistent setof distribution functions (see Problem 8, Chapter 2) F^xJ • • • Fn(xn), then

Page 56: Probability

3.1 BASIC DEFINITIONS AND RESULTS 39

any resulting process Xl5 X2, . . . having these distribution functions consistsof independent random variables.

Proposition 3.8. Let X and Y be independent random variables, f and g $r

measurable functions such that E |/(X)| < oo, E |g(Y)| < oo, then

and

Proof. For any set A E 3V take f(x) = %A(x); and consider the classC of nonnegative functions g(y) for which the equality in 3.8 holds. C isclosed under linear combinations. By the Lebesgue monotone convergencetheorem applied to both sides of the equation in 3.8, If B E 3$! and g(y) = %B(y\ tnen the equation becomes

which holds by independence of X and Y. By 2.38, C includes all nonnegativeg-measurable 3V Now fix g and apply to/to conclude that 3.8 is valid forall nonnegative g and/

For general g and/ note that 3.8 holding for nonnegative g and/implies£|/(X)£(Y)| = [£|/(X)|][£|g(Y)|], so integrability of/(X) and g(Y)implies that of /(X)g(Y). By writing / = /+—/- , g — g+ — g~ weobtain the general result.

Note that if X and Y are independent, then so are the random variables/(X) and g(Y). So actually the above proposition is no more general than thestatement: Let X and Y be independent random variables, E\X\ < oo,E | Y| < oo, then E |XY| < oo and £XY = EX • £Y.

By induction, we get

Corollary 3.9. Let Xlt..., Xn be independent random variables such thatE \Xk\ < oo, k = ! , . . . ,«. Then

Proof. Follows from 3.8 by induction.

Problems

1. Let ^Fj, be independent <r-fields. Show that if a set A is both in 3^and 3r

2, then P(A) = 0 or 1.2. Use Fubini's theorem (Appendix A.37) to show that for X and Y in-dependent random variablesa) for any B E $1} P(X E B — y) is a 3^-measurable function of y,

Page 57: Probability

40 INDEPENDENCE 3.2

2. TAIL EVENTS AND THE KOLMOGOROV ZERO-ONE LAW

Consider the set E again, on which Sn/n -h» f for fair coin-tossing. Aspointed out, this set has the odd property that whether or not co E E does notdepend on the first n coordinates of co no matter how large n is. Sets whichhave this fascinating property we call tail events.

Definition 3.10. Let X1? X2, . . . be any process. A set E E (X) will be calleda tail event if E E J"(Xn, Xn+1, . . .), all n. Equivalent ly, let % be the o-fieldf)*=i ^(Xn, XM+1, . . .), then Is is called the tail a-field and any set E E J iscalled a tail event.

This definition may seem formidable, but it captures formally the sensein which certain events do not depend on any finite number of their co-ordinates. For example,

is a tail event. Because for any

hence E E &(Xk, Xk^, . . .) for all k > 1, =>£ e tf.An important class of tail events is given as follows:

Definition 3.11. Let Xlt X2, . . . be any process, B1, B2, . . . Borel sets. Theset Xn in Bn infinitely often, denoted (Xn E Bn i.o.} is the set {CD; nX(o>) e Bn

occurs for an infinite number of n}. Equivalently,

It is fairly apparent that for many strong limit theorems the events involvedwill be tail. Hence it is most gratifying that the following theorem is inforce.

Theorem 3.12. (Kolmogorov zero-one law). Let X1? X2, . . . be independentrandom variables. Then if E E 3, P(E) is either zero or one.

Proof. E E (X). By 2.33, there are sets En E J-(Xlf . . . , Xn) such thatP(En A E) -v 0. This implies P(En) -> P(E), and P(En n £) -^ />(£). ButE E (X^!, Xn+2, . . .), hence E and En are in independent <r-fields. ThusP(En n E) = P(En)P(E). Taking limits in this latter equation gives

The only solutions of x = x2 are x = 0 or 1. Q.E.D.

Page 58: Probability

3.3 THE BOREL-CANTELLI LEMMA 41

This is really a heart-warming result. It puts us into secure businesswith strong limit theorems for independent random variables involvingtail events. Either the theorem holds true for almost all u> E O or it failsalmost surely.

Problems

3. Show that [Xn e Bn i.o.} is a tail event.4. In the coin-tossing game let s be any sequence m-long of zeros or ones.Let Zn be the vector (Xn + 1 , . . . , Xn+TO), and F the set (Zn = s i.o.}. Showthat F e J.

5. (the random signs problem). Let cn be any sequence of real numbers.In the fair coin-tossing game let Yn = ± 1 as the «th toss is H or T. LetD = {co; 2 cwYn converges}; show that D e #.

3. THE BOREL-CANTELLI LEMMA

Every tail event has probability zero or one. Now the important question is:how to decide which is which. The Borel-Cantelli lemma is a most impor-tant step in that direction. It applies to a class of events which includes manytail-events, but it also has other interesting applications.

Definition 3.13. In (Q, F, P), let An e F. The set [An i.o.} is defined as(to; co £ Anfor an infinite number ofn}, or equivalently

Borel-Cantelli Lemma 3.14

I. The direct half. IfAnE&, then £* P(An} < oo implies P(An i.o.) = 0.

To state the second part of the Borel-Cantelli lemma we need

Definition 3.15. Events Alt A2,. . . , in (Q, 3% P) will be called independentevents if the random variables %A , %A , . . . are independent (see Problem 8).

II. The converse half. If An e & are independent events thenimplies

Proof of I

But obviously P(-^n) < °° implies that 2 P(^n) ^- 0, as w -> oo.

Proof of II

Because

Page 59: Probability

42 INDEPENDENCE 3.3

the events {An} are independent,

Use the inequality log (1 — x) < — x to get

Application 1. In coin-tossing, let s be any sequence fc-long of H, T.

n = (o>; (wn, . . . , co,, ) = s}, 0 < P(Heads) < 1.

Proposition 3.16. P(An i.o.) = 1.

Proof. Let Bl = {co; (o^, . . . , o>fc) = s}, 52 = {w; (cok+l, ..., cozk) = s}, . . .The difficulty is that the An are not independent events because of the over-lap, for instance, between Al and A2, but the Bn are independent, and

{An i.o.} = (£n i.o.}. Now P(Bn) = P(Bl) > 0, so fp(5n) = oo, implyingby 3. 14(11) that P(Bn i.o.) =1. *

Another way of putting this proposition is that in coin-tossing (biasedor not), given any finite sequence of H, 7"s, this sequence will occur aninfinite number of times as the tossing continues, except on a set of sequencesof probability zero.

Application 2. Again, in coin-tossing, let Yf = ±1, as /th toss is H or T,Zn = Y! + • • • + Yn. If Zn = 0, we say that an equalization (or returnto the origin) takes place at time n. Let An = {Zn = 0}. Then {An i.o.} =(o>; an infinite number of equalizations occur}.

Proposition 3.17. IfP(Heads) ^ J, then P(Zn = 0 i.o.) = 0.

Proof. Immediate, from the Borel-Cantelli lemma and the asymptoticexpression for P(Zn = 0).

Another statement of 3.17 is that in biased coin-tossing, as we continuetossing, we eventually come to a last equalization and past this toss there areno more equalizations. What if the coin is fair?

Theorem 3.18. For a fair coin, P(2.n = 0 i.o.) = 1.

Proof. The difficulty, of course, is that the events An = (Zn = 0} are notindependent, so 3.14 is not directly applicable. In order to get around this,we manufacture a most pedestrian proof, which is typical of the way in whichthe Borel-Cantelli lemma is stretched out to cover cases of nonindependentevents.

The idea of the proof is this ; we want to apply the converse part ofthe Borel-Cantelli lemma, but in order to do this we can look only at therandom variables Xfc related to disjoint stretches of tosses. That is, if we

Page 60: Probability

3.3 THE BOREL-CANTELLI LEMMA 43

consider a subsequence «x < «2 < «3 < • • • of the integers, then any events{Ck} such that each Ck depends only on {YWfc+1, Yn f c + 2 , . . . , Ynfc+i} areindependent events to which the Borel-Cantelli lemma applies. Suppose, forinstance, that we select nk < mk < nk+1 and define

The purpose of defining Ck this way is that we know

because each Yi is ±1. Hence o> e Ck => ZTO < 0. Again Zm > —WA , so,in addition,

Therefore (co e Cfc} =^> (Zn = 0 at least once for nk + 1 < n <. nk+l}. Wehave used here a standard trick in probability theory of considering stretches«!, «2,. . . so far apart that the effect of what happened previously to nk

is small as compared to the amount that 7.n can change between nk and nk+1.Now

So we need only to prove now that the nk, mk can be selected in such a waythat

Assertion: Given any number a, 0 < a < 1, #«d integer k > 1, 3 an integer<p(k) > 1 jwc/z //za?

Proof. We know that for any fixed j,

Hence for k fixed, as n -> oo,

Simply take q>(k) sufficiently large so that

Define nfc, wfc as follows:Compute jP(Cfc) as follows :

By symmetry,

Page 61: Probability

44 INDEPENDENCE 3.3

Thus, since the distribution of the vector (Yi+1, . . . , Yi+;.) is the same as thatY,,...^.),

This proof is a bit of a mess. Now let me suggest a much more excitingpossibility. Suppose we can prove that P(Zn = 0 at least once) = 1. Nowevery time there is an equalization, everything starts all over again. Thatis, if Zn = 0, then the game starts from the (n -f 1) toss as though it werebeginning at n = 0. Consequently, we are sure now to have at least onemore equalization. Continue this argument now ad infinitum to conclude thatP(Zn = 0 at least once) = 1 => P(7.n = 0 i.o.) = 1. We make this argumenthold water when 3.18 is generalized in Section 7, and generalize it againin Chapter 7.

6. Show, by using 3.14, that Xn — »• X => 3 a subsequence Xw such that

Problems

how,

7. Show, using 3.14, that if Xn — Xm -*• 0, 3 a random variable X such thatP

Xn — *• X. [Hint: Take ek I 0 and nk such that for m,n>. nk,

Now prove that there is a random variable X such that X

8. In order that events Alt A2, . . . be independent, show it is sufficient that

for every finite subcollection At • , . . . , Aim. [One interesting approach to therequired proof is: Let 3) be the smallest field containing Alt . . . , AN.Define Q on 0) by Q(Bt n • • • n BN) = P(BJ • • • P(BX), where the setsBk are equal to Ak or Ak. Use P(A^ n • • • n Aim) = P(AJ - - - P(Aim) toshow that P = Q on a class of sets to which 2.23 can be applied. Concludethat P = Q on CD.]9. Use the strong law of large numbers in the form Sn/n — >• p a.s. toprove 3.17.10. Let X15 X2, . . . be independent identically distributed random variables.Prove that E |Xj| < oo if and only if

(See Loeve [108, p. 239].)

Page 62: Probability

3.4 THE RANDOM SIGNS PROBLEM 45

4. THE RANDOM SIGNS PROBLEM

In Problem 5, it is shown for Yx, Y2, . . . , independent +1 or —1 withprobability |, that the set {a>; 2i cfcXfc converges} is a tail event. Thereforeit has probability zero or one. The question now is to characterize thesequences {cn} such that 2? ck^k converges a.s.

This question is naturally arrived at when you look at the sequenceI/H, that is, £ l/« diverges, but 2 (—!)"!/« converges. Now what happens ifthe signs are chosen at random? In general, look at the consecutive sums2" Xfc of any sequence X l5 X2, . . . of independent random variables. Theconvergence set is again a tail event. When does it have probability one ?

The basic result here is that in this situation convergence in probabilityimplies the much stronger convergence almost surely.

Theorem 3.19. For Xl5 X2, . . . independent random variables,

Proof. Proceeds by an important lemma which is due to Skorokhod, [125].

Lemma 3.21. Let Sl5 . . . , SN be successive sums of independent randomvariables such that sup^^-POS^ — S,.| > a) = c < 1. Then

Proof. Let j*(co) = {first j such that |S,| > 2<x}. Then

The set {j* = j } is in ^(Xj, . . . , X3), and SN — S,, is measurable^(Xj.+j, . . . , X^), so the last sum on the right above equals

Page 63: Probability

46 INDEPENDENCE 3.4

The observation that

completes the proof.

To finish the proof of 3.19: If a sequence sn of real numbers does notconverge, then there exists an e > 0 such that for every m,

nSo if 2 Xj. diverges with positive probability then there exists an e > 0

iand 6 > 0 such that for every m fixed,

where

it

If 2 Xfc is convergent in probability, theni

Hence, as m,N —*• oo,

so we find that

Taking first TV —*• oo, conclude

This contradiction proves the theorem.

We can use convergence in second mean to get an immediate criterion.

Corollary 3.22. If EXk = 0, all k, and J EX2k < oo, then the sums J Xfc

converge a.s. x x

In particular, for the random signs problem, mentioned at the beginningof this section, the following corollary holds.

Page 64: Probability

3.4 THE RANDOM SIGNS PROBLEM 47

nCorollary 3.23. A sufficient condition for the sums ^ cfcYfc to converge a.s. is

The open question is necessity. Marvelously enough, the converse of 3.23is true, so that L cfcYfc converges if and only if S c* < oo. In fact, a partialconverse of 3.22 holds.

Theorem 3.24. Let Xl5 X2 , . . . be independent random variables such thatn

EXk = 0, and \Xk\ < a < oo, all k. Then I Xfc converges a.s. impliesco !

2 £Xj* < oo.1

Proof. For any A > 0, define n*(co) by

1st« such that

oo if no such n exists,

where n * is an extended random variable. For any integers j < N, look at

Since {n* = j} e ^(X^ . . . , Xy), then by independence, and EXk = 0, allA:,

And, by independence, for

Using these,

Sum ony from 1 up to N to get

Also,

Page 65: Probability

48 INDEPENDENCE 3.4

Adding this to the above inequality we get

or

Letting N ->• oo, we find that

n

But, since 2 *fc converges a.s., then there must exist a A such that

implying P(n* = oo) > 0 and 2^^ < °°-i

The results of 3.24 can be considerably sharpened. But why bother;elegant necessary and sufficient conditions exist for the convergence of sums2" Xj. where the only assumption made is that the Xk are independent.This is the "three-series" theorem of Kolmogorov (see Loeve [108, p.237]). More on this will appear in Chapter 9. Kac [82] has interestinganalytic proofs of 3.23 and its converse.

Problems

11. Let X1? X2, . . . be independent, and Xk > 0. If for some <5, 0 < <5 < 1,there exists an x such that

for all k, show that

Give an example to show that in general, Xl5 X2, . . . independent non-

negative random variables and a.s. does not imply that

Page 66: Probability

3.5 THE LAW OF PURE TYPES 49

12. Let Y!, Y2, . . . be a process. We will say that the integer-valued randomvariables ml5 m2, . . . are optional skipping variables if

(i.e., the decision as to which game to play next depends only on the previousoutcomes). Denote Yfc = YTOj;. Show that

a) If the Yj, Y2, . . . are independent and identically distributed then thesequence Yj, Y2, . . . has the same distribution as

b) For Yl5 Y2, . . . as in (a), show that the sequencehas the same distribution as Y

c) Give an example where the Yl5 Y2, . . . are independent, but theare not independent.

5. THE LAW OF PURE TYPES

Suppose that X1? X2, . . . are independent and Xk converges a.s. Whatcan be said about the distribution of the limit X = ]£" ** ? In general, verylittle ! In the nature of things the distribution of X can be anything. Forexample, let Xk = 0, k > 1, then X = Xx. There is one result availablehere which is an application of the Kolmogorov zero-one law and remarkablefor its simplicity and elegance.

Definition 3.25. A random variable X is said to have a distribution of pure typeif either

i) There is a countable set D such that P(X E D) = 1,ii) P(X = x) = Ofor every x e R(l}, but there is a set D E $x of Lebesgue

measure zero such that P(X ED) = I , oriii) P(X E dx) « l(dx)(Lebesgue measure)

[Recall that ^ « v for two measures /u, v denotes p absolutely continuouswith respect to v; see Appendix A. 29].

Theorem 3.26 (Jessen-Wintner law of pure types [78]). Let Xlt X2, . . . beindependent random variables such that

ii) For each k, there is a countable set Fk such that

Then the distribution of X is of pure type.

Proof. Let F = U Fk. Take G to be the smallest additive group in R(l)

k>lcontaining F. G consists of all numbers of the form

Page 67: Probability

50 INDEPENDENCE 3.5

jq, . . . , Xj E Fand mlt . . . ,mi integers. Fis countable, hence G is countable.For any set B <= R(l} write

Note that

i) B countable => G ® B countable,

For 2? e $!, and C = (o>, 2 X* converges}, consider the event

The point is that A is a tail event. Because if xl — XZE G, then

00

But X - X e G for all o> in C. Hence

By the zero-one law P(A) = 0 or 1. This gives the alternatives:

a) Either there is a countable set D such that P(X e D) = 1 orP(X G G ® B) = 0, hence P(X e 5) = 0, for all countable sets B.

b) If the latter in (a) holds, then either there is a set D e 3j such that/(/>) = 0 and P(X e D) = 1 or P(X G G ® B) = 0, for all B E suchthat l(B) = 0.

c) In this latter case B e $15 /(£) = 0 => P(X 6 B) = 0, that is, the dis-tribution is absolutely continuous with respect to l(dx).

Theorem 3.26 gives no help as to which type the distribution of the limitrandom variable belongs to. In particular, for Yx, . . . independent ± 1with probability £, the question of the type of the distribution of the sumsCO

2cnYn is open. Some important special cases are given in the followingiproblems.

Problems

13. Show that P! 2 Y*/2* e dx] is Lebesgue measure on [0, 1]. [Recall

the analytic model for coin-tossing.]14. If X and Y are independent random variables, use Problem 2 to showthat P(X E dx) « l(dx) => P(X + Y e dx) « l(dx).

Page 68: Probability

3.6 THE LAW OF LARGE NUMBERS 51

15. Use Problems 13 and 14 to show that the distribution of

16. Show that if independent random variables X1} X2, . . . take values in acountable set F, and if there are constants <xn e F such that

then the sum X = ]T Xn has distribution concentrated on a countable numberi

of points. (The converse is also true ; see Levy [101].)

6. THE LAW OF LARGE NUMBERS FOR INDEPENDENTRANDOM VARIABLES

From the random signs problem, by some juggling, we can generalize thelaw of large numbers for independent random variables.

Theorem 3.27. Let Xl5 X2, . . . be independent random variables, EXk = 0,

EX2k < oo. Let bn>0 converge up to +00. If^ EXl/h* < oo, then

i

Proof. To prove this we need :

Kronecker's Lemma 3.28. Let xl5 x2, ... be a sequence of real numbers suchn

that 2 *k — > s finite. Take bn f oo, theni

Proof. Let rB = 2 **> ro = *; then xn = rB_! — rB, n = 1, 2, . . . , andi+l

Thus

(3.29)

Page 69: Probability

52 INDEPENDENCE 3.6

For any e > 0, take N such that \rk\ <, c for k ^ N. Then letting f =^! |rn| we get

Divide (3.29) by bn, and take lim sup, noting that bN/bn -*• 0, |rj -*• 0, to get

To prove 3.27, by Kronecker's lemma, if J (Xt/6t) converges a.s., theni

By 3.22, it is sufficient that

As a consequence of 3.27, if the XEXl = 0, and EX* = a2 < oo, then

. are identically distributed,

This is stronger than 1.21, which gives the same conclusion for &For example, we could take bBut the strong law of large numbers is basically a first-moment theorem.

Theorem 3.30. Let Xl5 X2, . . . be independent and identically distributedrandom variables; if E |Xj| < oo then

if E IX j ) = oo, then the above averages diverge almost everywhere.

Proof. In order to apply 3.27 define truncated random variables Xn by

By Problem 10, P(|XJ > n i.o.) = 0. Hence (3.30) is equivalent to

Page 70: Probability

3.7 RECURRENCE OF SUMS 53

But EXn -»• EXi, so (3.30) will follow if 3.27 can be applied to show that

Since E(Xk — EXk)z <, EX*, it is sufficient to show that

This follows from writing the right-hand side as

oo

Interchange order of summation, and use 2 I/7*2 < 2/fc, fc > 1, to getk

For the converse, suppose that SJn converges on a set of positive probability.Then it converges a.s. The contradiction is that

must converge a.s. to zero, implying P(|Xn| > n i.o.) = 0. This is impossibleby Problem 10.

7. RECURRENCE OF SUMS

Through this section let X1? X 2 , . . . be a sequence of independent, identicallydistributed random variables. Form the successive sums S

Definition 3.31. For x e R(1}, call x a recurrent state if for every neighborhood

The problem is to characterize the set of recurrent states. In coin-tossing, with2 , . . . equaling ± 1 with probability/?, q, \tpj£\, then P(Sn = 0 i.o.) = 0.

In fact, the strong law of large numbers implies that for any state y,n = j i.o.) = 0—no states are recurrent. For fair coin-tossing, every

time Sn returns to zero, the probability of entering the state j is the same asit was at the start when n = 0. It is natural to surmise that in this case

n = y i.o.) = 1 for ally. But we can use this kind of reasoning for anydistribution, that is, if there is any recurrent state, then all states should berecurrent.

Page 71: Probability

54 INDEPENDENCE 3.7

Definition 3.32. Say that a random variable X is distributed on the latticeLd = {nd}, n = 0, ±1, . . . , d any real number >0, if 2nP(X = nd) = 1 andthere is no smaller lattice having this property. If X is not distributed on anylattice, it is called nonlattice. In this case, say that it is distributed on L0,where L

Theorem 3.33. IfXlf X2, . . . are distributed on Ld, d>0, then either everystate in Ld is recurrent, or no states are recurrent.

Proof. Let G be the set of recurrent points. Then G is closed. Becausexn £ G, xn—>-x implies that for every neighborhood / of x, xn e / for nsufficiently large. Hence P(S

Define y to be a possible state if for every neighborhood / of y, 3 ksuch that P(Sk e /) > 0. I assert that

x recurrent, y possible => jc — y recurrent.

To show this, take any e > 0, and k such that P(\Sk — y\ < e) > 0. Then

n - x\ < e finitely often)

t -y\<c, \Sk+n - S k - ( x - y)\ < 2e finitely often)

k -y\ < e)P(|Sn - (jc - y)\ < 2e finitely often).

The left side is zero, implying

n — (x — y)\ < 2e finitely often) = 0.

If G is not empty, it contains at least one state x. Since every recurrent stateis a possible state, ;c — jc = 0 G G. Hence G is a group, and therefore is aclosed subgroup of /?(1). But the only closed subgroups of R(1) are thelattices Ld, d > 0. For every possible state y, 0 — j e G => j e G. Ford > 0, this implies Ld c: G, hence Ld = G. If Xx is non-lattice, G cannotbe a lattice, so G = R(1).

A criterion for which alternative holds is established in the followingtheorem.

Theorem 3.34. Let X1? X2, . . . be distributed on Ld, d > 0. If there is a finiteoo

interval J, J C\ Ld ^ 0 such that 2 P($n E J) < °°, then no states arei °°

recurrent. If there is a finite interval J such that ^ P(Sn = J) = co, then allstates in L are recurrent. l

Proof. If ^r P(Sn e J) < °°» use the Borel-Cantelli lemma to get

P(Sn E J i.o.) = 0. There is at least one state in Ld that is not recurrent,hence none are. To go the other way, we come up against the same difficultyas in proving 3.18, the successive sums S1? S2, . . . are not independent.Now we make essential use of the idea that every time one of the sums

Page 72: Probability

3.7 RECURRENCE OF SUMS 55

S1} S 2 , . . . comes back to the neighborhood of a state x, the subsequentprocess behaves nearly as if we had started off at the state jc at n = 0. If00

2 P($n EJ) =00, for any e > 0 and less than half the length of/, there is a1 00

subinterval I = (x — e, x + e) <^ J such that 2 P($n e 7) = oo. Define setsi

k is the set on which the last visit to 7 occurred at the kth trial. Then

00

{Sn E I finitely often} = (J Ak.o

The Ak are disjoint, hence

P(Sn E I finitely often) = f P(Ak).

For A: > 1,

Use independence, then identical distribution of the {Xn} to get

This inequality holds for all k > 1. Thus

n E I finitely often) > P(|SJ > 2e, « = 1, 2, . . .)|p(S, e 7).i

oo

Since 2 P($k e ^) = °°> we conclude that for every e > 0

Now take /= (—e, 4-e), and define the sets Ak as above. Denote I6 =(-6, +d), so that

Since the sequence of sets is monotone,

Now use (3.35) in getting

Page 73: Probability

56 INDEPENDENCE 3.7

to conclude P(Alc) = 0, k > 1. Use (3.35) directly to establish

Therefore P(Sn e /finitely often) = 0, and the sums Sn enter every neighbor-hood of the origin infinitely often with probability one. So the origin is arecurrent state, and consequently all states in Ld are recurrent. Q.E.D.

Look again at the statement of this theorem. An immediate applicationof the Borel-Cantelli lemma gives a zero-one property :

Corollary 3.36. Either

for all finite intervals I such that L

for all such I.

Definition 3.37. If the first alternative in 3.36 holds, call the process Sl5 S2, . . .recurrent. If the second holds, call it transient.

A quick corollary of 3.34 is a proof that fair coin-tossing is recurrent.Simply use the estimate P(S2n = 0) ~ \l\JTrn to deduce that

diverges.The criterion for recurrence given in 3.34 is difficult to check directly

in terms of the distribution of Xl5 X2, . . . A slightly more workable expressionwill be developed in Chapter 8. There is one important general result,however. If

then by the law of large numbers, the sums are transient. If EXl = 0, theissue is in doubt. All that is known is that Sn = o(n) [o(n) denoting smallorder of n]. There is no reason why the sums should behave as regularly asthe sums in coin-tossing. (See Problem 17 for a particularly badly-behavedexample of successive sums with zero means.) But, at any rate,

Theorem 3.38. If EXl = 0, then the sums Sl5 S2, . . . are recurrent.

Proof. First, we need to prove

Proposition 3.39. If I is any interval of length a, then

Page 74: Probability

3.7 RECURRENCE OF SUMS 57

oo

Proof. Denote N = 2 X/(Sn)» so that N counts the number of times thati

the sums enter the interval /. Define an extended random variable n* by

denoting by / — y the interval / shifted left a distance y,

We use 3.39 to prove 3.38 as follows: For any positive integer M,

Hence

The strong law of large numbers implies the weaker result Sn| < €«) -* 1 f°r every e > 0. Fix e, and take m so that P(|SJ < en) > £,

« > m. Then P(|Sn| < M) > |, m < n < M/e, which gives

Substituting this into (3.40), we get

Page 75: Probability

58 INDEPENDENCE 3.8

Since e is arbitrary, conclude that

By 3.34 the sums are recurrent. Q.E.D.

Problems

17. Unfavorable "fair" game, due to Feller [59, Vol. I, p. 246]. LetXl5 X2, . . .be independent and identically distributed, and take values in (0, 2, 22, 23, . . .}so that

and define P(Xl = 0) to make the sum unity. Now EX{ = 1, but show thatfor every e > 0,

18. Consider k fair coin-tossing games being carried on independently ofeach other, giving rise to the sequence of random variables

where Y^> is ±1 as the nth outcome of they'th game is H or T. Let Z(n

i} =yo) _ ( - . . . -)_ Y^3), and plot the progress of each game on one axis of*<*>. The point described is Zwhere Yx takes any one of the values (±1,. . . , ±1) with probability 1/2*.Denote 0 = (0, 0,. . ., 0). Now Zn = 0 only if equalization takes place inall k games simultaneously. Show that

8. STOPPING TIMES AND EQUIDISTRIBUTION OF SUMS

Among the many nice applications of the law of large numbers, I am goingto pick one. Suppose that X l5 X2, . . . are distributed on the lattice Ld andthe sums are recurrent. For any interval /such that / n Ld ^ 0, the numberof S1? . . . , Sn falling into / goes to oo. Denote this number by Nn(7). Then

Page 76: Probability

3.8 STOPPING TIMES AND EQUIDISTRIBUTION OF SUMS 59

NJ«, the average number of landings in / per unit time, goes to zero ingeneral (see Problem 20).

An interesting result is that the points S19 . . . , Sn become a.s. uniformlydistributed in the sense that for any two finite intervals

(In the lattice case, define ||/|| as the number of points of LThis equidistribution is clearly a strong property. The general proof is notelementary (see Harris and Robbins [69]). But there is an interesting proofin the lattice case which introduces some useful concepts. The idea is tolook at the number of landings of the Sn sequence in / between successivezeros of Sn.

Definition 3.41. A positive, integer-valued random variable n* is called astopping time for the sums S

The field of events ^(S^., k < n *) depending on Sn up to time of stopping consistsof all A E ^(X) such that

For example, in recurrent case, d = 1 , let n * be the first entry of the sums{$„} into the state y. This is a stopping time. More important, once at state y,the process continues by adding independent random variables, so that

n*+fc — Sn, should have the same distribution as Sk and be independent ofanything that happened up to time n*.

Proposition 3.42. Ifn* is a stopping time, then the process Sk = 1, ... has the same distribution as Sindependent of&(S

Proof. Let A e 3 , k < n*), B E 3^, and write

distribution as

Note that n* itself is measurable &($,,., k <> n*).

On the set process is equal to the process, andSince has the same

Page 77: Probability

60 INDEPENDENCE 3.8

Definition 3.43. The times of the zeros ofSn are defined by

k is called the kth occurrence time of{Sn = 0}. The times between zeros aredefined by T

The usefulness of these random variables is partially accounted for by

Proposition 3.44. IfP(Sn = 0 i.o.) = 1, then the Tl5 T2, . . . are independentand identically distributed random variables.

Proof. Tj is certainly a stopping time. By 3.42, Sk = Sk+Ti — STi has thesame distribution as Sk, but this process is independent of Tx. Thus, T2,which is the first equalization time for the Sk process, is independent ofTj and has the same distribution. Repeat this argument for A; = 3, . . . .

Theorem 3.45. Let X1} X2, . . . have lattice distance one, andP(Sn = 0 i.o.) = 1.Then, for any two states, j, I,

Proof. Let Rx, R2, . . . be the times of successive returns to the origin,ls T2, . . . the times between return. The T1} T2, . . . are independent and

identically distributed. Let M^j) be the number of landings in j before thefirst return to the origin, M2(j), the number between the first and secondreturns, etc. The M l5 M2, . . . are similarly independent and identicallydistributed (see Problem 22). The law of large numbers could be applied to

A)//: if we knew something about EM^j). Denote TT(J) —EM^j), and assume for the moment that for all j e Lj, 0 < TT(J) < oo.

This gives convergence of Nn(j')/Nn(/) along a random subsequence. To getconvergence over the full sequence, write

Page 78: Probability

3.8 STOPPING TIMES AND EQUIDISTRIBUTION OF SUMS 61

Dividing the top and bottom of the first term on the right by k we have thatlinij. of that term is given by

By Problem 10, this term is a.s. zero. The second term is treated similarly toget

Given that we have landed iny for the first time, let A, be the probability,starting from j, that we return to the origin before another landing in j.This must occur with positive probability, otherwise P(SSo whether or not another landing occurs before return is decided by tossinga coin with probability A, of failure. The expected number of additionallandings past the first is given by the expected number of trials until failure.

00

This is given by ]T m(l — A,)mA; < oo, hence -n(j) is finite. Add the con-i

vention 7r(0) = 1, then (3.46) holds for all states, j and /. Let n* be the firsttime that state / is entered. By (3.46)

But Nn*+n(7) is the number of times that the sums Sfc+n» — Sn», k = ! , . . . ,«land in state zero, and Nn*+n(j) is the number of times that S^,,* — Sn»,k = 1, . . . , K land in j — I, plus the number of landings in j by the sumsSfc up to time n*. Therefore, 7r(/)/7r(/) = TT(J — /)MO), or

This is the exponential equation on the integers — the only solutions arc?r(y) = rj. Consider any sequence of states w1? . . . , wn_l5 0 terminating atzero. I assert that

The first probability is P(XX = mlt. ..,Xn= —mn_^. The fact thatX1? . . . , Xn are identically distributed lets us equate this to

which is the second probability. This implies that ir(j) = TT(— j), hence

Page 79: Probability

62 INDEPENDENCE 3.8

Problems

19. For fair coin-tossing equalizations P(l'Use this result to show that

b) From (a) conclude ET

c) Using (a) again, show that P(Ti > 2n) ~

d) Use (c) to show that P(J

e) Conclude from (d) that Pifim

(There are a number of ways of deriving the exact expression above for= 2«); see Feller [59, Vol. I, pp. 74-75].)

20. For fair coin-tossing, use /*(S2n = 0) ~ ll\/7rn to show that

Use an argument similar to the proof of the strong law of large numbers forfair-coin tossing, Theorem 1.21, to prove

21. Define

n+n* > Sn*}, and so forth,

Show that the sequence (n*, Y^.) are independent, identically distributedvectors. Use the law of large numbers to prove that if E (X^ < oo, EXthen if one of £%, En* is finite, so is the other, and

Show by using the sequence

Page 80: Probability

3.9 HEWITT-SAVAGE ZERO-ONE LAW 63

22. For sums S1} S2, . . . such that P(Sn = 0 i.o.) = 1, and R1? R2, . . . theoccurrence times of {Sn = 0}, define the vectors Zk by

Define the appropriate range space (R, $) for each of these vectors and showthat they are independent and identically distributed.

9. HEWITT-SAVAGE ZERO-ONE LAW

Section 7 proved that for any interval /, P(Sn e I i.o.) is zero or one. Butthese sets are not tail events ; whether Sn = 0 an infinite number of timesdepends strongly on Xx. However, there is another zero-one law in operation,formulated recently by Hewitt and Savage [71] which covers a variety ofnon-tail events.

Definition 3.47. For a process Xl5 X2, . . . , A E (X) is said to be symmetricif for any finite permutation {/such that

Theorem 3.48. For Xl5 X2, . . . independent and identically distributed, everysymmetric set has probability zero or one.

Proof. The short proof we give here is due to Feller [59, Vol. II]. TakeAn E (Xj, . . . , XB) so that P(A A An) -> 0. An can be written

Because the X = (XiV XtV . . .) process, ?\, /2, . . . any sequence of distinctintegers, has the same distribution as X,

Hence, for any B E $<„,

Take (i

By (3.49), taking B E $«, such that (X E B} = (X E B} = A,

But y4n and ^4n are independent, thus

Again, as in the Kolmogorov zero-one law, we wind up with

Page 81: Probability

64 INDEPENDENCE 3.9

Corollary 3.50. For X1? X2, . . . independent and identically distributed, {Sn}the sums, every tail event on the Sx, S2, . . . process has probability zero or one.

Proof. For A a tail event, if {/j, /2, . . .} permutes only the first n indices,take B E ffc^ such that

Thus A is a symmetric set.

This result leads to the mention of another famous strong limit theoremwhich will be proved much later. If EXk = 0, £Xj! < oo for independent,identically distributed random variables, then the form 3.27 of the stronglaw of large numbers implies

On the other hand, it is not hard to show that

Therefore fluctuations of Sn should be somewhere in the range \Jn to\Jn log n. For any function h(n) | oo, the random variable lim |SJ//z(X) is atail random variable, hence a.s constant. The famous law of the iteratedlogarithm is

Theorem 3.52

This is equivalent to: For every e > 0,

Therefore, a more general version of the law of the iterated logarithmwould be a separation of all nondecreasing h(ri) into two classes

The latter dichotomy holds because of 3.50. The proof of 3.52 is quitetricky, to say nothing of the more general version. The simplest proofaround for coin-tossing is in Feller [59, Vol. I].

Page 82: Probability

NOTES 65

Actually, this theorem is an oddity. Because, even though it is a stronglimit theorem, it is a second-moment theorem and its proof consists of in-genious uses of the Borel-Cantelli lemma combined with the central limittheorem. We give an illuminating proof in Chapter 13.

Problem 23. Use the Kolmogorov zero-one law and the central limit theoremto prove (3.51) for fair coin-tossing.

Remark. The important theorems for independence come up over andover again as their contents are generalized. In particular, the random signsproblem connects with martingales (Chapter 5), the strong law of largenumbers generalizes into the ergodic theorem (Chapter 6), and the notions ofrecurrence of sums comes up again in Markov processes (Chapter 7).

NOTES

The strong law of large numbers was proven for fair coin-tossing by Borelin 1909. The forms of the strong law given in this chapter were proved byKolmogorov in 1930, 1933 [92], [98]. The general solution of the randomsigns problem is due to Khintchine and Kolmogorov [91] in 1925. A specialcase of the law of the iterated logarithm was proven by Khintchine [88,1924].

The work on recurrence is more contemporary. The theorems of Section7 are due to Chung and Fuchs [18, 1951], but the neat proof given that.EX = 0 implies recurrence was found by Chung and Ornstein [20, 1962].But this work was preceded by some intriguing examples due to Polya[116, 1921]. (These will be given in Chapter 7). The work of Harris andRobbins (loc. cit.) on the equidistribution of sums appeared in 1953.

There is a bound for sums of independent random variables with zeromeans which is much more well-known than the Skorokhod's inequality,that is,

Sfc = Xx + • • • + Xfc. This is due to Kolmogorov. Compare it with theChebyshev bound for P(\Sn\ > e). A generalization is proved in Chapter 5.

The strong law for identically distributed random variables dependsessentially on E |X| < oo. One might expect that even if E |X| = oo, therewould be another normalization Nn f oo such that the normed sums

converge a.s. One answer is trivial; you can always take Nn increasing sorapidly that a.s. convergence to zero follows. But Chow and Robbins [14]

Page 83: Probability

66 INDEPENDENCE

have obtained the strong result that if E |X| = oo, there is no normalizationNn t oo such that one gets a.s. convergence to anything but zero.

If E |Xj| < oo, then if the sums are nonrecurrent, either Sn -> -f-°o a.s.or Sn —>• — oo a.s. But if E \X{\ = oo, the sums can be transient and still changesign an infinite number of times; in fact, one can get lim Sn = +00 a.s.,lim Sn = — oo a.s. Examples of this occur when Xlt X2,. . . have one of thesymmetric stable distributions discussed in Chapter 9.

Strassen [135] has shown that the law of the iterated logarithm is asecond-moment theorem in the sense that if EX1 = 0, and

then EX\ < oo. There is some work on other forms of this law when£Xj = oo, but the results (Feller [54]) are very specialized.

For more extensive work with independent random variables the mostinteresting source remains Paul Levy's book [103]. Loeve's book has a gooddeal of the classical material. For an elegant and interesting development ofthe ideas of recurrence see Spitzer [130].

Page 84: Probability

CHAPTER 4

CONDITIONAL PROBABILITYAND CONDITIONAL EXPECTATION

1. INTRODUCTION

More general tools need to be developed to handle relationships betweendependent random variables. The concept of conditional probability — thedistribution of one set of random variables given information concerningthe observed values of another set — will turn out to be a most useful tool.

First consider the problem: What is the probability of an event Bgiven that A has occurred ? If we know that a> e A, then our new sample spaceis A. The probability of B is proportional to the probability of that part ofit lying in A. Hence

Definition 4.1. Given (Q, &, P), for sets A, BE&, such that P(A) > 0, theconditional probability of B given that A has occurred is defined as

and is denoted by P(B \ A).

This extends immediately to conditioning by random variables taking only acountable number of values.

Definition 4.2. JfX takes values in {xk}, the conditional probability of A givenX = xk is defined by

ifP(X = xk) > 0 and arbitrarily defined as zero ifP(X = xk) = 0.

Note that there is probability zero that X takes values in the set where theconditional probability was not defined by the ratio. P(A \ X = xk) is aprobability on &, and the natural definition of the conditional expectationof a random variable Y given X = xk is

if the integral exists.67

Page 85: Probability

68 CONDITIONAL PROBABILITY AND EXPECTATION 4.1

What needs to be done is to generalize the definition so as to be able tohandle random variables taking on nondenumerably many values. Look at thesimplest case of this : Suppose there is a random variable X on (D, 5", P),and let A e 3~. If B e is such that P(X e B) > 0, then as above, theconditional probability of A given X e B, is defined by

But suppose we want to give meaning to the conditional probability of Agiven X(cu) = x. Of course, if P(X = x) > 0, then we have no trouble andproceed as in 4.2. But many of the interesting random variables have theproperty that P(X = x) = 0 for all x. This causes a fuss. An obvious thingto try is taking limits, i.e., to try defining

In general, this is no good. If P(X = x0) = 0, then there is no guarantee, unlesswe put more restrictive conditions onPand X, that the limit above will exist for

So either we add these restrictions (very unpleasant), or we look at theproblem a different way. Look at the limit in (4.4) globally as a function of x.Intuitively, it looks as though we are trying to take the derivative of onemeasure with respect to another. This has a familiar ring; we look back tosee what can be done.

On $j define two measures as follows : Let

Note that 0 < Q(B) <, P(B) so that £ is absolutely continuous with respectto P. By the Radon-Nikodym theorem (Appendix A.30) we can define thederivative of $ with respect to P, which is exactly what we are trying to dowith limits in (4.4). But we must pay a price for taking this elegant route.Namely, recall that dQ/dP is defined as any ^-measurable function q>(x)satisfying

If 9? satisfies (4.6) so does y if <p = <p' a.s. P. Hence this approach, definingP(A | X = x) as any function satisfying (4.6) leads to an arbitrary selectionof one function from among a class of functions equivalent (a.s. equal) underP. This is a lesser evil.

Page 86: Probability

4. 1 INTRODUCTION 69

Definition 4.7. The conditional probability P(A \ X = x) is defined as any^-measurable function satisfying

In 4.7 above, P(A \ X = x) is defined as a 3^-measurable function <p(x),unique up to equivalence under P. For many purposes it is useful to considerthe conditional probability as a random variable on the original (Q, 3r, P)space, rather than the version above which resembles going into representa-tion space. The natural way to do this is to define

Since y> is -3^-measurable, then <p(X(co)) is a random variable on (Q., 3r).Since any two versions of <p are equivalent under P, any two versions ofP(A | X(co)) obtained in this way are equivalent under P. But there is a moredirect way to get to P(A \ X), analogous to 4.7. Actually, what is done isjust transform 4.7 to (Q, 3r, P).

Definition 4.8. The conditional probability of A given X(o>), is defined as anyrandom variable on Q., measurable (X), an d satisfying

Any two versions ofP(A \ X) differ on a set of probability zero.

This gives the same P(A \ X) as starting from 4.7 to get 9?(X(a>)), wherecp(x) = P(A | X = x). To see this, apply 2.41 to 4.7 and compare the resultwith 4.8. A proof that is a bit more interesting utilizes a converse of 2.31.

Proposition 4.9. Let X be a random vector on (Q, 30 taking values in (R, 3K).If 7. is a random variable on (Q, 3r), measurable (X), then there is a randomvariable 6(x) on (R, $) such that

Proof. See Appendix A. 21.

The fact that P(A \ X) is 5r(X)-measurable implies by this proposition thatP(A | X) = 0(X), where 0(x) is -measurable. But 0(X) satisfies

(this last by 2.41). Hence 6 = <p a.s. P.We can put 4.8 into a form which shows up a seemingly curious phenome-

non. Since J"(X) is the class of all sets (X e B}, B e $1? P(A \ X) is any

Page 87: Probability

70 CONDITIONAL PROBABILITY AND EXPECTATION 4.1

random variable satisfying

From this, make the observation that if Xx and X2 are two random variableswhich contain the same information in the sense that ^(X^ — (Xg), then

In a way this is not surprising, because ^(X!) = 5r(X2) implies that Xa

and X2 are functions of each other, that is, from 4.9,

The idea here is that P(A \ X) does not depend on the values of X, but ratheron the sets in & that X discriminates between.

The same course can be followed in defining the conditional expectation ofone random variable, given the value of another. Let X, Y be random variableson (Q, &, P). What we wish to define is the conditional expectation of Y givenX = jc, in symbols, £(Y | X = jc). If B e $ were such that P(X e B) > 0,intuitively £(Y | X e 5) should be defined as J Y(co)P(da) \ X e B), whereP(- | X E B) is the probability on 5" defined as

Again, we could take B = (x — h, x + /z), let h [ 0, and hope the limitexists. More explicitly, we write the ratio

and hope that as P(X e B) — > 0, the limiting ratio exists. Again the derivativeof one set function with respect to another is coming up. What to do issimilar: Define

To get things finite, we have to assume E |Y| < oo; then

To show that $ is cr-additive, write it as

Page 88: Probability

4.1 INTRODUCTION 71

Now {Bn} disjoint implies An = {X e Bn} disjoint, and

Also, P(j9) = 0 => g(5) = 0, thus Q is absolutely continuous with respectto P. This allows the definition of £(Y |X = x) as any version of dQ/dP.

Definition 4.12. Let E |Y| < oo, f/z<m £(Y | X = x) is any ^-measurablefunction satisfying

Any two versions of E(Y \ X = x) are a.s.

Conditional expectations can also be looked at as random variables.Just as before, if <p(x) = £(Y | X = x), £(Y | X) can be defined as <p(X(co)).Again, we prefer the direct definition.

Definition 4.13. Let E \ Y| < oo ; then E(Y \ X) is any ^(Y^-measur able functionsatisfying

The random variable Y trivially satisfies (4.14), but in general Y £(Y | X)because Y(o>) is not necessarily 5r(X)-measurable. This remark does discoverthe property that if ^F(Y) <= J'(X) then £(Y | X) = Y a.s. Another propertyin this direction is : Consider the space of 5r(X)-measurable random variables.In this space, the random variable closest to Y is £(Y | X). (For a definedversion of this statement see Problem 11.)

Curiously enough, conditional probabilities are a special case of con-ditional expectations. Because, by the definitions,

Therefore, the next section deals with the general definition of conditionalexpectation.

Definition 4.15. Random variables X, Y on (Q, 5", P) are said to have a jointdensity if the probability P(-} defined on $>z by P(F) = P((Y, X) e F) isabsolutely continuous with respect to Lebesgue measure on $>2, that is, if thereexists f(y, x) on R{2), measurable $2>

sucn that

Page 89: Probability

72 CONDITIONAL PROBABILITY AND EXPECTATION 4.1

Then, by Fubini's theorem (see Appendix A.37), defining f(x) = J/(y, x) dy,for all B £ $15

(Actually, any or-finite product measure on 32 could be used instead ofdy dx.) If a joint density exists, then it can be used to compute the conditionalprobability and expectation. This is the point of Problems 2 and 3 below.

Problems

1. Let X take on only integer values. Show that P(A \ X = x) as defined in4.7 is any ^-measurable function (p(x) satisfying

Conclude that if P(X = j) > 0, then any version of the above conditionalprobability satisfies

2. Prove that if X, Y have a joint density, then for any B e &lt

3. If (Y, X) have a joint density f(y, x) and E |Y| < 0, show that oneversion of £(Y | X = x) is given by

4. Let XL X2 take values in {1,2, . . . , N}. If ^(X^ = ^XJ, then provethere is a permutation {/!, iz, . . . , /v} of {1, 2, . . . , N} such that

Let P(Ai)>Qtj = 1, . . . , TV. Prove that

Show that one version of P(A \ Xj) is

Page 90: Probability

4.2 A MORE GENERAL CONDITIONAL EXPECTATION 73

Let X2(z) = z4; find a version of P(A \ X2). Find versions of P(A \ Xx = x),

6. Given the situation of Problem 5, and with Y a random variable such that£|Y(z)| < oo, show that

Find a version of

7. If X and Y are independent, show that for any B e &lt P(Y 6 B \ X) =P(Y e B) a.s. For £ |Y| < oo, show that £(Y | X) = £Y a.s.

8. Give an example to show that £(Y | X) = £Y a.s. does not imply thatX and Y are independent.

9. (Borel paradox). Take Q to be the unit sphere S(2) in £(3), & the Borelsubsets of Q, P(-) the extension of surface area. Choose two opposing pointson S(2) as the poles and fix a reference half-plane passing through them.For any point p, define its longtitude y>(p) as the angle between — TT and -nthat the half-plane of the great semi-circle through p makes with the referencehalf plane. Define its latitude 0(p) as the angle that the radius to p makes withthe equatorial plane, —77/2 < 0(p) < Tr/2. Prove that the conditional prob-ability of y given 6 is uniformly distributed over [—77,77) but that the con-ditional probability of 6 given y is not uniformly distributed over (—77/2,77/2]. (See Kolmogorov [98, p. 50].)

2. A MORE GENERAL CONDITIONAL EXPECTATION

Section 1 pointed out that £(Y | X) or P(A \ X) depended only on ^(X).The point was that the relevant information contained in knowing X(o>) isthe information regarding the location of o>. Let (Q, &, P) be a probabilityspace, Y a random variable, E |Y| < oo, D any ex-field, D <= 5r.

Definition 4.16. The conditional expectation J?(Y | D) is any random variablemeasurable (Q, 2)) such that

As before, any two versions differ on a set of probability zero. If 3) =

If X is a random vector to (R, 3), then for x e R

Definition 4.18. £(Y | X = x) is any random variable on (R, $), whereP(B) = P(X E B), satisfying

Page 91: Probability

74 CONDITIONAL PROBABILITY AND EXPECTATION 4.2

The importance of this is mostly computational. By inspection verify that

Usually, £(Y | X = x) is easier to compute, when densities exist, for example.Then (4. 19) gets us £(Y | X).

Proposition 4.20. A list of properties of E(Y \ D),

4) 5="(Y) independent of®, £|Y| < oo, => E(Y | 0)) = EY a.s.

Proofs. These proofs follow pretty trivially from the definition 4.16. Toimprove technique I'll briefly go through them; the idea in all cases is toshow that the integrals of both sides of (1), (2), (3), (4) over 3) sets are thesame, (in 2, >). Let D E 2). Then by 4.16 the integrals over D of the lefthand sides of (1), (2), (3), (4) above are

The right-hand sides integrated over D are

So (1) = (!') is trivial, also (2) > (2'). For (4), write

For (3), (3') note that by 4.16,

Page 92: Probability

4.2 A MORE GENERAL CONDITIONAL EXPECTATION 75

An important property of the conditional expectation is, if E |Y| < oo,

This follows quickly from the definitions.Let Y = %A(co); then the general definition of conditional probability is

Definition 4.22. Let 3) be a sub-a-field of 3r. The conditional probability ofA £ & given 3) is a random variable P(A \ 3)) on (O., 3)) satisfying

By the properties in 4.20, a conditional probability acts almost like a prob-ability, that is,

It is also tf-additive almost surely. This follows from

Proposition 4.24. Let Yn > 0 be random variable such that Yn f Y a.s. andE|Y| < oo. Then E(Yn \ 3)) -> E(Y | 3)) a.s.

Proof. Let Zn = Y - Yn, so Zn | 0 a.s., and EZn | 0. By 4.20(2) Zn >Zn+l => E(Zn | 3)) > E(Zn+11 3)) a.s. Therefore the sequence E(Zn \ 3))converges monotonically downward a.s. to a random variable U > 0. Bythe monotone convergence theorem,

Equation (4.21) now gives EU = lim EZn = 0. Thus U = 0 a.s.n

Let Ak E 3r, {Ak} disjoint, and take

in the above proposition to get from (4.23) to

For A fixed, P(A \ 3)) is an equivalence class of functions f(A, co). It seemsreasonable to hope from (4.25) that from each equivalence class a functionf*(A, a>) could be selected such that the resulting function P*(A \ 33) on& x Q would be a probability on & for every w. If this can be done, thenthe entire business of defining £(Y | 3)) would be unnecessary because it

Page 93: Probability

76 CONDITIONAL PROBABILITY AND EXPECTATION 4.2

could be defined as

Unfortunately, in general it is not possible to do this. What can be done is aquestion which is formulated and partially answered in the next section.

Problems

10. If D has the property, D E CD => P(D) = 0, 1, then show £(Y | 0)) = £Ya.s. ( if£|Y| < oo).

11. Let Y be a random variable on (Q, J"), £Y2 < oo. For any randomvariable X on (Q, J"), let </2(Y, X) = E |Y - X|2.

a) Prove that among all random variables X on (Q, 2)), 3) c J", there is ana.s. unique random variable Y0 which minimizes d(Y, X). This randomvariable Y0 is called the best predictor of Y based on 2).

b) Prove that Y0 = £(Y | £>) a.s.12. Let X0, X1} X2, . . . , Xn be random variables having a joint normaldistribution, EXk = 0, I\, = E(XiXj). Show that £(X0 | Xlf X2, . . . , Xn) =n2 A;-X;. a.s., and give the equations determining A3 in terms of the I\y.i(See Chapter 9 for definitions.)

13. Let X be a random vector taking values in (R, $), and 9?(x) a randomvariable on (R, $). Let Y be a random variable on (Q, 5") and £ |Y| < oo,£ |gp(X)| < oo, £ |Y<?(X)| < oo. Prove that

[This result concurs with the idea that if we know X, then given X,9?(X) should be treated as a constant. To work Problem 13, a word to thewise: Start by assuming <p(X) > 0, Y > 0, consider the class of randomvariables <p for which it is true, and apply 2.38.]14. Let Y be a random variable on (Q, 5", P) such that £ |Y| < oo andXx, X2 random vectors such that J"(Y, Xx) is independent of ^(Xa), thenprove that

15. Let Xl5 X2, . . . be independent, identically distributed random variables,£ |XJ < oo, and denote SB = Xx + • • • + XB. Prove that

[Use symmetry in the final step.]

Page 94: Probability

4.3 CONDITIONAL PROBABILITIES AND DISTRIBUTIONS 77

3. REGULAR CONDITIONAL PROBABILITIES AND DISTRIBUTIONS

Definition 4.27. P*(A \ 2)) will be called a regular conditional probability on3^ c 3r, given 3) if

a) For A E & J . fixed, P*(A \ 3D) is a version ofP(A \ 3)).b) For any co held fixed, P*(A \ 3)) is a probability on 3 .

If a regular conditional probability, given 3), exists, all the conditionalexpectations can be defined through it.

Proposition 4.28. IfP*(A \ 3)) is a regular conditional probability on 3^ andz/Y is a random variable on (£1, 3 ), E |Y| < oo, then

Proo/. Consider all nonnegative random variables on (iQ, 3 ) for which 4.28holds. For the random variable %A, A e 3r

1, 4.28 becomes

which holds by 4.27. Hence 4.28 is true for simple functions. Now forY > 0, £Y < oo take Yn simple | Y. Then by 4.24 and the monotoneconvergence theorem,

Unfortunately, a regular conditional probability on 3^ given 2) does no/exist in general (see the Chapter notes). The difficulty is this : by (4.25),for An G 3 , disjoint, there is a set of probability zero such that

If 5^ contains enough countable collections of disjoint sets, then the ex-ceptional sets may pile up.

By doing something which is like passing over to representation spacewe can get rid of this difficulty. Let Y be a random vector taking values ina space (R, 35).

Definition 4.29. P(B \ 0)) defined for B e 3i and w G Q is called a regularconditional distribution for Y given D if

i) for B e 35 fixed, P(B \ 0)) is a version ofP(\ e B \ D).ii) for any co e O fixed, P(B \ 3)) is a probability on 35.

If Y is a random variable, then by using the structure of R(l) in an essentialway we prove the following theorem.

Page 95: Probability

78 CONDITIONAL PROBABILITY AND EXPECTATION 4.3

Theorem 4.30. There always exists a regular conditional distribution for arandom variable Y given CD.

Proof. The general proof we break up into steps.

Definition 4.31. F(x \ CD) on 7?(1) X H is a conditional distribution functionfor Y given CD if

i) F(x | CD) is a version ofP(Y < x \ CD) for every x.ii) for every a>, F(x \ CD) is a distribution function.

Proposition 4.32. There always exists a conditional distribution function forY given CD.

Proof. Let R = {r,} be the rationals. Select versions of /»(Y < r, | CD) anddefine

So M is the set on which monotonicity is violated. By 4.20(2) P(M) = 0.Define

Since X(-oo,rj)(Y) t *<-oo.r,>(Y) 4.24 implies /»(#,) = 0, orP(TV) = 0 . Finallytake r; f oo or r,J, — oo and observe that the set L on which lim P(Y < r3 I CD)

ftfails to equal one or zero, respectively, has probability zero. Thus, for co in thecomplement of M U TV U L, P(Y < jc | CD) for x e R is monotone, left-continuous, zero and one at — oo, + oo, respectively. Take G(x) an arbitrarydistribution function and define:

It is a routine job to verify that F(x | CD) defined this way is a distributionfunction for every co. Use 4.24 again and X(-<*>,rj)(Y) T *<-co,*)00 to checkthat F(x | CD) is a version of P(Y < x \ CD).

Back to Theorem 4.30. For Y a random variable, define P(- \ CD), foreach co, as the probability on (R(l), Sbi) extended from the distributionfunction F(x \ CD). Let G be the class of all sets C e 3^ such that P(C \ CD)is a version of P(Y e C |CD). By 4.23, C contains all finite disjoint unions ofleft-closed right-open intervals. By 4.24, G is closed under monotone limits.Hence C = 3X. Therefore, /(• | CD) is the required conditional distribution.

Page 96: Probability

4.3 CONDITIONAL PROBABILITIES AND DISTRIBUTIONS 79

For random vectors Y taking values in (R, 55), this result can be extendedif (R, 55) looks enough like (R(1), ttj.

Definition 4.33. Call (R, 3$) a Borel space if there is a E e 55X and a one-to-onemapping <p :R *-* E such that <p is $>-measurable and <p~* is ^-measurable.

Borel spaces include almost all the useful probability spaces. Forexample, see the proof in Appendix A that (/? (CO), $«,) is a Borel space. So,more easily, is (R(n), 55J, n > 1.

Theorem 4.34. IfY takes values in a Borel space (R, 55), then there is a regularconditional distribution for Y given 3).

Proof. By definition, there is a one-to-one mapping <p : /?<-ȣ' 6 55X with<p,<p~l measurable 55, 55X respectively. Take Y = <p(Y); Y is a randomvariable so there is a regular conditional distribution P0(A \ 3)) = P(Y E A \ 1))a.s. Define P(B \ 3)), for B e 55, by

Because <p(B) is the inverse set mapping of the measurable mapping y~l(x),P(- | 3)) is a probability on $> for each o>, and is also a version of P(Y e B \ 3))for every B e 55.

Since the distribution of processes is determined on their range space, aregular conditional distribution will suit us just as well as a regular conditionalprobability. For instance,

Proposition 4.35. Let \bea random vector taking values in (R, 55), y a point inR, <p any ^-measurable function such that E |<p(Y)| < oo. Then ifP(- \ 3)) is aregular conditional distribution for Y given 3),

Proof. Same as 4.28 above.

If Y, X are two random vectors taking values in (R, 55), (S, 8) respectively,then define a regular conditional distribution for Y given X = x,P(B | X = x),in the analogous way : for each x E S, it is a probability on 55, and for Bfixed it is a version of P(B |.X = x). Evidently the results 4.34 and 4.35hold concerning P(B | X = x). Some further useful results are :

Proposition 4.36. Let (p(x, y) be a random variable on the product space(R, 55) x (S, S), E \<p(\, Y)| < oo. l f P ( - \ X = x)isa regular conditionaldistribution for Y given X = x, then,

Page 97: Probability

80 CONDITIONAL PROBABILITY AND EXPECTATION 4.3

[Note: The point of 4.36 is that x is held constant in the integration occurringin the right-hand side of (4.37). Since <p(x, y) is jointly measurable in (x, y),for fixed x it is a measurable function of y.]

Proof. Let g?(x, y) = ^c(x)/tz)(y)» C, D measurable sets, then, by Problem13(b). a.s. P,

On the right in 4.37 is

which, by definition of P, verifies (4.37) for this 9?.Now to finish the proof, just approximate in the usual way; that is,

(4.37) is now true for all ??(x, y) of the form . XcfflXDiW* ^«» ^» meas-urable sets, and apply now the usual monotonicity argument.

One can see here the usefulness of a regular conditional distribution.It is tempting to replace the right-hand side of (4.37) by E((f>(x, Y) | X = x).But this object cannot be defined through the standard definition of con-ditional expectation (4.18).

A useful corollary is

Corollary 4.38. Let X, Y be independent random vectors. For (p(\, y) as in

Proof. A regular conditional distribution for Y given X = x is P(Y 6 B).Apply this in 4.36.

Problems

16. Let / be any interval in R(1}. A function y(x) on / measurable $i(/)is called convex if for all / e [0, 1] and x E I, y e I

Prove that if Y is a random variable with range in /, and E |Y| < oo, thenfor <p(x) convex on /, and E |<p(Y)| < oo,

a) 9?(£Y) < £[9?(Y)] (Jensen's inequality),

[On (b) use the existence of a regular conditional distribution.]

Page 98: Probability

NOTES 81

17. Let X, Y be independent random vectors.a) Show that a regular conditional probability for Y given X = x is

b) If <p(x, y) is a random variable on the product of the range spaces, showthat a regular conditional distribution for Z = <p(X, Y) given X = xis P(<p(x, Y) £ B).

NOTES

The modern definition of conditional probabilities and expectations, usingthe Radon-Nikodym theorem, was formulated by Kolmogorov in his mono-graph of 1933 [98].

The difficulty in getting a regular conditional probability on ^(X),given any a-field 3), is similar to the extension problem. Once /(• | 3)) isgotten on the range space (R, $) of X, if for any set B E $ contained in thecomplement of the range of X, P(B | 3)) = 0, then by 2.12 we can get aregular conditional probability P*(- \ CD) on ^(X) given 3). In this case it issufficient that the range of X be a set in $. Blackwell's article [8] also dealswith this problem.

The counterexample referred to in the text is this: Let Q = [0, 1], 2) =$([0, 1]). Take C to be a nonmeasurable set of outer Lebesgue measure oneand inner Lebesgue measure zero. The smallest a-field & containing 3)and C consists of all sets of the form A = (C n BJ U (Cc n B%), whereBlt Bz are in 3). Define P on & by: If A has the above form,

Because the Blt Bz in the definition of A are not unique, it is necessary tocheck that P is well defined, as well as a probability. There does not existany regular conditional probability on & given 3). This example can befound in Doob [39, p. 624].

Page 99: Probability

CHAPTER 5

MARTINGALES

1. GAMBLING AND GAMBLING SYSTEMS

Since probability theory started from a desire to win at gambling, it is onlysporting to discuss some examples from this area.

Example 1. Let Z1? Z2, . . . be independent, Z, = +1, — 1 with probabilityp, I —p. Suppose that at the «th toss, we bet the amount b. Then we receivethe amount b if Zn = 1, otherwise we lose the amount b. A gamblingstrategy is a rule which tells us how much to bet on the (n + 1) toss. To beinteresting, the rule will depend on the first n outcomes. In general, then, agambling strategy is a sequence of functions bn:{— 1, -fl}(n) — »• [0, oo) suchthat &n(Zl5 . . . , ZJ is the amount we bet on the (n + 1) game. If we startwith initial fortune S0, then our fortune after n plays, Sn, is a random variabledefined by

We may wish to further restrict the bn by insisting that ^n(Zl5 . . . , Zn) < Sn.Define the time of ruin n* as the first n such that Sn = 0. One question thatis certainly interesting is: What is P(n* < oo); equivalently what is theprobability of eventual ruin ?

There is one property of the sequence of fortunes given by (5.1) that Iwant to focus on.

Proposition 5.2. For p = \,

Proof

The last step follows because Sn is a function of Zt, . . . , Zn. If p = £,then E2.n+l = 0, otherwise EZn+1 < 0, but bn is nonnegative. Thus, for

Sn a.s. Note that ^(Sj, . . . , S J c ^(Zj, . . . , Z J. Taking conditional82

Page 100: Probability

5.2 DEFINITIONS OF MARTINGALES AND SUBMARTINGALES 83

expectations of both sides above with respect to Sx, . . . , Sn concludes theproof.

Example 2. To generalize Example 1, consider any process Zx, Z2, . . .With any system of gambling on the successive outcomes of the Z15 Z2, . . .the player's fortune after n plays, Sn is a function of Zl5 . . . , Zn. We assumethat corresponding to any gambling system is a sequence of real-valuedfunctions <pn measurable &n such that the fortune Sn is given by

Definition 5.3. The sequence of games Z1? Z2, . . . , under the given gamblingsystem is called fair if

unfavorable if

The first function of probability in gambling was involved in computationof odds. We wish to address ourselves to more difficult questions such as :Is there any system of gambling in fair or unfavorable games that will yieldSn — >• oo a.s. ? When can we assert that P(n* < oo) = 1 for a large class ofgambling systems ? This class of problems is a natural gateway to a generalstudy of processes behaving like the sequences Sn.

2. DEFINITIONS OF MARTINGALES AND SUBMARTINGALES

Definition 5.4. Given a process Xl5 X2, . . . It will be said to form a martingale(MG) ifE |Xfc| < oo, k = 1, 2, . . . and

It forms a submartingale (SMG) ifE \Xk\ < oo, k = 1 ,2 , . . . , and

Example 3. Sums of independent random variables. Let Yx, Y2, . . . be asequence of independent random variables such that E \Yk\ < oo, all k,£Yfc = 0, A: = 1 , 2 , . . . Define Xn = Yx + • • • + Yn; then E\Xn\£

2 E | Y*| < oo. Furthermore, ^(Xl5 . . . , Xn) = ^(Yx, . . . , YJ, soi

As mentioned in connection with fortunes, if the appropriate inequalities

Page 101: Probability

84 MARTINGALES 5.3

hold with respect to larger <r-fields, then the same inequalities hold for theprocess, that is,

Proposition 5.7. Let X1? X2, . . . be a process with E \Xk\ < oo, k = 1, 2, . . .Let Yx, . . . , be another process such that ^(X^ . . . , Xn) c= ^(Yj, . . . Yn),n = 1, 2, . . . //

then Xlf X2, . . . w a SMG(MG).

Proof. Follows from

[See 4.20 (3).]

For any >n > «, if X1? . . . is a SMG(MG), then

because, for example,

This remark gives us an equivalent way of defining SMG or MG.

Proposition 5.8. Let Xx, X2, . . . be a process with E |Xfc| < oo, all k. Then itis a SMG(MG) iff for every m > n and A e ^(X^ ..., XB),

Proof. By definition, $

Problem 1. Let Xx, X2, . . . be a MG or SMG; let I be an intervalcontaining the range of Xn, n = 1, 2, . . . , and let <p(x) be convex on /.Prove

n)| < oo, n = 1, 2, . . . and Xlf X2, . . . is a MG, then X; =<?(Xn) is a SMG.

b) If E \<p(Xn)\ < oo, n = 1, . . . , 9?(jc) also nondecreasing on /, and Xl5 X2, . . .a SMG, then X; = ^(Xn) is a SMG.

3. THE OPTIONAL SAMPLING THEOREM

One of the most powerful theorems concerning martingales is built aroundthe idea of transforming a process by optional sampling. Roughly, the ideais this — starting with a given process Xl5 X2, . . . , decide on the basis of

Page 102: Probability

5.3 THE OPTIONAL SAMPLING THEOREM 85

observing Xx, . . . , Xn whether or not Xn will be the first value of the trans-formed process; then keep observing until on the basis of your observationsa second value of the transformed process is chosen, and so forth. Moreprecisely,

Definition 5.9. Let Xl5 X2, . . . be an arbitrary process. A sequence n^, m2, . . .of integer-valued random variables will be called sampling variables if

Then the sequence of random variables Xn defined by Xn = Xmn is called theprocess derived by optional sampling from the original process.

Theorem 5.10. Let Xx, X2, . . . be a SMG (MG), m1, m2, . . . sampling variables,and Xn the optional sampling process derived from Xl9 X2, . . . If

then the Xj, X2, . . . process is a SMG(MG).

Proof. Let A E J"(Xj, . . . , XJ. We must show that

Let DJ = A n {mn = j } . Then A = U Df and it suffices to show that

We assert that Z>, e ^(Xj,. . . , X,), since, letting B G $>n, then

Evidently, each set in this union is in ^(X^ . . . , X3). Of course,

Page 103: Probability

86 MARTINGALES 5.3

Now, for arbitary N > j,

The first two terms on the right in the above telescope. Starting with the lastterm of the sum,

Hence the first two terms reduce to being greater than or equal to

But DJ <=• {mn+1 > j ] , so we conclude that

Letting N -+ oo through an appropriate subsequence, by (b) we can force

But {mn+l > TV} | 0, and the bounded convergence theorem gives

since E |Xn+1| < oo by (a). Q.E.D.

Corollary 5.11. Assume the conditions of Theorem 5.10 are in force, and thatin addition, lim E |XJ < oo,

Page 104: Probability

5.3 THE OPTIONAL SAMPLING THEOREM 87

Proof. Without loss of generality, take 1% = 1. By 5.10,

By condition (b), we may take a subsequence of N such that

Using this subsequence, the conclusion £Xn < lim £Xn follows. Forpart (2), note that if Xn is a SMG, so is X+ = max (0, XB). Thus by 5.10so is X+. Applying (1) we have

For the original process,

Proposition 5.12. Under hypothesis (b) of Theorem 5.10, z/lim£'|Xn| < oo,then sup E |XJ < oo, so that (a) holds and the theorem is in force.

n

Proof. We introduce truncated sampling variables : for M > 0,

and Xn>M = XmB(Jf. The reason for this is

so that the conditions of Theorem 5.10 are in force for Xn>M. By Corollary5.11, E \Xn,M\ <, a < oo, where a does not depend on n or M. Note nowthat lim^ mWiM = mn; hence lim^ XUiM = Xn. By the Fatou lemma(Appendix A.27),

or

Page 105: Probability

88 MARTINGALES 5.3

One nice application of optional sampling is

Proposition 5.13. IfX^ X2 , . . . is a SMG then for any x > 0

Proof. 1) Define sampling variables by

The conditions of 5.10 are satisfied, since {mn > TV) = 0, N ^ k> and

£|XJ < ]£^|X>I < °o- Nowi

By the optional sampling theorem,

2) To go below, take x > 0 and define

The conditions of 5.10 are again satisfied, so

But

Therefore, since

part (2) follows.

Page 106: Probability

5.4 THE MARTINGALE CONVERGENCE THEOREM 89

Problem 2. Use 5.13 and Problem 1 to prove Kolmogorov's extension of theChebyshev inequality (see the notes to Chapter 3). That is, if Xl5 X 2 , . . .is a MG and EX2

n < oo, then show

4. THE MARTINGALE CONVERGENCE THEOREM

One of the outstanding strong convergence theorems is

Theorem 5.14. Let Xlf X2,. . . be a SMG such that lim E\Xn\ < oo, then thereexists a random variable X such that Xn —> X a.s. and E \X\ < oo.

Proof. To prove this, we are going to define sampling variables. Let b > abe any two real numbers. Let

In general,

That is, the m* are the successive times that Xn drops below a or rises aboveb. Now, what we would like to conclude for any b > a is that

This is equivalent to proving that if

that P(S) = 0. Suppose we defined Xn = Xm*, and

On^S, X2n+1 - X2n < -(b - a), so Z = -oo, but if Xn is a SMG,E(X2n+1 — X2n) > 0, so EZ > 0, giving a contradiction which would be

Page 107: Probability

90 MARTINGALES 5.4

resolved only by P(S) = 0. To make this idea firm, define

Then the optional sampling theorem is in force, and Xn> M is a SMG. Define

noting that XZn+iiM — X2niM — 0 for m*n > Mt which certainly holds for2n>M.

Let $M be the largest n such that m*n < M. On the set {/3M = k},

The last term above can be positive only if ni*fc+i ^ ^» m wni°h case

becomes (X^ — a). In any case, on {fiM = k}

or, in general,

On the other hand, since Xn M is a SMG, EZM > 0. Taking expectations of(5.15) gives

Now f}M is nondecreasing, and fiM | oo on 5". This contradicts (5.16) sincefim£(XM - a)+ <, fim£| X^| + a, unless P(S) = 0. This establishes

where the union is taken over all rational a, b. Then either a randomvariable X exists such that Xn -> X a.s. or |XJ ->• oo with positive proba-bility. This latter case is eliminated' by Fatou's lemma, that is,

From this we also conclude that

In the body of the above proof, a result is obtained which will be useful inthe future.

Page 108: Probability

5.5 FURTHER MARTINGALE THEOREMS 91

Lemma 5.17. Let Xl5 . . . , XM be random variables such that E\Xn\ < oo,n = 1, . . . , M and E(Xn+l \ XM, . . . , Xx) > Xn, n = 1, . . . , M - 1. LetfiM be the number of times that the sequence Xl5 . . . , XM crosses a finiteinterval [a, b]from left to right. Then,

Problem 3. Now use the martingale convergence theorem to prove that forXl5 X2, . . . independent random variables, EXk = 0, k = 1,2,...,

5. FURTHER MARTINGALE THEOREMS

To go further and apply the basic convergence theorem, we need:

Definition 5.18. Let Xl5 X2,. . . be a process. It is said to be uniformly integrableif E \Xk\ < oo, all k, and

Proposition 5.19. Let X1? X2,.. . be uniformly integrable, then lim E |XJ < oo.IfXn-+X a.s, then E \X - Xn\ -* 0, andEXn ->£X.

Proof. First of all, for any x > 0,

Hence,

But the last term must be finite for some value of x sufficiently large. Forthe next item, use Fatou's lemma:

Now, by EgorofTs theorem (Halmos [64, p. 88]), for any e > 0, we can takeA E & such that P(A) < e and Vn = |XB — X| -+ 0 uniformly on Ac,

Page 109: Probability

92 MARTINGALES 5.5

But bounds can be gotten by writing

Taking e J, 0, we get

since (5.20) holds for |X| as well as |XJ. Now letting x j oo gives the result.For the last result, use \EX - EXn\ <>E\X- XJ.

We use this concept of uniform integrability in the proof of:

Theorem 5.21. Let Z, Y1} Y2, . . . , be random variables on (H, f, P), such thatE\Z\ < oo. Then

(Here -—> indicates both a.s. and first mean convergence.)

Proof. Let Xn = E(Z \ Ylf . . . , YJ. Then E |XJ ^ E |Z|. By 4.20(3)

Since

the sequence X1} X2, . . . is a MG. Since lim E |Xn| < oo,

By convexity, |XJ ^ E(\L\ |Yx, . . . , YJ. Hence

Now,

is a.s. finite; hence as x f oo, the sets (U > x} converge down to a set ofprobability zero. Thus the Xn sequence is uniformly integrable and Xn —>• X.Let A E (Yj,. . ., Yjy); then

Page 110: Probability

5.5 FURTHER MARTINGALE THEOREMS 93

But for n > N,

This implies that X = Z a.s.

Corollary 5.22. If &(T) c ^(Yl5 Y2, . . .), and E |Z|< oo, then

In particular, if A E (Yj, . . .), then

Proof. Immediate.

There is a converse of sorts, most useful, to 5.22.

Theorem 5.23. Let X1? X2, . . . be a SMG (MG), and uniformly integrable.Then Xn - - X, and

(with equality holding //Xl5 X2, . . . is MG).

Proof. By definition, for every m > n,

Thus, for every set A e 5r(X1, . . . , Xn), and m > «,

But by (5.19) lim E |XJ < oo, hence Xn -^> X, implying Xn -^> X, so

Therefore, every uniformly integrable MG sequence is a sequence of con-ditional expectations and every uniformly integrable SMG sequence is boundedabove by a MG sequence.

In 5.21 we added conditions, that is, we took the conditional expectationof Z relative to increasing sequence <7-fields ^(Yj, . . . , Yn). We can also gothe other way.

Theorem 5.24. Let Z, Y1} . . . be random variables, E |Z| < oo, 3" the taila-field on the Y1? Y2, . . . process. Then

Page 111: Probability

94 MARTINGALES 5.5

Proof. A bit of a sly dodge is used here. Define X_n = E(Z \ Yn, Yn+1,...),n = 1 ,2 , . . . Note that

so Lemma 5.17 is applicable to the sequence X_M, X_M+l,..., X_j. For anyinterval [a, b], if @M is the number of times that this sequence crosses frombelow a to above b, then

By the same argument as in the main convergence theorem, X_n either mustconverge to a random variable X a.s. or |X_n| —*• oo with positive probability.But E |X_J < E\Z\, so that Fatou lemma again gives X_n ->• X a.s., andE \X\ < E |Z|. Just as in the proof of 5.21,

so the X_n sequence is uniformly integrable; hence E(Z Yn, Yn+1, ...)-> X.Since X is an a.s. limit of random variables measurable with respect to^(Y,,,. . .), then X is measurable with respect to ^(Y,,,. . .) for every «.Hence X is a random variable on (Q , $)• Let A e %; then A e 5r(Yn,. . .),and

which proves that X = E(Z \ 5) a.s.

Problems

4. Show that E |XJ < oo, E |X| < oo, E |Xn - X| -+ 0 => Xlt X2, . .. is3 S

a uniformly integrable sequence. Get the same conclusion if Xn —^> X,

5. Apply Theorem 5.21 to the analytic model for coin-tossing and the coin-tossing variables defined in Chapter 1, Section 5. Take /(x) any Borelmeasurable function on [0, 1) such that J \f(x)\ dx < oo. Let 7l5 72, . . . , IN

be the intervals

and define the step function fn(x) by

Then prove that

Page 112: Probability

5.6 STOPPING TIMES 95

6. Given a process Xl5 X2, . . . , abbreviate 5rn = ^(X^, Xn+1, . . .). Use

5.24 to show that the tail cr-field 3" °n the process has the zero-one property(C G £ => P(C) = 0 or 1) iff for every A e

7. Use Problem 1 5, Chapter 4, to prove the strong law of large numbers(3.30) from the martingale results.

6. STOPPING TIMES

Definition 5.25. Given a process X1? X2, . . . , an extended stopping time is anextended integer-valued random variable n*(co) > 1 such that {n* = j} E^(Xj, . . . , X,-). The process X1} X2, . . . derived under stopping is defined by

If we define variables rnn(o>) as min (n, n* (o>)), then obviously the mn(eo)are optional sampling variables, and the Xn defined in 5.25 above are givenby Xn = Xmn. Hence stopping is a special case of optional sampling.Furthermore,

Proposition 5.26. Let Xl5 X2, . . . be a SMG(MG), and Xx, X2, . . . be derivedunder stopping from Xl5 X2, . . . Then Xl5 X2, . . . is a SMG(MG).

Proof. All that is necessary is to show that conditions (a) and (b) of theoptional sampling Theorem 5.10 are met. Now

Noting that mn(co) — min (n, n*(co)) < n, we know that (mn > N} = 0,for N > n, and

Not only does stopping appear as a transformation worthy of study,but it is also a useful tool in proving some strong theorems. For any setB G $!, and a process X1? X2 , . . ., define an extended stopping time by

Then X1} X2,. . . is called the process stopped on B.

Page 113: Probability

96 MARTINGALES 5.6

Proposition 5.27. Let Xlf X2 , . . . be a SMG, B the set [a, oo], a > 0. IfE[sup (Xn+1 — Xn)+] < oo, then for Xn the process stopped on B,

n

Proof. For any n, X+ < a + U, where

But ><! = Xx. By the fact that Xn is a SMG, £Xn > EX19 so £X~ ^EX+ - EX^ Thus

Theorem 5.28. Let Xlt X2,. . . be a MG jncA r/wrf £(sup |XB+1 - Xn|) < oo.'If the sets Alt A2 are defined by n

then A! U Az = Q a.s.

Proof. Consider the process Xt, X 2 , . . . stopped on [K, oo]. By 5.27 andthe basic convergence theorem Xn -^> X. On the set FK = (supn Xn < K},Xn =Xn, all n. Hence on FK, limn Xn exists and is finite a.s. Thus this limitexists and is finite a.s. on the set U^=1 FK-. DUt this set is exactly the set{lim Xn < oo}. By using now the MG sequence — Xx, — X2 , . . ., concludethat limn Xn exists and is finite a.s. on the set {lim Xn > — oo}. Hence lim Xn

exists and is finite for almost all a> in the set {lim Xn < 00} U {lim Xn > — oo},and the theorem is proved.

This theorem is something like a zero-one law. Forgetting about a setof probability zero, according to this theorem we find that for every CD,either lim Xn(co) exists finite, or the sequence Xn(co) behaves badly in the sensethat lim Xn(w) = — oo, lim Xn(co) = +00. There are some interesting anduseful applications. An elegant extension of the Borel-Cantelli lemma validfor arbitrary processes comes first.

Corollary 5.29 (extended Borel-Cantelli lemma). Let Yls Y2, . . . be anyprocess, and An e Cfi,. . . , Yn). Then almost surely

Proof. Let

Page 114: Probability

5.6 STOPPING TIMES 97

Note that

Obviously, also |Xn+1| < n, so {Xn} is a MG sequence. To boot,

so 5.28 is applicable. Now

Let D! = (lim Xn exists finite}. Then on D1}

Let D2 = {lim Xn = — oo, lim Xn = + oo}, then for all CD e D2,

Since P(Dl U D2) = 1, the corollary follows.

Some other applications of 5.28 are in the following problems.

Problems

8. A loose end, which is left over from the random signs problem, is that ifYj, Y2, . . . are independent random variables, £Tfc = 0, by the zero-one laweither Xra = ]£™ Yfc converges to a finite limit a.s. or diverges a.s. The nature ofthe divergence can be gotten from 5.28 in an important special case. Showthat if Yl5 . . . are independent, |Yfc| < a < oo, all k, EYk = 0, Sn =Y! + • • • + Yn, then either

a) Pi lim Sn exists I = 1, or\ n I

b) P(fimSw = oo, lim Sn = -oo) = 1.

9. For any process X1} X2, . . . and sets A, B E $1} suppose that P(XTO e Bfor at least one m > n Xn , . . ., Xt) > d > 0, on {Xw e A}. Then provethat

Page 115: Probability

98 MARTINGALES 5.7

10. Consider a process X1} X2, . . . taking values in [0, oo). Consider {0}an absorbing state in the sense that Xn = 0 => Xn+m = 0, m 1 . Let Dbe the event that the process is eventually absorbed at zero, that is,

If, for every x there exists a 6 > 0 such that

prove that for almost every sample sequence, either Xr, X2, . . . is eventuallyabsorbed, or Xn -> oo.

7. STOPPING RULES

Definition 5.30. If an extended stopping time n* is a.s. finite, call it a stoppingtime or a stopping rule. The stopped variable is Xn..

It is clear that if we define m^w) = 1, mn(eo) = n*(cu), n > 2, then themn are optional sampling variables. From the optional sampling theorem weget the interesting

Corollary 5.31. Let n* be a stopping rule. IfXl,Xtt...isa SMG(MG)and if

then

Proof. Obvious.

The interest of 5.31 in gambling is as follows: If a sequence of games isunfavorable under a given gambling system, then the variables — Sn forma SMG if E |SJ < oo. Suppose we use some stopping rule n* which based onthe outcome of the first j games tells us whether to quit or not after they'thgame. Then our terminal fortune is Sn». But if (a) and (b) are in force, then

with equality if the game is fair. Thus we cannot increase our expectedfortune by using a stopping rule. Also, in the context of gambling, someillumination can be shed on the conditions (a) and (b) of the optionalsampling theorem. Condition (b), which is pretty much the stickler, saysroughly that the variables mn cannot sample too far out in the sequence toofast. For example, if there are constants <xn such that mn <, <xn, all n (evenif an -> oo), then (b) is automatically satisfied. A counterexample where (b)

Page 116: Probability

5.7 STOPPING RULES 99

is violated is in the honored rule "stop when you are ahead." Let Yl5 Y 2 , . . .be independent random variables, Y, = ± 1 with probability £ each. ThenXn = Yx + • ' ' + Yn is a MG sequence and represents the winnings after nplays in a coin-tossing game. From Problem 8, lim Xn = -f-oo a.s., hencewe can define a stopping rule by

that is, stop as soon as we win one dollar. If (5.32) were in force, thenEXn* = EX! = 0, but Xn» = 1. Now (a) is satisfied because £"|Xn*| = 1,hence we must conclude that (b) is violated. This we can show directly.Note that |Xn| is a SMG sequence, and

Therefore

Going down the ladder we find that

Here, as a matter of fact, n* can be quite large. For example, note thatEn* = oo, because if Yx = —1, we have to wait for an equalization, inother words, wait for the first time that Y2 + • • • + Yn = 0 before we canpossibly get a dollar ahead. Actually, on the converse side, it is not difficultto prove.

Proposition 5.33. Let Xx, X2, . . . be a SMG(MG) and n* a stopping rule. IfEn* < oo and E(\Xn+1 - Xn\ \ Xn,. . . , Xx) < a < oo, n <, n*, then

Proof. All we need to do is verify the conditions of 5.31. Denote Zn =|Xn - X^l, n > 1, Zx = \X,\, Y = Zx + • • • + Zn.. Hence |Xn.| < Y.

Page 117: Probability

100 MARTINGALES

Interchange the order of summation so that

and we get

5.7

For 5.31(b), since Z

As JV-> CXD, {n* > Af} J, 0 a.s. Apply the bounded convergence theorem to

Proposition 5.33 has interesting applications to sums of independent ran-dom variables. Let Y1} Y2, . . . be independent and identically distributed,Sn = Yj + • • • + Yn, and assume that for some real A 5^ 0, <p(A) = £eAXl

exists, and that <p(A) > 1 • Then

Proposition 5.34 (Wald's identity). If n* is a stopping time for the sumsSl5 S2, . . . such that |SJ < y, n <,n*, and En* < oo, then

Proof. The random variables

form a MG, since

Obviously, £XX = 1, so if the second condition of 5.33 holds, then Wald'sidentity follows. The condition is

or

which is clearly satisfied under 5.34.

to get the result

Page 118: Probability

5.8 BACK TO GAMBLING 101

Problems

11. For n* a stopping time for sums S1} S2, . . . of independent, identicallydistributed random variables Y1? Y2, . . . , E [YJ < oo, prove that En* < ooimplies that

Use this to give another derivation of Blackwell's equation (Problem 21,Chapter 3).12. Let Y15 Y2, . . . be independent and equal ± 1 with probability £. Letr, s be positive integers. Define

Show

13. (See Doob, [39, p. 308].) For sums of independent, identically dis-tributed random variables Yl5 Y2, . . . , define n * as the time until the firstpositive sum, that is,

Prove that if £Yj = 0, then En* = oo.

8. BACK TO GAMBLING

The reason that the strategy "quit when you are ahead" works in thefair coin-tossing game is that an infinite initial fortune is assumed. That is,there is no lower limit, say — M, such that if Sn becomes less than — M playis ended.

A more realistic model of gambling would consider the sequence offortunes S0, S1} S2, . . . (that is, money in hand) as being nonnegative andfinite. We now turn to an analysis of a sequence of fortunes under gamblingsatisfying

In addition, in any reasonable gambling house, and by the structure of ourmonetary system, if we bet on the nth trial, there must be a lower bound tothe amount we can win or lose. We formalize this by

Assumption 5.36. There exists a <5 > 0 such that either

Page 119: Probability

102 MARTINGALES

Definition 5.37. We will say that we bet on the nth game i/|Sn — S^J > d.Let n * be the (possibly extended) time of the last bet, that is,

where S0 is the starting fortune.

Under (5.35), (i) and (ii), and 5.36, the martingale convergence theoremyields strong results. You can't win!

Theorem 5.38.

Remark. The interesting thing about P(n * < oo) = 1 is the implication thatin an unfavorable (or fair) sequence of games, one cannot keep bettingindefinitely. There must be a last bet. Furthermore, £Sn* < S0 implies thatthe expected fortune after the last bet is smaller than the initial fortune.

Proof. Let Xn = — Sn; then the X0, Xl5 . . . sequence is a SMG. Further-more, E\Xn\ = E\Sn\ = ESn. Thus E\Xn\ < S0, all n. Hence there existsa random variable X such that X — V X. Thus

or

To prove the second part use the monotone convergence theorem

Note that the theorem is actually a simple corollary of the martingaleconvergence theorem. Now suppose that the gambling house has a minimumbet of a dollars and we insist on betting as long as Sn > a; then n* becomesthe time "of going broke," that is, n* = {first n such that Sn < a), and theobvious corollary of 5.38 is that the persistent gambler goes broke withprobability one.

NOTES

Martingales were first fully explored by Doob, in 1940 [32] and systematicallydeveloped in his book [39] of 1953. Their widespread use in probabilitytheory has mostly occurred since that time. However, many of the results

Page 120: Probability

NOTES 103

had been scattered around for some time. In particular, some of them are dueto Levy, appearing in his 1937 book [103], and some to Ville [137, 1939].Some of the convergence theorems in a measure-theoretic framework are dueto Andersen and Jessen. See the Appendix to Doob's book [39] for a dis-cussion of the connection with the Andersen-Jessen approach, and completereferences. The important concepts of optional sampling, optional stopping,and the key lemma 5.17 are due to Doob.

David Freedman has pointed out to me that many of the convergenceresults can be gotten from the inequality 5.13. For example, here is a moreelementary and illuminating proof of 5.21 for Z an .^(X) measurable randomvariable. For any e > 0, take k, Zk measurable ^(X^ . . . , Xfc) such that

Now,

Thus,

Take e j 0 fast enough so that Zk —^- Z to get the result

For a fascinating modern approach to gambling strategies, see thebook by Dubins and Savage [40].

Page 121: Probability

CHAPTER 6

STATIONARY PROCESSESAND THE ERGODIC THEOREM

1. INTRODUCTION AND DEFINITIONS

The question here is : Given a process X1} X2, . . . , find conditions for thealmost sure convergence of (Xl + • • • Xn)/«. Certainly, if the {Xn} areindependent identically distributed random variables and E |Xt| < oo, then,

A remarkable weakening of this result was proved by Birkhoffin 1931 [4].Instead of having independent identically distributed random variables,think of requiring that the distribution of the process not depend on theplacement of the time origin. In other words, assume that no matter whenyou start observing the sequence of random variables the resulting observa-tions will have the same probabilistic structure.

Definition 6.1. A process Xl5 X2, . . . is called stationary if for every k, theprocess Xk+1, Xfc+2, . . . has the same distribution as X1? X2, . . . , that is, forevery B E $„.

Since the distribution is determined by the distribution functions, (6.2) isequivalent to : For every xlt . . . , xn, and integer k > 0,

In particular, if a process is stationary, then all the one-dimensional dis-tribution functions are the same, that is,

We can reduce (6.2) and (6.3) by noting

Proposition 6.4. A process Xlt X2, . . . is stationary if the process X2, X3, . . .has the same distribution as Xx, X2, . . .

Proof. Let X^ = Xfc+1, k = 1,2, . . . Then X^, X'2, . . . has the same distributionas Xj, X2, . . . Hence X^, Xg, . . . has the same distribution as X^, X^, . . . ,and so forth.

104

Page 122: Probability

6.1 INTRODUCTION AND DEFINITIONS 105

Sometimes it is more convenient to look at stationary processes thatconsist of a double-ended sequence of random variables . . . , X_x, X0, Xl5 . . .In this context, what we have is an infinite sequence of readings, beginningin the infinitely remote past and continuing into the infinite future. Definesuch a process to be stationary if its distribution does not depend on choiceof an origin, i.e., in terms of finite dimensional distributions:

for all Xi, . . . , xn and all k, both positive and negative.The interesting point here is

Proposition 6.5. Given any single-ended stationary process Xl5 X2, . . . , thereis a double-ended stationary process. . . , X_l5 X0» X1} . . . such that X1? X2, • . .and Xl5 X2, . . . have the same distribution.

Proof. From the Extension Theorem 2.26, all we need to define the Xfc

process is a set of consistent distribution functions; i.e., we need to define

such that if either x_m or xn | oo, then we drop down to the next highestdistribution function. We do this by defining

that is, we slide the distribution functions of the Xl9 . . . process to the left.Now X1} X2, . . . can be looked at as the continuation of a process that hasalready been going on an infinite length of time.

Starting with any stationary process, an infinity of stationary processescan be produced.

Proposition 6.6. Let Xl5 X2, . . . be stationary, ^(x) measurable $„>, then theprocess Yl5 Y2, . . . defined by

is stationary.

Proof. On /?(co) define <pfc(x) as <p(xk, xk+l, . . .). The set

B e $00, is in $«,, because each ^(x) is a random variable on (R(CO),Note

and

which implies the stationarity of the Yfc sequence.

Page 123: Probability

106 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.2

Corollary 6.7. Let X1} X2, . . . be independent and identically distributedrandom variables, <p(x) measurable 3^^ ; then

is stationary.

Proof. The Xl5 X2, . . . sequence is stationary.

Problem 1. Look at the unit circle and define

Here Q is the unit circle, 5" the Borel or-field. Take P to be Lebesgue measuredivided by 27r. Take 6 to be an irrational angle. Define x: = x, xk+1 =(xk + 0)[27r], and Xfc(jt) = f(xk). Hence the process is a sequence of zerosand ones, depending on whether xk is in the last two quadrants or first twowhen jq is picked at random on the circumference. Prove that X1? X 2 , . . . isstationary, ([a] denotes modulo a).

2. MEASURE-PRESERVING TRANSFORMATIONS

Consider a probability space (Q, 3r, P) and a transformation T of Q intoitself. As usual, we will call T measurable if the inverse images under T ofsets in & are again in &; that is, if T~1A = (o>; Ta> e A] e &, all A £ &.

Definition 6.8. A measurable transformation T on £1 —»• D w/7/ 6e calledmeasure-preserving ifP(T~lA) = P(A), all A E 3r.

To check whether a given transformation T is measurable, we can easilygeneralize 2.28 and conclude that if T~1C e 3", for C e C, ^(C) = 5% thenJ" is measurable. Again, to check measure-preserving, both P(T~1A) andP(A) are cr-additive probabilities, so we need check only their agreement ona class C, closed under intersections, such that ^(C) = 3r.

Starting from measure-preserving transformations (henceforth assumedmeasurable) a large number of stationary processes can be generated. Let X (o>)be any random variable on (Q, &, P). Let T be measure-preserving and definea process Xlf X2, . . . byX^w) = X(o>), X2(o>) = X(7w),X3(a>) = X(T*a>), . . .Another way of looking at this is: If X^cu) is some measurement on thesystem at time one, then Xn(co) is the same measurement after the systemhas evolved n — 1 steps so that w —* Tn~lco. It should be intuitively clearthat the distribution of the X1} X2 , . . . sequence does not depend on origin,since starting from any Xn(co) we get Xn+1(o») as Xn(Tco), and so forth. Tomake this firm, denote by T° the identity operator, and we prove

Page 124: Probability

6.2 MEASURE-PRESERVING TRANSFORMATIONS 107

Proposition 6.9. Let T be measure preserving on (Q, 3r, P), X a randomvariable on (D, 3r)\ then the sequence Xn(co) = X(rn-1 co), n = 1, 2, . . . is astationary sequence of random variables.

Proof. First of all, Xn(co) is a random variable, because (Xn(co) G B} ={XCr^co) e B}. Let A = (X e 5}. Then

Evidently, however, T measurable implies Tn~l measurable, or T~n+lA e 3-.Now let A = {co; (X15 X2, . . .) e 5}, £ e $ro, thus

Look at A: = {co; (X2, X3, . . .) E B}. This similarly is the set{co; (X(Tco),X(r2co), . . .) e B}. Hence co G A: o Tco e A or A± = T~M.But, by hypothesis, P(T~1A) = P(A).

Can every stationary process be generated by a measure-preservingtransformation? Almost! In terms of distribution, the answer is Yes.Starting from any stationary process X1(co), X2(co), . . . go to the coordinaterepresentation process X1} X2, . . . on (R((a}, $«,, P). By definition,

Definition 6.10. On (R( °° >, 3J define the shift transformation S : R( " ] -> R( °° >by S(xlt xz, . . .) = (x2, xs, . . .).

So, for example, S(3, 2, 7, 1, . . .) = (2, 7, 1, . . .).The point is that from the definitions, Xn(x) = X1(5'n~1x). We prove

below that S is measurable and measure-preserving, thus justifying theanswer of "Almost" above.

Proposition 6.11. The transformation S defined above is measurable, and ifXl5 X2, . . . is stationary, then S preserves P measure.

Proof. To show S measurable, consider

By definition, letting (S\)k be the fcth coordinate of Sx, we find that

and that 5 is therefore obviously in &<„. Furthermore, by the stationarity of

So 5 is also measure-preserving.

Page 125: Probability

108 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.3

Problems

2. Show that the following transformations are measurable and measure-preserving.

1) Q = [0, 1), & = $[0, 1), P = dx. Let A be any number in [0, 1) anddefine Tx = (x + X)[l].

2) Q = [0, 1), & = &[0,1), P = dx, Tx = (2jc)[l}.

3) Same as (2) above, but Tx — (kx)[l], where k > 2 is integral.

3. Show that for the following transformations on [0, 1), $[0, 1), there isno P such that P(single point) = 0, and the transformation preserves P.

1) Tx = A*, 0 < X < 1.

2) Tx = x2.

4. On Q = [0, 1), define 7:Q -* Q by TJC = (2*)[1]. Use J" = 3&([0, 1)),P = dx. Define

Show that the sequence Xn(x) = X(7'n~1^) consists of independent zerosand ones with probability \ each.

Show that corresponding to every stationary sequence X1(o>), X2(co),. ..such that Xn(w) e {0, 1), there is a probability Q(dx) on $[0, 1) such thatTx = (2x)[l] preserves g-measure, and such that the Xn(x) sequence definedabove has the same distribution with respect to $[0,1), Q(dx) as X15X2 , . . .

3. INVARIANT SETS AND ERGODICITY

Let T be a measure-preserving transformation on (D, 3% P).

Definition 6.12. A set A E & is invariant ifT~*A = A.

If A is an invariant set, then the motion T of Q, -> £1 carries A into A; thatis, if co e A, then Tea e A (because T~1AC = Ac). Ac is also invariant, andfor all n, Tn carries points of A into A and points of Ac into Ac. Because of theproperties of inverse mappings we have

Proposition 6.13. The class of invariant sets is a a-field 3.

Proof. Just write down definitions.

In the study of dynamical systems, Q is the phase space of the system,and if co is the state of the system at t = 0, then its state at time / is given byTtca, where Tt\£l —> O, is the motion of the phase space into itself induced bythe equations of motion. For a conservative system Tt(TTco) = Tt+Tca.

Page 126: Probability

6.3 INVARIANT SETS AND ERGODICITY 109

We discretize time and take T = 7\, so the state of the system at time nis given by Tnoy. Suppose that X(co) is some observable function of the stateu>. Physically, in taking measurements, the observation time is quite longcompared to some natural time scale of molecular interactions. So wemeasure, not X(co), but the average of X(co) over the different states into whichco passes with the evolution of the system. That is, we measure

for T large, or in discrete time

for n large. The brilliant insight of Gibbs was the following argument:that in time, the point cot = Ttco wandered all over the phase space andthat the density of the points cot in any neighborhood tended toward alimiting distribution. Intuitively, this limiting distribution of points had to beinvariant under T. If there is such a limiting distribution, say a measure P,then we should be able to replace the limiting time average

by the phase average

Birkhoff's result was that this argument, properly formulated, was true!To put Gibb's conjecture in a natural setting, take Q to be all points on asurface of constant energy. (This will be a subset of R(*n) where n is thenumber of particles.) Take $ to be the intersection of $6n with Q and P thenormalized surface area on Q. By Liouville's theorem, T:£l -*• Q preservesP-measure. The point is now that Tnco will never become distributed over £1in accordance with P if there are invariant subsets A of Q such that P(A) > 0,P(AC) > 0, because in this case the points of A will remain in A; similarly,for Ac. Thus, the only hope for Gibb's conjecture is that every invariant setA has probability zero or one.

To properly formulate it, begin with

Definition 6.14. Let T be measure-preserving on (Q, 3% P). T is called ergodicif for every A e 3, P(A) = 0 or 1.

One question that is relevant here is: Suppose one defined events A tobe a.s. invariant if P(A A T~1A) = 0. Is the class of a.s. invariant eventsconsiderably different from the class of invariant events ? Not so! It isexactly the completion of 3 with respect to P and 3^

Page 127: Probability

110 STATIONARY PROCESSES AND THE.ERGODIC THEOREM 6.3

Figure 6.1

Proposition 6.15. Let A e & be a.s. invariant, then there is a set A' e & whichis invariant such that P(A A A') = 0.

Proof. Let

Then A" = A a.s., and T~1A" <= A". Let

noting that A" is a.s. invariant gives A' = A a.s., and T~1A"<^ A" impliesthat A' is invariant.

The concept of ergodicity, as pointed out above, is a guarantee that thephase space does not split into parts of positive probability which areinaccessible to one another.

Example. Let (I be the unit square, that is, Q. = {(x, y); 0 < x < 1,0 < y < 1}, 5 the Borel field. Let a be any positive number and defineT(x,y) = ((x + a)[l], (y + «)[!]). Use P =dx dy; then it is easy to check that T is measurepreserving. What we have done is to sew edgesof Q. together (see Fig. 6.1) so that a and a' aretogether, /? and ft. T moves points at a 45° anglealong the sewn-together square. Just by lookingat Fig. 6. 1 you can see that T does not movearound the points of £i very much, and it iseasy to construct invariant sets of any probabil-ity, for example, the shaded set as shown inFiS 6 2" Figure 6.2

Page 128: Probability

6.3 INVARIANT SETS AND ERGODICITY 111

Problems

5. In Problem 2(1), show that if X is rational, Tis not ergodic.

6. We use this problem to illustrate more fully the dynamical aspect. Take(Q, 5) and let T:Q. -> Q be measurable. Start with any point co e Q, thenco has the motion co -> Tco —*• T2a> — > • • • • Let Nn(^4, co) be the number oftimes that the moving point rfcco enters the set A e & during the first nmotions; that is, Mn(A,co) is the number of times that Tka> e A, k =0, 1 , . . . , « — 1. Keeping co fixed, define probabilities P™(') on 5" byP£°(') = Nn(-, co)/«. That is, P^'^) is the proportion of times in the firstn moves that the point is in the set A. Let X be any random variable on

a) Show that

Assume that there is a probability Pm(-) on & such that for every A e 51",

b) Show that T is Pw(-)-preserving, that is,

c) Show that if X(co) > 0 is bounded, then

What is essential here is that the limit J X dPm not depend on where the systemstarted at time zero. Otherwise, to determine the limiting time averages of(c) for the system, a detailed knowledge of the position co in phase-space at/ = 0 would be necessary. Hence, what we really need is the additionalassumption that Pm(-) be the same for all co, in other words, that the limitingproportion of time that is spent in the set A not depend on the starting posi-tion co. Now substitute this stronger assumption, that is: There is a proba-bility P(-) on & such that for every A £ 5" and co E Q,

d) Show that under the above assumption, A invariant => A = Q or 0.

This result shows not only that T is ergodic on (Q, 3r, P), but ergodicin a much stronger sense than that of Definition 6.14. The stronger assumptionabove is much too restrictive in the sense that most dynamical systems donot satisfy it. There are usually some starting states co which are exceptionalin that the motion under T does not mix them up very well. Take, for example,elastic two-dimensional molecules in a rectangular box. At t = 0 consider

Page 129: Probability

112 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.4

the state co which is shown in Fig. 6.3, where all the molecules have the same.x-coordinate and velocity. Obviously, there will be large chunks of phase-space that will never be entered if this is the starting state. What we want then,is some weaker version that says Pin)(0 —>• P(~) for most starting states co.With this weakening, the strong result of (d) above will no longer be true.We come back later to the appropriate weakening which, of course, willresult in something like 6.14 instead of (d).

Figure 6.3

4. INVARIANT RANDOM VARIABLES

Along with invariant sets go invariant random variables.

Definition 6.16. Let X(w) be a random variable on (£), &, P), T measure-preserving; then X(co) is called an invariant random variable if'X(eo) = X(7co).

Note

Proposition 6.17. X is invariant iffX is measurable 3.

Proof. If X is invariant, then for every x, (X < x} e 3; hence X is measur-able 3. Conversely, if X(co) = ^(o>), A e 3, then

Now consider the class C. of all random variables on (Q, 3) which are invariant;clearly C. is closed under linear combinations, and Xn e f., Xn(co) f X(co)implies

Hence by 2.38, C contains all nonnegative random variables on (Q, 3).Thus clearly every random variable on (Q, 3) is invariant.

The condition for ergodicity can be put very nicely in terms of invariantrandom variables.

Proposition 6.18. Let T be measure-preserving on (Q, J", P). T is ergodic iffevery invariant random variable X(co) is a.s. equal to a constant.

Proof. One way is immediate; that is, for any invariant set A, let X(o>) =%A((o). Then X(co) constant a.s. implies P(A) = 0, 1. Conversely, supposeP(X < x) = 0, 1 for all ;c. Since for x f + oo, P(X < x) -»• 1, P(X < x) = 1,for all x sufficiently large. Let x0 = mf{x; P(X < x) = I}. Then for

Page 130: Probability

6.5 THE ERGODIC THEOREM 113

every e > 0, P(x0 — e < X < ; t 0 + e ) = l , and taking e | 0 yieldsP(X = Xo) = 1.

Obviously we can weaken 6.18 to read

Proposition 6.19. T is ergodic iff every bounded invariant random variable isa.s. constant.

In general, it is usually difficult to show that a given transformation isergodic. Various tricks are used : For example, we can apply 6. 1 9 to Problem2(1).

Example. We show that if A is irrational, then Tx = (jc + A) [1] is ergodic.Let/(jc) be any Borel-measurable function on [0, 1). Assume it is in Lz(dx),that is, J/2 dx < oo. Then we have

where the sum exists as a limit in the second mean, and 2 |cj2 < oo.Therefore

For/(x) to be invariant, cn(l — e2jrin*) = 0. This implies either cn = 0or e2*in* = 1 . The latter can never be satisfied for nonzero n and irrationalA. The conclusion is that/(x) = c0 a.s.; by 6.19, 7" is ergodic.

Problems

7. Use the method of the above example to show that the transformation ofProblem 2(2) is ergodic.

8. Using 2.38, show that if T is measure-preserving on (Q, J", P) and X(co)any random variable, that

5. THE ERGODIC THEOREM

One of the most remarkable of the strong limit theorems is the result usuallyreferred to as the ergodic theorem.

Theorem 6.21. Let T be measure-preserving on (Q, 5-", P). Then for X anyrandom variable such that E |X| < oo,

Page 131: Probability

114 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.5

To prove this result, we prove first an odd integration inequality:

Theorem 6.22 (Maximal ergodic theorem). Let T be measure-preserving on(Q, 5", P), and X a random variable such that E |X| < oo. Define

Sfc(w) = X(co) + • • • + XCr*-1 o>), and Mn(co) = max (0, Sls S2 , . . . , Sn).

Then

Proof. We give a very simple recent proof of this due to Adriano Garsia[61]. For any k < n, Mn(rco) > Sfc(7w). Hence

Write this as

But trivially,

since St(co) = X(w) and Mn(co) > 0. These two inequalities together giveX(o>) > max (S^co),. . . , Sn(cu)) - Mn(rw). Thus

On the set {Mn > 0}, max (Slt . . . , $ „ ) = Mn. Hence

but

This last is by (6.20).

Completion of proof of 6.21. Assuming that E(X | 3) = 0, prove that theaverages converge to zero a.s. Then apply this result to the random variableX(eo) — E(X | 3) to get the general case. Let X = lim SJn, and for anye > 0, denote D = {X > e}. Note that X(Tco) = X(eo), so X and thereforeD are invariant. Define the random variable

and using X*, define S*, M* as above.The maximal ergodic theorem gives

Page 132: Probability

6.5 THE ERGODIC THEOREM 115

The rest of the proof is easy sailing. The sets

converge upward to the set

Since sup^ St/fc > X, F = D. The inequality E\X*\ < E\X\ + e allowsthe use of the bounded convergence theorem, so we conclude that

Therefore,

But

which implies P(D) = 0, and X <; 0 a.s. Apply the same argument to therandom variable —X(co). Here the lim sup of the sums is

The conclusion above becomes —X < 0 or X > 0 a.s. Putting these twotogether gives the theorem. Q.E.D.

A consequence of 6.21 is that if T is ergodic, time averages can bereplaced by phase averages, in other words,

Corollary 6.23. Let T be measure-preserving and ergodic on (Q, &, P). Thenfor X any random variable such that E \X\ < oo,

Proof. Every set in 3 has probability zero or one, hence

Page 133: Probability

116 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.6

6. CONVERSES AND COROLLARIES

It is natural to ask whether the conditions of the ergodic theorem arenecessary and sufficient. Again the answer is — Almost. If X is a non-negative random variable and EX = oo, it is easy to show that for T measure-preserving and ergodic,

Because defining for a > 0,

of course, E |Xa| < oo. Thus the ergodic theorem can be used to get

Take a f oo to get the conclusion.But, in general, if E \ X| = oo, it does not follow that the averages diverge

a.s. (see Halmos [65, p. 32]).Come back now to the question of the asymptotic density of the points

o>, Ta>, T^co,... In the ergodic theorem, for any A e !F, take X(o>) = ;Q(G>).Then the conclusion reads, if T is ergodic,

so that for almost every starting point o>, the asymptotic proportion ofpoints in A is exactly P(A). If Q. has a topology with a countable basis suchthat P(N) > 0 for every open neighborhood N, then this implies that foralmost every to, the set of points to, Ta>, T*co,... is dense in Q.

Another interesting and curious result is

Corollary 6.24. Let T: Q —>• O be measure-preserving and ergodic with respectto both (Q, 5", Px) and (O, &, P2). Then either Pt = P2 or Pl and P2 areorthogonal in the sense that there is a set A e 3 such that P^(A) = 1, PZ(AC) = 1.

Proof. If P! 7* P2, take B e & such that P^B) 7* P2(B) and let X(o>) =Xs(co). Let A be the set of CD such that

By the ergodic theorem Pt(A) •=• 1. But Ac includes all at such that

and we see that P2(^c) = 1.

Page 134: Probability

6.6 CONVERSES AND COROLLARIES 117

Finally, we ask concerning convergence in the first mean of the averagesto EX. By the Lebesgue theorem we know that a.s. convergence plus justa little more gives first mean convergence. But here we have to work a bitto get the additional piece.

Corollary 6.25. Under the conditions of the ergodic theorem

Proof. We can assume that E(X | 3) = 0. Let

Since Vn -^ 0, by Egoroff's theorem, for any e > 0, 3^4 e & such that^ e and Vn -»• 0 uniformly on ^c. Now,

These integrals can be estimated by

Since e is arbitrary, conclude that for any N,

Let A^ go to infinity ; then by the bounded convergence theorem, the right-hand side above goes to zero.

Problems

9. Another consequence of the ergodic theorem is a weak form of Weyl'sequidistribution theorem. For any x in [0, 1) and interval / <= [0, 1), letR^>(7) be the proportion of the points {(x + A)[l], (x + 2A)[1],(x + «A)[1]} falling in the interval /. If A > 0 is irrational, show that for jcin a set of Lebesgue measure one, R^.n)(/) — *• length /.10. Let IJL be a. finite measure on ^([O, 1)) such that Tx = 2x[l] preserves/^-measure. Show that fj, is singular with respect to Lebesgue measure.

Page 135: Probability

118 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.7

11. Let T be measurable on (Q,, 5), and define J/i, as the set of all probabilitiesP on & such that T is measure-preserving on (D, J", P). Define real linearcombinations by (aPx + pP2)(B) = *Pi(B) + fiP2(B), BE?. Show thata) JC is convex, that is, for a, /? > 0, a + /? = 1,

An extreme point of JC is a probability P e JC which is not a linear com-bination aP! + £P2, a, j5 > 0, a -f ft = 1, with Pls P2 e JG. Show that

b) the extreme points of Jk> are the probabilities P such that T is ergodic on

7. BACK TO STATIONARY PROCESSES

By the ergodic theorem and its corollary, if the shift-transformation S (see6.10) is ergodic on (R(cc}, $«,, P), then

If S is ergodic, then, the same conclusions will hold for the original Xx,X2, . . . process, because a.s. convergence and rth mean convergence dependonly on the distribution of the process.

Almost all the material concerning invariance and ergodicity, can beformulated in terms of the original process Xlt X2, . . . rather than going intorepresentation space. If B e $„ and A = (X £ B}, then the inverse imageunder X of S~*B, S the shift operator, is

Hence, we reach

Definition 6.26. An event A e & is invariant if IB e $„ such that for everyn> 1

The class of invariant events is easily seen to be a a-field. Similarly, we definea random variable Z to be invariant if there is a random variable q> onCK(oo), $«,) such that

The results of Section 4 hold again ; Z is invariant iff it is 3-measurable.The ergodic theorem translates as

Theorem 6.28. IfX^, X2, . . . is a stationary process, 3 the a-field of invariantevents, and E JXJ < oo, then

Page 136: Probability

6.7 BACK TO STATIONARY PROCESSES 119

Proof. From the ergodic theorem, SJn converges a.s. to some random vari-able Y. It is not difficult to give an argument that the correct translation isY = £(XX | 3). But we can identify Y directly: take Y = llm SJn, then thesets {Y < y} are invariant, hence Y is 3-measurable. Take A E 3, then sincewe have first mean convergence,

Select B e $ro so that A = {(Xk,. . .) e B}, for all k > 1. Now stationaritygives

Use this in 6.29, to conclude

By definition, Y = E(Xl \ 3).

Definition 6.30. A stationary process Xl5 X2, . . . is ergodic if every invariantevent has probability zero or one.

If 3 has this zero-one property, of course the averages converge to £Xl5 a.s.Ergodicity, like stationarity, is preserved under taking functions of the

process. More precisely,

Proposition 6.31. Let X1? X2, . . . be a stationary and ergodic process, <p(x)measurable 3^^, then the process Yl5 Y2, . . . defined by

is ergodic.

Proof. This is very easy to see. Use the same argument as in Proposition6.6 to conclude that for any . S e - f f i ^ H ^ e ^ ^ such that

Hence, every invariant event on the Y-process coincides with an invariantevent on the X-process.

One result that is both interesting and useful in establishing ergodicity is

Proposition 6.32. Let Xl5 X2, . . . be a stationary process. Then every in-variant event A is a tail event.

Proof. Take B so that A = {(Xn, Xw+1,. . .) 6 B}, n > 1. Hence A e^(Xn, Xn+1, . . .), all n.

Page 137: Probability

120 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.8

Corollary 6.33. Let Xj, X2 , . . . be independent and identically distributed;then the process is ergodic.

Proof. Kolmogorov's zero-one law.

By this corollary we can include the strong law of large numbers as aconsequence of the ergodic theorem, except for the converse.

Problems

12. Show that the event {Xn E B i.o.}, B e 5J1} is invariant.

13. If Xlt X 2 , . . . is stationary and if there 3B e 3^x such that

show that A is a.s. equal to an invariant event.

14. A process Xl5 X2 , . .. is called a normal process if X l s . . . , Xn have ajoint normal distribution for every n. Let .EX,- = 0, FV, = £"XtX3.

1) Prove that the process is stationary iff Tif depends only on |/ — j\.2) Assume IV = r(\i — j\). Then prove that limm r(m) = 0 implies that the

process is ergodic.

[Assume that for every «, the determinant of IV, i,j = ! , . . . ,«, is notzero. See Chapter 11 for the definition and properties of joint normaldistributions.]

15. Show that Xlt X 2 , . . . is ergodic iff for every A E &k, k = 1, 2, . . .,

16. Let Xj, X2 , . . . and Y1} Y2, . . . be two stationary, ergodic processes on(Q, &, P). Toss a coin with probability p of heads independently of X andY. If it comes up heads, observe X, if tails, observe Y. Show that the resultantprocess is stationary, but not ergodic.

8. AN APPLICATION

There is a very elegant application of the above ideas due to Spitzer, Kesten,and Whitman (see Spitzer, [130, pp. 35 ff]). Let Xlt X2, ... be a sequence ofindependent identically distributed random variables taking values in theintegers. The range Rn of the first n sums is defined as the number of distinctpoints in the set {St, . . . , Sn}. Heuristically, the more the tendency of thesums Sn to return to the origin, the smaller Rn will be, because if we are at agiven point k at time n, the distribution of points around k henceforth lookslike the distribution of points around the origin starting from n = 1. To

Page 138: Probability

6.8 AN APPLICATION 121

pin this down, write

P(no return) = P(Sn ^ 0, n = 1, 2, . . .),then,

Proposition 6.34

So now,

the last equality holding because ( X k , . . ., X2) has the same distribution as(X l s . . . , X^j). Therefore limfc £WA = P(no return).

The remarkable result is

Theorem 6.35

Proof. Take TV any positive integer, and let Zfc be the number of distinctpoints visited by the successive sums during the time (k — l)N + 1 tokN, that is, Zfc is the range of {S(fc_1)jV+1,. . . , SfcAr}. Note that Zfc dependsonly on the Xn for n between (k — 1)7V + 1 and kN, so that the Zk areindependent, |Zfc| < N, and are easily seen to be identically distributed. Usethe obvious inequality RnN < Z1 + • • • + Zn and apply the law of largenumbers:

For n' not a multiple of N, Rn- differs by at most N from one of RnN, so

But Zr = RN, hence letting N ^- oo, and using 6.34, we get

Page 139: Probability

122 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.9

Going the other way is more interesting. Define

That is, Vfc is one if at time k, Sk is in a state which is never visited again, andzero otherwise. Now Y! + • • • + Vn is the number of states visited in timen which are never revisited. Rn is the number of states visited in time nwhich are not revisited prior to time n + 1. Thus Rn ^ Vj + • • • 4- Vn.Now define

and make the important observation that

Use 6.31 and 6.33 to show that Vl5 V 2 , . . . is a stationary, ergodic sequence.Now the ergodic theorem can be used to conclude that

Of course, £Vj = P(no return), and this completes the proof.

9. RECURRENCE TIMES

For X0, X 1 ? . . . stationary, and any set A E $>!, look at the times that theprocess enters A, that is, the n such that Xn e A.

Definition 6.36. For A e $15 P(X0 E A) > 0, define

and so forth. These are the occurrence times of the set A. The recurrence timesT1} T2, . . . are given by

If {Xn} is ergodic, then P(Xn E A i.o.) = 1; so the Rfc are well defined a.s.But if {Xn} is not ergodic a subsidiary condition has to be imposed to makethe Rfc well defined. At any rate, the smaller A is, the longer it takes to get

Page 140: Probability

6.9 RECURRENCE TIMES 123

back to it, and in fact we arrive at the following proposition:

Proposition 6.38. Let X0, X1?. . . be a stationary process, A E $x such that

then the Rk,k = 1 , . . . are finite a.s. On the sample space QA = {CD; X0 e A}the Tj, T2,. . . form a stationary sequence under the probability P(- \ X0 E A),and

Remarks. This means that to get the Tx, T2, . . . to be stationary, we have tostart off on the set A at time zero. This seems too complicated because oncewe have landed in A then the returns should be stationary, that is, theT2, T3, . . . should be stationary under P. This is not so, and counterexamplesare not difficult to construct (see Problem 20.)

Note that P(X0 e A) > 0, otherwise condition (6.39) is violated.Therefore, conditional probability given (X0 e A} is well defined.

Proof. Extend (XJ to 0, ±1, ±2, ... By (6.39), P(RX < oo) = 1. Fromthe stationarity

where C E "(Xo, X_l5 . . .). Let n -»> oo, so that we get

Go down the ladder to conclude that P( Rx < oo) = 1 implies P( Rk < oo) =l,k> 1.

To prove stationarity, we need to establish that

This is not difficult to do, but to keep out of notational messes, I prove onlythat

The generalization is exactly the same argument. Define random variablesU« = XA(*n)> and sets Ck by

The {Ck} are disjoint, and

Page 141: Probability

124 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.9

I assert that

By stationarity of the {UJ process,

The limit of the right-hand side is P(Xn e A at least once), which is one by(6.39). Now, using stationarity again, we find that

To compute E(T1 1 X0 e A), note that

Consequently,

We can use the ergodic theorem to get a stronger result.

Theorem 6.40. If the process {Xn} is ergodic, then the process {Jk} on (X0 e A}is ergodic under P(- | X0 e A).

Proof. By the ergodic theorem,

On every sequence such that the latter holds, Rn —*• oo, and

Rn

The sum (Xfc) is the number of visits Xk makes to A up to the time ofk=l Rn

the wth occurrence. Thus (Xfc) = n, soi

because

Page 142: Probability

6.10 STATIONARY POINT PROCESSES 125

Note that for every function / measurable $„,, there is a function gmeasurable $«,({(), 1}) such that on the set (Rfc_i — j},

where Un denotes #4(Xn) again. Therefore

Since Rw_x — >• oo a.s., if E \g\ < oo we can use the ergodic theorem asfollows :

Because U0 = x^(X0), this limit is

On the space (X0 e A}, take / to be any bounded invariant function ofTj, . . . , that is ,

Then (6.41) implies that/(Tx, . . .) is a.s. constant on (X0 G A}, implyingin turn that the process T1? T2, . . . is ergodic.

10. STATIONARY POINT PROCESSES

Consider a class of processes gotten as follows : to every integer n, positiveand negative, associate a random variable Un which is either zero or one. Away to look at this process is as a sequence of points. If a point occurs attime «, then Un = 1, otherwise Un = 0. We impose a condition to ensurethat these processes have a.s. no trivial sample points.

Condition 6.42. There is probability zero that all Un = 0.

For a point process to be time-homogeneous, the {Un} have to form astationary sequence.

Definition 6.43. A stationary discrete point process is a stationary process {UM},n = 0, ±1, . . . where Un is either zero or one.

Note: P(Un = 1) > 0, otherwise P(a\\ Un = 0) = 1. Take A = {!}, then(6.38) implies that given (U0 = 1} the times between points form a stationary

Page 143: Probability

126 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.10

sequence T1? T2, . . . of positive integer-valued random variables such that

The converse question is : Given a stationary sequence T1} T2, . . . ofpositive integer valued random variables, is there a stationary point process{Ura} with interpoint distances having the same distribution as Tl5 T2, . . . ?The difficulty in extracting the interpoint distances from the point process wasthat an origin had to be pinned down; that is, {U0 = 1} had to be given.But here the problem is to start with the Tl5 T2, . . . and smear out the originsomehow. Define

Suppose that there is a stationary process (UJ with interpoint distanceshaving the same distribution as Tl5 . . . Denote the probability on {Un} byQ. Then for s any sequence k-long of zeroes and ones

This leads to

(6.44)

The right-hand side above depends only upon the probability P on theT1} T2, . . . process. So if a stationary point process exists with interpointdistances T1? T2, . . . it must be unique. Furthermore, if we define Q directlyby means of (6.44), then it is not difficult to show that we get a consistent setof probabilities for a stationary point process having interpoint distances withdistribution Tx, T2, . . . Thus

Theorem 6.45. Let T1? T2, . . . be a stationary sequence of positive integervalued random variables such that ETi < oo. Then there is a unique stationarypoint process {Un} such that the interpoint distances given {U0 = 1} have thesame distribution as Tl5 T2, . . .

Page 144: Probability

NOTES 127

Proof. The work is in showing that the Q defined in (6.44) is a probability forthe desired process. To get stationarity one needs to verify that

Once this is done, then it is necessary to check that the interpoint distancesgiven (U0 = 1} have the same distribution as T1? T2,. .. For example,check that

All this verification we leave to the reader.

Problems

17. Given the process Tl5 T2,. .. such that Jk = 2, all k, describe thecorresponding stationary point process.

18. If the interpoint distances for a stationary point process, given {U0 = 1}are T1? T2,. . . , prove that the distribution of time n* until the first pointpast the origin, that is,

is given by

19. If the interpoint distances T^ T2 , . . . given {U0 = 1} are independentrandom variables, then show that the unconditioned interpoint distancesT2, T3, . . . are independent identically distributed random variables with thesame distribution as the conditioned random variables and that (T2, T3,. ..) isindependent of the time n* of the first point past the origin.

20. Consider the stationary point process (Un) having independent inter-point distances Tlt T2, . . . with P(Jl = 1) = 1 — e, with e very small. Nowconsider the stationary point process

Show that for this latter process the random variables defined by the dis-tance from the first point past the origin to the second point, from the secondpoint to the third point, etc., do not form a stationary sequence.

NOTES

The ergodic theorem was proven by G. D. BirkhofT in 1931 [4]. Sincethat time significant improvements have been made on the original lengthyproof, and the proof given here is the most recent simplification in a sequenceinvolving many simplifications and refinements.

Page 145: Probability

128 STATIONARY PROCESSES AND THE ERGODIC THEOREM

The nicest texts around on ergodic theorems in a framework of measure-preserving transformations are Halmos [65], and Hopf [73] (the latteravailable only in German). From the point of view of dynamical systems, seeKhintchine [90], and Birkhoff [5]. Recently, E. Hopf generalized the ergodictheorem to operators T acting on measurable functions defined on a measurespace (Q, 5% fj,) such that T does not increase Lj-norm or Loo-norm. Forthis and other operator-theoretic aspects, see the Dunford-Schwartz book[41].

For Xj, . . . a process on (Q, 5% P), we could try to define a set trans-formation S~l on ^"(X) similar to the shift in sequence space as follows: IfA E &(\), then take B E $„ such that A = {(Xl5 X 2 , . . . ) e B) and defineS~1A = {(X2, X3, . . .) G B}. The same difficulty comes up again; B is notuniquely determined, and if A = {X e B:}, A = (X € B2}, it is not true ingeneral that

But for stationary processes, it is easy to see that these latter two sets differonly by a set of probability zero. Therefore S"1 can be defined, not on^"(X), but only on equivalence classes of sets in (X), where sets Alt Az e ^(X)are equivalent if P(Al A A2) = 0. For a deeper discussion of this and othertopics relating to the translation into a probabilistic context of the ergodictheorem see Doob [39].

The fact that the expected recurrence time of A starting from A is1/P(X0 e A) is due to Kac [81]. A good development of point processes andproofs of a more general version of 6.45 and the ergodic property of therecurrence times is in Ryll-Nardzewski [119].

Page 146: Probability

CHAPTER 7

MARKOV CHAINS

1. DEFINITIONS

The basic property characterizing Markov chains is a probabilistic analogueof a familiar property of dynamical systems. If one has a system of particlesand the position and velocities of all particles are given at time t, the equationsof motion can be completely solved for the future development of the system.Therefore, any other information given concerning the past of the process upto time t is superfluous as far as future development is concerned. The presentstate of the system contains all relevant information concerning the future.Probabilistically, we formalize this by defining a Markov chain as

Definition 7.1. A process X0, X1} . . . taking values in F e 3^1 is called aMarkov chain if

for alln>0 and

For each n, there is a version pn(A \ x) of P(Xn+l e A \ Xn = x) whichis a regular conditional distribution. These are the transition probabilitiesof the process. We restrict ourselves to the study of a class of Markovchains which have a property similar to conservative dynamical systems.One way to state that a system is conservative is that if it goes from anystate x at t = 0 to y in time r, then starting from x at any time / it will beiny at time / + T: The corresponding property for Markov chains is that thetransition probabilities do not depend on time.

Definition 7.2. A Markov chain X0, X1} . . . on F E 5JX has stationary transitionprobabilities p(A \ x) if p(A \ x) is a regular conditional distribution and if foreach A e &L(F), n > 0, p(A \ x) is a version of P(Xn+l e A \ Xn = x). Theinitial distribution is defined by

The transition probabilities and the initial distribution determine the dis-tribution of the process.

129

Page 147: Probability

130 MARKOV CHAINS 7.1

Proposition 7.3. For a Markov chain X0, X15 . . . on F E 3X,

(7.4)

and proceed by induction.

The Markov property as defined in 7.1 simply states that the presentstate of the system determines the probability for one step into the future.This generalizes easily :

Proposition 7.5. Let X0, X1? . . . be a Markov chain, and C any event in&(Xn+l, Xn+2, . . .). Then

Having stationary transition probabilities generalizes into

Proposition 7.6. If the process X0, Xl5 . . . is Markov on F E with stationarytransition probabilities, then for every B 6 ^^(F) there are versions of

and

which are equal.

The proofs of both 7.5 and 7.6 are left to the reader.Proposition 7.3 indicates how to do the following construction:

Proposition 7.7. Let p(A \ x) be a regular conditional probability on 3x E F, and Tr(A) a probability on $>i(F). Then there exists a Markov chainX0, Xi, X2, . . . with stationary transition probabilities p(A \ x) and initialdistribution -n.

Proof. Use (7.4) to define probabilities on rectangles. Then it is easy tocheck that all the conditions of 2.18 are met. Now extend and use thecoordinate representation process. What remains is to show that anyprocess satisfying (7.4) is a Markov chain with p(A \ x) as its transition

Page 148: Probability

7.1 DEFINITIONS 131

probabilities. (That P(X0 e A) = ir(A) is obvious.) For this, use (7.4) forn — 1 to conclude that (7.4) can be written as

But, by definition,

Hence p(A \ x) is a version of P(Xn £ A \ Xn^ = xn_lf...) for all n.

Now we switch points of view. For the rest of this chapter forget aboutthe original probability space (Q, 3r, P). Fix one version p(A \ x) of thestationary transition probabilities and consider p(A \ x) to be the given data.Each initial distribution IT, together with p(A \ x), determines a probabilityon (F(co), (F)) we denote by P , and the corresponding expectation by

A A

Ev. Under Pff, the coordinate representation process X0, X l 5 . . . becomes aMarkov chain with initial distribution IT. Thus we are concerned with afamily of processes having the same transition probability, but differentinitial distributions. If -n is concentrated on the single point {x}, then denotethe corresponding probability on ^(F) by Px, and the expectation by Ex.Under Px, X0, X l 5 . . . is referred to as the "process starting from the pointx." Now eliminate the * over variables, with the understanding that hence-forth all processes referred to are coordinate representation processes withP one of the family of probabilities {Pff}. For any TT, and B E (F), alwaysuse the version of P^((X0,. ..) £ B \ X0 = x) given by Pa.((X0, . . .) £ B}.

In exactly the same way as in Chapter 3, call a nonnegative integer-valued random variable n* a stopping time or Markov time for a Markovchain X0, X 1 ? . . . if

Define 5"(Xn, n < n*) to be the (T-field consisting of all events A such thatA n {n* < n} E F(X0,. . . , XJ. Then

Proposition 7.8. Ifn* is a Markov time for X0, Xl5 . . . , then

Note. The correct interpretation of the above is: Let

Then the right-hand side is 9?(Xn»).

Proposition 7.8 is called the strong Markov property.

Page 149: Probability

132 MARKOV CHAINS 7.1

Proof. Let A e ^(X,,, n <, n*), then the integral of the left-hand side of7.8 over A is

The set A n {n* = n} e &(X0,..., XB). Hence

Putting this back in, we find that

A special case of a Markov chain are the successive sums S0, Sls S 2 , . . .of independent random variables Yx, Y2, . . ., (S0 = 0 convention). This istrue because independence gives

If the summands are identically distributed, then the chain has stationarytransition probabilities:

where F(E) denotes the probability P(YX e B) on Sj. In this case, takeF(A — x) as the fixed conditional probability distribution to be used. Nowletting X0 have any initial distribution TT and using the transition probabilityF(A — jc) we get a Markov chain X0, X l 5 . . . having the same distribution asY0, Y0 + Yl5 Y0 + Yj + Y2, . . . where Y0, Yx, . . . are independent, Y0 hasthe distribution TT, and Y1? Y2 , . . . all have the distribution F. In particular,the process "starting from x" has the distribution of S0 + x, Sl + x,S2 + x,. . . Call any such Markov process a random walk.

Problems

1. Define the /?-step transition probabilities p(n\A \ x) for all A e ^(F),x <= Fby

Page 150: Probability

7.2 ASYMPTOTIC STATIONARITY 133

a) Show that p(n](A \ x) equals Px(Xn E A), hence is a version of

b) Show that p(n)(A \ x) is a regular conditional probability on &i(F)given x E F, and that for all A E (F), x E F,n,m > 0,

2. Let 9?(x) be a bounded ^-measurable function. Demonstrate thatEx<p(Xlt X2, . . .) is 3^-measurable in x.

2. ASYMPTOTIC STATIONARITY

There is a class of limit theorems which state that certain processes areasymptotically stationary. Generally these theorems are formulated as:Given a process X1? X2, . . . , a stationary process X*, X*, . . . exists such thatfor every B E 35 ,,

The most well-known of these relate to Markov chains with stationarytransition probabilities p(A \ x). Actually, what we would really like to showfor Markov chains is that no matter what the initial distribution of X0 is,convergence toward the same stationary limiting distribution sets in, thatis, that for all B E X(F), and initial distributions TT,

where {X*} is stationary.Suppose, to begin, that there is a limiting distribution TT(-) on $i(F)

such that for all x e F, A E

If this limiting distribution ir(A) exists, then from

comes

Also,

Page 151: Probability

134 MARKOV CHAINS 7.2

For A fixed, approximate/?^ | ;c) uniformly by simple functions of x. Takinglimits implies that TT(A) must satisfy

If TT is the limiting distribution of Xn, what happens if we start the process offwith the distribution TT ? The idea is that if TT is a limiting steady-state dis-tribution, then, starting the system with this distribution, it should maintain astable behavior. This is certainly true in the following sense — let us start theprocess with any initial distribution satisfying (7.9). Then this distributionmaintains itself throughout time, that is,

This is established by iterating (7.9) to get H(A) = J p(n)(A \ x)Tr(dx). Inthis sense any solution of (7.9) gives stable initial conditions to the process.

Definition 7.10. For transition probabilities p(A \ x), an initial distribution7r(A) will be called a stationary initial distribution if it satisfies (7.9).

But if a stationary initial distribution is used for the process, much more istrue.

Proposition 7.11. Let X0, Xl5 . . . be a Markov process with stationarytransition probabilities such that the initial distribution Tr(A) satisfies (7.9);then the process is stationary.

Proof. By 7.6 there are versions of P(Xn E An, . . . , Xl E Al \ X0 = jc),P(Xn+l E An, . . . ,X2E Al\Xl = x} which are equal. Since P(Xl E A) =P(X0 E A) = TT(A), integrating these versions over A0 gives

which is sufficient to prove the process stationary.

Furthermore, Markov chains have the additional property that ifPx(Xn E A) converges to TT(A), all A e $i(.F), x E F, then this one-dimensionalconvergence implies that the distribution of the entire process is convergingto the distribution of the stationary process with initial distribution TT.

Proposition 7.12. If p(n)(A \ x) -+ n(A), all A E (F), x E F, then for anyBE tt^F), and all x E F,

Proof. Write

Then

Page 152: Probability

7.3 CLOSED SETS, INDECOMPOSABILITY, ERGODICITY 135

and, also,

Now,

or

Under the stated conditions, one can show, by taking simple functions thatapproximate <p(x) uniformly, that

Therefore, the asymptotic behavior problem becomes : How many stationaryinitial distributions does a given set of transition probabilities have, anddoes/j(ra)(/4 | x) converge to some stationary distribution as n — > oo?

3. CLOSED SETS, INDECOMPOSABILITY, ERGODICITY

Definition 7.13. A set A E ft^F) is closed if p(A \ x) = 1 for all x e A.

The reason for this definition is obvious; if X0 e A, and A is closed, thenXn e A with probability one for all n. Hence if there are two closed disjointsets Al9 A2, then

and there is no hope that p(n)(Al \ x) converges to the same limit for allstarting points x.

Definition 7.14. A chain is called indecomposable if there are no two disjointclosed sets A^ A2 e 3$i(F).

Use a stationary initial distribution TT, if one exists, to get the stationaryprocess X0, X1? . . . If the process is in addition ergodic, then use the ergodictheorem to assert

Take conditional expectations of (7.15), given X0 = x. Use the boundednessand proposition 4.24, to get

Page 153: Probability

136 MARKOV CHAINS 7.3

Thus, from ergodicity, it is possible to get convergence of the averages of thep(n)(A | x) to Tr(dx) a.s. Tr(dx). The questions of ergodicity of the processuniqueness of stationary initial distributions, and indecomposability gotogether.

Theorem 7.16. Let the chain be indecomposable. Then if a stationary initialdistribution TT exists, it is unique and the process gotten by using TT as the initialdistribution is ergodic.

Proof. Let C be an invariant event under the shift transformation in $<»,TT a stationary initial distribution. Take <p(x) = PX(C}. Now, using TT as theinitial distribution, write

By the Markov property, since C E &(Xn, Xn+1,. . .) for all n > 0,

By Proposition 7.6,

and by the invariance of C, then

Therefore <p(x) satisfies

By a similar argument, P(C \ Xn, . . . , X0) = P(C \ Xn) = <p(Xn) a.s. andE(<p(Xn) | Xn_1; . . . , X0) = ^(X^j) a.s. Apply the martingale theorem toget <p(X „) — >• Xc a-s- Thus for any e > 0,

because the distribution of Xn is TT. So 9?(x) can assume only the two values0, 1 a.s. Tr(dx). Define sets

Since 9?(x) is a solution of (7.17),

except for a set D, such that 7r(/>) = 0. Therefore

Page 154: Probability

7.4 THE COUNTABLE CASE 137

Let us define

If p(A\n) | x) = 1, x E Aln), a.s. TT, then <^4jn) - ^<"+1)) = 0. TakeCi = no

ro ,4<n). Then T^) = J, but the C* are closed and disjoint.Hence one of C1} C2 is empty, <p(x) is zero a.s. or one a.s., P(C) = 0 or 1,and the process is ergodic. Now, suppose there are two stationary initialdistributions T^ and TTZ leading to probabilities P-^ and P2 on (Q, &). By6.24 there is an invariant C such that Pj(C) = 1, but PZ(C) = 0. Using thestationary initial distribution TT — \ + \ TTZ we get the probability

which is again ergodic, by the above argument. But

What has been left is the problem: When does there exist a TT(A) such thatp(n)(A | x) -> ?r(y4)? If Fis countable, this problem has a complete solution.In the case of general state spaces F, it is difficult to arrive at satisfactoryconditions (see Doob, Chapter 6). But if a stationary initial distributionTT(A) exists, then under the following conditions :

1) the state space Fis indecomposable under p(A \ x);2) the motion is nonperiodic ; that is, F is indecomposable under the transitionprobabilities p(n)(A \ x), n = 2, 3, . . . ,3) for each x e F, p(A \ x) « ir(A);

Doob [35] has shown that

Theorem 7.18. lim p(n)(A \ x) = *(A)for all A e ^(F), x e F.n

The proof is essentially an application of the ergodic theorem and itsrefinements. [As shown in Doob's paper, (3) can be weakened somewhat.]

4. THE COUNTABLE CASE

The case where the state space F is countable is much easier to understandand analyze. It also gives some insight into the behavior of general Markovchains. Hence assume that F is a subset of the integers, that we have trans-ition probabilities p(j | &), satisfying

Page 155: Probability

138 MARKOV CHAINS 7.5

where the summation is over all states in F, and «-step transition probabilitiesp ( n ) ( j \ A:) denned by

This is exactly matrix multiplication: Denote the matrix { p ( j \ k ) } by P;then the w-step transition probabilities are the elements of Pn. Therefore,if F has a finite number of states the asymptotic stationarity problem canbe studied in terms of what happens to the elements of matrix as it is raisedto higher and higher powers. The theory in this case is complete and detailed.(See Feller [59, Vol. I, Chapter 16].) The idea that simplifies the theory inthe countable case is the renewal concept. That is, if a Markov chain startsin state j, then every time it returns to state j, the whole process starts overagain as from the beginning.

5. THE RENEWAL PROCESS OF A STATE

Let X0, Xlt . . . be the Markov chain starting from state j. We ignoretransitions to all other states and focus attention only on the returns of theprocess to the state/ Define random variables U l5 U2, . . . by

By the Markov property and stationarity of the transition probabilities, forany B e &U({Q, 1}), ( j l f . . . , s^) £ {0, l}<-»

This simple relationship partially summarizes the fact that once the processreturns toy, the process starts anew. In general, any process taking values in{0, 1} and satisfying (7.20) is called a renewal process. We study the behaviorat a single state j by looking at the associated process U l5 U2, . . . governingreturns toy. Define the event G that a return to state j occurs at least once by

Theorem 7.21. The following dichotomy is in force

Page 156: Probability

7.5 THE RENEWAL PROCESS OF A STATE 139

Proof. Let Fn = {Un = 1, \Jn+k = 0, all k > 1}, n > 1; FQ = (Ufc = 0,all A; > 1}. Thus Fn is the event that the last return to j occurs at time «.Hence

The Fn are disjoint, so

and by (7.20),

According to the definitions, F0 = Gc, hence

for all n > 0; so P/Xn = j i.o.) = 1. Then

otherwise, the Borel-Cantelli lemma would imply Pj(Xn = j i.o.) = 0.lfPj(G) < 1, then P//V) > 0» and we can use expression (7.22) to substitutefor P,(Fn), getting

oo

This implies 2 P^n = j) < °° and thus Pj(Xn = j i.o.) = 0.i

Definition 7.24. Call the state j recurrent if Pj(Xn =j i.o.) = 1, transient

Note that 7.21 in terms of transition probabilities reads, "j is recurrent iff

2r/)(n)0'i;) = °°-"Define Rfc as the time of the nth return toy, and the times between returns

as T! = Rx, Tfc = Rfc - R^^ k > 1. Then

Proposition 7.25. Ifj is recurrent, then Tl5 T2, . . . are independent, identicallydistributed random variables under the probability P,.

Proof. T! is a Markov time for X0, X1? . . . By the strong Markov property7 8

Therefore, the process (XTi+1, . . .) has the same distribution as (Xl5 . . .)and is independent of &(Xn, n < Tx), hence is independent of Tx.

Page 157: Probability

140 MARKOV CHAINS 7.5

The T1} T 2 , . . . are also called the recurrence times for the state j. Theresult of 7.25 obviously holds for any renewal process with Tlf T 2 , . . . ,the times between successive ones in the U19 U 2 , . . . sequence.

Definition 7.26. Call the recurrent state j

positive-recurrent if EjTl < oo, null-recurrent if EjJ^ = oo.

Definition 7.27. The state j has period d > 1 //Tj is distributed on the latticeLd, d > 1 under P3. If d > 1, call the state periodic; if d = 1, nonperiodic.[Recall from (3.32) that Ld = {nd}, n = 0, ±1,. . .]

Fory recurrent, let the random vectors Zk be defined by

Then Zk takes values in the space R of all finite sequences of integers with& the smallest (T-field containing all sets of the form

where / is any integer. Since the length of blocks is now variable, an in-teresting generalization of 7.25 is

Theorem 7.28. The Z0, Z l s. . . are independent and identically distributedrandom vectors under the probability Pj.

Proof. This follows from 7.8 by seeing that Tl is a Markov time. Z0 ismeasurable 5r(Xn, n <, Tx) because for B e $,

By 7.8, for ^ e $>„,

Now Zx is the same function of the XTi, XTi+1,. . . process as Z0 is of theX0, X 1 ? . . . process. Hence Z,^ is independent of Z0 and has the same dis-tribution.

Call events {An} a renewal event £ if the random variables %An form arenewal process. Problems 3 and 4 are concerned with these.

Problems

3. (Runs of length at least N). Consider coin-tossing (biased or fair) anddefine

Prove that {A^} form a renewal event. [Note that An is the event such that attime n a run of at least N heads has just finished.]

4. Given any sequence t^, N long of H, T, let (£1, J", P) be the coin-tossing game (fair or biased). Define An = {u>; (con, . . ., con_N+l) = tN},

Page 158: Probability

7.6 GROUP PROPERTIES OF STATES 141

(An = 0 if n < N). Find necessary and sufficient conditions on t^ for the{An} to form a renewal event.

5. lfPj(Xn = j i.o.) = 0, then show that

6. Use Problem 5 to show that for a biased coin

P(no equalizations) = \p — q\.

7. Use Theorem 7.21 to show that if {Zn} are the successive fortunes in afair coin-tossing game,

6. GROUP PROPERTIES OF STATES

Definition 7.29. If there is an n such that p(n)(k\ j) > 0, j to k (denoted byj — >• k) is a permissible transition. If j -> k and k — *-y, say that j and kcommunicate, and write j <— > A:.

Communicating states share properties :

Theorem 7.30. If j <— » k, then j and k are simultaneously transient, null-recurrent, or positive-recurrent, and have the same period.

Proof. Use the martingale result, Problem 9, Chapter 5, to deduce thatunder any initial distribution the sets {Xn = j i.o.} and {Xn = k i.o.} havethe same probability. So both are recurrent or transient together. LetT1} T2, . . . be the recurrence times for/ starting fromy, and assume EjJ1 < oo.Let Vn = 1 or 0 as there was or was not a visit to state k between timesRn_i and Rn. The Vn are independent and identically distributed (7.28) withP(Vn = 1) > 0. Denote by n* the first n such that Vn = 1, T the time of thefirst visit to state k. Then

But {n* > n} = {n* < n}c, and {n* < n} e F(Xk, k < Rn_x), thus is in-dependent of Tn. Hence

Obviously, Efn* < oo, so EjJ < oo. Once k has occurred, the time untilanother occurrence of k is less than the time to get back to state j plus thetime until another occurrence of k starting from j. This latter time has thesame distribution as T. The former time must have finite expectation,otherwise -E^Tj = oo. Hence k has a recurrence time with finite expectation.

Page 159: Probability

142 MARKOV CHAINS 7.6

Take nlt «2 such that p(n^(k \j) > 0, p(n*\j \ K) > 0. If T(fc) is the firstreturn time to k, define Ik = {«; P4(T

(fc) = «) > 0}, Ldj is the smallest latticecontaining Ik, Ld the smallest lattice containing /,. Diagrammatically, wecan go:

That is, if m E Ik, then /ij + m + «2 e ^» or A + wi + "2

Ldz e Ld , and </2 > ^i- The converse argument gives d^ ^^- Hence

so d^ = dz.Q.E.D.

Let J be the set of recurrent states. Communication (<->) is clearly anequivalence relationship on J, so splits J into disjoint classes C1} C2, . . .

Proposition 7.31. For any Q

w, eac/z Cl is a closed set of states.

Proof. Ify is recurrent andy ->• A:, then A: ->-y. Otherwise, P;(Xn =yi.o.) < 1.Hence the set of states (fc; j -> k} = {k; j <->&}, but this latter isexactly the equivalence class containing j. The sum of /?(£ | y) over all A:such thaty — >• fc is clearly one.

Take C to be a closed indecomposable set of recurrent states. They all

have the same period d. If d > 1, define the relationship «-» as j <-» & if

p(nid)(k \ j ) > 0,/?("2(J)(y| fc) > 0 for some Ajj, «2. Since y'<-> A; is an equivalence

relationship, C may be decomposed into disjoint sets Dx, D2, . . . under <->.

Proposition 7.32. There are d disjoint equivalence classes Dlt D2, . . . , Dd

under <-> and they can be numbered such that

or diagrammatically,

The Dj , . . . , Dd are called cyclically moving subsets.

Page 160: Probability

7.7 STATIONARY INITIAL DISTRIBUTIONS 143

Proof. Denote by j —> k the existence of an n such that n[d] = l[d] andp(n)(k |y) > 0. Fix/! and number D± as the equivalence class containingj\.Take y'2 such that p(j2\ji) > 0, and number D2 the equivalence class con-

taining y'2, and so on. See thaty —> k implies that k —>j. But

so y'i<->yd+i => Dd+l = D±. Consider any state k, such that p(k |y) > 0,

j E Dl and k <£ D2, say k E D3. Then k<^>j3. Look at the string

This string leads to j\ —'—> j^ From this contradiction, conclude k E D2,and 2 X^l ; )= 1,7 eDx.

fceD2

If C is a closed set of communicating nonperiodic states, one usefulresult is that for any two states j, k, all other states are common descendants.That is:

Proposition 7.33. For j, k, /, any three states, there exists an n such that/ > < " > ( / 1 ;•)> 0 , / > < " > ( / 1 *)>0.

Proof. Take nlt n2 such that p(ni)(l \ j) > 0, p(n*\l \ k) > 0. Consider theset J = {n; p ( n ] ( l \ l ) > 0}. By the nonperiodicity, the smallest latticecontaining J is Lx. Under addition, / is closed, so every integer can beexpressed as s^m^ — szm2, m1? m2 E J, slt s2 nonnegative integers. Taken2 — nl — s^m^ — s2m2, so «x + s^rn^ = nz + s2m2 = n. Now check thatp(n)(l\j)>Q,p(n\l\k)>0.

Problems

8. If the set of states is finite, prove that

a) there must be at least one recurrent state;b) every recurrent state is positive recurrent;c) there is a random variable n* such that for n > n*, Xn is in the set of

recurrent states.9. Give an example of a chain with all states transient.

7. STATIONARY INITIAL DISTRIBUTIONS

Consider a chain X0, X l 5 . . . starting from an initial state / such that / isrecurrent. Let Nra(j) be the number of visits of X l 5 . . . , Xn toy, and TT(J) bethe expected number of visits to j before return to i, TT(/) = 1. The proof ofTheorem 3.45 goes through, word for word. Use (3.46) again to conclude

Page 161: Probability

144 MARKOV CHAINS 7.7

The relevance here is

Proposition 7.34

for all k such that i—*k.

Proof. A visit to state j occurs on the «th trial before return to state / if{Xn = j, Xn+1 5^ / ' , . . . , Xx ^ i}. Therefore

so that

For * = /, ^(Xn+i = /, Xn * / , . . . , Xx ^ /) = ^(T(<) = « + ! ) , the right-hand side becomes 2f ^t(T<i) = «) = 1, where T(i) is the time of firstrecurrence of state /.

The {TT(J)}, therefore, form a stationary initial measure for the chain.By summing, we get

If / is positive-recurrent, then TT(J) = Tr(j)/EiT(i) forms a stationary initial

distribution for the process. By Proposition 6.38, if T(>> is the first recurrencetime for state j, starting from j, then

Every equivalence class C of positive-recurrent states thus has the uniquestationary initial distribution given by TT(J}. Note that TT(J) > 0 for all j in

Page 162: Probability

7.8 SOME EXAMPLES 145

the class. Use the ergodic theorem to conclude

Restrict the state space to such a class C.

Proposition 7.35. Let A(j) be a solution of

such that 2 1^0)1 < °o. Then there is a constant c such that A(j) = c-n(j).3

Proof. By iteration

Consequently,

The inner term converges to TT(J), and is always <1. Use the boundedconvergence theorem to get

This proposition is useful in that it permits us to get the TT(J) by solvingthe system TT(J) — 2X.7 I k)^r(k).

8. SOME EXAMPLES

Example A. Simple symmetric random walks. Denote by / the integers, andby I(m) the space j = (j\,. . . ,jm) of all w-tuples of integers. If the particleis at j, then it makes the transition to any one of the 2m nearest neighborsO'i ± U./2, • • • »7«), OW2 ± 1, . . . ,yj,. . . with probability l/2m. Thedistribution of this process starting from j is given by j + Yx + • • • + Yn,where Y1} Y2,. . . are independent identically distributed random vectorstaking values in (±1, 0,. . . , 0), (0, ±1, 0,. . . , 0) with equal prob-abilities. All states communicate. The chain has period d = 2. Denote0 = (0, 0, . . . , 0). For m = 1, the process starting from 0 is the fair coin-tossing game. Thus for m = 1, all states are null-recurrent. Polya [116]discovered the interesting phenomenon that if m = 2, all states are againnull-recurrent, but for m > 3 all states are transient. In fact, every randomwalk on I(m) that is genuinely w-dimensional is null-recurrent for m < 2,transient for m > 3. See Chung and Fuchs [18].

Page 163: Probability

146 MARKOV CHAINS

Example B. The renewal chain. Another way of looking at a renewal processwhich illuminates the use of the word renewal is the following: At time zero,a new light bulb is placed in a fixture. Let Tl be the number of periods(integral) that it lasts. When it blows out at time n, it is replaced by anotherlight bulb starting at time n that lasts T2 periods; the kth light bulb lastsTfc periods. The light bulbs are of identical manufacture and a reasonablemodel is to assume that the T1} T2 , . . . are independent and identicallydistributed random variables. Also assume each bulb lasts at least oneperiod; that is, P^ > 1) = 1. Let An be the event that a light bulb blowsout at time n. Intuitively, this starts the whole process over again. Mathe-matically, it is easy to show that {An} form a renewal event. Formally,An = (o»; 3 a k such that Tx + • • • + Tfc = «}. Now the point is that 7.25shows that the converse is true; given any renewal process (Jlt U2 , . . . , andletting T1} T2 , . . . be the times between occurrences of {Ufc = 1}, we findthat if P(Jl < oo) = 1, then

where the Tl5 T2,. .. are independent and identically distributed.For a Markov chain X0, X l 5 . . . starting from state j, the events {Xn = j}

form a renewal event. Now we ask the converse question, given a renewalprocess U^ U 2 , . . . , is there a Markov process X0, X l 5 . . . starting from theorigin such that the process Ul5 U 2 , . . . , defined by

has the same distribution as \Jlt U 2 , . . . ? Actually, we can define a Markovprocess X0, X l 5 . . . on the same sample space as the renewal process such that

Definition 7.36. For a renewal process U l5 U2 , . . ., add the conventionR0 = 0, and define the time of the last replacement prior to time n as

The age of the current item is defined by

Clearly,

The Xn process, as defined, takes values in the nonnegative integers, andX0 = 0. What does it look like? If Xn = j, then Tn = n — j and on thisset there exists no k such that n — j < Rk <, n. Therefore, on the setTn = n — j, either rn+1 = n — j or Tn+1 = n + 1. So if Xn = y, either

Page 164: Probability

7.8 SOME EXAMPLES 147

Xn+1 = j + 1 or Xn+1 = 0. Intuitively, either a renewal takes place at time« + 1 or the item ages one more time unit. Clearly, X0, X1? . . . is a Markovchain with the stationary transition probabilities

All states communicate, the chain has period determined by the minimumlattice on which T1 is distributed, and is null-recurrent or positive-recurrentas jETi = oo or < oo.

If ETl < oo, there is a stationary point process U*, U*, . . . havinginterpoint distances T*, T*, . . . with the same distribution as T1} T2, . . .For this process use R* as the time of the Ath point past the origin, and defineT*, X* as above. The transition probabilities of X* are the same as forXn, but X* is easily seen to be stationary. At n = 0, the age of the currentitem is k on the set {U* = 0, . . . , U*fc+1 = 0, U*fc = 1}. The probabilityof this set is, by stationarity,

Hence Tr(k) = P(Ti > k^ETi is a stationary initial distribution for theprocess. The question of asymptotic stationarity of X0, X1} . . . is equivalentto asking if the renewal process is asymptotically stationary in the sense

for every B e &n ({0, 1}).

Problem 10. Show that a sufficient condition for theU1,U2, . . . process to beasymptotically stationary in the above sense is

Example C. Birth and death processes. These are a class of Markov chainsin which the state space is the integers / or the nonnegative integers /+and where, if the particle is aty, it can move either toy + 1 with probabilitya,-, toy — 1 with probability fa or remain aty with probability 1 — a}- — fa.If the states are /, assume all states communicate. If the states are /+,0 is either an absorbing state (defined as any state / such that p(i | 0 = 0or reflecting (/?(! | 0) > 0). Assume that all other states communicate betweenthemselves, and can get to zero. Equivalently, a, 5^ 0, fa 5^ 0, fory > 0. I fOis absorbing, then all other states are transient, because/ — >• 0 but 0+->y',y 5^ 0.Therefore, for almost every sample path, either Xn -> oo or Xn — »• 0. If 0is reflecting, the states can be transient or recurrent, either positive or null.

Page 165: Probability

148 MARKOV CHAINS 7.8

To get a criterion, let T? be the first passage time to state j starting from zero :

Let AJ be the event that a return to zero occurs between T;* and T*+I,

The T* are Markov times, and we use 7.8 to conclude that the Aj are in-dependent events. By the Borel-Cantelli lemma, the process is transient orrecurrent as J* P(Ai) < oo or P(Ai) = oo. Let T* be the first timeafter T* that Xn ^ j. Then P(Ai) = E(P(Aj \ XT.)). Now

On the set XT, — j — \,Aj can occur if we return to zero before climbing toyor by returning to zero only after climbing to j but before climbing toj + 1. Since r* is a Markov time, by the strong Markov property

Checking that P(Xr» = j — 1) = j3j/(ctj + /3j) gives the equation

or

where TJ = flj/atj. Direct substitution verifies that

Certainly, if ^PJ < oo then ^P(Aj) < oo. To go the other way, notei i

that since Sj = Sj_i/(l - PJ/SJ), then

We have provedProposition 7.37. A birth and death process on I+ with the origin reflecting istransient iff

(Due to Harris [68]. See Karlin [86, p. 204] for an alternative derivation.)To discriminate between null and positive recurrence is easier.Problem 11. Use the condition that

Page 166: Probability

7.8 SOME EXAMPLES 149

has no solutions such that S \A(k)\ < GO to find a necessary and sufficientcondition that a recurrent birth and death process on /+ be null-recurrent.Example D. Branching processes. These processes are characterized asfollows : If at time n there are k individuals present, then the y'th one inde-pendently of the others gives birth to Y;. offspring by time n + l,j = 1, . . . , k,where P(Y3- = /) = pt, / = 0, 1, . . . The {Y3 = 0} event corresponds tothe death of the y'th individual leaving no offspring. The state space is /+,the transition probabilities for Xn, the population size at time n, are

where the Yl5 . . . , Yfc are independent and have the same distribution as Y2.Zero is an absorbing state (unless the model is revised to allow the introductionof new individuals into the population). If p0 > 0, then the same argumentas for birth and death processes establishes the fact that every state exceptzero is transient. If pQ = 0, then obviously the same result holds. For acomplete and interesting treatment of these chains and their generalizations,see Harris [67].Problem 12. In a branching process, suppose EY± = m < oo. Use themartingale convergence theorem to show that XJmn converges a.s.Example E. The Ehrenfest urn scheme. Following the work of Gibbs andBoltzmann statistical mechanics was faced with this paradox. For a systemof particles in a closed container, referring to the 67V position-velocity vectoras the state of the system, then in the ergodic case every state is recurrent inthe sense that the system returns infinitely often to every neighborhood ofany initial state.

On the other hand, the observed macroscopic behavior is that a systemseems to move irreversibly toward an equilibrium condition. Smoluchowskiproposed the solution that states far removed from equilibrium have anenormously large recurrence time, thus the system over any reasonableobservation time appears to move toward equilibrium. To illustrate this theEhrenfests constructed a model as follows : consider two urns I and II, and atotal of 27V molecules distributed within the two urns. At time n, a moleculeis chosen at random from among the 27V and is transferred from whateverurn it happens to be in to the other urn. Let the state k of the chain be thenumber of molecules in urn I, k = 0, . . . , 27V. The transition probabilitiesare given by

All states communicate, and since there are only a finite number, all arepositive-recurrent. We can use the fact that the stationary distribution7r(k) = l/EkT

(k) to get the expected recurrence times.Problem 13. Use the facts that

to show that

Page 167: Probability

150 MARKOV CHAINS 7.9

Compare this with the derivation of the central limit theorem for coin-tossing, Chapter I, Section 3, and show that for N large, if T is the recurrencetime for the states {k; \N — k\ > x

See Kac [80] for further discussion.

9. THE CONVERGENCE THEOREM

The fundamental convergence result for Markov chains on the integers is

Theorem 7.38. Let C be a closed indecomposable set of nonperiodic recurrentstates. If the states are null-recurrent, then for all j, k e C

If the states are positive-recurrent with stationary initial distribution TT(J), thenfor allj, kEC

There are many different proofs of this. Interestingly enough, the variousproofs are very diverse in their origin and approach. One simple proof isbased on

Theorem 7.39 (The renewal theorem). For a nonperiodic renewal process,

There is a nice elementary proof of this in Feller [59, Volume I], and we provea much generalized version in Chapter 10.

The way this theorem is used in 7.38 is that for {Un} the return processfor state;, >P3(Un = 1) = p(n}(j\j); hence p(Tt)(j '\ j) -> *(j) if £,T(» < oo, orP(n}(j I /) "*• 0 if j is null-recurrent. No matter where the process is started, letT(3) be the first time thaty is entered. Then

by the Markov property. Argue that -Pfc(T(3) < oo) = 1 (see the proof of

7.30). Now use the bounded convergence theorem to establish

Page 168: Probability

7.9 THE CONVERGENCE THEOREM 151

We can also use some heavier machinery to get a stronger result due toOrey [114]. We give this proof, from Blackwell and Freedman [9], becauseit involves an interesting application of martingales and the Hewitt-Savagezero-one law.

Theorem 7.41. Let C be a closed indecomposable set of nonperiodic recurrentstates. Then for any states k, I in C

Remark. We can get 7.38 in the positive-recurrent case from 7.41 by notingthat 7.41 implies

In the null-recurrent case we use an additional fact: First, consider theevent Am that starting fromy the last entry intoy up to time n was at timen - m, Am = (Xn_m = j, Xn_m+l ^ ;,..., Xn ^ ;}. The Am are disjointand the union is the whole space. Furthermore, for the process startingfrom y,

Consequently,

(where p«_\j\j}= 1).Let limp(n)(j \j) = p. Take a subsequence n such that p(n'\j \j) — *• p.

By 7.41, for any other state, p(n\j \ k ) -> p. Use this to get

Then for any r > 0 fixed, /?(n'fr)(y \j) -+ p. Substitute n = n' + r in (7.42),and chop off some terms to get

Noting that

aimplies p — 0, and using 7.41 we can complete the proof that p(n\j \ k) -> 0.

Page 169: Probability

152 MARKOV CHAINS 7.9

Proof of 7.41. Take TT any initial distribution such that n(l) > 0. Let 3"be the tail tr-field of the process X0, Xl5 . . . and suppose that J has the zero-one property under Pn. Then an easy consequence of the martingale con-vergence theorem (see Problem 6, Chapter 5) is that for any A e ^(X),

or

Write C? = (y; P{(Xn =y) > Pw(Xn =;)}, C~ the complement of Cin C. Then by the above,

implying

Now use the initial distribution TT which assigns mass \ each to the states /and k to get the stated result.

The completed proof is provided by

Theorem 7 A3. For any tail event A, either P,(A) is one for all j e C, or zerofor all j E C (under the conditions 0/7.41).

Proof. Consider the process starting from j. The random vectors Zfc =(XRjt, . . . , XRfc ^j) are independent and identically distributed by 7.28.Clearly, £"(X)+= ^(Z,,, Z1? . . .). Take W a tail random variable; that is,W is measurable J.

For every n, there is a random variable <pn(x) on (R(x>), 3^^) such thatW = pB(XB, . . .). So for every k,

Now Rfc is a symmetric function of Z0, . . . , Zk_i. Hence W is a symmetricfunction of Z0, Zl5 . . . The Hewitt-Savage zero-one law holds word byword for independent identically distributed random vectors, instead ofrandom variables. Therefore W is a.s. constant, and Pj(A) is zero or one forevery tail event A.

For any two states;, k let / be a descendent,/?(n)(/ \j) > 0,/>(n)(/ 1 k) > 0.Write

Page 170: Probability

7.10 THE BACKWARD METHOD 153

Since A is measurable ^(X^j,.. .),

= 1, then Pt(A \ Xn = /) j± 0. But using fc instead of j above, weget

Hence Pk(A) = 1, and the theorem is proved.

From this fundamental theorem follows a complete description of theasymptotic behavior of the p(n](j \k). If a closed communicating set ofpositive recurrent states has period d, then any one of the cyclically movingsubclasses Dr, r = 1,. . . , d is nonperiodic and closed under the transitionprobability pw(j \ k). Looking at this class at time steps d units apart, con-clude that

If both transient states and positive-recurrent states are present then theasymptotic behavior of p(n)(j\ k), j positive-recurrent and nonperiodic, ktransient, depends on the probability P(C \ k) that starting from k theprocess will eventually enter the class of states C communicating with j.From (7.40), in fact,

When j is periodic, the behavior depends not only on P(C \ k) but also atwhat point in the cycle of motion in C the particle from k enters C.

10. THE BACKWARD METHOD

There is a simple device which turns out to be important, both theoreticallyand practically, in the study of Markov chains. Let Z = <p(X0, Xl5 . . .) beany random variable on a Markov chain with stationary transition prob-abilities. Then the device is to get an equation for f(x) = E(7. \ X0 = x)by using the fact that

Of course, this will be useful only if E(2. \ X: = y, X0 = x) can be expressedin terms of/. The reason I call this the backward method is that it is theinitial conditions of the process that are perturbed. Here are some examples.

Page 171: Probability

154 MARKOV CHAINS 7.10

Mostly for convenience, look at the countable cases. It is not difficult tosee how the same method carries over to similar examples in the general case.

a) Invariant random variable. Let Z be a bounded invariant function,Z = <KXn, . . . ) ,«> 0. Then if £(Z | X0 =y) = /(;),

so that/(j) is a bounded solution of

There is an interesting converse. Let/(j) be a bounded solution of (7.44).Then write (7.44 )as

By the Markov property,

This says that/(Xn) is a martingale. Since it is bounded, the convergencetheorem applies, and there is a random variable Y such that

If Y = 0(X0, Xlf . . .), from

conclude that Y = 0(X19 X2, . . .) a.s. Thus Y is a.s. invariant, and

(Use here an initial distribution which assigns positive mass to all states.)Formally,

Proposition 7.45. Let TT(J) > 0/or all j e F, then there is a one-to-one corre-spondence between bounded a.s. invariant random variables and boundedsolutions o/(7.44).

b) Absorption probabilities. Let Cl5 C2, . . . be closed sets of communicatingrecurrent states. Let A be the event that a particle starting from state k iseventually absorbed in Cj,

A = {Xn e Cj, all n sufficiently large}.

A is an invariant event, so /(j) = P(A \ X0 = y) satisfies (7.44). There arealso the boundary conditions :

Page 172: Probability

7.10 THE BACKWARD METHOD 155

If one solves (7.44) subject to these boundary conditions and boundedness, isthe solution unique ? No, in general, because if J is the set of all transientstates, the event that the particle remains in / for all time is invariant, and

is zero on all Ch, and any multiple of c(j) may be added to any given solutionsatisfying the boundary conditions to give another solution. If the prob-ability of remaining in transient states for all time is zero, then the solutionis unique. For example, let g be such a solution, and start the process fromstate j. The process g(Xn), n = 0, 1 , . . . is a martingale. Let n * be the firsttime that one of the Ch is entered. This means that n* is a stopping timefor the X0, Xl5 . . . sequence. Furthermore g(Xn) and n* satisfy the hypo-thesis of the optional stopping theorem. Therefore,

But

Therefore g(j) = P(A X0 = j\In fair coin-tossing, with initial fortune zero, what is the probability

that we win M{ dollars before losing Mzl This is the same problem as:For a simple symmetric random walk starting from zero, with absorbingstates at A/1} — M2, find the probability of being absorbed into M^ Letp+(j) be the probability of being absorbed into M± starting fromy, — Mz <j <> M!. Then p+(j} must satisfy (7.44) which in this case is

and the boundary conditions />+(— A/2) = 0, p+(M^) — 1. This solution iseasy to get:

c) Two other examples. Among many others, not involving invariant sets,I pick two. Let n* be the time until absorption into the class C of recurrentstates, assuming C 0. Write m(j) = E^n*). Check that

and apply the backward argument to give

The boundary conditions are m(j) = 0, j G C.

Page 173: Probability

156 MARKOV CHAINS 7.10

Now let Nt. be the number of visits to state / before absorption into G.Denote G(y,/) = £,(Nf). For k$C,j$C,

So

where d(j, /) = 0 or 1 as j ^ i or j = i. The boundary conditions areG(y, /) = 0, j E C. Of course, this makes no sense unless / is transient.

With these last two examples there is a more difficult uniqueness problem.For example, in (7.46) assume that

Then any nonnegative solution g(j) of (7.46) satisfying

must be Ej(n*). To prove this, check that

is a martingale sequence, that k) I = £,-(n*)» and apply optionalstopping.

Problems

14. For simple symmetric random walk with absorbing states at Mlt —M2,show that

15. Let {Xn} be simple symmetric random walk. Derive the expressions forp+(j) and £"3(n*) by showing that the sequences {Xn}, {X£ — «} are martin-gales and applying the stopping time results of Section 7, Chapter 5.16. For simple symmetric random walk with absorbing states at Mlt —M2,use the expression for p+(j) to evaluate q(j) = Pj (at least one return to y).For — A/2 <y < / < A/l5 £XN f) is the probability that particle hits /

Page 174: Probability

NOTES 1 57

before — M2 times the expected number of returns to / starting from i beforeabsorption. Use p+(j), for absorbing states at — M2, i, and q(i) to evaluate

17. For any given set D of states, let A be the event that Xn stays in D for alln,

a) Show that/(j) = P(A \j) satisfies

b) Prove using (a) that a state h is transient iff there exists a bounded non-trivial solution to the equation

c) Can you use (b) to deduce 7.37?

NOTES

In 1906 A. A. Markov [110] proved the existence of stationary initialdistributions for Markov chains with a finite number of states. His methodis simple and clever, and the idea can be generalized. A good exposition isin Doob's book [38, pp. 170 ff]. The most fundamental work on generalstate spaces is due to W. Doeblin [25] in 1937 and [28] in 1940. Some of theselatter results concerning the existence of invariant initial distributions aregiven in Doob's book. The basic restriction needed is a sort of compactnessassumption to keep the motion from being transient or null-recurrent. Buta good deal of Doeblin's basic work occurs before this restriction is imposed,and is concerned with the general decomposition of the state space. Foran exposition of this, see K. L. Chung [15] or [17].

The difficulty in the general state space is that there is no way of classifyingeach state y by means of the process of returns to y. If, for example, p(A \ x)assigns zero mass to every one-point set, then the probability of a return tox is zero. You might hope to get around this by considering returns to aneighborhood of x, but then the important independence properties of therecurrence times no longer hold. It may be possible to generalize by takingsmaller and smaller neighborhoods and getting limits, but this program looksdifficult and has not been carried out successfully. Hence, in the general case,it is not yet clear what definition is most appropriate to use in classifyingchains as recurrent or transient. For a fairly natural definition of recurrentchains Harris [66] generalized Doeblin's result by showing the existence of a

Page 175: Probability

158 MARKOV CHAINS

possibly infinite, but always <r-finite measure Q(dx) satisfying

His idea was very similar to the idea in the countable case: Select a setA e &i(F) so that an initial distribution TTA(-) exists concentrated on A suchthat every time the process returned to A, it had the distribution TTA. Thiscould be done using Doeblin's technique. Then define -rr(B), B E $>i(F) as theexpected number of visits to B between visits to A, using the initial distribution7TA.

The basic work when the state space is countable but not necessarily finiteis due to Kolmogorov [95], 1936. The systematic application of the renewaltheorem and concepts was done by Feller, see [55]. K. L. Chung's book [16]is an excellent source for a more complete treatment of the countable case.

The literature concerning applications of Markov chains is enormous.Karlin's book [86] has some nice examples; so does Feller's text [59, Vol. I].A. T. Bharucha-Reid's book [3] is more comprehensive.

The proof of Proposition 7.37 given in the first edition of this book wasincorrect. I am indebted to P. J. Thomson and K. M. Wilkinson for pointingout the error and supplying a correction.

Page 176: Probability

CHAPTER 8

CONVERGENCE IN DISTRIBUTIONAND THE TOOLS THEREOF

1. INTRODUCTION

Back in Chapter 1, we noted that if Z.rt = Yj + • • • + Yn, where the Yf areindependent and ± 1 with probability \, then

Thus the random variables Zn/v« have distribution functions Fn(x) thatconverge for every value of x to <b(x), but from Problem 23, Chapter 3,certainly the random variables Zn/v« do not converge a.s. (or for that matterin Llt or in any strong sense). What are convergent here are not the valuesof the random variables themselves, but the probabilities with which therandom variables assume certain values. In general, we would like to say thatthe distribution of the random variable Xn converges to the distribution ofXifFB(x) = P(Xn < x) -> F(x) = P(X < x) for every x ER(". But this is abit too strong. For instance, suppose X = 0. Then we would want thevalues of Xk to be more and more concentrated about zero, that is for anye > 0 we would want

Now F(0) = 0, but 8.1 could hold, even with FB(0) = 1, for all n. TakeXn = - 1/n, for example. What 8.1 says is that for all x < 0, Fn(x) -* F(x),and for all x > 0, Fn(x) -»• F(x). Apparently, not much should be assumedabout what happens for x a discontinuity point of F(x). Hence we state thefollowing:

2)Definition 8.2. We say that Xn converges to X in distribution, Xn —> X, ifFn(x) —>• F(x) at every point x e C(F), the set of continuity points ofF. That is,P(X = x) = 0 => Fn(x) -> F(x). We mil also write in this case Fn -^> F.

Different terminology is sometimes used.

Definition 8.3. By the law of X, written C(X), is meant the distributionofX. Convergence in distribution is also called convergence in law andL(Xn) —>•

3)C(X) is equivalent notation for Xn —>. X. If random variables X and Y have the2)

same distribution, write either C(X) = C(Y) or X = Y.159

Page 177: Probability

160 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.2

Recall from Chapter 2 that a function F(x) on R(1} is the distributionfunction of a random variable iff

Problems

1. Show that if F(JC) = P(X < x), then F(x+) - F(x) = P(X = jc). Showthat CC(F) is at most countable (F(x+) = Hmylx F(yj).

2. Let T be a dense set of points in R(l], F0(x) on T having properties (8.41, ii, and iii) with x, y e T. Show that there is a unique distribution functionF(x) on RW such that F(x) = F0(x), x E T.

0)3. Show that if, for each «, Xn takes values in the integers /, then Xn —> X

implies P(X e /) = 1 and Xn -^> X o P(Xn = ;) -* />(X = y), ally e /.

4. If Fn, F are distribution functions, Fn —> F, and F(x) continuous, showthat

2) D5. If X = Y, and y(x) is .^-measurable, show that 9?(X) = g?(Y). Give anexample to show that if X, Y, Z are random variables defined on the same

2) Dprobability space, that X = Y does not necessarily imply that XZ = YZ.

Define a random variable X to have a degenerate distribution if X is a.s.constant.

D p6. Show that if Xn —> X and X has a degenerate distribution, then Xn —> X.

2. THE COMPACTNESS OF DISTRIBUTION FUNCTIONS

One of the most frequently used tools in 2)-convergence is a certain compact-ness property of distribution functions. They themselves are not compact,but we can look at a slightly larger set of functions.

Definition 8.5. Let <M be the class of all functions G(x) satisfying (8.4 i and ii),with the addition of

As before G, Gn e JC, Gn 2L+ Q ifl[m Gn(x) = G(x) at all points ofC(G).n

3)Theorem 8.6 (Helly-Bray), JC is sequentially compact under —> .

Page 178: Probability

8.2 THE COMPACTNESS OF DISTRIBUTION FUNCTIONS 161

Proof. Let Gn e JG, take T = {xk}, k = 1, 2 , . . . dense in 7?(1). We applyCantor's diagonalization method. That is, let 7X = {nlf «2 , . . .} be an orderedsubset of the integers such that Gn(x^ converges as n —> oo through 7X. Let72 <= /! be such that Gn(x2) converges as n ->• oo through 72. Continue thisway getting decreasing ordered subsets 71? 72,. . . of the integers. Let nm

be the mth member of 7TO. For m > A:, «OT e Ik, so for every xfc e T7, Gnm(xfc)converges. Define G0(x) on 7" by G0(xfc) = lim Gn (xfc). Define G(x) onR(l) by m

It is easy to check that G £ <M>. Let x E C(G). Then

by definition, but also check that

k < x < x'k, x'k, x'k e T. Then

implying

Letting x'k t x, x^' [ x gives the result that GHm converges to G at every x e C(G).

A useful way of looking at the Helly-Bray theorem is

Corollary 8.7. Let Gn e JC. 7/there is a G e JC such that for every It-convergent2) 3)

subsequence Gn , Gn —>- G, then the full sequence Gn —> G.m m

3)Proof. If Gn+-> G, there exists an x0 G C(G) such that Gn(x0)-h> G(x0).But every subsequence of the Gn contains a convergent subsequence GHm andGn(x0) -> G(x0).

Fig. 8.1 F«(x).

Unfortunately, the class of distribution functions itself is not compact.For instance, take Xn = n (see Fig. 8.1). Obviously limn Fn(x) = 0 identi-cally. The difficulty here is that mass floats out to infinity, disappearing in thelimit. We want to use the Helly-Bray theorem to get some compactnessproperties for distribution functions. But to do this we are going to have toimpose additional restrictions to keep the mass from moving out to infinity.We take some liberties with the notation.

Page 179: Probability

162 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.2

Definition 8.8. F(B), B e $1? will denote the extension ofF(x) to a probabilitymeasure on &lt that is, if F(x} = P(X < x), F(B) = P(X e B).

Definition 8.9. Let JV denote the set of all distribution functions. A subsetC c= JV will be said to be mass-preserving if for any e > 0, there is a finiteinterval I such that F(IC) < e, all F G L

Proposition 8.10. Let C <= JV. Then C is conditionally compact in JV° if and2)

only if C is mass-preserving (that is, Fn e C => 3Fn such that Fnm —> F G JV).

Proof. Assume C mass-preserving, Fn E C. There exists G e JC such that

FWm —>• G. For any e, take a, b such that Fn([a, &)) > 1 — e. Take a' < a,b' > 6 so that a', 6' G C(G). Then

with the conclusion that G(b') — G(a) > 1 — e, or G(+ oo) = 1, G(— oo) = 0,hence G e JV°. On the other hand, let C be conditionally compact in JV\If C is not mass-preserving, then there is an e > 0 such that for every finite

Take Fn G C such that for every n, Fn([—n, +«)) < 1 — e. Now take a

subsequence FWm —> F G JV. Let a, & G C(F); then FnJ$a, b)) -» F([a ,b)),but for «m sufficiently large [a, b) c [—A7TO, +«m). Thus F([fl, 6)) < 1 — efor any a, b G C(F) which implies F £ JV\

One obvious corollary of 8.10 is

Corollary 8.11. If Fn —>• F, FcJV\ f/te« {Fn} w mass-preserving.

Problems

7. For — oo < a < £ < +00, consider the class of all distribution functionssuch thatF(#) = 0, F(b) = 1. Show that this class is sequentially compact.

8. Let Fn —> F, and Fn, F be distribution functions. Show that for B,any Borel set, it is not necessarily true that Fn(B} = 1, for all n => F(B) = 1.Show that if B is closed, however, then Fn(B) = 1, for all n => F(B) = 1.9. Let g(x) be -^-measurable, such that |g(x)| -> oo as x —>• ± oo. IfC c: Jv° is such that sup J |g| dF < oo, then C is mass-preserving.

Fef.

10. Show that if there is an r > 0 such that fim E |Xn|r < oo, then {Fn}

is mass-preserving.

11. The support of F is the smallest closed set C such that F(C) = 1.Show that such a minimal closed set exists. A point of increase of F is

Page 180: Probability

8.3 INTEGRALS AND ^-CONVERGENCE 163

defined as a point x such that for every neighborhood N of x, F(N) > 0.Show that the set of all points of increase is exactly the support of F.

12. Define a Markov chain with stationary transition probabilities p(- \ x) tobe stable if for any sequence of initial distributions irn D-converging to aninitial distribution TT, the probabilities //?(• | x)-n-n(dx) D-converge to the prob-ability j>('I *M^)-

If the state space of a stable Markov chain is a compact interval, show thatthere is at least one invariant initial distribution. [Use Problem 7 applied tothe probabilities l/n p(k>(- \ x} for x fixed.]

3. INTEGRALS AND ^-CONVERGENCE

2)Suppose Fn —>• F, Fn, F G JV, does it then follow that for any reasonablemeasurable function /(*), that ]"/(*) dFn -> J/(X) dF! The answer isNo! For example, let

Now take/(;c) = 0, Jt < 0, and/(x) = 1, x > 0. Then $fdFn = 1, butIfdF = 0. But it is easy to see that it works for/bounded and continuous.Actually, a little more can be said.

Proposition 8.12. Let Fn, F e JV1 and Fn -^> F. Iff(x) is bounded on R(l\measurable 3!>i and the discontinuity points of fare a set S with F(S) = 0, then

Remark. The set of discontinuity points of a ^-measurable function is in$!, so F(S) is well-defined. (See Hobson [72, p. 313].)

Proof. Take a, b e C(F), /15 . . . , Ik a partition 3\. of / = [a, b), where It =[ait bi) and ai9 bt e C(F). Define on /

Then

Clearly, the right- and left-hand sides above converge, and

Page 181: Probability

164 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.3

Let ||d\.|| —>• 0. At every point x which is a continuity point of/(x),

By the Lebesgue bounded convergence theorem,

Let M = sup |/(x)|, then since {Fn} is mass-preserving, for any e > 0 we cantake / such that Fn(/

c) < e/2M and F(IC) < cj2M. Now

Corollary 8.13. In the above proposition, eliminate the condition that f bebounded on Rw, then

Proof. Define

Every continuity point of/is a continuity point of ga. Apply Proposition8.12 to ga to conclude

Let

Problems

13. Let Fn, FeN,Fn -2+ F. For any set E c /?<*', define the boundary ofE as W(£) = E n EC,(E = closure of £). Prove that for any B e suchthat F(bd(B)) = 0, /;(£) -^ F(5).

2)14. Let Fn —>- F, and A(x), ^(x) be continuous functions such that

Show that lim J |g(x)| dFn < oo implies

By the monotone convergence theorem

Page 182: Probability

8.4 CLASSES OF FUNCTIONS THAT SEPARATE 165

4. CLASSES OF FUNCTIONS THAT SEPARATE

Definition 8.14. A set 8 of bounded continuous functions on R(1] will be called^-separating if for any F, G e JY\

implies F = G.

We make this a bit more general (also ultimately, more convenient) byallowing the functions of 8 to be complex-valued. That is, we considerfunctions/(x)oftheform/(x) = /i(*) + ifz(x)-,f\,fz real-valued, continuous,and bounded. Now, of course, |/(x)| has the meaning of the absolute valueof a complex number. As usual, then, IfdF = J/! dF + / J/2 dF.

The nice thing about such a class 8 of functions is that we can check

whether Fn —> by looking at the integrals of these functions. Morespecifically:

Proposition 8.15. Let 8 be ^-separating, and {Fn} mass-preserving. Then3)

there exists an F e JV such that Fn —>- F if and only if

If this holds, then lim $fdFn= J/ dF, allfe 8.n

Proof. One way is clear. If Fn -^> F, then lim J/c/Fn = $fdF by 8.12.n

To go the other way, take any ID-convergent subsequence Fn of Fn. By3) *

mass-preservation Fnk —>- F e JV\ Take any other convergent subsequence

Fni-^G. Then for/£8, by 8.12,

so §fdF = $fdG, all/e 8 => F = G. All ID-convergent subsequences of

Fn have the same limit F, implying Fn —> F.

Corollary 8.16. Let 8 be ^-separating and {Fn} mass-preserving. IfFeN

is such that $fdFn-+ $ f dF, allfe 8, then Fn -^ F.

The relevance of looking at integrals of functions to 3)-convergence can beDclarified by the simple observation that Fn —>• F is equivalent to $fdFn -*•

jjfdF for all functions/of the form

for any x0 e C(F).

Page 183: Probability

166 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.5

What classes of functions are J\P-separating ? Take 80 to be all functions/of the form below (see Fig. 8.2) with any a, b finite and any e > 0.

Proposition 8.17. 80 is N-separating.

Proof. For any F,GeN, take a, b G C(F) n C(G). Assume that for any/as described,

and conversely,

Let e | 0, to get F([a, b)) = G([a, b)). The foregoing being true for alla, b G C(F) n C(G) implies F = G.

Figure 8.2

However 80 is an awkward set of functions to work with. What is reallymore important is

Proposition 8.18. Let £bea class of continuous bounded functions on R(1) withthe property that for anyf0 G 80, there exist fn G 8 such that sup |/n(x)| < M,all n, and lim/n(;c) = f0(x)for every x G R(1). Then 8 is N-separating.

n

Proof. Let I fdF = I fdG, all/6 8. For any/0 e 80, take/, 6 8 convergingto/, as in the statement 8.18 above. By the Lebesgue bounded convergencetheorem

5. TRANSLATION INTO RANDOM - VARIABLE TERMS

The foregoing is all translatable into random-variable terms. For example:i) If Xh are random variables, their distribution functions are mass-

preserving iff for any e > 0, there is a finite interval / such that

Page 184: Probability

8.6 AN APPLICATION OF THE FOREGOING 167

ii) If \g(x)\ -»• oo as x -> ± oo, then the distribution functions of Xn aremass-preserving if sup E |g(Xw)| < oo (Problem 9).

niii) If Xn have mass-preserving distribution functions and 8 is an J\P-

separating set of functions, then there exists a random variable X suchthat Xn - -> X if and only if lim £/"(Xn) exists, all/£ 8.

n

We will switch freely between discussion in terms of distribution functionsand in terms of random variables, depending on which set of terms is moreilluminating.

2)Proposition 8.19. Let Xn — > X, and let tp(x) be measurable $x, with itsset S of discontinuities such that P(X E S) = 0. Then

Proof. Let ZM = <p(Xn\ Z = tfX). If £/(ZH) - £/(Z)> for all/e 80, then3)

Zn — >• Z. Let g(x) = / (<p(x)). This function g is bounded, measurable$15 and continuous wherever q> is continuous. By 8.12, Eg(Xn) -

We can't do any better with a.s. convergence. This is illustrated by thefollowing problem.

Problem 15. If ?(*) is as in 8.19, and Xn±X, then show <p(XJ -Give an example to show that in general this is not true if y(x) is onlyassumed measurable.

6. AN APPLICATION OF THE FOREGOING

With only this scanty background we are already in a position to prove amore general version of the central limit theorem. To do this we work withthe class of functions defined by

8X consists of all continuous bounded f on R(l) such that f"(x) exists for allx, sup |/"(*)| < oo, andf"(x) is uniformly continuous on Rw.

X

It is fairly obvious that 8t satisfies the requirements of 8.18 and hence isJV-separating. We use 8X to establish a simple example of what has becomeknown as the "invariance principle."

Theorem 8.20. If there is one sequence X*, X*, . . . of independent, identicallydistributed random variables, EX* = 0, £(X*)2 = o"*2 < oo, such that

then for all sequences Xl9 X2,. . . of independent, identically distributed random

Page 185: Probability

168 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.6

variables such that

Proo/. Let/e 8l5 and define

By definition lim d(h) = 0 as h [ 0. We may as well assume otherwise we use X*/a*, Xk/a. Let

Since EZ2n = 1, £(Z*)2 = 1, both sequences are mass-preserving. By 8.15

and by 8.16 it suffices to show that

Since only the distributions are relevant here, we can assume that X*, Xare defined on a common sample space and are independent of each other.Now write

Define random variables

Page 186: Probability

8.6 AN APPLICATION OF THE FOREGOING 169

Use Taylor's expansion around (Jk to get

where 8, 8* are random variables such that 0 < 8, 8* <; 1. Both X^, X* areindependent of U^., so

Let /in(jc) = x2 <5(|jc|/V«). Take the expectation of (8.21) and use EX2k =

£(X*)» to get

this latter by the identical distribution. Note that

so

Let M = supx |/"(*)l; then d(h) < 2M, all h, so ^(Xj) < 2MX2r But

^n(^i) — >• 0 a.s. Since X2 is integrable, the bounded convergence theoremyields Ehn(Xj) -* 0. Similarly for Ehn(X*\ Thus, it has been establishedthat

implyingQ.E.D.

This proof is anachronistic in the sense that there are much simplermethods of proving the central limit theorem if one knows some moreprobability theory. But it is an interesting proof. We know that if we takeXf, X*, ... to be fair coin-tossing variables, that

Page 187: Probability

170 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.7

where the notation JVXO,1) *s clarified by

Definition 8.22. The normal distribution with mean ft and variance a2, denotedJYX"> or), is the distribution of a random variable aX + //, where

So we have proved

Corollary 8.23. Let Xlf X2, . . . be independent, identically distributed randomvariables, EXl = 0, £X2 = a2 < oo. Then

7. CHARACTERISTIC FUNCTIONSAND THE CONTINUITY THEOREM

The class of functions of the form {eiux}, u e R(l), is particularly importantand useful in studying convergence in distribution. To begin with

Theorem 8.24. The set of all complex exponentials {eiux}, u e R(l), is JV-separating.

Proof. Let J eiux dF = J eiux dG, all u. Then for afc , k = 1, . . . , m anycomplex numbers, and ut , . . . , um real,

Let/o be in 80, let en | 0, €n ^ 1, and consider the interval [— «, +n]. Anycontinuous function on [—n,+n] equal at endpoints can be uniformlyapproximated by a trigonometric polynomial ; that is, there exists a finite sum

such that |/0(x) — /„(*)! 6n, x e [— «, +«]. Since/n is periodic, and €n <. 1,then for all n, suPl |/n(x)| < 2. By (8.25) above J/n ^F = J/n dG. Thisgives $f0dF = $f0dG or F = G.

Definition 8.26. Given a distribution function F(x), its characteristic functionf(u) is a complex-valued function defined on R(l) by

If F is the distribution function of the random variable X, then equivalent ly,

Page 188: Probability

8.7 CHARACTERISTIC FUNCTIONS—CONTINUITY THEOREM 171

Note quickly that

Proposition 8.27. Any characteristic function f(u) has the properties

i) /(O) = 1,ii) !/(")! < 1,iii) f(u) is uniformly continuous on R(l),iv) /(-«)=/(«).

Proof

i) Obvious;

by the bounded convergence theorem d(h) -> 0 as h -»> 0 ;

iv) /(— «) = f ( u ) is obvious.

Theorem 8.24 may be stated as : No two distinct distribution functionshave the same characteristic function. However, examples are known (seeLoeve [108, p. 218]) of distribution functions Fl ^ F2 such that/i(w) = fz(u)for all u in the interval [—1, +1]. Consequently, the set of functions {eiux},— 1 < u < + 1 , is not JV-separating.

3)The condition that Fn — > F can be elegantly stated in terms of the

associated characteristic functions.

Theorem 8.28 (The continuity theorem). If Fn are distribution functions withcharacteristic functions fn(u) such that

a) \imfn(u) exists for every u, andn

b) lim/n(H) = h(u) is continuous at u = 0,n

2)then there is a distribution function F such that Fn — > F and h(u) is the char-acteristic function of F.

Proof. Since {eiux} is JV-separating and limn J eiux dFn exists for every

member of {eiux}, by 8.15, all we need to do is show that {Fn} is mass-preserving. To do this, we need

Proposition 8.29. There exists a constant a, 0 < a < oo, such that for anydistribution function F with characteristic function f, and any u > 0,

Page 189: Probability

172 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.7

Proof. Rlf(u) = J cos ux F(dx), so

Letting

does it.Now back to the main theorem. By the above inequality,

The bounded convergence theorem gives

Now/n(0) = 1 => /;(0) = 1. By continuity of /z at zero, lim /?//z(V) = 1.Therefore, v~*°

By this, for any e > 0 we may take a so that lim Fn([— a, +a]c) < e/2.So there is an n0 such that for « > «0, /"„([— a, +a]c) < e. Take b > asuch that Ffc([— 6, +£]c) < € for k = 1, 2, . . . , n0. From these togethersup /*„([-£, +b]c) < €. Q.E.D.

n

Corollary 8.30. Let Fn be distribution functions, fn their characteristic functionsIf there is a distribution function F with characteristic function f such that

lim/n(w) = f(u)for every u, then Fn — >• F.n

Proof. Obvious from 8.28.

Page 190: Probability

8.7 CHARACTERISTIC FUNCTIONS—CONTINUITY THEOREM 173

fQ

Clearly, if Fn —> F, then the characteristic functions fn(u) converge atevery point u to/(w). We strengthen this to

2)Proposition 8.31. If Fn —>• F, then the corresponding characteristic functionsfn(u) converge uniformly to f(u) on every finite interval I. (Denote this kindof convergence by —^V).

Proof. This result follows from the fact that the/n,/form an equicontinuousfamily. That is, if we fix a finite interval 7, then for any w, u, and h,

Thus, since the {Fn} are mass-preserving,

where d(h) | 0 as h I 0. Now the usual argument works: Divide 7 up intopoints MJ, u2, . . . , um such that IM^! — wfc| < h. For M G 7,

where wfc is the point of the partition nearest u. Therefore

because /(M) also satisfies \f(u + h) — f(u)\ < d(h). Taking h -»• 0 nowcompletes the proof.

The continuity theorem gives us a strong basic tool. Now we startreaping limit theorems from it by using some additional technical details.

Problems

16. A random variable X has a symmetric distribution if P(X e B) =P(X G —B), where —B = {—x\ x E B}. Prove that the characteristicfunction of X is real for all u iff X has a symmetric distribution.

17. A natural question is, what continuous complex-valued functions /(M)on 7?(1) are characteristic functions? Say that such a function is nonnegativedefinite if for any complex numbers A1? . . . , An, and points M^ . . . , Mn e 7?(1),

A complete answer to the question is given by the following theorem.

Page 191: Probability

174 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.8

Bochner's Theorem. Let f(u) be continuous on 7?(1),/(0) = 1. Then f is acharacteristic function if and only if it is nonnegative definite.

Prove that if/is a characteristic function, then it is nonnegative definite.(See Loeve [108, pp. 207 ff.] for a proof of the other direction.)18. Find the characteristic function for a Poisson distribution with

parameter X.19. Find the characteristic function of Sn for coin-tossing.20. If Y = aX + b, show that

21. A random variable X is called a displaced lattice random variable ifthere are numbers a, d such that

Show that X is a displaced lattice if and only if there is a u ^ 0 such that|/X(M)| = 1. If i/!, uz are irrational with respect to each other, andl/x("i)l = l/x(M2)l = 1» show that X is a.s. constant, hence |/x(")| = 1. Showthat X is distributed on a lattice Ld, d > 0 iff there is a u ^ 0 such that/x(«) = I -

8. THE CONVERGENCE OF TYPES THEOREM3)

Look at the question: Suppose that Xn —> X, and X is nondegenerate. Can3)

we find constants an, bn such that anXn + bn —> X' where X' has a law notconnected with that of X in any reasonable way? For example, if Xl5 X2 , . . .are independent and identically distributed, EX± = 0, EX\ < oo, can we findconstants Xn such that Sn/An I)-converges to something not JC(/u, <r)? Andif Sn/An 3)-converges, what can be said about the size of Xn compared withV«, the normalizing factor we have been using ? Clearly, we cannot get the

3) 3)result that Xn —> X, anXn + bn —> X' implies lim an = a exists, because

Q) 5)if Xn has a symmetric distribution, then Xn—^ X =^ (— l)nXM—>- X,since — Xn and Xn have the same law. But if we rule this out by requiringan > 0, then the kind of result we want holds.

3)Theorem 8.32 (Convergence of types theorem). Let Xn —> X, and suppose

5)there are constants an > 0, bn such that anXn + bn —> X', where X and X'are nondegenerate. Then there are constants a, b such that C(X') = C(aX + b)and bn —>• b, an —»• a.

Proof. Use characteristic functions and let /„ =/x so that

Page 192: Probability

8.9 CHARACTERISTIC FUNCTIONS AND INDEPENDENCE 175

By 8.31, if/',/ are the characteristic functions of X',X respectively, then

Take ?im such that an -»• a, where a may be infinite. Since

if an — >• oo, substitute yn = «/an, « e /, to get

Thus |/(w)| = 1, implying X degenerate by Problem 21. Hence a is finite.Using i/c-convergence \fn(anu)\ -» |/(a«)|; thus |/'(«)| = |/(mi)|. Supposetfnm -*• a, 0,4 -> a' and a a'. Use \f(au)\ = \f(a'u)\, assume a' < a, so|/fM)| = \f((a'/a)u)\ = -• = \f((a'la)Nu)\ by iterating N times. LetN-> oo to get the contradiction |/(«)| = 1. Thus there is a unique a > 0such that an — >• a. Sofn(anu) -+f(au\ Hence eiuj>n must converge for everyu such that /(fl«) ^ 0, thus in some interval |«| < d. Obviously then,lim \bn\ < oo, and if b, b' are two limit-points of bn, then eiub = eiub' for all|«| < d, which implies b = b'. Thus bn -+ b, eiub» -+ eiub, and/'(w) = eiubf(au\

9. CHARACTERISTIC FUNCTIONS AND INDEPENDENCE

The part that is really important and makes the use of characteristic functionsso natural is the multiplicative property of the complex exponentialsand the way that this property fits in with the independence of randomvariables.

Proposition 8.33. Let Xl5 X2, . . . , Xn be random variables with characteristicfunctions fi(u), . . . , fn(u). The random variables are independent iff for all

Proof. Suppose X, Y are independent random variables and/, g are complex-valued measurable functions,/ = /! + ;/2, g = gl + ^and/^/a^u^are^-measurable. Then I assert that if E |/(X)| < oo, E |^(Y)| < oo,

so splitting into products does carry over to complex-valued functions.To show this, just verify

Page 193: Probability

176 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.9

All the expectations are those of real-valued functions. We apply the ordinaryresult to each one and get (8.34). Thus, inducing up to n variables, concludethat if Xl5 X2,. . . , Xn are independent, then

To go the other way, we make use of a result which will be proved in Chapter11. If we consider the set of functions on R(n),

then these separate the w-dimensional distribution functions. Let Fn(x) bethe distribution function of X1?. . . , Xn; then the left-hand side of the equa-tion in proposition 8.33 is simply

But the right-hand side is the integral of

n

with respect to the distribution function JJ Fk(xk). Hence F(xl5 . . . ,*„) =n 1

JJ Fk(xk), thus establishing independence.i

Notation. To keep various variables and characteristic functions clear, wedenote byfx(u) the characteristic function of the random variable X.

Corollary 8.35. If X l 5 . . . , Xn are independent, then the characteristicfunction of Sn = X\ + • • • + Xn is given by

The proof is obvious.See that Xlt X2 independent implies that £fe'«<xi+xi> _ fXi(u)fXt(u).

But having this hold for all u is not sufficient to guarantee that X1} X2 areindependent. (See Loeve [108, p. 263, Example 1].)

Recall that in Chapter 3, we got the result that if Xl5 X2,. .. areindependent, ^" Xfc converges a.s. iff ^i ** converges in probability, hence

P 3) Piff ^m Xfc —> 0. The one obvious time that —> and —> coincide is when

P CDYn —> c o Yn —> ^(degenerate at c). This observation will lead to

Proposition 8.36. For Xl9 X2, . . . independent, Jj Xfc -^> iff Jj Xfc —>,

because for degenerate convergence, we can prove the following proposition.

Page 194: Probability

8.10 FOURIER INVERSION FORMULAS 177

Proposition 8.37. If Yn are random variables with characteristic functions20

fn(u), then Yn —> 0 ifffn(u) -> 1 in some neighborhood of the origin.D

Proof. One way is obvious: Yw —> 0 implies fn(u) —»• 1 for all u. Nowlet/n(w) -» 1 in [-<5, +<3]. Proposition 8.29 gives

The right-hand side goes to zero as n -> oo, so the Fn are mass-preserving.2)

Let n' be any subsequence such that Fn> —>- F. Then the characteristicfunction of Fis identically one in [—<5, +<5], hence F is degenerate at zero.By 8.7, the full sequence Fn converges to the law degenerate at zero.

This gives a criterion for convergence based on characteristic functions.Use the notation fk(u) = /Xj.(«).

n fts wTheorem 8.38. 2 Xfc —^-> iff IT/fc(M) converges to h(u) in some neighborhood

i iTV of the origin, and \h(u)\ > 0 on TV.

Proof. Certainly ^ Xfc —^> implies JJ /fc(w) converges everywhere to ai i

characteristic function. To go the other way, the characteristic function

of 2 X, is fl/fc(u). Because f [ f k ( u ) - h(u) * 0 on TV, ftfk(u) -* 1 on TV.m fc=TO 1 m

Use 8.37 to complete the proof, and note that 8.36 is a corollary.

Problems

22. For Yls Y2,. . . independent and ±1 with probability £, use 8.38 toshow that S cfcYfc converges a.s. o S c% < oo.

23. Show that the condition on/fc(w) in 8.38 can be partly replaced by—if2i° U ~/fc(w)l converges in some neighborhood TV of the origin, then

10. FOURIER INVERSION FORMULAS

To every characteristic function corresponds one and only one distributionfunction. Sometimes it is useful to know how, given a characteristic function,to find the corresponding distribution function, although by far the mostimportant facts regarding characteristic functions do not depend on knowinghow to perform this inversion. The basic inversion formula is the Fouriertransform inversion formula. There are a lot of different versions of this;we give one particularly useful version.

Page 195: Probability

178 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.10

Theorem 8.39. Letf(u) be the characteristic function of a distribution functionF(dx} such that

Then F(dx) has a bounded continuous density h(x) with respect to Lebesguemeasure given by

Proof. Assume that (8.40) holds true for one distribution function G(dx)with density g(x) and characteristic function (p(u). Then we show that it holdstrue in general. Write

Then, interchanging order of integration on the right:

If X has distribution F(dx), and Y has distribution G(dx), then the integralon the right is the density for the distribution of X + Y where they are takento be independent. Instead of Y, now use eY, in (8.41), because if the distri-bution of Y satisfies (8.40), you can easily verify that so does that of eY,for e any real number. As e —> 0 the characteristic function <pt(u) of eYconverges to one everywhere. Use the bounded convergence theorem toconclude that the left-hand side of (8.41) converges to

The left-hand side is bounded by J \f(u)\ du for all y, so the integral of theleft-hand side over any finite interval / converges to

If the endpoints of / are continuity points of F(x), then since C(X + eY) —>C(X), the right-hand side of (8.41) converges to F(I). Thus the two measuresF(B) and JB h(y) dy on agree on all intervals, therefore are identical. Thecontinuity and boundedness of h(x) follows directly from the expression(8.40). To conclude, all I have to do is produce one G(x), y(u) for which(8.40) holds. A convenient pair is

To verify (8.42) do a straightforward contour integration.

Page 196: Probability

8.11 MORE ON CHARACTERISTIC FUNCTIONS 179

We can use the same method to prove

Proposition 8.43. Let <pn(u) be any sequence of characteristic functions con-verging to one for all u such that for each n,

If b and a are continuity points of any distribution function F(x), with char-acteristic function f(u), then

Proof. Whether or not F has a density or/(«) is integrable, (8.41) abovestill holds, where now the right-hand side is the density of the distribution ofX + Yw, X, YM independent, <pn(u) the characteristic function of Yn. Since9?B(«) -* 1, YB -^> 0, C(X + Yn) -> C(X). The integral of the right-hand sideover [a, b) thus converges to F(b) — F(d). The integral of the left-hand sideis

This all becomes much simpler if X is distributed on the lattice Ld,d>0. Then

so that/(tt) has period In/d. The inversion formula is simply

Problem 24. Let X:, X2,. . . be independent, identically distributed integer-valued random variables. Show that their sums are recurrent iff

where/(M) is the common characteristic function of X1? X2 , . . .

11. MORE ON CHARACTERISTIC FUNCTIONS

There are some technical results concerning characteristic functions which wewill need later. These revolve around expansions, approximation, and similarresults.

Page 197: Probability

180 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.11

Proposition 8.44. IfE |X|fc < oo, then the characteristic function ofX has theexpansion

where d(u) denotes a function of u, such that lim d(u) = 0, and satisfyingI <$(«)! < 3£ | X f/br all u.

Proof. Use the Taylor expansion with remainder on sin y, cos y for y realto get

where 0l5 02 are real numbers such that \6:\ < 1, |02| < 1- Thus

Now 6l5 62 are random, but still \9j\ < 1, |62| < 1. Now

which establishes \d(u)\ < 3£|X|fc. Use the bounded convergence theoremto get

Another point that needs discussion is the logarithm of a complexnumber. For z complex, log z is a many-valued function defined by

For any determination of log z, log z + 2mri, n = 0, ± 1, . . . is anothersolution of (8.45). Write z = reie; then log z = log r + id. We alwayswill pick that determination of 6 which satisfies — TT < 6 < TT, unless westate otherwise. With this convention, log z is uniquely determined.

Proposition 8.46. For z complex,

where |e(z)| < \z\for \z\ < £.

Proof. For |z| < 1, the power series expansion is

Page 198: Probability

8.12 METHOD OF MOMENTS 181

One remark: Given a sequence/n(w) of characteristic function, frequentlywe will take ln(u) = log/n(w), and show that ln(u) ->• <p(u) for some evaluationof the log function. Now ln(u) is not uniquely determined.

Nn(u) integer-valued, is just as good a version of log/n(w). However, ifln(u) -> <p(w), and <p(w) is continuous at the origin for one evaluation of ln(u),then because/„(«) = eln(u) -+e<p(u) the continuity theorem is in force.

12. METHOD OF MOMENTS

Suppose that all moments of a sequence of distribution functions Fn existsand for every integer k > 0, the limit of

3)exists. Does it follow that there is a distribution F such that Fn —> Fl Notnecessarily! The reason that the answer may be "No" is that the functionsxk do not separate. There are examples [123] of distinct distribution functionsF and G such that J \x\k dF < oo, J |x|fc dG < oo for all k > 0, and

Start to argue this way: If lim $ x2 dFn < oo, then (Problem 10) the {Fn}D

are mass-preserving. Take a subsequence Fn, — >- F. Then (Problem 14)

so for the full sequence

If there is only one F such that (8.47) holds, then every convergent subse-quence of Fn converges to F, hence Fn — > F. Thus

Theorem 8.48. If there is at most one distribution function F such that

then F

The question is now one of uniqueness. Let

Page 199: Probability

182 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.13

If F is uniquely determined by (8.47), then the moment problem given bythe fik is said to be determined. In general, if the /ifc do not grow too fast,then uniqueness holds. A useful sufficient condition is

Proposition 8.49. If

then there is at most one distribution function F satisfying

Proof. Let

then for any e > 0 and k >, k0, using the even moments to get bounds forthe odd moments,

Hence, by the monotone convergence theorem,

for ||| < 1 /re. Consider

By the above, <p(z) is analytic in the strip \Rlz\ < I/re. For |z| < \/re,

This holds for any distribution function F having moments /j,k. Since y(z)in the strip is the analytic continuation of <p(z) given by (8.50), then <p(z) iscompletely determined by fj,k. But for Rlz = 0, <p(z) is the characteristicfunction and thus uniquely determines F.

13. OTHER SEPARATING FUNCTION CLASSES

For restricted classes of distribution functions, there are separating classes offunctions which are sometimes more useful than the complex exponentials.For example, consider only nonnegative random variables ; their distributionfunctions assign zero mass to (—00, 0). Call this class of distribution func-tions JC+.

Page 200: Probability

8.13 OTHER SEPARATING FUNCTION CLASSES 183

Proposition 8.51. The exponentials {e~ix}, A real and nonnegative, separatein JTb .

Proof. Suppose F and G are in Jt+ and for all A 0,

Then substitute e~x = y, so

In particular (8.52) holds for A ranging through the nonnegative integers.Thus for any polynomial P(y),

hence equality holds for any continuous function on [0, 1]. Use an approxi-mation argument to conclude now that F = G.

As before, if Fn e <M>+ and J e~XxFn(dx) converges for all A > 0, then3)

there is at most one distribution function F such that Fn —> F. Let thelimit of J e~*x dFn(x) be /z(A). Then by the bounded convergence theorem,

So conclude, just as in the continuity theorem, that if

then the sequence {Fn} is mass-preserving. Hence there is a unique distri-3)

bution function F such that Fn —> F.For X taking on nonnegative integer values, the moment-generating

function is defined as

for z complex, \z\ < 1.

Problem 25. Prove that the functions zx, \z\ < 1 are separating in the classof distribution functions of nonnegative integer-valued random variables. If(X,J are a set of such random variables and

converges for all \z\ < 1 to a function continuous at z = 1, then show there3)is a random variable X such that X_ —> X.

Page 201: Probability

184 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF

NOTES

More detailed background on distribution functions, etc., can be found inLoeve's book [108], For material on the moment problem, consult Shohatand Tamarkin [123]. For Laplace transforms of distributions J e~ixdF(x)see Widder [140]. Although the central limit theorem for coin-tossing wasproved early in the nineteenth century, a more general version was notformulated and proved until 1901 by Lyapunov [109]. The interesting proofwe give in Section 6 is due to Lindeberg [106],

An important estimate for the rate of convergence in the central limittheorem is due to Berry and Eseen (see Loeve, [108, pp. 282 fT.J). They provethat there is a universal constant c such that if Sn = Xj + • • • + Xn is asum of independent, identically distributed random variables with EXt = 0,EX\ = a2 < oo, E (Xjl3 < oo, and if O(x) is the distribution function of theJV(0, 1) law, then

It is known that c < 4, (Le Cam [99]) and unpublished calculations givebounds as low as 2.05. By considering coin-tossing, note that the 1/V« rateof convergence cannot be improved upon.

Page 202: Probability

CHAPTER 9

THE ONE-DIMENSIONALCENTRAL LIMIT PROBLEM

1. INTRODUCTION

We know already that if Xx, X2, . . . are independent and identically dis-tributed, £XX = 0, £X2 = cr2 < oo, then

Furthermore, by the convergence of types theorem, no matter how Sn is2)normalized, if SJAn — > then the limit is a normal law or degenerate. So

this problem is pretty well solved, with the exception of the question : Whyis the normal law honored above all other laws ? From here there are a numberof directions available; the identically distributed requirement can bedropped. This leads again to a normal limit if some nice conditions onmoments are satisfied. So the condition on moments can be dropped; takeX1? X2, . . . independent, identically distributed but E\\ = oo. Now anew class of laws enters as the limits of $JAn for suitable An, the so-calledstable laws.

In a completely different direction is the law of rare events, convergenceto a Poisson distribution. But this result is allied to the central limit problemand there is an elegant unification via the infinitely divisible laws. Throughoutthis chapter, unless explicitly stated otherwise, equations involving logs ofcharacteristic functions are supposed to hold modulo additive multiples of2777.

2. WHY NORMAL?

There is really no completely satisfying answer to this question. But consider,if X1? X2, . . . , are independent, identically distributed, and if

what are the properties that X must have ? Look at

185

Page 203: Probability

186 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.3

Now Z2n - -> X. But

2) IDThe variables Z'n, Z"n are independent, and Z'n —>- X, Z"n —>• X, since theyhave the same distributions as Zn. This (we would like to believe) implies thatX has the same distribution as l/\/2 (X' + X"), where X', X" are independentand have the same distribution as X. To verify this, note that

or

uc

Since fn(u) —>•/(«), where/(«) is the characteristic function of X, it followsthat f(u) = /(w/v 2)2. But the right-hand side of this is the characteristicfunction of (X' + X")/v 2. So our expectation is fulfilled. Now the point is:

Proposition 9.1. If a random variable X satisfies EX2 < oo, and

where X', X" are independent andt(X) = C(X') = C(X"), then X has a JV(0, a2)distribution.

Proof. The proof is simple. Let X1? X 2 , . . . be independent, C(Xfc) = C(X).EX must be zero, since EX — (EX' + £X")/V2 implies EX = \/2EX.By iteration,

But the right-hand sums, divided by a, converge in distribution to JV(0, 1).

Actually, this result holds without the restriction that EX2 < oo. Adirect proof of this is not difficult, but it also comes out of later work we willdo with stable laws, so we defer it.

3. THE NONIDENTICALLY DISTRIBUTED CASE

Let Xlf X 2 , . . . be independent. Then

Theorem 9.2. Let EXk = 0, EX2 = a2 < oo, E \X*\ < oo, and s2 = % 0%.

V

Page 204: Probability

9.3

then

THE NONIDENTICALLY DISTRIBUTED CASE 187

Proof. Very straightforward and humdrum, using characteristic functions.be/Xjfc, gn the characteristic function of SJsn. Then

Using the Taylor expansion, we get from (8.44)

Now (£ |X||)S/2 < E |XJ3, or cx» < E \Xk\3. Then condition 9.3 implies that

supfc<ncTfc/.sn->0. So supfc<n |/t(«/jB) - 1|->0 as « goes to infinity.Therefore use the log expansion

where |0| < 1 for |z| < |, to get

where the equality holds modulo 2-n-i. Consider the second term above,

This bound goes to zero as n ->• oo. Apply the Taylor expansion,

to the first term above to get

which converges to — M2/2.

Page 205: Probability

188 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.4

We conclude that

for every u. Since the theorem holds for identically distributed random vari-ables, it follows that e~"2/2 must be the characteristic function of the JV(0, 1)distribution. Apply the continuity theorem to complete the proof.

Note that we got, in this proof, the additional dividend that if X isJC(0,1), then

4. THE POISSON CONVERGENCE

For Xx, X 2 , . . . independent and identically distributed, EXl = 0, EX[ = <r2,let

Write, for x > 0,

Now

where lim 0n(x) = 0. This leads to

the point being that

3)or Mn — >• 0. In this case, therefore, we are dealing with sums

of independent random variables such that the maximum of the individualsummands converges to zero. I have gone through this to contrast it to thesituation in which we have a sequence of coins, 1 ,2 , . . . , with probabilitiesof heads />1? /?2, . . . , where pn -> 0, and the nth coin is tossed n times. Forthe «th coin, let X(^} be one if heads comes on the kth trial, zero otherwise.So

Page 206: Probability

9.4 THE POISSON CONVERGENCE 189

is the number of heads gotten using the nth coin. Think!—the probabilityof heads on each individual trial is pn and that is going to zero. However,the total number of trials is getting larger and larger. Is it possible that Sn

converges in distribution ? Compute

This will converge if and only if If thenand henceforth rule this case out.

For Take charac-teristic functions, noting that

For n sufficiently large, these are close to one, for u fixed, and we can write

Since this gives

so

Theorem 9.4. if and only if , then the limit has characteristicfunction . The limit random variable X takes values in{0,1,2,...}, so

Expanding,

so

Definition 9.5. A random variable X taking values in (0, a, 2a, 3a,. . .} willbe said to have Poisson distribution with jump size a, if

Look now at the . Since usuallythe are zero, but once in a while along comes a blip. Again, take Mn =max (X^n), . . . , X(

wra>). Now Mn can only take the values 0 or 1, and Mw-f-> 0

unless A = 0. Here the contrast obviously is that Mra must equal 1 with

Page 207: Probability

190 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.5

positive probability, or It is the difference between the sum ofuniformly small smears, versus the sum of occasionally large blips. Thatthis is pretty characteristic is emphasized by

Proposition 9.6. Let Sn = X™ + - • • + XJI">, where X < n ) , . . . , XJ1' are3)

independent and identically distributed. If Sn —> X then Mn —> 0 // andonly if X is normal.

Proof. Deferred until Section 7.

The Poisson convergence can be generalized enormously. For example,suppose Sn == X[n) + ' • • + Xj,n), the X(

kn) independent and identically

distributed with

2)and Sn —> X. We could again show that this is possible only if np™ —+ A,,0 < Af < oo, and if so, then

Two interesting points are revealed in this result.

First: The expected number of times that is So is roughlythe expected number of times that one of the summands is

Second: Since

X is distributed as where the are independent randomvariables and has Poisson distribution with jump size . So thejumps do not interact; each jump size xi contributes an independentPoisson component.

5. THE INFINITELY DIVISIBLE LAWS

To include both Poisson and convergence, ask the followingquestion: Let

where the are independent and identically distributed.If what are the possible distributions of X?

Sn is the sum of many independent components; heuristically X musthave this same property.

Page 208: Probability

9.5 THE INFINITELY DIVISIBLE LAWS 191

Definition 9.8. X will be said to have an infinitely divisible distribution if forevery n, there are independent and identically distributed random variables

such that

Proposition 9.9. A random variable X is a limit in distribution of sums of thetype (9.7) if and only if it has an infinitely divisible distribution.

Proof. If X has an infinitely divisible distribution, then by definition thereare sums Sn of type (9.7) with distribution exactly equal to X. The otherway: Consider

The random variables Yn and Yn are independent with the same distribution.

If the distributions of Yn are mass-preserving, because

and similarly,

Take a subsequence {n'} such that Yn, —> Y. Obviously, fx(u) = [fY(u)]2;so independent. This can be repeated to get Xequal in distribution to the sum of by considering Snm.

2)If do the components have to get smaller and

smaller in any reasonably formulated way? Note that in both the Poisson

and convergence, for any that is,[so, of course, since these probabilities are the samefor all k = 1 , . . . , « ] . This holds in general.

Proposition 9.10. If then

Proof. Since there is a neighborhood of the origin suchthat all On is boundedaway from so

On N, then

Sofn(u) -> 1, for u E N, and now apply 8.37.

Now I turn to the problem of characterizing the infinitely divisible dis-tributions. Let f(u) be the characteristic function of X. Therefore, since

Page 209: Probability

192 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.5

there is a characteristic function fn(u) such

that and by 9.10, Then,

Since it follows that Also, all u, otherwise

implies all n, contradicting . Denote byFn the distribution function of then

If we set up approximating sums of the integral in (9.12), we get

exactly like the general Poisson case looked at before. Note also that if weput a nonnegative measure on then

Since if converges to a measure

such that for continuous bounded functions then we could conclude

This is the basic idea, but there are two related problems. First, the totalmass of , hence Certainly, then, forthere is no finite p such that . Second, how can thecharacteristic function be represented as above ? Now, for any neigh-borhood N of the origin, we would expect more and more of the tobe in N; that is, But in analogy with Poisson convergence, thenumber of times that is sizeable enough to take values outside of Nshould be bounded; that is, We can prove even more thanthis.

Proposition 9.13

Proof. By inequality 8.29,

Page 210: Probability

9.5 THE INFINITELY DIVISIBLE LAWS 193

Take the real part of Eq. (9.1 1), and pass to the limit to get

Use \f(u)\ > 0, all u, and the bounded convergence theorem for the rest.

so the jun sequence is in this sense mass-preserving. What is happening isthat the mass of /un is accumulating near the origin, and behaving nicelyaway from the origin as n —>• oo. But if then 99(0) = 0, there is some hopethat J <p(x)/un(dx) may converge. This is true to some extent, more exactly,

Proposition 9.14r

Proof.

By Taylor's expansion, cos x = 1 — x2/2 cos xoc, |a| < 1, so there is a (3,0 < ft < oo, such that cos x < 1 — fix2, for \x\ < 1. Thus

However, n(l - Rlfn(\})-> Rl log /(I), giving the result, since |/(1)| =£ 0.

By 9.13 and 9.14, if we define vn(B) = JB (p(x~)[An(dx), where <p(x) isbounded and behaves like x2 near the origin, the vn(B) is a bounded sequenceof measures and we can think of trying to apply the Helly-Bray theorem.The choice of cp(x) is arbitrary, subject only to boundedness and the rightbehavior near zero. The time honored custom is to take y(x) to be x2/(l + x2).Thus let an = J (//(I + y*))pn(dy),

making Gn(x) a distribution function. By 9.13 and 9.14 lim an < oo. Wecan write

but the integrand blows up as x — »• 0. So we first subtract the infinity bywriting,

Page 211: Probability

194 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.5

Then we write /5n = J x/(l + x2) dpn, so that

If the integral term converges we see that {/?„} can contain no subsequencegoing to infinity. If it did, then

would imply that on substituting u = v/pn and going along the subsequence,we would get eiv = /z(0) for all v. If {/?„} has two limit points /?, /$', thenei?u _ eifu^ ^gnce ^ _ ^' Thus, «c convergence of the first term entailsconvergence of the second term to ifiu.

The integrand in (9.15)

is a continuous bounded function of x for x ^ 0. As x -»• 0, it has the limit— u2/2. By defining <p(0, w) = — w2/2, <p(;c, M) is jointly continuous every where.By 9.13, {Gn} is mass-preserving. If lim an = 0, take n' such that <xn< -»• 0and conclude from (9.15) that X is degenerate. Otherwise, take n' such that

3)an. -> a > 0 and Gn -- > G. Then G is a distribution function. Go alongthe n' sequence in 9.15. The fin, sequence must converge to some limit /3since the integral term converges uc. Therefore

Suppose G({0}) > 0, then

We have now shown part of

Theorem 9.17. X has infinitely divisible distribution if and only if its char-acteristic function f(u) is given by

where v is a finite measure that assigns zero mass to the origin.

To complete the proof: It has to be shown that any random variable whosecharacteristic function is of the form (9.18) has infinitely divisible

Page 212: Probability

9.6 THE GENERALIZED LIMIT PROBLEM 195

distribution. To begin, assume that any function /(w) whose log is of theform (9.18) is a characteristic function. Then it is trivial because if /„(«) isdefined by

then log fn(u) is again of the form (9. 1 8) ; so/n(w) is a characteristic function.Since /(w) now is given by [fn(u)]n for any n, the corresponding distribution isinfinitely divisible. The last point is to show that (9.18) always gives a charac-teristic function. Take partitions !Tn of R(l) into finite numbers of intervalssuch that the Riemann sums in (9. 1 8) converge to the integral, that is,

Put pn = ft - 2X/«)/*,), denote

and write

See that gn(w) is the product of a characteristic function of a N(f3n, a) dis-tribution and characteristic functions of Poisson distributions (JX^) withjump x^ Therefore by Corollary 8.35, gn(u) is a characteristic function. Thisdoes it, because

for every u. Check that anything of the form (9.18) is continuous atu = 0. Certainly the first two terms are. As to the integral, note thatsup,,. \<p(x, u)\ < M for all \u\ < 1. Also, limu^0 <p(x, u) = 0 for every x, andapply the bounded convergence theorem to get

By the continuity theorem, f(u) is a characteristic function. Q.E.D.

6. THE GENERALIZED LIMIT PROBLEM

Just as before, it becomes reasonable to ask what are the possible limit lawsof

if the restriction that the X<n), . . . , X^> be identically distributed is lifted.

Page 213: Probability

196 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.7

Some restriction is needed ; otherwise, take

to get any limit distribution desired. What is violated in the spirit of theprevious work is the idea of a sum of a large number of components, each onesmall on the average. That is, we had, in the identically distributed case, thatfor every e > 0,

This is the restriction that we retain in lieu of identical distribution. It isjust about the weakest condition that can be imposed on the summands inorder to prevent one of the components from exerting a dominant influence onthe sum. With condition (A) a surprising result comes up.

3)Theorem 9.19. If the sums Sn — > X, then X has infinitely divisible distribution.

So in a strong sense the infinitely divisible laws are the limit laws of largesums of independent components, each one small on the average. Theproof of 9.19 proceeds in exactly the same way as that of Theorem 9.17,the only difference being that pn(B) = 2? F(

kn\B) instead of nFn(B), but

the same inequalities are used. It is the same proof except that one moresubscript is floating around.

Problem 1. Let X have infinitely divisible distribution,

if v({0}) = 0, and if v assigns all its mass to a countable set of points, provethat the distribution of X is of pure type. [Use the law of pure types.]

7. UNIQUENESS OF REPRESENTATION AND CONVERGENCE

Let X have an infinitely divisible distribution with characteristic functionf(u). Then by (9.18), there is a finite measure y(dx) (possibly with mass atthe origin) and a constant /5 such that

and <p(x, u) is continuous in both x and u and bounded for x e R(l), u e[— U, +U], U < oo. Log f(u) is defined up to additive multiples of 1-ni.Because \f(u)\ ^ 0, there is a unique version of log f(u) which is zero whenu is zero and is a continuous function of u on 7?(1). Now (9.20) states that thisversion is given by the right-hand side above.

Page 214: Probability

9.7 UNIQUENESS OF REPRESENTATION AND CONVERGENCE 197

Proposition 9.21. y(dx) and ft are uniquely determined by (9.20).

Proof. Let y(u) = log/(w); this is the continuous version of log /(M).Then (following Loeve [108]), take

so that 6(u) is determined by y, hence by/. Note that

Hence, using (9.20),

where

It is easy to check that 0 < infg(x) < sup^(x) < oo. But 6(u) uniquelydetermines the measure v(B~) = JB g(x)y(dx), and thus y is determined as7(B} = SB feW]-X^). If, therefore,

then y = / implying ft = {}'.

The fact that y(^/x) is unique gives us a handhold on conditions for3)

Sn —>• X. Let y(dx) = <x.G(dx) where G is a distribution function. Recallthat a, G(x) were determined by taking any convergent subsequence an.of an, a = limw- an,, and taking G(x) as any limit distribution of the

2)Gn>(x) sequence. Since a, G(x) are unique, then an -> a, Gn —>• G. Con-sequently /?„, defined by

converges to /?. Thus, letting y(x) = y(—ao, x), and

Page 215: Probability

198 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.7

D 3)then Sn —> X implies yn —> y, and {yn} is mass-preserving in the sense thatfor any e > 0, there exists a finite interval / such that sup yn(/

c) < e. Theseconditions are also sufficient: "

2)Theorem 9.22. Sn —>- X where X has characteristic function given by

ybr y(i/x) a finite measure if and only if the measures yn(dx) are mass-preservingin the above sense and

Proof. All that is left to do is show sufficiency. This is easy. Since

it follows that

hence Fn converges to the law degenerate at zero. Thus for all u in a finiteinterval, we can write

where en(w) — > 0. Thus

Now we can go back and get the proof of 9.6. If Sn = X<n ) + • • • + XJ,n),2)

and Sn — *- X, then clearly X is normal if and only if yn converges to ameasure y concentrated on the origin. Equivalently, for every jc > 0,yn((~x, +xY) -» 0. Since

this is equivalent to

But

Because X["> -^> 0,

Page 216: Probability

9.8 THE STABLE LAWS 199

where dn(x) ->• 0 for x > 0. Therefore,

which completes the proof.

8. THE STABLE LAWS

LetXx, X 2 , . . . be identically distributed, nondegenerate, independent randomvariables. What is the class of all possible limit laws of normed sums

Since Sn may be written as

the requirement Sn — > X implies that X is infinitely divisible. The condition3)

X(n) — > 0 implies An -> oo, Bjn -»• 0. This class of limit laws is the mostinteresting set of distributions following the normal and Poisson. Of course,if EX* < oo, then An ~ \Jn and X must be normal. So the only interestingcase is EX\ = oo. Two important questions arise.

First: What is the form of the class of all limit distributions X such that

Second: Find necessary and sufficient conditions on the common dis-0)

tribution function of X1? X2, . . . so that Sn — > X.

These two questions lead to the stable laws and the domains of attractionof the stable laws.

Definition 9.24. A random variable X is said to have a stable law if for everyinteger k > 0, and Xl5 . . . , Xk independent with the same distribution as X,there are constants ak > 0, bk such that

This approach is similar to the way we intrinsically characterized thenormal law; by breaking Snfc up into k blocks, we concluded that the limit ofSn/V« must satisfy ^(Xl + • • • + Xfc) =

Proposition 9.25. X is the limit in distribution of normed sums (9.23) if andonly if X has a stable law.

Page 217: Probability

200 THE ONE-DIMENSFONAL CENTRAL LIMIT PROBLEM 9.9

Proof. One way is quick: If X is stable, then C(XX + • • • + Xn) =C(flnX + bn). Then (check this by characteristic functions),

and we can take An = an, Bn = bjan to get

(actually = X is true here).To go the other way, suppose

Then Znjk — >- X as n — >• oo for all A:. Repeat the trick we used for the normallaw:

where S«> = X, + • • • + Xn, S<« = XM+1 + • • • + X2n, . . . Thus

where Cnfc = (A^/AJB^ — A;5n. By the law of convergence of types,

Therefore,

This not only proves 9.25 but contains the additional information that if

Sn —>- X, then Ank/An —>• ak for all k. By considering the limit of Anmk/An

as « -> oo, we conclude that the constants ak must satisfy

9. THE FORM OF THE STABLE LAWS

Theorem 9.27. Let X have a stable law . Then either X has a normal distributionor there is a number a, 0 < a < 2, called the exponent of the law and constantsml > 0, w2 > 0, ft such that

Page 218: Probability

9.9 THE FORM OF THE STABLE LAWS 201

Proof. Since X is stable, it is infinitely divisible,

In terms of characteristic function, the definition of stability becomes

or

Separate the situation into two cases.

CASE I. Define a measure /:

Then /u, is a-finite, p[—a, +a]c < oo, for any a > 0, j'[_a +o] jc2 d)a < oo, and

This last integrand behaves like x3 near the origin, and is bounded awayfrom the origin, so the integral exists. Define a change of variable measure

to get

Therefore, (9.28) becomes

Page 219: Probability

202 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.9

By the uniqueness of representation of infinitely divisible characteristicfunction we get the central result

Similarly, let M~(x) = fj,(— oo, x), x < 0. Again, kM~(x) = M~(x/ak),x < 0, k > 0.

Proposition 9.31. ak = k*-, X > 0, and

Proof. M+(x) is nonincreasing. The relation kM+(\) = M+(\lak) impliesak increasing in k, and we know ak -> oo. For any k, anak = ank gives

For n> k, take/ such that k3 <, n < ki+l. Then

orlog (at,) < log an < log (flfcm),

7 !og «* ^ log «n < (y + 1) log ak.

Dividing by y log k, we get

Now let n —* oo; consequently y -> oo, and (log »)/(/ log A:) — >• 1, implying

To do the other part, set x = (kjnY; then in (9.30),

For k — «, this is «A/+(1) = A/+(l/n^). Substituting this above gives

or

Page 220: Probability

9.9 THE FORM OF THE STABLE LAWS 203

For all x in the dense set {(/:/«)*} we have shown M+(x) = x~1/AM+(l). Thefact that M+(x) is nonincreasing makes this hold true for all x. Similarlyfor M-(x).

The condition <fi1 x2 dp < oo implies JiJ x2 |^|~(1/A)-1 dx < oo so that

— I/A + 1 > —1, or finally A > \. For y>(«) the expression becomes

where m: = M+(l) • -: , w2 = M~(— 1) • - , and oc = - ; s o O < a < 2 .A A. A

CASE II. If y({0}) = a2 > 0, then

The coefficient az is uniquely determined by \imu^Xly)(u)lu2 = —az/2,because supXiU \y(x, u)ju2\ < oo and <p(x, u)lu2 -** 0 for x ^ 0, as u -> oo.Apply the bounded convergence theorem to J{0}C (<p(x, u)lu2}y(dx) to get theresult. Therefore, dividing (9.28) by u2 and letting u —>• oo gives

k = al which implies A = £.

So (9.28) becomes

As k -*- oo, y(\]ku)lku2 ->• — <r2/2. This entails 6fc//c —> £ and

It is not difficult to check that every characteristic function of the formgiven in (9.27) is the characteristic function of a stable law. This additionalfact completes our description of the form of the characteristic function forstable laws.

Problems

2. Use the methods of this section to show that Proposition 9.1 holds withoutthe restriction EX2 < oo.

Page 221: Probability

204 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.10

3. Use Problem 2 to prove that if X:, X2 , . . . are independent, identicallydistributed random variables and

then X is normal or degenerate.

4. Show that if a < 1, then y(u) can be written as

Then prove that /? = 0, mz = 0, implies X > 0 a.s.

10. THE COMPUTATION OF THESTABLE CHARACTERISTIC FUNCTIONS

When the exponent a is less than 2, the form of the stable characteristicfunction is given by 9.27. By doing some computations, we can evaluatethese integrals in explicit form.

Theorem 9.32. f(u) = ev(u) is the characteristic function of a stable law ofexponent a, 0 < a < 1, and 1 < a < 2 if and only if it has the form

where c is real, d real and positive, and 6 real such that \0\ < 1. For a = 1, theform of the characteristic function is given by

with c, c/, 6 as above.

Proof. Let

Since /i(— M) = AO/), we evaluate I^u) only for u > 0. Also,

Consider first 0 < a < 1 ; then

Page 222: Probability

9.10 COMPUTATION OF STABLE CHARACTERISTIC FUNCTIONS 205

Substitute ux = y in the first term to get, for u > 0,

where

For 1 < a < 2, integrate by parts, getting

where

Substitute ux = y again, so

If a = 1, the integration by parts gives

Let

Then for «2 > «i > 0,

Now, by the Riemann-Lebesgue lemma (see Chapter 10, Section 2),

since l[u,,U2](^)/f is a bounded measurable function vanishing outside afinite interval. This gives

Page 223: Probability

206 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.10

Consequently, checking that lim J(T, 1) exists,

and

For the first time, the constant c appearing in the linear term is complex:

/• oo

where cx is real. The integral (sin x/x) dx is well known and equals ?r/2.Finally, then, Jo

The remaining piece of work is to evaluate

This can be done by contour integration (see Gnedenko and Kolmogorov,[62, p. 169]), with the result that

where L(£) is real and negative. Putting everything together, we get

For 0 < a < 1, and n > 0,

where </ = — (Wj 4- m2)RlH(<x) is real and positive, 0 = (m1 — mz)/(ml + w2)is real with range [—1, +1], and c is real. For 1 < a < 2,

Now

so

Page 224: Probability

9.11 THE DOMAIN OF ATTRACTION OF A STABLE LAW 207

Here d = (^ + mz)laC)Rl(e-i(vl2)aL(y. — 1)) is real and positive, and 6 isagain (ml — m^Km^ + w2) with range [—1, +1]. If a = 1, then

6 as above, d real and positive, c real. Q.E.D.

Problem 5. Let Sn = X1 + - - - + X n be consecutive sums of independentrandom variables each having the same symmetric stable law of exponenta. Use the Fourier inversion theorem and the technique of Problem 24,Chapter 8, to show that the sums are transient for 0 < a < 1, recurrentfor a > 1. The case a = 1 provides an example where .EIXJ = oo, butthe sums are recurrent. Show that for all a > 0,

Conclude that the sums change sign infinitely often for any a > 0.

11. THE DOMAIN OF ATTRACTION OF A STABLE LAW

Let X15 X 2 , . . . be independent and identically distributed. What are necessaryand sufficient conditions on their distribution function F(x) such that Sn

suitably normalized converges in distribution to X where X is nondegenerate ?Of course, the limit random variable X must be stable.

Definition 9.33. The distribution F(x) is said to be in the domain of attractionof a stable law with exponent a < 2 if there are constants An, Bn such that

and X has exponent a. Denote this by F e D(a).

Complete conditions on F(x) are given by

Theorem 9.34. F(x) is in the domain of attraction of a stable law with exponenta < 2 // and only if there are constants M+, M~ > 0, M+ + M~ > 0, suchthat as y -> oo:

aii) For every £ > 0,

Page 225: Probability

208 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.11

2)Proof. We show first necessity. By Theorem 9.22, yn —>. y. Now takebn = JBJn. Then

where

and

Thus, for any xl < 0, x2 > 0, since the yn must be mass-preserving,

This is

By taking one or the other of jc1? x2 to be infinite, we find that the con-dition becomes

Then

Since bn -> 0, for any e > 0 and « sufficiently large,

We can use this to get

We know that M+(x), M~(x) are continuous, so we can conclude that

Now if we fix x > 0, for any^ > 0 sufficiently large, there is an n such that

Page 226: Probability

9.11 THE DOMAIN OF ATTRACTION OF A STABLE LAW 209

Then

So for any £ > 0,

In exactly the same way, conclude

Also,

which leads to

To get sufficiency, assume, for example, that M+ > 0. I assert we candefine constants An > 0 such that

Condition (ii) implies 1 — F(x) > 0, all x > 0. Take An such that forany e > 0, n(\ - F(An}) > M+, but n(\ - F(An + e)) < M+. Then iflim «(1 — F(^n)) = M+(l + <5), <5 > 0, there is a subsequence n' such thatfor every e > 0,

This is ruled out by condition (ii). So

Similarly,

Page 227: Probability

210 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.11

Take F to be the distribution function of a random variable X1} and write

Then

Therefore, the {//„} sequence is mass-preserving, and so is the {yn} sequencedefined, as before, by

Let g(x) = /(x)x2/(l + xz). The first integral converges to

where d^e) —>• 0 as e —>• 0. Thus, defining y(x) in the obvious way,

DTo complete the proof that y>n —>- y, we need

Proposition 9.35

where <52(e) -> 0 05 e -> 0.

In order not to interrupt the main flow of the proof, I defer the proof of9.35 until the end. Since An -*• oo, the characteristic function gn(u) ofXx + • • • + Xn//*n is given, as before, by

where

Page 228: Probability

9.11 THE DOMAIN OF ATTRACTION OF A STABLE LAW 211

(T\

and en(w) —>- 0. Since yn —* y, the first term tends toward J <p(x, u)y(dx).The characteristic function of $JAn — fin is e~iufngn(u). So, if en(w)/?n -*• 0,then the theorem (except for 9.35) follows with Bn = fin. For n sufficientlylarge,

where/is the characteristic function of Xx. But

so it is sufficient to show that /32/n -> 0. For any e > 0, use the Schwarzinequality to get

Apply 9.35 to reach the conclusion.

Proof 0/9.35. We adapt a proof due to Feller [59, Vol. II]. Write

We begin by showing that there is a constant c such that

For t > T,

Fix x > 1. For any e > 0, take T so large that for I > T,

From (9.36)

This inequality implies 7(0 ->• oo as f -> oo. Then, dividing by 7(0 andletting f -*• oo yields

Page 229: Probability

212 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.11

Now in 9.36 take T = 1, so

The integrand is bounded by £, so

Thus, taking lim inf on x,

This does it.To relate this to 9.35 integrate by parts on , assuming

is a continuity point of for all n. This gives

From we get

Because of the inequality proved above,

Therefore,

The integral over the range (—c, 0) is treated in the same way, with F(—x)in place of 1 - /*(*). Q.E.D.

Problems

6. Show that [For any take x0 such that for

Define k by then

Now select m appropriately.

Page 230: Probability

9.12 A COIN-TOSSING EXAMPLE 213

7. Show, using the same methods used to get n\A\ ->• 0, that for F(x) e -D(oc),

Conclude that

8. For F(x) e D(a), a < 1, by using the same methods as in the proof of9.35, show that

Conclude that /3n converges to a finite constant /3; hence Bn can be definedto be zero, all n.

9. For F(x) E D(a), 1 < a < 2, show that

converges, so BJn can be taken as —E^IAn\ that is, X1? X2,. . . are re-placed by Xl - £X15 X2 - £X2,. . .

10. For F(x) 6 D(a), a < 1, if F(x) = 0, ;c < 0, then

Prove, using Problem 8 and the computations of the various integralsdefining the stable laws, that

12. A COIN-TOSSING EXAMPLE

The way to recognize laws in Z)(a) is

Definition 9.38. A function H(x) on (0, oo) is said to be slowly changing if forall £ e (0, oo),

For example, log x is slowly changing, so is log log x, but not x', a ^ 0.

Proposition 9.39. F(x) e Z)(a) if and only if, defining H+(x), H~(x) on (0, oo)by

Page 231: Probability

214 THE ONE-DIMENSIONAL CENTRAL LIMIT PROBLEM 9.13

there are constants M+ > 0, M ~ > 0, M+ + M~ > 0 such that

M+ > 0 => H+(x) slowly changing, M~ > 0 => H~(x) slowly changing,

and, as x — *• oo,

Proof. Obvious.

Now, in fair coin-tossing, if Jl is the time until the first return to equilib-rium, then by Problem 19, Chapter 3,

where dn -> 0, so 1 — F(n) ~ cn~llz. Thus for the range n <, x <, n + 1,1 - F(x) ~ cn~1/2 or finally, 1 - F(x) = cx~ll2(\ + d(x)), d(x) -*> 0. There-fore, by 9.39, F(x) e D(\}. To get the normalizing An, the condition isn(l — F(An)) converges to M+, that is,

Take An = «2; then M+ = c — >/2/>/7r. We conclude, if Rn is the time ofthe nth return to equilibrium, then

Theorem 9.40. Rn/«2 - > X, where

13. THE DOMAIN OF ATTRACTION OF THE NORMAL LAW

If we look at the various possible distribution functions F(x} and ask whencan we normalize Sn so that 5jAn — Bn converges to a nondegenerate limit,we now have some pretty good answers. For F(x) to be in />(a), thetails of the distribution must behave in a very smooth way — actually themass of the distribution out toward oo must mimic the behavior of the tailsof the limiting stable law. So for 0 < a < 2, only a few distributionfunctions F(x) are in £>(a). But the normal law is the limit for a wide class ofdistributions, including all F(x) such that J xzF(dx) < oo. The obviousunanswered question is : What else does this class contain ? In other words,

3)for what distributions F(x) does $JAn — Bn — > JV(0, 1) for an appropriatechoice of An, Bt

Prrof. See Problem 10.

Page 232: Probability

NOTES 215

We state the result only—see Gnedenko\and Kolmogorov, [62, pp.172 ff.], for the proof.

Theorem 9.41. There exist An, Bn such that SJAn - Bn -^> JV(0, 1) if andonly if

Problem 11. Show that J yzF(dy) < oo implies that the limit in (9.42) is zero.Find a distribution such that J yzF(dy) — oo, but (9.42) is satisfied.

NOTES

The central limit theorem 9.2 dates back to Lyapunov [109, 1901], Thegeneral setting of the problem into the context of infinitely divisible lawsstarts with Kolmogorov [94, 1932], who found all infinitely divisible lawswith finite second moment, and Paul Levy [102, 1934], who derived thegeneral expression while investigating processes depending on a continuous-time parameter (see Chapter 14). The present framework dates to Feller[51] and Khintchine [89] in 1937. Stable distributions go back to PaulLevy [100, 1924], and also [103, 1937]. In 1939 and 1940 Doeblin, [27] and[29], analyzed the problem of domains of attraction. One fascinatingdiscovery in his later paper was the existence of universal laws. These arelaws C(X) such that for Sn = Xx + • • • + Xw sums of independent randomvariables each having the law C(X), there are normalizing constants An,Bn such that for Y having any infinitely divisible distribution, there is aa subsequence {nk} with

For more discussion of stable laws see Feller's book [59, Vol. II]. A muchdeeper investigation into the area of this chapter is given by Gnedenko andKolmogorov [62].

Page 233: Probability

CHAPTER 10

THE RENEWAL THEOREMAND LOCAL LIMIT THEOREM

1. INTRODUCTION

By sharpening our analytic tools, we can prove two more important weaklimit theorems regarding the distribution of sums of independent, identicallydistributed random variables. We group these together because the methodsare very similar, involving a more delicate use of characteristic functions andFourier analytical methods. In the last sections, we apply the local limittheorem to get occupation time laws. This we do partly because of their owninterest, and also because they illustrate the use of Tauberian arguments andthe method of moments.

2. THE TOOLS

A basic necessity is the

Riemann-Lebesgue lemma. Let f(x) be ^-measurable and J \f(x}\ dx < oo;then

Proof. For any e > 0, take / a finite interval such that

and take M such that

Therefore,

so it is sufficient to prove this lemma for bounded f ( x ) vanishing off offinite intervals /. Then (see Problem 5, Chapter 5) for any e > 0, there aredisjoint intervals /1}..., /„ such that

216

Page 234: Probability

10.2 THE TOOLS 217

By direct computation,

as u goes to ± oo.

Next, we need to broaden the concept of convergence in distribution.Consider the class Q of <r-additive measures /u on 3!>l such that /*(/) is finitefor every finite interval /.

\VDefinition 10.2. Let /un, /ueQ.. Say that /nn converges weakly to p, ftn —> p,if for every continuous function y(x) vanishing off of a finite interval

<F(*KXdx) -> (

If the f i n , fi have total mass one, then weak convergence coincides withconvergence in distribution. Some of the basic results concerning conver-gence in distribution have easy extensions.

Proposition 10.4. jun — *• /j, iff for every Borel set A contained in a finiteinterval such that /u,(bd(A)) = 0,

The proof is exactly the same as in Chapter 8. The Helly-Bray theoremextends to

Theorem 10.5. If (tn e & and for every finite interval /, lim //„(/) < oo, thenthere is a subsequence fj,nk converging weakly to /u e Q.

Proof. For Ik = [—k, +k], k = 1, 2, . . . , use the Helly-Bray theorem toget an ordered subset Nk of the positive integers such that on the interval

7fc, nn -%- /n(k) as n runs through Wfc, and Nk+l <= Nk. Here we use the obviousfact that the Helly-Bray theorem holds for measures whose total mass is boundedby the same constant. Let [JL be the measure that agrees with ju

(fc) on Borelsubsets of Ik. Since Nk+l <= Nk, p is well defined. Let nk be the fcth member of

Nk. Then clearly, -% p.

There are also obvious generalizations of the ideas of separating classesof functions. But the key result we need is this:

Definition 10.6. Let JC be the class of ^-measurable complex-valued functionsh(x) such that

and

where h(u) is real and vanishes outside of a finite interval I.

Note that if h e 36, then for any real v, the function eivxh(x) is again in 36.

Page 235: Probability

218 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM 10.3

Theorem 10.7. Let /an, ^ e Q. Suppose that there is an everywhere positivefunction h0 E 3£ such that JA0 dp is finite, and

Wfor all functions h e J6 of the form eivx h0(x), v real. Then fj.n — >• p.

Proof. Let <xn = J h0 du.n, a = J h0 d/u. Note that h0, or for that matterany h E JC, is continuous. If a = 0, then ^ = 0, and since ocn — *• a, for anyfinite interval /, pn(I) -+ 0. Hence, assume a > 0. Define probabilitymeasures vn,v on $j by

By the hypothesis of the theorem, for all real v,

Wso that vn — > v. Thus, by 8.12, for any bounded continuous g(x),

For any continuous y(x) vanishing off of a finite interval 7, take g(x) =to conclude

A question remains as to whether JC contains any everywhere positivefunctions, To sec that it docs, check that

in J€, with

Now, for A15 A2 not rational multiples of each other, the function h0(x) —h^(x) + h^(x) is everywhere positive and in JC.

3. THE RENEWAL THEOREM

For independent, identically distributed random variables Xa, X2, . . .,define, as usual, Sn = Xi + • • • + X,(, S0 = 0. In the case that the sumsare transient, so that P(Sn E /i.o.) = 0, all finite /, there is one major result

Page 236: Probability

10.3 THE RENEWAL THEOREM 219

concerning the interval distribution of the sums. It interests us chiefly in thecase of finite nonzero first moment, say 0 < £XX < oo. The law of largenumbers guarantees Sn —>• oo a.s. But there is still the possibility of a more orless regular progression of the sums out to oo. Think of Xl5 X 2 , . . . as thesuccessive life spans of light bulbs in a given socket. After a considerablenumber of years of operation, the distributions of this replacement processshould be invariant under shifts of time axis. For instance, the expectednumber of failures in any interval of time / should depend only on the lengthof /. This is essentially the renewal theorem: Let ld assign mass d to everypoint of La, /0 is Lebesgue measure on L0.

Theorem 10.8. Suppose Xx is distributed on the lattice Ld, d > 0. Let N(7) bethe number of members of the sequence S0, S1} S2,. . . landing in the finiteinterval I. Then as y —>• oo through Ld,

Remarks. The puzzle is why the particular limit of 10.8 occurs. To get somefeel for this, look at the nonlattice case. Let N(5) denote the number oflandings of the sequence S0, S l 5 . . . in any Borel set B. Suppose thatUrn,, £N(5 + y) existed for all B e . Denote this limit by Q(B). Note thatB!, Bz disjoint implies N^ U 52) = N^) + N(52). Hence Q(B) is anonnegative, additive set function. With a little more work, its cr-additivityand (T-finiteness can be established. The important fact now is that for anyx e R(l\B e $15 Q(B + x) = Q(B); hence Q is invariant under translations.Therefore Q must be some constant multiple of Lebesgue measure, Q(dx) =a dx. To get a, by adding up disjoint intervals, see that

Let N(x) = N([0, x]), and assume in addition that Xx > 0 a.s., then

By the law of large numbers,

Along every sample sequence such that (10.9) holds N(x) -> oo, and

Page 237: Probability

220 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM 10.3

then for the sequences such that (10.9) holds,

If we could take expectations of this then a = 1/EX^ (This argument isdue to Doob [35].)

Proof of 10.8. Define the renewal measure on by

The theorem states that H(- + y) - -> ld(-)/EXl as y goes to plus infinitythrough Ld. For technical reasons, it is easier to define

The second measure converges weakly to zero as y goes to plus infinitythrough Ld. Because (from Chapter 3, Section 7) if / is any interval oflength a, then

n e / at least once) EM([-a, +a]).

So for / = [b - a, b]

The fact that EXl > 0, implies that the sums are transient, and .EN(/) < oofor all finite intervals /. Since Sn — > oo a.s., infn Sn is a.s. finite. Hence theright-hand side above goes to zero as j> ->• + oo. So it is enough to show that

WHy — >• 4/^Xj. Let h e 3€. The first order of business is to evaluateJ h(x)ny(dx). Since

then

Using

gives

where we assume either that h(x) is nonnegative, or J |/j(x)| pv(dx) < oo.Note that

Page 238: Probability

10.3 THE RENEWAL THEOREM 221

SO

We would like to take the sum inside the integral sign, but the divergence

at is troublesome. Alternatively, let

Compute as above, that

Now there is no trouble in interchanging to get

In the same way,

where denotes the complex conjugate of . Then

The basic result used now is a lemma due to Feller and Orey [60].

Lemma 10.11. Let f(u) be a characteristic function such that f(u) 7* 1 for0 < \u\ < b. Then on \u\ <. b, the measure with density Rl(\l\ — rf(u))converges weakly as r \ 1 to the measure with density Rl(\l\ — /(«)), plus apoint mass of amount Tr/EX1 at u = 0. Also, the integral of \Rl(\l\ — /(«))!is finite on \u\ < b.

This lemma is really the core of the proof. Accept it for the moment. Inthe nonlattice case, f ( u ) 5^ 1 for every u ^ 0. Thus, the limit on the right-hand side of (10.10) exists and

Apply the Riemann-Lebesgue lemma to the integral on the right, getting

Page 239: Probability

222 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM 10.3

The inversion formula (8.40) yields

which proves the theorem in the nonlattice case. In the lattice case, sayd = 1, put

Since y e Llt use the notation pn instead of fiu. Then (10.10) becomes,since /(w) is periodic,

Furthermore, since L^ is the minimal lattice for Xa,/(w) 5^ 1 on 0 < |w| < TT.Apply lemma 10.11 again:

The Riemann-Lebesgue lemma gives

Now look:

+ W

By the inversion formula for Fourier series, taking /z(u) so that ^ \h(ni)\ < oo,

finishing the proof in the lattice case.

Now for the proof of the lemma. Let <p(u) be any real-valued continuousfunction vanishing for \u\ > b, such that sup |<p| < 1. Consider

Page 240: Probability

10.3 THE RENEWAL THEOREM 223

For any e > 0, the integrand converges boundedly to zero on (— e, +e)c asr | 1. Consider the integral over (— e, + e). The term in brackets equals.

Using |1 — r/| > 1 — r, we find that the integral above containing thesecond term of (10.12) over (— e, +e) is dominated by

Assuming that |R/(1/1 — /)| has a finite integral, we can set this termarbitrarily small by selection of e. The function (1 —/)/(! — /) is con-tinuous for 0 < |u| < b. Use the Taylor expansion,

where d(u) -+ 0 as « -»- 0, to conclude that the limit of (1 -/)/(! -/)exists equal to — 1 as u — > 0. Define its value at zero to be — 1 making(1 -~/)/(l — /) a continuous function on |«| <. b. Then the integralcontaining the first term of (10.12) is given by

where g(u) is a continuous function such that g(0) = <p(0). Denote m = EX.Use the Taylor expansion again to see that for any A > 0, e can be selectedso that for \u\ < e

Combine this with

to conclude that the limit of (10.13) is rr(p(0)/m. The last fact we need nowis the integrability of

on |u| ^ Z). Since |1 — /(w)|2 > m2M2/2 in some neighborhood of the origin,it is sufficient to show that R/(l — /(w))/«2 has a finite integral. Write

Page 241: Probability

224 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM 10.4

so that

where the orders of integration have been interchanged, and

4. A LOCAL CENTRAL LIMIT THEOREM

For fair coin-tossing, a combinatorial argument gives P(S2n = 0) ~ 1/vVw,a result that has been useful. What if the X1? X 2 , . . . are integer-valued, ordistributed on the lattice Ldl More generally, what about estimatingP(Sn G /)? Look at this optimistic argument:

if^Xi = 0, EXl = a2 < oo. By Problem 4, Chapter 8, the supremum of thedifference in (10.14) goes to zero. Thus by substituting^ = a^x,

Substitute in the integral, and rewrite as

If the convergence is so rapid that the difference in (10.15) is 0(l/v«), thenmultiplying by gets us

The integrand goes uniformly to one, so rewrite this as

This gives the surprising conclusion that estimates like P(S2n = Q)~ l/v^wmay not be peculiar to fair coin-tossing, but may hold for all sums of in-dependent, identically distributed random variables with zero means and

Page 242: Probability

10.4 A LOCAL CENTRAL LIMIT THEOREM 225

finite variance. The fact that this delicate result is true for most distributionsis a special case of the local central limit theorem. It does not hold generally —look at coin-tossing again: P(Sn = 0) = 0 for n odd. The next definitionrestricts attention to those Xl5 X2, . . . such that the sums do not have anunpleasant periodicity.

Definition 10.16. A random variable X is called a centered lattice randomvariable if there exists d > 0 such that P(X e Ld) = 1, and there is no d' > dand a such that P(X e a + Ld>) = 1 . X is called centered nonlattice if thereare no numbers a and d > 0 such that P(X e a + Ld) = 1 .

For example, a random variable X with P(X = 1) = P(X = 3) = \ isnot centered lattice, because L± is the minimal lattice with P(X e L) = 1, butP(X e 1 + L2) = 1.

As before, let ld assign mass d to every point of Ld, and /0 denotes Lebesguemeasure on L0.

Theorem 10.17. Let Xx, X2, . . . be independent, identically distributed randomvariables, either centered lattice on L& or centered nonlattice on L0, withEX* = a2 < oo. Then for any finite interval I,

Proof. In stages — first of all, if X is centered nonlattice, then by Problem 21 ,Chapter 8, \f(u)\ < 1, u 0. If X is centered lattice on Ld, then/(«) hasperiod 2-n\d and the only points at which \f(u)\ = 1 are {Irrkld}, k = 0,± 1, . . . Eq. (10.18) is equivalent to the assertion that the measures /^definedon $! by

converge weakly to the measure ld. The plan of this proof is to show that forevery h(x) e JC

and then to apply Theorem 10.7.Now to prove (10.19): Suppose first that \f(u)\ ^ \ on J — {0},

some finite closed interval, and that h(u) vanishes on J°. Write

where Fn is the distribution function of Sn. Then

Page 243: Probability

226 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM 10.4

From 8.44, /(«) = 1 - (aV/2)(l + <5(«)), where <5(«) -> 0 as u -> 0.Take N = (-b, +b), so small that on N, \d(u)\ < I a*u* < 1. On J - N,\f(u)\ < 1 - £, 0 < £ < 1. Letting ||4|| = sup fe|, we get

(10.20) Eh(Sn) = f [f(u)]nh(u) du + 6n \\h\\ (1 - /?)", |0J < 11/11.•w

OnN,

By the substitution w = y/V«,

By (10.21) the integrand on the right is dominated for all y by the integrablefunction

But just as in the central limit theorem, [f(v/\/n)]n -*• e-"*v*12. Sincen) —* ^(0), use the dominated convergence theorem for

Use (10.20) to get

When X is centered nonlattice, this holds for all finite J. Furthermore,the Fourier inversion theorem gives

By putting u = 0 we can prove the assertion. In the lattice case, assume+ »

d = 1 for convenience, SO/(M) has period 2rr. Let £(H) = £ h(u + Ink) sothat

The purpose of this is that now \f(u)\ ^ 1 on [-TT, +TT] - (0), so (10.22)holds in the form

Page 244: Probability

10.5 APPLYING A TAUBERIAN THEOREM 227

Just as in the proof of the renewal theorem,

which proves the theorem in the lattice case. Q.E.D.

Problem 1. Under the conditions of this section, show that

uniformly for x in bounded subsets of Ld. [See Problem 4, Chapter 8.]

5. APPLYING A TAUBERIAN THEOREM

Let X1} X2 , . . . be centered lattice. Suppose we want to get information con-cerning the distribution of T, the time of the first zero of the sums Sl5 S 2 , . . .From (7.42), for all n 0,

with the convention S0 = 0. Multiply this equation by rn, Q < r < 1, andsum from n = 0 to oo,

The local limit theorem gives

blows up. Suppose we can use the asymptotic expression for P(Sn = 0) toget the rate at which P(f) blows up. Since

is given by

we have information regarding the rate at which T(r) —*• oo when r —»• 1.Now we would like to reverse direction and get information about P(T > n)from the rate at which T(r) blows up. The first direction is called an Abelianargument; the reverse and much more difficult direction is called a Tauberianargument.

Page 245: Probability

228 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM 10.5

To get some feeling for what is going on, consider

Put r = e~s. For s small, a good approximation to <p(r) is given by theintegral

The integral is the gamma function F(a 4- 1). Since s <~ 1 — r as r | 1,this can be made to provide a rigorous proof of

oo

Now suppose (10.23) holds for <p(r) = 2 *Vn- Can we reverse and concludeo

that 0n ~ «a? Not quite; what is true is the well-known theorem:

Theorem 10.24. Let y(r) = J anrn, an > 0, n = 0,. . . 77ie/z as n-+ ao

o

//flrtd only if, as r} 1,

For a nice proof of this theorem, see Feller [59, Vol. II, p. 423]. The implica-tion from the behavior of <p(r) to that of ax + • • • + an is the hard part and isa special case of Karamata's Tauberian theorem [85].

We use the easy part of this theorem to show that

This follows immediately from the local limit theorem, the fact that F(|) =VT, and

[Approximate the sums above and below by integrals of 1/v* and 1 / v x + lrespectively.]

Of course, this gives

Page 246: Probability

10.6 OCCUPATION TIMES 229

Theorem 10.25

Proof. By Karamata's Tauberian theorem, putting 23l*ald^, we get

Since the pn are nonincreasing, for any m <£ n write

Divide by v«, put m = [An], 0 < A < 1, and let n ->• oo to get

Let A -» 1 ; then (1 - \/A)/l - A -> |, so >/«/>» -»• c/2.

Problem 2. Let Nn(0) be the number of times that the sums $!, . . . ,$„ visitthe state zero. Use the theory of stable laws and 10.25 to find constants

2)An | oo such that Nn(0)/,4n — > X where X is nondegenerate.

6. OCCUPATION TIMES

One neat application of the local central limit theorem is to the problem:Given a finite interval /, Sl5 S2, . . . sums of independent, identically distributedrandom variables X15 X2, . . . such that EXl = 0, EX* = a2 < oo, and takeS0 = 0. Let Nn be the number of visits of S0, $! , . . . ,$„ to the interval /. Is

3)there a normalization such that Nn/^4n — >. X, X nondegenerate? Nn is theamount of time in the first n trials that the sums spend in the interval /, hencethe name "occupation-time problem." Let Xx, X2, . . . be centered lattice ornonlattice, so the local limit theorem is in force. To get the size of An, compute

Hence we take An = y AJ. The way from here is the method of moments. Let

By combining all permutations of the same indices jlt . . . ,jk we get

and

Page 247: Probability

230 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM 10.6

Define transition probabilities on $: by

So

where x0 = 0. If

where p(0)(B \ x) is the point mass concentrated on {x}, and 0 < r < 1,then, defining nk(— 1) = 0,

Proposition 10.27

/<?r x e Ldas r } 1 .

. By Problem 1 of this chapter

for ;c £ Ld. Hence for all x e / n Ld, J a finite interval, and for any € > 0,there exists n0 such that for n > «0,

By 10.24 we get uc convergence of Jl — r pf(I \ x) to 2r(|)//j(/)/(rV27r,

Proposition 10.28

Proof. On the right-hand side of (10.26), look at the first factor of theintegrand multiplied by \/l — r, namely

This converges uniformly for xk^ e / n Ld to a constant. Hence we canpull it out of the integral. Now continue this way to get the result above.

With 10.28 in hand, apply Karamata's Tauberian theorem to concludethat.

Page 248: Probability

NOTES 231

This almost completes the proof of

Theorem 10.30

where

Proof. By (10.29),

Let

Use integration by parts to show that

and deduce from this that

Proposition 8.49 implies that the distribution is here uniquely determinedby the moments. Therefore Theorem 8.48 applies. Q.E.D.

NOTES

The renewal theorem in the lattice case was stated and proved by Erdos,Feller, and Pollard [49] in 1949. Chung later pointed out that in this case thetheorem follows from the convergence theorem for countable Markov chainsdue to Kolmogorov (see Chapter 7, Section 7). The theorem was graduallyextended and in its present form was proved by Blackwell [6, 7]. Newproofs continue to appear. There is an interesting recent proof due to Feller,see [59, Vol. II], which opens a new aspect of the theorem. The method ofproof we use is adapted from Feller and Orey [60, 1961]. Charles Stone [132]by similar methods gets very accurate estimates for the rate of convergenceof H(B + x) as x -> oo. A good exposition of the state of renewal theory asof 1958 is given by W. L. Smith [127]. One more result that is usually con-sidered to be part of the renewal theorem concerns the case EX = oo. Whatis known here is that if the sums S0, S l 5 . . . are transient, then H(I + y) -> 0as y —»• ± oo for all finite intervals /. Hence, in particular, if one of £X+,£X~ is finite, this result holds (see Feller [59, Vol. II, pp. 368 ff]).

Local limit theorems for lattice random variables have a long history.The original proof of the central limit theorem for coin-tossing was gottenby first estimating P(Sn = j ) and thus used a local limit theorem to prove the

Page 249: Probability

232 THE RENEWAL THEOREM AND LOCAL LIMIT THEOREM

tendency toward JV(0, 1). More local limit theorems for the lattice case arein Gnedenko and Kolmogorov [62]. Essentially the theorem given in the text isdue to Shepp [122]; the method of proof follows [133]. In its form in thetext, the central limit theorem is not a consequence of the local theorem.But by very similar methods somewhat sharper results can be proved. Forexample, in the centered nonlattice case, for any interval /, let

Stone proves that

and the central limit theorem for centered nonlattice variables follows fromthis.

The occupation time theorem 10.30 was proven for normally distributedrandom variables by Chung and Kac [19], and in general, by Kallianpurand Robbins [84]. Darling and Kac [23] generalized their results significantlyand simplified the proof by adding the Tauberian argument.

The occupation time problem for an infinite interval, say (0, oo), isconsiderably different. Then Nn becomes the number of positive sums amongSl5 . . . , Sn. The appropriate normalizing factor is n, and the famous arcsine theorem states

See Spitzer's book [130], Chapter 4, for a complete discussion in the latticecase, or Feller [59, Vol. II, Chapter 12], for the general case.

Page 250: Probability

CHAPTER 11

MULTIDIMENSIONAL CENTRAL LIMITTHEOREM AND GAUSSIAN

PROCESSES

1. INTRODUCTION

Suppose that the objects under study are a sequence \n of vector-valuedvariables (X[n), . . . , Xj.n)) where each XJ.n), j = 1, . . . , k, is a randomvariable. What is a reasonable meaning to attach to

where X = (Xx, . . . , Xfc)? Intuitively, the meaning of convergence indistribution is that the probability that Xn is in some set B E $fc convergesto the probability that X is in B, that is,

But when we attempt to make this hold for all B E &k, difficulty is encounteredin that X may be degenerate in part. In the one-dimensional case, the one-point sets to which the limit X assigned positive probability gave trouble andhad to be excluded. In general, what can be done is to require that

for all sets B e 3^ such that P(X e bd(B)) = 0.The definition we use is directed at the problem from a different but

equivalent angle. Let 80 be the class of all continuous functions on R(k}

vanishing off of compact sets.

Definition 11.1. The k-vectors Xn converge in distribution (or in law) to X //for every f(x) E 80,

This is written as In terms of distribution functions,Fn — > F if and only if

233

Page 251: Probability

234 CENTRAL LIMIT THEOREM AND GAUSSIAN PROCESSES 11.2

By considering continuous functions equal to one on some compact set andvanishing on the complement of some slightly larger compact set conclude

5) 2)that if Fn — > F, and Fn — *• G, then F — G. Define as in the one-dimensionalcase:

Definition 11.2. Let JV\ be the set of all distribution functions on R(k). A setC ci Nk is mass-preserving if for any e > 0, there is a compact set A such that

If Fn — > F, then {Fn} is mass-preserving. From this, conclude that Fn

if and only if for every bounded continuous function /(x) on R(k\

For any rectangle S such that F(bd(S)) = 0, approximate %s above and belowby continuous functions to see that Fn(S) -> F(S). Conversely, approximate$fdFn by Riemann sums over rectangles such that F(bd(S)) = 0 to conclude

3)that Fn — > F is equivalent to

There are plenty of rectangles whose boundaries have probability zero,because if P(X e 5) = P(Xl e!lt...,Xke 4), then

By the same approximation as in 8.12, conclude that for/(x) bounded,^-measurable and with its discontinuity set having /"-measure zero, that

From this, it follows that if B is in $>k and F(bd(B)) = 0, then Fn(B) -* F(J8).

Problem 1. Is Fn —>• /"equivalent to requiring

at every continuity point (xlt . . . , jcft) of /"? Prove or disprove.

2. PROPERTIES OF J<ffe

The properties of ^-dimensional probability measures are very similar tothose of one-dimensional probabilities. The results are straightforwardgeneralizations, and we deal with them sketchily. The major result is thegeneralization of the Helly-Bray theorem.

Theorem 11.3. Let {Fn} <=• JV\ be mass-preserving. Then there exists asubsequence Fn converging in distribution to some F E JV\..

Page 252: Probability

11.2 PROPERTIES OF J\Pfc 235

Proof. Here is a slightly different proof that opens the way for generaliza-tions. Take {/•} <= 80 to be dense in 80 in the sense of uniform convergence.To verify that a countable set can be gotten with this property, look at theset of all polynomials with rational coefficients, then consider the set gottenby multiplying each of these by a function hN which is one inside the k-sphere of radius TV and zero outside the sphere of radius N + I . Use di-agonalization again; for every j, let /, be an ordered subset of the positiveintegers such that J/- dFn converges as n runs through /,-, and Ii+l <= /..Let nm be the wth member of 7m; then the limit of J/y dFn exists for ally.Take/e 80, \f - /.| < e so that

hence \im$fdFnm exists for all/eS0. Because {Fn} is mass-preserving,lim $fdFHm = /(/) exists for all bounded continuous/. Denote the openrectangle {(xj, . . . , xfc); xl < ylt . . . , xk < yk} by 5(y). Take gn boundedand continuous such that gn | Xs(yy Then J(gn) is nondecreasing; call thelimit F(ylt . . . ,yk). The rest of the proof is the simple verification that

i, . . . ,yk) is a distribution function, and that if F(bd(S(y))} — 0, then

Define sets 8 of JVVseparating functions as in 8.14, prove as in 8.15 and8.16 that {Fn} mass-preserving, §fdFn convergent, all/ s 8 imply the existence

Dof an FE J\Pfc such that Fn — >• F. Obviously, one sufficient condition fora class of functions to be JV\-separating is that they be dense in 80 underuc convergence of uniformly-bounded sequences of functions.

Theorem 11.4. The set of complex exponentials of the form

is Nk-separating.

Proof. The point here is that for any / e 80, we can take n so large that/(x) = 0 on the complement of Sn = (x; x} E [—n, +n],j = ! , . . . ,&} .Now approximate /(x) uniformly on Sn by sums of terms of the formexp [rr/Xm^! + • • • + wfcxfc)/n], ml, . . . , mk, integers. The rest goes throughas in Theorem 8.24.

Definition 11.5. For u, x 6 R(k), let (u, x) = w,jCj, and define the char-acteristic function of the k-vector X = (Xl5 . . . , Xfc) as /x(u) = £el(u-X) or ofthe distribution function F(\) as

The continuity theorem holds.

Page 253: Probability

236 CENTRAL LIMIT THEOREM AND GAUSSIAN PROCESSES 11.2

Theorem 11.6. For Fn e JV\ having characteristic functions /n(u), //

a) lim/M(u) exists for every u E R(k},n

b) lim/n(u) = /z(u) is continuous at the origin,n

2)then there is a distribution function F e JVfc such that Fn —>. F and h(u) is thecharacteristic function of F.

Proof. The only question at all is the analog of inequality 8.29. But this issimple. Observe that

where Su = {{x^ < I/M, . . . , \xk\ < l/u}. For any function g(vlt , . . , vk)on R(k} define Tjg to be the function

Then

where / is the characteristic function of F. The function Tk • • • /(v) iscontinuous and is zero at the origin. Write the inequality as

Now,/n(u) ->• h(u) implies

and everything goes through.

Problems

2. If X(n), . . . , X[n) are independent for every n, and if fory fixed,

prove that

and Xj , . . ., Xjt are independent.

Page 254: Probability

11.3 THE MULTIDIMENSIONAL CENTRAL LIMIT THEOREM 237

3)3. Let X(n) — > X and <PI, . . . , q>m be continuous functions on R(k).Show that

4. Show that the conclusion of Problem 3 remains true if cpk, k = 1, . . . , w,continuous is replaced by <pfc(x), k = 1, . . . , m a.s. continuous with respectto the distribution on R(k} given by X = (Xl5 . . . , Xfc).

3. THE MULTIDIMENSIONAL CENTRAL LIMIT THEOREM

To use the continuity theorem, estimates on/(u) are needed. Write

Proposition 11.7. Let E ||X||" < oo, then

where d(u) — >• 0 as u — >• 0.

Proof. Write

where

01? 02 real and |#i| < 1, \62\ < 1. By the Schwarz inequality we have|(u,x)| < ||u|| ||x||. thus

The integrand is dominated by the integrable function 3 ||X||n, and 99(0, X) ->0as u — »• 0 for all co. Apply the bounded convergence theorem to get theresult.

Definition 11.8. Given a k-vector (Xl5 . . . , Xfc), EXj = 0, j = 1, . . . , k.Define the k X k covariance matrix F by

Definition 11.9. The vector X, EXj = 0,y = 1, . . . , k, is said to have a jointnormal distribution JV(0, F) if

Page 255: Probability

238 CENTRAL LIMIT THEOREM AND GAUSSIAN PROCESSES 11.4

Theorem 11.10. Let Xl5 X2, . . . be independent k-vectors having the samedistribution with zero means and finite covariance matrix P. Then

Proof£(exp i[(u, Xi + • • • + XJ/^/n]) = [Eexp [i(u,

where X has the same distribution as X1} X2, . . . By assumption, £X2 < oo,j = \,...,k, where X = (X1? . . . , Xfc). Thus E ||X||2 < oo. By Proposition11.7,

so

Once more, as long as second moments exist, the limit is normal. Thereare analogs here of Theorem 9.2 for the nonidentically distributed case whichinvolve bounds on E [|XJ|3. But we leave this; the tools are available for asmany of these theorems as we want to prove.

4. THE JOINT NORMAL DISTRIBUTION

The neatest way of defining a joint normal distribution is

Definition 11.11. Say that Y = (Yls . . . , Yn), £Y2 = 0, has a joint normal(or joint Gaussian) distribution with zero means if there are k independentrandom variables X = (X1% , . , , Xt), each nith t^P(0, 1) distribution, and ak x n matrix A such that

Obviously, the matrix A and vector X are not unique. Say that a setZ1; . . . , Z; of random variables is linearly independent if there are no realnumbers a1? . . . , a^, not all zero, such that

Then note that the minimal k, such that there is an A, X as in 11.11, withY = XA, is the maximum number of linearly independent random variablesin the set Y1( . . . , Yn . If

then

Page 256: Probability

11.4 THE JOINT NORMAL DISTRIBUTION 239

so the minimum k is also given by the rank of the covariance matrix F.Throughout this section, take the joint normal distribution with zero meansto be defined by 11.11. We will show that it is equivalent to 11.9.

If Y = \A, then the covariance matrix of Y is given by

So

This is characteristic of covariance matrices.

Definition 11.12. Call a square matrix M symmetric ifm{j = mH, nonnegativedefinite ij ciMaL* > Qfor all vectors a, where a* denotes the transpose of a..

Proposition 11.13. An n x n matrix F is the covariance matrix of some set ofrandom variables Yl5 . . . , Yn if and only if

1) F is symmetric nonnegative definite,equivalently,

2) there is a matrix A such that

Proof. One way is easy. Suppose Fi3 = £Y;Y;; then obviously F issymmetric, and So^F^-a, = ^Sa^Y,.)2 > 0. For the converse, we start withthe well-known result (see [63], for example) that for F symmetric and non-negative definite, there is an orthogonal matrix O such that

where D is diagonal with nonnegative elements. Then taking B to be diag-onal with diagonal elements the square roots of the corresponding elementsD gives B = D*D so that

Take A = DO, then F = A1 A. Now take X with components independentJV^O, 1) variables, Y = X/4 to get the result that the covariance matrixo f Y i s F.

If Y has joint normal distribution, Y = X/l, then

Since the Xfc are independent JV(0, 1),

Page 257: Probability

240 CENTRAL LIMIT THEOREM AND GAUSSIAN PROCESSES 11.4

Hence the Y vector has a joint normal distribution in the sense of 11.9.Furthermore, if Y has the characteristic function by finding Asuch that F = A*A, taking X to have independent JV(0, 1) componentsand Y = \A, then we get Y S Y.

We can do better. Suppose Y has characteristic function e~(1/2)uru*.Take O to be an orthogonal matrix such that OTOi = D is diagonal.Consider the vector Z = Y0.

So

Thus, the characteristic function of Z splits into a product, the Z19 Z2 , . . .are independent by 8.33, and ~Lj is JV(0, d^). Define X3 = 0 if djs = 0,otherwise X3 = Z3/^3. Then the nonzero X^ are independent JV(0, 1)variables, and there is a matrix A such that Y = XA.

A fresh memory of linear algebra will show that all that has been donehere is to get an orthonormal basis for the functions Y1} . . . , Yn. Thiscould be done for any set of random variables Y l 5 . . ., Yn, getting randomvariables X 1 9 . . ., Xk such that EX* = 1, £X,X, = 0, i ^ j. But thevariables Y1? . . ., Yn, having a joint normal distribution and zero means,have their distribution completely determined by F with the pleasant andunusual property that E^ZY, = 0 implies Yt and Y, are independent.Furthermore, if 7t = (/1?. .., /fc), 72 = (jlt. .. ,jm) are disjoint subsetsof (1, . . ., ri) and F^ = 0, / e Iltj e /2, then it follows that (Y^, . . ., Yffc) and(Y^,. . ., Y,m) are independent vectors.

We make an obvious extension to nonzero means by

Definition 11.14. Say that Y = (Y1? . . . , Yn), with m = (£Y15. . . , £YJ,has a joint normal distribution N(m, F) // (\l — £Yl5 . . . , Yn — £YW) hasthe joint normal distribution JV^O, F).

Also,

Definition 11.15. If Y = (Yl5 . . . , YB) has the distribution J^(0, F), thedistribution is said to be nondegenerate if the Y1}. . . , Yw are linearly in-dependent, or equivalently, if the rank of F is n.

Problems

5. Show that if Yl5 . . . , Yn are JV^O, F), the distribution is nondegenerate ifand only if there are n independent random variables Xls . . . , Xn, eachJYXO, 1), and an n x n matrix A such that det (A} ^ 0, and Y ='XA.

Page 258: Probability

11.5 STATIONARY GAUSSIAN PROCESS 241

6. Let Y19 . . . , Yn have a joint normal nondegenerate distribution. Showthat their distribution function has a density /(y) given by

7. Show that if the random variables Y1} . . . , Yn have the characteristicfunction exp [— |u//u'] for some n x n matrix H, then H is the covariancematrix of Y and £Y,- = 0,y = ! , . . . ,« .

5. STATIONARY GAUSSIAN PROCESS

Definition 11.16. A double ended process . . . , X_1? X0, Xl5 . . . is calledGaussian if every finite subset of variables has a joint normal distribution.

Of course, this assumes that E |XJ < oo for all n. When is a Gaussianzero-mean process X stationary? Take F(/n, n) = EXnXm. Since thedistribution of the process is determined by F(w, n), the condition should beon F(w, n).

Proposition 11.17. X is stationary if and only if

Proof. If X is stationary then EXmXn = EXm_nXn_n = F(m - n, 0).Conversely, if true, then the characteristic function of X1} . . . , Xn is

But this is exactly the characteristic function of X1 + T O , . . . , Xn+m.Use the notation (loose),

and call F(«) the covariance function. Call a function M(ri) on the integersnonnegative definite if for / any finite subset of the integers, and a,, j e /,any real numbers,

Clearly a covariance function is nonnegative definite. Just as in the finite case,given any symmetric nonnegative definite function H(ri) on the integers, wecan construct a stationary zero-mean Gaussian process such that EXmXn =H(m - n).

How can we describe the general stationary Gaussian process? To dothis neatly, generalize a bit. Let . . ., X_l5 X0, X1} . . . be a process ofcomplex-valued functions X3 = U, + /V3, where U3 ,V3 are random variables,and E\Jj = £V3 = 0, ally. Call it a complex Gaussian process if any finite

Page 259: Probability

242 CENTRAL LIMIT THEOREM AND GAUSSIAN PROCESSES 11.6

subset of the (Un, Vn} have a joint normal distribution, stationary ifEXmXn = T(m — ri). The covariance function of a complex Gaussianstationary process is Hermitian, F(— «) = P(«) and nonnegative definite inthe sense that for a subset / of the integers, and ay complex numbers,

Consider a process that is a superposition of periodic functions withrandom amplitudes that are independent and normal. More precisely, letA15 . . . , Ajt be real numbers (called frequencies), and define

where Zl9 . . . , Zk are independent JV°(0, a*) variables. The {Xn} process is acomplex Gaussian process. Further,

so the process is stationary. The functions eiijn are the periodic componentswith frequency A;, and we may as well take A, 6 [—TT, +TT). The formula(11.18) can be thought of as representing the {Xn} process by a sum overfrequency space A, A 6 [—TT, +rr). The main structural theorem for complexnormal stationary processes is that every such process can be represented as anintegral over frequency space, where the amplitudes of the various frequencycomponents are, in a generalized sense, independent and normally distributed.

6, SPECTRAL REPRESENTATIONOF STATIONARY GAUSSIAN PROCESSES

The main tool in the representation theorem is a representation result forcovariance functions.

Herglotz Lemma 11.19 [70]. F(w) is a Hermitian nonnegative definite functionon the integers if and only if there is a finite measure F(B) on $i([—TT, +TT))such that

Proof. One direction is quick. If

Page 260: Probability

11.6 REPRESENTATION, STATIONARY GAUSSIAN PROCESSES 243

then F(n) is Hermitian, and

To go the other way, following Loeve [108, p. 207], define

Multiply both sides by eil for — « + ! < / < « — 1, and integrate over[—77, +77] to get

Define Fn(dX) as the measure on [—77, +77] with density/n(A), and take n'2)a subsequence such that Fn. —>- F. Then

Now take the mass on the point {77} and put it on {—77} to complete theproof.

Note that the functions {ein*} are separating on [—77,77), hence F(dty isuniquely determined by T(n). For F(n) the covariance function of a complexGaussian stationary process X, F(dX) is called the spectral distribution

function of X. To understand the representation of X, an integral withrespect to a random measure has to be defined.

Definition 11.20. Let (Z(A)} be a noncountable family of complex-valuedrandom variables on (Q, 5% P) indexed by A e [—77, +77). For 1 an interval[A1? A2), define Z(7) = Z(A2) - Z^O- If the Riemann sums S/(Afc)Z(/fc),/!,...,/„ a disjoint partition of [—IT, +77) into intervals left-closed, right-open,Afc e Ik, converge in the second mean to the same random variable for anysequence of partitions such that max \Ik\ —> 0, denote this limit random variableby k

Now we can state

Page 261: Probability

244 CENTRAL LIMIT THEOREM AND GAUSSIAN PROCESSES 11.6

Theorem 11.21. Let X be a complex Gaussian stationary process on (O, &, P)with spectral distribution function F(dX). Then there exists a family (Z(A)} ofcomplex-valued random variables on (D, 3r, P) indexed by A e [—TT, +TT) suchthat

i) for any Als . . . , Am, Z^),. . . , Z(An) have a joint normal distribution,

ii) for /!, 72, disjoint,

iii) £|Z(/)|2 = F(I),for all intervals I,

iv) Xn = J ein*Z(dty, a.s. all n.

Proof. The most elegant way to prove this is to use some elementary Hilbertspace arguments. Consider the space C(X) consisting of all finite linearcombinations

where the afc are complex numbers. Consider the class of all random variablesY such that there exists a sequence Yn e C with Yn —i> Y. On this classdefine an inner product (Y1? Y2) by £Yj Y2. Call random variables equivalentif they are a.s. equal. Then it is not difficult to check that the set of equivalenceclasses of random variables forms a complete Hilbert space L2(X) under theinner product (Yj, Y2). Let L2(F) be the Hilbert space of all complex-valued&I([-T, +ir)) measurable functions /(A) such that J |/(A)|2F(</A) < oounder the inner product (/, g) = $fg dF(ta\ae equivalence classes again).

To the element Xn e L2(X), correspond the function .Extend this correspondence linearly,

Let C(F) be the class of all finite linear combinations . Then

and implies and

If Yn E C(X) and Yn - -> Y, then Yn is Cauchy-convergent in the secondmean; consequently so is the sequence/„<-> Yn. Hence there is an/e L2(F)such that /„ -^->/. Define Y<-»/; this can be checked to give a one-to-onecorrespondence between L2(F) and L2(X), which is linear and preservesinner products.

Page 262: Probability

11.6 REPRESENTATION, STATIONARY GAUSSIAN PROCESSES 245

The function ^_s |}(A) is in L2(F); let Z(|) be the corresponding elementin L2(X). Now to check that the family (Z(£)} has the properties asserted.

Begin with (i). If real random variables YJLn) - - Yfc, k = 1, . . . , m, and(Y{"},. . ., Y^>) has a joint normal distribution for each n, then (Y 1 } . . . , Ym)is joint normal. Because each element Fj^ = £Y<.n)Yjn) of the covariancematrix converges to Tki = £Y fcY,,; hence the characteristic function ofYn converges to e~

al2)uru\ and this must be the characteristic function of Y.Conclude that if Y l 5 . . ., YTO are in L2(X), then their real and imaginarycomponents have joint normal distributions. Thus for |1} . . . , £TO in[—TT, TT), the real and imaginary components of Z(£i),. . . , Z(£m) have ajoint normal distribution. For any interval / = [|1? £2), Z(/)<-> %/(A).Hence for 7l5 72 disjoint,

Also,

Lastly, take /(A) to be a uniformly continuous function on [— TT, TT). For apartition of [— TT, TT) into disjoint intervals 7l5 . . . , Ik left-closed, right-open,and Afc e 4,

The function on the right equals /(Afc) on the interval Ik, and convergesuniformly to /(A) as maxfc |/fc| — »• 0. So, in particular, it converges in the

second mean to /(A). If Y<->/(A), then 2/(Afc)Z(/fc) -1*. Y. From thedefinition 11.20,

For /(A) = eir>*-, the corresponding element is Xn, thus

Proposition 11.22. If {Xn} is a real stationary Gaussian process, then thefamily (Z(A)}, A e [— TT, +?r) /KW ///e additional properties: //(Z^A)}, (Z2(A)}are ?/ze rea/ ««i/ imaginary parts of the (Z(A)} process

then

i) For a«^ two intervals I, J,

ii) For any two disjoint intervals I, J,

Page 263: Probability

246 CENTRAL LIMIT THEOREM AND GAUSSIAN PROCESSES 11.7

Proof. Write any intervals /, J as the union of the common part / n J andnonoverlapping parts, and apply 11.21(ii) and (iii) to conclude that theimaginary part of EZ(I)Z(J) is zero. Therefore,

Inspect the correspondence set up in the proof of 11.21 and notice that ifY = 2 afcXfc and Y <->•/( A), then Y = S a^X^. corresponds to /(—A). Thisextends to all L2(X) and L2(F). Hence, since #/(A)<-> Z(7), then #_/(A) <->

where — / = {A; — A e /}. From this,

Thus, if we change /to — /in (11.23), the first term remains the same, and thesecond term changes sign; hence both terms are zero. For /, J disjoint,£Z(/)Z(7) = 0 implies £Z1(/)Z1(7) = £Z2(/)Z2(J). We can use the signchange again as above to prove that both sides are individually zero. Q.E.D.

For a real stationary Gaussian process with zero means, we can deducefrom 11.22(i) that the processes (Z^A)}, (Z2(A)} are independent in the sensethat all finite subsets

are independent. From 11.22 (ii), we deduce that for Ilt . . . , In disjointintervals, the random variables Z^/j), . . . , Z^/J are independent. Similarlyfor Z,^), . . . , Z,(/n).

7. OTHER PROBLEMS

The fact that F(dX) completely determines the distribution of a stationaryGaussian process with zero means leads to some compact results. Forexample, the process X is ergodic if and only if F(dty assigns no mass to anyone-point sets [111].

The correspondence between L2(X) and L2(F) was exploited by Kol-mogorov [96, 97] and independently by Wiener [143] in a fascinating pieceof analysis that leads to the solution of the prediction problem. The startingpoint is this: the best predictor in a mean-square sense of Xx based onX0, X_l5 . . . is E(Xl | X0, X_i, . . .). But for a Gaussian process, there areconstants a*.'0 such that

Page 264: Probability

NOTES 247

Because by taking the a£.n) such that

or

then Xx — aj^X-fc is independent of (X0, X_l5 . . . , X_n), so thati

From the Martingale theorems, since EX^ < oo, it is easy to deduce that

Hence the best predictor is in the space L2(X0, Xl5 . . .) generated by all linearcombinations of X0, X_15 . . . . By the isomorphism this translates into theproblem: Let L~(F] be the space generated by all linear combinations ofeik*, k = 0, —1, —2, . . . Find the element /(A) e L~(F) which minimizes

In a similar way, many problems concerning Gaussian processes translateover into interesting and sometimes well-known problems in functions of areal variable, in particular, usually in the area of approximation theory.

NOTES

Note that the representation theorem 11.21 for Gaussian processes dependsonly on the fact that £XnXm depends on the difference n — m. Define{Xn}, n = 0, ± 1, . . . to be a complex process stationary in the second orderif F(n, m) = EXnXm is a function of n — m. The only difference in theconclusion of 11.21 is that (i) is deleted. This representation theorem wasproved by Cramer [21, 1942], and independently by Loeve [107]. Sincethe work on the prediction problem in 1941-42 by Kolmogorov [96] and[97], and independently by Wiener [143], there has been a torrent of publica-tions on second-order stationary processes and a sizeable amount onGaussian processes. For a complete and rigorous treatment of these matters,refer to Doob's book [39]. For a treatment which is simpler and places morestress on applications, see Yaglom [144].

Page 265: Probability

CHAPTER 12

STOCHASTIC PROCESSESAND BROWNIAN MOTION

1. INTRODUCTION

The natural generalization of a sequence of random variables {Xn} is acollection of random variables {XJ indexed by a parameter t in some interval/. Such an object we will call a stochastic process.

Definition 12.1. A stochastic process or continuous parameter process is acollection {Xt(co)} of random variables on (£i, &, P) where t ranges over aninterval I <= R(l). Whenever convenient the notation (X(f, CD)} or simply(X(f)} will be used.

For fixed co, what is produced by observing the values of X(t, co) is a function;t(f) on /.

The most famous stochastic process and the most central in probabilitytheory is Brownian motion. This comes up like so: let X(f) denote oneposition coordinate of a microscopic particle undergoing molecular bom-bardments in a glass of water. Make the three assumptions given below,

Assumptions 12.2

1) Independence-, X(/ + A/) - X(f) is independent 0/{X(r)}, r < t.

2) Stationarity: The distribution ofX(t + A/) — X(/) does not depend on t.

This is the sense of the assumptions: (1) means that the change inposition during time [t, t + Af] is independent of anything that has happenedup to the time t. This is obviously only a rough approximation. Physically,what is much more correct is that the momentum imparted to the particledue to molecular bombardments during [t, t + A?] is independent of whathas happened up to time t. This assumption makes sense only if the

248

Page 266: Probability

12.1 INTRODUCTION 249

displacement of the particle due to its initial velocity at the beginning of theinterval [t, t + Af] is small compared to the displacements it suffers as aresult of molecular momentum exchange over [t, t + Af]. From a modelpoint of view this is the worst assumption of the three. Accept it for now;later we derive the so-called exact model for the motion in which (1) will bereplaced. The second assumption is quite reasonable: It simply requireshomogeneity in time ; that the distribution of change over any time intervaldepend only on the length of the time interval, and not on the location of theorigin of time. This corresponds to a model in which the medium is consideredto be infinite in extent.

The third assumption is interesting. We want all the sample functionsof our motion to be continuous. A model in which the particle took in-stantaneous jumps would be a bit shocking. Split the interval [0, 1] into nparts, A/ = I jri. If the motion is continuous, then

must converge to zero as Af — >• 0. At a minimum, for any 6 > 0,

By (1) the variables Yfc = |X(fc Af) - X((k - 1) Af)| are independent;by (2), they all have the same distribution. Thus

so that P(/r(Af ) > 6) -> 0 if and only if nP(Y^ > S) -> 0. This last is exactly

Make the further assumption that X(0) = 0. This is not a restriction, butcan be done by considering the process X(f) — X(0), t > 0, which againsatisfies (1), (2), and (3). Then

Proposition 12.4. For any process (X(f)}, t > 0 satisfying 12.2. (1), (2), and (3)with X(0) = 0, X(r) has a normal distribution with EX(t) = [tt, az(X(t)) =azt.

Proof. For any /, let A? = t/n, Yfc = X(k AO - X((k - 1) Af). ThenX(t) = Yj + • • • + Yn, where the Yl5 . . . , YM are independent and identi-cally distributed. Therefore X(t) has an infinitely divisible law. Utilize theproof in (12.3) to show that Mn = max^j.^,, |YJ converges in probability to

Page 267: Probability

250 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.1

zero. By 9.6, X(t) has a normal distribution. Let y^t) = EX(t). Then

Let

so that

The fact is now that p1(t) and p ( t ) are continuous. This follows from3)

X(r + T) —> X(r), as r —*• 0, which for normal variables implies

It is easy to show that any continuous solutions of the equation <p(t + T) =<p(f) -f <P(T) are linear.

Use the above heuristics to back into a definition of Brownian motion.

Definition 12.5. Brownian motion is a stochastic process on [0, oo) such thatX(0) = 0 and the joint distribution of

is specified by the requirement that X(t) — X(ffc_1), k = ! , . . . ,« be inde-pendent, normally distributed random variables with

This can be said another way. The random variables X(tn),..., X(t0) have ajoint normal distribution with EX(tk) = fitk and

so that

Problem 1. Show that Brownian motion, as defined by 12.5, satisfies 12.2(1), (2), and (3).

Page 268: Probability

12.3 DEFINITIONS AND EXISTENCE 251

2. BROWNIAN MOTION AS THE LIMIT OF RANDOM WALKS

There are other ways of looking at Brownian motion. Consider a particlethat moves to the right or left a distance Ax with probability \. It does thiseach Af time unit. Let Yl5 . . . be independent, and equal ± Ax with proba-bility \ each. The particle at time t has made [f/A/] jumps ([z] indicatesgreatest integer < z). Thus the position of the particle is given by

The idea is that if Ax, A/ -> 0 in the right way, then D(t) will approachBrownian motion in some way. To figure out how to let Ax, Ar ->• 0, notethat ED\t) ~ (Ax)2f/Af. To keep this finite and nonzero, Ax has to be of theorder of magnitude of \/Af. For simplicity, take Ax = \/Af. TakeAf = 1/n,then the Yl5 . . . equal ± 1/v «• Thus the D(f) process has the same distributionas

where Zl9 . . . are ± 1 with probability \.Note that

and apply the central limit theorem to conclude X(n)(f) —> JV(0, t). Inaddition, it is no more difficult to show that all the joint distributions of X (n )(f)converge to those of Brownian motion. Therefore, Brownian motion appearsas the limit of processes consisting of consecutive sums of independent,identically distributed random variables, and its study is an extension ofthe study of the properties of such sequences.

What has been done in 12.5 is to specify all the finite-dimensionaldistribution functions of the process. There is now the question again:Is there a process (X(f)}, t e [0, oo) on (Q, 3r, P) with these finite-dimensionaldistributions? This diverts us into some foundational work.

3. DEFINITIONS AND EXISTENCE

Consider a stochastic process (X(f)}, t E I on (Q, ^F, P). For fixed co, X(r, o>)is a real-valued function on /. Hence denote by R1 the class of all real-valuedfunctions x(t) on I, and by X(w) the vector variable (X(/,co)} taking values in R1.

Definition 12.6. 5r(X(5>, s e /), J c /, is the smallest o-field & such that allX(.s), s E J, are $-measurable.

Definition 12.7. A finite-dimensional rectangle in R1 is any set of the form

Page 269: Probability

252 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.3

where Il9 . . . , In are intervals. Let &z be the smallest a-field of subsets of R1

containing all finite dimensional rectangles.

For the understanding of what is going on here it is important to char-acterize $7. Say that a set B E $7 has a countable base T = {/,} if itis of the form

This means that B is a set depending only on the coordinates x(tj), x(/2), • - •

Proposition 12.8. The class C of all sets with a countable base forms a a-field,hence C = $7.

Proof. Let Blt B2, . . . e C, and take Tk as the base for Bk, then T = \JkTk

is a base for all Blt B2, . . . , and if T = {*,}, each Bk may be written as

Now it is pretty clear that any countable set combinations of the Bk

produce a set with base T, hence a set in C.

Corollary 12.9. For B E % (to; X(w) e B} E &.

Proof. By the previous proposition there is a countable set {f,} such that

Thus (co; X(eo) e 5} = {«; (X^), . . .) e D}, and this in in & by 2.13.

Definition 12.10. The finite-dimensional distribution functions of the processare given by

77re notation Ft(\) may also be used.

Definition 12.11. The distribution of the process is the probability P on &defined by

It is easy to prove that

Proposition 12.12. Any two stochastic processes on I having the same finitedimensional distribution functions have the same distribution,

Proof. If X(0, X'O) are the two processes, it follows from 2.22 thatX(/2), . . . and X'(fi), X'(/2), . . . have the same distribution. Thus if B is anyset with base {rls tz, . . .}, there is a set D E 3$^ such that

Page 270: Probability

12.3 DEFINITIONS AND EXISTENCE 253

The converse is also true, namely, that starting with a consistent set of finitedimensional distribution functions we may construct a process having thosedistribution functions.

Definition 12.13. Given a set of distribution functions

defined for all finite subsets [tl < • • • < tn}ofl. They are said to be consistentif

where the A denotes missing.

Theorem 12.14. Given a set of consistent distribution functions as in 12.13above, there is a stochastic process {X(t )}, / e /, such that

Proof. Take (Q, 30 to be (/?7, $7). Denote by T, 7\, 72, etc., countablesubsets of /, and by 3!>T all sets of the form

By the extension theorem 2.26, there is a probability PT on 3&T such that

Take B any set in $7, then by 12.8 there is a T such that J5 e $T. We wouldlike to define P on 5S7 by P(B) = PT(B). To do this, the question is — is thedefinition well defined? That is, if we let B 6 $TI, £ e $r2, is PTi(£) =PT(B)1 Now 5 e $rlU2y hence it is sufficient to show that

But $r e $T/? and ,Pr = PT' on all rectangles with base in T; hencePT. is an extension to $y of PT on 3T, so PT(B) = PT,(B). Finally, toshow P is CT-additive on 37, take {Bn} disjoint in 5J7; then there is a T suchthat all J?l5 Bz, . . . are in $r, hence so is U Bk. Obviously now, by the a-additivity of PT,

The probability space is now defined. Finish by taking

Page 271: Probability

254 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.4

4. BEYOND THE KOLMOGOROV EXTENSION

One point of the previous paragraph was that from the definition, the mostcomplicated sets that could be guaranteed to be in 5=" were of the form

Furthermore, starting from the distribution functions, the extension to $7

is unique, and the maximal cr-field to which the extension is unique is thecompletion, $7, the class of all sets A such that A differs from a set in $7

by a subset of a set of probability zero (see Appendix A. 10).Now consider sets of the form

These can be expressed as

If each X(r) has a continuous distribution, then A1 is a noncountable unionof sets of probability zero. Az is a noncountable intersection. Neither ofAlt A2 depends on a countable number of coordinates because

Clearly, A{ does not contain any set of the form {(X^), . . .) e B}, B e 3^^.Thus A{ is not of the form {X e B}, B € $7, so neither is Ar. Similarly, A2

contains no sets of the form {X e B}, B e $7. This forces the unpleasantconclusion that if all we are given are the joint distribution functions of theprocess, there is no unique way of calculating the interesting and importantprobabilities that a process has a zero crossing during the time interval 7 orremains bounded below a in absolute value during /. (Unless, of course,these sets accidentally fall in 3}7. See Problem 3 for an important set which isnot in '6Z.)

But a practical approach that seems reasonable is: Let

and hope that a.s. A^ = \im€loAf. To compute P(Af), for / = [0, 1] say,compute

Page 272: Probability

12.5 EXTENSION BY CONTINUITY 255

and define P(A€) = \imn Pn. Note that {inffc<n |X(fc//i)|< e) 63% so itsprobability is well-defined and computable from the distribution functions.Finally, define P(A1) = limeio P(Ae). This method of approximation isappealing. How to get it to make sense ? We take this up in the next section.

Problems

2. Prove for Brownian motion that the fields

0 < a < 6 < c < o o , are independent.

3. Let

Show by considering A and Ac, that A is never in $z for any probabilityP on $7.

5. EXTENSION BY CONTINUITY

We are going to insist that all the processes we deal with in this chapter have avery weak continuity property.

Definition 12.15. Given a stochastic process (X(f)}, t e /, say that it is con-P

tinuous in probability if for every t e /, whenever tn — >• t, then X(tn) — >• X(t).

When is X(t, co) a continuous function of t ? The difficulty here again isthat the set

is not necessarily in 3r. It certainly does not depend on only a countablenumber of coordinates. However, one way of getting around the problem isto take T = {f3} dense in /. The set

is in 5". To see this more clearly, for h > 0 define

The function U(/z, co) is the supremum over a countable set of randomvariables, hence is certainly a random variable. Furthermore, it is decreasingin h. If as h [ 0, U(/z, co) -^-> 0, then for almost every co, X(f, co) is a uni-formly continuous function on T. Let C e 51" be the set on which U(//, co) -> 0.Assume P(C) = 1. For co e C, define X(t, co) to be the unique continuous

Page 273: Probability

256 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.5

function on / that coincides with X(/, CD) on T. For t E T, obviouslyX(t, co) = X(t, co). For / £ T, co e C, X(t, co) = limtf_t X(/3-, co). DefineX(t, co) to be anything continuous for co e Cc, for example X(t, co) = 0,

all t e /, co e Cc. But note that X(r,) -^> X(0 for /; e T, t, , -»• /, which impliesp ~ •*,

that X(f,) — >• X(?) and implies further that X(f) = X(/) almost surely.When / is an infinite interval, then this same construction works if for anyfinite interval J <= /, there is probability one that X(-) is uniformly continuouson T n J. Thus we have proved

Theorem 12.16. If the process (X(t)}, t E /, is continuous in probability, andif there is a countable set T dense in I such that

for every finite subinterval J c /, then there is a process X(t, co) such thatX(t, co) is a continuous function of t E I for every fixed co, and for each t,

The revised process {X(t, co)} and the original process {X(t, co)} have the samedistribution, because for any countable {r,-},

Not only have the two processes the same distribution, so that they areindistinguishable probabilistically, but the (X(/)} process is defined on thesame probability space as the original process. The (X(t)} process lends itselfto all the computations and devices we wanted to use before. For example:f o r / = [0,1],

It is certainly now true that

But take Ant to be the set A^t = {co; 3k < 2n, such that \X(k/2n)\ < «}.Then

which implies Ae e 5", so, in turn, A± e 5". Furthermore,

Therefore, by slightly revising the original process, we arrive at a processhaving the same distribution, for which the reasonable approximationprocedures we wish to use are valid and the various interesting sets aremeasurable.

Page 274: Probability

12.6 CONTINUITY OF BROWNIAN MOTION 257

Obviously not all interesting stochastic processes can be altered slightlyso as to have all sample functions continuous. But the basic idea always is topick and work with the smoothest possible version of the process.

Definition 12.17. Given two processes (X(f)} and {X(f)}, t e I, on the sameprobability space (Q, 3% P). They will be called versions of each other if

Problems

4. Show that if a process (X(f)}, t E I, is continuous in probability, then forany set {t„} dense in 7, each X(f) is measurable with respect to the completionof ^(XOi), X(r2),...), or that each set of (XCO, t e /) differs from a set of5r(X(f1),...) by a set of probability zero.5. Conclude from the above that if Tn c rn+1, Tn t T, Tn finite subsets of7, T dense in 7, then for J c 7, A e 3r,

6. If X(0, X(0 are versions of each other for t e 7, and if both processes haveall sample paths continuous on 7, show that

7. If (X(f)}, t e 7, is a process all of whose paths are continuous on 7,then show that the function X(t, co) defined on 7 x Q is measurable withrespect to the product a-field $!(7) x 3r. [For 7 finite, let 7l9. . ., In be anypartition of 7, tk e 7fc, and consider approximating X(f) by the functions

6. CONTINUITY OF BROWNIAN MOTION

It is easy to check that the finite-dimensional distributions given by 12.5 areconsistent. Hence there is a process (X(f)}, t > 0 fitting them.

Definition 12.18. Let X(t) be a Brownian motion. If ju 7^ 0 it is said to be aBrownian motion with drift fji. If /n = 0, az = 1, it is called normalizedBrownian motion, or simply Brownian motion.

Note that (X(f) — ptya is normalized. The most important single samplepath property is contained in

Theorem 12.19. For any Brownian motion X(f) there is a dense set T in[0, oo) such that X(t) is uniformly continuous on T n [0, a], a < oo, foralmost every co.

Page 275: Probability

258 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.6

In preparation for the proof of this, we need

Proposition 12.20. Let T0 be any finite collection of points, 0 = t0 </i < • • • < / „ = T; then for X(t) normalized Brownian motion

Proof. Denote Yfc = X(tk) - X(tk_j), and j* = (first; such that X(f,) > *}.Then because X(/n) — X(Vy) has a distribution symmetric about the origin,

For the second inequality, use

and the fact that — X(r) is normalized Brownian motion.

Proof of 12.19. We show this for a = 1. Take Tn = {A:/2n; k = 0, . . . , 2n),

and T = U ^«. Define

To show that Un -» 0 a.s., since Un e J, it is sufficient to show that

By the triangle inequality,

Page 276: Probability

12.7 AN ALTERNATIVE DEFINITION 259

We show that Pi max Yt > 6 -+ 0. Use' J k\ k

The Y1 } . . . are identically distributed, so

Note that

hence

Since Brownian motion satisfies P(|X(Af)| > <5)/Af —*• 0 as A? —>• 0, then

which proves the theorem.

Corollary 12.21. There is a version of Brownian motion on [0, oo) such that allsample paths are continuous.

Henceforth, we assume that the Brownian motion we deal with has allsample paths continuous.

Problems

8. Prove that for (X(f)} normalized Brownian motion on [0, oo),P(X(m5) e J i.o.) = 1 for all intervals J such that ||y|| > 0, and fixed 6 > 0.

9. Define 75 = n(>0^(X(T), r > t), or A e 75 if A e ^(X(T), r > t) forall t > 0. Prove that A e 75 implies P(^) = 0 or 1. [Apply a generalizedversion of 3.50.]

1. AN ALTERNATIVE DEFINITION

Normalized Brownian motion is completely specified by stating that it is aGaussian process, i.e., all finite subsets (X(/1),. . ., X(/n)} have a joint normaldistribution, EX(t) — 0 and covariance F(5, t) = min (s, t). Since allsample functions are continuous, to specify Brownian motion it wouldonly be necessary to work with a countable subset of random variables

Page 277: Probability

260 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.7

{X(/)}, t e T, and get the others as limits. This leads to the idea that with aproper choice of coordinates, X(0 could be expanded in a countable coordinatesystem. Let Yl5 Y 2 , . . . be independent and JV(0, 1). Let <pk(t) be defined on aclosed interval /such that J* \(pk(t)\

2 < oo, all / e /. Consider

Since

the sums converge a.s. for every t e /, hence Z(t) is well-defined for each texcept on a set of probability zero. Furthermore, Z(^),..., Z(tn) is thelimit in distribution of joint normal variables, so that (Z(f)} is a Gaussianprocess.

Note EZ(t) = 0, and for the Z(t) covariance

Hence if the <pk(i) satisfy

then Z(?) is normalized Brownian motion on /. I assert that on / = [0, TT],

One way to verify this is to define a function of / on [—TT, n] for any j > 0, by

Denote the right-hand side of (12.23) by g,(0- The sum converges uniformly,hence gs(t) is continuous for all s, t. Simply check that for all integers k,

and use the well-known fact that two continuous functions with the sameFourier coefficients are equal on [-TT, +TT}. Since hs(t) = min (s, t), fort > 0, (12.23) results.

Page 278: Probability

12.8 VARIATION AND DIFFERENTIABILITY 261

Proposition 12.24. Let Y0, Y1}. . . be independent JV(0, 1), then

is normalized Brownian motion on [0, TT].

One way to prove the continuity of sample paths would be to define X(n)(f)as the nth partial sum in 12.24, and to show that for almost all co, the functionsxn(t) = X(n)(f, co) converged uniformly on [0, TT]. This can by shown true,at least for a subsequence X(n'\t}. See Ito and McKean [76, p. 22], for aproof along these lines.

8. VARIATION AND DIFFERENTIABILITY

The Brownian motion paths are extremely badly behaved for continuousfunctions. Their more obvious indices of bad behavior are given in thissection: they are nowhere differentiable, and consequently of unboundedvariation in every interval.

Theorem 12.25. Almost every Brownian path is nowhere differentiable.

Proof. We follow Dvoretski, Erdos, and Kakutani [42]. Fix (3 > 0, supposethat a function x(t) has derivative x'(s), \x'(s)\ < J3, at some point s e [0, 1];then there is an n0 such that for n > «0

Let *(•) denote functions on [0,1].

An = {*(•); 35 such that \x(t) - x(s)\ < 2fi \t - s\, if \t - s\ < 2//i>.

The An increase with n, and the limit set A includes the set of all samplepaths on [0, 1] having a derivative at any point which is less than /? in absolutevalue. If (12.26) holds, then, and we let k be the largest integer such thatk/n < s, the following is implied:

Therefore, if

Page 279: Probability

262 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.8

then An <= Bn. Thus to show P(A) = 0, which implies the theorem, itis sufficient to get limn P(Bn) = 0. But

Substitute nx = y. Then

Corollary 12.27. Almost every sample path of X(t) has infinite variation onevery finite interval.

Proof. If a sample function X(/, o») has bounded variation on /, then it has aderivative existing almost everywhere on /.

A further result gives more information on the size of the oscillations ofX(/). Since E \X(t + Ar) — X(r)|2_^: A/, as a rough estimate we wouldguess that |X(/ + Ar) — X(r)| c^ \/A/. Then for any fine partition / 0 , . . . , /.„of the interval [/, t + T],

The result of the following theorem not only verifies this, but makes itsurprisingly precise.

Theorem 12.28. Let the partitions Sn of [t, t + T], iTn = (t(0

n\ . . . , /^w)),PJ| = sup |/<»\ - 4«>| satisfy ||(TJ| -> 0. Then

k

Page 280: Probability

12.9 LAW OF THE ITERATED LOGARITHM 263

Proof. Assume t = t(0n), t + r = t(£\ otherwise do some slight modi-

fication. Then T = J> (tk — tk_^ (dropping the superscript), and

The summands are independent, with zero means. Hence

(X(tk) — X(rfc_1))2/(/fc — tk_!) has the distribution of X2, where X is

JV°(0, 1). So

proving convergence in mean square. If S ||(Tn|| < oo, then use the Borel-Cantelli lemma plus the Chebyshev inequality.

Theorem 12.28 holds more generally, with Sn — > T for any sequence$n of partitions such that ||(TB|| -*• 0 and the (Tn are successive refinements.(See Doob [39, pp. 395 ff.].)

9. LAW OF THE ITERATED LOGARITHM

Now for one of the most precise and well-known theorems regardingoscillations of a Brownian motion. It has gone through many refinementsand generalizations since its proof by Khintchine in 1924.

Theorem 12.29. For normalized Brownian motion

Proof. I follow essentially Levy's proof. Let y(t) — \J2t log (log 1/0-1) For any 6 > 0,

Proof of (I). Take q any number in (0, 1), put tn = qn.The plan is to show that if Cn is the event

(X(0 > (1 + b)<p(t) for at least one t e [fn+1, tn]},

Page 281: Probability

264 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.9

then P(Cn i.o.) = 0. Define M(r) = sup X(t) and uset<T

valid since <p(t) is increasing in t. Use the estimates gotten from taking limitsin 12.20.

Hence, letting xn = (I +

and since

where A = ?(1 + <5)2,

For any d, select # so thatq(\ + <5)2 > 1. Then the right-hand side of (12.30)is a term of a convergent sum and the first assertion is proved.

2) For any d > 0,

Proof of '(2). Take ? again in (0, 1), rn = 9W, let Zn = X(rn) - X(rB+1).

The Zw are independent. Suppose we could show that for e > 0,

This would be easy in principle, because the independence of the Zn allows theconverse half of the Borel-Cantelli lemma to be used. On the other hand,from part one of this proof, because the processes (X(f)} and {—X(0) havethe same distribution,

or

Page 282: Probability

12.10 BEHAVIOR AT t = 00 265

holds for all n sufficiently large. From X(fB) = Zn + X(rB+1) it followsthat infinitely often

Note that <p(tn+JI<p(tn) -»• V . Therefore, if we take e, # so small that

the second part would be established. So now, we start estimating:

Then

By taking q even smaller, if necessary, we can get a < 2. The right-handside is then a term of a divergent series and the proof is complete. Q.E.D.

10. BEHAVIOR AT t = oo

Let Yfc = X(fc) - X(A: - 1). The Yfc are independent JV(0, 1) variables,X(«) = Y! + • • • + Yn is the sum of independent, identically distributedrandom variables. Thus X(t) for t large has the magnitude properties ofsuccessive sums Sn. In particular,

Proposition 12.31

Proof. Since .EY . = 0, we can use the strong law of large numbers to getX(n)/« ±1> 0. Let

Page 283: Probability

266 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.10

a.s.The first term — -O. 7.k has the same distribution as max,,^^ |X(f)|. By12.20, EZk < oo. Now use Problem 10, Chapter 3, to conclude that

This is the straightforward approach. There is another way which issurprising, because it essentially reduces behavior for t — *• oo to behavior for1 0. Define

Proposition 12.32. X(1)(f) is normalized Brownian motion on [0, oo).

Proof. Certainly X(1>(f) is Gaussian with zero mean. Also,

Now to prove 12.31 another way. The statement X(f)// — »• 0 as t — * ootranslates into tX(\/t) -*• 0 a.s. as t —>• 0. So 12.31 is equivalent to provingthat X(1>(/) -»• 0 a.s. as t -> 0. If X(1)(/) is a version of Brownian motion withall paths continuous on [0, oo), then trivially, X(1)(f) — >• 0 a.s. at the origin.However, the continuity of X(f) on [0, oo) gives us only that the paths ofXU)0) are continuous on (0, oo). Take a version X(1)(/) of X(1)(f) such thatall paths of X(1)(0 are continuous on [0, oo). By Problem 5, almost all pathsof X(1)(0 and X(1)(f) coincide on (0, oo). Since X(1)(0 -»• 0 as / -*. 0, this issufficient.

By using this inversion on the law of the iterated logarithm we get

Corollary 12.33

Since — X(/) is also Brownian motion*

Therefore,

The similar versions of 12.29 hold as t — »• 0; for instance,

Page 284: Probability

12.11 THE ZEROS OF X(t) 267

11. THE ZEROS OF X(r)

Look at the set T(co) of zeros of X(t, co) in the interval [0, 1]. For anycontinuous function, the zero set is closed. By (12.29) and (12.34), T(io) is aninfinite set a.s. Furthermore the Lebesgue measure of T(co) is a.s. zero,because l(T(a>)) = l(t; X(t) = 0) = J} *{0)(X(0) dt, so

where the interchange of E and JJ eft is justified by the joint measurabilityof X(t, co), hence of z{0}(X(f, co)). (See Problem 7.)

Theorem 12.35. For almost all co, T(oo) is a closed, perfect set of Lebesguemeasure zero (therefore, noncountable).

Proof. The remaining part is to prove that T(co) has no isolated points.The idea here is that every time X(t) hits zero, it is like starting all over againand the law of the iterated logarithm guarantees a clustering of zeros startingfrom that point. For almost all paths, the point t = 0 is a limit point ofzeros of X(f) from the right. For any point a > 0, let t* be the positionof the first zero of X(t) following i = a, that is,

Look at the process X(1)(f) = X(t + t*) — X(t*). This is just looking atthe Brownian process as though it started afresh at the time t*. Heuristically,what happens up to time t* depends only on the process up to that time;starting over again at t* should give a process that looks exactly like Brownianmotion. If this argument can be made rigorous, then the set C0 of samplepaths such that t* is a limit point of zeros from the right has probability one.The intersection of Ca over all rational a > 0 has probability one, also.Therefore almost every sample path has the property that the first zerofollowing any rational is a limit point of zeros from the right. This precludesthe existence of any isolated zero. Therefore, the theorem is proved exceptfor the assertion,

The truth of (12.36) and its generalization are established in the next section.Suppose that it holds for more general random starting times. Then wecould use this to prove

Corollary 12.37. For any value a, the set T(a} = {t; X(t) = a, 0 < t < 1}is, for almost all co, either empty or a perfect closed set of Lebesgue measurezero.

Page 285: Probability

268 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.12

Proof. Let t* be the first t such that X(0 = a, that is, t* = inf {/; X(t) = a}.If t* > 1, then T(a) is empty. The set {t* = 1} c (X(l) = a} has proba-bility zero. If t* < 1, consider the process X(t -f t*) — X(t*) as startingout at the random time t*. If this is Brownian motion, the zero set is perfect.But the zero set for this process is

Hence T(a) = T r\ [0, 1] is perfect a.s.

12. THE STRONG MARKOV PROPERTY

The last item needed to complete and round out this study of sample pathproperties is a formulation and proof of the statement: At a certain timet*, where t* depends only on the Brownian path up to time t*, consider themotion as starting at t*; that is, look at X(1)(f) = X(t + t*) — X(t*).Then X(1)(r) is Brownian motion and is independent of the path ofthe particle up to time t*. Start with the observation that for r ^> 0, fixed,

is Brownian motion and is independent of 3r(X(s)) s < T). Now, to haveany of this make sense, we need:

Proposition 12.39. Ift* > 0 is a random variable so is X(t*).

Proof. For any n > 0, let

X(w)(t*) is a random variable. On the set {t* <

The right-hand side -* 0, so X(n)(t*) -». X(t*) everywhere.

Next, it is necessary to formulate the statement that the value of t*depends only on the past of the process up to time t*.

Definition 12.40. For any process (X(/)} a random variable t* ^ 0 will becalled a stopping time if for every t > 0,

The last step is to give meaning to "the part of the process up to time t*."

Page 286: Probability

12.12 THE STRONG MARKOV PROPERTY 269

Look at an example:

so t* is the first time that X(f) hits the point x = 1. It can be shown quicklythat t* is a stopping time. Look at sets depending on X(f), 0 < t <. t*; forexample,

Note that for any t > 0, B n {t* <. t} depends only on the behavior of thesample path on [0, t], that is,

Generalizing from this we get

Definition 12.41. The a-field of events B e 3-" such that for every t > 0,B C\ {t* < t} 6 5r(X(r), T f) w ca//e<3? ?/ie a-field generated by the processup to time t* and denoted by £"(X(f ), t < t*).

This is all we need:

Theorem 12.42. Let t* be a stopping time, then

is normalized Brownian motion and

Proof. If t* takes on only a countable number of values {rk}, then 12.42is quick. For example, if

then

Now,

Furthermore,

Page 287: Probability

270 STOCHASTIC PROCESSES AND BROWNIAN MOTION 12.12

so is in £"(X(0, t < rk). By (12.38), then,

Summing over k, we find that

This extends immediately to the statement of the theorem. In the generalcase, we approximate t* by a discrete stopping variable. Define

" "

Then t* is a stopping time because for kfn < f < (k + l)/w,

the latter because for k/n < t < (k + !)/«,

But, by the path continuity of X(/), X(*\t) -> X(1)(0 for every co, t. Thisimplies that at every point (*!,..., Xj) which is a continuity point of thedistribution function of X^fo), . . . , X(1)(r;.),

This is enough to ensure that equality holds for all (jq, . . . , x3). Extensionnow proves the theorem.

Page 288: Probability

NOTES 271

Problem 10. Prove that the variables

are stopping times for Brownian motion.

NOTES

The motion of small particles suspended in water was noticed and de-scribed by Brown in 1828. The mathematical formulation and study wasinitiated by Bachelier [1, 1900], and Einstein [45, 1905], and carried onextensively from that time by both physicists and mathematicians. Butrigorous discussion of sample path properties was not started until 1923 whenWiener [141] proved path continuity. Wiener also deduced the orthogonalexpansion (12.24) in 1924 [142].

A gcod source for many of the deeper properties of Brownian motionis Levy's books [103, 105], and in the recent book [76] by Ito and McKean.A very interesting collection of articles that includes many references toearlier works and gives a number of different ways of looking at Brownianmotion has been compiled by Wax [139]. The article by Dvoretski, Erdos,and Katutani [42] gives the further puzzling property that no sample pathshave any "points of increase."

The fact of nondifferentiability of the sample paths was discovered byPaley, Wiener, and Zygmund [115, 1933]. The law of the iterated logarithmfor Brownian motion was proved by Khintchine [88, 1933]. The properties ofthe zero sets of its paths was stated by Levy, who seemed to assume the truthof 12.42. This latter property was stated and proved by Hunt [74, 1956].David Freedman suggested the proof given that no zeros of X(f) are isolated.

Page 289: Probability

CHAPTER 13

INVARIANCE THEOREMS

1. INTRODUCTION

Let Sl5 S2, . . . be a player's total winnings in a fair coin-tossing game. Aquestion leading to the famous arc sine theorem is: Let Nn be the numberof times that the player is ahead in the first n games,

The proportion of the time that the player is ahead in n games is Nn//z = Wn.Does a limiting distribution exist for Wn, and if so, what is it?

Reason this way: Define Z(n\t) = S[ni] as t ranges over the values0 < t < 1. Denote Lebesgue measure by /; then

Now Z(n)(/) = S[nt] does not converge to anything in any sense, but recallfrom Section 2 of the last chapter that the processes X(n)(/) = S[nt] /\/whave all finite dimensional distribution functions converging to those ofnormalized Brownian motion X(t) as n -»• oo. We denote this by

X(n)(-) -^ X(-), But Wn can also be written as

Of course, the big transition that we would like to make here would be todefine

so that W is just the proportion of time that a Brownian particle stays in thepositive axis during [0, 1], and then conclude that

The general truth of an assertion like this would be a profound generalizationof the central limit theorem. The transition from the obvious application

2)of the central limit theorem to conclude that X (n>(-) — >- X(-) to get to

272

Page 290: Probability

13.2 THE FIRST-EXIT DISTRIBUTION 273

Wn —> W is neither obvious or easy. But some general theorems of thiskind would give an enormous number of limit theorems. For example, ifwe let

is it possible to find constants An such that MJAn —> M, M nondegenerate ?Again, write

Let

then apply our nonexistent theorem to conclude

This chapter will fill in the missing theorem. The core of the theorem is thatsuccessive sums of independent, identically distributed random variables withzero means and finite variances have the same distribution as Brownianmotion with a random time index. The idea is not difficult—supposeSl5 S 2 , . . . form the symmetric random walk. Define Tx as the first timesuch that |X(OI = 1. By symmetry P(X(T1) = l) = P(X^) = -l) = |.Define T2 as the first time such that \X(t + T^ — X(TX)| = 1, and so on.But T! is determined only by the behavior of the X(t) motion up to timeTl5 so that intuitively one might hope that the X(t + Tx) — X^) processwould have the same distribution as Brownian motion, but be independentof X(f), t < Tj. To make this sort of construction hold, the strong Markovproperty is, of course, essential. But also, we need to know some more aboutthe first time that a Brownian motion exits from some interval around theorigin.

2. THE FIRST-EXIT DISTRIBUTION

For some set B e $1? let

be the first exit time of the Brownian motion from the set B. In particular,the first time that the particle hits the point {a} is identical with the first exittime of the particle from (— oo, a) if a > 0, or from (oo, a) if a < 0, andwe denote it by t*. Let t*(a, b) be the first exit time from the interval (a, b).The information we want is the probability that the first exit of the particlefrom the interval (a, b), a < 0 < b, occurs at the point b, and the value ofEt*(a, b), the expected time until exit.

Page 291: Probability

274 INVARIANCE THEOREMS 13.2

For normalized Brownian motion (X(f)}, use the substitution X(r) =X(s) + (X(/) - X(j)) to verify that for s < f,

Hence the processes (X(/)}, (X2(f) — f} are martingales in the sense of

Definition 13.1. A process (Y(f)}, t e /, /$ a martingale if for all t in I,E |Y(/)| < oo, and for any s, t in I, s < t,

Suppose that for stopping times t* satisfying the appropriate integrabilityconditions, the generalization of 5.31 holds as the statement £Y(t*) = £Y(0).This would give, for a stopping time on Brownian motion,

These two equations would give the parameters we want. If we taket* = t*(a, b), (13.2) becomes

Solving, we get

The second equation of (13.2) provides us with

Using (13.3) we find that

Rather than detour to prove the general martingale result, we defer proofuntil the next chapter and prove here only what we need.

Proposition 13.5. For t* = t*(a, ft), a < 0 < ft,

Proof. Let t * be a stopping time taking values in a countable set {ry} ^ [0, T],T < oo. Then

Page 292: Probability

13.2 THE FIRST-EXIT DISTRIBUTION 275

By independence, then, £(X(r) - X(t*)) = 0; hence £X(t*) = 0. Forthe second equation, write

The last term is easily seen to be zero. The first term is

which equals

Since £X(r)2 = r,

Take t** = min (t*(a, b), r), and t* a sequence of stopping times takingvalues in a countable subset of [0, r] such that t* -»• t** everywhere. Bythe bounded convergence theorem, £t*->£t**. Furthermore, by pathcontinuity X(t*) -* X(t**), and

which is integrable by 12.20. Use the bounded convergence theorem againto get EX(t*) -». £X(t**), £X2(t*) -H. £X2(t**). Write t* for t*(a, b\ then

Note that |X(t*)| < max (|a|, \b\), and on t* > r, |X(r)| ^ max (|a|, |6|).Hence as T ->• oo,

Since t** } t*, apply monotone convergence to get lim £t** = Et*, com-pleting the proof.

Problem 1. Use Wald's identity 5.34 on the sums

and, letting Af -> 0, prove that for t* = t*(a, b), and any A,

By showing that differentiation under the integral is permissible, prove 13.5.

Page 293: Probability

276 INVARIANCE THEOREMS 13.3

3. REPRESENTATION OF SUMS

An important representation is given by

Theorem 13.6. Given independent, identically distributed random variablesY19 Y2, . . . , .EYi = 0, £Yj = <r2 < oo, Sn = Yx + • • • + Yn, there exists aprobability space with a Brownian motion X(f) and a sequence T1} T2, . . . ofnonnegative, independent, identically distributed random variables defined on itsuch that the sequence X(Tj), X(Ta -f T2), . . . , has the same distribution asSl5 S2, . . . , and E'[l = oz.

Proof. Let (Un, Vn), n = 1, 2, . . . be a sequence of identically distributed,independent random vectors defined on the same probability space as aBrownian motion (X(0) such that "(XC/), / ^ 0) and (U^, Vn, n = 1,2,...)are independent. This can be done by constructing (Q1? 5 , PJ for theBrownian motion, (Q2, 3-"2, P2) for the (Un, Vn) sequence, and taking

Suppose Un < 0 < Vn. Define

Therefore the Ul5 Vx function as random boundaries. Note that Tx is arandom variable because

Further, (X(? + T) - X(r), f > 0) is independent of (X(^), s < r, Ulf Vj.By the same argument which was used to establish the strong Markovproperty,

is a Brownian motion and is independent of X^). Now define T2 as thefirst exit time of X(1)(0 from (U2, V2). Then X^ + T2) - XCT^ has thesame distribution as X^). Repeating this procedure we manufacturevariables

which are independent and identically distributed. The trick is to selectUj, Vx so that X(Tj) has the same distribution as Yj. For any randomboundaries Ulf V15 if ElXCY^l2 < oo, 13.5 gives

Hence

Page 294: Probability

13.3 REPRESENTATION OF SUMS 277

Therefore if X^) has the same distribution as Yl5 then Y! must satisfy

and automatically,

So £Yt = 0 is certainly a necessary condition for the existence of randomboundaries such that XO^) has the same distribution as Yj.

To show that it is also sufficient, start with the observation that if Yx

takes on only two values, say u < 0, and v > 0 with probabilities p and q,then from £YX = 0, we have pu + qv = 0. For this distribution, we cantake fixed boundaries (Jl = u, Vx = v. Because, by 13.5, .EXOY) = 0,w/>(X(T1) = u) + vP(X(Tl) = v) = 0, which implies thatP(X(Jl) = «)=/?.This idea can be extended to prove

Proposition 13.7. For any random variable Y, such that £T = 0, £T2 < oo3)

there are random boundaries U < 0 < V such that X(T) = Y.

Proof. Assume first that the distribution of Y is concentrated on points"i < 0, u, > 0 with probability pif ^ such that (u^ vt) are pairs satisfyinguiPi + vtfi — 0- Then take

By the observation above,

t

Therefore X(T) has the same distribution as Y. Suppose now that the dis-tribution of Y is concentrated on a finite set {y^ of points. Then it is easyto see that the pairs ui < 0, vt > 0 can be gotten such that Y assumes onlyvalues in {ut}, {i } and the pairs (ut, vt) satisfy the conditions above. (Notethat ut may equal u}, i j, and similarly the vt are not necessarily distinct.)

2)For Y having any distribution such that £Y = 0, £Y2 < oo, take Yn — > Ywhere Yn takes values in a finite set of points, £Yn = 0. Define randomboundaries Un, Vn having stopping time Tn such that X(Tn) = Yn. Supposethat the random vectors (Un, Vn) have mass-preserving distributions. Then

3)take (Un<, Vn.) — >- (U, V). For these random boundaries and associatedstopping time T, for / <= (0, oo), use (13.3) to get

Similarly,

Page 295: Probability

278 INVARIANCE THEOREMS 13.4

Hence if P(Y e bd(I)) = P(V e bd(I)) = 0, then

The analogous proof holds for / <= (0, — oo). To complete the proof weneed therefore to show that the Yw can be selected so that the (Un, Vn) havea mass-preserving set of distributions. Take F (dy) to denote the distributionof Y. We can always select a nonempty finite interval (a, b) such that in-cluding part or all of the mass of F(dy) at the endpoints of (a, b) in the integralwe get

Thus we can always take the distributions Fn of the Yn such that

In this case, the (wf, u,) pairs have the property that either both are in [a, b]or both are outside of [a, b]. Since £Y2 < oo, we can also certainly take theYn such that £Y* < M < oo for all n. Write

But the function \uv\ goes to infinity as either \u\ — > co or |y| — >• oo everywherein the region {u < a, v > b}. This does it.

Problem 2. Let the distribution function of Y be F(dy). Prove that therandom boundaries (U, V) with distribution

where a"1 = £"Y+ and ^(M, v) is zero or one as u and u have the same or2)

opposite signs, give rise to an exit time T such that X(T) = Y. (Here U andV can be both positive and negative.)

4. CONVERGENCE OF SAMPLE PATHSOF SUMS TO BROWNIAN MOTION PATHS

Now it is possible to show that in a very strong sense Brownian motionis the limit of random walks with smaller and smaller steps, or of normedsums of independent, identically distributed random variables. The random

Page 296: Probability

13.4 CONVERGENCE TO BROWNIAN MOTION PATHS 279

walk example is particularly illuminating. Let the walk X(n)(t) take steps ofsize ± \l\ln every 1/n time units. Using the method of the previous section, let^ be the first time that X(t) changes by an amount 1/V«, then T2 the timeuntil a second change of amount l/v« occurs, etc.

The process

has the same distribution as the process X(n)(i). By definition,

T! + • • • + T[n<] = time until [nt] changes of magnitudel/\/« have occurred along X(t).

Therefore, up to time Tl + • • • + T[nt], the sum of the squares of thechanges in X(t) is approximately t. But by 12.28, this takes a length of time t.So, we would expect that Tl + • • • + Ttn(] -*• t, hence that each samplepath of the interpolated motion X(n)(t) would converge as a function of tto the corresponding path of Brownian motion. The convergence that doestake place is uniform convergence. This holds, in general, along subsequences.

Theorem 13.8. Let Y1} Y2, . . . be independent, identically distributed randomvariables, £YX = 0, £Y,2 =_<r2 < oo, Sn = Yj + • • • + YB. Define theprocesses X(n)(t) by S^/oVn. Then there are processes {X(n) (t)}, for each nhaving the same distribution as (X(n)(f)}, defined on a common probability spaceand a Brownian motion process {X(t )} on the same space, such that for anysubsequence {nk} increasing rapidly enough,

Proof. Assume that £Y^ = 1 . Let (Q, 3% P) be constructed as in the repre-sentation theorem. For each n, consider the Brownian motion Xn(r) =^nX(tjn). Construct T<n), T^, . . . using the motion Xn(t). Then the {SJsequence has the same distribution as the {Xn(J(n) + • • • + T.j.n))} sequence.Thus the X(n)(t) process has the same distribution as

The sequence T[n), T^n), . . . for each n consists of independent, identicallydistributed random variables such that ET[n) = 1, and J(n) has the samedistribution for all n. The weak law of large numbers gives

so for n large, the random time appearing in (13.9) should be nearly /.

Page 297: Probability

280 INVARIANCE THEOREMS

Argue that if it can be shown that

13.4

for « running through some subsequence, then the continuity of X(<)guarantees that along the same subsequence

pWhat can be easily proved is that Wn —> 0. But this is enough becausethen for any subsequence increasing rapidly enough, Wnjfc —^-> 0. Use

so

Ignore the second term, and write

For any e, 0 < e < 1, take

The distribution of M ( n > is the same as M(1). Now write

This bounding term has the same distribution as

Page 298: Probability

13.5 AN INVARIANCE PRINCIPLE 281

The law of large numbers implies

Taking e J, 0 now gives the result.

5. AN INVARIANCE PRINCIPLE

The question raised in the introduction to this chapter is generally this:

The sequence of processes X (n)(-) converges in distribution to Brownian motion

X(-), denoted X(n)(-) -^> X(-), in the sense that for any Q< tl<-- • <tk<l,

Let H(JC(-)) ^e defined on jtf°-1]. ^/zen w // frwe

There is no obvious handle. What is clear, however, is that we can proceedas follows : Let us suppose X(n)(-) and X(-) are denned on the same space andtake values in some subset D <= R^°-1^. On D define the sup-norm metric

and assume that H(x(-)) defined on D is continuous with respect to p. Thenif the sample paths of the X(n)(-) processes converge uniformly to the corre-sponding paths of X(-), that is, if

then

But this is enough to give us what we want. Starting with the X (n)(f) =S[nt]/crv«, we can construct X(n)(-) having the same distribution as X(n)(-)so that the X(n)(/) -+ X(f) uniformly for t £ [0,1] for n running through subse-

quences. Thus, H(X(n)(-)) -^ H(X(-)), n e {nk}. This implies

But this holding true for every subsequence {nk} increasing rapidly enoughimplies that the full sequence converges in distribution to H(X(-)). Now tofasten down this idea.

Page 299: Probability

282 IN VARIANCE THEOREMS 13.5

Definition 13.10. D is the class of all functions x(t), 0 < / < 1, such thatjc(r-), jc(r+) exist for all t E (0, 1), and jc(r+) = *(/). Also, x(0+) = jc(0),x(l -) = *(0. Define P(x(-), XO) on D 6y

Definition 13.11. For H (*(•)) *te/z«e</ o« £>, /<?/ G te the set of all functionsx(') E D such that H w discontinuous at x(-) in the metric p. If there is aset (/! 6 $[0-1] such that G <=• G^ and for a normalized Brownian motion(X(/)}, P(X(-) e GX) = 0, call H a.s. B-continuous.

The weakening of the continuity condition on H to a.s. 5-continuity isimportant. For example, the H that leads to the arc sine law is discontinuousat the set of all *(•) E D such that

(We leave this to the reader to prove as Problem 4.) But this set has proba-bility zero in Brownian motion.

With these definitions, we can state the following special case of the"invariance principle."

Theorem 13.12. Let H defined on D be a.s. B-continuous. Consider anyprocess of the type

where the Sn are sums of independent, identically distributed random variablesYlf Y2, . . . with EYi = 0, £Y* = az. Assume that the H(X(">(-)) are randomvariables. Then

where (X(/)} is normalized Brownian motion,

Proof. Use 8.8; it is enough to show that any subsequence {nk} contains asubsequence {n'k} such that (13.13) holds along n'k. Construct Xn(/) as in theproof of 13.8. Take n'k any subsequence of nk increasing rapidly enough.Then X(n*')(r) converges uniformly to X(f) for almost every CD, implying that(13.13) holds along the n'k sequence.

There is a loose end in that H(X(-)) was not assumed to be a randomvariable. However, since

the latter is a.s. equal to a random variable. Hence it is a random variable

Page 300: Probability

13.6 THE KOLMOGOROV-SMIRNOV STATISTICS 283

with respect to the completed probability space (Q, W, P), and its distributionis well defined.

The reason that theorems of this type are referred to as invarianceprinciples is that they establish convergence to a limiting distribution whichdoes not depend on the distribution function of the independent summandsY15 Y2, . . . except for the one parameter a2. This gives the freedom to choosethe most convenient way to evaluate the limit distribution. Usually, this isdone either directly for Brownian motion or by combinatorial argumentsfor coin-tossing variables Y1} Y2, . . . In particular, see Feller's book [59,Vol. I], for a combinatorial proof that in fair coin-tossing, the proportion oftimes Wn that the player is ahead in the first « tosses has the limit distribution

Problems

3. Show that the function on D defined by

is continuous everywhere.

4. Show that the function on D defined by

is continuous at x(-) if and only if /{/; x(t) = 0} = 0.

6. THE KOLMOGOROV-SMIRNOV STATISTICS

An important application of invariance is to an estimation problem. LetYl5 Y2, . . . be independent, identically distributed random variables witha continuous but unknown distribution function F(x). The most obviousway to estimate F(x) given n observations Yl5 . . . , Yn is to put

The law of large numbers guarantees

for fixed x. From the central limit theorem,

Page 301: Probability

284 INVARIANCE THEOREMS 13.6

However, we will be more interested in uniform estimates:

and the problem is to show that D+, D~, Dn converge in distribution, andto find the limiting distribution.

Proposition 13.14. Each of D+, D~, Dn has the same distribution for allcontinuous F(x).

Proof. Call / = [a, b] an interval of constancy for F(x) if P(Yl s /) = 0and there is no larger interval containing / having this property. Let B bethe union of all the intervals of constancy. Clearly, we can write

and similar equations for D+ and D~. For x 6 Bc, the sets

are identical. Put Ut = F(Yfc), and setA

Gn(y) = {number of Ufc < y, k = 1 , . . . , n}.Then

Since F(x) maps /?(1) onto (0, 1) plus the points {0} or {1} possibly,

the latter holding because P(Ul = 0) = P(Uj = 1) = 0. The distributionof Uj is given by

Put x = inf (I; F(£) = y}. Then

Thus Uj is uniformly distributed on [0, 1], and Dn for arbitrary continuous

Page 302: Probability

13.6 THE KOLMOGOROV-SMIRNOV STATISTICS 285

F has the same distribution as Dn for the uniform distribution. Similarly forD;, and D+

Let Uj, . . . , Un be independent random variables uniformly distributedon [0, 1]. The order statistics are defined as follows: \J(n) is the smallest,i Aand so forth; U (n ) is the largest. The maximum of |Gn(y) — y\ or ofGn(y) — y or y — Gn(y) must occur at one of the jumps of Gn(y). Thejumps are at the points U^n), and

Since the size of the jumps is l/«, then to within l/v«,

The fact that makes our invariance theorem applicable is that theU [ w ) , . . . , U^w) behave something like sums of independent random vari-ables. Let W19 W 2 , . . . be independent random variables with the negativeexponential distribution. That is,

Denote Zn = Wx + • • • + WB; then

Proposition 13.15. U[n), k = 1, . .. , n have the same joint distribution aszfc/z«+i> k= ! , . . . ,».

Proof. To show this, write (using a little symbolic freedom),

Thus

Page 303: Probability

286 INVARIANCE THEOREMS 13.6

From this,

Therefore,

On the other hand, for 0 <, y

where the sum is over all permutations (/!,...,/„) of (1,. . . , n). Usingindependence this yields n! dyv • • • dyn.

Use this proposition to transform the previous expression for Dn into

with analogous expressions for D+, D~. Then

Because £Wi = 1, a^VA) = 1, it follows that w/Zn+1 — + 1, and thatZj. — A: is a sum of independent, identically distributed random variableswith first moment zero and second moment one. Put S

and ignore the »/Zn+1 term and terms of order l/v«. Then Dn is given by

Obviously,

sup |X(n)(r) -

is a continuous function in the sup-norm metric, so now applying theinvariance principle, we have proved

Page 304: Probability

13.7 MORE ON FIRST-EXIT DISTRIBUTIONS 287

Theorem 13.16

7. MORE ON FIRST-EXIT DISTRIBUTIONS

There is a wealth of material in the literature on evaluating the distributionsof functions on Brownian motion. One method uses some transformationsthat carry Brownian motion into Brownian motion. A partial list of suchtransformations is

Proposition 13.17. If X(t) is normalized Brownian motion, then so is

1) — X(t), t > 0 (symmetry),

2) X(t + r) - X(t), t > 0, T > 0 fixed (origin change),

3) fX(l/f) , / > 0 (inversion),

4) (l/Va)X(aO, f > 0, a > 0 (scale change),5) X(T) -X(T - t), 0 < t < T fixed (reversal).

To get (4) and (5) just check that the processes are Gaussian with zero meansand the right covariance.

We apply these transformations and the strong Markov property to getthe distributions of some first exit times and probabilities. These are relatedto a number of important functions on Brownian motion. For example, forx > 0, if t * is the first hitting time of the point {x}, then

To get the distribution of t*, let

Take x,y > 0, note that t*+y = t* + T*, where T* is the first passage timeof the process X(/) = X(t* + /) — X(t*) to the point y. By the strongMarkov property, T* has the same distribution as t* and is independent oft*. Thus,

Page 305: Probability

288 INVARIANCE THEOREMS 13.7

Since is decreasing in x, and therefore well-behaved, 13.19 implies

Now we can get more information by a scale change. Transformation13.17(4) implies that a scale change in space by an amount changes timeby a factor a. To be exact,

Therefore t* has the same distribution as , yielding

Now uniquely determines the distribution of t*, so if we can get c,then we are finished. Unfortunately, there seems to be no very simple wayto get c. Problem 5 outlines one method of showing that c = . Acceptthis for now, because arguments of this sort can get us the distribution of

Denote

so that on Aa, t* = t*, and on Ab, t* = t* + T*, where T* is the additionaltime needed to get to x = a once the process has hit x = b. So define

Put these together:

Now check that Ab e 5r(X(r), / < t*). Since the variable T* is independentof^(X(0,?<t*),

The same argument for t* gives

Page 306: Probability

13.7 MORE ON FIRST-EXIT DISTRIBUTIONS 289

Now solve, to get

The sum of (13.23(1) and (2)) is Ee~**, the Laplace transform of thedistribution of t*. By inverting this, we can get

Very similar methods can be used to compute the probability that theBrownian motion ever hits the line x = at + b, a > 0, b > 0, or equiva-lently, exits from the open region with the variable boundary x = at + b.Let p(a, b) be the probability that X(/) ever touches the line at + b, a > 0,b > 0. Then

the argument being that to get to at + bt + b2, first the particle must getto at + b±, but once it does, it then has to get to a line whose equationrelative to its present position is at + b2. To define this more rigorously,let T * be the time of first touching at + &i; then t * = min ( T *, 5) is a stoppingtime. The probability that the process ever touches the line at + bv + b2

and T* < s equals the probability that the process X(/ + t*) — X(t*) evertouches the line at + b2 and t* < s. By the strong Markov property, thelatter probability is the product p(a, 62)P(T* < s). Let s -»• oo to get theresult. Therefore, p(a, b) = e-v

(a]b. Take t* to be the hitting time ofthe point b. Then

Conditioning on t* yields

(see 4.38). Use (13.21) to conclude that

which leads to 2ay(a) = y\d), or y(a) = 2a. Thus

The probability,

Page 307: Probability

290 IN VARIANCE THEOREMS 13.7

of exiting from the two-sided region \x\ < at + b is more difficult to com-pute. One way is to first compute Ee~^*, where t* is the first time of hittingat + b, and then imitate the development leading to the two-sided boundarydistribution in (13.23). Another method is given in Doob [36].

The expression for p(a, b) can be used to get the distribution of

and therefore of limnP(D+ < x). Let Y(/) = X(f) - fX(l). Then Y(f) isa Gaussian process with covariance

Consider the process

Its covariance is min (s, /), so X(1)(/) is normalized Brownian motion.Therefore

The limiting distribution for Dn is similarly related to the probability ofexiting from the two-sided region {\x\ < y(l + t)}.

Problems

5. Assuming

find £e~At*, where t* = t*(— 1, +1). Differentiating this with respect to A,at A = 0, find an expression for Et* and compare this with the known valueof Et* to show that c = \/2.

6. Use Wald's identity (see Problem 1) to get (13.23(1) and (2)) by usingthe equations for A and —A.

7. Using £exp [—At*] = exp [—\/2A |*|], prove that for x > 0,

Page 308: Probability

13.8 THE LAW OF THE ITERATED LOGARITHM 291

8. For Sl5 S2, . . . sums of independent, identically distributed randomvariables with zero means and finite second moments, find normalizingconstants so that the following random variables converge in distributionto a nondegenerate limit, and evaluate the distribution of the limit, or theLaplace transform of the limit distribution

8. THE LAW OF THE ITERATED LOGARITHM

Let Sn, n = 1, 2, . . . , be successive sums of independent, identically dis-tributed random variables Y

One version of the law of the iterated logarithm is

Theorem 13.25

Strassen [134] noted recently that even though this is a strong limit theorem,it follows from an invariance principle, and therefore is a distant consequenceof the central limit theorem. The result follows fairly easily from the repre-sentation theorem, 13.8. What we need is

Theorem 13.26. There is a probability space with a Brownian motion X(t)defined on it and a sequence Sn, n = 1, . . . , having the same distribution asSn/cr, n = 1, . . . , such that

Proof. By 13.8, there is a sequence of independent, identically distrib-uted, nonnegative random variables T1} T2, . . . , £Tj = 1 such thatX(Tj + • • • + Tn), n = 1 , 2 , . . . , has the same distribution as SB/<r,n = 1, . . . Therefore (13.27) reduces to proving that

where <p(t) = \J2t log (log t). By the law of large numbers,

Page 309: Probability

292 INVARIANCE THEOREMS 13.8

For any e > 0, there is an almost surely finite function t0(w) such thatfor t > t0(w),

Let

Thus, for

In consequence, if we define

By 12.20,

Write

Use Borel-Cantelli again, getting

or

Going back,

Taking gives which completes the proof.

Page 310: Probability

13.9 A MORE GENERAL IN VARIANCE THEOREM 293

9. A MORE GENERAL INVARIANCE THEOREM

The direction in which generalization is needed is clear. Let the {Y(n)(f)},

/ 6 [0, l],7t = 0, 1, . . .be a sequence of processes such that Y (n)(-) — > Y(0)(-)in the sense that all finite dimensional distributions converge to the appro-priate limit. Suppose that all sample functions of {Y( "'(/)} are in D. Supposealso that some metric p(('), X')) is defined on D, and that in this metric His a function on D a.s. continuous with respect to the distribution of Y(0)(-).Find conditions to ensure that

This has been done for some useful metrics and we follow Skorokhod'sstrategy. The basic idea is similar to that in our previous work: Findprocesses Y(7l)(-), n — 0, 1, . . . defined on a common probability space suchthat for each n, Y(n)(-) has the same distribution as Y(n)(-), and has all itssample functions in D. Suppose Y(n)(-) have the additional property that

Then conclude, as in Section 5, that if H(Y(n)(-)) and H(f(n)(-)) are randomvariables,

The basic tool is a construction that yields the very general

Theorem 13.28. Let {Y(n)(/)}, / e [0, 1], n = 0, 1, ... be any sequence of

processes such that Y (n)(-) — > Y<0)(-). Then for any countable set T c [0, 1],there are processes (YU)(f)}, t E T, defined on a common space such that

a) For each n, {Y(n)(f)}, t £ T, and {Y(n)(»}, / e'T, have the same distribution.b) For every t £ T,

Proof. The proof of this is based on some simple ideas but is filled withtechnical details. We give a very brief sketch and refer to Skorokhod [124]for a complete proof.

First, show that a single sequence of random variables Xn — > X0 can*** n s "*" "" *T\

be replaced by Xn —+• X0 on a common space with Xn = Xn. It is a bit3) a s

surprising to go from — > to — . But if, for example, Xn are fair coin-ID

tossing random variables such that Xn — > X0, then replace all Xn by the

Page 311: Probability

294 INVARIANCE THEOREMS 13.9

random variables Xn on ((0, 1), $((0, 1)), dx) defined by

Not only does Xn(x) -^> X0(jc), but Xn(x) = X0(x). In general, take(Q, £-, P) = ((0, 1), $((0, 1)), dx). The device is simple: If Fn(z\ the dis-tribution function of Xn, is continuous with a unique inverse, then take

Consequently,

or everyor Xn(X) — > X0(x). Because Fn may not have a unique inverse, defineSince Xn -£> X0, FB(z) - F0(z) for every z; thus F~\x) - F^(x), all x,

Now verify that these variables do the job.Generalize now to a sequence of process Xn = (X^, . . .) such that2)

Xn — > X0. Suppose we have a nice 1-1 mapping 6: R(cc) <-» B, B G S15 suchthat 6, 6~l are measurable ^B^, &i(B) respectively, and such that the followingholds:

Take Yn, n > 0, random variables on a common space such that Yn = 0(Xn)

and Yn -^> Y0. Define Xn = ^(Y,,). It is easy to see that Xn and Xn

have the same distribution, If (T1 is smooth enough so that Yn -^> Y0

implies that every coordinate of 0~1(Yn) converges a.s. to the correspondingcoordinate of 0~1(Y0), then this does it. To get such a 0, let Cn>fc be the set{x;/>(X[n) = x) > 0}. LetC = Un,*Cn>»; C is countable. Take <p(x): /?(1)<->(0, 1) to be 1-1 and continuous such that <p(C) contains no binary rationals.There is a 1-1 measurable mapping /: (0, l) (ao) <-t (0, 1) constructed inAppendix A. 47. The mapping 0: /?<00) <-> (0, 1) defined by

has all the necessary properties.

Page 312: Probability

13.9 A MORE GENERAL IN VARIANCE THEOREM 295

Take one more step. The process (X(n)(f)}, t E T, having the same dis-tribution as (X(n)(f)}, t E T, has the property that a.s. every sample functionis the restriction to T of a function in D. Take T dense in [0, 1], and for anyt € [0, 1], t$ r, define

[Assume {1} e T.] The processes {X(M)(/)}, t e [0, 1], defined this way haveall their sample functions in D, except perhaps for a set of probability zero.Furthermore, they have the same distribution as X(n)(-) for each n.

Throwing out a set of probability zero, we get the statement : For eachft> fixed, the sample paths xn(i) = X(n}(t, co) are in D with the property thatxn(t) — *• x0(t) for all t e T. The extra condition needed is something toguarantee that this convergence on T implies that p(xn(-), x0(-)) -*• 0 in themetric we are using. To illustrate this, use the metric

introduced in the previous sections. Other metrics will be found in Skorokhod'sarticle referred to above. Define

If jcn(0 -> Jt0(0 for t e T, and x0(t ) e D, then Urn,, j 0 d(h) = 0 impliesp(xn, x0) — >• 0. Hence the following :

Theorem 13.29. Under the above assumptions, if

then

Proof. Y(0)(0 has continuous sample paths a.s. Because take Tlt . . . , tm} c T. Then, letting

we find that Mn - -> M0 follows. For e a continuity point in the distributionof M0, />(Mn > e) -* P(M0 > e). Therefore,

Letting Tm f T and using (13.30) implies the continuity.

Page 313: Probability

296 INVARIANCE THEOREMS

Define random variables

m = {/lf . . . , tm] c T is such that

then

The first term goes to zero a.s., leaving

Take h J, 0. Since Un does not depend on h, the continuity of Y(0)(-), and(13.30) yields

Remark. Note that under (13.30), since Y(0)(-) has continuous sample pathsa.s., then Y(0)(f) has a version with all sample paths continuous.

1)The general theorems are similar; Y(n)(-) — >• Y(0)(-) plus some equi-

continuity condition on the Y(n)(-) gives

Problem 9. For random variables having a uniform distribution on [0, 1],and F^*), the sample distribution function defined in Section 6, use the multi-dimensional central limit theorem to show that

where Y(£) is a Gaussian process with covariance

Prove that (13.30) is satisfied by using 13.15 and the Skorokhod lemma 3.21.

NOTES

The invariance principle as applied to sums of independent, identicallydistributed random variables first appeared in the work of Erdos and Kac[47, 1946] and [48, 1947]. The more general result of 13.12 is due to Donsker[30, 1951]. The method of imitating the sums by using a Brownian motion

Page 314: Probability

NOTES 297

evaluated at random times was developed by Skorokhod [126, 1961]. Thepossibility of using these methods on the Kolmogorov-Smirnov statisticswas suggested by Doob [37] in a paper where he also evaluates the distribu-tion of the limiting functional on the Brownian motion. Donsker later[31, 1952] proved that Doob's suggested approach could be made rigorous.For some interesting material on the distribution of various functional onBrownian motion, see Cameron and Martin [13], Kac [82], and Dinges [24].

Strassen's recent work [134, 1964] on the law of the iterated logarithmcontains some fascinating generalizations of this law concerning the limitingfluctuations of Brownian motion. A relatively simple proof of the law ofthe iterated logarithm for coin-tossing is given by Feller [59, Vol. I]. Ageneralized version proved by Erdb's [46, 1942] for coin-tossing, and extendedby Feller [53, 1943] is: Let q>(ri) be a positive, monotonically increasingfunction, Sn = Yx + • • • + Yn, Yls . . . , independent and identicallydistributed random variables with mean zero and finite second moment.Then

equals zero or one, depending on whether

converges or diverges.The general question concerning convergence of a sequence of processes

3)X(n)(-) — >• X(-) and related invariance results was dealt with in 1956 byProkhorov [118] and by Skorokhod [124]. We followed the latter in Section 9.

The arc sine law has had an honorable history. Its importance inprobability has been not so much in the theorem itself, as in the variety andpower of the methods developed to prove it. For Brownian motion, it wasderived by Paul Levy [104, 1939]. Then Erdos and Kac [48, 1947] used aninvariance argument to get it for sums of independent random variables.Then Sparre Andersen in 1954 [128] discovered a combinatorial proof thatrevealed the surprising fact that the law held for random variables whosesecond moments were not necessarily finite. Spitzer extended the combina-torial methods into entirely new areas [129, 1956]. For the latter, see par-ticularly Spitzer's book [130], also the development by Feller [59, Vol. II].Another interesting proof was given by Kac [82] for Brownian motion as aspecial case of a method that reduces the finding of distribution of functionalsto related differential equations. There are at least three more proofs weknow of that come from other areas of probability.

Page 315: Probability

CHAPTER 14

MARTINGALES AND PROCESSES WITHSTATIONARY, INDEPENDENT INCREMENTS

1. INTRODUCTION

In Chapter 12, Brownian motion was defined as follows:

1) X(t + r) — X(t) is independent of everything up to time /,2) The distribution of X(/ + r) — X(t) depends only on r,

The third assumption involved continuity and had the eventual consequencethat a version of Brownian motion was available with all sample pathscontinuous.

If the third assumption is dropped, then we get a class of processessatisfying (1) and (2) which have the same relation to Brownian motion asthe infinitely divisible laws do to the normal law. In fact, examining theseprocesses gives much more meaning to the representation for characteristicfunctions of infinitely divisible laws.

These processes cannot have versions with continuous sample paths,otherwise the argument given in Chapter 12 forces them to be Brownianmotion. Therefore, the extension problem that plagued us there and thatwe solved by taking a continuous version, comes back again. We deal withthis problem in the same way — we take the smoothest possible versionavailable. Of the results available relating to smoothness of sample paths,one of the most general is for continuous parameter martingale processes.So first we develop the martingale theorems. With this theory in hand, wethen prove that there are versions of any of the processes satisfying (1) and(2) above, such that all sample paths are continuous except for jumps. Thenwe investigate the size and number of jumps in terms of the distributionof the process, and give some applications.

2. THE EXTENSION TO SMOOTH VERSIONS

Virtually all the well-known stochastic processes (X(r)}, t e I, can beshown to have versions such that all sample paths have only jump discon-tinuities. That is, the sample paths are functions x(t) which have finiteright- and left-hand limits x(t— ) and x(t+) at all / e / for which these limits

298

Page 316: Probability

14.2 THE EXTENSION TO SMOOTH VERSIONS 299

can be defined. This last phrase refers to endpoints. Make the conventionthat if t is in the interior of /, both x(t—) and x(t+) limits can be defined.At a closed right endpoint, only the Jt(/—) limit can be defined. At a closedleft endpoint only the x(t+) limit can be defined. At open (including infinite)endpoints, neither limit can be defined. We specialize a bit more and define:

Definition 14.1. D(I) is the class of all functions x(t), t e /, which have onlyjump discontinuities and which are right-continuous; that is, x(t+} = x(t)forall t e I such that x(t+) is defined.

Along with this goes

Definition 14.2. A process (X(/)}, t E I will be called continuous in probabilityfrom the right if whenever T \, t,

We want to find conditions on the process (X(r)}, t E Iso that a versionexists with all sample paths in D(I). As with Brownian motion, start byconsidering the variables of the process on a set T countable and dense in /,with the convention that T includes any closed endpoints of /. In the caseof continuous sample paths the essential property was that for 7 finite, anyfunction defined and uniformly continuous on T had an extension to a con-tinuous function on /. The analog we need here is

Definition 14.3. A function x(t) defined on T is said to have only jump dis-continuities in I if the limits

exist and are finite for all t E I where these limits can be defined.

Proposition 14.4. If x(t) defined on T has only jump discontinuities on /, thenthe function x(t) defined on I by

and x(b) = x(b)for b a closed right endpoint of I is in D(I).

Proof. Let tn | /, /„, t E /, and take sn E T, sn > tn > t such that sn I tand x(sn) — x(tn) -»> 0. Since x(sn) ->• x(t), this implies Jc(r+) = x(t). Nowtake tn t t, and sn E T with tn < sn < t and x(tn~) — x(sn) -»• 0. This showsthat jc(rn) -> lim x(s), SET.

s\t

We use this to get conditions for the desired version.

Theorem 14.5. Let the process (X(/)}, t E I, be continuous in probability fromthe right. Suppose that almost every sample function of the countable process(X(/)}, tET, has only jump discontinuities on I. Then there is a version of(X(/)}, t E I, with all sample paths in D(7).

Page 317: Probability

300 MARTINGALES, PROCESSES, WITH INCREMENTS 14.3

Proof. If for fixed CD, (X(r, co)}, t e T, does not have only jump discon-tinuities on /, put X(/, co) = 0, all t E I. Otherwise, define

and X(b, co) = X(b, co) for b a closed right endpoint of 7. By 14.4, theprocess (X(t)}, t E /, so defined has all its sample paths in D(I). For anyt E I such that sn [ t, sne T, X(t) = limn X(sn) a.s. By the continuity in

Pprobability from the right, X(sn) — >• X(t). Hence

completing the proof.

Problem 1. For x(t) E D(I), J any finite closed subinterval of /, show that

1) sup |x(OI < oo,teJ

2) for any d > 0, the set

is finite.3) The set of discontinuity points of x(t) is at most countable.

3. CONTINUOUS PARAMETER MARTINGALES

Definition 14.6. A process (X(/)}, t E I, is called a martingale (MG) if

Call the process a submartingale (SMG) // under the same conditions

This definition is clearly the immediate generalization of the discreteparameter case. The basic sample path property is:

Theorem 14.7. Let (X(/)}, t E I be a SMG. Then for T dense and countablein I, almost every sample function of{X(t)}, t e T, has only jump discontinuitieson I.

Proof. It is sufficient to prove this for / a finite, closed interval [r, r].Define

Page 318: Probability

14.3 CONTINUOUS PARAMETER MARTINGALES 301

Of course, the limits for r— and T+ are not defined. First we show thatfor almost every sample function the limits in (14.8) are finite for all t e /.In fact,

To show this, take TN finite subsets of T, TN f T and r, r e TN. By addingtogether both (1) and (2) of 5.13, deduce that

Letting N -»• oo proves that (14.10) holds with Tin place of TN. Now takex -*• oo to prove (14.9).

Now assume all limits in (14.8) are finite. If a sample path of (X(f)},teT does not have only jump discontinuities on /, then there is a pointv el such that either X~(y-) < X+(y-) or X~(y+) < X+(y+). For anytwo numbers a < b, let D(a, b) be the set of all a> such that there exists ay e / with either

The union U D(a, b) over all rational a, b ; a < 6, is then the set of allsample paths not having only jump discontinuities.

Take TN finite subsets of T as above. Let $N be the up-crossings of theinterval [a, b] by the SMG sequence (X(^)}, t} e TN (see Section 4, Chap-ter 5). Then $N t @> where (3 is a random variable, possibly extended. Thesignificant fact is

Apply Lemma 5.17 to get

to conclude that E$ < oo, hence P(D(a, b)) = 0. Q.E.D.

The various theorems concerning transformation of martingales byoptional sampling and stopping generalize, if appropriate restrictions areimposed. See Doob [39, Chap. 7] for proofs under weak restrictions. Weassume here that all the processes we work with have a version with allsample paths in D(I).

Proposition 14.11. Let t* be a stopping time for a process (X(f)}, t e /, havingall sample paths right-continuous. Then X(t*) is a random variable.

Proof. Approximate t* by

Page 319: Probability

302 MARTINGALES, PROCESSES, WITH INCREMENTS 14.3

Then, X(t*) is a random variable. For n running through 2m, t* J, t*, soby right-continuity, X(t*) -> X(t*) for every w.

We prove, as an example, the generalization of 5.31.

Theorem 14.12. Lett* be a stopping time for the SMG(MG) (X(?)}, t E [0, oo).If all the paths of the process are in D([0, oo)) and if

then

Proof. Suppose first that t* is uniformly bounded, t* < r. Take t* J, t*,t* < T, but t* taking on only a finite number of values. By 5.31,

and right-continuity implies X(t*) -> X(t*) everywhere. Some sort ofboundedness condition is needed now to conclude £X(t*) -> £X(t*).Uniform integrability is sufficient, that is,

goes to zero as jc -> oo. If (X(f)} is a MG process, then {|X(/)|} is a SMG.Hence by the optional sampling theorem, 5.10,

Let

By the right-continuity and (14.9), M < oo a.s. Then the uniform integra-bility follows from E |X(r)| < oo and

But if {X(0> is not a MG, then use this argument: If the SMG (X(0)/ > 0, were bounded below, say, X(/) > a, all t < r and co, then forx > M,

Page 320: Probability

14.4 PROCESSES WITH STATIONARY, INDEPENDENT INCREMENTS 303

This gets us to (14.13) again. Proceed as above to conclude £X(t*) > £X(0).In general, for a negative, take Y(0 = max (a, X(0). Then (Y(0), t > 0is a SMG bounded below, so EY(t*) > £Y(0). Take a -> -oo and notethat

goes to zero. Similarly for £X(0) — £Y(0), proving the theorem forbounded stopping times. If t* is not bounded, define the stopping timet** as min (t*, T). Then

The first term in this integral goes to zero as T -»• oo because E |X(t*)| < ooby hypothesis. For the second term, simply take a sequence rn -> oo suchthat

For this sequence £X(t**) -> £X(t*), completing the proof.

Problem 2. For Brownian motion (X(f)}, t > 0, and t* = t*(a, b), proveusing 14.12 that £X(t*) = 0, £X2(t*) = £t*.

4. PROCESSES WITH STATIONARY,INDEPENDENT INCREMENTS

Definition 14.14. A process (X(t)}, t e [0, oo), has independent increments iffor any t and r > 0, 5"(X(f + T) - X(0) is independent off(X(s), s < t).

The stationary condition is that the distribution of the increase does notdepend on the time origin.

Definition 14.15. A process {X(t)}, t E [0, oo) is said to have stationary in-crements if£(X(t + T) — X(?)), T > 0, does not depend on t.

In this section we will deal with processes having independent, stationaryincrements, and we further normalize by taking X(0) = 0. Note that

where r — tin. Or

where the X^n), k = 1, . . . , n are independent and identically distributed.Ergo, X(/) must have an infinitely divisible distribution. Putting this formally,we have the following proposition.

Page 321: Probability

304 MARTINGALES, PROCESSES, WITH INCREMENTS 14.4

Proposition 14.16. Let (X(t)}, t E [0, oo) be a process with independent,stationary increments; then X(t) has an infinitely divisible distribution forevery t > 0.

It follows from X(t + s) = (X(t + s) - X(») + (X(s) - X(0)) thatft(u), the characteristic function of X(t), satisfies the identity

lf ft(u) had any reasonable smoothness properties, such as ft(u) measurablein t for each u, then (14.17) would imply that/» = [fi(u)V, t > 0.

Unfortunately, a pathology can occur: Let <p(t) be a real solution of theequation y(t + s) = cp(t) + y(s), t, s > 0. Nonlinear solutions of this doexist [33]. They are nonmeasurable and unbounded in every interval. Con-sider the degenerate process X(f) = cp(t). This process has stationary,independent increments.

Starting with any process {X(t)} such that ft(u) = [f^u)]1, then theprocess X'(t) = X(t) + <p(i) has also stationary, independent increments, but/<(") ^ [/!(")]'• This is the extent of the pathology, because it follows fromDoob [39, pp. 407 ff.] that if (X(f)} is a process with stationary, independentincrements, then there is a function <p(t), <p(t + s) = y(t} + <p(s) such thatthe process {X(1)(0}, X(1)(r) = X(/) — <p(i) has stationary, independent incre-ments, X(1)(0) = 0, and/]1}(«) is continuous in t for every u. Actually, thisis not difficult to show directly in this case (see Problem 3). A sufficientcondition that eliminates this unpleasant case is given by

Proposition 14.18. Let (X(/)} be a process with stationary, independent in-crements such that ft(u) is continuous at t = 0 for every u; then {X(t}} iscontinuous in probability and ft(u) = [f^u)]*.

Proof. Fix u, then taking s \ 0 in the equations ft+s(u) = ft(u)fs(u) andproves continuous for all t > 0. The only continuous

solutions of this functional equation are the exponentials. Therefore ft(u) =et{>'(u) Evaluate this at t = 1 to get/i(w) = ev>(M). Use limsio/g(tt) = 1 to

0) Pconclude X(.y) —> 0 as s —> 0, implying X(s) —> 0. Since

the proposition follows.

The converse holds.

Proposition 14.19. Given the characteristic function f(u) of an infinitelydivisible distribution, there is a unique process (X(t)} with stationary, indepen-dent increments, X(0) = 0, such that ft(u) = [/(M)?-

Remark. In the above statement, uniqueness means that every processsatisfying the given conditions has the same distribution.

Page 322: Probability

14.4 PROCESSES WITH STATIONARY, INDEPENDENT INCREMENTS 305

Proof. All that is necessary to prove existence is to produce finite-dimen-sional consistent distribution functions. To specify the distribution ofX('i), x('2), • • • , X(O> *i < *2 < • • - < tn, define variables Y1? Y 2 , . . ., Ynas being independent, Yj. having characteristic function [/(«)] **-**-1, t0 = 0.Define

The consistency is obvious. By the extension theorem for processes, 12.14,there is a process (X(f)}, t > 0, having the specified distributions. For thesedistributions, X(r-+ T) — X(t) is independent of the vector (X^),..., X(?n)),t\, • • • ,tn < t. Thus the process (X(r)} has independent increments. Further,the characteristic function of X(t + T) — X(t) is [f(u)]T, all t > 0, soimplying stationarity of the increments. Of course, the characteristic functionof X(t) is [/(«)]'. By construction, X(0) = 0 a.s. since/O(M) = 1, so we takea version with X(0) = 0. If there is any other such process (X(t)} havingcharacteristic function [f(u)Y, clearly its distribution is the same.

Since there is a one-to-one correspondence between processes withstationary, independent increments, continuous in probability and charac-teristic functions of infinitely divisible distributions, we add the terminology:If

call y(u) the exponent function of the process.

Problems

3. Since \ft(u)\ < 1, all u, show that (14.17) implies that \ft(u)\ = l/^u)!'.By (9.20), log/t(«) = i/?(0« + J v(x, u)7t(dx). Use |/f(«)| = 1/ )1* toshow that yt(R

(l)) is uniformly bounded in every finite ^-interval. Deducethatyt+s = yt + ys; hence show that yt = tyv Now/3(? + s) = fi(t) + /3(s)by the unique representation. Conclude that X(1)(r) = X(t) — fi(t) has acontinuous characteristic function/^^w). (It is known that every measurablesolution of <p(t + s) = y(f) + <p(s) is linear. Therefore if fi(t) is measurable,then it is continuous and/t(w) is therefore continuous.)4. For a process with independent increments show that

independent of &(X(s\ s < t) for each /, r => F(X(t + T) - X(0, r > 0) in-dependent of 5r(X(5), s < t).5. For a process with stationary, independent increments, show that forBe $!

Page 323: Probability

306 MARTINGALES, PROCESSES, WITH INCREMENTS 14.5

5. PATH PROPERTIES

We can apply the martingale results to get this theorem :

Theorem 14.20. Let (X(t)}, t > 0 be a process with stationary, independentincrements continuous in probability. Then there is a version of (X(t)} with allsample paths in D([0, oo)).

Remark. There are a number of ways to prove this theorem. It can be donedirectly, using Skorokhod's lemma (3.21) in much the same way as was donefor path continuity in Brownian motion. But since we have martingalemachinery available, we use a device suggested by Doob [39].

Proof. Take, as usual, X(0) = 0. If E\X(t)\ < oo, all t ;> 0, we arefinished. Because, subtracting off the means EX(t) if necessary, assumeEX(t) = 0. Then (X(/)}, t > 0 is a martingale since for 0 <, s <. t,

E(X(t) | X(r), r <, s) = E(X(t) - X(s) \ X(r), r < s) + X(s) = X(s) a.s.

Simply apply the martingale path theorem (14.7), the continuity in probability,and 14.5.

If E |X(OI is not finite for all t, one interesting proof is the following:Take <p(x) any continuous bounded function. The process on [0, T] definedby

is a martingale. But (see Problem 5),

where

The plan is to deduce the path properties of X(r) from the martingale paththeorem by choosing suitable y(x). Take <p(x) strictly increasing, <p(+ oo) = a,£>(_ oo) = —a, E |g?(X(T))| = 1. By the continuity in probability ofX(/), 0(t, x) is continuous in t. It is continuous and strictly increasing in x,hence jointly continuous in x and t. Thus,

has an inverse

jointly continuous in

For r dense and countable in [0, T], the martingale path theorem implies that

Page 324: Probability

14.5 PATH PROPERTIES 307

(Y(0), t e T, has only jump discontinuities on [0, r]. Hence, for all pathssuch that supter |Y(/)| < a, {X(/)}, t e T, has only jump discontinuities on[0, T]. Now, to complete the proof, we need to show that sup,6T |X(OI < °°,a.s. because for every sample function,

Since |Y(f)| is a SMG, apply 5.13 to conclude

Since

we get

Since a can be made arbitrarily large, conclude that

In the rest of this chapter we take all processes with stationary, independentincrements to be continuous in probability with sample paths in D([0, oo)).

Problems

6. Prove, using Skorokhod's lemma 3.21 directly, that

7. For (X(f)}, t > 0, a process with stationary, independent increments withsample paths in D([0, oo)), show that the strong Markov property holds;that is, if t* is any stopping time, then

has the same distribution as (X(f)}, t > 0 and is independent of

8. For (X(f)}, t > 0, a process continuous in probability with sample pathsin D([0, oo)), show that

for all t

Page 325: Probability

308 MARTINGALES, PROCESSES, WITH INCREMENTS 14.6

6. THE POISSON PROCESS

This process stands at the other end of the spectrum from Brownian motionand can be considered as the simplest and most basic of the processes withstationary, independent increments. We get at it this way: A sample pointco consists of any countable collection of points of [0, oo) such that if N(7, co)is the number of points of co falling into the interval /, then N(7, co) < oofor all finite intervals /. Define & on Q such that all N(7, co) are randomvariables, and impose on the probability P these conditions.

Conditions 14.21

a) The number of points in nonoverlapping intervals is independent. That is,/!,..., 4 disjoint => N^), . . . , N(4) independent.

b) The distribution of the number of points in any interval I depends only onthe length \\I\\ of I.

A creature with this type of sample space is called a point process. Con-ditions (a) and (b) arise naturally under wide circumstances : For example,consider a Geiger counter held in front of a fairly sizeable mass of radioactivematerial, and let the points of co be the successive registration times. Orconsider a telephone exchange where we plot the times of incoming telephonecalls over a period short enough so that the disparity between morning,afternoon, and nighttime business can be ignored. Define

Then by 14.21(a) and (b), X(/) is a process with stationary, independentincrements. Now the question is: Which one1? Actually, the prior part ofthis question is : Is there one ? The answer is :

Theorem 14.23. A process X(/) with stationary, independent increments has aversion with all sample paths constant except for upward jumps of length one ifand only If there is a parameter X > 0 such that

Remark. By expanding, we find that X(0 has the Poisson distribution

and

Proof. Let X(/) be a process with the given characteristic function, withpaths in £>([0, oo)). Then

Page 326: Probability

14.6 THE POISSON PROCESS 309

that is, X(0 is concentrated on the nonnegative integers 7+, so thatP(X(0 e 7+, all t e T) = 1 for any countable set T. Taking T dense in[0, oo) implies that the paths of the process are integer-valued, with proba-bility one. Also, (X(f)}, t e T, has nondecreasing sample paths, becauseC(X(f + r) - X(0) = C(X(r)), and X(r) > 0, a.s. Therefore there is aversion of X(t) such that all sample paths take values in 7+ and jump upwardonly. I want to show that for this version,

By Problem 8, the probability that X(f) — X(f— ) > 0 for t rational iszero. Hence, a.s.,

But

To go the other way, let X(f) be any process with stationary, independentincrements, integer- valued, such that X(f) — X(f— ) = 0 or 1. Take t* tobe the time until the first jump. It is a stopping time. Take t* to be the timeuntil the first jump of X(t + t*) — X(t*). We know that t* is independentoft* with the same distribution. The time until the «th jump is t* + • • • + t*,

Now {t* > r} = {t* < T}c, hence {t* > r} is in f ( X ( t ) , t < T). Therefore

This is the exponential equation ; the solution is

for some A > 0. To finish, write

Denote Qn(t) = P(t* + • • • + t* > t), and note that

Page 327: Probability

310 MARTINGALES, PROCESSES, WITH INCREMENTS 14.7

So we have

This recurrence relation and Qi(i) = e~u gives

leading to

7. JUMP PROCESSES

We can use the Poisson processes as building blocks. Let the jump pointsof a Poisson process Y(f) with parameter A > 0 be tlt t2, . . . Constructa new process X(f) by assigning jump Xj at time t1} X2 at time ta, . . . , whereXx, X2, . . . are independent, identically distributed random variables withdistribution function F(x), and 5"(X) is independent of £"(Y(f), t ^ 0). Then

Proposition 14.24. X(f) is a process with stationary, independent increments,and

Proof. That X(/) has stationary, independent increments follows from theconstruction. Let h(u) = j eiux dF. Note that

£(exp

implying

In Chapter 9, Section 4, it was pointed out that for a generalized Poissondistribution with jumps of size xit xs, . . . , each jump size contributes anindependent component to the distribution. This is much more graphicallyillustrated here. Say that a process has a jump of size B e at time / ifX(r) — X(t—) e B. A process with stationary, independent increments andexponent function

with ^(/?(1)) < oo is of the type treated above, with A = ju(R(1)), F(B) =fj.(B)/fj.(R(1)). Therefore it has sample paths constant except for jumps. Let

Page 328: Probability

14.7 JUMP PROCESSES 311

- {0} = JJfc=i Bk, for disjoint^ e $x, k = ! , . . . ,« . Define processes0), * > 0, by

Thus X(jBfc, 0 is the sum of the jumps of X(t) of size Bk up to time t. Weneed to show measurability of X(5fc, f)> but this follows easily if we constructthe process X(t) by using a Poisson process Y(f) and independent jumpvariables Xl5 X2, . . . , as above. Then, clearly,

Now we prove :

Proposition 14.25. The processes {X(Bk, t )}, t > 0, are independent of eachother, and{X(Bk, t)}, t > 0 is a process with stationary, independent incrementsand exponent function

Proof. It is possible to prove this directly, but the details are messy. Sowe resort to an indirect method of proof. Construct a sample space andprocesses {X(k)(t)}, t > Q, « = ! , . . . ,« which are independent of each otherand which have stationary, independent increments with exponent functions

Each of these has the same type of distribution as the processes of 14.24,with A = /*(Bk), Fk(B) = ft(B n Bk)lfi(Bk). Hence Fk(dx) is concentratedon Bk and (X(fc)(/)}, t > 0, has jumps only of size Bk. Let

Then (X(f)}, t > 0 has stationary, independent increments and the samecharacteristic function for every t as X(t). Therefore (X(0) ajid (X(t)} havethe same distribution. But by construction,

therefore {X(fc)(0}, t > 0 is defined on the (X(0) process by the same func-tion as {X(Bk, t)} on the (X(f)} process. Hence the processes {X(Bk, t}},(X(0) have the same joint distribution as (X(fc)(/)}, (X(f)}, proving theproposition.

Page 329: Probability

312 MARTINGALES, PROCESSES, WITH INCREMENTS 14.8

8. LIMITS OF JUMP PROCESSES

The results of the last section give some insight into the description ofthe paths of the general process. First, let {X(0}, t > 0, be a process withstationary, independent increments, whose exponent function y> has only anintegral component:

If ft(R(l)) < oo, then the process is of the form (X(f) - fit}, where (X(/)},t > 0 is of the type studied in the previous section, with sample pathsconstant except for isolated jumps. In the general case, p assigns infinitemass to arbitrarily small neighborhoods of the origin. This leads to thesuspicion that the paths for these processes have an infinite number ofjumps of very small size. To better understand this, let D be any neighbor-hood of the origin, {X(Z>C, t)}, t > 0, be the process of jumps of size greaterthan D, where we again define: the process {X(B, t")}, t > 0, of jumps of{X(/)> of size B e $15 {0} £ B, is given by

Assuming that the results analogous to the last section carry over,(X(.DC, /)} has exponent function

Letting

we have that the exponent function of X(DC, t) — fit is given by

Therefore, as D shrinks down to {0}, ipD(u) -> y(u), and in some sense weexpect that X(Y) is the limit of X(Z)C, t) — fit. In fact, we can get a verystrong convergence.

Theorem 14.27. Let (X(/)}, t > 0, be a process with stationary, independentincrements having only an integral component in its characteristic function.Take {Dn} neighborhoods of the origin such that Dn [ {0}, then {X(Dc

n, /)},t > 0 is a process with stationary, independent increments and exponentfunction

Page 330: Probability

14.8 LIMITS OF JUMP PROCESSES 313

Put

For any t

Proof. Take DX = /?(1), Bn = Dn+1 — Dn. Construct a probability space onwhich there are processes (Zn(f)}, f > 0, with stationary, independentincrements such that the processes are independent of each other, withexponent functions

Then 14.25 implies that the paths of Zn(f) are constant except for jumps ofsize B. Denote

Consequently, Xn(f) — fint is the sum of the independent componentsZfc(0 — bkt. Since the characteristic functions converge, for every ?,

Because we are dealing with sums of independent random variables, 8.36implies that for every t there is a random variable X(f) such that

This implies that (X(r)}, f ^ 0, is a process with stationary, independentincrements having the same distribution as (X(r)}, / > 0. We may assumethat (X(f)}, t > 0 has all its sample paths in D([0, oo)). Take, for example,r0 = 1, Tany set dense in [0, 1]. For Yn(f) = X(r) - XB(r) + £„*,

Let TN be finite subsets of T, TN | T. Then sup |Yw(f)| = lim sup |Yn(r)|.teT n <eTjv

Because (Yn(f)} is a process with stationary, independent increments, we canapply Skorokhod's lemma:

where

and

Page 331: Probability

314 MARTINGALES, PROCESSES, WITH INCREMENTS 14.8

Let yn(«) be the exponent function of Yn(r). Apply inequality 8.29 towrite:

For |ez| < 1, there is a constant y < oo such that |ez — 1| < y \z\. Hence,

Since y>n(ii) - ->- 0, then C^ -> 0, leading to

Thus we can find a subsequence {«'} such that a.s.

Some reflection on this convergence allows the conclusion that the processXB(0 can be identified with X(Dc

n, t). Therefore (X(/)} and {Xn(/)} have thesame joint distribution as (X(f)}, {X(Dc

n, /}. Using this fact proves the theorem.

If a function x(t) e D([Q, oo)) is, at all times /, the sum of all its jumpsup to time /, then it is called a pure jump function. However, this is notwell defined if the sum of the lengths of the jumps is infinite. Then ;c(0 isthe sum of positive and negative jumps which to some extent cancel eachother out to produce x(t), and the order of summation of the jumps up totime t becomes important. We define :

Definition 14.28. If there is a sequence of neighborhoods Dn [ {0} such that

then x(t) e Z)([0, oo)) is called a pure jump function.

Many interesting processes of the type studied in this section have the prop-erty that there is a sequence of neighborhoods Dn \, {0} such that

exists and is finite. Under this assumption we have

Corollary 14.29. Almost every sample path of (X(t) — @t}, t ^ 0, is a purejump function.

Another interesting consequence of this construction is the following:Take neighborhoods Dn J. {0} and take (X(£>£, t)}, as above, the process of

Page 332: Probability

14.8 LIMITS OF JUMP PROCESSES 315

jumps of size greater than Dn. Restrict the processes to a finite time interval[0, t0] and consider any event A such that for every n,

Then

Proposition 14.30. P(A) is zero or one.

Proof. Let Bn = Dn+l — Dn; then the processes {X(Bn, t)}, t e [0, t0] areindependent and A is measurable:

for every n. Apply a slight generalization of the Kolmogorov zero-one lawto conclude P(A) = 0 or 1.

The results of this section make it easier to understand why infinitelydivisible laws were developed to use in the context of processes with inde-pendent increments earlier than in the central limit problem. The processesof jumps of different sizes proceed independently of one another, and thejump process of jumps of size [x, x + AJC) contributes a Poisson componentwith exponent function approximately equal to

The fact that the measure fj. governs the number and the size of jumpsis further exposed in the following problems, all referring to a process(X(f)}, t > 0 with stationary, independent increments and exponent functionhaving only an integral component.

Problems

9. Show that for any set B e 3!>1 bounded away from the origin, the processof jumps of size B, (X(5, t)}, t > 0 has stationary, independent incrementswith exponent function

and the processes {X(0 — X(B, t)}, {X(B, t)} are independent.10. For B as above, show that the expected number of jumps of size Bin 0 < t < 1 is /n(B).11. For B as above let t* be the time of the first jump of size B. For C <= B,C e $!, prove that

12. Show that except for a set of probability zero, either all sample functionsof (X(f)}, t E [0, t0] have infinite variation or all have finite variation.

Page 333: Probability

316 MARTINGALES, PROCESSES, WITH INCREMENTS 14.9

13. Show that for (X(r)}, t > 0 as above, all sample paths have finite varia-tion on every finite time interval [0, t0] if and only if

[Take t0 = 1. The function

is monotonically nondecreasing. Now compute Ee~^n for X > 0.]

9. EXAMPLES

Consider the first passage time t* of a Brownian motion X(/) to the point £.Denote Z(£) = t*. For £

By the strong Markov property, Z(£x + ^2) — Z^J is independent of^(ZCf), | < ^i) and is distributed as Z(£2). Thus, to completely characterizeZ(|), all we need is its characteristic function. From Chapter 13,

This is the characteristic function of a stable distribution with exponent £.The jump measure ft(dx) is given by

Doing some definite integrals gives c = % \Jir.If the characteristic function of a process with stationary, independent

increments is stable with exponent a, call the process stable with exponenta. For 0 < a < 1,

exists. The processes with exponent functions

Page 334: Probability

14.9 EXAMPLES 317

are the limit of processes (Xn(f )} with exponent functions

These latter processes have only upward jumps. Hence all paths of (X(/)}are nondecreasing pure jump functions.

Stable processes of exponent > 1 having nondecreasing sample paths donot exist. If £(X(f)) = C(— X(f)), the process is symmetric. Bochner [11]noted that it was possible to construct the symmetric stable processes fromBrownian motion by a random time change. Take a sample space with anormalized Brownian motion X(f) and a stable process Z(f) defined on itsuch that Z(f) has nondecreasing pure jump sample paths and 5r(X(r), t > 0),

, t > 0) are independent.

Theorem 14.32. IfZ.(t) has exponent a, 0 < a < 1, then the process Y(f) =X(Z(f)) is a stable symmetric process of exponent 2a.

Proof. The idea can be seen if we write

Then, given Z(f), the process Y(r + r) — Y(Y) looks just as if it were theY(T) process, independent of Y(s), s < t.

For a formal proof, take Zn(f) the process of jumps of Z(/) larger than[0, l/«). Its exponent function is

The jumps of Zn(f) occur at the jump times of a Poisson process with intensity

and the jumps have magnitude Y19 Y2, . . . independent of one another andof the jump times, and are identically distributed. Thus X(Zn(0) has jumpsonly at the jump times of the Poisson process. The size of the fcth jump is

By an argument almost exactly the same as the proof of the strong Markovproperty for Brownian motion, \Jk is independent of U^, . . ., \Jl and has thesame distribution as l^. Therefore X(Zn(f)) is a process with stationary, inde-pendent increments. Take {n'} so that Zn,(f) -^> Z(f) a.s. Use continuity of

Page 335: Probability

318 MARTINGALES, PROCESSES, WITH INCREMENTS

Brownian motion to get X(Zn,(t)) -^> X(Z(0) for every /. Thus X(Z(/)) isa process with stationary, independent increments. To get its characteristicfunction, write

Therefore

10. A REMARK ON A GENERAL DECOMPOSITION

Suppose (X(Y)}, t > 0 is a process with stationary, independent incrementsand exponent function

Since «//? — a2w2/2 is the exponent function for a Brownian motion, anatural expectation is that X(/) = X(1)(/) + X(2)(r), where {X(1)(r}}, / > 0,is a Brownian motion with drift /5 and variance <r2, and (X(2)(f)}, / > 0, isa process with stationary, independent increments with exponent functionhaving only an integral component, and that the two processes are independ-ent. This is true, in fact, and can be proved by the methods of Sections 7and 8. But as processes with stationary, independent increments appear inpractice either as a Brownian motion, or a process with no Brownian com-ponent, we neglect the proof of this decomposition.

NOTES

For more material on processes with stationary, independent increments, seeDoob [39, Chap. 8] and Paul Levy's two books [103 and 105]. These lattertwo are particularly good on giving an intuitive feeling for what these proc-esses look like. Of course, for continuous parameter martingales, the bestsource is Doob's book. The sample path properties of a continuous param-eter martingale were given by Doob in 1951 [38], and applied to processeswith independent increments.

Processes with independent increments had been introduced by deFinnett in 1929. Their sample path properties were studied by Levy in 1934[102]. He then proved Theorem 14.20 as generalized to processes withindependent increments, not necessarily stationary. Most of the subsequentdecomposition and building up from Poisson processes follows Levy also,in particular [103, p. 93]. The article by Ito [75] makes this superpositionidea more precise by defining an integral over Poisson processes.

Page 336: Probability

CHAPTER 15

MARKOV PROCESSES,INTRODUCTION AND PURE JUMP CASE

1. INTRODUCTION AND DEFINITIONS

Markov processes in continuous time are, as far as definitions go, astraightforward extension of the Markov dependence idea.

Definition 15.1. A process (X(f)}, t > 0 is called Markov with state spaceF E alf // X(0 e F, t > 0, and for any B E 3^(F), t, T > 0.

To verify that a process is Markov, all we need is to have for any ?

Since finite dimensional sets determine -!F(X(5), 5 < ?), this extends to (15.2).The Markov property is a statement about the conditional probability

at the one instant t + r in the future. But it extends to a general statementabout the future, given the present and past:

Proposition 15.3. 7/{X(0) is Markov, then for A 6 (X^), r > t),

The proof of this is left to the reader.

Definition 15.4. By Theorem 4.30, for every t2 > tl a version pt ti(B \ jc) ofP(X(t2) e B | X(fj) = jc) can be selected such that ptt ti(B \ x) is a probabilityon ^(FJfor x fixed, and $i(F) measurable in xfor B fixed. Call these a set oftransition probabilities for the process.

Definition 15.5. TT(-) on tt^F) given by rr(B) — P(X(0) e B) is the initialdistribution for the process.

The importance of transition probabilities is that the distribution of theprocess is completely determined by them if the initial distribution is specified.

319

Page 337: Probability

320 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.2

This follows from

Proposition

The proof is the same as for 7.3.A special case of (15.7) are the Chapman-Kolmogorov equations.

Reason this way: To get from x at time r to B at time t > r, fix any inter-mediate time 5, t > s > r. Then the probability of all the paths that gofrom x to B through the small neighborhood dy of y at time 5 is given(approximately) by pt S(B \ y)ps T(dy \ x). So summing over dy gets us to theChapman-Kolmogorov equations,

Actually, what is true is

Proposition 15.9. The equation (15.8) holds a.s. with respect to P(X(r) E dx).

Proof. From (15.7), for every C E tt^F),

Taking the Radon derivative with respect to P(X(r) 6 dx) gives the result.

One rarely starts with a process on a sample space (Q, 5-", P). Instead,consistent distribution functions are specified, and then the process con-structed. For a Markov process, what gets specified are the transitionprobabilities and the initial distribution. Here there is a divergence fromthe discrete time situation in which the one-step transition probabilitiesP(Xn+l e B | Xn = x) determine all the multiple-step probabilities. There areno corresponding one-step probabilities here; the probabilities {pt.T(B \ x)},t > r > 0 must all be specified, and they must satisfy among themselvesat least the functional relationship (15.8).

2. REGULAR TRANSITION PROBABILITIES

Definition 15.10. A set of functions {pt<T(B \ x)} defined for all t > r > 0,B E $>i(F), x E F is called a regular set of transition probabilities if

1) pt,r(B | x) is a probability on ^(F)for x fixed and (F) measurable in xfor B fixed,

Page 338: Probability

15.2 REGULAR TRANSITION PROBABILITIES 321

2) for every B e ^(F\ x e F, t > s > T > 0,

To be regular, then, a set of transition probabilities must satisfy theChapman-Kolmogorov equations identically.

Theorem 15.11. Given a regular set of transition probabilities and a distributionir(dx^), define probabilities on cylinder sets by: for t

These are consistent and the resultant process {X(t )} is Markov with the givenfunctions pt r, rr as transition probabilities and initial distribution.

Proof. The first verification is consistency. Let Bk = F. The expressionin P(X(tn) € Bn, . . .) that involves xfc, tk is an integration with respect tothe probability defined for B E $X(F) by

By the Chapman-Kolmogorov equations, this is exactly

which eliminates rfc, ;cfc and gives consistency. Thus, a bit surprisingly, thefunctional relations (15.8) are the key to consistency. Now extend andget a process (X(f)} on (Ft0-00*, 5J[0iCO)(F)). To verify the remainder, letA e 3!>n(F). By extension from the definition,

Now, take the Radon derivative to get the result.One result of this theorem is that there are no functional relationships

other than those following from the Chapman-Kolmogorov equations thattransition probabilities must satisfy in general.

Another convention we make starting from a regular set of transitionprobabilities is this: Let P(a.,r)(0 be the probability on the space of pathsF[0>QO), gotten by "starting the process out at the point x at time T." Morespecifically, let P(a, r)(-) be the probability on &°-™\F) extended from (15.12),where /„ = r and -n(dx^ concentrated on the point {x} are used. So P(X>T)(0is well-defined for all x, r in terms of the transition probabilities only.

Page 339: Probability

322 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.3

Convention. For C e ^(XO), s < T), A e $[aco)CF) always use the version ofP(X(r + - } £ A \ X(r) = x, C) given by P(x,T)(A).

Accordingly, we use the transition probabilities not only to get the distribu-tion of the process but also, to manufacture versions of all importantconditional probabilities. The point of requiring the Chapman-Kolmogorovequations to hold identically rather than a.s. is that if there is an x, B,t > s > r > 0, such that

then these transition probabilities can not be used to construct "the processstarting from x, T."

Now, because we wish to study the nature of a Markov process asgoverned by its transition probabilities, no matter what the initial distribu-tion, we enlarge our nomenclature. Throughout this and the next chapterwhen we refer to a Markov process (X(f)} this will no longer refer to a singleprocess. Instead, it will denote the totality of processes having the sametransition probabilities but with all possible different initial starting points xat time zero. However, we will use only coordinate representation processes,so the measurability of various functions and sets will not depend on thechoice of a starting point or initial distribution for the process.

3. STATIONARY TRANSITION PROBABILITIES

Definition 15.13. Let {ptiT} be a regular set of transition probabilities. Theyare called stationary if for all t > r > 0,

In this case, the pt(B \ x) = pt_Q(B \ x) are referred to as the transition prob-abilities for the process.

Some simplification results when the transition probabilities are stationary.The Chapman-Kolmogorov equations become

For any A e $[0-ao)(F), P(x,r)(A) = P(x^(A); the probabilities on the path-space of the process are the same no matter when the process is started.Denote for any A e ^'^(F), /(•) on F[^w) measurable $[0-00)(F),

Page 340: Probability

15.3 STATIONARY TRANSITION PROBABILITIES 323

pAssume also from now on that for any initial distribution, X(t) — > X(0) ast -> 0. This is equivalent to the statement,

Definition 15.16. Transition probabilities pt(B \ x) are called standard if

for all x E F, where d{x)(-) denotes the distribution with unit mass on {x}.

There is another property that will be important in the sequel. Supposewe have a stopping time t* for the process X(t). The analog of the restartingproperty of Brownian motion is that the process X(t + t*), given everythingthat happened up to time t*, has the same distribution as the process X(t)starting from the point X(t*).

Definition 15.17. A Markov process (X(f)} with stationary transition prob-abilities is called strong Markov if for every stopping time t* [see (12.40)],every starting point x E F, and set A E $[0>00)(F),

Henceforth, call a stopping time for a Markov process a Markov time.It's fairly clear from the definitions that for fixed r > 0,

In fact,

Proposition 15.19. If t* assumes at most a countable number of values {rthen

Proof. Take C E &(X(s), s < t*). Let <p(x) be any bounded measurablefunction, <p(x) — £X9?(X(0)- We prove the proposition first for A one-dimensional. Take 9? the set indicator of A, then

By definition,

which does it. The same thing goes for 93, a measurable function of manyvariables, and then for the general case.

It is not unreasonable to hope that the strong Markov property wouldhold in general. It doesn't ! But we defer an example until the next chapter.

Page 341: Probability

324 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.4

One class of examples of strong Markov processes with standard sta-tionary transition probabilities are the processes with stationary, independentincrements, where the sample paths are taken right-continuous. Let X(f)be such a process. If at time t the particle is at the point x, then the distribu-tion at time t + r is gotten by adding to x an increment independent of thepath up to x and having the same distribution as X(T). Thus, these processesare Markov.

Problems

1. For X(0, a process with stationary, independent increments, show thatit is Markov with one set of transition probabilities given by

and that this set is regular and standard.

2. Show that processes with stationary, independent increments and right-continuous sample paths are strong Markov.3. Show that the functions PX(A), Exf(-) of (15.15) are ^(/O-measurable.4. Show that any Markov process with standard stationary transitionprobabilities is continuous from the right in probability.

4. INFINITESIMAL CONDITIONS

What do Markov processes look like? Actually, what do their samplepaths and transition probabilities look like? This problem is essentially oneof connecting up global behavior with local behavior. Note, for example,that if the transition probabilities pt are known for all t in any neighborhoodof the origin, then they are determined for all i > 0 by the Chapman-Kolmogorov equations. Hence, one suspects that pt would be determinedfor all / by specifying the limiting behavior of pt as t —> 0. But, then, thesample behavior will be very immediately connected with the behavior ofpt near / = 0.

To get a feeling for this, look at the processes with stationary, independentincrements. If it is specified that

then the process is Brownian motion, all the transition probabilities aredetermined, and all sample paths are continuous. Conversely, if all samplepaths are given continuous, the above limiting condition at / = 0 must hold.

At the other end, suppose one asks for a process X(/) with stationary,independent increments having all sample paths constant except for isolatedjumps. Then (see Section 6, Chapter 14) the probability of no jump in the

Page 342: Probability

15.4 INFINITESIMAL CONDITIONS 325

time interval [0, /] is given by

If there is a jump, with magnitude governed by F(x), then for B e $

Conversely, if there is a process X(/) with stationary, independent increments,and a A, F such that the above conditions hold as t — > 0, it is easy to checkthat the process must be of the jump type with exponent function given by

In general now, let (X(f)} be any Markov process with stationary transi-tion probabilities. Take /(x) a bounded ^-measurable function on F.Consider the class of these functions such that the limit as t I 0 of

exists for all x e F. Denote the resulting function by (Sf)(x). S is calledthe infinitesimal operator and summarizes the behavior of the transitionprobabilities as t -> 0. The class of bounded measurable functions suchthat the limit in (15.21) exists for all x e F, we will call the domain of S,denoted by D(S). For example, for Poisson-like processes,

Define a measure ju(B; x) by

Then S for this process can be written as

In this example, no further restrictions on / were needed to make the limitas t -> 0 exist. Thus 3)(5) consists of all bounded measurable functions.

For Brownian motion, take f(x) continuous and with a continuous,bounded second derivative. Write

Page 343: Probability

326 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE

where

15.4

From this,

Use

to get

In this case it is not clear what 2)(5") is, but it is certainly not the set of allbounded measurable functions. These two examples will be typical in thissense: The jumps in the sample paths contribute an integral operatorcomponent to S; the continuous nonconstant parts of the paths contributea differential operator component.

Once the behavior of pt near / = 0 is specified by specifying 5, theproblem of computing the transition probabilities for all / > 0 is present.S hooks into the transition probabilities in two ways. In the first method,we let the initial position be perturbed. That is, given X(0) = x, we let asmall time T elapse and then condition on X(T). This leads to the backwardsequations. In the second method, we perturb on the final position. Wecompute the distribution up to time / and then let a small time r elapse.Figures 15.1 and 15.2 illustrate computing Px(X(t + T) e B).

Backwards Equations

Letting <pt(x) = pt(B \ x), we can write the above as

Fig. 15.1 Backwards equations. Fig. 15.2 Forwards equations.

Page 344: Probability

15.4 INFINITESIMAL CONDITIONS 327

Dividing by T, letting r — »• 0, if pt(B \ x) is smooth enough in t, x, wefind that

(15.22)

that is, for Brownian motion, the backwards equations are

For the Poisson-like processes,

Forwards Equations. For any / for which Sf exists,

or

Subtract Exf(X(t)} from both sides, divide by T, and let r [ 0. With enoughsmoothness,

Thus, if S has an adjoint S*, the equations are

For Poisson-like processes,

so the forwards equations are

where /Z(5; ) = /*(£;>>), 7 a»d ({j} ; j) = -M^5 7>- If Pt(dy \ x) «//(f/y) for all ?, x, then take S* to be the adjoint with respect to ft(dy). Forexample, for Brownian motion pt(dy \ x) « dy. For all/(j>) with continuoussecond derivatives vanishing off finite intervals,

Page 345: Probability

328 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.5

where pt(y \ x) denotes (badly) the density of pt(dy \ x) with respect to dy.Hence the forwards equation is

The forwards or backwards equations, together with the boundary2)

condition pt(- \ x) — > o(x}(-) as / — »• 0 can provide an effective method ofcomputing the transition probabilities, given the infinitesimal conditions.But the questions regarding the existence and uniqueness of solutions aredifficult to cope with. It is possible to look at these equations analytically,forgetting their probabilistic origin, and investigate their solutions. Butthe most illuminating approach is a direct construction of the requiredprocesses.

5. PURE JUMP PROCESSES

Definition 15.24. A Markov process (X(/)} will be called a pure jump processif, starting from any point x e F, the process has all sample paths constantexcept for isolated jumps, and right-continuous.

Proposition 15.25. If (X(/)} is a pure jump process, then it is strong Markov.

Proof. By 15.19, for <p bounded and measurable $fc, and

we have for any C e 5r(X(/), t < t*), where t* takes on only a countablenumber of values and t* j t*.

Since t* < t*, this holds for all C e ^(XC/), t < t*). Since all paths areconstant except for jumps and right-continuous,

for every sample path. Taking limits in the above integral equality provesthe proposition.

Assume until further notice that (X(t)} is a pure jump process.

Definition 15.26. Define the time T of the first jump as

Proposition 15.27. T is a Markov time.

Page 346: Probability

330 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.5

with 0 <, A(x) < oo. If A(X) = 0, then the state x is absorbing, that is,Px(X(t) = x) = 1, for all /. The measurability of A(;c) follows from themeasurability of P

Corollary 15.29. For a pure jump process, let

Then at every point x in F,

Proof. Suppose that we prove that the probability of two jumps in time /is o(t) ; then both statements are clear because

x(X(t) = x, no more than one jump) + o(i)

Similarly,

x(X(t) e B) = Px(X(t) e B, no more than one jump) + o(t)

Remark. The reason for writing o(t) in front ofp(B; x) is to emphasize thato(t)lt — > 0 uniformly in B.

We finish by

Proposition 15.30. Let T0 be the time of the first jump, T0 + J1 the time of thesecond jump. Then

so

This goes to zero as / — > 0, by the bounded convergence theorem. Thefollowing inequality

now proves the stated result.

NowProof

Page 347: Probability

15.5 PURE JUMP PROCESSES 329

Proof. Let Tn = inf{A:/2n; X(fc/2B) * X(0)}. For every co, Tn | T. Further,

The sets {Tn < /} are monotonic increasing, and their limit is therefore in^(X^), 5 < t). If / is a binary rational, the limit is {T < t}. If not, the limitis {T < t}. But,

The basic structure of Markov jump processes is given by

Theorem 15.28. Under Px, T and X(T) are independent and there is a $x-measurable nonnegative function X(x) on F such that

Proof. Let t + T1 be the first exit time from state x past time t', that is,

Then, for x £ B,

if Px( T > t) > 0. Assume this for the moment. Then,

Going back, we have

There must be some t0 > 0, such that PX(J > f0) > 0. Take 0 < t < r0;then

Therefore <p(i) = ^(T > t) is a monotonic nonincreasing function satisfying<p(t + T) = ^(O^C7") for ^ < /0. This implies that there is a parameterA(JC) such that

Page 348: Probability

15.5 PURE JUMP PROCESSES 331

Define a measure /u(dy; jc) by

The infinitesimal operator S is given, following 15.30, by

and 3)(S) consists of all bounded ^(F) measurable functions. The followingimportant result holds for jump processes.

Theorem 15.32. pt(B \ x) satisfies the backwards equations.

Proof. First, we derive the equation

The intuitive idea behind (15.33) is simply to condition on the time andposition of the first jump. To see this, write

The first term is %B(x)e~*-(x)t. Reason that to evaluate the second term, ifT € dr, and X(T) e dy, then the particle has to get from y to B in timet — T. Hence the second term should be

and this is exactly the second term in (15.33). A rigorous derivation couldbe given along the lines of the proof of the strong Markov property. But itis easier to use a method involving Laplace transforms which has wideapplicability when random times are involved. We sketch this method:First note that since X(/) is jointly measurable in t, oj,pt(B \ x) = Ex

is measurable in /. Define

Write

The first term is

Page 349: Probability

332 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.6

The second is

By the strong Markov property,

Hence

This is exactly the transform of (15.33), and invoking the uniqueness theoremfor Laplace transforms (see [140]) gets (15.33) almost everywhere (dt). For(X(f)} a pure jump process, writing pt(B \ x) as ^^(X^)) makes it clearthalpt(B | x) is right-continuous. The right side of (15.33) is obviously con-tinuous in time; hence (15.33) holds identically.

Multiply (15.33) by eMx)t and substitute t — T = T' in the second term:

Hence />t(^ | x) is differentiable, and

An easy simplification gives the backwards equations.The forwards equations are also satisfied. (See Chung [16, pp. 224 ff.],

for instance.) In fact, most of the questions regarding the transition proba-bilities can be answered by using the representation

where Rn is the time of the «th jump, R0 = 0, so

where the Tfe are the first exit times after the fcth jump. We make use of thisin the next section to prove a uniqueness result for the backward equation.

Problem 5. Show that a pure jump process {X(0}, t > 0 is jointly measurablein (t, w) with respect to $i([0, oo)) x &.

6. CONSTRUCTION OF JUMP PROCESSES

In modeling Markov processes what is done, usually, is to prescribeinfinitesimal conditions. For example : Let F be the integers, then a popula-tion model with constant birth and death rates would be constructed by

Page 350: Probability

15.6 CONSTRUCTION OF JUMP PROCESSES 333

specifying that in a small time Af, if the present population size is j, theprobability of increasing by one is rBjkt, where rB is the birth rate. Theprobability of a decrease is rDj A? where rD is the death rate, and the proba-bility of no change is 1 — rDj A? — rBj A/. What this translates into is

In general, countable state processes are modeled by specifying q(k \ j) > 0,?(/) = 2 l(k 17) < °°» such that

General jump processes are modeled by specifying finite measuresju(B; x), measurable in x for every B e &i(F), such that for every jc,

Now the problem is: Is there a unique Markov jump process fitting into(15.35)? Working backward from the results of the last section—we knowthat if there is a jump process satisfying (15.35), then it is appropriate todefine

and look for a process such that

Theorem 15.28 gives us the key to the construction of a pure jump process.Starting at x, we wait there a length of time T0 exponentially distributed withparameter A(jc); then independently of how long we wait, our first jump isto a position with distribution p(dy; x). Now we wait at our new positiony, time Tl independent of T0, with distribution parameter A(j), etc. Notethat these processes are very similar to the Poisson-like processes withindependent increments. Heuristically, they are a sort of patched-togetherassembly of such processes, in the sense that at every point x the processbehaves at that point like a Poisson-like process with parameter A(JC) andjump distribution given by p(B; x).

At any rate, it is pretty clear how to proceed with the construction.

1) The space structure of this process is obtained by constructing a discreteMarkov process X0, Xl5 X 2 , . . . , moving under the transition probabilitiesp(B\ x), and starting from any point x.

Page 351: Probability

334 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.6

2) The time flow of the process consists of slowing down or speeding up therate at which the particle travels along the paths of the space structure.

For every «, x e F, construct random variables Tn(x) such that

ii) Tn(x) are jointly measurable in to, x,

ui) the processes (X0, . . .), (T0(x), x e F), (T^jc), x e F), . . . are mutuallyindependent.

The Tn(;c) will serve as the waiting time in the state jc after the nth jump.To see that joint measurability can be gotten, define on the probabilityspace ([0,1], $([0, 1]), dz) the variables T(z, A) = -(I/A) logz. ThusP(T(z, A) > 0 = e~u- Now define T0(;c) = T(z, A(x)) and take the cross-product space with the sample space for X0, Xx, . . . Similarly, for T^JC),T,(*), . . .

For the process itself, proceed with

Definition 15.36. Define variables as follows:

In this definition Rn functions as the time of the nth jump.

Theorem 15.37. If n*(f) is well-defined by 15.36 for all t, then X(/) is a purejump Markov process with transition probabilities satisfying the given infini-tesimal conditions.

Proof. This is a straightforward verification. The basic point is that givenX(r) = jc, and given, say, that we got to this space-time point in n steps, thenthe waiting time in x past t does not depend on how long has already beenspent there ; that is,

To show that the infinitesimal conditions are met, just show again that theprobability of two jumps in time / is o(t).

The condition that n *(/) be well-defined is that

This is a statement that at most a finite number of jumps can occur in everyfinite time interval. If (R^ < oo) > 0, there is no pure jump process thatsatisfies the infinitesimal conditions for all x e F. However, even if R^ = ooa.s., the question of uniqueness has been left open. Is there another Markov

Page 352: Probability

15.6 CONSTRUCTION OF JUMP PROCESSES 335

process, not necessarily a jump process, satisfying the infinitesimal conditions ?The answer, in regard to distribution, is No. The general result states that if^(Roo < oo) = 0, all x e F, then any Markov process (X(f)} satisfying theinfinitesimal conditions (15.35) is a pure jump process and has the samedistribution as the constructed process. For details of this, refer to Doob[39, pp. 266 ff.]. We content ourselves with the much easier assertion:

Proposition 15.38. Any two pure jump processes having the same infinitesimaloperator S have the same distribution.

Proof. This is now almost obvious, because for both processes, T andX(T) have the same distribution. Therefore the sequence of variables X(T0),X(T0 + Tx), . . . has distribution governed by p(B; x), and given this sequence,the jump times are sums of independent variables with the same distributionas the constructed variables {Rn}.

Let (X(/)} be the constructed process. Whether or not R^ = oo a.s.,define p(

tN\B \ x) as the probability that X(/) reaches B in time t in N or

fewer jumps. That is,

For n > 1, and T < t,

The terms for n > 1 vanish for r > t, and the zero term is

hence integrating out Xl5 T0 gives

Proposition 15.39

Letting N -> oo gives another proof that for a pure jump process, integralequation (15.33) and hence the backwards equations are satisfied.

Define

The significance of pt(B \ x) = \\m.Np\N)(B \ x) is that it is the probabilityof going from x to B in time t in a finite number of steps.

Page 353: Probability

336 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.7

Proposition 15.41. pt(B \ x) is the minimal solution of the backwards equationsin the sense that ifqt(x) is any other solution satisfying

then

Proof. The backwards equation is

Multiply by e~Mx}T, integrate from 0 to /, and we recover the integral equation

Assume qt(x) > p(tN\B \ x). Then substituting this inequality in the integral

on the right,

By the nonnegativity of qt,

Hence q

Corollary 15.42. If{X(t)} is a pure jump process, equivalently, //R^ = oo a.s.Px, all x e F, then (pt(B \ x)} are the unique set of transition probabilitiessatisfying the backwards equations.

7. EXPLOSIONS

If there are only a finite number of jumps in every finite time interval,then everything we want goes through — the forwards and backwards equa-tions are satisfied and the solutions are unique. Therefore it becomes impor-tant to be able to recognize from the infinitesimal conditions when theresulting process will be pure jump. The thing that may foul the process upis unbounded A(x). The expected duration of stay in state x is given byEXT = 1/AO). Hence if X(x) -»• oo anywhere, there is the possibility that theparticle will move from state to state, staying in each one a shorter periodof time. In the case where F represents the integers, k(ri) can go to oo onlyif n — *• oo. In this case, we can have infinitely many jumps only if the particle

Page 354: Probability

15.7 EXPLOSIONS 337

can move out to oo in finite time. This is dramatically referred to as thepossibility of explosions in the process. Perhaps the origin of this is in apopulation explosion model with pure birth,

Here the space structure is p(n + 1; n) = 1; the particle must move onestep to the right each unit. Hence, Xn = n if X0 = 0. Now

is the time necessary to move n steps. And

If this sum is finite, then R^ < oo a.s. P0. This is also sufficient for•PjC^oo < oo) = 1, for all j e F. Under these circumstances the particleexplodes out to infinity in finite time, and the theorems of the previoussections do not apply.

One criterion that is easy to derive is

Proposition 15.43. A process satisfying the given infinitesimal conditions willbe pure jump if and only if

Proof. For ]£™ Tn a sum of independent, exponentially distributed randomvariables with parameters Aw , 2^° Tn < oo a.s. iff 2o° *Mn < °°- Because for5 > 0 ,

Verify that the infinite product on the right converges to a finite limitiff ^* l/An < oo, and apply 8.36 and the continuity theorem given in Section13, Chapter 8. Now note that given (X0, X1? . . .), R^ is a sum of suchvariables with parameters A(Xn).

Corollary 15.44. If sup A(JC) < oo, then (X(f)} is pure jump.

Note that for a pure birth process 2^ 1M(W) = oo is both necessaryand sufficient for the process to be pure jump. For F the integers, anotherobvious sufficient condition is that every state be recurrent under the spacestructure. Conversely, consider

Page 355: Probability

338 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.7

Let N(&) be the number of entries that the sequence X0, X t , . . . makesinto state k. Then

If this is finite, then certainly there will be an explosion. Time-continuousbirth and death processes are defined as processes on the integers such thateach jump can be only to adjacent states, so that

For a birth and death process with no absorbing states moving on the non-negative integers, £0(N(A:)) = 1 + M(k), where M(k) is the expected numberof returns to k given X0 = k. Then, as above,

The condition that this latter sum be infinite is both necessary and sufficientfor no explosions [12].

Another method for treating birth and death processes was informallysuggested to us by Charles Stone. Let F be the nonnegative integers withno absorbing states. Let t* be the first passage time from state 1 to state n,and T* the first passage time from state n to state n + 1. The xf, T*, . . .are independent, t* = ^~l T*.

Proposition 15.45. t£ = lim t* is finite a.s. if and only if^ Et* < oo.

Proof. Let Tjj. be the duration of first stay in state k, then T* Ts. FurtherI* Tfc < oo a.s. iff 25° £T* < oo or £? I/Aft) < oo. Hence if inf* A(fc) = 0,both J~ T* and r ^T* are infinite. Now assume inffc X(k) = d > 0. Givenany succession of states X0 = «, X1} . . . , Xm = n + 1 leading from n ton 4- 1, T* is a sum of independent, exponentially distributed random vari-ables T0 + • • • + Tmf and (T2(T0 + • • • + T J = a*(T0) + • • • + a*(T J

+ • ' ' + TJ. Hence

If T* converges a.s., but 2i°° Et* = oo, then for 0 < € < 1,

Applying Chebyshev's inequality to this probability gives a contradictionwhich proves the proposition.

Page 356: Probability

15.8 NONUNIQUENESS AND BOUNDARY CONDITIONS 339

Problems

6. Show that v.(k) = £T* satisfies the difference equation

where

Deduce conditions on p(k), q(k), A(fc) such that there are no explosions.(See [87].)

7. Discuss completely the explosive properties of a birth and death processwith {0} a reflecting state and

8. NONUNIQUENESS AND BOUNDARY CONDITIONS

If explosions are possible, then pt(F\ x) < 1 for some *, t, and the pro-cess is not uniquely determined by the given infinitesimal conditions. Thenature of the nonuniqueness is that the particle can reach points on someundefined "boundary" of F not included in F. Then to completely describethe process it is necessary to specify its evolution from these boundary points.This is seen most graphically when F is the integers. If/^R^ < oo) > Oforsomey, then we have to specify what the particle will do once it reaches oo.One possible procedure is to add to F a state denoted {00} and to specifytransition probabilities from {00} to j £ F. For example, we could make{00} an absorbing state, that is,pt({co} \ {oo}) = 1. An even more interestingconstruction consists of specifying that once the particle reaches {00} itinstantaneously moves into state k with probability Q(k). This is moreinteresting in that it is not necessary to adjoin an extra state {00} to F.

To carry out this construction, following Chung [16], define

Now look at the probability p(tl} (j \ k) that k —*j in time / with exactly one

passage to {oo}. To compute this, suppose that R.^ = T; then the particlemoves immediately to state / with probability Q(l), and then must go from/ to j in time t — T with no further excursions to {oo}. Hence, denotingHk(dr) = Pk(Rx E dr\

or

Page 357: Probability

340 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.9

Similarly, the probability p(n}(j \ k), of & —*j in time / with exactly n passagesto {00} is given by

Now define

Proposition 15.46. pt(j \ k) as defined above satisfies

2) the Chapman- Kolmogorov equations, and3) the backwards equations.

Proof. Left to reader.

Remark. pt(j \ k) does not satisfy the forwards equations. See Chung[16, pp. 224 ff.].

The process constructed the above way has the property that

This follows from noting that

The integral term is dominated by Pk(Rx < /). This is certainly less thanthe probability of two jumps in time /, hence is o(t). Therefore, no matterwhat Q(l) is, all these processes have the specified infinitesimal behavior.This leads to the observation (which will become more significant in the nextchapter), that if it is possible to reach a "boundary" point, then boundaryconditions must be added to the infinitesimal conditions in order to specify theprocess.

9. RESOLVENT AND UNIQUENESS

Although S with domain tD(5") does not determine the process uniquely,this can be fixed up with a more careful and restrictive definition of thedomain of S. In this section the processes dealt with will be assumed tohave standard stationary transition probabilities, but no restrictions are puton their sample paths.

Definition 15,47, Say that functions qpt(x) converge boundedly pointwise to<p(x) on some subset A of their domain as t — >• 0 //

Page 358: Probability

15.9 RESOLVENT AND UNIQUENESS 341

ii) sup |<pt(x)| < M < oo, for all t sufficiently small.xeA

Denote this by <pt(x) -%• <p(x) on A.

Let C be any class of bounded ^(FJ-measurable functions. Then we use

Definition 15.48. *£)(S, £) consists of all functions f(x) in C such that

converges boundedly pointwise on F to a function in C.

We plan to show that with an appropriate choice of C, that correspondingto a given 5", 3)(S, C), there is at most one process. In the course of this, wewill want to integrate functions of the type Exf(X(t)}, so we need

Proposition 15.49. Forf(x) bounded and measurable on F, (p(x, t) = £'a./(X(/))is jointly measurable in (x, t), with respect to $i(F) X ^([O, oo)).

Proof. Take/(x) bounded and continuous. Since (X(f)} is continuous inprobability from the right, the function <p(x, t) = £a./(X(?)) is continuousin t from the right and ^1(F)-measurable in x for / fixed. Consider theapproximation

By the right-continuity, <pn(x, t) — >• q>(x, t}. But <pn(x, f) is jointly measurable,therefore so is cp(x, t). Now consider the class of 5i1(F)-measurable functionsf ( x ) such that |/(x)| < 1 and the corresponding <p(x, t) is jointly measurable.This class is closed under pointwise convergence, and contains all continuousfunctions bounded by one. Hence it contains all bounded measurablefunctions bounded by one.

Definition 15.50. The resolvent is defined as

for any B £ ^(F), jc e F.

It is easy to check that R^B \ x) is a bounded measure on $i(F) for fixedjc. Furthermore, by 15.49 and the Fubini theorem, R^(B\x) ismeasurable in x for B e (F) fixed. Denote, for / bounded andmeasurable,

Page 359: Probability

342 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.9

Then, using the Fubini theorem to justify the interchange,

Take S to be the set of all bounded ^(/^-measurable functions f(x)such that (Ttf)(x) ->/(;c) as t -> 0 for every x e F. Note that D(S, S) cz S.

I f f i s i n S ) ( S , S ) , then

Proof. Iff is in S, then since Tt and /^ commute, the bounded convergencetheorem can be applied to Tt(R^f) = R^.(Ttf) to establish R^fe S. Write

The Chapman-Kolmogorov equations imply T

From this, denoting

we get

Using I/I = sup |/(.x)|, x e F, we have

As / goes to zero, (e~^ — \)jt —*• — A. As T goes to zero,Using these in (15.52) completes the proof of the first assertion.

Now take /in D(5', S); by the bounded convergence theorem,

Note that 7?A and St commute, so

Page 360: Probability

15.9 RESOLVENT AND UNIQUENESS 343

By part (1) of this theorem, R^fis in D(5, S), hence

The purpose of this preparation is to prove

Theorem 15.53. There is at most one set of standard transition probabilitiescorresponding to given S, 1)(S, S).

Proof. Suppose there are two different sets,/?f(1) and/?|2) leading to resolvents

R(» and R™. For/in S, let

Then, by 15.51(1),

But g e D(S, Q), so use 15.51(2) to get

Therefore g is zero. Thus for all/e S, /P/'/ = U(A

8)/. Since S includes allbounded continuous functions, for any such function, and for all A > 0,

By the uniqueness theorem for Laplace transforms (see [140], for example)(T\l}f)(x) = (r/8)/)(x) almost everywhere (dt). But both these functions arecontinuous from the right, hence are identically equal. Since boundedcontinuous functions separate, p\l}(B \ x) = p(?\B \ x).

The difficulty with this result is in the determination of D(5', S). Thisis usually such a complicated procedure that the uniqueness theorem 15.53above has really only theoretical value. Some examples follow in theseproblems.

Problems

8. For the transition probabilities constructed and referred to in 15.46,show that a necessary condition for/(/) to be in 3)(5, S) is

9. Show that for any

Use this to show that the set 31A consisting of all functions {R*f}, f in 8,does not depend on A. Use 15.51 to show that ftA =

Page 361: Probability

344 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.10

10. For Brownian motion, show that the resolvent has a density r^y x)with respect to dy given by

11. Let C be the class of all bounded continuous functions on R{1). Usethe identity in Problem 9, and the method of that problem to show that3)(S, C) for Brownian motion consists of all functions / in C such that

f"(x) is in C.

12. For a pure jump process, show that if sup A(x) < oo, then £)(S, S)consists of all bounded ^(FJ-measurable functions.

10. ASYMPTOTIC STATIONARITY

Questions concerning the asymptotic stationarity of a Markov process(X(7)} can be formulated in the same way as for discrete time chains. Inparticular,

Definition 15.54. 7r(dx) on ^(F) will be called a stationary initial distributionfor the process if for every B e t&^F) and t > 0,

Now ask, when do the probabilities pt(B \ x] converge as / -> oo to somestationary distribution 77(8} for all x e Fl Interestingly enough, the situationhere is less complicated than in discrete time because there is no periodicbehavior. We illustrate this for X(/) a pure jump process moving on theintegers. Define the times between successive returns to state k by

and so forth. By the strong Markov property,

Proposition 75.55. If Pk(t* < oo) = 1, then the t*, t*, . . . are independentand identically distributed.

If Pk(t* < oo) < 1, then the state k is called transient.To analyze the asymptotic behavior of the transition probabilities, use

where Tn(/c) is the duration of stay in state k after the nth return. I t isindependent o

Page 362: Probability

15.10 ASYMPTOTIC STATIONAR1TY 345

Put

then

Argue that t* is nonlattice because

where T* is the time from the first exit to the first return. By the strongMarkov property, T0(fc), T* are independent. Finally, note that T0(AV) hasa distribution absolutely continuous with respect to Lebesgue measure;hence so does t*.

Now apply the renewal theorem 10.8. As t —> oo,

Conclude that

\irnp

Hence

Proposition 15.56. Let T be the first exit time from state k, t* the time offirst return. If Et* < oo, then

The following problems concern the rest of the problem of asymptoticconvergence.

Problems

13. Let all the states communicate under the space structure given by p ( j ; k),and let the expected time of first return be finite for every state. Show that1) For any k,j,

where TT(J) is defined by (15.57).2) If Tr(k) is the stationary initial distribution under p(j; k), that is,

then

where a is normalizing constant.

Page 363: Probability

346 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE

14. Show that if k is a transient state, then p t(k \ k) goes to zero exponentiallyfast.

15. Show that n is a stationary initial distribution for a process havingstandard stationary transition probabilities if and only if

for all g such that there exists an/in 1)(S, S) with g = Sf.

NOTES

K. L. Chung's book [16] is an excellent reference for the general structureof time-continuous Markov chains with a countable state space. Even withthis simple a state space, the diversity of sample path behavior of processeswith standard stationary transition probabilities is dazzling. For moregeneral state spaces and for jump processes in particular, see Doob's book[39, Chap. 6]. For a thoroughly modern point of view, including discussionof the strong Markov property and the properties of 5", 3)(5", S) and theresolvent, see Dynkin [44, especially Vol. I].

The fundamental work in this field started with Kolmogorov [93, 1931].The problems concerning jump processes were treated analytically byPospisil [117, 1935-1936] and Feller in 1936, but see Feller [52] for a fullertreatment. Doeblin [26, 1939] had an approach closer in spirit to ours. Doob[33 in 1942] carried on a more extended study of the sample path properties.The usefulness of the resolvent and the systematic study of the domain of Swere introduced by Feller [57, 1952]. His idea was that the operators{Tt}, >0, formed a semi-group, hence methods for analyzing semi-groupsof operators could be applied to get useful results.

There is an enormous literature on applications of pure jump Markovprocesses, especially for those with a countable state space. For a look atsome of those, check the books by Bharucha-Reid [3], T. E. Harris [67],N. T. J. Bailey [2], and T. L. Saaty [120]. An extensive reference to boththeoretical and applied sources is the Bharucha-Reid book.

Page 364: Probability

CHAPTER 16

DIFFUSIONS

1. THE ORNSTEIN-UHLENBECK PROCESS

In Chapter 12, the Brownian motion process was constructed as a modelfor a microscopic particle in liquid suspension. We found the outstandingnonreality of the model was the assumption that increments in displacementwere independent—ignoring the effects of the velocity of the particle at thebeginning of the incremental time period. We can do better in the followingway:

Let V(/) be the velocity of a particle of mass m suspended in liquid.Let AV = V(? + A/) — V(r), so that m AV is the change in momentumof the particle during time Ar. The basic equation is

Here —/#V is the viscous resistance force, so —/SV Ar is the loss in momen-tum due to viscous forces during A?. AM is the momentum transfer due tomolecular bombardment of the particle during time A/.

Let M(f) be the momentum transfer up to time t. Normalize arbitrarilyto M(0) = 0. Assume that

i) M(/ + A/) - M(0 is independent of (M^), r <, t),ii) the distribution of AM depends only on Af,iii) M(Y) is continuous in /.

The third assumption may be questionable if one uses a hard billiard-ballmodel of molecules. But even in this case we reason that the jumps of M(f)would have to be quite small unless we allowed the molecules to have enor-mous velocities. At any rate (iii) is not unreasonable as an approximation.

But (i), (ii), (iii) together characterize M(f) as a Brownian motion. Thepresence of drift in M(0 would put a ft Af term on the right-hand side of(16.1). Such a term corresponds to a constant force field, and would beuseful, for example, in accounting for a gravity field. However, we willassume no constant force field exists, and set £"M(/) = 0. Put EM2(t) = azt;hence M(r) = aX(t), where X(f) is normalized Brownian motion. Equation(16.1) becomes

Page 365: Probability

348 DIFFUSIONS 16.1

Doing what comes naturally, we divide by Af, let Af ->• 0 and produce theLangevin equation

The difficulty here is amusing: We know from Chapter 12 that dX/dt existsnowhere. So (16.3) makes no sense in any orthodox way. But look at this:Write it as

where a = 8Jm, y = aim. Assume V(0) = 0 and integrate from 0 to tto get

Do an integration by parts on the integral,

Now the integral appearing is for each M just the integral of a continuousfunction and makes sense. Thus the expression for V(/) given by

can be well defined by this procedure, and results in a process with con-tinuous sample paths.

To get a more appealing derivation, go back to (16.2). Write it as

where 6(&t) — o(kt) because by (16.2), V(r) is continuous and boundedin every finite interval. By summing up, write this as

where 0 = ^n) < • • • < t(nn} = t is a partition (Tn of [0, f]. If the limit of

the right-hand side exists in some decent way as ||(Tn|| —> 0, then it would bevery reasonable to define eatV(t) as this limit. Replace the integration byparts in the integral by a similar device for the sum,

The second sums are the Riemann-Stieltjes sums for the integral j^ X(T) d(e*T).

Page 366: Probability

16.1 THE ORNSTEIN-UHLENBECK PROCESS 349

For every sample path, they converge to the integral. Therefore:

Definition 16.4. The Ornstein-Uhlenbeck process V(?) normalized to be zeroat t = 0, is defined as

where the integral is the limit of the approximating sums for every path.

Proposition 16.5. V(t) is a Gaussian process with covariance

where p = y2/2a.

Proof. That V(r) is Gaussian follows from its being the limit of sums2fc <p(tk) Aj.X, where the A^X = X(tk+l) — X(tk) are independent, normally-distributed random variables. To get T(s, t), take s > t, put

Write

Use £(A

Going to the limit, we get

As t -+ oo, £V(02 -»• p, so V(0 - -> V(oo), where V(oo) is , (0, p). Whatif we start the process with this limiting distribution ? This would mean thatthe integration of the Langevin equation would result in

Define the stationary Ornstein-Uhlenbeck process by

where V^O) is JV(0, p) and independent of 3^(0, t > 0).

Proposition 16.7. Y(f) is a stationary Gaussian process with covariance

Proof. Direct computation.

Page 367: Probability

350 DIFFUSIONS 16.1

Remark. Stationarity has not been defined for continuous parameterprocesses, but the obvious definition is that all finite-dimensional distribu-tions remain invariant under a time shift. For Gaussian processes with zeromeans, stationarity is equivalent to T(s, t) = (p(\s — t\).

The additional important properties of the Ornstein-Uhlenbeck processare:

Proposition 16.8. Y(f) is a Markov process with stationary transition prob-abilities having all sample paths continuous.

Proof. Most of 16.8 follows from the fact that .or T > 0, Y(/ + T) -e~a'Y(r) is independent of ^(Yfa), s < T). To prove this, it is necessaryonly to check the covariance

Now,

The random variable Y(/ + T) — e~a'Y(T) is normal with mean zero, and

Thus/?t(- | x) has the distribution of

The continuity of paths follows from the definition of V(r) in terms of anintegral of X(r).

Problems

1. Show that if a process is Gaussian, stationary, Markov, and continuousin probability, then it is of the form Y(Y) + c, where Y(r) is an Ornstein-Uhlenbeck process.

2. Let Z be a vector-valued random variable taking values in R(m}, m > 2.Suppose that the components of Z, (Z1? Z2,. . ., ZTO), are independent andidentically distributed with a symmetric distribution. Suppose also thatthe components have the same property under all other orthogonal coordinatesystems gotten from the original one by rotation. Show that Z1?. . ., Zm

are JV(0, or2).

Page 368: Probability

16.2 PROCESSES THAT ARE LOCALLY BROWNIAN 351

Remark. The notable result of this problem is that any model for Brownianmotion in three dimensions leads to variables normally distributed providingthe components of displacement of velocity along the different axes areindependent and identically distributed (symmetry is not essential, see Kac[79]) irrespective of which orthogonal coordinate system is selected. How-ever, it does not follow from this that the process must be Gaussian.

2. PROCESSES THAT ARE LOCALLY BROWNIAN

In the spirit of the Langevin approach of the last section, if Y(f) isBrownian motion with drift ^, variance a2, then write

The same integration procedure as before would then result in a processY(?) which would be, in fact, Brownian motion with parameters /*, cr. To tryto get more general Markov processes with continuous paths, write

As before X(0 is normalized Brownian motion. Y(0 should turn out to beMarkov with continuous paths and stationary transition probabilities. Arguethis way : //(Y) A? is a term approximately linear in /, but except for this termAY is of the order of AX, hence Y(f) should be continuous. Further, assumeY(/) is measurable ^(X^), r < /). Then the distribution of AY dependsonly on Y(r), (through /i(Y) and <r(Y)), and on AX, which is independent of^(^(T), r < f ) with distribution depending only on Ar.

Roughly, a process satisfying (16.9) is locally Brownian. Given thatY(r) = y, it behaves for the next short time interval as though it were aBrownian motion with drift p(y), variance a(y). Therefore, we can think ofconstructing this kind of process by patching together various Brownianmotions. Note, assuming AX is independent of Y(r),

Of course, the continuity condition is also satisfied,

Define the truncated change in Y by

As a first approximation to the subject matter of this chapter, I will say thatwe are going to look at Markov processes Y(r) taking values in some interval

Page 369: Probability

352 DIFFUSIONS 16.3

F, with stationary transition probabilities pt satisfying for every e > 0, andy in the interior of F,

and having continuous sample paths. Conditions (16.10) are the infinitesimalconditions for the process. A Taylor expansion gives

Proposition 16.11. Let f ( x ) be bounded with a continuous second derivative.IfPt(dy \ x) satisfies (16.10), then (Sf)(x) exists for every point x in the interiorof F and equals

Thus,

Proposition 16.12. If the transition probabilities satisfy (16.10), and havedensities pt(y \ x) with a continuous second derivative for x E int(F), then

Proof. This is the backwards equation.

Problem 3. Show by direct computation that the transition probabilities forthe Ornstein-Uhlenbeck process satisfy

3. BROWNIAN MOTION WITH BOUNDARIES

For X(0 a locally Brownian process as in the last section, the infini-tesimal operator S is defined for all interior points of F by 16. 1 1 . Of course,this completely defines 5" if F has only interior points. But if F has a closedboundary point, the definition of S at this point is not clear. This problemis connected with the question of what boundary conditions are neededto uniquely solve the backwards equation (16.13). To illuminate this problema bit, we consider two examples of processes where F has a finite closedboundary point.

Definition 16.14. Use X0(f) to denote normalized Brownian motion on R(l\p(i\dy | x) are its transition probabilities.

The examples will be concerned with Brownian motion restricted to theinterval F = [0, oo).

Page 370: Probability

16.3 BROWNIAN MOTION WITH BOUNDARIES 353

Example 1. Brownian motion with an absorbing boundary. Take F = [0, oo).The Brownian motion X(t) starting from x > 0 with absorption at {0} isdefined by

where X0(f) is started from x.

It is not difficult to check that X(t) is Markov with stationary transitionprobabilities. To compute these rigorously is tricky. Let A <= (0, oo),A 6 $15 and consider PX(X0(0 e A, t* < t). The set (X0(0 e A, t* < t}consists of all sample paths that pass out of (0, oo) at least once and then comeback in and get to A by time t. Let A L be the reflection of A around thepoint x = 0. Argue that after hitting {0} at time r < t it is just as prob-able (by symmetry) to get to AL by time t as it is to get to A, implying

This can be proven rigorously by approximating t* by stopping times thattake only a countable number of values. We assume its validity. Proceedby noting that (X0(f) e A1-} c {t* < t} so that

Now,

The density f

Example 2. Brownian motion with a reflecting boundary. Define the Brownianmotion X(t) on F = [0, oo) with a reflecting boundary at {0} to be

where we start the motion from x > 0. What this definition does is to takeall parts of the X0(/) path below x = 0 and reflect them in the x = 0 axisgetting the X(f) path.

Proposition 16.17. X(t) is Markov with stationary transition probabilitydensity

Page 371: Probability

354 DIFFUSIONS 16.3

Proof. Take A e ^([O, oo)), x > 0. Consider the probabilities

Because X0(/) is Markov, these reduce to

These expressions are equal. Hence

In both examples, X(t) equals the Brownian motion X0(/) until the par-ticle reaches zero. Therefore, in both cases, for x > 0 and /bounded andcontinuous on [0, oo),

As expected, in the interior of F, then

for functions with continuous second derivatives. Assume that the limits/'(0+),/"(0+), as x I 0 of the first and second derivatives, exist. In the caseof a reflecting boundary at zero, direct computation gives

Thus, (S/)(0) does not exist unless /'(0+) = 0. If /'(0+) = 0, then(S/)(0) = i/"(0+), so not only is (Sf)(x) defined at x = 0, but it is alsocontinuous there.

If {0} is absorbing, then for any/, (5/")(0) = 0. If we want (Sf)(x) to becontinuous at zero, we must add the restriction /"(0+) = 0.

Does the backwards equation (16.13) have the transition probabilitiesof the process as its unique solution ? Even if we add the restriction thatwe will consider only solutions which are densities of transition probabilitiesof Markov processes, the examples above show that the solution is not

Page 372: Probability

16.3 BROWNIAN MOTION WITH BOUNDARIES 355

unique. However, note that in the case of absorption

for all t, y > 0. Intuitively this makes sense, because the probabilitystarting from x of being absorbed at zero before hitting the point y goes toone as x —>• 0. For reflection, use the symmetry to verify that

If either of the above boundary conditions are imposed on the backwardsequation, it is possible to show that there is a unique solution which is a setof transition probability densities.

Reflection or absorption is not the only type of behavior possible atboundary points. Odd things can occur, and it was the occurrence of someof these eccentricities which first prompted Feller's investigation [56] andeventually led to a complete classification of boundary behavior.

Problems

4. Show that the process X(f) defined on [0, 1] by folding over Brownianmotion,

if |X0(f) — 2n\ < 1, is a Markov process with stationary transition proba-bilities such that

Evaluate (Sf)(x) for x € (0, 1). For what functions does (Sf)(x) exist at theendpoints? [This process is called Brownian motion with two reflectingboundaries.]

5. For Brownian motion on [0, oo), either absorbing or reflecting, evaluatethe density r^(y \ x) of the resolvent R^(dy \ x). For C the class of all boundedcontinuous functions on [0, oo), show that

a) For absorbing Brownian motion,

b) For reflecting Brownian motion,

[See Problems 10 and 1 1 , Chapter 15. Note that in (a), R^(dy \ 0) assigns allof its mass to {0}.]

Page 373: Probability

356 DIFFUSIONS 16.4

4. FELLER PROCESSES

The previous definitions and examples raise a host of interesting and difficultquestions. For example :

1) Given the form of S, that is, given a\x) and /u(x) defined on int(F), andcertain boundary conditions. Does there exist a unique Markov process withcontinuous paths having S as its infinitesimal operator and exhibiting thedesired boundary behavior?

2) If the answer to (1) is Yes, do the transition probabilities have a densityPt(y I *)? Are these densities smooth enough in x so that they are a solutionof the backwards equations? Do the backwards equations have a uniquesolution ?

The approach that will be followed is similar to that of the previous chapter.Investigate first the properties of the class of processes we want to construct —try to simplify their structure as far as possible. Then work backwards fromthe given infinitesimal conditions to a process of the desired type.

To begin, assume the following.

Assumption 16.18(a). F is an interval closed, open, or half of each, finite orinfinite. X(t) is a Markov process with stationary transition probabilities suchthat starting from any point x E F, all sample paths are continuous.

The next step would be to prove that 16.18(a) implies that the strong Markovproperty holds. This is not true. Consider the following counter example:Let F = R(V. Starting from any jc ^ 0, X(r) is Brownian motion startingfrom that point. Starting from x = 0, X(/) = 0. Then, for x ^ 0,

But if X(0 were strong Markov, since t* is a Markov time,

The pathology in this example is that starting from the point x = 0gives distributions drastically different from those obtained by starting fromany point x ^ 0. When you start going through the proof of the strongMarkov property, you find that it is exactly this large change in the distri-bution of the process when the initial conditions are changed only slightlythat needs to be avoided. This recalls the concept of stability introduced inChapter 8.

Definition 16.19. Call the transition probabilities stable if for any sequence ofrr\

initial distributions -*„(•} — >. TT-(-), the corresponding probabilities satisfy

for any t > 0. Equivalently, for all y(x) continuous and bounded on F,Ex<p(X(t)) is continuous on F.

Page 374: Probability

16.4 FELLER PROCESSES 357

Assume, in addition to 16.18(a),

Assumption 16.18(b). The transition probabilities of X(t) are stable.

Definition 16.20. A process satisfying 16.18(a),(b) will be called a Fellerprocess.

Now we can carry on.

Theorem 16.21. A Feller process has the strong Markov property.

Proof. Let cp(x) be bounded and continuous on F, then <p(x) = Ex<p(X(t))is likewise. Let t* be a Markov time, t* a sequence of Markov times suchthat t* I t* and t* takes on only a countable number of values. From(15.20), for C E (X(0, t < t*),

The path continuity and the continuity of 99, 9? give

Thus, for 99 continuous,

The continuous functions separate, thus, for any B e &i(F),

To extend this, let (jp(x^ . . . , xk) on Fw equal a product ^(.x^) • • • <pk(Xk)>where <PI, . . . , <pk are bounded and continuous on F. It is easy to check that

is continuous in :c. By the same methods as in (15.20), get

conclude that

and now use the fact that products of bounded continuous functions separateprobabilities on

Of course, this raises the question of how much stronger a restrictionstability is than the strong Markov property. The answer is — Not much!

Page 375: Probability

358 DIFFUSIONS 16.5

To go the other way, it is necessary that the state space have something likean indecomposability property — that every point of F can be reached fromevery interior point.

Definition 16.22. The process is called regular if for every x E int(F) and y£F,

Theorem 16.23. If X(f) is regular and strong Markov, then its transition prob-abilities are stable.

Proof. Let x < y < z, x e int(F). Define ty = min (t*, s), t2 = min (t*, s),for s > 0 such that Px(t* <, s) > 0. These are Markov times. Take <p(x)bounded and continuous, y(x) = £-

x9?(X(/)). By the strong Markov property,

Suppose that on the set {t* < oo}, t* f t* a.s. Px as y f z, implying tv | t,a.s. So Ex(p(X(t + tv)) -» Ex<f>(X(t + t ,)). The right-hand sides of theabove equations are

and

For j a continuity point of Px(t* < s), Px(t* <, s) -»• Pa(t* < j), and thesets {t* > s} t {t* > s}. The conclusion is that y>(y) -*• ^(2).

The final part is to get t* | tj as ^ | z. Let t*(w) < c» and denoteT = lim t*, as y T 2. On the set {t* < oo}, X(tJ) = j. By the path con-tinuity, X(r) = z, so T = t*. By varying x to the left and right of points,16.23 results.

From here on, we work only with regular processes.

Problem 6. Let C be the class of all bounded continuous functions on ^F,For a Feller process show that if /e C, then -R^/is in C. Deduce that3)(5', C) <= 3)(S, S). Prove that there is at most one Feller process corre-sponding to a given S, D(5, C). (See Theorem 15.53.)

5. THE NATURAL SCALE

For pure jump processes, the structure was decomposable into a spacestructure, governed by a discrete time Markov chain and a time rate whichdetermined how fast the particle moved through the paths of the space

Page 376: Probability

16.5 THE NATURAL SCALE 359

structure. Regular Feller processes can be decomposed in a very similar way.The idea is clearer when it is stated a bit more generally. Look at a path-continuous Markov process with stable stationary transition probabilitiestaking values in n-dimensional space, X(t) e R(n). Consider a set B e $„;let t*(5) be the first exit time from B, then using the nontrivial fact thatt*(5) is measurable, define probabilities on the Borel subsets of the boundaryof B by

The QX(C) are called exit distributions and specify the location of X(f) uponfirst leaving B. Suppose two such processes have the same exit distributionsfor all open sets. Then we can prove that under very general conditionsthey differ from each other only by a random change of time scale [10].Thus the exit distributions characterize the space structure of the process.To have the exit distributions make sense, it is convenient to know that theparticle does exit a.s. from the set in question. Actually, we want and getmuch more than this.

Theorem 16.24. Let J be a finite open interval such that J <^ F. Then

sup E

Proof. First we need

Lemma 16.25. Let sup Px(t*(J) > t) = a < 1 for some t > 0. ThenxeJ

Proof. Let an = sup/»x(t*(/) > nt). Writexej

Let t*(7) be the first exit time from J, starting from time nt. Then

Since (t*(J) > nt} € :F(X(r),r < nt), then

Hence an < an, and

Page 377: Probability

360 DIFFUSIONS 16.5

To finish the proof of 16.24, let J = [a, b], pick y e J. By regularity,there exists a t and a such that Pv(t* > / ) < « < 1, P»W > ' ) < « < ! .For y < x < b,

and for a < x < y,

Apply the lemma now to deduce the theorem.

Remark. Note that the lemma has wider applicability than the theorem.Actually, it holds for all intervals J, finite or infinite.

In one dimension the relevant exit probabilities are:

Definition 16.26. For any open interval J = (a,b) such thatPx(t *(/) < oo) = 1 ,x E J, define

Theorem 16.27. There exists a continuous, strictly increasing function u(x) onF, unique up to a linear transformation, such that for J ^ F,J = (a, b),

Proof. For J ^ I, note that exiting right from / starting from x e J canoccur in two ways :

i) Exit right from / starting from x, then exit right from / starting from b.ii) Exit left from / starting from x, then exit right from / starting from a,

Use the strong Markov property to conclude that for x e /,

or

If F were bounded and closed, then we could take t/(x) = p+(x, iand satisfy the theorem. In general, we have to use extension. Take 70 tobe a bounded open interval, such that if F includes any one of its endpoints,then /o includes that endpoint. Otherwise /0 is arbitrary. Define u(x) on/0 = (jcl5 x2) &sp+(x, /„). By the equation above, for 70 <= 7j, x e 70,

Page 378: Probability

16.5 THE NATURAL SCALE 361

Define an extension «(1)(jc) on 7X by the right-hand side of (16.30). Supposeanother interval /2 is used to get an extension, say I: <= /2. Then for x E 72,we would have

For x E Ilt (16.29) gives

Substitute this into (16.30) to conclude that u(l)(x) = «(2)(x) on 7V Thusthe extensions are unique. Continuing this way, we can define u(x) on int(F)so that (16.28) is satisfied.

It is increasing; otherwise there exists a finite open /, J <= F, and x E /,such that p+(x, J) = 0. This contradicts regularity. Extend u, by takinglimits, to endpoints of F included in F. Now let Jn be open, Jn] J = (a, b).I assert that t*(/n) f t*(J). Because t *(./„) < t*(J), by monotonicity t* =limw t*(Jn) exists, and by continuity X(t*) = a or b. For any e > 0 and nsufficiently large,

Since X(t*(yj) -> X(t*(/)) a.s., taking limits of the above equation gives

By taking either an I a or bn ] b we can establish the continuity. The factthat u(x) is unique up to a linear transformation follows from (16.28).

We will say that a process is on its natural scale if it has the same exitdistributions as Brownian motion. From (13.3),

Definition 16.31. A process is said to be on its natural scale if for everyJ = (a, b), J c F,

that is, ifu(x) = x satisfies (16.28).

The distinguishing feature of the space structure of normalized Brownianmotion is that it is driftless. There is as much tendency to move to the rightas to the left. More formally, if J is any finite interval and x0 its midpoint,then for normalized motion, by symmetry, p+(x0,J) = \. We generalizethis to

Proposition 16.32. A process is on its natural scale if and only if for everyfinite open J, J <= F, x0 the midpoint of J,

Page 379: Probability

362 DIFFUSIONS 16.6

Proof. Consider n points equally spaced in /,

Starting from x*., the particle next hits Xk-i or Xk+i with equal proba-bility, so p" l"(xjb,(xfc_i,Xjb+i)) = |. Therefore, the particle behaves like asymmetric random walk on the points of the partition. Prom Chapter 7,Section 10,

The continuity established in Theorem 16.27 completes the proof of 16.32.

Let the state space undergo the transformation Jc = u(x). Equivalently,consider the process

If X(f) is a regular Feller process, then so is X(0- The importance of thistransformation is :

Proposition 16.34. X(t ) is on its natural scale.

Proof. Let J = (a, b), a = u(d), b = u(b), J = (a, b). For the X(/) process,with Jc = u(x),

For any regular Feller process then, a simple space transformation givesanother regular Feller process having the same space structure as normalizedBrownian motion. Therefore, we restrict attention henceforth to this typeand examine the time flow.

Remark. The reduction to natural scale derived here by using the trans-formation x = u(x) does not generalize to Feller processes in two or moredimensions. Unfortunately, then, a good deal of the theory that followsjust does not generalize to higher dimensions.

6. SPEED MEASURE

The functions m(x,J) = Ext*(J) for open intervals /, determine howfast the process moves through its paths. There is a measure m(dx) closely

Page 380: Probability

16.6 SPEED MEASURE 363

associated with these functions. Define, for / = (a, b) finite,

Then,

Theorem 16.36. Let X(f) be on its natural scale. Then there is a unique measurem(dx) defined on ft^int F), m(B) < oo for B bounded, B <= int(F), such thatfor finite open J, J c: F,

Proof. The proof will provide some justification for the followingterminology.

Definition 16.37. m(dx) is called the speed measure for the process.

Consider (a, b) partitioned by points a = x0 < X-L < • • • < xn = b, wherexk = a + kd. Define Jk = (xk_^, xk+l). Note that m(xk,Jk) gives somequantitative indication of how fast the particle is traveling in the vicinity ofxk. Consider the process only at the exit times from one of the Jk intervals.This is a symmetric random walk Z0, Z1}.. . moving on the points x 0 , . . . , xn.Let n(Xj; xk) be the expected number of times the random walk visits thepoint jtfc, starting from x} before hitting x0 or xn. Then

This formula is not immediately obvious, because t*(7) and X(t*(/)) are not,in general, independent. Use this argument: Let t^ be the time taken forthe transition Z# —> 7.N+l. For x e (x0,.. ., xn},

Sum over N, noting that

This function was evaluated in Problem 14, Chapter 7, with the result,

Page 381: Probability

364 DIFFUSIONS 16.6

Defining m(dx) as a measure that gives mass m(xlc,JK)(d to the point xk,we get

Now to get m(*/jc) defined on all int(F). Partition F by successive refinementsy(n) having points a distance dn apart, with 6n -»• 0. Define the measure mn

as assigning mass

to all points jcfc 6 iT(n) which are not endpoints of the partition. For a, b,

For any finite interval / such that 7 c int(F), (16.39) implies that limn mn(I) <W

oo. Use this fact to conclude that there exists a subsequence mn -- >m, wherem is a measure on ^(int F) (see 10.5). Furthermore, for any J = (a, b), andjc e /, where a, b, x are in U ff(n),

For any arbitrary finite open interval J, J <^ F, take J±<^ J where yx hasendpoints in (J« ^(n) and pass to the limit as J^ f y, to get (16.40) holdingfor / and any x e Un ^<n)- To extend this to arbitrary x, we introduce anidentity: If 7 < = / , / = (a, 6), :c 6 7, then the strong Markov property gives

Take / finite and open, Use (16.41) to write

Take z t Jc. By (16.40), m(z, (y, jc)) -»- 0, so

Since the integral J Gj(x,y)m(dy) is continuous in x, then (16.40) holds forall jc e /. But now the validity of (1 6.40) and the fact that the set of functions{(jjOc,/)}, a, b, xeint(F), are separating on int(F) imply that m(dy) isunique, and the theorem follows.

One question left open is the assignment of mass to closed endpointsof F. This we defer to Section 7.

Page 382: Probability

16.7 BOUNDARIES 365

Problems

7. If X(f) is not on its natural scale, show that m(dx} can still be defined by

by using the definition

8. For X(0 Brownian motion with zero drift, show that for / = (a, b),

Use this to prove

[Another way to see the form of m(dy) for Brownian motion is this: mn(dy)assigns equal masses to points equally spaced. The only possible limitmeasure of measures of this type is easily seen to be a constant multiplec dy of Lebesgue measure. If c0 is the constant for normalized Brownianmotion, a scale transformation of X(f) shows that c = c0/<r2.]9. For x e int(F) and /n open neighborhoods of x such that Jn [ {x}, showthat t*(7n) ±1>- 0. Deduce that t*({x» = 0 a.s. Px.10. For/(x) a bounded continuous function on F, J any finite open intervalsuch that J <= F, prove that

[Use the approximation by random walks on partitions.]

7. BOUNDARIES

This section sets up classifications that summarize the behavior at theboundary points of F. If F is infinite in any direction, say to the right, call+ oo the boundary point on the right; similarly for — oo.

For a process on its natural scale the speed measure m(dx) defined onint(F) will to a large extent determine the behavior of the process at theboundaries of F. For example, we would expect that knowing the speed

Page 383: Probability

366 DIFFUSIONS 16.7

measure in the vicinity of a boundary point b of F would tell us whether theprocess would ever hit b, hence whether b was an open or closed endpointof F. Here, by open endpoint, we mean b $ F, by closed, b E F. In fact,we have

Proposition 16.43. If b is a finite endpoint of F, then b $ F if and only if forJ c: F any nonempty open neighborhood with b as endpoint,

Remark. A closed endpoint of F is sometimes called accessible, for obviousreasons.

Proof. Assume b is a right endpoint. If b e F, then there exists a t such thatfore e mt(F\Pc(t* > t) = a < 1. Let/ = (c, b). ThenPx(t*(J) > t) < a,all x e J. Use Lemma 16.25 to get m(x, J) < oo, all x e/. For beF,J<= F;hence 16.36 holds:

Therefore the integral over (x, b) is finite. Conversely, if the integral is finitefor one c e int(F), then as z | b, if b $ F, and x > c,

The left-hand side of this is nondecreasing as x | b. But the integral equals

which goes to zero as jc f b.

For open boundary points b, there is two-way classification.

Definition 16.44. Let x e int(F), y e int(F), y—*b monotonically. Call b

natural ///or all t > 0, lim P

entrance i/ Mere w a t > 0 sue/; /Aar lim -P,,(t* < r) > 0.

A natural boundary behaves like the points at oo for Brownian motion.It takes the particle a long time to get close to such a boundary point andthen a long time to get away from it. An entrance boundary has the oddproperty that it takes a long time for the particle to get out to it but not toget away from it.

Page 384: Probability

16.7 BOUNDARIES 367

Proposition 16.45. Let b be an open boundary point.

a) Ifb is finite, it must be natural.b) Ifb is infinite, it is natural if and only if for J <= Fany open interval with b

as an endpoint,

Proof. Take b finite and right-hand, say. If b is an entrance point, thenfor J = (c, b), use Lemma 16.25 to get m(x,J) < M < oo for all x E J.This implies, if we take J± | J and use the monotone convergence theorem,that

This is impossible, by 16.43, since b is an open endpoint. For b = oo, checkthat

lim m

If b = oo is entrance, then there is an a such that for J = (a, oo)

for all x G J. Use Lemma 16.25 to get

Taking x -> oo proves that fy |y\ dm < co. Conversely, if the integral isfinite for / = (a, oo), then there exists an M < oo such that for alc < oo, m(x, (a, c)) < M. Assume b = oo is not an entrance boundary.For any 0 < e < £, take c, x such that a < x < c and

Then

and this completes the proof.

There is also a two-way classification of closed boundary points b. Thisis connected with the assignment of speed measure m({b}) on these points.Say b is a closed left-hand endpoint. We define a function Gj(x, y) for all

_Lintervals of the form J = [b, c), and x, y e / as follows: Let / be the re-

i j_flection of / around b, that is, J = (b — (c — b),b] and y the reflection of

iy around b, y = b — (y — b). Then define

Page 385: Probability

368 DIFFUSIONS 16.7

where C/Juj is defined by (16.35). This definition leads to an extension of(16.40).

Theorem 16.47. It is possible to define m({b}) so that for all finite intervals

Remark. Note that if J(& c) m(dy) = oo, then no matter how ni({b}} is assigned,m(x, J) = oo. This leads to

Definition 16.48. Let b be a closed boundary point, J <= F any finite openinterval with b as one endpoint such that J is smaller than F. Call b

regular

exit //

Proof of 16.47. The meaning of the definition of Gj will become moreintuitive in the course of this proof. Partition any finite interval [b, c) byequally spaced points b = x0 < Xj. < • • • < xn = c, a distance 6 apart.Define Jk = (xk_l9 xk+l), k = 1, . . ., n — I, J0 = [x0, Xj). By exactly thesame reasoning as in deriving 16.38,

where /7(r)(x3; xk) is the expected number of visits to xk of a random walkZ0, Z:, . . . on *0, . . . , xn starting from jc,, with reflection at x0 and absorp-tion at xn. Construct a new state space x_n, . . . , x_:, x0, . . . , xn, where

± -1 -1

x_k = xk, and consider a symmetric random walk Z0, Zl4 . . . on this spacej_

with absorption at *_„, xn. Use the argument that Zfc = \Zk — x0\ + x0

is the reflected random walk we want, to see that for xk ^ x0,

where n(xjt xk) refers to the Z random walk, and

Hence, for Gj(x, y) as defined in (16.46),

Page 386: Probability

16.7 BOUNDARIES 369

For m(xk,Jk), k > 1, use the expression JJ)t GJk(xk,y)m(dy). Then

The integrand converges uniformly to GJ(XJ, y) as (5 —>• 0. Hence for 6 notan exit boundary, the second term of this converges to

as we pass through successive refinements such that 6 —> 0. Therefore,m(x<» ^o)/2<5 must converge to a limit, finite or infinite. Define m({b}} to bethis limit. Then for xf any point in the successive partitions,

For m({b}) < oo, extend this to all x E J by an argument similar to that inthe proof of 16.36. For m({b}) = oo, show in the same way that m(x, J) — oofor all x e J.

Note that the behavior of the process is specified at all boundary pointsexcept regular boundary points by m(dy) on ^(int F). Summarizing graphi-cally, for b a right-hand endpoint,

Classification Typeint(F) Q b natural boundaryint(F) £ b entrance boundaryint(F) ^y. b exit boundaryint(F) b regular boundary

The last statement (^ for regular boundary points) needs some explanation.Consider reflecting and absorbing Brownian motion on [b, oo) as describedin Section 3. Both of these processes have the same speed measure m(dy) =dy on (b, oo), and b is a regular boundary point for both of them. Theydiffer in the assignment of m({b}). For the absorbing process, obviouslym({b}) — °o; for the reflecting process m({b}) < oo. Hence, in terms ofm on int(F) it is possible to go int(F) -»• b and b -> int(F). Of course, thelatter is ruled out if m({b}) = oo, so ± should be understood above to mean"only in terms of m on int(F)."

Definition 16.49. A regular boundary point b is called

absorbing // rn({b}) = oo,slowly reflecting // 0 < m({b}) < oo,

instantaneously reflecting // m({b}} = 0.

See Problem 11 for some interesting properties of a slowly reflecting boundary.

Page 387: Probability

370 DIFFUSIONS 16.8

Problems

11. Show, by using the random walk approximation, that

Conclude that if m({b}) > 0, then X(/) spends a positive length of time atpoint b => pt({b] | b) > 0 for some /. Show also that for almost all samplepaths, {/; X(/) = b} contains no intervals with positive length and noisolated points.

12. For b an entrance boundary, J = (b, c), c e int(F), show that

where m(b, J) = lim m(x, J), x —> b.

13. For b a regular boundary point, / = [b, c), J <= F, show that

14. For b an exit boundary, and any / = (a, b], x e J, show that m(x, J) = oo(see the proof of 16.47). Use 16.25 to conclude that P6(t* < oo) = 0. Hence,deduce thaipt({b} \ b) = 1, / > 0.

8. CONSTRUCTION OF FELLER PROCESSES

Assume that a Feller process is on its natural scale. Because X(/) hasthe same exit probabilities as Brownian motion X0(/), we should be ableto construct a process with the same distribution as X(/) by expanding orcontracting the time scale of the Brownian motion, depending on the currentposition of the particle. Suppose m(dx) is absolutely continuous with respectto Lebesgue measure,

where V(x) is continuous on F. For / small,

where w(0)(x, /) is the expected time for X0(r) to exit from J. So if it takesBrownian motion time Ar to get out of /, it takes X(/) about V(x) A/ to getout of J. Look at the process X0(T(/)), where T(f) = J(t, oj) is for every wan increasing function of time. If we want this process to look like X(f), thenwhen / changes by V(x) A/, we must see that T changes by the amount Af.We get the differential equation

Page 388: Probability

16.8 CONSTRUCTION OF FELLER PROCESSES 371

We are at the point x = X0(T(f)), so

Integrating, we get

Hence,

Definition 16.50. Take X0(f) to be Brownian motion on F, instantaneouslyreflecting at all finite endpoints. Denote

Define T(/) to be the solution of !(/•) = t, that is,

Remark. Because m(J} > 0 for all open neighborhoods /, {x; V(x) = 0} cancontain no open intervals. But almost no sample function of X0(f) can haveany interval of constancy. Hence |(T) is a strictly increasing continuousfunction of r, and so is J(t ). Further, note that for every t, T(f) is a Markovtime for {X0(0}.

Theorem 16.51. X(t) = X0(T(?)) is a Feller process on natural scale withspeed measure

Proof. For any Markov time t*, X0(t*) is measurable ^(X,,^), s < t*).Further, it is easy to show for any two Markov times t* < t*, that

This implies that X(T) = X0(T(r)) is measurable ^(X^s), s < T(/)), for any

Hence Px(X(t + r) e B \ X(s), s < t) equals the expectation of

given 3r(X(s), s < t). To evaluate T(/ + T), write

where A = T(r + T) — T(f). Thus A is the solution of l^r) = r, where

Page 389: Probability

372 DIFFUSIONS 16.8

Now X(t + T) = X0(T(f + T)) = X0(A + T(0). Because l^r) is the samefunction on the process X0(- + T(f)) as l(r) is on X0(-), the strong Markovproperty applies:

The proof that X(f) is strong Markov is only sketched. Actually, what wewill prove is that the transition probabilities for the process are stable.Let 9?(jc) be bounded and continuous on F. Denote by t* the first passagetime of X0(V) to y. As y -+ x, t* — »> 0 a.s. Px. Hence, by path continuity,as y —>• jc,

T(/ + l(t*)) is defined as the solution of

Thus A = T(/ + l(tj)) - t* is the solution of

Use the strong Markov property of X0(/) to compute

Thus ^^(X^)) is continuous.Denote by t*(7) the first exit time of X0(/) from /, t*(7) the exit time for

X(/). The relationship is

Taking expectations, we get

By Problem 10, the latter integral is

Thus, the process has the asserted speed measure.To show X(/) is on its natural scale, take / = (a, £),

lfm(dx) is not absolutely continuous, the situation is much more difficult.The key difficulty lies in the definition of \(r). So let us attempt to transform

Page 390: Probability

16.8 CONSTRUCTION OF FELLER PROCESSES 373

the given definition of l(r) above into an expression not involving V(x):Let L(t, J) be the time that X0(£) spends in J up to time t ,

Then l(r) can be written approximately as

Suppose that

exists for all y, t. Then assuming the limit exists in some nice way in y,we get the important alternative expression for l(r),

That such a function I *(t, x) exists having some essential properties is theconsequence of a remarkable theorem due to Trotter [136].

Theorem 16.54. Let X0(f) be unrestricted Brownian motion on Rw. Almostevery sample path has the property that there is a function l*(t, y) continuous on{(t, y)', t > 0, y e R(l)} such that for all B E 3&ls

Remarks. \.*(t,y) is called local time for the process. It has to be non-decreasing in t for y fixed. Because of its continuity, the limited procedureleading up to (16.53) is justified. The proof of this theorem is too long andtechnical to be given here. See Ito and McKean [76, pp. 63 ff.] or Trotter[136].

Assume the validity of 16.54. Then it is easy to see that local timel*(/, y) exists for the Brownian motion with reflecting boundaries. Forexample, if x = 0 is a reflecting boundary, then for B E ([O, oo)),

where I *(t,y) = \.*(t,y) + l*(f, -y).

Definition 16.55. Take X0(/) to be Brownian motion on F, instantaneouslyreflecting at all closed endpoints, with local time l*(t,y). Let m(dx) be anymeasure on &i(F) such that 0 < m(J) < oo for all finite open intervals J withJ c int(F). Denote

and let T(f) be the solution of\(

Page 391: Probability

374 DIFFUSIONS 16.8

Since m(F) is not necessarily finite, then along some path there may be an rsuch that l(r) = oo; hence 1(5) = oo, s > r. But if l(r) < oo, then \(s) iscontinuous on 0 < s < r. Furthermore, it is strictly increasing on thisrange. Otherwise there would be an s, t, 0 < / < s < r such that l*(/, y) =I *(s, y) for all y e F. Integrate this latter equality over F with respect to dyto get the contradiction s = t. Thus, T(f) will always be well defined exceptalong a path such that there is an r with l(r) < oo, l(r+) = oo (by themonotone convergence theorem l(r) is always left-continuous). If thisoccurs, define T(r) = r for all t > l(r). With this added convention, T(/) iscontinuous and strictly increasing unless it becomes identically constant.

Theorem 16.56. X0(T(/)) is a Feller process on natural scale with speed measurem(dy).

Proof. That X(f) = X0(T(r)) is on natural scale is again obvious. Tocompute the speed measure of X(f), use (16.52), tJ(J) = T(t*(/)) orl(t*(./)) = t*(J). Hence

The integrand does not depend on m(dy). By the definition and continuityproperties of local time the integral

is the expected amount of time the Brownian particle starting from x spendsin the interval / before it exits from J. We use Problem 10 to deduce thatthis expected time is also given by

The verification that Ex l*(tj(/), y) is continuous in y leads to its identificationwith Gj(x 5 y), and proves the assertion regarding the speed measure of theprocess. Actually, the identity of Problem 10 was asserted to hold only forthe interior of F, but the same proof goes through for J including a closedendpoint of F when the extended definition of Gj(x, y) is used.

The proof in 16.51 that X(f) is strong Markov is seen to be based on twofacts:a) T(r) is a Markov time for every t > 0;b) T(f + r) = A + T(f), where A is the solution of l^r) = T, and l^r) isthe integral l(r) based on the process X0(

- + T(r)).

It is easy to show that (a) is true in the present situation. To verify (b), ob-serve that I *(/, y) is a function of y and the sample path for 0 < £ < t.

Page 392: Probability

16.9 THE CHARACTERISTIC OPERATOR 375

Because

it follows that

where l**(j,j>) is the function \.*(s,y) evaluated along the sample pathX0(r + I), 0 < £ < s. Therefore

where \l(s) is I (s) evaluated along the path X0(r + £), 0 < £ < s. The restgoes as before.

Many details in the above proof are left to the reader to fill in. Animportant one is the examination of the case in which l(r) = oo for finite r.This corresponds to the behavior of X(?) at finite boundary points. Inparticular, if at a closed boundary point b the condition of accessibility(16.43) is violated, then it can be shown that the constructed process neverreaches b. Evidently, for such measures m(dy), the point b should be deletedfrom F. With this convention we leave to the reader the proof that theconstructed process X(f) is regular on F.

Problem 15. Since I *(t, 0) is nondecreasing, there is an associated measureI*(dt, 0). Show that l*(dt, 0) is concentrated on the zeros of X0(0- That is,prove

(This problem illustrates the fact that l*(?, 0) is a measure of the time the par-ticle spends at zero.)

9. THE CHARACTERISTIC OPERATOR

The results of the last section show that corresponding to every speedmeasure m(dx) there is at least one Feller process on natural scale. In fact,there is only one. Roughly, the reasoning is that by breaking F down intosmaller and smaller intervals, the distribution of time needed to get fromone point to another becomes determined by the expected times to leavethe small subintervals, hence by m(dx). But to make this argument firm,an excursion is needed. This argument depends on what happens over smallspace intervals. The operator S which determines behavior over small timeintervals can be computed by allowing a fixed time t to elapse, averagingover X(f), and then taking t -> 0. Another approach is to fix the terminalspace positions and average over the time it takes the particle to get to these

Page 393: Probability

376 DIFFUSIONS 16.9

terminal space positions. Take x e J, J open, and define

Let J \ {x}, if lim (Ujf)(x) exists, then see whether, under some reasonableconditions, the limit will equal (Sf)(x).

Definition 16.58. Let x & F. For any neighborhood J of x open relative to F,suppose a function y(J) is defined. Say that

if for every system Jn of such neighborhoods, J n \ {x}, lim 9?(/n) = a.n

Theorem 16.59. Let f e D(5', C), where C consists of all bounded continuousfunctions on F. Then

for all x E F.

Proof. The proof is based on an identity due to Dynkin.

Lemma 16.60. For any Markov time t* such that Ext* < oo, f o r f E 3)(S, C)andg = Sf,

Proof. For any bounded measurable h(x) consider/(x) = (R^h)(x) and write

For/ in D(5, C), take h = (A - S)f. Then by 15.51(2),/ = RJi and (16.61)becomes

where the last integral exists for all A > 0 by the boundedness of /(x) andExt* < oo. Taking A -> 0 now proves the lemma.

To finish 16.59, if x is an absorbing or exit endpoint, then both ex-pressions are zero at x. For any other point x, there are neighborhoods /

Page 394: Probability

16.9 THE CHARACTERISTIC OPERATOR 377

of x, open relative to F, such that £xt*(J) < oo. Now g is continuous at x,so take / sufficiently small so that |g(X(f)) - g(x)\ < e, t < t*(J). Then bythe lemma,

Definition 16.62. Define (Uf)(x)for any measurable function f as

wherever this limit exists. Denote f E D((/, /) if the limit exists for all x E I.

Corollary 16.63.

There are a number of good reasons why U is usually easier to work withthan S. For our purposes an important difference is that U is expressiblein terms of the scale and speed measure. For any Feller process we can showthat (Uf)(x) is very nearly a second-order differential operator. There is asimple expression for this when the process is on natural scale.

Theorem 16.64. Let X(t) be on its natural scale, fe 1)(S, C). Then

in the following sense for x e int (F),

i) /'(x) exists except perhaps on the countable set of points {x; m({x}) > 0}.ii) For J = (xi, x2) finite, such that f'(x^),f'(x^ exist,

where g(x) = (Sf)(x).

Proof. Use 16.60 and Problem 10 to get

Take J = (x — /zl5 x + /*2) and use the appropriate values for p+(x, J) toget the following equations.

Page 395: Probability

378 DIFFUSIONS 16.9

Taking hlt hz I 0 separately we can show that both right- and left-handderivatives exist. When hlt h2 both go to zero, the integrand converges tozero everywhere except at y = x. Use bounded convergence to establish (i).

By substituting x + H2 = b, x — hl = a, form the identity,

Substituting in x + h for x and subtracting we get

The integrand is bounded for all h, and

Therefore g(x)m({x}) = 0 implies that

Take a < x± < xz < b, such that

Use both jq and xz in the equation above and subtract

There is an interesting consequence of this theorem in an importantspecial case:

Corollary 16.65. Let m(dx) = V(x) dx on $(int F) where V(x) is continuouson int(F). Then f e 3)(,S, C) implies that f(x) exists and is continuous onint(F), and for x e int(F),

Page 396: Probability

16.10 UNIQUENESS 379

Problems

16. If b is a closed left endpoint,/e 2)(S, C), show that

b reflecting

where /+(£) is the one-sided derivative

Show also thatb absorbing or exit => (S/)(fc) = 0.

17. If X(/) is not on its natural scale, use the definition

Show that for/e 0)(5, C), jc e int(F),

10. UNIQUENESS

f/ is given by the scale and speed. But the scale and speed can also berecovered from U.

Proposition 16.66. p+(x, J) satisfies

and m(x, J) satisfies

for J open and finite, J <^ F.

Proof. Let / < = / , / = (jcl5 xz). Then for x E I,

For b a closed reflecting endpoint and / = [b, c}, even simpler identitieshold and 16.66 is again true for J of this form.

To complete the recovery of m, p+ from U, one needs to know that thesolutions of the equations in 16.66 are unique, subject to the conditionsm,p+ continuous on J = (a,b), and p+(b— , /)=!, p+(a+,J) — 0,m(a+,J) = 0, m(b—,J) = 0.

Page 397: Probability

380 DIFFUSIONS 16.10

Proposition 16.67. Let J = (a, b). Iff(x) is continuous on J, then

Proof. This is based on the following minimum principle for U:

Lemma 16.68. If h(x) is a continuous function in some neighborhood ofxQ e F, if h G D(C/, {x0}), and if h(x) has a local minimum at x0, then

This is obvious from the expression for U. Now suppose there is a function<p(x) continuous on J such that <p > 0 on / and U<p = A^ on /, A > 0.Suppose that f(x) has a minimum in J, then for e > 0 sufficiently small,/ — €(p has a minimum in J, but

By the minimum principle, / cannot have a minimum. Similarly it cannothave a maximum, thus/ = 0. All that is left to do is get <p. This we will doshortly in Theorem 16.69 below.

This discussion does not establish the fact that U determines m({b}) ata closed, reflecting endpoint b, because if J = [b, c), then m(b,J) ;* 0. Inthis case suppose there are two solutions,/!,^ of Uf= — l,xeJ, such thatboth are continuous on J, and/i(c) = /2(c) = 0,/i(6) = a,/2(6) = /? > a.Form

The minimum principle implies that/(jc) <, OonJ. But (Uf)(b) > 0 impliesthere is a number d,b<d<c, such that/(rf) — f(b) > 0. Hence ft = a,/i(*) = /a(*) on •7-Theorem 16.69. U<p = A^, jc E int(F), A > 0, has two continuous solutions<P+(x), <p-(x) such that

ii) <p+(x) is strictly increasing, <p_(x) is strictly decreasing,iii) at closed endpoints b, <p+ and <p_ have finite limits as x ->• b, at open right

(left) endpoints <?+(<?-) —*• °o,iv) any other continuous solution of U(p = Xcp, x & int(F) is a linear com-

bination of<p+, q>_.

Proof. The idea is simple in the case F = [a, b]. Here we could take

and show that these two satisfy the theorem. In the general case, we have

Page 398: Probability

16.10 UNIQUENESS 381

to use extensions of these two functions. Denote for x < z,

For x < y < z, the time from x to z consists of the time from x to y plusthe time from y to z. Use the strong Markov property to get

Now, in general, t* and t* are extended random variables and therefore notstopping times. But, by truncating them and then passing to the limit,identities such as (16.70) can be shown to hold. Pick z0 in Fto be the closedright endpoint if there is one, otherwise arbitrary. Define, for x < z0,

Now take zx > z0, and for x < z1 define

Use (16.70) to check that the <p+ as defined for x < zx is an extension of q>+

defined for x < z0.Continuing this way we define <p+(x) on F . Note that <p+ > 0 on int(.F),

and is strictly increasing. Now define <p_ in an analogous way. As w ] yor w I y, t* -> t* a.s. Px on the set {t* < oo}. Use this fact applied to (16.70)to conclude that cp+ and y_ are continuous. If the right-hand endpoint b isopen, then lim Ex exp [—At*] = 0 as z -»• b, otherwise b is accessible. Forz > z0, the definition of 99+ gives

so <pFor 7 arbitrary in F, let 9?(x) = ^e"^*, then assert that

Proposition 16.71. <p e D(C7, F — {y}), and for x ^ y,

Proof. Take h(x) e C such that h(x) = 0 for all x < y. Then/ = R^h is inD(5", C). By truncating, it is easy to show that the identity (16.61) holds forthe extended stopping variable t*. For x <y, (16.61) now becomes

or

Page 399: Probability

382 DIFFUSIONS 16.10

By 15.51,

But for ;c < y,

which leads to Ucp = Xcp. An obviously similar argument does the trick forx > y.

Therefore g?+, <p_ are solutions of Uy = ky in int(F). Let 9? be anysolution of U<p = kq> in int(F). For x: < jc2 determine constants c1} c2 sothat

This can always be done if 99+(*2)?'-(;ci) ~ 9)-(;c2)95+(-x'i) 5^ 0. TJie functionD(x) = (p+(x)(p_(x^) — (jp^x^^x^ is strictly increasing in x, so D(x^ =0 => D(x2) > 0. The function <p(x) = (p(x) — c-^^x) — c2(p_(x) satisfiesUy = h<p, and y(x-^ = y(x^) = 0. The minimum principle implies y(x) = 0,xl < x < A:2. Thus

This must hold over int(F). For if c(, c'z are constants determined by a largerinterval (x'v x'2), then

This is impossible unless <p+, <p_ are constant over (xit x2). Hence c[ = cl5/ _C-2 — C2-

We can now prove the uniqueness.

Theorem 16.72. For f(x) bounded and continuous on F, g(x) = (R^f)(x),A > 0, is in 3)(U, F) and is the unique bounded continuous solution of

Proof. That g(x) e £>(£/, F) follows from 16.63, and 15.51. Further,

Suppose that there were two bounded continuous solutions gl7 gz on F.Then y(x) = g^x) — gz(x) satisfies

By 16.69,

If F has two open endpoints, 9? cannot be bounded unless c

Page 400: Probability

16.11 <p+(x) AND 9>_(x) 383

Therefore, take F to have at least one closed endpoint b, say to the left.Assume that q>(b) > 0; otherwise, use the solution — y(x). If y(b) > 0,then

for all h sufficiently small. Then g?(x) can never decrease anywhere, becauseif it did, it would have a positive maximum in int(F), contradicting theminimum principle. If F = [b, c), then <p(x) = c2<p_(x). Since <p_(b) = 1,the case <p(b) = 0 leads to <p(x) = 0. But cp(b} > 0 is impossible because9?_(x) is decreasing. Finally, look at F = [b, c]. If cp(c) > 0 or if <p(c) < 0,then an argument similar to the above establishes that rp(x) must be decreasingon F or increasing on F, respectively. In either case, <p(b) > 0 is impossible.The only case not ruled out is <p(b) = <p(c) = 0. Here the minimum principlegives <p(x) = 0 and the theorem follows.

Corollary 16.73. There is exactly one Feller process having a given scalefunction and speed measure.

Proof. It is necessary only to show that the transition probabilities (pt(B \ x)}are uniquely determined by u(x), m(dx). This will be true if £'a./(X(r)) isuniquely determined for all bounded continuous /on F. The argument usedto prove uniqueness in 15.51 applies here to show that £'a./(X(f))is com-pletely determined by the values of (R^f)(x), A > 0. But g(x, A) = (Rrf)(x)is the unique bounded continuous solution of Ug = kg + /, x E F.

11. <P+(x) AND (cp_*)

These functions, (which depend also on A) have a central role in furtheranalytic developments of the theory of Feller processes.

For example, let J be any finite interval, open in F, with endpoints xly x2.The first passage time distributions from / can be specified by the twofunctions

where A+ = (X(t*(/)) = jc2), A_ is the other exit set.

Theorem 16.74 (Darling and Sieger t)

Page 401: Probability

384 DIFFUSIONS 16.11

Remark. Since these expressions are invariant under linear transformations,we could just as well take instead of <p+, y_ any two linearly independentsolutions g+, g_ of Ug = Xg such that g+ f , g_ |.

Proof. Use the strong Markov property to write

xe~^t, h~(x) = Exe~^*i, and solve the above equations to get

Since h~(x^) = I,h+(x2) = 1 , this has the form of the theorem. By construc-tion h~(x), h+(x) are constant multiples of <p_(x), <p+(x).

More important, in determining the existence of densities of the transitionprobabilities pt(B \ x} and their differentiability is the following sequenceof statements :

Theorem 16.75

i) — <P+(x), — <P-(x) both exist except possibly at a countable number ofdx dxpoints.

± - 15 constant except possibly wher

or — cp_ do not exist.

iii) Ri(dy \ x) « m(dy) on -^i(-F), and putting r^y, x) = — - , we have

or equivalently,

for rA(y, x) as above.

Page 402: Probability

16.12 DIFFUSIONS 385

The proof, which I will not give (see Ito and McKean, pp. 149 ff.), goes likethis: Show that Uq>+, Ucp_ have the differential form given in 16.64. Thenstatement (ii) comes from the fact that

Statement (iii) comes from showing directly that the function

satisfies Ug = Ag + / for all continuous bounded /.The statement 16.75(iii) together with some eigenvalue expansions

finally shows that pt(dy \ x) « m(dy) and that pt(y \ x), the density withrespect to m(dy) exists and is a symmetric function of y, x. Also, the samedevelopment shows thatpt(y \ x) € 2)(5", C) and that

Unfortunately, the proofs of these results require considerably more analyticwork. It is disturbing that such basic things as existence of densities andthe proofs thatpt(y \ x) are sufficiently differentiable to satisfy the backwardsequations lie so deep.

Problem 18. If regular Feller process has a stationary initial distribution7r(dx) on F, then show that -n-(dx) must be a constant multiple of the speedmeasure m(dx). [Use 16.75(iii).]

12. DIFFUSIONS

There is some disagreement over what to call a diffusion. We more or lessfollow Dynkin [44].

Definition 16.76. A diffusion is a Feller process on F such that there existfunctions a\x), ju(x) defined and continuous on int(F), az(x) > 0 on int(F),with

where the convergence (—>) is bounded pointwise on all finite intervals J,J c int(F).

Page 403: Probability

386 DIFFUSIONS 16.12

So a diffusion is pretty much what we started in with—a process that islocally Brownian. Note

Proposition 16.77. Let f(x) have a continuous second derivative on int(/~).For a diffusion, fe 3)(C7, int(F)) ami

Proof. Define /(x) = /(x) in some neighborhood J of x such that /(x) isbounded and has continuous second derivative on \nt(F) and vanishesoutside a compact interval / contained in the interior of F. On Ic prove that

(5"(/)(x) —^> 0. Apply a Taylor expansion now to conclude that /e 'J)(5", C),and that (S/)(x) is given by the right-hand side above.

The scale is given in terms of //, a by

Proposition 16.78. For a diffusion, the scale function u(x) is the unique (up to alinear transformation) solution on int(F) of

(16.79) i<r2(x) * ? + /"(*) -^^ = 0.

Proo/. Equation (16.79) has the solution

for an arbitrary x0. This MO(X) has continuous second derivative on int(F).Thus by 16.77 (£/w0)(x) = 0, x G int(F), and MO(X) is a scale function.

Now that the scale is determined, transform to natural scale, getting theprocess

Proposition 16.81. X(f) is a diffusion with zero drift and

Proof. Proposition 16.81 follows from

Proposition 16.82. Let X(r) be a diffusion on F, w(x) a function continuous onF such that \w'(x)\ ^ 0 on int(F), ir"(x) continuous on int(F). Then X(/) =w(X(0) is a diffusion. If x = vt'(x),

Page 404: Probability

16.12 DIFFUSIONS 387

Proof. The transformations of the drift and variance here come from

What needs to be shown is that for the revised process, — > fi(x), — > 62(.x)takes place on finite intervals in the interior of F. This is a straightforwardverification, and I omit it.

By this time it has become fairly apparent from various bits of evidencethat the following should hold :

Theorem 16.83. For a diffusion with zero drift the speed measure is given on$!(int F) by

Proof. Take/ to be zero off a compact interval / <= int(F), with a continuoussecond derivative. There are two expressions, given by 16.77 and 16.64, for(Uf)(x). Equating these,

or,

which implies the theorem.

These results show that /u(x), a2(jc) determine the scale function andthe speed measure completely on int(F). Hence, specifying m on any andall regular boundary points completely specifies the process. The basicuniqueness result 16.73 guarantees at most one Feller process with the givenscale and speed. Section 8 gives a construction for the associated Fellerprocess. What remains is to show that the process constructed is a diffusion.

Theorem 16.84. Let m(dx) = V(x) dx on int(F), V(x) continuous and positiveon int(F). The process on natural scale with this speed measure is a diffusionwith p(x) = 0, (T2(x)

Page 405: Probability

388 DIFFUSIONS 16.12

Proof. Use the representation X(f) = X0(T(r)), where T(/) is the randomtime change of Section 8. The first step in the proof is: Let the intervalJf = (x — €, x + e), take J <= int(F) finite such that for any x e /,J€ c / c int(F). Then

Proof of (i°). (t*(/e) < /} = (t*(J£) < T(0}. In the latter set, by takinginverses we get

But letting M = inf V(y), we find that, denoting 7

Since Px(\X(t) - x\ > e) <, Px(t*(J() < /), condition (i) of the diffusiondefinition 16.76 is satisfied.

To do (ii) and (iii) : By (i°) it is sufficient to prove

for x in finite J,J<^ int(F).Let Te* = min (/, t *(./<)). Then we will show that

Proof of (ii°). Use (16.52); that is (dropping the e),

But T* = min (T(/), t*(/)) is a stopping time for Brownian motion. SinceX0(0, X^(/) — t are martingales, (ii°) follows.

Use (ii°) as follows :

Hence, we get the identity

Page 406: Probability

16.12 DIFFUSIONS 389

This latter integral is in absolute value less than ePx(t*(J(} < r). Apply(i°) now.

For (iii), write

The remaining integral above is bounded by <-zPx(t*(J^) < r), hence givesno trouble. To evaluate Egjfr*), write

so that

for all t and x eJ. Further, since T(t) -*• 0 as t —>• 0,

as / ->• 0. Since T(r*)/f is bounded by M, the bounded convergence theoremcan be applied to get

for every x e /.

These results help us to characterize diffusions. For example, nowwe know that of all Feller processes on natural scale, the diffusion processesare those such that dm « dx on int(F) and dm/dx has a continuous version,positive on int(F). A nonscaled diffusion has the same speed measureproperty, plus a scale function with nonvanishing first derivative and con-tinuous second derivative.

Problem 19. Show that for a diffusion, the functions y>+, <p_ of Section 10are solutions in int(F) of

Page 407: Probability

390 DIFFUSIONS

NOTES

Most of the material in this chapter was gleaned from two recent books onthis subject, Dynkin [44], of which an excellent English translation waspublished in 1965, and Ito, McKean [76, 1965].

The history of the subject matter is very recent. The starting point wasa series of papers by Feller, beginning in 1952 [57], which used the semigroupapproach. See also [58]. Dynkin introduced the idea of the characteristicoperator in [43, 1955], and subsequently developed the theory using theassociated concepts. The idea of random time substitutions was first exploitedin this context by Volkonskil [138, 1958]. The construction of the generalprocess using local time was completed by Ito and McKean [76].

The material as it stands now leaves one a little unhappy from thepedagogic point of view. Some hoped-for developments would be: (1) asimple proof of the local time theorem (David Freedman has shown me asimple proof of all the results of the theorem excepting the continuity ofI *(/,/)); (2) a direct proof of the unique determination of the process byscale function and speed measure to replace the present detour by means ofthe characteristic operator; (3) a simplification of the proofs of the existenceof densities and smoothness properties for the transition probabilities.

Charles Stone has a method of getting Feller processes as limits of birthand death processes which seems to be a considerable simplification bothconceptually and mathematically. Most of this is unpublished. However,see [131] for a bit of it.

The situation in two or more dimensions is a wilderness. The essentialproperty in one dimension that does not generalize is that if a path-continuousprocess goes from x to y, then it has to pass through all the points betweenx and y. So far, the most powerful method for dealing with diffusions inany number of dimensions is the use of stochastic integral equations (seeDoob [39, Chap. VI], Dynkin [44, Chap. XI]) initiated by Ito. The ideahere is to attempt a direct integration to solve the equation

and its multidimensional analogs.

Page 408: Probability

APPENDIX

ON MEASURE AND FUNCTION THEORY

The purpose of this appendix is to give a brief review, with very few proofs,of some of the basic theorems concerning measure and function theory.We refer for the proofs to Halmos [64] by page number.

1. MEASURES AND THE EXTENSION THEOREM

For Q a set of points a), define

Definition A.I. A class & of subsets of£l is afield if A, B e & implies Ac,A U JB, A n B are in &. The class & is a-field if it is afield, and if, in addition,An e 3", n = 1,2, ... implies \Jf An e &.

Notation A.2. We will use Ac for the complement of A, A — B for A n Bc,& for the empty set, A A Bfor the symmetric set difference

Note

Proposition A.3. For any class C of subsets of£l, there is a smallest field ofsubsets, denoted ^oC^X and a smallest a-field of subsets, denotedcontaining all the sets in C.

Proof. The class of all subsets of Q. is a a-field containing C. Let ^(C) be theclass of sets A such that A is in every cr-field that contains C. Check that^(C) so defined is a <r-field and that if J" is a cr-field, G c j", then ^(C) c &.For fields a finite construction will give

If An is a sequence of sets such that An <= An+l, n = 1,2, ... and A =U An, write An f A. Similarly, if An+1 <= An, A = (~\ An, write An J, A.Define a monotone class of subsets C by: If An E C, and An\ A or An\ Athen AEG.

Monotone Class Theorem A.4 (Halmos, p. 27). The smallest monotone classof sets containing afield J*0 ij 3

r(3:^.

391

Page 409: Probability

392 ON MEASURE AND FUNCTION THEORY A.I

Definition A.5. A finitely additive measure p on a field & is a real-valued(including +00), nonnegative function with domain $ such that for A, B e 3%A n B = 0,

This extends to: If Alf . . . , An e & are pairwise disjoint, Ai n A^ = 0,i 5* j, then

Whether or not the sets Alt . . . , An e J" are disjoint,

Definition A.6. A G-additive measure (or just measure} on a ct-field & is areal-valued (+ oo included), nonnegative function with domain 3-" such that forA^... E^,A, n At = 0, i^j,

We want some finiteness:

Definition A.7. A measure (finitely or a-additive) on a field &$ is a-finite ifthere are sets Ak E #"„ such that (Jk Ak = Q. and for every k, ju(Ak) < oo.

We restrict ourselves henceforth to a-finiteness! The extension problem formeasures is: Given a finitely additive measure //„ on a field 3^0, when doesthere exist a measure n on ^(^g) agreeing with //0 on ^Fo? A measure hascertain continuity properties :

Proposition A.8. Let p be a measure on the a-field fr. If An [ A, An £ &,and if fj,(An) < ccfor some n, then

Also, if An t A, An e &, then

This is called continuity from above and below. Certainly, if /u,0 is to beextended, then the minimum requirement needed is that /^0 be continuous onits domain. Call f*0 continuous from above at 0 if whenever An e &Q, An j 0,and //oC^n) < °° f°r some «, then

Page 410: Probability

A.I MEASURES AND THE EXTENSION THEOREM 393

Caratheodory Extension Theorem A.9. If ju,0 on &0 is continuous from aboveat 0, then there is a unique measure p on ^(^o) agreeing with //0 on 5"0 (seeHalmos, p. 54).

Definition A. 10. A measure space is a triple (Q, fr, JLI) where &, p are aa-field and measure. The completion of a measure space, denoted by (£1, &, //),is gotten by defining A e & if there are sets At, A2 in 3% A^ <= A <= Az and

- AJ £ 0. Then define fi(A) =

& is the largest (T-field for which unique extension under the hypothesis ofA.9 holds. That is, p, is the only measure on if agreeing with //0 on 3^ and

Proposition A. 11. Let B $ 3r,3rl the smallest o '-field containing both B and 3-.

Then there is an infinity of measures on 3r1 agreeing with //0 on fF0. (See

Halmos, p. 55 and p. 71, Problem 3).

Note that 3r(3r0) depends only on 3^ and not on /HQ, but that 3r(3r

0)depends on JLIO. The measure ju on &(&<)), being a unique extension, must beapproximable in some sense by /u,0 on &Q. One consequence of the extensionconstruction is

Proposition A.12 (Halmos, p. 56). For every A e &(&<>), and e > 0, there is aset A0 E &Q such that

We will designate a space Q and a (7-field 5" of subsets of D as a measurablespace (Q, &). If F c Q, denote by J"(F) the <r-field of subsets of F of theform A n F, A e 3% and take the complement relative to F.

Some important measurable spaces are

7?(1) the real line$! the smallest cr-field containing all intervals

R(k) A:-dimensional Euclidean space$fc the smallest cr-field containing all A>dimensional rectangles

jR<0 0) the space of all infinite sequences (xlt xz, . . .) of real numbers$00 the smallest cr-field containing all sets of the form {(xlt x2, . . .);

xl G /!,..., xn e /„} fpr any n where /!,...,/„ are anyintervals

R1 the space of all real-valued functions x(t) on the interval/ cz /jd)

$7 the smallest cr-field containing all sets of the form {*(•) 6 R1',jtOj) 6 /!,..., x(?n) e /„} for any tlt . . . , tn e / and intervals/!,...,/„

Page 411: Probability

394 ON MEASURE AND FUNCTION THEORY A. 2

Definition A.13. The class of all finite unions of disjoint intervals in R(l)

is a field. Take fi0 on this field to be length. Then /j.0 is continuous fromabove at 0 (Halmos, pp. 34 ff.). The extension of length to -3^ is Lebesguemeasure, denoted by I or by dx.

Henceforth, if we have a measure space (Q., &, u) and a statement holdsfor all a) e O with the possible exception of at e A, where fj.(A) = 0, we saythat the statement holds almost everywhere (a.e.)

2. MEASURABLE MAPPINGS AND FUNCTIONS

Definition A.14. Given two spaces D., R, and a mapping X(w) : Q — »• R, theinverse image of a set B <= R is defined as

Denote this by (X e B}. The taking of inverse images preserves all set oper-ations; that is,

Definition A.15. Given two measurable spaces (D, &), (R, $). A mappingX: Q — »• R is called measurable if the inverse of every set in & is in 3r.

Proposition A.16 (See 2.29). Let G <= 3 such that ^(C) = $. ThenX : Q — »• R is measurable if the inverse of every set in G is in fr.

Definition A.17. X: Q — >• /?<x> will be called a measurable function if it is ameasurable map from (Q, 5) to (R(1}, ^).

From A.16 it is sufficient that {X < x} 6 & for all x in a set dense inR(l). Whether or not a function is measurable depends on both Q and 5".Refer therefore to measurable functions on (Q, &) as ^-measurable functions.

Proposition A. 18. The class of ^-measurable functions is closed under point-wise convergence. That is, if Xn(o>) are each & -measurable, and limn Xn(co)exists for every w, then X(<o) = limn Xn(o>) is & -measurable.

Proof. Suppose Xn(w) J, X(eo) for each eo; then {X < x} = (Jn (Xn < x}.This latter set is in J". In general, if Xn(w) — v X(o>), take

Then {Yn > y} = U m > n (^w > JK} which is in &. Then Yn is J'-measurableand Yn J, X.

Page 412: Probability

A.2 MEASURABLE MAPPINGS AND FUNCTIONS 395

Define an extended measurable function as a function X(w) which takesvalues in the extended real line R(l) U {00} such that {X e B] e & for everyBE 3V

By the argument above, if Xn are ^-measurable, then lim Xn, lim Xn

are extended ^-measurable, hence the set

is in &.

Proposition A.19 (See 2.31). If 'X is a measurable mapping from (Q, J") to(R, $) and <p is a ^-measurable function, then g?(X) is an 3- -measurablefunction.

The set indicator of a subset A <=• £1 is the function

A simple function is any finite linear combination of set indicators,

of sets Ak e 5.

Proposition A.20. The class of &-measurable functions is the smallest class offunctions containing all simple functions and closed under pointwise convergence.

Proof. For any n > 0 and X(o>) a measurable function, define sets Ak ={X e [kin, k + !/«)} and consider

Obviously Xn -> X.

For any measurable mapping X from (Q, ) to (R, 3), denote bythe a-field of inverse images of sets in $. Now we prove 4.9.

Proposition A.21. If Z /j an ^(Ify-measurable function, then there is a^-measurable function 6 such that Z = 0(X).

Proof. Consider the class of functions <p(X), X fixed, as q> ranges over the5S-measurable functions. Any set indicator ^(o>), A e ^(X), is in this class,because A = {X e B} for some B e $. Hence

Now the class is closed under addition, so by A.20 it is sufficient to show itclosed under pointwise convergence. Let <pn(X) -> Y, q>n ^-measurable.

Page 413: Probability

396 ON MEASURE AND FUNCTION THEORY A.3

Let B = {limn <pn exists}. Then B e $>, and H = (X E B}. Define

Obviously, then Y = 9?(X).

We modify the proof of A.20 slightly to get 2.38;

Proposition A.22. Consider a class C of 3-measurable functions having theproperties

f/zen C includes all nonnegative ^-measurable functions.

Proof. For X > 0, ^-measurable, let Xn = Jfcio klnXAjfo) where Ak ={X e [^//i, k + !/«)}. Then certainly Xn e C, and Xn- f X*if we take /i' thesubsequence {2TO}.

3. THE INTEGRAL

Take (Q, 5", //) to be a measure space. Let X(o>) > 0 be a nonnegative^-measurable function. To define the integral of X let Xn > 0 be simplefunctions such that Xn f X.

Definition A.23. The integral J Xn dp of the nonnegative simple functionXn = 2 <x.k %A (co), ocfr > 0, is defined by S afc/a(y4fc).

For Xn t X, it is easy to show that J Xn+1 dp > J Xn dp > 0.

Definition A.24. Define J X d/u as limn J Xn dp. Furthermore, the value of thislimit is the same for all sequences of nonnegative simple functions convergingup to X (Halmos, p. 101).

Note that this limit may be infinite. For any ^-measurable function X,suppose that j" |X| d/u < oo. In this case define ^-measurable functions

Page 414: Probability

A. 3 THE INTEGRAL 397

Definition A. 25. If J |X| dp < oo, define

we may sometimes use the notation

The elementary properties of the integral are : If the integrals of X and Yexist,

Some nonelementary properties begin with

Monotone Convergence Theorem A.26 (Halmos, p. 112). For Xn > 0 non-negative & -measurable functions, Xn f X, then

From this comes the

Fatou Lemma A. 27. If Xn > 0, then

Proof. To connect up with A.26 note that for Xl5 . . . , Xn arbitrary, non-negative ^-measurable functions,

Hence, by taking limits

Let Yn = infm>n Xm; then

Since Yw > 0, and Yn ] lim Xn, apply A.26 to complete the proof.

Page 415: Probability

398 ON MEASURE AND FUNCTION THEORY A.4

Another useful convergence result is:

Bounded Convergence Theorem A.28 (2.44). Let Xn —»• X pointwise, wherethe Xn are &-measurable functions such that there is an 5-measurable functionZ with |XJ < Z, all n, to, and $ Z dp < oo. Then (see Halmos, p. 110)

From these convergence theorems can be deduced the a-additivity of anintegral: For {Bn} disjoint, Bn e 5", and J |X| </// < oo,

Also, if J |X| dfi < oo, then the integral is absolutely continuous. That is,for every e > 0, there exists a d > 0 such that if A e & and fj,(A) < 6, then

4. ABSOLUTE CONTINUITY ANDTHE RADON-NKODYM THEOREM

Consider a measurable space (Q, ^) and two measures //, v on 5 .

Definition A.29. Say that v is absolutely continuous with respect to p, denoted

Call two measurable functions X1} X2 equivalent if //({Xj 5^ X2}) = 0. Then

Radon-Nikodym Theorem A.30 (Halmos, p. 128). If v « //,ar nonnegative & -measurable function X determined up to equivalence, suchthat for any A e 5%

Another way of denoting this is to say that the Radon derivative of v withrespect to /u exists and equals X ; that is,

The opposite of continuity is

Definition A.31. Say that v is singular with respect to /u, written p _\_ v ifthere exists A e & such that

Page 416: Probability

A. 5 CROSS-PRODUCT SPACES AND THE FUBINI THEOREM 399

Lebesgue Decomposition Theorem A.32 (Halmos, p. 134). For any twomeasures /u, v on &,v can be decomposed into two measures, vc, vs, in the sensethat for every A e 5",

and v

For a a-finite measure the set of points (o> ; /M({<O}) > 0} is at mostcountable. Call v a point measure if there is a countable set G = {coj} such thatfor every A E 5",

Obviously, any measure v may be decomposed into vx + vp, where vp is apoint measure and vt assigns mass zero to any one-point set. Hence, on(R(l), $x) we have the special case of A.32.

Corollary A. 33. A measure v on 55j can be written as

where vv is a point measure, vs _[_ / but vs assigns zero mass to any one-pointsets, and vc « /. [Recall I is Lebesgue measure.]

5. CROSS-PRODUCT SPACES AND THE FUBINI THEOREM

Definition A.34. Given two spaces Qls Q2, their cross product X Q2 is tne

set of all ordered pairs {(col5 co2); a): e Ql5 o>2 e ^2}- For measurable spaces(Q15 3

rl), (O2, ^2), ^j x ^2 w ?/ze smallest a-field containing all sets of the

form

w/iere ^4X 6 j ^42 e ^v Denote this set by A1 x ^42.

For a function X(<w1, co2) on Qx x H2, its section at o^ is the function on Q2

gotten by holding a>! constant and letting co2 be the variable. Similarly, ifA c Qx x D2, /W section at coj is defined as {co2; (ft>1} o>2) e A}.

Theorem A.35 (Halmos, pp. 141ff.). Let X be an 3- x &z-measurable function;then every section ofX is an 5 -measur able function. If A e 3^ Xsection of A is in S^.

If we have measures /^ on ^2 on 2, then

Theorem A.36 (Halmos, p. 144). There is a unique measureJ\ X JMC/z f/zor/ /or every A± e ^ ^(

This is called the cross-product measure.

Page 417: Probability

400 ON MEASURE AND FUNCTION THEORY A.7

Fubini Theorem A.37 (Halmos, p. 148). Let X be x fr^-measurable, and

Then

are respectively &2-and & -measurable functions, which may be infinite on setsof measure zero, but whose integrals exist. And

Corollary A.38. If A e x 5"2 and nl x ju2(A) = 0, //few almost everysection of A has /u2 measure zero.

This all has fairly obvious extensions to finite cross products flj x • • • x Qn.

6. THE MH) SPACES

These are some well-known results. Let (Q, 5", ^) be a measure space and thefunctions X, Y be ^-measurable :

Schwarz Inequality A.39. If $ \X\2 d/u and J |Y|2^// are finite, then so isJ |XY| dp and

Definition A. 40. Lr(ft), r > 0, is the class of all fr -measurable functions Xsuch that $ \X\r dfi < oo.

Completeness Theorem A.41 (Halmos, p. 107). I f X n e Lr(fj.) and

as m, n — *• oo in any way, then there is a function X e Lr(/u) such that

7. TOPOLOGICAL MEASURE SPACES

In this section, unless otherwise stated, assume that £2 has a metric underwhich it is a separable metric space. Let C be the class of all open sets in Q.A measure p on & is called inner regular if for any A e &

Page 418: Probability

A.7 TOPOLOGICAL MEASURE SPACES 401

where the sup is over all compact sets C <=• A, C E 5r. It is called outerregular if

where the inf is over all open sets O such that A c: O, and O <= $ .

Theorem A.42 (Follows from Halmos, p. 228). Any measure on ^(C) is bothinner and outer regular.

Theorem A.43 (Follows from Halmos, p. 240). The class of 3^ (^-measurablefunctions is the smallest class of functions containing all continuous functions onQ and closed under pointwise convergence.

Theorem A.44 (Halmos, p. 241). If J |X| dp < oo for X 3- '(G)-measurable,then for any e > 0 there is a continuous function <p on Q such that

Definition A.45. Given any measurable space (Q, 3") (£1 not necessarilymetric), it is called a Borel space if there is 1-1 mapping <p: Q<— >• E whereE E $x such that <p is & -measurable, and y1 is ^(E^-measurable.

Theorem A.46. I f Q is complete, then (Q, -^(C)) is a Borel space.

We prove this in the case that (Q, ^(C)) is (JR(00), 5^^). Actually, since there

is a 1-1 continuous mapping between R(l) and (0, 1) it is sufficient to show that

Theorem A.47. ((0, 1)(00), 3^(0, 1)) is a Borel space.

Note. Here (0, l)(cc) is the set of all infinite sequences with coordinates in(0, 1) and $^(0, 1) means $^((0, 1)).

Proof. First we construct the mapping O from (0, 1) to (0, 1)(00). Everynumber in (0, 1) has a unique binary expansion x = .x^z ' • • containing aninfinite number of zeros. Consider the triangular array

12 34 5 67 8 9 10

Let

where the nth coordinate is formed by going down the nth column of thearray ; that is,

Page 419: Probability

402 ON MEASURE AND FUNCTION THEORY A. 8

and so on. Conversely, if x e (0, 1)(00), x = (x(l}, *<2), . . .), expand everycoordinate as the unique binary decimal having an infinite number of zeros,say

and define <p(\) to be the binary decimal whose nth entry is x™ if n appearsin the kth column j numbers down. That is,

Clearly, O and (p are inverses of each other, so the mapping is 1-1 and onto.By 2.13, to show O ^(0, Immeasurable, it is sufficient to show that each$k(x) is $i(0, l)-measurable. Notice that the coordinates x^x), x2(x), . . .in the decimal expansion of x are measurable functions of x, continuousexcept at the points which have only a finite number of ones in their expansion(binary rationals). Furthermore, each <bk(x) is a sum of these, for example,

Therefore, every $>k(x) is measurable. The proof that <p(x) is measurablesimilarly proceeds from the observation that x(.k}(x) is a measurable functionofx. Q.E.D.

To prove A.46 generally, see Sierpinski, [123a, p. 137], where it is provedthat every complete separable metric space is homeomorphic to a subset ofJ?<co). (See also p. 206.)

8. EXTENSION ON SEQUENCE SPACE

A.48 (Proof of Theorem 2.18). Consider the class 3rn of all finite disjointunions of finite dimensional rectangles. It is easily verified that 5"0 is a field,and that

For any set A e J"0, A — \^i S}1 where the Sf are disjoint rectangles, define

There is a uniqueness problem: Suppose A can also be represented as (J* S'k,where the S^. are also disjoint. We need to know that

Write

Page 420: Probability

A. 8 EXTENSION ON SEQUENCE SPACE 403

By part (b) of the hypothesis,

so

By a symmetric argument

Now we are in a position to apply the Caratheodory extension theorem, if wecan prove that An E 5",,, An { 0 implies P(An) -> 0. To do this, assume thatlimn P(An) = 6 > 0. By repeating some An in the sequence, if necessary, wecan assume that

where >4* is a union of disjoint rectangles in R(n). By part (c) of the hypothesiswe can find a set B* <= A* so that 5* is a finite union of compact rectanglesin R(n\ and if

then

Form the sets Cn = PI" Bk, and put

Then, since the An are nonincreasing,

The conclusion is that limn P(Cfc) > (5/2, and, of course, Cn | 0. Take points

For every «,

Take 7V\ any ordered infinite subsequence of integers such that x[n} -^- xle C*as n runs through N^ This is certainly possible since x[n} e C* for all n.

Page 421: Probability

404 ON MEASURE AND FUNCTION THEORY A. 8

Now take 7V2 c jvx such that

as n runs through N2. Continuing, we construct subsequences

Let nk be the £th member of A^, then for every j, ;cjn> -> j:, as w goes to infinitythrough {/I*}. Furthermore, the point (xi, xz, . . .) is in Cn for every n > 1,contradicting Cn J. 0.

Page 422: Probability

BIBLIOGRAPHY

[1] BACHELIER, L., "Theorie de la speculation, These, Paris, 1900," Ann. EC.Norm. Sup. s. 3, 17, 21-86 (1900).

[2] BAILEY, N. T. J., The Elements of Stochastic Processes, John Wiley & Sons,Inc., New York, 1964.

[3] BHARUCHA-REID, A. T., Elements of the Theory of Markov Processes and TheirApplications, McGraw-Hill Book Co., Inc., New York, 1960.

[4] BIRKHOFF, G. D., "Proof of the ergodic theorem," Proc. Nat'l. Acad. Sci. 17,656-660 (1931).

[5] , "Dynamical systems," American Math. Society, New York, 1927,reprinted 1948.

[6] BLACKWELL, D., "A renewal theorem," Duke Math. J. 15, 145-150 (1948).[7] , "Extension of a renewal theorem," Pacific J. Math. 3, 315-320 (1953).[8] , "On a class of probability spaces," Proc. of the 3rd Berkeley Symp. on

Math. Stat. and Prob. Vol. II., 1-6 (1956).[9] and FREEDMAN, D., "The tail d-field of a Markov chain and a theorem

of Orey," Ann. Math. Stat. 35, No. 3, 1291-1295 (1964).[10] BLUMENTHAL, R., GETOOR, R., and MCKEAN, H. P., JR., "Markov processes

with identical hitting distributions," Illinois J. Math. 6, 402-420 (1962).[11] BOCHNER, S., Harmonic Analysis and the Theory of Probability, University of

California Press, Berkeley, 1955.[12] BREIMAN, L., "On transient Markov chains with application to the uniqueness

problem for Markov processes," Ann. Math. Stat. 28, 499-503 (1957).[13] CAMERON, R. H. and MARTIN, W. T., "Evaluations of various Wiener integrals

by use of certain Sturm-Liouville differential equations," Bull. Amer. Math.Soc. 51, 73-90 (1945).

[14] CHOW, Y. S. and ROBBINS, H., "On sums of independent random variableswith infinite moments and 'fair' games," Proc. Nat. Acad. Sci. 47, 330-335(1961).

[15] CHUNG, K. L., "Notes on Markov chains," duplicated notes, ColumbiaGraduate Mathematical Statistical Society, 1951.

[16] , Markov Chains with Stationary Transition Probabilities, Springer-Verlag,Berlin, 1960.

[17] , "The general theory of Markov processes according to Doeblin,"Z. Wahr. 2, 230-254 (1964).

[18] and FUCHS, W. H. J., "On the distribution of values of sums ofrandom variables," Mem. Amer. Math. Soc. No. 6 (1951).

[19] and KAC, M., "Remarks on fluctuations of sums of independentrandom variables," Mem. Amer. Math. Soc. No. 6 (1951).

405

Page 423: Probability

406 BIBLIOGRAPHY

[20] and ORNSTEIN, D., "On the recurrence of sums of random variables,"Bull. Amer. Math. Soc. 68, 30-32 (1962).

[21] CRAMER, H., "On harmonic analysis in certain functional spaces," Ark. Mat.Astr. Fys. 28B, No. 12 (1942).

[22] DARLING, D. A., and SIEGERT, A. J., "The first passage problem for acontinuous Markov process," Ann. Math. Stat. 24, 624-639 (1953).

[23] and KAC, M., "On occupation times for Markov processes," Trans.Amer. Math. Soc. 84, 444-458 (1957).

[24] DINGES, H., "Ein verallgemeinertes Spiegelungsprinzip fur den Prozess derBrownschen Bewegung," Z. Wahr. 1, 177-196 (1962).

[25] DOEBLIN, W., "Sur les proprietes asymptotiques de mouvement regis parcertains types des chaines simples," Bull. Math. Soc. Roum. Sci. 39, No. 1,57-115, No. 2, 3-61 (1937).

[26] , "Sur certains mouvements aleatoires discontinus," Skand. Aktuarie-tidskr 22, 211-222(1939).

[27] , "Sur les sommes d'un grand nombre de variables aleatoires inde-pendantes," Bull. Sci. Math. 63, No. 1, 23-32, 35-64 (1939).

[28] , "Elements d'une theorie generate des chaines simples constante deMarkoff," Ann. Sci. Ecole Norm. Sup. (3), 57, 61-111 (1940).

[29] , "Sur 1'ensemble de puissances d'une loi de probabilite," Studio Math.9, 71-96 (1940).

[30] DONSKER, M., "An invariance principle for certain probability limit theorems,"Mem. Amer. Math. Soc. No. 6 (1951).

[31] , "Justification and extension of Doob's heuristic approach to theKolmogorov-Smirnov theorems," Ann. Math. Stat. 23, 277-281, (1952).

[32] DOOB, J. L., "Regularity properties of certain families of chance variables,"Trans. Amer. Math. Soc. 47, 455-486 (1940).

[33] , "Topics in the theory of Markov chains," Trans. Amer. Math. Soc. 52,37-64 (1942).

[34] , "Markov chains-denumerable case," Trans. Amer. Math. Soc. 58,455-473 (1945).

[35] , "Asymptotic properties of Markov transition probabilities," Trans.Amer. Math. Soc. 63, 393-421 (1948).

[36] , "Renewal theory from the point of view of the theory of probability,"Trans. Amer. Math. Soc. 63, 422-438 (1948).

[37] , "A heuristic approach to the Kolmogorov-Smirnov theorems,"Ann. Math. Stat. 20, 393-403 (1949).

[38] , "Continuous parameter martingales," Proc. 2nd Berkeley Symp. onMath. Stat. and Prob., 269-277 (1951).

[39] , Stochastic Processes, John Wiley & Sons, Inc., New York, 1953.[40] DUBINS, L. and SAVAGE, J. L., How to Gamble If You Must, McGraw-Hill

Book Co., Inc., New York, 1965.[41] DUNFORD, N. and SCHWARTZ, J. T., Linear Operators, Part I, Interscience

Publishers, Inc., New York, 1958.[42] DVORETSKI, A., ERDOS, P., and KAKUTANI, S., "Nonincrease everywhere of

the Brownian motion process," Proc. 4th Berkeley Symp. on Math. Stat.and Prob. Vol. II, 103-116 (1961).

Page 424: Probability

BIBLIOGRAPHY 407

[43] DYNKIN, E. B., "Continuous one-dimensional Markov processes," Dokl.Akad. Nauk. SSSR 105, 405^08 (1955).

[44] , Markov processes, Vols. I, II, Academic Press, Inc., New York, 1965.[45] EINSTEIN, A., "On the movement of small particles suspended in a stationary

liquid demanded by the molecular-kinetic theory of heat," Ann. d. Physik 17(1905). [In Investigations of the Theory of the Brownian Movement, Edited byR. Fiirth, Dover Publications, Inc., New York, 1956.]

[46] ERDOS, P., "On the law of the iterated logarithm," Ann. of Math. 43, 419-436 (1942).

[47] and KAC, M., "On certain limit theorems of the theory of prob-ability," Bull. Amer. Math. Soc. 52, 292-302 (1946).

[48] , "On the number of positive sums of independent random variables,"Bull. Amer. Math. Soc. 53, 1011-1020 (1947).

[49] , FELLER, W., and POLLARD, H., "A theorem on power series,"Bull. Amer. Math. Soc. 55, 201-204, 1949.

[50] FELLER, W., "Zur Theorie der stochastischen Prozesse," Math. Ann. 113,113-160(1936).

[51] , "On the Kolmogorov-P. Levy formula for infinitely divisible distribu-tion functions," Proc. Yugoslav Acad. Sci. 82,95-113 (1937).

[52] , "On the integro-differential equations of purely discontinuous Markovprocesses," Trans. Am. Math. Soc. 48, 488-515 (1940).

[53] , "The general form of the so-called law of the iterated logarithm,"Trans. Amer. Math. Soc. 54, 373-402 (1943).

[54] , "A limit theorem for random variables with infinite moments, Amer.J. Math. 68, 257-262 (1946).

[55] , "Fluctuation theory of recurrent events," Trans. Amer. Math. Soc. 67,98-119(1949).

[56] , "Diffusion processes in genetics," Proc. 2nd Berkeley Symp. on Math.Stat. andProb. 227-246 (1951).

[57] , "The parabolic differential equations and the associated semi-groupsof transformations, Ann. of Math. 55, 468-519 (1952).

[58] , "Diffusion processes in one dimension," Trans. Amer. Math. Soc. 77,1-31 (1954).

[59] , An Introduction to Probability Theory and Its Applications, Vol. I,2nd Ed., Vol. II, John Wiley & Sons, Inc., New York, 1957, 1966.

[60] • and OREY, S., "A renewal theorem," /. Math, and Mech. 10,619-624 (1961).

[61] GARSIA, A., "A simple proof of Eberhard Hopf's maximal ergodic theorem,"/. Math, and Mech. 14, 381-382 (1965).

[62] GNEDENKO, B. V. and KOLMOGOROV, A. N., Limit Distributions for Sums ofIndependent Random Variables, Addison-Wesley Publishing Co., Inc.,Reading, Mass., 1954.

[63] HADLEY, G., Linear Algebra, Addison-Wesley Publishing Co., Inc., Reading,Mass., 1961.

[64] HALMOS, P. R., Measure Theory, D. Van Nostrand Co., Inc., Princeton,N.J., 1950.

[65] ."Lectures on ergodic theory," Mathematical Society of Japan, No. 3,1956.

Page 425: Probability

408 BIBLIOGRAPHY

[66] HARRIS, T. E., "The existence of stationary measures for certain Markovprocesses," Proc. of the 3rd Berkeley Symp. on Math. Stat. andProb. Vol. II,113-124(1956).

[67] , The Theory of Branching Processes, Prentice-Hall, Inc., EnglewoodCliffs, N.J., 1963.

[68] , "First passage and recurrence distributions," Trans. Amer. Math. Soc.73, 471-486 (1952).

[69] and ROBBINS, H., "Ergodic theory of Markov chains admitting aninfinite invariant measure," Proc. of the National Academy of Sciences 39,No. 8, 862-864 (1953).

[70] HERGLOTZ, G. "Ober Potenzreihen mit positivem reclen Teil im Einheit-skreis," Ber. Verh. Kgl. Sachs. Ges. Leipzig, Math.-Phys. Kl 63, 501-511(1911).

[71] HEWITT, E. and SAVAGE, L. J., "Symmetric measures on Cartesian products,"Trans Amer. Math. Soc. 80, 470-501 (1955).

[72] HOBSON, E. W., The Theory of Functions of a Real Variable, 3rd Ed., Vol. I,Cambridge University Press, Cambridge, 1927 (Reprinted by Dover Publica-tions, Inc., New York, 1957).

[73] HOPF, E. Ergoden Theorie, Ergebnisse der Math. Vol. 2, J. Springer, Berlin,1937 (Reprinted by Chelsea Publishing Co., New York, 1948).

[74] HUNT, G., "Some theorems concerning Brownian motion," Trans. Amer.Math. Soc. 81, 294-319 (1956).

[75] ITO, K., "On stochastic processes (I) (Infinitely divisible laws of probability),"Japan J. Math. 18, 261-301 (1942).

[76] and McKEAN, H. P. JR., Diffusion Processes and Their Sample Paths,Academic Press, New York, 1965.

[77] JESSEN, B. and SPARRE ANDERSEN, E., "On the introduction of measures ininfinite product sets," Danske Vid. Selsk. Mat.-Fys. Medd. 25, No. 4(1948).

[78] and WINTNER, A., "Distribution functions and the Riemann Zetafunction," Trans. Amer. Math. Soc. 38, 48-88 (1935).

[79] KAC, M., "On a characterization of the normal distribution," Amer. J. ofMath. 61, 726-728 (1939).

[80] , "Random walk and the theory of Brownian motion, Amer. Math.Monthly 54, 369-391 (1949) (Reprinted in [139]).

[81] , "On the notion of recurrence in discrete stochastic processes," Bull.Amer. Math. Soc. 53, 1002-1010 (1947).

[82] , "On some connections between probability theory and differential andintegral equations," Proc. 2nd Berkeley Symp. on Math. Stat. and Prob.189-215 (1951).

[83] , "Statistical independence in probability, analysis, and number theory,"Car us Mathematical Monograph, No. 12, The Mathematical Association ofAmerica, 1959.

[84] KALLIANPUR, G. and ROBBINS, H., "The sequence of sums of independentrandom variables," Duke Math. J. 21, 285-307 (1954).

[85] KARAMATA, J., "Neuer Beweis und Verallgemeinerung der TauberschenSatze, welche die Laplaceschen Stieltjesschen Transformationen betreffen,"J.fur die reine und angewandte Math. 164, 27-40 (1931).

Page 426: Probability

BIBLIOGRAPHY 409

[86] KARLIN, S., A First Course in Stochastic Processes, Academic Press, New York,1966.

[87] and MCGREGOR, J. L., "Representation of a class of stochastic proc-esses," Proc. Nat'l. Acad. Sci. 41, 387-391 (1955).

[88] KHINTCHINE, A., "Ober einen Satz der Wahrscheinlichkeitsrechnung,"Fundamenta Math. 6, 9-20 (1924).

[89] , "Deduction nouvelle d'une formula de M. Paul Levy," Bull. Univ.d'Etat Moskou., Ser Internal. Sect. A., 1, No. 1, 1-5 (1937).

[90] , Mathematical Foundations of Statistical Mechanics, Dover Publications,Inc., New York, 1949.

[91] and KOLMOGOROV, A., "Ober konvergenz von Reihen, deren Gliederdurch den Zufall bestimmt werden," Rec. Math (Mat. Sbornik) 32, 668-677(1925).

[92] KOLMOGOROV, A., "Sur la loi forte des grandes nombres," C. R. Acad. Sci.Paris, 191, 910-912 (1930).

[93] , "Ober die analytischen Methoden in der Wahrscheinlichkeitsrechnung,"Math. Ann. 104, 415-458 (1931).

[94] , "Sulla forma generate di un processo stocastico omogeneo," Atti Acad.naz. Lincei Rend. Cl. Sci. Fis. Mat. Nat. (6), 15, 805-808, 866-869 (1932).

[95] , "Anfangsgriinde der Theorie der Markoffschen ketten mit unendichvielen moglichen Zustanden," Rec. Math. Moscov (Mat. Sbornik), 1 (43),607-610 (1936).

[96] , "Interpolation and extrapolation of stationary random sequences"(Russian), Izv. Akad. Nauk. SSSR, Ser. Mat. 5, 3-114 (1941).

[97] , "Stationary sequences in Hilbert space" (Russian), Bull. Math. Univ.Moscov 2, No. 6 (1941).

[98] , Foundations of Probability (translation), Chelsea Publishing Co.,New York, 1950.

[99] LE CAM, L., Mimeographed notes, Statistics Department, Univ. of Calif.,Berkeley.

[100] LEVY, P., "Theorie des erreurs. La loi de Gauss et les lois exceptionalles,"Bull. Soc. Math. 52, 49-85 (1924).

[101] , "Sur les series dont les termes sont des variables eventuelles inde-pendantes," Studio Math. 3, 119-155 (1931).

[102] , "Sur les integrates dont les elements sont des variables aleatoiresindependantes," Annali R. Scuola. Sup. Pisa (2), 3, 336-337 (1934) and 4,217-218 (1935).

[103] , Theorie de VAddition des Variables Aleatoires, Bautier-Villars, Paris,1937.

[104] , "Sur certains processus stochastiques homogenes," Comp. Math. 7,283-339 (1939).

[105] , Processus Stochastiques et Mouvement Brownian, Gauthier-Villars, Paris,1948.

[106] LINDEBERG, Y. W., "Eine neue Herleitung des Exponentialgesetzes in derWahrscheinlichkeiterechnung," Math. Z. 15, 211-225 (1922).

[107] LOEVE, M., Appendix to [105].[108] , Probability Theory, 3rd Ed., D. Van Nostrand Co., Inc., Princeton,

N.J., 1963.

Page 427: Probability

410 BIBLIOGRAPHY

[109] LYAPUNOV, A. M., "Nouvelle forme du theoreme sur la limite de prob-abilites, Mem. Acad. Sci. St. Petersbourg (8), 12, No. 5, 1-24 (1901).

[110] MARKOV, A. A., "Extension of the law of large numbers to dependent events"(Russian), Bull. Soc. Phys. Math. Kazan (2) 15, 155-156 (1906).

[Ill] MARUYAMA, G., "The harmonic analysis of stationary stochastic processes,"Mem. Fac. Sci. Kyusyu Univ. A4, 45-106 (1949).

[—a] McSHANE, E. J., Integration, Princeton University Press, Princeton, N.J.,1944.

[112] MISES, R. v., Probability, Statistics, and Truth, William Hodge and Co.,2nd Ed., London, 1957.

[113] NEVEU, J., Mathematical Foundations of the Calculus of Probability, Holden-Day, Inc., San Francisco, 1965.

[114] OREY, S., "An ergodic theorem for Markov chains," Z. Wahr. 1, 174-176(1962).

[115] PALEY, R., WIENER, N., and ZYGMUND, A., "Note on random functions,"Math. Z., 647-668 (1933).

[116] POLYA, G., "t)ber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffenddie Irrfahrt in Strassennetz," Math. Ann. 89, 149-160 (1921).

[117] POSPISIL, B., "Sur un probleme de M. M. S. Berstein et A. Kolmogoroff,"Casopis Pest. Mat. Fys. 65, 64-76 (1935-36).

[118] PROKHOROV, Y. V., "Convergence of random processes and limit theoremsin probability theory," Teor. Veroyatnost. i. Primenen 1, 177-238 (1956).

[119] RYLL-NARDZEWSKI, C., "Remarks on processes of calls," Proc. of the 4thBerkeley Symp. on Math. Stat. and Prob. Vol. II, 455-465 (1961).

[120] SAATY, T. L., Elements of Queuing Theory, with Applications, McGraw-HillBook Co., Inc., New York, 1961.

[121] SAKS, S., "Theory of the integral," Monografje Mathematyczne, Tom VII,Warsaw-Lwow (1937).

[122] SHEPP, L. A., "A local limit theorem," Ann. Math. Stat. 35, 419-423 (1964).[123] SHOHAT, J. A. and TAMARKIN, J. D., "The problem of moments," Math.

Surveys, No. 1, Amer. Math. Soc., New York (1943).[—a] SIERPINSKI, W., General Topology, University of Toronto Press, Toronto,

1952.[124] SKOROKHOD, A. V., "Limit theorems for stochastic processes, "Teor. Vero-

yatnost. i. Primenen 1, 289-319 (1956).[125] , "Limit theorems for stochastic processes with independent increments,"

Teor. Veroyatnost. i. Primenen 2, 145-177 (1957).[126] , Studies in the Theory of Random Processes, Kiev University, 1961;

Addison-Wesley Publishing Co., Inc., Reading, Mass., 1965 (translation).[127] SMITH, W. L., "Renewal theory and its ramifications," /. Roy. Statist. Soc.

(3), 20, 243-302 (1958).[128] SPARRE ANDERSEN, E., "On the fluctuations of sums of random variables,"

Math. Scand. 1, 163-285 (1953) and 2, 195-223 (1954).[129] SPITZER, F., "A combinatorial lemma and its applications to probability

theory," Trans. Amer. Math. Soc. 82, 323-339 (1956).[130] , Principles of random walk, D. Van Nostrand Co., Inc., Princeton, 1964.[131] STONE, C. J., "Limit theorems for random walks, birth and death processes,

and diffusion processes," Illinois J. Math. 7, 638-660 (1963).

Page 428: Probability

BIBLIOGRAPHY 411

[132] , "On characteristic functions and renewal theory," Trans. Amer.Math. Soc. 120, 327-342 (1965).

[133] , "A local limit theorem for multidimensional distribution functions,"Ann. Math. Stat. 36, 546-551 (1965).

[134] STRASSEN, V., "An invariance principle for the law of the iterated logarithm,"Z. Wahr. 3, 211-226(1964).

[135] , "A converse to the law of the iterated logarithm," Z. Wahr. 4, 265-268(1965).

[136] TROTTER, H. F., "A property of Brownian motion paths," Illinois J. Math. 2,425-433 (1958).

[137] VILLE, J., Etude Critique de la Notion de Collectif, Gauthier-Villars, Paris,1939.

[138] VOLKONSKII, V. A., "Random substitution of time in strong Markov proc-esses," Tear. Veroyatnost. i. Primenen 3, 332-350 (1958).

[139] WAX, N. (editor), Selected Papers on Noise and" Stochastic Processes, DoverPublications, Inc., New York, 1954.

[140] WIDDER, D. V., The Laplace Transform, Princeton Univ. Press, Princeton,1941.

[141] WIENER, N., "Differential space," /. Math. Phys. 2, 131-174 (1923).[142] , "Un probleme de probabilites enombrables," Bull. Soc. Math, de

France 52, 569-578 (1924).[143] , Extrapolation, Interpolation and Smoothing of Stationary Time Series,

MIT Press and John Wiley & Sons, Inc., New York, 1950 (reprinted from apublication restricted for security reasons in 1942).

[144] YAGLOM, A. M., An Introduction to the Theory of Stationary RandomFunctions, Prentice-Hall, Inc., Englewood Cliffs, N.J., 1962.

Page 429: Probability

INDEX

Absorbing boundary, for Brownianmotion, 353

for a Feller process, 369-370Absorption probabilities, for Markov

chains, 155-156for fair coin-tossing, 156

Accessible boundary point, 366Andersen, E., 103, 297Arc sine law, 272, 282, 283, 297Asymptotic behavior of Markov

transition probabilities, generalstate space, 137

countable state space, 150Orey's theorem, 151continuous time processes, 344

Asymptotic stationarity, for Markovchains, 133-135

for renewal processes, 147

Bachelier, L., 271Backwards equations, hold for pure

jump processes, 331-332, 335for Feller processes, 385

Bailey, N., 346Berry-Esccn bound, 184Bharucha-Reid, A., 158, 346Birkhoff, G., 104, 109, 127, 128Birth and death processes, discrete

time, 147-148continuous parameter, 337-339

Blackwell, D., 35, 62, 81, 101,151,231

Bochner, S., 317Bochner's theorem on characteristic

functions, 174Borel, E., 65

Borel-Cantelli lemma, proof, 41-42applied to recurrence in coin-tossing,

42-44generalized to arbitrary processes,

96-97Borel field, in sequence space, 12

finite-dimensional, 22in function space, 251

Borel paradox, 73Boundaries, for a Brownian motion,

352-355for Feller processes, 365-370

Boundary conditions, for a Markovprocess on the integers, 339-340

for a Brownian motion, 352-355Bounded pointwise convergence,

340-341Branching processes, 148-149Brown, R., 271Brownian motion, defining assumptions,

248normally distributed, 249-250distribution of, 250as limit in distribution of random

walks, 251joint measurability, 257normalized, 257continuous sample paths, 257-259distribution of the maximum, 258,

287-288defined by an infinite expansion,

259-261nondifferentiable, 261-262unbounded variation, 262variation of paths, 262-263law of the iterated logarithm,

263-265412

Page 430: Probability

INDEX 413

behavior for large time, 265-267time inversion, 266set of zeros, 267-268stopping times, 268-270strong Markov property,

268-270first-exit distributions, 273-275,

287-290used to represent sums of random

variables, 276-278as uniform limit of normed sums,

278-281as limit in invariance theorem,

281-283transformations, 287distribution of first hitting times,

287-288with stable subordinator, 317-318infinitesimal operator, 325-326backwards and forwards equation,

327-328resolvent, 344exact model, 347-350boundaries, 352-355scale, 361speed measure, 365used to construct Feller processes,

370-375local time, 373-374

Cameron, R., 297Central limit problem, conditions for

convergence of distributions,198-199

convergence to the normal, 198-199convergence to stable laws, 199-200,

207-212Central limit problem generalized,

195-196Central limit theorem, for coin-tossing,

7-10for identically distributed random

variables with finite secondmoment, 167-168

nonidentically distributed case,186-187

in terms of distribution of maximum,190, 198-199

for multidimensional variables,237-238

Chapman-Kolmogorov equations,320-321, 322

Characteristic functions, defined, 170properties, 170-171and independent random variables,

175-177inversion formulas, 177-179expansion, 180logarithms of, 180-181normal distribution, 188Poisson distribution, 189infinitely divisible laws, 191-195stable laws, 200-203, 204-207of multidimensional distributions,

235-236of multivariate normal distribution,

237, 239of processes with independent

increments, 304-305Characteristic operator, defined,

375-376as second order differential

operator, 377-379minimum principle for, 380uniqueness of solutions, 380-382for a diffusion, 386

Chebyshev inequality for coin-tossing, 4Chow, Y., 65Chung, K. L., 65, 145, 157, 158, 231,

232, 332, 339, 346Coin-tossing, probability of an

equalization, 2weak law of large numbers, 3strong law of large numbers,

11-13analytic model, 15equalizations, 62probability of no equalizations in

biased coin-tossing, 141absorption probabilities, 155limit distribution of returns to

equilibrium, 214Communicating states, 141

Page 431: Probability

414 INDEX

Compactness of the set of distributionfunctions, 160-163

multidimensional, 234-235Conditional expectation, given one

random variable, 70-71given a a-field, 73-76

Conditional probabilities, for a Markovprocess, 321-322

Conditional probability, of one eventgiven another, 67

given the value of a random variable,68-69

Continuity of Brownian motion samplepaths, 257-259

Continuity in probability, 299Continuity theorem, for characteristic

functions, 171-172in terms of Laplace transforms, 183in terms of moment-generating

functions, 183for multidimensional characteristic

functions, 236Continuous parameter processes, 248,

251-253Continuous versions of processes,

254-257Convergence definitions, almost surely,

•=^,33r

in mean —>,

in probability,

33p

-,33

in distribution -=->, 159, 233uniformly on compact sets —>-, 173weak-^, 217

bpboundedly pointwise —*~t 341

Convergence in distribution, definition,159

behavior of integrals, 163-164separating classes of functions,

165-166convergence of expectations, 167determined by characteristic

functions, 170-173and convergence of moments,

181-182of random vectors, 233-234of processes, 281, 293

Convergence of integrals, 163-164Convergence of sequences of

conditional expectations, 92-94Convergence of types theorem, 174-175Coordinate representation process, 22Covariance function, 241, 243Covariance matrix, 237, 239Cramer, H., 247

Darling, D., 232, 383Decomposition of processes with

stationary independent increments,310-315

Degenerate distribution, 160Diffusions, backwards equations, 352

infinitesimal operator, 352construction, 370-372defined, 385-386scale and speed measure, 386-387characterization of, 387-389as Feller processes, 387-389

Dinges, H., 297Distribution functions, determine the

distribution of a process, 26Distribution of pure type, 49Distribution of a sum of two

independent random variables, 39Doeblin, W., 157, 158, 215, 346Domain of attraction, of the stable

laws, 207-212of the normal law, 214-215

Domain of the infinitesimal operator,general definition, 341

for Brownian motion withboundaries, 355-356

for diffusion, 378-379Donsker, M., 296, 297Doob, J. L., 81, 101, 102, 103, 128,

137, 157, 220, 247, 290, 297, 306,318, 335, 346, 390

Dubins, L., 34, 103Dunford, R, 128Dvoretski, A., 261Dynkin, E., 346, 390Dynkin's identity, 376

Ehrenfest urn scheme, 149Einstein, A., 271

Page 432: Probability

INDEX 415

Entrance boundary points, 366Equidistribution of sums of independent

random variables, 58-61Erdos, P., 231, 261, 296, 297Ergodic theorem, proof, 113-115

applied to orthogonality of measures,116

converse, 116convergence in first mean, 117for stationary processes, 118applied to the range of sums, 120-122

Ergodicity, of measure-preservingtransformation, 109-110, 112-113

of stationary processes, 119of process of independent,

identically distributed randomvariables, 120

of process of recurrence times for astationary ergodic process, 124-125

Markov chains, 136-137of Gaussian stationary processes, 246

Exit boundary points, 368-369Expectation, defined, 31

of products of independent randomvariables, 39

Explosions, in a Markov process,336-338

Exponent function, 305Extended random variable, 30Extension of probabilities, in sequence

space, 23-25from distribution functions in

discrete time, 28for Markov chains, 130-131on function spaces, 251-253by continuity of paths, 254-257for Markov processes, 321

Extensions to smooth versions, 298-300

Feller-Orey renewal lemma, 221-222Feller process, defined, 356-358

stability of transition probabilities,356-357, 358

regularity, 358scale, 358-362first-exit distributions, 359-361,

383-384

speed measure, 362-365boundary classification, 365-370construction, 370-375local time, 373characteristic operator, 375-379uniqueness, 379-383densities of transition

probabilities, 384resolvent,- 384backwards equation, 385stationary initial distribution, 385(See Diffusions)

Feller, W., 18, 62, 63, 64, 150, 158,211, 215, 221, 228, 231, 232, 297,346, 390

Field of events depending on a finitenumber of outcomes, 30

Finetti, B. de, 318First-exit distributions, for Brownian

motion, 273-275, 287-290from linear boundaries, 289-290for Feller processes, 359-361, 383-384

First-exit time, for Brownian motion,273

finite expectation, 359and speed measure, 363

Fourier inversion formulas, 177-179Freedman, D., 151, 271Fuchs, W., 65, 145Functionals on Brownian motion, 282

Gambling systems, 82-83, 101-102Garsia, A., 114Gaussian stationary processes,

definition, 241, 242spectral representation, 242-246ergodicity, 246prediction problem, 246-247Ornstein-Uhlenbeck process, 347-350

Gibbs, W., 109Gnedenko, B., 215, 232Green's function, 363, 365, 367-368

Halmos, P., 35, 91, 128Harris, T., 65, 148, 149, 157, 346Helly-Bray theorem, 160-161

for multidimensional distributions,234-235

Page 433: Probability

416 INDEX

Herglotz lemma, 242Hewitt, E., 63Hewitt-Savage zero-one law, proof, 63Hopf, E., 128Hunt, G., 271

Independence, definitions, 36Independent events, 41, 44Independent increments, 303Independent a-fields, 36Infinite sums of independent random

variables, convergence almostsurely, 45-48

finiteness of sums of moments, 47, 48distribution of, 49-51converge or diverge badly, 97convergence in distribution, 176convergence of characteristic

functions of, 177Infinitely divisible distribution of

processes with independentincrements, 303-305

Infinitely divisible laws, defined, 191as limits of distributions of sums,

191characteristic function, 194as limits of distributions of

nonidentically distributedsummands, 195-196

necessary and sufficient conditionsfor convergence to, 198

Infinitesimal conditions, for a purejump process, 332-333

Infinitesimal operator, domain,general case, 341

for a diffusion, 352Instantaneously reflecting boundary

points, 369Invariance principle, applied to central

limit theorem, 167-169for sums of independent random

variables, 272-273, 281-283applied to Kolmogorov-Smirnov

statistics, 283-287applied to the law of the iterated

logarithm, 291-292for general processes, 293-296

Ito, K., 271, 318, 373, 390

Jensen's inequality, 80lessen, B., 44, 103Joint density, 71Joint normal distribution, 237, 238-240Jump processes, 310-315Jump times, for a Markov process,

328-332Jumps of a process with independent

increments, 310-315

Kac, M., 18, 48, 128, 150, 232, 296,297, 351

Kakutani, S., 261Kallianpur, G., 232Karamata, J., 228Karlin, S., 148, 158Kesten, H., 120Khintchine, A., 65, 128, 215, 263, 271Kolmogorov, A. N., 34, 40, 48, 65, 73,

81,158, 215, 231, 232, 246, 247, 346Kolmogorov inequality, 65Kolmogorov zero-one law, 40Kolmogorov-Smirnov statistics,

283-287, 290Kronecker's lemma, 51

Langevin equation, 351Lattice distribution, 54, 174, 225Law of the iterated logarithm, for sums

of independent random variables,64, 291-292

for Brownian motion, 263-265, 266Law of pure types, 49Law of a random variable, 159Lebesgue bounded convergence

theorem, 33LeCam, L., 184L6vy, P., 51, 66, 103, 215, 271,

297, 318Lindeberg, J., 184Local central limit theorem, 224-227Local time, 373-375Loeve, M., 35, 48, 66, 184, 247Lyapunov, A., 184, 215

Markov, A., 157Markov chains, definitions, 129-131

construction of, 130-131

Page 434: Probability

INDEX 417

Markov time, 131strong Markov property, 131-132asymptotic stationarity, 133-135stationary initial distributions, 134,

136-137, 143-145closed sets, 135indecomposability, 135-137ergodicity, 136-137periodic motion, 137, 140backward difference equations, 153-156

Markov chains with a countable statespace, renewal process of a state,138

returns to a state, 138-140classification of states, 139-140recurrence times, 139-140group properties of states, 141-143cyclically moving subsets, 142-143stationary initial distributions,

143-145renewal chain, 146-147birth and death processes, 147, 148branching processes, 148-149Ehrenfest urn scheme, 149asymptotic behavior of transition

probabilities, 150-153tail <r-field, 152-153random variables invariant under

shifts, 154absorption, probabilities, 154-155

time until, 155-156Markov processes, defined, 319-320

transition probabilities, 319-323strong Markov property, 323infinitesimal operator, 325-326backwards and forwards equations,

326-328with pure jump sample paths

defined, 328uniqueness, general case, 340-343resolvent, 340-344domain of the infinitesimal operator,

341stationary initial distributions, 344,

346Ornstein-Uhlenbeck processes,

347-350locally Brownian processes, 351-352

(See Diffusions; Feller processes;Pure jump Markov processes)

Markov processes moving on theintegers, returns from infinity,339-340

backwards and forwards equations,340

asymptotic behavior of transitionprobabilities, 344-345

Markov times, for Brownian motion,268-270

defined, 323Martin, R., 297Martingales and submartingales,

definitions, 83-84optional sampling theorem, 84-89inequalities, 88-89strong convergence theorem, 89-90upcrossing lemma, 91as conditional expectations, 92-94optional stopping, 95-96applied to the generalized

Borel-Cantelli lemma, 96-97stopping rules, 98-100Wald's identity, 100applied to gambling systems, 101-102continuous parameter, 274, 300-303continuity properties of sample

paths, 300-301optional stopping for continuous

parameter martingales, 302-303applied to processes with

independent increments, 306-307Mass-preserving sets of distributions,

162Mass-preserving sets of multi-

dimensional distributions, 234Maximal ergodic theorem, 114Maximum of a Brownian motion,

258, 287-288Maximum of independent random

variables, 189-190McKean, H., 271, 373 390Measure-preserving transformations,

definition, 106invariant sets, 108invariant cr-field, 108ergodicity, 109-110, 112-113

Page 435: Probability

418 INDEX

invariant random variables, 112-113ergodic theorem, 113-115

Method of moments, 181-182Minimal solution, for Markov

process, 336Mises, R. v., 34Moment-generating function, 183Moment problem, 182Moments, applied to distribution of

occupation times, 229-231Multivariate normal distribution,

237, 238-240

Natural boundary points, 366Natural scale, 358-362Neveu, J., 18, 35Nondifferentiability of Brownian

paths, 261-262Normal distribution, 9, 170, 185-186,

188domain of attraction, 214-215invariance under orthogonal

rotations, 350Normalized Brownian motion, 257Null-recurrent states, 140

Occupation times, for sums of latticerandom variables, 229

for sums of nonlattice randomvariables, 229-231

Occurrence infinitely often, 40Optional sampling, 85Optional sampling theorem, 84-89Optional skipping, 49Order statistics, 285-286Orey, S., 151, 221, 231Ornstein, D., 65Ornstein-Uhlenbeck process, 347-350

Paley, R., 271Paths with jump discontinuities, 298-300Periodic states, 140Point processes, stationary, 125-127Poisson, convergence, 188-190

distribution, 189process, 308-310processes with random jumps,

310-312

Pollard, H., 231Polya, G., 65, 145Pospisil, B., 346Probabilities on function spaces,

251-253, 254-257Probability space, 14, 19Process, continuous parameter,

defined, 248, 251-253Process, discrete time, definition, 19

measurability, 21distribution of, 22

Processes with stationary, independentincrements, defined, 303

infinitely divisible distributions,303-305

path properties, 306-307Poisson process, 308-310jump processes, 310-312limits of jump processes, 312-315as sums of independent jump

processes, 312-315zero-one law for path properties, 315as Markov processes, 324infinitesimal operator, 325

Prokhorov, Y., 297Pure jump function, 314Pure jump Markov processes, defined,

328strong Markov property, 328first jump distributions, 328-330backwards equations, 331-332, 335infinitesimal conditions, 332-333construction, 332-336space and time structure, 333-334uniqueness, 334-336, 339-340existence, 336-339explosions, 336-339birth and death processes, 337-339conditions for no accumulation of

jumps, 337-339nonuniqueness and boundary

conditions, 339-340minimal solution, 336

Random boundaries for Brownianmotion, 276-278

Random signs problem, 41, 45, 47

Page 436: Probability

INDEX 419

Random variables, definition, 19sufficient conditions for

measurability, 30distribution of, 31expectation of, 31identically distributed, 31strong convergence defined, 33Cauchy convergence, 33, 44independence, 36, 37, 38uniform integrability, 91, 94

Random vectors, definition, 20distribution of, 21function of, 30

Random walk, 132, 145Range of sums of independent

variables, 120-122Recurrence, in coin-tossing, 42

in sums of independent randomvariables, 53-58

in Markov chains, 138-141Recurrence of sums of independent

variables, determined bycharacteristic function, 179

Recurrence times, for sums of inde-pendent random variables, 60

for a stationary process, 122-125for a Markov state, 139-140for the Ehrenfest urn scheme,

149-150for sums of lattice random variables,

229for a Markov process, 344-345

Recurrent states, 139Reflecting boundary, for Brownian

motion, 353-355for a Feller process, 369-370

Regular boundary points, 368-369Regular conditional distribution,

definition, 77existence, 78-79

Regular conditional probability, 77,79

Regular Feller processes, 358Regular transition probabilities, 321Renewal process, of a Markov state,

138as a Markov chain, 146-147

Renewal theorem, discrete time, 150general case, 218-224applied to Markov processes,

344-345Resolvent, of a Markov process,

340-344and uniqueness of Markov processes,

343for Feller processes, 382, 384

Riemann-Lebesgue lemma, 216-217Robbins, H., 65, 232Runs in coin-tossing, 140Ryll-Nardzewski, C., 128

Saaty, T., 346Sample path properties, Brownian

motion (See Brownian motion)general conditions to have only

jump discontinuities, 298-300of continuous parameter martingales,

300-301of processes with stationary,

independent increments, 306-307,312-315, 315-316

of Poisson processes, 308-310Savage, J., 34, 63, 103Scale, of a Feller process, 358-362

determined by characteristic operator,379

for a diffusion, 386Schwartz, J., 128Second-order stationary processes, 247Separating classes of functions, prop-

erties, 165-166complex exponentials, 170the polynomials do not separate, 181real exponentials, 183for multidimensional distributions,

235Shepp, L., 232Shift transformation, 107, 118Shohat, J., 184Siegert, A. J., 383Skorokhod, A., 293, 297Skorokhod's lemma, 45Slowly reflecting boundary points,

369-370

Page 437: Probability

420 INDEX

Smith, W., 231Spectral distribution function, 242-243Spectral integral, 243Spectral representation theorem,

244-246Speed measure, 362-365

on boundary points, 367-369absolutely continuous case, 370-372,

386-389determined by characteristic

operator, 379for a diffusion, 387

Spitzer, F., 66, 120, 232, 297Stable laws, defined, 199

as limits in distribution of normedsums, 199-200

characteristic functions of, 200-203,204-207

domains of attraction, 207-212Stable processes, 316-318Stable subordinator, 317-318Stable transition probabilities, for

Markov chains, 163for Feller processes, 356-357, 358

Stationary initial distribution, forMarkov chains, 134, 136-137,143-145

for Markov processes, 346for Feller processes, 385

Stationary point processes, 125-127Stationary processes, definition, 104

transformation of, 105ergodic theorem, 118ergodicity, 119Gaussian, 241-247second order, 247

Stationary transition probabilities,129, 322-323

Stirling's approximation, 2Stochastic process, continuous

parameter, defined, 248, 251-253distribution of, 252with continuous sample paths,

255-257having only jump discontinuities,

298-300

Stone, C, 231, 390Stopping times, for sums of independent

random variables, 59, 62for martingales, 95-97for Markov chain, 131for Brownian motion, 268-270for Brownian motion with random

boundaries, 276-278for continuous parameter

martingales, 301-303for Markov processes, 323

Strassen, V., 66, 291, 297Strong law of large numbers, for

coin-tossing, 11-13for nonidentically distributed, inde-

pendent random variables, 51-52for identically distributed, inde-

pendent random variables, 52-53for Brownian motion, 265-266

Strong Markov property, for discreteMarkov chains, 131-132

for Brownian motion, 268-270for general Markov processes, 323for pure jump processes, 328for Feller processes, 357

Sums of independent randomvariables, convergence of,45-49

law of large numbers for, 51-53recurrence properties, 54-58equidistribution of, 58-61stopping times, 59tail a-field, 64are martingales, 83limit theorem for their range,

120-122renewal theorem, 218-224invariance principle, 272-273,

281-283distribution of the maximum, 273representation by Brownian motion,

276-278converge strongly to Brownian

motion paths, 278-281law of the iterated logarithm,

291-292

Page 438: Probability

INDEX 421

Tail events, defined, 40Tail a-field, for independent random

variables, 40for sums of independent random

variables, 64condition for the zero-one property,

95for Markov chains with a countable

state space, 152-153Tamarkin, J., 184Tauberian theorem, 228

applied to recurrence times, 228-229applied to occupation times, 229-231

Transformations of Brownian motion,287

Transient states, 139Transition probabilities, stationary,

129, 322for continuous parameter Markov

processes, 320-321regular, 321standard, 323backwards and forwards equations,

326-328Trotter, H., 373

Uniform integrability, 91, 94Upcrossing lemma, 91

Variation of Brownian paths,262-263

Variation of sample paths for processeswith independent increments,315-316

Versions of stochastic processes, 257Ville, J., 103Volkonskn, V., 390

Wald's identity, 100for Brownian motion, 275

Wax, N., 271Weak convergence of measures,

compactness property, 217defined, 217conditions for, 218

Weyl, H., 117Widder, D., 184Wiener, N., 246, 247, 271Whitman, W., 120Wintner, A., 49

Yaglom, A., 247

Zeros of Brownian motion,267-268

Zygmund, A., 271

ABCDE698