Math 285 Stochastic Processes Spring 2016bdriver/285_S2016/Lecture Notes/285notes.pdfBruce K. Driver Math 285 Stochastic Processes Spring 2016 June 3, 2016 File:285notes.tex

Bruce K. Driver

Math 285 Stochastic Processes Spring 2016

June 3, 2016 File:285notes.tex

Contents

Part Homework Problems

-1 Math 285 Homework Problem List for S2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1.1 Homework 1. Due Friday, April 8, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1.2 Homework 2. Due Friday, April 15, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1.3 Homework 3. Due Friday, April 22, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1.4 Homework 4. Due Friday, April 29, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1.5 Homework 5. Due Friday, May 6, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1.6 Homework 6. Due Friday, May 13, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1.7 Homework 7. Due Friday, May 20, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1.8 Homework 8. Due Friday, May 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1.9 Homework 9. Due Friday, June 3, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Part I Background Material

0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70.1 Deterministic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70.2 Stochastic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Probabilities, Expectations, and Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.1 Basics of Probabilities and Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Some Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Borel Cantelli Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Contents

3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1 σ – algebras (partial information) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Theory of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Conditional Expectation for Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Summary on Conditional Expectation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Filtrations and stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 Filtrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Part II Discrete Time & Space Markov Processes

5 Markov Chain Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.1 Markov Chain Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Joint Distributions of an MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3 Hitting Times Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 First Step Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1 Finite state space examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 More first step analysis examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.3 Random Walk Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.4 Wald’s Equation and Gambler’s Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.5 Some more worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.5.1 Life Time Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.5.2 Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.5.3 Extra Homework Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.6 *Computations avoiding the first step analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.6.1 General facts about sub-probability kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7 Markov Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.2 The Strong Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.3 Strong Markov Proof of Hitting Times Estimates* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8 Long Run Behavior of Discrete Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.2 The Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768.3 Aperiodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.4 Some finite state space examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.5 Periodic Chain Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.5.1 A number theoretic lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Page: 4 job: 285notes macro: svmonob.cls date/time: 3-Jun-2016/13:04

Contents 5

9 *Proofs of Long Run Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.1 Strong Markov Property Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.2 Irreducible Recurrent Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

10 Transience and Recurrence Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9310.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9310.2 *Transience and Recurrence for R.W.s by Fourier Series Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

11 Detail Balance and MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9911.1 Detail Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9911.2 Some uniform measure MCMC examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10011.3 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10311.4 The linear algebra associated to detail balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10311.5 Convergence rates of reversible chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10711.6 Importance of the spectral gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10711.7 Dropping the aperiodic assumption* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10911.8 *Reversible Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

12 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11312.1 The most likely trajectory via dynamic programing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

12.1.1 Viterbi’s algorithm in the Hidden Markov chain context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11512.2 The computation of the conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

12.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Part III Martingales

13 (Sub and Super) martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12113.1 (Sub and Super) martingale Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12113.2 Jensen’s and Holder’s Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12613.3 Stochastic Integrals and Optional Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12813.4 Martingale Convergence Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13113.5 Uniform Integrability and Optional stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13213.6 Submartingale Maximal Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13513.7 * Lp – inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13713.8 Martingale Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

13.8.1 More Random Walk Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13813.8.2 More advanced martingale exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

14 Some martingale Examples and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14114.1 A Large Deviations Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14114.2 A Polya Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14314.3 Galton Watson Branching Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


6 Contents

14.3.1 Appendix: justifying assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Part IV Continuous Time Theory

15 Discrete State Space/Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15315.1 Geometric, Exponential, and Poisson Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15315.2 Scaling Limits of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15515.3 Sums of Exponentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15915.4 Continuous Time Markov Chains (formalities) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16115.5 Jump Hold Description II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16615.6 Jump Hold Construction of the Poisson Process* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

15.6.1 More related results* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16915.7 First jump analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17115.8 Long time behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17415.9 Some Associated Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

16 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17716.1 Normal/Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17716.2 Stationary and Independent Increment Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17916.3 Brownian motion defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18016.4 Some “Brownian” martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18316.5 Optional Sampling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18516.6 Scaling Properties of B. M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18816.7 Random Walks to Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18816.8 Path Regularity Properties of BM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18916.9 The Strong Markov Property of Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19016.10Dirichelt Problem and Brownian Motion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

17 A short introduction to Ito’s calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19517.1 A short introduction to Ito’s calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19517.2 Ito’s formula, heat equations, and harmonic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19717.3 A Simple Option Pricing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19817.4 The Black-Scholes Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

Part V Appendix

18 Analytic Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20518.1 A Stirling’s Formula Like Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205


Contents 7

19 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20719.1 Review of Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20719.2 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21019.3 Gaussian Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

20 Solution to Selected Lawler Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219


Part

Homework Problems

-1

Math 285 Homework Problem List for S2016

Note: solutions to Lawler Problems will appear after all of the Lecture NoteSolutions.

-1.1 Homework 1. Due Friday, April 8, 2016

• Look at from lecture note exercises: 3.1• Hand in lecture note exercises: 3.2, 3.3, 4.1, 4.2, 4.3• Hand in from Lawler §5.1 on page 125.


• Look at from lecture note exercises: 6.3, 6.4, 6.6• Hand in lecture note exercises: 6.1, 6.5, 6.7• Hand in from Lawler §1.1, 1.4, 1.19


• Look at from Lawler §1.10,• Hand in from Lawler §1.5, 1.8, 1.9, 1.14, 1.18*, 2.3

*Hint: show the invariant distribuiton is uniform.


• Look at lecture note exercises: 10.1• Hand in from Lawler problems: §7.6, 7.7, 7.8.• Please use the result in 7.8 to verify your numerical approximations found

in 7.7.

*Hints for 7.8. Recall that for n ∈ N that

Sn =

k = (k0, . . . , kn) ∈ 0, 1n+1: ki−1 + ki ≤ 1 for 1 ≤ i ≤ n

and your goal is to compute

pn (m) :=# k ∈Sn : km = 1

# (Sn)for 0 ≤ m ≤ n.

As Lawler suggests, for i, j ∈ 0, 1 and m ∈ N, let1

rm (ij) = # k = (k0, . . . , km) ∈ Sm : k0 = i and km = j .

Notice that if k = (k0, . . . , kn) ∈ Sn with km = 1, then

(0, k0, . . . , km−1, 1) ∈ Sm+1 and (1, km+1, . . . , kn, 0) ∈ Sn−m+1

(and visa versa) where km−1 and km+1 must both be zero so that

(0, k0, . . . , km−2, 0) ∈ Sm and (0, km+2, . . . , kn, 0) ∈ Sn−m.

From these considerations we find

# k ∈Sn : km = 1 = rm+1 (01) · rn−m+1 (10) = rm (00) · rn−m (00) .

Similarly k = (k0, . . . , km) ∈ Sm then (0, k0, . . . , kn, 0) ∈ Sn+2 and visa versafrom which we learn # (Sn) = rn+2 (00). Combining these results shows

pn (m) =rm (00) · rn−m (00)

rn+2 (00).

A little thought shows this formula is correct at m = 0 and m = n−m providedwe use the convention that r0 (00) = 1. Now follow the outline in the Lawler inorder to find, ym = y (m) := rm (00) .

-1.5 Homework 5. Due Friday, May 6, 2016

• Look at from lecture note exercises: 13.1, 13.2• Hand in lecture note exercises: 13.3, 13.5• Hand in from Lawler § 5.4, 5.7a, 5.8a, 5.12

1 I have changed the indexing a bit since Lawler’s choices are a bit confusing.


• Look at from lecture note exercises: 13.7, 13.8, 13.9, 13.14• Look at from Lawler § 5.13• Hand in lecture note exercises: 13.11, 13.13• Hand in from Lawler § 5.7b, 5.9*, 5.14

*Correction to 5.9 The condition, Pf (x) = g (x) for x ∈ S \A, should readPf (x)− f (x) = g (x) for x ∈ S \A.


• Look at from lecture note exercises: 15.4, 15.6• Hand in lecture note exercises: 15.1, 15.2, 15.3, 15.5


• Look at from lecture note exercises: 15.7, 15.8, 15.9, 19.1• Hand in lecture note exercises: 15.12, 16.1, 16.3, 19.5

-1.9 Homework 9. Due Friday, June 3, 2016

• Look at from lecture note exercises: 16.4, 16.5, 19.2• Hand in lecture note exercises: 16.6, 16.7

Part I

Background Material

0

Introduction

Definition 0.1 (Stochastic Process via Wikipedia). ..., a stochasticprocess, or often random process, is a collection of random variables rep-resenting the evolution of some system of random values over time. This isthe probabilistic counterpart to a deterministic process (or deterministic sys-tem). Instead of describing a process which can only evolve in one way (as inthe case, for example, of solutions of an ordinary differential equation), in astochastic, or random process, there is some indeterminacy: even if the initialcondition (or starting point) is known, there are several (often infinitely many)directions in which the process may evolve.

0.1 Deterministic Modeling

In deterministic modeling one often has a dynamical system on a state spaceS. The dynamical system often takes on one of the two forms;

1. there exists f : S → S and a state xn then evolves according to the rulexn+1 = f (xn). [More generally one might allow xn+1 = fn (x0, . . . , xn)where fn : Sn+1 → S is a given function for each n.

2. There exists a vector field f on S (where now S = Rd or a manifold)such that x (t) = f (x (t)) . [More generally, we might allow for x (t) =f(t;x|[0,t]

), a functional differential equation.]

Goals: the goals in this case then have to do with deriving the propertiesof the trajectories given the properties of the driving dynamics incorporated inf. For example, think of a golfer trying to make a put or a hot-air balloonisttrying to find a path from point A to point B.

0.2 Stochastic Modeling

Much of our time in this course will be to explore the above two situations wheresome extra randomness is added at each state of the game. The point being thatin many situations the exact nature of the dynamics is not known or is rapidlychanging. What is known are statistical properties of the dynamics – i.e. likelyhoods that the dynamics will be of a certain form. This amounts to replacing f

above by some sort of random f and then resolving the problems. However, nowrather than trying to find the properties of a given trajectory we instead try tofind properties of the statistics of the now random trajectories. Typically whencomparing theory to experiment one has to now average experimental results(hoping to use the law of large numbers) to make contact with the mathematicaltheory. Here is a little more detail on the typical sort of scenarios that we willconsider in this course.

1. We may now have that Xn+1 ∈ S is random and evolves according to

Xn+1 = f (Xn, ξn)

where ξn∞n=0 is a sequence of i.i.d. random variables. Alternatively put,we might simply let fn := f (·, ξn) so that fn : S → S is a sequence of i.i.d.random functions from S to S. Then Xn∞n=0 is defined recursively by

Xn+1 = fn (Xn) for n = 0, 1, 2, . . . . (0.1)

This is the typical example of a time-homogeneous Markov chain. We as-sume that X0 ∈ S is given with an initial condition which is either deter-ministic or is independent of the fn∞n=0 .

2. Later in the course we will study the continuous time analogue,

“Xt = ft (Xt) ”

where ftt≥0 are again i.i.d. random vector-fields. The continuous timecase will require substantially more technical care. For example, one oftenconsiders the controlled differential equation,

Xt = f (Xt) Bt (0.2)

where Btt≥0 is Brownian motion or equivalently Bt is “white noise” or Btis a Poisson process. The Poisson noise is often used to model arrival timesin networks or in queues (i.e. service lines) or appear in electrical circuitsdue to “thermal fluctuations” to name a few. See for example, JohnsonNoise and Shot Noise by Dennis V. Perepelitsa, November 27, 2006.) Hereare two quotes from this article.

“The thermal agitation of the charge carriers in any circuit causes a small,yet detectable, current to flow. J.B. Johnson was the first to present a quan-titative analysis of this phenomenon, which is unaffected by the geometryand material of the circuit.”“The quantization of charge carried by electrons in a circuit also contributesto a small amount of noise. Consider a photoelectric circuit in which currentcaused by the photoexcitation of electrons flow to the anode.”

3. We will also consider a class of processes known as (Sub/Super) martin-gales which encode information about fair (or not so fair) games of chanceamongst many other applications.

1

Probabilities, Expectations, and Distributions

1.1 Basics of Probabilities and Expectations

Our first goal in this course is to describe modern probability with “sufficient”precision to allow us to do the required computations for this course. We willthus be neglecting some technical details involving measures and σ – algebras.The knowledgeable reader should be able to fill in the missing hypothesis whilethe less knowledgeable readers should not be too harmed by the omissions tofollow.

1. (Ω,P) will denote a probability space and S will denote a set which iscalled state space. Informally put, Ω is a set (often the sample space) andP is a function on all1 subsets of Ω (subsets of Ω are called events) withthe following properties;

a) P (A) ∈ [0, 1] for all A ⊂ Ω,b) P (Ω) = 1 and P (∅) = 0.c) P (A ∪B) = P (A) + P (B) is A ∩B = ∅. More generally, if An ⊂ Ω for

all n with An ∩Am = ∅ for m 6= n we have

P (∪∞n=1An) =

∞∑n=1

P (An) .

2. A random variable, Z, is a function from Ω to R or perhaps some otherrange space. For example if A ⊂ Ω is an event then the indicator functionof A,

1A (ω) :=

1 if ω ∈ A0 if ω /∈ A,

is a random variable.3. Note that every real value random variable, Z, may be approximated by

the discrete random variables

Zε :=∑n∈Z

nε · 1nε≤Z<(n+1)ε for all ε > 0. (1.1)

As we usually do in probability, nε ≤ Z < (n+ 1) ε , stands for the eventmore precisely written as;

1 This is often a lie! Nevertheless, for our purposes it will be reasonably safe to ignorethis lie.

ω ∈ Ω : nε ≤ Z (ω) < (n+ 1) ε .

4. EZ will denote the expectation of a random variable, Z : Ω → R whichis defined as follows. If Z only takes on a finite number of real valuesz1, . . . , zm we define

EZ =

m∑i=1

ziP (Z = zi) .

For general Z ≥ 0 we set EZ = limn→∞ EZn where Zn∞n=1 is any sequenceof discrete random variables such that 0 ≤ Zn ↑ Z as n ↑ ∞. Finally if Zis real valued with E |Z| < ∞ (in which case we say Z is integrable) weset EZ = EZ+ − EZ− where Z± = max (±Z, 0) . With these definition oneeventually shows via the dominated convergence theorem below; if f : R→R is a bounded continuous function, then

E [f (Z)] = lim∆→0

∑n∈Z

f (n∆)P (n∆ < Z ≤ (n+ 1)∆) .

We summarize this informally2 by writing;

E [f (Z)] = “

∫Rf (z)P (z < Z ≤ z + dz) .”

5. The expectation has the following basic properties;

a) Expectations of indicator functions: E1A = P (A) for all eventsA ⊂ Ω.

b) Linearity: if X and Y are integrable random variables and c ∈ R, then

E [X + cY ] = EX + cEY.

c) Monotinicity: if X,Y : Ω → R are integrable with P (X ≤ Y ) = 1,then EX ≤ EY. In particular if X = Y almost surely (a.s.) (i.e.P (X = Y ) = 1), then EX = EY. [What happens on sets of probability0 are typically irrelevant.]

2 Think of z = n∆ and dz = ∆.

10 1 Probabilities, Expectations, and Distributions

d) Finite expectation =⇒ finite random variable. If Z : Ω → [0,∞]is a random variable such that EZ < ∞ then P (Z =∞) = 0, i.e.P (Z <∞) = 1.

e) MCT: the monotone convergence theorem holds; if 0 ≤ Zn ↑ Zthen

↑ limn→∞

E [Zn] = E [Z] (with ∞ allowed as a possible value).

Example 1: If An∞n=1 is a sequence of events such that An ↑ A (i.e.An ⊂ An+1 for all n and A = ∪∞n=1An), then

P (An) = E [1An ] ↑ E [1A] = P (A) as n→∞

Example 2: If Xn : Ω → [0,∞] for n ∈ N then

E∞∑n=1

Xn = E limN→∞

N∑n=1

Xn = limN→∞

EN∑n=1

Xn = limN→∞

N∑n=1

EXn =

∞∑n=1

EXn.

Example 3: Suppose S is a finite or countable set and X : Ω → S isa random function. Then for any f : S → [0,∞] ,

E [f (X)] =∑s∈S

f (s)P (X = s) .

Indeed, we have

f (X) =∑s∈S

f (s) 1X=s

and so by Example 2. above,

E [f (X)] =∑s∈S

E[f (s) 1X=s

]=∑s∈S

f (s)E[1X=s

]=∑s∈S

f (s)P (X = s) .

f) DCT: the dominated convergence theorem holds, if

E[supn|Zn|

]<∞ and lim

n→∞Zn = Z, then

E[

limn→∞

Zn

]= EZ = lim

n→∞EZn.

Example 1: If An∞n=1 is a sequence of events such that An ↓ A (i.e.An ⊃ An+1 for all n and A = ∩∞n=1An), then

P (An) = E [1An ] ↓ E [1A] = P (A) as n→∞.

The dominating function is 1 here.Example 2: If Xn∞n=1 is a sequence of real valued random variablessuch that

E∞∑n=1

|Xn| =∞∑n=1

E |Xn| <∞,

then; 1) Z :=∑∞n=1 |Xn| < ∞ a.s. and hence

∑∞n=1Xn =

limN→∞∑Nn=1Xn exist a.s., 2)

∣∣∣∑Nn=1Xn

∣∣∣ ≤ Z and EZ < ∞, and

so 3) by DCT,

E∞∑n=1

Xn = E limN→∞

N∑n=1

Xn = limN→∞

EN∑n=1

Xn = limN→∞

N∑n=1

EXn =

∞∑n=1

EXn.

g) Fatou’s Lemma: Fatou’s lemma holds; if 0 ≤ Zn ≤ ∞, then

E[lim infn→∞

Zn

]≤ lim inf

n→∞E [Zn] .

This may be proved as an application of MCT.

6. Discrete distributions. If S is a discrete set, i.e. finite or countable andX : Ω → S we let

ρX (s) := P (X = s) .

Notice that if f : S → R is a function, then f (X) =∑s∈S f (s) 1X=s and

therefore,

Ef (X) =∑s∈S

f (s)E1X=s =∑s∈S

f (s)P (X = s) =∑s∈S

f (s) ρX (s) .

More generally if Xi : Ω → Si for 1 ≤ i ≤ n we let

ρX1,...,Xn (s) := P (X1 = s1, . . . , Xn = sn)

for all s = (s1, . . . , sn) ∈ S1 × · · · × Sn and

Ef (X1, . . . , Xn) =∑

s=(s1,...,sn)

f (s) ρX1,...,Xn (s) .

7. Continuous density functions. If S is R or Rn, we say X : Ω → S is a“continuous random variable,” if there exists a probability densityfunction, ρX : S → [0,∞) such that for all bounded (or positive) functions,f : S → R, we have

E [f (X)] =

∫S

f (x) ρX (x) dx.


1.2 Some Discrete Distributions 11

8. Given random variables X and Y we let;

a) µX := EX be the mean of X.

b) Var (X) := E[(X − µX)

2]

= EX2 − µ2X be the variance of X.

c) σX = σ (X) :=√

Var (X) be the standard deviation of X.d) Cov (X,Y ) := E [(X − µX) (Y − µY )] = E [XY ]−µXµY be the covari-

ance of X and Y.e) Corr (X,Y ) := Cov (X,Y ) / (σXσY ) be the correlation of X and Y.

9. Tonelli’s theorem; if f : Rk × Rl → R+, then∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) (with ∞ being allowed).

10. Fubini’s theorem; if f : Rk × Rl → R is a function such that∫Rkdx

∫Rldy |f (x, y)| =

∫Rldy

∫Rkdx |f (x, y)| <∞,

then ∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) .

1.2 Some Discrete Distributions

Definition 1.1 (Generating Function). Suppose that N : Ω → N0 is aninteger valued random variable on a probability space, (Ω,B,P) . The generatingfunction associated to N is defined by

GN (z) := E[zN]

=

∞∑n=0

P (N = n) zn for |z| ≤ 1. (1.2)

By standard power series considerations, it follows that P (N = n) =1n!G

(n)N (0) so that GN can be used to completely recover the distribution of

N.

Proposition 1.2 (Generating Functions). The generating function satis-fies,

G(k)N (z) = E

[N (N − 1) . . . (N − k + 1) zN−k

]for |z| < 1

andG(k) (1) = lim

z↑1G(k) (z) = E [N (N − 1) . . . (N − k + 1)] ,

where it is possible that one and hence both sides of this equation are infinite.In particular, G′ (1) := limz↑1G

′ (z) = EN and if EN2 <∞,

Var (N) = G′′ (1) +G′ (1)− [G′ (1)]2. (1.3)

Proof. By standard power series considerations, for |z| < 1,

G(k)N (z) =

∞∑n=0

P (N = n) · n (n− 1) . . . (n− k + 1) zn−k

= E[N (N − 1) . . . (N − k + 1) zN−k

]. (1.4)

Since, for z ∈ (0, 1) ,

0 ≤ N (N − 1) . . . (N − k + 1) zN−k ↑ N (N − 1) . . . (N − k + 1) as z ↑ 1,

we may apply the MCT to pass to the limit as z ↑ 1 in Eq. (1.4) to find,

G(k) (1) = limz↑1

G(k) (z) = E [N (N − 1) . . . (N − k + 1)] .

Exercise 1.1 (Some Discrete Distributions). Let p ∈ (0, 1] and λ > 0. Inthe four parts below, the distribution of N will be described. You should workout the generating function, GN (z) , in each case and use it to verify the givenformulas for EN and Var (N) .

1. Bernoulli(p) : P (N = 1) = p and P (N = 0) = 1−p. You should find EN = pand Var (N) = p− p2.

2. Bin (n, p) : P (N = k) =(nk

)pk (1− p)n−k for k = 0, 1, . . . , n. (P (N = k)

is the probability of k successes in a sequence of n independent yes/noexperiments with probability of success being p.) You should find EN = npand Var (N) = n

(p− p2

).

3. Geometric(p) : P (N = k) = p (1− p)k−1for k ∈ N. (P (N = k) is the prob-

ability that the kth – trial is the first time of success out a sequence ofindependent trials with probability of success being p.) You should findEN = 1/p and Var (N) = 1−p

p2 .

4. Poisson(λ) : P (N = k) = λk

k! e−λ for all k ∈ N0. You should find EN = λ =

Var (N) .

Exercise 1.2. Let Sn,pd= Bin (n, p) , k ∈ N, pn = λn/n where λn → λ > 0 as

n→∞. Show that

limn→∞

P (Sn,pn = k) =λk

k!e−λ = P (Poi (λ) = k) .

Thus we see that for p = O (1/n) and k not too large relative to n that for largen,

P (Bin (n, p) = k) ∼= P (Poi (pn) = k) =(pn)

k

k!e−pn.

(We will come back to the Poisson distribution and the related Poisson processlater on.)


2

Independence

Definition 2.1. We say that an event, A, is independent of an event, B, iff1

P (A ∩B) = P (A)P (B) .

We further say a collection of events Ajj∈J are independent iff

P (∩j∈J0Aj) =∏j∈J0

P (Aj)

for any finite subset, J0, of J.

Lemma 2.2. If Ajj∈J is an independent collection of events then so isAj , A

cj

j∈J .

Proof. First consider the case of two independent events, A and B. Byassumption, P (A ∩B) = P (A)P (B) . Since A is the disjoint union of A ∩ Band A ∩Bc, the additivity of P implies,

P (A) = P (A ∩B) + P (A ∩Bc) = P (A)P (B) + P (A ∩Bc) .

Solving this identity for P (A ∩Bc) gives,

P (A ∩Bc) = P (A) [1− P (B)] = P (A)P (Bc) .

Thus if A,B are independent then so is A,Bc . Similarly we may showAc, B are independent and then that Ac, Bc are independent. That isP(Aε ∩Bδ

)= P (Aε)P

(Bδ)

where ε, δ is either “nothing” or “c.”The general case now easily follows similarly. Indeed, if A1, . . . , An ⊂

Ajj∈J we must show that

P (Aε11 ∩ · · · ∩Aεnn ) = P (Aε11 ) . . .P (Aεnn )

where εj = c or εj = “ ”. But this follows from above. For example,A1 ∩ · · · ∩An−1, An are independent implies that A1 ∩ · · · ∩An−1, A

cn are

independent and hence

1 Shortly we will consider conditional probabilities, P (·|B) . With this notation, Ais independent of B iff P (A|B) = P (A) , i.e. given the information gained by Boccurring does not affect the likelihood that A occurred.

P (A1 ∩ · · · ∩An−1 ∩Acn) = P (A1 ∩ · · · ∩An−1)P (Acn)

= P (A1) . . .P (An−1)P (Acn) .

Thus we have shown it is permissible to add Acj to the list for any j ∈ J.

Lemma 2.3. If An∞n=1 is a sequence of independent events, then

P (∩∞n=1An) =

∞∏n=1

P (An) := limN→∞

N∏n=1

P (An) .

Proof. Since ∩Nn=1An ↓ ∩∞n=1An, it follows that

P (∩∞n=1An) = limN→∞

P(∩Nn=1An

)= limN→∞

N∏n=1

P (An) ,

where we have used the independence assumption for the last equality.The convergence assertion used above follows from DCT Indeed, 1∩Nn=1An

↓1∩∞n=1An

and all functions are dominated by 1 and therefore,

P (∩∞n=1An) = E[1∩∞n=1An

]= limN→∞

E[1∩Nn=1An

]= limN→∞

P(∩Nn=1An

).

2.1 Borel Cantelli Lemmas

Definition 2.4 (An i.o.). Suppose that An∞n=1 is a sequence of events. Let

An i.o. :=

∞∑n=1

1An =∞

denote the event where infinitely many of the events, An, occur. The abbrevia-tion, “i.o.” stands for infinitely often.

For example if Xn is H or T depending on whether a heads or tails is flippedat the nth step, then Xn = H i.o. is the event where an infinite number ofheads was flipped.

14 2 Independence

Lemma 2.5 (The First Borell – Cantelli Lemma). If An is a sequenceof events such that

∑∞n=0 P (An) <∞, then

P (An i.o.) = 0.

Proof. Since

∞ >

∞∑n=0

P (An) =

∞∑n=0

E1An = E

[ ∞∑n=0

1An

]it follows that

∑∞n=0 1An <∞ almost surely (a.s.), i.e. with probability 1 only

finitely many of the An can occur.Under the additional assumption of independence we have the following

strong converse of the first Borel-Cantelli Lemma.

Lemma 2.6 (Second Borel-Cantelli Lemma). If An∞n=1 are independentevents, then

∞∑n=1

P (An) =∞ =⇒ P (An i.o.) = 1. (2.1)

Proof. We are going to show P (An i.o.c) = 0. Since,

An i.o.c =

∞∑n=1

1An =∞

c=

∞∑n=1

1An <∞

,

we see that ω ∈ An i.o.c iff there exists n ∈ N such that ω /∈ Am for allm ≥ n. Thus we have shown, if ω ∈ An i.o.c then ω ∈ Bn := ∩m≥nAcm forsome n and therefore,

An i.o.c = ∪∞n=1Bn.

As Bn ↑ An i.o.c we have

P (An i.o.c) = limn→∞

P (Bn) .

But making use of the independence (see Lemmas 2.2 and 2.3) and the estimate,1− x ≤ e−x, see Figure 2.1 below, we find

P (Bn) = P (∩m≥nAcm) =∏m≥n

P (Acm) =∏m≥n

[1− P (Am)]

≤∏m≥n

e−P(Am) = exp

−∑m≥n

P (Am)

= e−∞ = 0.

Combining the two Borel Cantelli Lemmas gives the following Zero-OneLaw.

Fig. 2.1. Comparing e−x and 1− x.

Corollary 2.7 (Borel’s Zero-One law). If An∞n=1 are independent events,then

P (An i.o.) =

0 if

∑∞n=1 P (An) <∞

1 if∑∞n=1 P (An) =∞ .

Example 2.8. If Xn∞n=1 denotes the outcomes of the toss of a coin such thatP (Xn = H) = p > 0, then P (Xn = H i.o.) = 1.

Example 2.9. If a monkey types on a keyboard with each stroke being indepen-dent and identically distributed with each key being hit with positive prob-ability. Then eventually the monkey will type the text of the bible if shelives long enough. Indeed, let S be the set of possible key strokes and let(s1, . . . , sN ) be the strokes necessary to type the bible. Further let Xn∞n=1

be the strokes that the monkey types at time n. Then group the monkey’sstrokes as Yk :=

(XkN+1, . . . , X(k+1)N

). We then have

P (Yk = (s1, . . . , sN )) =

N∏j=1

P (Xj = sj) =: p > 0.

Therefore,∞∑k=1

P (Yk = (s1, . . . , sN )) =∞

and so by the second Borel-Cantelli lemma,

P (Yk = (s1, . . . , sN ) i.o. k) = 1.


2.2 Independent Random Variables 15

2.2 Independent Random Variables

Definition 2.10. We say a collection of discrete random variables, Xjj∈J ,are independent if

P (Xj1 = x1, . . . , Xjn = xn) = P (Xj1 = x1) · · ·P (Xjn = xn) (2.2)

for all possible choices of j1, . . . , jn ⊂ J and all possible values xk of Xjk .

Proposition 2.11. A sequence of discrete random variables, Xjj∈J , is in-dependent iff

E [f1 (Xj1) . . . fn (Xjn)] = E [f1 (Xj1)] . . .E [fn (Xjn)] (2.3)

for all choices of j1, . . . , jn ⊂ J and all choice of bounded (or non-negative)functions, f1, . . . , fn. Here n is arbitrary.

Proof. ( =⇒ ) If Xjj∈J , are independent then

E [f (Xj1 , . . . , Xjn)] =∑

x1,...,xn

f (x1, . . . , xn)P (Xj1 = x1, . . . , Xjn = xn)

=∑

x1,...,xn

f (x1, . . . , xn)P (Xj1 = x1) · · ·P (Xjn = xn) .

Therefore,

E [f1 (Xj1) . . . fn (Xjn)] =∑

x1,...,xn

f1 (x1) . . . fn (xn)P (Xj1 = x1) · · ·P (Xjn = xn)

=

(∑x1

f1 (x1)P (Xj1 = x1)

)· · ·

(∑xn

f (xn)P (Xjn = xn)

)= E [f1 (Xj1)] . . .E [fn (Xjn)] .

(⇐=) Now suppose that Eq. (2.3) holds. If fj := δxj for all j, then

E [f1 (Xj1) . . . fn (Xjn)] = E [δx1(Xj1) . . . δxn (Xjn)] = P (Xj1 = x1, . . . , Xjn = xn)

whileE [fk (Xjk)] = E [δxk (Xjk)] = P (Xjk = xk) .

Therefore it follows from Eq. (2.3) that Eq. (2.2) holds, i.e. Xjj∈J is anindependent collection of random variables.

Using this as motivation we make the following definition.

Definition 2.12. A collection of arbitrary random variables, Xjj∈J , are in-dependent iff

E [f1 (Xj1) . . . fn (Xjn)] = E [f1 (Xj1)] . . .E [fn (Xjn)]

for all choices of j1, . . . , jn ⊂ J and all choice of bounded (or non-negative)functions, f1, . . . , fn.

Fact 2.13 To check independence of a collection of real valued random vari-ables, Xjj∈J , it suffices to show

P (Xj1 ≤ t1, . . . , Xjn ≤ tn) = P (Xj1 ≤ t1) . . .P (Xjn ≤ tn)

for all possible choices of j1, . . . , jn ⊂ J and all possible tk ∈ R. Moreover,one can replace ≤ by < or reverse these inequalities in the the above expression.

Theorem 2.14 (Groupings of independent RVs). If Xjj∈J , are inde-pendent random variables and J0, J1 are finite disjoint subsets in J, then

E[f0

(Xjj∈J0

)· f1

(Xjj∈J1

)]= E

[f0

(Xjj∈J0

)]· E[f1

(Xjj∈J1

)].

(2.4)This holds more generally for any Jknk=0 ⊂ J with Jk∩Jl = ∅ and # (Jk) <∞.

In words; disjoint groupings of independent random variables are still in-dependent random vectors.

Proof. Discrete case example. Suppose X1, . . . , X5 are independentdiscrete random variables. Then

P (X1 = s1, X2 = s2, X3 = s3, X4 = s4, X5 = s5)

= P (X1 = s1)P (X2 = s2)P (X3 = s3)P (X4 = s4)P (X5 = s5)

= P (X1 = s1, X2 = s2)P (X3 = s3, X4 = s4, X5 = s5)

=: ρ1,2 (s1, s2) · ρ3,4,5 (s3, s4, s5)

and therefore,

E [f (X1, X2) g (X3, X4, X5)]

=∑

s=(s1,...,s5)

f (s1, s2) g (s3, s4, s5)P (X1 = s1, . . . , X5 = s5)

=∑

s=(s1,...,s5)

f (s1, s2) g (s3, s4, s5) · ρ1,2 (s1, s2) · ρ3,4,5 (s3, s4, s5)

=∑

s=(s1,s2)

f (s1, s2) ρ1,2 (s1, s2) ·∑

s=(s3,s4,s5)

g (s3, s4, s5) ρ3,4,5 (s3, s4, s5)

= E [f (X1, X2)] · E [g (X3, X4, X5)] .


16 2 Independence

General Case. Equation (2.4) is easy to verify when f0 and f1 are them-selves product functions. The general result is then deduced from this obser-vation along with measure theoretic arguments which go under the name ofDynkin’s multiplicative systems theorem.

Proposition 2.15 (Disintegration I.). Suppose that X is an Rk – valuedrandom variable, Y is an Rl – valued random variable independent of X, andf : Rk × Rl → R+ then (assuming X and Y have continuous distributionsρX (x) and ρY (y) respectively),

E [f (X,Y )] =

∫Rk

E [f (x, Y )] ρX (x) dx and

E [f (X,Y )] =

∫Rl

E [f (X, y)] ρY (y) dy.

Proof. It is a fact that independence implies that the joint probabilitydistribution, ρ(X,Y ) (x, y), for (X,Y ) must be given by

ρ(X,Y ) (x, y) = ρX (x) ρY (y) .

Therefore,

E [f (X,Y )] =

∫Rk×Rl

f (x, y) ρX (x) ρY (y) dxdy

=

∫Rk

[∫Rldyf (x, y) ρY (y)

]ρX (x) dx

=

∫Rk

E [f (x, Y )] ρX (x) dx.

One of the key theorems involving independent random variables is thestrong law of large numbers. The other is the central limit theorem.

Theorem 2.16 (Kolmogorov’s Strong Law of Large Numbers). Supposethat Xn∞n=1 are i.i.d. random variables and let Sn := X1 + · · · + Xn. Thenthere exists µ ∈ R such that 1

nSn → µ a.s. iff Xn is integrable and in whichcase EXn = µ.

Remark 2.17. If E |X1| =∞ but EX−1 <∞, then 1nSn →∞ a.s. To prove this,

for M > 0 let

XMn := min (Xn,M) =

Xn if Xn ≤MM if Xn ≥M

and SMn :=∑ni=1X

Mi . It follows from Theorem 2.16 that 1

nSMn → µM := EXM

1

a.s.. Since Sn ≥ SMn , we may conclude that

lim infn→∞

Snn≥ lim inf

n→∞

1

nSMn = µM a.s.

Since µM → ∞ as M → ∞, it follows that lim infn→∞Snn = ∞ a.s. and hence

that limn→∞Snn =∞ a.s.

Here is a crude special case of Theorem 2.16 which however does come witha rate estimate. We will do considerably better later in Corollary 13.59.

Proposition 2.18. Let k ∈ N with k ≥ 2 and Xn∞n=1 be i.i.d. random vari-ables with EXn = 0 and EX2k

n <∞. Then for every p > 12 + 1

2k ,

limn→∞

Snnp

= 0 a.s.

In other words for any ε > 0 small we have

Snn

=Snnp

1

n1−p = O

(1

n1−p

)= O

(1

n12 (1− 1

k−ε)

).

Proof. We start with the identity,

E[

1

nSn

]2k

=1

n2k

n∑j1,...j2k=1

E [Xj1 . . . Xj2k ] .

Using E [Xj1 . . . Xj2k ] = 0 if there is any one index, jl, distinct from the others,we conclude that the above sum can contain at most Ckn

k non-zero terms forsome Ck < ∞ and all of these terms are bounded by a constant C dependingon EX2k

n . For example if k = 2 we have E [Xj1Xj2Xj3Xj4 ] = 0 unless j1 = j2 =j3 = j4 (of which there are n such terms) or j1 = j2 and j3 = j4 (or similarwith permuted indices) of which there are 3n2 – terms.

From the previous observations it follows that

E[

1

nSn

]2k

≤ Cnk

n2k= C

1

nk.

Therefore if 0 < α < 1, then

E

( ∞∑n=1

[nα

1

nSn

]2k)

=

∞∑n=1

E[nα

1

nSn

]2k

≤∞∑n=1

C1

nknα2k =

∞∑n=1

C1

nk(1−2α)<∞

provided k (1− 2α) > 1, i.e. 1− 2α > 1k , i.e. α < 1

2

(1− 1

k

). For such an α we

have


∞∑n=1

[nα

1

nSn

]2k

<∞ a.s. =⇒ limn→∞

1

n1−αSn = 0 a.s..

Tracing through the inequalities shows p := 1− α > 1− 12

(1− 1

k

)= 1

2 + 12k is

the required restriction on p.Often times for practical importance, the following weak law of large num-

bers is in fact more useful. For the proof we will need the following simple butvery useful inequality.

Lemma 2.19 (Chebyshev’s Inequality). If X is a random variable, δ > 0,and p > 0, then

P (|X| ≥ δ) = E[1|X|≥δ

]≤ E

[|X|p

δp1|X|≥δ

]≤ δ−pE |X|p . (2.5)

Proof. Taking expectations of the following pointwise inequalities,

1|X|≥δ ≤|X|p

δp1|X|≥δ ≤ δ−p |X|

p,

immediately gives Eq. (2.5).

Theorem 2.20. Let Xn∞n=1 be uncorrelated random square integrable randomvariables, then

P

(∣∣∣∣∣ 1nn∑

m=1

(Xm − EXm)

∣∣∣∣∣ ≥ δ)≤ 1

δ2n2

n∑m=1

Var (Xm) .

If we further assume that EXm = µ and Var (Xm) = σ2 are independent of m,then

P

(∣∣∣∣∣ 1nn∑

m=1

Xm − µ

∣∣∣∣∣ ≥ δ)≤ σ2

δ2

1

n.

Proof. By Chebyshev’s inequality and the assumption that Cov (Xm, Xk) =δmk Var (Xm) , we find

P

(∣∣∣∣∣ 1nn∑

m=1

(Xm − EXm)

∣∣∣∣∣ ≥ δ)

≤ 1

δ2E

∣∣∣∣∣ 1nn∑

m=1

(Xm − EXm)

∣∣∣∣∣2

=1

δ2n2E

n∑m.k=1

[(Xm − EXm) (Xk − EXk)]

=1

δ2n2

n∑m.k=1

Cov (Xm, Xk) =1

δ2n2

n∑m=1

Var (Xm) .

3

Conditional Expectation

3.1 σ – algebras (partial information)

Definition 3.1 (σ - algebra of X). Let Ω – be a sample space, W be a set,and X : Ω → W be a function. The X – measurable events in Ω, σ (X) ,are those events of the form,

σ (X) :=A := X ∈ B = X−1 (B) : B ⊂W

.

Put another way, A ∈ σ (X) iff A = X ∈ B for some B ⊂W.

Aa = X = a

Ab = X = b

Ac = X = c

1

8

7

2

5

6

3

4

9

Ω

a b c

W

X

Fig. 3.1. In this figure Ω = 1, . . . , 9 , Aa = 1, 8 , Ab = 2, 5, 7 , Ac = 3, 4, 6, 9 ,and X : Ω → S = a, b, c is the function that satisfies, X = x on Ax for x ∈ S. Withthis notation we find A ∈ σ (X) iff A = ∅ or A is a union of elements from the list ofevents, Aa, Ab, Ac .

Lemma 3.2. An event A ⊂ Ω is in σ (X) iff there exists a function, f : W →R, such that f (X) = 1A. [The function f is unique iff X (Ω) = W.]

Proof. If A ∈ σ (X) , then A = X−1 (B) for some B ⊂W and in this case,

1A = 1X−1(B) = 1B (X) .

Conversely if f (X) = 1A, then let B := w ∈W : f (w) = 1 . We then have,

ω ∈ X−1 (B) ⇐⇒ X (ω) ∈ B ⇐⇒ 1 = f (X (ω)) = 1A (ω) ⇐⇒ ω ∈ A

which show A = X−1 (B) ∈ σ (X) .

Remark 3.3. Notice that if Ai ∈ σ (X) for i ∈ N then Ac1, ∪∞i=1Ai, ∩∞i=1Ai, andA2 \A1 = A2 ∩Ac1 are in σ (X) . In other words, σ (X) is stable under all of theusual set operations. Also notice that ∅, Ω ∈ σ (X) .

The reader should interpret σ (X) as the events in Ω which can be distin-guished (measured) by knowing X. Now let S be another set and Y : Ω → Sbe another function.

Definition 3.4 (X – measurable). We say Y is X – measurable iff σ (Y ) ⊂σ (X) , i.e. the events that can be measure by Y can also be measured by X.

For our purposes you can forget the previous definition and use the criteriafor X – measurability given in the following proposition.

Proposition 3.5 (X – measurability). Let Y : Ω → S and X : Ω → Wbe functions. Then Y is X – measurable iff there exists f : W → S such thatY = f (X) , i.e. Y (ω) is completely determined by knowing X (ω) .

Proof. If Y = f (X) for some function, f : W → S, then for A ⊂ S we have

Y ∈ A = f (X) ∈ A =X ∈ f−1 (A)

∈ σ (X)

from which it follows that σ (Y ) ⊂ σ (X) . You are asked to prove the conversedirection in Exercise 3.1.

Exercise 3.1 (Optional). Let W0 := X (Ω) ⊂ W. Finish the proof of Propo-sition 3.5 using the following outline;

1. Use the fact that σ (Y ) ⊂ σ (X) to show for each s ∈ S there exists Bs ⊂W0 ⊂W such that Y = s = X ∈ Bs .

2. Show Bs ∩Bs′ = ∅ for all s, s′ ∈ S with s 6= s′.3. Show X (Ω) = W0 := ∪s∈SBs.

Now fix a point s∗ ∈ S and then define, f : W → S by setting f (w) = s∗when w ∈W \W0 and f (w) = s when w ∈ Bs ⊂W0.

4. Verify that Y = f (X) .

20 3 Conditional Expectation

3.2 Theory of Conditional Expectation

Let us now fix a probability, P, on Ω for the rest of this subsection.

Notation 3.6 (Conditional Expectation 1) Given Y ∈ L1 (P) and A ⊂ Ωlet

E [Y : A] := E [1AY ]

and

E [Y |A] =

E [Y : A] /P (A) if P (A) > 0

0 if P (A) = 0. (3.1)

(In point of fact, when P (A) = 0 we could set E [Y |A] to be any real number.We choose 0 for definiteness and so that Y → E [Y |A] is always linear.)

Example 3.7 (Conditioning for the uniform distribution). Suppose that Ω is afinite set and P is the uniform distribution on P so that P (ω) = 1

#(Ω) for all

ω ∈W. Then for non-empty any subset A ⊂ Ω and Y : Ω → R we have E [Y |A]is the expectation of Y restricted to A under the uniform distribution on A.Indeed,

E [Y |A] =1

P (A)E [Y : A] =

1

P (A)

∑ω∈A

Y (ω)P (ω)

=1

# (A) /# (Ω)

∑ω∈A

Y (ω)1

# (Ω)=

1

# (A)

∑ω∈A

Y (ω) .

Theorem 3.8 (Conditional Expectation via Best Approximations).Suppose X : Ω → W as above and Y ∈ L2 (P) , i.e. Y : Ω → C such that

E |Y |2 < ∞. Then there exists an “essentially unique” function h : W → Csuch that h (X) ∈ L2 (P) which satisfies,1

E[|Y − h (X)|2

]≤ E |Y − f (X)|2 (3.2)

for all function f : W → C such that f (X) ∈ L2 (P) . The function h mayalternatively be determined by requiring,

E [[Y − h (X)] f (X)] = 0 ∀ f : W → C 3 E |f (X)|2 <∞. (3.3)

Proof. The existence of such an h satisfying Eq. (3.2) is a consequence ofthe orthogonal projection theorem in Hilbert spaces. We will simply take thisresult for granted. However, let us show the conditions in Eq. (3.2) and Eq.(3.3) are equivalent.

1 Note that h depends on X, Y, and P. However, Theorem 3.16 below asserts that hcan be determined uniquely by only knowing the “joint distribution” of (X,Y ) .

Eq. (3.2) =⇒ Eq. (3.3): If Eq. (3.2) holds then for any f : W → R suchthat f (X) ∈ L2 (P) and t ∈ R we have,

ϕ (t) := E[|Y − [h (X) + tg (X)]|2

]= E |Y − h (X)|2 + 2tE ([Y − h (X)] f (X)) + t2E |f (X)|2

has a minimum at t = 0. So by the first derivative test it follows that

0 = ϕ (0) = 2E ([Y − h (X)] f (X))

which shows Eq. (3.3) holds.Eq. (3.3) =⇒ Eq. (3.2): Assuming Eq. (3.3), it follows that

E |Y − f (X)|2 = E |Y − h (X) + (h− f) (X)|2

= E |Y − h (X)|2 + 2E ([Y − h (X)] (h− f) (X)) + E |(h− f) (X)|2

= E |Y − h (X)|2 + E |(h− f) (X)|2 ≥ E |Y − h (X)|2 .

It should be noted that the function, h, in Theorem 3.8 depends on both Xand Y and therefore E [Y |X] = h (X) also depends on both X and Y. Howeveras explained in Theorem 3.16 below, the function h is only depends on (X,Y )though their “joint distribution.”

Definition 3.9 (Conditional Expectation I). We refer to the functionh (X) in Theorem 3.8 as the conditional expectation of Y given X (or σ (X))and denote the result by E [Y |σ (X)] or by E [Y |X] .

Proposition 3.10 (Discrete formula). Suppose that X : Ω → W has finite

or countable range, i.e. X (Ω) = xiNi=1 where N ∈ N∪∞ . In this case,

E [Y |X] = h (X) where h (x) = EP [Y |X = x] , (3.4)

where EP [Y |X = x] is as in Notation 3.6.

Proof. If P (X = x) = 0 then P (ω) = 0 for all ω ∈ X = x and Eq. (3.5)holds no matter the value of h (x) . For definiteness, we choose h (x) = 0 in thiscase.

We are looking to find h : X (Ω) ⊂W → R satisfying,

0 = E [[Y − h (X)] f (X)] =∑

x∈X(Ω)

E [[Y − h (X)] f (X) : X = x]

=∑

x∈X(Ω)

E [[Y − h (x)] f (x) : X = x]

=∑

x∈X(Ω)

(E [Y : X = x]− h (x)P (X = x)) f (x)


3.2 Theory of Conditional Expectation 21

for all f on X (Ω) ⊂W. This implies that we require,

0 = E [Y : X = x]− h (x)P (X = x) for all x ∈ X (Ω) . (3.5)

In other word,

h (x) = EP [Y · 1X=x] /P (X = x) = EP [Y |X = x] .

If P (X = x) = 0 Eq. (3.5) holds no matter the value of h (x) . For definiteness,we choose h (x) = 0 in this case.

Remark 3.11. If we have already worked out the probability measure,P (·|X = x) , then h (x) in Eq. (3.4) may be computed using

h (x) = EP [Y |X = x] = EP(·|X=x)Y.

The point is that for any random variable, Z : Ω → R we always have

EP [Z|X = x] = EP(·|X=x)Z. (3.6)

For example if Z = 1B , then

EP [1B |X = x] =1

P (X = x)EP [1B · 1X=x]

=1

P (X = x)P (B ∩ X = x)

= P (B|X = x) = EP(·|X=x)1B . (3.7)

For general Z let Zε be the approximation in Eq. (1.1). From Eq. (3.7) andlinearity if follows that

EP [Zε|X = x] = EP(·|X=x)Zε

and then letting ε ↓ 0 proves Eq. (3.6).

Let us pause for a moment to record a few basic general properties of con-ditional expectations.

Proposition 3.12 (Contraction Property). For all Y ∈ L2 (P) , we haveE |E [Y |X]| ≤ E |Y | . Moreover if Y ≥ 0 then E [Y |X] ≥ 0 (a.s.).

Proof. Let E [Y |X] = h (X) (with h : S → R) and then define

f (x) =

1 if h (x) ≥ 0−1 if h (x) < 0

.

Since h (x) f (x) = |h (x)| , it follows from Eq. (3.3) that

E [|h (X)|] = E [Y f (X)] = |E [Y f (X)]| ≤ E [|Y f (X)|] = E |Y | .

For the second assertion take f (x) = 1h(x)<0 in Eq. (3.3) in order to learn

E[h (X) 1h(X)<0

]= E

[Y 1h(X)<0

]≥ 0.

As h (X) 1h(X)<0 ≤ 0 we may conclude that h (X) 1h(X)<0 = 0 a.s.Because of this proposition we may extend the notion of conditional expec-

tation to Y ∈ L1 (P) as stated in the following theorem which we do not botherto prove here.

Theorem 3.13. Given X : Ω → W and Y ∈ L1 (P) , there exists an “essen-tially unique” function h : W → R such that Eq. (3.3) holds for all boundedfunctions, f : W → R. (As above we write E [Y |X] for h (X) .) Moreover thecontraction property, E |E [Y |X]| ≤ E |Y | , still holds.

Definition 3.14 (Conditional Expectation). If Y ∈ L1 (P) , we letE [Y |X] = h (X) be the a.s. unique random variable which is σ (X) measurableand satisfies,

E (Y f (X)) = E (h (X) f (X)) = E (E [Y |X] f (X))

for all bounded f.

Theorem 3.15 (Basic properties). Let Y, Y1, and Y2 be integrable randomvariables and X : Ω →W be given. Then:

1. E(Y1 + Y2|X) = E(Y1|X) + E(Y2|X).2. E(aY |X) = aE(Y |X) for all constants a.3. E(g(X)Y |X) = g(X)E(Y |X) for all bounded functions g.4. E(E(Y |X)) = EY. (Law of total expectation.)5. If Y and X are independent then E(Y |X) = EY.6. If Y is σ (X) measurable, then E [Y |X] = Y.

Proof. 1. Let hi (X) = E [Yi|X] , then for all bounded f,

E [Y1f (X)] = E [h1 (X) f (X)] and

E [Y2f (X)] = E [h2 (X) f (X)]

and therefore adding these two equations together implies

E [(Y1 + Y2) f (X)] = E [(h1 (X) + h2 (X)) f (X)]

= E [(h1 + h2) (X) f (X)]

E [Y2f (X)] = E [h2 (X) f (X)]



for all bounded f . Therefore we may conclude that

E(Y1 + Y2|X) = (h1 + h2) (X) = h1 (X) + h2 (X) = E(Y1|X) + E(Y2|X).

2. The proof is similar to 1 but easier and so is omitted.3. Let h (X) = E [Y |X] , then E [Y f (X)] = E [h (X) f (X)] for all bounded

functions f. Replacing f by g · f implies

E [Y g (X) f (X)] = E [h (X) g (X) f (X)] = E [(h · g) (X) f (X)]

for all bounded functions f. Therefore we may conclude that

E [Y g (X) |X] = (h · g) (X) = h (X) g (X) = g (X)E(Y |X).

4. Take f ≡ 1 in Eq. (3.3).5. If X and Y are independent and µ := E [Y ] , then

E [Y f (X)] = E [Y ]E [f (X)] = µE [f (X)] = E [µf (X)]

from which it follows that E [Y |X] = µ as desired.6. By assumption Y = f (X) for some function f and hence the best ap-

proximation to Y by a function of X is clearly Y = f (X) itself. More formally,

E [Y g (X)] = E [f (X) g (X)] for all bounded g

and hence E [Y |X] = f (X) = Y.

Exercise 3.2. Suppose that X and Y are two integrable random variables suchthat

E [X|Y ] = 18− 3

5Y and E [Y |X] = 10− 1

3X.

Find EX and EY.

The next theorem says that conditional expectations essentially only de-pends on the distribution of (X,Y ) and nothing else.

Theorem 3.16 (Dependence only on distributions). Suppose that (X,Y )

and(X, Y

)are random vectors such that (X,Y )

d=(X, Y

), i.e. E [G (X,Y )] =

E[G(X, Y

)]for all bounded (or non-negative) functions G. If h (X) =

E [u (X,Y ) |X] , then E[u(X, Y

)|X]

= h(X).

Proof. By assumption we know that

E [u (X,Y ) f (X)] = E [h (X) f (X)] for all bounded f.

Since (X,Y )d=(X, Y

), this is equivalent to

E[u(X, Y

)f(X)]

= E[h(X)f(X)]

for all bounded f

which is equivalent to E[u(X, Y

)|X]

= h(X).

Exercise 3.3. Let Xi∞i=1 be i.i.d. random variables with E |Xi| <∞ for all iand let Sm := X1 + · · ·+Xm for m = 1, 2, . . . . Show

E [Sm|Sn] =m

nSn for all m ≤ n.

Hint: observe by symmetry2 that there is a function h : R→ R such that

E (Xi|Sn) = h (Sn) independent of i.

Remark 3.17. If m > n, in Exercise 3.3, then Sm = Sn+Xn+1 + · · ·+Xm. SinceXi is independent of Sn for i > n (see Theorem 2.14) it follows from item 5.of Theorem 3.15 that E [Xi|Sn] = µ := EXi. This result along with the basicproperties of conditional expectation shows,

E (Sm|Sn) = E (Sn +Xn+1 + · · ·+Xm|Sn)

= E (Sn|Sn) + E (Xn+1|Sn) + · · ·+ E (Xm|Sn)

= Sn + (m− n)µ if m ≥ n.

So in general,

E [Sm|Sn] =

mn Sn if m ≤ n

Sn + (m− n)µ if m ≥ n.

item 5. of Theorem 3.15 is a rather special case the following useful tool forcomputing conditional expectations.

Proposition 3.18 (Disintegration II.). Suppose that X : Ω → W and Y :Ω →W ′ are independent random functions and U : W×W ′ → C is a functionssuch that E |U (X,Y )| <∞. Under these assumptions,

E [U (X,Y ) |X] = h (X)

whereh (x) := E [U (x, Y )] for all x ∈W.

2 Apply Theorem 3.16 using (X 1, Sn)d= (Xi, Sn) for 1 ≤ i ≤ n.


3.3 Conditional Expectation for Continuous Random Variables 23

Proof. The theorem is true in general but requires measure theory in orderto prove it in full generality. Here I will give the proof in the case that X (Ω) isa countable set, see Proposition 3.23 for a proof for certain continuous randomfunctions.)

Let

h (x) := E [U (X,Y ) |X = x] = E [U (x, Y ) |X = x] = E [U (x, Y )]

where in the last equality we have used the independence of X from Y whichby definition means,

E [f (X) g (Y )] = E [f (X)] · E [g (Y )]

for all bounded functions, f : W → C and g : W → C. The result now followsfrom Proposition 3.10

Theorem 3.19 (Tower Property). Let Y be an integrable random variableand Xi : Ω →Wi be given functions for i = 1, 2. Then

E [E [Y | (X1, X2)] |X1] = E [Y |X1] (3.8)

andE [E [Y |X1] | (X1, X2)] = E [Y |X1] . (3.9)

Proof. If g : W1 → R is a bounded function then g (X1) is both σ (X1) andσ (X1, X2) measurable and therefore,

E [E [Y | (X1, X2)] · g (X1)] = E [Y · g (X1)] = E [E [Y |X1] · g (X1)] .

As this equation holds for all g we conclude that Eq. (3.8) holds.

E [Y |X1] = E [E [Y | (X1, X2)] |X1] .

For Eq. (3.9) we simply use item 6. of Theorem 3.15 and the fact that E [Y |X1]is σ (X1) ⊂ σ (X1, X2) – measurable to conclude,

E [E [Y |X1] | (X1, X2)] = E [Y |X1] .

3.3 Conditional Expectation for Continuous RandomVariables

(We will cover this section later in the course as needed.)

Suppose that Y and X are continuous random variables which have a jointdensity, ρ(Y,X) (y, x) . Then by definition of ρ(Y,X), we have, for all bounded ornon-negative, U, that

E [U (Y,X)] =

∫ ∫U (y, x) ρ(Y,X) (y, x) dydx. (3.10)

The marginal density associated to X is then given by

ρX (x) :=

∫ρ(Y,X) (y, x) dy (3.11)

and recall that the conditional density ρ(Y |X) (y, x) is defined by

ρ(Y |X) (y, x) =

ρ(Y,X)(y,x)

ρX(x) if ρX (x) > 0

0 if ρX (x) = 0. (3.12)

Observe that if ρ(Y,X) (y, x) is continuous, then

ρ(Y,X) (y, x) = ρ(Y |X) (y, x) ρX (x) for all (x, y) . (3.13)

Indeed, if ρX (x) = 0, then

0 = ρX (x) =

∫ρ(Y,X) (y, x) dy

from which it follows that ρ(Y,X) (y, x) = 0 for all y. If ρ(Y,X) is not continuous,Eq. (3.13) still holds for “a.e.” (x, y) which is good enough.

Lemma 3.20. In the notation above,

ρ (x, y) = ρ(Y |X) (y, x) ρX (x) for a.e. (x, y) . (3.14)

Proof. By definition Eq. (3.14) holds when ρX (x) > 0 and ρ (x, y) ≥ρ(Y |X) (y, x) ρX (x) for all (x, y) . Moreover,∫ ∫

ρ(Y |X) (y, x) ρX (x) dxdy =

∫ ∫ρ(Y |X) (y, x) ρX (x) 1ρX(x)>0dxdy

=

∫ ∫ρ (x, y) 1ρX(x)>0dxdy

=

∫ρX (x) 1ρX(x)>0dx =

∫ρX (x) dx

= 1 =

∫ ∫ρ (x, y) dxdy,

or equivalently, ∫ ∫ [ρ (x, y)− ρ(Y |X) (y, x) ρX (x)

]dxdy = 0

which implies the result.



Theorem 3.21. Keeping the notation above, for all or all bounded or non-negative, U, we have E [U (Y,X) |X] = h (X) where

h (x) =

∫U (y, x) ρ(Y |X) (y, x) dy (3.15)

=

∫U(y,x)ρ(Y,X)(y,x)dy∫

ρ(Y,X)(y,x)dyif

∫ρ(Y,X) (y, x) dy > 0

0 otherwise. (3.16)

In the future we will usually denote h (x) informally by E [U (Y, x) |X = x],3 sothat

E [U (Y, x) |X = x] :=

∫U (y, x) ρ(Y |X) (y, x) dy. (3.17)

Proof. We are looking for h : S → R such that

E [U (Y,X) f (X)] = E [h (X) f (X)] for all bounded f.

Using Lemma 3.20, we find

E [U (Y,X) f (X)] =

∫ ∫U (y, x) f (x) ρ(Y,X) (y, x) dydx

=

∫ ∫U (y, x) f (x) ρ(Y |X) (y, x) ρX (x) dydx

=

∫ [∫U (y, x) ρ(Y |X) (y, x) dy

]f (x) ρX (x) dx

=

∫h (x) f (x) ρX (x) dx

= E [h (X) f (X)]

where h is given as in Eq. (3.15).

Example 3.22 (Durrett 8.15, p. 145). Suppose that X and Y have joint densityρ (x, y) = 8xy · 10<y<x<1. We wish to compute E [u (X,Y ) |Y ] . To this end wecompute

ρY (y) =

∫R

8xy · 10<y<x<1dx = 8y

∫ x=1

x=y

x · dx = 8y · x2

2|1y = 4y ·

(1− y2

).

Therefore,

3 Warning: this is not consistent with Eq. (3.1) as P (X = x) = 0 for continuousdistributions.

ρX|Y (x, y) =ρ (x, y)

ρY (y)=

8xy · 10<y<x<1

4y · (1− y2)=

2x · 10<y<x<1

(1− y2)

and so

E [u (X,Y ) |Y = y] =

∫R

2x · 10<y<x<1

(1− y2)u (x, y) dx = 2

10<y<1

1− y2

∫ 1

y

u (x, y) xdx

and so

E [u (X,Y ) |Y ] = 21

1− Y 2

∫ 1

Y

u (x, Y )xdx.

is the best approximation to u (X,Y ) be a function of Y alone.

Proposition 3.23. Suppose that X,Y are independent random functions, then

E [U (Y,X) |X] = h (X)

whereh (x) := E [U (Y, x)] .

Proof. I will prove this in the continuous distribution case and leave thediscrete case to the reader. (The theorem is true in general but requires measuretheory in order to prove it in full generality.) The independence assumption isequivalent to ρ(Y,X) (y, x) = ρY (y) ρX (x) . Therefore,

ρ(Y |X) (y, x) =

ρY (y) if ρX (x) > 0

0 if ρX (x) = 0

and therefore E [U (Y,X) |X] = h0 (X) where

h0 (x) =

∫U (y, x) ρ(Y |X) (y, x) dy

= 1ρX(x)>0

∫U (y, x) ρY (y) dy = 1ρX(x)>0E [U (Y, x)]

= 1ρX(x)>0h (x) .

If f is a bounded function of x, then

E [h0 (X) f (X)] =

∫h0 (x) f (x) ρX (x) dx =

∫x:ρX(x)>0

h0 (x) f (x) ρX (x) dx

=

∫x:ρX(x)>0

h (x) f (x) ρX (x) dx =

∫h (x) f (x) ρX (x) dx

= E [h (X) f (X)] .

So for all practical purposes, h (X) = h0 (X) , i.e. h (X) = h0 (X) – a.s. (In-deed, take f (x) = sgn(h (x) − h0 (x)) in the above equation to learn thatE |h (X)− h0 (X)| = 0.


3.3 Conditional Expectation for Continuous Random Variables 25

Theorem 3.24 (Iterated conditioning 1). Let X, Y, and Z be random vec-tors and suppose that (X,Y ) is distributed according to ρ(X,Y ) (x, y) dxdy. Then

E [Z|Y = y] =

∫E [Z|Y = y,X = x] ρY |X (y, x) dx ρY (y) dy – a.s.

Proof. Let h (x, y) := E [Z|Y = y,X = x] so that

E [Zv (X,Y )] = E [h (X,Y ) v (X,Y )] for all v.

Taking v (x, y) = g (y) to be a function of Y alone shows,

E [Zg (Y )] = E [h (X,Y ) g (Y )] for all g.

Thus it follows that

E [Z|Y = y] = E [h (X,Y ) |Y = y]

=

∫h (x, y) ρX|Y (x, y) dx

=

∫E [Z|Y = y,X = x] ρX|Y (x, y) dx.

Remark 3.25. Often, E [Z|Y = y0] may be computed as;

limε↓0

E [Z| |Y − y0| < ε] .

To understand this formula, suppose that h (y) := E [Z|Y = y] and ρY (y) arecontinuous near y0 and ρY (y0) > 0. Then

E [Z| |Y − y0| < ε] =E [Z : |Y − y0| < ε]

P (|Y − y0| < ε)

=E[h (Y ) 1|Y−y0|<ε

]P (|Y − y0| < ε)

=

∫h (y) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy→ h (y0) as ε ↓ 0,

wherein we have used, h (y) ∼= h (y0) for y near y0 and therefore,∫h (y) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy∼=∫h (y0) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy= h (y0) .

Here is a consequence of this result.

Theorem 3.26 (Iterated conditioning 2). Suppose that Ω is partitionedinto disjoint sets Aini=1 and y0 is given such that the following limits exists;

P (Ai|Y = y0) := limε↓0

P (Ai| |Y − y0| < ε) and

E [Z|Y = y0, Ai] := limε↓0

E [Z| |Y − y0| < ε,Ai] .

(In particular we are assuming that and P (|Y − y0| < ε,Ai) > 0 for all ε > 0.)Then

E [Z|Y = y0] =

n∑i=1

E [Z|Y = y0, Ai]P (Ai|Y = y0) .

Proof. Since,

E [Z| |Y − y0| < ε,Ai] =E[Z1Ai1|Y−y0|<ε

]P (Ai ∩ |Y − y0| < ε)

=E [Z1Ai | |Y − y0| < ε]P (|Y − y0| < ε)

P (Ai ∩ |Y − y0| < ε)

=E [Z1Ai | |Y − y0| < ε]

P (Ai| |Y − y0| < ε),

it follows that

limε↓0

E [Z1Ai | |Y − y0| < ε] = limε↓0

(E [Z| |Y − y0| < ε,Ai] · P (Ai| |Y − y0| < ε))

= E [Z|Y = y0, Ai]P (Ai|Y = y0) .

Moreover,

n∑i=1

limε↓0

E [Z1Ai | |Y − y0| < ε] = limε↓0

n∑i=1

E [Z1Ai | |Y − y0| < ε]

= limε↓0

E

[Z

n∑i=1

1Ai | |Y − y0| < ε

]= lim

ε↓0E [Z| |Y − y0| < ε] = E [Z|Y = y0] ,

and therefore

E [Z|Y = y0] =

n∑i=1

limε↓0

E [Z1Ai | |Y − y0| < ε]

=

n∑i=1

E [Z|Y = y0, Ai]P (Ai|Y = y0)

as claimed.



Remark 3.27 (An advanced remark). The last result is in fact a special case ofTheorem 3.24 wherein we take X =

∑i i1Ai . In this case, any function u (X,Y )

may be written as,

u (X,Y ) =∑i

1X=iu (i, Y ) .

Thus if we let h (x, y) := E [Z| (X = x, Y = y)] , i.e. h (X,Y ) = E [Z| (X,Y )]a.s. then on one hand;

E [h (X,Y )u (X,Y )] =∑i

E [h (X,Y ) 1X=iu (i, Y )] =∑i

E [h (i, Y )u (i, Y ) 1X=i]

while on the other,

E [h (X,Y )u (X,Y )] = E [Zu (X,Y )] =∑i

E [Z1X=iu (i, Y )] .

Taking u (i, Y ) = δi,jv (Y ) and comparing the resulting expressions shows,

E [h (j, Y ) v (Y ) 1X=j ] = E [Z1X=jv (Y )] for all v

and therefore that

E [Z1X=j |Y ] = E [h (j, Y ) 1X=j |Y ] = h (j, Y ) · E [1X=j |Y ] .

Summing this equation on j then shows,

E [Z|Y ] =∑j

E [Z1X=j |Y ] =∑j

h (j, Y ) · E [1X=j |Y ] ,

which reads,

E [Z|Y = y] =∑j

E [Z|X = j, Y = y] · E [1X=j |Y = y]

=∑j

E [Z|X = j, Y = y] · P [X = j|Y = y]

=∑j

E [Z|Aj , Y = y] · P [Aj |Y = y] (µY – a.s.) .

where µY is the law of Y, i.e. µY (A) := P (Y ∈ A) .

Example 3.28. Suppose that Tknk=1 are independent random times such that

P (Tk > t) = e−λkt for all t ≥ 0 for some λk > 0. LetTk

nk=1

be the order

statistics of the sequence, i.e.Tk

nk=1

is the sequence Tknk=1 in increasing

order, i.e. T1 < T2 < · · · < Tn. Further let K = i on T1 = Ti. Then

E[f(T2 − T1

)|T1 = t

]= E

[f(T2 − T1

)|T1 = t,K = i

]P(K = i|T1 = t

)where

T1 = t,K = i

= t = Ti < Tj for j 6= i and therefore,

E[f(T2 − T1

)|T1 = t,K = i

]= E

[f(T2 − t

)|T1 = t,K = i

]= E

[f

(minj 6=i

Tj − t)|t = Ti < min

j 6=iTj

]= E

[f

(minj 6=i

Tj − t)|t < min

j 6=iTj

]= E

[f

(minj 6=i

Tj

)],

wherein we have used Ti is independent of S := minj 6=i Tjd= E (λ− λi) ,

where λ = λ1 + · · · + λn. Since S is an exponential random variable,P (S > t+ s|S > t) = P (S > s) , i.e. S − t under P (·|S > t) is the same indistribution as S under P. Thus we have shown,

E[f(T2 − T1

)|T1 = t

]=∑i

E[f

(minj 6=i

Tj

)]P(K = i|T1 = t

).

We now compute informally,

P(K = i|T = t

)=

P (t = Ti < Tj for j 6= i)

P(t = T1

) =e−(λ−λi)t · P (Ti = t)

P(t = T1

)=e−(λ−λi)t · λie−λitdt

λe−λtdt=λiλ.

Here is the above computation done more rigorously;

P(K = i|t < T ≤ t+ ε

)=

P (Ti < Tj for j 6= i, t < Ti ≤ t+ ε)

P(t < T ≤ t+ ε

)=

∫ t+εt

P (τ < Tj for j 6= i)λie−λiτdτ∫ t+εt

λe−λτdτ

ε↓0→ P (t < Tj for j 6= i)λie−λit

λe−λt

=e−(λ−λi)tλie

−λit

λe−λt=λiλ.

In summary we have shown;

E[f(T2 − T1

)|T1 = t

]=∑i

E[f

(minj 6=i

Tj

)]λiλ.


3.4 Summary on Conditional Expectation Properties 27

3.4 Summary on Conditional Expectation Properties

Let Y and X be random variables such that EY 2 <∞ and h be function fromthe range of X to R. Then the following are equivalent:

1. h(X) = E(Y |X), i.e. h(X) is the conditional expectation of Y given X.2. E(Y − h(X))2 ≤ E(Y − g(X))2 for all functions g, i.e. h(X) is the best

approximation to Y among functions of X.3. E(Y ·g(X)) = E(h(X) ·g(X)) for all functions g, i.e. Y −h(X) is orthogonal

to all functions of X. Moreover, this condition uniquely determines h(X).

The methods for computing E(Y |X) are given in the next two propositions.

Proposition 3.29 (Discrete Case). Suppose that Y and X are discrete ran-dom variables and p(y, x) := P(Y = y,X = x). Then E(Y |X) = h(X), where

h (x) = E(Y |X = x) =E(Y : X = x)

P(X = x)=

1

pX (x)

∑y

yp(y, x) (3.18)

and pX (x) = P(X = x) is the marginal distribution of X which may be com-puted as pX (x) =

∑y p(y, x).

Proposition 3.30 (Continuous Case). Suppose that Y and X are randomvariables which have a joint probability density ρ(y, x) (i.e. P(Y ∈ dy,X ∈dx) = ρ(y, x)dydx). Then E(Y |X) = h(X), where

h (x) = E(Y |X = x) :=1

ρX (x)

∫ ∞−∞

yρ(y, x)dy (3.19)

and ρX (x) is the marginal density of X which may be computed as

ρX (x) =

∫ ∞−∞

ρ(y, x)dy.

Intuitively, in all cases, E(Y |X) on the set X = x is E(Y |X = x). Thisintuitions should help motivate some of the basic properties of E(Y |X) sum-marized in the next theorem.

Theorem 3.31. Let Y, Y1, Y2 and X be random variables. Then:

1. E(Y1 + Y2|X) = E(Y1|X) + E(Y2|X).2. E(aY |X) = aE(Y |X) for all constants a.3. E(f(X)Y |X) = f(X)E(Y |X) for all functions f.4. E(E(Y |X)) = EY.5. If Y and X are independent then E(Y |X) = EY.6. If Y ≥ 0 then E(Y |X) ≥ 0.

Remark 3.32. Property 4 in Theorem 3.31 turns out to be a very powerfulmethod for computing expectations. I will finish this summary by writing outProperty 4 in the discrete and continuous cases:

EY =∑x

E(Y |X = x)pX (x) (Discrete Case)

where

E(Y |X = x) =

E(Y 1X=x)P(X=x) if P (X = x) > 0

0 otherwise

E [U (Y,X)] =

∫E(U (Y,X) |X = x)ρX (x) dx, (Continuous Case)

where

E [U (Y, x) |X = x] :=

∫U (y, x) ρ(Y |X) (y, x) dy

and

ρ(Y |X) (y, x) =

ρ(Y,X)(y,x)

ρX(x) if ρX (x) > 0

0 if ρX (x) = 0.


4

Filtrations and stopping times

Notation 4.1 Let N := 1, 2, 3, . . . , N0 := N∪0 , and N := N0 ∪ ∞ =N∪0,∞ .

In this chapter, let Ω – be a sample space and S be a set called state space.Further let X := Xn∞n=0 be a sequence of random variables taking values inS which we will refer to generically as a stochastic process.

4.1 Filtrations

Definition 4.2 (Filtration). The filtration associated to X isFXn := σ (X0, . . . , Xn)

∞n=0

. We further let FX∞ = σ (X0, . . . , Xn, . . . ) .

Notice that g : Ω → S is FXn – measurable iff g = G (X0, . . . , Xn) some

G : Sn+1 → S. Also notice that FXm ⊂ FX

n for all 0 ≤ m ≤ n ≤ ∞.

Remark 4.3 (Independence). If X := Xn∞n=0 are independent S valued randomvariables and n ∈ N, then by Theorem 2.14 Xn+1 is independent of (X0, . . . , Xn)and therefore by item 5. of Theorem 3.15,

E [f (Xn+1) | (X0, . . . , Xn)] = Ef (Xn+1) ∀ n ∈ N0 and f : S → R. (4.1)

It turns out that Eq. (4.1) is equivalent to the assertion that X := Xn∞n=0 areindependent.

For example if Eq. (4.1) holds, then

E [f1 (X1) |X0] = m1 := Ef1 (X1) a.s.

which implies

E [f1 (X1) f0 (X0)] = E [E [f1 (X1) |X0] · f0 (X0)] = E [m1f0 (X0)] = m1E [f0 (X0)]

= Ef1 (X1) · E [f0 (X0)] .

Similarly, another application of Eq. (4.1) gives,

E [f2 (X2) | (X0, X1)] = m2 := E [f2 (X2)]

and therefore,

E [f2 (X2) f1 (X1) f0 (X0)] = E [E [f2 (X2) | (X0, X1)] · f1 (X1) f0 (X0)]

= m2E [f1 (X1) f0 (X0)]

= E [f2 (X2)] · Ef1 (X1) · E [f0 (X0)] .

The general result follows by induction.

Besides independence the two types of dependence structures we will studymost in this course are described in the following two definitions.

Definition 4.4 (Markov Property). Let (Ω,P) be a probability space andX := Xn : Ω → S∞n=0 be a stochastic process. We say that (Ω,X,P) has theMarkov property if

E [f (Xn+1) | (X0, . . . , Xn)] = E [f (Xn+1) |Xn]

for all n ≥ 0 and f : S → R bounded or non-negative functions.

Definition 4.5 (Martingales). Let Mn∞n=0 be a sequence of complex orreal valued integrable random variables on a probability space (Ω,P) . We sayMn∞n=0 is a martingale if

E [Mn+1| (M0, . . . ,Mn)] = Mn for n = 0, 1, 2, . . . .

In other words,E [Mn+1 −Mn| (M0, . . . ,Mn)] = 0

which states the expected future increment given the past is zero.

Example 4.6. If Xn∞n=0 are independent random variables with EXn = 0,then Mn = X0 + · · ·+Xn is a martingale. Indeed,

E [Mn+1 −Mn| (M0, . . . ,Mn)] = E [Mn+1 −Mn| (X0, . . . , Xn)]

= E [Xn+1| (X0, . . . , Xn)] = EXn+1 = 0,

wherein we have used item 5. of Theorem 3.15 for the second to last equality.

30 4 Filtrations and stopping times

4.2 Stopping Times

Definition 4.7 (Stopping time). A function τ : Ω → N := N0 ∪ ∞ is aFn – stopping time iff

τ = n ∈ Fn for all n ∈ N0.

[In words, we should be able to determine if we are going to stop at time nfrom observing the information about Xk∞k=0 that we have up to time n. SeeRemark 4.12 below for a more precise version of the last statement.]

Remark 4.8. According to Lemma 3.2, τ : Ω → N is a stopping time iff for eachn ∈ N, there exists a function fn so that

1τ=n = fn (X0, . . . , Xn) .

More precisely there must sets An ⊂ Sn+1 for n ∈ N0 so that

τ = n = (X0, . . . , Xn) ∈ An ∀ n ∈ N0.

Example 4.9. If τ : Ω → N is constant, τ (ω) = k for some k ∈ N0, then τ is astopping time.

Example 4.10 (First Hitting times). Let A ⊂ S be a set and let

HA := inf n ≥ 0 : Xn ∈ A

where inf ∅ :=∞. We call HA the first hitting time of A. Since

HA = n = X0 /∈ A, . . . ,Xn−1 /∈ Ac, Xn ∈ A= X0 ∈ Ac, . . . , Xn−1 ∈ Ac, Xn ∈ A

= (X0, X1, . . . , Xn)−1

(Ac × · · · ×Ac ×A) ∈ FXn

we see that HA is a stopping time.

Example 4.11 (First Hitting time after 0). Let A ⊂ S be a set and let

TA := inf n > 0 : Xn ∈ A

where inf ∅ :=∞. Since TA = 0 = ∅ ∈ F0 and for n ≥ 1 we have

TA = n = X1 ∈ Ac, . . . , Xn−1 ∈ Ac, Xn ∈ A

= (X0, X1, . . . , Xn)−1

(S ×Ac × · · · ×Ac ×A) ∈ FXn ,

it follows that TA is a stopping time.

Remark 4.12 (Another stopping time test). Let τ : Ω → N0 be a function, thenτ is stopping time iff τ satisfies the following test.

Stopping time test. To each ω ∈ Ω, let n = τ (ω) ∈ N0. If n = ∞ thereis nothing to check. If n <∞, then τ passes the test if τ (ω′) = n for any otherω′ ∈ Ω such that Xk (ω) = Xk (ω′) for 0 ≤ k ≤ n.

Exercise 4.1. Let τ : Ω → N be a function. Verify the following are equivalent;

1. τ is a stopping time.2. τ ≤ n ∈ FX

n for all n ∈ N0.3. τ > n ∈ FX

n for all n ∈ N0.

Also show that if τ is a stopping time then τ =∞ ∈ FX∞.

Exercise 4.2. If τ and σ are two stopping times shows, σ ∧ τ = min σ, τ ,σ ∨ τ = max σ, τ , and σ + τ are all stopping times.

Exercise 4.3 (Hitting time after a stopping time). Let σ be any stoppingtime. Show

τ1 = inf n ≥ σ : Xn ∈ B and

τ2 = inf n > σ : Xn ∈ B .

are both stopping times.

Definition 4.13. Let τ be a stopping time. A function F : Ω → W is said tobe FX

τ – measurable if

F = “∑n∈N

1τ=nFn” i.e. F = Fn on τ = n ∀ n ∈ N,

where Fn : Ω → W is FXn – measurable for all n ∈ N. In more detail, we

are assuming for each n ∈ N there exists fn : Sn+1 → W such that Fn =fn (X0, . . . , Xn) and F = Fn on τ = n . We also say that A ⊂ Ω is in FX

τ iffA ∩ τ = n ∈ FX

n for all n ∈ N.

Remark 4.14. A set A ⊂ Ω is in FXτ iff 1A is FX

τ – measurable. Indeed, if A ∈ FXτ

then there exists fn : Sn+1 → 0, 1 such that 1A∩τ=n = fn (X0, . . . , Xn) .

Conversely, if 1A is FXτ – measurable, then there exists fn : Sn+1 → R such

that1A =

∑n∈N

1τ=nfn (X0, . . . , Xn)

and so 1A∩τ=n = fn (X0, . . . , Xn) . Thus it follows that

A ∩ τ = n = fn (X0, . . . , Xn) = 1 = (X0, . . . , Xn)−1 (

f−1n (1)

)∈ FX

n

which implies A ∩ τ = n ∈ FXn for all n ∈ N0 and hence A ∈ FX

τ .


Example 4.15. If τ is a stopping time and f : S → R is a function, then F :=1τ<∞f (Xτ ) is Fτ – measurable. Indeed, for n ∈ N0 we have

F1τ=n = f (Xn) 1τ=n.

Example 4.16. If f ≥ 0, then

F :=∑k≤τ

f (Xk)

is Fτ – measurable. Indeed, for n ∈ N we have

F1τ=n = 1τ=n∑k≤n

f (Xk)

which is of the desired form.

Theorem 4.17. Suppose now that P is a probability on Ω. If Z ∈ L1 (P) and τis a stopping time, then

E [Z|Fτ ] =∑n∈N

1τ=nE [Z|Fn] . (4.2)

Proof. Let F be a bounded Fτ – measurable function, i.e. F =∑n∈N 1τ=nFn where Fn = fn (X0, . . . , Xn) are Fn – measurable functions

as in Definition 4.13. Since F1τ=n = Fn1τ=n is Fn – measurable for all n,we find

E [ZF ] =∑n∈N

E[ZFn1τ=n

]=∑n∈N

E[E [Z|Fn]Fn1τ=n

]

=∑n∈N

E[E [Z|Fn]F1τ=n

]= E

∑n∈N

E [Z|Fn] 1τ=n

F .

As this is true for all bounded Fτ – measurable functions F we conclude thatEq. (4.2) holds.

Part II

Discrete Time & Space Markov Processes

5

Markov Chain Basics

In deterministic modeling one often has a dynamical system on a state spaceS. The dynamical system often takes on one of the two forms;

1. there exists f : S → S and a state xn then evolves according to the rulexn+1 = f (xn). [More generally one might allow xn+1 = fn (x0, . . . , xn)where fn : Sn+1 → S is a given function for each n.

2. There exists a vector field f on S (where now S = Rd or a manifold)such that x (t) = f (x (t)) . [More generally, we might allow for x (t) =f(t;x|[0,t]

), a functional differential equation.]

Much of our time in this course will be to explore the above two situationswhere some extra randomness is added at each state of the game. Namely;

1. We may now have that Xn+1 ∈ S is random and evolves according to

Xn+1 = f (Xn, ξn)

where ξn∞n=0 is a sequence of i.i.d. random variables. Alternatively put,we might simply let fn := f (·, ξn) so that fn : S → S is a sequence of i.i.d.random functions from S to S.1 Then Xn∞n=0 is defined recursively by

Xn+1 = fn (Xn) for n = 0, 1, 2, . . . . (i.e. Xn+1 (ω) = fn (ω,Xn (ω))) (5.1)

This is the typical example of a time-homogeneous Markov chain. We as-sume that X0 ∈ S is given with an initial condition which is either deter-ministic or is independent of the fn∞n=0 .

2. Later in the course we will study the continuous time analogue,

“Xt = ft (Xt) ”

where ftt≥0 are again i.i.d. random vector-fields. The continuous timecase will require substantially more technical care.

1 To be more explicit, we actually have fn : Ω×S → S. When we write fn (y) belowthis is short hand for the random function,

Ω 3 ω → fn (ω, y) .

As usual in probability we do our best to suppress the ω from the notation.

5.1 Markov Chain Descriptions

Notation 5.1 Given a random function, f : S → S we may describe its “statis-tics” or distribution in two ways. The first is to assign a matrix to f while thesecond is to assign a weighted graph to f.

1. (Matrix assignment.) The first way is to let

p (x, y) := P (f (x) = y) = P (ω ∈ Ω : fω (x) = y) ∀ x, y ∈ S.

Here we must have that p (x, y) ∈ [0, 1] and∑y∈S p (x, y) = 1 for all x ∈ S.

2. (Jump diagram.) Given p (x, y) as above, let 〈x, y〉 be an edge of a graphover S iff p (x, y) > 0 and weight this edge by p (x, y) .The function p : S × S → [0, 1] is called the one step transition proba-bility associate to the Markov chain Xn∞n=0 . Notice that;

a)∑y p (x, y) =

∑y∈S P (y = fn (x)) = 1, and

b) as we will see in Theorem 5.11 below, the law of Xn∞n=0 is completelydetermined by the one step transition probability p and the initial dis-tribution, π (x) := P (X0 = x).

Notation 5.2 If p (x, y) ∈ [0, 1] and∑y∈S p (x, y) = 1 for all x ∈ S, we let P

be the matrix indexed by S so that Pxy = p (x, y) for all x, y ∈ S. We refer tosuch a matrix, P, as a Markov matrix or a transition matrix.

Example 5.3. The transition matrix,

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is represented by the jump diagram in Figure 5.1.

Example 5.4. The jump diagram for

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is shown in Figure 5.2.

36 5 Markov Chain Basics

1 2

3

1/2

1/3

1/2

1/2

1/3

1/4

1/4

0

1/3

1 2

3

1/2

1/3

1/2

1/2

1/3

1/4

Fig. 5.1. A simple 3 state jump diagram. We typically abbreviate the jump diagramon the left by the one on the right. That is we infer by conservation of probabilitythere has to be probability 1/4 of staying at 1, 1/3 of staying at 3 and 0 probabilityof staying at 2.

114

12

##

212

12oo

3

13

YY

13

EE

Fig. 5.2. In the above diagram there are jumps from 1 to 1 with probability 1/4 andjumps from 3 to 3 with probability 1/3 which are not explicitly shown but must beinferred by conservation of probability.

Example 5.5. Suppose that S = 1, 2, 3 , then

P =

1 2 3 0 1 01/2 0 1/21 0 0

123

has the jump graph given by 5.3.

Example 5.6 (Realizing Random Functions I). Suppose we are given the Markovmatrix

P =

1/2 1/4 1/41/3 1/3 1/33/4 1/8 1/8

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 5.3. A simple 3 state jump diagram.

and we want to simulate the associated chain by producing random functionsfn∞n=0 such that P (fn (i) = j) = Pij for 1 ≤ i, j ≤ 3. One experimental wayto do this is to use three independent spinners as shown in Figure 5.4.

1

11 1

2

22

2

3

33

3

Fig. 5.4. In this figure we generate the random function, f : 1, 2, 3 → 1, 2, 3using the results of three independent spinners. For the sample point shown here wehave f (1) = 2, f (2) = 3, and f (3) = 1.

In the previous example we constructed our random function f so thatf (1) , f (2) , f (3) were independent. In fact at any given time n, we only useone of the values of fn (1) , fn (2) , fn (3) and so the independence and in factthe joint distribution of the three values, fn (1) , fn (2) , fn (3) , is unimpor-tant. The next example uses this fact to produce a more numerically tractableconstruction of the random functions fn .

Example 5.7 (Realizing Random Functions II). Again suppose that P is theMarkov matrix in Example 5.6. We now keep the same spinner faces as inFigure 5.4 but now use a “master spinner” to determine the placement of thespinner in each of the spin faces all at once, see Figure 5.5. In this case we stillhave P (f (i) = j) = Pij but is is now no longer true that f (1) , f (2) , f (3) areindependent. As mentioned above this is unimportant from the Markov chainspoint of vies.


5.1 Markov Chain Descriptions 37

1

11

1

2

22

2

3

33

3

MasterSpinner

Fig. 5.5. In this figure we generate the random function, f : 1, 2, 3 → 1, 2, 3using the result of a master spinner to determine the (same) placement of the needleon all three spin faces at once. For this sample point we have f (1) = f (2) = 3 andf (3) = 1.

Example 5.8 (Random Walks on Graphs). Let S be a set and G be a graph onS. We then take fn∞n=0 i.i.d. such that

P (fn (x) = y) =

0 if x, y /∈ G1

d(x) if x, y ∈ G

where d (x) := # y ∈ S : x, y ∈ G . We can give a similar definition fordirected graphs, namely

P (fn (x) = y) =

0 if 〈x→ y〉 /∈ G1

d(x) if 〈x→ y〉 ∈ G

where nowd (x) := # y ∈ S : 〈x→ y〉 ∈ G .

A directed graph on S is a subset G ⊂ S2 \∆ where ∆ := (s, s) : S ∈ S . Wesay that G is undirected if (s, t) ∈ G implies (t, s) ∈ G. As we have seen, everyMarkov chain is really determined by a weighted random walk on a graph.

Example 5.9. Suppose we flip a fair coin repeatedly and would like to find thefirst time the pattern HHT appears. To do this we will later examine theMarkov chain, Yn = (Xn, Xn+1, Xn+2) where Xn∞n=0 is the sequence of un-biased independent coin flips with values in H,T . The state space for Ynis

S =TTT THT TTH THH HHH HTT HTH HHT

.

The transition matrix for recording three flips in a row of a fair coin is

P =1

2

TTT THT TTH THH HHH HTT HTH HHTTTT 1 0 1 0 0 0 0 0THT 0 0 0 0 0 1 1 0TTH 0 1 0 1 0 0 0 0THH 0 0 0 0 1 0 0 1HHH 0 0 0 0 1 0 0 1HTT 1 0 1 0 0 0 0 0HTH 0 1 0 1 0 0 0 0HHT 0 0 0 0 0 1 1 0

.

Example 5.10 (Ehrenfest Urn Model). Let a beaker filled with a particle fluidmixture be divided into two parts A and B by a semipermeable membrane. LetXn = (# of particles in A) which we assume evolves by choosing a particle atrandom from A ∪ B and then replacing this particle in the opposite bin fromwhich it was found. Modeling Xn as a Markov process we find,

P(Xn+1 = j | Xn = i) =

0 if j /∈ i− 1, i+ 1iN if j = i− 1N−iN if j = i+ 1

=: q (i, j)

As these probabilities do not depend on n, Xn is a time homogeneous Markovchain.

Exercise 5.1. Consider a rat in a maze consisting of 7 rooms which is laid outas in the following figure. 1 2 3

4 5 67



In this figure rooms are connected by either vertical or horizontal adjacentpassages only, so that 1 is connected to 2 and 4 but not to 5 and 7 is onlyconnected to 4. At each time t ∈ N0 the rat moves from her current room toone of the adjacent rooms with equal probability (the rat always changes roomsat each time step). Find the one step 7 × 7 transition matrix, q, with entriesgiven by q (i, j) := P (Xn+1 = j|Xn = i) , where Xn denotes the room the rat isin at time n.

5.2 Joint Distributions of an MC

Theorem 5.11. Suppose that fn∞n=0 are i.i.d. random functions2 from S toS, X0 ∈ S is a random variable independent of the fn∞n=0 , and Xn∞n=1 aregenerated as in Eq. (5.1). Further let

p (x, y) := P (y = fn (x)) = P (y = f1 (x))

andπ (x) := P (X0 = x) .

Then

Pπ (X0 = x0, X1 = x1, . . . , Xn = xn) = π (x0) p (x0, x1) p (x1, x2) . . . p (xn−1, xn)(5.2)

for all x0, . . . , xn ∈ S. In particular if G : Sn+1 → R is a function, then

Eπ [G (X0, . . . , Xn)]

=∑

x0,...,xn∈SG (x0, . . . , xn)π (x0) p (x0, x1) p (x1, x2) . . . p (xn−1, xn) . (5.3)

Proof. This is straightforward to verify using

X0 = x0, X1 = x1, . . . , Xn = xn = X0 = x0 ∩ ∩n−1k=0 xk+1 = fk (xk)

and hence as (X0, f0 (x0) , . . . , fn−1 (xn−1)) is a list of independent S – valuedrandom functions it follows that

Pπ (X0 = x0, X1 = x1, . . . , Xn = xn)

= P (X0 = x0) ·n−1∏k=0

P (xk+1 = fk (xk))

= π (x0) ·n−1∏k=0

p (xk, xk+1) .

2 Actually, the functions fn do not need to be identically distributed, we reallyneed only require that P (y = fn (x)) is independent of n for all n. The correlationsbetween events like y = fn (x) and y′ = fn (x′) are irrelevant.

Notation 5.12 (Px and Pπ) We will either write Xπn∞n=0 for the Markov

chain constructed in Theorem 5.11 when π (y) := P (X0 = y) or we will indicatethe starting distribution π by writing Pπ rather than P in this case. If π (y) =δx (y) for some x ∈ S, then we write simply write Xx

n∞n=0 instead of

Xδxn

∞n=0

and Px instead of Pδx .We also let Eπ and Ex denote expectations relative to Pπand Px respectively. We may also write,

Pπ =∑x∈S

π (x)Px and Eπ =∑x∈S

π (x)Ex.

Proposition 5.13. Let us continue the notation above. The Markov chainXn∞n=0 has the Markov property. More precisely if g : S → R is a boundedfunction, then

E [g (Xn+1) | (X0, . . . , Xn)] = (Pg) (Xn) = E [g (Xn+1) |Xn] , (5.4)

where(Pg) (x) :=

∑y∈S

p (x, y) g (y) =∑y∈S

Pxyg (y) .

We also have

P (Xn+1 = xn+1|X0 = x0, . . . , Xn = xn)

= p (xn, xn+1) = P (Xn+1 = xn+1|Xn = xn) . (5.5)

Proof. We will give two proofs. The first proof makes use of the construc-tion of Xn∞n=0 in in Eq. (5.1). The second proof uses the description of thedistribution of Xn given in Theorem 5.11.

Proof 1. By the construction of Xn∞n=0 in in Eq. (5.1), the eventsX0 = x0, . . . , Xn = xn and fn (xn) are independent and therefore,

h (x0, . . . , xn) = E [g (Xn+1) |X0 = x0, . . . , Xn = xn]

= E [g (fn (xn)) |X0 = x0, . . . , Xn = xn]

= E [g (fn (xn))] =∑y∈S

g (y)P (fn (xn) = y)

=∑y∈S

p (Xn, y) g (y) .

It now follows from Proposition 3.10 that

E [g (Xn+1) | (X0, . . . , Xn)] = h (X0, . . . , Xn) =∑y∈S

p (Xn, y) g (y) .

Since the latter expression only depends on Xn, the previous equation alongwith the tower property (Theorem 3.19) of conditional expectations implies3

3 We could alternatively make use of the fact that Xn = xn is independent offn (xn) and then working as before directly show the second equality in Eq. (5.4).


5.2 Joint Distributions of an MC 39∑y∈S

p (Xn, y) g (y) = E [g (Xn+1) |Xn] .

At this point Eq. (5.4) is proved.A similar argument (or take g (y) = δxn+1,y in Eq. (5.4)) also shows

P (Xn+1 = xn+1|X0 = x0, . . . , Xn = xn) = p (xn, xn+1) .

Similarly since fn (xn) = xn+1 and Xn = xn are independent events wefind,

P (Xn+1 = xn+1|Xn = xn) = P (fn (xn) = xn+1|Xn = xn)

= P (fn (xn) = xn+1) = p (xn, xn+1) .

Proof 2.We have

E [g (Xn+1) : X0 = x0, . . . , Xn = xn]

= E [1x0 (X0) . . . 1xn (Xn) g (Xn+1)]

=∑

xn+1∈Sπ (x0) p (x0, x1) . . . p (xn−1, xn) p (xn, xn+1) g (xn+1)

= P (X0 = x0, . . . , Xn = xn)∑

xn+1∈Sp (xn, xn+1) g (xn+1)

from which it follows that

E [g (Xn+1) |X0 = x0, . . . , Xn = xn] =∑y∈S

p (xn, y) g (y) . (5.6)

It now follows from this identity and Proposition 3.10 that

E [g (Xn+1) | (X0, . . . , Xn)] =∑y∈S

p (Xn, y) g (y)

which proves the first equality in Eq. (5.4). This equality along with the towerproperty of conditional expectation then implies,

∑y∈S

p (Xn, y) g (y) = E

∑y∈S

p (Xn, y) g (y) |Xn

= E [E [g (Xn+1) | (X0, . . . , Xn)] |Xn] = E [g (Xn+1) |Xn] .

Taking g = 1xn+1 in Eq. (5.6) shows,

P (Xn+1 = xn+1|X0 = x0, . . . , Xn = xn) = p (xn, xn+1)

or equivalently that

P (Xn+1 = xn+1, X0 = x0, . . . , Xn = xn) = P (X0 = x0, . . . , Xn = xn) p (xn, xn+1) .

Summing this last equation on x0, . . . , xn−1 gives

P (Xn+1 = xn+1, Xn = xn) = P (Xn = xn) p (xn, xn+1)

which implies,

P (Xn+1 = xn+1, Xn = xn) = p (xn, xn+1) .

Both equalities in Eq. (5.5) have now been verified.

Corollary 5.14. For all n ∈ N, Pπ (Xn = y) =∑x∈S π (x) pn (x, y) where

p0 (x, y) = δx (y) , p1 = p, and pnn≥1 is defined inductively by

pn (x, y) :=∑z∈S

p (x, z) pn−1 (z, y) .

Proof. From Eq. (5.3) with G (x0, . . . , xn) = δy (xn) we learn that

Pπ (Xn = y) =∑

x0,...,xn∈Sδy (xn)π (x0) p (x0, x1) p (x1, x2) . . . p (xn−1, xn)

=∑x∈S

π (x) pn (x, y) .

Notation 5.15 To emphasize the matrix aspect of Markov-process, we will of-ten write P for the “matrix” with entries p (x, y)x,y∈S and correspondinglyoften write Pxy for p (x, y) and Pn

x,y for pn (x, y) .

Definition 5.16. We say π : S → [0, 1] is an invariant distribution for theMarkov chain determined by p : S × S → [0, 1] provided

∑y∈S π (y) = 1 and

π (y) =∑x∈S

π (x) p (x, y) for all y ∈ S.

Alternatively put, π, should be a distribution such that Pπ (Xn = x) = π (x) forall n ∈ N0 and x ∈ S. [The invariant distribution is found by solving the matrixequation, πP = π under the restrictions that π (x) ≥ 0 and

∑x∈S π (x) = 1.]



Example 5.17. If

P =

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

,then Ptr has eigenvectors,

− 23− 1

31

↔ 0,

1

561

↔ 1,

1−21

↔ − 5

12.

The invariant distribution is given by

π =1

1 + 1 + 56

[1 5

6 1]

=[

617

517

617

]=[

0.352 94 0.294 12 0.352 94].

Notice that

P100 =

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

100

=

0.352 94 0.294 12 0.352 940.352 94 0.294 12 0.352 940.352 94 0.294 12 0.352 94

∼=

πππ

.Exercise 5.2 (2 - step MC). Consider the following simple (i.e. no-brainer)two state “game” consisting of moving between two sites labeled 1 and 2. Ateach site you find a coin with sides labeled 1 and 2. The probability of flipping a2 at site 1 is a ∈ (0, 1) and a 1 at site 2 is b ∈ (0, 1). If you are at site i at time n,then you flip the coin at this site and move or stay at the current site as indicatedby coin toss. We summarize this scheme by the “jump diagram” of Figure 5.6.It is reasonable to suppose that your location, Xn, at time n is modeled by a

11−a22

a++

2b

kk 1−bll

Fig. 5.6. The generic jump diagram for a two state Markov chain.

Markov process with state space, S = 1, 2 . Explain (briefly) why this is atime homogeneous chain and find the one step transition probabilities,

p (i, j) = P (Xn+1 = j|Xn = i) for i, j ∈ S.

Use your result and basic linear (matrix) algebra to compute,limn→∞ P (Xn = 1) . Your answer should be independent of the possiblestarting distributions, ν = (ν1, ν2) for X0 where νi := P (X0 = i) .

Solution to Exercise 5.2. Writing P as a matrix with entry in the ith rowand jth column being p (i, j) , we have

P =

[1− a ab 1− b

].

If P (X0 = i) = νi for i = 1, 2 then

P (Xn = 1) =

2∑k=1

νkPnk,1 = [νPn]1

where we now write ν = (ν1, ν2) as a row vector. A simple computation showsthat

det(Ptr − λI

)= det (P− λI)

= λ2 + (a+ b− 2)λ+ (1− b− a)

= (λ− 1) (λ− (1− a− b)) .

Note that

P

[11

]=

[11

]since

∑j p (i, j) = 1 – this is a general fact. Thus we always know that λ1 = 1

is an eigenvalue of P. The second eigenvalue is λ2 = 1− a− b. We now find theeigenvectors of Ptr;

Nul(Ptr − λ1I

)= Nul

([−a ba −b

])= R ·

[ba

]while

Nul(Ptr − λI2

)= Nul

([b ba a

])= R ·

[1−1

].

In fact by a direct check we have,[b a]P =

[b a]

and[1 −1

]P = (1− b− a)

[1 −1

].

Thus we may writeν = α (b, a) + β (1,−1)

where


5.3 Hitting Times Estimates 41

1 = ν · (1, 1) = α (b, a) · (1, 1) = α (a+ b) .

Thus β = ν1 − b = − (ν2 − a) , we have

ν =1

a+ b(b, a) + β (1,−1) .

and therefore,

νPn = (b, a) Pn + β (1,−1) Pn =1

a+ b(b, a) + β (1,−1)λn2 .

By our assumptions on and a, b ∈ (0, 1) it follows that |λ2| < 1 and therefore

limn→∞

νPn =1

a+ b(b, a)

and we have shown

limn→∞

P (Xn = 1) =b

a+ band lim

n→∞P (Xn = 2) =

a

a+ b

independent of the starting distribution ν. Also observe that the convergence isexponentially fast. Notice that

π :=1

a+ b(b, a)

is the invariant distribution of this chain.

5.3 Hitting Times Estimates

Lemma 5.18. For any random time T : Ω → N∪0,∞ we have

P (T =∞) = limn→∞

P (T > n) and ET =

∞∑k=0

P (T > k) =

∞∑k=1

P (T ≥ k) .

Proof. The first equality is a consequence of the continuity of P (or useDCT) and the fact that

T > n ↓ T =∞ .The second equality is proved by the following simple computation;

∞∑k=0

P (T > k) =

∞∑k=0

E1T>k = E∞∑k=0

1T>k = ET.

Let us now assume that Xn∞n=0 is a Markov chain with values in S andtransition kernel P. (We often write p (x, y) for Pxy.) We are going to furtherassume that B ⊂ S is non-empty proper subset of S and A = S \B.

Definition 5.19 (Hitting times). Given a subset B ⊂ S we let HB be thefirst time Xn hits B, i.e.

HB = min n : Xn ∈ B

with the convention that HB = ∞ if n : Xn ∈ B = ∅. We call HB the firsthitting time of B by X = Xnn .

Observe that

HB = n = X0 /∈ B, . . . ,Xn−1 /∈ B,Xn ∈ B= X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ B

andHB > n = X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ A

so that HB = n and HB > n only depends on (X0, . . . , Xn) . Recall that arandom time, T : Ω → N ∪ 0,∞ , with either of these properties is called astopping time.

Notation 5.20 Let Q:= Px,yx,y∈A be the matrix P restricted to A.

Proposition 5.21. Using the notation above and letting 1 denote the constantfunction 1 on A then for all x ∈ A;

Px (HB > n) = [Qn1]x (5.7)

Px (HB =∞) = limn→∞

[Qn1]x and (5.8)

Ex (HB) =

∞∑n=0

[Qn1]x . (5.9)

Proof. If x ∈ A and n ∈ N, then

Px (HB > n) = Px (X1 ∈ A,X2 ∈ A, . . . ,Xn ∈ A)

=∑

x1,...,xn∈APx (X0 = x1, . . . , Xn−1 = xn−1, Xn = xn)

=∑

x1,...,xn∈APx,x1

Px1,x2. . .Pxn−1,xn =

∑y∈A

Qnx,y = [Qn1]x

which proves Eq. (5.7). An application of Lemma 5.18 then gives the remainingresults.

Lemma 5.22. If m,n ∈ N and x ∈ A, then

Px (HB > m+ n) =∑y∈A

Qmx,yPy (HB > n) (5.10)



andPx (HB =∞) =

∑y∈S

Qmx,yPy (HB =∞) . (5.11)

Proof. From Eq. (5.7),

Px (HB > m+ n) =[Qn+m1

]x

= [QmQn1]x

=∑y∈A

Qmx,y [Qn1]y =

∑y∈A

Qmx,yPy (HB > n)

which is Eq. (5.10). Letting n → ∞ (using DCT) in Eq. (5.10) then leads toEq. (5.11).

Corollary 5.23. If there exists α < 1 such that Px (HB =∞) ≤ α for allx ∈ A, then Px (HB =∞) = 0 for all x ∈ A. [In words; if there is a “uniform”chance that X hits B starting from any site, then X will surely hit B from anypoint in A.]

Proof. From Eqs. (5.7) and (5.11) and the assumption that Px (HB =∞) ≤α for all x ∈ A we find,

Px (HB =∞) ≤∑y∈S

Qmx,yα = α [Qm1]x = αPx (HB > m) .

Letting m ↑ ∞ then shows,

Px (HB =∞) ≤ αPx (HB =∞) =⇒ Px (HB =∞) = 0.

Corollary 5.24. If there exists α ∈ (0, 1) and n ∈ N such that

Px (HB > n) = [Qn1]x ≤ α ∀ x ∈ A, (5.12)

then;

1. I−Q is invertible with

(I−Q)−1x,y =

∞∑n=0

Qnx,y <∞ ∀ x, y ∈ A,

2.ExHB =

[(I−Q)

−11]x≤ n

1− α<∞ ∀ x ∈ A,

where [(I−Q)

−11]x

=∑y∈A

(I−Q)−1x,y =

∑y∈A

∞∑n=0

Qnx,y,

3. and for any 0 < β < n√α and x ∈ A,

Ex[βHB

]≤ β

(1− β n

√α)−1

<∞.

This last assertion implies supx∈A ExHNB < ∞ for all N ∈ N. [In words;

if there is a “uniform” chance that X hits B starting from any site withina fixed number of steps, then the expected hitting time of B is finite andbounded independent of the starting distribution.]

Proof. Proof 1. The assumption in Eq. (5.12) may be stated as sayingQn1 ≤ α1. Therefore if ` = kn+m ∈ N0 with k ∈ N0 (k ≤ `/n) and 0 ≤ m < n,then

Px (HB > `) = Qkn+m1 = [Qn]k

Qm1 ≤ [Qn]k

1 ≤ αk1. (5.13)

Summing this last equation on k ∈ N0 and 0 ≤ m < n (so that ` runs over N0)we find (with the aid of Eq. (5.9)) that

ExHB =

∞∑n=0

[Qn1]x ≤∞∑k=0

∑0≤m<n

αk =n

1− α<∞.

For the last assertion we see from Eq. (5.13) and observations that k ≤ `/n(noted above) we find

Px (HB > `) ≤ α`/n

and therefore,

ExβHB = Px (HB = 0) +

∞∑`=0

β`+1Px (HB = `+ 1)

≤ 1 +

∞∑`=0

β`+1Px (HB > `) ≤ 1 + β

∞∑`=0

β`α`/n

≤ 1 + β(1− β n

√α)−1

.

Proof 2. From Eq. (5.10) and the assumption in Eq. (5.12),

Px (HB > m+ n) =∑y∈A

Qmx,yPy (HB > n) ≤ α

∑y∈A

Qmx,y = αPx (HB > m) .

Therefore if ` = kn+m ∈ N0 with k ∈ N0 and 0 ≤ m < n, then by induction,

Px (HB > `) = Px (HB > kn+m) ≤ αPx (HB > (k − 1)n+m)

≤ α2Px (HB > (k − 2)n+m) · · · ≤ αkPx (HB > m) ≤ αk


5.3 Hitting Times Estimates 43

and hence

ExHB =

∞∑`=0

Px (HB > `) =

∞∑k=0

∑0≤m<n

Px (HB > kn+m)

≤∞∑k=0

∑0≤m<n

αk =n

1− α<∞.

Corollary 5.25. If A = S \ B is a finite set and Px (HB =∞) < 1 for allx ∈ A, then EπHB <∞.

Proof. Since

Px (HB > m) ↓ Px (HB =∞) < 1 for all x ∈ A

we can find Mx < ∞ such that Px (HB > Mx) < 1 for all x ∈ A. As A is afinite set

n := maxx∈A

Mx ∈ N and α := maxx∈A

Px (HB > n) < 1.

Corollary 5.24 now applies to complete the proof.

Remark 5.26 (First step analysis arguments). We can also computePx (HB > n) using the first step analysis. (Please note: this remarkmakes use of information developed in Chapter 6 below.) To do this letun (x) := Px (HB > n) for x ∈ A and observe that Px (HB > 0) = 0 for x ∈ Band u0 (x) = 1. The first step analysis then gives,

un+1 (x) = Px (HB > n+ 1) =∑y∈S

p (x, y)Px (HB > n+ 1|X1 = y)

=∑y∈A

p (x, y)Px (HB > n+ 1|X1 = y) =∑y∈A

p (x, y)Py (HB > n)

=∑y∈A

p (x, y)un (y) .

Thus we have shown un+1 = Qun for all n and therefore un = Qnu0 = un =Qn1 as before.

Similarly using the first step analysis we find,

Px (HB =∞) =∑y∈S

p (x, y)Py (HB =∞) =∑y∈A

p (x, y)Py (HB =∞)

from which it follows that

Px (HB =∞) =∑y∈S

Qx,yPy (HB =∞)

or in matrix notation, u = Qu where u := Px (HB =∞)x∈A . This equationmay be iterated n - times to arrive at Eq. (5.11).


6

First Step Analysis

The next theorem (which is a special case of Theorem 7.5) is the basis ofthe first step analysis developed in this section.

Theorem 6.1 (First step analysis). Let F (X) = F (X0, X1, . . . ) be somefunction of the paths (X0, X1, . . . ) of our Markov chain, then for all x, y ∈ Swith p (x, y) > 0 we have

Ex [F (X0, X1, . . . ) |X1 = y] = Ey [F (x,X0, X1, . . . )] (6.1)

and

Ex [F (X0, X1, . . . )] =∑y∈S

p (x, y)Ey [F (x,X0, X1, . . . )] (6.2)

= Ep(x,·) [F (x,X0, X1, . . . )]

Proof. Equation (6.1) follows directly from Theorem 7.5 below,

Ex [F (X0, X1, . . . ) |X1 = y] = Ex [F (X0, X1, . . . ) |X0 = x,X1 = y]

= Ey [F (x,X0, X1, . . . )] .

Equation (6.2) now follows from Eq. (6.1), the law of total expectation(i.e. Ex [F (X0, X1, . . . )] = ExE [F (X0, X1, . . . ) |X1]), and the fact thatPx (X1 = y) = p (x, y) .

Alternative direct proof. For this proof we use the measure theoretic factthat it suffice to prove the theorem under the restriction that F (X0, X1, . . . ) isactually of the form,

F (X0, X1, . . . ) = F (X0, X1, . . . , Xn) for some n <∞.

Under this additional assumption,

Ex [F (X0, X1, . . . , Xn) 1X1=y]

=∑

x1,...,xn∈SF (x, x1, . . . , xn) 1x1=yp (x, x1) p (x1, x2) . . . p (xn−1, xn)

= p (x, y)∑

x2,...,xn∈SF (x, y, x2, . . . , xn) p (y, x2) . . . p (xn−1, xn)

= p (x, y)∑

x1,...,xn−1∈SF (x, y, x1, . . . , xn−1) p (y, x1) . . . p (xn−2, xn−1)

= Px (X1 = y)EyF (x,X0, . . . , Xn−1)

which upon dividing by Px (X1 = y) gives Eq. (6.1).Let us now suppose for until further notice that B is a non-empty proper

subset of S, A = S \B, and HB = HB (X) is the first hitting time of B by X.

Notation 6.2 Given a transition matrix P = (p (x, y))x,y∈S we letQ = PA×A = (p (x, y))x,y∈A and R = PA×B = (p (x, y))x∈A,y∈B sothat, schematically,

P =

A B[PA×A PA×B∗ ∗

]AB

=

A B[Q R∗ ∗

]AB.

Remark 6.3. To construct the matrix Q and R from P, let P′ be P with therows corresponding to B omitted. To form Q from P′, remove the columnsof P′ corresponding to B and to form R from P′, remove the columns of P′

corresponding to A.

Example 6.4. If S = 1, 2, 3, 4, 5, 6, 7 , A = 1, 2, 4, 5, 6 , B = 3, 7 , and

P =

r\c 1 2 3 4 5 6 71 0 1/2 0 1/2 0 0 02 1/3 0 1/3 0 1/3 0 03 0 1/2 0 0 0 1/2 04 1/3 0 0 0 1/3 0 1/35 0 1/3 0 1/3 0 1/3 06 0 0 1/2 0 1/2 0 07 0 0 0 0 0 0 1

,

then

P′ =

r\c 1 2 3 4 5 6 71 0 1/2 0 1/2 0 0 02 1/3 0 1/3 0 1/3 0 04 1/3 0 0 0 1/3 0 1/35 0 1/3 0 1/3 0 1/3 06 0 0 1/2 0 1/2 0 0

.

Deleting the 3 and 7 columns of P′ gives

46 6 First Step Analysis

Q = PA×A =

1 2 4 5 60 1/2 1/2 0 0

1/3 0 0 1/3 01/3 0 0 1/3 00 1/3 1/3 0 1/30 0 0 1/2 0

12456

and deleting the 1, 2, 4, 5, and 6 columns of P′ gives

R = PA×B =

3 70 0

1/3 00 1/30 0

1/2 0

12456

.

Theorem 6.5 (Hitting distributions). Let h : B → R be a bounded or non-negative function and let u : S → R be defined by

u (x) := Ex [h (XHB ) : HB <∞] for x ∈ S.

Then u = h on B and u|A satisfies,

u (x) =∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) for all x ∈ A. (6.3)

In matrix notation this becomes (where u := u|A),

u = Qu+ Rh =⇒ u = (I −Q)−1

Rh,

i.e. for all x ∈ A we have1

Ex [h (XHB ) : HB <∞] =[(I −Q)

−1Rh]x

=[(I −PA×A)

−1PA×Bh

]x

.

(6.4)As a special case, if y ∈ B and h (s) = δy (s) , then Eq. (6.4) becomes,

Px (XHB = y : HB <∞) =[(I −Q)

−1R]x,y

=[(I −PA×A)

−1PA×Bδy

]x

(6.5)

1 It is possible that I − Q is not invertible. In this case it turns out that (at leastwhen h ≥ 0) that the correct interpretation of (I −Q)−1 is;

(I −Q)−1 =

∞∑n=0

Qn.

More generally if B0 ⊂ B and h (s) := 1B0(s) we learn that

Px (XHB ∈ B0 : HB <∞) =[(I −PA×A)

−1PA×B1B0

]x. (6.6)

Proof. Let

F (X0, X1, . . . ) = h(XHB(X)

)1HB(X)<∞,

then for x ∈ A we have F (x,X0, X1, . . . ) = F (X0, X1, . . . ) . Therefore by thefirst step analysis (Theorem 6.1) we learn

u (x) = Ex[h(XHB(X)

)1HB(X)<∞

]= ExF (X0, X1, . . . )

= ExF (x,X1, . . . ) =∑y∈S

p (x, y)EyF (x,X0, X1, . . . )

=∑y∈S

p (x, y)EyF (X0, X1, . . . )

=∑y∈S

p (x, y)Ey[h(XHB(X)

)1HB(X)<∞

]=∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) .

wherein the last equality we have used,

Ey[h(XHB(X)

)1HB(X)<∞

]=

u (y) if y ∈ Ah (y) if y ∈ B .

Theorem 6.6 (Travel averages). Given g : A → [0,∞] , let w (x) :=Ex[∑

n<HBg (Xn)

]. Then w (x) satisfies

w (x) =∑y∈A

p (x, y)w (y) + g (x) for all x ∈ A. (6.7)

In matrix notation this becomes,

w = Qw + g =⇒ w = (I −Q)−1

g

so that

Ex

[ ∑n<HB

g (Xn)

]=[(I −Q)

−1g]x

=[(I −PA×A)

−1g]x

The following two special cases are of most interest;


6.1 Finite state space examples 47

1. Suppose g (x) = δy (x) for some y ∈ A, then∑n<HB

g (Xn) =∑n<HB

δy (Xn) is the number of visits of the chain to y and

Ex (# visits to y before hitting B)

= Ex

[ ∑n<HB

δy (Xn)

]= (I −Q)

−1x,y . (6.8)

2. Suppose that g (x) = 1, then∑n<HB

g (Xn) = HB and we may concludethat

Ex [HB ] =[(I −Q)

−11]x

(6.9)

where 1 is the column vector consisting of all ones.

Proof. Let F (X0, X1, . . . ) =∑n<HB(X0,X1,... )

g (Xn) be the sum of thevalues of g along the chain before its first exit from A, i.e. entrance into B.With this interpretation in mind, if x ∈ A, it is easy to see that

F (x,X0, X1, . . . ) =

g (x) if X0 ∈ B

g (x) + F (X0, X1, . . . ) if X0 ∈ A= g (x) + 1X0∈A · F (X0, X1, . . . ) .

Therefore by the first step analysis (Theorem 6.1) it follows that

w (x) = ExF (X0, X1, . . . ) =∑y∈S

p (x, y)EyF (x,X0, X1, . . . )

=∑y∈S

p (x, y)Ey [g (x) + 1X0∈A · F (X0, X1, . . . )]

= g (x) +∑y∈A

p (x, y)Ey [F (X0, X1, . . . )]

= g (x) +∑y∈A

p (x, y)w (y) .

Remark 6.7. We may combine Theorems 6.1 and 6.6 into one theorem as follows.Suppose that h : S → R is a given function and let

w (x) := Ex

[ ∑n<HB

h (Xn) + h (XHB )

],

then for w (x) = h (x) for x ∈ B and

w (x) =∑y∈A

p (x, y)w (y) + h (x) +∑y∈B

p (x, y)h (y) for x ∈ A.

In matrix format we have

w = PA×Aw + hA + PA×BhB

where hA = h (x)x∈A and hB = h (x)x∈B .

6.1 Finite state space examples

Example 6.8. Consider the Markov chain determined by

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 00 0 1 00 0 0 1

1234

whose hitting diagram is given in Figure 6.1.Notice that 3 and 4 are absorbing

1 2

41/8

3/4

3

1/3

1/3

1/3

1/8

Fig. 6.1. For this chain the states 3 and 4 are absorbing.

states. Taking A = 1, 2 and B = 3, 4 , we find

1 2 3 4

P′ =

[0 1/3 1/3 1/3

3/4 1/8 1/8 0

]12,

1 2

Q =

[0 1/3

3/4 1/8

]12, and

3 4

R =

[1/3 1/31/8 0

]12.

Matrix manipulations now show,



Ei (# visits to j before hitting 3, 4)

= (I −Q)−1

=

i\j12

1 2[75

815

65

85

]=

[1.4 0.533331.2 1.6

],

EiH3,4 = (I −Q)−1

[11

]=

i

12

[2915145

]=

[1.933 3

2.8

]and

Pi(XH3,4 = j

)= (I −Q)

−1R =

i\j12

3 4[815

715

35

25

]=

[0.534 0.4670.6 0.4

].

The output of one simulation from www.zweigmedia.com/RealWorld/

markov/markov.html is in Figure 6.2 below.

Fig. 6.2. In this run, rather than making sites 3 and 4 absorbing, we have madethem transition back to 1. I claim now to get an approximate value for P1 (Xn hits 3)we should compute: (State 3 Hits)/(State 3 Hits + State 4 Hits). In this example wewill get 171/(171 + 154) = 0.526 15 which is a little lower than the predicted value of0.533 . You can try your own runs of this simulator.

Remark 6.9. As an aside in the above simulation we really used the matrix,

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 01 0 0 01 0 0 0

1234

for which has invariant distribution,

π =1

21 + 8 + 8 + 7

[21 8 8 7

]=[

2144

211

211

744

]=[

0.477 0.182 0.182 0.159].

Notice that2

11/

(2

11+

7

44

)=

8

15!

Lemma 6.10. Suppose that B is a subset of S and x ∈ A := S \B. Then

Px (HB <∞) =[(I −PA×A)

−1PA×B1

]x

where

(I −PA×A)−1

:=

∞∑n=0

PnA×A.

[See the optional Section 6.6 below for more analysis of this type.]

Proof. We work this out by first principles,

Px (HB <∞) =

∞∑n=1

Px (HB = n) =

∞∑n=1

Px (X1 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ B)

=

∞∑n=1

∑x1,...,xn−1∈A

∑y∈B

p (x, x1) p (x1, x2) . . . p (xn−2, xn−1) p (xn−1, y)

=

[ ∞∑n=1

Pn−1A×APA×B1

]x

=

[ ∞∑n=0

PnA×APA×B1

]x

.

Example 6.11. Let us continue the rat in the maze Exercise 5.1 and now supposethat room 3 contains food while room 7 contains a mouse trap. 1 2 3 (food)

4 5 67 (trap)

.Page: 48 job: 285notes macro: svmonob.cls date/time: 3-Jun-2016/13:04

www.zweigmedia.com/RealWorld/markov/markov.html

www.zweigmedia.com/RealWorld/markov/markov.html


Recall that the transition matrix for this chain with sites 3 and 7 absorbing isgiven by,

P =

1 2 3 4 5 6 7

0 1/2 0 1/2 0 0 01/3 0 1/3 0 1/3 0 00 0 1 0 0 0 0

1/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 00 0 0 0 0 0 1

1234567

,

see Figure 6.3 for the corresponding jump diagram for this chain.

1

1/2

1/2++

2

1/3,,

1/3

1/3

kk3

food

4

1/3

SS

1/3

1/3++

51/3

kk

1/3++

1/3

SS

6

1/2

RR

1/2

kk

7trap

Fig. 6.3. The jump diagram for our proverbial rat in the maze. Here we assume therat is “absorbed” at sites 3 and 7

We would like to compute the probability that the rat reaches the food beforehe is trapped. To answer this question we let A = 1, 2, 4, 5, 6 , B = 3, 7 ,and T := HB be the first hitting time of B. Then deleting the 3 and 7 rows ofP leaves the matrix,

P′ =

1 2 3 4 5 6 70 1/2 0 1/2 0 0 0

1/3 0 1/3 0 1/3 0 01/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 0

12456

.

Deleting the 3 and 7 columns of P′ gives

Q = PA×A =

1 2 4 5 60 1/2 1/2 0 0

1/3 0 0 1/3 01/3 0 0 1/3 00 1/3 1/3 0 1/30 0 0 1/2 0

12456

and deleting the 1, 2, 4, 5, and 6 columns of P′ gives

R = PA×B =

3 70 0

1/3 00 1/30 0

1/2 0

12456

.

Therefore,

I −Q =

1 − 1

2 −12 0 0

− 13 1 0 − 1

3 0− 1

3 0 1 − 13 0

0 − 13 −

13 1 − 1

30 0 0 − 1

2 1

,and using a computer algebra package we find

Ei [# visits to j before hitting 3, 7] = (I −Q)−1

=

1 2 4 5 6 j116

54

54 1 1

356

74

34 1 1

356

34

74 1 1

323 1 1 2 2

313

12

12 1 4

3

i

12456

.

In particular we may conclude,E1TE2TE4TE5TE6T

= (I −Q)−1

1 =

173143143163113

,and



P1 (XT = 3) P1 (XT = 7)P2 (XT = 3) P2 (XT = 3)P4 (XT = 3) P4 (XT = 3)P5 (XT = 3) P5 (XT = 3)P6 (XT = 3) P6 (XT = 7)

= (I −Q)−1

R =

3 7712

512

34

14

512

712

23

13

56

16

12456

.

Since the event of hitting 3 before 7 is the same as the event XT = 3 , thedesired hitting probabilities are

P1 (XT = 3)P2 (XT = 3)P4 (XT = 3)P5 (XT = 3)P6 (XT = 3)

=

712345122356

.Example 6.12 (A modified rat maze). Here is the modified maze, 1 2 3(food)

4 56(trap)

.We now let T = H3,6 be the first time to absorption – we assume that 3 and6 made are absorbing states.2 The transition matrix is given by

P =

1 2 3 4 5 60 1/2 0 1/2 0 0

1/3 0 1/3 0 1/3 00 0 1 0 0 0

1/3 0 0 0 1/3 1/30 1/2 0 1/2 0 00 0 0 0 0 1

123456

.

The corresponding Q and R matrices in this case are;

Q =

1 2 4 50 1/2 1/2 0

1/3 0 0 1/31/3 0 0 1/30 1/2 1/2 0

1245

, and R =

3 60 0

1/3 00 1/30 0

1245

.

2 It is not necessary to make states 3 and 6 absorbing. In fact it does matter at allwhat the transition probabilities are for the chain for leaving either of the states3 or 6 since we are going to stop when we hit these states. This is reflected in thefact that the first thing we will do in the first step analysis is to delete rows 3 and6 from P. Making 3 and 6 absorbing simply saves a little ink.

After some matrix manipulation we then learn,

Ei [# visits to j] = (I4 −Q)−1

=

1 2 4 52 3

232 1

1 2 1 11 1 2 11 3

232 2

1245

,

Pi [XT = j] = (I4 −Q)−1

R =

3 612

12

23

13

13

23

12

12

1245

,

Ei [T ] = (I4 −Q)−1

1111

=

6556

1245

.

So for example, P4(XT = 3(food)) = 1/3, E4(Number of visits to 1) = 1,E5(Number of visits to 2) = 3/2 and E1T = E5T = 6 and E2T = E4T = 5.

Exercise 6.1 (Invariant distributions and expected return times). Sup-pose that Xn∞n=0 is a Markov chain on a finite state space S determined byone step transition probabilities, p (x, y) = P (Xn+1 = y|Xn = x) . For x ∈ S, letRx := inf n > 0 : Xn = x be the first passage time3 to x. [We will assumehere that EyRx <∞ for all x, y ∈ S.] Use the first step analysis to show,

ExRy =∑z:z 6=y

p (x, z)EzRy + 1. (6.10)

Now further assume that π : S → [0, 1] is an invariant distributions for p, thatis∑x∈S π (x) = 1 and

∑x∈S π (x) p (x, y) = π (y) for all y ∈ S, i.e. πP = π. By

multiplying Eq. (6.10) by π (x) and summing on x ∈ S, show,

π (y)EyRy = 1 for all y ∈ S =⇒ π (y) =1

EyRy> 0. (6.11)

Example 6.13 (Two heads in a row). On average, how many times do we needto toss a coin to get two consecutive heads? To answer this question, we need toconstruct an appropriate Markov chain. There are different ways to approachthis; probably the simplest is to let

Xn = 2 ∧(# of heads tossed in a row after the nth -toss

).

3 Rx is the first return time to x when the chain starts at x.



Here is an example of how Xn is computed;

Time 0 1 2 3 4 5 6 7 8 9 10 11 . . .Tosses H T H T T H H T H H H . . .

X 0 1 0 1 1 0 1 2 0 1 2 2 . . .

This means Xn is a Markov chain with state space 0, 1, 2 and transitionmatrix

P =

0 1 2 12

12 0

12 0 1

212 0 1

2

012.

We then take B = 2 and we wish to compute

E0HB =[(I −Q)

−11]

0

where

Q =

0 1[12

12

12 0

]01.

The linear algebra gives,

(I −Q)−1

=

0 1[4 22 2

]01

and (I −Q)−1

1 =

[4 22 2

] [11

]=

[64

]01

and hence E0HB = 6.

Remark 6.14. It is unimportant how we make assignments in the bottom rowof P in Example 6.13 since when B = 2 we will immediately delete this rowin the computations above. So we could have made 2 an absorbing state, i.e.

P =

0 1 2 12

12 0

12 0 1

20 0 1

012

Actually a more interesting choice is to send 2 back to 0 and restart the chain,i.e. take

P =

0 1 2 12

12 0

12 0 1

21 0 0

012.

This represents the chain where we restart the game once we have reached twoheads in a row. The invariant distribution for P is

π =1

7

[4 2 1

]=[

47

27

17

].

Note that on average we expect 4 visits to 0 and 2 visits to 2 for every timethe game is played and hence the expected number of times to get two headsin a row is 4 + 2 = 6 in agreement with Example 6.13. (Alternatively we have1/7 = 1/E2R2 whereR2 is the return time to 2. Thus E2R2 = 7.NowR2 = 1+Habove and hence we see that EH = 7− 1 = 6 as we saw before.) Let us further

Fig. 6.4. Here is a numerical verification of the above statements. In this case thechain was run for N = 7000 and in that time there were 1007 returns to 3 (correspondsto site 2) for an expected return time of about 7.

observe that

P100 =

12

12 0

12 0 1

21 0 0

100

=

0.571 43 0.285 71 0.142 860.571 43 0.285 71 0.142 860.571 43 0.285 71 0.142 86

while

π =[0.571 43 0.285 71 0.142 86

].



6.2 More first step analysis examples

Example 6.15 (Computing EH2B). Let HB = inf n ≥ 0 : Xn ∈ B . Here we

compute w (x) = ExH2B as follows,

w (x) =∑y∈S

p (x, y)Ex[H2B |X1 = y

]=∑y∈S

p (x, y)Ey[(HB + 1)

2]

=∑y∈S

p (x, y)Ey[H2B + 2HB + 1

]= 1 +

∑y∈A

p (x, y)Ey[H2B + 2HB

]= 1 +

∑y∈A

p (x, y)w (y) + 2∑y∈A

p (x, y)EyHB .

Thus in matrix notation with Q = PA×A this may be written as,

E(·)H2B = w = (I −Q)

−1[1 + 2Q (I −Q)

−11]

wherein we have used E(·)HB = (I −Q)−1

1. In other words,

E(·)H2B = w = (I −Q)

−1[I + 2Q (I −Q)

−1]

1

= (I −Q)−2

(I + Q) 1.

Example 6.16 (Generating function for HB). Let us now suppose that |z| < 1and let us set w (x) := Ex

[zHB

]and notice that w (y) = 1 for y ∈ B. Then by

the first step analysis we have,

w (x) :=∑y∈S

p (x, y)Ex[zHB |X1 = y

]=∑y∈S

p (x, y)Ey[zHB+1

]= z

∑y∈S

p (x, y)Ey[zHB

]= z

∑y∈S

p (x, y)w (y) .

So let w := w|A, we find,

w = z [Qw+p (·, B)] = z [Qw + 1−p (·, A)]

= z [Qw + 1−Q1] .

Thus solving this equation for w gives,

E(·)[zHB

]= w = z (I − zQ)

−1(I −Q) 1.

Differentiating this equation with respect to z at z = 1 shows

E(·) [HB ] = 1+d

dz|1 (I − zQ)

−1(I −Q) 1

= 1+ (I −Q)−1

Q (I −Q)−1

(I −Q) 1

=[I+ (I −Q)

−1Q]

1 = (I −Q)−1

1

which reproduces are old formula for E(·) [HB ] .

Example 6.17 (Generating function for∑n<HB

g (Xn)). Suppose that g : A→R is a given function, F :=

∑n<HB

g (Xn) , |z| ≤ 1, and let

w (x) := ExzF .

Then by the first step analysis,

w (x) =∑y∈S

p (x, y)EyzF (x,X0,X1,... )

where

EyzF (x,X0,X1,... ) = zg(x) if y ∈ B and

EyzF (x,X0,X1,... ) = zg(x)EyzF (X0,X1,... ) = zg(x)w (y) if y ∈ A.

Thus we learn that w satisfies,

w (x) = zg(x)

∑y∈A

Qxyw +∑y∈B

Pxy1

.Thus let D (z) = diag

(zg(x) : x ∈ A

), we have

w = D (z) Qw +D (z) R1 (6.12)

and hencew = (I −D (z) Q)

−1D (z) R1.

Sanity checks.

1. If z = 1 then w = 1 which solves Eq. (6.12) since Q1 + R1 = PA×S1 = 1since the row sums of P add to 1. Consequently we now know that

(I −Q) 1 = R1 =⇒ (I −Q)−1

R1 = 1.

2. Suppose that we differentiate this expression in z one time,


6.2 More first step analysis examples 53

d

dz|z=1w = (I −Q)

−1D′ (1) Q (I −Q)

−1R1+ (I −Q)

−1D′ (1) R1

= (I −Q)−1D′ (1) Q1+ (I −Q)

−1D′ (1) R1

= (I −Q)−1D′ (1) [Q1 + R1] = (I −Q)

−1D′ (1) 1

= (I −Q)−1

g

which is consistent with the fact that

d

dz|z=1w = E(·)

[ ∑n<HB

g (Xn)

]and our previous computations.

Exercise 6.2 (III.4.P11 on p.132). An urn contains two red and two greenballs. The balls are chosen at random, one by one, and removed from the urn.The selection process continues until all of the green balls have been removedfrom the urn. What is the probability that a single red ball is in the urn at thetime that the last green ball is chosen?

Theorem 6.18. Let h : B → [0,∞] and g : A→ [0,∞] be given and for x ∈ S.If we let 4

w (x) := Ex

[h (XHB ) ·

∑n<HB

g (Xn) : HB <∞

]and

gh (x) = g (x)Ex [h (XHB ) : HB <∞] ,

then

w (x) = Ex

[ ∑n<HB

gh (Xn) : HB <∞

]. (6.13)

Remark 6.19. Recall that we can find uh (x) := Ex [h (XHB ) : HB <∞] usingTheorem 6.5 and then we can solve for w (x) using Theorem 6.6 with g replacedby gh (x) = g (x)uh (x) . So in the matrix language we solve for w (x) as follows;

uh := (I −Q)−1

Rh,

gh : = g ∗ uh, and

w = (I −Q)−1

gh,

where [a ∗ b]x := ax · bx – the entry by entry product of column vectors.

4 Recall from Theorem 6.5 that uh = (I −Q)−1 Rh, i.e. u = h on B and u satisfies

u (x) =∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) for all x ∈ A.

Proof. First proof. LetH (X) := h (XHB ) 1HB<∞, then using 1n<HB(X) =1X0∈A,...,Xn∈A and

H (X0, . . . , Xn−1, Xn, . . . ) = H (Xn, Xn+1, . . . ) when X0, . . . , Xn ∈ A

along with the strong Markov property in Theorem 7.11 shows;

w (x) =

∞∑n=0

Ex [H (X) · 1n<HBg (Xn)]

=

∞∑n=0

Ex [H (X) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex[E(Y )Xn

[H (X0, . . . , Xn−1, Y )] · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex[E(Y )Xn

H (Y ) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex [uh (Xn) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex [uh (Xn) · g (Xn) 1n<HB ]

= Ex

[ ∑n<HB

uh (Xn) · g (Xn)

]= Ex

[ ∑n<HB

gh (Xn)

].

Second proof. Let G (X) :=∑n<HB

g (Xn) and observe that

H (x, Y ) =

H (Y ) if x ∈ Ah (x) if x ∈ B and

G (x, Y ) = g (x) +G (Y )

and so by the first step analysis we find,

w (x) = Ex [H (X)G (X)] = Ep(x,·) [H (x, Y )G (x, Y )]

= Ep(x,·) [H (x, Y ) (g (x) +G (Y ))]

= g (x)Ep(x,·) [H (x, Y )] + Ep(x,·) [H (x, Y )G (Y )] .

The first step analysis also shows (see the proof of Theorem 6.5)

uh (x) := Ex [h (XHB ) 1HB<∞] = Ex [H (X)] = Ep(x,·) [H (x, Y )] .

and therefore,



w (x) = g (x)uh (x) + Ep(x,·) [H (x, Y )G (Y )]

Since G (Y ) = 0 if Y0 ∈ B and H (x, Y ) = H (Y ) if Y0 ∈ A we find,

Ep(x,·) [H (x, Y )G (Y )] =∑x∈S

p (x, y)Ey [H (x, Y )G (Y )]

=∑x∈A

p (x, y)Ey [H (x, Y )G (Y )]

=∑x∈A

p (x, y)Ey [H (Y )G (Y )]

=∑x∈A

p (x, y)w (y)

and hence

w (x) = g (x)uh (x) +∑x∈A

p (x, y)w (y) = gh (x) +∑x∈A

p (x, y)w (y) .

But Theorem 6.6 with g replaced by gh then shows w is given by Eq. (6.13).

Example 6.20 (A possible carnival game). Suppose that B is the disjoint unionof L and W and suppose that you win

∑n<HB

g (Xn) if you end in W and winnothing when you end in L. What is the least we can expect to have to pay toplay this game and where in A := S \B should we choose to start the game. Toanswer these questions we should compute our expected winnings (w (x)) foreach starting point x ∈ A;

w (x) = Ex

[1W (XHB )

∑n<HB

g (Xn)

].

Once we find w we should expect to pay at least C := maxx∈A w (x) and weshould start at a location x0 ∈ A where w (x0) = maxx∈A w (x) = C. As anapplication of Theorem 6.18 we know that

w (x) =[(I−Q)

−1gh

]x

where5

gh (x) = g (x)Ex [1W (XHB )] = g (x)Px (XHB ∈W ) .

Let us now specialize these results to the chain in Example 6.8 where

5 Intuitively, the effective pay off for a visit to site x is g (x) · Px ( we win) + 0 ·Px (we loose) .

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 00 0 1 00 0 0 1

1234

Let us make 4 the winning state and 3 the losing state (i.e. h (3) = 0 andh (4) = 1) and let g = (g (1) , g (2)) be the payoff function. We have alreadyseen that [

uh (1)uh (2)

]=

[P1 (XHB = 4)P2 (XHB = 4)

]=

[71525

]so that g ∗ uh =

[715g125g2

]and therefore

[w (1)w (2)

]= (I −Q)

−1

[715g125g2

]=

[75

815

65

85

] [715g125g2

]=

[4975g1 + 16

75g21425g1 + 16

25g2

].

Let us examine a few different choices for g.

1. When g (1) = 32 and g (2) = 7, we have[w (1)w (2)

]=

[75

815

65

85

] [71532257

]=

[1125

1125

]=

[22.422.4

]and so it does not matter where we start and we are going to have to payat least $22.40 to play.

2. When g (1) = 10 = g (2) , then[w (1)w (2)

]=

[75

815

65

85

] [715102510

]=

[263

12

]=

[8. 666 7

12.0

]and we should enter the game at site 2. We are going to have to pay at least$12 to play.

3. If g (1) = 20 and g (2) = 7,[w (1)w (2)

]=

[75

815

65

85

] [71520257

]=

[3642539225

]=

[14.5615.68

]and again we should enter the game at site 2. We are going to have to payat least $15.68 to play.


6.3 Random Walk Exercises 55

6.3 Random Walk Exercises

Exercise 6.3 (Uniqueness of solutions to 2nd order recurrence rela-tions). Let a, b, c be real numbers with a 6= 0 6= c, α, β ∈ Z∪±∞ withα < β, and g : Z∩ (α, β) → R be a given function. Show that there is exactlyone function u : [α, β] ∩ Z→ R with prescribed values on two two consecutivepoints in [α, β] ∩ Z which satisfies the second order recurrence relation:

au (x+ 1) + bu (x) + cu (x− 1) = f (x) for all x ∈ Z∩ (α, β) . (6.14)

are for α < x < β. Show; if u and w both satisfy Eq. (6.14) and u = w on twoconsecutive points in (α, β) ∩ Z, then u (x) = w (x) for all x ∈ [α, β] ∩ Z.

Exercise 6.4 (General homogeneous solutions). Let a, b, c be real num-bers with a 6= 0 6= c, α, β ∈ Z∪±∞ with α < β, and supposeu (x) : x ∈ [α, β] ∩ Z solves the second order homogeneous recurrence relation

au (x+ 1) + bu (x) + cu (x− 1) = 0 for all x ∈ Z∩ (α, β) , (6.15)

i.e. Eq. (6.14) with f (x) ≡ 0. Show:

1. for any λ ∈ C,aλx+1 + bλx + cλx−1 = λx−1p (λ) (6.16)

where p (λ) = aλ2 + bλ + c is the characteristic polynomial associatedto Eq. (6.14).

Let λ± = −b±√b2−4ac

2a be the roots of p (λ) and suppose for the momentthat b2− 4ac 6= 0. From Eq. (6.14) it follows that for any choice of A± ∈ R,the function,

w (x) := A+λx+ +A−λ

x−, (6.17)

solves Eq. (6.14) for all x ∈ Z.2. Show there is a unique choice of constants, A± ∈ R, such that the functionu (x) is given by

u (x) := A+λx+ +A−λ

x− for all α ≤ x ≤ β.

3. Now suppose that b2 = 4ac and λ0 := −b/ (2a) is the double root of p (λ) .Show for any choice of A0 and A1 in R that

w (x) := (A0 +A1x)λx0 (6.18)

solves Eq. (6.14) for all x ∈ Z. Hint: Differentiate Eq. (6.16) with respectto λ and then set λ = λ0.

4. Again show that any function u solving Eq. (6.14) is of the form u (x) =(A0 +A1x)λx0 for α ≤ x ≤ β for some unique choice of constants A0, A1 ∈R.

In the next group of exercises you are going to use first step analysis to showthat a simple unbiased random walk on Z is null recurrent. We let Xn∞n=0 bethe Markov chain with values in Z with transition probabilities given by

P (Xn+1 = x± 1|Xn = x) = 1/2 for all n ∈ N0 and x ∈ Z.

Further let a, b ∈ Z with a < 0 < b and

Ha,b := min n : Xn ∈ a, b and Hb := inf n : Xn = b .

We know by Corollary 5.25 that E0 [Ha,b] < ∞ from which it follows thatP (Ha,b <∞) = 1 for all a < 0 < b. For these reasons we will ignore the eventHa,b =∞ in what follows below.

Exercise 6.5. Let w (x) := Px(XHa,b = b

):= P

(XHa,b = b|X0 = x

).

1. Use first step analysis to show for a < x < b that

w (x) =1

2(w (x+ 1) + w (x− 1)) (6.19)

provided we define w (a) = 0 and w (b) = 1.2. Use the results of Exercises 6.3 and 6.4 to show

Px(XHa,b = b

)= w (x) =

1

b− a(x− a) . (6.20)

3. Let

Hb :=

min n : Xn = b if Xn hits b

∞ otherwise

be the first time Xn hits b. Explain why,XHa,b = b

⊂ Hb <∞ and

use this along with Eq. (6.20) to conclude that Px (Hb <∞) = 1 for allx < b. (By symmetry this result holds true for all x ∈ Z.)

Exercise 6.6. The goal of this exercise is to give a second proof of the fact thatPx (Hb <∞) = 1. Here is the outline:

1. Let w (x) := Px (Hb <∞) . Again use first step analysis to show that w (x)satisfies Eq. (6.19) for all x with w (b) = 1.

2. Use Exercises 6.3 and 6.4 to show that there is a constant, c, such that

w (x) = c · (x− b) + 1 for all x ∈ Z.

3. Explain why c must be zero to again show that Px (Hb <∞) = 1 for allx ∈ Z.

Exercise 6.7. Let H = Ha,b and u (x) := ExH := E [H|X0 = x] .



1. Use first step analysis to show for a < x < b that

u (x) =1

2(u (x+ 1) + u (x− 1)) + 1 (6.21)

with the convention that u (a) = 0 = u (b) .2. Show that

u (x) = A0 +A1x− x2 (6.22)

solves Eq. (6.21) for any choice of constants A0 and A1.3. Choose A0 and A1 so that u (x) satisfies the boundary conditions, u (a) =

0 = u (b) . Use this to conclude that

ExHa,b = −ab+ (b+ a)x− x2 = −a (b− x) + bx− x2. (6.23)

Remark 6.21. Notice that Ha,b ↑ Hb = inf n : Xn = b as a ↓ −∞, and sopassing to the limit as a ↓ −∞ in Eq. (6.23) shows

ExHb =∞ for all x < b.

Combining the last couple of exercises together shows that Xn is “null -recurrent.”

Exercise 6.8. Let H = Hb. The goal of this exercise is to give a second proofof the fact and u (x) := ExH = ∞ for all x 6= b. Here is the outline. Letu (x) := ExH ∈ [0,∞] = [0,∞) ∪ ∞ .

1. Note that u (b) = 0 and, by a first step analysis, that u (x) satisfies Eq.(6.21) for all x 6= b – allowing for the possibility that some of the u (x) maybe infinite.

2. Argue, using Eq. (6.21), that if u (x) < ∞ for some x b then u (y) < ∞ for ally > b.

3. If u (x) < ∞ for all x > b then u (x) must be of the form in Eq. (6.22)for some A0 and A1 in R such that u (b) = 0. However, this would imply,u (x) = ExH → −∞ as x → ∞ which is impossible since ExH ≥ 0 for allx. Thus we must conclude that ExH = u (x) =∞ for all x > b. (A similarargument works if we assume that u (x) <∞ for all x < b.)

1. The first step analysis shows u (x) satisfies Eq. (6.21) for all x 6= b even ifu (x) =∞ for some x. Note that u (b) = EbHb = 0.

2. If u (x) <∞ for some x < b, then Eq. (6.21) can only hold if u (x+ 1) <∞and u (x− 1) <∞. In particular we may continue this line of reasoning tosee that u (x− 2) <∞, and then u (x− 3) <∞, . . . , i.e. u (y) <∞ for ally < b. A similar argument shows that if u (x) < ∞ for some x > b thenu (y) <∞ for all y > b.

3. If u (x) < ∞ for all x > b then u (x) must be of the form in Eq. (6.22) forsome A0 and A1 in R such that u (b) = 0. However, this would imply,

ExH = u (x) = A0 +A1x− x2 → −∞ as x→∞

which is impossible since ExH ≥ 0 for all x. Thus we must conclude thatin fact u (x) = ∞ for all x > b. A similar argument shows that u (x) = ∞for all x 1.

Exercise 6.9 (Biased random walks I). Let p ∈ (1/2, 1) and consider thebiased random walk Xnn≥0 on the S = Z where Xn = ξ0 + ξ1 + · · · + ξn,

ξi∞i=1 are i.i.d. with P (ξi = 1) = p ∈ (0, 1) and P (ξi = −1) = q := 1 − p,and ξ0 = x for some x ∈ Z. Let H = H0 be the first hitting time of 0 andu (x) := Px (H <∞) .

a) Use the first step analysis to show

u (x) = pu (x+ 1) + qu (x− 1) for x 6= 0 and u (0) = 1. (6.24)

b) Use Eq. (6.24) along with Exercises 6.3 and 6.4 to show for some a± ∈ Rthat

u (x) = (1− a+) + a+ (q/p)x

for x ≥ 0 and (6.25)

u (x) = (1− a−) + a− (q/p)x

for x ≤ 0. (6.26)

c) By considering the limit as x→ −∞ conclude that a− = 0 and u (x) = 1 forall x < 0, i.e. Px (H0 <∞) = 1 for all x ≤ 0.

Exercise 6.10 (Biased random walks II). The goal of this exercise is toevaluate Px (H0 <∞) for x ≥ 0. To do this let Bn := 0, n and Hn := H0,n.Let h (x) := Px (XHn = 0) where XHn = 0 is the event of hitting 0 before n.

a) Use the first step analysis to show

h (x) = ph (x+ 1) + qh (x− 1) with h (0) = 1 and h (n) = 0.

b) Show the unique solution to this equation is given by

Px (XHn = 0) = h (x) =(q/p)

x − (q/p)n

1− (q/p)n .

c) Argue that

Px (H <∞) = limn→∞

Px (XHn = 0) = (q/p)x< 1 for all x ≥ 0.


6.4 Wald’s Equation and Gambler’s Ruin 57

The following formula summarizes Exercises 6.9 and 6.10; for 12 0 for x > 0 we already knowthat ExH = ∞ for all x > 0. Nevertheless we will deduce this fact again here.Letting u (x) = ExH it follows by the first step analysis that, for x 6= 0,

u (x) = p [1 + u (x+ 1)] + q [1 + u (x− 1)]

= pu (x+ 1) + qu (x− 1) + 1 (6.28)

with u (0) = 0. Notice u (x) =∞ is a solution to this equation while if u (n) <∞for some n 6= 0 then Eq. (6.28) implies that u (x) < ∞ for all x 6= 0 with thesame sign as n. A particular solution to this equation may be found by tryingu (x) = αx to learn,

αx = pα (x+ 1) + qα (x− 1) + 1 = αx+ α (p− q) + 1

which is valid for all x provided α = (q − p)−1. The general finite solution to

Eq. (6.28) is therefore,

u (x) = (q − p)−1x+ a+ b (q/p)

x. (6.29)

Using the boundary condition, u (0) = 0 allows us to conclude that a + b = 0and therefore,

u (x) = (q − p)−1x+ a [1− (q/p)

x] . (6.30)

Notice that u (x)→ −∞ as x→ +∞ no matter how a is chosen and thereforewe must conclude that the desired solution to Eq. (6.28) is u (x) =∞ for x > 0as we already mentioned. In the next exercise you will compute ExH for x < 0.

Exercise 6.11 (Biased random walks IV). Continue the notation in Ex-ample 6.22. Using the outline below, show

ExH =|x|p− q

for x ≤ 0. (6.31)

In the following outline n is a negative integer, Hn is the first hitting time of nso that Hn,0 = Hn ∧H = min H,Hn is the first hitting time of n, 0 . By

Corollary 5.25 we know that u (x) := Ex[Hn,0

]< ∞ for all n ≤ x ≤ 0 and

by a first step analysis one sees that u (x) still satisfies Eq. (6.28) for n < x < 0and has boundary conditions u (n) = 0 = u (0) .

a) From Eq. (6.30) we know that, for some a ∈ R,

Ex[Hn,0

]= u (x) = (q − p)−1

x+ a [1− (q/p)x] .

Use u (n) = 0 in order to show

a = an =n

(1− (q/p)n) (p− q)

and therefore,

Ex[Hn,0

]=

1

p− q

[|x|+ n

1− (q/p)x

1− (q/p)n

]for n ≤ x ≤ 0.

b) Argue that ExH = limn→−∞ Ex [Hn ∧H] and use this and part a) to proveEq. (6.31).

Remark 6.23 (More on the boundary conditions). If we were to use Theorem 6.6directly to derive Eq. (6.28) in the case that u (x) := Ex

[Hn,0

]< ∞ we for

all 0 ≤ x ≤ n. we would find, for x 6= 0, that

u (x) =∑

y/∈n,0

q (x, y)u (y) + 1

which implies that u (x) satisfies Eq. (6.28) for n < x < 0 provided u (n) andu (0) are taken to be equal to zero. Let us again choose a and b

w (x) := (q − p)−1x+ a+ b (q/p)

x

satisfies w (0) = 0 and w (−1) = u (−1) . Then both w and u satisfy Eq. (6.28)for n < x ≤ 0 and agree at 0 and −1 and therefore are equal for n ≤ x ≤ 0and in particular 0 = u (n) = w (n) . Thus correct boundary conditions on w inorder for w = u are w (0) = w (n) = 0 as we have used above.

6.4 Wald’s Equation and Gambler’s Ruin

Example 6.24. Here are some example of random times some of which are stop-ping times and some of which are not. In these examples we will always use theconvention that the minimum of the empty set is +∞.

1. The random time, τ = min k : |Xk| ≥ 5 (the first time, k, such that |Xk| ≥5) is a stopping time since

τ = k = |X1| < 5, . . . , |Xk−1| < 5, |Xk| ≥ 5.



2. Let Wk := X1 + · · ·+Xk, then the random time,

τ = mink : Wk ≥ π

is a stopping time since,

τ = k =

Wj = X1 + · · ·+Xj < π for j = 1, 2, . . . , k − 1,

& X1 + · · ·+Xk−1 +Xk ≥ π

.

3. For t ≥ 0, let N(t) = #k : Wk ≤ t. Then

N(t) = k = X1 + · · ·+Xk ≤ t, X1 + · · ·+Xk+1 > t

which shows that N (t) is not a stopping time. On the other hand, since

N(t) + 1 = k = N(t) = k − 1= X1 + · · ·+Xk−1 ≤ t, X1 + · · ·+Xk > t,

we see that N(t) + 1 is a stopping time!4. If τ is a stopping time then so is τ + 1 because,

1τ+1=k = 1τ=k−1 = σk−1 (X0, . . . , Xk−1)

which is also a function of (X0, . . . , Xk) which happens not to depend onXk.

5. On the other hand, if τ is a stopping time it is not necessarily true thatτ − 1 is still a stopping time as seen in item 3. above.

6. One can also see that the last time, k, such that |Xk| ≥ π is typically nota stopping time. (Think about this.)

The following presentation of Wald’s equation is taken from Ross [17, p.59-60].

Theorem 6.25 (Wald’s Equation). Suppose that Xn∞n=1 is a sequence ofi.i.d. random variables, f (x) is a non-negative function of x ∈ R, and τ is astopping time. Then

E

[τ∑n=1

f (Xn)

]= Ef (X1) · Eτ. (6.32)

This identity also holds if f (Xn) are real valued but integrable and τ is a stop-ping time such that Eτ <∞. (See Resnick for more identities along these lines.)

Proof. If f (Xn) ≥ 0 for all n, then the the following computations need nojustification,

E

[τ∑n=1

f (Xn)

]= E

[ ∞∑n=1

f (Xn) 1n≤τ

]=

∞∑n=1

E [f (Xn) 1n≤τ ]

=

∞∑n=1

E [f (Xn)un (X1, . . . , Xn−1)]

=

∞∑n=1

E [f (Xn)] · E [un (X1, . . . , Xn−1)]

=

∞∑n=1

E [f (Xn)] · E [1n≤τ ] = Ef (X1)

∞∑n=1

E [1n≤τ ]

= Ef (X1) · E

[ ∞∑n=1

1n≤τ

]= Ef (X1) · Eτ.

If E |f (Xn)| <∞ and Eτ <∞, the above computation with f replaced by|f | shows all sums appearing above are equal E |f (X1)| · Eτ < ∞. Hence wemay remove the absolute values to again arrive at Eq. (6.32).

Example 6.26. Let Xn∞n=1 be i.i.d. such that P (Xn = 0) = P (Xn = 1) = 1/2and let

τ := min n : X1 + · · ·+Xn = 10 .

For example τ is the first time we have flipped 10 heads of a fair coin. By Wald’sequation (valid because Xn ≥ 0 for all n) we find

10 = E

[τ∑n=1

Xn

]= EX1 · Eτ =

1

2Eτ

and therefore Eτ = 20 <∞.

Example 6.27 (Gambler’s ruin). Let Xn∞n=1 be i.i.d. such that P (Xn = −1) =P (Xn = 1) = 1/2 and let

τ := min n : X1 + · · ·+Xn = 1 .

So τ may represent the first time that a gambler is ahead by 1. Notice thatEX1 = 0. If Eτ < ∞, then we would have τ < ∞ a.s. and by Wald’s equationwould give,

1 = E

[τ∑n=1

Xn

]= EX1 · Eτ = 0 · Eτ

which can not hold. Hence it must be that

Eτ = E [first time that a gambler is ahead by 1] =∞.


6.5 Some more worked examples 59

6.5 Some more worked examples

Example 6.28. Let S = 1, 2 and P =

[0 11 0

]with jump diagram in Figure 6.5.

In this case P2n = I while P2n+1 = P and therefore limn→∞Pn does not exist.

1

1))

21

ii

Fig. 6.5. A non-random chain.

On the other hand it is easy to see that the invariant distribution, π, for P isπ =

[1/2 1/2

]and, moreover,

P + P2 + · · ·+ PN

N→ 1

2

[1 11 1

]=

[ππ

].

Let us compute [E1R1

E2R1

]=

([1 00 1

]−[

0 10 0

])−1 [11

]=

[21

]and [

E1R2

E2R2

]=

([1 00 1

]−[

0 01 0

])−1 [11

]=

[12

]so that indeed, π1 = 1/E1R1 and π2 = 1/E2R2. Of course R1 = 2 (P1 -a.s.)and R2 = 2 (P2 -a.s.) so that it is obvious that E1R1 = E2R2 = 2.

Example 6.29. Again let S = 1, 2 and P =

[10

01

]with jump diagram in

Figure 6.6. In this case the chain is not irreducible and every π = [a b] with

1166 2 1

hh

Fig. 6.6. A simple non-irreducible chain.

a+ b = 1 and a, b ≥ 0 is an invariant distribution.

Example 6.30. Suppose that S = 1, 2, 3 , and

P =

1 2 30 1 0

1/2 0 1/21 0 0

123

has the jump graph given by 6.7. Notice that P211 > 0 and P3

11 > 0 that P is

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 6.7. A simple 3 state jump diagram.

“aperiodic.” We now find the invariant distribution,

Nul (P− I)tr

= Nul

−1 12 1

1 −1 00 1

2 −1

= R

221

.Therefore the invariant distribution is given by

π =1

5

[2 2 1

].

Let us now observe that

P2 =

12 0 1

212

12 0

0 1 0

P3 =

0 1 01/2 0 1/21 0 0

3

=

12

12 0

14

12

14

12 0 1

2

P20 =

4091024

205512

2051024

205512

4091024

2051024

205512

205512

51256

=

0.399 41 0.400 39 0.200 200.400 39 0.399 41 0.200 200.400 39 0.400 39 0.199 22

.Let us also compute E2R3 via,E1R3

E2R3

E3R3

=

1 0 00 1 00 0 1

− 0 1 0

1/2 0 01 0 0

−1 111

=

435



so that1

E3R3=

1

5= π3.

Example 6.31. The transition matrix,

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is represented by the jump diagram in Figure 6.8. This chain is aperiodic. We

114

12

##

212

12oo

3

13

YY

13

EE

Fig. 6.8. In the above diagram there are jumps from 1 to 1 with probability 1/4 andjumps from 3 to 3 with probability 1/3 which are not explicitly shown but must beinferred by conservation of probability.

find the invariant distribution as,

Nul (P− I)tr

= Nul

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

−1 0 0

0 1 00 0 1

tr

= Nul

− 34

12

13

12 −1 1

314

12 −

23

= R

1561

= R

656

π =

1

17

[6 5 6

]=[

0.352 94 0.294 12 0.352 94].

In this case

P10 =

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

10

=

0.352 98 0.294 04 0.352 980.352 89 0.294 23 0.352 890.352 95 0.294 1 0.352 95

.

Let us also computeE1R2

E2R2

E3R2

=

1 0 00 1 00 0 1

−1/4 0 1/4

1/2 0 1/21/3 0 1/3

−1 111

=

115175135

so that

1/E2R2 = 5/17 = π2.

Example 6.32. Consider the following Markov matrix,

P =

1 2 3 4

1/4 1/4 1/4 1/41/4 0 0 3/41/2 1/2 0 00 1/4 3/4 0

1234

with jump diagram in Figure 6.9. Since this matrix is doubly stochastic (i.e.

114

14 //

14

234

14

4

14

EE

34

3

12

LL

12

RR

Fig. 6.9. The jump diagram for Q.

∑4i=1 Pij = 1 for all j as well as

∑4j=1 Pij = 1 for all i), it is easy to check that

π = 14

[1 1 1 1

]. Let us compute E3R3 as follows


6.5 Some more worked examples 61E1R3

E2R3

E3R3

E4R3

=

1 0 0 00 1 0 00 0 1 00 0 0 1

−

1/4 1/4 0 1/41/4 0 0 3/41/2 1/2 0 00 1/4 0 0

−1

1111

=

5017521743017

so that E3R3 = 4 = 1/π4 as it should be. Similarly,

E1R2

E2R2

E3R2

E4R2

=

1 0 0 00 1 0 00 0 1 00 0 0 1

−

1/4 0 1/4 1/41/4 0 0 3/41/2 0 0 00 0 3/4 0

−1

1111

=

5417444175017

and again E2R2 = 4 = 1/π2.

Example 6.33. Consider the following example,

P =

1 2 31/2 1/2 00 1/2 1/2

1/2 1/2 0

123

with jump diagram given in Figure 6.10.We have

1 2

3

1/2

1/2

1/21/2

Fig. 6.10. The jump diagram associated to P.

P2 =

1/2 1/2 00 1/2 1/2

1/2 1/2 0

2

=

14

12

14

14

12

14

14

12

14

and

P3 =

1/2 1/2 00 1/2 1/2

1/2 1/2 0

3

=

14

12

14

14

12

14

14

12

14

.To have a picture what is going on here, imaging that π = (π1, π2, π3)

represents the amount of sand at the sites, 1, 2, and 3 respectively. Duringeach time step we move the sand on the sites around according to the followingrule. The sand at site j after one step is

∑i πiPij , namely site i contributes Pij

fraction its sand, πi, to site j. Everyone does this to arrive at a new distribution.Hence π is an invariant distribution if each πi remains unchanged, i.e. π = πP.(Keep in mind the sand is still moving around it is just that the size of the pilesremains unchanged.)

As a specific example, suppose π = (1, 0, 0) so that all of the sand starts at1. After the first step, the pile at 1 is split into two and 1/2 is sent to 2 to getπ1 = (1/2, 1/2, 0) which is the first row of P. At the next step the site 1 keeps1/2 of its sand (= 1/4) and still receives nothing, while site 2 again receivesthe other 1/2 and keeps half of what it had (= 1/4 + 1/4) and site 3 then gets(1/2 · 1/2 = 1/4) so that π2 =

[14

12

14

]which is the first row of P2. It turns

out in this case that this is the invariant distribution. Formally,

[14

12

14

] 1/2 1/2 00 1/2 1/2

1/2 1/2 0

=[

14

12

14

].

In general we expect to reach the invariant distribution only in the limit asn→∞.

Notice that if π is any stationary distribution, then πPn = π for all n andin particular,

π = πP2 =[π1 π2 π3

] 14

12

14

14

12

14

14

12

14

=[

14

12

14

].

Hence[

14

12

14

]is the unique stationary distribution for P in this case.

Example 6.34 (§3.2. p108 Ehrenfest Urn Model). Let a beaker filled with a par-ticle fluid mixture be divided into two parts A and B by a semipermeablemembrane. Let Xn = (# of particles in A) which we assume evolves by choos-ing a particle at random from A ∪ B and then replacing this particle in the



opposite bin from which it was found. Suppose there are N total number ofparticles in the flask, then the transition probabilities are given by,

Pij = P(Xn+1 = j | Xn = i) =

0 if j /∈ i− 1, i+ 1iN if j = i− 1N−iN if j = i+ 1.

For example, if N = 2 we have

(Pij) =

0 1 20 1 0

1/2 0 1/20 1 0

012

and if N = 3, then we have in matrix form,

(Pij) =

0 1 2 30 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

0123

.

In the case N = 2, 0 1 01/2 0 1/20 1 0

2

=

12 0 1

20 1 012 0 1

2

0 1 0

1/2 0 1/20 1 0

3

=

0 1 012 0 1

20 1 0

and when N = 3,

0 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

2

=

13 0 2

3 00 7

9 0 29

29 0 7

9 00 2

3 0 13

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

3

=

0 7

9 0 29

727 0 20

27 00 20

27 0 727

29 0 7

9 0

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

25

∼=

0.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.0

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

26

∼=

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

:

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

100

∼=

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

We also have

(P− I)tr

=

−1 1 0 013 −1 2

3 00 2

3 −1 13

0 0 1 −1

tr

=

−1 1

3 0 01 −1 2

3 00 2

3 −1 10 0 1

3 −1

and

Nul(

(P− I)tr)

=

1331

.Hence if we take, π = 1

8

[1 3 3 1

]then

πP =1

8

[1 3 3 1

] 0 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

=1

8

[1 3 3 1

]= π

is the stationary distribution. Notice that


6.5 Some more worked examples 63

1

2

(P25 + P26

) ∼= 1

2

0.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.0

+1

2

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

=

0.125 0.375 0.375 0.1250.125 0.375 0.375 0.1250.125 0.375 0.375 0.1250.125 0.375 0.375 0.125

=

ππππ

.Example 6.35. Let us consider the Markov matrix,

P =

1 2 30 1 0

1/2 0 1/21 0 0

123

.

In this case we have

P25 =

0 1 01/2 0 1/21 0 0

25

∼=

0.399 9 0.400 15 0.199 950.400 02 0.399 9 0.200 070.400 15 0.399 9 0.199 95

P26 =

0 1 01/2 0 1/21 0 0

26

∼=

0.400 02 0.399 9 0.200 070.400 02 0.400 02 0.199 950.399 9 0.400 15 0.199 95

P100 =

0 1 01/2 0 1/21 0 0

100

∼=

0.4 0.4 0.20.4 0.4 0.20.4 0.4 0.2

and observe that

[0.4 0.4 0.2

] 0 1 01/2 0 1/21 0 0

=[

0.4 0.4 0.2].

so that π =[

0.4 0.4 0.2]

is a stationary distribution for P.

6.5.1 Life Time Processes

A computer component has life time T, with P (T = k) = ak for k ∈ N. Let Xn

denote the age of the component in service at time n. The set up is then

[0, T1] ∪ (T1, T1 + T2] ∪ (T1 + T2, T1 + T2 + T3] ∪ . . .

so for example if (T1, T2, T3, . . . ) = (1, 3, 4, . . . ) , then

X0 = 0, X1 = 0, X2 = 1, X3 = 2, X4 = 0, X5 = 1, X5 = 2, X5 = 3, X6 = 0, . . . .

The transition probabilities are then

P (Xn+1 = 0|Xn = k) = P (T = k + 1|T > k) =ak+1∑m>k ak

P (Xn+1 = k + 1|Xn = k) = P (T > k + 1|T > k)

=P (T > k + 1)

P (T > k)=

∑m>k+1 ak∑m>k ak

=

∑m>k ak − ak+1∑

m>k ak= 1− ak+1∑

m>k ak.

See Exercise IV.2.E6 of Karlin and Taylor for a concrete example involving achain of this form.

There is another way to look at this same situation, namely let Yn denotethe remaining life of the part in service at time n. So if (T1, T2, T3, . . . ) =(1, 3, 4, . . . ) , then

Y0 = 1, Y1 = 3, Y2 = 2, Y3 = 1, Y4 = 4, Y5 = 3, Y5 = 2, Y5 = 1, . . . .

and the corresponding transition matrix is determined by

P (Yn+1 = k − 1|Yn = k) = 1 if k ≥ 2

whileP (Yn+1 = k|Yn = 1) = P (T = k) .

Example 6.36 (Exercise IV.2.E6 revisited). Let Yn denote the remaining lifeof the part in service at time n. So if (T1, T2, T3, . . . ) = (1, 3, 4, . . . ) , then

Y0 = 1, Y1 = 3, Y2 = 2, Y3 = 1, Y4 = 4, Y5 = 3, Y5 = 2, Y5 = 1, . . . .

Ifk 0 1 2 3 4

P (T = k) 0 0.1 0.2 0.3 0.4,

the transition matrix is now given by

P =

1 2 3 4110

15

310

25

1 0 0 00 1 0 00 0 1 0

1234



whose invariant distribution is given by

π =1

30

[10 9 7 4

]=[

13

310

730

215

].

The failure of a part is indicated by Yn being 1 and so again the failure frequencyis 1

3 of the time as found before. Observe that expected life time of a part is;

E [T ] = 1 · 0.1 + 2 · 0.2 + 3 · 0.3 + 4 · 0.4 = 3.

Thus we see that π1 = 1/ET which is what we should have expected. To go alittle further notice that from the jump diagram in Figure 6.11,

1 2 43a1

a2a3

a4

111

Fig. 6.11. The jump diagram for this “renewal” chain.

one see that

k 1 2 3 4P1 (R1 = k) a1 = 0.1 a2 = 0.2 a3 = 0.3 a4 = 0.4

and therefore,E1R1 = 1a1 + 2a2 + 3a3 + 4a4 = ET

and hence π1 = 1/E1R1 = 1/ET in general for this type of chain.

6.5.2 Sampling Plans

This section summarizes the results of Section IV.2.3 of Karlin and Taylor.There one is considering at production line where each item manufactured hasprobability 0 ≤ p ≤ 1 of being defective. Let i and r be two integers and samplethe output of the line as follows. We begin by sampling every item until we havefound i – in a row which are good. Then we sample each of then next itemswith probability 1

r determined randomly at end of production of each item. (Ifr = 6 we might throw a die each time a product comes off the line and samplethat product when we roll a 6 say.) If we find a bad part we start the process

over again. We now describe this as a Markov chain with states Ekik=0 whereEk denotes that we have seen k – good parts in a row for 0 ≤ k < i and Eiindicates we are in stage II where we are randomly choosing to sample an itemwith probability 1

r . The transition probabilities for this chain are given by

P (Xn+1 = Ek+1|Xn = Ek) = q := 1− p for k = 0, 1, 2 . . . , i− 1,

P (Xn+1 = E0|Xn = Ek) = p if 0 ≤ k ≤ i− 1,

P (Xn+1 = E0|Xn = Ei) =p

rand P (Xn+1 = Ei|Xn = Ei) = 1− p

r,

with all other transitions being zero. The stationary distribution for this chainsatisfies the equations,

πk =

k∑l=0

P (Xn+1 = Ek|Xn = El)πl

so that

π0 =

i−1∑k=0

pπi +p

rπ0,

π1 = qπ0, π2 = qπ1, . . . , πi−1 = qπi−2,

πi = qπi−1 +(

1− p

r

)πi.

These equations may be solved (see Section IV.2.3 of Karlin and Taylor) to findin particular that

πk =p (1− p)k

1 + (r − 1) (1− p)ifor 1 ≤ k < i and

πi =r (1− p)k

1 + (r − 1) (1− p)i.

See Karlin and Taylor for more comments on this solution.

6.5.3 Extra Homework Problems

Exercises 6.12 – 6.15 refer to the following Markov matrix:

P =

1 2 3 4 5 60 1 0 0 0 0

1/2 1/2 0 0 0 00 0 1/2 1/2 0 00 0 1 0 0 00 1/2 0 0 0 1/20 0 0 1/4 3/4 0

123456

(6.33)

We will let Xn∞n=0 denote the Markov chain associated to P.


6.6 *Computations avoiding the first step analysis 65

Exercise 6.12. Make a jump diagram for this matrix and identify the recurrentand transient classes. Also find the invariant destitutions for the chain restrictedto each of the recurrent classes.

Exercise 6.13. Find all of the invariant distributions for P.

Exercise 6.14. Compute the hitting probabilities, h5 = P5 (Xn hits 3, 4)and h6 = P6 (Xn hits 3, 4) .

Exercise 6.15. Find limn→∞ P6 (Xn = j) for j = 1, 2, 3, 4, 5, 6.

6.6 *Computations avoiding the first step analysis

You may safely skip the rest of this section!!Theorem 6.37. Let n denote a non-negative integer. If h : B → R is measur-able and either bounded or non-negative, then

Ex [h (Xn) : HB = n] =(Qn−1A Q [1Bh]

)(x)

and

Ex [h (XHB ) : HB <∞] =

( ∞∑n=0

QnAQ [1Bh]

)(x) . (6.34)

If g : A→ R+ is a measurable function, then for all x ∈ A and n ∈ N0,

Ex [g (Xn) 1n<HB ] = (QnAg) (x) .

In particular we have

Ex

[ ∑n<HB

g (Xn)

]=

∞∑n=0

(QnAg) (x) =: u (x) , (6.35)

where by convention,∑n<HB

g (Xn) = 0 when HB = 0.

Proof. Let x ∈ A. In computing each of these quantities we will use;

HB > n = Xi ∈ A for 0 ≤ i ≤ n and

HB = n = Xi ∈ A for 0 ≤ i ≤ n− 1 ∩ Xn ∈ B .

From the second identity above it follows that for

Ex [h (Xn) : HB = n] = Ex[h (Xn) : (X1, . . . , Xn−1) ∈ An−1, Xn ∈ B

]=

∞∑n=1

∫An−1×B

n∏j=1

Q (xj−1, dxj)h (xn)

=(Qn−1A Q [1Bh]

)(x)

and therefore

Ex [h (XHB ) : HB <∞] =

∞∑n=1

Ex [h (Xn) : HB = n]

=

∞∑n=1

Qn−1A Q [1Bh] =

∞∑n=0

QnAQ [1Bh] .

Similarly,

Ex [g (Xn) 1n<HB ] =

∫An

Q (x, dx1)Q (x1, dx2) . . . Q (xn−1, dxn) g (xn)

= (QnAg) (x)

and therefore,

Ex

[ ∞∑n=0

g (Xn) 1n<HB

]=

∞∑n=0

Ex [g (Xn) 1n<HB ]

=

∞∑n=0

(QnAg) (x) .

In practice it is not so easy to sum the series in Eqs. (6.34) and (6.35). Thuswe would like to have another way to compute these quantities. Since

∑∞n=0Q

nA

is a geometric series, we expect that

∞∑n=0

QnA = (I −QA)−1

which is basically correct at least when (I −QA) is invertible. This suggeststhat if u (x) = Ex [h (XHB ) : HB <∞] , then (see Eq. (6.34))

u = QAu+Q [1Bh] on A, (6.36)

and if u (x) = Ex[∑

n<HBg (Xn)

], then (see Eq. (6.35))

u = QAu+ g on A. (6.37)



That these equations are valid was the content of Corollary 6.43 below andTheorem 6.6 above. below which we will prove using the “first step” analysis inthe next theorem. We will give another direct proof in Theorem 6.42 below aswell.

Lemma 6.38. Keeping the notation above we have

ExT =

∞∑n=0

∑y∈A

Qn (x, y) for all x ∈ A, (6.38)

where ExT =∞ is possible.

Proof. By definition of T we have for x ∈ A and n ∈ N0 that,

Px (T > n) = Px (X1, . . . , Xn ∈ A)

=∑

x1,...,xn∈Ap (x, x1) p (x1, x2) . . . p (xn−1, xn)

=∑y∈A

Qn (x, y) . (6.39)

Therefore Eq. (6.38) now follows from Lemma 5.18 and Eq. (6.39).

Proposition 6.39. Let us continue the notation above and let us further as-sume that A is a finite set and

Px (T <∞) = P (Xn ∈ B for some n) > 0 ∀ x ∈ A. (6.40)

Under these assumptions, ExT < ∞ for all x ∈ A and in particularPx (T <∞) = 1 for all x ∈ A. In this case we may may write Eq. (6.38)as

(ExT )x∈A = (I −Q)−1

1 (6.41)

where 1 (x) = 1 for all x ∈ A.

Proof. Since T > n ↓ T =∞ and Px (T =∞) < 1 for all x ∈ A itfollows that there exists an m ∈ N and 0 ≤ α < 1 such that Px (T > m) ≤ αfor all x ∈ A. Since Px (T > m) =

∑y∈AQ

m (x, y) it follows that the row sumsof Qm are all less than α < 1. Further observe that∑

y∈AQ2m (x, y) =

∑y,z∈A

Qm (x, z)Qm (z, y) =∑z∈A

Qm (x, z)∑y∈A

Qm (z, y)

≤∑z∈A

Qm (x, z)α ≤ α2.

Similarly one may show that∑y∈AQ

km (x, y) ≤ αk for all k ∈ N. Therefore

from Eq. (6.39) with m replaced by km, we learn that Px (T > km) ≤ αk forall k ∈ N which then implies that∑

y∈AQn (x, y) = Px (T > n) ≤ αb

nk c for all n ∈ N,

where btc = m ∈ N0 if m ≤ t < m+ 1, i.e. btc is the nearest integer to t whichis smaller than t. Therefore, we have

ExT =

∞∑n=0

∑y∈A

Qn (x, y) ≤∞∑n=0

αbnmc ≤ m ·

∞∑l=0

αl = m1

1− α<∞.

So it only remains to prove Eq. (6.41). From the above computations we seethat

∑∞n=0Q

n is convergent. Moreover,

(I −Q)

∞∑n=0

Qn =

∞∑n=0

Qn −∞∑n=0

Qn+1 = I

and therefore (I −Q) is invertible and∑∞n=0Q

n = (I −Q)−1. Finally,

(I −Q)−1

1 =

∞∑n=0

Qn1 =

∞∑n=0

∑y∈A

Qn (x, y)

x∈A

= (ExT )x∈A

as claimed.

Remark 6.40. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with0 being an absorbing state. Let T = T0, i.e. B = 0 so that A = N is nowan infinite set. From Remark 6.21, we learn that EiT = ∞ for all i > 0. Thisshows that we can not in general drop the assumption that A (A = 1, 2, . . . is a finite set the statement of Proposition 6.39.

6.6.1 General facts about sub-probability kernels

Definition 6.41. Suppose (A,A) is a measurable space. A sub-probabilitykernel on (A,A) is a function ρ : A ×A → [0, 1] such that ρ (·, C) is A/BR –measurable for all C ∈ A and ρ (x, ·) : A → [0, 1] is a measure for all x ∈ A.

As with probability kernels we will identify ρ with the linear map, ρ : Ab →Ab given by

(ρf) (x) = ρ (x, f) =

∫A

f (y) ρ (x, dy) .

Of course we have in mind that A = SA and ρ = QA. In the following lemmalet ‖g‖∞ := supx∈A |g (x)| for all g ∈ Ab.


6.6 *Computations avoiding the first step analysis 67

Theorem 6.42. Let ρ be a sub-probability kernel on a measurable space (A,A)and define un (x) := (ρn1) (x) for all x ∈ A and n ∈ N0. Then;

1. un is a decreasing sequence so that u := limn→∞ un exists and is in Ab.(When ρ = QA, un (x) = Px (HB > n) ↓ u (x) = P (HB =∞) as n→∞.)

2. The function u satisfies ρu = u.3. If w ∈ Ab and ρw = w then |w| ≤ ‖w‖∞ u. In particular the equation,ρw = w, has a non-zero solution w ∈ Ab iff u 6= 0.

4. If u = 0 and g ∈ Ab, then there is at most one w ∈ Ab such that w = ρw+g.5. Let

U :=

∞∑n=0

un =

∞∑n=0

ρn1 : A→ [0,∞] (6.42)

and suppose that U (x) <∞ for all x ∈ A. Then for each g ∈ Sb,

w =

∞∑n=0

ρng (6.43)

is absolutely convergent,|w| ≤ ‖g‖∞ U, (6.44)

ρ (x, |w|) < ∞ for all x ∈ A, and w solves w = ρw + g. Moreover if v alsosolves v = ρv + g and |v| ≤ CU for some C <∞ then v = w.Observe that when ρ = QA,

U (x) =

∞∑n=0

Px (HB > n) =

∞∑n=0

Ex (1HB>n) = Ex

( ∞∑n=0

1HB>n

)= Ex [HB ] .

6. If g : A→ [0,∞] is any measurable function then

w :=

∞∑n=0

ρng : A→ [0,∞]

is a solution to w = ρw + g. (It may be that w ≡ ∞ though!) Moreover ifv : A → [0,∞] satisfies v = ρv + g then w ≤ v. Thus w is the minimalnon-negative solution to v = ρv + g.

7. If there exists α < 1 such that u ≤ α on A then u = 0. (When ρ = QA, thisstates that Px (HB =∞) ≤ α for all x ∈ A implies Px (TA =∞) = 0 for allx ∈ A.)

8. If there exists an α < 1 and an n ∈ N such that un = ρn1 ≤ α on A, thenthere exists C <∞ such that

uk (x) =(ρk1)

(x) ≤ Cβk for all x ∈ A and k ∈ N0

where β := α1/n < 1. In particular, U ≤ C (1− β)−1

and u = 0 under thisassumption.(When ρ = QA this assertion states; if Px (HB > n) ≤ α for all α ∈ A,

then Px (HB > k) ≤ Cβk and ExHB ≤ C (1− β)−1

for all k ∈ N0.)

Proof. We will prove each item in turn.

1. First observe that u1 (x) = ρ (x,A) ≤ 1 = u0 (x) and therefore,

un+1 = ρn+11 = ρnu1 ≤ ρn1 = un.

We now let u := limn→∞ un so that u : A→ [0, 1] .2. Using DCT we may let n→∞ in the identity, ρun = un+1 in order to showρu = u.

3. If w ∈ Ab with ρw = w, then

|w| = |ρnw| ≤ ρn |w| ≤ ‖w‖∞ ρn1 = ‖w‖∞ · un.

Letting n→∞ shows that |w| ≤ ‖w‖∞ u.4. If wi ∈ Ab solves wi = ρwi + g for i = 1, 2 then w := w2 − w1 satisfiesw = ρw and therefore |w| ≤ Cu = 0.

5. Let U :=∑∞n=0 un =

∑∞n=0 ρ

n1 : A → [0,∞] and suppose U (x) < ∞ forall x ∈ A. Then un (x)→ 0 as n→∞ and so bounded solutions to ρu = uare necessarily zero. Moreover we have, for all k ∈ N0, that

ρkU =

∞∑n=0

ρkun =

∞∑n=0

un+k =

∞∑n=k

un ≤ U. (6.45)

Since the tails of convergent series tend to zero it follows that limk→∞ ρkU =0.Now if g ∈ Sb, we have

∞∑n=0

|ρng| ≤∞∑n=0

ρn |g| ≤∞∑n=0

ρn ‖g‖∞ = ‖g‖∞ · U <∞ (6.46)

and therefore∑∞n=0 ρ

ng is absolutely convergent. Making use of Eqs. (6.45)and (6.46) we see that

∞∑n=1

ρ |ρng| ≤ ‖g‖∞ · ρU ≤ ‖g‖∞ U <∞

and therefore (using DCT),



w =

∞∑n=0

ρng = g +

∞∑n=1

ρng

= g + ρ

∞∑n=1

ρn−1g = g + ρw,

i.e. w solves w = g + ρw.If v : A → R is measurable such that |v| ≤ CU and v = g + ρv, theny := w − v solves y = ρy with |y| ≤ (C + ‖g‖∞)U. It follows that

|y| = |ρny| ≤ (C + ‖g‖∞) ρnU → 0 as n→∞,

i.e. 0 = y = w − v.6. If g ≥ 0 we may always define w by Eq. (6.43) allowing for w (x) = ∞ for

some or even all x ∈ A. As in the proof of the previous item (with DCTbeing replaced by MCT), it follows that w = ρw + g. If v ≥ 0 also solvesv = g + ρv, then

v = g + ρ (g + ρv) = g + ρg + ρ2v

and more generally by induction we have

v =

n∑k=0

ρkg + ρn+1v ≥n∑k=0

ρkg.

Letting n→∞ in this last equation shows that v ≥ w.7. If u ≤ α < 1 on A, then by item 3. with w = u we find that

u ≤ ‖u‖∞ · u ≤ αu

which clearly implies u = 0.8. If un ≤ α < 1, then for any m ∈ N we have,

un+m = ρmun ≤ αρm1 = αum.

Taking m = kn in this inequality shows, u(k+1)n ≤ αukn. Thus a simple

induction argument shows ukn ≤ αk for all k ∈ N0. For general l ∈ N0 wewrite l = kn+ r with 0 ≤ r < n. We then have,

ul = ukn+r ≤ ukn ≤ αk = αl−rn = Cαl/n

where C = α−n−1n .

Corollary 6.43. If h : B → [0,∞] is measurable, then u (x) :=Ex [h (XHB ) : HB <∞] is the unique minimal non-negative solution to Eq.(6.36) while if g : A → [0,∞] is measurable, then u (x) = Ex

[∑n<HB

g (Xn)]

is the unique minimal non-negative solution to Eq. (6.37).

Exercise 6.16. Keeping the notation of Exercise 6.9 and 6.11. Use Corollary6.43 to show again that Px (HB <∞) = (q/p)

xfor all x > 0 and ExT0 =

x/ (q − p) for x < 0. You should do so without making use of the extraneoushitting times, Tn for n 6= 0.

Corollary 6.44. If Px (HB =∞) = 0 for all x ∈ A and h : B → R is a boundedmeasurable function, then u (x) := Ex [h (XHB )] is the unique solution to Eq.(6.36).

Corollary 6.45. Suppose now that A = Bc is a finite subset of S such thatPx (HB =∞) < 1 for all x ∈ A. Then there exists C < ∞ and β ∈ (0, 1) suchthat Px (HB > n) ≤ Cβn and in particular ExHB <∞ for all x ∈ A.

Proof. Let α0 = maxx∈A Px (HB =∞) < 1. We know that

limn→∞

Px (HB > n) = Px (HB =∞) ≤ α0 for all x ∈ A.

Therefore if α ∈ (α0, 1) , using the fact that A is a finite set, there exists ann sufficiently large such that Px (HB > n) ≤ α for all x ∈ A. The result nowfollows from item 8. of Theorem 6.42.


7

Markov Conditioning

We assume the Xn∞n=0 is a Markov chain with values in S and transitionkernel P and π : S → [0, 1] is a probability on S. As usual we write Pπ for theunique probability satisfying Eq. (5.2) and we will often write p (x, y) for Pxy.The most important result of this section is the strong Markov property statedin Theorem 7.8 below.

7.1 The Markov Property

Notation 7.1 Letfn

∞n=0

be an independent copy of the fn∞n=0 and for

each x ∈ S let Y xn ∞n=0 be the Markov chain defined inductively by,

Y x0 = x and Y xn+1 = fn (Y xn ) for n ∈ N0.

It should be clear to the reader that

E [F (Y x0 , Yx1 , . . . )] = Ex [F (X0, X1, . . . )]

for all bounded functions (x0, x1, . . . )→ F (x0, x1, . . . ) . The next two theoremsgeneralize these statements in far reaching ways.

Theorem 7.2 (Markov Property). Given any m ∈ N, then

(X0, X1, . . . , Xm, Xm+1, . . . )d=(X0, X1, . . . , Xm, Y

Xm1 , Y Xm2 , . . .

).

Proof. Let

Xk :=

Xk if k ≤ mY Xmk−m if k ≥ m,

N ∈ N such that m ≤ N, and x0, x1, . . . , xN ∈ S. Then

P(X0 = x0, X1 = x1, . . . XN = xN

)= P

(X0 = x0, . . . , Xm = xm, Y

xm1 = xm+1, . . . , Y

xmN−m = xN

)= P (X0 = x0, . . . , Xm = xm)P

(Y xm1 = xm+1, . . . , Y

xmN−m = xN

)(7.1)

wherein we have used the independence of(Y xm1 , . . . , Y xmN−m

)from X. Further-

more, as in the proof of Theorem 5.11,

P(Y xm1 = xm+1, . . . , Y

xmN−m = xN

)= P

(f0 (xm) = xm+1, . . . , fN−m−1 (xN−1) = xN

)= p (xm, xm+1) . . . p (xN−1, xN )

which combined with Eq. (7.1) and Theorem 5.11 shows,

P(X0 = x0, X1 = x1, . . . XN = xN

)= P (X0 = x0, X1 = x1, . . . XN = xN ) .

Remark 7.3. Here is the intuition behind the proof of Theorem 7.2.The trajectory (X0, X1, . . . , Xm, Xm+1, Xm+2, . . . ) is constructed us-ing the original spinors that determine the fn while the trajectory(X0, X1, . . . , Xm, Y

Xm1 , Y Xm2 , . . .

)is determined by using identical but inde-

pendent spinors (i.e. thefn

) at time m and beyond. It should be intuitively

clear that the statistics of these two trajectories should be the same and thisis the content of Theorem 7.2.

We end this section with some consequences and variants of Theorem 7.2.For the first application let us give another proof of Proposition 5.13.

Corollary 7.4. If g : S → R is a bounded function, then

E [g (Xm+1) | (X0, . . . , Xm)] = (Pg) (Xm) (7.2)

andE [g (Xm+1) |Xm] = (Pg) (Xm) .

Proof. Let u : Sm+1 → R be another bounded function, then

70 7 Markov Conditioning

E [g (Xm+1)u (X0, . . . , Xm)] = E[g(Y Xm1

)u (X0, . . . , Xm)

]=∑x∈S

E [g (Y x1 )u (X0, . . . , Xm−1, x) 1Xm=x]

=∑x∈S

E [g (Y x1 )] · E [u (X0, . . . , Xm−1, x) 1Xm=x]

=∑x∈S

(Pg) (x) · E [u (X0, . . . , Xm−1, x) 1Xm=x]

=∑x∈S

E [(Pg) (x)u (X0, . . . , Xm−1, x) 1Xm=x]

= E [(Pg) (Xm)u (X0, . . . , Xm−1, Xm)]

from which Eq. (7.2) follows. The second equation follows from Eq. (7.2) andthe tower property of conditional expectation;

(Pg) (Xm) = E [(Pg) (Xm) |Xm]

= E [E [g (Xm+1) | (X0, . . . , Xm)] |Xm] = E [g (Xm+1) |Xm] .

Theorem 7.5 (Markov conditioning). Let X = Xn∞n=0 be a Markov chainwith transition kernel p and for each x ∈ S let Yn∞n=0 be another Markovchain with transition kernel p starting at Y0 = x which is independent of X.Then relative to P (·|Xm = x) we have,

X = (X0, X1, . . . )d= (X0, . . . , Xm−1, x, Y1, Y2, . . . ) .

In more detail we are asserting,

E [F (X) |Xm = x] = E [F (X0, . . . , Xm−1, x, Y1, Y2, . . . ) |Xm = x]

for all F (X) = F (X0, X1, . . . ), where F is either bounded or non-negative.

Proof. First Proof. By Theorem 7.2,

E [F (X) : Xm = x] = E[F(X0, X1, . . . , Xm, Y

Xm1 , Y Xm2 , . . .

): Xm = x

]= E [F (X0, X1, . . . , Xm−1, x, Y

x1 , Y

x2 , . . . ) : Xm = x]

which upon dividing by P (Xm = x) completes the proof.Second Proof. In this proof we use the fact1 that it suffices to verify the

assertions of the Theorem for F (x) = F (x0, . . . , xN ) , i.e. F (x) depends ononly finitely many coordinates. For any k ∈ N let

1 We are using a standard measure theoretic result here which goes by the name ofDynkin’s multiplicative system theorem. One could also use the so called π – λtheorem or the monotone class theorem as well. See Math 280 for these sorts ofarguments.

ρk (x0, . . . , xk) :=

k∏l=1

p (xl−1, xl)

and observe that

ρm (x0, . . . , xm−1, xm) ρn (xm, y1, . . . , yn) = ρm+n (x0, . . . , xm−1, xm, y1, . . . , yn) .

Now let N = m+ n, x := (x0, . . . , xm−1) ∈ Sm, and y = (y1, . . . , yn) ∈ Sn. Wethen have,

E [F (X0, . . . , Xm−1, x, Y1, Y2, . . . , Yn) : Xm = x]

=∑x

∑y

F (x, x,y)π (x0) ρm (x, x) ρn (x,y)

=∑(x,y)

F (x, x,y)π (x0) ρm+n (x, x,y)

=∑

(x0,...,xm+n)

F (x0, . . . , xm+n) 1xm=xπ (x0) ρm+n (x0, . . . . . . , xm+n)

= E [F (X0, . . . , XN ) : Xm = x] .

Corollary 7.6. Again let Y = (Y0, Y1, . . . ) be a Markov chain independent ofX = (X0, X1, . . . ) but having the same transition matrix, P. If m ∈ N we have

E [F (X) | (X0, . . . , Xm)] = E(Y )Xm

F (X0, . . . , Xm−1, Y0, Y1, Y2, . . . ) ,

where E(Y )Xm

indicates the expectation in the Y – process only keeping the X –process “frozen.” Put another way,

E [F (X) | (X0, . . . , Xm)] = h (X0, . . . , Xm)

whereh (x0, . . . , xm) = Exm [F (x0, . . . , xm−1, X0, X1, . . . )] . (7.3)

Proof. By Theorem 7.2,

E [F (X) : (X0, . . . , Xm) = (x0, . . . , xm)]

= E [F (x0, . . . , xm, Xm+1, Xm+2, . . . ) : (X0, . . . , Xm) = (x0, . . . , xm)]

= E [F (x0, . . . , xm, Yxm1 , Y xm2 , . . . ) : (X0, . . . , Xm) = (x0, . . . , xm)]

= E [F (x0, . . . , xm, Yxm1 , Y xm2 , . . . )] · P ((X0, . . . , Xm) = (x0, . . . , xm))

= Exm [F (x0, . . . , xm−1, X0 = xm, X1, . . . )] · P ((X0, . . . , Xm) = (x0, . . . , xm)) .


7.2 The Strong Markov Property 71

Dividing this equation by then shows that h defined in Eq. (7.3) is also givenby

h (x0, . . . , xm) = E [F (X) | (X0, . . . , Xm) = (x0, . . . , xm)]

which along with Proposition 3.10 completes the proof.

Corollary 7.7. For each m ∈ N and x ∈ S such that P (Xm = x) > 0,(X0, . . . , Xm−1) and (Xm+1, Xm+2, . . . ) are independent relative toP (·|Xm = x) . Stated more poetically, the past is independent of the fu-ture given the present.

Proof. Let F (x0, . . . , xm−1) and G (xm+1, xm+2, . . . ) be given boundedfunctions, then

E [F (X0, . . . , Xm−1)G (Xm+1, Xm+2, . . . ) : Xm = x]

= E [F (X0, . . . , Xm−1)G (Y x1 , Yx2 , . . . ) : Xm = x]

= E [F (X0, . . . , Xm−1) : Xm = x] · EG (Y x1 , Yx2 , . . . )

and so dividing this equation by P (Xm = x) shows,

E [F (X0, . . . , Xm−1)G (Xm+1, Xm+2, . . . ) |Xm = x]

= E [F (X0, . . . , Xm−1) |Xm = x] · EG (Y x1 , Yx2 , . . . ) .

Taking F ≡ 1 in this equation implies

E [G (Xm+1, Xm+2, . . . ) |Xm = x] = EG (Y x1 , Yx2 , . . . )

and hence,

E [F (X0, . . . , Xm−1)G (Xm+1, Xm+2, . . . ) |Xm = x]

= E [F (X0, . . . , Xm−1) |Xm = x] · E [G (Xm+1, Xm+2, . . . ) |Xm = x]

which is the desired independence statement.

7.2 The Strong Markov Property

The results in Section 7.1 have a far reaching extension to the case wherem ∈ N0 is replaced by a stopping time τ. The strong Markov property will playan important role in Chapter 8 on the limiting behaviors of Markov chains.

Theorem 7.8 (Strong Markov Property). Again let Y xn ∞n=0 be the

Markov chain (independent of X) described in Notation 7.1. If τ is a FX -stopping time and

Xn :=

Xn if n < τ

Y Xτn−τ if n ≥ τ for n ∈ N0,

then (X0, X1, . . .

)d= (X0, X1, . . . ) .

Alternatively this may be stated,(X0, X1, . . . , Xτ , Y

Xτ1 , Y Xτ2 , . . .

)d= (X0, X1, . . . , Xτ , Xτ+1, . . . ) on τ <∞ .

Proof. The intuitive proof of this theorem is essentially the same as the onedescribed in Remark 7.3. For the formal proof, let N ∈ N, n ≤ N, and thenfind A ⊂ Sn+1 such that

τ = n = (X0, . . . , XN ) ∈ A .

The for any x0, x1, . . . , xN ∈ S, we have

P(X0 = x0, X1 = x1, . . . XN = xN , τ = n

)= P

(X0 = x0, . . . , Xn = xn, Y

xn1 = xn+1, . . . , Y

xnN−n = xN , τ = n

)= 1A (x0, x1, . . . , xn) · P

(X0 = x0, . . . Xn = xn, Y

xn1 = xn+1, . . . , Y

xnN−n = xN

)= 1A (x0, x1, . . . , xn) · P (X0 = x0, . . . Xn = xn, Xn+1 = xn+1, . . . , XN = xN )

= P (X0 = x0, . . . , Xn = xn, Xn+1 = xn+1, . . . , XN = xN , τ = n)

wherein the second to last inequality we used Theorem 7.2. Summing the pre-vious equation on n ≤ N shows,

P(X0 = x0, X1 = x1, . . . , XN = xN , τ ≤ N

)= P (X0 = x0, . . . Xn = xn, Xn+1 = xn+1, . . . , XN = xN , τ ≤ N) . (7.4)

Finally since

(X0, . . . , XN ) =(X0, . . . , XN

)on τ > N

we conclude that

P(X0 = x0, X1 = x1, . . . , XN = xN , τ > N

)= P (X0 = x0, . . . , XN = xN , τ > N) . (7.5)

Summing Eqs. (7.4) and (7.5) gives

P(X0 = x0, . . . , XN = xN

)= P (X0 = x0, . . . , XN = xN )

which completes the proof.


72 7 Markov Conditioning

Notation 7.9 Let Yn∞n=0 be a Markov chain independent of Xn but having

the same transition probabilities, P. Further let E(Y )y denote the expectation

relative to the chain Y started at y ∈ S

Corollary 7.10. If τ is an FX – stopping time and (x0, x1, . . . ) →F (x0, x1, . . . ) is a bounded (or non-negative) function, then

E [F (X) : τ <∞] = E[E(Y )Xτ

[F (X0, . . . , Xτ , Y1, Y2, . . . )] : τ <∞].

Alternatively if for each n ∈ N0, we let

Fn (x0, x1, . . . , xn) := E(Y )xn F (x0, . . . , xn, Y1, Y2, . . . )

= ExnF (x0, . . . , xn, X1, X2, . . . ) ,

thenE [F (X) : τ <∞] = E [Fτ (X0, . . . , Xτ ) : τ <∞] .


E [F (X) : τ <∞] = E[F(X)

: τ <∞]

= E[F(X0, X1, . . . , Xτ , Y

Xτ1 , Y Xτ2 , . . .

): τ <∞

]= E

[E(Y )Xτ

[F (X0, . . . , Xτ , Y1, Y2, . . . )] : τ <∞].

Let us end this section with one more variant of the strong Markov property.

Theorem 7.11 (Strong Markov Property). Let x ∈ X and Yn∞n=0 be asin Notation 7.9 and τ be a FX – stopping time. Then given τ <∞, Xτ = xwe have

(X0, X1, . . . )d= (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . ) , (7.6)

i.e.

(X0, X1, . . . )P(·|τ<∞,Xτ=x)

= (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . )

We also have,

E [F (X0, X1, . . . ) : τ <∞]

= E[E(Y )Xτ

[F (X0, X1, . . . , Xτ−1, Y0, Y1, Y2, . . . )] : τ <∞]

= E[E(Y )Xτ

[F (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . )] : τ <∞]. (7.7)

Proof. Using1τ=n = f (X0, X1, . . . , Xn) ,

it follows from Theorem 7.5 that

Eν [F (X0, X1, . . . ) : τ = n,Xn = x]

= Eν [F (X0, X1, . . . , Xn−1, x, Y1, Y2, . . . ) : τ = n,Xn = x] .

Summing this equation on n and using τ = n,Xn = x = τ = n,Xτ = x ,we learn that

Eν [F (X0, X1, . . . ) : τ <∞, Xτ = x]

=

∞∑n=0

Eν [F (X0, X1, . . . ) : τ = n,Xn = x]

=

∞∑n=0

Eν [F (X0, X1, . . . , Xn−1, x, Y1, Y2, . . . ) : τ = n,Xn = x]

=

∞∑n=0

Eν [F (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . ) : τ = n,Xτ = x]

= Eν [F (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . ) : τ <∞, Xτ = x] (7.8)

which suffices to prove Eq. (7.6).We may rewrite Eq. (7.8) as,

Eν [F (X0, X1, . . . ) : τ <∞, Xτ = x]

= Eν[E(Y )x [F (X0, X1, . . . , Xτ−1, Y0, Y1, Y2, . . . )] : τ <∞, Xτ = x

]= Eν

[E(Y )Xτ

[F (X0, X1, . . . , Xτ−1, Y0, Y1, Y2, . . . )] : τ <∞, Xτ = x].

Summing this last equation on x ∈ S then gives Eq. (7.7).Here is a special case of Theorem 7.11.

Theorem 7.12 (Strong Markov Property). Let(Xn∞n=0 , Pxx∈S , p

)be

Markov chain as above and τ : Ω → [0,∞] be a stopping time. Then

Eπ [f (Xτ , Xτ+1, . . . ) gτ (X0, . . . , Xτ ) 1τ<∞]

= Eπ[[E(Y )Xτ

f (Y0, Y1, . . . )]gτ (X0, . . . , Xτ ) 1τ<∞

]. (7.9)

for all f, g = gn ≥ 0 or f and g bounded. In other words,

Eπ [f (Xτ , Xτ+1, . . . ) 1τ<∞|Fτ ] = E(Y )Xτ

[f (Y0, Y1, . . . )] .

Proof. Apply Eq. (7.7) of Theorem 7.11 with F (X0, X1, . . . ) =f (Xτ , Xτ+1, . . . ) gτ (X0, . . . , Xτ ) 1τ<∞.


7.3 Strong Markov Proof of Hitting Times Estimates* 73

7.3 Strong Markov Proof of Hitting Times Estimates*

In this optional subsection we reprove the results in Section 5.3 by making useof the strong Markov property.

Corollary 7.13. Let B ⊂ S and HB be as above, then for n,m ∈ N we have

Pπ (HB > m+ n) = Eπ [1HB>mPXm [HB > n]] . (7.10)

Proof. First proof. By the Markov property in Theorem 7.2, for eachy ∈ S, we have

(X0, . . . , Xm, Yy1 , Y

y2 , . . . )

Pπ(·|Xm=y)= (X0, X1, . . . ) .

This observation along with the identity,

HB (X) > m+ n = HB (X) > m,HB (Xm, Xm+1, . . . ) > n

leads to

h (y) = Pπ (HB (X) > m+ n|Xm = y)

= Pπ (HB (X) > m,HB (Xm, Xm+1, . . . ) > n|Xm = y)

= Pπ (HB (X) > m,HB (Y y0 , Yy1 , Y

y2 , . . . ) > n|Xm = y)

= Pπ (HB (X) > m|Xm = y)P (HB (Y y0 , Yy1 , Y

y2 , . . . ) > n)

= Pπ (HB > m|Xm = y)Py (HB > m)

= Eπ (1HB>m|Xm = y)Py (HB > m) .

Therefore it follows that

Pπ (HB > m+ n) = EπEπ [1HB>m+n|Xm] = Eπ [h (Xm)]

= Eπ [Eπ (1HB>m|Xm) · PXm (HB > m)] .

Second proof. Using Theorem 7.11,

Pπ (HB > m+ n) = Eπ[1HB(X)>m+n

]= Eπ

[E(Y )Xm

[1HB(X0,...,Xm−1,Y0,Y1,... )>m+n

]]= Eπ

[E(Y )Xm

[1HB(X)>m · 1HB(Y )>n

]]= Eπ

[1HB(X)>mE(Y )

Xm

[1HB(Y )>n

]]= Eπ [1HB>mPXm [HB > n]] .

Corollary 7.14. Suppose that B ⊂ S is non-empty proper subset of S andA = S \ B. Further suppose there is some α < 1 such that Px (HB =∞) ≤ αfor all x ∈ A, then Pπ (HB =∞) = 0. [In words; if there is a “uniform” chancethat X hits B starting from any site, then X will surely hit B from any pointin A.]

Proof. Since HB = 0 on X0 ∈ B we in fact have Px (HB =∞) ≤ α forall x ∈ S. Letting n→∞ in Eq. (7.10) shows,

Pπ (HB =∞) = Eπ [1HB>mPXm [HB =∞]] ≤ Eπ [1HB>mα] = αPπ (HB > m) .

Now letting m → ∞ in this equation shows Pπ (HB =∞) ≤ αPπ (HB =∞)from which it follows that Pπ (HB =∞) = 0.

Corollary 7.15. Suppose that B ⊂ S is non-empty proper subset of S and A =S\B. Further suppose there is some α < 1 and n <∞ such that Px (HB > n) ≤α for all x ∈ A, then

Eπ (HB) ≤ n

1− α<∞

for all x ∈ A. [In words; if there is a “uniform” chance that X hits B startingfrom any site within a fixed number of steps, then the expected hitting time ofB is finite and bounded independent of the starting distribution.]

Proof. Again using HB = 0 on X0 ∈ B we may conclude thatPx (HB > n) ≤ α for all x ∈ S. Letting m = kn in Eq. (7.10) shows

Pπ (HB > kn+ n) = Eπ [1HB>knPXm [HB > n]]

≤ Eπ [1HB>kn · α] = αPπ (HB > kn) .

Iterating this equation using the fact that Pπ (HB > 0) ≤ 1 showsPπ (HB > kn) ≤ αk for all k ∈ N0. Therefore with the aid of Lemma5.18 and the observation,

P (HB > kn+m) ≤ P (HB > kn) for m = 0, . . . , n− 1,

we find,

ExHB =

∞∑k=0

P (HB > k) ≤∞∑k=0

nP (HB > kn)

≤∞∑k=0

nαk =n

1− α<∞.


8

Long Run Behavior of Discrete Markov Chains

In this chapter, Xn will be a Markov chain with a finite or countable statespace, S. To each state x ∈ S, let

Hx := minn ≥ 0 : Xn = x and Rx := minn ≥ 1 : Xn = x (8.1)

be the first hitting and passage (return) time of the chain to site xrespectively.

8.1 Introduction

We wish to begin with some informal facts and intuition that the reader shouldkeep in mind while proceeding through this chapter.

1. If for some starting distribution (ν) ,

π (x) = limn→∞

Pν (Xn = x) = limn→∞

[νPn]x exists (8.2)

then we expectπP = lim

n→∞νPnP = π

so that π is necessarily an invariant distribution of P. [This is easily justifiedif # (S) < ∞. For infinite state spaces complications arise if the columnsums of P are infinite.]

2. From your homework we expect that (at least if ExRx <∞ for all x ∈ S)

π (x) =1

ExRx.

3. One further expects (“on average”) that

1

n

n∑m=0

1Xm=x =# returns to x

# steps∼=

# steps/ExRx# steps

=1

ExRx

i.e. that

limN→∞

1

n

n∑m=0

1Xm=x =1

ExRxa.s.. (8.3)

4. Notice that if Eq. (8.3) holds then taking expectations (using DCT) implies

limn→∞

1

n

n∑m=0

Pν (Xm = x) = E

[limN→∞

1

n

n∑m=0

1Xm=x

]=

1

ExRx

which may be restated as saying,

1

ExRx= limn→∞

1

n

n∑m=0

(νPm)x

which is a weaker form of Eq. (8.2).

Example 8.1 (A periodic chain). Let S = 1, 2, 3 and

P =

0 1 00 0 11 0 0

so that the following transitions occur with probability 1, 1 → 2 → 3 → 1. Inthis case (P2 corresponds to 1 → 3 → 2 → 1 and P3 corresponds to 1 → 1,2→ 2 and 3→ 3), i.e.

P2 =

0 0 11 0 00 1 0

and P3 =

1 0 00 1 00 0 1

.With these observations we see that Pn cycles through the above three matricesand so limn→∞Pn does not exists. On the other hand it is quite easy to verify;

limn→∞

1

n

n∑m=0

Pm =1

3

[P + P2 + P3

]=

1

3

1 1 11 1 11 1 1

where each row is π = 1

3

[1 1 1

]which is the unique invariant distribution of P.

[The eigenvalues of P are the three roots of unity in C, namely

1,− 12 ±

12 i√

3.]

Our goal in this chapter is to refine and make precise the ideas in the aboveremark being wary of the previous example.

76 8 Long Run Behavior of Discrete Markov Chains

8.2 The Main Results

See Kallenberg [10, pages 148-158] for more information along the lines of whatis to follow. The first thing we need to do is to divide the state space up intodifferent “classes”.

Definition 8.2. A state y is accessible from x (written x → y) iff Px(Hy <∞) > 0 and x and y communicate (written x←→ y) iff x→ y and y → x.

Notice that (since Px (Hx <∞) = 1) x ←→ x for all x ∈ S. We also havex → y iff there is a path, x = x0 → x1 → · · · → xn = y ∈ S such thatp (x0, x1) p (x1, x2) . . . p (xn−1, xn) > 0. Furthermore if x ←→ y and y ←→ k,then x←→ k, that is notion of communicating is an equivalence relation.

Definition 8.3. For each x ∈ S, let Cx := y ∈ S : x←→ y be the commu-nicating class of x.

The state space, S, is partitioned into a disjoint union of its communicatingclasses, see Figure 8.1 for an example.

Definition 8.4. A communicating class C ⊂ S is closed provided the probabil-ity that Xn leaves C given that it started in C is zero. In other words Pxy = 0for all x ∈ C and y /∈ C.

Notice that if C ⊂ S is a closed communicating class, then Xn restricted toC is a Markov chain.

Definition 8.5. A state x ∈ S is:

1. transient if Px(Rx <∞) < 1 (⇐⇒ Px(Rx =∞) > 0) ,2. recurrent if Px(Rx <∞) = 1 (⇐⇒ Px(Rx =∞) = 0) ,

a) positive recurrent if 1/ (ExRx) > 0, i.e. ExRx <∞,b) null recurrent if it is recurrent (Px(Rx <∞) = 1) and 1/ (ExRx) = 0,

i.e. ERx =∞.

We let St, Sr, Spr, and Snr be the transient, recurrent, positive recurrent,and null recurrent states respectively.

The rest of this chapter is devoted to the main Markov chain limiting resultsalong with some illustrative examples. The more technical aspects of the proofswill be covered in Chapter 9. Some of the more interesting examples of thetheory given here will appear in Chapter 10, see Examples 10.7, 10.8, and 10.9for examples of transient, null-recurrent, and positively recurrent states.

Proposition 8.6 (Class properties). The notions of being recurrent, positiverecurrent, null recurrent, or transient are all class properties. Namely if C ⊂ Sis a communicating class then either all x ∈ C are recurrent, positive recurrent,null recurrent, or transient. Hence it makes sense to refer to C as being eitherrecurrent, positive recurrent, null recurrent, or transient.

Proof. See Proposition 8.19 for the assertion that being recurrent or tran-sient is a class property. For the fact that positive and null recurrence is a classproperty, see Proposition 9.5 below.

Lemma 8.7. Let C ⊂ S be a communicating class. Then

C not closed =⇒ C is transient

or equivalently put,

C is recurrent =⇒ C is closed.

Proof. If C is not closed and x ∈ C, there is a y /∈ C such that x → y,i.e. there is a path x = x0, x1, . . . , xn = y with all of the xjnj=0 being distinctsuch that

Px (X0 = x,X1 = x1, . . . , Xn−1 = xn−1, Xn = xn = y) > 0.

Since y /∈ C we must have y 9 C and therefore on the event,

A := X0 = x,X1 = x1, . . . , Xn−1 = xn−1, Xn = xn = y ,

Xm /∈ C for all m ≥ n and therefore Rx =∞ on the event A which has positiveprobability.

Definition 8.8. To each site x ∈ S, let

Mx :=

∞∑n=0

1Xn=x =

∞∑n=0

1x (Xn) (8.4)

be number of visits (returns) of Xnn≥0 to site x.

Proposition 8.9. Suppose that C ⊂ S is a finite communicating class andT = inf n ≥ 0 : Xn /∈ C = HS\C be the first exit time from C. If C is notclosed, then not only is C transient but ExH <∞ for all x ∈ C. We also havethe equivalence of the following statements:

1. C is closed.2. C is positive recurrent.3. C is recurrent.


8.2 The Main Results 77

In particular if # (S) <∞, then the recurrent (= positively recurrent) statesare precisely the union of the closed communication classes and the transientstates are what is left over. [Warning: when # (S) = ∞ or more importantly# (C) =∞, life is not so simple – see Example 10.6 below.]

Proof. These results follow fairly easily from Corollary 7.15 and the factthat

T =∑x∈C

Mx.

Remark 8.10. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with0 being an absorbing state. The communication classes are 0 and 1, 2, . . . with the latter class not being closed and hence transient. If T = H0 is thefirst exit time from 1, 2, . . . , Using Remark 6.21, it follows that ExT = ∞for all x > 0 which shows we can not drop the assumption that # (C) < ∞in the first statement in Proposition 8.9. Similarly, using the fair random walkexample, we see that it is not possible to drop the condition that # (C) < ∞for the equivalence statements as well.

Example 8.11. Let P be the Markov matrix with jump diagram given in Figure8.1. In this case the communication classes are 1, 2 , 3, 4 , 5 . The lattertwo are closed and hence positively recurrent while 1, 2 is transient.

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 8.1. A 5 state Markov chain with 3 communicating classes.

Example 8.12. Let Xn∞n=0 denote the fair random walk on S = Z, then thischain is irreducible. On the other hand if Xn∞n=0 is the fair random walk on0, 1, 2, . . . with 0 being an absorbing state, then the communication classesare 0 (closed) and 1, 2, . . . (not closed).

Remark 8.13 (Warnings). There are some subtle point which may arise when# (S) =∞.

1. If C ⊂ S is closed and # (C) = ∞, C could be recurrent or it could betransient. Transient in this case means the walk goes off to “infinity.”

2. Interchanging sums and limits is not always permissible. For example, con-sider the fair random walk on Z which satisfies ExRx =∞ for all x ∈ Z. Inthis case, for any starting distribution, ν, we will see below that

limn→∞

1

n

n∑m=1

1Xm=y =1

ExRx= 0 Pν - a.s.

while

1 = limn→∞

1

n

n∑m=1

1 = limn→∞

1

n

n∑m=1

∑x∈Z

1Xm=x 6=∑x∈Z

limn→∞

1

n

n∑m=1

1Xm=y = 0.

More generally, in the terminology below if C is a closed communicationclass which is positively recurrent or transient, then for any starting distri-bution ν with ν (C) = 1 we will have, Pν – a.s., that

1 = limn→∞

1

n

n∑m=1

1 = limn→∞

1

n

n∑m=1

∑x∈C

1Xm=x 6=∑x∈C

limn→∞

1

n

n∑m=1

1Xm=y = 0.

The following proposition is a consequence of the strong Markov propertyin Theorem 7.11.

Remark 8.14 (MC Segment Decomposition). The following excursion decompo-sition is the key idea behind most of the results in this chapter. Let Xn∞n=0

be our Markov chain starting with some initial distribution, ν : S → [0, 1] andlet x ∈ S be given. We then decompose the chain as follows. Let TN be the timeof the N th – visit to site x so that T1 = Hx, T2 = min n > Hx : Xn = x , . . . ,TN+1 = min n > TN : Xn = x with the convention that minimum of theempty set is∞. The chain then decomposes into “segments” (or excursions),

S1 := (X0, . . . , XT1= x) ,

S2 := (XT1= x,XT1+1 . . . , XT2

= x) ,

S3 := (XT2= x,XT2+1 . . . , XT3

= x) ,

...

It is possible there is only one segment but what we do know is, by the strongMarkov property, that given TN <∞, the segments S1, . . . , SN are independentand S2, . . . , SN have the same distributions. In particular we conclude, relative



to Pν (·|TN <∞) , that τK := TK − TK−1NK=2 are i.i.d. random variables suchthat

Pν (τK = n|TN <∞) = Px (Rx = n) for n ∈ N and 2 ≤ K ≤ N

and in particular

Eν [τK |TN <∞] = ExRx for 2 ≤ K ≤ N.

Proposition 8.15. If x ∈ S, N ∈ N, and ν : S → [0, 1] is any probability on S,then

Pν (Mx ≥ N) = Pν (Hx <∞) · Px (Rx <∞)N−1

. (8.5)

[Also see Proposition 9.1 for another proof.]

Proof. Following the ideas in Remark 8.14: intuitively Mx ≥ N happens iffthe chain first visits x with probability Pν (Hx <∞) and then revisits x againN − 1 times which the probability of each revisit being Px (Rx <∞) . SinceMarkov chains are forgetful, these probabilities are all independent and hencewe arrive at Eq. (8.5). See Proposition 9.1 below for the formal proof based onthe strong Markov property in Theorem 7.11.

Formal proof. Using the strong Markov property as stated in Theorem 7.8we know that

X d= (X0, . . . , XτN , Y

x1 , Y

x2 , . . . ) on τN <∞

and therefore

P (Mx ≥ N + 1) = P (τN+1 <∞) = P (τN <∞, τN+1 − τN <∞)

= P (τN <∞, Rx (XτN , XτN+1, . . . ) <∞)

= P (τN <∞, Rx (Y x0 , Yx1 , . . . ) <∞)

= P (τN <∞) · P (Rx (Y x0 , Yx1 , . . . ) <∞)

= P (τN <∞) · Px (Rx <∞) = Px (Rx <∞) · P (Mx ≥ N)

and so Equation (8.5) now follows by induction on N.

Corollary 8.16. If x ∈ S and ν : S → [0, 1] is any probability on S, then1

Pν (Mx =∞) = Pν (Xn = x i.o.) = Pν (Hx <∞) 1x∈Sr , (8.6)

Px (Mx =∞) = Px (Xn = x i.o.) = 1x∈Sr , (8.7)

1 See Definition 2.4 to review the meaning of Xn = y i.o. . In this case,Xn = y i.o. = My =∞ .

EνMx =

∞∑n=0

∑y∈S

ν (y) Pnyx =

Pν (Hx <∞)

1− Px (Rx <∞), (8.8)

and (taking ν = δy)

EyMx =

∞∑n=0

Pnyx =

Py (Hx <∞)

1− Px (Rx <∞)(8.9)

where the following conventions are used in interpreting the right hand side ofEqs. (8.8) and (8.9); a/0 :=∞ if a > 0 while 0/0 := 0.

Proof. Since

Mx ≥ N ↓ Mx =∞ = Xn = x i.o. n as N ↑ ∞,

it follows, using Eq. (8.5), that

Pν (Xn = x i.o. n) = limN→∞

Pν(Mx ≥ N) = Pν(Hx <∞)· limN→∞

Px(Rx <∞)N−1

(8.10)which gives Eq. (8.6). Equation (8.7) follows by taking ν = δx in Eq. (8.6) andrecalling that x ∈ Sr iff Px (Rx <∞) = 1. Similarly Eq. (8.9) is a special caseof Eq. (8.8) with ν = δx. We now prove Eq. (8.8).

Using the definition of Mx in Eq. (8.4),

EνMx = Eν∞∑n=0

1Xn=x =

∞∑n=0

Eν1Xn=x

=

∞∑n=0

Pν(Xn = x) =

∞∑n=0

∑y∈S

ν (y) Pnyx

which is the first equality in Eq. (8.8). For the second equality we combineLemma 5.18 along with Eq. (8.5) to find;

EνMx =

∞∑N=1

Pν(Mx ≥ N)

=

∞∑N=1

Pν(Hx <∞)Px(Rx <∞)N−1

=Pν(Hx <∞)

1− Px(Rx <∞).



Corollary 8.17. If ν : S → [0, 1] is any starting distribution and x ∈ St, then

Pν (Mx <∞) = 1 and limn→∞

Pν (Xn = x) = 0.

The first equality above states that, Pν – a.s., Xn visits x at most a finite numberof times which further implies;

limn→∞

1

n

n∑m=1

1Xm=y = 0 Pν − a.s.

Proof. When x ∈ St, Eq. (8.8) implies EνMx < ∞ which immediatelyimplies Pν (Mx <∞) = 1. For the second assertion we simply observe that

∞ > EνMx =

∞∑n=0

Eν1Xn=x =

∞∑n=0

Pν (Xn = x)

and therefore limn→∞ Pν (Xn = x) = 0.

Theorem 8.18 (Recurrent States). Let y ∈ S. Then the following are equiv-alent;

1. y is recurrent, i.e. Py (Ry <∞) = 1,2. Py (Xn = y i.o. n) = 1,3. EyMy =

∑∞n=0 Pn

yy =∞.

Proof. The equivalence of the first two items follows directly from Eq. (8.7)or directly from Remark 8.14 when ν = δy. The equivalent of items 1. and 3.follows directly from Eq. (8.9) with x = y.

Proposition 8.19. If x ←→ y, then x is recurrent iff y is recurrent, i.e. theproperty of being recurrent or transient is a class property.

Proof. First proof. For any α, β ∈ n ∈ N0 we have,

Pn+α+βxx =

∑u,v∈S

PαxuP

nuvP

βvx ≥ Pα

xyPny yP

βyx

and therefore,

ExMx ≥∞∑n=0

Pn+α+βxx ≥ Pα

xyPβyx

∞∑n=0

Pny y =

(PαxyP

βyx

)EyMy.

Since x ←→ y we may choose α, β ∈ N so that c := PαxyP

βyx > 0 and so the

above inequality shows if EyMy =∞ then ExMx =∞ as well. By interchangingthe roles of x and y we may similarly may show ExMx = ∞ implies EyMy =∞. Thus using item 3. of Theorem 8.18, it follows that x is recurrent iff y isrecurrent.

Second proof. See the aside in the proof of Corollary 8.20 below.

Corollary 8.20. If C ⊂ Sr is a recurrent communication class, then

Px(Ry <∞) = 1 = Px (My =∞) for all x, y ∈ C (8.11)

and in fact

Px(∩y∈CXn = y i.o. n) = Px(∩y∈CMy =∞) = 1 for all x ∈ C. (8.12)

More generally if ν : S → [0, 1] is a probability such that ν (x) = 0 for x /∈ C,then

Pν(∩y∈CXn = y i.o. n) = 1 for all x ∈ C. (8.13)

In words, if we start in C then every state in C is visited an infinite number oftimes. (Notice that Px (Ry <∞) = Px(Xnn≥1 hits y).)

Proof. First Proof. Let C be a communicating class and suppose x, y ∈ Cwith x being recurrent. (After the proof we will show directly that y is recurrentas well, thus giving another proof of Proposition 8.19.) Since x is recurrent, Px– a.s. their are infinitely many independent round trip excursions from x to x inthe segment decomposition described in Remark 8.14 with ν = δx. For n ∈ N,let An be the event that y is visited by the chain during the nth - segment whichis round trip excursion from x to x. By the strong Markov property, An∞n=1

is an independent collection of sets with p = Px (An) being independent of n.Moreover since x → y, it follows that p > 0. Therefore by an application ofthe second Borel Cantelli Lemma 2.6, Px (An i.o.) = 1, i.e. Px (My =∞) = 1which proves Eq. (8.11). If we now assume Proposition 8.19 we may concludethat Px (My =∞) = 1 for all x, y ∈ C. Equation (8.12) is a consequence of Eq.(8.11) and the fact that the countable intersection of probability one events isagain a probability 1 event. Equation (8.13) follows by multiplying Eq. (8.12)by ν (x) and then summing on x ∈ C.

Aside: a second proof of Proposition 8.19. Let x ∈ S be recurrent andy ∈ S such that y 6= x and y communicates with x, i.e. y ←→ x. Recall that wehave just shown Eq. (8.12) holds (i.e. Px (My =∞) = 1 = Px (Hy <∞)) underthese assumptions. Using the identity,

My (X0, X1, . . . ) =

∞∑n=0

1Xn=y =

∞∑n=0

1XHy+n=y

= My

(XHy , XHy+1, XHy+2, . . .

),

along with the strong Markov property implies;

1 = Px (My (X0, X1, . . . ) =∞) = Px (My (X0, X1, . . . ) =∞, Hy <∞)

= Px (Hy <∞) · Py (My (X0, X1, . . . ) =∞)

= Py (My =∞) .



This certainly shows y is recurrent as well.Second proof of Corollary 8.20. Let x, y ∈ C ⊂ Sr and choose m ∈ N

such that Py (Xm = x) > 0. Then using Py(My =∞) = 1 we learn that

Py (Xm = x) = Py (My =∞, Xm = x) = Py (My =∞|Xm = x)Py (Xm = x)

= Px (My =∞)Py (Xm = x)

which implies,

1 = Px (My =∞) ≤ Px(Ry <∞) =⇒ Eq. (8.11).

Here we have used

My =∞ =

∞∑k=m

1y (Xk) =∞

along with the Markov property (see Theorem 7.5) to assert that

Py (My =∞|Xm = x) = Py

( ∞∑k=m

1y (Xk) =∞|Xm = x

)

= Px

( ∞∑k=0

1y (Xk) =∞

)= Px (My =∞) .

Equation (8.12) is a consequence of Eq. (8.11) and the fact that the countableintersection of probability one events is again a probability 1 event. Equation(8.13) follows by multiplying Eq. (8.12) by ν (x) and then summing on x ∈ C.

Theorem 8.21 (Transient States). Let y ∈ S. Then the following are equiv-alent;

1. y is transient, i.e. Py (Ry <∞) < 1,2. Py (Xn = y i.o. n) = 0, and3. EyMy =

∑∞n=0 Pn

y,y <∞.

Moreover, if x ∈ S and y ∈ St, then

∞∑n=0

Pnxy = ExMy <∞ (8.14)

=⇒

Px (Xn = y i.o. n) = 0 andlimn→∞ Px (Xn = y) = limn→∞Pn

xy = 0.(8.15)

and more generally if ν : S → [0, 1] is any probability, then

∞∑n=0

Pν (Xn = y) = EνMy <∞ (8.16)

=⇒

Pν (Xn = y i.o. n) = 0 andlimn→∞ Pν (Xn = y) = limn→∞ [νPn]y = 0.

(8.17)

Proof. The equivalence of the first two items follows directly from Eq.(8.7) and the equivalent of items 1. and 3. follows directly from Eq. (8.9) withx = y. The fact that ExMy <∞ and EνMy <∞ in Eqs. (8.14) and (8.16) forall y ∈ St are consequences of Eqs. (8.9) and (8.8) respectively. The remainingimplications in Eqs. (8.15) and (the more general) Eq. (8.17) are consequences ofthe facts; 1) the nth – term in a convergent series tends to zero as n→∞, 2) theXn = y i.o. n = My =∞ , and 3) EνMν <∞ implies Pν (My =∞) <∞.

Corollary 8.22. 1) If the state space, S, is a finite set, then Sr 6= ∅. 2) Anyfinite and closed communicating class C ⊂ S is a recurrent.

Proof. First suppose that # (S) < ∞ and for the sake of contradic-tion, suppose Sr = ∅ or equivalently that S = St. Then by Theorem 8.21,limn→∞Pn

xy = 0 for all x, y ∈ S. On the other hand,∑y∈S Pn

xy = 1 so that

1 = limn→∞

∑y∈S

Pnxy =

∑y∈S

limn→∞

Pnxy =

∑y∈S

0 = 0,

which is a contradiction. (Notice that if S were infinite, we could not (in general)interchange the limit and the above sum without some extra conditions.)

To prove the first statement, restrict Xn to C to get a Markov chain on afinite state space C. By what we have just proved, there is a recurrent statex ∈ C. Since recurrence is a class property, it follows that all states in C arerecurrent.

Definition 8.23. A function, π : S → [0, 1] is a sub-probability if∑y∈S π (y) ≤ 1. We call

∑y∈S π (y) the mass of π. So a probability is a sub-

probability with mass one.

Definition 8.24. We say a sub-probability, π : S → [0, 1] , is invariant ifπP = π, i.e. ∑

x∈Sπ (x) Pxy = π (y) for all y ∈ S. (8.18)

An invariant probability, π : S → [0, 1] , is called an invariant distribution.

Example 8.25. If # (S) < ∞ and p : S × S → [0, 1] is a Markov transitionmatrix with column sums adding up to 1 then π (x) := 1

#(S) is an invariant

distribution for p. In particular, if p is a symmetric Markov transition matrix(p (x, y) = p (y, x) for all x, y ∈ S), then the uniform distribution π is aninvariant distribution for p.



Example 8.26. If S is a finite set of nodes and G be an undirected graph on S,i.e. G is a subset of S × S such that

1. (x, x) /∈ G for all x ∈ S,2. if (x, y) ∈ G, then (y, x) ∈ G [the graph is undirected], and3. for all x ∈ G, the set Sx := y ∈ S : (x, y) ∈ G is not empty. [We are not

allowing for any isolated nodes in our graph.]

Letν (x) := # (Sx) =

∑y∈S

1(x,y)∈G

be the valence of G at x. The random walk on this graph is then theMarkov chain on S with Markov transition matrix,

p (x, y) :=1

ν (x)1Sx (y) =

1

ν (x)1(x,y)∈G.

Notice that ∑x∈S

ν (x) p (x, y) =∑x∈S

1(x,y)∈G =∑x∈S

1(y,x)∈G = ν (y) .

Thus if we let Z :=∑x∈S ν (x) and π (x) := ν (x) /Z, we will have π is an

invariant distribution for P.

Theorem 8.27 (Ergodic Theorem). Suppose that P = (pxy) is an irre-ducible Markov kernel and

πy :=1

EyRyfor all y ∈ S. (8.19)

Then:

1. For all x, y ∈ S, we have

limn→∞

1

n

n∑m=0

1Xm=y = πy Px − a.s. (8.20)

and

limn→∞

1

n

n∑m=1

Px (Xm = y) = limn→∞

1

n

n∑m=1

Pmxy = πy. (8.21)

2. If µ : S → [0, 1] is an invariant sub-probability, then either µ (x) > 0 for allx or µ (x) = 0 for all x.

3. P has at most one invariant distribution.4. P has a (necessarily unique) invariant distribution, µ : S → [0, 1] , iff P is

positive recurrent in which case µ (x) = π (x) = 1ExRx > 0 for all x ∈ S.

(These results may of course be applied to the restriction of a general non-irreducible Markov chain to any one of its communication classes.)

Proof. These results are the contents of Theorem 9.3 and Propositions 9.5and 9.7 below.

Using this result we can give another proof of Proposition 8.9.

Theorem 8.28 (General Convergence Theorem). Let ν : S → [0, 1] beany probability, y ∈ S, C = Cy be the communicating class2 containing y,HC = inf n : Xn ∈ C with HC =∞ when Xn /∈ C for all n,3 and

πy := πy (ν) =Pν (HC <∞)

EyRy, (8.22)

where 1/∞ := 0. Then:

1. Pν – a.s.,

limn→∞

1

n

n∑m=1

1Xm=y =1

EyRy· 1HC<∞, (8.23)

2.

limn→∞

1

n

n∑m=1

∑x∈S

ν (x) Pmxy = lim

n→∞

1

n

n∑m=1

Pν (Xm = y) = πy, (8.24)

3. π is an invariant sub-probability for P, and4. the mass of π (S) :=

∑y∈S πy is given by

π (S) =∑

C: pos. recurrent

Pν (HC <∞) ≤ 1. (8.25)

Proof. If y ∈ S is a transient site, then according to Eq. (8.17),Pν (My <∞) = 1 and therefore limn→∞

1n

∑nm=1 1Xm=y = 0 which agrees with

Eq. (8.23) for y ∈ St.So now suppose that y ∈ Sr. Let C be the communication class containing

y andH = HC := inf n ≥ 0 : Xn ∈ C

be the first time when Xn enters C. It is clear that Ry <∞ ⊂ H <∞ .On the other hand, for any z ∈ C, it follows by the strong Markov propertyof Theorem 7.11 and Corollary 8.20 that, conditioned on H <∞, XH = z ,Xn hits y i.o. and hence Pν (Ry <∞|H <∞, XH = z) = 1. Equivalently put,

2 No assumptions on the type of this class are needed here.3 We sometimes write the event HC <∞ as Xn hits C .



Pν (Ry <∞, H <∞, XH = z) = Pν (H <∞, XH = z) for all z ∈ C.

Summing this last equation on z ∈ C then shows

Pν (Ry <∞) = Pν (Ry <∞, H <∞) = Pν (H <∞)

and therefore Ry <∞ = H <∞ modulo an event with Pν – probabilityzero.

Another application of the strong Markov property of Theorem 7.11, observ-ing that XRy = y on Ry <∞ , allows us to conclude that the Pν (·|Ry <∞) =Pν (·|H <∞) – law of

(XRy , XRy+1, XRy+2, . . .

)is the same as the Py – law of

(X0, X1, X2, . . . ) . Therefore, we may apply Theorem 8.27 to conclude that

limn→∞

1

n

n∑m=1

1Xm=y = limn→∞

1

n

n∑m=1

1XRy+m=y =1

EyRyPν (·|Ry <∞) – a.s.

On the other hand, on the event Ry =∞ we have limn→∞1n

∑nm=1 1Xm=y =

0. Thus we have shown Pν – a.s. that

limn→∞

1

n

n∑m=1

1Xm=y =1

EyRy1Ry<∞ =

1

EyRy1H<∞ =

1

EyRy1HC<∞

which is Eq. (8.23). Taking expectations of this equation, using the dominatedconvergence theorem, gives Eq. (8.24).

Since ExRx =∞ unless x is a positive recurrent site, it follows that∑x∈S

πxPxy =∑x∈Spr

πxPxy =∑

C: pos-rec.

Pν (HC <∞)∑x∈C

1

ExRxPxy. (8.26)

As each positive recurrent class, C, is closed; if x ∈ C and y /∈ C, thenPxy = 0. Therefore

∑x∈C

1ExRxPxy is zero unless y ∈ C. So if y /∈ Spr we

have∑x∈S πxPxy = 0 = πy and if y ∈ Spr, then by Theorem 8.27,∑

x∈C

1

ExRxPxy = 1y∈C ·

1

EyRy.

Using this result in Eq. (8.26) shows that∑x∈S

πxPxy =∑

C: pos-rec.

Pν (HC <∞) 1y∈C ·1

EyRy= πy

so that π is an invariant distribution. Similarly, using Theorem 8.27 again,

∑x∈S

πx =∑

C: pos-rec.

Pν (HC <∞) ·

(∑x∈C

1

ExRx

)=

∑C: pos-rec.

Pν (HC <∞) .

The point is that for any communicating class C of S we have,∑x∈C

1

ExRx=

1 if C is positive recurrent0 otherwise.

Corollary 8.29. If C is a closed finite communicating class then C is positiverecurrent. (Recall that we already know that C is recurrent by Corollary 8.22.)

Proof. For x, y ∈ C, let

πy := limn→∞

1

n

n∑m=1

Px (Xm = y) =1

EyRy(8.27)

where the second equality is a consequence of Theorem 8.28. Since C is closed,∑y∈C

Px (Xn = y) = 1

and therefore,

∑y∈C

πy = limn→∞

1

n

∑y∈C

n∑m=1

Px (Xm = y) = limm→∞

1

n

n∑m=1

∑y∈C

Px (Xm = y) = 1.

In particular it follows we must have πy > 0 for some y ∈ C and hence ally ∈ C by Theorem 8.27 with S replaced by C. So by Eq. (8.27) we concludeEyRy <∞ for all y ∈ C, i.e. every y ∈ C is a positive recurrent state.

Remark 8.30 (The MCMC Seed). Suppose that S is a finite set, P = (pxy) isan irreducible Markov kernel, ν is a probability on S, and f : S → C is afunction. Then

Eπf :=∑x∈S

π (x) f (x) = limn→∞

1

n

n∑m=1

f (Xm) Pν – a.s.

where π is the invariant distribution for P. Indeed,

limn→∞

1

n

n∑m=1

f (Xm) = limn→∞

1

n

n∑m=1

∑x∈S

f (x) 1Xm=x

=∑x∈S

f (x) limn→∞

1

n

n∑m=1

1Xm=y

=∑x∈S

f (x)π (x) Pν – a.s.


8.3 Aperiodic chains 83

8.3 Aperiodic chains

Definition 8.31. A state x ∈ S is aperiodic if Pnxx > 0 for all n sufficiently

large.

Lemma 8.32. If x ∈ S is aperiodic and y ←→ x, then y is aperiodic. So beingaperiodic is a class property.

Proof. We have

Pn+m+kyy =

∑w,z∈S

Pny,wPm

w,zPkz,y ≥ Pn

y,xPmx,xP

kx,y.

Since y ←→ x, there exists n, k ∈ N such that Pny,x > 0 and Pk

x,y > 0. Since

Pmx,x > 0 for all large m, it follows that Pn+m+k

yy > 0 for all large m andtherefore, y is aperiodic as well.

Lemma 8.33. A state x ∈ S is aperiodic iff 1 is the greatest common divisorof the set,

n ∈ N : Px (Xn = x) = Pnxx > 0 .

Proof. Use the number theory Lemma 8.44 below.

Theorem 8.34. If P is an irreducible, aperiodic, and recurrent Markov chain,then

limn→∞

Pnxy = πy =

1

EyRy. (8.28)

More generally, if C is a communication class which is assumed to be aperiodicif it is recurrent, then

limn→∞

Pν (Xn = y) := limn→∞

∑x∈S

ν (x) Pnxy = Pν (HC <∞) · 1

EyRyfor all y ∈ C.

(8.29)

Proof. We will only give the proof under the added assumption that P ispositive recurrent or equivalently P has an invariant distribution.4 The proofuses the important idea of a coupling argument (we follow [14, Theorem1.8.3] or Kallenberg [10, Chapter 8]). Here is the idea. Let Xn and Yn betwo independent Markov chains having P as their transition matrix. Then thechain (Xn, Yn) ∈ S×S is again a Markov chain with transition matrix P⊗P.5

4 We still have to handle the null-recurrent case which is going to be omitted inthese notes but see Kallenberg [10, Chapter 8, Theorem 8.18, p. 152.]. The key newingredient not explained here may be found in [10, Lemma 8.21].

5 By definition the state space is now S × S and [P⊗P](x,y),(x′,y′) = Px,x′Py′,y′ .

The aperiodicity and irreducibility assumption guarantees that P ⊗ P is stillirreducible and aperiodic (but the aperiodicity is not needed).6 Moreover if πis an invariant distribution for P then7 π ⊗ π is an invariant distribution forP ⊗ P and so P ⊗ P is positive recurrent by Theorem 8.27. We now we havea couple of choices how to proceed. We might take H to be the hitting time ofthe chain (Xn, Yn) of a fixed point (x, x) ∈ ∆ ⊂ S × S or we can let H be thefirst hitting time of the diagonal itself. For either choice H <∞ a.s. no matterthe starting distributions because the chain is positive recurrent. We then let

Yn =

Yn if n ≤ HXn if n > H.

By the strong Markov property Y and Y have the same distribution. Thus iff : S → R is a bounded function and µ, ν are two initial distributions on S wewill have,

Eµf (Xn)− Eνf (Xn) = Eµf (Xn)− Eνf (Yn) = Eµ⊗ν[f (Xn)− f

(Yn

)]= Eµ⊗ν

[f (Xn)− f

(Yn

): n < H

]from which it follows that

|Eµf (Xn)− Eνf (Xn)| ≤ 2 ‖f‖∞ P (H > n)→ 0 as n→∞, (8.30)

where ‖f‖∞ = supx∈S |f (x)| . This inequality shows that the initial distributionplays no role in the limiting distribution of the Xn∞n=0 .

Taking µ = π to be the invariant distribution and using Eπf (Xn) =Eπf (X0) = π (f), we find,

|Eνf (Xn)− π (f)| ≤ 2 ‖f‖∞ P (H > n)→ 0 as n→∞.

which shows LawPν (Xn) =⇒ π as n→∞, and this gives Eq. (8.29). QED.Alternative derivation of Eq. (8.30). In Nate Eldredge’s notes, the in-

equality in Eq. (8.30) is derived a bit differently. He take H to be the hittingtime of (x, x) ∈ ∆ and then notes by the strong Markov property that

6 If P is periodic, then P⊗P may be reducible. For example if our chain transitionsare 1→ 2→ 3→ 1 all with probability 1, then P⊗P is reducible. For example ifthe chain starts at (1, 1) it will only visit (2, 2) and (3, 3) and never (i, j) for i 6= j.

7 Again by definition the state space is S × S and π ⊗ π : S × S → [0, 1] is definedby π ⊗ π (x, y) = π (x)π (y) .



Eµf (Xn) = Eµ⊗νf (Xn)

=

n∑k=0

Eµ⊗ν [f (Xn) |H = k]P (H = k) + Eµ⊗ν [f (Xn) : H > n]

=

n∑k=0

Ex [f (Xn−k)]P (H = k) + Eµ⊗ν [f (Xn) : H > n] .

As similar calculation shows

Eνf (Xn) = Eµ⊗νf (Yn) =

n∑k=0

Ex [f (Yn−k)]P (H = k) + Eµ⊗ν [f (Yn) : H > n]

=

n∑k=0

Ex [f (Xn−k)]P (H = k) + Eµ⊗ν [f (Yn) : H > n] .

Then subtracting these two equations shows,

Eµf (Xn)− Eνf (Xn) = Eµ⊗ν [f (Xn) : H > n]− Eµ⊗ν [f (Yn) : H > n]

= Eµ⊗ν [f (Xn)− f (Yn) : H > n]

= Eµ⊗ν[f (Xn)− f

(Yn

): H > n

]and so again,

|Eµf (Xn)− Eνf (Xn)| ≤ 2 ‖f‖∞ P (H > n) .

8.4 Some finite state space examples

Example 8.35 (Analyzing a non-irreducible Markov chain). In this example weare going to analyze the limiting behavior of the non-irreducible Markov chaindetermined by the Markov matrix,

P =

1 2 3 4 50 1/2 0 0 1/2

1/2 0 0 1/2 00 0 1/2 1/2 00 0 1/3 2/3 00 0 0 0 1

12345

.

Here are the steps to follow.

1. Find the jump diagram for P. In our case it is given in Figure 8.2.

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 8.2. The jump diagram for P.

2. Identify the communication classes. In our example they are 1, 2 ,5 , and 3, 4 . The first is not closed and hence transient while the secondtwo are closed and finite sets and hence recurrent.

3. Find the invariant distributions for the recurrent classes. For 5it is simply π′5 = [1] and for 3, 4 we must find the invariant distributionfor the 2× 2 Markov matrix,

Q =

3 4[1/2 1/21/3 2/3

]34.

We do this in the usual way, namely

Nul(I −Qtr

)= Nul

([1 00 1

]−[

12

13

12

23

])= R

[23

]so that π′3,4 = 1

5

[2 3].

4. We can turn π′3,4 and π′5 into invariant distributions for P by paddingthe row vectors with zeros to get

π3,4 =[

0 0 2/5 3/5 0]

π5 =[

0 0 0 0 1].

The general invariant distribution may then be written as;

π = απ5 + βπ3,4 with α, β ≥ 0 and α+ β = 1.

5. We can now work out the limn→∞Pn. If we start at site x we are consideringthe xth – row of limn→∞Pn. If we start in the recurrent class 3, 4 we will


8.5 Periodic Chain Considerations 85

simply get π3,4 for these rows and we start in the recurrent class 5 wewill get π5. However if start in the non-closed transient class, 1, 2 wehave

first row of limn→∞

Pn = P1 (Xn hits 5)π5 + P1 (Xn hits 3, 4)π3,4(8.31)

and

second row of limn→∞

Pn = P2 (Xn hits 5)π5 + P2 (Xn hits 3, 4)π3,4.(8.32)

Recall that Xn hits 5 = H5 <∞ and Xn hits 3, 4 =H3,4 <∞

.

6. Compute the required hitting probabilities. Let us now compute therequired hitting probabilities by taking B = 3, 4, 5 and A = 1, 2 . Thenwe have,

Px (XHB = y)x∈A,y∈B = (I −PA×A)−1

PA×B

=

([1 00 1

]−[

0 1/21/2 0

])−1 [0 0 1/20 1/2 0

]=

[43

23

23

43

] [0 0 1/20 1/2 0

]=

[0 1

323

0 23

13

].

From this we learn

P1 (Xn hits 5) =2

3, P2 (Xn hits 5) =

1

3,

P1 (Xn hits 3, 4) =1

3and P2 (Xn hits 3, 4) =

2

3.

7. Using these results in Eqs. (8.31) and (8.32) shows,

first row of limn→∞

Pn =2

3π5 +

1

3π3,4

=[

0 0 215

15 2/3

]=[

0.0 0.0 0.133 33 0.2 0.666 67]

and

second row of limn→∞

Pn =1

3π5 +

2

3π3,4

=1

3

[0 0 0 0 1

]+

2

3

[0 0 2/5 3/5 0

]=[

0 0 415

25

13

]=[

0.0 0.0 0.266 67 0.4 0.333 33].

These answers already compare well with

P10 =

9.7656× 10−4 0.0 0.132 76 0.200 24 0.666 02

0.0 9.7656× 10−4 0.266 26 0.399 76 0.333 010.0 0.0 0.4 0.600 00 0.00.0 0.0 0.400 00 0.6 0.00.0 0.0 0.0 0.0 1.0

.

8.5 Periodic Chain Considerations

Definition 8.36. For each x ∈ S, let d (x) be the greatest common divisor ofn ≥ 1 : Pn

xx > 0 with the convention that d (x) = 0 if Pnxx = 0 for all n ≥ 1.

We refer to d (x) as the period of x. We say a site x is aperiodic if d (x) = 1.

Example 8.37. Each site of the fair random walk on S = Z has period 2. Whilefor the fair random walk on 0, 1, 2, . . . with 0 being an absorbing state, eachx ≥ 1 has period 2 while 0 has period 1, i.e. 0 is aperiodic.

Theorem 8.38. The period function is constant on each communication classof a Markov chain.

Proof. Let x, y ∈ C and a = d (x) and b = d (y) . Now suppose that Pmxy > 0

and Pnyx > 0, then Pm+n

x,x ≥ PmxyP

nyx > 0 and so a| (m+ n) . Further suppose

that Ply,y > 0 for come l ∈ N, then

Pm+n+lx,x ≥ Pm

xyPly,yP

nyx > 0

and therefore a| (m+ n+ l) which coupled with a| (m+ n) implies a|l. We maytherefore conclude that a ≤ b (in fact a|b) as b = gcd

(l ∈ N : Pl

y,y > 0).

Similarly we show that b ≤ a and therefore b = a.

Lemma 8.39. If d (x) is the period of site x, then

1. if m ∈ N and Pmx,x > 0 then d (x) divides m and

2. Pnd(x)x,x > 0 for all n ∈ N sufficiently large.

3. If x is aperiodic iff Pnx,x > 0 for all n ∈ N sufficiently large.

In summary, Ax :=m ∈ N : Pm

x,x > 0⊂ d (x)N and d (x)n ∈ Ax for all

n ∈ N sufficiently large.

Proof. Choose n1, . . . , nk ∈ n ≥ 1 : Pnxx > 0 such that d (x) =

gcd (n1, . . . , nk) . For part 1. we also know that d (x) = gcd (n1, . . . , nk,m) andtherefore d (x) divides m. For part 2., if mx ∈ N we have,


86 8 Long Run Behavior of Discrete Markov Chains(P∑k

l=1mlnl

)x,x

≥k∏l=1

[Pnlx,x

]ml > 0.

This observation along with the number theoretic Lemma 8.44 below is enough

to show Pnd(x)x,x > 0 for all n ∈ N sufficiently large. The third item is a special

case of item 2.

Example 8.40. Suppose that P =

[0 11 0

], then Pm = P if m is odd and Pm =[

1 00 1

]if m is even. Therefore d (x) = 2 for x = 1, 2 and in this case P2n

x,x = 1 > 0

for all n ∈ N. However observe that P2 is no longer irreducible – there are nowtwo communication classes.

Example 8.41. Consider the Markov chain with jump diagram given in Figure8.3.In this example, d (x) = 2 for all x and all states for P2 are aperiodic.

P jump diagram. P2 jump diagram.

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

Fig. 8.3. All arrows are assumed to have weight 1 unless otherwise specified. Noticethat each state has period d = 2 and that P2 is the transition matrix having twoaperiodic communication classes.

However P2 is no longer irreducible. This is an indication of the what happensin general. In terms of matrices,

P =

1 2 3 4 5 60 1 0 0 0 00 0 1 0 0 00 0 0 1 0 012 0 0 0 1

2 00 0 0 0 0 11 0 0 0 0 0

123456

and P2 =

1 2 3 4 5 60 0 1 0 0 00 0 0 1 0 012 0 0 0 1

2 00 1

2 0 0 0 12

1 0 0 0 0 00 1 0 0 0 0

123456

.

Example 8.42. Consider the Markov chain with jump diagram given in Figure8.4.Assume there are no implied jumps from a site back to itself, i.e. Px,x = 0

x

y

x

y

P jump diagram. P 2 jump diagram.

1

1

1

11/3

1 1

1

1

2/3

1/2

1/2

Fig. 8.4. Assume there are no implied jumps from a site back to itself, i.e. Pi,i = 0for all i. This chain is then irreducible and has period 2.

for all x. This chain is then irreducible and has period 2. To calculate the periodnotice that starting at y there is an obvious loop of length 4 and starting at xthere is one of length 6. Therefore the period must divide both 4 and 6 and somust be either 2 or 1. The period is not 1 as one can only return to a site withan even number of jumps in this picture. If on the other hand there was anyone vertex, x, such that Px,x = 1, then the period of the chain would have beenone, i.e. the chain would have been aperiodic. Further notice that the jumpdiagram for P2 is no longer irreducible. The red vertices and the blue verticessplit apart. This has to happen as is a consequence of Proposition 8.43 below.

Proposition 8.43. If P is the Markov matrix for a finite state irreducible ape-riodic chain, then there exists n0 ∈ N such that Pn

xy > 0 for all x, y ∈ S andn ≥ n0.


8.5 Periodic Chain Considerations 87

Proof. Let x, y ∈ S. By Lemma 8.39 with d (x) = 1 we know that Pmx,x > 0

for all m large. As P is irreducible there exists a ∈ N such that Paxy > 0 and

therefore Pa+mx,y ≥ Pm

x,xPax,y > 0 for all m sufficiently large. This shows for all

x, y ∈ S there exists nx,y ∈ N such that Pnxy > 0 for all n ≥ nx,y. Since there

are only finitely many steps we may now take n0 := max nx,y : x, y ∈ S <∞.

8.5.1 A number theoretic lemma

Lemma 8.44 (A number theory lemma). Suppose that 1 is the greatestcommon denominator of a set of positive integers, Γ := n1, . . . , nk . Thenthere exists N ∈ N such that the set,

A = m1n1 + · · ·+mknk : mi ≥ 0 for all i ,

contains all n ∈ N with n ≥ N. More generally if q = gcd (Γ ) (perhaps not 1),then A ⊂ qN and contains all points qn for n sufficiently large.

Proof. First proof. The set I := m1n1 + · · ·+mknk : mi ∈ N for all iis an ideal in Z and as Z is a principle ideal domain there is a q ∈ I with q > 0such that I = qZ = qm : m ∈ Z . In fact q = min (I∩N) . Since q ∈ I we knowthat q = m1n1 + · · · + mknk for some mi ∈ N and so if l is a common divisorof n1, . . . , nk then l divides q. Moreover as I = qZ and ni ∈ I for all i, we knowthat q|ni as well. This shows that q = gcd (n1, n2, . . . , nk) .

Now suppose that n n1 + · · · + nk is given and large (to be explainedshortly). Then write n = l (n1 + · · ·+ nk)+r with l ∈ N and 0 ≤ r < n1+· · ·+nkand therefore,

nq = ql (n1 + · · ·+ nk) + rq

= ql (n1 + · · ·+ nk) + r (m1n1 + · · ·+mknk)

= (ql + rm1)n1 + · · ·+ (ql + rmk)nk

whereql + rmi ≥ ql − (n1 + · · ·+ nk)mi

which is greater than 0 for l and hence n sufficiently large.Second proof. (The following proof is from Durrett [6].) We first will show

that A contains two consecutive positive integers, a and a + 1. To prove thislet,

k := min |b− a| : a, b ∈ A with a 6= b

and choose a, b ∈ A with b = a+ k. If k > 1, there exists n ∈ Γ ⊂ A such thatk does not divide n. Let us write n = mk + r with m ≥ 0 and 1 ≤ r < k. Itthen follows that (m+ 1) b and (m+ 1) a+ n are in A,

(m+ 1) b = (m+ 1) (a+ k) > (m+ 1) a+mk + r = (m+ 1) a+ n,

and(m+ 1) b− (m+ 1) a+ n = k − r < k.

This contradicts the definition of k and therefore, k = 1.Let N = a2. If n ≥ N, then n− a2 = ma+ r for some m ≥ 0 and 0 ≤ r < a.

Therefore,

n = a2 +ma+ r = (a+m) a+ r = (a+m− r) a+ r (a+ 1) ∈ A.


9

*Proofs of Long Run Results

In proving the results above, we are going to make essential use of a strongform of the Markov property of Theorem 7.11.

9.1 Strong Markov Property Consequences

Letf (n)xx = Px(Rx = n) = Px(X1 6= x, . . . ,Xn−1 6= x,Xn = x)

and mxy := Ex(My) – the expected number of visits to y after n = 0.

Proposition 9.1. Let x ∈ S and n ≥ 1. Then Pnxx satisfies the “renewal equa-

tion,”

Pnxx =

n∑m=1

P(Rx = m)Pn−mxx . (9.1)

Also if y ∈ S, K ∈ N, and ν : S → [0, 1] is any probability on S, then Eq. (8.5)holds, i.e.

Pν (My ≥ K) = Pν (Hy <∞) · Py (Ry <∞)K−1

. (9.2)

[Equation (9.2) was also proved in Proposition 8.15.]

Proof. To prove Eq. (9.1) we first observe for n ≥ 1 that Xn = x is thedisjoint union of Xn = x,Rx = m for 1 ≤ m ≤ n and therefore1,

1 Alternatively, we could use the Markov property to show,

Pnii = Pi(Xn = i) =

n∑k=1

Ei(1Ri=k · 1Xn=i) =

n∑k=1

Ei(1Ri=k · Ei1Xn−k=i)

=

n∑k=1

Ei(1Ri=k)Ei(1Xn−k=i

)=

n∑k=1

Pi(Ri = k)Pi(Xn−k = i)

=

n∑k=1

Pn−kii P (Ri = k).

Pnxx = Px(Xn = x) =

n∑m=1

Px(Rx = m,Xn = x)

=

n∑m=1

Px(X1 6= x, . . . ,Xm−1 6= x,Xm = x,Xn = x)

=

n∑m=1

Px(X1 6= x, . . . ,Xm−1 6= x,Xm = x)Pn−mxx

=

n∑m=1

Pn−mxx P(Rx = m).

For Eq. (9.2) we have My ≥ 1 = Ry <∞ so that Px (My ≥ 1) =Px (Ry <∞) . For K ≥ 2, since Ry <∞ if My ≥ 1, we have

Px (My ≥ K) = Px (My ≥ K|Ry <∞)Px (Ry <∞) .

Since, on Ry < ∞, XRy = y, it follows by the strong Markov property ofTheorem 7.11 that;

Px (My ≥ K|Ry <∞) = Px(My ≥ K|Ry <∞, XRy = y

)= Px

1 +∑n≥1

1XRy+n=y ≥ K|Ry <∞, XRy = y

= Py

1 +∑n≥1

1Xn=y ≥ K

= Py (My ≥ K − 1) .

By the last two displayed equations,

Px (My ≥ K) = Py (My ≥ K − 1)Px (Ry <∞) (9.3)

Taking x = y in this equation shows,

Py (My ≥ K) = Py (My ≥ K − 1)Py (Ry <∞)

and so by induction,

Py (My ≥ K) = Py (Ry <∞)K. (9.4)

Equation (9.2) now follows from Eqs. (9.3) and (9.4).

90 9 *Proofs of Long Run Results

9.2 Irreducible Recurrent Chains

For this section we are going to assume that Xn is an irreducible and recurrentMarkov chain. Let us now fix a state, y ∈ S and define,

τ1 = Hy = minn ≥ 0 : Xn = y,τ2 = minn ≥ 1 : Xn+τ1 = y,

...

τN = minn ≥ 1 : Xn+τN−1= y,

so that τN is the time it takes for the chain to visit y after the (N − 1)’st visitto y. By Corollary 8.20 we know that Pν (τN <∞) = 1 for all N ∈ N and anystarting distribution, ν : S → [0, 1] . We will use strong Markov property toprove the following key lemma in our development.

Lemma 9.2. We continue to use the notation above and in particular assumethat Xn∞n=0 is an irreducible recurrent Markov chain. If ν : S → [0, 1] is anystarting distribution, the random times, τN∞N=1 , above are independent andmoreover τN∞N=2 are identically distributed with Pν (τN = n) = Py (Ry = n)for all n ∈ N and N ≥ 2.

Proof. This lemma follows rather immediately from Remark 8.14. Here isa slightly longer argument for those wishing for more detail.

Let T0 = 0 and then define TN inductively by, TN+1 =inf n > TN : Xn = y so that TN is the time of the N ’th visit of Xn∞n=1 tosite y. Observe that T1 = τ1,

τN+1 (X0, X1, . . . ) = τ1(XTN , XTN+1, XTN+2

, . . .),

and (τ1, . . . , τN ) is a function of (X0, . . . , XTN ) . Since Pν (TN <∞) = 1 (Corol-lary 8.20) and XTN = y, we may apply the strong Markov property in the formof Theorem 7.11 to learn:

1. τN+1 is independent of (X0, . . . , XTN ) and hence τN+1 is independent of(τ1, . . . , τN ) , and

2. the distribution of τN+1 under Pν is the same as the distribution of Ryunder Py.

The result now follows from these two observations and induction.

Theorem 9.3. Suppose that Xn is a irreducible recurrent Markov chain, andlet y ∈ S be a fixed state. Define

πy = π (y) :=1

EyRy, (9.5)

with the understanding that πy = 0 if EyRy = ∞. If ν : S → [0, 1] is anyprobability distribution on S, then

limn→∞

1

n

n∑m=0

1Xm=y = πy Pν − a.s. (9.6)

and

limn→∞

1

n

n∑m=0

Pν (Xm = y) = πy (9.7)

and in particular (take ν = δx)

limn→∞

1

n

n∑m=0

Pmxy = lim

n→∞

1

n

n∑m=0

Px (Xm = y) = πy. (9.8)

for all x ∈ S.

Proof. Let us first note that Eq. (9.7) follows by taking expectations of Eq.(9.6). So it only remains to prove Eq. (9.6).

By Corollary 8.20 we already know that 1 = Pν (τ1 <∞) = Pν (Hy <∞)and so that Pν - a.s. the chain will hit y. By Remark 8.14 (or more formally byLemma 9.2(, the sequence τNN≥2 is i.i.d. relative to Pν and EντN = EyRy.We may now use the strong law of large numbers to conclude that

limN→∞

τ1 + τ2 + · · ·+ τNN

= Eντ2 = EyRy (Pν– a.s.). (9.9)

This may be expressed as follows, let TN = τ1 + τ2 + · · ·+ τN , be the time whenthe chain first visits y for the N th occurrence, then

limN→∞

TNN

= EyRy (Pν– a.s.) (9.10)

Given a time n ∈ N, let

Nn =

n∑m=0

1Xm=y = (# visits of y up to time n).

We will make use of the following two simple observations.

• Since y is visited infinitely often, Nn → ∞ as n → ∞ and therefore,limn→∞

Nn+1Nn

= 1.• A little thought shows that the following arrangements of times must hold;

0−−−−−−TNn −−n−−−−− T(Nn+1) −−− > time.

In more detail, since there were Nn visits to y in the first n steps, the time ofthe Nn

th visit to y must be less than or equal to n, i.e. TNn ≤ n. Similarly,the time, TNn+1, of the (Nn + 1)

stvisit to y must be larger than n, i.e.

n < TNn+1.


9.2 Irreducible Recurrent Chains 91

Using the last bullet point we have

TNnNn≤ n

Nn<TNn+1

Nn=

TNn+1

Nn + 1· Nn + 1

Nn.

Letting n → ∞ in these inequalities, making use of the “sandwich” theorem,the first bullet point, and Eq. (9.10) shows;

limn→∞

n

Nn= EyRy Pν − a.s.

Taking reciprocals of the last equation gives Eq. (9.6).

Remark 9.4 (Rates of Convergence). According to Proposition 2.18 above2 or

Corollary 13.59 below, roughly speaking we know that TNN = ExRx +O

(1√N

)and therefore

Nnn∼(TNnNn

)−1

=1

ExRx +O(

1√Nn

)=

1

ExRx

(1 +O

(1√Nn

))=

1

ExRx+O

(1√n

).

What we could formally prove is that for any 0 < α < 1/2 we have

limn→∞

nα∣∣∣∣Nnn − 1

ExRx

∣∣∣∣ = 0 Pν – a.s.

We will discuss rates of convergence more in Chapter 11.

Proposition 9.5. Suppose that Xn is a irreducible, recurrent Markov chainand let πx = 1/ (ExRx) for all x ∈ S as in Eq. (9.5). Then either πx = 0 forall x ∈ S (in which case Xn is null recurrent) or πx > 0 for all x ∈ S (in whichcase Xn is positive recurrent). Moreover if πx > 0 then∑

x∈Sπx = 1 and (9.11)

[πP]y :=∑x∈S

πxPxy = πy for all y ∈ S. (9.12)

That is π = (πx)x∈S is the unique stationary distribution for P.

2 Here we also use the results in 5.24 along with the first step analysis to concludethat ExRNx <∞ for all N ∈ N.

Proof. Suppose that Ex0Rx0

< ∞ for some x0 i.e. π (x0) > 0 for somex0 ∈ S and let

α :=∑x∈S

π (x) .

For any starting distribution, ν : S → [0, 1] , we have

∑x∈S

1

n

n∑m=1

Pν (Xm = x) =1

n

n∑m=1

∑x∈S

Pν (Xm = x) =1

n

n∑m=1

1 = 1.

This equality along Eq. (9.7) and Fatou’s lemma gives;

0 < π (x0) ≤ α =∑x∈S

π (x) =∑x∈S

limn→∞

1

n

n∑m=1

Pν (Xm = x)

≤ lim infn→∞

∑x∈S

1

n

n∑n=1

Pν (Xm = x) = 1.

Using∑x∈S

Pν (Xm = x) Pxy =∑x∈S

[νPm]x Pxy =[νPm+1

]y

= Pν (Xm+1 = y) ,

it is easy to conclude that

∑x∈S

1

n

n∑m=1

Pν (Xm = x) Pxy =1

n

n∑m=1

Pν (Xm+1 = y)→ π (y) as n→∞.

So by another application of Fatou’s lemma along Eq. (9.7),

∑x∈S

π (x) Pxy =∑x∈S

limn→∞

1

n

n∑m=1

Pν (Xm = x) Pxy

≤ lim infn→∞

1

n

n∑m=1

Pν (Xm+1 = y) = π (y) .

If∑x∈S π (x) Px,y0 < π (y0) for some y0 ∈ S, we would find,

α =∑x∈S

π (x) =∑x∈S

π (x)∑y∈S

Pxy =∑y∈S

∑x∈S

π (x) Pxy <∑y∈S

π (y) = α

which is absurd and so we must conclude that in fact;∑x∈S

π (x) Px,y = π (y) for all y ∈ S,


92 9 *Proofs of Long Run Results

i.e. πP = π.Taking ν := α−1π (so ν is an invariant distribution on S) in Eq. (9.7) allows

us to conclude that, for all y ∈ S,

π (y) = limn→∞

1

n

n∑m=1

Pα−1π (Xm = y) = limn→∞

1

n

n∑m=1

α−1π (y) = α−1π (y) .

Using this equation with y = x0 (where π (x0) > 0) forces α = 1 and so π isindeed an invariant distribution for P.

The assertion that ExRx < ∞ (i.e. π (x) > 0) for all x ∈ S follows directlyfrom Proposition 8.19. Alternatively, since x communicates with x0 there existsm ∈ N so that Pm

x0,x > 0 and hence

π (x) = [πPm]x =∑y∈S

π (y) Pmy,x ≥ π (x0) Pm

x0,x > 0.

Lastly if ν is another invariant distribution for P then form Eq. (9.7) wewill have

π (y) = limn→∞

1

n

n∑m=0

Pν (Xm = y) = limn→∞

1

n

n∑m=0

ν (y) = limn→∞

n+ 1

nν (y) = ν (y)

which shows that the invariant distribution is unique.

Remark 9.6. If we further assume that # (S) <∞ in Proposition 9.5, then theprevious proof becomes much simpler. For in this case we know by our hittingtime estimates that ExRx <∞ for all x ∈ S and

∑x∈S

π (x) =∑x∈S

limn→∞

1

n

n∑m=1

Pν (Xm = x) = limn→∞

1

n

∑x∈S

n∑m=1

Pν (Xm = x)

= limn→∞

1

n

n∑m=1

∑x∈S

Pν (Xm = x) = limn→∞

1

n

n∑m=1

1 = 1.

Similarly, in the previous proof we have directly that

∑x∈S

π (x) Pxy =∑x∈S

limn→∞

1

n

n∑m=1

Pν (Xm = x) Pxy

= limn→∞

∑x∈S

1

n

n∑m=1

Pν (Xm = x) Pxy

= limn→∞

1

n

n∑m=1

∑x∈S

Pν (Xm = x) Pxy

= limn→∞

1

n

n∑m=1

Pν (Xm+1 = y) = π (y)

showing π is the invariant distribution for P.

Proposition 9.7. Suppose that P is an irreducible Markov kernel which ad-mits a stationary distribution µ. Then P is positive recurrent and µy = πy =1/ (EyRy) for all y ∈ S. In particular, an irreducible Markov kernel has at mostone invariant distribution and it has exactly one iff P is positive recurrent.

Proof. Taking ν = µ in Eq. (9.7) shows

π (y) = limn→∞

1

n

n∑m=1

Pµ (Xm = y) = limn→∞

1

n

n∑m=1

µ (y) = µ (y)

and so µ (y) = π (y) is an invariant distribution and so ExRx = 1π(x) < ∞ for

some x ∈ S. The fact that P is positive recurrent now follows from Proposition9.5.


10

Transience and Recurrence Examples

Let us begin by recalling some of the basic definitions and theorems involvingrecurrence and transience.

Definition 10.1. A state x ∈ S is:

1. transient if Px(Rx <∞) < 1 (⇐⇒ Px(Rx =∞) > 0) ,2. recurrent if Px(Rx <∞) = 1 (⇐⇒ Px(Rx =∞) = 0) ,

a) positive recurrent if 1/ (ExRx) > 0, i.e. ExRx <∞,b) null recurrent if it is recurrent (Px(Rx <∞) = 1) and 1/ (ExRx) = 0,

i.e. ERx =∞.

We let St, Sr, Spr, and Snr be the transient, recurrent, positive recurrent,and null recurrent states respectively.

The following theorem summarizes Theorem 8.18 and Corollary 8.20.

Theorem 10.2 (Recurrent States). Let y ∈ S. Then the following are equiv-alent;

1. y is recurrent, i.e. Py (Ry <∞) = 1,2. Py (Xn = y i.o. n) = 1,3. EyMy =

∑∞n=0 Pn

yy =∞.

Moreover if C ⊂ S is a recurrent communication class, then Px(∩y∈CXn =y i.o. n) = 1 for all x ∈ C.In words, if we start in C then every state in C isvisited an infinite number of times.

The next result summarizes Propositions 9.5 and 9.7.

Proposition 10.3. Suppose that P is an irreducible Markov kernel, Xn isthe corresponding Markov Chain, and πy = 1/ (EyRy) for all y ∈ S.

1. If Xn is recurrent, then either πx = 0 for all x ∈ S (in which case Xn

is null recurrent) or πx > 0 for all x ∈ S (in which case Xn is positiverecurrent). Moreover if πx > 0 then π = (πx)x∈S is the unique stationarydistribution for P.

2. If Xn admits a stationary distribution µ, then Xn (or equivalently P)is positive recurrent and µy = πy = 1/ (EyRy) for all y ∈ S.

In particular, an irreducible Markov kernel, P, has at most one invariantdistribution and it has exactly one iff P is positive recurrent.

The next theorem summarizes Theorem 8.21.

Theorem 10.4 (Transient States). Let y ∈ S. Then the following are equiv-alent;

1. y is transient, i.e. Py (Ry <∞) < 1,2. Py (Xn = y i.o. n) = 0, and3. EyMy =

∑∞n=0 Pn

yy <∞.

More generally if ν : S → [0, 1] is any probability and y ∈ S is transient,then

EνMy =

∞∑n=0

Pν (Xn = y) <∞ =⇒

limn→∞ Pν (Xn = y) = 0Pν (Xn = y i.o. n) = 0.

(10.1)

For the reader’s convenience here is a combination of Theorems 8.18 and8.21.

Theorem 10.5 (Recurrent/Transient States). Let y ∈ S. Then the follow-ing are equivalent;

1. y is recurrent (transient), i.e. Py (Ry <∞) = 1 (< 1)2. Py (Xn = y i.o. n) = 1 (= 0) ,3. EyMy =

∑∞n=0 Pn

yy =∞ (<∞) .

Moreover if y is either transient or null-recurrent, then

limn→∞

Pν (Xn = y) = limn→∞

[νPn]y = 0.

10.1 Examples

Example 10.6. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with0 being an absorbing state and H = H0 be the first hitting time of 0. Thecommunication classes are now 0 and 1, 2, . . . with the latter class not being

94 10 Transience and Recurrence Examples

closed and hence is transient. Using Exercise 6.7 or Exercise 6.8, it follows thatExH0 = ∞ for all x > 0 which shows we can not drop the assumption that# (C) < ∞ in the first statement in Proposition 8.9. Similarly, using the fairrandom walk example, we see that it is not possible to drop the condition that# (C) <∞ for the equivalence statements as well.

The next examples show that if C ⊂ S is closed and # (C) = ∞, thenC could be recurrent or it could be transient. Transient in this case meansthe chain goes off to “infinity,” i.e. eventually leaves every finite subset of Cnever to return again. In Examples 10.7 – 10.9 below, let X = Xn be a(possibly biased) random walk on S = Z, Ry = inf n > 0 : Xn = y and Hy =inf n ≥ 0 : Xn = y .Example 10.7. Let S = Z and X = Xn be the standard fair random walk onZ, i.e. P (Xn+1 = x± 1|Xn = x) = 1

2 . Then S itself is a closed class and everyelement of S is (null) recurrent. Indeed, using Exercise 6.5 or Exercise 6.6 andthe first step analysis we know that

P0 [R0 =∞] =1

2(P0 [R0 =∞|X1 = 1] + P0 [R0 =∞|X1 = −1])

=1

2(P1 [H0 =∞] + P−1 [H0 =∞]) =

1

2(0 + 0) = 0.

This shows 0 is recurrent. Similarly using Exercise 6.7 or Exercise 6.8 and thefirst step analysis we find,

E0 [R0] =1

2(E0 [R0|X1 = 1] + E0 [R0|X1 = −1])

=1

2(1 + E1 [H0] + 1 + E−1 [H0]) =

1

2(∞+∞) =∞

and so 0 is null recurrent. As this chain is invariant under translation it followsthat every x ∈ Z is a null recurrent site.

Example 10.8. Let S = Z and X = Xn be a biassed random walk on Z, i.e.P (Xn+1 = x+ 1|Xn = x) = p and P (Xn+1 = x− 1|Xn = x) = q := 1− p withp > 1

2 . Then every site of is now transient. Recall from Exercises 6.9 and 6.10(see Eq. (6.27)) that

Px (H0 <∞) =

(q/p)

xif x ≥ 0

1 if x < 0. (10.2)

Using these result and the first step analysis implies,

P0 [R0 =∞] = pP0 [R0 =∞|X1 = 1] + qP0 [R0 =∞|X1 = −1]

= pP1 [H0 =∞] + qP−1 [H0 =∞]

= p[1− (q/p)

1]

+ q (1− 1)

= p− q = 2p− 1 > 0.

Example 10.9. Again let S = Z and p ∈(

12 , 1)

and suppose that Xn is therandom walk on Z described the jump diagram in Figure 10.1. In this case using

10−1−2 2pppppp

q q 1/2 q q1/2

Fig. 10.1. A positively recurrent Markov chain.

the results of Exercise 6.11 we learn that

E0 [R0] =1

2(E0 [R0|X1 = 1] + E0 [R0|X1 = −1])

=1

2(1 + E1 [H0] + 1 + E−1 [H0])

= 1 +1

2

(1

p− q+

1

p− q

)= 1 +

1

p− q=

2p

2p− 1<∞.

This shows the site 0 is positively recurrent. Thus according to Proposition 8.6,every site in Z is positively recurrent. (Notice that E0 [R0] → ∞ as p ↓ 1

2 , i.e.as the chain becomes closer to the unbiased random walk of Example 10.7.

Example 10.10. Let us revisit the fair random walk on Z describe before Exercise6.5. In this case P0 (Xn = 0) = 0 if n is odd and

P0 (X2n = 0) =

(2n

n

)(1

2

)2n

=(2n)!

(n!)2

(1

2

)2n

. (10.3)

Making use of Stirling’s formula, n! ∼√

2πnn+ 12 e−n, we find,(

1

2

)2n(2n)!

(n!)2 ∼

(1

2

)2n √2π (2n)

2n+ 12 e−2n

2πn2n+1e−2n=

√1

π

1√n

(10.4)

and therefore,

∞∑n=0

P0 (Xn = 0) =

∞∑n=0

P0 (X2n = 0) ∼ 1 +

∞∑n=1

√1

π

1√n

=∞

which shows again that this walk is recurrent. To now verify that this walk isnull-recurrent it suffices to show there is not invariant probability distribution,π, for this walk. Such an invariant measure must satisfy,

π (x) =1

2[π (x+ 1) + π (x− 1)]


10.2 *Transience and Recurrence for R.W.s by Fourier Series Methods 95

which has general solution given by π (x) = A+Bx. In order for π (x) ≥ 0 forall x we must take B = 0, i.e. π is constant. As

∑x∈Z π (x) = A ·∞, there is no

way to normalize π to become a probability distribution and hence Xn∞n=0 isnull-recurrent.

Fact 10.11 Simple Random walk in Zd is recurrent if d = 1 or 2 and is tran-sient if d ≥ 3. [For a informal proof of this fact, see page 49 of Lawler. For aformal proof see Section 10.2 below or Todd Kemp’s lecture notes.]

Example 10.12. The above method may easily be modified to show that thebiased random walk on Z (see Exercise 6.9) is transient. In this case 1

2 < p < 1and

P0 (X2n = 0) =

(2n

n

)pn (1− p)n =

(2n

n

)[p (1− p)]n .

Since p (1− p) has a maximum at p = 12 of 1

4 we have ρp := 4p (1− p) < 1 for12 < p < 1. Therefore,

P0 (X2n = 0) =

(2n

n

)[ρp

1

4

]n=

(2n

n

)(1

2

)2n

ρnp ∼√

1

π

1√nρnp .

Hence

∞∑n=0

P0 (Xn = 0) =

∞∑n=0

P0 (X2n = 0) ∼ 1 +

∞∑n=1

1√nρnp ≤ 1 +

1

1− ρp<∞

which again shows the biased random walk is transient.

Exercise 10.1. Let Xn∞n=0 be the fair random walk on Z (as in Exercise 6.5)starting at 0 and let

AN := E

[2N∑k=0

1Xk=0

]denote the expected number of visits to 0. Using Sterling’s formula and integralapproximations for

∑Nn=1

1√n

to argue that AN ∼ c√N for some constant c > 0.

10.2 *Transience and Recurrence for R.W.s by FourierSeries Methods

In the next result we will give another way to compute (or at least estimate)ExMy for random walks and thereby determine if the walk is transient or re-current.

Theorem 10.13. Let ξi∞i=1 be i.i.d. random vectors with values in Zd andfor x ∈ Zd let Xn := x + ξ1 + · · · + ξn for all n ≥ 1 with X0 = x. As usual letMz :=

∑∞n=0 1Xn=z denote the number of visits to z ∈ Zd. Then

EMz = limα↑1

(1

2π

)d ∫[−π,π]d

ei(x−z)·θ

1− αE [eiξ1·θ]dθ

where dθ = dθ1 . . . dθd and in particular,

EMx = limα↑1

(1

2π

)d ∫[−π,π]d

1

1− αE [eiξ1·θ]dθ. (10.5)

Proof. For 0 < α ≤ 1 let

M (α)y :=

∞∑n=0

αn1Xn=y

so that M(1)y = My. Given any θ ∈ [−π, π]

dwe have,

∑y∈Zd

EM (α)y eiy·θ = E

∑y∈Zd

M (α)y eiy·θ = E

∑y∈Zd

∞∑n=0

αn1Xn=yeiy·θ

= E∞∑n=0

αn∑y∈Zd

1Xn=yeiy·θ = E

∞∑n=0

αneiθ·Xn

=

∞∑n=0

αnE[eiθ·Xn

]where

E[eiθ·Xn

]= eiθ·xE

[eiθ·(ξ1+···+ξn)

]= eiθ·xE

n∏j=1

eiθ·ξj

= eiθ·x

n∏j=1

Eeiθ·ξj

= eiθ·x ·(E[eiξ1·θ

])n.

Combining the last two equations shows,∑y∈Zd

EM (α)y eiy·θ =

eiθ·x

1− αE [eiξ1·θ].

Multiplying this equation by e−iz·θ for some z ∈ Zd we find, using the orthog-onality of

eiy·θ

y∈Zd , that


96 10 Transience and Recurrence Examples

EM (α)z =

(1

2π

)d ∫[−π,π]d

dθ

∑y∈Zd

EM (α)y eiy·θe−iz·θ

=

(1

2π

)d ∫[−π,π]d

eiθ·(x−z)

1− αE [eiξ1·θ]dθ.

Since M(α)z ↑ Mz as α ↑ 1 the result is now a consequence of the monotone

convergence theorem.

Example 10.14. Suppose that P (ξi = 1) = p and P (ξi = −1) = q := 1 − p andx = 0. Then

E[eiξ1·θ

]= peiθ + qe−iθ = p (cos θ + i sin θ) + q (cos θ − i sin θ)

= cos θ + i (p− q) sin θ.

Therefore according to Eq. (10.5) we have,

E0M0 = limα↑1

1

2π

∫ π

−π

1

1− α (cos θ + i (p− q) sin θ)dθ. (10.6)

(We could compute the integral in Eq. (10.6) exactly using complex contourintegral methods but I will not do this here.)

The integrand in Eq. (10.6) may be written as,

1− α cos θ + iα (p− q) sin θ

(1− α cos θ)2

+ α2 (p− q)2sin2 θ

.

As sin θ is odd while the denominator is now even we find,

E0M0 = limα↑1

1

2π

∫ π

−π

1− α cos θ

(1− α cos θ)2

+ α2 (p− q)2sin2 θ

dθ. (10.7)

a Let first suppose that p = 12 = q in which case the above equation reduces to

E0M0 = limα↑1

1

2π

∫ π

−π

1

1− α cos θdθ =

1

2π

∫ π

−π

1

1− cos θdθ

wherein we have used the MCT (for θ ∼ 0) and DCT (for θ away from 0)to justify passing the limit inside of the integral. Since 1−cos θ ≈ θ2/2 for θnear zero and

∫ ε−ε

1θ2 dθ =∞ it follows that E0M0 =∞ and the fair random

walk on Z is recurrent.b Now suppose that p 6= 1

2 and let us write α := 1− ε for some ε which we willeventually let tend down to zero. With this notation the integrand (fα (θ))in Eq. (10.7) satisfies,

fα (θ) =1− cos θ + ε cos θ

(1− cos θ + ε cos θ)2

+ (1− ε)2(p− q)2

sin2 θ

=1− cos θ


+ (1− ε)2(p− q)2

sin2 θ

+ε cos θ


+ (1− ε)2(p− q)2

sin2 θ

≤ 1− cos θ

(1− ε)2(p− q)2

sin2 θ+

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θ.

The first term is bounded in θ and ε because

limθ↓0

1− cos θ

sin2 θ=

1

2

and therefore only makes a finite contribution to the integral. Integratingthe second term near zero and making the change of variables u = sin θ (sodu = cos θdθ) shows,∫ δ

−δ

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θdθ

=

∫ sin(δ)

− sin(δ)

ε

ε2 (1− u2) + (1− ε)2(p− q)2

u2du

≤∫ sin(δ)

− sin(δ)

ε

ε2 12 + 1

2 (p− q)2u2du

= 2

∫ sin(δ)

− sin(δ)

ε

ε2 + (p− q)2u2du

provided δ is sufficiently small but fixed and ε is small. Lastly we make thechange of variables u = εx/ |p− q| in order to find∫ δ

−δ

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θdθ

≤ 4

|p− q|

∫ sin(δ)|p−q|/ε

0

1

1 + x2du ↑ 2π

|p− q|<∞ as ε ↓ 0.

Combining these estimates shows the limit in Eq. (10.7) is now finite so therandom walk is transient when p 6= 1

2 .

Example 10.15 (Unbiased R.W. in Zd). Now suppose that P (ξi = ±ej) = 12d

for j = 1, 2, . . . , d and Xn = ξ1 + · · ·+ ξn. In this case,

E[eiθ·ξj

]=

1

d[cos (θ1) + · · ·+ cos (θd)]


and so according to Eq. (10.5) we find (as before)

EM0 = limα↑1

(1

2π

)d ∫[−π,π]d

1

1− α 1d [cos (θ1) + · · ·+ cos (θd)]

dθ

=

(1

2π

)d ∫[−π,π]d

1

1− 1d [cos (θ1) + · · ·+ cos (θd)]

dθ (by MCT and DCT).

Again the integrand is singular near θ = 0 where

1− 1

d[cos (θ1) + · · ·+ cos (θd)] ∼= 1− 1

d

[d− 1

2‖θ‖2

]=

1

2‖θ‖2 .

Hence it follows that EM0 < ∞ iff∫‖θ‖≤R

1‖θ‖2 dθ <∞ for R < ∞. The last

integral is well know to be finite iff d ≥ 3 as can be seen by computing in polarcoordinates. For example when d = 2 we have∫

‖θ‖≤R

1

‖θ‖2dθ = 2π

∫ R

0

1

r2rdr = 2π ln r|R0 = 2π (lnR− ln 0) =∞

while when d = 3,∫‖θ‖≤R

1

‖θ‖2dθ = 4π

∫ R

0

1

r2r2dr = 4πR <∞.

In this way we have shown that the unbiased random walk in Z and Z2 isrecurrent while it is transient in Zd for d ≥ 3.

11

Detail Balance and MCMC

In this chapter, let π : S → (0, 1) be a positive probability on a (largeand complicated) finite set S. One important application of Markov chains issampling (approximately) from such probability distributions even when π itselfis computationally intractable. In order to “sample” from π we first try to findan irreducible Markov matrix P on S such that πP = π. Then from Theorem8.27 in the form Remark 8.30, if Xn∞n=0 is the Markov chain associated to Pwe have,

π (f) :=∑x∈S

f (x)π (x) = limn→∞

1

n

n∑m=1

f (Xm) Pν – a.s.,

Thus we expect that if we simulate, (Xn∞n=0) , we will have,

π (f) ∼=1

n

n∑m=1

f (Xm) for large n,

[Using Remark 9.4, the error in the above approximation is roughly C 1√n

where

C = C (ω) is a random constant which is likely to be impossible to estimate. Wewill come back to error estimates in the L2 – sense below in Corollary 11.38.]The key to implementing this method is to construct P in such a way that theassociated chain is relatively easy to simulate.

Remark 11.1 (A useless choice for P). Suppose π : S → (0, 1) is a distributionwe would like to sample as the stationary distribution of a Markov chain. Oneway to construct such a chain would be to make the Markov matrix P with allrows given by π. In this case we will have νP =π and so Pν (X1 = x) = π (x) forall x ∈ S. Thus for this chain we would reach the invariant distribution in onestep. However, to simulate this chain is equivalent to sampling from π whichis what we were trying to figure out how to do in the first place!! We need asimpler Markov matrix, P, with πP = π.

11.1 Detail Balance

The next Lemma gives a useful criteria on a Markov matrix P so that π is aninvariant (=stationary) distribution of P.

Lemma 11.2 (Detail balance). In general, if we can find a distribution, π,satisfying the detail balance equation,

πiPij = πjPji for all i 6= j, (11.1)

then π is a stationary (=invariant) distribution, i.e. πP = π.

Proof. First proof. Intuitively, Eq. (11.1) states that sites i and j arealways exchanging sand back and forth at equal rates. Hence if all sites aredoing this the size of the piles of sand at each site must remain unchanged.

Second Proof. Summing Eq. (11.1) on i making use of the fact that∑i Pji = 1 for all j implies,

∑i πiPij = πj .

Example 11.3 (Simple Random Walks on Graphs). If S is a finite set of nodesand G be an undirected graph on S, i.e. G is a subset of S × S such that

1. (x, x) /∈ G for all x ∈ S,2. if (x, y) ∈ G, then (y, x) ∈ G [the graph is undirected], and3. for all x ∈ G, the set Sx := y ∈ S : (x, y) ∈ G is not empty. [We are not

allowing for any isolated nodes in our graph.]

Let us write x ∼ y to mean (x, y) ∈ G and let

ν (x) := # (Sx) =∑y∈S

1(x,y)∈G

be the valence of G at x. The random walk on this graph is then theMarkov chain on S with Markov transition matrix,

p (x, y) :=1

ν (x)1Sx (y) =

1

ν (x)1(x,y)∈G.

Notice thatν (x) p (x, y) = 1x∼y = ν (y) p (y, x)

so that [ν,P] satisfy the detail balance equation. Therefore if

Z :=∑x∈S

ν (x) and π (x) :=1

Zν (x) ,

then π is an invariant distribution for P.

100 11 Detail Balance and MCMC

Remark 11.4. If (P,π) are in detail balance and π : S → (0, 1) is a strictlypositive, then Pxy 6= 0 iff Pyx 6= 0. Thus,

G := (x, y) ∈ S × S : x 6= y and Pxy 6= 0

defines an undirected graph on S. Moreover it is easy to check that P is irre-ducible iff G is connected.

Example 11.5. Capitalizing on Example 11.3, consider the following problem.A Knight starts at a corner of a standard 8 × 8 chessboard, and on each stepmoves at random. How long, on average, does it take to return to its startingposition?

· ∗ · ∗ · ∗ · ∗∗ · ∗ · ∗ · ∗ ·· ∗ · ∗ · ∗ · ∗∗ · ∗ · ∗ · ∗ ·· ∗ · ∗ · ∗ · ∗∗ · ∗ · ∗ · ∗ ·· ∗ · ∗ · ∗ · ∗∗ · ∗ · ∗ · ∗ ·

If i is a corner, the question is what is Ei[Ri], where the Markov process in

question is the one described in the problem: it is a simple random walk (SRW)on a graph whose vertices are the 64 squares of the chessboard, and two squaresare connected in the graph if a Knight can move from one to the other. (Note:if the Knight can move from i to j, it can do so in exactly two ways: 1 up/downand 2 left/right, or 2 left/right and 1 up/down.1 So choosing uniformly amongmoves or among positions amounts to the same thing.) This graph is connected(as a little thought will show), and so the SRW is irreducible; therefore there isa unique stationary distribution. By Example 11.3, the stationary distributionπ is given by π(i) = vi/

∑j vj , and so by the Ergodic Theorem 8.27,

Ei[Ri] =1

π(i)=

1

νi

∑j

vj .

A Knight moves 2 units in one direction and 1 unit in the other direction.Starting at a corner (1, 1), the Knight can only more to the positions (3, 2)and (2, 3), so vi = 2. To solve the problem, we need to calculate vj for allstarting positions j on the board. By square symmetry, we need only calculatethe numbers for the upper 4 × 4 grid. An easy calculation then gives thesevalences as

1 A little experimentation will show that this graph is connected so the random walkis irreducible.

2 3 4 43 4 6 64 6 8 84 6 8 8

The sum of all the valences of the 4× 4 grid is 84, and so the sum over the fullchess board is 4 · 84 = 336. Thus, the number of expected steps for the Knightto return to the corner is

Ei[Ri] =1

νi

∑j

vj =1

2336 = 168.

11.2 Some uniform measure MCMC examples

Let us continue the notation in Example 11.3, i.e. suppose S is a finite setand G is a connected unoriented graph on S with no isolated points. Thesimple random walk on G would allow us to sample from the distribution,π (x) = vx/

∑y∈S vy. This is already useful as finding the Z :=

∑y∈S vy may

be intractable and moreover for large S, Z will be very large causing problemsof its own. In this section we are going to be interested in sampling the uniformdistribution, π (x) = Z−1 on S, where now Z := # (S) . The difficulty is goingto be that the set S is typically very large complicated so that Z := # (S) islarge and perhaps not even known how to compute. The next example showshow one might sample the uniform distribution in this setting.

Example 11.6 (Uniform Samplers). Keeping the notation above, if νx = K isconstant for all x ∈ S then the SRW has the uniform distribution as its station-ary distribution. What if the valences are not constant but we at least knowthat there exists K < ∞ such that vx ≤ K for all x ∈ S. In this case we cantake

p (x, y) :=1

K1〈x,y〉∈G + c (x) 1y=x

where c (x) is chosen so that

1 =∑y∈S

p (x, y) =1

Kν (x) + c (x) =⇒ c (x) := 1− 1

Kν (x) ≥ 0.

Then clearly p (x, y) is a symmetric irreducible Markov matrix and thereforethe invariant distribution of P is π (x) = 1

#(S) . Thus if Xn∞n=0 is the Markov

chain associated to P, we will have

1

|S|∑x∈S

f (x) = limn→∞

1

n

n∑m=1

f (Xm) Pν -a.s.

for any starting distribution ν.


11.2 Some uniform measure MCMC examples 101

Example 11.7. Suppose that S is the set of D × D - matrices with entries in0, 1 so that # (S) = 225×25 ∼= 1.4×10188! We say that 〈A,B〉 ∈ G if A,B ∈ Sand A and B differ at exactly one location, i.e. (A+B) mod 2 is a matrix inS with exactly one non-zero entry. In this case νA = D2 and the simple random-walk on S will sample from the uniform distribution. The associated Markovmatrix is

p (A,B) =1

D21A∼B .

It is relatively easily to sample from this chain. To describe how to do this,for i, j ∈ 1, . . . , D , let E (i, j) ∈ S denote the matrix with 1 at the (i, j)

th

- position and zeros at all other positions. To simulate the associated Markovchain, Xn∞n=0 , at each n, choose i, j ∈ 1, . . . , D , uniformly at random sothat i is independent of j which are independent of the previous choices made.Then if Xn = A ∈ S we take Xn+1 = (A+ E (i, j)) .

The next example gives a variant of this example.

Example 11.8 (HCC). Let us continue the notation in Example 11.7 and furtherlet SHC denote those A ∈ S such there are no adjacent 1’s in the matrix A. Tobe more precise we are requiring for all i, j ∈ 1, . . . , D that

Aij · (Ai−1,j +Ai+1,j +Ai,j−1 +Ai,j+1) = 0

where by convention Ai,j = 0 if either i or j is in 0, D + 1 . We refer toSHC as the Hard Core Configuration space. This is a simple model ofthe configuration of an ideal gas (the 1s represent gas molecules, the 0s emptyspace). We would like to pick a random hard-core configuration space – meaningsample from the uniform distribution on SHC , i.e. each A ∈ SHC is to bechosen with weight 1/|SHC |. The problem is that |SHC | is unknown, even to

exponential order, for large D. A simple upper-bound is 2D2

= # (S) . It is

conjectured that |SHC | ∼ βD2

for some β ∈ (1, 2), but no one even has a goodguess as to what β is. Nevertheless, we can use the Markov chain associated toMarkov matrix,

pHC (A,B) =1

D21A∼B + c (A) 1A=B

in order to sample from this distribution. Proposition 11.10 below will justifythe following algorithm for sampling from this chain.

Sampling Algorithm. Suppose we generated (X0, X1, . . . , Xn) and Xn =A ∈ SHC . To construct Xn+1;

1. Choose i, j ∈ 1, . . . , D , uniformly at random so that i is independent ofj which are independent of the previous choices made.

2. Let B := (A+ E (i, j)) mod 2 and set

Xn+1 =

B if B ∈ SHCA otherwise.

Note that B ∈ SHC iff

Bij · (Bi−1,j +Bi+1,j +Bi,j−1 +Bi,j+1) = 0

where again Bkl = 0 if either k or l is in 0, D + 1 .3. loop on n.

You are asked to implement the algorithm in Example 11.8 for D = 50in one of your homework problems and use your simulations to estimate theprobability that A25,25 = 1 when A is chosen uniformly at random from SHC .

Lemma 11.9 (Conditional detail balance). Suppose that (π,P) satisfiesdetail balance and S is a subset of S. Let

π (x) = π(x|S)

=π (x)∑y∈S π (y)

for x ∈ S

be the conditional distribution of π given S. For x, y ∈ S let,

p (x, y) :=

p (x, y) if x 6= yc (x) if x = y

wherec (x) := 1−

∑y∈S\x

p (x, y) .

Then P is a Markov matrix on S and(π, P

)are in detail balance. In particular

π is an invariant distribution for P.

Proof. First off,

∑y∈S

p (x, y) =∑

y∈S\x

p (x, y) +

1−∑

y∈S\x

p (x, y)

= 1

so that p is a Markov matrix. Moreover, if we let Z :=∑y∈S π (y) , then for

x 6= y in S we have,

π (x) p (x, y) =1

Zπ (x) p (x, y) =

1

Zπ (y) p (y, x) = π (y) p (y, x)

which proves the detail balance condition.



Proposition 11.10 (Conditional sampling). Let us continue the notation inLemma 11.9 and further suppose that fn : S → S∞n=0 are random functions asin Theorem 5.11 so that Xn+1 = fn (Xn) generates the Markov chain associated

to P.If we let fn : S → S be defined by

fn (x) =

fn (x) if fn (x) ∈ Sx otherwise

thefn

∞n=0

are independent random function which generate (in the sense that

Xn+1 = fn

(Xn

)) the Markov chain,

Xn

∞n=0

, associated to P.

Proof. We need only note that if x, y ∈ S with x 6= y, then

P(fn (x) = y

)= P

(fn (x) = y, fn (x) ∈ S

)+ P

(x = y, fn (x) /∈ S

)= P (fn (x) = y) + 0 = p (x, y) .

It then must follows that

P(fn (x) = x

)= 1− P

(fn (x) ∈ S \ x

)= 1−

∑y∈S\x

P(fn (x) = y

)= 1−

∑y∈S\x

p (x, y) = c (x) .

Example 11.11. Let T be a finite set and S be the set of functions, s : T → 0, 1so that # (S) = 2#(T ). We say (s, s′) ∈ G iff s and s′ differ at one location t ∈ T,i.e.

∑t |s (t)− s′ (t)| = 1. (We abbreviate this condition by writing s ∼ s′.) In

this case νs = # (T ) . To construct a random walk on T, let Un∞n=1 be i.i.d. T –valued uniformly distributed random functions. We then construct the randomwalk starting at s ∈ S by X0 = s, Xn+1 =

(Xn + δUn+1

)mod 2 for n ≥ 0. Note

that

p (s, s′) =

1#(T ) if s′ ∼ s

0 otherwise

Since p (s, s′) = p (s′, s) , it follows that p has the uniform distribution as itsinvariant measure. If we now take some subset S ⊂ S we can sample from the

“restricted chain”Xn

on S using

Xn+1 =

Xn + δUn+1

if Xn + δUn+1∈ S

Xn otherwise.

Example 11.12. Consider the problem of graph colorings. Let G = (V,E) be afinite graph. A q-coloring of G (with q ∈ N) is a function f : V → 1, 2, . . . , qwith the property that, if u, v ∈ V with u ∼ v, then f(u) 6= f(v). The set ofq-colorings is very hard to count (especially for small q), so again we cannotdirectly sample a uniformly random q-coloring. Instead, define a Markov chainon the state space of all q-colorings f , where for f 6= g

p(f, g) =

1

q|V | , if f and g differ at exactly one vertex,

0, otherwise.

Again: since there are at most q different q-colorings of G that differ at a givenvertex, and there are |V | vertices, we have

∑f 6=g p(f, g) ≤ 1, and so setting

p(f, f) =∑g 6=f p(f, g) yields a stochastic matrix p which is the transition kernel

of a Markov chain. It is evidently symmetric, and by considerations like those inExample 11.8 the chain is irreducible and aperiodic (so long as q is large enoughfor any q-colorings to exist!); hence, the stationary distribution is uniform, andthis Markov chain converges to the uniform distribution.

To simulate the corresponding Markov chain: given Xn = f , choose one ofthe |V | vertices, v, uniformly at random and one of the q colors, k, uniformly atrandom. If the new function g defined by g(v) = k and g(w) = f(w) for w 6= vis a q-coloring of the graph, set Xn+1 = g; otherwise, keep Xn+1 = Xn = f .This process has the transition probabilities listed above, and so running it forlarge n gives an approximately uniformly random q-coloring.

Exercise 11.1. Let S be a finite set and P be an irreducible Markov matrixwith invariant distribution, π, which satisfies detail balance. Further let A := S

be a subset of S, B = S \ A, and π (x) = π (x) /π(S)

as above. If Xn∞n=0 is

the Markov chain associated to P which we will start in S, letXk := Xτk

∞k=0

where τk∞k=0 are the stopping times defined by, τ0 = 0, and then inductivelyby

τk+1 := minn > τk : Xn ∈ S

for k ∈ N0.

ShowXk

∞k=0

is a Markov chain on S with corresponding Markov matrix,

P = PA×A + PA×B (I −PB×B)−1

PB×A

and π is in detail balance with P. In particular it follows that

limn→∞

1

n

n∑k=1

f(Xk

)= π (f) Pν – a.s.

for any distribution ν on S. [This algorithm will likely run very slowly, especiallywhen S is a rather “thin” subset of S since then the chain Xn will spend alot of time outside of S.]


11.3 The Metropolis-Hastings Algorithm 103

11.3 The Metropolis-Hastings Algorithm

In Examples 11.7, 11.8, and 11.12, we constructed Markov chains with sym-metric kernels, therefore having the uniform distribution as stationary, andconverging to it. This is often a good way to simulate uniform samples. Buthow can we sample from a non-uniform distribution?

Proposition 11.13. Let S be a finite set, and let π be a (strictly positive) dis-tribution on S and suppose that Qij = q (i, j) is a symmetric (Qij = Qji irre-ducible Markov matrix on S. [Note that Q invariant distribution is the uniformdistribution on S.] Then define a new Markov matrix Pij = p (i, j) by,

p(i, j) := q(i, j) min

1,π(j)

π(i)

for i 6= j, and

p(i, i) = 1−∑j 6=i

p(i, j).

Then P is an irreducible Markov matrix with stationary distribution, π. In fact,(π,P) are in detail balance.

Proof. By construction we have p (i, j) ∈ [0, 1] , p (i, j) = 0 iff q (i, j) = 0,and

π(i)p(i, j) = q(i, j) minπ(i), π(j) = q(j, i) minπ(j), π(i) = π(j)p(j, i)

so π satisfies the detailed balance condition for p. Consequently, π is the sta-tionary distribution for p. Moreover, it follows that p is irreducible (since q is).

Thus if Xn∞n=0 is the Markov chain associated to P we will have andtherefore

limn→∞

1

n

n∑m=0

f (Xm) = π (f) Pν – a.s.

In order to simulate from this chain we need to explain an effective way to

construct a random function f : S → S so that P(f (i) = j

)= p (i, j) . We

assume that q is simple enough that we have to difficulty simulating a randomfunction, f : S → S so that P (f (i) = j) = q (i, j) .

Lemma 11.14. If f : S → S is a random function such that P (f (i) = j) =q (i, j) for all i, j ∈ S, U ∈ [0, 1] is an independent uniform random variable,and

f (i) :=

f (i) if U ≤ π(f(i))

π(i)

i if U > π(f(i))π(i)

,

thenP(f (i) = j

)= p (i, j) for all i, j ∈ S.

Proof. For j 6= i we have

P(f (i) = j

)= P

(f (i) = j, U ≤ π(j)

π (i)

)= P (f (i) = j)P

(U ≤ π(j)

π (i)

)= q (i, j) min

(π(j)

π (i), 1

)= p (i, j) .

and this suffices to complete the proof.With the above constructions in hand we can now explain how to simulate

the chain Xn .Metropolis-Hastings algorithm. Given Xn = i we update to Xn+1 using

the following steps.

1. Spin the q (i, ·) – spinner (i.e. evaluate f (i)), if it lands at i (f (i) = i) stayat i,

2. if it lands at j 6= i (f (i) = j 6= i) , then flip a biased coin with probability

of heads being min(π(j)π(i) , 1

).

a) If the coin flip is heads then move to j, i.e. set Xn+1 = j. [This willhappen for sure if π (j) ≥ π (i) .]

b) If the coin flip is tails then stay at i, i.e. Xn+1 = i.

There are many variations on the Metropolis-Hastings algorithm, all withthe same basic ingredients and goal: to use a (collection of correlated) Markovchain(s), designed to have a given stationary distribution π, to sample from πapproximately. They are collectively known as Markov chain Monte Carlosimulation methods. One place they are especially effective is in calculatinghigh-dimensional integrals (i.e. expectations), a common problem in multivari-ate statistics.

11.4 The linear algebra associated to detail balance

In this section suppose that S is a finite set, ν : S → (0, 1] is a given (strictlypositive) probability on S, and p : S × S → [0, 1] is a Markov matrix. We let`2 (ν) denote the vector space of functions, f : S → R equipped with the innerproduct,

〈f, g〉ν =∑x∈S

f (x) g (x) ν (x) ∀ f, g ∈ `2 (ν) .

We further let P : `2 (ν)→ `2 (ν) be defined by

Pf (x) :=∑y∈S

p (x, y) f (y) .



Notation 11.15 If P is a Markov matrix we let σ (P) denote the set of eigen-values of P.

We are now going to discuss the spectral theory of P – mostly in the casethat P has in invariant distribution, π, which is in detail balance with P. Westart with a general lemma not requiring detail balance.

Lemma 11.16. If P is a Markov matrix and π is an invariant distribution forP, then P is a contraction on `p (π) for all 1 ≤ p ≤ ∞.

Proof. The case f ∈ `1 (π) is all we really need so let us prove this specialcase first. The point is that

∑x∈S

π (x) |Pf (x)| =∑x∈S

π (x)

∣∣∣∣∣∣∑y∈S

Px,yf (y)

∣∣∣∣∣∣≤∑x∈S

π (x)∑y∈S

Px,y |f (y)|

=∑y∈S

∑x∈S

π (x) Px,y |f (y)| =∑y∈S

π (y) |f (y)|

which is precisely the statement that ‖Pf‖`1(π) ≤ ‖f‖`1(π) .The remaining results use Holder’s inequality. Let us be content here with

the case p = 2 where we only need the Cauchy-Schwarz inequality. If f ∈ `2 (π) ,then

|Pf (x)| ≤∑y∈S

Px,y |f (y)| ≤

∑y∈S

Px,y |f (y)|21/2∑

y∈SPx,y12

1/2

≤

∑y∈S

Px,y |f (y)|21/2

· 1.

Squaring this inequality, multiplying by π (x) , and then summing on x shows∑x∈S

π (x) |Pf |2 (x) ≤∑x∈S

π (x)∑y∈S

Px,y |f (y)|2

=∑y∈S

∑x∈S

π (x) Px,y |f (y)|2

=∑y∈S

π (y) |f (y)|2

which is precisely the statement that ‖Pf‖`2(π) ≤ ‖f‖`2(π) .

Corollary 11.17. If P is an irreducible Markov matrix with invariant distri-bution π (necessarily positive) then 1 ∈ σ (P) and if λ ∈ σ (P) , then |λ| ≤ 1.

Proof. By very definition of a Markov matrix we know that P1 = 1 where1 is the function 1 on S and hence 1 ∈ σ (P) . If λ ∈ σ (P) and f : S → C is anon-zero function such that Pf = λf, then

|λ| ‖f‖`1(π) = ‖λf‖`1(π) = ‖Pf‖`1(π) ≤ ‖f‖`1(π)

from which we conclude that |λ| ≤ 1.The only statement we really need from the next three results is the part of

Corollary 11.20 which states; if (π,P) satisfy detail balance, then P is symmetricoperator on `2 (π) .

Lemma 11.18 (Adjoint of P). The adjoint of P (relative to 〈·, ·〉ν) is givenby

(P∗g) (x) =∑y∈S

qν (x, y) g (y) where (11.2)

qν (x, y) =ν (y)

ν (x)p (y, x) . (11.3)

Proof. This lemmas is proved as a consequence of the simple identities,

〈Pf, g〉ν =∑x∈S

ν (x) g (x)∑y∈S

p (x, y) f (y)

=∑x,y∈S

g (x)ν (x)

ν (y)p (x, y) f (y) ν (y)

=∑y∈S

[∑x∈S

qν (y, x) g (x)

]f (y) ν (y) .

= 〈f,P∗g〉ν .

Corollary 11.19. Keeping the notation in Lemma 11.18. The function,qν (x, y) , is Markov matrix iff ν is an invariant measure for P, i.e. νP = ν.

Proof. The matrix, qν , is Markov matrix iff for all x ∈ S,

1 =∑y∈S

qν (x, y) =∑y∈S

ν (y)

ν (x)p (y, x) =

1

ν (x)· (νP) (x)

from which the result immediately follows.


11.4 The linear algebra associated to detail balance 105

Corollary 11.20. Keeping the notation in Lemma 11.18. We have P = P∗ iffthe ν and p satisfy the detail balance equation.

Proof. This is an immediate consequence of Eq. (11.3). For those that haveskipped directly to this corollary let us check directly that if (ν,P) is in detailbalance, then P is self-adjoint on `2 (ν) . For f, g ∈ `2 (ν) , we have

〈Pf, g〉ν :=∑x∈S

∑y∈S

Pxyf (y) g (x) ν (x)

=∑y∈S

∑x∈S

f (y) ν (x) Pxyg (x)

=∑y∈S

∑x∈S

f (y) ν (y) Pyxg (x) = 〈f,Pg〉ν .

Second proof. Let Df (x) = ν (x) f (x) for all x ∈ S. Then 〈f, g〉ν =f ·Dg = f trDg where

f · g =∑x∈S

f (x) g (x)

is the usual dot product. We then find,

〈Pf, g〉ν = Pf ·Dg = f ·PtrDg and

〈f,Pg〉ν = f ·DPg

and so P∗ν = P iff PtrD = DP, i.e.

ν (x) Pxy = [DP]xy =[PtrD

]xy

= PtrxyDy = Pyxν (y)

which is precisely the detail balance condition.Because of this corollary, we may apply to the spectral theorem from linear

algebra to arrive at the following important theorem.

Theorem 11.21 (Spectral Theorem). If P is a Markov matrix and π : S →(0, 1] is a probability distribution on S such that (π,P) are in detail balance, thenPf (x) =

∑y∈S p (x, y) f (y) has an orthonormal (relative to the `2 (π) – inner

product) basis of eigenvectors, fn|S|n=1 ⊂ `2 (π) with corresponding eigenvalues

σ (P) := λn|S|n=1 ⊂ R.

Advise to the reader: on your first reading of the proof of Theorem 11.22,stick to the (most important) case where P is aperiodic.

Theorem 11.22. If (π,P) are in detail balance, then σ (P) ⊂ [−1, 1] and 1 ∈σ (P) . We further have;

1. If P is irreducible and aperiodic then 1 has multiplicity one with P1 = 1,i.e. 1 is the unique column eigenvector of P modulo scaling. Moreover thereexists α < 1 so that σ (P) ⊂ [−α, α] ∪ 1 .

2. If P is only assumed to irreducible, then there exists α < 1 such that σ (P) ⊂[−α, α] ∪ ±1 and again 1 has multiplicity 1 and as usual P1 = 1.

Proof. By Corollary 11.20, P is self-adjoint on `2 (π) and therefore σ (P) ⊂R. If Pf = λf with f 6= 0 and λ ∈ σ (P) , then by Lemma 11.16,

‖f‖π ≥ ‖Pf‖π = ‖λf‖π = |λ| ‖f‖π

which shows |λ| ≤ 1 so that σ (P) ⊂ [−1, 1] . As for any Markov matrix wehave P1 = 1 = 1 · 1 from which it follows that 1 ∈ σ (P) . We now prove theremaining items in turn.

1. If P is irreducible and aperiodic, then according to Proposition 8.43 thereexists an n ∈ N so that Pn

xy > 0 for all x, y ∈ S. Suppose that f ∈ `2 (π) isa unit vector such that Pf = λf and f is perpendicular to 1, i.e.

0 = 〈f, 1〉π = π (f) =∑x∈S

f (x)π (x) .

This last identity forces forces f (x) < 0 for some x ∈ S and f (y) > 0 foranother y ∈ S and therefore

|λ|n |f (x)| = |λnf (x)| = |(Pnf) (x)| =

∣∣∣∣∣∣∑y∈S

Pnxyf (y)

∣∣∣∣∣∣ <∑y∈S

Pnxy |f (y)| .

Multiplying this equation by π (x) and summing the result gives

|λ|n π (|f |) < π (|f |) =⇒ |λ|n < 1 =⇒ |λ| < 1

and this completes the proof in the aperiodic case.

2. Let N ∈ N and set

Q :=1

N

N∑n=1

Pn

so thatQf (x) =

∑y∈S

q (x, y) f (y)

where

q (x, y) =1

N

N∑n=1

Pnx,y.



As P is irreducible, we know that for N sufficiently large, q (x, y) > 0 forall x, y ∈ S so that Q is an aperiodic irreducible Markov matrix.Now suppose that f ∈ `2 (π) is a unit vector such that Pf = λf and f isperpendicular to 1. Then

Qf =1

N

N∑n=1

Pnf =1

N

N∑n=1

λnf = cNf (11.4)

where

cN =

λNλN+1−1λ−1 if λ 6= 1.

1 if λ = 1.(11.5)

From the aperiodic case above applied to Q we may conclude that |cN | < 1.On the other hand, from Eq. (11.5) we see that |cN | → ∞ if |λ| > 1 whichwould contradict |cN | < 1 and we may conclude |λ| ≤ 1. If λ = 1 we wouldhave cN = 1 (i.e. Qf = f) which again leads to a contradiction since byitem 1. we know 1 is an eigenvalue of Q with multiplicity 1. Thus we mayconclude that either |λ| < 1 or λ = −1.

Remark. Here is another way to see in item 2. that 1 ∈ σ (P) has multi-plicity 1. Let f ∈ `2 (π) be a unit vector such that Pf = f. Then using Eq.(8.21) of the ergodic Theorem 8.27 we find,

f (x) =1

n

n∑m=1

[Pmf ]x =∑y∈S

1

n

n∑m=1

Pmxyf (y)

→∑y∈S

π (y) f (y) = π (f) as n→∞,

where π is the unique invariant distribution for P. This shows f (x) = π (f) forall x ∈ S, i.e. f = π (f) · 1.

For the rest of this chapter we will be making the following assumption.

Assumption 11.23 For this section we assume S is a finite set, p : S × S →[0, 1] is an irreducible Markov matrix2, π : S → [0, 1] be the unique invariantdistribution for p which we further assume satisfies the detail balance equation,

π (x) p (x, y) = π (y) p (y, x) .

As we know P1 = 1, let us agree that we take f1 = 1 and λ1 = 1.

2 Thus we know the associated chain is irreducible and the p has a unique strictlypositive invariant distributions, π, on S.

Example 11.24. The Markov matrix,

P =

[0 11 0

],

is irreducible, π =[

12 ,

12

],

f1 =

[11

]and f2 =

[1−1

]with λ1 = 1 and λ2 = −1. The fact that λ2 = −1 is a reflection of the fact thatP has period 2.

Example 11.25. The Markov matrix

P =

0 1 0 0

1/2 0 0 1/20 0 0 10 1/2 1/2 0

is irreducible, 2-periodic, has

π =[

16

13

16

13

]as its invariant distribution which satisfies detail balance. For example

π2P24 =1

3· 1

2= π4P42.

In this case, σ (P) =− 1

2 , 1,12 ,−1

.

Example 11.26. The Markov matrix,

P =

0 1 00 0 11 0 0

describes the chain which moves around the circle, 1 → 2 → 3 → 1. It isirreducible, 3 - periodic,

σ (P) =

−1

2i√

3− 1

2,

1

2i√

3− 1

2, 1

which are the third roots of unity, and π = [1/3, 1/3, 1/3] is its invariant mea-sure. This matrix does not satisfy detail balance.


11.5 Convergence rates of reversible chains 107

11.5 Convergence rates of reversible chains

In order for the Metropolis-Hastings algorithm (or other MCMC methods) tobe effective, the Markov chain designed to converge to the given distributionmust converge reasonably fast. The term used to measure this rate is the mixingtime of the chain. (To make this precise, we need a precise measure of closenessto the stationary distribution; then we can ask how long before the distance tostationary is less than ε > 0.)

Notation 11.27 If ν is a probability on S and f : S → R is a function, let

ν (f) := Eνf =∑x∈S

f (x) ν (x) .

Theorem 11.28 (L2 (Pν) – convergence rates). Let P be an irreducible andaperiodic Markov matrix, π be the invariant distribution of P, and assume that(π,P) are in detail balance. Let ν : S → [0, 1] be any starting distribution,f : S → R be any function, and

SN (f) :=1

N

N∑m=1

f (Xm) for all N ∈ N

be the time average of f along the chain XmNm=1 . Then

Eν |SN (f)− π (f)|2 = O

(1

N

).

Proof. We carry out the proof in three steps.

1. Using the spectral theorem, for any f : S → R we have,

Pnf = Pn

|S|∑j=1

〈f, uj〉π uj =

|S|∑j=1

〈f, uj〉π Pnuj

=

|S|∑j=1

〈f, uj〉π λnj uj = 〈f,1〉π 1 +O (αn)

= π (f) +O (αn) .

where as above λ1 = 1, u1 = 1, and α = maxj>1 |λj | < 1.2. For n ≥ m,

E [f (Xm) g (Xn)] = ν(Pm

[f ·Pn−mg

])= ν

(Pm

[f ·(π (g) +O

(αn−m

))])= π (g) ν (Pmf) +O

(αn−m

)= π (g) ν (π (f) +O (αm)) +O

(αn−m

)= π (g)π (f) +O (αm) +O

(αn−m

).

Here we have used P is a contraction on `2 (π) (see Lemma 11.16) and item1. to conclude that Pmf = π (f) +O (αm) .

3. Letting h := f − π (f) so that π (h) = 0, we find

E[(SN (f)− π (f))

2]

= E[[SN (f − π (f))]

2]

= E[[SN (h)]

2]

=1

N2

N∑m,n=1

E [h (Xm)h (Xn)]

=1

N2

N∑m=1

E [h (Xm)h (Xm)] +2

N2

∑1≤m<n≤N

E [h (Xm)h (Xn)]

≤ N

N2‖h‖2∞ +

2

N2

∑1≤m<n≤N

E [h (Xm)h (Xn)]

≤ O(

1

N

)+

2

N2

∑1≤m<n≤N

[O (αm) +O

(αn−m

)]= O

(1

N

).

For the last equality we have used,∑1≤m<n≤N

[O (αm) +O

(αn−m

)]≤ CN ·

∑1≤m<∞

αm + C∑

1≤m≤N

∑m<n<∞

αn−m ≤ CN

owing to that fact that

∞∑k=1

αk =α

1− αfor |α| < 1.

Roughly speaking, the previous theorem indicates that the very best we canhope for in terms of pointwise convergence is

π (f) =1

N

N∑n=1

f (Xn) +O

(1√N

)Pν – a.s.

11.6 Importance of the spectral gap

Let us try to be more precise on how many runs we will have to take in oursimulations in order to get a good answer. As above, suppose that P is an irre-ducible Markov matrix with invariant distribution π which is in detail balancewith P. For simplicity, let us further assume that P is aperiodic.



Notation 11.29 If ν : S → [0, 1] is a probability distribution and κ ∈ N0, letνκ := νPκ so that νκ (x) = Pν (Xκ = x) .

Remark 11.30. When P is not aperiodic one should replace νκ by

νκ (x) =1

κ+ 1

κ∑m=0

Pν (Xκ = x) i.e. νκ :=1

κ+ 1

κ∑m=0

νPm

in the arguments below.

Theorem 11.31. Let us keep the above assumptions and notation in force and

let uj#(S)j=1 be a 〈·, ·〉π – orthonormal basis for `2 (π) with u1 = 1 and λ1 = 1

and α := maxj>1 |λj | < 1. Then for any f : S → R, starting distributionν : S → [0, 1] , and κ ∈ N0 we have,

|νκ (f)− π (f)|2 ≤ Cκ (ν) Varπ (f) (11.6)

where

Cκ (ν) :=

#(S)∑k=2

|λk|2κ∣∣∣⟨νπ, uk

⟩π

∣∣∣2 ≤ α2κ(ν(νπ

)− 1). (11.7)

In summary,

|νκ (f)− π (f)|2 ≤ α2κ Varπ (f) ·(ν(νπ

)− 1). (11.8)

Proof. Since Pκuj = λκj uj and

f =

#(S)∑k=1

〈f, uk〉π uk = π (f) +

#(S)∑k=2

〈f, uk〉π uk

we find,

Pκf =

#(S)∑k=1

〈f, uk〉π Pκuk =

#(S)∑k=1

λκk 〈f, uk〉π uk = π (f) +

#(S)∑k=2

λκk 〈f, uk〉π uk

and hence

νκ (f) = νPκf = π (f) +

#(S)∑k=2

λκk 〈f, uk〉π ν (uk)

= π (f) +

#(S)∑k=2

λκk 〈f, uk〉π⟨νπ, uk

⟩π.

Making use the Cauchy-Schwarz inequality it then follows that

|νPκf − π (f)|2 ≤

#(S)∑k=2

|λk|κ |〈f, uk〉π|∣∣∣⟨νπ, uk

⟩π

∣∣∣2

≤#(S)∑k=2


⟩π

∣∣∣2 · #(S)∑k=2

|〈f, uk〉π|2

= Cκ (ν) ·#(S)∑k=2

|〈f, uk〉π|2.

Since

#(S)∑k=2

|〈f, uk〉π|2

= ‖f‖2π − 〈f,1〉2π = π

(f2)− [π (f)]

2= Varπ (f) ,

the previous two equations proves Eq. (11.6). Moreover,

Cκ :=

#(S)∑k=2


⟩π

∣∣∣2≤ α2κ

#(S)∑k=2

∣∣∣⟨νπ, uk

⟩π

∣∣∣2= α2κ ·

(∥∥∥νπ

∥∥∥2

π−⟨νπ,1⟩2

π

)= α2κ ·

(ν(νπ

)− 1)

which proves the inequality in Eq. (11.7).

Example 11.32. If π is the uniform distribution on S and ν = δx for some x ∈ S,then

ν(νπ

)− 1 = |S| − 1 ≤ |S| .

If we further assume that |f | ≤ K for some constant K, then Var (f) ≤ K2 andwe have,

|νκ (f)− π (f)|2 ≤ α2κK2 |S| .

If we are given ε > 0 as are allowed error, then we should choose κ so that

α2κK2 |S| ≤ ε2 =⇒ 2κ =ln(

ε2

K2|S|

)lnα

=⇒ κ =

ln

(ε

K√|S|

)lnα


11.7 Dropping the aperiodic assumption* 109

i.e.

κ ≥ln(K√|S|/ε

)|lnα|

=ln(K√|S|/ε

)ln (α−1)

.

So for example if K = 1, |S| = 2502

, ε = 10−2, and α−1 = e1/10 = 0.904 84 thenwe require,

κ ≥ 10 ln(

100 · 2502/2)∼= 8710.4.

Thus if take independent runs, Xmn : n ∈ N0nm=1 of our chain starting at a

fixed point x ∈ S, we will have by the law of large numbers that

1

n

n∑m=1

f (Xmκ )→ νκ (f) = Eνf (Xκ) as n→∞.

Moreover, using the weak law of large numbers (see Theorem 2.20), we concludethat for any δ > 0 that

P

(∣∣∣∣∣ 1nn∑

m=1

f (Xmκ )− νκ (f)

∣∣∣∣∣ ≥ δ)≤ 1

δ2

1

nVarνκ (f) ≤ 1

δ2n12.

Taking δ = 10−2 we find,

P

(∣∣∣∣∣ 1nn∑

m=1

f (Xmκ )− νκ (f)

∣∣∣∣∣ ≥ 10−2

)≤ 102

n.

In conclusion if we take n = 104 and κ = 8711 we would find that, 99% of thetime, that ∣∣∣∣∣∣ 1

104

104∑m=1

f (Xm8711)− π (f)

∣∣∣∣∣∣ ≤ 2 · 10−2.

The difficult thing here is of course getting an estimate on the size of α.

To get more detailed information about this topic, the reader might startby consulting [12], [4], and [13, Chapter 17] and the references therein.

11.7 Dropping the aperiodic assumption*

Advise to the reader: ignore this section and in fact the rest of this chapteron first reading.

Notation 11.33 Let Πλ denote orthogonal projection onto the λ – eigenspaceof P for all λ ∈ σ (P) . We then have,

P = Π1 −Π−1 +R where

R =∑

λ∈σ(P)\±1

λΠλ.

We further letα := max |λ| : λ ∈ σ (P) \ ±1 < 1.

With this notation we have for k ∈ N that

Pk = Π1 + (−1)kΠ−1 +Rk (11.9)

where∥∥Rk∥∥ ≤ αk. Moreover if P is aperiodic, then Π−1 = 0. In all cases

Π1f = 〈f, 1〉π 1 = π (f) .

Corollary 11.34. Under Assumption 11.23, if ν : S → [0, 1] is any probabilityon S and f : S → R is any function, then

Eν [f (Xn)] = π (f) + (−1)nν (Π−1f) + ‖f‖ ·Oν (αn) . (11.10)

In particular if P is aperiodic, then

Eν [f (Xn)] = π (f) + ‖f‖ ·Oν (αn) .

Proof. Let ρ (x) := ν (x) /π (x) where π is the invariant distribution for P.By the properties of Markov chains we have,

Eν [f (Xn)] =∑x,y∈S

ν (x) Pnx,yf (y) =

∑x∈S

ν (x) (Pnf) (x)

=∑x∈S

(Pnf) (x) ρ (x)π (x) = 〈Pnf, ρ〉π = 〈f,Pnρ〉π .

Making use of Eq. (11.9) we find,

〈f,Pnρ〉π = 〈f, (Π1 + (−1)nΠ−1 +Rn) ρ〉π

= π (f) + (−1)n 〈Π−1f, ρ〉π + ‖f‖O (αn)

= π (f) + (−1)nν (Π−1f) + ‖f‖O (αn) .



Remark 11.35. The above corollary suggests that in the aperiodic case in orderto estimate π (f) a good strategy is to choose n fairly large so that the errorestimate

|Eν [f (Xn)]− π (f)| ≤ C1αn

forces Eν [f (Xn)] ∼= π (f) . We then estimate Eν [f (Xn)] using the law of largenumbers, i.e.

Eν [f (Xn)] ∼=1

N

N∑k=1

f (xn (k))

where xn (k)Nk=0 are N independent samples from Xn. We then expect tohave ∣∣∣∣∣π (f)− 1

N

N∑k=1

f (xn (k))

∣∣∣∣∣ ≤ C1αn + C2

1√N

The typical difficulties here are that C1, α, and the random constant C2 aretypically not easy to estimate.

Notation 11.36 Let SN (f) := 1N

∑Nn=1 f (Xn) be the sample averages of f :

S → R.

Corollary 11.37. Under Assumption 11.23,

Eν [SN (f)] =1

N

N∑n=1

Eν [f (Xn)] = π (f) + ‖f‖O(

1

N

). (11.11)

Proof. Summing Eq. (11.10) on n shows,

1

N

N∑n=1

Eν [f (Xn)] = π (f) +εNNν (Π−1f) + ‖f‖O

(1

N

N∑n=1

αn

)

= π (f) +εNNν (Π−1f) + ‖f‖O

(α

N

1

1− α

)= π (f) + ‖f‖O

(1

N

),

where εN ∈ 0,±1 .

Corollary 11.38 (L2 (Pν) – convergence rates). Given the assumptionsabove for any initial distribution, ν,

Eν |SN − π (f)|2 = O

(1

N

).

Proof. If we let f = π (f) + h (i.e.h = f − π (f)), then

SN (f) =1

N

N∑n=1

f (Xn) = π (f) +1

N

N∑n=1

h (Xn)

and therefore,

E [SN − π (f)]2

=1

N2

N∑m,n=1

E [h (Xn)h (Xm)]

=1

N2

N∑m=1

E[h2 (Xm)

]+ 2

∑1≤m<n≤N

E [h (Xn)h (Xm)]

≤ ‖h‖2∞

1

N+

2

N2

∑1≤m<n≤N

E [h (Xn)h (Xm)] .

To finish the proof it suffices to show∑1≤m<n≤N

E [h (Xn)h (Xm)]

=∑

1≤m<n≤N

E[h (Xm)

(Pn−mh

)(Xm)

]= O (N) ,

wherein we have used the Markov property in the first equality.Case 1. Suppose that P is aperiodic. Under this additional assumption, we

know that ‖Pn−mh‖∞ ≤ C ‖h‖∞ αn−m and hence we find,∑1≤m<n≤N

E [h (Xn)h (Xm)] ≤ C ‖h‖2∞∑

1≤m<n≤N

αn−m.

This completes the proof in this case upon observing that∑1≤m<n≤N

O(αn−m

)≤ C

∑1≤m<n≤N

αn−m ≤ α

1− αN = O (N) . (11.12)

Case 2. For the general case, we decompose h as

h = f− + g = Π−1f +∑

λ∈σ(P)\±

Πλf

so thatPn−mh = (−1)

n−mf− + Pn−mg.

With this notation along with Eq. (11.10) we learn


11.8 *Reversible Markov Chains 111

E [h (Xn)h (Xm)] = E[h (Xm)

(Pn−mh

)(Xm)

]= E

[h (Xm)

[(−1)

n−mf− + Pn−mg

](Xm)

]= (−1)

n−m E [h (Xm) f− (Xm)] +O(αn−m

)= (−1)

n−mπ (hf−) + (−1)

mν (Π−1 [hf−]) +O (αm) +O

(αn−m

).

The last displayed equation combined with Eq. (11.12) and Lemma 11.39 belowagain implies ∑

1≤m<n≤N

E [h (Xn)h (Xm)] = O (N) .

Lemma 11.39. The following sums are all order O (N) ,∑1≤m<n≤N

(−1)m,

∑1≤m<n≤N

(−1)n,

∑1≤m,n≤N

(−1)m∧n

,

∑1≤m,n≤N

(−1)m∨n

, and∑

1≤m,n≤N

(−1)|m−n|

.

Proof. These results are elementary and are base on the fact that for any0 ≤ k < p <∞ we have

p∑l=k

(−1)l ∈ 0,±1 .

For example,

∑1≤m,n≤N

(−1)m∧n

=

N∑m=1

(−1)m

+ 2∑

1≤m<n≤N

(−1)m

∈ 0,±1+ 2∑

1<n≤N

0,±1

which is certainly bounded by CN. The largest term is,

∑1≤m,n≤N

(−1)|m−n|

=

N∑m=1

1 + 2∑

1≤m<n≤N

(−1)(n−m)

= N +∑

1≤m<N

0,±1

which is again bounded by CN.

11.8 *Reversible Markov Chains

Let S be a finite or countable state space, p : S × S → [0, 1] be a Markovtransition kernel, and ν : S → [0, 1] be a probability on S. We now wish to findconditions on ν and p so that, for each N ∈ N, Yn := XN−n for 0 ≤ n ≤ Nis distributed as a time homogeneous Markov chain. Thus we are looking for aMarkov transition kernel, q : S × S → [0, 1] , and a probability, ν : S → [0, 1] ,such that

Pν (Y0 = x0, . . . , YN = xN ) = ν (x0) q (x0, x1) . . . q (xN−1, xN )

for all xj ∈ S. By assumption,

Pν (Y0 = x0, . . . , YN = xN ) = Pν (X0 = x0, . . . , X0 = xN )

= ν (xN ) p (xN , xN−1) . . . p (x1, x0) .

Comparing these last two displayed equations leads us to require, for all N ∈ N0

and xj ⊂ S, that

ν (x0) q (x0, x1) . . . q (xN−1, xN ) = ν (xN ) p (xN , xN−1) . . . p (x1, x0) .

Taking N = 0 forces us to take ν = ν. Taking N = 1 we have,

ν (x0) q (x0, x1) = ν (x1) p (x1, x0) (11.13)

and then summing this equation on x1 shows ν (x0) = (νP) (x0) and forces νto be a stationary distribution for p.

From now on let us assume that p has a strictly positive stationary distri-bution, π, i.e. that all communication classes of p are positive recurrent. Let νbe one of these stationary distributions. Then from Eq. (11.13) we see that wemust take,

q (x0, x1) =ν (x1)

ν (x0)p (x1, x0) ≥ 0.

Observe that∑x1∈S

q (x0, x1) =∑x1∈S

ν (x1)

ν (x0)p (x1, x0) =

(νP) (x0)

ν (x0)=ν (x0)

ν (x0)= 1

and that ∑x0∈S

ν (x0) q (x0, x1) =∑x0∈S

ν (x1) p (x1, x0) ν (x1)

which shows that q : S × S → [0, 1] is a Markov transition kernel and withstationary distribution, ν. Finally we have


ν (x0)q (x0, x1) . . . q (xN−1, xN )

= ν (x0)ν (x1)

ν (x0)p (x1, x0)

ν (x2)

ν (x1)p (x2, x1) . . .

ν (xN )

ν (xN−1)p (xN , xN−1)

= p (xN , xN−1) . . . p (x1, x0) ν (xN )

as desired.

Definition 11.40. We say the chain Xn associated to p is (time) reversibleif there exists a distributions ν on S such that for any N ∈ N,

(X0, . . . , XN )Pν= (XN , XN−1, . . . , X0) .

What we have proved above is that p is reversible iff there exists ν : S →(0,∞) such that

ν (x) p (x, y) = ν (y) p (y, x) ∀ x, y ∈ S.

12

Hidden Markov Models

Problem: Let Xn ⊂ S be a stochastic process and En : S → O∞n=0 is acollection of random functions. It is often the case that we can not observe Xndirectly but only some noisy output Un = En (Xn)∞n=0 of the Xn . Our goal

is to predict the values of XnNn=0 given that we have observe u := unNn=0 ⊂O. One way to make such a prediction is based on finding the most probabletrajectory, i.e. we are looking for (x∗0, . . . , x

∗N ) ∈ SN+1 which is a maximizer of

the function,

FN (x0, . . . , xN ) = P(Xn = xnNn=0 | Un = unNn=0

)=P(Xn = xnNn=0 , Un = unNn=0

)P(Un = unNn=0

) .

Equivalently we may define (x∗0, . . . , x∗N ) as,

(x∗0, . . . , x∗N ) = arg max

(x0,...,xN )∈SN+1FN (x0, . . . , xN )

whereFN (x0, . . . , xN )

FN (x0, . . . , xN ) = P(Xn = xnNn=0 , Un = unNn=0

)= P

((Xn, Un) = (xn, un)Nn=0

). (12.1)

With no extra structure this maximization problem is intractable even for mod-erately large N since it requires |S|N+1

evaluations of FN .1 On the other hand,

we will be able to say something about this problem if we make the followingadditional assumptions.

Assumption 12.1 Keeping the notation above, let us now further assume that;

1. Xn is a (hidden) Markov chain with transition probabilities p : S×S →[0, 1] and initial distribution ν : S → [0, 1] and

2. the “random noise” functions, En : S → O∞n=0 , are i.i.d. random func-tions which are independent of the chain Xn∞n=0 .

1 It is worth observing that the dimension of the space of possible functions FN isalso roughly |S|N+1 .

Notation 12.2 Under Assumption 12.1, the emission probabilities refersto the function, e : S ×O → [0, 1] defined by

e (x, u) := P (En (x) = u) ∀ (x, u) ∈ S ×O.

Given u := unNn=0 ⊂ O, if we let

V0 (x0) = ν (x0) e (x0, u0) and

qn (x, y) := p (x, y) e (y, un) , (12.2)

then the function FN in Eq. (12.1) becomes,

FN (x0, . . . , xN ) = P(Xn = xnNn=0

)· P(En (xn) = unNn=0

)= ν (x0) e (x0, u0)

N∏n=1

p (xn−1, xn) e (xn, un) (12.3)

= V0 (x0)

N∏n=1

qn (xn−1, xn) . (12.4)

Example 12.3 (Occasionally Dishonest Casino). In a casino game, a die isthrown. Usually a fair die is used, but sometimes the casino swaps it out for aloaded die. To be precise: there are two dice, F (fair) and L (loaded). For thefair die F , the probability of rolling i is 1

6 for i = 1, 2, . . . , 6. For the loadeddie, however, 1 is rolled with probability 0.5 while 2 through 5 are each rolledwith probability 0.1. Each roll, the casino randomly switches out F for L withprobability 0.05; if L is in use, they switch back to F next roll with probability0.9.

In this example, we may take S = F,L (F =fair and L =loaded) and(Xn)n≥0 to be the Markov chain keeping track of which dice is in play and hastransition probabilities,

P =

F L[0.95 0.050.9 0.1

]FL.

We further let O = 1, 2, 3, 4, 5, 6 and we are given the emission probabilities,

114 12 Hidden Markov Models

e (x, u) =

x/u 1 2 3 4 5 6F 1/6 1/6 1/6 1/6 1/6 1/6L 0.5 0.1 0.1 0.1 0.1 0.1

Questions: You notice that 1 is rolled 6 times in a row. How likely is itthat the fair die was in use for those rolls (given the above information)? Howlikely is it the loaded die was used in rolls 2, 3, and 5? More generally, whatis the most likely sequence of dice that was used to generate this sequence ofrolls?

Example 12.3 is a very typical situation that occurs in many real-worldproblems. Another application of Hidden Markov Models (HMMs) is in machinespeech recognition. Here we have S = words and O = wave forms.

12.1 The most likely trajectory via dynamic programing

Let V0 : S → [0,∞) and qn : S × S → [0,∞) for n ∈ N be given functions andthen set

Fn (x0, x1, . . . , xn) = V0 (x0) q1 (x0, x1) q2 (x1, x2) . . . qn (xn−1, xn) . (12.5)

The space of functions, Fn, that have the form in Eq. (12.5) is now approxi-

mately, n |S|2 + |S| which is much smaller than |S|n+1– the dimension of all

functions, Fn on Sn+1. Viterbi’s algorithm exploits this fact in order to find amaximizer of Fn with only order of n |S|2 operations rather than |S|n+1

. Thekey observations is that we may describe Fn inductively as;

F1 (x0, x1) = V0 (x0) q1 (x0, x1) and

Fn (x0, x1, . . . , xn) = Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn) . (12.6)

From Eq. (12.6) it follows that

maxx0,x1,...,xn∈S

Fn (x0, x1, . . . , xn)

= maxx0,x1,...,xn∈S

[Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn)]

= maxxn∈S

maxxn−1∈S

[max

x0,x1,...,xn−2∈SFn−1 (x0, x1, . . . , xn−1)

]qn (xn−1, xn) .

This identity suggest that we define (for n ∈ N) Vn : S → [0, 1] by,

Vn (xn) := maxx0,x1,...,xn−1∈S

Fn (x0, x1, . . . , xn) for n ≥ 1. (12.7)

It then follows that

maxx0,x1,...,xn∈S

Fn (x0, x1, . . . , xn) = maxxn∈S

Vn (xn)

and

Vn (xn) = maxx0,x1,...,xn−1∈S

[Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn)]

= maxxn−1∈S

maxx0,x1,...,xn−2∈S

[Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn)]

= maxxn−1∈S

[Vn−1 (xn−1) qn (xn−1, xn)] .

Notation 12.4 If G : S → R is a function on a finite set S, letarg maxx∈S G (x) denote a point x∗ ∈ S such that G (x∗) = maxx∈S G (x) .

The above comments lead to the following algorithm for finding a maximizerof FN .

Theorem 12.5 (Viterbi’s most likely trajectory). Let us continue the no-tation above so that for n ≥ 1,

Vn (y) = maxx∈S

[Vn−1 (x) qn (x, y)]

and further letϕn (y) := arg max

x∈S[Vn−1 (x) qn (x, y)] .

Given N ∈ N, if we let

x∗N = arg maxx∈S

VN (x) ,

x∗N−1 = ϕN (x∗N ) ,

x∗N−2 = ϕN−1

(x∗N−1

),

...

x∗0 := ϕ1 (x∗1) ,

then(x∗0, x

∗1, . . . , x

∗N ) = arg max

(x0,...,xN )∈SN+1FN (x0, . . . , xN ) (12.8)

and

M := max(x0,...,xN )∈SN+1

FN (x0, . . . , xN ) = VN (x∗N ) = maxx

VN (x) . (12.9)


12.2 The computation of the conditional probabilities 115

Proof. First off Eq. (12.9) is just a rewrite of Eq. (12.7). Moreover,

M = VN (x∗N ) = maxxN−1∈S

[VN−1 (xN−1) qN (xN−1, x∗N )]

= VN−1

(x∗N−1

)qN(x∗N−1, x

∗N

)while

VN−1

(x∗N−1

)= maxxN−2∈S

[VN−2 (xN−2) qN−1

(xN−2, x

∗N−1

)]= VN−2

(x∗N−2

)qN−1

(x∗N−2, x

∗N−1

)so that

M = VN−2

(x∗N−2

)qN−1

(x∗N−2, x

∗N−1

)qN(x∗N−1, x

∗N

).

Continuing this way inductively shows for 1 ≤ k ≤ N

M = VN−k(x∗N−k

)qN−k+1

(x∗N−k, x

∗N−k+1

). . . qN−1

(x∗N−2, x

∗N−1

)qN(x∗N−1, x

∗N

)and in particular at k = N we find,

M = V0 (x∗0) q1 (x∗0, x∗1) . . . qN−1

(x∗N−2, x

∗N−1

)qN(x∗N−1, x

∗N

)= FN (x∗0, x

∗1, . . . , x

∗N )

which proves Eq. (12.8).Proof in the special case where N = 3. The algorithm gives;

V1 (y) := maxx

V0 (x) q1 (x, y) = V0 (ϕ1 (y)) q1 (ϕ1 (y) , y) ,

V2 (y) := maxx

V1 (x) q2 (x, y) = V1 (ϕ2 (y)) q2 (ϕ2 (y) , y) ,

V3 (y) := maxx

V2 (x) q3 (x, y) = V2 (ϕ3 (y)) q3 (ϕ3 (y) , y) .

Now choose x∗3 so that

M = maxF3 (x0, x1, x2, x3) = maxy

V3 (y) = V3 (x∗3)

and then let

x∗2 = ϕ3 (x∗3) , x∗1 = ϕ2 (x∗2) , and x∗0 = ϕ1 (x∗1) .

We then find by back substitution that

M = maxy

V3 (y) =V3 (x∗3)

=V2 (ϕ3 (x∗3)) q3 (ϕ3 (x∗3) , x∗3) = V2 (x∗2) q3 (x∗2, x∗3)

=V1 (ϕ2 (x∗2)) q2 (ϕ2 (x∗2) , x∗2) q3 (x∗2, x∗3)

= V1 (x∗1) q2 (x∗1, x∗2) q3 (x∗2, x

∗3)

=V0 (ϕ1 (x∗1)) q1 (ϕ1 (x∗1) , x∗1) q2 (x∗1, x∗2) q3 (x∗2, x

∗3)

= V0 (x∗0) q1 (x∗0, x∗1) q2 (x∗1, x

∗2) q3 (x∗2, x

∗3)

= F3 (x∗0, x∗1, x∗2, x∗3)

and so (x∗0, x∗1, x∗2, x∗3) is a maximizer of F3 (x0, x1, x2, x3) .

12.1.1 Viterbi’s algorithm in the Hidden Markov chain context

Let us now write out the pseudo code of this algorithm in the hidden Markovchain context. We now assume the setup described in Notation 12.2 and weassume that we have observed u := ukNk=0 ⊂ O.

Hidden Markov chain pseudo code:For y ∈ S,

V0 (y) = ν (y) e (y, u0)End y ∈ S.For n = 1 to N,

For y ∈ S,

Vn (y) := maxx∈S

[Vn−1 (x) qn (x, y)]

= maxx∈S

[Vn−1 (x) p (x, y) e (x, un)] and

ϕn (y) = arg maxx∈S

[Vn−1 (x) qn (x, y)]

= arg maxx∈S

[Vn−1 (x) p (x, y) e (x, un)]

End y ∈ S.End n = 1 to Nx∗N := arg maxy∈S Vn (y)For n = 1 to N,x∗N−n = ϕN−n+1

(x∗N−n+1

)End n = 1 to N.Output: (x∗0, x

∗1, . . . , x

∗N ) .

The output x∗kNk=0 is a most likely trajectory given ukNk=0 , i.e.

(x∗0, x∗1, . . . , x

∗N ) = arg max

(x0,...,xN )∈SN+1FN (x0, . . . , xN ) .

Complexity analysis: Notice that the computational complexity of thisalgorithm is approximately, C ·N |S|2 rather than |S|N+1

. Indeed, the N comesfrom the for loops involving n = 1 to N. In this loop there is a loop on y ∈ Sand in this loop there are roughly |S| evaluation needed to compute Vn (y) and

ϕn (y) . Thus at each index n, there are about |S|2 operations needed.

12.2 The computation of the conditional probabilities

If u := (u0, u1, u2, . . . , uN ) ∈ ON+1 is given observation sequence and x=(x0, x1, x2, . . . xN ) ∈ SN+1 is a state sequence we may ask how likely is it that


116 12 Hidden Markov Models

X(N) := (X0, X1, X2, . . . XN ) = x given U(N) := (U0, U1, U2, . . . , UN ) = u,2

i.e. we want to compute,

ρ (x|u) := P(X(N) = x|U(N)

= u)

=P(X(N) = x,U(N) = u

)P(U(N) = u

)=

ν (x0) e (x0, u0)∏Nk=1 p (xk−1, xk) e (xk, uk)∑

y∈SN+1 ν (y0) e (y0, u0)∏Nk=1 p (yk−1, yk) e (yk, uk)

.

The denominator, P(U(N) = u

), as written is computationally to costly to

compute even for moderately large N. Theorem 12.7 gives a computationallytractable algorithm for computing P

(U(N) = u

)hence for computing ρ (x|u) .

We first introduce the following notation.

Notation 12.6 For 0 ≤ n ≤ N, let u(n) = (u0, u1, u2, . . . , un) ∈ On+1 andsimilarly U(n) = (U0, U1, U2, . . . , Un) .

Theorem 12.7. Let N ∈ N, u = (u0, u1, . . . , uN ) ∈ ON+1 be an observationsequence and define

αn (x) = P(U(n) = u(n), Xn = x

)for 0 ≤ n ≤ N. (12.10)

Thenα0 (x) = ν (x) e (x, u) = P (X0 = x) e (x, u) (12.11)

and αn : S → [0, 1]Nn=1 may be computed inductively using;

αn+1(y) = [αnP]y e (y, un+1) = e (y, un+1)∑x∈S

αn (x) p(x, y). (12.12)

We then have

P(U(N) = u

)=∑x∈S

P(U(N) = u, XN = x) =∑x∈S

αN (x) . (12.13)

Proof. Equation (12.13) trivially follows from Eq. (12.10) with n = N whilefor Eq. (12.11) we simply have,

α0 (x) = P (U0 = u0, X0 = x) = P (E0 (x) = u0, X0 = x)

= P (E0 (x) = u0)P (X0 = x) = e (x, u) ν (x) .

In order to prove Eq. (12.12), let us realize the Markov chain, Xn∞n=0 , as itis described in Eq. (5.1). That is choose X0 ∈ S and i.i.d. random functions

2 Recall that Un = En (Xn)∞n=0 so that Un depends on X and the fn .

fn : S → S∞n=0 such that P (X0 = x) = ν (x) , p (x, y) = P (fn (x) = y) forall x, y ∈ S, and then takeXn+1 = fn (Xn) where and n ∈ N0. We furtherassume that X0 ∪ fn∞n=0 ∪ En

∞n=0 are all independent in order to satisfy

Assumption 12.1. With this notation and making use of all the independencewe find,

αn+1 (y) := P(U(n+1) = u(n+1), Xn+1 = y

)=∑x∈S

P(U(n+1) = u(n+1), Xn+1 = y,Xn = x

)=∑x∈S

P(U(n) = u(n), En+1 (y) = un+1, fn (x) = y,Xn = x

)=∑x∈S

P(U(n) = u(n), Xn = x

)P (En+1 (y) = un+1)P (fn (x) = y)

=∑x∈S

αn (x) e (y, un+1) p (x, y) = [αnP]y e (y, un+1) .

Complexity of the algorithm. The complexity of this algorithm is againroughly CN |S|2. Indeed, there are N inductive steps in Eq. (12.12) and thereare C |S| operations needed to compute the sum in Eq. (12.12) and this has tobe done |S| times, once for each y ∈ S.

12.2.1 Exercises

Exercise 12.1. Here is a summary of the Viterbi algorithm:

1. (Initialize): δ0(j) = p(j)hj(b0) for j ∈ S.2. (Recursion): δk+1(j) = [maxi∈S δk(i)P (i, j)]hj(bk+1), for j ∈ S and

ψk+1(j) = arg maxi∈S δk(i)P (i, j) for k = 0, 1, 2, . . . n− 1.

3. (Termination): P ∗(b)[= P ∗(b0, . . . bn)] = maxj∈S δn(j) and

j∗n = arg max j∈Sδn(j).

4. (Back substitution): j∗k = ψk+1(j∗k+1) for k = n− 1, n− 2, . . . , 2, 1, 0.

Show, by backward induction on k, that p(j∗k , j∗k+1) > 0 for k =

0, 1, 2, . . . , n− 1.

Exercise 12.2. This problems implements the Viterbi algorithm for the“crooked casino” HMM. The underlying Markov chain has state space S =F,L and transition matrix


12.2 The computation of the conditional probabilities 117

P =

[.95 .05.1 .9

].

When the Markov chain is in state F , a fair die is thrown—all six faces areequally likely; when the Markov chain is in state L a loaded die is thrown—face 6 has probability .5 and all other faces have probability 0.1. In short, the“emission probabilities” are given by the matrix

H =

[1/6 1/6 1/6 1/6 1/6 1/61/10 1/10 1/10 1/10 1/10 1/2

].

We assume that at the start the fair die is being used; that is, p(F ) = 1 andp(L) = 0.

Here are 100 random rolls in the crooked casino. Use the Viterbi algorithmto predict the state sequence; that is, which of the two dice was used on eachroll.

3151162464 4664424531 1321631164 1521336251 44543631656626566666 6511664531 3265124563 6664631636 6631623264On the second page are the dice that are tossed on each roll. Don’t look

until you have done the exercise!

Here are the dice that are rolled, each listed just below the roll itself.3151162464 4664424531 1321631164 1521336251 4454363165

FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFFFFFF FFFFFLLLLL

6626566666 6511664531 3265124563 6664631636 6631623264

LLLLLLLLLL LLLLLLFFFF FFFFFFFFLL LLLLLLLLLL LLLLFFFLLL


Part III

Martingales

13

(Sub and Super) martingales

The typical setup in this chapter is that Xk, Yk, Zk, . . . ∞k=0 is somecollection of stochastic processes on a probability space (Ω,P) and we letFn = σ (Xn, Yn, Zn, . . . nk=0) for n ∈ N0 and F = σ (Xn, Yn, Zn, . . . ∞k=0) .Notice that Fn ⊂ Fn+1 ⊂ F for all n = 0, 1, 2 . . . . We say a stochastic process,Un∞n=0 is adapted to Fn provided Un is Fn – measurable for all n ∈ N0,i.e. Un = Fn (Xn, Yn, Zn, . . . nk=0) .

Definition 13.1. Let X := Xn∞n=0 be an Fn-adapted sequence of integrablerandom variables. Then;

1. X is a Fn∞n=0 – martingale if E [Xn+1|Fn] = Xn a.s. for all n ∈ N0.2. X is a Fn∞n=0 – submartingale if E [Xn+1|Fn] ≥ Xn a.s. for all n ∈ N0.3. X is a Fn∞n=0 – supermartingale if E [Xn+1|Fn] ≤ Xn a.s. for alln ∈ N0.

It is often fruitful to view Xn as your earnings at time n while playing somegame of chance. In this interpretation, your expected earnings at time n + 1given the history of the game up to time n is the same, greater than, less thanyour earnings at time n if X = Xn∞n=0 is a martingale, submartingale orsupermartingale respectively. In this interpretation, martingales are fair games,submartingales are games which are favorable to the gambler (unfavorable to thecasino), and supermartingales are games which are unfavorable to the gambler(favorable to the casino), see Example 13.39.

By induction one shows that X is a supermartingale, martingale, or sub-martingale iff

E [Xn|Fm]≤=≥Xm a.s for all n ≥ m, (13.1)

to be read from top to bottom respectively. This last equation may also beexpressed as

E [Xn|Fm]≤=≥Xn∧m a.s for all m,n ∈ N0. (13.2)

The reader should also note that E [Xn] is decreasing, constant, or increasingrespectively.

13.1 (Sub and Super) martingale Examples

Example 13.2. Suppose that Zn∞n=0 are independent1 integrable randomvariables such that EZn = 0 for all n ≥ 1. Then Sn :=

∑nk=0 Zk is a mar-

tingale relative to the filtration, FZn := σ (Z0, . . . , Zn) . Indeed,

E [Sn+1 − Sn|Fn] = E [Zn+1|Fn] = EZn+1 = 0.

This same computation also shows that Snn≥0 is a submartingale if EZn ≥ 0and supermartingale if EZn ≤ 0 for all n.

Example 13.3 (Regular martingales). If X is a random variable with E |X| <∞, then Xn := E [X|Fn] is a martingale. Indeed, by the tower property ofconditional expectations,

E [Xn+1|Fn] = E [E [X|Fn+1] |Fn] = E [X|Fn] = Xn a.s.

A martingale of this form is called a regular martingale. From Proposition3.12 we know that

E |Xn| = E |E [X|Fn]| ≤ E |X|

which shows that supn E |Xn| ≤ E |X| <∞ for regular martingales. In the nextexercise you will show that not all martingales are regular.

Exercise 13.1. Construct an example of a martingale, Mn∞n=0 such thatE |Mn| → ∞ as n → ∞. [In particular, Mn∞n=1 will be a martingale whichis not of the form Mn = EFnX for some X ∈ L1 (P ) .] Hint: try takingMn =

∑nk=0 Zk for a judicious choice of Zk∞k=0 which you should take to

be independent, mean zero, and having E |Zn| growing rather rapidly.

Example 13.4 (“Conditioning a δ – fucntion”). Suppose that Ω = (0, 1],F = F (0,1], and P = m – Lebesgue measure. We then define random variables,Yn∞n=1 , with values in 0, 1 so that

ω =

∞∑n=1

Yn (ω)

2n, i.e. ω = .Y1 (ω)Y2 (ω)Y3 (ω) . . .

1 We do not need to assume that the Zn∞n=0 are identically distributed here!!

122 13 (Sub and Super) martingales

is the binary expansion of ω. We then let Fn := σ (Y1, . . . , Yn) . A little thoughtshows that X is Fn measurable iff X is constant on each of the sub-intervals,(k−12n ,

k2n

], for 1 ≤ k ≤ 2n. As

√2n1( k−1

2n , k2n ]

2n

k=1is an orthonormal basis for

this collection of functions we may conclude that

E [X|Fn] (ω) =

2n∑k=1

[∫ 1

0

X (σ)√

2n1( k−12n , k2n ] (σ) dσ

]·√

2n1( k−12n , k2n ] (ω)

=

2n∑k=1

[2n∫ k

2n

k−12n

X (σ) dσ

]· 1( k−1

2n , k2n ] (ω) .

If we formally take X = δ0 to be the Dirac delta function of 0 (with all itsweight to the right of zero), then

Mn := “E [δ0|Fn] ” = 2n1(0,2−n] for n ∈ N.

In Exercise 13.2), you are asked to verify directly thatMn := 2n1(0,2−n]

∞n=0

is a martingale.Notice that E |Mn| = 1 for all n. However, there is no X ∈ L1 (Ω,F , P ) such

that Mn = E [X|Fn] . To verify this last assertion, suppose such an X existed.We would then have for 2n > k > 0 and any m > n, that

E[X :

(k

2n,k + 1

2n

]]= E

[EFmX :

(k

2n,k + 1

2n

]]= E

[Mm :

(k

2n,k + 1

2n

]]= 0.

Using E [X : A] = 0 for all A in the π – system, Q :=∪∞n=1

(k2n ,

k+12n

]: 0 ≤ k < 2n

, an application of the “π – λ theorem”

shows E [X : A] = 0 for all A ∈ σ (Q) = F . By more general measuretheoretic arguments, this suffices to show X = 0 a.s. which is impossible since1 = EMn = EX.

Moral: not all L1 – bounded martingales are regular. It is also worth notingthat pointwise M∞ = limn→∞Mn = 0 yet

limn→∞

EMn = limn→∞

1 6= 0 = EM∞ = E limn→∞

Mn.

Exercise 13.2. Show that Mn := 2n1(0,2−n] for n ∈ N0 as defined in Example13.4 is a martingale.

Notation 13.5 Given a sequence Zk∞k=0 , let ∆kZ := Zk − Zk−1 for k =1, 2, . . . . and by convention let ∆0Z = Z0.

The next lemma is the “differential” characterization of (sub or super) mar-tingales.

Lemma 13.6 (Submartingale Increments). Let X := Xn∞n=0 be anadapted process of integrable random variables on a filtered probability space,(Ω,F , Fn∞n=0 ,P) and let dn := ∆nX := Xn − Xn−1.Then X is a mar-tingale (respectively submartingale or supermartingale) iff E [dn+1|Fn] = 0(E [dn+1|Fn] ≥ 0 or E [dn+1|Fn] ≤ 0 respectively) for all n ∈ N0.

Conversely if dn∞n=1 is an adapted sequence of integrable random vari-ables and X0 is a F0 -measurable integrable random variable. Then Xn =X0 +

∑nj=1 dj is a martingale (respectively submartingale or supermartingale)

iff E [dn+1|Fn] = 0 (E [dn+1|Fn] ≥ 0 or E [dn+1|Fn] ≤ 0 respectively) for alln ∈ N.

Proof. We prove the assertions for martingales only, the other all beingsimilar. Clearly X is a martingale iff

0 = E [Xn+1|Fn]−Xn = E [Xn+1 −Xn|Fn] = E [dn+1|Fn] .

The second assertion is an easy consequence of the first assertion.

Example 13.7. Suppose that Yn∞n=1 is any adapted sequence of integrable ran-dom variables, X0 is any integrable F0 – random variable, and

Xn := X0 +

n∑k=1

(Yk − E [Yk|Fk−1]) for n ≥ 1.

Then Xn∞n=0 is a martingale. Indeed, ∆nX = Yn − E [Yn|Fn−1] and hence

E [∆nX|Fn−1] = E [Yn − E [Yn|Fn−1] |Fn−1]

= E [Yn|Fn−1]− E [Yn|Fn−1] = 0 for n ≥ 1.

Example 13.8. Suppose that Zn∞n=0 is a sequence of independent integrablerandom variables, Xn = Z0 . . . Zn, and Fn := σ (Z0, · · · , Zn) . (Observe thatE |Xn| =

∏nk=0 E |Zk| <∞.) Since

E [Xn+1|Fn] = E [XnZn+1|Fn] = XnE [Zn+1|Fn] = Xn · E [Zn+1] a.s.,

it follows that Xn∞n=0 is a martingale if EZn = 1. If we further assume,for all n, that Zn ≥ 0 so that Xn ≥ 0, then Xn∞n=0 is a supermartingale(submartingale) provided EZn ≤ 1 (EZn ≥ 1) for all n.

Let us specialize the above example even more by taking Znd= p+U where

p ≥ 0 and U is the uniform distribution on [0, 1] . In this case we have by thestrong law of large numbers that


13.1 (Sub and Super) martingale Examples 123

1

nlnXn =

1

n

n∑k=0

lnZk → E [ln (p+ U)] a.s. (13.3)

An elementary computation shows

E [ln (p+ U)] =

∫ 1

0

ln (p+ x) dx =

∫ p+1

p

ln (p+ x) dx

= (x lnx− x)x=p+1x=p = (p+ 1) ln (p+ 1)− p ln p− 1

The function f (p) := E [ln (p+ U)] has a zero at p = pc ∼= 0.542 21 and

Fig. 13.1. The graph of E [ln (p+ U)] as a function of p. This function has a zero atp = pc ∼= 0.542 21.

f (p) < 0 for p < pc while f (p) > 0 for p > pc, see Figure 13.1. Combiningthese observations with Eq. (13.3) implies,

Xn → limn→∞

exp (nE [ln (p+ U)]) =

0 if p < pc? if p = pc∞ if p > pc

a.s.

Notice that EZn = p + 1/2 and therefore Xn is a martingale precisely whenp = 1/2 and is a sub-martingale for p > 1/2. So for 1/2 < p < pc, Xn∞n=1 is

a positive sub-martingale, EXn = (p+ 1/2)n+1 →∞ yet limn→∞Xn = 0 a.s.

Definition 13.9. We say Cn : Ω → S∞n=1 is predictable or pre-visible if each Cn is Fn−1– measurable for all n ∈ N, i.e. Cn =

Fn

(Xn, Yn, Zn, . . . n−1

k=0

).

Lemma 13.10 (Doob Decomposition). Each adapted sequence, Zn∞n=0 ,of integrable random variables has a unique decomposition,

Zn = Mn +An (13.4)

where Mn∞n=0 is a martingale and An is a predictable process such that A0 =0. Moreover this decomposition is given by A0 = 0,

An :=

n∑k=1

EFk−1[∆kZ] for n ≥ 1 (13.5)

and

Mn = Zn −An = Zn −n∑k=1

EFk−1[∆kZ] (13.6)

= Z0 +

n∑k=1

(Zk − EFk−1

Zk). (13.7)

In particular, Zn∞n=0 is a submartingale (supermartingale) iff An is increasing(decreasing) almost surely.

Proof. Assuming Zn has a decomposition as in Eq. (13.4), then

EFn [∆n+1Z] = EFn [∆n+1M +∆n+1A] = ∆n+1A (13.8)

wherein we have used M is a martingale and A is predictable so thatEFn [∆n+1M ] = 0 and EFn [∆n+1A] = ∆n+1A. Hence we must define, form ≥ 1,

An :=

n∑k=1

∆kA =

n∑k=1

EFk−1[∆kZ]

which is a predictable process. This proves the uniqueness of the decompo-sition and the validity of Eq. (13.5). Moreover if Z is a sub-martingale thenEFk−1

[∆kZ] ≥ 0 from which it follows that An∞n=0 is increasing.For existence, we define An by Eq. (13.5) and Mn := Zn − An as in Eq.

(13.6). From Eq. (13.5) we have ∆nA = E [∆nZ|Fn−1] and therefore,

E [∆nM |Fn−1] = E [∆nZ −∆nA|Fn−1] = E [∆nZ|Fn−1]− E [∆nZ|Fn−1] = 0

which shows that M is indeed a martingale.

Exercise 13.3. Suppose that Zn∞n=0 are independent random variables suchthat σ2 := EZ2

n <∞ and EZn = 0 for all n ≥ 1. As in Example 13.2, let Sn :=


124 13 (Sub and Super) martingales∑nk=0 Zk be the martingale relative to the filtration, FZn := σ (Z0, . . . , Zn) .

ShowMn := S2

n − σ2n for n ∈ N0

is a martingale. [Hint, make use of the basic properties of conditional expecta-tions.]

Exercise 13.4 (Quadratic Variation). Suppose Mn∞n=0 is a square inte-grable martingale. Show;

1. E[M2n+1 −M2

n|Fn]

= E[(Mn+1 −Mn)

2 |Fn]. Conclude from this that the

Doob decomposition of M2n is of the form,

M2n = Nn +An

where

An :=∑

1≤k≤n

E[(∆kM)

2 |Fk−1

]=

∑1≤k≤n

E[(Mk −Mk−1)

2 |Fk−1

].

In particular we see from this that M2n is a sub-martingale.

2. Show that Nn above may be expressed as, N0 = M20 and for n ≥ 1,

Nn = M20 + 2

n∑k=1

Mk−1∆kM +

n∑k=1

((∆kM)

2 − E[(∆kM)

2 |Fk−1

]).

[From Example 13.7 and Proposition 13.29 below one may see directly thatthe expression above is a martingale.]

3. If we further assume that Mk −Mk−1 is independent of Fk−1 for all k =1, 2, . . . , explain why,

An =∑

1≤k≤n

E (Mk −Mk−1)2.

Exercise 13.5. Suppose that Xn∞n=0 is a Markov chain on S with one steptransition probabilities, P, and Fn := σ (X0, . . . , Xn) . Let f : S → R be abounded (for simplicity) function and set g (x) = (Pf) (x)− f (x) . Show

Mn := f (Xn)−∑

0≤j<n

g (Xj) for n ∈ N0

is a martingale.

Proposition 13.11 (Markov Chains and Martingales). Suppose that S isa finite or countable set, P is a Markov matrix on S, and Xn∞n=0 is the as-sociated Markov Chain. If f : S → R is a function such that either f ≥ 0 or

E |f (Xn)| < ∞ for all n ∈ N0 and (Pf ≤ f) Pf = f then Zn = f (Xn) is a(sub-martingale) martingale. More generally, if f : N0 × S → R is a such thateither f ≥ 0 or E |f (n,Xn)| < ∞ for all n ∈ N0 and (Pf (n+ 1, ·) ≤ f (n, ·))Pf (n+ 1, ·) = f (n, ·) , then Zn := f (n,Xn)∞n=0 is a (sub-martingale) mar-tingale.

Proof. Using the Markov property and the definition of P we find,2

E [Zn+1|Fn] = E [f (n+ 1, Xn+1) |Fn]

= E [f (n+ 1, Xn+1) |Xn] = [Pf (n+ 1, ·)] (Xn) .

The latter expression is (less than or equal) equal to Zn if(Pf (n+ 1, ·) ≤ f (n·)) Pf (n+ 1, ·) = f (n, ·) for all n ≥ 0.

Remark 13.12. One way to find solutions to the equation Pf (n+ 1, ·) =f (n, ·) , at least for 0 ≤ n ≤ T for some T < ∞, is to let g : S → R bean arbitrary function and T ∈ N be given and then define

f (n, y) :=(PT−ng

)(y) for 0 ≤ n ≤ T.

Then Pf (n+ 1, ·) = P(PT−n−1g

)= PT−ng = f (n, ·) and we will have that

Zn = f (n,Xn) =(PT−ng

)(Xn)

is a martingale for 0 ≤ n ≤ T. If f (n, ·) satisfies Pf (n+ 1, ·) = f (n, ·) for alln then we must have, with f0 := f (0, ·) ,

f (n, ·) = P−nf0

where P−1g denotes a function h solving Ph = g. In general P is not invertibleand hence there may be no solution to Ph = g or there might be many solutions.

Example 13.13. Let S = Z, Sn = X0 + X1 + · · · + Xn, where Xi∞i=1 arei.i.d. with P (Xi = 1) = p ∈ (0, 1) and P (Xi = −1) = q := 1 − p, and X0 isS –valued random variable independent of Xi∞i=1 . Recall that Sn∞n=0 is atime homogeneous Markov chain with transition kernel determined by Pf (x) =pf (x+ 1) + qf (x− 1) . As we have seen if f (x) = a + b (q/p)

x, then Pf = f

and therefore, by Proposition 13.11,

2 Recall from Proposition 3.10 that E [f (n+ 1, Xn+1) |Xn] = h (Xn) where

h (x) = E [f (n+ 1, Xn+1) |Xn = x]

=∑y∈S

p (x, y) f (n+ 1, y) = [Pf (n+ 1, ·)] (x) .


13.1 (Sub and Super) martingale Examples 125

Mn = a+ b (q/p)Sn

is a martingale for all a, b ∈ R. This is easily verified directly as well;

EFn(q

p

)Sn+1

= EFn(q

p

)Sn+Xn+1

=

(q

p

)SnEFn

(q

p

)Xn+1

=

(q

p

)SnE(q

p

)Xn+1

=

(q

p

)Sn·

[(q

p

)1

p+

(q

p

)−1

q

]

=

(q

p

)Sn· [q + p] =

(q

p

)Sn.

Example 13.14. We can generalize the above example by observing for λ 6= 0that P [x→ λx] =

(pλ+ qλ−1

)λx. Thus it follows that we may set P−1λx =(

pλ+ qλ−1)−1

λx and therefore conclude that

f (n, x) := P−nλx =(pλ+ qλ−1

)−nλx

satisfies Pf (n+ 1, ·) = f (n, ·) . So if we suppose that X0 is a bounded sothat Sn is bounded for all n, then using Proposition 13.11, it follows thatMn = f (n, Sn) =

(pλ+ qλ−1

)−nλSn

n≥0

is a martingale for all λ 6= 0. To

recover the special case in Example 13.13 we choose λ so that pλ + qλ−1 = 1.This quadratic equation has 1 and q/p = (1− p) /p as solutions as one easilyverifies.

Exercise 13.6. For θ ∈ R let

fθ (n, x) := P−neθx =(peθ + qe−θ

)−neθx

so that Pfθ (n+ 1, ·) = fθ (n, ·) for all θ ∈ R. Compute;

1. f(k)θ (n, x) :=

(ddθ

)kfθ (n, x) for k = 1, 2.

2. Use your results to show,

M (1)n := Sn − n (p− q)

and

M (2)n := (Sn − n (p− q))2 − 4npq

are martingales.

(If you are ambitious you might also find M(3)n .)

Example 13.15 (Polya urns). Let us consider an urn that contains red and greenballs. Let (Rn, Gn) denote the number of red and green balls (respectively) inthe urn at time n. If at a given time r red balls and g green balls at a given timewe draw one of these balls at random and replace it and add c more balls of thesame color drawn. So (Rn, Gn) is a Markov process with transition probabilitiesdetermined by,

P ((Rn+1, Gn+1) = (r + c, g) | (Rn, Gn) = (r, g)) =r

r + gand

P ((Rn+1, Gn+1) = (r, g + c) | (Rn, Gn) = (r, g)) =g

r + g. (13.9)

We now letFn := σ ((Rk, Gk) : k ≤ n) = σ (Xk : k ≤ n)

and recall that by the Markov property along with Proposition 3.10 (the basicformula for computing conditional expectations) we have,

E [f (Rn+1, Gn+1) |Fn] = E [f (Rn+1, Gn+1) | (Rn, Gn)] = h (Rn, Gn)

where

h (r, g) := E [f (Rn+1, Gn+1) | (Rn, Gn) = (r, g)]

=r

r + gf (r + c, g) +

g

r + gf (r, g + c) .

Thus we have shown,

E [f (Rn+1, Gn+1) |Fn] =Rn

Rn +Gnf (Rn + c,Gn) +

GnRn +Gn

f (Rn, Gn + c) .

(13.10)Next let us observe that Rn + Gn = R0 + G0 + nc and hence if we let Xn

be the fraction of green balls in the urn at time n,

Xn :=Gn

Rn +Gn,

then

Xn :=Gn

Rn +Gn=

GnR0 +G0 + nc

.

We now claim that Xn∞n=0 is an Fn – martingale. Indeed, using Eq. (13.10)with f (r, g) = g

r+g we learn

E [Xn+1|Fn] = E [Xn+1| (Rn, Gn)]

=Rn

Rn +Gn· GnRn + c+Gn

+Gn

Rn +Gn· Gn + c

Rn +Gn + c

=Gn

Rn +Gn· Rn +Gn + c

Rn +Gn + c= Xn.



Remark 13.16. The previous example is in fact a special case of Proposition13.11. From Eq. (13.9) (or see Eq. (13.10)) the Markov matrix P associated tothe chain in Example 13.15 is determined by,

(Pf) (r, g) =r

r + gf (r + c, g) +

g

r + gf (r, g + c) .

From this expression with f (r, g) = gr+g we find,

(Pf) (r, g) =r

r + g

g

r + c+ g+

g

r + g

g + c

r + c+ g

=1

(r + g) (r + c+ g)· [rg + g (g + c)] =

g

r + g= f (r, g) .

An application of Proposition 13.11 now implies Xn = f (Rn, Gn) = GnRn+Gn

isnecessarily a martingale.

13.2 Jensen’s and Holder’s Inequalities

Theorem 13.17 (Jensen’s Inequality). Suppose that (Ω,P) is a probabilityspace, −∞ ≤ a < b ≤ ∞, and ϕ : (a, b) → R is a convex function, (i.e.ϕ′′ (x) ≥ 0 on (a, b)). If f : Ω → (a, b) is a random variable with E |f | < ∞,then

ϕ (Ef) ≤ E [ϕ(f)] .

[Here it may happen that ϕ f /∈ L1 (P) , in this case it is claimed that ϕ f isintegrable in the extended sense and E [ϕ(f)] =∞.]

Proof. Let t = Ef ∈ (a, b) and let β ∈ R (β = ϕ (t) when ϕ (t) exists), besuch that

ϕ(s)− ϕ(t) ≥ β(s− t) for all s ∈ (a, b), (13.11)

see Figure 13.2. Then integrating the inequality, ϕ(f)−ϕ(t) ≥ β(f − t), impliesthat

0 ≤ Eϕ(f)− ϕ(t) = Eϕ(f)− ϕ(Ef).

Moreover, if ϕ(f) is not integrable, then ϕ(f) ≥ ϕ(t) + β(f − t) which showsthat negative part of ϕ(f) is integrable. Therefore, Eϕ(f) =∞ in this case.

Example 13.18. Since ex for x ∈ R, − lnx for x > 0, and xp for x ≥ 0 and p ≥ 1are all convex functions, we have the following inequalities

exp (Ef) ≤ Eef , (13.12)

E log(|f |) ≤ log (E |f |)

and for p ≥ 1,|Ef |p ≤ (E |f |)p ≤ E |f |p .

Fig. 13.2. A convex function, ϕ, along with a cord and a tangent line. Notice thatthe tangent line is always below ϕ and the cord lies above ϕ between the points ofintersection of the cord with the graph of ϕ.

Example 13.19. As a special case of Eq. (13.12), if pi, si > 0 for i = 1, 2, . . . , nand

∑ni=1

1pi

= 1, then

s1 . . . sn = e∑n

i=1ln si = e

∑n

i=1

1pi

ln spii ≤

n∑i=1

1

pieln s

pii =

n∑i=1

spiipi. (13.13)

Indeed, we have applied Eq. (13.12) with Ω = 1, 2, . . . , n , µ =∑ni=1

1piδi and

f (i) := ln spii . As a special case of Eq. (13.13), suppose that s, t, p, q ∈ (1,∞)with q = p

p−1 (i.e. 1p + 1

q = 1) then

st ≤ 1

psp +

1

qtq. (13.14)

(When p = q = 1/2, the inequality in Eq. (13.14) follows from the inequality,

0 ≤ (s− t)2.)

As another special case of Eq. (13.13), take pi = n and si = a1/ni with

ai > 0, then we get the arithmetic geometric mean inequality,

n√a1 . . . an ≤

1

n

n∑i=1

ai. (13.15)

Example 13.20. Let (Ω,P) be a probability space, 0 < p < q <∞, and f : Ω →C be a measurable function. Then by Jensen’s inequality,

(E |f |p)q/p ≤ E (|f |p)q/p = E |f |q

from which it follows that ‖f‖p ≤ ‖f‖q . In particular, Lp (P) ⊂ Lq (P) for all0 < p < q <∞.


13.2 Jensen’s and Holder’s Inequalities 127

Theorem 13.21 (Holder’s inequality). Suppose that 1 ≤ p ≤ ∞ and q :=pp−1 , or equivalently p−1 + q−1 = 1. If f and g are measurable functions then

‖fg‖1 ≤ ‖f‖p · ‖g‖q. (13.16)

Assuming p ∈ (1,∞) and ‖f‖p · ‖g‖q <∞, equality holds in Eq. (13.16) iff |f |pand |g|q are linearly dependent as elements of L1 which happens iff

|g|q‖f‖pp = ‖g‖qq |f |p

a.s. (13.17)

Proof. The cases p = 1 and q = ∞ or p = ∞ and q = 1 are easy to dealwith and will be left to the reader. So we now assume that p, q ∈ (1,∞) . If‖f‖q = 0 or ∞ or ‖g‖p = 0 or ∞, Eq. (13.16) is again easily verified. So we willnow assume that 0 < ‖f‖q, ‖g‖p < ∞. Taking s = |f | /‖f‖p and t = |g|/‖g‖qin Eq. (13.14) gives,

|fg|‖f‖p‖g‖q

≤ 1

p

|f |p

‖f‖p+

1

q

|g|q

‖g‖q(13.18)

with equality iff |g/‖g‖q| = |f |p−1/‖f‖(p−1)

p = |f |p/q /‖f‖p/qp , i.e. |g|q‖f‖pp =‖g‖qq |f |

p. Integrating Eq. (13.18) implies

‖fg‖1‖f‖p‖g‖q

≤ 1

p+

1

q= 1

with equality iff Eq. (13.17) holds. The proof is finished since it is easily checkedthat equality holds in Eq. (13.16) when |f |p = c |g|q of |g|q = c |f |p for someconstant c.

Theorem 13.22 (Conditional Jensen’s inequality). Let (Ω,P) be a prob-ability space, −∞ ≤ a < b ≤ ∞, and ϕ : (a, b) → R be a convex function.Assume f ∈ L1 (Ω,P) is a random variable satisfying, f ∈ (a, b) a.s. andϕ(f) ∈ L1 (Ω,P) . Then for any sigma-algebra (of information) on Ω,

ϕ(EGf) ≤ EG [ϕ(f)] a.s. (13.19)

andE [ϕ(EGf)] ≤ E [ϕ(f)] . (13.20)

where EGf := E [f |G] .

Proof. Not completely rigorous proof. Assume that ϕ is C1 in whichcase Eq. (13.11) becomes,

ϕ(s)− ϕ(t) ≥ ϕ (t) (s− t) for all s, t ∈ (a, b). (13.21)

We now take s = f and t = EGf in this inequality to find,

ϕ(f)− ϕ(EGf) ≥ ϕ (EGf) (f − EGf).

The result now follows by taking EG of both sides of this inequality to learn,

EGϕ(f)− ϕ(EGf)

= EG [ϕ(f)− ϕ(EGf)]

≥ EG [ϕ (EGf) (f − EGf)] = ϕ (EGf) · EG [(f − EGf)] = 0.

The technical problem with this argument is the justification thatEG [ϕ′(EGf)(f − EGf)] = ϕ′(EGf)EG [(f − EGf)] since there is no reasonfor ϕ′ to be a bounded function. The honest proof below circumvents thistechnical detail.

*Honest proof. Let Λ := Q ∩ (a, b) – a countable dense subset of (a, b) .By the basic properties of convex functions,

ϕ(s) ≥ ϕ(t) + ϕ′−(t)(s− t) for all for all t, s ∈ (a, b) , (13.22)

where ϕ′−(t) is the left hand derivative of ϕ at t. Taking s = f and then takingconditional expectations imply,

EG [ϕ(f)] ≥ EG[ϕ(t) + ϕ′−(t)(f − t)

]= ϕ(t) + ϕ′−(t)(EGf − t) a.s. (13.23)

Since this is true for all t ∈ (a, b) (and hence all t in the countable set, Λ) wemay conclude that

EG [ϕ(f)] ≥ supt∈Λ

[ϕ(t) + ϕ′−(t)(EGf − t)

]a.s.

Since EGf ∈ (a, b) we may conclude that

supt∈Λ

[ϕ(t) + ϕ′−(t)(EGf − t)

]= ϕ (EGf) a.s.

Combining the last two estimates proves Eq. (13.19).From Eq. (13.19) and Eq. (13.22) with s = EGf and t ∈ (a, b) fixed we find,

ϕ(t) + ϕ′−(t) (EGf − t) ≤ ϕ(EGf) ≤ EG [ϕ(f)] . (13.24)

Therefore

|ϕ(EGf)| ≤ |EG [ϕ(f)]| ∨∣∣ϕ(t) + ϕ′−(t)(EGf − t)

∣∣ ∈ L1 (Ω,G,P) (13.25)

which implies that ϕ(EGf) ∈ L1 (Ω,G,P) . Taking expectations of Eq. (13.19)is now allowed and immediately gives Eq. (13.20).



Corollary 13.23. If Xn∞n=0 is a martingale and ϕ is a convex function suchthat E [ϕ (Xn)] < ∞ for all n, thenϕ (Xn)∞n=0 is a sub-martingale. [Typicalexamples for ϕ are ϕ (x) = |x|p for p ≥ 1.]

Proof. As Xn∞n=0 is a martingale, Xn = EFnXn+1. Applying ϕ to bothsides of this equation and then using the conditional form of Jensen’s inequalitythen shows,

ϕ (Xn) = ϕ (EFnXn+1) ≤ EFn [ϕ (Xn+1)] .

13.3 Stochastic Integrals and Optional Stopping

Notation 13.24 Suppose that cn∞n=1 and xn∞n=0 are two sequences of num-bers, let c · ∆x = (c ·∆x)nn∈N0

denote the sequence of numbers defined by(c ·∆x)0 = 0 and

(c ·∆x)n =

n∑j=1

cj (xj − xj−1) =

n∑j=1

cj∆jx for n ≥ 1.

(For convenience of notation later we will interpret∑0j=1 cj∆jx = 0.)

Interpretation. Suppose that xj represents the value of a stock on thetime interval, [j, j + 1).3 For j ∈ N, let cj denote the amount of stock you holdin the interval (j− 1, j] so that during this period you make a profit (or loss) ofcj∆jx = cjxj − cjxj−1, and therefore (c ·∆x)n =

∑nj=1 cj∆jx represents your

profit (or loss) from time 0 to time n. For example if you want to buy 5 sharesof the stock just after time, n = 3, and then sell them all at time just aftern = 9, you would take ck = 5 · 13<k≤9 so that

(c ·∆x)9 = 5 ·∑

3<k≤9

∆kx = 5 · (x9 − x3)

would represent your profit (loss) for this transaction. The next example for-malizes this observation.

Example 13.25 (Coin flip game). Suppose that Zn∞n=1 are i.i.d. Bernoulli ran-dom variables with P (Zn = ±1) = 1

2 . Think of Zn = 1 or −1 depending onwhether the nth - flip of a coin is heads or tails respectively. Suppose that Cn isyour bet on the nth coin flip and you loose your bet if Zn = −1 and you doubleyour bet if Zn = +1. If

WCn

∞n=0

represents your wealth just after the nth flip

3 Stocks only change price at integer times in this model.

of the coin then ∆nWC = WC

n −WCn−1 = CnZn. [If Zn = 1 your wealth goes up

by Cn which represents your net winning of 2Cn (double your bet) - Cn (yourbet) before the nth flip.] Thus

WCn = W0 + C1Z1 + · · ·+ CnZn.

If we let Xn := W 1n (i.e. take Cn = 1 so that we bet $1) on every game, we will

haveXn = X0 + Z1 + · · ·+ Zn.

With this notation we see that

WCn = W0 + [C ·∆X]n .

[This example is generalized in Example 13.30 below.]

Example 13.26. If σ, τ ∈ N0 with 0 ≤ σ ≤ τ and cn := 1σ<n≤τ , then

(c ·∆x)n = xτ∧n − xσ∧n for all n ∈ N0.

[This is easily proved by considering three case, n ≤ σ, σ ≤ n ≤ τ, and n ≥τ.] More generally if σ, τ ∈ N0 are arbitrary and cn := 1σ<n≤τ , then cn :=1σ∧τ<n≤τ and therefore

(c ·∆x)n = xτ∧n − xσ∧τ∧n.

Notation 13.27 (Stochastic intervals) If σ, τ : Ω → N, let

(σ, τ ] :=

(ω, n) ∈ Ω × N : σ (ω) < n ≤ τ (ω)

and we will write 1(σ,τ ] for the process, 1σ<n≤τ .

Lemma 13.28. If τ is a stopping time, then the processes, fn := 1τ≤n, andfn := 1τ=n are adapted and fn := 1τ<n and fn = 1n≤τ are predictable. More-over, if σ and τ are two stopping times, then fn := 1σ<n≤τ is predictable.

Proof. These are all trivial to prove. For example, if fn := 1σ<n≤τ , then fnis Fn−1 measurable since,

σ < n ≤ τ = σ < n ∩ n ≤ τ = σ < n ∩ τ < nc ∈ Fn−1.

Proposition 13.29 (The Discrete Stochastic Integral). Let X = Xn∞n=0

be an adapted integrable process, i.e. E |Xn| <∞ for all n. If X is a martingaleand Cn∞n=1 is a predictable (as in Definition 13.9) sequence of bounded ran-dom variables, then (C ·∆X)n

∞n=1

is still a martingale. If X := Xn∞n=0 isa submartingale (supermartingale) (necessarily real valued) and Cn ≥ 0, then(C ·∆X)n

∞n=1

is a submartingale (supermartingale). (In other words, X isa sub-martingale if no matter what your (non-negative) betting strategy is youwill make money on average.)


13.3 Stochastic Integrals and Optional Stopping 129

Proof. For any adapted process X, we have

E[(C ·∆X)n+1 |Fn

]= E [(C ·∆X)n + Cn+1 (Xn+1 −Xn) |Fn]

= (C ·∆X)n + Cn+1E [(Xn+1 −Xn) |Fn] (13.26)

from which the result easily follows.Remark: For n = 1 we have,

E [(C ·∆X)1 |F0] = C1 · E [∆1X|F0] =

0 if X is a mtg.≥ 0 if X is a submtg

and this explains why taking (C ·∆X)0 = 0 is the correct choice here giventhat

(C ·∆X)n =

n∑k=1

Ck∆kX when n ≥ 1.

Example 13.30. Suppose that Xn∞n=0 are mean zero independent integrablerandom variables and fk : Rk → R are bounded measurable functions for k ∈ N.Then Yn∞n=0 , defined by Y0 = 0 and

Yn :=

n∑k=1

fk (X0, . . . , Xk−1) (Xk −Xk−1) for n ∈ N, (13.27)

is a martingale sequence relative toFXnn≥0

.

Notation 13.31 Given an Fn− adapted process, X, and an Fn− stop-ping time τ, let Xτ

n := Xτ∧n. We call Xτ := Xτn∞n=0 the process X stopped

by τ.

Theorem 13.32 (Optional stopping theorem). Suppose X = Xn∞n=0 is asupermartingale, martingale, or submartingale with E |Xn| <∞ for all n. Thenfor every stopping time, τ, Xτ is a Fn∞n=0 – supermartingale, martingale, orsubmartingale respectively.

Proof. If we define Ck := 1k≤τ , then

[C ·∆X]n =

n∑k=1

1k≤τ∆kX =

n∧τ∑k=1

∆kX = Xn∧τ −X0.

This shows thatXτn = X0 + [C ·∆X]n

and the result now follows Proposition 13.29.

Corollary 13.33. Suppose X = Xn∞n=0 is a martingale and τ is a stop-ping time such that P (τ =∞) = 0. If there exists C < ∞ such that eitherE [supn |Xn|] <∞ for all n or τ ≤ C, then

EXτ = EX0. (13.28)

Proof. From Theorem 13.32 we know that

EX0 = E [Xτn ] = E [Xτ∧n] for all n ∈ N.

• If we assume that τ ≤ C then by taking n > C we learn that Eq. (13.28)holds.

• If we assume E [supn |Xn|] <∞ , then by the dominated convergence theo-rem (see Section 1.1),

EX0 = limn→∞

E [Xτ∧n] = E[

limn→∞

Xτ∧n

]= E [Xτ ] .

Corollary 13.34 (Martingale Expectations). Suppose that Xn∞n=0 is amartingale and τ is a stopping time so that P (τ =∞) = 0, E |Xτ | <∞, and

limn→∞

E [|Xn| : τ > n] = 0, (13.29)

then EXτ = EX0.

Proof. We have

Xτ = Xτ∧n + (Xτ −Xn) 1τ>n

and henceE [Xτ ] = E [Xτ∧n] + εn = EX0 + εn

where

|εn| = |E [(Xτ −Xn) 1τ>n]| ≤ E [|Xτ | · 1τ>n] + E [|Xn| · 1τ>n] .

The first term of the right side above goes to 0 as n → ∞ by DCT while thesecond terms goes to zero by the assumption in Eq. (13.29) and so εn → 0 asn→∞ and hence EXτ = EX0.

Example 13.35. One way to “beat” the coin flip game in Example 13.25 is touse the double or nothing betting strategy. Our strategy for making M > 0dollars is to bet Ck = 2k−1M on the kth - flip, i.e we start by betting $M andthen continually double our bet at each successive flip. We will quit the firsttime that a heads is flipped. To analyze this strategy, let



T = inf k ≥ 1 : Zk = 1 = inf k ≥ 1 : Xk −Xk−1 = 1 .

Notice that T is an FZ – stopping time since

T = n = Z1 = −1, . . . , Zn−1 = −1, Zn = 1 ∈ FZn for all n

and

P (T > n) = P (Z1 = −1, . . . , Zn−1 = −1, Zn = −1) =

(1

2

)nso that P (T =∞) = limn→∞ P (T > n) = 0. Our wealth, Wn, at time n will be

Wn = W0 +M

n∑k=1

2k−1Zk

where W0 is our initial wealth. [For the moment we suppose we have goodcredit with the casino and so our wealth is allowed to go negative.] The processWn∞n=0 is a martingale and on the event T > n ,

Wn −W0 = −Mn∑k=1

2k−1 = −M (2n − 1)

while on the event T = n,

Wn −W0 = Wn −Wn−1 +Wn−1 −W0 = M[2n−1 −

(2n−1 − 1

)]= M.

Thus we have shown that WT = W0 +M and we have indeed one $M. It clearlynow follows that

EWT = EW0 +M 6= EW0

even though EWn = EW0 for all n? This example shows that the assumption inEq. (13.29) of Corollary 13.34 can not be eliminated in general. Here we have

E [|Wn −W0| : T > n] = M (2n − 1) ·(

1

2

)n= M

[1−

(1

2

)n]→M as n→∞

showing explicitly that Eq. (13.29) does not hold.

Example 13.36. Let us now suppose that, in Example 13.35 above that are initialwealth is W0 = M (2m − 1) and that we have no credit with the casino so thatwe must stop betting at any time n where Wn = 0. Keeping the same bettingstrategy as in Example 13.35, we now let

τm = inf n ≥ 1 : Wn = 0 .

Then τm ∧ T is a finite stopping time (a.s.) and 0 ≤WT∧τmn ≤W0 +M for all

n ∈ N0. Therefor Corollary 13.33 is applicable and shows

M (2m − 1) = EW0 = EWτm∧T = 0·P (τm < T )+[M (2m − 1) +M ]·P (T < τm)

so that

P (you win) = P (T < τm) =2m − 1

2m= 1− 1

2m.

In hind site, this result is obvious since you loose precisely when Zk = −1 fork = 1, 2, . . . ,m, i.e.

P (you loose) = P (τm < T ) =

(1

2

)m=

1

2m.

Moral: the more wealth you start out with the more likely you will win $Mbefore going bust. Nevertheless, if you repeatedly use this strategy over andover again with a fixed m, then you expected winnings are going to be zero!![Over many trials you will win most of the time and gain $M but you will alsoloose on rare occasions at a heavy cost of loosing $ M (2m − 1) .]

Example 13.37. Suppose that Xn∞n=0 represents the value of a stock which isknown to be a sub-martingale. At time n − 1 you are allowed buy Cn ∈ [0, 1]shares of the stock which you will then sell at time n. Your net gain (loss) inthis transaction is CnXn − CnXn−1 = Cn∆nX and your wealth at time n willbe

Wn = W0 +

n∑k=1

Ck∆kX.

The next lemma asserts that the way to maximize your expected gain is tochoose Ck = 1 for all k, i.e. buy the maximum amount of stock you can at eachstage. We will refer to this as the all in strategy.

The next lemma gives the optimal strategy (relative to expected return) forbuying stock which is known to be a submartingale.

Lemma 13.38 (*“All In”). If Xn∞n=0 is a sub-martingale and Ck∞k=1 isa previsible process with values in [0, 1] , then

E

(n∑k=1

Ck∆kX

)≤ E [Xn −X0]

with equality when Ck = 1 for all k, i.e. the optimal strategy is to go all in.

Proof. Notice that 1− Ck∞k=1 is a previsible non-negative process andtherefore by Proposition 13.29,

E

(n∑k=1

(1− Ck)∆kX

)≥ 0.


13.4 Martingale Convergence Theorems 131

Since

Xn −X0 =

n∑k=1

∆kX =

n∑k=1

Ck∆kX +

n∑k=1

(1− Ck)∆kX,

it follows that

E [Xn −X0] = E

(n∑k=1

Ck∆kX

)+E

(n∑k=1

(1− Ck)∆kX

)≥ E

(n∑k=1

Ck∆kX

).

Example 13.39 (*Setting the odds). Let S be a finite set (think of the outcomesof a spinner, or dice, or a roulette wheel) and p : S → (0, 1) be a probabilityfunction4. Let Zn∞n=1 be random functions with values in S such that p (s) :=P (Zn = s) for all s ∈ S. (Zn represents the outcome of the nth – game.) Alsolet α : S → [0,∞) be the house’s payoff function, i.e. for each dollar you (thegambler) bets on s ∈ S, the house will pay α (s) dollars back if s is rolled.Further let W : Ω → W be measurable function into some other measurespace, (W,F) which is to represent your random (or not so random) “whims.”.We now assume that Zn is independent of (W,Z1, . . . , Zn−1) for each n, i.e.the dice are not influenced by the previous plays or your whims. If we letFn := σ (W,Z1, . . . , Zn) with F0 = σ (W ) , then we are assuming the Zn isindependent of Fn−1 for each n ∈ N.

As a gambler, you are allowed to choose before the nth – game is played, theamounts

(Cn (s)s∈S

)that you want to bet on each of the possible outcomes

of the nth – game. Assuming that you are not clairvoyant (i.e. can not see thefuture), these amounts may be random but must be Fn−1 – measurable, thatis Cn (s) = Cn (W,Z1, . . . , Zn−1, s) , i.e. Cn (s)∞n=1 is “previsible” process (seeDefinition 13.9 below). Thus if X0 denotes your initial wealth (assumed to bea non-random quantity) and Xn denotes your wealth just after the nth – gameis played, then

Xn −Xn−1 = −∑s∈S

Cn (s) + Cn (Zn)α (Zn)

where −∑s∈S Cn (s) is your total bet on the nth – game and Cn (Zn)α (Zn)

represents the house’s payoff to you for the nth – game. Therefore it followsthat

Xn = X0 +

n∑k=1

[−∑s∈S

Ck (s) + Cn (Zk)α (Zk)

],

4 To be concrete, take S = 2, . . . , 12 representing the possible values for the sumsof the upward pointing faces of two dice. Assuming the dice are independent andfair then determines p : S → (0, 1) . For example p (2) = p (12) = 1/36, p (3) =p (11) = 1/18, p (7) = 1/6, etc.

Xn is Fn – measurable for each n, and

EFn−1 [Xn −Xn−1] = −∑s∈S

Cn (s) + EFn−1 [Cn (Zn)α (Zn)]

= −∑s∈S

Cn (s) +∑s∈S

Cn (s)α (s) p (s)

=∑s∈S

Cn (s) (α (s) p (s)− 1) .

Thus it follows, that no matter the choice of the betting “strategy,”Cn (s) : s ∈ S∞n=1 , we will have

EFn−1[Xn −Xn−1] =

≥ 0 if α (·) p (·) ≥ 1= 0 if α (·) p (·) = 1≤ 0 if α (·) p (·) ≤ 1

,

that is Xnn≥0 is a sub-martingale, martingale, or supermartingale dependingon whether α · p ≥ 1, α · p = 1, or α · p ≤ 1.

Moral: If the Casino wants to be guaranteed to make money on average, ithad better choose α : S → [0,∞) such that α (s) < 1/p (s) for all s ∈ S. In thiscase the expected earnings of the gambler will be decreasing which means theexpected earnings of the Casino will be increasing.

13.4 Martingale Convergence Theorems

Definition 13.40. Suppose X = Xn∞n=0 is a sequence of extended real num-bers and −∞ < a < b <∞. To each N ∈ N, let UN (a, b) denote the number of

up-crossings of [a, b] , i.e. the number times that XnNn=0 goes from below ato above b.

Notice that if Cn∞n=1 is the betting strategy of buy at or below a and sellat or above b then UN (a, b) = [C ·∆X]N . In more detail, Cn = 1 iff Xj ≤ a

for some 0 ≤ j < n and following Xkn−1k=0 backwards we have Xn−k < b for

k ≥ 1 until the first time that Xk ≤ a. In any other scenario, we have Cn = 0.

Lemma 13.41. Suppose X = Xn∞n=0 is a sequence of extended real numberssuch that UX∞ (a, b) < ∞ for all a, b ∈ Q with a < b. Then X∞ := limn→∞Xn

exists in R.

Proof. If limn→∞Xn does not exists in R, then there would exists a, b ∈ Qsuch that

lim infn→∞

Xn < a b infinitelyoften. Therefore, UX∞ (a, b) =∞.



Theorem 13.42 (Doob’s Upcrossing Inequality). If Xn∞n=0 is a mar-tingale and −∞ < a < b <∞, then for all N ∈ N,

E[UXN (a, b)

]≤ 1

b− a[E (XN − a)+

]≤ 1

b− a[E |XN − a|] . (13.30)

Proof. Let Ck∞k=1 be the buy at or below a and sell at or above b asexplained after Definition 13.40. Our net winnings of this strategy up to timeN is [C ·∆X]N which satisfies,

[C ·∆X]N ≥ (b− a)UXN (a, b)− [a−XN ]+ . (13.31)

In words, our net gain in buying at or below a and selling at or above b isat least equal to (b− a) times the number of times we buy low and sell highplus a possible penalty for holding the stock below a at the end of the day. [Theworst case scenario for this penalty is that we bought the stock for $ a and attime N, XN < a, so that we are down $ (a−XN ) .] From Proposition 13.29 weknow that E [C ·∆X]N = 0 for all N and therefore taking expectation of Eq.(13.31) gives Eq. (13.30).

Theorem 13.43 (Martingale Convergence Theorem). If Xn∞n=0 is amartingale such that C := supn E |Xn| <∞, then X∞ = limn→∞Xn exists a.s.and E |X∞| ≤ C.

Proof. Sketch. If −∞ < a < b < ∞, it follows from Eq. (13.30) andFatou’s Lemma or Monotone convergence theorem (see Section 1.1) that

E[UX∞ (a, b)

]= E

[limN→∞

UXN (a, b)]≤ lim inf

N→∞E[UXN (a, b)

]≤ C + a

b− a<∞.

From this we learn that P(UX∞ (a, b) <∞

)= 0 for all a, b ∈ Q with −∞ < a <

b <∞ and hence

P(UX∞ (a, b) <∞ ∀−∞ < a < b <∞

)= 0.

From this inequality and Lemma 13.41 it follows that limn→∞Xn = X∞ existsa.s. as a random variable with values in R∪±∞ . Another application ofFatou’s lemma then shows

E |X∞| = E limn→∞

|Xn| ≤ lim infN→∞

E |Xn| ≤ C <∞.

Corollary 13.44 (Submartingale convergence theorem). If Xn∞n=0 is asubmartingale such that C := supn E |Xn| < ∞, then X∞ = limn→∞Xn existsa.s. and E |X∞| ≤ C.

Proof. By Doob’s decomposition (Lemma 13.10), we have Xn = Mn + Anwhere Mn∞n=0 is a martingale and

An =

n∑k=1

E [∆kX|Fk−1] .

As Xn∞n=0 is a submartingale we know E [∆kX|Fk−1] and hence An ≥ 0 forall n and An is increasing (i.e. non-decreasing) as n increases. Furthermore bythe monotone convergence theorem,

EA∞ = limn→∞

EAn = limn→∞

n∑k=1

EE [∆kX|Fk−1]

= limn→∞

n∑k=1

E [∆kX] = limn→∞

E [Xn −X0] ≤ 2C.

Thus we may conclude for all n ∈ N0 that

E |Mn| = E |Xn −An| ≤ E |Xn|+ EAn ≤ E |Xn|+ EA∞ ≤ 3C <∞.

Therefore limn→∞Mn = M∞ exists by Theorem 13.43 and hence

limn→∞

Xn = limn→∞

[Mn +An] = M∞ +A∞

exists as well.

13.5 Uniform Integrability and Optional stopping

Definition 13.45. A sequence of random functions, Xn∞n=0 , is uniformly in-tegrable (U.I. for short) if the tails are uniformly small in the following sense,

εK := supn

E [|Xn| : |Xn| ≥ K]→ 0 as K →∞.

Lemma 13.46. If Xn is U.I. then supn E [|Xn|] <∞.

Proof. Fix a K <∞ such that

εK := supn

E [|Xn| : |Xn| ≥ K] ≤ 1.

For this K it follows that

E [|Xn|] = E [|Xn| : |Xn| ≥ K] + E [|Xn| : |Xn| < K] ≤ 1 +K

and soM := sup

nE [|Xn|] ≤ 1 +K <∞.


13.5 Uniform Integrability and Optional stopping 133

Example 13.47. LetXn = Mn = 2n1[0,2−n] for ω ∈ Ω := [0, 1] as in Exercise 13.2where P is the length measure on Ω. Then E |Xn| = 1 for all n but Xn is notuniformly integrable. Indeed for any K > 0 we will have E [|Xn| : |Xn| ≥ K] = 1provided 2n > K, i.e. n > lnK/ ln 2.

On the other hand if 1 < α < 2 and Yn := αn1[0,2−n], then

supn

E [|Yn| : |Yn| ≥ K] = supn

(α2

)n1n> lnK

lnα≤(α

2

) lnKlnα → 0 as K →∞

and so Yn∞n=1 is uniformly integrable. Intuitively, an L1 (P) – bounded se-quence, Xn∞n=0 , is uniformly integrable provided the Xn ’s are not allowedto form “to sharp” of peaks.

Proposition 13.48. If Xn is U.I., then for all ε > 0 there exists δ > 0 suchthat supn E [|Xn| : A] < ε whenever A ⊂ Ω with P (A) < δ.

Proof. For any A ⊂ Ω,

E [|Xn| : A] = E [|Xn| : A, |Xn| ≥ K] + E [|Xn| : A, |Xn| < K]

≤ E [|Xn| : |Xn| ≥ K] +KP (A)

and sosupn

E [|Xn| : A] ≤ εK +KP (A) . (13.32)

Thus given ε > 0 choose K so large that εK < ε/2 and choose δ := ε2K so

that when P (A) < δ we will have KP (A) < ε/2 which then combined with Eq.(13.32) implies, supn E [|Xn| : A] < ε.

Corollary 13.49. Suppose that Xn∞n=0 is a U.I. martingale and τ is a stop-ping time so that P (τ =∞) = 0 and E |Xτ | <∞, then EXτ = EX0.

Proof. According to Corollary 13.34 we need to use the U.I. assumption toshow

limn→∞

E [|Xn| : τ > n] = 0.

But using P (τ > n)→ 0 as n→∞ we may use Proposition 13.48 to conclude,

E [|Xn| : τ > n] ≤ supk

E [|Xk| : τ > n]→ 0 as n→∞.

Let us end this section with a couple of criteria for checking when Xn isU.I.

Proposition 13.50 (DCT and U.I.). If F = supn |Xn| ∈ L1 (P) then Xnis U.I.

Proof. We have

supn

E [|Xn| : |Xn| ≥ K] ≤ E [F : F ≥ K]→ 0 as K →∞

by DCT.

Proposition 13.51 (p – integrability and U.I.). If there exists p > 1 suchthat M := supn E |Xn|p <∞ then Xn is U.I.

Proof. If we let ε := p− 1 > 0, then

E [|Xn| : |Xn| ≥ K] = E[|Xn|1+ε 1

|Xn|ε: |Xn| ≥ K

]≤ 1

KεE[|Xn|1+ε

: |Xn| ≥ K]≤ 1

KεE [|Xn|p]

and so

supn

E [|Xn| : |Xn| ≥ K] ≤ M

Kε→ 0 as K ↑ ∞.

Proposition 13.52. If Xn∞n=0 , is uniformly integrable and X = limn→∞Xn

exists a.s., then limn→∞ E |X −Xn| = 0.

Proof. By Fatou’s lemma, E |X| ≤ M := supn E [|Xn|] < ∞ so that X isintegrable. Let Yn := |X −Xn| so that Yn → 0 a.s. and EYn ≤ 2M for all n.Using DCT twice we find, for any K > 0 that

lim supn→∞

EYn ≤ lim supn→∞

E [Yn : Yn < K] + lim supn→∞

E [Yn : Yn ≥ K]

= lim supn→∞

E [Yn : Yn ≥ K]

where

lim supn→∞

E [Yn : Yn ≥ K] ≤ lim supn→∞

E [|X| : Yn ≥ K] + lim supn→∞

E [|Xn| : Yn ≥ K]

= lim supn→∞

E [|Xn| : Yn ≥ K] ≤ supn

E [|Xn| : Yn ≥ K] .

Since P (Yn ≥ K) ≤ 2M/K, it follows by Proposition 13.51 that

lim supn→∞

EYn ≤ supn

E [|Xn| : Yn ≥ K]→ 0 as K →∞.

Corollary 13.53. If X = limn→∞Xn exists a.s. and there exists p > 1 suchthat M := supn E |Xn|p <∞, then limn→∞ E |X −Xn| = 0.



Proof. First proof. This is a direct consequence of Proposition 13.51 and13.52.

Second direct proof. By Fatou’s lemma,

E |X|p = E lim infn→∞

|Xn|p ≤ lim infn→∞

E |Xn|p ≤M <∞.

Let Yn := |X −Xn| so that Yn → 0 a.s. and

EY pn ≤ Cp [E |X|p + E |Xn|p] ≤M ′ = 2CpM.

Now for any K > 0 we have,

EYn = E [Yn : Yn ≤ K] + E [Yn : Yn > K]

≤ E [Yn1Yn≤K ] + E

[Yn

(YnK

)p−1

: Yn > K

]

≤ E [Yn1Yn≤K ] +1

Kp−1EY pn ≤ E [Yn1Yn≤K ] +

M ′

Kp−1.

Thus by an application of DCT,

lim supn→∞

EYn ≤ lim supn→∞

E [Yn1Yn≤K ] +M ′

Kp−1=

M ′

Kp−1→ 0 as K →∞.

Here is the theorem summarizing the key facts about uniformly integrablemartingales.

Theorem 13.54 (Uniformly integrable martingales). If Xn∞n=0 is aU.I. martingale then

1. X∞ = limn→∞Xn exists (P – a.s.) and limn→∞ E |Xn −X∞| = 0, i.e.Xn → X∞ in L1 (P) . [This in particular implies that limn→∞ EXn =EX∞.]

2. Xn∞n=0 is a regular martingale and in fact Xn = E [X∞|Fn]∞n=0 .3. For any stopping time τ (no need to assume P (τ =∞) = 0) we have

E |Xτ | < ∞ and EXτ = EX0. [This even holds if τ = ∞ a.s.]. [Thisresult extends Corollary 13.49.]

Proof. We take each item in turn.

1. The first item is a direct consequence of Lemma 13.46, Theorem 13.43, andProposition 13.52.

2. Passing to the limit, N → ∞, in the equality, Xn = E [XN |Fn] , using thefact that conditional expectation is a L1 (P) – contraction proves item 2.

3. From Theorem 4.17,

E [X∞|Fτ ] =∑n∈N0

E [X∞|Fn] 1τ=n =∑n∈N0

Xn1τ=n = Xτ .

Combining this equality with the contractive and tower properties of con-ditional expectations shows E |Xτ | ≤ E |X∞| <∞ and

EXτ = E (E [X∞|Fτ ]) = EX∞ = EE [X∞|Fn] = EXn ∀ n ∈ N0.

Example 13.55 (Generalized Random Sums). Suppose that Xk∞k=1 are in-dependent random variables such that EXk = 0. If we further know thatK :=

∑∞k=1 EX2

k <∞, then∑∞k=1Xk exists a.s., i.e.

P

( ∞∑k=1

Xk converges in R

)= 1.

To prove this, note that S0 = 0 and Sn =∑nk=1Xk is a martingale. Moreover,

using the fact that variances of sums of independent random variables add wefind,

ES2n = Var (Sn) =

n∑k=1

Var (Xk) =

n∑k=1

EX2k ≤ K <∞.

From Proposition 13.51 we conclude that Sn∞n=0 is a uniformly integrablemartingale and in particular Sn∞n=0 is L1 - bounded.5 Therefore the result nowfollows from the martingale convergence theorem, Theorem 13.43. Moreover,making use of Theorem 13.54 we may conclude that

E [S∞] = E

[ ∞∑k=1

Xk

]= ES0 = 0

and more generally if τ is any stopping time we have E [Sτ ] = E [S0] = 0.For a more explicit example, if Xk = 1

kZk where Zk∞k=1 are i.i.d. such thatP (Zk = ±1) = 1

2 , then∑∞k=1 EX2

k =∑∞k=1

1k2 < ∞ and therefore the random

harmonic series,∑∞k=1

1kZk is almost surely convergent. However, notice that

∞∑k=1

∣∣∣∣1kZk∣∣∣∣ =

∞∑k=1

1

k=∞,

i.e. the series∑∞k=1

1kZk is never absolutely convergent. For the full generaliza-

tion of this result look up Kolmogorov’s three series theorem.5 Alternatively, by Jensen’s inequality, [E |Sn|]2 ≤ ES2

n ≤ K which again shows Snis L1 – bounded.


13.6 Submartingale Maximal Inequalities 135

13.6 Submartingale Maximal Inequalities

Notation 13.56 (Running Maximum) If X = Xn∞n=0 is a sequence of(extended) real numbers, we let

X∗N := max X0, . . . , XN . (13.33)

Proposition 13.57 (Submartingale Maximal Inequalities ). Let Xn bea submartingale on a filtered probability space, (Ω, Fn∞n=0 ,P) . Then for anya ≥ 0 and N ∈ N0,

aP (X∗N ≥ a) ≤ E [XN : X∗N ≥ a] ≤ E[X+N

], (13.34)

where X+N := XN ∨ 0 = max (XN , 0) .

Proof. Let τ := inf n : Xn ≥ a and observe X∗N ≥ a iff τ ≤ N and a ≤ Xk

on τ = k . Therefore,

aP (X∗N ≥ a) =aP (τ ≤ N) =

N∑k=0

E [a1τ=k]

≤N∑k=0

E [Xk1τ=k] ≤N∑k=0

E [XN1τ=k]

= E [XN : τ ≤ N ] = E [XN : X∗N ≥ a]

≤ E[X+N : X∗N ≥ a

]≤ EX+

N ,

wherein we have used τ = k ∈ Fk and Xn∞n=0 is a submartingale in thesecond inequality.

Corollary 13.58. If Mn∞n=0 is a martingale and M∗N := max M0, . . . ,MN ,then for a > 0,

P(|M |∗N ≥ a

)≤ 1

apE |MN |p for all p ≥ 1 (13.35)

and

P (M∗N ≥ a) ≤ 1

eλaE[eλMN

]for all λ ≥ 0. (13.36)

[This corollary has fairly obvious generalizations to other convex functions, ϕ,other than ϕ (x) = |x|p and ϕ (x) = eλx.]

Proof. By the conditional Jensen’s inequality, it follows that Xn := |Mn|pis a submartingale and so Eq. (13.35) follows from Eq. (13.34) with a replacedby ap. Again by the conditional Jensen’s inequality,

Xn := eλMn

∞n=0

is a

submartingale. Since λ > 0, we know that x → eλx is an increasing functionand hence,

M∗N ≥ a =X∗N ≥ eλa

and so Eq. (13.36) follows from Eq. (13.34) with a replaced by eλa.

The following result is a substantial improvement of Proposition 2.18.

Corollary 13.59 (L2 – SSLN). Let Xn be a sequence of independent ran-dom variables with mean zero, and σ2 = EX2

m < ∞. Letting Sm =∑mk=1Xk

and p > 1/2, we have6

P (Sm = O (mp)) = 1 (13.37)

As this is true for all p > 1/2 we may restate Eq. (13.37) as; if p > 1/2, thenlimm→∞

Smmp = 0 a.s..

Proof. From Eq. (13.35) of Corollary 13.58 with Mn = Sn; if N ∈ N then,

P(|S|∗N ≥ N

p)≤ 1

N2pE[S2N

]=

1

N2pσ2N =

σ2

N (2p−1). (13.38)

Choose α ∈ N large enough so that α (2p− 1) > 1 and take N = nα in Eq.(13.38) and then sum on n to find,

E

[ ∞∑n=1

1|S|∗nα≥nαp

]=

∞∑n=1

P(|S|∗nα ≥ n

αp)≤∞∑n=1

σ2

nα(2p−1)<∞.

Therefore (as in the proof of the first Borel - Cantelli lemma),

∞∑n=1

1|S|∗nα≥nαp <∞ a.s. =⇒ P(|S|∗nα < nαp for a.a. n

)= 1

and hence, almost surely, we have

|S|∗nαnαp

< 1 for almost all n. (13.39)

As [1,∞) is the disjoint union of the intervals, [nα, (n+ 1)α

)∞n=1 , to eachm ∈ N there exists a unique n = n (m) such that nα ≤ m < nα. This in-equality along with the (obvious) inequality, |S|∗m ≤ |S|

∗(n+1)α ,

7 then implies

for sufficiently large m (with n = n (m)),

|S|∗mmp

≤|S|∗(n+1)α

nαp=|S|∗(n+1)α

(n+ 1)α

(n+ 1

n

)αp< 1 · 2αp

which suffices to prove Eq. (13.37).

6 Here we say Sm = O (mp) iff there exists C < ∞ so that |Sm| ≤ Cmp for all m,where C is allowed to be random here.

7 This is where we need the maximal inequality since it would not necessarily be truethat |S|m ≤ |S|(n+1)α !



Corollary 13.60. If Yn is a sequence of independent random variablesEYn = µ and σ2 = Var (Xn) <∞, then for any β ∈ (0, 1/2) ,

1

m

m∑n=1

Yn − µ = O

(1

mβ

)a.s.

Proof. Let Xn = Yn − µ. Then for p > 1/2 we have

m∑n=1

Yn −mµ =

m∑n=1

(Yn − µ) = Sm = O (mp) a.s.

and therefore,

1

m

m∑n=1

Yn − µ = O

(1

mα

)where β = 1− p which can be any number in (0, 1/2) .

Remark 13.61 (Law of the iterated logarithm). Under the assumptions of Corol-lary 13.59, ES2

m = σ2m and so in the root mean square sense, |Sm| . σ√m.

The result in Corollary 13.59 makes this statement honest at the expense ofreplacing

√m by mp for any p > 1

2 . The more precise statement is the so calledlaw of the iterated logarithm which states;

lim supm→∞

Sm√m · ln (lnm)

= 2σ a.s.

Here is another (not so great) example of using Corollary 13.58.

Corollary 13.62. Let Xn be a sequence of Bernoulli random variables withP (Xn = ±1) = 1

2 and let S0 = 0, Sn := X1 + · · ·+Xn for n ∈ N. Then

limN→∞

|S|∗N√N lnN

:= limN→∞

max |S0| , . . . , |SN |√N lnN

= 0 a.s. (13.40)

[This result actually holds for an sequence Xn of i.i.d. random variable suchthat EXn = 0 and Var (Xn) = EX2

n <∞.]

Proof. By Eq. (13.36), if a, λ > 0 then

P (S∗N ≥ a) ≤ 1

eλaE[eλSN

]=

1

eλaE

N∏j=1

eλXj

= e−λa

N∏j=1

E[eλXj

]= e−λa [cos (λ)]

N(13.41)

wherein we have used,

E[eλXj

]=

1

2

[eλ + e−λ

]= cosh (λ) .

Making the substitutions, a→ a√N · lnN and λ→ λ/

√N in Eq. (13.41) shows

P(S∗N ≥ a

√N · lnN

)≤ e−λa lnN

[cos

(λ√N

)]N=

1

Nλa

[cos

(λ√N

)]N.

Given a > 0 we now choose λ so that λa > 1 in which case it follows that

∞∑N=1

P(S∗N ≥ a

√N · lnN

)≤∞∑N=1

1

Nλa

[cos

(λ√N

)]Nwherein we have used

limN→∞

[cos

(λ√N

)]N= limN→∞

(1 +

λ

2N+O

(1

N2

))N= eλ/2.

From this inequality we learn

P(S∗N ≥ a

√N lnN i.o. N

)= 0

and so with probability one we conclude,S∗N√N ·lnN < a for almost all N. Since a

was arbitrary it follows that

lim supN→∞

S∗N√N · lnN

= 0 a.s.

By replacing Xn by −Xn for all n we may further conclude that

lim supN→∞

(−S)∗N√

N · lnN= 0

and together these two statements proves Eq. (13.40) since |S|∗N =max

S∗N , (−S)

∗N

.

Remark 13.63. The above corollary is not a very good example as the resultsfollow just as easily without the use of Doob’s maximal inequality. Indeed, theproof above goes through the same with S∗N replaced by SN everywhere inwhich case we would conclude that limN→∞

SN√N lnN

= 0 or equivalently that


13.7 * Lp – inequalities 137

for all ε > 0, |SN | ≤ ε√N lnN for all N ≥ Nε where Nε is a finite random

variable. For N ≥ Nε, we have

|S|∗N ≤(ε√N lnN

)∨ |S|∗Nε =⇒ |S|∗N ≤

(ε√N lnN

)for almost all N.

This then shows that

P(

lim supN→∞

|S|∗N√N lnN

≤ ε)

= 1

and as ε > 0 was arbitrary we may conclude that

P(

lim supN→∞

|S|∗N√N lnN

= 0

)= 1.

13.7 * Lp – inequalities

Lemma 13.64. Suppose that X and Y are two non-negative random variablessuch that P (Y ≥ y) ≤ 1

yE [X : Y ≥ y] for all y > 0. Then for all p ∈ (1,∞) ,

EY p ≤(

p

p− 1

)pEXp. (13.42)

Proof. We will begin by proving Eq. (13.42) under the additional assump-tion that Y ∈ Lp (Ω,B,P) . Since

EY p = pE∫ ∞

0

1y≤Y · yp−1dy = p

∫ ∞0

E [1y≤Y ] · yp−1dy

= p

∫ ∞0

P (Y ≥ y) · yp−1dy ≤ p∫ ∞

0

1

yE [X : Y ≥ y] · yp−1dy

= pE∫ ∞

0

X1y≤Y · yp−2dy =p

p− 1E[XY p−1

].

Now apply Holder’s inequality, with q = p (p− 1)−1, to find

E[XY p−1

]≤ ‖X‖p ·

∥∥Y p−1∥∥q

= ‖X‖p · [E |Y |p]1/q

.

Combining thew two inequalities shows and solving for ‖Y ‖p shows ‖Y ‖p ≤pp−1 ‖X‖p which proves Eq. (13.42) under the additional restriction of Y being

in Lp (Ω,B,P) .To remove the integrability restriction on Y, for M > 0 let Z := Y ∧M and

observe that

P (Z ≥ y) = P (Y ≥ y) ≤ 1

yE [X : Y ≥ y] =

1

yE [X : Z ≥ y] if y ≤M

while

P (Z ≥ y) = 0 =1

yE [X : Z ≥ y] if y > M.

Since Z is bounded, the special case just proved shows

E [(Y ∧M)p] = EZp ≤

(p

p− 1

)pEXp.

We may now use the MCT to pass to the limit, M ↑ ∞, and hence concludethat Eq. (13.42) holds in general.

Corollary 13.65 (Doob’s Inequality). If X = Xn∞n=0 be a non-negativesubmartingale and 1 < p <∞, then

EX∗pN ≤(

p

p− 1

)pEXp

N . (13.43)

Proof. Equation 13.43 follows by applying Lemma 13.64 with the aid ofProposition 13.57.

Corollary 13.66 (Doob’s Inequality). If Mn∞n=0 is a martingale and 1 0 we have

P(|M |∗N ≥ a

)≤ 1

aE [|M |N : M∗N ≥ a] ≤ 1

aE [|MN |] (13.44)

and

E |M |∗pN ≤(

p

p− 1

)pE |MN |p . (13.45)

Proof. By the conditional Jensen’s inequality (Theorem 13.22), it followsthat Xn := |Mn| is a submartingale. Hence Eq. (13.44) follows from Eq. (13.34)and Eq. (13.45) follows from Eq. (13.43).

Example 13.67. Let Xn be a sequence of independent integrable randomvariables with mean zero, S0 = 0, Sn := X1 + · · · + Xn for n ∈ N, and|S|∗n = maxj≤n |Sj | . Since Sn∞n=0 is a martingale, by cJensen’s inequality(Theorem 13.22), |Sn|p

∞n=1 is a (possibly extended) submartingale for any

p ∈ [1,∞). Therefore an application of Eq. (13.34) of Proposition 13.57 show

P(|S|∗N ≥ α

)= P

(|S|∗pN ≥ α

p)≤ 1

αpE [|SN |p : S∗N ≥ α] .

(When p = 2, this is Kolmogorov’s inequality.) From Corollary 13.66 we alsoknow that



E |S|∗pN ≤(

p

p− 1

)pE |SN |p .

In particular when p = 2, this inequality becomes,

E |S|∗2N ≤ 4 · E |SN |2 = 4 ·N∑n=1

E |Xn|2 .

13.8 Martingale Exercises

(The next four problems were taken directly fromhttp://math.nyu.edu/˜sheff/martingalenote.pdf.)

Exercise 13.7. Suppose Harriet has 7 dollars. Her plan is to make one dollarbets on fair coin tosses until her wealth reaches either 0 or 50, and then to gohome. What is the expected amount of money that Harriet will have when shegoes home? What is the probability that she will have 50 when she goes home?

Exercise 13.8. Consider a contract that at time N will be worth either 100 or0. Let Sn be its price of the contract at time 0 ≤ n ≤ N . If Sn is a martingale,and S0 = 47, then what is the probability that the contract will be worth 100at time N?

Exercise 13.9. Pedro plans to buy the contract in the previous problem attime 0 and sell it the first time T at which the price goes above 55 or below 15.What is the expected value of ST ? You may assume that the value, Sn, of thecontract is bounded – there is only a finite amount of money in the world upto time N. Also note, by assumption, T ≤ N.

Exercise 13.10. Suppose SN is with probability one either 100 or 0 and thatS0 = 50. Suppose further there is at least a 60% probability that the price willat some point dip to below 40 and then subsequently rise to above 60 beforetime N . Prove that Sn cannot be a martingale. (I don’t know if this problemis correct! but if we modify the 40 to a 30 the buy low sell high strategy willshow that Sn is not a martingale.)

13.8.1 More Random Walk Exercises

For the next four exercises, let Zn∞n=1 be a sequence of Bernoulli randomvariables with P (Zn = ±1) = 1

2 and let S0 = 0 and Sn := Z1 + · · ·+ Zn. ThenS becomes a martingale relative to the filtration, Fn := σ (Z1, . . . , Zn) withF0 := ∅, Ω – of course Sn is the (fair) simple random walk on Z. For anya ∈ Z, let

σa := inf n : Sn = a .

Exercise 13.11. For a < 0 < b with a, b ∈ Z, let τ = σa ∧ σb. Explain whySτn

∞n=0 is a bounded martingale use this to show P (τ =∞) = 0. Hint: make

use of the fact that |Sn − Sn−1| = |Zn| = 1 for all n and hence the only waylimn→∞ Sτn can exist is if it stops moving!

Exercise 13.12. In this exercise, you are asked to use the central limit Theoremto prove again that P (τ =∞) = 0, Exercise 13.11. Hints: Use the central limittheorem to show

1√2π

∫Rf (x) e−x

2/2dx ≥ f (0)P (τ =∞) (13.46)

for all f ∈ C3 (R→ [0,∞)) with M := supx∈R∣∣f (3) (x)

∣∣ <∞. Use this inequal-ity to conclude that P (τ =∞) = 0.

Exercise 13.13. Show

P (σb < σa) =|a|

b+ |a|(13.47)

and use this to conclude P (σb <∞) = 1, i.e. every b ∈ N is almost surely visitedby Sn.

Hint: As in Exercise 13.11 notice that Sτn∞n=0 is a bounded martingale

where τ := σa ∧ σb. Now compute E [Sτ ] = E [Sττ ] in two different ways.

Exercise 13.14. Let τ := σa ∧ σb. In this problem you are asked to showE [τ ] = |a| b with the aid of the following outline.

1. Use Exercise 13.4 above to conclude Nn := S2n − n is a martingale.

2. Now show0 = EN0 = ENτ∧n = ES2

τ∧n − E [τ ∧ n] . (13.48)

3. Now use DCT and MCT along with Exercise 13.13 to compute the limit asn→∞ in Eq. (13.48) to find

E [σa ∧ σb] = E [τ ] = b |a| . (13.49)

4. By considering the limit, a→ −∞ in Eq. (13.49), show E [σb] =∞.

For the next group of exercise we are now going to suppose that P (Zn = 1) =p > 1

2 and P (Zn = −1) = q = 1 − p < 12 . As before let Fn = σ (Z1, . . . , Zn) ,

S0 = 0 and Sn = Z1 + · · ·+Zn for n ∈ N. Let us review the method above andwhat you did in Exercise 6.4 above.

In order to follow the procedures above, we start by looking for a function,ϕ, such that ϕ (Sn) is a martingale. Such a function must satisfy,

ϕ (Sn) = EFnϕ (Sn+1) = ϕ (Sn + 1) p+ ϕ (Sn − 1) q,


13.8 Martingale Exercises 139

and this then leads us to try to solve the following difference equation for ϕ;

ϕ (x) = pϕ (x+ 1) + qϕ (x− 1) for all x ∈ Z. (13.50)

Similar to the theory of second order ODE’s this equation has two linearlyindependent solutions which could be found by solving Eq. (13.50) with initialconditions, ϕ (0) = 1 and ϕ (1) = 0 and then with ϕ (0) = 0 and ϕ (1) =0 for example. Rather than doing this, motivated by second order constantcoefficient ODE’s, let us try to find solutions of the form ϕ (x) = λx with λto be determined. Doing so leads to the equation, λx = pλx+1 + qλx−1, orequivalently to the characteristic equation,

pλ2 − λ+ q = 0.

The solutions to this equation are

λ =1±√

1− 4pq

2p=

1±√

1− 4p (1− p)2p

=1±

√4p2 − 4p+ 1

2p=

1±√

(2p− 1)2

2p= 1, (1− p) /p = 1, q/p .

The most general solution to Eq. (13.50) is then given by

ϕ (x) = A+B (q/p)x.

Below we will take A = 0 and B = 1. As before let σa = inf n ≥ 0 : Sn = a .Exercise 13.15. Let a < 0 1/2 andq = 1− p.

1. Apply the method in Exercise 13.11 with Sn replaced by Mn := (q/p)Sn to

show P (τ =∞) = 0. [Recall that Mn∞n=1 is a martingale as explained inExample 13.13.]

2. Now use the method in Exercise 13.13 to show

P (σa < σb) =(q/p)

b − 1

(q/p)b − (q/p)

a. (13.51)

3. By letting a→ −∞ in Eq. (13.51), conclude P (σb =∞) = 0.

4. By letting b→∞ in Eq. (13.51), conclude P (σa <∞) = (q/p)|a|.

Exercise 13.16. Verify,

Mn := Sn − n (p− q)

andNn := M2

n − σ2n

are martingales, where σ2 = 1 − (p− q)2. (This should be simple; see either

Exercise 13.4 or Exercise 13.6.)

Exercise 13.17. Using exercise 13.16, show

E (σa ∧ σb) =

b [1− (q/p)a] + a

[(q/p)

b − 1]

(q/p)b − (q/p)

a

(p− q)−1. (13.52)

By considering the limit of this equation as a→ −∞, show

E [σb] =b

p− q

and by considering the limit as b→∞, show E [σa] =∞.

13.8.2 More advanced martingale exercises

Exercise 13.18. Let (Mn)∞n=0 be a martingale with M0 = 0 and E[M2n] < ∞

for all n. Show that for all λ > 0,

P(

max1≤m≤n

Mm ≥ λ)≤ E[M2

n]

E[M2n] + λ2

.

Hints: First show that for any c > 0 thatXn := (Mn + c)2

∞n=0

is asubmartingale and then observe,

max1≤m≤n

Mm ≥ λ⊂

max1≤m≤n

Xn ≥ (λ+ c)2

.

Now use Doob’ Maximal inequality (Proposition 13.57) to estimate the proba-bility of the last set and then choose c so as to optimize the resulting estimateyou get for P (max1≤m≤nMm ≥ λ) . (Notice that this result applies to −Mn aswell so it also holds that;

P(

min1≤m≤n

Mm ≤ −λ)≤ E[M2

n]

E[M2n] + λ2

for all λ > 0.

Exercise 13.19. Let Zn∞n=1 be independent random variables, S0 = 0 and

Sn := Z1 + · · · + Zn, and fn (λ) := E[eiλZn

]. Suppose EeiλSn =

∏Nn=1 fn (λ)

converges to a continuous function, F (λ) , as N → ∞. Show for each λ ∈ Rthat

P(

limn→∞

eiλSn exists)

= 1. (13.53)

Hints:

1. Show it is enough to find an ε > 0 such that Eq. (13.53) holds for |λ| ≤ ε.


2. Choose ε > 0 such that |F (λ)− 1| < 1/2 for |λ| ≤ ε. For |λ| ≤ ε, show

Mn (λ) := eiλSn

EeiλSn is a bounded complex8 martingale relative to the filtra-tion, Fn = σ (Z1, . . . , Zn) .

Lemma 13.68 (Protter [16, See the lemma on p. 22.]). Let xn∞n=1 ⊂ Rsuch that

eiuxn

∞n=1

is convergent for Lebesgue almost every u ∈ R. Thenlimn→∞ xn exists in R.

Proof. Let U be a uniform random variable with values in [0, 1] . By as-sumption, for any t ∈ R, limn→∞ eitUxn exists a.s. Thus if nk and mk are anyincreasing sequences we have

limk→∞

eitUxnk = limn→∞

eitUxn = limk→∞

eitUxmk a.s.

and therefore,

eit(Uxnk−Uxmk) =eitUxnk

eitUxmk→ 1 a.s. as k →∞.

Hence by DCT it follows that

E[eit(Uxnk−Uxmk)

]→ 1 as k →∞

and therefore(xnk − xmk) · U = Uxnk − Uxmk → 0

in distribution and hence in probability. But his can only happen if(xnk − xmk)→ 0 as k →∞. As nk and mk were arbitrary, this suffices toshow xn is a Cauchy sequence.

Exercise 13.20 (Continuation of Exercise 13.19 – See Doob [5, Chap-ter VII.5]). Let Zn∞n=1 be independent random variables. Use Exercise 13.19

an Lemma 13.68 to prove the series,∑∞n=1 Zn, converges in R a.s. iff

∏Nn=1 fn (λ)

converges to a continuous function, F (λ) as N →∞. Conclude from this that∑∞n=1 Zn is a.s. convergent iff

∑∞n=1 Zn is convergent in distribution.

8 Please use the obvious generalization of a martingale for complex valued processes.It will be useful to observe that the real and imaginary parts of a complex martin-gales are real martingales.

14

Some martingale Examples and Applications

Exercise 14.1. Let Sn be the total assets of an insurance company in yearn ∈ N0. Assume S0 > 0 is a constant and that for all n ≥ 1 that Sn = Sn−1+ξn,where ξn = c− Zn and Zn∞n=1 are i.i.d. random variables having the normal

distribution with mean µ < c and variance σ2, i.e. Znd= µ+ σN where N is a

standard normal random variable.1 Let

τ = infn : Sn ≤ 0 and

R = τ <∞ = Sn ≤ 0 for some n

be the event that the company eventually becomes bankrupt, i.e. is Ruined.Show

P(Ruin) = P(R) ≤ e−2(c−µ)S0/σ2

.

Outline:

1. Show that λ = −2 (c− µ) /σ2 < 0 satisfies, E[eλξn

]= 1.

2. With this λ show (using Sn = S0 + ξ1 + · · ·+ ξn)

Yn := exp (λSn) = eλS0

n∏j=1

eλξj (14.1)

is a non-negative Fn = σ(Z1, . . . , Zn) – martingale.3. Use a martingale convergence theorem to argue that limn→∞ Yn = Y∞

exists a.s. and then use Fatou’s lemma to show EYτ ≤ eλS0 .4. Finally conclude that

P (R) = P (τ <∞) ≤ E [Yτ : τ <∞] ≤ EYτ ≤ eλS0 = e−2(c−µ)S0/σ2

.

Observe that by the strong law of large numbers that limn→∞Snn = Eξ1 =

c − µ > 0 a.s. Thus for large n we have Sn ∼ n (c− µ) → ∞ as n → ∞.The question we have addressed is what happens to the Sn for intermediatevalues – in particular what is the likelyhood that Sn makes a sufficiently “largedeviation” from the “typical” value of n (c− µ) in order for the company togo bankrupt. The next section is a detour which gives a taste of how suchprobabilities may be estimated.

1 The number c is to be interpreted as the yearly premium, µ represents the meanpayout in claims per year, and σN represents the random fluctuations which canoccur from year to year.

14.1 A Large Deviations Primer

Definition 14.1. A real valued random variable, Z, is said to be exponentiallyintegrable if E

[eθZ]< ∞ for all θ ∈ R. Under this condition we let M (θ) =

E[eθZ]

be the moment generating function and

ψ (θ) = lnM (θ) = lnE[eθZ]

be the log-moment generating function.

Summary. The upshot of Theorems 14.2 and 14.3 and of this section is; ifSn := Z1 + · · ·+Zn where Zn∞n=1 be i.i.d. exponentially integrable mean zerorandom variables, then

P (Sn ≥ n`) ∼ e−n·maxθ[θ`−ψ(θ)]

Theorem 14.2 (Large Deviation Upper Bound). Let, for n ∈ N, Sn :=Z1 + · · ·+Zn where Zn∞n=1 be i.i.d. exponentially integrable random variables

such that EZ = 0 where Zd= Zn. Then for all ` > 0,

P (Sn ≥ n`) ≤ e−nI(`) (14.2)

whereI (`) = sup

θ≥0(θ`− ψ (θ)) = sup

θ∈R(θ`− ψ (θ)) ≥ 0. (14.3)

In particular,

lim supn→∞

lnP (Sn ≥ n`) ≤ −I (`) for all ` > 0. (14.4)

Proof. Let Zn∞n=1 be i.i.d. exponentially integrable random variables such

that EZ = 0 where Zd= Zn. Then for ` > 0 we have for any θ ≥ 0 that

P (Sn ≥ n`) = P(eθSn ≥ eθn`

)≤ e−θnÈ

[eθSn

]=(e−θÈ

[eθZ])n

.

Let M (θ) := E[eθZ]

be the moment generating function for Z and ψ (θ) =lnM (θ) be the log – moment generating function. Then we have just shown,

142 14 Some martingale Examples and Applications

P (Sn ≥ n`) ≤ exp (−n (θ`− ψ (θ))) for all θ ≥ 0.

Minimizing the right side of this inequality over θ ≥ 0 gives the upper boundin Eq. (14.2) where I (`) is given as in the first equality in Eq. (14.3).

To prove the second equality in Eq. (14.3), we use the fact that eθx is aconvex function in x for all θ ∈ R and therefore by Jensen’s inequality,

M (θ) = E[eθZ]≥ eθEX = eθ0 = 1 for all θ ∈ R.

This then implies that ψ (θ) = lnM (θ) ≥ 0 for all θ ∈ R. In particular, θ` −ψ (θ) < 0 for θ < 0 while [θ`− ψ (θ)] |θ=0 = 0 and therefore

supθ∈R

(θ`− ψ (θ)) = supθ≥0

(θ`− ψ (θ)) ≥ 0.

This completes the proof as Eq. (14.4) easily follows from Eq. (14.2).

Theorem 14.3 (Large Deviation Lower Bound). If there is a maximizer,θ0, for the the function θ → θ`− ψ (θ) , then

lim infn→∞

1

nlnP (Sn ≥ n`) ≥ −I (`) = θ0`− ψ (θ0) . (14.5)

Proof. If there is a maximizer, θ0, for the the function θ → θ`−ψ (θ) , then

0 = `− ψ′ (θ0) = `− M ′ (θ0)

M (θ0)= `−

E[Zeθ0Z

]M (θ0)

.

Thus if W is a random variable with Law determined by

E [f (W )] = M (θ0)−1 E

[f (Z) eθ0Z

]for all non-negative functions f : R→ [0,∞] then E [W ] = `.

Suppose that Wn∞n=1 has been chosen to be a sequence of i.i.d. random

variables such that Wnd= W for all n. Then, for all non-negative functions

f : Rn → [0,∞] we have

E [f (W1, . . . ,Wn)] = M (θ0)−n E

[f (Z1, . . . , Zn)

n∏i=1

eθ0Zi

]= M (θ0)

−n E[f (Z1, . . . , Zn) eθ0Sn

].

This is easily verified by showing the right side of this equation gives the cor-rect expectations when f is a product function. Replacing f (z1, . . . , zn) byM (θ0)

ne−θ0(z1+...zn)f (z1, . . . , zn) in the previous equation then shows

E [f (Z1, . . . , Zn)] = M (θ0)n E[f (W1, . . . ,Wn) e−θ0Tn

](14.6)

where Tn := W1 + · · ·+Wn.Taking δ > 0 and f (z1, . . . , zn) = 1z1+...zn≥n` in Eq. (14.6) shows

P (Sn ≥ n`) = M (θ0)n E[e−θ0Tn : n` ≤ Tn

]≥M (θ0)

n E[e−θ0Tn : n` ≤ Tn ≤ n (`+ δ)

]≥M (θ0)

ne−nθ0(`+δ)P [n` ≤ Tn ≤ n (`+ δ)]

= e−nI(`)e−nθ0δP [n` ≤ Tn ≤ n (`+ δ)] .

Taking logarithms of this equation, then dividing by n, then letting n→∞ welearn

lim infn→∞

1

nlnP (Sn ≥ n`) ≥ −I (`)− θ0`δ + lim

n→∞

1

nlnP [n` ≤ Tn ≤ n (`+ δ)]

= −I (`)− θ0`δ + 0 (14.7)

wherein have used the central limit theorem to argue that

P [n` ≤ Tn ≤ n (`+ δ)] = P [0 ≤ Tn − n` ≤ nδ]

= P[0 ≤ Tn − n`√

n≤√nδ

]→ 1

2as n→∞.

Equation (14.5) now follows from Eq. (14.7) as δ > 0 is arbitrary.

Example 14.4. Suppose that Zd= N

(0, σ2

) d= σN where N

d= N (0, 1) , then

M (θ) = E[eθZ]

= E[eθσN

]= exp

(1

2(σθ)

2

)and therefore ψ (θ) = lnM (θ) = 1

2σ2θ2. Moreover for ` > 0,

` = ψ′ (θ) =⇒ ` = σ2θ =⇒ θ0 =`

σ2.


I (`) = θ0`− ψ (θ0) =`2

σ2− 1

2σ2

(`

σ2

)2

=1

2

`2

σ2.

In this Gaussian case we actually know that Snd= N

(0, nσ2

)and therefore by

Mill’s ratio,

P (Sn ≥ n`) = P(√nσN ≥ n`

)= P

(N ≥

√n`

σ

)∼ 1√

2πn `σe−n

12`2

σ2 =σ√

2πnè−nI(`) as n→∞.


14.2 A Polya Urn Model 143

Remark 14.5. The technique used in the proof of Theorem 14.3 was to make achange of measure so that the large deviation (from the usual) event with smallprobability became typical behavior with substantial probability. One couldimaging making other types of change of variable of the form

E [f (W )] =E [f (Z) ρ (Z)]

E [ρ (Z)]

where ρ is some positive function. Under this change of measure the analogueof Eq. (14.6) is

E [f (Z1, . . . , Zn)] = (E [ρ (Z)])n · E

f (W1, . . . ,Wn)

n∏j=1

1

ρ (Wi)

.However to make this change of variable easy to deal with in the setting athand we would like to further have

n∏j=1

1

ρ (Wj)= fn (Tn) = fn (W1 + · · ·+Wn)

for some function fn. Equivalently we would like, for some function gn, that

n∏j=1

ρ (wj) = gn (w1 + · · ·+ wn)

for all wi. Taking logarithms of this equation and differentiating in the wj andwk variables shows,

ρ′ (wj)

ρ (wj)= (ln gn)

′(w1 + · · ·+ wn) =

ρ′ (wk)

ρ (wk).

From the extremes of this last equation we conclude that ρ′ (wj) /ρ (wj) = c(for some constant c) and therefore ρ (w) = Kecw for some constant K. Thishelps to explain why the exponential function is used in the above proof.

14.2 A Polya Urn Model

In this section we are going to analyze the long run behavior of the Polya urnMarkov process which was introduced in Example 13.15. Recall that if the urncontains r red balls and g green balls at a given time we draw one of these ballsat random and replace it and add c more balls of the same color drawn. Let(rn, gn) be the number of red and green balls in the earn at time n. Then wehave

P ((rn+1, gn) = (r + c, g) | (rn, gn) = (r, g)) =r

r + gand

P ((rn+1, gn) = (r, g + c) | (rn, gn) = (r, g)) =g

r + g.

Let us observe that rn+gn = r0 +g0 +nc and hence if we let Xn be the fractionof green balls in the urn at time n,

Xn :=gn

rn + gn,

thenXn :=

gnrn + gn

=gn

r0 + g0 + nc

and recall from Example 13.15 or Remark 13.16 that Xn∞n=0 is a martingalerelative to

Fn := σ ((rk, gk) : k ≤ n) = σ (Xk : k ≤ n) .

Since Xn ≥ 0 and EXn = EX0 <∞ for all n it follows by Theorem 13.43 thatX∞ := limn→∞Xn exists a.s. The distribution of X∞ is described in the nexttheorem.

Theorem 14.6. Let γ := g/c and ρ := r/c and µ := LawP (X∞) . Then µ isthe beta distribution on [0, 1] with parameters, γ, ρ, i.e.

dµ (x) =Γ (ρ+ γ)

Γ (ρ)Γ (γ)xγ−1 (1− x)

ρ−1dx for x ∈ [0, 1] . (14.8)

Proof. We will begin by computing the distribution of Xn. As an example,the probability of drawing 3 greens and then 2 reds is

g

r + g· g + c

r + g + c· g + 2c

r + g + 2c· r

r + g + 3c· r + c

r + g + 4c.

More generally, the probability of first drawing m greens and then n−m redsis

g · (g + c) · · · · · (g + (n− 1) c) · r · (r + c) · · · · · (r + (n−m− 1) c)

(r + g) · (r + g + c) · · · · · (r + g + (n− 1) c).

Since this is the same probability for any of the(nm

)– ways of drawing m greens

and n−m reds in n draws we have

P (Draw m – greens)

=

(n

m

)g · (g + c) · · · · · (g + (m− 1) c) · r · (r + c) · · · · · (r + (n−m− 1) c)

(r + g) · (r + g + c) · · · · · (r + g + (n− 1) c)

=

(n

m

)γ · (γ + 1) · · · · · (γ + (m− 1)) · ρ · (ρ+ 1) · · · · · (ρ+ (n−m− 1))

(ρ+ γ) · (ρ+ γ + 1) · · · · · (ρ+ γ + (n− 1)).

(14.9)



Before going to the general case let us warm up with the special case, g = r =c = 1. In this case Eq. (14.9) becomes,

P (Draw m – greens) =

(n

m

)1 · 2 · · · · ·m · 1 · 2 · · · · · (n−m)

2 · 3 · · · · · (n+ 1)=

1

n+ 1.

On the set, Draw m – greens , we have Xn = 1+m2+n and hence it follows that

for any f ∈ C ([0, 1]) that

E [f (Xn)] =

n∑m=0

f

(m+ 1

n+ 2

)· P (Draw m – greens)

=

n∑m=0

f

(m+ 1

n+ 2

)1

n+ 1.

Therefore

E [f (X)] = limn→∞

E [f (Xn)] =

∫ 1

0

f (x) dx (14.10)

and hence we may conclude that X∞ has the uniform distribution on [0, 1] .For the general case, recall that n! = Γ (n + 1), Γ (t+ 1) = tΓ (t) , and

therefore for m ∈ N,

Γ (x+m) = (x+m− 1) (x+m− 2) . . . (x+ 1)xΓ (x) . (14.11)

Also recall Stirling’s formula,

Γ (x) =√

2πxx−1/2e−x [1 + r (x)] (14.12)

where |r (x)| → 0 as x → ∞. To finish the proof we will follow the strategy ofthe proof of Eq. (14.10) using Stirling’s formula to estimate the expression forP (Draw m – greens) in Eq. (14.9).

On the set, Draw m – greens , we have

Xn =g +mc

r + g + nc=

γ +m

ρ+ γ + n=: xm,

where ρ := r/c and γ := g/c. For later notice that ∆mx = γρ+γ+n .

Using this notation we may rewrite Eq. (14.9) as


=

(n

m

) Γ (γ+m)Γ (γ) · Γ (ρ+n−m)

Γ (ρ)

Γ (ρ+γ+n)Γ (ρ+γ)

=Γ (ρ+ γ)

Γ (ρ)Γ (γ)· Γ (n+ 1)

Γ (m+ 1)Γ (n−m+ 1)

Γ (γ +m)Γ (ρ+ n−m)

Γ (ρ+ γ + n). (14.13)

Now by Stirling’s formula,

Γ (γ +m)

Γ (m+ 1)=

(γ +m)γ+m−1/2

e−(γ+m) [1 + r (γ +m)]

(1 +m)m+1−1/2

e−(m+1) [1 + r (1 +m)]

= (γ +m)γ−1 ·

(γ +m

m+ 1

)m+1/2

e−(γ−1) 1 + r (γ +m)

1 + r (m+ 1).

= (γ +m)γ−1 ·

(1 + γ/m

1 + 1/m

)m+1/2

e−(γ−1) 1 + r (γ +m)

1 + r (m+ 1)

We will keep m fairly large, so that(1 + γ/m

1 + 1/m

)m+1/2

= exp

((m+ 1/2) ln

(1 + γ/m

1 + 1/m

))∼= exp ((m+ 1/2) (γ/m− 1/m)) ∼= eγ−1.

Hence we haveΓ (γ +m)

Γ (m+ 1) (γ +m)

γ−1.

Similarly, keeping n−m fairly large, we also have

Γ (ρ+ n−m)

Γ (n−m+ 1) (ρ+ n−m)

ρ−1and

Γ (ρ+ γ + n)

Γ (n+ 1) (ρ+ γ + n)

ρ+γ−1.

Combining these estimates with Eq. (14.13) gives,


Γ (ρ+ γ)

Γ (ρ)Γ (γ)· (γ +m)

γ−1 · (ρ+ n−m)ρ−1

(ρ+ γ + n)ρ+γ−1

=Γ (ρ+ γ)

Γ (ρ)Γ (γ)·

(γ+mρ+γ+n

)γ−1

·(ρ+n−mρ+γ+n

)ρ−1

(ρ+ γ + n)ρ+γ−1

=Γ (ρ+ γ)

Γ (ρ)Γ (γ)· (xm)

γ−1 · (1− xm)ρ−1

∆mx.

Therefore, for any f ∈ C ([0, 1]) , it follows that

E [f (X∞)] = limn→∞

E [f (Xn)]

= limn→∞

n∑m=0

f (xm)Γ (ρ+ γ)

Γ (ρ)Γ (γ)· (xm)

γ−1 · (1− xm)ρ−1

∆mx

=

∫ 1

0

f (x)Γ (ρ+ γ)

Γ (ρ)Γ (γ)xγ−1 (1− x)

ρ−1dx.


14.3 Galton Watson Branching Process 145

14.3 Galton Watson Branching Process

This section is taken from [6, p. 245 –249].

Notation 14.7 Let pk∞k=0 be a probability on N0 where pk := P (ξ = k) isthe off-spring distribution.

We assume that the mean number of offspring,

µ := Eξ =

∞∑k=0

kpk <∞.

Notation 14.8 Let ξni : i, n ≥ 1 be a sequence of i.i.d. non-negative integervalued random variables with P (ξni = k) = pk for all k ∈ N0. We also letYn := ξn1

∞n=1 in order to shorten notation later.

If Zn is the number of “people” (or organisms) alive in the nth – generationthen we assume the ith organism has ξn+1

i – offspring so that

Zn+1 = ξn+11 + · · ·+ ξn+1

Zn

=

∞∑k=1

(ξn+11 + · · ·+ ξn+1

k

)1Zn=k. (14.14)

represents the number of people present in generation, n+ 1. We complete thedescription of the process, Zn by setting Z0 = 1 and Zn+1 = 0 if Zn = 0,i.e. once the population dies out it remains extinct forever after. The processZnn≥0 is called a Galton-Watson Branching process, see Figure 14.1.

Standing assumption: We suppose that p1 < 1 for otherwise we will haveZn = 1 for all n.

To understand Zn a bit better observe that

Z0 = 1,

Z1 = ξ1Z0

= ξ11 ,

Z2 = ξ21 + · · ·+ ξ2

ξ11,

Z3 = ξ31 + · · ·+ ξ3

Z2,

...

The sample path in Figure 14.1 corresponds to

ξ11 = 3,

ξ21 = 2, ξ2

2 = 0, ξ23 = 3,

ξ31 = ξ3

2 = ξ33 = ξ3

4 = 0, ξ35 = 4, and

ξ41 = ξ4

2 = ξ43 = ξ4

4 = 0.



Z0 = 1

Z1 = 3

Z2 = 5

Z3 = 4

Fig. 14.1. A possible realization of a Galton Watson “tree.”

In order to shorten the exposition we will make use of the following twointuitive facts:

1. The different branches of the Galton-Watson tree evolve independently ofone another.

2. The process Zn∞n=0 is a Markov chain with transition probabilities,

p (k, l) = P (Zn+1 = l|Zn = k)

=

δ0,l if k = 0

P (Y1 + · · ·+ Yk = l) if k ≥ 0.

= δk,0δ0,l + 1k≥1 · P (Y1 + · · ·+ Yk = l) . (14.15)

Lemma 14.9. If f : N0 → R is a function then

(Pf) (k) = E [f (Y1 + · · ·+ Yk)]

where Y1 + · · ·+ Yk := 0 if k = 0.

Proof. This follows by the simple computation;

(Pf) (k) =

∞∑l=0

p (k, l) f (l) =

∞∑l=0

[δk,0δ0,l + 1k≥1 · P (Y1 + · · ·+ Yk = l)] f (l)

= δk,0f (0) + 1k≥1

∞∑l=0

[P (Y1 + · · ·+ Yk = l)] f (l)

= δk,0f (0) + 1k≥1E [f (Y1 + · · ·+ Yk)]

= E [f (Y1 + · · ·+ Yk)] .

Let us evaluate Pf for a couple of f. If f (k) = k, then

Pf (k) = E [Y1 + · · ·+ Yk] = k · µ =⇒ Pf = µf. (14.16)

If f (k) = λk for some |λ| ≤ 1, then(Pλ(·)

)(k) = E

[λY1+···+Yk

]= ϕ (λ)

k. (14.17)

Corollary 14.10. The process Mn := Zn/µn∞n=0 is a positive martingale and

in particularEZn = µn <∞ for all n ∈ N0. (14.18)

Proof. If f (n) = n for all n, then Pf = µf by Eq. (14.16) and therefore

E [Zn+1|Fn] = Pf (Zn) = µ · f (Zn) = µZn.

Dividing this equation by µn+1 then shows E [Mn+1|Fn] = Mn as desired. AsM0 = 1 it then follows that EMn = 1 for all n and this gives Eq. (14.18).

Theorem 14.11. If µ < 1, then, almost surely, Zn = 0 for a.a. n. In fact,

E [Total # of organisms ever alive] = E

[ ∞∑n=0

Zn

]<∞.

Proof. When µ < 1, we have

E∞∑n=0

Zn =

∞∑n=0

µn =1

1− µ<∞

and therefore∑∞n=0 Zn <∞ a.s. As Zn ∈ N0 for all n, this can only happen if

Zn = 0 for almost all n a.s.

Theorem 14.12. If µ = 1 and P (ξmi = 1) < 1, then again, almost surely,Zn = 0 for a.a. n.

Proof. First note the assumption µ = 1 and the standing assumption thatp1 < 1 implies p0 > 0. This then further implies that for any k ≥ 1 we have

P (Y1 + · · ·+ Yk = 0) ≥ P (Y1 = 0, . . . , Yk = 0) = pk0 > 0

which then implies,

p (k, k) = P (Y1 + · · ·+ Yk = k) < 1. (14.19)


14.3 Galton Watson Branching Process 147

Because of Corollary 14.10 and the assumption that µ = 1, we know Zn∞n=1

is a martingale. This martingale being positive is L1 bounded as

E [|Zn|] = EZn = EZ0 = 1 for all n.

Therefore the martingale convergence theorem guarantees that Z∞ =limn→∞ Zn =: Z∞ exists with EZ∞ ≤ 1. Because Zn is integer valued, it musthappen that Zn = Z∞ a.a. If k ∈ N, the event Z∞ = k can be expressed as

Z∞ = k = Zn = k a.a. n = ∪∞M=1 Zn = k for all n ≥M .

As the sets in the union of this expression are increasing, it follows by themonotone (and then DCT) that

P (Z∞ = k) = limM→∞

P (Zn = k for all n ≥M)

= limM→∞

limN→∞

P (Zn = k for M ≤ n ≤ N)

= limM→∞

limN→∞

P (ZM = k) p (k, k)N−M

= 0,

wherein we have made use of Eq. (14.19) in order to evaluate the limit. Thusit follows that

P (Z∞ > 0) =

∞∑k=1

P (Z∞ = k) = 0.

The above argument does not apply to k = 0 since p (0, 0) = 1 by definition.

Remark 14.13. By the way, the branching process, Zn∞n=0 with µ = 1 andP (ξ = 1) < 1 gives a nice example of a non regular martingale. Indeed, if Zwere regular, we would have

Zn = E[

limm→∞

Zm|Fn]

= E [0|Fn] = 0

which is clearly false.

We now wish to consider the case where µ := EYk = E [ξmi ] > 1. For λ ∈ Cwith |λ| ≤ 1 let

ϕ (λ) := E[λY1]

=∑k≥0

pkλk (14.20)

be the moment generating function of pk∞k=0 . Notice that ϕ (1) = 1 and forλ = s ∈ (−1, 1) we have

ϕ′ (s) =∑k≥0

kpksk−1 and ϕ′′ (s) =

∑k≥0

k (k − 1) pksk−2 ≥ 0

with

lims↑1

ϕ′ (s) =∑k≥0

kpk = E [ξ] =: µ and

lims↑1

ϕ′′ (s) =∑k≥0

k (k − 1) pk = E [ξ (ξ − 1)] .

Therefore ϕ is convex with ϕ (0) = p0, ϕ (1) = 1 and ϕ′ (1) = µ.

Lemma 14.14. If µ = ϕ′ (1) > 1, there exists a unique ρ < 1 so that ϕ (ρ) = ρ.

Proof. See Figure 14.2 below.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Fig. 14.2. Figure associated to ϕ (s) = 18

(1 + 3s+ 3s2 + s3

)which is relevant for

Exercise 3.13 of Durrett on p. 249. In this case ρ ∼= 0.236 07.

Theorem 14.15 (See Durrett [6], p. 247-248.). If µ > 1, then

P1 (Extinction) = P1

(limn→∞

Zn = 0)

= P1 (Zn = 0 for some n) = ρ.

Proof. Since Zm = 0 ⊂ Zm+1 = 0 , it follows that Zm = 0 ↑Zn = 0 for some n and therefore if θm := P1 (Zm = 0) ,then

P1 (Zn = 0 for some n) = limm→∞

θm.

Notice that θ1 = P1 (Z1 = 0) = p0 ∈ (0, 1) . We now show; θm = ϕ (θm−1) . Tosee this, conditioned on the set Z1 = k , Zm = 0 iff all k – families die out inthe remaining m − 1 time units. Since each family evolves independently, theprobability of this event is θkm−1. [See Example 14.16 for a formal justification



of this fact.] Combining this with, P1 (Z1 = k) = P(ξ11 = k

)= pk, allows us

to conclude by the first step analysis that

θm = P1 (Zm = 0) =∞∑k=0

P1 (Zm = 0, Z1 = k)

=

∞∑k=0

P1 (Zm = 0|Z1 = k)P1 (Z1 = k) =

∞∑k=0

θkm−1pk = ϕ (θm−1) .

It is now easy to see that θm ↑ ρ as m ↑ ∞, again see Figure 14.3.

0 0.05 0.1 0.15 0.2 0.25 0.30

0.05

0.1

0.15

0.2

0.25

0.3

θ0 ρθ1 θ2 θ3 θ4θ5

Fig. 14.3. The graphical interpretation of iterating θm = ϕ (θm−1) starting fromθ0 = P1 (Z0 = 0) = 0 and then θ1 = ϕ (0) = p0 = P1 (Z1 = 0) .

14.3.1 Appendix: justifying assumptions

Exercise 14.2 (Markov Chain Products). Suppose that P and Q areMarkov matrices on state space S and T. Then P⊗Q defined by

P⊗Q ((s, t) , (s′, t′)) := P (s, s′) ·Q (t, t′)

defines a Markov transition matrix on S × T. Moreover, if Xn and Ynare Markov chains on S and T respectively with transition matrices P andQ respectively, then Zn = (Xn, Yn) is a Markov chain with transition matrixP⊗Q.

Exercise 14.3 (Markov Chain Projections). Suppose that Xn is theMarkov chain on a state space S with transition matrix P and π : S → Tis a surjective (for simplicity) map. Then there is a transition matrix Q on Tsuch that Yn := π Xn is a Markov chain on T for all starting distributions onS iff for all t ∈ T,

P(x, π−1 (t)

)= P

(y, π−1 (t)

)whenever π (x) = π (y) . (14.21)

In this case,

Q (s, t) = P(x, π−1 (t)

):=

∑y∈S:π(y)=t

P (x, y) where x ∈ π−1 (s) . (14.22)

Example 14.16. Using the above results we may now justify the fact that“the different branches of the Galton-Watson tree evolve independently of oneanother.” Indeed, suppose that Xn and Yn are two independent copiesof the Galton-Watson tree starting with k and l off-spring respectively. Letπ : N0 × N0 → N0 be the addition map, π (a, b) = a+ b and let

Zn := Xn + Yn = π (Xn, Yn) .

We claim that Zn is the Galton-Watson tree starting with k+ l off spring. Tosee this we notice that for c ∈ N0 (using the transition functions in Eq. (14.15)that ∑

(x,y)∈π−1(c)

P⊗P ((a, b) , (x, y))

=∑x+y=c

P (a, x) P (b, y)

=∑x+y=c

[δa,0δ0,x + 1a≥1 · P (Y1 + · · ·+ Ya = x)]

· [δb,0δ0,y + 1b≥1 · P (Y1 + · · ·+ Yb = y)]


Hence ifYn

is an independent copy of the Yn then assuming the a, b ≥ 1

we find, ∑x+y=c

P (Y1 + · · ·+ Ya = x) · P (Y1 + · · ·+ Yb = y)

=∑x+y=c

P(Y1 + · · ·+ Ya = x, Y1 + · · ·+ Yb = y

)= P

(Y1 + · · ·+ Ya + Y1 + · · ·+ Yb = c

)= P (a+ b, c) = P (π (a, b) , c) .

When a = 0, we get,∑(x,y)∈π−1(c)

P⊗P ((a, b) , (x, y))

=∑x+y=c

δ0,x · [δb,0δ0,y + 1b≥1 · P (Y1 + · · ·+ Yb = y)]

= [δb,0δ0,c + 1b≥1 · P (Y1 + · · ·+ Yb = c)] = P (a+ b, c)

and similarly the statement holds if b = 0. Thus we may now apply Exercise14.3 to complete the proof.

Using the above results, if θm (k) := Pk (Zm = 0) , then

θm (k + l) = Pk+l (Zm = 0) = P(k,l) (Xm + Ym = 0)

= P(k,l) (Xm = 0, Ym = 0)

= Pk (Xm = 0)Pl (Ym = 0) = θm (k) θm (l)

and therefore, θm (k) = θm (1)k.

Part IV

Continuous Time Theory

15

Discrete State Space/Continuous Time

In this chapter, we continue to assume that S is a finite or countable statespace and (Ω,P) is a probability space.

15.1 Geometric, Exponential, and Poisson Distributions

Notation 15.1 Let N : Ω → N0 be an integer valued random variable. Wewrite:

1. Nd=Bern(p) if P (N = 1) = p and P (N = 0) = 1− p where 0 ≤ p ≤ 1 and

say N is a Bernoulli random variable.

2. Nd=Geo(p) if P (N = k) = p (1− p)k−1

for k ∈ N and say N is a geomet-ric random variable.

3. Nd=Pois(λ) if P (N = k) = λk

k! e−λ and say N is a Poisson random

variable.

Remark 15.2. Here is how the geometric distribution often arises. If Xk∞k=1 arei.i.d. Bern(p) – random variables (i.e. P (Xi = 1) = p and P (Xi = 0) = 1− p),then

N := inf k ≥ 1 : Xk = 1 d= Geo (p) .

Indeed,

P (N = k) = P (X1 = 0, . . . , Xk−1 = 0, Xk = 1)

= (1− p)k−1p.

Here are two simple consequences of this representation.

Lemma 15.3 (Memoryless property of the geometric distribution). If

Nd= Geo (p) is a geometric random variable then;

1. P (N > n) = P (X1 = 0, . . . , Xn−1 = 0, Xn = 0) = (1− p)n2. for all k, n ∈ N,

P (N = n+ k|N > n) = P (N = k) .

Proof. We take each item in turn.

1. Using the representation in Remark 15.2,

P (N > n) = P (X1 = 0, . . . , Xn−1 = 0, Xn = 0) = (1− p)n .

Alternatively, we can sum the geometric series,

P (N > n) =

∞∑k=n+1

(1− p)k−1p = (1− p)n p · 1

1− (1− p)= (1− p)n .

Conversely if P (N > n) = (1− p)n for all n ∈ N then

P (N = n) = P (N > n− 1)−P (N > n) = (1− p)n−1−(1− p)n = p (1− p)n−1.

2. Using the representation in Remark 15.2, we easily (and intuitively) seethat

P (N = n+ k|N > n) =P (X1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

P (X1 = 0, . . . , Xn = 0)

= P (Xn+1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

= P (X1 = 0, . . . , Xk−1 = 0, Xk = 1) = P (N = k) .

Alternatively by first principles,

P (N = n+ k|N > n) =P (N = n+ k)

P (N > n)=p (1− p)n+k−1

(1− p)n

= p (1− p)k−1= P (N = k) .

Exercise 15.1 (Some Discrete Distributions). Let p ∈ (0, 1] and λ > 0. Inthe two parts below, the distribution of N will be described. In each case findthe generating function (see Proposition 1.2) and use this to verify the statedvalues for EN and Var (N) .

1. Geometric(p) : P (N = k) = p (1− p)k−1for k ∈ N. (P (N = k) is the prob-

ability that the kth – trial is the first time of success out a sequence ofindependent trials with probability of success being p.) You should findEN = 1/p and Var (N) = 1−p

p2 .

154 15 Discrete State Space/Continuous Time

2. Poisson(λ) : P (N = k) = λk

k! e−λ for all k ∈ N0. You should find EN = λ =

Var (N) .

Definition 15.4. A random variable T ≥ 0 is said to be exponential withparameter λ ∈ [0,∞) provided, P (T > t) = e−λt for all t ≥ 0. We will write

Td= E (λ) for short.

Alternatively we can express this distribution as

P (T ∈ t, t+ dt]) = P (T > t)− P (T > t+ dt) = e−λt − e−λ(t+dt)

= e−λt[1− e−λdt

]= e−λtλdt+ o (dt) with t ≥ 0.

To be more precise

E [f (T )] =

∫ ∞0

f (τ)λe−λτdτ.

Notice that taking f (τ) = 1τ>t in the above expression again gives,

P (T > t) = E [1T>t] =

∫ ∞0

1τ>tλe−λτdτ

=

∫ ∞t

λe−λτdτ = e−λt.

Notice that T has the following memoryless property,

P (T > s+ t|T > s) = P (T > t) for all s, t ≥ 0.

Indeed,

P (T > s+ t|T > s) =P (T > s+ t, T > s)

P (T > s)

=P (T > s+ t)

P (T > s)=e−λ(t+s)

e−λs

= e−λt = P (T > t) .

See Theorem 15.8 below for the converse assertion.

Exercise 15.2. Let Td= E (λ) be as in Definition 15.4. Show;

1. ET = λ−1 and2. Var (T ) = λ−2.

In the next two propositions we are going Npd= Geo (p) in the limit that we

make p – small so there are many trials before success but at the same time weare going to suppose that time between trials is decreasing as well. Taking thislimit with the appropriate scalings will limit to an exponential random variable.

Notation 15.5 For t ≥ 0 we let

[t] := max n ∈ N0 : n ≤ t =

∞∑k=1

(k − 1) 1[k−1,k) (t)

and more generally if ε > 0 we let

[t]ε := ε

[t

ε

]=

∞∑k=1

1[(k−1)ε,kε) (t) .

Proposition 15.6 (Scaling limits of geometrics). Let λ > 0 be given and

let Nεd= Geo (λε) for 0 < ε < λ−1. Then εNε =⇒ T

d= E (λ) as ε ↓ 0, i.e.

limε↓0

P (εNε > t) = e−λt for all t > 0.

Proof. We have

P (εNε > t) = P(Nε >

t

ε

)= (1− ελ)[

tε ] .

Since[tε

]= t

ε +O (1) and ln (1− ελ) = −ελ+O(ε2)

we find

lnP (εNε > t) =

[t

ε

]ln (1− ελ) =

(t

ε+O (1)

)(−ελ+O

(ε2))

= −λt+O (ε)

and the result follows.We can substantially improve on the previous proposition.

Proposition 15.7. For p > 0 and λ > 0 let Gp : (0, 1]→ N be defined by

Gp (x) :=

∞∑n=1

n · 1((1−p)n,(1−p)n−1] (x)

and Tλ : (0, 1] → R+ be defined by Tλ (x) = − 1λ lnx. If U is a (0, 1]-valued

uniformly distributed random variable, then;

1. Gp (U)d= Geo (p) ,

2. Tλ (U)d= E (λ) , and

3. limε↓0 εGελ (U) = Tλ (U) .

Proof. Since the law of U is Lebesgue measure (m) on (0, 1] and we find

P (Gp (U) = n) = m (x ∈ [0, 1] : Gp (x) = n)

= (1− p)n−1 − (1− p)n = p (1− p)n−1


15.2 Scaling Limits of Markov Chains 155

which shows Gp (U)d= Geo (p) . Similarly, as Tλ (x) = − 1

λ lnx > t iff x < e−λt

it follows thatm (x ∈ (0, 1] : Tλ (x) > t) = e−λt,

i.e. Tλ (U)d= E (λ) .

For item 3. we refer the reader to Figure 15.1. For the reader not convinced

Fig. 15.1. This graph shows εGε for ε = 1/20 in black along with the graph of − lnxin red.

by the figure here are some more details. Since Gp (x) = n iff

n ln (1− p) < lnx ≤ (n− 1) ln (1− p)

or equivalently (keep in mind ln (1− p) < 0) iff

(n− 1) ≤ lnx

ln (1− p)< n ⇐⇒

[lnx

ln (1− p)

]= n− 1,

it follows that

Gp (x) =

[lnx

ln (1− p)

]1

+ 1.

Using ε [t]1 = [εt]ε we conclude

εGλε (x) = ε

[1

ln (1− λε)lnx

]1

+ ε =

[ε

ln (1− λε)lnx

]ε

+ ε

from which it easily follows that

limε↓0

εGλε (x) = − 1

λlnx := Tλ (x)

This suffices to prove item 3.

Theorem 15.8 (Memoryless property of the exponential distribu-tion). A random variable, T ∈ (0,∞] has an exponential distribution iff itsatisfies the memoryless property:

P (T > s+ t|T > s) = P (T > t) for all s, t ≥ 0,

where as usual, P (A|B) := P (A ∩B) /P (B) when p (B) > 0. (Note that Td=

E (0) means that P (T > t) = e0t = 1 for all t > 0 and therefore that T = ∞a.s.)

Proof. (The following proof is taken from [14].) Suppose first that Td= E (λ)

for some λ > 0. Then

P (T > s+ t|T > s) =P (T > s+ t)

P (T > s)=e−λ(s+t)

e−λs= e−λt = P (T > t) .

For the converse, let g (t) := P (T > t) , then by assumption,

g (t+ s)

g (s)= P (T > s+ t|T > s) = P (T > t) = g (t)

whenever g (s) 6= 0 and g (t) is a decreasing function. Therefore if g (s) = 0 forsome s > 0 then g (t) = 0 for all t > s. Thus it follows that

g (t+ s) = g (t) g (s) for all s, t ≥ 0.

Since T > 0, we know that g (1/n) = P (T > 1/n) > 0 for some n andtherefore, g (1) = g (1/n)

n> 0 and we may write g (1) = e−λ for some 0 ≤ λ <

∞.Observe for p, q ∈ N, g (p/q) = g (1/q)

pand taking p = q then shows,

e−λ = g (1) = g (1/q)q. Therefore, g (p/q) = e−λp/q so that g (t) = e−λt for all

t ∈ Q+ := Q ∩ R+. Given r, s ∈ Q+ and t ∈ R such that r ≤ t ≤ s we have,since g is decreasing, that

e−λr = g (r) ≥ g (t) ≥ g (s) = e−λs.

Hence letting s ↑ t and r ↓ t in the above equations shows that g (t) = e−λt for

all t ∈ R+ and therefore Td= E (λ) .

15.2 Scaling Limits of Markov Chains

The next exercise deals with how to describe a “lazy” Markov chain. We will saya chain is lazy if p (x, x) > 0 for some x ∈ S. The point being that if p (x, x) > 0,



then the chain starting at x may be lazy and stay at x for some period of timebefore deciding to jump to a new site. The next exercise describes lazy chains interms of a non-lazy chain and the random times that the lazy chain will spendlounging at each site x ∈ S. We will refer to this as the jump-hold descriptionof the chain. We will give a similar description of chains on S in the context ofcontinuous time in Corollary 15.42 below. Let Xn∞n=0 be the Markov chainwith transition kernel P and let j1 be the first jump time of the chain andq (x, y) be defined in Exercise 15.3 below. We then have

Px (j1 = k,Xj1 = y) = Px (X1 = x, . . . ,Xk−1 = x,Xk = y)

= p (x, x)k−1

p (x, y) = p (x, x)k−1

(1− p (x, x)) q (x, y)

from which we learn j1 and Y1 := Xj1 are independent, j1d= Geo ((1− p (x, x))) ,

and Px (Y1 = y) = q (x, y) . These computations along with induction and thestrong Markov property leads to the results in the next exercise. [You are askedto give an elementary proof of these results.]

Notation 15.9 (Jump and hold times) Let jk denote the time of the kth –jump of the chain Xn∞n=0 so that

j1 := inf n > 0 : Xn 6= X0 and

jk+1 := inf n > jk : Xn 6= Xjk

with the convention that j0 = 0. Further let σk := jk+1−jk denote the time spent

between the kth and (k + 1)th

jump of the chain Xn∞n=0 , see Figure 15.2.

The sample path of the chain Xn associated to Figure 15.2 is

n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Xn 2 2 4 4 4 1 3 3 3 2 2 2 1 1 1 . . .

and correspondingly we have σ0 = 2, σ1 = 3, σ2 = 1, σ3 = 3, σ4 = 3, andσ5 > 2 and similarly j1 = 2, j2 = 5, j3 = 6, j4 = 9, j5 = 12, and j6 > 14.

Exercise 15.3 (Discrete time M.C. jump-hold description). Let S be acountable or finite set (Ω,P, Xn∞n=0) be a Markov chain with transition kernel,P := p (x, y)x,y∈S and let ν (x) := P (X0 = x) for all x ∈ S. For simplicity let

us assume there are no absorbing states,1 (i.e. p (x, x) < 1 for all x ∈ S) andthen define Qx,y = q (x, y) where

q (x, y) :=

p(x,y)

1−p(x,x) if x 6= y

0 if x = y.

1 A state x is absorbing if p (x, x) = 1 since in this case there is no chance for thechain to leave x once it hits x.

σ0

σ1

σ5timeσ2

σ3

σ4

j0 j1 j2 j3 j4 j5

Fig. 15.2. A typical sample path for a lazy discrete time Markov chain with S =1, 2, 3, 4. The picture indicates the jump and sojourn times for this random graph.

Let jk denote the time of the kth – jump of the chain Xn∞n=0 so that

j1 := inf n > 0 : Xn 6= X0 and

jk+1 := inf n > jk : Xn 6= Xjk

with the convention that j0 = 0. Further let σk := jk+1 − jk denote the time

spent between the kth and (k + 1)th

jump of the chain Xn∞n=0 , see Figure15.2. Show;

1. For xknk=0 ⊂ S with xk 6= xk−1 for k = 1, . . . , n and m0, . . . ,mn−1 ∈ N,show

P([∩nk=0 Xjk = xk] ∩

[∩n−1k=0 σk = mk

])= ν (x0)

n−1∏k=0

p (xk, xk)mk−1

(1− p (xk, xk)) · q (xk, xk+1) . (15.1)

2. Summing the previous formula on m0, . . . ,mn−1 ∈ N, conclude

P ([∩nk=0 Xjk = xk]) = ν (x0) ·n−1∏k=0

q (xk, xk+1) ,

i.e. this shows Yk := Xjk∞k=0 is a Markov chain with transition kernel,

Qx,y = q (x, y) .


15.2 Scaling Limits of Markov Chains 157

3. Conclude, relative to the conditional probability measure,P (·| [∩nk=0 Xjk = xk]) , that σkn−1

k=0 are independent geometric

σkd= Geo (1− p (xk, xk)) for 0 ≤ k ≤ n− 1.

Corollary 15.10. Let P and Q be the Markov matrices as described in Exercise15.3, Yn∞n=0 be the non-lazy Markov chain associated to Q, and let Un∞n=0

be i.i.d. random variables uniformly distributed on (0, 1]. LetXn

n≥0

be the

stochastic process which visits the sites (in order) determined by Yn∞n=0 and

remains at site Yn for σn := G(1−p(Yn,Yn)) (Un) many steps. ThenXn

n≥0

and Xnn≥0 have the same distribution assuming we start each chain with thesame initial distributions.

The reason that geometric distribution appears in the above descriptionis because it has the forgetful property described in Lemma 15.3 above. Thisforgetfulness property of the jump time intervals is a consequence of the mem-oryless property of a Markov chain. For continuous time chains we will seethat we essentially need only replace the geometric distribution in the abovedescription by the exponential distribution (see Definition 15.4) which is theunique continuous distribution with the same forgetfulness property, see Theo-rem 15.8 above. Theorem 15.14, we are going to start with a non-lazy Markovchain Yn∞n=0 and make it lazier and lazier while at the same time scaling theamount of times between the steps of the chain to be smaller and smaller. Bychoosing the appropriate scalings will get a limiting process, Xt , which isgoing to be a typical continuous time Markov chain. Theorem 15.14 will in factgive most of the key results in this chapter. We begin with some basic factsabout matrix exponentials which will be needed in the proof of Theorem 15.14and for the rest of this chapter as well.

Definition 15.11. If Bx,yx,y∈S is a matrix on S with K :=

supx∈S∑y∈S |Bx,y| <∞, then we define eB by the convergent sum,

eB =

∞∑n=0

1

n!Bn.

Remark 15.12. We will take without proof that the exponentiation functionsatisfies the following properties;

1. ddte

tB = BetB = etBB with e0B = I and this differential equation uniquelydetermines etB.

2. etBesB = e(t+s)B for all s, t ∈ R and more generally eBeC = eB+C providedBC = CB.

3. If K < 1, then

ln (I + B) =

∞∑k=1

(−1)k+1 1

kBk exists

andeln(I+B) = I + B.

Lemma 15.13 (Exponential product formula). If B is as in Definition15.11 and t ≥ 0, then

limε↓0

(I + εB)[tε ] = etB. (15.2)

[This result is also easily proved for t < 0 as well.]

Proof. If t = 0 both sides of Eq. (15.2) are equal to the identity and sothere is nothing to prove and we may and do assume t > 0. From item 3. ofRemark 15.12, for ε sufficiently small we have

ln (I + εB) = εB +O(ε2)

and

I + εB = eln(I+εB) = eεB+O(ε2).

Raising this equation to[tε

]thpower gives

(I + εB)[tε ] = e[

tε ]εB+[ tε ]O(ε2) = etB+O(ε) → etB as ε ↓ 0,

wherein we have used,[t

ε

]ε=

(t

ε+O (1)

)ε = t+O (ε)

for the last equality.

Theorem 15.14 (Continuous time MC as a scaling limit). Let M < ∞and a : S → (0,M) be a given function, Dx,y := axδxy, and Q be a Markovmatrix such that Qxx = 0 for all x ∈ S so that Q generates a non-lazy chainYn∞n=0 . For ε > 0 let

P(ε) := (I − εD) + εDQ i.e. P(ε)xy = (1− εax) δxy + εaxQxy.

Let

X

(ε)t = X

(ε)

[ tε ]

t≥0

whereX

(ε)n

∞n=0

is the Markov chain generated by

P(ε).2 Then

2 For ε > 0 small, the lazy chain,X

(ε)n

∞n=0

, is in fact very lazy and only leaves

a site x ∈ S at any given step with probability ελx << 1. On the other hand,



1. X(ε) =⇒ X where X is a “continuous time” Markov process with “Markovtransition semi-group,”

Pt = etA where A := −D + DQ.

[See the proof for the meaning of the assertion that X(ε) =⇒ X.]2. Ex [f (Xt) |Fs] = (Pt−sf) (Xs) for all t > s and f : S → R bounded or

positive, where f is viewed as a column vector, i.e.

(Ptf) (x) :=∑y∈S

Pt (x, y) f (y) ∀ x ∈ S.

3. We can construct the process X(ε) is such a way that limε↓0X(ε)t = Xt, Px

– a.s. for all x ∈ S.4. The trajectories of the X process may be described as follows;

a) the jump locations are determined by the non-lazy chain Yn .b) Given Y0 = x0, . . . , Yn = xn with xj ∈ S, let Sjnj=0 be independent

exponential random variables with Sjd= E

(axj). The trajectory of X

for 0 ≤ t ≤∑nj=0 Sj is determined by holding the chain at xj for time

Sj for 0 ≤ j ≤ n, see Figure 15.5 below.

Proof. Since P(ε) → I as ε ↓ 0,[P(ε)

]n+1 ∼=[P(ε)

]nwhen ε is small.

Moreover by Lemma 15.133 we find for any t > 0 that

limε↓0

[P(ε)

][ tε ]= lim

ε↓0[I + ε (−D + DQ)][

tε ] = et(−D+DQ) = Pt.

We now go to the proof using these limiting results below without furthercomment.

1. Let 0 = t0 < t1 < t2 < · · · < tm <∞, then

Px(X

(ε)t1 = x1, . . . , X

(ε)tn = xn

)= Px

(X

(ε)[t1/ε]

= x1, . . . , X(ε)[tn/ε]

= xn

)∼=[P(ε)

][t1/ε]x,x1

. . .[P(ε)

][tn−tn−1/ε]

x,x1

and henceX

(ε)t

t≥0

is the basically the chain,X

(ε)n

∞n=0

, forced to take a step ever 1/ε

seconds. Thus as we make the chain lazier and lazier we simultaneously make ittake more steps per second in such a way that the chain leaves site x ∈ S at rateax.

3 The boundeness of the rates λxx∈S implies supx∈S∑y∈S |Axy| <∞.

limε↓0

Px(X

(ε)t1 = x1, . . . , X

(ε)tn = xn

)=Pt1 (x, x1) Pt2−t1 (x1, x2) . . .Ptn−tn−1

(xn−1, xn) .

The latter expression is by definition, Px (Xt1 = x1, . . . , Xtn = xn) .2. The Markov property for Xt follows rather directly from the formula in

the proof of item 1. above. Intuitively, for ε > 0 we have by the Markovproperty for X(ε) that

Ex[f(X

(ε)t

)|F (ε)s

]∼=([

P(ε)][ t−sε ]

f

)(X(ε)s

). (15.3)

Passing the limit in Eq. (15.3) “shows”

Ex [f (Xt) |Fs] = (Pt−sf) (Xs) .

3. Let us now construct X(ε) so that limε↓0X(ε) is a.s. convergent. We begin

by letting Yn∞n=0 be the non-lazy chain associated to Q and then letUn∞n=0 be i.i.d. random variables uniformly distributed on (0, 1] whichare independent of the chain Yn∞n=0 . Then let Sεn := εσεn = εGεaYn (Un)

be the nth - holding time, J(ε)0 = 0, and for k ≥ 1, J

(ε)k :=

∑0≤n<k S

εn be

the time of the nth – jump and therefore set4

X(ε)t = Yn when J (ε)

n ≤ t < J(ε)n+1.

From Exercise 15.3 we may conclude thatX

(ε)t

t≥0

has the same distri-

bution as the chain we described in the beginning of the statement of the

Theorem. Making use of Proposition 15.7 we will have limε↓0X(ε)t = Xt a.s.

where now,

Sn := − 1

aYnlnUn, Jn =

∑0≤k<n

Sk, and (15.4)

Xt = Yn when Jn ≤ t < Jn+1. (15.5)

4. Item 4. now follows directly from what we have done in item 3. Indeed,given Y0 = x0, . . . , Yn = xn with xj ∈ S we have that the holding times,Sk = − 1

axklnUk

nk=0

, are independent random times with Skd= E (axk) .

Moreover, by Theorem 15.22 below, limn→∞ Jn =∞ a.s. and therefore Eq.(15.5) makes sense almost surely.

4 Since J(ε)n = εj

(ε)n we have J

(ε)n ≤ t < J

(ε)n+1 iff εj

(ε)n ≤ t < εj

(ε)n+1 iff j

(ε)n ≤ t

ε< j

(ε)n+1

iff j(ε)n ≤

[tε

]< j

(ε)n+1 and therefore X

(ε)t = X

(ε)

[ tε ].


15.3 Sums of Exponentials 159

Notation 15.15 We refer Eqs. (15.4) and (15.5) as the jump hold of con-struction of the process Xtt≥0.

Definition 15.16 (Infinitesimal Generator). Let S be a finite or countableset and A = Ax,y ∈ Rx,y∈S be a matrix indexed by S. We say A is an in-

finitesimal Markov generator if Ax,y ≥ 0 for all x 6= y and∑y∈S Ax,y = 0

for all x ∈ S.

Example 15.17. Let Q and D be as in Theorem 15.14, then A := −D + DQ,satisfies,

Ax,y = axQxy ≥ 0 if x 6= y,

Ax,x = −ax < 0, and∑y∈S

Axy = 0,

i.e. A is an infinitesimal generator. Conversely if we are given a matrix A withthe above properties then we can define ax := −Ax,x and

Qxy := 1x6=y ·1

axAx,y.

We will then have Q is the Markov matrix of a non-lazy chain such that A :=−D + DQ where D = diag

(axx∈S

).

More generally if ax = 0 for some x we make the following definition.

Definition 15.18. Given a Markov-semi-group infinitesimal generator, A, onS, let ax := −Ax,x for x ∈ S and Q = A be the Markov matrix indexed by Sdefined by

Qx,y := 1ax 6=0 · 1x6=yAx,y

ax+ 1ax=0

· 1x=y.

Example 15.19. If

A =

1 2 3−3 1 20 0 02 7 −9

123

then

D =

1 2 33 0 00 0 00 0 9

123

and Q =

1 2 30 1

323

0 1 029

79 0

123

.

Let us summarize the key points of the above results in the following theo-rem.

Theorem 15.20. Let S be a finite or countable set, A be an infinitesimalMarkov generator on S, axx∈S , and Q be defined as in Definition 15.18 andassume supx∈S ax <∞ so that

Pt := etA =

∞∑n=0

tn

n!An exists.

Further let ;

1. Yn∞n=0 be the non-lazy Markov chain associated to Q.2. Un∞n=0 be i.i.d. random variables uniformly distributed on (0, 1] which are

independent of the chain Yn .3. Set Sn := − 1

aYnlnUn

d= E (aYn) where Sn ≡ ∞ if aYn = 0 and then define

J0 = 0 and Jn := S0 + S1 + · · ·+ Sn−1 for n ≥ 1.4. Finally let

Xt :=

∞∑n=0

Yn1Jn≤t<Jn+1∈ S for t ∈ R+. (15.6)

Then for all x0, x1, . . . , xn ∈ S and 0 < t1 < t2 < · · · < tn (see Figure 15.6below),

Px0(Xt1 = x1, . . . Xtn = xn) = Pt1 (x, x1) Pt2−t1 (x1, x2) . . .Ptn−tn−1

(xn−1, xn) .

Remark 15.21. The expression in Eq. (15.6) is short hand for writing; Xt = Ynwhere n is the unique positive integer such that t ∈ [Jn, Jn+1). If aYn−1

> 0while aYn = 0, then Jn <∞ while Jn+1 =∞ and we will have Xt = Yn for allt ≥ Jn.

15.3 Sums of Exponentials

From Theorem 15.14, we see that the trajectories of the Markov chain describedthere were defined for 0 ≤ t <

∑∞j=0 Sj and so we could run into trouble if∑∞

j=0 Sj <∞ with positive probability. As we will see in the next theorem, weavoided this problem by assuming the jump rates axx∈S of our chain werebounded.

Theorem 15.22. Let Tj∞j=1 be independent random variables such that Tjd=

E (λj) with 0 < λj <∞ for all j. Then:

1. If∑∞n=1 λ

−1n <∞ then P (

∑∞n=1 Tn =∞) = 0 (i.e. P (

∑∞n=1 Tn <∞) = 1).

2. If∑∞n=1 λ

−1n =∞ then P (

∑∞n=1 Tn =∞) = 1.



In summary, there are precisely two possibilities; 1) P (∑∞n=1 Tn =∞) = 0

or 2) P (∑∞n=1 Tn =∞) = 1 and moreover

P

( ∞∑n=1

Tn =∞

)=

0 iff E [

∑∞n=1 Tn] =

∑∞n=1 λ

−1n <∞.

1 iff E [∑∞n=1 Tn] =

∑∞n=1 λ

−1n =∞

Proof. 1. Since

E

[ ∞∑n=1

Tn

]=

∞∑n=1

E [Tn] =

∞∑n=1

λ−1n <∞

it follows that∑∞n=1 Tn <∞ a.s., i.e. P (

∑∞n=1 Tn =∞) = 0.

2. First observe that for α < λ if Td= exp (λ) , then

E[eαT

]=

∫ ∞0

eατλe−λτdτ =λ

λ− α=

1

1− αλ−1. (15.7)

By the DCT, independence, and Eq. (15.7) with α = −1 we find,

E[e−∑∞

n=1Tn]

= limN→∞

E[e−∑N

n=1Tn

]= limN→∞

N∏n=1

E[e−Tn

]= limN→∞

N∏n=1

(1

1 + λ−1n

)=

∞∏n=1

(1− an)

where

an = 1− 1

1 + λ−1n

=1

1 + λn.

Hence by Exercise 15.4 below, E[e−∑∞

n=1Tn]

= 0 iff ∞ =∑∞n=1 an which

happens iff∑∞n=1 λ

−1n =∞ as you should verify. This completes the proof since

E[e−∑∞

n=1Tn]

= 0 iff e−∑∞

n=1Tn = 0 a.s. or equivalently

∑∞n=1 Tn =∞ a.s.

Lemma 15.23. If 0 ≤ x ≤ 12 , then

e−2x ≤ 1− x ≤ e−x. (15.8)

Moreover, the upper bound in Eq. (15.8) is valid for all x ∈ R.

Proof. The upper bound follows by the convexity of e−x, see Figure 15.3.For the lower bound we use the convexity of ϕ (x) = e−2x to conclude that theline joining (0, 1) = (0, ϕ (0)) and

(1/2, e−1

)= (1/2, ϕ (1/2)) lies above ϕ (x)

Fig. 15.3. A graph of 1− x and e−x showing that 1− x ≤ e−x for all x.

Fig. 15.4. A graph of 1−x (in red), the line joining (0, 1) and(1/2, e−1

)(in green), e−x

(in purple), and e−2x (in black) showing that e−2x ≤ 1− x ≤ e−x for all x ∈ [0, 1/2] .

for 0 ≤ x ≤ 1/2. Then we use the fact that the line 1− x lies above this line toconclude the lower bound in Eq. (15.8), see Figure 15.4.

For an∞n=1 ⊂ [0, 1] , let

∞∏n=1

(1− an) := limN→∞

N∏n=1

(1− an) .

The limit exists since,∏Nn=1 (1− an) decreases as N increases.

Exercise 15.4. Show; if an∞n=1 ⊂ [0, 1), then

∞∏n=1

(1− an) = 0 ⇐⇒∞∑n=1

an =∞.

The implication, ⇐= , holds even if an = 1 is allowed.


15.4 Continuous Time Markov Chains (formalities) 161

15.4 Continuous Time Markov Chains (formalities)

Let S be a countable or finite state space but now let us replace discrete time,N0, by its continuous cousin, T := [0,∞). We now formalize the notions ofa continuous time homogeneous Markov chain. In more detail we will assumethat Xtt≥0 is a stochastic process whose sample paths are right continuous,see Figure 15.5. (These processes need not have left hand limits if there are aninfinite number of jumps in a finite time interval. For the most part we willassume that this does not happen almost surely.) The fact that time is contin-uous adds some mathematical technicalities which typically will be suppressedin these notes unless they are too dangerous to do so.

S0

S1

S5timeS2

S3

S4

J0 J1 J2 J3 J4 J5

Fig. 15.5. A typical sample path for a continuous time Markov chain with S =1, 2, 3, 4. The picture indicates the jump and sojourn times for this random graph.

Notation 15.24 (Jump and Sojourn times) Let J0 = 0 and then defineJn inductively by

Jn+1 := inf t > Jn : Xt 6= XJn

so that Jn is the time of the nth – jump of the process Xt . We further definethe sojourn times by

Sn := Jn+1 − Jn

so that Sn is the time spent at the nth site which has been visited by the chain.(This notation conflicts with Karlin and Taylor who label the sojourn times bythe states in S.)

In this continuous time setting, if Xtt≥0 is a collection of functions on Ω

with values in S, we let FXt := σ (Xs : s ≤ t) be the continuous time filtrationdetermined by Xtt≥0 . As usual, a function f : Ω → T is Ft – measurable

iff f = F(Xss≤t

)is a function of the path s → Xs restricted to [0, t] .

As in the discrete time Markov chain setting, to each x ∈ S, we will writePx (A) := P (A|X0 = x) . That is Px is the probability associated to the scenariowhere the chain is forced to start at site x. We now define, for x, y ∈ S,

Pt (x, y) := Px (Xt = y) (15.9)

which is the probability of finding the chain at time t at site y given the chainstarts at x.

Definition 15.25. The time homogeneous Markov property states for ev-ery 0 ≤ s < t <∞ and bounded functions, f : S → R, we require

Ex [f (Xt) |Fs] = Ex [f (Xt) |σ (Xs)] = (Pt−sf) (Xs)

where as before,

(Ptf) (x) =∑y∈S

Pt (x, y) f (y) .

A more down to earth statement of this property is that for any choices of0 = t0 < t1 < · · · < tn = s < t and x1, . . . , xn ∈ S that

Px (Xt = y|Xt1 = x1, . . . , Xtn = xn) = Pt−s (xn, y) , (15.10)

whenever Px (Xt1 = x1, . . . , Xtn = xn) > 0. In particular if P (Xs = z) > 0 then

Px (Xt = y|Xs = z) = Pt−s (z, y) . (15.11)

Roughly speaking the Markov property may be stated as follows; the prob-ability that Xt = y given knowledge of the process up to time s is Pt−s (Xs, y) .In symbols we might express this last sentence as

Px(Xt = y| Xττ≤s

)= Px (Xt = y|Xs) = Pt−s (Xs, y) .

So again a continuous time Markov process is forgetful in the sense what thechain does for t ≥ s depend only on where the chain is located, Xs, at time sand not how it got there. See Fact 15.41 below for a more general statement ofthis property. The next theorem gives our first basic description of a continuoustime Markov process on a discrete state space in terms of “finite dimensionaldistributions,” see Figure 15.6.



S0

S1

S5timeS2

S3

S4

J0 J1 J2 J3 J4 J5t1 t2 t3

Fig. 15.6. In this figure we fix three time windows, t1, t2, and t3 where we are goingto probe the process, Xtt≥0 . In general, this can be done for any finite number oftimes and this is what we mean by finite dimensional distributions.

Theorem 15.26 (Finite dimensional distributions). Let 0 < t1 < t2 <· · · < tn and x0, x1, x2, . . . , xn ∈ S. Then

Px0(Xt1 = x1, Xt2 = x2, . . . , Xtn = xn)

= Pt1 (x0, x1) Pt2−t1 (x1, x2) . . .Ptn−tn−1 (xn−1, xn) . (15.12)

Proof. The proof is similar to that of Theorems 5.11 and Proposition 5.13.For notational simplicity let us suppose that n = 3. We then have

Px0(Xt1 = x1, Xt2 = x2, Xt3 = x3)

= Px0(Xt3 = x3|Xt1 = x1, Xt2 = x2)Px0

(Xt1 = x1, Xt2 = x2)

= Pt3−t2 (x2, x3)Px0(Xt1 = x1, Xt2 = x2)

= Pt3−t2 (x2, x3)Px0(Xt2 = x2|Xt1 = x1)Px0

(Xt1 = x1)

= Pt3−t2 (x2, x3) Pt2−t1 (x1, x2) Pt1 (x0, x1)

wherein we have used the Markov property once in line 2 and twice in line 4.

Proposition 15.27 (Properties of P). Let Pt (x, y) := Px (Xt = y) be asabove. Then:

1. For each t ≥ 0, Pt is a Markov matrix, i.e.

∑y∈S

Pt (x, y) = 1 for all x ∈ S and

Pt (x, y) ≥ 0 for all x, y ∈ S.

2. limt↓0 Pt (x, y) = δxy for all x, y ∈ S.3. The Chapman – Kolmogorov equation holds:

Pt+s = PsPt for all s, t ≥ 0, (15.13)

i.e.Pt+s (x, z) =

∑y∈S

Ps (x, y) Pt (y, z) for all s, t ≥ 0. (15.14)

We will call a matrix Ptt≥0 satisfying items 1. – 3. a continuous timeMarkov semigroup.

Proof. Most of the assertions follow from the basic properties of condi-tional probabilities. The assumed right continuity of Xt implies that limt↓0 Pt =P(0) = I. From Equation (15.12) with n = 2 we learn that

Pt2 (x0, x2) =∑x1∈S

Px0(Xt1 = x1, Xt2 = x2)

=∑x1∈S

Pt1 (x0, x1) Pt2−t1 (x1, x2)

= [Pt1Pt2−t1 ] (x0, x2) .

Definition 15.28 (Infinitesimal Generator). The infinitesimal generator,A, of a Markov semi-group Ptt≥0 is the matrix,

A :=d

dt|0+Pt (15.15)

which we are assuming to exist here. [We will not go into the technicalities ofthis derivative in these notes.]

Assuming that the derivative in Eq. (15.15) exists, then

d

dtPt = APt and (Kolmogorov’s Backwards Equation)

d

dtPt = PtA (Kolmogorov’s Forward Equation).

Indeed,



d

dtPt =

d

ds|0Pt+s =

d

ds|0 [PsPt] = APt

d

dtPt =

d

ds|0Pt+s =

d

ds|0 [PtPs] = PtA.

We also must have;

1. Since Pt (x, y) ≥ 0 for all t ≥ 0 and x, y ∈ S we have for x 6= y that

Ax,y = limt↓0

Pt (x, y)−P0 (x, y)

t= lim

t↓0

Pt (x, y)

t≥ 0.

2. Since Pt1 = 1 we also have

0 =d

dt|0+1 =

d

dt|0+Pt1 = A1,

i.e. A1 = 0 and thus

ax :=∑y 6=x

Ax,y = −Ax,x ≥ 0.

Example 15.29. Suppose that S = 1, 2, . . . , n and Pt is a Markov-semi-groupwith infinitesimal generator, A, so that d

dtPt = APt = PtA. By assumptionPt (x, y) ≥ 0 for all x, y ∈ S and

∑ny=1 Pt (x, y) = 1 for all x ∈ S. We may write

this last condition as Pt1 = 1 for all t ≥ 0 where 1 denotes the vector in Rnwith all entries being 1. Differentiating Pt1 = 1 at t = 0 shows that A1 = 0,i.e.

∑ny=1 Ax,y = 0 for all x ∈ S. Since

Ax,y = limt↓0

Pt (x, y)− δxyt

if x 6= y we will have,

Ax,y = limt↓0

Pt (x, y)

t≥ 0.

Thus we have shown the infinitesimal generator, A, of Pt must satisfy Ax,y ≥ 0for all x 6= y and

∑ny=1 Ax,y = 0 for all x ∈ S. In words, A is an n×n – matrix

with non-negative off diagonal entries with all row sums being zero. You areasked to prove the converse in Exercise 15.5. So an explicit example of aninfinitesimal generator when S = 1, 2, 3 is

A =

−3 1 24 −6 27 1 −8

.

In this case my computer finds,

Pt =

17e−7t + 1

5e−10t + 23

3517 −

17e−7t 1

5 −15e−10t

15e−10t − 6

7e−7t + 23

3567e−7t + 1

715 −

15e−10t

17e−7t − 4

5e−10t + 23

3517 −

17e−7t 4

5e−10t + 1

5

.

Fact 15.30 Given a discrete state space S, the the collection of homogeneousMarkov probabilities on trajectories, Xtt≥0 ⊂ S as in Figure 15.5 are in oneto one correspondence with continuous time Markov semigroups P (·) which arein one to one correspondence with Markov infinitesimal generators A where Ais a matrix indexed by S with the properties stated after Definition 15.28. Thecorrespondences are given (when no explosions occur) by

A =d

dt|0+Pt, Pt = etA, Pt (x, y) := Px (Xt = y) ,

where the probabilities Px are determined by Theorem 15.26. In more detail,

Xtt≥0 → Pt (x, y) := P (Xt+s = y|Xs = x) for all x, y ∈ S and s, t ≥ 0,

Pt (x, y)→ Xtt≥0 with Px (Xtk = xk : 1 ≤ k ≤ n) =

n∏k=1

Ptk−tk−1(xk−1, xk) ,

Pt → A : =d

dt|0+Pt and

A→ Pt := etA =

∞∑n=0

tn

n!An.

Proposition 15.31 (Poisson process generators). If S = N0 and

A =

0 1 2 3 4 5 . . .

−λ λ 0 0 0 0 . . .0 −λ λ 0 0 0 . . .0 0 −λ λ 0 0 . . .0 0 0 −λ λ 0 . . ....

......

. . .. . .

. . .. . .

0123...

. (15.16)

then

Pt = etA = e−tλ

0 1 2 3 4 5 . . .

1 λt (λt)2

2!(λt)3

3!(λt)4

4!(λt)5

5! . . .

0 1 λt (λt)2

2!(λt)3

3!(λt)4

4! . . .

0 0 1 λt (λt)2

2!(λt)3

3! . . .

0 0 0 1 λt (λt)2

2! . . ....

......

. . .. . .

. . .. . .

0123...

. (15.17)

In other words,

Pt (m,n) = 1n≥me−tλ (λt)

n−m

(n−m)!= Pt (0, n−m) 1n≥m. (15.18)



We will summarize the matrix A and hence the Markov chain via the ratediagram;

0λ−→ 1

λ−→ 2λ−→ . . .

λ−→ (n− 1)λ−→ n

λ−→ . . . ..

Proof. First proof. Let e0 (m) = δ0,m and πt = e0Pt be the top row ofPt. Then

d

dtπt (0) = −λπt (0) and

d

dtπt (n) = −λπt (n) + λπt (n− 1) .

Thus if wt (n) := eλtπt (n) we have with the convention that wt (−1) = 0 that

d

dtwt (n) = λwt (n− 1) for all n with w0 (n) = δ0,n.

The solution to this system of equations is wt (n) = (λt)n

n! as can be found byinduction on n. This justifies the first row Pt given in Eq. (15.17). The otherrows are found similarly or using the above results and a little thought. Thepoint is that the chain starting at m ∈ N0 behaves just as it has started at 0except we have to shift the origin to n. To be more precise we should expectthat if Ntt≥0 is the Poisson process started at 0 then Nt + m is the Poissonprocess started at n and hence

Pt (m,n) = P0 (Nt +m = n)

= P0 (Nt = n−m) = 1n≥me−tλ (λt)

n−m

(n−m)!.

Second proof. Suppose that Pt is given as in Eq. (15.17) or equivalentlyEq. (15.18). The row sums are clearly equal to one and we can check directlythat PtPs = Ps+t. To verify this we have,

∑k∈S

Pt (i, k) Ps (k, j) =∑k∈N0

1k≥ie−λt (λt)

k−i

(k − i)!1j≥ke

−λs (λs)j−k

(j − k)!

= 1i≤je−λ(t+s)

∑i≤k≤j

(λt)k−i

(k − i)!(λs)

j−k

(j − k)!. (15.19)

Letting k = i+m with 0 ≤ m ≤ j − i, then the above sum may be written as

j−i∑m=0

(λt)m

m!

(λs)j−i−m

(j − i−m)!=

1

(j − i)!

j−i∑m=0

(j − im

)(λt)

m(λs)

j−i−m

and hence by the Binomial formula we find,

∑i≤k≤j

(λt)k−i

(k − i)!(λs)

j−k

(j − k)!=

1

(j − i)!(λt+ λs)

j−i.

Combining this with Eq. (15.19) shows that∑k∈S

Pt (i, k) Ps (k, j) = Ps+t (i, j)

as desired. To finish the proof we now need only observe that ddt |0Pt= A which

follows from the facts that

d

dte−λt|t=0 = −λ, d

dt(λt) e−λt|t=0 = λ, and

d

dt

(λt)k

k!e−λt|t=0 = 0 for all k ≥ 2.

Definition 15.32. The continuous time Markov chain Ntt≥0 associated toA in Eq. (15.16) when started at N0 = 0 is called the Poisson process withintensity λ.

Definition 15.33. A stochastic process Xtt≥0 is said to have independentincrements if for any n ∈ N and 0 = t0 < t1 < · · · < tn < ∞ the collection ofrandom functions,

Xti −Xti−1

ni=1

are independent.

Proposition 15.34. The Poisson processNtt≥0 has independent incrementsand moreover for each 0 ≤ s ≤ t <∞,

Nt −Nsd= Poi (λ (t− s)) .

Proof. Let us give the proof when n = 3 as the general case is similar. Thengiven k1, k2, k3 ∈ N0 and 0 = t0 < t1 < t2 < t3 <∞ we find,

P0 (Nt1 −Nt0 = k1, Nt2 −Nt1 = k2, Nt3 −Nt2 = k3)

= P0 (Nt1 = k1, Nt2 = k1 + k2, Nt3 = k1 + k2 + k3)

= Pt1 (0, k1) Pt2−t1 (k1, k1 + k2) Pt3−t3 (k1 + k2, k1 + k2 + k3)

= Pt1 (0, k1) Pt2−t1 (0, k2) Pt3−t3 (0, k3)

= e−λt1(λt1)

k1

k1!e−λt2

(λ (t2 − t1))k2

k2!e−λt3

(λ (t3 − t2))k3

k3!,

which suffices to verify all the assertions in the proposition.



Corollary 15.35. Let Sn∞n=0 be i.i.d. E (λ) – random times, J0 = 0 andJk = S0 + S1 + · · ·+ Sk−1 for k ≥ 1. If we now define, for 0 ≤ t <∞,

Nt := # k ∈ N0 : Jk ≤ t ,

then Ntt≥0 is the Poisson Markov process with parameter λ and in particular,

P (Jk ≤ t < Jk+1) = e−λt(λt)

k

k!.

[You are asked to verify these assertions more directly in Section 15.6 below.]

Proof. We simply need to observe that we have really just given the jump-hold description of the Poisson process. In more detail, for the Poisson processgenerator (A) in Eq. (15.16) the associated non-lazy Markov matrix is

Q =

0 1 2 3 4 5 . . .

0 1 0 0 0 0 . . .0 0 1 0 0 0 . . .0 0 0 1 0 0 . . .0 0 0 0 1 0 . . ....

......

. . .. . .

. . .. . .

0123...

.

So assuming we start the non-lazy chain Yn∞n=0 associated to Q at Y0 = 0,then we will simply have Yn = n for all n. According to Theorem 15.14, the

holding times, Sn, at the Yn = n is Snd= E (λ) and moreover the Sn∞n=0 are

independent. Thus the jump hold description of the sample paths of the Poissonprocess is;

1. remain at 0 for S0 seconds then jump to 1,2. remain at 1 for S1 – seconds and then jump to 2,3. remain at 2 for S2 seconds and then jump to 3,4. etc. etc.

In terms of formulas we have

Nt =

∞∑k=0

k · 1Jk≤t<Jk+1= # k ∈ N0 : Jk ≤ t

and in particular,

P (Jk ≤ t < Jk+1) = P (Nt = k) = e−λt(λt)

k

k!.

Exercise 15.5. Suppose that S = 1, 2, . . . , n and A is a matrix such thatAi,j ≥ 0 for i 6= j and

∑nj=1 Ai,j = 0 for all i. Show

Pt = etA :=

∞∑n=0

tn

n!An (15.20)

is a time homogeneous Markov kernel.Hints: 1. To show Pt (i, j) ≥ 0 for all t ≥ 0 and i, j ∈ S, write Pt =

e−tλet(λI+A) where λ > 0 is chosen so that λI + A has only non-negativeentries. 2. To show

∑j∈SP t (i, j) = 1, compute d

dtP t1 where 1 is the columnvector of all 1’s.

Theorem 15.36 (Feynmann-Kac Formula). Continue the notation in Ex-

ercise 15.5 and let(Ω, Ftt≥0 ,Px, Xtt≥0

)be a time homogeneous Markov

process (assumed to be right continuous) with transition kernels, Ptt≥0 . Given

V : S → R, let Tt := TVt be defined by

(Ttg) (x) = Ex[exp

(∫ t

0

V (Xs) ds

)g (Xt)

](15.21)

for all g : S → R. Then Tt satisfies,

d

dtTt = Tt (A +MV ) with T0 = I (15.22)

where MV g := V g for all g : S → R, i.e. MV is the diagonal matrix withV (1) , . . . , V (n) being placed in the diagonal entries. We may summarize thisresult as,

Ex[exp

(∫ t

0

V (Xs) ds

)g (Xt)

]=(et(A+MV )g

)(x) . (15.23)

Proof. To see what is going on let us first assume that ddtTtg exists in which

case we may compute it as, ddt (Ttg) (x) = d

dh |0+ (Tt+hg) (x) . Then by the chainrule, the fundamental theorem of calculus, and the Markov property, we find



d

dt(Ttg) (x) =

d

dh|0+Ex

[exp

(∫ t+h

0

V (Xs) ds

)g (Xt)

]

+d

dh|0+Ex

[exp

(∫ t

0

V (Xs) ds

)g (Xt+h)

]= Ex

[d

dh|0+ exp

(∫ t+h

0

V (Xs) ds

)g (Xt)

]

+d

dh|0+Ex

[exp

(∫ t

0

V (Xs) ds

)(ehAg

)(Xt)

]= Ex

[exp

(∫ t

0

V (Xs) ds

)V (Xt) g (Xt)

]+ Ex

[exp

(∫ t

0

V (Xs) ds

)(Ag) (Xt)

]= Tt (V g) (x) + Tt (Ag) (x)

which gives Eq. (15.22). [It should be clear that (T0g) (x) = g (x) .]We now give a rigorous proof. For 0 ≤ τ ≤ t < ∞, let Zτ,t :=

exp(∫ t

τV (Xs) ds

)and let Zt := Z0,t. For h > 0,

(Tt+hg) (x) = Ex [Zt+hg (Xt+h)] = Ex [ZtZt,t+hg (Xt+h)]

= Ex [Ztg (Xt+h)] + Ex [Zt [Zt,t+h − 1] g (Xt+h)]

= Ex[Zt(ehAg

)(Xt)

]+ Ex [Zt [Zt,t+h − 1] g (Xt+h)] .

Therefore,

(Tt+hg) (x)− (Ttg) (x)

h= Ex

[Zt

(ehAg

)(Xt)− g (Xt)

h

]

+ Ex[Zt

[Zt,t+h − 1

h

]g (Xt+h)

]and then letting h ↓ 0 in this equation implies,

d

dh|0+ (Tt+hg) (x) = Ex [Zt (Ag) (Xt)] + Ex [ZtV (Xt) g (Xt+h)] .

This shows that Tt is one sided differentiable and this one sided derivatives isgiven as in Eq. (15.22).

On the other hand for s, t > 0, using the Markov property

Tt+sg (x) = Ex [Zt+sg (Xt+s)] = Ex [ZtZt,t+sg (Xt+s)]

= Ex [ZtEFt (Zt,t+sg (Xt+s))] = Ex [ZtEXt (Z0,sg (Xs))]

= (TtTsg) (x) ,

i.e. Ttt>0 still has the semi-group property. So for h > 0,

Tt−h − Tt = Tt−h − Tt−hTh = Tt−h (I − Th)

and henceTt−h − Tt−h

= Tt−hTh − Ih

→ Tt (A + V ) as h ↓ 0

using Tt is continuous in t and the result we have already proved. This showsTt is differentiable in t and Eq. (15.22) is valid.

15.5 Jump Hold Description II

In this section, we are going to start with the Markov semi-group (or equiv-alently the generator, A) and show how to recover the jump hold descriptionof the associated Markov process, see Corollary 15.42 below. The validity ofthis argument is a consequence of the strong Markov property (see Fact 15.41)along with Theorem 15.39 below.

Definition 15.37. Given a Markov-semi-group infinitesimal generator, A, onS, let Q = A be the matrix indexed by S defined by

Qx,y :=

Ax,y

axif x 6= y and ax 6= 0

0 if x = y and ax 6= 00 if x 6= y and ax = 01 if x = y and ax = 0

. (15.24)

Recall thatax := −Ax,x =

∑y:y 6=x

Ax,y.

Notice that Q is itself a Markov matrix for a non-lazy chain provided ax 6= 0for all x ∈ S. We will typically denote the associate chain by Yn∞n=0 . Perhapsit is worth remarking if x ∈ S is a point where ax = 0, then πt (y) := etA (x, y) =[δxe

tA]

(y) satisfies,

πt (y) =[δxAe

tA]

(y) =∑z∈S

A (x, z) etA (z, y) = 0

so that πt (y) = π0 (y) = δx (y) . This shows that etA (x, y) = δx (y) for all t ≥ 0and so if the chain starts at x then it stays at x when ax = 0 which explainsthe last two rows in Eq. (15.24).


15.5 Jump Hold Description II 167

Example 15.38. If

A =

1 2 3−3 1 20 0 02 7 −9

123

, then Q =

1 2 30 1

323

0 1 029

79 0

123

.

Theorem 15.39 (Jump distributions). Suppose that A is a Markov in-finitesimal generator, x ∈ S is a non-absorbing state (ax > 0). Then for ally 6= x and t > 0,

Px (XJ1 = y; J1 > t) = e−taxAx,y

ax(15.25)

In other words, given X0 = x, J1d= E (ax) and XJ1 (the first jump location) is

independent of J1 with

Px (XJ1 = y) =Ax,y

ax.

Proof. (We will give a bit of informal proof here – the reader will need toconsult the references for fully rigorous proof.) Let J1 = inf τ > 0 : Xτ 6= X0 .Referring to Figure 15.5, we then have5 with Pn =

knτ : 1 ≤ k ≤ n

, that

Px (J1 ∈ [τ, τ + dτ ], XJ1 = y)

= Px (J1 ∈ [τ, τ + dτ ], Xτ+dτ = y) + o (dτ)

= limn→∞

Px(Xs = xs∈Pn , Xτ+dτ = y

)+ o (dτ)

= limn→∞

[P τn

(x, x)]n ·Pdτ (x, y) + o (dτ)

= limn→∞

[1− τ

n· ax + o

(1

n

)]n·Ax,ydτ + o (dτ)

= e−τaxAx,ydτ + o (dτ) .

Integrating this equation over τ > t then shows,

Px (J1 > t, XJ1 = y) =

∫ ∞t

e−τaxAx,ydτ =Ax,y

axe−tax .

As a check, taking t = 0 above shows,

Px (XJ1 = y) =Ax,y

ax

5 Although we do not know where in [t, t+ dt] that Jx is we do expect that withrelatively high likelihood (relative to dt) that there will be at most one jump inthis infinitesimally small interval and hence we should have Xt+dt = XJx whenJx ∈ [t, t+ dt].

and summing this equation on y ∈ S \ x then gives the following requiredidentity, ∑

y 6=x

Px (XJ1 = y) =∑y 6=x

Ax,y

ax= 1.

Definition 15.40 (Informal). A stopping time, T, for Xt , is a randomvariable with the property that the event T ≤ t is determined from the knowl-edge of Xs : 0 ≤ s ≤ t . Alternatively put, for each t ≥ 0, there is a functional,ft, such that

1T≤t = ft (Xs : 0 ≤ s ≤ t) .

As in the discrete state space setting, the first time the chain hits some subsetof states, A ⊂ S, is a typical example of a stopping time whereas the last timethe chain hits a set A ⊂ S is typically not a stopping time. Similar the discretetime setting, the Markov property leads to a strong form of forgetfulness of thechain. This property is again called the strong Markov property which wetake for granted here.

Fact 15.41 (Strong Markov Property) If Xtt≥0 is a Markov chain, T isa stopping time, and j ∈ S, then, conditioned on T <∞ and XT = j ,

Xs : 0 ≤ s ≤ T and Xt+T : t ≥ 0 are independent

and Xt+T : t ≥ 0 has the same distribution as Xtt≥0 under Pj .

Using Theorem 15.39 along with the strong Markov property leads to thefollowing jump-hold description of the continuous time Markov chain, Xtt≥0 ,associated to A. The reader should compare this next corollary with its discretetime analogue in Exercise 15.3.

Corollary 15.42. Let A be the infinitesimal generator of a Markov semigroupPt. Then the Markov chain, Xt , associated to Pt may be described as follows.Let Yk∞k=0 denote the discrete time Markov chain with Markov matrix A asin Eq. (15.24). Let Sj∞j=0 be random times such that given Yj = xj : j ≤ n ,

Sjd= exp

(axj)

and the Sjnj=0 are independent for 0 ≤ j ≤ n.6 Now let Nt =

max j : S0 + · · ·+ Sj−1 ≤ t (see Figure 15.5) and Xt := YNt . Then Xtt ≥ 0is the Markov process starting at x with Markov semi-group, Pt = etA.

Put another way, if Tn∞n=0 are i.i.d exponential random times with Tnd=

E (1) for all n, then

6 A concrete way to chose the Sj∞j=0 is as follows. Given a sequence, Tj∞j=0 , ofi.i.d. exp (1) – random variables which are independent of Yn∞n=0 , define Sj :=Tj/QYj .



XJn , Sn∞n=0

d=

Yn,

TnaYn

(15.26)

provided Y0d= X0.

Proof. First proof. Assuming the process X starts at some x ∈ S, then 1)it stays at x for an E(ax) – amount of time, S1, then jump to x1 with probabilityAx,x1

. Stay at x1 for an E(ax1) amount of time, S2, independent of S1 and then

jump to x2 with probability Ax1,x2 . Stay at x2 for an E(ax2) amount of time,

S3, independent of S1 and S2 and then jump to x3 with probability Ax2,x3, etc.

etc. etc. etc. where one should interpret Td= E (0) to mean that T =∞.

Second proof. The argument in the proof of Theorem 15.39 easily gener-alizes to multiple jump times. For example for 0 ≤ τ < u <∞ and y, z ∈ S wefind,

Px (J1 ∈ [τ, τ + dτ ], J2 ∈ [u, u+ du] , XJ1 = y, XJ2 = z)∼= Px (J1 ∈ [τ, τ + dτ ], J2 ∈ [u, u+ du] , Xτ+dτ = y, Xu+du = z)

∼= limn→∞

Px(

X kn τ

= x,Xτ+dτ+ kn (u−(τ+dτ)) = y

s∈P

, Xτ+dτ = y,Xu+du = z

)∼= limn→∞

[P τn

(x, x)]n ·Pdτ (x, y) ·

[Pu−τ

n(y, y)

]nPdu (y, z)

∼= limn→∞

(1− τ

nax + o

(1

n

))n(1− u− τ

nay + o

(1

n

))n·Pdτ (x, y) Pdu (y, z)

= e−τaxe−(u−τ)ayAx,ydτAy,zdu+ o ((dτ, du))

from which we conclude that

Ex [F (J1, J2, X0, XJ1 , XJ2)]

=∑

y:y 6=x,z:z 6=y

∫ ∞0

dτ

∫ ∞0

du F (τ, u, x, y, z) e−τaxe−(u−τ)ayAx,yAy,z

=∑

y:y 6=x,z:z 6=y

∫ ∞0

dτ

∫ ∞τ

du F (τ, u, x, y, z) axe−τaxaye

−(u−τ)ayAx,y

ax

Ay,z

ay

=∑

y:y 6=x,z:z 6=y

∫ ∞0

dτ

∫ ∞0

du F (τ, v + τ, x, y, z) axe−τaxaye

−vay Ax,y

ax

Ay,z

ay

where we should takeAx,y

ax= 0 if ax = 0. If T1 and T2 are independent expo-

nential random variables with parameter 1 and Yn is the Markov chain withtransition matrix A given below then

Ex [F (J1, J2, X0, XJ1 , XJ2)] = Ex[F

(T1

Y0,T1

Y0+T2

Y1, Y0, Y1, Y2

)].

In other words

(J1, J2, X0, XJ1 , XJ2)d=

(T1

Y0,T1

Y0+T2

Y1, Y0, Y1, Y2

).

This has a clear generalization to any number of times.The proof of Eq. (15.26) is that on one hand,

P(XJn = xn, Sn > snNn=0

)= P (X0 = x0) e−ax0s0

N∏n=1

[Q (xn−1, xn) e−axnsn

]while on the other hand,

P(Yn = xn, Tn/aYn > snNn=0

)= P

(Yn = xn, Tn > axnsn

Nn=0

)= P

(Yn = xnNn=0

)· P(Tn > axnsn

Nn=0

)= P (Y0 = x0)

N∏n=1

Q (xn−1, xn) ·N∏n=0

e−axnsn .

Thus we have shown

P(XJn = xn, Sn > snNn=0

)= P

(Yn = xn, Tn/aYn > snNn=0

)provided P (X0 = x0) = P (Y0 = x0) .

It is possible to show the description in Corollary 15.42 defines a Markovprocess with the correct semi-group, Pt. For the details the reader is referredto Norris [14, See Theorems 2.8.2 and 2.8.4].

Here is one more interpretation of the jump hold or sojourn descriptionof our continuous time Markov chains. When we arrive at a site x ∈ S westart a collection of independent exponential random clocks, Tyy∈S\x with

Tyd= E (Ax,y) where we interpret Ty = ∞ if Ax,y = 0. If the first clock

to ring is the y clock, then we jump to y. It is fairly simple to check that

J = min Tz : z 6= x d= E (ax) and the probability that the y clock rings first

is precisely, Ax,y/ax. At the next site in the chain, we restart the process allover again.

15.6 Jump Hold Construction of the Poisson Process*

The goal of this section is to give a direct proof of Corollary 15.35. We willuse the following notation here. Let Tk∞k=1 be an i.i.d. sequence of randomexponential times with parameter λ, i.e. P (Tk ∈ [t, t+ dt]) = λe−λtdt. LetW0 =


15.6 Jump Hold Construction of the Poisson Process* 169

0 and for each n ∈ N let Wn := T1 + · · · + Tn be the “waiting time” for thenth event to occur. Because of Theorem 15.22 we know that limn→∞Wn =∞a.s. [Sorry for the change of notation here! To match with the above sectionyou should identify Tk with Sk−1 and Wn with Jn.]

Definition 15.43 (Poisson Process I). For any subset A ⊂ R+ let N (A) :=∑∞n=1 1A (Wn) count the number of waiting times which occurred in A. When

A = (0, t] we will write, Nt := N ((0, t]) for all t ≥ 0 and refer to Ntt≥0 asthe Poisson Process with intensity λ.

The next few results summarize a number of the basic properties of thisPoisson process. Many of the proofs will be left as exercises to the reader. Wewill use the following notation below; for each n ∈ N and T ≥ 0 let

∆n (T ) := (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn < T

and let

∆n := ∪T>0∆n (T ) = (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn <∞ .

In the exercises below it will be very useful to observe that

Nt = n = Wn ≤ t < Wn+1 .

Exercise 15.6. For 0 ≤ s < t <∞, show by induction on n that

an (s, t) :=

∫s≤w1≤w2≤···≤wn≤t

dw1 . . . dwn =(t− s)n

n!.

Exercise 15.7. If n ∈ N and g : ∆n → R bounded (non-negative) function,then

E [g (W1, . . . ,Wn)] =

∫∆n

g (w1, w2, . . . , wn)λne−λwndw1 . . . dwn. (15.27)

As a simple corollary we have the following result which explains the namegiven to the Poisson process.

Exercise 15.8. Show Ntd= Poi (λt) for all t > 0.

Exercise 15.9. Show for 0 ≤ s < t < ∞ that Nt − Nsd= Poi (λ (t− s)) and

Ns, Nt −Ns are independent.Hint: for m,n ∈ N0 show using the few exercises above that

P0 (Ns = m,Nt = m+ n) = e−λs(λs)

m

m!· e−λ(t−s) (λ (t− s))n

n!. (15.28)

[This result easily generalizes to show Ntt≥0 has independent increments andusing this one easily verifies that Ntt≥0 is a Markov process with transitionkernel given as in Eq. (15.18).]

15.6.1 More related results*

Corollary 15.44. If n ∈ N, then Wnd=Gamma

(n, λ−1

).

Proof. Taking g (w1, w2, . . . , wn) = f (wn) in Eq. (15.27) we find with theaid of Exercise 15.6 that

E [f (Wn)] =

∫∆n

f (wn)λne−λwndw1 . . . dwn

=

∫ ∞0

f (w)λnwn−1

(n− 1)!e−λwdw

which shows that Wnd=Gamma

(n, λ−1

).

Corollary 15.45. If t ∈ R+ and f : ∆n (t)→ R is a bounded (or non-negative)measurable function, then

E [f (W1, . . . ,Wn) : Nt = n]

= λne−λt∫∆n(t)

f (w1, w2, . . . , wn) dw1 . . . dwn. (15.29)

Proof. Making use of the observation that Nt = n = Wn ≤ t < Wn+1 ,we may apply Eq. (15.27) at level n+ 1 with

g (w1, w2, . . . , wn+1) = f (w1, w2, . . . , wn) 1wn≤t<wn+1

to learn

E [f (W1, . . . ,Wn) : Nt = n]

=

∫0<w1<···<wn<t<wn+1

f (w1, w2, . . . , wn)λn+1e−λwn+1dw1 . . . dwndwn+1

=

∫∆n(t)

f (w1, w2, . . . , wn)λne−λtdw1 . . . dwn.

Definition 15.46 (Order Statistics). Suppose that X1, . . . , Xn are non-negative random variables such that P (Xi = Xj) = 0 for all i 6= j. The order

statistics of X1, . . . , Xn are the random variables, X1, X2, . . . , Xn defined by

Xk = min#(Λ)=k

max Xi : i ∈ Λ (15.30)

where Λ always denotes a subset of 1, 2, . . . , n in Eq. (15.30).



The reader should verify that X1 ≤ X2 ≤ · · · ≤ Xn, X1, . . . , Xn =X1, X2, . . . , Xn

with repetitions, and that X1 < X2 < · · · < Xn if

Xi 6= Xj for all i 6= j. In particular if P (Xi = Xj) = 0 for all i 6= j then

P (∪i 6=j Xi = Xj) = 0 and X1 < X2 < · · · < Xn a.s.

Exercise 15.10. Suppose that X1, . . . , Xn are non-negative7 random variablessuch that P (Xi = Xj) = 0 for all i 6= j. Show;

1. If f : ∆n → R is bounded (non-negative) measurable, then

E[f(X1, . . . , Xn

)]=∑σ∈Sn

E [f (Xσ1, . . . , Xσn) : Xσ1 < Xσ2 < · · · < Xσn] ,

(15.31)where Sn is the permutation group on 1, 2, . . . , n .

2. If we further assume that X1, . . . , Xn are i.i.d. random variables, then

E[f(X1, . . . , Xn

)]= n! · E [f (X1, . . . , Xn) : X1 < X2 < · · · < Xn] .

(15.32)

(It is not important that f(X1, . . . , Xn

)is not defined on the null set,

∪i 6=j Xi = Xj .)3. f : Rn+ → R is a bounded (non-negative) measurable symmetric function

(i.e. f (wσ1, . . . , wσn) = f (w1, . . . , wn) for all σ ∈ Sn and (w1, . . . , wn) ∈Rn+) then

E[f(X1, . . . , Xn

)]= E [f (X1, . . . , Xn)] .

4. Suppose that Y1, . . . , Yn is another collection of non-negative random vari-ables such that P (Yi = Yj) = 0 for all i 6= j such that

E [f (X1, . . . , Xn)] = E [f (Y1, . . . , Yn)]

for all bounded (non-negative) measurable symmetric functions from Rn+ →R. Show that

(X1, . . . , Xn

)d=(Y1, . . . , Yn

).

Hint: if g : ∆n → R is a bounded measurable function, define f : Rn+ → Rby;

f (y1, . . . , yn) =∑σ∈Sn

1yσ1<yσ2<···<yσng (yσ1, yσ2, . . . , yσn)

and then show f is symmetric.

7 The non-negativity of the Xi are not really necessary here but this is all we needto consider.

Exercise 15.11. Let t ∈ R+ and Uini=1 be i.i.d. uniformly distributed random

variables on [0, t] . Show that the order statistics,(U1, . . . , Un

), of (U1, . . . , Un)

has the same distribution as (W1, . . . ,Wn) given Nt = n. (Thus, given Nt =n, the collection of points, W1, . . . ,Wn , has the same distribution as thecollection of points, U1, . . . , Un , in [0, t] .)

Theorem 15.47 (Joint Distributions). If Aiki=1 ⊂ B[0,t] is a partition

of [0, t] , then N (Ai)ki=1 are independent random variables and N (A)d=

Poi (λm (A)) for all A ∈ B[0,t] with m (A) < ∞. In particular, if 0 < t1 <

t2 < · · · < tn, thenNti −Nti−1

ni=1

are independent random variables and

Nt − Nsd= Poi (λ (t− s)) for all 0 ≤ s < t < ∞. (We say that Ntt≥0 is a

stochastic process with independent increments.)

Proof. If z ∈ C and A ∈ B[0,t], then

zN(A) = z∑n

i=11A(Wi) on Nt = n .

Let n ∈ N, zi ∈ C, and define

f (w1, . . . , wn) = z

∑n

i=11A1

(wi)

1 . . . z

∑n

i=11Ak (wi)

k

which is a symmetric function. On Nt = n we have,

zN(A1)1 . . . z

N(Ak)k = f (W1, . . . ,Wn)

and therefore,

E[zN(A1)1 . . . z

N(Ak)k |Nt = n

]= E [f (W1, . . . ,Wn) |Nt = n]

= E [f (U1, . . . , Un)]

= E[z

∑n

i=11A1

(Ui)

1 . . . z

∑n

i=11Ak (Ui)

k

]=

n∏i=1

E[(z

1A1(Ui)

1 . . . z1Ak (Ui)

k

)]=(E[(z

1A1(U1)

1 . . . z1Ak (U1)

k

)])n=

(1

t

k∑i=1

m (Ai) · zi

)n,

wherein we have made use of the fact that Aini=1 is a partition of [0, t] so that


15.7 First jump analysis 171

z1A1

(U1)1 . . . z

1Ak (U1)

k =

k∑i=1

zi1Ai (Ui) .


E[zN(A1)1 . . . z

N(Ak)k

]=

∞∑n=0

E[zN(A1)1 . . . z

N(Ak)k |Nt = n

]P (Nt = n)

=

∞∑n=0

(1

t

k∑i=1

m (Ai) · zi

)n(λt)

n

n!e−λt

=

∞∑n=0

1

n!

(λ

k∑i=1

m (Ai) · zi

)ne−λt

= exp

(λ

[k∑i=1

m (Ai) zi − t

])

= exp

(λ

[k∑i=1

m (Ai) (zi − 1)

]).

From this result it follows that N (Ai)ni=1 are independent random variablesand N (A) = Poi (λm (A)) for all A ∈ BR with m (A) <∞.

Alternatively; suppose that ai ∈ N0 and n := a1 + · · ·+ ak, then

P [N (A1) = a1, . . . , N (Ak) = ak|Nt = n] = P

[n∑i=1

1Al (Ui) = al for 1 ≤ l ≤ k

]

=n!

a1! . . . ak!

k∏l=1

[m (Al)

t

]al=n!

tn·k∏l=1

[m (Al)]al

al!

and therefore,

P [N (A1) = a1, . . . , N (Ak) = ak]

= P [N (A1) = a1, . . . , N (Ak) = ak|Nt = n] · P (Nt = n)

=n!

tn·k∏l=1

[m (Al)]al

al!· e−λt (λt)

n

n!

=

k∏l=1

[m (Al)]al

al!· e−λtλn

=

k∏l=1

[m (Al)λ]al

al!e−λal

which shows that N (Al)kl=1 are independent and that N (Al)d= Poi (λm (Al))

for each l.

Remark 15.48. If A ∈ B[0,∞) with m (A) = ∞, then N (A) = ∞ a.s. To provethis observe that N (A) =↑ limn→∞N (A ∩ [0, n]) . Therefore for any k ∈ N, wehave

P (N (A) ≥ k) ≥ P (N (A ∩ [0, n]) ≥ k)

= 1− e−λm(A∩[0,n])∑

0≤l<k

(λm (A ∩ [0, n]))l

l!→ 1 as n→∞.

This shows that N (A) ≥ k a.s. for all k ∈ N, i.e. N (A) =∞ a.s.

15.7 First jump analysis

As usual let S be a finite or countable state space, A be the infinitesimalgenerator of a continuous time Markov chain Xtt≥0 , and

Qx,y :=Ax,y

ax1x 6=y

be Markov matrix for the underlying non-lazy Markov chain. We will also usethe following notation.

Notation 15.49 Given a right continuous “jump” path(Xtt≥0

)with values

in S, x ∈ S, and τ ∈ (0,∞) , let x1[0,τ) ∗X be the right continuos jump path inS defined by [

x1[0,τ) ∗X]t

:=

x if 0 ≤ t < τ

Xt−τ if t ≥ τ .

In words, we start at x and stay there up to time τ and then jump to X0 attime τ and then follows the path X thereafter, see Figure 15.7.



S0

S1

S5timeS2

S3

S4

J0 J1 J2 J3 J4 J5

T

New Time Origin

Old Time Origin

Added in path

Fig. 15.7. This picture indicates the construction of the new path 3 ·1[0,T ) ∗X whereX is the path in Figure 15.5.

The next two results summarizes the “first jump” analysis in this continuoustime context.

Theorem 15.50 (First jump conditioning). If X → F (X) is a bounded ornon-negative function of the paths Xtt≥0 and x, y ∈ S with x 6= y, then

Ex [F (X) |XS0= y] = Ey

[F([x1[0,T ) ∗X

])](15.33)

where T is a random time independent of X with Td= E (ax) . As usual

S0 := J1 = inf t > 0 : Xt 6= X0

denotes the first jump time of X.

Proof. By the jump hold description of the chain X starting at x and

conditioned on XS0= y, we know that Xt stays at x until time S0

d= E (ax)

and then jumps to y and continues on from there independent of the holdingtime S0. These simple remarks constitute the proof of Eq. (15.33).

The following corollary is now an immediate consequence of Theorem 15.50.

Corollary 15.51 (First jump analysis). If X → F (X) is a bounded or non-

negative function of the paths Xtt≥0 , x ∈ S, and Td= E (ax) is independent

of X, then

Ex [F (X)] =∑y∈S

Qx,yEx [F (X) |XS0 = y] =∑y∈S

Qx,yEy[F([x1[0,T ) ∗X

])].

(15.34)[Recall that if X has not absorbing sites that Qx,x = 0.]

Proof. Using Theorem 15.50 along with the identity, Px (XS0= y) = Qxy,

shows

Ex [F (X)] =∑y∈S

Ex [F (X) |XS0= y]Px (XS0

= y)

=∑y∈S

Ex [F (X) |XS0= y] Qx,y

=∑y∈S

Qx,yEy[F([x1[0,T ) ∗X

])].

Example 15.52 (Expected Hitting Times). For example, let B be a proper subsetof S, A := S \B, and

w (x) := Ex [HB (X)]

where HB (X) = inf t ≥ 0 : Xt ∈ B . By Theorem 15.50 with Td= E (ax)

independent of X we have

Ex [HB (X) |XJ1 = y] = Ey[HB

([x1[0,T ) ∗X

])]= Ey [T +HB (X)]

=1

ax+ Ey [HB (X)] .

Therefore by Corollary 15.51,

Ex [HB (X)] =∑y∈S

(1

ax+ Ey [HB (X)]

)Qxy

=1

ax+∑y∈A

QxyEy [HB (X)] .

Solving this equation for w := Ex [HB (X)]x∈A gives, (assuming

(I−QA×A)−1

is invertible)

w = (I−QA×A)−1 1

a|A

wherein we think of w and 1a|A

:=

1ax

x∈A

as column vectors.


15.7 First jump analysis 173

Exercise 15.12 (Expected return times). Let A be a infinitesimal genera-tor of a continuous time Markov chain Xtt≥0 on S with 0 < ax = −Ax,x ≤K <∞ for all x ∈ S. Further let

Ry := inf t ≥ S0 : Xt = y

be the first return time8 to y on or after the first jump and mx,y := ExRy. Usethe first jump analysis to show,

mxy =1

ax+∑z 6=y

Qx,zmzy =1

ax+

∑z/∈x,y

Axz

axmzy, (15.35)

where Qx,z := 1x 6=yAx,y/ax. [You may find it useful to observe; if Hy :=inf t ≥ 0 : Xt = y is the first hitting time of y, then ExHy = ExRy if x 6= yand EyHy = 0.]

Definition 15.53 (Invariant distributions). A probability distribution, π :S → [0, 1] is said to be an invariant distribution for A if πA = 0, i.e.∑

x∈SπxAx,y = 0 for all y ∈ S.

Remark 15.54. If π is an invariant distribution for A and Pt = etA, thenddtπPt = πAPt = 0 and hence πPt = πP0 = π. Consequently, if Xtt≥0

is the continuous time Markov chain associated to A, then Pπ (Xt = y) = πyfor all t > 0 and y ∈ S.

Corollary 15.55. Let Xtt≥0 be a finite state irreducible Markov chain9 withgenerator, A = (Axy)x,y∈S . If π = (πx)x∈S is an invariant distribution for A,then

πx =1

axmxx=

1

axExRx=

ExS0

ExRx. (15.36)

[Intuitively (and as one would anticipate), Eq. (15.36) indicates that πx is therelative fraction of the time the chain spends at site x.]

Proof. Since the chain is irreducible we will definitely return to x and infact one can show mx,x < ∞ – we omit this detail here. As π is an invariantdistribution for A, we are given∑

x:x: 6=z

πxAxz = −πzAzz = πzaz ∀ z ∈ S.

8 We require t ≥ S0 so that if the chain starts at y it must first leave y before wecount being at y again a return to y.

9 In detail we are assuming that ax > 0 for all x ∈ S and Qx,y = 1x 6=yAx,y/ax is anirreducible Markov matrix.

Using this identity along with Eq. (15.35) allows us to conclude;

∑x∈S

πxaxmxy =∑x∈S

πxax

1

ax+∑z 6=y

Qx,zmzy

= 1 +

∑x,z

πx1x 6=z,z 6=yAxzmzy

= 1 +∑z

1z 6=yπzazmzy

= 1 +∑x∈S

1x 6=yπxaxmxy.

Hence it follows that πxaxmxx = 1 which proves Eq. (15.36).

Example 15.56. Consider the two state Markov chain with rate diagram being

0α

β

1 and let m0 = E0R0 and m1 = E1R0. Then the first jump analysis gives,

E1R0 = m1 = E1 [R0|XS0 = 0]P (XS0 = 0) = E1 [S0] =1

βand

m0 = E0 [R0|XS0 = 1]P (XS0 = 1) = (E0 [S0] + E1 [R0]) · 1 =1

α+m1.

Therefore E0R0 = m0 = 1α + 1

β and

π0 =E0S0

E0R0=

1

α· 1

1α + 1

β

=1

1 + α/β

and by symmetry,

π1 =1

1 + β/α.

Observe that

π0 + π1 =1

1 + α/β+

1

1 + β/α=

1 + β/α+ 1 + α/β

(1 + α/β) (1 + β/α)= 1

and using

−α 1

1 + α/β+ β

1

1 + β/α∝ −α (1 + β/α) + β (1 + α/β) = −α− β + β + α = 0

implies

πA =[

11+α/β

11+β/α

] [−α αβ −β

]=[

0 0]

as we have shown should happen in general.



15.8 Long time behavior

In this section, suppose that Xtt≥0 is a continuous time Markov chain withinfinitesimal generator, A, so that

P (Xt+h = y|Xt = x) = δxy + Ax,yh+ o (h) .

We further assume that A completely determines the chain and that A has noabsorbing states, i.e. ax := −Ax,x > 0 for all x ∈ S.

Definition 15.57. The chain, Xt , is irreducible iff the underlying discrete

time jump chain, Yn , determined by the Markov matrix, Qx,y :=Ax,y

ax1x 6=y,

is irreducible, where as usual ax := −Ax,x =∑y:y 6=x Ax,y. Put more directly,

Xt , is irreducible iff for all x, y ∈ S with x 6= y there exists x0, x1, . . . , xn ∈ Swith such that x0 = x, xn = y, and xi+1 6= xi and Axi,xi+1 > 0 for all 0 ≤ i < n.

Lemma 15.58. Let A be the generator of a continuous time Markov chainXtt≥0 and Pt = etA then the following are equivalent;

1. Xt is irreducible,2. for all x, y ∈ S, Pt (x, y) > 0 for some t > 0, and3. Pt (x, y) > 0 for all t > 0 and x, y ∈ S.

[In particular, all irreducible chains are “aperiodic.”]

Proof. (1. =⇒ 2.) Let Yn be the discrete time Markov chain associatedto Q. If Q is irreducible, t > 0, and x, y ∈ S with y 6= x, let x0, x1, . . . , xn ∈ Sbe as in Definition 15.57 and Tjnj=0 be independent random times such that

Tjd= E

(axj). Using the jump hold description of Xtt≥0 we then have

Pt (x, y) = Px (Xt = y) ≥ Px (Yn = y, Jn ≤ t < Jn+1)

≥n−1∏i=0

Qxi,xi+1· P (T0 + · · ·+ Tn−1 < t < T0 + · · ·+ Tn) > 0.

(2. =⇒ 3.) This is obvious.(3. =⇒ 1.) If for all x, y ∈ S,Pt (x, y) > 0 for some t > 0 then the embedded

chain Yn as a positive probability of hitting y when started at x. As x andy were arbitrary, it follows that Yn must be irreducible and hence Xt isirreducible.

The next theorem gives the basic limiting behavior of irreducible Markovchains. Before stating the theorem we need to introduce a little more notation.

Notation 15.59 Let S0 be the time of the first jump of Xt, and

Rx := min t ≥ S0 : Xt = x ,

is the first time hitting the site x after the first jump10, and set

πx =ExS0

ExRx=

1

ax · ExRxwhere ax := −Ax,x.

[So πx is on average the the fraction of the time the chain spends at site x.] Forthe sample path in Figure 15.5, R1 = J2, R2 = J4, R3 = J3 and R4 = J1.

Theorem 15.60 (Limiting behavior). Let Xt be an irreducible Markovchain. Then

1. for all initial staring distributions, ν (y) := P (X (0) = y) for all y ∈ S, andall y ∈ S,

limT→∞

1

T

∫ T

0

1Xt=ydt = πy (Pν a.s.). (15.37)

In words, the fraction of the time the chain spends at site y is πy.2. limt→∞Pt (x, y) = πy independent of x.3. π = (πy)y∈S is an invariant (stationary) distribution for A.

4. If πx > 0 for some x ∈ S, then πx > 0 for all x ∈ S and∑x∈S πx = 1.

5. The πx are all positive iff there exists a solution, νx ≥ 0 to∑x∈S

νxAx,y = 0 for all y ∈ S with∑x∈S

νx = 1.

If such a solution exists it is unique and ν = π.

Proof. We refer the reader to [14, Theorems 3.8.1.] for the full proof. Let usmake a few comments on the proof taking for granted that limt→∞Pt (x, y) =:πy exists. (See part d) below for why the limit is independent of x if it exists.)

1. Suppose we assume that and that ν is a stationary distribution, i.e. νPt = ν,then (by dominated convergence theorem),

νy = limt→∞

∑x

νxPt (x, y) =∑x

limt→∞

νxPt (x, y) =

(∑x

νx

)πy = πy.

Thus νy = πy. If πy = 0 for all y we must conclude there is no stationarydistribution.

10 We require t ≥ S0 so that if the chain starts at x it must first leave x before wecount being at x again a return to x. In short, you can not return to x unless youhave first left x or did not start at x.


15.9 Some Associated Martingales 175

2. If we are in the finite state setting, the following computation is justified:∑y∈S

πyPs (y, z) =∑y∈S

limt→∞

Pt (x, y) Ps (y, z) = limt→∞

∑y∈S

Pt (x, y) Ps (y, z)

= limt→∞

[PtPs]xz = limt→∞

Pt+s (x, z) = πz.

This show that πPs = π for all s and differentiating this equation at s = 0then shows, πA= 0.

3. Let us now explain why

1

T

∫ T

0

1Xt=ydt→EyS0

EyRy=

1

ay · EyRy. (15.38)

The idea is that, because the chain is irreducible, no matter how we startthe chain we will eventually hit the site y. Once we hit y, the (strong)Markov property implies the chain forgets how it got there and behaves asif it started at y. The size of J1 (as long as it is finite) is negligible whencomputing the limit in Eq. (15.38) and so we may now as well assume thatthe chain starts at y.Now consider one typical cycle in the chain staring at y jumping away attime S0 and then returning to y at time Ry. The average first jump timeis EyS0 = 1/ay while the average length of such as cycle is EyRy. As thechain repeats this procedure over and over again with the same statistics,we expect (by a law of large numbers) that the average time spent at sitey is given by

EyS0

EyRy=

1/ayEyRy

=1

ay · EyRy.

Formal argument. To make this last argument rigorous, let ρz∞z=1 bethe successive times between returns to y and σz∞z=1 be the successivesojourn times at y. Letting NT be chosen so that

ρ1 + · · ·+ ρNT ≤ T < ρ1 + · · ·+ ρNT+1,

then we will have,

σ1 + · · ·+ σNρ1 + · · ·+ ρN+1

≤ 1

T

∫ T

0

1Xt=ydt ≤σ1 + · · ·+ σN + σN+1

ρ1 + · · ·+ ρN

where by the strong law of large numbers, N = NT →∞ as T →∞,

σ1 + · · ·+ σNρ1 + · · ·+ ρN+1

=σ1+···+σN

Nρ1+···+ρN+1

N

→ ES0

ERy=

1

ayERy

and

σ1 + · · ·+ σN+1

ρ1 + · · ·+ ρN=

σ1+···+σN+1

Nρ1+···+ρN

N

→ ES0

ERy=

1

ayERyas T →∞.

4. Taking expectations of Eq. (15.37) shows (at least when # (S) <∞)

πy = limT→∞

1

T

∫ T

0

Pν (Xt = y) dt = limT→∞

1

T

∫ T

0

∑x

νxPx,y (t) dt

=∑x

νx limT→∞

1

T

∫ T

0

Pt (x, y) dt =∑x

νx limt→∞

Pt (x, y)

for any initial distribution ν. Taking νx = δz,x for any z ∈ S we chooseshows limt→∞Pt (z, y) = πy = 1

ay·EyRy independent of z.

15.9 Some Associated Martingales

This section may be considered to be a warm-up of the martingales appearingrelative to Brownian motions in the next chapter. As above let S be a finite orcountable state space and A be the infinitesimal generator of a continuous timeMarkov chain Xtt≥0 on S such that ax := −Ax,x is bounded for x ∈ S. Fora function [0, T ]× S 3 (t, x)→ h (t, x) ∈ R which is continuously differentiablefunction in t such that

∑y |Axy| |h (t, y)| <∞ for all x ∈ S we let

Lh (t, x) := h (t, x) + (Ah) (t, x) =∂

∂th (t, x) +

∑y

Ax,yh (t, y) .

Theorem 15.61 (Martingale problem). Let us continue the above notationand assumptions above and let ν : S → [0, 1] be a given initial distribution. Ifwe further assume that

sup0≤t≤T

Eν[|h (t,Xt)|+

∣∣∣h (t,Xt)∣∣∣+ |(Ah) (t,Xt)|

]<∞,

then

Mt := h (t,Xt)−∫ t

0

Lh (τ,Xτ ) dτ

is anFXtt∈R+

– martingale. [Here Mt is a right continuous martingale with

left hand limits having jumps at (at most) the jump times of Xt .] In partic-ular, if h also satisfies the “heat equation” in reverse time, Lh (t, x) = 0, thenMt = h (t,Xt) is a martingale.


Proof. Working formally for the moment,

d

dτEν [h (τ,Xτ ) |Fs] =

d

dτ[Pτ−sh (τ,Xs)]

= Pτ−sh (τ,Xs) + Pτ−sAh (τ,Xs)

= Eν[h (τ,Xτ ) + Ah (τ,Xτ ) |Fs

]= Eν [Lh (τ,Xτ ) |Fs]

wherein we have used the Markov property in the first and third lines and thechain rule in the second line. Integrating this equation on τ ∈ [s, t] then shows

Eν [h (t,Xt)− h (s,Xs) |Fs] = Eν[∫ t

s

Lh (τ,Xτ ) dτ |Fs]. (15.39)

Using

Eν [Mt −Ms|Fs] = Eν[h (t,Xt)− h (s,Xs)−

∫ t

s

Lh (τ,Xτ ) dτ |Fs]

we see that Eq. (15.39) is precisely the statement that Eν [Mt −Ms|Fs] = 0,i.e. to the assertion that Mtt≥0 is a martingale.

The formal proof requires some more justification. Rather than do this herethe reader is referred to the analogous Theorem 16.31 in the Brownian motioncase for the sorts of arguments needed to fill in the missing details.

Corollary 15.62 (Representation formula). Suppose that B ⊂ S is a setand τB = inf t ≥ 0 : Xt ∈ B and assume that ExτB < ∞ for all x ∈ A :=S \B. If u : S → R is a bounded function, then

u (x) = Ex[u (XτB )−

∫ τB

0

(Au) (Xs) ds

]. (15.40)


Mt := u (Xt)−∫ t

0

(Au) (Xs) ds

is a martingale relative to any starting distribution ν which we now take to beδx. By a continuous time variant of the optional sampling theorem, one showst→Mt∧τB is still a martingale and therefore,

u (x) = ExM0 = ExMt∧τB = Ex[u (Xt∧τB )−

∫ t∧τB

0

(Au) (Xs) ds

].

We may now use DCT to pass to the limit as t ↑ ∞ in the above identity toarrive at Eq. (15.40).

16

Brownian Motion

This chapter is devoted to an introduction to Brownian motion on the statespace S which is take to be S = R or more generally Rd for some d. The firstthing we need to do is recall the basic properties of Gaussian random variablesand vectors.

16.1 Normal/Gaussian Random Vectors

A random variable, Y, is normal with mean µ standard deviation σ2 iff

P (Y ∈ (y, y + dy]) =1√

2πσ2e−

12σ2

(y−µ)2dy. (16.1)

Notation 16.1 Suppose that Y is a random-variable. We write Yd= N

(µ, σ2

)to mean that Y is a normal random variable with mean µ and variance σ2.

In particular, Yd= N (0, 1) is used to indicate that Y is a standard normal

random variable.

It turns out that rather than working with densities of normal random vari-ables it is typically easier to make use of the Laplace (or better Fourier) trans-form description of the distribution.

Definition 16.2. A random variable, Y : Ω → R, is said to be exponen-tially integrable if E

[eλY

]<∞ for all λ ∈ R. This is equivalent to requiring

E[eλ|Y |

]<∞ for all λ > 0.

Lemma 16.3. If Y is an exponentially integrable random variable, then Yd=

N(µ, σ2

)iff

E[eλY

]= e

12σ

2λ2+λµ. (16.2)

Proof. If Yd= N

(µ, σ2

)and λ ∈ R, then using Eq. (16.1) we find,

E[eλY

]:=

∫ ∞−∞

eλy1√

2πσ2e−

12σ2

(y−µ)2dy

=

∫ ∞−∞

eλ(σx+µ) 1√2πe−

12x

2

dx

= eλµ ·∫ ∞−∞

1√2πe−

12 (x−σλ)2e

12σ

2λ2

dx

= e12σ

2λ2+λµ ·∫ ∞−∞

1√2πe−

12 z

2

dz

= e12σ

2λ2+λµ, (16.3)

wherein the second line we made the change of variables, y = σx+µ, completedthe squares in the third line, made the change of variables z = x − σλ in thefourth line, and used the well known fact that

√2π =

∫ ∞−∞

e−12 z

2

dz

in the last. The converse result relies on the fact that distribution of an expo-nentially random variables is uniquely determined by its Laplace transform, i.e.by the function λ→ E

[eλY

].

Corollary 16.4. If Yd= N

(µ, σ2

), then µ = EY and σ2 = Var (Y ) .

Proof. Differentiating Eq. (16.2) (which is justified by DCT) we find,

E[Y eλY

]=(σ2λ+ µ

)e

12σ

2λ2+λµ and

E[Y 2eλY

]=(σ2 +

(σ2λ+ µ

)2)e

12σ

2λ2+λµ.

Taking λ = 0 in these equations then implies,

µ = EY and σ2 + µ2 = E[Y 2]

and therefore σ2 = Var (Y ) .The last two results serve as motivation for the following alternative defini-

tion of Gaussian random variables and more generally random vectors.

Definition 16.5 (Normal / Gaussian Random Variable). An Rd –valuedrandom vector, Z, is said to be normal or Gaussian1 if for all λ ∈ Rd, λ ·Z1 I will use the terms, normal and Gaussian, interchangeably.

178 16 Brownian Motion

is a Gaussian random variable. Alternatively stated, Z is a Gaussian randomvector iff

E[eλ·Z

]= exp

(1

2Var (λ · Z) + E (λ · Z)

)for all λ ∈ Rd. We say that Z is a standard normal random vector ifE [λ · Z] = 0 and Var (λ · Z) = λ · λ for all λ ∈ Rd.

If we let µ = (µi := EZi)di=1 be the mean vector and Cij := Cov (Zi,Zj)

be the covariance matrix of Z then

E [λ · Z] = µ · λ and

Var (λ · Z) = E[(λ · Z)

2]− (E [λ · Z])

2

=

d∑i,j=1

Cov (λiZi, λjZj) = Cλ · λ.

With this notation, a Gaussian random vector, Z, is standard normal iff µ = 0and C = I.

Lemma 16.6. If Z ∈Rd is a Gaussian random vector, then for all(k1, . . . , kd) ∈ Nd we have

E

d∏j=1

∣∣∣Zkjj ∣∣∣ <∞.

Proof. A simple calculus exercise shows for each k ≥ 0 that there existsCk <∞ such that

|x|k ≤ Cke|x| ≤ Ck[ex + e−x

].

Hence we know,

Ed∏j=1

∣∣∣Zkjj ∣∣∣ ≤ E

d∏j=1

Ckj[eZj + e−Zj

]and the latter integrand is a linear combination of random variables of the formea·Z with a ∈ ±1d . By assumption E

[ea·Z

]<∞ for all a ∈ Rd and hence we

learn that

E

d∏j=1

Ckj[eZj + e−Zj

] <∞.

Fact 16.7 If W and Z are any two Rd – valued random vectors such thatE[eλ·Z

]<∞ and E

[eλ·W

]<∞ for all λ ∈ Rd, then

E[eλ·Z

]= E

[eλ·W

]for all λ ∈ Rd

iff Wd= Z where W

d= Z by definition means

E [f (W)] = E [f (Z)]

for all functions, f : Rd → R, such that the expectations makes sense.

Fact 16.8 If W ∈Rk and Z ∈Rl are any two random vectors such thatE[ea·W

]< ∞ and E

[eb·Z

]< ∞ for all a ∈ Rk and b ∈ Rl, then W and

Z are independent iff

E[ea·Web·Z

]= E

[ea·W

]· E[eb·Z

]∀ a ∈ Rk & b ∈ Rl. (16.4)

Example 16.9. Suppose W ∈Rk and Z ∈Rl are two random vectors such that(W,Z)∈Rk×Rl is a Gaussian random vector. Then W and Z are independentiff Cov (Wi, Zj) = 0 for all 1 ≤ i ≤ k and 1 ≤ j ≤ l. To keep the notation to aminimum let us verify this fact when k = l = 1. Then according to Fact 16.8we need only verify Eq. (16.4) for a, b ∈ R. However,

Var ((a, b) · (W,Z)) = Var (aW + bZ)

= Var (aW ) + Var (bZ) + 2abCov (W,Z)

= Var (aW ) + Var (bZ)

and hence

E[ea·W eb·Z

]= E

[e(a,b)·(W,Z)

]= exp

(1

2Var ((a, b) · (W,Z)) + E [(a, b) · (W,Z)]

)= exp

(1

2Var (aW ) + E [aW ] +

1

2Var (bZ) + E [bZ]

)= E

[ea·W

]E[eb·Z

].

Remark 16.10. In general it is not true that two uncorrelated random variablesare independent. For example, suppose that X is any random variable with an

even distribution, i.e. Xd= −X. Let Y := |X| which is typically not independent

of X.2 Nevertheless X and Y are uncorrelated. Indeed ,

2 For example if X : Ω → ±1,±2 all with probability 1/4, then

P (X = 1, |X| = 1) = P (X = 1) =1

46= 1

8= P (X = 1)P (|X| = 1) .


16.2 Stationary and Independent Increment Processes 179

Cov (X,Y ) = E [XY ]− EX · EY = E [X |X|]− EX · E |X| = 0

wherein we have used both X and |X|X have even distributions and therefore

have zero expectations. Indeed, if Zd= −Z then

EZ = E [−Z] = −EZ =⇒ EZ = 0.

Exercise 16.1. Suppose that X and Y are independent normal random vari-ables. Show;

1. Z = (X,Y ) is a normal random vector, and2. W = X + Y is a normal random variable.3. If N is a standard normal random variable and X is any normal random

variable, show Xd= σN + µ where µ = EX and σ =

√Var (X).

Remark 16.11. To conclude that a random vector, X : Ω → Rd, is Gaussianit is not enough to check that each of its components, Xidi=1 , are Gaussianrandom variables. The following simple counter example was provided by Nate

Eldredge. Let Xd= N (0, 1) and ξ

d= Bern (1/2) with X and ξ being independent

and then let Y = ξX. Then Yd= N (0, 1) so that both X and Y are Gaussian

random variables yet (X,Y ) is not a Gaussian random vector.

Exercise 16.2. Prove the assertion made in Remark 16.11. Hints: compute

E[eλY

]in order to show Y

d= N (0, 1), then show Cov (X,Y ) = 0 even though

X and Y are not independent.

Definition 16.12 (Gaussian Process). We say a real valued stochastic pro-cess, Xtt≥0 , is a Gaussian process if for all n ∈ N and 0 = t0 < t1 < · · · <tn <∞, (Xt1 , . . . , Xtn) is a Gaussian random vector.

16.2 Stationary and Independent Increment Processes

Rather than considering all continuous time Markov processes with values in Ror(Rd)

we are going to focus out attention on one extremely important example– namely Brownian motion. However, before restricting to the Brownian caselet us first introduce a more general class of Rd – valued stochastic processes.

Definition 16.13. A stochastic process,Xt : Ω → S := Rd

t∈T , has inde-

pendent increments if for all finite subsets, Λ = 0 ≤ t0 < t1 < · · · < tn ⊂ Tthe random variables X0 ∪

Xtk −Xtk−1

nk=1

are independent. We refer toXt −Xs for s < t as an increment of X.

Definition 16.14. A stochastic process,Xt : Ω → S := Rd

t∈T , has sta-

tionary increments if for all 0 ≤ s < t < ∞, the law of Xt − Xs dependsonly on t− s.

Example 16.15. The Poisson process of rate λ, Xtt≥0 , has stationary inde-

pendent increments with Xt −Xsd= Poi (λ (t− s)) . Also if we let

Xt := Xt − EXt = Xt − λt

be the “centering” of Xt, thenXt

t≥0

is a mean zero process with independent

increments. In this case, EX2t = Var (Xt) = λt.

Lemma 16.16. Suppose that Xtt≥0 is a mean zero square-integrable stochas-tic process with stationary independent increments. If we further assume thatX0 = 0 a.s. and t→ EX2

t is continuous, then EX2t = λt for some λ ≥ 0.

Proof. Let ϕ (t) := EX2t and let s, t ≥ 0. Then

ϕ (t+ s) = EX2t+s = E [(Xt+s −Xt) +Xt]

2

= E (Xt+s −Xt)2

+ EX2t + 2E [(Xt+s −Xt)Xt]

= EX2s + EX2

t + 2E [Xt+s −Xt] · EXt

= ϕ (t) + ϕ (s) + 0.

Now let λ := ϕ (1) and use the previous equation to learn, for n ∈ N0, that

λ = ϕ (1) = ϕ

n -times︷︸︸︷

1

n+ · · ·+ 1

n

= n · ϕ(

1

n

),

i.e. ϕ(

1n

)= 1

nλ. Similarly if k ∈ N we may conclude,

ϕ

(k

n

)= ϕ

k -times︷︸︸︷

1

n+ · · ·+ 1

n

= kϕ

(1

n

)= k

1

nλ =

k

nλ.

Thus ϕ (t) = λt for all t ∈ Q+ and then by continuity we must have ϕ (t) = λtfor all t ∈ R+.

Proposition 16.17. Suppose thatXt : Ω → S := Rd

t∈T is a stochastic pro-

cess with stationary and independent increments and let Ft := FXt for all t ∈ T.Then



E [f (Xt) |Fs] = (Pt−sf) (Xs) where (16.5)

(Ptf) (x) = E [f (x+Xt −X0)] . (16.6)

Moreover, Ptt≥0 satisfies the Chapman – Kolmogorov equation, i.e. PtPs =Pt+s for all s, t ≥ 0.

Proof. By assumption we have Xt−Xs is independent of Fs and thereforeby a jazzed up version of Proposition 3.18,

E [f (Xt) |Fs] = E [f (Xs +Xt −Xs) |Fs] = G (Xs)

whereG (x) = E [f (x+Xt −Xs)] .

Because the increments are stationary we may replace Xt−Xs in the definitionof G above by X(t−s) − X0 which then proves Eqs. (16.5) with Pt as in Eq.(16.6).

To verify the Chapman – Kolmogorov equation we have,

(PtPsf) (x) = E [(Psf) (x+Xt −X0)]

= E[Ef(x+Xt −X0 + Xs − X0

)]= E

[Ef(x+Xt −X0 + Xt+s − Xt

)]= E [f (x+Xt −X0 +Xt+s −Xt)]

= E [f (x+Xt+s −X0)] = (Pt+sf) (x) .

Example 16.18. The Poisson process (see Definition 15.32) of rate λ, Ntt≥0 ,

has stationary independent increments with Nt − Nsd= Poi (λ (t− s)) as was

shown in Proposition 15.34. In this case,

(Ptf) (x) = Ef (x+Nt −N0) =

∞∑k=0

f (x+ k)(λt)

k

k!e−λt

and we find

(Af) (x) =d

dt|0+ (Ptf) (x) =

∞∑k=0

f (x+ k)d

dt|0+

(λt)k

k!e−λt

=

1∑k=0

f (x+ k)d

dt|0+

(λt)k

k!e−λt = −λf (x) + λf (x+ 1) .

Proposition 16.19 (Brownian Motion I). Suppose that Btt≥0 is a realvalued stochastic process on some probability space, (Ω,B, P ) with right contin-uous sample paths such that;

1. B0 = 0 a.s., EBt = 0 for all t ≥ 0,2. E (Bt −Bs)2

= t− s for all 0 ≤ s ≤ t <∞,3. B has independent increments, i.e. if 0 = t0 < t1 < · · · < tn < ∞, then

Btj −Btj−1

nj=1

are independent random variables.

4. (Moment Condition) There exists p > 2, q > 1 and c < ∞ such thatE |Bt −Bs|p ≤ c |t− s|q for all s, t ∈ R+.

Then Bt − Bsd= N (0, t− s) for all 0 ≤ s < t < ∞ and in fact Btt≥0 is

a “Brownian motion” as to be described in Section 16.3 below. [You are askedto show in Exercise 16.3 below that the infinitesimal generator of Btt≥0 is

“A = 12d2

dx2 .”]

The previous result is a consequence of the central limit theorem. As thenext example illustrates, the moment condition above is needed in order to showBtt≥0 is a Gaussian process as defined in Definition 16.12.

Example 16.20 (Normalized Poisson Process). There certainly are other rightcontinuous processes satisfying items 1.-3. but not the moment condition 4.that are not Gaussian. Indeed, if Ntt≥0 is a Poisson process with intensity

λ (see Definition 15.32), then Bt := λ−1Nt − t satisfies the items 1.–3. above.The independence of the increments follows from Proposition 15.34 and fromExercise 15.1, we know that Var (Nt −Ns) = (λt− λs)2

and E (Nt −Ns) =λ (t− s) so EBt = 0 and

E (Bt −Bs)2= Var (Bt −Bs) = Var

(λ−1 (Nt −Ns)

)= λ−2 Var (Nt −Ns) = λ−2 (λt− λs)2

= t− s.

In this case one can show that E [|Bt −Bs|p] ∼ |t− s| for all 1 ≤ p < ∞ andthe moment condition in item 4. fails. This is good as Btt≥0 is not Gaussian!

16.3 Brownian motion defined

Definition 16.21 (Brownian Motion). Let(Ω,F , Ftt∈R+

,P)

be a filtered

probability space. A real valued adapted process, Bt : Ω → Rt∈R+, is called a

Brownian motion if;

1. Btt∈R+has independent increments with increments Bt −Bs being inde-

pendent of Fs for all 0 ≤ s < t <∞.


16.3 Brownian motion defined 181

2. for 0 ≤ s < t, Bt − Bsd= N (0, t− s) , i.e. Bt − Bs is a normal mean zero

random variable with variance (t− s) ,3. t→ Bt (ω) is continuous for all ω ∈ Ω.

Remark 16.22. In light of Lemma 16.16, we could have described a Brownianmotion (up to a scale) as a continuous in time stochastic process Btt≥0 withstationary independent mean zero Gaussian increments such that B0 = 0 andt→ EB2

t is continuous.

Exercise 16.3 (Brownian Motion). Let Btt≥0 be a Brownian motion asin Definition 16.21.

1. Explain why Btt≥0 is a time homogeneous Markov process with transitionoperator,

(Ptf) (x) =

∫Rpt (x, y) f (y) dy (16.7)

where

pt (x, y) =1√2πt

e−12t |y−x|

2

. (16.8)

In more detail use Proposition 16.17 and Eq. (16.10) below to argue,

E [f (Bt) |Fs] = (Pt−sf) (Bs) . (16.9)

2. Show by direct computation that PtPs = Pt+s for all s, t > 0. Hint:probably the easiest way to do this is to make use of Exercise 16.1 alongwith the identity,

(Ptf) (x) = E[f(x+√tZ)], (16.10)

where Zd= N (0, 1) .

3. Show by direct computation that pt (x, y) of Eq. (16.8) satisfies the heatequation,

d

dtpt (x, y) =

1

2

d2

dx2pt (x, y) =

1

2

d2

dy2pt (x, y) for t > 0.

4. Suppose that f : R→ R is a twice continuously differentiable function withcompact support. Show

d

dtPtf = APtf = PtAf for all t > 0,

where

Af (x) =1

2f ′′ (x) .

Note: you may use (under the above assumptions) without proof the factthat it permissible to interchange the d

dt and ddx derivatives with the integral

in Eq. (16.7).

Modulo technical details, Exercise 16.3 shows that A = 12d2

dx2 is the infinites-imal generator of Brownian motion, i.e. the infinitesimal generator of Pt in Eq.(16.7). The technical details we have ignored involve the proper function spacesin which to carry out these computations along with a proper description ofthe domain of the operator A. We will have to omit these delicate issues here.However, let me at least point out that it is no longer necessarily a good ideato try to recover Pt as

∑∞n=0

tn

n!An in the Brownian motion setting. Indeed,∑∞

n=0tn

n!Anf to make sense one needs to assume that, at very least, f is C∞

and even this will not guarantee convergence of the sum! Nevertheless, when

A = 12d2

dx2 it is true that

Ptf =

∞∑n=0

tn

n!Anf =

∞∑n=0

tn

2nn!f (2n)

when if f is a polynomial function, see Exercise 19.2. The infinite series ex-pansion also gives the correct answer when applied to exponential functions,f (x) = eλx, where λ ∈ C.

Corollary 16.23. If 0 = t0 < t1 < · · · < tn <∞, ∆it := ti − ti−1, then for

EF (Bt1 , . . . , Btn) =

∫F (x1, . . . , xn) p∆1t (x0, x1) . . . p∆nt (xn−1, xn) dx1 . . . dxn

(16.11)for all bounded or non-negative functions F : Rn → R. In particular if Ji =(ai, bi) ⊂ R are given bounded intervals, then

P (B (ti) ∈ Ji for i = 1, 2, . . . , n)

=

∫. . .

∫J1×···×Jn

p∆1t (0, x1) p∆2t (x1, x2) . . . p∆nt (xn−1, xn) dx1 . . . dxn.

(16.12)

as follows from Eq. (16.11) by taking F (x1, . . . , xn) := 1J1 (x1) . . . 1Jn (xn) .

Proof. This result is a formal consequence of the Markov property (Propo-sition 16.17) similar to the discrete space case in Theorem 5.11. Rather thancarry out this style of proof here I will give a proof based directly on the in-dependent increments of the Brownian motion. Let x0 := 0. We are going toprove Eq. (16.11) by induction on n.

For n = 1, we have

EF (Bt1) = EF(√t1N

)=

∫Rpt1 (0, y) f (y) dy

which is Eq. (16.11) with n = 1. For the induction step, let N be a standardnormal random variable independent of

(Bt1 , . . . , Btn−1

), then



EF (Bt1 , . . . , Btn) = EF(Bt1 , . . . , Btn−1

, Btn−1+Btn −Btn−1

)= EF

(Bt1 , . . . , Btn−1

, Btn−1+√∆ntN

)= E

∫RF(Bt1 , . . . , Btn−1

, y)p∆nt

(Btn−1

, y)dy

=

∫RE[F(Bt1 , . . . , Btn−1

, y)p∆nt

(Btn−1

, y)]dy. (16.13)

By the induction hypothesis,

E[F(Bt1 , . . . , Btn−1 , y

)p∆nt

(Btn−1 , y

)]=

∫F (x1, . . . , xn−1, y)

[p∆1t (x0, x1) · . . .

·p∆n−1t (xn−2, xn−1) p∆nt (xn−1, y)

]dx1 . . . dxn−1dy.

(16.14)

Combining Eqs. (16.13) and (16.14) and then replacing y by xn verifies Eq.(16.11).

Remark 16.24 (Gaussian aspects of Brownian Motion). Brownian motion is aGaussian process (see Definition 16.12). Indeed, if 0 = t0 < t1 < · · · < tn <∞,then (Bt1 , . . . , Btn) is a Gaussian random vector since it is a linear transforma-tion of the increment vector, (∆1B, . . . ,∆nB) where ∆iB := Bti − Bti−1 . Thedistribution of mean zero Gaussian random vectors is completely determinedby its covariance matrix. In this case if 0 ≤ s < t <∞, we have

E [BsBt] = E [Bs [Bs +Bt −Bs]] = EB2s + EBs · E [Bt −Bs] = s.

In general we haveE [BsBt] = min (s, t) ∀ s, t ≥ 0. (16.15)

Thus we could define Brownian motion Btt≥0 to be the mean-zero continuosGaussian process such that Eq. (16.15) holds.

Indeed, if Btt≥0 to be the mean-zero continuos Gaussian process such thatEq. (16.15) holds then for 0 < t1 < · · · < tn = s < t, we will have

E [(Bt −Bs)Bti ] = t ∧ ti − s ∧ ti = ti − ti = 0

for all 1 ≤ i ≤ n. Since (Bt1 , . . . , Btn , Bt −Bs) is a Gaussian random vector wemay conclude that Bt−Bs is independent of (Bt1 , . . . , Btn) . As the ti as abovewere arbitrary we conclude that Bt−Bs is independent of FBs = σ (Br : r ≤ s) .Therefore if f : R→ R is a bounded measurable function we have,

E [f (Bt) |Fs] = E [f (Bt −Bs +Bs) |Fs] = G (Bs)

where

G (x) := E [f (Bt −Bs + x)] = E[f(x+√t− sN

)]=

∫Rpt−s (x, y)G (y) dy = (Ptf) (x) ,

where Nd= N (0, 1) . Thus we have shown

E [f (Bt) |Fs] = (Ptf) (Bs) .

It is now not to difficult to verify Eq. (16.11) holds.

Remark 16.25 (White Noise). If we formally differentiate Eq. (16.15) with re-spect to s and and t we find,

E[BsBt

]=

d

dt

d

dsmin (s, t) =

d

dt[1s≤t] = δ (t− s) . (16.16)

ThusBt

t≥0

is a totally uncorrelated Gaussian process which formally impliesBt

t≥0

are all independent “random variables” which is referred to as white

noise.Warning: white noise is not an honest real-valued random stochastic pro-

cess owing to the fact that Brownian motion is very rough and in particularnowhere differentiable. If Bt were to exists it should be a Gaussian and hencesatisfy EB2

t <∞. On the other hand from Eq. (16.16)with s = t we are lead tobelieve that

E[B2t

]= δ (t− t) = δ (0) =∞.

Nevertheless, ignoring these issues, it is common to model noise in a systemby such a white noise. For example we may have y (t) ∈ Rd satisfies, in a pristineenvironment, a differential equation of the form

y (t) = f (t, y (t)) . (16.17)

However, the real world is not so pristine and the trajectory y (t) is also influ-enced by noise in the system which is often modeled by adding a term of theform g (t, y (t)) Bt to the rights side of Eq. (16.17) to arrive at the “ stochasticdifferential equation,”

y (t) = f (t, y (t)) + g (t, y (t)) Bt.

In the last section of these notes we will begin to explain Ito’s method formaking sense of such equations.

Definition 16.26 (Multi-Dimensional Brownian Motion). For d ∈ N, we

say a Rd – valued process,Bt =

(B1t , . . . , B

dt

)trt≥0

is a d – dimensional

Brownian motion providedBi·di=1

is an independent collection of one di-mensional Brownian motions.


16.4 Some “Brownian” martingales 183

We state the following multi-dimensional version of the results argued above.The proof of this theorem will be left to the reader.

Theorem 16.27. A d – dimensional Brownian motion, Btt≥0 , is a Markovprocess with transition semi-group, Pt, now defined by

(Ptf) (x) = E[f(x+√tZ)]

=

∫Rdpt (x, y) f (y) dy

where now Z is a standard normal random vector in Rd and

pt (x, y) =

(1√2πt

)de−

12t‖x−y‖

2

.

Moreover we haved

dtPt =

1

2∆Pt = Pt

1

2∆

where now ∆ =∑dj=1

∂2

∂x2j

is the d – dimensional Laplacian.

Remark 16.28. Add here comments about the heuristic path integral interpre-tation of Brownian motion, namely that

Ef (B) “=”1

Z

∫C(R+,Rd)

f (ω) e− 1

2

∫∞0|ω(t)|2dtDω. (16.18)

Also mention that ((− d2

dt2

)−1

h

)(t) =

∫ ∞0

t ∧ sh (s) ds,

that is t ∧ s is the Green’s function associate to the formal covariance matrixof the “Gaussian” measure in Eq. (16.18).

16.4 Some “Brownian” martingales

Definition 16.29 (Continuous time martnigales). Given a filtered proba-

bility space,(Ω,F , Ftt≥0 ,P

), an adapted process, Xt : Ω → R, is said to be

a (Ft) martingale provided, E |Xt| <∞ for all t and

E [Xt −Xs|Fs] = 0 for all 0 ≤ s ≤ t <∞.

IfE [Xt −Xs|Fs] ≥ 0 or E [Xt −Xs|Fs] ≤ 0 for all 0 ≤ s ≤ t <∞,

then X is said to be submartingale or supermartingale respectively.

Notation 16.30 Given a function, h : [0, T ] × Rd → R such that h (t, x) =∂∂th (t, x) , ∇xh (t, x) , and ∆xh (t, x) all exist, let

Lh (t, x) := ∂th (t, x) +1

2∆xh (t, x) .

Theorem 16.31 (Martingale problem). Suppose that h : [0, T ]×Rd → R isa continuous function such that h (t, x) = ∂

∂th (t, x) , ∇xh (t, x) , and ∆xh (t, x)exist and are continuous on [0, T ]× Rd and satisfy

sup0≤t≤T

Eν[|h (t, Bt)|+

∣∣∣h (t, Bt)∣∣∣+ |∇xh (t, Bt)|+ |∆xh (t, Bt)|

]<∞. (16.19)

Then the process,

Mt := h (t, Bt)−∫ t

0

Lh (τ,Bτ ) dτ (16.20)

is a Ftt∈R+– martingale. In particular, if h also satisfies the heat equation

in reverse time,

Lh (t, x) := ∂th (t, x) +1

2∆xh (t, x) = 0, (16.21)

then Mt = h (t, Bt) is a martingale.

Proof. Working formally for the moment,

d

dτEν [h (τ,Bτ ) |Fs] =

d

dτ[Pτ−sh (τ,Bs)]

= Pτ−sh (τ,Bs) +1

2Pτ−s∆h (τ,Bs)

= Eν[h (τ,Bτ ) +

1

2∆h (τ,Bτ ) |Fs

]= Eν [Lh (τ,Bτ ) |Fs]

wherein we have used the Markov property in the first and third lines and thechain rule in the second line. Integrating this equation on τ ∈ [s, t] then shows

Eν [h (t, Bt)− h (s,Bs) |Fs] = Eν[∫ t

s

Lh (τ,Bτ ) dτ |Fs]. (16.22)

Using

Eν [Mt −Ms|Fs] = Eν[h (t, Bt)− h (s,Bs)−

∫ t

s

Lh (τ,Bτ ) dτ |Fs]



we see that Eq. (16.22) is precisely the statement that Eν [Mt −Ms|Fs] = 0,i.e. to the assertion that Mtt≥0 is a martingale. *(I would suggest omittingthe more technical details to follow upon first reading.)

We now need to justify the above computations.1. Let us first suppose there exists an R < ∞ such that h (t, x) = 0 if

|x| =√∑d

i=1 x2i ≥ R. By a couple of integration by parts, we find

d

dτ(Pτ−sh) (τ, x) =

∫Rd

d

dτ[pτ−s (x− y)h (τ, y)] dy

=

∫Rd

[1

2(∆pτ−s) (x− y)h (τ, y) + pτ−s (x− y) h (τ, y)

]dy

=

∫Rd

[1

2pτ−s (x− y)∆yh (τ, y) + pτ−s (x− y) h (τ, y)

]dy

=

∫Rdpτ−s (x− y)Lh (τ, y) dy =: (Pτ−sLh) (τ, x) , (16.23)

where

Lh (τ, y) := h (τ, y) +1

2∆yh (τ, y) . (16.24)

Since

(Pτ−sh) (τ, x) :=

∫Rdpτ−s (x− y)h (τ, y) dy = Eν

[h(τ, x+

√τ − sN

)]where N

d= N (0, Id×d) , we see that (Pτ−sh) (τ, x) (and similarly that Lh (τ, x))

is a continuous function in (τ, x) for τ ≥ s. Moreover, (Pτ−sh) (τ, x) |τ=s =h (s, x) and so by the fundamental theorem of calculus (using Eq. (16.23

(Pτ−sh) (τ, x) = h (s, x) +

∫ t

s

(Pτ−sLh) (τ, x) dτ. (16.25)

Hence for A ∈ Fs,

Eν [h (t, Bt) : A] = Eν [Eν [h (t, Bt) |Fs] : A]

= Eν [(Pτ−sh) (τ,Bs) : A]

= Eν[h (s,Bs) +

∫ t

s

(Pτ−sLh) (τ,Bs) dτ : A

]= Eν [h (s,Bs) : A] +

∫ t

s

Eν [(Pτ−sLh) (τ,Bs) : A] dτ.

(16.26)

Since

(Pτ−sLh) (τ,Bs) = Eν [(Lh) (τ,Bτ ) |Fs] ,we have

Eν [(Pτ−sLh) (τ,Bs) : A] = Eν [(Lh) (τ,Bτ ) : A]

which combined with Eq. (16.26) shows,

Eν [h (t, Bt) : A] = Eν [h (s,Bs) : A] +

∫ t

s

Eν [(Lh) (τ,Bτ ) : A] dτ

= Eν[h (s,Bs) +

∫ t

s

(Lh) (τ,Bτ ) dτ : A

].

This proves Mtt≥0 is a martingale when h (t, x) = 0 if |x| ≥ R.2. For the general case, let ϕ ∈ C∞c

(Rd, [0, 1]

)such that ϕ = 1 in a neigh-

borhood of 0 ∈ Rd and for n ∈ N, let ϕn (x) := ϕ (x/n) . Observe that ϕn → 1,∇ϕn (x) = 1

n (∇ϕ) (x/n) , and ∆ϕn (x) = 1n2 (∆ϕ) (x/n) all go to zero bound-

edly as n→∞. Applying case 1. to hn (t, x) = ϕn (x)h (t, x) we find that

Mnt := ϕn (Bt)h (t, Bt)−

∫ t

0

ϕn (Bτ )

[h (τ,Bτ ) +

1

2∆h (τ,Bτ )

]dτ + εn (t)

is a martingale, where

εn (t) :=

∫ t

0

[1

2∆ϕn (Bτ )h (τ,Bτ ) +∇ϕn (Bτ ) · ∇h (τ,Bτ )

]dτ.

By DCT,

Eν |εn (t)| ≤∫ t

0

1

2Eν |∆ϕn (Bτ )h (τ,Bτ )| dτ

+

∫ t

0

Eν [|∇ϕn (Bτ )| |∇h (τ,Bτ )|] dτ → 0 as n→∞,

and similarly,

Eν |ϕn (Bt)h (t, Bt)− h (t, Bt)| → 0,∫ t

0

Eν∣∣∣ϕn (Bτ ) h (τ,Bτ )− h (τ,Bτ )

∣∣∣ dτ → 0, and

1

2

∫ t

0

Eν |ϕn (Bτ )∆h (τ,Bτ )−∆h (τ,Bτ )| dτ → 0

as n → ∞. From these comments one easily sees that Eν |Mnt −Mt| → 0 as

n→∞ which is sufficient to show Mtt∈[0,T ] is still a martingale.

Rewriting Eq. (16.20) we have

h (t, Bt) = Mt +

∫ t

0

[h (τ,Bτ ) +

1

2∆xh (τ,Bτ )

]dτ.


16.5 Optional Sampling Results 185

Example 16.32. Considering the simplest case where d = 1 and h (t, x) = f (x) ,Theorem 16.31 asserts that

f (Bt) = Mt +1

2

∫ t

0

f ′′ (Bt) dt

where Mt is a martingale provided,

sup0≤t≤T

Eν [|f (Bt)|+ |f ′ (Bt)|+ |f ′′ (Bt)|] <∞.

Hence;

1. If f ′′ ≥ 0 (i.e. f is subharmonic3) then f (Bt) is a submartingale.2. If f ′′ ≤ 0 (i.e. f is super harmonic) then f (Bt) is a supermartingale.3. If f ′′ = 0 (i.e. f is harmonic) then f (Bt) is a martingale.

More generally, if f : Rd → R is a C2 function, then f (Bt) is a “local”submartingale (supermartingale) iff ∆f ≥ 0 (∆f ≤ 0) .

Exercise 16.4 (h – transforms of Bt). Let Btt∈R+be a d – dimensional

Brownian motion. Show the following processes are martingales;

1. Mt = u (Bt) where u : Rd → R is a Harmonic function, ∆u = 0, such thatsupt≤T E [|u (Bt)|+ |∇u (Bt)|] <∞ for all T <∞.

2. Mt = λ ·Bt for all λ ∈ Rd.3. Mt = e−λYt cos (λXt) and Mt = e−λYt sin (λXt) where Bt = (Xt, Yt) is a

two dimensional Brownian motion and λ ∈ R.4. Mt = |Bt|2 − d · t.5. Mt = (a ·Bt) (b ·Bt)− (a · b) t for all a, b ∈ Rd.6. Mt := eλ·Bt−|λ|

2t/2 for any λ ∈ Rd.

Corollary 16.33 (Compensators). Suppose h : [0,∞) × Rd → R is a C2 –function such that

Lh = ∂th (t, x) +1

2∆xh (t, x) = 0

and both h and h2 satisfy the hypothesis of Theorem 16.31. If we let Mt denotethe martingale, Mt := h(t, Bt) and

At :=

∫ t

0

|(∇xh) (τ,Bτ )|2 dτ,

3 A function f is subharmonic if for any compact region, U, f ≤ g on U where g isthe harmonic function on U such that g = f on ∂U. This turns out to be equivalentto assuming ∆f ≥ 0.

then Nt := M2t − At is a martingale. Thus the submartingale has the “Doob”

decomposition,M2t = Nt +At.

and we call the increasing process, At, the compensator to M2t .

Proof. We need only apply Theorem 16.31 to h2. In order to do this weneed to compute Lh2. Since

∂2i h

2 = ∂i (2h · ∂ih) = 2 (∂ih)2

+ 2h∂ih

we see that 12∆h

2 = h∆h+ |∇h|2 . Therefore,

Lh2 = 2hh+ h∆h+ |∇h|2 = |∇h|2

and hence the proposition follows form Eq. (16.20) with h replaced by h2.

Exercise 16.5 (Compensators). Let Btt∈R+be a d – dimensional Brown-

ian motion. Find the compensator, At, for each of the following square integrablemartingales.

1. Mt = u (Bt) where u : Rd → R is a harmonic function, ∆u = 0, such that

supt≤T

E[∣∣u2 (Bt)

∣∣+ |∇u (Bt)|2]<∞ for all T <∞.

Verify the hypothesis of Corollary 16.33 to show

At =

∫ t

0

|∇u (Bτ )|2 dτ.

2. Mt = λ ·Bt for all λ ∈ Rd.3. Mt = |Bt|2 − d · t.4. Mt := eλ·Bt−|λ|

2t/2 for any λ ∈ Rd.

Fact 16.34 (Levy’s criteria for BM) P. Levy gives another description ofBrownian motion, namely a continuous stochastic process, Xtt≥0 , is a Brow-

nian motion if X0 = 0, Xtt≥0 is a martingale, andX2t − t

t≥0

is a martin-

gale.

16.5 Optional Sampling Results

Similar to Definition 15.40, we have the informal definition of a stopping timein the Brownian motion setting.



Definition 16.35 (Informal). A stopping time, T, for Btt≥0 , is a randomvariable with the property that the event T ≤ t is determined from the knowl-edge of Bs : 0 ≤ s ≤ t . Alternatively put, for each t ≥ 0, there is a functional,ft, such that

1T≤t = ft (Bs : 0 ≤ s ≤ t) .

In this continuous context the optional sampling theorem is as follows.

Theorem 16.36 (Continuous time optional sampling theorem). IfMtt≥0 is a FBt – martingale and σ and τ are two stopping times suchthat there τ ≤ K for some non-random finite constant K < ∞, then Mτ ∈L1 (Ω,Fτ , P ) , Mσ∧τ ∈ L1 (Ω,Fσ∧τ , P ) and

Mσ∧τ = E [Mτ |Fσ] . (16.27)

Remark 16.37. In applying the optional sampling Theorem 16.36, observe thatif τ is any optional time and K < ∞ is a constant, then τ ∧K is a boundedoptional time. Indeed,

τ ∧K < t =

τ < t ∈ Ft if t ≤ KΩ ∈ Ft if t > K.

Therefore if σ and τ are any optional times with σ ≤ τ, we may apply Theorem16.36 with σ and τ replaced by σ ∧K and τ ∧K. One may then try to pass tolimit as K ↑ ∞ in the resulting identity.

For the next few results let d = 1 and for any y ∈ R let

τy := inf t > 0 : Bt = y . (16.28)

Let us now fix a, b ∈ R so that −∞ < a < 0 < b < ∞ and set τ := τa ∧ τb bethe first exit time the Brownian motion from the interval, (a, b) . By the law oflarge numbers for Brownian motions, see Corollary 16.43 below, we may deducethat P0 (τ <∞) = 1. This may also be deduced from Lemma 16.56 – but thisis using a rather large hammer to conclude a simple result. We will give anindependent proof here as well.

Proposition 16.38. With the notation above we have,

Eτ = E [τa ∧ τb] = −ab = |a| b <∞, (16.29)

P (τa ∧ τb =∞) = 0, and (16.30)

Eτb =∞. (16.31)

Proof. Let τ := τa ∧ τb and u (x) = (x− a) (b− x) , see Figure 16.1. ThenLu = 1

2u′′ = −1 and so by Theorem 16.31 (or by Ito’s lemma in Theorem 17.5

below) we know that

Fig. 16.1. A plot of f for a = −2 and b = 5.

Mt = u (Bt)−∫ t

0

Lu (Bs) ds = u (Bt) + t

is a martingale. So by the optional sampling Theorem 16.36 we may concludethat

Eu (Bt∧τ ) + E [τ ∧ t] = EMτ∧t = EM0 = u (B0) = −ab. (16.32)

As u ≥ 0 on [a, b] and Bt∧τ ∈ [a, b] , it follows that

|a| b = E [τ ∧ t] + Eu (Bt∧τ ) ≥ E [τ ∧ t] .

We may now use the monotone convergence theorem (see Section 1.1) to con-clude,

Eτ = limt↑∞

E [τ ∧ t] ≤ |a| b <∞.

Next using the facts; 1) τ is integrable and t∧τ → τ as t ↑ ∞ 2) u is boundedon [a, b] and u (Bt∧τ ) → u (Bτ ) = 0 as t ↑ ∞, we may apply the dominatedconvergence theorem to pass to the limit in Eq. (16.32) to find,

−ab = limt↑∞

[E [τ ∧ t] + Eu (Bt∧τ )] = E [τ ] + E [u (Bτ )] = Eτ

which proves Eqs. (16.29) and (16.30). Lastly using the monotone convergencetheorem we find,

Eτb = lima↓−∞

E [τa ∧ τb] = lima↓−∞

[−ab] =∞

which proves Eq. (16.31).Combining Proposition 16.38 with the next proposition shows that one di-

mensional Brownian motion hits every point in R but the expected hitting timeof hitting any point other than 0 is infinite. As usual, using the Markov propertyit follows that one dimensional Brownian motion hits every point in R infinitelyoften. So in summary one dimensional Brownian motion is null recurrent justthe same as for simple random walks on Z.


16.5 Optional Sampling Results 187

Proposition 16.39. Continuing the notation above we have,

P (τb < τa) =−ab− a

and P (τb <∞) = 1. (16.33)

Proof. Now let τ := τa∧τb and u (x) = x. Since Lu = 12u′′ ≡ 0, by Theorem

16.31 (or by Ito’s lemma in Theorem 17.5 below) we know that Mt = u (Bt) =Bt is a martingale and hence, by the optional sampling Theorem 16.36,

0 = EM0 = EMt∧τ = E [Bt∧τ ] . (16.34)

Since Bt∧τ ∈ [a, b] and P (τ =∞) = 0, Bt∧τ → Bτ ∈ a, b boundedly as t ↑ ∞.Therefore using the dominated convergence theorem to pass to the limit in Eq.(16.34), we find

0 = E [Bτ ] = aP (Bτ = a) + bP (Bτ = b)

= a · P (τa < τb) + bP (τb < τa)

= a · [1− P (τa < τb)] + bP (τb < τa) ,

which easily proves the first equality in Eq. (16.33). Since τb < τa ⊂ τb <∞for all a, it follows that

P (τb <∞) ≥ −ab− a

→ 1 as a ↓ −∞

which proves the second equality in Eq. (16.33).

Exercise 16.6 (The Laplace transform of τb). Let b > 0 and τb :=inf t > 0 : Bt = b . Using (with λ > 0) the martingale,

Mt := eλBt−12λ

2t,

show using the optional sampling Theorem 16.36 that

E0

[e−µτb

]= e−b

√2µ for all µ ≥ 0. (16.35)

[Note: for 0 ≤ t ≤ τb that eλBt ≤ eλb which is useful when applying DCT.]

Remark 16.40. Equation (16.35) may also be proved using Lemma 16.56 di-rectly. Indeed, if

pt (x) :=1√2πt

e−x2/2t,

then

E0

[e−λτa

]= E0

[∫ ∞τa

λe−λtdt

]= E0

[∫ ∞0

1τa<tλe−λtdt

]= E0

[∫ ∞0

P (τa < t)λe−λtdt

]= −2

∫ ∞0

[∫ ∞a

pt (x) dx

]d

dte−λtdt.

Integrating by parts in t, shows

−2

∫ ∞0

[∫ ∞a

pt (x) dx

]d

dte−λtdt = 2

∫ ∞0

[∫ ∞a

d

dtpt (x) dx

]e−λtdt

= 2

∫ ∞0

[∫ ∞a

1

2p′′t (x) dx

]e−λtdt

= −∫ ∞

0

p′t (a) e−λtdt

= −∫ ∞

0

1√2πt

−ate−a

2/2te−λtdt

=a√2π

∫ ∞0

t−3/2e−a2/2te−λtdt

where the last integral may be evaluated to be e−a√

2λ using the Laplace trans-form function of Mathematica.

Alternatively, consider∫ ∞0

pt (x) e−λtdt =

∫ ∞0

1√2πt

e−x2/2te−λtdt =

1√2π

∫ ∞0

1√t

exp

(−x

2

2t− λt

)dt

which may be evaluated using the theory of the Fourier transform (or charac-teristic functions if you prefer) as follows,

√2π

2me−m|x| =

∫R

(|ξ|2 +m2

)−1

eiξxdξ

=

∫R

[∫ ∞0

e−λ(|ξ|2+m2)dλ

]eiξxdξ

=

∫ ∞0

dλ

[e−λm

2

∫R

dξe−λ|ξ|2

eiξx]

=

∫ ∞0

e−λm2

(2λ)−1/2e−14λx

2

dλ,

where dξ := (2π)−1/2

dξ. Now make appropriate change of variables and carryout the remaining x integral to arrive at the result.



Remark 16.41 (A representation formula). Suppose that R is a “nice” boundedsub-domain of Rd, u ∈ C2

(R,C

), u = g on ∂R and 1

2∆u = −f on R, then forx ∈ R we have,

u (x) = E[g (x+Bτ ) +

∫ τ

0

f (x+Bs) ds

], (16.36)

where τ = τx := inf t ≥ 0 : x+Bt /∈ R is the first exit time from R. Here aresome of the proof ideas.

1. First note that Eτ <∞ using the one dimensional results above.2. Extend the functions u to Rd so that u has compact support in which caseMt := u (x+Bt)− 1

2

∫ t0∆u (x+Bs) ds is a martingale.

3. Apply the optional sampling theorem to learn

Mτt = u (x+Bt∧τ )− 1

2

∫ t∧τ

0

∆u (x+Bs) ds

= u (x+Bt∧τ ) +

∫ t∧τ

0

f (x+Bs) ds

is a martingale and in particular

u (x) = EMτ0 = EMτ

t

= E[u (x+Bτ∧t) +

∫ τ∧t

0

f (x+Bs) ds

].

4. Using the dominated convergence theorem, pass to the limit as t ↑ ∞ in theprevious identity to arrive at Eq. (16.36).

16.6 Scaling Properties of B. M.

Theorem 16.42 (Transformations preserving B. M.). Let Btt≥0 be aBrownian motion and Bt := σ (Bs : s ≤ t) . Then;

1. bt = −Bt is again a Brownian motion.2. if c > 0 and bt := c−1/2Bct is again a Brownian motion.3. bt := tB1/t for t > 0 and b0 = 0 is a Brownian motion. In particular,

limt↓0 tB1/t = 0 a.s.4. for all T ∈ (0,∞) , bt := Bt+T −BT for t ≥ 0 is again a Brownian motion

which is independent of BT .5. for all T ∈ (0,∞) , bt := BT−t − BT for 0 ≤ t ≤ T is again a Brownian

motion on [0, T ] .

Proof. It is clear that in each of the four cases above btt≥0 is still aGaussian process. Hence to finish the proof it suffices to verify, E [btbs] = s ∧ twhich is routine in all cases. Let us work out item 3. in detail to illustrate themethod. For 0 < s < t,

E [bsbt] = stE [Bs−1Bt−1 ] = st(s−1 ∧ t−1

)= st · t−1 = s.

Notice that t→ bt is continuous for t > 0, so to finish the proof we must showthat limt↓0 bt = 0 a.s. However, this follows from “Kolmogorov’s continuitycriteria” which we do not cover here.

Corollary 16.43 (B. M. Law of Large Numbers). Suppose Btt≥0 is aBrownian motion, then almost surely, for each β > 1/2,

lim supt→∞

|Bt|tβ

=

0 if β > 1/2∞ if β ∈ (0, 1/2) .

(16.37)

Proof. We omit the full proof here other than to show that we may replacethe limiting behavior at ∞ by statements about the limiting behavior as t ↓ 0because bt := tB1/t for t > 0 and b0 = 0 is a Brownian motion.

16.7 Random Walks to Brownian Motion

Let Xj∞j=1 be a sequence of independent Bernoulli random variables with

P (Xj = ±1) = 12 and let W0 = 0, Wn = X1 + · · ·+Xn be the random walk on

Z. For each ε > 0, we would like to consider Wn at n = t/ε. We can not expectWt/ε to have a limit as ε → 0 without further scaling. To see what scaling isneeded, recall that

Var (X1) = EX21 =

1

212 +

1

2(−1)

2= 1

and therefore, Var (Wn) = n. Thus we have

Var(Wt/ε

)= t/ε

and hence to get a limit we should scale Wt/ε by√ε. These considerations

motivate the following theorem.

Theorem 16.44. For all ε > 0, let Bε (t)t≥0 be the continuous time process,defined as follows:

1. If t = nε for some n ∈ N0, let Bε (nε) :=√εWn and


16.8 Path Regularity Properties of BM 189

2. if nε < t < (n+ 1) ε, let Bε (t) be given by

Bε (t) = Bε (nε) +t− nεε

(Bε ((n+ 1) ε)−Bε (nε))

=√εWn +

t− nεε

(√εWn+1 −

√εWn

)=√εWn +

t− nεε

√εXn+1,

i.e. Bε (t) is the linear interpolation between (nε,√εWn) and

((n+ 1) ε,√εWn+1) , see Figure 16.2.

Fig. 16.2. The four graphs are constructed (in Excel) from a single realization ofa random walk. Each graph corresponds to a different scaling parameter, namely,ε ∈

24, 28, 212, 214

. It is clear from these pictures that Bε (t) is not converging to

B (t) for each realization. The convergence is only in law.

Then Bε =⇒ B (“weak convergence”) as ε ↓ 0, where B is a continuousrandom process which we call a Brownian motion.

Remark 16.45. Theorem 16.44 shows that Brownian motion is the result ofcombining additively the effects of many small random steps.

We may allow for more general random walk limits which is the content ofDonsker’s invariance principle or the so called functional central limit theorem.

The setup is to start with a “random walk,” Sn := X1+· · ·+Xn where Xn∞n=1

are i.i.d. random variables with zero mean and variance one. We then define foreach n ∈ N the following continuous process,

Bn (t) :=1√n

(S[nt] + (nt− [nt])X[nt]+1

)(16.38)

where for τ ∈ R+, [τ ] is the integer part of τ, i.e. the nearest integer to τ whichis no greater than τ.

Theorem 16.46 (Donsker’s invariance principle). Let Ω, Bn, and B beas above. Then for Bn =⇒ B, i.e.

limn→∞

E [F (Bn)] = E [F (B)]

for all bounded continuous functions F : Ω → R.

For the proof of this theorem we refer the reader to [11], [18], [10], or [15].

16.8 Path Regularity Properties of BM

For the rest of this chapter we will assume that Btt≥0 is a Brownian motion

on some probability space,(Ω,F , Ftt≥0 ,P

)and Ft := σ (Bs : s ≤ t) .

Notation 16.47 (Partitions) Given Π := 0 = t0 < t1 < · · · < tn = T , apartition of [0, T ] , let

∆iB := Bti −Bti−1 , and ∆it := ti − ti−1

for all i = 1, 2, . . . , n. Further let mesh(Π) := maxi |∆it| denote the mesh ofthe partition, Π.

Exercise 16.7 (Quadratic Variation). Let

Πm :=

0 = tm0 < tm1 < · · · < tmnm = T

be a sequence of partitions such that mesh (Πm)→ 0 as m→∞. Further let

Qm :=

nm∑i=1

(∆iB)2

:=

nm∑i=1

(Btm

i−Btm

i−1

)2

. (16.39)

Showlimm→∞

E[(Qm − T )

2]

= 0.



Also show that

∞∑m=1

mesh (Πm) <∞ =⇒ E

[ ∞∑m=1

(Qm − T )2

]<∞

=⇒ limm→∞

Qm = T a.s.

[These results are often abbreviated by the writing, dB2t = dt.]

Hints: it is useful to observe; 1)

Qm − T =

nm∑i=1

[(∆iB)

2 −∆it]

and 2) using ∆iBd=√∆itN (0, 1) one easily shows there is a constant, c < ∞

such that

E[(∆iB)

2 −∆it]2

= c (∆it)2. (16.40)

Proposition 16.48. Suppose that Πm∞m=1 is a sequence of partitions of [0, T ]such that Πm ⊂ Πm+1 for all m and mesh (Πm)→ 0 as m→∞. Then Qm → Ta.s. where Qm is defined as in Eq. (16.39).

Corollary 16.49 (Roughness of Brownian Paths). A Brownian motion,Btt≥0 , is not almost surely α – Holder continuous for any α > 1/2.

Proof. According to Exercise 16.7, we may choose partition, Πm, such thatmesh (Πm) → 0 and Qm → T a.s. If B were α – Holder continuous for someα > 1/2, then

Qm =

nm∑i=1

(∆mi B)

2 ≤ Cnm∑i=1

(∆mi t)

2α ≤ C max(

[∆it]2α−1

) nm∑i=1

∆mi t

≤ C [mesh (Πm)]2α−1

T → 0 as m→∞

which contradicts the fact that Qm → T as m→∞.

16.9 The Strong Markov Property of Brownian Motion

Notation 16.50 If ν is a probability measure on Rd, we let Pν indicate thatwe are starting our Brownian motion so that B0 distributed as ν. For Brownianmotion, the expectation relative to Pν can be defined to be

Eν[F(B(·)

)]=

∫Rd

E0

[F(x+B(·)

)]dν (x) .

Theorem 16.51 (Strong-Markov Property). Let ν be a probability measureon Rd, τ be an optional time with Pν (τ <∞) > 0, and let

bt := Bt+τ −Bτ on τ <∞ .

Then, conditioned on τ <∞ , b is a Brownian motion starting at 0 ∈ Rdwhich is independent of F+

τ . To be more precise we are claiming, for all boundedmeasurable functions F : Ω → Rd and all A ∈ F+

τ , that

Eν [F (b) |τ <∞] = E0 [F ] (16.41)

and

Eν [F (b) 1A|τ <∞] = Eν [F (b) |τ <∞] · Eν [1A|τ <∞] . (16.42)

Corollary 16.52. Let ν be a probability measure on(Rd,BRd

), T > 0,and let

bt := Bt+T −BT for t ≥ 0.

Then b is a Brownian motion starting at 0 ∈ Rd which is independent of FTrelative to the probability measure, Pν , describing Brownian motion started withB0 distributed as ν.

Proof. This is a direct consequence of Theorem 16.51 with τ = T or it canbe proved directly using the Markov property alone.

Proposition 16.53 (Stitching Lemma). Suppose that Btt≥0 and Xtt≥0

are Brownian motions, τ is an optional time, and Xtt≥0 is independent of

F+τ . Then Bt := Bt∧τ +X(t−τ)+

is another Brownian motion.

For the next couple of results we will follow [10, Chapter 13] (also see [11,Section 2.8]) where more results along this line may be found.

Theorem 16.54 (Reflection Principle). Let τ be an optional time andBtt≥0 be a Brownian motion. Then the “reflected” process (see Figure 16.3),

Bt := Bt∧τ − (Bt −Bt∧τ ) =

Bt if t ≤ τ

Bτ − (Bt −Bτ ) if t > τ,

is again a Brownian motion.

Proof. Let T > 0 be given and set bt := Bt+τ∧T − Bτ∧T . Then applying

Proposition 16.53 with X = b and τ replaced by τ ∧ T we learn that B|[0,T ]d=

B|[0,T ]. Again as T > 0 was arbitrary we may conclude that Bd= B.


16.9 The Strong Markov Property of Brownian Motion 191

Lemma 16.55. If d = 1, then

P0 (Bt > a) ≤ min

(√t

2πa2e−a

2/2t,1

2e−a

2/2t

)

for all t > 0 and a > 0.

Proof. This follows from Lemma 19.7 and the fact that Btd=√tN, where

N is a standard normal random variable.

Lemma 16.56 (Running maximum). If Bt is a 1 - dimensional Brownianmotion and z > 0 and y ≥ 0, then

P(

maxs≤t

Bs ≥ z, Bt < z − y)

= P(Bt > z + y) (16.43)

and

P(

maxs≤t

Bs ≥ z)

= 2P (Bt > z) = P (|Bt| > z) . (16.44)

In particular we have maxs≤tBsd= |Bt| .

tτ

z + y

z

z − y

Bt

Bt

Fig. 16.3. A Brownian path B along with its reflection about the first time, τ, thatBt hits the level z.

Proof. Intuitive proof of Eq. (16.44). For z > 0 we havemaxs≤t

Bs ≥ z

= τz ≤ t = τz ≤ t, Bt < z ∪ τz ≤ t, Bt ≥ z .

= τz ≤ t, Bt < z ∪ Bt ≥ z

In order to compute these probabilities we make use of the intuitively “clear”fact that the events

τz ≤ t, Bt < z and τz ≤ t, Bt > z = Bt > z

are equally likely4, see Figure 16.3. Assuming this assertion and usingP (Bt = z) = 0 we find,

P(

maxs≤t

Bs ≥ z)

= P (τz ≤ t, Bt < z) + P (Bt > z)

= 2P (Bt > z) = 2P (Bt ≥ z) .

Formal proof of Eq. (16.44). Let z > 0 be given. Sincemaxs≤tBs ≥ z, Bt ≥ z = Bt ≥ z and P (Bt ≥ z) = P (Bt > z) wefind

P(

maxs≤t

Bs ≥ z)

= P(

maxs≤t

Bs ≥ z, Bt ≥ z)

+ P(

maxs≤t

Bs ≥ z, Bt < z

)= P (Bt ≥ z) + P

(maxs≤t

Bs ≥ z, Bt < z

)= P (Bt > z) + P

(maxs≤t

Bs ≥ z, Bt < z

).

So to finish the proof of Eq. (16.44) it suffices verify Eq. (16.43) for y = 0.Let τ := τz = inf t > 0 : Bt = z and let Bt := Bτt − b(t−τ)+

where bt :=

(Bt∧τ −Bτ ) 1τ<∞, see Figure 16.3. (By Corollary 16.43 we actually know thatτz < ∞ a.s. but we will not use this fact in the proof.) Observe from Figure16.3 that

τ := inft > 0 : Bt = z

= τ,

Bt < z =Bt > z

on τ ≤ t and (16.45)

maxs≤t

Bs ≥ z

= τ ≤ t = τ ≤ t .

Hence we have

P(

maxs≤t

Bs ≥ z, Bt < z

)= P (τ ≤ t, Bt < z) = P

(τ ≤ t, Bt > z

)= P (τ ≤ t, Bt > z) = P (Bt > z) (16.46)

wherein we have used Law (B·, τ) = Law(B·, τ

)in the third equality and

Bt > z ⊂ τ ≤ t for the last equality.

4 Think in terms of a random walker starting at z.



Proof of Eq. (16.43) for general y ≥ 0. We simply follow the above proofexcept we use the identity,

Bt < z − y =Bt > z + y

on τ ≤ t ,

in place of Eq. (16.45). Since Bt > z + y ⊂ τ ≤ t and working as above wefind

P(

maxs≤t

Bs ≥ z, Bt < z − y)

= P (τ ≤ t, Bt < z − y) = P(τ = τ ≤ t, Bt > z + y

)= P (τ ≤ t, Bt > z + y) = P (Bt > z + y) .

Remark 16.57. Notice that

P(

maxs≤t

Bs ≥ z)

= P (|Bt| > z) = P(√

t |B1| > z)

= P(|B1| >

z√t

)→ 1 as t→∞

and therefore it follows that sups<∞Bs =∞ a.s. In particular this shows thatτz <∞ a.s. for all z ≥ 0 which we have already seen in Corollary 16.43.

Corollary 16.58. Suppose now that T = inf t > 0 : |Bt| = a , i.e. the firsttime Bt leaves the strip (−a, a). Then

P0(T < t) ≤ 4P0(Bt > a) =4√2πt

∫ ∞a

e−x2/2tdx

≤ min

(√8t

πa2e−a

2/2t, 1

). (16.47)

Notice that P0(T < t) = P0(B∗t ≥ a) where B∗t = max |Bτ | : τ ≤ t . So Eq.(16.47) may be rewritten as

P0(B∗t ≥ a) ≤ 4P0(Bt > a) ≤ min

(√8t

πa2e−a

2/2t, 1

)≤ 2e−a

2/2t. (16.48)

Proof. By definition T = Ta ∧ T−a so that T < t = Ta < t ∪ T−a < tand therefore

P0(T < t) ≤ P0 (Ta < t) + P0 (T−a < t)

= 2P0(Ta < t) = 4P0(Bt > a) =4√2πt

∫ ∞a

e−x2/2tdx

≤ 4√2πt

∫ ∞a

x

ae−x

2/2tdx =4√2πt

(− tae−x

2/2t

)∣∣∣∣∞a

=

√8t

πa2e−a

2/2t.

This proves everything but the very last inequality in Eq. (16.48). To prove thisinequality first observe the elementary calculus inequality:

min

(4√2πy

e−y2/2, 1

)≤ 2e−y

2/2. (16.49)

Indeed Eq. (16.49) holds 4√2πy≤ 2, i.e. if y ≥ y0 := 2/

√2π. The fact that Eq.

(16.49) holds for y ≤ y0 follows from the following trivial inequality

1 ≤ 1.4552 ∼= 2e−1π = e−y

20/2.

Finally letting y = a/√t in Eq. (16.49) gives the last inequality in Eq. (16.48).

Corollary 16.59 (Fernique’s Theorem). For all λ > 0, then

E[eλ[maxs≤t Bs]

2]=

1√1−2λt

if λ < 12t

∞ if λ ≥ 12t

. (16.50)

Moreover,

E[eλ‖B·‖

2∞,t

]<∞ iff λ <

1

2t. (16.51)

Proof. From Lemma 16.56,

E[eλ[maxs≤t Bs]

2]= Eeλ|Bt|

2

= Eeλt|B1|2

=1√2π

∫Reλtx

2

e−12x

2

dx

=

1√1−2λt

if λ < 12t

∞ if λ ≥ 12t

.

By Corollary 16.58 we can show E[eλ‖B·‖

2∞,t

]<∞ if λ < 1

2t while if λ ≥ 12t we

will have,

E[eλ‖B·‖

2∞,t

]≥ E

[eλ[maxs≤t Bs]

2]=∞.

Alternative weak bound. For a cruder estimate, notice that

‖B·‖∞,t := maxs≤t|Bs| =

[maxs≤t

Bs

]∨[maxs≤t

[−Bs]]

and hence

‖B·‖2∞,t =

[maxs≤t

Bs

]2

∨[maxs≤t

[−Bs]]2

≤[maxs≤t

Bs

]2

+

[maxs≤t

[−Bs]]2

.


16.10 Dirichelt Problem and Brownian Motion 193

Therefore using −Bss≥0 is still a Brownian motion along with the Cauchy-Schwarz’s inequality gives,

E[eλ‖B·‖

2∞,t

]≤ E

[eλ[maxs≤t Bs]

2

· eλ[maxs≤t(−Bs)]2]

≤√E[e2λ[maxs≤t Bs]

2]· E[·eλ[maxs≤t(−Bs)]

2]= E

[e2λ[maxs≤t Bs]

2]<∞ if λ <

1

4t.

16.10 Dirichelt Problem and Brownian Motion

Theorem 16.60 (The Dirichlet problem). Suppose D is an open subset ofRd and τ := inf t ≥ 0 : Bt ∈ Dc is the first exit time form D. Given a boundedmeasurable function, f : bd(D) → R, let u : D → R be defined by (see Figure16.4),

u (x) := Ex [f (Bτ ) : τ <∞] for x ∈ D.

Then u ∈ C∞ (D) and ∆u = 0 on D, i.e. u is a harmonic function.

Fig. 16.4. Brownian motion starting at x ∈ D and exiting on the boundary of D atBτ .

Proof. (Sketch.) Let x ∈ D and r > 0 be such that B (x, r) ⊂ D and letσ := inf t ≥ 0 : Bt /∈ B (x, r) as in Figure 16.5. Setting F := f (Bτ ) 1τ<∞,we see that F θσ = F on σ <∞ and that σ <∞ , Px - a.s. on τ <∞ .Moreover, by either Corollary 16.43 or by Lemma 16.56 (see Remark 16.57), weknow that σ <∞ , Px – a.s. Therefore by the strong Markov property,

Fig. 16.5. Brownian motion starting at x ∈ D and exiting the boundary of B (x, r)at Bσ before exiting on the boundary of D at Bτ .

u (x) = Ex [F ] = Ex [F : σ <∞] = Ex [F θσ : σ <∞]

= Ex [EBσF : σ <∞] = Ex [u (Bσ)] .

Using the rotation invariance of Brownian motion, we may conclude that

Ex [u (Bσ)] =1

ρ (bd(B (x, r)))

∫bd(B(x,r))

u (y) dρ (y)

where ρ denotes surface measure on bd(B (x, r)). This shows that u (x) satisfiesthe mean value property, i.e. u (x) is equal to its average about in sphere cen-tered at x which is contained in D. It is now a well known that this property,see for example [11, Proposition 4.2.5 on p. 242], that this implies u ∈ C∞ (D)and that ∆u = 0.

When the boundary ofD is sufficiently regular and f is continuous on bd(D),it can be shown that, for x ∈ bd(D), that u (y) → f (x) as y ∈ D tends to x.For more details in this direction, see [1], [2], [11, Section 4.2], [8], and [7].


17

A short introduction to Ito’s calculus

17.1 A short introduction to Ito’s calculus

Definition 17.1. The Ito integral of an adapted process1 ftt≥0 , is defined by∫ T

0

fdB = lim|Π|→0

n∑i=1

fti−1

(Bti −Bti−1

)(17.1)

when the limit exists. Here Π denotes a partition of [0, T ] so that Π =0 = t0 < t1 < · · · < tn = T and

|Π| = max1≤i≤n

∆it where ∆it = ti − ti−1. (17.2)

Proposition 17.2. Keeping the notation in Definition 17.1 and further assumeEf2

t <∞ for all t. Then we have,

E

[n∑i=1

fti−1

(Bti −Bti−1

)]= 0

and

E

[n∑i=1

fti−1

(Bti −Bti−1

)]2

= En∑i=1

f2ti−1

(ti − ti−1) .

Proof. Since(Bti −Bti−1

)is independent of fti−1

we have,

E

[n∑i=1

fti−1

(Bti −Bti−1

)]=

n∑i=1

Efti−1E(Bti −Bti−1

)=

n∑i=1

Efti−1· 0 = 0.

For the second assertion, we write,

1 To say f is adapted means that for each t ≥ 0, ft should only depend on Bss≤t ,i.e. ft = Ft

(Bss≤t

).

[n∑i=1

fti−1

(Bti −Bti−1

)]2

=

n∑i,j=1

ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

).

If j < i, then ftj−1

(Btj −Btj−1

)fti−1

is independent of(Bti −Bti−1

)and

therefore,

E[ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

)]= E

[ftj−1

(Btj −Btj−1

)fti−1

]· E(Bti −Bti−1

)= 0.

Similarly, if i < j,

E[ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

)]= 0.

Therefore,

E

[n∑i=1

fti−1

(Bti −Bti−1

)]2

=

n∑i,j=1

E[ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

)]=

n∑i=1

E[fti−1

(Bti −Bti−1

)fti−1

(Bti −Bti−1

)]=

n∑i=1

E[f2ti−1

(Bti −Bti−1

)2]=

n∑i=1

Ef2ti−1· E(Bti −Bti−1

)2=

n∑i=1

Ef2ti−1

(ti − ti−1)

= En∑i=1

f2ti−1

(ti − ti−1) ,

wherein the fourth equality we have used Bti −Bti−1is independent of fti−1

.This proposition motivates the following theorem which will not be proved

in full here.

196 17 A short introduction to Ito’s calculus

Theorem 17.3. If ftt≥0 is an adapted process such that E∫ T

0f2t dt < ∞,

then the Ito integral,∫ T

0fdB, exists and satisfies,

E∫ T

0

fdB = 0 and

E

(∫ T

0

fdB

)2

= E∫ T

0

f2t dt.

[More generally, Mt :=∫ t

0fdB is a martingale such that M2

t −∫ t

0f2s ds is also

a martingale. ]

Proof. Let us explain the martingale property for∫ t

0fdB here. First if

ft = u · 1[a,b) (t) with 0 ≤ a < b < ∞ and u is a bounded Fa – measurablefunction, then

Mt =

∫ t

0

fdB =

∫ ∞0

1[0,t)ftdBt = u [Bt∧b −Bt∧a] .

If t ≤ a will have Mt = 0 and so we may as well assume that t ≥ a. If a ≤ s ≤ t,then

E [Mt|Fs] = E [u [Bt∧b −Bt∧a] |Fs] = uE [Bt∧b −Ba|Fs]= u (Bs∧b −Ba) = Ms

while if s < a ≤ t we have

E [Mt|Fs] = E [E [Mt|Fa] |Fs] = E [Ma|Fs] = 0 = Ms.

This shows M is a martingale in this case. In general,∫ t

0fdB is the L2 – limit of

linear combinations of such martingales and hence is again a martingale. More-over, by the continuous analogue of Doob’s maximal inequality, see Proposition13.57, we may arrange it so that t→

∫ t0fdB is still continuous.

We leave it as an exercise to the reader to show; if ft = u · 1[a,b) (t) , then

M2t −

∫ t

0

f2s ds = u2

[[Bt∧b −Bt∧a]

2 − (t ∧ b− t ∧ a)]

is a martingale.

Corollary 17.4. In particular if τ is a bounded stopping time (say τ ≤ T <∞)then

E∫ τ

0

fdB = 0 and

E(∫ τ

0

fdB

)2

= E∫ τ

0

f2t dt.

Proof. The point is that, by the definition of a stopping time, 10≤t≤τft isstill an adapted process. Therefore we have,

E∫ τ

0

fdB = E

[∫ T

0

10≤t≤τftdBt

]= 0

and

E(∫ τ

0

fdB

)2

= E

[∫ T

0

10≤t≤τftdBt

]2

= E

[∫ T

0

(10≤t≤τft)2dt

]= E

[∫ τ

0

f2t dt

].

Theorem 17.5 (Ito’s Lemma). If f (t, x) is C1 – function such that ∂2xf (t, x)

exists and is continuous, then

df (t, Bt) =

[∂tf (t, Bt) +

1

2∂2xf (t, Bt)

]dt+ ∂xf (t, Bt) dBt

= Lf (t, Bt) dt+ ∂xf (t, Bt) dBt.

More precisely,

f (T,BT ) = f (0, B0) +

∫ T

0

∂xf (t, Bt) dBt

+

∫ T

0

[∂tf (t, Bt) +

1

2∂2xf (t, Bt)

]dt. (17.3)

Roughly speaking, all differentials should be expanded out to second order usingthe Ito multiplication rules,

dB2 = dt and dBdt = 0 = dt2.

Proof. The rough idea is as follows. Let Π = 0 = t0 < t1 < · · · < tn = Tbe a partition of [0, T ] , so that by a telescoping series argument,

f (T,BT )− f (0, B0) =

n∑i=1

[f (ti, Bti)− f

(ti−1, Bti−1

)].

Then by Taylor’s theorem,

f (ti, Bti)− f(ti−1, Bti−1

)= ∂tf

(ti−1, Bti−1

)∆it+ ∂xf

(ti−1, Bti−1

)∆iB

+1

2∂2xf(ti−1, Bti−1

)[∆iB]

2+O

(∆it

2, ∆iB3).


17.2 Ito’s formula, heat equations, and harmonic functions 197

The error terms are negligible, i.e.

lim|Π|→0

n∑i=1

O(∆it

2, ∆iB3)

= 0

and hence

f (T,BT )− f (0, B0)

= lim|Π|→0

n∑i=1

[f (ti, Bti)− f

(ti−1, Bti−1

)]= lim|Π|→0

n∑i=1

∂tf(ti−1, Bti−1

)∆it+ lim

|Π|→0

n∑i=1

∂xf(ti−1, Bti−1

)∆iB

+1

2lim|Π|→0

n∑i=1

∂2xf(ti−1, Bti−1

)[∆iB]

2

=

∫ T

0

∂tf (t, Bt) dt+

∫ T

0

∂xf (t, Bt) dBt +1

2

∫ T

0

∂2xf (t, Bt) dt

where the last term follows by an extension of Exercise 16.7.

Example 17.6. Let us verify Ito’s lemma in the special case where f (x) = x2.For T > 0 and

Π = 0 = t0 < t1 < · · · < tn = T

being a partition of [0, T ] , we have, with ∆iB := Bti −Bti−1that

B2T =

n∑i=1

(B2ti −B

2ti−1

)=

n∑i=1

(Bti +Bti−1

) (Bti −Bti−1

)=

n∑i=1

(2Bti−1

+∆iB)∆iB

= 2

n∑i=1

Bti−1∆iB +QΠ|Π|→0→ 2

∫ T

0

BdB + T

where we have used Exercise 16.7 in finding the limit. Hence we conclude

B2T = 2

∫ T

0

BdB + T

which is exactly Eq. (17.3) for f (x) = x2.

17.2 Ito’s formula, heat equations, and harmonic functions

Suppose D is a bounded open subset of Rd and x ∈ D. Let Bt =(B1t , . . . , B

dt

)be a standard d – dimensional Brownian motion. Let Xt := x+Bτt where

τ = inf t ≥ 0 : x+Bt /∈ D

be the first exit time from D. [Warning: I will be using the “natural” multi-dimensional and “stopped” extensions of the Ito theory developed above.] Inthis multi-dimensional setting Ito’s formula becomes,

df (t, Bt) =

[∂tf (t, Bt) +

1

2∆f (t, Bt)

]dt+∇f (t, Bt) · dBt.

The Ito multiplication table in this multi-dimensional setting is

dt · dBi = 0, dt2 = 0, dBidBj = δijdt. (17.4)

Example 17.7 (Feynmann-Kac formula). Suppose that V is a nice function onRd and h (t, x) solves the partial differential equation,

∂th =1

2∆h− V h with h (0, x) = f (x) .

Given T > 0 let

Mt := h (T − t,Xt) e−∫ t0V (Xs)ds := h (T − t,Xt)Zt.

Then by Ito’s lemma,

dMt =∇xh (T − t,Xt) · dBt +

(1

2∆h− ∂th

)(T − t,Xt)Ztdt

− h (T − t,Xt)ZtV (Xt) dt

=∇xh (T − t,Xt) · dBt +

(1

2∆h− ∂th− V h

)(T − t,Xt)Ztdt

=∇xh (T − t,Xt) · dBt.

From this it follows that Mt is a martingale and in particular we learn EM0 =EMT from which ti follows

h (T, x) = E[f (XT ) e

−∫ T0V (Xs)ds

]= E

[f (x+BT ) e

−∫ T0V (x+Bs)ds

].



Theorem 17.8. If u : D → R is a function such that ∆u = g on D and u = fon ∂D, then

u (x) = E[f (x+Bτ )− 1

2

∫ τ

0

g (x+Bs) ds

]. (17.5)

Proof. By Ito’s formula,

d [u (Xt)] = 1t≤τ∇u (Xt) · dBt +1

2(∆u) (Xt) 1t≤τdt

= 1t≤τ∇u (Xt) · dBt +1

2g (Xt) 1t≤τdt,

i.e.

u (XT ) = u (X0) +

∫ T

0

1t≤τ∇u (Xt) · dBt +1

2

∫ T∧τ

0

g (Xt) dt.

Taking expectations of this equation and using X0 = x shows

E [u (XT )] = u (x) + E∫ T

0

1t≤τ∇u (Xt) · dBt +1

2E∫ T∧τ

0

g (Xt) dt

= u (x) +1

2E∫ T∧τ

0

g (Xt) dt.

Now letting T ↑ ∞ gives the desired result upon observing that

E [u (X∞)] = E [u (x+Bτ )] = E [f (x+Bτ )] .

Notation 17.9 In the future let us write Ex[F(B(·)

)]for E

[F(x+B(·)

)].

Example 17.10. Suppose u : D → R solves, ∆u = −2 on D and u = 0 on ∂D,then Eq. (17.5) implies

u (x) = Ex [τ ] . (17.6)

If instead we let u solve, ∆u = 0 on D and u = f on ∂D, then Eq. (17.5) implies

u (x) = Ex [f (Bτ )] .

Here are some technical details for Eq. (17.6). If u solves ∆u = −2 on Dand u = 0 on ∂D, then by the maximum principle applied to −u we learn thatu ≥ 0 on D. So by the optional sampling theorem

Mt := u (Bτt ) + t ∧ τ

is a martingale and therefore,

u (x) = Exu (Bτ0 ) = ExM0 = Ex [u (Bτt ) + t ∧ τ ]

= Ex [u (Bτt )] + Ex [t ∧ τ ] (17.7)

≥ Ex [t ∧ τ ] . (17.8)

We may now use the MCT and pass to the limit as t ↑ ∞ in the inequality inEq. (17.8) to show Exτ ≤ u (x) <∞. Now that we know τ is integrable and inparticular Px (τ <∞) = 1, we may with the aid of DCT let t ↑ ∞ in Eq. (17.7)to find,

u (x) = Ex [u (Bτ )] + Exτ = Exτ

wherein we have used Bτ ∈ ∂D where u = 0 in the last equality.

17.3 A Simple Option Pricing Model

In this section we are going to try to explain the Black–Scholesformula for option pricing. The following excerpt is taken fromhttp://en.wikipedia.org/wiki/Black-Scholes.

Robert C. Merton was the first to publish a paper expanding the mathematicalunderstanding of the options pricing model and coined the term ”Black-Scholes” options pricing model, by enhancing work that was published byFischer Black and Myron Scholes. The paper was first published in 1973.The foundation for their research relied on work developed by scholars suchas Louis Bachelier, A. James Boness, Sheen T. Kassouf, Edward O. Thorp,and Paul Samuelson. The fundamental insight of Black-Scholes is that theoption is implicitly priced if the stock is traded.Merton and Scholes received the 1997 Nobel Prize in Economics for thisand related work. Though ineligible for the prize because of his death in1995, Black was mentioned as a contributor by the Swedish academy.

Definition 17.11. A European stock option at time T with strike priceK is a ticket that you would buy from a trader for the right to buy a particularstock at time T at a price K. If the stock price, ST , at time T is greater thatK you could then buy the stock at price K and then instantly resell it for aprofit of (ST −K) dollars. If the ST < K, you would not turn in your ticketbut would loose whatever you paid for the ticket. So the pay off of the option attime T is (ST −K)+ .

Question: What should be the price (q) at time zero of such a stock option?

To answer this question, we will use a simplified version of a financial marketwhich consists of only two assets; a no risk bond worth βt = β0e

rt (for somer > 0) dollars per share at time t and a risky stock worth St dollars per share.We are going to model St via a geometric “Brownian motion.”


17.3 A Simple Option Pricing Model 199

Definition 17.12 (Geometric Brownian Motion). Let σ > 0, and µ ∈ Rbe given parameters. We say that the solution to the “stochastic differentialequation,”

dStSt

= σdBt + µdt (17.9)

with S0 being non-random is a geometric Brownian motion. More precisely,St, is a solution to

St = S0 + σ

∫ t

0

SdB + µ

∫ t

0

Ssds. (17.10)

(The parameters σ and µ measure the volatility and drift or trend of thestock respectively.)

Notice that dSS is the relative change of S and formally, E

(dSS

)= µdt and

Var(dSS

)= σ2dt. Taking expectation of Eq. (17.10) gives,

ESt = S0 + µ

∫ t

0

ESsds.

Differentiating this equation then implies,

d

dtESt = µESt with ES0 = S0,

which yields, ESt = S0eµt. So on average, St is growing or decaying exponen-

tially depending on the sign of µ.

Proposition 17.13 (Geometric Brownian motion). The stochastic differ-ential Equation (17.10) has a unique solution given by

St = S0 exp

(σBt +

(µ− 1

2σ2

)t

).

Proof. We do not bother to give the proof of uniqueness here. To proveexistence, let us look for a solution to Eq. (17.9) of the form;

St = S0 exp (aBt + bt) ,

for some constants a and b. By Ito’s lemma, using ddxe

x = d2

dx2 ex = ex and the

multiplication rules, dB2 = dt and dt2 = dB · dt = 0, we find that

dS = S (adB + bdt) +1

2S (adB + bdt)

2

= S (adB + bdt) +1

2Sa2dt,

i.e.dS

S= adB +

(b+

1

2a2

)dt.

Comparing this with Eq. (17.9) shows that we should take a = σ and b = µ− 12σ

2

to get a solution.

Definition 17.14 (Holdings and Value Processes). Let (at, bt) be the hold-ings process which denotes the number of shares of stock and bonds respectivelythat are held in the portfolio at time t. The value process, Vt, of the portfolio,is

Vt = atSt + btβt. (17.11)

Suppose time is partitioned as,

Π = 0 = t0 < t1 < t2 < · · · < tn = T

for some time T in the future. Let us suppose that (at, bt) is constant on theintervals, [0, t1] , (t1, t2], . . . , (tn−1, tn]. Let us write (at, bt) = (ai−1, bi−1) forti−1 < t ≤ ti, see Figure 17.1.

ti−1 timeti ti+1

a or b in dollars

Fig. 17.1. A possible graph of either at or bt.

Therefore the value of the portfolio is given by

Vt = ai−1St + bi−1βt for ti−1 < t ≤ ti.

We now assume that our holding process is self financing (i.e. we do not addany external money to portfolio other than what was invested, V0 = a0S0+b0β0,at the initial time t = 0), then we must have2

2 Equation (17.12) may be written as



ai−1Sti + bi−1βti = Vti = aiSti + biβti for all i. (17.12)

That is to say, when we rebalance our portfolio at time ti, we are only use themoney, Vti , dollars in the portfolio at time ti. Using Eq. (17.12) at i and i− 1allows us to conclude,

Vti − Vti−1= ai−1Sti + bi−1βti −

(ai−1Sti−1

+ bi−1βti−1

)= ai−1

(Sti − Sti−1

)+ bi−1

(βti − βti−1

)for all i, (17.13)

which states the change of the portfolio balance over the time interval, (ti−1, ti]is due solely to the gain or loss made by the investments in the portfolio. (TheEquations (17.12) and (17.13) are equivalent.) Summing Eq. (17.13) then gives,

Vtj − V0 =

j∑i=1

ai−1

(Sti − Sti−1

)+

j∑i=1

bi−1

(βti − βti−1

)(17.14)

=

∫ tj

0

ardSr +

∫ tj

0

brdβr for all j. (17.15)

More generally, if we throw any arbitrary point, t ∈ [0, T ] , into our partitionwe may conclude that

Vt = V0 +

∫ t

0

adS +

∫ t

0

bdβ for all 0 ≤ t ≤ T. (17.16)

The interpretation of this equation is that Vt−V0 is equal to the gains or lossesdue to trading which is given by∫ t

0

adS +

∫ t

0

bdβ.

Equation (17.16) now makes sense even if we allow for continuous trading. Theprevious arguments show that the integrals appearing in Eq. (17.16) shouldbe taken to be Ito – integrals as defined in Definition 17.1. Moreover, if theinvestor does not have psychic abilities, we should assume that holding processis adapted.

(ai − ai−1)Sti + (bi − bi−1)βti = 0.

This explains why the continuum limit of this equation is not Stdat+βtdbt = 0 butrather must be interpreted as St+dtdat + βt+dtdbt = 0. It is also useful to observethat

d (XY )t = Xt+dtYt+dt −XtYt= (Xt+dt −Xt)Yt+dt +Xt (Yt+dt − Yt) ,

and hence there is no quadratic differential term when d (XY ) is written out thisway.

17.4 The Black-Scholes Formula

Now that we have set the stage we can now try to price the option. (We willclosely follow [3, p. 255-264.] here.)

Fundamental Principle: The price of the option should be equal to theamount of money, V0, that an investor would have to put into the bond-stock market at time t = 0 so as there exists a self-financing holding process(at, bt), such that

VT = aTST + bTβT = (ST −K)+ .

Remark 17.15 (Money for nothing). If we price the option higher than V0, i.e.q > V0, we could make risk free money by selling one of these options at qdollars, investing V0 < q of this money using the holding process (at, bt) tocover the payoff at time T and then pocket the different, q − V0.

If the price of the option was less than V0, i.e. q < V0, the investor shouldbuy the option and then pursue the trading strategy, (−a,−b) . At time zerothe investor has invested q + (−a0S0 − b0β0) = q − V0 < 0 dollars, i.e. he isholding V0−q dollars in hand at time t = 0. The value of his portfolio at time Tis now −VT = − (ST −K)+ . If ST > K, the investor then exercises her optionto pay off the debt she as accrued in the portfolio and if ST ≤ K, she doesnothing since his portfolio is worth zero dollars. Either way, she still has theV0 − q dollars in hand from the start of the transactions at t = 0.

Ansatz: We would like the price of the option q = V0 to depend onlyon what we might know at the initial time. Thus we make the ansatz thatq := f

(S0, T,K, r, σ

2, µ)

3. (It is part of the content of the Black-Scholes formulathat this ansatz is permissible.)

If we have a self-financing holding process (at, bt) , then (as, bs)t≤s≤T isalso a self-financing holding process on [t, T ] such that VT = aTST + bTβT =(ST −K)+ , therefore, given the ansatz and the fundamental principle above,if the stock price is St at time t, the options price at this time should be

Vt = f (St, T − t) for all 0 ≤ t ≤ T. (17.17)

By Ito’s lemma

dVt = fx (St, T − t) dSt +1

2fxx (St, T − t) dS2

t − ft (St, T − t) dt

= fx (St, T − t)St (σdBt + µdt) +

[1

2fxx (St, T − t)S2

t σ2 − ft (St, T − t)

]dt

= fx (St, T − t)StσdBt

+

[fx (St, T − t)Stµ+

1


t σ2 − ft (St, T − t)

]dt

3 Since r,K, µ, and σ2 are fixed, we will often drop them from the notation.


17.4 The Black-Scholes Formula 201

On the other hand from Eqs. (17.16) and (17.9), we know that

dVt = atdSt + btβ0rertdt

= atSt (σdBt + µdt) + btβ0rertdt

= atStσdBt +[atStµ+ btβ0re

rt]dt.

Comparing these two equations implies,

at = fx (St, T − t) (17.18)

and

atStµ+ btβ0rert

= fx (St, T − t)Stµ+1


t σ2 − ft (St, T − t) . (17.19)

Using Eq. (17.18) and

f (St, T − t) = Vt = atSt + btβ0ert

= fx (St, T − t)St + btβ0ert

in Eq. (17.19) allows us to conclude,

1


t σ2 − ft (St, T − t) = rbtβ0e

rt

= rf (St, T − t)− rfx (St, T − t)St.

Thus we see that the unknown function f should solve the partial differentialequation,

1

2σ2x2fxx (x, T − t)− ft (x, T − t) = rf (x, T − t)− rxfx (x, T − t)

with f (x, 0) = (x−K)+ ,

i.e.

ft (x, t) =1

2σ2x2fxx (x, t) + rxfx (x, t)− rf (x, t) (17.20)

with f (x, 0) = (x−K)+ . (17.21)

Fact 17.16 Let N be a standard normal random variable and Φ (x) :=P (N ≤ x) . The solution to Eqs. (17.20) and (17.21) is given by;

f (x, t) = xΦ (g (x, t))−Ke−rtΦ (h (x, t)) , (17.22)

where,

g (x, t) =ln (x/K) +

(r + 1

2σ2)t

σ√t

,

h (x, t) = g (x, t)− σ√t.

Theorem 17.17 (Option Pricing). Given the above setup, the “rational”price” of the European call option is

q = S0Φ

(ln (S0/K) +

(r + 1

2σ2)T

σ√T

)

−Ke−rTΦ

(ln (S0/K) +

(r + 1

2σ2)T

σ√T

− σ√T

)

where Φ (x) := P (N (0, 1) ≤ x) .


Part V

Appendix

18

Analytic Facts

18.1 A Stirling’s Formula Like Approximation

Theorem 18.1. Suppose that f : (0,∞) → R is an increasing concave downfunction (like f (x) = lnx) and let sn :=

∑nk=1 f (k) , then

sn −1

2(f (n) + f (1)) ≤

∫ n

1

f (x) dx

≤ sn −1

2[f (n+ 1) + 2f (1)] +

1

2f (2)

≤ sn −1

2[f (n) + 2f (1)] +

1

2f (2) .

Proof. On the interval, [k − 1, k] , we have that f (x) is larger than thestraight line segment joining (k − 1, f (k − 1)) and (k, f (k)) and thus

1

2(f (k) + f (k − 1)) ≤

∫ k

k−1

f (x) dx.

Summing this equation on k = 2, . . . , n shows,

sn −1

2(f (n) + f (1)) =

n∑k=2

1

2(f (k) + f (k − 1))

≤n∑k=2

∫ k

k−1

f (x) dx =

∫ n

1

f (x) dx.

For the upper bound on the integral we observe that f (x) ≤ f (k)−f ′ (k) (x− k)for all x and therefore,∫ k

k−1

f (x) dx ≤∫ k

k−1

[f (k)− f ′ (k) (x− k)] dx = f (k)− 1

2f ′ (k) .

Summing this equation on k = 2, . . . , n then implies,∫ n

1

f (x) dx ≤n∑k=2

f (k)− 1

2

n∑k=2

f ′ (k) .

Since f ′′ (x) ≤ 0, f ′ (x) is decreasing and therefore f ′ (x) ≤ f ′ (k − 1) for x ∈[k − 1, k] and integrating this equation over [k − 1, k] gives

f (k)− f (k − 1) ≤ f ′ (k − 1) .

Summing the result on k = 3, . . . , n+ 1 then shows,

f (n+ 1)− f (2) ≤n∑k=2

f ′ (k)

and thus ti follows that∫ n

1

f (x) dx ≤n∑k=2

f (k)− 1

2(f (n+ 1)− f (2))

= sn −1

2[f (n+ 1) + 2f (1)] +

1

2f (2)

≤ sn −1

2[f (n) + 2f (1)] +

1

2f (2)

206 18 Analytic Facts

Example 18.2 (Approximating n!). Let us take f (n) = lnn and recall that∫ n

1

lnxdx = n lnn− n+ 1.

Thus we may conclude that

sn −1

2lnn ≤ n lnn− n+ 1 ≤ sn −

1

2lnn+

1

2ln 2.

Thus it follows that(n+

1

2

)lnn− n+ 1− ln

√2 ≤ sn ≤

(n+

1

2

)lnn− n+ 1.

Exponentiating this identity then gives the following upper and lower boundson n!;

e√2· e−nnn+1/2 ≤ n! ≤ e · e−nnn+1/2.

These bound compare well with Strirling’s formula (Theorem 18.5) which im-plies,

n! ∼√

2πe−nnn+1/2 by definition⇐⇒ limn→∞

n!

e−nnn+1/2=√

2π.

Observe that

e√2∼= 1. 922 1 ≤

√2π ∼= 2. 506 ≤ e ∼= 2.718 3.

Definition 18.3 (Gamma Function). The Gamma function, Γ : R+ →R+ is defined by

Γ (x) :=

∫ ∞0

ux−1e−udu (18.1)

(The reader should check that Γ (x) <∞ for all x > 0.)

Here are some of the more basic properties of this function.

Example 18.4 (Γ – function properties). Let Γ be the gamma function, then;

1. Γ (1) = 1 as is easily verified.2. Γ (x+ 1) = xΓ (x) for all x > 0 as follows by integration by parts;

Γ (x+ 1) =

∫ ∞0

e−u ux+1 du

u=

∫ ∞0

ux(− d

due−u

)du

= x

∫ ∞0

ux−1 e−u du = x Γ (x) .

In particular, it follows from items 1. and 2. and induction that

Γ (n+ 1) = n! for all n ∈ N. (18.2)

3. Γ (1/2) =√π. This last assertion is a bit trickier. One proof is to make use

of the fact (proved below in Lemma 19.1) that∫ ∞−∞

e−ar2

dr =

√π

afor all a > 0. (18.3)

Taking a = 1 and making the change of variables, u = r2 below implies,

√π =

∫ ∞−∞

e−r2

dr = 2

∫ ∞0

u−1/2e−udu = Γ (1/2) .

Γ (1/2) = 2

∫ ∞0

e−r2

dr =

∫ ∞−∞

e−r2

dr

= I1(1) =√π.

4. A simple induction argument using items 2. and 3. now shows that

Γ

(n+

1

2

)=

(2n− 1)!!

2n√π

where (−1)!! := 1 and (2n− 1)!! = (2n− 1) (2n− 3) . . . 3 · 1 for n ∈ N.

Theorem 18.5 (Stirling’s formula). The Gamma function (see Definition18.3), satisfies Stirling’s formula,

limx→∞

Γ (x+ 1)√2πe−xxx+1/2

= 1. (18.4)

In particular, if n ∈ N, we have

n! = Γ (n+ 1) ∼√

2πe−nnn+1/2

where we write an ∼ bn to mean, limn→∞anbn

= 1.


19

Multivariate Gaussians

The following basic Gaussian integration formula is needed in order to prop-erly normalize Gaussian densities, see Definition 19.2 below.

Lemma 19.1. Let a > 0 and for d ∈ N let,

Id(a) :=

∫Rd

e−a|x|2

dm(x).

Then Id(a) = (π/a)d/2.

Proof. By Tonelli’s theorem and induction,

Id(a) =

∫Rd−1×R

e−a|y|2

e−at2

md−1(dy) dt

= Id−1(a)I1(a) = Id1 (a). (19.1)

So it suffices to compute:

I2(a) =

∫R2

e−a|x|2

dm(x) =

∫R2\0

e−a(x21+x2

2)dx1dx2.

Using polar coordinates, we find,

I2(a) =

∫ ∞0

dr r

∫ 2π

0

dθ e−ar2

= 2π

∫ ∞0

re−ar2

dr

= 2π limM→∞

∫ M

0

re−ar2

dr = 2π limM→∞

e−ar2

−2a

∫ M

0

=2π

2a= π/a.

This shows that I2(a) = π/a and the result now follows from Eq. (19.1).

19.1 Review of Gaussian Random Variables

Definition 19.2 (Normal / Gaussian Random Variable). A random vari-able, Y, is normal with mean µ standard deviation σ2 iff

P (Y ∈ (y, y + dy]) =1√

2πσ2e−

12σ2

(y−µ)2dy. (19.2)

We will abbreviate this by writing Yd= N

(µ, σ2

). When µ = 0 and σ2 = 1 we

say Y is a standard normal random variable. We will often denote standardnormal random variables by Z.

Observe that Eq. (19.2) is equivalent to writing

E [f (Y )] =1√

2πσ2

∫Rf (y) e−

12σ2

(y−µ)2dy

for all bounded functions, f : R→ R. Also observe that Yd= N

(µ, σ2

)is

equivalent to Yd= σZ+µ. Indeed, by making the change of variable, y = σx+µ,

we find

E [f (σZ + µ)] =1√2π

∫Rf (σx+ µ) e−

12x

2

dx

=1√2π

∫Rf (y) e−

12σ2

(y−µ)2 dy

σ=

1√2πσ2

∫Rf (y) e−

12σ2

(y−µ)2dy.

Lastly the constant,(2πσ2

)−1/2is chosen so that

1√2πσ2

∫Re−

12σ2

(y−µ)2dy =1√2π

∫Re−

12y

2

dy = 1.

Lemma 19.3 (Integration by parts). If Xd= N

(0, σ2

)for some σ2 ≥ 0

and f : R→ R is a C1 – function such that Xf (X) , f ′ (X) and f (X) are all

integrable random variables and1 limz→±∞

[f (x) e−

12σ2

x2]

= 0, then

E [Xf (X)] = σ2E [f ′ (X)] = E[X2]· E [f ′ (X)] . (19.3)

Proof. If σ = 0 then X = 0 a.s. and both sides of Eq. (19.3) are zero. So we

now suppose that σ > 0 and set C := 1/√

2πσ2. The result is a simple matterof using integration by parts;

1 This last hypothesis is actually unnecessary!

208 19 Multivariate Gaussians

E [f ′ (X)] = C

∫Rf ′ (x) e−

12σ2

x2

dx = C limM→∞

∫ M

−Mf ′ (x) e−

12σ2

x2

dx

= C limM→∞

[f (x) e−

12σ2

x2

|M−M −∫ M

−Mf (x)

d

dxe−

12σ2

x2

dx

]

= C limM→∞

∫ M

−Mf (x)

x

σ2e−

12x

2

dx =1

σ2E [Xf (X)] .

Example 19.4. Suppose that Xd= N (0, 1) and define αk := E

[X2k

]for all

k ∈ N0. By Lemma 19.3,

αk+1 = E[X2k+1 ·X

]= (2k + 1)αk with α0 = 1.

Hence it follows that

α1 = α0 = 1, α2 = 3α1 = 3, α3 = 5 · 3

and by a simple induction argument,

EX2k = αk = (2k − 1)!!,

where (−1)!! := 0.Actually we can use the Γ – function to say more. Namely for any β > −1,

E |X|β =1√2π

∫R|x|β e− 1

2x2

dx =

√2

π

∫ ∞0

xβe−12x

2

dx.

Now make the change of variables, y = x2/2 (i.e. x =√

2y and dx = 1√2y−1/2dy)

to learn,

E |X|β =1√π

∫ ∞0

(2y)β/2

e−yy−1/2dy

=1√π

2β/2∫ ∞

0

y(β+1)/2e−yy−1dy =1√π

2β/2Γ

(β + 1

2

).

Exercise 19.1. Let q (x) be a polynomial2 in x, Zd= N (0, 1) , and

u (t, x) := E[q(x+√tZ)]

(19.4)

=

∫R

1√2πt

e−12t (y−x)2q (y) dy (19.5)

2 Actually, q (x) can be any twice continuously differentiable function which along

with its derivatives grow slower than eεx2

for any ε > 0.

Show u satisfies the heat equation,

∂

∂tu (t, x) =

1

2

∂2

∂x2u (t, x) for all t > 0 and x ∈ R,

with u (0, x) = q (x) .

Hints: Make use of Lemma 19.3 along with the fact (which is easily provedhere) that

∂

∂tu (t, x) = E

[∂

∂tq(x+√tZ)].

You will also have to use the corresponding fact for the x derivatives as well.

Exercise 19.2. Let q (x) be a polynomial in x, Zd= N (0, 1) , and ∆ = d2

dx2 .Show

E [q (Z)] =(e∆/2q

)(0) :=

∞∑n=0

1

n!

((∆

2

)nq

)(0) =

∞∑n=0

1

n!

1

2n(∆nq) (0)

where the above sum is actually a finite sum since ∆nq ≡ 0 if 2n > deg q.Hint: let u (t) := E

[q(√tZ)]. From your proof of Exercise 19.1 you should be

able to see that u (t) = 12E[(∆q)

(√tZ)]. This latter equation may be iterated

in order to find u(n) (t) for all n ≥ 0. With this information in hand you shouldbe able to finish the proof with the aid of Taylor’s theorem.

Example 19.5. Suppose that k ∈ N, then

E[Z2k

]=(e∆2 x2k

)|x=0 =

∞∑n=0

1

n!

1

2n(∆nx2k

)|x=0

=1

k!

1

2k∆kx2k =

(2k)!

k!2k

=2k · (2k − 1) · 2 (k − 1) · (2k − 3) · · · · · (2 · 2) · 3 · 2 · 1

2kk!= (2k − 1)!!

in agreement with Example 19.4.

Example 19.6. Let Z be a standard normal random variable and set f (λ) :=

E[eλZ

2]

for λ < 1/2. Then f (0) = 1 and

f ′ (λ) = E[Z2eλZ

2]

= E[∂

∂Z

(ZeλZ

2)]

= E[eλZ

2

+ 2λZ2eλZ2]

= f (λ) + 2λf ′ (λ) .


19.1 Review of Gaussian Random Variables 209

Solving for λ we find,

f ′ (λ) =1

1− 2λf (λ) with f (0) = 1.

The solution to this equation is found in the usual way as,

ln f (λ) =

∫f ′ (λ)

f (λ)dλ =

∫1

1− 2λdλ = −1

2ln (1− 2λ) + C.

By taking λ = 0 using f (0) = 1 we find that C = 0 and therefore,

E[eλZ

2]

= f (λ) =1√

1− 2λfor λ <

1

2.

This can also be shown by directly evaluating the integral,

E[eλZ

2]

=

∫R

1√2πe−

12 (1−2λ)z2dz.

Exercise 19.3. Suppose that Zd= N (0, 1) and λ ∈ R. Show

f (λ) := E[eiλZ

]= exp

(−λ2/2

). (19.6)

Hint: You may use without proof that f ′ (λ) = iE[ZeiλZ

](i.e. it is permissible

to differentiate past the expectation.) Assuming this use Lemma 19.3 to see thatf ′ (λ) satisfies a simple ordinary differential equation.

Lemma 19.7 (Gaussian tail estimates). Suppose that X is a standard nor-mal random variable, i.e.

P (X ∈ A) =1√2π

∫A

e−x2/2dx for all A ∈ BR,

then for all x ≥ 0,

P (X ≥ x) ≤ min

(1

2− x√

2πe−x

2/2,1√2πx

e−x2/2

)≤ 1

2e−x

2/2. (19.7)

Moreover (see [15, Lemma 2.5]),

P (X ≥ x) ≥ max

(1− x√

2π,

x

x2 + 1

1√2πe−x

2/2

)(19.8)

which combined with Eq. (19.7) proves Mill’s ratio (see [9]);

limx→∞

P (X ≥ x)1√2πx

e−x2/2= 1. (19.9)

Proof. See Figure 19.1 where; the green curve is the plot of P (X ≥ x) , theblack is the plot of

min

(1

2− 1√

2πxe−x

2/2,1√2πx

e−x2/2

),

the red is the plot of 12e−x2/2, and the blue is the plot of

max

(1

2− x√

2π,

x

x2 + 1

1√2πe−x

2/2

).

The formal proof of these estimates for the reader who is not convinced by

Fig. 19.1. Plots of P (X ≥ x) and its estimates.

Figure 19.1 is given below.We begin by observing that

P (X ≥ x) =1√2π

∫ ∞x

e−y2/2dy ≤ 1√

2π

∫ ∞x

y

xe−y

2/2dy

≤ − 1√2π

1

xe−y

2/2|−∞x =1√2π

1

xe−x

2/2. (19.10)

If we only want to prove Mill’s ratio (19.9), we could proceed as follows. Letα > 1, then for x > 0,

P (X ≥ x) =1√2π

∫ ∞x

e−y2/2dy

≥ 1√2π

∫ αx

x

y

αxe−y

2/2dy = − 1√2π

1

αxe−y

2/2|y=αxy=x

=1√2π

1

αxe−x

2/2[1− e−α

2x2/2]



from which it follows,

lim infx→∞

[√2πxex

2/2 · P (X ≥ x)]≥ 1/α ↑ 1 as α ↓ 1.

The estimate in Eq. (19.10) shows lim supx→∞

[√2πxex

2/2 · P (X ≥ x)]≤ 1.

To get more precise estimates, we begin by observing,

P (X ≥ x) =1

2− 1√

2π

∫ x

0

e−y2/2dy (19.11)

≤ 1

2− 1√

2π

∫ x

0

e−x2/2dy ≤ 1

2− 1√

2πe−x

2/2x.

This equation along with Eq. (19.10) gives the first equality in Eq. (19.7). Toprove the second equality observe that

√2π > 2, so

1√2π

1

xe−x

2/2 ≤ 1

2e−x

2/2 if x ≥ 1.

For x ≤ 1 we must show,

1

2− x√

2πe−x

2/2 ≤ 1

2e−x

2/2

or equivalently that f (x) := ex2/2 −

√2πx ≤ 1 for 0 ≤ x ≤ 1. Since f is convex(

f ′′ (x) =(x2 + 1

)ex

2/2 > 0), f (0) = 1 and f (1) ∼= 0.85 < 1, it follows that

f ≤ 1 on [0, 1] . This proves the second inequality in Eq. (19.7).It follows from Eq. (19.11) that

P (X ≥ x) =1

2− 1√

2π

∫ x

0

e−y2/2dy

≥ 1

2− 1√

2π

∫ x

0

1dy =1

2− 1√

2πx for all x ≥ 0.

So to finish the proof of Eq. (19.8) we must show,

f (x) :=1√2πxe−x

2/2 −(1 + x2

)P (X ≥ x)

=1√2π

[xe−x

2/2 −(1 + x2

) ∫ ∞x

e−y2/2dy

]≤ 0 for all 0 ≤ x <∞.

This follows by observing that f (0) = −1/2 < 0, limx↑∞ f (x) = 0 and

f ′ (x) =1√2π

[e−x

2/2(1− x2

)− 2xP (X ≥ x) +

(1 + x2

)e−x

2/2]

= 2

(1√2πe−x

2/2 − xP (X ≥ y)

)≥ 0,

where the last inequality is a consequence Eq. (19.7).

19.2 Gaussian Random Vectors

Definition 19.8 (Gaussian Random Vectors). A random vector, X ∈ Rd,is Gaussian iff

E[eiλ·X

]= exp

(−1

2Var (λ ·X) + iE (λ ·X)

)for all λ ∈ Rd. (19.12)

In short, X is a Gaussian random vector iff λ·X is a Gaussian random variablefor all λ ∈ Rd. (Implicitly in this definition we are assuming that E

∣∣X2j

∣∣ < ∞for 1 ≤ j ≤ d.)

Notation 19.9 Let X be a random vector in Rd with second moments, i.e.E[X2k

]< ∞ for 1 ≤ k ≤ d. The mean X is the vector µ = (µ1, . . . , µd)

tr ∈ Rdwith µk := EXk for 1 ≤ k ≤ d and the covariance matrix C = C (X) is thed× d matrix with entries,

Ckl := Cov (Xk, Xl) for 1 ≤ k, l ≤ d. (19.13)

Exercise 19.4. Suppose thatX is a random vector in Rd with second moments.Show for all λ = (λ1, . . . , λd)

tr ∈ Rd that

E [λ ·X] = λ · µ and Var (λ ·X) = λ · Cλ. (19.14)

Corollary 19.10. If Yd= N

(µ, σ2

), then

E[eiλY

]= exp

(−1

2λ2σ2 + iµλ

)for all λ ∈ R. (19.15)

Conversely if Y is a random variable such that Eq. (19.15) holds, then Yd=

N(µ, σ2

).

Proof. ( =⇒ ) From the remarks after Lemma 19.2, we know that Yd=

σZ + µ where Zd= N (0, 1) . Therefore,

E[eiλY

]= E

[eiλ(σZ+µ)

]= eiλµE

[eiλσZ

]= eiλµe−

12 (λσ)2 = exp

(−1

2λ2σ2 + iµλ

).

(⇐=) This follows from the basic fact that the characteristic function or Fouriertransform of a distribution uniquely determines the distribution.

Remark 19.11 (Alternate characterization of being Gaussian). Given Corollary19.10, we have Y is a Gaussian random variable iff EY 2 <∞ and

E[eiλY

]= exp

(−1

2Var (λY ) + iλEY

)= exp

(−λ

2

2Var (Y ) + iλEY

)for all λ ∈ R.


19.2 Gaussian Random Vectors 211

Exercise 19.5. Suppose X1 and X2 are two independent Gaussian random

variables with Xid= N

(0, σ2

i

)for i = 1, 2. Show X1 + X2 is Gaussian and

X1 +X2d= N

(0, σ2

1 + σ22

). (Hint: use Remark 19.11.)

Exercise 19.6. Suppose that Zd= N (0, 1) and t ∈ R. Show E

[etZ]

=

exp(t2/2

). (You could follow the hint in Exercise 19.3 or you could use a

completion of the squares argument along with the translation invariance ofLebesgue measure.)

Exercise 19.7. Use Exercise 19.6 to give another proof that EZ2k = (2k − 1)!!

when Zd= N (0, 1) .

Exercise 19.8. Let Zd= N (0, 1) and α ∈ R, find ρ : R+ → R+ := (0,∞) such

that

E [f (|Z|α)] =

∫R+

f (x) ρ (x) dx

for all continuous functions, f : R+ → R with compact support in R+.

In particular a random vector (X) in Rd with second moments a Gaussianrandom vector iff

E[eiλ·X

]= exp

(−1

2Cλ · λ+ iµ · λ

)for all λ ∈ Rd. (19.16)

We abbreviate Eq. (19.16) by writing Xd= N (µ,C) . Notice that it follows from

Eq. (19.13) that Ctr = C and from Eq. (19.14) that C ≥ 0, i.e. λ · Cλ ≥ 0 forall λ ∈ Rd.

Definition 19.12. Given a Gaussian random vector, X, we call the pair, (C, µ)appearing in Eq. (19.16) the characteristics of X.

Lemma 19.13. Suppose that X =∑kl=1 Zlvl +µ where Zlkl=1 are i.i.d. stan-

dard normal random variables, µ ∈ Rd and vl ∈ Rd for 1 ≤ l ≤ k. Then

Xd= N (µ,C) where C =

∑kl=1 vlv

trl .

Proof. Using the basic properties of independence and normal random vari-ables we find

E[eiλ·X

]= E

[ei∑k

l=1Zlλ·vl+iλ·µ

]= eiλ·µ

k∏l=1

E[eiZlλ·vl

]= eiλ·µ

k∏l=1

e−12 (λ·vl)2

= exp

(−1

2

k∑l=1

(λ · vl)2+ iλ · µ

).

Sincek∑l=1

(λ · vl)2=

k∑l=1

λ · vl(vtrl λ)

= λ ·

(k∑l=1

vlvtrl

)λ

we may conclude,

E[eiλ·X

]= exp

(−1

2Cλ · λ+ iλ · µ

),

i.e. Xd= N (µ,C) .

Exercise 19.9 (Existence of Gaussian random vectors for all C ≥ 0 andµ ∈ Rd). Suppose that µ ∈ Rd and C is a symmetric non-negative d×d matrix.

By the spectral theorem we know there is an orthonormal basis ujdj=1 for Rd

such that Cuj = σ2juj for some σ2

j ≥ 0. Let Zjdj=1 be i.i.d. standard normal

random variables, show X :=∑dj=1 Zjσjuj + µ

d= N (µ,C) .

Theorem 19.14 (Gaussian Densities). Suppose that Xd= N (µ,C) is an Rd

– valued Gaussian random vector with C > 0 (for simplicity). Then

E [f (X)] =1√

det (2πC)

∫Rdf (x) exp

(−1

2C−1 (x− µ) · (x− µ)

)dx (19.17)

for bounded or non-negative functions, f : Rd→ R.

Proof. Let us continue the notation in Exercise 19.9 and further let

A := [σ1u1| . . . |σnun] = UΣ (19.18)

where

U = [u1| . . . |un] and Σ = diag (σ1, . . . , σn) = [σ1e1| . . . |σnen] ,

where eidi=1 is the standard orthonormal basis for Rd. With this notation we

know that Xd= AZ+µ where Z = (Z1, . . . , Zd)

tris a standard normal Gaussian

vector. Therefore,

E [f (X)] = (2π)−d/2

∫Rdf (Az + µ) e−

12‖z‖

2

dz (19.19)

wherein we have used

d∏j=1

1√2πe−

12 z

2i = (2π)

−d/2e−

12‖z‖

2

with ‖z‖2 :=

d∑j=1

z2j .



Making the change of variables x = Az+µ in Eq. (19.19) (i.e. z = A−1 (x− µ)and dz = dx/detA) implies

E [f (X)] =1

(2π)d/2

detA

∫Rdf (x) e−

12‖A−1(x−µ)‖2dx. (19.20)

Recall from your linear algebra class (or just check) that CU = UΣ2, i.e.C = UΣ2U−1 = UΣ2U tr. Therefore3,

AAtr = UΣΣU tr = UΣ2U−1 = C (19.21)

which then implies detA =√

detC and for all y ∈ Rd∥∥A−1y∥∥2

=(A−1

)trA−1y · y =

(AAtr

)−1y · y = C−1y · y.

Equation (19.17) follows from theses observations, Eq. (19.20), and the identity;

(2π)d/2

detA = (2π)d/2√

detC =

√(2π)

ddetC =

√det (2πC).

Theorem 19.15 (Gaussian Integration by Parts). Suppose that X =(X1, . . . , Xd) is a mean zero Gaussian random vector and Cij = Cov (Xi, Yj) =E [XiXj ] . Then for any smooth function f : Rd → R such that f and all itsderivatives grows slower that exp (|x|α) for some α < 2, we have

E [Xif (X1, . . . , Xd)] =

d∑k=1

CikE[

∂

∂Xkf (X1, . . . , Xd)

]

=

d∑k=1

E [XiXk] · E[

∂

∂Xkf (X1, . . . , Xd)

].

Here we write ∂∂Xk

f (X1, . . . , Xd) for (∂kf) (X1, . . . , Xd) where

(∂kf) (x1, . . . , xd) :=∂

∂xkf (x1, . . . , xd) =

d

dt|0f (x+tek)

where x := (x1, . . . , xd) and ek is the kth – standard basis vector for Rd.3 Alternatively,

Cik = Cov (Xi, Xk) = Cov (Yi, Yk)

=∑j,m

Cov (AijZj , AkmZm) =∑j,m

AijAkm Cov (Zj , Zm)

=∑j,m

AijAkmδjm =∑j

AijAkj =(AAtr)

ik.

Proof. From Exercise 19.9 we know Xd= Y where Y =

∑dj=1 σjZjuj where

ujdj=1 is an orthonormal basis for Rd such that Cuj = σ2juj and Zjdj=1

are i.i.d. standard normal random variables. To simplify notation we defineA := [σ1u1| . . . |σdud] as in Eq. (19.18) so that Y = AZ where Z = (Z1, . . . , Zd)

tr

as in the proof of Theorem 19.14. From our previous observations and a simplegeneralization of Lemma 19.3, it follows that

E [Xif (X1, . . . , Xd)] = E [Yif (Y1, . . . , Yd)]

=∑j

AijE [Zjf ((AZ)1 , . . . (AZd))]

=∑j

AijE[∂

∂Zjf ((AZ)1 , . . . (AZd))

]

=∑j

AijE

[∑k

(∂kf) ((AZ)1 , . . . (AZd)) ·∂

∂Zj(AZ)k

]=∑j,k

AijAkjE [(∂kf) (X1, . . . , Xd)] .

This completes the proof since,∑j AijAkj = (AAtr)ik = Cik as we saw in Eq.

(19.21).

Theorem 19.16 (Wick’s Theorem). If X = (X1, . . . , X2n) is a mean zeroGaussian random vector, then

E [X1 . . . X2n] =∑

pairings

Ci1j1 . . . Cinjn

where the sum is over all perfect pairings of 1, 2, . . . , 2n and

Cij = Cov (Xi, Xj) = E [XiXj ] .

Proof. From Theorem 19.15,

E [X1 . . . X2n] =∑j

C1jE[∂

∂XjX2 . . . X2n

]=∑j>2

C1jE[X2 . . . Xj . . . X2n

]where the hat indicates a term to be omitted. The result now basically followsby induction. For example,

E [X1X2X3X4] =C12E[∂

∂X2(X2X3X4)

]+ C13E

[∂

∂X3(X2X3X4)

]+ C14E

[∂

∂X4(X2X3X4)

]=C12E [X3X4] + C13E [X2X4] + C14E [X2X3]

=C12C34 + C13C24 + C14C23.


19.2 Gaussian Random Vectors 213

Recall that if Xi and Yj are independent, then Cov (Xi, Yj) = 0, i.e. indepen-dence implies uncorrelated. On the other hand, typically uncorrelated randomvariables are not independent. However, if the random variables involved arejointly Gaussian, then independence and uncorrelated are actually the samething!

Lemma 19.17. Suppose that Z = (X,Y )tr

is a Gaussian random vector withX ∈ Rk and Y ∈ Rl. Then X is independent of Y iff Cov (Xi, Yj) = 0 for all1 ≤ i ≤ k and 1 ≤ j ≤ l.

Remark 19.18. Lemma 19.17 also holds more generally. Namely ifX lnl=1

is

a sequence of random vectors such that(X1, . . . , Xn

)is a Gaussian random

vector. ThenX lnl=1

are independent iff Cov(X li , X

l′

k

)= 0 for all l 6= l′ and

i and k.

Exercise 19.10. Prove Lemma 19.17. Hint: by basic facts about the Fouriertransform, it suffices to prove

E[eix·Xeiy·Y

]= E

[eix·X

]· E[eiy·Y

]for all x ∈ Rk and y ∈ Rl.

If you get stuck, take a look at the proof of Corollary 19.19 below.

Corollary 19.19. Suppose that X ∈ Rk and Y ∈ Rl are two independent ran-dom Gaussian vectors, then (X,Y ) is also a Gaussian random vector. Thiscorollary generalizes to multiple independent random Gaussian vectors.

Proof. Let x ∈ Rk and y ∈ Rl, then

E[ei(x,y)·(X,Y )

]=E

[ei(x·X+y·Y )

]= E

[eix·Xeiy·Y

]= E

[eix·X

]· E[eiy·Y

]= exp

(−1

2Var (x ·X) + iE (x ·X)

)× exp

(−1

2Var (y · Y ) + iE (y · Y )

)= exp

(−1

2Var (x ·X) + iE (x ·X)− 1

2Var (y · Y ) + iE (y · Y )

)= exp

(−1

2Var (x ·X + y · Y ) + iE (x ·X + y · Y )

)which shows that (X,Y ) is again Gaussian.

Remark 19.20 (Be careful). If X1 and X2 are two standard normal randomvariables, it is not generally true that (X1, X2) is a Gaussian random vector.

For example suppose X1d= N (0, 1) is a standard normal random variable and

ε is an independent Bernoulli random variable with P (ε = ±1) = 12 . Then

X2 := εX1d= N (0, 1) but X := (X1, X2) is not a Gaussian random vector as

we now verify.If λ = (λ1, λ2) ∈ R2, then

E[eiλ·X

]= E

[ei(λ1X1+λ2X2)

]= E

[ei(λ1X1+λ2εX1)

]=

1

2

∑τ=±1

E[ei(λ1X1+λ2τX1)

]=

1

2

∑τ=±1

E[ei(λ1+λ2τ)X1

]=

1

2

∑τ=±1

exp

(−1

2(λ1 + λ2τ)

2

)=

1

2

∑τ=±1

exp

(−1

2

(λ2

1 + λ22 + 2τλ1λ2

))=

1

2e−

12 (λ2

1+λ22) · [exp (−λ1λ2) + exp (λ1λ2)]

= e−12 (λ2

1+λ22) cosh (λ1λ2) .

On the other hand, E[X2

1

]= E

[X2

2

]= 1 and

E [X1X2] = Eε · E[X2

1

]= 0 · 1 = 0,

from which it follows that X1 and X2 are uncorrelated and CX = I2×2. Thusif X were Gaussian we would have,

E[eiλ·X

]= exp

(−1

2CXλ · λ

)= e−

12 (λ2

1+λ22)

which is just not the case!Incidentally, this example also shows that two uncorrelated random variables

need not be independent. For if X1, X2 were independent, then again wewould have

E[eiλ·X

]= E

[ei(λ1X1+λ2X2)

]= E

[eiλ1X1eiλ2X2

]= E

[eiλ1X1

]· E[eiλ2X2

]= e−

12λ

21e−

12λ

22 = e−

12 (λ2

1+λ22),

which is not the case.

The following theorem gives another useful way of computing Gaussian in-tegrals of polynomials and exponential functions.



Theorem 19.21. Suppose Xd= N (0, C) where C is a N×N symmetric positive

definite matrix. Let L = LC :=∑di,j=1 Cij∂i∂j (sum on repeated indices) where

∂i := ∂/∂xi. Then for any polynomial function, q : RN → R,

E [q (X)] =(e

12Lq)

(0) :=

∞∑n=0

1

n!

((L

2

)nq

)(0) (a finite sum). (19.22)

Proof. This is a fairly straight forward extension of Exercise 19.2 and so Iwill only provide a short outline to the proof. 1) Let u (t) := E

[q(√tX)]. 2)

Using Theorem 19.15 one shows that u (t) = 12E[(Lq)

(√tX)]. 3) Iterating this

result and then using Taylor’s theorem finishes the proof just like in Exercise19.2.

Corollary 19.22. The function u (t, x) := E[q(x+√tX)]

solves the heatequation,

∂tu (t, x) =1

2LCu (t, x) with u (0, x) = q (x) .

If Xd= N (1, 0) we have

u (t, x) =1√2π

∫Rq(x+√tz)e−

12 z

2

dz

=

∫Ept (x, y) q (y) dy

where

pt (x, y) :=1√2πt

exp

(− 1

2t(y − x)

2

).

19.3 Gaussian Conditioning

Notation 19.23 Let (X,Y ) be a mean zero Gaussian random vector takingvalues in Rk × Rl,

CX := E[XXtr

], CY := E

[Y Y tr

], CX,Y = E

[XY tr

], and CY,X = E

[Y Xtr

]so that CX , CY, CX,Y , and CY,X are k × k, l × l, k × l, and l × k matricesrespectively. [Note that Ctr

X,Y = CY,X while CtrX = CX and Ctr

Y = CY .]

Definition 19.24 (Pseudo-Inverses). If C is a symmetric k × k matrix onRk, let C−1 be the k×k matrix uniquely determined by; C−1v = 0 if v ∈ Nul (C)

while if v ∈ Nul (C)⊥

= Ran (C) we let C−1v = w where w is the unique elementof Ran (C) such that Cw = v. [If C is invertible, then the pseudo inverse is thesame as the inverse matrix.]

Lemma 19.25. If X is a mean zero Rk – valued Gaussian random vector, thenX ∈ Ran (CX) a.s.

Proof. If λ ∈ Ran (CX)⊥

= Nul (C∗X) = Nul (CX) , then

E [λ ·X]2

= CXλ · λ = 0 =⇒ λ ·X = 0 a.s. .

Letting λ run through a basis for Ran (CX)⊥, it then follows thatX ∈ Ran (CX)

a.s.

Lemma 19.26. If (X,Y ) is a mean zero Gaussian random vector taking valuesin Rk × Rl, then Z := Y − CY,XC−1

X X is independent of X, i.e.

Y = CY,XC−1X X + Z (19.23)

where Z is a mean zero Gaussian random vector independent of X such that

CZ = CY − CY,XC−1X CX,Y . (19.24)

Proof. We look for a l× k matrix, A, so that Z := Y −AX is independentof X. Since (X,Y ) is Gaussian it suffices to find A so that

0 = E[ZXtr

]= E

[(Y −AX)Xtr

]= CY,X −ACX ,

i.e. we require ACX = CY,X . This suggests that we let A = CY,XC−1X but we

must check this works even when CX is not invertible. In this case

ACX = CY,XC−1X CX = CY,XPX

where PX is orthogonal projection onto Ran (CX) . The claim is that CY,XPX =CY,X . The point is that if λ ∈ Nul (CX) , then

CY,Xλ = E[Y Xtrλ

]= E [(λ ·X)Y ] = E [0] = 0,

wherein we have used Lemma 19.25 to conclude λ · X = 0 a.s. Consequently,CY,X vanishes on Ran (CX)

⊥and hence CY,XPX = CY,X .

So we have shown Z := Y − CY,XC−1X X is independent of X. We now

compute the covariance (CZ) of Z;

CZ = E[ZZtr

]= E

[Z[Y − CY,XC−1

X X]tr]

= E[ZY tr

]= E

[[Y − CY,XC−1

X X]Y tr]

= CY − CY,XC−1X CX,Y ,

wherein the third equality we made use of the fact that Z and X were indepen-dent of (X,Z) is still jointly Gaussian.


Theorem 19.27 (Gaussian Conditioning). Suppose that (X,Y ) is a Gaus-sian vector taking values in Rk × Rl. Then

E [f (Y ) |X] = G (X) (19.25)

whereG (x) := E

[f(CY,XC

−1X x+ Z

)](19.26)

and Z is a Rl – mean zero Gaussian random vector with covariance matrix(CZ) as in Eq. (19.24). As a special case, taking f (y) = y above shows

E [Y |X] = CY,XC−1X X.

Proof. This follows directly from Proposition 3.18 and the decompositionin Lemma 19.26.

Corollary 19.28. If (X,Y ) is a Gaussian random vector taking values in Rk×Rl with µX = EX and µY = EY, then

Y = µY + CY,XC−1X (X − µX) + Z

where now,

CY,X = E[(Y − µY ) (X − µX)

tr]

= EY Xtr − (EY ) (EX)tr

with similar definitions for CX,Y , CX , and CY and Z is a mean zero GaussianRl – random vector independent of X with covariance matrix,

CZ = CY − CY,XC−1X CX,Y . (19.27)

Moreover

E [f (Y ) |X] = G (X) where G (x) := E[f(CY,XC

−1X x+ Z

)], (19.28)

where Z is a Rl - valued Gaussian random vector independent such that

Z = Z + µY − CY,XC−1X µX

d= N

(µY − CY,XC−1

X µX , CZ). (19.29)

Proof. Applying Lemma 19.26 to (X − µX , Y − µY ) shows,

Y − µY = CY,XC−1X (X − µX) + Z (19.30)

where the matrices CY,X and CX are now the covariances as described in thestatement. The Rl – valued random vector, Z, is still a mean zero Gaussianvector with covariance given by Eq. (19.27) as described. We may rewrite Eq.(19.30) in the form Y = CY,XC

−1X X + Z where Z is given as in Eq. (19.29).

The formula for E [f (Y ) |X] given in Eq. (19.28) follows from this decompositionalong with Proposition 3.18.

20

Solution to Selected Lawler Problems

References

1. Richard F. Bass, Probabilistic techniques in analysis, Probability and its Applica-tions (New York), Springer-Verlag, New York, 1995. MR MR1329542 (96e:60001)

2. , Diffusions and elliptic operators, Probability and its Applications (NewYork), Springer-Verlag, New York, 1998. MR MR1483890 (99h:60136)

3. K. L. Chung and R. J. Williams, Introduction to stochastic integration, seconded., Probability and its Applications, Birkhauser Boston Inc., Boston, MA, 1990.MR MR1102676 (92d:60057)

4. Persi Diaconis and J. W. Neuberger, Numerical results for the Metropolis algo-rithm, Experiment. Math. 13 (2004), no. 2, 207–213. MR 2068894

5. J. L. Doob, Stochastic processes, Wiley Classics Library, John Wiley & Sons Inc.,New York, 1990, Reprint of the 1953 original, A Wiley-Interscience Publication.MR MR1038526 (91d:60002)

6. Richard Durrett, Probability: theory and examples, second ed., Duxbury Press,Belmont, CA, 1996. MR MR1609153 (98m:60001)

7. , Stochastic calculus, Probability and Stochastics Series, CRC Press, BocaRaton, FL, 1996, A practical introduction. MR MR1398879 (97k:60148)

8. Evgenii B. Dynkin and Aleksandr A. Yushkevich, Markov processes: Theoremsand problems, Translated from the Russian by James S. Wood, Plenum Press,New York, 1969. MR MR0242252 (39 #3585a)

9. Robert D. Gordon, Values of Mills’ ratio of area to bounding ordinate and of thenormal probability integral for large values of the argument, Ann. Math. Statistics12 (1941), 364–366. MR MR0005558 (3,171e)

10. Olav Kallenberg, Foundations of modern probability, second ed., Probability andits Applications (New York), Springer-Verlag, New York, 2002. MR MR1876169(2002m:60002)

11. Ioannis Karatzas and Steven E. Shreve, Brownian motion and stochastic calculus,second ed., Graduate Texts in Mathematics, vol. 113, Springer-Verlag, New York,1991. MR MR1121940 (92h:60127)

12. K. L. Mengersen and R. L. Tweedie, Rates of convergence of the Hastings andMetropolis algorithms, Ann. Statist. 24 (1996), no. 1, 101–121. MR 1389882(98c:60081)

13. Sean Meyn and Richard L. Tweedie, Markov chains and stochastic stability, seconded., Cambridge University Press, Cambridge, 2009, With a prologue by Peter W.Glynn. MR 2509253 (2010h:60206)

14. J. R. Norris, Markov chains, Cambridge Series in Statistical and ProbabilisticMathematics, vol. 2, Cambridge University Press, Cambridge, 1998, Reprint of1997 original. MR MR1600720 (99c:60144)

15. Yuval Peres, An invitation to sample paths of brownian motion, stat-www.berkeley.edu/ peres/bmall.pdf (2001), 1–68.

16. Philip Protter, Stochastic integration and differential equations, Applications ofMathematics (New York), vol. 21, Springer-Verlag, Berlin, 1990, A new approach.MR MR1037262 (91i:60148)

17. Sheldon M. Ross, Stochastic processes, Wiley Series in Probability and Mathe-matical Statistics: Probability and Mathematical Statistics, John Wiley & SonsInc., New York, 1983, Lectures in Mathematics, 14. MR MR683455 (84m:60001)

18. Barry Simon, Functional integration and quantum physics, second ed., AMSChelsea Publishing, Providence, RI, 2005. MR MR2105995 (2005f:81003)

Documents

Math 285 Stochastic Processes Spring 2016bdriver/285_S2016/Lecture Notes/285notes.pdfBruce K. Driver Math 285 Stochastic Processes Spring 2016 June 3, 2016 File:285notes.tex