Probability Theory - TUM€¦ · 1 Measure spaces 16.4 In the rst few lectures we give, without proofs, background from Measure Theory which we will need in the Probability course,

Probability Theory

Instructor: Noam Berger

Lecture at TUM in Summer Semester 2013

July 1, 2013

Produced by me

Contents

1 Measure spaces 3

2 Basic inequalities and types of convergence 13

3 The law of large numbers 19

4 Characteristic functions and the central limit theorem 22

5 Conditional expectation and martingales 28

2

1 Measure spaces

16.4

In the first few lectures we give, without proofs, background from Measure Theory which

we will need in the Probability course, and set some of the basic definitions of Probability

Theory.

Let Ω be a non-empty set.

Definition 1.1 A ⊆ P(Ω), i.e. a collection A of subsets of Ω, is a σ-algebra on Ω, if

(i) A 6= ∅

(ii) A ∈ A ⇒ Ac ∈ A

(iii) An ∈ A, ∀n ∈ N⇒⋃n∈N

An ∈ A

Remark 1.2 Note that If An ∈ A, ∀n ∈ N, then also⋂n∈N

An ∈ A.

Definition 1.3 If A is a σ-algebra on Ω, (Ω,A) is a measurable space and each A ∈ A is

measurable.

Definition 1.4 Let U ⊆ 2Ω. The σ-algebra generated by U is defined to be

σ(U) :=⋂U ⊆ A

A σ-algebra

A

Definition 1.5 Let (Ω, T ) be a topological space. Borel’s σ-algebra w.r.t. the space

(Ω, T ) is the σ-algebra σ(T ). It is usually denoted by B.

We will mostly be interested in Borel’s σ-algebra in the case of the Euclidean spaces Rd.

Definition 1.6 Let (Ω1,F1) and (Ω2,F2) be mesaurable spaces. The product space is

defined as follows: The space is Ω = Ω1 × Ω2, and the σ-algebra is F = σ(A× B : A ∈F1, B ∈ F2).

Remark 1.7 Often notation is abused and the product σ-algebra is denoted F1 × F2.

Note that this is not the cartesian product of F1 and F2.

Definition 1.6 can be easily extended to any finite number of spaces. Note that we seem

to have two natural ways of defining a σ-algebra on the space R2 - using Borel directly,

or multiplying the one-dimensional Borel σ-algebra with itself. However, both yield the

same σ-algebra.

We now define the product σ-algebra of infinitely many spaces. Let (Ωk,Fk)∞k=1 be a

collection of measurable spaces.

Definition 1.8 A cylinder is a set A ⊆∏∞

k=1 Ωk s.t. A =∏∞

k=1Ak such that

3

1. Ak ∈ Fk for every k, and

2. There exists k0 s.t. Ak = Ωk for all k > k0.

Now, the product σ-algebra is defined to be the σ-algebra generated by the set of cylinders.

Remark 1.9 There is a slightly simpler way of defining the same object.

Definition 1.10 Let Ω,F be a measurable space. A measure on Ω,F is a function

µ : F → R s.t:

1. µ is non-negative.

2. µ(∅) = 0.

3. if (Ak)∞k=1 are pairwise disjoint, then

µ(∪∞k=1Ak) =∞∑k=1

µ(Ak).

A measure is called σ-finite if Ω is the union of countably many sets of finite measure. A

measure P is called a probability measure if P (Ω) = 1.

Example 1.11 Let Ω be finite, F = 2Ω, and P (A) = |A||Ω| . Then P is a probability

measure.

Theorem 1.12 There exist a unique measure λ on (R,B) such that λ([a, b]) = b− a for

every b > a. λ is called Lebesgue’s measure.

Theorem 1.13 Let (Ω1,F1, µ1) and (Ω2,F2, µ2) σ-finite measure spaces. Then there

exists a unique measure µ on (Ω1×Ω2,F1×F2) s.t. µ(A1×A2) = µ1(A1)µ2(A2) for every

A1 ∈ F1 and A2 ∈ F2.

Theorem 1.14 Let (Ωk,Fk, µk)∞k=1 probability spaces. Then there exists a unique measure

µ on∏∞

k=1 Ωk,∏∞

k=1Fk such that µ(A) =∏∞

k=1 µk(Ak) for every cylinder A =∏∞

k=1 Ak.

Definition 1.15 Let (Ω,F) be a measurable space and let (X,T ) be a topological space.

f : Ω→ X is called a measurable function if f−1(A) ∈ F for all a ∈ T .

Let (Ω,F , P ) be a probability space. An event is a measurable set. A random variable is

a measurable function from Ω to R. The σ-algebra generated by a random variable X is

σ(X) := σ(X−1(A) : A ∈ T).19.4

1.1 integration

Let (Ω,F , µ)be a σ-finite measure space. A measurable function f : Ω→ R is called simple

if it is non-negative and there is a partition A1, A2, . . . , An of Ω s.t. Ak is measurable for

4

all k, and f is constant on each Ak. We define the integral of f to be∫Ω

fdµ :=n∑k=1

µ(Ak) · f |Ak .

Exercise Show that this is well defined.

For a nonnegative measurable function f we define∫Ω

fdµ = sup

∫Ω

gdµ : g ≤ f ; g simple

.

Define

f+(x) =

f(x) if f(x) ≥ 0

0 if f(x) ≤ 0; f−(x) =

−f(x) if f(x) ≤ 0

0 if f(x) ≥ 0.

Exercise Show that if f is measurable, then so are f+ and f−.

Definition 1.16 A measurable function f is said to be inrtegrable if∫

Ωf+dµ < ∞ and∫

Ωf−dµ <∞. In this case we define∫

Ω

fdµ :=

∫Ω

f+dµ−∫

Ω

f−dµ. (1.1)

We also define (infinite) integrals of non-integrable functions if the difference in (1.1)

makes sense.

We say that a random variable X on a probability space (Ω,F , P ) has an expectation if

it is integrable. In this case we write E(X) :=∫

ΩXdP . We say that X has a variance if

X2 has an expectation, and write

var(X) = E[(X − E(X))2] = E(X2)− [E(X)]2.

We say that X has a k-th moment if E(|X|k) <∞, and in this case the k-th moment of

X is E(Xk). For two variable X and Y , if they both have expectations and the variable

XY has expectation too, then we say that they have a covariance, and define

cov(X, Y ) := E(XY )− E(X)E(Y ).

Exercise (1) show that if X and Y have variances, then they have a covariance. (2) show

that this is not an ”if and only if”.

We now state without proof some theorems about convergence of integrals, which we will

use in the course. For all of these theorems, (Ω,F , µ) is a σ-finite measure space, typically

a probability space. (fk : Ω→ R)k = 1∞ are measurable.

Theorem 1.17 (Fatou’s lemma)

If the functions f are non-negative, then∫Ω

lim inf fkdµ ≤ lim inf

∫Ω

fkdµ

5

We now assume that there exists a function f : Ω→ R such that

µ(x ∈ Ω : lim

k→∞fk(x) 6= f(x)

)= 0.

Theorem 1.18 (Monotone convergence theorem)

If, in addition, fk is a pointwise increasing sequence, and fk are non-negative, then∫Ω

fdµ = lim

∫Ω

fkdµ

Theorem 1.19 (Dominated convergence theorem)

If there exists a non-negative integrable function g s.t. |fk(x)| ≤ g(x) for every k and

(almost) all x, then ∫Ω

fdµ = lim

∫Ω

fkdµ

1.2 Calculus of probabilities

Let (Ak)∞k=1 be a sequence of events. We define:

lim infk→∞

Ak =∞⋃k=1

∞⋂n=k

An ; lim supk→∞

Ak =∞⋂k=1

∞⋃n=k

An.

If lim infk→∞Ak = lim supk→∞Ak we say that the sequence converges. We also say that

the sequence converges if the equality is up to measure zero.

Example 1.20

1. If the sequence (Ak)∞k=1 is increasing, then it converges to ∩k=1∞Ak.

2. If the sequence (Ak)∞k=1 is decreasing, then it converges to ∪k=1∞Ak.

Theorem 1.21 (Continuity of Probability)

Let (Ω,F , P ) be a probability space, and let (Ak)∞k=1 be a sequence of events. Then

P[lim infk→∞

Ak

]≤ lim inf

k→∞P [Ak], and (1.2)

P

[lim supk→∞

Ak

]≥ lim sup

k→∞P [Ak]. (1.3)

As an immediate corollary we get the following:

Corollary 1.22 If (Ak)∞k=1 is a converging sequence of events, then

P[

limk→∞

Ak

]= lim

k→∞P [Ak].

23.4

6

Proof of Theorem 1.21: We start by showing this for increasing sequences. Let (Ak)∞k=1

be increasing, and let B1 := A1 and Bk+1 := Ak+1 − Ak. Then the events Bk, k =

1, . . . are pairwise disjoint, and we have that An = ∪nk=1Bk for every n, and in the limit

limk→∞Ak = ∪∞k=1Bk. Therefore we get

P

[∞⋃k=1

Ak

]= P

[limk→∞

Ak

]= P

[∞⋃k=1

Bk

]=∞∑k=1

P (Bk) = limn→∞

n∑k=1

P (Bk) = limn→∞

P (An).

Next we note that, by taking complements, the same holds for decreasing sequences. We

now prove (1.2): For every k, Let

Bk =∞⋂n=k

An.

Then Bk ⊆ Ak and therefore P [Bk] ≤ P [Ak]. Now note that (Bk) is an increasing sequence

and that

lim infk→∞

Ak = limk→∞

Bk.

Therefore,

P[lim infk→∞

Ak

]= P

[limk→∞

Bk

]= lim

k→∞P [Bk] ≤ lim inf

k→∞P [Ak].

(1.3) is proven analogously.

1.3 Distributions and independence

1.3.1 Independence

We say that two events A1 and A2 are independent if P (A1 ∪A2) = P (A1)P (A2). We say

that k events A1, . . . , Ak are independent if for every nonempty subset L ⊆ 1, . . . , k,

P

[⋂j∈L

Aj

]=∏j∈L

P [Aj].

Exercise Find three events A1, A2, A3 s.t. every two are independent, but the three are

not independent.

We say that a sequence of events (Ak)∞k=1 is independent if every finite sub-collection is

independent.

We can also define the very useful notion of independence of σ-algebras: Let (Ω,F , P ) be

a probability space, and let F1 ⊆ F and F2 ⊆ F be σ-algebras. We say that F1 and F2 are

independent if A1 and A2 are independent for every A1 ∈ F1 and A2 ∈ F2. equivalently, we

can define independence of larger collections of σ-algebras - F1, . . . ,Fk are independent if

A1, . . . , Ak are independent for all choices of Aj ∈ Fj, j = 1, . . . , k, and infinite collections

of σ-algebras are independent if every finite sub-collection is independent.

Example 1.23 Let Ω = [0, 1]2, let B(2) be the Borel σ-algebra for Ω, and let P be the

two dimensional Lebesgue measure. Then (Ω,B(2), P ) is a probability space. Let B(1) be

the Borel σ-algebra of [0, 1]. Let F1 = A× [0, 1] : A ∈ B(1), and let F2 = [0, 1]×A :

A ∈ B(1). Then F1 and F2 are independent σ-algebras.

7

We can now define independence of random variable: A collection of random variables is

independent if the collection of induced σ-algebras is independent.

1.3.2 Distributions

Let (Ω,F , P ) be a Probability space, and let X : Ω → R be a random variable. The

distribution of X is the Borel probability measure DX on R induced by X. More precisely,

for every Borel set A ⊆ R, we take DX(A) = P(X−1(A)

).

Definition 1.24 The distribution function FX : R→ R of X is defined to be the function

FX(A) = DX

((−∞, a]

)= P (X ≤ a).

Definition 1.25 Let (Ω,F , P ) be a probability space.

1. Let X1, . . . , Xk be random variables. The joint distribution of X1, . . . , Xk is the

Borel probability measure DX1,...,Xk on Rk defined by

DX1,...,Xk(A) = P(ω ∈ Ω : (X1(ω), . . . , Xk(ω)) ∈ A

).

2. Let (Xk)∞k=1 be random variables. The joint distribution of (Xk)

∞k=1 is the Borel

probability measure DX1,..., on RN defined by

DX1,...(A) = P(ω ∈ Ω : (X1(ω), X2(ω), . . .) ∈ A

).

26.4

1.4 Absolute continuity and the Radon-Nikodym’s theorem

Today we will prove a measure theoretic theorem which is very useful in Probability

Theory. Let (Ω,F) be a measurable space. Let µ and ν be measures on (Ω,F).

Definition 1.26 1. We say that ν is absolutely continuous with respect to µ if for

every set A ∈ F such that µ(A) = 0 we also have ν(A) = 0. We denote this by

ν µ.

2. We say the ν and µ are equivalent if µ ν and ν µ. We denote this by µ ∼ ν.

3. We say that ν and µ are singular if there exists A ∈ F such that µ(A) = 0 and

ν(Ac) = 0. We denote this by µ ⊥ ν.

Example 1.27 Let (Ω,F , µ) be a σ-finite measure space, and let f : Ω→ R be measur-

able and non-negative. Define ν : F → R by

ν(A) =

∫A

fdµ :=

∫Ω

f1Adµ.

Then ν µ.

8

Exercise Show that ν is a measure, and that it is absolutely continuous with respect to

µ.

The next theorem will show that Exampel 1.27 is, in fact, the general case of absolute

continuity.

Theorem 1.28 (Radon-Nikodym)

Let (Ω,F) be a measurable space, and let µ and ν be σ-finite measures on (Ω,F). Assume,

in addition, that ν µ. Then there exists a measurable and non-negative f : Ω → Rsuch that for every A ∈ F ,

ν(A) =

∫A

fdµ. (1.4)

Furthermore, f is unique up to measure zero. f is called the Radon-Nikodym derivative

of ν with respect to µ, and is denoted dνdµ

.

Exercise Find a counter example when the assumption of σ-finiteness is removed (it is

sufficient when µ is not σ-finite).

The next definition and theorem (which we will not prove) diverge from our material.

They are, however, important for one of the homework problems.

Definition 1.29 Let (Ω,F , µ) be a measure space. We say that (Ω,F , µ) is non-atomic

if for every A ∈ F with µ(A) > 0 there exists B ⊆ A in F such that 0 < µ(B) < µ(A).

Example 1.30 1. (R,B, λ) is non-atomic.

2. Any finite probability space is atomic.

Theorem 1.31 Let (Ω,F , µ) be non-atomic, and let A ∈ F . Then for every 0 ≤ γ ≤µ(A) there exists B ⊆ A in F such that µ(B) = γ.

1.4.1 Towards proving Theorem 1.28

We start with the following definition.

Definition 1.32 Let (Ω,F) be a measurable space. A signed measure Φ on (Ω,F) is a

function Φ : F → R ∩ −∞,+∞, such that:

1. Φ(∅) = 0.

2. If (Ak)∞k=1 are disjoint measurable sets, then

Φ

(∞⋃k=1

Ak

)=∞∑k=1

Φ(Ak),

and the sum always converges (possibly to an infinite value).

Example 1.33 The difference of two measures, at least one of which finite, is a signed

measure.

9

Definition 1.34 Let (Ω,F) be a measurable space, and let Φ be a signed measure on

(Ω,F). Then we define

|Φ|(A) := sup|Φ(B)|+ |Φ(A \B) : B ⊆ A.

Exercise Prove that |Φ| is a measure on (Ω,F), and that |Φ|(A) ≥ |Φ(A)| for every A.

Theorem 1.35 (Hahn’s decomposition’s theorem)

Let (Ω,F) be a measurable space, and let Φ be a signed measure on (Ω,F). Then there

exist sets A+ ∈ F and A− ∈ F such that

1. A+ ∪ A− = Ω and A+ ∩ A− = ∅.

2. Φ(A) ≥ 0 for every measurable A ⊆ A+. A+ is called the positive set of Φ.

3. Φ(A) ≤ 0 for every measurable A ⊆ A−. A− is called the negative set of Φ.

A+ and A− are unique up to measure zero.

Proof: Let S = supΦ(A) : A ∈ F and I = infΦ(A) : A ∈ F.Exercise show that at most one of them is infinite.

Hint: show that otherwise the sum in definition 1.32 does not make sense.

Assume without loss of generality that S <∞. then one can find a sequence of measurable

sets (Ak)∞k=1 such that for every k,

S − 2−k ≤ Φ(Ak) ≤ S.

Now let

A+ := lim supk→∞

Ak,

and let A− := Ω \ A+.

Claim 1.36 Φ(A+) = S.

Now, if A ⊆ A+ has negative measure, then Φ(A+ \ A) > S, in contradiction to the

definition of S, and equivalently if A ⊆ A− has positive measure then Φ(A+ ∩ A) > S,

again in contradiction to the definition of S.

30.4

Proof of Claim 1.36: For every k and every A ⊆ Ak we have that Φ(A) ≥ −2−k, and for

every A ⊆ Ack we have that Φ(A) ≤ 2k. Therefore, for every k, j, we have

|Φ|(Ak \ Aj) ≤ 2k + 2j

and therefore

|Φ|(Ak 4 Aj) ≤ 2(2k + 2j).

10

Now, let Bk = ∪∞j=kAj. Then

|Φ|(Bk \ Ak) =∞∑

j=k+1

|Φ|(Aj \ Aj−1) ≤∞∑

j=k+1

[2−j + 21−j] ≤ 21−k.

In particular, Φ(Bk) ≥ S − 22−k.

Now, Bk is a decreasing sequence, and

|Φ|(Bk \Bk+1) ≤ 23−k.

Therefore,

|Φ|(Bk \ A) =∞∑j=k

|Φ|(Bj \Bj−1) ≤ 24−k.

Therefore, for every k we have Φ(A) ≥ Φ(Bk) − |Φ|(Bk \ A) ≥ S − 25−k, and therefore

Φ(A) ≥ S. From the definition of S, we get now Φ(A) = S.

Corollary 1.37 Let A+, A− and B+, B− be two Hahn decompositions of the space (Ω,F ,Φ).

Then |Φ|(A+4B+) = |Φ|(A−4B−) = 0.

Proof: We show that Φ(A+ \ B+) = 0. From symmetry this suffices. Let C ⊆ A+ \ B+.

Then Φ(C) ≥ 0 because C ⊆ A+, and Φ(C) ≤ 0 because C ⊆ B−, and thus Φ(C) = 0.

From here we get that |Φ|(A+ \B+) = 0, as desired.

We can now use the Hahn’s theorem to prove Radon-Nikodym’s theorem.

Proof of Theorem 1.28: We first assume that both µ and ν are finite. The extension to

the σ-finite case is left as an (easy) exercise. For every α ≥ 0 rational, we define the

signed measure Φα := α · µ − ν. Then, for every α ≥ 0 rational, we define Aα to be (a

choice of) the positive set of Φα.

Claim 1.38 Let α1 < α2. Then µ(Aα1 \ Aα2) = ν(Aα1 \ Aα2) = 0.

Proof of Claim 1.38: Let A = Aα1 \ Aα2 . Then Φα1(A) ≥ 0 and Φα2(A) ≥ 0. However,

by the definition of Φα, and since α1 < α2, we get Φα2(A) ≥ Φα1(A), and thus Φα1(A) =

Φα2(A) = 0. Solving a linear equation we get µ(A) = ν(A) = 0.

We now define the function f : Ω→ R by

f(ω) := inf(α : ω ∈ Aα) ≤ ∞.

Exercise show that f is measurable, and that due to absolute continuity, µ(ω : f(ω) =

∞) = 0.

We now need to show that (1.4) holds for every A ∈ F . We first show that

ν(A) ≤∫A

fdµ.

11

To this end, let ε > 0 be rational, and for every n = 0, 1, . . . let A(n) := ω ∈ A : nε ≤f(ω) < (n+ 1)ε. Define f(ω) := (n+ 1)ε on A(n). Then f < f + ε. Then, up to measure

zero, An ⊆ A(n+1)ε. Therefore, ν(An) ≤ (n+ 1)εµ(An), and we get that

ν(A) =∞∑k=0

ν(An) ≤∞∑k=0

(n+ 1)εµ(An) =

∫A

fdµ ≤∫A

fdµ+ εµ(A).

taking ε as small as we like proves the desired inequality. The opposite inequality follows

similarly, and (1.4) is proved.

Exercise Prove the uniqueness (up to µ-measure zero) of f .

3.3

Definition 1.39 Let X be a random variable, and let DX be its distribution. We say

that X has a density if DX λ. In this case, we say that the density of X is

fX :=dDx

dλ.

12

2 Basic inequalities and types of convergence

2.1 Inequalities

We begin with the most basic and most useful inequality in Probability Theory.

Theorem 2.1 (Cauchy-Schwarz inequality)

Let X, Y be variables with second moments. Then E(XY )2 ≤ E(X2)E(Y 2).

Proof: Assume without loss of generality that E(X2) = E(Y 2) = 1. Define Z = X −E(XY )Y . Then E(ZY ) = E(XY )− E(XY )E(Y 2) = 0. Therefore,

E(X2)E(Y 2) = 1 = E(X2) = E[(Z + E(XY )Y )2

]= E(Z2) + E(XY )2E(Y 2) = E(XY )2 + E(Z2) ≥ E(XY )2.

Theorem 2.2 (Markov’s inequality)

Let X be a non-negative random variable, and let a > 0. Then

P (X ≥ a) ≤ E(X)

a.

Proof: Let Y be a random variable, defined as follows:

Y =

0 if X < a

a if X ≥ a.

Then Y ≤ X, and therefore E(Y ) ≤ E(X). We get

E(X)

a≥ E(Y )

a=a · P (Y = a)

a= P (Y = a) = P (X ≥ a).

We now prove a general inequality which we will use later.

Theorem 2.3 (Jensen’s Inequality)

Let f : R → R be a convex function, and let X be a random variable s.t. X and f(X)

both have expectations. Then

E [f(X)] ≥ f (E[X]) .

Corollary 2.4 [E[X]]2 ≤ E [X2] whenever well defined.

Proof of Theorem 2.3: Since f is convex, for every x there exists an affine function gx

such that gx(x) = f(x) and gx(y) ≤ f(y) for every y. We take g = gE(X). Since g is affine,

we get g(E[X]

)= E

[g(X)

]. Therefore,

E [f(X)] ≥ E [g(X)] = g (E[X]) = f (E[X]) .

13

We now prove two inequalities that relate to the second moment.

Theorem 2.5 (Chebishef I)

Let X be a random variable with variance. Then for every δ > 0,

P[∣∣X − E[x]

∣∣ ≥ δ]≤ var[X]

δ2.

Proof: Let Y =[X − E(X)

]2. Then Y is non-negative, and E(Y ) = var(X). Then, by

Markov’s inequality,

P[∣∣X − E[x]

∣∣ ≥ δ]

= P (Y ≥ δ2) ≤ E(Y )

δ2=

var[X]

δ2.

Theorem 2.6 (Chebishef II)

Let X be a non-negative random variable with a second moment. Then

P [X > 0] ≥ [E[X]]2

E [X2].

Proof: Let Y = 1X>0. Then E[Y ] = E[Y 2]

= P [X > 0] and X = XY . Cauchy-Schwarz

inequality tells us

[E[X]]2 = [E[XY ]]2 ≤ E[X2]· E[Y 2]

= P [X > 0] · E[X2],

and we get the required inequality by dividing both sides by E[X2].

7.5

2.2 The lemma of Borel-Cantelli

Let (Ω,F , P ) be a probability space, and let (Ak)∞k=1 be a sequence of events. The event

A := lim supk→∞Ak is the event that infinitely many of the events occur.

Theorem 2.7 (Borel-Cantelli’s lemma) 1. If

∞∑k=1

P (Ak) <∞,

then P (A) = 0.

2. If the events are independent and

∞∑k=1

P (Ak) =∞, (2.1)

then P (A) = 1.

14

Exercise Find a sequence (Ak)∞k=1 of events such that

∞∑k=1

P (Ak) =∞,

but P (A) = 0.

Proof of Theorem 2.7: 1: For every k,

A ⊆∞⋂j=k

Aj,

and therefore

P (A) ≤∞∑j=k

P (Ak),

and as the RHS goes to zero as k →∞, we get that P (A) = 0.

2: Let Bk = ∩∞k=1Aj. Then, using the continuity of probability (Theorem 1.21) as well as

the continuity of the exponential function and the identity log x ≤ x− 1,

P [Bck] = P

[∞⋃j=k

Ack

]=∞∏j=k

P [Ack] = exp

(∞∑j=k

logP [Ack]

)

≤ exp

(∞∑j=k

(P [Ack]− 1)

)= exp

(−∞∑j=k

P [Ak]

)= 0,

where the last inequality follows from the divergence of the sum in (2.1). Therefore,

P[Bk

]= 1. Now, as A = ∪∞k=1Bk, we get

P [A] = limk→∞

P [Bk] = 1.

2.3 Types of convergence

In this section we discuss various types of convergence of random variables, and the

relations between those types of convergence.

2.3.1 Convergence in probability

Let (Xk)∞k=1 be random variables, and let X be a random variable, all defined on the same

probability space. We say that (Xk) converges to X in probability, and write

Xk−→prob

X,

if for every δ > 0,

limk→∞

P (|Xk −X| > δ) = 0. (2.2)

This is equivalent to

∀ε∃N∀n>NP (|Xn −X| > ε) < ε. (2.3)

Exercise Prove the equivalence of (2.2) and (2.3).

15

Example 2.8 (Weak version of the weak law of large numbers)

Let (Xk)∞k=1 be i.i.d. bounded variables, and let E = E(X1). Let Sn = 1

n

∑nk=1Xk. Then

Xk−→prob

E,

Proof: from Chebisheff.

2.3.2 Almost sure convergence

Let (Xn)∞n=0 be random variables, and let X be a random variable. Then we say that

(Xk) converges to X almost surely, and write

Xk−→a.s.

X,

if

P(ω : lim

n→∞Xn(ω) = X(ω)

)= 1. (2.4)

This is equivalent to

∀ε∃NP (∃n>N |Xn −X| > ε) < ε. (2.5)

Exercise Prove the equivalence of (2.4) and (2.5).

Solution First, assume that (Xn) converges to X. Then for every ε, with probability 1

there exists N such that |Xn −X| < ε for every n > N , i.e.

P

[∞⋂N=1

∃n>N |Xn −X| > ε

]= 0.

Continuity of probability now yields (2.5).

Now, assume that (2.5) holds. For every rational ε > 0, and every 0 < δ < ε, there exists

Nδ such that P (∃n>Nδ |Xn −X| > ε) < δ, and by continuity of probability,

P [∃N∀n>N |Xn −X| < ε] = 1.

Since there are only ℵ0 rationals, we get

P [∀0<ε∈Q∃N∀n>N |Xn −X| < ε] = 1.

a.s. convergence follows.

2.3.3 Convergence in Lp

Let 1 ≤ p <∞. Let (Xn)∞n=0 be random variables, and let X be a random variable. Then

we say that (Xk) converges to X in Lp, and write

Xk−→Lp

X,

If

limn→∞

E [|Xn −X|p] = 0.

Exercise Show that if p > q, then convergence in Lp yields convergence in Lq.

Remark 2.9 We will mostly be interested in convergence in L1 and in L2.

16

2.3.4 Convergence in distribution

et (Xn)∞n=0 be random variables, and let X be a random variable. Then we say that (Xk)

converges to X in distribution, and write

Xk−→dist

X,

If for every continuous, bounded f with compact support,

limn→∞

E[f(Xn)] = E[f(X)].

exercise Prove that this is equivalent to the following condition: For ever x such that

FX is continuous in x,

limn→∞

FXn(x) = FX(x).

2.4 Relations between various types of convergence

In this section we will draw a complete diagram of relations between types of convergence.

Theorem 2.10 Let (Xn)∞n=1 be a sequence of random variables, and let X be a random

variable.

1. If

Xk−→a.s.

X,

then

Xk−→prob

E.

2. If

Xn−→prob

E,

then there exists a strictly increasing sequence (nk)∞k=1 such that

Xnk −→a.s.

X.

Proof: 1. If a sequence (Xn) converges almost surely to a random variable X, then by

(2.5), for every ε there exist N such that w.p. 1−ε for every n > N we have |X−Xn| < ε.

In particular, P [|Xn −X| > ε] < ε for every n > N , so (2.3) is satisfied.

2. We assume that the sequence (Xn) converges in probability to a random variable X.

Then by (2.3) we can define a subsequence as follows: n1 is chosen so that P [|Xn1−X| >2−1] < 2−1. Then inductively nk is chosen to be a number larger than nk−1 such that

P [|Xnk − X| > 2−k] < 2−k. such choice is possible by (2.3). Now choose ε > 0. Then

there exists K such that ε > 2−K . Then

P [∃k>K |Xnk −X| > ε] ≤∞∑

k=K+1

P [|Xnk −X| > ε]

≤∞∑

k=K+1

P[|Xnk −X| > 2−k

]≤

∞∑k=K+1

2−k = 2−K < ε.

Therefore, by (2.5), the convergence is a.s.

17

14.5

Theorem 2.11 If (Xn) converges to X in Lp, then (Xn) converges to X in probability.

Proof: Exercise, use Markov’s inequality.

Exercise Find examples showing that there is no implication between a.s. convergence

and convergence in Lp.

Theorem 2.12 1. If (Xn) converges to X in probability, then (Xn) converges to X in

distribution.

2. If (Xn) converges to X in distribution and X is an almost sure constant, then (Xn)

converges to X in probability.

Proof: 1. Let f be a bounded, continuous function with bounded support. Then f is

uniformly continuous. Let M be a bound for |f |. Fix ε, and let δ < ε be such that if

|x− y| < δ then |f(x)− f(y)| < ε. Let N be such that for all n > N we have

P [|Xn −X| > δ] < δ.

Then for n > N , let A be the event A = |X −Xn| > δ

E(f(X))− E(f(Xn)) =(E[f(X)1A]− E[f(Xn)1A]

)+(E[f(X)1Ac ]− E[f(Xn)1Ac ]

).

Now, (E[f(X)1Ac ]− E[f(Xn)1Ac ]

)≤ 2MP (Ac) < 2Mε,

and (E[f(X)1A]− E[f(Xn)1A]

)≤ ε.

Therefore

|E(f(X))− E(f(Xn))| < (2M + 1)ε.

2. Exercise.

So we get the following diagram.

dist

prob

?

6

a.s.

-@

@@@@R@@

@@I

Lp

18

24.5

3 The law of large numbers

In this section we discuss the following theorem.

Theorem 3.1 Let (Xn)∞n=1 be i.i.d. random variables, and assume E(|X1|) < ∞. Let

E := E(X1). Let

Sn =n∑k=1

Xk.

ThenSnn−→a.s.

E.

Conversely, we have the following theorem.

Theorem 3.2 Let (Xn)∞n=1 be i.i.d. random variables, and let

Sn =n∑k=1

Xk.

Assume in addition that there exists a random variable X s.t.

Snn−→a.s.

X.

Then E(|X1|) <∞, and X E(X1) (i.e. P (X = E(X1)) = 1).

We will prove this Theorem 3.1 in full when we study martingales (this is in the future...),

but at the moment we prove an important special case.

Proof od Theorem 3.1 in the case var(X1) <∞: We assume that var(X1) < ∞. Write

σ2 = var(X1). First we prove thatSn2

n2−→a.s.

E.

Indeed,

var(Sn2) =n2∑k=1

var(Xk) = n2σ2,

and thus

var

[Sn2

n2

]=

1

n4var(Sn2) =

σ2

n2.

At the same time,

E

[Sn2

n2

]= E.

Fix ε > 0 rational. By Chebisheff’s inequality,

P

[∣∣∣∣Sn2

n2− E

∣∣∣∣ ≥ ε

]≤ var[Sn2/n2]

ε2=σ2

ε2· 1

n2,

19

and thus by Borel-Cantelli with probability 1 there exists Nε < ∞ such that for every

n > Nε, ∣∣∣∣Sn2

n2− E

∣∣∣∣ < ε (3.1)

Therefore w.p. 1 (3.1) holds for every rational ε > 0, and therefore

Sn2

n2−→a.s.

E.

We now need to show that the non-squares do not do too much damage. Again, fix ε > 0

(rational). For every n, let

Un = maxn2<k≤(n+1)2

∣∣∣∣Skk − Sn2

n2

∣∣∣∣ .We want to estimate P (Un ≥ ε).

We first estimate

P

[∣∣∣∣Skk − Sn2

n2

∣∣∣∣ > ε

]for a given n2 < k ≤ (n+ 1)2. First we write

Jk :=Skk− Sn2

n2=

1

k

k∑j=n2

Xj −(

1

n2− 1

k

) n2∑j=1

Xj

Thus

E(Jk) =

(k − n2

k− n2

[1

n2− 1

k

])· E = 0.

We also need to calculate the variance.

var(Jk) =k − n2

k2σ2 +

(k − n2

kn2

)2

· n2σ2 ≤(

3n

n4+

9n2

n6

)· σ2 ≤ 4σ2

n3

for all n large enough and n2 < k ≤ (n+ 1)2.

Thus, by Chebisheff,

P [|Jk| > ε] ≤ P [|Jk − E(Jk)| > ε/2] ≤ 4varJkε2

≤ C1

n3,

and by a union bound,

P (Un ≥ ε) ≤[(n+ 1)2 − n

]· C1

n3≤ C2

n2,

and by Borel-Cantelli w.p.1 Un < ε for all n large enough.

The theorem follows.

Proof of Theorem 3.2: It is sufficient to prove that E(|X1|) < ∞ due to Theorem 3.1.

For contradiction, assume E(|X1|) =∞, and we keep the convergence assumption. Then

∞∑k=1

P (|X1| ≥ k) = (−1) +∞∑k=0

P (|X1| ≥ k) ≥ (−1) + E(|X1|) =∞.

20

Therefore∞∑n=1

P (|Xn| ≥ n) =∞∑k=1

P (|X1| ≥ k) =∞,

and by Borel-Cantelli there a.s. exists a subsequence (Xnk)∞k=1 such that |Xnk | ≥ nk. For

such nk,

Snknk− Snk−1

nk − 1=Xnk

nk+

(nk − 1

nk− 1

)Snk−1

nk − 1=Xnk

nk− 1

nk

Snk−1

nk − 1.

Due to Cauchy’s criterion,

limk→∞

(Xnk

nk− 1

nk

Snk−1

nk − 1

)= 0. (3.2)

On the other hand, due to convergence,

limk→∞

1

nk

Snk−1

nk − 1= 0, (3.3)

and due to the choice of the sequence (nk),

lim infk→∞

∣∣∣∣Xnk

nk

∣∣∣∣ ≥ 1. (3.4)

(3.3) and (3.4) contradict (3.2).

Exercise: Is the following true: Let (Xn)∞n=1 be i.i.d. random variables, and let

Sn =n∑k=1

Xk.

Assume in addition that there exists a random variable X s.t.

Snn−→prob

X.

Then E(|X1|) <∞, and X E(X1) (i.e. P (X = E(X1)) = 1)?

21

28.5

4 Characteristic functions and the central limit theorem

4.1 Characteristic functions

Let X be a random variable, and let t ∈ R. Then the (complex) random variable eitX is

bounded by 1 and therefore has an expectation. Therefore, for every random variable X,

we can define its characteristic function φX : R→ C by φX(t) = E[eitX ].

We start by discussing the basic properties of the characteristic function.

Lemma 4.1 Let X be a random variable, and let φX : R → C be its characteristic

function.

1. φX is determined only by the distribution of X.

2. φX(0) = 1.

3. |φX(t)| ≤ 1 for every t ∈ R.

4. φX is continuous in R.

5. If X has an expectation, then φX is everywhere differentiable, and φ′X(0) = iE(X).

Further, |φ′X(t)| ≤ E(|X|) for every t.

6. If X has a variance, then φX is everywhere twice differentiable, and φ′′X(0) =

−E(X2). Further, |φ′′X(t)| ≤ E(X2) for every t.

7. If X and Y are independent, then φX+Y (t) = φX(t) · φY (t) for every t.

Exercise: Show that in parts 5 and 6, the function is, in fact, continuously (twice)

differentiable.

Proof of Lemma 4.1: 1. and 2. are obvious. 3. This is clear because eitX is bounded by

1.

4. Fix ε and let M be such that P [|X| ≥M ] < ε/2. Let t1, t1 be such that |t1− t2| < ε4M.

Then,

|φX(t1)− φX(t1)| ≤∣∣E [(eit1X − eit2X) · 1X<M]∣∣+

∣∣E [(eit1X − eit2X) · 1X≥M]∣∣≤ M |t1 − t2|+ 2P [|X| > M ] < ε.

5. For every t and every h > 0, ∣∣∣∣ei(t+h)X − eitX

h

∣∣∣∣ ≤ |X|.

22

As |X| is integrable, by the dominated convergence theorem,

E[XieitX ] = E

[limh→0

ei(t+h)X − eitX

h

]= lim

h→0E

[ei(t+h)X − eitX

h

]= lim

h→0

1

h

(E[ei(t+h)X

]− E

[eitX

])=dφX(t)

dt.

31.5

6. Same proof as that of 5.

7. From independence,

φX+Y (t) = E[eit(X+Y )] = E[eitXeitY ] = E[eitX ]E[eitY ] = φX(t)φY (t).

Example 4.2 Let X ∼ N (0, 1). We will calculate φX(t) for every t ∈ R. Let f be the

density function of X, i.e.

f(x) =e−

x2

2

√2π.

Then,

φX(t) = E[eitX ] =

∫ ∞−∞

eitxf(x)dx =1√2π

∫ ∞−∞

exp(itx− x2/2)dx

=1√2π

∫ ∞−∞

exp

(1

2

(t2 + 2itx− x2 − t2

))dx =

e−t2/2

√2π

∫ ∞−∞

exp

(−1

2

(x2 − 2itx− t2

))dx

=e−t

2/2

√2π

∫ ∞−∞

exp

((x− it)2

2

)dx = e−t

2/2,

where the last inequality (i.e. the calculation of the integral) follows from Cauchy’s

theorem.

Exercise: Calculate the characteristic functions of Bernoulli, exponential and binomial

distributions.

4.2 The central limit theorem

Let (Xn)∞n=1 be i.i.d. random variables, and assume E(X1) = µ, var(X1) = σ2 <∞. We

want to understand the behavior of the variable

Sn =n∑k=1

Xk

for n large. In particular, we want to establish some sort of convergence to a non trivial

variable. E(Sn) = nµ, so in order to get convergence we need to subtract nµ. Once done

this, the variance of Sn − nµ is nσ2, so we may want to divide by σ√n =√nσ2.

Our purpose now is to prove the following theorem.

23

Theorem 4.3 (Central limit theorem)

Let (Xn)∞n=1 be i.i.d. random variables, and assume E(X1) = µ, var(X1) = σ2 < ∞.

Then,Sn − nµσ√n−→distN (0, 1).

Before proving the theorem, we may note that w.l.o.g. we may assume µ = 0, σ2 = 1 and

then we get the more esthetic form

Sn√n−→distN (0, 1).

We start with a lemma that gives the intuitive explanation the CLT.

Lemma 4.4 Let (Xn)∞n=1 be i.i.d. random variables such that E(X1) = 0 and var(X1) =

1. Let X ∼ N (0, 1). Let

Sn =n∑k=1

Xk,

and let Un = Sn/√n. Then for every t ∈ R,

limn→∞

φUn(t) = φX(t).

Proof: Fix t. Then φSn(t) = [φX1(t)]n, and

φUn(t) = E[eiUnt] = E[eiSnt/√n] = φSn

(t√n

)=

[φX1

(t√n

)]n.

φX1(0) = 1 and φX1 is continuous, and therefore for all n large enough,

log

[φX1

(t√n

)]is well defined. Thus, for n large enough,

log(φUn(t)) = n log φX1(t/√n).

Note that by Taylor’s theorem,

log φX1(t/√n) = log φX1(0) + (log φX1)′(0) · t√

n+ (log φX1)′′(0) · t

2

2n+Rn,

where limn→∞ nRn = 0. Also, note that log φX1(0) = 0, (log φX1)′(0) = 0, (log φX1)′′(0) =

−1. Therefore,

limn→∞

log(φUn(t)) = limn→∞

n log φX1(t/√n) = lim

n→∞n · −t

2

2n+ nRn =

−t2

2,

and therefore

limn→∞

φUn(t) = e−t2

2 .

24

Class on 4.6 cancelled.

7.6

We now show that the convergence of the characteristic functions indeed guarantees con-

vergence of the distributions. We do it in a few steps. The first step seems to be the

converse of the desired statement.

Lemma 4.5 If

Xn−→dist

X,

then for every t ∈ R,

limn→∞

φXn(t) = φX(t).

Proof: First we define for every M > 0 the cut-off function gM : R→ R by

gM(x) =

1 |x| ≤M

M + 1− |x| M ≤ |x| ≤M + 1

0 |x| ≥M + 1

.

Let f : R→ C be defined by f(x) = eits, and let fM(x) = f(x)gM(x). Then for every M ,

limn→∞

E(fM(Xn)) = E(fM(X)).

For every ε, there exists M such that P (|X| > M) < ε, and P (|X| > M) < ε for every

n. For this M and each n, we have that∣∣E[fM(Xn) − E[f(Xn)]]

∣∣ < ε, and equivalently,∣∣E[fM(X)− E[f(X)]]∣∣ < ε. The lemma follows.

The next lemma shows that the characteristic function determines the distribution.

Lemma 4.6 Assume X1 and X2 are two random variables with the same characteristic

function φ. Then X1 and X2 have the same distribution.

Proof: Let P1 be the distribution ofX1, and P2 that ofX2. It suffices to show that for every

interval [a, b], where neither a nor b are atoms of P1 or P2, we have P1([a, b]) = P2([a, b]).

By translation and multiplication by a constant, we may assume that the interval [a, b] is

in fact the interval [−1, 1].

We now claim:

P1([−1, 1]) = P2([−1, 1]) = limT→∞

1

2π

∫ T

−T

eit − e−it

itφ(t)dt.

Let

I(T ) =

∫ T

−T

eit − e−it

itφ(t)dt =

∫ T

−T

∫eit − e−it

iteitsdP (x)dt.

Note that the integrand is bounded by 2, and both measures are finite, and therefore we

may use Fubini and say

25

I(T ) =

∫ T

−T

∫eit − e−it

iteitxdP (x)dt =

∫ ∫ T

−T

eit − e−it

iteitxdtdP (x)

=

∫ [∫ T

−T

sin((x+ 1)t)

tdt−

∫ T

−T

sin((x− 1)t)

tdt

]dP (x).

We introducing the notation R(T, θ) =∫ T−T

sin(θt)tdt and S(T ) =

∫ T0

sin ttdt, we get

I(T ) =

∫(R(T, x+ 1)−R(T, x− 1)) dP (x). (4.1)

Changing the integration variables tells us that

R(T, θ) = 2(sgnθ)S(T |θ|).

It is well known that limT→∞ S(T ) = π/2 (and even if not, it only changes the constant

in front of the integral), and therefore, for every x,

limT→∞

R(T, x+ 1)−R(T, x− 1) =

0 |x| > 1

π |x| = 1

2π |x| < 1

.

S(T ) is bounded, and therefore R is also bounded, and therefore by applying the bounded

convergence theorem we get the desired result.

11.3

We now define the notion of tightness for random variables. Let (Xn)∞n=1 be a sequence

of random variables. We say that (Xn) is tight if for every ε there exists M such that for

every n,

P (|Xn| > M) < ε.

Lemma 4.7 Let (Xn)∞n=1 be a tight sequence of random variables. Then there exist a

subsequence (Xnk)∞k=1 which converges in distribution.

To prove Lemma 4.7, we need to use the Riesz representation theorem.

Theorem 4.8 (Riesz representation)

Let C be the space of continuous functions with bounded support from R to R. Then every

bounded linear functional ψ : C → R can be represented as the integral with respect to a

signed measure.

Proof of Lemma 4.7: Let F be a countable dense collection of functions in C. Let f1, f2, . . .

be an ordering of F . We take a subsequence (X(1)n ) of (Xn) such that E(f1(X

(1)n )) con-

verges, then a subsequence (X(2)n ) of (X

(1)n ) such that E(f2(X

(2)n )) converges, and so on. At

26

the end we take the diagonal subsequence Xnk = X(k)k . Let φ : C → R be the functional

defined by φ(f) = limk→∞E(Xnk(f)) on F , and elsewhere by continuity. φ is bounded,

and therefore is the integral w.r.t. a signed measure µ. φ is non-negative, and therefore

µ is non-negative. All we ned to prove is that µ has total measure one. to this end, we

use tightness. Fix ε, and let M be such that P (|Xn| > M) < ε for every n. Let

f =

1 |x| ≤M

0 |x| ≥M + 1

M + 1− |x| M ≤ |x| ≤M + 1

.

Then µ(f) ≥ 1− ε, and thus µ(R) ≥ 1− ε, and thus µ(R) ≥ 1. To see that µ(R) ≤ 1 is an

easy exercise. Therefore, Xnk converge in distribution to any variable with distribution

µ.

Lemma 4.9 Let (Xn) be random variables, and assume that φXn converges pointwise to

φ, where φ(0) = 1 and φ is continuous at zero. Then the sequence Xn is tight.

Proof: Fix ε. There exists δ s.t. for every n,

1

2δ

∫ δ

−δφXn(t)dt > 1− ε/2.

Now fix (large) M.

1

2δ

∫ δ

−δφXn(t)dt ≤ P (|Xn| < M) +

1

2δmaxm>M

∣∣∣∣∫ δ

−δeimtdt

∣∣∣∣≤ P (|Xn| < M) +

1

2δmaxm>M

eimδ − e−imδ

m= P (|Xn| < M) +

1

δM.

Therefore, for M such that 1/δM < ε/2, we get that P (|Xn| > M) < ε.

Proof of Theorem 4.3: By Lemma 4.4, φXn(t) −→ e−t2/2 which is a continuous function.

Therefore, by Lemma 4.9, the sequence (Xn) is tight, and therefore by Lemma 4.7 every

subsequence of (Xn) has a convergent subsubsequence. By Lemma 4.5 the characteristic

function of the limit of the subsubsequence is e−t2/2, and thus by Lemma 4.6 the limit is

N (0, 1). The theorem follows.

27

14.6 Tal Orenshtein

5 Conditional expectation and martingales

5.1 Conditional expectation

Let (Ω,F , P ) be a probability space, and let X be a random variable on it. The expecta-

tion of the random variable X is its average value. We may define its average conditioned

on an event A. This is done by integrating with respect to the probability measure con-

ditioned on A. If we now partition Ω into events A1, . . . , Ak (partition means that the

Ai-s are disjoint and their union in Ω), we can calculate the expectation conditioned on

each of those events, and define a new variable Y as follows: If Ai occurs, then Y takes

the value E(X|Ai).It is easy to see that the variable Y is the unique variable satisfying the following prop-

erties:

1. Y is measurable w.r.t. the σ-algebra G := σ(A1, . . . , Ak).

2. For every (bounded) random variable Z which is measurable w.r.t. G, we have

E(ZY ) = E(ZX).

This formulation allows us to generalize the definition of conditional expectation to more

general σ-algebras.

Definition 5.1 Let (Ω,F , P ) be a probability space, let X be a random variable which

has an expectation, and let G ⊆ F be a σ-algebra on Ω. We say that a random variable

Y is the conditional expectation of X w.r.t. G, and denote Y = E(X|G) if:

1. Y is measurable w.r.t. G.

2. For every bounded random variable Z which is measurable w.r.t. G, we have

E(ZY ) = E(ZX).

It is not immediately obvious that the conditional expectation exists, or that it is unique.

We will give an example where this is a priori not obvious, and then prove both existence

and uniqueness of the conditional expectation.

Example 5.2 Let Ω = [0, 1), F = B, P = λ. We take X(ω) = sinω. We then let Yk be

the k − th digit in the decimal expansion of ω, and take G = σ(Y2, Y4, Y6, . . .).

Lemma 5.3 (Existence)

The conditional expectation always exists.

Proof: Let Q be the following measure defined on (Ω,F):

Q(A) =

∫A

XdP.

28

Note that P and Q are also defined on (Ω,G) because G ⊆ F . Now note that Q P .

Indeed, if P (A) = 0 then Q(A) = 0. Therefore there exists a Radon-Nykodim derivative

of Q w.r.t. P on the measurable space (Ω,G). We call this derivative Y , and claim that

Y is the conditional expectation. We verify the conditions in the definition:

1. This is obvious as Y was defined on the space (Ω,G).

2. Z is measurable in the space (Ω,G). Therefore,∫ZdQ =

∫ZY dP = E(ZY ).

Z is also measurable in the space (Ω,F), and thus∫ZdQ =

∫ZXdP = E(ZX).

and we get E(ZY ) = E(ZX).

Lemma 5.4 (Uniqueness)

Assume that Y1 is the conditional expectation of X w.r.t. G, and that at the same time

Y2 is the conditional expectation of X w.r.t. G. Then P (Y1 = Y2) = 1.

Proof: Assume P (Y1 6= Y2) > 0. Without loss of generality, there exists ε > 0 s.t.

P (Y1 − Y2 > ε) > ε. Let Z = 1Y1−Y2>ε. Then Z is bounded and is measurable w.r.t. G.

Now, by the choice of Z, we have E(ZY1) − E(ZY2) ≤ ε2 > 0, in contradiction to the

assumption that E(ZY1) = E(ZX) = E(ZY2).

We now collect some useful facts regarding the conditional expectation.

Lemma 5.5 1. Let G = Ω, ∅ be the trivial σ-algebra. Then E(X|G) = E(X) a.s.

2. Let G2 ⊆ G1 ⊆ F . Then E(E(X|G1)|G2

)= E(X|G2).

3. Jensen’s ineguality holds for conditional expectations, namely if f is convex then

E(f(X)|G) ≥ f(E(X|G)) a.s. when everything is well-defined.

4. Define

var(X|G) := E[(X − E(X|G)

)2∣∣∣G] = E(X2|G)− [E(X|G)]2 .

If var(X) <∞, then var(X) = E(var(X|G)) + var(E(X|G)).

Exercise Find a variable X and σ-algebras G1 and G2 such that E(E(X|G1)|G2

)6=

E(E(X|G2)|G1

). Exercise Prove that E(E(X|G)) = E(X).

Proof: 1. Let Y = E(X|G). Y is a constant, and it has to satisfy Y = E(Y · 1) =

E(X · 1) = E(X).

2. Let Y1 = E(X|G1) and Y2 = E(X|G2). Let Z be measurable w.r.t. G2. Note that Z is

also measurable w.r.t. G1 because G2 ⊆ G1. Therefore,

E(ZY2) = E(ZX) = E(ZY1),

29

so Y2 = E(Y1|G2).

3. Let g : R2 → R be a Borel measurable function s.t. (a) for every x, the fiber g(x, ·) is

a linear function, (b) g(x, y) ≤ f(y) for every x and y, and (c) g(x, x) = f(x) for every x.

Exercise: Prove the existence of such g.

Now, we may take Y = g(E(X|G), X). Then Y ≤ f(X). Now, a.s,

f(E(X|G)) = E(Y |G) ≤ E(f(X)|G).

4. Assume w.l.o.g. that E(X) = 0. Let Y = E(X|G) and W = X − Y . Then E(XY ) =

E(Y 2) and therefore E(WY ) = 0. Therefore,

var(x) = E(X2) = E(Y 2) + E(W 2) = var(E(X|G)) + E[(X − E(X|G)

)2∣∣∣]

= E(var(X|G)) + var(E(X|G)).

18.6

5.2 Filtrations

Definition 5.6 Let (Ω,F , P ) be a probability space. A filtration is a sequence (Gn)∞n=1

of σ-algebras on Ω such that

1. For every n, Gn+1 is a refinement of Gn, i.e. Gn ⊆ Gn+1.

2. For every n, Gn ⊆ F .

We may also speak about finite filtrations, namely finite sequences of σ-algebras.

Examples:

1. Let Ω = [0, 1]ℵ0 with product σ-algebra and measure. We take

Gn = λ× λ× · · · × λ× ∅, [0, 1] × · · · .

2. Let (Xn) be a sequence of random variables, then take Gn = σ(X1, . . . , Xn).

We can also define the notion of an inverse filtration which we will use later.

Definition 5.7 Let (Ω,F , P ) be a probability space. A inverse filtration is a sequence

(Gn)∞n=1 of σ-algebras on Ω such that

1. For every n, Gn is a refinement of Gn+1, i.e. Gn+1 ⊆ Gn.

2. For every n, Gn ⊆ F .

Example: Let (Xn) be a sequence of random variables, then take Gn = σ(Xn, Xn+1, . . .).

30

5.3 Martingales

Definition 5.8 Let (Xn) be a sequence of random variables, and let (Gn) be a filtration.

We say that (Xn) is a martingale with respect to (Gn) if

1. E(|Xn|) <∞ for every n.

2. Xn is measurable with respect to Gn for every n.

3. Xn = E(Xn+1|Gn) for every n.

Definition 5.9 A sequence (Xn) of random variables is called a martingale if there exists

a filtration (Gn) such that (Xn) is a martingale w.r.t. (Gn).

Examlpes

1. Simple random walk.

2. conditional expectations of same variable w.r.t. a filtration.

Lemma 5.10 1. A sequence (Xn) of random variables is a martingale if and only if

it is a martingale with respect to the natural filtration Kn = σ(X1, . . . , Xn).

2. Let (Xn) be a martingale w.r.t. a filtration (Gn). For every n,m, E(Xn) = E(Xm).

Also, if m > n then E(Xn) = E(Xm|Gn).

3. If (Xn) and (Yn) are both martingales w.r.t. (Gn), then (Xn+Yn) is also a martingale

w.r.t. (Gn).

Exercise:

1. Find two martingales whose sum is not a martingale.

2. (*) Find an example of a martingale (Xn) w.r.t. a filtration (Gn) such that there is

no variable X s.t. Xn = E(X|Gn) for all n.

Proof of Lemma 5.10: 1. Let (Xn) be a martingale w.r.t. (Gn). Clearly, E(|Xn|) <∞ for

every n, and it is also obvious that Xn is measurable with respect to Kn. We need to show

that Xn = E(Xn+1|Gn). Note that Kn ⊆ Gn, and therefore for every Z which is measurable

w.r.t. Kn, it is also measurable w.r.t. Gn. Thus for all such Z, E(ZXn+1) = E(ZXn) and

Xn = E(Xn+1|Kn).

21.6

2. This follows from Lemma 5.5.

3. This follows from the linearity of the conditional expectation.

Definition 5.11 Let (Xn) be a sequence of random variables, and let (Gn) be a filtration.

We say that (Xn) is a sub-martingale (resp super-martingale) with respect to (Gn) if

1. E(|Xn|) <∞ for every n.

31

2. Xn is measurable with respect to Gn for every n.

3. Xn ≤ E(Xn+1|Gn) (resp. Xn ≥ E(Xn+1|Gn)) for every n.

Examples

1. Random walk with a drift.

2. Let (Xn) be a martingale, and let f be convex. Then (f(Xn)) is a sub-martingale.

3. Let (Xn) be a martingale, and let f be concave. Then (f(Xn)) is a super-martingale.

4. Let (Xn) be a martingale. (|Xn|), (X+n ) and (X−n ) are sub-martingales.

5.4 Stopping times and the Optional stopping theorem

Definition 5.12 Let (Gn)∞n=1 be a filtration. A positive integer variable T ≤ ∞ is called

a stopping time if for every n, the event T ≤ n belongs to Gn.

Examples:

1. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = infn :

Xn = 9. Then T is a stopping time.

2. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = supn <100 : Xn = 0. Then T is not a stopping time.

3. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = infn <

100 : Xn = maxXk : k = 1, . . . , 99

. Then T is not a stopping time.

4. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = infn ≥

100 : Xn = maxXk : k = 1, . . . , 99

. Then T is a stopping time.

Exercise: Prove the assertions above.

We now discuss stopped martingales. We start with a useful lemma.

25.6

Lemma 5.13 Let (Gn) be a filtration, let (Xn) be a martingale w.r.t. (Gn), and let T be

a stopping time w.r.t. (Gn). Let

Yn = Xn∧T =

Xn n ≤ T

XT n ≥ T.

The Yn is a martingale.

Exercise Find an example of a martingale (Xn) and a random time T such that Xn∧T is

not a martingale.

32

Proof of Lemma 5.13:

E(Yn+1|Gn) = 1T≤nE(Yn+1|Gn) + 1T>nE(Yn+1|Gn)

On the event T ≤ n, we have Yn+1 = XT = Yn, and thus

1T≤nE(Yn+1|Gn) = 1T≤nYn.

On the event T > n, we have Yn = Xn and Yn+1 = Xn+1, and thus

1T>nE(Yn+1|Gn) = 1T>nYn.

The lemma follows.

From here we get an important theorem known as the optional sampling theorem.

Theorem 5.14 (Optional Sampling Theorem)

Let (Gn) be a filtration, let (Xn) be a martingale w.r.t. (Gn), and let T be a stopping time

w.r.t. (Gn). Then E(XT ) = E(X1) if at least one of the following conditions hold:

1. T is bounded (i.e. there exists M s.t. P (T < M) = 1).

2. (Xn) is bounded (i.e. ∃MP (∀n|Xn| < M) = 1) and T is finite.

3. (Xn+1 −Xn) is bounded and E(T ) <∞.

Proof: 1. Since XT = YM and X1 = Y1, and (Yn) is a martingale, we get E(XT ) =

E(YM) = E(Y1) = E(X1).

2. Let N be a large number, and let S = min(T,N). Then S is a bounded stopping

time, so E(XS) = E(X1). Now, |XS − XT | < 2M , and P (XS 6= XT ) = P (T > N).

Therefore∣∣E(XT ) − E(X1)

∣∣ < 2MP (T > N) and as limN→∞ P (T > N) = 0 we get

E(XT ) = E(X1).

3. Again, let S = min(T,N). Then E(XS) = E(X1), and

E(|XT −XS|) ≤ E

(T−1∑k=N

|Xk+1 −Xk|

)≤M

∞∑k=N

P (T > k),

and as the last sum converges to 0 as N →∞ we get that E(XT ) = E(X1).

We now see several applications of the Optional Sampling Theorem. Let (Zn)∞n=1 be i.i.d.

with distribution P (Z1 = 1) = P (Z1 = −1) = 1/2, and let

Xn =n∑k=1

Zk.

The sequence (Xn) is called Simple Random Walk (or SRW).

Example 5.15 Fix some a, b positive and integer. Let T = min(n : Xn ∈ −a, b). Then

P (XT = a) =b

a+ b.

33

Example 5.16 Fix some a positive and integer. Let T = min(n : Xn ∈ −a, a). Then

E(T ) = a2.

Example 5.17 Let T = min(n : Xn = 1). Then E(T ) =∞.

The same holds for sub and super martingales. We state the theorem for sub martingales,

and the proof is left as an exercise.

Theorem 5.18 Let (Gn) be a filtration, let (Xn) be a sub-martingale w.r.t. (Gn), and let

T be a stopping time w.r.t. (Gn). Then E(XT ) ≥ E(X1) if at least one of the following

conditions hold:

1. T is bounded (i.e. there exists M s.t. P (T < M) = 1).

2. (Xn) is bounded (i.e. ∃MP (∀n|Xn| < M) = 1) and T is finite.

3. (Xn+1 −Xn) is bounded and E(T ) <∞.

Exercise: Prove Theorem 5.18

5.5 Convergence Theorems

We start with a simple theorem, which nevertheless contains the main idea of the general

convergence theorem, and will slowly work our way to more and more general theorems.

Theorem 5.19 Let (Xn) be a positive martingale. Then limn→∞Xn exists a.s.

Theorem 5.19 is a special case of the following theorem:

Theorem 5.20 Let (Xn) be a positive sub-martingale, and assume

supE(Xn) : n = 1, 2, . . . <∞.

Then limn→∞Xn exists a.s.

Proof: Let

S = supE(Xn) : n = 1, 2, . . ..

We first show that P (supnXn <∞) = 1. To this end we will show that

limk→∞

P (supnXn > k) = 0.

Fix k, and let T = infn : Xn > k ≤ ∞. Let j > i be natural numbers. Then

E(Xj|T = i) > k because (Xn) is a sub-martingale. The events(T = i

)i=1,...

are

disjoint, and therefore E(Xj|T < j) > k. Using Markov’s inequality,

P (T < j) <E(Xj)

k<S

k.

34

Since this holds for every t we get P (T <∞) < Sk, and a result P (supXn <∞) = 1.

We now proceed to prove the convergence. Let a < b be rational. We define a sequence

of stopping times Tk and Sk as follows:

T1 = infn : Xn < a ≤ ∞;

Sk = infn > Tk : Xn > b ≤ ∞;

Tk+1 = infn > Sk : Xn < a ≤ ∞.

We let Ca,b = maxk : Sk <∞, and call Ca,b the number of up-crossing from a to b.

We define a new sub-martingale (Yn) as follows: Y1 = X1. For every n, if there exists k s.t.

Tk ≤ n < Sk, then Yn+1 = Yn+Xn+1−Xn and else Yn+1 = Yn. We make a few observations.

The first is that E(Yn) ≤ E(Xn) for every n. Indeed, let An be the event that there exists k

s.t. Tk ≤ n < Sk. Then An ∈ Gn, and thus E(1An(Xn+1−Xn)|Gn) = 1AnE(Xn+1−Xn|Gn)

and thus E(Yn+1 − Yn) ≤ E(Xn+1 −Xn) and by induction E(Yn) ≤ E(Xn).

The second observation, which is left as an exercise, is that

limn→∞

E(Yn) ≥ (b− a)E(Ca,b).

Therefore, a.s, for every rational a and b the number of up-crossing is finite, and therefore

(Xn) converges almost surely.

We can now state and prove the main theorem in this section.

Theorem 5.21 (The Martingale Convergence Theorem)

Let (Xn)∞n=1 and assume that supnE(|Xn|) < ∞. Then the sequence (Xn) converges

almost surely.

Proof: By Theorem 5.20, both (X+n ) and (X−n ) converge, and therefore (Xn) converges.

The same proof idea is useful in proving the inverse martingale convergence theorem.

Theorem 5.22 (Inverse Martingale Convergence Theorem)

Let (Xn)∞n=1 be an inverse martingale. Then (Xn) converges almost surely.

5.6 Uniform integrability and convergence in L1

Definition 5.23 Let (Xn)∞n=1 be a sequence of random variables. We say that the se-

quence (Xn) is uniformly integrable if for every ε > 0 there exists M such that E(|Xn| ·1|Xn|>M) < ε for every n.

Exercises

1. If supE(X2n) <∞ then (Xn) is uniformly integrable.

2. ”Double or nothing”: Let (Yn) be i.i.d. variables with P (Y1 = 0) = P (Y1 = 2) = 0.5.

Let X1 = 1, and Xn+1 = XnYn for all n. Then (Xn) is not uniformly integrable.

35

2.7

Proof of the inverse martingale convergence theorem: get bound on down-crossings. We

also need to show that the sequence does not go to infinity. To this end we note that due

to up-crossing bound, it needs to really go to infinity. But this can not happen due to L1

bound.

5.7 Doob’s maximal inequality and the reflection principle

Let (Xn) be a martingale. Assume w.l.o.g. that E(X1) = 0. We know that by Chebisheff,

P (|Xn| > γ) <var(Xn)

γ2.

The following result shows that we have similar control over max(X1, X2, . . . , Xn).

Theorem 5.24 (Doob’s Maximal Inequality)

Let (Xn) be a martingale, and assume E(X1) = 0. Then, for every γ > 0,

P (max(X1, . . . , X2) > γ) <var(Xn)

γ2.

Remark: we then get a factor of 2 for the absolute value.

Proof: Let Tγ = mink : Xk > γ. The event Ak = T = k are independent.

E(X2n|Ak) ≥ γ2 and thus E(X2

n|T ≤ n) ≥ γ2 and from here we get what we need.

Exercise: Write an Lp inequality.

An even stronger statement can be stated for sums of independent variables. The next

result is about sums of gaussians.

Theorem 5.25 (Reflection Principle)

Let (Xn) be i.i.d. N (0, 1), and let Sn =∑n

k=1Xk. Then, for all γ > 0 and n,

P [max(S1, . . . , Sn) > γ] = 2P [Xn > γ].

Proof: reflection.

5.8 Azuma’s inequality

.

P (Xn −X0 > t) ≤ exp

(−t2

2∑N

k=1 c2k

).

Some applications.

For the proof: If E(X) = 0 and |X| < c, then define Y by Y = c w.p. X+c2c

and Y = −cwith the complement probability. Then by Jensen

E(eX) ≤ E(eY ) =∞∑k=0

E(Y k)

k!=∞∑k=0

E(c2k)

(2k)!≤ 1 +

∞∑k=1

c2k

2k!≤ ec

2/2.

36

Documents

Probability Theory - TUM€¦ · 1 Measure spaces 16.4 In the rst few lectures we give, without proofs, background from Measure Theory which we will need in the Probability course,