Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Probability Theory
Instructor: Noam Berger
Lecture at TUM in Summer Semester 2013
July 1, 2013
Produced by me
Contents
1 Measure spaces 3
2 Basic inequalities and types of convergence 13
3 The law of large numbers 19
4 Characteristic functions and the central limit theorem 22
5 Conditional expectation and martingales 28
2
1 Measure spaces
16.4
In the first few lectures we give, without proofs, background from Measure Theory which
we will need in the Probability course, and set some of the basic definitions of Probability
Theory.
Let Ω be a non-empty set.
Definition 1.1 A ⊆ P(Ω), i.e. a collection A of subsets of Ω, is a σ-algebra on Ω, if
(i) A 6= ∅
(ii) A ∈ A ⇒ Ac ∈ A
(iii) An ∈ A, ∀n ∈ N⇒⋃n∈N
An ∈ A
Remark 1.2 Note that If An ∈ A, ∀n ∈ N, then also⋂n∈N
An ∈ A.
Definition 1.3 If A is a σ-algebra on Ω, (Ω,A) is a measurable space and each A ∈ A is
measurable.
Definition 1.4 Let U ⊆ 2Ω. The σ-algebra generated by U is defined to be
σ(U) :=⋂U ⊆ A
A σ-algebra
A
Definition 1.5 Let (Ω, T ) be a topological space. Borel’s σ-algebra w.r.t. the space
(Ω, T ) is the σ-algebra σ(T ). It is usually denoted by B.
We will mostly be interested in Borel’s σ-algebra in the case of the Euclidean spaces Rd.
Definition 1.6 Let (Ω1,F1) and (Ω2,F2) be mesaurable spaces. The product space is
defined as follows: The space is Ω = Ω1 × Ω2, and the σ-algebra is F = σ(A× B : A ∈F1, B ∈ F2).
Remark 1.7 Often notation is abused and the product σ-algebra is denoted F1 × F2.
Note that this is not the cartesian product of F1 and F2.
Definition 1.6 can be easily extended to any finite number of spaces. Note that we seem
to have two natural ways of defining a σ-algebra on the space R2 - using Borel directly,
or multiplying the one-dimensional Borel σ-algebra with itself. However, both yield the
same σ-algebra.
We now define the product σ-algebra of infinitely many spaces. Let (Ωk,Fk)∞k=1 be a
collection of measurable spaces.
Definition 1.8 A cylinder is a set A ⊆∏∞
k=1 Ωk s.t. A =∏∞
k=1Ak such that
3
1. Ak ∈ Fk for every k, and
2. There exists k0 s.t. Ak = Ωk for all k > k0.
Now, the product σ-algebra is defined to be the σ-algebra generated by the set of cylinders.
Remark 1.9 There is a slightly simpler way of defining the same object.
Definition 1.10 Let Ω,F be a measurable space. A measure on Ω,F is a function
µ : F → R s.t:
1. µ is non-negative.
2. µ(∅) = 0.
3. if (Ak)∞k=1 are pairwise disjoint, then
µ(∪∞k=1Ak) =∞∑k=1
µ(Ak).
A measure is called σ-finite if Ω is the union of countably many sets of finite measure. A
measure P is called a probability measure if P (Ω) = 1.
Example 1.11 Let Ω be finite, F = 2Ω, and P (A) = |A||Ω| . Then P is a probability
measure.
Theorem 1.12 There exist a unique measure λ on (R,B) such that λ([a, b]) = b− a for
every b > a. λ is called Lebesgue’s measure.
Theorem 1.13 Let (Ω1,F1, µ1) and (Ω2,F2, µ2) σ-finite measure spaces. Then there
exists a unique measure µ on (Ω1×Ω2,F1×F2) s.t. µ(A1×A2) = µ1(A1)µ2(A2) for every
A1 ∈ F1 and A2 ∈ F2.
Theorem 1.14 Let (Ωk,Fk, µk)∞k=1 probability spaces. Then there exists a unique measure
µ on∏∞
k=1 Ωk,∏∞
k=1Fk such that µ(A) =∏∞
k=1 µk(Ak) for every cylinder A =∏∞
k=1 Ak.
Definition 1.15 Let (Ω,F) be a measurable space and let (X,T ) be a topological space.
f : Ω→ X is called a measurable function if f−1(A) ∈ F for all a ∈ T .
Let (Ω,F , P ) be a probability space. An event is a measurable set. A random variable is
a measurable function from Ω to R. The σ-algebra generated by a random variable X is
σ(X) := σ(X−1(A) : A ∈ T).19.4
1.1 integration
Let (Ω,F , µ)be a σ-finite measure space. A measurable function f : Ω→ R is called simple
if it is non-negative and there is a partition A1, A2, . . . , An of Ω s.t. Ak is measurable for
4
all k, and f is constant on each Ak. We define the integral of f to be∫Ω
fdµ :=n∑k=1
µ(Ak) · f |Ak .
Exercise Show that this is well defined.
For a nonnegative measurable function f we define∫Ω
fdµ = sup
∫Ω
gdµ : g ≤ f ; g simple
.
Define
f+(x) =
f(x) if f(x) ≥ 0
0 if f(x) ≤ 0; f−(x) =
−f(x) if f(x) ≤ 0
0 if f(x) ≥ 0.
Exercise Show that if f is measurable, then so are f+ and f−.
Definition 1.16 A measurable function f is said to be inrtegrable if∫
Ωf+dµ < ∞ and∫
Ωf−dµ <∞. In this case we define∫
Ω
fdµ :=
∫Ω
f+dµ−∫
Ω
f−dµ. (1.1)
We also define (infinite) integrals of non-integrable functions if the difference in (1.1)
makes sense.
We say that a random variable X on a probability space (Ω,F , P ) has an expectation if
it is integrable. In this case we write E(X) :=∫
ΩXdP . We say that X has a variance if
X2 has an expectation, and write
var(X) = E[(X − E(X))2] = E(X2)− [E(X)]2.
We say that X has a k-th moment if E(|X|k) <∞, and in this case the k-th moment of
X is E(Xk). For two variable X and Y , if they both have expectations and the variable
XY has expectation too, then we say that they have a covariance, and define
cov(X, Y ) := E(XY )− E(X)E(Y ).
Exercise (1) show that if X and Y have variances, then they have a covariance. (2) show
that this is not an ”if and only if”.
We now state without proof some theorems about convergence of integrals, which we will
use in the course. For all of these theorems, (Ω,F , µ) is a σ-finite measure space, typically
a probability space. (fk : Ω→ R)k = 1∞ are measurable.
Theorem 1.17 (Fatou’s lemma)
If the functions f are non-negative, then∫Ω
lim inf fkdµ ≤ lim inf
∫Ω
fkdµ
5
We now assume that there exists a function f : Ω→ R such that
µ(x ∈ Ω : lim
k→∞fk(x) 6= f(x)
)= 0.
Theorem 1.18 (Monotone convergence theorem)
If, in addition, fk is a pointwise increasing sequence, and fk are non-negative, then∫Ω
fdµ = lim
∫Ω
fkdµ
Theorem 1.19 (Dominated convergence theorem)
If there exists a non-negative integrable function g s.t. |fk(x)| ≤ g(x) for every k and
(almost) all x, then ∫Ω
fdµ = lim
∫Ω
fkdµ
1.2 Calculus of probabilities
Let (Ak)∞k=1 be a sequence of events. We define:
lim infk→∞
Ak =∞⋃k=1
∞⋂n=k
An ; lim supk→∞
Ak =∞⋂k=1
∞⋃n=k
An.
If lim infk→∞Ak = lim supk→∞Ak we say that the sequence converges. We also say that
the sequence converges if the equality is up to measure zero.
Example 1.20
1. If the sequence (Ak)∞k=1 is increasing, then it converges to ∩k=1∞Ak.
2. If the sequence (Ak)∞k=1 is decreasing, then it converges to ∪k=1∞Ak.
Theorem 1.21 (Continuity of Probability)
Let (Ω,F , P ) be a probability space, and let (Ak)∞k=1 be a sequence of events. Then
P[lim infk→∞
Ak
]≤ lim inf
k→∞P [Ak], and (1.2)
P
[lim supk→∞
Ak
]≥ lim sup
k→∞P [Ak]. (1.3)
As an immediate corollary we get the following:
Corollary 1.22 If (Ak)∞k=1 is a converging sequence of events, then
P[
limk→∞
Ak
]= lim
k→∞P [Ak].
23.4
6
Proof of Theorem 1.21: We start by showing this for increasing sequences. Let (Ak)∞k=1
be increasing, and let B1 := A1 and Bk+1 := Ak+1 − Ak. Then the events Bk, k =
1, . . . are pairwise disjoint, and we have that An = ∪nk=1Bk for every n, and in the limit
limk→∞Ak = ∪∞k=1Bk. Therefore we get
P
[∞⋃k=1
Ak
]= P
[limk→∞
Ak
]= P
[∞⋃k=1
Bk
]=∞∑k=1
P (Bk) = limn→∞
n∑k=1
P (Bk) = limn→∞
P (An).
Next we note that, by taking complements, the same holds for decreasing sequences. We
now prove (1.2): For every k, Let
Bk =∞⋂n=k
An.
Then Bk ⊆ Ak and therefore P [Bk] ≤ P [Ak]. Now note that (Bk) is an increasing sequence
and that
lim infk→∞
Ak = limk→∞
Bk.
Therefore,
P[lim infk→∞
Ak
]= P
[limk→∞
Bk
]= lim
k→∞P [Bk] ≤ lim inf
k→∞P [Ak].
(1.3) is proven analogously.
1.3 Distributions and independence
1.3.1 Independence
We say that two events A1 and A2 are independent if P (A1 ∪A2) = P (A1)P (A2). We say
that k events A1, . . . , Ak are independent if for every nonempty subset L ⊆ 1, . . . , k,
P
[⋂j∈L
Aj
]=∏j∈L
P [Aj].
Exercise Find three events A1, A2, A3 s.t. every two are independent, but the three are
not independent.
We say that a sequence of events (Ak)∞k=1 is independent if every finite sub-collection is
independent.
We can also define the very useful notion of independence of σ-algebras: Let (Ω,F , P ) be
a probability space, and let F1 ⊆ F and F2 ⊆ F be σ-algebras. We say that F1 and F2 are
independent if A1 and A2 are independent for every A1 ∈ F1 and A2 ∈ F2. equivalently, we
can define independence of larger collections of σ-algebras - F1, . . . ,Fk are independent if
A1, . . . , Ak are independent for all choices of Aj ∈ Fj, j = 1, . . . , k, and infinite collections
of σ-algebras are independent if every finite sub-collection is independent.
Example 1.23 Let Ω = [0, 1]2, let B(2) be the Borel σ-algebra for Ω, and let P be the
two dimensional Lebesgue measure. Then (Ω,B(2), P ) is a probability space. Let B(1) be
the Borel σ-algebra of [0, 1]. Let F1 = A× [0, 1] : A ∈ B(1), and let F2 = [0, 1]×A :
A ∈ B(1). Then F1 and F2 are independent σ-algebras.
7
We can now define independence of random variable: A collection of random variables is
independent if the collection of induced σ-algebras is independent.
1.3.2 Distributions
Let (Ω,F , P ) be a Probability space, and let X : Ω → R be a random variable. The
distribution of X is the Borel probability measure DX on R induced by X. More precisely,
for every Borel set A ⊆ R, we take DX(A) = P(X−1(A)
).
Definition 1.24 The distribution function FX : R→ R of X is defined to be the function
FX(A) = DX
((−∞, a]
)= P (X ≤ a).
Definition 1.25 Let (Ω,F , P ) be a probability space.
1. Let X1, . . . , Xk be random variables. The joint distribution of X1, . . . , Xk is the
Borel probability measure DX1,...,Xk on Rk defined by
DX1,...,Xk(A) = P(ω ∈ Ω : (X1(ω), . . . , Xk(ω)) ∈ A
).
2. Let (Xk)∞k=1 be random variables. The joint distribution of (Xk)
∞k=1 is the Borel
probability measure DX1,..., on RN defined by
DX1,...(A) = P(ω ∈ Ω : (X1(ω), X2(ω), . . .) ∈ A
).
26.4
1.4 Absolute continuity and the Radon-Nikodym’s theorem
Today we will prove a measure theoretic theorem which is very useful in Probability
Theory. Let (Ω,F) be a measurable space. Let µ and ν be measures on (Ω,F).
Definition 1.26 1. We say that ν is absolutely continuous with respect to µ if for
every set A ∈ F such that µ(A) = 0 we also have ν(A) = 0. We denote this by
ν µ.
2. We say the ν and µ are equivalent if µ ν and ν µ. We denote this by µ ∼ ν.
3. We say that ν and µ are singular if there exists A ∈ F such that µ(A) = 0 and
ν(Ac) = 0. We denote this by µ ⊥ ν.
Example 1.27 Let (Ω,F , µ) be a σ-finite measure space, and let f : Ω→ R be measur-
able and non-negative. Define ν : F → R by
ν(A) =
∫A
fdµ :=
∫Ω
f1Adµ.
Then ν µ.
8
Exercise Show that ν is a measure, and that it is absolutely continuous with respect to
µ.
The next theorem will show that Exampel 1.27 is, in fact, the general case of absolute
continuity.
Theorem 1.28 (Radon-Nikodym)
Let (Ω,F) be a measurable space, and let µ and ν be σ-finite measures on (Ω,F). Assume,
in addition, that ν µ. Then there exists a measurable and non-negative f : Ω → Rsuch that for every A ∈ F ,
ν(A) =
∫A
fdµ. (1.4)
Furthermore, f is unique up to measure zero. f is called the Radon-Nikodym derivative
of ν with respect to µ, and is denoted dνdµ
.
Exercise Find a counter example when the assumption of σ-finiteness is removed (it is
sufficient when µ is not σ-finite).
The next definition and theorem (which we will not prove) diverge from our material.
They are, however, important for one of the homework problems.
Definition 1.29 Let (Ω,F , µ) be a measure space. We say that (Ω,F , µ) is non-atomic
if for every A ∈ F with µ(A) > 0 there exists B ⊆ A in F such that 0 < µ(B) < µ(A).
Example 1.30 1. (R,B, λ) is non-atomic.
2. Any finite probability space is atomic.
Theorem 1.31 Let (Ω,F , µ) be non-atomic, and let A ∈ F . Then for every 0 ≤ γ ≤µ(A) there exists B ⊆ A in F such that µ(B) = γ.
1.4.1 Towards proving Theorem 1.28
We start with the following definition.
Definition 1.32 Let (Ω,F) be a measurable space. A signed measure Φ on (Ω,F) is a
function Φ : F → R ∩ −∞,+∞, such that:
1. Φ(∅) = 0.
2. If (Ak)∞k=1 are disjoint measurable sets, then
Φ
(∞⋃k=1
Ak
)=∞∑k=1
Φ(Ak),
and the sum always converges (possibly to an infinite value).
Example 1.33 The difference of two measures, at least one of which finite, is a signed
measure.
9
Definition 1.34 Let (Ω,F) be a measurable space, and let Φ be a signed measure on
(Ω,F). Then we define
|Φ|(A) := sup|Φ(B)|+ |Φ(A \B) : B ⊆ A.
Exercise Prove that |Φ| is a measure on (Ω,F), and that |Φ|(A) ≥ |Φ(A)| for every A.
Theorem 1.35 (Hahn’s decomposition’s theorem)
Let (Ω,F) be a measurable space, and let Φ be a signed measure on (Ω,F). Then there
exist sets A+ ∈ F and A− ∈ F such that
1. A+ ∪ A− = Ω and A+ ∩ A− = ∅.
2. Φ(A) ≥ 0 for every measurable A ⊆ A+. A+ is called the positive set of Φ.
3. Φ(A) ≤ 0 for every measurable A ⊆ A−. A− is called the negative set of Φ.
A+ and A− are unique up to measure zero.
Proof: Let S = supΦ(A) : A ∈ F and I = infΦ(A) : A ∈ F.Exercise show that at most one of them is infinite.
Hint: show that otherwise the sum in definition 1.32 does not make sense.
Assume without loss of generality that S <∞. then one can find a sequence of measurable
sets (Ak)∞k=1 such that for every k,
S − 2−k ≤ Φ(Ak) ≤ S.
Now let
A+ := lim supk→∞
Ak,
and let A− := Ω \ A+.
Claim 1.36 Φ(A+) = S.
Now, if A ⊆ A+ has negative measure, then Φ(A+ \ A) > S, in contradiction to the
definition of S, and equivalently if A ⊆ A− has positive measure then Φ(A+ ∩ A) > S,
again in contradiction to the definition of S.
30.4
Proof of Claim 1.36: For every k and every A ⊆ Ak we have that Φ(A) ≥ −2−k, and for
every A ⊆ Ack we have that Φ(A) ≤ 2k. Therefore, for every k, j, we have
|Φ|(Ak \ Aj) ≤ 2k + 2j
and therefore
|Φ|(Ak 4 Aj) ≤ 2(2k + 2j).
10
Now, let Bk = ∪∞j=kAj. Then
|Φ|(Bk \ Ak) =∞∑
j=k+1
|Φ|(Aj \ Aj−1) ≤∞∑
j=k+1
[2−j + 21−j] ≤ 21−k.
In particular, Φ(Bk) ≥ S − 22−k.
Now, Bk is a decreasing sequence, and
|Φ|(Bk \Bk+1) ≤ 23−k.
Therefore,
|Φ|(Bk \ A) =∞∑j=k
|Φ|(Bj \Bj−1) ≤ 24−k.
Therefore, for every k we have Φ(A) ≥ Φ(Bk) − |Φ|(Bk \ A) ≥ S − 25−k, and therefore
Φ(A) ≥ S. From the definition of S, we get now Φ(A) = S.
Corollary 1.37 Let A+, A− and B+, B− be two Hahn decompositions of the space (Ω,F ,Φ).
Then |Φ|(A+4B+) = |Φ|(A−4B−) = 0.
Proof: We show that Φ(A+ \ B+) = 0. From symmetry this suffices. Let C ⊆ A+ \ B+.
Then Φ(C) ≥ 0 because C ⊆ A+, and Φ(C) ≤ 0 because C ⊆ B−, and thus Φ(C) = 0.
From here we get that |Φ|(A+ \B+) = 0, as desired.
We can now use the Hahn’s theorem to prove Radon-Nikodym’s theorem.
Proof of Theorem 1.28: We first assume that both µ and ν are finite. The extension to
the σ-finite case is left as an (easy) exercise. For every α ≥ 0 rational, we define the
signed measure Φα := α · µ − ν. Then, for every α ≥ 0 rational, we define Aα to be (a
choice of) the positive set of Φα.
Claim 1.38 Let α1 < α2. Then µ(Aα1 \ Aα2) = ν(Aα1 \ Aα2) = 0.
Proof of Claim 1.38: Let A = Aα1 \ Aα2 . Then Φα1(A) ≥ 0 and Φα2(A) ≥ 0. However,
by the definition of Φα, and since α1 < α2, we get Φα2(A) ≥ Φα1(A), and thus Φα1(A) =
Φα2(A) = 0. Solving a linear equation we get µ(A) = ν(A) = 0.
We now define the function f : Ω→ R by
f(ω) := inf(α : ω ∈ Aα) ≤ ∞.
Exercise show that f is measurable, and that due to absolute continuity, µ(ω : f(ω) =
∞) = 0.
We now need to show that (1.4) holds for every A ∈ F . We first show that
ν(A) ≤∫A
fdµ.
11
To this end, let ε > 0 be rational, and for every n = 0, 1, . . . let A(n) := ω ∈ A : nε ≤f(ω) < (n+ 1)ε. Define f(ω) := (n+ 1)ε on A(n). Then f < f + ε. Then, up to measure
zero, An ⊆ A(n+1)ε. Therefore, ν(An) ≤ (n+ 1)εµ(An), and we get that
ν(A) =∞∑k=0
ν(An) ≤∞∑k=0
(n+ 1)εµ(An) =
∫A
fdµ ≤∫A
fdµ+ εµ(A).
taking ε as small as we like proves the desired inequality. The opposite inequality follows
similarly, and (1.4) is proved.
Exercise Prove the uniqueness (up to µ-measure zero) of f .
3.3
Definition 1.39 Let X be a random variable, and let DX be its distribution. We say
that X has a density if DX λ. In this case, we say that the density of X is
fX :=dDx
dλ.
12
2 Basic inequalities and types of convergence
2.1 Inequalities
We begin with the most basic and most useful inequality in Probability Theory.
Theorem 2.1 (Cauchy-Schwarz inequality)
Let X, Y be variables with second moments. Then E(XY )2 ≤ E(X2)E(Y 2).
Proof: Assume without loss of generality that E(X2) = E(Y 2) = 1. Define Z = X −E(XY )Y . Then E(ZY ) = E(XY )− E(XY )E(Y 2) = 0. Therefore,
E(X2)E(Y 2) = 1 = E(X2) = E[(Z + E(XY )Y )2
]= E(Z2) + E(XY )2E(Y 2) = E(XY )2 + E(Z2) ≥ E(XY )2.
Theorem 2.2 (Markov’s inequality)
Let X be a non-negative random variable, and let a > 0. Then
P (X ≥ a) ≤ E(X)
a.
Proof: Let Y be a random variable, defined as follows:
Y =
0 if X < a
a if X ≥ a.
Then Y ≤ X, and therefore E(Y ) ≤ E(X). We get
E(X)
a≥ E(Y )
a=a · P (Y = a)
a= P (Y = a) = P (X ≥ a).
We now prove a general inequality which we will use later.
Theorem 2.3 (Jensen’s Inequality)
Let f : R → R be a convex function, and let X be a random variable s.t. X and f(X)
both have expectations. Then
E [f(X)] ≥ f (E[X]) .
Corollary 2.4 [E[X]]2 ≤ E [X2] whenever well defined.
Proof of Theorem 2.3: Since f is convex, for every x there exists an affine function gx
such that gx(x) = f(x) and gx(y) ≤ f(y) for every y. We take g = gE(X). Since g is affine,
we get g(E[X]
)= E
[g(X)
]. Therefore,
E [f(X)] ≥ E [g(X)] = g (E[X]) = f (E[X]) .
13
We now prove two inequalities that relate to the second moment.
Theorem 2.5 (Chebishef I)
Let X be a random variable with variance. Then for every δ > 0,
P[∣∣X − E[x]
∣∣ ≥ δ]≤ var[X]
δ2.
Proof: Let Y =[X − E(X)
]2. Then Y is non-negative, and E(Y ) = var(X). Then, by
Markov’s inequality,
P[∣∣X − E[x]
∣∣ ≥ δ]
= P (Y ≥ δ2) ≤ E(Y )
δ2=
var[X]
δ2.
Theorem 2.6 (Chebishef II)
Let X be a non-negative random variable with a second moment. Then
P [X > 0] ≥ [E[X]]2
E [X2].
Proof: Let Y = 1X>0. Then E[Y ] = E[Y 2]
= P [X > 0] and X = XY . Cauchy-Schwarz
inequality tells us
[E[X]]2 = [E[XY ]]2 ≤ E[X2]· E[Y 2]
= P [X > 0] · E[X2],
and we get the required inequality by dividing both sides by E[X2].
7.5
2.2 The lemma of Borel-Cantelli
Let (Ω,F , P ) be a probability space, and let (Ak)∞k=1 be a sequence of events. The event
A := lim supk→∞Ak is the event that infinitely many of the events occur.
Theorem 2.7 (Borel-Cantelli’s lemma) 1. If
∞∑k=1
P (Ak) <∞,
then P (A) = 0.
2. If the events are independent and
∞∑k=1
P (Ak) =∞, (2.1)
then P (A) = 1.
14
Exercise Find a sequence (Ak)∞k=1 of events such that
∞∑k=1
P (Ak) =∞,
but P (A) = 0.
Proof of Theorem 2.7: 1: For every k,
A ⊆∞⋂j=k
Aj,
and therefore
P (A) ≤∞∑j=k
P (Ak),
and as the RHS goes to zero as k →∞, we get that P (A) = 0.
2: Let Bk = ∩∞k=1Aj. Then, using the continuity of probability (Theorem 1.21) as well as
the continuity of the exponential function and the identity log x ≤ x− 1,
P [Bck] = P
[∞⋃j=k
Ack
]=∞∏j=k
P [Ack] = exp
(∞∑j=k
logP [Ack]
)
≤ exp
(∞∑j=k
(P [Ack]− 1)
)= exp
(−∞∑j=k
P [Ak]
)= 0,
where the last inequality follows from the divergence of the sum in (2.1). Therefore,
P[Bk
]= 1. Now, as A = ∪∞k=1Bk, we get
P [A] = limk→∞
P [Bk] = 1.
2.3 Types of convergence
In this section we discuss various types of convergence of random variables, and the
relations between those types of convergence.
2.3.1 Convergence in probability
Let (Xk)∞k=1 be random variables, and let X be a random variable, all defined on the same
probability space. We say that (Xk) converges to X in probability, and write
Xk−→prob
X,
if for every δ > 0,
limk→∞
P (|Xk −X| > δ) = 0. (2.2)
This is equivalent to
∀ε∃N∀n>NP (|Xn −X| > ε) < ε. (2.3)
Exercise Prove the equivalence of (2.2) and (2.3).
15
Example 2.8 (Weak version of the weak law of large numbers)
Let (Xk)∞k=1 be i.i.d. bounded variables, and let E = E(X1). Let Sn = 1
n
∑nk=1Xk. Then
Xk−→prob
E,
Proof: from Chebisheff.
2.3.2 Almost sure convergence
Let (Xn)∞n=0 be random variables, and let X be a random variable. Then we say that
(Xk) converges to X almost surely, and write
Xk−→a.s.
X,
if
P(ω : lim
n→∞Xn(ω) = X(ω)
)= 1. (2.4)
This is equivalent to
∀ε∃NP (∃n>N |Xn −X| > ε) < ε. (2.5)
Exercise Prove the equivalence of (2.4) and (2.5).
Solution First, assume that (Xn) converges to X. Then for every ε, with probability 1
there exists N such that |Xn −X| < ε for every n > N , i.e.
P
[∞⋂N=1
∃n>N |Xn −X| > ε
]= 0.
Continuity of probability now yields (2.5).
Now, assume that (2.5) holds. For every rational ε > 0, and every 0 < δ < ε, there exists
Nδ such that P (∃n>Nδ |Xn −X| > ε) < δ, and by continuity of probability,
P [∃N∀n>N |Xn −X| < ε] = 1.
Since there are only ℵ0 rationals, we get
P [∀0<ε∈Q∃N∀n>N |Xn −X| < ε] = 1.
a.s. convergence follows.
2.3.3 Convergence in Lp
Let 1 ≤ p <∞. Let (Xn)∞n=0 be random variables, and let X be a random variable. Then
we say that (Xk) converges to X in Lp, and write
Xk−→Lp
X,
If
limn→∞
E [|Xn −X|p] = 0.
Exercise Show that if p > q, then convergence in Lp yields convergence in Lq.
Remark 2.9 We will mostly be interested in convergence in L1 and in L2.
16
2.3.4 Convergence in distribution
et (Xn)∞n=0 be random variables, and let X be a random variable. Then we say that (Xk)
converges to X in distribution, and write
Xk−→dist
X,
If for every continuous, bounded f with compact support,
limn→∞
E[f(Xn)] = E[f(X)].
exercise Prove that this is equivalent to the following condition: For ever x such that
FX is continuous in x,
limn→∞
FXn(x) = FX(x).
2.4 Relations between various types of convergence
In this section we will draw a complete diagram of relations between types of convergence.
Theorem 2.10 Let (Xn)∞n=1 be a sequence of random variables, and let X be a random
variable.
1. If
Xk−→a.s.
X,
then
Xk−→prob
E.
2. If
Xn−→prob
E,
then there exists a strictly increasing sequence (nk)∞k=1 such that
Xnk −→a.s.
X.
Proof: 1. If a sequence (Xn) converges almost surely to a random variable X, then by
(2.5), for every ε there exist N such that w.p. 1−ε for every n > N we have |X−Xn| < ε.
In particular, P [|Xn −X| > ε] < ε for every n > N , so (2.3) is satisfied.
2. We assume that the sequence (Xn) converges in probability to a random variable X.
Then by (2.3) we can define a subsequence as follows: n1 is chosen so that P [|Xn1−X| >2−1] < 2−1. Then inductively nk is chosen to be a number larger than nk−1 such that
P [|Xnk − X| > 2−k] < 2−k. such choice is possible by (2.3). Now choose ε > 0. Then
there exists K such that ε > 2−K . Then
P [∃k>K |Xnk −X| > ε] ≤∞∑
k=K+1
P [|Xnk −X| > ε]
≤∞∑
k=K+1
P[|Xnk −X| > 2−k
]≤
∞∑k=K+1
2−k = 2−K < ε.
Therefore, by (2.5), the convergence is a.s.
17
14.5
Theorem 2.11 If (Xn) converges to X in Lp, then (Xn) converges to X in probability.
Proof: Exercise, use Markov’s inequality.
Exercise Find examples showing that there is no implication between a.s. convergence
and convergence in Lp.
Theorem 2.12 1. If (Xn) converges to X in probability, then (Xn) converges to X in
distribution.
2. If (Xn) converges to X in distribution and X is an almost sure constant, then (Xn)
converges to X in probability.
Proof: 1. Let f be a bounded, continuous function with bounded support. Then f is
uniformly continuous. Let M be a bound for |f |. Fix ε, and let δ < ε be such that if
|x− y| < δ then |f(x)− f(y)| < ε. Let N be such that for all n > N we have
P [|Xn −X| > δ] < δ.
Then for n > N , let A be the event A = |X −Xn| > δ
E(f(X))− E(f(Xn)) =(E[f(X)1A]− E[f(Xn)1A]
)+(E[f(X)1Ac ]− E[f(Xn)1Ac ]
).
Now, (E[f(X)1Ac ]− E[f(Xn)1Ac ]
)≤ 2MP (Ac) < 2Mε,
and (E[f(X)1A]− E[f(Xn)1A]
)≤ ε.
Therefore
|E(f(X))− E(f(Xn))| < (2M + 1)ε.
2. Exercise.
So we get the following diagram.
dist
prob
?
6
a.s.
-@
@@@@R@@
@@I
Lp
18
24.5
3 The law of large numbers
In this section we discuss the following theorem.
Theorem 3.1 Let (Xn)∞n=1 be i.i.d. random variables, and assume E(|X1|) < ∞. Let
E := E(X1). Let
Sn =n∑k=1
Xk.
ThenSnn−→a.s.
E.
Conversely, we have the following theorem.
Theorem 3.2 Let (Xn)∞n=1 be i.i.d. random variables, and let
Sn =n∑k=1
Xk.
Assume in addition that there exists a random variable X s.t.
Snn−→a.s.
X.
Then E(|X1|) <∞, and X E(X1) (i.e. P (X = E(X1)) = 1).
We will prove this Theorem 3.1 in full when we study martingales (this is in the future...),
but at the moment we prove an important special case.
Proof od Theorem 3.1 in the case var(X1) <∞: We assume that var(X1) < ∞. Write
σ2 = var(X1). First we prove thatSn2
n2−→a.s.
E.
Indeed,
var(Sn2) =n2∑k=1
var(Xk) = n2σ2,
and thus
var
[Sn2
n2
]=
1
n4var(Sn2) =
σ2
n2.
At the same time,
E
[Sn2
n2
]= E.
Fix ε > 0 rational. By Chebisheff’s inequality,
P
[∣∣∣∣Sn2
n2− E
∣∣∣∣ ≥ ε
]≤ var[Sn2/n2]
ε2=σ2
ε2· 1
n2,
19
and thus by Borel-Cantelli with probability 1 there exists Nε < ∞ such that for every
n > Nε, ∣∣∣∣Sn2
n2− E
∣∣∣∣ < ε (3.1)
Therefore w.p. 1 (3.1) holds for every rational ε > 0, and therefore
Sn2
n2−→a.s.
E.
We now need to show that the non-squares do not do too much damage. Again, fix ε > 0
(rational). For every n, let
Un = maxn2<k≤(n+1)2
∣∣∣∣Skk − Sn2
n2
∣∣∣∣ .We want to estimate P (Un ≥ ε).
We first estimate
P
[∣∣∣∣Skk − Sn2
n2
∣∣∣∣ > ε
]for a given n2 < k ≤ (n+ 1)2. First we write
Jk :=Skk− Sn2
n2=
1
k
k∑j=n2
Xj −(
1
n2− 1
k
) n2∑j=1
Xj
Thus
E(Jk) =
(k − n2
k− n2
[1
n2− 1
k
])· E = 0.
We also need to calculate the variance.
var(Jk) =k − n2
k2σ2 +
(k − n2
kn2
)2
· n2σ2 ≤(
3n
n4+
9n2
n6
)· σ2 ≤ 4σ2
n3
for all n large enough and n2 < k ≤ (n+ 1)2.
Thus, by Chebisheff,
P [|Jk| > ε] ≤ P [|Jk − E(Jk)| > ε/2] ≤ 4varJkε2
≤ C1
n3,
and by a union bound,
P (Un ≥ ε) ≤[(n+ 1)2 − n
]· C1
n3≤ C2
n2,
and by Borel-Cantelli w.p.1 Un < ε for all n large enough.
The theorem follows.
Proof of Theorem 3.2: It is sufficient to prove that E(|X1|) < ∞ due to Theorem 3.1.
For contradiction, assume E(|X1|) =∞, and we keep the convergence assumption. Then
∞∑k=1
P (|X1| ≥ k) = (−1) +∞∑k=0
P (|X1| ≥ k) ≥ (−1) + E(|X1|) =∞.
20
Therefore∞∑n=1
P (|Xn| ≥ n) =∞∑k=1
P (|X1| ≥ k) =∞,
and by Borel-Cantelli there a.s. exists a subsequence (Xnk)∞k=1 such that |Xnk | ≥ nk. For
such nk,
Snknk− Snk−1
nk − 1=Xnk
nk+
(nk − 1
nk− 1
)Snk−1
nk − 1=Xnk
nk− 1
nk
Snk−1
nk − 1.
Due to Cauchy’s criterion,
limk→∞
(Xnk
nk− 1
nk
Snk−1
nk − 1
)= 0. (3.2)
On the other hand, due to convergence,
limk→∞
1
nk
Snk−1
nk − 1= 0, (3.3)
and due to the choice of the sequence (nk),
lim infk→∞
∣∣∣∣Xnk
nk
∣∣∣∣ ≥ 1. (3.4)
(3.3) and (3.4) contradict (3.2).
Exercise: Is the following true: Let (Xn)∞n=1 be i.i.d. random variables, and let
Sn =n∑k=1
Xk.
Assume in addition that there exists a random variable X s.t.
Snn−→prob
X.
Then E(|X1|) <∞, and X E(X1) (i.e. P (X = E(X1)) = 1)?
21
28.5
4 Characteristic functions and the central limit theorem
4.1 Characteristic functions
Let X be a random variable, and let t ∈ R. Then the (complex) random variable eitX is
bounded by 1 and therefore has an expectation. Therefore, for every random variable X,
we can define its characteristic function φX : R→ C by φX(t) = E[eitX ].
We start by discussing the basic properties of the characteristic function.
Lemma 4.1 Let X be a random variable, and let φX : R → C be its characteristic
function.
1. φX is determined only by the distribution of X.
2. φX(0) = 1.
3. |φX(t)| ≤ 1 for every t ∈ R.
4. φX is continuous in R.
5. If X has an expectation, then φX is everywhere differentiable, and φ′X(0) = iE(X).
Further, |φ′X(t)| ≤ E(|X|) for every t.
6. If X has a variance, then φX is everywhere twice differentiable, and φ′′X(0) =
−E(X2). Further, |φ′′X(t)| ≤ E(X2) for every t.
7. If X and Y are independent, then φX+Y (t) = φX(t) · φY (t) for every t.
Exercise: Show that in parts 5 and 6, the function is, in fact, continuously (twice)
differentiable.
Proof of Lemma 4.1: 1. and 2. are obvious. 3. This is clear because eitX is bounded by
1.
4. Fix ε and let M be such that P [|X| ≥M ] < ε/2. Let t1, t1 be such that |t1− t2| < ε4M.
Then,
|φX(t1)− φX(t1)| ≤∣∣E [(eit1X − eit2X) · 1X<M]∣∣+
∣∣E [(eit1X − eit2X) · 1X≥M]∣∣≤ M |t1 − t2|+ 2P [|X| > M ] < ε.
5. For every t and every h > 0, ∣∣∣∣ei(t+h)X − eitX
h
∣∣∣∣ ≤ |X|.
22
As |X| is integrable, by the dominated convergence theorem,
E[XieitX ] = E
[limh→0
ei(t+h)X − eitX
h
]= lim
h→0E
[ei(t+h)X − eitX
h
]= lim
h→0
1
h
(E[ei(t+h)X
]− E
[eitX
])=dφX(t)
dt.
31.5
6. Same proof as that of 5.
7. From independence,
φX+Y (t) = E[eit(X+Y )] = E[eitXeitY ] = E[eitX ]E[eitY ] = φX(t)φY (t).
Example 4.2 Let X ∼ N (0, 1). We will calculate φX(t) for every t ∈ R. Let f be the
density function of X, i.e.
f(x) =e−
x2
2
√2π.
Then,
φX(t) = E[eitX ] =
∫ ∞−∞
eitxf(x)dx =1√2π
∫ ∞−∞
exp(itx− x2/2)dx
=1√2π
∫ ∞−∞
exp
(1
2
(t2 + 2itx− x2 − t2
))dx =
e−t2/2
√2π
∫ ∞−∞
exp
(−1
2
(x2 − 2itx− t2
))dx
=e−t
2/2
√2π
∫ ∞−∞
exp
((x− it)2
2
)dx = e−t
2/2,
where the last inequality (i.e. the calculation of the integral) follows from Cauchy’s
theorem.
Exercise: Calculate the characteristic functions of Bernoulli, exponential and binomial
distributions.
4.2 The central limit theorem
Let (Xn)∞n=1 be i.i.d. random variables, and assume E(X1) = µ, var(X1) = σ2 <∞. We
want to understand the behavior of the variable
Sn =n∑k=1
Xk
for n large. In particular, we want to establish some sort of convergence to a non trivial
variable. E(Sn) = nµ, so in order to get convergence we need to subtract nµ. Once done
this, the variance of Sn − nµ is nσ2, so we may want to divide by σ√n =√nσ2.
Our purpose now is to prove the following theorem.
23
Theorem 4.3 (Central limit theorem)
Let (Xn)∞n=1 be i.i.d. random variables, and assume E(X1) = µ, var(X1) = σ2 < ∞.
Then,Sn − nµσ√n−→distN (0, 1).
Before proving the theorem, we may note that w.l.o.g. we may assume µ = 0, σ2 = 1 and
then we get the more esthetic form
Sn√n−→distN (0, 1).
We start with a lemma that gives the intuitive explanation the CLT.
Lemma 4.4 Let (Xn)∞n=1 be i.i.d. random variables such that E(X1) = 0 and var(X1) =
1. Let X ∼ N (0, 1). Let
Sn =n∑k=1
Xk,
and let Un = Sn/√n. Then for every t ∈ R,
limn→∞
φUn(t) = φX(t).
Proof: Fix t. Then φSn(t) = [φX1(t)]n, and
φUn(t) = E[eiUnt] = E[eiSnt/√n] = φSn
(t√n
)=
[φX1
(t√n
)]n.
φX1(0) = 1 and φX1 is continuous, and therefore for all n large enough,
log
[φX1
(t√n
)]is well defined. Thus, for n large enough,
log(φUn(t)) = n log φX1(t/√n).
Note that by Taylor’s theorem,
log φX1(t/√n) = log φX1(0) + (log φX1)′(0) · t√
n+ (log φX1)′′(0) · t
2
2n+Rn,
where limn→∞ nRn = 0. Also, note that log φX1(0) = 0, (log φX1)′(0) = 0, (log φX1)′′(0) =
−1. Therefore,
limn→∞
log(φUn(t)) = limn→∞
n log φX1(t/√n) = lim
n→∞n · −t
2
2n+ nRn =
−t2
2,
and therefore
limn→∞
φUn(t) = e−t2
2 .
24
Class on 4.6 cancelled.
7.6
We now show that the convergence of the characteristic functions indeed guarantees con-
vergence of the distributions. We do it in a few steps. The first step seems to be the
converse of the desired statement.
Lemma 4.5 If
Xn−→dist
X,
then for every t ∈ R,
limn→∞
φXn(t) = φX(t).
Proof: First we define for every M > 0 the cut-off function gM : R→ R by
gM(x) =
1 |x| ≤M
M + 1− |x| M ≤ |x| ≤M + 1
0 |x| ≥M + 1
.
Let f : R→ C be defined by f(x) = eits, and let fM(x) = f(x)gM(x). Then for every M ,
limn→∞
E(fM(Xn)) = E(fM(X)).
For every ε, there exists M such that P (|X| > M) < ε, and P (|X| > M) < ε for every
n. For this M and each n, we have that∣∣E[fM(Xn) − E[f(Xn)]]
∣∣ < ε, and equivalently,∣∣E[fM(X)− E[f(X)]]∣∣ < ε. The lemma follows.
The next lemma shows that the characteristic function determines the distribution.
Lemma 4.6 Assume X1 and X2 are two random variables with the same characteristic
function φ. Then X1 and X2 have the same distribution.
Proof: Let P1 be the distribution ofX1, and P2 that ofX2. It suffices to show that for every
interval [a, b], where neither a nor b are atoms of P1 or P2, we have P1([a, b]) = P2([a, b]).
By translation and multiplication by a constant, we may assume that the interval [a, b] is
in fact the interval [−1, 1].
We now claim:
P1([−1, 1]) = P2([−1, 1]) = limT→∞
1
2π
∫ T
−T
eit − e−it
itφ(t)dt.
Let
I(T ) =
∫ T
−T
eit − e−it
itφ(t)dt =
∫ T
−T
∫eit − e−it
iteitsdP (x)dt.
Note that the integrand is bounded by 2, and both measures are finite, and therefore we
may use Fubini and say
25
I(T ) =
∫ T
−T
∫eit − e−it
iteitxdP (x)dt =
∫ ∫ T
−T
eit − e−it
iteitxdtdP (x)
=
∫ [∫ T
−T
sin((x+ 1)t)
tdt−
∫ T
−T
sin((x− 1)t)
tdt
]dP (x).
We introducing the notation R(T, θ) =∫ T−T
sin(θt)tdt and S(T ) =
∫ T0
sin ttdt, we get
I(T ) =
∫(R(T, x+ 1)−R(T, x− 1)) dP (x). (4.1)
Changing the integration variables tells us that
R(T, θ) = 2(sgnθ)S(T |θ|).
It is well known that limT→∞ S(T ) = π/2 (and even if not, it only changes the constant
in front of the integral), and therefore, for every x,
limT→∞
R(T, x+ 1)−R(T, x− 1) =
0 |x| > 1
π |x| = 1
2π |x| < 1
.
S(T ) is bounded, and therefore R is also bounded, and therefore by applying the bounded
convergence theorem we get the desired result.
11.3
We now define the notion of tightness for random variables. Let (Xn)∞n=1 be a sequence
of random variables. We say that (Xn) is tight if for every ε there exists M such that for
every n,
P (|Xn| > M) < ε.
Lemma 4.7 Let (Xn)∞n=1 be a tight sequence of random variables. Then there exist a
subsequence (Xnk)∞k=1 which converges in distribution.
To prove Lemma 4.7, we need to use the Riesz representation theorem.
Theorem 4.8 (Riesz representation)
Let C be the space of continuous functions with bounded support from R to R. Then every
bounded linear functional ψ : C → R can be represented as the integral with respect to a
signed measure.
Proof of Lemma 4.7: Let F be a countable dense collection of functions in C. Let f1, f2, . . .
be an ordering of F . We take a subsequence (X(1)n ) of (Xn) such that E(f1(X
(1)n )) con-
verges, then a subsequence (X(2)n ) of (X
(1)n ) such that E(f2(X
(2)n )) converges, and so on. At
26
the end we take the diagonal subsequence Xnk = X(k)k . Let φ : C → R be the functional
defined by φ(f) = limk→∞E(Xnk(f)) on F , and elsewhere by continuity. φ is bounded,
and therefore is the integral w.r.t. a signed measure µ. φ is non-negative, and therefore
µ is non-negative. All we ned to prove is that µ has total measure one. to this end, we
use tightness. Fix ε, and let M be such that P (|Xn| > M) < ε for every n. Let
f =
1 |x| ≤M
0 |x| ≥M + 1
M + 1− |x| M ≤ |x| ≤M + 1
.
Then µ(f) ≥ 1− ε, and thus µ(R) ≥ 1− ε, and thus µ(R) ≥ 1. To see that µ(R) ≤ 1 is an
easy exercise. Therefore, Xnk converge in distribution to any variable with distribution
µ.
Lemma 4.9 Let (Xn) be random variables, and assume that φXn converges pointwise to
φ, where φ(0) = 1 and φ is continuous at zero. Then the sequence Xn is tight.
Proof: Fix ε. There exists δ s.t. for every n,
1
2δ
∫ δ
−δφXn(t)dt > 1− ε/2.
Now fix (large) M.
1
2δ
∫ δ
−δφXn(t)dt ≤ P (|Xn| < M) +
1
2δmaxm>M
∣∣∣∣∫ δ
−δeimtdt
∣∣∣∣≤ P (|Xn| < M) +
1
2δmaxm>M
eimδ − e−imδ
m= P (|Xn| < M) +
1
δM.
Therefore, for M such that 1/δM < ε/2, we get that P (|Xn| > M) < ε.
Proof of Theorem 4.3: By Lemma 4.4, φXn(t) −→ e−t2/2 which is a continuous function.
Therefore, by Lemma 4.9, the sequence (Xn) is tight, and therefore by Lemma 4.7 every
subsequence of (Xn) has a convergent subsubsequence. By Lemma 4.5 the characteristic
function of the limit of the subsubsequence is e−t2/2, and thus by Lemma 4.6 the limit is
N (0, 1). The theorem follows.
27
14.6 Tal Orenshtein
5 Conditional expectation and martingales
5.1 Conditional expectation
Let (Ω,F , P ) be a probability space, and let X be a random variable on it. The expecta-
tion of the random variable X is its average value. We may define its average conditioned
on an event A. This is done by integrating with respect to the probability measure con-
ditioned on A. If we now partition Ω into events A1, . . . , Ak (partition means that the
Ai-s are disjoint and their union in Ω), we can calculate the expectation conditioned on
each of those events, and define a new variable Y as follows: If Ai occurs, then Y takes
the value E(X|Ai).It is easy to see that the variable Y is the unique variable satisfying the following prop-
erties:
1. Y is measurable w.r.t. the σ-algebra G := σ(A1, . . . , Ak).
2. For every (bounded) random variable Z which is measurable w.r.t. G, we have
E(ZY ) = E(ZX).
This formulation allows us to generalize the definition of conditional expectation to more
general σ-algebras.
Definition 5.1 Let (Ω,F , P ) be a probability space, let X be a random variable which
has an expectation, and let G ⊆ F be a σ-algebra on Ω. We say that a random variable
Y is the conditional expectation of X w.r.t. G, and denote Y = E(X|G) if:
1. Y is measurable w.r.t. G.
2. For every bounded random variable Z which is measurable w.r.t. G, we have
E(ZY ) = E(ZX).
It is not immediately obvious that the conditional expectation exists, or that it is unique.
We will give an example where this is a priori not obvious, and then prove both existence
and uniqueness of the conditional expectation.
Example 5.2 Let Ω = [0, 1), F = B, P = λ. We take X(ω) = sinω. We then let Yk be
the k − th digit in the decimal expansion of ω, and take G = σ(Y2, Y4, Y6, . . .).
Lemma 5.3 (Existence)
The conditional expectation always exists.
Proof: Let Q be the following measure defined on (Ω,F):
Q(A) =
∫A
XdP.
28
Note that P and Q are also defined on (Ω,G) because G ⊆ F . Now note that Q P .
Indeed, if P (A) = 0 then Q(A) = 0. Therefore there exists a Radon-Nykodim derivative
of Q w.r.t. P on the measurable space (Ω,G). We call this derivative Y , and claim that
Y is the conditional expectation. We verify the conditions in the definition:
1. This is obvious as Y was defined on the space (Ω,G).
2. Z is measurable in the space (Ω,G). Therefore,∫ZdQ =
∫ZY dP = E(ZY ).
Z is also measurable in the space (Ω,F), and thus∫ZdQ =
∫ZXdP = E(ZX).
and we get E(ZY ) = E(ZX).
Lemma 5.4 (Uniqueness)
Assume that Y1 is the conditional expectation of X w.r.t. G, and that at the same time
Y2 is the conditional expectation of X w.r.t. G. Then P (Y1 = Y2) = 1.
Proof: Assume P (Y1 6= Y2) > 0. Without loss of generality, there exists ε > 0 s.t.
P (Y1 − Y2 > ε) > ε. Let Z = 1Y1−Y2>ε. Then Z is bounded and is measurable w.r.t. G.
Now, by the choice of Z, we have E(ZY1) − E(ZY2) ≤ ε2 > 0, in contradiction to the
assumption that E(ZY1) = E(ZX) = E(ZY2).
We now collect some useful facts regarding the conditional expectation.
Lemma 5.5 1. Let G = Ω, ∅ be the trivial σ-algebra. Then E(X|G) = E(X) a.s.
2. Let G2 ⊆ G1 ⊆ F . Then E(E(X|G1)|G2
)= E(X|G2).
3. Jensen’s ineguality holds for conditional expectations, namely if f is convex then
E(f(X)|G) ≥ f(E(X|G)) a.s. when everything is well-defined.
4. Define
var(X|G) := E[(X − E(X|G)
)2∣∣∣G] = E(X2|G)− [E(X|G)]2 .
If var(X) <∞, then var(X) = E(var(X|G)) + var(E(X|G)).
Exercise Find a variable X and σ-algebras G1 and G2 such that E(E(X|G1)|G2
)6=
E(E(X|G2)|G1
). Exercise Prove that E(E(X|G)) = E(X).
Proof: 1. Let Y = E(X|G). Y is a constant, and it has to satisfy Y = E(Y · 1) =
E(X · 1) = E(X).
2. Let Y1 = E(X|G1) and Y2 = E(X|G2). Let Z be measurable w.r.t. G2. Note that Z is
also measurable w.r.t. G1 because G2 ⊆ G1. Therefore,
E(ZY2) = E(ZX) = E(ZY1),
29
so Y2 = E(Y1|G2).
3. Let g : R2 → R be a Borel measurable function s.t. (a) for every x, the fiber g(x, ·) is
a linear function, (b) g(x, y) ≤ f(y) for every x and y, and (c) g(x, x) = f(x) for every x.
Exercise: Prove the existence of such g.
Now, we may take Y = g(E(X|G), X). Then Y ≤ f(X). Now, a.s,
f(E(X|G)) = E(Y |G) ≤ E(f(X)|G).
4. Assume w.l.o.g. that E(X) = 0. Let Y = E(X|G) and W = X − Y . Then E(XY ) =
E(Y 2) and therefore E(WY ) = 0. Therefore,
var(x) = E(X2) = E(Y 2) + E(W 2) = var(E(X|G)) + E[(X − E(X|G)
)2∣∣∣]
= E(var(X|G)) + var(E(X|G)).
18.6
5.2 Filtrations
Definition 5.6 Let (Ω,F , P ) be a probability space. A filtration is a sequence (Gn)∞n=1
of σ-algebras on Ω such that
1. For every n, Gn+1 is a refinement of Gn, i.e. Gn ⊆ Gn+1.
2. For every n, Gn ⊆ F .
We may also speak about finite filtrations, namely finite sequences of σ-algebras.
Examples:
1. Let Ω = [0, 1]ℵ0 with product σ-algebra and measure. We take
Gn = λ× λ× · · · × λ× ∅, [0, 1] × · · · .
2. Let (Xn) be a sequence of random variables, then take Gn = σ(X1, . . . , Xn).
We can also define the notion of an inverse filtration which we will use later.
Definition 5.7 Let (Ω,F , P ) be a probability space. A inverse filtration is a sequence
(Gn)∞n=1 of σ-algebras on Ω such that
1. For every n, Gn is a refinement of Gn+1, i.e. Gn+1 ⊆ Gn.
2. For every n, Gn ⊆ F .
Example: Let (Xn) be a sequence of random variables, then take Gn = σ(Xn, Xn+1, . . .).
30
5.3 Martingales
Definition 5.8 Let (Xn) be a sequence of random variables, and let (Gn) be a filtration.
We say that (Xn) is a martingale with respect to (Gn) if
1. E(|Xn|) <∞ for every n.
2. Xn is measurable with respect to Gn for every n.
3. Xn = E(Xn+1|Gn) for every n.
Definition 5.9 A sequence (Xn) of random variables is called a martingale if there exists
a filtration (Gn) such that (Xn) is a martingale w.r.t. (Gn).
Examlpes
1. Simple random walk.
2. conditional expectations of same variable w.r.t. a filtration.
Lemma 5.10 1. A sequence (Xn) of random variables is a martingale if and only if
it is a martingale with respect to the natural filtration Kn = σ(X1, . . . , Xn).
2. Let (Xn) be a martingale w.r.t. a filtration (Gn). For every n,m, E(Xn) = E(Xm).
Also, if m > n then E(Xn) = E(Xm|Gn).
3. If (Xn) and (Yn) are both martingales w.r.t. (Gn), then (Xn+Yn) is also a martingale
w.r.t. (Gn).
Exercise:
1. Find two martingales whose sum is not a martingale.
2. (*) Find an example of a martingale (Xn) w.r.t. a filtration (Gn) such that there is
no variable X s.t. Xn = E(X|Gn) for all n.
Proof of Lemma 5.10: 1. Let (Xn) be a martingale w.r.t. (Gn). Clearly, E(|Xn|) <∞ for
every n, and it is also obvious that Xn is measurable with respect to Kn. We need to show
that Xn = E(Xn+1|Gn). Note that Kn ⊆ Gn, and therefore for every Z which is measurable
w.r.t. Kn, it is also measurable w.r.t. Gn. Thus for all such Z, E(ZXn+1) = E(ZXn) and
Xn = E(Xn+1|Kn).
21.6
2. This follows from Lemma 5.5.
3. This follows from the linearity of the conditional expectation.
Definition 5.11 Let (Xn) be a sequence of random variables, and let (Gn) be a filtration.
We say that (Xn) is a sub-martingale (resp super-martingale) with respect to (Gn) if
1. E(|Xn|) <∞ for every n.
31
2. Xn is measurable with respect to Gn for every n.
3. Xn ≤ E(Xn+1|Gn) (resp. Xn ≥ E(Xn+1|Gn)) for every n.
Examples
1. Random walk with a drift.
2. Let (Xn) be a martingale, and let f be convex. Then (f(Xn)) is a sub-martingale.
3. Let (Xn) be a martingale, and let f be concave. Then (f(Xn)) is a super-martingale.
4. Let (Xn) be a martingale. (|Xn|), (X+n ) and (X−n ) are sub-martingales.
5.4 Stopping times and the Optional stopping theorem
Definition 5.12 Let (Gn)∞n=1 be a filtration. A positive integer variable T ≤ ∞ is called
a stopping time if for every n, the event T ≤ n belongs to Gn.
Examples:
1. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = infn :
Xn = 9. Then T is a stopping time.
2. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = supn <100 : Xn = 0. Then T is not a stopping time.
3. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = infn <
100 : Xn = maxXk : k = 1, . . . , 99
. Then T is not a stopping time.
4. Let (Xn) be a simple random walk, and let Gn = σ(X1, . . . , Xn). Take T = infn ≥
100 : Xn = maxXk : k = 1, . . . , 99
. Then T is a stopping time.
Exercise: Prove the assertions above.
We now discuss stopped martingales. We start with a useful lemma.
25.6
Lemma 5.13 Let (Gn) be a filtration, let (Xn) be a martingale w.r.t. (Gn), and let T be
a stopping time w.r.t. (Gn). Let
Yn = Xn∧T =
Xn n ≤ T
XT n ≥ T.
The Yn is a martingale.
Exercise Find an example of a martingale (Xn) and a random time T such that Xn∧T is
not a martingale.
32
Proof of Lemma 5.13:
E(Yn+1|Gn) = 1T≤nE(Yn+1|Gn) + 1T>nE(Yn+1|Gn)
On the event T ≤ n, we have Yn+1 = XT = Yn, and thus
1T≤nE(Yn+1|Gn) = 1T≤nYn.
On the event T > n, we have Yn = Xn and Yn+1 = Xn+1, and thus
1T>nE(Yn+1|Gn) = 1T>nYn.
The lemma follows.
From here we get an important theorem known as the optional sampling theorem.
Theorem 5.14 (Optional Sampling Theorem)
Let (Gn) be a filtration, let (Xn) be a martingale w.r.t. (Gn), and let T be a stopping time
w.r.t. (Gn). Then E(XT ) = E(X1) if at least one of the following conditions hold:
1. T is bounded (i.e. there exists M s.t. P (T < M) = 1).
2. (Xn) is bounded (i.e. ∃MP (∀n|Xn| < M) = 1) and T is finite.
3. (Xn+1 −Xn) is bounded and E(T ) <∞.
Proof: 1. Since XT = YM and X1 = Y1, and (Yn) is a martingale, we get E(XT ) =
E(YM) = E(Y1) = E(X1).
2. Let N be a large number, and let S = min(T,N). Then S is a bounded stopping
time, so E(XS) = E(X1). Now, |XS − XT | < 2M , and P (XS 6= XT ) = P (T > N).
Therefore∣∣E(XT ) − E(X1)
∣∣ < 2MP (T > N) and as limN→∞ P (T > N) = 0 we get
E(XT ) = E(X1).
3. Again, let S = min(T,N). Then E(XS) = E(X1), and
E(|XT −XS|) ≤ E
(T−1∑k=N
|Xk+1 −Xk|
)≤M
∞∑k=N
P (T > k),
and as the last sum converges to 0 as N →∞ we get that E(XT ) = E(X1).
We now see several applications of the Optional Sampling Theorem. Let (Zn)∞n=1 be i.i.d.
with distribution P (Z1 = 1) = P (Z1 = −1) = 1/2, and let
Xn =n∑k=1
Zk.
The sequence (Xn) is called Simple Random Walk (or SRW).
Example 5.15 Fix some a, b positive and integer. Let T = min(n : Xn ∈ −a, b). Then
P (XT = a) =b
a+ b.
33
Example 5.16 Fix some a positive and integer. Let T = min(n : Xn ∈ −a, a). Then
E(T ) = a2.
Example 5.17 Let T = min(n : Xn = 1). Then E(T ) =∞.
The same holds for sub and super martingales. We state the theorem for sub martingales,
and the proof is left as an exercise.
Theorem 5.18 Let (Gn) be a filtration, let (Xn) be a sub-martingale w.r.t. (Gn), and let
T be a stopping time w.r.t. (Gn). Then E(XT ) ≥ E(X1) if at least one of the following
conditions hold:
1. T is bounded (i.e. there exists M s.t. P (T < M) = 1).
2. (Xn) is bounded (i.e. ∃MP (∀n|Xn| < M) = 1) and T is finite.
3. (Xn+1 −Xn) is bounded and E(T ) <∞.
Exercise: Prove Theorem 5.18
5.5 Convergence Theorems
We start with a simple theorem, which nevertheless contains the main idea of the general
convergence theorem, and will slowly work our way to more and more general theorems.
Theorem 5.19 Let (Xn) be a positive martingale. Then limn→∞Xn exists a.s.
Theorem 5.19 is a special case of the following theorem:
Theorem 5.20 Let (Xn) be a positive sub-martingale, and assume
supE(Xn) : n = 1, 2, . . . <∞.
Then limn→∞Xn exists a.s.
Proof: Let
S = supE(Xn) : n = 1, 2, . . ..
We first show that P (supnXn <∞) = 1. To this end we will show that
limk→∞
P (supnXn > k) = 0.
Fix k, and let T = infn : Xn > k ≤ ∞. Let j > i be natural numbers. Then
E(Xj|T = i) > k because (Xn) is a sub-martingale. The events(T = i
)i=1,...
are
disjoint, and therefore E(Xj|T < j) > k. Using Markov’s inequality,
P (T < j) <E(Xj)
k<S
k.
34
Since this holds for every t we get P (T <∞) < Sk, and a result P (supXn <∞) = 1.
We now proceed to prove the convergence. Let a < b be rational. We define a sequence
of stopping times Tk and Sk as follows:
T1 = infn : Xn < a ≤ ∞;
Sk = infn > Tk : Xn > b ≤ ∞;
Tk+1 = infn > Sk : Xn < a ≤ ∞.
We let Ca,b = maxk : Sk <∞, and call Ca,b the number of up-crossing from a to b.
We define a new sub-martingale (Yn) as follows: Y1 = X1. For every n, if there exists k s.t.
Tk ≤ n < Sk, then Yn+1 = Yn+Xn+1−Xn and else Yn+1 = Yn. We make a few observations.
The first is that E(Yn) ≤ E(Xn) for every n. Indeed, let An be the event that there exists k
s.t. Tk ≤ n < Sk. Then An ∈ Gn, and thus E(1An(Xn+1−Xn)|Gn) = 1AnE(Xn+1−Xn|Gn)
and thus E(Yn+1 − Yn) ≤ E(Xn+1 −Xn) and by induction E(Yn) ≤ E(Xn).
The second observation, which is left as an exercise, is that
limn→∞
E(Yn) ≥ (b− a)E(Ca,b).
Therefore, a.s, for every rational a and b the number of up-crossing is finite, and therefore
(Xn) converges almost surely.
We can now state and prove the main theorem in this section.
Theorem 5.21 (The Martingale Convergence Theorem)
Let (Xn)∞n=1 and assume that supnE(|Xn|) < ∞. Then the sequence (Xn) converges
almost surely.
Proof: By Theorem 5.20, both (X+n ) and (X−n ) converge, and therefore (Xn) converges.
The same proof idea is useful in proving the inverse martingale convergence theorem.
Theorem 5.22 (Inverse Martingale Convergence Theorem)
Let (Xn)∞n=1 be an inverse martingale. Then (Xn) converges almost surely.
5.6 Uniform integrability and convergence in L1
Definition 5.23 Let (Xn)∞n=1 be a sequence of random variables. We say that the se-
quence (Xn) is uniformly integrable if for every ε > 0 there exists M such that E(|Xn| ·1|Xn|>M) < ε for every n.
Exercises
1. If supE(X2n) <∞ then (Xn) is uniformly integrable.
2. ”Double or nothing”: Let (Yn) be i.i.d. variables with P (Y1 = 0) = P (Y1 = 2) = 0.5.
Let X1 = 1, and Xn+1 = XnYn for all n. Then (Xn) is not uniformly integrable.
35
2.7
Proof of the inverse martingale convergence theorem: get bound on down-crossings. We
also need to show that the sequence does not go to infinity. To this end we note that due
to up-crossing bound, it needs to really go to infinity. But this can not happen due to L1
bound.
5.7 Doob’s maximal inequality and the reflection principle
Let (Xn) be a martingale. Assume w.l.o.g. that E(X1) = 0. We know that by Chebisheff,
P (|Xn| > γ) <var(Xn)
γ2.
The following result shows that we have similar control over max(X1, X2, . . . , Xn).
Theorem 5.24 (Doob’s Maximal Inequality)
Let (Xn) be a martingale, and assume E(X1) = 0. Then, for every γ > 0,
P (max(X1, . . . , X2) > γ) <var(Xn)
γ2.
Remark: we then get a factor of 2 for the absolute value.
Proof: Let Tγ = mink : Xk > γ. The event Ak = T = k are independent.
E(X2n|Ak) ≥ γ2 and thus E(X2
n|T ≤ n) ≥ γ2 and from here we get what we need.
Exercise: Write an Lp inequality.
An even stronger statement can be stated for sums of independent variables. The next
result is about sums of gaussians.
Theorem 5.25 (Reflection Principle)
Let (Xn) be i.i.d. N (0, 1), and let Sn =∑n
k=1Xk. Then, for all γ > 0 and n,
P [max(S1, . . . , Sn) > γ] = 2P [Xn > γ].
Proof: reflection.
5.8 Azuma’s inequality
.
P (Xn −X0 > t) ≤ exp
(−t2
2∑N
k=1 c2k
).
Some applications.
For the proof: If E(X) = 0 and |X| < c, then define Y by Y = c w.p. X+c2c
and Y = −cwith the complement probability. Then by Jensen
E(eX) ≤ E(eY ) =∞∑k=0
E(Y k)
k!=∞∑k=0
E(c2k)
(2k)!≤ 1 +
∞∑k=1
c2k
2k!≤ ec
2/2.
36