Upload
truonglien
View
234
Download
3
Embed Size (px)
Citation preview
ECE 541 - NOTES ON PROBABILITY THEORY ANDSTOCHASTIC PROCESSES
MAJEED M. HAYAT
These notes are made available to students of the course ECE-541 at the University
of New Mexico. They cannot and will not be used for any type of profitable repro-
duction or sale. Parts of the materials have been extracted from texts and notes by
other authors. These include the text of Probability Theory by Y. Chow and H. Te-
icher, the notes on stochastic analysis by T. G. Kurtz, the text of Real and Complex
Analysis by W. Rudin, the notes on measure and integration by A. Beck, the text
of Probability and Measure by P. Billingsly, the text of Introduction to Stochastic
Processes by P. G. Hoel, S. C. Port, and C. J. Stone, and possibly others sources.
In compiling these note I tried, as much as possible, to make explicit reference to
these sources. If any omission is found, it is due to my oversight for which I seek the
authors’ forgiveness.
I am truly indebted to these outstanding scholars and authors, some of whom were
my teachers.
Date: November 15, 2006.1
2 MAJEED M. HAYAT
1. Fundamentals of Probability: Set-1
1.1. Experiments. The most fundamental component in probability theory is the
notion of a physical (or sometimes imaginary) “experiment,” whose “outcome” is
revealed when the experiment is completed. Probability theory aims to provide the
tools that will enable us to assess the likelihood of an outcome, or more generally, the
likelihood a collection of outcomes. Let us consider the following example:
Example 1. Shooting a dart: Consider shooting a single dart at a target (board)
represented by the unit closed disc, D, which is centered at the point (0, 0). We write
D = (x, y) ∈ IR2 : x2 + y2 ≤ 1. Here, IR denotes the set of real numbers (same
as (−∞,∞)), and IR2 is the set of all points in the plane ( IR2 = IR × IR, where ×denotes the Cartesian product of sets). We read the above description of D as “the
set of all points (x, y) in IR2 such that (or with the property) x2 + y2 ≤ 1.
1.2. Outcomes and the Sample Space. Now we define what we mean by an
outcome: An outcome can be “missing the target,” or “miss” for short, in which
case the dart misses the board entirely, or it can be its location in the scenario that
it hits the board. Note that we have implicitly decided (or chose) that we do not
care where the dart lands as whenever it misses the board. (The definition of an
outcome is totally arbitrary and therefore it is not unique for any experiment. It
depends on whatever makes sense to us.) Mathematically, we form what is called the
sample space as the set containing all possible outcomes. If we call this set Ω, then
for our dart example, Ω = miss ∪ D, where the symbol ∪ denotes set union (we
say x ∈ A ∪ B if and only if x ∈ A or x ∈ B). We write ω ∈ Ω to denote a specific
outcome from the sample space Ω. For example, we may have ω = “miss,” ω = (0, 0),
and ω = (0.1, 0.2); however, according to our definition of an outcome, ω cannot be
(1,1).
1.3. Events. An event is a collection of outcomes, that is, a subset in Ω. Such a
subset can be associated with a question that we may ask about the outcome of the
experiment and whose answer can be determined after the outcome is revealed. For
example, the question: “Q1: Did the dart land within 0.5 of the bull’s eye?” can be
ECE 541 3
associated with the subset of Ω (or event) given by E = (x, y) ∈ IR2 : x2+y2 ≤ 1/4.We would like to call the set E an event. Now consider the complement of Q1, that is:
Q2: “Did the dart not land within 0.5 of the bull’s eye?” with which we can associate
the event Ec, were the superscript “c” denotes set complementation (relative to Ω).
Namely, Ec = Ω \ E, which is the set of outcomes (or points or members) of Ω that
are not already in E. The notation “\” represents set difference (or subtraction). In
general, for any two sets A and B, A \ B is the set of all elements that are in A but
not in B. Note that Ec = (x, y) ∈ IR2 : x2 + y2 > 1/4∪miss. The point here
is that if E is an event, then we would want Ec to qualify as an event as well since
we would like to be able to ask the logical negative of any question. In addition,
we would also like to be able to form a logical “or” of any two questions about the
experiment outcome. Thus, if E1 and E2 are events, we would like their union to also
be an event.
Here is another illustrative example: for each n = 1, 2, . . ., define the subset En =
(x, y) ∈ IR2 : x2+y2 ≤ 1−1/n and let⋃∞
n=1 En be their countable union. (Notation:
we say ω ∈ ⋃∞n=1 En if and only if ω ∈ En for some n. Similarly, we say ω ∈ ⋂∞
n=1 En
if and only if ω ∈ En for all n.) It is not hard to see (prove it) that E = (x, y) ∈IR2 : x2 + y2 < 1, which corresponds to the valid question “did the dart land inside
the board?” Thus, we would want E (which is the countable union of events) to be
an event as well.
Example 2. For each n ≥ 1, take An = (−1/n, 1/n). Then,⋂∞
n=1 An = 0 and⋃∞
n=1 An = (−1, 1). Prove these.
Finally, we should be able to ask whether or not the experiment was conducted or
not, that is, we would like to label the sample space Ω as an event as well. With this
(hopefully) motivating introduction, we proceed to formally define what we mean by
events.
Definition 1. A collection F of subsets of Ω is called a σ-algebra (reads as sigma-
algebra) if:
(1) Ω ∈ F
4 MAJEED M. HAYAT
(2)⋃∞
n=1 En ∈ F whenever En ∈ F , n = 1, 2, . . .
(3) Ec ∈ F whenever E ∈ F
If F is a σ-algebra, then its members are called events. Note that F is a collection
of subsets and not a union of subsets; thus, F itself is not a subset of Ω.
Here are some consequences of the above definition (you should prove all of them):
(1) ∅ ∈ F .
(2)⋂∞
n=1 En ∈ F whenever En ∈ F , n = 1, 2, . . .. Here, the countable intersection⋂∞
n=1 En is defined as follows: ω ∈ ⋂∞n=1 En if and only if ω ∈ En for all n.
(3) A \B ∈ F whenever A,B ∈ F . (First prove that A \B = A ∩Bc.)
Generally, members of any σ-algebra (not necessary corresponding to a random ex-
periment and a sample space Ω) are called measurable sets. Measurable sets were
first introduced in the branch of mathematics called analysis. They were adopted
to probability by a great mathematician named Kolmogorov. Clearly, in probability
theory we call measurable subsets of Ω events.
Definition 2. Let Ω be a sample space and let F be a σ-algebra of events. We call
the pair (Ω,F) a measurable space.
Definition 3. A collection D of subsets of Ω is called a sub- σ-algebra of D if
(1) D ⊂ F (this means that if A ∈ D, then automatically A ∈ F ).
(2) D is itself a σ-algebra.
Example 3. ∅, Ω is a sub-σ-algebra of any other σ-algebra.
Example 4. The power set of Ω, which is the set of all subsets of Ω, is a σ-algebra.
In fact it is a maximal σ-algebra in a sense that it contains any other σ-algebra. The
power set of a set Ω is often denoted by the notation 2Ω.
Interpretation: Once again, we emphasize that it would be meaningful to think
of a σ-algebra as a collection of all valid questions that one may ask about an ex-
periment. The collection has to satisfy certain self-consistency rules, dictated by the
ECE 541 5
requirements for a σ-algebra, but what we mean by “valid” is really up to us as long
as the self-consistency rules defining the collection of events are met.
Generation of σ-algebras: Let M be a collection of events (not necessarily a σ-
algebra); that is, M ⊂ F . This collection could be a collection of certain events of
interest. For example, in the dart experiment we may define M =miss, (x, y) ∈
IR2 : 1/4 ≤ x2 + y2 = 1/2
in the dart experiment. The main question now is the
following: Can we construct a minimal (or smallest) σ-algebra that contains M? If
such a σ-algebra exists, call it FM, then it must possess the following property: If Dis another σ-algebra containing M, then necessarily FM ⊂ D. Hence FM is minimal.
The following Theorem states that there is such a minimal σ-algebra.
Theorem 1. Let M be a collection of events, then there is a minimal σ-algebra
containing M.
Before we prove the Theorem let us look at an example.
Example 5. Suppose that Ω = (−∞,∞), and M = (−∞, 1), (0,∞). It is easy to
check that the minimal σ-algebra containingM is FM = ∅, Ω, (−∞, 1), (0,∞), (0, 1), (−∞, 0], [1,∞), (−∞, 0]∪[1,∞). Explain where each member is coming from?
Proof of Theorem: Let KM be the collection of all σ-algebras that contain M.
We observe that such a collection is not empty since at least it contains the power
set 2Ω. Let us label each member of KM by an index α, namely Dα , where α ∈ I,
where I is some index set. Define FM =⋂
α∈I Dα. We need to show that 1) FM is a
σ-algebra containing M, and 2) that FM is a minimal σ-algebra. Note that each Dα
contains Ω (since each Dα is a σ-algebra); thus, FM contains Ω. Next, if A ∈ FM,
then A ∈ Dα, for each α ∈ I. Thus, Ac ∈ Dα, for each α ∈ I (since each Dα is a
σ-algebra) which implies that Ac ∈ ⋂α∈I Dα. Next, suppose that A1, A2, . . . ∈ FM.
Then, we know that A1, A2, . . . ∈ Dα, for each α ∈ I. Moreover,⋃∞
n=1 An ∈ Dα,
for each α ∈ I (again, since each Dα is a σ-algebra) and thus⋃∞
n=1 An ∈⋂
α∈I Dα.
This completes proving that FM is a σ-algebra. Now suppose that F ′M is another
6 MAJEED M. HAYAT
σ-algebra containing M, we will show that FM ⊂ F ′M. First note that since KM is
the collection of all σ-algebras containing M, then it must be true that F ′M = Dα∗
for some α∗ ∈ I. Now if A ∈ FM, then necessarily A ∈ Dα∗ (since Dα∗ is one of the
members of the countable intersection that defines FM), and consequently A ∈ F ′M,
since F ′M = Dα∗ . This establishes that FM ⊂ F ′
M and completes the proof of the
theorem. 2
Example 6. Let U be the collection of all open sets in IR. Then, according to the
above theorem, there exists a minimal σ-algebra containing U . This σ-algebra is called
the Borel σ-algebra, B, and its elements are called the Borel subsets of IR. Note that
by virtue of set complementation, union and intersection, B contains all closed sets,
half open intervals, their countable unions, intersections, and so on.
(Reminder: A subset U in IR is called open if for every x ∈ U , there exists an open
interval centered at x which lies entirely in U . Closed sets are defined as complements
of open sets. These definitions extend to IRn in a straightforward manner.)
Restrictions of σ-algebras: Let F be a σ-algebra. For any measurable set U , we
define F ∩ U as the collection of all intersections between U and the members of F ,
that is, F ∩U = V ∩U : V ∈ F. It is easy to show that F ∩U is also a σ-algebra,
which is called the restriction of F to U .
Example 7. Back to the dart experiment: What is a reasonable σ-algebra for this
experiment? I would say any such σ-algebra should contain all the Borel subsets of
D and the miss event. So we can take M =miss,B ∩D
. It is easy to check
that in this case
FM =((B ∩D) ∪ miss
)∪ (B ∩D)(1)
, where for any σ-algebra F and any measurable set U , we define F ∪ U as the
collection of all unions between U and the members of F , that is, F ∪ U = V ∪ U :
V ∈ F. Note that F ∪ U is not always a σ-algebra (contrary to F ∩ U), but in this
example it is because miss is the complement of D.
ECE 541 7
1.4. Random Variables. Motivation: Recall the dart experiment, and define the
following transformation on the sample space:
X: Ω → IR defined as
X(ω) =
1, if ω ∈ D
0, if ω = miss
where as before D = (x, y) ∈ IR2 : x2 + y2 ≤ 1.Consider the collection of outcomes that we can identify if we knew that X fell in
the interval (−∞, r). More precisely, we want to identify ω ∈ Ω : X(ω) ∈ (−∞, r),or equivalently, the set X−1((−∞, r)), which is the inverse image of the set (−∞, r)
under the transformation (or function) X. It can be checked that (you must work
these out carefully):
For r < 0, X−1((−∞, r)) = ∅,for 0 ≤ r < 1, X−1((−∞, r)) = miss,and for 1 ≤ r < ∞, X−1((−∞, r)) = Ω.
The important thing to note is that in each case, X−1((−∞, r)) is an event; that is,
X−1((−∞, r)) ∈ FM, where FM was defined earlier for this experiment (1). This is
a direct consequence of the way we defined the function X.
Here is another transformation defined on the outcomes of the dart experiment:
Define
(2) Y (ω) =
10, if ω = miss√
x2 + y2, if ω ∈ D.
Let’s consider the collection of outcomes that correspond to Y < 1/2, which we
can write as ω ∈ Ω : Y (ω) < 1/2 = Y −1((−∞, 1/2)). Note that
Y −1((−∞, 1/2)) = (x, y) ∈ IR2 : x2 + y2 < 1/4 ∈ FM.
Moreover, we can also show that
Y −1((−∞, 2)) = D ∈ FMY −1((−∞, 11)) = D ∪ miss = Ω ∈ FMY −1((−∞, 0]) = ∅ ∈ FM
8 MAJEED M. HAYAT
We emphasize again that Y −1((−∞, r)) is always an event; that is, Y −1((−∞, r)) ∈FM, where FM was defined earlier for this experiment (1). Again, this is a direct
consequence of the way we defined the function Y .
Motivated by these examples, we proceed to define what we mean by a random
variable in general.
Definition 4. Let (Ω,F) be a measurable space. A transformation X : Ω → IR is
said to be F-measurable if for every r ∈ IR, X−1((−∞, r)) ∈ F . Any measurable X
is called a random variable.
Now let X be a random variable and consider the collection of events M = ω ∈Ω : X(ω) ∈ (−∞, r), r ∈ IR, which can also be written more conveniently as
X−1((−∞, r)), r ∈ IR. As before, let FM be the minimal σ-algebra containing M.
Then, FM is the “information” that the random variable X conveys about the experi-
ment. We define such a σ-algebra from this point on as σ(X), the σ-algebra generated
by the random variable X. In the above example of X, σ(X) = ∅, miss, Ω, D.This concept is demonstrated in the next example. (Also see HW#1 for another
example of σ(X).)
Example 8. Back to the dart example: Let M′ 4= ∅, miss, Ω, then it is easy
to check that FM′ = ∅, miss, Ω, D, which also happens to be a subset of FMas defined in (1). However, we observe that in fact X−1((−∞, r)) ∈ F ′
M for any
r ∈ IR. Intuitively, FM′, which also call σ(X), can be identified as the set of all events
that the function X can convey about the experiment. In particular, FM′ consists of
precisely those events whose occurrence or nonoccurrence can be determined through
our observation of the value of X. In other words, FM′ is all the “information” that
the mapping X can provide about the outcome of the experiments. Note that FM′ is
much smaller than the original σ-algebra F , which was ((B∩D)∪miss)∪ (B∩D).
As we have seen before, ∅, miss, Ω, D ⊂ ((B ∩D) ∪ miss) ∪ (B ∩D). Thus, X
can only partially inform us of the true outcome of the experiment; it can only tell
us if we hit the target or missed it, nothing else, which is precisely the information
contained in FM′.
ECE 541 9
Facts about Measurable Transformations: Let (Ω,F) be a measurable space.
The following statements are equivalent:
(1) X is a measurable transformation.
(2) X−1((−∞, r]) ∈ F .
(3) X−1((r,∞)) ∈ F .
(4) X−1([r,∞)) ∈ F .
(5) X−1((a, b)) ∈ F for all a ≤ b.
(6) X−1([a, b)) ∈ F for all a ≤ b.
(7) X−1([a, b]) ∈ F for all a ≤ b.
(8) X−1((a, b]) ∈ F for all a ≤ b.
(9) X−1(B) ∈ F for all B ∈ B.
Using (9), we can equivalently define σ(X) as X−1(B) : B ⊂ B, which can be
directly shown to be a σ-algebra (prove it).
10 MAJEED M. HAYAT
2. Set 2: Fundamentals of Probability, Continued
2.1. Probability Measure.
Definition 5. Consider the measurable space (Ω,F), and a function, P, that maps
F into IR. P is called a probability measure if
(1) P(E) ≥ 0, for any E ∈ F .
(2) P(Ω) = 1.
(3) If E1, E2, . . . ∈ F and if Ei ∩ Ej = ∅ when i 6= j, then P(⋃∞
n=1 En) =∑∞
n=1 P(En).
The following properties follow directly from the above definition.
Property 1. P(∅) = 0.
Proof. Put E1 = Ω, E2 = . . . = En = ∅ in (3) and use (2) to get 1 = P(Ω) =
P(Ω∪∅∪∅∪ . . .) = P(Ω)+∑∞
n=2 P(∅) = 1+∑∞
n=2 P(∅). Thus,∑∞
n=2 P(∅) = 0, which
implies that P(∅) = 0 since P(∅) ≥ 0 according to (1).
Property 2. If E1, E2, . . . , En ∈ F and if Ei∩Ej = ∅ when i 6= j, then P(⋃n
i=1 Ei) =∑n
i=1 P(En).
Proof. Put En+1 = En+2 = . . . = ∅ and the result will follow from 3 since P(∅) = 0
(from Property 2)
Property 3. If E1, E2 ∈ F and E1 ⊂ E2, then P(E1) ≤ P(E2).
Proof. Note that E1 ∪ E2\E1 = E2 and E1 ∩ E2\E1 = ∅. Thus, by Property 2 (use
n = 2), P(E2) = P(E1) + P(E2\E1) ≥ P(E1), since P(E2\E1) ≥ 0.
Property 4. If A1 ⊂ A2 ⊂ A3 . . . ∈ F , then
(3) limn→∞
P(An) = P(∞⋃
n=1
An).
ECE 541 11
Proof. Put B1 = A1, B2 = A2\A1, . . . , Bn = An+1\An, . . .. Then, it is easy to check
that⋃∞
n=1 An =⋃∞
n=1 Bn and⋃m
n=1 An =⋃m
n=1 Bn for any m ≥ 1, and that Bi∩Bj =
∅ when i 6= j. Hence, P(⋃n
i=1 Ai) =∑n
i=1 P(Bi) and P(⋃∞
i=1 Ai) =∑∞
i=1 P(Bi).
But,⋃n
i=1 Ai = An, so that P(An) =∑n
i=1 P(Bi). Now since∑n
i=1 P(Bi) converges
to∑∞
i=1 P(Bi), we conclude that P(An) converges to∑∞
i=1 P(Bi), which is equal to
P(⋃∞
i=1 Ai).
Property 5. If A1 ⊃ A2 ⊃ A3 . . ., then limn→∞ P(An) = P(⋂∞
n=1 An).
Proof. See HW.
Property 6. For any A ∈ F , 0 ≤ P(A) ≤ 1.
The triplet (Ω,F , P ) is called a probability space.
Example 9. Recall the dart experiment. We now define P on (Ω,FM) as follows.
Assign P(miss) = 0.5, and for any A ∈ D∩B, assign P(A) = area(A)/2π. It is easy
to check that P defines a probability measure. (For example, P(Ω) = P(D∪miss) =
P(D) + P(miss) = area(D)/2π + 0.5 = 0.5 + 0.5 = 1. Check the other requirements
as an exercise.)
2.2. Distributions and distribution functions: Consider a probability space (Ω,F , P ),
and consider a random variable X defined on it. Up to this point, we have devel-
oped a formalism that allows us to as questions of the form “what is the proba-
bility that X assumes a value in a Borel set B?” Symbolically, this is written as
P(ω ∈ Ω : X(ω) ∈ B), or for short, PX ∈ B with the understanding that the
set X ∈ B is an event (i.e., member of F). Answering all the questions of the
above form is tantamount to assigning a number in the interval [0,1] to every Borel.
Thus, we can think of a mapping from B into [0,1], whose knowledge will provide an
answer to all the questions of the form described earlier. We call this mapping the
distribution of the random variable X, and it is denoted by µX . Formally, we have
µX : B → [0, 1] according to the rule µX(B) = PX ∈ B, B ∈ B.
12 MAJEED M. HAYAT
Proposition 1. µX defines a probability measure on the measurable space (IR,B).
Proof. See HW.
Distribution Functions: Recall from your undergraduate probability that we often
associate with each random variable a distribution function, defined as FX(x) =
PX ≤ x. This function can also be obtained from the distribution of X, µX , by
evaluating µX at B = (−∞, x], which is a Borel set. That is, FX(x)4= µX((−∞, x]).
Note that for any x ≤ y, µX((x, y]) = FX(y)− FX(x).
Property 7. FX is nondecreasing.
Proof. For x1 ≤ x2, (−∞, x1] ⊂ (−∞, x2] and = µX((−∞, x1]) ≤ µX((−∞, x2]) since
µX is a probability measure (see Property 3 above).
Property 8. limx→∞ FX(x) = 1 and limx→−∞ FX(x) = 0.
Proof. Note that (−∞,∞) =⋃∞
n=1(−∞, n] and by Property 4 above, limn→∞ µX((−∞, n]) =
µX(⋃∞
n=1(−∞, n]) = µX((−∞,∞)) = 1 since µX is a probability measure. Thus we
proved that limn→∞ FX(n) = 1. Now the same argument can be repeated if we re-
place the sequence n by any increasing sequence xn. Thus, limxn→∞ FX(xn) = 1 for
any increasing sequence xn, and consequently limx→∞ FX(x) = 1.
The proof of the second assertion is left as an exercise.
Property 9. FX is right continuous, that is, limx↓y FX(x) = FX(y).
Proof. Note that FX(y) = µX((−∞, y]), and (−∞, y] =⋂∞
n=1(−∞, y + n−1]. So, by
Property 5 above, limn→∞ µX((−∞, y+n−1]) = µX(⋂∞
n=1(−∞, y+n−1]) = µX((−∞, y]).
Thus, we proved that limn→∞ FX(y + n−1) = FX(y). In the same fashion, we can
generalize the result to obtain limn→∞ FX(y+xn) = FX(y) for any sequence for which
xn ↓ 0. This completes the proof.
Property 10. FX has a left limit at every point, that is, limx↑y FX(x) exits.
ECE 541 13
Proof. Note that (−∞, y) =⋃∞
n=1(−∞, y−n−1]. So, by Property 4 above, limn→∞ FX(y−n−1]) = limn→∞ µX((−∞, y − n−1]) = µX(
⋃∞n=1(−∞, y − n−1]) = µX((−∞, y))
4=
FX(y−). In the same fashion, we can generalize the result to obtain limn→∞ FX(y −xn) = µX((−∞, y)) for any sequence xn ↓ 0.
Property 11. FX has at most countably many discontinuities.
Proof. Let D be the set of discontinuity points of FX . We first show a simple
fact that says that the jumps (or discontinuity intervals) corresponding to distinct
discontinuity points are disjoint. More precisely, pick α, β ∈ D, and suppose, without
loss of generality, that α < β. Let Iα = (FX(α−), FX(α)] and Iβ = (FX(β−), FX(β)]
represent the discontinuities associated with α and β, respectively. Note that FX(α) <
FX(β−); this follows from the definition of FX(β−), the fact that α < β, and the fact
that FX is nondecreasing. Hence, Iα and Iβ are disjoint. From this we also conclude
that the discontinuities (jumps) associated with the points of D form a collection of
disjoint intervals. (*)
Next, note that D = ∪∞n=1Dn, where Dn = x ∈ IR : FX(x) − FX(x−) > n−1.In words, Dn is the set of all discontinuity points that have jumps greater than n−1.
Since the discontinuities corresponding to the points of Dn form a collection of disjoint
intervals, the length of the the union of all these disjoint intervals should not exceed 1
(why?). Hence, if we denote the cardinality of Dn by D#n , then we have n−1D#
n ≤ 1,
or D#n ≤ n. Hence, D is countable since it is a countable union of finite sets (this is
a fact from elementary set theory).
Note: We could have finished the proof right after (*) by associating the points of
D with a disjoint collection of intervals, each containing a rational number. In turn,
we can associate the points of D with distinct rational numbers, which proves that
D is countable.
Discrete random variables: If the distribution function of a random variable is
piecewise constant, then we say that the random variable is discrete. Note that in
this case the number of discontinuities is at most countably infinite, and the random
variable may assume at most countably many values, say a1, a2, . . .. It is easy to see
14 MAJEED M. HAYAT
that for a discrete random variable X, E[X] =∑∞
i=1 aiPX = ai =∑∞
i=1 ai
(FX(ai)−
FX(a−i )).
Absolutely continuous random variables: If there is a Borel function fX : IR →[0,∞), with the property that
∫∞−∞ f(t) dt = 1, such that FX(x) =
∫ x
−∞ fX(t) dt, then
we say that X is absolutely continuous. Note that in this event we term fX the
probability density function (pdf) of X. Also note that if Fx is differentiable, then
such a density exists and it is given by the derivative of FX .
Example 10. (Problem 3.8 in the textbook) Let X be a uniformly-distributed random
variable in [0,2]. Let the function g be as shown in Fig. 1 below. Compute the
distribution function of Y = g(X).
1
1/2
1/2
FY(y)
y
1
1
g(x)
x1
Figure 1.
Solution: If y < 0, FY (y) = 0 since Y is nonnegative. If 0 ≤ y ≤ 1, FY (y) =
P(X ≤ y/2∪X > 1−y/2) = 0.5y+0.5. Finally, if y > 1, FY (y) = 1 since Y ≤ 1.
The graph of FY (y) is shown above. Note that FY (y) is indeed right continuous
everywhere, but it is discontinuous at 0.
In this example the random variable X is absolutely continuous (why?). On the
other hand, the random variable Y is not absolutely continuous because there is no
Borel function fY so that FY (x) =∫ x
−∞ fY (t) dt. Observe that FY has a jump at 0,
and we cannot reproduce this jump by integrating a Borel finction over it (we would
need a “delta function,” which is not really a function let alone a Borel function).
It is not a discrete random variable either since Y can assume any value from an
uncountable collection of real numbers in [0, 1].
ECE 541 15
2.3. Expectation. Recall that in an undergraduate probability course one would
talk about the expectation, average, or mean of a random variable. This is done
by carrying out an integration (in the Riemann sense) with respect to a probability
density function. It turns out that the definition of an expectation does require having
a probability density function (pdf). It is based on a more-or-less intuitive notion of
an average. We will follow this general approach here and then connect it to the usual
expectation with respect to a pdf whenever the pdf exists. We begin by introducing
the expectation of a nonnegative random variable, and will generalize thereafter.
Consider a nonnegative random variable X, and for each n ≥ 1, we define the sum
Sn =∑∞
i=1i
2n P i2n < X ≤ i+1
2n . We claim that Sn is nondecreasing. If this is the
case (to be proven shortly), then we know that Sn is either convergent (to a finite
number) or Sn ↑ ∞. In any case, we call the limit of Sn the expectation of X, and
symbolically we denote it as E[X]. Thus, we define E[X] = limn→∞ Sn. To see the
monotonicity of Sn, we follow Chow and Teicher [1] and observe that
i2n < X ≤ i+1
2n = 2i2n+1 < X ≤ 2i+2
2n+1 = 2i2n+1 < X ≤ 2i+1
2n+1 ∪ 2i+12n+1 < X ≤ 2i+2
2n+1;thus,
Sn =∑∞
i=12i
2n+1 (P 2i2n+1 < X ≤ 2i+1
2n+1+ P 2i+12n+1 < X ≤ 2i+2
2n+1)≤ 1
2n+1 P 12n+1 < X ≤ 2
2n+1+∑∞
i=12i
2n+1 (P 2i2n+1 < X ≤ 2i+1
2n+1+ P 2i+12n+1 < X ≤ 2i+2
2n+1)≤ 1
2n+1 P 12n+1 < X ≤ 2
2n+1+∑∞
i=12i
2n+1 P 2i2n+1 < X ≤ 2i+1
2n+1+∑∞
i=12i+12n+1 + P 2i+1
2n+1 <
X ≤ 2i+22n+1
=∑∞
i=1,i oddi
2n+1 P i2n+1 < X ≤ i+1
2n+1+∑∞
i=1,i eveni
2n+1 P i2n+1 < X ≤ i+1
2n+1=
∑∞i=1
i2n+1 P i
2n+1 < X ≤ i+12n+1 = Sn+1
If E[X] < ∞ , we say that X is integrable.
General Case: If X is any random variable, then we decompose it into two
parts: a positive part and a negative part. More precisely, let X+ = max(0, X)
and X− = max(0,−X), then clearly X = X+ − X−, and both of X+ and X− are
nonnegative. Hence, we use our definition of expectation for a nonnegative random
variable to define E[X] = E[X+] − E[X−] whenever E[X+] < ∞ or E[X−] < ∞, or
16 MAJEED M. HAYAT
both. In cases where E[X+] < ∞ and E[X−] = ∞, or E[X−] < ∞ and E[X+] = ∞, we
define E[X] = −∞ and E[X] = ∞, respectively. Finally, E[X] is not defined whenever
E[X+] = E[X−] = ∞. Also, E[|X|] = E[X+ + X−] exists if and only if E[X] does.
Special Case: Consider any event E and let X = IE, a binary random variable.
(IE(ω) = 1 if ω ∈ E and IE(ω) = 0 otherwise; it is called the indicator function of
the event E.) Then, E[X] = P(E).
Proof. Note that in this case
(4) Sn =2n − 1
2nP2n − 1
2n< X ≤ 1
,
and the second quantity is simply PE. Thus, Sn → PE.When P(E) = 1, we say that X = 1 with probability one, or almost surely. In
general if X is equal to a constant c almost surely, then E[X] = c.
Another Special Case: A random variable that takes only finitely many values,
that is, X =∑n
i=1 aiIAi, where Ai, i = 1, . . . , n are events. It is straightforward
(prove it following the approach as in the previous case) that E[X] =∑n
i=1 aiP(Ai).
Notation and Terminology: E[X] is also written as∫
ΩX(ω)P(dω), which is
called the Lebesgue integral of X with respect to the probability measure P . Often,
cumbersome notation is avoided by writing∫
ΩXP(dω) or simply
∫Ω
XdP .
Linearity of Expectation: The expectation is linear, that is, E[aX + bY ] =
aE[X] + bE[Y ]. This can be seen, for example, by observing that any nonnega-
tive random variable can be approximated from below by functions of the form∑∞
i=1 xiIxi<X≤xj+1(ω), where for any event E, the random variable IE(ω) = 1 if
ω ∈ E and IE(ω) = 0 otherwise. (Recall that IE is called the indicator function of
the set E.) Indeed, we have seen such an approximation through our definition of the
expectation. Namely, if we define
(5) Xn(ω) =∞∑i=1
i
2nI i
2n <X≤ i+12n (ω),
ECE 541 17
then it is easy to check that Xn(ω) → X(ω), as n →∞. In fact, E[X] was defined as
limn→∞ E[Xn], where E[Xn] precisely coincides with Sn described above [recall that
E[IE(ω)] = P(E)]. Now to prove the linearity of expectations, we note that if X and
Y are random variables with defined expectations, then we can approximate them by
Xn and Yn, respectively. Also, Xn + Yn would approximate X + Y . Next, we observe
that for nonnegative X and Y ,
E[Xn] + E[Yn] =∑∞
i=1i
2n P i2n < X ≤ i+1
2n +∑∞
i=1i
2n P i2n < Y ≤ i+1
2n =
∑∞j=1
∑∞i=1
i2n P
( i
2n < X ≤ i+12n ∩ j
2n < Y ≤ j+12n
)+
∑∞j=1
∑∞i=1
j2n P
( i
2n <
X ≤ i+12n ∩ j
2n < Y ≤ j+12n
)
=∑∞
j=1
∑∞i=1
i+j2n P
( i
2n < X ≤ i+12n ∩ j
2n < Y ≤ j+12n
).
But∑∞
j=1
∑∞i=1
i+j2n P
( i
2n < X ≤ i+12n ∩ j
2n < Y ≤ j+12n
)= E[Xn + Yn]. (Think
about it.)
Thus, we have shown that E[Xn]+E[Yn] = E[Xn+Yn]. Now take limits of both sides
and use the definition of E[X], E[Y ] and E[X + Y ] as the limits of E[Xn], E[Yn], and
E[Xn +Yn], respectively, to conclude that E[X] +E[Y ] = E[X +Y ]. The homogeneity
property E[aX] = aE[X] can be proved similarly.
It is easy to show that if X ≥ 0 then E[X] ≥ 0; in addition, if X ≤ Y almost surely
(this means that PX ≤ Y = 1), then E[X] ≤ E[Y ].
Expectations in the Context of Distributions: Recall that for a nonnegative
random variable X, E[X] = limn→∞∑∞
i=1i
2n P i2n < X ≤ i+1
2n . But we had seen
earlier that P i2n < X ≤ i+1
2n = µX
(( i
2n , i+12n ]
). So we can write
∑∞i=1
i2n P i
2n <
X ≤ i+12n as
∑∞i=1
i2n µX(( i
2n , i+12n ]). We denote the limit of the latter by
∫IR
xdµX ,
which is read as the Lebesgue integral of x with respect to the probability measure (or
distribution) µX . In summary, we have E[X] =∫
ΩXdP =
∫IR
xdµX . Of course we
can extend this notion to a general X in the usual way (i.e., splitting X to its positive
and negative parts).
A pair of random variables Suppose that X and Y are rv’s defined on the
measurable space (Ω,F). For any B1, B2 ∈ B, we use the convenient notation X ∈
18 MAJEED M. HAYAT
B1, Y ∈ B2 for the event X ∈ B1 ∩ Y ∈ B2. Moreover, we can also think of the
pair X and Y as a vector (X,Y ) in IR2.
Now consider any B ∈ B2 (the set of all Borel sets on the plane) of the form B1×B2,
where B1, B2 ∈ B (i.e., rectangles with measurable sides). We note that (X, Y ) ∈ Bis always an event because (X, Y ) ∈ B = X ∈ B1∩Y ∈ B2. In fact, (X, Y ) ∈B is an event for any B ∈ B2, not just those that are rectangles. Here is the proof.
Let D be the collection of all Borel sets in B2 for which (X, Y ) ∈ B ∈ F . As we had
already seen, D contains all rectangles (i.e., Borel sets of the form B1×B2). Further,
it is easy to show that D qualifies to be called a Dynkin class of sets (see below).
Also, the collection S of all Borel sets in the plane that are rectangles is closed under
finite intersection (this is because (B1 × B2) ∩ (C1 × C2) = (B1 ∩ C1) × (B2 ∩ C2)).
To summarize, the collection D is a Dynkin class and it contains the collection S,
that is closed under finite intersection and contains all rectangular Borel sets in the
plane. Now by the Dynkin class Theorem (see below), D contains the sigma algebra
generated by S, which is just B2. Hence, (X, Y ) ∈ B ∈ F for any B ∈ B2.
Thus, for any Borel set B in the plane, we can define the joint distribution of X
and Y as µXY (B)4= PX ∈ B1, Y ∈ B2. This can also be generalized in the obvious
way to define a joint distribution of multiple (say n) random variables over Bn, the
Borel subsets of IRn.
Now back to the definition of a Dynkin class and the Dynkin class Theorem. Any
collection D in any set Ω is called a Dynkin class if (1) Ω ∈ D, (2) E2 \ E1 ∈ Dwhenever E1 ⊂ E2 and E1, E2 ∈ D, and (3) if E1 ⊂ E2 ⊂ . . ., then ∪∞n=1En ∈ D. Any
collection S in any set Ω is called a π-class if A ∩ B ∈ S whenever A,B ∈ S. The
Dynkin Class Theorem (see Chow and Teicher, for example) states that if S ⊂ D, then
D contains the σ-algebra generated by S. More so, if D also happens to be a π-class
(in addition to being a Dynkin class) then D is precisely the σ-algebra generated by
π. We will not prove this Theorem here although its proof is elementary.
Expectation in the context of distribution function: Note that we can use
the definition of a distribution function to write (for a non-negative rv X) E[X] =
ECE 541 19
limn→∞∑∞
i=1i
2n
(FX(( i+1
2n )−FX(( i2n )
). Now without being too fussy, we can imagine
that if FX is differentiable with derivative fX , then the above limit is the Reimann
integral∫∞−∞ xfX(x) dx, which is the usual formula for expectation for a random
variable that has a probability density function.
Expectation of a function of a random variable. Now that we have E[X], we
can define E[g(X)] in the obvious way, where is g is a Borel-measurable transformation
from IR to IR. (Recall that this means that g−1(B) ∈ B for any B ∈ B.) In particular,
E[g(X)] =∫Ω
g(X)dP =∫IR
g(x)dµX . Also, in the event that X has a density function,
we obtain the usual expression∫∞−∞ g(x)fX(x) dx.
Markov Inequality: If φ is a non-decreasing function on [0,∞), then P|X| >
ε ≤ 1φ(ε)
E[φ(|X|)]. In particular, if we take φ(t) = t2 and consider X − X in place
of X, where X4= E[X], then we have P|X − X| > ε ≤ E[(X−X)2]
ε2. Note that the
numerator is simply the variance of X.
Also, if we take φ(t) = t, then we have P|X| > ε ≤ E[|X|]ε
.
Proof. Since φ is nondecreasing, |X| > ε = φ(|X|) > φ(ε). Further, note that
1 ≥ I|X|>ε(ω), for any ω ∈ Ω (since any indicator function is either 0 or 1). Now
note that φ(|X|) ≥ I|X|>εφ(|X|) ≥ I|X|>εφ(ε). Now take expectations of both ends
of the inequalities to obtain
E[φ(|X|)] ≥ φ(ε)E[I|X|>ε] = φ(ε)P|X| > ε, and the desired result follows. (See the
first special case after the definition of expectation.)
Chernoff Bound: This in an application of the Markov inequality and it is a
useful tool in upperbounding error probabilities in digital communications. Let X be
a non-negative random variable and let θ be nonnegative. Then by taking φ(t) = eθt,
we obtain
PX > x ≤ e−θxE[eθX ].
20 MAJEED M. HAYAT
The right hand side can be minimized over θ > 0 to yield what is called the Chernoff
bound for a random variable X.
Independence: Consider a probability space (Ω,F , P), and let D1 and D2 be two
sub σ-algebras of F . We say that D1 and D2 are independent if P(A∩B) = P(A)P(B)
for every A ∈ D1 and B ∈ D2.
For example, if X and Y are random variables in (Ω,F), then we say that they
are independent if σ(X) and σ(Y ) are independent. Note that in this case we auto-
matically have µXY = µXµY , FXY (x, y) = FX(x)FY (y), and fXY (x, y) = fX(x)fY (y)
if the these densities exist.
ECE 541 21
3. Elementary Hilbert Space Theory
Most of the material in this section is extracted from the excellent book by W.
Rudin, Real & Complex Analysis [2]. A complex vector space H is called an inner
product space if ∀x, y ∈ H, we have a complex-valued scaler, < x, y >, read the “inner
product between vectors x and y,” such teat the following properties are satisfied:
(1) < x, y >= < y, x >, ∀x, y ∈ H
(2) < x + y, z >=< x, z > + < y, z >, ∀x, y, z ∈ H
(3) < αx, y >= α < x, y >, ∀x, y ∈ H, α ∈ |C
(4) < x, x >≥ 0,∀x ∈ H and < x, x >= 0 only if x = 0
Note that (3) and (1) together imply < x, αy >= α < x, y >. Also, (3) and (4)
together imply < x, x >= 0 if and only if x = 0. It would be convenient to write
< x, x > as ‖x‖2, which we later use to introduce a norm on H.
3.1. Schwarz Inequality. It follows from properties (1)-(4) that | < x, y > | ≤‖x‖ ‖y‖, ∀x, y ∈ H.
Proof: Put A = ‖x‖2, C = ‖y‖2 and B = | < x, y > |. By (4), < x−αry, x−αry >≥0, for any choice of α ∈ |C and all r ∈ IR, which can be further written as (using
(2)-(3)):
(6) ‖x‖2 − rα < y, x > −rα < x, y > +r2‖y‖2 ≥ 0.
Now choose α so that that α < y, x >= | < x, y > |. Thus, (6) can be cast as
‖x‖2 − 2r| < x, y > | + r2‖y‖2 ≥ 0, or A − 2rB + r2C ≥ 0, which is true for
any r ∈ IR. Let r1,2 = 2B±√4B2−4AC2C
= B±√B2−ACC
denote the roots of the equation
A − 2rB + r2C = 0. Since A − 2rB + r2C ≥ 0, it must be true that B2 − AC ≤ 0
(since the roots cannot be real unless they are the same), which implies | < x, y >
|2 ≤ ‖x‖2‖y‖2, or | < x, y > | ≤ ‖x‖ ‖y‖.Note that | < x, y > | = ‖x‖‖y‖ whenever x = βy, β ∈ |C. Also, if | < x, y >
| = ‖x‖‖y‖, then it is easy to verify that < x− |<x,y>|‖x‖<y,x>‖y‖ y, x− |<x,y>|‖x‖
<y,x>‖y‖ y = 0, which
implies (using (4)) x − ‖x‖‖y‖y = 0. Thus, | < x, y > | = ‖x‖‖y‖ if and only if x is
proportional to y.
22 MAJEED M. HAYAT
3.2. Triangle Inequality. ‖x + y‖ ≤ ‖x‖+ ‖y‖,∀x, y ∈ H. Proof.
This follows from the Schwarz inequality. Note that ‖x + y‖2 ≥ 0 and expand to
obtain ‖x‖2 + ‖y‖2+ < x, y > + ¯< x, y > ≤ ‖x‖2 + ‖y‖2 + 2| < x, y > |. Now by
Schwarz inequality, ‖x‖2+‖y‖2+2| < x, y > | ≤ ‖x‖2+‖y‖2+2‖x‖‖y‖ = (‖x‖+‖y‖)2,
from which the desired result follows. Note that ‖x+y‖2 = ‖x‖2+‖y‖2 if < x, y >= 0
(why?).
This is a generalization of the customary triangle inequality for complex numbers,
which states that ‖x− z‖ ≤ ‖x− y‖+ ‖y − z‖, x, y, z ∈ |C.
3.3. Norm. We say ‖x‖ is a norm on H if:
(1) ‖x‖ ≥ 0.
(2) ‖x‖ = 0 only if x = 0.
(3) ‖x + y‖ ≤ ‖x‖+ ‖y‖.(4) ‖αx‖ = |α| ‖x‖, α ∈ D.
With the triangular inequality at hand, we can define ‖ · ‖ on members of H as
follows: ‖x‖ 4=< x, x >1/2. You should check that this actually defines a norm. This
yields a “yardstick fo distance” between the two vectors x, y ∈ H, defined as ‖x− y‖.We can now say that H is normed space.
3.4. Convergence. We can then talk about convergence: a sequence xn ∈ H is said
to be convergent to x, written as xn → x, or limn→∞ xn = x, if for every ε > 0, there
exists N ∈ IN such that ‖xn − x‖ < ε whenever n > N .
3.5. Completeness. An inner-product space H is complete if any Cauchy sequence
in H converges to a point in H. A sequence yn∞n=1 in H is called a Cauchy if for
any ε > 0,∃ N ∈ IN such that ‖yn − ym‖ ≤ ε for all n,m > N .
Now, if H is complete, then H is called a Hilbert space.
Fact: If H is complete, then it is closed. This is because any convergent sequence is
automatically a Cauchy sequence.
3.6. Convex Sets. A set E in a vector space V is said to be a convex set if for any
x, y ∈ E, and t ∈ (0, 1), the following point Zt = tx + (1− t)y ∈ E. In other words,
ECE 541 23
the line segment between x and y lies in E. Note that if E is a convex set, the the
translation of E, E + x4= y + x : y ∈ E, is also a convex set.
3.7. Parallelogram Law. For any x and y in an inner-product space, ‖x + y‖2 +
‖x−y‖2 = 2‖x‖2 +2‖y‖2. This can be simply verified using the properties of an inner
product. See the schematic below.
x+y
x-y
Figure 2.
3.8. Orthogonality. If < x, y >= 0, then we say that x and y are orthogonal. We
write x ⊥ y. Note that the relation ⊥ is a symmetric relation; that is, x ⊥ y ⇔ y ⊥ x.
Pick a vector x ∈ H, then find all vectors in H that are orthogonal to x. We write
it as: x⊥ = y ∈ H :< x, y >= 0 and x⊥ is a closed subspace. To see that it is a
subspace, we must show that x⊥ is closed under addition and scalar multiplication;
these are immediate from the definition of an inner product (see Assignment 3). To
see that x⊥ is closed, we must show that the limit of every convergent sequence in
x⊥ is also in x⊥. Let xn be a sequence in x⊥, and assume that xn → x0. We need to
show that x0inx⊥. To see this, note that | < x0, x > | = | < x0 − xn + xn, x > | =
| < x0 − xn, x > + < xn, x > | = | < x0 − xn, x > | ≤ ‖x0 − xn‖‖x0‖, by Schwarz
inequality. but the term on the right converges to zero, so | < x0, x > | = 0, which
implies x0 = x (last property of inner products).
Let M ⊂ H be a subspace of H. We define M⊥ 4=
⋂x∈M x⊥,∀x ∈ M . It is easy
to see that M⊥ is actually a subspace of H and that M⊥ ∩M = 0. It is also true
that M⊥ is closed, which is simply because it is an intersection of closed sets (recall
that x⊥ is closed).
Theorem 2. Any non-empty, closed and convex set E in a Hilbert space H contains
a unique element with smallest norm. That is, there exists xo ∈ E such that ‖xo‖ <
‖x‖, x ∈ E.
24 MAJEED M. HAYAT
Proof (see also Fig. 3): Existence: Let δ = inf‖x‖ : x ∈ E.In the parallelogram law, replace x by x/2 and y by y/2 to obtain
‖x− y‖2 = 2‖x‖2 + 2‖y‖2 − 4‖x + y
2‖2.
Since x+y2∈ E (because E is convex), ‖x+y
2‖2 ≥ δ2. Hence,
(7) ‖x− y‖2 ≤ 2‖x‖2 + 2‖y2‖ − 4δ2.
By definition of δ, there exists a sequence yn∞n=1 ∈ H such that limn→∞ ‖yn‖ = δ.
Now replace x by yn and y by ym in to obtain
(8) ‖yn − ym‖2 ≤ 2‖yn‖2 + 2‖ym‖2 − 4δ2.
Hence,
limn,m→∞
‖yn − ym‖2 = 2 limn→∞
‖yn‖2 + 2 limm→∞
‖ym‖2 − 4δ2 = 2δ2 + 2δ2 − 4δ2 = 0
and we conclude that yn∞n=1 is a Cauchy sequence in H. Since H is complete, there
∃xo ∈ H such that yn → xo. We next show that ‖xo‖ = δ, which would complete the
existence of a minimal-norm member in E. Note that by the triangle inequality,
∣∣∣‖xo‖ − ‖yn‖∣∣∣ ≤ ‖xo − yn‖.
Now take limits of both sides to conclude that limn→∞ ‖yn‖ = ‖xo‖. But we already
know that limn→∞ ‖yn‖ = δ, which implies that ‖xo‖ = δ.
Uniqueness: suppose that x′ 6= x, x, x′ ∈ E, and ‖x‖ = ‖x′‖ = δ. Replace y in
(3.8) by x′ to obtain
‖x− x′‖2 = 2δ2 + 2δ2 − 4δ2 = 0,
which implies that x = x′. Hence the minimal-norm element in E is unique. 2
Example 11. For any fixed n, the set |Cn of all n-tuples of complex numbers, x =
(x1, x2, . . . , xn), xi ∈ |C, is a Hilbert Space, where with y = (y1, y2, . . . , yn) we define
< x, y >4=
∑ni=1 xiyi.
ECE 541 25
2ℜ=H
x+M
Shortest distance Qx
Figure 3. Elevating M by x.
2ℜ=Hℜ=M
M
Px
Qx
x
Figure 4. Correction: The upper M should read M⊥.
Example 12. L2[a, b] =
f(x) :∫ b
a|f(x)|2dx < ∞, x ∈ [a, b]
is a Hilbert space, with
< f, g >4=
∫ b
af(x)g(x)dx. ‖f‖ =
√< f, f > = [
∫ b
a|f |2dx]1/2 = ‖f‖2. Actually, we
need Schwarz inequality for expectations before we can see that we have a an inner
product in this case. More on this to follow.
Theorem 3. Projection Theorem. If M ⊂ H is a closed subspace of a Hilbert space
H, then for every x ∈ H there exists a unique decomposition x = Px + Qx, where
Px ∈ M , and Qx ∈ M⊥. Moreover,
(1) ‖x‖2 = ‖Px‖2 + ‖Qx‖2.
(2) If we think of Px and Qx as mappings from H to M and M⊥, respectively,
then the mappings P and Q are linear.
26 MAJEED M. HAYAT
(3) Px is the nearest point in M to x, and Qx is the nearest point in M⊥ to
x. Px and Qx are called the orthogonal projections of x into M and M⊥,
respectively.
Proof.
Existence: Consider M +x; we claim that M +x is closed. Recall that M is closed,
which means that if xn ∈ M and xn → xo, then xo ∈ M . Pick a convergent sequence
in x + M , call it zn. Now zn = x + yn, for some yn ∈ M . Since zn is convergent, so is
yn, but the limit of yn is in M , so x + limn→∞ yn ∈ x + M .
We next show that x + M is convex. Pick x1 and x2 ∈ x + M . We need to show
that for any 0 ≤ α ≤ 1, αx1 + (1 − α)x2 ∈ x +M. But x1 = x + y1, y1 ∈ M , and
x2 = x + y2, y2 ∈ M . So αx1 + (1 − α)x2 = x + αy1 + (1 − α)y2 ∈ x + M since
y1 + (1− α)y2 ∈M.
By the minimal-norm theorem, the exists a member in x + M of smallest norm.
Call it Qx. Let Px = x−Qx. Note that Px ∈ M . We need to show that Qx ∈ M⊥.
Namely, < Qx, y >= 0, ∀ y ∈ M . Call Qx = z. ‖z‖ ≤ ‖y‖, ∀ y ∈ M + x. Pick
y = z − αy, where y ∈ M, ‖y‖ = 1. ‖z‖2 ≤ ‖z − αy‖2 =< z − αy, z − αy >. 0 ≤−α < y, z > −α < z, y > +‖α‖2. Pick α =< z, y >. We obtain 0 ≤ −| < z, y > |2.This can hold only if < z, y >= 0, i.e., z is orthogonal to every y ∈ M; therefore,
Qx ∈ M⊥.
Uniqueness: Suppose that x = Px + Qx = (Px)′ + (Qx)′, where Px, (Px)′ ∈ M
and Qx, (Qx)′ ∈ M⊥. Then, Px− (Px)′ = (Qx)′−Qx, where the left side belongs to
M while the right side belongs to M⊥. Hence, each side can only be the zero vector
(why?), and we conclude that Px = (Px)′ and (Qx)′ = Qx.
Minimum Distance Properties: To show that Px is the nearest point in M to x,
pick any y ∈M and observe that ‖x− y‖2 = ‖Px + Qx− y‖2 = ‖Qx + (Px− y)‖2 =
‖Qx‖2 + ‖Px − y‖2 (since Px − y ∈ M). The right-hand side is minimized when
‖Px−y‖, which happens if and only if y = Px. The fact that Qx is the nearest point
in M⊥ to x can be shown similarly.
Linearity: Take x, y ∈ H. Then, we have x = Px + Qx and y = Py + Qy. Now
ax+by = aPx+aQx+bPy+bQy. On the other hand, ax+by = P (ax+by)+Q(ax+by).
ECE 541 27
Thus, we can write P (ax+ by)−aPx− bPy = −Q(ax+ by)+aQx+ bQy and observe
that the left side is in M while the right side is in M⊥. Hence, each side can only be
the zero vector. Therefore, P (ax + by) = aPx + bPy and Q(ax + by) = aQx + bQy.
2
4. Conditional expectations for L2 random variables
4.1. Holder inequality for expectations. Consider the probability space (Ω,F , P),
and let X and Y be random variables. The next result is called Holder inequality for
expectations.
Theorem 4. If p > 1, q > 1 for which p−1+q−1 = 1, then E[|XY |] ≤ E[|X|p]1/pE[|Y |q]1/q.
Proof: (See for example [1] p. 105).
Note that if E[|X|p]1/p = 0, then X = 0 a.s., and hence E[|XY |]| = 0 and (4) holds.
If E[|X|p]1/p > 0 and E[|Y |q]1/q = ∞, then (4) also holds trivially. Next, consider the
case when 0 < E[|X|p]1/p < ∞ and 0 < E[|Y |q]1/q < ∞. Let U = |X|/E[|X|p]1/p and
V = |Y |/E[|Y |q]1/q, and note that E[|U |q] = E[|V |q] = 1. Using the convexity of the
logarithm function, it is easy to see that for any a, b > 0,
log(ap
p+
bq
q
)≥ 1
plog ap +
1
qlog bq,
and the last term is simply log ab. From this it follows that
ab ≤ ap
p+
bq
q.
Thus,
E[UV ] ≤ 1
pE[Up] +
1
qE[V q] =
1
p+
1
q= 1,
from which the desired follows. 2
When p = q = 1/2, we have E[|XY |] ≤ E[X2]1/2E[Y 2]1/2, and this result is called
Schwarz inequality for expectations.
Let H be the collection of square-integrable random variables. Now E[(aX+bY )2] =
a2E[X2]+b2E[Y 2]+2abE[XY ] ≤ a2E[X2]+b2E[Y 2]+2abE[X2]1/2E[Y 2]1/2 < ∞, where
28 MAJEED M. HAYAT
we have used Schwarz inequality for expectations in the last step. Hence, H is a
vector space. Next, Schwarz inequality for expectations also tells us that if X and
Y are square integrable, then E[XY ] is defined (i.e., finite). We can therefore define
< X, Y >4= E[XY ] and see that it defines an inner product. (Technically we need
to consider the equivalence class of random variables that are equal a.s.; otherwise,
we may not have an inner product. Can you see why?) We can also define the norm
‖X‖24=< X, X >1/2. This is called the L2-norm associated with (Ω,F , P). Hence,
we can recast H as the vector space of all random variables that have finite L2-norm.
This collection is often written as L2(Ω,F , P). It can be shown that L2(Ω,F , P) is
complete (see [2], pp. 67). Hence, L2(Ω,F , P) is a Hilbert space. In fact, L2(Ω,D, P)
is a Hilbert space for any σ-algebra D.
4.2. The L2 conditional expectation. Let X, Y ∈ L2(Ω,F , P). Clearly, L2(Ω, σ(X), P)
is a closed (why? See Problem 5 in Assignment 3). subspace of L2(Ω,F , P)). We
can now apply the projection theorem to Y , with L2(Ω, σ(X), P) being our closed
subspace, and obtain the decomposition Y = PY + QY , where PY ∈ L2(Ω, σ(X), P)
and QY ∈ L2(Ω, σ(X), P)⊥. We also have the property that ‖X−PY ‖2 ≤ ‖X−Y ′‖2
∀ Y ′ ∈ L2(Ω, σ(X), P). We call PY the conditional expectation of Y given σ(X),
which we will write from this point an on as E[Y |σ(X)].
Exercise 1. (1) Show that if x is a D-measurable r.v. then so is aX, a ∈ <. (2)
Show that if X and Y are D-measurable, so is X + Y . [See Homework 3]
The above exercise shows that the collection of square-integrable D-measurable
random variables is indeed a vector space.
A function h : IR → IR is said to be a Borel function if h−1(B) ∈ B for every B ∈ B.
Theorem 5. Consider (Ω,F , P), and let X be a random variable. A random variable
Z is a σ(X)-measurable random variable if and only if Z = h(Y ) for some Borel
function h. 2 (The proof is based on a corollary to the Dynkin class Theorem. We
omit the proof, which can be found in [1].)
ECE 541 29
Since E[X|σ(Y )] is σ(Y )-measurable by definition, we can write it explicitly as a
Borel function of Y . For this reason, we often time write E[X|Y ] in place of E[X|σ(Y )].
4.3. Properties of Conditional Expectation. Our definition of conditional ex-
pectation lends itself to many powerful properties that are useful in practice. One of
the properties also leads to an equivalent definition of the conditional expectation,
which is actually the way commonly done—see the comments after Property 4.3.3.
However, proving the existence of the conditional expectation according to the alter-
native definition requires knowledge of a major theorem in measure theory called the
Radon-Nikodym Theorem, which is beyond the scope of this course. That is why in
this course we took a path (for defining the conditional expectation) that is based on
the projection theorem, which is an important theorem to signal processing for other
reasons as well. Here are some key properties of conditional expectations.
4.3.1. Property 1. For any constant a, E[a|D] = a. This follows trivially from the
observation that Pa = a (why?).
4.3.2. Property 2 (Linearity). Fr any constants a, b, E[aX + bY |D] = aE[XD] +
bE[Y |D]. This follows trivially from the linearity of the projection operator P (see
the Projection Theorem).
4.3.3. Property 3. Let Z = E[X|D], then E[XY ] = E[ZY ], for all Y ∈ L2(Ω,D, P).
Interpretation: Z contains all the information that X contains that is relevant to any
D-measurable random variable Y .
Proof: Note that E[XY ] = E[(PX +QX)Y ] = E[(PX)Y ]+E[(QX)Y ] = E[(PX)Y ] =
E[ZY ]. The last equality follows from the definition of Z.
Conversely, if a random variable Z ∈ L2(Ω,D, P) has the property that E[ZY ] =
E[XY ],∀Y ∈ L2(Ω,D, P), then Z = E[¯X|D]. To see this, we need to show that
Z = PX. Note that by assumption E[Y (X − Z)] = 0 for any Y ∈ L2(Ω,D, P).
Therefore, E[Y (PX + QX − Z)] = 0, or E[Y PX] + E[¯Y QX] − E[Y Z] = 0, or
30 MAJEED M. HAYAT
E[Y (Z − PX)] = 0,∀ Y ∈ L2(Ω,D, P) (since E[Y QX] = 0). In particular, if we
take Y = Z − PX we will conclude that E[(Z − PX)2] = 0, which implies Z = PX
almost surely.
Thus, we arrive at an alternative definition for the L2 conditional expectation.
Definition 6. We define Z4= E[X|D] if (1) Z ∈ L2(Ω,D, P) and (2) E[ZY ] = E[XY ]
∀ Y ∈ L2(Ω,D, P).
We will use this new definition frequently it the remainder of this chapter.
4.3.4. Property 4 (Smoothing Property). For any X, E[X] = E[E[X|D]
]. To see this,
note that if Z = E[X|D], then E[ZY ] = E[XY ] for all Y ∈ L2(Ω,D, P). Now take
Y = 1 to conclude that E[X] = E[Z].
4.3.5. Property 5. If Y ∈ L2(Ω,D, P), then E[XY |D] = Y E[X|D]. To show this,
we check the second definition for conditional expectations. Note that Y E[X|D] ∈L2(Ω,D, P), so all we need to show is that E[(XY )W |D] = E
[(Y E[X|D])W
]for
any W ∈ L2(Ω,D, P). But we already know from the definition of E[X|D] that
E[(WY )E[X|D]] = E[(WY )X], since WY ∈ L2(Ω,D, P).
As a special case, we have E[XY ] = E[E[XY |Y ]
]= E
[Y E[X|Y ]
].
4.3.6. Property 6. For any Borel function g : IR → IR, E[g(Y )|D] = g(Y ) whenever
Y ∈ L2(Ω,D, P). (For example, we have E[g(Y )|Y ] = g(Y ).) To prove this, all we
need to do is to observe that g(Y ) ∈ L2(Ω,D, P) and then apply Property 4.3.5 with
X = 1.
4.3.7. Property 7 (Iterated Conditioning). Let D1 ⊂ D2 ⊂ F . Then, E[X|D1] =
E[E[X|D2]|D1
].
ECE 541 31
Proof: Let Z = E[E[X|D2]
∣∣ D1
]. First note that Z ∈ L2(Ω,D1, P), so we only need to
show that E[ZY ] = E[XY ], ∀ Y ∈ L2(Ω,D1, P). Now E[ZY ] = E[Y E
[E[X|D2]
∣∣D1
]],
and we already know from the definition of E[E[X|D2]|D1
]that E
[Y ′(E[
E[X|D2]|D1
])]=
E[Y ′(E[X|D2])
]for any Y ′ ∈ L2(Ω,D1, P). But by Property 4.3.5, Y ′(E[
E[X|D2]|D1
])=
E[Y ′E[X|D2]|D1]
]. Now since Y ′ ∈ L2(Ω,D1, P) implies Y ′ ∈ L2(Ω,D2, P), we
use Property 4.3.5 once again to obtain Y ′E[X|D2] = E[Y ′X|D2]. Thus, we have
E[Y
(E[E[X|D2]|D1
])]= E
[E[E[Y X|D2]|D1
]]for all Y ∈ L2(Ω,D1, P). But by prop-
erty 4.3.4, E[E[(
E[Y X|D2])∣∣D1
]]= E
[(E[Y X|D2]
)]= E[Y X], which completes the
proof.
Also note that E[X|D1] = E[E[X|D1]|D2
]. This is much easier to prove. (Prove it.)
The next property is very important in applications.
4.3.8. Property 8. If X and Y are independent r.v.’s, and if g : IR2 → IR is Borel
measurable, then E[g(X, Y )|σ(Y )] = h(Y ), where for any scalar t, h(t)4= E[g(X, t)].
As a consequence, E[g(X,Y )] = E[h(Y )].
Proof: We prove this in the case g(x, y) = g1(x)g2(y). (The general result follows
through an application of a corollary to the Dynkin class theorem.) First note that
E[g1(X)g2(Y )|σ(Y )
]= g2(Y )E
[g1(X)|σ(Y )
]. Also note that h(t) = g2(t)E[g1(X)], so
h(Y ) = g2(Y )E[g1(X)]. We need to show that h(Y ) = E[g1(X)g2(Y ) | σ(Y )
]. To do
so, we need to show that E[h(Y )Z] = E[g1(X)g2(Y )Z] for every Z ∈ L2(Ω, σ(Y ), P).
But E[h(Y )Z] = E[g2(Y )E[g1(X)]Z
], and by the independence of X and Y , the latter
is equal to E[g1(X)g2(Y )Z].
The stated consequence follows from Property 4.3.4. Also, the above result can be
easily generalized to random vectors.
4.4. Some applications of Conditional Expectations.
Example 13 (Photon counting).
It is know that the energy of a photon is hν, where h is Planck’s constant and ν is the
optical frequency (color) of the photon in Hertz. We are interested in detecting and
32 MAJEED M. HAYAT
radiation
detection
current
……….
N photons
electric pulses
counter M
Figure 5.
counting the number of photons in a certain time interval. We assume that we have a
detector, and upon the detection of the ith photon, it generates an instantaneous pulse
(a delta function) whose area is random non-negative integer, Gi. We then integrate
the sum of all delta functions (i.e., responses) in a given interval and call that the
photon count M . We will also assume that the Gi’s are mutually independent, and
they are also independent of the number of photons N that impinge on the detector
in the given time interval. Mathematically, we can write
M =N∑
i=1
Gi.
Let us try to calculate E[M ], the average number of detected photons in a given
time interval. Assume that we know PM = k (k = 0, 1, 2, 3 . . .) and that PGi = kis also known (k = 0, 1, 2, 3 . . .).
By using Property 4.3.4, we know that E[M ] = E[E[M |N ]
]. To find E[M |N ], we
must appeal to Property 4.3.8. We think of the random sequence Gi∞i=1 as X, and
the random variable N as Y . With these, the function h becomes
h(t) = E[t∑
i=1
Gi] =t∑
i=1
E[Gi].
Let uss assume further that the Gi’s are identically distributed, so that E[Gi] ≡E[G1]. With this, h(t) = tE[G1] and h(N) = E[G1]N . Therefore, E[M ] = E[h(N)] =
E[N ]E[G1].
ECE 541 33
What is the variance of M? Let us first calculate E[M2]. By Property 4.3.4,
E[M2] = E[E[M2|N ]
], and by Property 4.3.8, E
[M2|N]
is equal to h2(N), where
h2(t)4= E
[(
t∑i=1
Gi)2]
= E[ t∑
i=1
t∑j=1
GiGj
]= E
[∑
i 6=j
∑GiGj
]+E
[∑i=j
∑G2
i
]= (t2−t)E[G1]
2+t(E[G1]2+σ2
G)
and σ2G
4= E[G2
1]− E[G1]2 is the variance of G1, which is common to all the other Gi’s
due to independence.
Next, E[M2] = E[h2(N)
]= E
[(N2−N)
]+ (E[G1]
2 + σ2G)E[N ] = E[G1]
2(σ2N + N2−
N) + (E[G1]2 + σ2
G)N , where N4= E[N ] and σ2
N4= E[N2] − N2 is the variance of N .
Finally, if we assume that N is a Poisson random variable (as in coherent light) the
variance of M becomes
σM = E[G1]2N2 +
(E[G1]
2 + σ2G
)N .
Exercise: For an integer-valued random variable X, we define the generating function
of X as ψX(s)4= E[sX ], s ∈ |C, |s| ≤ 1. We can think of the generating function as the
z-transform of the probability mass function associated with X.
For the photon-counting example described above, show that ψM(s) = ψN
(ψG(s)
).
Example 14 (First Occurrence of Successive Hits).
Consider the problem of flipping a coin successively. Suppose PH = p and
PT = q = 1−p. In this case, we define Ω4= X
∞i=1Ωi, where for each i, Ωi = H,T.
This means that members of Ω are simply of the form ω = (ω1, ω2, . . .), where ωi ∈ Ωi.
For each Ωi, we define Fi =∅, H, T, T, H
. We take F associated with Ω as
the minimal σ-algebra containing all cylinders in Ω. A an event E in X∞i=1Ωi is called
a cylinder if E is of the form
(ω1, ω2, . . .) ∈ Ω : ωi1 ∈ B1, . . . , ωik ∈ Bk, k finite, Bi ∈Fi
. In words, in a cylinder, only finitely many of the coordinates are specified. (For
example, think of a vertical cylinder in IR3, where we specify only two coordinates
out of three, now extend this to IR∞.) Let Xi = Iωi=H be a 0, 1-valued random,
and define X on X∞i=1Ωi as X
4= (X1, X2, . . .). Finally, we define P on Ω cylinders by
forming the product of the probabilities of the coordinates (enforcing independence).
For each (Ωi,Fi), define PiH = p and q = 1− p.
34 MAJEED M. HAYAT
We would like to define a random variable that tells us when a run of n successive
heads appears for the first time. More precisely, define the random variable T1 as a
function of X as follows: T14= mini ≥ 1 : Xi = 1, Xj = 0, j < i. For example, if
X = (0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T1 = 3. More generally, we define Tk = mini :
Xi−k+1 = Xi−k+2 = . . . = Xi = 1. For each k, Tk is a r.v. on Ω. For example, if
X = (0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T2 = 6, and T3 = 10.
Let yk = E[Tk]. In what follows we will characterize yk.
Special case: k=1
It is easy to see that PY1 = 1 = p,
PY1 = 2 = qp,
PY1 = 3 = q2p,...
PY1 = n = qn−1p.
Recall from undergraduate probability that above probability law is called the geo-
metric law, and Y1 is called a geometric random variable. Also, y1 =∑∞
i=1 ipqi−1 = 1p.
General case: k > 1
We begin by observing that Tk is actually an explicit function, f , say, of Tk−1, XTk−1+1, XTk−1+2, . . . .
Note, however, that Tk−1 and XYk−1+1, XYk−1+2, . . . are independent. Therefore,
by Property 4.3.8, E[Tk] = E[f(Tk−1, XYk−1+1, XYk−1+2, . . .)] = E[h(Yk−1)], where
h(t) = E[f(t,Xt+1, Xt+2, . . .)].
Now it is easy to check that E[f(t,Xt+1, . . .)|Xt+1] = (t + 1)IXt+1=1 + (t + 1 +
yk)IXt+1=0. This essentially says that if it took us t flips to see k − 1 consecutive
heads for the first time, then if the (t + 1)st flip is a head, then we have achieved
(k + 1) successive heads at time t + 1; alternatively, if the (t + 1)st flip is a tail, then
we have to start all over again (start afresh) while we have already waisted t+1 units
of time.
Now, E[f(t, Tt+k−1, . . .)] = E[E[f(t,Xt+1, . . .) | Xt+1]
]= E
[(t + 1)IXt+1=1 + (t +
1 + yk)IXt+1=0]
=(t + 1)p + (t + 1 + yk)q. Thus, h(t) = (t + 1)p + (t + 1 + yk)q and
yk = E[h(Tk−1)] = E[(Tk−1 + 1)p + (Tk−1 + 1 + yk)q = p + pyk−1 + qyk + qyk−1 + q.
ECE 541 35
Finally, we obtain
(9) yk = p−1yk−1 + p−1.
We now invoke the initial condition y1 = 1p, which completes the characterization of
yk.
For example, if p = 12, then y2 = 2y1 + 2 = 6, and so on.
Example 15. Important of the Independence Hypothesis in Property 4.3.8.
Let Y = XZ where Z = 1 + X. Now, E[Y ] = E[XZ] = E[X(1 + X)] = X + X2.
However, if we erroneously attempt to use Property 4.3.8 (by ignoring the dependence
of Z on X), we will have E[Xt] = E[X]t, and E[E[X]Z
]= E[X]E[Z] = E[X](1 +
E[X]) = X + X2. Note that the two are different since X2 6= X2 in general.
5. The L1 conditional expectation
Consider a probability space (Ω,F , P) and let X be an integrable random variable,
that is, E[|X|] < ∞. We write X ∈ L1(Ω,F , P). Let D be a sub σ-algebra. For
each M ≥ 1, we define X ∧ M as X when |X| ≤ M and M otherwise. Note that
X∧M ∈ L2(Ω,F , P) (since it is bounded), so we can talk about E[(X∧M)|D], which
we call ZM . The question is can we somehow use ZM to define E[X|D]? Intuitively,
we would want to think of E[X|D] as some kind of limit of ZM , as M →∞. So one
approach for the construction of E[(X ∧M)|D] is to take a “limit” of ZM and then
show that the limit, Z, say, satisfies Definition 6. It turns out that such an approach
would lead to the following definition, which we will adopt from now on.
Definition 7. [L1 Conditional Expectation] If X is an integrable random variable,
then we define Z = E[X|D] if Z has the following two properties: (1) Z is D measur-
able, (2) E[ZY ] = E[XY ] for any bounded D measurable random variable Y .
All the properties of the L2 conditional expectation will carry on to the L1 condi-
tional expectation.
36 MAJEED M. HAYAT
We make the final remark that if a random variable X is square integrable, then
it is integrable. The proof is easy. Suppose that X is square integrable. Then,
E[|X|] = E[|X|I|X|≤1]+E[|X|I|X|>1] ≤ E[I|X|≤1]+E[X2I|X|>1] ≤ 1+E[X2] < ∞.
6. Conditional probabilities
Now we will connect conditional expectations to the familiar conditional probabil-
ities. In particular, we recall from undergraduate probability that if A and B are
events, with P(B) > 0, then we define the conditional probability of A given that the
event B has occurred as P(A|B)4= P(A∩B)/P(B). What is the connection between
this definition and our notion of a conditional expectation?
Consider the σ-algebra D = ∅, Ω, B,Bc, and consider E[IA|D]. Because of the
special form of this D, and because we know that this conditional expectation is D-
measurable, we infer that E[IA|D] can assume only two values: one value on B and
another value on Bc. That is, we can write E[IA|D] = aIB + bIBc , where a and b are
constants. We claim that a = P(A ∩ B)/P(B) and b = P(A ∩ Bc)/P(Bc). Note that
because P(B) > 0, 1 − P(B) > 0, so we are not dividing by zero. As seen from a
homework problem, we can prove that(P(A∩B)/P(B)
)IB +
(P(A∩Bc)/P(Bc)
)IBc is
actually the conditional expectation E[IA|D] by showing that(P(A ∩B)/P(B)
)IB +
(P(A ∩ Bc)/P(Bc)
)IBc satisfies the two defining properties listed in Definition 6 or
Definition 7. Thus, E[IA|D] encompasses both P(A|B) and P(A|Bc); that is, P(A|B)
and P(A|Bc) are simply the values of E[IA|D] on B and Bc, respectively. Also note
that P(·|B) is actually a probability measure for each B as long as P(B) > 0.
7. Joint densities and marginal densities
Let X and Y be random variables on (Ω,F , P) with a distribution µXY (B) defined
on all Borel subsets B of the plane. Suppose that X and Y have a joint density. That
is, there exists an integrable function fXY (·, ·) : IR2 → IR, such that
(10) µXY (B) =
∫
B
fXY (x, y) dx dy, for any B ∈ B2.
ECE 541 37
7.1. Marginal densities. Consider µXY (A×IR), where A ∈ B. Notice that µXY (·×IR) is a probability measure for any A ∈ B. Also, by the integral representation shown
in (10),
(11) µXY (A×IR) =
∫
A×IR
fXY (x, y) dx dy =
∫
A
(∫
IR
fXY (x, y) dy)
dx ≡∫
A
fX(x) dx,
where fX(·) us called the marginal density of X. Note that fX(·) qualifies as the pdf
of X. Thus, we arrive at the familiar result that the pdf of X can be obtained from
the joint pdf of X and Y through integration. Similarly, fY (y) =∫
IRfXY (x, y) dx.
7.2. Conditional densities. Suppose that X and Y have a joint density fXY . Then,
(12) E[Y |X] =
∫IR
yfXY (X, y) dy∫IR
fXY (X, y) dy,
and in particular,
(13) E[Y |X = x] =
∫IR
yfXY (x, y) dy∫IR
fXY (x, y) dy≡ g(x).
Proof
We will verify the definition of a conditional expectation given in Definition 7. First,
we must show that g(X) is σ(X) measurable. This follows from the fact that g is Borel
measurable. (This is because g it obtained from integrating a Borel-measurable func-
tion in the plain over one of the variables. This is proved in a course on integration.
Let’s not worry about the details now.)
Next, we must show that E[g(X)W ] = E[Y W ] for every bounded σ(X)-measurable
W . Without loss of generality, assume that W = h(X) for some Borel function h. To
38 MAJEED M. HAYAT
see that E[g(X)h(X)] = E[Y h(X)], write
E[g(X)h(X)] = E
[∫IR
yfXY (X, y) dy∫IR
fXY (X, y) dyh(X)
]
=
∫
IR
∫
IR
[∫IR
yfXY (x, t) dt∫IR
fXY (x, t) dth(x)fXY (x, y)
]dx dy
=
∫
IR
∫
IR
[∫IR
yfXY (x, t) dt
fX(x)h(x)fXY (x, y)
]dx dy
=
∫
IR
[∫IR
yfXY (x, t) dt
fX(x)h(x)
∫
IR
fXY (x, y) dy
]dx
=
∫
IR
[∫IR
yfXY (x, t) dt
fX(x)h(X)fx(x)
]dx
=
∫
IR
[(∫
IR
yfXY (x, t) dt)h(x)
]dx
=
∫
IR
∫
IR
yh(x)fXY (x, t) dt dx
= E[Y h(X)].
In the above, we have used two facts about integrations: (1) that we can write a
double integral as an iterated integral, and (2) that we can exchange the order of the
integration in the iterated integrals. These two properties are always true as long
as the integrand is non-negative. This is a consequence of what is called Fubini’s
Theorem in analysis [2, Chapter 8]. To complete the proof, we need to address the
concern over the points over which fX(·) = 0 (in which case we will not be able
to cancel fX ’s in the numerator and the denominator in the above development).
However, it is straightforward to show that PfX(X) = 0 = 0, which guarantee
that we can exclude these problematic point from the integration in the x-direction
without changing the integral. 2
With the above results, we can think of
fXY (x, y)∫IR
fXY (x, y) dy
ECE 541 39
as a conditional pdf. In particular, if we define
(14) fY |X(y|x)4=
fXY (x, y)∫IR
fXY (x, y) dy=
fXY (x, y)
fX(x),
then we can calculate the conditional expectation E[Y |X] using the formula∫IR
yfY |X(y|X) dy,
which is the familiar result that we know from undergraduate probability.
8. Convergence of random sequences
Consider a sequence of random variables X1, X2, . . . , defined on the product space
(Ω,F , P), where as before, Ω = X∞i=1Ωi is the infinite-dimensional Cartesian product
space and F is the smallest σ-algebra containing all cylinders. Since each Xi is a
random variable, when we talk about convergence of random variables we must do so
in the context of functions.
8.1. Pointwise convergence. Let us take Ωi = [0, 1], F = B ∩ [0, 1], and take P as
Lebesgue measure (generalized length) in [0, 1]. Consider the sequence of functions
fn(t) defined as follows:
(15) fn(t) =
n2t, 0 ≤ t ≤ n−1/2
n− n2t, n−1/2 < t ≤ n−1
0, otherwise.
It is straightforward to see that for any t, the sequence of functions fn(t) converges
to the constant function 0. We say that fn(t) → 0 pointwise, everyevery, or surely.
Generally, a sequence of random variables Xn converges to a random variable X if for
every ω ∈ Ω, Xn(ω) → X(ω). To be more precise, for every ε > 0 and every ω ∈ Ω,
there exists n0 ∈ IN such that |Xn(ω) − X(ω)| < ε whenever n ≥ n0. Note that in
this definition n0 not only depends upon the choice of ε but also on the choice of ω.
Note that in the example above we cannot drop the dependence of n0 on ε. To see
that, observe that supt |fn(t)− 0| = n/2; thus, there is no single n0 that will work for
every t.
40 MAJEED M. HAYAT
8.2. Uniform convergence. The strongest type of convergence is what’s called uni-
form convergence, in which case n0 will be independent of the choice of ω. More
precisely, a sequence of random variables Xn converges to a random variable X uni-
formly if for every ε > 0, there exists n0 ∈ IN such that |Xn(ω) −X(ω)| < ε for any
ω ∈ Ω and whenever n ≥ n0. This is equivalent to saying that for every ε > 0, there
exists n0 ∈ IN such that supω∈Ω |Xn(ω)−X(ω)| < ε whenever n ≥ n0. Can you make
a small change in the example above to make the convergence uniform?
8.3. Almost-sure convergence (or almost-everywhere convergence). Now con-
sider a slight variation of the example in (15).
(16) fn(t) =
n2t, 0 < t ≤ n−1/2
n− n2t, n−1/2 < t ≤ n−1
1, t ∈ (n−1, 1] ∩ |Q
0, t ∈ (n−1, 1] ∩ |Qc.
Note that in this case fn(t) → 0 for t ∈ [0, 1) ∩ |Qc. Note that the Lebesgue measure
(or generalize length) of the set of points for which fn(t) does not converge to the
function 0 is zero. (The latter is because the measure of the set of rational numbers
is zero. To see that, let r1, r2, . . . be an enumeration of the rationals. Pick any ε > 0,
and define the intervals Jn = (rn−2−n−1ε, rn+2−n−1ε). Now, if we sum up the lengths
of all of these intervals, we obtain ε. However, since |Q ⊂ ∪∞n=1Jn, the “length” of the
set |Q cannot exceed ε. Since ε can be selected arbitrarily small, we conclude that
the Lebesgue measure of |Q must be zero.) In this case, we say that the sequence of
functions fn converges to the constant function 0 almost everywhere.
Generally, we say that a sequence of random variables Xn converges to a random
variable X almost surely if there exists A ∈ F , with the property P(A) = 1, such that
for every ω ∈ A, Xn(ω) → X(ω). This is the strongest convergence statement that we
can make in a probabilistic sense (because we don’t care about the points that belong
to a set that has probability zero). Note that we can write this type of convergence by
saying that Xn → X almost surely (or a.s.) if Pω ∈ Ω : limn→∞ Xn(ω) = X(ω) = 1,
or simply when Plimn→∞ Xn = X = 1.
ECE 541 41
8.4. Convergence in probability (or convergence in measure). We say that
the sequence Xn converges to a random variable X in probability (also called in
distribution) when for every ε > 0, limn→∞ P|Xn−X| > ε = 0. This a weaker type
of convergence than almost sure convergence, as seen next.
Theorem 6. Almost sure convergence implies convergence in probability.
Proof
Let ε > 0 be given. For each N ≥ 1, define EN = |Xn −X| > ε, for some n ≥ N.If we define E
4=
⋂∞N=1 EN , we’ll observe that ω ∈ E implies that Xn(ω) does not
converge to X. (This is because if ω ∈ E, then no matter how large we pick N ,
N0, say, there will be an n ≥ N0 such that |Xn − X| > ε.) Hence, since we know
that Xn converges to X a.s., it must be true that P(E) = 0. Now observe that since
E1 ⊃ E2 ⊃ . . ., P(E) = limN→∞ P(EN), and therefore limN→∞ P(EN) = 0. Observe
that if ω ∈ |Xn −X| > ε and n ≥ N , then ω ∈ |Xn −X| > ε for some n ≥ N.Thus, for n ≥ N , ω ∈ |Xn −X| > ε ⊂ EN and P|Xn −X| > ε ≤ P(EN). Hence,
limn→∞ P|Xn −X| > ε = 0. 2
This theorem is not true, however, in infinite measure spaces. For example, take
Ω = IR, F = B, and the Lebesgue measure m. Let Xn(ω) = ω/n. Then clearly,
Xn → 0 everywhere, but m(|Xn − 0| ε) = ∞ for any ε > 0. Where did we use the
fact that P is a finite measure in the proof of the Theorem 6?
8.5. Convergence in the mean-square sense (or in L2). We say that the se-
quence Xn converges to a random variable X in the mean-square sense (also in L2)
if limn→∞ E[|Xn − X|2] = 0. Recall that this is precisely the type of convergence
we defined in the Hilbert-space context, which we used to introduce the conditional
expectation. This a stronger type of convergence than convergence in probability, as
seen next.
Theorem 7. Convergence in the meas-square sense implies convergence in probabil-
ity.
42 MAJEED M. HAYAT
Proof
This is an easy consequence of Markov inequality when taking ψ(·) = (·)2, which
yields P|Xn −X| > ε ≤ E[|Xn −X|2]/ε2. 2
Convergence in probability does not imply almost-sure convergence, as seen from
the following example.
Example 16 (The marching-band function). Consider Ω = [0, 1], take P to be
Lebesgue measure on [0, 1], and define Xn as follows:
(17) Xn =
I[2−1(n−1),2−1n)(ω), n ∈ 1, . . . , 21I[2−2(n−1),2−2n)(ω), n ∈ 21 + 1, . . . , 22...
I[2−k(n−1),2−kn)(ω), n ∈ 2k−1 + 1, . . . , 2k...
Note that for any ω ∈ [0, 1], Xn(ω) = 1 for infinitely many n’s; thus, Xn does not
converge to 0 anywhere. At the same time, if n ∈ 2k−1+1, . . . , 2k, then P|Xn−0| >ε = 2−k; hence, limn→∞ P|Xn − 0| > ε = 0 and Xn converges to 0 in probability.
Note that in the above example, limn→∞ E[|Xn − 0|2] = 0, so Xn also converges to
0 in L2.
Recall the example given in (15) and let’s modify it slightly as follows.
(18) fn(t) =
n3/2t, 0 ≤ t ≤ n−1/2
n1/2 − n3/2t, n−1/2 < t ≤ n−1
0, otherwise.
In this case, fn continues to converge to 0 at every point; nonetheless, E[|fn − 0|2] =
1/12 for every n. Thus, fn does not converge to 0 in L2.
Lemma 1. If f is a continuous function, then Xn → X in probability implies
f(Xn) → f(X) in probability.
Proof: Left as an exercise.
ECE 541 43
8.6. Convergence in distribution. A sequence Xn is said to converge to a ran-
dom variable X in distribution if the distribution functions, FXn(x), converge to the
distribution function of X, FX(x), at every point of continuity of FX(x).
Theorem 8. Convergence in probability implies convergence in distribution.
Proof (Adapted from T. G. Kurtz)
Pick ε > 0 and note that
PX ≤ x + ε = PX ≤ x + ε, Xn > x+ PX ≤ x + ε,Xn ≤ x
and
PXn ≤ x = PXn ≤ x,X > x + ε+ PXn ≤ x,X ≤ x + ε.
Thus,
(19) PXn ≤ x−PX ≤ x+ ε = PXn ≤ x,X > x+ ε−PX ≤ x+ ε,Xn > x.
Note that since Xn ≤ x,X > x + ε ⊂ |Xn −X| > ε,
PXn ≤ x − PX ≤ x + ε ≤ P|Xn −X| > ε.
By taking the limit superior of both sides and noting that limn→∞ P|Xn−X| > ε =
0, we obtain limn→∞ PXn ≤ x ≤ PX ≤ x + ε.Now replace every occurrence of ε in (19) with −ε, rearrange and obtain
−PXn ≤ x+ PX ≤ x− ε = −PXn ≤ x,X > x− ε+ PX ≤ x− ε,Xn > x.
Since X ≤ x− ε, X > x ⊂ |Xn −X| > ε, we have
−PXn ≤ x+ PX ≤ x− ε ≤ P|Xn −X| > ε.
By taking the limit inferior of both sides, we obtain limn→∞ PXn ≤ x ≥ PX ≤x− ε.
Combining, we obtain PX ≤ x − ε ≤ limn→∞ PXn ≤ x ≤ limn→∞ PXn ≤x ≤ PX ≤ x+ε. Thus, by letting ε → 0 we conclude that limn→∞ FXn(x) = FX(x)
at every point of continuity of FX . 2
44 MAJEED M. HAYAT
Theorem 9 (Bounded Convergence Theorem). Suppose that Xn converges to X in
probability, and that |Xn| ≤ M , a.s., ∀n, for some fixed M . Then E[Xn] −−−→n→∞
E[X].
This example shows that the boundedness requirement is not superfluous. Take
Ω = [0, 1], F = B ∩ [0, 1], and m as Lebesgue measure (i.e., m(A) =∫
Adx). Consider
the functions fn(t) shown below the figure. Note that fn(t) → 0, ∀t ∈ Ω. But,
E[fn(t)− 0] = 1, for any n. Therefore, E[fn] 9 E[0] = 0.
Proof of Theorem (Adapted from A. Beck).
Let En = |Xn| > M and note thatP(En) = 0 by the boundedness assumption.
Define E0 = ∪∞n=1En and note that P(E0) ≤∑∞
n=1 P(En) = P(E1) + P(E2) + . . . = 0.
We will first show that |X| ≤ M a.s. Consider Wn,k4= |Xn−X| > 1/k, k = 1, 2, . . ..
Since Xn converges to X in probability, P(Wn,k) −−→n→0
0. Let Fk = |X| > M + 1/k.Note that Fk ⊂ Wn,k a.s., regardless of n, and P(Fk) ≤ P(Wn,k). Then, P(Fk) ≤limn→∞ P(Wn,k) = 0, or P(Fk) = 0. Therefore, P(∪∞i=1Fk) = 0. Since ∪∞i=1Fk =
|X| > M, |X| ≤ M a.s.
Let ε > 0 be given. Let F04= ∪∞i=1Fk and Gn = |Xn −X| > ε. Note that Ω can
be written as the disjoint union (F0 ∪E0)∪ (Gn ∪F0 ∪E0)c ∪ (Gn \ (F0 ∪E0)). Thus,
|E[Xn −X]| ≤ E[|Xn −X|] = E[|Xn −X|IF0∪E0 ] + E[|Xn −X|IGn\(F0∪E0)] + E[|Xn −X|I(Gn∪F0∪E0)c ] ≤ 0 + 2ME[IGn ] + ε = 2ME[IGn ] + ε.
ECE 541 45
Since P(Gn) → 0, there exist n0 such that n ≥ n0 implies P(Gn) ≤ ε2M
. Hence, for
n ≥ n0, |E[Xn] − E[X]| ≤ 2Mε2M
+ ε = 2ε, which establishes E[Xn] → E[X] as n → ∞.
2
Lemma 2 (Adapted from T. G. Kurtz). Suppose that X ≥ 0 a.s. Then, limM→∞ E[X∧M ] = E[X].
Proof
First note that the statement is true when Xn is a discrete random variable. Then
recall that we can always approximate a random variable X monotonically from below
by a discrete random variable. More precisely, define Xn ≤ X as in (5) and note that
Xn ↑ X and E[Xn] ↑ E[X]. Now since Xn ∧M ≤ X ∧M ≤ X, we have
E[Xn] = limM→∞
E[Xn ∧M ] ≤ limM→∞
E[X ∧M ] ≤ limM→∞
E[X ∧M ] ≤ E[X].
Take the limit as n →∞ to obtain
E[X] ≤ limM→∞
E[X ∧M ] ≤ limM→∞
E[X ∧M ] ≤ E[X],
from which the Lemma follows. 2
Theorem 10 (Monotone Convergence Theorem, Adapted from T. Kurtz). Suppose
that 0 ≤ Xn ≤ X and Xn converges to X in probability. Then, limn→∞ E[Xn] = E[X].
Proof
For M > 0, Xn ∧M ≤ Xn ≤ X. Thus,
E[Xn ∧M ] ≤ E[Xn] ≤ E[X].
46 MAJEED M. HAYAT
By the bounded convergence theorem, limn→∞ E[Xn ∧M ] = E[X ∧M ]. Hence,
E[X ∧M ] ≤ limn→∞
E[Xn] ≤ limn→∞
E[Xn] ≤ E[X].
Now by Lemma 2, limM→∞ E[X ∧M ] = E[X], and we finally obtain
E[X] ≤ limn→∞
E[Xn] ≤ limn→∞
E[Xn] ≤ E[X],
and the theorem is proven. 2
Lemma 3 (Fatou’s Lemma, Adapted from T. G. Kurtz). Suppose that Xn ≥ 0 a.s.
and Xn converges to X in probability. Then, limn→∞ E[Xn] ≥ E[X].
Proof
For M > 0, Xn ∧M ≤ Xn. Thus,
E[X ∧M ] = limn→∞
E[Xn ∧M ] ≤ limn→∞
E[Xn],
where the left equality is due to the bounded convergence theorem. Now by Lemma 2,
limM→∞ E[X ∧M ] = E[X], and we obtain E[X] ≤ limn→∞ E[Xn]. 2
Theorem 11 (Dominated Convergence Theorem). Suppose that Xn → X in proba-
bility, |Xn| ≤ Y , and E[Y ] < ∞. Then, E[Xn] → E[X].
Proof:
Since Xn + Y ≥ 0 and Y + Xn → Y + X in probability, we use Fatou’s lemma to
write limn→∞ E[Y + Xn] ≥ E[Y + X]. This implies E[Y ] + limn→∞ E[Xn] ≥ E[Y ] +
E[X], or limn→∞ E[Xn] ≥ E[X]. Similarly, limn→∞ E[Y − Xn] ≥ E[Y − X], which
implies E[Y ] + limn→∞(−E[Xn]) ≥ E[Y ] − E[X], or limn→∞(−E[Xn]) ≥ −E[X]. But
limn→∞(−xn) = − limn→∞ xn, and therefore limn→∞ E[Xn] ≤ E[X]. In summary, we
have limn→∞ E[Xn] ≤ E[X] ≤ limn→∞ E[Xn], which implies limn→∞ E[Xn] = E[X]. 2
9. The L1 Conditional expectation, revisited
Consider a probability space (Ω,F , P) and let X ∈ L1(Ω,F , P). Let D be a sub
σ-algebra. For each n ≥ 1, we define Xn4= X ∧ n (defined as X when |X| ≤ n and n
otherwise). Note that Xn ∈ L2(Ω,F , P) (since it is bounded), so we can talk about
ECE 541 47
E[(Xn)|D] as a projection of Xn onto L2(Ω,D, P ), which we call Zn. We will now
show that Zn is a Cauchy sequence in L1, which means limm,n→∞ E[|Zn − Zm|] = 0.
We already know that E[(Zn−Zm)Y ] = E[(Xn−Xm)Y ] for any Y ∈ L2(Ω,D, P ). In
particular, pick Y = IZn−Zm>0− IZn−Zm≤0. In this case, (Zn−Zm)Y = |Zn−Zm|.Thus, we conclude that E[|Zn−Zm|] = E[(Xn−Xm)(IZn−Zm>0− IZn−Zm≤0)]. But
the right-hand side is no greater than E[|Xn − Xm|] (why?), and we obtain E[|Zn −Zm|] ≤ E[|Xn − Xm|]. However, E[|Xn − Xm|] ≤ E[|X|I|X|≥max(m,n)] → 0 by the
dominated convergence theorem (verify this), and we conclude that limm,n→∞ E[|Zn−Zm|] = 0.
From a key theorem in analysis (e.g., see Rudin, Chapter 3), we know that any
L1 space is complete. Thus, there exists Z ∈ L1(Ω,D, P) such that limn→∞ E[|Zn −Z|] = 0. Further, Zn has a subsequence, Znk
, that converges almost surely to Z.
We take this Z as a candidate for E[X|D]. But first, we must show that Z satisfies
E[ZY ] = E[XY ] for any bounded, D-measurable Y . This is easy to show. Suppose
that |Y | < M , for some M . We already know that E[ZnY ] = E[XnY ], and by
the dominated convergence theorem, E[XnY ] → E[XY ] (since |Xn| ≤ |X|). Also,
|E[ZnY ] − E[ZY ]| ≤ E[|Zn − Z||Y |] ≤ E[|Zn − Z|M ] → 0. Hence, E[ZnY ] → E[ZY ].
This leads to the conclusion that E[ZY ] = E[XY ] for any bounded, D-measurable Y .
In summary, we have found a Z ∈ L1(Ω,D, P) such that E[ZY ] = E[XY ] for any
bounded, D-measurable Y . We define this Z as E[X|D].
10. Central Limit Theorems
To establish this theorem, we need the concept of a characteristic function. Let Y
be a r.v. We define the characteristic function of Y , φY (ω), as E[eiuY ], where E[eiuY ]
exists if E[Re(eiuY )] < ∞ and E[Im(eiuY )] < ∞. Note that Re(eiuY ) ≤ |eiuY | = 1 and
Im(eiuY ) ≤ 1. Therefore, E[|Re(eiuY )|] ≤ 1 and E[|Im(eiuY )|] ≤ 1. Hence, E[eiuY ] is
well defined for any u ∈ IR. If Y has a pdf fY (y), then E[eiuY ] =∫∞−∞ eiuY fY (y)dy,
which is the Fourier transform of fY evaluated at −u.
Example 17. Y ∈ 0, 1 and PY = 1 = p. In this case φY (u) = E[eiuY ] =
peiu + (1− p)eiu0 = peiu + (1− p).
48 MAJEED M. HAYAT
Example 18. Y ∼ N(0, 1). Then, E[eiuY ] = e−u2
2 (see justification below).
We also need this Theorem that relates pdf’s to characteristic functions.
10.1. Levy’s Inversion Lemma.
Lemma 4. If φx(n) is the characteristic functions of X, then limc→∞ 12π
∫ c
−ceiua−eiub
iuφX(u)du
= Pa < x < b+ PX=a+PX=b2
.
See Chow and Teicher for proof.
As a special case, if X is an absolutely continuous random variable with pdf fX ,
then 12π
∫∞−∞ e−iuxφX(u)du = fX(x).
Theorem 12 (Central limit theorem). Suppose Xk is a sequence of i.i.d. random
variables with E[Xk] = µ and varXk= σ2. Let Sn =
∑nk=1 Xk and Yn = Sn−nµ√
nσ. Then,
Yn convergence to Z in distribution, where Z ∼ N(0, 1) (zero-mean Gaussian random
variable with unit variance).
Sketch of the proof
Observe that
ΦYn(ω) = E[ejωYn ]
= E[ejωPn
k=1 Xk−µn√nσ ]
= E[n∏
k=1
ejω√nσ
(Xk−µ)]
=n∏
k=1
E[ejω√nσ
(Xk−µ)]
=n∏
k=1
Φ(Xk−µ)(ω√nσ
)
= (Φ(Xk−µ)(ω√nσ
))n
ECE 541 49
We now expand Φ(Xk−µ)(ω) as
Φ(Xk−µ)(ω) = E[ejω(Xk−µ)]
= E[1 + jω(Xk − µ) +(jω)2
2!(Xk − µ)2 +
(jω)3
3!(Xk − µ)3 + . . .]
= 1 + jωE[Xk − µ]− ω2
2E[(Xk − µ)2] +
(jω)3
3!E[(Xk − µ)3] + . . .
= 1− ω2
2E[(Xk − µ)2] +
(jω)3
3!E[(Xk − µ)3] + . . .
Then we have
Φ(Xk−µ)(ω√nσ
) = 1−( ω√
nσ)2
2E[(Xk − µ)2] +
(j( ω√nσ
))3
3!E[(Xk − µ)3] + . . .
= 1− ω2
2nσ2E[(Xk − µ)2]− jω3
3!√
n3σ3
E[(Xk − µ)3] + . . .
≈ 1− ω2
2nσ2E[(Xk − µ)2].
= 1− ω2
2n
Then
ΦYn(ω) = (1− ω2
2n)n
and
limn→∞
ΦYn(ω) = limn→∞
(1− ω2/2
n)n = e−ω2/2.
50 MAJEED M. HAYAT
On the other hand, the characteristic function of Z is ΦZ(ω) = Eejωz. Then
ΦZ(ω) = Eejωz
=
∫ ∞
−∞ejωzfZ(z)dz
=
∫ ∞
−∞ejωz 1√
2πe−z2/2dz
=1√2π
∫ ∞
−∞e−
12(z2−2jωz)dz
=1√2π
∫ ∞
−∞e−
12[(z−jω)2+ω2]dz
=1√2π
e−ω2/2
∫ ∞
−∞e−
12(z−jω)2dz
= e−ω2/2
Therefore, limn→∞ ΦYn(ω) = ΦZ(ω) and FYn
n→∞−−−→ FZ by the inversion lemma. 2
10.2. Central limit theorem for non-identical random variables. Suppose
Xk is a sequence of random variables that are independent but non-identical.
EXk = µk and VarXk = σ2k. Let Sn =
∑nk=1 Xk. Then ESn =
∑nk=1 EXk =
∑nk=1 µk = ηn and VarSn =
∑nk=1 σ2
k = βn.
Let Yn = Sn−ηβn
and suppose Xk satisfies the following two conditions:
a) limn→∞ β2n = limn→∞
∑nk=1 σ2
k = ∞;
b) There exists α > 2 and c < ∞ such that E[|Xk|α] < c ∀k.
Then, Yn converges in distribution to a zero-mean unit variance Gaussian random
variable (For proof see, fr example, Chow and Teicher). Equivalent conditions to the
above two conditions are:
a’) There is an constant ε > 0, such that, σ2k > ε ∀k.
b’) All densities fk(x) are zeros outside a finite interval (−c, c) for some c.
11. Strong Law of Large Numbers
Suppose that X1, X2, . . . , are i.i.d. random variables, E[X1] = µ < ∞, and E[(X1−µ)2] = σ2 < ∞. Let Sn
4= n−1
∑ni=1 Xi. Note that E[Sn] = µ < ∞ and E[(Sn−µ)2] =
ECE 541 51
n−1σ2. By Chebychev’s inequality,
P|Sn − µ| > ε ≤ σ2
nε,
and therefore we conclude that Sn converges to µ in probability. This is called the
weak law of large numbers.
As it turns out, this convergence occurs almost surely, yielding what is called the
strong law of large numbers. To prove one version of the strong law, one needs
the notion of a sequence of events occurring infinitely often. Roughly speaking, if
A1, A2, . . . , is a sequence of events (think for example of the sequence |Sn−µ| > ε),then one can look for the collection of all outcomes ω, each of which belonging to
infinitely many of the events An. For example, if ωo is such outcome, and if we take
An = |Sn − µ| > ε, where ε > 0, then we would know that Sn(ωo) cannot converge
to µ.
More generally, we define the event An occurs infinitely often, or for short An i.o.,as
An i.o. 4= ∩∞n=1 ∪∞k=n Ak.
The above event is also referred to as the limit superior of the sequence An, that
is, An i.o. = limn→∞ An. The terminology of limit superior makes sense since
the ∪∞k=nAk is a decreasing sequence and the intersection yields the infimum of the
decreasing sequence.
Before we prove a key results on the probability of events such as An i.o., we will
give an example showing their use.
Let Yn be any sequence, and let Y be a random variable (possibly being a candidate
limit of the sequence provided that the sequence is convergent). Consider the event
limn→∞ Yn = Y , the collection of all outcomes ω for which Yn(ω) → Y (ω). It is
easy to see that
(20) limn→∞
Yn = Y c = ∪ε>0,ε∈ |Q|Yn − Y | > ε i.o..
This is because if ωo ∈ |Yn − Y | > ε i.o. for some ε > 0, then we are guaranteed
that for any n ≥ 1, |Yk − Y | > ε for infinitely many k ≥ n; thus, Yn(ωo) does not
converge to Y (ωo).
52 MAJEED M. HAYAT
Lemma 5 (Borel-Cantelli Lemma). Let A1, A2, . . . , be any sequence of events. If∑∞
n=1 P(An) < ∞, then PAn i.o. = 0.
Proof
From (11) we know that An i.o. ⊂ ∪k≥nAk for any n, and thus P(An i.o.) ≤P(∪k≥nAk) for any n. Now note that
∑∞n=1 P(An) < ∞ implies
∑∞k=n P(Ak) → 0 as
n →∞. Since P(∪∞k=nAk) ≤∑∞
k=n P(Ak), limn→∞ P(∩∞n=1∪∞k=nAk) = limn→∞ P(∪∞k=nAk) =
0 (recall that ∪∞k=nAk is a decreasing sequence of events). Hence, P(An i.o.) =
0. 2
The next Theorem is an application of the Borel-Cantelli Lemma.
Theorem 13 (Strong law of large numbers). Let X1, X2, . . . , be an i.i.d. sequence,
E[X1] = µ < ∞, E[X21 ] = µ2 < ∞, and E[X4
1 ] = µ4 < ∞. Then, Sn4= n−1
∑∞n=1 Xn →
µ almost surely.
Proof
Without loss of generality, we will assume that µ = 0. From (20), we know that
(21) limn→∞
Sn = 0c = ∪ε>0,ε∈ |Q|Sn| > ε i.o..
By Markov inequality (with ψ(·) = (·)4),
P|Sn| > ε ≤ E[S4n]
ε4=
∑∞i=1
∑∞j=1
∑∞`=1
∑∞m=1 E[XiXjX`Xm]
n4ε4.
Considering the term in the numerator, we observe that all the terms in the sum-
mation that are of the form E[X3i ]E[Xj], E[X2
i ]E[Xj]E[X`], E[Xi]E[Xj]E[X`]E[Xm]
(i 6= j 6= ` 6= m) are zero. Hence,
∞∑i=1
∞∑j=1
∞∑
`=1
∞∑m=1
E[XiXjX`Xm] = nE[X41 ] + 3n(n− 1)E[X2
1 ]E[X21 ],
and therefore
P|Sn| > ε ≤ nµ4 + 3n(n− 1)µ2
n4ε4.
Hence, we conclude that∞∑
n=1
P|Sn| > ε < ∞.
ECE 541 53
By the Borel-Cantelli Lemma,
P(|Sn| > ε i.o.) = 0, ∀ε > 0.
Hence, by (21)
P( lim
n→∞Sn = 0c
)= 0
and the theorem is proved. 2
54 MAJEED M. HAYAT
12. Markov Chains
The majority of the materials of this section are extracted from the book by
Billingsly (Probability and Measure) and the book by Hoel et al. (Introduction to
Stochastic Processes.)
Up to this point we had seen many examples and properties (e.g., convergence)
of iid sequences. However, many sequences have correlation among their members.
Markov chains are a class of sequences that exhibit a simple type of correlation in the
sequence of random variables. In words, as far as calculating probabilities pertaining
future values of the sequence conditional on the present and the past, it is sufficient
to know the present. In other words, as far as the future is concerned, knowledge
of the present contains all the knowledge of the past. We next define discrete-space
Marcov chains (MCs) formally.
Consider a sequence of discrete random variables X1, X2, . . . , defined on some prob-
ability space (Ω,F , P). Let S be a finite or countable set representing the set of all
possible values that each Xn may assume. Without loss of generality, assume that
S = 0, 1, 2, . . .. The sequence is a Markov chain if
PXn+1 = j|X0 = i0, . . . , Xn = in = PXn+1 = j|Xn = in.
For each pair i and j in S, we define pij4= PXn+1 = j|Xn = i. Note that we have
implicitly assumed (for now) that PXn+1 = j|Xn = i is not dependent upon n, in
which case the cain is called homogeneous. Also note that the definition of pij implies
that∑
j∈S pij = 1, i ∈ S. The numbers pij are called the transition probabilities of
the MC. The initial probabilities are defined as αi4= PX0 = i. Note that the only
restriction on the αis is that they are nonnegative and they must add up to 1. We
denote the matrix whose (i, j)th entry is pi,j by IP, which is termed the transition
matrix of the MC. Note that if S is infinite, then IP is a matrix with infinitely many
rows and columns. The transition matrix IP called a stochastic matrix, as each row
sums up to one and all its elements are nonnegative. Figure 6 is a state-diagram
diagram illustrating the transition probabilities for a two-state MC.
ECE 541 55
P21
P22
P12
s t ate1
P11
s t a t e0
Figure 6.
12.1. Higher-Order Transitions. Consider PX0 = i0, X1 = i1, X2 = i2. By
applying Bayes rule repeatedly (i.e., P(A ∩ B ∩ C) = P(A|B ∩ C)P(B|C)P(C)), we
obtain
PX0 = i0, X1 = i1, X2 = i2 = PX0 = i0PX1 = i1|X0 = i0PX2 = i2|X0 = i0, X1 = i1
= αi0pi0i1pi1i2 .
More generally, it is easy to verify that
PX0 = i0, X1 = i1, . . . , Xm = im = αi0pi0i1 . . . pim−1im
for any sequence i0, i1, . . . , im of states.
Further, it is also easy to see that
PX3 = i3, X2 = i2, |X1 = i1, X0 = i0 = pi1i2pi2i3 .
More generally,
(22) PXm+` = j`, 1 ≤ ` ≤ n|Xs = is, 0 ≤ s ≤ m = pimj1pj1j2 . . . pjn−1jn .
With these preliminaries, we can define the n-step transition probability as
p(n)ij = PXm+n = j|Xm = i.
If we now observe that
Xm+n = j = ∪j0,...,jn−1Xm+n = j, Xm+n−1 = j0, . . . , Xm+1 = jn−1,
we conclude
p(n)ij =
∑
k1...kn−1
pik1pk1k2 . . . pkn−1j.
56 MAJEED M. HAYAT
From matrix theory, we recognize p(n)ij as the (i, j)th entry of IPn, the nth power of
the transition matrix IP. It is convenient to use the convention p(0)ij = δij, consistent
with the fact that IP0 is the identity matrix I.
Finally, it is straightforward to verify that
p(m+n)ij =
∑v∈S
p(m)iv p
(n)vj ,
and that∑
j∈S p(n)ij = 1. This means IPn = IPkIP`, whenever ` + k = n, which is
consistent with taking powers of a matrix.
For convenience, we now define the conditional probabilities Pi(A)4= PA|X0 = i,
A ∈ F . With this notation and as an immediate consequence of (22), we have
PiX1 = i1, . . . , Xm = j, Xm+1 = jm+1, . . . , Xm+n = jm+n =
PiX1 = i1, . . . , Xm = jPjX1 = jm+1, . . . , Xn = jm+n.
Example 19. A gambling problem (see related homework problem)
Initial fortune of a gambler is X0 = xo. After playing each hand, he/she either
increases (decreases) the fortune by one dollar with probability p (q). The gambler
quits if either her/his fortune becomes 0 (bankruptcy) or reaching a goal of L dollars.
Let Xn denote the fortune at time n. Note that
(23) PXn = j|X1 = i1 . . . Xn−1 = in−1 = PXn = j|Xn−1 = in−1.
Hence, the sequence Xn is a Markov chain with state space S = 0, 1, . . . , L. Also,
the time-independent transition probabilities are given by
(24) pij = PXn+1 = j|Xn = i =
p, if j = i + 1, i 6= L
q, if j = i− 1, i 6= 0
1, if i = j = L
1, if i = j = 0
ECE 541 57
The (L + 1)× (L + 1) probability transition matrix, IP4= ((pij)) is therefore
(25) IP =
1 0 0 . . . . . . . . . . . . . . . 0
q 0 p 0 . . . . . . . . . . . ....
0 q 0 p 0 . . . . . . . . ....
......
......
.... . .
......
...
0 0 . . . . . . . . . . . . q 0 p
0 0 . . . . . . . . . . . . . . . . . . 1
.
IP Note that the sum of any row is 1, a characteristic of a stochastic matrix.
Exercise 2. Show that λ = 1 is always an eigenvalue for any stochastic matrix IP.
Let P (x) be the probability of achieving the goal of L dollars, where x is the initial
fortune. In a homework problem we have shown that
(26) P (x) = pP (x + 1) + qP (x− 1), x = 1, 2, . . . , L− 1,
with boundary conditions P (0) = 0 and P (L) = 1. Similarly, define Q(x) as the
probability of going bankrupt. Then,
(27) Q(x) = pQ(x + 1) + qQ(x− 1), x = 1, 2, . . . , L− 1,
with boundary conditions Q(0) = 1 and Q(L) = 0.
For example, if p = q = 12, then (see homework solutions)
(28) P (x) =x
L,
(29) Q(x) = 1− x
L.
Thus, limL→∞ P (x) = 0. (what is the implication of this?).
Exercise 3. Show that
(30) limL→∞
P (x) = 0
when q 6= p.
58 MAJEED M. HAYAT
Back to Gambler’s ruin problem: take L = 4, p = 0.6. Then,
(31) IP =
1 0 0 0 0
0.4 0 0.6 0 0
0 0.4 0 0.6 0
0 0 0.4 0 0.6
0 0 0 0 1
.
Also, straightforward calculation yields
(32) IP2 =
1 0 0 0 0
0.4 0.24 0 0.36 0
0.16 0 0.48 0 0.36
0 0.16 0 0.24 0.6
0 0 0 0 1
, and
(33) limn→∞
IPn =
1 0 0 0 0
0.5846 0 0 0 0.4154
0.3077 0 0 0 0.6923
0.1231 0 0 0 0.8769
0 0 0 0 1
.
(This can be obtained, for example, using the Kayley-Hamilton Theorem). Note that
the last column is
(34) Q =
Q(0)...
Q(4)
,
and the first column is
(35) P =
P (0)...
P (4)
.
ECE 541 59
12.2. Transience and Recurrence. Suppose that X0 = i and define
Eij4= ∪∞n=1Xn = j
as the event that the chain eventually visits state j provided that X0 = i. A state i
is said to be recurrent if P(Eii) = 1; otherwise, it is said to be transient.
Note that we can write Eij as the disjoint union
Eij = ∪∞k=1Xk = j, Xk−1 6= j, . . . , X1 6= j.
In the above, we are saying that the chain can eventually visit state j precisely in the
following mutually exclusive ways: visit j for the first time in one step, or visit j for
the first time in two steps, and so on. If we define
f(n)ij
4= PiX1 6= j, . . . , Xn−1 6= j, Xn = j
as the probability of a first visit to state j at time n provided that X0 = i, then the
quantity
fij4=
∞∑n=1
f(n)ij
is precisely P(Eij). Hence, a state i is recurrent precisely when fii = 1 and it is
transient when fii < 1.
12.2.1. Visiting a state infinitely often. Suppose that X0 = i and consider the event
Ajk defined as the event that the chain visits state j at least k times. Note that if
Ajk occurs, then it must be true that the chain must have made a visit from i to j
and then revisited j at least k − 1 times. Realizing the Pi probability that the chain
visits state j from state i is fij and the Pi probability that the chain visits state j
from state j (k − 1) times is fk−1jj , we conclude that Pi(Ajk) ≤ fijf
k−1jj . On the other
hand, if the chain happens to visit j from i and also revisit j at least k − 1 times,
then we would know that the event Ajk has occurred. Hence, Pi(Ajk) = fijfk−1jj .
Note that Ajk is a decreasing sequence of events in k. Consequently,
Pi(∩kAjk) = limk→∞
Pi(Ajk) =
0, if fjj < 1
fij, if fjj = 1
60 MAJEED M. HAYAT
However, ∩kAjk is precisely the event Xn = j i.o.. Hence, we arrive at the conclu-
sion
(36) PiXn = j i.o. =
0, if fjj < 1
fij, if fjj = 1
Taking i = j yields
(37) PiXn = i i.o. =
0, if fjj < 1
1, if fjj = 1
Thus, PiXn = i i.o. is either 0 or 1!
We have therefore proved the following result.
Theorem 14. Recurrence of i is equivalent to PiXn = i i.o. = 1 and transience of
i is equivalent to PiXn = i i.o. = 0.
The next theorem further characterizes recurrence and transience.
Theorem 15 (Adopted from Billingsly). Recurrence of i is equivalent to∑
n p(n)ii =
∞; transience of i equivalent to∑
n p(n)ii < ∞.
Proof.
Since p(n)ii = PiXn = i, by the Borel-Cantelli lemma
∑n p
(n)ii < ∞ implies PiXn =
i i.o. = 0. According to (37), this implies fii < 1. Hence, we have shown that∑
n p(n)ii < ∞ implies transience. Consequently, recurrence implies that
∑n p
(n)ii = ∞.
We next prove that fii < 1 (or transience) implies∑
n p(n)ii < ∞, which will, in turn,
imply that∑
n p(n)ii = ∞ implies recurrence, and the entire theorem will therefore be
proved. Now back to proving that fii < 1 implies∑
n p(n)ii < ∞.
Observe that the chain visits j from i in steps precisely in the following mutually
exclusive ways: visit j from i for the first time n steps, visit j from i for the first time
in n− 1 steps and then revisit j in one step, visit j from i for the first time in n− 2
ECE 541 61
steps and then revisit j in two steps, and so on. More precisely, we write
p(n)ij = PiXn = j =
n−1∑s=0
PiX1 6= j, . . . , Xn−s−1 6= j,Xn−s = j, Xn = j
=n−1∑s=0
PiX1 6= j, . . . , Xn−s−1 6= j,Xn−s = jPjXs = j
=n−1∑s=0
f(n−s)ij p
(s)jj .
Put j = i and summing up over n = 1 to n = ` we obtain,
∑n=1
p(n)ii =
∑n=1
n−1∑s=0
f(n−s)ii p
(s)ii
=`−1∑s=0
p(s)ii
∑n=s+1
f(n−s)ii ≤ fii
∑s=0
p(s)ii
The last inequality comes from the fact that∑`
n=s+1 f(n−s)ii =
∑`−su=1 f
(u)ii ≤ fii. Re-
alizing that p(0)ii = 1, we conclude (1 − fii)
∑`n=1 p
(n)ii ≤ fii. Hence, if fii < 1, then
∑`n=1 p
(n)ii ≤ fii/(1− fii), and the series
∑∞n=1 p
(n)ii will therefore be convergent since
the partial sum is increasing and bounded. 2
Exercise 4. Show that in the gambler’s problem discussed earlier, states 0 and L are
recurrent, but states 1, . . . , L− 1 are transient.
12.3. Irreducible chains. A MC is said to be irreducible if for every i, j ∈ S, there
exists an n (that is generally dependent up on i and j) such that p(n)ij > 0. In words,
in if the chain is irreducible then we can always go from any state to any other state in
finite time with a nonzero probability. The next theorem shows that for irreducible
chains, either all states are recurrent or all states are transient, there is no option
to be in between. Thus, we can say that an irreducible MC is either recurrent or
transient.
Theorem 16 (Adopted from Billingsly). If the Markov chain is irreducible, then one
of the following two alternatives holds.
(i) All states are recurrent, Pi(∩jXn = j i.o.) = 1 for all i, and∑
n p(n)ij = ∞
for all i and j.
62 MAJEED M. HAYAT
(ii) All states are transient, Pi(∪jXn = j i.o.) = 0 for all i, and∑
n p(n)ij < ∞
for all i and j.
Proof.
By irreducibility, for each i and j there exist integers r and s so that p(r)ij > 0 and
p(s)ji > 0. Note that if a chain with X0 = i visits j from i in r steps, then visits j from
j in n steps, and then visits i from j in s steps, then it has visited i from i in r+n+s
steps. Therefore,
p(r+s+n)ii ≥ p
(r)ij p
(n)jj p
(s)ji .
Now if we sum up both sides of the above inequality over n and use the fact that
p(r)ij p
(s)ji > 0, we conclude that
∑n p
(n)ii < ∞ implies
∑n p
(n)jj < ∞ (why do we need
the fact that p(r)ij p
(s)ji > 0 to come up with this conclusion?) In particular, if one
state is transient, then all other states are also transient (i.e., fjj < 1 for all j).
In this case, (36) tells us that PiXn = j i.o. = 0 for all i and j. Therefore,
Pi(∪jXn = j i.o.) ≤ ∑j PiXn = j i.o. = 0 for all i.
Next, note that∑∞
n=1 p(n)ij =
∑∞n=1
∑nv=1 f
(v)ij p
(n−v)jj =
∑∞v=1 f
(v)ij
∑∞m=0 p
(m)jj ≤ ∑∞
m=0 p(m)jj ,
where the inequality follows from the fact that∑∞
v=1 f(v)ij = fij ≤ 1. Hence, if j is
transient (in which case∑
n p(n)jj < ∞ according to Theorem 15), then p
(n)ij < ∞ for
all i. We have therefore established the first alternative of the theorem.
If any state is not transient, then the only possibility is that all states are recurrent.
In this case PjXn = j i.o. = 1 by Theorem 15 and Xn = j i.o. is an almost certain
event for all j with respect to Pj. Namely, Pj(A ∩ Xn = j i.o.) = Pi(A) for any
A ∈ F . Hence, we can write
p(m)ji = Pj(Xm = i ∩ Xn = j i.o.)
≤∑n>m
PjXm = i, Xm+1 6= j, . . . , Xn−1 6= j,Xn = j
=∑n>m
p(m)ji f
(n−m)ij = p
(m)ji fij.
By irreducibility, there is an m0 for which p(m0)ji > 0; hence, fij = 1. Now by (37),
PiXn = j i.o.] = 1. The only item left to prove is that∑
n p(n)ij under the second
ECE 541 63
alternative. Note that if∑
n p(n)ij < ∞ for some i and j, then the Borel-Cantelli lemma
would dictate that PiXn = j i.o. = 0, and the chain would not be recurrent, which
is a contradiction. Hence, under the second alternative,∑
n p(n)ij = ∞ for all i and j.
2
Special Case: Suppose that∑
n p(n)ij < ∞ and S is a finite set. Then
∑j∈S
∑n p
(n)ij =
∑n
∑j∈S p
(n)ij . But since
∑j∈S p
(n)ij = 1, we obtain
∑j∈S
∑n p
(n)ij =
∑n 1 = ∞, which
is a contradition. Hence, in a finite irreducible MC, alternative (ii) is impossible;
hence, every finite state-space irreducible MC is recurrent. Consequently, if the tran-
sition matrix of a finite state-space MC has all positive entries (making the chain
irreducible), then the chain is recurrent.
12.4. Birth and death Markov chains. Most of the material here is extracted
from the book by Hoel et al. Birth and death chains are common examples of Markov
chains; they include random walks. Consider a Markov chain on the set of infinite or
finite nonnegative integers 0, . . . , d (in the case of an infinite set we take d = ∞).
The transition function is of the form
pij =
qi, j = i− 1,
ri, j = i,
pi, j = i + 1
Note that pi + qi + ri = 1, q0 = 0, and pd = 0 if d < ∞. We also assume that pi and
qj are positive for 0 < i < d.
For a and b in J such that a < b, set
u(x) = Px(Ta < Tb), a < x < b
and set u(a) = 1 and u(b) = 0. If the birth and death chain starts at y, then in one
step it goes to y − 1, y, or y + 1 with respective probabilities qy, ry, or py. It follows
that
u(y) = qyu(y − 1) + ryu(y) + pyu(y + 1), a < y < b
64 MAJEED M. HAYAT
Since ry = 1− py − qy we can rewrite it as
u(y + 1)− u(y) =qx
py
(u(y)− u(y − 1)), a < y < b
set γ0 = 1 and
γy =q1 . . . qy
p1 . . . py
, 0 < y < d
From this we see that
u(y + 1)− u(y) =γa+1
γa
. . .γy
γy−1
(u(a + 1)− u(a)) =γy
γa
(u(a + 1)− u(a))
Consequently,
u(y)− u(y + 1) =γy
γa
(u(a)− u(a + 1)), a ≤ y < b
Summing the equation on y = a, . . . , b−1 and recalling that u(a) = 1 and u(b) = 0,
we conclude that
u(a)− u(a + 1)
γa
=1∑b−1
y=a γy
Page 31
Thus the equation becomes,
u(y)− u(y + 1) =γy∑b−1
y=a γy
, a ≤ y < b
Summing this equation on y = x, . . . , b − 1 and again using the formula u(b) = 0,
we obtain
u(x) =
∑b−1y=x γy∑b−1y=a γy
, a < x < b
It now follows from the definition of u(x) that
ECE 541 65
Px(Ta < Tb) =
∑b−1y=x γy∑b−1y=a γy
, a ≤ y < b
By subtracting both sides of the equation from 1, we see that
Px(Ta < Tb) =
∑x−1y=a γy∑b−1y=a γy
, a ≤ y < b
Page 32
In the remainder of this section we consider a birth and death chain on the nonnegative
integers which is irreducible, i.e., such that px > 0 for x ≥ 0 and qx > 0 for x ≥ 1.
We will determine when such a chain is recurrent and when it is transient.
As a special case of (59),
P1(T0 < Tn) = 1− 1∑n−1y=0 γy
, n > 1
Consider now a birth and death chain starting in state 1. Since the birth and death
chain can move at most one step to the right at a time (considering the transition
from state to state as movement along the real number line),
1 ≤ T2 < T3 < . . . .
It follows from (62) that T0 < Tn, n > 1 forms a nondecreasing sequence of
events. We conclude from Theorem 1 of Chapter 1 of Volume I that
limn→∞
P1(T0 < Tn) = P1(T0 < Tn for some n > 1)
Equation (62) implies that Tn ≥ n and thus Tn → ∞ as n → ∞; hence the event
T0 < Tn for some n > 1 occurs if and only if the event T0 < ∞ occurs. We can
therefore rewrite (63) as
limn→∞
P1(T0 < T1) = P1(T0 < ∞)
It follows from (61) and (64) that
66 MAJEED M. HAYAT
P1(T0 < ∞) = 1− 1∑∞y=0 γy
We are now in position to show that the birth and death chain is recurrent if and
only if
∞∑y=0
γy = ∞
If the birth and death chain is recurrent, then P1(T0 < ∞) = 1 and (66) follows
from (65). To obtain the converse, we observe that P(0, y) = 0 for y ≥ 2 and hence
P0(T0 < ∞) = P(0, 0) + P(0, 1)P1(T0 < ∞)
Page 33
Suppose (66) holds. Then by (65)
P1(T0 < ∞) = 1
From this and (67) we conclude that
P0(T0 < ∞) = P(0, 0) + P(0, 1) = 1
Thus 0 is a recurrent state, and since the chain is assumed to be irreducible, it
must be a recurrent chain.
In summary, we have shown that an irreducible birth and death chain on 0, 1, 2, . . .is recurrent if and only if
∞∑x=1
q1 . . . qx
p1 . . . px
= ∞
Page 50
Consider a birth and death chain on 0, 1, . . . , d or on the nonnegative integers. In
ECE 541 67
the latter case we set d = ∞. We assume without further mention that the chain is
irreducible, i.e., that
px > 0 for 0 ≤ x < d
and
qx > 0 for 0 < x ≤ d
if d is finite, and that
px > 0 for 0 ≤ x < ∞
qx > 0 for 0 < x ≤ ∞
if d is infinite.
Suppose d is infinite. The system of equations
∑π(x)P(x, y) = π(y), y ∈ J
Page 51
become
π(0)r0 + π(1)q1 = π(0)
π(y − 1)py−1 + π(y)ry + π(y + 1)qy+1 = π(y), y ≥ 1
Since
py + qy + ry = 1
these equations reduce to
q1π(1)− p0π(0) = 0
68 MAJEED M. HAYAT
qy+1π(y + 1)− pyπ(y) = qyπ(y)− py−1π(y − 1), y ≥ 1
It follows easily from (7) and induction that
qy+1π(y + 1)− pyπ(y) = 0, y ≥ 0
and hence that
π(y + 1) =py
qy+1
π(y), y ≥ 0
Consequently,
π(x) =p0 . . . px−1
q1 . . . qx
π(0), x ≥ 1
set
π(x) =
1, x = 0,
p0...px−1
q1...qx, x ≥ 1
Then (8) can be written as
π(x) = πxπ(0), x ≥ 0
Conversely, (1) follows from (10).
Suppose now that∑
x π(x) < ∞ or equivalently, that
∞∑x=1
p0 . . . px−1
q1 . . . qx
< ∞
We conclude from (10) that the birth and death chain has a unique stationary
distribution, given by
π(x) =πx∑∞y=0 πy
, x ≥ 0
Suppose instead that (11) fails to hold, i.e., that
ECE 541 69
∑πx = ∞
Page 52
We conclude from (10) and (13) that any solution to (1) is either identically zero or
has infinite sum, and hence that there is no stationary distribution.
In summary, we see that the chain has a stationary distribution if and only if (11)
holds, and that the stationary distribution, when it exists, is given by (9) and (12).
Suppose now that d < ∞. By essentially the same arguments used to obtain (12),
we conclude that the unique stationary distribution is given by
π(x) =πx∑dy=0 πy
, 0 ≤ x ≤ d
where πx, 0 ≤ x ≤ d, is given by (9).
70 MAJEED M. HAYAT
1. Y. S. Chow and H. Teicher, Probability Theory: Independence, Interchange-
ability, Martingales, Springer Texts in Statistics (New York), 3rd ed., 1997. ISBN:
0-387-40607-7.
2. W. Rudin, Real and Complex Analysis. New York: McGraw Hill, 1987.
3. Thomas G. Kurtz, Stochastic Analysis. Class note at the University of Wisconsin-
Madison.
4. Anatole Beck, Integration and Measure. Class notes at the Unversity of Wisconsin-
Madison.
5. P. G. Hoel, S. C. Port, and C. J. Stone, Inytroduction to Stochastic Processes.
Waveland Press, Inc. 1987.