Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)

Appendix for Lecture 2: Monte Carlo Methods (Basics)

Dahua Lin

1 Justification of Basic Sampling Methods

Proposition 1. Let F be the cdf of a real-valued random variable with distribution D. LetU ∼ Uniform([0, 1]), then F−1(U) ∼ D.

Proof. Let X = F−1(U). It suffices to show that the cdf of X is F . For any t ∈ R,

P (X ≤ t) = P (F−1(U) ≤ t) = P (U ≤ F (t)) = F (t). (1)

Here, we utilize the fact that F is non-decreasing.

Proposition 2. Samples producted using Rejection sampling has the desired distribution.

Proof. Each iteration actually generate two random variables: x and u, where u ∈ 0, 1 is theindicator of acceptance. The join distribution of x and u is given by

p(dx, u = 1) = a(u = 1|x)q(dx) =p(x)

Mq(x)q(x)dx =

p(x)

Mµ(dx). (2)

Here, a(u|x) is the conditional distribution of u on x, and µ is the base measure. On the otherhand, we have

Pr(u = 1) =

∫p(dx, u = 1) =

∫p(x)

Mµ(dx) =

1

M. (3)

Thus, the resultant distribution is

p(dx|u = 1) =p(dx, u = 1)

Pr(u = 1)= p(x)µ(dx). (4)

This completes the proof.

2 Markov Chain Theory

Proposition 3. When the state space Ω is countable, we have

‖µ− ν‖TV =1

2

∑x∈Ω

|µ(x)− ν(x)|. (5)

Proof. Let A = x ∈ Ω : µ(x) ≥ nu(x). By definition, we have

‖µ− ν‖TV ≥ |µ(A)− ν(A)| = µ(A)− ν(A), (6)

‖µ− ν‖TV ≥ |µ(Ac)− ν(Ac)| = ν(Ac)− µ(Ac). (7)

We also have

µ(A)− ν(A) =∑x∈A

µ(x)− ν(x) =∑x∈A|µ(x)− ν(x)|, (8)

ν(Ac)− µ(Ac) =∑x∈Ac

ν(x)− µ(x) =∑x∈Ac

|µ(x)− ν(x)|. (9)

1

Combining the equations above results in

‖µ− ν‖TV ≥1

2(µ(A)− ν(A) + ν(Ac)− µ(Ac))

=1

2

(∑x∈A|µ(x)− ν(x)|+

∑x∈Ac

|µ(x)− ν(x)|

)

=1

2

∑x∈Ω

|µ(x)− ν(x)|. (10)

Next we show the inequality of the other direction. For any A ⊂ Ω, we have

|µ(Ac)− ν(Ac)| = |(µ(Ω)− µ(A))− (ν(Ω)− ν(A))| = |µ(A)− ν(A)| (11)

Hence,

|µ(A)− ν(A)| = 1

2(|µ(A)− ν(A)|+ |µ(Ac)− ν(Ac)|)

≤ 1

2

(∑x∈A|µ(x)− ν(x)|+

∑x∈Ac

|µ(x)− ν(x)|

)

≤ 1

2

∑x∈Ω

|µ(x)− ν(x)|. (12)

As A is arbitrary, we can conclude that

‖µ− ν‖TV , supA|µ(A)− ν(A)| ≤ 1

2

∑x∈Ω

|µ(x)− ν(x)|. (13)


Proposition 4. The total variation distance (µ, ν) 7→ ‖µ− ν‖TV is a metric.

Proof. To show that it is a metric, we verify the four properties that a metric needs to satisfyone by one.

1. ‖µ− ν‖TV is non-negative, as |µ(A)− ν(A)| is always non-negative.

2. When µ = ν, |µ(A) − ν(A)| is always zero, and hence ‖µ − ν‖TV = 0. On the otherhand, when µ 6= ν, there exists A ⊂ S such that |µ(A) − ν(A)| > 0, and therefore‖µ− ν‖TV ≥ |µ(A)− ν(A)| > 0. Together we can conclude that ‖µ− ν‖TV = 0 iff µ = ν.

3. ‖µ−ν‖TV = ‖ν−µ‖TV as |µ(A)−ν(A)| = |ν(A)−µ(A)| holds for any measurable subsetA.

4. Next, we show that the total variation distance satisfies the triangle inequality, as below.Let µ, ν, η be three probability measures over Ω:

‖µ− ν‖TV = supA∈S|µ(A)− ν(A)|

= supA∈S|µ(A)− η(A) + η(A)− ν(A)|

≤ supA∈S

(|µ(A)− η(A)|+ |η(A)− ν(A)|)

≤ supA∈S|µ(A)− η(A)|+ sup

A∈S|η(A)− ν(A)|

= ‖µ− η‖TV + ‖η − ν‖TV . (14)

2

The proof is completed.

Proposition 5. Consider a Markov chain over a countable space Ω with transition proba-bility matrix P . Let π be a probability measure over Ω that is in detailed balance with P ,i.e. π(x)P (x, y) = π(y)P (y, x), ∀x, y ∈ Ω. Then π is invariant to P , i.e. π = πP .

Proof. With the assumption of detailed balance, we have

(πP )(y) =∑x∈Ω

π(x)P (x, y) =∑x∈Ω

π(y)P (y, x) = π(y)∑x∈Ω

P (y, x) = π(y). (15)

Hence, π = πP , or in other words, π is invariant to P .

Proposition 6. Let (Xt) be an ergodic Markov chain Markov(π, P ) where π is in detailedbalance with P , then given arbitrary sequence x0, . . . , xn ∈ Ω, we have

Pr(X0 = x0, . . . , Xn = xn) = Pr(X0 = xn, . . . , Xn = x0). (16)

Proof. First, we have

Pr(X0 = x0, . . . , Xn = xn) = π(x0)P (x0, x1) · · ·P (xn−1, xn). (17)

On the other hand, by detailed balance, we have P (x, y) = π(y)P (y,x)π(x) , and thus

Pr(X0 = xn, . . . , Xn = x0) = π(xn)P (xn, xn−1) · · ·P (x1, x0)

= π(xn)π(xn−1)P (xn−1, xn)

π(xn)· · · π(x0)P (x0, x1)

π(x1)

= P (xn−1, xn) · · ·P (x0, x1)π(x0). (18)

Comparing Eq.(17) and Eq.(18) results in the equality that we intend to prove.

Proposition 7. Over a measurable space (Ω,S), if a stochastic kernel P is reversible w.r.t. π,then π is invariant to P .

Proof. Let π′ = πP , it suffices to show that π′(A) = π(A) for every A ∈ S under the reversibilityassumption. Given any A ∈ S, let fA(x, y) := 1(y ∈ A), then we have

π′(A) =

∫π(dx)P (x,A)

=

∫π(dx)

∫fA(x, y)P (x, dy)

=

∫ ∫fA(x, y)π(dx)P (x, dy)

=

∫ ∫fA(y, x)π(dx)P (x, dy)

=

∫ ∫1(x ∈ A)π(dx)P (x, dy)

=

∫1(x ∈ A)π(dx)

∫P (x, dy)

=

∫1(x ∈ A)π(dx) = π(A). (19)


3

Proposition 8. Given a stochastic kernel P and a probability measure π over (Ω,S). Supposeboth Px and π are absolutely continuous w.r.t. a base measure µ, that is, π(dx) = π(x)µ(dx)and P (x, dy) = Px(dy) = px(y)µ(dy), then P is reversible w.r.t. π if and only if

π(x)px(y) = π(y)py(x), a.e. (20)

Proof. First, assuming detailed balance, i.e. π(x)px(y) = π(y)py(x), a.e., we show reversibility.∫ ∫f(x, y)π(dx)P (x, dy) =

∫ ∫f(x, y)π(x)px(y)µ(dx)µ(dy)

=

∫ ∫f(x, y)π(y)py(x)µ(dx)µ(dy) ...[detailed balance]

=

∫ ∫f(y, x)π(x)px(y)µ(dx)µ(dy) ...[exchange variables]

=

∫ ∫f(y, x)π(dx)P (x, dy). (21)

Next, we show the converse. The definition of reversibility implies that∫ ∫f(x, y)π(dx)P (x, dy) =

∫ ∫f(x, y)π(dy)P (y, dx) (22)

Hence, ∫ ∫f(x, y)π(x)px(y)µ(dx)µ(dy) =

∫ ∫f(x, y)π(y)py(x)µ(dx)µ(dy) (23)

Hence, f(x, y)π(x)px(y) = f(x, y)π(y)py(x) a.e. for arbitrary integrable function f , which im-plies that π(x)px(y) = π(y)py(x) a.e.

Proposition 9. Given a stochastic kernel P and a probability measure π over (Ω,S). IfP (x, dy) = m(x)Ix(dy) + px(y)µ(dy) and π(x)px(y) = π(y)py(x) a.e, then P is reversiblew.r.t. π.

Proof. Under the given conditions, we have∫ ∫f(x, y)π(dx)P (x, dy) =

∫ ∫f(x, y)π(dx) (m(x)Ix(dy) + px(y)µ(dy))

=

∫ ∫f(x, y)m(x)π(dx)Ix(dy) +

∫ ∫f(x, y)px(y)µ(dy)

=

∫f(x, x)m(x)π(dx) +

∫ ∫f(x, y)px(y)π(dx)µ(dy)

=


∫ ∫f(x, y)px(y)π(x)µ(dx)µ(dy). (24)

For the right hand side, we have∫ ∫f(y, x)π(dx)P (x, dy) =

∫ ∫f(y, x)π(dx) (m(x)Ix(dy) + px(y)µ(dy))

=

∫ ∫f(y, x)m(x)π(dx)Ix(dy) +

∫ ∫f(y, x)px(y)µ(dy)

=


∫ ∫f(y, x)px(y)π(dx)µ(dy)

=


∫ ∫f(y, x)px(y)π(x)µ(dx)µ(dy)

=


∫ ∫f(x, y)py(x)π(y)µ(dx)µ(dy). (25)

With π(x)px(y) = π(y)py(x), we can see that the left and right hand sides are equal. Thiscompletes the proof.

4

3 Justification of MCMC Methods

Proposition 10. Samples produced using the Metropolis-Hastings algorithm has the desireddistribution, and the resultant chain is reversible.

Proof. It suffices to show that the M-H update is reversible w.r.t. π, which implies that π isinvariant. The stochastic kernel of M-H update is given by

P (x, dy) = m(x)I(x, dy) + q(x, dy)a(x, y) = r(x)I(x, dy) + qx(y)a(x, y)µ(dy) (26)

Here, µ is the base measure, and I(x, dy) is the identity measure given by I(x,A) = 1(x ∈ A),and m(x) is the probability that the proposal is rejected, which is given by

m(x) = 1−∫

Ωq(x, dy)a(x, y). (27)

Let g(x, y) = h(x)qx(y)a(x, y). With Proposition 9, it suffices to show that g(x, y) = g(y, x).

Here, a(x, y) = minr(x, y), 1. Also, from the definition r(x, y) =h(y)qy(x)h(x)qx(y) , it is easy to see

that r(x, y) = 1/r(y, x). We first consider the case where r(x, y) ≤ 1 (thus r(y, x) ≥ 1), then

g(x, y) = h(x)qx(y)a(x, y) = h(x)qx(y)h(y)qy(x)

h(x)qx(y)= h(y)qy(x), (28)

andg(y, x) = h(y)qy(x)a(y, x) = h(y)qy(x). (29)

Hence, g(x, y) = g(y, x) when r(x, y) ≤ 1. Similarly, we can show that the equality holds whenr(x, y) ≥ 1. This completes the proof.

Proposition 11. The Metropolis algorithm is a special case of the Metropolis-Hastings algo-rithm.

Proof. It suffices to show that when q is symmetric, i.e. qx(y) = qy(x), the acceptance ratereduces to the form given in the Metropolis algorithm. Particularly, when qx(y) = qy(x), theacceptance rate of the M-H algorithm is

a(x, y) = minr(x, y), 1 = min

h(y)qy(x)

h(x)qx(y), 1

= min

h(y)

h(x), 1

. (30)


Proposition 12. The Gibbs sampling update is a special case of the Metropolis-Hastings update.

Proof. Without losing generality, we assume the sample is comprised of two components: x =(x1, x2). Consider a proposal qx(dy) = π(dy1, x2)I(dx2). In this case, we have

r((x1, x2), (y1, x2)) =π(y1, x2)π(x1, x2)

π(x1, x2)π(y1, x2)= 1. (31)

This implies that the candidate is always accepted. Also, generating a sample from qx isequivalent to drawing one from p(y|z). This completes the argument.

Proposition 13. Let K1, . . . ,Km be stochastic kernels with invariant measure π, and q ∈ Rmbe a probability vector, then K =

∑mk=1 qiKi is also a stochastic kernel with invariant measure

π. Moreover, if K1, . . . ,Km are all reversible, then K is reversible.

5

Proof. First, it is easy to see that convex combinations of probability measures remain proba-bility measures. As an immediate consequence, Kx, a convex combination of Ki(x, ·), is also aprobability measure. Given a measurable subset A, Ki(·, A) is measurable for each i, so is theirconvex combinations. Hence, we can conclude that K remains a stochastic kernel. Next, weshow that π invariant to K, as

πK = π

(m∑i=1

qiKi

)=

m∑i=1

qi(πKi) =m∑i=1

qiπ = π. (32)

This proves the first statement. Then, we assume that K1, . . . ,Km are reversible, then for K,we have ∫ ∫

f(x, y)π(dx)K(x, dy) =

m∑i=1

qi

∫ ∫f(x, y)π(dx)Ki(x, dy)

=

m∑i=1

qi

∫ ∫f(y, x)π(dx)Ki(x, dy)

=

∫ ∫f(y, x)π(dx)K(x, dy). (33)

This implies that K is also reversible, thus completing the proof.

Proposition 14. Let K1, . . . ,Km be stochastic kernels with invariant measure π. Then K =Km · · · K1 is also a stochastic kernel with invariant measure π.

Proof. Consider K = K2 K1. To show that K is a stochastic kernel, we first show thatKx(dy) = K(x, dy) is a probability measure. Given arbitrary measurable subset A, we have

K(x,A) =

∫ ∫K1(x, dy)K2(y,A). (34)

As this is a bounded non-negative integration and K2(y,A) is measurable, it constitutes ameasure. Also,

K(x,Ω) =

∫ ∫K1(x, dy)K2(y,Ω) =

∫K1(x, dy) = 1. (35)

Hence, K(x, ·) is a probability measure, and thus K a stochastic kernel. Next, we show π isinvariant to K:

πK = π(K2 K1) = (πK1)K2 = πK2 = π. (36)

We have proved the statement for a composition of two kernels K1 K2. By induction, we canfurther extend to finite composition, thus completing the proof.

6

Science

Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)