Lecture 12: Markov Chain Monte Carlo

Lecture 12: Markov Chain Monte Carlo

Ziyu Shao

School of Information Science and TechnologyShanghaiTech University

December 25, 2018

Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 1 / 145

Outline

1 Background of Markov Chain

2 Markov Property and Transition Matrix

3 Classification of States

4 Stationary Distribution & Reversibility

5 Basic Graph Theory

6 Application: PageRank

7 Introduction of Markov Chain Monte Carlo (MCMC)

8 Metropolis–Hastings Algorithm

9 Gibbs Sampler


Outline









9 Gibbs Sampler


Model Selection in Stochastic Modeling

Enough complexity to capture the complexity of the phenomena inquestion

Enough structure and simplicity to allow one to compute things ofinterest


Motivation

Introduced by Andrey Markov

IID sequence of random variables: too restrictive assumption

Completely dependent among random variables: hard to analysis

Markov chain: happy medium between complete independence &complete dependence.

Markov chain: does allow for correlations among random variablesand also has enough structure and simplicity to allow forcomputations to be carried out


Markov Process

Three basic components of Markov process

A sequence of random variables {Xt , t ∈ T }, where T is an index set,usually called “time”.

All possible sample values of {Xt , t ∈ T } are called “states”, whichare elements of a sate space S.

“Markov property”: given the present value(information) of theprocess, the future evolution of the process is independent of the pastevolution of the process.


Example: Discrete S & Discrete T


Classification of Markov Process

Discrete-Time Markov Chain: Discrete S & Discrete TContinuous-Time Markov Chain: Discrete S & Continuous TDiscrete Markov Process: Continuous S & Discrete TContinuous Markov Process: Continuous S & Continuous T


Outline









9 Gibbs Sampler


Markov Chain (DTMC)

Definition

A sequence of random variables X0,X1,X2, ... taking values in the statespace {1, 2, ...,M} is called a Markov chain if for all n ≥ 0,

P (Xn+1 = j |Xn = i ,Xn−1 = in−1, ...,X0 = i0) = P (Xn+1 = j |Xn = i) .


Transition Matrix

Definition

Let X0,X1,X2, ... be a Markov chain with state space {1, 2, ...,M}, and letqij = P(Xn+1 = j |Xn = i) be the transition probability from state i tostate j . The M ×M matrix Q = (qij) is called the transition matrix of thechain.


Graphical and Matrix Form of Markov Chain

1

2

3

4

0.60.3

0.1

0.4

0.30.2

0.10.8

0.20.2

0.6

0.2Q =

0.1 0.3 0 0.60.4 0.2 0.1 0.30 0.8 0 0.2

0.2 0 0.2 0.6

Row sum = 1 in TransitionMatrix.


n-step Transition Probability

Definition

The n-step transition probability from i to j is the probability of being at j

exactly n steps after being at i . We denote this by q(n)ij :

q(n)ij = P (Xn = j |X0 = i) .

Lemma

Entries in the matrix Qn are probabilities of n-step transitions.


Proof


Chapman-Kolmogorov Equation

Theorem (Chapman-Kolmogorov Equation)

For any m ≥ 0, n ≥ 0, i , j ∈ S,

q(n+m)ij =

∑k∈S

q(n)ik q

(m)kj


Proof


Marginal Distribution of Xn

Theorem

Define t = (t1, t2, ..., tM) by ti = P(X0 = i), and view t as a row vector.Then the marginal distribution of Xn is given by the vector tQn. That is,the jth component of tQn is P(Xn = j).


Outline









9 Gibbs Sampler


Recurrent and Transient States

Definition

State i of a Markov chain is recurrent if starting from i , the probability is1 that the chain will eventually return to i . Otherwise, the state istransient, which means that if the chain starts from i , there is a positiveprobability of never returning to i .


Example


Irreducible and Reducible Chain

Definition

A Markov chain with transition matrix Q is irreducible if for any twostates i and j , it is possible to go from i to j in a finite number of steps(with positive probability). That is, for any states i ,j there is some positiveinteger n such that the (i , j) entry of Qn is positive. A Markov chain thatis not irreducible is called reducible.


Irreducible Implies All States Recurrent

Theorem

In an irreducible Markov chain with a finite state space, all states arerecurrent.


A Reducible Markov Chain with Recurrent States


Gambler’s Ruin As A Markov Chain


Coupon Collector As A Markov Chain


Period

Definition

The period of a state i in a Markov chain is the greatest common divisor(gcd) of the possible numbers of steps it can take to return to i whenstarting at i . That is, the period of i is the greatest common divisor ofnumbers n such that the (i , i) entry of Qn is positive. (The period of i isundefined if it’s impossible ever to return to i after starting at i .) A stateis called aperiodic if its period equals 1, and periodic otherwise. The chainitself is called aperiodic if all its states are aperiodic, and periodicotherwise.


Example


Outline









9 Gibbs Sampler


Definition

Definition

A row vector s = (s1, ..., sM) such that si ≥ 0 and∑

i si = 1 is a stationarydistribution for a Markov chain with transition matrix Q if∑

i

siqij = sj .

for all j , or equivalently,sQ = s.


Example: Two-State Markov Chain


Existence and Uniqueness of Stationary Distribution

Theorem

Any irreducible Markov chain has a unique stationary distribution. In thisdistribution, every state has positive probability.


Reversibility

Definition

Let Q = (qij) be the transition matrix of a Markov chain. Suppose there iss = (s1, ..., sM) with si ≥ 0,

∑i si = 1, such that

siqij = sjqji

for all states i and j . This equation is called the reversibility or detailedbalance condition, and we say that the chain is reversible with respect to sif it holds.


Check the Detailed Balance Equation

Theorem

If for an irreducible Markov chain with transition matrix Q = (qi ,j), thereexists a probability solution π to the detailed balance equations

πiqi ,j = πjqj ,i

for all pairs of states i , j , then this Markov chain is positive recurrent,time-reversible and the solution π is the unique stationary distribution.


Example: Simple Random Walk

Consider a negative drift simple random walk, restricted to benon-negative, in which q0,1 = 1, qi ,i+1 = p < 0.5, qi ,i−1 = 1− p for i ≥ 1.Corresponding state diagram is:

0 1 2 · · · i i + 1

1

1− p

p p

1− p 1− p

p

1− p


Example: Simple Random Walk


Outline









9 Gibbs Sampler


Modeling Language: Graph Theory

Origin: 1735 Euler for Seven Bridges of Konigsberg


Key Elements of A Graph

A graph is an ordered pair G = (V ,E )

V : a set of vertices or nodes

E : a set of edges or links between nodes

Edge: undirected/directed


Undirected Graph

Degree of vertex v : metric for connectivity of vertex v .

deg(v): the number of edges with v as an end vertex


Directed Graph

Indegree of vertex v : the number of incoming edges ends at v .

Outdegree of vertex v : the number of outgoing edges starting from v .

I (v): indegree of v

O(v): outdegree of v


Adjacency Matrix

A square (0, 1)−matrix to represent a finite graph

Matrix elements: pairs of vertices are adjacent or not

Symmetric matrix: for undirected graph


Adjacency Matrix: Directed Graph


Adjacency Matrix of Nauru Graph


Adjacency Matrix of Cayley Graph


Random Walk on Undirected Graph


Outline









9 Gibbs Sampler


How to Organize the Web?

First Try: Web Directories

Yahoo, DMOZ, LookSmart


Yahoo


How to Organize the Web?

Second Try: Web Search

Information Retrieval:I find relevant docs in a small and trusted setI newspaper articles, patents, etc

Hardness: web is huge, full of untrusted documents, random things,web spam, etc.


Challenges for Web Search

Web contains many sources of information. Who to trust?I Trick: Trustworthy pages may point to each other!

What is the best answer to query keywords?I Webpages are not equally important (www.nothing.com

vs.www.stanford.edu)I Trick: rank pages containing keywords according to their importances

(popularity)I Find the page with the highest rankI How to rank?


World Wide Web as A Graph

Web as a directed graph

Nodes: webpages

Edges: hyperlinks


Web as A Directed Graph


Milestones

1998: Larry Page (1973-) and Sergey Brin (1973-) inventedPageRank algorithm and then founded Google.

1998: Jon Kleinberg (1971-) invented Hyperlink-Induced TopicSearch (HITS) algorithm.


Links as Votes

Page is more important if it has more links

Incoming links or outgoing links?

Think of incoming links as votes:I www.stanford.edu has 23400 incoming linksI www.nothing.com has 1 incoming link

Are all in-links are equal?I Links from important pages count moreI Recursive question!


Example: PageRank Scores


Simple Recursive Formulation

Each link’s vote is proportional to the importance of its source page.

If page j with importance rj has n out-links, each link gets rj/n votes

Page j’s own importance is the sum of the votes on its in-links


Example

rj = ri/3 + rk/4.


PageRank: The “Flow” Model

A “vote” from an important page is worth more

A page is important if it is pointed to by other important pages

Define a rank rj for page j :

rj =∑i→j

riOi

where Oi is the outdegree of i .


Example: Flow Equation

ry = ry/2 + ra/2

ra = ry/2 + rm

rm = ra/2


Solving the Flow Equations

Additional constraint forces uniqueness:I ry + ra + rm = 1.I solution: ry = 2/5, ra = 2/5, rm = 1/5.

Gaussian elimination method works for small examples

We need a better method for large web-size graphs


PageRank: Matrix Formulation

Adjacency matrix QI Each page i has Oi out-linksI If i → j , then Qi,j = 1

Oi, else Qi,j = 0.

Q is a stochastic matrix

Row sum to 1.


PageRank: Matrix Formulation

Rank vector rI Vector with an entry per pageI ri is the importance score of page iI∑

i ri = 1

The flow equations can be written

r = r · Q


Example

ry = ry/2 + ra/2

ra = ry/2 + rm

rm = ra/2

Q =

y a m

y 12

12 0

a 12 0 1

2m 0 1 0


Random Walk Interpretation


The Stationary Distribution

r · Q = r


Power Iteration Method

Given a web graph with n nodes, where the nodes are pages andedges are hyperlinks.

Power iteration: a simple iterative schemeI Suppose there are N web pagesI Initialize: r(0) = [ 1

N , . . . ,1N ].

I Iterate: r(t + 1) = r(t) · Q.I Stop when |r(t + 1)− r(t)|1 < ε


The Google Formulation

r(t+1)j =

∑i→j

r(t)i

Oi=⇒ r(t+1) = r(t) · Q

Does this converge?Converge to what we want?Result reasonable?


Example 1: Spider Traps

A B

r(t+1)A = r

(t)B

r(t+1)B = r

(t)A

rA → 1 0 1 0 . . .rB → 0 1 0 1 . . .


Example 2: Dead End

A B

rA → 1 0 0 0 . . .rB → 0 1 0 0 . . .


Observations

Dead End

Spider Trap

Some pages are dead ends (no out-links)

Spider traps (all out-links are within the group)


Example

y

a m

y a my 1/2 1/2 0a 1/2 0 0m 0 1/2 1

=⇒ry → 1/3 2/6 3/12 . . . 0ra → 1/3 1/6 2/12 . . . 0rm → 1/3 3/6 7/12 . . . 1


Google’s Solution for Spider Trap

Teleports!

At each time, walker at page i has two options{w.p. β Follow a out-link at random 1

Oi

w.p. 1− β Jump to some random pages(1)

β = (0.8, 0.9)


Google’s Solution for Dead Ends

y

a m

y a my 1/2 1/2 0a 1/2 0 1/2m 0 0 1

=⇒

y a my 1/2 1/2 0a 1/2 0 1/2m 1/3 1/3 1/3


Random Teleports

Pagerank Equation:

rj =∑i→j

β · riOi

+ (1− β) · 1

N

Google Matrix

G = β · Q + (1− β)

[1

N

]N×N

r = r · G


Random Teleports

Google Matrix

Q → G = β · Q + (1− β)

[1

N

]N×N

y

a m

y

a m


Do Some Calculation

Let β = 0.8.

Q = 0.8 ·

1/2 1/2 01/2 0 1/2

0 0 1

+ 0.2 ·

1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3

As a result,

G =

7/15 7/15 1/157/15 1/15 7/151/15 1/15 13/15


Implementation of PageRank in Practice

Google’s Team with the leader Jeff Dean developed several tools

BigTable: distributed storage system

GFS (Google File System): distributed file system

Spanner: distributed database

Mapreduce: distributed computing system (followed by Hadoop &Spark)


Outline









9 Gibbs Sampler


Motivation of MCMC

Previously we know we can generate desired distributions based onUnif(0,1) distribution.

Universality of the Uniform depends partly on how tractable it is tocompute the inverse CDF.

In general it is hard to compute CDF and its inverse.


Motivation of MCMC

Markov Chain Monte Carlo (MCMC): build your own Markov chain(X0,X1,. . .) so that the desired distribution π is the stationarydistribution of the chain.

Forward & Reverse Engineering PerspectiveI Forward engineering: given the transition matrix P, find the

stationary distribution of Markov chain.I Reverse engineering: given a distribution π that we want to simulate,

we will engineer a Markov chain whose stationary distribution is π.Then run this engineered Markov chain for a long time, the distributionof the chain will approach π.


MCMC: Sampling Perspective

Sampling from distribution π

Utilizes Markov sequences (samples) to effectively simulate from whatwould otherwise be intractable distributions.

With samples, we can further compute sample mean & samplevariance & other sample functions to approximate the value ofinteresting statistics

Guaranteed by the Strong Law of Large Numbers


Strong Law of Large Numbers for I.I.D. Sequences

If Y1,Y2, · · · is an i.i.d. sequence with common mean µ <∞, then thestrong law of large numbers says that, with probability 1,

limn→∞

Y1 + · · ·+ Yn

n= µ.

Equivalently, let Y be a random variable with the same distribution as theYi and assume that r is a bounded, real-valued function. Then,r(Y1), r(Y2), · · · is also an i.i.d. sequence with finite mean, and, withprobability 1,

limn→∞

r(Y1) + · · ·+ r(Yn)

n= E (r(Y )).


Strong Law of Large Numbers for Markov Chains

Theorem

Assume that X0,X1, · · · is an ergodic Markov chain with stationarydistribution π. Let r be a bounded, real-valued function. Let X be arandom variable with distribution π. Then, with probability 1,

limn→∞

r(X1) + · · ·+ r(Xn)

n= E (r(X )).

where E (r(X )) =∑

j r(j)πj .


Theoretical Foundations of MCMC

All MCMC algorithms construct reversible (time-reversible) Markovchain: detailed balance equations help us.

Theorem

If for an irreducible Markov chain with transition matrix Q = (qi ,j), thereexists a probability solution π to the detailed balance equations

πiqi ,j = πjqj ,i

for all pairs of states i , j , then this Markov chain is positive recurrent,time-reversible and the solution π is the unique stationary distribution.


Example: Binary Sequences with No Adjacent 1s

Consider sequences of length m consisting of 0s and 1s. Call a sequencegood if it has no adjacent 1s. What is the expected number of 1s in agood sequence if all good sequences are equal likely?


MCMC Approach


MCMC Approach


MCMC Approach


Two Main Methods of MCMC

Metropolis-Hastings algorithm: this method generates a random walkusing a proposal distribution and a method for rejecting some of theproposed moves.

Gibbs Sampling: this method requires all the conditional distributionsof the target distribution to be sampled exactly. When drawing fromthe full-conditional distributions is not straightforward, othersamplers-within-Gibbs are used.


Outline









9 Gibbs Sampler


Basic Idea

Proposed by Nicholas Metropolis in 1953 & further developed byWilfred Keith Hastings in 1970.

Start with any irreducible Markov chain on the state space of interest

Then modify it into a new Markov chain with desired stationarydistribution

Modification: moves are proposed according to the original chain, butthe proposal may or may not accepted.

Art: choice of the probability of accepting the proposal


Algorithm for DTMC

Let π = (π1, ..., πM) be a desired stationary distribution on state space{1, ...,M}. Assume that πi > 0 for all i (if not, just delete any states iwith si = 0 from the state space). Suppose that P = (pij) is the transitionmatrix for a Markov chain on state space {1, ...,M}. Intuitively, P is aMarkov chain that we know how to run but that doesn’t have the desiredstationary distribution. Now we will modify P to construct a Markovchain.

Use the original transition probabilities pij to propose where to go next

Then accept the proposal with probability aij

Staying in the current state in the event of a rejection.

Normalization of π does not need to be known.


Algorithm 1 Metropolis-Hastings

Require:Stationary distribution π = (π1, ..., πM);Original transition matrix P = (pij);State X0 (chosen randomly or deterministically);

Ensure:Modified transition matrix P ′ = (p′

ij);1: repeat2: If Xn = i , propose a new state j using the transition probabilities in the ith

row of the original transition matrix P;

3: Compute the acceptance probability aij = min(πjpjiπipij

, 1)

;

4: Flip a coin that lands Heads with probability aij ;5: If the coin lands Heads, accept the proposal (i.e., go to j), setting Xn+1 = j .

Otherwise, reject the proposal (i.e., stay at i), setting Xn+1 = i ;6: until Convergence;7: return P ′;


Metropolis–Hastings Algorithm

Theorem

The sequence X0,X1, · · · constructed by the Metropolis–Hastingsalgorithm is a reversible Markov chain with stationary distribution π.


Remarks

The exact form of π is not necessary to implementMetropolis–Hastings. The algorithm only uses ratios of the form

πjπi

.Thus, π needs only to be specified up to proportionality.

If the proposal transition matrix P is symmetric, aij = min(πjπi, 1)

.

The algorithm works for any irreducible proposal chain (originalchain). Thus, the user has wide latitude to find a proposal chain thatis efficient in the context of their problem.


Remarks

The generated sequence X0,X1, . . . ,Xn gives an approximate samplefrom π.

If the chain requires many steps to get close to stationarity, there maybe initial bias.

Burn-in: the practice of discarding the initial iterations and retainingXm,Xm+1, . . . ,Xn, for some m.

The strong laws of large numbers for Markov chains:

limn→∞

r(Xm) + · · ·+ r(Xn)

n −m + 1= E (r(X )) =

∑x

r(x)πx .


Example: Zipf Distribution Simulation

Let M ≥ 2 be an integer. An r.v. X has the Zipf distribution withparameter a > 0 if its PMF is

P (X = k) =1/ka∑M

i=1 (1/ja),

for k = 1, 2, ...,M (and 0 otherwise). This distribution is widely used inlinguistics for studying frequencies of words.Create a Markov chain X0,X1, ... whose stationary distribution is the Zipfdistribution, and such that |Xn+1 − Xn| ≤ 1 for all n. Your answer shouldprovide a simple, precise description of how each move of the chain isobtained, i.e., how to transit from Xn to Xn+1 for each n.


Solution


Solution


Example: Knapsack Problem

m treasures with labels from 1 to m, where the jth treasure is worthgj gold pieces and weighs wj pounds.

The maximum weight we can carry is w pounds.

We must choose a vector x = (x1, ..., xm), where xj is 1 if we choosethe jth treasure and 0 otherwise, such that the total weight of thetreasures j with xj = 1 is at most w .

Let C be the space of all such vectors, so C consists of all binaryvectors (x1, ..., xm) with

∑mj=1 xjwj ≤ w .

We wishes to maximize the total worth of the treasure we takes.

Finding an optimal solution is an extremely difficult problem, knownas the knapsack problem, which has a long history in computerscience.

A brute force solution would be completely infeasible in general. Howabout MCMC?



(a) Consider the following Markov chain. Start at (0, 0, ..., 0). One moveof the chain is as follows. Suppose the current state is x = (x1, ..., xm).Choose a uniformly random J in {1, 2, ...,m}, and obtain y from x byreplacing xJ with 1− xJ (i.e., toggle whether that treasure will betaken). If y is not in C , stay at x ; if y is in C , move to y . Show thatthe uniform distribution over C is stationary for this chain.

(b) Show that the chain from (a) is irreducible, and that it may or maynot be aperiodic (depending on w ,w1, ...,wm).


Solution


Solution



(c) The chain from (a) is a useful way to get approximately uniformsolutions, but Bilbo is more interested in finding solutions where thevalue (in gold pieces) is high. In this part, the goal is to construct aMarkov chain with a stationary distribution that puts much higherprobability on any particular high-value solution than on any particularlow-value solution. Specifically, suppose that we want to simulatefrom the distribution

π (x) ∝ eβV (x),

where V (x) =∑m

j=1 xjgj is the value of x in gold pieces and β is apositive constant. The idea behind this distribution is to giveexponentially more probability to each high-value solution than toeach low-value solution. Create a Markov chain whose stationarydistribution is as desired.


Solution


Solution


Impact of β


Application: Simulated Annealing

MCMC algorithm for optimization problems

M-H with varying β: β gradually increases over time

β is small at first, exploring the state space broadly

β gets larger and larger, more likely to stuck in local optimal, andstationary distribution more likely to concentrate around thebest-value solution

Why the name of annealing?

Annealing of metals: first heated to a very high temperature, andthen gradually cooled until it reaches a very strong and stable state

β = 1Temperature


Continuous State Space

MCMC can also be used in the continuous case, when π is aprobability density function.

For a continuous state space, a transition function replaces thetransition matrix.

Pij is the value of a conditional density function given X0 = i .


Example: Beta Simulation

Suppose that we want to generate W ∼ Beta(a, b), but we don’t knowabout the rbeta command in R. Instead, what we have available are i.i.d.Unif(0, 1) r.v.s. How can we generate W which is approximatelyBeta(a, b) if a and b are any positive real numbers, with the help of aMarkov chain on the continuous state space (0, 1)?


Solution


Example: Normal-Normal Conjugacy

Let Y |θ ∼ N(θ, σ2

), where σ2 is known but θ is unknown. Using the

Bayesian framework, we treat θ as a random variable, with prior given byθ ∼ N

(µ, τ2

)for some known constants µ and τ2. That is, we have the

two-level model

θ ∼ N(µ, τ2

)Y |θ ∼ N

(θ, σ2

)Describe how to use the Metropolis-Hastings algorithm to find theposterior mean and variance of θ after observing the value of Y .


Solution


Solution


Solution


Solution


Simulation Results with MCMC




Outline









9 Gibbs Sampler


Basic Idea

Proposed by brothers Stuart and Donald Geman in 1984.

Named after the physicist Josiah Willard Gibbs, in reference to ananalogy between the sampling algorithm and statistical physics.

Obtaining approximate draws from a joint distribution, based onsampling from conditional distributions one at a time

Especially useful in problems where these conditional distributions arepleasant to work with.


Basic Idea

At each stage, one variable is updated (keeping all the other variablesfixed) by drawing from the conditional distribution of that variablegiven all the other variables.

Two major kinds of Gibbs sampler:I systematic scan: the updates sweep through the components in a

deterministic order.I random scan: a randomly chosen component is updated at each stage.


Algorithm: Systematic Scan Gibbs Sampler

Let X and Y be discrete r.v.s with joint PMFpX ,Y (x , y) = P(X = x ,Y = y). We wish to construct a two-dimensionalMarkov chain (Xn,Yn) whose stationary distribution is pX ,Y . Thesystematic scan Gibbs sampler proceeds by updating the X -component andthe Y -component in alternation. If the current state is (Xn,Yn) = (xn, yn),then we update the X -component while holding the Y -component fixed,and then update the Y -component while holding the X -component fixed.


Algorithm 2 Systematic Scan Gibbs Sampler

Require:Joint PMF pX ,Y ;Initial state (X0,Y0);

Ensure:Two-dimensional Markov chain (Xn,Yn);

1: repeat2: Draw a value xn+1 from the conditional distribution of X given Y =

yn, and set Xn+1 = xn+1;3: Draw a value yn+1 from the conditional distribution of Y given X =

xn+1, and set Yn+1 = yn+1;4: return (Xn+1,Yn+1);5: until n ≥ N;


Algorithm: Random Scan Gibbs Sampler

As above, let X and Y be discrete r.v.s with joint PMF pX ,Y (x , y). Wewish to construct a two-dimensional Markov chain (Xn,Yn) whosestationary distribution is pX ,Y . Each move of the random scan Gibbssampler picks a uniformly random component and updates it, according tothe conditional distribution given the other component.


Algorithm 3 Random scan Gibbs sampler

Require:Joint PMF pX ,Y ;Initial state (X0,Y0);

Ensure:Two-dimensional Markov chain (Xn,Yn);

1: repeat2: Choose which component to update, with equal probabilities;3: If the X -component was chosen, draw a value xn+1 from the condi-

tional distribution of X given Y = yn, and set Xn+1 = xn+1,Yn+1 =yn. Similarly, if the Y -component was chosen, draw a value yn+1 fromthe conditional distribution of Y given X = xn, and set Xn+1 = xn,Yn+1 = yn+1;

4: return (Xn+1,Yn+1);5: until n ≥ N;


Random Scan Gibbs as Metropolis-Hastings

Theorem

The random scan Gibbs sampler is a special case of theMetropolis-Hastings algorithm, in which the proposal is always accepted.In particular, it follows that the stationary distribution of the random scanGibbs sampler is as desired.


Gibbs Sampling vs. Metropolis-Hastings

Gibbs sampling emphasizes conditional distributions.

Metropolis-Hastings emphasizes acceptance probabilities.


Example: Graph ColoringLet G be a network (also called a graph): there are n nodes, and for eachpair of distinct nodes, there either is or isn’t an edge joining them. Wehave a set of k colors, e.g., if k = 7, the color set may be {red, orange,yellow, green, blue, indigo, violet}. A k-coloring of the network is anassignment of a color to each node, such that two nodes joined by an edgecan’t be the same color. Graph coloring is an important topic in computerscience, with wide-ranging applications such as task scheduling and thegame of Sudoku.


Suppose that it is possible to k-color G . Form a Markov chain on thespace of all k-colorings of G , with transitions as follows: starting with ak-coloring of G , pick a uniformly random node, figure out what the legalcolors are for that node, and then repaint that node with a uniformlyrandom legal color (note that this random color may be the same as thecurrent color). Show that this Markov chain is reversible, and find itsstationary distribution.


Proof


Remarks

Think of each node in the graph as a discrete r.v. that can take on kpossible values.

These nodes have a joint distribution and a complicated dependencestructure (constraint that connected nodes cannot have the samecolor).

We would like to sample a random k-coloring of the entire graph(draw from the joint distribution of all the nodes).

We instead condition on all but one node.

If the joint distribution is to be uniform over all legal graphs, then theconditional distribution of one node given all the others is uniformover its legal colors.

At each stage of the algorithm, we are drawing from the conditionaldistribution of one node given all the others: we’re running a randomscan Gibbs sampler!


Gibbs Sampler for Continuous State Space

In the Gibbs sampler, the target distribution π is an m-dimensional jointdensity

π(x) = π(x1, · · · , xm).

A multivariate Markov chain is constructed whose stationary distribution isπ, and which takes values in an m-dimensional space. The algorithmgenerates elements by iteratively updating each component of anm-dimensional vector conditional on the other m − 1 components.


Example: Bivariate Standard Normal Distribution


Example: Bivariate Standard Normal Distribution

The Gibbs sampler is implemented to simulate (X ,Y ) from a bivariatestandard normal distribution with correlation ρ.

1 Initialize: (x0, y0)← (0, 0), m← 1.

2 Generate xm from the conditional distribution of X given Y = ym−1.That is, simulate from a normal distribution with mean ρym−1 andvariance 1− ρ2.

3 Generate ym from the conditional distribution of Y given X = xm.That is, simulate from a normal distribution with mean ρxm andvariance 1− ρ2.

4 m← m + 1.

5 Return to Step 2.




Example: Chicken-Egg with Unknown Parameters

A chicken lays N eggs, where N ∼ Pois(λ). Each egg hatches withprobability p, where p is unknown; we let p ∼ Beta(a, b). The constantsλ, a, b are known.Here’s the catch: we don’t get to observe N. Instead, we only observe thenumber of eggs that hatch, X . Describe how to use Gibbs sampling tofind E (p|X = x), the posterior mean of p after observing x hatched eggs.


Solution


Solution


Solution


Solution


Example: Three-dimensional Joint Distribution

Random variables X ,P and N have joint density

π(x , p, n) ∝(

n

x

)px(1− p)n−x

4n

n!

for x = 0, 1, . . . , n, 0 < p < 1, n = 0, 1, . . .. The p variable is continuous;x and n are discrete.


Solution


Solution




References

Chapters 11 & 12 of BH


Documents

Lecture 12: Markov Chain Monte Carlo