Upload
others
View
7
Download
1
Embed Size (px)
Citation preview
Lecture 12: Markov Chain Monte Carlo
Ziyu Shao
School of Information Science and TechnologyShanghaiTech University
December 25, 2018
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 1 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 2 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 3 / 145
Model Selection in Stochastic Modeling
Enough complexity to capture the complexity of the phenomena inquestion
Enough structure and simplicity to allow one to compute things ofinterest
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 4 / 145
Motivation
Introduced by Andrey Markov
IID sequence of random variables: too restrictive assumption
Completely dependent among random variables: hard to analysis
Markov chain: happy medium between complete independence &complete dependence.
Markov chain: does allow for correlations among random variablesand also has enough structure and simplicity to allow forcomputations to be carried out
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 5 / 145
Markov Process
Three basic components of Markov process
A sequence of random variables {Xt , t ∈ T }, where T is an index set,usually called “time”.
All possible sample values of {Xt , t ∈ T } are called “states”, whichare elements of a sate space S.
“Markov property”: given the present value(information) of theprocess, the future evolution of the process is independent of the pastevolution of the process.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 6 / 145
Example: Discrete S & Discrete T
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 7 / 145
Classification of Markov Process
Discrete-Time Markov Chain: Discrete S & Discrete TContinuous-Time Markov Chain: Discrete S & Continuous TDiscrete Markov Process: Continuous S & Discrete TContinuous Markov Process: Continuous S & Continuous T
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 8 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 9 / 145
Markov Chain (DTMC)
Definition
A sequence of random variables X0,X1,X2, ... taking values in the statespace {1, 2, ...,M} is called a Markov chain if for all n ≥ 0,
P (Xn+1 = j |Xn = i ,Xn−1 = in−1, ...,X0 = i0) = P (Xn+1 = j |Xn = i) .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 10 / 145
Transition Matrix
Definition
Let X0,X1,X2, ... be a Markov chain with state space {1, 2, ...,M}, and letqij = P(Xn+1 = j |Xn = i) be the transition probability from state i tostate j . The M ×M matrix Q = (qij) is called the transition matrix of thechain.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 11 / 145
Graphical and Matrix Form of Markov Chain
1
2
3
4
0.60.3
0.1
0.4
0.30.2
0.10.8
0.20.2
0.6
0.2Q =
0.1 0.3 0 0.60.4 0.2 0.1 0.30 0.8 0 0.2
0.2 0 0.2 0.6
Row sum = 1 in TransitionMatrix.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 12 / 145
n-step Transition Probability
Definition
The n-step transition probability from i to j is the probability of being at j
exactly n steps after being at i . We denote this by q(n)ij :
q(n)ij = P (Xn = j |X0 = i) .
Lemma
Entries in the matrix Qn are probabilities of n-step transitions.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 13 / 145
Proof
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 14 / 145
Chapman-Kolmogorov Equation
Theorem (Chapman-Kolmogorov Equation)
For any m ≥ 0, n ≥ 0, i , j ∈ S,
q(n+m)ij =
∑k∈S
q(n)ik q
(m)kj
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 15 / 145
Proof
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 16 / 145
Marginal Distribution of Xn
Theorem
Define t = (t1, t2, ..., tM) by ti = P(X0 = i), and view t as a row vector.Then the marginal distribution of Xn is given by the vector tQn. That is,the jth component of tQn is P(Xn = j).
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 17 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 18 / 145
Recurrent and Transient States
Definition
State i of a Markov chain is recurrent if starting from i , the probability is1 that the chain will eventually return to i . Otherwise, the state istransient, which means that if the chain starts from i , there is a positiveprobability of never returning to i .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 19 / 145
Example
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 20 / 145
Irreducible and Reducible Chain
Definition
A Markov chain with transition matrix Q is irreducible if for any twostates i and j , it is possible to go from i to j in a finite number of steps(with positive probability). That is, for any states i ,j there is some positiveinteger n such that the (i , j) entry of Qn is positive. A Markov chain thatis not irreducible is called reducible.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 21 / 145
Irreducible Implies All States Recurrent
Theorem
In an irreducible Markov chain with a finite state space, all states arerecurrent.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 22 / 145
A Reducible Markov Chain with Recurrent States
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 23 / 145
Gambler’s Ruin As A Markov Chain
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 24 / 145
Coupon Collector As A Markov Chain
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 25 / 145
Period
Definition
The period of a state i in a Markov chain is the greatest common divisor(gcd) of the possible numbers of steps it can take to return to i whenstarting at i . That is, the period of i is the greatest common divisor ofnumbers n such that the (i , i) entry of Qn is positive. (The period of i isundefined if it’s impossible ever to return to i after starting at i .) A stateis called aperiodic if its period equals 1, and periodic otherwise. The chainitself is called aperiodic if all its states are aperiodic, and periodicotherwise.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 26 / 145
Example
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 27 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 28 / 145
Definition
Definition
A row vector s = (s1, ..., sM) such that si ≥ 0 and∑
i si = 1 is a stationarydistribution for a Markov chain with transition matrix Q if∑
i
siqij = sj .
for all j , or equivalently,sQ = s.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 29 / 145
Example: Two-State Markov Chain
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 30 / 145
Existence and Uniqueness of Stationary Distribution
Theorem
Any irreducible Markov chain has a unique stationary distribution. In thisdistribution, every state has positive probability.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 31 / 145
Reversibility
Definition
Let Q = (qij) be the transition matrix of a Markov chain. Suppose there iss = (s1, ..., sM) with si ≥ 0,
∑i si = 1, such that
siqij = sjqji
for all states i and j . This equation is called the reversibility or detailedbalance condition, and we say that the chain is reversible with respect to sif it holds.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 32 / 145
Check the Detailed Balance Equation
Theorem
If for an irreducible Markov chain with transition matrix Q = (qi ,j), thereexists a probability solution π to the detailed balance equations
πiqi ,j = πjqj ,i
for all pairs of states i , j , then this Markov chain is positive recurrent,time-reversible and the solution π is the unique stationary distribution.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 33 / 145
Example: Simple Random Walk
Consider a negative drift simple random walk, restricted to benon-negative, in which q0,1 = 1, qi ,i+1 = p < 0.5, qi ,i−1 = 1− p for i ≥ 1.Corresponding state diagram is:
0 1 2 · · · i i + 1
1
1− p
p p
1− p 1− p
p
1− p
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 34 / 145
Example: Simple Random Walk
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 35 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 36 / 145
Modeling Language: Graph Theory
Origin: 1735 Euler for Seven Bridges of Konigsberg
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 37 / 145
Key Elements of A Graph
A graph is an ordered pair G = (V ,E )
V : a set of vertices or nodes
E : a set of edges or links between nodes
Edge: undirected/directed
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 38 / 145
Undirected Graph
Degree of vertex v : metric for connectivity of vertex v .
deg(v): the number of edges with v as an end vertex
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 39 / 145
Directed Graph
Indegree of vertex v : the number of incoming edges ends at v .
Outdegree of vertex v : the number of outgoing edges starting from v .
I (v): indegree of v
O(v): outdegree of v
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 40 / 145
Adjacency Matrix
A square (0, 1)−matrix to represent a finite graph
Matrix elements: pairs of vertices are adjacent or not
Symmetric matrix: for undirected graph
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 41 / 145
Adjacency Matrix: Directed Graph
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 42 / 145
Adjacency Matrix of Nauru Graph
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 43 / 145
Adjacency Matrix of Cayley Graph
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 44 / 145
Random Walk on Undirected Graph
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 45 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 46 / 145
How to Organize the Web?
First Try: Web Directories
Yahoo, DMOZ, LookSmart
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 47 / 145
Yahoo
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 48 / 145
How to Organize the Web?
Second Try: Web Search
Information Retrieval:I find relevant docs in a small and trusted setI newspaper articles, patents, etc
Hardness: web is huge, full of untrusted documents, random things,web spam, etc.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 49 / 145
Challenges for Web Search
Web contains many sources of information. Who to trust?I Trick: Trustworthy pages may point to each other!
What is the best answer to query keywords?I Webpages are not equally important (www.nothing.com
vs.www.stanford.edu)I Trick: rank pages containing keywords according to their importances
(popularity)I Find the page with the highest rankI How to rank?
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 50 / 145
World Wide Web as A Graph
Web as a directed graph
Nodes: webpages
Edges: hyperlinks
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 51 / 145
Web as A Directed Graph
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 52 / 145
Milestones
1998: Larry Page (1973-) and Sergey Brin (1973-) inventedPageRank algorithm and then founded Google.
1998: Jon Kleinberg (1971-) invented Hyperlink-Induced TopicSearch (HITS) algorithm.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 53 / 145
Links as Votes
Page is more important if it has more links
Incoming links or outgoing links?
Think of incoming links as votes:I www.stanford.edu has 23400 incoming linksI www.nothing.com has 1 incoming link
Are all in-links are equal?I Links from important pages count moreI Recursive question!
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 54 / 145
Example: PageRank Scores
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 55 / 145
Simple Recursive Formulation
Each link’s vote is proportional to the importance of its source page.
If page j with importance rj has n out-links, each link gets rj/n votes
Page j’s own importance is the sum of the votes on its in-links
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 56 / 145
Example
rj = ri/3 + rk/4.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 57 / 145
PageRank: The “Flow” Model
A “vote” from an important page is worth more
A page is important if it is pointed to by other important pages
Define a rank rj for page j :
rj =∑i→j
riOi
where Oi is the outdegree of i .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 58 / 145
Example: Flow Equation
ry = ry/2 + ra/2
ra = ry/2 + rm
rm = ra/2
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 59 / 145
Solving the Flow Equations
Additional constraint forces uniqueness:I ry + ra + rm = 1.I solution: ry = 2/5, ra = 2/5, rm = 1/5.
Gaussian elimination method works for small examples
We need a better method for large web-size graphs
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 60 / 145
PageRank: Matrix Formulation
Adjacency matrix QI Each page i has Oi out-linksI If i → j , then Qi,j = 1
Oi, else Qi,j = 0.
Q is a stochastic matrix
Row sum to 1.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 61 / 145
PageRank: Matrix Formulation
Rank vector rI Vector with an entry per pageI ri is the importance score of page iI∑
i ri = 1
The flow equations can be written
r = r · Q
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 62 / 145
Example
ry = ry/2 + ra/2
ra = ry/2 + rm
rm = ra/2
Q =
y a m
y 12
12 0
a 12 0 1
2m 0 1 0
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 63 / 145
Random Walk Interpretation
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 64 / 145
The Stationary Distribution
r · Q = r
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 65 / 145
Power Iteration Method
Given a web graph with n nodes, where the nodes are pages andedges are hyperlinks.
Power iteration: a simple iterative schemeI Suppose there are N web pagesI Initialize: r(0) = [ 1
N , . . . ,1N ].
I Iterate: r(t + 1) = r(t) · Q.I Stop when |r(t + 1)− r(t)|1 < ε
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 66 / 145
The Google Formulation
r(t+1)j =
∑i→j
r(t)i
Oi=⇒ r(t+1) = r(t) · Q
Does this converge?Converge to what we want?Result reasonable?
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 67 / 145
Example 1: Spider Traps
A B
r(t+1)A = r
(t)B
r(t+1)B = r
(t)A
rA → 1 0 1 0 . . .rB → 0 1 0 1 . . .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 68 / 145
Example 2: Dead End
A B
rA → 1 0 0 0 . . .rB → 0 1 0 0 . . .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 69 / 145
Observations
Dead End
Spider Trap
Some pages are dead ends (no out-links)
Spider traps (all out-links are within the group)
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 70 / 145
Example
y
a m
y a my 1/2 1/2 0a 1/2 0 0m 0 1/2 1
=⇒ry → 1/3 2/6 3/12 . . . 0ra → 1/3 1/6 2/12 . . . 0rm → 1/3 3/6 7/12 . . . 1
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 71 / 145
Google’s Solution for Spider Trap
Teleports!
At each time, walker at page i has two options{w.p. β Follow a out-link at random 1
Oi
w.p. 1− β Jump to some random pages(1)
β = (0.8, 0.9)
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 72 / 145
Google’s Solution for Dead Ends
y
a m
y a my 1/2 1/2 0a 1/2 0 1/2m 0 0 1
=⇒
y a my 1/2 1/2 0a 1/2 0 1/2m 1/3 1/3 1/3
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 73 / 145
Random Teleports
Pagerank Equation:
rj =∑i→j
β · riOi
+ (1− β) · 1
N
Google Matrix
G = β · Q + (1− β)
[1
N
]N×N
r = r · G
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 74 / 145
Random Teleports
Google Matrix
Q → G = β · Q + (1− β)
[1
N
]N×N
y
a m
y
a m
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 75 / 145
Do Some Calculation
Let β = 0.8.
Q = 0.8 ·
1/2 1/2 01/2 0 1/2
0 0 1
+ 0.2 ·
1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3
As a result,
G =
7/15 7/15 1/157/15 1/15 7/151/15 1/15 13/15
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 76 / 145
Implementation of PageRank in Practice
Google’s Team with the leader Jeff Dean developed several tools
BigTable: distributed storage system
GFS (Google File System): distributed file system
Spanner: distributed database
Mapreduce: distributed computing system (followed by Hadoop &Spark)
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 77 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 78 / 145
Motivation of MCMC
Previously we know we can generate desired distributions based onUnif(0,1) distribution.
Universality of the Uniform depends partly on how tractable it is tocompute the inverse CDF.
In general it is hard to compute CDF and its inverse.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 79 / 145
Motivation of MCMC
Markov Chain Monte Carlo (MCMC): build your own Markov chain(X0,X1,. . .) so that the desired distribution π is the stationarydistribution of the chain.
Forward & Reverse Engineering PerspectiveI Forward engineering: given the transition matrix P, find the
stationary distribution of Markov chain.I Reverse engineering: given a distribution π that we want to simulate,
we will engineer a Markov chain whose stationary distribution is π.Then run this engineered Markov chain for a long time, the distributionof the chain will approach π.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 80 / 145
MCMC: Sampling Perspective
Sampling from distribution π
Utilizes Markov sequences (samples) to effectively simulate from whatwould otherwise be intractable distributions.
With samples, we can further compute sample mean & samplevariance & other sample functions to approximate the value ofinteresting statistics
Guaranteed by the Strong Law of Large Numbers
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 81 / 145
Strong Law of Large Numbers for I.I.D. Sequences
If Y1,Y2, · · · is an i.i.d. sequence with common mean µ <∞, then thestrong law of large numbers says that, with probability 1,
limn→∞
Y1 + · · ·+ Yn
n= µ.
Equivalently, let Y be a random variable with the same distribution as theYi and assume that r is a bounded, real-valued function. Then,r(Y1), r(Y2), · · · is also an i.i.d. sequence with finite mean, and, withprobability 1,
limn→∞
r(Y1) + · · ·+ r(Yn)
n= E (r(Y )).
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 82 / 145
Strong Law of Large Numbers for Markov Chains
Theorem
Assume that X0,X1, · · · is an ergodic Markov chain with stationarydistribution π. Let r be a bounded, real-valued function. Let X be arandom variable with distribution π. Then, with probability 1,
limn→∞
r(X1) + · · ·+ r(Xn)
n= E (r(X )).
where E (r(X )) =∑
j r(j)πj .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 83 / 145
Theoretical Foundations of MCMC
All MCMC algorithms construct reversible (time-reversible) Markovchain: detailed balance equations help us.
Theorem
If for an irreducible Markov chain with transition matrix Q = (qi ,j), thereexists a probability solution π to the detailed balance equations
πiqi ,j = πjqj ,i
for all pairs of states i , j , then this Markov chain is positive recurrent,time-reversible and the solution π is the unique stationary distribution.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 84 / 145
Example: Binary Sequences with No Adjacent 1s
Consider sequences of length m consisting of 0s and 1s. Call a sequencegood if it has no adjacent 1s. What is the expected number of 1s in agood sequence if all good sequences are equal likely?
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 85 / 145
MCMC Approach
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 86 / 145
MCMC Approach
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 87 / 145
MCMC Approach
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 88 / 145
Two Main Methods of MCMC
Metropolis-Hastings algorithm: this method generates a random walkusing a proposal distribution and a method for rejecting some of theproposed moves.
Gibbs Sampling: this method requires all the conditional distributionsof the target distribution to be sampled exactly. When drawing fromthe full-conditional distributions is not straightforward, othersamplers-within-Gibbs are used.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 89 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 90 / 145
Basic Idea
Proposed by Nicholas Metropolis in 1953 & further developed byWilfred Keith Hastings in 1970.
Start with any irreducible Markov chain on the state space of interest
Then modify it into a new Markov chain with desired stationarydistribution
Modification: moves are proposed according to the original chain, butthe proposal may or may not accepted.
Art: choice of the probability of accepting the proposal
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 91 / 145
Algorithm for DTMC
Let π = (π1, ..., πM) be a desired stationary distribution on state space{1, ...,M}. Assume that πi > 0 for all i (if not, just delete any states iwith si = 0 from the state space). Suppose that P = (pij) is the transitionmatrix for a Markov chain on state space {1, ...,M}. Intuitively, P is aMarkov chain that we know how to run but that doesn’t have the desiredstationary distribution. Now we will modify P to construct a Markovchain.
Use the original transition probabilities pij to propose where to go next
Then accept the proposal with probability aij
Staying in the current state in the event of a rejection.
Normalization of π does not need to be known.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 92 / 145
Algorithm 1 Metropolis-Hastings
Require:Stationary distribution π = (π1, ..., πM);Original transition matrix P = (pij);State X0 (chosen randomly or deterministically);
Ensure:Modified transition matrix P ′ = (p′
ij);1: repeat2: If Xn = i , propose a new state j using the transition probabilities in the ith
row of the original transition matrix P;
3: Compute the acceptance probability aij = min(πjpjiπipij
, 1)
;
4: Flip a coin that lands Heads with probability aij ;5: If the coin lands Heads, accept the proposal (i.e., go to j), setting Xn+1 = j .
Otherwise, reject the proposal (i.e., stay at i), setting Xn+1 = i ;6: until Convergence;7: return P ′;
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 93 / 145
Metropolis–Hastings Algorithm
Theorem
The sequence X0,X1, · · · constructed by the Metropolis–Hastingsalgorithm is a reversible Markov chain with stationary distribution π.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 94 / 145
Remarks
The exact form of π is not necessary to implementMetropolis–Hastings. The algorithm only uses ratios of the form
πjπi
.Thus, π needs only to be specified up to proportionality.
If the proposal transition matrix P is symmetric, aij = min(πjπi, 1)
.
The algorithm works for any irreducible proposal chain (originalchain). Thus, the user has wide latitude to find a proposal chain thatis efficient in the context of their problem.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 95 / 145
Remarks
The generated sequence X0,X1, . . . ,Xn gives an approximate samplefrom π.
If the chain requires many steps to get close to stationarity, there maybe initial bias.
Burn-in: the practice of discarding the initial iterations and retainingXm,Xm+1, . . . ,Xn, for some m.
The strong laws of large numbers for Markov chains:
limn→∞
r(Xm) + · · ·+ r(Xn)
n −m + 1= E (r(X )) =
∑x
r(x)πx .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 96 / 145
Example: Zipf Distribution Simulation
Let M ≥ 2 be an integer. An r.v. X has the Zipf distribution withparameter a > 0 if its PMF is
P (X = k) =1/ka∑M
i=1 (1/ja),
for k = 1, 2, ...,M (and 0 otherwise). This distribution is widely used inlinguistics for studying frequencies of words.Create a Markov chain X0,X1, ... whose stationary distribution is the Zipfdistribution, and such that |Xn+1 − Xn| ≤ 1 for all n. Your answer shouldprovide a simple, precise description of how each move of the chain isobtained, i.e., how to transit from Xn to Xn+1 for each n.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 97 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 98 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 99 / 145
Example: Knapsack Problem
m treasures with labels from 1 to m, where the jth treasure is worthgj gold pieces and weighs wj pounds.
The maximum weight we can carry is w pounds.
We must choose a vector x = (x1, ..., xm), where xj is 1 if we choosethe jth treasure and 0 otherwise, such that the total weight of thetreasures j with xj = 1 is at most w .
Let C be the space of all such vectors, so C consists of all binaryvectors (x1, ..., xm) with
∑mj=1 xjwj ≤ w .
We wishes to maximize the total worth of the treasure we takes.
Finding an optimal solution is an extremely difficult problem, knownas the knapsack problem, which has a long history in computerscience.
A brute force solution would be completely infeasible in general. Howabout MCMC?
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 100 / 145
Example: Knapsack Problem
(a) Consider the following Markov chain. Start at (0, 0, ..., 0). One moveof the chain is as follows. Suppose the current state is x = (x1, ..., xm).Choose a uniformly random J in {1, 2, ...,m}, and obtain y from x byreplacing xJ with 1− xJ (i.e., toggle whether that treasure will betaken). If y is not in C , stay at x ; if y is in C , move to y . Show thatthe uniform distribution over C is stationary for this chain.
(b) Show that the chain from (a) is irreducible, and that it may or maynot be aperiodic (depending on w ,w1, ...,wm).
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 101 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 102 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 103 / 145
Example: Knapsack Problem
(c) The chain from (a) is a useful way to get approximately uniformsolutions, but Bilbo is more interested in finding solutions where thevalue (in gold pieces) is high. In this part, the goal is to construct aMarkov chain with a stationary distribution that puts much higherprobability on any particular high-value solution than on any particularlow-value solution. Specifically, suppose that we want to simulatefrom the distribution
π (x) ∝ eβV (x),
where V (x) =∑m
j=1 xjgj is the value of x in gold pieces and β is apositive constant. The idea behind this distribution is to giveexponentially more probability to each high-value solution than toeach low-value solution. Create a Markov chain whose stationarydistribution is as desired.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 104 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 105 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 106 / 145
Impact of β
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 107 / 145
Application: Simulated Annealing
MCMC algorithm for optimization problems
M-H with varying β: β gradually increases over time
β is small at first, exploring the state space broadly
β gets larger and larger, more likely to stuck in local optimal, andstationary distribution more likely to concentrate around thebest-value solution
Why the name of annealing?
Annealing of metals: first heated to a very high temperature, andthen gradually cooled until it reaches a very strong and stable state
β = 1Temperature
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 108 / 145
Continuous State Space
MCMC can also be used in the continuous case, when π is aprobability density function.
For a continuous state space, a transition function replaces thetransition matrix.
Pij is the value of a conditional density function given X0 = i .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 109 / 145
Example: Beta Simulation
Suppose that we want to generate W ∼ Beta(a, b), but we don’t knowabout the rbeta command in R. Instead, what we have available are i.i.d.Unif(0, 1) r.v.s. How can we generate W which is approximatelyBeta(a, b) if a and b are any positive real numbers, with the help of aMarkov chain on the continuous state space (0, 1)?
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 110 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 111 / 145
Example: Normal-Normal Conjugacy
Let Y |θ ∼ N(θ, σ2
), where σ2 is known but θ is unknown. Using the
Bayesian framework, we treat θ as a random variable, with prior given byθ ∼ N
(µ, τ2
)for some known constants µ and τ2. That is, we have the
two-level model
θ ∼ N(µ, τ2
)Y |θ ∼ N
(θ, σ2
)Describe how to use the Metropolis-Hastings algorithm to find theposterior mean and variance of θ after observing the value of Y .
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 112 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 113 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 114 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 115 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 116 / 145
Simulation Results with MCMC
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 117 / 145
Simulation Results with MCMC
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 118 / 145
Outline
1 Background of Markov Chain
2 Markov Property and Transition Matrix
3 Classification of States
4 Stationary Distribution & Reversibility
5 Basic Graph Theory
6 Application: PageRank
7 Introduction of Markov Chain Monte Carlo (MCMC)
8 Metropolis–Hastings Algorithm
9 Gibbs Sampler
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 119 / 145
Basic Idea
Proposed by brothers Stuart and Donald Geman in 1984.
Named after the physicist Josiah Willard Gibbs, in reference to ananalogy between the sampling algorithm and statistical physics.
Obtaining approximate draws from a joint distribution, based onsampling from conditional distributions one at a time
Especially useful in problems where these conditional distributions arepleasant to work with.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 120 / 145
Basic Idea
At each stage, one variable is updated (keeping all the other variablesfixed) by drawing from the conditional distribution of that variablegiven all the other variables.
Two major kinds of Gibbs sampler:I systematic scan: the updates sweep through the components in a
deterministic order.I random scan: a randomly chosen component is updated at each stage.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 121 / 145
Algorithm: Systematic Scan Gibbs Sampler
Let X and Y be discrete r.v.s with joint PMFpX ,Y (x , y) = P(X = x ,Y = y). We wish to construct a two-dimensionalMarkov chain (Xn,Yn) whose stationary distribution is pX ,Y . Thesystematic scan Gibbs sampler proceeds by updating the X -component andthe Y -component in alternation. If the current state is (Xn,Yn) = (xn, yn),then we update the X -component while holding the Y -component fixed,and then update the Y -component while holding the X -component fixed.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 122 / 145
Algorithm 2 Systematic Scan Gibbs Sampler
Require:Joint PMF pX ,Y ;Initial state (X0,Y0);
Ensure:Two-dimensional Markov chain (Xn,Yn);
1: repeat2: Draw a value xn+1 from the conditional distribution of X given Y =
yn, and set Xn+1 = xn+1;3: Draw a value yn+1 from the conditional distribution of Y given X =
xn+1, and set Yn+1 = yn+1;4: return (Xn+1,Yn+1);5: until n ≥ N;
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 123 / 145
Algorithm: Random Scan Gibbs Sampler
As above, let X and Y be discrete r.v.s with joint PMF pX ,Y (x , y). Wewish to construct a two-dimensional Markov chain (Xn,Yn) whosestationary distribution is pX ,Y . Each move of the random scan Gibbssampler picks a uniformly random component and updates it, according tothe conditional distribution given the other component.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 124 / 145
Algorithm 3 Random scan Gibbs sampler
Require:Joint PMF pX ,Y ;Initial state (X0,Y0);
Ensure:Two-dimensional Markov chain (Xn,Yn);
1: repeat2: Choose which component to update, with equal probabilities;3: If the X -component was chosen, draw a value xn+1 from the condi-
tional distribution of X given Y = yn, and set Xn+1 = xn+1,Yn+1 =yn. Similarly, if the Y -component was chosen, draw a value yn+1 fromthe conditional distribution of Y given X = xn, and set Xn+1 = xn,Yn+1 = yn+1;
4: return (Xn+1,Yn+1);5: until n ≥ N;
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 125 / 145
Random Scan Gibbs as Metropolis-Hastings
Theorem
The random scan Gibbs sampler is a special case of theMetropolis-Hastings algorithm, in which the proposal is always accepted.In particular, it follows that the stationary distribution of the random scanGibbs sampler is as desired.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 126 / 145
Gibbs Sampling vs. Metropolis-Hastings
Gibbs sampling emphasizes conditional distributions.
Metropolis-Hastings emphasizes acceptance probabilities.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 127 / 145
Example: Graph ColoringLet G be a network (also called a graph): there are n nodes, and for eachpair of distinct nodes, there either is or isn’t an edge joining them. Wehave a set of k colors, e.g., if k = 7, the color set may be {red, orange,yellow, green, blue, indigo, violet}. A k-coloring of the network is anassignment of a color to each node, such that two nodes joined by an edgecan’t be the same color. Graph coloring is an important topic in computerscience, with wide-ranging applications such as task scheduling and thegame of Sudoku.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 128 / 145
Suppose that it is possible to k-color G . Form a Markov chain on thespace of all k-colorings of G , with transitions as follows: starting with ak-coloring of G , pick a uniformly random node, figure out what the legalcolors are for that node, and then repaint that node with a uniformlyrandom legal color (note that this random color may be the same as thecurrent color). Show that this Markov chain is reversible, and find itsstationary distribution.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 129 / 145
Proof
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 130 / 145
Remarks
Think of each node in the graph as a discrete r.v. that can take on kpossible values.
These nodes have a joint distribution and a complicated dependencestructure (constraint that connected nodes cannot have the samecolor).
We would like to sample a random k-coloring of the entire graph(draw from the joint distribution of all the nodes).
We instead condition on all but one node.
If the joint distribution is to be uniform over all legal graphs, then theconditional distribution of one node given all the others is uniformover its legal colors.
At each stage of the algorithm, we are drawing from the conditionaldistribution of one node given all the others: we’re running a randomscan Gibbs sampler!
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 131 / 145
Gibbs Sampler for Continuous State Space
In the Gibbs sampler, the target distribution π is an m-dimensional jointdensity
π(x) = π(x1, · · · , xm).
A multivariate Markov chain is constructed whose stationary distribution isπ, and which takes values in an m-dimensional space. The algorithmgenerates elements by iteratively updating each component of anm-dimensional vector conditional on the other m − 1 components.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 132 / 145
Example: Bivariate Standard Normal Distribution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 133 / 145
Example: Bivariate Standard Normal Distribution
The Gibbs sampler is implemented to simulate (X ,Y ) from a bivariatestandard normal distribution with correlation ρ.
1 Initialize: (x0, y0)← (0, 0), m← 1.
2 Generate xm from the conditional distribution of X given Y = ym−1.That is, simulate from a normal distribution with mean ρym−1 andvariance 1− ρ2.
3 Generate ym from the conditional distribution of Y given X = xm.That is, simulate from a normal distribution with mean ρxm andvariance 1− ρ2.
4 m← m + 1.
5 Return to Step 2.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 134 / 145
Simulation Results with MCMC
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 135 / 145
Example: Chicken-Egg with Unknown Parameters
A chicken lays N eggs, where N ∼ Pois(λ). Each egg hatches withprobability p, where p is unknown; we let p ∼ Beta(a, b). The constantsλ, a, b are known.Here’s the catch: we don’t get to observe N. Instead, we only observe thenumber of eggs that hatch, X . Describe how to use Gibbs sampling tofind E (p|X = x), the posterior mean of p after observing x hatched eggs.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 136 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 137 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 138 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 139 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 140 / 145
Example: Three-dimensional Joint Distribution
Random variables X ,P and N have joint density
π(x , p, n) ∝(
n
x
)px(1− p)n−x
4n
n!
for x = 0, 1, . . . , n, 0 < p < 1, n = 0, 1, . . .. The p variable is continuous;x and n are discrete.
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 141 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 142 / 145
Solution
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 143 / 145
Simulation Results with MCMC
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 144 / 145
References
Chapters 11 & 12 of BH
Ziyu Shao (ShanghaiTech) Lecture 12: Markov Chain Monte Carlo December 25, 2018 145 / 145