Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The Bouncy Particle Sampler: ANon-Reversible Rejection-Free Markov Chain
Monte Carlo Method
Alexandre Bounchard-Côté, Sebastian J. Vollmer, ArnaudDoucet
Presented by Changyou ChenJanuary 20, 2017
1 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Introduction
Outline
1 Introduction
2 The Bouncy Particle Sampler
3 Numerical Results
2 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Introduction
SG-MCMC vs. Bouncy Particle Sampler
SG-MCMC:diffusion based,approximatedsimulation
0 50 100 150 200t
-5
0
5
x
trace
-4 -2 0 2 4x
0
500
1000
1500
count
histogram
Bouncy particle:Poisson processbased, exactsimulation
5
0
−51000 50 150 200
bouncy/jumping points
3 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Introduction
Bouncy Particle Sampler
1 A Poisson-process based MCMC sampler:velocity and direction depend on the target distribution(model posterior)bouncing (velocity changing) time driven by a Poissonprocess parametrized by the model posterior
2 Stationary distribution equals the model posteriordistribution.
3 Since simulation of a Poisson process can be exact, noerror introduced, the algorithm is rejection free.
4 Theoretically sound.
4 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Outline
1 Introduction
2 The Bouncy Particle Sampler
3 Numerical Results
5 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Poisson Processes
1 A Poisson process N(t) is a counting process1 of rate λ ifthe inter-arrival times are i.i.d. exponential with mean 1/λ .
2 When λ depends on t, it is called an inhomogeneousPoisson process.
1t can be considered as one-dimensional time for simplicity, generalizationon general spaces is straightforward.
6 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Basic Setup
1 Goal is to sample from a target distribution:
p(x) = e−U(x)
2 In Bayesian models, we are given data D = {d1, · · · ,dN}, agenerative model (likelihood) p(D|x) = ∏
Ni=1 p(di|x) and prior
p(x), we want to sample from the posterior:
p(x|D) ∝ p(x)p(D|x) = p(x)N
∏i=1
p(di|x)
3 U(x) is defined as:
U(x) ,−N
∑i=1
logp(di|x)− logp(x)
7 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Basic Setup
1 Like HMC, the parameter space is augmented with avelocity variable.
2 Define the following two quantities:
Poisson process intensity: λ (x,v) = max{0,〈∇U(x),v〉}
velocity refreshment operator: R(x)v = v−2〈∇U(x),v〉‖∇U(x)‖2 ∇U(x)
8 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Algorithm Illustration
1 A one-dimension example with U(x) = x2.2 A particle changes velocity (bounce) at the first arrival time
of an inhomogeneous PP with intensity λ (x,v).3 Other than this, a random bounce happens in the first
arrival time of a Poisson process with constant intensity.
rU > 0v < 0prob. bounce = 0
rU > 0v > 0prob. bounce > 0
v < 0 v > 0
U(x) = x2
Poisson process intensity:6(x; v) = maxf0; hrU(x); vig
9 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Basic BPS Algorithm
λ (x,v) = max{0,〈∇U(x),v〉}, R(x)v = v−2〈∇U(x),v〉‖∇U(x)‖2 ∇U(x)
these segments, the algorithm relies on a position and velocity-dependent intensity function � :Rd⇥Rd ! [0,1) and a position-dependent bouncing matrix R : Rd ! Rd⇥d given respectively by
� (x, v) = max {0, hrU (x) , vi} (1)
and for any v 2 Rd
R (x) v =
Id � 2
rU (x) {rU (x)}t
krU (x)k2
!v = v � 2
hrU (x) , vikrU (x)k2
rU (x) , (2)
where Id denotes the d ⇥ d identity matrix, k · k the Euclidean norm, and hw, zi = wtz thescalar product between column vectors w, z.1 The algorithm also performs a velocity refreshmentat random times distributed according to the arrival times of a homogeneous Poisson process ofintensity �ref � 0, �ref being a parameter of the BPS algorithm. Throughout the paper, we usethe terminology “event” for a time at which either a bounce or a refreshment occurs. The basicversion of the BPS algorithm proceeds as follows:
Algorithm 1 Basic BPS algorithm
1. Initialize the state and velocity�x(0), v(0)
�arbitrarily on Rd ⇥ Rd.
2. While more events i = 1, 2, . . . requested do
(a) Simulate the first arrival time ⌧bounce 2 (0,1) of an inhomogeneous Poisson process ofintensity � (t) = �(x(i�1) + v(i�1)t, v(i�1)).
(b) Simulate ⌧ ref ⇠ Exp��ref
�.
(c) Set ⌧i min (⌧bounce, ⌧ ref) and compute the next position
x(i) x(i�1) + v(i�1)⌧i. (3)
(d) If ⌧i = ⌧ ref , sample the next velocity v(i) ⇠ N (0d, Id).
(e) If ⌧i = ⌧bounce, compute the next velocity v(i) using
v(i) R⇣x(i)
⌘v(i�1), (4)
which is the vector obtained once v(i�1) bounces on the plane tangential to the gradientof the energy function at x(i).
3. End While.
In the algorithm above, Exp (�) denotes the exponential distribution of rate � and N (0d, Id)the standard Gaussian distribution on Rd.2 Compared to the algorithm described in [23], ourformulation of step 2a is expressed in terms of an inhomogeneous Poisson process arrival.
We will show further that the transition kernel of the resulting process x (t) admits ⇡ as invariantdistribution for any �ref � 0 but it can fail to be irreducible when �ref = 0. It is thus critical touse �ref > 0. Our proof of invariance and ergodicity can accommodate some alternative ways toperform the refreshment step 2d. One such variant, which we call restricted refreshment, samplesv(i) uniformly on the unit hypersphere Sd�1 =
�x 2 Rd : kxk = 1
. We compare experimentally
these two variants and others in Section 4.3.1Throughout the paper, when an algorithm contains an expressions of the form R(x)v, it is understood that this
computation is implemented via the right-hand side of Equation 2 which takes time O(d) rather than the left-handside, which would naively take time O(d2).
2By exploiting the memorylessness of the exponential distribution, we could alternatively only implement stepb) of Algorithm 1 for the ith event when ⌧i�1 corresponds to a bounce and set ⌧ ref ⌧ ref � ⌧i�1 otherwise.
4
10 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Simulating bouncy time using a time-scaletransformation
1 Goal is to simulate the first arrival time of aninhomogeneous Poisson process with intensityχ(t) = max{0,〈∇U(x+ vt,v)〉}.
2 Let Ξ(t) =∫ t
0 χ(s)ds be the cumulative intensity.3 The probability of the first arrival time τ > u is:
P(τ > u) = exp(−Ξ(u))(Ξ(u))0
0!= exp(−Ξ(u)) (1)
4 Hence, τ = Ξ−1 (− log(V)), where V ∼U (0,1), andΞ−1(p) = inf{t : Ξ(t)≥ p} is the first time such that Ξ(t)> p.
11 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Simulating bouncy time using a time-scaletransformation
1 When the target distribution is log-concave (U(x) isconvex):
Ξ(τ) =∫
τ
0λ (x+ vs)ds =
∫τ
τ∗λ (x+ vs)ds , (2)
where τ∗ = argmint:t≥0 U(x+ vt) is the minimal.2 After some simplifications,
U(x+ vτ)−U(x+ vτ∗) =− logV, V ∼U (′,∞).3 Solve through line search if not explicitly solvable.
rU > 0v < 0prob. bounce = 0
rU > 0v > 0prob. bounce > 0
v < 0 v > 0
U(x) = x2
Poisson process intensity:6(x; v) = maxf0; hrU(x); vig
12 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Example
U(x+ vτ)−U(x+ vτ∗) =− logV
Example (Gaussian distributions)Consider the target distribution to be a zero-mean multivariateGaussian of covariance matrix 1
2 Id, so that U(x) = ‖x‖2.
τ =1‖v‖2
{−〈x,v〉+
√−‖v‖2 logV if 〈x,v〉 ≤ 0
−〈x,v〉+√〈x,v〉−‖v‖2 logV otherwise
(3)
13 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Simulating bouncy time via adaptive thinning
1 The idea is to define an easy-to-simulate Poisson processwhose cumulative intensity χ̄s(t) upper bounds χ(t), i.e.:
χ̄s(t) = 0 for all t < s, and χ̄s(t)≥ χ(t) for all t ≥ s (4)
2.3.2 Simulation using adaptive thinning
In scenarios where it is difficult to solve (5), the use of a thinning procedure to simulate ⌧ providesanother alternative. Assume we have access to local-in-time upper bounds �̄s (t) on �(t), that is
�̄s(t) = 0 for all t < s,
�̄s(t) � �(t) for all t � s
and that we can simulate the first arrival time of the inhomogeneous Poisson process ⇧̄s withintensity �̄s(t) defined on [s,1). Algorithm 2 shows the pseudocode for the adaptive thinningprocedure.
Algorithm 2 Simulation of the first arrival time of an inhomogeneous Poisson process throughthinning
1. Set s 0, ⌧ 0.
2. Do
(a) Set s ⌧ .
(b) Sample ⌧ as the first arrival point of ⇧̄s of intensity �̄s.
(c) While V > �(⌧)�̄s(⌧) where V ⇠ U (0, 1).
3. Return ⌧ .
The event V > �(⌧)�̄s(⌧) corresponds to a rejection step in the thinning algorithm but, in contrast to
rejection steps that occur in standard MCMC samplers, in the BPS algorithm this just means thatthe particle does not bounce and just coasts.
2.3.3 Simulation using superposition
Assume that U (x) can be decomposed as
U (x) =
mX
j=1
U [j] (x) . (7)
Under this assumption, if we let �[j](t) = max�0,⌦rU [j](x + tv), v
↵�for j = 1, ..., m, it follows
that
� (t) mX
j=1
�[j] (t) .
It is therefore possible to use the adaptive thinning algorithm with �̄s(t) = �̄(t) =Pm
j=1 �[j] (t)
for t � s. Moreover, we can simulate from �̄ via superposition as follows. First, simulate the firstarrival time ⌧ [j] of each inhomogeneous Poisson process with intensity �[j] (t) � 0. Second, return
⌧ = minj=1,...,m ⌧ [j].
Example 3. Exponential families. Consider a univariate exponential family with parameter x,observation ◆, sufficient statistic �(◆) and log-normalizing constant A(x). If we assume a standardGaussian prior on x, we obtain the following energy:
U(x) = x2/2|{z}U [1](x)
+�x�(◆)| {z }U [2](x)
+ A(x)| {z }U [3](x)
,
6
14 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Simulating bouncy time using superposition
1 Assume U(x) can be decomposed as:
U(x) =m
∑j=1
U[j](x) . (5)
2 Let χ [j] = max{0,〈∇U[j](x+ tv),v〉}, then we have
χ(t)≤m
∑j=1
χ[j](t) (6)
3 Therefore, let τ [j] be the first arrival time w.r.t. χ [j](t),
τ =m
minj=1
τ[j] . (7)
15 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Example
Example (Logistic regression)
Let {`r ∈ Rd}Rr=1 be the data, cr ∈ {0,1} the lable of data `r.
Parameter x is assigned a standard multivariate Gaussian prior.
U(x) =‖x‖2
2+
R
∑r=1
log(1+ exp〈`r,x〉)− cr〈`r,x〉︸ ︷︷ ︸U[r](x)
. (8)
1 A lower bound for U[r](x)’s corresponding intensity, χ [r](t),is
χ[r](t)≤ χ̄
[r] =d
∑k=1
1 [(−1)cr vk ≥ 0] · `rk · |vk| , (9)
where each χ [r](t) is a constant given vk.
16 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Theoretical results
For example for ' (x) = xk, k 2 {1, 2, . . . , d}, we haveˆ ⌧i
0
'⇣x(i�1) + v(i�1)s
⌘ds = x
(i�1)k ⌧i + v
(i�1)k
⌧2i
2.
When the integral above is intractable, we may just subsample the trajectory of x (t) at regulartime intervals to obtain an estimator
1
L
L�1X
l=0
' (x (l�))
where � > 0 and L = 1 + bT/�c . Alternatively, we could approximate these univariate integralsthrough quadrature.
2.5 Theoretical results
Peters and de With (2012) present an informal proof establishing the fact that the BPS with�ref = 0 admits ⇡ as invariant distribution. We provide in Appendix A a rigorous proof of this⇡-invariance result for �ref � 0 and prove that the resulting process is additionally ergodic when�ref > 0. In the following we denote by Pt(z, dz0) the transition kernel of the continuous-timeMarkov process z (t) = (x (t) , v (t)).
Proposition 1. For any �ref � 0, the infinitesimal generator associated to the transition kernelPt of the BPS is given, for any given continuously differentiable functions h : Rd ⇥ Rd ! R, by
Lh(z) = limt!0
´
Pt(z, dz0)h(z0)� h(z)
t
= �� (x, v) h(z) + hrxh, vi+ �ref
ˆ
(h(x, v0)� h(x, v)) ( dv0)
+� (x, v) h(x, R (x) v),
where we recall that (v) denotes the standard multivariate Gaussian density on Rd.
This transition kernel is non-reversible and ⇢-invariant, where
⇢(z) = ⇡ (x) (v) . (12)
If we add the condition �ref > 0, we get the following stronger result.
Theorem 1. If �ref > 0 then ⇢ is the unique invariant probability measure of the transition kernelof the BPS and the corresponding process satisfies a strong law of large numbers for ⇢-almost everyz (0) and h 2 L1 (⇢)
limT!1
1
T
ˆ T
0
h(z (t))dt =
ˆ
h(z)⇢(z)dz a.s.
We exhibit in Section 4.1 a simple example where Pt is not ergodic for �ref = 0.
3 The local bouncy particle sampler
3.1 Structured target distribution and factor graph representation
In numerous applications, the target distribution admits some structural properties that can beexploited by sampling algorithms. For example, the popular Gibbs sampler takes advantages of
8
For example for ' (x) = xk, k 2 {1, 2, . . . , d}, we haveˆ ⌧i
0
'⇣x(i�1) + v(i�1)s
⌘ds = x
(i�1)k ⌧i + v
(i�1)k
⌧2i
2.
When the integral above is intractable, we may just subsample the trajectory of x (t) at regulartime intervals to obtain an estimator
1
L
L�1X
l=0
' (x (l�))
where � > 0 and L = 1 + bT/�c . Alternatively, we could approximate these univariate integralsthrough quadrature.
2.5 Theoretical results
Peters and de With (2012) present an informal proof establishing the fact that the BPS with�ref = 0 admits ⇡ as invariant distribution. We provide in Appendix A a rigorous proof of this⇡-invariance result for �ref � 0 and prove that the resulting process is additionally ergodic when�ref > 0. In the following we denote by Pt(z, dz0) the transition kernel of the continuous-timeMarkov process z (t) = (x (t) , v (t)).
Proposition 1. For any �ref � 0, the infinitesimal generator associated to the transition kernelPt of the BPS is given, for any given continuously differentiable functions h : Rd ⇥ Rd ! R, by
Lh(z) = limt!0
´
Pt(z, dz0)h(z0)� h(z)
t
= �� (x, v) h(z) + hrxh, vi+ �ref
ˆ
(h(x, v0)� h(x, v)) ( dv0)
+� (x, v) h(x, R (x) v),
where we recall that (v) denotes the standard multivariate Gaussian density on Rd.
This transition kernel is non-reversible and ⇢-invariant, where
⇢(z) = ⇡ (x) (v) . (12)
If we add the condition �ref > 0, we get the following stronger result.
Theorem 1. If �ref > 0 then ⇢ is the unique invariant probability measure of the transition kernelof the BPS and the corresponding process satisfies a strong law of large numbers for ⇢-almost everyz (0) and h 2 L1 (⇢)
limT!1
1
T
ˆ T
0
h(z (t))dt =
ˆ
h(z)⇢(z)dz a.s.
We exhibit in Section 4.1 a simple example where Pt is not ergodic for �ref = 0.
3 The local bouncy particle sampler
3.1 Structured target distribution and factor graph representation
In numerous applications, the target distribution admits some structural properties that can beexploited by sampling algorithms. For example, the popular Gibbs sampler takes advantages of
8
17 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
Basic proof idea
To derive the Fokker-Planck equation for the algorithm:
∂
∂ tµt(z) = (L ∗
µt)(z) ,
where z , (x,v), µt(z) is the density of z at time t, L ∗ is theadjoint of the generator L .
1 Assume the stationary distribution of µ(z) = π(x)ψ(v).2 Write out the joint distribution of z and the joint times of the
Poisson process.3 Calculate the marginal distribution of z, and get the
corresponding density µt(z) for time t.4 Verify that dµt(z′)
dt = 0 for all z′.
18 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
The local bouncy particle sampler
Divide parameters into groups with factor graph.
−2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1
0123
05
1015
20tim
e
position −2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1
0123
05
1015
20tim
e
position −2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1
0123
05
1015
20tim
e
position −2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1
0123
05
1015
20tim
e
position
Time
x1 x2 x3 x4
x1(0)
x1(1)
x1(2)
x2(0)
x2(1)x2(2)
x2(3)
x2(4)
x1(3) x2(5)
x3(0)
x3(1)x3(2)
x4(0)
fa fb fc
t
*
!
Figure 2: Top: an example of a factor graph with d = 4 variables and 3 binary factors, F ={fa, fb, fc}. Bottom: an example of paths generated by the local BPS algorithm. The circlesshow the locations that are stored in memory (each with their associated time and velocity afterbouncing, not shown). The small black square on the first path is used to demonstrate the executionof Algorithm 3, used to reconstruct the location x(t). The algorithm first identifies i(t, 1), which is3 in this example. The algorithm then uses the information of the latest event preceding xt to addto the position at that event, x
(3)1 , the velocity just after the event, v
(3)1 , times the time increment
denoted by the asterisk, (t � T(3)1 ). The bottom section of the figure also shows the candidate
bounce times used by Algorithm 4 to compute the first four bounce events. The exclamationmark indicates an example where a candidate bounce time need not be recomputed thanks to thesparsity structure of the factor graph.
conditional independence properties. We present here a “local” version of the BPS which canexploit any representation of the target density as a product of positive factors
⇡ (x) /Y
f2F
�f (xf ) (13)
where xf is a restriction of x to a subset Nf ✓ {1,2,. . . ,d} of the components of x, and F is anindex set called the set of factors. Hence the energy associated to ⇡ is of the form
U (x) =X
f2F
Uf (x) (14)
with @Uf (x) /@xk = 0 for all variable absent from factor f , i.e. for all k 2 {1, 2, . . . , d} \Nf .
Such a factorization of the target density can be formalized using factor graphs (Figure 2, top). Afactor graph is a bipartite graph, with one set of vertices N called the variables, each correspondingto a component of x (|N | = d), and a set of vertices F corresponding to the local factors (�f )f2F .There is an edge between k 2 N and f 2 F if and only if k 2 Nf . Such a representation generalizesundirected graphical models [28, Chap. 2, Section 2.1.3]. For example, factor graphs can havedistinct factors connected to the same set of components (i.e. f 6= f 0 with Nf = Nf 0).
3.2 Local BPS: algorithm description
Similarly to the Gibbs sampler, each step of the local BPS manipulates only a subset of the dcomponents of x. Contrary to the Gibbs sampler, the local BPS does not require sampling from
9
F = {fa, fb, fc}The energy is decomposed into:
U(x) = ∑f∈F
Uf (x) (10)
Define local intensity functions λf and local bouncingmatrices Rf :
λf (x,v) = max{0,〈∇Uf (x),v〉} (11)
Rf (x)v = v−2〈∇Uf (x),v〉‖∇Uf (x)‖2 ∇Uf (x) (12)
19 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
The Bouncy Particle Sampler
The local bouncy particle sampler
1 The next bounce time τ is the first arrival time of aninhomogeneous Poisson process with intensity:χ(t) = ∑f∈F χf (t).
2 Sample a factor with probability χf (τ)χ(τ) , and update the
components of the parameter x and velocity v related tofactor f , followed the basic BPS sampler.
3 Efficient implementation via priority queue orthinning-based methods (when # factor is large).
TheoremThe local BPS still endows ρ(z) = π(x)ψ(v) as the stationarydistribution.
20 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Outline
1 Introduction
2 The Bouncy Particle Sampler
3 Numerical Results
21 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Multivariate Gaussian distributions
●
●
−4 −2 0 2
−3−2
−10
12
3
xBounds
yBounds
100
101
102
103
104
105
101 102 103 104 105
Dimension
ESS
per c
pu s
econ
d
Slope=−1.24
Figure 3: Left: A trajectory of the BPS for �ref = 0, the center of the space is never explored.Right: ESS per CPU second for increasing d for the process with refreshment.
and the bounce is based onPs
j=1
⌦rUFj (x
(i)), v(i�1)↵. It is simple to check that the resulting
dynamics preserves ⇡. In contrast to the batchsize s = 1, this is not an implementation of localBPS described in Algorithm 6, but instead this corresponds to a local BPS update for a randompartition of the factors.
4 Numerical results
4.1 Multivariate Gaussian distributions and the need for refreshment
We use a simple isotropic multivariate Gaussian target distribution, U (x) = kxk2, to illustrate theimportance of velocity refreshment, restricting ourselves here to BPS with restricted refreshment.Without loss of generality, we assume kv(i�1)k = 1. From Equation (6), we obtain
Dx(i), v(i)
E=
(�p� log Vi if
⌦x(i�1), v(i�1)
↵ 0
�q⌦
x(i�1), v(i�1)↵2 � log Vi otherwise
and
���x(i)���
2
=
(��x(i�1)��2 �
⌦x(i�1), v(i�1)
↵2 � log Vi if⌦x(i�1), v(i�1)
↵ 0��x(i�1)
��2 � log Vi otherwise..
In particular, these calculations show that if⌦x(i), v(i)
↵ 0 then
⌦x(j), v(j)
↵ 0 for j > i. Using
this result, we can show inductively that kx(i)k2 =��x(1)
��2 �⌦x(1), v(1)
↵2 � log Vi for i � 2. Inparticular for x(0) = e1 and v(0) = e2 with ei being elements of standard basis of Rd, the norm ofthe position at all points along the trajectory can never be smaller than 1 as illustrated in Figure3.
In this scenario, we show that BPS without refreshment admits a countably infinite collectionof invariant distributions. Again, without loss of generality, assume kv (0) k = 1. Let us definert = kx (t)k and mt = hx (t) , v (t)i / kx (t)k and denote by �k the probability density of the chidistribution with k degrees of freedom.
Proposition 2. For any dimension d � 2, the process (rt, mt)t�1 is Markov and its transition ker-nel is invariant w.r.t the probability densities
�fk(r, m) / �k(
p2r) · (1�m2)(k�3)/2; k 2 {2, 3, 4, . . .}
.
15
1 When λ ref = 0, the center of the space is never explored(non-ergodic).
2 Optimal scaling of ESS vs. dimension d:BPS: ≈ d−1.24 (empirically)HMC: d−1.25 (theoretically)random walk MH: d−2 (theoretically)
22 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Comparison of the global and local schemes
1 Test on a sparse Gaussian field (not defined).
●●●
●
●
●●
●
●
●
●
●●●
●
● ●●●
●
●
●●● ●
●●●
●●●
●●●●●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●● ●
●
●●●●●
●
●●
●
●
●
●
●
●●
●
●●●●
●
●
●●
●●
●●
●●
●
●
●
●●●
●
●●
●●●
●
●
●●
●
●●
●
●●
●
●
●
●●●
●
●
●●●●●●
●●●
●
●
●●
●
●●
●
●● ●●●
●
●
●●
●
●
●
●
●●
●
●
●●●
● ●
●●●
●
●●●
●
●●
●
●
●
●
●●●
●
●
●●●●
●
●
●
●●●
●●
●●
●
●●●●●
●
●●
●●●
●
●●
●
●
●
●●
●●●●●●●●●
●
●●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
1e−04
1e−02
1e+00
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Pairwise precision parameter
Rel
ative
erro
r
is_localfalsetrue
Figure 4: Relative errors for Gaussian chain-shaped random fields. Facets contain results for fieldsof pairwise precisions 0.1-0.9. Each summarizes the relative errors of 200 (100 local, 100 global)local BPS executions, each ran for a fixed computational budget (a wall-clock time of 60 seconds).
By Theorem 1, we have a unique invariant measure as soon as �ref > 0.
Next, we look at the scaling of the Effective Sample Size (ESS) per CPU second of the basic BPSalgorithm as the dimensionality d of the isotropic normal target increases. We use �ref = 5. Wefocus without loss of generality on the first component of the state vector and estimate the EffectiveSample Size (ESS) using the R package mcmcse [8] by evaluating the trajectory on a sufficientlyfine discretization of the sampled trajectory. The results in log-log scale are displayed in Figure 3.The curve suggests a decay of roughly d�1.24, similar to the d�5/4 scaling for an optimally tunedHamiltonian Monte Carlo (HMC) algorithm in a similar same setup [5, Section III], [21, Section5.4.4]. Both BPS and HMC compare favorably to the d�2scaling of the optimally tuned randomwalk MH [24].
4.2 Comparison of the global and local schemes
To quantify the potential computational advantage brought by the local version of the algorithmof Section 3 over the global version of Section 2, we compare both algorithms on a sparse Gaussianfield. We use a chain-shaped undirected graphical model of length 1000, and perform separate ex-periments for various pairwise precision parameters for the pairwise interaction between neighborsin the chain. We run the local and global methods for a fixed computational budget (60 seconds),and repeat the experiment in each configuration 100 times. We compute a Monte Carlo estimate ofthe marginal variance of variable index 500, and compare this estimate to the truth (which can becomputed explicitly in this case). The results are shown in Figure 4, in the form of an histogramover relative absolute errors of the 100 executions of each setup. They confirm that the smallercomputational complexity per local bounce more than offsets the associate d decrease in expectedtrajectory segment length. Moreover, the results show that the BPS method is very robust to thepairwise precision used in this sparse Gaussian field model.
4.3 Comparisons of alternative refreshment schemes
In Section 2 and Section 3, the velocity was refreshed using a standard multivariate normal. Wecompare here this scheme to alternative refreshment schemes:
Global refreshment: sample the entire velocity vector from an standard multivariate Gaussiandistribution.
Local refreshment: if the local BPS is being used, the structure specified by the factor graphcan be used to design computationally cheaper refreshment operators. We pick one factorf 2 F uniformly at random, and we consider resampling only the components of v with
16
23 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Comparisons of refreshment schemes
1 Global refreshment: basic BPS sampler.2 Local refreshment: local BPS with partial components of v
refreshed.3 Restricted refreshment: restrict v to have unit norm when
refreshing.4 Restricted partial refreshment: local BPS version of
restricted refreshment.
24 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Comparisons of refreshment schemes
●●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●
●
●●●●
●
●
●●●
●
●
●
●
●
●●●●●
1e−04
1e−02
1e+00
1e−04
1e−02
1e+00
1001000
GLOBAL LOCAL PARTIAL RESTRICTEDRefreshment type
Rel
ative
erro
r
Ref. rate0.010.1110
Figure 5: Comparison of four refreshment schemes. The top panel shows results for a 100-dimensional problem, and the bottom one, for a 1000-dimensional problem. The box plots sum-marize the marginal variance (in log scale) of the variable with index 50 over 100 executions ofBPS for each of the refreshment schemes.
indices in Nf . By the same argument used in Section 3, each refreshment will then requirebounce time recomputation only for the factors f 0 with Nf \ Nf 0 6= ;. Provided that eachvariable is connected with at least one factor f with |Nf | > 1, this scheme is irreducible(and if this condition is not satisfied, additional constant factors can be introduced withoutchanging the target distribution).
Restricted refreshment: this method adds a restriction that the velocities be of unit norm. Thisscheme corresponds to a different invariant distribution ⇢ (x) = ⇡ (x)� (v) where� (v) is nowthe uniform distribution on Sd�1. Refreshment is thus performed by the global refreshmentscheme, followed by a re-normalization step.
Restricted partial refreshment: a variant of the restricted refreshment scheme where we samplean angle ✓ by multiplying a Beta(↵, �)-distributed random variable by 2⇡. We then select avector uniformly at random from the unit length vectors that have an angle ✓ from v. Weused ↵ = 1,� = 4 to favor small angles.
The rationale behind the partial refreshment procedure is to suppress the random walk behaviorof the particle path arising from a refreshment step independent from the current velocity. Somerefreshment is needed to ensure ergodicity but a “good” direction should only be altered slightly.This strategy is akin to the partial momentum refreshment strategy for HMC methods [13], [21,Section 4.3] and, for a normal refreshment, could be similarly implemented. By Lemma 2 all ofthe above schemes preserve ⇢ as invariant distribution. We tested these schemes on two versionsof the chain-shaped factor graph from the previous section (with the pairwise precision parameterset to 0.5), one with 100 dimensions, and one with 1000 dimensions. All methods are providedwith a computational budget of 30 seconds. The results are shown in Figure 5. The results showthat the local refreshment scheme is less sensitive to �ref , performing as well or better than theglobal refreshment scheme. The performance of the restricted and partial methods appears moresensitive to �ref and generally inferior to the other two schemes.
4.4 Comparisons with HMC methods on high-dimensional multivariateGaussian distributions
We compare the local BPS to various state-of-the-art versions of HMC. We use the local refreshmentscheme, no partial refreshment and �ref = 1. We select a 100-dimensional Gaussian example from
17
1 The local refreshment scheme is less sensitive to λ ref.
25 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Comparisons with HMC
Multivariate Gaussian:
−0.6
−0.4
−0.2
0.0
0.2
0.4
0 25 50 75 100Dimension
Rel
ative
erro
r
Sampling methodBPS(adapt=false,fit−metric=false)Stan/HMC(adapt=true,fit−metric=false,nuts=false)Stan/HMC(adapt=true,fit−metric=false,nuts=true)Stan/HMC(adapt=true,fit−metric=true,nuts=false)Stan/HMC(adapt=true,fit−metric=true,nuts=true)
Figure 6: Relative error of marginal variance estimates for a fixed computational budget (30s).
[21, Section 5.3.3.4], where even a basic HMC scheme was shown to perform favorably comparedto standard MH methods. We run several methods for this test case, each for a wall clock time of30 seconds, and measure the relative error on the reconstructed marginal variances. We use Stan[12] as a reference implementation for the HMC algorithms. Different HMC versions are exploredby enabling and disabling the NUTS methodology for determining path lengths, and by enablingand disabling adaptive estimation of a diagonal mass matrix. We always exclude the time takento compile the Stan program in the 30 seconds budget. The three HMC methods tested use 1000iterations of adaptation, since HMC without adaptation (not shown) yields a zero acceptance rate.In contrast, we use the default value for our local algorithm’s tuning parameter (�ref = 1), andno adaptation of the mass matrix. The results (Figure 6) show that this simple implementationperforms remarkably well. The adapted HMC performs reasonably well, except for four marginalswhich are markedly off target. These deviations disappear after incorporating more complex HMCextensions, namely learning a diagonal metric (denoted fit-metric), and adaptively selecting thenumber of leap-frog steps (denoted nuts).
Next, we perform a series of experiments to investigate the comparative performance of our localmethod versus NUTS as the dimensionality d increases. Experiments are performed on the chain-shaped Gaussian Random Field of Section 4.2 (with the pairwise precision parameter set to 0.5).We vary the length of the chain (10, 100, 1000), and run Stan’s implementation of NUTS for 1000iterations + 1000 iterations of adaptation. We measure the wall-clock time (excluding the timetaken to compile the Stan program) and then run our method for the same wall-clock time. Werepeat this 40 times for each chain size. We then measure the absolute value of the relative error on10 equally spaced marginal variances, and show the rate at which they decrease as the percentageof the samples collected in the fixed computational budget varies from 1 percent to 100 percent.The results are displayed in Figure 7. Note that the gap between the two methods widen as thedimensionality increases. To visualize the different behavior of the two algorithms, we show inFigure 8 three marginals of the Stan and BPS paths from the 100-dimensional example for thefirst 0.5% of the full trajectories computed in the computational budget.
4.5 Poisson-Gaussian Markov random field
We consider the following hierarchical model. Let xi,j : i, j 2 {1, 2, . . . , 10} denote a sparseprecision, grid-shaped Gaussian Markov random field with pairwise interactions of the same formas those used in the previous chain examples (pairwise precision set to 0.5). For each i, j, let yi,j
be Poisson distributed, independent given x = (xi,j : i, j 2 {1, 2, . . . , 10}), with rate exp(xi,j). Wegenerate a synthetic dataset from this model and approximate the posterior distribution of x givendata y = (yi,j : i, j 2 {1, 2, . . . , 10}). We run Stan with default settings for 16, 32, 64, . . . , 4096iterations. For each configuration, we run local BPS for the same wall-clock time as Stan, usinga local refreshment with �ref = 1 and the method from Example 3 to perform the bouncing timecomputations. We repeat this series of experiments 10 times with different random seeds for thetwo samplers. We show in Figure 9 estimates of the posterior variances of the variables indexed 0(x0,0), and 50 (x5,0). These two marginals are representative of the other variables. Each box plotsummarizes the 10 replications. As expected, both methods converge to the same value, but BPS
18
Figure: Relative error of marginal variance estimates for a fixedcomputation budget (30s).
26 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Comparisons with HMC
Gaussian random field:
10 100 1000
0.01
0.10
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100Percent of samples processed
Rel
ative
erro
r (lo
g sc
ale) method
BPSStan
Figure 7: Relative reconstruction error for d = 10 (left), d = 100 (middle) and d = 1000 (right),averaged over 10 of the dimensions and 40 runs. Each panel is ran on a fixed computational budget(corresponding in each panel to the wall clock time taken by 2000 Stan iterations).
0 50
−2
−1
0
1
2
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5Percent of samples processed
Sam
ple
methodBPSStan
Figure 8: Marginals of the paths for variables index 0 (the left-most variable in the chain) andindex 50 (a variable in the middle of the chain). Each of the two marginal paths represent 0.5%of the full trajectory computed in the fixed computational budget used in the 100-dimensionalexample of Figure 7. While each piecewise constant step in the HMC trajectory is obtained by asequence of leap-frog steps, the need for a MH step in HMC means that these intermediate stepsare not usable Monte Carlo samples. In contrast, the full trajectory obtained from BPS can beused in the calculation of Monte Carlo averages, as explained in Section 2.4.
19
Figure: Relative error for d = 10,100,1000 with a fixed computationalbudget.
27 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Bayesian logistic regression
Use the superposition trick presented previously.Compared with FireFly, the only scalable algorithm with thesame convergence rate as traditional MCMC.
10−5
10−4
10−3
10−2
10−1
101 102 103 104 105
Number of datapoints (R)
ESS
per l
ikelih
ood
eval
uatio
n
Algorithm BPS constant refresh rate Tuned FireFly
Figure 10: ESS per-datum likelihood evaluation for Local BPS and Firefly.
We compare the local BPS with thinning to the MAP-tuned Firefly algorithm implementationprovided by the authors. In [18], it is reported that this version of Firefly outperforms experiment-ally significantly the standard MH in terms of ESS per-datum likelihood. Local BPS and Fireflyare here compared in terms of this criterion, where the ESS is averaged over the d = 5 componentsof x. We generate covariates as ◆rk
i.i.d.⇠ U(0.1, 1.1) and data cr 2 {0, 1} for r = 1, . . . , R accordingto (8). We set a standard multivariate normal prior for x. For the algorithm, we set �ref = 0.5 and� = 0.5, which is the length of the time interval for which a constant upper bound for the rateassociated with the prior is used, see Algorithm 7. Experimentally, local BPS always outperformsFirefly, by about an order of magnitude for large data sets. However, we also observe that bothFirefly and local BPS have an ESS per datum likelihood evaluation decreasing in approximately1/R so that the gains brought by these algorithms over a correctly scaled random walk MH al-gorithm do not appear to increase with R. The rate for local BPS is slightly superior in the regimeof up to 104 data points, but then returns to the approximate 1/R rates. We expect that tighterbounds on the intensities could improve the computational efficiency.
4.7 Bayesian inference of evolutionary parameters
We consider a model from phylogenetics. Given a fixed tree of species with DNA sequences atthe leaves, we want to compute the posterior evolutionary parameters encoded into a rate matrixQ. More precisely, we consider an over-parameterized generalized time reversible rate matrix [25]with d = 10: 4 unnormalized stationary parameters x1, . . . , x4, and 6 unconstrained substitutionparameters x{i,j}, which are indexed by sets of size 2, i.e. where i, j 2 {1, 2, 3, 4} , i 6= j. Off-diagonal entries of Q are obtained via qi,j = ⇡j exp
�x{i,j}
�, where
⇡j =exp (xj)P4
k=1 exp (xk).
We assign independent standard Gaussian priors on the parameters xi. We assume that a matrix ofaligned nucleotides is provided, where rows are species and columns contains nucleotides believed tocome from a shared ancestral nucleotide. Given x =
�x1, . . . , x4, x{1,2}, . . . , x{3,4}
�, and hence Q,
21
28 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Bayesian inference of evolutionary parameters
A model from phylogenetics, to model the evolutionary ofmodel parameters (details omitted).
0.0
0.5
1.0
1.5
2.0
2.5
3.0
max median minstatistic (across parameters)
RF: ESS/s for different statistics
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
3.0
max median minstatistic (across parameters)
HMC: ESS/s for different statistics
Figure 11: Maximum, median and minimum ESS/s for BPS (left) and HMC (right). The experi-ments are replicated 10 times with different random seeds.
the likelihood is a product of conditionally independent continuous time Markov chains (CTMC)over {A, C, G, T}, with “time” replaced by a branching process specified by the phylogenetic tree’stopology and branch lengths. The parameter x is unidentifiable, and while this can be addressed bybounded or curved parameterizations, the over-parameterization provides an interesting challengefor sampling methods, which need to cope with the strong induced correlations.
We analyze a dataset of primate mitochondrial DNA [11], containing 898 sites and 12 species. Wefocus on sampling x and fix the tree to a reference tree [14]. We use the basic BPS algorithm withrestricted refreshment and �ref = 1 in conjunction with an auxiliary variable-based method similarto the one described in [31], alternating between two moves: (1) sampling CTMC paths along atree given x using uniformization, (2) sampling x given the path (in which case the derivation ofthe gradient is simple and efficient). The only difference compared to [31] is that we substitutethe HMC kernel by the kernel induced by running BPS. We use this auxiliary variable methodbecause conditioned on the paths, the energy is a convex function and hence we can use the methoddescribed in Example 1 to compute the bouncing times.
We compare against a state-of-the-art HMC sampler [29] that uses Bayesian optimization to adaptthe key parameters of HMC, the leap-frog stepsize ✏ and trajectory length L, while preservingconvergence to the correct target distribution. This sampler was shown to be comparable or betterto other state-of-the-art HMC methods such as NUTS. It also has the advantage of having efficientimplementations in several languages. We use the author’s Java implementation to compare toour Java implementation of the BPS. Both methods view the objective function as a black box(concretely, a Java interface supporting pointwise evaluation and gradient calculation). In allexperiments, we initialize at the mode and use a burn-in of 100 iterations and no thinning. TheHMC auto-tuner yielded ✏ = 0.39 and L = 100. For our method, we use the global sampler andthe independent global refreshment scheme.
As a first step, we perform various checks to ensure that both BPS and HMC chains are in closeagreement given a sufficiently large number of iterations. We observe that after 20 millions HMCiterations, the highest posterior density intervals from the HMC method are in close agreementwith those obtained from BPS (result not shown) and that both method pass the Geweke diagnostic[9].
To compare the effectiveness of the two samplers, we first look at the ESS per second of the modelparameters. We show the maximum, median, and maximum over the 10 parameter components,for both BPS and HMC in Figure 11. As observed in Figure 12, the autocorrelation function(ACF) for the BPS decays faster than that of HMC. HMC’s slowly decaying ACF is due to thefact that the stepsize ✏ in HMC is selected to be very small by the auto-tuner.
To ensure that the problem does not come from a faulty auto-tuning, we look at the ESS/s for thelog-likelihood statistic when varying the stepsize ✏. The results in Figure 13(right) show that thevalue selected by the auto-tuner is indeed reasonable, close to the value 0.02 found by brute forcemaximization. We repeat the experiments with ✏ = 0.02 and obtain the same conclusions. Thisshows that the problem is genuinely challenging for HMC.
22
29 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Bayesian inference of evolutionary parameters
logDensity
0.0
0.4
0.8
0 10 20 30 40 50Lag
Autocorrelation
BPSlogDensity
0.0
0.4
0.8
0 10 20 30 40 50Lag
Autocorrelation
HMC
Figure 12: Estimate of the ACF of the log-likelihood statistic for BPS (left) and HMC (right). Asimilar behavior is observed for the ACF of the other statistics.
●
●
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1e−06 1e−05 1e−04 0.001 0.01 0.1 1 10 100 1000refresh rate
BPS: ESS/s for different refresh rates
●
●
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0025 0.005 0.01 0.02 0.04epsilon
HMC: ESS/s for different values of epsilon
Figure 13: Left: sensitivity of BPS’s ESS/s on the log likelihood statistic. Right: sensitivity ofHMC’s ESS/s on the log likelihood statistic. Each setting is replicated 10 times with differentalgorithmic random seeds.
23
logDensity
0.0
0.4
0.8
0 10 20 30 40 50Lag
Autocorrelation
BPSlogDensity
0.0
0.4
0.8
0 10 20 30 40 50Lag
Autocorrelation
HMC
Figure 12: Estimate of the ACF of the log-likelihood statistic for BPS (left) and HMC (right). Asimilar behavior is observed for the ACF of the other statistics.
●
●
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1e−06 1e−05 1e−04 0.001 0.01 0.1 1 10 100 1000refresh rate
BPS: ESS/s for different refresh rates
●
●
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0025 0.005 0.01 0.02 0.04epsilon
HMC: ESS/s for different values of epsilon
Figure 13: Left: sensitivity of BPS’s ESS/s on the log likelihood statistic. Right: sensitivity ofHMC’s ESS/s on the log likelihood statistic. Each setting is replicated 10 times with differentalgorithmic random seeds.
23
30 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method
Numerical Results
Thanks for your attention!!!
31 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method