The Bouncy Particle Sampler: A Non-Reversible Rejection ...lcarin/Changyou1.20.2017.pdf2017/01/20 · The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte

The Bouncy Particle Sampler: ANon-Reversible Rejection-Free Markov Chain

Monte Carlo Method

Alexandre Bounchard-Côté, Sebastian J. Vollmer, ArnaudDoucet

Presented by Changyou ChenJanuary 20, 2017

1 Changyou Chen The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method

Introduction

Outline

1 Introduction

2 The Bouncy Particle Sampler

3 Numerical Results


Introduction

SG-MCMC vs. Bouncy Particle Sampler

SG-MCMC:diffusion based,approximatedsimulation

0 50 100 150 200t

-5

0

5

x

trace

-4 -2 0 2 4x

0

500

1000

1500

count

histogram

Bouncy particle:Poisson processbased, exactsimulation

5

0

−51000 50 150 200

bouncy/jumping points


Introduction

Bouncy Particle Sampler

1 A Poisson-process based MCMC sampler:velocity and direction depend on the target distribution(model posterior)bouncing (velocity changing) time driven by a Poissonprocess parametrized by the model posterior

2 Stationary distribution equals the model posteriordistribution.

3 Since simulation of a Poisson process can be exact, noerror introduced, the algorithm is rejection free.

4 Theoretically sound.


The Bouncy Particle Sampler

Outline

1 Introduction


3 Numerical Results



Poisson Processes

1 A Poisson process N(t) is a counting process1 of rate λ ifthe inter-arrival times are i.i.d. exponential with mean 1/λ .

2 When λ depends on t, it is called an inhomogeneousPoisson process.

1t can be considered as one-dimensional time for simplicity, generalizationon general spaces is straightforward.



Basic Setup

1 Goal is to sample from a target distribution:

p(x) = e−U(x)

2 In Bayesian models, we are given data D = {d1, · · · ,dN}, agenerative model (likelihood) p(D|x) = ∏

Ni=1 p(di|x) and prior

p(x), we want to sample from the posterior:

p(x|D) ∝ p(x)p(D|x) = p(x)N

∏i=1

p(di|x)

3 U(x) is defined as:

U(x) ,−N

∑i=1

logp(di|x)− logp(x)



Basic Setup

1 Like HMC, the parameter space is augmented with avelocity variable.

2 Define the following two quantities:

Poisson process intensity: λ (x,v) = max{0,〈∇U(x),v〉}

velocity refreshment operator: R(x)v = v−2〈∇U(x),v〉‖∇U(x)‖2 ∇U(x)



Algorithm Illustration

1 A one-dimension example with U(x) = x2.2 A particle changes velocity (bounce) at the first arrival time

of an inhomogeneous PP with intensity λ (x,v).3 Other than this, a random bounce happens in the first

arrival time of a Poisson process with constant intensity.

rU > 0v < 0prob. bounce = 0

rU > 0v > 0prob. bounce > 0

v < 0 v > 0

U(x) = x2

Poisson process intensity:6(x; v) = maxf0; hrU(x); vig



Basic BPS Algorithm

λ (x,v) = max{0,〈∇U(x),v〉}, R(x)v = v−2〈∇U(x),v〉‖∇U(x)‖2 ∇U(x)

these segments, the algorithm relies on a position and velocity-dependent intensity function � :Rd⇥Rd ! [0,1) and a position-dependent bouncing matrix R : Rd ! Rd⇥d given respectively by

� (x, v) = max {0, hrU (x) , vi} (1)

and for any v 2 Rd

R (x) v =

Id � 2

rU (x) {rU (x)}t

krU (x)k2

!v = v � 2

hrU (x) , vikrU (x)k2

rU (x) , (2)

where Id denotes the d ⇥ d identity matrix, k · k the Euclidean norm, and hw, zi = wtz thescalar product between column vectors w, z.1 The algorithm also performs a velocity refreshmentat random times distributed according to the arrival times of a homogeneous Poisson process ofintensity �ref � 0, �ref being a parameter of the BPS algorithm. Throughout the paper, we usethe terminology “event” for a time at which either a bounce or a refreshment occurs. The basicversion of the BPS algorithm proceeds as follows:

Algorithm 1 Basic BPS algorithm

1. Initialize the state and velocity�x(0), v(0)

�arbitrarily on Rd ⇥ Rd.

2. While more events i = 1, 2, . . . requested do

(a) Simulate the first arrival time ⌧bounce 2 (0,1) of an inhomogeneous Poisson process ofintensity � (t) = �(x(i�1) + v(i�1)t, v(i�1)).

(b) Simulate ⌧ ref ⇠ Exp��ref

�.

(c) Set ⌧i min (⌧bounce, ⌧ ref) and compute the next position

x(i) x(i�1) + v(i�1)⌧i. (3)

(d) If ⌧i = ⌧ ref , sample the next velocity v(i) ⇠ N (0d, Id).

(e) If ⌧i = ⌧bounce, compute the next velocity v(i) using

v(i) R⇣x(i)

⌘v(i�1), (4)

which is the vector obtained once v(i�1) bounces on the plane tangential to the gradientof the energy function at x(i).

3. End While.

In the algorithm above, Exp (�) denotes the exponential distribution of rate � and N (0d, Id)the standard Gaussian distribution on Rd.2 Compared to the algorithm described in [23], ourformulation of step 2a is expressed in terms of an inhomogeneous Poisson process arrival.

We will show further that the transition kernel of the resulting process x (t) admits ⇡ as invariantdistribution for any �ref � 0 but it can fail to be irreducible when �ref = 0. It is thus critical touse �ref > 0. Our proof of invariance and ergodicity can accommodate some alternative ways toperform the refreshment step 2d. One such variant, which we call restricted refreshment, samplesv(i) uniformly on the unit hypersphere Sd�1 =

�x 2 Rd : kxk = 1

. We compare experimentally

these two variants and others in Section 4.3.1Throughout the paper, when an algorithm contains an expressions of the form R(x)v, it is understood that this

computation is implemented via the right-hand side of Equation 2 which takes time O(d) rather than the left-handside, which would naively take time O(d2).

2By exploiting the memorylessness of the exponential distribution, we could alternatively only implement stepb) of Algorithm 1 for the ith event when ⌧i�1 corresponds to a bounce and set ⌧ ref ⌧ ref � ⌧i�1 otherwise.

4



Simulating bouncy time using a time-scaletransformation

1 Goal is to simulate the first arrival time of aninhomogeneous Poisson process with intensityχ(t) = max{0,〈∇U(x+ vt,v)〉}.

2 Let Ξ(t) =∫ t

0 χ(s)ds be the cumulative intensity.3 The probability of the first arrival time τ > u is:

P(τ > u) = exp(−Ξ(u))(Ξ(u))0

0!= exp(−Ξ(u)) (1)

4 Hence, τ = Ξ−1 (− log(V)), where V ∼U (0,1), andΞ−1(p) = inf{t : Ξ(t)≥ p} is the first time such that Ξ(t)> p.



Simulating bouncy time using a time-scaletransformation

1 When the target distribution is log-concave (U(x) isconvex):

Ξ(τ) =∫

τ

0λ (x+ vs)ds =

∫τ

τ∗λ (x+ vs)ds , (2)

where τ∗ = argmint:t≥0 U(x+ vt) is the minimal.2 After some simplifications,

U(x+ vτ)−U(x+ vτ∗) =− logV, V ∼U (′,∞).3 Solve through line search if not explicitly solvable.

rU > 0v < 0prob. bounce = 0

rU > 0v > 0prob. bounce > 0

v < 0 v > 0

U(x) = x2

Poisson process intensity:6(x; v) = maxf0; hrU(x); vig



Example

U(x+ vτ)−U(x+ vτ∗) =− logV

Example (Gaussian distributions)Consider the target distribution to be a zero-mean multivariateGaussian of covariance matrix 1

2 Id, so that U(x) = ‖x‖2.

τ =1‖v‖2

{−〈x,v〉+

√−‖v‖2 logV if 〈x,v〉 ≤ 0

−〈x,v〉+√〈x,v〉−‖v‖2 logV otherwise

(3)



Simulating bouncy time via adaptive thinning

1 The idea is to define an easy-to-simulate Poisson processwhose cumulative intensity χ̄s(t) upper bounds χ(t), i.e.:

χ̄s(t) = 0 for all t < s, and χ̄s(t)≥ χ(t) for all t ≥ s (4)

2.3.2 Simulation using adaptive thinning

In scenarios where it is difficult to solve (5), the use of a thinning procedure to simulate ⌧ providesanother alternative. Assume we have access to local-in-time upper bounds �̄s (t) on �(t), that is

�̄s(t) = 0 for all t < s,

�̄s(t) � �(t) for all t � s

and that we can simulate the first arrival time of the inhomogeneous Poisson process ⇧̄s withintensity �̄s(t) defined on [s,1). Algorithm 2 shows the pseudocode for the adaptive thinningprocedure.

Algorithm 2 Simulation of the first arrival time of an inhomogeneous Poisson process throughthinning

1. Set s 0, ⌧ 0.

2. Do

(a) Set s ⌧ .

(b) Sample ⌧ as the first arrival point of ⇧̄s of intensity �̄s.

(c) While V > �(⌧)�̄s(⌧) where V ⇠ U (0, 1).

3. Return ⌧ .

The event V > �(⌧)�̄s(⌧) corresponds to a rejection step in the thinning algorithm but, in contrast to

rejection steps that occur in standard MCMC samplers, in the BPS algorithm this just means thatthe particle does not bounce and just coasts.

2.3.3 Simulation using superposition

Assume that U (x) can be decomposed as

U (x) =

mX

j=1

U [j] (x) . (7)

Under this assumption, if we let �[j](t) = max�0,⌦rU [j](x + tv), v

↵�for j = 1, ..., m, it follows

that

� (t) mX

j=1

�[j] (t) .

It is therefore possible to use the adaptive thinning algorithm with �̄s(t) = �̄(t) =Pm

j=1 �[j] (t)

for t � s. Moreover, we can simulate from �̄ via superposition as follows. First, simulate the firstarrival time ⌧ [j] of each inhomogeneous Poisson process with intensity �[j] (t) � 0. Second, return

⌧ = minj=1,...,m ⌧ [j].

Example 3. Exponential families. Consider a univariate exponential family with parameter x,observation ◆, sufficient statistic �(◆) and log-normalizing constant A(x). If we assume a standardGaussian prior on x, we obtain the following energy:

U(x) = x2/2|{z}U [1](x)

+�x�(◆)| {z }U [2](x)

+ A(x)| {z }U [3](x)

,

6



Simulating bouncy time using superposition

1 Assume U(x) can be decomposed as:

U(x) =m

∑j=1

U[j](x) . (5)

2 Let χ [j] = max{0,〈∇U[j](x+ tv),v〉}, then we have

χ(t)≤m

∑j=1

χ[j](t) (6)

3 Therefore, let τ [j] be the first arrival time w.r.t. χ [j](t),

τ =m

minj=1

τ[j] . (7)



Example

Example (Logistic regression)

Let {`r ∈ Rd}Rr=1 be the data, cr ∈ {0,1} the lable of data `r.

Parameter x is assigned a standard multivariate Gaussian prior.

U(x) =‖x‖2

2+

R

∑r=1

log(1+ exp〈`r,x〉)− cr〈`r,x〉︸︷︷︸U[r](x)

. (8)

1 A lower bound for U[r](x)’s corresponding intensity, χ [r](t),is

χ[r](t)≤ χ̄

[r] =d

∑k=1

1 [(−1)cr vk ≥ 0] · `rk · |vk| , (9)

where each χ [r](t) is a constant given vk.



Theoretical results

For example for ' (x) = xk, k 2 {1, 2, . . . , d}, we haveˆ ⌧i

0

'⇣x(i�1) + v(i�1)s

⌘ds = x

(i�1)k ⌧i + v

(i�1)k

⌧2i

2.

When the integral above is intractable, we may just subsample the trajectory of x (t) at regulartime intervals to obtain an estimator

1

L

L�1X

l=0

' (x (l�))

where � > 0 and L = 1 + bT/�c . Alternatively, we could approximate these univariate integralsthrough quadrature.

2.5 Theoretical results

Peters and de With (2012) present an informal proof establishing the fact that the BPS with�ref = 0 admits ⇡ as invariant distribution. We provide in Appendix A a rigorous proof of this⇡-invariance result for �ref � 0 and prove that the resulting process is additionally ergodic when�ref > 0. In the following we denote by Pt(z, dz0) the transition kernel of the continuous-timeMarkov process z (t) = (x (t) , v (t)).

Proposition 1. For any �ref � 0, the infinitesimal generator associated to the transition kernelPt of the BPS is given, for any given continuously differentiable functions h : Rd ⇥ Rd ! R, by

Lh(z) = limt!0

´

Pt(z, dz0)h(z0)� h(z)

t

= �� (x, v) h(z) + hrxh, vi+ �ref

ˆ

(h(x, v0)� h(x, v)) ( dv0)

+� (x, v) h(x, R (x) v),

where we recall that (v) denotes the standard multivariate Gaussian density on Rd.

This transition kernel is non-reversible and ⇢-invariant, where

⇢(z) = ⇡ (x) (v) . (12)

If we add the condition �ref > 0, we get the following stronger result.

Theorem 1. If �ref > 0 then ⇢ is the unique invariant probability measure of the transition kernelof the BPS and the corresponding process satisfies a strong law of large numbers for ⇢-almost everyz (0) and h 2 L1 (⇢)

limT!1

1

T

ˆ T

0

h(z (t))dt =

ˆ

h(z)⇢(z)dz a.s.

We exhibit in Section 4.1 a simple example where Pt is not ergodic for �ref = 0.

3 The local bouncy particle sampler

3.1 Structured target distribution and factor graph representation

In numerous applications, the target distribution admits some structural properties that can beexploited by sampling algorithms. For example, the popular Gibbs sampler takes advantages of

8

For example for ' (x) = xk, k 2 {1, 2, . . . , d}, we haveˆ ⌧i

0

'⇣x(i�1) + v(i�1)s

⌘ds = x

(i�1)k ⌧i + v

(i�1)k

⌧2i

2.

When the integral above is intractable, we may just subsample the trajectory of x (t) at regulartime intervals to obtain an estimator

1

L

L�1X

l=0

' (x (l�))

where � > 0 and L = 1 + bT/�c . Alternatively, we could approximate these univariate integralsthrough quadrature.

2.5 Theoretical results

Peters and de With (2012) present an informal proof establishing the fact that the BPS with�ref = 0 admits ⇡ as invariant distribution. We provide in Appendix A a rigorous proof of this⇡-invariance result for �ref � 0 and prove that the resulting process is additionally ergodic when�ref > 0. In the following we denote by Pt(z, dz0) the transition kernel of the continuous-timeMarkov process z (t) = (x (t) , v (t)).

Proposition 1. For any �ref � 0, the infinitesimal generator associated to the transition kernelPt of the BPS is given, for any given continuously differentiable functions h : Rd ⇥ Rd ! R, by

Lh(z) = limt!0

´

Pt(z, dz0)h(z0)� h(z)

t

= �� (x, v) h(z) + hrxh, vi+ �ref

ˆ

(h(x, v0)� h(x, v)) ( dv0)

+� (x, v) h(x, R (x) v),

where we recall that (v) denotes the standard multivariate Gaussian density on Rd.

This transition kernel is non-reversible and ⇢-invariant, where

⇢(z) = ⇡ (x) (v) . (12)

If we add the condition �ref > 0, we get the following stronger result.

Theorem 1. If �ref > 0 then ⇢ is the unique invariant probability measure of the transition kernelof the BPS and the corresponding process satisfies a strong law of large numbers for ⇢-almost everyz (0) and h 2 L1 (⇢)

limT!1

1

T

ˆ T

0

h(z (t))dt =

ˆ

h(z)⇢(z)dz a.s.

We exhibit in Section 4.1 a simple example where Pt is not ergodic for �ref = 0.

3 The local bouncy particle sampler

3.1 Structured target distribution and factor graph representation

In numerous applications, the target distribution admits some structural properties that can beexploited by sampling algorithms. For example, the popular Gibbs sampler takes advantages of

8



Basic proof idea

To derive the Fokker-Planck equation for the algorithm:

∂

∂ tµt(z) = (L ∗

µt)(z) ,

where z , (x,v), µt(z) is the density of z at time t, L ∗ is theadjoint of the generator L .

1 Assume the stationary distribution of µ(z) = π(x)ψ(v).2 Write out the joint distribution of z and the joint times of the

Poisson process.3 Calculate the marginal distribution of z, and get the

corresponding density µt(z) for time t.4 Verify that dµt(z′)

dt = 0 for all z′.



The local bouncy particle sampler

Divide parameters into groups with factor graph.

−2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1

0123

05

1015

20tim

e

position −2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1

0123

05

1015

20tim

e

position −2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1

0123

05

1015

20tim

e

position −2 −1 0 1−2 −1 0 1−2 −1 0 1−2 −1 0 1

0123

05

1015

20tim

e

position

Time

x1 x2 x3 x4

x1(0)

x1(1)

x1(2)

x2(0)

x2(1)x2(2)

x2(3)

x2(4)

x1(3) x2(5)

x3(0)

x3(1)x3(2)

x4(0)

fa fb fc

t

*

!

Figure 2: Top: an example of a factor graph with d = 4 variables and 3 binary factors, F ={fa, fb, fc}. Bottom: an example of paths generated by the local BPS algorithm. The circlesshow the locations that are stored in memory (each with their associated time and velocity afterbouncing, not shown). The small black square on the first path is used to demonstrate the executionof Algorithm 3, used to reconstruct the location x(t). The algorithm first identifies i(t, 1), which is3 in this example. The algorithm then uses the information of the latest event preceding xt to addto the position at that event, x

(3)1 , the velocity just after the event, v

(3)1 , times the time increment

denoted by the asterisk, (t � T(3)1 ). The bottom section of the figure also shows the candidate

bounce times used by Algorithm 4 to compute the first four bounce events. The exclamationmark indicates an example where a candidate bounce time need not be recomputed thanks to thesparsity structure of the factor graph.

conditional independence properties. We present here a “local” version of the BPS which canexploit any representation of the target density as a product of positive factors

⇡ (x) /Y

f2F

�f (xf ) (13)

where xf is a restriction of x to a subset Nf ✓ {1,2,. . . ,d} of the components of x, and F is anindex set called the set of factors. Hence the energy associated to ⇡ is of the form

U (x) =X

f2F

Uf (x) (14)

with @Uf (x) /@xk = 0 for all variable absent from factor f , i.e. for all k 2 {1, 2, . . . , d} \Nf .

Such a factorization of the target density can be formalized using factor graphs (Figure 2, top). Afactor graph is a bipartite graph, with one set of vertices N called the variables, each correspondingto a component of x (|N | = d), and a set of vertices F corresponding to the local factors (�f )f2F .There is an edge between k 2 N and f 2 F if and only if k 2 Nf . Such a representation generalizesundirected graphical models [28, Chap. 2, Section 2.1.3]. For example, factor graphs can havedistinct factors connected to the same set of components (i.e. f 6= f 0 with Nf = Nf 0).

3.2 Local BPS: algorithm description

Similarly to the Gibbs sampler, each step of the local BPS manipulates only a subset of the dcomponents of x. Contrary to the Gibbs sampler, the local BPS does not require sampling from

9

F = {fa, fb, fc}The energy is decomposed into:

U(x) = ∑f∈F

Uf (x) (10)

Define local intensity functions λf and local bouncingmatrices Rf :

λf (x,v) = max{0,〈∇Uf (x),v〉} (11)

Rf (x)v = v−2〈∇Uf (x),v〉‖∇Uf (x)‖2 ∇Uf (x) (12)



The local bouncy particle sampler

1 The next bounce time τ is the first arrival time of aninhomogeneous Poisson process with intensity:χ(t) = ∑f∈F χf (t).

2 Sample a factor with probability χf (τ)χ(τ) , and update the

components of the parameter x and velocity v related tofactor f , followed the basic BPS sampler.

3 Efficient implementation via priority queue orthinning-based methods (when # factor is large).

TheoremThe local BPS still endows ρ(z) = π(x)ψ(v) as the stationarydistribution.


Numerical Results

Outline

1 Introduction


3 Numerical Results


Numerical Results

Multivariate Gaussian distributions

●

●

−4 −2 0 2

−3−2

−10

12

3

xBounds

yBounds

100

101

102

103

104

105

101 102 103 104 105

Dimension

ESS

per c

pu s

econ

d

Slope=−1.24

Figure 3: Left: A trajectory of the BPS for �ref = 0, the center of the space is never explored.Right: ESS per CPU second for increasing d for the process with refreshment.

and the bounce is based onPs

j=1

⌦rUFj (x

(i)), v(i�1)↵. It is simple to check that the resulting

dynamics preserves ⇡. In contrast to the batchsize s = 1, this is not an implementation of localBPS described in Algorithm 6, but instead this corresponds to a local BPS update for a randompartition of the factors.

4 Numerical results

4.1 Multivariate Gaussian distributions and the need for refreshment

We use a simple isotropic multivariate Gaussian target distribution, U (x) = kxk2, to illustrate theimportance of velocity refreshment, restricting ourselves here to BPS with restricted refreshment.Without loss of generality, we assume kv(i�1)k = 1. From Equation (6), we obtain

Dx(i), v(i)

E=

(�p� log Vi if

⌦x(i�1), v(i�1)

↵ 0

�q⌦

x(i�1), v(i�1)↵2 � log Vi otherwise

and

��x(i)��

2

=

(��x(i�1)��2 �

⌦x(i�1), v(i�1)

↵2 � log Vi if⌦x(i�1), v(i�1)

↵ 0��x(i�1)

��2 � log Vi otherwise..

In particular, these calculations show that if⌦x(i), v(i)

↵ 0 then

⌦x(j), v(j)

↵ 0 for j > i. Using

this result, we can show inductively that kx(i)k2 =��x(1)

��2 �⌦x(1), v(1)

↵2 � log Vi for i � 2. Inparticular for x(0) = e1 and v(0) = e2 with ei being elements of standard basis of Rd, the norm ofthe position at all points along the trajectory can never be smaller than 1 as illustrated in Figure3.

In this scenario, we show that BPS without refreshment admits a countably infinite collectionof invariant distributions. Again, without loss of generality, assume kv (0) k = 1. Let us definert = kx (t)k and mt = hx (t) , v (t)i / kx (t)k and denote by �k the probability density of the chidistribution with k degrees of freedom.

Proposition 2. For any dimension d � 2, the process (rt, mt)t�1 is Markov and its transition ker-nel is invariant w.r.t the probability densities

�fk(r, m) / �k(

p2r) · (1�m2)(k�3)/2; k 2 {2, 3, 4, . . .}

.

15

1 When λ ref = 0, the center of the space is never explored(non-ergodic).

2 Optimal scaling of ESS vs. dimension d:BPS: ≈ d−1.24 (empirically)HMC: d−1.25 (theoretically)random walk MH: d−2 (theoretically)


Numerical Results

Comparison of the global and local schemes

1 Test on a sparse Gaussian field (not defined).

●●●

●

●

●●

●

●

●

●

●●●

●

● ●●●

●

●

●●● ●

●●●

●●●

●●●●●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●● ●

●

●●●●●

●

●●

●

●

●

●

●

●●

●

●●●●

●

●

●●

●●

●●

●●

●

●

●

●●●

●

●●

●●●

●

●

●●

●

●●

●

●●

●

●

●

●●●

●

●

●●●●●●

●●●

●

●

●●

●

●●

●

●● ●●●

●

●

●●

●

●

●

●

●●

●

●

●●●

● ●

●●●

●

●●●

●

●●

●

●

●

●

●●●

●

●

●●●●

●

●

●

●●●

●●

●●

●

●●●●●

●

●●

●●●

●

●●

●

●

●

●●

●●●●●●●●●

●

●●●

●

●

●

●

●

●

●

●

●●●●

●

●

●●●

●

●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

1e−04

1e−02

1e+00

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Pairwise precision parameter

Rel

ative

erro

r

is_localfalsetrue

Figure 4: Relative errors for Gaussian chain-shaped random fields. Facets contain results for fieldsof pairwise precisions 0.1-0.9. Each summarizes the relative errors of 200 (100 local, 100 global)local BPS executions, each ran for a fixed computational budget (a wall-clock time of 60 seconds).

By Theorem 1, we have a unique invariant measure as soon as �ref > 0.

Next, we look at the scaling of the Effective Sample Size (ESS) per CPU second of the basic BPSalgorithm as the dimensionality d of the isotropic normal target increases. We use �ref = 5. Wefocus without loss of generality on the first component of the state vector and estimate the EffectiveSample Size (ESS) using the R package mcmcse [8] by evaluating the trajectory on a sufficientlyfine discretization of the sampled trajectory. The results in log-log scale are displayed in Figure 3.The curve suggests a decay of roughly d�1.24, similar to the d�5/4 scaling for an optimally tunedHamiltonian Monte Carlo (HMC) algorithm in a similar same setup [5, Section III], [21, Section5.4.4]. Both BPS and HMC compare favorably to the d�2scaling of the optimally tuned randomwalk MH [24].

4.2 Comparison of the global and local schemes

To quantify the potential computational advantage brought by the local version of the algorithmof Section 3 over the global version of Section 2, we compare both algorithms on a sparse Gaussianfield. We use a chain-shaped undirected graphical model of length 1000, and perform separate ex-periments for various pairwise precision parameters for the pairwise interaction between neighborsin the chain. We run the local and global methods for a fixed computational budget (60 seconds),and repeat the experiment in each configuration 100 times. We compute a Monte Carlo estimate ofthe marginal variance of variable index 500, and compare this estimate to the truth (which can becomputed explicitly in this case). The results are shown in Figure 4, in the form of an histogramover relative absolute errors of the 100 executions of each setup. They confirm that the smallercomputational complexity per local bounce more than offsets the associate d decrease in expectedtrajectory segment length. Moreover, the results show that the BPS method is very robust to thepairwise precision used in this sparse Gaussian field model.

4.3 Comparisons of alternative refreshment schemes

In Section 2 and Section 3, the velocity was refreshed using a standard multivariate normal. Wecompare here this scheme to alternative refreshment schemes:

Global refreshment: sample the entire velocity vector from an standard multivariate Gaussiandistribution.

Local refreshment: if the local BPS is being used, the structure specified by the factor graphcan be used to design computationally cheaper refreshment operators. We pick one factorf 2 F uniformly at random, and we consider resampling only the components of v with

16


Numerical Results

Comparisons of refreshment schemes

1 Global refreshment: basic BPS sampler.2 Local refreshment: local BPS with partial components of v

refreshed.3 Restricted refreshment: restrict v to have unit norm when

refreshing.4 Restricted partial refreshment: local BPS version of

restricted refreshment.


Numerical Results

Comparisons of refreshment schemes

●●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●

●●●

●

●

●

●

●●

●

●

●

●●●

●

●

●●

●

●

●●●●

●

●

●●●

●

●

●

●

●

●●●●●

1e−04

1e−02

1e+00

1e−04

1e−02

1e+00

1001000

GLOBAL LOCAL PARTIAL RESTRICTEDRefreshment type

Rel

ative

erro

r

Ref. rate0.010.1110

Figure 5: Comparison of four refreshment schemes. The top panel shows results for a 100-dimensional problem, and the bottom one, for a 1000-dimensional problem. The box plots sum-marize the marginal variance (in log scale) of the variable with index 50 over 100 executions ofBPS for each of the refreshment schemes.

indices in Nf . By the same argument used in Section 3, each refreshment will then requirebounce time recomputation only for the factors f 0 with Nf \ Nf 0 6= ;. Provided that eachvariable is connected with at least one factor f with |Nf | > 1, this scheme is irreducible(and if this condition is not satisfied, additional constant factors can be introduced withoutchanging the target distribution).

Restricted refreshment: this method adds a restriction that the velocities be of unit norm. Thisscheme corresponds to a different invariant distribution ⇢ (x) = ⇡ (x)� (v) where� (v) is nowthe uniform distribution on Sd�1. Refreshment is thus performed by the global refreshmentscheme, followed by a re-normalization step.

Restricted partial refreshment: a variant of the restricted refreshment scheme where we samplean angle ✓ by multiplying a Beta(↵, �)-distributed random variable by 2⇡. We then select avector uniformly at random from the unit length vectors that have an angle ✓ from v. Weused ↵ = 1,� = 4 to favor small angles.

The rationale behind the partial refreshment procedure is to suppress the random walk behaviorof the particle path arising from a refreshment step independent from the current velocity. Somerefreshment is needed to ensure ergodicity but a “good” direction should only be altered slightly.This strategy is akin to the partial momentum refreshment strategy for HMC methods [13], [21,Section 4.3] and, for a normal refreshment, could be similarly implemented. By Lemma 2 all ofthe above schemes preserve ⇢ as invariant distribution. We tested these schemes on two versionsof the chain-shaped factor graph from the previous section (with the pairwise precision parameterset to 0.5), one with 100 dimensions, and one with 1000 dimensions. All methods are providedwith a computational budget of 30 seconds. The results are shown in Figure 5. The results showthat the local refreshment scheme is less sensitive to �ref , performing as well or better than theglobal refreshment scheme. The performance of the restricted and partial methods appears moresensitive to �ref and generally inferior to the other two schemes.

4.4 Comparisons with HMC methods on high-dimensional multivariateGaussian distributions

We compare the local BPS to various state-of-the-art versions of HMC. We use the local refreshmentscheme, no partial refreshment and �ref = 1. We select a 100-dimensional Gaussian example from

17

1 The local refreshment scheme is less sensitive to λ ref.


Numerical Results

Comparisons with HMC

Multivariate Gaussian:

−0.6

−0.4

−0.2

0.0

0.2

0.4

0 25 50 75 100Dimension

Rel

ative

erro

r

Sampling methodBPS(adapt=false,fit−metric=false)Stan/HMC(adapt=true,fit−metric=false,nuts=false)Stan/HMC(adapt=true,fit−metric=false,nuts=true)Stan/HMC(adapt=true,fit−metric=true,nuts=false)Stan/HMC(adapt=true,fit−metric=true,nuts=true)

Figure 6: Relative error of marginal variance estimates for a fixed computational budget (30s).

[21, Section 5.3.3.4], where even a basic HMC scheme was shown to perform favorably comparedto standard MH methods. We run several methods for this test case, each for a wall clock time of30 seconds, and measure the relative error on the reconstructed marginal variances. We use Stan[12] as a reference implementation for the HMC algorithms. Different HMC versions are exploredby enabling and disabling the NUTS methodology for determining path lengths, and by enablingand disabling adaptive estimation of a diagonal mass matrix. We always exclude the time takento compile the Stan program in the 30 seconds budget. The three HMC methods tested use 1000iterations of adaptation, since HMC without adaptation (not shown) yields a zero acceptance rate.In contrast, we use the default value for our local algorithm’s tuning parameter (�ref = 1), andno adaptation of the mass matrix. The results (Figure 6) show that this simple implementationperforms remarkably well. The adapted HMC performs reasonably well, except for four marginalswhich are markedly off target. These deviations disappear after incorporating more complex HMCextensions, namely learning a diagonal metric (denoted fit-metric), and adaptively selecting thenumber of leap-frog steps (denoted nuts).

Next, we perform a series of experiments to investigate the comparative performance of our localmethod versus NUTS as the dimensionality d increases. Experiments are performed on the chain-shaped Gaussian Random Field of Section 4.2 (with the pairwise precision parameter set to 0.5).We vary the length of the chain (10, 100, 1000), and run Stan’s implementation of NUTS for 1000iterations + 1000 iterations of adaptation. We measure the wall-clock time (excluding the timetaken to compile the Stan program) and then run our method for the same wall-clock time. Werepeat this 40 times for each chain size. We then measure the absolute value of the relative error on10 equally spaced marginal variances, and show the rate at which they decrease as the percentageof the samples collected in the fixed computational budget varies from 1 percent to 100 percent.The results are displayed in Figure 7. Note that the gap between the two methods widen as thedimensionality increases. To visualize the different behavior of the two algorithms, we show inFigure 8 three marginals of the Stan and BPS paths from the 100-dimensional example for thefirst 0.5% of the full trajectories computed in the computational budget.

4.5 Poisson-Gaussian Markov random field

We consider the following hierarchical model. Let xi,j : i, j 2 {1, 2, . . . , 10} denote a sparseprecision, grid-shaped Gaussian Markov random field with pairwise interactions of the same formas those used in the previous chain examples (pairwise precision set to 0.5). For each i, j, let yi,j

be Poisson distributed, independent given x = (xi,j : i, j 2 {1, 2, . . . , 10}), with rate exp(xi,j). Wegenerate a synthetic dataset from this model and approximate the posterior distribution of x givendata y = (yi,j : i, j 2 {1, 2, . . . , 10}). We run Stan with default settings for 16, 32, 64, . . . , 4096iterations. For each configuration, we run local BPS for the same wall-clock time as Stan, usinga local refreshment with �ref = 1 and the method from Example 3 to perform the bouncing timecomputations. We repeat this series of experiments 10 times with different random seeds for thetwo samplers. We show in Figure 9 estimates of the posterior variances of the variables indexed 0(x0,0), and 50 (x5,0). These two marginals are representative of the other variables. Each box plotsummarizes the 10 replications. As expected, both methods converge to the same value, but BPS

18

Figure: Relative error of marginal variance estimates for a fixedcomputation budget (30s).


Numerical Results

Comparisons with HMC

Gaussian random field:

10 100 1000

0.01

0.10

0 25 50 75 100 0 25 50 75 100 0 25 50 75 100Percent of samples processed

Rel

ative

erro

r (lo

g sc

ale) method

BPSStan

Figure 7: Relative reconstruction error for d = 10 (left), d = 100 (middle) and d = 1000 (right),averaged over 10 of the dimensions and 40 runs. Each panel is ran on a fixed computational budget(corresponding in each panel to the wall clock time taken by 2000 Stan iterations).

0 50

−2

−1

0

1

2

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5Percent of samples processed

Sam

ple

methodBPSStan

Figure 8: Marginals of the paths for variables index 0 (the left-most variable in the chain) andindex 50 (a variable in the middle of the chain). Each of the two marginal paths represent 0.5%of the full trajectory computed in the fixed computational budget used in the 100-dimensionalexample of Figure 7. While each piecewise constant step in the HMC trajectory is obtained by asequence of leap-frog steps, the need for a MH step in HMC means that these intermediate stepsare not usable Monte Carlo samples. In contrast, the full trajectory obtained from BPS can beused in the calculation of Monte Carlo averages, as explained in Section 2.4.

19

Figure: Relative error for d = 10,100,1000 with a fixed computationalbudget.


Numerical Results

Bayesian logistic regression

Use the superposition trick presented previously.Compared with FireFly, the only scalable algorithm with thesame convergence rate as traditional MCMC.

10−5

10−4

10−3

10−2

10−1

101 102 103 104 105

Number of datapoints (R)

ESS

per l

ikelih

ood

eval

uatio

n

Algorithm BPS constant refresh rate Tuned FireFly

Figure 10: ESS per-datum likelihood evaluation for Local BPS and Firefly.

We compare the local BPS with thinning to the MAP-tuned Firefly algorithm implementationprovided by the authors. In [18], it is reported that this version of Firefly outperforms experiment-ally significantly the standard MH in terms of ESS per-datum likelihood. Local BPS and Fireflyare here compared in terms of this criterion, where the ESS is averaged over the d = 5 componentsof x. We generate covariates as ◆rk

i.i.d.⇠ U(0.1, 1.1) and data cr 2 {0, 1} for r = 1, . . . , R accordingto (8). We set a standard multivariate normal prior for x. For the algorithm, we set �ref = 0.5 and� = 0.5, which is the length of the time interval for which a constant upper bound for the rateassociated with the prior is used, see Algorithm 7. Experimentally, local BPS always outperformsFirefly, by about an order of magnitude for large data sets. However, we also observe that bothFirefly and local BPS have an ESS per datum likelihood evaluation decreasing in approximately1/R so that the gains brought by these algorithms over a correctly scaled random walk MH al-gorithm do not appear to increase with R. The rate for local BPS is slightly superior in the regimeof up to 104 data points, but then returns to the approximate 1/R rates. We expect that tighterbounds on the intensities could improve the computational efficiency.

4.7 Bayesian inference of evolutionary parameters

We consider a model from phylogenetics. Given a fixed tree of species with DNA sequences atthe leaves, we want to compute the posterior evolutionary parameters encoded into a rate matrixQ. More precisely, we consider an over-parameterized generalized time reversible rate matrix [25]with d = 10: 4 unnormalized stationary parameters x1, . . . , x4, and 6 unconstrained substitutionparameters x{i,j}, which are indexed by sets of size 2, i.e. where i, j 2 {1, 2, 3, 4} , i 6= j. Off-diagonal entries of Q are obtained via qi,j = ⇡j exp

�x{i,j}

�, where

⇡j =exp (xj)P4

k=1 exp (xk).

We assign independent standard Gaussian priors on the parameters xi. We assume that a matrix ofaligned nucleotides is provided, where rows are species and columns contains nucleotides believed tocome from a shared ancestral nucleotide. Given x =

�x1, . . . , x4, x{1,2}, . . . , x{3,4}

�, and hence Q,

21


Numerical Results

Bayesian inference of evolutionary parameters

A model from phylogenetics, to model the evolutionary ofmodel parameters (details omitted).

0.0

0.5

1.0

1.5

2.0

2.5

3.0

max median minstatistic (across parameters)

RF: ESS/s for different statistics

●

●

●

●

0.0

0.5

1.0

1.5

2.0

2.5

3.0

max median minstatistic (across parameters)

HMC: ESS/s for different statistics

Figure 11: Maximum, median and minimum ESS/s for BPS (left) and HMC (right). The experi-ments are replicated 10 times with different random seeds.

the likelihood is a product of conditionally independent continuous time Markov chains (CTMC)over {A, C, G, T}, with “time” replaced by a branching process specified by the phylogenetic tree’stopology and branch lengths. The parameter x is unidentifiable, and while this can be addressed bybounded or curved parameterizations, the over-parameterization provides an interesting challengefor sampling methods, which need to cope with the strong induced correlations.

We analyze a dataset of primate mitochondrial DNA [11], containing 898 sites and 12 species. Wefocus on sampling x and fix the tree to a reference tree [14]. We use the basic BPS algorithm withrestricted refreshment and �ref = 1 in conjunction with an auxiliary variable-based method similarto the one described in [31], alternating between two moves: (1) sampling CTMC paths along atree given x using uniformization, (2) sampling x given the path (in which case the derivation ofthe gradient is simple and efficient). The only difference compared to [31] is that we substitutethe HMC kernel by the kernel induced by running BPS. We use this auxiliary variable methodbecause conditioned on the paths, the energy is a convex function and hence we can use the methoddescribed in Example 1 to compute the bouncing times.

We compare against a state-of-the-art HMC sampler [29] that uses Bayesian optimization to adaptthe key parameters of HMC, the leap-frog stepsize ✏ and trajectory length L, while preservingconvergence to the correct target distribution. This sampler was shown to be comparable or betterto other state-of-the-art HMC methods such as NUTS. It also has the advantage of having efficientimplementations in several languages. We use the author’s Java implementation to compare toour Java implementation of the BPS. Both methods view the objective function as a black box(concretely, a Java interface supporting pointwise evaluation and gradient calculation). In allexperiments, we initialize at the mode and use a burn-in of 100 iterations and no thinning. TheHMC auto-tuner yielded ✏ = 0.39 and L = 100. For our method, we use the global sampler andthe independent global refreshment scheme.

As a first step, we perform various checks to ensure that both BPS and HMC chains are in closeagreement given a sufficiently large number of iterations. We observe that after 20 millions HMCiterations, the highest posterior density intervals from the HMC method are in close agreementwith those obtained from BPS (result not shown) and that both method pass the Geweke diagnostic[9].

To compare the effectiveness of the two samplers, we first look at the ESS per second of the modelparameters. We show the maximum, median, and maximum over the 10 parameter components,for both BPS and HMC in Figure 11. As observed in Figure 12, the autocorrelation function(ACF) for the BPS decays faster than that of HMC. HMC’s slowly decaying ACF is due to thefact that the stepsize ✏ in HMC is selected to be very small by the auto-tuner.

To ensure that the problem does not come from a faulty auto-tuning, we look at the ESS/s for thelog-likelihood statistic when varying the stepsize ✏. The results in Figure 13(right) show that thevalue selected by the auto-tuner is indeed reasonable, close to the value 0.02 found by brute forcemaximization. We repeat the experiments with ✏ = 0.02 and obtain the same conclusions. Thisshows that the problem is genuinely challenging for HMC.

22


Numerical Results

Bayesian inference of evolutionary parameters

logDensity

0.0

0.4

0.8

0 10 20 30 40 50Lag

Autocorrelation

BPSlogDensity

0.0

0.4

0.8

0 10 20 30 40 50Lag

Autocorrelation

HMC

Figure 12: Estimate of the ACF of the log-likelihood statistic for BPS (left) and HMC (right). Asimilar behavior is observed for the ACF of the other statistics.

●

●

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1e−06 1e−05 1e−04 0.001 0.01 0.1 1 10 100 1000refresh rate

BPS: ESS/s for different refresh rates

●

●

●

●

●

●

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0025 0.005 0.01 0.02 0.04epsilon

HMC: ESS/s for different values of epsilon

Figure 13: Left: sensitivity of BPS’s ESS/s on the log likelihood statistic. Right: sensitivity ofHMC’s ESS/s on the log likelihood statistic. Each setting is replicated 10 times with differentalgorithmic random seeds.

23

logDensity

0.0

0.4

0.8

0 10 20 30 40 50Lag

Autocorrelation

BPSlogDensity

0.0

0.4

0.8

0 10 20 30 40 50Lag

Autocorrelation

HMC

Figure 12: Estimate of the ACF of the log-likelihood statistic for BPS (left) and HMC (right). Asimilar behavior is observed for the ACF of the other statistics.

●

●

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1e−06 1e−05 1e−04 0.001 0.01 0.1 1 10 100 1000refresh rate

BPS: ESS/s for different refresh rates

●

●

●

●

●

●

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0025 0.005 0.01 0.02 0.04epsilon

HMC: ESS/s for different values of epsilon

Figure 13: Left: sensitivity of BPS’s ESS/s on the log likelihood statistic. Right: sensitivity ofHMC’s ESS/s on the log likelihood statistic. Each setting is replicated 10 times with differentalgorithmic random seeds.

23


Numerical Results

Thanks for your attention!!!


Documents

The Bouncy Particle Sampler: A Non-Reversible Rejection ...lcarin/Changyou1.20.2017.pdf2017/01/20 · The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte