Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
arX
iv:1
603.
0248
5v3
[st
at.M
E]
10
Nov
201
6
Block-Wise Pseudo-Marginal Metropolis-Hastings
M.-N. Tran∗ R. Kohn † M. Quiroz ‡§ M. Villani‡
February 22, 2019
Abstract
The pseudo-marginal Metropolis-Hastings (PMMH) approach is increasingly usedfor Bayesian inference in statistical models where the likelihood is intractable but canbe estimated unbiasedly, such as random effects models and state-space models, or fordata subsampling in big-data settings. Dahlin et al. (2015) and Deligiannidis et al.(2015) show how the PMMH approach can be made much more efficient by correlatingthe underlying pseudo random numbers used to form the estimate of the likelihoodat the current and proposed values of the unknown parameters. Their approachgreatly speeds up the standard PMMH algorithm, as it requires a much smaller num-ber of particles to form the optimal likelihood estimate. We present an alternativePMMH approach that divides the underlying random numbers into blocks so thatthe likelihood estimates for the proposed and current values of the likelihood onlydiffer by the random numbers in one block. Our approach has several advantages.First, it provides a direct way to control the correlation between the logarithms ofthe estimates of the likelihood at the current and proposed values of the parameters.Second, under some conditions, the mathematical properties of the method are sim-plified and made more transparent compared to the treatment in Deligiannidis et al.(2015). Third, blocking is shown to be a natural way to carry out PMMH in manyproblems, especially when the likelihood is estimated using randomised quasi num-bers. We show that, if quasi numbers are used for estimating the likelihood, theoptimal number of particles required in the block-wise PMMH is much smaller thanthat in Deligiannidis et al.. We obtain theory and guidelines for selecting the optimalnumber of particles, and document large speed-ups in a wide range of applications.
Keywords. Intractable likelihood; Unbiasedness; Panel data; Data subsampling;Quasi Monte Carlo
∗Discipline of Business Analytics, University of Sydney†School of Economics, UNSW School of Business‡Department of Computer and Information Science, Linkoping University§Research Division, Sveriges Riksbank
1
1 Introduction
In many statistical applications the likelihood is analytically or computationally intractable,
making it difficult to carry out Bayesian inference. An example of models where the
likelihood is often intractable are generalised linear mixed models (GLMM) for longitudinal
data, where random effects are used to account for the dependence between the observations
measured on the same individual (Fitzmaurice et al., 2011; Bartolucci et al., 2012). The
likelihood is intractable because it is an integral over the random effects, but it can be
easily estimated unbiasedly using importance sampling. The second example that uses
a variant of the unbiasedness idea, is that of unbiasedly estimating the log likelihood by
subsampling, as in Quiroz et al. (2016). Subsampling is useful when the log likelihood
is a sum of terms, with each term in the log likelihood expensive to evaluate, or when
there is a very large number of such terms. Quiroz et al. (2016) estimate the log-likelihood
unbiasedly in this way and then bias correct the resulting likelihood estimator to use within
a PMMH algorithm. They show analytically that the resulting posterior distribution of the
parameters is a perturbation of the true target distribution, with the perturbation error
being very small. State space models are a third class of models where the likelihood is
often intractable but can be unbiasedly estimated using an importance sampling estimator
(Shephard and Pitt, 1997; Durbin and Koopman, 1997) or by a particle filter estimator
(Del Moral, 2004; Andrieu et al., 2010).
It is now well known in the literature that a direct way to overcome the problem of
working with an intractable likelihood is to estimate the likelihood unbiasedly and use this
estimate within a Markov chain Monte Carlo (MCMC) simulation on an expanded space
that includes the random numbers used to construct the likelihood estimator. This was
first considered by Lin et al. (2000) in the Physics literature and Beaumont (2003) in the
Statistics literature; it was formally studied in Andrieu and Roberts (2009), who called the
method pseudo-marginal Metropolis-Hastings and gave conditions for the chain to converge.
Andrieu et al. (2010) use MCMC for inference in state space models where the likelihood
is estimated unbiasedly by the particle filter. Flury and Shephard (2011) give an excel-
lent discussion with illustrative examples of PMMH. Pitt et al. (2012) and Doucet et al.
(2015) analyse the effect of estimating the likelihood and show that the variance of the
2
log-likelihood estimator should be around 1 to obtain an optimal tradeoff between the effi-
ciency of the Markov chain and the computational cost. See also Sherlock et al. (2015), who
consider random walk proposals for the parameters, and show that the optimal variance of
the log of the likelihood estimator can be somewhat higher in this case.
A key issue in estimating models by standard PMMH is that the variance of the log of
the estimated likelihood grows linearly with the number of observations T . Hence, to keep
the variance of the log of the estimated likelihood small and around 1 it is necessary for the
number of particles N , used in constructing the likelihood estimator, to increase in propor-
tion to T , which means that PMMH requires O(T 2) operations at every MCMC iteration.
Deligiannidis et al. (2015) and Dahlin et al. (2015) propose correlating the pseudo random
numbers used in constructing the estimators of the likelihood at the current and proposed
values of the parameters; see also Stramer and Bognar (2011), Murray and Graham (2016)
and Lee and Holmes (2010). Deligiannidis et al. (2015) show that by inducing a high cor-
relation between these ensembles of pseudo random numbers it is only necessary to increase
the number of particles N in proportion to T12 , reducing the PMMH algorithm to O(T 3/2)
per iteration. This is a breakthrough in the ability of PMMH to be competitive with more
traditional MCMC methods.
Our article proposes an alternative approach to Deligiannidis et al. (2015) and Dahlin et al.
(2015) called the block-wise PMMH that requires a much smaller number of particles for
optimal performance. The block-wise PMMH divides the set of underlying random numbers
into blocks and updates one block at a time, thus reducing the variation in the Metropolis-
Hastings acceptance probability. This helps the chain to mix well even if highly variable
estimates of the likelihood are used. As a result, only a small number of particles is needed
at every iteration. We obtain theory and guidelines for selecting an optimal number of
particles in the block-wise PMMH. The block-wise PMMH has a number of advantages
over the correlated PMMH method for models where the likelihood can be divided into in-
dependent blocks. First, the correlation between the proposed and current values of the log
likelihood estimators is controlled directly rather than indirectly and nonlinearly through
the correlated ensembles of random numbers. Second, we show that blocking is a natural
way to carry out a dependent PMMH in many problems such as subsampling, panel data,
3
diffusion processes and Approximate Bayesian Computation. Third, under some condi-
tions, the theoretical justification and optimality properties of the block-wise PMMH are
more transparent and easier to prove than the derivations in Deligiannidis et al. (2015). In
particular, similarly to Deligiannidis et al. (2015), we show that the optimal number of par-
ticles required at each iteration of the block-wise PMMH is O(T 3/2). Forth, the blockwise
PMMH takes a less CPU time than the standard and correlated PMMH as we do not need
to generate the full set of random numbers in each iteration. Fifth, the block-wise PMMH
offers a natural way to use randomised quasi numbers for estimating the likelihood. Nu-
merical integration using randomised quasi Monte Carlo numbers achieves, in many cases,
a better convergence rate than using Monte Carlo pseudo random numbers. A thorough
treatment of quasi Monte Carlo can be found in the monographs by Niederreiter (1992)
and Dick and Pillichshammer (2010). Using quasi numbers has proven successful in the
intractable likelihood literature recently; see, e.g., Gerber and Chopin (2015); Tran et al.
(2015). We show that, if quasi numbers are used for estimating the likelihood, the optimal
number of particles required at each iteration of the PMMH is approximately O(T 7/6),
compared to O(T 3/2) in the correlated PMMH (Deligiannidis et al., 2015). Correlating
quasi numbers for their use in the correlated PMMH is not trivial as this might break
down the desired uniformity property of quasi numbers, and this is recently studied in
Gunawan et al. (2016).
We obtain a large-sample property showing that, under some conditions, slowly updat-
ing the underlying random numbers in likelihood estimation does not have any negative
effect on the mixing of the chain on the parameters of interest. We apply the proposed
method to a wide range of applications and document large speed-ups compared to the
other approaches.
4
2 Block-wise pseudo-marginal Metropolis-Hastings al-
gorithm
2.1 The PMMH algorithm
Let y be a set of observations with density p(y|θ), where θ ∈ Θ is the vector of unknown
parameters. We are interested in sampling from the posterior π(θ) ∝ p(θ)p(y|θ) in models
where the likelihood p(y|θ) is analytically or computationally intractable. We assume that
p(y|θ) can be unbiasedly estimated by pN(y|θ) = pN(y|θ, u), with u the set of pseudo
random numbers or randomised quasi numbers used to compute pN(y|θ); N is the number
of importance samples or number of particles used and the dimension of u is proportional
N . We will however call N the number of particles throughout. Denote the density function
of u by pN (u) and define a joint density of θ and u as
πN (θ, u) := p(θ)pN (y|θ, u)pN(u)/p(y). (1)
Then, πN(θ,u) admits π(θ) as its marginal density because∫pN(y|θ,u)p(u)du= p(y|θ) by
unbiasedness. Therefore, we obtain samples from the posterior π(θ) by sampling from
πN(θ,u).
Let q(θ|θ′) be a proposal density for θ, conditional on θ′, with θ′ the current state. Let
u′ be the corresponding current set of random numbers used to compute pN(y|θ′,u′). The
standard PMMH algorithm generates samples from π(θ) by generating a Markov chain with
invariant density based on πN (θ,u) using the Metropolis-Hastings algorithm with proposal
density q(θ,u|θ′,u′)=q(θ|θ′)pN(u). The proposal (θ,u) is accepted with probability
α(θ′,u′;θ,u) :=min
(1,πN (θ,u)
πN (θ′,u′)
q(θ′,u′|θ,u)q(θ,u|θ′,u′)
)=min
(1,p(θ)pN(y|θ,u)p(θ′)pN(y|θ′,u′)
q(θ′|θ)q(θ|θ′)
), (2)
which is computable. In the standard PMMH scheme, a new independent set of pseudo-
random numbers u is generated each time the likelihood estimate is computed, so it is
usually unnecessary to store u and u′.
For the standard PMMH algorithm, Pitt et al. (2012) and Doucet et al. (2015) show
5
that the variance of log pN (y|θ,u) should be around 1 in order to obtain an optimal tradeoff
between the computational cost and efficiency of the Markov chain in θ and u. However,
in some problems it may be prohibitively expensive to take an N large enough to ensure
that V(log pN(y|θ,u))≈1.
2.2 The block-wise PMMH algorithm
In our block-wise PMMH algorithm, instead of generating a completely new set u when
estimating the likelihood as previously done in the literature, we update u in blocks. Let
us divide the set of variables u into G sets u(1),...,u(G). The extended target (1) can be
re-written as
πN (θ, u(1), ..., u(G)) = p(θ)pN(y|θ, u(1), ..., u(G))pN(u(1), ..., u(G))/p(y). (3)
Instead of updating the full set of (θ,u) as in the standard PMMH, we propose to update
θ and a block u(K) at a time. The block index K is randomly selected from 1,...,G with
P (K=k)>0 for every k=1,...,G. Typically, P (K=k)=1/G. This is similar to component-
wise MCMC whose convergence is well established in the literature; see, e.g., Johnson et al.
(2013). The acceptance probability (2) becomes
min
(1,p(θ)pN(y|θ, u′(1), ..., u(k), ..., u′(G))
p(θ′)pN(y|θ′, u′(1), ..., u′(k), ..., u′(G))
q(θ′|θ)q(θ|θ′)
). (4)
Intuitively, by fixing all u(j)’s except u′(k), the variation in the ratio of the likelihood esti-
mates is reduced, which helps the chain to mix well.
Instead of updating u in blocks, Deligiannidis et al. (2015) and Dahlin et al. (2015)
move u slowly by making the proposed u and the current u′ correlated. Suppose that
the underlying pseudo random numbers u are standard univariate normal variables and
ρ>0 is a number close to 1, they propose u=ρu′+√1−ρ2ǫ with ǫ a set of standard normal
variables of the same size as u′. We note that it’s not trivial to extend this framework to the
case where u are randomised quasi numbers, as u is a set of uniform rather than standard
random variables, and correlating u in this way will break down the desired uniformity
6
property of quasi numbers.
3 Analysis of the block-wise PMMH
We now suppose that the likelihood can be written as a product of G independent terms,
p(y|θ)=G∏
k=1
p(y(k)|θ). (5)
Example: panel data models. Consider a panel-data model with T panels, which we
divide into G groups y(1),...,y(G) with approximately T/G panels in each.
Example: big data. Consider a big data set with T independent observations, which we
divide into G groups y(1),...,y(G) with T/G observations in each.
Example: time series data. Consider a time series data y={y1,...,yT}, which we divide
into G groups y(1)=y1:t1, y(2)=yt1+1:t2 ... Then p(y|θ) can be written in the form of (5) with
p(y(k)|θ)=p(ytk−1+1|ytk−1,θ)...p(ytk |ytk−1,θ), k=1,...,G.
Here, t0=0, tG=T and p(y1|y0,θ)=p(y1|θ).We assume that the kth likelihood term p(y(k)|θ) is estimated unbiasedly by pN(k)
(y(k)|θ,u(k)),where the u(k) are independent with u(k)∼pN(k)
(·), and N(k) is the number of particles or
importance samples or the size of u(k). Let N :=N(1)+···N(G). An unbiased estimator of
the likelihood is
pN(y|θ) :=G∏
k=1
pN(k)(y(k)|θ,u(k)).
Our analysis of the block-wise PMMH is built on the framework of Pitt et al. (2012). We
follow Pitt et al. (2012) and define z(k)=z(k)(θ,u(k)) := log pN(k)(y(k)|θ,u(k))−logp(y(k)|θ) as
the error in the log likelihood estimate of the kth block. Then, z=z(θ,u) :=log pN(y|θ,u)−logp(y|θ)=∑G
k=1z(k) is the error in the log-likelihood estimate.
Suppose that u(1),...,u(G) are independent and generated from pN(k)(u(k)) and suppose
7
that gN(k)(z(k);θ) is the distribution of the corresponding z(k). Then, the distribution of
θ,z(1),...,z(G) is
πN(θ,z(1),...,z(G))=π(θ)G∏
k=1
exp(z(k))gN(k)(z(k)|θ), (6)
which means that, at stationary, πN(k)(z(k)|θ)=exp(z(k))gN(k)
(z(k)|θ) and the z(k) are inde-
pendent conditional on θ, i.e., in the posterior πN(·|θ).We note that it is straightforward to show that the acceptance probability (2) can be
rewritten as
min
{1,exp(z−z′) π(θ)
π(θ′)
q(θ′|θ)q(θ|θ′)
}(7)
where z′ = z(θ′,u′). This is also the acceptance probability if we consider a Metropolis-
Hastings scheme that samples from
πN(θ,z)=π(θ)ezgN(z|θ), (8)
with gN(z|θ) the density of z. Thus, as noted in Pitt et al. (2012), the properties of the
Metropolis-Hastings scheme based on πN (θ,u) are the same as those based on πN(θ,z). As
we show below, it will be easier to work with πN (θ,z) than with πN(θ,u) because z is a
scalar.
Instead of updating θ and u as in the standard PMMH, the block-wise PMMH updates
θ and a single block, u(k). The terms z and z′ in the acceptance probability (7) are
z =∑G
j=1,j 6=kz(j)(θ,u(j) = u′(j))+z(k)(θ,u(k)), z′ =∑G
j=1z(j)(θ′,u′(j)). We use the following
notation: w∼N (a,b2) means that w has a normal distribution with mean a and variance
b2, and denote the density of w as N (w;a,b2).
We make the following assumptions.
Assumption 1. Suppose that u(1),...,u(G) are independent and generated from pN(k)(u(k))
and,
8
(i) For each group k, there is a γ2(k)(θ)>0 and an Nk such that
z(k)(θ,u(k))∼N(−γ2(k)(θ)
2N2k
,γ2(k)(θ)
N2k
),
for some >0.
(ii) For a given σ2> 0, let Nk be a function of θ, σ2 and G such that V(z(k)(θ,u(k))) =
σ2/G, i.e. Nk = Nk(θ,σ2,G) = [Gγ2(k)(θ)/σ
2]1/(2). This means that z(k)(θ,u(k)) ∼N (−σ2/(2G),σ2/G) for each k.
As will be clear in Lemma 5, =1/2 if the likelihood is estimated using pseudo random
numbers, and =3/2−ǫ for any arbitrarily small ǫ>0 if the likelihood is estimated using
quasi numbers. We note that N(k) is the total number of particles used for the kth group,
and will usually be different from Nk. In panel data models, N(k)=(T/G)Nk. Assumption 1
is a stylized set of assumptions that will hold approximately when the Nk are sufficiently
large. Part (i) is similar to the assumption made in Pitt et al. (2012) and ensures that
if z(k) is normally distributed then the expected value E(exp(z(k))) = 1 for each k; this is
consistent with the estimator pN(k)(y(k)|θ) being unbiased. For panel data models, part (i) is
justified by Lemma 5 in this article. Assumption 1(ii) can be enforced for most panel data
models and subsampling approaches because it is straightforward to estimate the variance
of z(k) accurately for each k and θ.
We can now obtain the following lemma whose proof is straightforward and omitted.
Lemma 1. Suppose that Assumption 1 holds and define z′ and z as above. Let ρ=1−1/G.
Then,
(i)
z′∼N(σ2
2,σ2
)and z∼N
(σ2(2ρ−1)
2,σ2
).
(ii) Corr(z,z′)=ρ.
9
(iii)
z|z′∼N(−σ2
2(1−ρ)+ρz′,σ2(1−ρ2)
).
We follow Pitt et al. (2012) and assume that the proposal for θ is perfect. That is,
Assumption 2. q(θ|θ′)=π(θ).
This assumption helps identify the effect of estimating the likelihood, as we assume
a perfect proposal, and helps to simplify the derivation of the guidelines for the optimal
number of particles. Under Assumption 2, the acceptance probability (7) of the Metropolis-
Hastings scheme becomes
α(z′, z; ρ, σ) = min(1, ez−z′
), (9)
The next lemma gives the conditional and unconditional acceptance probabilities of the
Metropolis-Hastings scheme for z and θ and is proved in Appendix A
Lemma 2. Suppose Assumptions 1 and 2 hold and ρ=1−1/G.
(i) The acceptance probability of the Metropolis-Hastings scheme conditional on z′ =
z(θ′,u′) is
P(accept|z′,ρ,σ)=exp(−x+τ 2/2)Φ(xτ−τ)+Φ
(−xτ
)
with x :=
(z′+ σ2
2
)(1−ρ) and τ=σ
√1−ρ2.
(ii) The unconditional acceptance probability of the Metropolis-Hastings scheme is
P(accept|ρ, σ) = 2
(1− Φ
(σ√1− ρ√2
)). (10)
Suppose that we are interested in estimating π(ϕ)=∫ϕ(θ)π(θ)dθ for some scalar func-
tion ϕ(θ) of θ. Let {θ[j],z[j],j=1,...,M} be the draws obtained from the PMMH sampler
after it has converged, and let the estimator of π(ϕ) be π(ϕ):= 1M
∑ϕ(θ[j]). Then, we define
the inefficiency of the estimator π(ϕ) relative to an estimator based on an i.i.d. sample
10
from π(θ) (as in Assumption 2) as
IF(ϕ,σ,ρ) := limM→∞
MVPMMH(π(ϕ))/V(ϕ|y), (11)
where VPMMH(π(ϕ)) is the posterior variance of the estimator π(ϕ) and V(ϕ|y):=Eπ(ϕ(θ)2)−
[Eπ(ϕ(θ))]2 is the posterior variance of ϕ so that V(ϕ|y)/M is the variance of the ideal es-
timator when θ[j]iid∼ π(θ). We obtain the following result which shows that under our
assumptions the inefficiency IF(ϕ,σ,ρ) is independent of ϕ and is a function only of σ and
ρ=1−1/G. The proof is Appendix A. We call IF(σ,ρ) the inefficiency of the PMMH algo-
rithm, for given ρ and σ. From (8) and Lemma 1, we note that the posterior density of z
conditional on θ is πN (z|θ)=ezgN(z|θ)=N (z;σ2/2,σ2), which does not depend on θ and is
denoted π(z|σ).
Lemma 3. The inefficiency is given by
IF(σ,ρ)=1+2Ez′∼π(z′|σ)
(1−k(z′|σ,ρ)k(z′|σ,ρ)
), (12)
where k(z′|ρ,σ)=Pr(accept|z′,ρ,σ) is the acceptance probability of the MCMC scheme con-
ditional on the previous iterate z′ and is given by part (i) of Lemma 2.
Similarly to Pitt et al. (2012), we obtain in Appendix B the computing time of the
sampler as
CT(σ,ρ) :=IF(σ,ρ)
σ1/(13)
which takes into account the total number of particles needed to obtain a given precision
and the mixing rate of the PMMH chain.
To simplify the notation in this section we often do not show dependence on ρ as it is
assumed constant. In Section 5 we show that if we take G=O(T12 ), then ρ=1−O(T− 1
2 )
and Nk=O(T1/(4)) are optimal.
The next lemma shows the optimal σ under our assumptions as well as the corresponding
acceptance rate. Its proof is in Appendix A.
11
Lemma 4. Given that ρ=1−1/G is fixed and close to 1, the optimal σopt that minimizes
CT(σ,ρ) is σopt ≈ 2.16/√1−ρ2 if = 1/2, and σopt ≈ 0.82/
√1−ρ2 if = 3/2−ǫ. The
unconditional acceptance rate (10) under this optimal choice of the tuning parameters is
approximately 0.28 and 0.68, respectively.
Lemma 4 provides a theory for selecting the optimal σopt given that G is fixed. Because
the optimal variance σopt is much bigger than 1, the optimal variance in the standard
PMMH, Lemma 4 shows that the bockwise PMMH is much more robust than the standard
PMMH in the sense that the former can tolerate highly noisy estimates of the likelihood.
Let M be the length of the Markov chain to be generated. The average number of times
that a block u(k) is updated is M/G. In general, G should be selected such that M/G is
not too small so that the space if z is well explored. In the examples in this paper, if not
otherwise stated, we set G=100, as we found that the efficiency is relatively insensitive to
larger values of G.
A toy example. Suppose that we wish to sample from π(θ,z)=π(θ)ezg(z|σ) in which θ is
the parameter of interest, with π(θ)=N (0,1) and g(z|σ)=N (−σ2/2,σ2). Suppose further
that z is divided into blocks z=∑G
k=1z(k) with z(k)iid∼N (−σ2
G/2,σ2G), σ
2G=σ
2/G and G=100.
First, we consider two sampling schemes, the standard PMMH and the block-wise
PMMH, to sample from π(θ,z) with σ2G = 2.34, i.e. σ2 = 234. Suppose that (θ′,z′) is
the current state. The proposal (θ,z) in the standard PMMH is generated by θ ∼ π(θ)
and z ∼ g(z|σ). The proposal (θ,z) in the block-wise PMMH scheme is generated as
follows. Let z′ =∑G
k=1z′(k) be the current z-state and let k be an index uniformly gen-
erated from the set {1,...,G}. Sample z(k) ∼N (−σ2G/2,σ
2G) and let z =
∑j 6=kz
′(k)+z(k) be
the proposal. Both schemes accept (θ,z) with probability min(1,ez−z′). In this setting,
the blockwise PMMH scheme is exactly the correlated PMMH algorithm of Dahlin et al.
(2015) and Deligiannidis et al. (2015) where the proposal density q(z|z′) is normal with
mean −(σ2/2)(1−ρ)+ρz′ and variance σ2(1−ρ2), ρ=1−1/G.
Figure 1 plots the θ-samples generated by the standard PMMH scheme (left panel) and
by the block-wise PMMH scheme (right panel). As expected, the standard PMMH chain
is sticky because of the big variance of z, σ2=234.
Now, in order to study the effect of σ2 on the acceptance rate and computing time CT(σ)
12
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 105
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 105
−5
−4
−3
−2
−1
0
1
2
3
4
5
Figure 1: The samples of θ generated by the standard PMMH scheme (left) and the block-wise PMMH scheme (right). Both chains are initialised at 3 and run for 500,000 iterations.
of the blockwise sampler, Figure 2 shows the CT(σ) and acceptance rates for various values
of σ2. The figure shows that CT(σ) has a minimum value of 0.0263 at σ2=234, where the
acceptance rate is 0.279, which agrees with the theory. Among all standard PMMH schemes
with different σ2, Pitt et al. (2012) show that the optimal scheme is the one with σ2≈1.
We also run this optimal standard PMMH scheme and obtain a computing time value of
CT(σ=1)=5.32. Hence, the optimal block-wise PMMH is 5.32/0.0263 approximately 202
times more efficient than the optimal standard PMMH.
3.1 Guidelines for selecting the optimal number of particles for
panel data
We now consider the case with the likelihood factorised as in (5). From Lemma 4, for pseudo
numbers where =1/2, the optimal variance of the log-likelihood estimator based on each
group is σ2opt/G≈ 2.162/(1+ρ), which is approximately 2.34 given that G approximately
100 is large. For quasi numbers, the optimal variance of the log-likelihood estimator based
on each group is σ2opt/G≈0.822/(1+ρ)≈0.34, given that G is large. Hence, for each group,
k, we propose tuning the number of particles N(k) = N(k)(θ) such that V(z(k)|θ,N(k)) is
approximately 2.34 and 0.34 if the likelihood is estimated using pseudo numbers and quasi
numbers, respectively. In many cases, it’s more convenient to tune N(k)=N(k)(θ) at some
13
0 100 200 300 400 500 600 700 800 900 10000.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Variance σ2
Com
putin
g tim
e IF
(σ)/σ
2
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Variance σ2
Acc
epta
nce
rate
Figure 2: Blockwise PMMH: The left panel shows the computing time CT(σ) and the rightpanel shows the acceptance rate v.s. the variance σ2. The dashed lines indicate the valuesw.r.t. the optimal variance σ2
opt=234.
central value θ and then fix N(k) across all MCMC iterations.
4 Applications
4.1 Panel data example
A clinical trial is conducted to test the effectiveness of beta-carotene in preventing non-
melanoma skin cancer (Greenberg et al., 1989). Patients were randomly assigned to a
control or treatment group and biopsied once a year to ascertain the number of new skin
cancers since the last examination. The response yij is a count of the number of new skin
cancers in year j for the ith subject. Covariates include age, skin (1 if skin has burns and
0 otherwise), gender, exposure (a count of the number of previous skin cancers), year of
follow-up and treatment (1 if the subject is in the treatment group and 0 otherwise). There
are T =1683 subjects with complete covariate information.
Following Donohue et al. (2011) we consider the mixed Poisson model with a random
intercept
p(yij|β,αi) = Poisson(exp(ηij)),
ηij = β0+β1Agei+β2Skini+β3Genderi+β4Exposureij+αi,
14
where αi∼N (0,2), i=1,....,T =1683, j=1,...,ni=5. The likelihood is
p(y|θ) =T∏
i=1
p(yi|θ), p(yi|θ) =∫ ( ni∏
j=1
p(yij|β, αi)
)p(αi|2)dαi
with θ=(β,2) the vector of model parameters to be estimated.
We run both the optimal standard PMMH and the optimal block-wise PMMH for 50,000
iterations with the first 10,000 discarded as burnin. For simplicity, each likelihood p(yi|θ) isestimated by importance sampling based on Ni i.i.d. samples from the natural importance
sampler p(αi|2). For the standard PMMH, for each θ, the number of particles Ni=Ni(θ)
is tuned so as the variance of the log-likelihood estimator V(logp(yi|θ)) is not bigger than1/T . This is done as follows. We start from some small Ni and increase Ni if this variance
is bigger than 1/T . Note that the variance V(logp(yi|θ)) can be approximated in closed
form. The CPU time spent on tuning Ni is taken into account in the comparison. In the
block-wise PMMH, we divide the data into G=99 groups, so that each group has 17 panels,
and the variance of log-likelihood estimator in each group is tuned to not be bigger than
2.34. Figure 3 plots the samples generated by the two algorithms.
As performance measures, we report the acceptance rate, the integrated autocorrelation
time (IACT), the CPU times, and the time normalised variance (TNV). For a univariate
Markov chain, the IACT is estimated by
IACT = 1 + 2
1000∑
t=1
ρt,
where ρt are the sample autocorrelations. The IACT for a multivariate chain is averaged
over the IACT values for the parameters. The time normalised variance is the product of
the IACT and the CPU time. The TNV is proportional to the computing time defined in
(13) if the CPU time to generate N particles is linearly proportional to N .
Table 1 summarises the acceptance rates, the IACT ratio, the CPU ratio, and the
TNV ratio, with the blockwise PMMH as the baseline. As shown, the block-wise PMMH
outperforms the standard PMMH. In particular, the block-wise PMMH is around 25 times
more efficient than the standard PMMH in terms of the time normalised variance.
15
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
−6
−5.5
−5
−4.5
−4
−3.5
−3
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Figure 3: Plots of the standard PMMH chain (black) and block-wise PMMH chain (green)for the 6 parameters.
Methods Acceptance IACT ratio CPU ratio TVN ratio
Standard PMMH 0.222 1.080 23.095 24.938Block-wise PMMH 0.243 1 1 1
Table 1: Skin cancer example using the block-wise PMMH as the baseline
Choice of the numbers of samples N
We demonstrate the usefulness of Lemma 4 as guidelines for selecting suitable numbers of
samples Ni in each estimation of p(yi|θ). Here we use the static strategy where Ni are fixed
across θ and are tuned at a central θ obtained by a short pilot run. Because there are 17
panels in each group, given a target group-variance σ2G, Ni =Ni(θ) is selected such that
V(logp(yi|θ))≈σ2G/17, then, V(z(k))≈σ2
G. It is easier in terms of implementation to keep
Ni fixed, furthermore tuning Ni for each proposal θ will add more computational cost.
Figure 4 shows the average N =∑Ni/T , IACT and computing time CT for various
group-variance σ2G, when the likelihood is estimated using pseudo numbers. Here, CT is
computed as the product of N and IACT. CT has a minimum at σ2G≈2.3, which requires
40 samples on average to estimate each p(yi|θ). In this example we found that CT does
not change much when σ2G is in between 2 and 2.4. CT increases slowly when we choose
Ni such that σ2G is smaller 2.3, whereas it increases quite dramatically when σ2
G is higher
than the optimal. To be on the safe side, we therefore advocate a conservative choice of Ni
in practice.
16
σ2G
0 0.5 1 1.5 2 2.5 3
Aver
age
N
0
50
100
150
200
250
300
350
σ2G
0 0.5 1 1.5 2 2.5 3
IACT
0
500
1000
1500
2000
2500
σ2G
0 0.5 1 1.5 2 2.5 3
CT
×104
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
Figure 4: Pseudo numbers: Average N , IACT and CT = N× IACT for various σ2G.
We now consider using quasi numbers, generated using the scrambled net algorithm of
Matousek (1998), to estimate the likelihood. Because scrambled quasi numbers are depen-
dent, it is necessary to generate a different set of scrambled quasi numbers to compute each
unbiased estimate p(yi|θ). Then, p(yi|θ) are independent of each other, so p(y|θ)=∏ip(yi|θ)is a unbiased estimate of p(y|θ)=∏ip(yi|θ). We compute V(logp(yi|θ)) by replications as its
closed-form approximation is no longer available. Figure 5 shows that CT has a minimum
at σ2G ≈ 0.3, which agrees with the theory in Lemma 4. CT increases slowly when σ2
G is
smaller 0.3, and it increases quickly when σ2G is higher than this value.
σ2G
0 0.2 0.4 0.6 0.8 1
Aver
age
N
5
10
15
20
25
30
35
40
45
50
σ2G
0 0.2 0.4 0.6 0.8 1
IACT
0
200
400
600
800
1000
1200
1400
σ2G
0 0.2 0.4 0.6 0.8 1
CT
7000
7500
8000
8500
9000
9500
10000
10500
11000
Figure 5: Quasi numbers: Average N , IACT and CT = N× IACT for various σ2G.
17
Pseudo numbers vs quasi numbers
We now compare the performance of the standard PMMH using pseudo random numbers,
the blockwise PMMH using pseudo random numbers, and the blockwise PMMH using
randomised quasi numbers. Here, we fix the number of samples Ni=50 across all θ and i.
Table 2 summarises the performance measures with the quasi blockwise PMMH as the
baseline. The two blockwise PMMH frameworks outperform the standard PMMH. The
quasi blockwise PMMH outperforms the pseudo blockwise PMMH in terms of acceptance
rates, IACT and TNV, but not the CPU time. We note that the CPU times depend
heavily on the programming language and the implementation. Our implementation of
the quasi blockwise PMMH is suboptimal as we first generate scrambled quasi numbers
then transform them into standard normal variates which are needed for estimating the
likelihood. For the pseudo blockwise PMMH, we generate standard normal variates directly
using the optimal builtin function in Matlab.
Methods Acceptance IACT ratio CPU ratio TNV ratio
Standard PMMH 0.002 12.005 0.927 11.133pseudo blockwise PMMH 0.179 4.121 0.742 3.057quasi blockwise PMMH 0.225 1 1 1
Table 2: Skin cancer example using the quasi blockwise PMMH as the baseline
4.2 Subsampling MCMC
Quiroz et al. (2016) propose a data subsampling approach, based on the correlated PMMH
algorithm in Deligiannidis et al. (2015), to speed up MCMC. As in Quiroz et al. (2016), we
consider the following two AR(1) processes with Student-t iid errors ǫt∼ t(ν) with known
degrees of freedom ν. We demonstrate the PMMH methods using two models whose data
generating processes are
yt =
β0+β1yt−1+ǫt ,[M1,θ=(β0=0.3,β1=0.6)]
µ+ρ(yt−1−µ)+ǫt ,[M2,θ=(µ=0.3,ρ=0.99)],(14)
18
t=1,...,T , where p(ǫt)∝(1+ǫ2t/ν)−(ν+1)/2 with ν=5 and the uniform priors
p(β0,β1)=U(−5,5)·U(0,1) and p(µ,ρ)=U(−5,5)·U(0,1).
Define ℓt(θ) :=logp(yt|yt−1,θ) and rewrite the log likelihood ℓ(θ) as
ℓ(θ)=q(θ)+d(θ), q(θ)=
T∑
t=1
qt(θ), d(θ)=
T∑
t=1
dt(θ), with dt(θ)=ℓt(θ)−qt(θ),
where qt(θ)≈ℓt(θ) is a control variate. We set qt(θ) to a Taylor series approximation of lt(θ)
evaluated at the nearest centroid from a clustering in data space. This has the effect of
reducing the complexity of computing q(θ) from O(T ) to O(C), where C is the number of
centroids. The reader is referred to (Quiroz et al., 2016) for the details. Now, an unbiased
estimate of ℓ(θ) based on a simple random sample with replacement is
ℓ(θ)=T
N
N∑
i=1
dui(θ)+q(θ), with ui∈{1,...,T}, Pr(ui= t)=
1
T, t=1,...,T. (15)
Here N is the subsample size and u=(u1,...,uN) represents a vector of observation indices.
The resulting likelihood estimate is pN(y|θ,u) = exp(ℓ(θ)−σ2(θ)/2
), where σ2(θ) is an
unbiased estimate of the variance of ℓ(θ) in (15). Quiroz et al. (2016) show that carrying
out MCMC with this slightly biased likelihood estimator samples from a perturbed posterior
that is very close to the correct posterior if N is sufficiently large in relation to σ2(θ).
Block-wise PMMH creates G blocks, each with N/G indices. For large G, a high
correlation in the MH log-ratio is naturally induced, since u and u′ differ only at a small
number of positions.
We generate T =100,000 observations from the models in (14) and run both the corre-
lated PMMH and the block-wise PMMH for 55,000 iterations from which we discard the
first 5,000 draws as burn-in. Using the same σ2(θ) as in Quiroz et al. (2016) results in
sample sizes N approximately 1300 and N approximately 2600 for models M1 and M2,
respectively. For the block-wise PMMH we use G=100 so that each block has ≈ 13 ob-
servations for M1 and ≈26 observations for M2. Also, following Quiroz et al. (2016), the
persistence parameter in the correlated PMMH is set to φ=0.9999, and we use a random
19
walk proposal which is adapted during the burn-in phase to target α≈0.15 (Sherlock et al.,
2015).
Table 3 summarises the performance measures introduced in Section 4.1. It is evident
that the block-wise PMMH dramatically outperforms the correlated PMMH on the speed
measures. This is because, as discussed above, the correlated PMMH requires N operations
for generating the vector u. The block-wise PMMH moves only one block at a time, so
that the update of the vector u requires N/G operations.
Methods Acceptance IACT ratio CPU ratio TNV ratio
M1 M2 M1 M2 M1 M2 M1 M2
Correlated PMMH 0.149 0.140 1.110 1.124 62.893 38.610 69.444 43.478Block-wise PMMH 0.160 0.151 1 1 1 1 1 1
Table 3: Data subsampling example using block-wise PMMH as a baseline.
4.3 Diffusion process example
This section applies the PMMH approaches to estimating diffusion processes. Consider a
diffusion process X={Xt,t≥0} governed by the stochastic differential equation (SDE)
dXt = µ(Xt, θ)dt+ σ(Xt, θ)dWt, (16)
withWt a Wiener process. We assume that regularity conditions on µ(·,·) and σ(·,·) are met
so that the solution to the SDE in (16) exists and is unique. We are interested in estimating
the vector of parameters θ based on discrete-time observations x={x0,x1,...,xn}, where xiis the observation of X at time ti= i∆ with ∆ some time-interval. The likelihood is
p(x|θ)=n−1∏
i=0
p∆(xi+1|xi,θ)
where the ∆-interval Markov transition density p∆(xi+1|xi,θ) is typically intractable. In
order to make the discrete approximation of the continuous-time process X sufficiently
accurate, it is common to represent
p∆(xi+1|xi, θ) =∫ph(xi+1|zi,M−1, θ)ph(zi,M−1|zi,M−2, θ)...ph(zi,1|xi, θ)dzi,1...dzi,M−1 (17)
20
where ph(·|·,θ) is the Markov transition density of X after time-step h=∆/M . Given that
h is small enough, ph(u|v,θ) is well approximated by the Euler approximation
peulerh (u|v,θ)=N (u;v+hµ(v,θ),hΣ(v,θ)),
where N (·;a,S) denotes the normal density with mean a and covariance S, here Σ(v,θ)=
σ(v,θ)σ′(v,θ). The transition density in (17) is then approximated by
peuler∆ (xi+1|xi,θ)=∫peulerh (xi+1|zi,M−1,θ)p
eulerh (zi,M−1|zi,M−2,θ)...p
eulerh (zi,1|xi,θ)dzi,1...dzi,M−1.
Finally, we are interested in the posterior
p(θ|x)∝p(θ)peuler(x|θ), peuler(x|θ)=n−1∏
i=0
peuler∆ (xi+1|xi,θ).
The likelihood peuler(x|θ) is intractable but can be estimated unbiasedly. We follow Stramer and Bognar
(2011) and estimate peuler∆ (xi+1|xi,θ) by importance sampling using the importance sampler
of Durham and Gallant (2002)
zi,m+1∼N(zi,m+
xi+1−zi,mM−m ,h
M−m−1
M−m Σ(zi,m,θ)
), m=0,...,M−2,
where zi,0=xi. The density of this sampler is
g(zi)=g(zi,1,...,zi,M−1)=M−2∏
m=0
N(zi,m+1;zi,m+
xi+1−zi,mM−m ,h
M−m−1
M−m Σ(zi,m,θ)
).
We sample N such trajectories z(j)i =(z
(j)i,1 ,...,z
(j)i,M−1), j=1,...,N and denote by ui the set of
all required random variates. Then, the unbiased estimator of peuler∆ (xi+1|xi,θ) is
peuler∆ (xi+1|ui,xi,θ)=1
N
N∑
j=1
peulerh (xi+1|z(j)i,M−1,θ)peulerh (z
(j)i,M−1|z
(j)i,M−2,θ)...p
eulerh (z
(j)i,1 |xi,θ)
g(z(j)i )
The likelihood peuler(x|θ) is estimated unbiasedly and in the form of factorisation (5), so
all the theory developed is applicable.
21
We apply the proposed method to fit the FedFunds dataset to the Cox-Ingersoll-Ross
(CIR) model
dXt=β(α−Xt)dt+σ√XtdWt.
The FedFunds dataset we use consists of 745 monthly federal funds rates in the US from
July 1954 to August 2016.
Following Stramer and Bognar (2011) we use the prior p(α,β,σ)=I(0,1)(α)I(0,∞)(β)σ−1I(0,∞)(σ).
As in Stramer and Bognar (2011), we set ∆=1/12, N =5. We set M =300 to make the
Euler approximation highly accurate, Stramer and Bognar (2011) use M = 20. We use
G=184 groups. Table 4 summarises the results, which show that the blockwise PMMH
works better than the standard PMMH. However, we note that the improvement is not as
much as we archieve in the previous examples.
Our findings quite contradict the results in Stramer and Bognar (2011) who report
that the blocking strategy does not work better than the standard PMMH. The setting in
Stramer and Bognar (2011) is different from ours. First, their updating scheme alternates
between θ and the blocks u(1),...,u(G), rather updating θ together with one of the blocks at
a time as we do. So their scheme updates the main parameter θ one every G iterations,
which is very inefficient when G is large. Second, Stramer and Bognar’s dataset consists
of 432 monthly rates from January 1963 to December 1998, which is a small portion of our
dataset. Third, they setM=20 while we setM=300. Our setting is much more challenging
for the standard PMMH because the estimates of the likelihood are highly variable. The
setting in Stramer and Bognar (2011) is doable for the standard PMMH and therefore it
is not necessary to use the blockwise strategy.
Methods Acceptance IACT ratio CPU ratio TNV ratio
Standard PMMH 0.246 1.657 1.146 1.899Blockwise PMMH 0.383 1 1 1
Table 4: FedFunds data example using the pseudo blockwise PMMH as the baseline
22
Lag0 20 40 60 80 100
Sam
ple
Auto
corre
latio
n
-0.2
0
0.2
0.4
0.6
0.8
1Sample Autocorrelation Function
Lag0 20 40 60 80 100
Sam
ple
Auto
corre
latio
n
-0.2
0
0.2
0.4
0.6
0.8
1Sample Autocorrelation Function
Lag0 20 40 60 80 100
Sam
ple
Auto
corre
latio
n
-0.2
0
0.2
0.4
0.6
0.8
1Sample Autocorrelation Function
Lag0 20 40 60 80 100
Sam
ple
Auto
corre
latio
n
-0.2
0
0.2
0.4
0.6
0.8
1Sample Autocorrelation Function
Lag0 20 40 60 80 100
Sam
ple
Auto
corre
latio
n
-0.2
0
0.2
0.4
0.6
0.8
1Sample Autocorrelation Function
Lag0 20 40 60 80 100
Sam
ple
Auto
corre
latio
n
-0.2
0
0.2
0.4
0.6
0.8
1Sample Autocorrelation Function
Figure 6: Autocorrelations for the standard PMMH chains (first row) and the blockwisePMMH chains (second row)
4.4 ABC example: Bayesian inference for α-stable models
α-stable distributions (Nolan, 2007) are a class of heavy-tailed distributions used in many
statistical applications. The main difficulty when working with α-stable distributions is that
they do not have closed form densities, which makes it difficult to do inference. However,
as it is easy to sample from an α-stable distribution, one can use Approximate Bayesian
Computation techniques (ABC) for Bayesian inference (Tavare et al., 1997; Peters et al.,
2012). ABC approximates the likelihood by the so-called likelihood-free
pLF(y|θ) =∫Kǫ(S(y
′), S(y))p(y′|θ)dy′, (18)
where Kǫ(.,.) is a kernel with the bandwidth ǫ and S(.) is a summary statistics. Inferences
are then based on the approximate posterior pABC(θ|y)∝p(θ)pLF(y|θ) where the likelihood-free pLF(y|θ) is unbiasedly estimated by an average of Kǫ(S(y
[i]),S(y)) with y[i]iid∼p(·|θ). We
illustrate in this example that the block-wise PMMH scheme is still applicable, although
the likelihood can’t be factorised as in (5).
We use the example in Peters et al. (2012) and generate a data set y = {y1,...,yn}with n = 200 from a univariate α-stable distribution with parameters α = 1.7, β = 0.9,
γ =10 and δ=10. The summary statistics of a pseudo-dataset y′ = {y′1,...,y′n} is S(y′) =
(vα(y′),vβ(y
′),vγ(y′),vδ(y
′)); see Peters et al. (2012) for details. The likelihood-free pLF(y|θ)
23
in (18) is estimated by pLF(y|θ) =Kǫ(S(y′),S(y)) with Kǫ the Gaussian kernel with co-
variance matrix ǫI4. We use only one pseudo-dataset to estimate the likelihood-free as
suggested in Peters et al. (2012).
Both the standard PMMH and the block-wise PMMH are run for 50,000 iterations
with the first 10,000 discarded as burnins. The block-wise PMMH scheme is carried out as
follows. Given a vector of parameters θ, write the pseudo-data point y′i as f(θ,ui) with ui
the set of random numbers used to generate y′i. We divide the set u={ui,i=1,...,200} into
G=100 blocks with block k consisting of u2k−1 and u2k, k=1,...,G. The block-wise PMMH
is now used to update θ and each block of u at a time.
Table 5 summarises the performance measures for different values of ǫ, averaged over 10
replications. Here, the mean squared error (MSE) is the l2-norm of the difference between
the estimated posterior mean and the true parameters. See Pasarica and Gelman (2010)
for a definition of averaged squared distance (ASD) as a performance measure in MCMC.
It is understood that the bigger ASD the better. The results show that the block-wise
PMMH works better than the standard PMMH in this example.
ǫ Methods Acc. rate MSE IACT ASD
10 Standard 0.31 1.41 359.75 3.14Block-wise 0.37 1.29 174.11 3.43
2 Standard 0.20 1.14 345.60 0.70Block-wise 0.30 1.17 224.28 0.96
1 Standard 0.10 0.96 659.50 0.18Block-wise 0.21 0.95 376.80 0.32
Table 5: α-stable distribution example
4.5 A time series example
We consider a time series {yt,t=1,...,T =1000} generated from the state space model
yt|xt ∼ Poisson(λt), λt=eβ+xt , (19)
xt+1 = φxt+ηt, ηt∼N(0,σ2), x1∼N(0,σ2/(1−φ2)).
24
The model parameters are θ=(β,φ,σ2). We generate the data using the parameter values
β=1, φ=0.5 and σ2=2(1−φ2).
We employ the high-dimensional importance sampling method of Shephard and Pitt
(1997) and Durbin and Koopman (1997) to obtain an unbiased likelihood estimator p(y|θ,u).The simulation smoothing step requires 2T independent univariate normal variates to gen-
erate each sample path of the states, so the set of random variates u needed is a matrix of
size N×(2T ) with N the number of samples. We divide u into G=100 blocks where u1
consists of the first 2T/G columns of u, u2 consists of the next 2T/G columns of u, etc.
We use the static strategy in this example, i.e. the number of sample paths N is fixed.
Let θ be some central value of θ, e.g. the MLE estimate using the simulated maximum
likelihood method (Gourieroux and Monfort, 1995). For simplicity, we set θ to the true
value. For the standard PMMH, the value of N is chosen such that V(pN(y|θ,u))≈ 1,
where the variance V(pN(y|θ,u)) is estimated by replications. For the two block-wise
PMMH schemes, one with pseudo numbers and the other with quasi numbers, we se-
lect N such that V(pN(y|θ,u))≈ 2.162/(1−ρ2) and ≈ 0.822/(1−ρ2) respectively, with the
correlation ρ estimated as follows. Let N0 be some large value of N . Let z=log pN0(y|θ,u)and z′ = log pN0(y|θ,u′) with u′ obtained from u by generating a new set for a randomly-
selected block uk, with the other blocks kept fixed. We generate J = 1000 realisations
(zj ,z′j)
Jj=1 of (z,z′) and estimate ρ by the sample correlation ρ. For the correlated PMMH
of Deligiannidis et al., we set the correlation ρ=0.99 and use the same N as in the pseudo
block-wise PMMH. Each MCMC scheme is run for 25,000 iterations with 5000 burn-ins.
Methods N Acc. rate IACT ratio CPU ratio CT ratio
standard PMMH 3500 - - - -correlated PMMH (ρ= .99) 56 0.18 1.613 1.336 2.155pseudo block-wise PMMH 56 0.23 1.154 1.257 1.451quasi block-wise PMMH 16 0.23 1 1 1
Table 6: State space example: performance measure ratios with the quasi block-wisePMMH as the baseline
Table 6 summarises the results. The standard PMMH requires around N=3500 samples
and is therefore very computationally demanding. So we do not run the standard PMMH in
this case. The quasi block-wise PMMH outperforms the others in this example. Although
25
both the correlated PMMH and pseudo block-wise PMMH use the same number of particles
N , the latter has a less CPU time because it generates only one block of u in each iteration.
5 Large sample in T analysis for panel data
In this section we discuss properties of the block-wise PMMH for large T for the panel data
model discussed in Sections 3 and 4.1 and show that the total number of particles required
per MCMC iteration is O(T 3/2) if pseudo random numbers are used and roughly O(T 7/6)
if quasi random numbers are used, rather than O(T 2) as in the standard PMMH. We also
justify part (i) of Assumption 1 and show a weak posterior correlation between θ and z.
Consider the panel data model, with the panels in the kth group denoted by Gk, and
suppose that we use the same Ni =Nk particles for all panels i∈ Gk. Let p(yi|θ) be the
likelihood of the ith panel, and let pNi(yi|θ,ui) be the unbiased estimate of p(yi|θ). We
make the following assumption
Assumption 3. For each i∈Gk and parameter value θ, there exists a Ai(θ)2 such that as
Nk→∞
Nk
(pNi
(yi;θ,ui)−p(yi|θ))
d⇒N (0,Ai(θ)2), (20)
for some >0.
The central limit theorem (20) holds for most importance sampling estimates of the
likelihood, where = 1/2 if using pseudo random numbers and = 3/2−ǫ with an ar-
bitrarily small ǫ > 0 if randomised quasi numbers are used; see, e.g., Loh (2003); Owen
(1997).
Let γi(θ)2=Ai(θ)
2/p(yi|θ)2 and define zi(θ,ui) :=log
(pNi
(yi;θ,ui)/p(yi|θ)). The follow-
ing result holds and is similar to Lemma 2(ii) in Pitt et al. (2012). Its proof is straightfor-
ward and is omitted.
Lemma 5. Suppose that Assumption 3 holds. Then, for i∈Gk, as Nk→∞,
Nk
(zi(θ,ui)+
γ2i (θ)
2N2k
)d⇒N (0,γ2i (θ)).
26
So, informally, zi(θ,ui)≈N(−γ2i (θ)/(2N2
k ),γ2i (θ)/N2k
). The next corollary formalises
the results of this section. Its proof follows from the discussion immediately above and
Lemma 5.
Corollary 1. Define γ2(k)(θ) :=∑
i∈Gkγ2i (θ). Suppose that we take GT =O(T
12 ), where the
subscript T , here and below, indicates dependence on T . Then,
(i) The number of elements in each group is |Gk|=O(T/GT )=O(T12 ), γ2(k)(θ)=O(T/GT )=
O(T12 ), and
ρT =1−O(T− 12 ), σ2
opt,T =O(T12 ) and Nk,T =Nk(θ,σ
2opt,T ,GT )=O(T
1/(4)).
(ii) The total number of particles used to estimate the likelihood is
G∑
k=1
Nk,T (θ)×(T/G)=O(T 1+1/(4))=
O(T 3/2), if=1/2
O(T7−4ǫ6−4ǫ ), if=3/2−ǫ.
(iii) Part (i) of Assumption 1 holds.
We now present an important result that justifies that slowly moving z does not have
any undesired effect on the mixing of the chain on θ. Suppose that the same NT=O(T1/(4))
particles is used for all panels.
Lemma 6 (Posterior orthogonality of θ and z). Assume that
(i) z=∑T
i=1zi(θ,ui)∼N (−ζT (θ)/2,ζT (θ)) with ζT (θ)=(T/N2T )ηT (θ), ηT (θ)=
1T
∑Ti=1γ
2i (θ)
and µη=Eπ(ηT (θ)) is bounded (in T ).
(ii) For any bounded and differentiable function h(θ),
T×Eπ
[(θ−θ)′∂h(θ)
∂θ
∂ηT (θ)
∂θ′(θ−θ)
]T→∞−→ v
for some constant v>0. Here θ denotes the posterior mean Eπ(θ) and θ′ denotes the
transpose of vector θ. Furthermore
T×Vπ(h(θ))T→∞−→ vh, and T×Vπ(ηT (θ))
T→∞−→ vη
27
for some constants vh>0, vη>0.
Then the posterior correlation of h(θ) and the log-likelihood estimation error z is approxi-
mately zero, i.e., Corr(h(θ),z|y)→0, as T→∞.
Assumption (i) is justified by Lemma 5 and (ii) is justified by the Berstein von-Mises
theorem.
Proof. See the Appendix.
6 Conclusion
We have presented an efficient block-wise PMMH algorithm for Bayesian inference for
models where the likelihood is a product as in panel data models and can be estimated
unbiasedly or for big data problems where the log likelihood is a sum with many terms
which can be estimated unbiasedly. The proposed algorithm divides the set of random
numbers used to estimate the likelihood into blocks, and then updates the parameters of
interest and one block at a time. The block-wise PMMH approach requires a much smaller
number of particles in the likelihood estimation than the standard PMMH. Applications
strongly confirm the usefulness of the proposed methodology. The blocking approach can
in principle be generalised to any model that is estimated by PMMH, including intractable
time series problems. The blocking approach can also be combined with the autocorrelation
approach in Dahlin et al. (2015) and Deligiannidis et al. (2015).
Using a random truncation to obtain unbiased estimates (Rhee and Glynn, 2015) is cur-
rently attracting significant attention. The resulting debiased estimates are highly variable
that makes it challenging to use the standard PMMH. The block-wise PMMH based on
quasi numbers can offer an efficient method in this situation.
References
Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo
methods. Journal of the Royal Statistical Society, Series B, 72:1–33.
28
Andrieu, C. and Roberts, G. (2009). The pseudo-marginal approach for efficient Monte
Carlo computations. The Annals of Statistics, 37:697–725.
Bartolucci, F., Farcomeni, A., and Pennoni, F. (2012). Latent Markov Models for Longitu-
dinal Data. Chapman and Hall/CRC press.
Beaumont, M. A. (2003). Estimation of population growth or decline in genetically moni-
tored populations. Genetics, 164:1139–1160.
Dahlin, J., Lindsten, F nad Kronander, J., and Schon, T. B. (2015). Accelerating pseu-
domarginal Metropolis-Hastings by correlating auxiliary variables. Technical report.
arXiv:1511.05483v1.
Del Moral, P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle Sys-
tems with Applications. Springer, New York.
Deligiannidis, G., Doucet, A., and Pitt, M. (2015). The correlated pseudo-marginal method.
Technical report.
Dick, J. and Pillichshammer, F. (2010). Digital nets and sequence. Discrepancy theory and
quasi-Monte Carlo integration. Cambridge University Press, Cambridge.
Donohue, M. C., Overholser, R., Xu, R., and Vaida, F. (2011). Conditional Akaike infor-
mation under generalized linear and proportional hazards mixed models. Biometrika,
98:685–700.
Doucet, A., Pitt, M., Deligiannidis, G., and Kohn, R. (2015). Efficient implementation of
Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika,
102(2):295–313.
Durbin, J. and Koopman, S. J. (1997). Monte Carlo maximum likelihood estimation for
non-Gaussian state space models. Biometrika, 84:669–684.
Durham, G. B. and Gallant, A. R. (2002). Numerical techniques for maximum likeli-
hood estimation of continuous-time diffusion processes. Journal of Business & Economic
Statistics, 20(3):335–338.
29
Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal Analysis.
John Wiley & Sons, Ltd, New Jersey, 2nd edition.
Flury, T. and Shephard, N. (2011). Bayesian inference based only on simulated likelihood:
Particle filter analysis of dynamic economic models. Econometric Theory, 1:1–24.
Gerber, M. and Chopin, N. (2015). Sequential quasi Monte Carlo. J. R. Stat. Soc. Ser. B.
Stat. Methodol., 77(3):509–579.
Gourieroux, C. and Monfort, A. (1995). Statistics and Econometric Models, volume 2.
Cambridge University Press, Melbourne.
Greenberg, E. R., Baron, J. A., Stevens, M. M., Stukel, T. A., Mandel, J. S., Spencer,
S. K., Elias, P. M., Lowe, N., Nierenberg, D. N., G., B., and Vance, J. C. (1989). The
skin cancer prevention study: design of a clinical trial of beta-carotene among persons
at high risk for nonmelanoma skin cancer. Controlled Clinical Trials, 10:153–166.
Gunawan, D., Tran, M.-N., Suzuki, K., Dick, J., and Kohn, R. (2016). Computationally ef-
ficient bayesian estimation of high dimensional copulas with discrete and mixed margins.
Technical report.
Johnson, A. A., Jones, G. L., and Neath, R. C. (2013). Component-wise Markov Chain
Monte Carlo: Uniform and geometric ergodicity under mixing and composition. Statis-
tical Science, 28(3):360–375.
Lee, A. and Holmes, C. (2010). Discussion on particle Markov chain Monte Carlo methods.
Journal of the Royal Statistical Society, Series B, 72:1–33.
Lin, L., Liu, K., and Sloan, J. (2000). A noisy Monte Carlo algorithm. Physical Review D,
61.
Loh, W.-L. (2003). On the asymptotic distribution of scrambled net quadrature. Ann.
Statist., 31(4):1282–1324.
Matousek, J. (1998). On the l2-discrepancy for anchored boxes. Journal of Complexity,
14:527–556.
30
Murray, I. and Graham, M. M. (2016). Pseudo-marginal slice sampling. In Proceedings of
the 19th International Conference on Artificial Intelligence and Statistics, JMLR W&CP,
volume 51, pages 911–19.
Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods.
Society for Industrial and Applied Mathematics, Philadelphia.
Nolan, J. (2007). Stable Distributions: Models for Heavy-Tailed Data. Birkhauser, Boston.
Owen, A. B. (1997). Scrambled net variance for integrals of smooth functions. The Annals
of Statistics, 25(4):1541–1562.
Pasarica, C. and Gelman, A. (2010). Adaptively scaling the Metropolis algorithm using
expected squared jumped distance. Statistica Sinica, 20:343–364.
Peters, G., Sisson, S., and Fan, Y. (2012). Likelihood-free Bayesian inference for α-stable
models. Computational Statistics & Data Analysis, 56(11):3743 – 3756.
Pitt, M. K., Silva, R. S., Giordani, P., and Kohn, R. (2012). On some properties of
Markov chain Monte Carlo simulation methods based on the particle filter. Journal of
Econometrics, 171(2):134–151.
Quiroz, M., Villani, M., and Kohn, R. (2016). Speeding up MCMC by efficient data
subsampling. Technical report.
Rhee, C.-H. and Glynn, P. W. (2015). Unbiased estimation with square root convergence
for sde models. Operations Research, 63(5):1026–1043.
Shephard, N. and Pitt, M. K. (1997). Likelihood analysis of non-Gaussian measurement
time series. Biometrika, 84:653–667.
Sherlock, C., Thiery, A., Roberts, G., and Rosenthal, J. (2015). On the efficiency of the
pseudo marginal random walk Metropolis algorithm. The Annals of Statistics, 43(1):238–
275.
Stramer, O. and Bognar, M. (2011). Bayesian inference for irreducible diffusion processes
using the pseudo-marginal approach. Bayesian Analysis, 6(2):231–258.
31
Tavare, S., Balding, D. J., Griffiths, R. C., and Donnelly, P. (1997). Inferring coalescence
times from DNA sequence data. Genetics, 145(2):505–518.
Tran, M.-N., Nott, D. J., and Kohn, R. (2015). Variational Bayes with intractable likeli-
hood. Technical report. arXiv:1503.08621.
Appendix A Proofs
Proof of Lemma 2. To obtain the conditional acceptance probability, we use the following
results,
∫ A
−∞
exp(z)N (z;a,b2)dz=exp(a+b2/2)Φ
(A−a−b2
b
)(21)
∫ ∞
A
N (z;a,b2)dz=Φ
(a−Ab
). (22)
Now, as in Lemma 1, let a(z′) =E(z|z′) =−σ2/2G+ρz′ and τ 2 =V(z|z′) = σ2(1−ρ2), sothat the conditional density of z given z′ is N (z;a(z′),τ 2), and using (21) and (22), the
conditional probability of acceptance is
∫min(1,exp(z−z′))N (z;a(z′),τ 2)dz=
∫ z′
−∞
exp(z−z′)N (z;a(z′),τ 2)dz+
∫ ∞
z′N (z;a(z′),τ 2)dz
=exp
(a(z′)−z′+τ 2/2
)Φ
(z′−a(z′)−τ 2
τ
)+Φ
(a(z′)−z′
τ
)
=exp
(−y+τ 2/2
)Φ
(y−τ 2τ
)+Φ
(−yτ
),
where y :=z′−a(z′)=(1−ρ)(z′+σ2/2). To obtain the unconditional acceptance probability,
we need to take the expectation of the conditional probability with respect to z′. We note
that y∼N(
σ2
2G, σ
2
G2
). Hence, the unconditional probability of acceptance is
∫exp
(−y+τ 2/2
)Φ
(y−τ 2τ
)N (y;σ2/(2G),σ2/G2)dy+
∫Φ
(−yτ
)N (y;σ2/(2G),σ2/G2)dy
(23)
32
To proceed further we use the following elementary results,
exp(−y)N (y;a,b2)=exp(b2/2−a)N (y;a−b2,b2) (24)
Φ
( −a−c√b2+d2
)=
∫Φ
(−y−ab
)N (y;c,d2)dy (25)
Φ
( −a+c√b2+d2
)=
∫Φ
(y−ab
)N (y;c,d2)dy. (26)
Hence,
exp
(−y+τ 2/2
)N (y;σ2/G,σ2/G2)=exp(τ 2/2+σ2/2G2−σ2/G)N (y;σ2/G−σ2/G2,σ2/G2)
=N (y;ρσ2/G,σ2/G2)∫Φ
(y−τ 2τ
)N (y;ρσ2/G,σ2/G2)dy=Φ
(ρσ2/G−σ2(1−ρ2)√σ2/G2+σ2(1−ρ2)
)
=Φ
(−σ
√1−ρ√2
)
∫Φ
(−yτ
)N (y;σ2/G,σ2/G2)dy=Φ
(−σ
√1−ρ√2
).
Proof of Lemma 3. For notational simplicity, we write the proposal density q(z|z′;ρ,σ) asq(z|z′), the acceptance probability in (9) as α(z′,z;ρ,σ) as α(z′,z) and the acceptance prob-
ability k(z′|σ,ρ), conditional on the previous iterate, as k(z′). Let {(θj,zj),j=1,...,M} be
iterates, after convergence, for the Markov chain produced by the PMMH sampling scheme.
Then, the Markov transition distribution from (θ′,z′) to (θ,z) is
p(θ′, z′; dθ, dz) = α(z′, z)π(θ)q(z|z′)dθdz +(1−
∫α(z′, z∗)π(θ∗)q(z∗|z′)dθ∗dz∗
)δ(θ′,z′)(θ, z)
= α(z′, z)π(θ)q(z|z′)dθdz + (1− k(z′|σ, ρ)) δ(θ′,z′)(dθ, dz),
δθ′,z′(dθ,dz) is the probability measure concentrated at (θ′,z′).
33
Consider now the space of functions
F=
{ϕ : Θ=Θ⊗R 7→R,ϕ=ϕ(θ)ψ(z),π(ϕ) :=Eθ∼π(θ)(ϕ)=0,π(ϕ2) :=Eθ∼π(θ)(ϕ
2)<∞,
π(ψ2) :=Ez∼π(z)(ψ)2<∞
}.
We define the operator P :F 7→F as
(Pϕ)(θ,z) :=
∫ϕ(θ∗,z∗)p(θ,z;θ∗,z∗)dθ∗dz∗
=π(ϕ)
∫ψ(z)α(z,z∗)q(z∗|z)dz∗+ϕ(θ)(1−k(z))
=ϕ(θ)ψ(z)(1−k(z)).
as π(ϕ)=0 by assumption. It is straightforward to check that (P jϕ)(θ,z)=ϕ(θ)ψ(z)(1−k(z))j and that (Pϕ)(θj−1,zj−1)=E(ϕ(θj ,zj)|θj−1,zj−1). Hence, (P
jϕ)(θ0,z0)= ϕ(θ0,z0)(1−k(z0))
j.
We now consider ϕ(θ,z) = ϕ(θ)ψ(z) with ψ(z) ≡ 1 so that ϕ ∈ F; suppose also that
(θ0,z0)∼ πN . Define cj :=Cov(ϕ(θj ,zj),ϕ(θ0,z0))=Cov(ϕ(θj),ϕ(θ0)). Then,
cj=E(ϕ(θj ,zj)ϕ(θ0,z0)
)
=E(θ0,z0)∼πN
(E(ϕ(θj ,zj)|θ0,z0)ϕ(θ0,z0)
)
=E(θ0,z0)∼πN
((1−k(z0))jϕ(θ0,z0)2
)
=Ez0∼πN (z)
((1−k(z0))j)Eθ0∼π(ϕ(θ0)
2)
because z0 only depends on σ by construction
=Ez0∼πN (z)
((1−k(z0))j
)c0.
The inefficiency IF is defined as
IF=(c0+2
∞∑
j=1
cj)/c0=1+2
∞∑
j=1
Ez∼πN (z)
((1−k(z)
)j)=1+2Ez∼πN(z)
(1−k(z)k(z)
)
34
as required.
Proof of Lemma 4. From Lemma 1, π(z′|σ)=N (z′;σ2/2,σ2). Let ω := [(1−ρ)(z′+σ2/2)−τ 2]/τ with τ=σ
√1−ρ2. Then,
ω∼N(− ρτ
1+ρ,1−ρ1+ρ
),
and we note that the variance of ω just depends on ρ. For ρ close to 1, the variance of ω is
approximately 1/(2G), which is very small. Thus, ω will be concentrated close to its mean
ω∗ :=−ρτ/(1+ρ). Define p∗(ω|τ) :=1−k(z′|ρ,σ)=Φ(ω+τ)+exp(−ωτ−τ 2/2)Φ(ω). Then,
IF(σ,ρ)=
∫1+p∗(ω|τ)1−p∗(ω|τ)N
(ω;− ρτ
1+ρ,1−ρ1+ρ
)dω.
It is convenient to write IF(σ,ρ) as IF(τ |ρ), which we will optimize the computing time
over τ keeping ρ fixed. Let,
f(ω;τ) :=1+p∗(ω|τ)1−p∗(ω|τ)
Using the 4th order Taylor series expansion of f(w;τ) at ω=ω∗, the inefficiency factor
can be approximated by
IFapprox(τ |ρ) = f(ω∗|τ) + 1
2
1− ρ
1 + ρf (2)(ω∗|τ) + 1
8
(1− ρ
1 + ρ
)2
f (4)(ω∗|τ),
which is considered as a function of τ with ρ fixed. This approximation is very accurate
because, as noted, the variance of ω is very small for large G. So the computing time
CT(σ,ρ)=IF(σ,ρ)/σ1/ is approximated by
CTapprox(τ |ρ)=(1−ρ2) 12
IFapprox(τ |ρ)τ 1/
∝ IFapprox(τ |ρ)τ 1/
Minimizing this term over τ , for ρ close to 1, we find that CTapprox(τ |ρ) is minimized at
τ ≈2.16 for =1/2, and at τ ≈0.82 for ≈3/2. So the optimal σopt≈2.16/√
1−ρ2 for
=1/2 and σopt≈0.82/√
1−ρ2 for =3/2−ǫ with any arbitrarily small ǫ.
35
Then the unconditional acceptance rate (10) is
P (accept|ρ, σopt) = 2
(1− Φ
(σopt√1− ρ√2
))
= 2
(1− Φ
(σopt√
1− ρ2√2(1 + ρ)
))
≈ 2
(1− Φ
(2.162
))≈ 0.28.
Proof of Lemma 6. The joint posterior density of θ and z in (8) can be written as π(θ,z)=
π(θ)π(z|θ) with π(z|θ)∼N (ζT (θ)/2,ζT (θ)). We have
Cov(h(θ),z|y)=E[h(θ)z|y]−E[h(θ)|y]E[z|y]
=1
2Eπ[h(θ)ζT (θ)]−
1
2Eπ[h(θ)]Eπ[ζT (θ)]
=1
2Eπ[(h(θ)−µh)ζT (θ)], with µh=Eπ[h(θ)]
≈ 1
2Eπ
[(h(θ)−µh
)(ζT (θ)+
∂ζT (θ)
∂θ′(θ−θ)
)]
=1
2Eπ
[(h(θ)−µh
)∂ζT (θ)∂θ′
(θ−θ)]
≈ 1
2Eπ
[(h(θ)+
∂h(θ)
∂θ′(θ−θ)−µh
)∂ζT (θ)∂θ′
(θ−θ)]
=1
2Eπ
[(θ−θ)′∂h(θ)
∂θ
∂ζT (θ)
∂θ′(θ−θ)
]
=1
2N2T
T×Eπ
[(θ−θ)′∂h(θ)
∂θ
∂ηT (θ)
∂θ′(θ−θ)
]
≈ v
2N2T
.
V(z|y)=E[z2|y]−(E[z|y])2=Eπ[ζT (θ)
2
4+ζT (θ)]−
1
4(Eπ[ζT (θ)])
2
=1
4Vπ(ζT (θ))+Eπ(ζT (θ))≈
T
4N4T
vη+T
N2T
µη.
V(h(θ)|y)=Vπ(h(θ))≈vhT.
36
Hence,
Corr(h(θ),z|y)= Cov(h(θ),z|y)√V(h(θ)|y)V(z|y)
≈v
2N2T√
vhT( T4N4
Tvη+
TN2
Tµη)
=v√
vh(vη+4N2T µη)
→0
as T→∞.
Appendix B Derivation of the expression (13) for Com-
puting Time
The average number of particles required in each MCMC iteration to give the same accuracy
in terms of variance as M iid iterates θ1,...,θM from π(θ) is proportional to
1
M
M∑
i=1
G∑
k=1
Nk(θi)IF(σ,ρ)=1
M
M∑
i=1
G∑
k=1
G1
2 γ1/(k) (θi)
σ1/IF(σ,ρ)→
(G
12
G∑
k=1
γ1/(k)
)IF(σ,ρ)
σ1/
as M→∞, where γ1/(k) =Eθ∼π(γ
1/(k) (θ)). The factor in the brackets are independent of σ2,
hence the computing time is proportional to CT= IF(σ,ρ)
σ1/ .
37
This figure "ARplainKDEBlocking1.png" is available in "png" format from:
http://arxiv.org/ps/1603.02485v3
This figure "ARsteadyKDEBlocking1.png" is available in "png" format from:
http://arxiv.org/ps/1603.02485v3
This figure "ARplainKDEBlocking2.png" is available in "png" format from:
http://arxiv.org/ps/1603.02485v3
This figure "ARsteadyKDEBlocking2.png" is available in "png" format from:
http://arxiv.org/ps/1603.02485v3