Projected Stein Variational Gradient Descentlower-dimensional coefﬁcients of the projection. We provide an upper bound for the projection er-ror with respect to the posterior and

Projected Stein Variational Gradient Descent

Peng Chen 1 Omar Ghattas 1

AbstractThe curse of dimensionality is a critical challengein Bayesian inference for high dimensional pa-rameters. In this work, we address this chal-lenge by developing a projected Stein varia-tional gradient descent (pSVGD) method, whichprojects the parameters into a subspace that isadaptively constructed using the gradient of thelog-likelihood, and applies SVGD for the muchlower-dimensional coefficients of the projection.We provide an upper bound for the projection er-ror with respect to the posterior and demonstratethe accuracy (compared to SVGD) and scalabilityof pSVGD with respect to the number of parame-ters, samples, data points, and processor cores.

1. IntroductionGiven observation data for a system with unknown pa-rameters, Bayesian inference provides an optimal proba-bility framework for learning the parameters by updatingtheir prior distribution to a posterior distribution. However,many conventional methods for solving high-dimensionalBayesian inference problems face the curse of dimensional-ity, i.e., the computational complexity grows rapidly, oftenexponentially, with respect to (w.r.t.) the number of param-eters. To address the curse of dimensionality, the intrinsicproperties of the posterior distribution, such as its smooth-ness, sparsity, and intrinsic low-dimensionality, have beenexploited to reduce the parameter correlation and develop ef-ficient methods whose complexity grows slowly or remainsthe same with increasing dimension. By exploiting the ge-ometry of the log-likelihood function, accelerated Markovchain Monte Carlo (MCMC) methods have been developedto reduce the sample correlation or increase effective samplesize independent of the dimension (Girolami & Calderhead,2011; Martin et al., 2012; Petra et al., 2014; Constantineet al., 2016; Cui et al., 2016; Beskos et al., 2017). Neverthe-less, these random and essentially serial sampling methods

1Oden Institute for Computational Engineering and Sciences,The University of Texas at Austin, Austin, USA.. Correspondenceto: Peng Chen <[email protected]>.

remain prohibitive for large-scale inference problems withexpensive likelihoods. Deterministic methods using sparsequadratures (Schwab & Stuart, 2012; Schillings & Schwab,2013; Chen & Schwab, 2015; 2016) were shown to convergerapidly with dimension-independent rates for problems withsmooth and sparse posteriors. However, for posteriors lack-ing smoothness or sparsity, the convergence deterioratessignificantly, despite incorporation of Hessian-based trans-formations (Schillings & Schwab, 2016; Chen et al., 2017).

Transport-based variational inference is another type of de-terministic method that seeks a transport map in a functionspace (represented by, e.g., polynomials, kernels, or neuralnetworks) that pushes the prior to the posterior by mini-mizing the difference between the transported prior andthe posterior, measured in, e.g., Kullback–Leibler diver-gence (Marzouk et al., 2016; Liu & Wang, 2016; Blei et al.,2017; Bigoni et al., 2019; Detommaso et al., 2019). Inparticular, kernel-based Stein variational methods, usinggradient-based (SVGD) (Liu & Wang, 2016; Chen et al.,2018; Liu & Zhu, 2018) and Hessian-based (SVN) (Detom-maso et al., 2018; Wang & Li, 2020) optimization methods,are shown to achieve fast convergence in relatively low di-mensions. Nonetheless, the convergence and accuracy ofthese methods deteriorates in high dimensions due to thecurse of dimensionality in kernel representation. This canbe partially addressed by a localized SVGD on Markov blan-kets, which relies on a conditional independence structureof the target distribution (Zhuo et al., 2018; Wang et al.,2018), or by a projected SVN that projects the parameter ina low-dimensional subspace (Chen et al., 2019b).

Contributions: Here, we propose, analyze, and apply aprojected SVGD method to tackle the curse of dimension-ality for high-dimensional nonlinear Bayesian inferenceproblems, which relies on the fundamental property thatdata typically only inform a low-dimensional subspace ofhigh-dimensional parameters, e.g., (Bashir et al., 2008; Bui-Thanh & Ghattas, 2012; Bui-Thanh et al., 2013; Spantiniet al., 2015; Isaac et al., 2015; Cui et al., 2016; Chen et al.,2017; 2019a; Chen & Ghattas, 2019; Bigoni et al., 2019) andreferences therein. Specifically, our contributions are: (1)we perform dimension reduction by projecting the param-eters to a low-dimensional subspace constructed using thegradient of the log-likelihood, and push the prior samples ofthe projection coefficients to their posterior by pSVGD; (2)

arX

iv:2

002.

0346

9v1

[cs

.LG

] 9

Feb

202

0


we prove the equivalence of the projected transport in thecoefficient space and the transport in the projected parame-ter space; (3) we provide an upper bound for the projectionerror committed in the posterior (with the likelihood ap-proximated by an optimal profile function) in terms of theeigenvalues of a gradient information matrix; (4) we proposeadaptive and parallel algorithms to efficiently approximatethe optimal profile function and the gradient informationmatrix; and (5) we demonstrate the accuracy (comparedto SVGD) and scalability of pSVGD w.r.t. the number ofparameters, samples, data points, and processor cores for ahigh-dimensional nonlinear Bayesian inference problem.

The major differences of this work compared to pSVN(Chen et al., 2019b): (1) pSVGD uses only gradient in-formation of the log-likelihood, which is available for manymodels, while pSVN requires Hessian information, whichmay not be available for complex models and codes in prac-tical applications; (2) the upper bound for the projectionerror w.r.t. the posterior in terms of eigenvalues is muchsharper than that for pSVN in terms of projection error inparameters; (3) here we theoretically prove the equivalenceof the projected transport for the coefficient and the trans-port for the projected parameters; (4) we also investigate theconvergence of pSVGD w.r.t. the number of parameters andthe scalability of pSVGD w.r.t. the number of data points.

2. Preliminaries2.1. Bayesian Inference

Let x ∈ Rd denote a random parameter of dimension d ∈N, which has a continuous prior density p0 : Rd → R.Let y = {yi}si=1 denote a set of i.i.d. observation data.Let f(x) :=

∏si=1 p(yi|x) denote, up to a multiplicative

constant, a continuous likelihood of y at given x. Thenthe posterior density of parameter x conditioned on data y,denoted as p : Rd → R, is given by Bayes’ rule as

p(x) =1

Zf(x)p0(x), (1)

where Z is the normalization constant defined as

Z =

∫Rdf(x)p0(x)dx, (2)

whose computation is typically intractable, especially for alarge d and f with a complex geometry in Rd, e.g., multi-modal, rich local behavior, etc. The central task of Bayesianinference is to draw samples of the parameter x from itsposterior distribution with density p, and compute somestatistical quantity of interest, e.g., the mean and varianceof the parameter x or some function of x.

2.2. Stein Variational Gradient Descent (SVGD)

SVGD is one type of variational inference method that seeksan approximation of the posterior density p by a functionq∗ in a predefined function set Q, which is realized by mini-mizing the Kullback–Leibler (KL) divergence that measuresthe difference between two densities, i.e.,

q∗ = arg minq∈Q

DKL(q|p), (3)

where DKL(q|p) = Ex∼q[log(q/p)], i.e., the average oflog(q/p) with respect to the density q, which vanishes whenq = p. In particular, a transport based function set is consid-ered as Q = {T]p0 : T ∈ T }, where T] is a pushforwardmap that pushes the prior density to a new density q := T]p0

through an invertible transport map T (·) : Rd → Rd in aspace T . Let T be given by

T (x) = x+ εφ(x), (4)

where φ : Rd → Rd is a differentiable perturbation mapw.r.t. x, and ε > 0 is small enough so that T is invertible. Itis shown in (Liu & Wang, 2016) that

∇εDKL(T]p0|p)∣∣ε=0

= −Ex∼p0 [trace(Apφ(x))], (5)

where Ap is the Stein operator given by

Apφ(x) = ∇x log p(x)φ(x)T +∇xφ(x). (6)

Moreover, by choosing the space T = (Hd)d = Hd×· · ·×Hd, a tensor product of a reproducing kernel Hilbert space(RKHS)Hd with kernel k(·, ·) : Rd×Rd → R, the steepestdescent direction of DKL(T]p0|p) w.r.t. φ, denoted as φ∗p0,p,is explicitly given by

φ∗p0,p(·) = Ex∼p0 [Apk(x, ·)], (7)

where

Apk(x, ·) = ∇x log p(x)k(x, ·) +∇xk(x, ·). (8)

Following this optimization perspective, we define a se-quence of transport maps T` for ` = 0, 1, . . . , as

T`(x) = x+ ε`φ∗p`,p

(x), (9)

and p`+1 := (T`)]p`, the density pushed from p` by thepushforward map (T`)]. The function φ∗p`,p is the steepestdescent direction of DKL(T]p`|p), which is given as in (7)with p0 replaced by p`. ε` is known as a learning rate that issuitably updated in each iteration. Convergence of p` to theposterior p as `→∞ is studied in (Liu, 2017). To this end,the sequence of transport maps provide a practical algorithmfor approximately drawing samples from the posterior as:first draw samples x0

1, . . . , x0N from the prior p0, then update

them subsequently as

x`+1m = x`m + εlφ

∗p`,p

(x`m), m = 1, . . . , N, (10)


where φ∗p`,p(x`m), given by

1

N

N∑n=1

∇x`n log p(x`n)k(x`n, x`m) +∇x`nk(x`n, x

`m), (11)

is a sample average approximation (SAA) of φ∗p`,p(x`m) with

samples x`1, . . . , x`N . The first term of (11) drives the sam-

ples to regions of high posterior density and the second termacts as a repulsive force that pushes the samples away fromeach other to avoid collapsing to local modes of the poste-rior. Note that while the posterior p defined in (1) involvesthe intractable normalization constant Z, ∇x log p(x) doesnot, thus making the sample update by (11) computationallyefficient. For the kernel k, a common choice is Gaussian asin (Liu & Wang, 2016),

k(x, x′) = exp

(−||x− x

′||22h

), (12)

where h is the bandwidth, e.g., h = med2/ log(N) withmed representing the median of sample distances.

The repulsive force term ∇xk in (11) plays a critical rolein determining the empirical distribution of the samples.However, it is known that the kernel function in generalsuffers from the curse of dimensionality (Ramdas et al.,2015; Zhuo et al., 2018; Wang et al., 2018), which makesthe repulsive term useless in high dimensions and leads tosamples not representative of the posterior, as observed in(Zhuo et al., 2018; Wang et al., 2018), in which the SVGDsample variance becomes very inaccurate for large d.

3. Projected SVGD3.1. Dimension reduction by projection

Let ψ1, . . . , ψr, ψr+1, . . . , ψd ∈ Rd denote an orthogonalbasis that spans the parameter space Rd, where ψ1, . . . , ψrspan a subspace Xr of dimension r and ψr+1, . . . , ψd spanits complement subspace X⊥. We define a linear projectorof rank r, Pr : Rd → Rd, as

Prx :=

r∑i=1

ψiψTi x = Ψrw, ∀x ∈ Rd, (13)

where Ψr := (ψ1, . . . , ψr) ∈ Rd×r represents the projec-tion matrix and w := (w1, . . . , wr)

T ∈ Rr is the coefficientvector with element wi := ψTi x for i = 1, . . . , r. By thisprojection, we can decompose x ∈ Rd as

x = xr + x⊥, (14)

where the projected parameter xr := Prx and its comple-ment x⊥ := P⊥x, with P⊥ := Id − Pr and identity ma-trix Id ∈ Rd×d. By defining Ψ⊥ := (ψr+1, . . . , ψd) ∈

Rd×(d−r), and w⊥ := (wr+1, . . . , wd)T ∈ Rd−r with

elements wi = ψTi x for i = r + 1, . . . , d, we haveP⊥x = Ψ⊥w⊥. For any x ∈ Rd, with the coefficientvector w ∈ Rr of the projection (13), and any v⊥ ∈ Rd−r,we can decompose the prior density as

p0(Ψrw + Ψ⊥v⊥) = pr0(w)p⊥0 (v⊥|w), (15)

where the marginal density pr0 : Rd → R is defined as

pr0(w) =

∫Rd−r

p0(Ψrw + Ψ⊥v⊥)dv⊥, (16)

and the conditional density p⊥0 (·|w) : Rd−r → R as

p⊥0 (v⊥|w) = p0(Ψrw + Ψ⊥v⊥)/pr0(w). (17)

We seek a profile function g : Rd → R such that g(Prx)is a good approximation of the likelihood function f(x)for a given projector Pr. We define a particular profilefunction g∗ : Rd → R as the marginalization or conditionalexpectation of f w.r.t. the conditional density p⊥0 , i.e.,

g∗(Prx) =

∫Rd−r

f(Prx+ Ψ⊥v⊥)p⊥0 (v⊥|w)dv⊥. (18)

The following proposition, established in (Zahm et al.,2018), shows that g∗ is the optimal profile function w.r.t. theKL divergence between the posterior density p defined in(1) and the projected posterior density pr : Rd → R definedas

pr(x) :=1

Zrg(Prx)p0(x), (19)

where Zr := Ex∼p0 [g(Prx)].

Proposition 1. Given any projector Pr defined in (13), letpr and p∗r denote the projected posterior densities defined in(19) for any given profile function g and the profile functiong∗ defined in (18), respectively, then we have

DKL(p|p∗r) ≤ DKL(p|pr), (20)

where p is the posterior density defined in (1).

Remark 1. Once the projector Pr is provided, g∗ definedin (18) is the optimal by Proposition 1. However, practi-cal evaluation of g∗ is challenging because (1) the basisΨ⊥ is typically not available; (2) it involves two (d − r)-dimensional integrals in (16) and (18), whose evaluationmay be expensive. Serveral approximations were introducedin (Zahm et al., 2018). We propose an efficient approxima-tion in Section 3.5 that is most suitable for pSVGD.

3.2. Construction of the basis

The basis ψ1, . . . , ψr introduced in last section plays a crit-ical role for the accuracy of the approximation of the pos-terior density p by the optimal projected posterior densityp∗r . To construct a good basis, we first make the followingassumption on the prior density.


Assumption 1. The prior density satisfies p0 ∝ exp(−V −W ) for the two functions V,W : Rd → R such that:

(1) V is twice continuously differentiable, and there exists asymmetric positive definite matrix Γ ∈ Rd×d such that

zT∇2V (x)z ≥ zTΓz

for any z ∈ Rd at any x ∈ Rd;

(2) W is bounded in Rd, and there exists a constant γ ≥ 1such that the maximum difference is bounded by

supx∈Rd

W (x)− infx∈Rd

W (x) ≤ log(γ).

Remark 2. Assumption (1) implies that the function V isat least quadratically convex such that the density decaysrapidly departing from the origin. Assumption (2) guar-antees that c exp(−V (x)) ≤ ρ(x) ≤ C exp(−V (x)) forsome constants 0 < c < C. Priors satisfying the two as-sumptions are Gaussian or sub-Gaussian whose tails decayat least as fast as that of Gaussian. For example, a pa-rameter x ∈ Rd with Gaussian distribution x ∼ N (x,Σ)satisfies Assumption 1 with Γ = Σ−1 and γ = 1.

By H ∈ Rd×d we denote a gradient information matrix,which is defined as the average of the outer product of thegradient of the log-likelihood w.r.t. the posterior, i.e.,

H =

∫Rd

(∇x log f(x))(∇x log f(x))T p(x)dx. (21)

By (λi, ψi)ri=1 we denote the generalized eigenpairs of

(H,Γ), with Γ given in Assumption 1, which correspond tothe r largest eigenvalues λ1 ≥ · · · ≥ λr, i.e.,

Hψi = λiΓψi. (22)

In particular, for the development of pSVGD, we requireψTi ψj = δij instead of the conventional ψTi Γψj = δij ,being δij = 1 if i = j and zero otherwise. We makethe following important observation: By definition of H ,the eigenvalue λi measures the sensitivity of the data w.r.t.the parameters along direction ψi, i.e., the data mostlyinform parameters in directions ψi corresponding to largeeigenvalues λi. For small λi close to zero, the variation ofthe likelihood f in direction ψi is negligible.

More rigorously, the following proposition, as shown in(Zahm et al., 2018), provides an upper bound for the projec-tion error committed in the optimally projected posterior p∗rin terms of the eigenvalues λi for r < i ≤ d.Proposition 2. Under Assumption 1, for the projector Prdefined in (13) with the basis ψ1, · · · , ψr taken as the eigen-vectors of the generalized eigenvalue problem (22), for anycontinuously differentiable likelihood function f satisfyingEx∼p0 [(∇x log f(x))TΓ−1∇x log f(x)] <∞, we have

DKL(p|p∗r) ≤γ

2

d∑i=r+1

λi. (23)

To distinguish different dimensions, we consider a relativeprojection error defined as Er :=

∑di=r+1 λi/

∑di=1 λi.

Remark 3. Proposition 2 implies that if the eigenvalues λidecay rapidly, then the projected posterior p∗r is close tothe true posterior p in KL divergence for a small r � d.Moreover, as r → d, p∗r → p if Er → 0, which is usuallytrue because of the essential ill-posedness of the inferenceproblem due to parameter correlation, i.e., the observationdata cannot inform all parameter dimensions if d is large.Remark 4. Construction of the basis by H is challengingsince H involves an integration w.r.t. the posterior distribu-tion. We propose an algorithm in Section 3.5 for an adaptiveapproximation of H and construction of the basis.

3.3. Bayesian inference of the coefficient

For any x ∈ Rd with decomposition (14), or equivalently

x = Ψrw + Ψ⊥w⊥, (24)

by the density decomposition (15) we have

p0(x) = pr0(w)p⊥0 (w⊥|w). (25)

For this decomposition, the projected posterior for x ∈ Rdin (19) can be written for w ∈ Rd and w⊥ ∈ Rd−r as

pr(x) =1

Zrg(Ψrw)pr0(w)p⊥0 (w⊥|w). (26)

We define a new function of the coefficient vector w as

pr(w) :=1

Zrg(Ψrw)pr0(w), (27)

which is indeed a density function ofw because by definition

Zr = Ex∼p0 [g(Prx)] = Ew∼pr0 [g(Ψrw)], (28)

where in the second equality we used the property that therandomness of x in g(Prx) can be fully represented by thatof w in g(Ψrw) due to Prx = Ψrw. Therefore, we canview pr0 and pr as the prior and posterior densities of w,and g(Ψrw) as the likelihood function. Then by the densitydecomposition (26), or equivalently

pr(x) = pr(w)p⊥0 (w⊥|w), (29)

sampling x from the projected posterior distribution withdensity pr : Rd → R can be conceptually realized by (1)sampling w from the posterior distribution with density pr :Rr → R, which is to solve a problem of Bayesian inferenceof the coefficient vector w of dimension r, and (2) samplingw⊥ from the conditional distribution with density p⊥0 (·|w) :Rd−r → R, and (3) assembling x by the decomposition(24). However, this sampling scheme remains impracticalsince Ψ⊥ is in general not available or very expensive tocompute, especially for large d. We defer sampling w⊥ toSection 3.5 which describes a practical algorithm and focuson sampling w from the posterior pr by projected SVGD.


3.4. Projected SVGD

To sample from the posterior distribution of the coefficientvector w with density given by Bayes’ rule in (27), weemploy the SVGD method presented in Section 2.2 in thecoefficient space Rr, with r < d. Specifically, we define aprojected transport map T r : Rr → Rr as

T r(w) = w + εφr(w), (30)

with a differentiable perturbation map φr : Rr → Rr, anda small enough ε > 0 such that T r is invertible. Followingthe argument in (Liu & Wang, 2016) on the result (5) forSVGD, we obtain

∇εDKL(T r] pr0|pr)

∣∣ε=0

= −Ew∼pr0 [trace(Aprφr(w))],(31)

where Apr is the Stein operator given by

Aprφr(w) = ∇w log pr(w)φr(w)T +∇wφr(w). (32)

To find the perturbation map φr, we consider the spaceφr ∈ (Hr)r = Hr × · · · × Hr, a tensor product of RKHSHr defined with kernel kr(·, ·) : Rr × Rr → R. Thenthe steepest descent direction of DKL(T r] p

r0|pr) w.r.t. φr,

denoted as φr,∗pr0,pr , is explicitly given by

φr,∗pr0,pr (·) = Ew∼pr0 [Aprkr(w, ·)], (33)

where

Aprkr(w, ·) = ∇w log pr(w)kr(w, ·)+∇wkr(w, ·). (34)

By the same optimization perspective in the first step, wecan define a sequence of transport maps T r` for ` = 0, 1, . . . ,

T r` (w) = w + ε`φ∗,rpr` ,p

r (w), (35)

and push forward the density pr` as pr`+1 := (T r` )]pr` . Con-

vergence of the density pr` to the posterior density pr as` → ∞ can be obtained by following the same argumentas in (Liu, 2017). Therefore, to draw samples from the pos-terior pr, we can first draw samples w0

1, . . . , w0N from the

prior with density pr0, and subsequently update them as

w`+1m = w`m + ε`φ

r,∗pr` ,p

r (w`m), m = 1, . . . , N, (36)

where φr,∗pr` ,pr (w`m) is a SAA of φr,∗pr` ,pr (w

`m) with samples

w`1, . . . , w`N , i.e., φr,∗pr` ,pr (w

`m) is given by

1

N

N∑n=1

∇w`n log pr(w`n)kr(w`n, w`m) +∇w`nk

r(w`n, w`m),

(37)where practical evaluation of the gradient of the log-posterior ∇w`n log pr(w`n), with the posterior density pr

defined in (27), is presented in Section 3.5.

The kernel kr can be specified as in (12), i.e.,

kr(w,w′) = exp

(−||w − w

′||22h

). (38)

To account for data impact in different directions ψ1, . . . , ψrinformed by the eigenvalues of (22), we propose to replace||w − w′||22 in (38) by (w − w′)T (Λ + I)(w − w′) withΛ = diag(λ1, . . . , λr) for the likelihood and I for the prior.

The following theorem, proved in Appendix A, establishesthe connection between pSVGD in the coefficient space andSVGD in the projected parameter space.Theorem 1. Under the condition pr0(w) = p0(Prx) forΨrw = Prx, with the kernel kr(·, ·) : Rr×Rr → R definedin (38) and k(·, ·) : Rd × Rd → R defined in (12), at theoptimal profile function g∗ in (18), we have the equivalenceof the projected transport map T r for the coefficient w inthe steepest direction (33) and the transport map T for theprojected parameter Prx in the steepest direction (7), as

T r(w) = ΨTr T (Prx). (39)

In particular, we have

∇w log pr(w) = ΨTr ∇x log pr(Prx), (40)

with the posteriors pr defined in (27) and pr defined in (19).Remark 5. The equivalence (39) implies that the pSVGDonly updates the projected parameter Prx, or the param-eter x in the subspace formed by the basis ψ1, . . . , ψr,while leaving the parameter in the complement subspace un-changed. We propose an adaptive scheme (Algorithm 3) thatalso updates the parameter in the complement subspace.Remark 6. The condition pr0(w) = p0(Prx) is satisfied forp0 such that its marginal for Prx, i.e., pr0 defined in (16),is the same as the anchored density p0(Prx + Ψ⊥v⊥) atv⊥ = 0. Meanwhile, Gaussian priors satisfy this conditionsince p0(Prx+ Ψ⊥v⊥) = p0(Prx)p0(Ψ⊥v⊥) where bothp0(Prx) and p0(Ψ⊥v⊥) are Gaussian densities.

3.5. Practical algorithms

Sampling from the projected posterior p∗r(x) defined in (19)with the optimal profile function g∗ defined in (18), involves,by the decomposition (27), sampling w from the posteriorpr by pSVGD and sampling w⊥ from the conditional distri-bution with density p⊥0 (w⊥|w). The sampling is impracticalbecause of two challenges summarized as follows:

1. The d − r basis Ψ⊥ = (ψr+1, . . . , ψd) for the com-plement subspace, used in (16) for the marginal priordensity pr0 and in (18) for the optimal profile functiong∗, are generally not available or their computation bysolving the generalized eigenvalue problem (22) is tooexpensive for large dimension d. Meanwhile, both (16)and (18) involve high-dimensional integrals.


2. The matrixH defined in (21) for the construction of thebasis ψ1, . . . , ψr involves integration w.r.t. the poste-rior distribution of the parameter x. However, drawingsamples from the posterior to evaluate the integral turnsout to be the central task of the Bayesian inference.

Challenge 1 can be practically addressed by using the funda-mental property pointed out in Section 3.2, i.e., the variationof likelihood f in the complement subspace X⊥ is negligi-ble, which means that the posterior distribution meaning-fully differs from the prior distribution only in the subspaceXr. Therefore, to draw samples from the posterior, we onlyneed to proceed by the abstract Algorithm 1.

Algorithm 1 An abstract sampling procedure by projection

1. Draw samples from the prior distribution, e.g.,

x0n ∼ p0, n = 1, . . . , N.

2. Make parameter decomposition by projection

x0n = Prx

0n + x⊥n , n = 1, . . . , N.

3. Push the projected parameter Prx0n in the subspace Xr

to the projected posterior distribution (19) as

Prx`n ∼ pr(Pr·) as `→∞, n = 1, . . . , N.

4. Assemble the samples to approximate the posterior

x`n = Prx`n + x⊥n , n = 1, . . . , N.

Two important factors in Algorithm 1 have to be specified tomake it practically operational: (i) the projector Pr in step2, and (ii) the transport map in step 3.

We address (ii) at first by proposing a practical pSVGDalgorithm given the projector Pr with basis Ψr. By the factthat the likelihood f has negligible variation in the subspaceX⊥, for any sample x`n, n = 1, . . . , N , we can approximatethe optimal profile function g∗ in (18) as

g∗(Prx`n) = f(Prx

`n + x⊥n ), (41)

which is equivalent to using one sample v⊥ such thatΨ⊥v⊥ = x⊥n = x0

n − Prx0n to approximate the integral

(18). Similarly, we can specify the projected prior densityat w`n as

pr0(w`n) = p0(Ψrw`n + x⊥n ). (42)

By this specification, we avoid computing not only Ψ⊥ butalso the high-dimensional integral w.r.t. v⊥ in both (16) and(18), thus addressing challenge 1. Moreover, under this spec-ification, the gradient of the log-posterior ∇w`n log pr(w`n)

in (37) for the pSVGD method can be evaluated as

∇w`n log pr(w`n) = ΨTr ∇x log p(Ψrw

`n + x⊥n ). (43)

Given the projector Pr with basis Ψr, we summarize thepSVGD transport of samples in Algorithm 2.

In particular, by leveraging the property that the samplescan be updated in parallel, we implement a parallel versionof pSVGD using MPI for information communication inK processor cores, each with N different samples, thusproducing NK different samples in total.

Algorithm 2 pSVGD in parallel1: Input: samples {x0

n}Nn=1 in each of K cores, basis Ψr,maximum iteration Lmax, tolerance wtol for step norm.

2: Output: posterior samples {x∗n}Nn=1 in each core.3: Set ` = 0, project w0

n = ΨTr x

0n, x⊥n = x0

n − Ψrw0n,

and perform MPI Allgather for coefficients {w0n}Nn=1.

4: repeat5: Compute gradients∇w`n log pr(w`n) by (43) for n =

1, . . . , N , and perform MPI Allgather for them.6: Compute the kernel values kr(w`n, w

`m) and their

gradients ∇w`nkr(w`n, w

`m) for n = 1, . . . , NK,

m = 1, . . . , N , and perform MPI Allgather for them.7: Update samples w`+1

m from w`m by (36) and (37) form = 1, . . . , N , with NK samples used for SAA in(37), and perform MPI Allgather for {w0

m}Nm=1.8: Set `← `+ 1.9: until ` ≥ Lmax or mean(||w`m − w`−1

m ||2) ≤ wtol.10: Reconstruct samples x∗n = Ψrw

`n + x⊥n .

To construct the projector Pr with basis Ψr, we approximateH in (21) by SAA with posterior samples x1, . . . , xM , i.e.,

H :=1

M

M∑m=1

∇x log f(xm)(∇x log f(xm))T . (44)

Since the posterior samples x1, . . . , xM are not available,we propose to adaptively construct the basis Ψ`

r withsamples x`1, . . . , x

`M transported from the prior samples

x01, . . . , x

0M by pSVGD, which addresses challenge 2. This

procedure is summarized in Algorithm 3. We remark thatby the adaptive construction, we push the samples to theirposterior in each subspace X`

r spanned by (possibly) dif-ferent basis Ψ`

r with different r for different `. With ∪`X`r

exploiting all the data-informed dimensions as `→∞, thesamples may represent the true posterior without commit-ting projection error as bounded in Proposition 2.

4. Numerical ExperimentsWe present a nonlinear Bayesian inference problem withhigh-dimensional parameters to demonstrate the accuracyof pSVGD compared to SVGD, and the convergence and


Algorithm 3 Adaptive pSVGD in parallel1: Input: samples {x0

n}Nn=1 in each ofK cores, maximumiterations Lxmax, L

wmax, tolerances xtol, wtol.

2: Output: posterior samples {x∗n}Nn=1 in each core.3: Set `x = 0.4: repeat5: Compute ∇x log f(x`xn ) in (44) for n = 1, . . . , N in

each core, and perform MPI Allgather for them.6: Solve (22) with H approximated as in (44), with all

M = NK samples, to get the projection basis Ψ`xr .

7: Apply the pSVGD Algorithm 2, i.e.,

{x∗n}Nn=1 = pSVGD({x`xn }Nn=1,Ψ`xr , L

wmax, wtol).

8: Set `x ← `x + 1 and x`xn = x∗n, n = 1, . . . , N .9: until `x ≥ Lxmax or mean(||x`xm − x`x−1

m ||X) ≤ xtol.

scalability of pSVGD w.r.t. the number of parameters,samples, data points, and processor cores. A linear in-ference example, whose posterior is analytically given,is presented in Appendix B to demonstrate the accuracyof pSVGD compared to SVGD. The code is available athttps://github.com/cpempire/pSVGD.

3.0 3.5 4.0 4.5 5.0 5.5 6.0log2( d 1)

1.2

1.0

0.8

0.6

0.4

0.2

log 1

0(RM

SE o

f var

ianc

e)

SVGDpSVGD

Figure 1. RMSE of pointwise sample variance in L2-norm, with256 samples, SVGD and pSVGD both terminated at ` = 200iterations, parameter dimension d = (2n+1)2, with n = 3, 4, 5, 6.

We consider a parameter-to-observable map h : Rd → Rs

h(x) = O ◦ S(x), (45)

where S : x→ u is a nonlinear discrete solution map of thelog-normal diffusion model in a unit square domain

−∇ · (ex∇u) = 0, in (0, 1)2, (46)

imposed with Dirichlet boundary condition u = 1 on thetop boundary and u = 0 on bottom boundary, and homo-geneous Neumann boundary condition on the left and rightboundaries. ∇· is a divergence operator, and∇ is a gradientoperator. x and u are discretized by finite elements with

piecewise linear elements in a uniform mesh of triangles ofsize d. x ∈ Rd and u ∈ Rd are the nodal values of x and u.We consider a Gaussian distribution for x ∈ N (0, C) withcovariance C = (−0.1∆ + I)−2, which leads to a Gaussiandistribution for x ∼ N (0,Σx), where Σx ∈ Rd×d is dis-cretized from C. O : Rd → Rs is a pointwise observationmap at s = 7× 7 points equally distributed in (0, 1)2. Weconsider an additive 5% Gaussian noise ξ ∼ N (0,Σξ) withΣξ = σ2I and σ = max(|Ou|)/20 for data

y = h(x) + ξ, (47)

which leads to the likelihood function

f(x) = exp

(−1

2||y − h(x)||2

Σ−1ξ

). (48)

0 10 20 30 40 50r

2

1

0

1

2

3

4

5

log 1

0(|

r|)

d=289d=1,089d=4,225d=16,641

0 25 50 75 100 125 150 175 200# iterations

6

5

4

3

2

1

log 2

(ave

rage

d st

ep n

orm

) d=289d=1,089d=4,225d=16,641

Figure 2. Scalability w.r.t. the parameter dimension d by decay ofeigenvalues λr w.r.t. r (top), and decay of the averaged step normmeanm||w`+1

m − w`m||2 w.r.t. the number of iterations (bottom).

We use a DILI-MCMC algorithm (Cui et al., 2016) to gen-erate 10, 000 effective posterior samples and use them tocompute a reference sample variance. We run SVGD andthe adaptive pSVGD (with λr+1 < 10−2) using 256 sam-ples and 200 iterations for different dimensions, both usingline search to seek the step size ε`. The comparison of accu-racy can be observed in Figure 1. We can see that SVGDsamples fail to capture the posterior distribution in highdimensions and become worse with increasing dimension,

https://github.com/cpempire/pSVGD


while pSVGD samples represent the posterior distributionwell, measured by sample variance, and the approximationremains accurate with increasing dimension.

0 10 20 30 40 50r

2

1

0

1

2

3

4

5

log 1

0(|

r|)

N=64N=128N=256N=512

0 25 50 75 100 125 150 175 200# iterations

6

5

4

3

2

1

log 2

(ave

rage

d st

ep n

orm

) N=64N=128N=256N=512

Figure 3. Scalability w.r.t. the number of samples N by decay ofeigenvalues λr w.r.t. r (top), and decay of the averaged step normmeanm||w`+1

m − w`m||2 w.r.t. the number of iterations (bottom).

The accuracy of pSVGD can be further demonstrated bythe significant decay (about 7 orders of magnitude) of theeigenvalues for different dimensions in the top of Figure2. Only about 50 dimensions (with small relative pro-jection error, about Er < 10−6, committed in the poste-rior by Proposition 2) are preserved out of from 289 to16,641 dimensions, representing over 300× dimension re-duction for the last case. The similar decays of the eigen-values λr in the projection rank r and the averaged stepnorm meanm||w`+1

m − w`m||2 in the number of iterationsshown in Figure 2 imply that pSVGD is scalable w.r.t.the parameter dimension. Moreover, the similar decaysfor different sample size N = 64, 128, 256, 512 in Fig-ure 3 demonstrate that pSVGD is scalable w.r.t. the num-ber of samples N . Furthermore, as displayed in the topof Figure 4, with increasing number of i.i.d. observationdata points s = 72, 152, 312, 632 in a refined mesh of sized = 172, 332, 652, 1292, the eigenvalues decay at almostthe same rate with similar relative projection error Er, andlead to similar reduction d/r for r such that λr+1 < 10−2,which implies weak scalability of pSVGD w.r.t. the numberof data points. Lastly, from the bottom of Figure 4 by the

nearly O(K−1) decay of CPU time we can see that pSVGDachieves strong parallel scalability (in computing gradient,kernel, and sample update) w.r.t. the number of processorcores K for the same work with KN = 1024 samples.

0 1 2 3 4 5 6 7 8log2(r)

2

0

2

4

6

8

log 1

0(|

r|)

s=49s=225s=961s=3,969

0 1 2 3 4 5log2(# processor cores)

2

4

6

8

10

log 2

(tim

e (s

))

totalgradientkernelupdateO(K 1)

Figure 4. Scalability w.r.t. the number of data points s by decay ofeigenvalues λr w.r.t. r (top), and the number of processor cores Kby decay of CPU time for different parts of pSVGD (bottom).

5. Conclusions and Future WorkWe proposed, analyzed, and demonstrated pSVGD forBayesian inference in high dimensions to tackle the crit-ical challenge of curse of dimensionality. The projectioncan be adaptively constructed by using only the informationof the gradient of the log-likelihood, which is available inSVGD algorithm, where the projection error committed inthe posterior can be bounded by the truncated eigenvalues.We proved that pSVGD for the coefficient is equivalent toSVGD for the projected parameter under suitable assump-tions. The accuracy (compared to SVGD), convergence, andscalability of pSVGD w.r.t. the number of parameters, sam-ples, data points, and processor cores were demonstrated bya high-dimensional nonlinear Bayesian inference problem.

Further analysis of the adaptive pSVGD and its applica-tion to other high-dimensional inference problems, e.g.,Bayesian neural networks, is of interest. Another promisingdirection is to study the correlation, reduction by projection,and scalability in data dimension for big data applications.


ReferencesBashir, O., Willcox, K., Ghattas, O., van Bloemen Waanders,

B., and Hill, J. Hessian-based model reduction for large-scale systems with initial condition inputs. InternationalJournal for Numerical Methods in Engineering, 73:844–868, 2008.

Beskos, A., Girolami, M., Lan, S., Farrell, P. E., and Stuart,A. M. Geometric MCMC for infinite-dimensional inverseproblems. Journal of Computational Physics, 335:327 –351, 2017.

Bigoni, D., Zahm, O., Spantini, A., and Marzouk, Y. Greedyinference with layers of lazy maps. arXiv preprintarXiv:1906.00031, 2019.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Varia-tional inference: A review for statisticians. Journal ofthe American Statistical Association, 112(518):859–877,2017.

Bui-Thanh, T. and Ghattas, O. Analysis of the Hessian forinverse scattering problems: I. Inverse shape scattering ofacoustic waves. Inverse Problems, 28(5):055001, 2012.

Bui-Thanh, T., Ghattas, O., Martin, J., and Stadler, G.A computational framework for infinite-dimensionalbayesian inverse problems part I: The linearized case,with application to global seismic inversion. SIAM Jour-nal on Scientific Computing, 35(6):A2494–A2523, 2013.

Chen, P. and Ghattas, O. Hessian-based sampling for high-dimensional model reduction. International Journal forUncertainty Quantification, 9(2), 2019.

Chen, P. and Schwab, C. Sparse-grid, reduced-basisBayesian inversion. Computer Methods in Applied Me-chanics and Engineering, 297:84 – 115, 2015.

Chen, P. and Schwab, C. Sparse-grid, reduced-basisBayesian inversion: Nonaffine-parametric nonlinear equa-tions. Journal of Computational Physics, 316:470 – 503,2016.

Chen, P., Villa, U., and Ghattas, O. Hessian-based adaptivesparse quadrature for infinite-dimensional Bayesian in-verse problems. Computer Methods in Applied Mechanicsand Engineering, 327:147–172, 2017.

Chen, P., Villa, U., and Ghattas, O. Taylor approxima-tion and variance reduction for PDE-constrained optimalcontrol problems under uncertainty. Journal of Computa-tional Physics, 2019a. To appear.

Chen, P., Wu, K., Chen, J., O’Leary-Roseberry, T., and Ghat-tas, O. Projected Stein variational Newton: A fast andscalable Bayesian inference method in high dimensions.In Advances in Neural Information Processing Systems,pp. 15104–15113, 2019b.

Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates,C. J. Stein points. arXiv preprint arXiv:1803.10161,2018.

Constantine, P. G., Kent, C., and Bui-Thanh, T. AcceleratingMarkov chain Monte Carlo with active subspaces. SIAMJournal on Scientific Computing, 38(5):A2779–A2805,2016.

Cui, T., Law, K. J., and Marzouk, Y. M. Dimension-independent likelihood-informed MCMC. Journal ofComputational Physics, 304:109–137, 2016.

Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., andScheichl, R. A stein variational Newton method. InAdvances in Neural Information Processing Systems, pp.9187–9197, 2018.

Detommaso, G., Kruse, J., Ardizzone, L., Rother, C., Kothe,U., and Scheichl, R. HINT: Hierarchical invertible neuraltransport for general and sequential Bayesian inference.arXiv preprint arXiv:1905.10687, 2019.

Girolami, M. and Calderhead, B. Riemann manifoldLangevin and Hamiltonian Monte Carlo methods. Jour-nal of the Royal Statistical Society: Series B (StatisticalMethodology), 73(2):123–214, 2011.

Isaac, T., Petra, N., Stadler, G., and Ghattas, O. Scalableand efficient algorithms for the propagation of uncertaintyfrom data through inference to prediction for large-scaleproblems, with application to flow of the Antarctic icesheet. Journal of Computational Physics, 296:348–368,September 2015. doi: 10.1016/j.jcp.2015.04.047.

Liu, C. and Zhu, J. Riemannian Stein variational gradientdescent for Bayesian inference. In Thirty-Second AAAIConference on Artificial Intelligence, 2018.

Liu, Q. Stein variational gradient descent as gradient flow.In Advances in neural information processing systems,pp. 3115–3123, 2017.

Liu, Q. and Wang, D. Stein variational gradient descent:A general purpose Bayesian inference algorithm. In Ad-vances In Neural Information Processing Systems, pp.2378–2386, 2016.

Martin, J., Wilcox, L., Burstedde, C., and Ghattas, O. Astochastic Newton MCMC method for large-scale sta-tistical inverse problems with application to seismic in-version. SIAM Journal on Scientific Computing, 34(3):A1460–A1487, 2012.

Marzouk, Y., Moselhy, T., Parno, M., and Spantini, A. Sam-pling via measure transport: An introduction. In Hand-book of Uncertainty Quantification, pp. 1–41. Springer,2016.


Petra, N., Martin, J., Stadler, G., and Ghattas, O. A com-putational framework for infinite-dimensional Bayesianinverse problems, part ii: Stochastic Newton MCMCwith application to ice sheet flow inverse problems. SIAMJournal on Scientific Computing, 36(4):A1525–A1555,2014.

Ramdas, A., Reddi, S. J., Poczos, B., Singh, A., and Wasser-man, L. On the decreasing power of kernel and distancebased nonparametric hypothesis tests in high dimensions.In Twenty-Ninth AAAI Conference on Artificial Intelli-gence, 2015.

Schillings, C. and Schwab, C. Sparse, adaptive Smolyakquadratures for Bayesian inverse problems. Inverse Prob-lems, 29(6):065011, 2013.

Schillings, C. and Schwab, C. Scaling limits in computa-tional Bayesian inversion. ESAIM: Mathematical Mod-elling and Numerical Analysis, 50(6):1825–1856, 2016.

Schwab, C. and Stuart, A. Sparse deterministic approxima-tion of Bayesian inverse problems. Inverse Problems, 28(4):045003, 2012.

Spantini, A., Solonen, A., Cui, T., Martin, J., Tenorio, L.,and Marzouk, Y. Optimal low-rank approximations ofBayesian linear inverse problems. SIAM Journal on Sci-entific Computing, 37(6):A2451–A2487, 2015.

Wang, D., Zeng, Z., and Liu, Q. Stein variational messagepassing for continuous graphical models. In InternationalConference on Machine Learning, pp. 5206–5214, 2018.

Wang, Y. and Li, W. Information Newton’s flow: second-order optimization method in probability space. arXivpreprint arXiv:2001.04341, 2020.

Zahm, O., Cui, T., Law, K., Spantini, A., and Marzouk, Y.Certified dimension reduction in nonlinear Bayesian in-verse problems. arXiv preprint arXiv:1807.03712, 2018.

Zhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang,B. Message passing Stein variational gradient descent.In International Conference on Machine Learning, pp.6018–6027, 2018.

.


A. Proof of Theorem 1Proof. We prove the equivalence at the first step ` = 0.Then the equivalence for steps ` > 0 follows by induction.By the parameter decomposition x = xr + x⊥ in (14) withxr = Prx, we denote π0 and π as the prior and posteriorfor the projected parameter xr, given by

π0(xr) = p0(Prx) and π(xr) = pr(Prx), (49)

where pr is the projected posterior defined in (19) withoptimal profile function g = g∗ given in (18). Equivalently,by the property of the projection PrPrx = Prx, we have

π(xr) =

1

Zrg∗(xr)π0(xr). (50)

We can write the transport map (4) for the projected param-eter xr in the steepest direction ϕ∗π0,π as

T (xr) = xr + εϕ∗π0,π(xr), (51)

where ϕ∗π0,π is given by

ϕ∗π0,π(·) = Exr∼π0 [Aπκ(xr, ·)], (52)

with the kernel κ(xr, xr) = k(Prx, Prx) for any x, x ∈ Rdand the Stein operator

Aπκ(xr, ·) = ∇xr log π(xr)κ(xr, ·) +∇xrκ(xr, ·). (53)

By definition of the kernel in (12), we have

k(Prx, Prx)

= exp

(− 1

h(Prx− Prx)T (Prx− Prx)

)= exp

(− 1

h(w − w)TΨT

r Ψr(w − w)

)= exp

(− 1

h||w − w||22

)(54)

where we used the relation Prx = Ψrw and Prx = Ψrwin the second equality and the orthonormality ΨT

r Ψr =I in the generalized eigenvalue problem (22) in the third.Therefore, by definition (38), we have

kr(w, w) = κ(xr, xr). (55)

Moreover, for the gradient of the kernel we have

∇xrκ(xr, xr) = − 2

hκ(xr, xr)(xr − xr)

= − 2

hκ(xr, xr)Ψr(w − w).

(56)

On the other hand, we have

∇wkr(w, w) = − 2

hkr(w, w)(w − w), (57)

which yields

∇wkr(w, w) = ΨTr ∇xrκ(xr, xr). (58)

For the posterior π defined in (49), we have

∇xr log π(xr) =∇xr (g∗(xr)π0(xr))

g∗(xr)π0(xr), (59)

while for the posterior pr defined in (27), we have

∇w log pr(w) =∇w(g∗(Ψrw)pr0(w))

g∗(Ψrw)pr0(w). (60)

By chain rule, it is straightforward to see that

∇wg∗(Ψrw) = ΨTr ∇xrg∗(xr). (61)

Under assumption pr0(w) = p0(Prx) in Theorem 1, andp0(Prx) = π0(xr) by definition (49), we have

∇wpr0(w) = ΨTr ∇xrπ0(xr). (62)

Therefore, combining (61) and (62), we have

∇w log pr(w) = ΨTr ∇xr log π(xr). (63)

To this end, we obtain the equivalence of the Stein operators

Aprkr(w, w) = ΨTr Aπκ(xr, xr) (64)

for xr = Ψrw and xr = Ψrw. Since the prior densitiesπ0(xr) = pr0(w), we have the equivalence

Ew∼pr0 [Aprkr(w, w)] = ΨTr Exr∼π0

[Aπκ(xr, xr)], (65)

which concludes the equivalence of the transport map (39)by w = ΨT

r xr with the same ε at ` = 0. Moreover, by

induction we have

T r` (w`) = ΨTr T`(Prx

`), (66)

which concludes.

B. A linear inference problemWe consider a linear parameter-to-observable map A :Rd → Rs, which is given by

Ax = O ◦Bx, (67)

where B : x → u is a linear discrete solution map of thediffusion reaction equation (∆ is the Laplace operator)

−∆u + u = x, in (0, 1), (68)

with boundary condition u(0) = 0 and u(1) = 1, whichis solved by a finite element method. The continuous pa-rameter x and solution u are discretized by finite elements


with piecewise linear elements in a uniform mesh of size d.x ∈ Rd and u ∈ Rd are the nodal values of x and u. Theparameter x is assumed to follow a Gaussian distributionN (0, C) with covariance C = (−0.1∆ + I)−1, which leadsto a Gaussian parameter x ∼ N (0,Σx), with covarianceΣx ∈ Rd×d as a discretization of C.

O : Rd → Rs in (67) is an observation map that take scomponents of u that are equally distributed in (0, 1). Fors = 15, we have Ou = (u(1/16), . . . , u(15/16))T . Weassume an additive 1% Gaussian noise ξ ∼ N (0,Σξ) withΣξ = σ2I and σ = max(|Ou|)/100 for data

y = Ax+ ξ, (69)

then the likelihood function is given by

f(x) = exp

(−1

2||y −Ax||2

Σ−1ξ

). (70)

Because of the linearity of the inference problem, the poste-rior of x is also GaussianN (xMAP,Σy) with the MAP pointxMAP = ΣyA

TΣ−1ξ y and covariance

Σy = (ATΣ−1ξ A+ Σ−1

x )−1. (71)

We run SVGD and pSVGD (projection with r = 8 basisfunctions and λ9 < 10−4) with 256 samples and 200 itera-tions for different dimensions, both using line search to seekthe step size ε`. The RMSE (of 10 trials and their average)of the samples variances compared to the ground truth (71)are shown in Figure 5, which indicates that SVGD deteri-orates with increasing dimension while pSVGD performswell for all dimensions.

4 5 6 7 8 9 10log2(d 1)

0.8

0.6

0.4

0.2

0.0

log 1

0(RM

SE o

f var

ianc

e)

SVGDpSVGD

Figure 5. RMSE of pointwise sample variance in L2-norm, with256 samples, SVGD and pSVGD both terminated at ` = 200iterations, parameter dimension d = 2n + 1, with n = 4, 6, 8, 10.

Documents

Projected Stein Variational Gradient Descentlower-dimensional coefﬁcients of the projection. We provide an upper bound for the projection er-ror with respect to the posterior and