42
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang

Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

On Bayesian Computation

Michael I. Jordan

with Elaine Angelino, Maxim Rabinovich, Martin Wainwrightand Yun Yang

Page 2: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Previous Work: Information Constraints on Inference

I Minimize the minimax risk under constraintsI privacy constraintI communication constraintI memory constraint

I Yields tradeoffs linking statistical quantities (amount of data,complexity of model, dimension of parameter space, etc) to“externalities” (privacy level, bandwidth, storage, etc)

Page 3: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Ongoing Work: Computational Constraints on Inference

I Tradeoffs via convex relaxations

I Tradeoffs via concurrency control

I Bounds via optimization oracles

I Bounds via communication complexity

I Tradeoffs via subsampling

I All of this work has been frequentist; how about Bayesianinference?

Page 4: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Today’s talk: Bayesian Inference and Computation

I Integration rather than optimization

I Part I: Mixing times for MCMC

I Part II: Distributed MCMC

Page 5: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Part I: Mixing Times for MCMC

Page 6: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

MCMC and Sparse Regression

I Statistical problem: predict/explain a response variable Ybased on a very large number of features X1, . . . ,Xd .

I Formally:

Y = X1 β1 + X2 β2 + · · ·+ Xd βd + noise, or

Yn×1 = Xn×d βd×1 + noise, (matrix form)

number of features d � sample size n.

I Sparsity condition: number of non-zero βj ’s � n.

I Goal: variable selection—identify influential features

I Can this be done efficiently (in a computational sense)?

Page 7: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Statistical Methodology

I Frequentist approach: penalized optimization

Objective function f (β)

=1

2n‖Y − Xβ‖22︸ ︷︷ ︸

Goodness of fit

+ Pλ(β)︸ ︷︷ ︸Penalty term

I Bayesian approach: Monte Carlo sampling

Posterior probability P(β |Y )

∝ P(Y |β)︸ ︷︷ ︸Likelihood

× P(β)︸ ︷︷ ︸Prior probability

Page 8: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Computation and Bayesian inference

Posterior probability P(β |Y )

= C (Y )︸ ︷︷ ︸Normalizing

constant

× P(Y |β)︸ ︷︷ ︸Likelihood

× P(β)︸ ︷︷ ︸Prior probability

I The most widely used tool for fitting Bayesian models is thesampling technique based on Markov chain Monte Carlo(MCMC)

I The theoretical analysis of the computational efficiency ofMCMC algorithms lags that of optimization-based methods

Page 9: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Computational Complexity

I Central object of interest: the mixing time tmix of the Markovchain

I Bound tmix as a function of problem parameters (n, d)

I rapidly mixing—tmix grows at most polynomially

I slowly mixing—tmix grows exponentially

I It has long been believed by many statisticians that MCMCfor Bayesian variable selection is necessarily slowly mixing;efforts have been underway to find lower bounds but suchbounds have not yet been found

Page 10: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Model

Mγ : Linear model: Y = Xγβγ + w , w ∼ N (0, φ−1In)

Precision prior: π(φ) ∝ 1

φ

Regression prior:(βγ | γ

)∼ N (0, g φ−1(XT

γ Xγ)−1)

Sparsity prior: π(γ) ∝(1

p

)κ|γ|I[|γ| ≤ s0].

Page 11: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption A (Conditions on β∗)

The true regression vector has components β∗ = (β∗S , β∗Sc ) that

satisfy the bounds

Full β∗ condition:∥∥ 1√

nXβ∗

∥∥22≤ g σ20

log p

n

Off-support Sc condition:∥∥ 1√

nXScβ∗Sc

∥∥22≤ Lσ20

log p

n,

for some L ≥ 0.

Page 12: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption B (Conditions on the design matrix)

The design matrix has been normalized so that ‖Xj‖22 = n for allj = 1, . . . , p; moreover, letting Z ∼ N(0, In), there exist constantsν > 0 and L <∞ such that Lν ≥ 4 and

Lower restricted eigenvalue (RE(s)):

min|γ|≤s

λmin

(1

nXTγ Xγ

)≥ ν,

Sparse projection condition (SI(s)):

EZ

[max|γ|≤s

maxk∈[p]\γ

1√n

∣∣〈(I − Φγ

)Xk , Z 〉

∣∣] ≤ 1

2

√Lν log p,

where Φγ denotes projection onto the span of {Xj , j ∈ γ}.

This is a mild assumption, needed in the information-theoreticanalysis of variable selection.

Page 13: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption C (Choices of prior hyperparameters)

The noise hyperparameter g and sparsity penalty hyperparameterκ > 0 are chosen such that

g � p2α for some α > 0, and

κ+ α ≥ C1(L + L) + 2 for some universal constant C1 > 0.

Page 14: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption D (Sparsity control)

For a constant C0 > 8, one of the two following conditions holds:

Version D(s∗): We set s0 : = p in the sparsity prior and the truesparsity s∗ is bounded as

max{1, s∗} ≤ 1

8C0K

{ n

log p− 16Lσ20

}for some constant K ≥ 4 +α+ cL, where c is a universal constant.

Version D(s0): The sparsity parameter s0 satisfies the sandwichrelation

max{

1,(2ν−2 ω(X ) + 1

)s∗}≤ s0 ≤

1

8C0 K

{ n

log p− 16Lσ20

},

where ω(X ) : = maxγ|||(XT

γ Xγ)−1XTγ Xγ∗\γ |||2op.

Page 15: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Our Results

Theorem (Posterior concentration)

Given Assumption A, Assumption B with s = Ks∗, Assumption C,and Assumption D(s∗), if Cβ satisfies

C 2β ≥ c0 ν

−2 (L + L + α + κ)σ20log p

n,

we have πn(γ∗ | Y ) ≥ 1− c1 p−1 with high probability.

Compare to `1-based approaches,which require an irrepresentablecondition:

maxk∈[d ],

S: |S |=s∗

‖XTk XS(XT

S XS)−1‖1 < 1

Page 16: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Our Results

Theorem (Rapid mixing)

Suppose that Assumption A, Assumption B with s = s0,Assumption C, and Assumption D(s0) all hold. Then there areuniversal constants c1, c2 such that, for any ε ∈ (0, 1), the ε-mixingtime of the Metropolis-Hastings chain is upper bounded as

τε ≤ c1 ps20

(c2α (n + s0) log p + log(1/ε) + 2

)with probability at least 1− c3p

−c4 .

Page 17: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

High Level Proof Idea

I A Metropolis-Hasting random walk on the d-dim hypercube{0, 1}d

I Canonical path ensemble argument: for any model γ 6= γ∗,find a path from γ to γ∗, along which acceptance ratios arehigh, where γ∗ is the true model

Page 18: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Part II: Distributed MCMC

Page 19: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Traditional MCMC

I Serial, iterative algorithm for generating samples

I Slow for two reasons:

(1) Large number of iterations required to converge

(2) Each iteration depends on the entire dataset

I Most research on MCMC has targeted (1)

I Recent threads of work target (2)

Page 20: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Serial MCMC

Data

Single core

Samples

Page 21: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Data-parallel MCMC

Data

Parallel cores

“Samples”

Page 22: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Aggregate samples from across partitions — but how?

Aggregate

Data

Parallel cores

“Samples”

Page 23: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Factorization motivates a data-parallel approach

π(θ | x)︸ ︷︷ ︸posterior

∝ π(θ)︸︷︷︸prior

π(x | θ)︸ ︷︷ ︸likelihood

=J∏

j=1

π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior

Page 24: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Factorization motivates a data-parallel approach

π(θ | x)︸ ︷︷ ︸posterior

∝ π(θ)︸︷︷︸prior

π(x | θ)︸ ︷︷ ︸likelihood

=J∏

j=1

π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior

I Partition the data as x(1), . . . , x(J) across J cores

I The jth core samples from a distribution proportional to thejth sub-posterior (a ‘piece’ of the full posterior)

I Aggregate the sub-posterior samples to form approximate fullposterior samples

Page 25: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Aggregation strategies for sub-posterior samples

π(θ | x)︸ ︷︷ ︸posterior

∝ π(θ)︸︷︷︸prior

π(x | θ)︸ ︷︷ ︸likelihood

=J∏

j=1

π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior

Sub-posterior density estimation (Neiswanger et al, UAI 2014)

Weierstrass samplers (Wang & Dunson, 2013)

Weighted averaging of sub-posterior samples

I Consensus Monte Carlo (Scott et al, Bayes 250, 2013)

I Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)

Page 26: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Aggregate ‘horizontally’ across partions

Aggregate

Data

Parallel cores

“Samples”

Page 27: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Naıve aggregation = Average

Aggregate

= + x 0.5 x 0.5

( , )

= + x x

Aggregate( , )

Page 28: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Less naıve aggregation = Weighted average

= + x 0.58 x 0.42

Aggregate( , )

= + x x

Aggregate( , )

Page 29: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Consensus Monte Carlo (Scott et al, 2013)

= + x x

Aggregate( , ) = + x x

Aggregate( , )

I Weights are inverse covariance matrices

I Motivated by Gaussian assumptions

I Designed at Google for the MapReduce framework

Page 30: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

Page 31: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

F = aggregation function

qF = approximate distribution

L (F )︸ ︷︷ ︸objective

= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood

+ H [qF ]︸ ︷︷ ︸entropy

Page 32: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

F = aggregation function

qF = approximate distribution

L (F )︸ ︷︷ ︸objective

= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood

+ H [qF ]︸ ︷︷ ︸relaxed entropy

Page 33: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

F = aggregation function

qF = approximate distribution

L (F )︸ ︷︷ ︸objective

= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood

+ H [qF ]︸ ︷︷ ︸relaxed entropy

No mean field assumption

Page 34: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

= + x x

Aggregate( , ) = + x x

Aggregate( , )

I Optimize over weight matrices

I Restrict to valid solutions when parameter vectors constrained

Page 35: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Theorem (Entropy relaxation)

Under mild structural assumptions, we can choose

H [qF ] = c0 +1

K

K∑k=1

hk (F ) ,

with each hk a concave function of F such that

H [qF ] ≥ H [qF ] .

We therefore haveL (F ) ≥ L (F ) .

Page 36: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Theorem (Concavity of the variational Bayes objective)

Under mild structural assumptions, the relaxed variational Bayesobjective

L (F ) = EqF [log π (X, θ)] + H [qF ]

is concave in F .

Page 37: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Empirical evaluation

I Compare three aggregation strategies:

I Uniform average

I Gaussian-motivated weighted average (CMC)

I Optimized weighted average (VCMC)

I For each algorithm A, report approximation error of someexpectation Eπ[f ], relative to serial MCMC

εA (f ) =|EA [f ]− EMCMC [f ]||EMCMC [f ]|

I Preliminary speedup results

Page 38: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Example 1: High-dimensional Bayesian probit regression

#data = 100, 000, d = 300

First moment estimation error, relative to serial MCMC

(Error truncated at 2.0)

Page 39: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Example 2: High-dimensional covariance estimation

Normal-inverse Wishart model

#data = 100, 000, #dim = 100 =⇒ 5, 050 parameters

(L) First moment estimation error (R) Eigenvalue estimation error

Page 40: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Example 3: Mixture of 8, 8-dim Gaussians

Error relative to serial MCMC, for cluster comembershipprobabilities of pairs of test data points

Page 41: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

VCMC reduces CMC error at the cost of speedup (∼2x)

VCMC speedup is approximately linear

CMC

VCMC

Page 42: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Discussion

Contributions

I Convex optimization framework for Consensus Monte Carlo

I Structured aggregation accounting for constrained parameters

I Entropy relaxation

I Empirical evaluation