Deep generative learning_icml_part2

Stochastic Gradient Fisher ScoringAhn, Korattikara, Welling – 2012

Large Gradient

Small Gradient

Mixing Issues Bernstein-von Mises theorem

θ0 - True parameterIN - Fisher Information at θ0

( a.k.a Bayesian CLT)

1vrijdag 4 juli 14

SGFS

Stochastic Gradient Langevin

Samples from the correct posterior, , at low ϵ

2

2vrijdag 4 juli 14

SGFS


Low Bias

High


2

2vrijdag 4 juli 14

SGFS


Markov Chain for Approximate

Low Bias

High


Samples from approximate posterior, , at any ϵ

2

2vrijdag 4 juli 14

SGFS


Markov Chain for Approximate

Low Bias

High


Samples from approximate posterior, , at any ϵ

Low

High Bias

2

2vrijdag 4 juli 14

SGFS

Small ϵ

Large ϵ

Bias

Variance

3

3vrijdag 4 juli 14

SGFS

Small ϵ

Large ϵ

Bias

Variance

(term compensates for subsampling noise)

3

3vrijdag 4 juli 14

The SGFS Knob

Burn-in using

Sampling

Sampling

Decrease ϵ over time

Exact Sampli

4

Low Variance( Fast )

High Variance( Slow )

High Bias Low Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

xx

xx

4vrijdag 4 juli 14

Demo SGFS ε = 2

5

5vrijdag 4 juli 14

Demo SGFS ε = 2

5

5vrijdag 4 juli 14

Demo SGFS ε = 0.4

6

6vrijdag 4 juli 14

Demo SGFS ε = 0.4

6

6vrijdag 4 juli 14

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

7vrijdag 4 juli 14



Euclidean distance b/w parameters is 1,

but densities p(x|θ) are very different

7vrijdag 4 juli 14






but densities p(x|θ) are almost identical

7vrijdag 4 juli 14







where G(θ) is positive semi-definite

7vrijdag 4 juli 14








7vrijdag 4 juli 14








7vrijdag 4 juli 14








Natural Gradient change in curvaturealign noise

7vrijdag 4 juli 14

Stochastic Gradient Hamiltonian Monte Carlo T. Chen, E. B. Fox, C. Guestrin (2014)

8vrijdag 4 juli 14

An (over-) simplified explanation of Hamiltonian Monte Carlo (HMC)


one informative gradient step of size ϵ + one random step of size ϵ = Random walk type movement and bad mixing

Langevin Update

8vrijdag 4 juli 14




Langevin Update

• HMC allows multiple gradient steps per noise step• HMC can make distant proposals with high acceptance probability

8vrijdag 4 juli 14

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)




Langevin Update

• Naively using stochastic gradients in HMC does not work well• Authors use a correction term to cancel the effect of noise in gradients


8vrijdag 4 juli 14

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)




Langevin Update

• Naively using stochastic gradients in HMC does not work well• Authors use a correction term to cancel the effect of noise in gradients


Talk tomorrow afternoon

In Track C (Monte Carlo)

8vrijdag 4 juli 14

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14


NN

N

Total NData points

9vrijdag 4 juli 14


NN

N

Total NData points

9vrijdag 4 juli 14


NN

N

Total NData points

9vrijdag 4 juli 14


NN

N

Total NData points

9vrijdag 4 juli 14


NN

N

Total NData points

9vrijdag 4 juli 14


NN

N

Total NData points

9vrijdag 4 juli 14


NN

N

Total NData points

Adaptive Load Balancing: Longer trajectories from faster machines

9vrijdag 4 juli 14

D-SGLD ResultsWikipedia dataset: 4.6M articles, 811M tokens, vocabulary size: 7702PubMed dataset: 8.2M articles, 730M tokens, vocabulary size: 39987

Model: Latent Dirichlet Allocation

10

10vrijdag 4 juli 14

D-SGLD ResultsWikipedia dataset: 4.6M articles, 811M tokens, vocabulary size: 7702PubMed dataset: 8.2M articles, 730M tokens, vocabulary size: 39987

Model: Latent Dirichlet Allocation

10



10vrijdag 4 juli 14

A Recap

Use an efficient proposal so that the Metropolis-‐Has3ngs test can be avoidedUse an efficient proposal so that the Metropolis-‐Has3ngs test can be avoided

SGLD Langevin Dynamics with stochas3c gradients

SGFS Precondi3oning matrix based on Fisher informa3on at mode

SGRLD Posi3on specific precondi3oning matrix based on Reimannian geometry

SGHMC Avoids random walks by taking mul3ple gradient steps

DSGLD Distributed version of above algorithms

11vrijdag 4 juli 14

A Recap




SGRLD Posi3on specific precondi3oning matrix based on Reimannian geometry



Approximate the Metropolis-‐Has3ngs Test using less dataApproximate the Metropolis-‐Has3ngs Test using less data

11vrijdag 4 juli 14

Why approximate the MH test? (if gradient based methods seem to work so well)

• Gradient based proposals are not always available– Parameter spaces of different dimensionality– Distributions on constrained manifolds– Discrete variables

• High gradients may catapult the sampler to low density regions

12vrijdag 4 juli 14

Metropolis-Hastings

1 2

3

13vrijdag 4 juli 14

Metropolis-Hastings

1 2

3

13vrijdag 4 juli 14

Metropolis-Hastings

14vrijdag 4 juli 14

Metropolis-Hastings

Does not depend on the data (x)

14vrijdag 4 juli 14

Approximate Metropolis-Hastings

15vrijdag 4 juli 14


15vrijdag 4 juli 14


15vrijdag 4 juli 14


Collect more data

15vrijdag 4 juli 14


How do we choose Δ+ and Δ-?

Collect more data

15vrijdag 4 juli 14

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

16vrijdag 4 juli 14


Collect more data

16vrijdag 4 juli 14


Collect more data

16vrijdag 4 juli 14


Collect more data

16vrijdag 4 juli 14


Collect more data

(c is chosen as in a t-test for µ = µ0 vs µ ≠ µ0 )

16vrijdag 4 juli 14


Collect more data




16vrijdag 4 juli 14


Collect more data




• Singh, Wick, McCallum (2012) – inference in large scale factor graphs• DuBois, Korattikara, Welling, Smyth (2014) – approximate Slice Sampling

16vrijdag 4 juli 14

Independent Component Analysis

Mixture of 4 audio sources - 1.95 Million datapoints, 16 dimensionsTest function is Amari distance to true unmixing matrix

17

17vrijdag 4 juli 14

SGLD + approximate MH

SGLD

SGLD+

MH

18

18vrijdag 4 juli 14

Approach 2: Using Concentration InequalitiesBardenet, Doucet, Holmes (2014)

Collect more data

19vrijdag 4 juli 14


Collect more data

19vrijdag 4 juli 14


Collect more data

• Complementary to previous method• More robust as it does not use any CLT assumptions• Uses more data per test if CLT assumptions do hold

19vrijdag 4 juli 14


Collect more data

• Complementary to previous method• More robust as it does not use any CLT assumptions• Uses more data per test if CLT assumptions do hold



19vrijdag 4 juli 14

Summary




SGRLD Posi3on specific precondi3oning based on Reimannian geometry



Approximate the Metropolis-‐Has3ngs Test using less dataApproximate the Metropolis-‐Has3ngs Test using less data

Confidence Intervals

Based on confidence levels using CLT assump3ons.

Concentra3on Bounds

Based on concentra3on bounds. More robust as it does not use CLT assump3ons, but uses more data than above if CLT assump3ons hold

20vrijdag 4 juli 14

Langevin Dynamics

• The Langevin update is a discrete time approximation of a stochastic differential equation(SDE)

• The stationary distribution of this SDE is S0(θ)

• Discretization introduces O(ϵ) errors that are corrected using a MH test

Analysis: SGLDI. Sato and H. Nakagawa (2014)

Stochastic Gradient Langevin Dynamics

• The stationary distribution of the SDE that SGLD represents can also be shown to be S0(θ) I. Sato and H. Nakagawa (2014)

• Time Discretized SGLD converges weakly to the SGLD SDE

i.e. For any continuous differentiable and polynomial growth function f:

21vrijdag 4 juli 14

Langevin Dynamics

• The Langevin update is a discrete time approximation of a stochastic differential equation(SDE)

• The stationary distribution of this SDE is S0(θ)

• Discretization introduces O(ϵ) errors that are corrected using a MH test

Analysis: SGLDI. Sato and H. Nakagawa (2014)

Stochastic Gradient Langevin Dynamics

• The stationary distribution of the SDE that SGLD represents can also be shown to be S0(θ) I. Sato and H. Nakagawa (2014)

• Time Discretized SGLD converges weakly to the SGLD SDE

i.e. For any continuous differentiable and polynomial growth function f:

Talk Monday afternoon

In Track C (Monte Carlo & Approximate Inference)

21vrijdag 4 juli 14

Assume Uniform Ergodicity

Control error in Transition Kernel

Analysis: Approximate MH

Control probability of making a wrong decision:

_ Error in acceptance probability is bounded:

_ Error in transition probability is bounded:

where Total Variation

22vrijdag 4 juli 14

Error in Stationary Distribution

If the error in transition probability is bounded:

And uniform ergodicity holds:

Then, the error in the stationary distribution is bounded as:

Analysis: Approximate MH

For more details:1. P. Alquier, N. Friel, R. Everitt, A. Boland (2014) 2. R. Bardenet, A. Doucet, C. Holmes (2014) 3. A. Korattikara, Y. Chen, M. Welling (2014) 4. N. S. Pillai, A. Smith (2014)

23vrijdag 4 juli 14

References - MCMCApproximate MCMC algorithms using mini-batch gradients

• Stochastic Gradient Langevin Dynamics – M. Welling and Y. W. Teh (ICML 2011)• Stochastic Gradient Fisher Scoring – S. Ahn, A. Korattikara, M. Welling (ICML 2012)• Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex – S. Patterson and Y.W. Teh (NIPS 2013)• Stochastic Gradient Hamiltonian Monte Carlo - T. Chen, E. B. Fox, C. Guestrin (ICML 2014)• Distributed Stochastic Gradient MCMC – S. Ahn, B. Shahbaba, M. Welling (ICML 2014)

Approximate MCMC algorithms using mini-batch Metropolis-Hastings

• Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget - A. Korattikara, Y. Chen, M. Welling (ICML 2014)• Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach – R. Bardenet, A. Doucet, C. Holmes (ICML 2014)• Approximate Slice Sampling for Bayesian Posterior Inference – C. DuBois, A. Korattikara, M. Welling, P. Smyth (AISTATS 2014)

Theory

• Approximation Analysis of Stochastic Gradient Langevin Dynamics using Fokker-Planck Equation and Ito Process –I. Sato and H. Nakagawa (ICML 2014)

• Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels - P. Alquier, N. Friel, R. Everitt, A. Boland (arXiv 2014)

• Ergodicity of Approximate MCMC Chains with Applications to Large Data Sets - N. S. Pillai, A. Smith (arXiv 2014)

Asymptotically unbiased MCMC algorithms using mini-batches

• Asymptotically Exact, Embarrassingly Parallel MCMC – W. Neiswanger, C. Wang, R. Xing (arXiv 2013)• Firefly Monte Carlo: Exact MCMC with Subsets of Data – D. Maclaurin, R. P. Adams (arXiv 2014)• Accelerating MCMC via Parallel Predictive Prefetching – E. Angelino, E. Kohler, A. Waterland, M. Seltzer, R. P. Adams (arXiv 2014)

24vrijdag 4 juli 14

Conclusions & Future Directions• Bayesian Inference is not superfluous in the context of big data.

• Two requirements:• Stochas3c / mini-‐batch based updates• Distributed implementa3on

• Two fruiRul approaches:• Stochas3c Varia3onal Bayes• Mini-‐batch MCMC

• Future VB:• Very flexible varia3onal posteriors, very small remaining bias• Black-‐box inference engine, a la Infer.net, BUGS

• Future MCMC• BeTer theory• BeTer use of powerful (stochas3c) op3miza3on methods.

25vrijdag 4 juli 14

Stochas3c Fully Structured Distributed Varia3onal Bayes

Stochas3c Approxima3on MCMC

(driving bias to 0)

(driving variance to 0)

26vrijdag 4 juli 14

Acknowledgements & Collaborators• Yee Whye Teh• Sungjin Ahn• Babak Shahbaba• Yutian Chen• Durk Kingma• Taco Cohen• Alex Ihler• Chris DuBois• Padhraic Smyth• Dan Gillen

27vrijdag 4 juli 14

Science

Deep generative learning_icml_part2