Methods Gibbs Sampling and Variational · Gibbs Sampling and Variational Methods ... Mixture Models...

Gibbs Sampling and VariationalMethods

Héctor Corrada Bravo

University of Maryland, College Park, USA CMSC 644: 20190403

Blei (2012), Comm. ACM

Mixture ModelsDocuments as mixtures of topics (Hoffman 1999, Blei et al. 2003)

1 / 48

Kovacevic (2014), PLOS One

Mixture ModelsMore applications: genetics, populations as mixture of ancestralpopulations

2 / 48

Glueck, et al. (2017). TVCG

Mixture ModelsMore applications: clinical subtyping

3 / 48

Schulman and Saria (2016). JMLR

Mixture ModelsMore applications: clinical prognosis

4 / 48

https://radimrehurek.com/gensim/

Mixture Models

Software

5 / 48

https://mcstan.org/

Mixture Models

Software

6 / 48

Mixture ModelsWe have a set of documents

Each document modeled as a bagofwords (bow) over dictionary .

: the number of times word appears in document .

xw,d w ∈ W d ∈ D

7 / 48

Approximate Inference by SamplingUltimately, what we are interested in is learning topics

Perhaps instead of finding parameters that maximize likelihood

Sample from a distribution that gives us topic estimates

Pr(θ|D)

8 / 48

Approximate Inference by SamplingUltimately, what we are interested in is learning topics

Perhaps instead of finding parameters that maximize likelihood

Sample from a distribution that gives us topic estimates

But, we only have talked about how can we sample parameters?

Pr(θ|D)

Pr(D|θ)

9 / 48

Approximate Inference by SamplingLike EM, the trick here is to expand model with latent data

And sample from distribution

Pr(θ, Zm|Z)

10 / 48

Approximate Inference by SamplingLike EM, the trick here is to expand model with latent data

And sample from distribution

This is challenging, but sampling from and is easier

Pr(θ, Zm|Z)

Pr(θ|Zm, Z) Pr(Zm|θ, Z)

11 / 48

Approximate Inference by SamplingThe Gibbs Sampler does exactly that

Property: After some rounds, samples from the conditional distributions

Correspond to samples from marginal

Pr(θ|Zm, Z)

Pr(θ|Z) = ∑Z

m Pr(θ, Zm|Z)

12 / 48

Approximate Inference by SamplingQuick aside, how to simulate data for pLSA?

Generate parameters and Generate

{pd} {θt}

Δw,d,t

13 / 48

Approximate Inference by SamplingLet's go backwards, let's deal with Δw,d,t

14 / 48

Approximate Inference by SamplingLet's go backwards, let's deal with

Where was as given by Estep

Δw,d,t

Δw,d,t ∼ Multxw,d(γw,d,1, … , γw,d,T )

γw,d,t

15 / 48

Approximate Inference by SamplingLet's go backwards, let's deal with

Where was as given by Estep

for d in range(num_docs):

delta[d,w,:] = np.random.multinomial(doc_mat[d,w],

gamma[d,w,:])

Δw,d,t

Δw,d,t ∼ Multxw,d(γw,d,1, … , γw,d,T )

γw,d,t

16 / 48

Approximate Inference by SamplingHmm, that's a problem since we need ...

But, we know so, let's use that to generate each as

Pr(w, d) = ∑t pt,dθw,t xw,d

xw,d ∼ Multnd(Pr(1, d), … ,Pr(W , d))

17 / 48

Approximate Inference by SamplingHmm, that's a problem since we need ...

But, we know so, let's use that to generate each as

doc_mat[d,:] = np.random.multinomial(nw[d], np.sum(p[:,d] * theta), axis=0)

Pr(w, d) = ∑t pt,dθw,t xw,d

xw,d ∼ Multnd(Pr(1, d), … ,Pr(W , d))

18 / 48

Approximate Inference by SamplingNow, how about ? How do we generate the parameters of a Multinomialdistribution?

19 / 48

Approximate Inference by SamplingNow, how about ? How do we generate the parameters of a Multinomialdistribution?

This is where the Dirichlet distribution comes in...

If , then

pd ∼ Dir(α)

Pr(pd) ∝T

∏t=1

pαt−1t,d

20 / 48

Approximate Inference by SamplingSome interesting properties:

So, if we set all we will tend to have uniform probability over topics ( each on average)

If we increase it will also have uniform probability but will have verylittle variance (it will almost always be )

E[pt,d] =αt

∑t′ αt′

αt = 1

αt = 100

21 / 48

Approximate Inference by SamplingSo, we can say and pd ∼ Dir(α) θt ∼ Dir(β)

22 / 48

Approximate Inference by SamplingSo, we can say and

And generate data as (with )

p[:,d] = np.random.dirichlet(1. * np.ones(num_topics))

pd ∼ Dir(α) θt ∼ Dir(β)

αt = 1

23 / 48

Approximate Inference by SamplingSo what we have is a prior over parameters and : and

And we can formulate a distribution for missing data :

{pd} {θt} Pr(pd|α) Pr(θt|β)

Δw,d,t

Pr(Δw,d,t|pd, θt,α,β) =

Pr(Δw,d,t|pd, θt)Pr(pd|α)Pr(θt|β)

24 / 48

Approximate Inference by SamplingHowever, what we care about is the posterior distribution

What do we do???

Pr(pd|Δw,d,t, θt,α,β)

25 / 48

Approximate Inference by SamplingAnother neat property of the Dirichlet distribution is that it is conjugate tothe Multinomial

If and , thenθ|α ∼ Dir(α) X|θ ∼ Multinomial(θ)

θ|X, α ∼ Dir(X + α)

26 / 48

Approximate Inference by SamplingThat means we can sample from

pt,d ∼ Dir(∑w

Δw,d,t + α)

θw,t ∼ Dir(∑d

Δw,d,t + β)

27 / 48

Blei, Ng, Jordan (2003), JMLR

Approximate Inference by SamplingCoincidentally, we have just specified the Latent Dirichlet Allocationmethod for topic modeling.

This is the most commonly used method for topic modeling

28 / 48

Approximate Inference by SamplingWe can now specify a full Gibbs Sampler for an LDA mixture model.

Given:

Worddocument counts Number of topics Prior parameters and

Do: Learn parameters and for topics

{pd} {θt} K

29 / 48

Approximate Inference by SamplingStep 0: Initialize parameters and

{pd} {θt}

pd ∼ Dir(α)

θt ∼ Dir(β)

30 / 48

Approximate Inference by SamplingStep 1:

Sample based on current parameters and Δw,d,t {pd} {θt}

Δw,d,. ∼ Multxw,d(γw,d,1, … , γw,d,T )

31 / 48

Sample parameters from

pt,d ∼ Dir(∑w

Δw,d,t + α)

θw,t ∼ Dir(∑d

Δw,d,t + β)

32 / 48

Get samples for a few iterations (e.g., 200), we want to reach astationary distribution...

33 / 48

Estimate as the average of the estimates from the last iterations(e.g., m=500)

Δ̂w,d,t m

34 / 48

Estimate parameters and based on estimated pd θt Δ̂w,d,t

p̂ t,d =∑w Δ̂w,d,t + α

∑t∑w Δ̂w,d,t + α

θ̂w,t =∑d Δ̂w,d,t + β

∑w∑d Δ̂w,d,t + β

35 / 48

Mixture modelsWe have now seen two different mixture models: soft kmeans and topicmodels

36 / 48

Two inference procedures:

Exact Inference with Maximum Likelihood using the EM algorithmApproximate Inference using Gibbs Sampling

37 / 48

Two inference procedures:

Exact Inference with Maximum Likelihood using the EM algorithmApproximate Inference using Gibbs Sampling

Next, we will go back to Maximum Likelihood but learn aboutApproximate Inference using Variational Methods

38 / 48

Variational MethodsConsider LDA model again

Benefits

Full document generative modelCan process new documents(posterior over topics) and words(prior parameters)

39 / 48

With Gibbs we sampled from

What if we want to estimateparameters again? (maximum aposteriori parameters)

Pr(θ, Δ|x, α, β)

40 / 48

Very difficult to maximize

Harder than pLSA due to Dirichletpriors

41 / 48

Let's get inspiration from EM:maximize lower bound

But what should the lower boundbe?

42 / 48

Make missing data and parameters"independent"!

43 / 48

Find parameters that make simplemodel most similar to original model

44 / 48

Variational MethodsCan then define EMlike algorithm

Estep: define expectation w.r.t. approximate distribution

Mstep: maximize parameters of approximate distribution

45 / 48

Variational MethodsNet result:

1) Maximum posterior estimates 2) Super simple updates 3) Withstochastic approach (update using a few words at a time), extremelyscalable

46 / 48

Hoffman, et al. (2010). NIPS

Variational Methods

47 / 48

ConclusionProbabilistic mixture models: powerful model class with manyapplications

Awesome historical algorithmic development

Outstanding software support

48 / 48

Methods Gibbs Sampling and Variational · Gibbs Sampling and Variational Methods ... Mixture Models...

Documents

Biostatistics 615/815 Lecture 20: Simulated Annealing Gibbs … · 2012-12-09 · Simulated Annealing. . . . . . . . . . . . Gaussian Mixture. . . . . . . Gibbs Sampler Key requirements

Memoized Online Variational Inference for Dirichlet Process Mixture Models

Deep Clustering by Gaussian Mixture Variational ...lijiaying.github.io/papers/iccv19.pdf · mixture variational autoencoder (VAE) with Graph embed-ding. To facilitate clustering,

Energia Gibbs

Variational Learning and Variational Inference · The variational approach • Variational inference: Find q(h) by solving • Variational learning: Alternate between running variational

Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference Yee Whye Teh ... I Structurally very similar to Gibbs conditionals. ... *++,-./0123 45,60,70123

Learning Finite Beta-Liouville Mixture Models via Variational - Ijcai

The Gibbs Variational Method in Thermodynamics of ... · applied to the classical problem of thermodynamic inequalities. We also established the novel thermodynamic inequalities,

Variational Networks: Connecting Variational Methods … · Variational Networks: Connecting Variational Methods and Deep Learning Erich Kobler1, Teresa Klatzer1, Kerstin Hammernik1

Stochastic variational hierarchical mixture of sparse Gaussian … · 2018. 9. 13. · Stochastic variational hierarchical mixture of sparse ... There has been much interest in sparse

Variational Inference & Variational Autoencoderscseweb.ucsd.edu/~dasgupta/254-deep-ul/casey-mary.pdfAuto-Encoding Variational Bayes (Kingma & Welling) SGVB (Stochastic Gradient Variational

Biostats Gibbs

Deep Clustering by Gaussian Mixture Variational ...openaccess.thecvf.com/content_ICCV_2019/papers/...Deep Clustering by Gaussian Mixture Variational Autoencoders with Graph Embedding

Variational Learning for Gaussian Mixture Models · Variational Learning for Gaussian Mixture Models Nikolaos Nasios and Adrian G. Bors, SeniorMember,IEEE Abstract—This paper proposes

Evaluating Supervised Topic Models in the Presence of OCR ... · topic. Given a collection of documents, tools from Bayesian statistics (such as Gibbs sampling and variational inference)

Variational Inference for Dirichlet Process Mixture

Learning Model Reparametrizations: Implicit Variational ... · There exist many popular MCMC algorithms, such as random walk Metropolis-Hastings, Gibbs sampling, Metropolis-adjusted

Mixture Variational Autoencoder, Computers and Chemical

MME 2010 METALLURGICAL THERMODYNAMICS IImetalurji.mu.edu.tr/Icerik/metalurji.mu.edu.tr... · The excess Gibbs free energy of a binary liquid mixture at a given T and P is given by

Fast Collapsed Gibbs Sampler for Dirichlet Process ...rajarshd.github.io/talks/DPGMM_Cholesky.pdfFast Collapsed Gibbs Sampler for Dirichlet Process Gaussian Mixture Models using Rank