Inference and Representation: Lab 9 Extensions of LDA › ~jernite › ir14 › Lab9.pdf · Yacine Jernite Inference and Representation: Lab 9 Extensions of LDA Learning algorithm:

Inference and Representation: Lab 9Extensions of LDA

Yacine Jernite

November 6, 2014

Yacine Jernite Inference and Representation: Lab 9 Extensions of LDA

Lecture plan

Notes on MCMC methods

LDA inference and learning

Variations of LDA


Notes on MCMC

Stationary distribution of MCMC satisfies detailed balance:

T (x ′|x)P(x) = T (x |x ′)P(x ′)

Hence:P(x ′) =

∑x

T (x ′|x)P(x)

∀n > N, xn ∼ P

(xn, xn+1, . . . , xn+M) not i.i.d. from P

However, 1M

∑Ml=1 xn+l unbiased estimator of EP [x ]


Notes on MCMC

Ergodicity of Gibbs SamplingIrreducible: iff all variables can be explored. Order or randomAperiodic: ∀X0,∀X ,∀n > N0,Pn(X |X0) > 0Depends on the model.

Collapsed Gibbs Sampling

A B C

A C

Figure : P(C |A) =∑

B P(B,C |A) =∑

B P(C |B)P(B|A).Empirical estimate of E[B] from sample of A.

Remember: drop all constants in derivations!


Lecture plan: LDA

Brief history

Model description and independence assumptions

Treewidth and cost of exact inference

Approximate inference method

Learning algorithm


History of topic modelling

LSI (Deerwester et al., 1990): classifying documents frombag-of-word representation, SVD of tf-idf counts.

w1 w2 . . . wV

d1 0.01 0.021 . . . 0.005

d2 0.0031 0.102 . . . 0.3

. . . . . . . . . . . . . . .

dM 0.11 0.0041 . . . 0.093

pLSI (Hofmann, 1999):

p(d ,wn) = p(d)∑z

p(wn|z)p(z |d)

LDA (Blei and Jordan, 2003): Admixture model of text.

Extensions (2003-present): Tailored to many problems.


Model description and independence assumptions

Figure : Plate models for pLSI (top) and LDA (bottom)


Treewidth and cost of exact inference

Expanded model is a tree

BUT:

p(θ, z|w, α, β) =p(θ, z,w|α, β)

p(w|α, β)

with:

p(w|α, β) ∝∫ ( k∏

i=1

θαi−1i

) N∏n=1

k∑i=1

V∏j=1

(θiβij)w jn

dθ

Intractable.


Approximate inference method: LDA

Mean field inference: see last week, assignment

Find fully factorized q(θ, z) = q(θ)∏n

i=1 q(zi ) that minimizesDKL(q||p(·|w, α, β))Easily get pseudo-marginals.

Gibbs sampling: see assignment

Sample θ ∼ p(θ|z,w, α, β), sample zi ∼ p(zi |θ, z−i ,w, α, β)Collapsed version: zi ∼ p(zi |z−i ,w, α, β)Get empirical estimates of marginals.


Learning algorithm

Variational EM: see last week

E step is approximate inference: find qM step maxα,β Eq[log(p(θ, z,w, α, β))]

Bayesian Gibbs sampling

Bayesian prior on α, β reduces learning to inference.


Learning algorithm: Gibbs Sampling

By putting a prior distribution on the parameters, they becomerandom variables which can be sampled within the Gibbs Samplingalgorithm:

α0

β0

Figure : Putting a Bayesian prior on the parameters: α ∼ Dirichlet(·;α0)(optional, see Mallet) and β ∼ Dirichlet(·;β0)


Lecture plan: Variations of LDA

Time and influence

Correlated topics

Supervised topic models


Dynamic topic models

First idea: introduce variability across time for α, β.

Figure : Dynamic Topic Model (Blei and Lafferty, 2006)


Dynamic topic models

Parametrization of P(αt |αt−1) and P(βt |βt−1)

For the variational EM algorithms: keep dependences

βt-1

θt-1

zt-1

NM

βt

θt

zt

NM

βt+1

θt+1

zt+1

NM

Figure : q(z, θ, β) =∏k=1K q(βk,1, . . . , βk,T )

∏Tt=1

∏Dt

d=1

(q(θt,d)

∏Nt,d

n=1 q(zt,d,n))


Dynamic topic models: modelling influence

Next step: what drives the changes in β:

Figure : Document Influence Model, (Gerrish and Blei, 2010)


Dynamic topic models: Document Influence Model

Originally applied to measuring impact of scholarlypublications

Parametrization of influence:

βk,t+1|βk,t , (w, l, z)t

∼ N (·;βk,t + exp(−βk,t)∑d

ld ,k∑n

wd ,nzd ,n,k , σ2I )

Learnt with variational EM: same with independent l :q(z, θ, β) =∏

k=1K q(βk,1, . . . , βk,T )∏T

t=1

∏Dtd=1

(q(θt,d)q(lt,d)

∏Nt,d

n=1 q(zt,d ,n))


Dynamic topic models: Document Influence Model

Dynamic Topic Model: http://pdf.aminer.org/000/334/

521/dynamic_topic_models.pdf

Document Influence Model: https://www.cs.princeton.

edu/~blei/papers/GerrishBlei2010.pdf

Application to musical influence: http:

//jmlr.org/proceedings/papers/v28/shalit13.pdf


http://pdf.aminer.org/000/334/521/dynamic_topic_models.pdf

http://pdf.aminer.org/000/334/521/dynamic_topic_models.pdf

https://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf

https://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf

http://jmlr.org/proceedings/papers/v28/shalit13.pdf

http://jmlr.org/proceedings/papers/v28/shalit13.pdf

Introducing structure over topics

Correlated Topic Model (Blei and Lafferty, 2006)

models 2nd order moments of topics with logistic normal

distribution: θ0 ∼ N (·;α,Σ), and θk =exp(θ0k )∑k′ exp(θ

0k′ )

Original paper: fully factorized mean field approximation andTaylor expansionK 2 additional parameters, only models 2nd order correlations.

Pachinko Allocation Model (Li and McCallum, 2006)

DAG structure on topicszw is the path in the DAG

zwi ∼ Mult(·; θ(d)zw(i−1))


Pachinko Allocation Model: Inference

Figure : Different realizations of the Pachinko Allocation Model. Howmany parameters are used? Effects on topics structure?


Pachinko Allocation Model: Inference

Inference with a Gibbs Sampler for the Four-level PAM structure:

zw1 = 1, θ are sampled as in LDA, and:

P(zw2 = tk , zw3 = tp|D, z−w , α, β) ∝

n(d)1k + α1k∑

k ′ n(d)1k ′ + α1k ′

×n(d)kp + αkp∑

p′ n(d)kp′ + αkp′

× npw + βw∑w ′ npw ′ + βw ′


Pachinko Allocation Model: Learning

Moments matching at every step of the Gibbs Sampler for αxy

αxy ∝∑d

n(d)xy

n(d)x

Results: 6 super-topics, 12 sub-topics in Figure 3 ofhttp://people.cs.umass.edu/~mccallum/papers/

pam-icml06.pdf

(Correlated Topic Model: https://www.cs.princeton.

edu/~blei/papers/BleiLafferty2006.pdf)


http://people.cs.umass.edu/~mccallum/papers/pam-icml06.pdf

http://people.cs.umass.edu/~mccallum/papers/pam-icml06.pdf

https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006.pdf

https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006.pdf

Supervised Topic Models

The inferred θ or z can be used as features in many predictiontasks.

Performance can be improved by jointly training therepresentation and the predictor.

Hence, supervised LDA:


Supervised Topic Models: MedLDA

Supervised Latent Dirichlet Allocation (using GeneralizedLinear Models) (Blei and McAuliffe, 2007)https://www.cs.princeton.edu/~blei/papers/

BleiMcAuliffe2007.pdf

MedLDA (max margin objective) (Zhu et al., 2009)http://www.cs.cmu.edu/~amahmed/papers/zhu_ahmed_

xing_icml09.pdf

Gibbs MedLDA (using SVMs and Gibbs Sampling) (Zhu etal., 2014)http://jmlr.org/papers/volume15/zhu14a/zhu14a.pdf


https://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf

https://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf

http://www.cs.cmu.edu/~amahmed/papers/zhu_ahmed_xing_icml09.pdf

http://www.cs.cmu.edu/~amahmed/papers/zhu_ahmed_xing_icml09.pdf

http://jmlr.org/papers/volume15/zhu14a/zhu14a.pdf

Documents

Inference and Representation: Lab 9 Extensions of LDA › ~jernite › ir14 › Lab9.pdf · Yacine Jernite Inference and Representation: Lab 9 Extensions of LDA Learning algorithm: