Lda Tutorial

A Tutorial on Topic Models

Jia Zeng

Department of Computer ScienceHong Kong Baptist University

Tutorial

May 2010

Jia Zeng (Hong Kong Baptist University) A Tutorial on Topic Models May 2010, Seminar 1 / 63

Overview

1 Bayesian inference [Heinrich, 2008]

2 Latent Dirichlet allocation [Blei et al., 2003]Gibbs samplingVariational inference

3 ApplicationsNatural scene categorization [Fei-Fei and Perona, 2005]Network topic modeling [Zeng et al., 2009]


Inference problems

Machine learning is to learn patterns, rules or regularities from data.

Patterns or rules are represented by distributions.

Two inference problems:◮ Learning: estimate distribution parameters λ to explain observations O.◮ Prediction: calculate the probability of new observation o given

training observations, i.e., to find P(o|O) ≈ P(o|λ).

P(λ|O) =P(O|λ) · P(λ)

P(O), (1)

posterior =likelihood · prior

evidence. (2)


Maximum likelihood (ML)

Given distribution parameters, the observations are assumed to beindependently and identically distributed (i.i.d.).Learning:

λML = arg maxλL(λ|O) ≈ arg max

λ

∑

o∈O

log P(o|λ). (3)

The common way to estimate the parameter is to solve the system:

∂L(λ|O)

∂λk

= 0,∀λk . (4)

Prediction:

P(o|O) =

∫

λ

P(o|λ)P(λ|O)dλ

≈

∫

λ

P(o|λML)P(λ|O)dλ

= P(o|λML). (5)


Maximum a posterior (MAP)

Learning:

λMAP = arg maxλ

{

∑

o∈O

log P(o|λ) + log P(λ)

}

. (6)

We use parameterized priors P(λ|α) with hyperparameters α, inwhich the belief in the anticipated values of λ can be expressed withinthe framework of probability.

A hierarchy of parameters is created.

Prediction:

P(o|O) ≈

∫

λ

P(o|λMAP)P(λ|O)dλ = P(o|λMAP). (7)


Bayesian inference

Learning:

P(λ|O) =P(O|λ)P(λ)

∫

λP(O|λ)P(λ)dλ

. (8)

Prediction:

P(o|O) =

∫

λ

P(o|λ)P(λ|O)dλ

=

∫

λ

P(o|λ)P(O|λ)P(λ)

P(O)dλ. (9)

The summations or integrals of the marginal likelihood are intractableor there are unknown variables.


Conjugate priors

Bayesian inference often uses conjugate prior for convenientcomputation.

A conjugate prior P(λ) of a likelihood function P(o|λ) is adistribution that results in a posterior distribution P(λ|o) with thesame functional form as the prior with different parameters (e.g.,exponential family).

For example, the Dirichlet distribution is the conjugate prior for themultinomial distribution.


Topic modeling

Large document and image networks require new statistical tools fordeep analysis.

Topic models have become a powerful unsupervised learning tool forlarge-scale document and image networks.

Topic modeling introduces latent topic variables in text and imagethat reveal the underlying structure with posterior inference.

Topic models can be used in text summarization, documentclassification, information retrieval, link prediction, and collaborativefiltering.


What are topics?

Topics are a group of related words.

Topics1 learn kernel model reinforc algorithm machin classif2 model learn network neural bayesian time visual3 retriev inform base model text queri system4 model imag motion track recognit object estim5 imag base model recognit segment object detect

1 cell protein express gene activ mutat signal2 pcr assai detect dna method probe specif3 apo diseas allel alzheim associ onset gene4 cell express gene tumor apoptosi protein cancer5 mutat gene apc cancer famili protein diseas


Topic modeling = word clustering

A topic is a fuzzy set of words with different membership gradesbased on word co-occurrences in documents.

Fewer topics per document (homogeneous): Words co-occur withinthe same document tend to be clustered within the same topic(attractive force).

Fewer words per topic (heterogeneous): Words tend to be clusteredinto their dominant topics with the highest membership grade(repulsive force).


Examples

0 10 20 30 40 50

0.01

0.03

0.05topic 1

vocabulary0 10 20 30 40 50

topic 2

me

mb

ers

hip

0.01

0.03

0.05


Probabilistic topic modeling

Treat data as observations that arise from a generative modelcomposed of latent variables.

◮ The latent variables reflect the semantic structure of the document.

Infer the latent structure using posterior inference (MAP):◮ What are the topics that summarize the document network?

Predict new data by the estimated topic model◮ How the new data fit into the estimated topic structures?


Latent Dirichlet allocation (LDA) [Blei et al., 2003]

Simple intuition: a document exhibits multiple topics.


Generative models

Topics

promoter 0.04

dna 0.02

genetic 0.01

evolve 0.04

life 0.02

organism 0.01

gene 0.04

neuron 0.02

recognition 0.01... ...

... ...

... ...

Topic proportions

Latent topic variablesDocuments

Each document is a mixture of corpus-wide topics.

Each word is drawn from one of those topics.


The posterior distribution (inference)

Topics

Topic proportions

Latent topic variablesDocuments

?

?

?

?

?

?

?

Observations include documents and their words.

Our goal is to infer the underlying topic structures marked by “?”from observations.


LDA

α θd Zd,n Wd,n

N

D K

βk η

Dirichletparameter

Per-documenttopic proportions

Topic

TopicPer-wordtopic assignment

Observedword

distribution

hyperparameter

A fuller hierarchical Bayesian model composed of random variables.Jia Zeng (Hong Kong Baptist University) A Tutorial on Topic Models May 2010, Seminar 16 / 63

LDA: generative process

α θd Zd,n Wd,n

N

D K

βk η

Draw each topic βk ∼ Dir(η), for k ∈ {1, . . . ,K}.

For each document:◮ Draw topic proportions θd ∼ Dir(α).◮ For each word:

⋆ Draw Zd,n ∼ Mult(θd).⋆ Draw Wd,n ∼ Mult(βZd,n

).


LDA: inference

α θd Zd,n Wd,n

N

D K

βk η

From a collection of documents, infer◮ Per-document topic assignment Zd,n.◮ Per-document topic proportions θd .◮ Per-corpus topic distribution βk .

Use posterior expectations Zd,n, θd , and βk to perform documentclassification and information retrieval.


LDA: intractable exact inference

α θd Zd,n Wd,n

N

D K

βk η

P(Z , θ, β|W , α, η) = P(W ,Z , θ, β|α, η)/P(W |α, η).

P(W |α, η) = P(β|η)∫

θP(W |θ, β)P(θ|α).

Because θ and β are coupling, the integration is intractable.


LDA: approximate inference

α θd Zd,n Wd,n

N

D K

βk η

Collapsed Gibbs sampling [Griffiths and Steyvers, 2004].

Mean field variational inference [Blei et al., 2003].

Expectation propagation [Minka and Lafferty, 2002].

Collapsed variational inference [Teh et al., 2007].


Collapsed Gibbs sampling (GS)

“Collapse” means marginalizing out parameters θ and β.

P(W ,Z , θ, β|α, η) =

∫

θ

∫

β

P(W ,Z , θ, β|α, η)dθdβ (10)

∫

θ

D∏

d=1

P(θd |α)N∏

n=1

P(Zd,n|θd)dθ (11)

∫

β

K∏

k=1

P(βk |η)

D∏

d=1

N∏

n=1

P(Wd,n|βZd,n)dβ (12)


Marginalizing θd for the dth document

∫

θd

P(θd |α)N∏

n=1

P(Zd,n|θd)dθd (13)

∫

θd

Γ(∑K

k=1 αk)∏K

k=1 Γ(αk)

K∏

k=1

θαk−1d,k P(Zd,n|θd)dθd (14)

P(Zd,n|θd) =

K∏

k=1

θnk

d,·

d,k (15)

nkd,· is the number of words in the dth document with the kth topic

nkd,· =

∑

w nkd,w


Marginalizing θd for the dth document

∫

θd

Γ(∑K

k=1 αk)∏K

k=1 Γ(αk)

K∏

k=1

θnk

d,·+αk−1

d,k (16)

∫

θd

Γ(∑K

k=1 nkd,· + αk)

∏Kk=1 Γ(nk

d,· + αk)

K∏

k=1

θnk

d,·+αk−1

d,k = 1 (17)

Eq. (11) becomes

Γ(∑K

k=1 αk)∏K

k=1 Γ(αk)

∏Kk=1 Γ(nk

d,· + αk)

Γ(∑K

k=1 nkd,· + αk)

(18)


Joint probability

Marginalizing βk for the kth topic, Eq. (12) becomes

Γ(∑W

w=1 ηw )∏W

w=1 Γ(ηw )

∏Ww=1 Γ(nk

·,w + ηw )

Γ(∑W

w=1 nk·,w + ηw )

(19)

P(W ,Z |α, η) =

D∏

d=1

Γ(∑K

k=1 αk)∏K

k=1 Γ(αk)

∏Kk=1 Γ(nk

d,· + αk)

Γ(∑K

k=1 nkd,· + αk)

×

K∏

k=1

Γ(∑W

w=1 ηw )∏W

w=1 Γ(ηw )

∏Ww=1 Γ(nk

·,w + ηw )

Γ(∑W

w=1 nk·,w + ηw )

(20)

The collapsed form bridges the observed words W and latent topicassignment Z with hyperparameters α and η directly.


Posterior probability

P(Zd,n = k|Z−(d,n),W , α, η) ∝

P(Zd,n = k,Wd,n = w ,Z−(d,n),W−(d,n), α, η) (21)

∝

(

Γ(∑K

k=1 αk)∏K

k=1 Γ(αk)

)M∏

−d

∏Kk=1 Γ(nk

d,· + αk)

Γ(∑K

k=1 nkd,· + αk)

×

(

Γ(∑W

w=1 ηw )∏W

w=1 Γ(ηw )

)K K∏

k=1

∏

−w

Γ(nk·,w + ηw )×

∏Kk=1 Γ(nk

d,· + αk)

Γ(∑K

k=1 nkd,· + αk)

K∏

k=1

Γ(nk·,w + ηw )

Γ(∑W

w=1 nk·,w + ηw )

(22)

Dependent on the jth document and w th word.


Posterior probability

If we separate the kth topic, Eq. (22) can be decomposed to

∝

∏

−k Γ(nk,−(d,n)d,· + αk)

Γ(∑K

k=1 nk,−(d,n)d,· + αk + 1)

∏

−k

Γ(nk,−(d,n)·,w + ηw )

Γ(∑W

w=1 nk,−(d,n)·,w + ηw )

×

Γ(nk,−(d,n)d,· + αk + 1)

Γ(nk,−(d,n)·,w + ηw + 1)

Γ(∑W

w=1 nk,−(d,n)·,w + ηw + 1)

. (23)

Note that Γ(x + 1) = xΓ(x).

P(Zd,n = k|W , α, η) ∝ (nk,−(d,n)d,· + αk)

nk,−(d,n)·,w + ηw

∑Ww=1 n

k,−(d,n)·,w + ηw

. (24)

nk,−(d,n)d,· and n

k,−(d,n)·,w are the number of topic assignments excluding

Zd,n = k.


Gibbs sampling [Griffiths and Steyvers, 2004]

Sampling from target distribution using Markov chain Monte Carlo(MCMC) method.

In MCMC, a Markov chain is constructed to converge to the targetdistribution, and samples are then taken from that Markov chain.

Each state of the chain is an assignment of values to the variablesbeing sampled, in this case Zd,n, and transitions between states followa simple rule.

Gibbs sampling is an stochastic approximate inference, where the nextstate is reached by sequentially sampling all variables from theirdistribution when conditioned on the current values of all othervariables and the data.


Parameter estimation

For simplicity, we use the symmetric Dirichlet hyperparametersαk = α and ηw = η.

Eq. (24) can be simplified to

P(Zd,n = k|W , α, η) ∝ (θβ)−(d,n). (25)

P(θd |Z ,W , α) = Dir(θd |nd,· + α)⇒ θd,k =nkd,· + α

∑

k nkd,· + Kα

, (26)

P(φk |Z ,W , η) = Dir(φk |nk + η)⇒ βw ,k =

nk·,w + η

∑

w nk·,w + W η

. (27)

Note that E[Dir(α)] = αi/∑

i αi .


Summary of Gibbs sampling

If we use X for data and Z for labeling configuration, we obtain thefollowing optimal marginal distribution [Bishop, 2006]

lnQj(Zj) ∝ Ei 6=j [lnP(X ,Z )]. (28)

From the “variational inference” view, we aim to maximize the lowerbound of the model evidence P(W |α, η) in terms of Qj(Zj).

From the “passing messages” (belief propagation) perspective, weaim to update the belief Qj(Zj) ∼ θd over Zd,n through all otherdocuments belief θ as well as clique potentials based on β, whichimplies that LDA is similar to a Potts model (Markov random fields).


Mean field variational inference [Blei et al., 2003]

Use the fully factorized variational distributions Q to approximate thetrue posterior distribution P by maximizing the lower bound Lbecause the KL (Kullback-Leibler) divergence is always non-negative:

lnP(W |α, η) = L(Q) + KL(Q‖P). (29)

Mean field assumption: Q can be factorized.

Generally,

L(Q) =

∫

Q(Z ) ln

{

P(X ,Z )

Q(Z )

}

. (30)


Lower bound

In LDA,

L(Q) = EQ [lnP(X ,Z , θ, β|α, η)] − EQ [lnQ(Z , θ, β|λ, φ, γ)]. (31)

Q(Z , θ, β|λ, φ, γ) =K∏

k=1

Q(βk |λk)D∏

d=1

Qd(θd |γd)N∏

n=1

Q(Zd,n|φd,n).

(32)

λ ∼ η, φ ∼ θ, and γ ∼ α are variational parameters.

Coupling between θ and β does not exist.


Factorization

L = EQ [lnP(θ|α)]

+ EQ[ln P(Z |θ)]

+ EQ[ln P(W |Z , β)]

+ EQ[ln P(β|η)]

− EQ[ln Q(θ)]

− EQ[ln Q(Z )]

− EQ[ln Q(β)]. (33)


Expansion

L = ln Γ(K∑

k=1

αk) −K∑

k=1

ln Γ(αk ) +K∑

k=1

(αk − 1)[Ψ(γi ) − Ψ(K∑

k=1

γk )]

+N∑

n=1

K∑

k=1

φn,k [Ψ(γk) − Ψ(K∑

k=1

γk)]

+N∑

n=1

K∑

k=1

W∑

w=1

φn,kwn ln βk,w

+ ln Γ(K∑

k=1

ηk) −K∑

k=1

ln Γ(ηk ) +K∑

k=1

(ηk − 1)[Ψ(λi ) − Ψ(K∑

k=1

λk )]

− ln Γ(K∑

k=1

γk) +K∑

k=1

ln Γ(γk ) −K∑

k=1

(γk − 1)(Ψ(γk ) − Ψ(K∑

k=1

γk))

−N∑

n=1

K∑

k=1

φn,k ln φn,k

− ln Γ(K∑

k=1

λk) +K∑

k=1

ln Γ(λk ) −K∑

k=1

(λk − 1)(Ψ(λk ) − Ψ(K∑

k=1

λk)). (34)


Lagrangian multipliers

φn,k ← βk,w exp(Ψ(γti ))

normalize∑

k

φn,k = 1

γk ← αk +

N∑

n=1

φn,k

λk,w ← ηk +

N∑

n=1

φn,kwn

βk,w ←

N∑

n=1

φn,kwn

Newton-Raphson iteration

α ∼ γ

η ∼ λJia Zeng (Hong Kong Baptist University) A Tutorial on Topic Models May 2010, Seminar 34 / 63

Summary of variational inference

Variational inference is a deterministic alternative to MCMC.

At each iterative step, we use deterministic optimization instead ofonline sampling.

Variational inference is fast but has a bias to the true posterior.

Gibbs sampling has no bias but does not know when to converge.


Natural scene categorization [Fei-Fei and Perona, 2005]

ObjectObjectObjectObject Bag of Bag of ‘‘wordswords’’Bag of Bag of ‘‘wordswords’’


Bag of words (BOW)

Independent features

Histogram representation


Categorization system

categorycategory

decisiondecisioncategorycategory

decisiondecision

learninglearninglearninglearning

feature detection

& representation

feature detection

& representation

codewords dictionarycodewords dictionarycodewords dictionarycodewords dictionary

image representationimage representation

category modelscategory models

(and/or) classifiers(and/or) classifierscategory modelscategory models

(and/or) classifiers(and/or) classifiers

recognitionrecognitionrecognitionrecognition


Codewords


Image representation

…..…..

freq

ue

ncy

codewords


LDA

Latent Dirichlet Allocation (LDA)

Fei-Fei et al. ICCV 2005

“beach“beach”


Spatial information

Image patches have strong spatial structural information.

BOW assumptions are still weak to handle real-world problems.

Incorporating discriminative methods into generative methods.


Weaknesses

No rigorous geometric information of the object components.

Intuitive to most of us that objects are made of parts õno suchinformation.

Not extensively tested yet for◮ View point invariance.◮ Scale invariance.


Network topic modeling [Zeng et al., 2009]

Network data:◮ Document and image network.◮ Social network.◮ Biological network.

Basic components:◮ A set of entities (e.g., documents, images, individuals, genes).◮ A set of relations (e.g., citation, coauthor, co-tag, friends, pathways).


Examples

Citation network Coauthor network

Social network

Gene regularoty network


Research issues

Static network analyses assume that the network has invariantstructure.

◮ Static structural pattern modeling.

Dynamic network analyses assume that the network structure changeswith time.

◮ Dynamic structural pattern modeling.


Structural patterns

Loose definition: known relations + unknown relations.

More stricter definition: a cluster of entities with homogeneousrelation compared with other entities.

Properties: structural patterns can be propagated through networks.


Tasks

Learn arbitrary topic structure from network data.

Propagate and regularize topic structures in order to extractmeaningful topics.

Predict network structure including entities and relations.


Examples

Citation network Coauthor network

?

?

?

?

?

?

?

?

?

?

?

?


Problems

Multiple relation types:◮ Citation.◮ Coauthor.◮ Co-proceedings and journals.◮ Year.◮ · · ·

Higher-order relations:◮ A, B, and C collaborate to write a paper. So A, B, C form the

higher-order relation rather than the pairwise relation alone.◮ A cites B and C, while B cites C, which forms a higher-order relation.


Examples

Multiple relation Higher-order relation


Factor graph [Kschischang et al., 2001]

network data factor graph

document

author

Factor graph is a hypergraph, which is used in higher-order relationmodeling.

Factor graph is equivalent to higher-order Markov random fields,which is used in image processing and computer vision.


Computational problems

Multiple relation modeling meets the problem of how to fuse differentrelation information.

Higher-order structural topic modeling is significantly hampered bythe intractable complexity of optimizing the long-range topicdependence.


Multirelational topic models

authors

documents

topics

words

words

topics

documents(citation network)

authors(coauthor network)

(coauthor network)

(citation network)

(A) (B)


Multirelational topic models

(A) (B)

αα

α β

fa

fa

fa′

fa′

z

hd

hd

hd′

hd′

zi zi′


Summary

Cascade information fusion.

Parallel information fusion (weights).


Higher-order structural topic models

α

θ γ

z

w

adu

ID

φ β

(A)

(B)

J

A

e1,1

e1,2

e2,3

e1,3 e3,4

e3,3

x1

x2

x3

x4

y1

y2 y3


Generative process

1 θd |α, γ|α ∼ Dirichlet(α), u|ad ∼ Uniform(ad),

2 zd,i |θd , γu ∼ Multi(θd)Multi(γu),

3 wd,i |zd,i , φ ∼ Multi(φzd,i), φzd,i

|β ∼ Dirichlet(β),

xd = arg maxj

θd,j , (35)

ya = arg maxj

γa,j . (36)


Sparsity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

1000

2000

3000

4000

0 5 10 15 20 25 30 35 400

2000

4000

6000

8000

10000

0 10 20 30 40 50 60 70 800

0.5

1

1.5

2

2.5

3x 10

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

500

1000

1500

2000

0 5 10 15 20 25 30 35 400

200

400

600

800

1000

0 10 20 30 40 50 60 70 800

200

400

600

800

1000

4-order

3-order

2-order

4-order

3-order

2-order

documents authors

topic labeling space topic labeling space

Higher-order topic structures encountered in document networks areoften sparse, i.e., many topic configurations of a higher-order cliqueare equally unlikely and thus need not to be encouraged throughoutthe network.


Learning clique potentials

Generalized linear model:

f (xa) = σ(ηTθxa + b), xa ∈ X. (37)

The feature is defined as

θxa =∏

d

θd,xd, xd ∈ xa. (38)


Belief propagation and Gibbs sampling for inference

Belief propagation:

θd,xd=

∏

ya∈ne(xd )

[

∑

xa\xd

f (xa)∏

xd′∈ne(ya)\xd

θd ′,xd′

]

. (39)

Gibbs sampling:

P(zi , ui |z−i ,u−i ,w, a, e;α, β) ∝ (φθγ)−i θγ. (40)


Summary

Sparsity relieves the computational complexity significantly.

Hybrid belief propagation and Gibbs sampling algorithm works well.


Thank you!


Bishop, C. M. (2006).Pattern recognition and machine learning.Springer.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).Latent dirichlet allocation.J. Mach. Learn. Res., 3(4-5):993–1022.

Fei-Fei, L. and Perona, P. (2005).A Bayesian hierarchical model for learning natural scene categories.In CVPR.

Griffiths, T. L. and Steyvers, M. (2004).Finding scientific topics.Proc. Natl. Acad. Sci., 101:5228–5235.

Heinrich, G. (2008).Parameter estimation for text analysis.Technical Note, University of Leipzig, Germany.

Kschischang, F. R., Frey, B. J., and Loeliger, H.-A. (2001).


Factor graphs and the sum-product algorithm.IEEE Transactions on Information Theory, 47(2):498–519.

Minka, T. P. and Lafferty, J. (2002).Expectation propagation for the generative aspect model.In UAI, pages 352–359.

Teh, Y. W., Newman, D., and Welling, M. (2007).A collapsed variational bayesian inference algorithm for latent dirichletallocation.In NIPS.

Zeng, J., Cheung, W. K.-W., hung Li, C., and Liu., J. (2009).Multirelational topic models.In ICDM, pages 1070–1075.