Expectation-Propagation for the Generative Aspect ModelGenerative aspect model Each document mixes...

Preview:

Citation preview

Expectation-Propagation for the Generative Aspect Model

Tom Minka and John Lafferty

Carnegie Mellon University

Main points

• Aspect model is important and diff icult– PCA for discrete data

• How variational methods can fail

• Extensions of Expectation-Propagation

• Using EP for maximum likelihood

Generative aspect model

Each document mixes aspects in different proportions

Aspect 1 Aspect 2

1λ11 λ− 2λ

3λ21 λ− 31 λ−

(Hofmann1999; Blei, Ng, & Jordan 2001)

First interpretation

Document

Aspect 1 Aspect 2

) word( wp

λ λ−1

),...,(~)( 1 JDirichletp ααλ

Multinomial sampling

Second interpretation

“Aspect 2” “Aspect 2”

Word 2 Word 3Word 1Doc i

3iz2iz1iz“Aspect 1”

Sample fromaspect

Two tasks

Inference:

• Given aspects and document i, what is (posterior for) ?

Learning:

• Given some documents, what are (maximum likelihood) aspects?

Variational method

• Blei, Ng, & Jordan, NIPS’01

• Uses interpretation

• Inference: factored variational distribution

• Learning: – E-step

– M-step (aspects, aspect weights)

)()( zqq λ

),( zλ

),( zλ),( αp

Expectation Propagation

• Uses interpretation

• Inference: EP of Dirichlet posterior

• Learning:– E-step

– M-step (aspects, aspect weights)

• Fewer latent variables better

)only (λ

),( αp

)(λq

)(λ

Geometric interpretation

• Likelihood is composed of terms of form

• Want Dirichlet approximation:

∑==a

na

nnw

www awpwpt ))|(()()( λλ

∏=a

awwat βλλ)(~

Variational method

• Bound each term via Jensen’s inequality:

• Coupled equations:

∏∑ ×≥a

aa

awaawp const)()|( βλλ

)|()logexp( awpawa ><∝ λβ

Moment matching

• Context function: all but one occurrence

• Moment match:

• Fixed point equations for

∏≠

−=ww

nw

nw

w ww ttq'

'1\ ')()()( λλλ

)()(~)()( \\ λλλλ ww

ww qtqt ↔

β

One term

3.0)1(4.0)()( λλλ −+=wt

Ten word document

Moments for ten word doc

VarianceMean

Variational

EP

Exact

0.01780.365

0.06120.393

0.06130.393

1000 word document

Moments for 1000 word doc

VarianceMean

Variational

EP

Exact

0.00020.252

0.00150.252

0.00150.250

General behavior

• For long documents, VB recovers correct mean, but not correct variance

• Optimizes global criterion, only accurate locally

• What about learning?

Learning

• Want MLE for aspects• Requires marginalizing • is probabil ity of generating doc

from a random• Consider extreme approximation:

– Probability of generating from most likely– Ignores uncertainty in (“Occam factor” )– Used by Hofmann (1999)

λ

λ

λ

)(docp

λ

Toy problem

• Two aspects over two words– describes each

• One aspect fixed to

• is uniform, i.e.

)|1( awp =

1)2|1( == awp

)2|1()1()1|1()()1( awpawpwp =−+=== λλλ 1=α

)1( =wp1=λ 0=λ

Aspect1 Aspect2

0 1

Exact likelihood vs. Max

ii nn /1Dotted are observed frequencies:

)1|1( awp =

10 docs

EP likelihood vs. VB

)1|1( awp =

VB not much different from Max

100 documents

)1|1( awp =

EP better, VB worse

Longer documents

)1|1( awp =

VB not helped

Why VB fails

• Doesn’ t capture posterior variance of– No Occam factor

• Gets worse with more documents

• Doesn’ t matter if posterior is sharp!– Longer documents

– More distinct aspects

Both sharpen posterior, but do not help VB

λ

Larger vocabulary

10 docs,Length 10

Larger vocabulary

10 docs,Length 10

Larger vocabulary

10 docs,Length 10

More documents

100 docs,Length 10

TREC documents

Perplexity

• EP and VB models have nearly same perplexity (!)

• Perplexity is document probabilit y, normalized by length– Dilutes the penalty of extreme aspects

• Better measure: classification error

Summary & Future

• Approximate inference can lead to biased learning– Must preserve “Occam factor” terms

• Moment matching does well

• Simpler methods may also work– Area of aspect simplex

• Make visualizations!

Recommended