Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Machine Learning 2DS 4420 - Spring 2020
Topic Modeling 1Byron C. Wallace
Last time: Clustering —> Mixture Models —>
Expectation Maximization (EM)
Today: Topic models
Data:
Assumewe are given data, D, consisting ofN fully unsupervised ex-amples in M dimensions:
D = {t(i)}Ni=1 where t(i) � RM
Model: p�,�(t, z) = p�(t|z)p�(z)
p�,�(t) =K�
z=1
p�(t|z)p�(z)
Joint:
Marginal:
Generative Story: z � Multinomial(�)
t � p�(·|z)
(Marginal) Log-likelihood:�(�) = HQ;
N�
i=1
p�,�(t(i))
=N�
i=1
HQ;K�
z=1
p�(t(i)|z)p�(z)
Mixture models
Slide credit: Matt Gormley and Eric Xing (CMU)
Data:
Assumewe are given data, D, consisting ofN fully unsupervised ex-amples in M dimensions:
D = {t(i)}Ni=1 where t(i) � RM
Model: p�,�(t, z) = p�(t|z)p�(z)
p�,�(t) =K�
z=1
p�(t|z)p�(z)
Joint:
Marginal:
Generative Story: z � Multinomial(�)
t � p�(·|z)
(Marginal) Log-likelihood:�(�) = HQ;
N�
i=1
p�,�(t(i))
=N�
i=1
HQ;K�
z=1
p�(t(i)|z)p�(z)
Mixture models
Slide credit: Matt Gormley and Eric Xing (CMU)
Data:
Assumewe are given data, D, consisting ofN fully unsupervised ex-amples in M dimensions:
D = {t(i)}Ni=1 where t(i) � RM
Model: p�,�(t, z) = p�(t|z)p�(z)
p�,�(t) =K�
z=1
p�(t|z)p�(z)
Joint:
Marginal:
Generative Story: z � Multinomial(�)
t � p�(·|z)
(Marginal) Log-likelihood:�(�) = HQ;
N�
i=1
p�,�(t(i))
=N�
i=1
HQ;K�
z=1
p�(t(i)|z)p�(z)
Mixture models
Slide credit: Matt Gormley and Eric Xing (CMU)
Data:
Assumewe are given data, D, consisting ofN fully unsupervised ex-amples in M dimensions:
D = {t(i)}Ni=1 where t(i) � RM
Model: p�,�(t, z) = p�(t|z)p�(z)
p�,�(t) =K�
z=1
p�(t|z)p�(z)
Joint:
Marginal:
Generative Story: z � Multinomial(�)
t � p�(·|z)
(Marginal) Log-likelihood:�(�) = HQ;
N�
i=1
p�,�(t(i))
=N�
i=1
HQ;K�
z=1
p�(t(i)|z)p�(z)
Mixture models
Slide credit: Matt Gormley and Eric Xing (CMU)
Naive Bayes
evaluate the classifier. One good evaluation metric is accuracy, which is defined as
# correctly predicted labels in the test set# of instances in the test set
(6)
2.2 Naive Bayes model
The naive Bayes model defines a joint distribution over words and classes–that is, the prob-ability that a sequence of words belongs to some class c.
p(c, w1:N |⇡, ✓) = p(c|⇡)NY
n=1
p(wn|✓c) (7)
⇡ is the probability distribution over the classes (in this case, the probability of seeing spame-mails and of seeing ham e-mails). It’s basically a discrete probability distribution overthe classes that sums to 1. ✓c is the class conditional probability distribution–that is, givena certain class, the probability distribution of the words in your vocabulary. Each class hasa probability for each word in the vocabulary (in this case, there is a set of probabilities forthe spam class and one for the ham class).Given a new test example, we can classify it using the model by calculating the conditionalprobability of the classes given the words in the e-mail p(c | wn) and see which class ismost likely. There is an implicit independence assumption behind the model. Since we’remultiplying the p(wn | ✓c) terms together, it is assumed that the words in a document areconditionally independent given the class.
2.3 Class prediction with naive Bayes
We classify using the posterior distribution of the classes given the words, which is propor-tional to the joint distribution since P (X|Y ) = P(X, Y)/P(Y) / P(X, Y). The posteriordistribution is
p(c|w1:N , ⇡, ✓) / p(c|⇡)NY
n=1
p(wn|✓c) (8)
Note that we don’t need a normalizing constant because we only care about which proba-bility is bigger. The classifier assigns the e-mail to the class which has highest probability.
2.4 Fitting a naive Bayes model with maximum likelihood
We find the parameters in the posterior distribution used in class prediction by learningon the labeled training data (in this case, the ham and spam e-mails). We use the maxi-mum likelihood method method in finding parameters that maximize the likelihood of theobserved data set. Given data {wd,1:N , cd}D
d=1, the likelihood under a certain model withparameters (✓1:C , ⇡) is
p(D|✓1:C , ⇡) =DY
d=1
p(cd|⇡)NY
n=1
p(wn|✓cd)!
3
evaluate the classifier. One good evaluation metric is accuracy, which is defined as
# correctly predicted labels in the test set# of instances in the test set
(6)
2.2 Naive Bayes model
The naive Bayes model defines a joint distribution over words and classes–that is, the prob-ability that a sequence of words belongs to some class c.
p(c, w1:N |⇡, ✓) = p(c|⇡)NY
n=1
p(wn|✓c) (7)
⇡ is the probability distribution over the classes (in this case, the probability of seeing spame-mails and of seeing ham e-mails). It’s basically a discrete probability distribution overthe classes that sums to 1. ✓c is the class conditional probability distribution–that is, givena certain class, the probability distribution of the words in your vocabulary. Each class hasa probability for each word in the vocabulary (in this case, there is a set of probabilities forthe spam class and one for the ham class).Given a new test example, we can classify it using the model by calculating the conditionalprobability of the classes given the words in the e-mail p(c | wn) and see which class ismost likely. There is an implicit independence assumption behind the model. Since we’remultiplying the p(wn | ✓c) terms together, it is assumed that the words in a document areconditionally independent given the class.
2.3 Class prediction with naive Bayes
We classify using the posterior distribution of the classes given the words, which is propor-tional to the joint distribution since P (X|Y ) = P(X, Y)/P(Y) / P(X, Y). The posteriordistribution is
p(c|w1:N , ⇡, ✓) / p(c|⇡)NY
n=1
p(wn|✓c) (8)
Note that we don’t need a normalizing constant because we only care about which proba-bility is bigger. The classifier assigns the e-mail to the class which has highest probability.
2.4 Fitting a naive Bayes model with maximum likelihood
We find the parameters in the posterior distribution used in class prediction by learningon the labeled training data (in this case, the ham and spam e-mails). We use the maxi-mum likelihood method method in finding parameters that maximize the likelihood of theobserved data set. Given data {wd,1:N , cd}D
d=1, the likelihood under a certain model withparameters (✓1:C , ⇡) is
p(D|✓1:C , ⇡) =DY
d=1
p(cd|⇡)NY
n=1
p(wn|✓cd)!
3
The model
Slide credit: Matt Gormley and Eric Xing (CMU)
Initialize parameters randomlywhile not converged
1. E-Step:Create one training example for each possible value of the latent variables Weight each example according to model’s confidenceTreat parameters as observed
2. M-Step:Set the parameters to the values that maximizes likelihoodTreat pseudo-counts from above as observed
(Soft) EM
And for NB
For soft EM expected of timest occurs in C
Pct e PCZi L Count tin Xi
F Plz c Ix ila
Total Token Count in Xi
TOPIC MODELS
Some content borrowed from: David Blei(Columbia)
Topic Models: Motivation
• Suppose we have a giant dataset (“corpus”) of text, e.g., all of the NYTimes or all emails from a company
❖ Cannot read all documents ❖ But want to get a sense of what they contain
Topic Models: Motivation• Topic models are a way of uncovering, well,
“topics” (themes) in a set of documents
• Topic models are unsupervised
• Can be viewed as a type of clustering, so follows naturally from prior lectures; will come back to this.
Topic Models: Motivation• Topic models are a way of uncovering, well,
“topics” (themes) in a set of documents
• Topic models are unsupervised
• Can be viewed as a type of clustering, so follows naturally from prior lectures; will come back to this.
Topic Models: Motivation• Topic models are a way of uncovering, well,
“topics” (themes) in a set of documents
• Topic models are unsupervised
• Can be viewed as a sort of soft clustering of documents into topics.
Topic Modeling: Beyond Bag-of-Words
in this paper. To enable a direct comparison, Dirichlethyperpriors could also be placed on the hyperparame-ters of the priors described in section 3.
7. Conclusions
Creating a single model that integrates bigram-basedand topic-based approaches to document modeling hasseveral benefits. Firstly, the predictive accuracy of thenew model, especially when using prior 2, is signifi-cantly better than that of either latent Dirichlet al-location or the hierarchical Dirichlet language model.Secondly, the model automatically infers a separatetopic for function words, meaning that the other top-ics are less dominated by these words.
Acknowledgments
Thanks to Phil Cowans, David MacKay and FernandoPereira for useful discussions. Thanks to AndrewSu�eld for providing sparse matrix code.
References
Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I.
(2003). An introduction to MCMC for machine learning.
Machine Learning, 50, 5–43.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent
Dirichlet allocation. Journal of Machine Learning Re-search, 3, 993–1022.
Gri�ths, T. L., & Steyvers, M. (2004). Finding scientific
topics. Proceedings of the National Academy of Sciences,101, 5228–5235.
Gri�ths, T. L., & Steyvers, M. (2005). Topic mod-
eling toolbox. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm.
Gri�ths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum,
J. B. (2004). Integrating topics and syntax. Advancesin Neural Information Processing Systems.
Jelinek, F., & Mercer, R. (1980). Interpolated estima-
tion of Markov source parameters from sparse data. In
E. Gelsema and L. Kanal (Eds.), Pattern recognition inpractice, 381–402. North-Holland publishing company.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journalof the American Statistical Association, 90, 773–795.
MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical
Dirichlet language model. Natural Language Engineer-ing, 1, 289–307.
Minka, T. P. (2003). Estimating a Dirichlet dis-
tribution. http://research.microsoft.com/⇠minka/papers/dirichlet/.
Rennie, J. (2005). 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/.
Table 1. Top: The most commonly occurring words in
some of the topics inferred from the 20 Newsgroups train-
ing data by latent Dirichlet allocation. Middle: Some of
the topics inferred from the same data by the new model
with prior 1. Bottom: Some of the topics inferred by the
new model with prior 2. Each column represents a sin-
gle topic, and words appear in order of frequency of oc-
currence. Content words are in bold. Function words,
which are not in bold, were identified by their presence on
a standard list of stop words: http://ir.dcs.gla.ac.uk/resources/linguistic utils/stop words. All three sets
of topics were taken from models with 90 topics.
Latent Dirichlet allocation
the i that easter“number” is proteins ishtar
in satan the a
to the of the
espn which to have
hockey and i with
a of if but
this metaphorical “number” englishas evil you and
run there fact is
Bigram topic model (prior 1)
to the the the
party god and a
arab is between to
not belief warrior i
power believe enemy of
any use battlefield “number”
i there a is
is strong of in
this make there and
things i way it
Bigram topic model (prior 2)
party god “number” the
arab believe the to
power about tower a
as atheism clock and
arabs gods a of
political before power i
are see motherboard is
rolling atheist mhz “number”
london most socket it
security shafts plastic that
984
Example from Wallach, 2006
Topic 1 Topic 2 Topic 3 Topic 4
Key outputs• Topics Distributions over words; we hope these are
somehow thematically coherent
• Document-topics Probabilistic assignments of topics to documents
https://en.wikipedia.org/wiki/Enron_scandal
Example: Enron emails
https://www.cs.cmu.edu/~enron/
Example: Enron emails1.3. Foundations 5
Table 1.1: Five topics from a twenty-five topic model fit on Enron e-mails. Exampletopics concern financial transactions, natural gas, the California utilities, federalregulation, and planning meetings. We provide the five most probable words fromeach topic (each topic is a distribution over all words).
Topic Terms3 trading financial trade product price6 gas capacity deal pipeline contract9 state california davis power utilities14 ferc issue order party case22 group meeting team process plan
and stock prices (Figure 1.1).The first half of a topic model connects topics to a jumbled “bag
of words”. When we say that a topic is about X, we are manuallyassigning a post hoc label (more on this in Chapter 3.1). It remains theresponsibility of the human consumer of topic models to go further andmake sense of these piles of straw (we discuss labeling the topics morein Chapter 3).
Making sense of one of these word piles by itself can be di�cult.The second half of a topic model links topics to individual documents.For example, the document in Figure 1.1 is about a California utility’sreaction to the short-term electricity market and exemplifies Topic 9from Figure ??. Considering examples of documents that are stronglyconnected to a topic, along with the words associated with the topic,can give us a more complete representation of the topic. If we get asense that Topic 9 is of interest, we can explore deeper to find otherdocuments.
1.3 Foundations
You might notice that we are using the general term “topic model”.There are many mathematical formulations of topic models and manyalgorithms that learn the parameters of those models from data. Al-
Example from Boyd-Graber, Hu and Mimno, 2017https://en.wikipedia.org/wiki/Enron_scandal
Document-topic probabilities6 The What and Wherefore of Topic Models
Yesterday, SDG&E filed a motion for adoption of an electricprocurement cost recovery mechanism and for an order short-ening time for parties to file comments on the mechanism. Theattached email from SDG&E contains the motion, an executivesummary, and a detailed summary of their proposals and rec-ommendations governing procurement of the net short energyrequirements for SDG&E’s customers. The utility requests a15-day comment period, which means comments would have tobe filed by September 10 (September 8 is a Saturday). Replycomments would be filed 10 days later.
Topic Probability9 0.4211 0.058 0.05
Figure 1.1: Example document from the Enron corpus and its association to topics.Although it does not contain the word “California”, it discusses a single Californiautility’s dissatisfaction with how much it is paying for electricity.
M × VM × K K × V ≈×
Topic Assignment
Topics
Dataset
Figure 1.2: A matrix formulation of finding K topics for a dataset with M documentsand V unique words. While this view of topic modeling includes approaches such aslatent semantic analysis (lsa, where the approximation is based on svd), we focuson probabilistic techniques in the rest of this survey.
Example from Boyd-Graber, Hu and Mimno, 2017
Topics as Matrix Factorization
• One can view topics as a kind of matrix factorization
6 The What and Wherefore of Topic Models
Yesterday, SDG&E filed a motion for adoption of an electricprocurement cost recovery mechanism and for an order short-ening time for parties to file comments on the mechanism. Theattached email from SDG&E contains the motion, an executivesummary, and a detailed summary of their proposals and rec-ommendations governing procurement of the net short energyrequirements for SDG&E’s customers. The utility requests a15-day comment period, which means comments would have tobe filed by September 10 (September 8 is a Saturday). Replycomments would be filed 10 days later.
Topic Probability9 0.4211 0.058 0.05
Figure 1.1: Example document from the Enron corpus and its association to topics.Although it does not contain the word “California”, it discusses a single Californiautility’s dissatisfaction with how much it is paying for electricity.
M × VM × K K × V ≈×
Topic Assignment
Topics
Dataset
Figure 1.2: A matrix formulation of finding K topics for a dataset with M documentsand V unique words. While this view of topic modeling includes approaches such aslatent semantic analysis (lsa, where the approximation is based on svd), we focuson probabilistic techniques in the rest of this survey.
Figure from Boyd-Graber, Hu and Mimno, 2017
• One can view topics as a kind of matrix factorization
6 The What and Wherefore of Topic Models
Yesterday, SDG&E filed a motion for adoption of an electricprocurement cost recovery mechanism and for an order short-ening time for parties to file comments on the mechanism. Theattached email from SDG&E contains the motion, an executivesummary, and a detailed summary of their proposals and rec-ommendations governing procurement of the net short energyrequirements for SDG&E’s customers. The utility requests a15-day comment period, which means comments would have tobe filed by September 10 (September 8 is a Saturday). Replycomments would be filed 10 days later.
Topic Probability9 0.4211 0.058 0.05
Figure 1.1: Example document from the Enron corpus and its association to topics.Although it does not contain the word “California”, it discusses a single Californiautility’s dissatisfaction with how much it is paying for electricity.
M × VM × K K × V ≈×
Topic Assignment
Topics
Dataset
Figure 1.2: A matrix formulation of finding K topics for a dataset with M documentsand V unique words. While this view of topic modeling includes approaches such aslatent semantic analysis (lsa, where the approximation is based on svd), we focuson probabilistic techniques in the rest of this survey.
• We will try and take a more probabilistic view, but useful to keep this in mind
Figure from Boyd-Graber, Hu and Mimno, 2017
Topics as Matrix Factorization
Probabilistic Word Mixtures
Topics:Words:
Idea: Model text as a mixture over words (ignore order)Generative model for LDA
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics DocumentsTopic proportions and
assignments
• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics
Latent Dirichlet allocation (LDA)
Simple intuition: Documents exhibit multiple topics.
Topic Modeling
Idea: Model corpus of documents with shared topics
Generative model for LDA
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics DocumentsTopic proportions and
assignments
• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics
Topics(shared)
Words in Document(mixture over topics)
Topic Proportions(document-specific)
Topic Modeling
• Each topic is a distribution over words • Each document is a mixture over topics • Each word is drawn from one topic distribution
Generative model for LDA
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics DocumentsTopic proportions and
assignments
• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics
Topics(shared)
Words in Document(mixture over topics)
Topic Proportions(document-specific)
Topic Modeling
• Each topic is a distribution over words • Each document is a mixture over topics • Each word is drawn from one topic distribution
Generative model for LDA
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics DocumentsTopic proportions and
assignments
• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics
Topics(shared)
Words in Document(mixture over topics)
Topic Proportions(document-specific)
Topic Modeling
• Each topic is a distribution over words • Each document is a mixture over topics • Each word is drawn from one topic distribution
Generative model for LDA
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics DocumentsTopic proportions and
assignments
• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics
Topics(shared)
Words in Document(mixture over topics)
Topic Proportions(document-specific)
xdn | zdn=k ⇠ Discrete(�k)<latexit sha1_base64="OX8/ryM68neOZUArDd/DTpCJlCE=">AAAGHHicfZTNbtQwEIDd0oWy/LVw5JKylyKtqqSt2l6QqkJFj6Xqn1SvVo4zu2tt4gTbabe1/Co8Ak/AkRvihpA4wLPg7EZiE0c4UjSa+cYz4xk7zGImle//Wli8t9S6/2D5YfvR4ydPn62sPj+XaS4onNE0TsVlSCTEjMOZYiqGy0wAScIYLsLx28J+cQ1CspSfqtsMegkZcjZglCir6q8cTfoaUxHpyHDj4YRF3l1Fs/YGr3ljD0uWeFjBROl3TFIBCsw6vtY4BEVMf/y6v9LxN/zp8lwhKIXOPpqt4/7q0hccpTRPgCsaEymvAj9TPU2EYjQG08a5hIzQMRnClRU5SUD29LRk41Wsp0FPD1KugNOKmyaJTIgaOcoCllUtHdnAIKphS2VPF7tEINmQV73CxLTbOIKBPf5pZjoK4xyMPnl/YLTf3dnqBpu7poYIiEoi2PO79qsDQwHAS2Rvuxvs7LlMlosshn+QX2BFNgI43NA0SQiPNL4Gaq7s+WDgMhdQFGKbluhOYIxx4Blqfab2Np43Toy2gzGfoO3/xNSx2zmsKNRCtw5017TXnYN9bMKwGhUz52avmmmSN7C5w+YuJBxI1DOExpiQSRan3KlnMEdP52TgBo3nmLLHxZaxvdMRcXbMRs14NmIOe1LrzIkpxmWeIGKYENtnnGYgiEpFcelumBrFLGFK6tJuXC/G/+9l7fVgh6Y6lMU/DPWhcUgaxtPBrJ6dO6H21apyRZUN2FBUsVnjGsCsBpYHXJD2vQvqr5srnG9uBFb+sN3ZPyhfvmX0Er1C6yhAu2gfHaFjdIYo+ox+oN/oT+tT62vrW+v7DF1cKH1eoMpq/fwLOsw/og==</latexit><latexit sha1_base64="OX8/ryM68neOZUArDd/DTpCJlCE=">AAAGHHicfZTNbtQwEIDd0oWy/LVw5JKylyKtqqSt2l6QqkJFj6Xqn1SvVo4zu2tt4gTbabe1/Co8Ak/AkRvihpA4wLPg7EZiE0c4UjSa+cYz4xk7zGImle//Wli8t9S6/2D5YfvR4ydPn62sPj+XaS4onNE0TsVlSCTEjMOZYiqGy0wAScIYLsLx28J+cQ1CspSfqtsMegkZcjZglCir6q8cTfoaUxHpyHDj4YRF3l1Fs/YGr3ljD0uWeFjBROl3TFIBCsw6vtY4BEVMf/y6v9LxN/zp8lwhKIXOPpqt4/7q0hccpTRPgCsaEymvAj9TPU2EYjQG08a5hIzQMRnClRU5SUD29LRk41Wsp0FPD1KugNOKmyaJTIgaOcoCllUtHdnAIKphS2VPF7tEINmQV73CxLTbOIKBPf5pZjoK4xyMPnl/YLTf3dnqBpu7poYIiEoi2PO79qsDQwHAS2Rvuxvs7LlMlosshn+QX2BFNgI43NA0SQiPNL4Gaq7s+WDgMhdQFGKbluhOYIxx4Blqfab2Np43Toy2gzGfoO3/xNSx2zmsKNRCtw5017TXnYN9bMKwGhUz52avmmmSN7C5w+YuJBxI1DOExpiQSRan3KlnMEdP52TgBo3nmLLHxZaxvdMRcXbMRs14NmIOe1LrzIkpxmWeIGKYENtnnGYgiEpFcelumBrFLGFK6tJuXC/G/+9l7fVgh6Y6lMU/DPWhcUgaxtPBrJ6dO6H21apyRZUN2FBUsVnjGsCsBpYHXJD2vQvqr5srnG9uBFb+sN3ZPyhfvmX0Er1C6yhAu2gfHaFjdIYo+ox+oN/oT+tT62vrW+v7DF1cKH1eoMpq/fwLOsw/og==</latexit><latexit sha1_base64="OX8/ryM68neOZUArDd/DTpCJlCE=">AAAGHHicfZTNbtQwEIDd0oWy/LVw5JKylyKtqqSt2l6QqkJFj6Xqn1SvVo4zu2tt4gTbabe1/Co8Ak/AkRvihpA4wLPg7EZiE0c4UjSa+cYz4xk7zGImle//Wli8t9S6/2D5YfvR4ydPn62sPj+XaS4onNE0TsVlSCTEjMOZYiqGy0wAScIYLsLx28J+cQ1CspSfqtsMegkZcjZglCir6q8cTfoaUxHpyHDj4YRF3l1Fs/YGr3ljD0uWeFjBROl3TFIBCsw6vtY4BEVMf/y6v9LxN/zp8lwhKIXOPpqt4/7q0hccpTRPgCsaEymvAj9TPU2EYjQG08a5hIzQMRnClRU5SUD29LRk41Wsp0FPD1KugNOKmyaJTIgaOcoCllUtHdnAIKphS2VPF7tEINmQV73CxLTbOIKBPf5pZjoK4xyMPnl/YLTf3dnqBpu7poYIiEoi2PO79qsDQwHAS2Rvuxvs7LlMlosshn+QX2BFNgI43NA0SQiPNL4Gaq7s+WDgMhdQFGKbluhOYIxx4Blqfab2Np43Toy2gzGfoO3/xNSx2zmsKNRCtw5017TXnYN9bMKwGhUz52avmmmSN7C5w+YuJBxI1DOExpiQSRan3KlnMEdP52TgBo3nmLLHxZaxvdMRcXbMRs14NmIOe1LrzIkpxmWeIGKYENtnnGYgiEpFcelumBrFLGFK6tJuXC/G/+9l7fVgh6Y6lMU/DPWhcUgaxtPBrJ6dO6H21apyRZUN2FBUsVnjGsCsBpYHXJD2vQvqr5srnG9uBFb+sN3ZPyhfvmX0Er1C6yhAu2gfHaFjdIYo+ox+oN/oT+tT62vrW+v7DF1cKH1eoMpq/fwLOsw/og==</latexit><latexit sha1_base64="ZZEqi/4sMBZVC8LRe6l9DxkYkYM=">AAAGHHicfZTNbtQwEIBd6EJZ/lo4ctmylyKtqqRUbS9IVaGix1L1T6pXK8eZ3bXWcYLttNtafhUegSfgyA1xQ0gc4FlwtpHYxBGOFI1mvvHMeMaOMs6UDoJfC3fuLrbu3V960H746PGTp8srz05VmksKJzTlqTyPiALOBJxopjmcZxJIEnE4iyZvC/vZJUjFUnGsrzPoJ2Qk2JBRop1qsHwwHRhMZWxiK2wHJyzu3FQ0q2/wamfSwYolHaxhqs07pqgEDXYNXxocgSZ2MHk1WO4G68FsdXwhLIUuKtfhYGXxC45TmicgNOVEqYswyHTfEKkZ5WDbOFeQETohI7hwoiAJqL6ZlWw7Fetx2DfDVGgQtOJmSKISoseesoBVVUvHLjDIathS2TfFLjEoNhJVryix7TaOYeiOf5aZiSOegzVH7/esCXpbr3vhxratIRLikgh3gp776sBIAogS2dnshVs7PpPlMuPwDwoKrMhGgoArmiYJEbHBl0DthTsfDELlEopCXNMS0w2ttR58izqfmb2N541Ta9xgzCfo+j+1dex6DisKddC1B9007XXjYR+bMKzHxcz52etmmuQNbO6xuQ9JD5L1DKExJmSK8VR49Qzn6NmcDP2gfI4pe1xsyd2djom3YzZuxrMx89ijWmeObDEu8wSRo4S4PuM0A0l0KotLd8X0mLOEaWVKu/W9mPi/l7PXg+3b6lAW/ygy+9YjacRng1k9O39C3atV5YoqG7CRrGK3jWsAsxpYHnBBuvcurL9uvnC6sR46+cNmd3evfPmW0Av0Eq2hEG2jXXSADtEJougz+oF+oz+tT62vrW+t77fonYXS5zmqrNbPv+juP2I=</latexit>
zdn ⇠ Discrete(✓d)<latexit sha1_base64="O1i9JMBjmHNm9M2eTNvBxn+6U6w=">AAAGB3icfZTNbtQwEIDd0oWy/HQLRy4r9lKkVZW0VdtjVajgWKr+SfVq5Tizu1YdJ9hO/yw/AI/AU3BDHJAQN3gE3gZnN4JNHOFI0cjzjWfGM54o40zpIPi9sHhvqXX/wfLD9qPHT56udFafnao0lxROaMpTeR4RBZwJONFMczjPJJAk4nAWXb4u9GdXIBVLxbG+zWCQkLFgI0aJdlvDzubd0GAqYxNbYbtYsaSLNdxo84YpKkGDXcN6Apr8xeyrYacXrAfT1fWFsBR6e2i2DoerS99wnNI8AaEpJ0pdhEGmB4ZIzSgH28a5gozQSzKGCycKkoAamGl2tlvRHocDM0qFBkErZoYkKiF64m0WsKru0olzDLLqttwcmOKUGBQbi6pVlNh2G8cwcjc9jczEEc/BmqO3+9YE/e3NfrixY2uIhLgkwt2g7746MJYAokR2t/rh9q7PZLnMOPyDggIropEg4JqmSUJEbPAVUHvh7geDULmEIhGDo8T0QmutB89QZzPVt/G88sYaV+75APGVubF17HYOKxJ10K0H3TWddedhH5qwWes1RK+baZI3sLnH5j4kPUjWI4RGn5ApxlPh5TOao6d9MvKd8jmmrHFxJHfPNybeidmkGc8mzGOPapU5skW7zBNEjhPi6ozTDCTRqSwe3TXTE84SppUp9da3YuL/Vk5fd3Zgq01Z/KPIHFiPpBGfNmb17vwOdbOoyhVZNmBjWcVmhWsAsxpYXnBBunkX1qebL5xurIdOfr/V29svJ98yeoFeojUUoh20h96hQ3SCKPqEvqOf6FfrY+tz60vr6wxdXChtnqPKav34A4dxOBg=</latexit><latexit sha1_base64="O1i9JMBjmHNm9M2eTNvBxn+6U6w=">AAAGB3icfZTNbtQwEIDd0oWy/HQLRy4r9lKkVZW0VdtjVajgWKr+SfVq5Tizu1YdJ9hO/yw/AI/AU3BDHJAQN3gE3gZnN4JNHOFI0cjzjWfGM54o40zpIPi9sHhvqXX/wfLD9qPHT56udFafnao0lxROaMpTeR4RBZwJONFMczjPJJAk4nAWXb4u9GdXIBVLxbG+zWCQkLFgI0aJdlvDzubd0GAqYxNbYbtYsaSLNdxo84YpKkGDXcN6Apr8xeyrYacXrAfT1fWFsBR6e2i2DoerS99wnNI8AaEpJ0pdhEGmB4ZIzSgH28a5gozQSzKGCycKkoAamGl2tlvRHocDM0qFBkErZoYkKiF64m0WsKru0olzDLLqttwcmOKUGBQbi6pVlNh2G8cwcjc9jczEEc/BmqO3+9YE/e3NfrixY2uIhLgkwt2g7746MJYAokR2t/rh9q7PZLnMOPyDggIropEg4JqmSUJEbPAVUHvh7geDULmEIhGDo8T0QmutB89QZzPVt/G88sYaV+75APGVubF17HYOKxJ10K0H3TWddedhH5qwWes1RK+baZI3sLnH5j4kPUjWI4RGn5ApxlPh5TOao6d9MvKd8jmmrHFxJHfPNybeidmkGc8mzGOPapU5skW7zBNEjhPi6ozTDCTRqSwe3TXTE84SppUp9da3YuL/Vk5fd3Zgq01Z/KPIHFiPpBGfNmb17vwOdbOoyhVZNmBjWcVmhWsAsxpYXnBBunkX1qebL5xurIdOfr/V29svJ98yeoFeojUUoh20h96hQ3SCKPqEvqOf6FfrY+tz60vr6wxdXChtnqPKav34A4dxOBg=</latexit><latexit sha1_base64="O1i9JMBjmHNm9M2eTNvBxn+6U6w=">AAAGB3icfZTNbtQwEIDd0oWy/HQLRy4r9lKkVZW0VdtjVajgWKr+SfVq5Tizu1YdJ9hO/yw/AI/AU3BDHJAQN3gE3gZnN4JNHOFI0cjzjWfGM54o40zpIPi9sHhvqXX/wfLD9qPHT56udFafnao0lxROaMpTeR4RBZwJONFMczjPJJAk4nAWXb4u9GdXIBVLxbG+zWCQkLFgI0aJdlvDzubd0GAqYxNbYbtYsaSLNdxo84YpKkGDXcN6Apr8xeyrYacXrAfT1fWFsBR6e2i2DoerS99wnNI8AaEpJ0pdhEGmB4ZIzSgH28a5gozQSzKGCycKkoAamGl2tlvRHocDM0qFBkErZoYkKiF64m0WsKru0olzDLLqttwcmOKUGBQbi6pVlNh2G8cwcjc9jczEEc/BmqO3+9YE/e3NfrixY2uIhLgkwt2g7746MJYAokR2t/rh9q7PZLnMOPyDggIropEg4JqmSUJEbPAVUHvh7geDULmEIhGDo8T0QmutB89QZzPVt/G88sYaV+75APGVubF17HYOKxJ10K0H3TWddedhH5qwWes1RK+baZI3sLnH5j4kPUjWI4RGn5ApxlPh5TOao6d9MvKd8jmmrHFxJHfPNybeidmkGc8mzGOPapU5skW7zBNEjhPi6ozTDCTRqSwe3TXTE84SppUp9da3YuL/Vk5fd3Zgq01Z/KPIHFiPpBGfNmb17vwOdbOoyhVZNmBjWcVmhWsAsxpYXnBBunkX1qebL5xurIdOfr/V29svJ98yeoFeojUUoh20h96hQ3SCKPqEvqOf6FfrY+tz60vr6wxdXChtnqPKav34A4dxOBg=</latexit><latexit sha1_base64="wK/6vFJ0W2rB0VpbrgTLA0uYUTE=">AAAGB3icfZTNbtQwEIDd0oWy/HQLRy4r9lKkVZW0VdtjVajgWKr+Sc1q5Tizu1YdJ9hO/yw/AI/AU3BDHJAQN3gE3gZ7N4JNHOFI0cjzjWfGM544Z1SqIPi9sHhvqXX/wfLD9qPHT56udFafncqsEAROSMYycR5jCYxyOFFUMTjPBeA0ZnAWX752+rMrEJJm/Fjd5jBI8ZjTESVY2a1hZ/NuqCMiEp0YbrqRpGk3UnCj9BsqiQAFZi1SE1D4L2ZeDTu9YD2Yrq4vhKXQQ+U6HK4ufYuSjBQpcEUYlvIiDHI10FgoShiYdlRIyDG5xGO4sCLHKciBnmZnuhXtcTjQo4wr4KRipnEqU6wm3qaDZXWXTKxjEFW35eZAu1MSkHTMq1ZxatrtKIGRvelpZDqJWQFGH73dNzrob2/2w40dU0MEJCUR7gZ9+9WBsQDgJbK71Q+3d30mL0TO4B8UOMxFI4DDNcnSFPNER1dAzIW9nwi4LAS4RHQUp7oXGmM8eIZam6m+Hc0rb4y25Z4PMLrSN6aO3c5hLlEL3XrQXdNZdx72oQmbtV5D9KqZxkUDW3hs4UPCg0Q9Qmj0CbmkLONePqM5etonI98pm2PKGrsjmX2+CfZOzCfNeD6hHntUq8yRce0yT2AxTrGtc5TlILDKhHt011RNGE2pkrrUG9+K8v9bWX3d2YGpNqX7x7E+MB5JYjZtzOrd+R1qZ1GVc1k2YGNRxWaFawDzGlhesCPtvAvr080XTjfWQyu/3+rt7ZeTbxm9QC/RGgrRDtpD79AhOkEEfULf0U/0q/Wx9bn1pfV1hi4ulDbPUWW1fvwBNaI32A==</latexit>
Generative model for LDA
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics DocumentsTopic proportions and
assignments
• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics
Topics(shared)
Words in Document(mixture over topics)
Topic Proportions(document-specific)
Each document hasDifferent topic proportions
Topic Modeling
LDA’s view of a document
Slide credit: William Cohen (CMU)
Example: Discovering scientific topics
Example InferenceExample inference
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computergenome evolutionary host models
dna species bacteria informationgenetic organisms diseases datagenes life resistance computers
sequence origin bacterial systemgene biology new network
molecular groups strains systemssequencing phylogenetic control model
map living infectious parallelinformation diversity malaria methods
genetics group parasite networksmapping new parasites softwareproject two united new
sequences common tuberculosis simulations
Example InferenceExample inference
1 8 16 26 36 46 56 66 76 86 96
Topics
Probability
0.0
0.1
0.2
0.3
0.4
Example InferenceExample inference (II)
problem model selection speciesproblems rate male forest
mathematical constant males ecologynumber distribution females fish
new time sex ecologicalmathematics number species conservationuniversity size female diversity
two values evolution populationfirst value populations natural
numbers average population ecosystemswork rates sexual populationstime data behavior endangered
mathematicians density evolutionary tropicalchaos measured genetic forests
chaotic models reproductive ecosystem
climate change, gene knockout techniques, and apoptosis (pro-grammed cell death), the subject of the 2002 Nobel Prize inPhysiology. The cold topics were not topics that lacked preva-lence in the corpus but those that showed a strong decrease inpopularity over time. The coldest topics were 37, 289, and 75,corresponding to sequencing and cloning, structural biology, andimmunology. All these topics were very popular in about 1991and fell in popularity over the period of analysis. The NobelPrizes again provide a good means of validating these trends,with prizes being awarded for work on sequencing in 1993 andimmunology in 1989.
Tagging Abstracts. Each sample produced by our algorithm con-sists of a set of assignments of words to topics. We can use theseassignments to identify the role that words play in documents. Inparticular, we can tag each word with the topic to which it wasassigned and use these assignments to highlight topics that areparticularly informative about the content of a document. Theabstract shown in Fig. 6 is tagged with topic labels as superscripts.Words without superscripts were not included in the vocabularysupplied to the model. All assignments come from the samesingle sample as used in our previous analyses, illustrating the
kind of words assigned to the evolution topic discussed above(topic 280).
This kind of tagging is mainly useful for illustrating the contentof individual topics and how individual words are assigned, andit was used for this purpose in ref. 1. It is also possible to use theresults of our algorithm to highlight conceptual content in otherways. For example, if we integrate across a set of samples, we cancompute a probability that a particular word is assigned to themost prevalent topic in a document. This probability provides agraded measure of the importance of a word that uses informa-tion from the full set of samples, rather than a discrete measurecomputed from a single sample. This form of highlighting is usedto set the contrast of the words shown in Fig. 6 and picks out thewords that determine the topical content of the document. Suchmethods might provide a means of increasing the efficiency ofsearching large document databases, in particular, because it canbe modified to indicate words belonging to the topics of interestto the searcher.
ConclusionWe have presented a statistical inference algorithm for LatentDirichlet Allocation (1), a generative model for documents in
Fig. 5. The plots show the dynamics of the three hottest and three coldest topics from 1991 to 2001, defined as those topics that showed the strongest positiveand negative linear trends. The 12 most probable words in those topics are shown below the plots.
Fig. 6. A PNAS abstract (18) tagged according to topic assignment. The superscripts indicate the topics to which individual words were assigned in a singlesample, whereas the contrast level reflects the probability of a word being assigned to the most prevalent topic in the abstract, computed across samples.
5234 ! www.pnas.org"cgi"doi"10.1073"pnas.0307752101 Griffiths and Steyvers
From Griffiths and Steyvers, PNAS 2004
From Naive Bayes to Topic Models (board)
Likelihoodlog (p(x d | � ,✓ d)) =
X
n
log (p(x dn | � ,✓ d))
=X
n
log
✓Y
v
p(x dn = v | � ,✓ d)I[x dn=v]◆
=X
n,v
I[x dn = v] log (p(x dn = v | � ,✓ d))
=X
n,v
I[x dn = v] log
ÇX
k
p(x dn = v, zdn=k | � ,✓ d)
å
=X
n,v
I[x dn = v] log
ÇX
k
p(zdn=k | ✓ d) p(x dn = v | zdn=k,�)
å
=X
n,v
I[x dn = v] log
ÇX
k
✓ d,k � k,v
å
= X log✓�<latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="ZGSfvHkKuXAmqhUvswfMoyPCFyw=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQg1DwB6QEBPa0F5Yx03CXeUkbmrVcTLHKRfL34wvsm8zO82giSN2lZZKlXX8O+d/jo9z4iUEp7zT+TYzO9d48PDRfLP1+MnTZ88XFl+cpHHGfHTsxyRmZx5MEcEUHXPMCTpLGIKRR9CpN3qn90/HiKU4pkf8KkG9CIYUD7APuTL1F+duAIlDGxA04Ct2sgL8kAkwFpdS9kUgbRDhwAY+C7QReIhDKR1l8Ehu4MPc0g9WbcBwOOSrrdc7dgukWdSn9n2h6Z/EBkCFr4sOEhYHfTGWNTo7419X+iIOzk3/nqzLQFBH6dXyNnB+UvzvJPWX4hN+VH80WjDJbde3bVneAcv26P/kV84lT+U2kxph596jrQlmj5xqTf8k/5GZnwgcVRQw9FStOnZFtNA5k/IufiViNVCr1V9od9Y6+WObC7dYtK3iOewvNm5AEPtZhCj3CUzTc7eT8J6AjGOfINkCWYoS6I9giM7VksIIpT2Rzxlpl3aP3J4YxJQj6pfcBIzSCPKhYdRwWrb6QyWMWFm2MPaEjhKgFIe07OVFqnQQoIGaeXlmIvBIhqTovt+TouNsvnHc9S1ZQRgKCsLd7jjqVwVChhAtkO0Nx93cNpkkYwlBd1BHYzobhii68OMoglS3CPnyXJ0PQDTNGNKFqKZFou1K1bYqPEGVT77fAtObl1IIUEpwchkr2NUUpgtV0JUBXdfFujawr3XYjztoZM/raZjVsJnBZibEDIhVM0S1mihJMYmpUc9gis7vycAUJVNM0WMdkqgPaQCNiMmwHk+G2GC7lc509XtbIiALI6j6DOIEMchjpl+6C8yHBEeYp6LYl6YXpvd7qf2q2L4sX0r973liXxqkGj75xSyfnXlD1Uwqc7rKGixkZWzSuBowqYDFAWtSzTu3Ot3Mxcn6mqvWnzbau3vF5Ju3XlqvrBXLtbasXeuDdWgdW36j3ThodBufm0vNt83dZsHOzhQ+S1bpaX78Dj1igPA=</latexit>
log (p(x d | � ,✓ d)) =X
n
log (p(x dn | � ,✓ d))
=X
n
log
✓Y
v
p(x dn = v | � ,✓ d)I[x dn=v]◆
=X
n,v
I[x dn = v] log (p(x dn = v | � ,✓ d))
=X
n,v
I[x dn = v] log
ÇX
k
p(x dn = v, zdn=k | � ,✓ d)
å
=X
n,v
I[x dn = v] log
ÇX
k
p(zdn=k | ✓ d) p(x dn = v | zdn=k,�)
å
=X
n,v
I[x dn = v] log
ÇX
k
✓ d,k � k,v
å
= X log✓�<latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="ZGSfvHkKuXAmqhUvswfMoyPCFyw=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQg1DwB6QEBPa0F5Yx03CXeUkbmrVcTLHKRfL34wvsm8zO82giSN2lZZKlXX8O+d/jo9z4iUEp7zT+TYzO9d48PDRfLP1+MnTZ88XFl+cpHHGfHTsxyRmZx5MEcEUHXPMCTpLGIKRR9CpN3qn90/HiKU4pkf8KkG9CIYUD7APuTL1F+duAIlDGxA04Ct2sgL8kAkwFpdS9kUgbRDhwAY+C7QReIhDKR1l8Ehu4MPc0g9WbcBwOOSrrdc7dgukWdSn9n2h6Z/EBkCFr4sOEhYHfTGWNTo7419X+iIOzk3/nqzLQFBH6dXyNnB+UvzvJPWX4hN+VH80WjDJbde3bVneAcv26P/kV84lT+U2kxph596jrQlmj5xqTf8k/5GZnwgcVRQw9FStOnZFtNA5k/IufiViNVCr1V9od9Y6+WObC7dYtK3iOewvNm5AEPtZhCj3CUzTc7eT8J6AjGOfINkCWYoS6I9giM7VksIIpT2Rzxlpl3aP3J4YxJQj6pfcBIzSCPKhYdRwWrb6QyWMWFm2MPaEjhKgFIe07OVFqnQQoIGaeXlmIvBIhqTovt+TouNsvnHc9S1ZQRgKCsLd7jjqVwVChhAtkO0Nx93cNpkkYwlBd1BHYzobhii68OMoglS3CPnyXJ0PQDTNGNKFqKZFou1K1bYqPEGVT77fAtObl1IIUEpwchkr2NUUpgtV0JUBXdfFujawr3XYjztoZM/raZjVsJnBZibEDIhVM0S1mihJMYmpUc9gis7vycAUJVNM0WMdkqgPaQCNiMmwHk+G2GC7lc509XtbIiALI6j6DOIEMchjpl+6C8yHBEeYp6LYl6YXpvd7qf2q2L4sX0r973liXxqkGj75xSyfnXlD1Uwqc7rKGixkZWzSuBowqYDFAWtSzTu3Ot3Mxcn6mqvWnzbau3vF5Ju3XlqvrBXLtbasXeuDdWgdW36j3ThodBufm0vNt83dZsHOzhQ+S1bpaX78Dj1igPA=</latexit>
log (p(x d | � ,✓ d)) =X
n
log (p(x dn | � ,✓ d))
=X
n
log
✓Y
v
p(x dn = v | � ,✓ d)I[x dn=v]◆
=X
n,v
I[x dn = v] log (p(x dn = v | � ,✓ d))
=X
n,v
I[x dn = v] log
ÇX
k
p(x dn = v, zdn=k | � ,✓ d)
å
=X
n,v
I[x dn = v] log
ÇX
k
p(zdn=k | ✓ d) p(x dn = v | zdn=k,�)
å
=X
n,v
I[x dn = v] log
ÇX
k
✓ d,k � k,v
å
= X log✓�<latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="ZGSfvHkKuXAmqhUvswfMoyPCFyw=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQg1DwB6QEBPa0F5Yx03CXeUkbmrVcTLHKRfL34wvsm8zO82giSN2lZZKlXX8O+d/jo9z4iUEp7zT+TYzO9d48PDRfLP1+MnTZ88XFl+cpHHGfHTsxyRmZx5MEcEUHXPMCTpLGIKRR9CpN3qn90/HiKU4pkf8KkG9CIYUD7APuTL1F+duAIlDGxA04Ct2sgL8kAkwFpdS9kUgbRDhwAY+C7QReIhDKR1l8Ehu4MPc0g9WbcBwOOSrrdc7dgukWdSn9n2h6Z/EBkCFr4sOEhYHfTGWNTo7419X+iIOzk3/nqzLQFBH6dXyNnB+UvzvJPWX4hN+VH80WjDJbde3bVneAcv26P/kV84lT+U2kxph596jrQlmj5xqTf8k/5GZnwgcVRQw9FStOnZFtNA5k/IufiViNVCr1V9od9Y6+WObC7dYtK3iOewvNm5AEPtZhCj3CUzTc7eT8J6AjGOfINkCWYoS6I9giM7VksIIpT2Rzxlpl3aP3J4YxJQj6pfcBIzSCPKhYdRwWrb6QyWMWFm2MPaEjhKgFIe07OVFqnQQoIGaeXlmIvBIhqTovt+TouNsvnHc9S1ZQRgKCsLd7jjqVwVChhAtkO0Nx93cNpkkYwlBd1BHYzobhii68OMoglS3CPnyXJ0PQDTNGNKFqKZFou1K1bYqPEGVT77fAtObl1IIUEpwchkr2NUUpgtV0JUBXdfFujawr3XYjztoZM/raZjVsJnBZibEDIhVM0S1mihJMYmpUc9gis7vycAUJVNM0WMdkqgPaQCNiMmwHk+G2GC7lc509XtbIiALI6j6DOIEMchjpl+6C8yHBEeYp6LYl6YXpvd7qf2q2L4sX0r973liXxqkGj75xSyfnXlD1Uwqc7rKGixkZWzSuBowqYDFAWtSzTu3Ot3Mxcn6mqvWnzbau3vF5Ju3XlqvrBXLtbasXeuDdWgdW36j3ThodBufm0vNt83dZsHOzhQ+S1bpaX78Dj1igPA=</latexit>
log (p(x d | � ,✓ d)) =X
n
log (p(x dn | � ,✓ d))
=X
n
log
✓Y
v
p(x dn = v | � ,✓ d)I[x dn=v]◆
=X
n,v
I[x dn = v] log (p(x dn = v | � ,✓ d))
=X
n,v
I[x dn = v] log
ÇX
k
p(x dn = v, zdn=k | � ,✓ d)
å
=X
n,v
I[x dn = v] log
ÇX
k
p(zdn=k | ✓ d) p(x dn = v | zdn=k,�)
å
=X
n,v
I[x dn = v] log
ÇX
k
✓ d,k � k,v
å
= X log✓�<latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="ZGSfvHkKuXAmqhUvswfMoyPCFyw=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQg1DwB6QEBPa0F5Yx03CXeUkbmrVcTLHKRfL34wvsm8zO82giSN2lZZKlXX8O+d/jo9z4iUEp7zT+TYzO9d48PDRfLP1+MnTZ88XFl+cpHHGfHTsxyRmZx5MEcEUHXPMCTpLGIKRR9CpN3qn90/HiKU4pkf8KkG9CIYUD7APuTL1F+duAIlDGxA04Ct2sgL8kAkwFpdS9kUgbRDhwAY+C7QReIhDKR1l8Ehu4MPc0g9WbcBwOOSrrdc7dgukWdSn9n2h6Z/EBkCFr4sOEhYHfTGWNTo7419X+iIOzk3/nqzLQFBH6dXyNnB+UvzvJPWX4hN+VH80WjDJbde3bVneAcv26P/kV84lT+U2kxph596jrQlmj5xqTf8k/5GZnwgcVRQw9FStOnZFtNA5k/IufiViNVCr1V9od9Y6+WObC7dYtK3iOewvNm5AEPtZhCj3CUzTc7eT8J6AjGOfINkCWYoS6I9giM7VksIIpT2Rzxlpl3aP3J4YxJQj6pfcBIzSCPKhYdRwWrb6QyWMWFm2MPaEjhKgFIe07OVFqnQQoIGaeXlmIvBIhqTovt+TouNsvnHc9S1ZQRgKCsLd7jjqVwVChhAtkO0Nx93cNpkkYwlBd1BHYzobhii68OMoglS3CPnyXJ0PQDTNGNKFqKZFou1K1bYqPEGVT77fAtObl1IIUEpwchkr2NUUpgtV0JUBXdfFujawr3XYjztoZM/raZjVsJnBZibEDIhVM0S1mihJMYmpUc9gis7vycAUJVNM0WMdkqgPaQCNiMmwHk+G2GC7lc509XtbIiALI6j6DOIEMchjpl+6C8yHBEeYp6LYl6YXpvd7qf2q2L4sX0r973liXxqkGj75xSyfnXlD1Uwqc7rKGixkZWzSuBowqYDFAWtSzTu3Ot3Mxcn6mqvWnzbau3vF5Ju3XlqvrBXLtbasXeuDdWgdW36j3ThodBufm0vNt83dZsHOzhQ+S1bpaX78Dj1igPA=</latexit>
log (p(x d | � ,✓ d)) =X
n
log (p(x dn | � ,✓ d))
=X
n
log
✓Y
v
p(x dn = v | � ,✓ d)I[x dn=v]◆
=X
n,v
I[x dn = v] log (p(x dn = v | � ,✓ d))
=X
n,v
I[x dn = v] log
ÇX
k
p(x dn = v, zdn=k | � ,✓ d)
å
=X
n,v
I[x dn = v] log
ÇX
k
p(zdn=k | ✓ d) p(x dn = v | zdn=k,�)
å
=X
n,v
I[x dn = v] log
ÇX
k
✓ d,k � k,v
å
= X log✓�<latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="ZGSfvHkKuXAmqhUvswfMoyPCFyw=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQg1DwB6QEBPa0F5Yx03CXeUkbmrVcTLHKRfL34wvsm8zO82giSN2lZZKlXX8O+d/jo9z4iUEp7zT+TYzO9d48PDRfLP1+MnTZ88XFl+cpHHGfHTsxyRmZx5MEcEUHXPMCTpLGIKRR9CpN3qn90/HiKU4pkf8KkG9CIYUD7APuTL1F+duAIlDGxA04Ct2sgL8kAkwFpdS9kUgbRDhwAY+C7QReIhDKR1l8Ehu4MPc0g9WbcBwOOSrrdc7dgukWdSn9n2h6Z/EBkCFr4sOEhYHfTGWNTo7419X+iIOzk3/nqzLQFBH6dXyNnB+UvzvJPWX4hN+VH80WjDJbde3bVneAcv26P/kV84lT+U2kxph596jrQlmj5xqTf8k/5GZnwgcVRQw9FStOnZFtNA5k/IufiViNVCr1V9od9Y6+WObC7dYtK3iOewvNm5AEPtZhCj3CUzTc7eT8J6AjGOfINkCWYoS6I9giM7VksIIpT2Rzxlpl3aP3J4YxJQj6pfcBIzSCPKhYdRwWrb6QyWMWFm2MPaEjhKgFIe07OVFqnQQoIGaeXlmIvBIhqTovt+TouNsvnHc9S1ZQRgKCsLd7jjqVwVChhAtkO0Nx93cNpkkYwlBd1BHYzobhii68OMoglS3CPnyXJ0PQDTNGNKFqKZFou1K1bYqPEGVT77fAtObl1IIUEpwchkr2NUUpgtV0JUBXdfFujawr3XYjztoZM/raZjVsJnBZibEDIhVM0S1mihJMYmpUc9gis7vycAUJVNM0WMdkqgPaQCNiMmwHk+G2GC7lc509XtbIiALI6j6DOIEMchjpl+6C8yHBEeYp6LYl6YXpvd7qf2q2L4sX0r973liXxqkGj75xSyfnXlD1Uwqc7rKGixkZWzSuBowqYDFAWtSzTu3Ot3Mxcn6mqvWnzbau3vF5Ju3XlqvrBXLtbasXeuDdWgdW36j3ThodBufm0vNt83dZsHOzhQ+S1bpaX78Dj1igPA=</latexit>
log (p(x d | � ,✓ d)) =X
n
log (p(x dn | � ,✓ d))
=X
n
log
✓Y
v
p(x dn = v | � ,✓ d)I[x dn=v]◆
=X
n,v
I[x dn = v] log (p(x dn = v | � ,✓ d))
=X
n,v
I[x dn = v] log
ÇX
k
p(x dn = v, zdn=k | � ,✓ d)
å
=X
n,v
I[x dn = v] log
ÇX
k
p(zdn=k | ✓ d) p(x dn = v | zdn=k,�)
å
=X
n,v
I[x dn = v] log
ÇX
k
✓ d,k � k,v
å
= X log✓�<latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="ZGSfvHkKuXAmqhUvswfMoyPCFyw=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQg1DwB6QEBPa0F5Yx03CXeUkbmrVcTLHKRfL34wvsm8zO82giSN2lZZKlXX8O+d/jo9z4iUEp7zT+TYzO9d48PDRfLP1+MnTZ88XFl+cpHHGfHTsxyRmZx5MEcEUHXPMCTpLGIKRR9CpN3qn90/HiKU4pkf8KkG9CIYUD7APuTL1F+duAIlDGxA04Ct2sgL8kAkwFpdS9kUgbRDhwAY+C7QReIhDKR1l8Ehu4MPc0g9WbcBwOOSrrdc7dgukWdSn9n2h6Z/EBkCFr4sOEhYHfTGWNTo7419X+iIOzk3/nqzLQFBH6dXyNnB+UvzvJPWX4hN+VH80WjDJbde3bVneAcv26P/kV84lT+U2kxph596jrQlmj5xqTf8k/5GZnwgcVRQw9FStOnZFtNA5k/IufiViNVCr1V9od9Y6+WObC7dYtK3iOewvNm5AEPtZhCj3CUzTc7eT8J6AjGOfINkCWYoS6I9giM7VksIIpT2Rzxlpl3aP3J4YxJQj6pfcBIzSCPKhYdRwWrb6QyWMWFm2MPaEjhKgFIe07OVFqnQQoIGaeXlmIvBIhqTovt+TouNsvnHc9S1ZQRgKCsLd7jjqVwVChhAtkO0Nx93cNpkkYwlBd1BHYzobhii68OMoglS3CPnyXJ0PQDTNGNKFqKZFou1K1bYqPEGVT77fAtObl1IIUEpwchkr2NUUpgtV0JUBXdfFujawr3XYjztoZM/raZjVsJnBZibEDIhVM0S1mihJMYmpUc9gis7vycAUJVNM0WMdkqgPaQCNiMmwHk+G2GC7lc509XtbIiALI6j6DOIEMchjpl+6C8yHBEeYp6LYl6YXpvd7qf2q2L4sX0r973liXxqkGj75xSyfnXlD1Uwqc7rKGixkZWzSuBowqYDFAWtSzTu3Ot3Mxcn6mqvWnzbau3vF5Ju3XlqvrBXLtbasXeuDdWgdW36j3ThodBufm0vNt83dZsHOzhQ+S1bpaX78Dj1igPA=</latexit>
log (p(x d | � ,✓ d)) =X
n
log (p(x dn | � ,✓ d))
=X
n
log
✓Y
v
p(x dn = v | � ,✓ d)I[x dn=v]◆
=X
n,v
I[x dn = v] log (p(x dn = v | � ,✓ d))
=X
n,v
I[x dn = v] log
ÇX
k
p(x dn = v, zdn=k | � ,✓ d)
å
=X
n,v
I[x dn = v] log
ÇX
k
p(zdn=k | ✓ d) p(x dn = v | zdn=k,�)
å
=X
n,v
I[x dn = v] log
ÇX
k
✓ d,k � k,v
å
= X log✓�<latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="LPiWkWvUWNYI5mpxmr9IeDTaWv4=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQglDwB6QKia0ob2wjptUd5WTuKnVxMkcp1wsfzO+yL7N7DSDJo7YVVoqVdbx75z/OT7OiZuEOGW2/W1ufqHx4OGjxWbr8ZOnz54vLb84TeOMeujEi8OYnrswRSEm6IRhFqLzhCIYuSE6c8fv1P7ZBNEUx+SYXSWoH8GA4CH2IJOmwfLCDQjjwAQhGrI1M1kDXkA5mPBLIQbcFyaIsG8Cj/rKCFzEoBCWNLhhbmCj3DLw101AcTBi663Xe2YLpFk0IOZ9ocmfxAZAhq+LDhIa+wM+ETU6e5NfV/rCD3u6f1/UZcCJJfVqeRNYPyn+d5L6S/EpP64/GiWY5Lbr27as7oFVc/x/8ivnkqdym0mNsHXv0dYEM8dWtaZ/kv9Yz4/7liwKaHqyVhW7IlronAtxF78SsRqo1Roste0NO39MfeEUi3bHmD5Hg+XGDfBjL4sQYV4I07Tn2Anrc0gZ9kIkWiBLUQK9MQxQTy4JjFDa5/mcEWZp99jp82FMGCJeyY3DKI0gG2lGBadlqzeSwoiWZQtjn6soPkpxQMpebiRLBz4aypmXZ8Z9N8yQ4N33+4Lb1vYby9ncERWEIr8gnF3bkr8qEFCESIHsblnO9q7OJBlNQnQH2QpT2VBE0IUXRxEkqkXIEz15PgCRNKNIFSKbFvG2I2TbqvAUlT75fgvMbl4KzkEpwellrGBXM5gqVEJXGnRdF+taw77WYT/uoJY9q6dhVsNmGpvpENUgWs0Q1WqiJMVhTLR6hjN0fk+Gumg4wxQ9ViFD+SH1oRYxGdXjyQhrbLfSma56b0sEpEEEZZ9BnCAKWUzVS3eB2SjEEWYpL/aF7oXJ/V5yvyp2IMqXUv27Lj8QGimHT34xy2en31A5k8qcqrIGC2gZmzauBkwqYHHAipTzzqlON31xurnhyPWnrXZnv5h8i8ZL45WxZjjGjtExPhhHxonhNdqNw0a38bm50nzb7DQLdn6u8FkxSk/z43ePMYEw</latexit><latexit sha1_base64="ZGSfvHkKuXAmqhUvswfMoyPCFyw=">AAAJkXicxZVbT9swFMcDdBvtbjAe9xJWaQIpQg1DwB6QEBPa0F5Yx03CXeUkbmrVcTLHKRfL34wvsm8zO82giSN2lZZKlXX8O+d/jo9z4iUEp7zT+TYzO9d48PDRfLP1+MnTZ88XFl+cpHHGfHTsxyRmZx5MEcEUHXPMCTpLGIKRR9CpN3qn90/HiKU4pkf8KkG9CIYUD7APuTL1F+duAIlDGxA04Ct2sgL8kAkwFpdS9kUgbRDhwAY+C7QReIhDKR1l8Ehu4MPc0g9WbcBwOOSrrdc7dgukWdSn9n2h6Z/EBkCFr4sOEhYHfTGWNTo7419X+iIOzk3/nqzLQFBH6dXyNnB+UvzvJPWX4hN+VH80WjDJbde3bVneAcv26P/kV84lT+U2kxph596jrQlmj5xqTf8k/5GZnwgcVRQw9FStOnZFtNA5k/IufiViNVCr1V9od9Y6+WObC7dYtK3iOewvNm5AEPtZhCj3CUzTc7eT8J6AjGOfINkCWYoS6I9giM7VksIIpT2Rzxlpl3aP3J4YxJQj6pfcBIzSCPKhYdRwWrb6QyWMWFm2MPaEjhKgFIe07OVFqnQQoIGaeXlmIvBIhqTovt+TouNsvnHc9S1ZQRgKCsLd7jjqVwVChhAtkO0Nx93cNpkkYwlBd1BHYzobhii68OMoglS3CPnyXJ0PQDTNGNKFqKZFou1K1bYqPEGVT77fAtObl1IIUEpwchkr2NUUpgtV0JUBXdfFujawr3XYjztoZM/raZjVsJnBZibEDIhVM0S1mihJMYmpUc9gis7vycAUJVNM0WMdkqgPaQCNiMmwHk+G2GC7lc509XtbIiALI6j6DOIEMchjpl+6C8yHBEeYp6LYl6YXpvd7qf2q2L4sX0r973liXxqkGj75xSyfnXlD1Uwqc7rKGixkZWzSuBowqYDFAWtSzTu3Ot3Mxcn6mqvWnzbau3vF5Ju3XlqvrBXLtbasXeuDdWgdW36j3ThodBufm0vNt83dZsHOzhQ+S1bpaX78Dj1igPA=</latexit>
How to estimate parameters in PLSA?
Let’s implement… (in class exercise)
Evaluation: Are these topics any good?
• As for clustering: a bit tricky. Thoughts on how we might evaluate topics?
Likelihood of held-out data
<latexit sha1_base64="Ot+R1CnHY23oNHKIvRpVfR1ZFt4=">AAAF03icfZRPT9swFMANoxvr/gDbcRe0XnaoqoQh6BExoe00QUUBiVTIcV4TC8fJbKdQLF+m3abdtm+zL7Jvs7iNtCaO5kjR03u/989+dpgzKpXn/Vlbf7TRefxk82n32fMXL7e2d15dyKwQBMYkY5m4CrEERjmMFVUMrnIBOA0ZXIa3H6z9cgZC0oyfq3kOkxTHnE4pwapUnX2+2e55A2+xdl3Br4Qeqtbpzc7G7yDKSJECV4RhKa99L1cTjYWihIHpBoWEHJNbHMN1KXKcgpzoRaVmt2Y99yd6mnEFnNTcNE5lilXiKC0s61qSlIlB1NNWyom2USKQNOZ1rzA13W4QwbTctUVlOgpZAUaPPh4b7fUP3vf9vUPTQAREFeEPvX75NYFYAPAKGe73/YOhy+SFyBn8gzyL2WoEcLgjWZpiHulgBsRcl/sTAJeFANuIDsJU93xjjAMv0dJnYe8Gq8Z7o3VQKzCY6XvTxOYrmG20hOYO9NAW68HBvrRhgUpA4ZbqVTuNixa2cNjChYQDiWaF0JoTcklZxp1+piv0Yk6mblK2wlRnbEOy8ipG2ImYJ+14nlCHHTVOZmTsuKwSWMQpLs85yHIQWGXCXro7qhJGU6qkruzG9aL8/16lvZnsxNSH0v7DUJ8YhyQhWwxmfe/cCSUiqnO2yxYsFnVseXAtYN4Aqw22ZPne+c3XzRUu9ga+N/DP9ntHx9XLt4neoLfoHfLRITpCn9ApGiOCAH1HP9GvzrijO18735bo+lrl8xrVVufHX+c0IZU=</latexit>
zn<latexit sha1_base64="CL/33CPWRC9wnFZu4UuXjwKQH1Y=">AAAF1XicfZTNb9MwFMC9scIoXxscuUz0wqGakjFtPVZDExxH2Ze0VJPjvKbWHCfYTrfO8g1xQ9zgn+Ef4b/BbiPRxBGOFD2993tf9rPjglGpguDP2vqDjc7DR5uPu0+ePnv+Ymv75bnMS0HgjOQsF5cxlsAohzNFFYPLQgDOYgYX8c17Z7+YgZA056dqXsA4wymnE0qwsqrP99f8eqsX7AaLteMLYSX0hmi5Tq63N35HSU7KDLgiDEt5FQaFGmssFCUMTDcqJRSY3OAUrqzIcQZyrBe1mp2a9TQc60nOFXBSc9M4kxlWU0/pYFnXkqlNDKKetlKOtYuSgKQpr3vFmel2owQmdt8WlekkZiUYPfpwZHTQP3jXD/cOTQMRkFREOAj69msCqQDgFTLY74cHA58pSlEw+AcFDnPVCOBwS/IswzzR0QyIubL7EwGXpQDXiI7iTPdCY4wHL1Hrs7B3o1XjndE6qhUYzfSdaWLzFcw1aqG5B923xbr3sC9tWKSmoHBL9aqdxmULW3ps6UPCg0SzQmjNCYWkLOdeP5MVejEnEz8pW2GqM3Yhmb2MCfYiFtN2vJhSjx01TmZk3LisElikGbbnHOUFCKxy4S7dLVVTRjOqpK7sxvei/P9e1t5MdmzqQ+n+cayPjUeSmC0Gs753/oQSkdQ512ULloo6tjy4FrBogNUGO9K+d2HzdfOF873d0Mqf9nvDo+rl20Sv0Rv0FoXoEA3RR3SCzhBBKfqOfqJfnYuO6XztfFui62uVzytUW50ffwHVSSLi</latexit><latexit sha1_base64="CL/33CPWRC9wnFZu4UuXjwKQH1Y=">AAAF1XicfZTNb9MwFMC9scIoXxscuUz0wqGakjFtPVZDExxH2Ze0VJPjvKbWHCfYTrfO8g1xQ9zgn+Ef4b/BbiPRxBGOFD2993tf9rPjglGpguDP2vqDjc7DR5uPu0+ePnv+Ymv75bnMS0HgjOQsF5cxlsAohzNFFYPLQgDOYgYX8c17Z7+YgZA056dqXsA4wymnE0qwsqrP99f8eqsX7AaLteMLYSX0hmi5Tq63N35HSU7KDLgiDEt5FQaFGmssFCUMTDcqJRSY3OAUrqzIcQZyrBe1mp2a9TQc60nOFXBSc9M4kxlWU0/pYFnXkqlNDKKetlKOtYuSgKQpr3vFmel2owQmdt8WlekkZiUYPfpwZHTQP3jXD/cOTQMRkFREOAj69msCqQDgFTLY74cHA58pSlEw+AcFDnPVCOBwS/IswzzR0QyIubL7EwGXpQDXiI7iTPdCY4wHL1Hrs7B3o1XjndE6qhUYzfSdaWLzFcw1aqG5B923xbr3sC9tWKSmoHBL9aqdxmULW3ps6UPCg0SzQmjNCYWkLOdeP5MVejEnEz8pW2GqM3Yhmb2MCfYiFtN2vJhSjx01TmZk3LisElikGbbnHOUFCKxy4S7dLVVTRjOqpK7sxvei/P9e1t5MdmzqQ+n+cayPjUeSmC0Gs753/oQSkdQ512ULloo6tjy4FrBogNUGO9K+d2HzdfOF873d0Mqf9nvDo+rl20Sv0Rv0FoXoEA3RR3SCzhBBKfqOfqJfnYuO6XztfFui62uVzytUW50ffwHVSSLi</latexit><latexit sha1_base64="CL/33CPWRC9wnFZu4UuXjwKQH1Y=">AAAF1XicfZTNb9MwFMC9scIoXxscuUz0wqGakjFtPVZDExxH2Ze0VJPjvKbWHCfYTrfO8g1xQ9zgn+Ef4b/BbiPRxBGOFD2993tf9rPjglGpguDP2vqDjc7DR5uPu0+ePnv+Ymv75bnMS0HgjOQsF5cxlsAohzNFFYPLQgDOYgYX8c17Z7+YgZA056dqXsA4wymnE0qwsqrP99f8eqsX7AaLteMLYSX0hmi5Tq63N35HSU7KDLgiDEt5FQaFGmssFCUMTDcqJRSY3OAUrqzIcQZyrBe1mp2a9TQc60nOFXBSc9M4kxlWU0/pYFnXkqlNDKKetlKOtYuSgKQpr3vFmel2owQmdt8WlekkZiUYPfpwZHTQP3jXD/cOTQMRkFREOAj69msCqQDgFTLY74cHA58pSlEw+AcFDnPVCOBwS/IswzzR0QyIubL7EwGXpQDXiI7iTPdCY4wHL1Hrs7B3o1XjndE6qhUYzfSdaWLzFcw1aqG5B923xbr3sC9tWKSmoHBL9aqdxmULW3ps6UPCg0SzQmjNCYWkLOdeP5MVejEnEz8pW2GqM3Yhmb2MCfYiFtN2vJhSjx01TmZk3LisElikGbbnHOUFCKxy4S7dLVVTRjOqpK7sxvei/P9e1t5MdmzqQ+n+cayPjUeSmC0Gs753/oQSkdQ512ULloo6tjy4FrBogNUGO9K+d2HzdfOF873d0Mqf9nvDo+rl20Sv0Rv0FoXoEA3RR3SCzhBBKfqOfqJfnYuO6XztfFui62uVzytUW50ffwHVSSLi</latexit><latexit sha1_base64="EVtucYrJet2AkCUkTOzXDAuSwZo=">AAAF1XicfZTNb9MwFMC9scIoXxscuUz0wqGakjFtPU5DExzH2JfUVJXjvKbWHCfYzrbW8g1xQ9zgn+Ef4b/BbiPRxBGOFD2993tf9rPjglGpguDP2vqDjc7DR5uPu0+ePnv+Ymv75aXMS0HgguQsF9cxlsAohwtFFYPrQgDOYgZX8c17Z7+6BSFpzs/VrIBRhlNOJ5RgZVWf52M+3uoFu8Fi7fhCWAk9VK3T8fbG7yjJSZkBV4RhKYdhUKiRxkJRwsB0o1JCgckNTmFoRY4zkCO9qNXs1Kzn4UhPcq6Ak5qbxpnMsJp6SgfLupZMbWIQ9bSVcqRdlAQkTXndK85MtxslMLH7tqhMJzErweizD8dGB/2Dd/1w79A0EAFJRYSDoG+/JpAKAF4hg/1+eDDwmaIUBYN/UOAwV40ADnckzzLMEx3dAjFDuz8RcFkKcI3oKM50LzTGePAStT4LezdaNd4braNagdGtvjdNbLaCuUYtNPOgeVusuYd9acMiNQWFW6pX7TQuW9jSY0sfEh4kmhVCa04oJGU59/qZrNCLOZn4SdkKU52xC8nsZUywF7GYtuPFlHrsWeNkzowbl1UCizTD9pyjvACBVS7cpbujaspoRpXUld34XpT/38vam8lOTH0o3T+O9YnxSBKzxWDW986fUCKSOue6bMFSUceWB9cCFg2w2mBH2vcubL5uvnC5txta+dN+7+i4evk20Wv0Br1FITpER+gjOkUXiKAUfUc/0a/OVcd0vna+LdH1tcrnFaqtzo+/g3oiog==</latexit>
xn<latexit sha1_base64="R5fqLlqQnakjPB/hUCUoBDPFKv0=">AAAF1XicfZRPb9MwFMC9scIo/zY4cpnohUM1JWPaeqyGJjiOsm6TlmpynNfUmuME29nWWb4hbogbfBm+CN8Gu41EE0c4UvT03u/9s58dF4xKFQR/1tYfbHQePtp83H3y9NnzF1vbL89kXgoCY5KzXFzEWAKjHMaKKgYXhQCcxQzO4+v3zn5+A0LSnJ+qeQGTDKecTinByqo+313xq61esBss1o4vhJXQG6LlOrna3vgdJTkpM+CKMCzlZRgUaqKxUJQwMN2olFBgco1TuLQixxnIiV7UanZq1tNwoqc5V8BJzU3jTGZYzTylg2VdS2Y2MYh62ko50S5KApKmvO4VZ6bbjRKY2n1bVKaTmJVg9OjDkdFB/+BdP9w7NA1EQFIR4SDo268JpAKAV8hgvx8eDHymKEXB4B8UOMxVI4DDLcmzDPNERzdAzKXdnwi4LAW4RnQUZ7oXGmM8eIlan4W9G60a74zWUa3A6EbfmSY2X8Fcoxaae9B9W6x7D/vShkVqBgq3VK/aaVy2sKXHlj4kPEg0K4TWnFBIynLu9TNdoRdzMvWTshWmOmMXktnLmGAvYjFrx4sZ9dhR42RGxo3LKoFFmmF7zlFegMAqF+7S3VI1YzSjSurKbnwvyv/vZe3NZMemPpTuH8f62HgkidliMOt7508oEUmdc122YKmoY8uDawGLBlhtsCPtexc2XzdfONvbDa38ab83PKpevk30Gr1Bb1GIDtEQfUQnaIwIStF39BP96px3TOdr59sSXV+rfF6h2ur8+AvKZyLg</latexit><latexit sha1_base64="R5fqLlqQnakjPB/hUCUoBDPFKv0=">AAAF1XicfZRPb9MwFMC9scIo/zY4cpnohUM1JWPaeqyGJjiOsm6TlmpynNfUmuME29nWWb4hbogbfBm+CN8Gu41EE0c4UvT03u/9s58dF4xKFQR/1tYfbHQePtp83H3y9NnzF1vbL89kXgoCY5KzXFzEWAKjHMaKKgYXhQCcxQzO4+v3zn5+A0LSnJ+qeQGTDKecTinByqo+313xq61esBss1o4vhJXQG6LlOrna3vgdJTkpM+CKMCzlZRgUaqKxUJQwMN2olFBgco1TuLQixxnIiV7UanZq1tNwoqc5V8BJzU3jTGZYzTylg2VdS2Y2MYh62ko50S5KApKmvO4VZ6bbjRKY2n1bVKaTmJVg9OjDkdFB/+BdP9w7NA1EQFIR4SDo268JpAKAV8hgvx8eDHymKEXB4B8UOMxVI4DDLcmzDPNERzdAzKXdnwi4LAW4RnQUZ7oXGmM8eIlan4W9G60a74zWUa3A6EbfmSY2X8Fcoxaae9B9W6x7D/vShkVqBgq3VK/aaVy2sKXHlj4kPEg0K4TWnFBIynLu9TNdoRdzMvWTshWmOmMXktnLmGAvYjFrx4sZ9dhR42RGxo3LKoFFmmF7zlFegMAqF+7S3VI1YzSjSurKbnwvyv/vZe3NZMemPpTuH8f62HgkidliMOt7508oEUmdc122YKmoY8uDawGLBlhtsCPtexc2XzdfONvbDa38ab83PKpevk30Gr1Bb1GIDtEQfUQnaIwIStF39BP96px3TOdr59sSXV+rfF6h2ur8+AvKZyLg</latexit><latexit sha1_base64="R5fqLlqQnakjPB/hUCUoBDPFKv0=">AAAF1XicfZRPb9MwFMC9scIo/zY4cpnohUM1JWPaeqyGJjiOsm6TlmpynNfUmuME29nWWb4hbogbfBm+CN8Gu41EE0c4UvT03u/9s58dF4xKFQR/1tYfbHQePtp83H3y9NnzF1vbL89kXgoCY5KzXFzEWAKjHMaKKgYXhQCcxQzO4+v3zn5+A0LSnJ+qeQGTDKecTinByqo+313xq61esBss1o4vhJXQG6LlOrna3vgdJTkpM+CKMCzlZRgUaqKxUJQwMN2olFBgco1TuLQixxnIiV7UanZq1tNwoqc5V8BJzU3jTGZYzTylg2VdS2Y2MYh62ko50S5KApKmvO4VZ6bbjRKY2n1bVKaTmJVg9OjDkdFB/+BdP9w7NA1EQFIR4SDo268JpAKAV8hgvx8eDHymKEXB4B8UOMxVI4DDLcmzDPNERzdAzKXdnwi4LAW4RnQUZ7oXGmM8eIlan4W9G60a74zWUa3A6EbfmSY2X8Fcoxaae9B9W6x7D/vShkVqBgq3VK/aaVy2sKXHlj4kPEg0K4TWnFBIynLu9TNdoRdzMvWTshWmOmMXktnLmGAvYjFrx4sZ9dhR42RGxo3LKoFFmmF7zlFegMAqF+7S3VI1YzSjSurKbnwvyv/vZe3NZMemPpTuH8f62HgkidliMOt7508oEUmdc122YKmoY8uDawGLBlhtsCPtexc2XzdfONvbDa38ab83PKpevk30Gr1Bb1GIDtEQfUQnaIwIStF39BP96px3TOdr59sSXV+rfF6h2ur8+AvKZyLg</latexit><latexit sha1_base64="TFLGGUTr6qFYReHLRiCE2yIPgUY=">AAAF1XicfZTNb9MwFMC9scIoXxscuUz0wqGakjFtPU5DExzH2JfUVJXjvKbWHCfYztbO8g1xQ9zgn+Ef4b/BbiPRxBGOFD2993tf9rPjglGpguDP2vqDjc7DR5uPu0+ePnv+Ymv75aXMS0HgguQsF9cxlsAohwtFFYPrQgDOYgZX8c17Z7+6BSFpzs/VvIBRhlNOJ5RgZVWfZ2M+3uoFu8Fi7fhCWAk9VK3T8fbG7yjJSZkBV4RhKYdhUKiRxkJRwsB0o1JCgckNTmFoRY4zkCO9qNXs1Kzn4UhPcq6Ak5qbxpnMsJp6SgfLupZMbWIQ9bSVcqRdlAQkTXndK85MtxslMLH7tqhMJzErweizD8dGB/2Dd/1w79A0EAFJRYSDoG+/JpAKAF4hg/1+eDDwmaIUBYN/UOAwV40ADnckzzLMEx3dAjFDuz8RcFkKcI3oKM50LzTGePAStT4LezdaNc6M1lGtwOhWz0wTm69grlELzT3ovi3WvYd9acMiNQWFW6pX7TQuW9jSY0sfEh4kmhVCa04oJGU59/qZrNCLOZn4SdkKU52xC8nsZUywF7GYtuPFlHrsWeNkzowbl1UCizTD9pyjvACBVS7cpbujaspoRpXUld34XpT/38vam8lOTH0o3T+O9YnxSBKzxWDW986fUCKSOue6bMFSUceWB9cCFg2w2mBH2vcubL5uvnC5txta+dN+7+i4evk20Wv0Br1FITpER+gjOkUXiKAUfUc/0a/OVcd0vna+LdH1tcrnFaqtzo+/eJgioA==</latexit>
�<latexit sha1_base64="aDS6LXjdUwMKJ2qhSEXt/yLFw8g=">AAAF13icfZRPb9MwFMC9scIof7bBkctELxyqKRnT1mM1NMFxVOs2tFST47y21hwn2E63zrK4IW6IG3wXvgjfBruNRBNHOFL09N7v/bOfHeeMShUEf9bWH2y0Hj7afNx+8vTZ863tnRfnMisEgSHJWCYuYyyBUQ5DRRWDy1wATmMGF/HNO2e/mIGQNONnap7DKMUTTseUYGVVwygGha+3O8FesFi7vhCWQqePluv0emfjd5RkpEiBK8KwlFdhkKuRxkJRwsC0o0JCjskNnsCVFTlOQY70olqzW7GehSM9zrgCTipuGqcyxWrqKR0sq1oytYlBVNOWypF2URKQdMKrXnFq2u0ogbHduUVlOolZAUYP3h8bHXQP33bD/SNTQwQkJRH2gq796sBEAPAS6R10w8Oez+SFyBn8gwKHuWoEcLglWZpinuhoBsRc2f2JgMtCgGtER3GqO6ExxoOXqPVZ2NvRqvHOaB1VCoxm+s7UsfkK5hq10NyD7pti3XvY5yYsUlM7bw3Vq2YaFw1s4bGFDwkPEvUKoTEn5JKyjHv9jFfoxZyM/aRshSnP2IVk9jom2IuYT5vxfEo9dlA7mYFx47JKYDFJsT3nKMtBYJUJd+luqZoymlIldWk3vhfl//ey9nqyE1MdSvePY31iPJLEbDGY1b3zJ5SIpMq5Lhuwiahiy4NrAPMaWG6wI+17F9ZfN184398LrfzxoNM/Ll++TfQKvUZvUIiOUB99QKdoiAii6Dv6iX61PrW+tL62vi3R9bXS5yWqrNaPv+ttI6c=</latexit><latexit sha1_base64="aDS6LXjdUwMKJ2qhSEXt/yLFw8g=">AAAF13icfZRPb9MwFMC9scIof7bBkctELxyqKRnT1mM1NMFxVOs2tFST47y21hwn2E63zrK4IW6IG3wXvgjfBruNRBNHOFL09N7v/bOfHeeMShUEf9bWH2y0Hj7afNx+8vTZ863tnRfnMisEgSHJWCYuYyyBUQ5DRRWDy1wATmMGF/HNO2e/mIGQNONnap7DKMUTTseUYGVVwygGha+3O8FesFi7vhCWQqePluv0emfjd5RkpEiBK8KwlFdhkKuRxkJRwsC0o0JCjskNnsCVFTlOQY70olqzW7GehSM9zrgCTipuGqcyxWrqKR0sq1oytYlBVNOWypF2URKQdMKrXnFq2u0ogbHduUVlOolZAUYP3h8bHXQP33bD/SNTQwQkJRH2gq796sBEAPAS6R10w8Oez+SFyBn8gwKHuWoEcLglWZpinuhoBsRc2f2JgMtCgGtER3GqO6ExxoOXqPVZ2NvRqvHOaB1VCoxm+s7UsfkK5hq10NyD7pti3XvY5yYsUlM7bw3Vq2YaFw1s4bGFDwkPEvUKoTEn5JKyjHv9jFfoxZyM/aRshSnP2IVk9jom2IuYT5vxfEo9dlA7mYFx47JKYDFJsT3nKMtBYJUJd+luqZoymlIldWk3vhfl//ey9nqyE1MdSvePY31iPJLEbDGY1b3zJ5SIpMq5Lhuwiahiy4NrAPMaWG6wI+17F9ZfN184398LrfzxoNM/Ll++TfQKvUZvUIiOUB99QKdoiAii6Dv6iX61PrW+tL62vi3R9bXS5yWqrNaPv+ttI6c=</latexit><latexit sha1_base64="aDS6LXjdUwMKJ2qhSEXt/yLFw8g=">AAAF13icfZRPb9MwFMC9scIof7bBkctELxyqKRnT1mM1NMFxVOs2tFST47y21hwn2E63zrK4IW6IG3wXvgjfBruNRBNHOFL09N7v/bOfHeeMShUEf9bWH2y0Hj7afNx+8vTZ863tnRfnMisEgSHJWCYuYyyBUQ5DRRWDy1wATmMGF/HNO2e/mIGQNONnap7DKMUTTseUYGVVwygGha+3O8FesFi7vhCWQqePluv0emfjd5RkpEiBK8KwlFdhkKuRxkJRwsC0o0JCjskNnsCVFTlOQY70olqzW7GehSM9zrgCTipuGqcyxWrqKR0sq1oytYlBVNOWypF2URKQdMKrXnFq2u0ogbHduUVlOolZAUYP3h8bHXQP33bD/SNTQwQkJRH2gq796sBEAPAS6R10w8Oez+SFyBn8gwKHuWoEcLglWZpinuhoBsRc2f2JgMtCgGtER3GqO6ExxoOXqPVZ2NvRqvHOaB1VCoxm+s7UsfkK5hq10NyD7pti3XvY5yYsUlM7bw3Vq2YaFw1s4bGFDwkPEvUKoTEn5JKyjHv9jFfoxZyM/aRshSnP2IVk9jom2IuYT5vxfEo9dlA7mYFx47JKYDFJsT3nKMtBYJUJd+luqZoymlIldWk3vhfl//ey9nqyE1MdSvePY31iPJLEbDGY1b3zJ5SIpMq5Lhuwiahiy4NrAPMaWG6wI+17F9ZfN184398LrfzxoNM/Ll++TfQKvUZvUIiOUB99QKdoiAii6Dv6iX61PrW+tL62vi3R9bXS5yWqrNaPv+ttI6c=</latexit><latexit sha1_base64="kSBDf/XaBhLxR2gd7E4qD97QhD4=">AAAF13icfZRPb9MwFMC9scIof7bBkctELxyqKRnT1uM0NMFxVOs21FST47y21hwn2E63zrK4IW6IG3wXvgjfBruNRBNHOFL09N7v/bOfHeeMShUEf9bWH2y0Hj7afNx+8vTZ863tnRcXMisEgQHJWCauYiyBUQ4DRRWDq1wATmMGl/HNO2e/nIGQNOPnap7DKMUTTseUYGVVgygGha+3O8FesFi7vhCWQgeV6+x6Z+N3lGSkSIErwrCUwzDI1UhjoShhYNpRISHH5AZPYGhFjlOQI72o1uxWrOfhSI8zroCTipvGqUyxmnpKB8uqlkxtYhDVtKVypF2UBCSd8KpXnJp2O0pgbHduUZlOYlaA0f33J0YH3cO33XD/yNQQAUlJhL2ga786MBEAvER6B93wsOczeSFyBv+gwGGuGgEcbkmWppgnOpoBMUO7PxFwWQhwjegoTnUnNMZ48BK1Pgt7O1o13hmto0qB0UzfmTo2X8Fcoxaae9B9U6x7D/vchEVqauetoXrVTOOigS08tvAh4UGiXiE05oRcUpZxr5/xCr2Yk7GflK0w5Rm7kMxexwR7EfNpM55Pqcf2ayfTN25cVgksJim25xxlOQisMuEu3S1VU0ZTqqQu7cb3ovz/XtZeT3ZqqkPp/nGsT41HkpgtBrO6d/6EEpFUOddlAzYRVWx5cA1gXgPLDXakfe/C+uvmCxf7e6GVPx50jk/Kl28TvUKv0RsUoiN0jD6gMzRABFH0Hf1Ev1qfWl9aX1vfluj6WunzElVW68dfmZ4jZw==</latexit>
✓<latexit sha1_base64="dtgpQ6KClwybVS33dqS0RC75e2U=">AAAF2HicfZRPb9MwFMC9scIo/zY4cpnohUM1JWPaepyGJjiOat0mlmpynNfUzHGC7WzrLEvcEDfEDT4LX4Rvg91GookjHCl6eu/3/tnPjgtGpQqCPyur99Y69x+sP+w+evzk6bONzeenMi8FgRHJWS7OYyyBUQ4jRRWD80IAzmIGZ/HVW2c/uwYhac5P1KyAcYZTTieUYGVVp5GagsKXG71gO5ivLV8IK6F3gBbr+HJz7XeU5KTMgCvCsJQXYVCoscZCUcLAdKNSQoHJFU7hwoocZyDHel6u2apZT8KxnuRcASc1N40zmWE19ZQOlnUtmdrEIOppK+VYuygJSJryulecmW43SmBit25emU5iVoLRw3eHRgf9vTf9cGffNBABSUWEg6BvvyaQCgBeIYPdfrg38JmiFAWDf1DgMFeNAA43JM8yzBMdXQMxF3Z/IuCyFOAa0VGc6V5ojPHgBWp95vZutGy8NVpHtQKja31rmthsCXONWmjmQXdtse487HMbtpi3lupVO43LFrb02NKHhAeJZoXQmhMKSVnOvX4mS/R8TiZ+UrbEVGfsQjJ7HxPsRSym7XgxpR47bJzM0LhxWSawSDNszznKCxBY5cJduhuqpoxmVEld2Y3vRfn/vay9mezI1IfS/eNYHxmPJDGbD2Z97/wJJSKpc67LFiwVdWxxcC1g0QCrDXakfe/C5uvmC6c726GVP+z2Dg6rl28dvUSv0GsUon10gN6jYzRCBH1C39FP9KvzsfOl87XzbYGurlQ+L1BtdX78Ba2VJCs=</latexit><latexit sha1_base64="dtgpQ6KClwybVS33dqS0RC75e2U=">AAAF2HicfZRPb9MwFMC9scIo/zY4cpnohUM1JWPaepyGJjiOat0mlmpynNfUzHGC7WzrLEvcEDfEDT4LX4Rvg91GookjHCl6eu/3/tnPjgtGpQqCPyur99Y69x+sP+w+evzk6bONzeenMi8FgRHJWS7OYyyBUQ4jRRWD80IAzmIGZ/HVW2c/uwYhac5P1KyAcYZTTieUYGVVp5GagsKXG71gO5ivLV8IK6F3gBbr+HJz7XeU5KTMgCvCsJQXYVCoscZCUcLAdKNSQoHJFU7hwoocZyDHel6u2apZT8KxnuRcASc1N40zmWE19ZQOlnUtmdrEIOppK+VYuygJSJryulecmW43SmBit25emU5iVoLRw3eHRgf9vTf9cGffNBABSUWEg6BvvyaQCgBeIYPdfrg38JmiFAWDf1DgMFeNAA43JM8yzBMdXQMxF3Z/IuCyFOAa0VGc6V5ojPHgBWp95vZutGy8NVpHtQKja31rmthsCXONWmjmQXdtse487HMbtpi3lupVO43LFrb02NKHhAeJZoXQmhMKSVnOvX4mS/R8TiZ+UrbEVGfsQjJ7HxPsRSym7XgxpR47bJzM0LhxWSawSDNszznKCxBY5cJduhuqpoxmVEld2Y3vRfn/vay9mezI1IfS/eNYHxmPJDGbD2Z97/wJJSKpc67LFiwVdWxxcC1g0QCrDXakfe/C5uvmC6c726GVP+z2Dg6rl28dvUSv0GsUon10gN6jYzRCBH1C39FP9KvzsfOl87XzbYGurlQ+L1BtdX78Ba2VJCs=</latexit><latexit sha1_base64="dtgpQ6KClwybVS33dqS0RC75e2U=">AAAF2HicfZRPb9MwFMC9scIo/zY4cpnohUM1JWPaepyGJjiOat0mlmpynNfUzHGC7WzrLEvcEDfEDT4LX4Rvg91GookjHCl6eu/3/tnPjgtGpQqCPyur99Y69x+sP+w+evzk6bONzeenMi8FgRHJWS7OYyyBUQ4jRRWD80IAzmIGZ/HVW2c/uwYhac5P1KyAcYZTTieUYGVVp5GagsKXG71gO5ivLV8IK6F3gBbr+HJz7XeU5KTMgCvCsJQXYVCoscZCUcLAdKNSQoHJFU7hwoocZyDHel6u2apZT8KxnuRcASc1N40zmWE19ZQOlnUtmdrEIOppK+VYuygJSJryulecmW43SmBit25emU5iVoLRw3eHRgf9vTf9cGffNBABSUWEg6BvvyaQCgBeIYPdfrg38JmiFAWDf1DgMFeNAA43JM8yzBMdXQMxF3Z/IuCyFOAa0VGc6V5ojPHgBWp95vZutGy8NVpHtQKja31rmthsCXONWmjmQXdtse487HMbtpi3lupVO43LFrb02NKHhAeJZoXQmhMKSVnOvX4mS/R8TiZ+UrbEVGfsQjJ7HxPsRSym7XgxpR47bJzM0LhxWSawSDNszznKCxBY5cJduhuqpoxmVEld2Y3vRfn/vay9mezI1IfS/eNYHxmPJDGbD2Z97/wJJSKpc67LFiwVdWxxcC1g0QCrDXakfe/C5uvmC6c726GVP+z2Dg6rl28dvUSv0GsUon10gN6jYzRCBH1C39FP9KvzsfOl87XzbYGurlQ+L1BtdX78Ba2VJCs=</latexit><latexit sha1_base64="oMiyQFUFxbkN7ZoLVRKcGnWf1C8=">AAAF2HicfZRPb9MwFMC9scIo/zY4cpnohUM1JWPaepyGJjiOau0mlmpynNfWzHGC7XTrLEvcEDfEDT4LX4Rvg91GookjHCl6eu/3/tnPjnNGpQqCP2vr9zZa9x9sPmw/evzk6bOt7edDmRWCwIBkLBMXMZbAKIeBoorBRS4ApzGD8/j6rbOfz0BImvEzNc9hlOIJp2NKsLKqYaSmoPDVVifYDRZrxxfCUuigcp1ebW/8jpKMFClwRRiW8jIMcjXSWChKGJh2VEjIMbnGE7i0IscpyJFelGt2KtazcKTHGVfAScVN41SmWE09pYNlVUumNjGIatpSOdIuSgKSTnjVK05Nux0lMLZbt6hMJzErwOj+u2Ojg+7Bm264d2hqiICkJMJe0LVfHZgIAF4ivf1ueNDzmbwQOYN/UOAwV40ADjckS1PMEx3NgJhLuz8RcFkIcI3oKE51JzTGePAStT4LeztaNd4araNKgdFM35o6Nl/BXKMWmnvQXVOsOw/73IQt562hetVM46KBLTy28CHhQaJeITTmhFxSlnGvn/EKvZiTsZ+UrTDlGbuQzN7HBHsR82kznk+px/ZrJ9M3blxWCSwmKbbnHGU5CKwy4S7dDVVTRlOqpC7txvei/P9e1l5PdmKqQ+n+caxPjEeSmC0Gs7p3/oQSkVQ512UDNhFVbHlwDWBeA8sNdqR978L66+YLw73d0Mof9jtHx+XLt4leolfoNQrRITpC79EpGiCCPqHv6Cf61frY+tL62vq2RNfXSp8XqLJaP/4CW8Yj6w==</latexit>
Fit modelObserved data Held-out data
“Intrusion detection”
Word Intrusion Topic Intrusion
Figure 2: Screenshots of our two human tasks. In the word intrusion task (left), subjects are presented with a setof words and asked to select the word which does not belong with the others. In the topic intrusion task (right),users are given a document’s title and the first few sentences of the document. The users must select which ofthe four groups of words does not belong.
word is selected at random from a pool of words with low probability in the current topic (to reducethe possibility that the intruder comes from the same semantic group) but high probability in someother topic (to ensure that the intruder is not rejected outright due solely to rarity). All six words arethen shuffled and presented to the subject.
3.2 Topic intrusion
The topic intrusion task tests whether a topic model’s decomposition of documents into a mixture oftopics agrees with human judgments of the document’s content. This allows for evaluation of thelatent space depicted by Figure 1(b). In this task, subjects are shown the title and a snippet from adocument. Along with the document they are presented with four topics (each topic is represented bythe eight highest-probability words within that topic). Three of those topics are the highest probabilitytopics assigned to that document. The remaining intruder topic is chosen randomly from the otherlow-probability topics in the model.
The subject is instructed to choose the topic which does not belong with the document. As before, ifthe topic assignment to documents were relevant and intuitive, we would expect that subjects wouldselect the topic we randomly added as the topic that did not belong. The formulation of this taskprovides a natural way to analyze the quality of document-topic assignments found by the topicmodels. Each of the three models we fit explicitly assigns topic weights to each document; this taskdetermines whether humans make the same association.
Due to time constraints, subjects do not see the entire document; they only see the title and firstfew sentences. While this is less information than is available to the algorithm, humans are goodat extrapolating from limited data, and our corpora (encyclopedia and newspaper) are structured toprovide an overview of the article in the first few sentences. The setup of this task is also meaningfulin situations where one might be tempted to use topics for corpus exploration. If topics are usedto find relevant documents, for example, users will likely be provided with similar views of thedocuments (e.g. title and abstract, as in Rexa).
For both the word intrusion and topic intrusion tasks, subjects were instructed to focus on themeanings of words, not their syntactic usage or orthography. We also presented subjects with theoption of viewing the “correct” answer after they submitted their own response, to make the tasksmore engaging. Here the “correct” answer was determined by the model which generated the data,presented as if it were the response of another user. At the same time, subjects were encouraged tobase their responses on their own opinions, not to try to match other subjects’ (the models’) selections.In small experiments, we have found that this extra information did not bias subjects’ responses.
4 Experimental results
To prepare data for human subjects to review, we fit three different topic models on two corpora.In this section, we describe how we prepared the corpora, fit the models, and created the tasksdescribed in Section 3. We then present the results of these human trials and compare them to metricstraditionally used to evaluate topic models.
4
From Chang et al., 2009
“Intrusion detection”
Word Intrusion Topic Intrusion
Figure 2: Screenshots of our two human tasks. In the word intrusion task (left), subjects are presented with a setof words and asked to select the word which does not belong with the others. In the topic intrusion task (right),users are given a document’s title and the first few sentences of the document. The users must select which ofthe four groups of words does not belong.
word is selected at random from a pool of words with low probability in the current topic (to reducethe possibility that the intruder comes from the same semantic group) but high probability in someother topic (to ensure that the intruder is not rejected outright due solely to rarity). All six words arethen shuffled and presented to the subject.
3.2 Topic intrusion
The topic intrusion task tests whether a topic model’s decomposition of documents into a mixture oftopics agrees with human judgments of the document’s content. This allows for evaluation of thelatent space depicted by Figure 1(b). In this task, subjects are shown the title and a snippet from adocument. Along with the document they are presented with four topics (each topic is represented bythe eight highest-probability words within that topic). Three of those topics are the highest probabilitytopics assigned to that document. The remaining intruder topic is chosen randomly from the otherlow-probability topics in the model.
The subject is instructed to choose the topic which does not belong with the document. As before, ifthe topic assignment to documents were relevant and intuitive, we would expect that subjects wouldselect the topic we randomly added as the topic that did not belong. The formulation of this taskprovides a natural way to analyze the quality of document-topic assignments found by the topicmodels. Each of the three models we fit explicitly assigns topic weights to each document; this taskdetermines whether humans make the same association.
Due to time constraints, subjects do not see the entire document; they only see the title and firstfew sentences. While this is less information than is available to the algorithm, humans are goodat extrapolating from limited data, and our corpora (encyclopedia and newspaper) are structured toprovide an overview of the article in the first few sentences. The setup of this task is also meaningfulin situations where one might be tempted to use topics for corpus exploration. If topics are usedto find relevant documents, for example, users will likely be provided with similar views of thedocuments (e.g. title and abstract, as in Rexa).
For both the word intrusion and topic intrusion tasks, subjects were instructed to focus on themeanings of words, not their syntactic usage or orthography. We also presented subjects with theoption of viewing the “correct” answer after they submitted their own response, to make the tasksmore engaging. Here the “correct” answer was determined by the model which generated the data,presented as if it were the response of another user. At the same time, subjects were encouraged tobase their responses on their own opinions, not to try to match other subjects’ (the models’) selections.In small experiments, we have found that this extra information did not bias subjects’ responses.
4 Experimental results
To prepare data for human subjects to review, we fit three different topic models on two corpora.In this section, we describe how we prepared the corpora, fit the models, and created the tasksdescribed in Section 3. We then present the results of these human trials and compare them to metricstraditionally used to evaluate topic models.
4
Which word doesn’t belong?
From Chang et al., 2009
“Intrusion detection”
Word Intrusion Topic Intrusion
Figure 2: Screenshots of our two human tasks. In the word intrusion task (left), subjects are presented with a setof words and asked to select the word which does not belong with the others. In the topic intrusion task (right),users are given a document’s title and the first few sentences of the document. The users must select which ofthe four groups of words does not belong.
word is selected at random from a pool of words with low probability in the current topic (to reducethe possibility that the intruder comes from the same semantic group) but high probability in someother topic (to ensure that the intruder is not rejected outright due solely to rarity). All six words arethen shuffled and presented to the subject.
3.2 Topic intrusion
The topic intrusion task tests whether a topic model’s decomposition of documents into a mixture oftopics agrees with human judgments of the document’s content. This allows for evaluation of thelatent space depicted by Figure 1(b). In this task, subjects are shown the title and a snippet from adocument. Along with the document they are presented with four topics (each topic is represented bythe eight highest-probability words within that topic). Three of those topics are the highest probabilitytopics assigned to that document. The remaining intruder topic is chosen randomly from the otherlow-probability topics in the model.
The subject is instructed to choose the topic which does not belong with the document. As before, ifthe topic assignment to documents were relevant and intuitive, we would expect that subjects wouldselect the topic we randomly added as the topic that did not belong. The formulation of this taskprovides a natural way to analyze the quality of document-topic assignments found by the topicmodels. Each of the three models we fit explicitly assigns topic weights to each document; this taskdetermines whether humans make the same association.
Due to time constraints, subjects do not see the entire document; they only see the title and firstfew sentences. While this is less information than is available to the algorithm, humans are goodat extrapolating from limited data, and our corpora (encyclopedia and newspaper) are structured toprovide an overview of the article in the first few sentences. The setup of this task is also meaningfulin situations where one might be tempted to use topics for corpus exploration. If topics are usedto find relevant documents, for example, users will likely be provided with similar views of thedocuments (e.g. title and abstract, as in Rexa).
For both the word intrusion and topic intrusion tasks, subjects were instructed to focus on themeanings of words, not their syntactic usage or orthography. We also presented subjects with theoption of viewing the “correct” answer after they submitted their own response, to make the tasksmore engaging. Here the “correct” answer was determined by the model which generated the data,presented as if it were the response of another user. At the same time, subjects were encouraged tobase their responses on their own opinions, not to try to match other subjects’ (the models’) selections.In small experiments, we have found that this extra information did not bias subjects’ responses.
4 Experimental results
To prepare data for human subjects to review, we fit three different topic models on two corpora.In this section, we describe how we prepared the corpora, fit the models, and created the tasksdescribed in Section 3. We then present the results of these human trials and compare them to metricstraditionally used to evaluate topic models.
4
Which topic doesn’t belong?
From Chang et al., 2009
Summing up
• PLSA is a simple ad-mixture model that uncovers topics (distributions over words) and soft-assigns instances to these.
• We saw parameter estimation via Expectation-Maximization.
• Next time: Introducing priors into topic models — Latent Dirichlet Allocation (LDA).
★ This will motivate sampling-based estimation
Summing up
• PLSA is a simple ad-mixture model that uncovers topics (distributions over words) and soft-assigns instances to these.
• We saw parameter estimation via Expectation-Maximization.
• Next time: Introducing priors into topic models — Latent Dirichlet Allocation (LDA).
★ This will motivate sampling-based estimation
Summing up
• PLSA is a simple ad-mixture model that uncovers topics (distributions over words) and soft-assigns instances to these.
• We saw parameter estimation via Expectation-Maximization.
• Next time: Introducing priors into topic models — Latent Dirichlet Allocation (LDA).
★ This will motivate sampling-based estimation