Upload
kuleuven
View
2
Download
0
Embed Size (px)
Citation preview
CROSS-LANGUAGE PROBABILISTIC TOPIC MODELS Marie-Francine Moens Joint work with Ivan Vulić and Wim De Smet Department of Computer Science Language Intelligence and Information Retrieval team Katholieke Universiteit Leuven, Belgium http://www.cs.kuleuven.be/groups/liir/
Going back in history: Panini • Panini = Indian grammarian (6th-4thcentury B.C. ?) who wrote a
grammar for sanskrit • Realizational chain when creating natural language texts:
• Ideas -> broad conceptual components of a text -> subideas -> sentences -> set of semantic roles-> set of grammatical and lexical concepts -> character sequences
2 LSD 8-6-2012
Probabilistic topic model • Generative model for documents: probabilistic model by
which documents can be generated • select a document dj with probability P(dj) • pick a latent class zk with probability P(zk⎟dj) • generate a word wi with probability P(wi⎟zk)
observed word distributions
word distributions per topic
topic distributions per document
3
[Hofmann SIGIR 1999]
LSD 8-6-2012
Overview • Monolingual Latent Dirichet Allocation
• BiLingual Latent Dirichet Allocation • Examples of applications:
• Cross-lingual event clustering • Cross-lingual document categorization • Cross-lingual semantic similarities of words • Cross-lingual information retrieval
4
Our research
LSD 8-6-2012
5
© 2011-2012 M.-F. Moens 5
Monolingual Latent Dirichlet Allocation
[Blei, Ng & Jordan JMLR 2003] "
LSD 8-6-2012
6 LSD 8-6-2012 6 LSD 8-6-2012 6
[Blei CACM 2012]
LSD 8-6-2012
Trained on a large corpus we learn: - Per topic word distributions - Per document topic distributions
Monolingual Latent Dirichlet Allocation • α,β : Dirichlet priors • Trained on large document corpus: Key inferential
problem: computing the distribution of the hidden variables θ, ϕ and z given a document ,
i.e., P(θ,ϕ,z|w,α,β): intractable for exact inference: • Variational inference • Gibbs sampling
• Inference for new document
7 LSD 8-6-2012
BiLDA = Bilingual LDA
8
[De Smet & Moens SWSM 2009] [Mimno, Wallach, Naradowsky, Smith, McCallum EMNLP 2009]
LSD 8-6-2012
Variational inference Gibbs sampling
BiLDA § A single variable Θ which generates a distribution of
topics over document pairs – assumes that documents are topically parallel: § Alignments at the document level:
§ Parallel corpus § Comparable corpus
§ A generative model: inference on an unseen text collection possible: e.g., § Training on general document aligned corpus (e.g.,
Wikipedia) § Inferring on other documents (e.g., newswire data)
9 LSD 8-6-2012
11
BiLDA
• Can be trained on comparable corpora ! • Two or more corpora that treat similar content written in
different languages
LSD 8-6-2012 11
[http://www.ethnologue.com/ethno_docs/distribution.asp?by=area]
LSD 8-6-2012
13
• More advanced bilingual generative models: • Models needed that integrate shared and non-shared
topics and that deal with flexible number of topics
[De Smet PhD 2011]
LSD 8-6-2012
BiLDA applications • Topics serve as a bridge between languages =>
• Per-‐topic word distribu0ons • Per-‐document topic distribu0ons
• Useful functionality for many applications !
• Cross-lingual event clustering • Cross-lingual document categorization • Cross-lingual semantic similarities of words • Cross-lingual information retrieval
14 LSD 8-6-2012
• Methods: • Cluster documents with similar cross-lingual topic
distributions • Dissimilarity metric: Symmetric Kullback-Leibler (KL)
divergence • Hierarchical agglomerative clustering with complete linkage
16
Cross-lingual event clustering
Per document topic distribu0ons
LSD 8-6-2012
Cross-lingual event clustering • Trained on 7612 Wikipedia English articles randomly selected from
Wikipedia in its Dutch version • Tested on randomly selected 18 events of Google news (English-
Dutch) in the period of July 16-18, 2009"
17
precisioni =Ci!Mi
Ci
recalli =Ci!Mi
Mi
[De Smet & Moens SWSM 2009]
LSD 8-6-2012
Cross-lingual document categorization • BiLDA is trained on general comparable document aligned
corpus (Wikipedia) • Given a labeled document collection LS in the source
language and unlabeled collection UT in the target language, represented by cross-lingual topics
• Training of the classifier (support vector machine) on LS, testing on UT
• Topic smoothing: combining topic features obtained from different BILDA models with different K = number of topics
19 LSD 8-6-2012
Cross-lingual document categorization
20
[De Smet, Tang & Moens PAKDD 2011]
LSD 8-6-2012
Training on 20,000-30,000 Wikipedia articles per language and tested on 1000 different Wikipedia articles per language
21
Cross-lingual semantic similarity of words • Exploitation of the per topic word distributions to
obtain translation candidates or cross-lingual word similarities
• Several methods and their combinations: e.g.
- word vectors with TF-ITF scores and cosine similarity
(T= topic)
© 2011-2012 M.-F. Moens K.U.Leuven
(1)
21
LSD 8-6-2012
22
Cross-lingual semantic similarity of words
[Vulić, De Smet & Moens ACL 2011] [Vulić & Moens EACL 2012]
LSD 8-6-2012
Cross-language information retrieval
We can integrate additional uncertain evidences, such as the cross-lingual semantic similarity
24
P(q1,...,qm | dj) = (! P(qi zk)P(zk dj)+ (k=1
K
! 1"i=1
m
# !)P(qi Clj)
Per document topic distribu0ons Per topic word distributions
LSD 8-6-2012
Cross-language information retrieval
25
[Vulić, De Smet & Moens Information Retrieval 2012]
Results on three data sets from the CLEF 2001-2003 CLIR campaigns. Training on Europarl corpus and Wikipedia, on average ca. 50 queries per run, and > 270,000 English documents and >190, 000 Dutch documents searched
LSD 8-6-2012
Conclusions • BiLDA = a novel language-independent, self-contained
and dictionary-free framework for cross-language text mining and search
• Is trained on comparable document aligned corpora • Many different evidences lead to better results, which
are competitive and sometimes even outperforming dictionary based methods
• Latent class models offer a very interesting approach to the process of language generation: much more to come …
26 LSD 8-6-2012
References
28
• De Smet, W. (2011). Probabilistic Graphical Models for Content Representation and Applications in Monolingual and Multilingual Settings. PhD Thesis.
• De Smet, W. & Moens, M.-F. (2009). Cross-Language Linking of News Stories on the Web Using Interlingual Topic Models. In Proceedings of SWSM 2009. ACM.
• De Smet, W., Tang, J. & Moens, M.-F. (2011). Knowledge Transfer across Multilingual Corpora via Latent Topics. In Proceedings of PAKDD 2011(Lecture Notes in Computer Science 6634) (pp. 549-560). Springer.
• Vulić, I., De Smet, W. & Moens, M.-F. (2011). Identifying Word Translations from Comparable Corpora Using Latent Topic Models. In Proceedings of ACL-HLT 2011 (pp. 479-484). ACL.
• Vulić, I., De Smet, W. & Moens, M.-F. (2012). Cross-Language Information Retrieval Models Based on Latent Topic Models Trained with Document-Aligned Comparable Corpora. Information Retrieval (DOI: 10.1007/s10791-012-9200-5).
• Vulić, I. & Moens, M.-F. (2012). Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge. In Proceedings of EACL 2012 (pp. 449-459). ACL.
LSD 8-6-2012
• Blei, D. (2012). Introduction to probabilistic topic models. Communications of the ACM, to appear.
• Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
• Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of Twenty-second Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval.
• Mimno, D.M., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A. (2009). Polylingualropic models. In Proceedings EMNLP 2009 (pp. 880-889).
29 LSD 8-6-2012