Cross-Language Probabilistic Topic Models

CROSS-LANGUAGE PROBABILISTIC TOPIC MODELS Marie-Francine Moens Joint work with Ivan Vulić and Wim De Smet Department of Computer Science Language Intelligence and Information Retrieval team Katholieke Universiteit Leuven, Belgium http://www.cs.kuleuven.be/groups/liir/

Going back in history: Panini •  Panini = Indian grammarian (6th-4thcentury B.C. ?) who wrote a

grammar for sanskrit • Realizational chain when creating natural language texts:

•  Ideas -> broad conceptual components of a text -> subideas -> sentences -> set of semantic roles-> set of grammatical and lexical concepts -> character sequences

2 LSD 8-6-2012

Probabilistic topic model • Generative model for documents: probabilistic model by

which documents can be generated •  select a document dj with probability P(dj) •  pick a latent class zk with probability P(zk⎟dj) •  generate a word wi with probability P(wi⎟zk)

observed word distributions

word distributions per topic

topic distributions per document

3

[Hofmann SIGIR 1999]

LSD 8-6-2012

Overview • Monolingual Latent Dirichet Allocation

• BiLingual Latent Dirichet Allocation • Examples of applications:

• Cross-lingual event clustering • Cross-lingual document categorization • Cross-lingual semantic similarities of words • Cross-lingual information retrieval

4

Our research

LSD 8-6-2012

5

© 2011-2012 M.-F. Moens 5

Monolingual Latent Dirichlet Allocation

[Blei, Ng & Jordan JMLR 2003] "

LSD 8-6-2012

6 LSD 8-6-2012 6 LSD 8-6-2012 6

[Blei CACM 2012]

LSD 8-6-2012

Trained on a large corpus we learn: -  Per topic word distributions -  Per document topic distributions

Monolingual Latent Dirichlet Allocation • α,β : Dirichlet priors •  Trained on large document corpus: Key inferential

problem: computing the distribution of the hidden variables θ, ϕ and z given a document ,

i.e., P(θ,ϕ,z|w,α,β): intractable for exact inference: • Variational inference • Gibbs sampling

•  Inference for new document

7 LSD 8-6-2012

BiLDA = Bilingual LDA

8

[De Smet & Moens SWSM 2009] [Mimno, Wallach, Naradowsky, Smith, McCallum EMNLP 2009]

LSD 8-6-2012

Variational inference Gibbs sampling

BiLDA § A single variable Θ which generates a distribution of

topics over document pairs – assumes that documents are topically parallel: § Alignments at the document level:

§ Parallel corpus § Comparable corpus

§  A generative model: inference on an unseen text collection possible: e.g., §  Training on general document aligned corpus (e.g.,

Wikipedia) §  Inferring on other documents (e.g., newswire data)

9 LSD 8-6-2012

10 LSD 8-6-2012

11

BiLDA

•  Can be trained on comparable corpora ! •  Two or more corpora that treat similar content written in

different languages

LSD 8-6-2012 11

[http://www.ethnologue.com/ethno_docs/distribution.asp?by=area]

LSD 8-6-2012

12 LSD 8-6-2012

13

• More advanced bilingual generative models: • Models needed that integrate shared and non-shared

topics and that deal with flexible number of topics

[De Smet PhD 2011]

LSD 8-6-2012

BiLDA applications •  Topics serve as a bridge between languages =>

•  Per-‐topic word distribu0ons •  Per-‐document topic distribu0ons

• Useful functionality for many applications !

• Cross-lingual event clustering • Cross-lingual document categorization • Cross-lingual semantic similarities of words • Cross-lingual information retrieval

14 LSD 8-6-2012

Cross-lingual event clustering 15 LSD 8-6-2012

•  Methods: •  Cluster documents with similar cross-lingual topic

distributions •  Dissimilarity metric: Symmetric Kullback-Leibler (KL)

divergence •  Hierarchical agglomerative clustering with complete linkage

16

Cross-lingual event clustering

Per document topic distribu0ons

LSD 8-6-2012

Cross-lingual event clustering •  Trained on 7612 Wikipedia English articles randomly selected from

Wikipedia in its Dutch version •  Tested on randomly selected 18 events of Google news (English-

Dutch) in the period of July 16-18, 2009"

17

precisioni =Ci!Mi

Ci

recalli =Ci!Mi

Mi

[De Smet & Moens SWSM 2009]

LSD 8-6-2012

Cross-lingual document categorization

18

Per document topic distribu0ons

LSD 8-6-2012

Cross-lingual document categorization • BiLDA is trained on general comparable document aligned

corpus (Wikipedia) • Given a labeled document collection LS in the source

language and unlabeled collection UT in the target language, represented by cross-lingual topics

•  Training of the classifier (support vector machine) on LS, testing on UT

•  Topic smoothing: combining topic features obtained from different BILDA models with different K = number of topics

19 LSD 8-6-2012

Cross-lingual document categorization

20

[De Smet, Tang & Moens PAKDD 2011]

LSD 8-6-2012

Training on 20,000-30,000 Wikipedia articles per language and tested on 1000 different Wikipedia articles per language

21

Cross-lingual semantic similarity of words • Exploitation of the per topic word distributions to

obtain translation candidates or cross-lingual word similarities

• Several methods and their combinations: e.g.

- word vectors with TF-ITF scores and cosine similarity

(T= topic)

© 2011-2012 M.-F. Moens K.U.Leuven

(1)

21

LSD 8-6-2012

22

Cross-lingual semantic similarity of words

[Vulić, De Smet & Moens ACL 2011] [Vulić & Moens EACL 2012]

LSD 8-6-2012

Cross-language information retrieval

23 LSD 8-6-2012


We can integrate additional uncertain evidences, such as the cross-lingual semantic similarity

24

P(q1,...,qm | dj) = (! P(qi zk)P(zk dj)+ (k=1

K

! 1"i=1

m

# !)P(qi Clj)

Per document topic distribu0ons Per topic word distributions

LSD 8-6-2012


25

[Vulić, De Smet & Moens Information Retrieval 2012]

Results on three data sets from the CLEF 2001-2003 CLIR campaigns. Training on Europarl corpus and Wikipedia, on average ca. 50 queries per run, and > 270,000 English documents and >190, 000 Dutch documents searched

LSD 8-6-2012

Conclusions • BiLDA = a novel language-independent, self-contained

and dictionary-free framework for cross-language text mining and search

•  Is trained on comparable document aligned corpora • Many different evidences lead to better results, which

are competitive and sometimes even outperforming dictionary based methods

•  Latent class models offer a very interesting approach to the process of language generation: much more to come …

26 LSD 8-6-2012

27

Termwise project

LSD 8-6-2012

References

28

•  De Smet, W. (2011). Probabilistic Graphical Models for Content Representation and Applications in Monolingual and Multilingual Settings. PhD Thesis.

•  De Smet, W. & Moens, M.-F. (2009). Cross-Language Linking of News Stories on the Web Using Interlingual Topic Models. In Proceedings of SWSM 2009. ACM.

•  De Smet, W., Tang, J. & Moens, M.-F. (2011). Knowledge Transfer across Multilingual Corpora via Latent Topics. In Proceedings of PAKDD 2011(Lecture Notes in Computer Science 6634) (pp. 549-560). Springer.

•  Vulić, I., De Smet, W. & Moens, M.-F. (2011). Identifying Word Translations from Comparable Corpora Using Latent Topic Models. In Proceedings of ACL-HLT 2011 (pp. 479-484). ACL.

•  Vulić, I., De Smet, W. & Moens, M.-F. (2012). Cross-Language Information Retrieval Models Based on Latent Topic Models Trained with Document-Aligned Comparable Corpora. Information Retrieval (DOI: 10.1007/s10791-012-9200-5).

•  Vulić, I. & Moens, M.-F. (2012). Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge. In Proceedings of EACL 2012 (pp. 449-459). ACL.

LSD 8-6-2012

•  Blei, D. (2012). Introduction to probabilistic topic models. Communications of the ACM, to appear.

•  Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.

•  Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of Twenty-second Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval.

•  Mimno, D.M., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A. (2009). Polylingualropic models. In Proceedings EMNLP 2009 (pp. 880-889).

29 LSD 8-6-2012

Documents

Cross-Language Probabilistic Topic Models