Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Non-Bayesian Additive Regularization forMultimodal Topic Modeling of Large

Collections

Konstantin Vorontsov1,3 • Oleksandr Frei4 • Murat Apishev2

Peter Romov3 • Marina Suvorova2 • Anastasia Yanina1

1Moscow Institute of Physics and Technology,2Moscow State University • 3Yandex • 4Schlumberger

Topic Models: Post-Processing and Applications,CIKM’15 Workshop

October 19, 2015 • Melbourne, Australia

Probabilistic Topic Model (PTM) generating a text collection

Topic model explains terms w in documents d by topics t:

p(w |d) =∑tp(w |t)p(t|d)

Разработан спектрально-аналитический подход к выявлению размытых протяженных повторов

в геномных последовательностях. Метод основан на разномасштабном оценивании сходства

нуклеотидных последовательностей в пространстве коэффициентов разложения фрагментов

кривых GC- и GA-содержания по классическим ортогональным базисам. Найдены условия

оптимальной аппроксимации, обеспечивающие автоматическое распознавание повторов

различных видов (прямых и инвертированных, а также тандемных) на спектральной матрице

сходства. Метод одинаково хорошо работает на разных масштабах данных. Он позволяет

выявлять следы сегментных дупликаций и мегасателлитные участки в геноме, районы синтении

при сравнении пары геномов. Его можно использовать для детального изучения фрагментов

хромосом (поиска размытых участков с умеренной длиной повторяющегося паттерна).

�( |!)

�("| ):

, … , #$

" , … ,"#$:

0.018 распознавание

0.013 сходство

0.011 паттерн

… … … …

0.023 днк

0.016 геном

0.009 нуклеотид

… … … …

0.014 базис

0.009 спектр

0.006 ортогональный

… … … …

! " " #" $�

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 2 / 28

Inverse problem: text collection → PTM

Given: D is a set (collection) of documentsW is a set (vocabulary) of termsndw = how many times term w appears in document d

Find: parameters φwt =p(w |t), θtd =p(t|d) of the topic model

p(w |d) =∑t

φwtθtd .

under nonnegativity and normalization constraints

φwt > 0,∑

w∈Wφwt = 1; θtd > 0,

∑t∈T

θtd = 1.

The ill-posed problem of matrix factorization:

ΦΘ = (ΦS)(S−1Θ) = Φ′Θ′

for all S such that Φ′,Θ′ are stochastic.


PLSA — Probabilistic Latent Semantic Analysis [Hofmann, 1999]

Constrained maximization of the log-likelihood:

L (Φ,Θ) =∑d ,w

ndw ln∑t

φwtθtd → maxΦ,Θ

EM-algorithm is a simple iteration method for the nonlinear system

E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw

)θtd = norm

t∈T

( ∑w∈d

ndwptdw

)where norm

t∈Txt = max{xt ,0}∑

s∈Tmax{xs ,0} is vector normalization.


Graphical Models and Bayesian Inference

In Bayesian approach, Graphical Models are used to makesophisticated generative models.

David M. Blei. Probabilistic topic models // Communications of the ACM,2012. Vol. 55, No. 4., Pp. 77–84.


Graphical Models and Bayesian Inference

In Bayesian approach, a lot of calculus to be done for each modelto go from the problem statement to the solution algorithm:

Yi Wang. Distributed Gibbs Sampling of Latent Dirichlet Allocation: The GrittyDetails. 2008.


ARTM — Additive Regularization of Topic Model

Maximum log-likelihood with additive regularization criterion R :∑d ,w

ndw ln∑t

φwtθtd + R(Φ,Θ) → maxΦ,Θ

EM-algorithm is a simple iteration method for the system

E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw + φwt∂R∂φwt

)θtd = norm

t∈T

( ∑w∈d

ndwptdw + θtd∂R∂θtd

)


Example: Latent Dirichlet Allocation [Blei, Ng, Jordan, 2003]

Maximum a posteriori (MAP) with Dirichlet prior:∑d ,w

ndw ln∑t

φwtθtd︸︷︷︸log-likelihood L (Φ,Θ)

+∑t,w

βw lnφwt +∑d ,t

αt ln θtd︸︷︷︸regularization criterion R(Φ,Θ)

→ maxΦ,Θ


E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw + βw

)θtd = norm

t∈T

( ∑w∈d

ndwptdw + αt

)


Many Bayesian PTMs can be reinterpreted as regularizers in ARTM

smoothing (LDA) for background and stop-words topicssparsing (anti-LDA) for domain-specific topicstopic decorrelationtopic coherence maximizationsupervised learning for classification and regressionsemi-supervised learningusing document citations and linksdetermining number of topics via entropy sparsingmodeling topical hierarchiesmodeling temporal topic dynamicsusing vocabularies in multilingual topic modelsetc.

Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models //Machine Learning. Volume 101, Issue 1 (2015), Pp. 303-323.


ARTM — Additive Regularization of Topic Model

Maximum log-likelihood with additive combination of regularizers:

∑d ,w

ndw ln∑t

φwtθtd +n∑

i=1

τiRi (Φ,Θ) → maxΦ,Θ

,

where τi are regularization coefficients.


E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw + φwtn∑

i=1τi

∂Ri∂φwt

)θtd = norm

t∈T

( ∑w∈d

ndwptdw + θtdn∑

i=1τi∂Ri∂θtd

)


Assumptions: what topics would be well-interpretable?

Topics S ⊂ T contain domain-specific termsp(w |t), t ∈ S are sparse and different (weakly correlated)

Topics B ⊂ T contain background termsp(w |t), t ∈ B are dense and contain common lexis words

ΦW×T ΘT×D


Smoothing regularization (rethinking LDA)

The non-sparsity assumption for background topics t ∈ B :φwt are similar to a given distribution βw ;θtd are similar to a given distribution αt .∑

t∈BKLw (βw‖φwt)→ min

Φ;

∑d∈D

KLt(αt‖θtd)→ minΘ.

We minimize the sum of these KL-divergences to get a regularizer:

R(Φ,Θ) = β0

∑t∈B

∑w∈W

βw lnφwt + α0

∑d∈D

∑t∈B

αt ln θtd → max .

The regularized M-step applied for all t ∈ B coincides with LDA:

φwt ∝ nwt + β0βw , θtd ∝ ntd + α0αt ,

which is new non-Bayesian interpretation of LDA [Blei 2003].


Sparsing regularizer (further rethinking LDA)

The sparsity assumption for domain-specific topics t ∈ S :distributions φwt , θtd contain many zero probabilities.

We maximize the sum of KL-divergences KL(β‖φt) and KL(α‖θd):

R(Φ,Θ) = −β0

∑t∈S

∑w∈W

βw lnφwt − α0

∑d∈D

∑t∈S

αt ln θtd → max .

The regularized M-step gives “anti-LDA”, for all t ∈ S :

φwt ∝(nwt − β0βw

)+, θtd ∝

(ntd − α0αt

)+.

Varadarajan J., Emonet R., Odobez J.-M. A sparsity constraint for topicmodels — application to temporal activity mining // NIPS-2010 Workshop onPractical Applications of Sparse Modeling: Open Issues and New Directions.


Regularization for topics decorrelation

The dissimilarity assumption for domain-specific topics t ∈ S :if topics are interpretable then they must differ significantly.

We maximize covariances between column vectors φt :

R(Φ) = −τ2

∑t∈S

∑s∈S\t

∑w∈W

φwtφws → max .

The regularized M-step makes columns of Φ more distant:

φwt ∝(nwt − τφwt

∑s∈S\t

φws

)+.

Tan Y., Ou Z. Topic-weak-correlated latent Dirichlet allocation // 7th Int’lSymp. Chinese Spoken Language Processing (ISCSLP), 2010. — Pp. 224–228.


Example: Combination of sparsing, smoothing, and decorrelation

smoothing background topics B in Φ and Θ

sparsing domain-specific topics S = T\B in Φ and Θ

decorrelation of topics in Φ

R(Φ,Θ) = + β1∑t∈B

∑w∈W

βw lnφwt + α1∑d∈D

∑t∈B

αt ln θtd

− β0∑t∈S

∑w∈W

βw lnφwt − α0∑d∈D

∑t∈S

αt ln θtd

− γ∑t∈T

∑s∈T\t

∑w∈W

φwtφws

where β0, α0, β1, α1, γ are regularization coefficients.


Multimodal Probabilistic Topic Modeling

Given a text document collection Probabilistic Topic Model finds:p(t|d) — topic distribution for each document d ,p(w |t) — term distribution for each topic t.

Topics of documents

Words and keyphrases of topics

doc1:

doc2:

doc3:

doc4:

...

Text documents

TopicModeling

Documents

Topics



Multimodal Topic Model finds topical distribution for terms p(w |t),authors p(a|t), time p(y |t),

Topics of documents


doc1:

doc2:

doc3:

doc4:

...

Text documents

TopicModeling

Documents

Topics

Metadata:

AuthorsData TimeConferenceOrganizationURLetc.



Multimodal Topic Model finds topical distribution for terms p(w |t),authors p(a|t), time p(y |t), objects on images p(o|t),linked documents p(d ′|t), advertising banners p(b|t), users p(u|t)

Topics of documents


doc1:

doc2:

doc3:

doc4:

...

Text documents

TopicModeling

Documents

Topics

Metadata:

AuthorsData TimeConferenceOrganizationURLetc.

Ads Images Links

Users


Multimodal extension of ARTM

Wm is a vocabulary of tokens of m-th modality, m ∈ MW = W 1 t · · · tWM is a joint vocabulary of all modalities

Maximum multimodal log-likelihood with regularization:∑m∈M

λm∑d∈D

∑w∈Wm

ndw ln∑t

φwtθtd + R(Φ,Θ) → maxΦ,Θ


E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈Wm

( ∑d∈D

λm(w)ndwptdw + φwt∂R∂φwt

)θtd = norm

t∈T

( ∑w∈d

λm(w)ndwptdw + θtd∂R∂θtd

)


Example: Multi-lingual topic model of Wikipedia

Top 10 words with p(wt) probabilities (in %) from two-languagetopic model, based on Russian and English Wikipedia articles withmutual interlanguage links.

Topic 68 Topic 79research 4.56 институт 6.03 goals 4.48 матч 6.02technology 3.14 университет 3.35 league 3.99 игрок 5.56engineering 2.63 программа 3.17 club 3.76 сборная 4.51institute 2.37 учебный 2.75 season 3.49 фк 3.25science 1.97 технический 2.70 scored 2.72 против 3.20program 1.60 технология 2.30 cup 2.57 клуб 3.14

Topic 88 Topic 251opera 7.36 опера 7.82 windows 8.00 windows 6.05conductor 1.69 оперный 3.13 microsoft 4.03 microsoft 3.76orchestra 1.14 дирижер 2.82 server 2.93 версия 1.86wagner 0.97 певец 1.65 software 1.38 приложение 1.86soprano 0.78 певица 1.51 user 1.03 сервер 1.63performance 0.78 театр 1.14 security 0.92 server 1.54


Example: Recommending articles from blog

The quality of recommendations for baseline matrix factorizationmodel, unimodal model with only modality of user likes, and twomultimodal models incorporating words and user-specified data(tags and categories).

Model Recall@5 Recall@10 Recall@20collaborative filtering 0.591 0.652 0.678

likes 0.62 0.59 0.65likes + words 0.79 0.64 0.68all modalities 0.80 0.71 0.69


BigARTM project

BigARTM features:Parallel + Online + Multimodal + Regularized Topic ModelingOut-of-core one-pass processing of Big DataBuilt-in library of regularizers and quality measures

BigARTM community:

Code on GitHub: https://github.com/bigartmLinks to docs, discussion group, buildshttp://bigartm.org

BigARTM license and programming environment:Freely available for commercial usage (BSD 3-Clause license)Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)Programming APIs: command-line, C++, and Python


The BigARTM project: parallel architecture

Concurrent processing of batches D = D1 t · · · t DB

Simple single-threaded code for ProcessBatchUser controls when to update the model in online algorithmDeterministic (reproducible) results from run to run


Online EM-algorithm for Multi-ARTM

Input: collection D split into batches Db, b = 1, . . . ,B ;Output: matrix Φ;

1 initialize φwt for all w ∈W , t ∈ T ;2 nwt := 0, nwt := 0 for all w ∈W , t ∈ T ;3 for all batches Db, b = 1, . . . ,B

4 iterate each document d ∈ Db at a constant matrix Φ:(nwt) := (nwt) + ProcessBatch (Db,Φ);

5 if (synchronize) then6 nwt := nwt + ndw for all w ∈W , t ∈ T ;

7 φwt := normw∈Wm

(nwt + φwt

∂R∂φwt

)for all w ∈Wm, m∈M, t∈T ;

8 nwt := 0 for all w ∈W , t ∈ T ;


Online EM-algorithm for Multi-ARTM: ProcessBatch

ProcessBatch iterates documents d ∈ Db at a constant matrix Φ.

matrix (nwt) := ProcessBatch (set of documents Db, matrix Φ)

1 nwt := 0 for all w ∈W , t ∈ T ;2 for all d ∈ Db

3 initialize θtd := 1|T | for all t ∈ T ;

4 repeat5 ptdw := norm

t∈T

(φwtθtd

)for all w ∈ d , t ∈ T ;

6 ntd :=∑w∈d

λm(w)ndwptdw for all t ∈ T ;

7 θtd := normt∈T

(ntd + θtd

∂R∂θtd

)for all t ∈ T ;

8 until θd converges;9 nwt := nwt + λm(w)ndwptdw for all w ∈ d , t ∈ T ;


BigARTM vs Gensim vs Vowpal Wabbit

3.7M articles from Wikipedia, 100K unique words

procs train inference perplexityBigARTM 1 35 min 72 sec 4000Gensim.LdaModel 1 369 min 395 sec 4161VowpalWabbit.LDA 1 73 min 120 sec 4108BigARTM 4 9 min 20 sec 4061Gensim.LdaMulticore 4 60 min 222 sec 4111BigARTM 8 4.5 min 14 sec 4304Gensim.LdaMulticore 8 57 min 224 sec 4455

procs = number of parallel threadsinference = time to infer θd for 100K held-out documentsperplexity is calculated on held-out documents.


Running BigARTM in parallel

3.7M articles from Wikipedia, 100K unique words

Amazon EC2 c3.8xlarge (16 physical cores + hyperthreading)No extra memory cost for adding more threads


Summary / Questions?

1 Additive RegularizationPLSA algorithm with regularization on M-stepsimple way to incorporate assumptions on the topic model

2 Multimodal Topic Modelsincorporate metadata into topic modelparallel text corpuses

3 BigARTM1: Open source implementation of Multi-ARTMparallel online algorithmready for large collectionssupports multimodal collectionscollection of implemented regularizers

1http://bigartm.org/Peter Romov ([email protected]) ARTM of Large Multimodal Collections 28 / 28

http://bigartm.org/

Science

Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections