28
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections Konstantin Vorontsov 1,3 Oleksandr Frei 4 Murat Apishev 2 Peter Romov 3 Marina Suvorova 2 Anastasia Yanina 1 1 Moscow Institute of Physics and Technology, 2 Moscow State University 3 Yandex 4 Schlumberger Topic Models: Post-Processing and Applications, CIKM’15 Workshop October 19, 2015 Melbourne, Australia

Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

  • Upload
    romovpa

  • View
    252

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Non-Bayesian Additive Regularization forMultimodal Topic Modeling of Large

Collections

Konstantin Vorontsov1,3 • Oleksandr Frei4 • Murat Apishev2

Peter Romov3 • Marina Suvorova2 • Anastasia Yanina1

1Moscow Institute of Physics and Technology,2Moscow State University • 3Yandex • 4Schlumberger

Topic Models: Post-Processing and Applications,CIKM’15 Workshop

October 19, 2015 • Melbourne, Australia

Page 2: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Probabilistic Topic Model (PTM) generating a text collection

Topic model explains terms w in documents d by topics t:

p(w |d) =∑tp(w |t)p(t|d)

Разработан спектрально-аналитический подход к выявлению размытых протяженных повторов

в геномных последовательностях. Метод основан на разномасштабном оценивании сходства

нуклеотидных последовательностей в пространстве коэффициентов разложения фрагментов

кривых GC- и GA-содержания по классическим ортогональным базисам. Найдены условия

оптимальной аппроксимации, обеспечивающие автоматическое распознавание повторов

различных видов (прямых и инвертированных, а также тандемных) на спектральной матрице

сходства. Метод одинаково хорошо работает на разных масштабах данных. Он позволяет

выявлять следы сегментных дупликаций и мегасателлитные участки в геноме, районы синтении

при сравнении пары геномов. Его можно использовать для детального изучения фрагментов

хромосом (поиска размытых участков с умеренной длиной повторяющегося паттерна).

�( |!)

�("| ):

, … , #$

" , … ,"#$:

0.018 распознавание

0.013 сходство

0.011 паттерн

… … … …

0.023 днк

0.016 геном

0.009 нуклеотид

… … … …

0.014 базис

0.009 спектр

0.006 ортогональный

… … … …

! " " #" $�

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 2 / 28

Page 3: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Inverse problem: text collection → PTM

Given: D is a set (collection) of documentsW is a set (vocabulary) of termsndw = how many times term w appears in document d

Find: parameters φwt =p(w |t), θtd =p(t|d) of the topic model

p(w |d) =∑t

φwtθtd .

under nonnegativity and normalization constraints

φwt > 0,∑

w∈Wφwt = 1; θtd > 0,

∑t∈T

θtd = 1.

The ill-posed problem of matrix factorization:

ΦΘ = (ΦS)(S−1Θ) = Φ′Θ′

for all S such that Φ′,Θ′ are stochastic.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 3 / 28

Page 4: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

PLSA — Probabilistic Latent Semantic Analysis [Hofmann, 1999]

Constrained maximization of the log-likelihood:

L (Φ,Θ) =∑d ,w

ndw ln∑t

φwtθtd → maxΦ,Θ

EM-algorithm is a simple iteration method for the nonlinear system

E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw

)θtd = norm

t∈T

( ∑w∈d

ndwptdw

)where norm

t∈Txt = max{xt ,0}∑

s∈Tmax{xs ,0} is vector normalization.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 4 / 28

Page 5: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Graphical Models and Bayesian Inference

In Bayesian approach, Graphical Models are used to makesophisticated generative models.

David M. Blei. Probabilistic topic models // Communications of the ACM,2012. Vol. 55, No. 4., Pp. 77–84.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 5 / 28

Page 6: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Graphical Models and Bayesian Inference

In Bayesian approach, a lot of calculus to be done for each modelto go from the problem statement to the solution algorithm:

Yi Wang. Distributed Gibbs Sampling of Latent Dirichlet Allocation: The GrittyDetails. 2008.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 6 / 28

Page 7: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

ARTM — Additive Regularization of Topic Model

Maximum log-likelihood with additive regularization criterion R :∑d ,w

ndw ln∑t

φwtθtd + R(Φ,Θ) → maxΦ,Θ

EM-algorithm is a simple iteration method for the system

E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw + φwt∂R∂φwt

)θtd = norm

t∈T

( ∑w∈d

ndwptdw + θtd∂R∂θtd

)

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 7 / 28

Page 8: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Example: Latent Dirichlet Allocation [Blei, Ng, Jordan, 2003]

Maximum a posteriori (MAP) with Dirichlet prior:∑d ,w

ndw ln∑t

φwtθtd︸ ︷︷ ︸log-likelihood L (Φ,Θ)

+∑t,w

βw lnφwt +∑d ,t

αt ln θtd︸ ︷︷ ︸regularization criterion R(Φ,Θ)

→ maxΦ,Θ

EM-algorithm is a simple iteration method for the system

E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw + βw

)θtd = norm

t∈T

( ∑w∈d

ndwptdw + αt

)

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 8 / 28

Page 9: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Many Bayesian PTMs can be reinterpreted as regularizers in ARTM

smoothing (LDA) for background and stop-words topicssparsing (anti-LDA) for domain-specific topicstopic decorrelationtopic coherence maximizationsupervised learning for classification and regressionsemi-supervised learningusing document citations and linksdetermining number of topics via entropy sparsingmodeling topical hierarchiesmodeling temporal topic dynamicsusing vocabularies in multilingual topic modelsetc.

Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models //Machine Learning. Volume 101, Issue 1 (2015), Pp. 303-323.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 9 / 28

Page 10: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

ARTM — Additive Regularization of Topic Model

Maximum log-likelihood with additive combination of regularizers:

∑d ,w

ndw ln∑t

φwtθtd +n∑

i=1

τiRi (Φ,Θ) → maxΦ,Θ

,

where τi are regularization coefficients.

EM-algorithm is a simple iteration method for the system

E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈W

( ∑d∈D

ndwptdw + φwtn∑

i=1τi

∂Ri∂φwt

)θtd = norm

t∈T

( ∑w∈d

ndwptdw + θtdn∑

i=1τi∂Ri∂θtd

)

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 10 / 28

Page 11: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Assumptions: what topics would be well-interpretable?

Topics S ⊂ T contain domain-specific termsp(w |t), t ∈ S are sparse and different (weakly correlated)

Topics B ⊂ T contain background termsp(w |t), t ∈ B are dense and contain common lexis words

ΦW×T ΘT×D

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 11 / 28

Page 12: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Smoothing regularization (rethinking LDA)

The non-sparsity assumption for background topics t ∈ B :φwt are similar to a given distribution βw ;θtd are similar to a given distribution αt .∑

t∈BKLw (βw‖φwt)→ min

Φ;

∑d∈D

KLt(αt‖θtd)→ minΘ.

We minimize the sum of these KL-divergences to get a regularizer:

R(Φ,Θ) = β0

∑t∈B

∑w∈W

βw lnφwt + α0

∑d∈D

∑t∈B

αt ln θtd → max .

The regularized M-step applied for all t ∈ B coincides with LDA:

φwt ∝ nwt + β0βw , θtd ∝ ntd + α0αt ,

which is new non-Bayesian interpretation of LDA [Blei 2003].

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 12 / 28

Page 13: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Sparsing regularizer (further rethinking LDA)

The sparsity assumption for domain-specific topics t ∈ S :distributions φwt , θtd contain many zero probabilities.

We maximize the sum of KL-divergences KL(β‖φt) and KL(α‖θd):

R(Φ,Θ) = −β0

∑t∈S

∑w∈W

βw lnφwt − α0

∑d∈D

∑t∈S

αt ln θtd → max .

The regularized M-step gives “anti-LDA”, for all t ∈ S :

φwt ∝(nwt − β0βw

)+, θtd ∝

(ntd − α0αt

)+.

Varadarajan J., Emonet R., Odobez J.-M. A sparsity constraint for topicmodels — application to temporal activity mining // NIPS-2010 Workshop onPractical Applications of Sparse Modeling: Open Issues and New Directions.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 13 / 28

Page 14: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Regularization for topics decorrelation

The dissimilarity assumption for domain-specific topics t ∈ S :if topics are interpretable then they must differ significantly.

We maximize covariances between column vectors φt :

R(Φ) = −τ2

∑t∈S

∑s∈S\t

∑w∈W

φwtφws → max .

The regularized M-step makes columns of Φ more distant:

φwt ∝(nwt − τφwt

∑s∈S\t

φws

)+.

Tan Y., Ou Z. Topic-weak-correlated latent Dirichlet allocation // 7th Int’lSymp. Chinese Spoken Language Processing (ISCSLP), 2010. — Pp. 224–228.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 14 / 28

Page 15: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Example: Combination of sparsing, smoothing, and decorrelation

smoothing background topics B in Φ and Θ

sparsing domain-specific topics S = T\B in Φ and Θ

decorrelation of topics in Φ

R(Φ,Θ) = + β1∑t∈B

∑w∈W

βw lnφwt + α1∑d∈D

∑t∈B

αt ln θtd

− β0∑t∈S

∑w∈W

βw lnφwt − α0∑d∈D

∑t∈S

αt ln θtd

− γ∑t∈T

∑s∈T\t

∑w∈W

φwtφws

where β0, α0, β1, α1, γ are regularization coefficients.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 15 / 28

Page 16: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Multimodal Probabilistic Topic Modeling

Given a text document collection Probabilistic Topic Model finds:p(t|d) — topic distribution for each document d ,p(w |t) — term distribution for each topic t.

Topics of documents

Words and keyphrases of topics

doc1:

doc2:

doc3:

doc4:

...

Text documents

TopicModeling

Documents

Topics

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 16 / 28

Page 17: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Multimodal Probabilistic Topic Modeling

Multimodal Topic Model finds topical distribution for terms p(w |t),authors p(a|t), time p(y |t),

Topics of documents

Words and keyphrases of topics

doc1:

doc2:

doc3:

doc4:

...

Text documents

TopicModeling

Documents

Topics

Metadata:

AuthorsData TimeConferenceOrganizationURLetc.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 17 / 28

Page 18: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Multimodal Probabilistic Topic Modeling

Multimodal Topic Model finds topical distribution for terms p(w |t),authors p(a|t), time p(y |t), objects on images p(o|t),linked documents p(d ′|t), advertising banners p(b|t), users p(u|t)

Topics of documents

Words and keyphrases of topics

doc1:

doc2:

doc3:

doc4:

...

Text documents

TopicModeling

Documents

Topics

Metadata:

AuthorsData TimeConferenceOrganizationURLetc.

Ads Images Links

Users

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 18 / 28

Page 19: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Multimodal extension of ARTM

Wm is a vocabulary of tokens of m-th modality, m ∈ MW = W 1 t · · · tWM is a joint vocabulary of all modalities

Maximum multimodal log-likelihood with regularization:∑m∈M

λm∑d∈D

∑w∈Wm

ndw ln∑t

φwtθtd + R(Φ,Θ) → maxΦ,Θ

EM-algorithm is a simple iteration method for the system

E-step:

M-step:

ptdw = normt∈T

(φwtθtd

)φwt = norm

w∈Wm

( ∑d∈D

λm(w)ndwptdw + φwt∂R∂φwt

)θtd = norm

t∈T

( ∑w∈d

λm(w)ndwptdw + θtd∂R∂θtd

)

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 19 / 28

Page 20: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Example: Multi-lingual topic model of Wikipedia

Top 10 words with p(wt) probabilities (in %) from two-languagetopic model, based on Russian and English Wikipedia articles withmutual interlanguage links.

Topic 68 Topic 79research 4.56 институт 6.03 goals 4.48 матч 6.02technology 3.14 университет 3.35 league 3.99 игрок 5.56engineering 2.63 программа 3.17 club 3.76 сборная 4.51institute 2.37 учебный 2.75 season 3.49 фк 3.25science 1.97 технический 2.70 scored 2.72 против 3.20program 1.60 технология 2.30 cup 2.57 клуб 3.14

Topic 88 Topic 251opera 7.36 опера 7.82 windows 8.00 windows 6.05conductor 1.69 оперный 3.13 microsoft 4.03 microsoft 3.76orchestra 1.14 дирижер 2.82 server 2.93 версия 1.86wagner 0.97 певец 1.65 software 1.38 приложение 1.86soprano 0.78 певица 1.51 user 1.03 сервер 1.63performance 0.78 театр 1.14 security 0.92 server 1.54

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 20 / 28

Page 21: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Example: Recommending articles from blog

The quality of recommendations for baseline matrix factorizationmodel, unimodal model with only modality of user likes, and twomultimodal models incorporating words and user-specified data(tags and categories).

Model Recall@5 Recall@10 Recall@20collaborative filtering 0.591 0.652 0.678

likes 0.62 0.59 0.65likes + words 0.79 0.64 0.68all modalities 0.80 0.71 0.69

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 21 / 28

Page 22: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

BigARTM project

BigARTM features:Parallel + Online + Multimodal + Regularized Topic ModelingOut-of-core one-pass processing of Big DataBuilt-in library of regularizers and quality measures

BigARTM community:

Code on GitHub: https://github.com/bigartmLinks to docs, discussion group, buildshttp://bigartm.org

BigARTM license and programming environment:Freely available for commercial usage (BSD 3-Clause license)Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)Programming APIs: command-line, C++, and Python

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 22 / 28

Page 23: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

The BigARTM project: parallel architecture

Concurrent processing of batches D = D1 t · · · t DB

Simple single-threaded code for ProcessBatchUser controls when to update the model in online algorithmDeterministic (reproducible) results from run to run

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 23 / 28

Page 24: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Online EM-algorithm for Multi-ARTM

Input: collection D split into batches Db, b = 1, . . . ,B ;Output: matrix Φ;

1 initialize φwt for all w ∈W , t ∈ T ;2 nwt := 0, nwt := 0 for all w ∈W , t ∈ T ;3 for all batches Db, b = 1, . . . ,B

4 iterate each document d ∈ Db at a constant matrix Φ:(nwt) := (nwt) + ProcessBatch (Db,Φ);

5 if (synchronize) then6 nwt := nwt + ndw for all w ∈W , t ∈ T ;

7 φwt := normw∈Wm

(nwt + φwt

∂R∂φwt

)for all w ∈Wm, m∈M, t∈T ;

8 nwt := 0 for all w ∈W , t ∈ T ;

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 24 / 28

Page 25: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Online EM-algorithm for Multi-ARTM: ProcessBatch

ProcessBatch iterates documents d ∈ Db at a constant matrix Φ.

matrix (nwt) := ProcessBatch (set of documents Db, matrix Φ)

1 nwt := 0 for all w ∈W , t ∈ T ;2 for all d ∈ Db

3 initialize θtd := 1|T | for all t ∈ T ;

4 repeat5 ptdw := norm

t∈T

(φwtθtd

)for all w ∈ d , t ∈ T ;

6 ntd :=∑w∈d

λm(w)ndwptdw for all t ∈ T ;

7 θtd := normt∈T

(ntd + θtd

∂R∂θtd

)for all t ∈ T ;

8 until θd converges;9 nwt := nwt + λm(w)ndwptdw for all w ∈ d , t ∈ T ;

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 25 / 28

Page 26: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

BigARTM vs Gensim vs Vowpal Wabbit

3.7M articles from Wikipedia, 100K unique words

procs train inference perplexityBigARTM 1 35 min 72 sec 4000Gensim.LdaModel 1 369 min 395 sec 4161VowpalWabbit.LDA 1 73 min 120 sec 4108BigARTM 4 9 min 20 sec 4061Gensim.LdaMulticore 4 60 min 222 sec 4111BigARTM 8 4.5 min 14 sec 4304Gensim.LdaMulticore 8 57 min 224 sec 4455

procs = number of parallel threadsinference = time to infer θd for 100K held-out documentsperplexity is calculated on held-out documents.

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 26 / 28

Page 27: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Running BigARTM in parallel

3.7M articles from Wikipedia, 100K unique words

Amazon EC2 c3.8xlarge (16 physical cores + hyperthreading)No extra memory cost for adding more threads

Peter Romov ([email protected]) ARTM of Large Multimodal Collections 27 / 28

Page 28: Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Summary / Questions?

1 Additive RegularizationPLSA algorithm with regularization on M-stepsimple way to incorporate assumptions on the topic model

2 Multimodal Topic Modelsincorporate metadata into topic modelparallel text corpuses

3 BigARTM1: Open source implementation of Multi-ARTMparallel online algorithmready for large collectionssupports multimodal collectionscollection of implemented regularizers

1http://bigartm.org/Peter Romov ([email protected]) ARTM of Large Multimodal Collections 28 / 28