Upload
romovpa
View
252
Download
0
Embed Size (px)
Citation preview
Non-Bayesian Additive Regularization forMultimodal Topic Modeling of Large
Collections
Konstantin Vorontsov1,3 • Oleksandr Frei4 • Murat Apishev2
Peter Romov3 • Marina Suvorova2 • Anastasia Yanina1
1Moscow Institute of Physics and Technology,2Moscow State University • 3Yandex • 4Schlumberger
Topic Models: Post-Processing and Applications,CIKM’15 Workshop
October 19, 2015 • Melbourne, Australia
Probabilistic Topic Model (PTM) generating a text collection
Topic model explains terms w in documents d by topics t:
p(w |d) =∑tp(w |t)p(t|d)
Разработан спектрально-аналитический подход к выявлению размытых протяженных повторов
в геномных последовательностях. Метод основан на разномасштабном оценивании сходства
нуклеотидных последовательностей в пространстве коэффициентов разложения фрагментов
кривых GC- и GA-содержания по классическим ортогональным базисам. Найдены условия
оптимальной аппроксимации, обеспечивающие автоматическое распознавание повторов
различных видов (прямых и инвертированных, а также тандемных) на спектральной матрице
сходства. Метод одинаково хорошо работает на разных масштабах данных. Он позволяет
выявлять следы сегментных дупликаций и мегасателлитные участки в геноме, районы синтении
при сравнении пары геномов. Его можно использовать для детального изучения фрагментов
хромосом (поиска размытых участков с умеренной длиной повторяющегося паттерна).
�( |!)
�("| ):
, … , #$
" , … ,"#$:
0.018 распознавание
0.013 сходство
0.011 паттерн
… … … …
0.023 днк
0.016 геном
0.009 нуклеотид
… … … …
0.014 базис
0.009 спектр
0.006 ортогональный
… … … …
! " " #" $�
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 2 / 28
Inverse problem: text collection → PTM
Given: D is a set (collection) of documentsW is a set (vocabulary) of termsndw = how many times term w appears in document d
Find: parameters φwt =p(w |t), θtd =p(t|d) of the topic model
p(w |d) =∑t
φwtθtd .
under nonnegativity and normalization constraints
φwt > 0,∑
w∈Wφwt = 1; θtd > 0,
∑t∈T
θtd = 1.
The ill-posed problem of matrix factorization:
ΦΘ = (ΦS)(S−1Θ) = Φ′Θ′
for all S such that Φ′,Θ′ are stochastic.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 3 / 28
PLSA — Probabilistic Latent Semantic Analysis [Hofmann, 1999]
Constrained maximization of the log-likelihood:
L (Φ,Θ) =∑d ,w
ndw ln∑t
φwtθtd → maxΦ,Θ
EM-algorithm is a simple iteration method for the nonlinear system
E-step:
M-step:
ptdw = normt∈T
(φwtθtd
)φwt = norm
w∈W
( ∑d∈D
ndwptdw
)θtd = norm
t∈T
( ∑w∈d
ndwptdw
)where norm
t∈Txt = max{xt ,0}∑
s∈Tmax{xs ,0} is vector normalization.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 4 / 28
Graphical Models and Bayesian Inference
In Bayesian approach, Graphical Models are used to makesophisticated generative models.
David M. Blei. Probabilistic topic models // Communications of the ACM,2012. Vol. 55, No. 4., Pp. 77–84.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 5 / 28
Graphical Models and Bayesian Inference
In Bayesian approach, a lot of calculus to be done for each modelto go from the problem statement to the solution algorithm:
Yi Wang. Distributed Gibbs Sampling of Latent Dirichlet Allocation: The GrittyDetails. 2008.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 6 / 28
ARTM — Additive Regularization of Topic Model
Maximum log-likelihood with additive regularization criterion R :∑d ,w
ndw ln∑t
φwtθtd + R(Φ,Θ) → maxΦ,Θ
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = normt∈T
(φwtθtd
)φwt = norm
w∈W
( ∑d∈D
ndwptdw + φwt∂R∂φwt
)θtd = norm
t∈T
( ∑w∈d
ndwptdw + θtd∂R∂θtd
)
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 7 / 28
Example: Latent Dirichlet Allocation [Blei, Ng, Jordan, 2003]
Maximum a posteriori (MAP) with Dirichlet prior:∑d ,w
ndw ln∑t
φwtθtd︸ ︷︷ ︸log-likelihood L (Φ,Θ)
+∑t,w
βw lnφwt +∑d ,t
αt ln θtd︸ ︷︷ ︸regularization criterion R(Φ,Θ)
→ maxΦ,Θ
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = normt∈T
(φwtθtd
)φwt = norm
w∈W
( ∑d∈D
ndwptdw + βw
)θtd = norm
t∈T
( ∑w∈d
ndwptdw + αt
)
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 8 / 28
Many Bayesian PTMs can be reinterpreted as regularizers in ARTM
smoothing (LDA) for background and stop-words topicssparsing (anti-LDA) for domain-specific topicstopic decorrelationtopic coherence maximizationsupervised learning for classification and regressionsemi-supervised learningusing document citations and linksdetermining number of topics via entropy sparsingmodeling topical hierarchiesmodeling temporal topic dynamicsusing vocabularies in multilingual topic modelsetc.
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models //Machine Learning. Volume 101, Issue 1 (2015), Pp. 303-323.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 9 / 28
ARTM — Additive Regularization of Topic Model
Maximum log-likelihood with additive combination of regularizers:
∑d ,w
ndw ln∑t
φwtθtd +n∑
i=1
τiRi (Φ,Θ) → maxΦ,Θ
,
where τi are regularization coefficients.
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = normt∈T
(φwtθtd
)φwt = norm
w∈W
( ∑d∈D
ndwptdw + φwtn∑
i=1τi
∂Ri∂φwt
)θtd = norm
t∈T
( ∑w∈d
ndwptdw + θtdn∑
i=1τi∂Ri∂θtd
)
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 10 / 28
Assumptions: what topics would be well-interpretable?
Topics S ⊂ T contain domain-specific termsp(w |t), t ∈ S are sparse and different (weakly correlated)
Topics B ⊂ T contain background termsp(w |t), t ∈ B are dense and contain common lexis words
ΦW×T ΘT×D
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 11 / 28
Smoothing regularization (rethinking LDA)
The non-sparsity assumption for background topics t ∈ B :φwt are similar to a given distribution βw ;θtd are similar to a given distribution αt .∑
t∈BKLw (βw‖φwt)→ min
Φ;
∑d∈D
KLt(αt‖θtd)→ minΘ.
We minimize the sum of these KL-divergences to get a regularizer:
R(Φ,Θ) = β0
∑t∈B
∑w∈W
βw lnφwt + α0
∑d∈D
∑t∈B
αt ln θtd → max .
The regularized M-step applied for all t ∈ B coincides with LDA:
φwt ∝ nwt + β0βw , θtd ∝ ntd + α0αt ,
which is new non-Bayesian interpretation of LDA [Blei 2003].
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 12 / 28
Sparsing regularizer (further rethinking LDA)
The sparsity assumption for domain-specific topics t ∈ S :distributions φwt , θtd contain many zero probabilities.
We maximize the sum of KL-divergences KL(β‖φt) and KL(α‖θd):
R(Φ,Θ) = −β0
∑t∈S
∑w∈W
βw lnφwt − α0
∑d∈D
∑t∈S
αt ln θtd → max .
The regularized M-step gives “anti-LDA”, for all t ∈ S :
φwt ∝(nwt − β0βw
)+, θtd ∝
(ntd − α0αt
)+.
Varadarajan J., Emonet R., Odobez J.-M. A sparsity constraint for topicmodels — application to temporal activity mining // NIPS-2010 Workshop onPractical Applications of Sparse Modeling: Open Issues and New Directions.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 13 / 28
Regularization for topics decorrelation
The dissimilarity assumption for domain-specific topics t ∈ S :if topics are interpretable then they must differ significantly.
We maximize covariances between column vectors φt :
R(Φ) = −τ2
∑t∈S
∑s∈S\t
∑w∈W
φwtφws → max .
The regularized M-step makes columns of Φ more distant:
φwt ∝(nwt − τφwt
∑s∈S\t
φws
)+.
Tan Y., Ou Z. Topic-weak-correlated latent Dirichlet allocation // 7th Int’lSymp. Chinese Spoken Language Processing (ISCSLP), 2010. — Pp. 224–228.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 14 / 28
Example: Combination of sparsing, smoothing, and decorrelation
smoothing background topics B in Φ and Θ
sparsing domain-specific topics S = T\B in Φ and Θ
decorrelation of topics in Φ
R(Φ,Θ) = + β1∑t∈B
∑w∈W
βw lnφwt + α1∑d∈D
∑t∈B
αt ln θtd
− β0∑t∈S
∑w∈W
βw lnφwt − α0∑d∈D
∑t∈S
αt ln θtd
− γ∑t∈T
∑s∈T\t
∑w∈W
φwtφws
where β0, α0, β1, α1, γ are regularization coefficients.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 15 / 28
Multimodal Probabilistic Topic Modeling
Given a text document collection Probabilistic Topic Model finds:p(t|d) — topic distribution for each document d ,p(w |t) — term distribution for each topic t.
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
TopicModeling
Documents
Topics
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 16 / 28
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w |t),authors p(a|t), time p(y |t),
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
TopicModeling
Documents
Topics
Metadata:
AuthorsData TimeConferenceOrganizationURLetc.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 17 / 28
Multimodal Probabilistic Topic Modeling
Multimodal Topic Model finds topical distribution for terms p(w |t),authors p(a|t), time p(y |t), objects on images p(o|t),linked documents p(d ′|t), advertising banners p(b|t), users p(u|t)
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
TopicModeling
Documents
Topics
Metadata:
AuthorsData TimeConferenceOrganizationURLetc.
Ads Images Links
Users
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 18 / 28
Multimodal extension of ARTM
Wm is a vocabulary of tokens of m-th modality, m ∈ MW = W 1 t · · · tWM is a joint vocabulary of all modalities
Maximum multimodal log-likelihood with regularization:∑m∈M
λm∑d∈D
∑w∈Wm
ndw ln∑t
φwtθtd + R(Φ,Θ) → maxΦ,Θ
EM-algorithm is a simple iteration method for the system
E-step:
M-step:
ptdw = normt∈T
(φwtθtd
)φwt = norm
w∈Wm
( ∑d∈D
λm(w)ndwptdw + φwt∂R∂φwt
)θtd = norm
t∈T
( ∑w∈d
λm(w)ndwptdw + θtd∂R∂θtd
)
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 19 / 28
Example: Multi-lingual topic model of Wikipedia
Top 10 words with p(wt) probabilities (in %) from two-languagetopic model, based on Russian and English Wikipedia articles withmutual interlanguage links.
Topic 68 Topic 79research 4.56 институт 6.03 goals 4.48 матч 6.02technology 3.14 университет 3.35 league 3.99 игрок 5.56engineering 2.63 программа 3.17 club 3.76 сборная 4.51institute 2.37 учебный 2.75 season 3.49 фк 3.25science 1.97 технический 2.70 scored 2.72 против 3.20program 1.60 технология 2.30 cup 2.57 клуб 3.14
Topic 88 Topic 251opera 7.36 опера 7.82 windows 8.00 windows 6.05conductor 1.69 оперный 3.13 microsoft 4.03 microsoft 3.76orchestra 1.14 дирижер 2.82 server 2.93 версия 1.86wagner 0.97 певец 1.65 software 1.38 приложение 1.86soprano 0.78 певица 1.51 user 1.03 сервер 1.63performance 0.78 театр 1.14 security 0.92 server 1.54
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 20 / 28
Example: Recommending articles from blog
The quality of recommendations for baseline matrix factorizationmodel, unimodal model with only modality of user likes, and twomultimodal models incorporating words and user-specified data(tags and categories).
Model Recall@5 Recall@10 Recall@20collaborative filtering 0.591 0.652 0.678
likes 0.62 0.59 0.65likes + words 0.79 0.64 0.68all modalities 0.80 0.71 0.69
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 21 / 28
BigARTM project
BigARTM features:Parallel + Online + Multimodal + Regularized Topic ModelingOut-of-core one-pass processing of Big DataBuilt-in library of regularizers and quality measures
BigARTM community:
Code on GitHub: https://github.com/bigartmLinks to docs, discussion group, buildshttp://bigartm.org
BigARTM license and programming environment:Freely available for commercial usage (BSD 3-Clause license)Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)Programming APIs: command-line, C++, and Python
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 22 / 28
The BigARTM project: parallel architecture
Concurrent processing of batches D = D1 t · · · t DB
Simple single-threaded code for ProcessBatchUser controls when to update the model in online algorithmDeterministic (reproducible) results from run to run
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 23 / 28
Online EM-algorithm for Multi-ARTM
Input: collection D split into batches Db, b = 1, . . . ,B ;Output: matrix Φ;
1 initialize φwt for all w ∈W , t ∈ T ;2 nwt := 0, nwt := 0 for all w ∈W , t ∈ T ;3 for all batches Db, b = 1, . . . ,B
4 iterate each document d ∈ Db at a constant matrix Φ:(nwt) := (nwt) + ProcessBatch (Db,Φ);
5 if (synchronize) then6 nwt := nwt + ndw for all w ∈W , t ∈ T ;
7 φwt := normw∈Wm
(nwt + φwt
∂R∂φwt
)for all w ∈Wm, m∈M, t∈T ;
8 nwt := 0 for all w ∈W , t ∈ T ;
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 24 / 28
Online EM-algorithm for Multi-ARTM: ProcessBatch
ProcessBatch iterates documents d ∈ Db at a constant matrix Φ.
matrix (nwt) := ProcessBatch (set of documents Db, matrix Φ)
1 nwt := 0 for all w ∈W , t ∈ T ;2 for all d ∈ Db
3 initialize θtd := 1|T | for all t ∈ T ;
4 repeat5 ptdw := norm
t∈T
(φwtθtd
)for all w ∈ d , t ∈ T ;
6 ntd :=∑w∈d
λm(w)ndwptdw for all t ∈ T ;
7 θtd := normt∈T
(ntd + θtd
∂R∂θtd
)for all t ∈ T ;
8 until θd converges;9 nwt := nwt + λm(w)ndwptdw for all w ∈ d , t ∈ T ;
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 25 / 28
BigARTM vs Gensim vs Vowpal Wabbit
3.7M articles from Wikipedia, 100K unique words
procs train inference perplexityBigARTM 1 35 min 72 sec 4000Gensim.LdaModel 1 369 min 395 sec 4161VowpalWabbit.LDA 1 73 min 120 sec 4108BigARTM 4 9 min 20 sec 4061Gensim.LdaMulticore 4 60 min 222 sec 4111BigARTM 8 4.5 min 14 sec 4304Gensim.LdaMulticore 8 57 min 224 sec 4455
procs = number of parallel threadsinference = time to infer θd for 100K held-out documentsperplexity is calculated on held-out documents.
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 26 / 28
Running BigARTM in parallel
3.7M articles from Wikipedia, 100K unique words
Amazon EC2 c3.8xlarge (16 physical cores + hyperthreading)No extra memory cost for adding more threads
Peter Romov ([email protected]) ARTM of Large Multimodal Collections 27 / 28
Summary / Questions?
1 Additive RegularizationPLSA algorithm with regularization on M-stepsimple way to incorporate assumptions on the topic model
2 Multimodal Topic Modelsincorporate metadata into topic modelparallel text corpuses
3 BigARTM1: Open source implementation of Multi-ARTMparallel online algorithmready for large collectionssupports multimodal collectionscollection of implemented regularizers
1http://bigartm.org/Peter Romov ([email protected]) ARTM of Large Multimodal Collections 28 / 28