Upload
dodung
View
214
Download
0
Embed Size (px)
Citation preview
Introduction STBM Model Inference Experiences Conclusion
The Stochastic Topic Block Model for theClustering of Vertices in Networks with
Textual edges
Presenter: Wei JIANG
Authors: Bouveyron Charles, Latouche Pierre, Zreik RawyaMachine Learning Journal Club, CMAP
September 28th 2017
Wei JIANG STBM September 28th 2017 1 / 46
Introduction STBM Model Inference Experiences Conclusion
Outline
1 Introduction
2 STBM Model
3 Inference
4 ExperiencesSimulation studyReal-world data
Wei JIANG STBM September 28th 2017 2 / 46
Introduction STBM Model Inference Experiences Conclusion
Motivation
Communication betweenindividuals via
Social mediaFacebookTwitterLinkedin
Electronic formatsEmailWebE-publication
FIGURE – Social network diagram dis-playing friendship ties among aset of Facebook user
Network AnalysisWei JIANG STBM September 28th 2017 3 / 46
Introduction STBM Model Inference Experiences Conclusion
Network analysis for clustering users in email system
Directed graphNode : usersEdge : sender -> recipientTask : Clustering users basedon person-to-person link only
How to improve⇒ Also take into account the email content.
Wei JIANG STBM September 28th 2017 4 / 46
Introduction STBM Model Inference Experiences Conclusion
Structure of paper
Problem : Discovering clusters of vertices⇐ the networkinteractions and the text content.Model : Stochastic topic block model (STBM) - a probabilisticmodel for networks with textual edges.Inference : Classification variational expectation-maximization(C-VEM)Experience : Simulated data to assess the approach andhighlight its features.Real-world data sets to demonstrate the effectiveness.
Wei JIANG STBM September 28th 2017 5 / 46
Introduction STBM Model Inference Experiences Conclusion
Outline
1 Introduction
2 STBM Model
3 Inference
4 ExperiencesSimulation studyReal-world data
Wei JIANG STBM September 28th 2017 6 / 46
Introduction STBM Model Inference Experiences Conclusion
Context and notations
Directed networkM verticesA : M ×M adjacency matrix.
Aij =
{1, if there is an edge from i to j0, otherwise
If Aij = 1, then this edge is characterized by a set of Dijdocuments : Wij =
(W d
ij
)d
Each document is made by a collection of Ndij words :
W dij =
(W dn
ij
)n
W = (Wij )ij The set of all documents exchanged for all the edges
Wei JIANG STBM September 28th 2017 7 / 46
Introduction STBM Model Inference Experiences Conclusion
Goal
Cluster the vertices into Q latent groups sharing the same connectionprofiles
Presence of edgesDocuments between pairs of vertices
⇔ estimate Y = (Y1, · · · ,YM) of latent variable Yi s.t
Yiq =
{1, vertex i belongs to cluster q0, otherwise
Wei JIANG STBM September 28th 2017 8 / 46
Introduction STBM Model Inference Experiences Conclusion
Assumptions
Any kind of relationships between two vertices can be explainedby their latent clusters only.Words in documents are drawn from a mixture distribution overtopics, each document d having its own vector of topicproportions θd .
Wei JIANG STBM September 28th 2017 9 / 46
Introduction STBM Model Inference Experiences Conclusion
Overview of STBM Model
FIGURE – Graphical representation of the stochastic topic block model
Wei JIANG STBM September 28th 2017 10 / 46
Introduction STBM Model Inference Experiences Conclusion
Modeling the presence of edges
Stochastic block model (Wang & Wong 1987 ; Nowicki & Snijders2001)
Yi ∼M(1, ρ = (ρ1, · · · , ρQ))
Aij |YiqYjr = 1 ∼ B(πqr )⇒ π the Q ×Q matrix of connection probabilities
⇒ p(A,Y |ρ, π) = p(A|Y , π)p(Y |ρ)
Wei JIANG STBM September 28th 2017 11 / 46
Introduction STBM Model Inference Experiences Conclusion
Modeling the construction of documents
Latent Dirichlet Allocation (Blei et al. 2003 )Pair of clusters (q, r) of vertices→ vector of topic proportionsθqr =
(θqrk)
k ∼ Dir(α = (α1, · · · , αK ))Here all components of α are fixed to 1.The nth word of d th document between vertex i and j : W dn
ij →Latent topic vectorZ dn
ij |{YiqYjr Aij = 1, θ} ∼ M(1, θqr = (θqr1, · · · , θqrK ))
W dnij |Z dnk
ij = 1 ∼M(1, βk = (βk1, · · · , βkV ))
⇒ Mixture model for words over topics
W dnij |{YiqYjr Aij = 1, θ} ∼
K∑k=1
θqrkM(1, βk )
Wei JIANG STBM September 28th 2017 12 / 46
Introduction STBM Model Inference Experiences Conclusion
Modeling the construction of documents
AssumeAll the latent variables Z dn
ij are sampled independently.
Given the latent variables, the words W dnij are independent.
Denote Z =(Z dn
ij
)ijdn ⇒ Joint distribution
p(W ,Z , θ|A,Y , β) = p(W |A,Z , β)p(Z |A,Y , θ)p(θ)
Wei JIANG STBM September 28th 2017 13 / 46
Introduction STBM Model Inference Experiences Conclusion
STBM ModelThe full joint distribution of STBM model
p(A,W ,Y ,Z , θ|ρ, π, β) = p(W ,Z , θ|A,Y , β)p(A,Y |ρ, π)
FIGURE – Graphical representation of the stochastic topic block model
Wei JIANG STBM September 28th 2017 14 / 46
Introduction STBM Model Inference Experiences Conclusion
Examples
Wei JIANG STBM September 28th 2017 15 / 46
Introduction STBM Model Inference Experiences Conclusion
Examples
The simulated messages (150 words) are from four texts from BBCnews :
1 The birth of Princess Charlotte2 Black holes in astrophysics3 UK politics4 Cancer diseases in medicine
Wei JIANG STBM September 28th 2017 16 / 46
Introduction STBM Model Inference Experiences Conclusion
Key Property of STBM Model
Assume that Y is available.Recognize documents in W s.t W = (W̃qr )qr
⇒ All words in W̃qr share the same mixture distribution overtopic.⇒Words in W are drawn from LDA model with D = Q2
independent documents W̃qr .p(A,Y |ρ, π) involves sampling of the clusters + construction ofbinary variables describing presence of edges⇒ correspond to likelihood of SBM model.
For given Y , the full joint distribution factorizes into LDA like term andSBM like term.
Wei JIANG STBM September 28th 2017 17 / 46
Introduction STBM Model Inference Experiences Conclusion
Outline
1 Introduction
2 STBM Model
3 Inference
4 ExperiencesSimulation studyReal-world data
Wei JIANG STBM September 28th 2017 18 / 46
Introduction STBM Model Inference Experiences Conclusion
Aim : maximize log-likelihood
First fix the number of groups Q and number of topics K
log p(A,W ,Y |ρ, π, β) = log∑
Z
∫θ
p(A,W ,Y ,Z , θ|ρ, π, β)dθ
Model parameters (ρ, π, β)
Z and θ are latent variables.Y = (Y1, · · · ,YM) is seen as a set of binary vectors for which weaim at providing estimates. (Motivated by the key property ofSTBM)
Wei JIANG STBM September 28th 2017 19 / 46
Introduction STBM Model Inference Experiences Conclusion
Variational decomposition of log-likelihood
log p(A,W ,Y |ρ, π, β) = L(R(·); Y , ρ, π, β)+KL(R(·)||p(·|A,W ,Y , ρ, π, β))
KL : the Kullback-Leibler divergence between the true andapproximate posterior distribution R(·) of (Z , θ), given the data andmodel parameters.
KL(R(·)||p(·|A,W ,Y , ρ, π, β)) = −∑
Z
∫θ
R(Z , θ) log p(Z ,θ|A,W ,Y ,ρ,π,β)R(Z ,θ) dθ
⇒ Maximizing the lower bound L w.r.t R(Z , θ) induces a minimizationof KL divergence.
Wei JIANG STBM September 28th 2017 20 / 46
Introduction STBM Model Inference Experiences Conclusion
Model decomposition of lower bound L
Recall STBM property : The set of latent variables in Y allows the fulljoint distribution be decomposed to the sampling of Y and A+ construction of documents given A and Y.
L(R(·); Y , ρ, π, β) = L̃(R(·); Y , β) + log p(A,Y |ρ, π)
where
L̃(R(·); Y , β) =∑
Z
∫θ
R(Z , θ) logp(W ,Z , θ|A,Y , β)
R(Z , θ)dθ
⇒ For given Y, the two terms can be maximized independently.
Wei JIANG STBM September 28th 2017 21 / 46
Introduction STBM Model Inference Experiences Conclusion
C-VEM algorithm
Aim : Maximize the lower bound L.C-VEM algorithm alternates between the optimization of R(Z , θ), Yand (ρ, π, β).
1 Estimate of R(Z , θ)
Update R(Z dnij ) and R(θ) of the E-step of VEM
2 Estimate of model parameters (ρ, π, β)
Maximize the lower bound L ⇒ β only in L̃ ; ρ, π only in SBMlog-likelihood. (M-step)
3 Estimate of YFix (ρ, π, β) and R(Z , θ) ⇒ Find Y maximizing LTest QM possible cluster assignments ⇒ on line clustering methods
(Classification)
Wei JIANG STBM September 28th 2017 22 / 46
Introduction STBM Model Inference Experiences Conclusion
Initialization strategy : Multiple initializations
(Biernacki et al. 2003)For several initializations of a k-means like algorithm on a distancematrix between vertices
1 VEM for LDA is applied on all documents i → j ⇒ Xij = k if k isthe majority topic.
2 distance matrix
∆(i , j) =M∑
h=1
δ(Xih 6= Xjh)AihAjh +M∑
h=1
δ(Xhi 6= Xhj )AhiAhj
Look at all possible edges i → j towards a third vertex h⇒compare the edge type
The distance matrix computes the number of discordances in the wayboth i and j connect to other vertices or vertices connect them.
Wei JIANG STBM September 28th 2017 23 / 46
Introduction STBM Model Inference Experiences Conclusion
Model selection
Model selection problem : Estimating number of groups Q andnumber of topics KCriterion : ICL (Biernacki et al. 2000)
Wei JIANG STBM September 28th 2017 24 / 46
Introduction STBM Model Inference Experiences Conclusion
Outline
1 Introduction
2 STBM Model
3 Inference
4 ExperiencesSimulation studyReal-world data
Wei JIANG STBM September 28th 2017 25 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Simulation setup
Wei JIANG STBM September 28th 2017 26 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Simulation setup
The simulated messages (150 words) are from four texts from BBCnews :
1 The birth of Princess Charlotte2 Black holes in astrophysics3 UK politics4 Cancer diseases in medicine
Wei JIANG STBM September 28th 2017 27 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Introductory example on scenario CRun C-VEM for STBM on network of scenario C with the actualnumber of groups and topics⇒ Both network structure and the topicinformation should be correctly recovered.
FIGURE – Clustering result for the introductory example (scenario C)
Wei JIANG STBM September 28th 2017 28 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Introductory example on scenario C
Evolution of the lower bound L along iterations (top-left)The most frequent words in the 3 found topics (left-bottom)The estimated model parameters (ρ, π) (right)
Wei JIANG STBM September 28th 2017 29 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Introductory example on scenario C
FIGURE – Summary of connexion probabilities between groups (π, edge widths), groupproportions (ρ, node sizes) and most probable topics for group interactions(edge colors).
Wei JIANG STBM September 28th 2017 30 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Experiment on model selection
Percentage of selections by ICL for each STBM model (Q,K ) on50 simulated networks of each of three scenarios.Highlighted rows and columns correspond to the actual valuesfor Q and K
Wei JIANG STBM September 28th 2017 31 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Benchmark study
Run SBM, LDA and STBM on 20 networks simulated according to thethree scenarios. Average ARI values (Rand, 1971) are reported withstandard deviations for both node and edge clustering.
Easy : same as the previous simulations of three scenarios.
Hard 1 : the communities are very few differentiated (piqq = 0.25and πq 6=r = 0.2.
Wei JIANG STBM September 28th 2017 32 / 46
Introduction STBM Model Inference Experiences Conclusion
Simulation study
Benchmark study
Hard 2 : 40% of message words are sampled in different topicsthan the actual topic.
The joint model of network structure and topics allows to recover thecomplex hidden structure in a network with textual edges.
Wei JIANG STBM September 28th 2017 33 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
Enron data set https://www.cs.cmu.edu/~./enron/Email communications between 149 employees from 1999-2002.All messages sent between 2 individuals were coerced in a singlemeta-message⇒ 1234 directed edgesRun V-CEM for STBM, for number of groups Q = 1 : 14 and numberof topics K = 2 : 20⇒ Model selection (Q,K ) = (10,5)
FIGURE – Clustering result with STBM on the Enron data set (Sept.-Dec. 2001)Wei JIANG STBM September 28th 2017 34 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
Enron email network
FIGURE – Most specific words for the 5 found topics with STBM on the Enron data set.
1 Financial and trading activity2 Enron activities in Afghanistan3 California electricity crisis4 Usual logistic issues (building equipment, computers, ...)5 technical discussions on gas deliveries
Wei JIANG STBM September 28th 2017 35 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
Enron email network
Group 10 contains a single individual whohas a central place in the networkfrequently discusses about logistic issues (topic 4) with groups 4, 5,6 and 7.
Wei JIANG STBM September 28th 2017 36 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
Enron email network
Group 8 contains 6 individuals who mainly communicate aboutEnron activities in Afghanistan (topic 2) between them and withother groups.Group 4 and 6 are more focused on trading activities (topic 1).Group 1, 3 and 9 deal with technical issues on gas deliveries(topic 5).
Wei JIANG STBM September 28th 2017 37 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
Enron email network
FIGURE – Clustering results with SBM (left, Q = 8) and STBM (right) on the Enron dataset.
Some clusters found by SBM (ex. red) have been split by STBMsince some nodes use different topics than the rest.SBM isolates two "hubs" (light green)↔ STBM identify a unique"hub" and the second is gathered with others using similar topics.
Wei JIANG STBM September 28th 2017 38 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
Enron email network
FIGURE – Clustering results with SBM (left, Q = 8) and STBM (right) on the Enron dataset.
STBM allows a better and deeper understanding of the Enronnetwork.
Wei JIANG STBM September 28th 2017 39 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
NIPS co-authorship network
1988-2003 editions(Nips 1-17 http://robotics.stanford.edu/~gal/data.html)contains the abstracts of 2 484 accepted papers from 2740contributing authors.⇒ undirected network between 2740 authors with 22640 textualedges.Model selection by ICL : (Q,K ) = (13,7)
Wei JIANG STBM September 28th 2017 40 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
NIPS co-authorship network
FIGURE – Clustering result with STBM on the Nips co-authorship network
Wei JIANG STBM September 28th 2017 41 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
NIPS co-authorship network
FIGURE – Most specific words for the 5 found topics with STBM on the Nips co-authorship network.
Wei JIANG STBM September 28th 2017 42 / 46
Introduction STBM Model Inference Experiences Conclusion
Real-world data
NIPS co-authorship network
STBM has proved its ability to bring out concise and relevantanalyses on the structure of a large and dense network.
Wei JIANG STBM September 28th 2017 43 / 46
Introduction STBM Model Inference Experiences Conclusion
Conclusion
STBM : modeling and clustering vertices in networks with textualedges
directed or undirected networkapplication to various types of network
C-VEM : model inferenceICL : model selectionNumerical experiments on simulated dataTwo real worlds networks
large co-authorship network → scalability
Wei JIANG STBM September 28th 2017 44 / 46
Introduction STBM Model Inference Experiences Conclusion
Authors
Bouveyron Charleshttp://w3.mi.parisdescartes.fr/~cbouveyr/
Pierre Latouchehttp://samm.univ-paris1.fr/Pierre-Latouche
Zreik Rawyahttp://samm.univ-paris1.fr/Rawya-ZREIK
Wei JIANG STBM September 28th 2017 45 / 46
Introduction STBM Model Inference Experiences Conclusion
"Linkage"
https://linkage.fr/
Wei JIANG STBM September 28th 2017 46 / 46