The Stochastic Topic Block Model for the Clustering of Vertices in ...zoltan.szabo/jc/2017_09_28_Wei_Jiang.pdf · IntroductionSTBM ModelInference ExperiencesConclusion Outline 1 Introduction

Introduction STBM Model Inference Experiences Conclusion

The Stochastic Topic Block Model for theClustering of Vertices in Networks with

Textual edges

Presenter: Wei JIANG

Authors: Bouveyron Charles, Latouche Pierre, Zreik RawyaMachine Learning Journal Club, CMAP

September 28th 2017

Wei JIANG STBM September 28th 2017 1 / 46


Outline

1 Introduction

2 STBM Model

3 Inference

4 ExperiencesSimulation studyReal-world data



Motivation

Communication betweenindividuals via

Social mediaFacebookTwitterLinkedin

Electronic formatsEmailWebE-publication

FIGURE – Social network diagram dis-playing friendship ties among aset of Facebook user

Network AnalysisWei JIANG STBM September 28th 2017 3 / 46


Network analysis for clustering users in email system

Directed graphNode : usersEdge : sender -> recipientTask : Clustering users basedon person-to-person link only

How to improve⇒ Also take into account the email content.



Structure of paper

Problem : Discovering clusters of vertices⇐ the networkinteractions and the text content.Model : Stochastic topic block model (STBM) - a probabilisticmodel for networks with textual edges.Inference : Classification variational expectation-maximization(C-VEM)Experience : Simulated data to assess the approach andhighlight its features.Real-world data sets to demonstrate the effectiveness.



Outline

1 Introduction

2 STBM Model

3 Inference




Context and notations

Directed networkM verticesA : M ×M adjacency matrix.

Aij =

{1, if there is an edge from i to j0, otherwise

If Aij = 1, then this edge is characterized by a set of Dijdocuments : Wij =

(W d

ij

)d

Each document is made by a collection of Ndij words :

W dij =

(W dn

ij

)n

W = (Wij )ij The set of all documents exchanged for all the edges



Goal

Cluster the vertices into Q latent groups sharing the same connectionprofiles

Presence of edgesDocuments between pairs of vertices

⇔ estimate Y = (Y1, · · · ,YM) of latent variable Yi s.t

Yiq =

{1, vertex i belongs to cluster q0, otherwise



Assumptions

Any kind of relationships between two vertices can be explainedby their latent clusters only.Words in documents are drawn from a mixture distribution overtopics, each document d having its own vector of topicproportions θd .



Overview of STBM Model

FIGURE – Graphical representation of the stochastic topic block model



Modeling the presence of edges

Stochastic block model (Wang & Wong 1987 ; Nowicki & Snijders2001)

Yi ∼M(1, ρ = (ρ1, · · · , ρQ))

Aij |YiqYjr = 1 ∼ B(πqr )⇒ π the Q ×Q matrix of connection probabilities

⇒ p(A,Y |ρ, π) = p(A|Y , π)p(Y |ρ)



Modeling the construction of documents

Latent Dirichlet Allocation (Blei et al. 2003 )Pair of clusters (q, r) of vertices→ vector of topic proportionsθqr =

(θqrk)

k ∼ Dir(α = (α1, · · · , αK ))Here all components of α are fixed to 1.The nth word of d th document between vertex i and j : W dn

ij →Latent topic vectorZ dn

ij |{YiqYjr Aij = 1, θ} ∼ M(1, θqr = (θqr1, · · · , θqrK ))

W dnij |Z dnk

ij = 1 ∼M(1, βk = (βk1, · · · , βkV ))

⇒ Mixture model for words over topics

W dnij |{YiqYjr Aij = 1, θ} ∼

K∑k=1

θqrkM(1, βk )



Modeling the construction of documents

AssumeAll the latent variables Z dn

ij are sampled independently.

Given the latent variables, the words W dnij are independent.

Denote Z =(Z dn

ij

)ijdn ⇒ Joint distribution

p(W ,Z , θ|A,Y , β) = p(W |A,Z , β)p(Z |A,Y , θ)p(θ)



STBM ModelThe full joint distribution of STBM model

p(A,W ,Y ,Z , θ|ρ, π, β) = p(W ,Z , θ|A,Y , β)p(A,Y |ρ, π)

FIGURE – Graphical representation of the stochastic topic block model



Examples



Examples

The simulated messages (150 words) are from four texts from BBCnews :

1 The birth of Princess Charlotte2 Black holes in astrophysics3 UK politics4 Cancer diseases in medicine



Key Property of STBM Model

Assume that Y is available.Recognize documents in W s.t W = (W̃qr )qr

⇒ All words in W̃qr share the same mixture distribution overtopic.⇒Words in W are drawn from LDA model with D = Q2

independent documents W̃qr .p(A,Y |ρ, π) involves sampling of the clusters + construction ofbinary variables describing presence of edges⇒ correspond to likelihood of SBM model.

For given Y , the full joint distribution factorizes into LDA like term andSBM like term.



Outline

1 Introduction

2 STBM Model

3 Inference




Aim : maximize log-likelihood

First fix the number of groups Q and number of topics K

log p(A,W ,Y |ρ, π, β) = log∑

Z

∫θ

p(A,W ,Y ,Z , θ|ρ, π, β)dθ

Model parameters (ρ, π, β)

Z and θ are latent variables.Y = (Y1, · · · ,YM) is seen as a set of binary vectors for which weaim at providing estimates. (Motivated by the key property ofSTBM)



Variational decomposition of log-likelihood

log p(A,W ,Y |ρ, π, β) = L(R(·); Y , ρ, π, β)+KL(R(·)||p(·|A,W ,Y , ρ, π, β))

KL : the Kullback-Leibler divergence between the true andapproximate posterior distribution R(·) of (Z , θ), given the data andmodel parameters.

KL(R(·)||p(·|A,W ,Y , ρ, π, β)) = −∑

Z

∫θ

R(Z , θ) log p(Z ,θ|A,W ,Y ,ρ,π,β)R(Z ,θ) dθ

⇒ Maximizing the lower bound L w.r.t R(Z , θ) induces a minimizationof KL divergence.



Model decomposition of lower bound L

Recall STBM property : The set of latent variables in Y allows the fulljoint distribution be decomposed to the sampling of Y and A+ construction of documents given A and Y.

L(R(·); Y , ρ, π, β) = L̃(R(·); Y , β) + log p(A,Y |ρ, π)

where

L̃(R(·); Y , β) =∑

Z

∫θ

R(Z , θ) logp(W ,Z , θ|A,Y , β)

R(Z , θ)dθ

⇒ For given Y, the two terms can be maximized independently.



C-VEM algorithm

Aim : Maximize the lower bound L.C-VEM algorithm alternates between the optimization of R(Z , θ), Yand (ρ, π, β).

1 Estimate of R(Z , θ)

Update R(Z dnij ) and R(θ) of the E-step of VEM

2 Estimate of model parameters (ρ, π, β)

Maximize the lower bound L ⇒ β only in L̃ ; ρ, π only in SBMlog-likelihood. (M-step)

3 Estimate of YFix (ρ, π, β) and R(Z , θ) ⇒ Find Y maximizing LTest QM possible cluster assignments ⇒ on line clustering methods

(Classification)



Initialization strategy : Multiple initializations

(Biernacki et al. 2003)For several initializations of a k-means like algorithm on a distancematrix between vertices

1 VEM for LDA is applied on all documents i → j ⇒ Xij = k if k isthe majority topic.

2 distance matrix

∆(i , j) =M∑

h=1

δ(Xih 6= Xjh)AihAjh +M∑

h=1

δ(Xhi 6= Xhj )AhiAhj

Look at all possible edges i → j towards a third vertex h⇒compare the edge type

The distance matrix computes the number of discordances in the wayboth i and j connect to other vertices or vertices connect them.



Model selection

Model selection problem : Estimating number of groups Q andnumber of topics KCriterion : ICL (Biernacki et al. 2000)



Outline

1 Introduction

2 STBM Model

3 Inference




Simulation study

Simulation setup



Simulation study

Simulation setup

The simulated messages (150 words) are from four texts from BBCnews :

1 The birth of Princess Charlotte2 Black holes in astrophysics3 UK politics4 Cancer diseases in medicine



Simulation study

Introductory example on scenario CRun C-VEM for STBM on network of scenario C with the actualnumber of groups and topics⇒ Both network structure and the topicinformation should be correctly recovered.

FIGURE – Clustering result for the introductory example (scenario C)



Simulation study

Introductory example on scenario C

Evolution of the lower bound L along iterations (top-left)The most frequent words in the 3 found topics (left-bottom)The estimated model parameters (ρ, π) (right)



Simulation study

Introductory example on scenario C

FIGURE – Summary of connexion probabilities between groups (π, edge widths), groupproportions (ρ, node sizes) and most probable topics for group interactions(edge colors).



Simulation study

Experiment on model selection

Percentage of selections by ICL for each STBM model (Q,K ) on50 simulated networks of each of three scenarios.Highlighted rows and columns correspond to the actual valuesfor Q and K



Simulation study

Benchmark study

Run SBM, LDA and STBM on 20 networks simulated according to thethree scenarios. Average ARI values (Rand, 1971) are reported withstandard deviations for both node and edge clustering.

Easy : same as the previous simulations of three scenarios.

Hard 1 : the communities are very few differentiated (piqq = 0.25and πq 6=r = 0.2.



Simulation study

Benchmark study

Hard 2 : 40% of message words are sampled in different topicsthan the actual topic.

The joint model of network structure and topics allows to recover thecomplex hidden structure in a network with textual edges.



Real-world data

Enron data set https://www.cs.cmu.edu/~./enron/Email communications between 149 employees from 1999-2002.All messages sent between 2 individuals were coerced in a singlemeta-message⇒ 1234 directed edgesRun V-CEM for STBM, for number of groups Q = 1 : 14 and numberof topics K = 2 : 20⇒ Model selection (Q,K ) = (10,5)

FIGURE – Clustering result with STBM on the Enron data set (Sept.-Dec. 2001)Wei JIANG STBM September 28th 2017 34 / 46

https://www.cs.cmu.edu/~./enron/


Real-world data

Enron email network

FIGURE – Most specific words for the 5 found topics with STBM on the Enron data set.

1 Financial and trading activity2 Enron activities in Afghanistan3 California electricity crisis4 Usual logistic issues (building equipment, computers, ...)5 technical discussions on gas deliveries



Real-world data

Enron email network

Group 10 contains a single individual whohas a central place in the networkfrequently discusses about logistic issues (topic 4) with groups 4, 5,6 and 7.



Real-world data

Enron email network

Group 8 contains 6 individuals who mainly communicate aboutEnron activities in Afghanistan (topic 2) between them and withother groups.Group 4 and 6 are more focused on trading activities (topic 1).Group 1, 3 and 9 deal with technical issues on gas deliveries(topic 5).



Real-world data

Enron email network

FIGURE – Clustering results with SBM (left, Q = 8) and STBM (right) on the Enron dataset.

Some clusters found by SBM (ex. red) have been split by STBMsince some nodes use different topics than the rest.SBM isolates two "hubs" (light green)↔ STBM identify a unique"hub" and the second is gathered with others using similar topics.



Real-world data

Enron email network

FIGURE – Clustering results with SBM (left, Q = 8) and STBM (right) on the Enron dataset.

STBM allows a better and deeper understanding of the Enronnetwork.



Real-world data

NIPS co-authorship network

1988-2003 editions(Nips 1-17 http://robotics.stanford.edu/~gal/data.html)contains the abstracts of 2 484 accepted papers from 2740contributing authors.⇒ undirected network between 2740 authors with 22640 textualedges.Model selection by ICL : (Q,K ) = (13,7)


http://robotics.stanford.edu/~gal/data.html


Real-world data


FIGURE – Clustering result with STBM on the Nips co-authorship network



Real-world data


FIGURE – Most specific words for the 5 found topics with STBM on the Nips co-authorship network.



Real-world data


STBM has proved its ability to bring out concise and relevantanalyses on the structure of a large and dense network.



Conclusion

STBM : modeling and clustering vertices in networks with textualedges

directed or undirected networkapplication to various types of network

C-VEM : model inferenceICL : model selectionNumerical experiments on simulated dataTwo real worlds networks

large co-authorship network → scalability



Authors

Bouveyron Charleshttp://w3.mi.parisdescartes.fr/~cbouveyr/

Pierre Latouchehttp://samm.univ-paris1.fr/Pierre-Latouche

Zreik Rawyahttp://samm.univ-paris1.fr/Rawya-ZREIK


http://w3.mi.parisdescartes.fr/~cbouveyr/

http://samm.univ-paris1.fr/Pierre-Latouche

http://samm.univ-paris1.fr/Rawya-ZREIK


"Linkage"

https://linkage.fr/


https://linkage.fr/