Upload
bruno-abrahao
View
929
Download
1
Tags:
Embed Size (px)
Citation preview
On the Separability of Structural Classes of Communities
Bruno AbrahaoSucheta SoundarajanJohn HopcroftRobert Kleinberg Cornell University
1
Text
The idea of community structure as distinctive relationships
[Newman-Girvan, 2004]
2
Which community is real?
3
Which community is real?
3
Which community is real?
“Real” Community Random WalkMetis
Infomap Newman-Modularity Louvain3
How do their structures differ?
“Real” Community Random WalkMetis
Infomap Newman-Modularity Louvain3
The definition of community structure
• Community structure is not well defined– different people have different notions of community structure
• Traditional strategy1. start with an expectation of what a community should look like
• e.g., a set of nodes that interact more within the set than with the outside
2. define an optimization problem3. design heuristic4. the solution gives the desired communities
4
Two research questions that we address here
• A multitude of different of algorithms– different objective functions– different heuristicsHow dissimilar are their outputs?
• Communities may differ from the proposed mathematical constructs
– e.g., preponderance of links to the outside, as in this figure, contrary to widely accepted notions
Which algorithms extract communities that most closely resemble the structure of real communities?
5
Two research questions that we address here
• A multitude of different of algorithms– different objective functions– different heuristicsHow dissimilar are their outputs?
• Communities may differ from the proposed mathematical constructs
– e.g., preponderance of links to the outside, as in this figure, contrary to widely accepted notions
Which algorithms extract communities that most closely resemble the structure of real communities?
5
Obstacles to answering the preceding questions
• We don't know what properties communities possess
• We can't characterize communities in the absence of negative examples– Look at real communities and determine their structure– do other sets that are not communities have these properties?
– every other connected set could be a negative example - intractable
– sets that are not annotated could also be communities
• We don't know what metrics we should use– modularity, conductance, clustering coefficient...
6
Our plan to address the preceding questions
• Propose a methodology to analyze community structure by using different notions of communities as references
– key idea: analyze community structure without requiring negative examples of communities
• Scalable and comprehensive, simultaneously considering– multiple notions of communities– diverse domains of application – a broad spectrum of community metrics
• Assess the structural dissimilarities between– the output of different community detection algorithms– the output of algorithms and real communities
7
Building structural classes by extracting examples
Algorithm Network
8
Building structural classes by extracting examples
Algorithm Network
Apply
8
Building structural classes by extracting examples
Algorithm Network Extract community examples
Apply
8
Building structural classes by extracting examples
Algorithm 2
Algorithm 4
Algorithm k
Algorithm 1
Algorithm 3
9
Building structural classes by extracting examples
Algorithm 2
Algorithm 4
Algorithm k
Algorithm 1
Algorithm 3
Class 1
Class 2
Class 3
Class 4
Class k
9
Building a feature space by characterizing examples
Labeled Example
10
Building a feature space by characterizing examples
Labeled Example
Feature Vector
10
Building a feature space by characterizing examples
11
Building a feature space by characterizing examples
Feature Space11
Measure inter-class separability from feature space
Feature Space
12
Measure inter-class separability from feature space
Feature Space
Class Separability Measure
12
Measure inter-class separability from feature space
Feature SpaceSeparability = Distinct structures
Class Separability Measure
Are the classes separable?
12
We test our methods on large-scale network datasets
• Social
• Commercial
• Biological
+ Rice University
Facebook+Rice with permission of Mislove et al.. Other datasets publicly available.
13
We furnish our framework with10 community detection algorithms
• BFS (Random connected subgraphs)• Random-Walk-based (with and without restart)• (α,β)-communities• InfoMap• Markov Clustering• Metis• Louvain• Newman-Clauset-Moore• Link Communities
14
Annotated communities identify exemplar communities
+ Rice University
Metadata included in the datasets
15
Annotated communities identify exemplar communities
+ Rice University
Metadata included in the datasets
15
Annotated communities identify exemplar communities
+ Rice University
Metadata included in the datasets
15
Annotated communities identify exemplar communities
+ Rice University
Metadata included in the datasets
15
Annotated communities identify exemplar communities
+ Rice University
Metadata included in the datasets
15
Annotated communities identify exemplar communities
+ Rice University
Metadata included in the datasets
15
To what extent are the classes separable?
16
First separability measure: Scatter Matrices
• Traditional methods for measuring class separability give a single score, e.g., scatter matrices
Reference: the same data with shuffled labels
• This is a global measure. We need more fine-grained separability information of each class!
Network
17
Idea: use the performance of probabilistic multi-class classifiers
Probabilistic k-way classifier (SVM, k-NN)
Algorithm 1
Algorithm 2
Annotated communities
Train
18
Idea: use the performance of probabilistic multi-class classifiers
Probabilistic k-way classifier (SVM, k-NN)
Algorithm 1
Algorithm 2
Annotated communities
Train
18
Probabilistic k-way classifier(SVM, k-NN)
Classify(cross-validation)
19
Idea: use the performance of probabilistic multi-class classifiers
Probabilistic k-way classifier(SVM, k-NN)
Classify(cross-validation)
19
Idea: use the performance of probabilistic multi-class classifiers
Probabilistic k-way classifier(SVM, k-NN)
Classify(cross-validation)
Pr(Algorithm 1) = 0.05Pr(Algorithm 2) = 0.08...Pr(Annotated) = 0.48
19
Idea: use the performance of probabilistic multi-class classifiers
Cross-validation performance indicates class separability
Probabilistic-SVM cross-validation outcome with 11 structural classes. Data: DBLP network.
Ann.
Metis
MCL
Newm.
Louv.
LC
IM
AB
RW15
RW0
BFS
0.0 0.2 0.4 0.6 0.8 1.0
BFS
RW0
RW15
AB
IM
LC
Louv.
Newm.
MCL
Metis
Ann
Stru
ctur
al Cl
ass
20
Matching annotated communities
Which algorithms extract communities that most closely resemble the structure of annotated communities?
21
Repeat the preceding experiment, leaving out the class of annotated communities
Probabilistic k-way classifier
Algorithm 1
Algorithm 2
Algorithm N
Learn
22
Repeat the preceding experiment, leaving out the class of annotated communities
Probabilistic k-way classifier
Algorithm 1
Algorithm 2
Algorithm N
Learn
22
Probabilistic k-way classifier
Classify
23
Introduce the class of annotated communities in the test set
Probabilistic k-way classifier
Classify
23
Introduce the class of annotated communities in the test set
Probabilistic k-way classifier
Classify
Pr(Algorithm 1) = 0.02Pr(Algorithm 2) = 0.19...Pr(Algorithm k) = 0.12
23
Introduce the class of annotated communities in the test set
Classification reveals that annotated resemble unstructured methods
LJ2
LJ1
DBLP
Amazon
Fly
HS
SC
Ugrad
grad
0.0 0.2 0.4 0.6 0.8 1.0
BFS
RW0
RW15
AB
IM
LC
Louv.
Newm.
MCL
Metis
0.0 0.2 0.4 0.6 0.8 1.0
BFS
RW0
RW15
AB
IM
LC
Louv.
Newm.
MCL
Metis
Probabilistic-SVM classification of annotated communities into 11 structural classes structural class for 9 different networks.
Netw
ork
24
Improving the quality of the space
What classes should we consider?
25
The classifier confuses the two types of RW communities
Probabilistic-SVM cross-validation outcome with 11 structural classes. Data: DBLP network.
Ann.
Metis
MCL
Newm.
Louv.
LC
IM
AB
RW15
RW0
BFS
0.0 0.2 0.4 0.6 0.8 1.0
BFS
RW0
RW15
AB
IM
LC
Louv.
Newm.
MCL
Metis
Ann
Stru
ctur
al Cl
ass
26
Fisher’s discriminant ratio
27
A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg, To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013
Fisher’s discriminant ratio
27
A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg, To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013
Cross-validation performance indicates class separability
Probabilistic-SVM cross-validation outcome with 11 structural classes. Data: DBLP network.
Ann.
Metis
MCL
Newm.
Louv.
LC
IM
AB
RW15
RW0
BFS
0.0 0.2 0.4 0.6 0.8 1.0
BFS
RW0
RW15
AB
IM
LC
Louv.
Newm.
MCL
Metis
Ann
Stru
ctur
al Cl
ass
28
Cross-validation performance indicates class separability
29
Stru
ctur
al C
lass
Classification reveals that annotated resemble unstructured methods
LJ2
LJ1
DBLP
Amazon
Fly
HS
SC
Ugrad
grad
0.0 0.2 0.4 0.6 0.8 1.0
BFS
RW0
RW15
AB
IM
LC
Louv.
Newm.
MCL
Metis
0.0 0.2 0.4 0.6 0.8 1.0
BFS
RW0
RW15
AB
IM
LC
Louv.
Newm.
MCL
Metis
Probabilistic-SVM classification of annotated communities into 11 structural classes structural class for 9 different networks.
Netw
ork
30
Classification reveals that annotated resemble unstructured methods
31
Net
wor
k
Can we reveal latent similarities among community detection algorithms?
Our framework enables one to cluster algorithms that behave similarly
32
Step 1: identifying the most important features
7 features out of 36 retain the discriminative power of the full set
33
Grouping algorithms by their tendencieswith respect to most discriminative features
34
High
Medium
Low
Grouping algorithms by their tendencieswith respect to most discriminative features
34
High
Medium
Low
Conclusion of methodology
• We present a methodology to address the complexity of analyzing community structure, which simultaneously considers
– large number of algorithms
– multiple domains of application
– a broad spectrum of metrics to characterize community structure
• A scalable framework that enables
– researchers to compare and understand biases of new and existing community detection algorithms
– practitioners to decide on the most suitable algorithm for particular purpose and network
35
Conclusion of experimental analysis
• Our experimental analysis, which include 10 community detection algorithms and 9 different networks analyzed with 36 properties reveals
– High variability among the output of community detection methods– Annotated communities have a distinct structure from what we
expect• their structure is closer to the output of baseline procedures than to that of popular
algorithms
– A small set of features explain the biases produced by different algorithms
– We can organize the tapestry of available community detection algorithms by grouping them with respect to similarities in behavior
36
Final remarks on future directions
• Traditional methods are unsupervised– they find a particular type of community– little sensitivity to different purposes, structures of interest and domains of
application
• Our approach suggests a supervised approach to community detection– user specifies what they intended to find through examples (real or synthetic)– algorithm learns from those examples and retrieves similar structures in the
network
37
On the Separability of Structural Classes of Communities
Bruno AbrahaoSucheta SoundarajanJohn HopcroftRobert Kleinberg Cornell University
Thank you!
38