On the Separability of Structural Classes of Communities

On the Separability of Structural Classes of Communities

Bruno AbrahaoSucheta SoundarajanJohn HopcroftRobert Kleinberg Cornell University

1

Text

The idea of community structure as distinctive relationships

[Newman-Girvan, 2004]

2

Which community is real?

3


3


“Real” Community Random WalkMetis

Infomap Newman-Modularity Louvain3

How do their structures differ?

“Real” Community Random WalkMetis

Infomap Newman-Modularity Louvain3

The definition of community structure

• Community structure is not well defined– different people have different notions of community structure

• Traditional strategy1. start with an expectation of what a community should look like

• e.g., a set of nodes that interact more within the set than with the outside

2. define an optimization problem3. design heuristic4. the solution gives the desired communities

4

Two research questions that we address here

• A multitude of different of algorithms– different objective functions– different heuristicsHow dissimilar are their outputs?

• Communities may differ from the proposed mathematical constructs

– e.g., preponderance of links to the outside, as in this figure, contrary to widely accepted notions

Which algorithms extract communities that most closely resemble the structure of real communities?

5

Two research questions that we address here

• A multitude of different of algorithms– different objective functions– different heuristicsHow dissimilar are their outputs?

• Communities may differ from the proposed mathematical constructs

– e.g., preponderance of links to the outside, as in this figure, contrary to widely accepted notions

Which algorithms extract communities that most closely resemble the structure of real communities?

5

Obstacles to answering the preceding questions

• We don't know what properties communities possess

• We can't characterize communities in the absence of negative examples– Look at real communities and determine their structure– do other sets that are not communities have these properties?

– every other connected set could be a negative example - intractable

– sets that are not annotated could also be communities

• We don't know what metrics we should use– modularity, conductance, clustering coefficient...

6

Our plan to address the preceding questions

• Propose a methodology to analyze community structure by using different notions of communities as references

– key idea: analyze community structure without requiring negative examples of communities

• Scalable and comprehensive, simultaneously considering– multiple notions of communities– diverse domains of application – a broad spectrum of community metrics

• Assess the structural dissimilarities between– the output of different community detection algorithms– the output of algorithms and real communities

7

Building structural classes by extracting examples

Algorithm Network

8


Algorithm Network

Apply

8


Algorithm Network Extract community examples

Apply

8


Algorithm 2

Algorithm 4

Algorithm k

Algorithm 1

Algorithm 3

9


Algorithm 2

Algorithm 4

Algorithm k

Algorithm 1

Algorithm 3

Class 1

Class 2

Class 3

Class 4

Class k

9

Building a feature space by characterizing examples

Labeled Example

10


Labeled Example

Feature Vector

10


11


Feature Space11

Measure inter-class separability from feature space

Feature Space

12


Feature Space

Class Separability Measure

12


Feature SpaceSeparability = Distinct structures

Class Separability Measure

Are the classes separable?

12

We test our methods on large-scale network datasets

• Social

• Commercial

• Biological

+ Rice University

Facebook+Rice with permission of Mislove et al.. Other datasets publicly available.

13

We furnish our framework with10 community detection algorithms

• BFS (Random connected subgraphs)• Random-Walk-based (with and without restart)• (α,β)-communities• InfoMap• Markov Clustering• Metis• Louvain• Newman-Clauset-Moore• Link Communities

14

Annotated communities identify exemplar communities

+ Rice University

Metadata included in the datasets

15


+ Rice University


15


+ Rice University


15


+ Rice University


15


+ Rice University


15


+ Rice University


15

To what extent are the classes separable?

16

First separability measure: Scatter Matrices

• Traditional methods for measuring class separability give a single score, e.g., scatter matrices

Reference: the same data with shuffled labels

• This is a global measure. We need more fine-grained separability information of each class!

Network

17

Idea: use the performance of probabilistic multi-class classifiers

Probabilistic k-way classifier (SVM, k-NN)

Algorithm 1

Algorithm 2

Annotated communities

Train

18


Probabilistic k-way classifier (SVM, k-NN)

Algorithm 1

Algorithm 2

Annotated communities

Train

18

Probabilistic k-way classifier(SVM, k-NN)

Classify(cross-validation)

19




19




Pr(Algorithm 1) = 0.05Pr(Algorithm 2) = 0.08...Pr(Annotated) = 0.48

19


Cross-validation performance indicates class separability

Probabilistic-SVM cross-validation outcome with 11 structural classes. Data: DBLP network.

Ann.

Metis

MCL

Newm.

Louv.

LC

IM

AB

RW15

RW0

BFS

0.0 0.2 0.4 0.6 0.8 1.0

BFS

RW0

RW15

AB

IM

LC

Louv.

Newm.

MCL

Metis

Ann

Stru

ctur

al Cl

ass

20

Matching annotated communities

Which algorithms extract communities that most closely resemble the structure of annotated communities?

21

Repeat the preceding experiment, leaving out the class of annotated communities

Probabilistic k-way classifier

Algorithm 1

Algorithm 2

Algorithm N

Learn

22

Repeat the preceding experiment, leaving out the class of annotated communities


Algorithm 1

Algorithm 2

Algorithm N

Learn

22


Classify

23

Introduce the class of annotated communities in the test set


Classify

23



Classify

Pr(Algorithm 1) = 0.02Pr(Algorithm 2) = 0.19...Pr(Algorithm k) = 0.12

23


Classification reveals that annotated resemble unstructured methods

LJ2

LJ1

DBLP

Amazon

Fly

HS

SC

Ugrad

grad

0.0 0.2 0.4 0.6 0.8 1.0

BFS

RW0

RW15

AB

IM

LC

Louv.

Newm.

MCL

Metis

0.0 0.2 0.4 0.6 0.8 1.0

BFS

RW0

RW15

AB

IM

LC

Louv.

Newm.

MCL

Metis

Probabilistic-SVM classification of annotated communities into 11 structural classes structural class for 9 different networks.

Netw

ork

24

Improving the quality of the space

What classes should we consider?

25

The classifier confuses the two types of RW communities


Ann.

Metis

MCL

Newm.

Louv.

LC

IM

AB

RW15

RW0

BFS

0.0 0.2 0.4 0.6 0.8 1.0

BFS

RW0

RW15

AB

IM

LC

Louv.

Newm.

MCL

Metis

Ann

Stru

ctur

al Cl

ass

26

Fisher’s discriminant ratio

27

A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg, To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013

Fisher’s discriminant ratio

27

A Separability Framework for Analyzing Community Structure, Bruno Abrahao, Sucheta Soundarajan, John Hopcroft, Robert Kleinberg, To appear in ACM Transactions on Knowledge Discovery from Data (TKDD), 2013



Ann.

Metis

MCL

Newm.

Louv.

LC

IM

AB

RW15

RW0

BFS

0.0 0.2 0.4 0.6 0.8 1.0

BFS

RW0

RW15

AB

IM

LC

Louv.

Newm.

MCL

Metis

Ann

Stru

ctur

al Cl

ass

28


29

Stru

ctur

al C

lass


LJ2

LJ1

DBLP

Amazon

Fly

HS

SC

Ugrad

grad

0.0 0.2 0.4 0.6 0.8 1.0

BFS

RW0

RW15

AB

IM

LC

Louv.

Newm.

MCL

Metis

0.0 0.2 0.4 0.6 0.8 1.0

BFS

RW0

RW15

AB

IM

LC

Louv.

Newm.

MCL

Metis

Probabilistic-SVM classification of annotated communities into 11 structural classes structural class for 9 different networks.

Netw

ork

30


31

Net

wor

k

Can we reveal latent similarities among community detection algorithms?

Our framework enables one to cluster algorithms that behave similarly

32

Step 1: identifying the most important features

7 features out of 36 retain the discriminative power of the full set

33

Grouping algorithms by their tendencieswith respect to most discriminative features

34

High

Medium

Low

Grouping algorithms by their tendencieswith respect to most discriminative features

34

High

Medium

Low

Conclusion of methodology

• We present a methodology to address the complexity of analyzing community structure, which simultaneously considers

– large number of algorithms

– multiple domains of application

– a broad spectrum of metrics to characterize community structure

• A scalable framework that enables

– researchers to compare and understand biases of new and existing community detection algorithms

– practitioners to decide on the most suitable algorithm for particular purpose and network

35

Conclusion of experimental analysis

• Our experimental analysis, which include 10 community detection algorithms and 9 different networks analyzed with 36 properties reveals

– High variability among the output of community detection methods– Annotated communities have a distinct structure from what we

expect• their structure is closer to the output of baseline procedures than to that of popular

algorithms

– A small set of features explain the biases produced by different algorithms

– We can organize the tapestry of available community detection algorithms by grouping them with respect to similarities in behavior

36

Final remarks on future directions

• Traditional methods are unsupervised– they find a particular type of community– little sensitivity to different purposes, structures of interest and domains of

application

• Our approach suggests a supervised approach to community detection– user specifies what they intended to find through examples (real or synthetic)– algorithm learns from those examples and retrieves similar structures in the

network

37

On the Separability of Structural Classes of Communities

Bruno AbrahaoSucheta SoundarajanJohn HopcroftRobert Kleinberg Cornell University

Thank you!

38

Education

On the Separability of Structural Classes of Communities