GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs

GraphScope: Parameter-Free Mining of Large Time-Evolving GraphsJimeng Sun CMU

Spiros Papadimitriou IBM

Philip S. Yu IBM

Christos Faloutsos CMU

Motivation of GraphScope

Time-evolving graphs Network traffic graphs Email networks Customer product relationshipsCall detail records in telecom networks Financial transaction data

Key questions:1. How to monitor community structures?

2. How to detect the change points?

2

3

1. Community discovery

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

Products

Graph Adjacency matrix

289 /300

48/50

5/200 2/75

Books

CEOsResearchers

BMWs

97%

96%

3%

3%

54%54%

Simultaneously group: customers and products,or, source-destination traffic graphs,or, sender-recipient communication, etc…

Cus

tom

ers

Product groups

Cus

tom

er g

roup

s

Customers

ProductsCustomers

Products

e.g.,

4

2. Change detection

time

Find change points in group structure

Products

Cus

tom

ers

Produ

cts

holiday season

Given graphs G1, G2, … Gt where Gi is n-by-m

1. partition them into time segments G(1), G(2), …

2. for each segment, identify the groups

5

Problem definition

time

1. Scalable, 2. Parameter-free, 3. Incremental

G(1) G(2)

6

Outline

MotivationGraphScope

Community discovery Change detection

Experiments

7

Community detectionClustering problem Compression problem

t = 0 t = 1 t = 2

8

Cost objective within a time segment

p 1,1

p 1,2

p 1,3

p 2,1

p 2,2

p 2,3

p 3,3

p 3,2

p 3,1

n1

n2

n3k =

3 row

groups

m 1

m 2

m 3

ℓ = 3

col. g

roup

s

dsegment duration

log dnimj

i,j d nimj H(pi,j)

density of ones (edges)

d n1m2 H(p1,2) bits for (1,2)

code cost

bits total

i,j+

description cost

+

+ log* d

9

Cost objective within a time segment

code cost(blocks)

description cost(blocks’ model)

+

one row groupone col group

n row groupsm col groups

low

high low

high

10

Cost objectivewithin a time segment

code cost(blocks)


+

k = 3 row groupsℓ = 3 col groups

low

low

Search for the optimum grouping

Problem is NP-hard even for one timestamp on column permutation onlyReduction from TSP problem [Johnson+ 03]

HeuristicsSearch: Split, Merge, Shuffle Initialization: Resume, Restart

11

12

Outline


Community discovery Change detection

Experiments

13

Change point detection

Option 1:Append to current segment

14


change point

Option 2:Start new segment

15


1: append

2: split (time)

In both cases, we do row & col. shuffles, splits and/or merges

Choose the most parsimonious option

16

Outline


Single timestamp Multiple timestamp

Experiments

Objectives

Effectiveness on Community discoveryChange detection

Compression benefit Scalable, incremental computation

17

18

Evolving communitiesNETWORK

29K hosts (nodes)12K edges (on avg)1,220 hours

~ 14.6M edges totaltime

19

Community change pointsENRON

34K email addresses12K emails (on avg)165 weeks

~ 2M emails total

Key change-pointscorrespond to

key events

Compression gain

20GraphScope gives 10%-150% compression gain

Graphscope

21

Graph stream clusteringScalability—NETWORK

29K hosts (nodes) 12K edges per hour (on average) 1,220 hours (timestamps) ~ 14.6M edges total

< 2 sec / snapshot on avg

Related work

Co-clustering [Dhillon+ KDD03] [Chakrabarti+ KDD04]

Graph partitioning [Karypis+ 99]

Time-evolving graphs [Chakrabarti+ KDD06] [Chi+ KDD07] [Asur+ KDD07]

22

23

Summary

Organize into few, homogeneous communities

Find changes in community structure

Scalable Parameter-free Incremental

GraphScope: Parameter-Free Mining of Large Time-Evolving GraphsJimeng Sun

Spiros Papadimitriou

Philip S. Yu

Christos Faloutsos

25

Graph stream clustering

t = 0 t = 1 t = 2

28

Graph clustering – [Chakrabarti+ KDD’04]

versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

Why is this better?

implies

29


versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

Why is this better?

implies

Good Clustering

GoodCompression

implies

30

log nimj

Assumes group paritionings,sizes and densities are given

i,j nimj H(pi,j)

Cost objective

n1

n2

n3

m1 m2 m3

p1,1 p1,2 p1,3

p2,1 p2,2 p2,3

p3,3p3,2p3,1

n £ m adj. matrix

k =

3 r

ow g

roup

s

ℓ = 3 col. groups

density of ones (edges)

n1m2 H(p1,2) bits for (1,2)

code cost

bits total

irow-partitionidescription j

col-partitionjdescription

i,jtransmit#edges ei,j

+

+

description cost

+

block size entropy

31

Graph clusteringScalability

Number of edges

Tim

e (s

ec)

Splits

Shuffles

Linear on the number of edges Scalable

Time vs. Size

32

Cost objective

code cost(blocks)


+

one row groupone col group

n row groupsm col groups

low

high low

high

33

Cost objective

code cost(blocks)


+

k = 3 row groupsℓ = 3 col groups

low

low

34

Search for optimum

k

ℓ

bit

cost

Cost vs. number of groups

one row

groupone

col group

n row

groupsm

col g

roupsk =

3 row

groupsℓ =

3 co

l groups

35

splitshuffle

k = 5, ℓ = 5k = 5, ℓ = 5

Search for optimumSummary

k=1, ℓ=2 k=2, ℓ=2 k=2, ℓ=3 k=3, ℓ=3 k=3, ℓ=4 k=4, ℓ=4 k=4, ℓ=5

k = 1, ℓ = 1

splitshuffle

Split:Increase k or ℓ

Shuffle:Rearrange rows and cols

Merge:Decrease k or ℓ

36


Given a graph of interactions or associationsCustomers to products Documents to termsPeople to peopleComputer communicationsFinancial transactions

Find simultaneouslyCommunities (source and destination)Their number