39
KDD04 Faloutsos, McCurley & Tom kins 1 Carnegie Mellon Fast Discovery of Connection Subgraphs Christos Faloutsos (CMU) Kevin McCurley (IBM) Andrew Tomkins (IBM)

Fast Discovery of Connection Subgraphs

Embed Size (px)

DESCRIPTION

Fast Discovery of Connection Subgraphs. Christos Faloutsos (CMU) Kevin McCurley (IBM) Andrew Tomkins (IBM). Outline. Introduction / Motivation Survey Proposed Method Algorithms Experiments Conclusions. Introduction. What are the best path s between ‘Kidman’ and ‘Diaz’?. Diaz. - PowerPoint PPT Presentation

Citation preview

Page 1: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 1

Carnegie Mellon

Fast Discovery of Connection Subgraphs

Christos Faloutsos (CMU)Kevin McCurley (IBM)Andrew Tomkins (IBM)

Page 2: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 2

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 3: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 3

Carnegie Mellon

Introduction

• What are the best paths between ‘Kidman’ and ‘Diaz’?

Kidman

Diaz

Page 4: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 4

Carnegie Mellon

Problem definition

• Given a graph, and two nodes s and t, and a 'budget' b of nodes

• Find the best b nodes that capture the relationship between s and t

s t

f

Page 5: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 5

Carnegie Mellon

Problem definition

• Given a graph, and two nodes s and t, and a 'budget' b of nodes

• Find the best b nodes that capture the relationship between s and t

s t

f

Page 6: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 6

Carnegie Mellon

Problem definition

• Part 1: How to quantify the goodness?

• Part 2: How to pick ‘best few’ nodes?

• Part 3: Scalability: large graphs (10**7 nodes)

s t

f

Page 7: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 7

Carnegie Mellon

Survey

• Graph Partitioning– [Karypis+Kumar]; [Newman+];

– [Virtanen]; …

• Communities– [Flake+]; [Tomkins, Kleinberg+]

• External distances [Palmer+]

Page 8: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 8

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 9: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 9

Carnegie Mellon

• part 1: measuring goodness:– electricity

• part 2: finding good paths– dynamic programming

• part 3: scalability– heuristics

Proposed method

Page 10: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 10

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

Page 11: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 11

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

• Why not net. flow?

Page 12: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 12

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

• Why not net. flow?

• Why not plain ‘voltages’?

+1V 0V

Page 13: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 13

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

• Why not net. flow?

• Why not plain ‘voltages’?

+1V 0V

+0.5V

Page 14: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 14

Carnegie Mellon

s t

f

...

Electricity, cont’d

• Proposed method: voltages with universal sink:– ~ ‘tax collector’

• goodness of a path:

• its electric current(*)!+1V 0V

0V

Page 15: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 15

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 16: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 16

Carnegie Mellon

Electricity – Algorithm

• Voltages/Amperages can be computed easily ( O(E) )

• without universal sink:v(i) = Σumj [v(j) * C(i,j) / C(i,*) ]

i != source, sink

v(source)=1; v(sink)=0

Page 17: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 17

Carnegie Mellon

Electricity – Algorithm

With universal sink:v(i) = 1/(1+a) Σumj [v(j) * C(i,j) / C(i,*) ]

(~ insensitive to a (=1))

Page 18: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 18

Carnegie Mellon

Given the voltages and amperages

• Which b nodes to keep?

• (and how to spot them quickly?)

Part 2: DisplayGen

Page 19: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 19

Carnegie Mellon

Part 2: DisplayGen

Page 20: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 20

Carnegie Mellon

Part 2: DisplayGen

• ‘delivered current’ of a path:– ~ ‘how many electrons’ choose this path

=4/5 *1/2A

Page 21: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 21

Carnegie Mellon

Part 2: DisplayGen

• find subgraph that max’s delivered current

• Incrementally, add nodes with max marginal delivered current

Page 22: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 22

Carnegie Mellon

Part 3: Scalability

‘CandidateGen’

• Starting from the large graph

• Eliminate nodes that are too far away to matter

• How?

Page 23: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 23

Carnegie Mellon

s tsource sink

Part 3: Scalability

• By successive, careful expansions

Page 24: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 24

Carnegie Mellon

s t

Part 3: Scalability

Page 25: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 25

Carnegie Mellon

s t

Part 3: Scalability

Page 26: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 26

Carnegie Mellon

s t

Part 3: Scalability

Page 27: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 27

Carnegie Mellon

Pseudo-code

Until (stoppingCriterion) use pickHeuristic() to pick a node n

expand node n

Page 28: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 28

Carnegie Mellon

Pseudo-code

pickHeuristic() favors• Nearby nodes with• Strong connections to source or sink

and with• Small degree

Page 29: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 29

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 30: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 30

Carnegie Mellon

Experiments

• on large real graph – ~15M nodes, ~100M edges, weighted

– ‘who co-appears with whom’ (from 500M web pages)

• Q1: Quality of ‘voltage’ approach?

• Q2: Speed/accuracy trade-off?

Page 31: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 31

Carnegie Mellon

Q1: Quality

• Actors (A); Computer-Scientists (CS)

• Kidman-Diaz (A-A)

• Negreponte-Palmisano (CS-CS)

• Turing-Stone (CS-A)

Page 32: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 32

Carnegie Mellon

(A-A) Kidman-Diaz

Strong, direct link

• What are the best paths between ‘Kidman’ and ‘Diaz’?

Kidman

Diaz

Page 33: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 33

Carnegie Mellon

CS-CS: Negreponte - Palmisano

NN SP

• Mainly: CEOs of major Computer companies (Dell, Gates, Fiorina, ++)

Page 34: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 34

Carnegie Mellon

CS-CS: Negreponte - Palmisano

NNEsther Dyson Louis Gerstner

SP

Page 35: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 35

Carnegie Mellon

CS-A: Turing - Stone

TuringAnderson

Stone

Page 36: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 36

Carnegie Mellon

Outline

• Introduction / Motivation

• ...

• Experiments– Q1: quality

– Q2: speed/accuracy trade-off

• Conclusions

Page 37: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 37

Carnegie Mellon

Speed/Accuracy Trade-off

number of nodes kept (‘b’)

deliveredcurrent Kleinberg-Newell

Rivest-HoffmanTuring-StoneKidman-Diaz

Page 38: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 38

Carnegie Mellon

Speed/accuracy trade-off

• 80/20-like rule:

• the first few nodes/paths contribute the vast majority of ‘delivered current’

• Thus: CandidateGen makes sense

Page 39: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 39

Carnegie Mellon

Conclusions

• Defined the problem• Part 1: Electricity-based method to measure

quality• Part 2: Dynamic programming to spot best

paths (‘DisplayGen’)• Part 3: Scalability with good accuracy

(‘CandidateGen’)• Operational system