29
Frequent Subgraph Pattern Minin Frequent Subgraph Pattern Minin g g on Uncertain Graph Data on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Sh uo Zhang Harbin Institute of Technology, China CIKM’09, Hong Kong Nov 4, 2009

Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Embed Size (px)

Citation preview

Page 1: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Frequent Subgraph Pattern MiningFrequent Subgraph Pattern Miningon Uncertain Graph Dataon Uncertain Graph Data

Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang

Harbin Institute of Technology, China

CIKM’09, Hong KongNov 4, 2009

Page 2: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Page 3: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Background

Graph mining has played an important role in a range of real world applications. medicines: structures of molecules bioinformatics: biological networks technologies: WWW social science: social networks many others

Page 4: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Directions of Graph Mining

Patterns of graphse.g., [Yan et al. ICDM’02]

Privacy of graphse.g., [Zou et al. VLDB’09]

Uncertainties of graphs

Models of graphse.g. [Leskovec et al. KDD’05]

Evolution of graphse.g., [Faloutsos et al. SIGMOD’07]

Page 5: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Uncertainties of Graphs: Example I Protein-Protein Interaction (PPI) Networks

Vertices: proteins Edges: interactions between proteins Uncertainties: probabilities of interactions really existing

The data are taken from the STRING Database (http://string-db.org).

NTG1

FET3

TIF34

SMT3

RPC40

0.375

0.639

0.651

0.147

0.651

0.639

0.867

0.698

RAD59

Page 6: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Uncertainties of Graphs: Example II Topologies of wireless sensor networks (WSNs)

Vertices: sensor nodes Edges: wireless links between sensor nodes Uncertainties: probabilities of wireless links functioning at an

y given time

0.75

0.92

0.88

0.95

0.69

Page 7: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Page 8: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Preliminaries

BB

x y

A

The support of S = the number of graphs containing S the total number of graphs

BB

x x

A

BB

A

x y

z

graph G2

B

B B

B

A x

x y

y

graph G1

support = 1.0

support = 0.5

Graph Database

Subgraph Pattern

Page 9: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Frequent Subgraph Pattern Mining Problem

Input: a graph database D, and a support threshold minsup Output: all subgraph patterns with support no less than minsup

FSP mining on biological networks (e.g., PPI networks) is an important tool for discovering functional modules [Koyutürk et al. Bioinformatics 04, Turanalp et al. BMC Bioinformatics 08].

PPI networks are subject to uncertainties. How do we define support?

Page 10: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Model of Uncertain Graphs

B

B B

B

A x

x y

y0.5

0.60.7

0.8

B

B B

B

A x

x

y

exist in this

form

(1 – 0.5) * 0.6 * 0.7 * 0.8 = 0.168

0.5 * (1 – 0.6) * 0.7 * 0.8 = 0.112

Uncertain Graph

B

B B

B

A

x y

y

exist in

this

form

Implicated Graph

Page 11: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Model of Uncertain Graphs (Cont’d)

Theorem: An uncertain graph represents a probability distribution over all its implicated graphs.

Page 12: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Uncertain Graph DatabasesB

B B

B

A x

x y

y0.5

0.60.7

0.8

BB

A

x y

z

0.8 0.1

0.7

Uncertain graph G1 Uncertain graph G2

Totally, 24 * 23 = 128 implicated graph databases.

B

B B

B

A

x y

y

BB

A

x y

exist in this form

Implicated graph of G1

Implicated graph of G2

Theorem: An uncertain graph DB represents a probability distribution over all its implicated graph DBs.

((1 – 0.5) * 0.6 * 0.7 * 0.8) * (0.8 * 0.1 * (1 – 0.7)) = 4.032 * 10-3

Implicated Graph Database

Page 13: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Expected Support

D uncertain graph DB

d1 d2 dn

implicating

impl

icat

ing implicating

……p1 = Pr(D implicates d1) p2 = Pr(D implicates d2) pn = Pr(D implicates dn)

s1 = support of S in d1 s2 = support of S in d2 sn = support of S in dn

The expected support of S is

n

iii psSesup

1

)(

Page 14: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

FSP Mining Problem on Uncertain Graphs

Input: an uncertain graph database D, and an expected support threshold minsup

Output: all subgraph patterns with expected support no less than minsup

It is #P-hard to count the number of frequent subgraph patterns. Reduction from the problem of counting the number of satisf

ying truth assignments of a monotone k-CNF formula.

The FSP mining problem on uncertain graphs is NP-hard.

Page 15: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Page 16: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Approximation Method It is #P-hard to compute the expected support of a subgraph patt

ern.

We develop an approximation method to find an approximate set of frequent subgraph patterns. Let e (0 < e < 1) be a relative error tolerance.

expected supportminsup(1-e) minsup

Output

Discard

Arbitrary

10

Page 17: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Objective I Difficulty I: # of frequent subgraph patterns is exponentially larg

e.

Objective I: Examine subgraph patterns as efficiently as possible to find all frequent ones.

Page 18: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Method for Objectives I Step 1: Build a search tree T of subgraph patterns. Step 2: Examine subgraph patterns in T in depth-first order

If S is infrequent, then all its descendents can be pruned.B

B B

B

A x

x y

y0.5

0.6

0.7

0.8

BB

A

x y

z

0.8

0.1

0.7

Uncertain graph G1

Uncertain graph G2

expected support

minsup(1-e) minsup

Output

Discard Arbitrary

10

Page 19: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Objective II Difficulty II: It is #P-hard to compute the expected support esup

(S) of a subgraph pattern S.

Objective II: Make the following judgments without computing esup(S) exactly. If esup(S) is surely not in the green region, then discard. If esup(S) is probable to be in the green region and surely not

in the red region, then output.

expected supportminsup(1-e) minsup

Output

Discard

Arbitrary

10

Page 20: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Method for Objective II Step 1: Approximate esup(S) by an interval [l, u] such that esup

(S)∈[l, u]. Step 2: Decide whether S can be output or not by testing the foll

owing conditions.

Output

Discard

Shrink

expected supportminsup(1-e) minsup 10

Page 21: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Approximating esup(S) by [l,u]

ISGII

IGGS

in contained is andby implicated is :

) implicates Pr()in occurs Pr(

||

1

)in occurs Pr(||

1)sup(

D

iiGS

DSe

||

1||

1 D

iilD

l

A subgraph pattern S occurs in an uncertain graph G if S is contained in at least one implicated graph of G.

Algorithm Approximate esup(S) by [l,u]

Step 1: For each uncertain graph Gi in D, approximate Pr(S occurs in Gi) by an interval [li, ui] of width at most

e*minsup.

Step 2:

||

1||

1 D

iiuD

u

Page 22: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Approximate Pr(S occurs in Gi) by [li, ui]B

B B

B

A x

x y

y0.5

0.60.7

0.8

uncertain graph Gi

BB

x y

A

pattern S

(x1)

(x2)

(x4)

(x3)

Step 1: Find all embeddings of S in Gi. 4 embeddings

Step 2: Assign boolean variables to the edges in the embeddings.Pr(x1) = 0.5, Pr(x2) = 0.6, Pr(x3) = 0.7, Pr(x4) = 0.8.

Step 3: Construct a conjunctive formula for each embedding.C1 = (x1 ^ x2), C2 = (x1 ^ x4), C3 = (x2 ^ x3), C4 = (x3 ^ x4).

Step 4: Construct a DNF formula.F = C1 V C2 V C3 V C4.

Step 5: Estimate Pr(F = TRUE) by p using Karp & Luby’s Markov-Chain Monte-Carlo method with absolute error e*minsup/2 and confidence d (d ∈[0,1]).

Step 6: [li, ui] = [p - e*minsup/2, p + e*minsup/2].

Page 23: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Page 24: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Experimental Results Data

The STRING Database (http://string-db.org)

Page 25: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Time Efficiency

Page 26: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Approximation Quality

Page 27: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Scalability

Page 28: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Conclusions

A new model of uncertain graph data has been proposed.

The frequent subgraph pattern mining problem on uncertain graph data has been formalized.

The computational complexity of the problem has been formally proved to be NP-hard.

An approximate mining algorithm has been proposed.

The proposed algorithm has high efficiency, high approximation quality, and high scalability.

Page 29: Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09,

Thank youThank you