Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo...

Preview:

Citation preview

Frequent Subgraph Pattern MiningFrequent Subgraph Pattern Miningon Uncertain Graph Dataon Uncertain Graph Data

Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang

Harbin Institute of Technology, China

CIKM’09, Hong KongNov 4, 2009

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Background

Graph mining has played an important role in a range of real world applications. medicines: structures of molecules bioinformatics: biological networks technologies: WWW social science: social networks many others

Directions of Graph Mining

Patterns of graphse.g., [Yan et al. ICDM’02]

Privacy of graphse.g., [Zou et al. VLDB’09]

Uncertainties of graphs

Models of graphse.g. [Leskovec et al. KDD’05]

Evolution of graphse.g., [Faloutsos et al. SIGMOD’07]

Uncertainties of Graphs: Example I Protein-Protein Interaction (PPI) Networks

Vertices: proteins Edges: interactions between proteins Uncertainties: probabilities of interactions really existing

The data are taken from the STRING Database (http://string-db.org).

NTG1

FET3

TIF34

SMT3

RPC40

0.375

0.639

0.651

0.147

0.651

0.639

0.867

0.698

RAD59

Uncertainties of Graphs: Example II Topologies of wireless sensor networks (WSNs)

Vertices: sensor nodes Edges: wireless links between sensor nodes Uncertainties: probabilities of wireless links functioning at an

y given time

0.75

0.92

0.88

0.95

0.69

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Preliminaries

BB

x y

A

The support of S = the number of graphs containing S the total number of graphs

BB

x x

A

BB

A

x y

z

graph G2

B

B B

B

A x

x y

y

graph G1

support = 1.0

support = 0.5

Graph Database

Subgraph Pattern

Frequent Subgraph Pattern Mining Problem

Input: a graph database D, and a support threshold minsup Output: all subgraph patterns with support no less than minsup

FSP mining on biological networks (e.g., PPI networks) is an important tool for discovering functional modules [Koyutürk et al. Bioinformatics 04, Turanalp et al. BMC Bioinformatics 08].

PPI networks are subject to uncertainties. How do we define support?

Model of Uncertain Graphs

B

B B

B

A x

x y

y0.5

0.60.7

0.8

B

B B

B

A x

x

y

exist in this

form

(1 – 0.5) * 0.6 * 0.7 * 0.8 = 0.168

0.5 * (1 – 0.6) * 0.7 * 0.8 = 0.112

Uncertain Graph

B

B B

B

A

x y

y

exist in

this

form

Implicated Graph

Model of Uncertain Graphs (Cont’d)

Theorem: An uncertain graph represents a probability distribution over all its implicated graphs.

Uncertain Graph DatabasesB

B B

B

A x

x y

y0.5

0.60.7

0.8

BB

A

x y

z

0.8 0.1

0.7

Uncertain graph G1 Uncertain graph G2

Totally, 24 * 23 = 128 implicated graph databases.

B

B B

B

A

x y

y

BB

A

x y

exist in this form

Implicated graph of G1

Implicated graph of G2

Theorem: An uncertain graph DB represents a probability distribution over all its implicated graph DBs.

((1 – 0.5) * 0.6 * 0.7 * 0.8) * (0.8 * 0.1 * (1 – 0.7)) = 4.032 * 10-3

Implicated Graph Database

Expected Support

D uncertain graph DB

d1 d2 dn

implicating

impl

icat

ing implicating

……p1 = Pr(D implicates d1) p2 = Pr(D implicates d2) pn = Pr(D implicates dn)

s1 = support of S in d1 s2 = support of S in d2 sn = support of S in dn

The expected support of S is

n

iii psSesup

1

)(

FSP Mining Problem on Uncertain Graphs

Input: an uncertain graph database D, and an expected support threshold minsup

Output: all subgraph patterns with expected support no less than minsup

It is #P-hard to count the number of frequent subgraph patterns. Reduction from the problem of counting the number of satisf

ying truth assignments of a monotone k-CNF formula.

The FSP mining problem on uncertain graphs is NP-hard.

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Approximation Method It is #P-hard to compute the expected support of a subgraph patt

ern.

We develop an approximation method to find an approximate set of frequent subgraph patterns. Let e (0 < e < 1) be a relative error tolerance.

expected supportminsup(1-e) minsup

Output

Discard

Arbitrary

10

Objective I Difficulty I: # of frequent subgraph patterns is exponentially larg

e.

Objective I: Examine subgraph patterns as efficiently as possible to find all frequent ones.

Method for Objectives I Step 1: Build a search tree T of subgraph patterns. Step 2: Examine subgraph patterns in T in depth-first order

If S is infrequent, then all its descendents can be pruned.B

B B

B

A x

x y

y0.5

0.6

0.7

0.8

BB

A

x y

z

0.8

0.1

0.7

Uncertain graph G1

Uncertain graph G2

expected support

minsup(1-e) minsup

Output

Discard Arbitrary

10

Objective II Difficulty II: It is #P-hard to compute the expected support esup

(S) of a subgraph pattern S.

Objective II: Make the following judgments without computing esup(S) exactly. If esup(S) is surely not in the green region, then discard. If esup(S) is probable to be in the green region and surely not

in the red region, then output.

expected supportminsup(1-e) minsup

Output

Discard

Arbitrary

10

Method for Objective II Step 1: Approximate esup(S) by an interval [l, u] such that esup

(S)∈[l, u]. Step 2: Decide whether S can be output or not by testing the foll

owing conditions.

Output

Discard

Shrink

expected supportminsup(1-e) minsup 10

Approximating esup(S) by [l,u]

ISGII

IGGS

in contained is andby implicated is :

) implicates Pr()in occurs Pr(

||

1

)in occurs Pr(||

1)sup(

D

iiGS

DSe

||

1||

1 D

iilD

l

A subgraph pattern S occurs in an uncertain graph G if S is contained in at least one implicated graph of G.

Algorithm Approximate esup(S) by [l,u]

Step 1: For each uncertain graph Gi in D, approximate Pr(S occurs in Gi) by an interval [li, ui] of width at most

e*minsup.

Step 2:

||

1||

1 D

iiuD

u

Approximate Pr(S occurs in Gi) by [li, ui]B

B B

B

A x

x y

y0.5

0.60.7

0.8

uncertain graph Gi

BB

x y

A

pattern S

(x1)

(x2)

(x4)

(x3)

Step 1: Find all embeddings of S in Gi. 4 embeddings

Step 2: Assign boolean variables to the edges in the embeddings.Pr(x1) = 0.5, Pr(x2) = 0.6, Pr(x3) = 0.7, Pr(x4) = 0.8.

Step 3: Construct a conjunctive formula for each embedding.C1 = (x1 ^ x2), C2 = (x1 ^ x4), C3 = (x2 ^ x3), C4 = (x3 ^ x4).

Step 4: Construct a DNF formula.F = C1 V C2 V C3 V C4.

Step 5: Estimate Pr(F = TRUE) by p using Karp & Luby’s Markov-Chain Monte-Carlo method with absolute error e*minsup/2 and confidence d (d ∈[0,1]).

Step 6: [li, ui] = [p - e*minsup/2, p + e*minsup/2].

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Experimental Results Data

The STRING Database (http://string-db.org)

Time Efficiency

Approximation Quality

Scalability

Conclusions

A new model of uncertain graph data has been proposed.

The frequent subgraph pattern mining problem on uncertain graph data has been formalized.

The computational complexity of the problem has been formally proved to be NP-hard.

An approximate mining algorithm has been proposed.

The proposed algorithm has high efficiency, high approximation quality, and high scalability.

Thank youThank you