Upload
isabel-page
View
215
Download
0
Embed Size (px)
Citation preview
Frequent Subgraph Pattern MiningFrequent Subgraph Pattern Miningon Uncertain Graph Dataon Uncertain Graph Data
Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang
Harbin Institute of Technology, China
CIKM’09, Hong KongNov 4, 2009
Outline
Background
Problem Definition
Algorithm
Experimental Results
Conclusions
Background
Graph mining has played an important role in a range of real world applications. medicines: structures of molecules bioinformatics: biological networks technologies: WWW social science: social networks many others
Directions of Graph Mining
Patterns of graphse.g., [Yan et al. ICDM’02]
Privacy of graphse.g., [Zou et al. VLDB’09]
Uncertainties of graphs
Models of graphse.g. [Leskovec et al. KDD’05]
Evolution of graphse.g., [Faloutsos et al. SIGMOD’07]
Uncertainties of Graphs: Example I Protein-Protein Interaction (PPI) Networks
Vertices: proteins Edges: interactions between proteins Uncertainties: probabilities of interactions really existing
The data are taken from the STRING Database (http://string-db.org).
NTG1
FET3
TIF34
SMT3
RPC40
0.375
0.639
0.651
0.147
0.651
0.639
0.867
0.698
RAD59
Uncertainties of Graphs: Example II Topologies of wireless sensor networks (WSNs)
Vertices: sensor nodes Edges: wireless links between sensor nodes Uncertainties: probabilities of wireless links functioning at an
y given time
0.75
0.92
0.88
0.95
0.69
Outline
Background
Problem Definition
Algorithm
Experimental Results
Conclusions
Preliminaries
BB
x y
A
The support of S = the number of graphs containing S the total number of graphs
BB
x x
A
BB
A
x y
z
graph G2
B
B B
B
A x
x y
y
graph G1
support = 1.0
support = 0.5
Graph Database
Subgraph Pattern
Frequent Subgraph Pattern Mining Problem
Input: a graph database D, and a support threshold minsup Output: all subgraph patterns with support no less than minsup
FSP mining on biological networks (e.g., PPI networks) is an important tool for discovering functional modules [Koyutürk et al. Bioinformatics 04, Turanalp et al. BMC Bioinformatics 08].
PPI networks are subject to uncertainties. How do we define support?
Model of Uncertain Graphs
B
B B
B
A x
x y
y0.5
0.60.7
0.8
B
B B
B
A x
x
y
exist in this
form
(1 – 0.5) * 0.6 * 0.7 * 0.8 = 0.168
0.5 * (1 – 0.6) * 0.7 * 0.8 = 0.112
Uncertain Graph
B
B B
B
A
x y
y
exist in
this
form
Implicated Graph
Model of Uncertain Graphs (Cont’d)
Theorem: An uncertain graph represents a probability distribution over all its implicated graphs.
Uncertain Graph DatabasesB
B B
B
A x
x y
y0.5
0.60.7
0.8
BB
A
x y
z
0.8 0.1
0.7
Uncertain graph G1 Uncertain graph G2
Totally, 24 * 23 = 128 implicated graph databases.
B
B B
B
A
x y
y
BB
A
x y
exist in this form
Implicated graph of G1
Implicated graph of G2
Theorem: An uncertain graph DB represents a probability distribution over all its implicated graph DBs.
((1 – 0.5) * 0.6 * 0.7 * 0.8) * (0.8 * 0.1 * (1 – 0.7)) = 4.032 * 10-3
Implicated Graph Database
Expected Support
D uncertain graph DB
d1 d2 dn
implicating
impl
icat
ing implicating
……p1 = Pr(D implicates d1) p2 = Pr(D implicates d2) pn = Pr(D implicates dn)
s1 = support of S in d1 s2 = support of S in d2 sn = support of S in dn
The expected support of S is
n
iii psSesup
1
)(
FSP Mining Problem on Uncertain Graphs
Input: an uncertain graph database D, and an expected support threshold minsup
Output: all subgraph patterns with expected support no less than minsup
It is #P-hard to count the number of frequent subgraph patterns. Reduction from the problem of counting the number of satisf
ying truth assignments of a monotone k-CNF formula.
The FSP mining problem on uncertain graphs is NP-hard.
Outline
Background
Problem Definition
Algorithm
Experimental Results
Conclusions
Approximation Method It is #P-hard to compute the expected support of a subgraph patt
ern.
We develop an approximation method to find an approximate set of frequent subgraph patterns. Let e (0 < e < 1) be a relative error tolerance.
expected supportminsup(1-e) minsup
Output
Discard
Arbitrary
10
Objective I Difficulty I: # of frequent subgraph patterns is exponentially larg
e.
Objective I: Examine subgraph patterns as efficiently as possible to find all frequent ones.
Method for Objectives I Step 1: Build a search tree T of subgraph patterns. Step 2: Examine subgraph patterns in T in depth-first order
If S is infrequent, then all its descendents can be pruned.B
B B
B
A x
x y
y0.5
0.6
0.7
0.8
BB
A
x y
z
0.8
0.1
0.7
Uncertain graph G1
Uncertain graph G2
expected support
minsup(1-e) minsup
Output
Discard Arbitrary
10
Objective II Difficulty II: It is #P-hard to compute the expected support esup
(S) of a subgraph pattern S.
Objective II: Make the following judgments without computing esup(S) exactly. If esup(S) is surely not in the green region, then discard. If esup(S) is probable to be in the green region and surely not
in the red region, then output.
expected supportminsup(1-e) minsup
Output
Discard
Arbitrary
10
Method for Objective II Step 1: Approximate esup(S) by an interval [l, u] such that esup
(S)∈[l, u]. Step 2: Decide whether S can be output or not by testing the foll
owing conditions.
Output
Discard
Shrink
expected supportminsup(1-e) minsup 10
Approximating esup(S) by [l,u]
ISGII
IGGS
in contained is andby implicated is :
) implicates Pr()in occurs Pr(
||
1
)in occurs Pr(||
1)sup(
D
iiGS
DSe
||
1||
1 D
iilD
l
A subgraph pattern S occurs in an uncertain graph G if S is contained in at least one implicated graph of G.
Algorithm Approximate esup(S) by [l,u]
Step 1: For each uncertain graph Gi in D, approximate Pr(S occurs in Gi) by an interval [li, ui] of width at most
e*minsup.
Step 2:
||
1||
1 D
iiuD
u
Approximate Pr(S occurs in Gi) by [li, ui]B
B B
B
A x
x y
y0.5
0.60.7
0.8
uncertain graph Gi
BB
x y
A
pattern S
(x1)
(x2)
(x4)
(x3)
Step 1: Find all embeddings of S in Gi. 4 embeddings
Step 2: Assign boolean variables to the edges in the embeddings.Pr(x1) = 0.5, Pr(x2) = 0.6, Pr(x3) = 0.7, Pr(x4) = 0.8.
Step 3: Construct a conjunctive formula for each embedding.C1 = (x1 ^ x2), C2 = (x1 ^ x4), C3 = (x2 ^ x3), C4 = (x3 ^ x4).
Step 4: Construct a DNF formula.F = C1 V C2 V C3 V C4.
Step 5: Estimate Pr(F = TRUE) by p using Karp & Luby’s Markov-Chain Monte-Carlo method with absolute error e*minsup/2 and confidence d (d ∈[0,1]).
Step 6: [li, ui] = [p - e*minsup/2, p + e*minsup/2].
Outline
Background
Problem Definition
Algorithm
Experimental Results
Conclusions
Time Efficiency
Approximation Quality
Scalability
Conclusions
A new model of uncertain graph data has been proposed.
The frequent subgraph pattern mining problem on uncertain graph data has been formalized.
The computational complexity of the problem has been formally proved to be NP-hard.
An approximate mining algorithm has been proposed.
The proposed algorithm has high efficiency, high approximation quality, and high scalability.
Thank youThank you