Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo...

Frequent Subgraph Pattern MiningFrequent Subgraph Pattern Miningon Uncertain Graph Dataon Uncertain Graph Data

Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang

Harbin Institute of Technology, China

CIKM’09, Hong KongNov 4, 2009

Outline

Background

Problem Definition

Algorithm

Experimental Results

Conclusions

Background

Graph mining has played an important role in a range of real world applications. medicines: structures of molecules bioinformatics: biological networks technologies: WWW social science: social networks many others

Directions of Graph Mining

Patterns of graphse.g., [Yan et al. ICDM’02]

Privacy of graphse.g., [Zou et al. VLDB’09]

Uncertainties of graphs

Models of graphse.g. [Leskovec et al. KDD’05]

Evolution of graphse.g., [Faloutsos et al. SIGMOD’07]

Uncertainties of Graphs: Example I Protein-Protein Interaction (PPI) Networks

Vertices: proteins Edges: interactions between proteins Uncertainties: probabilities of interactions really existing

The data are taken from the STRING Database (http://string-db.org).

Uncertainties of Graphs: Example II Topologies of wireless sensor networks (WSNs)

Vertices: sensor nodes Edges: wireless links between sensor nodes Uncertainties: probabilities of wireless links functioning at an

y given time

Outline

Background

Problem Definition

Algorithm

Conclusions

Preliminaries

The support of S = the number of graphs containing S the total number of graphs

graph G2

graph G1

support = 1.0

support = 0.5

Graph Database

Subgraph Pattern

Frequent Subgraph Pattern Mining Problem

Input: a graph database D, and a support threshold minsup Output: all subgraph patterns with support no less than minsup

FSP mining on biological networks (e.g., PPI networks) is an important tool for discovering functional modules [Koyutürk et al. Bioinformatics 04, Turanalp et al. BMC Bioinformatics 08].

PPI networks are subject to uncertainties. How do we define support?

Model of Uncertain Graphs

0.60.7

exist in this

(1 – 0.5) * 0.6 * 0.7 * 0.8 = 0.168

0.5 * (1 – 0.6) * 0.7 * 0.8 = 0.112

Uncertain Graph

exist in

Implicated Graph

Model of Uncertain Graphs (Cont’d)

Theorem: An uncertain graph represents a probability distribution over all its implicated graphs.

Uncertain Graph DatabasesB

0.60.7

0.8 0.1

Uncertain graph G1 Uncertain graph G2

Totally, 24 * 23 = 128 implicated graph databases.

exist in this form

Implicated graph of G1

Implicated graph of G2

Theorem: An uncertain graph DB represents a probability distribution over all its implicated graph DBs.

((1 – 0.5) * 0.6 * 0.7 * 0.8) * (0.8 * 0.1 * (1 – 0.7)) = 4.032 * 10-3

Implicated Graph Database

Expected Support

D uncertain graph DB

d1 d2 dn

implicating

ing implicating

……p1 = Pr(D implicates d1) p2 = Pr(D implicates d2) pn = Pr(D implicates dn)

s1 = support of S in d1 s2 = support of S in d2 sn = support of S in dn

The expected support of S is

iii psSesup

FSP Mining Problem on Uncertain Graphs

Input: an uncertain graph database D, and an expected support threshold minsup

Output: all subgraph patterns with expected support no less than minsup

It is #P-hard to count the number of frequent subgraph patterns. Reduction from the problem of counting the number of satisf

ying truth assignments of a monotone k-CNF formula.

The FSP mining problem on uncertain graphs is NP-hard.

Outline

Background

Problem Definition

Algorithm

Conclusions

Approximation Method It is #P-hard to compute the expected support of a subgraph patt

We develop an approximation method to find an approximate set of frequent subgraph patterns. Let e (0 < e < 1) be a relative error tolerance.

expected supportminsup(1-e) minsup

Output

Discard

Arbitrary

Objective I Difficulty I: # of frequent subgraph patterns is exponentially larg

Objective I: Examine subgraph patterns as efficiently as possible to find all frequent ones.

Method for Objectives I Step 1: Build a search tree T of subgraph patterns. Step 2: Examine subgraph patterns in T in depth-first order

If S is infrequent, then all its descendents can be pruned.B

Uncertain graph G1

Uncertain graph G2

expected support

minsup(1-e) minsup

Output

Discard Arbitrary

Objective II Difficulty II: It is #P-hard to compute the expected support esup

(S) of a subgraph pattern S.

Objective II: Make the following judgments without computing esup(S) exactly. If esup(S) is surely not in the green region, then discard. If esup(S) is probable to be in the green region and surely not

in the red region, then output.

expected supportminsup(1-e) minsup

Output

Discard

Arbitrary

Method for Objective II Step 1: Approximate esup(S) by an interval [l, u] such that esup

(S)∈[l, u]. Step 2: Decide whether S can be output or not by testing the foll

owing conditions.

Output

Discard

Shrink

expected supportminsup(1-e) minsup 10

Approximating esup(S) by [l,u]

in contained is andby implicated is :

) implicates Pr()in occurs Pr(

)in occurs Pr(||

1)sup(

A subgraph pattern S occurs in an uncertain graph G if S is contained in at least one implicated graph of G.

Algorithm Approximate esup(S) by [l,u]

Step 1: For each uncertain graph Gi in D, approximate Pr(S occurs in Gi) by an interval [li, ui] of width at most

e*minsup.

Step 2:

Approximate Pr(S occurs in Gi) by [li, ui]B

0.60.7

uncertain graph Gi

pattern S

Step 1: Find all embeddings of S in Gi. 4 embeddings

Step 2: Assign boolean variables to the edges in the embeddings.Pr(x1) = 0.5, Pr(x2) = 0.6, Pr(x3) = 0.7, Pr(x4) = 0.8.

Step 3: Construct a conjunctive formula for each embedding.C1 = (x1 ^ x2), C2 = (x1 ^ x4), C3 = (x2 ^ x3), C4 = (x3 ^ x4).

Step 4: Construct a DNF formula.F = C1 V C2 V C3 V C4.

Step 5: Estimate Pr(F = TRUE) by p using Karp & Luby’s Markov-Chain Monte-Carlo method with absolute error e*minsup/2 and confidence d (d ∈[0,1]).

Step 6: [li, ui] = [p - e*minsup/2, p + e*minsup/2].

Outline

Background

Problem Definition

Algorithm

Conclusions

Experimental Results Data

The STRING Database (http://string-db.org)

Time Efficiency

Approximation Quality

Scalability

Conclusions

A new model of uncertain graph data has been proposed.

The frequent subgraph pattern mining problem on uncertain graph data has been formalized.

The computational complexity of the problem has been formally proved to be NP-hard.

An approximate mining algorithm has been proposed.

The proposed algorithm has high efficiency, high approximation quality, and high scalability.

Thank youThank you

Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo...

Documents

Cikm 2014 v2

Characteristics of Fiber Suspension Flow in a Turbulent … · Jianzhong Lin, Suhua Shen, Xiaoke Ku Department of Mechanics Hangzhou, Zhejiang CHINA Correspondence to: Jianzhong Lin

1 Sparse Equalizers Jianzhong Huang Feb. 24 th. 2009

Huangqi Jianzhong Tang for Treatment of Chronic Gastritis: A

PENGUMUMAN PRAKUALIFIKASI NO. PENGUMUMAN : SHUO …

Shuo Zhang Krisztian Balog University of Stavanger shuo

CIKM Cup 2016: Cross-Device Linking

Immunology Jianzhong Chen, PhD Institute of Immunology Zhejiang University

JOURNAL REVIEW - c clsec.cc.ac.cn/~wcmb/down/Jianzhong Wu.pdf · JOURNAL REVIEW Density Functional Theory for Chemical Engineering: From Capillarity to Soft Materials Jianzhong Wu

Ni shuo ba leçon 2

CIKM Tutorial 2008

Wang,Shuo MA 2014

Lo kuo shuo 2016portfolio 0617

CIKM 2011 Keynote

NI SHUO BA leçon 3

Der-Jiunn Deng 、 Chong-Shuo Fan 、 Chao-Yang Lin Speaker: Chong-Shuo Fan Date:2006/06/26

Ni shuo ba leçon 8

Wu,Shuo -Yan

LocWeb 2014 Workshop at CIKM

Shuo Yang Portfolio