© 2008 IBM Corporation
Mining Significant Graph Patterns by Leap Search
Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC)
2
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Graph Patterns
Interestingness measures / Objective functions
• Frequency: frequent graph pattern
• Discriminative: information gain, Fisher score
• Significance: G-test
• …
3
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Frequent Graph Pattern
4
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Optimal Graph Pattern (this work)
5
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Objective Functions
Challenge: Not Anti-Monotonic
X
6
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Challenge: Non Anti-Monotonic
Anti-Monotonic
Non Monotonic
Non-Monotonic: Enumerate all subgraphs then check their score?
Enumerate subgraphs : small-size to large-size
7
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Frequent Pattern Based Mining Framework
Exploratory task
Graph clustering
Graph classification
Graph index
(SIGMOD’04, ’05)(ISMB’05, ’07)
Graph Database Frequent Patterns Optimal Patterns
1. Bottleneck : millions, even billions of patterns
2. No guarantee of quality
8
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Direct Pattern Mining Framework
Exploratory task
Graph clustering
Graph classification
Graph index
Graph Database Optimal Patterns
Direct
How?
9
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Upper-Bound
10
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Upper-Bound: Anti-Monotonic (cont.)
Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting
We can recycle the existing graph mining algorithms to accommodate non-monotonic functions.
11
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Vertical Pruning
Larg
e <- s
mall
12
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Horizontal Pruning: Structural Proximity
13
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Structural Proximity: Another Perspective
# of frequent patterns >> # of possible frequency pairs
Many patterns share the same score
14
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Frequency Envelope
15
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Structural Leap Search
16
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Frequency Association
Significant patterns often fall into the high-quantile of frequency
Starting with the most frequent patterns
17
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Descending Leap Mine
1. Structural Leap Searchwith frequency threshold
3. Structural Leap Search
2. Support-Descending Mining
F(g*) converges
18
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Results: NCI Anti-Cancer Screen Datasets
Name # of Compounds Tumor Description
MCF-7 27,770 Breast
MOLT-4 39,765 Leukemia
NCI-H23 40,353 Non-Small Cell Lung
OVCAR-8 40,516 Ovarian
P388 41,472 Leukemia
PC-3 27,509 Prostate
SF-295 40,271 Central Nerve System
SN12C 40,004 Renal
SW-620 40,532 Colon
UACC257 39,988 Melanoma
YEAST 79,601 Yeast anti-cancer
Link: http://pubchem.ncbi.nlm.nih.gov
Chemical Compounds: anti-cancer or not
# of vertices: 10 ~ 200
19
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Efficiency
Vertical Pruning
Horizontal Pruning
20
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Effectiveness (runtime)
frequency descending
frequency descending+ leap mine
21
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Effectiveness (accuracy)
slightly different
22
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Graph Classification
Name OA Kernel LEAP OA Kernel (6x) LEAP (6x)
Average (AUC) 0.70 0.72 0.75 0.77
* OA Kernel: Optimal Assignment Kernel LEAP: LEAP search
(6x)
(6x)
23
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Scalability Means Something !
LEAP
OA
LEAP(6X)
OA(6X)
~20sec
~100sec
~200sec
~8000sec
Linear
Quadratic
24
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Direct Pattern Mining Framework
Exploratory task
Graph clustering
Graph classification
Graph index
Graph Database Optimal Graph Patterns
Direct
25
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Beyond Graph Patterns
Exploratory task
Clustering
Classification
Index
itemset/sequence/tree Database Optimal Patterns
Direct
1. Direct mining can be applied to itemsets, sequences, and trees
2. Existing algorithms can be recycled to mine patterns with sophisticated measures.
3. Pattern-based methods including indexing and classification are competitive.
26
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Thank you
Direct Mining of Discriminative and Essential Graphicaland Itemset Features via Model-based Search Tree
SIGKDD’08 @ Las Vegas
27
IBM T. J. Watson Research Center
Graph Pattern Mining | © 2008 IBM Corporation
Graph Classification: Kernel Approach
Kernel-based Graph Classification
Optimal Assignment Kernel (Fröhlich et al. ICML’05)