Upload
chester-obrien
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Subgraph Containment Search
Dayu Yuan
The Pennsylvania State University
1 © Dayu Yuan 04/19/23
Outline 1. Background & Related Work: Preliminary & Problem Definition Filter + Verification [Feature Based Index
Approach] 2. Lindex: A general index structure for sub
search 3.Direct feature mining for sub search
2 © Dayu Yuan 04/19/23
Problem Definition: In a graph database D = {g1,g2,...gn}, given a query
graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.
Solution: Brute force: For each query q, scan the dataset, find
D(q) Filter + Verification:
Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q)
Subgraph Search: Definition
Dq
C(q) D(q)
C(q) = D
3 © Dayu Yuan 04/19/23
Filter + Verification: Rule:
If a graph g contains the query q, then g has to contain all q’s subgraphs.
Inverted Index: <Key, Value> pair Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the
“key” subgraph)
Subgraph Search: Solutions
4 © Dayu Yuan 04/19/23
Response time: (1) filtering cost: D-> C(q)
Cost of the search for subgraph features contained in the query
Cost of loading the postings file, cost of joining the postings (2) verification cost: C(q) -> D(q)
subgraph isomorphism tests
NP-complete, dominate overall cost Related work: Reduce the verification cost by mining subgraph
features Disadvantages: (1) Different index structure designs for different
features (2) “batch mode” feature mining [talk latter]
Subgraph Search: Related Work
5 © Dayu Yuan 04/19/23
Outline 1. Background: 2. Lindex: A general index structure for
subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results
3.Direct feature mining for sub search
6 © Dayu Yuan 04/19/23
Lindex: A general index structureContributions: Orthogonal to related work (feature mining) General: Applicable to all subgraph/subtree
features. Compact, Effective and Efficient
Compact: less memory consumption. Effective: prune more false positive (with the same
features) Efficient: runs faster
7 © Dayu Yuan 04/19/23
Lindex: Compact Space Saving (Extension Labeling) Each Edge in a graph is represented as:
<ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)> the label of the graph sg2 is
< 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is
< 1,2,6,1,7 >
Then subgraph g2 can be stored
as just < 1, 3, 6, 2, 6 >< 1, 3, 6, 2, 6 >
< 1,2,6,1,7 >
8 © Dayu Yuan 04/19/23
Lindex: Empirical Evaluation of Memory
Index\Feature
DFG ∆TCFG MimRTree+
∆DFT
Feature Count
7599/6238
9873/5712
50006172/3
87500/61
72Gindex 1359 1534 1348 1339FGindex 1826
SwiftIndex 860Lindex 677 841 772 676 671
Unit in KB
9 © Dayu Yuan 04/19/23
Definition (maxSub, minSuper).
Lindex: Effective in Filtering
maxSub(g,S)={gi ∈S|gi ⊂ g,¬∃x∈Ss.t.gi ⊂ x⊂ g}minSup(g,S) ={gi ∈S|g⊂ gi ,¬∃x∈Ss.t.g⊂ x⊂ gi}
(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q
10 © Dayu Yuan 04/19/23
Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on
which an algorithm should check for subgraph isomorphism is
Lindex: Effective in Filtering
C(q)=I i D( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)
C(q)=D(sg2 ) I D(sg4 )−D(sg5 )={a,b,c} I {a,b,d} −{b}=a
(3)
(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q
11 © Dayu Yuan 04/19/23
Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: such that sg can extend to
g, without being isomorphic to any other features Indirect Set:
Lindex: Effective in Filtering
Vd (sg)={g∈D(sg)}
Vi (sg)=D(sg)−Vd(sg)
Data Based Graphs
Index
Why “b” is in the direct value set of “sg1”, but “a” is not?
12 © Dayu Yuan 04/19/23
Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is
Lindex: Effective in Filtering
C(q)=I iVd( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)
Query “a” Graphs need to be verified
Traditional Model
Strategy(1)
Strategy(1 + 2)
{a,b,c} I {a,b,c}={a,b,c}
{a,b,c} I {a,b,c}−{c} ={a,b}
{a,c} I {a}−{c} ={a}
c b
Omit Prof
13 © Dayu Yuan 04/19/23
the label of the graph sg2 is < 1,2,6,1,7 >,< 1,3,6,2,6 >
the label of its chosen parent sg1 is < 1,2,6,1,7 >
Node1 of sg1 mapped to Node1 of sg2
Lindex: Efficient in Maxsub Feature Search
< 1, 3, 6, 2, 6 >
< 1,2,6,1,7 >
instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matcheswhile traversing a graph lattice, mappings constructed to check that a graph sg1 is contained in q can be extended to check whether a supergraph of sg1 in the lattice, sg2, is contained in q by incrementally expanding the mappings from sg1 to q.
14 © Dayu Yuan 04/19/23
Lindex: Efficient in Minsup Feature Search
< 1, 3, 6, 2, 6 >
< 1,2,6,1,7 >
The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice.
15 © Dayu Yuan 04/19/23
Outline 1. Background: 2. Lindex: A general index structure for
subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results
3.Direct feature mining for sub search
16 © Dayu Yuan 04/19/23
Lindex: Experiments
Exp on AIDS Dataset: 40,000 Graphs
17 © Dayu Yuan 04/19/23
Exp on AIDS Dataset: 40,000 Graphs
Lindex: Experiments
18 © Dayu Yuan 04/19/23
Exp on AIDS Dataset: 40,000 Graphs
Lindex: Experiments
19 © Dayu Yuan 04/19/23
Outline 1. Background: 2. Lindex: A general index structure for sub
search 3.Direct feature mining for sub search Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results
20 © Dayu Yuan 04/19/23
Feature Mining: A Brief History
1
2
3Graphs
Graph Classification
Graph Containment Search
…….
Graph Feature Mining
Applications
All Freq Subgraphs
Batch Mode
Direct Feature Mining
21 © Dayu Yuan 04/19/23
Feature Mining: Motivation All previous feature selection algorithms for
“subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum
support, etc) Our Contributions: First direct feature mining algorithm for the
subgraph search problem Effective in index updating Choose high quality features
22 © Dayu Yuan 04/19/23
Feature Mining: Problem Definition
Previous work: Given a graph database D, find a set of subgraph
(subtree) features, minimizing the response time over training query Q.
Our work: Given a graph database D, an already built index I with
feature set P0, search for a new feature p, such that the new feature set {P0 + p} minimizes the response time
Trsp (q)=Tfilter (q) +Tverf (q,C(q))
Tresp (q)≈Tverf (q,C(q)) ∝0 q∈P
|C(q) |=|I nXq[ i ]=1
D(pi ) | q∉P
⎧⎨⎪
⎩⎪
P =argmin|P|=N
|C(q,P) |q∈Q∑
gain(p,P0 )= |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |
q∈Q∑
p =argmaxp
gain(p,P0 )
23 © Dayu Yuan 04/19/23
Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0
(1) Remove Useless Features Find a feature p in P0
(2) Add New Features Find a new feature p
(3) Goes to (1)
C(q,P)=I n
Xq[ i ]=1D(pi ) =I pi ∈maxSub(q,P )
n D(pi )
p =argmaxp
( |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |
q∈Q∑ )
p =argminp∈P0
( |C(q,{P0 \ p} |q∈Q∑ − |C(q,P0 ) |
q∈Q∑ −)
Po =Po−p
Po =Po + p
24 © Dayu Yuan 04/19/23
Feature Mining: More on the Object Function (1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e.
Gindex, FGindex) depends on queries too. [Implicitly]
(2) Feature selected are “discriminative” Previous work:
the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’.
Our objective function: discriminative power is measure w.r.t P0
(3) Computation Issue:
25 © Dayu Yuan 04/19/23
Feature Mining: More on the Object Functiongain(p,P0 )= |C(q,P0 ) |
q∈Q∑ − |C(q,{ p,P0} |
q∈Q∑
Q
{q ∈Q |q=p}{q ∈Q |p∈maxSub(q,{ p,P0})}
MinSup Queries(p, Q)
gain(p,P0 )= (|C(q,P0 ) |q∈minSup(q,Q)∑ −|C(q,{ p,P0} |) + I (p=q)(|C(q,P0 ) |
q∈Q∑
gain(p,P0 )= (|C(q,P0 ) |
q∈minSup(q,Q)∑ −C(q,P0 ) I D(p) |) + I (p=q)(|C(q,P0 ) |
q∈Q∑
Computing D(p) for each enumerated feature ‘p’ is expensive
26 © Dayu Yuan 04/19/23
Feature Mining: Challenges (1) Objective function is expensive to evaluate (2) Exponential search space for the new
index subgraph feature “p”. (3) Objective function is neither monotonic nor
anti-monotonic. [Apriori rule can not be used] (4) Traditional graph feature mining
algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”)
27 © Dayu Yuan 04/19/23
Feature Mining: Estimate The Objective Function The objective function of a new subgraph feature p,
has an easy to compute upper bound and lower bound:
Upp(p,P0 )=1|Q |
|C(q,P0 )−D(q) |+1|Q |
I (p=q) |C(q,P0 ) |q∈Q∑
q∈minSup(p,Q)∑
Low(p,P0 ) =1|Q |
|C(q,P0 )−1γ
D(maxSub(p)) |+1|Q |
I (p=q) |C(q,P0 ) |q∈Q∑
q∈minSup(p,Q)∑
Inexpensive to compute Two approaches to estimate
(1) Lazy calculation: don’t have to calculate gain(p, P0) when
Upp(p, P0) < gain(p*, P0)
Low(p, P0) > gain(p*, P0)
(2)
gain(p,P0 )=α ×Upp(p,P0 ) + (1−α)Upp(p,P0 )
Omit Prof
28 © Dayu Yuan 04/19/23
Feature Mining: Branch and Bound Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string,
the DFS tree is a pre-fix tree of the labels of graphs.
n1
n2
n3 n4
n5
n6
n7
Depth first search.
Visit: n1, n2, n3, n4 and find the current best pattern is n3.
Now visit n5, pre-observe that n5 and all its offspring have gain function less than n3.
Prune the branch and start to visit n7.
The objective function is neither monotonic or anti-monotonic
29 © Dayu Yuan 04/19/23
Feature Mining: Branch and Bound For each branch, e.g., branch starting from n5, find an
branch upper bound > gain value of all nodes on that branch.
Thm: For a feature p, an upper bound exists such that for all p’ that are
supergraph of p, gain(p’, P0) <= BUpp(p, P0)
BUpp(p)=1Q{ |C(q,P0 )−D(q) |+
q∈Q,q⊃p∑ maxp'⊃p |C(p') | I (q=p')
q∈Q∑ }
Although correct, the upper bound is not tight
Q
{q ∈Q |q=p}{q ∈Q |p∈maxSub(q,{ p,P0})}
MinSup Queries(p, Q)
Omit Prof
30 © Dayu Yuan 04/19/23
Feature Mining: Heuristic based search space partition Problem:
The search always starts from the same root and search according to the same order
Observation The new graph pattern p must be a super graph of some patterns in P0,
i.e., p ⊃ p2 in Figure 41) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important.
Spoin(r)= |C(q,P0 )−D(q) |q∈minSup(r,Q)∑ +maxp'⊃r |C(p') | I (q=p')
q∈minSup(r,Q)∑
31 © Dayu Yuan 04/19/23
Feature Mining: Heuristic based search space partition Procedure: (1)gain(p*)=0 (2)Sort all P0 according to sPoint(pi) function in decreasing order (3) Start Iterating For i=1to|P| do If branch upper bound of BUpp(ri) < gain(p∗) then break Else Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p∗) If gain(p*(r)) > gain(p∗), update p∗ = p∗r
Discussion: (1) Candidate features are enumerated as descendent of the
“root” (2) Candidate features are ‘frequent’ on D(r), not all D
Smaller minimum support (3) “root” are visited according to sPoint(r) score, quick to
find a close to optimal feature. (4) Top k feature selection
32 © Dayu Yuan 04/19/23
Outline 1. Background: 2. Lindex: A general index structure for sub
search 3.Direct feature mining for sub search Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results
33 © Dayu Yuan 04/19/23
Feature Mining: Experiments
34 © Dayu Yuan 04/19/23
The same AIDS dataset D, Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02
[1175 new feature are added]
Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration
Achieving the same candidate set size decrease
Feature Mining: Experiments
35 © Dayu Yuan 04/19/23
Feature Mining: Experiments
36 © Dayu Yuan 04/19/23
2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on
DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration
Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable
Feature Mining: Experiments
DF VS. iterative methods
37 © Dayu Yuan 04/19/23
Feature Mining: Experiments
38 © Dayu Yuan 04/19/23
Feature Mining: Experiments
TCFG VS. iterative methods
MimR VS. iterative methods
39 © Dayu Yuan 04/19/23
Iterative until the gain is stable
Conclusion
04/19/23© Dayu Yuan40
1. Lindex: index structure general enough to support any features Compact Effective Efficient
2. Direct feature mining Third generation algorithm (no frequent feature
enumeration bottleneck) Effective in updating the index to accommodate
changes Runs much faster than building the index from scratch Feature selected can filter more false positives than
features selected from scratch.
41
Thanks
Questions?
© Dayu Yuan 04/19/23