Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015

Subgraph Containment Search

Dayu Yuan

The Pennsylvania State University

1 © Dayu Yuan 04/19/23

Outline 1. Background & Related Work: Preliminary & Problem Definition Filter + Verification [Feature Based Index

Approach] 2. Lindex: A general index structure for sub

search 3.Direct feature mining for sub search

2 © Dayu Yuan 04/19/23

Problem Definition: In a graph database D = {g1,g2,...gn}, given a query

graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.

Solution: Brute force: For each query q, scan the dataset, find

D(q) Filter + Verification:

Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q)

Subgraph Search: Definition

Dq

C(q) D(q)

C(q) = D

3 © Dayu Yuan 04/19/23

Filter + Verification: Rule:

If a graph g contains the query q, then g has to contain all q’s subgraphs.

Inverted Index: <Key, Value> pair Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the

“key” subgraph)

Subgraph Search: Solutions

4 © Dayu Yuan 04/19/23

Response time: (1) filtering cost: D-> C(q)

Cost of the search for subgraph features contained in the query

Cost of loading the postings file, cost of joining the postings (2) verification cost: C(q) -> D(q)

subgraph isomorphism tests

NP-complete, dominate overall cost Related work: Reduce the verification cost by mining subgraph

features Disadvantages: (1) Different index structure designs for different

features (2) “batch mode” feature mining [talk latter]

Subgraph Search: Related Work

5 © Dayu Yuan 04/19/23

Outline 1. Background: 2. Lindex: A general index structure for

subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results

3.Direct feature mining for sub search

6 © Dayu Yuan 04/19/23

Lindex: A general index structureContributions: Orthogonal to related work (feature mining) General: Applicable to all subgraph/subtree

features. Compact, Effective and Efficient

Compact: less memory consumption. Effective: prune more false positive (with the same

features) Efficient: runs faster

7 © Dayu Yuan 04/19/23

Lindex: Compact Space Saving (Extension Labeling) Each Edge in a graph is represented as:

<ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)> the label of the graph sg2 is

< 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is

< 1,2,6,1,7 >

Then subgraph g2 can be stored

as just < 1, 3, 6, 2, 6 >< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

8 © Dayu Yuan 04/19/23

Lindex: Empirical Evaluation of Memory

Index\Feature

DFG ∆TCFG MimRTree+

∆DFT

Feature Count

7599/6238

9873/5712

50006172/3

87500/61

72Gindex 1359 1534 1348 1339FGindex 1826

SwiftIndex 860Lindex 677 841 772 676 671

Unit in KB

9 © Dayu Yuan 04/19/23

Definition (maxSub, minSuper).

Lindex: Effective in Filtering

maxSub(g,S)={gi ∈S|gi ⊂ g,¬∃x∈Ss.t.gi ⊂ x⊂ g}minSup(g,S) ={gi ∈S|g⊂ gi ,¬∃x∈Ss.t.g⊂ x⊂ gi}

(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q

10 © Dayu Yuan 04/19/23

Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on

which an algorithm should check for subgraph isomorphism is


C(q)=I i D( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)

C(q)=D(sg2 ) I D(sg4 )−D(sg5 )={a,b,c} I {a,b,d} −{b}=a

(3)

(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q

11 © Dayu Yuan 04/19/23

Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: such that sg can extend to

g, without being isomorphic to any other features Indirect Set:


Vd (sg)={g∈D(sg)}

Vi (sg)=D(sg)−Vd(sg)

Data Based Graphs

Index

Why “b” is in the direct value set of “sg1”, but “a” is not?

12 © Dayu Yuan 04/19/23

Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is


C(q)=I iVd( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)

Query “a” Graphs need to be verified

Traditional Model

Strategy(1)

Strategy(1 + 2)

{a,b,c} I {a,b,c}={a,b,c}

{a,b,c} I {a,b,c}−{c} ={a,b}

{a,c} I {a}−{c} ={a}

c b

Omit Prof

13 © Dayu Yuan 04/19/23

the label of the graph sg2 is < 1,2,6,1,7 >,< 1,3,6,2,6 >

the label of its chosen parent sg1 is < 1,2,6,1,7 >

Node1 of sg1 mapped to Node1 of sg2

Lindex: Efficient in Maxsub Feature Search

< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matcheswhile traversing a graph lattice, mappings constructed to check that a graph sg1 is contained in q can be extended to check whether a supergraph of sg1 in the lattice, sg2, is contained in q by incrementally expanding the mappings from sg1 to q.

14 © Dayu Yuan 04/19/23

Lindex: Efficient in Minsup Feature Search

< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice.

15 © Dayu Yuan 04/19/23

Outline 1. Background: 2. Lindex: A general index structure for

subsearch Compact (memory consumption) Effective (filtering power) Efficient (response time) Experiment Results

3.Direct feature mining for sub search

16 © Dayu Yuan 04/19/23

Lindex: Experiments

Exp on AIDS Dataset: 40,000 Graphs

17 © Dayu Yuan 04/19/23


Lindex: Experiments

18 © Dayu Yuan 04/19/23


Lindex: Experiments

19 © Dayu Yuan 04/19/23

Outline 1. Background: 2. Lindex: A general index structure for sub

search 3.Direct feature mining for sub search Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results

20 © Dayu Yuan 04/19/23

Feature Mining: A Brief History

1

2

3Graphs

Graph Classification

Graph Containment Search

…….

Graph Feature Mining

Applications

All Freq Subgraphs

Batch Mode

Direct Feature Mining

21 © Dayu Yuan 04/19/23

Feature Mining: Motivation All previous feature selection algorithms for

“subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum

support, etc) Our Contributions: First direct feature mining algorithm for the

subgraph search problem Effective in index updating Choose high quality features

22 © Dayu Yuan 04/19/23

Feature Mining: Problem Definition

Previous work: Given a graph database D, find a set of subgraph

(subtree) features, minimizing the response time over training query Q.

Our work: Given a graph database D, an already built index I with

feature set P0, search for a new feature p, such that the new feature set {P0 + p} minimizes the response time

Trsp (q)=Tfilter (q) +Tverf (q,C(q))

Tresp (q)≈Tverf (q,C(q)) ∝0 q∈P

|C(q) |=|I nXq[ i ]=1

D(pi ) | q∉P

⎧⎨⎪

⎩⎪

P =argmin|P|=N

|C(q,P) |q∈Q∑

gain(p,P0 )= |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑

p =argmaxp

gain(p,P0 )

23 © Dayu Yuan 04/19/23

Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0

(1) Remove Useless Features Find a feature p in P0

(2) Add New Features Find a new feature p

(3) Goes to (1)

C(q,P)=I n

Xq[ i ]=1D(pi ) =I pi ∈maxSub(q,P )

n D(pi )

p =argmaxp

( |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑ )

p =argminp∈P0

( |C(q,{P0 \ p} |q∈Q∑ − |C(q,P0 ) |

q∈Q∑ −)

Po =Po−p

Po =Po + p

24 © Dayu Yuan 04/19/23

Feature Mining: More on the Object Function (1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e.

Gindex, FGindex) depends on queries too. [Implicitly]

(2) Feature selected are “discriminative” Previous work:

the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’.

Our objective function: discriminative power is measure w.r.t P0

(3) Computation Issue:

25 © Dayu Yuan 04/19/23

Feature Mining: More on the Object Functiongain(p,P0 )= |C(q,P0 ) |

q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑

Q

{q ∈Q |q=p}{q ∈Q |p∈maxSub(q,{ p,P0})}

MinSup Queries(p, Q)

gain(p,P0 )= (|C(q,P0 ) |q∈minSup(q,Q)∑ −|C(q,{ p,P0} |) + I (p=q)(|C(q,P0 ) |

q∈Q∑

gain(p,P0 )= (|C(q,P0 ) |

q∈minSup(q,Q)∑ −C(q,P0 ) I D(p) |) + I (p=q)(|C(q,P0 ) |

q∈Q∑

Computing D(p) for each enumerated feature ‘p’ is expensive

26 © Dayu Yuan 04/19/23

Feature Mining: Challenges (1) Objective function is expensive to evaluate (2) Exponential search space for the new

index subgraph feature “p”. (3) Objective function is neither monotonic nor

anti-monotonic. [Apriori rule can not be used] (4) Traditional graph feature mining

algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”)

27 © Dayu Yuan 04/19/23

Feature Mining: Estimate The Objective Function The objective function of a new subgraph feature p,

has an easy to compute upper bound and lower bound:

Upp(p,P0 )=1|Q |

|C(q,P0 )−D(q) |+1|Q |

I (p=q) |C(q,P0 ) |q∈Q∑

q∈minSup(p,Q)∑

Low(p,P0 ) =1|Q |

|C(q,P0 )−1γ

D(maxSub(p)) |+1|Q |

I (p=q) |C(q,P0 ) |q∈Q∑

q∈minSup(p,Q)∑

Inexpensive to compute Two approaches to estimate

(1) Lazy calculation: don’t have to calculate gain(p, P0) when

Upp(p, P0) < gain(p*, P0)

Low(p, P0) > gain(p*, P0)

(2)

gain(p,P0 )=α ×Upp(p,P0 ) + (1−α)Upp(p,P0 )

Omit Prof

28 © Dayu Yuan 04/19/23

Feature Mining: Branch and Bound Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string,

the DFS tree is a pre-fix tree of the labels of graphs.

n1

n2

n3 n4

n5

n6

n7

Depth first search.

Visit: n1, n2, n3, n4 and find the current best pattern is n3.

Now visit n5, pre-observe that n5 and all its offspring have gain function less than n3.

Prune the branch and start to visit n7.

The objective function is neither monotonic or anti-monotonic

29 © Dayu Yuan 04/19/23

Feature Mining: Branch and Bound For each branch, e.g., branch starting from n5, find an

branch upper bound > gain value of all nodes on that branch.

Thm: For a feature p, an upper bound exists such that for all p’ that are

supergraph of p, gain(p’, P0) <= BUpp(p, P0)

BUpp(p)=1Q{ |C(q,P0 )−D(q) |+

q∈Q,q⊃p∑ maxp'⊃p |C(p') | I (q=p')

q∈Q∑ }

Although correct, the upper bound is not tight

Q

{q ∈Q |q=p}{q ∈Q |p∈maxSub(q,{ p,P0})}

MinSup Queries(p, Q)

Omit Prof

30 © Dayu Yuan 04/19/23

Feature Mining: Heuristic based search space partition Problem:

The search always starts from the same root and search according to the same order

Observation The new graph pattern p must be a super graph of some patterns in P0,

i.e., p ⊃ p2 in Figure 41) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important.

Spoin(r)= |C(q,P0 )−D(q) |q∈minSup(r,Q)∑ +maxp'⊃r |C(p') | I (q=p')

q∈minSup(r,Q)∑

31 © Dayu Yuan 04/19/23

Feature Mining: Heuristic based search space partition Procedure: (1)gain(p*)=0 (2)Sort all P0 according to sPoint(pi) function in decreasing order (3) Start Iterating For i=1to|P| do If branch upper bound of BUpp(ri) < gain(p∗) then break Else Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p∗) If gain(p*(r)) > gain(p∗), update p∗ = p∗r

Discussion: (1) Candidate features are enumerated as descendent of the

“root” (2) Candidate features are ‘frequent’ on D(r), not all D

Smaller minimum support (3) “root” are visited according to sPoint(r) score, quick to

find a close to optimal feature. (4) Top k feature selection

32 © Dayu Yuan 04/19/23

Outline 1. Background: 2. Lindex: A general index structure for sub

search 3.Direct feature mining for sub search Motivation Problem Definition & Objective Function Branch & Bound Partition of the search space Experiment Results

33 © Dayu Yuan 04/19/23

Feature Mining: Experiments

34 © Dayu Yuan 04/19/23

The same AIDS dataset D, Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02

[1175 new feature are added]

Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration

Achieving the same candidate set size decrease


35 © Dayu Yuan 04/19/23


36 © Dayu Yuan 04/19/23

2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on

DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration

Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable


DF VS. iterative methods

37 © Dayu Yuan 04/19/23


38 © Dayu Yuan 04/19/23


TCFG VS. iterative methods

MimR VS. iterative methods

39 © Dayu Yuan 04/19/23

Iterative until the gain is stable

Conclusion

04/19/23© Dayu Yuan40

1. Lindex: index structure general enough to support any features Compact Effective Efficient

2. Direct feature mining Third generation algorithm (no frequent feature

enumeration bottleneck) Effective in updating the index to accommodate

changes Runs much faster than building the index from scratch Feature selected can filter more false positives than

features selected from scratch.

41

Thanks

Questions?

© Dayu Yuan 04/19/23

Documents

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015