Fast and practical indexing and querying of very large graphs Silke Triβl, Ulf Leser Humboldt-Universitat zu Berlin Presenter: Liwen Sun (Stephen) SIGMOD’07

Fast and practical indexing and querying of very large graphs

Silke Triβl, Ulf Leser

Humboldt-Universitat zu Berlin

Presenter: Liwen Sun (Stephen)

SIGMOD’07

Introduction Let G=(V,E) be a graph and let v, w be two nodes of G. w is

reachable from v, iff there exists a path from v to w. Given two nodes v and w, the function reach (v, w) returns true

if w is reachable from v, and false otherwise.

The objective of this paper: Efficiently answering reach(v,w) query on very large graphs.

Contents

Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary

Existing methods

Given a graph with n nodes and m edges, Recursive, i.e., depth first traversal

No additional index Query time: O(m+n)

Transitive closure (TC), Agrawal et al.[VLDB’87], H, Lu [VLDB’87] Pre-compute the set of node pair(v,w) for which w is

reachable from v. Index size: O(n2), Index construction time O(n3) Query time: O(1) Index is infeasible for very large graphs, e.g. 1M nodes

Existing methods

Pre- and postorder numbering scheme Dietz et al. [STOC’87], only suitable for tree Assign each node v a preorder value and a postorder

value, i.e., [vpre, vpost] Values are set according to the order of DFS traversal

vpre is the timestamp when v is first visited vpost is the time stamp after all of v’s successors have been

visited. E.g., DFS traversal order is r,A,B,E,F,C,D…

rpre = 0, Apre=1, Bpre=2, Epre=3, Epost=4. Fpre=5, Fpost=6, Bpost=7,… …

Existing methods Pre- and postorder numbering scheme (cont’d)

reach(v,w) <==> vpre < wpre < vpost

A reachability query becomes a simple range query However, when considering general graphs, it is much more

complicated. One node may have multiple incoming edges

node pre post

r 0 17

A 1 16

B 2 7

E 3 4

F 5 6

C 8 9

D 10 15

G 11 12

H 13 14

E.g.,

Preorder value of G is 11,

11 belongs to [1,16] of A reach(A,G) = true

11 not belong to [2,7] of B reach (B,G) = false

Existing methods

Recent approaches based on pre- and postorder numbering scheme, Label+SSPI, Chen et al. [VLDB’05] 2-Hop-Cover, He et al. [CIKM’05] TLC matrix, Wang et al. [ICDE06] GRIPP: today’s talk

GRaph Indexing based on Pre- and Postorder numbering

Contents


Index Creation

Assumptions The graph G has exactly one root, i.e., the node

without incoming edges. Only for the sake of discussion, and is not the

limitation of the algorithm. Each node’s children have an arbitrary, yet fixed

order, e.g., ID of the nodes. The children are always visited

according to their order.

Index Creation

One node may have multiple incoming edges, thus can be visited many times. E.g., A and B are both have two incoming edges.

One [vpre, vpost] pair for each node is not enough. At time 2, B is visited for the first time, after A. At time 12, B is visited for the second time, after G.

Index Creation

The index is created by DFS traversal on G and assigning each node pre and post-order value.

when visiting a previously visited node, e.g.,B No children of B (E.g., E, F) will be explored again B has already been assigned a preorder value.

Make a new copy of B in the index table to record the new pre- and postorder value.

Index Creation

Let IND(G) be the index table for graph G. A tuple in IND(G) is an instance of node v in G

A tree instance is the first instance created for v. Other instances are non-tree instances.

tree instance of B

non-tree instance of B

Index Creation

Summary of GRIPP Index GRIPP uses a separate index table, IND(G), which

sufficiently represents the graph structure of G. IND(G) is viewed as a relational table, and relevant

attributes (e.g., preorder value) are indexed by B-tree Constuction Time: O(m+n)

By DFS traversal of G Index Size: O(m+n)

O(n) tree instances O(m-n) non-tree instances

Contents


Querying GRIPP

Order Tree IND(G) resembles a rooted tree, which we call the order

tree, O(G).

The non-tree instances in IND(G) must be leaf nodes in O(G).

Querying GRIPP

RIS(v): Reachable instance set of v

To evaluate reach(D,G) Retrieve RIS(D), i.e., a range

query over IND(G), and we find G.

Only 1 operation. To evaluate reach(D,C).

Retrieve RIS(D), and find no instance of C.

Since RIS(D) contains non-tree instances of A and B, which may have successors in G.

We need to recursively examine RIS(A) or RIS(B).

Querying GRIPP

reach(v, w) Examine if w has an instance in

RIS(v). If yes, return true. If not, recursively examine the RIS

of each hop node in RIS(v). The algorithm stops when w’s

instance is found, or no unvisited hop node can be invoked.

Worst case: O(m-n) recursive calls

Hop Node Given two nodes h and v, if RIS(v) contains an non-tree

instance of h, then h is a hop node for v. E.g., A and B are hop nodes of D.

Contents


Pruning Strategies

Skip Strategy Keep a list U of nodes whose

RIS have been retrieved before.

To avoid redundant checking 1) If U={A}, i.e., we’ve

checked RIS(A), then we don’t need to check RIS(B) at all.

Pruning Strategies

Skip Strategy 2) When examining RIS(A), if

U={B}, then we can skip the range of RIS(B), i.e., [2,7]. Reduce the range of RIS(A) from

[1,20] to ([1,2] U [7,20]). 3) However, between RIS(B)

and RIS(D), no pruning is possible.

Pruning Strategies

Stop Strategy A node s is called a stop node iff all non-tree

instances in RIS(s) also have their corresponding tree instances in RIS(s). After examine RIS(s), we don’t need to recursively

searching hop nodes in RIS(s). A is stop node, no need to

search RIS(A) and RIS(B) again. D is NOT a stop node.

Contents


Impact of Traversal Order

During the index construction, the traversal order doesn’t affect the index size, however, it has a strong impact on the performance of reachability queries.

The goal Larger RIS, and thus less recursive calls. RIS is a range, and can be accessed in sub-linear

time using B-tree.


Strongly connected component(SCC)

The reachability information of nodes in the same SCC are identical.

E.g., if reach(e,g) = true, then reach(b, i) = true, reach(e,k) = true, etc.

Impact of Traversal Order Node traversal order within a SCC

Consider a strongly connected component C1. During index creation, let c be the first node traversed in

C1. Since every other node in C1 are not visited before and

reachable from c, RIS(c) contains the tree instance of every other node in C1.

For the nodes of v, w in C1. reach(c, w), only 1 recursive call: RIS(c). reach(v, w), when v≠c, if RIS(v) has non-tree instances of c,

then only two recursive calls: RIS(v), then RIS(c). Ideal: For every other node w in C1, RIS(w) contains non-

tree instances of c.


Giant Strongly Connected Component Erdős had proved that directed random graphs

with many edges contain one giant SCC C’. Other SCC’s tend to be small. The size of C’ depends on the graph density. This also somewhat reflects real-world graphs.

Better to traverse C’ before other SCC’s. Larger RIS


Optimal GRIPP structure

Graph G GRIPP for G

reach (an, s1)

Only three recursive calls:

Find the non-tree instance of h in RIS(an)

Find the non-tree instance of c in RIS(h)

Find s1 in RIS(c)

h

h


Heuristic Strategies Make a new virtual root r, then add an edge from r to the

node with highest degree in G, say s. s tends to have more incoming edges and reach more nodes s is highly likely to be within in Giant SCC s is a good choice to be first visited in SCC, i.e.,node c in

C1. This also tackles the case when G has no root or more root.

Then, traverse the children in the descending order of their degrees.

Contents


Experimental Result

Experimental Result

Experimental Result

Contents


Summary Practicality

Throughout the paper, no theorem or formula All algorithms are based on SQL and implemented on

commercial RDBMS Separate and uniform index table

Most heuristic efforts are made towards transitive closure Stop node s

Recursive stops whenever we encounter s. First visited node c in Giant SCC

RIS(c) is a very large set Other nodes in SCC, even outside SCC, rely on the reach

power of c.

Thank you!

Documents

Fast and practical indexing and querying of very large graphs Silke Triβl, Ulf Leser Humboldt-Universitat zu Berlin Presenter: Liwen Sun (Stephen) SIGMOD’07