Upload
albert-daniels
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Fast and practical indexing and querying of very large graphs
Silke Triβl, Ulf Leser
Humboldt-Universitat zu Berlin
Presenter: Liwen Sun (Stephen)
SIGMOD’07
Introduction Let G=(V,E) be a graph and let v, w be two nodes of G. w is
reachable from v, iff there exists a path from v to w. Given two nodes v and w, the function reach (v, w) returns true
if w is reachable from v, and false otherwise.
The objective of this paper: Efficiently answering reach(v,w) query on very large graphs.
Contents
Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary
Existing methods
Given a graph with n nodes and m edges, Recursive, i.e., depth first traversal
No additional index Query time: O(m+n)
Transitive closure (TC), Agrawal et al.[VLDB’87], H, Lu [VLDB’87] Pre-compute the set of node pair(v,w) for which w is
reachable from v. Index size: O(n2), Index construction time O(n3) Query time: O(1) Index is infeasible for very large graphs, e.g. 1M nodes
Existing methods
Pre- and postorder numbering scheme Dietz et al. [STOC’87], only suitable for tree Assign each node v a preorder value and a postorder
value, i.e., [vpre, vpost] Values are set according to the order of DFS traversal
vpre is the timestamp when v is first visited vpost is the time stamp after all of v’s successors have been
visited. E.g., DFS traversal order is r,A,B,E,F,C,D…
rpre = 0, Apre=1, Bpre=2, Epre=3, Epost=4. Fpre=5, Fpost=6, Bpost=7,… …
Existing methods Pre- and postorder numbering scheme (cont’d)
reach(v,w) <==> vpre < wpre < vpost
A reachability query becomes a simple range query However, when considering general graphs, it is much more
complicated. One node may have multiple incoming edges
node pre post
r 0 17
A 1 16
B 2 7
E 3 4
F 5 6
C 8 9
D 10 15
G 11 12
H 13 14
E.g.,
Preorder value of G is 11,
11 belongs to [1,16] of A reach(A,G) = true
11 not belong to [2,7] of B reach (B,G) = false
Existing methods
Recent approaches based on pre- and postorder numbering scheme, Label+SSPI, Chen et al. [VLDB’05] 2-Hop-Cover, He et al. [CIKM’05] TLC matrix, Wang et al. [ICDE06] GRIPP: today’s talk
GRaph Indexing based on Pre- and Postorder numbering
Contents
Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary
Index Creation
Assumptions The graph G has exactly one root, i.e., the node
without incoming edges. Only for the sake of discussion, and is not the
limitation of the algorithm. Each node’s children have an arbitrary, yet fixed
order, e.g., ID of the nodes. The children are always visited
according to their order.
Index Creation
One node may have multiple incoming edges, thus can be visited many times. E.g., A and B are both have two incoming edges.
One [vpre, vpost] pair for each node is not enough. At time 2, B is visited for the first time, after A. At time 12, B is visited for the second time, after G.
Index Creation
The index is created by DFS traversal on G and assigning each node pre and post-order value.
when visiting a previously visited node, e.g.,B No children of B (E.g., E, F) will be explored again B has already been assigned a preorder value.
Make a new copy of B in the index table to record the new pre- and postorder value.
Index Creation
Let IND(G) be the index table for graph G. A tuple in IND(G) is an instance of node v in G
A tree instance is the first instance created for v. Other instances are non-tree instances.
tree instance of B
non-tree instance of B
Index Creation
Summary of GRIPP Index GRIPP uses a separate index table, IND(G), which
sufficiently represents the graph structure of G. IND(G) is viewed as a relational table, and relevant
attributes (e.g., preorder value) are indexed by B-tree Constuction Time: O(m+n)
By DFS traversal of G Index Size: O(m+n)
O(n) tree instances O(m-n) non-tree instances
Contents
Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary
Querying GRIPP
Order Tree IND(G) resembles a rooted tree, which we call the order
tree, O(G).
The non-tree instances in IND(G) must be leaf nodes in O(G).
Querying GRIPP
RIS(v): Reachable instance set of v
To evaluate reach(D,G) Retrieve RIS(D), i.e., a range
query over IND(G), and we find G.
Only 1 operation. To evaluate reach(D,C).
Retrieve RIS(D), and find no instance of C.
Since RIS(D) contains non-tree instances of A and B, which may have successors in G.
We need to recursively examine RIS(A) or RIS(B).
Querying GRIPP
reach(v, w) Examine if w has an instance in
RIS(v). If yes, return true. If not, recursively examine the RIS
of each hop node in RIS(v). The algorithm stops when w’s
instance is found, or no unvisited hop node can be invoked.
Worst case: O(m-n) recursive calls
Hop Node Given two nodes h and v, if RIS(v) contains an non-tree
instance of h, then h is a hop node for v. E.g., A and B are hop nodes of D.
Contents
Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary
Pruning Strategies
Skip Strategy Keep a list U of nodes whose
RIS have been retrieved before.
To avoid redundant checking 1) If U={A}, i.e., we’ve
checked RIS(A), then we don’t need to check RIS(B) at all.
Pruning Strategies
Skip Strategy 2) When examining RIS(A), if
U={B}, then we can skip the range of RIS(B), i.e., [2,7]. Reduce the range of RIS(A) from
[1,20] to ([1,2] U [7,20]). 3) However, between RIS(B)
and RIS(D), no pruning is possible.
Pruning Strategies
Stop Strategy A node s is called a stop node iff all non-tree
instances in RIS(s) also have their corresponding tree instances in RIS(s). After examine RIS(s), we don’t need to recursively
searching hop nodes in RIS(s). A is stop node, no need to
search RIS(A) and RIS(B) again. D is NOT a stop node.
Contents
Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary
Impact of Traversal Order
During the index construction, the traversal order doesn’t affect the index size, however, it has a strong impact on the performance of reachability queries.
The goal Larger RIS, and thus less recursive calls. RIS is a range, and can be accessed in sub-linear
time using B-tree.
Impact of Traversal Order
Strongly connected component(SCC)
The reachability information of nodes in the same SCC are identical.
E.g., if reach(e,g) = true, then reach(b, i) = true, reach(e,k) = true, etc.
Impact of Traversal Order Node traversal order within a SCC
Consider a strongly connected component C1. During index creation, let c be the first node traversed in
C1. Since every other node in C1 are not visited before and
reachable from c, RIS(c) contains the tree instance of every other node in C1.
For the nodes of v, w in C1. reach(c, w), only 1 recursive call: RIS(c). reach(v, w), when v≠c, if RIS(v) has non-tree instances of c,
then only two recursive calls: RIS(v), then RIS(c). Ideal: For every other node w in C1, RIS(w) contains non-
tree instances of c.
Impact of Traversal Order
Giant Strongly Connected Component Erdős had proved that directed random graphs
with many edges contain one giant SCC C’. Other SCC’s tend to be small. The size of C’ depends on the graph density. This also somewhat reflects real-world graphs.
Better to traverse C’ before other SCC’s. Larger RIS
Impact of Traversal Order
Optimal GRIPP structure
Graph G GRIPP for G
reach (an, s1)
Only three recursive calls:
Find the non-tree instance of h in RIS(an)
Find the non-tree instance of c in RIS(h)
Find s1 in RIS(c)
h
h
Impact of Traversal Order
Heuristic Strategies Make a new virtual root r, then add an edge from r to the
node with highest degree in G, say s. s tends to have more incoming edges and reach more nodes s is highly likely to be within in Giant SCC s is a good choice to be first visited in SCC, i.e.,node c in
C1. This also tackles the case when G has no root or more root.
Then, traverse the children in the descending order of their degrees.
Contents
Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary
Experimental Result
Experimental Result
Experimental Result
Contents
Existing Methods GRIPP Index Creation Query GRIPP Pruning Strategies Impact of Traversal Order Experimental Result Summary
Summary Practicality
Throughout the paper, no theorem or formula All algorithms are based on SQL and implemented on
commercial RDBMS Separate and uniform index table
Most heuristic efforts are made towards transitive closure Stop node s
Recursive stops whenever we encounter s. First visited node c in Giant SCC
RIS(c) is a very large set Other nodes in SCC, even outside SCC, rely on the reach
power of c.
Thank you!