Upload
allen-jacobs
View
235
Download
0
Tags:
Embed Size (px)
Citation preview
1
QSX: Querying Social Graphs
Querying Big Graphs
Parallel scalability
Making big graphs small
– Bounded evaluability
– Query-preserving graph compression
2
The impact of the sheer volume of big data
Using SSD of 6G/s, a linear scan of a data set DD would take
1.9 days when DD is of 1PB (1015B)
5.28 years when DD is of 1EB (1018B)
Is it feasible to query real-life big graphs?
A departure from classical computational complexity theory
Traditional computational complexity theory of almost 50 years:
• The good: polynomial time computable (PTIME)
• The bad: NP-hard (intractable)
• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
Parallel query answering
We can do better provided more resources
10,000 processors
How to cope with the sheer volume of big graphs? 3
Using 10000 SSD of 6G/s, a linear scan of DD might take: 1.9 days/10000 = 16 seconds when DD is of 1PB (1015B)5.28 years/10000 = 4.63 days when DD is of 1EB (1018B)
Only ideally, why?
DB
M
DB
M
DB
M
interconnection network
P P P
Do parallel algorithms always work?
If not, is it still feasible to query big graphs?
Parallel scalability
44
5
Parallel scalability
A distributed algorithm is useful if it is parallel scalable
Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G
Complexityt(|G|, |Q|): the time taken by a sequential algorithm with a single
processorT(|G|, |Q|, n): the time taken by a parallel algorithm with n
processorsParallel scalable: if
T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|)k)
including the cost of data shipment, k is a constant
When G is big, we can still query G by adding more processors if we can afford
them
partition
6
Degree of parallelism -- speedup
Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task
Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system
resources
Speed: throughputresponse time
Linear speedup
Question: can we do better than linear speedup?
7
Better than linear speedup?
NO, even hard to achieve linear speedup/scaleup!
Startup costs: initializing each process
Interference: competing for shared resources (network, disk, memory or
even locks)
Skew: it is difficult to divide a task into exactly equal-sized parts; the
response time is determined by the largest part
Data shipment cost: in a shared-nothing architecture
Linear speedup is the best we can hope for -- optimal!
A closer look: Ullman’s algorithm for subgraph isomorphism: the adjacency matrix for the entire G.What if we break G into n fragments and leverage
the data locality of subgraph isomorphism?
Give 4 reasons
Think of blocking in MapReduce
Worst-case: exponential in |G| and |Q| vs exponential in |G|/n and |Q|! Contradiction?
No: the worst-case complexity of a particular algorithm vs the time really needed by a sequential algorithm
8
linear scalability
Querying big data by adding more processors
An algorithm T for answering a class Q of queries Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G
Algorithm T is linearly scalable in computation if its parallel complexity is a function of |Q| and |G|/n,
and in data shipment if the total amount of data shipped is a function of |
Q| and n
The more processors, the less response time
Independent of the size |G| of big G
Is it always possible?
9
Graph pattern matching via graph simulation
Input: a graph pattern graph Q and a graph G
Output: Q(G) is a binary relation S on the nodes of Q and G
O((| V | + | VQ |) (| E | + | EQ| )) time
• each node u in Q is mapped to a node v in G, such that (u, v) S∈
• for each (u,v) S, ∈ each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ ) S∈
9
Parallel scalable?
10
Impossibility
Nontrivial to develop parallel scalable algorithms
There exists NO algorithm for distributed graph simulation that is
parallel scalable in either computation, or data shipment
Why?
Pattern: 2 nodesGraph: 2n nodes, distributed to
n processors
Possibility: when G is a tree, parallel scalable in both response time
and data shipmentWhat can we do if parallel scalability is
beyond reach?
Making big graphs small
1111
12
The cost of query answering
Input: A query Q and a graph G Question: The answer Q(G) to Q in G
Reduce the cost of computing Q(G) by making G small!
too costly when G is big
The cost of computing Q(G): a function f(|G|, |Q|)
Find a lower function for f? Develop faster algorithm
Reduce the size of |Q|?
Q( )GGQ( ) GQGQ
Reduce the size of G
What should we do?
12
13
Making big graphs small
Input: A class Q of queries
Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that
Q(G) = Q(GQ)?
How to make G small?
Particularly useful for
A single dataset G, e.g., the social graph of Facebook
Minimum GQ – the necessary amount of data for answering Q
Q( )GGQ( ) GQGQ
Much smaller than G
The essence of parallel query answering
Given a big graph G, and n processors S1, …, SnG is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si
Dividing a big G into small fragments Gi of manageable size
Each processor Si processes its local fragment Gi in parallel
Parallel query answeringInput: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G
Q( )
GGQ( )
G1G1Q( )GnGnQ( )G2G2
…What can we do if parallel scalability is beyond reach for our queries?
|G|/n, much smaller
14
15
How to make big graphs small
Input: A class Q of queries
Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that
Q(G) = Q(GQ)?
Effective methods for making big graphs small
A number of methods
We have seen one of the methods: parallel query answering
Other methods – in the next two lectures
Q( )GG
Q( ) GQGQ
Much smaller than G
Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation …
15
Making big graphs small
16
17
What do we need
Input: A class Q of queries
Question: Can we effectively find, given queries Q Q and any (possibly big) graph G, a small GQ such that
Q(G) = Q(GQ)?
How to characterize this?
How to find GQ?
The time taken to find GQ should be independent of |G|
Not very likely in the absence of auxiliary information
Q( )GG
Q( ) GQGQ
Much smaller than G
Why?
Boundedly evaluable queries
Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q and any
(possibly big) graph G, a fraction GQ of G such that
|GQ | is independent of |G|,
Q(G) = Q(GQ), and moreover,
GQ can be identified in time determined by Q and A?
A closer look
GQ does not get bigger when G grows -- Q(GQ) can be efficiently computed
The time taken on finding GQ does not increase when G grows
effectively find
Is this possible in practice?18
Example: subgraph isomorphism
Find pairs of leading actors and actresses from the same country and stared in an award-winning movie released in 2011-2014
Find all matches of the pattern in the graph
A movie database represented as a graph, for movies from 1880 -- 2014
– Nodes: movies, casts (actors, actresses), awards, etc– Edges: relationships between the nodes
5.1 million nodes and 19.5 million edges
award year2011-2014movie
actor actress
country
19
Example: access constraints
Hold on the entire graph, regardless of queries posed on it
C1: an award is presented to no more than 4 movies each year C2: each movie has at most 30 leading actors and actresses C3: each person has only one country of origin C4-6: there are no more than 134 years (2014 1880), 24 major
awards, and 196 countries in the graph
award year2011-2014movie
actor actress
country
real-life limits Build indices accordingly
20
Example: a query plan
Visit at most 17922 nodes and 35136 edges, using indices
1. Fetch a set V1 of 134 year nodes, 24 awards and 195 countries
2. Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies released in 2011-2014, with at most 288 * 2 associated edges, by using award and year nodes in V1
3. Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and actresses with 17280 edges, using nodes in V2
4. Connect the actors and actresses in V3 to country nodes in V1, with at most 17280 edges -- GQ
award year2011-2014movie
actor actress
country
21
As opposed to 5.1 million nodes and 19.5 million edges
By using the indices
Access constraints: Example
S (l, N)
S: a set of node labels, and l is another label N: a natural number -- cardinality
Access schema: A set of access constraints
Combining cardinality constraints and index
For any set Vs of nodes in G with label S, there exist at most N
common neighbours of Vs with label l
There is an index on S for l
Semantics: G satisfies S (l, N)
22
With distinct labels, in S Connected by an edge to each node in Vs
For each set Vs of nodes with label S, find all common neighbours labelled l in O(N) time
Example: access constraints
Useful special cases: (l, N), l (l’, N),
C1: an award is presented to no more than 4 movies each year C2: each movie has at most 30 leading actors and actresses C3: each person has only one country of origin C4-6: there are no more than 134 years (2014 1880), 24 major
awards, and 196 countries in the graph
Access constraints
23
Build indices accordingly
(year, award) (movie, 4)
movie (actor/actress, 30)
actor/actress (country, 1)
(year, 134), (award, 24), (country, 196)
24
discovering access schema
S (l, N)
How to maintain constraints in response to changes to graphs?
Functional dependencies X Y, e.g., movie (year, 1)
Degree bound: l (l’, N) if a node with label l has a degree N, for
any label l’
(l, N), very common, e.g., (country, 196)
Aggregate queries: group by (year, award), we find (year, award)
(movie, 4)
Real-life bounds: 5000 friends per person (Facebook)
…
Shredding graphs to relations, using, e.g., TANE
Local changes: only to common neighbours
Generating query plans
Fetch operations: construct GQ; then we compute Q(GQ)
A query plan P for a query Q is a sequence of fetching operations
fetch(u, Vs, C, q(u))
given a set Vs of nodes fetched earlier, fetch all common neighbours u of Vs labelled l, by using access constraint C,the nodes satisfy the condition of u, e.g., year in [2011, 2014]
award year2011-2014movie
actor actress
country
Efficient by using the indices
25
Generating query plans
Independent of |G| no matter how big G grows!
1. Fetch a set V1 of 134 year nodes, 24 awards and 195 countries
2. Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies released in 2011-2014, with at most 288 * 2 associated edges, by using award and year nodes in V1
3. Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and actresses with 17280 edges, using nodes in V2
4. Connect the actors and actresses in V3 to country nodes in V1, with at most 17280 edges -- GQ
26
Boundedly evaluable
Boundedly evaluable: if there exists a query plan under an access schema A such that for all graphs G that satisfies A,
Its fetch operations finds GQ, and Q(GQ) = Q(G) The time for all fetch operations is determined by Q and A only,
independent of |G| example
An approach to querying big graphs
27
Given a query Q, and an access schema A
1. Decide whether Q is boundedly evaluable under A
2. If so, generate a bounded query plan P for Q
Independent of the size of |G|?
3. Given any graph G, use the query plan P
a) Fetch GQ
b) Compute Q(GQ)
Questions: the complexity of– deciding bounded evaluability?– generating a boundedly evaluable query plan?
Are we done yet?
28
Positive: in O(|A| |VQ| |EQ|) time
Input: A query Q, and an access schema A Question: Is Q boundedly evaluable under A?
Graph pattern matching via subgraph isomorphismIndependent of any graph G
Characterization: Q is boundedly evaluable under A iff VCov(Q, A) = VQECov(Q, A) = EQ
Q = (VQ, EQ), small in real life
Nodes covered by A, computed
by (l, N) first and inductively by other constraints in A
Edges (u1, u2) covered by A: one of them is in VCov and the other has a bounded number of candidates by A
Deciding bounded evaluability: independent of |G|
Deciding bounded evaluability
2828
Positive: in O(|A| |EQ| + |A| |VQ|2) time
Input: A boundedly evaluable query Q, and an access schema A Output: A boundedly evaluable query plan P for Q under A
Graph pattern matching via subgraph isomorphismIndependent of any graph GQ = (VQ, EQ)
Inductively identify covered nodes and edges, and in each step, generate a corresponding fetch operation
Yes, since Q is decided boundedly evaluable under A
Always possible?
Query plan generation: independent of |G|
Generating boundedly evaluable query plan
29
Instance-bounded in a graph G
1. Decide whether Q is effectively bounded under A
2. If so, generate a bounded query plan P for Q
For any finite set Q of pattern queries, access schema A and a graph G satisfying A, there exists M such that all queries in Q are M-bounded in G under A
30
Can we do anything if Q is not boundedly evaluable under A?
Extending A by to AM adding constraints of the form
(l, M), l (l’, M)
such that G satisfies AM
Query Q is M-bounded in G if there is GQ of G such that Q(G) = Q(GQ),
and GQ can be found in time determined by Q and AM
M: may depend on |G|
M LQ (LQ + 1)/2, LQ: the number of labels in G
Instance-bounded: on an individual graph, e.g., Facebook
Effectiveness of bounded evaluability
Bounded evaluability: effective for graph pattern queries31
How effective is this approach?
60% of subgraph queries and 33% of simulation queries are boundedly evaluable under small access schema
Improvement: 4 orders of magnitudes for subgraph queries, and 3 orders of magnitudes for simulation queries
A small M of 0.016% of |G| makes all queries M-bounded
Graph pattern matching via subgraph isomorphism: data locality
Does the same approach work on graph simulation, without data
locality?
All the results remain intact on graph pattern matching via simulation
Revised node and edge covers
28587 times faster
Query-preserving graph compression
3232
33
Dynamic reduction vs. Uniform reduction
Is there any effective uniform reduction to query big data?
Bounded evaluability: dynamic reduction on dataset D
• Given a query Q, identify and fetch a minimum subset DQ of
D such that it has sufficient information for answering Q in D
What is the benefit? Uniform reduction on dataset D
• Identify and fetch a minimum DC such that for all queries Q posed
on D, DC has sufficient information to find answers to Q in D
What is the benefit?Questions:DQ is typically smaller than DC. Why?
DC is computed once offline and then we don’t have to worry about it; is this claim true?
Graph compression
The cost of query processing: f(|G|, |Q|)
Compression <R, P>
For a graph G, GC = R(G)
For any Q, Q( G ) = P(Q(GC))
Q( G )
RG Gc
QP
Q
Q( Gc )
Compress big G into a smaller GC
It is unlikely that we can lower its complexity, but can we reduce the size of its parameter |G|?
Compressing
Post-processing
Q( )
Q( )
GCGC
GG
34
Lossless: restore G from GC. GC is not much smaller than G
Query friendly compression: decompression of GC back to G
Query preserving graph compression
35
Query preserving compression <R, P> for a class Q of queries
For any graph G, GC = R(G)
For any Q in Q, Q( G ) = P(Q(Gc))
Q( G )
RG Gc
QP
Q
Q( Gc )
Compress G w.r.t. to a particular query class Q
Compressing
Post-processing
Q( )
Q( )
GCGC
GG
35
What is new about query preserving compression?
36
In contrast to lossless compression, no need to restore the original graph G
Relative to a class L of queries of users’ choice
Better compression ratio: only information about L queries
Query preserving compression <R, P> for a class L of queries
For any graph G, Gc = R(G)
For any Q in L, Q( G ) = P(Q(Gc))
For any Q in L, Q(Gc) can be directly computed
Any algorithms and indexing structures for G can be used for Gc
no need to decompress Gc
Gc is computed once for all queries Q in L
Incrementally maintained
Compress G relative to your queries
Compress G by leveraging the equivalence relation
Equivalence relation:
• reachability relation Re: a node pair (u,v) R∈ e iff they have the same set of ancestors and descendants in G.
• for any graph G, there is a unique maximum Re, i.e., the reachability equivalence relation of G
Reachability queries
Reachability• Input: A directed graph G, and a pair of nodes s and t in G• Question: Does there exist a path from s to t in G?
O(|V| + |E|) time
37
C1
QR
MSA1 MSA1
BSA1
MSA2
BSA2
…
FA1
C1 C3
FA2
C2 Ck
FA3 FA4
FA1 FA3 FA4
MSA1BSA1MSA2
BSA2
C1 FA2C2 C3…C4
Ck
1. Compute Re and
its equivalence
classes
2. Construct a node
for each node
set in the
equivalence
class
3. Construct GC
Algorithm and example
O(|V||E|)
38
Reachability preserving compression
A reachability preserving compression R for G– R maps each node in G to its reachability equivalence class in
GC, and each edge to an edge between two equivalence
classes
Reduction: 95% in average for reachability queries
Correctness: – For any query QR(v,w) over G, v can reach w iff R(v) can
reach R(w) in GC
– Compression R is in quadratic time – no post-processing function P is required.
Nodes in GC: equivalence classes
39
How does it look like in real life?
18 times faster on average for reachability queries40
Graph pattern matching by graph simulation
Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R
41
Bisimulation: a binary relation B over V of G, such that for each node pair (u,v) B, ∈
• L(u) = L(v)• for each edge (u,u’) E, there exists (v,v’) E, s.t. (u’,v’) B,∈ ∈ ∈• for each edge (v,v’) E, there exists (u,u’) E, s.t. (u’,v’) B∈ ∈ ∈
Equivalence relation Rb: the unique maximum bisimulation relation
Compress G by leveraging the equivalence relation
A3
B4
A4 A5
B5
C3 C4
A1
B1
D1C1
A2
B2
D2C2
B3
G1 G2
Compression for simulation
42
msa1
bsa1
fa1
c1
msa2
bsa2
fa2
c2
fa3
c3ck
G
R(G): computes equivalence classes
MSAr
BSAr
FAr FAr’
Cr Cr’
msa1 msa2
bsa1 bsa2
fa1 fa2 fa3
…c1 c2 c3 ck
Gc
R(G): constructs Gc with equivalence classes
P(Q,Gc): expanded to the nodes in their equivalence classes
42
Compression for simulation
43Reduction: 57% in average for graph pattern matching
nodes in Gc denote equivalence classes
compression function R( ):• maximum bisimulation relation on the nodes of G • equivalence relation
Query preserving compression <R, P> for graph pattern matching
R(G) in O(|E| log (|V|)) time
P(Q, Gc): linear time in the size of Q( G )
post-processing function P( ):• making use of the inverse of R( )
nodes in Q(Gc ) are expanded to nodes in their equivalence classes, in the size of output
Subgraph isomorphism?2.3 times faster (simulation)
Summing up
4444
45
Summary and review
What is parallel scalability? Why do we care about it?
Study some parallel algorithms. Show that they are parallel
scalable if they are, and disprove it otherwise
Why do we want to make big graphs small? How can we do it?
What is bounded evaluability of queries? What auxiliary
structures do we need to make queries boundedly evaluable?
What is query-preserving graph compression? Is it lossless? Do
we lose information when using such a compression scheme?
How to develop query preserving graph compression schemes?
46
Project (1)
Bounded evaluability. Recall keyword search via distinct-root trees (bounded by a fixed depth k; see Lectures 2 and 4)
Develop an algorithm for keyword search based on access constraints; show that such queries can be boundedly evaluated
Develop optimization strategies Develop a parallel version of your algorithm, in whatever model you
like (MapReduce, BSP, GRAPE) Experimentally evaluate your algorithms, especially their scalability
with the size of G Write a survey on various methods for keyword search with distinct-
trees, as part of the related work.
A research and development project
47
Project (2)
Recall graph pattern matching by subgraph isomorphism (Lecture 3)
Develop a query-preserving compression scheme for subgraph isomorphism
Implement your compression scheme and an algorithm for graph pattern matching via subgraph isomorphism, based on your query-preserving compression scheme
Experimentally evaluate your compression scheme and evaluation algorithm, especially its scalability with the size of G
Write a survey on graph compression schemes, as part of the related work.
A research and development project
Project (3)
Combine query-preserving compression and distributed algorithm for reachability queries (Lecture 5)
48
Develop a framework for answering reachability queries, with– query-preserving compression scheme to reduce graphs– distributed algorithm for answering reachability queries– incremental algorithm to maintain compressed graphs in response to changes to
the original graphs Implement the framework with all three algorithms Experimentally evaluate method for answering reachability queries, especially its
scalability with the size of G Write a survey on graph compression schemes and distributed algorithms for
reachability queries, as part of the related work.
A development project
49
• M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna, and H. Oh. SCADS: Scale-independent storage for social computing applications. In CIDR, 2009. http://arxiv.org/ftp/arxiv/papers/0909/0909.1775.pdf
• M. Armbrust, E. Liang, T. Kraska, A. Fox, M. J. Franklin, and D. Patterson. Generalized scale independence through incremental precomputation. In SIGMOD, 2013. http://www.cs.albany.edu/
• ~jhh/courses/readings/armbrust.sigmod13.incremental.pdf• S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica.
BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013. https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf
• Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph Summarization. http://pages.cs.wisc.edu/~jignesh/publ/summarization.pdf
• Y. Cao, W. Fan, and R. Huang. Making pattern queries bounded in big graphs. ICDE 2015. (bounded evaluability)
• W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression)
Papers for you to review