Upload
others
View
25
Download
0
Embed Size (px)
Citation preview
Extending In-Memory Relational Database Engines with Native Graph Support
Mohamed S. Hassan1 Tatiana Kuznetsova1 Hyun Chai Jeong1 Walid G. Aref1 Mohammad Sadoghi2
1Purdue University – West Lafayette, IN, USA
EDBT’18
2University of California – Davis, CA, USA
Graphs are Ubiquitous 2
Road Network Biological Network
Datacenter Network Social Network
Specialized Graph Database Systems 3
¨ Specialized graph databases can handle graph query-workloads ¤ Vital queries include shortest-path and reachability queries
Why Use Relational Database Systems to Support Graphs?
4
¨ RDBMS technology is very mature and widely-adopted ¨ Relational data can have latent graphs ¨ Can easily represent graphs using relational tables ¨ Many applications involve graph queries
¤ Queries that involve both relational and graph predicates n E.g., for each Patient P in a selected area, find the nearest hospital to P
¨ How can an RDBMS effectively and efficiently handle graph query workloads?
Graph Support in RDBMSs 5
¨ Why is it challenging? ¤ There is an impedance mismatch between the relational model and the
graph model
¨ Two common approaches for supporting graphs: ¤ Native Relational-Core ¤ Native Graph-Core
¨ Native G+R Core (The proposed GRFusion system)
Native Relational-Core 6
¨ Use a vanilla RDBMS ¨ Encode graphs in relational schemas ¨ Support limited graph queries ¨ Translate the supported graph queries
into SQL or procedural SQL ¨ E.g., SQLGraph [SIGMOD’15],
Grail [CIDR’15] ¨ Pros:
¤ Use of very mature RDBMS technology ¨ Cons:
¤ Several graph queries are inefficient to evaluate using pure SQL
¤ Graphs are encoded in complex schemas
Relational Database
Relational Data
Relational Queries (SQL)
Results
Graph Encoded into Relational Tables
Graph Queries
SQL Translation Layer
¨ Build on top of an RDBMS ¨ Extract graphs from the RDBMS ¨ Store graphs and process queries
outside the realm of the RDBMS ¨ E.g., Ringo [SIGMOD’15],
GraphGen [VLDB’15, SIGMOD’17] ¨ Pros:
¤ Native processing of graph operations ¨ Cons:
¤ Graph updates require re-extracting the graphs
¤ Queries cannot reference any non-extracted relational data
Native Graph-Core 7
Relational Database
Relational Data
Graph Extraction Queries (SQL)
Graph Extraction and Materialization Engine
Extracted Graphs
Results Graph Queries
The Relational Model vs. the Graph Model 8
¨ Graph-core approach ¤ +ve: Queries involving graph traversals are efficiently handled in
the graph model (e.g., shortest paths) ¤ -ve: Not as pervasive and mature as RDBMSs
¨ Relational-core approach ¤ +ve: Mature and pervasive ¤ -ve: Either many temporary inserts/deletes/updates, or too many
joins to traverse a graph n Intermediate-result size and cardinality estimation
¨ Can the best of the two worlds be combined? ¤ Support native graph processing inside an RDBMS
Proposed Approach: Native G+R Core 9
¨ Assume that graphs have relational schemas ¨ Relational schemas describe the edges/nodes ¨ Enables graphs to be defined
as native database objects ¨ Store graphs in non-relational structures
optimized for graph operations ¨ Extend the SQL language
¤ Queries can compose relational and graph operations
¨ Cross-Data-Model QEPs (Query Evaluation Plans) ¨ Graph updates are supported
Relational Database
Graph Views (Topology + Tuple Pointers)
Relational Data
Graph Construction
⋈
σ GraphOp
π Graph and Relational Operators
in the Same QEP
Graph-Relational Queries (SQL)
Results
¨ We realize the G+R approach inside VoltDB ¤ An open-source in-memory RDBMS ¤ GRFusion: Our realization of the G+R
approach into VoltDB ¤ A demo of GRFusion will appear in
SIGMOD 2018
GRFusion: Realizing the G+R Approach 10
In-Memory Relational Database
Declarative Graph-Relational Queries
Graph Views Relational Data
Graph-Relational Query Engine
Query Parser
Query Optimizer
Plan Executor
Create Graph View 11
¨ Create-Graph-View statement ¤ Create a named graph database object that can be referenced in queries ¤ Define the relational sources of the graph’s vertexes/edges ¤ Materialize the topology of the graph in main-memory as a singleton graph
structure
Graph-View of a Social Network 12
Graph-View Structure [Traversal Index] 13
Graph-View Structure [Traversal Index] 14
The VERTEXES Construct 15
¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.VERTEXES v
¨ VERTEXES represents the vertexes of a graph view ¨ A vertex is a tuple with the following properties:
¤ Id ¤ FanIn ¤ FanOut ¤ Property for each vertex attribute
The EDGES Construct 16
¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.EDGES v
¨ EDGES represents the edges of a graph view ¨ An edge is a tuple with the following properties:
¤ Id ¤ StartVertexId ¤ EndVertexId ¤ Property for each edge attribute
The PATHS Construct – Extended SQL 17
¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.PATHS P
¨ PATHS represents a set of lazily-evaluated paths ¨ A path is a set of consecutive edges ¨ Each edge has two endpoint vertexes
¤ E.g., (V:attributes) –(:E:attributes)à(V:attributes) ….. ¨ A path is a tuple with the following properties:
¤ Length ¤ StartVertex ¤ EndVertex ¤ Vertexes ¤ Edges
Declarative Graph-Relational Queries 18
The PathScan Operator 19
¨ PathScan is a logical operator that acts on a graph-view ¤ Has three corresponding physical operators: BFScan, DFScan, SPScan
¨ The output of PathScan is a tuple ¤ Extends the standard relational tuple ¤ PathScan output can be ingested by other relational operators in the QEP
¨ PathScan accepts the id of the vertex to start the traversal from ¤ Otherwise, all the vertexes will be considered as start vertexes
¨ Filters can be pushed as Hints into the PathScan operator ¤ E.g., P.PathLength = 2
Friends-of-Friends Query Example 20
¨ For all the users working as lawyers, retrieve the last name of their friends of friends, where the friendships happened after 1/1/2000
QEP of the Friends-of-Friends Query 21
Reachability Query Example 22
¨ Check if Protein X interacts directly (i.e., by an edge) or indirectly (i.e., by a path) with Protein Y through either a covalent or a stable interaction type.
Shortest-Path Queries with Relational Predicates 23
Evaluating The Native G+R Approach 24
¨ Realized a certralized version of GRFusion (Native G+R Core approach) inside VoltDB Version 6.7
¨ Single node running Linux kernel Version 3.17.7 n 32 cores of Intel Xeon 2.90 GHz n 384 GB of RAM
¨ Comparing against: ¤ Native Relational-Core:
n SQLGraph [SIGMOD’15], Grail [CIDR’15]
¤ Natice Graph-Core Systems: n Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]
Experimental Setup 25
¨ Native relational-core approach ¤ SQLGraph [SIGMOD’15]
n Represent path traversal using recursive relational joins n Commercial system (code not available) n Implemented the techniques in VoltDB in-memory
¤ Grail [CIDR’15] n Implemented Grail in VoltDB n Also evaluated Grail in Hekaton n Got similar conclusions (Do not report the Hekaton results here)
Experimental Setup (Cont’d) 26
¨ Native Graph Approch ¤ Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]
n Native graph-cores (specialized graph systems) n Disk-based systems n Titan: configured to use the in-memory storage configuration n Neo4j: Run on RamDisk to mitigate the disk IO cost
¤ GRFusion uses simple graph algorithms (single-source-shortest-path - Dijkstra’s algorithm) n Want to investigate performance gains, if any, of the G+R approach in contrast to
the native relational-core
Evaluating GRFusion 27
¨ Graph queries ¤ Reachability queries ¤ Reachability queries with filtering predicates ¤ Shortest path queries ¤ Subgraph queries (e.g., count triangles)
¨ Datasets
Reachability Queries (DBLP Dataset) 28
¨ Performance of GRFusion, Neo4j, Titan more stable in contrast to SQLGraph ¤ Avoid overheads of recursive relational joins
¨ GRFusion performs better than Neo4J &Titan ¤ VoltDB is optimized for main-memory ¤ Disk-based Titan/Neo4j (although runs on
RamDisk) are not optimized for main-memory ¤ Graph views in GRFusion are more compact
n Encode only the topology within the graph n No vertex/edge attributes in the topology n Thus, GRFusion makes better use of caching
¤ GRFusion/VoltDB are C++-based ¤ Neo4j and Titan are Java-based
n Overheads from the automatic memory management of Java
Reachability Queries (String Dataset) 29
¨ String dataset: ~ 0.5B edges >> DBLP ¨ SQLGraph (based on VoltDB):
¤ Materialize the join results at each intermediate stage
¤ Explosion in size of intermediate results (perform more than 11 joins)
¨ SQLGraph and GRFusion follow BFS evaluation
¨ GRFusion follows as iterative model: ¤ Evaluate one path at a time ¤ Also, only the vertex Ids are stored in BFS
queue ¤ More efficient storage-wise than storing
the tuples of the relational joins as intermediate results
Reachability Queries (Twitter Dataset) 30
¨ Twitter dataset: 1.4B edges dataset
¨ Fan-out is also a factor in the performance ¤ But we did not study effect of fan-
out ¤ Would require synthetic datasets ¤ Current study focus on real
datasets
Constrained-Reachability Queries (String Dataset) 31
¨ Select a sub-graph then perform the query
¨ Use synthesized edge attributes to control selectivity
¨ Vary the selectivity from 5% to 50% ¨ Limit path length of the results of the
generated queries to 20 ¤ To emphasize the effect of the selectivity
of the sub-graph to operate on ¨ GRFusion uses the relational engine to
evaluate the filtering predicates on the edges ¤ Utilize pointers from edges to
corresponding tuples
SSSP Queries – Tiger Dataset 32
¨ Grail implemented in-memory using VoltDB
¨ GRFusion: Run Dijkstra’s algorithm
A Note on the Performance Gains of GRFusion 33
¨ Table scan or index scan/seek ¤ Direct pointers are more efficient
¨ Relational joins ¤ Large intermediate results ¤ Inaccurate cardinality estimation
⋈
σ
eTable T1
σ
eTable T2
σ
eTable T3
⋈
Conclusions 34
¨ GRFusion (G+R Core approach) ¤ Execute both relational and graph operations in native mode
¨ Write declarative path-queries with relational predicates ¨ Outperform the state-of-the-art by up to 4 orders-of-magnitude in
query-time speedup ¨ Avoid large intermediate join results ¨ Avoid inaccurate cardinality estimation that may lead to non-optimal
join-algorithm selection
Thank You!
35
Vertex Query Example 36
¨ Retrige the Birthdate and the number of friends of each user in the social network with last name = ‘Smith’