Extending In-Memory Relational Database Engines …...¨ Extract graphs from the RDBMS ¨ Store graphs and process queries outside the realm of the RDBMS ¨ E.g., Ringo [SIGMOD’15],

Extending In-Memory Relational Database Engines with Native Graph Support

Mohamed S. Hassan1 Tatiana Kuznetsova1 Hyun Chai Jeong1 Walid G. Aref1 Mohammad Sadoghi2

1Purdue University – West Lafayette, IN, USA

EDBT’18

2University of California – Davis, CA, USA

Graphs are Ubiquitous 2

Road Network Biological Network

Datacenter Network Social Network

Specialized Graph Database Systems 3

¨  Specialized graph databases can handle graph query-workloads ¤ Vital queries include shortest-path and reachability queries

Why Use Relational Database Systems to Support Graphs?

4

¨  RDBMS technology is very mature and widely-adopted ¨  Relational data can have latent graphs ¨  Can easily represent graphs using relational tables ¨  Many applications involve graph queries

¤ Queries that involve both relational and graph predicates n E.g., for each Patient P in a selected area, find the nearest hospital to P

¨  How can an RDBMS effectively and efficiently handle graph query workloads?

Graph Support in RDBMSs 5

¨  Why is it challenging? ¤ There is an impedance mismatch between the relational model and the

graph model

¨  Two common approaches for supporting graphs: ¤ Native Relational-Core ¤ Native Graph-Core

¨  Native G+R Core (The proposed GRFusion system)

Native Relational-Core 6

¨  Use a vanilla RDBMS ¨  Encode graphs in relational schemas ¨  Support limited graph queries ¨  Translate the supported graph queries

into SQL or procedural SQL ¨  E.g., SQLGraph [SIGMOD’15],

Grail [CIDR’15] ¨  Pros:

¤  Use of very mature RDBMS technology ¨  Cons:

¤  Several graph queries are inefficient to evaluate using pure SQL

¤  Graphs are encoded in complex schemas

Relational Database

Relational Data

Relational Queries (SQL)

Results

Graph Encoded into Relational Tables

Graph Queries

SQL Translation Layer

¨  Build on top of an RDBMS ¨  Extract graphs from the RDBMS ¨  Store graphs and process queries

outside the realm of the RDBMS ¨  E.g., Ringo [SIGMOD’15],

GraphGen [VLDB’15, SIGMOD’17] ¨  Pros:

¤ Native processing of graph operations ¨  Cons:

¤ Graph updates require re-extracting the graphs

¤ Queries cannot reference any non-extracted relational data

Native Graph-Core 7

Relational Database

Relational Data

Graph Extraction Queries (SQL)

Graph Extraction and Materialization Engine

Extracted Graphs

Results Graph Queries

The Relational Model vs. the Graph Model 8

¨  Graph-core approach ¤ +ve: Queries involving graph traversals are efficiently handled in

the graph model (e.g., shortest paths) ¤  -ve: Not as pervasive and mature as RDBMSs

¨  Relational-core approach ¤ +ve: Mature and pervasive ¤  -ve: Either many temporary inserts/deletes/updates, or too many

joins to traverse a graph n  Intermediate-result size and cardinality estimation

¨  Can the best of the two worlds be combined? ¤ Support native graph processing inside an RDBMS

Proposed Approach: Native G+R Core 9

¨  Assume that graphs have relational schemas ¨  Relational schemas describe the edges/nodes ¨  Enables graphs to be defined

as native database objects ¨  Store graphs in non-relational structures

optimized for graph operations ¨  Extend the SQL language

¤ Queries can compose relational and graph operations

¨  Cross-Data-Model QEPs (Query Evaluation Plans) ¨  Graph updates are supported

Relational Database

Graph Views (Topology + Tuple Pointers)

Relational Data

Graph Construction

⋈

σ GraphOp

π Graph and Relational Operators

in the Same QEP

Graph-Relational Queries (SQL)

Results

¨  We realize the G+R approach inside VoltDB ¤ An open-source in-memory RDBMS ¤ GRFusion: Our realization of the G+R

approach into VoltDB ¤ A demo of GRFusion will appear in

SIGMOD 2018

GRFusion: Realizing the G+R Approach 10

In-Memory Relational Database

Declarative Graph-Relational Queries

Graph Views Relational Data

Graph-Relational Query Engine

Query Parser

Query Optimizer

Plan Executor

Create Graph View 11

¨  Create-Graph-View statement ¤ Create a named graph database object that can be referenced in queries ¤ Define the relational sources of the graph’s vertexes/edges ¤ Materialize the topology of the graph in main-memory as a singleton graph

structure

Graph-View of a Social Network 12

Graph-View Structure [Traversal Index] 13

Graph-View Structure [Traversal Index] 14

The VERTEXES Construct 15

¨  Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.VERTEXES v

¨  VERTEXES represents the vertexes of a graph view ¨  A vertex is a tuple with the following properties:

¤  Id ¤ FanIn ¤ FanOut ¤ Property for each vertex attribute

The EDGES Construct 16

¨  Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.EDGES v

¨  EDGES represents the edges of a graph view ¨  An edge is a tuple with the following properties:

¤  Id ¤ StartVertexId ¤ EndVertexId ¤ Property for each edge attribute

The PATHS Construct – Extended SQL 17

¨  Appears in the FROM clause and references a graph view ¤  Select … From MyGraphView.PATHS P

¨  PATHS represents a set of lazily-evaluated paths ¨  A path is a set of consecutive edges ¨  Each edge has two endpoint vertexes

¤  E.g., (V:attributes) –(:E:attributes)à(V:attributes) ….. ¨  A path is a tuple with the following properties:

¤  Length ¤  StartVertex ¤  EndVertex ¤ Vertexes ¤  Edges

Declarative Graph-Relational Queries 18

The PathScan Operator 19

¨  PathScan is a logical operator that acts on a graph-view ¤ Has three corresponding physical operators: BFScan, DFScan, SPScan

¨  The output of PathScan is a tuple ¤ Extends the standard relational tuple ¤ PathScan output can be ingested by other relational operators in the QEP

¨  PathScan accepts the id of the vertex to start the traversal from ¤ Otherwise, all the vertexes will be considered as start vertexes

¨  Filters can be pushed as Hints into the PathScan operator ¤ E.g., P.PathLength = 2

Friends-of-Friends Query Example 20

¨  For all the users working as lawyers, retrieve the last name of their friends of friends, where the friendships happened after 1/1/2000

QEP of the Friends-of-Friends Query 21

Reachability Query Example 22

¨  Check if Protein X interacts directly (i.e., by an edge) or indirectly (i.e., by a path) with Protein Y through either a covalent or a stable interaction type.

Shortest-Path Queries with Relational Predicates 23

Evaluating The Native G+R Approach 24

¨  Realized a certralized version of GRFusion (Native G+R Core approach) inside VoltDB Version 6.7

¨  Single node running Linux kernel Version 3.17.7 n 32 cores of Intel Xeon 2.90 GHz n 384 GB of RAM

¨  Comparing against: ¤ Native Relational-Core:

n SQLGraph [SIGMOD’15], Grail [CIDR’15]

¤ Natice Graph-Core Systems: n Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]

Experimental Setup 25

¨  Native relational-core approach ¤ SQLGraph [SIGMOD’15]

n Represent path traversal using recursive relational joins n Commercial system (code not available) n  Implemented the techniques in VoltDB in-memory

¤ Grail [CIDR’15] n  Implemented Grail in VoltDB n Also evaluated Grail in Hekaton n Got similar conclusions (Do not report the Hekaton results here)

Experimental Setup (Cont’d) 26

¨  Native Graph Approch ¤ Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]

n Native graph-cores (specialized graph systems) n Disk-based systems n Titan: configured to use the in-memory storage configuration n Neo4j: Run on RamDisk to mitigate the disk IO cost

¤ GRFusion uses simple graph algorithms (single-source-shortest-path - Dijkstra’s algorithm) n Want to investigate performance gains, if any, of the G+R approach in contrast to

the native relational-core

Evaluating GRFusion 27

¨  Graph queries ¤ Reachability queries ¤ Reachability queries with filtering predicates ¤ Shortest path queries ¤ Subgraph queries (e.g., count triangles)

¨  Datasets

Reachability Queries (DBLP Dataset) 28

¨  Performance of GRFusion, Neo4j, Titan more stable in contrast to SQLGraph ¤  Avoid overheads of recursive relational joins

¨  GRFusion performs better than Neo4J &Titan ¤  VoltDB is optimized for main-memory ¤  Disk-based Titan/Neo4j (although runs on

RamDisk) are not optimized for main-memory ¤  Graph views in GRFusion are more compact

n  Encode only the topology within the graph n  No vertex/edge attributes in the topology n  Thus, GRFusion makes better use of caching

¤  GRFusion/VoltDB are C++-based ¤  Neo4j and Titan are Java-based

n  Overheads from the automatic memory management of Java

Reachability Queries (String Dataset) 29

¨  String dataset: ~ 0.5B edges >> DBLP ¨  SQLGraph (based on VoltDB):

¤  Materialize the join results at each intermediate stage

¤  Explosion in size of intermediate results (perform more than 11 joins)

¨  SQLGraph and GRFusion follow BFS evaluation

¨  GRFusion follows as iterative model: ¤  Evaluate one path at a time ¤  Also, only the vertex Ids are stored in BFS

queue ¤  More efficient storage-wise than storing

the tuples of the relational joins as intermediate results

Reachability Queries (Twitter Dataset) 30

¨  Twitter dataset: 1.4B edges dataset

¨  Fan-out is also a factor in the performance ¤ But we did not study effect of fan-

out ¤ Would require synthetic datasets ¤ Current study focus on real

datasets

Constrained-Reachability Queries (String Dataset) 31

¨  Select a sub-graph then perform the query

¨  Use synthesized edge attributes to control selectivity

¨  Vary the selectivity from 5% to 50% ¨  Limit path length of the results of the

generated queries to 20 ¤  To emphasize the effect of the selectivity

of the sub-graph to operate on ¨  GRFusion uses the relational engine to

evaluate the filtering predicates on the edges ¤  Utilize pointers from edges to

corresponding tuples

SSSP Queries – Tiger Dataset 32

¨  Grail implemented in-memory using VoltDB

¨  GRFusion: Run Dijkstra’s algorithm

A Note on the Performance Gains of GRFusion 33

¨  Table scan or index scan/seek ¤ Direct pointers are more efficient

¨  Relational joins ¤ Large intermediate results ¤  Inaccurate cardinality estimation

⋈

σ

eTable T1

σ

eTable T2

σ

eTable T3

⋈

Conclusions 34

¨  GRFusion (G+R Core approach) ¤ Execute both relational and graph operations in native mode

¨  Write declarative path-queries with relational predicates ¨  Outperform the state-of-the-art by up to 4 orders-of-magnitude in

query-time speedup ¨  Avoid large intermediate join results ¨  Avoid inaccurate cardinality estimation that may lead to non-optimal

join-algorithm selection

Thank You!

35

Vertex Query Example 36

¨  Retrige the Birthdate and the number of friends of each user in the social network with last name = ‘Smith’

Documents

Extending In-Memory Relational Database Engines …...¨ Extract graphs from the RDBMS ¨ Store graphs and process queries outside the realm of the RDBMS ¨ E.g., Ringo [SIGMOD’15],