Choosing Between Graph Databases and RDF Engines ceur-ws.org/Vol-1034/DeAbreuEtAl_ Between Graph Databases and RDF Engines for Consuming and Mining Linked Data ... engines during the

  • View
    215

  • Download
    3

Embed Size (px)

Text of Choosing Between Graph Databases and RDF Engines ceur-ws.org/Vol-1034/DeAbreuEtAl_ Between Graph...

  • Choosing Between Graph Databases and RDFEngines for Consuming and Mining Linked Data

    Domingo De Abreu, Alejandro Flores, Guillermo Palma, Valeria Pestana, JosePinero, Jonathan Queipo, Jose Sanchez, Maria-Esther Vidal

    Universidad Simon Bolvar, Caracas, Venezuela{dabreu, aflores, gpalma, vpestana, jpinero, jqueipo, jsanchez,

    mvidal}@ldc.usb.ve

    Abstract. Graphs naturally represent Linked Data and implementa-tions of graph-based tasks are required not only for data consumption,but also for mining patterns among links. Despite efficient graph-basedalgorithms and engines have been implemented, there is no clear under-standing of how these solutions may behave on Linked Data. We evaluateboth general purpose graph database and state-of-the-art RDF engines,and our experimental results reveal characteristics of linked datasets andgraph-based tasks that may affect their performance. These results canbe considered as a further step for solving the problem of choosing be-tween graph databases to consume and mine Linked Data.

    1 Introduction

    Graphs are commonly used to represent linked data, and several efficient algo-rithms have been proposed not only to consume, but also to mine Linked Data.For example, Saha et al. [16] and Thor et al. [18] have defined densest sub-graphs and graph summarization techniques to identify patterns between linkeddatasets of genes. Further, algorithms for finding subgraph isomorphisms and fre-quent subgraphs have been extensively studied in the literature [1]. The majorityof these algorithms are computationally complex, and rely on core graph-basedtasks to solve graph reachability, traversal, adjacency and pattern matching.

    Additionally, a variety of engines have been developed to manage, store andquery graph databases. Particularly, in the context of the Semantic Web, en-gines have been defined to store and consume RDF graphs. However, existingRDF engines focus on individual triples, more than providing a graph orientedrepresentation of the data, and the implementation of core graph-based taskscan be time-inefficient because of the number of potential self-joins required toexecute these tasks. On the other hand, several general purpose graph systemshave been defined to manage graph databases; also, APIs are usually availableto execute core graph-based tasks on these databases. However, many of the ex-isting graph engines do not exploit the properties of RDF, and pattern matchingqueries on RDF-based graphs may be inefficient. Therefore, it is important toconduct evaluations that uncover the properties and limitations of existing graphengines, and that provide a further step for solving the problem of choosing themost appropriate engine to execute a given task against linked datasets.

  • In this paper we empirically evaluate the performance of three general pur-pose graph database engines: DEX [11], Neo4j [15] and HypergraphDB [8]. Ad-ditionally, we study RDF-3x [14] a best-of-breed RDF engine [7]. During theevaluation, we analyze the performance of these engines on tasks of adjacencyand reachability. Further, we study different implementations of the algorithmsto find shortest paths and k-hops; also, graph traversals following breadth-firstand depth-first search strategies are evaluated. Finally, we analyze the perfor-mance of pattern matching queries, and the mining tasks of densest subgraphsand graph summarization algorithms. A variety of synthetic graphs are stud-ied; graphs with different density and size are generated using existing graphgenerators (e.g., GTgraph1 and the Berlin SPARQL Benchmark (BSBM) [3]).We describe the variables and configuration setups that impact on the perfor-mance of the studied engines when the above mentioned graph tasks are executedagainst the generated graphs. Finally, we discuss the results of our evaluation,and outline the different conditions that benefit the studied engines. Results ofthe experiments are published at http://graphium.ldc.usb.ve.

    To summarize, our contributions are as follows: i) a characterization of graphproperties and tasks that impact on the performance of existing graph engines;ii) benchmarks comprised of graph-based tasks and a variety of graphs to eval-uate existing graph engines; and iii) an empirical study of the performance ofgraph engines on tasks for consuming and mining linked data.

    This paper is organized as follows: Section 2 summarizes related works, andSection 3 describes the studied graph engines and tasks. Section 4 illustratesparameters that impact existing graph engines, while experimental results arereported in Section 5. Section 6 concludes and outlines our future works.

    2 Related Work

    Existing RDF engines effectively address the challenges of consuming RDF data.However, independent evaluations [7, 10] suggest that thanks to physical struc-tures and its execution techniques, RDF-3x is able to overcome existing engines.Based on these results, we selected RDF-3x as an exemplar of RDF engines.

    Several engines address the problem of managing large persistent graphs (e.g.,DEX [11], Neo4j [15], and HypergraphDB [8]). Commonly, these engines provideAPIs and rely on physical structures to speed up the execution of graph-basedtasks. We evaluate their main functionalities and propose a set of tasks whoseevaluation uncovers useful insights that suggest some of their limitations.

    Benchmarking has contributed to the improvement of systems and in general,of existing areas by the means of defining fair and neutral benchmarks. Partic-ularly, in the context of Databases, the family of the Transactional ProcessingPerformance Council (TPC) benchmarks[12] were developed for database-centricstandards and for the evaluation of system performance under clearly defineddataset conditions and workloads. Recently, the TPC family has been extended

    1 http://www.cse.psu.edu/~madduri/software/GTgraph/index.html

  • to other domains, e.g., virtualization, multi-source data integration, and graphanalysis in multiprocessors. Although some of the new released TPC bench-marks attempt to provide graph generation standards [5], the graph-based tasksare mainly focused on network centrality while core graph-based tasks cannot beevaluated, e.g., pattern marching, adjacency. Other initiatives have impulsed thestudy of performance of graph database engines in current cloud service infras-tructures and mostly networking tasks, e.g., [2] and [4]; and recently, the LDBC2

    council was created to work on benchmarks for Linked Data. Finally, benchmarkshave been defined to test existing RDF engines: LUBM [6], the Berlin SPARQLBenchmark [3], the RDF Store Benchmarks with DBpedia3, and the SP2BenchSPARQL benchmark [17]. Similarly, we tailored a family of graph characteristicsand tasks that allow us to reveal properties of existing RDF/Graph engines.

    3 Existing Graph Databases and Graph-Based Tasks

    3.1 Graph Database and RDF Engines

    DEX [11] is a database engine that relies on labeled attribute multigraphs tomodel linked data. To speed up the different graph-based tasks, DEX offers dif-ferent types of indexing: i) attributes. ii) unique attributes, iii) edges to indextheir neighborhood, and iv) indices on neighborhoods. A DEX graph is storedin a single file; values and resource identifiers are mapped by mapping func-tions; and maps are modeled as B+-tree. Bitmaps are used to store nodes andedges of a certain type. DEX provides an API to create and manage graphdatasets. Different functions are offered to efficiently traverse and select graphs:i) sub-graphs whose nodes or edges are associated with labels that satisfy certainconditions, ii) explode-based methods that visit the edges of a given node, andiii) neighbor-based methods that visit the neighbors of a given node.

    Similar to DEX, Neo4j [15] represents graphs as labeled attribute multigraphsimplemented in native structures. Neo4j can manage several types of indiceson nodes and relationships, and graphs can be traversed by following commonpolicies of breadth- and depth-first. Neo4j provides an API, but Cypher can bealso used to specify pattern matching queries.

    HypergraphDB [8] models graphs as hypergraphs, a graph generalizationwhere several edges correspond to a hyperedge. Hyperedges connect orderedsets of nodes, and can point to other hyperedges. Data is stored in the form ofkey-value pairs on top of BerkeleyDB, and is indexed using B-trees. Differentaccess methods and indices are provided to speed up core graph-based tasks,e.g., indices on edges or neighborhoods.

    Finally, RDF-3x [14] relies on a native index system to efficiently store RDFdata and reduce the overhead of I/O operations. RDF data graphs are imple-mented as a Triple table where each labelled edge of an RDF graph is represented

    2 Linked Data Benchmark Council is sponsored by the European Community underICT-FP7 http://www.ldbc.eu

    3 http://www4.wiwiss.fu-berlin.de/benchmarks-200801/

  • as a tuple of the table. A mapping dictionary is used to encode literals by shortnumber IDs, and compressed clustered B+-trees are used to index data in lexico-graphic order. Six permutations of subject, predicate and object are consideredto create an index on each permutation. Additionally, RDF-3x implements opti-mization and execution techniques that support efficient and scalable executionof SPARQL queries. Further, it exploits operating system features to managecache, and it is able to load portions of intermediate results in resident memory.Thus, RDF-3x speeds up execution time of queries where previous intermediateresults can be reused, and it is considered a best-of-breed RDF engine [7].

    3.2 Graph-Based Tasks

    We describe a set of graph-based tasks and the main properties of our imple-mentations. Source code is available at https://github.com/gpalma/gdbb.Graph Creation: consis