Benchmarking graph databases on the problem of community detection

ADBIS 2014 Conference

Ohrid, Republic of Macedonia, September 7-10, 2014

Benchmarking graph databases on the problemof community detectionSotirios Beis, Symeon Papadopoulos, Yiannis Kompatsiaris

Centre for Research and Technology Hellas, Information Technologies Institute (CERTH-ITI)

Important Note

• This is an update of the actual presentation given in September 2014. After the presentation, we received comments about the implementation of our benchmark, which led us to update the implementation and rerun the experiments. We would like to thank @lgarulli and @OrientDB for their contribution.

• The updated paper is now available on:http://mklab.iti.gr/files/beis_adbis2014_corrected.pdf

while the original is still available on the Springer site:http://link.springer.com/chapter/10.1007/978-3-319-10518-5_1

#2

#3

Overview

• Problem formulation

• Related work

• Workload description

• Evaluation

• Conclusions

Motivation

• The rapid growth of Online Social Networks contributes to the creation of high-volume and velocity graph data

– Twitter had 145M users in 2010 and 300M in 2011

• The need for massive graph data management and processing systems is constantly increasing

– Many methods and technologies available (RDBMS, OODBMS and graph databases)

• Benchmarks to evaluate candidate solutions are absolutely essential!

– We had to choose one in the context of the SocialSensorresearch project

#4

http://dstevenwhite.com/2011/12/29/social-media-growth-2006-2011/

Problem Formulation

• Given a set of database management systems assess the performance of each system with respect to a common mining task: community detection

• Many DBMS available to use

– Graph databases selected, as they are designed to store and manage efficiently big graph data

– Benchmarked systems: Titan, OrientDB and Neo4j

• Many operations used in databases to benchmark

– Workloads that simulate common operations in OSNs employed

#5

Related Work (1)

• Evaluation between different DBMS

– Giatsoglou et al. focused on the Social Tagging System use case scenario

– Angles et al. and Armstrong et al. proposed a synthetic graph generator with OSN characteristics and use this data for evaluation

– Grossniklaus et al. used a workloads that cover a wide variety of graph data use case

– Vicknair et al. utilized query workload that simulates typical operations performed in provenance system

#6

Related Work (2)

• Evaluation between graph DBMS

– Bader et al. proposed a benchmark with four operations: insert, h-hops node and edge selection, while Dominguez et al. reports the implementation.

– A similar workload is proposed by Jouili et al. emphasizing the effects of increasing multiple concurrent users.

– Ciglan et al. proposed a benchmark focusing on graph traversal operations.

– Dayarathna et al. used similar workload, focusing on graph databases server and cloud environments.

#7

Workloads Description (1)

Our Contribution

• Clustering Workload (CW)

– Very important due to its numerous applications, such as topic detection, photo clustering and event detection.

– Until now most community detection algorithms used main memory to store the graph and perform the required operations fast execution time, unable to manage big graphs.

– We propose an implementation of Louvain Method on top of graph databases employing cache techniques take advantage of both graph database capabilities and in-memory execution speed.

#8

Workload Description (2)

Supplementary Workloads

• Massive Insertion Workload (MIW)– Populates a graph with bulk load operations.

– Simulates batch creation of a graph.

• Single Insertion Workload (SIW)– Every object insertion is committed directly.

– Simulates the growth of a OSN.

• Query Workload (QW)– FindNeighbors query (FN) Finds the neighbors of all nodes

• simulates the friends retrieval a Facebook user

– FindAdjacentNodes query (FA) Finds the adjacent nodes of all edges

• used to find whether two user joined the same Facebook group

– FindShortestPath (FS) Finds the shortest paths between 100 nodes

• used to find the level two LinkedIn users are connected

#9

Experimental Study

• Datasets

– Synthetic for CW, generated with LFR generator.

– Real for MIW, SIW & QW, from SNAP dataset collection.

• Evaluation Measure: execution time (second)

#10

Results: Comparison wrt CW

#11

Results: Cache size impact

#12

Titan

Neo4j

OrientDB

Results: comparison wrt MIW and QW

#13

Results: Comparison wrt SIW

#14

YouTube LiveJournal

#15

Conclusions

SUMMARY• OrientDB is the most efficient solution for CW, while Titan is

the winner in SIW

• Neo4j is the fastest graph database for MIW and QW

• Titan and OrientDB have competitive performance in MIW and QW respectively don't scale as good as Neo4j

FUTURE WORK• Test with bigger graphs.

• Run the benchmark employing the distributed version of Titan and OrientDB and other graph databases (Sparksee already added).

• Improve performance of the implemented community detection method.

References (1)

Evaluation between different DBMSGiatsoglou, M., Papadopoulos, S., Vakali, A.: Massive graph management for the web

and web 2.0. In Vakali, A., Jain, L., eds.: New Directions in Web Data Management 1. Volume 331 of Studies in Computational Intelligence. Springer Berlin Heidelberg (2011) 19-58

Angles, R., Prat-Perez, A., Dominguez-Sal, D., Larriba-Pey, J.L.: Benchmarking database systems for social network applications. In: First International Workshop on Graph Data Management Experiences and Systems. GRADES '13, New York, NY, USA, ACM (2013) 15:1-15:7

Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. (2013)

Grossniklaus, M., Leone, S., Zaschke, T.: Towards a benchmark for graph data management and processing. (2013)

Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: A data provenance perspective. In: Proceedings of the 48th Annual Southeast Regional Conference. ACM SE '10, New York, NY, USA, ACM (2010) 42:1-42:6

#16

References (2)

Evaluation between graph DBMSBader, D.A., Feo, J., Gilbert, J., Kepner, J., Koester, D., Loh, E., Madduri, K., Mann,

B., Meuse, T., Robinson, E.: Hpc scalable graph analysis benchmark (2009)

Dominguez-Sal, D., Urbn-Bayes, P., Gimnez-Va, A., Gmez-Villamor, S., Martnez-Bazn, N., Larriba-Pey, J.: Survey of graph database performance on the hpc scalable graph analysis benchmark. In Shen, H., Pei, J., zsu, M., Zou, L., Lu, J., Ling, T.W., Yu, G., Zhuang, Y., Shao, J., eds.:Web-Age Information Management. Volume 6185 of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2010) 37-48

Ciglan, M., Averbuch, A., Hluchy, L.: Benchmarking traversal operations over graph databases. In: Data Engineering Workshops (ICDEW), 2012 IEEE 28th

International Conference on. (April 2012) 186-189

Jouili, S., Vansteenberghe, V.: An empirical comparison of graph databases. In: Social Computing (SocialCom), 2013 International Conference on. (Sept 2013) 708-715

Dayarathna, M., Suzumura, T.: Xgdbench: A benchmarking platform for graph stores in exascale clouds. In: Cloud Computing Technology and Science (Cloud-Com), 2012 IEEE 4th International Conference on. (Dec 2012) 363-370

#17

#18

Questions

Further contact:

[email protected] / [email protected]

Acknowledgement:

www.socialsensor.eu

Source: https://github.com/socialsensor/graphdb-benchmarks