Transcript

S2X

Graph-Parallel Querying of RDF with GraphX [1]

Alexander SchΓ€tzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen

University of Freiburg, Germany

[email protected]

http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/

Big-O(Q) 2015 – Hawaii, USA

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 2

Motivation

Semantic Web has arrived in real-world applications(not only academia & research)

β€’ Web-scale semantic data makes single machine solutions infeasible

Hadoop ecosystem has become de-facto

standard for Big Data applications

Our Idea: Use it for Semantic Web purposes as well

β€’ Main reasons:

1) Web-scale semantic data requires solutions that scale out

2) Industry has settled on Hadoop (or Hadoop-style) architectures

superior cost-benefit ratio compared to specialized infrastructures

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 3

Motivation

RDF/SPARQL-on-Hadoop

β€’ Existing solutions are based on relational-style data models

implemented on data-parallel frameworks

β€’ Hence they do not exploit the inherent graph-like structure of RDF

Parallel Graph Processing Systems

β€’ e.g. Pregel, Giraph, GraphLab

β€’ Tailored towards iterative graph algorithms

β€’ But: Not integrated with Hadoop or hard to combine with other frameworks

Data movement or duplication

S2X (SPARQL on Spark with GraphX)

β€’ Spark GraphX allows to combine graph-parallel and data-parallel computation

β€’ GraphX used for graph pattern matching

β€’ Other SPARQL operators implemented in Spark

Spark GraphX

Apache Spark

β€’ General-purpose in-memory cluster computing system

β€’ Based on Resilient Distributed Datasets (RDD)

- distributed, fault-tolerant collection of elements

- kept in memory and partitioned across the cluster

β€’ Jobs are modeled as Directed Acyclic Graphs (DAG) of tasks

- each tasks runs on a horizontal partition of an RDD

GraphX

β€’ Spark API for graphs and graph-parallel computation

β€’ Uses a Property Graph Data Model (represented by two RDDs)

β€’ Vertex-Centric view of graphs (similar to Pregel)

β€’ Adopts Bulk-Synchronous Parallel (BSP) execution model

- vertex programs run concurrently in a sequence of Supersteps

β€’ Vertex-Cut partitioning

- evenly assigns edges to machines, vertices can span multiple machines

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 4

RDF Property Graph Model

Property Graph (PG)

β€’ 𝑃𝐺 𝑃 = 𝑉, 𝐸, 𝑃

β€’ 𝑉 = 𝑠𝑒𝑑 π‘œπ‘“ π‘£π‘’π‘Ÿπ‘‘π‘–π‘π‘’π‘ , 𝐸 = 𝑠𝑒𝑑 π‘œπ‘“ π‘‘π‘–π‘Ÿπ‘’π‘π‘‘π‘’π‘‘ 𝑒𝑑𝑔𝑒𝑠

β€’ 𝑃 = 𝑃𝑉 , 𝑃𝐸 = π‘π‘œπ‘™π‘™π‘’π‘π‘‘π‘–π‘œπ‘› π‘œπ‘“ π‘£π‘’π‘Ÿπ‘‘π‘’π‘₯ π‘Žπ‘›π‘‘ 𝑒𝑑𝑔𝑒 π‘π‘Ÿπ‘œπ‘π‘’π‘Ÿπ‘‘π‘–π‘’π‘  π‘˜π‘’π‘¦, π‘£π‘Žπ‘™π‘’π‘’

RDF Graph (G)

β€’ 𝐺 = 𝑑1, … , 𝑑𝑛 = 𝑠𝑒𝑑 π‘œπ‘“ 𝑅𝐷𝐹 π‘‘π‘Ÿπ‘–π‘π‘™π‘’π‘ 

β€’ 𝑅𝐷𝐹 π‘‘π‘Ÿπ‘–π‘π‘™π‘’ 𝑑 = 𝑠𝑒𝑏𝑗𝑒𝑐𝑑, π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘Žπ‘‘π‘’, π‘œπ‘π‘—π‘’π‘π‘‘

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 5

i j

𝑃𝐸 𝑖, 𝑗 = π‘˜, 𝑣

𝑃𝑉 𝑖 = π‘˜, 𝑣 𝑃𝑉 𝑗 = π‘˜, 𝑣

𝐼𝐷𝑖 𝐼𝐷𝑗

s op

IRIs = global identifierse.g. dbpedia:Leonardo_da_Vinci

RDF Property Graph Model

RDF to PG - Example

β€’ 𝐺 = 𝐴, π‘˜π‘›π‘œπ‘€π‘ , 𝐡 , 𝐴, π‘™π‘–π‘˜π‘’π‘ , 𝐡 , 𝐴, π‘™π‘–π‘˜π‘’π‘ , 𝐢 , 𝐡, π‘˜π‘›π‘œπ‘€π‘ , 𝐢

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 6

A B

knows

C

likes

knows

likes

𝐼𝐷𝐴 = 𝑣𝐴 𝐼𝐷𝐡 = 𝑣𝐡 𝐼𝐷𝐢 = 𝑣𝐢

𝑃𝐸 𝑣𝐴, 𝑣𝐢 . π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

𝑃𝐸 𝑣𝐴, 𝑣𝐡 . π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

𝑃𝐸 𝑣𝐴, 𝑣𝐡 . π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘ƒπΈ 𝑣𝐡 , 𝑣𝐢 . π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

𝑃𝑉 𝑣𝐴 . π‘™π‘Žπ‘π‘’π‘™ = 𝐴 𝑃𝑉 𝑣𝐡 . π‘™π‘Žπ‘π‘’π‘™ = 𝐡 𝑃𝑉 𝑣𝐢 . π‘™π‘Žπ‘π‘’π‘™ = 𝐢

SPARQL

Standard Query Language for RDF

(Solution) Mapping

β€’ πœ‡ ∢ 𝑉 β†’ 𝑅𝐷𝐹 π‘‡π‘’π‘Ÿπ‘šπ‘ 

β€’ Result of a triple pattern 𝑑𝑝 is a bag of mappings

Ω𝑑𝑝 = πœ‡ π‘‘π‘œπ‘š πœ‡ = π‘£π‘Žπ‘Ÿπ‘  𝑑𝑝 , πœ‡ 𝑑𝑝 ∈ 𝐺

β€’ Result of a basic graph pattern 𝑏𝑔𝑝 = 𝑑𝑝1, … , π‘‘π‘π‘š is the merge of

compatible mappings (Join on common variables)

Ω𝑏𝑔𝑝 = Ω𝑑𝑝1 β‹ˆ … β‹ˆ Ξ©π‘‘π‘π‘š

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 7

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

triple pattern

basic graph pattern (BGP)

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 8

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 9

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

πœ‡11 𝐴 𝐡

πœ‡12 𝐡 𝐢

Ω𝑑𝑝1

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 10

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

πœ‡11 𝐴 𝐡

πœ‡12 𝐡 𝐢

Ω𝑑𝑝1

? 𝑨 ?𝑩

πœ‡21 𝐴 𝐡

πœ‡22 𝐴 𝐢

Ω𝑑𝑝2

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 11

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

πœ‡11 𝐴 𝐡

πœ‡12 𝐡 𝐢

Ω𝑑𝑝1

? 𝑨 ?𝑩

πœ‡21 𝐴 𝐡

πœ‡22 𝐴 𝐢

Ω𝑑𝑝2

?𝑩 ?π‘ͺ

πœ‡31 𝐴 𝐡

πœ‡32 𝐡 𝐢

Ω𝑑𝑝3

SPARQL - Example

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 12

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

A B

knows

C

likes

knows

likes

? 𝑨 ?𝑩

πœ‡11 𝐴 𝐡

πœ‡12 𝐡 𝐢

Ω𝑑𝑝1

? 𝑨 ?𝑩

πœ‡21 𝐴 𝐡

πœ‡22 𝐴 𝐢

Ω𝑑𝑝2

?𝑩 ?π‘ͺ

πœ‡31 𝐴 𝐡

πœ‡32 𝐡 𝐢

Ω𝑑𝑝3

β‹ˆ β‹ˆ

Ω𝑏𝑔𝑝 = πœ‡ ∢ ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢

S2X Query Processing

Basic Idea:

β€’ Every vertex stores the variables of a query where it is a candidate for

β€’ Vertices exchange their candidate sets for validation

Match Candidate:

β€’ π‘šπ‘Žπ‘‘π‘β„ŽπΆ = 𝑣, ? π‘£π‘Žπ‘Ÿ, πœ‡, 𝑑𝑝

β€’ 𝑣 is a candidate for ? π‘£π‘Žπ‘Ÿ with mapping πœ‡ derived from triple pattern 𝑑𝑝

Match Set:

β€’ Set of all match candidates of a vertex

β€’ Match set is stored as a property of the vertex 𝑃𝑉 𝑣 .π‘šπ‘Žπ‘‘π‘β„Žπ‘†

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 13

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ π‘‘π‘ = (?𝐴 π‘˜π‘›π‘œπ‘€π‘  ?𝐡)

𝑣𝐴, ? 𝐴, πœ‡: ?𝐴 β†’ 𝐴, ?𝐡 β†’ 𝐡 , 𝑑𝑝 𝑣𝐡, ? 𝐡, πœ‡: ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 , 𝑑𝑝

S2X Query Processing

BGP Matching in a Nutshell

β€’ Superstep 1:

- Determine all possible match candidates

Iterate over all edges and match with all triple patterns at once

- Yields an overestimation of the BGP result

β€’ Following Supersteps:

- Validate Local Match Sets

Does a vertex have all required match candidates for a variable?

- Validate Remote Match Sets

Are the neighbors of vertex still the required candidates?

- Exchange Match Sets with Neighbor Vertices

- Repeat until no changes occur

β€’ Result Generation (solution mappings):

- Collect and merge (join) all match sets to produce the final mappings

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 14

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 16

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 17

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 18

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 19

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐡, ? 𝐡 β†’ 𝐢 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

? 𝐡 ? 𝐴 β†’ 𝐡, ? 𝐡 β†’ 𝐢 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

BGP Matching - Example

Superstep 1

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 20

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐡, ? 𝐡 β†’ 𝐢 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

? 𝐡 ? 𝐴 β†’ 𝐡, ? 𝐡 β†’ 𝐢 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

Overestimation!

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 21

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

π·π‘œπ‘’π‘  ?𝐡 , ? 𝐡 β†’ 𝐴,… , 𝑑𝑝1 𝑒π‘₯𝑖𝑠𝑑 𝑖𝑛 𝑣𝐴 β‡’ 𝑁𝑂!

Validation of Local Match Sets:

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 22

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2 π·π‘œπ‘’π‘  ?𝐴 , ? 𝐴 β†’ 𝐴, ?𝐡 β†’ 𝐢 , 𝑑𝑝1 𝑒π‘₯𝑖𝑠𝑑 𝑖𝑛 𝑣𝐴 β‡’ 𝑁𝑂!

Validation of Local Match Sets:

π·π‘œπ‘’π‘  ?𝐡 , ? 𝐡 β†’ 𝐴,… , 𝑑𝑝1 𝑒π‘₯𝑖𝑠𝑑 𝑖𝑛 𝑣𝐴 β‡’ 𝑁𝑂!

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 23

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐴 ? 𝐴 β†’ 𝐡, ? 𝐡 β†’ 𝐢 𝑑𝑝1

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐢 𝑑𝑝2

? 𝐡 ? 𝐴 β†’ 𝐡, ? 𝐡 β†’ 𝐢 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

Removed!

BGP Matching - Example

Superstep 2

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 24

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐢 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

Exchange

Exchange

BGP Matching - Example

Superstep 3

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 25

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

π·π‘œπ‘’π‘  ?𝐡 , ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 , 𝑑𝑝3𝑒π‘₯𝑖𝑠𝑑 𝑖𝑛 𝑣𝐴 β‡’ 𝑁𝑂!

Validation of

Remote Match Sets:

BGP Matching - Example

Superstep 3

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 26

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐢 ?𝐡 β†’ 𝐴, ? 𝐢 β†’ 𝐡 𝑑𝑝3

? 𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐢 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

Removed!No Changes! No Changes!

BGP Matching - Example

Superstep 3

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 27

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐢 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

Exchange

Exchange

BGP Matching - Example

Superstep 4

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 28

SELECT * WHERE {

?A knows ?B . ?A likes ?B . ?B knows ?C

}

tp1 tp2 tp3

π‘£π΄π‘™π‘Žπ‘π‘’π‘™ = 𝐴

π‘£π΅π‘™π‘Žπ‘π‘’π‘™ = 𝐡

π‘£πΆπ‘™π‘Žπ‘π‘’π‘™ = 𝐢

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

π‘™π‘Žπ‘π‘’π‘™ = π‘™π‘–π‘˜π‘’π‘ π‘™π‘Žπ‘π‘’π‘™ = π‘˜π‘›π‘œπ‘€π‘ 

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐴 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

?𝐡 ?𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝1

? 𝐡 ? 𝐴 β†’ 𝐴, ? 𝐡 β†’ 𝐡 𝑑𝑝2

? 𝐡 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

? π‘£π‘Žπ‘Ÿ πœ‡ 𝑑𝑝

? 𝐢 ?𝐡 β†’ 𝐡, ? 𝐢 β†’ 𝐢 𝑑𝑝3

No Changes! No Changes!

Experiments

Small Cluster with low-end configuration

β€’ 10 machines (1 master, 9 worker), 2 disks each, Gigabit network

β€’ 24 GB RAM for Spark, 60% assigned for graph caching(typical Hadoop nodes currently β‰₯ 256 GB RAM, β‰₯ 12 disks, β‰₯ 10 Gigabit network)

β€’ CDH 5.3, Spark 1.2.0, Pig 0.12

WatDiv Benchmark [2]

β€’ Combines e-commerce scenario with a kind of social network

β€’ All different kinds of query shapes (star, linear, snowflake, complex)

β€’ Scale Factors:

- 10 (10K users, ~0.1M vertices)

- 100 (100K users, ~1M vertices)

- 1000 (1M users, ~10M vertices)

Compared with PigSPARQL [3]

β€’ SPARQL Query Processing Baseline for Hadoop

β€’ Proven to be competitive to other Hadoop research approaches

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 29

Experiments

Geometric Mean per Query Group

β€’ S2X outperforms PigSPARQL by up to an order of magnitude

β€’ Runtime of S2X strongly depends on result size (better for small output)

β€’ Runtime can be highly influenced by adjusting parameters of Spark/GraphX

β€’ GraphX currently quite β€œmemory-heavy” (early stage of development)

β€’ Very good preliminary results but still much room for improvement in S2X

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 30

S L SF C

S2X 7,62 9,62 17,98 52,89

PigSPARQL 43,08 62,19 88,09 127,34

0

20

40

60

80

100

120

140

Overall Runtime

S L SF C

result 1,98 2,92 8,03 23,49

BGP 5,62 6,66 9,93 29,04

0

10

20

30

40

50

60

S2X - BGP Matching vs. Result Generation

Summary

S2X (SPARQL on Spark with GraphX)

β€’ SPARQL Query Processor for Hadoop based on Spark GraphX

β€’ Property Graph representation for RDF

β€’ Combines graph-parallel and data-parallel computation

- graph-parallel: BGP matching

- data-parallel: BGP solution mapping generation + other SPARQL operators

Experiments

β€’ Very good preliminary results

- Up to an order of magnitude performance improvement compared to a

state-of-the-art SPARQL Processor baseline for Hadoop

β€’ GraphX is in an early stage of development

- Many improvements can be expected in future GraphX versions

β€’ Many possible improvements for S2X

- Incremental exchange of match sets I/O reduction

- Optimization of graph partitioning avoid unbalanced workloads

- Incremental BGP matching for selective queries Reduction of Overestimation

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 31

Thank You

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 32

Thank you for your attention!

Questions?

References

[1] Alexander SchΓ€tzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen:

S2X: Graph-Parallel Querying of RDF with GraphX. In: Big-O(Q) 2015

[2] GΓΌnes Aluc, Olaf Hartig, M. Tamer Γ–zsu, Khuzaima Daudjee:

Diversified Stress Testing of RDF Data Management Systems. In: ISWC 2014, pp. 197-212

[3] Alexander SchΓ€tzle, Martin Przyjaciel-Zablocki, Thomas Hornung, Georg Lausen:

PigSPARQL: A SPARQL Query Processing Baseline for Big Data. In: ISWC 2013 Posters

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 33

Experiments

Star Queries

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 34

10 100 1000

result 0,70 1,60 6,87

BGP 1,93 3,73 24,66

0

5

10

15

20

25

30

35

S2X - BGP Matching vs. Result Generation

10 100 1000

S2X 2,63 5,33 31,53

PigSPARQL 41,43 41,28 46,74

0

5

10

15

20

25

30

35

40

45

50

Geometric Mean

Experiments

Linear Queries

β€’ Strong variation in runtimes (min: ~14s, max: ~348s)

β€’ For the long running queries, result generation takes more time than BGP

matching, e.g. 348s = (124s, 224s) many results

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 35

10 100 1000

S2X 2,21 5,23 77,25

PigSPARQL 61,45 61,23 63,93

0

10

20

30

40

50

60

70

80

Geometric Mean

10 100 1000

result 0,55 1,64 27,41

BGP 1,65 3,58 49,85

0

10

20

30

40

50

60

70

80

S2X - BGP Matching vs. Result Generation

Experiments

Snowflake Queries

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 36

10 100 1000

S2X 5,84 12,99 76,63

PigSPARQL 85,47 85,21 93,85

0

10

20

30

40

50

60

70

80

90

100

Geometric Mean

10 100 1000

result 2,58 6,20 32,39

BGP 3,26 6,79 44,25

0

10

20

30

40

50

60

70

80

S2X - BGP Matching vs. Result Generation

Experiments

Complex Queries

β€’ Queries produce many results

for some queries, result generation takes more time than BGP matching

31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 37

10 100 1000

S2X 17,53 41,17 204,96

PigSPARQL 116,13 123,76 143,66

0

30

60

90

120

150

180

210

Geometric Mean

10 100 1000

result 8,09 20,98 76,40

BGP 9,44 20,19 128,57

0

30

60

90

120

150

180

210

S2X - BGP Matching vs. Result Generation


Recommended