S2X
Graph-Parallel Querying of RDF with GraphX [1]
Alexander SchΓ€tzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen
University of Freiburg, Germany
http://dbis.informatik.uni-freiburg.de/forschung/projekte/DiPoS/
Big-O(Q) 2015 β Hawaii, USA
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 2
Motivation
Semantic Web has arrived in real-world applications(not only academia & research)
β’ Web-scale semantic data makes single machine solutions infeasible
Hadoop ecosystem has become de-facto
standard for Big Data applications
Our Idea: Use it for Semantic Web purposes as well
β’ Main reasons:
1) Web-scale semantic data requires solutions that scale out
2) Industry has settled on Hadoop (or Hadoop-style) architectures
superior cost-benefit ratio compared to specialized infrastructures
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 3
Motivation
RDF/SPARQL-on-Hadoop
β’ Existing solutions are based on relational-style data models
implemented on data-parallel frameworks
β’ Hence they do not exploit the inherent graph-like structure of RDF
Parallel Graph Processing Systems
β’ e.g. Pregel, Giraph, GraphLab
β’ Tailored towards iterative graph algorithms
β’ But: Not integrated with Hadoop or hard to combine with other frameworks
Data movement or duplication
S2X (SPARQL on Spark with GraphX)
β’ Spark GraphX allows to combine graph-parallel and data-parallel computation
β’ GraphX used for graph pattern matching
β’ Other SPARQL operators implemented in Spark
Spark GraphX
Apache Spark
β’ General-purpose in-memory cluster computing system
β’ Based on Resilient Distributed Datasets (RDD)
- distributed, fault-tolerant collection of elements
- kept in memory and partitioned across the cluster
β’ Jobs are modeled as Directed Acyclic Graphs (DAG) of tasks
- each tasks runs on a horizontal partition of an RDD
GraphX
β’ Spark API for graphs and graph-parallel computation
β’ Uses a Property Graph Data Model (represented by two RDDs)
β’ Vertex-Centric view of graphs (similar to Pregel)
β’ Adopts Bulk-Synchronous Parallel (BSP) execution model
- vertex programs run concurrently in a sequence of Supersteps
β’ Vertex-Cut partitioning
- evenly assigns edges to machines, vertices can span multiple machines
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 4
RDF Property Graph Model
Property Graph (PG)
β’ ππΊ π = π, πΈ, π
β’ π = π ππ‘ ππ π£πππ‘ππππ , πΈ = π ππ‘ ππ ππππππ‘ππ πππππ
β’ π = ππ , ππΈ = πππππππ‘πππ ππ π£πππ‘ππ₯ πππ ππππ πππππππ‘πππ πππ¦, π£πππ’π
RDF Graph (G)
β’ πΊ = π‘1, β¦ , π‘π = π ππ‘ ππ π π·πΉ π‘ππππππ
β’ π π·πΉ π‘πππππ π‘ = π π’πππππ‘, ππππππππ‘π, ππππππ‘
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 5
i j
ππΈ π, π = π, π£
ππ π = π, π£ ππ π = π, π£
πΌπ·π πΌπ·π
s op
IRIs = global identifierse.g. dbpedia:Leonardo_da_Vinci
RDF Property Graph Model
RDF to PG - Example
β’ πΊ = π΄, ππππ€π , π΅ , π΄, πππππ , π΅ , π΄, πππππ , πΆ , π΅, ππππ€π , πΆ
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 6
A B
knows
C
likes
knows
likes
πΌπ·π΄ = π£π΄ πΌπ·π΅ = π£π΅ πΌπ·πΆ = π£πΆ
ππΈ π£π΄, π£πΆ . πππππ = πππππ
ππΈ π£π΄, π£π΅ . πππππ = ππππ€π
ππΈ π£π΄, π£π΅ . πππππ = πππππ ππΈ π£π΅ , π£πΆ . πππππ = ππππ€π
ππ π£π΄ . πππππ = π΄ ππ π£π΅ . πππππ = π΅ ππ π£πΆ . πππππ = πΆ
SPARQL
Standard Query Language for RDF
(Solution) Mapping
β’ π βΆ π β π π·πΉ πππππ
β’ Result of a triple pattern π‘π is a bag of mappings
Ξ©π‘π = π πππ π = π£πππ π‘π , π π‘π β πΊ
β’ Result of a basic graph pattern πππ = π‘π1, β¦ , π‘ππ is the merge of
compatible mappings (Join on common variables)
Ξ©πππ = Ξ©π‘π1 β β¦ β Ξ©π‘ππ
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 7
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
triple pattern
basic graph pattern (BGP)
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 8
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 9
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? π¨ ?π©
π11 π΄ π΅
π12 π΅ πΆ
Ξ©π‘π1
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 10
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? π¨ ?π©
π11 π΄ π΅
π12 π΅ πΆ
Ξ©π‘π1
? π¨ ?π©
π21 π΄ π΅
π22 π΄ πΆ
Ξ©π‘π2
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 11
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? π¨ ?π©
π11 π΄ π΅
π12 π΅ πΆ
Ξ©π‘π1
? π¨ ?π©
π21 π΄ π΅
π22 π΄ πΆ
Ξ©π‘π2
?π© ?πͺ
π31 π΄ π΅
π32 π΅ πΆ
Ξ©π‘π3
SPARQL - Example
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 12
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
A B
knows
C
likes
knows
likes
? π¨ ?π©
π11 π΄ π΅
π12 π΅ πΆ
Ξ©π‘π1
? π¨ ?π©
π21 π΄ π΅
π22 π΄ πΆ
Ξ©π‘π2
?π© ?πͺ
π31 π΄ π΅
π32 π΅ πΆ
Ξ©π‘π3
β β
Ξ©πππ = π βΆ ? π΄ β π΄, ? π΅ β π΅, ? πΆ β πΆ
S2X Query Processing
Basic Idea:
β’ Every vertex stores the variables of a query where it is a candidate for
β’ Vertices exchange their candidate sets for validation
Match Candidate:
β’ πππ‘πβπΆ = π£, ? π£ππ, π, π‘π
β’ π£ is a candidate for ? π£ππ with mapping π derived from triple pattern π‘π
Match Set:
β’ Set of all match candidates of a vertex
β’ Match set is stored as a property of the vertex ππ π£ .πππ‘πβπ
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 13
π£π΄πππππ = π΄
π£π΅πππππ = π΅
πππππ = ππππ€π π‘π = (?π΄ ππππ€π ?π΅)
π£π΄, ? π΄, π: ?π΄ β π΄, ?π΅ β π΅ , π‘π π£π΅, ? π΅, π: ? π΄ β π΄, ? π΅ β π΅ , π‘π
S2X Query Processing
BGP Matching in a Nutshell
β’ Superstep 1:
- Determine all possible match candidates
Iterate over all edges and match with all triple patterns at once
- Yields an overestimation of the BGP result
β’ Following Supersteps:
- Validate Local Match Sets
Does a vertex have all required match candidates for a variable?
- Validate Remote Match Sets
Are the neighbors of vertex still the required candidates?
- Exchange Match Sets with Neighbor Vertices
- Repeat until no changes occur
β’ Result Generation (solution mappings):
- Collect and merge (join) all match sets to produce the final mappings
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 14
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 16
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π£ππ π π‘π
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 17
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 18
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΄, ? π΅ β πΆ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β πΆ π‘π2
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 19
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΄, ? π΅ β πΆ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΅, ? π΅ β πΆ π‘π1
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β πΆ π‘π2
? π΅ ? π΄ β π΅, ? π΅ β πΆ π‘π1
? πΆ ?π΅ β π΅, ? πΆ β πΆ π‘π3
BGP Matching - Example
Superstep 1
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 20
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΄, ? π΅ β πΆ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΅, ? π΅ β πΆ π‘π1
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β πΆ π‘π2
? π΅ ? π΄ β π΅, ? π΅ β πΆ π‘π1
? πΆ ?π΅ β π΅, ? πΆ β πΆ π‘π3
Overestimation!
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 21
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΄, ? π΅ β πΆ π‘π2
π·πππ ?π΅ , ? π΅ β π΄,β¦ , π‘π1 ππ₯ππ π‘ ππ π£π΄ β ππ!
Validation of Local Match Sets:
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 22
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΄, ? π΅ β πΆ π‘π2 π·πππ ?π΄ , ? π΄ β π΄, ?π΅ β πΆ , π‘π1 ππ₯ππ π‘ ππ π£π΄ β ππ!
Validation of Local Match Sets:
π·πππ ?π΅ , ? π΅ β π΄,β¦ , π‘π1 ππ₯ππ π‘ ππ π£π΄ β ππ!
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 23
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΄, ? π΅ β πΆ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π΄ ? π΄ β π΅, ? π΅ β πΆ π‘π1
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β πΆ π‘π2
? π΅ ? π΄ β π΅, ? π΅ β πΆ π‘π1
? πΆ ?π΅ β π΅, ? πΆ β πΆ π‘π3
Removed!
BGP Matching - Example
Superstep 2
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 24
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
? π£ππ π π‘π
? πΆ ?π΅ β π΅, ? πΆ β πΆ π‘π3
Exchange
Exchange
BGP Matching - Example
Superstep 3
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 25
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
π·πππ ?π΅ , ?π΅ β π΄, ? πΆ β π΅ , π‘π3ππ₯ππ π‘ ππ π£π΄ β ππ!
Validation of
Remote Match Sets:
BGP Matching - Example
Superstep 3
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 26
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? πΆ ?π΅ β π΄, ? πΆ β π΅ π‘π3
? π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π2
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
? π£ππ π π‘π
? πΆ ?π΅ β π΅, ? πΆ β πΆ π‘π3
Removed!No Changes! No Changes!
BGP Matching - Example
Superstep 3
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 27
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
? π£ππ π π‘π
? πΆ ?π΅ β π΅, ? πΆ β πΆ π‘π3
Exchange
Exchange
BGP Matching - Example
Superstep 4
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 28
SELECT * WHERE {
?A knows ?B . ?A likes ?B . ?B knows ?C
}
tp1 tp2 tp3
π£π΄πππππ = π΄
π£π΅πππππ = π΅
π£πΆπππππ = πΆ
πππππ = πππππ
πππππ = ππππ€π
πππππ = πππππ πππππ = ππππ€π
? π£ππ π π‘π
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π1
? π΄ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π£ππ π π‘π
?π΅ ?π΄ β π΄, ? π΅ β π΅ π‘π1
? π΅ ? π΄ β π΄, ? π΅ β π΅ π‘π2
? π΅ ?π΅ β π΅, ? πΆ β πΆ π‘π3
? π£ππ π π‘π
? πΆ ?π΅ β π΅, ? πΆ β πΆ π‘π3
No Changes! No Changes!
Experiments
Small Cluster with low-end configuration
β’ 10 machines (1 master, 9 worker), 2 disks each, Gigabit network
β’ 24 GB RAM for Spark, 60% assigned for graph caching(typical Hadoop nodes currently β₯ 256 GB RAM, β₯ 12 disks, β₯ 10 Gigabit network)
β’ CDH 5.3, Spark 1.2.0, Pig 0.12
WatDiv Benchmark [2]
β’ Combines e-commerce scenario with a kind of social network
β’ All different kinds of query shapes (star, linear, snowflake, complex)
β’ Scale Factors:
- 10 (10K users, ~0.1M vertices)
- 100 (100K users, ~1M vertices)
- 1000 (1M users, ~10M vertices)
Compared with PigSPARQL [3]
β’ SPARQL Query Processing Baseline for Hadoop
β’ Proven to be competitive to other Hadoop research approaches
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 29
Experiments
Geometric Mean per Query Group
β’ S2X outperforms PigSPARQL by up to an order of magnitude
β’ Runtime of S2X strongly depends on result size (better for small output)
β’ Runtime can be highly influenced by adjusting parameters of Spark/GraphX
β’ GraphX currently quite βmemory-heavyβ (early stage of development)
β’ Very good preliminary results but still much room for improvement in S2X
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 30
S L SF C
S2X 7,62 9,62 17,98 52,89
PigSPARQL 43,08 62,19 88,09 127,34
0
20
40
60
80
100
120
140
Overall Runtime
S L SF C
result 1,98 2,92 8,03 23,49
BGP 5,62 6,66 9,93 29,04
0
10
20
30
40
50
60
S2X - BGP Matching vs. Result Generation
Summary
S2X (SPARQL on Spark with GraphX)
β’ SPARQL Query Processor for Hadoop based on Spark GraphX
β’ Property Graph representation for RDF
β’ Combines graph-parallel and data-parallel computation
- graph-parallel: BGP matching
- data-parallel: BGP solution mapping generation + other SPARQL operators
Experiments
β’ Very good preliminary results
- Up to an order of magnitude performance improvement compared to a
state-of-the-art SPARQL Processor baseline for Hadoop
β’ GraphX is in an early stage of development
- Many improvements can be expected in future GraphX versions
β’ Many possible improvements for S2X
- Incremental exchange of match sets I/O reduction
- Optimization of graph partitioning avoid unbalanced workloads
- Incremental BGP matching for selective queries Reduction of Overestimation
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 31
Thank You
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 32
Thank you for your attention!
Questions?
References
[1] Alexander SchΓ€tzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, Georg Lausen:
S2X: Graph-Parallel Querying of RDF with GraphX. In: Big-O(Q) 2015
[2] GΓΌnes Aluc, Olaf Hartig, M. Tamer Γzsu, Khuzaima Daudjee:
Diversified Stress Testing of RDF Data Management Systems. In: ISWC 2014, pp. 197-212
[3] Alexander SchΓ€tzle, Martin Przyjaciel-Zablocki, Thomas Hornung, Georg Lausen:
PigSPARQL: A SPARQL Query Processing Baseline for Big Data. In: ISWC 2013 Posters
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 33
Experiments
Star Queries
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 34
10 100 1000
result 0,70 1,60 6,87
BGP 1,93 3,73 24,66
0
5
10
15
20
25
30
35
S2X - BGP Matching vs. Result Generation
10 100 1000
S2X 2,63 5,33 31,53
PigSPARQL 41,43 41,28 46,74
0
5
10
15
20
25
30
35
40
45
50
Geometric Mean
Experiments
Linear Queries
β’ Strong variation in runtimes (min: ~14s, max: ~348s)
β’ For the long running queries, result generation takes more time than BGP
matching, e.g. 348s = (124s, 224s) many results
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 35
10 100 1000
S2X 2,21 5,23 77,25
PigSPARQL 61,45 61,23 63,93
0
10
20
30
40
50
60
70
80
Geometric Mean
10 100 1000
result 0,55 1,64 27,41
BGP 1,65 3,58 49,85
0
10
20
30
40
50
60
70
80
S2X - BGP Matching vs. Result Generation
Experiments
Snowflake Queries
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 36
10 100 1000
S2X 5,84 12,99 76,63
PigSPARQL 85,47 85,21 93,85
0
10
20
30
40
50
60
70
80
90
100
Geometric Mean
10 100 1000
result 2,58 6,20 32,39
BGP 3,26 6,79 44,25
0
10
20
30
40
50
60
70
80
S2X - BGP Matching vs. Result Generation
Experiments
Complex Queries
β’ Queries produce many results
for some queries, result generation takes more time than BGP matching
31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 37
10 100 1000
S2X 17,53 41,17 204,96
PigSPARQL 116,13 123,76 143,66
0
30
60
90
120
150
180
210
Geometric Mean
10 100 1000
result 8,09 20,98 76,40
BGP 9,44 20,19 128,57
0
30
60
90
120
150
180
210
S2X - BGP Matching vs. Result Generation