Map-Side Merge Joins for Scalable SPARQL BGP Processing
IEEE CloudCom 2013Bristol, UK
Martin Przyjaciel-Zablocki Alexander SchätzleEduard SkaleyThomas Hornung Georg Lausen
University of FreiburgDatabases & Information Systems
December 4th, 2013
Map-Side Merge Joins for Scalable SPARQL BGP Processing 2
Motivation
Front-EndArchitecture
Storage
Back-EndArchitecture
Map-Side Merge Joins for Scalable SPARQL BGP Processing 3
Motivation
Front-EndArchitecture
Storage
Back-EndArchitecture
Online
Real-time (~ ms), interactive (~ s)
Low-latency
Fast, simple queries
Accessing precomputed (offline) data
Avoid Joins (costly)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 4
Motivation
Front-EndArchitecture
Storage
Back-EndArchitecture
Online
Off
line
Real-time (~ ms), interactive (~ s)
Low-latency
Fast, simple queries
Accessing precomputed (offline) data
Avoid Joins (costly)
Batch processing (≥ h)
Scalability, Fault Tolerance
Long-running, complex queries
Joins between large datasets common
Used in many DBMS due to stable & efficient performance
What we present in our paper:
◦ Sort-Merge Join for RDF data with MapReduce
◦ Support for n-way joins
◦ Support for cascaded execution
Map-Side Merge Joins for Scalable SPARQL BGP Processing 5
Sort-Merge Join
L
a a
a b
b a
b b
c a
c b
...
sorted
by jo
in key
R
sorted
by jo
in key
L R
a b
a c
b a
b c
c c
c d
...
Approach:
◦ Divide Join task into N independent subtasks
◦ Each subtask is processed by exactly one machine
Map-Side Merge Joins for Scalable SPARQL BGP Processing 6
Sort-Merge Joins with MapReduce
L1[a … c)
L2
[c … e)
…
LN[x … z]
sorted
by jo
in key
R1
[a … c)
R2[c … e)
…RN
[x … z]
sorted
by jo
in key
L1 R1
L2 R2
LN RN
Challenges:
◦ Datasets must be sorted according to the join key
Map-Side Merge Joins for Scalable SPARQL BGP Processing 7
Sort-Merge Joins with MapReduce
L1[a … c)
L2
[c … e)
…
LN[x … z]
sorted
by jo
in key
R1
[a … c)
R2[c … e)
…RN
[x … z]
sorted
by jo
in key
L1 R1
L2 R2
LN RN
Challenges:
◦ Datasets must be sorted according to the join key
◦ Datasets must be split into N partitions
Map-Side Merge Joins for Scalable SPARQL BGP Processing 8
Sort-Merge Joins with MapReduce
L1[a … c)
L2
[c … e)
…
LN[x … z]
sorted
by jo
in key
R1
[a … c)
R2[c … e)
…RN
[x … z]
sorted
by jo
in key
L1 R1
L2 R2
LN RN
Challenges:
◦ Datasets must be sorted according to the join key
◦ Datasets must be split into N partitions
◦ Same join key values must be processed on the same machine i.e. they must be in the same partition
Map-Side Merge Joins for Scalable SPARQL BGP Processing 9
Sort-Merge Joins with MapReduce
L1
L2
[c … e)
…
LN[x … z]
sorted
by jo
in key
R1
R2[c … e)
…RN
[x … z]
sorted
by jo
in key
L1 R1
L2 R2
LN RN
[a … c) [a … c)
Challenges:
◦ Datasets must be sorted according to the join key
◦ Datasets must be split into N partitions
◦ Same join key values must be processed on the same machine i.e. they must be in the same partition
◦ Cascaded joins: Output must be sorted & partitioned according to the requirements of the following join
Map-Side Merge Joins for Scalable SPARQL BGP Processing 10
Sort-Merge Joins with MapReduce
L1[a … c)
L2
[c … e)
…
LN[x … z]
sorted
by jo
in key
R1
[a … c)
R2[c … e)
…RN
[x … z]
sorted
by jo
in key
L1 R1
L2 R2
LN RN
Main Contributions:
(1) RDF data store that guarantees that one side is already presortedand partitioned we only have to sort & split the output
(2) Merge Join is processed in the Map phase Reduce phase used to postprocess the output for the next join
(3) Dynamic Bloom Filter to remove dangling intermediate results
Map-Side Merge Joins for Scalable SPARQL BGP Processing 11
Sort-Merge Joins with MapReduce
L1[a … c)
L2
[c … e)
…
LN[x … z]
sorted
by jo
in key
R1
[a … c)
R2[c … e)
…RN
[x … z]
sorted
by jo
in key
L1 R1
L2 R2
LN RN
Linear scaling of query runtimes
Average performance benefit of Merge Joins compared to theother approaches: 15 – 48 %
Map-Side Merge Joins for Scalable SPARQL BGP Processing 12
Experiments
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Thank you
13
Contact information:
Martin Przyjaciel-ZablockiAlexander Schätzlezablocki|[email protected] of Freiburg
Discussion Slides1. RDF Data Store Layout
2. Map-Side Merge Join
3. Experiments
Map-Side Merge Joins for Scalable SPARQL BGP Processing 14