Download pdf - Map-Side Merge Joins for Scalable SPARQL BGP Processing

Map-Side Merge Joins for Scalable SPARQL BGP Processing

IEEE CloudCom 2013Bristol, UK

Martin Przyjaciel-Zablocki Alexander SchätzleEduard SkaleyThomas Hornung Georg Lausen

University of FreiburgDatabases & Information Systems

December 4th, 2013

Map-Side Merge Joins for Scalable SPARQL BGP Processing 2

Motivation

Front-EndArchitecture

Storage

Back-EndArchitecture


Motivation


Storage


Online

Real-time (~ ms), interactive (~ s)

Low-latency

Fast, simple queries

Accessing precomputed (offline) data

Avoid Joins (costly)


Motivation


Storage


Online

Off

line

Real-time (~ ms), interactive (~ s)

Low-latency

Fast, simple queries

Accessing precomputed (offline) data

Avoid Joins (costly)

Batch processing (≥ h)

Scalability, Fault Tolerance

Long-running, complex queries

Joins between large datasets common

Used in many DBMS due to stable & efficient performance

What we present in our paper:

◦ Sort-Merge Join for RDF data with MapReduce

◦ Support for n-way joins

◦ Support for cascaded execution


Sort-Merge Join

L

a a

a b

b a

b b

c a

c b

...

sorted

by jo

in key

R

sorted

by jo

in key

L R

a b

a c

b a

b c

c c

c d

...

Approach:

◦ Divide Join task into N independent subtasks

◦ Each subtask is processed by exactly one machine


Sort-Merge Joins with MapReduce

L1[a … c)

L2

[c … e)

…

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Challenges:

◦ Datasets must be sorted according to the join key



L1[a … c)

L2

[c … e)

…

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Challenges:


◦ Datasets must be split into N partitions



L1[a … c)

L2

[c … e)

…

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Challenges:



◦ Same join key values must be processed on the same machine i.e. they must be in the same partition



L1

L2

[c … e)

…

LN[x … z]

sorted

by jo

in key

R1

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

[a … c) [a … c)

Challenges:



◦ Same join key values must be processed on the same machine i.e. they must be in the same partition

◦ Cascaded joins: Output must be sorted & partitioned according to the requirements of the following join



L1[a … c)

L2

[c … e)

…

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Main Contributions:

(1) RDF data store that guarantees that one side is already presortedand partitioned we only have to sort & split the output

(2) Merge Join is processed in the Map phase Reduce phase used to postprocess the output for the next join

(3) Dynamic Bloom Filter to remove dangling intermediate results



L1[a … c)

L2

[c … e)

…

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Linear scaling of query runtimes

Average performance benefit of Merge Joins compared to theother approaches: 15 – 48 %


Experiments

Map-Side Merge Joins for Scalable SPARQL BGP Processing

Thank you

13

Contact information:

Martin Przyjaciel-ZablockiAlexander Schätzlezablocki|[email protected] of Freiburg

Discussion Slides1. RDF Data Store Layout

2. Map-Side Merge Join

3. Experiments



RDF Data Store Layout (1)






Map-Side Merge Join (1)


Map-Side Merge Join (2)


Experiments (1)


Experiments (2)


Experiments (3)


Experiments (4)