30
Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1 , Shlomi Dolev 2 , Shantanu Sharma 2 , and Jeffrey D. Ullman 3 1 National Technical University of Athens, Greece 2 Ben-Gurion University of the Negev, Israel 3 Stanford University, USA 2 nd Algorithms and Systems for MapReduce and Beyond (BeyondMR) Brussels, Belgium (27 March 2015)

Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

Embed Size (px)

Citation preview

Page 1: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

Bounds for Overlapping Interval Join on

MapReduceFoto N. Afrati1, Shlomi Dolev2,

Shantanu Sharma2, and Jeffrey D. Ullman3

1 National Technical University of Athens, Greece2 Ben-Gurion University of the Negev, Israel

3 Stanford University, USA

2nd Algorithms and Systems for MapReduce and Beyond (BeyondMR)Brussels, Belgium (27 March 2015)

Page 2: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

2

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

Page 3: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

3

Outline

• Introduction– Interval and Overlapping Intervals– Interval Join– Reducer capacity and Mapping Schema

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

Page 4: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

4

• Interval– A pair [starting time , ending time]– A (time) interval, i, is represented by a pair of times

[, ], <, where and show the starting-point and the ending-point of the interval i, respectively

– Example:• My talk, • a phase of a project, a class of a professor

Introduction

= 10am

Talk

= 10:30am

Page 5: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

5

• Overlapping Intervals– Two intervals, say interval i and interval j are called

overlapping intervals if the intersection of both the interval is nonempty

Introduction

Non-overlapping intervals Overlapping intervals

i

j

Overlapping intervals

Talk

Coffee break

10am 10:35am

10:30am 11am

Page 6: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

6

Introduction

EmpID Name Duration

U 1-Apr –1-June

V 1-May –1-July

W 1-Apr –1-July

X 1-Mar –1-June

Y 1-Mar –1-Aug

Phase Duration

Requirement Analysis (RA)

1-Mar – 1-May

Design (D) 1-Apr – 1-June

Coding (C) 1-May –1-Aug

1-Mar 1-Apr 1-May 1-June 1-July 1-Aug

Project Employee

Project

Employee

RADC

• Overlapping Interval Join: an example

Find all the employee that are involved in RA phase of the project

Page 7: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

7

• Reducer capacity– An upper bound on the total number of

intervals that are assigned to the reducer

– Example

• Reducer capacity to be the size of the main memory of the processors on which reducers run

• Communication cost

– Total amount of data to be transferred from the map phase to reduce phase

– Tradeoff between the reducer capacity and communication cost

Introduction

Page 8: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

8

IntroductionMapping schema for interval join

An assignment of the set of intervals to some given reducers, such that

– Respect the reducer capacity• The total number of intervals assigned to a reducer must

be less than or equal to the reducer capacity

– Assignment of inputs• For every output, it is required to assign every two

corrosponding overlapping corrossponding intervals to at least one reducer in common

Reducer

I1 I2 I3

Reducer

Reducer

Reducer

I1 I2 I3I1 I2 I3

Page 9: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

9

State-of-the-Art

• B. Chawda, H. Gupta, S. Negi, T.A. Faruquie, L.V. Subramaniam, and M.K. Mohania, “Processing Interval Joins On Map-Reduce,” EDBT, 2014.

• MapReduce-based 2-way and multiway interval join algorithms of overlapping intervals

• Not regarding the reducer capacity

• No analysis of a lower bound on replication of individual intervals

• No analysis of the replication rate of the algorithms offered therein

Page 10: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

10

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

Page 11: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

11

• Interval join problem– Assign all the intervals that share at least

one common point of time to at least one reduce in common for finding outputs

Goal of Mapping Schema

Page 12: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

12

• An algorithm for variable-length intervals that can start at any time

– Before this, we consider two simple cases of

• Unit-length and equally-spaced intervals and provide algorithm

• Variable-length and equally-spaced intervals and provide algorithm

• All the algorithms achieve almost matching upper bound on the replication rate to the lower bound

Our Contribution

Page 13: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

13

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

Page 14: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

• Relations X and Y of n intervals

• All intervals do not have beginning beyond k and before 0

• Hence, spacing between starting points of two successive intervals = < 1

Unit-Length and Equally-Spaced Intervals

14

0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25

X

Y

n = 9 and k = 2.25, so spacing = 0.25

Page 15: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

• Divide the time-range from 0 to k into equal-sized partitions of length w (say P partitions are created)

• Arrange P reducers

• Assign all intervals of X that exist in a partition pi to ith reducer

• Assign all intervals of Y that have their starting or ending-point in partition pi to ith reducer

Unit-Length and Equally-Spaced Intervals-

Algorithm

15

0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25

X

Y

n = 9 and k = 2.25

1 partition

2 partition

3 partition 5

partition

4 partition

Page 16: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

• Does the algorithm work?

• Consider q = + + 2• q: the reducer capacity• w: length of a partition• n: the total number of intervals in a relation• k: the last starting point of an interval

• Count how many intervals lie in a partition, if they are less than or equal to q then we have a solution and the algorithm works.

Unit-Length and Equally-Spaced Intervals

16

Page 17: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

• Does the algorithm work?– Count 1: How many intervals of Y overlap

with an interval X in a partition of length w?

• Spacing is n/k, so at most 2wn/k intervals of Y can overlap with an interval of X

– Count 2: How many intervals can have starting points after starting of xi and starting points before ending of xi.

• Intervals of X after starting point of xi = wn/k

• Intervals of X before starting point of xi = n/k

– Count 3: Do not forget to count xi itself and an identical interval of Y i.e. yi.

Unit-Length and Equally-Spaced Intervals

17

0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25

X

Yn = 9 and k = 2.25

1 partition

2 partition

3 partition 5

partition

4 partition

Page 18: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

• Does the algorithm work? – Total number of intervals in a partition

– Count 1 + Count 2 + Count 3 =

+ + 2

= q

– OK. The algorithm works

Unit-Length and Equally-Spaced Intervals

18

Page 19: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

19

Outline

• Introduction

• Goal of Mapping Schema and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

Page 20: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

20

• Two types of intervals– Big and small intervals

– Different length intervals

Variable-Length and Equally-Spaced Intervals

Page 21: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

21

• Big and small intervals

– All the intervals of X are of length lmin

– All the intervals of Y are of length lmax

– The previous algorithm will work here too

– Note that an interval of X will be replicated to several reducers, while an interval of Y will be replicated to at most two reducers

Variable-Length and Equally-Spaced Intervals

0 .7 1.4 2.1 2.8 3.5 4.2

X

Y

n = 6 and spacing = 0.7

Page 22: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

22

• Variable-length intervals: A general case– All the restriction regarding length of an

interval and spacing between two interval is removed

– Intervals can begin at some time greater than or equal to 0 and end by time T

– S: the total length of intervals in one relation

Variable-Length and Equally-Spaced Intervals

0 s s+1 s+2 s+3 T

X

Y

Page 23: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

23

• Variable-length intervals: A general case– Algorithm

• Divide the time range into equal sized partitions • Arrange reducers• Follow the same procedure as in the previous

algorithm– i.e., assign all the intervals of X that belong to ith partition to ith

reducers and assign all the intervals of Y to reducers corresponding to their starting and ending points (only to at most two reducers)

Variable-Length and Equally-Spaced Intervals

0 s s+1 s+2 s+3 T

X

Y

Page 24: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

24

• Variable-length intervals: A general case– Does the algorithm work?– Consider q =

– Count the average number of intervals of X and Y sent to a reducer; if they are less than or equal to the reducer capacity, then the algorithm will work

Variable-Length and Equally-Spaced Intervals

Page 25: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

25

• Variable-length intervals: A general case– Count 1: Average number of intervals of Y

received by a reducer

– An interval of Y is sent to at most to 2 reducers (Replication)

– There are reducers and n intervals in Y

• Average number of intervals of Y received by a reducer =

Variable-Length and Equally-Spaced Intervals

Page 26: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

26

• Variable-length intervals: A general case– Count 2: Average number of intervals of X

received by a reducer

– Average length of intervals is S/n

– An interval of X is sent to at most to 1 + S/nw reducers

– There are reducers and n intervals in X

• Average number of intervals of X received by a reducer =

Variable-Length and Equally-Spaced Intervals

Average length/how

much length a reducer can hold

Page 27: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

27

• Variable-length intervals: A general case– Does the algorithm work?

– Total number of intervals that a reducer receive

= Count 1+ Count 2

+ =

= q

The algorithm works

Variable-Length and Equally-Spaced Intervals

Page 28: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

28

Outline

• Introduction

• Problem Statement and Our Contribution

• Unit-Length and Equally-Spaced Intervals

• Variable-Length and Equally-Spaced Intervals

• Conclusion

Page 29: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

Conclusion

• An investigation for good MapReduce algorithms for the problem of finding pairs of overlapping intervals

• Algorithms for:– Unit-sized and equally-spaced intervals

• Lower bounds on the replication rate = 2 or 2q • Upper bounds on the replication rate =

– Big-small and equally-spaced intervals• Lower bounds on the replication rate = 2 or 2q• Upper bounds on the replication rate =

– A general case for variable length intervals• Upper bounds on the replication rate =

29Proofs of lower and upper bounds on the replication rate are given in the paper

Page 30: Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University

Foto Afrati1, Shlomi Dolev2, Shantanu Sharma2, and Jeffrey D. Ullman3

1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece

[email protected] Department of Computer Science, Ben-Gurion University of

the Negev, Israel{dolev,sharmas}@cs.bgu.ac.il

3 Department of Computer Science, Stanford University, USA [email protected]

Presentation is available athttp://www.cs.bgu.ac.il/~sharmas/publication.html