43
Distributed data processing on the Cloud – Lecture 6 Joins with MapReduce and MapReduce limitations Satish Srirama Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation Commons Attribution 3.0 License)

Distributed data processing on the Cloud –Lecture 6 Joins

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed data processing on the Cloud –Lecture 6 Joins

Distributed data processing on the Cloud – Lecture 6

Joins with MapReduce and

MapReduce limitations

Satish Srirama

Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation

Commons Attribution 3.0 License)

Page 2: Distributed data processing on the Cloud –Lecture 6 Joins

Outline

• Structured data

– Processing relational data with MapReduce

• MapReduce limitations

19.10.2018 Satish Srirama 2/41

Page 3: Distributed data processing on the Cloud –Lecture 6 Joins

Relational Databases

• A relational database is comprised of tables

• Each table represents a relation = collection of

tuples (rows)

• Each tuple consists of multiple fields

(columns)

19.10.2018 Satish Srirama 3/41

Page 4: Distributed data processing on the Cloud –Lecture 6 Joins

Basic SQL Commands - Queries

https://mariadb.com/kb/en/library/basic-sql-statements/

19.10.2018 Satish Srirama 4/41

Page 5: Distributed data processing on the Cloud –Lecture 6 Joins

Design Pattern: Secondary Sorting

• MapReduce sorts input to reducers by key

– Values are arbitrarily ordered

• What if want to sort value also?

– E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…

• Use case:

– We are monitoring different sensors

– Data: (timestamp, sensor id, sensor value)

– Map outputs

sensorid -> (timestamp, sValue)

– How to see the activity of each sensor over time?

19.10.2018 Satish Srirama 5/41

Page 6: Distributed data processing on the Cloud –Lecture 6 Joins

Secondary Sorting: Solutions

• Solution 1:

– Buffer values in memory, then sort

– Why is this a bad idea?

• Solution 2:

– “Value-to-key conversion” design pattern:form composite intermediate key, (k, v1)

– Let execution framework do the sorting

– Preserve state across multiple key-value pairs to handle processing

– Anything else we need to do?

19.10.2018 Satish Srirama 6/41

Page 7: Distributed data processing on the Cloud –Lecture 6 Joins

Value-to-Key Conversion

k → (v1, r), (v4, r), (v8, r), (v3, r)…

(k, v1) → (v1, r)

Before

After

(k, v3) → (v3, r)(k, v4) → (v4, r)(k, v8) → (v8, r)

Values arrive in arbitrary order…

Values arrive in sorted order…Process by preserving state across multiple keysRemember to partition correctly!

19.10.2018 Satish Srirama 7/41

Page 8: Distributed data processing on the Cloud –Lecture 6 Joins

Working Scenario

• Two tables:

– User demographics (gender, age, income, etc.)

– User page visits (URL, time spent, etc.)

• Analyses we might want to perform:

– Statistics on demographic characteristics

– Statistics on page visits

– Statistics on page visits by URL

– Statistics on page visits by demographic characteristic

– …

19.10.2018 Satish Srirama 8/41

Page 9: Distributed data processing on the Cloud –Lecture 6 Joins

Relational Algebra

• Primitives– Projection (π)

– Selection (σ)

– Cartesian product (×)

– Set union (∪)

– Set difference (−)

– Rename (ρ)

• Other operations– Join (⋈)

– Group by… aggregation

– …

19.10.2018 Satish Srirama 9/41

Page 10: Distributed data processing on the Cloud –Lecture 6 Joins

π

Projection

R1

R2

R3

R4

R5

R1

R2

R3

R4

R5

19.10.2018 Satish Srirama 10/41

Page 11: Distributed data processing on the Cloud –Lecture 6 Joins

Projection in MapReduce

• Easy!– Map over tuples, emit new tuples with appropriate

attributes

– No reducers, unless for regrouping or resorting tuples

– Alternatively: perform in reducer, after some other processing

• Basically limited by HDFS streaming speeds– Speed of encoding/decoding tuples becomes

important

– Take advantage of compression when available

– Semistructured data? No problem!

19.10.2018 Satish Srirama 11/41

Page 12: Distributed data processing on the Cloud –Lecture 6 Joins

Selection

R1

σR2

R3

R4

R5

R1

R3

19.10.2018 Satish Srirama 12/41

Page 13: Distributed data processing on the Cloud –Lecture 6 Joins

Selection in MapReduce

• Easy!

– Map over tuples, emit only tuples that meet criteria

– No reducers, unless for regrouping or resorting tuples

– Alternatively: perform in reducer, after some other processing

• Basically limited by HDFS streaming speeds

– Speed of encoding/decoding tuples becomes important

– Take advantage of compression when available

– Semistructured data? No problem!

19.10.2018 Satish Srirama 13/41

Page 14: Distributed data processing on the Cloud –Lecture 6 Joins

Group by… Aggregation

• Example: What is the average time spent per URL?

• In SQL:

– SELECT url, AVG(time) FROM visits GROUP BY url

• In MapReduce:

– Map over tuples, emit time, keyed by url

– Framework automatically groups values by keys

– Compute average in reducer

– Optimize with combiners

19.10.2018 Satish Srirama 14/41

Page 15: Distributed data processing on the Cloud –Lecture 6 Joins

Relational JoinsR1

R2

R3

R4

S1

S2

S3

S4

R1 S2

R2 S4

R3 S1

R4 S3

19.10.2018 Satish Srirama 15/41

Page 16: Distributed data processing on the Cloud –Lecture 6 Joins

Working Scenario – visited again

• Two tables:

– User demographics (userid, gender, age, income, etc.)

– User page visits (userid, URL, time spent, etc.)

• Analyses we might want to perform:

– Statistics on demographic characteristics

– Statistics on page visits

– Statistics on page visits by URL

– Statistics on page visits by demographic characteristic

– …

19.10.2018 Satish Srirama 16/41

Page 17: Distributed data processing on the Cloud –Lecture 6 Joins

Types of Relationships

One-to-One One-to-Many Many-to-Many

19.10.2018 Satish Srirama 17/41

Page 18: Distributed data processing on the Cloud –Lecture 6 Joins

Join Algorithms in MapReduce

• Reduce-side join

• Map-side join

• In-memory join

– Striped variant

– Memcached variant

19.10.2018 Satish Srirama 18/41

Page 19: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join

• Basic idea: group by join key– Map over both sets of tuples

– Emit tuple as value with join key as the intermediate key

– Execution framework brings together tuples sharing the same key

– Perform actual join in reducer

– Similar to a “sort-merge join” in database terminology

• Two variants– 1-to-1 joins

– 1-to-many and many-to-many joins

19.10.2018 Satish Srirama 19/41

Page 20: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: 1-to-1

R1

R4

S2

S3

R1

R4

S2

S3

keys valuesMap

R1

R4

S2

S3

keys values

Reduce

Note: no guarantee if R is going to come first or S

19.10.2018 Satish Srirama 20

Page 21: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: 1-to-many

R1

S2

S3

R1

S2

S3

S9

keys valuesMap

R1 S2

keys values

Reduce

S9

S3 …

19.10.2018 Satish Srirama 21

Page 22: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: V-to-K Conversion

R1

keys values

In reducer…

S2

S3

S9

R4

S3

S7

New key encountered: hold in memory

Cross with records from other set

New key encountered: hold in memory

Cross with records from other set

19.10.2018 Satish Srirama 22

Page 23: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: many-to-many

R1

keys values

In reducer…

S2

S3

S9

Hold in memory

Cross with records from other set

R5

R8

19.10.2018 Satish Srirama 23

Page 24: Distributed data processing on the Cloud –Lecture 6 Joins

Problems with Reduce-side join

• In summary many-to-many is still left with the

scalability issue

• Another big problem:

– The basic idea behind the reduce-side join is to

repartition the two datasets by the join key

– This requires shuffling both the datasets across

the network

19.10.2018 Satish Srirama 24/41

Page 25: Distributed data processing on the Cloud –Lecture 6 Joins

Map-side Join: Basic Idea

Assume two datasets are sorted by the join key:

R1

R2

R3

R4

S1

S2

S3

S4

A sequential scan through both datasets to join(called a “merge join” in database terminology)

19.10.2018 Satish Srirama 25/41

Page 26: Distributed data processing on the Cloud –Lecture 6 Joins

Map-side Join: Parallel Scans

• If datasets are sorted by join key, join can be accomplished by a scan over both datasets

• How can we accomplish this in parallel?– Partition and sort both datasets in the same manner

– Example: • Suppose R and S were both divided into ten files, partitioned in the same

manner by the join key

• We simply need to merge join the first file of R with the first file of S, the second file of R with second of S, etc.

• In MapReduce:– Map over one dataset, read from other corresponding partition

– No reducers necessary (unless to repartition or resort)

• Consistently partitioned datasets: realistic to expect?– Yes. Mostly MapReduce jobs are part of a workflow...

19.10.2018 Satish Srirama 26/41

Page 27: Distributed data processing on the Cloud –Lecture 6 Joins

In-Memory Join

• Basic idea: load one dataset into memory, stream over other dataset

– Works if R << S and R fits into memory

– Called a “hash join” in database terminology

• MapReduce implementation

– Distribute R to all nodes

– Map over S, each mapper loads R in memory, hashed by join key

– For every tuple in S, look up join key in R

– No reducers, unless for regrouping or resorting tuples

19.10.2018 Satish Srirama 27/41

Page 28: Distributed data processing on the Cloud –Lecture 6 Joins

In-Memory Join: Variants

• Striped variant:– What if R is too big to fit into memory?

– Divide R into R1, R2, R3, … s.t. each Rn fits into memory

– Perform in-memory join: ∀n, Rn ⋈ S

– Take the union of all join results

• Memcached join:– Load R into memcached

– Replace in-memory hash lookup with memcached lookup

– Memcached capacity >> RAM of individual node

– Memcached scales out with cluster

– Memcached is fast (basically, speed of network)

– Batch requests to balance the latency costs

19.10.2018 Satish Srirama 28/41

Page 29: Distributed data processing on the Cloud –Lecture 6 Joins

Which join to use?

• In-memory join > map-side join > reduce-side

join

– Why?

• Limitations of each?

– In-memory join: memory

– Map-side join: sort order and partitioning

– Reduce-side join: general purpose

19.10.2018 Satish Srirama 29/41

Page 30: Distributed data processing on the Cloud –Lecture 6 Joins

Processing Relational Data: Summary

• MapReduce algorithms for processing relational data:

– Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce

– Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer

– Multiple strategies for relational joins

• Complex operations require multiple MapReduce jobs

– Example: top ten URLs in terms of average time spent

– Opportunities for automatic optimization

19.10.2018 Satish Srirama 30/41

Page 31: Distributed data processing on the Cloud –Lecture 6 Joins

Issues with MapReduce

• Java verbosity– Writing low level MapReduce code is slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping takes significant time and requires manual compilation

– A lot of custom code is required • Even simple functions such as sum and avg, you need to write reduce methods

– Hard to manage more complex MapReduce job chains

• Pig is a high level language on top of Hadoop MapReduce (Next Lecture)– Similar to declarative SQL

• Easier to get started

– In comparison to Hadoop MapReduce: • 5% of the code

• 5% of the time

19.10.2018 Satish Srirama 31/41

Page 32: Distributed data processing on the Cloud –Lecture 6 Joins

Adapting Algorithms to MapReduce

• Designed a classification on how the algorithms can be adapted to MapReduce [Srirama et al, FGCS 2012]

– Algorithm � single MapReduce job

• Monte Carlo, RSA breaking

– Algorithm � n MapReduce jobs

• CLustering LARge Applications - CLARA

– Each iteration in algorithm � single MapReduce job

• K-medoids (Clustering)

– Each iteration in algorithm � n MapReduce jobs

• Conjugate Gradient

• Applicable especially for Hadoop MapReduce

S. N. Srirama, P. Jakovits, E. Vainikko: Adapting Scientific Computing Problems to Clouds using MapReduce,

Future Generation Computer Systems Journal, ISSN: 0167-739X, 28(1):184-192, 2012. Elsevier press. DOI

10.1016/j.future.2011.05.025.

Page 33: Distributed data processing on the Cloud –Lecture 6 Joins

Issues with Hadoop MapReduce

• It is designed and suitable for:

– Data processing tasks

– Embarrassingly parallel tasks

• Has serious issues with iterative algorithms

– Long „start up“ and „clean up“ times ~17 seconds

– No way to keep important data in memory between MapReduce job executions

– At each iteration, all data is read again from HDFS and written back to HDFS, at the end

– Results in a significant overhead in every iteration

19.10.2018 Satish Srirama 33/41

Page 34: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches - 1

• Restructuring algorithms into non-iterative versions – Can we change class 3 & 4 algorithms into class 1 or 2?

• Iterative k-medoid clustering algorithm

in MapReduce

• Map– Find the closest medoid and assign the

object to the cluster of the medoid

– Input: (cluster id, object)

– Output: (new cluster id, object)

• Reduce– Find which object is the most central and assign it as the new medoid

of the cluster

– Input: (cluster id, (list of all objects in the cluster))

– Output: (cluster id, new medoid)

19.10.2018 Satish Srirama 34/41

Page 35: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches – 1 - continued

• CLARA was designed to make PAM algorithm more scalable [Kaufman and Rousseeuw (1990)]

– CLARA applies PAM on random samples of the original dataset

• CLARA in MapReduce

– The algorithm can be reduced to 2 MapReduce Jobs

• MapReduce job I (Finding the Candidate sets)– Map: (Assign random key to each point)

• Input: < key, point >

• Output < random key, point >

– Reduce: (Pick first S points and use k-medoids on them)• Output < key, k-medoids(S points) >

– result of k-medoids() is a set of k medoids19.10.2018 Satish Srirama 35/41

Page 36: Distributed data processing on the Cloud –Lecture 6 Joins

• MapReduce job II (Measuring the quality over whole data set)– Map: (Find each points distance from it's closest medoid, for each

candidate set)• Input: < key, point >

• Output: < candidate set key, distance >– C different sums, one for each candidate set

– Reduce: (Sum distances)

• Candidate set and respective medoids with minimum sum will decide the clusters

CLARA in MapReduce - continued

K-medoids CLARA

19.10.2018 36/41

Page 37: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches - 2

• Using alternative MapReduce frameworks that

are designed to handle iterative algorithms

– In memory processing

– Spark (Lecture 8)

19.10.2018 Satish Srirama 37/41

Page 38: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches - 3

• Alternative distributed computing models

such as Bulk Synchronous Parallel model [Valiant,

1990] [Jakovits et al, HPCS 2013]

• Available BSP frameworks:

– Graph processing: jPregel & Giraph (Lecture 11)

19.10.2018 Satish Srirama 38/41

Page 39: Distributed data processing on the Cloud –Lecture 6 Joins

This week in lab…

• You’ll try Joins in MapReduce

19.10.2018 Satish Srirama 39/41

Page 40: Distributed data processing on the Cloud –Lecture 6 Joins

Next Lecture

• Higher level scripting languages for distributed

data processing - Pig

19.10.2018 Satish Srirama 40/41

Page 41: Distributed data processing on the Cloud –Lecture 6 Joins

References

• Data-Intensive Text Processing with MapReduce

Authors: Jimmy Lin and Chris Dyer

http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

• S. N. Srirama, P. Jakovits, E. Vainikko: Adapting Scientific Computing Problems to Clouds using MapReduce, Future Generation Computer Systems - Journal, ISSN: 0167-739X, 28(1):184-192, 2012. Elsevier press.

• P. Jakovits, S. N. Srirama: Evaluating MapReduce Frameworks for Iterative Scientific Computing Applications, The 2014 (12th) International Conference on High Performance Computing & Simulation (HPCS 2014), Bologna, Italy, July 21-25, 2014, pp. 226-233. IEEE.

19.10.2018 Satish Srirama 41/41

Page 42: Distributed data processing on the Cloud –Lecture 6 Joins

THANK YOU

[email protected]

19.10.2018 Satish Srirama 42

Page 43: Distributed data processing on the Cloud –Lecture 6 Joins

Issues with MapReduce

• Java verbosity

• Hadoop task startup time

• Not good for iterative algorithms

– Writing low level MapReduce code is slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping requires compilation

– A lot of custom code is required

– Hard to manage more complex MapReduce job chains

19.10.2018 Satish Srirama 43/41