Upload
warwithin
View
217
Download
0
Embed Size (px)
Citation preview
8/14/2019 MapReduceMerge
1/32
Map-Reduce-Merge:Simplied Relational
Data Processing onLarge ClustersHung-chih Yang, Ali DasdanRuey-Lung Hsiao, D. Stott Parker
presented by Nate Roberts
8/14/2019 MapReduceMerge
2/32
Outline
1. Introduction: principles of databases rather than the artifacts.
2. MapReduce
3. Map-Reduce-Merge: extending MapReduce
4. Using Map-Reduce-Merge to implement relational algebra operators
8/14/2019 MapReduceMerge
3/32
Principles of DB, Not
the Artifacts New data-processing systems shouldconsider alternatives to using big,
traditional databases. MapReduce does a good job, in a limited
context, with extraordinary simplicity
Map-Reduce-Merge will try to extend theapplicability without giving up too muchsimplicity
8/14/2019 MapReduceMerge
4/32
Introduction to
Ma Reduce1. Why MapReduce?
2. What is MapReduce?
3. How do you use it?
4. Whats it good for?5. What are its limitations?
8/14/2019 MapReduceMerge
5/32
Why MapReduce?
For (single core) CPUs, Moores Law isbeginning to slow down
The future is multi-core (large clusters of commodity hardware are the newsupercomputers)
But parallel programming is hard to think about!
8/14/2019 MapReduceMerge
6/32
What is MapReduce?
MapReduce handles dispatching tasksacross a large cluster
You just have to dene the tasks, in twostages:
1. Map: (k1, v1)
[(k2, v2)]2. Reduce: (k2, [v2]) [v3]
8/14/2019 MapReduceMerge
7/32
How do you use it?
Example: count the number of occurrencesof each word in a large collection of
documents. For each document d withcontents v:
map: given (d,v), for each word in v, emit(w, 1).
reduce: given (w, [v]), sum the counts in[v]. Emit the sum.
8/14/2019 MapReduceMerge
8/32
Word Counting
Behind the Scenes A single master server dispatches tasks andkeeps a scoreboard.
1. Mappers are dispatched for eachdocument. They do local writes withtheir results.
2. Once mapping nishes, a shufe phaseassigns reducers to each word. Reducersdo remote reads from mappers.
8/14/2019 MapReduceMerge
9/32
MapReduce Schematic
http ://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html
http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html8/14/2019 MapReduceMerge
10/32
MapReduce in Parallel
http ://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0008.html
http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html8/14/2019 MapReduceMerge
11/32
Three Optimizations
Fault and slow node tolerance: once tasksare all dispatched and some nodes are
nished, assign some unnished tasksredundantly. (First to return wins.)
Combiner: have mappers do some localreduction.
Locality: assign mappers in such a way thatmost have their input available for localreading.
8/14/2019 MapReduceMerge
12/32
Whats it good for?
Data processing tasks on homogeneousdata sets:
Distributed Grep Building an index mapping words to
documents in which those words occur. Distributed sort
8/14/2019 MapReduceMerge
13/32
What isnt it good for?
Not good at heterogeneous data sets.
8/14/2019 MapReduceMerge
14/32
emp-id dept-id bonus
1 B innov. award ($100)
1 B hard worker ($50)2 A high perform. ($150)
3 A innov. award ($100)
dept-id bonus adjustmentB 1.1
A 0.9
Heterogeneous Data
8/14/2019 MapReduceMerge
15/32
Map-Reduce-Merge:
Extending MapReduce1. Change to reduce phase
2. Merge phase
3. Additional user-denable operations
a. partition selector
b. processorc. merger
d. congurable iterators
8/14/2019 MapReduceMerge
16/32
Reduce & Merge
Phases1. Map: (k1, v1) [(k2, v2)]
2. Reduce: (k2, [v2])
[v3]becomes:
1. Map: (k1, v1) [(k2, v2)]
2. Reduce: (k2, [v2]) (k2, [v3])3. Merge: ((k2, [v3]), (k3, [v4])) (k4, [v5])
8/14/2019 MapReduceMerge
17/32
Programmer-Denable
O erations1. Partition selector - which data should go towhich merger?
2. Processor - process data on an individualsource.
3. Merger - analogous to the map and reduce
denitions, dene logic to do the mergeoperation.
4. Congurable iterators - how to step
through each of the lists as you merge.
8/14/2019 MapReduceMerge
18/32
Employee Bonus
Exam le, Revisited
8/14/2019 MapReduceMerge
19/32
Implementing Relational
Algebra Operations1. Projection
2. Aggregation3. Selection
4. Set Operations: Union, Intersection, Difference
5. Cartesian Product
6. Rename
7. Join
8/14/2019 MapReduceMerge
20/32
Projection
All we have to do is emit a subset of thedata passed in.
Just a mapper can do this.
8/14/2019 MapReduceMerge
21/32
Aggregation
By choosing appropriate keys, canimplement group by and aggregate SQLoperators in MapReduce.
(Do have to be careful here, though: choosebadly, and you might not have enough tasksfor MapReduce to do you any good.)
8/14/2019 MapReduceMerge
22/32
Selection
If selection condition involves only theattributes of one data source, can
implement in mappers. If its on aggregates or a group of values
contained in one data source, canimplement in reducers.
If it involves attributes or aggregates fromboth data sources, implement in mergers.
8/14/2019 MapReduceMerge
23/32
Set Union
Let each of the two MapReduces emit asorted list of unique elements
Merges just iterate simultaneously over thelists:
store the lesser value and increment itsiterator, if there is a lesser value if the two are equal, store one of the
two, and increment both iterators
8/14/2019 MapReduceMerge
24/32
Set Intersection
Let each of the two MapReduces emit asorted list of unique elements
Merges just iterate simultaneously over thelists:
if there is a lesser value, increment itsiterator if the two are equal, store one of the
two, and increment both iterators
8/14/2019 MapReduceMerge
25/32
Set Difference
Let each of the two MapReduces emit asorted list of unique elements
Merges just iterate simultaneously over thelists:
if As value is less than Bs, store As if Bs value is less than As, increment it
if the two are equal, increment both
To Compute A - B:
8/14/2019 MapReduceMerge
26/32
Cartesian Product
Set the reducers up to output the two setsyou want the Cartesian product of.
Each merger will get one partition F fromthe rst set of reducers, and the full set of partitions S from the second.
Each merger emits F x S.
8/14/2019 MapReduceMerge
27/32
Rename
Trivial
8/14/2019 MapReduceMerge
28/32
Sort-Merge Join
Map: partition records into key rangesaccording to the values of the attributes onwhich youre sorting, aiming for evendistribution of values to mappers.
Reduce: sort the data. Merge: join the sorted data for each key
range.
8/14/2019 MapReduceMerge
29/32
Hash Join
Map: use the same hash function for bothsets of mappers.
Reduce: produce a hash table from thevalues mapped.
Merge: operates on corresponding hashbuckets. Use one bucket as a build set , andthe other as a probe.
8/14/2019 MapReduceMerge
30/32
Nested Loop Join
Just like a hash join, except in the mergestep, do a nested loop, scanning the right-hand relation for matches to the left.
8/14/2019 MapReduceMerge
31/32
Conclusion
MapReduce & GFS represent a paradigmshift in data processing: use a simpliedinterface instead of overly-general DBMS
Map-Reduce-Merge adds the ability toexecute arbitrary relational algebra queries
Next steps: develop SQL-like interface anda query optimizer
8/14/2019 MapReduceMerge
32/32
References
J. Dean and S. Ghemawat. MapReduce: Simplied DataProcessing on Large Clusters. In OSDI, pages137-15 0, 2004. Slides available:http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
A. Kimball, Problem Solv ing on Large-Scale Clusters.Lecture given on July 3, 2007. Available athttp://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=-vD6PUdf3Jshttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://www.youtube.com/watch?v=-vD6PUdf3Jshttp://www.youtube.com/watch?v=-vD6PUdf3Jshttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04-slides/index.html