18
Map Reduce By Manuel Correa

Map Reduce

Embed Size (px)

DESCRIPTION

Map Reduce presentation. Operating Systems. University of Georgia2010

Citation preview

Page 1: Map Reduce

Map Reduce

ByManuel Correa

Page 2: Map Reduce

Background

Large set of data needs to be processed in a fast and efficient way

In order to process large set of data in a reasonable amount time, this needs to be distributed across thousands of machines

Programmers need to focus in solving problems without worrying about the implementation

Map Reduce is the answer.

Page 3: Map Reduce

What is Map reduce?

Programming model for processing large data sets

Hides the implementation of parallelization, faul-tolerance, data distribution and load balancing in a library

Inspired on some characteristics functional programming

Functional operations do not modify data structures. They always create new ones

Original data is not modified

Data flow is implicit within the application

The order of the operations does not matter

Page 4: Map Reduce

What is Map reduce?

There is two functions: Map and Reduce

Map

Input: Key/Value pairs

Output: Intermediate key/value pairs

Reduce

Input: Key, Iterator values

Output: list with results

map(k1, v1) --> list(k2, v2)

reduce(k2, values(k2)) --> list(v2)

Complicated?

Page 5: Map Reduce

Map Reduce by exampleCounting each word in a large set of documents

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

Page 6: Map Reduce

Map Reduce by exampleCounting each word in a large set of documents

Document_1

foo

bar

baz

foo

bar

test

Document_2

test

foo

baz

bar

foo

Expected results:

<foo, 4>,<bar, 3>,<baz,2>,<test,2>

Page 7: Map Reduce

Map Reduce by exampleCounting each word in a large set of documents

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

Map(document_1,contents(document_1))

<foo, “1”>

<bar,”1”>

<baz, “1” >

<foo, “1”>

<bar, “1”>

<test, ”1”>

Map(document_2,contents(document_2))

<test, “1”>

<foo, “1”>

<baz, ”1”>

<bar, ”1”>

<foo, “1”>

Page 8: Map Reduce

Map Reduce by exampleCounting each word in a large set of documents

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

Reduce(word, values)

<foo, “2”>

<bar,”2”>

<baz, “1” >

<test,”1”>

Reduce(word, values)

<test, “1”>

<foo, “2”>

<baz, ”1”>

<bar, ”1”>

Page 9: Map Reduce

Map Reduce by exampleCounting each word in a large set of documents

Reduce(word, values)

<foo, “4”>

<bar, ”3”>

<baz, “2”>

<test,”2”>

<foo, “2”>

<bar, ”2”>

<baz, “1”>

<test,”1”>

<test, “1”>

<foo, “2”>

<baz, ”1”>

<bar, ”1”>

Expected results:

<foo, 4>,<bar, 3>,<baz,2>,<test,2>

Page 10: Map Reduce

Implementation

Page 11: Map Reduce

Master node

Master keeps different data structures for Map and reduce tasks where the status of each process is maintain

Status: idle, in-progress or completed

The master node keeps track of the intermediate files to feed the reduce tasks

The master node control the interaction between the M map tasks and R reduce tasks

Page 12: Map Reduce

Fault Tolerance

Master pings every worker periodically

If a worker fail, then the master mark this worker as failed and assign the task to another worker

Every worker must notify that has finish its task. The master then assign another task

Each tasks is independent and can be restarted at any moment. Map reduce is resilient to workers failures

If the master failed, then? The Master periodically its status and data structures. Then another master can start from the last checkpoint

Page 13: Map Reduce

Task Granularity

There are M maps tasks and R reduce tasks

M and R should be larger than the number of workers

Dynamic loading and load balancing on workers to optimize resources

Master must make O(M+R) scheduling decisions and keeps O(M*R) states. One byte to save the state of each worker

According to the paper, Google performs M=200,000 and R=5,000 using 2,000 workers

Page 14: Map Reduce

Refinements

Partition function: load balancing

Ordering function: optimized generation of keys and easy to generate sorted output files

Combiner function = Reduce function. See count word in documents example

Input and output Readers: Standard input and output

Skipping bad records: Control of bad input

Local execution for debugging

Status information through an external application

Page 15: Map Reduce

What are the benefits of map reduce?

Easy to use for programmers that don't need to worry about the details of distributed computing

A large set of problems can be expressed in Map reduce programming model

Flexible and scalable in large clusters of machines. The fault tolerance is elegant and works

Page 16: Map Reduce

Programs that can be expressed with Map Reduce

Distributed Grep <word, match>

Count URL Access Frequency <URL, total_count>

Reverse Web-link graph <target, list(source)>

Term-Vector per Host <word, frequency>

Inverted index <word, document ID>

Distributed Sort <key, record>

Page 17: Map Reduce

References

MapReduce: Simplified Data Processing on Large Clusters ( http://labs.google.com/papers/mapreduce-osdi04.pdf)

http://code.google.com/edu/parallel/mapreduce-tutorial.html

www.mapreduce.org

http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=01ABB666FB64D768&index=0&playnext=1

http://hadoop.apache.org/

Page 18: Map Reduce

Map Reduce

Questions?