32
Intro to Map-Reduce Feb 21, 2014

Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

Embed Size (px)

Citation preview

Page 1: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

Intro to Map-Reduce

Feb 21, 2014

Page 2: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

map-reduce?

A programming model or abstraction.

A novel way of thinking about designing a solution to certain problems…

Feb 21, 2014

Page 3: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Why?

We have access to huge volumes of data.Facebook posts and photos, twitter streams, …

Feb 21, 2014

Page 4: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Easy…

Computational resources are cheap.Amazon EC2, Microsoft Azure, …

Feb 21, 2014

Page 5: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Not really!

Code running on one CPU is simple, two is a headache, four is a nightmare, …You get the picture.

Feb 21, 2014

Page 6: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Pipe Dream

Forget multiple machines.Just write code imagining one CPU.Someone else takes care of running the code on thousands of machine.Free the programmer from the unnecessary details.

Feb 21, 2014

Page 7: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Map-Reduce

map-reduce programming model to the rescue

Feb 21, 2014

Page 8: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Long long ago…

LISP, 1958

A programming language that introduced several innovative ideas

Feb 21, 2014

Recursive Functions of Symbolic Expression and Their Computation by Machine, Part IJohn McCarthy, MIT, April 1960

Page 9: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

LISP

Introduced map and reduce.

Feb 21, 2014

Page 10: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Map

map(mf, [a1, a2, …an])

-> [b1, b2, …, bn]

Accepts two arguments: a function and a list of values.Generates output by repeatedly applying the function on the list of values.

Feb 21, 2014

Page 11: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Reduce

reduce(rf, [b1, b2, …bn])

-> c

Accepts two arguments: a function and a list of values.Generates output by reducing the list of input values using the function.

Feb 21, 2014

Page 12: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Simple composition

Map’s output is a list of values, which reduce can accept as one of its argument.

Feb 21, 2014

Page 13: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Analogy

Break large problem into small pieces Code mf to solve one piece

Run map to apply mf on the small pieces and generate nuggets of solutions

Code rf to combine the nuggets

Run reduce to apply rf on the nuggets to output the complete solution

Feb 21, 2014

Page 14: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Example

1TB file split into 100,000 chunksCount number of lines in each chunkAdd counts together to output final line count

Feb 21, 2014

Page 15: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

A slightly different map-reduceMapCopies a function on a number of machines and applies each copy on different pieces of the inputReduceCombine the map outputs from different machines into a final solution

Feb 21, 2014

Page 16: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Map-reduce reintroduced…

Google created the awareness

Hadoop made it into a sensationHadoop is an open-source map-reduce implementation based on Google’s paper.

Feb 21, 2014

MapReduce: Simplified Data Processing on Large ClustersJeffrey Dean and Sanjay GhemawatOSDI'04: Sixth Symposium on Operating System Design and Implementation. December, 2004.

Page 17: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Hadoop

Feb 21, 2014

Page 18: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Example

Feb 21, 2014

Page 19: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Terminology

MapperInstance of the map function

ReducerInstance of the reduce function

Feb 21, 2014

Page 20: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Job

User’s implementation of map and reduce functions

Feb 21, 2014

Page 21: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Splitting the input

User submits job and specifies the input files.

Input files are split into chunks. Chunks are fed to the mappers

typically over a distributed file system like HDFS.

Feb 21, 2014

Page 22: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Copying the job

JobTracker Hadoop service that copies the map

and reduce code to available machines.

Feeds the input to the mappers and connects their outputs to reducers.

Feb 21, 2014

Page 23: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Maps in parallel

Maps run in parallel. Each maps operates on a set of

chunks assigned to it by the job tracker.

Maps write to local disk.

Feb 21, 2014

Page 24: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

What if maps fail?

Re-run failed maps. No need to re-run succeeded maps.

Why does this work?Maps typically are idempotent.

Feb 21, 2014

Page 25: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Input to reducers

#reducers (n) known a priori.#partitions equals #reducers.

Hash on the keys of mapper outputs.partition = hash(key) mod n

Load balancing by randomization.

Feb 21, 2014

Page 26: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Wait before reducing…

Cannot start reducers before mappers complete.

Synchronization barrier between map and reduce phases. Why?

Feb 21, 2014

Page 27: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Embarrassing Parallelism

Feb 21, 2014

Page 28: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Not a panacea!

If your workload exhibits embarrassing parallelism, Hadoop might be the ideal framework.

If not, look for other parallel programming paradigms.

Feb 21, 2014

Page 29: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Example

Feb 21, 2014

Page 30: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

WordCount

Feb 21, 2014

https://developer.yahoo.com/hadoop/tutorial/module4.html

Page 31: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Mapper

public static class MyMapper implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

// Split the given line (in value) into words and emit for each word the tuple <word, 1>

String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

Feb 21, 2014

https://developer.yahoo.com/hadoop/tutorial/module4.html

Page 32: Intro to Map-Reduce Feb 21, 2014. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

CS512 | Spring 2014

Reducer

public static class MyReducer implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Feb 21, 2014

https://developer.yahoo.com/hadoop/tutorial/module4.html