MapReduce basics

MapReduce basics

Harisankar H,PhD student, DOS lab, Dept. CSE, IIT Madras

6-Feb-2013

http://harisankarh.wordpress.com

Distributed processing ?

• Processing distributed across multiple machines/servers

Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg

Why distributed processing?

– Reduce execution time of large jobs

• E.g., extracting urls from terabytes of data

• 1000 machines could finish the jobs 1000 times faster

– Fault-tolerance

• Other nodes will take over the jobs if some of the nodes fail

– Typically if you have 10,000 servers, on the average one will fail per day

Issues in distributed processing

• Realized traditionally using special-purpose implementations– E.g., indexer, log processor

• Implementation really hard at socket programming level– Fault-tolerance

• Keep track of failure, reassignment of tasks

– Hand-coded parallelization– Scheduling across heterogeneous nodes– Locality

• Minimise movement of data for computation

– How to distribute data?

• Results in:– Complex, brittle, non-generic code– Reimplementation of common features like fault-tolerance,

distribution

Need for a generic abstraction for distributed processing

• Tradeoff between genericity and performance

– More generic => usually less performance

• MapReduce probably a sweet spot where you have both to some extent

App programmer abstraction systems developer

Separation of concerns

Express app logic

Performance, fault handling etc.

MapReduce abstraction(app programmer’s view)

• Model input and output as <key,value> pairs

• Provide map() and reduce() functions which act on <k,v> pairs

• Input: set of <k,v> pairs: {k,v}– For each input <k,v>:

map(k1,v1) list(k2,v2)

– For each unique output key from map:

reduce(k2,combined list(v2)) list(v3)

System will take care of distributing the tasks across thousands of machines, handling locality, fault-tolerance etc.

Example: word count

• Problem:– Count the number of occurrences of each unique

word in a big collection of documents

• Input <k,v> set:– <document name, document contents>

• Organize the files in this format

• Output:– <word, count>

• Get it in output files

• Next step: – Define the map() and reduce() functions

Word count

map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, “1”);

reduce(String key, List values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));

Program in java

public void map(LongWritable key, Text value, Context context) throws …

{String line = value.toString();StringTokenizer tokenizer = new

StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());context.write(word, one);

}}

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws …

{int sum = 0;for (IntWritable val : values) {

sum += val.get();}context.write(key, new

IntWritable(sum));}

Implementing MapReduce abstraction

• Looked at the application programmer’s view• Need a platform which implements the

MapReduce abstraction• Hadoop is the popular open-source

implementation of MapReduce abstraction• Questions for the platform developer

– How to • parallelize ?• handle faults ?• provide locality ?• distribute the data ?

App programmer abstraction systems developer

Basics of platform implementation

• parallelize ?– Each map can be executed independently in parallel– After all maps have finished execution, all reduce can be

executed in parallel

• handle faults ?– map() and reduce() has no internal state

• Simply re-execute in case of a failure

• distribute the data ?– Have a distributed file system(HDFS)

• provide locality ?– Prefer to execute map() on the nodes having input <k,v>

pair

MapReduce implementation

• Distributed File System(DFS) + MapReduce(MR) Engine– Specifically, MR engine uses a DFS

• Distributed files system– Files split into large chunks and stored in the

distributed file system(e.g., HDFS)

– Large chunks: typically 64MB per block

– can have a master-slave architecture• Master assigns and manages replicated blocks in the

slaves

MapReduce engine

• Has a master slave architecture

– Master co-ordinates the task execution across workers

– Workers perform the map() and reduce() functions

• Reads and writes blocks to/from the DFS

– Master keeps tracks of failure of workers and reassigns tasks if necessary

• Failure detection usually done through timeouts

network

Some tips for designing MR jobs

• Reduce network traffic between map and reduce

– Model map() and reduce() jobs appropriately

– Use combine() functions

• combine(<k,[v]>) <k,[v]>

• combine() executes after all map()s finish in each block

– map() [same node] combine() [network] reduce()

• Make map jobs of roughly equal expected execution times

• Try to make reduce() jobs less skewed

Pros and cons of MapReduce

• Advantages– Simple, easy to use distributed processing system– Reasonably generic– Exploits locality for performance– Simple and less buggy implementation

• Issues– Not a magic bullet which fit all problems

• Difficult to model iterative and recursive computations– E.g.: k-means clustering– Generate-Map-Reduce

• Difficult to model streaming computations• Centralized entities like master becomes bottlenecks• Most real-world problems require large chains of MR jobs

Summary

• Today

– Distributed processing issues, MR programming model

– Sample MR job

– How MR can be implemented

– Pros and cons of MR, tips for better performance

• Tomorrow

– Details specific to Hadoop

– Downloading and setting up of Hadoop on a cluster

Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

Hadoop components

• HDFS

– Master: Namenode

– Slave : DataNode

• MapReduce engine

– Master: JobTracker

– Slave: TaskTracker

Technology

MapReduce basics