Upload
harisankar-haridas
View
1.180
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Covers: Distributed processing issues, MR programming model Sample MR job How MR can be implemented Pros and cons of MR, tips for better performance
Citation preview
MapReduce basics
Harisankar H,PhD student, DOS lab, Dept. CSE, IIT Madras
6-Feb-2013
http://harisankarh.wordpress.com
Distributed processing ?
• Processing distributed across multiple machines/servers
Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg
Why distributed processing?
– Reduce execution time of large jobs
• E.g., extracting urls from terabytes of data
• 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
• Other nodes will take over the jobs if some of the nodes fail
– Typically if you have 10,000 servers, on the average one will fail per day
Issues in distributed processing
• Realized traditionally using special-purpose implementations– E.g., indexer, log processor
• Implementation really hard at socket programming level– Fault-tolerance
• Keep track of failure, reassignment of tasks
– Hand-coded parallelization– Scheduling across heterogeneous nodes– Locality
• Minimise movement of data for computation
– How to distribute data?
• Results in:– Complex, brittle, non-generic code– Reimplementation of common features like fault-tolerance,
distribution
Need for a generic abstraction for distributed processing
• Tradeoff between genericity and performance
– More generic => usually less performance
• MapReduce probably a sweet spot where you have both to some extent
App programmer abstraction systems developer
Separation of concerns
Express app logic
Performance, fault handling etc.
MapReduce abstraction(app programmer’s view)
• Model input and output as <key,value> pairs
• Provide map() and reduce() functions which act on <k,v> pairs
• Input: set of <k,v> pairs: {k,v}– For each input <k,v>:
map(k1,v1) list(k2,v2)
– For each unique output key from map:
reduce(k2,combined list(v2)) list(v3)
System will take care of distributing the tasks across thousands of machines, handling locality, fault-tolerance etc.
Example: word count
• Problem:– Count the number of occurrences of each unique
word in a big collection of documents
• Input <k,v> set:– <document name, document contents>
• Organize the files in this format
• Output:– <word, count>
• Get it in output files
• Next step: – Define the map() and reduce() functions
Word count
map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, “1”);
reduce(String key, List values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));
Program in java
public void map(LongWritable key, Text value, Context context) throws …
{String line = value.toString();StringTokenizer tokenizer = new
StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());context.write(word, one);
}}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws …
{int sum = 0;for (IntWritable val : values) {
sum += val.get();}context.write(key, new
IntWritable(sum));}
Implementing MapReduce abstraction
• Looked at the application programmer’s view• Need a platform which implements the
MapReduce abstraction• Hadoop is the popular open-source
implementation of MapReduce abstraction• Questions for the platform developer
– How to • parallelize ?• handle faults ?• provide locality ?• distribute the data ?
App programmer abstraction systems developer
Basics of platform implementation
• parallelize ?– Each map can be executed independently in parallel– After all maps have finished execution, all reduce can be
executed in parallel
• handle faults ?– map() and reduce() has no internal state
• Simply re-execute in case of a failure
• distribute the data ?– Have a distributed file system(HDFS)
• provide locality ?– Prefer to execute map() on the nodes having input <k,v>
pair
MapReduce implementation
• Distributed File System(DFS) + MapReduce(MR) Engine– Specifically, MR engine uses a DFS
• Distributed files system– Files split into large chunks and stored in the
distributed file system(e.g., HDFS)
– Large chunks: typically 64MB per block
– can have a master-slave architecture• Master assigns and manages replicated blocks in the
slaves
MapReduce engine
• Has a master slave architecture
– Master co-ordinates the task execution across workers
– Workers perform the map() and reduce() functions
• Reads and writes blocks to/from the DFS
– Master keeps tracks of failure of workers and reassigns tasks if necessary
• Failure detection usually done through timeouts
network
Some tips for designing MR jobs
• Reduce network traffic between map and reduce
– Model map() and reduce() jobs appropriately
– Use combine() functions
• combine(<k,[v]>) <k,[v]>
• combine() executes after all map()s finish in each block
– map() [same node] combine() [network] reduce()
• Make map jobs of roughly equal expected execution times
• Try to make reduce() jobs less skewed
Pros and cons of MapReduce
• Advantages– Simple, easy to use distributed processing system– Reasonably generic– Exploits locality for performance– Simple and less buggy implementation
• Issues– Not a magic bullet which fit all problems
• Difficult to model iterative and recursive computations– E.g.: k-means clustering– Generate-Map-Reduce
• Difficult to model streaming computations• Centralized entities like master becomes bottlenecks• Most real-world problems require large chains of MR jobs
Summary
• Today
– Distributed processing issues, MR programming model
– Sample MR job
– How MR can be implemented
– Pros and cons of MR, tips for better performance
• Tomorrow
– Details specific to Hadoop
– Downloading and setting up of Hadoop on a cluster
Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
Hadoop components
• HDFS
– Master: Namenode
– Slave : DataNode
• MapReduce engine
– Master: JobTracker
– Slave: TaskTracker