24
Hadoop MapReduce framework

Hadoop MapReduce framework - Module 3

Embed Size (px)

Citation preview

Page 1: Hadoop MapReduce framework - Module 3

Hadoop MapReduce framework

Page 2: Hadoop MapReduce framework - Module 3

Hadoop Data Types (http://hadoop.apache.org/docs/current/api/index.html)org.apache.hadoop.io

• int -> IntWritable , long -> LongWritable , boolean -> BooleanWritable , float -> FloatWritable , byte -> ByteWritable

We can use the following built-in data types as key and value

• Text :This stores a UTF8 text

• ByteWritable : This stores a sequence of bytes

• VIntWritable and VLongWritable : These stores variable length integer and long values

• Nullwritable: This is zero-length Writable type that can be used when you don’t want to use a key or value type

• Key class, should implement the WritableComparable interface.

• Value class, should implement the of a Writable interface.

E.g.

public class IntWritable implements WritableComparable

public abstract interface WritableComparable<T> extends Writable, Comparable<T>

Page 3: Hadoop MapReduce framework - Module 3

Hadoop Data Types Contd..

Page 4: Hadoop MapReduce framework - Module 3

MapReduce paradigm• Splits input files into blocks (typically of 64 MB each)

• Operates on key/value pairs

• Mappers filter & transform input data

• Reducers aggregate mappers output

• Efficient way to process the cluster:• Move code to data

• Run code on all machines

• Divide & conquer: partition a large problem into smaller sub problems• Independent sub-problems can be executed in parallel by workers (anything

from threads to clusters)

• Intermediate results from each worker are combined to get the final result

Page 5: Hadoop MapReduce framework - Module 3

MapReduce paradigm contd..

• Challenges:• How to transform a problem into sub-problems?• How to assign workers and synchronize the intermediate results?• How do the workers get the required data?• How to handle failures in the cluster?

Page 6: Hadoop MapReduce framework - Module 3

Map and Reduce tasks

Page 7: Hadoop MapReduce framework - Module 3

Shuffle and Sort

Page 8: Hadoop MapReduce framework - Module 3

MapReduce Execution Framework

Page 9: Hadoop MapReduce framework - Module 3

Combiners• Combiner: local aggregation of key/value pairs after map() and before the shuffle

& sort phase (occurs on the same machine as map())

• Also called “mini-reducer”

• Instead of emitting 100 times (the,1), the combiner emits (the,100)

• Can lead to great speed-ups and save network bandwidth

• Each combiner operates in isolation, has no access to other mapper’s key/value pairs

• A combiner cannot be assumed to process all values associated with the same key (may not run at all! Hadoop’s decision)

• Emitted key/value pairs must be the same as those emitted by the mapper

Page 10: Hadoop MapReduce framework - Module 3

Combiners contd..

• If the function computed is• Commutative [a + b = b + a]

• Associative [a + (b + c) = (a + b) + c]

Reducer can be reused as combiner

Max function works:

max (max ( a, b), max (c, d, e)) = max (a, b, c, d, e)

Mean function does not work:

mean(mean(a, b), mean(c, d, e)) != mean (a, b, c, d, e)

Page 11: Hadoop MapReduce framework - Module 3

MapReduce Programming: Word CountWordCountDriver.java

public class WordCountDriver extends Configured implements Tool {

public int run(String[] args) throws Exception {

Configuration conf = new Configuration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

args = parser.getRemainingArgs();

Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCountDriver.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, new Path("E:\\aa\\input\\names.txt"));

FileOutputFormat.setOutputPath(job, new Path("E:\\aa\\output\\"));

job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class);

if (job.waitForCompletion(true)) { return 1;

} else { return 0;

}}}

Page 12: Hadoop MapReduce framework - Module 3

MapReduce Programming: Word Count

WordCountMapper.javapublic class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

private Text word = new Text();private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());context.write(word, one);

}}}

Page 13: Hadoop MapReduce framework - Module 3

MapReduce Programming: Word Count

WordCountReducer.java

public class WordCountReducer extends

Reducer<Text, IntWritable, Text, IntWritable> {

protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable value : values) {

sum += value.get();

}

context.write(key, new IntWritable(sum));

}

}

Page 14: Hadoop MapReduce framework - Module 3

A minimal MapReduce driverpublic class MinimalMapReduceWithDefaults extends Configured implements Tool {

public int run(String[] args) throws Exception {

Job job = new Job(getConf());

job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(Mapper.class);

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);

job.setReducerClass(Reducer.class);

job.setOutputKeyClass(LongWritable.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(TextOutputFormat.class);

return job.waitForCompletion(true) ? 0 : 1; }

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);

System.exit(exitCode);}}

Page 15: Hadoop MapReduce framework - Module 3

Input Splits and Records

• Input split is a chunk of the input that is processed by a single map.

• Each map processes a single split.

• Each split is divided into records, and the map processes each record—a key-value pair—in turn.

public abstract class InputSplit {

public abstract long getLength() throws IOException, InterruptedException;

public abstract String[] getLocations() throws IOException, InterruptedException;

}

Page 16: Hadoop MapReduce framework - Module 3

InputFormat

• An InputFormat is responsible for creating the input splits and dividing them into records.

public abstract class InputFormat<K, V> {

public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;

public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;

}

Page 17: Hadoop MapReduce framework - Module 3

InputFormat class hierarchy

Page 18: Hadoop MapReduce framework - Module 3

FileInputFormat

• A place to define which files are included as the input to a job.

• An implementation for generating splits for the input files.FileInputFormat input paths

public static void addInputPath(Job job, Path path)

public static void setInputPaths(Job job, Path... inputPaths)

FileInputFormat input splits

max(minimumSize, min(maximumSize, blockSize))

by default: minimumSize < blockSize < maximumSize

Page 19: Hadoop MapReduce framework - Module 3

How to control the split size?

Page 20: Hadoop MapReduce framework - Module 3

Text Input : TextInputFormat

• TextInputFormat is the default InputFormat.

• Each record is a line of input.

• The key, a LongWritable, is the byte offset within the file of the beginning of the line.

• The value is the contents of the line, excluding any line terminators (newline, carriage return), and is packaged as a Text object.

Page 21: Hadoop MapReduce framework - Module 3

Binary Input: SequenceFileInputFormat

• Hadoop’s sequence file format stores sequences of binary key-value pairs.

• Sequence files are well suited as a format for MapReduce data since they are splitable.

• Support compression as a part of the format.

Page 22: Hadoop MapReduce framework - Module 3

Multiple Inputs

MultipleInputs.addInputPath(job, ABCInputPath,TextInputFormat.class, MapperABC.class);

MultipleInputs.addInputPath(job, XYZInputPath, TextInputFormat.class, MapperXYZ.class);

Page 23: Hadoop MapReduce framework - Module 3

Output Formats Class Hierarchy

Page 24: Hadoop MapReduce framework - Module 3

Output Types

• Text Output• The default output format, TextOutputFormat, writes records as lines of text.

• Binary Output• SequenceFileOutputFormat writes sequence files for its output.

• Multiple Outputs• MultipleOutputs allows you to write data to files whose names are derived

from the output keys and values, or in fact from an arbitrary string.