Hadoop MapReduce framework - Module 3

Hadoop MapReduce framework

Hadoop Data Types (http://hadoop.apache.org/docs/current/api/index.html)org.apache.hadoop.io

• int -> IntWritable , long -> LongWritable , boolean -> BooleanWritable , float -> FloatWritable , byte -> ByteWritable

We can use the following built-in data types as key and value

• Text :This stores a UTF8 text

• ByteWritable : This stores a sequence of bytes

• VIntWritable and VLongWritable : These stores variable length integer and long values

• Nullwritable: This is zero-length Writable type that can be used when you don’t want to use a key or value type

• Key class, should implement the WritableComparable interface.

• Value class, should implement the of a Writable interface.

E.g.

public class IntWritable implements WritableComparable

public abstract interface WritableComparable<T> extends Writable, Comparable<T>

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/package-frame.html

Hadoop Data Types Contd..

MapReduce paradigm• Splits input files into blocks (typically of 64 MB each)

• Operates on key/value pairs

• Mappers filter & transform input data

• Reducers aggregate mappers output

• Efficient way to process the cluster:• Move code to data

• Run code on all machines

• Divide & conquer: partition a large problem into smaller sub problems• Independent sub-problems can be executed in parallel by workers (anything

from threads to clusters)

• Intermediate results from each worker are combined to get the final result

MapReduce paradigm contd..

• Challenges:• How to transform a problem into sub-problems?• How to assign workers and synchronize the intermediate results?• How do the workers get the required data?• How to handle failures in the cluster?

Map and Reduce tasks

Shuffle and Sort

MapReduce Execution Framework

Combiners• Combiner: local aggregation of key/value pairs after map() and before the shuffle

& sort phase (occurs on the same machine as map())

• Also called “mini-reducer”

• Instead of emitting 100 times (the,1), the combiner emits (the,100)

• Can lead to great speed-ups and save network bandwidth

• Each combiner operates in isolation, has no access to other mapper’s key/value pairs

• A combiner cannot be assumed to process all values associated with the same key (may not run at all! Hadoop’s decision)

• Emitted key/value pairs must be the same as those emitted by the mapper

Combiners contd..

• If the function computed is• Commutative [a + b = b + a]

• Associative [a + (b + c) = (a + b) + c]

Reducer can be reused as combiner

Max function works:

max (max ( a, b), max (c, d, e)) = max (a, b, c, d, e)

Mean function does not work:

mean(mean(a, b), mean(c, d, e)) != mean (a, b, c, d, e)

MapReduce Programming: Word CountWordCountDriver.java

public class WordCountDriver extends Configured implements Tool {

public int run(String[] args) throws Exception {

Configuration conf = new Configuration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

args = parser.getRemainingArgs();

Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCountDriver.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, new Path("E:\\aa\\input\\names.txt"));

FileOutputFormat.setOutputPath(job, new Path("E:\\aa\\output\\"));

job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class);

if (job.waitForCompletion(true)) { return 1;

} else { return 0;

}}}

MapReduce Programming: Word Count

WordCountMapper.javapublic class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

private Text word = new Text();private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());context.write(word, one);

}}}

MapReduce Programming: Word Count

WordCountReducer.java

public class WordCountReducer extends

Reducer<Text, IntWritable, Text, IntWritable> {

protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable value : values) {

sum += value.get();

}

context.write(key, new IntWritable(sum));

}

}

A minimal MapReduce driverpublic class MinimalMapReduceWithDefaults extends Configured implements Tool {

public int run(String[] args) throws Exception {

Job job = new Job(getConf());

job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(Mapper.class);

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);

job.setReducerClass(Reducer.class);

job.setOutputKeyClass(LongWritable.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(TextOutputFormat.class);

return job.waitForCompletion(true) ? 0 : 1; }

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);

System.exit(exitCode);}}

Input Splits and Records

• Input split is a chunk of the input that is processed by a single map.

• Each map processes a single split.

• Each split is divided into records, and the map processes each record—a key-value pair—in turn.

public abstract class InputSplit {

public abstract long getLength() throws IOException, InterruptedException;

public abstract String[] getLocations() throws IOException, InterruptedException;

}

InputFormat

• An InputFormat is responsible for creating the input splits and dividing them into records.

public abstract class InputFormat<K, V> {

public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;

public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;

}

InputFormat class hierarchy

FileInputFormat

• A place to define which files are included as the input to a job.

• An implementation for generating splits for the input files.FileInputFormat input paths

public static void addInputPath(Job job, Path path)

public static void setInputPaths(Job job, Path... inputPaths)

FileInputFormat input splits

max(minimumSize, min(maximumSize, blockSize))

by default: minimumSize < blockSize < maximumSize

How to control the split size?

Text Input : TextInputFormat

• TextInputFormat is the default InputFormat.

• Each record is a line of input.

• The key, a LongWritable, is the byte offset within the file of the beginning of the line.

• The value is the contents of the line, excluding any line terminators (newline, carriage return), and is packaged as a Text object.

Binary Input: SequenceFileInputFormat

• Hadoop’s sequence file format stores sequences of binary key-value pairs.

• Sequence files are well suited as a format for MapReduce data since they are splitable.

• Support compression as a part of the format.

Multiple Inputs

MultipleInputs.addInputPath(job, ABCInputPath,TextInputFormat.class, MapperABC.class);

MultipleInputs.addInputPath(job, XYZInputPath, TextInputFormat.class, MapperXYZ.class);

Output Formats Class Hierarchy

Output Types

• Text Output• The default output format, TextOutputFormat, writes records as lines of text.

• Binary Output• SequenceFileOutputFormat writes sequence files for its output.

• Multiple Outputs• MultipleOutputs allows you to write data to files whose names are derived

from the output keys and values, or in fact from an arbitrary string.

Technology

Hadoop MapReduce framework - Module 3