Upload
nandy39
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Hadoop Tutorial
Citation preview
5/28/2018 Hadoop Tutorial
1/35
Hadoop Tutorial
Cloudera, Inc.
220 Portage Avenue
Palo Alto, CA 94306
US: 1-888-789-1488
Intl: 1-650-362-0488www.cloudera.com
5/28/2018 Hadoop Tutorial
2/35
Important Notice
2010-2013 Cloudera, Inc. All rights reserved.
Cloudera, the Cloudera logo, Cloudera Impala, Impala, and any other product or service names orslogans contained in this document, except as otherwise disclaimed, are trademarks of Cloudera and itssuppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior
written permission of Cloudera or the applicable trademark holder.Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All othertrademarks, registered trademarks, product names and company names or logos mentioned in thisdocument are the property of their respective owners. Reference to any products, services, processes orother information, by trade name, trademark, manufacturer, supplier or otherwise does not constituteor imply endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rightsunder copyright, no part of this document may be reproduced, stored in or introduced into a retrievalsystem, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, orotherwise), or for any purpose, without the express written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written licenseagreement from Cloudera, the furnishing of this document does not give you any license to thesepatents, trademarks copyrights, or other intellectual property.
The information in this document is subject to change without notice. Cloudera shall not be liable forany damages resulting from technical errors or omissions which may be present in this document, orfrom use of this document.
Date: February 25, 2013
5/28/2018 Hadoop Tutorial
3/35
Contents
PURPOSE ............................................................................................................................................................... 1
PREREQUISITES ...................................................................................................................................................... 1
OVERVIEW ............................................................................................................................................................. 1
INPUTS AND OUTPUTS ........................................................................................................................................... 2
EXAMPLE: WORDCOUNT V1.0 ............................................................................................................................... 2
SOURCE CODE ............................................................................................................................................................... 2
USAGE ......................................................................................................................................................................... 4
WALK-THROUGH ............................................................................................................................................................ 5
MAPREDUCE USER INTERFACES .......................................................................................................................... 6
PAYLOAD ...................................................................................................................................................................... 7
Mapper ................................................................................................................................................................. 7
Reducer ................................................................................................................................................................. 8
Partitioner ............................................................................................................................................................. 9
Reporter ................................................................................................................................................................ 9
OutputCollector ..................................................................................................................................................... 9
JOB CONFIGURATION .................................................................................................................................................... 10
TASK EXECUTION AND ENVIRONMENT .............................................................................................................................. 10
Memory Management ........................................................................................................................................ 11
Map Parameters ................................................................................................................................................. 12
Shuffle/Reduce Parameters ................................................................................................................................ 14
Task JVM Reuse ................................................................................................................................................... 16
JOB SUBMISSION AND MONITORING ................................................................................................................................ 17
Job Control .......................................................................................................................................................... 18
JOB INPUT .................................................................................................................................................................. 18
InputSplit ............................................................................................................................................................. 19
RecordReader ...................................................................................................................................................... 19
JOB OUTPUT ............................................................................................................................................................... 19
OutputCommitter ................................................................................................................................................ 20
Task Side-Effect Files ........................................................................................................................................... 20
RecordWriter ....................................................................................................................................................... 21
OTHER USEFUL FEATURES .............................................................................................................................................. 21
5/28/2018 Hadoop Tutorial
4/35
Submitting Jobs to Queues .................................................................................................................................. 21
Counters .............................................................................................................................................................. 21
DistributedCache ................................................................................................................................................. 22
Tool ..................................................................................................................................................................... 22
IsolationRunner ................................................................................................................................................... 23
Profiling ............................................................................................................................................................... 23
Debugging ........................................................................................................................................................... 24
JobControl ........................................................................................................................................................... 24
Data Compression ............................................................................................................................................... 24
Skipping Bad Records .......................................................................................................................................... 25
EXAMPLE: WORDCOUNT V2.0 ............................................................................................................................. 26
SOURCE CODE ............................................................................................................................................................. 26
SAMPLE RUNS ............................................................................................................................................................. 29
HIGHLIGHTS ................................................................................................................................................................ 31
5/28/2018 Hadoop Tutorial
5/35
Purpose
Hadoop Tutorial | 1
Purpose
This document describes the most important user-facing facets of the Apache Hadoop MapReduce
framework and serves as a tutorial. Apache Hadoop MapReduce consists of client APIs for writing
applications and a runtime on which to run the applications. There are two versions of the API: old andnew, and two versions of the runtime: MRv1 and MRv2. This tutorial describes the old API and MRv1.
Prerequisites
Ensure that CDH is installed, configured, and running. The easiest way to get going quickly is to use a
demo VM:
CDH4 Demo VM CDH3 Demo VM
Overview
Hadoop MapReduce is a software framework for writing applications that process vast amounts of data
(multi-petabyte datasets) in parallel on large clusters consisting of thousands of nodes of commodity
hardware in a reliable, fault-tolerant manner.
A MapReducejobusually splits the input dataset into independent chunks which are processed by the
map tasksin a parallel manner. The framework sorts the outputs of the maps, which are then input to
the reduce tasks. Typically both the input and the output of the job are stored in a distributed
filesystem.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework
and theHadoop Distributed File System (HDFS)are running on the same set of nodes. This configuration
allows the framework to effectively schedule tasks on the nodes where data is already present, resulting
in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster
node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring the
tasks, and re-executing failed tasks. The slaves execute the tasks as directed by the master.
Minimally, applications specify the input/output locations and supply mapand reducefunctions via
implementations of appropriate interfaces and/or abstract classes. These locations and functions and
other job parameters comprise thejob configuration. The Hadoopjob clientthen submits the job
(JAR/executable, and so on) and configuration to the JobTracker which then distributes the
software/configuration to the slaves, scheduling tasks and monitoring them, and providing status and
diagnostic information to the job client.
The Hadoop framework is implemented in JavaTM
, and you can develop MapReduce applications in Java
or any JVM-based language or use one of the following interfaces:
https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH3u4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH3u4http://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttp://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttp://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttp://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttps://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH3u4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH45/28/2018 Hadoop Tutorial
6/35
Inputs and Outputs
2 | Hadoop Tutorial
Hadoop Streaming- a utility that allows you to create and run jobs with any executables (forexample, shell utilities) as the mapper and/or the reducer.
Hadoop Pipes- aSWIG-compatible (not based on JNITM) C++ API to implement MapReduceapplications.
This tutorial describes applications written using the old Java MapReduce API.
The old API is in the org. apache. hadoop. mapr edpackage. The new API is in the
org. apache. hadoop. mapr educepackage.
Inputs and Outputs
The MapReduce framework operates exclusively on key-value pairs, that is, the framework views the
input to the job as a set of key-value pairs and produces a set of key-value pairs as the output of the job.
The output pairs can be different types than the input pairs.
The key and value classes must be serializable by the framework and hence must implement the
Writableinterface. Additionally, the key classes must implement theWritableComparableinterface to
facilitate sorting.
Input and Output types of a MapReduce job:
(input) k1-v1 -> map-> k2-v2 -> combine-> k2-v2 -> reduce-> k3-v3 (output)
Example: WordCount v1.0
Before we jump into the details, let's walk through an example MapReduce application, WordCount, to
get a flavor for how MapReduce works. WordCount is a simple application that counts the number of
occurrences of each word in an input set.
Source Code
package or g. myorg;
i mpor t j ava. i o. I OExcept i on;
i mport j ava. ut i l . * ;
i mpor t org. apache. hadoop. f s. Path;
i mpor t org. apache. hadoop. conf . *;
i mpor t org. apache. hadoop. i o. *;
i mpor t org. apache. hadoop. mapr ed. *;
i mpor t or g. apache. hadoop. ut i l . *;
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/streaming/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/streaming/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/pipes/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/pipes/package-summary.htmlhttp://www.swig.org/http://www.swig.org/http://www.swig.org/http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.htmlhttp://www.swig.org/http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/pipes/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/streaming/package-summary.html5/28/2018 Hadoop Tutorial
7/35
Example: WordCount v1.0
Hadoop Tutorial | 3
publ i c cl ass WordCount {
publ i c s t at i c cl ass Map ext ends MapReduceBase i mpl ement s
Mapper {
pr i vat e f i nal st at i c I nt Wr i t abl e one = new I nt Wr i t abl e( 1) ;
pr i vat e Text word = new Text ( ) ;
publ i c voi d map( LongWr i t abl e key, Text val ue, Out put Col l ect or out put , Repor t er r epor t er ) t hr ows I OExcept i on {
St r i ng l i ne = val ue. t oSt r i ng( ) ;
St r i ngTokeni zer t okeni zer = new St r i ngTokeni zer ( l i ne) ;
whi l e ( t okeni zer . hasMor eTokens( ) ) {
wor d. set ( t okeni zer . nextToken( ) ) ;
out put . col l ect( wor d, one) ;
}
}
}
publ i c s t at i c cl ass Reduce ext ends MapReduceBase i mpl ement s Reducer {
publ i c voi d r educe( Text key, I t er at or val ues,
Out put Col l ect or out put , Repor t er r epor t er ) t hr ows
I OExcept i on {
i nt sum = 0;
whi l e ( val ues. hasNext ( ) ) {
sum += val ues. next ( ) . get ( ) ;
}
out put . col l ect ( key, new I nt Wr i t abl e( sum) ) ;
}
}
publ i c st at i c voi d mai n( St r i ng[ ] ar gs) t hr ows Except i on {
J obConf conf = new J obConf ( Wor dCount . cl ass ) ;
conf . set J obName("wordcount " ) ;
conf . set Out put KeyCl ass( Text. cl ass) ;
conf . set Out put Val ueCl ass( I nt Wr i t abl e. cl ass) ;
conf . set Mapper Cl ass( Map. cl ass) ;
conf . set Combi ner Cl ass( Reduce. cl ass) ;
conf . set Reducer Cl ass( Reduce. cl ass);
conf . set I nput For mat ( Text I nput For mat . cl ass) ;
conf . setOut put For mat ( TextOut put For mat . cl ass) ;
Fi l eI nput For mat . set I nput Pat hs( conf , new Pat h( ar gs[ 0] ) ) ;
Fi l eOut put For mat. set Out put Pat h( conf , new Pat h( args[1] ) ) ;
5/28/2018 Hadoop Tutorial
8/35
Example: WordCount v1.0
4 | Hadoop Tutorial
J obCl i ent . r unJ ob(conf ) ;
}
}
Table 1 WordCount.java
Usage
Compile WordCount . j ava:
$ mkdi r wordcount _c l asses
$ j avac - cpclasspath- d wordcount _cl asses WordCount . j ava
where classpathis:
CDH4 - / usr/ l i b/ hadoop/ *: / usr/ l i b/ hadoop/ cl i ent - 0. 20/ * CDH3 - / usr / l i b/ hadoop- 0. 20/ hadoop- 0. 20. 2- cdh3u4- cor e. j ar
Create a JAR:
$ j ar - cvf wor dcount . j ar - C wor dcount _cl asses/ .
Assuming that:
/ user / cl ouder a/ wordcount / i nput - input directory in HDFS / user / cl ouder a/ wordcount / out put - output directory in HDFS
Create sample text files as input and move to HDFS:
$ echo "Hel l o Wor l d Bye Wor l d" > f i l e0
$ echo "Hel l o Hadoop Goodbye Hadoop" > f i l e1
$ hadoop f s - mkdi r / user / cl ouder a / user / cl ouder a/ wordcount
/ user / cl ouder a/ wor dcount / i nput
$ hadoop f s - put f i l e* / user / cl ouder a/ wor dcount / i nput
Run the application:
$ hadoop j ar wordcount . j ar org. myor g. WordCount
/ user / cl ouder a/ wor dcount / i nput / user / cl ouder a/ wor dcount / out put
5/28/2018 Hadoop Tutorial
9/35
Example: WordCount v1.0
Hadoop Tutorial | 5
Output:
$ hadoop f s - cat / user / cl ouder a/ wordcount / out put / part - 00000
Bye 1
Goodbye 1
Hadoop 2Hel l o 2
Wor l d 2
Applications can specify a comma-separated list of paths that would be present in the current working
directory of the task using the option - f i l es. The - l i bj ar soption allows applications to add JARs to
the classpaths of the maps and reduces. The - ar chi ves allows them to pass archives as arguments
that are unzipped/unjarred and a link with name of the zip/JAR are created in the current working
directory of tasks. More details about the command line options are available atHadoop Command
Guide.
Running wor dcount example with - l i bj ar s and - f i l es:
hadoop j ar hadoop- exampl es. j ar wor dcount - f i l es cachef i l e. t xt - l i bj ar s
myl i b. j ar i nput out put
Walk-through
The WordCount application is quite straightforward.
The Mapper implementation (lines 14-26), via the mapmethod (lines 18-25), processes one line at a
time, as provided by the specifiedText I nput For mat (line 49). It then splits the line into tokens
separated by whitespaces, via the St r i ngTokeni zer , and emits a key-value pair of .
For the given sample input the first map emits:
< Hel l o, 1>
< Wor l d, 1>
< Bye, 1>
< Wor l d, 1>
The second map emits:
< Hel l o, 1>< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
Well learn more about the number of maps spawned for a given job, and how to control them in a fine-
grained manner, a bit later in the tutorial.
http://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.html5/28/2018 Hadoop Tutorial
10/35
MapReduce User Interfaces
6 | Hadoop Tutorial
WordCount also specifies a combi ner (line 46). Hence, the output of each map is passed through the
local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after
being sorted on the
keys.
The output of the first map:
< Bye, 1>
< Hel l o, 1>
< Wor l d, 2>
The output of the second map:
< Goodbye, 1>
< Hadoop, 2>
< Hel l o, 1>
The Reducer implementation (lines 28-36), via the r educemethod (lines 29-35) just sums up the
values, which are the occurence counts for each key (that is, words in this example).
Thus the output of the job is:< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hel l o, 2>
< Wor l d, 2>
The r unmethod specifies various facets of the job, such as the input/output paths (passed via the
command line), key-value types, input/output formats etc., in theJ obConf. It then calls the
J obCl i ent . r unJ ob(line 55) to submit the and monitor its progress.
Well learn more aboutJ obConf,J obCl i ent ,Tool , and other interfaces and classes a bit later in the
tutorial.
MapReduce User Interfaces
This section provides a reasonable amount of detail on every user-facing aspect of the MapReduce
framework. This should help you implement, configure and tune your jobs in a fine-grained manner.
However, note that the javadoc for each class/interface remains the most comprehensive
documentation available; this is only meant to be a tutorial.
Let's first take the Mapper and Reducer interfaces. Applications typically implement them to provide
the mapand r educemethods.
We will then discuss other core interfaces includingJ obConf,J obCl i ent , Part i t i oner ,
Out put Col l ect or , Repor t er , I nputFormat , Out putFor mat , Output Commi t t er , and others.
Finally, we will wrap up by discussing some useful features of the framework such as the
Di st r i but edCache, I sol at i onRunner , and so on.
5/28/2018 Hadoop Tutorial
11/35
MapReduce User Interfaces
Hadoop Tutorial | 7
Payload
Applications typically implement the Mapper and Reduceinterfaces to provide the mapand r educe
methods. These form the core of the job.
Mapper
Mappermaps input key-value pairs to a set of intermediate key-value pairs. Maps are the individual
tasks that transform input records into intermediate records. The transformed intermediate records do
not need to be of the same type as the input records. A given input pair may map to zero or many
output pairs.
The Hadoop MapReduce framework spawns one map task for each I nput Spl i t generated by the
I nputFormat for the job.
Overall, Mapper implementations are passed theJ obConffor the job via the
JobConfigurable.configure(JobConf)method and override it to initialize themselves. The framework then
callsmap(WritableComparable, Writable, OutputCollector, Reporter)for each key-value pair in theI nput Spl i t for that task. Applications can then override theCloseable.close()method to perform any
required cleanup.
Output pairs do not need to be of the same types as input pairs. A given input pair may map to zero or
many output pairs. Output pairs are collected with calls to
OutputCollector.collect(WritableComparable,Writable).
Applications can use Repor t er to report progress, set application-level status messages and update
Counters , or just indicate that they are alive.
All intermediate values associated with a given output key are subsequently grouped by the framework,
and passed to the Reducer(s) to determine the final output. Youu can control the grouping by specifyinga Compar at or viaJobConf.setOutputKeyComparatorClass(Class).
The Mapper outputs are sorted and then partitioned per Reducer . The total number of partitions is the
same as the number of reduce tasks for the job. You can control which keys (and hence records) go to
which Reducer by implementing a custom Par t i t i oner .
You can optionally specify a combi ner , viaJobConf.setCombinerClass(Class),to perform local
aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from
the Mapper to the Reducer .
The intermediate, sorted outputs are always stored in a simple (key- l en, key, val ue- l en, val ue)
format. Applications can control if, and how, the intermediate outputs are to be compressed and the
CompressionCodecto be used via theJ obConf.
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of
the input files.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/compress/CompressionCodec.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/compress/CompressionCodec.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/compress/CompressionCodec.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html5/28/2018 Hadoop Tutorial
12/35
MapReduce User Interfaces
8 | Hadoop Tutorial
The right level of parallelism for maps seems to be around 10-100 maps per node, although it has been
set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at
least a minute to execute.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, youll end up with 82,000 maps,
unlesssetNumMapTasks(int)(which only provides a hint to the framework) is used to set it even higher.
Reducer
Reducerreduces a set of intermediate values that share a key to a smaller set of values.
You set the number of reducers for the job viaJobConf.setNumReduceTasks(int).
Overall, Reducer implementations are passed theJ obConffor the job via the
JobConfigurable.configure(JobConf)method and can override it to initialize themselves. The framework
then callsreduce(WritableComparable, Iterator, OutputCollector, Reporter)method for each pair in the
grouped inputs. Applications can then override theCloseable.close()method to perform any required
cleanup.
Reducer has 3 primary phases: shuffle, sort, and reduce.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the
relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key)
in this stage.
The shuffle and sort phases occur simultaneously; while map outputs are being fetched they aremerged.
Secondary Sort
If equivalence rules for grouping the intermediate keys are required to be different from those for
grouping keys before reduction, then one may specify a Comparat or via
JobConf.setOutputValueGroupingComparator(Class).Since
JobConf.setOutputKeyComparatorClass(Class)can be used to control how intermediate keys are
grouped, these can be used in conjunction to simulate secondary sort on values.
Reduce
In this phase thereduce(WritableComparable, Iterator, OutputCollector, Reporter)method is called for
each pair in the grouped inputs.
The output of the reduce task is typically written to theFileSystemvia
OutputCollector.collect(WritableComparable, Writable).
Applications can use the Repor t er to report progress, set application-level status messages and update
Counters , or just indicate that they are alive.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)5/28/2018 Hadoop Tutorial
13/35
MapReduce User Interfaces
Hadoop Tutorial | 9
The output of the Reducer is not sorted.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes*
mapr ed. t askt r acker. r educe. t asks. maxi mum).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps
finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of
reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and
lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the
framework for speculative tasks and failed tasks.
Reducer NONE
It is legal to set the number of reduce tasks to zero if no reduction is desired.
In this case the outputs of the map tasks go directly to the filesystem, into the output path set by
setOutputPath(Path).The framework does not sort the map outputs before writing them out to the
filesystem.
Partitioner
Partitionerpartitions the key space. Partitioner controls the partitioning of the keys of the intermediate
map outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function.
The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls
which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.
HashPartitioneris the default Partitioner.
Reporter
Reporteris a facility for MapReduce applications to report progress, set application-level status
messages and update Counters.
Mapper and Reducer implementations can use the Repor t er to report progress or just indicate that
they are alive. In scenarios where the application takes a significant amount of time to process individual
key-value pairs, this is crucial since the framework might assume that the task has timed out and kill that
task. Another way to avoid this is to set the configuration parameter mapr ed. t ask. t i meout to a high
enough value (or even set it to zero for no timeouts).
Applications can also update Countersusing Repor t er .
OutputCollector
OutputCollectoris a generalization of the facility provided by the MapReduce framework to collect data
output by the Mapper or the Reducer (either the intermediate outputs or the output of the job).
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/HashPartitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/HashPartitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/HashPartitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)5/28/2018 Hadoop Tutorial
14/35
MapReduce User Interfaces
10 | Hadoop Tutorial
Hadoop MapReduce comes bundled with alibraryof generally useful mappers, reducers, and
partitioners.
Job Configuration
JobConfrepresents a MapReduce job configuration.J obConfis the primary interface to describe a
MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the
job as described byJ obConf, however:
Some configuration parameters may have been marked asfinalby administrators and hencecannot be altered.
While some job parameters are straightforward to set (for example,setNumReduceTasks(int)),other parameters interact subtly with the rest of the framework and/or job configuration and
are more complex to set (for example,setNumMapTasks(int)).
J obConfis typically used to specify the Mapper , Combi ner (if any), Par t i t i oner , Reducer ,
I nputFormat , Out putFor mat and Out putCommi t t er implementations.J obConfalso indicates theset of input files (setInputPaths(JobConf, Path))/addInputPath(JobConf, Path))and
(setInputPaths(JobConf, String)/addInputPaths(JobConf, String))and where the output files should be
written (setOutputPath(Path)).
Optionally, JobConf is used to specify other advanced facets of the job such as the Compar at or to be
used, files to be put in the Di st r i but edCache, whether intermediate and/or job outputs are to be
compressed (and how), debugging via user provided scripts
(setMapDebugScript(String)/setReduceDebugScript(String)), whether job tasks can be executed in a
speculative manner (setMapSpeculativeExecution(boolean))/(setReduceSpeculativeExecution(boolean)),
maximum number of attempts per task (setMaxMapAttempts(int)/setMaxReduceAttempts(int)),
percentage of tasks failure which can be tolerated by the job
(setMaxMapTaskFailuresPercent(int)/setMaxReduceTaskFailuresPercent(int))etc.
Of course, you can useset(String, String)/get(String, String)to get and set arbitrary parameters needed
by applications. However, use DistributedCache for large amounts of (read-only) data.
Task Execution and Environment
TaskTracker executes Mapper and Reducer tasks as child processes in separate JVMs.
The child task inherits the environment of the parent TaskTracker. You can specify additional options to
the child JVM via the mapr ed. chi l d. j ava. opt s configuration parameter in theJ obConfsuch as
non-standard paths for the runtime linker to search shared libraries via - Dj ava. l i br ar y. pat h=,
and so on. If the mapr ed. chi l d. j ava. opt s property contains the symbol ( t aski d) it is interpolated
with value of t aski dof the MapReduce task.
Here is an example property with multiple arguments and substitutions, showing JVM GC logging, and
start of a passwordless JVM JMX agent so that it can connect with jconsole and the like to watch child
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.html5/28/2018 Hadoop Tutorial
15/35
MapReduce User Interfaces
Hadoop Tutorial | 11
memory, threads, and get thread dumps. The property also sets the maximum heap size of the child
JVM to 512MB and adds an additional path to thej ava. l i brar y. pat hof the child JVM.
mapred. chi l d. j ava. opts
- Xmx512M - Dj ava. l i br ary. path=/ home/ mycompany/ l i b - ver bose: gc -
Xl oggc: / t mp/ @t aski d@. gc
- Dcom. sun. management . j mxr emot e. aut hent i cat e=f al se -
Dcom. sun. management . j mxr emot e. ss l =f al se
Memory Management
You can also specify the maximum virtual memory of the launched child task, and any subprocess it
launches recursively, using mapr ed. chi l d. ul i mi t . Note that the value set here is a per process limit.
The value for mapr ed. chi l d. ul i mi t should be specified in kilobytes (KB) and must be greater than or
equal to the - Xmxvalue passed to the Java VM.
mapr ed. chi l d. j ava. opt s is used only for configuring the launched child tasks from task
tracker. Configuring the memory options for daemons is documented inConfiguring the
Environment of the Hadoop Daemons.
The memory available to some parts of the framework is also configurable. In map and reduce tasks,
performance may be influenced by adjusting parameters influencing the concurrency of operations and
the frequency with which data will hit disk. Monitoring the filesystem counters for a job particularly
relative to byte counts from the map and into the reduce is invaluable to the tuning of these
parameters.
You can choose to override default limits of virtual memory and RAM enforced by the task tracker, if
memory management is enabled. You can set the following parameters per job:
Name Type Description
mapr ed. t ask. maxvmem int A number, in bytes, that
represents the maximum virtual
memory task limit for each task of
the job. A task will be killed if it
consumes more virtual memory
http://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemons5/28/2018 Hadoop Tutorial
16/35
MapReduce User Interfaces
12 | Hadoop Tutorial
Name Type Description
than this number.
mapr ed. t ask. maxpmem int A number, in bytes, that
represents the maximum RAM
task limit for each task of the job.
This number can be optionally
used by schedulers to prevent
overscheduling of tasks on a node
based on RAM needs.
Map Parameters
A record emitted from a map will be serialized into a buffer and metadata will be stored into accounting
buffers. As described in the following options, when either the serialization buffer or the metadata
exceed a threshold, the contents of the buffers will be sorted and written to disk in the background
while the map continues to output records. If either buffer fills completely while the spill is in progress,
the map thread will block. When the map is finished, any remaining records are written to disk and all
on-disk segments are merged into a single file. Minimizing the number of spills to disk can decrease map
time, but a larger buffer also decreases the memory available to the mapper.
Name Type Description
i o. sor t . mb int The cumulative size of the
serialization and accounting
buffers storing records emitted
from the map, in megabytes.
i o. sor t . r ecor d. per cent float The ratio of serialization to
accounting space can be adjusted.
Each serialized record requires 16
bytes of accounting information
in addition to its serialized size toeffect the sort. This percentage of
space allocated from
i o. sor t . mbaffects the
probability of a spill to disk being
caused by either exhaustion of
the serialization buffer or the
5/28/2018 Hadoop Tutorial
17/35
MapReduce User Interfaces
Hadoop Tutorial | 13
Name Type Description
accounting space. Clearly, for a
map outputting small records, a
higher value than the default will
likely decrease the number of
spills to disk.
i o. sor t . spi l l . percent float This is the threshold for the
accounting and serialization
buffers. When this percentage of
either buffer has filled, their
contents will be spilled to disk in
the background. Let
i o. sor t . r ecor d. per cent be
r , i o. sor t . mbbe x, and this
value be q. The maximum
number of records collected
before the collection thread will
spill is r * x * q * 2 16. Note
that a higher value may decrease
the number of or even
eliminate merges, but will also
increase the probability of the
map task getting blocked. Thelowest average map times are
usually obtained by accurately
estimating the size of the map
output and preventing multiple
spills.
Other notes
If either spill threshold is exceeded while a spill is in progress, collection will continue until thespill is finished. For example, if i o. sor t . buf f er . spi l l . percent is set to 0.33, and the
remainder of the buffer is filled while the spill runs, the next spill will include all the collected
records, or 0.66 of the buffer, and will not generate additional spills. In other words, the
thresholds are defining triggers, not blocking.
A record larger than the serialization buffer will first trigger a spill, then be spilled to a separatefile. It is undefined whether or not this record will first pass through the combiner.
5/28/2018 Hadoop Tutorial
18/35
MapReduce User Interfaces
14 | Hadoop Tutorial
Shuffle/Reduce Parameters
As described previously, each reduce fetches the output assigned to it by the Partitioner via HTTP into
memory and periodically merges these outputs to disk. If intermediate compression of map outputs is
turned on, each output is decompressed into memory. The following options affect the frequency of
these merges to disk prior to the reduce and the memory allocated to map output during the reduce.
Name Type Description
i o. sor t . f actor int Specifies the number of segments
on disk to be merged at the same
time. It limits the number of open
files and compression codecs
during the merge. If the number
of files exceeds this limit, the
merge will proceed in severalpasses. Though this limit also
applies to the map, most jobs
should be configured so that
hitting this limit is unlikely there.
mapred. i nmem. merge. t hresh
ol dint The number of sorted map
outputs fetched into memory
before being merged to disk. Like
the spill thresholds in the
preceding note, this is notdefining a unit of partition, but a
trigger. In practice, this is usually
set very high (1000) or disabled
(0), since merging in-memory
segments is often less expensive
than merging from disk (see notes
following this table). This
threshold influences only the
frequency of in-memory merges
during the shuffle.
mapr ed. j ob. shuf f l e. mer ge.
percentfloat The memory threshold for
fetched map outputs before an
in-memory merge is started,
expressed as a percentage of
memory allocated to storing map
5/28/2018 Hadoop Tutorial
19/35
MapReduce User Interfaces
Hadoop Tutorial | 15
Name Type Description
outputs in memory. Since map
outputs that can't fit in memory
can be stalled, setting this high
may decrease parallelism
between the fetch and merge.
Conversely, values as high as 1.0
have been effective for reduces
whose input can fit entirely in
memory. This parameter
influences only the frequency of
in-memory merges during the
shuffle.
mapr ed. j ob. shuf f l e. i nput .
buf f er . per centfloat The percentage of memory
relative to the maximum heapsize
as typically specified in
mapr ed. chi l d. j ava. opt s
that can be allocated to storing
map outputs during the shuffle.
Though some memory should be
set aside for the framework, in
general it is advantageous to set
this high enough to store largeand numerous map outputs.
mapr ed. j ob. r educe. i nput . b
uf f er . per centfloat The percentage of memory
relative to the maximum heapsize
in which map outputs may be
retained during the reduce. When
the reduce begins, map outputs
will be merged to disk until those
that remain are under the
resource limit this defines. Bydefault, all map outputs are
merged to disk before the reduce
begins to maximize the memory
available to the reduce. For less
memory-intensive reduces, this
should be increased to avoid trips
5/28/2018 Hadoop Tutorial
20/35
MapReduce User Interfaces
16 | Hadoop Tutorial
Name Type Description
to disk.
Other notes
If a map output is larger than 25 percent of the memory allocated to copying map outputs, it willbe written directly to disk without first staging through memory.
When running with a combiner, the reasoning about high merge thresholds and large buffersmay not hold. For merges started before all map outputs have been fetched, the combiner is run
while spilling to disk. In some cases, one can obtain better reduce times by spending resources
combining map outputs making disk spills small and parallelizing spilling and fetching rather
than aggressively increasing buffer sizes.
When merging in-memory map outputs to disk to begin the reduce, if an intermediate merge isnecessary because there are segments to spill and at least i o. sor t . f actor segments already
on disk, the in-memory map outputs will be part of the intermediate merge.
Task JVM Reuse
Jobs can enable task JVMs to be reused by specifying the job configuration
mapr ed. j ob. r euse. j vm. num. t asks. If the value is 1 (the default), then JVMs are not reused (that is,
1 task per JVM). If it is -1, there is no limit to the number of tasks a JVM can run (of the same job). You
can also specify some value greater than 1 using the
methodJobConf.setNumTasksToExecutePerJvm(int).
The following properties are localized in the job configuration for each tasks execution:
Name Type Description
mapr ed. j ob. i d String The job id
mapr ed. j ar String job.jar location in job directory
j ob. l ocal . di r String The job specific shared scratch
space
mapr ed. t i p. i d String The task id
mapr ed. t ask. i d String The task attempt id
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)5/28/2018 Hadoop Tutorial
21/35
MapReduce User Interfaces
Hadoop Tutorial | 17
Name Type Description
mapr ed. t ask. i s. map boolean Is this a map task
mapr ed. t ask. par t i t i on int The id of the task within the job
map. i nput . f i l e String The filename that the map is
reading from
map. i nput . st ar t long The offset of the start of the map
input split
map. i nput . l engt h long The number of bytes in the map
input split
mapr ed. work. out put . di r String The tasks temporary output
directory
The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker and
logged to ${HADOOP_LOG_DI R}/ user l ogs
DistributedCachecan also be used to distribute both JARs and native libraries for use in the map and/or
reduce tasks. The child JVM always has its current working directoryadded to thej ava. l i brar y. pat h
and LD_LI BRARY_PATH. Hence the cached libraries can be loaded viaSystem.loadLibraryorSystem.load.More details on how to load shared libraries through distributed cache are documented at
Loading native libraries through DistributedCache.
Job Submission and Monitoring
JobClientis the primary interface by which a user job interacts with the JobTracker.J obCl i ent
provides facilities to submit jobs, track their progress, access component tasks reports and logs, get the
MapReduce clusters status information, and so on.
The job submission process involves:
1. Checking the input and output specifications of the job.2. Computing the I nput Spl i t values for the job.3. Setting up the requisite accounting information for the Di st r i but edCacheof the job, if
necessary.
4. Copying the jobs JAR file and configuration to the MapReduce system directory on thefilesystem.
http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#load(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#load(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/native_libraries.html#Loading+native+libraries+through+DistributedCachehttp://hadoop.apache.org/common/docs/r0.20.2/native_libraries.html#Loading+native+libraries+through+DistributedCachehttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/native_libraries.html#Loading+native+libraries+through+DistributedCachehttp://java.sun.com/javase/6/docs/api/java/lang/System.html#load(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)5/28/2018 Hadoop Tutorial
22/35
MapReduce User Interfaces
18 | Hadoop Tutorial
5. Submitting the job to the JobTracker and optionally monitoring its status.Job history files are also logged to user-specified directory hadoop. j ob. hi st or y. user . l ocat i on
which defaults to job output directory. The files are stored in _l ogs/ hi st or y/ in the specified
directory. Hence, by default they will be in mapr ed. out put . di r / _l ogs/ hi st or y. You can stop
logging by giving the value nonefor hadoop. j ob. hi st or y. user . l ocat i on.You can view the history logs summary in specified directory using the following command:
$ hadoop j ob - hi st or y out put - di r
This command will print job details, failed and killed tip details.
More details about the job such as successful tasks and task attempts made for each task can be viewed
using the following command
$ hadoop j ob - hi st or y al l out put - di r
You can useOutputLogFilterto filter log files from the output directory listing.
Normally you create the application, describes various facets of the job viaJ obConf, and then use the
J obCl i ent to submit the job and monitor its progress.
Job Control
You may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single
MapReduce job. This is fairly easy since the output of the job typically goes to distributed filesystem, and
the output, in turn, can be used as the input for the next job.
However, this also means that the onus on ensuring jobs are complete (success/failure) lies squarely on
the clients. In such cases, the various job control options are:
runJob(JobConf):Submits the job and returns only after the job has completed. submitJob(JobConf):Only submits the job, then poll the returned handle to theRunningJobto
query status and make scheduling decisions.
JobConf.setJobEndNotificationURI(String):Sets up a notification upon job-completion, thusavoiding polling.
You can can also useOozieto implement chains of MapReduce jobs.
Job Input
InputFormatdescribes the input specification for a MapReduce job. The MapReduce framework relies
on the I nputFormat of the job to:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setJobEndNotificationURI(java.lang.String)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setJobEndNotificationURI(java.lang.String)http://archive.cloudera.com/cdh4/cdh/4/oozie/http://archive.cloudera.com/cdh4/cdh/4/oozie/http://archive.cloudera.com/cdh4/cdh/4/oozie/http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://archive.cloudera.com/cdh4/cdh/4/oozie/http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setJobEndNotificationURI(java.lang.String)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.html5/28/2018 Hadoop Tutorial
23/35
MapReduce User Interfaces
Hadoop Tutorial | 19
1. Validate the input specification of the job.2. Split up the input file(s) into logical I nput Spl i t instances, each of which is then assigned to an
individual Mapper.
3. Provide the Recor dReader implementation used to glean input records from the logicalInputSplit for processing by the Mapper.
The default behavior of file-based I nputFormat implementations, typically sub-classes of
FileInputFormat,is to split the input into logicalI nput Spl i t instances based on the total size, in bytes,
of the input files. However, the Fi l eSyst emblocksize of the input files is treated as an upper bound for
input splits. A lower bound on the split size can be set via mapr ed. mi n. spl i t . si ze.
Clearly, logical splits based on input-size is insufficient for many applications since record boundaries
must be respected. In such cases, the application should implement a RecordReader, who is responsible
for respecting record-boundaries and presents a record-oriented view of the logical I nput Spl i t to the
individual task.
TextInputFormatis the default I nputFormat .
IfText I nput For mat is the InputFormat for a given job, the framework detects input-files with the .gz
extensions and automatically decompresses them using the appropriate Compress i onCodec.
However, it must be noted that compressed files with the above extensions cannot be split and each
compressed file is processed in its entirety by a single mapper.
InputSplit
InputSplitrepresents the data to be processed by an individual Mapper. Typically I nput Spl i t presents
a byte-oriented view of the input, and it is the responsibility of Recor dReader to process and present a
record-oriented view.
FileSplitis the default I nput Spl i t . It sets map. i nput . f i l eto the path of the input file for the logical
split.
RecordReader
RecordReaderreads pairs from an InputSplit. Typically Recor dReader converts the byte-oriented view
of the input, provided by the I nput Spl i t , and presents a record-oriented to the Mapper
implementations for processing. Recor dReader thus assumes the responsibility of processing record
boundaries and presents the tasks with keys and values.
Job Output
OutputFormatdescribes the output-specification for a MapReduce job. The MapReduce framework
relies on the Out putFor mat of the job to:
1. Validate the output-specification of the job; for example, check that the output directorydoesnt already exist.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileSplit.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileSplit.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileSplit.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html5/28/2018 Hadoop Tutorial
24/35
MapReduce User Interfaces
20 | Hadoop Tutorial
2. Provide the RecordWr i t er implementation used to write the output files of the job. Outputfiles are stored in a Fi l eSyst em.
Text Out put For mat is the default Out put For mat .
OutputCommitter
OutputCommitterdescribes the commit of task output for a MapReduce job. The MapReduce
framework relies on the Out put Commi t t er of the job to:
1. Set up the job during initialization. For example, create the temporary output directory for thejob during the initialization of the job. Job setup is done by a separate task when the job is in
PREP state and after initializing tasks. Once the setup task completes, the job will be moved to
RUNNING state.
2. Clean up the job after the job completion. For example, remove the temporary output directoryafter the job completion. Job cleanup is done by a separate task at the end of the job. Job is
declared SUCCEEDED/FAILED/KILLED after the cleanup task completes.
3. Setup the task temporary output. Task setup is done as part of the same task, during taskinitialization.
4. Check whether a task needs a commit. This is to avoid the commit procedure if a task does notneed commit.
5. Commit of the task output. Once task is done, the task will commit its output if required.6. Discard the task commit. If the task has been failed/killed, the output will be cleaned-up. If task
could not cleanup (in exception block), a separate task will be launched with same attempt-id to
do the cleanup.
Fi l eOutput Commi t t er is the default Out putCommi t t er . Job setup/cleanup tasks occupy map or
reduce slots, whichever is free on the TaskTracker. And JobCleanup task, TaskCleanup tasks and
JobSetup task have the highest priority, and in that order.
Task Side-Effect Files
In some applications, component tasks need to create and/or write to side files, which differ from the
actual job output files.
In such cases there could be issues with two instances of the same Mapper or Reducer running
simultaneously (for example, speculative tasks) trying to open and/or write to the same file (path) on
the FileSystem. Hence you must pick unique names per task-attempt (using the at t empt i d, sayat t empt _200709221812_0001_m_000000_0), not just per task.
To avoid these issues the MapReduce framework, when the OutputCommitter is
Fi l eOutput Commi t t er , maintains a special ${mapr ed. out put . di r }/ t emporar y/ ${t aski d} sub-
directory accessible via ${mapr ed. work. out put . di r } for each task-attempt on the Fi l eSyst em
where the output of the task-attempt is stored. On successful completion of the task-attempt, the files
in the ${mapr ed. out put . di r }/ t emporar y/ ${t aski d} (only) are promoted to
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCommitter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCommitter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCommitter.html5/28/2018 Hadoop Tutorial
25/35
MapReduce User Interfaces
Hadoop Tutorial | 21
${mapr ed. out put . di r }. Of course, the framework discards the sub-directory of unsuccessful task-
attempts. This process is completely transparent to the application.
The application-writer can take advantage of this feature by creating any side-files required in
${mapr ed. work. out put . di r } during execution of a task viaFileOutputFormat.getWorkOutputPath(),
and the framework will promote them similarly for successful task-attempts, thus eliminating the need
to pick unique paths per task-attempt.
Note:The value of ${mapr ed. work. out put . di r } during execution of a particular task-attempt is
actually ${mapr ed. out put . di r }/ t emporar y/ ${t aski d} and this value is set by the MapReduce
framework. So, just create any side-files in the path returned by
FileOutputFormat.getWorkOutputPath()from MapReduce task to take advantage of this feature.
The entire discussion holds true for maps of jobs with r educer =NONE(that is, 0 reduces) since output
of the map, in that case, goes directly to HDFS.
RecordWriter
RecordWriterwrites the output key-value pairs to an output file. RecordWriter implementations write
the job outputs to the filesystem.
Other Useful Features
Submitting Jobs to Queues
You submit jobs to Queues. Queues, as collection of jobs, allow the system to provide specific
functionality. For example, queues use ACLs to control which users who can submit jobs to them.
Queues are expected to be primarily used by Hadoop Schedulers.
Hadoop comes configured with a single mandatory queue, called default. Queue names are defined inthe mapr ed. queue. namesproperty of the Hadoop site configuration. Some job schedulers, such as the
Capacity Scheduler,support multiple queues.
A job defines the queue it needs to be submitted to through the mapr ed. j ob. queue. nameproperty,
or through thesetQueueName(String)API. Setting the queue name is optional. If a job is submitted
without an associated queue name, it is submitted to the default queue.
Counters
Counters represent global counters, defined either by the MapReduce framework or applications. Each
Counter can be of any Enum type. Counters of a particular Enumare bunched into groups of type
Count ers. Gr oup.
Applications can define arbitrary Counters(of type Enum) and update them via
Reporter.incrCounter(Enum, long)orReporter.incrCounter(String, String, long)in the mapand/or
r educemethods. These counters are then globally aggregated by the framework.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)5/28/2018 Hadoop Tutorial
26/35
MapReduce User Interfaces
22 | Hadoop Tutorial
DistributedCache
DistributedCachedistributes application-specific, large, read-only files efficiently. Di st r i but edCache
is a facility provided by the MapReduce framework to cache files (text, archives, JARs, and so on) needed
by applications.
Applications specify the files to be cached via urls (hdf s : / / ) in theJ obConf. The Di st r i but edCacheassumes that the files specified via hdf s : / / urls are already present on the FileSystem.
The framework will copy the necessary files to the slave node before any tasks for the job are executed
on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability
to cache archives which are un-archived on the slaves.
Di st r i but edCachetracks the modification timestamps of the cached files. Clearly the cache files
should not be modified by the application or externally while the job is executing.
Di st r i but edCachecan be used to distribute simple, read-only data/text files and more complex types
such as archives and JAR files. Archives (zip, tar, tgz and tar.gz files) are un-archivedat the slave nodes.
Files have execution permissionsset.
The