Hadoop Tutorial

  • Upload
    nandy39

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Hadoop Tutorial

Citation preview

  • 5/28/2018 Hadoop Tutorial

    1/35

    Hadoop Tutorial

    Cloudera, Inc.

    220 Portage Avenue

    Palo Alto, CA 94306

    [email protected]

    US: 1-888-789-1488

    Intl: 1-650-362-0488www.cloudera.com

  • 5/28/2018 Hadoop Tutorial

    2/35

    Important Notice

    2010-2013 Cloudera, Inc. All rights reserved.

    Cloudera, the Cloudera logo, Cloudera Impala, Impala, and any other product or service names orslogans contained in this document, except as otherwise disclaimed, are trademarks of Cloudera and itssuppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior

    written permission of Cloudera or the applicable trademark holder.Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All othertrademarks, registered trademarks, product names and company names or logos mentioned in thisdocument are the property of their respective owners. Reference to any products, services, processes orother information, by trade name, trademark, manufacturer, supplier or otherwise does not constituteor imply endorsement, sponsorship or recommendation thereof by us.

    Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rightsunder copyright, no part of this document may be reproduced, stored in or introduced into a retrievalsystem, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, orotherwise), or for any purpose, without the express written permission of Cloudera.

    Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property

    rights covering subject matter in this document. Except as expressly provided in any written licenseagreement from Cloudera, the furnishing of this document does not give you any license to thesepatents, trademarks copyrights, or other intellectual property.

    The information in this document is subject to change without notice. Cloudera shall not be liable forany damages resulting from technical errors or omissions which may be present in this document, orfrom use of this document.

    Date: February 25, 2013

  • 5/28/2018 Hadoop Tutorial

    3/35

    Contents

    PURPOSE ............................................................................................................................................................... 1

    PREREQUISITES ...................................................................................................................................................... 1

    OVERVIEW ............................................................................................................................................................. 1

    INPUTS AND OUTPUTS ........................................................................................................................................... 2

    EXAMPLE: WORDCOUNT V1.0 ............................................................................................................................... 2

    SOURCE CODE ............................................................................................................................................................... 2

    USAGE ......................................................................................................................................................................... 4

    WALK-THROUGH ............................................................................................................................................................ 5

    MAPREDUCE USER INTERFACES .......................................................................................................................... 6

    PAYLOAD ...................................................................................................................................................................... 7

    Mapper ................................................................................................................................................................. 7

    Reducer ................................................................................................................................................................. 8

    Partitioner ............................................................................................................................................................. 9

    Reporter ................................................................................................................................................................ 9

    OutputCollector ..................................................................................................................................................... 9

    JOB CONFIGURATION .................................................................................................................................................... 10

    TASK EXECUTION AND ENVIRONMENT .............................................................................................................................. 10

    Memory Management ........................................................................................................................................ 11

    Map Parameters ................................................................................................................................................. 12

    Shuffle/Reduce Parameters ................................................................................................................................ 14

    Task JVM Reuse ................................................................................................................................................... 16

    JOB SUBMISSION AND MONITORING ................................................................................................................................ 17

    Job Control .......................................................................................................................................................... 18

    JOB INPUT .................................................................................................................................................................. 18

    InputSplit ............................................................................................................................................................. 19

    RecordReader ...................................................................................................................................................... 19

    JOB OUTPUT ............................................................................................................................................................... 19

    OutputCommitter ................................................................................................................................................ 20

    Task Side-Effect Files ........................................................................................................................................... 20

    RecordWriter ....................................................................................................................................................... 21

    OTHER USEFUL FEATURES .............................................................................................................................................. 21

  • 5/28/2018 Hadoop Tutorial

    4/35

    Submitting Jobs to Queues .................................................................................................................................. 21

    Counters .............................................................................................................................................................. 21

    DistributedCache ................................................................................................................................................. 22

    Tool ..................................................................................................................................................................... 22

    IsolationRunner ................................................................................................................................................... 23

    Profiling ............................................................................................................................................................... 23

    Debugging ........................................................................................................................................................... 24

    JobControl ........................................................................................................................................................... 24

    Data Compression ............................................................................................................................................... 24

    Skipping Bad Records .......................................................................................................................................... 25

    EXAMPLE: WORDCOUNT V2.0 ............................................................................................................................. 26

    SOURCE CODE ............................................................................................................................................................. 26

    SAMPLE RUNS ............................................................................................................................................................. 29

    HIGHLIGHTS ................................................................................................................................................................ 31

  • 5/28/2018 Hadoop Tutorial

    5/35

    Purpose

    Hadoop Tutorial | 1

    Purpose

    This document describes the most important user-facing facets of the Apache Hadoop MapReduce

    framework and serves as a tutorial. Apache Hadoop MapReduce consists of client APIs for writing

    applications and a runtime on which to run the applications. There are two versions of the API: old andnew, and two versions of the runtime: MRv1 and MRv2. This tutorial describes the old API and MRv1.

    Prerequisites

    Ensure that CDH is installed, configured, and running. The easiest way to get going quickly is to use a

    demo VM:

    CDH4 Demo VM CDH3 Demo VM

    Overview

    Hadoop MapReduce is a software framework for writing applications that process vast amounts of data

    (multi-petabyte datasets) in parallel on large clusters consisting of thousands of nodes of commodity

    hardware in a reliable, fault-tolerant manner.

    A MapReducejobusually splits the input dataset into independent chunks which are processed by the

    map tasksin a parallel manner. The framework sorts the outputs of the maps, which are then input to

    the reduce tasks. Typically both the input and the output of the job are stored in a distributed

    filesystem.

    Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework

    and theHadoop Distributed File System (HDFS)are running on the same set of nodes. This configuration

    allows the framework to effectively schedule tasks on the nodes where data is already present, resulting

    in very high aggregate bandwidth across the cluster.

    The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster

    node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring the

    tasks, and re-executing failed tasks. The slaves execute the tasks as directed by the master.

    Minimally, applications specify the input/output locations and supply mapand reducefunctions via

    implementations of appropriate interfaces and/or abstract classes. These locations and functions and

    other job parameters comprise thejob configuration. The Hadoopjob clientthen submits the job

    (JAR/executable, and so on) and configuration to the JobTracker which then distributes the

    software/configuration to the slaves, scheduling tasks and monitoring them, and providing status and

    diagnostic information to the job client.

    The Hadoop framework is implemented in JavaTM

    , and you can develop MapReduce applications in Java

    or any JVM-based language or use one of the following interfaces:

    https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH3u4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH3u4http://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttp://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttp://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttp://archive.cloudera.com/cdh4/cdh/4/hbase/book.htmlhttps://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH3u4https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH4
  • 5/28/2018 Hadoop Tutorial

    6/35

    Inputs and Outputs

    2 | Hadoop Tutorial

    Hadoop Streaming- a utility that allows you to create and run jobs with any executables (forexample, shell utilities) as the mapper and/or the reducer.

    Hadoop Pipes- aSWIG-compatible (not based on JNITM) C++ API to implement MapReduceapplications.

    This tutorial describes applications written using the old Java MapReduce API.

    The old API is in the org. apache. hadoop. mapr edpackage. The new API is in the

    org. apache. hadoop. mapr educepackage.

    Inputs and Outputs

    The MapReduce framework operates exclusively on key-value pairs, that is, the framework views the

    input to the job as a set of key-value pairs and produces a set of key-value pairs as the output of the job.

    The output pairs can be different types than the input pairs.

    The key and value classes must be serializable by the framework and hence must implement the

    Writableinterface. Additionally, the key classes must implement theWritableComparableinterface to

    facilitate sorting.

    Input and Output types of a MapReduce job:

    (input) k1-v1 -> map-> k2-v2 -> combine-> k2-v2 -> reduce-> k3-v3 (output)

    Example: WordCount v1.0

    Before we jump into the details, let's walk through an example MapReduce application, WordCount, to

    get a flavor for how MapReduce works. WordCount is a simple application that counts the number of

    occurrences of each word in an input set.

    Source Code

    package or g. myorg;

    i mpor t j ava. i o. I OExcept i on;

    i mport j ava. ut i l . * ;

    i mpor t org. apache. hadoop. f s. Path;

    i mpor t org. apache. hadoop. conf . *;

    i mpor t org. apache. hadoop. i o. *;

    i mpor t org. apache. hadoop. mapr ed. *;

    i mpor t or g. apache. hadoop. ut i l . *;

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/streaming/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/streaming/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/pipes/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/pipes/package-summary.htmlhttp://www.swig.org/http://www.swig.org/http://www.swig.org/http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/WritableComparable.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.htmlhttp://www.swig.org/http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/pipes/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/streaming/package-summary.html
  • 5/28/2018 Hadoop Tutorial

    7/35

    Example: WordCount v1.0

    Hadoop Tutorial | 3

    publ i c cl ass WordCount {

    publ i c s t at i c cl ass Map ext ends MapReduceBase i mpl ement s

    Mapper {

    pr i vat e f i nal st at i c I nt Wr i t abl e one = new I nt Wr i t abl e( 1) ;

    pr i vat e Text word = new Text ( ) ;

    publ i c voi d map( LongWr i t abl e key, Text val ue, Out put Col l ect or out put , Repor t er r epor t er ) t hr ows I OExcept i on {

    St r i ng l i ne = val ue. t oSt r i ng( ) ;

    St r i ngTokeni zer t okeni zer = new St r i ngTokeni zer ( l i ne) ;

    whi l e ( t okeni zer . hasMor eTokens( ) ) {

    wor d. set ( t okeni zer . nextToken( ) ) ;

    out put . col l ect( wor d, one) ;

    }

    }

    }

    publ i c s t at i c cl ass Reduce ext ends MapReduceBase i mpl ement s Reducer {

    publ i c voi d r educe( Text key, I t er at or val ues,

    Out put Col l ect or out put , Repor t er r epor t er ) t hr ows

    I OExcept i on {

    i nt sum = 0;

    whi l e ( val ues. hasNext ( ) ) {

    sum += val ues. next ( ) . get ( ) ;

    }

    out put . col l ect ( key, new I nt Wr i t abl e( sum) ) ;

    }

    }

    publ i c st at i c voi d mai n( St r i ng[ ] ar gs) t hr ows Except i on {

    J obConf conf = new J obConf ( Wor dCount . cl ass ) ;

    conf . set J obName("wordcount " ) ;

    conf . set Out put KeyCl ass( Text. cl ass) ;

    conf . set Out put Val ueCl ass( I nt Wr i t abl e. cl ass) ;

    conf . set Mapper Cl ass( Map. cl ass) ;

    conf . set Combi ner Cl ass( Reduce. cl ass) ;

    conf . set Reducer Cl ass( Reduce. cl ass);

    conf . set I nput For mat ( Text I nput For mat . cl ass) ;

    conf . setOut put For mat ( TextOut put For mat . cl ass) ;

    Fi l eI nput For mat . set I nput Pat hs( conf , new Pat h( ar gs[ 0] ) ) ;

    Fi l eOut put For mat. set Out put Pat h( conf , new Pat h( args[1] ) ) ;

  • 5/28/2018 Hadoop Tutorial

    8/35

    Example: WordCount v1.0

    4 | Hadoop Tutorial

    J obCl i ent . r unJ ob(conf ) ;

    }

    }

    Table 1 WordCount.java

    Usage

    Compile WordCount . j ava:

    $ mkdi r wordcount _c l asses

    $ j avac - cpclasspath- d wordcount _cl asses WordCount . j ava

    where classpathis:

    CDH4 - / usr/ l i b/ hadoop/ *: / usr/ l i b/ hadoop/ cl i ent - 0. 20/ * CDH3 - / usr / l i b/ hadoop- 0. 20/ hadoop- 0. 20. 2- cdh3u4- cor e. j ar

    Create a JAR:

    $ j ar - cvf wor dcount . j ar - C wor dcount _cl asses/ .

    Assuming that:

    / user / cl ouder a/ wordcount / i nput - input directory in HDFS / user / cl ouder a/ wordcount / out put - output directory in HDFS

    Create sample text files as input and move to HDFS:

    $ echo "Hel l o Wor l d Bye Wor l d" > f i l e0

    $ echo "Hel l o Hadoop Goodbye Hadoop" > f i l e1

    $ hadoop f s - mkdi r / user / cl ouder a / user / cl ouder a/ wordcount

    / user / cl ouder a/ wor dcount / i nput

    $ hadoop f s - put f i l e* / user / cl ouder a/ wor dcount / i nput

    Run the application:

    $ hadoop j ar wordcount . j ar org. myor g. WordCount

    / user / cl ouder a/ wor dcount / i nput / user / cl ouder a/ wor dcount / out put

  • 5/28/2018 Hadoop Tutorial

    9/35

    Example: WordCount v1.0

    Hadoop Tutorial | 5

    Output:

    $ hadoop f s - cat / user / cl ouder a/ wordcount / out put / part - 00000

    Bye 1

    Goodbye 1

    Hadoop 2Hel l o 2

    Wor l d 2

    Applications can specify a comma-separated list of paths that would be present in the current working

    directory of the task using the option - f i l es. The - l i bj ar soption allows applications to add JARs to

    the classpaths of the maps and reduces. The - ar chi ves allows them to pass archives as arguments

    that are unzipped/unjarred and a link with name of the zip/JAR are created in the current working

    directory of tasks. More details about the command line options are available atHadoop Command

    Guide.

    Running wor dcount example with - l i bj ar s and - f i l es:

    hadoop j ar hadoop- exampl es. j ar wor dcount - f i l es cachef i l e. t xt - l i bj ar s

    myl i b. j ar i nput out put

    Walk-through

    The WordCount application is quite straightforward.

    The Mapper implementation (lines 14-26), via the mapmethod (lines 18-25), processes one line at a

    time, as provided by the specifiedText I nput For mat (line 49). It then splits the line into tokens

    separated by whitespaces, via the St r i ngTokeni zer , and emits a key-value pair of .

    For the given sample input the first map emits:

    < Hel l o, 1>

    < Wor l d, 1>

    < Bye, 1>

    < Wor l d, 1>

    The second map emits:

    < Hel l o, 1>< Hadoop, 1>

    < Goodbye, 1>

    < Hadoop, 1>

    Well learn more about the number of maps spawned for a given job, and how to control them in a fine-

    grained manner, a bit later in the tutorial.

    http://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.htmlhttp://archive.cloudera.com/cdh/3/hadoop/commands_manual.html
  • 5/28/2018 Hadoop Tutorial

    10/35

    MapReduce User Interfaces

    6 | Hadoop Tutorial

    WordCount also specifies a combi ner (line 46). Hence, the output of each map is passed through the

    local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after

    being sorted on the

    keys.

    The output of the first map:

    < Bye, 1>

    < Hel l o, 1>

    < Wor l d, 2>

    The output of the second map:

    < Goodbye, 1>

    < Hadoop, 2>

    < Hel l o, 1>

    The Reducer implementation (lines 28-36), via the r educemethod (lines 29-35) just sums up the

    values, which are the occurence counts for each key (that is, words in this example).

    Thus the output of the job is:< Bye, 1>

    < Goodbye, 1>

    < Hadoop, 2>

    < Hel l o, 2>

    < Wor l d, 2>

    The r unmethod specifies various facets of the job, such as the input/output paths (passed via the

    command line), key-value types, input/output formats etc., in theJ obConf. It then calls the

    J obCl i ent . r unJ ob(line 55) to submit the and monitor its progress.

    Well learn more aboutJ obConf,J obCl i ent ,Tool , and other interfaces and classes a bit later in the

    tutorial.

    MapReduce User Interfaces

    This section provides a reasonable amount of detail on every user-facing aspect of the MapReduce

    framework. This should help you implement, configure and tune your jobs in a fine-grained manner.

    However, note that the javadoc for each class/interface remains the most comprehensive

    documentation available; this is only meant to be a tutorial.

    Let's first take the Mapper and Reducer interfaces. Applications typically implement them to provide

    the mapand r educemethods.

    We will then discuss other core interfaces includingJ obConf,J obCl i ent , Part i t i oner ,

    Out put Col l ect or , Repor t er , I nputFormat , Out putFor mat , Output Commi t t er , and others.

    Finally, we will wrap up by discussing some useful features of the framework such as the

    Di st r i but edCache, I sol at i onRunner , and so on.

  • 5/28/2018 Hadoop Tutorial

    11/35

    MapReduce User Interfaces

    Hadoop Tutorial | 7

    Payload

    Applications typically implement the Mapper and Reduceinterfaces to provide the mapand r educe

    methods. These form the core of the job.

    Mapper

    Mappermaps input key-value pairs to a set of intermediate key-value pairs. Maps are the individual

    tasks that transform input records into intermediate records. The transformed intermediate records do

    not need to be of the same type as the input records. A given input pair may map to zero or many

    output pairs.

    The Hadoop MapReduce framework spawns one map task for each I nput Spl i t generated by the

    I nputFormat for the job.

    Overall, Mapper implementations are passed theJ obConffor the job via the

    JobConfigurable.configure(JobConf)method and override it to initialize themselves. The framework then

    callsmap(WritableComparable, Writable, OutputCollector, Reporter)for each key-value pair in theI nput Spl i t for that task. Applications can then override theCloseable.close()method to perform any

    required cleanup.

    Output pairs do not need to be of the same types as input pairs. A given input pair may map to zero or

    many output pairs. Output pairs are collected with calls to

    OutputCollector.collect(WritableComparable,Writable).

    Applications can use Repor t er to report progress, set application-level status messages and update

    Counters , or just indicate that they are alive.

    All intermediate values associated with a given output key are subsequently grouped by the framework,

    and passed to the Reducer(s) to determine the final output. Youu can control the grouping by specifyinga Compar at or viaJobConf.setOutputKeyComparatorClass(Class).

    The Mapper outputs are sorted and then partitioned per Reducer . The total number of partitions is the

    same as the number of reduce tasks for the job. You can control which keys (and hence records) go to

    which Reducer by implementing a custom Par t i t i oner .

    You can optionally specify a combi ner , viaJobConf.setCombinerClass(Class),to perform local

    aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from

    the Mapper to the Reducer .

    The intermediate, sorted outputs are always stored in a simple (key- l en, key, val ue- l en, val ue)

    format. Applications can control if, and how, the intermediate outputs are to be compressed and the

    CompressionCodecto be used via theJ obConf.

    How Many Maps?

    The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of

    the input files.

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/compress/CompressionCodec.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/compress/CompressionCodec.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/compress/CompressionCodec.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html#map(K1,%20V1,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html
  • 5/28/2018 Hadoop Tutorial

    12/35

    MapReduce User Interfaces

    8 | Hadoop Tutorial

    The right level of parallelism for maps seems to be around 10-100 maps per node, although it has been

    set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at

    least a minute to execute.

    Thus, if you expect 10TB of input data and have a blocksize of 128MB, youll end up with 82,000 maps,

    unlesssetNumMapTasks(int)(which only provides a hint to the framework) is used to set it even higher.

    Reducer

    Reducerreduces a set of intermediate values that share a key to a smaller set of values.

    You set the number of reducers for the job viaJobConf.setNumReduceTasks(int).

    Overall, Reducer implementations are passed theJ obConffor the job via the

    JobConfigurable.configure(JobConf)method and can override it to initialize themselves. The framework

    then callsreduce(WritableComparable, Iterator, OutputCollector, Reporter)method for each pair in the

    grouped inputs. Applications can then override theCloseable.close()method to perform any required

    cleanup.

    Reducer has 3 primary phases: shuffle, sort, and reduce.

    Shuffle

    Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the

    relevant partition of the output of all the mappers, via HTTP.

    Sort

    The framework groups Reducer inputs by keys (since different mappers may have output the same key)

    in this stage.

    The shuffle and sort phases occur simultaneously; while map outputs are being fetched they aremerged.

    Secondary Sort

    If equivalence rules for grouping the intermediate keys are required to be different from those for

    grouping keys before reduction, then one may specify a Comparat or via

    JobConf.setOutputValueGroupingComparator(Class).Since

    JobConf.setOutputKeyComparatorClass(Class)can be used to control how intermediate keys are

    grouped, these can be used in conjunction to simulate secondary sort on values.

    Reduce

    In this phase thereduce(WritableComparable, Iterator, OutputCollector, Reporter)method is called for

    each pair in the grouped inputs.

    The output of the reduce task is typically written to theFileSystemvia

    OutputCollector.collect(WritableComparable, Writable).

    Applications can use the Repor t er to report progress, set application-level status messages and update

    Counters , or just indicate that they are alive.

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.html#collect(K,%20V)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator(java.lang.Class)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Closeable.html#close()http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConfigurable.html#configure(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reducer.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)
  • 5/28/2018 Hadoop Tutorial

    13/35

    MapReduce User Interfaces

    Hadoop Tutorial | 9

    The output of the Reducer is not sorted.

    How Many Reduces?

    The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes*

    mapr ed. t askt r acker. r educe. t asks. maxi mum).

    With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps

    finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of

    reduces doing a much better job of load balancing.

    Increasing the number of reduces increases the framework overhead, but increases load balancing and

    lowers the cost of failures.

    The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the

    framework for speculative tasks and failed tasks.

    Reducer NONE

    It is legal to set the number of reduce tasks to zero if no reduction is desired.

    In this case the outputs of the map tasks go directly to the filesystem, into the output path set by

    setOutputPath(Path).The framework does not sort the map outputs before writing them out to the

    filesystem.

    Partitioner

    Partitionerpartitions the key space. Partitioner controls the partitioning of the keys of the intermediate

    map outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function.

    The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls

    which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.

    HashPartitioneris the default Partitioner.

    Reporter

    Reporteris a facility for MapReduce applications to report progress, set application-level status

    messages and update Counters.

    Mapper and Reducer implementations can use the Repor t er to report progress or just indicate that

    they are alive. In scenarios where the application takes a significant amount of time to process individual

    key-value pairs, this is crucial since the framework might assume that the task has timed out and kill that

    task. Another way to avoid this is to set the configuration parameter mapr ed. t ask. t i meout to a high

    enough value (or even set it to zero for no timeouts).

    Applications can also update Countersusing Repor t er .

    OutputCollector

    OutputCollectoris a generalization of the facility provided by the MapReduce framework to collect data

    output by the Mapper or the Reducer (either the intermediate outputs or the output of the job).

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/HashPartitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/HashPartitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCollector.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/HashPartitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)
  • 5/28/2018 Hadoop Tutorial

    14/35

    MapReduce User Interfaces

    10 | Hadoop Tutorial

    Hadoop MapReduce comes bundled with alibraryof generally useful mappers, reducers, and

    partitioners.

    Job Configuration

    JobConfrepresents a MapReduce job configuration.J obConfis the primary interface to describe a

    MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the

    job as described byJ obConf, however:

    Some configuration parameters may have been marked asfinalby administrators and hencecannot be altered.

    While some job parameters are straightforward to set (for example,setNumReduceTasks(int)),other parameters interact subtly with the rest of the framework and/or job configuration and

    are more complex to set (for example,setNumMapTasks(int)).

    J obConfis typically used to specify the Mapper , Combi ner (if any), Par t i t i oner , Reducer ,

    I nputFormat , Out putFor mat and Out putCommi t t er implementations.J obConfalso indicates theset of input files (setInputPaths(JobConf, Path))/addInputPath(JobConf, Path))and

    (setInputPaths(JobConf, String)/addInputPaths(JobConf, String))and where the output files should be

    written (setOutputPath(Path)).

    Optionally, JobConf is used to specify other advanced facets of the job such as the Compar at or to be

    used, files to be put in the Di st r i but edCache, whether intermediate and/or job outputs are to be

    compressed (and how), debugging via user provided scripts

    (setMapDebugScript(String)/setReduceDebugScript(String)), whether job tasks can be executed in a

    speculative manner (setMapSpeculativeExecution(boolean))/(setReduceSpeculativeExecution(boolean)),

    maximum number of attempts per task (setMaxMapAttempts(int)/setMaxReduceAttempts(int)),

    percentage of tasks failure which can be tolerated by the job

    (setMaxMapTaskFailuresPercent(int)/setMaxReduceTaskFailuresPercent(int))etc.

    Of course, you can useset(String, String)/get(String, String)to get and set arbitrary parameters needed

    by applications. However, use DistributedCache for large amounts of (read-only) data.

    Task Execution and Environment

    TaskTracker executes Mapper and Reducer tasks as child processes in separate JVMs.

    The child task inherits the environment of the parent TaskTracker. You can specify additional options to

    the child JVM via the mapr ed. chi l d. j ava. opt s configuration parameter in theJ obConfsuch as

    non-standard paths for the runtime linker to search shared libraries via - Dj ava. l i br ar y. pat h=,

    and so on. If the mapr ed. chi l d. j ava. opt s property contains the symbol ( t aski d) it is interpolated

    with value of t aski dof the MapReduce task.

    Here is an example property with multiple arguments and substitutions, showing JVM GC logging, and

    start of a passwordless JVM JMX agent so that it can connect with jconsole and the like to watch child

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapSpeculativeExecution(boolean)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path...%29http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html#FinalParamshttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/package-summary.html
  • 5/28/2018 Hadoop Tutorial

    15/35

    MapReduce User Interfaces

    Hadoop Tutorial | 11

    memory, threads, and get thread dumps. The property also sets the maximum heap size of the child

    JVM to 512MB and adds an additional path to thej ava. l i brar y. pat hof the child JVM.

    mapred. chi l d. j ava. opts

    - Xmx512M - Dj ava. l i br ary. path=/ home/ mycompany/ l i b - ver bose: gc -

    Xl oggc: / t mp/ @t aski d@. gc

    - Dcom. sun. management . j mxr emot e. aut hent i cat e=f al se -

    Dcom. sun. management . j mxr emot e. ss l =f al se

    Memory Management

    You can also specify the maximum virtual memory of the launched child task, and any subprocess it

    launches recursively, using mapr ed. chi l d. ul i mi t . Note that the value set here is a per process limit.

    The value for mapr ed. chi l d. ul i mi t should be specified in kilobytes (KB) and must be greater than or

    equal to the - Xmxvalue passed to the Java VM.

    mapr ed. chi l d. j ava. opt s is used only for configuring the launched child tasks from task

    tracker. Configuring the memory options for daemons is documented inConfiguring the

    Environment of the Hadoop Daemons.

    The memory available to some parts of the framework is also configurable. In map and reduce tasks,

    performance may be influenced by adjusting parameters influencing the concurrency of operations and

    the frequency with which data will hit disk. Monitoring the filesystem counters for a job particularly

    relative to byte counts from the map and into the reduce is invaluable to the tuning of these

    parameters.

    You can choose to override default limits of virtual memory and RAM enforced by the task tracker, if

    memory management is enabled. You can set the following parameters per job:

    Name Type Description

    mapr ed. t ask. maxvmem int A number, in bytes, that

    represents the maximum virtual

    memory task limit for each task of

    the job. A task will be killed if it

    consumes more virtual memory

    http://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemonshttp://archive.cloudera.com/cdh/3/hadoop/cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemons
  • 5/28/2018 Hadoop Tutorial

    16/35

    MapReduce User Interfaces

    12 | Hadoop Tutorial

    Name Type Description

    than this number.

    mapr ed. t ask. maxpmem int A number, in bytes, that

    represents the maximum RAM

    task limit for each task of the job.

    This number can be optionally

    used by schedulers to prevent

    overscheduling of tasks on a node

    based on RAM needs.

    Map Parameters

    A record emitted from a map will be serialized into a buffer and metadata will be stored into accounting

    buffers. As described in the following options, when either the serialization buffer or the metadata

    exceed a threshold, the contents of the buffers will be sorted and written to disk in the background

    while the map continues to output records. If either buffer fills completely while the spill is in progress,

    the map thread will block. When the map is finished, any remaining records are written to disk and all

    on-disk segments are merged into a single file. Minimizing the number of spills to disk can decrease map

    time, but a larger buffer also decreases the memory available to the mapper.

    Name Type Description

    i o. sor t . mb int The cumulative size of the

    serialization and accounting

    buffers storing records emitted

    from the map, in megabytes.

    i o. sor t . r ecor d. per cent float The ratio of serialization to

    accounting space can be adjusted.

    Each serialized record requires 16

    bytes of accounting information

    in addition to its serialized size toeffect the sort. This percentage of

    space allocated from

    i o. sor t . mbaffects the

    probability of a spill to disk being

    caused by either exhaustion of

    the serialization buffer or the

  • 5/28/2018 Hadoop Tutorial

    17/35

    MapReduce User Interfaces

    Hadoop Tutorial | 13

    Name Type Description

    accounting space. Clearly, for a

    map outputting small records, a

    higher value than the default will

    likely decrease the number of

    spills to disk.

    i o. sor t . spi l l . percent float This is the threshold for the

    accounting and serialization

    buffers. When this percentage of

    either buffer has filled, their

    contents will be spilled to disk in

    the background. Let

    i o. sor t . r ecor d. per cent be

    r , i o. sor t . mbbe x, and this

    value be q. The maximum

    number of records collected

    before the collection thread will

    spill is r * x * q * 2 16. Note

    that a higher value may decrease

    the number of or even

    eliminate merges, but will also

    increase the probability of the

    map task getting blocked. Thelowest average map times are

    usually obtained by accurately

    estimating the size of the map

    output and preventing multiple

    spills.

    Other notes

    If either spill threshold is exceeded while a spill is in progress, collection will continue until thespill is finished. For example, if i o. sor t . buf f er . spi l l . percent is set to 0.33, and the

    remainder of the buffer is filled while the spill runs, the next spill will include all the collected

    records, or 0.66 of the buffer, and will not generate additional spills. In other words, the

    thresholds are defining triggers, not blocking.

    A record larger than the serialization buffer will first trigger a spill, then be spilled to a separatefile. It is undefined whether or not this record will first pass through the combiner.

  • 5/28/2018 Hadoop Tutorial

    18/35

    MapReduce User Interfaces

    14 | Hadoop Tutorial

    Shuffle/Reduce Parameters

    As described previously, each reduce fetches the output assigned to it by the Partitioner via HTTP into

    memory and periodically merges these outputs to disk. If intermediate compression of map outputs is

    turned on, each output is decompressed into memory. The following options affect the frequency of

    these merges to disk prior to the reduce and the memory allocated to map output during the reduce.

    Name Type Description

    i o. sor t . f actor int Specifies the number of segments

    on disk to be merged at the same

    time. It limits the number of open

    files and compression codecs

    during the merge. If the number

    of files exceeds this limit, the

    merge will proceed in severalpasses. Though this limit also

    applies to the map, most jobs

    should be configured so that

    hitting this limit is unlikely there.

    mapred. i nmem. merge. t hresh

    ol dint The number of sorted map

    outputs fetched into memory

    before being merged to disk. Like

    the spill thresholds in the

    preceding note, this is notdefining a unit of partition, but a

    trigger. In practice, this is usually

    set very high (1000) or disabled

    (0), since merging in-memory

    segments is often less expensive

    than merging from disk (see notes

    following this table). This

    threshold influences only the

    frequency of in-memory merges

    during the shuffle.

    mapr ed. j ob. shuf f l e. mer ge.

    percentfloat The memory threshold for

    fetched map outputs before an

    in-memory merge is started,

    expressed as a percentage of

    memory allocated to storing map

  • 5/28/2018 Hadoop Tutorial

    19/35

    MapReduce User Interfaces

    Hadoop Tutorial | 15

    Name Type Description

    outputs in memory. Since map

    outputs that can't fit in memory

    can be stalled, setting this high

    may decrease parallelism

    between the fetch and merge.

    Conversely, values as high as 1.0

    have been effective for reduces

    whose input can fit entirely in

    memory. This parameter

    influences only the frequency of

    in-memory merges during the

    shuffle.

    mapr ed. j ob. shuf f l e. i nput .

    buf f er . per centfloat The percentage of memory

    relative to the maximum heapsize

    as typically specified in

    mapr ed. chi l d. j ava. opt s

    that can be allocated to storing

    map outputs during the shuffle.

    Though some memory should be

    set aside for the framework, in

    general it is advantageous to set

    this high enough to store largeand numerous map outputs.

    mapr ed. j ob. r educe. i nput . b

    uf f er . per centfloat The percentage of memory

    relative to the maximum heapsize

    in which map outputs may be

    retained during the reduce. When

    the reduce begins, map outputs

    will be merged to disk until those

    that remain are under the

    resource limit this defines. Bydefault, all map outputs are

    merged to disk before the reduce

    begins to maximize the memory

    available to the reduce. For less

    memory-intensive reduces, this

    should be increased to avoid trips

  • 5/28/2018 Hadoop Tutorial

    20/35

    MapReduce User Interfaces

    16 | Hadoop Tutorial

    Name Type Description

    to disk.

    Other notes

    If a map output is larger than 25 percent of the memory allocated to copying map outputs, it willbe written directly to disk without first staging through memory.

    When running with a combiner, the reasoning about high merge thresholds and large buffersmay not hold. For merges started before all map outputs have been fetched, the combiner is run

    while spilling to disk. In some cases, one can obtain better reduce times by spending resources

    combining map outputs making disk spills small and parallelizing spilling and fetching rather

    than aggressively increasing buffer sizes.

    When merging in-memory map outputs to disk to begin the reduce, if an intermediate merge isnecessary because there are segments to spill and at least i o. sor t . f actor segments already

    on disk, the in-memory map outputs will be part of the intermediate merge.

    Task JVM Reuse

    Jobs can enable task JVMs to be reused by specifying the job configuration

    mapr ed. j ob. r euse. j vm. num. t asks. If the value is 1 (the default), then JVMs are not reused (that is,

    1 task per JVM). If it is -1, there is no limit to the number of tasks a JVM can run (of the same job). You

    can also specify some value greater than 1 using the

    methodJobConf.setNumTasksToExecutePerJvm(int).

    The following properties are localized in the job configuration for each tasks execution:

    Name Type Description

    mapr ed. j ob. i d String The job id

    mapr ed. j ar String job.jar location in job directory

    j ob. l ocal . di r String The job specific shared scratch

    space

    mapr ed. t i p. i d String The task id

    mapr ed. t ask. i d String The task attempt id

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setNumTasksToExecutePerJvm(int)
  • 5/28/2018 Hadoop Tutorial

    21/35

    MapReduce User Interfaces

    Hadoop Tutorial | 17

    Name Type Description

    mapr ed. t ask. i s. map boolean Is this a map task

    mapr ed. t ask. par t i t i on int The id of the task within the job

    map. i nput . f i l e String The filename that the map is

    reading from

    map. i nput . st ar t long The offset of the start of the map

    input split

    map. i nput . l engt h long The number of bytes in the map

    input split

    mapr ed. work. out put . di r String The tasks temporary output

    directory

    The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker and

    logged to ${HADOOP_LOG_DI R}/ user l ogs

    DistributedCachecan also be used to distribute both JARs and native libraries for use in the map and/or

    reduce tasks. The child JVM always has its current working directoryadded to thej ava. l i brar y. pat h

    and LD_LI BRARY_PATH. Hence the cached libraries can be loaded viaSystem.loadLibraryorSystem.load.More details on how to load shared libraries through distributed cache are documented at

    Loading native libraries through DistributedCache.

    Job Submission and Monitoring

    JobClientis the primary interface by which a user job interacts with the JobTracker.J obCl i ent

    provides facilities to submit jobs, track their progress, access component tasks reports and logs, get the

    MapReduce clusters status information, and so on.

    The job submission process involves:

    1. Checking the input and output specifications of the job.2. Computing the I nput Spl i t values for the job.3. Setting up the requisite accounting information for the Di st r i but edCacheof the job, if

    necessary.

    4. Copying the jobs JAR file and configuration to the MapReduce system directory on thefilesystem.

    http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#load(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#load(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/native_libraries.html#Loading+native+libraries+through+DistributedCachehttp://hadoop.apache.org/common/docs/r0.20.2/native_libraries.html#Loading+native+libraries+through+DistributedCachehttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/native_libraries.html#Loading+native+libraries+through+DistributedCachehttp://java.sun.com/javase/6/docs/api/java/lang/System.html#load(java.lang.String)http://java.sun.com/javase/6/docs/api/java/lang/System.html#loadLibrary(java.lang.String)
  • 5/28/2018 Hadoop Tutorial

    22/35

    MapReduce User Interfaces

    18 | Hadoop Tutorial

    5. Submitting the job to the JobTracker and optionally monitoring its status.Job history files are also logged to user-specified directory hadoop. j ob. hi st or y. user . l ocat i on

    which defaults to job output directory. The files are stored in _l ogs/ hi st or y/ in the specified

    directory. Hence, by default they will be in mapr ed. out put . di r / _l ogs/ hi st or y. You can stop

    logging by giving the value nonefor hadoop. j ob. hi st or y. user . l ocat i on.You can view the history logs summary in specified directory using the following command:

    $ hadoop j ob - hi st or y out put - di r

    This command will print job details, failed and killed tip details.

    More details about the job such as successful tasks and task attempts made for each task can be viewed

    using the following command

    $ hadoop j ob - hi st or y al l out put - di r

    You can useOutputLogFilterto filter log files from the output directory listing.

    Normally you create the application, describes various facets of the job viaJ obConf, and then use the

    J obCl i ent to submit the job and monitor its progress.

    Job Control

    You may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single

    MapReduce job. This is fairly easy since the output of the job typically goes to distributed filesystem, and

    the output, in turn, can be used as the input for the next job.

    However, this also means that the onus on ensuring jobs are complete (success/failure) lies squarely on

    the clients. In such cases, the various job control options are:

    runJob(JobConf):Submits the job and returns only after the job has completed. submitJob(JobConf):Only submits the job, then poll the returned handle to theRunningJobto

    query status and make scheduling decisions.

    JobConf.setJobEndNotificationURI(String):Sets up a notification upon job-completion, thusavoiding polling.

    You can can also useOozieto implement chains of MapReduce jobs.

    Job Input

    InputFormatdescribes the input specification for a MapReduce job. The MapReduce framework relies

    on the I nputFormat of the job to:

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setJobEndNotificationURI(java.lang.String)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setJobEndNotificationURI(java.lang.String)http://archive.cloudera.com/cdh4/cdh/4/oozie/http://archive.cloudera.com/cdh4/cdh/4/oozie/http://archive.cloudera.com/cdh4/cdh/4/oozie/http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/InputFormat.htmlhttp://archive.cloudera.com/cdh4/cdh/4/oozie/http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setJobEndNotificationURI(java.lang.String)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/RunningJob.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputLogFilter.html
  • 5/28/2018 Hadoop Tutorial

    23/35

    MapReduce User Interfaces

    Hadoop Tutorial | 19

    1. Validate the input specification of the job.2. Split up the input file(s) into logical I nput Spl i t instances, each of which is then assigned to an

    individual Mapper.

    3. Provide the Recor dReader implementation used to glean input records from the logicalInputSplit for processing by the Mapper.

    The default behavior of file-based I nputFormat implementations, typically sub-classes of

    FileInputFormat,is to split the input into logicalI nput Spl i t instances based on the total size, in bytes,

    of the input files. However, the Fi l eSyst emblocksize of the input files is treated as an upper bound for

    input splits. A lower bound on the split size can be set via mapr ed. mi n. spl i t . si ze.

    Clearly, logical splits based on input-size is insufficient for many applications since record boundaries

    must be respected. In such cases, the application should implement a RecordReader, who is responsible

    for respecting record-boundaries and presents a record-oriented view of the logical I nput Spl i t to the

    individual task.

    TextInputFormatis the default I nputFormat .

    IfText I nput For mat is the InputFormat for a given job, the framework detects input-files with the .gz

    extensions and automatically decompresses them using the appropriate Compress i onCodec.

    However, it must be noted that compressed files with the above extensions cannot be split and each

    compressed file is processed in its entirety by a single mapper.

    InputSplit

    InputSplitrepresents the data to be processed by an individual Mapper. Typically I nput Spl i t presents

    a byte-oriented view of the input, and it is the responsibility of Recor dReader to process and present a

    record-oriented view.

    FileSplitis the default I nput Spl i t . It sets map. i nput . f i l eto the path of the input file for the logical

    split.

    RecordReader

    RecordReaderreads pairs from an InputSplit. Typically Recor dReader converts the byte-oriented view

    of the input, provided by the I nput Spl i t , and presents a record-oriented to the Mapper

    implementations for processing. Recor dReader thus assumes the responsibility of processing record

    boundaries and presents the tasks with keys and values.

    Job Output

    OutputFormatdescribes the output-specification for a MapReduce job. The MapReduce framework

    relies on the Out putFor mat of the job to:

    1. Validate the output-specification of the job; for example, check that the output directorydoesnt already exist.

    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileSplit.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileSplit.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordReader.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileSplit.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.htmlhttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html
  • 5/28/2018 Hadoop Tutorial

    24/35

    MapReduce User Interfaces

    20 | Hadoop Tutorial

    2. Provide the RecordWr i t er implementation used to write the output files of the job. Outputfiles are stored in a Fi l eSyst em.

    Text Out put For mat is the default Out put For mat .

    OutputCommitter

    OutputCommitterdescribes the commit of task output for a MapReduce job. The MapReduce

    framework relies on the Out put Commi t t er of the job to:

    1. Set up the job during initialization. For example, create the temporary output directory for thejob during the initialization of the job. Job setup is done by a separate task when the job is in

    PREP state and after initializing tasks. Once the setup task completes, the job will be moved to

    RUNNING state.

    2. Clean up the job after the job completion. For example, remove the temporary output directoryafter the job completion. Job cleanup is done by a separate task at the end of the job. Job is

    declared SUCCEEDED/FAILED/KILLED after the cleanup task completes.

    3. Setup the task temporary output. Task setup is done as part of the same task, during taskinitialization.

    4. Check whether a task needs a commit. This is to avoid the commit procedure if a task does notneed commit.

    5. Commit of the task output. Once task is done, the task will commit its output if required.6. Discard the task commit. If the task has been failed/killed, the output will be cleaned-up. If task

    could not cleanup (in exception block), a separate task will be launched with same attempt-id to

    do the cleanup.

    Fi l eOutput Commi t t er is the default Out putCommi t t er . Job setup/cleanup tasks occupy map or

    reduce slots, whichever is free on the TaskTracker. And JobCleanup task, TaskCleanup tasks and

    JobSetup task have the highest priority, and in that order.

    Task Side-Effect Files

    In some applications, component tasks need to create and/or write to side files, which differ from the

    actual job output files.

    In such cases there could be issues with two instances of the same Mapper or Reducer running

    simultaneously (for example, speculative tasks) trying to open and/or write to the same file (path) on

    the FileSystem. Hence you must pick unique names per task-attempt (using the at t empt i d, sayat t empt _200709221812_0001_m_000000_0), not just per task.

    To avoid these issues the MapReduce framework, when the OutputCommitter is

    Fi l eOutput Commi t t er , maintains a special ${mapr ed. out put . di r }/ t emporar y/ ${t aski d} sub-

    directory accessible via ${mapr ed. work. out put . di r } for each task-attempt on the Fi l eSyst em

    where the output of the task-attempt is stored. On successful completion of the task-attempt, the files

    in the ${mapr ed. out put . di r }/ t emporar y/ ${t aski d} (only) are promoted to

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCommitter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCommitter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputCommitter.html
  • 5/28/2018 Hadoop Tutorial

    25/35

    MapReduce User Interfaces

    Hadoop Tutorial | 21

    ${mapr ed. out put . di r }. Of course, the framework discards the sub-directory of unsuccessful task-

    attempts. This process is completely transparent to the application.

    The application-writer can take advantage of this feature by creating any side-files required in

    ${mapr ed. work. out put . di r } during execution of a task viaFileOutputFormat.getWorkOutputPath(),

    and the framework will promote them similarly for successful task-attempts, thus eliminating the need

    to pick unique paths per task-attempt.

    Note:The value of ${mapr ed. work. out put . di r } during execution of a particular task-attempt is

    actually ${mapr ed. out put . di r }/ t emporar y/ ${t aski d} and this value is set by the MapReduce

    framework. So, just create any side-files in the path returned by

    FileOutputFormat.getWorkOutputPath()from MapReduce task to take advantage of this feature.

    The entire discussion holds true for maps of jobs with r educer =NONE(that is, 0 reduces) since output

    of the map, in that case, goes directly to HDFS.

    RecordWriter

    RecordWriterwrites the output key-value pairs to an output file. RecordWriter implementations write

    the job outputs to the filesystem.

    Other Useful Features

    Submitting Jobs to Queues

    You submit jobs to Queues. Queues, as collection of jobs, allow the system to provide specific

    functionality. For example, queues use ACLs to control which users who can submit jobs to them.

    Queues are expected to be primarily used by Hadoop Schedulers.

    Hadoop comes configured with a single mandatory queue, called default. Queue names are defined inthe mapr ed. queue. namesproperty of the Hadoop site configuration. Some job schedulers, such as the

    Capacity Scheduler,support multiple queues.

    A job defines the queue it needs to be submitted to through the mapr ed. j ob. queue. nameproperty,

    or through thesetQueueName(String)API. Setting the queue name is optional. If a job is submitted

    without an associated queue name, it is submitted to the default queue.

    Counters

    Counters represent global counters, defined either by the MapReduce framework or applications. Each

    Counter can be of any Enum type. Counters of a particular Enumare bunched into groups of type

    Count ers. Gr oup.

    Applications can define arbitrary Counters(of type Enum) and update them via

    Reporter.incrCounter(Enum, long)orReporter.incrCounter(String, String, long)in the mapand/or

    r educemethods. These counters are then globally aggregated by the framework.

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/RecordWriter.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)
  • 5/28/2018 Hadoop Tutorial

    26/35

    MapReduce User Interfaces

    22 | Hadoop Tutorial

    DistributedCache

    DistributedCachedistributes application-specific, large, read-only files efficiently. Di st r i but edCache

    is a facility provided by the MapReduce framework to cache files (text, archives, JARs, and so on) needed

    by applications.

    Applications specify the files to be cached via urls (hdf s : / / ) in theJ obConf. The Di st r i but edCacheassumes that the files specified via hdf s : / / urls are already present on the FileSystem.

    The framework will copy the necessary files to the slave node before any tasks for the job are executed

    on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability

    to cache archives which are un-archived on the slaves.

    Di st r i but edCachetracks the modification timestamps of the cached files. Clearly the cache files

    should not be modified by the application or externally while the job is executing.

    Di st r i but edCachecan be used to distribute simple, read-only data/text files and more complex types

    such as archives and JAR files. Archives (zip, tar, tgz and tar.gz files) are un-archivedat the slave nodes.

    Files have execution permissionsset.

    The