27
MapReduce in Action Team 306 Led by Chen Lin College of Informatio n Science and Technology 数数数数数数数 Data Mining Group @ Xiamen University

MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology

Embed Size (px)

Citation preview

MapReduce in Action

Team 306Led by

Chen Lin

College of Information Science and Technology

数据挖掘研究组Data Mining Group @ Xiamen University

YOUR SITE HERE

LOGO

1. Basic MapReduce Programs1. Basic MapReduce Programs

2. Advanced MapReduce2. Advanced MapReduce

3. Beyond the horizon 3. Beyond the horizon

4. discussion4. discussion

Contents

YOUR SITE HERE

LOGO

JobConfiguration

MasterJobtracker

MasterJobtracker Job

Basic MapReduce Programs

YOUR SITE HERE

LOGO

Implement Interface

Environment Configuration

Basic MapReduce Programs

Job Configuration?

Java Class

YOUR SITE HERE

LOGO

Interface

CombinerInputFormatOutputFormat

MapperReducer Partitioner

YOUR SITE HERE

LOGO

Configure

jvm:Mapred.child.java.opts

{mapred.local.dir}

InputPathOutputPath

How many Map/ReduceTasks?

YOUR SITE HERE

LOGO

InputFormat Map Reduce OutputFormat

Basic MapReduce Program

Text

Inputsplit <K1,V2>

K1,List<V1>List<K1,V1>

YOUR SITE HERE

LOGO

Basic MapReduce

YOUR SITE HERE

LOGO

Combiners

an optimization in MapReduce that allow for local aggregation before the shue and sort phase

Partitioner

determines which reducer will be responsible for processing a particular key, and the execution framework uses this information to copy the data to the right location during the shue and sort phase

PARTITIONERS AND COMBINERS

YOUR SITE HERE

LOGO

CREATING CUSTOM INPUTFORMAT

KeyValueText

Sequence File

NLine

Text InputFormat

Basic MapReduce Program

InputFormat

YOUR SITE HERE

LOGO

• TextInputFormat - Each line in the text fi les is a record. Key is the byte

offset of the line, and value is the content of the line.

• KeyValueTextInputFormat - Each line in the text fi les is a record. The fi rst separator

character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input.line property, and the default is the tab (\t) character.

• NLineInputFormat

- Same as TextInputFormat, but each split is guaranteed

to have exactly N lines. The mapred.line.input.format. Lines/map property, which defaults to one, sets N.

InputFormat

YOUR SITE HERE

LOGO

4

Basic MapReduce Program

types for the key/value pairs

YOUR SITE HERE

LOGO

code for mapper, reducer,

combiner, partitioner, along with

job conguration parameters

The execution framework handles

everything else

Summary for basic Program

What’s a complete MapReduce job ??

YOUR SITE HERE

LOGO

Chaining MapReduce jobs Chaining MapReduce jobs

LOCAL AGGREGATIONLOCAL AGGREGATION

SECONDARY SORTINGSECONDARY SORTING

Work on Hadoop FilesWork on Hadoop Files

Advanced MapReduce

YOUR SITE HERE

LOGO

You’ve been doing data processing tasks which a single MapReduce job can accomplish.

But……As you get more comfortable writing

MapReduce programs and take on more ambitious data processing tasks

you’ll find many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce job

Chaining MapReduce jobs

YOUR SITE HERE

LOGO

in Hadoop, intermediate results are written to local disk before being sent over the network.

Reductions in the amount of intermediate data translate should increase in algorithmic efficiency

use of the combiner is possible to substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers

LOCAL AGGREGATION

YOUR SITE HERE

LOGO

seudo-code for computing the mean of values associated with the same string.

YOUR SITE HERE

LOGO

LOCAL AGGREGATION , Is it right ??

YOUR SITE HERE

LOGO

1. combiners must have the same input and output key-value type

2. Combiners are optimizations that cannot change the correctness of the algorithm

Hadoop makes no guarantees on how many times combiners are called; it could be zero, one, or multiple times

LOCAL AGGREGATION

YOUR SITE HERE

LOGO

LOCAL AGGREGATION , right usage !

YOUR SITE HERE

LOGO

we also need to sort by value sometimes (k1;m1; v8) (k1;m2; v1) (k1;m3; v7) ::: (k2;m1; v2) (k2;m2; v6) (k2;m3; v9)

k1 (m1; k8) (k1; m1) (k8)

SECONDARY SORTING

YOUR SITE HERE

LOGO

It’s a shameThe rest I will talk about Plays an

important role in MapReduce, but, they are beyond my horizon.

So, need all your help, to master them together….

Beyond the horizon

YOUR SITE HERE

LOGO

Beyond the horizon

Creat user custom

Inputformat Manipulate

local fileCreat user

customPartitioner

Pipes for C++Streaming

other language

YOUR SITE HERE

LOGO

Beyond the horizon

Joining data from

different sourcesHive

Pig

HBase

MultipleFileoutput

Joining data from different sources

Orders files CSV formatfields: (Customer ID, Order ID, Price,

and Purchase Date)

Customers file

CSV format

record fields:

(Customer ID,

Name, and Phone

Number)

YOUR SITE HERE

LOGOJoey Leung,555-555-55Edward,123-456-7890Jose Madriz,281-330-8004David Stork,408-555-0000…....

A,12.95,02-Jun-2008B,88.25,20-may-2008C,32.00,30-Nov-2007D,25.02,22-Jan-2009

Joining data from different sources

Joey Leung,555-555-5555,B,88.25,20-May-2008Edward,123-456-7890,C,32.00,30-Nov-2007Jose Madriz,281-330-8004,A,12.95,02-Jun-2008Jose Madriz,281-330-8004,D,25.02,22-Jan-2009

YOUR SITE HERE

LOGO

Thank you!

数据挖掘研究组Data Mining Group @ Xiamen University