77
Big Data Analytics Big Data Analytics C. Distributed Computing Emvironments / C.1 Map Reduce Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 32

C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics

Big Data AnalyticsC. Distributed Computing Emvironments / C.1 Map Reduce

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)Institute for Computer Science

University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 2: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics

SyllabusTue. 9.4. (1) 0. Introduction

A. Parallel ComputingTue. 16.4. (2) A.1 ThreadsTue. 23.4. (3) A.2 Message Passing Interface (MPI)Tue. 30.4. (4) A.3 Graphical Processing Units (GPUs)

B. Distributed StorageTue. 7.5. (5) B.1 Distributed File SystemsTue. 14.5. (6) B.2 Partioning of Relational DatabasesTue. 21.5. (7) B.3 NoSQL Databases

C. Distributed Computing EnvironmentsTue. 28.5. (8) C.1 Map-ReduceTue. 4.6. — — Pentecoste Break —Tue. 11.6. (9) C.2 Resilient Distributed Datasets (Spark)Tue. 18.6. (10) C.3 Computational Graphs (TensorFlow)

D. Distributed Machine Learning AlgorithmsTue. 25.6. (11) D.1 Distributed Stochastic Gradient DescentTue. 2.7. (12) D.2 Distributed Matrix Factorization

Tue. 9.7. (13) Questions and AnswersLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 32

Page 3: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 4: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 1. Introduction

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 5: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 1. Introduction

Technology Stack

Part D

Part C

Part B

Part A

Distributed MachineLearning Algorithms

Distributed Execution Environments

Distributed Storage

Parallel/Distributed Computing

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 6: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 1. Introduction

Why do we need a Computational Model?

I Our data is nicely stored in a distributed infrastructure

I We have a number of computers at our disposal

I We want our analytics software to take advantage of all thiscomputing power

I When programming we want to focus on understanding our data andnot our infrastructure

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany2 / 32

Page 7: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 1. Introduction

Shared Memory Infrastructure

Processor

Memory

Processor Processor

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany3 / 32

Page 8: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 1. Introduction

Distributed Infrastructure

Processor

Memory

Network

Processor

Memory

Processor

Memory

Processor

Memory

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany4 / 32

Page 9: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 2. Parallel Computing Speedup

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 10: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 2. Parallel Computing Speedup

Parallel Computing / SpeedupI We have p processors available to execute a task T

I Ideally: the more processors the faster a task is executed

I Reality: synchronisation and communication costs

I Speedup s(T , p) of a task T by using p processors:I Be t(T , p) the time needed to execute T using p processors

I Speedup is given by:

s(T , p) =t(T , 1)t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 11: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 2. Parallel Computing Speedup

Parallel Computing / Efficiency

I We have p processors available to execute a task T

I Efficiency e(T , p) of a task T by using p processors:

e(T , p) =t(T , 1)

p · t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany6 / 32

Page 12: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 2. Parallel Computing Speedup

Considerations

I It is not worth using a lot of processors for solving small problems

I Algorithms should increase efficiency with problem size

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany7 / 32

Page 13: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany8 / 32

Page 14: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Word Count Example

I Given a corpus of text documents

D := {d1, . . . , dn}

each containing a sequence of words:

(w1, . . . ,wm)

from a set W of possible words.

I the task is to generate word counts for each word in the corpus

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany8 / 32

Page 15: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Paradigms — Shared MemoryI All the processors have access to all counters

I Counters can be overwritten

I Processors need to lock counters before using them

Shared vector for word counts: c ∈ N|W |c ← {0}|W |

Each processor:1. access a document d ∈ D2. for each word w in document d :

2.1 lock(cw )2.2 cw ← cw + 12.3 unlock(cw )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany9 / 32

Page 16: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Paradigms — Shared Memory

I inefficient due to waiting times for the locksI the more processors, the less efficient

I in a distributed scenario even worsedue to communication overhead for acquiring/releasing the locks

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany10 / 32

Page 17: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Paradigms — Message passing

I Each processor sees only one part of the data

π(D, p) := {dp nP, . . . , d(p+1) nP−1}

I Each processor works on its partition

I Results are exchanged between processors (message passing)

For each processor p:1. For each d ∈ π(D, p)

1.1 process(d)

2. Communicate results

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany11 / 32

Page 18: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Word Count — Message passing

We need to define two types of processes:1. slave

I counts the words on a subset of documents and informs the master2. master

I gathers counts from the slaves and sums them up

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany12 / 32

Page 19: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Word Count — Message passing

Slave:Local memory:

subset of documents: π(D, p) := {dp nP, . . . , d(p+1) nP−1}

address of the master: addr_masterlocal word counts: c ∈ R|W |

1. c ← {0}|W |

2. for each document d ∈ π(D, p)for each word w in document d :

cw ← cw + 1

3. Send message send(addr_master, c)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany13 / 32

Page 20: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Word Count — Message passing

Master:Local memory:1. Global word counts: cglobal ∈ R|W |

2. List of slaves: S

cglobal ← {0}|W |

s ← {0}|S |

For each received message (p, cp)

1. cglobal ← cglobal + cp

2. sp ← 13. if ||s||1 = |S | return cglobal

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany14 / 32

Page 21: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 3. Example: Counting Words

Paradigms — Message passing

I We need to manually assign master and slave roles for each processor

I Partition of the data needs to be done manually

I Implementations like OpenMPI only provide services to exchangemessages

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany15 / 32

Page 22: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany16 / 32

Page 23: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Map-Reduce

I distributed computing environmentI introduced 2004 by Google

I open source reference implementation: Hadoop (since 2006)I meanwhile supported by many distributed programming environments

I e.g., in document databases such as MongoDB

I builds on a job schedulerI for Hadoop: yarn

I considers data is partitioned over nodesI for Hadoop: blocks of a file in a distributed filesystem

I high level abstractionI programmer only specifies a map and a reduce function

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany16 / 32

Page 24: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Key-Value input data

I Map-Reduce requires the data to be stored in a key-value format

I Examples:

Key Valuedocument array of wordsdocument worduser moviesuser friendsuser tweet

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 25: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Map-Reduce / Idea

I represent input and output data as key/value pairs.

I break down computation into three phases:1. map phase

I apply a function map to each input key/value pairI the result is represented also as key/value pairsI execute map on each data node for its data partition (data locality)

2. shuffle phaseI group all intermediate key/value pairs by key to key/valueset pairsI repartition the intermediate data by key

3. reduce phaseI apply a function reduce to each intermediate key/valueset pairI execute reduce on each node for its intermediate data partition

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany18 / 32

Note: The shuffle phase is also called sort or merge phase.

Page 26: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany19 / 32

Page 27: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

The Paradigm - Formally

Let I be a set called input keys,O be a set called output keys,X be a space called input values,V be a space called intermediate values,Y be a space called output values

A functionm : I × X → (O × V )∗

is called map,a function

r : O × (V ∗)→ O × Y

reducer.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany20 / 32

Note: As always, ∗ denotes sequences.

Page 28: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Map-Reduce Driver Algorithm1 map-reduce(m, r , p, (Dw )w∈W ):2 in parallel on workers w ∈W :3 E := ∅4 for (i , x) ∈ Dw :5 E := E ∪m(i , x)6 E ′ := dict(default = ∅)7 for (o, v) ∈ E :8 E ′[o] := E ′[o] ∪ {v}9 synchronize all workers

10 for (o, vs) ∈ E ′:11 send (o, vs) to worker p(o)12 F := dict(default = ∅)13 for all (o, vs) received:14 F (o) := F (o) ∪ vs15 synchronize all workers16 Gw := ∅17 for all (o, vs) ∈ F :18 Gw := Gw ∪ r(o, vs)

whereI m a mapper,I r a reducer,I p : O →W a partition function for

the output keys,I (Dw )w∈W a dataset partition stored

on worker w

result:I dataset G , stored distributed on

workers W

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany21 / 32

Page 29: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Word Count ExampleMap:

I Input: document-word list pairs

I Output: word-count pairs

(dn, (w1, . . . ,wM)) 7→ ((w , cw )w∈W :cw>0)

Reduce:

I Input: word-(count list) pairs

I Output: word-count pairs

(w , (c1, . . . , cK )) 7→ (w ,K∑

k=1

ck)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany22 / 32

Page 30: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Word Count Example

(d1, “love ain't no stranger”)

(d2, “crying in the rain”)

(d3, “looking for love”)

(d4, “I'm crying”)

(d5, “the deeper the love”)

(d6, “is this love”)

(d7, “Ain't no love”)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 5)

(stranger,1)

(crying, 2)

(ain't, 2)

(rain, 1)

(looking, 1)

(deeper, 1)

(this, 1)

Mappers ReducersLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

23 / 32

Page 31: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Hadoop Example / Map

1 public static class Map2 extends MapReduceBase3 implements Mapper<LongWritable, Text, Text, IntWritable> {4 private final static IntWritable one = new IntWritable(1);5 private Text word = new Text();67 public void map(LongWritable key, Text value,8 OutputCollector<Text, IntWritable> output,9 Reporter reporter)

10 throws IOException {1112 String line = value. toString ();13 StringTokenizer tokenizer = new StringTokenizer(line );1415 while ( tokenizer .hasMoreTokens()) {16 word.set( tokenizer .nextToken());17 output. collect (word, one);18 }19 }20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany24 / 32

Page 32: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Hadoop Example / Reduce

1 public static class Reduce2 extends MapReduceBase3 implements Reducer<Text, IntWritable, Text, IntWritable> {45 public void reduce(Text key, Iterator <IntWritable> values,6 OutputCollector<Text, IntWritable> output,7 Reporter reporter)8 throws IOException {9

10 int sum = 0;11 while (values .hasNext())12 sum += values.next().get();1314 output. collect (key, new IntWritable(sum));15 }16 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany25 / 32

Page 33: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Hadoop Example / Main

1 public static void main(String [] args) throws Exception {2 JobConf conf = new JobConf(WordCount.class);3 conf .setJobName("wordcount");45 conf .setOutputKeyClass(Text.class);6 conf .setOutputValueClass(IntWritable.class);78 conf .setMapperClass(Map.class);9 conf .setCombinerClass(Reduce.class);

10 conf .setReducerClass(Reduce.class);1112 conf .setInputFormat(TextInputFormat.class);13 conf .setOutputFormat(TextOutputFormat.class);1415 FileInputFormat.setInputPaths(conf, new Path(args[0]));16 FileOutputFormat.setOutputPath(conf, new Path(args[1]));1718 JobClient .runJob(conf);19 }20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany26 / 32

Page 34: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Execution

I mappers are executed in parallel

I reducers are executed in parallelI started after all mappers have completed

I high-level abstraction:I No need to worry about how many processors are available

I No need to specify which ones will be mappers and which ones will bereducers

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany27 / 32

Page 35: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Fault Tolerance

I map failureI re-execute map

I preferably on another node

I speculative executionI execute two mappers on each data segment in parallel eachI keep results from the firstI kill the slower one, once the other completed.

I node failureI re-execute completed and in-progress map()

I re-execute in-progress reduce tasks

I particular key-value pairs that cause mappers to crashI skip just the problematic pairs

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany28 / 32

Page 36: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

I We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data after the map phase:

t(T inter, 1) = σDI Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Note: K represents the overhead of IO operations (reading and writing data to disk)

Page 37: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

I Time for performing the task in one processor: wD

I Time for performing the task with p processors on Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

I Efficiency of Map-Reduce:

eMR(T , p) =t(T , 1)

p · t(T , p)

=wD

p(wDp + 2K σD

p )

=1

1+ 2K σw

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany30 / 32

Page 38: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =1

1+ 2K σw

I Apparently the efficiency is independent of p

I High speedups can be achieved with large number of processors

I If σ is high (too much intermediate data) the efficiency deteriorates

I In many cases σ depends on p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany31 / 32

Page 39: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 4. Map-Reduce

SummaryI Map Reduce is a distributed computing framework.

I Map Reduce represents input and output data as key/value pairs.

I Map Reduce decomposes computation into three phases:1. map: applying a function to each input key/value pair.

2. shuffle:I grouping intermediate data into key/valueset pairsI repartitioning intermediate data by key over nodes

3. reduce: applying a function to each intermediate key/valueset pair.

I Mappers are executed data local

I For a program, the map and reduce functions have to be specified.I the shuffle step is fixed.

I The size of the intermediate data is crucial for efficiency as it has tobe repartioned.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany32 / 32

Page 40: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics

Further Readings

I original Map Reduce framework by Google:I Dean and Ghemawat [2004, 2008, 2010]

I MapReduce reference implementation in Hadoop:I [White, 2015, ch. 2, 7]

I also ch. 6, 8 and 9.

I MapReduce in a document database, MongoDB:I [Chodorow, 2013, ch. 7]

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany33 / 32

Page 41: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Outline

5. Map-Reduce Tutorial

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany34 / 32

Page 42: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

MappersI extend baseclass Mapper<KI,VI,KO,VO>

(org.apache.hadoop.mapreduce)I types: KI = input key, VI = input value,

KO = output key, VO = output value.

I overwrite map(KI key, VI value, Context ctxt)I Context is an inner class of Mapper<KI,VI,KO,VO>

I write(KO,VO): write next output pair

I optionally,setup(Context ctxt): set up mapper

I called once before first call to map

I cleanup(Context ctxt): clean up mapperI called once after last call to map

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany34 / 32

Page 43: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

MappersI extend baseclass Mapper<KI,VI,KO,VO>

(org.apache.hadoop.mapreduce)I types: KI = input key, VI = input value,

KO = output key, VO = output value.

I overwrite map(KI key, VI value, Context ctxt)I Context is an inner class of Mapper<KI,VI,KO,VO>

I write(KO,VO): write next output pair

I optionally,setup(Context ctxt): set up mapper

I called once before first call to map

I cleanup(Context ctxt): clean up mapperI called once after last call to map

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany34 / 32

Page 44: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

MappersI extend baseclass Mapper<KI,VI,KO,VO>

(org.apache.hadoop.mapreduce)I types: KI = input key, VI = input value,

KO = output key, VO = output value.

I overwrite map(KI key, VI value, Context ctxt)I Context is an inner class of Mapper<KI,VI,KO,VO>

I write(KO,VO): write next output pair

I optionally,setup(Context ctxt): set up mapper

I called once before first call to map

I cleanup(Context ctxt): clean up mapperI called once after last call to map

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany34 / 32

Page 45: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:I serializable — Writable (org.apache.hadoop.io)

I To store the output of mappers, combiners and reducers in files

Additional requirements for key types:

I comparable — Comparable (java.lang)I To sort the output of mappers and combiners by key

I stable hashcode across different JVM instancesI To partition the output of combiners by key.

I The default implementation in Object is not stable!

I pooled in interface WritableComparable (org.apache.hadoop.io)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany35 / 32

Page 46: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:I serializable — Writable (org.apache.hadoop.io)

I To store the output of mappers, combiners and reducers in files

Additional requirements for key types:I comparable — Comparable (java.lang)

I To sort the output of mappers and combiners by key

I stable hashcode across different JVM instancesI To partition the output of combiners by key.

I The default implementation in Object is not stable!

I pooled in interface WritableComparable (org.apache.hadoop.io)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany35 / 32

Page 47: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:I serializable — Writable (org.apache.hadoop.io)

I To store the output of mappers, combiners and reducers in files

Additional requirements for key types:I comparable — Comparable (java.lang)

I To sort the output of mappers and combiners by key

I stable hashcode across different JVM instancesI To partition the output of combiners by key.

I The default implementation in Object is not stable!

I pooled in interface WritableComparable (org.apache.hadoop.io)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany35 / 32

Page 48: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:I serializable — Writable (org.apache.hadoop.io)

I To store the output of mappers, combiners and reducers in files

Additional requirements for key types:I comparable — Comparable (java.lang)

I To sort the output of mappers and combiners by key

I stable hashcode across different JVM instancesI To partition the output of combiners by key.

I The default implementation in Object is not stable!

I pooled in interface WritableComparable (org.apache.hadoop.io)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany35 / 32

Page 49: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

SerializationTo store the output of mappers, combiners and reducers in files, keys andvalues have to be serialized:I hadoop requires all keys and values of a step to be of the same class. the class of an object to deserialize is known in advance.

I standard Java serialization (java.lang.Serializable) serializes classinformation, thus is more verbose and complex and therefore not usedin hadoop.

I new interface Writable (org.apache.hadoop.io):I void write(DataOutput out) throws IOException:

write object to a data output.

I void readFields(DataInput in) throws IOException:set member variables (fields) of an object to values read from a datainput.

I standard DataInput and DataOutput (java.io)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany36 / 32

Page 50: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Serialization (2/2)I Writable wrappers for elementary data types:

BooleanWritable booleanByteWritable byteDoubleWritable doubleFloatWritable floatIntWritable intLongWritable longShortWritable shortText String

(all in org.apache.hadoop.io)I these are not subclasses of the default wrappers Integer, Double etc.

(as the latter are final)

I If one needs to pass more complex objects between steps,implement custom Writable (in terms of these elementaryWritables).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany37 / 32

Page 51: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Serialization (2/2)I Writable wrappers for elementary data types:

BooleanWritable booleanByteWritable byteDoubleWritable doubleFloatWritable floatIntWritable intLongWritable longShortWritable shortText String

(all in org.apache.hadoop.io)I these are not subclasses of the default wrappers Integer, Double etc.

(as the latter are final)

I If one needs to pass more complex objects between steps,implement custom Writable (in terms of these elementaryWritables).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany37 / 32

Page 52: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Mapper (in principle)

1 import java . io .IOException;2 import java . util . StringTokenizer ;34 import org.apache.hadoop.io. IntWritable ;5 import org.apache.hadoop.io.Text;6 import org.apache.hadoop.mapreduce.Mapper;78 public class TokenizerMapperS9 extends Mapper<Object, Text, Text, IntWritable>{

1011 public void map(Object key, Text value , Context context12 ) throws IOException, InterruptedException {13 StringTokenizer itr = new StringTokenizer(value. toString ());14 while ( itr .hasMoreTokens())15 context . write (new Text( itr .nextToken()), new IntWritable (1));16 }17 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany38 / 32

Page 53: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Mapper

1 import java . io .IOException;2 import java . util . StringTokenizer ;34 import org.apache.hadoop.io. IntWritable ;5 import org.apache.hadoop.io.Text;6 import org.apache.hadoop.mapreduce.Mapper;78 public class TokenizerMapper9 extends Mapper<Object, Text, Text, IntWritable>{

1011 private final static IntWritable one = new IntWritable(1);12 private Text word = new Text();1314 public void map(Object key, Text value , Context context15 ) throws IOException, InterruptedException {16 StringTokenizer itr = new StringTokenizer(value. toString ());17 while ( itr .hasMoreTokens()) {18 word.set( itr .nextToken());19 context . write (word, one);20 }21 }22 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany39 / 32

Page 54: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Job Configuration

Configuration (org.apache.hadoop.conf):I default constructor:

read default configuration from filesI core-default.xml and

I core-site.xml

(to be found in the classpath).

I addResource(Path file):update configuration from another configuration file.

I String get(String name):get the value of a configuration option.

I set(String name, String value):set the value of a configuration option.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany40 / 32

Page 55: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Job

Job (org.apache.hadoop.conf):I constructor Job(Configuration conf, String name):

create a new job.

I setMapperClass(Class cls), setCombinerClass(Class cls),setReducerClass(Class cls):set the class for mappers, combiners and reducers

I setOutputKeyClass(Class cls), setOutputValueClass(Class cls):set the class for output keys and values.

I boolean waitForCompletion(boolean verbose):submit job and wait until it completes.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany41 / 32

Page 56: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Input and Output Paths

FileInputFormat (org.apache.hadoop.mapreduce.lib.input):I addInputPath(Job job, Path path):

add input pathsFileOutputFormat (org.apache.hadoop.mapreduce.lib.output):I setOutputPath(Job job, Path path):

set output paths

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany42 / 32

Page 57: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Job Runner (Mapper Only)

1 import java . io .IOException;2 import java . util . StringTokenizer ;34 import org.apache.hadoop.conf. Configuration ;5 import org.apache.hadoop.fs .Path;6 import org.apache.hadoop.io. IntWritable ;7 import org.apache.hadoop.io.Text;8 import org.apache.hadoop.mapreduce.Job;9 import org.apache.hadoop.mapreduce.lib. input . FileInputFormat ;

10 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;1112 public class MRJobStarter1 {13 public static void main(String [] args) throws Exception {14 Configuration conf = new Configuration ();15 Job job = Job.getInstance(conf , "word count");16 job .setMapperClass(TokenizerMapper.class );17 job .setOutputKeyClass(Text. class );18 job . setOutputValueClass( IntWritable . class );19 FileInputFormat .addInputPath(job, new Path(args [0]));20 FileOutputFormat.setOutputPath(job, new Path(args [1]));2122 System.exit (job .waitForCompletion(true) ? 0 : 1);23 }24 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany43 / 32

Page 58: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPCLASSPATH =(JAVA_HOME)/lib/tools.jar

2. Compile sources:1 hadoop com.sun.tools. javac .Main −sourcepath . MRJobStarter1.java

3. Package all class files of the job into a jar:1 jar cf job . jar MRJobStarter1.class TokenizerMapper.class

4. Run the jar:1 hadoop jar job . jar MRJobStarter1 /ex1/input /ex1/output

I the output directory must not yet exist.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany44 / 32

Page 59: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPCLASSPATH =(JAVA_HOME)/lib/tools.jar

2. Compile sources:1 hadoop com.sun.tools. javac .Main −sourcepath . MRJobStarter1.java

3. Package all class files of the job into a jar:1 jar cf job . jar MRJobStarter1.class TokenizerMapper.class

4. Run the jar:1 hadoop jar job . jar MRJobStarter1 /ex1/input /ex1/output

I the output directory must not yet exist.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany44 / 32

Page 60: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPCLASSPATH =(JAVA_HOME)/lib/tools.jar

2. Compile sources:1 hadoop com.sun.tools. javac .Main −sourcepath . MRJobStarter1.java

3. Package all class files of the job into a jar:1 jar cf job . jar MRJobStarter1.class TokenizerMapper.class

4. Run the jar:1 hadoop jar job . jar MRJobStarter1 /ex1/input /ex1/output

I the output directory must not yet exist.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany44 / 32

Page 61: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPCLASSPATH =(JAVA_HOME)/lib/tools.jar

2. Compile sources:1 hadoop com.sun.tools. javac .Main −sourcepath . MRJobStarter1.java

3. Package all class files of the job into a jar:1 jar cf job . jar MRJobStarter1.class TokenizerMapper.class

4. Run the jar:1 hadoop jar job . jar MRJobStarter1 /ex1/input /ex1/output

I the output directory must not yet exist.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany44 / 32

Page 62: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Inputs and Output

I input 1:1 Hello World Bye World

I input 2:1 Hello Hadoop Goodbye Hadoop

I output:lst@lst-uni:~> hdfs dfs -ls /ex1/output.ex2Found 2 items-rw-r–r– 2 lst supergroup 0 2016-05-24 18:59 /ex1/output.ex2/_SUCCESS-rw-r–r– 2 lst supergroup 66 2016-05-24 18:59 /ex1/output.ex2/part-r-00000lst@lst-uni:~> hdfs dfs -cat /ex1/output.ex2/part-r-00000Bye 1Goodbye 1Hadoop 1Hadoop 1Hello 1Hello 1World 1World 1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany45 / 32

Page 63: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

ReducersI extend baseclass Reducer<KI,VI,KO,VO>

(org.apache.hadoop.mapreduce)I types: KI = input key, VI = input value,

KO = output key, VO = output value.

I overwrite reduce(KI key, Iterable<VI> value, Context ctxt)I Context is an inner class of Reducer<KI,VI,KO,VO>

I write(KO,VO): write next output pair

I optionally,setup(Context ctxt): set up reducer

I called once before first call to reduce

I cleanup(Context ctxt): clean up reducerI called once after last call to reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany46 / 32

Page 64: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

ReducersI extend baseclass Reducer<KI,VI,KO,VO>

(org.apache.hadoop.mapreduce)I types: KI = input key, VI = input value,

KO = output key, VO = output value.

I overwrite reduce(KI key, Iterable<VI> value, Context ctxt)I Context is an inner class of Reducer<KI,VI,KO,VO>

I write(KO,VO): write next output pair

I optionally,setup(Context ctxt): set up reducer

I called once before first call to reduce

I cleanup(Context ctxt): clean up reducerI called once after last call to reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany46 / 32

Page 65: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

ReducersI extend baseclass Reducer<KI,VI,KO,VO>

(org.apache.hadoop.mapreduce)I types: KI = input key, VI = input value,

KO = output key, VO = output value.

I overwrite reduce(KI key, Iterable<VI> value, Context ctxt)I Context is an inner class of Reducer<KI,VI,KO,VO>

I write(KO,VO): write next output pair

I optionally,setup(Context ctxt): set up reducer

I called once before first call to reduce

I cleanup(Context ctxt): clean up reducerI called once after last call to reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany46 / 32

Page 66: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 2 / Reducer

1 import java . io .IOException;23 import org.apache.hadoop.io. IntWritable ;4 import org.apache.hadoop.io.Text;5 import org.apache.hadoop.mapreduce.Reducer;67 public class IntSumReducer8 extends Reducer<Text,IntWritable,Text, IntWritable> {9 private IntWritable result = new IntWritable ();

1011 public void reduce(Text key, Iterable <IntWritable> values,12 Context context13 ) throws IOException, InterruptedException {14 int sum = 0;15 for ( IntWritable val : values )16 sum += val.get();17 result . set(sum);18 context . write (key, result );19 }20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany47 / 32

Page 67: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 2 / Job Runner

1 import java . io .IOException;2 import java . util . StringTokenizer ;34 import org.apache.hadoop.conf. Configuration ;5 import org.apache.hadoop.fs .Path;6 import org.apache.hadoop.io. IntWritable ;7 import org.apache.hadoop.io.Text;8 import org.apache.hadoop.mapreduce.Job;9 import org.apache.hadoop.mapreduce.lib. input . FileInputFormat ;

10 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;1112 public class MRJobStarter2 {13 public static void main(String [] args) throws Exception {14 Configuration conf = new Configuration ();15 Job job = Job.getInstance(conf , "word count");16 job .setMapperClass(TokenizerMapper.class );17 job . setReducerClass (IntSumReducer.class);18 job .setOutputKeyClass(Text. class );19 job . setOutputValueClass( IntWritable . class );20 FileInputFormat .addInputPath(job, new Path(args [0]));21 FileOutputFormat.setOutputPath(job, new Path(args [1]));2223 System.exit (job .waitForCompletion(true) ? 0 : 1);24 }25 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany48 / 32

Page 68: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 2 / Output

lst@lst-uni:~> hdfs dfs -cat /ex1/output.2/part*Bye 1Goodbye 1Hadoop 2Hello 2World 2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany49 / 32

Page 69: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 3 / Job Runner: Only Mapper and Combiner

1 import java . io .IOException;2 import java . util . StringTokenizer ;34 import org.apache.hadoop.conf. Configuration ;5 import org.apache.hadoop.fs .Path;6 import org.apache.hadoop.io. IntWritable ;7 import org.apache.hadoop.io.Text;8 import org.apache.hadoop.mapreduce.Job;9 import org.apache.hadoop.mapreduce.lib. input . FileInputFormat ;

10 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;1112 public class MRJobStarter3 {13 public static void main(String [] args) throws Exception {14 Configuration conf = new Configuration ();15 Job job = Job.getInstance(conf , "word count");16 job .setMapperClass(TokenizerMapper.class );17 job . setCombinerClass(IntSumReducer.class);18 job .setOutputKeyClass(Text. class );19 job . setOutputValueClass( IntWritable . class );20 FileInputFormat .addInputPath(job, new Path(args [0]));21 FileOutputFormat.setOutputPath(job, new Path(args [1]));2223 System.exit (job .waitForCompletion(true) ? 0 : 1);24 }25 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany50 / 32

Page 70: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 3 / Output

lst@lst-uni:~> hdfs dfs -cat /ex1/output.3/part*Bye 1Goodbye 1Hadoop 2Hello 1Hello 1World 2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany51 / 32

Page 71: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Example 4 / Job Runner: Combiner and Reducer1 import java . io .IOException;2 import java . util . StringTokenizer ;34 import org.apache.hadoop.conf. Configuration ;5 import org.apache.hadoop.fs .Path;6 import org.apache.hadoop.io. IntWritable ;7 import org.apache.hadoop.io.Text;8 import org.apache.hadoop.mapreduce.Job;9 import org.apache.hadoop.mapreduce.lib. input . FileInputFormat ;

10 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;1112 public class MRJobStarter4 {13 public static void main(String [] args) throws Exception {14 Configuration conf = new Configuration ();15 Job job = Job.getInstance(conf , "word count");16 job .setMapperClass(TokenizerMapper.class );17 job . setCombinerClass(IntSumReducer.class);18 job . setReducerClass (IntSumReducer.class);19 job .setOutputKeyClass(Text. class );20 job . setOutputValueClass( IntWritable . class );21 FileInputFormat .addInputPath(job, new Path(args [0]));22 FileOutputFormat.setOutputPath(job, new Path(args [1]));2324 System.exit (job .waitForCompletion(true) ? 0 : 1);25 }26 }

The output is the same as only with a reducer,but less intermediate data is moved.Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

52 / 32

Page 72: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Predefined Mappers

I Mapper: (org.apache.hadoop.mapreduce):

m(k , v) := ((k , v))

I InverseMapper (org.apache.hadoop.mapreduce.lib.map):

m(k , v) := ((v , k))

I ChainMapper (org.apache.hadoop.mapreduce.lib.chain)

chainm,`(k , v) := (`(k ′,m′) | (k ′, v ′) ∈ m(k , v))

I TokenCounterMapper (org.apache.hadoop.mapreduce.lib.map):

m(k , v) := ((k ′, 1) | k ′ ∈ tokenize(v))

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany53 / 32

Page 73: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Predefined Mappers (2/2)

I RegexMapper (org.apache.hadoop.mapreduce.lib.map):

m(k , v) := ((k ′, 1) | k ′ ∈ find-regex(v))

Regex pattern to search for andgroups to reportcan be set via configuration optionsPATTERN mapreduce.mapper.regexGROUP mapreduce.mapper.regexmapper..group

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany54 / 32

Page 74: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Predefined Reducers

I Reducer: (org.apache.hadoop.mapreduce):

r(k ,V ) := (k ,V )

I IntSumReducer, LongSumReducer(org.apache.hadoop.mapreduce.lib.reduce):

m(k ,V ) := (v ,∑v∈V

v))

I ChainReducer (org.apache.hadoop.mapreduce.lib.chain)

chainr ,s(k ,V ) := s(r(k ,V ))

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany55 / 32

Page 75: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Job Default Values

mapper class: class org.apache.hadoop.mapreduce.Mappercombiner class: nullreducer class: class org.apache.hadoop.mapreduce.Reducer

output key class: class org.apache.hadoop.io.LongWritableoutput value class: class org.apache.hadoop.io.Text

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany56 / 32

Page 76: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

Further Topics

I Controlling map-reduce jobsI YARN

I chaining map-reduce jobs

I iterative algorithms

I Controlling the number of mappers and reducers

I Managing resources required by all mappers or reducers

I Input and output using relational databases

I Streaming

I Examples, examples, examples

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany57 / 32

Page 77: C. Distributed Computing Emvironments / C.1 Map …BigDataAnalytics 4. Map-Reduce HadoopExample/Map 1 public static class Map 2 extendsMapReduceBase 3 implementsMapper

Big Data Analytics 5. Map-Reduce Tutorial

ReferencesKristina Chodorow. MongoDB: The Definitive Guide. O’Reilly and Associates, Beijing, 2 edition, May 2013. ISBN

978-1-4493-4468-9.

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI’04Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, volume 6,2004.

Jeffrey Dean and Sanjay Ghemawat. MapReduce. Communications of the ACM, 51(1):107, January 2008. ISSN00010782. doi: 10.1145/1327452.1327492.

Jeffrey Dean and Sanjay Ghemawat. MapReduce: A flexible data processing tool. Communications of the ACM, 53(1):72–77, 2010.

Tom White. Hadoop: The Definitive Guide. O’Reilly, 4 edition, 2015.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany58 / 32