Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

1

Map-Reduce and Hadoop

2

Introduction to Map-Reduce

3

Map Reduce operations

•  Input data are (key, value) pairs

• 2 operations available : map and reduce

• Map •  Takes a (key, value) and generates other (key, value)

• Reduce •  Takes a key and all associated values •  Generates (key, value) pairs

• A map-reduce algorithm requires a mapper and a reducer

4

Map Reduce example

•  Compute the average grade of students •  For each course, the professor provides us with a text file •  Text file format : lines of “student grade”

•  Algorithm (non map-reduce) •  For each student, collect all grades and perform the average

• Algorithm (map-reduce) •  Mapper

•  Assume the input file is parsed as (student, grade) pairs •  So … do nothing!

•  Reducer •  Perform the average of all values for a given key

5

Map Reduce example

Fabrice 20 Brian 10 Paul 15



(Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20)

(Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10])

(Fabrice, 15) (Brian 15) (Paul, 15)

Map Reduce

Course 1

Course 2

Course 3

6

Map Reduce example… too easy

• Ok, this was easy because •  We didn’t care about technical details like reading inputs •  All keys are “equals”, no weighted average

• Now can we do something more complicated ?

• Let’s computed a weighted average •  Course 1 has weight 5 •  Course 2 has weight 2 •  Course 3 has weight 3

• What is the problem now ?

7

Map Reduce example




(Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20)

(Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10])

(Fabrice, 15) (Brian 15) (Paul, 15)

Map Reduce

Course 1

Course 2

Course 3

Should be able to discriminate between values

8

Map Reduce example - advanced

• How discriminate between values for a given key •  We can’t … unless the values look different

• New reducer •  Input : (Name, [course1_Grade1, course2_Grade2, course3_Grade3]) •  Strip values from course indication and perform weighted average

• So, we need to change the input of the reducer which comes from… the mapper

• New mapper •  Input : (Name, Grade) •  Output : (Name, courseName_Grade) •  The mapper needs to be aware of the input file

9

Map Reduce example - 2




(Fabrice, C1_20) (Brian, C1_10) (Paul, C1_15) (Fabrice, C2_15) (Brian, C2_20) (Paul, C2_10) (Fabrice, C3_10) (Brian, C3_15) (Paul, C3_20)

(Fabrice, [C1_20, C2_15, C3_10]) (Brian, [C1_10, C2_15, C3_20]) (Paul, [C1_15, C2_20, C3_10])

(Fabrice, 16) (Brian, 14) (Paul, 14.5)

Map Reduce

Course 1

Course 2

Course 3

10

Introduction to Hadoop

F. Huet, Oasis Seminar, 07/07/2010

11

What is Hadoop ?

• A set of software developed by Apache for distributed computing

• Many different projects •  MapReduce •  HDFS : Hadoop Distributed File System •  Hbase : Distributed Database •  ….

• Written in Java

• Can be deployed on any cluster easily

12

Hadoop Job

• An Hadoop job is composed of a map operation and (possibly) a reduce operation

• Map and reduce operations are implemented in a Mapper subclass and a Reducer subclass

• Hadoop will start many instances of Mapper and Reducer •  Decided at runtime but can be specified

• Each instance will work on a subset of the keys called a Splits

13

Map-Reduce workflow Source : Hadoop the definitive guide

14

Mapper

• Extend default class Mapper<K1, V1, K2, V2> •  K1, V1 : type of input (key,value) •  K2, V2 : type of output (key,value)

• Implements public void map(K1 key, V1 value, Context context) throws IOException, InterruptedException

•  Output of values is done using context.write

15

Reducer

• Extend default class Reducer<K1, V1, K2, V2> •  K1, V1 : type of input (key,[values]) •  K2, V2 : type of output (key, value)

• Implements public void reduce(K1 key, V1 values, Context context) throws IOException, InterruptedException

•  V1 is iterable •  Output of values is done using context.write

16

Input/Output

• Hadoop helps abstracting away data format and I/O from map/reduce process

• InputFormat •  Validates data input format (user specified) •  Split-up the input file into Splits •  Provides an InputReader to read records from the Splits •  Default : TextInputFormat to read text file (key will be offset, value will be

the line)

• OutputFormat •  Validate data output format •  Provides an OutputWriter to write records to the file system •  Default : TextOutputFormat to write plain text files

17

Hadoop Job example

Configuration config = new Configuration(); Job job = new Job(config, "filesplitTest");

job.setInputFormatClass(TextInputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(SingleTextOutputFormat.class); Path outputDir = new Path(output);

Path inputPath = new Path(input);

FileInputFormat.setInputPaths(job, inputPath);

FileOutputFormat.setOutputPath(job, outputDir);

job.setMapperClass(MapSingleSortedFile.class);

job.setReducerClass(Reducer.class);

18

HDFS

• Hadoop Distributed File System

• Aggregate local storage

• Used by Hadoop workers to read input, store temporary data and final output

• Can be accessed using CLI •  $> hadoop –fs command •  put : copy a local file to HDFS •  get : copy a HDFS file to a local directory

• Suitable for large files •  64MB Block

19

Demo

20

Scenario

• Input : a text file made of RDF data (subject, predicate, object)

• Output : 3 “files” containing the input data sorted by subject, predicate or object

• Hadoop cluster •  eon 2-4 with HDFS •  Only need Hadoop conf files to use this cluster

• Monitor computation using web interface on eon2

Documents

Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file