20
1 Map-Reduce and Hadoop

Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

Embed Size (px)

Citation preview

Page 1: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

1

Map-Reduce and Hadoop

Page 2: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

2

Introduction to Map-Reduce

Page 3: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

3

Map Reduce operations

•  Input data are (key, value) pairs

• 2 operations available : map and reduce

• Map •  Takes a (key, value) and generates other (key, value)

• Reduce •  Takes a key and all associated values •  Generates (key, value) pairs

• A map-reduce algorithm requires a mapper and a reducer

Page 4: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

4

Map Reduce example

•  Compute the average grade of students •  For each course, the professor provides us with a text file •  Text file format : lines of “student grade”

•  Algorithm (non map-reduce) •  For each student, collect all grades and perform the average

• Algorithm (map-reduce) •  Mapper

•  Assume the input file is parsed as (student, grade) pairs •  So … do nothing!

•  Reducer •  Perform the average of all values for a given key

Page 5: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

5

Map Reduce example

Fabrice 20 Brian 10 Paul 15

Fabrice 15 Brian 20 Paul 10

Fabrice 10 Brian 15 Paul 20

(Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20)

(Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10])

(Fabrice, 15) (Brian 15) (Paul, 15)

Map Reduce

Course 1

Course 2

Course 3

Page 6: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

6

Map Reduce example… too easy

• Ok, this was easy because •  We didn’t care about technical details like reading inputs •  All keys are “equals”, no weighted average

• Now can we do something more complicated ?

• Let’s computed a weighted average •  Course 1 has weight 5 •  Course 2 has weight 2 •  Course 3 has weight 3

• What is the problem now ?

Page 7: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

7

Map Reduce example

Fabrice 20 Brian 10 Paul 15

Fabrice 15 Brian 20 Paul 10

Fabrice 10 Brian 15 Paul 20

(Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20)

(Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10])

(Fabrice, 15) (Brian 15) (Paul, 15)

Map Reduce

Course 1

Course 2

Course 3

Should be able to discriminate between values

Page 8: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

8

Map Reduce example - advanced

• How discriminate between values for a given key •  We can’t … unless the values look different

• New reducer •  Input : (Name, [course1_Grade1, course2_Grade2, course3_Grade3]) •  Strip values from course indication and perform weighted average

• So, we need to change the input of the reducer which comes from… the mapper

• New mapper •  Input : (Name, Grade) •  Output : (Name, courseName_Grade) •  The mapper needs to be aware of the input file

Page 9: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

9

Map Reduce example - 2

Fabrice 20 Brian 10 Paul 15

Fabrice 15 Brian 20 Paul 10

Fabrice 10 Brian 15 Paul 20

(Fabrice, C1_20) (Brian, C1_10) (Paul, C1_15) (Fabrice, C2_15) (Brian, C2_20) (Paul, C2_10) (Fabrice, C3_10) (Brian, C3_15) (Paul, C3_20)

(Fabrice, [C1_20, C2_15, C3_10]) (Brian, [C1_10, C2_15, C3_20]) (Paul, [C1_15, C2_20, C3_10])

(Fabrice, 16) (Brian, 14) (Paul, 14.5)

Map Reduce

Course 1

Course 2

Course 3

Page 10: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

10

Introduction to Hadoop

F. Huet, Oasis Seminar, 07/07/2010

Page 11: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

11

What is Hadoop ?

• A set of software developed by Apache for distributed computing

• Many different projects •  MapReduce •  HDFS : Hadoop Distributed File System •  Hbase : Distributed Database •  ….

• Written in Java

• Can be deployed on any cluster easily

Page 12: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

12

Hadoop Job

• An Hadoop job is composed of a map operation and (possibly) a reduce operation

• Map and reduce operations are implemented in a Mapper subclass and a Reducer subclass

• Hadoop will start many instances of Mapper and Reducer •  Decided at runtime but can be specified

• Each instance will work on a subset of the keys called a Splits

Page 13: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

13

Map-Reduce workflow Source : Hadoop the definitive guide

Page 14: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

14

Mapper

• Extend default class Mapper<K1, V1, K2, V2> •  K1, V1 : type of input (key,value) •  K2, V2 : type of output (key,value)

• Implements public void map(K1 key, V1 value, Context context) throws IOException, InterruptedException

•  Output of values is done using context.write

Page 15: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

15

Reducer

• Extend default class Reducer<K1, V1, K2, V2> •  K1, V1 : type of input (key,[values]) •  K2, V2 : type of output (key, value)

• Implements public void reduce(K1 key, V1 values, Context context) throws IOException, InterruptedException

•  V1 is iterable •  Output of values is done using context.write

Page 16: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

16

Input/Output

• Hadoop helps abstracting away data format and I/O from map/reduce process

• InputFormat •  Validates data input format (user specified) •  Split-up the input file into Splits •  Provides an InputReader to read records from the Splits •  Default : TextInputFormat to read text file (key will be offset, value will be

the line)

• OutputFormat •  Validate data output format •  Provides an OutputWriter to write records to the file system •  Default : TextOutputFormat to write plain text files

Page 17: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

17

Hadoop Job example

Configuration config = new Configuration(); Job job = new Job(config, "filesplitTest");

job.setInputFormatClass(TextInputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(SingleTextOutputFormat.class); Path outputDir = new Path(output);

Path inputPath = new Path(input);

FileInputFormat.setInputPaths(job, inputPath);

FileOutputFormat.setOutputPath(job, outputDir);

job.setMapperClass(MapSingleSortedFile.class);

job.setReducerClass(Reducer.class);

Page 18: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

18

HDFS

• Hadoop Distributed File System

• Aggregate local storage

• Used by Hadoop workers to read input, store temporary data and final output

• Can be accessed using CLI •  $> hadoop –fs command •  put : copy a local file to HDFS •  get : copy a HDFS file to a local directory

• Suitable for large files •  64MB Block

Page 19: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

19

Demo

Page 20: Map-Reduce and Hadoop - Inria and... · 4 Map Reduce example • Compute the average grade of students • For each course, the professor provides us with a text file • Text file

20

Scenario

• Input : a text file made of RDF data (subject, predicate, object)

• Output : 3 “files” containing the input data sorted by subject, predicate or object

• Hadoop cluster •  eon 2-4 with HDFS •  Only need Hadoop conf files to use this cluster

• Monitor computation using web interface on eon2