Upload
trinhhanh
View
221
Download
0
Embed Size (px)
Citation preview
1
Map-Reduce and Hadoop
2
Introduction to Map-Reduce
3
Map Reduce operations
• Input data are (key, value) pairs
• 2 operations available : map and reduce
• Map • Takes a (key, value) and generates other (key, value)
• Reduce • Takes a key and all associated values • Generates (key, value) pairs
• A map-reduce algorithm requires a mapper and a reducer
4
Map Reduce example
• Compute the average grade of students • For each course, the professor provides us with a text file • Text file format : lines of “student grade”
• Algorithm (non map-reduce) • For each student, collect all grades and perform the average
• Algorithm (map-reduce) • Mapper
• Assume the input file is parsed as (student, grade) pairs • So … do nothing!
• Reducer • Perform the average of all values for a given key
5
Map Reduce example
Fabrice 20 Brian 10 Paul 15
Fabrice 15 Brian 20 Paul 10
Fabrice 10 Brian 15 Paul 20
(Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20)
(Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10])
(Fabrice, 15) (Brian 15) (Paul, 15)
Map Reduce
Course 1
Course 2
Course 3
6
Map Reduce example… too easy
• Ok, this was easy because • We didn’t care about technical details like reading inputs • All keys are “equals”, no weighted average
• Now can we do something more complicated ?
• Let’s computed a weighted average • Course 1 has weight 5 • Course 2 has weight 2 • Course 3 has weight 3
• What is the problem now ?
7
Map Reduce example
Fabrice 20 Brian 10 Paul 15
Fabrice 15 Brian 20 Paul 10
Fabrice 10 Brian 15 Paul 20
(Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20)
(Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10])
(Fabrice, 15) (Brian 15) (Paul, 15)
Map Reduce
Course 1
Course 2
Course 3
Should be able to discriminate between values
8
Map Reduce example - advanced
• How discriminate between values for a given key • We can’t … unless the values look different
• New reducer • Input : (Name, [course1_Grade1, course2_Grade2, course3_Grade3]) • Strip values from course indication and perform weighted average
• So, we need to change the input of the reducer which comes from… the mapper
• New mapper • Input : (Name, Grade) • Output : (Name, courseName_Grade) • The mapper needs to be aware of the input file
9
Map Reduce example - 2
Fabrice 20 Brian 10 Paul 15
Fabrice 15 Brian 20 Paul 10
Fabrice 10 Brian 15 Paul 20
(Fabrice, C1_20) (Brian, C1_10) (Paul, C1_15) (Fabrice, C2_15) (Brian, C2_20) (Paul, C2_10) (Fabrice, C3_10) (Brian, C3_15) (Paul, C3_20)
(Fabrice, [C1_20, C2_15, C3_10]) (Brian, [C1_10, C2_15, C3_20]) (Paul, [C1_15, C2_20, C3_10])
(Fabrice, 16) (Brian, 14) (Paul, 14.5)
Map Reduce
Course 1
Course 2
Course 3
10
Introduction to Hadoop
F. Huet, Oasis Seminar, 07/07/2010
11
What is Hadoop ?
• A set of software developed by Apache for distributed computing
• Many different projects • MapReduce • HDFS : Hadoop Distributed File System • Hbase : Distributed Database • ….
• Written in Java
• Can be deployed on any cluster easily
12
Hadoop Job
• An Hadoop job is composed of a map operation and (possibly) a reduce operation
• Map and reduce operations are implemented in a Mapper subclass and a Reducer subclass
• Hadoop will start many instances of Mapper and Reducer • Decided at runtime but can be specified
• Each instance will work on a subset of the keys called a Splits
13
Map-Reduce workflow Source : Hadoop the definitive guide
14
Mapper
• Extend default class Mapper<K1, V1, K2, V2> • K1, V1 : type of input (key,value) • K2, V2 : type of output (key,value)
• Implements public void map(K1 key, V1 value, Context context) throws IOException, InterruptedException
• Output of values is done using context.write
15
Reducer
• Extend default class Reducer<K1, V1, K2, V2> • K1, V1 : type of input (key,[values]) • K2, V2 : type of output (key, value)
• Implements public void reduce(K1 key, V1 values, Context context) throws IOException, InterruptedException
• V1 is iterable • Output of values is done using context.write
16
Input/Output
• Hadoop helps abstracting away data format and I/O from map/reduce process
• InputFormat • Validates data input format (user specified) • Split-up the input file into Splits • Provides an InputReader to read records from the Splits • Default : TextInputFormat to read text file (key will be offset, value will be
the line)
• OutputFormat • Validate data output format • Provides an OutputWriter to write records to the file system • Default : TextOutputFormat to write plain text files
17
Hadoop Job example
Configuration config = new Configuration(); Job job = new Job(config, "filesplitTest");
job.setInputFormatClass(TextInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(SingleTextOutputFormat.class); Path outputDir = new Path(output);
Path inputPath = new Path(input);
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputDir);
job.setMapperClass(MapSingleSortedFile.class);
job.setReducerClass(Reducer.class);
18
HDFS
• Hadoop Distributed File System
• Aggregate local storage
• Used by Hadoop workers to read input, store temporary data and final output
• Can be accessed using CLI • $> hadoop –fs command • put : copy a local file to HDFS • get : copy a HDFS file to a local directory
• Suitable for large files • 64MB Block
19
Demo
20
Scenario
• Input : a text file made of RDF data (subject, predicate, object)
• Output : 3 “files” containing the input data sorted by subject, predicate or object
• Hadoop cluster • eon 2-4 with HDFS • Only need Hadoop conf files to use this cluster
• Monitor computation using web interface on eon2