Upload
kuldeep-dhole
View
97
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Map Reduce Basics
Citation preview
Hadoop Installation & MapReduce Programming
CS267 - Data Mining & Machine Learning
-Kuldeep Dhole
WHW
Why: To be able to deal with Big Data Mining.
How: By learning Hadoop & MR programming
What: Hadoop Installation, HDFS basics, & MR programming for Hadoop
Hadoop Installation
Amazon EC2 cloud - Cloudera’s Hadoop Insallationhttps://www.dropbox.com/s/s8zc3iwlq936hak/Amazon_Cloudera_Hadoop.pdf
Hadoop Components
- HDFS (Hadoop Distributed File System)
- MapReduce Model
HDFS Shell
CLUSTER / LOCAL MACHINE
/home/user1
File System of Local OS (Linux, Windows, etc.)> ls -l> mv f1 f2> cp f1 f2
HDFS - /tmp>hadoop fs -ls >hadoop fs -mv hdfs_f1 hdfs_f2>hadoop fs -cp hdfs_f1 hdfs_f2
- HDFS has its own shell commands
- You need to transfer data: LOCAL FS <-> HDFS
- Same concept applies to all machines in the cluster & Hadoop realm on all machines are in sync.
MapReduce Concept
- Programming Model for Distributed Parallel Computing.- Used on scalable commodity hardware cluster.- Can process Big Data (100s of GBs, TBs)- Based on Key-Value structure.- Parallel MAP tasks, which emit <K, V> data- Parallel REDUCE tasks, which processes <K, V[ ]> data
MapReduce Model
M1
M2
M3
M4
R1
R2
R3
R4
<K, V>
<K, V>
<K, V>
<K, V>
Sort, Merge & Shuffle
<K, V>
<K, V>
<K, V>
<K, V>
<K1, V [ ] >
<K2 V [ ] >
<K3, V [ ] >
<K4, V [ ] >
<K, V>
<K, V>
<K, V>
<K, V>
MapReduce Model In Brief
(K1, V1) -> MAP -> List(K2, V2)
(K2, List(V2) -> REDUCE -> List(K3, V3)
Hadoop MapReduce Application
- Implemented in Java - Components:
- Mapper- Reducer- Job Configuration
- Can be done in other languages like Python, Perl, Shell, etc. using Streaming Concept.
Complete Application
public class YourApp {Mapper {}
Reducer {}
Job Configuration {}}
Mapper Class & Function
public static class YourMap extends Mapper<K1, V1, K2, V2> {public void map(K1 key, V1 value, Context context) throws
IOException, InterruptedException {//DO YOUR PROCESSING ON Key ,
Value//K2 NewKey//V2 NewValue
context.write(NewKey, NewValue);}
}
Reducer Class & Function
public static class YourReduce extends Reducer<K2, V2, K3, V3> { public void reduce(K2 key, Iterable<V2> values, Context context) throws IOException, InterruptedException {
//DO YOUR PROCESSING ON Key , Value//K3 NewKey//V3 NewValuecontext.write(NewKey, NewValue);
}}
What are I/O Formats?
Job Configurationpublic static void main(String[] args) throws Exception {
//Create ConfigurationConfiguration conf = new Configuration(); //Create JobJob job = new Job(conf, "YourApp");
//Specify Input DirectoryFileInputFormat.addInputPath(job, new Path(args[0]));//Specify Output DirectoryFileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(Map.class);//Specify Input Split Format By Which Mapper Reads <K, V> job.setInputFormatClass(KeyValueTextInputFormat.class)//Specify Output Format By Which Mapper Emits <K, V>
job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);//Specify Output Format By Which Reducer Emits <K, V> job.setOutputKeyClass(Text.class); job.setOutputKeyClass(Value.class); //Specify Output Format By Which Output is Written To The Output Files job.setOutputValueClass(IntWritable.class); job.setJarByClass(org.myorg.YourApp.class); job.waitForCompletion(true); }
Reverse Indexing Application
Input File: Output File:
/hdfs/f1.dat/hdfs_op/o1.dat
f1 w1 w2 w3 w4f2 w2 w3 w4 w5f3 w3 w4 w5 w6
w1 f1w4 f1, f2, f3w2 f1, f2w5 f2, f3w3 f1, f2, f3w6 f3
Hadoop System
Job:
CONF
REDUCE
MAP
Mapper & Reducer AlgoMapper:
read line in K<filename>, V<rest contents>tokenize Vfor every token t:
emit K<t>, V<filename>
Reducer:receive K<token>, V[ ] <filenames>make unique list of V [ ]form a comma separated string of filenames in V [ ] as stremit K<token>, V<str>
Actual Java Program: Mapper
public static class Map extends Mapper<Text, Text, Text, Text> { private Text word = new Text(); public void map(Text key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { String temp = tokenizer.nextToken(); //Strip last non-alphabet chars from a word
if ( ! temp.matches(".*[a-zA-Z]$") ) { word.set(temp.substring(0, temp.length()-1));
} else
word.set(temp); context.write(word, key); } } }
Actual Java Program: Reducerpublic static class Reduce extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String doc_list = ""; HashMap<String, Integer> map = new HashMap<String, Integer>();
for (Text val : values) {map.put(val.toString(), 1);
} Iterator<String> keySetIterator = map.keySet().iterator(); while(keySetIterator.hasNext()){ String k = keySetIterator.next();
doc_list += k + ",";}
if (doc_list.length() > 0 && doc_list.charAt(doc_list.length()-1)==',') { doc_list = doc_list.substring(0, doc_list.length()-1); } context.write(key, new Text(doc_list)); } }
Actual Java Program: Main()public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "reverse-index"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setInputFormatClass(KeyValueTextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setOutputFormatClass(TextOutputFormat.class); job.setJarByClass(org.myorg.ReverseIndex.class); job.waitForCompletion(true); }
Actual Java Program: Complete Apppackage org.myorg;//IMPORT RELEVANT API Libraries
public class AppName {
Mapper() {}
Reducer() {}
Main() {}}
How to Exeucute?
- Compile/usr/java/jdk1.7.0_25/bin/javac -classpath /usr/local/hadoop/hadoop-core-
1.2.1.jar -d classes ip1/ReverseIndex.java
- make a JAR/usr/java/jdk1.7.0_25/bin/jar -cvf jar/reverse_index.jar -C classes/ .
- submit the JAR as JOBhadoop jar jar/reverse_index.jar org.myorg.ReverseIndex ip op
DEMO
Important LinksA few examples at my github: https://github.com/dkuldeep11/hadoop
Clear Basics: https://www.udacity.com/course/ud617
Hadoop MR Concept: http://developer.yahoo.com/hadoop/tutorial/module4.html#basics
MR Coding Basics: http://hadoop.apache.org/docs/stable1/mapred_tutorial.html
In Depth: http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-mapreduce-programming/
Thank You!
Q/A