Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010

Hadoop

Team3: Xiaokui Shu, Ron [email protected] [email protected] at Virginia TechDecember 6, 2010

Content

Introduction Hadoop MapReduce

Working With Hadoop Environment MapReduce Programming

Summary

Introduction :: Hadoop

Is a software framework User should program Like a super-library

For distributed applications Build-in solutions Solutions depend on this framework

Inspired by Google's MapReduce and Google File System (GFS) papers


Who use Hadoop A9.com – Amazon▪ Amazon's product search indices

Adobe▪ 30 nodes running HDFS, Hadoop and Hbase

Baidu▪ handle about 3000TB per week

Facebook▪ store copies of internal log and dimension data

sources Last.fm, LinkedIn, IBM, Yahoo!, Google…


Hadoop Common HDFS MapReduce ZooKeeper

Introduction :: Hadoop :: IR

Connections to the IR book Ch.4 Index construction▪ Distributed indexing (4.4)

Ch.20 Web crawling and indexes▪ Distributed crawler (20.2)▪ Distributed indexing (20.3)

Introduction :: MapReduce Is a software framework For distributed computing

Mass amount of data Simple processing requirement Portability across variety platforms▪ Clusters▪ CMP/SMP▪ GPGPU

Introduced by Google

Introduction :: MapReduce

Cited from MapReduce: Simplified Data Processing on Large Clusters

Introduction :: MapReduce Map

Map(k1,v1) -> list(k2,v2) Reduce

Reduce(k2, list (v2)) -> list(v3)

Hadoop MapReduce (input) <k1, v1> -> map -> <k2, v2> ->

combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Introduction :: MapReduce Ex Source

$cat file01Hello World Bye World$cat file02Hello Hadoop Goodbye Hadoop$

Introduction :: MapReduce Ex Map Output

For File01< Hello, 1>< World, 1>< Bye, 1>< World, 1>

For File02< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

Introduction :: MapReduce Ex Reduce Output

< Bye, 1>< Goodbye, 1>< Hadoop, 2>< Hello, 2>< World, 2>

Introduction :: MapReduce More input More mappers

Combiner Function after Map More reducers

Partition Function before ReduceFocus on Map & Reduce

Working With Hadoop :: Env

Hadoop in Java (C++) Run in 3 modes

Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode

It is setup to Pseudo-Distributed Mode in our instance on IBM cloud

Working With Hadoop

Process1. Start Hadoop service2. Prepare input3. Write your MapReduce program4. Compile your program5. Run your application with Hadoop

Working With Hadoop :: Env Start Hadoop service

$ bin/hadoop namenode -format $ bin/start-all.sh

Initialize filesystem $ bin/hadoop fs -put localdir hinputdir You can also use -get, -rm, -cat with fs

Working With Hadoop :: Env Compile your program & create jar

$ javac -classpath ${HADOOP}-core.jar -d wordcount_classes WordCount.java

$ jar -cvf wordcount.jar -C wordcount_classes/ .

Run your application with Hadoop $ bin/hadoop jar wordcount.jar

org.myorg.WordCount hinputdir houtputdir

Working With Hadoop :: Progvoid map(String name, String document):

// name: document name// document: document contentsfor each word w in document:

EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts):

// word: a word// partialCounts: a list of aggregated partial countsint result = 0;for each pc in partialCounts:

result += ParseInt(pc);Emit(AsString(result));

Cited from Wikipedia

Working With Hadoop :: Progpublic static class Map extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();StringTokenizer tokenizer = new

StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());output.collect(word, one);

}}

}

Working With Hadoop :: Progpublic static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;while (values.hasNext()) {

sum += values.next().get();}output.collect(key, new IntWritable(sum));

}}

Working With Hadoop :: Prog Configurations & Main class

Leave other work for the Hadoop MapReduce Framework

Summary

Hadoop Introduction Connections to the IR book

MapReduce Overview E.g. WordCount Environment configuration Writing your MapReduce application

Refenerce Hadoop Project

http://hadoop.apache.org/ MapReduce in Hadoop

http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html

MapReduce: Simplified Data Processing on Large Clusters

http://portal.acm.org/citation.cfm?id=1327452.1327492&coll=GUIDE&dl=&idx=J79&part=magazine&WantType=Magazines&title=Communications%20of%20the%20ACM

Hadoop Single-Node Setuphttp://hadoop.apache.org/common/docs/r0.20.2/quickstart.html

Who use Hadoophttp://wiki.apache.org/hadoop/PoweredBy

http://hadoop.apache.org/






http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html

http://wiki.apache.org/hadoop/PoweredBy

Thank You!

Documents

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010