24
Hadoop Team3: Xiaokui Shu, Ron Cohen [email protected] [email protected] CS5604 at Virginia Tech December 6, 2010

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010

Embed Size (px)

DESCRIPTION

 Is a software framework  User should program  Like a super-library  For distributed applications  Build-in solutions  Solutions depend on this framework  Inspired by Google's MapReduce and Google File System (GFS) papers

Citation preview

Page 1: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Hadoop

Team3: Xiaokui Shu, Ron [email protected] [email protected] at Virginia TechDecember 6, 2010

Page 2: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Content

Introduction Hadoop MapReduce

Working With Hadoop Environment MapReduce Programming

Summary

Page 3: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: Hadoop

Is a software framework User should program Like a super-library

For distributed applications Build-in solutions Solutions depend on this framework

Inspired by Google's MapReduce and Google File System (GFS) papers

Page 4: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: Hadoop

Who use Hadoop A9.com – Amazon▪ Amazon's product search indices

Adobe▪ 30 nodes running HDFS, Hadoop and Hbase

Baidu▪ handle about 3000TB per week

Facebook▪ store copies of internal log and dimension data

sources Last.fm, LinkedIn, IBM, Yahoo!, Google…

Page 5: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: Hadoop

Hadoop Common HDFS MapReduce ZooKeeper

Page 6: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: Hadoop :: IR

Connections to the IR book Ch.4 Index construction▪ Distributed indexing (4.4)

Ch.20 Web crawling and indexes▪ Distributed crawler (20.2)▪ Distributed indexing (20.3)

Page 7: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: MapReduce Is a software framework For distributed computing

Mass amount of data Simple processing requirement Portability across variety platforms▪ Clusters▪ CMP/SMP▪ GPGPU

Introduced by Google

Page 8: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: MapReduce

Cited from MapReduce: Simplified Data Processing on Large Clusters

Page 9: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: MapReduce Map

Map(k1,v1) -> list(k2,v2) Reduce

Reduce(k2, list (v2)) -> list(v3)

Hadoop MapReduce (input) <k1, v1> -> map -> <k2, v2> ->

combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Page 10: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: MapReduce Ex Source

$cat file01Hello World Bye World$cat file02Hello Hadoop Goodbye Hadoop$

Page 11: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: MapReduce Ex Map Output

For File01< Hello, 1>< World, 1>< Bye, 1>< World, 1>

For File02< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

Page 12: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: MapReduce Ex Reduce Output

< Bye, 1>< Goodbye, 1>< Hadoop, 2>< Hello, 2>< World, 2>

Page 13: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Introduction :: MapReduce More input More mappers

Combiner Function after Map More reducers

Partition Function before ReduceFocus on Map & Reduce

Page 14: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop :: Env

Hadoop in Java (C++) Run in 3 modes

Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode

It is setup to Pseudo-Distributed Mode in our instance on IBM cloud

Page 15: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop

Process1. Start Hadoop service2. Prepare input3. Write your MapReduce program4. Compile your program5. Run your application with Hadoop

Page 16: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop :: Env Start Hadoop service

$ bin/hadoop namenode -format $ bin/start-all.sh

Initialize filesystem $ bin/hadoop fs -put localdir hinputdir You can also use -get, -rm, -cat with fs

Page 17: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop :: Env Compile your program & create jar

$ javac -classpath ${HADOOP}-core.jar -d wordcount_classes WordCount.java

$ jar -cvf wordcount.jar -C wordcount_classes/ .

Run your application with Hadoop $ bin/hadoop jar wordcount.jar

org.myorg.WordCount hinputdir houtputdir

Page 18: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop :: Progvoid map(String name, String document):

// name: document name// document: document contentsfor each word w in document:

EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts):

// word: a word// partialCounts: a list of aggregated partial countsint result = 0;for each pc in partialCounts:

result += ParseInt(pc);Emit(AsString(result));

Cited from Wikipedia

Page 19: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop :: Progpublic static class Map extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();StringTokenizer tokenizer = new

StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());output.collect(word, one);

}}

}

Page 20: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop :: Progpublic static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;while (values.hasNext()) {

sum += values.next().get();}output.collect(key, new IntWritable(sum));

}}

Page 21: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Working With Hadoop :: Prog Configurations & Main class

Leave other work for the Hadoop MapReduce Framework

Page 22: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Summary

Hadoop Introduction Connections to the IR book

MapReduce Overview E.g. WordCount Environment configuration Writing your MapReduce application

Page 23: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Refenerce Hadoop Project

http://hadoop.apache.org/ MapReduce in Hadoop

http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html

MapReduce: Simplified Data Processing on Large Clusters

http://portal.acm.org/citation.cfm?id=1327452.1327492&coll=GUIDE&dl=&idx=J79&part=magazine&WantType=Magazines&title=Communications%20of%20the%20ACM

Hadoop Single-Node Setuphttp://hadoop.apache.org/common/docs/r0.20.2/quickstart.html

Who use Hadoophttp://wiki.apache.org/hadoop/PoweredBy

Page 24: Team3: Xiaokui Shu, Ron Cohen  CS5604 at Virginia Tech December 6, 2010

Thank You!