10
MapReduce by JRuby and DSL Hadoop Papyrus 2010/8/28 JRubyKaigi 2010 藤川幸一 FUJIKAWA Koichi @fujibee

JRubyKaigi2010 Hadoop Papyrus

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: JRubyKaigi2010 Hadoop Papyrus

MapReduce by JRuby and DSL Hadoop Papyrus

2010/8/28

JRubyKaigi 2010

藤川幸一 FUJIKAWA Koichi @fujibee

Page 2: JRubyKaigi2010 Hadoop Papyrus

What’s Hadoop?

• FW of parallel distributed processing framework for BIG data

• OSS clone of Google MapReduce

• For over terabyte scale data processing– Took over 2000hr if you read the data of

400TB(Web scale data) by standard HDD, reading 50MB/s

– Need the distributed file system and parallel processing framework!

Page 3: JRubyKaigi2010 Hadoop Papyrus

Hadoop Papyrus

• My own OSS project– Hosted by github http://github.com/fujibee/hadoop-papyrus

• Framework for running Hadoop jobs by (J)Ruby DSL description– Originally Hadoop jobs written by Java

– Just few lines in Ruby same as the very complex procedure if using Java!

• Supported by IPA MITOH 2009 project (Government support)

• Can run by Hudson (CI tool) plug-in

Page 4: JRubyKaigi2010 Hadoop Papyrus

Step.1Not Java, But we can write in Ruby!

Page 5: JRubyKaigi2010 Hadoop Papyrus

Step.2Simple description by DSL in Ruby

Map Reduce JobDescription

Log AnalysisDSL

Page 6: JRubyKaigi2010 Hadoop Papyrus

Step.3Enable the Hadoop server environment easily by Hudson

Page 7: JRubyKaigi2010 Hadoop Papyrus

package org.apache.hadoop.examples;

import java.io.IOException;import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper extendsMapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(Object key, Text value, Context context)throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}

public static class IntSumReducer extendsReducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}}

public static void main(String[] args) throws Exception {Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();if (otherArgs.length != 2) {System.err.println("Usage: wordcount <in> <out>");System.exit(2);}Job job = new Job(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);}}

70 lines are needed in Java..Hadoop Papyrus is only needed 10 lines!

dsl 'LogAnalysis‘

from ‘test/in‘to ‘test/out’

pattern /¥[¥[([^|¥]:]+)[^¥]:]*¥]¥]/column_name :link

topic "link num", :label => 'n' docount_uniq column[:link]

end

Javaの場合

Hadoop Papyrus

Page 8: JRubyKaigi2010 Hadoop Papyrus

• Invoke Ruby script using JRuby in the process of Map/Reduce running on Java

Hadoop Papyrus Details

Page 9: JRubyKaigi2010 Hadoop Papyrus

• Additionally, we can write the DSL script you want to process (log analysis, etc). Papyrus can choose the different process on each phase (Map or Reduce, job initialization). So we just need the only one script.

Hadoop Papyrus Details (con’t)

Page 10: JRubyKaigi2010 Hadoop Papyrus

ありがとうございました! Thank you!

Twitter ID: @fujibee