175
Scalding Hadoop Word Count in < 70 lines of code Konrad 'ktoso' Malawski JARCamp #3 12.04.2013 Friday, April 12, 13

Scalding - Hadoop Word Count in LESS than 70 lines of code

Embed Size (px)

DESCRIPTION

Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.

Citation preview

Page 1: Scalding - Hadoop Word Count in LESS than 70 lines of code

ScaldingHadoop Word Count in < 70 lines of code

Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013

Friday, April 12, 13

Page 2: Scalding - Hadoop Word Count in LESS than 70 lines of code

ScaldingHadoop Word Count

in 4 lines of code

Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013

Friday, April 12, 13

Page 3: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 4: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Friday, April 12, 13

Page 5: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)

Friday, April 12, 13

Page 6: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Friday, April 12, 13

Page 7: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)

Friday, April 12, 13

Page 8: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Friday, April 12, 13

Page 9: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)

Friday, April 12, 13

Page 10: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Friday, April 12, 13

Page 11: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)

Friday, April 12, 13

Page 12: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)=

Friday, April 12, 13

Page 13: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)=

100%

Friday, April 12, 13

Page 14: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Types

type Word = Stringtype Count = Int

String => Map[Word, Count]

Friday, April 12, 13

Page 15: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

Friday, April 12, 13

Page 16: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

Friday, April 12, 13

Page 17: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] =

Friday, April 12, 13

Page 18: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text

Friday, April 12, 13

Page 19: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ")

Friday, April 12, 13

Page 20: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1))

Friday, April 12, 13

Page 21: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1)

Friday, April 12, 13

Page 22: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }

Friday, April 12, 13

Page 23: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }

wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))

Friday, April 12, 13

Page 24: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

Friday, April 12, 13

Page 25: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

Friday, April 12, 13

Page 26: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

Friday, April 12, 13

Page 27: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

Friday, April 12, 13

Page 28: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

in Memory

Friday, April 12, 13

Page 29: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

in Memory

in Memory

Friday, April 12, 13

Page 30: Scalding - Hadoop Word Count in LESS than 70 lines of code

Apache Hadoop (HDFS + MR)http://hadoop.apache.org/

Friday, April 12, 13

Page 31: Scalding - Hadoop Word Count in LESS than 70 lines of code

package org.myorg;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;

import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); }}

Why Scalding?Word Count in Hadoop MR

Friday, April 12, 13

Page 32: Scalding - Hadoop Word Count in LESS than 70 lines of code

package org.myorg;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;

import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); }}

Why Scalding?Word Count in Hadoop MR

Friday, April 12, 13

Page 33: Scalding - Hadoop Word Count in LESS than 70 lines of code

Trivia: How old is Hadoop?

Friday, April 12, 13

Page 34: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 35: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 36: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 37: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 38: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 39: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 40: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 41: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 42: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingwww.cascading.org/

Friday, April 12, 13

Page 43: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingwww.cascading.org/

Friday, April 12, 13

Page 44: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingis

Friday, April 12, 13

Page 45: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingis

Taps & Pipes

Friday, April 12, 13

Page 46: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingis

Taps & Pipes& Pipes

& SinksFriday, April 12, 13

Page 47: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

Friday, April 12, 13

Page 48: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

Friday, April 12, 13

Page 49: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

Friday, April 12, 13

Page 50: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Friday, April 12, 13

Page 51: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

Friday, April 12, 13

Page 52: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef()

Friday, April 12, 13

Page 53: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap )

Friday, April 12, 13

Page 54: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

Friday, April 12, 13

Page 55: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

// run!flowConnector.connect(flowDef).complete();

Friday, April 12, 13

Page 56: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 57: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 58: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 59: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 60: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 61: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 62: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

Page 63: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

Page 64: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

Page 65: Scalding - Hadoop Word Count in LESS than 70 lines of code

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 66: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 67: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 68: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 69: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 70: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 71: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 72: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?

Friday, April 12, 13

Page 73: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

Friday, April 12, 13

Page 74: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDef

Friday, April 12, 13

Page 75: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

Friday, April 12, 13

Page 76: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

Friday, April 12, 13

Page 77: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

HadoopCluster.execute(jobs)

Friday, April 12, 13

Page 78: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

HadoopCluster.execute(jobs)

Friday, April 12, 13

Page 79: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

Friday, April 12, 13

Page 80: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

flowDef.setDebugLevel( DebugLevel.NONE );

Friday, April 12, 13

Page 81: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

flowDef.setDebugLevel( DebugLevel.NONE );

flowConnector will NOT create the Debug pipe!

Friday, April 12, 13

Page 82: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding=+

Twitter Scaldinggithub.com/twitter/scalding

Friday, April 12, 13

Page 83: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding API

Friday, April 12, 13

Page 84: Scalding - Hadoop Word Count in LESS than 70 lines of code

map

Friday, April 12, 13

Page 85: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

mapScala:

Friday, April 12, 13

Page 86: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapScala:

Friday, April 12, 13

Page 87: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

mapScala:

Friday, April 12, 13

Page 88: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

mapScala:

Friday, April 12, 13

Page 89: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data)

Scala:

Scalding:

Friday, April 12, 13

Page 90: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

Scala:

Scalding:

Friday, April 12, 13

Page 91: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

Friday, April 12, 13

Page 92: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

available in Pipe

Friday, April 12, 13

Page 93: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

available in Pipestays in Pipe

Friday, April 12, 13

Page 94: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

must choose type!

Friday, April 12, 13

Page 95: Scalding - Hadoop Word Count in LESS than 70 lines of code

mapTo

Friday, April 12, 13

Page 96: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

mapToScala:

Friday, April 12, 13

Page 97: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapToScala:

Friday, April 12, 13

Page 98: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapToScala:

Friday, April 12, 13

Page 99: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

Friday, April 12, 13

Page 100: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

release reference

Friday, April 12, 13

Page 101: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

release reference

Friday, April 12, 13

Page 102: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data)

Scala:

Scalding:

release reference

Friday, April 12, 13

Page 103: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

Scala:

Scalding:

release reference

Friday, April 12, 13

Page 104: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

release reference

Friday, April 12, 13

Page 105: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

doubled stays in Pipe

release reference

Friday, April 12, 13

Page 106: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

doubled stays in Pipenumber is removed

release reference

Friday, April 12, 13

Page 107: Scalding - Hadoop Word Count in LESS than 70 lines of code

flatMap

Friday, April 12, 13

Page 108: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

flatMapScala:

Friday, April 12, 13

Page 109: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String

flatMapScala:

Friday, April 12, 13

Page 110: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]

flatMapScala:

Friday, April 12, 13

Page 111: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMapScala:

Friday, April 12, 13

Page 112: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]

flatMapScala:

Friday, April 12, 13

Page 113: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 114: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 115: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String]

Scala:

Scalding:

Friday, April 12, 13

Page 116: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String]

Scala:

Scalding:

Friday, April 12, 13

Page 117: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

Scalding:

Friday, April 12, 13

Page 118: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

Scalding:

MR map outside

Friday, April 12, 13

Page 119: Scalding - Hadoop Word Count in LESS than 70 lines of code

flatMap

Friday, April 12, 13

Page 120: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

flatMapScala:

Friday, April 12, 13

Page 121: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String

flatMapScala:

Friday, April 12, 13

Page 122: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]

flatMapScala:

Friday, April 12, 13

Page 123: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

flatMapScala:

Friday, April 12, 13

Page 124: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]

flatMapScala:

Friday, April 12, 13

Page 125: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 126: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 127: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String]

Scala:

Scalding:

Friday, April 12, 13

Page 128: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) }

Scala:

Scalding:

Friday, April 12, 13

Page 129: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

Scalding:

Friday, April 12, 13

Page 130: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

Scalding:

map inside Scala

Friday, April 12, 13

Page 131: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

Friday, April 12, 13

Page 132: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

groupByScala:

Friday, April 12, 13

Page 133: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groupByScala:

Friday, April 12, 13

Page 134: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupByScala:

Friday, April 12, 13

Page 135: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))

groupByScala:

Friday, April 12, 13

Page 136: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupByScala:

Friday, April 12, 13

Page 137: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupByScala:

Friday, April 12, 13

Page 138: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

Scala:

Scalding:

Friday, April 12, 13

Page 139: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

Scala:

Scalding:

Friday, April 12, 13

Page 140: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

Friday, April 12, 13

Page 141: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

groups all with == value

Friday, April 12, 13

Page 142: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

groups all with == value 'lessThanTenCounts

Friday, April 12, 13

Page 143: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

Scalding:

Friday, April 12, 13

Page 144: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

Scalding:

Friday, April 12, 13

Page 145: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

Scalding:

Friday, April 12, 13

Page 146: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

Scalding:

Friday, April 12, 13

Page 147: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

Scalding:

'total = [3, 74]

Friday, April 12, 13

Page 148: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding API

Friday, April 12, 13

Page 149: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

Friday, April 12, 13

Page 150: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapTo

Friday, April 12, 13

Page 151: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

Friday, April 12, 13

Page 152: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

rename

Friday, April 12, 13

Page 153: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

Friday, April 12, 13

Page 154: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

unique

Friday, April 12, 13

Page 155: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

Friday, April 12, 13

Page 156: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limit

Friday, April 12, 13

Page 157: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Friday, April 12, 13

Page 158: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Group operations

Friday, April 12, 13

Page 159: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Group operations

joinsFriday, April 12, 13

Page 160: Scalding - Hadoop Word Count in LESS than 70 lines of code

Distributed Copy in Scalding

class WordCountJob(args: Args) extends Job(args) {

Friday, April 12, 13

Page 161: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

Distributed Copy in Scalding

Friday, April 12, 13

Page 162: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

input.read.write(output)

}

Distributed Copy in Scalding

Friday, April 12, 13

Page 163: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

input.read.write(output)

}

Distributed Copy in Scalding

The End.

Friday, April 12, 13

Page 164: Scalding - Hadoop Word Count in LESS than 70 lines of code

import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding

object ScaldingJobRunner extends App {

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Main Class - "Runner"

Friday, April 12, 13

Page 165: Scalding - Hadoop Word Count in LESS than 70 lines of code

import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding

object ScaldingJobRunner extends App {

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Main Class - "Runner"

from App

Friday, April 12, 13

Page 166: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

}

Word Count in Scalding

Friday, April 12, 13

Page 167: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

}

Word Count in Scalding

Friday, April 12, 13

Page 168: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile)

}

Word Count in Scalding

Friday, April 12, 13

Page 169: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 170: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 171: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 172: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size('count) }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 173: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 174: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

4{

Friday, April 12, 13

Page 175: Scalding - Hadoop Word Count in LESS than 70 lines of code

Dzięki! Thanks!ありがとう!

Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl

Friday, April 12, 13