44
THE TALE OF THE GLORIOUS LAMBDAS & THE WERE-CLUSTERZ Mateusz Fedoryszak [email protected] Michał Oniszczuk [email protected] +

Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

Embed Size (px)

DESCRIPTION

The Tale of the Glorious Lambdas & the Were-Clusterz

Citation preview

Page 1: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

THE TALE OF THE!GLORIOUS LAMBDAS!

& THE WERE-CLUSTERZ

Mateusz [email protected]!Michał [email protected]

+

Page 2: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

More than the weather forecast.

Page 3: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

MUCH MORE…

Page 4: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

WE SPY ON SCIENTISTS

Page 5: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

RAW DATA

Page 6: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

COMMON MAP OF ACADEMIA

Page 7: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

HADOOPHow to read millions of papers?

Page 8: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

IN ONE PICTUREMap Reduce

Page 9: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

WORD COUNT IS THE NEW HELLO WORLD

Page 10: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

WORD COUNT IN VANILLA MAP-REDUCE

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {   public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

Page 11: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

HOW SHOULDA WORD COUNT

LOOK LIKE?

val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.length))!!counts.foreach(println)

Page 12: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

SCOOBI, SCALDINGMap–Reduce the right way — with lambdas.

Page 13: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

WORD COUNT IN PURE SCALA

val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts.foreach(println)

Page 14: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

WORD COUNT IN SCOOBI

val lines = fromTextFile("hdfs://in/...")!!val words = lines.mapFlatten(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts!" .toTextFile("hdfs://out/...", overwrite=true)!" .persist()

Page 15: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

BEHIND THE SCENES

val lines = ! fromTextFile("hdfs://in/...")!!val words = ! lines.mapFlatten(_.split(" "))!val groups = ! words.groupBy(identity)"val counts = ! groups.map(x => (x._1, x._2.length))!!counts! .toTextFile("hdfs://out/...",! overwrite=true)! .persist()

flatMap

groupBy

map

map

reduce

map

reduce

map

reduce

Page 16: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

SCOOBI SNACKS– Joins, group-by, etc. baked in!

– Static type checking with custom data types and IO!

– One lang to rule them all (and it’s THE lang)!

– Easy local testing!

– REPL!

Page 17: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

WHICH ONE IS THE FRAMEWORK?

Scoobi ScaldingPure Scala Cascading wrapperDeveloped by NICTA Developed by TwitterStrongly typed API Field-based and strongly typed API

Has cooler logo

Page 18: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

THE NEW BIG DATA ZOOMost slides are by Matei Zaharia from the Spark team

Page 19: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

SPARK IDEA

Page 20: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

MAPREDUCE PROBLEMS…

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Page 21: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

iter. 1 iter. 2 . . .

Input

… SOLVED WITH SPARK

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-time processing

Page 22: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

HDFS

RESILIENT DISTRIBUTED          DATASETS (RDDS)

Restricted form of distributed shared memory» Partitioned data»Higher–level operations (map, filter, join, …)»No side–effects

Efficient fault recovery using lineage»List of operations»Recompute lost partitions on failure»No cost if nothing fails

Page 23: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

API

Scala, Python, Java

+ REPL

map"reduce

filter"groupBy

join"…

Page 24: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

SPARK EXAMPLES

Page 25: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

Worker

Worker

Worker

Master

Page 26: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Worker

Worker

Worker

Master

Page 27: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).count

Page 28: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).count

Page 29: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).count

tasks

Page 30: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).count

tasks

Cache 1

Cache 2

Cache 3

Page 31: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).count

tasks

results

Cache 1

Cache 2

Cache 3

Page 32: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

1TB data in 5-7 sec (vs 170 sec for on-disk data)

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count

tasks

results

Cache 1

Cache 2

Cache 3

Page 33: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

PAGERANK PERFORMANCET

ime

per

itera

tion

(s)

0

45

90

135

180

23,01

170,75 Hadoop Spark

Page 34: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

SPARK LIBRARIES

Page 35: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

SPARK’S ZOO

Spark

Spark Streaming

(real-time)

GraphX(graph)

Shark(SQL)

MLlib(machine learning)

BlinkDB

Page 36: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

ALL IN ONE

val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)

val model = KMeans.train(points, 10)

sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

Page 37: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

SPARK CONCLUSION

• In memory processing

• Libraries

• Increasingly popularSpark

Spark Streaming!

GraphX

…Shark MLlib

BlinkDB

Page 38: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

USEFUL LINKS

• spark.apache.org!

• spark-summit.org !videos & online hands–on tutorials

Page 39: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

Like Spark but less popular and less mature

Page 40: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

CONCLUSION

• We are in the 80’s of RDBMS

• Scala goes well with big data

Page 41: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

THANK YOU!!Q&A

Page 42: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

Iter

atio

n tim

e (s

)

0

62,5

125

187,5

250

Number of machines

25 50 100

3615

6280

116

76

111

184

HadoopHadoopBinMemSpark

Logistic Regression

Iter

atio

n tim

e (s

)

0

75

150

225

300

Number of machines

25 50 100

33

61

143

87

121

197

106

157

274Hadoop HadoopBinMemSpark

K-Means

SCALABILITY

Page 43: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

Iter

atio

n tim

e (s

)

0

25

50

75

100

Percent of working set in memory

0 0.25 0.5 0.75 1

11,5

29,740,7

58,168,8

INSUFFICIENT RAM

Page 44: Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

PERFORMANCE

Resp

onse

Tim

e (s)

0

11,25

22,5

33,75

45

HiveImpala (disk)Impala (mem)Shark (disk)Shark (mem)

SQLR

esp

onse

Tim

e (m

in)

0

7,5

15

22,5

30

HadoopGiraphGraphX

Graph

Thro

ughp

ut

(MB

/s/n

od

e)

0

9

18

26

35

StormSpark

Streaming