Introduction to Spark with Scala

Introduction to Spark with ScalaIntroduction to

Spark with Scala

Himanshu GuptaSoftware Consultant

Knoldus Software LLP

Himanshu GuptaSoftware Consultant

Knoldus Software LLP

Who am I ?Who am I ?

Himanshu Gupta (@himanshug735)

Software Consultant at Knoldus Software LLP

Spark & Scala enthusiast

Himanshu Gupta (@himanshug735)

Software Consultant at Knoldus Software LLP

Spark & Scala enthusiast

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?





● Demo

What is Apache Spark ?What is Apache Spark ?

Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics

Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics

Spark HistorySpark History

Project Begins at

UCB AMP Lab

20092009

20102010

Open Sourced

Apache Incubator

20112011

20122012

20132013

20142014

20152015

Data Frames

ClouderaSupport

ApacheTop level

SparkSummit

2013

SparkSummit

2014

Spark StackSpark Stack

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Fastest Growing Open Source ProjectFastest Growing Open Source Project

Img src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.htmlImg src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html






● Demo

● What is Spark ?





● Demo

Code SizeCode Size

Img src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdfImg src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Word Count Ex.public static class WordCountMapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); }public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

Daytona GraySort Record:Data to sort 100TB

Daytona GraySort Record:Data to sort 100TB

Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015 Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015

Hadoop (2013):Hadoop (2013): 2100 nodes2100 nodes

72 minutes72 minutes

Spark (2014):Spark (2014): 206 nodes206 nodes

23 minutes23 minutes

Runs EverywhereRuns Everywhere

Img src - http://spark.apache.org/

Who are using Apache Spark ?Who are using Apache Spark ?

Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010

AgendaAgenda

● What is Spark ?





● Demo

● What is Spark ?





● Demo

Brief Introduction to RDDBrief Introduction to RDD

RDD stands for Resilient Distributed Dataset

A fault tolerant, distributed collection of objects.

In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)

RDD stands for Resilient Distributed Dataset

A fault tolerant, distributed collection of objects.

In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)

This is the Spark Configuration


val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark Context

Contd...Contd...


val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark Context

Contd...Contd...


val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("data.txt")

Extract linesfrom text file

Contd...Contd...


val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))

Map linesto words

map

Contd...Contd...

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)

Word Count RDDmap groupBy

Contd...Contd...

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

Map[word, count] map groupBy

collect

StartsComputation

Contd...Contd...

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

map groupBy

collect

Transformation Action

Contd...Contd...

AgendaAgenda

● What is Spark ?





● Demo

● What is Spark ?





● Demo

Brief Introduction to Spark StreamingBrief Introduction to Spark Streaming


How Spark Streaming Works ?How Spark Streaming Works ?


Why we need Spark Streaming ?Why we need Spark Streaming ?

High Level API:High Level API:TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(10), Seconds(5)) //Counting tweets on a sliding window

Fault Tolerant:Fault Tolerant:

Integration:Integration:


Integrated with Spark SQL, MLLib, GraphX...

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master)

Specify SparkConfiguration


val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10))

Setup StreamContext

Contd...Contd...


val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999)

This is theReceiverInputDStream

linesDStream

at time0 - 1

at time1 - 2

at time2 - 3

at time3 - 4

Contd...Contd...


val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1))

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

map

Creates a Dstream(sequence of RDDs)

Contd...Contd...


val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

wordCountDStream

map

groupBy

Groups Dstreamby Words

Contd...Contd...


val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)

ssc.start()

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

wordCountDStream

map

groupBy

Start streaming& computation

Contd...Contd...






● Demo

● What is Spark ?





● Demo

How to Install Spark ? Download Spark from -

http://spark.apache.org/downloads.html

Extract it to a suitable directory.

Go to the directory via terminal & run following command -

mvn -DskipTests clean package

Now Spark is ready to run in Interactive mode

./bin/spark-shell

Download Spark from -


Extract it to a suitable directory.

Go to the directory via terminal & run following command -

mvn -DskipTests clean package

Now Spark is ready to run in Interactive mode

./bin/spark-shell



sbt Setup

name := "Spark Demo"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.2.1", "org.apache.spark" %% "spark-streaming" % "1.2.1", "org.apache.spark" %% "spark-sql" % "1.2.1", "org.apache.spark" %% "spark-mllib" % "1.2.1" )






● Demo

● What is Spark ?





● Demo

Demo

Download Code

https://github.com/knoldus/spark-scala

References

http://spark.apache.org/

http://spark-summit.org/2014

http://spark.apache.org/docs/latest/quick-start.html

http://stackoverflow.com/questions/tagged/apache-spark

https://www.youtube.com/results?search_query=apache+spark

http://apache-spark-user-list.1001560.n3.nabble.com/

http://www.slideshare.net/paulszulc/apache-spark-101-in-50-min

http://spark.apache.org/docs/latest/quick-start.html

https://www.youtube.com/results?search_query=apache+spark

http://apache-spark-user-list.1001560.n3.nabble.com/

Presenter:[email protected]

@himanshug735

Presenter:[email protected]

@himanshug735

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Thanks

mailto:[email protected]

mailto:[email protected]

Engineering

Introduction to Spark with Scala