Upload
himanshu-gupta
View
2.743
Download
11
Embed Size (px)
Citation preview
Introduction to Spark with ScalaIntroduction to
Spark with Scala
Himanshu GuptaSoftware Consultant
Knoldus Software LLP
Himanshu GuptaSoftware Consultant
Knoldus Software LLP
Who am I ?Who am I ?
Himanshu Gupta (@himanshug735)
Software Consultant at Knoldus Software LLP
Spark & Scala enthusiast
Himanshu Gupta (@himanshug735)
Software Consultant at Knoldus Software LLP
Spark & Scala enthusiast
AgendaAgenda● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
What is Apache Spark ?What is Apache Spark ?
Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics
Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics
Spark HistorySpark History
Project Begins at
UCB AMP Lab
20092009
20102010
Open Sourced
Apache Incubator
20112011
20122012
20132013
20142014
20152015
Data Frames
ClouderaSupport
ApacheTop level
SparkSummit
2013
SparkSummit
2014
Spark StackSpark Stack
Img src - http://spark.apache.org/Img src - http://spark.apache.org/
Fastest Growing Open Source ProjectFastest Growing Open Source Project
Img src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.htmlImg src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html
AgendaAgenda● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
Code SizeCode Size
Img src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdfImg src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
Word Count Ex.public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); }public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")
Daytona GraySort Record:Data to sort 100TB
Daytona GraySort Record:Data to sort 100TB
Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015 Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
Hadoop (2013):Hadoop (2013): 2100 nodes2100 nodes
72 minutes72 minutes
Spark (2014):Spark (2014): 206 nodes206 nodes
23 minutes23 minutes
Runs EverywhereRuns Everywhere
Img src - http://spark.apache.org/
Who are using Apache Spark ?Who are using Apache Spark ?
Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010
AgendaAgenda
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
Brief Introduction to RDDBrief Introduction to RDD
RDD stands for Resilient Distributed Dataset
A fault tolerant, distributed collection of objects.
In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)
RDD stands for Resilient Distributed Dataset
A fault tolerant, distributed collection of objects.
In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)
Example (RDD)Example (RDD)
val master = "local"val conf = new SparkConf().setMaster(master)
This is the Spark Configuration
Example (RDD)Example (RDD)
val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)
This is the Spark Context
Contd...Contd...
Example (RDD)Example (RDD)
val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)
This is the Spark Context
Contd...Contd...
Example (RDD)Example (RDD)
val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("data.txt")
Extract linesfrom text file
Contd...Contd...
Example (RDD)Example (RDD)
val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))
Map linesto words
map
Contd...Contd...
Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)
Word Count RDDmap groupBy
Contd...Contd...
Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect
Map[word, count] map groupBy
collect
StartsComputation
Contd...Contd...
Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect
map groupBy
collect
Transformation Action
Contd...Contd...
AgendaAgenda
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
Brief Introduction to Spark StreamingBrief Introduction to Spark Streaming
Img src - http://spark.apache.org/Img src - http://spark.apache.org/
How Spark Streaming Works ?How Spark Streaming Works ?
Img src - http://spark.apache.org/Img src - http://spark.apache.org/
Why we need Spark Streaming ?Why we need Spark Streaming ?
High Level API:High Level API:TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(10), Seconds(5)) //Counting tweets on a sliding window
Fault Tolerant:Fault Tolerant:
Integration:Integration:
Img src - http://spark.apache.org/Img src - http://spark.apache.org/
Integrated with Spark SQL, MLLib, GraphX...
Example (Spark Streaming)Example (Spark Streaming)
val master = "local" val conf = new SparkConf().setMaster(master)
Specify SparkConfiguration
Example (Spark Streaming)Example (Spark Streaming)
val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10))
Setup StreamContext
Contd...Contd...
Example (Spark Streaming)Example (Spark Streaming)
val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999)
This is theReceiverInputDStream
linesDStream
at time0 - 1
at time1 - 2
at time2 - 3
at time3 - 4
Contd...Contd...
Example (Spark Streaming)Example (Spark Streaming)
val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1))
linesDStream
at time0 - 1
words/pairsDStream
at time1 - 2
at time2 - 3
at time3 - 4
map
Creates a Dstream(sequence of RDDs)
Contd...Contd...
Example (Spark Streaming)Example (Spark Streaming)
val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)
linesDStream
at time0 - 1
words/pairsDStream
at time1 - 2
at time2 - 3
at time3 - 4
wordCountDStream
map
groupBy
Groups Dstreamby Words
Contd...Contd...
Example (Spark Streaming)Example (Spark Streaming)
val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)
ssc.start()
linesDStream
at time0 - 1
words/pairsDStream
at time1 - 2
at time2 - 3
at time3 - 4
wordCountDStream
map
groupBy
Start streaming& computation
Contd...Contd...
AgendaAgenda● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
How to Install Spark ? Download Spark from -
http://spark.apache.org/downloads.html
Extract it to a suitable directory.
Go to the directory via terminal & run following command -
mvn -DskipTests clean package
Now Spark is ready to run in Interactive mode
./bin/spark-shell
Download Spark from -
http://spark.apache.org/downloads.html
Extract it to a suitable directory.
Go to the directory via terminal & run following command -
mvn -DskipTests clean package
Now Spark is ready to run in Interactive mode
./bin/spark-shell
sbt Setup
name := "Spark Demo"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.2.1", "org.apache.spark" %% "spark-streaming" % "1.2.1", "org.apache.spark" %% "spark-sql" % "1.2.1", "org.apache.spark" %% "spark-mllib" % "1.2.1" )
AgendaAgenda● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
● What is Spark ?
● Why we need Spark ?
● Brief introduction to RDD
● Brief introduction to Spark Streaming
● How to install Spark ?
● Demo
Demo
Download Code
https://github.com/knoldus/spark-scala
References
http://spark.apache.org/
http://spark-summit.org/2014
http://spark.apache.org/docs/latest/quick-start.html
http://stackoverflow.com/questions/tagged/apache-spark
https://www.youtube.com/results?search_query=apache+spark
http://apache-spark-user-list.1001560.n3.nabble.com/
http://www.slideshare.net/paulszulc/apache-spark-101-in-50-min
Presenter:[email protected]
@himanshug735
Presenter:[email protected]
@himanshug735
Organizer:@Knolspeak
http://www.knoldus.comhttp://blog.knoldus.com
Organizer:@Knolspeak
http://www.knoldus.comhttp://blog.knoldus.com
Thanks