Upload
chen-gwen-shapira
View
922
Download
3
Embed Size (px)
DESCRIPTION
Citation preview
1
Introduction to SparkGwen Shapira, Solutions Architect
2
Spark is next-generation Map Reduce
3
MapReduce has been around for a whileIt made distributed compute easier
But, can we do better?
4
MapReduce Issues
• Launching mappers and reducers takes time• One MR job can rarely do a full computation• Writing to disk (in triplicate!) between each job• Going back to queue between jobs• No in-memory caching• No iterations• Very high latency• Not the greatest APIs either
5
Spark:Easy to Develop, Fast to Run
6
Spark Features
• In-memory cache• General execution graphs• APIs in Scala, Java and Python• Integrates but does not depend on Hadoop
7
Why is it better?
• (Much) Faster than MR• Iterative programming – Must have for ML• Interactive – allows rapid exploratory analytics• Flexible execution graph:
• Map, map, reduce, reduce, reduce, map• High productivity compared to MapReduce
8
Word Count
file = spark.textFile(“hdfs://…”)
file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)
Remember MapReduce WordCount?
9
Agenda
• Concepts• Examples• Streaming• Summary
10
Concepts
11
AMP Lab BDAS
12
CDH5 (simplified)
HDFS + In-Memory Cache
YARN
Spark MR Impala
Spark Streaming ML Lib
13
How Spark runs on a Cluster
Driver
Worker
Worker
Data
RAM
Data
RAMWorker
Data
RAM
Tasks
Results
14
Workflow
• SparkContext in driver connects to Master• Master allocates resources for app on cluster• SC acquires executors on worker nodes • SC sends the app code (JAR) to executors• SC sends tasks to executors
15
RDD – Resilient Distributed Dataset
• Collection of elements• Read-only• Partitioned• Fault-tolerant• Supports parallel operations
16
RDD Types
• Parallelized Collection• Parallelize(Seq)
• HDFS files• Text, Sequence or any InputFormat
• Both support same operations
17
Operations
Transformations• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct
Actions• Reduce• Collect• Count• First, Take• SaveAs• CountByKey
18
Transformations are lazy
19
Lazy transformation
Find all lines that mention “MySQL”
Only the timestamp portion of the line
Set the date and hour as key, 1 as value
Now reduce by key and sum the values
Return the result as Array so I can print
Find lines, get timestamp…
Aha! Finally something
to do!
20
Persistence / Caching
• Store RDD in memory for later use• Each node persists a partition• Persist() marks an RDD for caching• It will be cached first time an action is performed
• Use for iterative algorithms
21
Caching – Storage Levels
• MEMORY_ONLY• MEMORY_AND_DISK• MEMORY_ONLY_SER• MEMORY_AND_DISK_SER• DISK_ONLY• MEMORY_ONLY_2, MEMORY_AND_DISK_2…
22
Fault Tolerance
• Lost partitions can be re-computed from source data• Because we remember all transformations
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDD Mapped RDDfilter
(func = startsWith(…))map
(func = split(...))
23
Examples
24
Word Count
file = spark.textFile(“hdfs://…”)
file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)
Remember MapReduce WordCount?
25
Log Mining
• Load error messages from a log into memory• Interactively search for patterns
26
Log Mining
lines = spark.textFile(“hdfs://…”)errors = lines.filter(_.startsWith(“ERROR”)messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count…
Base RDD
Transformed RDD
Action
27
Logistic Regression
• Read two sets of points• Looks for a plane W that separates them• Perform gradient descent:
• Start with random W• On each iteration, sum a function of W over the data• Move W in a direction that improves it
28
Intuition
29
Logistic Regression
val points = spark.textFile(…).map(parsePoint).cache()
val w = Vector.random(D)
for (I <- 1 to ITERATIONS) {val gradient = points.map(p =>
(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x ).reduce(_+_)
w -= gradient
}println(“Final separating plane: ” + w)
30
Conviva Use-Case
• Monitor online video consumption• Analyze trends
Need to run tens of queries like this a day:
SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ'GROUP BY videoName;
31
Conviva With Spark
val sessions = sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSummaryOnHdfs)
val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache
val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }val reduceFn : (Long, Long) => Long = { (a,b) => a+b }
val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
32
Streaming
33
What is it?
• Extension of Spark API• For high-throughput fault-tolerant processing
of live data streams
34
Sources & Outputs
• Kafka• Flume• Twitter• JMS Queues• TCP sockets
• HDFS• Databases• Dashboards
35
Architecture
InputStreaming
ContextSpark
Context
36
DStreams
• Stream is broken down into micro-batches• Each micro-batch is an RDD• This means any Spark function or library can apply to
a stream• Including ML-Lib, graph processing, etc.
37
Processing DStreams
38
Processing Dstreams - Stateless
39
Processing Dstreams - Stateful
40
Dstream Operators
• Transformationproduce DStream from one or more parent streams• Stateless (independent per interval)Map, reduce
• Stateful (share data across intervals)Window, incremental aggregation, time-skewed join
• OutputWrite data to external system (save RDD to HDFS)Save, foreach
41
Fault Recovery
• Input from TCP, Flume or Kafka is stored on 2 nodes• In case of failure:
missing RDDs will be re-computed from surviving nodes.• RDDs are deterministic• So any computation will lead to the same result• Transformation can guarantee
exactly once semantics.• Even through failure
42
Key Question -
How fast can the system recover?
43
Example – Streaming WordCount
import org.apache.spark.streaming.{Seconds, StreamingContext}import StreamingContext._...
// Create the context and set up a network input stream val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))val lines = ssc.socketTextStream(args(1), args(2).toInt)
// Split the lines into words, count them// print some of the counts on the masterval words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()
// Start the computationssc.start()
44
Shark
45
Shark Architecture
• Identical to Hive• Same CLI, JDBC, SQL Parser, Metastore
• Replaced the optimizer, plan generator and the execution engine.
• Added Cache Manager. • Generate Spark code instead of Map Reduce
46
Hive Compatibility
• MetaStore• HQL• UDF / UDAF• SerDes• Scripts
47
Dynamic Query Plans
• Hive MetaData often lacks statistics• Join types often requires hinting
• Shark gathers statistics per partition• While materializing map output
• Partition sizes, record count, skew, histograms• Alter plan accordingly
48
Columnar Memory Store
• Better compression• CPU efficiency• Cache Locality
49
Spark + Shark Integration
val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")
val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())
50
Summary
51
Why Spark?
• Flexible • High performance• Machine learning,
iterative algorithms• Interactive data
explorations• Developer productivity
52
Why not Spark?
• Still immature• Uses *lots* of memory• Equivalent functionality
in Impala, Storm, etc
53
How Spark Works?
• RDDs – resilient distributed data• Lazy transformations• Fault tolerant caching• Streams – micro-batches of RDDs
54