Intro to Spark - for Denver Big Data Meetup

1

Introduction to SparkGwen Shapira, Solutions Architect

2

Spark is next-generation Map Reduce

3

MapReduce has been around for a whileIt made distributed compute easier

But, can we do better?

4

MapReduce Issues

• Launching mappers and reducers takes time• One MR job can rarely do a full computation• Writing to disk (in triplicate!) between each job• Going back to queue between jobs• No in-memory caching• No iterations• Very high latency• Not the greatest APIs either

5

Spark:Easy to Develop, Fast to Run

6

Spark Features

• In-memory cache• General execution graphs• APIs in Scala, Java and Python• Integrates but does not depend on Hadoop

7

Why is it better?

• (Much) Faster than MR• Iterative programming – Must have for ML• Interactive – allows rapid exploratory analytics• Flexible execution graph:

• Map, map, reduce, reduce, reduce, map• High productivity compared to MapReduce

8

Word Count

file = spark.textFile(“hdfs://…”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

Remember MapReduce WordCount?

9

Agenda

• Concepts• Examples• Streaming• Summary

10

Concepts

11

AMP Lab BDAS

12

CDH5 (simplified)

HDFS + In-Memory Cache

YARN

Spark MR Impala

Spark Streaming ML Lib

13

How Spark runs on a Cluster

Driver

Worker

Worker

Data

RAM

Data

RAMWorker

Data

RAM

Tasks

Results

14

Workflow

• SparkContext in driver connects to Master• Master allocates resources for app on cluster• SC acquires executors on worker nodes • SC sends the app code (JAR) to executors• SC sends tasks to executors

15

RDD – Resilient Distributed Dataset

• Collection of elements• Read-only• Partitioned• Fault-tolerant• Supports parallel operations

16

RDD Types

• Parallelized Collection• Parallelize(Seq)

• HDFS files• Text, Sequence or any InputFormat

• Both support same operations

17

Operations

Transformations• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct

Actions• Reduce• Collect• Count• First, Take• SaveAs• CountByKey

18

Transformations are lazy

19

Lazy transformation

Find all lines that mention “MySQL”

Only the timestamp portion of the line

Set the date and hour as key, 1 as value

Now reduce by key and sum the values

Return the result as Array so I can print

Find lines, get timestamp…

Aha! Finally something

to do!

20

Persistence / Caching

• Store RDD in memory for later use• Each node persists a partition• Persist() marks an RDD for caching• It will be cached first time an action is performed

• Use for iterative algorithms

21

Caching – Storage Levels

• MEMORY_ONLY• MEMORY_AND_DISK• MEMORY_ONLY_SER• MEMORY_AND_DISK_SER• DISK_ONLY• MEMORY_ONLY_2, MEMORY_AND_DISK_2…

22

Fault Tolerance

• Lost partitions can be re-computed from source data• Because we remember all transformations

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDDfilter

(func = startsWith(…))map

(func = split(...))

23

Examples

24

Word Count

file = spark.textFile(“hdfs://…”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

Remember MapReduce WordCount?

25

Log Mining

• Load error messages from a log into memory• Interactively search for patterns

26

Log Mining

lines = spark.textFile(“hdfs://…”)errors = lines.filter(_.startsWith(“ERROR”)messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count…

Base RDD

Transformed RDD

Action

27

Logistic Regression

• Read two sets of points• Looks for a plane W that separates them• Perform gradient descent:

• Start with random W• On each iteration, sum a function of W over the data• Move W in a direction that improves it

28

Intuition

29

Logistic Regression

val points = spark.textFile(…).map(parsePoint).cache()

val w = Vector.random(D)

for (I <- 1 to ITERATIONS) {val gradient = points.map(p =>

(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x ).reduce(_+_)

w -= gradient

}println(“Final separating plane: ” + w)

30

Conviva Use-Case

• Monitor online video consumption• Analyze trends

Need to run tens of queries like this a day:

SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ'GROUP BY videoName;

31

Conviva With Spark

val sessions = sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSummaryOnHdfs)

val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache

val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }val reduceFn : (Long, Long) => Long = { (a,b) => a+b }

val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

32

Streaming

33

What is it?

• Extension of Spark API• For high-throughput fault-tolerant processing

of live data streams

34

Sources & Outputs

• Kafka• Flume• Twitter• JMS Queues• TCP sockets

• HDFS• Databases• Dashboards

35

Architecture

InputStreaming

ContextSpark

Context

36

DStreams

• Stream is broken down into micro-batches• Each micro-batch is an RDD• This means any Spark function or library can apply to

a stream• Including ML-Lib, graph processing, etc.

37

Processing DStreams

38

Processing Dstreams - Stateless

39

Processing Dstreams - Stateful

40

Dstream Operators

• Transformationproduce DStream from one or more parent streams• Stateless (independent per interval)Map, reduce

• Stateful (share data across intervals)Window, incremental aggregation, time-skewed join

• OutputWrite data to external system (save RDD to HDFS)Save, foreach

41

Fault Recovery

• Input from TCP, Flume or Kafka is stored on 2 nodes• In case of failure:

missing RDDs will be re-computed from surviving nodes.• RDDs are deterministic• So any computation will lead to the same result• Transformation can guarantee

exactly once semantics.• Even through failure

42

Key Question -

How fast can the system recover?

43

Example – Streaming WordCount

import org.apache.spark.streaming.{Seconds, StreamingContext}import StreamingContext._...

// Create the context and set up a network input stream val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))val lines = ssc.socketTextStream(args(1), args(2).toInt)

// Split the lines into words, count them// print some of the counts on the masterval words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()

// Start the computationssc.start()

44

Shark

45

Shark Architecture

• Identical to Hive• Same CLI, JDBC, SQL Parser, Metastore

• Replaced the optimizer, plan generator and the execution engine.

• Added Cache Manager. • Generate Spark code instead of Map Reduce

46

Hive Compatibility

• MetaStore• HQL• UDF / UDAF• SerDes• Scripts

47

Dynamic Query Plans

• Hive MetaData often lacks statistics• Join types often requires hinting

• Shark gathers statistics per partition• While materializing map output

• Partition sizes, record count, skew, histograms• Alter plan accordingly

48

Columnar Memory Store

• Better compression• CPU efficiency• Cache Locality

49

Spark + Shark Integration

val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")

val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")),

extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())

50

Summary

51

Why Spark?

• Flexible • High performance• Machine learning,

iterative algorithms• Interactive data

explorations• Developer productivity

52

Why not Spark?

• Still immature• Uses *lots* of memory• Equivalent functionality

in Impala, Storm, etc

53

How Spark Works?

• RDDs – resilient distributed data• Lazy transformations• Fault tolerant caching• Streams – micro-batches of RDDs

54

Engineering

Intro to Spark - for Denver Big Data Meetup