Real Time Data Processing Using Spark Streaming

1© Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ Cloudera

Committer/PMC Member, Apache Flume

Committer, Apache Sqoop

Contributor, Apache Spark

Author, Using Flume (O’Reilly)

Real Time Data Processing using Spark Streaming


Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates

• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...


Use Cases Across Industries

CreditIdentifyfraudulent transactions as soon as they occur.

TransportationDynamicRe-routingOf traffic orVehicle Fleet.

Retail• Dynamic InventoryManagement• Real-timeIn-storeOffers and recommendations

Consumer Internet &MobileOptimize userengagement basedon user’s currentbehavior.

HealthcareContinuouslymonitor patientvital stats and proactively identifyat-risk patients.

Manufacturing• Identifyequipmentfailures and react instantly• PerformProactivemaintenance.

SurveillanceIdentifythreatsand intrusionsIn real-time

Digital Advertising& MarketingOptimize and personalize content based on real-time information.


From Volume and Variety to Velocity

Present

Batch + Stream Processing

Time to Insight of Seconds

Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

PastPresent

Hadoop Ecosystem evolves as well…

Past

Big Data has evolved

Batch Processing

Time to insight of Hours


Key Components of Streaming Architectures

Data Ingestion & TransportationService

Real-Time Stream Processing Engine

Kafka Flume

System Management

Security

Data Management & Integration

Real-TimeData Serving


Canonical Stream Processing Architecture

Kafka

Data Ingest

App 1

App 2

.

.

.

Kafka Flume

HDFS HBase

Data Sources


What is Spark?

Spark is a general purpose computational framework - more flexibility than MapReduce

Key properties:• Leverages distributed memory• Full Directed Graph expressions for data parallel computations• Improved developer experience

Yet retains:Linear scalability, Fault-tolerance and Data Locality based computations


Spark: Easy and Fast Big Data

•Easy to Develop•Rich APIs in Java, Scala, Python

• Interactive shell

•Fast to Run•General execution graphs

• In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory


Easy: High productivity language support

Pythonlines = sc.textFile(...)lines.filter(lambda s: “ERROR” in s).count()

Scalaval lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

JavaJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“error”);

}}).count();

• Native support for multiple languages with identical APIs• Use of closures, iterations and other common language constructs to minimize code


Easy: Use Interactively

• Interactive exploration of data for data scientists – no need to develop “applications”• Developers can prototype application on live system as they build application


Easy: Expressive API

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• groupByKey

• cogroup

• cross

• zip

• sample

• take

• first

• partitionBy

• mapWith

• pipe

• save


Easy: Example – Word Count (M/R)

public static class WordCountMapClass

extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one =

new IntWritable(1);

private Text word = new Text();

public void map(

LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr

= new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

}

}

}

public static class WorkdCountReduceextends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();

}output.collect(key, new IntWritable(sum));

}}


Easy: Example – Word Count (Spark)

val spark = new SparkContext(master, appName, [sparkHome], [jars])

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")


Easy: Out of the Box Functionality

•Hadoop Integration• Works with Hadoop Data• Runs With YARN

• Libraries• MLlib• Spark Streaming• GraphX (alpha)

• Language support:• Improved Python support• SparkR• Java 8• Schema support in Spark’s APIs


Spark Architecture

Driver

Worker

Worker

Worker

Data

RAM

Data

RAM

Data

RAM


RDDs

RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization

Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage


Fast: Using RAM, Operator Graphs

• In-memory Caching• Data Partitions read from RAM

instead of disk

• Operator Graphs• Scheduling Optimizations• Fault Tolerance

= cached partition

= RDD

join

filter

groupBy

B: B:

C: D: E:

F:

map

A:

map

take


Spark Streaming

Extension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput


Spark Streaming

• Incoming data represented as Discretized Streams (DStreams)

• Stream is broken down into micro-batches

• Each micro-batch is an RDD – can share code between batch and streaming


val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture


Use DStreams for Windowing Functions


Spark Streaming

• Runs as a Spark job

• YARN or standalone for scheduling

• YARN has KDC integration

• Use the same code for real-time Spark Streaming and for batch Spark jobs.

• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….

• Easy to write “Receivers” for custom messaging systems.


Sharing Code between Batch and Streaming

def filterErrors (rdd: RDD[String]): RDD[String] = {

rdd.filter(s => s.contains(“ERROR”))

}

Library that filters “ERRORS”

• Streaming generates RDDs periodically

• Any code that operates on RDDs can therefore be used in streaming as well


Sharing Code between Batch and Streaming

val lines = sc.textFile(…)

val filtered = filterErrors(lines)

filtered.saveAsTextFile(...)

Spark:

val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)

val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {

filterErrors(rdd)

}))

filtered.saveAsTextFiles(…)

Spark Streaming:


Spark Streaming Use-Cases

• Real-time dashboards

• Show approximate results in real-time

• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams

• Time-based or count-based “windows”

• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.


Hadoop in the Spark world

YARN

Spark

Spark

StreamingGraphX MLlib

HDFS, HBase

HivePigImpala

MapReduce2

SharkSearch

Core Hadoop

Support Spark components

Unsupported add-ons


Current project status

• 100+ contributors and 25+ companies contributing

• Includes: Databricks, Cloudera, Intel, Yahoo etc

• Dozens of production deployments

• Included in CDH!


More Info..

• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html

• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/

• Apache Spark homepage: http://spark.apache.org/

• Github: https://github.com/apache/spark


Thank you