Transcript
Page 1: Real Time Data Processing Using Spark Streaming

1© Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ Cloudera

Committer/PMC Member, Apache Flume

Committer, Apache Sqoop

Contributor, Apache Spark

Author, Using Flume (O’Reilly)

Real Time Data Processing using Spark Streaming

Page 2: Real Time Data Processing Using Spark Streaming

2© Cloudera, Inc. All rights reserved.

Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates

• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...

Page 3: Real Time Data Processing Using Spark Streaming

3© Cloudera, Inc. All rights reserved.

Use Cases Across Industries

CreditIdentifyfraudulent transactions as soon as they occur.

TransportationDynamicRe-routingOf traffic orVehicle Fleet.

Retail• Dynamic InventoryManagement• Real-timeIn-storeOffers and recommendations

Consumer Internet &MobileOptimize userengagement basedon user’s currentbehavior.

HealthcareContinuouslymonitor patientvital stats and proactively identifyat-risk patients.

Manufacturing• Identifyequipmentfailures and react instantly• PerformProactivemaintenance.

SurveillanceIdentifythreatsand intrusionsIn real-time

Digital Advertising& MarketingOptimize and personalize content based on real-time information.

Page 4: Real Time Data Processing Using Spark Streaming

4© Cloudera, Inc. All rights reserved.

From Volume and Variety to Velocity

Present

Batch + Stream Processing

Time to Insight of Seconds

Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

PastPresent

Hadoop Ecosystem evolves as well…

Past

Big Data has evolved

Batch Processing

Time to insight of Hours

Page 5: Real Time Data Processing Using Spark Streaming

5© Cloudera, Inc. All rights reserved.

Key Components of Streaming Architectures

Data Ingestion & TransportationService

Real-Time Stream Processing Engine

Kafka Flume

System Management

Security

Data Management & Integration

Real-TimeData Serving

Page 6: Real Time Data Processing Using Spark Streaming

6© Cloudera, Inc. All rights reserved.

Canonical Stream Processing Architecture

Kafka

Data Ingest

App 1

App 2

.

.

.

Kafka Flume

HDFS HBase

Data Sources

Page 7: Real Time Data Processing Using Spark Streaming

7© Cloudera, Inc. All rights reserved.

What is Spark?

Spark is a general purpose computational framework - more flexibility than MapReduce

Key properties:• Leverages distributed memory• Full Directed Graph expressions for data parallel computations• Improved developer experience

Yet retains:Linear scalability, Fault-tolerance and Data Locality based computations

Page 8: Real Time Data Processing Using Spark Streaming

8© Cloudera, Inc. All rights reserved.

Spark: Easy and Fast Big Data

•Easy to Develop•Rich APIs in Java, Scala, Python

• Interactive shell

•Fast to Run•General execution graphs

• In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory

Page 9: Real Time Data Processing Using Spark Streaming

9© Cloudera, Inc. All rights reserved.

Easy: High productivity language support

Pythonlines = sc.textFile(...)lines.filter(lambda s: “ERROR” in s).count()

Scalaval lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

JavaJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“error”);

}}).count();

• Native support for multiple languages with identical APIs• Use of closures, iterations and other common language constructs to minimize code

Page 10: Real Time Data Processing Using Spark Streaming

10© Cloudera, Inc. All rights reserved.

Easy: Use Interactively

• Interactive exploration of data for data scientists – no need to develop “applications”• Developers can prototype application on live system as they build application

Page 11: Real Time Data Processing Using Spark Streaming

11© Cloudera, Inc. All rights reserved.

Easy: Expressive API

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• groupByKey

• cogroup

• cross

• zip

• sample

• take

• first

• partitionBy

• mapWith

• pipe

• save

Page 12: Real Time Data Processing Using Spark Streaming

12© Cloudera, Inc. All rights reserved.

Easy: Example – Word Count (M/R)

public static class WordCountMapClass

extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one =

new IntWritable(1);

private Text word = new Text();

public void map(

LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr

= new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

}

}

}

public static class WorkdCountReduceextends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();

}output.collect(key, new IntWritable(sum));

}}

Page 13: Real Time Data Processing Using Spark Streaming

13© Cloudera, Inc. All rights reserved.

Easy: Example – Word Count (Spark)

val spark = new SparkContext(master, appName, [sparkHome], [jars])

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 14: Real Time Data Processing Using Spark Streaming

14© Cloudera, Inc. All rights reserved.

Easy: Out of the Box Functionality

•Hadoop Integration• Works with Hadoop Data• Runs With YARN

• Libraries• MLlib• Spark Streaming• GraphX (alpha)

• Language support:• Improved Python support• SparkR• Java 8• Schema support in Spark’s APIs

Page 15: Real Time Data Processing Using Spark Streaming

15© Cloudera, Inc. All rights reserved.

Spark Architecture

Driver

Worker

Worker

Worker

Data

RAM

Data

RAM

Data

RAM

Page 16: Real Time Data Processing Using Spark Streaming

16© Cloudera, Inc. All rights reserved.

RDDs

RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization

Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage

Page 17: Real Time Data Processing Using Spark Streaming

17© Cloudera, Inc. All rights reserved.

Fast: Using RAM, Operator Graphs

• In-memory Caching• Data Partitions read from RAM

instead of disk

• Operator Graphs• Scheduling Optimizations• Fault Tolerance

= cached partition

= RDD

join

filter

groupBy

B: B:

C: D: E:

F:

map

A:

map

take

Page 18: Real Time Data Processing Using Spark Streaming

18© Cloudera, Inc. All rights reserved.

Spark Streaming

Extension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput

Page 19: Real Time Data Processing Using Spark Streaming

19© Cloudera, Inc. All rights reserved.

Spark Streaming

• Incoming data represented as Discretized Streams (DStreams)

• Stream is broken down into micro-batches

• Each micro-batch is an RDD – can share code between batch and streaming

Page 20: Real Time Data Processing Using Spark Streaming

20© Cloudera, Inc. All rights reserved.

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture

Page 21: Real Time Data Processing Using Spark Streaming

21© Cloudera, Inc. All rights reserved.

Use DStreams for Windowing Functions

Page 22: Real Time Data Processing Using Spark Streaming

22© Cloudera, Inc. All rights reserved.

Spark Streaming

• Runs as a Spark job

• YARN or standalone for scheduling

• YARN has KDC integration

• Use the same code for real-time Spark Streaming and for batch Spark jobs.

• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….

• Easy to write “Receivers” for custom messaging systems.

Page 23: Real Time Data Processing Using Spark Streaming

23© Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

def filterErrors (rdd: RDD[String]): RDD[String] = {

rdd.filter(s => s.contains(“ERROR”))

}

Library that filters “ERRORS”

• Streaming generates RDDs periodically

• Any code that operates on RDDs can therefore be used in streaming as well

Page 24: Real Time Data Processing Using Spark Streaming

24© Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

val lines = sc.textFile(…)

val filtered = filterErrors(lines)

filtered.saveAsTextFile(...)

Spark:

val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)

val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {

filterErrors(rdd)

}))

filtered.saveAsTextFiles(…)

Spark Streaming:

Page 25: Real Time Data Processing Using Spark Streaming

25© Cloudera, Inc. All rights reserved.

Spark Streaming Use-Cases

• Real-time dashboards

• Show approximate results in real-time

• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams

• Time-based or count-based “windows”

• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.

Page 26: Real Time Data Processing Using Spark Streaming

26© Cloudera, Inc. All rights reserved.

Hadoop in the Spark world

YARN

Spark

Spark

StreamingGraphX MLlib

HDFS, HBase

HivePigImpala

MapReduce2

SharkSearch

Core Hadoop

Support Spark components

Unsupported add-ons

Page 27: Real Time Data Processing Using Spark Streaming

27© Cloudera, Inc. All rights reserved.

Current project status

• 100+ contributors and 25+ companies contributing

• Includes: Databricks, Cloudera, Intel, Yahoo etc

• Dozens of production deployments

• Included in CDH!

Page 28: Real Time Data Processing Using Spark Streaming

28© Cloudera, Inc. All rights reserved.

More Info..

• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html

• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/

• Apache Spark homepage: http://spark.apache.org/

• Github: https://github.com/apache/spark

Page 29: Real Time Data Processing Using Spark Streaming

29© Cloudera, Inc. All rights reserved.

Thank you


Recommended