Spark Streaming & Kafka-The Future of Stream Processing

|05/03/2023

Jack Gudenkauf VP Big Data

scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println)(

https://twitter.com/_JG

2

PLAYTIKA Founded in 2010

Social Casino global category leader 10 games

13 platforms 1000+ employees

3© Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ ClouderaCommitter/PMC Member, Apache FlumeCommitter, Apache SqoopContributor, Apache SparkAuthor, Using Flume (O’Reilly)

Spark + Kafka:Future of Streaming Processing


Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...


From Volume and Variety to Velocity

PresentBatch + Stream Processing

Time to Insight of Seconds

Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

PastPresent

Hadoop Ecosystem evolves as well…Past

Big Data has evolved

Batch Processing

Time to insight of Hours


Key Components of Streaming Architectures

Data Ingestion & TransportationService

Real-Time Stream Processing Engine

Kafka Flume

System Management

Security

Data Management & Integration

Real-TimeData Serving


Canonical Stream Processing Architecture

Kafka

Data IngestApp 1

App 2

.

.

.

Kafka Flume

HDFS HBaseData

Sources


Spark: Easy and Fast Big Data

•Easy to Develop•Rich APIs in Java, Scala, Python• Interactive shell

•Fast to Run•General execution graphs• In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory


Spark Architecture

Driver

Worker

Worker

Worker

DataRAM

Data

RAM

DataRAM

Tasks

Results


RDDs

RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization

Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage


Spark StreamingExtension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput


Spark Streaming• Incoming data represented as Discretized Streams (DStreams)• Stream is broken down into micro-batches• Each micro-batch is an RDD – can share code between batch and streaming


val tweets = ssc.twitterStream()val hashTags = tweets.flatMap (status => getTags(status))hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture


Use DStreams for Windowing Functions


Spark Streaming

• Runs as a Spark job• YARN or standalone for scheduling• YARN has KDC integration

• Use the same code for real-time Spark Streaming and for batch Spark jobs.• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….• Easy to write “Receivers” for custom messaging systems.


Sharing Code between Batch and Streaming

def filterErrors (rdd: RDD[String]): RDD[String] = {rdd.filter(s => s.contains(“ERROR”))

}

Library that filters “ERRORS”

• Streaming generates RDDs periodically• Any code that operates on RDDs can therefore be used in streaming as well


Sharing Code between Batch and Streaming

val lines = sc.textFile(…)

val filtered = filterErrors(lines)

filtered.saveAsTextFile(...)

Spark:

val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)

val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {

filterErrors(rdd)

}))

filtered.saveAsTextFiles(…)

Spark Streaming:


Reliability

• Received data automatically persisted to HDFS Write Ahead Log to prevent data loss• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf

• When AM dies, the application is restarted by YARN• Received, ack-ed and unprocessed data replayed from WAL (data that made it

into blocks)• Reliable Receivers can replay data from the original source, if required• Un-acked data replayed from source.• Kafka, Flume receivers bundled with Spark are examples

• Reliable Receivers + WAL = No data loss on driver or receiver failure!


Reliable Kafka DStream

• Stores received data to Write Ahead Log on HDFS for replay – no data loss!• Stable and supported!• Uses a reliable receiver to pull data from Kafka• Application-controlled parallelism• Create as many receivers as you want to parallelize• Remember – each receiver is a task and holds one executor hostage, no

processing happens on that executor.• Tricky to do this efficiently, so is controlling ordering (everything needs to be

done explicitly


Reliable Kafka Dstream - Issues

•Kafka can replay messages if processing failed for some reason • So WAL is overkill – causes unnecessary performance hit• In addition, the Reliable Stream causes a lot of network traffic due to unneeded HDFS writes etc.•Receivers hold executors hostage – which could otherwise be used for processing•How can we solve these issues?


Direct Kafka DStream

• No long-running receiver = no executor hogging!• Communicates with Kafka via the “low-level API”• 1 Spark partition Kafka partition• At the end of every batch:• The first message after the last batch to the current latest message in partition• If max rate is configured, then rate x batch interval is downloaded & processed• Checkpoint contains the starting and ending offset in the current RDD• Recovering from checkpoint is simple – last offset + 1 is least offset of next

batch


Direct Kafka DStream

• (Almost) Exactly once processing• At the end of each interval, the RDD can provide information about the starting

and ending offset• These offsets can be persisted, so even on failure – recover from there• Edge cases are possible and can cause duplicates• Failure in the middle of HDFS writes -> duplicates!• Failure after processing but before offsets getting persisted -> duplicates!• More likely!• Writes to Kafka also can cause duplicates, so do reads from Kafka• Fix: You app should really be resilient to duplicates


Spark Streaming Use-Cases

• Real-time dashboards • Show approximate results in real-time• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams• Time-based or count-based “windows”• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.


What is coming?

• Better Monitoring and alerting• Batch-level and task-level monitoring

• SQL on Streaming• Run SQL-like queries on top of Streaming (medium – long term)

• Python!• Limited support already available, but more detailed support coming

• ML• More real-time ML algorithms


Current Spark project status

• 400+ contributors and 50+ companies contributing• Includes: Databricks, Cloudera, Intel, Huawei, Yahoo! etc• Dozens of production deployments• Spark Streaming Survived Netflix Chaos Monkey – production ready!• Included in CDH!


More Info..

• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/• Apache Spark homepage: http://spark.apache.org/• Github: https://github.com/apache/spark


Thank [email protected]@harisr1234

Data & Analytics

Spark Streaming & Kafka-The Future of Stream Processing