Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Fully Fault tolerant Streaming Workflows at

Scale using Apache Mesos & Spark Streaming

[email protected]

About Me and Sigmoid

● GitHub: github.com/akhld

● Twitter: @AkhlD

● Email: [email protected] OUR CUSTOMERS

Overview

● Apache Spark

● Spark Streaming

● High Availability Mesos Cluster

● Running Spark Streaming over a High Availability Mesos Cluster

● Simple Fault-tolerant Streaming Pipeline

● Scaling the pipeline

Apache Spark

Spark Stack

Resilient Distributed Datasets (RDDs)

- Big collection of data which is:

- Immutable

- Distributed

- Lazily evaluated

- Type Inferred

- Cacheable

RDD1 RDD2 RDD3

Why Spark Streaming?

Many big-data applications need to process large data streams in near-real time

Monitoring Systems

Alert Systems

Computing Systems

What is Spark Streaming?

Taken from Apache Spark.

http://spark.apache.org

What is Spark Streaming?

Framework for large scale stream processing

➔ Created at UC Berkeley by Tathagata Das (TD)

➔ Scales to 100s of nodes

➔ Can achieve second scale latencies

➔ Provides a simple batch-like API for implementing complex algorithm

➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

Framework (SparkStreamingJob)

Spark Streaming

Run a streaming computation as a series of very small, deterministic batch jobs

SparkStreaming

Spark

- Chop up the live stream into batches of X seconds

- Spark treats each batch of data as RDDs and processes them using RDD operations

- Finally, the processed results of the RDD operations are returned in batches

Kafka Server

Simple Streaming Pipeline

Spark Streaming

Standalone Spark Cluster

Storage (HDFS/DB)

Point of Failure

Mesos High Availability Cluster

Masters Quorum

Leader

Standby Standby

SparkStreamingJob

ExecutorTask

ExecutorTask

HadoopJob

Slave 1

Slave N

Offer

Offer

Framework (SparkStreamingJob)

Driver program

Scheduler

Offer

Spark Streaming over a HA Mesos Cluster

● To use Mesos from Spark, you need a Spark binary package available in a place accessible (http/s3/hdfs) by Mesos, and a Spark driver program configured to connect to Mesos.

● Configuring the driver program to connect to Mesos:

val sconf = new SparkConf() .setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos") .setAppName("MyStreamingApp") .set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.3.0-bin-hadoop2.4.tgz") .set("spark.mesos.coarse", "true") .set("spark.cores.max", "30") .set("spark.executor.memory", "10g")

val sc = new SparkContext(sconf) val ssc = new StreamingContext(sc, Seconds(1)) ...

Spark Streaming Fault-tolerance

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system.

● Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster.

● In Streaming, driver failure can be recovered with checkpointing application state.● Write Ahead Logs (WAL) & Acknowledgements can ensure 0 data loss.

Kafka Cluster

Simple Fault-tolerant Streaming Infra

Spark Streaming

Storage (HDFS/DB)

NODES

High Availability Mesos Cluster

ExecutorTask

SparkStreamingJob

Scaling the pipeline

Spark StreamingStorage

(HDFS/DB)

NODES

High Availability Mesos Cluster

Understanding the bottlenecks

- Network : 1Gbps- # Cores/Slave : 4- DISK IO : 100MB/S on SSD

Goal:- Receive & Process data at 1M events/Second

Choosing the correct # Resources

- Since single slave can handle up to 100MB/S network and disk IO, a minimal of 6 slaves could take me to ~600MB/S

Kafka Cluster

Thank You&

Queries??

Read more: https://www.sigmoid.com/fault-tolerant-streaming-workflows/

https://t.yesware.com/tl/abf5a12ac7b04311c5c1eac4dbd32e4eded85e53/5a4cce30e072105e1856080413cee373/5f2487643f270f83eb2eea6a88dc9f3c?ytl=https%3A%2F%2Fwww.sigmoid.com%2Ffault-tolerant-streaming-workflows%2F

Documents

Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming