16
Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming AkhilDas [email protected]

Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Embed Size (px)

Citation preview

Page 1: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Fully Fault tolerant Streaming Workflows at

Scale using Apache Mesos & Spark Streaming

[email protected]

Page 2: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

About Me and Sigmoid

● GitHub: github.com/akhld

● Twitter: @AkhlD

● Email: [email protected] OUR CUSTOMERS

Page 3: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Overview

● Apache Spark

● Spark Streaming

● High Availability Mesos Cluster

● Running Spark Streaming over a High Availability Mesos Cluster

● Simple Fault-tolerant Streaming Pipeline

● Scaling the pipeline

Page 4: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Apache Spark

Spark Stack

Resilient Distributed Datasets (RDDs)

- Big collection of data which is:

- Immutable

- Distributed

- Lazily evaluated

- Type Inferred

- Cacheable

RDD1 RDD2 RDD3

Page 5: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Why Spark Streaming?

Many big-data applications need to process large data streams in near-real time

Monitoring Systems

Alert Systems

Computing Systems

Page 6: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

What is Spark Streaming?

Taken from Apache Spark.

Page 7: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

What is Spark Streaming?

Framework for large scale stream processing

➔ Created at UC Berkeley by Tathagata Das (TD)

➔ Scales to 100s of nodes

➔ Can achieve second scale latencies

➔ Provides a simple batch-like API for implementing complex algorithm

➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

Page 8: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Framework (SparkStreamingJob)

Spark Streaming

Run a streaming computation as a series of very small, deterministic batch jobs

SparkStreaming

Spark

- Chop up the live stream into batches of X seconds

- Spark treats each batch of data as RDDs and processes them using RDD operations

- Finally, the processed results of the RDD operations are returned in batches

Page 9: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Kafka Server

Simple Streaming Pipeline

Spark Streaming

Standalone Spark Cluster

Storage (HDFS/DB)

Point of Failure

Page 10: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Mesos High Availability Cluster

Masters Quorum

Leader

Standby Standby

SparkStreamingJob

ExecutorTask

ExecutorTask

HadoopJob

Slave 1

Slave N

Offer

Offer

Framework (SparkStreamingJob)

Driver program

Scheduler

Offer

Page 11: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming
Page 12: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Spark Streaming over a HA Mesos Cluster

● To use Mesos from Spark, you need a Spark binary package available in a place accessible (http/s3/hdfs) by Mesos, and a Spark driver program configured to connect to Mesos.

● Configuring the driver program to connect to Mesos:

val sconf = new SparkConf() .setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos") .setAppName("MyStreamingApp") .set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.3.0-bin-hadoop2.4.tgz") .set("spark.mesos.coarse", "true") .set("spark.cores.max", "30") .set("spark.executor.memory", "10g")

val sc = new SparkContext(sconf) val ssc = new StreamingContext(sc, Seconds(1)) ...

Page 13: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Spark Streaming Fault-tolerance

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system.

● Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster.

● In Streaming, driver failure can be recovered with checkpointing application state.● Write Ahead Logs (WAL) & Acknowledgements can ensure 0 data loss.

Page 14: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Kafka Cluster

Simple Fault-tolerant Streaming Infra

Spark Streaming

Storage (HDFS/DB)

NODES

High Availability Mesos Cluster

ExecutorTask

SparkStreamingJob

Page 15: Fully Fault tolerant Streaming Workflows at Scale using Apache Mesos & Spark Streaming

Scaling the pipeline

Spark StreamingStorage

(HDFS/DB)

NODES

High Availability Mesos Cluster

Understanding the bottlenecks

- Network : 1Gbps- # Cores/Slave : 4- DISK IO : 100MB/S on SSD

Goal:- Receive & Process data at 1M events/Second

Choosing the correct # Resources

- Since single slave can handle up to 100MB/S network and disk IO, a minimal of 6 slaves could take me to ~600MB/S

Kafka Cluster