Spark Streaming Early Warning Use Case

STREAMING EARLY WARNINGData Day Seattle 6-27-2015Chance Coble

Use Case Profile

➾Telecommunications company Had business problems/pain Scalable analytics infrastructure is a problem

Pushing infrastructure to its limits Open to a proof-of-concept engagement with emerging technology Wanted to test on historical data

➾We introduced Spark Streaming Technology would scale Could prove it enabled new analytic techniques (incident detection) Open to Scala requirement Wanted to prove it was easy to deploy – EC2 helped

Organization Profile

➾Telecommunications Wholesale Business Process 90 Million calls per day Scale up to 1,000 calls per second

nearly half-a-million calls in a 5 minute window Technology is loosely split into

Operational Support Systems (OSS) Business Support Systems (BSS)

➾ Core technology is mature Analytics on LAMP stack Technology team is strongly skilled in that stack

Jargon

➾ Number Comprised of Country Code (possibly), Area Code (NPA),

Exchange (NXX) and 4 other digits Area codes and exchanges are often geo-coded

1 5309512 867

Jargon

➾Trunk Group A trunk is a line connecting transmissions for two points. The

group of trunks has some common property, in this case being owned by the same entity.

Transmissions from ingress trunks are routed to transmissions to egress trunks.

➾Route – In this case, selection of a trunk group to facilitate the termination at the calls destination

➾QoS – Quality of Service governed by metrics Call Duration – Short calls are an indication of quality problems ASR – Average Seizure Rate

This company measures this as #connected calls / #calls attempted

➾Real-time: Within 5 minutes

The Problem

➾A switch handles most of their routing➾Configuration table in switch governs routing

if-this-then-that style logic.

➾Proprietary technology handles adjustments to that table Manual intervention also required

Call Logs Business Rules Application

Database Intranet Portal

The Problem

➾Backend system receives a log of calls from the switch File dumped every few minutes 180 well defined fields representing features of a call event Supports downstream analytics once enriched with pricing, geo-

coding and account information

Their job is to connect calls at the most efficient price without sacrificing quality

Why Spark?

➾They were interested because Workbench can simplify operationalizing analytics

They can skip a generation of clunky big data tools Works with their data structures Will “scale-out” rather than up Can handle fault-tolerant in-memory updates

Why Spark?

Spark Basics - Architecture

Spark Driver

Spark Context

Cluster Manager

Executor

Tasks Cache

Executor

Spark Basics – Call Status Count Example

val cdrLogPath = ”/cdrs/cdr20140731042210.ssv” val conf = new SparkConf().setAppName(”CDR Count") val sc = new SparkContext(conf) val cdrLines = sc.textFile(cdrLogPath)

val cdrDetails = cdrLines.map(_.split(“;”)).cache() val successful = cdrDetails.filter(x => x(6)==“S”).count() val unsuccessful = cdrDetails.filter(x => x(6)==“U”).count()

println(”Successful: %s, Unsuccessful: %s” .format(successful, unsuccessful))

Spark Basics - RDD’s

➾Operations on data generate distributable tasks through a Directed Acyclic Graph Functional programming FTW!

➾Resilient Data is redundantly stored, and can be recomputed through a

generated DAG

➾ Distributed The DAG can process each small task, as well as a subset of the

data through optimizations in the Spark planning engine.

➾ Dataset➾This construct is native to Spark computation

Spark Basics - RDD’s

➾Lazy➾Transformations for tasks and slices

Streaming Applications – Why try it?

➾Streaming Applications Site Activity Statistics Spam detection System monitoring Intrusion Detection Telecommunications

Network Data

Streaming Models

➾Record-at-a-time Receive One Record and process it

Simple, low-latency High-Throughput

➾Micro-Batch Receive records and occasionally run a batch process over a

window Process *must* run fast enough to handle all records collected Harder to reduce latency Easy Reasoning

Global state Fault tolerance Unified Code

DStreams

➾Stands for Discretized Streams➾A series of RDD’s➾Spark already provided computation model on RDD’s➾Note records are ordered as they are received

They are also time-stamped for computation in that order Is that always the way you want to see your data?

Fault Tolerance – Parallel Recovery

➾ Failed Nodes➾ Stragglers!

Fault Tolerance - Recompute

Throughput vs. Latency

Anatomy of a Spark Streaming Program

val sparkConf = new SparkConf().setAppName(“QueueStream”)

val ssc = new StreamingContext(sparkConf, Seconds(1))

val rddQueue = new SynchronizedQueue[RDD[Int]]()

val inputStream = ssc.queueStream(rddQueue)

val mappedStream = inputStream.map(x => (x % 10, 1))

val reducedStream = mappedStream.reduceByKey(_ + _)

reducedStream.print()

ssc.start()

for(i 1 to 30) {

rddQueue += ssc.sparkContext.makeRDD(1 to 1000, 10)

Thread.sleep(1000)

ssc.stop()

Utilities also available forTwitterKafkaFlume

Filestream

Windows

WindowSlide

Streaming Call Analysis with Windows

val path = "/Users/chance/Documents/cdrdrop”

val conf = new SparkConf()

.setMaster("local[12]")

.setAppName("CDRIncidentDetection")

.set("spark.executor.memory","8g")

val ssc = new StreamingContext(conf,Seconds(iteration))

val callStream = ssc.textFileStream(path)

val cdr = callStream.window(Seconds(window),Seconds(slide)).map(_.split(";"))

val cdrArr = cdr.filter(c => c.length>136)

.map(c => extractCallDetailRecord(c))

val result = detectIncidents(cdrArr)

result.foreach(rdd => rdd.take(10)

.foreach{case(x,(d,high,low,res)) =>

println(x + "," + high + "," + d + "," + low + "," + res) })

ssc.start()

ssc.awaitTermination()

Can we enable new analytics?

➾Incident detection Chose a univariate technique[1] to detect behavior out of profile

from recent events Technique identifies

out of profile events dramatic shifts in the profile

Easy to understand

Recent Window

Is it simple to deploy?

➾EC2 helped➾Client had no Hadoop, and little NoSQL expertise➾Develop and Deploy

Built with sbt, ran on master

➾Architecture involved Pushed new call detail logs to HDFS on EC2 Streaming picks up new data and updates RDD’s accordingly Results were explored in two ways

Accessing results through data virtualization Writing RDD results (small) to SQL database

Using a business intelligence tool to create report content

Call Logs Streaming DataCurrent Processing

HDFS on EC2

Analysis and Reporting Dashboards

Multiple Delivery Options

Results

WordCount (Published)

Throughput (MB)

Summary of Results

➾Technology would scale Handled 5 minutes of data sub second

➾Proved new analytics enabled Solved single-variable incident detection Small, simple code

➾Made a case for Scala adoption Team is still skeptical about big data

➾Wanted to prove it was easy to deploy – EC2 helped Burned on forward slash bug in AWS secret token

Incident Visual

References

➾[1] Zaharia et al : Discretized Streams ➾[2] Zaharia et al:

Discretized Streams: Fault-Tolerant Streaming ➾[3] Das : Spark Streaming – Real-time Big-Data Processing➾[4] Spark Streaming Programming Guide➾[5] Running Spark on EC2➾[6] Spark on EMR➾[7] Ahelegby: Time Series Outliers

Contact Us

CONTACT US

Email: chance at blacklightsolutions.com

Phone: 512.795.0855

Web: www.blacklightsolutions.com

Twitter: @chancecoble

Spark Streaming Early Warning Use Case

Data & Analytics

Introduction to Spark Streaming

Meetup Real Time Aggregations Spark Streaming + Spark Sql

Apache Spark Streaming

Spark Streaming + Amazon Kinesis

Spark Streaming and MLlib - Hyderabad Spark Group

Real-Time Streaming: Apache Spark Streaming i Apache Storm

Spark meetup - Zoomdata Streaming

Spark Streaming, Machine Learning and meetup.com streaming API

Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Spark streaming , Spark SQL

Spark streaming: Best Practices

Spark Streaming

Realtime Reporting using Spark Streaming

Massive Streaming Analytics with Spark Streaming

Spark Concepts - Spark SQL, Graphx, Streaming

DECISION MAKING WITH MLLIB, SPARK AND SPARK STREAMING

Streaming Big Data with Spark Streaming, Kafka, Cassandra ...chariotsolutions.com/wp-content/uploads/2015/04/HelenaEdelson_ETE... · Streaming Big Data with Spark Streaming, Kafka,

Productionalizing Spark Streaming

Spark Streaming Preview

Spark Summit - Stratio Streaming