105
ACM DEBS 2015: Realtime Streaming Analytics Patterns Srinath Perera Sriskandarajah Suhothayan WSO2 Inc.

DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics

Embed Size (px)

Citation preview

ACM DEBS 2015: Realtime Streaming Analytics

Patterns

Srinath Perera Sriskandarajah Suhothayan

WSO2 Inc.

Data Analytics ( Big Data)

o Scientists are doing this for 25 year with MPI (1991) using special Hardware

o Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole ecosystem created.

o Later Spark emerged, and it is faster.

o But, processing takes time.

Value of Some Insights degrade Fast!

o For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrade very quickly with time. o E.g. stock markets and speed of

light

oo We need technology that can produce outputs fasto Static Queries, but need very fast output (Alerts, Realtime

control) o Dynamic and Interactive Queries ( Data exploration)

History

▪Realtime Analytics are not new either!!- Active Databases (2000+)

- Stream processing (Aurora, Borealis (2005+) and later Storm)

- Distributed Streaming Operators (e.g. Database research topic around 2005)

- CEP Vendor Roadmap ( from http://www.complexevents.com/2014/12/03/cep-tooling-market-survey-2014/)

Data Analytics Landscape

Realtime Interactive Analytics

o Usually done to support interactive queries

o Index data to make them them readily accessible so you can respond to queries fast. (e.g. Apache Drill)

o Tools like Druid, VoltDB and SAP Hana can do this with all data in memory to make things really fast.

Realtime Streaming Analytics

o Process data without Streaming ( As data some in) o Queries are fixed ( Static) o Triggers when given conditions are met.o Technologies

o Stream Processing ( Apache Storm, Apache Samza)o Complex Event Processing/CEP (WSO2 CEP, Esper,

StreamBase)o MicroBatches ( Spark Streaming)

Why Realtime Streaming AnalyticsPatterns?

o Reason 1: Usual advantages o Give us better understanding

o Give us better vocabulary to teach and communicate

o Tools can implement them o ..

o Reason 2: Under theme realtime analytics, lot of people get too much carried away with word count example. Patterns shows word count is just tip of the iceberg.

Earlier Work on Patterns

o Patterns from SQL ( project, join, filter etc) o Event Processing Technical Society’s (EPTS)

reference architectureo higher-level patterns such as tracking, prediction and

learning in addition to low-level operators that comes from SQL like languages.

o Esper’s Solution Patterns Document (50 patterns) o Coral8 White Paper

Basic Patterns

o Pattern 1: Preprocessing ( filter, transform, enrich, project .. )

o Pattern 2: Alerts and Thresholdso Pattern 3: Simple Counting and Counting with

Windowso Pattern 4: Joining Event Streamso Pattern 5: Data Correlation, Missing Events, and

Erroneous Data

Patterns for Handling Trends

o Pattern 7: Detecting Temporal Event Sequence Patterns

o Pattern 8: Tracking ( track something over space or time)

o Pattern 9: Detecting Trends ( rise, fall, turn, tipple bottom)

o Pattern 13: Online Control

Mixed Patterns

o Pattern 6: Interacting with Databaseso Pattern 10: Running the same Query in Batch and

Realtime Pipelineso Pattern 11: Detecting and switching to Detailed

Analysiso Pattern 12: Using a Machine Learning Model

Earlier Work on Patterns

Realtime Streaming Analytics Tools

Implementing Realtime Analytics

o tempting to write a custom code. Filter look very easy. Too complex!! Don’t!

o Option 1: Stream Processing (e.g. Storm). Kind of works. It is like Map Reduce, you have to write code.

o Option 2: Spark Streaming - more compact than Storm, but cannot do some stateful operations.

o Option 3: Complex Event Processing - compact, SQL like language, fast

Stream Processing

o Program a set of processors and wire them up, data flows though the graph.

o A middleware framework handles data flow, distribution, and fault tolerance (e.g. Apache Storm, Samza)

o Processors may be in the same machine or multiple machines

Writing a Storm Program

o Write Spout(s)o Write Bolt(s)o Wire them upo Run

Write Bolts

We will use a shorthand like on the left to explain

public static class WordCount extends BaseBasicBolt { @Override public void execute(Tuple tuple, BasicOutputCollector collector) { .. do something … collector.emit(new Values(word, count)); }

@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Wire up and RunTopologyBuilder builder = new TopologyBuilder();builder.setSpout("spout", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8)

.shuffleGrouping("spout");builder.setBolt("count", new WordCount(), 12)

.fieldsGrouping("split", new Fields("word"));

Config conf = new Config(); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(

args[0], conf, builder.createTopology()); }else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf,

builder.createTopology()); ... } }

Complex Event Processing

Micro Batches ( e.g. Spark Streaming)

o Process data in small batches, and then combine results for final results (e.g. Spark)

o Works for simple aggregates, but tricky to do this for complex operations (e.g. Event Sequences)

o Can do it with MapReduce as well if the deadlines are not too tight.

o A SQL like data processing languages (e.g. Apache Hive)

o Since many understand SQL, Hive made large scale data processing Big Data accessible to many

o Expressive, short, and sweet. o Define core operations that

covers 90% of problems o Let experts dig in when they

like!

SQL Like Query Languages

o Easy to follow from SQLo Expressive, short, and sweet. o Define core operations that covers 90% of problems o Let experts dig in when they like!

CEP = SQL for Realtime Analytics

Pattern Implementations

Pattern 1: Preprocessing

o What? Cleanup and prepare data via operations like filter, project, enrich, split, and transformations

o Usecases?o From twitter data stream: we extract author,

timestamp and location fields and then filter them based on the location of the author.

o From temperature stream we expect temperature & room number of the sensor and filter by them.

Filter

from TempStream [ roomNo > 245 and roomNo <= 365]select roomNo, tempinsert into ServerRoomTempStream ;

In Storm

In CEP ( Siddhi)

Architecture of WSO2 CEP

CEP Event Adapters

Support for several transports (network access)● SOAP● HTTP● JMS● SMTP● SMS● Thrift● Kafka ● Websocket ● MQTT

Supports database writes using Map messages● Cassandra ● RDBMs

Supports custom event adaptors via its pluggable architecture!

Stream Definition (Data Model)

{ 'name':'soft.drink.coop.sales', 'version':'1.0.0', 'nickName': 'Soft_Drink_Sales', 'description': 'Soft drink sales', 'metaData':[ {'name':'region','type':'STRING'} ], 'correlationData':[ {'name':’transactionID’,'type':'STRING'} ], 'payloadData':[

{'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'},

{'name':'total','type':'INT'}, {'name':'user','type':'STRING'}

]}

Projection

define stream TempStream(deviceID long, roomNo int, temp double);

from TempStreamselect roomNo, tempinsert into OutputStream ;

Inferred Streams

from TempStreamselect roomNo, tempinsert into OutputStream ;

define stream OutputStream(roomNo int, temp double);

Enrich

from TempStreamselect roomNo, temp,‘C’ as scaleinsert into OutputStream

define stream OutputStream(roomNo int, temp double, scale string);

from TempStreamselect deviceID, roomNo, avg(temp) as avgTempinsert into OutputStream ;

Transformation

from cseEventStream[price >= 20 and symbol==’IBM’]select symbol, volumeinsert into StockQuote

from TempStreamselect concat(deviceID, ‘-’, roomNo) as uid,

toFahrenheit(temp) as tempInF, ‘F’ as scale

insert into OutputStream ;

Split

from TempStreamselect roomNo, tempinsert into RoomTempStream ;

from TempStreamselect deviceID, tempinsert into DeviceTempStream ;

Pattern 2: Alerts and Thresholds

o What? detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). o These alerts can be based on a simple value or

more complex conditions such as rate of increase etc.

o Usecases?o Raise alert when vehicle going too fasto Alert when a room is too hot

Filter Alert

from TempStream [ roomNo > 245 and roomNo <= 365 and temp > 40 ]

select roomNo, tempinsert into AlertServerRoomTempStream ;

Pattern 3: Simple Counting and Counting with Windows

o What? aggregate functions like Min, Max, Percentiles, etc

o Often they can be counted without storing any data

o Most useful when used with a windowo Usecases?

o Most metrics need a time bound so we can

compare ( errors per day, transactions per second)

o Linux Load Average give us an idea of overall trend by reporting last 1m, 3m, and 5m mean.

Types of windows

o Sliding windows vs. Batch (tumbling) windows o Time vs. Length windows

Also supports o Unique windowo First unique windowo External time window

Window

In Storm

Aggregation

In CEP (Siddhi)

from TempStreamselect roomNo, avg(temp) as avgTempinsert into HotRoomsStream ;

Sliding Time Window

from TempStream#window.time(1 min)select roomNo, avg(temp) as avgTempinsert all events into AvgRoomTempStream ;

Group By

from TempStream#window.time(1 min)select roomNo, avg(temp) as avgTempgroup by roomNoinsert all events into HotRoomsStream ;

Batch Time Window

from TempStream#window.timeBatch(5 min)select roomNo, avg(temp) as avgTempgroup by roomNoinsert all events into HotRoomsStream ;

Pattern 4: Joining Event Streams

o What? Create a new event stream by joining multiple streams

o Complication comes with time. So need at least one window

o Often used with a windowo Usecases?

o To detecting when a player has kicked the ball in a football game .

o To correlate TempStream and the state of the regulator and trigger control commands

Join with Storm

Join

define stream TempStream(deviceID long, roomNo int, temp double);

define stream RegulatorStream(deviceID long, roomNo int, isOn bool);

In CEP (Siddhi)

Join

define stream TempStream(deviceID long, roomNo int, temp double);

define stream RegulatorStream(deviceID long, roomNo int, isOn bool);

from TempStream[temp > 30.0]#window.time(1 min) as T join RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNoselect T.roomNo, R.deviceID, ‘start’ as actioninsert into RegulatorActionStream ;

In CEP (Siddhi)

Pattern 5: Data Correlation, Missing Events, and Erroneous Datao What? find correlations and use that to detect and

handle missing and erroneous Data o Use Cases?

o Detecting a missing event (e.g., Detect a

customer request that has not been responded within 1 hour of its reception)

o Detecting erroneous data (e.g., Detecting failed

sensors using a set of sensors that monitor

overlapping regions. We can use those

redundant data to find erroneous sensors and remove those data from further processing)

Missing Event in Storm

Missing Event in CEP

In CEP (Siddhi)

from RequestStream#window.time(1h) insert expired events into ExpiryStream

from r1=RequestStream->r2=Response[id=r1.id] or r3=ExpiryStream[id=r1.id] select r1.id as id ...insert into AlertStream having having r2.id == null;

Pattern 6: Interacting with Databases

o What? Combine realtime data against historical data

o Use Cases?

o On a transaction, looking up the customer age

using ID from customer database to detect fraud (enrichment)

o Checking a transaction against blacklists and whitelists in the database

o Receive an input from the user (e.g., Daily

discount amount may be updated in the

database, and then the query will pick it automatically without human intervention).

In Storm

Querying Databases

In CEP (Siddhi)

Event Table

define table CardUserTable (name string, cardNum long) ;

@from(eventtable = 'rdbms' , datasource.name = ‘CardDataSource’ , table.name = ‘UserTable’, caching.algorithm’=‘LRU’)define table CardUserTable (name string, cardNum long)

Cache types supported● Basic: A size-based algorithm based on FIFO.● LRU (Least Recently Used): The least recently used event is dropped

when cache is full.● LFU (Least Frequently Used): The least frequently used event is dropped

when cache is full.

Join : Event Table

define stream Purchase (price double, cardNo long, place string);

define table CardUserTable (name string, cardNum long) ;

from Purchase#window.length(1) join CardUserTableon Purchase.cardNo == CardUserTable.cardNum

select Purchase.cardNo as cardNo, CardUserTable.name as name, Purchase.price as price

insert into PurchaseUserStream ;

Insert : Event Table

define stream FraudStream (price double, cardNo long, userName string);

define table BlacklistedUserTable (name string, cardNum long) ;

from FraudStreamselect userName as name, cardNo as cardNuminsert into BlacklistedUserTable ;

Update : Event Table

define stream LoginStream (userID string, islogin bool, loginTime long);

define table LastLoginTable (userID string, time long) ;

from LoginStreamselect userID, loginTime as timeupdate LastLoginTable

on LoginStream.userID == LastLoginTable.userID ;

Pattern 7: Detecting Temporal Event Sequence Patterns

o What? detect a temporal sequence of events or condition arranged in time

o Use Cases?

o Detect suspicious activities like small transaction immediately followed by a large transaction

o Detect ball possession in a football game

o Detect suspicious financial patterns like large buy and sell behaviour within a small time period

In Storm

Pattern

In CEP (Siddhi)

Pattern

define stream Purchase (price double, cardNo long,place string);

from every (a1 = Purchase[price < 100] -> a3= ..) -> a2 = Purchase[price >10000 and a1.cardNo == a2.cardNo]

within 1 dayselect a1.cardNo as cardNo, a2.price as price, a2.place as placeinsert into PotentialFraud ;

Pattern 8: Tracking

o What? detecting an overall trend over timeo Use Cases?

o Tracking a fleet of vehicles, making sure that

they adhere to speed limits, routes, and Geo-fences.

o Tracking wildlife, making sure they are alive (they

will not move if they are dead) and making sure they will not go out of the reservation.

o Tracking airline luggage and making sure they have not been sent to wrong destinations

o Tracking a logistic network and figuring out bottlenecks and unexpected conditions.

TFL: Traffic Analytics

Built using TFL ( Transport for London) open data feeds. http://goo.gl/9xNiCm http://goo.gl/04tX6k

Pattern 9: Detecting Trends

o What? tracking something over space and time and detects given conditions.

o Useful in stock markets, SLA enforcement, auto scaling, predictive maintenance

o Use Cases?

o Rise, Fall of values and Turn (switch from rise to a fall)

o Outliers - deviate from the current trend by a large value

o Complex trends like “Triple Bottom” and “Cup and Handle” [17].

Trend in Storm

Build and apply an state machine

In CEP (Siddhi)

Sequence

from t1=TempStream,t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp) or

(t2[last].temp < temp and not(isNull(t2[last].temp))]+within 5 min

select t1.temp as initialTemp, t2[last].temp as finalTemp,t1.deviceID, t1.roomNo

insert into IncreaingHotRoomsStream ;

In CEP (Siddhi)

Partition

partition by (roomNo of TempStream)begin

from t1=TempStream,t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp) or (t2[last].temp < temp and not(isNull(t2[last].temp))]+within 5 min

select t1.temp as initialTemp, t2[last].temp as finalTemp,

t1.deviceID, t1.roomNo

insert into IncreaingHotRoomsStream ;end;

Detecting Trends in Real Life

o Paper “A Complex Event Processing Toolkit for Detecting Technical Chart Patterns” (HPBC 2015) used the idea to identify stock chart patterns

o Used kernel regression for smoothing and detected maxima’s and minimas.

o Then any pattern can be written as a temporal event sequence.

Pattern 10: Lambda Architecture

o What? runs the same query in both relatime and batch pipelines. This uses realtime analytics to fill the lag in batch analytics results. o Also called “Lambda Architecture”. See Nathen

Marz’s “Questioning the Lambda Architecture” o Use Cases?

o For example, if batch processing takes 15

minutes, results would always lags 15 minutes

from the current data. Here realtime processing fill the gap.

Lambda Architecture. How?

Pattern 11: Detecting and switching to Detailed Analysiso What? detect a condition that suggests some

anomaly, and further analyze it using historical data. o Use Cases?

o Use basic rules to detect Fraud (e.g., large transaction), then pull out all transactions done against that credit card for a larger time period (e.g., 3 months data) from batch pipeline and run a detailed analysis

o While monitoring weather, detect conditions like high temperature or low pressure in a given region, and then start a high resolution localized forecast for that region.

o Detect good customers (e.g., through expenditure of more than $1000 within a month, and then run a detailed model to decide the potential of offering a deal).

Pattern 11: How?

Pattern 12: Using a Machine Learning Modelo What? The idea is to train a model (often a

Machine Learning model), and then use it with the Realtime pipeline to make decisionso For example, you can build a model using R, export it as

PMML (Predictive Model Markup Language) and use it within your realtime pipeline.

o Use Cases?o Fraud Detection

o Segmentation

o Predict Churn

Predictive Analytics

o Build models and use them with WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)

o Build model using R, export them as PMML, and use within WSO2 CEP

o Call R Scripts from CEP queries

In CEP (Siddhi)

PMML Model

from TrasnactionStream #ml:applyModel(‘/path/logisticRegressionModel1.xml’,

timestamp, amount, ip)insert into PotentialFraudsStream;

Pattern 13: Online Control

o What? Control something Online. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions.

o Use Cases?o Autopilot

o Self-driving

o Robotics

Fraud Demo

Scaling & HA for Pattern Implementations

So how we scale a system ?

o Vertical Scaling

o Horizontal Scaling

Vertical Scaling

Horizontal Scaling

E.g. Calculate Mean

Horizontal Scaling ...

E.g. Calculate Mean

Horizontal Scaling ...

E.g. Calculate Mean

Horizontal Scaling ...

How about scaling median ?

Horizontal Scaling ...

How about scaling median ?

If & only if we can partition !

Scalable Realtime solutions ...

Spark Streaming

o Supports distributed processingo Runs micro batcheso Not supports pattern & sequence detection

Scalable Realtime solutions ...

Spark Streaming

o Supports distributed processingo Runs micro batcheso Not supports pattern & sequence detection

Apache Storm

o Supports distributed processingo Stream processing engine

Why not use Apache Storm ?

Advantages

o Supports distributed processing

o Supports Partitioning

o Extendable

o Opensource

Disadvantages

o Need to write Java code

o Need to start from basic principles ( & data structures )

o Adoption for change is slow

o No support to govern artifacts

WSO2 CEP += Apache Storm

Advantages

o Supports distributed processing

o Supports Partitioning

o Extendable

o Opensource

Disadvantages

o No need to write Java code (Supports SQL like query language)

o No need to start from basic principles (Supports high level

language)

o Adoption for change is fast

o Govern artifacts using Toolboxes

o etc ...

How we scale ?

How we scale ...

Scaling with Storm

Siddhi QL

define stream StockStream (symbol string, volume int, price double);

@name(‘Filter Query’)from StockStream[price > 75]select *insert into HighPriceStockStream ;

@name(‘Window Query’)from HighPriceStockStream#window.time(10 min)select symbol, sum(volume) as sumVolume insert into ResultStockStream ;

Siddhi QL - with partition

define stream StockStream (symbol string, volume int, price double);

@name(‘Filter Query’)from StockStream[price > 75]select *insert into HighPriceStockStream ;

@name(‘Window Query’)partition with (symbol of HighPriceStockStream)begin

from HighPriceStockStream#window.time(10 min)select symbol, sum(volume) as sumVolume insert into ResultStockStream ;

end;

Siddhi QL - distributed

define stream StockStream (symbol string, volume int, price double);

@name(Filter Query’)@dist(parallel= ‘3')from StockStream[price > 75]select *insert into HightPriceStockStream ;

@name(‘Window Query’)@dist(parallel= ‘2')partition with (symbol of HighPriceStockStream)begin

from HighPriceStockStream#window.time(10 min)select symbol, sum(volume) as sumVolume insert into ResultStockStream ;

end;

On Storm UI

On Storm UI

High Availability

HA / Persistence

o Option 1: Side by side o Recommendedo Takes 2X hardwareo Gives zero down time

o Option 2: Snapshot and restoreo Uses less HW o Will lose events between snapshotso Downtime while recovery o ** Some scenarios you can use event tables to keep intermediate state

Siddhi Extensions

● Function extension● Aggregator extension● Window extension● Transform extension

Siddhi Query : Function Extension

from TempStreamselect deviceID, roomNo,

custom:toKelvin(temp) as tempInKelvin, ‘K’ as scale

insert into OutputStream ;

Siddhi Query : Aggregator Extension

from TempStreamselect deviceID, roomNo, temp

custom:stdev(temp) as stdevTemp, ‘C’ as scale

insert into OutputStream ;

Siddhi Query : Window Extension

from TempStream#window.custom:lastUnique(roomNo,2 min)

select *insert into OutputStream ;

Siddhi Query : Transform Extension

from XYZSpeedStream#transform.custom:getVelocityVector(v,vx,vy,vz)

select velocity, directioninsert into SpeedStream ;