Transcript
Page 1: Spark Summit - Stratio Streaming
Page 2: Spark Summit - Stratio Streaming

Stratio is the only Big Data platform able to combine, in one query, stored data withstreaming data in real-time (in less than 30 seconds).

We are polyglots as well: Weuse Spark over two noSQLdatabases, Cassandra & Mongo DB.

Page 3: Spark Summit - Stratio Streaming
Page 4: Spark Summit - Stratio Streaming

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data, and in fact is represented as a sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.

Shark(SQL)

SparkStreaming

Mllib(machine learning)

GraphX(graph)

Page 5: Spark Summit - Stratio Streaming

• map(func), flatMap(func), filter(func), count()

• repartition(numPartitions)

• union(otherStream)

• reduce(func),countByValue(), reduceByKey(func, [numTasks])

• join(otherStream, [numTasks]), cogroup(otherStream, [numTasks])

• transform(func)

• updateStateByKey(func)

• window(windowLength, slideInterval)

• countByWindow(windowLength, slideInterval)

• reduceByWindow(func, windowLength, slideInterval)

• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

• countByValueAndWindow(windowLength, slideInterval, [numTasks])

• print()

• foreachRDD(func)

• saveAsObjectFiles(prefix, [suffix])

• saveAsTextFiles(prefix, [suffix])

• saveAsHadoopFiles(prefix, [suffix])

Page 6: Spark Summit - Stratio Streaming

Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances

CEP as a technique helps discover complex events by analyzing and correlating other events

Page 7: Spark Summit - Stratio Streaming

A CEP engine should provide operators over streams, keeping in mind that events and streams in a CEP are first-class citizens. In CEP, we think in terms of event streams: event stream is a sequence of events that arrives over time.

Users provide queries to the CEP engine whose main mission is matching those queries against events coming through event streams.

A CEP engine thus has notion of time and it allows working with temporal queries that reason in terms of temporal concepts, such as “time windows” or “before and after” event relationships… among others

• Filter

• Join

• Aggregation (Avg, Sum , Min, Max, Custom)

• Group by

• Having

• Conditions and Expressions (and, or, not, true/false, ==,!=, >=, >, <=, <),

• Data types (boolean, string, int, long, float, double)

• Pattern processing

• Sequence processing (zero to many, one to many, and zero to one)

Page 8: Spark Summit - Stratio Streaming

• You still have to integrate it in your code

• There is nothing like an interactive console

• If you want to do something with the streams, you guessed it, you have to code it!

• There is no way to remotely listen to a stream

• There are no solution patterns ready-to-use with the engine

• No statistics, no auditing

• Hard to integrate with other tools (dashboarding, log stream, batch processing)

Page 9: Spark Summit - Stratio Streaming
Page 10: Spark Summit - Stratio Streaming

With this solution you can use our API in order to request commands to StratioStreaming engine in your code.

And you can also work with the interactive shell in order to test your queries or interact with the engine on demand.

Both tools, in fact, hide that you are sending messages to a complex engine, built with Zookeeper, Kafka, Spark Streaming and Siddhi CEP Engine.

Page 11: Spark Summit - Stratio Streaming

kafka

zookeeper

requests

events

CASSANDRA

Kafka

CASSANDRA

CASSANDRA

Page 12: Spark Summit - Stratio Streaming

Kafka

CASSANDRA

Page 13: Spark Summit - Stratio Streaming
Page 14: Spark Summit - Stratio Streaming

• create --stream testStream --definition

• "name.string,data.double“

• insert --stream testStream --values

• "name.Temperature, field.testValue,data.33“

• save cassandra start --stream testStream

• alter --stream testStream --definition

"field.string"

Page 15: Spark Summit - Stratio Streaming
Page 16: Spark Summit - Stratio Streaming

CREATE --stream testStream –definition(name.string, data.double,data2.int, data3.float, data4.double, trueorfalse.boolean)

Page 17: Spark Summit - Stratio Streaming

Filtering

Projection

In-built functions

Windows (time and length)

Join

Event Sequences

There are a lot of CEP operators that you can use in your queries:

Event Patterns

Output rate limiting

Custom windows, customfunctions

from sensor_grid #window.length(10) select name, ind, avg(data) as data group by name insert into sensor_grid_avg for current-events

Page 18: Spark Summit - Stratio Streaming

1. >, <, ==, >=, <=, !=

2. contains, instanceof

3. and, or, not

1. sum, avg, max, min, count: when aggregated (group by, having)

2. Field Type Conversion

3. Coalesce: if field null then takeanother field

4. IsMatch: true or false if match regex

from orders[price >= 20 and price < 100]…

from orders select * insert into ordersB…from orders select client, price insert into ordersB…

Page 19: Spark Summit - Stratio Streaming

1. Length window - a sliding window that keeps the last N events.

2. Time window - a sliding window that keeps events that have arrived within the last T time period.

3. Time and Length batch window : same concept but outputs events only at the end of the given window

4. Unique window - keeps only the latest events that are unique according to the given unique attribute.

5. First unique window - keeps the first events that are unique according to the given unique attribute.

6. External Time Window - a sliding window that processes according to timestamps defined externally

from payments[channel == ‘Paypal']#window.time( 1 min )

Page 20: Spark Summit - Stratio Streaming

• With “on <condition>” joins only the events that matches the condition

• With “within <time>”, joins only the events that are within said time of each other

from errorStream#window.length(1) as errorStream joinallStream#window.length(1) as allStreamon errorStream.numberOfErrors > allStream.totalNumberOfEvents*0.05 select * insert into alarmByThreshold;

Page 21: Spark Summit - Stratio Streaming

from every (a1 = infoStock[action == "buy"]-> a2 = confirmOrder[command == "OK"] )-> b1 = StockExchangeStream [price > infoStock.price]

within 3000select a1.action as action, b1.price as priceinsert into StockQuote

from every a1 = infoStock[action == "buy"]+,b1 = StockExchangeStream[price > 70]?,

b2 = StockExchangeStream[price >= 75]select a1[0].action as action, b1.price as priceA, b2.price as priceBJoininsert into StockQuote

Page 22: Spark Summit - Stratio Streaming
Page 23: Spark Summit - Stratio Streaming

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Stratio Ingestion is an ETL for Big Data product, based on Flume.

Design your workflows (wysiwyg) with useful and improved sources and sinks, transform your data on the fly

• Create the stream if it doesn’t exist

• It is possible to send filtered event-flows only to streaming engine

• Built on the StratioStreaming API.

Page 24: Spark Summit - Stratio Streaming

• Call-center Real-time monitoring

Real-time detection of client churn riskNatural Language Processing Analysis to detect incidents in real-timeAnomaly detection in the service based on patterns

• IT services monitoring

DoS attack detection, hotlinking, etc in real-timeWarnings in monitoring of heterogeneous servicesPreventive detection of downtime based on patterns

• Sensor grid monitoring

Alarms when thresholds are reachedComplex alarms involving several sensorsReal-time monitoring (landing support devices in an airport, for example)

Data Machine Intelligence

Page 25: Spark Summit - Stratio Streaming

SELECT sum(order.quantity), company_data.countryFROM streaming.order WITH WINDOW 15 minutes INNER JOIN batch.company_dataON order.company = company_data.company_name;

.

• With an powerful query planner

• Able to perform mixed queries with streaming and batch data

SQL query example, mixing real-time data (coming from Stratio Streaming Engine) and batch data (stored in a noSQL database)

Page 26: Spark Summit - Stratio Streaming
Page 27: Spark Summit - Stratio Streaming

We are first going to use

the Shell to create

streams and queries.

Page 28: Spark Summit - Stratio Streaming
Page 29: Spark Summit - Stratio Streaming
Page 30: Spark Summit - Stratio Streaming
Page 31: Spark Summit - Stratio Streaming