How Spark is Enabling the New Wave of Converged Applications

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies

How Spark is Enabling the New Wave of Converged Applications

Balaji Mohanam and Carol McDonald

September, 2016

© 2016 MapR Technologies 2© 2016 MapR Technologies 2

Today’s Presenters

Carol McDonaldSolutions Architect

Balaji MohanamProduct Manager


Agenda

• Market Trends

• What’s Needed for Converged Applications

• Customer Use Cases

• Demo of MapR Streams with Spark Streaming


Analytics & ETL: Batch or Streaming?

V a l u e

T i m e


Analytic Categories

Descriptive Predictive StreamingPrescriptiv

e

Data-At-Rest Data-In-Motion Future

• What happened

• Why did it happen

• Discovery in nature

• Batch analytics

• What will happen

• Combines historical data with rules and algorithms

• ML (Batch + Real Time)

• What + When + Why

• Suggestions to take advantage of future opportunity or mitigate risks

• Volume, velocity and variety

• Agility is key to success.

• Analyse data as it happens

• Triggers and Alarms.

• Anomaly detection

• Continuous ETL and analytics


Decreasing Job Latencies

Hours Mins Secs Milli Secs

Data persistence on-disk

Data persistence in-memory


It was hot at 6:05 yesterday!

Why Stream Processing?

A n a l y z e

6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°

90°90°6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°

Batch processing may be too late for some events


Why Stream Processing?

6:05 P.M.: 90°Topic

Temperature

Turn on the air conditioning!

It’s becoming important to process events as they arrive

S t r e a m

© 2016 MapR Technologies 9© 2016 MapR Technologies 9© 2016 MapR Technologies© 2016 MapR Technologies

What’s Needed for Converged Applications


The Trinity of Real Time

Topic 1Real Time Producers

Topic 2

Global Messaging System No SQL Key Value Database

Spark + MapR DB Integration

Real Time Operational

Analytics

Transformational Tier

Spark + MapR Streams

Integration


Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and Others

Event StreamingDatabase

Custom Apps

HDFS API POSIX, NFS HBase API JSON API Kafka API

MapR Converged Data Platform


Use Case: Time Series Data in Oil Wells

Data for real-time monitoring

read

Sensor time-stamped

data

Spark processing

Spark Streaming

Stream

Topic


Serve DataStore DataCollect Data

What Do We Need to Do ?

Process DataData Sources

? ? ? ?


Scalable Messaging with MapR Streams

Topics are partitioned for throughput and scalability

Partition 1: Topic - Pressure

Partition 1: Topic - Temperature

Partition 1: Topic - Warning







Consumers

Consumers

Consumers!


Continuous Analytics: Structured Streaming with Spark 2.0

valrecords=sqlContext.read.format(“json”).stream(“hdfs://input”) valcounts=records.groupBy(“user”).count() counts.write .trigger(ProcessingTime(“5sec”)) .outputMode(UpdateInPlace(“user”)) .format(“jdbc”) .startStream(“mysql://...”)

Repeated Queries

DB

User Count

User 1 10

User 2 23

User 3 16

…….. ……..

Store only the processed output instead of every single record.• Query executed repeatedly as and when the data arrives.

• Read the result from persistent storage, instead of processing the entire data set, resulting in faster access.


Spark 2.0: Structured Streaming with Spark SQL

Processing Time1

Input Table

Result Table

Program Output Complete output

ORDelta output

Output for data at 1



Data upto proc. Time 1



Delta: writes the records from the query result changed from the last firing of the trigger. These are physical deltas and not logical deltas. That is to say, they specify what rows were added and removed, but not the logical difference for some row.

Append: A special case of the Delta mode that does not include removals.

Update( in place): Update the result directly in place (e.g. update a MySQL table). Similar to delta, a primary key must be specified.

Complete: For each run of the query, create a complete snapshot of the query result.

Output Modes32


Serve Data


Store DataCollect Data Process DataData Sources

St ream

Topic


User 1

User 2

User 3

User n

.

.

.

Sparkcontext

Query Compilation

Storage

Scheduling

Worker 1

Worker 2

Worker 3

Worker 4

Worker n

.

.

Spark Scheduling Bottleneck


Latency vs. Concurrency

Type Latency Concurrency

Batch/RTS Analytics Very Low Low

Interactive Applications Very Low High/Very High


MapR-DB (HBase API) is Designed to Scale

Key Range

xxxxxxxx

Key Col B Col C

val val val

xxx val val

Fast Reads and Writes by KeyData is automatically partitioned by Key Range

Key Range

xxxxxxxx

Key Col B Col C

val val val

xxx val val

Key Range

xxxxxxxx

Key Col B Col C

val val val

xxx val val


Serve DataStore DataCollect Data

What Do We Exactly Need to Do ?

Process DataData Sources

St ream

Topic


Customer Use Cases


Customer 360 & Behavior Prediction

Website Click-Stream

Real Time/Offline ClickStream Analysis

Internal Data Sources

External Data Sources

• Prediction Modelling

• Attribution Modelling

• Cohort Analysis

• Customer Lifetime Value Analysis

• Attrition Modelling

• Response Modelling

• Churn Modelling

Eliminate latency due to data movement between clusters

Eliminate Redundant storage with MapR streams and lower the TCO

360 Degree Customer View

Customer Behavior PredictionBetter Conversion Rate and Lower attrition $$$

OfflineReal Time

HA, DR, NFS, Snapshots, Data Protection

EDH/EDL

Topic

Topic

Topic

Topic

Support Tickets

DBMSEmail

CRM


Prescriptive Analytics: IoT & Auto Manufacturing

GPS

Telematic Data

Telephone Truck Fleet

Data generated from cars are stored locally

Data Modelling/Secondary ETL: Data is converted from proprietary to parquet format

• Identify emission patterns• Route optimization• Customer service requests• How does throttling affect other factors such as fuel consumption, emissions, etc.• Image and video analysis• Time series analysis for threshold breach

Topic

Topic

Topic

Topic


Interactive Analytics: Risk Analysis ( Internal Users)

0-10 days old data cached in memory: 50-100 GB of data.

Data older than 10 days accessed from disk

Analytic Application to submit queries with simple to medium analytic query complexity

User 1

User 2

User 3

Concurrent requests: 3-10

Throughput: 1.5 requests per second

Latency : < 2 secondsRepresentative Queries

• List of users who have spent more than $1000 in last 3 days.

• Group users by country who spent more than $1000.

Analytic Application

Type of Users: Internal


On-Demand

Pre-Computed

Interactive Analytics: External Customer Facing

Application

Sales Incentive Data

• 60 events/sec• 10 MB/event• Tabled based topics

Fast Changing DataEx: Credit dateAppend only (50% of events)

Search Application

Stale Data. Aggregates calculated using Snapshots.

Level 1 Aggregates

Level 2 Aggregates

Level 3 Aggregates

Advanced ML Analytics

Delta Aggregates

Pre-compute analytics with Spark Streaming on Data-in-motion

Topic

Topic

Topic

Topic

DB


Demo


What if BP had detected problems before the oil hit the water ?

1M samples/secHigh performance at scale is necessary!


Use Case: Time Series Data


Sensor time-stamped data

Spark processing

readSpark Streaming

Stream

Topic


Use Case: Time Series Data

Sensor time-stamped data

Stream

Topic

COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94

COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66

COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79

Data: PumpId, Date,Time , pressure and flow measurements


Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats

Row keyCF data CF alerts CF stats

hz … psi psi … hz_avg … psi_min

COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0

Row Key contains oil pump name, date, and a time stamp





COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0





COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0


Serve Data


Data Sources Store DataCollect Data Process Data

St ream

Topic


readSpark Streaming

Stream

Topic

Use Case Example Code


Sensor time-stamped data Spark processing


KafkaProducerString topic=“/streams/pump:warning”;public static KafkaProducer producer;//1 configure KafkaProducer properties Properties properties = new Properties();properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");//2 Create KafkaProducer with propertieskafkaProducer = new KafkaProducer<String, String>(properties);String txt = “msg text”;//3 Create producer records with topic and message ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, txt);//4 use kafka producer to send recordskafkaProducer.send(record);


readSpark Streaming

Stream

Topic

Use Case Example Code


Sensor time-stamped data Spark processing


Create a DStream

DStream: a sequence of RDDs representing a stream of data

val ssc = new StreamingContext(sparkConf, Seconds(5))// create an input Stream for set of topicsval dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet)

batchtime 0 to 1

batch time 1 to 2

batch time 2 to 3

dStream

Stored in memory as an RDD


Message Data to Sensor Object

case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double)// Parse CSV Strings into Sensor objects def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble)}


Process DStream// Parse message values into Sensor objects val sensorDStream = dStream.map(_._2).map(parseSensor)

dStream RDDs

batch time 2 to 3

batch time 1 to 2

batchtime 0 to 1

sensorDStream RDDs

New RDDs created for every batch

map map map


DataFrame and SQL Operations// for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) // convert RDD to DataFrame rdd.toDF().registerTempTable("sensor") // get the avg max min for pump values val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date”) res.show()}


Streaming Application Output


Save to HBaserdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)

linesRDD DStream

sensorRDD DStream

output operation: persist data to external storage

Put objects written to HBase

batch time 2-3

batch time 1 to 2

batchtime 0 to 1

mapmap map

savesave save


Start Receiving Data

sensorDStream.foreachRDD { rdd => . . .

}// Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()


Stream Processing

Building a Complete Data Architecture

MapR File System (MapR-FS)

MapR Converged Data Platform

MapR Database (MapR-DB)MapR Streams

Sources/Apps Bulk Processing


Q & AEngage with us!

1. Read explanation of and Download code– https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db– https://www.mapr.com/blog/spark-streaming-hbase

2. Get Started: MapR Converged Data Platform https://www.mapr.com/get-started-with-mapr

3. Get Answers: MapR Converge Community https://community.mapr.com/community/answers

4. Get Trained: MapR On-Demand Training https://learn.mapr.com

https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db

https://www.mapr.com/blog/spark-streaming-hbase

https://www.mapr.com/get-started-with-mapr

https://www.mapr.com/get-started-with-mapr

https://community.mapr.com/community/answers



https://learn.mapr.com/

Data & Analytics

How Spark is Enabling the New Wave of Converged Applications