Upload
mapr-technologies
View
122
Download
1
Embed Size (px)
Citation preview
© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
How Spark is Enabling the New Wave of Converged Applications
Balaji Mohanam and Carol McDonald
September, 2016
© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Today’s Presenters
Carol McDonaldSolutions Architect
Balaji MohanamProduct Manager
© 2016 MapR Technologies 3© 2016 MapR Technologies 3
Agenda
• Market Trends
• What’s Needed for Converged Applications
• Customer Use Cases
• Demo of MapR Streams with Spark Streaming
© 2016 MapR Technologies 4© 2016 MapR Technologies 4
Analytics & ETL: Batch or Streaming?
V a l u e
T i m e
© 2016 MapR Technologies 5© 2016 MapR Technologies 5
Analytic Categories
Descriptive Predictive StreamingPrescriptiv
e
Data-At-Rest Data-In-Motion Future
• What happened
• Why did it happen
• Discovery in nature
• Batch analytics
• What will happen
• Combines historical data with rules and algorithms
• ML (Batch + Real Time)
• What + When + Why
• Suggestions to take advantage of future opportunity or mitigate risks
• Volume, velocity and variety
• Agility is key to success.
• Analyse data as it happens
• Triggers and Alarms.
• Anomaly detection
• Continuous ETL and analytics
© 2016 MapR Technologies 6© 2016 MapR Technologies 6
Decreasing Job Latencies
Hours Mins Secs Milli Secs
Data persistence on-disk
Data persistence in-memory
© 2016 MapR Technologies 7© 2016 MapR Technologies 7
It was hot at 6:05 yesterday!
Why Stream Processing?
A n a l y z e
6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°
90°90°6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°
Batch processing may be too late for some events
© 2016 MapR Technologies 8© 2016 MapR Technologies 8
Why Stream Processing?
6:05 P.M.: 90°Topic
Temperature
Turn on the air conditioning!
It’s becoming important to process events as they arrive
S t r e a m
© 2016 MapR Technologies 9© 2016 MapR Technologies 9© 2016 MapR Technologies© 2016 MapR Technologies
What’s Needed for Converged Applications
© 2016 MapR Technologies 10© 2016 MapR Technologies 10
The Trinity of Real Time
Topic 1Real Time Producers
Topic 2
Global Messaging System No SQL Key Value Database
Spark + MapR DB Integration
Real Time Operational
Analytics
Transformational Tier
Spark + MapR Streams
Integration
© 2016 MapR Technologies 11© 2016 MapR Technologies 11
Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
Dat
aPr
oces
sing
Web-Scale StorageMapR-FS MapR-DB
Search and Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud and Managed Services
Search and Others
Unified M
anagement and M
onitoring
Search and Others
Event StreamingDatabase
Custom Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform
© 2016 MapR Technologies 12© 2016 MapR Technologies 12
Use Case: Time Series Data in Oil Wells
Data for real-time monitoring
read
Sensor time-stamped
data
Spark processing
Spark Streaming
Stream
Topic
© 2016 MapR Technologies 13© 2016 MapR Technologies 13
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
© 2016 MapR Technologies 15© 2016 MapR Technologies 15
Scalable Messaging with MapR Streams
Topics are partitioned for throughput and scalability
Partition 1: Topic - Pressure
Partition 1: Topic - Temperature
Partition 1: Topic - Warning
Partition 2: Topic - Pressure
Partition 2: Topic - Temperature
Partition 2: Topic - Warning
Partition 3: Topic - Pressure
Partition 3: Topic - Temperature
Partition 3: Topic - Warning
Consumers
Consumers
Consumers!
© 2016 MapR Technologies 16© 2016 MapR Technologies 16
Continuous Analytics: Structured Streaming with Spark 2.0
valrecords=sqlContext.read.format(“json”).stream(“hdfs://input”) valcounts=records.groupBy(“user”).count() counts.write .trigger(ProcessingTime(“5sec”)) .outputMode(UpdateInPlace(“user”)) .format(“jdbc”) .startStream(“mysql://...”)
Repeated Queries
DB
User Count
User 1 10
User 2 23
User 3 16
…….. ……..
Store only the processed output instead of every single record.• Query executed repeatedly as and when the data arrives.
• Read the result from persistent storage, instead of processing the entire data set, resulting in faster access.
© 2016 MapR Technologies 17© 2016 MapR Technologies 17
Spark 2.0: Structured Streaming with Spark SQL
Processing Time1
Input Table
Result Table
Program Output Complete output
ORDelta output
Output for data at 1
Output for data at 2
Output for data at 3
Data upto proc. Time 1
Data upto proc. Time 2
Data upto proc. Time 3
Delta: writes the records from the query result changed from the last firing of the trigger. These are physical deltas and not logical deltas. That is to say, they specify what rows were added and removed, but not the logical difference for some row.
Append: A special case of the Delta mode that does not include removals.
Update( in place): Update the result directly in place (e.g. update a MySQL table). Similar to delta, a primary key must be specified.
Complete: For each run of the query, create a complete snapshot of the query result.
Output Modes32
© 2016 MapR Technologies 18© 2016 MapR Technologies 18
Serve Data
What Do We Need to Do ?
Store DataCollect Data Process DataData Sources
St ream
Topic
© 2016 MapR Technologies 19© 2016 MapR Technologies 19
User 1
User 2
User 3
User n
.
.
.
Sparkcontext
Query Compilation
Storage
Scheduling
Worker 1
Worker 2
Worker 3
Worker 4
Worker n
.
.
Spark Scheduling Bottleneck
© 2016 MapR Technologies 20© 2016 MapR Technologies 20
Latency vs. Concurrency
Type Latency Concurrency
Batch/RTS Analytics Very Low Low
Interactive Applications Very Low High/Very High
© 2016 MapR Technologies 21© 2016 MapR Technologies 21
MapR-DB (HBase API) is Designed to Scale
Key Range
xxxxxxxx
Key Col B Col C
val val val
xxx val val
Fast Reads and Writes by KeyData is automatically partitioned by Key Range
Key Range
xxxxxxxx
Key Col B Col C
val val val
xxx val val
Key Range
xxxxxxxx
Key Col B Col C
val val val
xxx val val
© 2016 MapR Technologies 22© 2016 MapR Technologies 22
Serve DataStore DataCollect Data
What Do We Exactly Need to Do ?
Process DataData Sources
St ream
Topic
© 2016 MapR Technologies 23© 2016 MapR Technologies 23© 2016 MapR Technologies© 2016 MapR Technologies
Customer Use Cases
© 2016 MapR Technologies 24© 2016 MapR Technologies 24
Customer 360 & Behavior Prediction
Website Click-Stream
Real Time/Offline ClickStream Analysis
Internal Data Sources
External Data Sources
• Prediction Modelling
• Attribution Modelling
• Cohort Analysis
• Customer Lifetime Value Analysis
• Attrition Modelling
• Response Modelling
• Churn Modelling
Eliminate latency due to data movement between clusters
Eliminate Redundant storage with MapR streams and lower the TCO
360 Degree Customer View
Customer Behavior PredictionBetter Conversion Rate and Lower attrition $$$
OfflineReal Time
HA, DR, NFS, Snapshots, Data Protection
EDH/EDL
Topic
Topic
Topic
Topic
Support Tickets
DBMSEmail
CRM
© 2016 MapR Technologies 25© 2016 MapR Technologies 25
Prescriptive Analytics: IoT & Auto Manufacturing
GPS
Telematic Data
Telephone Truck Fleet
Data generated from cars are stored locally
Data Modelling/Secondary ETL: Data is converted from proprietary to parquet format
• Identify emission patterns• Route optimization• Customer service requests• How does throttling affect other factors such as fuel consumption, emissions, etc.• Image and video analysis• Time series analysis for threshold breach
Topic
Topic
Topic
Topic
© 2016 MapR Technologies 26© 2016 MapR Technologies 26
Interactive Analytics: Risk Analysis ( Internal Users)
0-10 days old data cached in memory: 50-100 GB of data.
Data older than 10 days accessed from disk
Analytic Application to submit queries with simple to medium analytic query complexity
User 1
User 2
User 3
Concurrent requests: 3-10
Throughput: 1.5 requests per second
Latency : < 2 secondsRepresentative Queries
• List of users who have spent more than $1000 in last 3 days.
• Group users by country who spent more than $1000.
Analytic Application
Type of Users: Internal
© 2016 MapR Technologies 27© 2016 MapR Technologies 27
On-Demand
Pre-Computed
Interactive Analytics: External Customer Facing
Application
Sales Incentive Data
• 60 events/sec• 10 MB/event• Tabled based topics
Fast Changing DataEx: Credit dateAppend only (50% of events)
Search Application
Stale Data. Aggregates calculated using Snapshots.
Level 1 Aggregates
Level 2 Aggregates
Level 3 Aggregates
Advanced ML Analytics
Delta Aggregates
Pre-compute analytics with Spark Streaming on Data-in-motion
Topic
Topic
Topic
Topic
DB
© 2016 MapR Technologies 28© 2016 MapR Technologies 28© 2016 MapR Technologies© 2016 MapR Technologies
Demo
© 2016 MapR Technologies 29© 2016 MapR Technologies 29
What if BP had detected problems before the oil hit the water ?
1M samples/secHigh performance at scale is necessary!
© 2016 MapR Technologies 30© 2016 MapR Technologies 30
Use Case: Time Series Data
Data for real-time monitoring
Sensor time-stamped data
Spark processing
readSpark Streaming
Stream
Topic
© 2016 MapR Technologies 31© 2016 MapR Technologies 31
Use Case: Time Series Data
Sensor time-stamped data
Stream
Topic
COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94
COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66
COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79
Data: PumpId, Date,Time , pressure and flow measurements
© 2016 MapR Technologies 32© 2016 MapR Technologies 32
Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats
Row keyCF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil pump name, date, and a time stamp
© 2016 MapR Technologies 33© 2016 MapR Technologies 33
Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats
Row keyCF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2016 MapR Technologies 34© 2016 MapR Technologies 34
Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats
Row keyCF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2016 MapR Technologies 35© 2016 MapR Technologies 35
Serve Data
What Do We Need to Do ?
Data Sources Store DataCollect Data Process Data
St ream
Topic
© 2016 MapR Technologies 36© 2016 MapR Technologies 36
readSpark Streaming
Stream
Topic
Use Case Example Code
Data for real-time monitoring
Sensor time-stamped data Spark processing
© 2016 MapR Technologies 37© 2016 MapR Technologies 37
KafkaProducerString topic=“/streams/pump:warning”;public static KafkaProducer producer;//1 configure KafkaProducer properties Properties properties = new Properties();properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");//2 Create KafkaProducer with propertieskafkaProducer = new KafkaProducer<String, String>(properties);String txt = “msg text”;//3 Create producer records with topic and message ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, txt);//4 use kafka producer to send recordskafkaProducer.send(record);
© 2016 MapR Technologies 38© 2016 MapR Technologies 38
readSpark Streaming
Stream
Topic
Use Case Example Code
Data for real-time monitoring
Sensor time-stamped data Spark processing
© 2016 MapR Technologies 39© 2016 MapR Technologies 39
Create a DStream
DStream: a sequence of RDDs representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))// create an input Stream for set of topicsval dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet)
batchtime 0 to 1
batch time 1 to 2
batch time 2 to 3
dStream
Stored in memory as an RDD
© 2016 MapR Technologies 40© 2016 MapR Technologies 40
Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double)// Parse CSV Strings into Sensor objects def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble)}
© 2016 MapR Technologies 41© 2016 MapR Technologies 41
Process DStream// Parse message values into Sensor objects val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch time 2 to 3
batch time 1 to 2
batchtime 0 to 1
sensorDStream RDDs
New RDDs created for every batch
map map map
© 2016 MapR Technologies 42© 2016 MapR Technologies 42
DataFrame and SQL Operations// for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) // convert RDD to DataFrame rdd.toDF().registerTempTable("sensor") // get the avg max min for pump values val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date”) res.show()}
© 2016 MapR Technologies 43© 2016 MapR Technologies 43
Streaming Application Output
© 2016 MapR Technologies 44© 2016 MapR Technologies 44
Save to HBaserdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist data to external storage
Put objects written to HBase
batch time 2-3
batch time 1 to 2
batchtime 0 to 1
mapmap map
savesave save
© 2016 MapR Technologies 45© 2016 MapR Technologies 45
Start Receiving Data
sensorDStream.foreachRDD { rdd => . . .
}// Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
© 2016 MapR Technologies 46© 2016 MapR Technologies 46
Stream Processing
Building a Complete Data Architecture
MapR File System (MapR-FS)
MapR Converged Data Platform
MapR Database (MapR-DB)MapR Streams
Sources/Apps Bulk Processing
© 2016 MapR Technologies 47© 2016 MapR Technologies 47
Q & AEngage with us!
1. Read explanation of and Download code– https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db– https://www.mapr.com/blog/spark-streaming-hbase
2. Get Started: MapR Converged Data Platform https://www.mapr.com/get-started-with-mapr
3. Get Answers: MapR Converge Community https://community.mapr.com/community/answers
4. Get Trained: MapR On-Demand Training https://learn.mapr.com