Upload
kelly-kohlleffel
View
665
Download
4
Embed Size (px)
Citation preview
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real time processing in HadoopTrucking Company Use Case
Paul Hargis Solutions Engineer, Hortonworks
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Agenda
Overview of logistics industry scenario Quick overview of streaming architecture on HDP Streaming Demo Integrating Predictive Analytics in streaming scenarios Spark Demo
Page 2
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Scenario Overview.
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for a given route; an event could be:
'Normal' events: starting / stopping of the vehicle
‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance
Company uses an application that monitors truck locations and violations from the truck/driver in real-time
Route?Truck?Driver?
Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time User Interface
One cluster with consistent security, governance & operations
SQL
Interactive Query(Hive on Tez)
Truck Sensors
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo - Streaming.
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Streaming Demo - High Level Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
Dangerous Events TableHbase
BoltHDFSBolt
Truck Events
Active MQ
Monitoring Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging(Kafka)
Truck Events Topic
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Analyzing Events with Tableau.
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Analyzing Raw Events – dangerous drivers
Page 9
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Analyzing Raw Events – dangerous routes
Page 10
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Analyzing Raw Events – violations by location
Page 11
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enriching truck events for analysis with Pig
HDFS Raw Truck EventsWeather Data Sets
Raw Weather Data
HCatalog (Metadata)
Payroll Data
HR & Payroll DBs
Load Raw Truck Events
Clean & Filter
Cleaned Events
TransformedEvents
Transform
Join withHR & weather data
EnrichedEvents
Enriched Events
Store
Tableau
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – noncertified and fatigued drivers more dangerous
Page 13
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – top 3 dangerous routes seem to be driven by fatigued drivers
Page 14
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – foggy weather leads to violations
Page 15
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – but top 3 safest routes are also foggy
Page 16
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integrating Predictive Analytics
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
CDO’s vision: Build a Predictive Business, not a Reactive one
CDO’s Requirements Offline predictions
Identify investments that will increase safety and reduce company’s liabilities
Real-time predictions Anticipate driver violations before they
happen and take precautionary actions
Data Scientist’s Response ♬ I’ve been waiting for this moment all my life
♬ Verify BI tool trends against TBs of events data
via machine learning Generate predictive models with Spark
MLlib on HDP Plug in Spark models in Storm to predict driver
violations in real-time
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Sensors
HDFS
YARN
Integrate Predictive Analytics in Stream Processing
Stream Processing (Storm)
Inbound Messaging(Kafka)
Interactive Query(Hive on Tez)
Real-time Serving (HBase)
Millions of Enriched Truck Events
Prediction Bolt
Plug Spark model into Storm bolt
Machine Learning(Spark)
Train Spark ML model with millions of truck events
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Streaming Demo - Updated Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
PayRollTableHBase
BoltHDFSBolt
Truck Events
Active MQ
Monitoring Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging(Kafka)
Truck Events Topic
PredictionBolt
Enrich
Event
Predict violation in real time & alert via MQ
Render Real time predictions on UI
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Building the Predictive Model on HDP
Tableau Explore small subset of events to identify predictive features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver violations”
1
Identify suitable ML algorithms to train a model – we will use classification algorithms as we have labeled events data
2
Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format
3
Train a logistic classification Spark model on YARN, with above events as training input, and iterate to fine tune generated model
4
Integrate Spark MLlib model in a Storm bolt to predict violations in real time
5
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transforming training data for Spark MLlibEnriched Events Data
Event Type Is Driver Certified?
Wage Plan
Hours Driven
MilesDriven
Longitude Latitude WeatherFoggy
Weather Rainy
Weather Windy
Normal Yes Hourly 45 2721 -91.3 38.14 No No NoOverspeed No Miles 72 4152 -94.23 37.09 Yes Yes No
… … … … … … … … … …
Spark MLlib Training DataLabel Is Driver
Certified?Wage Plan
Hours Driven
MilesDriven
WeatherFoggy
Weather Rainy
Weather Windy
0 1 1 0.45 0.2721 0 0 01 0 0 0.72 0.4152 1 1 0… … … … … … … …
Normal events labeled as 0 and
violation events as 1
Feature scaling applied to hours and miles to improve
algorithm performance
Features with binary values denoted as 0 and 1
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Running Spark ML on YARN
1spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 truckml.jar --algorithm LR --regType L2 --regParam 1.0 /user/root/truck_training --numIterations 100
Run spark-submit script to launch a Spark job on YARN.
Training data location on HDFS
2 Monitor progress of Spark job in YARN Resource Mgr UI
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Interpreting Spark Logistic Regression Results
Precision: 87.5% Recall: 88%
Top three predictors of violations 1. Foggy Weather 2. Rainy Weather 3. Driver Certification
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integrating Spark model in Storm
Kafka Spout
Storm Prediction Bolt
Initialize Spark model Parse truck event Enrich event with HBase data Predict violation with model Send Alert if violation predicted
Real-time Serving (HBase)
Active MQ
Ops Center LOB Dashboards
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Recommendations to CDO
Investment recommendations, in order of priority1. Invest in visibility sensors and auto braking systems to deal with foggy conditions2. Invest in slip resistant tires to fight rainy conditions3. Invest in certifying drivers to reduce violation probability
Power of real time predictions 40% reduction in violation rates by predicting high risk situations in real-time and
sending immediate alerts to drivers
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Value of large scale ML on HDP Accelerate time to market/value
Test out multiple ML algorithms against TBs of training data in reasonable time frames
Confirm hypothesis against TBs of training data with confidence We confirmed that fog does impact safety and wage plans do not,
whereas BI tools indicated otherwise
Easily integrate predictive models in data driven apps Run predictive models in Storm or any other app in your enterprise
Run all of the above in a multi-tenant YARN cluster Large scale ML on YARN respects other tenants in an HDP cluster
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Backup.
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Calling Spark from a Storm Bolt The outputs of a logistic regression model are weights and an intercept value:
val algorithm = new LogisticRegressionWithSGD() val model = algorithm.run(training).clearThreshold() println(model.weights) println(model.intercept) Weights[-0.40819922025591465,0.06392530395655666,-0.1346227352186122,-0.07188217286407801,0.7277326276521062,0.508779221680863,-0.024689093098281954]Intercept 0.0
The model can then be reconstructed in a Storm bolt with the above weights to make predictions
import org.apache.spark.mllib.classification.LogisticRegressionModel;import org.apache.spark.mllib.linalg.Vectors;………..Vector weights = (Vectors.dense(new double[] <array of weights like above>)LogisticRegressionModel model = new LogisticRegressionModel(weights, 0.0);double prediction = model.predict(<input features>)
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Why Apache Kafka?Open source real-time event stream processing platform that provides fixed, continuous & low latency processing for very high frequency streaming data
• Horizontally scalable like Hadoop• Eg: 3 node cluster can store 5M messages per secondHighly scalable
• Automatically reassigns on failed nodesFault-tolerant
• Supports message acknowledgementsGuarantees delivery
• Producers and consumers exist for many programming languagesLanguage agnostic
• Brand, governance & a large active communityApache project
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Key Capabilities of Storm
Page 31
• Extremely high ingest rates – millions of events/secondData Ingest
• Ability to easily plug different processing frameworks• Guaranteed processing – at-least once processing semanticsProcessing
• Ability to persist data to multiple relational and non relational data storesPersistence
• HA, fault tolerance & management supportOperations
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Our preferred solution architecture
HDP 2.x Data Lake
Online Data ProcessingHBase
Real Time Stream Processing
Storm
YARN
HDFS
APACHE KAFKAReal-time data feeds
SearchSolr
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
What is Real Time Event Processing
Real Time Event Processing System
A system that processes the events as they happen and generates real-time information/actions
Requirements• Ingest data at high rate• Process the data while its being collected• Continuously running• Low latency
Kafka
Storm
33
Components • Collection – Process to collect raw data• Data Flow - Process to Move data • Processing – Process to Analyze data • Delivery – Process to deliver the extracted information
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Kafka.
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
What is Kafka? APACHE KAFKA
High throughput distributed messaging system
Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes
Kafka Cluster
producer
producer
producer
consumer
consumer
consumer
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Kafka: Anatomy of a TopicPartition 0 Partition 1 Partition 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
Partitioning allows topics to scale beyond a single machine/node
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Storm.
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Key Constructs in Apache Storm• Tuples• Streams• Spouts• Bolts• Topology
Page 38
Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Tuples and Streams• What is a Tuple?
– Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 39
• What is a Stream?– An unbounded sequences of tuples.– Core abstraction in Storm and are what you “process” in Storm
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Spouts• What is a Spout?
– Generates or a source of Streams– E.g.: JMS, Twitter, Log, Kafka Spout– Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 40
Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Bolts• What is a Bolt?
– Processes any number of input streams and produces output streams– Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic– Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:1. HBaseBolt: persisting and counting in Hbase2. HDFSBolt: persisting into HFDS as Avro Files using Flume3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 41
Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Topology• What is a Topology?
– A network of spouts and bolts wired together into a workflow
Page 42