42
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved al time processing in Hadoop Trucking Company Use Case Paul Hargis Solutions Engineer, Hortonworks

Trucking demo w Spark ML - Paul Hargis - Hortonworks

Embed Size (px)

Citation preview

Page 1: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Real time processing in HadoopTrucking Company Use Case

Paul Hargis Solutions Engineer, Hortonworks

Page 2: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Agenda

Overview of logistics industry scenario Quick overview of streaming architecture on HDP Streaming Demo Integrating Predictive Analytics in streaming scenarios Spark Demo

Page 2

Page 3: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Scenario Overview.

Page 4: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Trucking company w/ large fleet of trucks in Midwest

A truck generates millions of events for a given route; an event could be:

'Normal' events: starting / stopping of the vehicle

‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance

Company uses an application that monitors truck locations and violations from the truck/driver in real-time

Route?Truck?Driver?

Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers

Page 5: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 6: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo - Streaming.

Page 7: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Streaming Demo - High Level Architecture

Distributed Storage: HDFS

YARN

Storm Stream Processing

Kakfa Spout

HBase

Dangerous Events TableHbase

BoltHDFSBolt

Truck Events

Active MQ

Monitoring Bolt

Web App

Truck Streaming Data

T(1) T(2) T(N)

Inbound Messaging(Kafka)

Truck Events Topic

Page 8: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo – Analyzing Events with Tableau.

Page 9: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – dangerous drivers

Page 9

Page 10: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – dangerous routes

Page 10

Page 11: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Raw Events – violations by location

Page 11

Page 12: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enriching truck events for analysis with Pig

HDFS Raw Truck EventsWeather Data Sets

Raw Weather Data

HCatalog (Metadata)

Payroll Data

HR & Payroll DBs

Load Raw Truck Events

Clean & Filter

Cleaned Events

TransformedEvents

Transform

Join withHR & weather data

EnrichedEvents

Enriched Events

Store

Tableau

Page 13: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – noncertified and fatigued drivers more dangerous

Page 13

Page 14: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – top 3 dangerous routes seem to be driven by fatigued drivers

Page 14

Page 15: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – foggy weather leads to violations

Page 15

Page 16: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Analyzing Enriched Events – but top 3 safest routes are also foggy

Page 16

Page 17: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Integrating Predictive Analytics

Page 18: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

CDO’s vision: Build a Predictive Business, not a Reactive one

CDO’s Requirements Offline predictions

Identify investments that will increase safety and reduce company’s liabilities

Real-time predictions Anticipate driver violations before they

happen and take precautionary actions

Data Scientist’s Response ♬ I’ve been waiting for this moment all my life

♬ Verify BI tool trends against TBs of events data

via machine learning Generate predictive models with Spark

MLlib on HDP Plug in Spark models in Storm to predict driver

violations in real-time

Page 19: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Truck Sensors

HDFS

YARN

Integrate Predictive Analytics in Stream Processing

Stream Processing (Storm)

Inbound Messaging(Kafka)

Interactive Query(Hive on Tez)

Real-time Serving (HBase)

Millions of Enriched Truck Events

Prediction Bolt

Plug Spark model into Storm bolt

Machine Learning(Spark)

Train Spark ML model with millions of truck events

Page 20: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Streaming Demo - Updated Architecture

Distributed Storage: HDFS

YARN

Storm Stream Processing

Kakfa Spout

HBase

PayRollTableHBase

BoltHDFSBolt

Truck Events

Active MQ

Monitoring Bolt

Web App

Truck Streaming Data

T(1) T(2) T(N)

Inbound Messaging(Kafka)

Truck Events Topic

PredictionBolt

Enrich

Event

Predict violation in real time & alert via MQ

Render Real time predictions on UI

Page 21: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Building the Predictive Model on HDP

Tableau Explore small subset of events to identify predictive features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver violations”

1

Identify suitable ML algorithms to train a model – we will use classification algorithms as we have labeled events data

2

Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format

3

Train a logistic classification Spark model on YARN, with above events as training input, and iterate to fine tune generated model

4

Integrate Spark MLlib model in a Storm bolt to predict violations in real time

5

Page 22: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Transforming training data for Spark MLlibEnriched Events Data

Event Type Is Driver Certified?

Wage Plan

Hours Driven

MilesDriven

Longitude Latitude WeatherFoggy

Weather Rainy

Weather Windy

Normal Yes Hourly 45 2721 -91.3 38.14 No No NoOverspeed No Miles 72 4152 -94.23 37.09 Yes Yes No

… … … … … … … … … …

Spark MLlib Training DataLabel Is Driver

Certified?Wage Plan

Hours Driven

MilesDriven

WeatherFoggy

Weather Rainy

Weather Windy

0 1 1 0.45 0.2721 0 0 01 0 0 0.72 0.4152 1 1 0… … … … … … … …

Normal events labeled as 0 and

violation events as 1

Feature scaling applied to hours and miles to improve

algorithm performance

Features with binary values denoted as 0 and 1

Page 23: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Running Spark ML on YARN

1spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 truckml.jar --algorithm LR --regType L2 --regParam 1.0 /user/root/truck_training --numIterations 100

Run spark-submit script to launch a Spark job on YARN.

Training data location on HDFS

2 Monitor progress of Spark job in YARN Resource Mgr UI

Page 24: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Interpreting Spark Logistic Regression Results

Precision: 87.5% Recall: 88%

Top three predictors of violations 1. Foggy Weather 2. Rainy Weather 3. Driver Certification

Page 25: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Integrating Spark model in Storm

Kafka Spout

Storm Prediction Bolt

Initialize Spark model Parse truck event Enrich event with HBase data Predict violation with model Send Alert if violation predicted

Real-time Serving (HBase)

Active MQ

Ops Center LOB Dashboards

Page 26: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Recommendations to CDO

Investment recommendations, in order of priority1. Invest in visibility sensors and auto braking systems to deal with foggy conditions2. Invest in slip resistant tires to fight rainy conditions3. Invest in certifying drivers to reduce violation probability

Power of real time predictions 40% reduction in violation rates by predicting high risk situations in real-time and

sending immediate alerts to drivers

Page 27: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Value of large scale ML on HDP Accelerate time to market/value

Test out multiple ML algorithms against TBs of training data in reasonable time frames

Confirm hypothesis against TBs of training data with confidence We confirmed that fog does impact safety and wage plans do not,

whereas BI tools indicated otherwise

Easily integrate predictive models in data driven apps Run predictive models in Storm or any other app in your enterprise

Run all of the above in a multi-tenant YARN cluster Large scale ML on YARN respects other tenants in an HDP cluster

Page 28: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Backup.

Page 29: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Calling Spark from a Storm Bolt The outputs of a logistic regression model are weights and an intercept value:

val algorithm = new LogisticRegressionWithSGD() val model = algorithm.run(training).clearThreshold() println(model.weights) println(model.intercept) Weights[-0.40819922025591465,0.06392530395655666,-0.1346227352186122,-0.07188217286407801,0.7277326276521062,0.508779221680863,-0.024689093098281954]Intercept 0.0

The model can then be reconstructed in a Storm bolt with the above weights to make predictions

import org.apache.spark.mllib.classification.LogisticRegressionModel;import org.apache.spark.mllib.linalg.Vectors;………..Vector weights = (Vectors.dense(new double[] <array of weights like above>)LogisticRegressionModel model = new LogisticRegressionModel(weights, 0.0);double prediction = model.predict(<input features>)

Page 30: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Why Apache Kafka?Open source real-time event stream processing platform that provides fixed, continuous & low latency processing for very high frequency streaming data

• Horizontally scalable like Hadoop• Eg: 3 node cluster can store 5M messages per secondHighly scalable

• Automatically reassigns on failed nodesFault-tolerant

• Supports message acknowledgementsGuarantees delivery

• Producers and consumers exist for many programming languagesLanguage agnostic

• Brand, governance & a large active communityApache project

Page 31: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Key Capabilities of Storm

Page 31

• Extremely high ingest rates – millions of events/secondData Ingest

• Ability to easily plug different processing frameworks• Guaranteed processing – at-least once processing semanticsProcessing

• Ability to persist data to multiple relational and non relational data storesPersistence

• HA, fault tolerance & management supportOperations

Page 32: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Our preferred solution architecture

HDP 2.x Data Lake

Online Data ProcessingHBase

Real Time Stream Processing

Storm

YARN

HDFS

APACHE KAFKAReal-time data feeds

SearchSolr

Page 33: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

What is Real Time Event Processing

Real Time Event Processing System

A system that processes the events as they happen and generates real-time information/actions

Requirements• Ingest data at high rate• Process the data while its being collected• Continuously running• Low latency

Kafka

Storm

33

Components • Collection – Process to collect raw data• Data Flow - Process to Move data • Processing – Process to Analyze data • Delivery – Process to deliver the extracted information

Page 34: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Kafka.

Page 35: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

What is Kafka? APACHE KAFKA

High throughput distributed messaging system

Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes

Kafka Cluster

producer

producer

producer

consumer

consumer

consumer

Page 36: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Kafka: Anatomy of a TopicPartition 0 Partition 1 Partition 2

0 0 0

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

9 9 9

10 10

11 11

12

Writes

Old

New

APACHE KAFKA

Partitioning allows topics to scale beyond a single machine/node

Page 37: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Storm.

Page 38: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Key Constructs in Apache Storm• Tuples• Streams• Spouts• Bolts• Topology

Page 38

Page 39: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Tuples and Streams• What is a Tuple?

– Fundamental data structure in Storm. Is a named list of values that can be of any data type.

Page 39

• What is a Stream?– An unbounded sequences of tuples.– Core abstraction in Storm and are what you “process” in Storm

Page 40: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Spouts• What is a Spout?

– Generates or a source of Streams– E.g.: JMS, Twitter, Log, Kafka Spout– Can spin up multiple instances of a Spout and dynamically adjust as needed

Page 40

Page 41: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Bolts• What is a Bolt?

– Processes any number of input streams and produces output streams– Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting

logic– Can spin up multiple instances of a Bolt and dynamically adjust as needed

• Bolts used in the Use Case:1. HBaseBolt: persisting and counting in Hbase2. HDFSBolt: persisting into HFDS as Avro Files using Flume3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the

number of illegal driver incidents exceed a given threshhold.

Page 41

Page 42: Trucking demo w Spark ML - Paul Hargis - Hortonworks

Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Topology• What is a Topology?

– A network of spouts and bolts wired together into a workflow

Page 42