36
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real-Time Processing in Hadoop Hadoop Summit 2015 Ali Bajwa Partner Solutions Engineer June 2015

Internet of things Crash Course Workshop

Embed Size (px)

Citation preview

Page 1: Internet of things Crash Course Workshop

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Real-Time Processing in HadoopHadoop Summit 2015

Ali BajwaPartner Solutions EngineerJune 2015

Page 2: Internet of things Crash Course Workshop

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Agenda

Introduction & about Hortonworks HDP Overview of logistics industry scenario Overview of streaming architecture on HDP Streaming Demo #1 Integrating Predictive Analytics in streaming scenarios Streaming Demo with Predictive additions Q & A

Page 2

Page 3: Internet of things Crash Course Workshop

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Preface: Enabling Technologies

Page 5

• Problems solved at scale, via fundamentally new approaches…• Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand.

• Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras (from smartphones) have enabled Electric cars, quad-copters, VR displays, & more…

• Hadoop has similarly led to breakthroughs in big data scale & capability, and enables new real-time advanced analytic applications.

Page 4: Internet of things Crash Course Workshop

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why did Hadoop emerge?

April 2015

Page 5: Internet of things Crash Course Workshop

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Traditional systems under pressure

Challenges• Constrains data to app

• Can’t manage new data

• Costly to Scale

Business Value

Clickstream

Geolocation

Web Data

Internet of Things

Docs, emails

Server logs

20122.8 Zettabytes

202040 Zettabytes

LAGGARDS

INDUSTRY LEADERS

1

2 New Data

ERP CRM SCM

New

Traditional

Page 6: Internet of things Crash Course Workshop

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Spring 2015

Hortonworks. We do Hadoop.

Page 7: Internet of things Crash Course Workshop

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Customer Momentum• 330+ customers (as of year-end 2014)

Hortonworks Data Platform• Completely open multi-tenant platform for any app & any data.

• A centralized architecture of consistent enterprise services for resource management, security, operations, and governance.

Partner for Customer Success• Open source community leadership focus on enterprise needs

• Unrivaled world class support

• Founded in 2011

• Original 24 architects, developers, operators of Hadoop from Yahoo!

• 600+ Employees

• 1000+ Ecosystem Partners

Page 8: Internet of things Crash Course Workshop

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Customer Partnerships matterDriving our innovation through

Apache Software Foundation Projects

Apache Project Committers PMC Members

Hadoop 27 21

Pig 5 5

Hive 18 6

Tez 16 15

HBase 6 4

Phoenix 4 4

Accumulo 2 2

Storm 3 2

Slider 11 11

Falcon 5 3

Flume 1 1

Sqoop 1 1

Ambari 34 27

Oozie 3 2

Zookeeper 2 1

Knox 13 3

Ranger 10 n/a

TOTAL 161 108Source: Apache Software Foundation. As of 11/7/2014.

Hortonworkers are the architects and engineers that lead development of open source Apache Hadoop at the ASF

• ExpertiseUniquely capable to solve the most complex issues & ensure success with latest features

• ConnectionProvide customers & partners direct input into the community roadmap

• PartnershipWe partner with customers with subscription offering. Our success is predicated on yours.

27

Cloudera: 11

Facebook: 5

LinkedIn: 2

IBM: 2

Others: 23

Yahoo10

Page 9: Internet of things Crash Course Workshop

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Technology Partnerships matter

Apache Project Hortonworks

Relationship Named Partner

CertifiedSolution Resells Joint

Engr

Microsoft

HP

SAS

SAP

IBM

Pivotal

Redhat

Teradata

Informatica

Oracle

It is not just about packaging and certifying software…

Our joint engineering with our partners drives open source standards for Apache Hadoop

HDP is Apache Hadoop

Page 10: Internet of things Crash Course Workshop

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP delivers a Centralized Architecture

Modern Data Architecture

• Unifies data and processing.

• Enables applications to have access to all your enterprise data through an efficient centralized platform

• Supported with a centralized approach governance, security and operations

• Versatile to handle any applications and datasets no matter the size or type

Clickstream Web & Social

Geolocation Sensor & Machine

Server Logs

Unstructured

SOU

RCES

Existing Systems

ERP CRM SCM

ANAL

YTIC

S

Data Marts

Business Analytics

Visualization& Dashboards

ANAL

YTIC

S

Applications Business Analytics

Visualization& Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

Interactive Real-TimeBatch Partner ISVBatch BatchMPP

EDW

Page 11: Internet of things Crash Course Workshop

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP delivers a completely open data platform

Hortonworks Data Platform 2.2

Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data.

Completely Open

• HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations

• All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem.

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Apa

che

Pig

° °

° °

° ° °

° ° °

HDFS (Hadoop Distributed File System)

GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

Apache Falcon

Apa

che

Hiv

e

Cas

cad

ing

Apa

che

HB

ase

Apa

che

Acc

umul

o

Apa

che

So

lr

Apa

che

Sp

ark

Apa

che

Sto

rm

Apache Sqoop

Apache Flume

Apache Kafka

SECURITY

Apache Ranger

Apache Knox

Apache Falcon

OPERATIONS

Apache Ambari

Apache Zookeeper

Apache Oozie

Page 12: Internet of things Crash Course Workshop

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Real World Use Case:Trucking Company

Spring 2015

Hortonworks. We do Hadoop.

Page 13: Internet of things Crash Course Workshop

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Scenario Overview.

Page 14: Internet of things Crash Course Workshop

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Trucking company w/ large fleet of trucks in Midwest

A truck generates millions of events for a given route; an event could be:

'Normal' events: starting / stopping of the vehicle

‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance

Company uses an application that monitors truck locations and violations from the truck/driver in real-time

Route?Truck?Driver?

Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers

Page 15: Internet of things Crash Course Workshop

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 16: Internet of things Crash Course Workshop

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 17: Internet of things Crash Course Workshop

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

What is Kafka? APACHE KAFKA

High throughput distributed messaging system

Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes

Kafka @LinkedIn: 800 billion messages per day 175 terabytes of data written per day 650 terabytes of data read per day Over 13 million messages/2.75GB of data

per second

Kafka Cluster

producer

producer

producer

consumer

consumer

consumer

Page 18: Internet of things Crash Course Workshop

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Kafka: Anatomy of a TopicPartition 0 Partition 1 Partition 2

0 0 0

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

9 9 9

10 10

11 11

12

Writes

Old

New

APACHE KAFKA

Partitioning allows topics to scale beyond a single machine/node

Topics can also be replicated, for high availability.

Page 19: Internet of things Crash Course Workshop

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 20: Internet of things Crash Course Workshop

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Apache Storm

• Distributed, real time, fault tolerant Stream Processing platform.• Provides processing guarantees.• Key concepts include:

•Tuples•Streams•Spouts•Bolts•Topology

Page 22

Page 21: Internet of things Crash Course Workshop

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Tuples and Streams

• What is a Tuple?– Fundamental data structure in Storm. Is a named list of values that can be of any data type.

Page 23

• What is a Stream?– An unbounded sequences of tuples.– Core abstraction in Storm and are what you “process” in Storm

Page 22: Internet of things Crash Course Workshop

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Spouts

• What is a Spout?– Generates or a source of Streams– E.g.: JMS, Twitter, Log, Kafka Spout– Can spin up multiple instances of a Spout and dynamically adjust as needed

Page 24

Page 23: Internet of things Crash Course Workshop

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Bolts

• What is a Bolt?– Processes any number of input streams and produces output streams– Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting

logic– Can spin up multiple instances of a Bolt and dynamically adjust as needed

• Bolts used in the Use Case:1. HBaseBolt: persisting and counting in Hbase2. HDFSBolt: persisting into HFDS as Avro Files using Flume3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the

number of illegal driver incidents exceed a given threshhold.

Page 25

Page 24: Internet of things Crash Course Workshop

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Topology

• What is a Topology?– A network of spouts and bolts wired together into a workflow

Page 26

Page 25: Internet of things Crash Course Workshop

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 26: Internet of things Crash Course Workshop

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Key Constructs in Apache HBase• HBase = Key / Value store• Designed for petabyte scale• Supports low latency reads, writes and updates• Key features

– Updateable records– Versioned Records– Distributed across a cluster of machines– Low Latency– Caching

• Popular use cases:– User profiles and session state– Object store– Sensor apps

Page 28

Page 27: Internet of things Crash Course Workshop

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Data Assignment

Page 29

HBase Table

Keys within HBaseDivided among

different RegionServers

Page 28: Internet of things Crash Course Workshop

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Data Access

• Get– Retrieves a single cell, all cells with a matching rowkey, or all cells in a column

family with a matching rowkey

• Put– Inserts a new version of a cell.  

• Scan– The whole table, row by row, or a section of that table starting at a particular

start key and ending at a particular end key

• Delete– It is actually a version of put(Add a new version with put with a deletion

marker)

• SQL via Apache Phoenix– Unique capability in the NoSQL market

Page 30

Page 29: Internet of things Crash Course Workshop

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 30: Internet of things Crash Course Workshop

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

20092006

1 ° ° ° ° °

° ° ° ° ° N

HDFS (Hadoop Distributed File

System)

MapReduceLargely Batch Processing

Hadoop w/ MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clustersLargely batch systemDifficult to integrate

MR-279: YARN

Hadoop 2 & YARN

Interactive Real-TimeBatch

Architected & led development of YARN to enable the Modern Data Architecture

October 23, 2013

Page 31: Internet of things Crash Course Workshop

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Benefits of YARN as the Data Operating System

• The container based model allows for running nearly any workload.– Enables the centralized architecture.– No longer is MapReduce the only data processing engine.– Docker containers managed by YARN. Yes Please!

• Decouples resource scheduling from application lifecycle.– Improved scalability and fault tolerence

• Dynamically allocated resources, resulting in HUGE utilization gains– Versus static allocation of “slots” in Hadoop 1.0

Page 33

Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time.

They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.

Page 32: Internet of things Crash Course Workshop

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Distributed Storage: HDFS

Many Workloads: YARN

Trucking Company’s YARN-enabled Architecture

Stream Processing (Storm)

Inbound Messaging(Kafka)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time User Interface

One cluster with consistent security, governance & operations

SQL

Interactive Query(Hive on Tez)

Truck Sensors

Page 33: Internet of things Crash Course Workshop

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Apache HDFS – Hadoop Distributed File System

• Very large scale distributed file system• 10K nodes, tens of millions files and PBs of data• Supports large files

• Designed to run on commodity hardware, assumes hardware failures

• Files are replicated to handle hardware failure• Detect failures and recovers from them automatically

• Optimized for Large Scale Processing• Data locations are exposed so that the computations can move to where data

resides• Data Coherency

• Write once and read many times access pattern• Files are broken up in chunks called ‘blocks’

• Blocks are distributed over nodes

Page 35

Page 34: Internet of things Crash Course Workshop

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services

Streaming Demo - High Level Architecture

Distributed Storage: HDFS

YARN

Storm Stream Processing

Kakfa Spout

HBase

Dangerous Events TableHbase

BoltHDFSBolt

Truck Events

Active MQ

Monitoring Bolt

Web App

Truck Streaming Data

T(1) T(2) T(N)

Inbound Messaging(Kafka)

Truck Events Topic

Page 35: Internet of things Crash Course Workshop

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo – Streaming Dashboard.

Page 36: Internet of things Crash Course Workshop

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Lab #1: bit.ly/1L3RLMoLab #2: bit.ly/1FW7ENl (<-lower case L)Lab #3: bit.ly/1L3S0ahShell cheatsheet: bit.ly/1JN8EsOSlides: bit.ly/1MtVoIL (<-capital I)Twitter demo: github.com/abajwa-hw/hdp22-twitter-demoCustom services: github.com/hortonworks-gallerywebinars: hortonworks.com/partners/learn email: abajwa@IoT demo: youtube.com/watch?v=FHMMcMYhmNI