Upload
posscon
View
75
Download
4
Embed Size (px)
Citation preview
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How YARN Enables Multiple Data Processing Engines in Hadoop
We Do Hadoop
Eric Mizell - Director, Solution Engineering
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• HDFS Overview - Storage
• YARN 101 - Compute – Yet Another Resource Negotiator
• Enabling a Modern Data Architecture
• YARN in action – Demo of streaming application
• Hadoop Tools – Demos
• Sample Code - https://github.com/emizell/HBase-Code-Samples
2
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS Overview
3
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS Overview
4
• Typical Hardware for DataNodes – 2@8 Core – 256GB RAM – 2@24TB Disk – 10 GbE
• Hadoop is rack aware – Data is replicated across racks to ensure no data loss
• Scale up or down – Add or remove DataNodes and HDFS auto rebalances
• HDFS is a file system – Store any kind of data – Inexpensive storage – Replica of 3 by default (can be changed)
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN Concepts
• Application – Application is a job submitted to the framework – Example – MapReduce Job
• Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (memory, cpu,
disk, network, gpu etc.) – container_0 = 2GB, 1CPU – container_1 = 1GB, 6 CPU
– Replaces the fixed map/reduce slots
5
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN Architecture
• Resource Manager – Global resource scheduler – Hierarchical queues – Application management
• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring
• Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master
6
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RackN
NodeManager
NodeManager
NodeManager
Rack2
NodeManager
NodeManager
NodeManager
Rack1
NodeManager
NodeManager
NodeManager
C2.1
C1.4
AM2
C2.2 C2.3
AM1
C1.3
C1.2
C1.1
Hadoop Client 1
Hadoop Client 2
create app2
submit app1
submit app2
create app1
ASM Scheduler queues
ASM Containers
NM ASM
Scheduler Resources
.......negotiates.......
.......reports to.......
.......partitions.......
ResourceManager
status report
YARN – Running Apps
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2.x Stack – Enabled by YARN
Hadoop
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization Accounting
Data Protection
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon
Cluster: Knox Cluster: Ranger
Deployment Choice Linux Windows On-Premises Cloud
YARN is the architectural center of HDP
Enables batch, interactive and real-time workloads
Provides comprehensive enterprise capabilities
The widest range of deployment options
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2.2.x Stack – Versions
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enabling a Modern Data Architecture with Apache Hadoop
Hortonworks. We do Hadoop.
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Existing Siloed Data Architectures Under Pressure AP
PLICAT
IONS
DATA
SYSTEM
SOURC
ES
Business Analy:cs
Custom Applica:ons
Packaged Applica:ons
Exis:ng Sources (CRM, ERP, Clickstream, Logs)
SILO SILO
RDBMS
SILO SILO SILO SILO
EDW MPP
Data growth: New Data Types
OLTP, ERP, CRM Systems
Unstructured docs, emails
Clickstream
Server logs
Social/Web Data
Sensor. Machine Data
Geoloca:on
85% Source: IDC
??
" Can’t manage new data paradigm
" Constrains data to specific schema
" Siloed data
" Limited scalability
" Economically unfeasible
" Limited analytics
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP2 and YARN enable the Modern Data Architecture
Hortonworks architected and led development of YARN
Common data set, multiple applications • Optionally land all data in a single cluster
• Batch, interactive & real-time use cases
• Support multi-tenant access, processing & segmentation of data
YARN: Architectural center of Hadoop • Consistent security, governance & operations • Ecosystem applications certified
by Hortonworks to run natively in Hadoop
SOU
RC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca:on Sensor & Machine
Server Logs
Unstructured
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-Time Batch
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN in Action Hortonworks. We do Hadoop.
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Sensors
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging (Kafka)
Microsoft Excel
Interactive Query (Hive on Tez)
Alerts & Events (ActiveMQ)
Real-Time User Interface
Real-time Serving (HBase)
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Components of the Topology
• 9 Node HDP 2.2 Cluster with Storm and HBase on YARN
• 4 Node 0.8 Kafka Cluster
• 1 Node ActiveMQ with Stomp Protocol Enabled • Spring 4.0 WebMVC Web Using SocketJS & ActiveMQ over STOMP
Page 15
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Topology Architecture
Page 16
Truck Simulator
T(1)
T(2)
T(N)
Truck Stream Generator via AKKA
KafkaCollector
Kafka Grid - Captures all Driving Events
BR(1) BR(2) BR(3)
BR(4) BR(5)
ZK
truck_eventsTOPIC
Storm on YARN on HDP
Kafka Spout
HBase Bolt
Monitoring Bolt
WebSocket Bolt
HBase on HDP
HBase
driver dangerous
events
driver dangerous
eventscount
Alerts
ActiveMQ
Alert Topic
Spring WebApp with SockJS WebSockets
Real-Time Streaming Driver Monitoring App
ActiveMQ
Violation Events Topic
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Tools Hortonworks. We do Hadoop.
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• The Basics • MapReduce & Java • Pig • Hive • HBase, Solr & Spark • Abstractions: .net, cascading and Spring XD
• Intro to the Sandbox • Basic Hello World Using Hive and Pig • HBase and Phoenix demo and code discussion
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows Cloud
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
On-Premises
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
SECURITY OPERATIONS
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows Cloud On-Premises
YARN: Data Operating System (Cluster Resource Management)
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Others
ISV Engines
We will cover: • What it is & where it is used • Basic elements
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
MapReduce
MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner
Developers use it to… • They don’t have to anymore
• Many tools have been created to abstract this complexity
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pig • Apache™ Pig allows you to write complex
MapReduce transformations using a simple scripting language.
• Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort.
• Pig Latin is sometimes extended using UDFs (User Defined Functions), in Java or a scripting language and then call directly from the Pig Latin.
Developers use Pig for • ETL
• Basic “spreadsheet” functions
• Prepare data for data science
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example
RAW_LOGS = LOAD '/user/paul/data/apache/access' USING TextLoader as (line:chararray);
CLICKS_RAW = LOAD '$input' USING PigStorage('|') as (sls_key:chararray, sls_item_ln_id:int, chn_id:int, loc_id:int, all_chnl_rpt_chn_id:int, all_chnl_rpt_loc_id:int, sls_bsns_dt:chararray, sku_id:int);
RECORDS = load 'config' using org.apache.hcatalog.pig.HCatLoader();
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pig Operators
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive • Apache Hive is the defacto standard for SQL
queries over petabytes of data in Hadoop
• Created by a team at Facebook.
• Provides a standard SQL interface to data stored in Hadoop.
• Quickly find value in raw data files.
• Proven at petabyte scale.
• Compatible with every popular BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc.
Developers use it to: • Perform SQL queries
• Interface with existing tools via JDBC/ODBC
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample SQL with Hive
SELECT [ALL | DISTINCT] select_expr, select_expr, ...!FROM table_reference![WHERE where_condition]![GROUP BY col_list]![HAVING having_condition]![CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY !col_list]]!
[LIMIT number] ; !
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive - Select Syntax
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Demonstration
HDP Sandbox • Up and running with a Hadoop
environment in minutes • Basic and advanced tutorials with
discreet learning paths • Ecosystem partner tutorials
hortonworks.com/sandbox
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HBase • Apache™ HBase is a non-relational (NoSQL)
database that runs on top of the Hadoop® Distributed File System (HDFS).
• It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data.
• It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.
• HBase was created for hosting very large tables with billions of rows and millions of columns.
Developers use it to: • Provide low latency access to
massive amounts of data (eg. Recommendation engine results)
• Document store
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Phoenix • Apache™ Phoenix is a high performance
relational database layer over HBase for low latency applications.
• SQL queries are compiled into a series of HBase scans producing regular JDBC result sets.
• Table metadata is stored in an HBase table and versioned and can be queried by version.
• Query performance in the millisecond to low seconds range.
• Largest know table size is a Trillion+ rows with query response times in the 30 second range.
Developers use it for: • Low latency queries
• SQL skin on HBase
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Phoenix Functions
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HBase/Phoenix Demonstration
HDP Sandbox • Up and running with a Hadoop
environment in minutes • Basic and advanced tutorials with
discreet learning paths • Ecosystem partner tutorials
hortonworks.com/sandbox
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Storm • Apache™ Storm is a distributed real-time
computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Hadoop.
• Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size.
• Apache Kafka is a publish-subscribe messaging system that works well with Storm.
Developers use it to: • Analyze stream data in real-
time
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Solr • Apache Solr provides full-text search and
near real-time indexing for data stored in Hadoop.
• Whether users search for tabular, text, geo-location or sensor data in Hadoop, they find it quickly with Apache Solr.
• Apache Solr indexes via XML, JSON, CSV or binary over HTTP. Users can query petabytes of data via HTTP GET and receive XML, JSON, CSV or binary results.
Developers use it to: • Provide search capability for a
cluster
• Data Scientist often use to explore data found in HDFS
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark • Spark is a general-purpose engine for ad-hoc
interactive analytics, iterative machine-learning, and other use cases well-suited to interactive, in-memory data processing of GB to TB sized datasets.
• Spark loads data into memory so it can be queried repeatedly. It can create a “shadow” of data that can be used in the next iteration of a query
• Spark provides simple APIs for data scientists and engineers familiar with Scala (programming language) to build applications
• Spark is YARN-ready – another engine on YARN!
Developers use it to: • Data Science: machine learning
and iterative analytics
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Cascading • Cascading is an application development
framework for building data applications. Converts applications into MapReduce jobs.
• The Cascading SDK provides a collection of tools, documentation, libraries, tutorials and example projects.
• Lingual. Simplifies systems integration through ANSI SQL compatibility and a JDBC driver
• Pattern. Enables various machine learning scoring algorithms through PMML compatibility
• Scalding. Enables development with Scala, a powerful language for solving functional problems
• Cascalog. Enables development with Clojure, a Lisp dialect
Developers use it to: • Build complex native Hadoop
applications without getting into MapReduce.
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
.net • The Microsoft .NET SDK for Hadoop provides
API access to HDP and Microsoft HDInsight including HDFS, HCatalog, Oozie and Ambari, and also some Powershell scripts for cluster management.
• There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
Developers use it to: • Build complex MSFT .net
Hadoop applications.
Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Java & Spring XD • Spring for Apache Hadoop (SHDP) provides a
developer API for Pig, Hive, Cascading and provides extensions to Spring Batch for orchestrating Hadoop based workflows.
• It integrates with other Spring ecosystem project such as Spring Integration and Spring Batch
• These foundational parts of Spring IO platform make Hadoop development more accessible to a wider range of Java developers.
Developers use it to: • Build complex Hadoop
applications using Java and the Spring framework
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Summit 2015
Page 40
Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You! Eric Mizell – Director, Solutions Engineering [email protected] @ericmizell