Upload
couchbase
View
177
Download
1
Tags:
Embed Size (px)
Citation preview
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks – The Hadoop Ecosystem
Fall 2014
Powering the Modern Data ArchitectureShivaji Dutta – Sr. Partner Solutions Engineer
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
AgendaApache Hadoop and Hortonworks Data Platform (HDP)HDP and CouchbaseWhat’s new in HDP?
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Hadoop
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Projects in Hadoop
Hadoop Core– Hadoop Common
– Hadoop Distributed File System
– Hadoop YARN
– Hadoop Mapreduce
Other Hadoop Key Projects
• Hive
• Hbase
• Spark
• Pig
• Tez
• Zookeper
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
JavaScala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBaseAccumulo
Slider Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
AmbariZookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
FalconSqoopFlumeKafkaNFS
WebHDFS
AuthenticationAuthorizationAccounting
Data Protection
Storage: HDFSResources: YARNAccess: Hive, … Pipeline: Falcon
Cluster: KnoxCluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN is the architectural center of HDP
Enables batch, interactive and real-time workloads
Provides comprehensive enterprise capabilities
The widest range of deployment options
Delivered Completely in the OPEN
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS and Yarn – The Core of Hadoop
The core components of HDP are YARN and Hadoop Distributed Filesystem (HDFS).
YARN is the architectural center of Hadoop that enables you to process data simultaneously in multiple ways. YARN provides the resource management and pluggable architecture for enabling a wide variety of data access methods.
HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data.
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN extends Hadoop into data center leaders
YARNThe Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
JavaScala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBaseAccumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Access
YARN provides the foundation for a versatile range of processing engines that empower you to interact with the same data in multiple ways, at the same time.
This means applications can interact with the data in the best way: from batch to interactive SQL or low latency access with NoSQL.
Emerging use cases for data science, search and streaming are also supported with Apache Spark, Solr and Storm.
Additionally, ecosystem partners provide even more specialized data access engines for YARN.
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
JavaScala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBaseAccumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance and Integration
• HDP extends data access and management with powerful tools for data governance and integration.
• They provide a reliable, repeatable, and simple framework for managing the flow of data in and out of Hadoop. This control structure, along with a set of tooling to ease and automate the application of schema or metadata on sources is critical for successful integration of Hadoop into your modern data architecture.• Apache SQOOP
• Apache OOZIE• Apache FALCON• Apache FLUME
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security
• Authentication/ Authorization and Encryption
• Kerberos
• SSL & SASL
• Apache Knox
• Apache Ranger
• HDFS File/Directory Encryption
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Operations – Apache Ambari• Provisioning, manage and monitor
Hadoop Clusters
• A complete set of operational capabilities that provide both visibilities into the health of your cluster as well as tooling to manage configuration and optimize performance across all data access methods.
• Apache Ambari provides APIs to integrate with existing management systems: for instance Microsoft System Center and Teradata ViewPoint
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Hadoop: Central Set of Services
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into Hadoop inherits these services
Provision, Manage & Monitor
AmbariZookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITYGOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
JavaScala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBaseAccumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System(Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider SliderTez Tez
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
H
ad
oo
p
&Y
AR
N
P
ig
H
ive
& H
Cat
alog
H
Ba
se
S
qo
op
O
ozi
e
Z
oo
ke
ep
er
A
mb
ari
S
torm
F
lum
e
K
no
x
P
ho
en
ix
A
cc
um
ulo
2.2.00.12.0
0.12.0
2.4.00.12.1
Data Management
0.13.0
0.96.1
0.98.0
0.9.11.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.50.4.0
4.0.0
1.5.1
F
alc
on
0.5.0
R
an
ge
r
S
pa
rk
K
afk
a
0.14.00.14.0
0.98.4
1.6.1
4.2 0.9.3
1.2.00.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.00.5.0
0.4.02.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Te
z
0.4.0
S
lid
er
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
S
olr
4.7.2
4.10.0
0.5.1
Data AccessGovernance & Integration
SecurityOperations
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONAL TOOLS
DEV & DATA TOOLS
INFRASTRUCTURE
The Partner EcoSystemS
OU
RC
ES
EXISTING Systems
Clickstream Web &Social Geolocation Sensor & Machine
Server Logs Unstructured
DA
TA S
YS
TE
M
RDBMS EDW MPP
HANA
APPL
ICAT
ION
S
BusinessObjects BI
Deep PartnershipsHortonworks engages in deep engineered relationships with the leaders in the data center, such as Microsoft, Teradata, Redhat, HP, SAS & SAP
Broad PartnershipsOver 900 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users
HDP 2.1
Go
ve
rna
nc
e
& In
teg
rati
on
Sec
uri
ty
Op
era
tio
nsData Access
Data Management
YARN
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• Couchbase is primarily online operational NoSQL datastore, low latency, scalable
• Source of data and also a sink
• Example source: Pulling user profiles into Hadoop for deep analytics
• Example sink: training machine learning models that are then cached / served from Couchbase
Couchbase and HDP
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• HDP Certified Sqoop connector for batch mode export / import
• Couchbase Kafka connector enables both Producer and Consumer scenarios
• Community supported Storm spout to persist data by writing to Couchbase Server
• Developer Preview Spark Connector
Couchbase and HDP
New!
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What’s New in HDP 2.2
New and Improved YARN Ready Engines
• Enterprise SQL at Hadoop Scale with Stinger.next
• Enterprise Ready Spark on YARN
• Deep YARN integration for real-time engines: HBase, Accumulo, Storm
• Enabling ISVs with a general SDK and API for direct YARN integration
• Only solution to provide real-time to micro batch for analyzing the internet of things
• Other engines/tools: Solr, Cascading
Continued Innovation of Central Enterprise Services
• Centralized security administration and policy enforcement
• Ease of use and operations agility features to speed cluster deployment
• 100% uptime target with cluster rolling upgrades
Expanded Deployment Options
• Enhanced business continuity with replication/archival across on-premises and cloud storage tiers (Azure Blob, S3)
• Simultaneous ship of Windows and Linux installs
• Expand Azure support beyond HDInsight Azure to include HDP for Windows or Linux in Azure VMs
HDP 2.2Delivering Apache Hadoop for the Enterprise
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.next: Enterprise SQL at Hadoop Scale
A continuation of momentum built in Apache Hive Community to deliver Enterprise SQL at Hadoop scale
HDP Stinger/Hive Goals:
• SpeedDeliver sub-second query response times
• Scale The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes
• SQLEnable transactions and SQL:2011 Analytics
Familiar three phase deliveryStinger delivered 390,000 lines of code to Apache Hive in 13 months from 44 companies, 145 developers
HDP 2.2 – Beyond Read Only• Transactions with ACID, allowing insert, update & delete
• Temporary tables
• Cost Based Optimizer for star & bushy join queries
Phase 2 – Sub Second• Sub-second queries with LLAP
• Hive-Spark Machine Learning integration
• Operational reporting w/ Hive streaming ingest & transactions
Phase 3 – Rich Analytics• SQL:2011 Analytics
• Materialized views
• Cross-geo queries
• Workload management via YARN and LLAP integration
HD
P 2
.2
Sec
urity
Ope
ratio
ns
Gov
erna
nce
Access
Management
YARN
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
• Apache Spark is an open source project for fast and large scale data processing. – Simple and expressive programming model
– Machine learning, graph computation and Streaming
– in-memory compute for iterative workloads
• It does most of the processing in memory
• It support programming languages– Java, Scala and Python
• It provides a high level modules for – Mlib
– GraphX
– Sprak Streaming
– Sprark SQL
• Cluster Manager– Yarn (recommended)
– Mesos
– Sparks Own
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Ready Spark for HDP 2.2 & beyond
HDP 2.2 – Spark on YARN• Integrated: Hive 0.13 support
• Integrated: Basic ORCfile support
Phase 2 – Spark for HDP 2.2• Managed: Deployment best practices with YARN Node Labels
• Managed: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI
• Security: Spark certification on Kerberized Cluster
• Security: Authentication in Spark UI against LDAP
Phase 3 - Beyond• Managed: Enhanced workload mgmnt & improved
debuggability
• Managed: Spark logs published to YARN Application Timeline
• Security: Wire Encryption and Authorization with XA/Argus
• Enhanced ORC support
Deliver a reliable and managed, enterprise grade Apache Spark that will run alongside other workloads in Hadoop via YARN
HDP Spark Goals:• Integrated
Enterprise-grade Workload Management & Optimized multi-tenancy on YARN
• SecureExtend comprehensive Hadoop security policy to Spark
• ManagedProvision, manage and monitor Spark along with other engines in hadoop
HD
P 2
.2
Sec
urity
Ope
ratio
ns
Gov
erna
nce
Access
Management
YARN
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Bringing more applications and services to YARN and making ISV adoption easier
• Complete work for Pig with Tez
• Cascading with Tez for Java and Scala apps
• Integration of Spark on YARN
• Kafka for inbound messaging to Storm & Spark – widest range from real-time to micro batch for internet of things
HDP 2.2 Delivers more YARN Ready Engines
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Others
Engines
Tez
JavaScala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
°
°
°
°
°
°
Others
ISV Engines
°
°
Storm
Stream
Slider introduces native YARN integration for applications with long running services
• HBase, Accumulo, Storm
• SDK for 3rd-party ISVs
Indicates “new to HDP” in 2.2. All engines have been updated
HD
P 2
.2
Sec
urity
Ope
ratio
ns
Gov
erna
nce
Access
Management
YARN
Others
Engines
Slider
Solr
Search
HBase
NoSQL
Slider
Accumulo
NoSQL
Slider
Spark
In-Memory
Kaf
ka
Slider
°
°
°
°
HDFS (Hadoop Distributed File System)
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security in HDP 2.2
HDP 2.2 New Features
• Extend Authorization with Apache Ranger
• Breadth: Knox and Storm integrations
• Policy enforcement at depth: Hive, HDFS and HBase integrations
• Documentation to support community development and partner ecosystem
• Apache Hadoop Advances• TP: HDFS Transparent Encryption in HDFS – HDFS-6134
• Key Management Server - HADOOP-10433
• Key Provider API - HADOOP-10141
Continue investments across for central security policy for authentication, authorization, audit, and data protection
HDP Security Goals:• Comprehensive Security
Meet all security requirements across authentication, authorization, audit & data protection for all HDP components.
• Central AdministrationProvide central administration ofg security policy and for viewing and managing audit across the platform.
• Consistent IntegrationIntegrate with other security and identity management systems, for compliance with IT policies.
HD
P 2
.2
Sec
urity
Ope
ratio
ns
Gov
erna
nce
Access
Management
YARN
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streamlining Operations in HDP 2.2
Apache Ambari 1.7.0 Delivers
• ViewsA common, secure, and extensible approach for the user interface for Operators, System Administrators, Application Developers, Data Workers and ISVs
• BlueprintsCreate and manage cluster templates for easy deployment
Apache Ambari is advancing at light speed to enable the IT operator to more easily manage clusters
HDP Operations Goals:• Open
Deliver a complete set of features for Hadoop operations, in public and with the community.
• IntegratedEnsure Hadoop operations integrate with existing IT tools, behind a single pane of glass.
• IntuitiveMake Hadoop’s most complex operational challenges easy to manage.
HD
P 2
.2
Sec
urity
Ope
ratio
ns
Gov
erna
nce
Access
Management
YARN
Ambari 2.0.0 delivers • Ambari on Windows• native metrics and alerts• rolling upgrade automation
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Rolling Upgrades
Allow continuous operation and up-time for applications and services on the cluster while upgrading• Single most critical feature for streamlining operations
• HDFS provides the ability to do this today… remaining components need to follow
• Leverages native operating system tools and scripting
• Allow jobs in-flight to complete
• Provides support for rapid rollback
HD
P 2
.2
Sec
urity
Ope
ratio
ns
Gov
erna
nce
Access
Management
YARN
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Development& POC Cluster
ProductionCluster
Vision: Maximize Hadoop Deployment Choice
Deployment Choice• Linux, Windows• On-Premises, Cloud, Hybrid
“Tethered” Clusters• Compatible services• An explicit “connection”
Synchronized Datasets• Efficient sharing & access• Governance & lineage
BI or MLCluster
Backup& Archive Cluster
Learn
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics(Hive)
IoT Apps(Storm, HBase, Hive)
Cloudbreak with HDP
Dev / Test(all HDP services)
Data Science(Spark)
Cloudbreak
1. Pick a Blueprint2. Choose a Cloud3. Launch HDP!
Example Ambari Blueprints: IoT Apps, BI / Analytics, Data Science, Dev / Test
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics(Hive)
IoT Apps(Storm, HBase, Hive)
Periscope with HDP
Dev / Test(all HDP services)
Data Science(Spark)
Autoscaling Policy
Periscope• Policies based on any Ambari metrics• Coordinates with YARN to achieve
elasticity based on the policies.