© Hortonworks Inc. 2014
Hortonworks: We Do Hadoop “State of the Union” Webinar
Shaun Connolly, VP Strategy @shaunconnolly, @hortonworks January 22, 2014
Page 1
© Hortonworks Inc. 2014
Today’s Webinar
Page 2
• Apache Hadoop & Hortonworks Overview • Hadoop’s Role • Hadoop Adoption: From Apps to Lake • Enterprise Hadoop Technology Directions
© Hortonworks Inc. 2014
Our Mission:
Our Commitment
Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process
Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind
Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills
Headquarters: Palo Alto, CA Employees: 300+ and growing
Reseller Partners
Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
Page 3
Our Vision: More than Half the World's Data Will Be Processed by Apache Hadoop
© Hortonworks Inc. 2014 Page 4
Apache Software Foundation Guiding Principles • Release early & often • Transparency, respect, meritocracy
Key Roles • PMC Members
– Managing community projects – Mentoring new incubator projects
• Committers – Authoring, reviewing & editing code
• Release Managers – Testing & releasing projects
Apache Community Process Apache Community Projects
Release Apache Hadoop
Test & Patch
Design & Develop
Apache HBase
Apache
Hive
Apache Falcon
Apache Pig
Apache Ambari
Apache Storm
© Hortonworks Inc. 2014
Hortonworks Process for Enterprise Hadoop
Page 5
Upstream Community Projects Downstream Enterprise Product
HDP 2.0
Distribute
Integrate & Test
Package & Certify
Release Apache Hadoop
Test & Patch
Design & Develop
Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream
Stable Project Releases
Fixed Issues
Apache HBase
Apache
Hive
Apache Falcon
Apache Pig
Apache Ambari
Apache Storm
• 1000’s of production nodes at Yahoo! • Over 1500 unit & system tests
Design & Develop
Certified at scale using the most advanced Hadoop test bed on the planet
© Hortonworks Inc. 2014
Hadoop’s Role…
“Hadoop is becoming a more ‘normal’ software market” and the “Hadoop vendor ecosystem [is] gaining critical mass”
Tony Baer, Ovum
Page 6
© Hortonworks Inc. 2014
A Traditional Approach Under Pressure
Page 7
APPLICAT
IONS
DATA
SYSTEM
REPOSITORIES
SOURC
ES
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources (Sensor, Sen4ment, Geo, Unstructured)
Business Analy4cs
Custom Applica4ons
Packaged Applica4ons
Source: IDC
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
© Hortonworks Inc. 2014
Unlock Value in New Types of Data 1. Social
Understand how people are feeling and interacting – right now
2. Clickstream Capture and analyze website visitors’ data trails and optimize your website
3. Sensor/Machine Discover patterns in data streaming from remote sensors and machines
4. Geographic Analyze location-based data to manage operations where they occur
5. Server Logs Diagnose process failures and prevent security breaches
6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents
Value
Page 8
+ Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value
© Hortonworks Inc. 2014
• Complement Data Systems • Right Workload Right Place
A Modern Data Architecture Enabled
Page 9
APPLICAT
IONS
DATA
SYSTEM
REPOSITORIES
SOURC
ES
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources (Sensor, Sen4ment, Geo, Unstructured)
Business Analy4cs
Custom Applica4ons
Packaged Applica4ons
© Hortonworks Inc. 2014
A Modern Data Architecture Applied
Page 10
APPLICAT
IONS
DATA
SYSTEM
SOURC
ES
RDBMS EDW MPP
Emerging Sources (Sensor, Sen4ment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
© Hortonworks Inc. 2014
UDA Diagram
Major Vendors Have Embraced Hadoop
Teradata Portfolio for Hadoop
• Seamless data access between Teradata and Hadoop (SQL-H)
• Simple management & monitoring with Viewpoint integration
• Flexible deployment options
Page 11
HDInsight & HDP for Windows
• Only Hadoop Distribution for Windows Azure & Windows Server
• Native integration with SQL Server, Excel, and System Center
• Extends Hadoop to .NET community
Complete Portfolio for Hadoop
Appliances
Instant Access + Infinite Scale
• SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP
• Enables analytics apps (BOBJ) to interact with Hadoop
© Hortonworks Inc. 2014
Hadoop Adoption
“Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data”
--Mike Gualtieri, Forrester
Page 12
© Hortonworks Inc. 2014
SC
ALE
SCOPE
New Analytic Apps New Types of Data LOB Driven
Drivers of Hadoop Adoption
Page 13
© Hortonworks Inc. 2014
20 Common Business Applications
Industry Use Case Type of Data
Financial Services New Account Risk Screens Text, Server Logs
Trading Risk Server Logs
Insurance Underwriting Geographic, Sensor, Text
Telecom Call Detail Records (CDRs) Machine, Geographic
Infrastructure Investment Machine, Server Logs
Real-time Bandwidth Allocation Server Logs, Text, Social
Retail 360° View of the Customer Clickstream, Text
Localized, Personalized Promotions Geographic
Website Optimization Clickstream
Manufacturing Supply Chain and Logistics Sensor
Assembly Line Quality Assurance Sensor
Crowdsourced Quality Assurance Social
Healthcare Use Genomic Data in Medical Trials Structured
Monitor Patient Vitals in Real-Time Sensor
Pharmaceuticals Recruit and Retain Patients for Drug Trials Social, Clickstream
Improve Prescription Adherence Social, Unstructured, Geographic
Oil & Gas Unify Exploration & Production Data Sensor, Geographic & Unstructured
Monitor Rig Safety in Real-Time Sensor, Unstructured
Government ETL Offload in Response to Federal Budgetary Pressures Structured
Sentiment Analysis for Government Programs Social
Page 14
© Hortonworks Inc. 2014
New Analytic Apps New Types of Data LOB Driven
SALES CANVAS Drivers of Hadoop Adoption
Page 15
More data and analytic apps
MDA/Data Lake Cost, Insight IT Driven
SC
ALE
SCOPE
© Hortonworks Inc. 2014
The Journey Towards a Data Lake D
ATA
VALUE
Risk Management E.g., Fraud Reduction
Operational Excellence E.g., Network Maintenance
New Business E.g., Data as a Product
Customer Intimacy E.g., 360 Degree View
of the Customer
TB’s
P
B
PB
’s
Page 16
DATA LAKE An architectural shift in the
data center that uses Hadoop to deliver deep insight across a
large, broad, diverse set of data at efficient scale
© Hortonworks Inc. 2014
DATA
LAK
E
• Acquire all data in original format and store in one place, cost effectively and for an unlimited time
• Scale horizontally and to petabyte scale
Drivers of the Data Lake
Data Access
Access your data simultaneously in multiple ways Irrespective of the processing engine, analytical application or presentation
+ Hadoop = INSIGHT
+ Hadoop = SCALE
• Allows simultaneous access by and timely insights for all your users across all your data
• Enabled schema on read & enterprise-wide pool of data
BROAD INSIGHT Data Access
Access your data simultaneously in mul4ple ways
EFFICIENT SCALE Data Management
Store and process all of your Corporate Data Assets
Page 17
© Hortonworks Inc. 2014
Data Lake Transforms Your Architecture
DATA
LAK
E SO
URC
ES
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
Emerging Sources (Sensor, Sen4ment, Geo, Unstructured)
APPLICAT
IONS
Business Analy4cs
Custom Applica4ons
Packaged Applica4ons
BROAD INSIGHT Data Access
Access your data simultaneously in mul4ple ways
EFFICIENT SCALE Data Management
Store and process all of your Corporate Data Assets
Page 18
© Hortonworks Inc. 2014
Enterprise Hadoop Technology Directions
“With Hadoop 2.0 we expect this ecosystem to grow like bamboo in spring time.”
Robin Bloor, The Bloor Group
Page 19
© Hortonworks Inc. 2014
OS/VM Cloud Appliance
What’s Needed for Enterprise Hadoop?
Page 20
CORE SERVICES
Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
OPERATIONAL SERVICES
HDFS
SQOOP
FLUME
NFS
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP TEZ REDUCE
HIVE & HCATALOG PIG HBASE
OPERATIONAL SERVICES
DATA SERVICES
CORE SERVICES
Schedule
Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
Storage
Resource Management
Process
Data Movement
Cluster Mgmt Dataset
Mgmt Data Access
Data Security
1 Key Services Platform, Operational and Data services essential for the enterprise
Skills Leverage your existing skills: development, analytics, operations
2 Integration Interoperable with existing data center investments 3
© Hortonworks Inc. 2014
1 Key Services Platform, Operational and Data services essential for the enterprise
Skills Leverage your existing skills: development, analytics, operations
2
What’s Needed for Enterprise Hadoop?
Page 21
OS/VM Cloud Appliance
CORE SERVICES
CORE
Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATA SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP TEZ REDUCE
HIVE & HCATALOG PIG HBASE
Integration Interoperable with existing data center investments 3
OPERATIONAL SERVICES
DATA SERVICES
CORE SERVICES
HORTONWORKS DATA PLATFORM (HDP)
Schedule
Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
Storage
Resource Management
Process
Data Movement
Cluster Mgmnt Dataset
Mgmnt Data Access
CORE SERVICES
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATA SERVICES
HDFS
SQOOP
FLUME AMBARI FALCON
YARN
MAP TEZ REDUCE
HIVE PIG HBASE
OOZIE
Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
LOAD & EXTRACT
WebHDFS
NFS
KNOX
© Hortonworks Inc. 2014
Hadoop 2 & Beyond
Page 22
details: hortonworks.com/labs
© Hortonworks Inc. 2014
Hadoop 2: The Introduction of YARN
1st Gen of Hadoop
HDFS (redundant, reliable storage)
MapReduce (cluster resource management
& data processing)
Single Use System Batch Apps
Page 23
Store all data in one place, interact in multiple ways
Multi-Use Data Platform Batch, Interactive, Online, Streaming, …
Redundant, Reliable Storage (HDFS)
Efficient Cluster Resource Management & Shared Services
(YARN)
Flexible Data Processing
Hive, Pig, others…
Batch MapReduce
Batch & Interac4ve Tez
Online Data Processing
HBase, Accumulo
Stream Processing
Storm
others
…
2nd Gen of Hadoop
Classic Hadoop Apps
© Hortonworks Inc. 2014
Apache Hadoop YARN
Page 24
Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming
Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service
Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads
The Data Operating System for Hadoop 2
Data Processing Engines Run Na4vely IN Hadoop BATCH
MapReduce INTERACTIVE
Tez STREAMING
Storm IN-‐MEMORY
Spark OTHER
Open Source / Commercial ONLINE
HBase, Accum
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
© Hortonworks Inc. 2014
Apache Tez: Modern Execution Engine
HDFS (redundant, reliable storage)
YARN (cluster resource management)
Tez (execu@on engine)
Hive (SQL)
Pig (data flow)
OTHER
Open Source / Commercial
MR (batch)
Supports BOTH Batch & Interactive workloads – Used for Stinger initiative to enable interactive SQL for Apache Hive – Hive and Pig will work on Tez – Other solutions are considering Tez
Apache Tez is a modern & more efficient alternative to MapReduce built on YARN
Page 25
© Hortonworks Inc. 2014
Batch AND Interactive SQL-IN-Hadoop
Page 26
Value Delivered • Enables rapid insight over big data
• Single engine for batch & interactive
• Preserves and transparently enhances existing investments in use of Hive
– Ex. Hive-based solutions get 100x faster
• SQL compliance improves integration with other data systems & tools
• New ORCFile reduces storage up to 70% while improving resource use, scale, and throughput
Stinger Initiative Broad, community based effort to deliver the next generation of Apache Hive
Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB
SQL Support broadest range of SQL semantics for analytic applications against Hadoop
Speed Improve Hive query performance by 100X to allow for interactive query times (seconds)
SQL
Apache Hive • The defacto standard for Hadoop SQL access • Used by your current data center partners • Built for batch AND interactive query
© Hortonworks Inc. 2014
Speed: Delivering Interactive Query
Page 27
Hive 10 Trunk (Phase 3) Hive 0.11 (Phase 1)
190x Improvement
1400s
39s
7.2s
TPC-‐DS Query 27
3200s
65s
14.9s
TPC-‐DS Query 82
200x Improvement
Query 27: Pricing Analy4cs using Star Schema Join Query 82: Inventory Analy4cs Joining 2 Large Fact Tables
All Results at Scale Factor 200 (Approximately 200GB Data)
© Hortonworks Inc. 2014
SCALE: Interactive Query at Petabyte Scale
Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale
131 GB (78% Smaller)
File Size Comparison Across Encoding Methods Dataset: TPC-‐DS Scale 500 Dataset
221 GB (62% Smaller)
Encoded with Text
Encoded with RCFile
Encoded with ORCFile
Encoded with Parquet
505 GB (14% Smaller)
585 GB (Original Size) • Larger Block Sizes
• Columnar format arranges columns adjacent within the file for compression & fast access
Impala
Hive 12
Smaller Footprint Better encoding with ORCFile in Apache Hive 0.12 reduces resource requirements for your cluster
Page 28
© Hortonworks Inc. 2014
SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL Seman4cs INT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-‐queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Func@ons (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggrega@on (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-‐queries for IN/NOT IN, HAVING
CHAR Expanded JOIN Syntax
INTERSECT / EXCEPT
Hive 0.12 (HDP 2.0)
Available
Hive 13
SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
Page 29
© Hortonworks Inc. 2014
Project Phases
Real-Time Streaming-IN-Hadoop
Apache Storm A community-based effort to bring real-time processing to Hadoop
Page 30
Storm : Improved Mul4-‐Tenancy • Declara@ve “wiring” • Hive update support • Advanced scheduler
Storm : Enterprise Connec4vity • No@fica@on and data persistence bolts: EDWs, RDBMS, JMS etc
• Data Ingest Spouts • AD/LDAP plugin for authen@ca@on • High Availability management w/Ambari
Storm : Streaming in Hadoop
• Storm-‐on-‐YARN • Installa@on with Ambari • Ganglia & Nagios based monitoring • Kaia, HBase, HDFS & Cassandra connectors
HADOOP INTEGRATION Making streaming a first-class component of a modern data architecture
ENTERPRISE CONNECTIVITY Connecting Storm to the important streaming sources within the enterprise
IMPROVED MULTI-TENANCY Increasing operations usability and enabling simple programming of new flows
Goals:
Coming Soon
© Hortonworks Inc. 2014
Hortonworks Investment in Apache Falcon
Simplified Data Processing for Hadoop Apache Falcon Create and implement reusable workflows for datasets to orchestrate movement and track lineage
Phase 3
• Advanced Dashboard for pipeline building
• Dataset lineage
Phase 1: • Incubate Apache Falcon • Dataset Replica@on • Dataset Reten@on • Falcon Tech Preview
Phase 2:
• Hive / HCatalog integra@on • Basic Dashboard for En@ty Viewing • Kerberos security support • Ambari integra@on for management
Acquisition & Processing Data • Direct data to processing engines or formats • Obfuscate or transform data
Replication & Retention Policy • Replicate datasets • Establish retention policies for datasets
Redirection & Extensions of Hadoop • Redirect data to encrypt or decrypt • Extract segments of data and redirect to other tools
Q4 2013
Coming Soon
Coming Soon
Page 31
Goals:
© Hortonworks Inc. 2014
Enterprise Hadoop Security Today
Authorization Restrict access to explicit data
Audit Understand who did what
Data Protection Encrypt data at rest & motion
Kerberos in native Apache Hadoop Perimeter Security with Apache Knox Gateway
Native in Apache Hadoop • MapReduce Access Control Lists • HDFS Permissions • Process Execution audit trail Cell level access control in Apache Accumulo
Wire encryption in native Apache Hadoop Orchestrated encryption with 3rd party tools
Authentication Who am I/prove it? Control access to cluster.
Page 32
© Hortonworks Inc. 2014
Security Investments
Hadoop Security – What’s Next?
Security in Enterprise Hadoop Driving the next generation of Hadoop security
Page 33
Security Phase 3: • Audit event correla@on and Audit
viewer • NotOnlyKerberos – Support other
Token-‐Based Authen@ca@on • Data Encryp@on in HDFS, Hive &
HBase
Security Phase 1: • Strong AuthN with Kerberos • HBase, Hive, HDFS basic AuthZ • Encryp@on with SSL for NN, JT, etc. • Wire encryp@on with Shuffle, HDFS,
JDBC
Security Phase 2: • Knox: Hadoop Perimeter Security • SQL-‐style Hive AuthZ (GRANT,
REVOKE) • ACLs for HDFS • SSL support for Hive Server 2 • PAM support for Hive
Flexible Authentication & Authorization Improve authentication choices and provide more granular access controls for the Hadoop platform, services and data.
Improve Data Protection Enhance Hadoop’s audit and data protection capabilities to support broader enterprise governance and compliance needs.
Work with Existing Systems Integrate with existing enterprise security and identity management systems in a consistent way.
Goals:
Delivered in HDP 2.0
Coming Soon
© Hortonworks Inc. 2014
Operating Enterprise Hadoop at Scale
Apache Ambari is the only 100% open source framework for provisioning, managing and monitoring Apache Hadoop clusters
AMBARI WEB
Others Viewpoint
compute &
storage . . .
. . .
. . compute &
storage
.
.
PROVISION
MANAGE
MONITOR
REST APIs
AMBARI SERVER PROVISION | MANAGE | MONITOR
Integra@on With Exis@ng Opera@ons Tools COMING SOON! Ambari Stacks: AMBARI-2714 Ambari Views: AMBARI-4234
Page 34
© Hortonworks Inc. 2014
Recap
Page 35
• Hadoop's role is becoming clear • Major vendors have recognized Hadoop’s role and are actively integrating it into their solutions
• Adoption path is consistent: from apps to lake • Open source innovation continues unabated
– YARN opens up the platform, and as adoption deepens, the community of committers is working to mature it even further
© Hortonworks Inc. 2014
Try Hadoop Today… Get Involved
Download the Hortonworks Sandbox
Page 36
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
San Jose, CA June 3 - 5, 2014
CALL FOR
PAPERS OPEN
Amsterdam April 2 - 3, 2014
REGISTER NOW