Upload
alanfgates
View
138
Download
8
Tags:
Embed Size (px)
DESCRIPTION
Slides from Strata talk "Coordinating the Many Tools of Big Data"
Citation preview
Coordinating the Many Tools of Big Data
Page 1
Alan F. Gates
@alanfgates
Strata 2013
Big Data = Terabytes, Petabytes, …
Page 2© Hortonworks 2013
Image Credit: Gizmodo
But It Is Also Complex Algorithms
Page 3© Hortonworks 2013
• An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data:
w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)
And New Tools
Page 4© Hortonworks 2013
• Apache Hadoop brings with it a large selection of tools and paradigms–Apache HBase, Apache Cassandra – Distributed, high volume
reads and rights of individual data records–Apache Hive - SQL–Apache Pig, Cascading – Data flow programming for ETL, data
modeling, and exploration–Apache Giraph – Graph processing–MapReduce – Batch processing–Storm, S4 – Stream processing–Plus lots of commercial offerings
Pre-Cloud: One Tool per Machine
Page 5© Hortonworks 2013
• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.
SAS).
Data Warehouse
Statistical Analysis
Cube/MOLAP
OLTP
Data Mart
Cloud: Many Tools One Platform
Page 6© Hortonworks 2013
• Users no longer want to be concerned with what platform their data is in – just apply the tool to it
• SQL no longer the only or primary data access tool
Data Warehouse
Statistical AnalysisData
Mart
Cube/MOLAP
OLTP
Upside - Pick the Right Tool for the Job
Page 7© Hortonworks 2013
Downside – Tools Don’t Play Well Together
Page 8© Hortonworks 2013
• Hard for users to share data between tools–Different storage formats–Different data models–Different user defined function interfaces
Downside – Wasted Developer Time
Page 9© Hortonworks 2013
• Wastes developer time since each tool supplies the redundant functionality
Executor
Physical Planner
Optimizer
Parser
Executor
Physical Planner
Optimizer
Parser
Metadata
Pig
Hive
Downside – Wasted Developer Time
Page 10© Hortonworks 2013
• Wastes developer time since each tool supplies the redundant functionality
Executor
Physical Planner
Optimizer
Parser
Executor
Physical Planner
Optimizer
Parser
Metadata
Pig
Hive
Overlap
Conclusion: We Need Services
Page 11© Hortonworks 2013
• We need to find a way to share services where we can • Gives users the same experience across tools• Allows developers to share effort when it makes sense
Hadoop = Distributed Data Operating System
Page 12© Hortonworks 2013
Service Hadoop Component
Table Management Hive
Access To Metadata HCatalog
User authentication Knox
Resource management YARN
Notification HCatalog
REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie
Relational data processing Tez
Exists Pieces exist in this component New Project
Hadoop = Distributed Data Operating System
Page 13© Hortonworks 2013
Service Hadoop Component
Table Management Hive
Access To Metadata HCatalog
User authentication Knox
Resource management YARN
Notification HCatalog
REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie
Relational data processing Tez
Exists Pieces exist in this component New Project
HCatalog – Table Management
Page 14© Hortonworks 2013
• Opens up Hive’s tables to other tools inside and outside Hadoop
• Presents tools with a table paradigm that abstracts away storage details
• Provides a shared data model• Provides a shared code path for data and metadata access
HCatalog – Table Management
Page 15© Hortonworks 2013
• Opens up Hive’s tables to other tools inside and outside Hadoop
• Presents tools with a table paradigm that abstracts away storage details
• Provides a shared data model• Provides a shared code path for data and metadata access
Metastore
Hive
HCatalog – Table Management
Page 16© Hortonworks 2013
• Opens up Hive’s tables to other tools inside and outside Hadoop
• Presents tools with a table paradigm that abstracts away storage details
• Provides a shared data model• Provides a shared code path for data and metadata access
Metastore
Hive Pig
HCatLoader
HCatInputFormat
MapReduce
HCatalog – Table Management
Page 17© Hortonworks 2013
• Opens up Hive’s tables to other tools inside and outside Hadoop
• Presents tools with a table paradigm that abstracts away storage details
• Provides a shared data model• Provides a shared code path for data and metadata access
Metastore
Hive Pig
HCatLoader
HCatInputFormat
MapReduceWebHCat
ExternalSystems
REST
Tez – Moving Beyond MapReduce
Page 18© Hortonworks 2013
• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascading etc.
• Enables pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline
• Does not write intermediate output to HDFS–Much lighter disk and network usage
• Built on YARN
Pig/Hive-MR versus Pig/Hive-Tez
Page 19© Hortonworks 2013
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Pig/Hive - MR
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Pig/Hive-MR versus Pig/Hive-Tez
Page 20© Hortonworks 2013
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Pig/Hive - MR Pig/Hive - Tez
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Single Job
FastQuery: Beyond Batch with YARN
Page 21© Hortonworks 2013
Tez Generalizes Map-Reduce
Simplified execution plans processdata more efficiently
Always-On Tez Service
Low latency processing forall Hadoop data processing
Knox – Single Sign On
Page 22© Hortonworks 2013
Today’s Access Options
Page 23© Hortonworks 2013
• Direct Access– Access Services via REST (WebHDFS, WebHCat)– Need knowledge of and access to whole cluster– Security handled by each component in the cluster– Kerberos details exposed to users
• Gateway / Portal Nodes– Dedicated nodes behind firewall– User SSH to node to access Hadoop services
Hadoop ClusterUser
Hadoop ClusterUserGW
Node
SSH
{REST}
Knox Design Goals
Page 24© Hortonworks 2013
• Operators can firewall cluster without end user access to “gateway node”
• Users see one cluster end-point that aggregates capabilities for data access, metadata and job control
• Provide perimeter security to make Hadoop security setup easier
• Enable integration enterprise and cloud identity management environments
Perimeter Verification & Authentication
Page 25© Hortonworks 2013
WebHCat
JT
NN
DN
DN DN
Hadoop Cluster
DN
Web HDFS
Hive
HCat
Authentication
Verification
Client
User StoreKDC, AD,
LDAP
ID ProviderKDC, AD,
LDAP
Verification- Verify identity token- SAML, propagation of identityAuthentication- Establish identity at Gateway to
Authenticate with LDAP + AD
{REST} KnoxGateway
© Hortonworks 2012
Thank You
Page 26