Strata feb2013

Coordinating the Many Tools of Big Data

Page 1

Alan F. Gates

@alanfgates

Strata 2013

Big Data = Terabytes, Petabytes, …

Page 2© Hortonworks 2013

Image Credit: Gizmodo

But It Is Also Complex Algorithms


• An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

And New Tools


• Apache Hadoop brings with it a large selection of tools and paradigms–Apache HBase, Apache Cassandra – Distributed, high volume

reads and rights of individual data records–Apache Hive - SQL–Apache Pig, Cascading – Data flow programming for ETL, data

modeling, and exploration–Apache Giraph – Graph processing–MapReduce – Batch processing–Storm, S4 – Stream processing–Plus lots of commercial offerings

Pre-Cloud: One Tool per Machine


• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.

SAS).

Data Warehouse

Statistical Analysis

Cube/MOLAP

OLTP

Data Mart

Cloud: Many Tools One Platform


• Users no longer want to be concerned with what platform their data is in – just apply the tool to it

• SQL no longer the only or primary data access tool

Data Warehouse

Statistical AnalysisData

Mart

Cube/MOLAP

OLTP

Upside - Pick the Right Tool for the Job


Downside – Tools Don’t Play Well Together


• Hard for users to share data between tools–Different storage formats–Different data models–Different user defined function interfaces

Downside – Wasted Developer Time


• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Downside – Wasted Developer Time


• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

Conclusion: We Need Services


• We need to find a way to share services where we can • Gives users the same experience across tools• Allows developers to share effort when it makes sense

Hadoop = Distributed Data Operating System


Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie

Relational data processing Tez

Exists Pieces exist in this component New Project

Hadoop = Distributed Data Operating System


Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie

Relational data processing Tez

Exists Pieces exist in this component New Project

HCatalog – Table Management


• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access






Metastore

Hive






Metastore

Hive Pig

HCatLoader

HCatInputFormat

MapReduce






Metastore

Hive Pig

HCatLoader

HCatInputFormat

MapReduceWebHCat

ExternalSystems

REST

Tez – Moving Beyond MapReduce


• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascading etc.

• Enables pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

• Does not write intermediate output to HDFS–Much lighter disk and network usage

• Built on YARN

Pig/Hive-MR versus Pig/Hive-Tez


SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Pig/Hive-MR versus Pig/Hive-Tez


SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Single Job

FastQuery: Beyond Batch with YARN


Tez Generalizes Map-Reduce

Simplified execution plans processdata more efficiently

Always-On Tez Service

Low latency processing forall Hadoop data processing

Knox – Single Sign On


Today’s Access Options


• Direct Access– Access Services via REST (WebHDFS, WebHCat)– Need knowledge of and access to whole cluster– Security handled by each component in the cluster– Kerberos details exposed to users

• Gateway / Portal Nodes– Dedicated nodes behind firewall– User SSH to node to access Hadoop services

Hadoop ClusterUser

Hadoop ClusterUserGW

Node

SSH

{REST}

Knox Design Goals


• Operators can firewall cluster without end user access to “gateway node”

• Users see one cluster end-point that aggregates capabilities for data access, metadata and job control

• Provide perimeter security to make Hadoop security setup easier

• Enable integration enterprise and cloud identity management environments

Perimeter Verification & Authentication


WebHCat

JT

NN

DN

DN DN

Hadoop Cluster

DN

Web HDFS

Hive

HCat

Authentication

Verification

Client

User StoreKDC, AD,

LDAP

ID ProviderKDC, AD,

LDAP

Verification- Verify identity token- SAML, propagation of identityAuthentication- Establish identity at Gateway to

Authenticate with LDAP + AD

{REST} KnoxGateway

© Hortonworks 2012

Thank You

Page 26

Technology

Strata feb2013