How to use Hadoop for operational and transactional purposes by RODRIGO MERINO at Big Data Spain 2014

HP TRAFODION

RODRIGO MERINOSENIOR PRESALES SOLUTION ARCHITECT HEWLETT-PACKARD

HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Trafodion: How to use Hadoop for operational and transactional purposes

Enterprise-Class Operational SQL-on-Hadoop DBMS

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Agenda

Current database landscape … and a prediction

How special are transactions

HP Trafodion. Trafodion Innovation

Use cases

Trafodion: an open-source project

Current database landscape

Source: https://451research.com


The current situation

Each database type has itsstrengths and their perfect fit

… but they also have weaknesses

You can’t use one of them for alltype of workloads!

Source: http://www.datasciencecentral.com/profiles/blogs/hadoop-vs-nosql-vs-sql-vs-newsql-by-example

HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6

Hadoop workload profiles

Operational Non-interactive

• Real-time analytics • Data preparation• Incremental batch

processing• Dashboards,

scorecards

Interactive

• Parameterized reports

• Drilldown visualization

• Exploration

Batch

• Operational batch processing

• Enterprise reports• Data mining

• Transactional SQL = OLTP + interactions

Sub-second Response Time Hours

Current Market Focus: Data Warehousing and Analytics

OperationalOptimizations

DataIntegrity

Workload Management

Transaction Support

Real-time Performance

Exposes Hadoop limitations

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7

We could have a situation like this. Sound familiar?

But if we use the right tool for each job…

MapReduceMPP DBMS

NoSQL

DBMS

In MemoryAnalytics

Large Data Movement / Replication of Data

Varying Platform Requirements

Departmental segmentation

HDFS CentricTraditional


Big Data is hard to move… because it’s BIG !!!

Source: www.pinterest.com


And there is a fair chance that something will fail

Source: www.shutterstock.com

… and a prediction


Hadoop: One platform to rule them all

Source: www.wallconvert.com

The Future of Hadoop: What Happened & What's Possible?

Operational SQL-on-Hadoop

“Transactions were something that were long thought to be out of scope for this style of platform. There are a lot of important cases for transactions. You are selling a ticket to something then you need to move money from one place to another. You need to assign a seat to someone. And you need to make sure that the money is in one place or the other. Not in both, not nowhere. And you need to at the same time assign that seat or not assign that seat. This is an important class of workload that is currently well served but not by the Hadoop platform. A year ago Google published a paper describing their internal system they have built on their platform, that is very similar to Hadoop, which does this, demonstrating that its possible to bring online transaction processing to this style of platform. And in the past when we have seen its possible, within a few years it happens. So I think the prediction we can make here is that it is inevitable that we will see just about every kind of workload be moved to this platform – even Online Transaction Processing.

– Doug Cutting, Cloudera, October 30 2013

http://www.youtube.com/watch?v=_WwuZI6AhN8

How special are transactions


Characteristics of operational DBMS applicationsGeneralized characteristics and requirements:• Low latency response times

• ACID (data consistency guaranteed) transactions

• Large number of users

• High concurrency

• High availability

• Scalable data volumes

• Multi-structured data

• Rapidly evolving data requirements (i.e. flexible schemas)

Expose Hadoop limitations

Operational QueryOptimization

DataIntegrity

Workload Management

Transaction Support



Characteristics of operational DBMS applicationsGeneralized characteristics and requirements:• Low latency response times

• ACID (data consistency guaranteed) transactions

• Large number of users

• High concurrency

• High availability

• Scalable data volumes

• Multi-structured data

• Rapidly evolving data requirements (i.e. flexible schemas)

Expose Hadoop limitations

Operational QueryOptimization

DataIntegrity

Workload Management

Transaction Support


Source: michaeljswart.com


ACID properties for transactions

AtomicityEither all operations of the transaction are properly reflected in the database or none are.

ConsistencyExecution of a transaction in isolation preserves the consistency of the database.

IsolationAlthough multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions.

DurabilityAfter a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.


The typical bank transfer example

Transfer £50 from account A to account B

Read(A)

A = A - 50

Write(A)

Read(B)

B = B + 50

Write(B)

AtomicityShouldn’t take money from A without giving it to B

ConsistencyMoney isn’t lost or gained

IsolationOther queries shouldn’t see A or B change until completion

DurabilityThe money does not go back to A

transaction


And a funnier example

“As the 6 a.m. deadline approached, Police Minister Toleafoa Faafisi went on national radio to tell drivers everywhere to stop their vehicles. Minutes later, Prime Minister Tuilaepa SaileleMalielegaoi broadcast the formal instructions for drivers to switch sides.”

Imagine we could do it in a SQL database:

If this “transaction” were not atomic there would be trouble!

On 2009 Samoa switched from driving on the right side of the road to the left

Source: michaeljswart.com

Trafodion

Trafodion - IntroductionOpen source project to develop transactional SQL-on-HBase

Rides the unstoppable Hadoop wave!Transforms how companies store, process, and share big data

Affordable performance, elastic scalability, availability

Open source project - downloadable for freeEliminates vendor lock-in and licensing fees

Leverages community development resources and speed

Schema flexibility and multi-structured dataCapturing and storing all data for all business functions

Full-function ANSI SQLReuses existing SQL skills and improves developer productivity

Distributed ACID transaction protectionGuarantees data consistency across multiple rows, tables, SQLstatements

Targeted for operational workloads!Optimized for real-time transaction processing applications i.e.

OLTP + New Style Transactions (Interactions + Observations)

Leverages 20+ years of HP investments

+Transactional SQLHBase


Trafodion - Features

Complete: Full-function SQL Reuse existing SQL skills and improve developer productivity

Protected: Distributed ACID transactionsData consistency across multiple rows, tables, SQL statements

Efficient: Low-latency R/W transactionsOptimized for real-time transaction processing applications

Interoperable: Standard ODBC/JDBC accessWorks with existing tools and applications

Data federation: Trafodion/HBase/Hive tablesEnables multiple data model deployment

Scalable: Elastic scale for high concurrencyProvides elastic scalability as number of users / data grows

Highly Available: For enterprise applicationsLeverages HBase / Hadoop replication

Open: Hadoop and Linux distribution neutralEasy to add to existing infrastructure with no vendor lock-in

Eco-system: Leverages large Hadoop eco-systemCan use any tool or database accessing Hadoop

Joint HP Labs & HP-IT project for transactional SQL database capabilities on Hadoop

+Transactional SQL Hadoop

HBase vs. Trafodion comparison

HBase Trafodion + HBaseData abstraction Key and value pair Relational schema

Physical Layout Column family store where row data is stored together by cells

Same except there is a single column family with space-saving column encoding

Column values Uninterpreted array of bytes Explicitly defined and enforced data types

ACID Guarantee Single row atomicity Multi- SQL statements, tables, and rows defined as part of transaction

Language API Get/put/delete SQL (Trafodian invokes native HBase API)

Row Key Index Single (string) row key Composite (multi-column) row key

Secondary Indexes Not supported Arbitrary secondary key columns

Trafodion and Hadoop – Benefits!Leverages and extends Hadoop for transactional SQL workloads

Complete: Full-function ANSI SQLReuse existing SQL skills and improve developer productivity

Protected: Distributed ACID transactionsGuarantees data consistency across multiple rows, tables, SQL statements

Efficient: Optimized for low-latency read and write transactionsSupports real-time transaction processing applications

Flexible: Schema flexibility and multi-structured dataSeamlessly integrates structured, unstructured, and semi-structured data

Interoperable: Standard ODBC/JDBC accessWorks with existing tools and applications

Open: Hadoop and Linux distribution neutralEasy to add to your existing infrastructure and no vendor lock-in

Open source project sponsorship and investment from HP

Scale without complexity

Reuse SQL skills

ComplementsHadoop

Reduce Costs


+


Innovations in Trafodion

Trafodion innovation built upon Hadoop stack

Leverages Hadoop andHBase for core modules• Maintains API compatibility

• Inherited scalability and availability

Differentiation• ANSI SQL via ODBC/JDBC

• Relational schema abstraction

• Distributed transaction protection

• Mature SQL technology

• Automatic parallelism

Zook

eepe

r

Client Application using ODBC/JDBC on Windows/Linux

Client Services for ODBC and JDBC

SQL Compiler / Optimizer / Executor

Distributed Transaction ManagerHive

HBase

HDFS

+StandardHadoop Trafodion


Trafodion – Software architecture (3 layers)

JDBC ODBC

User and ISV Operational Applications

Driver

Client

SQL

StorageEngine

*ESP

CMP Master

ESPDTM

WMS

Compiler and Optimizer Workload Management

SQL ParallelismDistributed Transaction Management

. . . .

Future

Database Connectivity

HBase

Relational Schema

Trafodion Tables

HDFS

Data StoreIntegrationHBase

Native HBase Tables KVS, Columnar via

HBase API + coprocessors

Hive

Direct HDFS access to Hive tables using

HCatalog

*Executor Server ProcessHP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Process overview and SQL execution flow

…

…HBase HBase

Client connections via ODBC, JDBC.Net (future).

SQL execution service with an instance of the executor serving as the master for parallel SQL execution plans.CMP (Compiler/Optimizer) component to

generate the optimal execution plan.

DTM provides distributed transaction management across the cluster.

Executor server processes used for parallel execution based on plan (optional). Multiple layers of ESP may be used.

HBase data services responsible for accessing and maintaining database objects.

Operational Application Clients.

HDFS

HBase-Trx provides transactional resource management for HBase.

Database connection service –lightweight coordination service &

process control using Apache Zookeeper.

ESP ESPDTM

CMP

DCSMaster

TRX TRX


Optimized execution plans based on statistics

Optimizer features• Top-down, multi-pass optimizations, branch and

bound plan pruning considers more potential plans• Utilizes “equal-height” histogram statistics• SQL pushdown considerations e.g. predicate

evaluation• Eliminates sorts when feasible, syntactically and

semantically• In-memory vs. overflow considerations• Optimal degree of parallelism (DOP) considerations

including non-parallel plans

Benefits• Facilitates enhanced parallelism and SQL object

handling efficiencies• Optimizations for operational transactions and

reporting workloads

SQL Statement

Optimized Plan

SQL Normalizer

Plan

Generator

Table Statistics

Cardinality Estimator

Cost Estimator

SQL Analyzer


Data flow SQL execution with optimized DOP

Data-flow, scheduler-driven Parallelism throughout

Scan Scan

Join

Group By

Operatorparallelism

Partitionedparallelism

Pipelineparallelism

Master

Join

ScanGroup by

Scan

40

30

20

– Operators executed by Master or ESP

– Varying degrees of parallelism

– SQL divided into operatorsNested, merge, hash joins; unions; partial & full aggregations; sorts; input/output operations (scan, update, delete, insert)


Trafodion Distributed transaction protection

Multiple row inserts, updates, and deletes to a table

Multiple table and SQL insert, update, and delete statements

Distributed multiple HBase region insert, update, and delete transaction (2-phase commit)

Read-only transaction (eliminates commit overhead)

Trafodion

1

4

3

. . .

Region A

Region B

Region C

Region D

2

Table A

Table B

Table C

Table A


Integrating external (non-Trafodion) Hadoop tablesBenefits• Able to run queries against external tables without needing to copy them into a Trafodion table structure

• Optimized access to external HBase and Hive tables without complex map-reduce programming

• Data can be joined across disparate data sources (e.g. Trafodion, Hive, HBase)

• Able to leverage HBase’s inherent schema flexibility capabilities

HBase tables (created outside of Trafodion by HBase)• Schema-less format i.e. no information in Trafodion metadata

• Accessible through Trafodion SQL in two modes– Cell-per-row access i.e. each row returned represents a single HBase cell

– Row-wise access i.e. all column values of the row will be returned as a single, big varchar

Hive tables (created outside of Trafodion by Hive)• Hive metadata, HDFS files storage, delimited data, read/append only

• Support for both SELECT and INSERT statements

• Automatic data type mapping

Trafodion use cases

Good fit for Trafodion

• Onlinefinancial management

Finance

• Billingsystems

• Provisioningsystems

Telecom

• RFID tracking

Manufacturing

• SmartMetering

Energy

• Authorizationand claims processing

Healthcare

• 911Emergency System

Government

• Reservationsystems

Transportation

• Onlineshopping

Consumer &Retail

Multi-Structured Data

ACID Protection, Data Integrity

Low Latency, High Concurrency

Generates Revenue Touches the Customer Helps Run the Business


35 HP PRIVATE © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Operational SQL on Hadoop – Use cases

• Integration of structured, semi-structured, and unstructured support

• Integration of operational, historical, & external (Big) data along common master data for better insights

Item id Description Cost Price …Structured

Type Display Size Resolution Brand Model 3D …

…ISBN Author Publish Date Format Dept

TV

Book

…

Semi- structured

SELECT all TVs WHERE Price > 2000 and Type = ‘Plasma’ and Display Size > ‘50’ and customer sentiment is very positive

Unstructured Image …

Review …

Open distributed HDFS structures

HBase & HiveFree at last!

Capture data directly into open file structures

Accessible for reporting & analytics with no latency

Trafodion: An open-source project

Modern open source environmentFollowing best practices of OpenStack project

Source code in GitHub

Build/test in OpenStack gerrit, zuul, jenkins

Defect tracking in Launchpad

Documentation in MediaWiki


Building an Open Source Community

Simple installation

Meritocracy

Recruiting project contributors

Share your expertise: Developing, fixing defects, testing,writing, translating and more

Want to try?

Discover our capabilities: Download and install in your Hadoop environment and take a test-drive

www.trafodion.org

Recruiting project contributors


http://www.trafodion.org/

See for yourself…Come discover and develop on Trafodion

www.trafodion.orgHP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

http://www.trafodion.org/


Thank You

17TH ~ 18th NOV 2014MADRID (SPAIN)