44
Deep Dive On Pivotal HD - World Class HDFS Platform Michael Goddard

Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Deep Dive On Pivotal HD - World Class HDFS Platform Michael Goddard

Page 2: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

2 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Agenda

• Pivotal

• Pivotal Business Data Lake

• Introducing Pivotal HD 2.0

• Pivotal HD 2.0 and Isilon Update

• Customer Success

• Q&A

Page 3: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

3 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

What Matters: Apps. Data. Analytics. Apps power businesses, and those apps generate data Analytic insights from that data drive new app functionality, which in-turn drives new data The faster you can move around that cycle, the faster you learn, innovate & pull away from the competition

Page 4: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

4 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

How Pivotal Gets You There Uniquely positioned to help enterprises modernize each facet of this cycle today Comprehensive portfolio of products & services spanning Big Data, PaaS & Agile Converging these technologies into a coherent, next-gen Enterprise PaaS platform

Pivotal Labs Agile Development

Pivotal Data Fabric

Pivotal One

PaaS

Page 5: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

5 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal’s Big Bets for the Future 1. HDFS becomes the data substrate for the next

generation of data infrastructures

2. A set of integrated, consumer-grade services must evolve on top of HDFS – stream ingestion, analytical processing, and transactional serving

3. Provisioning flexibility and elasticity become critical capabilities for this data infrastructure

Page 6: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

6 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal Business Data Lake

Govern where it matters

Focus on MDM and RDM Enforce only when sharing Treat corporate as aggregation of local

Encourage local requirements

Let the business decide what they need Build from the bottom Enable traceability to source Disposable data views

Distill on demand Select only what you want Business friendly tooling Re-usable information maps Rapid change cycle

Store everything Store everything ‘as is’ Include structured and unstructured data Store it cheaply

Page 7: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

7 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal Business Data Lake Architecture

Centralized Management System monitoring System management

Unified Data Management Tier Data mgmt.

services MDM RDM

Audit and policy mgmt.

Processing Tier

Workflow Management

In-memory MPP database

Existing Sources

Unified Sources Flexible Actions

Real-time ingestion

Micro batch ingestion

Batch ingestion

Real-time insights

Interactive insights

Batch insights

HDFS

New Data Sources

Page 8: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

8 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal Business Data Lake Architecture

Centralized Management

Unified Data Management Tier

Data Dispatch MDM RDM Data Dispatch

Processing Tier

Spring XD

Pivotal GemFire XD HAWQ

Unified Sources Flexible Actions

Clickstream Sensor Data

Weblogs Network Data

CRM Data ERP Data

Pivotal GemFire

Pivotal RabbitMQ Redis

Pivotal CF Pivotal HD

Command Center

Existing Sources

New Data Sources

Page 9: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

9 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

How is a Business Data Lake Different?

Business Data Lake Criteria EDW

Common data model

Base class = standard data Derived classes = local data

Single class = single view across the enterprise

Data quality Full spectrum 1 0

0 1 0 1 0 0 1

0 1 1 1 0

Data integration

Multiple interfaces SQL, SAS, R, MapReduce, NoSQL SQL access integration with SAS, R and other analytical interfaces

Mixed workload with varying QoS

Support low latency, interactive and batch Limited QoS separation required

Page 10: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

10 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Introducing Pivotal HD 2.0

• Foundation for Business

Data Lake

• World’s Most Advanced Real-Time Analytics Platform

• Most Extensive Set of Advanced Analytical Toolsets

Page 11: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

11 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal HD Architecture

HDFS

HBase Pig, Hive,

Mahout Map

Reduce

Sqoop Flume

Resource

Management & Workflow

YARN

ZooKeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Spring XD

Pivotal HD Enterprise

Spring

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Distributed In-memory

Store

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

Pivotal GemFire XD – Real-Time Database Services

MADlib Algorithms

Oozie

Virtual Extensions

GraphLab, Open MPI

Page 12: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

12 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

New Apache Hadoop Features in Pivotal HD 2.0

• Apache Hadoop 2.2 enables enterprise operationalization features such as NFS and Snapshots

• Hive 0.12 is faster, has better scalability, and broader SQL data type support

• Pig 0.12 (incl. PiggyBank) increases productivity and appeal for broader set of users

• HBase 0.96 improves in mean time between recovery and modularization for easy upgrade and reduced dependencies

Page 13: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

13 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Hadoop at the Center

Enabling the Data-Driven Enterprise

Hadoop as a Service

Big Data On-Demand

GemFire XD In-Memory Real-time Analytics

Spring XD Building Big Data Apps

Open Source

Algorithm Libraries

Chorus Big Data Collaboration

Fastest SQL Query Engine

Page 14: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

14 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Real-Time Analytics

• Adds fast data ingest, and real-time event processing and query performance, enabling SQL users to rapidly analyze and react to high volumes of events on HDFS

• Enables the creation of low latency, scale out OLTP applications integrated out of the box with a big data store.

• Creates a single platform for Analytics and OLTP, removing the need for an ETL process

• Supports changes to database tables while still complying to the immutability of HDFS

Page 15: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

15 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Real-Time Data Services on Pivotal HD

Pivotal GemFire XD HAWQ

Pivotal Extension Framework

Model Refresh MapReduce

I/P & O/P Formatter

Native Persistence Command Center

Model Refresh

Online Apps Analytic Apps

Sensor Data / Feeds

Pivotal HD Enterprise

Shared Data

Re-evaluate Model

Re-evaluate Model

HDFS

Page 16: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

16 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal GemFire XD 1.0 Major Features Enterprise real-time data processing platform for SLA critical applications; enables users to rapidly and reliably analyze and react to high volumes of events while leveraging 10s of TBs of in-memory reference data.

Cloud Scale Real-Time Platform

Seamless Pivotal HD Integration

Optimized for Real-Time Analytics

• Very low & predictable latencies at high and variable loads

• 10s of TBs in-memory (MemScale)

• Multi-tiered caching • Real-time event

processing • Rolling upgrade support

• SQL-based queries • Support structured data • Java stored procedures • Deep Spring Data

integration

• Scale to HDFS with policy driven in-memory data retention

• Online and offline querying of HDFS data

• ETL-less bi-directional integration with other Pivotal HD services

• Pivotal Extension Framework Integration

• ICM Integration

Enterprise-Class Reliability

• Distributed transactions (JTA)

• HA through in-memory redundancy

• Active-active deployments across WAN

• JMX based scalable management

• Visual monitoring through Pulse

Page 17: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

17 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Deep Scalable Analytics

• User Defined Functions: PL/R, PL/Java, PL/Python enable writing UDFs in additional languages that execute inside the database, improving performance

• Parquet columnar open storage format delivers significant performance and scalability improvements

• Richer set of open source machine learning algorithms helps conduct rapid data science experiments on relational data

Page 18: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

18 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Deep Scalable Analytics

Provides data-parallel implementations of mathematical, statistical and machine-learning methods

for structured and unstructured data.

Page 19: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

19 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

• HAWQ 1.2 Deep Scalable Analytics

• Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Naïve Bayes • Elastic Net Regression • Decision Trees / Random Forest • Support Vector Machines • Cox Proportional Hazards Regression • Descriptive Statistics • ARIMA

Page 20: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

20 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal vs. PL/R

• Interface is R client • Execution is in database • Parallelism handled by

PivotalR • Supports a portion of R

PivotalR • Interface is SQL client • Execution is in R • Parallelism via SQL function

invocation • Supports all of R

PL/R

Page 21: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

21 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

HAWQ: SQL on Hadoop, Format Agnostic

Pivotal HD: HDFS Data Lake

Future formats …

ANSI SQL

Page 22: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

22 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

HAWQ Continue to Soar

NameNode High Availability (HA) Support improves availability of query processing with full Hadoop fault tolerance

Error Table helps to debug data errors

Parquet file format: columnar data storage for HDFS

HAWQ expansion increases performance (concurrency/throughput) by expanding query processing to newly added data nodes

Page 23: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

23 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

NameNode HA Support

• Feature: – Automatic failover to secondary NameNode when primary

fails

• Benefits: – Fully fault tolerant to NameNode failures – Improved availability of query processing – Integrated into Hadoop availability model

Page 24: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

24 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Error Table

• Feature: – System table for storing non-conforming data

• Benefits: – Eliminates erroneous data load – Reduces retries during load – Helps to debug errors in data structures

Page 25: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

25 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Parquet • Features:

– Open storage format – Hybrid row/column open storage format – Configurable Parquet or AO/CO format support – Compression Type: Snappy and Gzip – Additional data type support – Parquet Input Format Reader API

• Benefits: – Delivers significant performance and scalability improvements – Industry standard compression: Saves storage – Usable in MapReduce/Hive work loads

Page 26: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

26 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

HAWQ Expansion

• Features: – Expand HAWQ nodes to additional DataNodes – Expand # of segments per HAWQ segment host

• Benefits: – Expand query processing – Increase performance by utilizing maximum CPU/resources – Increased concurrency/throughput

Page 27: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

27 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Big Computing and Graph Analytics

• Open MPI is one of the most mature parallel computing frameworks now available within HDFS, eliminating costly data movement and shortening data science cycles

• GraphLab is a graph-based library of machine learning algorithms – allowing Data Scientists and Analysts to leverage popular algorithms such as PageRank, collaborative filtering and computer vision in HDFS

Open MPI

Page 28: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

28 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Background

• Hadoop MapReduce is not a good fit for iterative applications (like graph computing, machine learning, etc.)

• User needs to build separate system/clusters to support those applications

• MPI is (one of) the most mature/used parallel computing frameworks

– MPI = Big Computing, Hadoop = Big Data – MPI + Hadoop = Big Computing + Big Data

Page 29: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

29 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

MPI Background

• What is MPI? – “a standardized and portable message-passing

system designed by a group of researchers from academia and industry to function on wide variety of parallel computers” Wikipedia

• What is Open MPI? – One of the most popular implementations of MPI,

community supported

• What is Hamster? – “Hadoop And Mpi on the same cluSTER

Page 30: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

30 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

GraphLab • Topic Modeling contains applications like LDA, which can be used to

cluster documents and extract topical representations. • Graph Analytics contain applications like PageRank and triangle

counting, which can be applied to general graphs to estimate community structure.

• Clustering contains standard data clustering tools such as k-means • Collaborative Filtering contains a collection of applications used to

make predictions about users interests and factorize large matrices. • Graphical Models contain tools for reasoning about structured noisy

data. • Computer Vision contains a collection of tools for reasoning about

images.

Page 31: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

31 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal HD: Built for Data Science

Relational Advanced Analytics

Data Science on Pivotal HD

Graph Advanced Analytics

SQL

R

Python

Java

Languages:

Custom Analytic Functions - UDFs

Page 32: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

32 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

World’s Leading Experts Pivotal Labs – Pivotal Data Labs

On Demand Services Pivotal Data Dispatch

BATCH BATCH

INTERACTIVE INTERACTIVE HAWQ Greenplum DB

Unlimited Pivotal HD

REAL-TIME REAL-TIME GemFire XD GemFire | SQLFire

Page 33: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

33 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal Enables Hadoop Market Adoption

Data Lakes Unify Unstructured and Structured Data Access

Big Data Apps Build analytic and

transaction-led applications impacting

top line revenue

Data-Driven Enterprise

App Dev and Operational Management on HDFS

Data Architecture

Page 34: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

34 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal HD 2.0 and Isilon Update

• Isilon aligns with our Enterprise Grade Message

• Pivotal Command Center 2.2 (part of Pivotal HD 2.0) – Works with Pivotal HD 1.1.1 – ‘Down’ status of HDFS is removed when Isilon is configured

• Isilon has accelerated their integration from Q4 to Q3 for HDFS 2.2

Page 35: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

35 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Large Mid-Market Financier Builds Foundation to Store All Data of Interest, Convert Insights to Value-added Services

Challenge: • Mid Market financier seeks to maintain high margins

through value-added services • Realized that critical insights could come from many

sources, but much was deleted due to storage cost • Frustrated by lack of ability to blend data fabric, build

analytics on top, create applications on top of this.

Solution: • Data Lake provides accessibility of any information of

interest through familiar SQL-Like interface • Provide foundation for creation of Analytics and

Applications as value added services: forecast demand based on social media sentiment, analytics on fleet vehicle usage

Page 36: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

36 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Major TV Network Replaces Teradata with Pivotal Builds Infrastructure to Capture $40 Million in Untapped Revenue

Challenge: • Ad Inventory is an inherently perishable product, and

is subject to inefficient, “traditional” selling process. • Upward trend in volume and traffic due to higher ad

quality, mobile devices. • Inability to react: 7 hour lag time in communication

between ad fulfillment and sales teams, this was exacerbated by major broadcast events.

Solution: • Reduced 7-hour lag time to under 1 hour – enabling

network sales to communicate delivered impressions, forecast spend inventory and sell more effectively

• Maximized profit by selling across brands/channels – allowing network to better leverage non-premium inventory

Page 37: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

37 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Home Appliance Maker Lays Foundation for “Smart” Connected Devices, Big Data-based Decisions

Challenge: • Prepare for next generation appliances: “smart”

connected devices, controlled by mobile phone • Silo’ed environment including Teradata, SAS, HP made

it difficult to derive true insights across disparate data

Solution: • Enable Innovation, improve service performance

through appliances that provide feedback based on output, environmental factors

• Improve marketing efficiency with targeted campaigns based on market demographics, buying indicators

• Better understand requirements for parts inventory based on current appliances lifecycle

Page 38: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

38 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

National Healthcare Organization Replaces Aging IBM Platform, Seeds Data Lake as Hadoop Beachhead

Challenge: • Aging IBM Infrastructure could not support new SAS

Access and Visual Analytics Technology • Interest in enabling infrastructure to support for-profit

healthcare analytics as a service business • Sought to provide refined data sets to other insurance

companies for their own research, needed way to cleanse data

Solution: • Stepwise evolution of platform onto GPDB, one of two

certified platform partners for running visual analytics • Established data lake as platform for upload, cleansing

and conversion of private data into publicly consumable datasets

Page 39: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

39 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Aviation: Predictive Maintenance

Challenge: • An airplane’s comprehensive “gate to gate” flight data

didn’t exist in a single place for reporting • Each individual flight can generate approximately 1 TB

of data - economically infeasible in traditional EDW • To maintain profitability of GE Aviation's Contract

Service Agreements, new analytical methods and approaches were required

Solution: • Ingest all data to a data lake for data discovery and

model development to increase wing time, greater aircraft uptime, improve customer satisfaction and airline profitability

• Improved capacity for preventative maintenance rather than remediation, reducing expense and liability

Pivotal Solution includes: GPDB, PHD, Alpine, Chorus

Page 40: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

40 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Brazilian Telco Provider Establishes Foundation for Data-Driven Culture

Challenge: • Poor call quality caused massive loss of customers. No

Insight into root cause of issues. • Increased scrutiny from regulators, but infrastructure

did not support the requests for information needed • Difficulties with Scale: Call Data Record generates 2

Billion new records per day, no info on dropped calls due to capacity

Solution: • New Data Warehouse infrastructure contains both

dropped and completed calls for analysis, 3 month capacity

• Hadoop infrastructure with familiar SQL interface stores 5x volume at half cost of Teradata

• Reports which took 2 Months to obtain now take 1 day

Pivotal Solution includes: PHD, HAWQ

Page 41: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

41 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal HD 2.0 Summary • The Foundation for Business Data Lake

• The World’s Most Advanced Hadoop Stack – Pivotal HD now based on Apache 2.2 – Real-time SQL, in-memory over Pivotal HD and integrated

into Spring: Pivotal GemFire XD – Enhanced Interactive SQL over Pivotal HD: HAWQ

• World’s Most Advanced Big Data Analytic Platform – Most extensive set of machine learning libraries: MADlib,

R and GraphLab

Page 42: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

42 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal HD 2.0 demo:

PivotalBooth

Page 43: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

43 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Thank You

Page 44: Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform