Upload
emc-academic-alliance
View
113
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Deep Dive On Pivotal HD - World Class HDFS Platform Michael Goddard
2 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Agenda
• Pivotal
• Pivotal Business Data Lake
• Introducing Pivotal HD 2.0
• Pivotal HD 2.0 and Isilon Update
• Customer Success
• Q&A
3 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
What Matters: Apps. Data. Analytics. Apps power businesses, and those apps generate data Analytic insights from that data drive new app functionality, which in-turn drives new data The faster you can move around that cycle, the faster you learn, innovate & pull away from the competition
4 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
How Pivotal Gets You There Uniquely positioned to help enterprises modernize each facet of this cycle today Comprehensive portfolio of products & services spanning Big Data, PaaS & Agile Converging these technologies into a coherent, next-gen Enterprise PaaS platform
Pivotal Labs Agile Development
Pivotal Data Fabric
Pivotal One
PaaS
5 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal’s Big Bets for the Future 1. HDFS becomes the data substrate for the next
generation of data infrastructures
2. A set of integrated, consumer-grade services must evolve on top of HDFS – stream ingestion, analytical processing, and transactional serving
3. Provisioning flexibility and elasticity become critical capabilities for this data infrastructure
6 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal Business Data Lake
Govern where it matters
Focus on MDM and RDM Enforce only when sharing Treat corporate as aggregation of local
Encourage local requirements
Let the business decide what they need Build from the bottom Enable traceability to source Disposable data views
Distill on demand Select only what you want Business friendly tooling Re-usable information maps Rapid change cycle
Store everything Store everything ‘as is’ Include structured and unstructured data Store it cheaply
7 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal Business Data Lake Architecture
Centralized Management System monitoring System management
Unified Data Management Tier Data mgmt.
services MDM RDM
Audit and policy mgmt.
Processing Tier
Workflow Management
In-memory MPP database
Existing Sources
Unified Sources Flexible Actions
Real-time ingestion
Micro batch ingestion
Batch ingestion
Real-time insights
Interactive insights
Batch insights
HDFS
New Data Sources
8 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal Business Data Lake Architecture
Centralized Management
Unified Data Management Tier
Data Dispatch MDM RDM Data Dispatch
Processing Tier
Spring XD
Pivotal GemFire XD HAWQ
Unified Sources Flexible Actions
Clickstream Sensor Data
Weblogs Network Data
CRM Data ERP Data
Pivotal GemFire
Pivotal RabbitMQ Redis
Pivotal CF Pivotal HD
Command Center
Existing Sources
New Data Sources
9 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
How is a Business Data Lake Different?
Business Data Lake Criteria EDW
Common data model
Base class = standard data Derived classes = local data
Single class = single view across the enterprise
Data quality Full spectrum 1 0
0 1 0 1 0 0 1
0 1 1 1 0
Data integration
Multiple interfaces SQL, SAS, R, MapReduce, NoSQL SQL access integration with SAS, R and other analytical interfaces
Mixed workload with varying QoS
Support low latency, interactive and batch Limited QoS separation required
10 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Introducing Pivotal HD 2.0
• Foundation for Business
Data Lake
• World’s Most Advanced Real-Time Analytics Platform
• Most Extensive Set of Advanced Analytical Toolsets
11 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal HD Architecture
HDFS
HBase Pig, Hive,
Mahout Map
Reduce
Sqoop Flume
Resource
Management & Workflow
YARN
ZooKeeper
Apache Pivotal
Command Center Configure,
Deploy, Monitor, Manage
Spring XD
Pivotal HD Enterprise
Spring
Xtension Framework
Catalog Services
Query Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced Database Services
Distributed In-memory
Store
Query Transactions
Ingestion Processing
Hadoop Driver – Parallel with Compaction
ANSI SQL + In-Memory
Pivotal GemFire XD – Real-Time Database Services
MADlib Algorithms
Oozie
Virtual Extensions
GraphLab, Open MPI
12 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
New Apache Hadoop Features in Pivotal HD 2.0
• Apache Hadoop 2.2 enables enterprise operationalization features such as NFS and Snapshots
• Hive 0.12 is faster, has better scalability, and broader SQL data type support
• Pig 0.12 (incl. PiggyBank) increases productivity and appeal for broader set of users
• HBase 0.96 improves in mean time between recovery and modularization for easy upgrade and reduced dependencies
13 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Hadoop at the Center
Enabling the Data-Driven Enterprise
Hadoop as a Service
Big Data On-Demand
GemFire XD In-Memory Real-time Analytics
Spring XD Building Big Data Apps
Open Source
Algorithm Libraries
Chorus Big Data Collaboration
Fastest SQL Query Engine
14 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Real-Time Analytics
• Adds fast data ingest, and real-time event processing and query performance, enabling SQL users to rapidly analyze and react to high volumes of events on HDFS
• Enables the creation of low latency, scale out OLTP applications integrated out of the box with a big data store.
• Creates a single platform for Analytics and OLTP, removing the need for an ETL process
• Supports changes to database tables while still complying to the immutability of HDFS
15 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Real-Time Data Services on Pivotal HD
Pivotal GemFire XD HAWQ
Pivotal Extension Framework
Model Refresh MapReduce
I/P & O/P Formatter
Native Persistence Command Center
Model Refresh
Online Apps Analytic Apps
Sensor Data / Feeds
Pivotal HD Enterprise
Shared Data
Re-evaluate Model
Re-evaluate Model
HDFS
16 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal GemFire XD 1.0 Major Features Enterprise real-time data processing platform for SLA critical applications; enables users to rapidly and reliably analyze and react to high volumes of events while leveraging 10s of TBs of in-memory reference data.
Cloud Scale Real-Time Platform
Seamless Pivotal HD Integration
Optimized for Real-Time Analytics
• Very low & predictable latencies at high and variable loads
• 10s of TBs in-memory (MemScale)
• Multi-tiered caching • Real-time event
processing • Rolling upgrade support
• SQL-based queries • Support structured data • Java stored procedures • Deep Spring Data
integration
• Scale to HDFS with policy driven in-memory data retention
• Online and offline querying of HDFS data
• ETL-less bi-directional integration with other Pivotal HD services
• Pivotal Extension Framework Integration
• ICM Integration
Enterprise-Class Reliability
• Distributed transactions (JTA)
• HA through in-memory redundancy
• Active-active deployments across WAN
• JMX based scalable management
• Visual monitoring through Pulse
17 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Deep Scalable Analytics
• User Defined Functions: PL/R, PL/Java, PL/Python enable writing UDFs in additional languages that execute inside the database, improving performance
• Parquet columnar open storage format delivers significant performance and scalability improvements
• Richer set of open source machine learning algorithms helps conduct rapid data science experiments on relational data
18 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Deep Scalable Analytics
Provides data-parallel implementations of mathematical, statistical and machine-learning methods
for structured and unstructured data.
19 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
• HAWQ 1.2 Deep Scalable Analytics
• Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Naïve Bayes • Elastic Net Regression • Decision Trees / Random Forest • Support Vector Machines • Cox Proportional Hazards Regression • Descriptive Statistics • ARIMA
20 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal vs. PL/R
• Interface is R client • Execution is in database • Parallelism handled by
PivotalR • Supports a portion of R
PivotalR • Interface is SQL client • Execution is in R • Parallelism via SQL function
invocation • Supports all of R
PL/R
21 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
HAWQ: SQL on Hadoop, Format Agnostic
Pivotal HD: HDFS Data Lake
Future formats …
ANSI SQL
22 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
HAWQ Continue to Soar
NameNode High Availability (HA) Support improves availability of query processing with full Hadoop fault tolerance
Error Table helps to debug data errors
Parquet file format: columnar data storage for HDFS
HAWQ expansion increases performance (concurrency/throughput) by expanding query processing to newly added data nodes
23 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
NameNode HA Support
• Feature: – Automatic failover to secondary NameNode when primary
fails
• Benefits: – Fully fault tolerant to NameNode failures – Improved availability of query processing – Integrated into Hadoop availability model
24 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Error Table
• Feature: – System table for storing non-conforming data
• Benefits: – Eliminates erroneous data load – Reduces retries during load – Helps to debug errors in data structures
25 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Parquet • Features:
– Open storage format – Hybrid row/column open storage format – Configurable Parquet or AO/CO format support – Compression Type: Snappy and Gzip – Additional data type support – Parquet Input Format Reader API
• Benefits: – Delivers significant performance and scalability improvements – Industry standard compression: Saves storage – Usable in MapReduce/Hive work loads
26 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
HAWQ Expansion
• Features: – Expand HAWQ nodes to additional DataNodes – Expand # of segments per HAWQ segment host
• Benefits: – Expand query processing – Increase performance by utilizing maximum CPU/resources – Increased concurrency/throughput
27 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Big Computing and Graph Analytics
• Open MPI is one of the most mature parallel computing frameworks now available within HDFS, eliminating costly data movement and shortening data science cycles
• GraphLab is a graph-based library of machine learning algorithms – allowing Data Scientists and Analysts to leverage popular algorithms such as PageRank, collaborative filtering and computer vision in HDFS
Open MPI
28 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Background
• Hadoop MapReduce is not a good fit for iterative applications (like graph computing, machine learning, etc.)
• User needs to build separate system/clusters to support those applications
• MPI is (one of) the most mature/used parallel computing frameworks
– MPI = Big Computing, Hadoop = Big Data – MPI + Hadoop = Big Computing + Big Data
29 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
MPI Background
• What is MPI? – “a standardized and portable message-passing
system designed by a group of researchers from academia and industry to function on wide variety of parallel computers” Wikipedia
• What is Open MPI? – One of the most popular implementations of MPI,
community supported
• What is Hamster? – “Hadoop And Mpi on the same cluSTER
30 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
GraphLab • Topic Modeling contains applications like LDA, which can be used to
cluster documents and extract topical representations. • Graph Analytics contain applications like PageRank and triangle
counting, which can be applied to general graphs to estimate community structure.
• Clustering contains standard data clustering tools such as k-means • Collaborative Filtering contains a collection of applications used to
make predictions about users interests and factorize large matrices. • Graphical Models contain tools for reasoning about structured noisy
data. • Computer Vision contains a collection of tools for reasoning about
images.
31 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal HD: Built for Data Science
Relational Advanced Analytics
Data Science on Pivotal HD
Graph Advanced Analytics
SQL
R
Python
Java
Languages:
Custom Analytic Functions - UDFs
32 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
World’s Leading Experts Pivotal Labs – Pivotal Data Labs
On Demand Services Pivotal Data Dispatch
BATCH BATCH
INTERACTIVE INTERACTIVE HAWQ Greenplum DB
Unlimited Pivotal HD
REAL-TIME REAL-TIME GemFire XD GemFire | SQLFire
33 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal Enables Hadoop Market Adoption
Data Lakes Unify Unstructured and Structured Data Access
Big Data Apps Build analytic and
transaction-led applications impacting
top line revenue
Data-Driven Enterprise
App Dev and Operational Management on HDFS
Data Architecture
34 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal HD 2.0 and Isilon Update
• Isilon aligns with our Enterprise Grade Message
• Pivotal Command Center 2.2 (part of Pivotal HD 2.0) – Works with Pivotal HD 1.1.1 – ‘Down’ status of HDFS is removed when Isilon is configured
• Isilon has accelerated their integration from Q4 to Q3 for HDFS 2.2
35 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Large Mid-Market Financier Builds Foundation to Store All Data of Interest, Convert Insights to Value-added Services
Challenge: • Mid Market financier seeks to maintain high margins
through value-added services • Realized that critical insights could come from many
sources, but much was deleted due to storage cost • Frustrated by lack of ability to blend data fabric, build
analytics on top, create applications on top of this.
Solution: • Data Lake provides accessibility of any information of
interest through familiar SQL-Like interface • Provide foundation for creation of Analytics and
Applications as value added services: forecast demand based on social media sentiment, analytics on fleet vehicle usage
36 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Major TV Network Replaces Teradata with Pivotal Builds Infrastructure to Capture $40 Million in Untapped Revenue
Challenge: • Ad Inventory is an inherently perishable product, and
is subject to inefficient, “traditional” selling process. • Upward trend in volume and traffic due to higher ad
quality, mobile devices. • Inability to react: 7 hour lag time in communication
between ad fulfillment and sales teams, this was exacerbated by major broadcast events.
Solution: • Reduced 7-hour lag time to under 1 hour – enabling
network sales to communicate delivered impressions, forecast spend inventory and sell more effectively
• Maximized profit by selling across brands/channels – allowing network to better leverage non-premium inventory
37 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Home Appliance Maker Lays Foundation for “Smart” Connected Devices, Big Data-based Decisions
Challenge: • Prepare for next generation appliances: “smart”
connected devices, controlled by mobile phone • Silo’ed environment including Teradata, SAS, HP made
it difficult to derive true insights across disparate data
Solution: • Enable Innovation, improve service performance
through appliances that provide feedback based on output, environmental factors
• Improve marketing efficiency with targeted campaigns based on market demographics, buying indicators
• Better understand requirements for parts inventory based on current appliances lifecycle
38 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
National Healthcare Organization Replaces Aging IBM Platform, Seeds Data Lake as Hadoop Beachhead
Challenge: • Aging IBM Infrastructure could not support new SAS
Access and Visual Analytics Technology • Interest in enabling infrastructure to support for-profit
healthcare analytics as a service business • Sought to provide refined data sets to other insurance
companies for their own research, needed way to cleanse data
Solution: • Stepwise evolution of platform onto GPDB, one of two
certified platform partners for running visual analytics • Established data lake as platform for upload, cleansing
and conversion of private data into publicly consumable datasets
39 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Aviation: Predictive Maintenance
Challenge: • An airplane’s comprehensive “gate to gate” flight data
didn’t exist in a single place for reporting • Each individual flight can generate approximately 1 TB
of data - economically infeasible in traditional EDW • To maintain profitability of GE Aviation's Contract
Service Agreements, new analytical methods and approaches were required
Solution: • Ingest all data to a data lake for data discovery and
model development to increase wing time, greater aircraft uptime, improve customer satisfaction and airline profitability
• Improved capacity for preventative maintenance rather than remediation, reducing expense and liability
Pivotal Solution includes: GPDB, PHD, Alpine, Chorus
40 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Brazilian Telco Provider Establishes Foundation for Data-Driven Culture
Challenge: • Poor call quality caused massive loss of customers. No
Insight into root cause of issues. • Increased scrutiny from regulators, but infrastructure
did not support the requests for information needed • Difficulties with Scale: Call Data Record generates 2
Billion new records per day, no info on dropped calls due to capacity
Solution: • New Data Warehouse infrastructure contains both
dropped and completed calls for analysis, 3 month capacity
• Hadoop infrastructure with familiar SQL interface stores 5x volume at half cost of Teradata
• Reports which took 2 Months to obtain now take 1 day
Pivotal Solution includes: PHD, HAWQ
41 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal HD 2.0 Summary • The Foundation for Business Data Lake
• The World’s Most Advanced Hadoop Stack – Pivotal HD now based on Apache 2.2 – Real-time SQL, in-memory over Pivotal HD and integrated
into Spring: Pivotal GemFire XD – Enhanced Interactive SQL over Pivotal HD: HAWQ
• World’s Most Advanced Big Data Analytic Platform – Most extensive set of machine learning libraries: MADlib,
R and GraphLab
42 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Pivotal HD 2.0 demo:
PivotalBooth
43 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.
Thank You