CASE | Pivotal - Ecossistema Big Data e Analytics 360

2 © 2016 Pivotal Software, Inc. All rights reserved. 2 © 2016 Pivotal Software, Inc. All rights reserved.

Ecossistema Big Data e Analytics 360 Visão, Ecossistema, Arquitetura

Luis Macedo Glenio Borges

3 © 2016 Pivotal Software, Inc. All rights reserved.

As eras da Tecnologia da Informação MAINFRAME

Automação de contabilidade Folha de pagamento

Mainframes

ISAM

CLIENT-SERVER & WEB

Automação de processos de papel: ERP, CRM, Email, …

Relational Databases

Mini’s & PC’s Cloud-Enabled Datacenter

Novos “Data-Fabrics”

CONSUMER GRADE

Novas Expriências Novos Modelos de Negócio

Hadoop O Elefante na loja de cristais


Hadoop Beyond Storage Needs Standards A shared industry effort to advance the state of Apache Hadoop® and Big Data

technologies for the enterprise

Greenplum Database™ The Open Source Data Warehouse


Greenplum Database Mission “To forever change data warehousing by offering a comprehensive and proven

data warehousing system in open source”

•  Fully ACID Relational Database built for Big Structured Data •  SQL Standard Compliant •  Cluster based system running on “Commodity” hardware & Linux OS •  Available as an EMC appliance, Software, Cloud Deployments •  10+ years of R&D investment •  PostgreSQL heritage and headed to Open Source •  Enterprise product with 1000+ install base

Greenplum


MPP Shared Nothing Architecture Flexible framework for processing large datasets

Standby Master

…

Master Host

SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Host with one or more Segment Instances Segment Instances process queries in parallel Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for continuous pipelining of data processing

Interconnect

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance


node1


node2


node3


nodeN

Greenplum


In-Database Analytics

•  Bringing the power of parallelism to commonly-used modeling and analytics functions

•  In-database analytics -  SAS – HPA, Access, and Scoring Accelerator -  MADlib – An open-source library of advanced

analytics functions -  Analytics extensions supported, including

•  PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.

Greenplum

Pivotal HDB Hadoop Native SQL para Advanced Analytics Baseado no Apache HAWQ


SQL on Hadoop Offerings Still Immature

Current SQL on Hadoop Advanced Hadoop Native SQL •  Complex joins not supported •  Complex joins at performance

•  Limited advanced analytics support •  Advanced analytics at scale within SQL

•  Interactive query latency issues •  Fast interactive queries on large data

•  Ad-hoc query performance issues •  Strong ad-hoc query support in optimizer

•  SQL analytic query coverage issues •  Full analytic SQL compliance

•  Concurrent query throughput issues •  High query throughput for mixed workloads

Pivotal HDB


Pivotal Query Optimizer (Future open source)

10+ years of advanced enterprise SQL analytics and MPP development Based on GPDB

Pivotal HDB Advanced SQL Query Engine Pivotal HDB


Hadoop-native Advanced Enterprise

Analytics

Exceptional MPP performance, low latency, petabyte scalability, ACID

reliability, fault tolerance

Robust ANSI SQL Compliance

SQL-92, -99, -2003, OLAP

Higher degree of SQL compatibility, leverages existing SQL skills, OLAP

Query Optimization Maximize performance or cost and Hadoop cluster resources Offload EDW with confidence

Flexible Deployment, On premise, cloud, PaaS HBase, Avro, Parquet, +

Flexibility, accessibility, portability

Tightly integrate w/MADlib Machine

Learning Advanced MPP analytics, data science at

scale, directly on Hadoop data

Pivotal HDB Advantage - Benefits

MAD

Pivotal HDB


Pivotal HDB Performance

Pivotal HDB Faster

Impala Faster

2 28 46 66 73 76 79 80 88 90 96

Pivotal HDB •  Faster on 46 of 62

TPC-DS queries completed*

•  4.55x mean avg. •  12 hrs faster total

* Query Engine only supported 74 of 99 queries, 12 crashed mid-run

Pivotal HDB

Pivotal GemFire Scale-out, In Memory Data Grid (noSQL) Baseado no Apache Geode


Traditional Data Stores Can’t Handle Cloud Scale Low-Latency Apps

Traditional RDBMS Distributed In-Memory DG’s •  Disk based, high latency •  Memory based, low latency

•  Limited scale up and out via expensive hardware and planning

•  Elastically scale up or down on demand without down time

•  Designed for monolithic applications •  Built for cloud scale applications

•  Support complex SQL for transactions AND analytics

•  Built to to optimize application-specific data access patterns

•  Built for general business applications and reporting requirements

•  Built to ensure low latency data access at any scale

GemFire


Pivotal GemFire Vertical scalability only Active Data Management

Exponential cost growth No data outages

Expensive and resource-intensive replication

Active management of partitions as cluster

changes size

Elastic linear scale-out, on commodity hardware

and cloud

Reliable transaction and data consistency, w/ wide area/global

replication

Elastic scale with lower management

costs

Highly Available, Global Data Distribution

Cost savings, no infrastructure lock-in


GemFire High Level Architecture

Locators

Java Client C# Client C++ Client Clients can embed cache with

disk overflow

Locators provide both discovery and load balancing services.

Updates are sent to subscribers as objects

change

Data partitioning and replication is handled

transparently to clients. Redundant storage assures

continuous availability (memory or disk)

Machines can be added dynamically to expand

capacity

Disk-Stores for data persistence and backup

Data Data Data

Synchronous read through, write through, or asynchronous write behind to other data sources

M.. M1 M2 M3


Data Event Handling & Continuous Queries

Server

Region

Application

Cluster Node

40404

Pool

Region

afterUpdate(EntryEvent<K,V> event)

Update

Durable, Fault Tolerant Subscription Queue

Continuous query: Matching “where clause”

�  Rich API of cache events for application-specific handling

–  Example: synchronous pre-writes & changes for app processing; asychronous write-behind for database archival

–  Asychronous events queue for mission critical delivery

�  Continuous Query –  Server-side “query match” event –  Significantly reduces latency and overhead

for detecting data conditions

Spring XD (Spring Cloud Data Services) Flexível, Scale-out Data Pipeline


Spring XD – Spring Cloud Data Flow Spring XD

Unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. �  Data Ingestion and Pipeline Processing

–  Kafka, RabbitMQ, MQTT, JMS, HTTP, GPDB, HAWQ/HDB

–  Partition, Filter, Transform, Split, Aggregate

�  Real Time Analytics and Complex Event Processing –  Spark Streaming, RxJava, JPMML Scoring

–  Redis, GemFire, Cassandra, etc..

�  Rapid Dashboarding

�  Batch Workflow Orchestration + ETL –  Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ/HDB, Spark

–  RDBMS, FILE, FTP, LOG, Mongo, Splunk


HTTPTailFileMail

Twi,erGemfireSyslogTCPUDPJMS

RabbitMQMQTTTrigger

ReactorTCP/UDP

FilterTransformer

Object-to-JSONJSON-to-Tuple

Spli,erAggregatorHTTPClient

GroovyScriptsJavaCode

JPMMLEvaluator

FileHDFSJDBCTCPLogMail

RabbitMQGemfireSplunkMQTT

DynamicRouterCounters

Spring XD - Streams


Problems

Batch and Streaming are often handled by multiple platforms

Unified Approach - Stream Processing and Batch Jobs -  Hadoop Batch workflow orchestration -  Analytics -  Machine Learning Scoring

Ecosystem is fragmented

Runtime provides critical Non-functional requirements -  Scalable, Distributed, Fault-Tolerant -  Portable on prem DIY cluster, YARN, EC2, (WIP for PCF) -  Easy to use, extend and integrate other technologies

Proven -  Built on robust EAI and Batch spring projects (7 years)

Eye on big picture -  Support end-to-end scenarios

Data Sources and API(s) constantly

changing

Spring XD Benefits

Not all data is Hadoop bound

Arquiteturas modernas de dados Arquitetura lambda e e baseada em eventos


background image: 960x540 pixels - send to back of slide and set to 80% transparency

“We’ve found that when a host selects a price that’s

within 5% of their tip, they’re nearly 4 times

more likely to get booked”

“The importance of accuracy and efficiency […], will continue to rise

as we expand and improve products like

uberPOOL and beyond.”

“Over 75% of what people watch come

from our recommendations”

Data manifests as features in an app


background image: 960x540 pixels - send to back of slide and set to 80% transparency

(Data) Microservices

Loosely coupled services architecture, bounded by context

Cloud-Native

Platforms Enabling continuous

delivery & automated operations

Open Source Database

Innovation Extreme scale &

performance advantages, built for the cloud

Machine Learning

Use of predictive analytics to build

smart apps

How are they accomplishing this?


●  Silo’d and aging database systems

●  Spaghetti data pipelines

●  Expensive, proprietary data management systems

●  Lack of structured platforms for continuous software delivery

●  Monolithic application architectures

●  Batch-oriented data integration

●  Limited operationalization of analytics

●  Proprietary systems

Today’s Enterprise Challenge


Data Programming Model

Cloud-Native Platform

Microservices Framework Platform Runtime

Stream Data Platform

Hadoop Spark DW

Apps & Microservices

DBMS

IMDG

K/V Store

Relational DB

Big Data & Machine Learning

Modern Cloud-Native Data Architecture


App Development

Data analytics

Cloud-native App platform

Data Science & Model building

Data Micro-service

APP

Must support scale-out query processing

Must deliver as an API

Must embrace agile development, focus on outcomes

Must support microservices, agile dev, and connect to big data analytics

A Real-World Example


Then Now assume fragile infrastructure assume reliable infrastructure

release code every 3 months release code early and often

works in my environment shared Dev & Ops responsibility

tightly coupled loosely coupled


Putting it All Together DATA FEEDS TRANSACTIONAL APPS ANALYTIC APPS

HDFS Data Lake

GemFire

Ingest Filter Enrich Sink Spring XD

HDB GPDB


Let’s build something MEANINGFUL

Business

CASE | Pivotal - Ecossistema Big Data e Analytics 360