33

CASE | Pivotal - Ecossistema Big Data e Analytics 360

  • Upload
    totvs

  • View
    371

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CASE | Pivotal - Ecossistema Big Data e Analytics 360
Page 2: CASE | Pivotal - Ecossistema Big Data e Analytics 360

2 © 2016 Pivotal Software, Inc. All rights reserved. 2 © 2016 Pivotal Software, Inc. All rights reserved.

Ecossistema Big Data e Analytics 360 Visão, Ecossistema, Arquitetura

Luis Macedo Glenio Borges

Page 3: CASE | Pivotal - Ecossistema Big Data e Analytics 360

3 © 2016 Pivotal Software, Inc. All rights reserved.

As eras da Tecnologia da Informação MAINFRAME

Automação de contabilidade Folha de pagamento

Mainframes

ISAM

CLIENT-SERVER & WEB

Automação de processos de papel: ERP, CRM, Email, …

Relational Databases

Mini’s & PC’s Cloud-Enabled Datacenter

Novos “Data-Fabrics”

CONSUMER GRADE

Novas Expriências Novos Modelos de Negócio

Page 4: CASE | Pivotal - Ecossistema Big Data e Analytics 360

Hadoop O Elefante na loja de cristais

Page 5: CASE | Pivotal - Ecossistema Big Data e Analytics 360

5 © 2016 Pivotal Software, Inc. All rights reserved.

Hadoop Beyond Storage Needs Standards A shared industry effort to advance the state of Apache Hadoop® and Big Data

technologies for the enterprise

Page 6: CASE | Pivotal - Ecossistema Big Data e Analytics 360

Greenplum Database™ The Open Source Data Warehouse

Page 7: CASE | Pivotal - Ecossistema Big Data e Analytics 360

7 © 2016 Pivotal Software, Inc. All rights reserved.

Greenplum Database Mission “To forever change data warehousing by offering a comprehensive and proven

data warehousing system in open source”

•  Fully ACID Relational Database built for Big Structured Data •  SQL Standard Compliant •  Cluster based system running on “Commodity” hardware & Linux OS •  Available as an EMC appliance, Software, Cloud Deployments •  10+ years of R&D investment •  PostgreSQL heritage and headed to Open Source •  Enterprise product with 1000+ install base

Greenplum

Page 8: CASE | Pivotal - Ecossistema Big Data e Analytics 360

8 © 2016 Pivotal Software, Inc. All rights reserved.

MPP Shared Nothing Architecture Flexible framework for processing large datasets

Standby Master

Master Host

SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Host with one or more Segment Instances Segment Instances process queries in parallel Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for continuous pipelining of data processing

Interconnect

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node1

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node2

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node3

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

nodeN

Greenplum

Page 9: CASE | Pivotal - Ecossistema Big Data e Analytics 360

9 © 2016 Pivotal Software, Inc. All rights reserved.

In-Database Analytics

•  Bringing the power of parallelism to commonly-used modeling and analytics functions

•  In-database analytics -  SAS – HPA, Access, and Scoring Accelerator -  MADlib – An open-source library of advanced

analytics functions -  Analytics extensions supported, including

•  PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.

Greenplum

Page 10: CASE | Pivotal - Ecossistema Big Data e Analytics 360

Pivotal HDB Hadoop Native SQL para Advanced Analytics Baseado no Apache HAWQ

Page 11: CASE | Pivotal - Ecossistema Big Data e Analytics 360

11 © 2016 Pivotal Software, Inc. All rights reserved.

SQL on Hadoop Offerings Still Immature

Current SQL on Hadoop Advanced Hadoop Native SQL •  Complex joins not supported •  Complex joins at performance

•  Limited advanced analytics support •  Advanced analytics at scale within SQL

•  Interactive query latency issues •  Fast interactive queries on large data

•  Ad-hoc query performance issues •  Strong ad-hoc query support in optimizer

•  SQL analytic query coverage issues •  Full analytic SQL compliance

•  Concurrent query throughput issues •  High query throughput for mixed workloads

Pivotal HDB

Page 12: CASE | Pivotal - Ecossistema Big Data e Analytics 360

12 © 2016 Pivotal Software, Inc. All rights reserved.

Pivotal Query Optimizer (Future open source)

10+ years of advanced enterprise SQL analytics and MPP development Based on GPDB

Pivotal HDB Advanced SQL Query Engine Pivotal HDB

Page 13: CASE | Pivotal - Ecossistema Big Data e Analytics 360

13 © 2016 Pivotal Software, Inc. All rights reserved.

Hadoop-native Advanced Enterprise

Analytics

Exceptional MPP performance, low latency, petabyte scalability, ACID

reliability, fault tolerance

Robust ANSI SQL Compliance

SQL-92, -99, -2003, OLAP

Higher degree of SQL compatibility, leverages existing SQL skills, OLAP

Query Optimization Maximize performance or cost and Hadoop cluster resources Offload EDW with confidence

Flexible Deployment, On premise, cloud, PaaS HBase, Avro, Parquet, +

Flexibility, accessibility, portability

Tightly integrate w/MADlib Machine

Learning Advanced MPP analytics, data science at

scale, directly on Hadoop data

Pivotal HDB Advantage - Benefits

MAD

Pivotal HDB

Page 14: CASE | Pivotal - Ecossistema Big Data e Analytics 360

14 © 2016 Pivotal Software, Inc. All rights reserved.

Pivotal HDB Performance

Pivotal HDB Faster

Impala Faster

2 28 46 66 73 76 79 80 88 90 96

Pivotal HDB •  Faster on 46 of 62

TPC-DS queries completed*

•  4.55x mean avg. •  12 hrs faster total

* Query Engine only supported 74 of 99 queries, 12 crashed mid-run

Pivotal HDB

Page 15: CASE | Pivotal - Ecossistema Big Data e Analytics 360

Pivotal GemFire Scale-out, In Memory Data Grid (noSQL) Baseado no Apache Geode

Page 16: CASE | Pivotal - Ecossistema Big Data e Analytics 360

16 © 2016 Pivotal Software, Inc. All rights reserved.

Traditional Data Stores Can’t Handle Cloud Scale Low-Latency Apps

Traditional RDBMS Distributed In-Memory DG’s •  Disk based, high latency •  Memory based, low latency

•  Limited scale up and out via expensive hardware and planning

•  Elastically scale up or down on demand without down time

•  Designed for monolithic applications •  Built for cloud scale applications

•  Support complex SQL for transactions AND analytics

•  Built to to optimize application-specific data access patterns

•  Built for general business applications and reporting requirements

•  Built to ensure low latency data access at any scale

GemFire

Page 17: CASE | Pivotal - Ecossistema Big Data e Analytics 360

17 © 2016 Pivotal Software, Inc. All rights reserved.

Pivotal GemFire Vertical scalability only Active Data Management

Exponential cost growth No data outages

Expensive and resource-intensive replication

Active management of partitions as cluster

changes size

Elastic linear scale-out, on commodity hardware

and cloud

Reliable transaction and data consistency, w/ wide area/global

replication

Elastic scale with lower management

costs

Highly Available, Global Data Distribution

Cost savings, no infrastructure lock-in

Page 18: CASE | Pivotal - Ecossistema Big Data e Analytics 360

18 © 2016 Pivotal Software, Inc. All rights reserved.

GemFire High Level Architecture

Locators

Java Client C# Client C++ Client Clients can embed cache with

disk overflow

Locators provide both discovery and load balancing services.

Updates are sent to subscribers as objects

change

Data partitioning and replication is handled

transparently to clients. Redundant storage assures

continuous availability (memory or disk)

Machines can be added dynamically to expand

capacity

Disk-Stores for data persistence and backup

Data Data Data

Synchronous read through, write through, or asynchronous write behind to other data sources

M.. M1 M2 M3

Page 19: CASE | Pivotal - Ecossistema Big Data e Analytics 360

19 © 2016 Pivotal Software, Inc. All rights reserved.

Data Event Handling & Continuous Queries

Server

Region

Application

Cluster Node

40404

Pool

Region

afterUpdate(EntryEvent<K,V> event)

Update

Durable, Fault Tolerant Subscription Queue

Continuous query: Matching “where clause”

�  Rich API of cache events for application-specific handling

–  Example: synchronous pre-writes & changes for app processing; asychronous write-behind for database archival

–  Asychronous events queue for mission critical delivery

�  Continuous Query –  Server-side “query match” event –  Significantly reduces latency and overhead

for detecting data conditions

Page 20: CASE | Pivotal - Ecossistema Big Data e Analytics 360

Spring XD (Spring Cloud Data Services) Flexível, Scale-out Data Pipeline

Page 21: CASE | Pivotal - Ecossistema Big Data e Analytics 360

21 © 2016 Pivotal Software, Inc. All rights reserved.

Spring XD – Spring Cloud Data Flow Spring XD

Unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. �  Data Ingestion and Pipeline Processing

–  Kafka, RabbitMQ, MQTT, JMS, HTTP, GPDB, HAWQ/HDB

–  Partition, Filter, Transform, Split, Aggregate

�  Real Time Analytics and Complex Event Processing –  Spark Streaming, RxJava, JPMML Scoring

–  Redis, GemFire, Cassandra, etc..

�  Rapid Dashboarding

�  Batch Workflow Orchestration + ETL –  Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ/HDB, Spark

–  RDBMS, FILE, FTP, LOG, Mongo, Splunk

Page 22: CASE | Pivotal - Ecossistema Big Data e Analytics 360

22 © 2016 Pivotal Software, Inc. All rights reserved.

HTTPTailFileMail

Twi,erGemfireSyslogTCPUDPJMS

RabbitMQMQTTTrigger

ReactorTCP/UDP

FilterTransformer

Object-to-JSONJSON-to-Tuple

Spli,erAggregatorHTTPClient

GroovyScriptsJavaCode

JPMMLEvaluator

FileHDFSJDBCTCPLogMail

RabbitMQGemfireSplunkMQTT

DynamicRouterCounters

Spring XD - Streams

Page 23: CASE | Pivotal - Ecossistema Big Data e Analytics 360

23 © 2016 Pivotal Software, Inc. All rights reserved.

Problems

Batch and Streaming are often handled by multiple platforms

Unified Approach - Stream Processing and Batch Jobs -  Hadoop Batch workflow orchestration -  Analytics -  Machine Learning Scoring

Ecosystem is fragmented

Runtime provides critical Non-functional requirements -  Scalable, Distributed, Fault-Tolerant -  Portable on prem DIY cluster, YARN, EC2, (WIP for PCF) -  Easy to use, extend and integrate other technologies

Proven -  Built on robust EAI and Batch spring projects (7 years)

Eye on big picture -  Support end-to-end scenarios

Data Sources and API(s) constantly

changing

Spring XD Benefits

Not all data is Hadoop bound

Page 24: CASE | Pivotal - Ecossistema Big Data e Analytics 360

Arquiteturas modernas de dados Arquitetura lambda e e baseada em eventos

Page 25: CASE | Pivotal - Ecossistema Big Data e Analytics 360

25 © 2016 Pivotal Software, Inc. All rights reserved.

background image: 960x540 pixels - send to back of slide and set to 80% transparency

“We’ve found that when a host selects a price that’s

within 5% of their tip, they’re nearly 4 times

more likely to get booked”

“The importance of accuracy and efficiency […], will continue to rise

as we expand and improve products like

uberPOOL and beyond.”

“Over 75% of what people watch come

from our recommendations”

Data manifests as features in an app

Page 26: CASE | Pivotal - Ecossistema Big Data e Analytics 360

26 © 2016 Pivotal Software, Inc. All rights reserved.

background image: 960x540 pixels - send to back of slide and set to 80% transparency

(Data) Microservices

Loosely coupled services architecture, bounded by context

Cloud-Native

Platforms Enabling continuous

delivery & automated operations

Open Source Database

Innovation Extreme scale &

performance advantages, built for the cloud

Machine Learning

Use of predictive analytics to build

smart apps

How are they accomplishing this?

Page 27: CASE | Pivotal - Ecossistema Big Data e Analytics 360

27 © 2016 Pivotal Software, Inc. All rights reserved.

●  Silo’d and aging database systems

●  Spaghetti data pipelines

●  Expensive, proprietary data management systems

●  Lack of structured platforms for continuous software delivery

●  Monolithic application architectures

●  Batch-oriented data integration

●  Limited operationalization of analytics

●  Proprietary systems

Today’s Enterprise Challenge

Page 28: CASE | Pivotal - Ecossistema Big Data e Analytics 360

28 © 2016 Pivotal Software, Inc. All rights reserved.

Data Programming Model

Cloud-Native Platform

Microservices Framework Platform Runtime

Stream Data Platform

Hadoop Spark DW

Apps & Microservices

DBMS

IMDG

K/V Store

Relational DB

Big Data & Machine Learning

Modern Cloud-Native Data Architecture

Page 29: CASE | Pivotal - Ecossistema Big Data e Analytics 360

29 © 2016 Pivotal Software, Inc. All rights reserved.

App Development

Data analytics

Cloud-native App platform

Data Science & Model building

Data Micro-service

APP

Must support scale-out query processing

Must deliver as an API

Must embrace agile development, focus on outcomes

Must support microservices, agile dev, and connect to big data analytics

A Real-World Example

Page 30: CASE | Pivotal - Ecossistema Big Data e Analytics 360

30 © 2016 Pivotal Software, Inc. All rights reserved.

Then Now assume fragile infrastructure assume reliable infrastructure

release code every 3 months release code early and often

works in my environment shared Dev & Ops responsibility

tightly coupled loosely coupled

Page 31: CASE | Pivotal - Ecossistema Big Data e Analytics 360

31 © 2016 Pivotal Software, Inc. All rights reserved.

Putting it All Together DATA FEEDS TRANSACTIONAL APPS ANALYTIC APPS

HDFS Data Lake

GemFire

Ingest Filter Enrich Sink Spring XD

HDB GPDB

Page 32: CASE | Pivotal - Ecossistema Big Data e Analytics 360

32 © 2016 Pivotal Software, Inc. All rights reserved.

Let’s build something MEANINGFUL

Page 33: CASE | Pivotal - Ecossistema Big Data e Analytics 360