Upload
totvs
View
371
Download
1
Embed Size (px)
Citation preview
2 © 2016 Pivotal Software, Inc. All rights reserved. 2 © 2016 Pivotal Software, Inc. All rights reserved.
Ecossistema Big Data e Analytics 360 Visão, Ecossistema, Arquitetura
Luis Macedo Glenio Borges
3 © 2016 Pivotal Software, Inc. All rights reserved.
As eras da Tecnologia da Informação MAINFRAME
Automação de contabilidade Folha de pagamento
Mainframes
ISAM
CLIENT-SERVER & WEB
Automação de processos de papel: ERP, CRM, Email, …
Relational Databases
Mini’s & PC’s Cloud-Enabled Datacenter
Novos “Data-Fabrics”
CONSUMER GRADE
Novas Expriências Novos Modelos de Negócio
Hadoop O Elefante na loja de cristais
5 © 2016 Pivotal Software, Inc. All rights reserved.
Hadoop Beyond Storage Needs Standards A shared industry effort to advance the state of Apache Hadoop® and Big Data
technologies for the enterprise
Greenplum Database™ The Open Source Data Warehouse
7 © 2016 Pivotal Software, Inc. All rights reserved.
Greenplum Database Mission “To forever change data warehousing by offering a comprehensive and proven
data warehousing system in open source”
• Fully ACID Relational Database built for Big Structured Data • SQL Standard Compliant • Cluster based system running on “Commodity” hardware & Linux OS • Available as an EMC appliance, Software, Cloud Deployments • 10+ years of R&D investment • PostgreSQL heritage and headed to Open Source • Enterprise product with 1000+ install base
Greenplum
8 © 2016 Pivotal Software, Inc. All rights reserved.
MPP Shared Nothing Architecture Flexible framework for processing large datasets
Standby Master
…
Master Host
SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Host with one or more Segment Instances Segment Instances process queries in parallel Segment Hosts have their own CPU, disk and memory (shared nothing) High speed interconnect for continuous pipelining of data processing
Interconnect
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node1
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node2
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node3
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
nodeN
Greenplum
9 © 2016 Pivotal Software, Inc. All rights reserved.
In-Database Analytics
• Bringing the power of parallelism to commonly-used modeling and analytics functions
• In-database analytics - SAS – HPA, Access, and Scoring Accelerator - MADlib – An open-source library of advanced
analytics functions - Analytics extensions supported, including
• PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.
Greenplum
Pivotal HDB Hadoop Native SQL para Advanced Analytics Baseado no Apache HAWQ
11 © 2016 Pivotal Software, Inc. All rights reserved.
SQL on Hadoop Offerings Still Immature
Current SQL on Hadoop Advanced Hadoop Native SQL • Complex joins not supported • Complex joins at performance
• Limited advanced analytics support • Advanced analytics at scale within SQL
• Interactive query latency issues • Fast interactive queries on large data
• Ad-hoc query performance issues • Strong ad-hoc query support in optimizer
• SQL analytic query coverage issues • Full analytic SQL compliance
• Concurrent query throughput issues • High query throughput for mixed workloads
Pivotal HDB
12 © 2016 Pivotal Software, Inc. All rights reserved.
Pivotal Query Optimizer (Future open source)
10+ years of advanced enterprise SQL analytics and MPP development Based on GPDB
Pivotal HDB Advanced SQL Query Engine Pivotal HDB
13 © 2016 Pivotal Software, Inc. All rights reserved.
Hadoop-native Advanced Enterprise
Analytics
Exceptional MPP performance, low latency, petabyte scalability, ACID
reliability, fault tolerance
Robust ANSI SQL Compliance
SQL-92, -99, -2003, OLAP
Higher degree of SQL compatibility, leverages existing SQL skills, OLAP
Query Optimization Maximize performance or cost and Hadoop cluster resources Offload EDW with confidence
Flexible Deployment, On premise, cloud, PaaS HBase, Avro, Parquet, +
Flexibility, accessibility, portability
Tightly integrate w/MADlib Machine
Learning Advanced MPP analytics, data science at
scale, directly on Hadoop data
Pivotal HDB Advantage - Benefits
MAD
Pivotal HDB
14 © 2016 Pivotal Software, Inc. All rights reserved.
Pivotal HDB Performance
Pivotal HDB Faster
Impala Faster
2 28 46 66 73 76 79 80 88 90 96
Pivotal HDB • Faster on 46 of 62
TPC-DS queries completed*
• 4.55x mean avg. • 12 hrs faster total
* Query Engine only supported 74 of 99 queries, 12 crashed mid-run
Pivotal HDB
Pivotal GemFire Scale-out, In Memory Data Grid (noSQL) Baseado no Apache Geode
16 © 2016 Pivotal Software, Inc. All rights reserved.
Traditional Data Stores Can’t Handle Cloud Scale Low-Latency Apps
Traditional RDBMS Distributed In-Memory DG’s • Disk based, high latency • Memory based, low latency
• Limited scale up and out via expensive hardware and planning
• Elastically scale up or down on demand without down time
• Designed for monolithic applications • Built for cloud scale applications
• Support complex SQL for transactions AND analytics
• Built to to optimize application-specific data access patterns
• Built for general business applications and reporting requirements
• Built to ensure low latency data access at any scale
GemFire
17 © 2016 Pivotal Software, Inc. All rights reserved.
Pivotal GemFire Vertical scalability only Active Data Management
Exponential cost growth No data outages
Expensive and resource-intensive replication
Active management of partitions as cluster
changes size
Elastic linear scale-out, on commodity hardware
and cloud
Reliable transaction and data consistency, w/ wide area/global
replication
Elastic scale with lower management
costs
Highly Available, Global Data Distribution
Cost savings, no infrastructure lock-in
18 © 2016 Pivotal Software, Inc. All rights reserved.
GemFire High Level Architecture
Locators
Java Client C# Client C++ Client Clients can embed cache with
disk overflow
Locators provide both discovery and load balancing services.
Updates are sent to subscribers as objects
change
Data partitioning and replication is handled
transparently to clients. Redundant storage assures
continuous availability (memory or disk)
Machines can be added dynamically to expand
capacity
Disk-Stores for data persistence and backup
Data Data Data
Synchronous read through, write through, or asynchronous write behind to other data sources
M.. M1 M2 M3
19 © 2016 Pivotal Software, Inc. All rights reserved.
Data Event Handling & Continuous Queries
Server
Region
Application
Cluster Node
40404
Pool
Region
afterUpdate(EntryEvent<K,V> event)
Update
Durable, Fault Tolerant Subscription Queue
Continuous query: Matching “where clause”
� Rich API of cache events for application-specific handling
– Example: synchronous pre-writes & changes for app processing; asychronous write-behind for database archival
– Asychronous events queue for mission critical delivery
� Continuous Query – Server-side “query match” event – Significantly reduces latency and overhead
for detecting data conditions
Spring XD (Spring Cloud Data Services) Flexível, Scale-out Data Pipeline
21 © 2016 Pivotal Software, Inc. All rights reserved.
Spring XD – Spring Cloud Data Flow Spring XD
Unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. � Data Ingestion and Pipeline Processing
– Kafka, RabbitMQ, MQTT, JMS, HTTP, GPDB, HAWQ/HDB
– Partition, Filter, Transform, Split, Aggregate
� Real Time Analytics and Complex Event Processing – Spark Streaming, RxJava, JPMML Scoring
– Redis, GemFire, Cassandra, etc..
� Rapid Dashboarding
� Batch Workflow Orchestration + ETL – Map Reduce, HDFS, PIG, Hive, GPDB, HAWQ/HDB, Spark
– RDBMS, FILE, FTP, LOG, Mongo, Splunk
22 © 2016 Pivotal Software, Inc. All rights reserved.
HTTPTailFileMail
Twi,erGemfireSyslogTCPUDPJMS
RabbitMQMQTTTrigger
ReactorTCP/UDP
FilterTransformer
Object-to-JSONJSON-to-Tuple
Spli,erAggregatorHTTPClient
GroovyScriptsJavaCode
JPMMLEvaluator
FileHDFSJDBCTCPLogMail
RabbitMQGemfireSplunkMQTT
DynamicRouterCounters
Spring XD - Streams
23 © 2016 Pivotal Software, Inc. All rights reserved.
Problems
Batch and Streaming are often handled by multiple platforms
Unified Approach - Stream Processing and Batch Jobs - Hadoop Batch workflow orchestration - Analytics - Machine Learning Scoring
Ecosystem is fragmented
Runtime provides critical Non-functional requirements - Scalable, Distributed, Fault-Tolerant - Portable on prem DIY cluster, YARN, EC2, (WIP for PCF) - Easy to use, extend and integrate other technologies
Proven - Built on robust EAI and Batch spring projects (7 years)
Eye on big picture - Support end-to-end scenarios
Data Sources and API(s) constantly
changing
Spring XD Benefits
Not all data is Hadoop bound
Arquiteturas modernas de dados Arquitetura lambda e e baseada em eventos
25 © 2016 Pivotal Software, Inc. All rights reserved.
background image: 960x540 pixels - send to back of slide and set to 80% transparency
“We’ve found that when a host selects a price that’s
within 5% of their tip, they’re nearly 4 times
more likely to get booked”
“The importance of accuracy and efficiency […], will continue to rise
as we expand and improve products like
uberPOOL and beyond.”
“Over 75% of what people watch come
from our recommendations”
Data manifests as features in an app
26 © 2016 Pivotal Software, Inc. All rights reserved.
background image: 960x540 pixels - send to back of slide and set to 80% transparency
(Data) Microservices
Loosely coupled services architecture, bounded by context
Cloud-Native
Platforms Enabling continuous
delivery & automated operations
Open Source Database
Innovation Extreme scale &
performance advantages, built for the cloud
Machine Learning
Use of predictive analytics to build
smart apps
How are they accomplishing this?
27 © 2016 Pivotal Software, Inc. All rights reserved.
● Silo’d and aging database systems
● Spaghetti data pipelines
● Expensive, proprietary data management systems
● Lack of structured platforms for continuous software delivery
● Monolithic application architectures
● Batch-oriented data integration
● Limited operationalization of analytics
● Proprietary systems
Today’s Enterprise Challenge
28 © 2016 Pivotal Software, Inc. All rights reserved.
Data Programming Model
Cloud-Native Platform
Microservices Framework Platform Runtime
Stream Data Platform
Hadoop Spark DW
Apps & Microservices
DBMS
IMDG
K/V Store
Relational DB
Big Data & Machine Learning
Modern Cloud-Native Data Architecture
29 © 2016 Pivotal Software, Inc. All rights reserved.
App Development
Data analytics
Cloud-native App platform
Data Science & Model building
Data Micro-service
APP
Must support scale-out query processing
Must deliver as an API
Must embrace agile development, focus on outcomes
Must support microservices, agile dev, and connect to big data analytics
A Real-World Example
30 © 2016 Pivotal Software, Inc. All rights reserved.
Then Now assume fragile infrastructure assume reliable infrastructure
release code every 3 months release code early and often
works in my environment shared Dev & Ops responsibility
tightly coupled loosely coupled
31 © 2016 Pivotal Software, Inc. All rights reserved.
Putting it All Together DATA FEEDS TRANSACTIONAL APPS ANALYTIC APPS
HDFS Data Lake
GemFire
Ingest Filter Enrich Sink Spring XD
HDB GPDB
32 © 2016 Pivotal Software, Inc. All rights reserved.
Let’s build something MEANINGFUL