Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
1 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database Getting Started with Big Data Analytics
Ofir Manor Pre Sales Technical Architect, EMC Greenplum
2 © Copyright 2012 EMC Corporation. All rights reserved.
Agenda
• Introduction to Greenplum
• Greenplum Database Architecture
• Flexible Database Configuration
• Beyond SQL – Flexible Analytics
• Flexible Deployment
• Other considerations
3 © Copyright 2012 EMC Corporation. All rights reserved.
!!!
!!!
!!!
!!!
!!!
“Big Data Is Less About Size, And More About Freedom”
―Techcrunch
!!!
!!!
!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”
― Gartner
“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”
―IDC “Total data: ‘bigger’ than big data”
― 451 Group
4 © Copyright 2012 EMC Corporation. All rights reserved.
!!!
!!!
!!!
!!!
!!!
“Big Data Is Less About Size, And More About Freedom”
―Techcrunch
!!!
!!!
!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”
― Gartner
“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”
―IDC “Total data: ‘bigger’ than big data”
― 451 Group
THE ERA OF
BIG DATA IS HERE
5 © Copyright 2012 EMC Corporation. All rights reserved.
Industries Are Broadly Embracing Big Data
Retail
•CRM – Customer Scoring
•Store Siting and Layout
•Fraud Detection / Prevention
•Supply Chain Optimization
Advertising & Public Relations
•Demand Signaling
•Ad Targeting
•Sentiment Analysis
•Customer Acquisition
Financial Services
•Algorithmic Trading
•Risk Analysis
•Fraud Detection
•Portfolio Analysis
Media & Telecommunications
•Network Optimization
•Customer Scoring
•Churn Prevention
•Fraud Prevention
Manufacturing
•Product Research
•Engineering Analytics
•Process & Quality Analysis
•Distribution Optimization
Energy
•Smart Grid
•Exploration
Government
•Market Governance
•Counter-Terrorism
•Econometrics
•Health Informatics
Healthcare & Life Sciences
•Pharmaco-Genomics
•Bio-Informatics
•Pharmaceutical Research
•Clinical Outcomes Research
13 © Copyright 2012 EMC Corporation. All rights reserved.
Extreme Performance for Analytics
Optimized for BI and analytics
– Deep integration with statistical packages
– High performance parallel implementations
• Simple and automatic
– Just load and query like any database
– Tables are automatically distributed across nodes
• Extremely scalable
– MPP shared-nothing architecture
– All nodes can scan and process in parallel
– Linear scalability by adding nodes
GREENPLUM DATABASE
14 © Copyright 2012 EMC Corporation. All rights reserved.
A Mature Enterprise Platform
PRODUCT FEATURES
CLIENT ACCESS & TOOLS
Multi-Level Fault Tolerance (RAID, Mirroring, DR with
Data Domain Boost)
Shared-Nothing MPP
Parallel Query Optimizer
Polymorphic Data Storage™
CLIENT ACCESS
ODBC, JDBC, OLEDB,
MapReduce, etc.
CORE MPP ARCHITECTURE
Parallel Dataflow Engine
gNet™ Software Interconnect
Scatter/Gather Streaming™ Data Loading
Online System Expansion Workload Management
GREENPLUM DATABASE ADAPTIVE
SERVICES
LOADING & EXT. ACCESS
Petabyte-Scale Loading
Trickle Micro-Batching
Anywhere Data Access
STORAGE & DATA ACCESS
Hybrid Storage & Execution (Row- & Column-Oriented)
In-Database Compression
Multi-Level Partitioning
Indexes – Btree, Bitmap, etc.
External Table Support
LANGUAGE SUPPORT
Comprehensive SQL
Native MapReduce
SQL 2003 OLAP Extensions
Programmable Analytics
Analytics Extensions (GeoSpatial, PR/R, PL/Java,
PL/Python, PL/Perl)
3rd PARTY TOOLS
BI Tools, ETL Tools
Data Mining, etc
ADMIN TOOLS
Greenplum Command Center
Greenplum Package Manager
GREENPLUM DATABASE
15 © Copyright 2012 EMC Corporation. All rights reserved.
Segment
SQL Client
Master
Segment Segment Segment
High-Speed Interconnect
…
Extremely Scalable MPP Shared-Nothing Architecture
16 © Copyright 2012 EMC Corporation. All rights reserved.
Linear Scalability
Segment
SQL Client
Master
Segment Segment Segment
High-Speed Interconnect
Segment Segment Segment Segment
…
• Each node has its own CPU and I/O resources
• Add nodes to scale
• Rebalance happens in the background
17 © Copyright 2012 EMC Corporation. All rights reserved.
High Availability
Master
Segment Segment Segment Segment
Master
Master Server Data Protection Replicated transaction logs for server failure
Optional RAID protection for drive failures
Upon server failure
Standby server activated
Administrator alerted
Orchestrated failover
Segment Server Data Protection Mirrored segments for server failures
Optional RAID protection for drive failures
Upon server failure
Mirrored segments take over with no loss of
service
Fast online differential recovery
GREENPLUM DATABASE
18 © Copyright 2012 EMC Corporation. All rights reserved.
SINGLE RACK COMPARISON
Most Powerful Data Loading Capabilities
Industry leading performance at 10+TB per-hour per-rack
Scatter-Gather Streaming™ provides true linear scaling
Support for both large-batch and continuous real-time loading strategies
Enable complex data transformations “in-flight”
Transparent interfaces to loading via support files, application, and services
Greenplum load rates scale linearly with the number of racks, others do not.
For example, two racks = >20TB/H
Greenplum Oracle Exadata
Netezza Teradata
GREENPLUM DATABASE
19 © Copyright 2012 EMC Corporation. All rights reserved.
Polymorphic Table StorageTM
• Enable Information Lifecycle Management (ILM)
• Storage types can be mixed within a table or database
– Four table types: heap, row-oriented AO, column-oriented, external
– Block compression: Gzip (levels 1-9), QuickLZ
• Provide the choice of processing model for any table or partition
TABLE ‘CUSTOMER’
Mar ‘11
Apr ‘11
May ‘11
Jun ‘11
Jul ‘11
Aug ‘11
Sept ‘11
Oct ‘11
Nov ‘11
Row-oriented for HOT DATA Column-oriented for COLD DATA
GREENPLUM DATABASE
20 © Copyright 2012 EMC Corporation. All rights reserved.
In-Database Analytics
Bringing the power of parallelism to commonly-used modeling and analytics functions
In-database analytics
– SAS – HPA, Access, and Scoring Accelerator
– MADLib – An open-source library of advanced analytics functions
– Analytics extensions supported, including
▪ PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.
MAD
lib
MAD
lib
GREENPLUM DATABASE
21 © Copyright 2012 EMC Corporation. All rights reserved.
SAS and Greenplum A Strategic Partnership for High-Performance Computing
Access relational data-sets for agile analysis – SAS/ACCESS provides fast, transparent and
secure access to Greenplum data.
Leverage database scalability for rapid model deployment
– SAS Scoring Accelerator publishes models for execution in parallel across the Greenplum cluster.
Build complex models at massive scales – The SAS High-Performance Analytics Appliance
combines SAS In-Memory Analytics with Greenplum parallelism to produce record-breaking scalability and performance.
GREENPLUM PARTNERS
22 © Copyright 2012 EMC Corporation. All rights reserved.
MADlib
Scalable in-database analytics
Data-parallel – Mathematical Algorithms
– Statistical Algorithms
– Machine learning Algorithms
– Supports structured and unstructured data.
Delivered via open-source – Accessibility
– Skill development
– Converge business, academic, and open-source communities
GREENPLUM DATABASE
23 © Copyright 2012 EMC Corporation. All rights reserved.
MADlib In-Database Analytical Functions
Descriptive Statistics Modeling
Quantile Correlation Matrix
Profile Association Rule Mining
CountMin (Cormode-Muthukrishnan) Sketch-based Estimator
K-Means Clustering
FM (Flajolet-Martin) Sketch-based Estimator
Naïve Bayes Classification
MFV (Most Frequent Values) Sketch-based Estimator
Linear Regression
Frequency Logistic Regression
Histogram Support Vector Machines
Bar Chart SVD Matrix Factorisation
Box Plot Chart Decision Trees/CART
Latent Dirichlet Allocation Topic Modeling
24 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Analytics Labs
Packaged solutions that produce business value and actionable results
Accelerate analytics capabilities on your data with your analysts
Leverage the expertise of Greenplum’s Data Scientists
Establish a strategic vision for analytics development
25 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Delivers Choice & Flexibility
Greenplum Data Computing Appliance
Choose Greenplum Database and/or Hadoop modules in ¼ rack increments
Scale up by adding your choice of additional modules
Minimal time to value
Greenplum Software Solutions
Greenplum Database, Hadoop, & Chorus on your x86 hardware
Flexibility for any workload or environment
Perpetual or subscription licenses
28 © Copyright 2012 EMC Corporation. All rights reserved.
Seamless Infrastructure Integration
EMC Data Domain Efficient Backup & Restore
EMC VMAX or VNX SAN Mirror For Advanced Storage
Management
Isilon Scale Out Storage For Big Data Staging
EMC VMAX SRDF EMC Data Domain
Replication For Disaster Recovery
GREENPLUM DCA
29 © Copyright 2012 EMC Corporation. All rights reserved.
Simple To Manage
Greenplum Command Center
– Complete platform management and control
Greenplum Package Manager
– Automates install, uninstall, update, and query for analytics extensions
– Support package migration during upgrade, segment recovery, expansion, and standby initialization
GREENPLUM DATABASE
30 © Copyright 2012 EMC Corporation. All rights reserved.
Innovative Companies Using Greenplum
32 © Copyright 2012 EMC Corporation. All rights reserved.
Thank you [email protected]
Downloads, Documentation, Whitepapers etc: http://www.greenplum.com A copy of this presentation will be avaliable on the event’s web site Next Greenplum workshop in Hungary: 04 July, 2012 Register now at EMC Hungary, or Avnet Hungary