28
1 © Copyright 2012 EMC Corporation. All rights reserved. Greenplum Database Getting Started with Big Data Analytics Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Greenplum Database Getting Started with Big Data Analytics2012.adattarhazforum.hu/letoltes/...2012_greenplum.pdfGreenplum Delivers Choice & Flexibility Greenplum Data Computing Appliance

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

1 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database Getting Started with Big Data Analytics

Ofir Manor Pre Sales Technical Architect, EMC Greenplum

2 © Copyright 2012 EMC Corporation. All rights reserved.

Agenda

• Introduction to Greenplum

• Greenplum Database Architecture

• Flexible Database Configuration

• Beyond SQL – Flexible Analytics

• Flexible Deployment

• Other considerations

3 © Copyright 2012 EMC Corporation. All rights reserved.

!!!

!!!

!!!

!!!

!!!

“Big Data Is Less About Size, And More About Freedom”

―Techcrunch

!!!

!!!

!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”

― Gartner

“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”

―IDC “Total data: ‘bigger’ than big data”

― 451 Group

4 © Copyright 2012 EMC Corporation. All rights reserved.

!!!

!!!

!!!

!!!

!!!

“Big Data Is Less About Size, And More About Freedom”

―Techcrunch

!!!

!!!

!!! “Findings: ‘Big Data’ Is More Extreme Than Volume”

― Gartner

“Big Data! It’s Real, It’s Real-time, and It’s Already Changing Your World”

―IDC “Total data: ‘bigger’ than big data”

― 451 Group

THE ERA OF

BIG DATA IS HERE

5 © Copyright 2012 EMC Corporation. All rights reserved.

Industries Are Broadly Embracing Big Data

Retail

•CRM – Customer Scoring

•Store Siting and Layout

•Fraud Detection / Prevention

•Supply Chain Optimization

Advertising & Public Relations

•Demand Signaling

•Ad Targeting

•Sentiment Analysis

•Customer Acquisition

Financial Services

•Algorithmic Trading

•Risk Analysis

•Fraud Detection

•Portfolio Analysis

Media & Telecommunications

•Network Optimization

•Customer Scoring

•Churn Prevention

•Fraud Prevention

Manufacturing

•Product Research

•Engineering Analytics

•Process & Quality Analysis

•Distribution Optimization

Energy

•Smart Grid

•Exploration

Government

•Market Governance

•Counter-Terrorism

•Econometrics

•Health Informatics

Healthcare & Life Sciences

•Pharmaco-Genomics

•Bio-Informatics

•Pharmaceutical Research

•Clinical Outcomes Research

6 © Copyright 2012 EMC Corporation. All rights reserved.

7 © Copyright 2012 EMC Corporation. All rights reserved.

8 © Copyright 2012 EMC Corporation. All rights reserved.

12 © Copyright 2012 EMC Corporation. All rights reserved.

The Power of Data Co-Processing

13 © Copyright 2012 EMC Corporation. All rights reserved.

Extreme Performance for Analytics

Optimized for BI and analytics

– Deep integration with statistical packages

– High performance parallel implementations

• Simple and automatic

– Just load and query like any database

– Tables are automatically distributed across nodes

• Extremely scalable

– MPP shared-nothing architecture

– All nodes can scan and process in parallel

– Linear scalability by adding nodes

GREENPLUM DATABASE

14 © Copyright 2012 EMC Corporation. All rights reserved.

A Mature Enterprise Platform

PRODUCT FEATURES

CLIENT ACCESS & TOOLS

Multi-Level Fault Tolerance (RAID, Mirroring, DR with

Data Domain Boost)

Shared-Nothing MPP

Parallel Query Optimizer

Polymorphic Data Storage™

CLIENT ACCESS

ODBC, JDBC, OLEDB,

MapReduce, etc.

CORE MPP ARCHITECTURE

Parallel Dataflow Engine

gNet™ Software Interconnect

Scatter/Gather Streaming™ Data Loading

Online System Expansion Workload Management

GREENPLUM DATABASE ADAPTIVE

SERVICES

LOADING & EXT. ACCESS

Petabyte-Scale Loading

Trickle Micro-Batching

Anywhere Data Access

STORAGE & DATA ACCESS

Hybrid Storage & Execution (Row- & Column-Oriented)

In-Database Compression

Multi-Level Partitioning

Indexes – Btree, Bitmap, etc.

External Table Support

LANGUAGE SUPPORT

Comprehensive SQL

Native MapReduce

SQL 2003 OLAP Extensions

Programmable Analytics

Analytics Extensions (GeoSpatial, PR/R, PL/Java,

PL/Python, PL/Perl)

3rd PARTY TOOLS

BI Tools, ETL Tools

Data Mining, etc

ADMIN TOOLS

Greenplum Command Center

Greenplum Package Manager

GREENPLUM DATABASE

15 © Copyright 2012 EMC Corporation. All rights reserved.

Segment

SQL Client

Master

Segment Segment Segment

High-Speed Interconnect

Extremely Scalable MPP Shared-Nothing Architecture

16 © Copyright 2012 EMC Corporation. All rights reserved.

Linear Scalability

Segment

SQL Client

Master

Segment Segment Segment

High-Speed Interconnect

Segment Segment Segment Segment

• Each node has its own CPU and I/O resources

• Add nodes to scale

• Rebalance happens in the background

17 © Copyright 2012 EMC Corporation. All rights reserved.

High Availability

Master

Segment Segment Segment Segment

Master

Master Server Data Protection Replicated transaction logs for server failure

Optional RAID protection for drive failures

Upon server failure

Standby server activated

Administrator alerted

Orchestrated failover

Segment Server Data Protection Mirrored segments for server failures

Optional RAID protection for drive failures

Upon server failure

Mirrored segments take over with no loss of

service

Fast online differential recovery

GREENPLUM DATABASE

18 © Copyright 2012 EMC Corporation. All rights reserved.

SINGLE RACK COMPARISON

Most Powerful Data Loading Capabilities

Industry leading performance at 10+TB per-hour per-rack

Scatter-Gather Streaming™ provides true linear scaling

Support for both large-batch and continuous real-time loading strategies

Enable complex data transformations “in-flight”

Transparent interfaces to loading via support files, application, and services

Greenplum load rates scale linearly with the number of racks, others do not.

For example, two racks = >20TB/H

Greenplum Oracle Exadata

Netezza Teradata

GREENPLUM DATABASE

19 © Copyright 2012 EMC Corporation. All rights reserved.

Polymorphic Table StorageTM

• Enable Information Lifecycle Management (ILM)

• Storage types can be mixed within a table or database

– Four table types: heap, row-oriented AO, column-oriented, external

– Block compression: Gzip (levels 1-9), QuickLZ

• Provide the choice of processing model for any table or partition

TABLE ‘CUSTOMER’

Mar ‘11

Apr ‘11

May ‘11

Jun ‘11

Jul ‘11

Aug ‘11

Sept ‘11

Oct ‘11

Nov ‘11

Row-oriented for HOT DATA Column-oriented for COLD DATA

GREENPLUM DATABASE

20 © Copyright 2012 EMC Corporation. All rights reserved.

In-Database Analytics

Bringing the power of parallelism to commonly-used modeling and analytics functions

In-database analytics

– SAS – HPA, Access, and Scoring Accelerator

– MADLib – An open-source library of advanced analytics functions

– Analytics extensions supported, including

▪ PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.

MAD

lib

MAD

lib

GREENPLUM DATABASE

21 © Copyright 2012 EMC Corporation. All rights reserved.

SAS and Greenplum A Strategic Partnership for High-Performance Computing

Access relational data-sets for agile analysis – SAS/ACCESS provides fast, transparent and

secure access to Greenplum data.

Leverage database scalability for rapid model deployment

– SAS Scoring Accelerator publishes models for execution in parallel across the Greenplum cluster.

Build complex models at massive scales – The SAS High-Performance Analytics Appliance

combines SAS In-Memory Analytics with Greenplum parallelism to produce record-breaking scalability and performance.

GREENPLUM PARTNERS

22 © Copyright 2012 EMC Corporation. All rights reserved.

MADlib

Scalable in-database analytics

Data-parallel – Mathematical Algorithms

– Statistical Algorithms

– Machine learning Algorithms

– Supports structured and unstructured data.

Delivered via open-source – Accessibility

– Skill development

– Converge business, academic, and open-source communities

GREENPLUM DATABASE

23 © Copyright 2012 EMC Corporation. All rights reserved.

MADlib In-Database Analytical Functions

Descriptive Statistics Modeling

Quantile Correlation Matrix

Profile Association Rule Mining

CountMin (Cormode-Muthukrishnan) Sketch-based Estimator

K-Means Clustering

FM (Flajolet-Martin) Sketch-based Estimator

Naïve Bayes Classification

MFV (Most Frequent Values) Sketch-based Estimator

Linear Regression

Frequency Logistic Regression

Histogram Support Vector Machines

Bar Chart SVD Matrix Factorisation

Box Plot Chart Decision Trees/CART

Latent Dirichlet Allocation Topic Modeling

24 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Analytics Labs

Packaged solutions that produce business value and actionable results

Accelerate analytics capabilities on your data with your analysts

Leverage the expertise of Greenplum’s Data Scientists

Establish a strategic vision for analytics development

25 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Delivers Choice & Flexibility

Greenplum Data Computing Appliance

Choose Greenplum Database and/or Hadoop modules in ¼ rack increments

Scale up by adding your choice of additional modules

Minimal time to value

Greenplum Software Solutions

Greenplum Database, Hadoop, & Chorus on your x86 hardware

Flexibility for any workload or environment

Perpetual or subscription licenses

28 © Copyright 2012 EMC Corporation. All rights reserved.

Seamless Infrastructure Integration

EMC Data Domain Efficient Backup & Restore

EMC VMAX or VNX SAN Mirror For Advanced Storage

Management

Isilon Scale Out Storage For Big Data Staging

EMC VMAX SRDF EMC Data Domain

Replication For Disaster Recovery

GREENPLUM DCA

29 © Copyright 2012 EMC Corporation. All rights reserved.

Simple To Manage

Greenplum Command Center

– Complete platform management and control

Greenplum Package Manager

– Automates install, uninstall, update, and query for analytics extensions

– Support package migration during upgrade, segment recovery, expansion, and standby initialization

GREENPLUM DATABASE

31 © Copyright 2012 EMC Corporation. All rights reserved.

Powerful Partner Ecosystem

Discovix

32 © Copyright 2012 EMC Corporation. All rights reserved.

Thank you [email protected]

Downloads, Documentation, Whitepapers etc: http://www.greenplum.com A copy of this presentation will be avaliable on the event’s web site Next Greenplum workshop in Hungary: 04 July, 2012 Register now at EMC Hungary, or Avnet Hungary