Transcript
Page 1: Analiza danych przy użyciu IBM Netezza Analytics

11

Grzegorz Puchawski

Data analysis withinIBM Netezza Analytics

14 czerwca 2011 r.Warszawa, Sheraton Warsaw Hotel

Page 2: Analiza danych przy użyciu IBM Netezza Analytics

11

In a nutshell, what is IBM Netezza Analytics

Page 3: Analiza danych przy użyciu IBM Netezza Analytics

Big Data Meets Big Math

Analytics Without Constraints

Page 4: Analiza danych przy użyciu IBM Netezza Analytics

Massive Data and Massive Computation

Data Intensity Computational Intensity

Depth of Data

Width of Data

Computational Complexity

Model Complexity

Page 5: Analiza danych przy użyciu IBM Netezza Analytics

11

In-Database AnalyticsIn-Database Analytics

Software Development KitSoftware Development KitParallel Analytic EnginesParallel Analytic Engines

nzMatrixnzEngine

forR

nzEngine for Hadoop

nzAdaptorsfor

C, C++, Java, Python, Fortran

nzPlug-infor

Eclipse

nzPackagefor

R GUI

Open Source AnalyticsOpen Source Analytics

R R AnalyticsAnalyticsData PrepData Prep Data Data

MiningMiningPredictive Predictive AnalyticsAnalytics

nzAnalyticsnzAnalytics

SpatialSpatial

Custom

Customer/ Partner Analytics

Streaming Accelerator

IBM Netezza AMPP™ Platform

Page 6: Analiza danych przy użyciu IBM Netezza Analytics

11

Who is the target audience for IBM Netezza Analytics?

Page 7: Analiza danych przy użyciu IBM Netezza Analytics

Who is the target audience for IBM Netezza Analytics?

• Line of Business Owner

– Areas of Interest – Gaining / sustaining competitive advantage, discovering new opportunities to increase revenue or decrease costs, ability to use all data collecting

– Benefits – Fast results, add significant business value / big bets, leverage all the data, performance at scale

• Business Intelligence

– Areas of Interest – Analysis beyond SQL, analytics dashboards and reports

– Benefits – Rich set of analytics beyond SQL

• Data Miners

– Areas of Interest – Marketing, life sciences, fraud, network analysis

– Benefits – Ability to explore more data, quick to failure, identify new opportunities, new package of analytic tools, ability to process large data

Page 8: Analiza danych przy użyciu IBM Netezza Analytics

Who is the target audience for IBM Netezza Analytics?

• Modelers

– Areas of Interest – Logistics, yield, forecasting, risk

– Benefits– Simplification of analytic processes, ability to use new and innovative models, quick to failure, model at scale using parallelized analytics, score at scale

• Quants / Statisticians

– Areas of Interest – Risk, forecasting, descriptive statistics, correlation of factors

– Benefits – Simplification of analytic processes, quick to failure, in-database analytics

• Programmers, Developers

– Areas of Interest – Low level programming tools, multi-language environment, User Defined Functions (UDFs), User Defined Analytic Process (AEs), Eclipse

– Benefits – Power and simplification of in-database analytics, flexibility of porting analytics/application

Page 9: Analiza danych przy użyciu IBM Netezza Analytics

11

How is the IBM Netezza Analytics

platform used?

Page 10: Analiza danych przy użyciu IBM Netezza Analytics

11

High Performance on Massive Data

1 Exploratory Data Analysis

Embed Algorithms

Build Model

Deploy Model

2

3

4

• Descriptive Modeling• Predictive Modeling• Optimization Model

• Data Exploration• Data Cleansing• Data Transformation

• Scoring• Forecasting• Decision Management

• Embarrassingly Parallel Algorithms• Heroic Computations• Model Parallelism

Page 11: Analiza danych przy użyciu IBM Netezza Analytics

11

Embed Algorithms

UDF, UDAPStored Procedures

Shared Libraries nzAdaptors

nzMatrixR

User Interface

Eclipse R GUI/CLI

Development Env.

Exploratory Data Analysis

SQLR GUI/CLI

nzAnalyticsR Analytics

Customer AnalyticsPartner Analytics

User Interface

Analytics

Build Model

nzAnalyticsR Analytics

Customer AnalyticsPartner Analytics

User Interface

Analytics

SQLR GUI/CLI

Eclipse

Deploy / Score Model

nzAnalyticsR Analytics

Customer AnalyticsPartner Analytics

nzAdaptorsUDF, UDAP

Shared LibraryStored Procedures

nzPackage for R

Analytics

Deploy/Scoring

Page 12: Analiza danych przy użyciu IBM Netezza Analytics

Embedding Algorithms

• What is it?

– The ability to run programs directly on the S-Blade

• What is it used for?

– Bringing complex computation to the Netezza data stream

• What technology does it use?

– User Interface – Eclipse, R GUI/CLI

– Development Environment - UDFs, User Defined Analytic Process, Stored Procedures, Shared Libraries, nzAdaptors, nzMatrix, R Packages (for implementing algorithms run from R GUI)

• What are the benefits?

– Ability to process data as it stream directly on the S-Blade

– Ability to harness total compute power of a TwinFin for parallel processing

Page 13: Analiza danych przy użyciu IBM Netezza Analytics

Exploratory Data Analysis

• What is it?

– The exercise of looking at data for the purpose of coming up with hypotheses

• What is it used for?

– Exploratory data analysis – Data profiling/ Descriptive Statistics, General Diagnostic Measures, Statistics, Sampling, Histograms

– Data cleansing – Feature selection

– Data transformation – Data Prep / Transformations

• What technology does it use?

– User interface – SQL, R GUI/CLI, others

– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics

• What are the benefits?

– Discovery on more data, faster

Page 14: Analiza danych przy użyciu IBM Netezza Analytics

Build Model

• What is it?

– Choosing which method will give the best results

– Finding the best parameters to give the best predictions

• What is it used for?

– Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing, Sample Size

– Data Mining – Association Rules Mining, Clustering, Feature Selection

• What technology does it use?

– User interface – SQL, R GUI/CLI, Eclipse, ...

– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics

– Development tools – Language Adapters, UDFs, UDAP, Stored Procedures

• What are the benefits?

– Moving the computation processing to the data

– Parallel computational processing on all of the data

Page 15: Analiza danych przy użyciu IBM Netezza Analytics

Deploying / Scoring Model

• What is it?

– Parallelized application of a model using parameters from the build step

• What is it used for?

– Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing

– Data Mining – Association Rules Mining, Clustering, Feature Extraction

• What technology does it use?

– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics

– Deploying/Scoring – Language Adapters, UDFs, User Defined Analytic Process, Shared Libraries, Stored Procedures, R package

– Development tools – Language Adapters, UDFs, AE, Stored Procedures

• What are the benefits?

– Score and experiment in parallel

– Faster model scoring and therefore time to insight/value

Page 16: Analiza danych przy użyciu IBM Netezza Analytics

11

What is in IBM Netezza Analytics?

Page 17: Analiza danych przy użyciu IBM Netezza Analytics

11

In-Database AnalyticsIn-Database Analytics

Software Development KitSoftware Development KitParallel Analytic EnginesParallel Analytic Engines

nzMatrixnzEngine

forR

nzEngine for Hadoop

nzAdaptorsfor

C, C++, Java, Python, Fortran

nzPlug-infor

Eclipse

nzPackagefor

R GUI

Open Source AnalyticsOpen Source Analytics

R R AnalyticsAnalyticsData PrepData Prep Data Data

MiningMiningPredictive Predictive AnalyticsAnalytics

nzAnalyticsnzAnalytics

SpatialSpatial

Custom

Customer/ Partner Analytics

Streaming Accelerator

IBM Netezza AMPP™ Platform

Page 18: Analiza danych przy użyciu IBM Netezza Analytics

Streaming Accelerator

• What is it?

– Our unique differentiator that combines our historical strength in fast data stream processing with powerful in-database analytics processing and new inter-node analytics processing capabilities

• What is it used for?

– Parallelizing data and analytics processing

• What technology does it use?

– FPGA

– UDFs, User Defined Analytic Process

– Message Passing Interface (MPI) for distributed processing

• What are the benefits?

– Accelerates data processing for analytics

– Accelerates parallel matrix operations on big data

– Simplifies parallelization

Streaming Accelerator

Netezza AMPP™ Platform

Page 19: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Matrix Engine

• What is it?

– Parallelized linear algebra package

• What is it used for?

– Building block for higher order parallelized analytics

• What technology does it use?

– Scalable Linear Algebra Package (ScaLAPACK)

– Message Passing Interface (MPI) for distributed processing

• What are the benefits?

– Simplifies analytic algorithm and model development

– Accelerates parallel matrix operations on big data

Parallel Analytic EnginesParallel Analytic Engines

nzMatrix

Page 20: Analiza danych przy użyciu IBM Netezza Analytics

• Supports the following parallel matrix operations– Basic Linear Algebra Subroutines (ie: Matrix Multiplication, Matrix Dot Function ,

etc.)

– Solving a System of Linear Equations

– Solving Linear Least Squared Problems

– Eigenvalues and Eigenvectors

– Singular Value Decomposition (SVD)

– Matrix Factorization

– Matrix Inversion

– Matrix Element Scalar Functions

– Matrix Reduction Functions (e.g. min, max, sum of squares, sum)

– Matrix Inquiry Functions  (e.g. number of rows and columns)

– Matrix Reshaping Functions

• Call Interface

– Accessible from R, Python, Java, etc. via ODBC and Stored Procedures

Parallel Analytic EnginesParallel Analytic Engines

nzMatrixIBM Netezza Matrix Engine

Page 21: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Engine for Hadoop

• What is it?

– Hadoop-compatible implementation of Hadoop (MapReduce paradigm)

• What is it used for?

– Clickstream & social data analysis

– ETL/ELT and analytics processing of key/value pairs

• What technology does it use?

– Java User Defined Analytic Process

• What are the benefits?

– Enables effective parallel processing of data from Netezza database tables

– Bringing Hadoop to database with minimal refactoring of existing Hadoop code

– Only database offering Hadoop interface (all others are home-grown)

Parallel Analytic EnginesParallel Analytic Engines

Hadoop

Page 22: Analiza danych przy użyciu IBM Netezza Analytics

Hadoop by Apache vs Hadoop by Netezza

Slice 1

Slice 2

Slice 3

Slice 4

Reducer 1

Reducer 2

Reducer 2

Mapper 1

Mapper 2

Mapper 3

Mapper 4

REDISTRIBUTION

HDFS

Input table(dataslices) Cluster nodes

SPUs

Cluster nodes

SPUs

HDFS

Output table(dataslices)

Page 23: Analiza danych przy użyciu IBM Netezza Analytics

Netezza Engine for HadoopExample

• Example: Clickstream analysis– Data:

• Table containing data about users and visited pages

• User groups’ definitions

– Task:• For each group, find all pages that have been visited by

all members of this group

23

Parallel Analytic EnginesParallel Analytic Engines

Hadoop

Page 24: Analiza danych przy użyciu IBM Netezza Analytics

Netezza Engine for HadoopExample

• Sample data: Clickstream analysis

24

USER URL

A ibm.com

A netezza.com

A sheraton.pl

B ibm.com

D netezza.com

D apache.org

GROUP USER

FIRST A

FIRST B

SECOND A

SECOND D

GROUP URL

FIRST ibm.com

SECOND netezza.com

Parallel Analytic EnginesParallel Analytic Engines

Hadoop

Page 25: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Engine for R• What is it?

– Native R running pushed down onto the S-Blade for parallel analytics processing

• What is it used for?

– Exploratory data analysis, building models, scoring models, etc

• What technology does it use?

– Open Source R

– User Defined Analytic Process, Data Stream Processing

• What are the benefits?

– Accelerates and scales R to run on big data

– Leverage open-source CRAN repository of algorithms

• Supports the following parallel R operations

– R interpreter running in parallel

– R CRAN Analytics applied in parallel

• Call Interface

– Invoked via SQL (a la User Defined Analytic Process) , R

Parallel Analytic EnginesParallel Analytic Engines

R

Page 26: Analiza danych przy użyciu IBM Netezza Analytics

In-database Analytics

• What is it?

– Parallelized in-database analytics for data prep, data mining, prediction, and geospatial

• What is it used for?

– Building and deploying/scoring models

• What technology does it use?

– UDFs, Stored Procedures, User Defined Analytic Process, nzMatrix

• What are the benefits?

– Starter kit of parallelized analytics that are designed for parallel environment that work on large scale data

In-Database AnalyticsIn-Database Analytics

nzAnalyticsnzAnalytics

Page 27: Analiza danych przy użyciu IBM Netezza Analytics

11

In-database Analytics

Data Profiling / Descriptive Statistics

Probability Density and Inverse Functions• Normal• Fisher• Exponential• Uniform• Weibull• Wilcoxn

• Man-Whitney

• tStudent

• Chi-Square

General Diagnostic Measures

Error Calculation• Classification Error

• Mean Absolute Error

• Mean Squared Error

• Relative Absolute Error

• Relative Squared Error

Sampling

Uniform Random Sampling• Uniform Random Sampling Count

• Uniform Random Sampling Fraction

Data Prep

Statistics

Histogram and Frequency Table• Histogram

• Bivariate Frequency Table

• Univariate Frequency Table

Quantiles

• Quantiles

• Median

• Outliers

• Quartile

Parametric Statistics

• Chi-Square

• tStudent

Non-Parametric Statistics

• Spearman’s Rank Correlation

• Man-Whitney-Wilcoxn

• Wilcoxn

Moments

• Kurtosis

• Skewness

Data Prep / Transformations

Binning and Discretization• Entropy Minimization

• Equal Width

• Equal Frequency

Standardization and Normalization

• Standardization and Normalization

In-Database AnalyticsIn-Database Analytics

nzAnalyticsnzAnalytics

Page 28: Analiza danych przy użyciu IBM Netezza Analytics

11

Association Rules Mining

Association • FP-Growth

Clustering

K-Means

Hierarchical Clustering• Divisive Clustering

• Agglomerative Clustering

Data Mining

Feature Extraction

Dimension Reduction• Principal Components Analysis

Model Testing

Error Calculation• Cross Validation

• Percentage Split

• Train / Test

Predictive Analytics

Regression

Linear Regression• Generalized Linear Models

Sample Size

One-Way ANOVA• Complete Randomized Design • Randomized Block Design

Classification

Decision Trees• Entropy Decision Tree • Gini Index Decision Tree• Regression Tree

Neighborhood Methods• K Nearest Neighbors

Bayesian Methods

Classifier • Naïve Bayes

Graphical Model• Bayesian Networks

In-Database AnalyticsIn-Database Analytics

nzAnalyticsnzAnalyticsIn-database Analytics

Page 29: Analiza danych przy użyciu IBM Netezza Analytics

11

What are these data mining algorithms used for?

Clustering• Finding naturally occurring

groups– Market segmentation– Find disease subgroups– Distinguish normal from

non-normal behavior

Association Rules Mining• Find co-occurring items

in a market basket– Suggest product

combinations– Design better item

placement on shelves

Feature Extraction• Identify most influential

attributes for a target attribute> Factors associated with

high costs, responding to an offer, etc.

A1

A2

A3

A4

A5

A6

A7

A8

Page 30: Analiza danych przy użyciu IBM Netezza Analytics

11

Classification• Predict customers most

likely to:– Respond to a campaign

or offer– Incur the highest costs

• Target your best customers• Develop customer profiles

Regression• Predict a numeric value

> Predict a purchase amount or cost

> Predict the value of a home

What are these data mining algorithms used for?

Page 31: Analiza danych przy użyciu IBM Netezza Analytics

11

Association Rules Mining Example

Regular database

• # transactions = 71M

• # items = 250k

• Implementation in SQL

• Offline process

• Computation time around ~5 hours

IBM Netezza Analytics

Support Time Itemsets

1% (708 208) 1m 87

0.1% (70 828) 16m 4000

0.01% (7 082) 41m 5 583 391

0.001% (708) 51m 346 749 521

• In-database Analytics using FPGrowth algorithm

• Ability to run on-demand analysis

Find co-occurring items in a market basket

Page 32: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Spatial Engine

• What is it?

– Location Intelligence Extension for IBM Netezza TwinFin Appliance

• What is it used for?

– Processing queries about geographical data to perform spatial analysis

• What technology does it use?

– GGL, GEOS libraries

• What are the benefits?

– Set of the functions to run GIS analysis on large size of data.

– Analyze spatial information all in the database.

– Better and faster analysis using spatial data.

In-Database AnalyticsIn-Database Analytics

Open Source Open Source

Page 33: Analiza danych przy użyciu IBM Netezza Analytics

Spatial Concepts

• Goal: to process queries about geometric features or geographical data in order to perform various types of analysis.

• Examples of geographical data:

– The location of a store, a wireless service tower or other landmark

– A running feature such as street, river or power line

• Examples of spatial analysis:

– Identify the number of wireless calls that occur in a particular area so that you can better plan the addition of new towers to improve wireless service

– Calculate driving distance form a certain point to the nearest N fire stations to calculate the cost of insurance premium

In-Database AnalyticsIn-Database Analytics

Open Source Open Source

Page 34: Analiza danych przy użyciu IBM Netezza Analytics

Examples of Usage

– Area

– Distance

– Length

– Perimiter

• Because IBM Netezza Spatial functionsare implemented as UDFs, it allowsus to utilize the full potential ofNetezza’s Massively ParrallelProcessing Architecture

Page 35: Analiza danych przy użyciu IBM Netezza Analytics

SDK – nzAdaptors • What is it?

– APIs that allow in-database user defined functions to be written in various languages

• What is it used for?

– Enable any program to run on the S-Blades (with minimal refactoring)

• What technology does it use?

– User Defined Analytic Process

• What are the benefits?

– Flexibility to build and deploy analytics/models in multiple languages

– Eliminate rewriting of model score code having to be rewritten and revalidated

– Analytics can be written in different language than calling application language

• Supports the following parallel operations

– Parallel execution of the analytic, model, application

• Call Interface

– Language-specific API

– Invoked via SQL

Software Development KitSoftware Development KitnzAdaptors

forC, C++, Java,

Python, Fortran

Page 36: Analiza danych przy użyciu IBM Netezza Analytics

SDK – nzPackage for R

• What is it?

– R packages that integrates the R GUI/CLI with Netezza

• Provide interfaces to tables, matrices, apply operations, and nzAnalytics

• What is it used for?

– Data frame integration with data warehouse, pushing analytics processing S-Blades, scoring on S-Blades, installation of R packages, integration with SQL, Matrix integration

• What technology does it use?

– R API for creating packages, open-source CRAN packages (e.g., RODBC)

• What are the benefits?

– Ability to use S-Blades for scaling R analytics/models

– Large-scale linear algebra via Matrix

– Access to nzAnalytics from R

Software Development KitSoftware Development KitnzPackage

for R

Page 37: Analiza danych przy użyciu IBM Netezza Analytics

SDK – nzPlug-in for Eclipse

• What is it?

– A plug-in for Eclipse that facilitates easier development of UDFs and Stored Procedures

• What is it used for?

– UDFs and Stored Procedure wizards

– Remote SSH terminal, database object explorer, SQL editors, source code control, issue management, system monitoring, documentation builder

• What technology does it use?

– Eclipse

• What are the benefits?

– Faster, more targeted development

– Leverage the many available open-source plug-ins for Eclipse

Software Development KitSoftware Development Kit

nzPlug-in for

Eclipse

Page 38: Analiza danych przy użyciu IBM Netezza Analytics

Netezza plugin for Eclipse

• What’s included– Predefined Project Perspective– NZ Admin– NZ Cartridge Manager– Logs Browser– Editors with Syntax Highlighting – Remote Console and SSH Terminals – Template Wizards (NZ project, UDX, UDTF, Stored Procedures,

Makefile, …) – Synchronization between local and remote projects – Data Tools – Database Object Explorer, SQL Editors, Data

Explorer, … (with support for Netezza database)

Software Development KitSoftware Development Kit

nzPlug-in for

Eclipse

Page 39: Analiza danych przy użyciu IBM Netezza Analytics

11

What are the key points?

Page 40: Analiza danych przy użyciu IBM Netezza Analytics

Key Points• Target Audience

– Line of Business, BI, Data Miners, Modelers, Quants, Statisticians, Programmers

• IBM Netezza Analytics Uses

– Embedding algorithms, exploratory data analysis, building model, deploying/scoring model

• 3 Major Components of IBM Netezza Analytics

– Parallel Analytic Engines, SDK, In-Database Analytics

• Streaming Accelerator

– Unique differentiator for combination of data stream processing, in-database analytics processing and inter-node processing

Page 41: Analiza danych przy użyciu IBM Netezza Analytics

Key Differentiators

• Faster and scalable analytics processing

• Parallelized in-database analytics

• Large scale matrix operations

• Rich development environment

Page 42: Analiza danych przy użyciu IBM Netezza Analytics

Key Benefits

• Eliminates inefficient analytics data processing - data remains in place

• Speeds up time to insight, action & business value

• Achieves parallelism without parallel programming

• Enables increased analytics experimentation

• Protects and leverages investment in existing analytics

• Reduces technology barriers for large scale analytics

Page 43: Analiza danych przy użyciu IBM Netezza Analytics

11

IBM Netezza Analytics

Big Math

Big Data

IBM Netezza Analytics

Page 44: Analiza danych przy użyciu IBM Netezza Analytics

Thank youYour Data. Your Site. Our Appliance.

ANALYTICSANALYTICS


Recommended