31
© 2012 IBM Corporation 1 Revolution Confidential Revolution R Enterprise for IBM Netezza

Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Embed Size (px)

Citation preview

Page 1: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation1

Revolution Confidential

Revolution R Enterprise for IBM Netezza

Page 2: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation2

Revolution ConfidentialIBM Netezza with Revolution Analytics

High-performance, in-database analytics platform for Big Data– Massively parallel processing delivers 10-100x performance– Run analytics in-database and eliminate data movement– Scalable architecture fosters experimentation

Innovation with Advanced Analytics– Analytic modeling with most current statistical methods and 2,500+

open source packages Enterprise ready advanced analytics software, services &

support – Security, IDE, training, professional services– Web Services stack enables integration with front-end

presentation layer

Page 3: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM CorporationMarch 1, 2012

Revolution Analytics

Page 4: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation4

Revolution ConfidentialWhat is R?

Data analysis software A programming language

– Development platform designed by and for statisticians– Object-oriented: vector, matrix, model, …– Built-in libraries of algorithms

An environment– Huge library of algorithms for data access, data manipulation, analysis

and graphics An open-source software project

– Free, open, and active A community

– Thousands of contributors, 2 million users– Resources and help in every domain

Download the White Paper

R is Hotbit.ly/r-is-hot

Page 5: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution Confidential

The professor who invented analytic software for the experts now wants to take it to the masses

Most advanced statistical analysis software available

Half the cost of commercial alternatives

2M+ Users

2,500+ Applications

Statistics

Predictive Analytics

Data Mining

Visualization

Finance

Life Sciences

Manufacturing

Retail

Telecom

Social Media

Government

5

Power

Productivity

Enterprise Readiness

Page 6: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution Confidential

R evolution R E nterpris e has the Open-S ource R E ngine at the core

2,500 community packages and growing exponentially

6

R Engine Language Libraries

Open Source R Packages

Technical Support

Web ServicesAPI

Big DataAnalysis

RevolutionProductivity

Environment

BuildAssurance

ParallelTools

Multi-ThreadedMath Libraries

TechnologyPartners

Page 7: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM CorporationMarch 1, 2012

Working with Revolution R Enterprise for IBM Netezza

Page 8: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation8

Revolution ConfidentialRevolution R Enterprise for IBM Netezzainside the IBM Netezza Architecture

IBM Netezza Analytics

Page 9: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation9

Revolution ConfidentialIn-Database Paradigms for using R

In-database Scoring– Family of apply functions which score

analytic models by using data parallelism

– Underlying truism is that there is a fact that can be applied across all data

Big Data Analytics – Family of parallelized, in-database

analytics that have R wrappers and work on entire data set

– Underlying truism exists across all data

Grouped by Row (tapply)– Data and Task Parallelism

• Data flow technique to apply analytics to naturally occurring groups of data using non-parallelized analytics

– Underlying relationship in data is by a group

Examples

– Customer lifetime value– Credit score– Affinity– Good stock/bad stock

Big data analytics– Clustering of all data to determine

groupings– Models that are apply across a whole

data set – decision trees– Data transformation – variable

selection, correlationGroup \

– Forecasting – by store, stock symbol, etc.

– Build model for each customer or product or etc.

Page 10: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation10

Revolution ConfidentialAccess In-Database Language Support from R

SQL Java

PythonC

Fortran C++

Page 11: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation11

Revolution ConfidentialOpen Source R Package Support

Vertical• Econometrics • Experimental Design• Computational Physics• Clinical Trials• Environmetrics• Finance• Genetics• Medical Imaging • Pharmacokinetics• Phylogenetics• Psychometrics• Social Sciences

Horizontal• Bayesian

• Cluster • Distributions• Graphics• Graphical Models• Machine Learning• Multivariate • Natural Language Processing• Optimization• Robust Statistical Metrics• Spatial• Survival Analysis• Time Series

2500+ community packages

Page 12: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation12

Revolution ConfidentialUsing Revolution R Enterprise with IBM Netezza

R Packages integrate and push analytics processing

in-database

Revolution R Enterprise - Workstation

HTTP

Revolution R Enterprise - Server

RevoDeployR Server Web Services Interface for R

Business Intelligence, Excel or Third-Party Application

HostIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

RODBC &

nzODBC

RODBC &

nzODBC

Page 13: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation13

Revolution ConfidentialDeploying Revolution R Enterprise to IBM Netezza

•Remote terminal connection to Host•Create your R Script•Compile and Register your R Script as an AE (UDAP)•Execute SQL that will invoke the registered AE•Go back Revolution R Client to retrieve results and continue additional analysis

HostIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

S-BladeIBM Netezza Analytics

Page 14: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation14

Revolution ConfidentialRevolution R Enterprise Client Configuration

Revolution R Enterprise– Productivity Environment

Netezza ODBC Drivers ‘nz’ R Packages

– nzA, nzR, nzMatrix

R Package Dependencies– RODBC– caTools– Tree– Bitops– E1071– Rgl– Ca– MASS– XML

Page 15: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation15

Revolution ConfidentialIBM Netezza In-Database Analytics from Revolution R

nzRPackage

Encapsulate database and expose “R”-like constructs

R data.frame = database tableApply an R function to a row of data or grouped rows of data

nzA Package

Entry point to the nzAnalytics

Explicitly parallelized algorithms that run in

database

nzMatrixPackage

Encapsulation of Matrices and operations in Database

nz.matrix construct in R to access matrices in the

database

R operations on nz.matrix translate to

matrix stored procedure operations

Page 16: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation16

Revolution ConfidentialnzR Package

Basic Functions Sample CodeDatabase Connection nzConnect

nzConnectDSN

SQL Execution nzQuery, nzScalarQuery nzDeleteTable

Data Management as.nz.data.frame nz.data.frame

Apply an R function nzApplynzTApply nzGroupedApply

R Package Management nzInstallPackages nzIsPackageInstalled

#load packages

library(nzr)

#connect to a database via ODBCnzConnect("admin", "xyz", "127.0.0.1", "iclasstest")

#load the iris tablenzdf <- nz.data.frame("iris")

#run a nzTApply against the nz dataframefun <- function(x) max(x[,1])nzTApply(nzdf, nzdf[,5], fun)

Page 17: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation17

Revolution ConfidentialnzA Package

Data ManipulationMoments nz.moments

Quantiles nz.quantile, nz.quartile

Outlier Detection nz.outliers

Frequency Table nz.bitable

Histogram nz.hist

Pearson's Correlation nz.corr

Spearman's Correlation nz.spearman.corr, nz.spearman.corr.s

Covariance nz.cov, nz.cov.matrix

Mutual Information nz.mutualinfo

Chi-Square Test nzChisq.test, nz.chisq.test

t -Test t.ls.test, t.me.test, t.pmd.test, t.umd.test

Mann-Whitney-Wilcoxon Test nz.mww.test

Wilcoxon Test nz.wilcoxon.test

Canonical Correlation nz.canonical.corr

One-Way ANOVA nzAnova, nz.anova.CRD.test, nz.anova.RBD.test

Principal Component Analysis nzPCA

Tree-Shaped Bayesian Networks nz.TBNet Apply, nz.TBNet Grow, nz.BigBNControl, nz.TBNet1g2p, nz.TBNet1g,nz.TBNet2g

Page 18: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation18

Revolution ConfidentialnzA Package

Data Transformations

Model Diagnostics

Discretization nz.efdisc, nz.emdisc, nz.ewdisc

Standardization and Normalization nz.std.norm

Data Imputation nz.impute.data

Misclassification Error nz.cerror

Confusion Matrix nz.acc, nz.CMATRIX STATS

Mean Absolute Error nz.mae

Mean Square Error nz.mse

Relative Absolute Error nz.rae

Percentage Split nz.percentage.split

Cross-Validation nz.cross.validation

Page 19: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation19

Revolution ConfidentialnzA Package

Classification

Regression

Clustering

Associative Rule Mining

Naive Bayes nzNaiveBayes, nz.naivebayes,nz.predict.naivebayes

Decision Trees nzDecTree, nz.dectree, nz.grow.dectree,nz.print.dectree,nz.prune.dectree,nz.predict.dectree

Nearest Neighbors nz.knn

Linear Regression nzLm

Regression Trees nzRegTree, nz.regtree, nz.grow.regtree, nz.print.regtree, nz.predict.regtree

K-Means Clustering nzKMeans, nz.kmeans, nz.predict.kmeans

Divisive Clustering nz.divcluster, nz.predict.divcluster

FP-Growth nz.fpgrowth, nz.prepare.fpgrowth

Page 20: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation20

Revolution ConfidentialnzMatrix Package

Data ManipulationCoerce or point to a nz.matrix as.nz.matrix, as.nz.matrix.matrix, nz.matrixCombine Matrices nzCBind, nzRBindCreate Matrices From Tables nzCreateMatrixFromTable, nzCreateTableFromMatrixCreate Special Matrices nzIdentityMatrix, nzNormalMatrix, nzOnesMatrix,

nzRandomMatrix, nzVecToDiagDecomposition nzSVD, svd, nzEigenDelete Matrices nzDeleteMatrix, nzDeleteMatrixByNameDimensions dim, NCOL, ncol, NROW, nrowMathematical Functions abs, add, aubtr, ceiling, div, exp, floor, ln, log10, mod,

mult, nzPowerMatrix, pow, rounding, sqrt, truncMatrix Engine Initialization nzMatrixEngineInitializationMatrix Info is.nz.matrix, isSparse, nzExistMatrix, nzExistMatrixByName,

nzGetValidMatrixNameOperators *, +, -, <, ==, >, nzKronecker, nzPMax, nzPMin, nzSetValue,

[, scale, tPrinting Matrices print.nz.matrixSolve nzInv, nzSolve, nzSolveLLSSparse Matrices isSparse, nzSparse2matrixSummaries

nzAll, nzAny, nzMax, nzMin, nzSsq, nzSum, nzTr

Page 21: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM CorporationMarch 1, 2012

DemonstrationUsing Revolution R with IBM Netezza

Page 22: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution Confidential

Turbo-C harge Your A nalytics with IB M Netezza and R evolution R E nterpris e

P res ented by:

Derek M Norton, S enior S ales E ngineer

Page 23: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution ConfidentialUs e C as e – C redit R is k

We have a dataset comprised of individuals and their credit risk stored on the Netezza Appliance

The goal is to model if someone is “approvable” for a loan. This use case will follow a modeling process

(though condensed) from start to finish. I will discuss each of the parts and at the end

there will be a demo of the code

Page 24: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution ConfidentialModeling E xerc is e

1. Learning more about the data2. Prepare the data for modeling3. Fit models to the data4. Model Performance

Page 25: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution Confidential1. L earning more about the data

Connect to the IBM Netezza appliance Summarize the data Visualize the data

Continuous Variable

x

Freq

uenc

y

0 5 10 15 20 25

050

100

150

200

250

300

High School Diploma Bachelors Degree Masters Degree Professional Degree PhD

Discrete Varible

050

100

150

200

250

300

Page 26: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution Confidential2. P repare the data for modeling

Split the data in to 70/30 Training/Test sets Transform some variables Discretize numeric variables for later use

Page 27: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution Confidential3. F it models to the data

Build two different models to predict if an individual is “approvable” Decision Tree Naïve Bayes

Page 28: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution Confidential4. Model P erformanc e

Examine confusion matrices to determine: Training performance Test performance

Page 29: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution ConfidentialDemo

Page 30: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

© 2012 IBM Corporation9

Summary Familiar environment for R Developers

– World-class productivity tools– Enterprise class service, support and integration

Execution of analytics in-database – Analytic computing distributed across Netezza nodes and run

in a massively parallel manner– Each Netezza node gets a data slice and analytics are pushed

down from the Host to the individual nodes Capabilities

– R Code executed on Netezza nodes in row-by-row fashion or on groups of rows

– Enables access to explicitly parallelized algorithms running on entire data set

– Large-scale parallel matrix operations on database tables Performance

– 10-100x Performance improvements

Page 31: Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-presentation-120305145841-phpapp02

Revolution ConfidentialC ontac t Us

Derek NortonSolutions ExecutiveRevolution [email protected]

www.revolutionanalytics.com +1 (650) 646 9545 Twitter: @RevolutionR

Bill ZanineBusiness Solutions Executive, Analytics Solutions IBM [email protected]