34
Introducing SQL Server 2016 R Services and Microsoft R Server Murali Shesham

L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

  • Upload
    buinga

  • View
    229

  • Download
    2

Embed Size (px)

Citation preview

Page 1: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

IntroducingSQL Server 2016 R Services and Microsoft R Server

Murali Shesham

Page 2: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

What is R?

Language

Platform

Community

Ecosystem

• A programming language for statistics, analytics, and data science

• A data visualization framework

• Provided as Open Source

• Used by 2.5M+ data scientists, statisticians and analysts

• Taught in most university statistics programs

• Active and thriving user groups across the world

• CRAN: 7000+ freely available algorithms, test data and evaluation

• Many of these are applicable to big data if scaled

• New and recent graduates prefer it

Page 3: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

20152009200420032000199719951993

Research Project in

New Zealand

Open Source Project

R-Core Group

R-1.0.0 released

R Foundation

First user

New York Times article

R-3.2.0 and R Consortium (founded by Microsoft)

History of R

Page 4: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

$?

Challenges posed by open source R

Uncertain

total cost of

ownership

Inadequate

access to

important

business data

Limited

business

agility

Limited

business

value

Page 5: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

R from Microsoft brings

Page 6: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

6

• Free and open source R distribution

• Enhanced and distributed by Revolution Analytics

Microsoft R Open

• Built in Advanced Analytics and Stand Alone Server

Capability

• Leverages the Benefits of SQL 2016 Enterprise Edition

SQL Server R Services

Microsoft R Products

Page 7: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Microsoft R Server

• Microsoft R Server for Redhat Linux

• Microsoft R Server for SUSE Linux

• Microsoft R Server for Teradata DB

• Microsoft R Server for Hadoop on Redhat

Microsoft R Server

Page 8: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Introducing SQL Server 2016 R Services

Enterprise speed and performance

Near-DB analytics

Parallel threading and processing

Model on-premises, store in cloud—or vice versa

Hybrid memory and disk scalability

Not bound by memory-enabling limits of larger

datasets

Included in SQL Server 2016

Reuse and optimize existing R code

Eliminate data movement across machines

Write once, deploy anywhere

Page 9: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Scalable in-database analytics

Data Scientist

Interacts directly with data

Creates models

and experiments

Data Analyst/DBA

Manages data and

analytics together

Example Solutions• Fraud detection

• Sales forecasting

• Warehouse efficiency

• Predictive

maintenance

010010

100100

010101

Relational Data

Extensibility

?R

R Integration

Analytic LibraryOpen Source R

Revolution PEMA

T-SQL Interface

How is it Integrated?• T-SQL calls a Stored Procedure

• Script is run in SQL through

extensibility model

• Result sets sent through Web API

to database or applications

Benefits• Faster deployment of ML models

• Less data movement, faster

insights

• Work with large datasets: mitigate

R memory and scalability

limitations

Page 10: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Cost effectiveness

• Best Advanced Analytics Value

• R Services and Polybase are built-ino Part of SQL Server 2016 Enterprise Edition

• In DB analytics shrinks analysis cost and timeo No data movement reduces costs

• No Proprietary Hardware Requirement

o Can be installed in commodity hardware

• Integration between cloud and open source offerings

SQL SERVER 2016$ 648 K

+ $120 Per user for PowerBI

Costs based on a Server with2 proc/ 8 Cores

Page 11: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

11

Page 12: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

High-performance open source R plus:• Data source connectivity to big-data objects

• Big-data advanced analytics

• Multi-platform environment support

• In-Hadoop and in-Teradata predictive modeling

• Development and production environment support

• IDE for data scientist developers

• Secure, Scalable R Deployment

DeployR

R Open R Server

DevelopR

Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is

supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and

machine learning capabilities, R Server supports the full range of analytics – exploration, analysis,

visualization and modeling

Introducing Microsoft R Server

Page 13: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

R Open Microsoft R Server

DeployRDevelopR

The Microsoft R Server Platform

ConnectR• High-speed & direct

connectors

Available for:• High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database & Aster

• EDWs and ADWs

• ODBCScaleR• Ready-to-Use high-performance

big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Range of predictive functions

• User tools for distributing customized R algorithms across nodes

• Wide data sets supported – thousands of variables

DistributedR• Distributed computing framework

• Delivers cross-platform portability

R+CRAN• Open source R interpreter

• R 3.1.2

• Freely-available huge range of R algorithms

• Algorithms callable by RevoR

• Embeddable in R scripts

• 100% Compatible with existing R scripts, functions and packages

RevoR• Performance enhanced R

interpreter

• Based on open source R

• Adds high-performance math library to speed up linear algebra functions

Page 14: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

ScaleR – Parallel + “Big Data”

Stream data in to RAM in blocks. “Big Data” can be any data

size. We handle Megabytes to Gigabytes to Terabytes…

Our ScaleR algorithms work

inside multiple cores / nodes

in parallel at high speed

Interim results are collected

and combined analytically to

produce the output on the

entire data set

XDF file format is optimised to work with the ScaleR library and

significantly speeds up iterative algorithm processing.

Page 15: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

16

Page 16: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

SQL Server 2016 Enterprise Edition

SQL Server R Services

Integration Facilities:

• Component Integration• Launchers• Parameter Passing• Results Return• Console Output

Return• Parallel Data Exchange

(RTM)• Stored Procedures• Package Administration

SQL Server

Query

Processor

Algorithm Library

• Data Prep

• Descriptive Stats

• Sampling

• Statistical Tests

• Predictive Models

• Variable Selection

• Clustering

• Classification

• Custom APIs for R + CRAN

• Parallel Scoring

Fast, Parallel, Storage Efficient Algorithms

Microsoft R Open

• 100% Open Source R• Fully CRAN Compatible• Accelerated Math

Open Source R Interpreter

Page 17: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Run R In-Database from TSQL

SQL

Server

2016

In-Database

Execution of R

+ CRAN

+ SQL

In-Database Execution of:

R Code

CRAN Packages

Move the

Work to

the Data

Run R From

the Query

Processor

Retrieve

Models,

Scores,

Transformed

Data,

Plots/Images

Operationalise

scoring/predictio

n in database for

data batches or

real-time

Page 18: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

SQL

In-Database Execution:

Remote Execution

Parallelized Compute SQL

Server

Remote

Execution

Context

Explore and Model:

In Parallel, In-Database

Parallelize distributable R and CRAN

Operationlize:

Score In Parallel

Parallel

Worker

Tasks

Move

BIG

Work to

the DataLarge Data Sets in Chunks

Parallel

Algorithm

Iterate/ Sequence

Run Parallel Algorithms in Database from an R client

Page 19: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

SQL 2016

ScaleR PEMAs: Fast, Parallel, Storage Efficient Algorithms

R Interpreter

Conceptual Flow

Page 20: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

22

Page 21: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Introducing Microsoft R Server

Page 22: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Gradient Boosted Decision Trees

Naïve Bayes

Scale R – Parallelized Algorithms & Functions

Data import – Delimited, Fixed, SAS, SPSS,

OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Subsample (observations & variables)

Random Sampling

Data Preparation Statistical Tests

Sampling

Descriptive Statistics Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Predictive Models K-Means

Decision Trees

Decision Forests

Cluster Analysis

Classification

Simulation

Variable Selection

Stepwise Regression

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Combination rxDataStep

rxExec

New

PEMA-R API Custom Algorithms

Page 23: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

ScaleR - Performance comparisonMicrosoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.

US flight data for 20 years

Linear Regression on Arrival Delay

Run on 4 core laptop, 16GB RAM and 500GB SSD

Page 24: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

DistributedR

ScaleR

ConnectR

DevelopR

Distributed R - Model development and model compute choice: “Write Once. Deploy Anywhere.”

Code Portability Across Platforms

In the Cloud

Workstations & Servers LinuxWindows

EDW Teradata

HadoopHortonworksClouderaMapR

+ HD Insights

+ Hadoop Spark

+ R Tools for Visual Studio

+ Azure ML

Ro

ad

map

Azure Marketplace

+ SQL Server v16

Microsoft R Server

Page 25: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

DistributedR - Revolution Code Portability

### SETUP HADOOP ENVIRONMENT VARIABLES ###

myHadoopCCC <- RxHadoopMR()

### HADOOP COMPUTE CONTEXT ###

rxSetComputeContext(myHadoopCC)

### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###

hdfsFS <- RxHdfsFileSystem()

AirlineDataSet <-

RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”)

, fileSystem = hdfsFS)

### ANALYTICAL PROCESSING ###

### Statistical Summary of the data

rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)

### CrossTab the data

rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)

### Linear Model and plot

hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)

plot(hdfsXdfArrLateLinMod$coefficients)

### SETUP LOCAL ENVIRONMENT VARIABLES ###

myLocalCC <- “localpar”

### LOCAL COMPUTE CONTEXT ###

rxSetComputeContext(myLocalCC)

### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###

linuxFS <- RxNativeFileSystem() )

AirlineDataSet <-

RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”,

fileSystem = linuxFS)

Local Parallel processing – Linux or Windows In – Hadoop

ScaleR models can be deployed from a server or edge node to run in Hadoop

without any functional R model re-coding for map-reduce

Compute

context R script

– sets where the

model will run

Functional

model R script –

does not need

to change to run

in Hadoop

Page 26: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

DistributedR - In-Hadoop

Uses Hadoop nodes for R

computations

Eliminate data movement

latency on very large data

Remove data duplication

Faster model development

No MapReduce R coding

Develop better models

using all the data= Microsoft R Server

Page 27: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

MRS and Hadoop Architecture options

R R R R R

R R R R R

ScaleR Production

RStudio Server Pro

Microsoft R Server

1. Copy

2. Stream

3. Send

Page 28: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

DistributedR - What processing mode to use, when?

Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm)

guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop

processing)

Low Medium High

Small Data

< 10GB

Medium Data

< 50GB

Bigger Data

> 50GB

Edge Node Linux

processingIn-Hadoop

processing

Local Linux

file-systemHadoop

file-system

LegendProcessing

Complexity

Data Size

Page 29: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

While Open Source R delivers:

• Capability• 6500+ Algorithm &

Connector Packages Available for Free in CRAN

• Simplicity• R Skills Transfer / Lower cost

of Talent• Ease of Integration with Other

Analytics Packages & Data• Access to Huge Libraries of R

Analytical Algorithms

• Speed• Intel-Optimized Computation

• Peace of mind• Knowledge that your business is using a stable platform backed with

commercial support and services

• Platform longevity for more predictability around costs

• Speed and scalability• Faster decisions using advanced analytics that were previously unachievable

• In-Hadoop & In Teradata Analysis

• Efficiency• Continue getting returns on existing hardware and software investments

• Developers can write code once and deploy it anywhere, keeping costs low

• Flexibility and agility• Model data in a hybrid environment: on-premises, in the cloud, or both

• Scripting, modeling, and in-database analytics across platforms shrinks analysis time and enables agile response to business needs

SQL Server R Services and Microsoft R Server deliver:

Page 30: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once
Page 31: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

Introducing Microsoft R Open• Enhanced Open Source R distribution

• Based on the latest Open Source R (3.1.2)

• Built, tested and distributed by Microsoft

• Enhanced by Intel MKL Library to speed up linear algebra functions

• Compatible with all R-related software• CRAN packages, RStudio, third-party R integrations, …

• Revolutions Open-Source R packages• Reproducible R Toolkit – Checkpoint , miniCRAN

• ParallelR – parallelise execution via ‘foreach’ loop

• Rhadoop – rhdfs, rhbase, ravro, rmr2, plyrmr

• AzureML – read/write data to AzureML, publish R code as ML API

• MRAN website mran.revolutionanalytics.com• Enhanced documentation and learning resources

• Discover 6500 free add-on R packages

• Open source (GPLv2 license) - 100% free to download, use and share

Page 32: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

DatasizeIn-memory

In-memory In-Memory or Disk Based

Speed of AnalysisSingle threaded Multi-threaded

Multi-threaded, parallel

processing 1:N servers

SupportCommunity Community Community + Commercial

Analytic Breadth

& Depth 7500+ innovative analytic

packages7500+ innovative analytic

packages

7500+ innovative packages +

commercial parallel high-

speed functions

LicenceOpen Source

Open Source

Commercial license.

Supported release with

indemnity

CRAN, MRO, MRS ComparisonMicrosoft

R Open

Microsoft

R Server

Page 33: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once

More efficient and multi-threaded math computation.

Benefits math intensive processing.

No benefit to program logic and data transform

CRAN R compared to Microsoft R Open

• Matrix calculation – upto 27x faster

• Matrix functions – upto 16x faster

• Programation – 0x faster

Page 34: L200 Microsoft R Server and SQL Server R Services Deck · PDF filefamily distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. ... • Developers can write code once