R, Scikit-Learn and Apache Spark ML - What difference does it make?

Villu RuusmannOpenscoring OÜ

Overview

● Identifying long-standing, high-value opportunities in the applied predictive analytics domain

● Thinking about problems in API terms● Providing solutions in API terms● Developing and applying custom tools

+ A couple of tips if you're looking to buy or sell a VW Golf

The trade-off

"More data beats better algorithms"

The state of the art

Scaling out horizontally

Elements of reproducibility

Standardized, human- and machine-readable descriptions:

● Dataset● Data pre- and post-processing steps:

○ From real-life input table (SQL, CSV) to model○ From model to real-life output table

● Model● Statistics

Calling R from within Apache Spark

1. Create and initialize R runtime2. Format and upload input RDD; upload and execute R

model; download output and parse into result RDD3. Destroy R runtime

Calling Scikit-Learn from within Apache Spark

1. Format input RDD (eg. using Java NIO) as numpy.array2. Invoke Scikit-Learn via Python/C API3. Parse output numpy.array into result RDD

API prioritization

Training << Maintenance ~ Deployment

One-time activity << Repeated activitiesShort-term << Long-term

JPMML - Java PMML API

● Conversion API● Maintenance API● Execution API

○ Interpreted mode○ Translated + compiled ("Transpiled") mode

● Serving API○ Integrations with popular Big Data frameworks○ REST web service

Calling JPMML-Spark from within Apache Spark

org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;

org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();

org.apache.spark.sql.Dataset<Row> input = ..;

org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);

The case study

Predicting the price of VW Golf cars using GBT algorithms:

● 71 columns:○ A continuous label: log(price)○ Two string and four numeric categorical features○ 64 binary-like (0/1) and numeric continuous features

● 270'458 rows:○ 153'978 complete cases○ 116'480 incomplete (ie. with missing values) cases

Gradient-Boosted Trees (GBTs)

R training and conversion API#library("caret")

library("gbm")

library("r2pmml")

cars = read.csv("cars.tsv", sep = "\t", na.strings = "N/A")

factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")

for(factor_col in factor_cols){

cars[, factor_col] = as.factor(cars[, factor_col])

# Doesn't work with factors with missing values

#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)

cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)

r2pmml(cars.gbm, "gbm.pmml")

Scikit-Learn training and conversion APIfrom sklearn_pandas import DataFrameMapper

from sklearn.model_selection import GridSearchCV

from sklearn2pmml import sklearn2pmml, PMMLPipeline

cars = pandas.read_csv("cars.tsv", sep = "\t", na_values = ["N/A", "NA"])

mapper = DataFrameMapper(..)

regressor = ..

tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)

tuner.fit(mapper.fit_transform(cars), cars["price"])

pipeline = PMMLPipeline([

("mapper", mapper),

("regressor", tuner.best_estimator_)

sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)

Dataset

R LightGBM XGBoost Scikit-Learn

Apache Spark ML

Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>

Memory layout

Contiguous, dense

Contiguous, dense(?)

Contiguous, dense/sparse

Distributed,dense/sparse

Data type Any double float float or double

double

Categorical values

As-is (factor) Encoded Binarized Binarized Binarized

Missing values

Yes Pseudo (NaN) Pseudo (NaN) No No

LightGBM via Scikit-Learnfrom sklearn_pandas import DataFrameMapper

from sklearn2pmml.preprocessing import PMMLLabelEncoder

from lightgbm import LGBMRegressor

mapper = DataFrameMapper(

[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +

[(continuous_columns, None)]

transformed_cars = mapper.fit_transform(cars)

regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)

regressor.fit(transformed_cars, cars["price"],

categorical_feature = list(range(0, len(factor_columns))))

XGBoost via Scikit-Learnfrom sklearn_pandas import DataFrameMapper

from sklearn2pmml.preprocessing import PMMLLabelBinarizer

from xgboost.sklearn import XGBRegressor

mapper = DataFrameMapper(

[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +

[(continuous_columns, None)]

transformed_cars = mapper.fit_transform(cars)

regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)

regressor.fit(transformed_cars, cars["price"])

GBT algorithm (training)

Apache Spark ML

Abstraction gbm LGBMRegressor XGBRegressor GradientBoostingRegressor

GBTRegressor

Parameterizability

Medium High High Medium Medium

Split type Multi-way Binary Binary Binary Binary

Categorical values

"set contains" "equals" Pseudo ("equals")

Pseudo ("equals")

"equals"

Missing values

First-class Pseudo Pseudo No No

gbm-style splits<Node id="9">

</Node>

<Array type="string">Grün Rot Violett Weiß</Array>

</SimpleSetPredicate>

</Node>

<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>

</SimpleSetPredicate>

</Node>

LightGBM- and XGBoost-style splits (1/3)<Node id="39" defaultChild="76">

</Node>

</Node>

LightGBM- and XGBoost-style splits (2/3)<Node id="39">

</CompoundPredicate>

</Node>

</Node>

</Node>

LightGBM- and XGBoost-style splits (2/3)<Node id="39">

</CompoundPredicate>

</Node>

<True/>

</Node>

Model measurement using JPMMLorg.dmg.pmml.tree.TreeModel treeModel = ..;

treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){

private int count = 0; // Number of Node elements

private int maxDepth = 0; // Max "nesting depth" of Node elements

@Override

public VisitorAction visit(org.dmg.pmml.tree.Node node){

this.count++;

int depth = 0;

for(org.dmg.pmml.PMMLObject parent : getParents()){

if(!(parent instanceof org.dmg.pmml.tree.Node)) break;

depth++;

this.maxDepth = Math.max(this.maxDepth, depth);

return super.visit(node);

GBT algorithm (interpretation)

Apache Spark ML

Feature importances

Direct Direct Transformed Transformed Transformed

Decision path No No(?) No(?) Transformed Transformed

Model persistence

RDS (binary) Proprietary (text)

Proprietary (binary, text)

Pickle (binary) SER (binary) or JSON (text)

Model reusability

Good Fair(?) Good Fair Fair

Java API No No Pseudo No Yes

LightGBM feature importancesAge 936

Mileage 887

Performance 738

[Category] 205

New? 179

[Type of fuel] 170

[Type of interior] 167

Airbags? 130

[Colour] 129

[Type of gearbox] 105

Model execution using JPMMLorg.dmg.pmml.PMML pmml;

try(InputStream is = ..){

pmml = org.jpmml.model.PMMLUtil.unmarshal(is);

org.jpmml.evaluator.Evaluator evaluator =

new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);

org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);

org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);

for(int value = min; value <= max; value += increment){

Map<FieldName, FieldValue> arguments =

Collections.singletonMap(inputField.getName(), inputField.prepare(value));

Map<FieldName, ?> result = evaluator.evaluate(arguments);

System.out.println(result.get(targetField.getName()));

Lessons (to be-) learned

● Limits and limitations of individual APIs● Vertical integration vs. horizontal integration:

○ All capabilities on a single platform○ Specialized capabilities on specialized platforms

● Ease-of-use and robustness beat raw performance in most application scenarios

● "Conventions over configuration"

Q&Avillu@openscoring.io

https://github.com/jpmmlhttps://github.com/openscoringhttps://groups.google.com/forum/#!forum/jpmml

R, Scikit-Learn and Apache Spark ML - What difference does it make?

Data & Analytics

Apache Spark Introduction

Budapest Spark Meetup - Apache Spark @enbrite.ly

Process Monitoring Platform based on Industry 4.0 tools: a ... · Apache Spark MLlib, Scikit-learn, TensorFlow, H2O.ai, BigML, Accord.NET, Apache SystemML, Apache Mahout, Oryx 2,

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache spark linkedin

Apache Spark - LMU

[@NaukriEngineering] Apache Spark

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

A Tutorial on Apache Spark - Michael Hahslermichael.hahsler.net/SMU/EMIS8331/tutorials/Tutorial_Apache_Spark.pdf · A Tutorial on Apache Spark ... •Apache Spark is considered to

Spark SQL | Apache Spark

Plugin Apache Spark

Managed Solutions Apache Spark® · Apache Spark® Apache Spark™ is a high performing engine for large-scale analytics and data processing, While Apache Spark™ provides advanced

R, Scikit-Learn and Apache Spark ML - What difference does it make?

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

Apache Spark overview

Apache spark session

Apache Spark 101

Apache Spark PDF

Apache Spark Briefing