Combining Machine Learning frameworks with Apache Spark

Combining Machine Learning frameworks with Apache SparkTim HunterHadoop SummitJune 2016

About meApache Spark contributor (since Spark 0.6)

Software Engineer @ Databricks

Ph.D. in Machine Learning @ UC Berkeley

Founded by the team who created Apache Spark

Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment

About Databricks

Apache Spark• The most active open-source project in big data

• Large-scale machine learning on Apache SparkSpark MLlib

MLlib’s MissionMLlib’s mission is to make practical machine learning easy and scalable.

• Easy to build machine learning applications• Capable of learning from large-scale datasets• Easy to integrate into existing workflows

Algorithm Coverage• Classification• Logistic regression• Naive Bayes• Streaming logistic regression• Linear SVMs• Decision trees• Random forests• Gradient-boosted trees• Multilayer perceptron

• Regression• Ordinary least squares• Ridge regression• Lasso• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods• Generalized Linear Models

• Frequent itemsets• FP-growth• PrefixSpan

Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering• Bisecting K-Means

Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation• Kolmogorov–Smirnov test• Online hypothesis testing• Survival analysis

Linear algebra• Local dense & sparse vectors & matrices• Normal equation for least squares• Distributed matrices

• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix

• Matrix decompositions

Recommendation• Alternating Least Squares

Feature extraction & selection• Word2Vec• Chi-Squared selection• Hashing term frequency• Inverse document frequency• Normalizer• Standard scaler• Tokenizer• One-Hot Encoder• StringIndexer• VectorIndexer• VectorAssembler• Binarizer• Bucketizer• ElementwiseProduct• PolynomialExpansion• Quantile discretizer• SQL transformer

Model import/exportPipelines

List based on Spark 2.0

Outline• ML workflows are complex• Spark as a scheduler• Integration within single-machine frameworks• Unified cross-languages ML pipelines with MLlib

ML workflows are complex• Specify the pipeline• Re-run on new data• Inspect the results• Tune the parameters

• Usually, each step of a pipeline is easier with one framework

ML Workflows are Complex

Train model 1

Evaluate

Datasource 1Datasource 2

Datasource 3

Extract featuresExtract features

Feature transform 1

Feature transform 2

Feature transform 3

Train model 2

Ensemble

Existing tools• Scikit-learn

– Excellent documentation– Standard for Python

• R– Lots of packages available

• Pandas– Very easy to use

• A lot of investment in tooling and education– How to integrate big data with these tools?

Common misconceptions• Spark is for big data only• Spark can only work with dedicated, distributed

libraries

Spark as a scheduler• A lot of tasks in ML are ”embarrassingly parallel”• Use Spark for data management and for scheduling

One example: learning digits• Learning tasks: given set of images, recognized digits• Standard benchmark dataset in computer vision built

by NIST:

Training Deep Learning algorithms

• Training a neural network is hard:• It is a sequential procedure (present one image after the other to

learn from)• It can be sensitive to noise and order of images: robustness

analysis is critical• Tuning the training parameters (descent rate, batch sizes, etc.) is

very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice.

TensorFlow as a training library• A lot of algorithms have been presented for this task,

we will choose TensorFlow, from Google:• Popular choice for neural network training and deep

learning• Competitive performance• Easy to experiment with• Python interface makes it easy to integrate with Spark

Distributing TensorFlow computations

• Even if TF is used as a single-machine library, we get speedups from Spark

Distributed Cross Validation

Best Model

Model #1Training

Model #2Training

Model #3 Training

Distributing TensorFlow computations

Distributed Cross Validation

Best Model

Model #4Training

Model #6Training

Model #3 Training

Model #1Training

Model #5Training

Model #2Training

Results• Running a 2-layer neural network, and testing for different

update rates and different layer sizes

1 node 2 nodes 13 nodes0

Embedding deep learning in Spark

• Best known algorithms are essentially sequential during training• Careful selection of training parameters is critical• Spark can help for fast iterations and find a good set of

parameters

A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings

Example: sentiment analysis

Given a review (text), predict the user’s rating.

Data from https://snap.stanford.edu/data/web-Amazon.html

ML Workflow

Train model

Evaluate

Load data

Extract features

Review: This product doesn't seem to be made to last… Rating: 2

feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0

Regression: (review: String) => Double

Load Data

built-in external

{ JSON }

and more …

Data sources for DataFrames

LIBSVM

Train model

Evaluate

Load data

Extract features

Extract Features

words: [this, product, doesn't, seem, to, …]

feature_vector: [0.1 -1.3 0.23 … -0.74]

Prediction: 3.0

Train model

Evaluate

Load data

Tokenizer

Hashed Term Frequ.

Extract Features

words: [this, product, doesn't, seem, to, …]

feature_vector: [0.1 -1.3 0.23 … -0.74]

Prediction: 3.0

Linear regression

Evaluate

Load data

Tokenizer

Hashed Term Frequ.

Our ML workflow

Cross Validation

Model TrainingFeature Extraction

regularizationparameter:{0.0, 0.1, ...}

Cross validation

Cross Validation

Best Model

Model #1 Training

Model #2 Training

Feature Extraction

Model #3 Training

Example

A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings

DataFrame-based API for MLliba.k.a. “Pipelines” API, with utilities for constructing ML Pipelines

In 2.0, the DataFrame-based API will become the primary API for MLlib.• Voted by community• org.apache.spark.ml, pyspark.ml

The RDD-based API will enter maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib

Why ML persistence?

Data Science Software Engineering

Prototype (Python/R)Create model

Re-implement model for production (Java)

Deploy model

Why ML persistence?

Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to

make prediction

• Extra implementation work• Different code paths• Synchronization overhead

Re-implement Pipeline for production (Java)

Deploy Pipeline

With ML persistence...

Prototype (Python/R)Create Pipeline

Persist model or Pipeline: model.save(“s3n://...”)

Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production

ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete (29 feature transformers, 21 models)• Python: complete except for 2 algorithms• R: complete for existing APIs

Single underlying implementation of models

Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)

A data scientist’s wish list:• Run original code on a production environment• Directly apply learned pipelines• Use MLlib as export format

• Use distributed data sources• Builtin Spark conversions

• Use familiar APIs and libraries• Distribute ML workload piece by piece• Easy to distribute the most common ML tasks

What’s next?Prioritized items on the 2.1 roadmap JIRA (SPARK-15581):• Critical feature completeness for the DataFrame-based API

– Multiclass logistic regression– Statistics

• Python API parity & R API expansion• Scaling & speed tuning for key algorithms: trees & ensembles

GraphFrames• Release for Spark 2.0• Speed improvements ( join elimination, connected components)

Get started• Get involved: roadmap JIRA (SPARK-15581) + mailing lists

• ML persistence blog post http://databricks.com/blog/2016/05/31

• Try out the Apache Spark 2.0 preview release:http://databricks.com/try

Thank you!

spark.apache.orgspark-packages.orgdatabricks.com

Combining Machine Learning frameworks with Apache Spark

Technology

Big Data Frameworks: Spark Practicals - University of Helsinki › u › lagerspe › slides › bigdata-spark-practic… · rdd.mapPartitions(func) makes one task handle one file/partition,

Spark Manual - © ARTURIA SA · 2015. 6. 29. · 1 INTRODUCTION 1.1 WELCOME TO SPARK BEAT THE FUTURE Combining the power of analog synthesis, physical modeling and samples through

ISPOR Internati onal: Times That Try Soulscbpartners.com/.../uploads/...Times-That-Try-Souls.pdf · VALUE ASSESSMENT FRAMEWORKS: FRAMING THE ISSUE ... combining tendering pools, limiting

Making Sense of Performance in Data Analytics …people.csail.mit.edu/.../courses/6.888/papers/analytics.pdfLarge-scale data analytics frameworks such as Hadoop [13] and Spark [51]

Making Sense of Performance in Data Analytics Frameworks ...keo/publications/nsdi15-final147.pdf · Making Sense of Performance in Data Analytics Frameworks ... Each Spark task pipelines

Renewing the Sustainable Energy Curriculum – Combining ... · Industry Input to Develop Multidisciplinary Tertiary Curriculum Frameworks . The Issue ... A list of sustainable energy

Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearcherbd2018.c3.furg.br/downloads/ERBD.pdf · Frameworks para processamento de Big Data Streaming Apache Spark, Apache Storm Filas

Computational Frameworks MapReduce - DEIcapri/BDC/MATERIAL/MapReduce1920.pdf · 2020-03-21 · MapReduce-Hadoop-Spark Several software frameworks have been proposed to support MapReduce

Improving Automatic Tuning of Hadoop and Spark by ... · of Hadoop and Spark by analysing container performance metrics. Analytics frameworks, for example, Hadoop and Spark, are powerful

October 11-12 Combining GMM and DNN Frameworks for …grammars.grlmc.com/SLSP2016/Download/slides/A New... · Combining GMM and DNN Frameworks for Speaker Adaptation Natalia Tomashenko1

Characterizing the Performance of Analytics Workloads on ...€¦ · analytics frameworks like Spark [3], [4] and Hadoop [5]. In particular, the widely-used Apache Spark open source

Big Data Frameworks: Scala and Spark Tutorial · PDF file Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides:

Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark

ACOMPARISON BETWEEN THE HADOOP AND SPARK …€¦ · The solution was built over Apache Hadoop (Apache Hadoop Development Team, 2019), ... in alternative distributed frameworks such

Towards Declarative Scripting Combining CP and Analytics · – Apache Spark (Hadoop) – RapidMiner (Radoop) – KMINE – Datameer’s JSON Array Analytics Most started as academic

Making Sense of Performance in Data Analytics Frameworks · Making Sense of Performance in Data Analytics Frameworks ... and use it to analyze the Spark ... only takes up 17GB in

Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

Deep Learning Frameworks with Spark and GPUs...Apache SystemML runs on top of Apache Spark, where it automatically scales your data, line by line, determining whether your code should

Combining high-tech and high touch frameworks to build competitive advantage in distant educaton