39
Combining Machine Learning frameworks with Apache Spark Tim Hunter Hadoop Summit June 2016

Combining Machine Learning frameworks with Apache Spark

Embed Size (px)

Citation preview

Page 1: Combining Machine Learning frameworks with Apache Spark

Combining Machine Learning frameworks with Apache SparkTim HunterHadoop SummitJune 2016

Page 2: Combining Machine Learning frameworks with Apache Spark

About meApache Spark contributor (since Spark 0.6)

Software Engineer @ Databricks

Ph.D. in Machine Learning @ UC Berkeley

2

Page 3: Combining Machine Learning frameworks with Apache Spark

Founded by the team who created Apache Spark

Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment

About Databricks

3

Page 4: Combining Machine Learning frameworks with Apache Spark

Apache Spark• The most active open-source project in big data

Page 5: Combining Machine Learning frameworks with Apache Spark

• Large-scale machine learning on Apache SparkSpark MLlib

Page 6: Combining Machine Learning frameworks with Apache Spark

MLlib’s MissionMLlib’s mission is to make practical machine learning easy and scalable.

• Easy to build machine learning applications• Capable of learning from large-scale datasets• Easy to integrate into existing workflows

6

Page 7: Combining Machine Learning frameworks with Apache Spark

Algorithm Coverage• Classification• Logistic regression• Naive Bayes• Streaming logistic regression• Linear SVMs• Decision trees• Random forests• Gradient-boosted trees• Multilayer perceptron

• Regression• Ordinary least squares• Ridge regression• Lasso• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods• Generalized Linear Models

• Frequent itemsets• FP-growth• PrefixSpan

7

Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering• Bisecting K-Means

Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation• Kolmogorov–Smirnov test• Online hypothesis testing• Survival analysis

Linear algebra• Local dense & sparse vectors & matrices• Normal equation for least squares• Distributed matrices

• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix

• Matrix decompositions

Recommendation• Alternating Least Squares

Feature extraction & selection• Word2Vec• Chi-Squared selection• Hashing term frequency• Inverse document frequency• Normalizer• Standard scaler• Tokenizer• One-Hot Encoder• StringIndexer• VectorIndexer• VectorAssembler• Binarizer• Bucketizer• ElementwiseProduct• PolynomialExpansion• Quantile discretizer• SQL transformer

Model import/exportPipelines

List based on Spark 2.0

Page 8: Combining Machine Learning frameworks with Apache Spark

Outline• ML workflows are complex• Spark as a scheduler• Integration within single-machine frameworks• Unified cross-languages ML pipelines with MLlib

8

Page 9: Combining Machine Learning frameworks with Apache Spark

ML workflows are complex• Specify the pipeline• Re-run on new data• Inspect the results• Tune the parameters

• Usually, each step of a pipeline is easier with one framework

9

Page 10: Combining Machine Learning frameworks with Apache Spark

ML Workflows are Complex

10

Train model 1

Evaluate

Datasource 1Datasource 2

Datasource 3

Extract featuresExtract features

Feature transform 1

Feature transform 2

Feature transform 3

Train model 2

Ensemble

Page 11: Combining Machine Learning frameworks with Apache Spark

Existing tools• Scikit-learn

– Excellent documentation– Standard for Python

• R– Lots of packages available

• Pandas– Very easy to use

• A lot of investment in tooling and education– How to integrate big data with these tools?

11

Page 12: Combining Machine Learning frameworks with Apache Spark

Common misconceptions• Spark is for big data only• Spark can only work with dedicated, distributed

libraries

12

Page 13: Combining Machine Learning frameworks with Apache Spark

Spark as a scheduler• A lot of tasks in ML are ”embarrassingly parallel”• Use Spark for data management and for scheduling

13

Page 14: Combining Machine Learning frameworks with Apache Spark

One example: learning digits• Learning tasks: given set of images, recognized digits• Standard benchmark dataset in computer vision built

by NIST:

14

Page 15: Combining Machine Learning frameworks with Apache Spark

Training Deep Learning algorithms

• Training a neural network is hard:• It is a sequential procedure (present one image after the other to

learn from)• It can be sensitive to noise and order of images: robustness

analysis is critical• Tuning the training parameters (descent rate, batch sizes, etc.) is

very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice.

15

Page 16: Combining Machine Learning frameworks with Apache Spark

TensorFlow as a training library• A lot of algorithms have been presented for this task,

we will choose TensorFlow, from Google:• Popular choice for neural network training and deep

learning• Competitive performance• Easy to experiment with• Python interface makes it easy to integrate with Spark

16

Page 17: Combining Machine Learning frameworks with Apache Spark

Distributing TensorFlow computations

• Even if TF is used as a single-machine library, we get speedups from Spark

17

Distributed Cross Validation

...

Best Model

Model #1Training

Model #2Training

Model #3 Training

Page 18: Combining Machine Learning frameworks with Apache Spark

Distributing TensorFlow computations

18

Distributed Cross Validation

...

Best Model

Model #4Training

Model #6Training

Model #3 Training

Model #1Training

Model #5Training

Model #2Training

Page 19: Combining Machine Learning frameworks with Apache Spark

Results• Running a 2-layer neural network, and testing for different

update rates and different layer sizes

19

1 node 2 nodes 13 nodes0

3000

6000

9000

12000

Page 20: Combining Machine Learning frameworks with Apache Spark

Embedding deep learning in Spark

• Best known algorithms are essentially sequential during training• Careful selection of training parameters is critical• Spark can help for fast iterations and find a good set of

parameters

20

Page 21: Combining Machine Learning frameworks with Apache Spark

A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings

21

Page 22: Combining Machine Learning frameworks with Apache Spark

Example: sentiment analysis

22

Given a review (text), predict the user’s rating.

Data from https://snap.stanford.edu/data/web-Amazon.html

Page 23: Combining Machine Learning frameworks with Apache Spark

ML Workflow

23

Train model

Evaluate

Load data

Extract features

Review: This product doesn't seem to be made to last… Rating: 2

feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0

Regression: (review: String) => Double

Page 24: Combining Machine Learning frameworks with Apache Spark

Load Data

24

built-in external

{ JSON }

JDBC

and more …

Data sources for DataFrames

LIBSVM

Train model

Evaluate

Load data

Extract features

Page 25: Combining Machine Learning frameworks with Apache Spark

Extract Features

words: [this, product, doesn't, seem, to, …]

feature_vector: [0.1 -1.3 0.23 … -0.74]

Review: This product doesn't seem to be made to last… Rating: 2

Prediction: 3.0

Train model

Evaluate

Load data

Tokenizer

Hashed Term Frequ.

Page 26: Combining Machine Learning frameworks with Apache Spark

Extract Features

words: [this, product, doesn't, seem, to, …]

feature_vector: [0.1 -1.3 0.23 … -0.74]

Review: This product doesn't seem to be made to last… Rating: 2

Prediction: 3.0

Linear regression

Evaluate

Load data

Tokenizer

Hashed Term Frequ.

Page 27: Combining Machine Learning frameworks with Apache Spark

Our ML workflow

27

Cross Validation

Model TrainingFeature Extraction

regularizationparameter:{0.0, 0.1, ...}

Page 28: Combining Machine Learning frameworks with Apache Spark

Cross validation

28

Cross Validation

...

Best Model

Model #1 Training

Model #2 Training

Feature Extraction

Model #3 Training

Page 29: Combining Machine Learning frameworks with Apache Spark

Example

29

Page 30: Combining Machine Learning frameworks with Apache Spark

A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings

30

Page 31: Combining Machine Learning frameworks with Apache Spark

DataFrame-based API for MLliba.k.a. “Pipelines” API, with utilities for constructing ML Pipelines

In 2.0, the DataFrame-based API will become the primary API for MLlib.• Voted by community• org.apache.spark.ml, pyspark.ml

The RDD-based API will enter maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib

31

Page 32: Combining Machine Learning frameworks with Apache Spark

Why ML persistence?

32

Data Science Software Engineering

Prototype (Python/R)Create model

Re-implement model for production (Java)

Deploy model

Page 33: Combining Machine Learning frameworks with Apache Spark

Why ML persistence?

33

Data Science Software Engineering

Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to

make prediction

• Extra implementation work• Different code paths• Synchronization overhead

Re-implement Pipeline for production (Java)

Deploy Pipeline

Page 34: Combining Machine Learning frameworks with Apache Spark

With ML persistence...

34

Data Science Software Engineering

Prototype (Python/R)Create Pipeline

Persist model or Pipeline: model.save(“s3n://...”)

Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production

Page 35: Combining Machine Learning frameworks with Apache Spark

ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete (29 feature transformers, 21 models)• Python: complete except for 2 algorithms• R: complete for existing APIs

Single underlying implementation of models

Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)

35

Page 36: Combining Machine Learning frameworks with Apache Spark

A data scientist’s wish list:• Run original code on a production environment• Directly apply learned pipelines• Use MLlib as export format

• Use distributed data sources• Builtin Spark conversions

• Use familiar APIs and libraries• Distribute ML workload piece by piece• Easy to distribute the most common ML tasks

36

Page 37: Combining Machine Learning frameworks with Apache Spark

What’s next?Prioritized items on the 2.1 roadmap JIRA (SPARK-15581):• Critical feature completeness for the DataFrame-based API

– Multiclass logistic regression– Statistics

• Python API parity & R API expansion• Scaling & speed tuning for key algorithms: trees & ensembles

GraphFrames• Release for Spark 2.0• Speed improvements ( join elimination, connected components)

37

Page 38: Combining Machine Learning frameworks with Apache Spark

Get started• Get involved: roadmap JIRA (SPARK-15581) + mailing lists

• ML persistence blog post http://databricks.com/blog/2016/05/31

• Try out the Apache Spark 2.0 preview release:http://databricks.com/try

38