38
Commercial in Confidence Experts in numerical algorithms and HPC services Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th , 2015

Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence

Experts in numerical algorithms and HPC services

Using Existing Numerical Libraries on Spark

Brian Spector

Chicago Spark Users Meetup

June 24th, 2015

Page 2: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 2Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

Map/Reduce existing functions across workers

How to use existing libraries on Spark

Page 3: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 3Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

NAG Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Page 4: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

NAG Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Page 5: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 5

The Numerical Algorithms Group

Founded in 1970 as a co-operative project out of academia in UK

Support and Maintain Numerical Library ~1700 Mathematical/Statistical Routines

NAG’s code is embedded in many vendor libraries (e.g. AMD, Intel)

Numerical Library

HPCServices

Numerical Services

Page 6: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 6

NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation & Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research

Page 7: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

NAG Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Page 8: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra, Regression, Optimization

Data is all in-memory call a compiled library

LAPACK/BLAS

MKL

void nag_1d_spline_interpolant (Integer m, const double x[], const double y[], Nag_Spline *spline, NagError *fail);

Efficient Algorithms

Page 9: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Algorithms take advantage of

AVX/AVX2

Multi-core

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpy/scipy

-O2 flag for compilers

Users worry less about efficient computing and more about solving their problem.

Efficient Algorithms

Page 10: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

…But what happens when it comes to Big Data?

x = (double*) malloc(10,000,000 * 10,000 * sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark?

We must reformulate the problem

An example…

Big Data

Page 11: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 𝑥1𝑥2 …𝑥𝑝 and an

outcome measurement 𝑦, fit a linear model

𝑦 = 𝑋 𝐵 + ε

𝑦 =

𝑦1

⋮𝑦𝑝

𝑋 =

𝑥1,1 ⋯ 𝑥1,𝑝

⋮ ⋱ ⋮𝑥𝑛,1 ⋯ 𝑥𝑛,𝑝

Three ways of solving the same

problem

Linear Regression Example

Page 12: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

𝑋 =

𝑥1,1 ⋯ 𝑥1,𝑝

⋮ ⋱ ⋮𝑥𝑛,1 ⋯ 𝑥𝑛,𝑝

= 𝑄𝑅∗ where 𝑅∗ =𝑅0

𝑅 is 𝑝 by 𝑝 Upper Triangular, 𝑄 orthogonal

𝑅 𝐵 = 𝑐1 where 𝑐 = 𝑄𝑇𝑦 …

… lots of linear algebra,

but we have an algorithm!

Linear Regression Example

Page 13: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

𝑋𝑇𝑋 𝐵 = 𝑋𝑇 𝑦

If we 3 independent variables

𝑋 =1 2 3⋮ ⋮ ⋮4 5 6

, 𝑦 =7⋮8

Linear Regression Example

Page 14: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

𝑋𝑇𝑋 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

𝑧1,1 𝑧1,2 𝑧1,3𝑧2,1 𝑧2,2 𝑧2,3𝑧3,1 𝑧3,2 𝑧3,3

𝐵 = 𝑋𝑇𝑋 −1𝑋𝑇 𝑦

This computation can be map out to slave nodes

Linear Regression Example

Page 15: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 𝑦 − 𝑋 𝐵2

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

MLlib Algorithms

Linear Regression Example

Page 16: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine Learning/Optimization

MLlib uses solution #3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map, Reduce, Repeat

MLlib Algorithms

Page 17: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

Master/Libraries

Slaves Slaves

Page 18: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

NAG Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Page 19: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset.

NAG Libraries were distributed across Master/Slave Nodes

Steps:

1. Read in data

2. Call NAG routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3. Call NAG routine to aggregate SSQ matrix together

4. Call NAG routine to compute optimal coefficients

NAG Example on Spark

Page 20: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

NAG CALL

NAG Call

NAG CALL

Page 21: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

0.0 18.0 0.0 2.22 1

0.0 18.0 0.0 8.11 0

68050.9 42.3 12.1 2.32 1

86565.0 47.3 19.5 7.19 2

65151.5 47.3 7.4 1.68 0

78564.1 53.2 11.4 1.74 1

56556.7 34.9 10.7 6.84 0

Page 22: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

0.0 18.0 0.0 2.22 1

0.0 18.0 0.0 8.11 0

68050.9 42.3 12.1 2.32 1

86565.0 47.3 19.5 7.19 2

65151.5 47.3 7.4 1.68 0

78564.1 53.2 11.4 1.74 1

Master

Label X1 X2 X3 X4

0.0 18.0 0.0 2.22 1

0.0 18.0 0.0 8.11 0

68050.9 42.3 12.1 2.32 1

86565.0 47.3 19.5 7.19 2

65151.5 47.3 7.4 1.68 0

78564.1 53.2 11.4 1.74 1

56556.7 34.9 10.7 6.84 0

Slave 1

Slave 2

rdd.parallelize(data).cache()

Page 23: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

0.0 18.0 0.0 2.22 1

0.0 18.0 0.0 8.11 0

68050.9 42.3 12.1 2.32 1

86565.0 47.3 19.5 7.19 2

65151.5 47.3 7.4 1.68 0

78564.1 53.2 11.4 1.74 1

Slave 1

Slave 2NAG_CALL(data)

Page 24: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

0.0 18.0 0.0 2.22 1

0.0 18.0 0.0 8.11 0

68050.9 42.3 12.1 2.32 1

86565.0 47.3 19.5 7.19 2

65151.5 47.3 7.4 1.68 0

78564.1 53.2 11.4 1.74 1

Slave 1

Slave 2

0.0 18.0 0.0 2.22 1

0.0 18.0 0.0 8.11 0

68050.9 42.3 12.1 2.32 1

86565.0 47.3 19.5 7.19 2

65151.5 47.3 7.4 1.68 0

78564.1 53.2 11.4 1.74 1

Slave 1

Slave 2

FlatMapFunction

Page 25: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

finalpt = data.mapPartitions(new ComputeSSQ() ).reduce(new Function2());

static class ComputeSSQ implements FlatMapFunction<Iterator<LabeledPoint>, NAGData> {

@Override

public Iterable<NAGData> call(Iterator<LabeledPoint> iter) throws Exception {

…NAG CALL…

}

How it looks in Spark

Page 26: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

0.0 18.0 0.0 2.22 1

0.0 18.0 0.0 8.11 0

68050.9 42.3 12.1 2.32 1

86565.0 47.3 19.5 7.19 2

65151.5 47.3 7.4 1.68 0

78564.1 53.2 11.4 1.74 1

Slave 1

Slave 2

Master

Page 27: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB – 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

NAG Example

Page 28: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Page 29: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

NAG Results of Linear Regression

Page 30: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

NAG Results of Scaling

Page 31: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

NAG and MLlib Results

Page 32: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

NAG Functions that work on Spark Regression

Linear regression (with constraints)

Logistic regression (with constraints)

Principal Component Analysis

Statistics Summary information (mean, variance, etc)

Correlation

Probabilities and deviates for normal, student-t, chi-squared, beta, and many more distributions

Random number generation

Optimization including Linear, nonlinear, quadratic, and sum of squares for the objective function

Constraints can be simple bounds, linear, or even nonlinear

These require wrapping in specific environment (java/python)

NAG on Spark

Page 33: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Distributing libraries

spark-ec2/copy-dir

Setting environment variables

LD_LIBRARY_PATH Needs to be set as you submit the job

$ ./spark-submit --jars NAGJava.jar --conf "spark.executor.extraLibraryPath= ${Path-to-Libraries-on-Worker-Nodes}" simpleNAGExample.jar

NAG_KUSARI_FILE (NAG license file) Can be set in code via sc.setExecutorEnv

Debugging slave nodes

Partition sizes

Problems Encountered

Page 34: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

NAG Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Page 35: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs, logistic regression, linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Page 36: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

DataFrames

MLlib pipeline

More and better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark!

Free

Huge open source community

Other Improvements

Page 37: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

Map/Reduce existing functions across slaves

How to use existing libraries on Spark

Page 38: Using Existing Numerical Libraries on Spark · 2019-02-06 · Commercial in Confidence 32 Numerical Excellence Commercial in Confidence NAG Functions that work on Spark Regression

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

Thanks!

For more information:

Email [email protected]

Check out The NAG Blog

http://blog.nag.com/2015/02/advanced-analytics-on-apache-spark.html

NAG and Apache Spark