29
High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine D ünner Martin Jaggi (EPFL) Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, Dimitrios Sarigiannis, Haris Pozidis (IBM Research)

High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

  • Upload
    others

  • View
    32

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

High-Performance Distributed Machine Learning in Heterogeneous Compute Environments

Celestine Dünner

Martin Jaggi (EPFL)Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, Dimitrios Sarigiannis, Haris Pozidis (IBM Research)

Page 2: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Distributed Training of Large-Scale Linear Models

Motivation of Our Work

fast training

interpretable models

training on large-scale datasets

Page 3: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

How does the infrastructure and the implementation impact the performance of the algorithm?

Motivation of Our Work

Choose an Algorithm

Distributed Training of Large-Scale Linear Models

Choose an Implementation

Choose an Infrastructure

?

Page 4: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

How can algorithms be optimized and implemented to achieve optimal performance on a given system?

Motivation of Our Work

Choose an Algorithm

Distributed Training of Large-Scale Linear Models

Choose an Implementation

Choose an Infrastructure

?

Page 5: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Algorithmic Challenge of Distributed Learning

min𝐰𝑓 𝐴⊤ 𝒘 + 𝑔(𝒘)

Page 6: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Algorithmic Challenge of Distributed Learning

min𝐰𝑓 𝐴⊤ 𝒘 + 𝑔(𝒘)

aggregatelocal models

11

1

1

1

2

3

4

• The more frequently you exchange information the faster your model converges

• Communication over the network can be very expensive

Trade-off

Page 7: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

H* depends on:

• system• Implementation/

framework

CoCoA Framework [SMITH, JAGGI, TAKAC, MA, FORTE, HOFMANN, JORDAN, 2013-2015]

H steps of local solver

H steps of local solver

H steps of local solver

H steps of local solver

H steps of local solver

min𝐰𝑓 𝐴⊤ 𝒘 + 𝑔(𝒘)

Tunable hyper-parameter H

Page 8: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Implementation: Frameworks for Distributed Computing

MPI

High-Performance Computing Framework

• Requires advanced system knowledge

• C++

• Good performance

Open Source Cloud Computing Framework

* http://spark.apache.org/

• Easy-to-use

• Powerful APIs : Python, Scala, Java, R

• Poorly understood overheads

Apache Spark, Spark and the Spark logo are trademarks of the Apache Software foundation (ASF). Other product or service names may be trademarks or service of IBM or other companies

Designed for different purposes different characteristics

Page 9: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

• (A) Spark Reference Implementation*

• (B) pySpark Implementation

• (C) MPI Implementation

Offload local solver to C++

• (A*) Spark+C

• (B*) pySpark+C

Different implementations of CoCoA

*https://github.com/gingsmith/cocoa100 iterations of CoCoA for 𝐻 fixed

(A*),(B*),(C) execute identical C++ code

webspam dataset8 Spark workers

Page 10: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Communication-Computation Tradeoff

6x

Understanding characteristics of the

framework and correctly adapting the

algorithm can decide upon orders of

magnitude in performance!

webspam dataset8 Spark workers

Page 11: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

strive to design flexible algorithms that can be adapted to system characteristics

be aware that machine learning algorithms need to be tuned to achieve good performance

to the designer

to the user

C. Dünner, T. Parnell, K. Atasu, M. Sifalakis, H. Pozidis, “Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark”,

IEEE International Conference on Big Data, Boston, 2017

Page 12: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

H steps of local solver

H steps of local solver

H steps of local solver

H steps of local solver

H steps of local solver

Which local solver should we use?

??

? ?

?

Page 13: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Stochastic Primal-Dual Coordinate Descent Methods

Ridge Regression, Lasso,Logistic Regression….

min𝒘𝑓 𝐴⊤𝒘 +

𝑖

𝑔𝑖(𝑤𝑖)

𝑓: smooth

𝐿2-regularized SVM, Ridge Regression, 𝐿2-regularized Logistic Regression,…

min𝜶𝑓∗ 𝜶 +

𝑖

𝑔𝑖∗(−𝐴:𝑖

⊤𝒘)

𝑓∗: strongly-convex

good convergence properties

problem: cannot leverage full power of modern CPUs or GPUs

min𝒘

12𝑛 𝐴

⊤𝒘−𝒚2

2+𝜆2 𝒘 2

2

min𝒘

12𝑛 𝐴

⊤𝒘−𝒚2

2+ 𝜆 𝒘 1

Asynchronous implementations

sequential

Page 14: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Asynchronous Stochastic Algorithms min𝒘𝑓 𝐴⊤𝒘 +

𝑖

𝑔𝑖(𝑤𝑖)

Parallelized over cores:Every core updates a dedicated subset of coordinates

Write collision on shared vector

• Recompute shared vector Liu et al., ”AsySCD” (JMLR’15)• Memory-locking K. Tran et al., “Scaling up SDCA, (SIGKDD’15)• Live with undefined behavior C.-J. Hsieh et al. “PASSCoDe:“ (ICML’15)

!

Page 15: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Asynchronous Stochastic Algorithms min𝒘𝑓 𝐴⊤𝒘 +

𝑖

𝑔𝑖(𝑤𝑖)

Parallelized over cores:Every core updates a dedicated subset of coordinates

Write collision on shared vector

• Recompute shared vector Liu et al., ”AsySCD” (JMLR’15)• Memory-locking K. Tran et al., “Scaling up SDCA, (SIGKDD’15)• Live with undefined behavior C.-J. Hsieh et al. “PASSCoDe:“ (ICML’15)

!

Ridge Regression on webspam

Page 16: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

2-level Parallelism of GPUs

1st level of parallelism:

• A GPU consists of streaming multiprocessors • Thread blocks get assigned to multi-processors and are executed asynchronously

2nd level of parallelism

• Each thread block consists of up to 1024 threads• Threads are grouped into warps (32 threads) which are executed as SMDI operations

GPU

SM8

SM2 SM3 SM4

SM5 SM6 SM7

SM1

Main Memory

Thread Block i

Shared memory

Page 17: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

GPU Acceleration

A Twice Parallel Asynchronous Stochastic Coordinate Descent (TPA-SCD) Algorithm

T. Parnell, C. Dünner, K. Atasu, M. Sifalakis, H. Pozidis, „Large-Scale Stochastic Learning Using GPUs“, IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lake Buena Vista, FL, 2017

Ridge SVM

1. thread-blocks are executed in parallel, asynchronously updating one coordinate each.

2. Update computation within a thread block is interleaved to ensure memory locality within warp and local memory is used to accumulate partial sums

3. Atomic add functionality of modern GPUs is used to update shared vector

webspam datasetGPU: GeForce GTX 1080TiCPU: 8-core Intel Xeon E5

10x

8x

Page 18: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

H steps of local solver

H steps of local solver

H steps of local solver

H steps of local solver

H steps of local solver

Which local solver should we use?

??

? ?

?

It depends on the available hardware

Page 19: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Heterogeneous System

CPU

core

core

core

core

core

core

GPU

Page 20: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

CPU

core

core

core

GPU

core

core

core

Heterogeneous System

Page 21: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Dual Heterogeneous Learning [NIPS’17]

Idea: The GPU should work on the part of the data it can learn most from.

Contribution of individual data columns to the duality gap is indicative for their potential to improving the model

A scheme to efficiently use Limited-Memory Accelerators for Linear Learning

CPU

memo

core core

core core

core core

GPU

memo

Select coordinate j with largest duality gap

Lasso epsilon dataset

Page 22: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Dual Heterogeneous Learning [NIPS’17]

Idea: The GPU should work on the part of the data it can learn most from.

Contribution of individual data columns to the duality gap is indicative for their potential to improving the model

A scheme to efficiently use Limited-Memory Accelerators for Linear Learning

CPU

memo

core core

core core

core core

GPU

memo

Duality Gap Computation is expensive!

• we introduce a gap-memory

• Parallelization of workload between CPU and GPUGPU runs algorithm on subset of the dataCPU computes importance values

𝐺𝑎𝑝 𝒘 = 𝑗𝑤𝑗 𝒙𝑗, 𝒗 + 𝑔𝑗 𝑤𝑗 + 𝑔𝑗∗ (𝒙𝑗⊤𝒗)

Page 23: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

DuHL Algorithm

GPU

CPU

C. Dünner, T. Parnell, M. Jaggi, „Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems “, NIPS, Long Beach, CA, 2017

Page 24: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

DuHL: Performance Resultssu

bo

pti

mal

ity

10−1

10−2

10−3

10−4

10−5

0 20 40 60 80 100

iterations

# s

wap

s

0 20 40 60 80 100

iterations

Fig 1: Superior convergence properties of DuHLover existing schemes

Fig 2: I/O efficiency of DuHL

Reduced I/O cost and faster convergence accumulate to 10x speedup

ImageNet dataset 30GB

GPU: NVIDIA Quadra M4000 (8GB)CPU: 8-core Intel Xeon X86 (64GB)

Lasso

Lasso

Page 25: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

DuHL: Performance Results

0 100 200 300

time [s]

sub

op

tim

alit

y

10−1

10−2

10−3

10−4

10−5

Reduced I/O cost and faster convergence accumulate to 10x speedup

ImageNet dataset 30GB

GPU: NVIDIA Quadra M4000 (8GB)CPU: 8-core Intel Xeon X86 (64GB)

Lasso SVM

0 100 200 300

10−1

10−2

10−3

10−4

10−5

du

alit

ygap

time [s]

Page 26: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Combining it all

A library for ultra-fast machine learning

Goal: remove training as a bottleneck

• Enable seamless retraining of models• Enable agile development• Enable training on large-scale datasets• Enable high quality insights

- Exploit Primal-Dual Structure of ML problems to minimize communication

- Offer GPU acceleration

- Implement DuHL for efficient utilization of limited-memory accelerators

- Improved memory management of Spark

Page 27: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Combining it all

A library for ultra-fast machine learning

Goal: remove training as a bottleneck

• Enable seamless retraining of models• Enable agile development• Enable training on large-scale datasets• Enable high quality insights

- Exploit Primal-Dual Structure of ML problems to minimize communication

- Offer GPU acceleration

- Implement DuHL for efficient utilization of limited-memory accelerators

- Improved memory management of Spark

Library Time (s) MSE Speed-up

spark.ml 20,000 6.1% n/a

our library 7 6.1% 2800x

Linear Regression (8 executors)

Tera-Scale Advertising Application

Predict whether a user will click on a given advert based

on an anonymized set of features.

training: 1 billion exampletesting: 100 million unseen examples

Power8 (Minksy) infrastructure

8 x P100 GPU

Page 28: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Summary

• Distributed Algorithm that offer tunable hyper-parameters are of particular practical interest

• A user may expect orders of magnitude improvement by optimizing such parameters

• GPUs can accelerate machine learning workloads by an order of magnitude if algorithms are carefully designed

• DuHL enables GPU acceleration even if the data exceeds the capacity of the GPU memory

• Combining all this knowledge we can remove training time as a bottleneck for ML applications

Page 29: High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Questions?