53
Strategies & Tools for Parallel Machine Learning in Python PyConFR 2012 - Paris Sunday, September 16, 2012

Strategies and Tools for Parallel Machine Learning in Python

Embed Size (px)

Citation preview

Page 1: Strategies and Tools for Parallel Machine Learning in Python

Strategies & Tools for Parallel Machine Learning

in PythonPyConFR 2012 - Paris

Sunday, September 16, 2012

Page 2: Strategies and Tools for Parallel Machine Learning in Python

The Story

scikit-learn

The Cloud

stackoverflow

& kaggle

Sunday, September 16, 2012

Page 3: Strategies and Tools for Parallel Machine Learning in Python

...

Sunday, September 16, 2012

Page 4: Strategies and Tools for Parallel Machine Learning in Python

Parts of the Ecosystem

Multiple Machines with Multiple Cores ♥♥♥

Single Machine with Multiple Cores ♥♥♥

multiprocessing

Sunday, September 16, 2012

Page 5: Strategies and Tools for Parallel Machine Learning in Python

The Problem

Big CPU (Supercomputers - MPI)

Simulating stuff from models

Big Data (Google scale - MapReduce)

Counting stuff in logs / Indexing the Web

Machine Learning?

often somewhere in the middle

Sunday, September 16, 2012

Page 6: Strategies and Tools for Parallel Machine Learning in Python

Parallel ML Use Cases

• Model Evaluation with Cross Validation

• Model Selection with Grid Search

• Bagging Models: Random Forests

• Averaged Models

Sunday, September 16, 2012

Page 7: Strategies and Tools for Parallel Machine Learning in Python

Embarrassingly Parallel ML Use Cases

• Model Evaluation with Cross Validation

• Model Selection with Grid Search

• Bagging Models: Random Forests

• Averaged Models

Sunday, September 16, 2012

Page 8: Strategies and Tools for Parallel Machine Learning in Python

Inter-Process Comm. Use Cases

• Model Evaluation with Cross Validation

• Model Selection with Grid Search

• Bagging Models: Random Forests

• Averaged Models

Sunday, September 16, 2012

Page 9: Strategies and Tools for Parallel Machine Learning in Python

Cross Validation

Labels to Predict

Input Data

Sunday, September 16, 2012

Page 10: Strategies and Tools for Parallel Machine Learning in Python

Cross Validation

A B C

A B C

Sunday, September 16, 2012

Page 11: Strategies and Tools for Parallel Machine Learning in Python

Cross Validation

A B C

A B C

Subset of the data usedto train the model

Held-outtest set

for evaluation

Sunday, September 16, 2012

Page 12: Strategies and Tools for Parallel Machine Learning in Python

Cross Validation

A B C

A B C

A C B

A C B

B C A

B C A

Sunday, September 16, 2012

Page 13: Strategies and Tools for Parallel Machine Learning in Python

Model Selection

the Hyperparameters hell

p1 in [1, 10, 100]

p2 in [1e3, 1e4, 1e5]

Find the best combination of parameters that maximizes the Cross Validated Score

Sunday, September 16, 2012

Page 14: Strategies and Tools for Parallel Machine Learning in Python

Grid Search

(1, 1e3) (10, 1e3) (100, 1e3)

(1, 1e4) (10, 1e4) (100, 1e4)

(1, 1e5) (10, 1e5) (100, 1e5)

p1

p2

Sunday, September 16, 2012

Page 15: Strategies and Tools for Parallel Machine Learning in Python

(1, 1e3) (10, 1e3) (100, 1e3)

(1, 1e4) (10, 1e4) (100, 1e4)

(1, 1e5) (10, 1e5) (100, 1e5)

Sunday, September 16, 2012

Page 16: Strategies and Tools for Parallel Machine Learning in Python

Grid Search:Qualitative Results

Sunday, September 16, 2012

Page 17: Strategies and Tools for Parallel Machine Learning in Python

Grid Search:Cross Validated Scores

Sunday, September 16, 2012

Page 18: Strategies and Tools for Parallel Machine Learning in Python

Enough ML theory!

Lets go shopping^Wparallel computing!

Sunday, September 16, 2012

Page 19: Strategies and Tools for Parallel Machine Learning in Python

Single Machinewith

Multiple Cores

♥ ♥

♥ ♥

Sunday, September 16, 2012

Page 20: Strategies and Tools for Parallel Machine Learning in Python

multiprocessing>>> from multiprocessing import Pool>>> p = Pool(4)

>>> p.map(type, [1, 2., '3'])[int, float, str]

>>> r = p.map_async(type, [1, 2., '3'])>>> r.get()[int, float, str]

Sunday, September 16, 2012

Page 21: Strategies and Tools for Parallel Machine Learning in Python

multiprocessing

• Part of the standard lib

• Nice API

• Cross-Platform support (even Windows!)

• Some support for shared memory

• Support for synchronization (Lock)

Sunday, September 16, 2012

Page 22: Strategies and Tools for Parallel Machine Learning in Python

multiprocessing: limitations

• No docstrings in a stdlib module? WTF?

• Tricky / impossible to use the shared memory values with NumPy

• Bad support for KeyboardInterrupt

Sunday, September 16, 2012

Page 23: Strategies and Tools for Parallel Machine Learning in Python

Sunday, September 16, 2012

Page 24: Strategies and Tools for Parallel Machine Learning in Python

• transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)

• easy simple parallel computing

• logging and tracing of the execution

Sunday, September 16, 2012

Page 25: Strategies and Tools for Parallel Machine Learning in Python

>>> from os.path.join>>> from joblib import Parallel, delayed

>>> Parallel(2)(... delayed(join)('/ect', s)... for s in 'abc')['/ect/a', '/ect/b', '/ect/c']

joblib.Parallel

Sunday, September 16, 2012

Page 26: Strategies and Tools for Parallel Machine Learning in Python

Usage in scikit-learn

• Cross Validationcross_val(model, X, y, n_jobs=4, cv=3)

• Grid SearchGridSearchCV(model, n_jobs=4, cv=3).fit(X, y)

• Random ForestsRandomForestClassifier(n_jobs=4).fit(X, y)

Sunday, September 16, 2012

Page 27: Strategies and Tools for Parallel Machine Learning in Python

>>> from joblib import Parallel, delayed>>> import numpy as np

>>> Parallel(2, max_nbytes=1e6)(... delayed(type)(np.zeros(int(i)))... for i in [1e4, 1e6])[<type 'numpy.ndarray'>, <class 'numpy.core.memmap.memmap'>]

joblib.Parallel:shared memory

Sunday, September 16, 2012

Page 28: Strategies and Tools for Parallel Machine Learning in Python

(1, 1e3) (10, 1e3) (100, 1e3)

(1, 1e4) (10, 1e4) (100, 1e4)

(1, 1e5) (10, 1e5) (100, 1e5)

Only 3 allocated datasets shared by all the concurrent workers performing

the grid search.

Sunday, September 16, 2012

Page 29: Strategies and Tools for Parallel Machine Learning in Python

Multiple Machineswith

Multiple Cores

♥ ♥

♥ ♥♥ ♥

♥ ♥♥ ♥

♥ ♥♥ ♥

♥ ♥

Sunday, September 16, 2012

Page 30: Strategies and Tools for Parallel Machine Learning in Python

MapReduce?

[ (k1, v1), (k2, v2), ...]

mapper

mapper

mapper

[ (k3, v3), (k4, v4), ...]

reducer

reducer

[ (k5, v6), (k6, v6), ...]

Sunday, September 16, 2012

Page 31: Strategies and Tools for Parallel Machine Learning in Python

Why MapReduce does not always work

Write a lot of stuff to disk for failover

Inefficient for small to medium problems

[(k, v)] mapper [(k, v)] reducer [(k, v)]

Data and model params as (k, v) pairs?

Complex to leverage for Iterative Algorithms

Sunday, September 16, 2012

Page 32: Strategies and Tools for Parallel Machine Learning in Python

• Parallel Processing Library

• Interactive Exploratory Shell

Multi Core & Distributed

IPython.parallel

Sunday, September 16, 2012

Page 33: Strategies and Tools for Parallel Machine Learning in Python

Demo!

Sunday, September 16, 2012

Page 34: Strategies and Tools for Parallel Machine Learning in Python

The AllReduce Pattern

• Compute an aggregate (average) of active node data

• Do not clog a single node with incoming data transfer

• Traditionally implemented in MPI systems

Sunday, September 16, 2012

Page 35: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 0/3Initial State

Value: 2.0 Value: 0.5

Value: 1.1 Value: 3.2 Value: 0.9

Value: 1.0

Sunday, September 16, 2012

Page 36: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 1/3Spanning Tree

Value: 2.0 Value: 0.5

Value: 1.1 Value: 3.2 Value: 0.9

Value: 1.0

Sunday, September 16, 2012

Page 37: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 2/3Upward Averages

Value: 2.0 Value: 0.5

Value: 1.1(1.1, 1)

Value: 3.2(3.1, 1)

Value: 0.9(0.9, 1)

Value: 1.0

Sunday, September 16, 2012

Page 38: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 2/3Upward Averages

Value: 2.0(2.1, 3)

Value: 0.5(0.7, 2)

Value: 1.1(1.1, 1)

Value: 3.2(3.1, 1)

Value: 0.9(0.9, 1)

Value: 1.0

Sunday, September 16, 2012

Page 39: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 2/3Upward Averages

Value: 2.0(2.1, 3)

Value: 0.5(0.7, 2)

Value: 1.1(1.1, 1)

Value: 3.2(3.1, 1)

Value: 0.9(0.9, 1)

Value: 1.0(1.38, 6)

Sunday, September 16, 2012

Page 40: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 3/3Downward Updates

Value: 2.0(2.1, 3)

Value: 0.5(0.7, 2)

Value: 1.1(1.1, 1)

Value: 3.2(3.1, 1)

Value: 0.9(0.9, 1)

Value: 1.38

Sunday, September 16, 2012

Page 41: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 3/3Downward Updates

Value: 1.38 Value: 1.38

Value: 1.1(1.1, 1)

Value: 3.2(3.1, 1)

Value: 0.9(0.9, 1)

Value: 1.38

Sunday, September 16, 2012

Page 42: Strategies and Tools for Parallel Machine Learning in Python

AllReduce 3/3Downward Updates

Value: 1.38 Value: 1.38

Value: 1.38 Value: 1.38 Value: 1.38

Value: 1.38

Sunday, September 16, 2012

Page 43: Strategies and Tools for Parallel Machine Learning in Python

AllReduce Final State

Value: 1.38 Value: 1.38

Value: 1.38 Value: 1.38 Value: 1.38

Value: 1.38

Sunday, September 16, 2012

Page 45: Strategies and Tools for Parallel Machine Learning in Python

Working in the Cloud

• Launch a cluster of machines in one cmd: starcluster start mycluster -b 0.07

starcluster sshmaster mycluster

• Supports spotinstances!

• Ships blas, atlas, numpy, scipy!

• IPython plugin!

Sunday, September 16, 2012

Page 46: Strategies and Tools for Parallel Machine Learning in Python

Perspectives

Sunday, September 16, 2012

Page 47: Strategies and Tools for Parallel Machine Learning in Python

2012 results by Stanford / Google

Sunday, September 16, 2012

Page 48: Strategies and Tools for Parallel Machine Learning in Python

The YouTube Neuron

Sunday, September 16, 2012

Page 50: Strategies and Tools for Parallel Machine Learning in Python

Goodies

Super Ctrl-C to killall nosetests

Sunday, September 16, 2012

Page 51: Strategies and Tools for Parallel Machine Learning in Python

psutil nosetests killer script in Automator

Sunday, September 16, 2012

Page 52: Strategies and Tools for Parallel Machine Learning in Python

Bind killnose to Shift-Ctrl-C

Sunday, September 16, 2012

Page 53: Strategies and Tools for Parallel Machine Learning in Python

ps aux | grep IPython.parallel \ | grep -v grep \ | cut -d ' ' -f 2 | xargs kill

Or simpler:

Sunday, September 16, 2012