Scaling Up Machine Learning, the Tutorial, KDD 2011 (Part III Summary + GPU learning + Terascale linear learning)

The 30000’ Summary and 3 Cool Results

John Langford (Yahoo!) with Ron Bekkerman(Linkedin) andMisha Bilenko(Microsoft)

http://hunch.net/~large_scale_survey

http://hunch.net/~large_scale_survey

This hour’s outline

1 A summary of results

2 Cool uses of GPUs

3 Terascale linear

A comparison is virtually impossible

Prediction performance varies wildly depending on theproblem–method pair.

But we can try: use Input complexity/time.

⇒ No credit for creating complexity then reducing it. (Ouch!)

Most interesting results reported. Some cases require creativebest-effort summary.
















Unsupervised Learning

10 100

1000 10000

100000 1e+06 1e+07

800x

800

Info

CC

MP

I-40

0R

CV

1S

pec.

Clu

st-5

3M

apR

ed-1

28R

CV

1LD

A-2

000

MP

I-10

24

Med

line

#LD

A-1

00S

SE

Wik

iped

ia &

LDA

-100

0IC

E-4

00N

ews

#&K

-Mea

ns-4

00G

PU

Syn

thet

icR

BM

-93K

GP

UIm

ages

*

Fea

ture

s/sSpeed per method

parallelsingle

# = Prev. * = Next. & = New

Supervised Training

100 1000

10000 100000 1e+06 1e+07 1e+08 1e+09

RB

F-S

VM

MP

I?-5

00R

CV

1E

nsem

ble

Tre

e M

PI-

128

Syn

thet

icR

BF

-SV

MT

CP

-48

MN

IST

220

KD

ecis

ion

Tre

eM

apR

ed-2

00A

d-B

ounc

e #

Boo

sted

DT

MP

I-32

Ran

king

#Li

near

Thr

eads

-2R

CV

1Li

near

Had

oop+

TC

P-1

000

Ads

*&

Fea

ture

s/sSpeed per method

parallelsingle

# = Prev. * = Next. & = New

Supervised Testing (but not training)

100

1000

10000

100000

1e+06

Spe

ech-

WF

ST

GP

UN

IST

*

Infe

renc

eM

PI-

40P

rote

in #

Con

v. N

NF

PG

AC

alte

ch

Boo

sted

DT

GP

UIm

ages

Con

v N

NA

sic(

sim

)C

alte

ch

Fea

ture

s/s

Speed per method

parallelsingle

# = Prev. * = Next.

My Flow Chart for Learning Optimization

1 Choose an efficient effective algorithm

2 Use compact binary representations.

3 If (Computationally Constrained)

4 then GPU5 else

1 If few learning steps2 then Map-Reduce AllReduce3 else Research Problem.


1 A summary of results2 Cool uses of GPUs

1 RBM learning2 Speech Recognition

3 Terascale linear

Restricted Boltzmann Machine learning CRN11 Chapter 18

Goal: Learn weights which predict hidden state given features thatcan predict features given hidden state *

Hidden

Observed

1 Number of parameters = hidden*observed = quadratic pain

2 An observed useful method for creating relevant features forsupervised learning.

(*) Lots of extra details here.

Restricted Boltzmann Machine learning CRN11 Chapter 18

Goal: Learn weights which predict hidden state given features thatcan predict features given hidden state *

Hidden

Observed

1 Number of parameters = hidden*observed = quadratic pain

2 An observed useful method for creating relevant features forsupervised learning.

(*) Lots of extra details here.

RBM parallelization

GPU = hundreds of weak processors doing vector operations onshared memory.

1 Activation levels of hidden node i is sig(∑

j wijxj).

A GPU isperfectly designed for a dense matrix/vector dot product.

2 Given activation levels, hidden nodes are independentlyrandomly rounded to {0, 1}.

Good for GPUs

3 Predict features given hidden units just as step 1.

Perfect forGPUs

4 Shift weights to make reconstruction more accurate.

Perfectfor GPUs

RBM parallelization



j wijxj).



Good for GPUs


Perfect forGPUs


Perfectfor GPUs

RBM parallelization



j wijxj).



Good for GPUs


Perfect forGPUs


Perfectfor GPUs

RBM parallelization



j wijxj).



Good for GPUs


Perfect forGPUs


Perfectfor GPUs

RBM parallelization



j wijxj). A GPU isperfectly designed for a dense matrix/vector dot product.

2 Given activation levels, hidden nodes are independentlyrandomly rounded to {0, 1}. Good for GPUs

3 Predict features given hidden units just as step 1. Perfect forGPUs

4 Shift weights to make reconstruction more accurate. Perfectfor GPUs

Parallelization Techniques

1 Store model in GPU memory and stream data.

2 Use existing GPU-optimized matrix operation code.

3 Use multicore GPU parallelism for the rest.

This is a best-case situation for GPUs. x10 to x55 speedupsobserved.

But, maybe we just sped up a slow algorithm?

Parallelization Techniques

1 Store model in GPU memory and stream data.

2 Use existing GPU-optimized matrix operation code.

3 Use multicore GPU parallelism for the rest.

This is a best-case situation for GPUs. x10 to x55 speedupsobserved.But, maybe we just sped up a slow algorithm?

GPUs for Speech Recognition CGYK11 Chapter 21

Given observed utterances, we want to reconstruct the original(hidden) sequence of words via an HMM structure.

Observed

Hidden

Standard method of decoding: forward-backward algorithm usingBayes law to find the most probable utterance.Naively, this is trivially parallelized just as before. But it’s not.

1 The observation is non-binary. The standard approachmatches the observed sound with one of very many differentrecorded sounds via nearest neighbor search.

2 The state transitions are commonly beam searched ratherthan using Bayesian integration.

3 The entire structure is compiled into a weighted finite statetransducer, which is what’s really optimized.



Observed

Hidden

Standard method of decoding: forward-backward algorithm usingBayes law to find the most probable utterance.

Naively, this is trivially parallelized just as before. But it’s not.1 The observation is non-binary. The standard approach

matches the observed sound with one of very many differentrecorded sounds via nearest neighbor search.





Observed

Hidden

Standard method of decoding: forward-backward algorithm usingBayes law to find the most probable utterance.Naively, this is trivially parallelized just as before.

But it’s not.1 The observation is non-binary. The standard approach

matches the observed sound with one of very many differentrecorded sounds via nearest neighbor search.





Observed

Hidden

Standard method of decoding: forward-backward algorithm usingBayes law to find the most probable utterance.Naively, this is trivially parallelized just as before. But it’s not.

1 The observation is non-binary. The standard approachmatches the observed sound with one of very many differentrecorded sounds via nearest neighbor search.



The approach used

Start with a careful systematic analysis of where parallelizationmight help.

1 SIMD instructions: Use carefully arranged datastructures sosingle-instruction-multiple-data works.

2 Multicore over 30 cores of GPU.

3 Use Atomic instructions (Atomic max, Atomic swap) = threadsafe primitives.

4 Stick model in GPU memory, using GPU memory as(essentially) a monstrous cache.

Result: x10.5 speedup. Crucially, this makes the algorithm fasterthan real time.GPUs help, even for highly optimized algorithms.

The approach used







The approach used








1 A summary of results2 Cool uses of GPUs

1 RBM learning2 Speech Recognition

3 Terascale linear

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a linear predictorfw (x) =

∑i wixi?

1 No single machine algorithm.

2 No multimachine algorithm requiring bandwidth ∝ Tbytes forany single machine.

It is necessary but not sufficient to have an efficientcommunication mechanism.



∑i wixi?






∑i wixi?






∑i wixi?




MPI-style AllReduce

2 3 4

6

7

Allreduce initial state

5

1

AllReduce = Reduce+Broadcast

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

MPI-style AllReduce

2 3 4

7

8

1

13

Reducing, step 1


Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

2 3 4

8

1

13

Reducing, step 2

28


Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

2 3 41

28

Broadcast, step 1

28 28


Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28


Properties:


2 Bandwidth ≤ 6n.


MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+BroadcastProperties:


2 Bandwidth ≤ 6n.


An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1 While (examples left)1 Do online update.

2 AllReduce(weights)

3 For each weight w ← w/n

Other algorithms implemented:

1 Nonuniform averaging for online learning

2 Conjugate Gradient

3 LBFGS

An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1 While (examples left)1 Do online update.

2 AllReduce(weights)

3 For each weight w ← w/n

Other algorithms implemented:

1 Nonuniform averaging for online learning

2 Conjugate Gradient

3 LBFGS

Approach Used: Preliminaries

Optimize so few data passes required.Basic problem with gradient descent = confused units.fw (x) =

∑i wixi

⇒ ∂(fw (x)−y)2

∂wi= 2(fw (x)− y)xi which has units of i .

But wi naturally has units of 1/i since doubling xi implies halvingwi to get the same prediction.Crude fixes:

1 Newton: Multiply inverse Hessian: ∂2

∂wi∂wj

−1by gradient to

get update direction.

..but computational complexity kills you.

2 Normalize update so total step size is controlled.

..but this justworks globally rather than per dimension.

Approach Used: Preliminaries

Optimize so few data passes required.Basic problem with gradient descent = confused units.fw (x) =

∑i wixi

⇒ ∂(fw (x)−y)2

∂wi= 2(fw (x)− y)xi which has units of i .

But wi naturally has units of 1/i since doubling xi implies halvingwi to get the same prediction.Crude fixes:

1 Newton: Multiply inverse Hessian: ∂2

∂wi∂wj

−1by gradient to

get update direction...but computational complexity kills you.

2 Normalize update so total step size is controlled...but this justworks globally rather than per dimension.

Approach Used

1 Optimize hard so few data passes required.1 L-BFGS = batch algorithm that builds up approximate inverse

hessian according to:∆w ∆T

w

∆Tw ∆g

where ∆w is a change in weights

w and ∆g is a change in the loss gradient g .

2 Dimensionally correct, adaptive, online, gradient descent forsmall-multiple passes.

1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i

where gi = previous gradients.3 Dimensionally correct = still works if you double all feature

values.

3 Use (2) to warmstart (1).

2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later

passes.5 Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Approach Used



w

∆Tw ∆g


w and ∆g is a change in the loss gradient g .2 Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.1 Online = update weights after seeing each example.2 Adaptive = learning rate of feature i according to 1√P

g2i


values.





Approach Used



w

∆Tw ∆g




g2i


values.


2 Use map-only Hadoop for process control and error recovery.

3 Use custom AllReduce code to sync state.4 Always save input examples in a cachefile to speed later



Approach Used



w

∆Tw ∆g




g2i


values.


2 Use map-only Hadoop for process control and error recovery.3 Use custom AllReduce code to sync state.

4 Always save input examples in a cachefile to speed laterpasses.

5 Use hashing trick to reduce input complexity.


Approach Used



w

∆Tw ∆g




g2i


values.



passes.

5 Use hashing trick to reduce input complexity.


Approach Used



w

∆Tw ∆g




g2i


values.





Approach Used



w

∆Tw ∆g




g2i


values.





Empirical Results

2.1T sparse features17B Examples16M parameters1K nodes70 minutes

The end

Right now there is extreme diversity:

1 Many different notions of large scale.

2 Many different approaches.

What works generally?What are the natural “kinds” of large scale learning problems?And what are good solutions for each kind?The great diversity implies this is really the beginning.

Numbers References

RBMs Adam Coates, Rajat Raina, and Andrew Y. Ng, Large ScaleLearning for Vision with GPUs, in the book.

speech Jike Chong, Ekaterina Gonina Kisun You, and Kurt Keutzer,in the book.

Teralinear Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and JohnLangford, Vowpal Wabbit 6.0.

LDA II M. Hoffman, D. Blei, F. Bach, ”Online Learning for LatentDirichlet Allocation,” in Neural Information ProcessingSystems (NIPS) 2010, Vancouver, 2010.

LDA III Amr Ahmed, Mohamed Aly, Joseph Gonzalez, ShravanNarayanamurthy, Alexander Smola, Scalable Inference inLatent Variable Models, In Submission.

Online References

Adapt I H. Brendan McMahan and Matthew Streeter Adaptive BoundOptimization for Online Convex Optimization, COLT 2010.

Adapt II John Duchi, Elad Hazan, and Yoram Singer, AdaptiveSubgradient Methods for Online Learning and StochasticOptimization, COLT 2010.

Dim I http://www.machinedlearnings.com/2011/06/dimensional-analysis-and-gradient.html

Import Nikos Karampatziakis, John Langford, Online ImportanceWeight Aware Updates, UAI 2011

Batch References

BFGS I Broyden, C. G. (1970), ”The convergence of a class ofdouble-rank minimization algorithms”, Journal of the Instituteof Mathematics and Its Applications 6: 7690.

BFGS II Fletcher, R. (1970), ”A New Approach to Variable MetricAlgorithms”, Computer Journal 13 (3): 317322.

BFGS III Goldfarb, D. (1970), ”A Family of Variable Metric UpdatesDerived by Variational Means”, Mathematics of Computation24 (109): 2326.

BFGS IV Shanno, David F. (July 1970), ”Conditioning of quasi-Newtonmethods for function minimization”, Math. Comput. 24:647656.

L-BFGS Nocedal, J. (1980). ”Updating Quasi-Newton Matrices withLimited Storage”. Mathematics of Computation 35: 773782.

Technology

Scaling Up Machine Learning, the Tutorial, KDD 2011 (Part III Summary + GPU learning + Terascale linear learning)