[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Machine Learning on Big DataLessons Learned from Google Projects

Max LinSoftware Engineer | Google Research

Massively Parallel Computing | Harvard CS 264Guest Lecture | March 29th, 2011

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Outline




“Machine Learning is a study of computer algorithms that

improve automatically through experience.”

Training

Testing

The quick brown fox jumped over the lazy dog. English

To err is human, but to really foul things up you need a computer.

English

No hay mal que por bien no venga. Spanish

La tercera es la vencida. Spanish

To be or not to be -- that is the question

?

La fe mueve montañas. ?

Input X

f(x’)

Output Y

Model f(x)

= y’

The quick brown fox jumped over the lazy dog.

Linear Classifier

0,‘a’ ‘aardvark’...

[ 0,...

... ...‘dog’

1,... ‘the’

1,...... ‘montañas’... 0, ...

...]x

0.1,[ 132,... ... 150, 200,... ... -153, ... ]w

f(x) = w · x =P�

p=1

wp ∗ xp

Training Data

...

...

...

... ... ... ... ... ...

...

NP

Input X Ouput Y

http://www.flickr.com/photos/mr_t_in_dc/5469563053/

Typical machine learning data at Google

N: 100 billions / 1 billionP: 1 billion / 10 million(mean / median)



Classifier Training

• Training: Given {(x, y)} and f, minimize the following objective function

argminw

N�

n=1

L(yi, f(xi;w)) +R(w)

http://www.flickr.com/photos/visitfinland/5424369765/

Use Newton’s method?wt+1 ← wt −H(wt)−1∇J(wt)

http://www.flickr.com/photos/visitfinland/5424369765/sizes/z/

http://www.flickr.com/photos/visitfinland/5424369765/sizes/z/

Outline




Scaling Up

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Machine

SubsamplingBig Data

Shard 1 Shard 2 Shard MShard 3...

Model

Reduce N

Why not Small Data?

[Banko and Brill, 2001]

• Why big data?





Scaling Up

Parallelize Estimates

• Naive Bayes Classifier

• Maximum Likelihood Estimates

wthe|EN =

�Ni=1 1EN,the(xi)�N

i=1 1EN (xi)

argminw

−N�

i=1

P�

p=1

P (xip|yi;w)P (yi;w)

Word Counting

MapX: “The quick brown fox ...”Y: EN

(‘the|EN’, 1)(‘quick|EN’, 1)(‘brown|EN’, 1)

Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]

C(‘the’|EN) = SUM of values = 3

w�the�|EN =C(�the�|EN)

C(EN)

Map

Reduce

Big Data

Mapper 1

Shard 1

Mapper 2

Shard 2

Mapper 3

Shard 3

Mapper M

Shard M

(‘the’ | EN, 1)

Reducer

Tally counts and update w

...

Word Counting

(‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

Model

Parallelize Optimization

• Maximum Entropy Classifiers

• Good: J(w) is concave

• Bad: no closed-form solution like NB

• Ugly: Large N

argminw

N�

i=1

exp(�P

p=1 wp ∗ xip)

yi

1 + exp(�P

p=1 wp ∗ xip)

Gradient Descent

http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf



Gradient Descent

• w is initialized as zero

• for t in 1 to T

• Calculate gradients

•

∇J(w)

wt+1 ← wt − η∇J(w)

∇J(w) =N�

i=1

P (w, xi, yi)

Distribute Gradient

• w is initialized as zero

• for t in 1 to T

• Calculate gradients in parallel

• Training CPU: O(TPN) to O(TPN / M)

wt+1 ← wt − η∇J(w)

Distribute Gradient

Map

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, partial gradient sum)

Sum and Update w

...

ModelRepeat M/R

until converge

• Why big data?





Scaling Up

Parallelize Subroutines

• Support Vector Machines

• Solve the dual problems.t. 1− yi(w · φ(xi) + b) ≤ ζi, ζi ≥ 0

arg minw,b,ζ

1

2||w||22 + C

n�

i=1

ζi

argminα

1

2αTQα− αT1

s.t. 0 ≤ α ≤ C,yTα = 0

http://www.flickr.com/photos/sea-turtle/198445204/

The computational cost for the Primal-Dual Interior Point

Method is O(n^3) in time and O(n^2) in

memory



Parallel SVM• Parallel, row-wise incomplete Cholesky

Factorization for Q

• Parallel interior point method

• Time O(n^3) becomes O(n^2 / M)

• Memory O(n^2) becomes O(n / M)

• Parallel Support Vector Machines (psvm) http://code.google.com/p/psvm/

• Implement in MPI

√N

[Chang et al, 2007]

http://code.google.com/p/psvm/




• Distribute Q by row into M machines

• For each dimension n <

• Send local pivots to master

• Master selects largest local pivots and broadcast the global pivot to workers

Machine 1

row 1

√N

Parallel ICF

Machine 2

...row 2

row 3

row 4

Machine 3

row 5

row 6

• Why big data?





Scaling Up

Majority Vote

Map

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M...

Model 1 Model 2 Model 3 Model 4

Majority Vote

• Train individual classifiers independently

• Predict by taking majority votes

• Training CPU: O(TPN) to O(TPN / M)

Parameter Mixture

Map

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

...

(dummy key, w2) ...

[Mann et al, 2009]

Model

http://www.flickr.com/photos/annamatic3000/127945652/

Much Less network usage than distributed gradient descentO(MN) vs. O(MNT)



Iterative Param Mixture

Map

Reduce after each epoch

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

...

(dummy key, w2) ...

Model

[McDonald et al., 2010]

Outline





Scalable



http://www.flickr.com/photos/aloshbennett/3209564747/

Parallel



http://www.flickr.com/photos/wanderlinse/4367261825/

Accuracy



http://www.flickr.com/photos/imagelink/4006753760/



http://www.flickr.com/photos/brenderous/4532934181/

Binary Classification



http://www.flickr.com/photos/mararie/2340572508/

Automatic Feature

Discovery



http://www.flickr.com/photos/prunejuice/3687192643/

Fast Response



http://www.flickr.com/photos/jepoirrier/840415676/

Memory is new hard disk.



http://www.flickr.com/photos/neubie/854242030/

Algorithm + Infrastructure



Design for Multicores

http://www.flickr.com/photos/geektechnique/2344029370/



Combiner

Multi-shard Combiner

[Chandra et al., 2010]

Machine Learning on

Big Data

Parallelize ML Algorithms




Parallel Accuracy

Fast Response

Google APIs

• Prediction API

• machine learning service on the cloud

• http://code.google.com/apis/predict

• BigQuery

• interactive analysis of massive data on the cloud

• http://code.google.com/apis/bigquery

http://code.google.com/apis/predict

http://code.google.com/apis/predict

http://code.google.com/apis/bigquery

http://code.google.com/apis/bigquery

Education

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)