60
Machine Learning on Big Data Lessons Learned from Google Projects Max Lin Software Engineer | Google Research Massively Parallel Computing | Harvard CS 264 Guest Lecture | March 29th, 2011

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

  • Upload
    npinto

  • View
    3.612

  • Download
    2

Embed Size (px)

DESCRIPTION

Abstract:Machine learning researchers and practitioners develop computeralgorithms that "improve performance automatically throughexperience". At Google, machine learning is applied to solve manyproblems, such as prioritizing emails in Gmail, recommending tags forYouTube videos, and identifying different aspects from online userreviews. Machine learning on big data, however, is challenging. Some"simple" machine learning algorithms with quadratic time complexity,while running fine with hundreds of records, are almost impractical touse on billions of records.In this talk, I will describe lessons drawn from various Googleprojects on developing large scale machine learning systems. Thesesystems build on top of Google's computing infrastructure such as GFSand MapReduce, and attack the scalability problem through massivelyparallel algorithms. I will present the design decisions made inthese systems, strategies of scaling and speeding up machine learningsystems on web scale data.Speaker biography:Max Lin is a software engineer with Google Research in New York Cityoffice. He is the tech lead of the Google Prediction API, a machinelearning web service in the cloud. Prior to Google, he publishedresearch work on video content analysis, sentiment analysis, machinelearning, and cross-lingual information retrieval. He had a PhD inComputer Science from Carnegie Mellon University.

Citation preview

Page 1: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Machine Learning on Big DataLessons Learned from Google Projects

Max LinSoftware Engineer | Google Research

Massively Parallel Computing | Harvard CS 264Guest Lecture | March 29th, 2011

Page 2: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Page 3: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Page 4: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

“Machine Learning is a study of computer algorithms that

improve automatically through experience.”

Page 5: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 6: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 7: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 8: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 9: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 10: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Training

Testing

The quick brown fox jumped over the lazy dog. English

To err is human, but to really foul things up you need a computer.

English

No hay mal que por bien no venga. Spanish

La tercera es la vencida. Spanish

To be or not to be -- that is the question

?

La fe mueve montañas. ?

Input X

f(x’)

Output Y

Model f(x)

= y’

Page 11: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

The quick brown fox jumped over the lazy dog.

Linear Classifier

0,‘a’ ‘aardvark’...

[ 0,...

... ...‘dog’

1,... ‘the’

1,...... ‘montañas’... 0, ...

...]x

0.1,[ 132,... ... 150, 200,... ... -153, ... ]w

f(x) = w · x =P�

p=1

wp ∗ xp

Page 12: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Training Data

...

...

...

... ... ... ... ... ...

...

NP

Input X Ouput Y

Page 13: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/mr_t_in_dc/5469563053/

Typical machine learning data at Google

N: 100 billions / 1 billionP: 1 billion / 10 million(mean / median)

Page 14: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Classifier Training

• Training: Given {(x, y)} and f, minimize the following objective function

argminw

N�

n=1

L(yi, f(xi;w)) +R(w)

Page 15: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/visitfinland/5424369765/

Use Newton’s method?wt+1 ← wt −H(wt)−1∇J(wt)

Page 16: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Page 17: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Scaling Up

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Page 18: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Machine

SubsamplingBig Data

Shard 1 Shard 2 Shard MShard 3...

Model

Reduce N

Page 19: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Why not Small Data?

[Banko and Brill, 2001]

Page 20: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Scaling Up

Page 21: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Parallelize Estimates

• Naive Bayes Classifier

• Maximum Likelihood Estimates

wthe|EN =

�Ni=1 1EN,the(xi)�N

i=1 1EN (xi)

argminw

−N�

i=1

P�

p=1

P (xip|yi;w)P (yi;w)

Page 22: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Word Counting

MapX: “The quick brown fox ...”Y: EN

(‘the|EN’, 1)(‘quick|EN’, 1)(‘brown|EN’, 1)

Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]

C(‘the’|EN) = SUM of values = 3

w�the�|EN =C(�the�|EN)

C(EN)

Page 23: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Map

Reduce

Big Data

Mapper 1

Shard 1

Mapper 2

Shard 2

Mapper 3

Shard 3

Mapper M

Shard M

(‘the’ | EN, 1)

Reducer

Tally counts and update w

...

Word Counting

(‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

Model

Page 24: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Parallelize Optimization

• Maximum Entropy Classifiers

• Good: J(w) is concave

• Bad: no closed-form solution like NB

• Ugly: Large N

argminw

N�

i=1

exp(�P

p=1 wp ∗ xip)

yi

1 + exp(�P

p=1 wp ∗ xip)

Page 25: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Gradient Descent

http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf

Page 26: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Gradient Descent

• w is initialized as zero

• for t in 1 to T

• Calculate gradients

∇J(w)

wt+1 ← wt − η∇J(w)

∇J(w) =N�

i=1

P (w, xi, yi)

Page 27: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Distribute Gradient

• w is initialized as zero

• for t in 1 to T

• Calculate gradients in parallel

• Training CPU: O(TPN) to O(TPN / M)

wt+1 ← wt − η∇J(w)

Page 28: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Distribute Gradient

Map

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, partial gradient sum)

Sum and Update w

...

ModelRepeat M/R

until converge

Page 29: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Scaling Up

Page 30: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Parallelize Subroutines

• Support Vector Machines

• Solve the dual problems.t. 1− yi(w · φ(xi) + b) ≤ ζi, ζi ≥ 0

arg minw,b,ζ

1

2||w||22 + C

n�

i=1

ζi

argminα

1

2αTQα− αT1

s.t. 0 ≤ α ≤ C,yTα = 0

Page 31: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/sea-turtle/198445204/

The computational cost for the Primal-Dual Interior Point

Method is O(n^3) in time and O(n^2) in

memory

Page 32: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Parallel SVM• Parallel, row-wise incomplete Cholesky

Factorization for Q

• Parallel interior point method

• Time O(n^3) becomes O(n^2 / M)

• Memory O(n^2) becomes O(n / M)

• Parallel Support Vector Machines (psvm) http://code.google.com/p/psvm/

• Implement in MPI

√N

[Chang et al, 2007]

Page 33: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

• Distribute Q by row into M machines

• For each dimension n <

• Send local pivots to master

• Master selects largest local pivots and broadcast the global pivot to workers

Machine 1

row 1

√N

Parallel ICF

Machine 2

...row 2

row 3

row 4

Machine 3

row 5

row 6

Page 34: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 35: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Scaling Up

Page 36: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Majority Vote

Map

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M...

Model 1 Model 2 Model 3 Model 4

Page 37: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Majority Vote

• Train individual classifiers independently

• Predict by taking majority votes

• Training CPU: O(TPN) to O(TPN / M)

Page 38: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Parameter Mixture

Map

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

...

(dummy key, w2) ...

[Mann et al, 2009]

Model

Page 39: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/annamatic3000/127945652/

Much Less network usage than distributed gradient descentO(MN) vs. O(MNT)

Page 40: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 41: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Iterative Param Mixture

Map

Reduce after each epoch

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

...

(dummy key, w2) ...

Model

[McDonald et al., 2010]

Page 42: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 43: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Page 44: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/mr_t_in_dc/5469563053/

Scalable

Page 45: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/aloshbennett/3209564747/

Parallel

Page 46: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/wanderlinse/4367261825/

Accuracy

Page 48: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/brenderous/4532934181/

Binary Classification

Page 49: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/mararie/2340572508/

Automatic Feature

Discovery

Page 50: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/prunejuice/3687192643/

Fast Response

Page 51: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/jepoirrier/840415676/

Memory is new hard disk.

Page 52: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

http://www.flickr.com/photos/neubie/854242030/

Algorithm + Infrastructure

Page 53: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Design for Multicores

http://www.flickr.com/photos/geektechnique/2344029370/

Page 54: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Combiner

Page 55: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Page 56: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Multi-shard Combiner

[Chandra et al., 2010]

Page 57: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Machine Learning on

Big Data

Page 58: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Parallelize ML Algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Page 59: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Parallel Accuracy

Fast Response

Page 60: [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Google APIs

• Prediction API

• machine learning service on the cloud

• http://code.google.com/apis/predict

• BigQuery

• interactive analysis of massive data on the cloud

• http://code.google.com/apis/bigquery