Upload
npinto
View
3.612
Download
2
Embed Size (px)
DESCRIPTION
Abstract:Machine learning researchers and practitioners develop computeralgorithms that "improve performance automatically throughexperience". At Google, machine learning is applied to solve manyproblems, such as prioritizing emails in Gmail, recommending tags forYouTube videos, and identifying different aspects from online userreviews. Machine learning on big data, however, is challenging. Some"simple" machine learning algorithms with quadratic time complexity,while running fine with hundreds of records, are almost impractical touse on billions of records.In this talk, I will describe lessons drawn from various Googleprojects on developing large scale machine learning systems. Thesesystems build on top of Google's computing infrastructure such as GFSand MapReduce, and attack the scalability problem through massivelyparallel algorithms. I will present the design decisions made inthese systems, strategies of scaling and speeding up machine learningsystems on web scale data.Speaker biography:Max Lin is a software engineer with Google Research in New York Cityoffice. He is the tech lead of the Google Prediction API, a machinelearning web service in the cloud. Prior to Google, he publishedresearch work on video content analysis, sentiment analysis, machinelearning, and cross-lingual information retrieval. He had a PhD inComputer Science from Carnegie Mellon University.
Citation preview
Machine Learning on Big DataLessons Learned from Google Projects
Max LinSoftware Engineer | Google Research
Massively Parallel Computing | Harvard CS 264Guest Lecture | March 29th, 2011
Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
“Machine Learning is a study of computer algorithms that
improve automatically through experience.”
Training
Testing
The quick brown fox jumped over the lazy dog. English
To err is human, but to really foul things up you need a computer.
English
No hay mal que por bien no venga. Spanish
La tercera es la vencida. Spanish
To be or not to be -- that is the question
?
La fe mueve montañas. ?
Input X
f(x’)
Output Y
Model f(x)
= y’
The quick brown fox jumped over the lazy dog.
Linear Classifier
0,‘a’ ‘aardvark’...
[ 0,...
... ...‘dog’
1,... ‘the’
1,...... ‘montañas’... 0, ...
...]x
0.1,[ 132,... ... 150, 200,... ... -153, ... ]w
f(x) = w · x =P�
p=1
wp ∗ xp
Training Data
...
...
...
... ... ... ... ... ...
...
NP
Input X Ouput Y
http://www.flickr.com/photos/mr_t_in_dc/5469563053/
Typical machine learning data at Google
N: 100 billions / 1 billionP: 1 billion / 10 million(mean / median)
Classifier Training
• Training: Given {(x, y)} and f, minimize the following objective function
argminw
N�
n=1
L(yi, f(xi;w)) +R(w)
http://www.flickr.com/photos/visitfinland/5424369765/
Use Newton’s method?wt+1 ← wt −H(wt)−1∇J(wt)
Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
Scaling Up
• Why big data?
• Parallelize machine learning algorithms
• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning
Machine
SubsamplingBig Data
Shard 1 Shard 2 Shard MShard 3...
Model
Reduce N
Why not Small Data?
[Banko and Brill, 2001]
• Why big data?
• Parallelize machine learning algorithms
• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning
Scaling Up
Parallelize Estimates
• Naive Bayes Classifier
• Maximum Likelihood Estimates
wthe|EN =
�Ni=1 1EN,the(xi)�N
i=1 1EN (xi)
argminw
−N�
i=1
P�
p=1
P (xip|yi;w)P (yi;w)
Word Counting
MapX: “The quick brown fox ...”Y: EN
(‘the|EN’, 1)(‘quick|EN’, 1)(‘brown|EN’, 1)
Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
C(‘the’|EN) = SUM of values = 3
w�the�|EN =C(�the�|EN)
C(EN)
Map
Reduce
Big Data
Mapper 1
Shard 1
Mapper 2
Shard 2
Mapper 3
Shard 3
Mapper M
Shard M
(‘the’ | EN, 1)
Reducer
Tally counts and update w
...
Word Counting
(‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)
Model
Parallelize Optimization
• Maximum Entropy Classifiers
• Good: J(w) is concave
• Bad: no closed-form solution like NB
• Ugly: Large N
argminw
N�
i=1
exp(�P
p=1 wp ∗ xip)
yi
1 + exp(�P
p=1 wp ∗ xip)
Gradient Descent
http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients
•
∇J(w)
wt+1 ← wt − η∇J(w)
∇J(w) =N�
i=1
P (w, xi, yi)
Distribute Gradient
• w is initialized as zero
• for t in 1 to T
• Calculate gradients in parallel
• Training CPU: O(TPN) to O(TPN / M)
wt+1 ← wt − η∇J(w)
Distribute Gradient
Map
Reduce
Big Data
Machine 1
Shard 1
Machine 2
Shard 2
Machine 3
Shard 3
Machine M
Shard M
(dummy key, partial gradient sum)
Sum and Update w
...
ModelRepeat M/R
until converge
• Why big data?
• Parallelize machine learning algorithms
• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning
Scaling Up
Parallelize Subroutines
• Support Vector Machines
• Solve the dual problems.t. 1− yi(w · φ(xi) + b) ≤ ζi, ζi ≥ 0
arg minw,b,ζ
1
2||w||22 + C
n�
i=1
ζi
argminα
1
2αTQα− αT1
s.t. 0 ≤ α ≤ C,yTα = 0
http://www.flickr.com/photos/sea-turtle/198445204/
The computational cost for the Primal-Dual Interior Point
Method is O(n^3) in time and O(n^2) in
memory
Parallel SVM• Parallel, row-wise incomplete Cholesky
Factorization for Q
• Parallel interior point method
• Time O(n^3) becomes O(n^2 / M)
• Memory O(n^2) becomes O(n / M)
• Parallel Support Vector Machines (psvm) http://code.google.com/p/psvm/
• Implement in MPI
√N
[Chang et al, 2007]
• Distribute Q by row into M machines
• For each dimension n <
• Send local pivots to master
• Master selects largest local pivots and broadcast the global pivot to workers
Machine 1
row 1
√N
Parallel ICF
Machine 2
...row 2
row 3
row 4
Machine 3
row 5
row 6
• Why big data?
• Parallelize machine learning algorithms
• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning
Scaling Up
Majority Vote
Map
Big Data
Machine 1
Shard 1
Machine 2
Shard 2
Machine 3
Shard 3
Machine M
Shard M...
Model 1 Model 2 Model 3 Model 4
Majority Vote
• Train individual classifiers independently
• Predict by taking majority votes
• Training CPU: O(TPN) to O(TPN / M)
Parameter Mixture
Map
Reduce
Big Data
Machine 1
Shard 1
Machine 2
Shard 2
Machine 3
Shard 3
Machine M
Shard M
(dummy key, w1)
Average w
...
(dummy key, w2) ...
[Mann et al, 2009]
Model
http://www.flickr.com/photos/annamatic3000/127945652/
Much Less network usage than distributed gradient descentO(MN) vs. O(MNT)
Iterative Param Mixture
Map
Reduce after each epoch
Big Data
Machine 1
Shard 1
Machine 2
Shard 2
Machine 3
Shard 3
Machine M
Shard M
(dummy key, w1)
Average w
...
(dummy key, w2) ...
Model
[McDonald et al., 2010]
Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
http://www.flickr.com/photos/mr_t_in_dc/5469563053/
Scalable
http://www.flickr.com/photos/aloshbennett/3209564747/
Parallel
http://www.flickr.com/photos/wanderlinse/4367261825/
Accuracy
http://www.flickr.com/photos/imagelink/4006753760/
http://www.flickr.com/photos/brenderous/4532934181/
Binary Classification
http://www.flickr.com/photos/mararie/2340572508/
Automatic Feature
Discovery
http://www.flickr.com/photos/prunejuice/3687192643/
Fast Response
http://www.flickr.com/photos/jepoirrier/840415676/
Memory is new hard disk.
http://www.flickr.com/photos/neubie/854242030/
Algorithm + Infrastructure
Design for Multicores
http://www.flickr.com/photos/geektechnique/2344029370/
Combiner
Multi-shard Combiner
[Chandra et al., 2010]
Machine Learning on
Big Data
Parallelize ML Algorithms
• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning
Parallel Accuracy
Fast Response
Google APIs
• Prediction API
• machine learning service on the cloud
• http://code.google.com/apis/predict
• BigQuery
• interactive analysis of massive data on the cloud
• http://code.google.com/apis/bigquery