Upload
pavel
View
64
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A New Parallel Framework for Machine Learning. Joseph Gonzalez Joint work with. Yucheng Low. Aapo Kyrola. Danny Bickson. Carlos Guestrin. Guy Blelloch. Joe Hellerstein. David O’Hallaron. Alex Smola. In ML we face BIG problems. 24 Million Wikipedia Pages. 750 Million - PowerPoint PPT Presentation
Citation preview
Carnegie Mellon
Joseph GonzalezJoint work with
YuchengLow
AapoKyrola
DannyBickson
CarlosGuestrin
GuyBlelloch
JoeHellerstein
DavidO’Hallaron
A New Parallel Framework for Machine Learning
AlexSmola
In ML we face BIG problems
48 Hours a MinuteYouTube
24 Million Wikipedia Pages
750 MillionFacebook Users
6 Billion Flickr Photos
Massive data provides opportunities for rich probabilistic structure …
4
Shopper 1 Shopper 2
Cameras Cooking
5
Social Network
What are the tools for massive data?
6
Parallelism: Hope & ChallengesWide array of different parallel architectures:
New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocksManaging distributed model state
New Challenges for Implementing Machine Learning Algorithms:Parallel debugging and profilingHardware specific APIs
7
GPUs Multicore Clusters Mini Clouds Clouds
8
Massive Structured Problems
Advances Parallel Hardware?Thesis:
“Parallel Learning and Inference in Probabilistic Graphical Models”
9
Massive Structured Problems
Advances Parallel Hardware
“Parallel Learning and Inference in Probabilistic Graphical Models”
GraphLab
Parallel Algorithms forProbabilistic Learning and Inference
Probabilistic Graphical Models
10
GraphLab
Massive Structured Problems
Advances Parallel Hardware
Parallel Algorithms forProbabilistic Learning and Inference
Probabilistic Graphical Models
11
Massive Structured Problems
Advances Parallel HardwareGraphLab
Parallel Algorithms forProbabilistic Learning and Inference
Probabilistic Graphical Models
How will wedesign and implement
parallel learning systems?
Threads, Locks, & Messages
“low level parallel primitives”
We could use ….
Threads, Locks, and MessagesML experts repeatedly solve the same parallel design challenges:
Implement and debug complex parallel systemTune for a specific parallel platformTwo months later the conference paper contains:
“We implemented ______ in parallel.”The resulting code:
is difficult to maintainis difficult to extendcouples learning model to parallel implementation
14
Graduate students
Map-Reduce / HadoopBuild learning algorithms on-top of
high-level parallel abstractions
... a better answer:
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
16
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
17
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
Image Features
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
18
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2
MapReduce – Reduce Phase
19
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Image Features
Attractive Face Statistics
Not Attractive FaceStatistics
N A A N N N A A N A N A
Attractive FacesNot AttractiveFaces
BeliefPropagation
Label Propagation
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
20
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Is there more toMachine Learning
?
Concrete ExampleLabel Propagation
Profile
Label Propagation AlgorithmSocial Arithmetic:
Recurrence Algorithm:
iterate until convergence
Parallelism:Compute all Likes[i] in parallel
Sue Ann
Carlos
Me
50% What I list on my profile40% Sue Ann Likes10% Carlos Like
40%
10%
50%
80% Cameras20% Biking
30% Cameras70% Biking
50% Cameras50% Biking
I Like:
+60% Cameras, 40% Biking
Properties of Graph Parallel Algorithms
DependencyGraph
IterativeComputation
What I Like
What My Friends Like
Factored Computation
?
BeliefPropagation
Label Propagation
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
24
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Map Reduce?
Why not use Map-Reducefor
Graph Parallel Algorithms?
Data DependenciesMap-Reduce does not efficiently express dependent data
User must code substantial data transformations Costly data replication
Inde
pend
ent D
ata
Row
s
Slow
Proc
esso
rIterative Algorithms
Map-Reduce not efficiently express iterative algorithms:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
MapAbuse: Iterative MapReduceOnly a subset of data needs computation:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
MapAbuse: Iterative MapReduceSystem is not optimized for iteration:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Disk Penalty
Disk Penalty
Disk Penalty
Startup Penalty
Startup Penalty
Startup Penalty
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
30
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Map Reduce?Pregel (Giraph)?
BarrierPregel (Giraph)
Bulk Synchronous Parallel Model:
Compute Communicate
Bulk synchronous computation can be highly inefficient.
32
Example: Loopy Belief Propagation
Problem
Loopy Belief Propagation (Loopy BP)
• Iteratively estimate the “beliefs” about vertices– Read in messages– Updates marginal
estimate (belief)– Send updated
out messages• Repeat for all variables
until convergence
33
Bulk Synchronous Loopy BP
• Often considered embarrassingly parallel – Associate processor
with each vertex– Receive all messages– Update all beliefs– Send all messages
• Proposed by:– Brunton et al. CRV’06– Mendiburu et al. GECC’07– Kang,et al. LDMTA’10– …
34
Sequential Computational Structure
35
Hidden Sequential Structure
36
Hidden Sequential Structure
• Running Time:
EvidenceEvidence
Time for a singleparallel iteration Number of Iterations
37
Optimal Sequential Algorithm
Forward-Backward
Bulk Synchronous2n2/p
p ≤ 2n
RunningTime
2n
Gap
p = 1Optimal Parallel
np = 2 38
39
The Splash Operation• Generalize the optimal chain algorithm:
to arbitrary cyclic graphs:
~
1) Grow a BFS Spanning tree with fixed size
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
Data-Parallel Algorithms can be Inefficient
The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.
1 2 3 4 5 6 7 80
100020003000400050006000700080009000
Number of CPUs
Runti
me
in S
econ
ds
Optimized in Memory Bulk Synchronous
Asynchronous Splash BP
BeliefPropagationSVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
The Need for a New AbstractionMap-Reduce is not well suited for Graph-Parallelism
41
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Pregel (Giraph)
What is GraphLab?
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
43
Data Graph
44
A graph with arbitrary data (C++ Objects) associated with each vertex and edge.
Vertex Data:• User profile text• Current interests estimates
Edge Data:• Similarity weights
Graph:• Social Network
Implementing the Data GraphMulticore Setting
In MemoryChallenge:
Fast lookup, low overhead
Solution:Dense data-structuresFixed Vdata & Edata typesImmutable graph structure
Cluster Setting
In MemoryPartition Graph:
ParMETIS or Random Cuts
Cached Ghosting
Node 1 Node 2
A BC D
A BC D
A B
C D
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
46
label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }
Update Functions
47
An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
48
The Scheduler
49
CPU 1
CPU 2
The scheduler determines the order that vertices are updated.
e f g
kjih
dcba b
ih
a
i
b e f
j
c
Sche
dule
r
The process repeats until the scheduler is empty.
Choosing a Schedule
GraphLab provides several different schedulersRound Robin: vertices are updated in a fixed orderFIFO: Vertices are updated in the order they are addedPriority: Vertices are updated in priority order
50
The choice of schedule affects the correctness and parallel performance of the algorithm
Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
52
Ensuring Race-Free CodeHow much can computation overlap?
CPU 1 CPU 2
Common Problem: Write-Write Race
54
Processors running adjacent update functions simultaneously modify shared data:
CPU1 writes: CPU2 writes:
Final Value
Importance of Consistency
Alternating Least Squares
Importance of Consistency
Build
Test
Debug
Tweak Model
GraphLab Ensures Sequential Consistency
57
For each parallel execution, there exists a sequential execution of update functions which produces the same result.
CPU 1
CPU 2
SingleCPU
Parallel
Sequential
time
Consistency Rules
58
Guaranteed sequential consistency for all update functions
Data
Full Consistency
59
Obtaining More Parallelism
60
Edge Consistency
61
CPU 1 CPU 2
Safe
Read
Consistency Through R/W LocksRead/Write locks:
Full Consistency
Edge Consistency
Write Write WriteCanonical Lock Ordering
Read Write ReadRead Write
Multicore Setting: Pthread R/W LocksDistributed Setting: Distributed Locking
Prefetch Locks and Data
Allow computation to proceed while locks/data are requested.
Node 2
Consistency Through R/W Locks
Node 1Data GraphPartition
Lock Pipeline
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
65
Anatomy of a GraphLab Program:
1) #include <graphlab.hpp>2) Define C++ Update Function3) Build data graph using the C++ graph object4) Set engine parameters:
1) Scheduler type 2) Consistency model
5) Add initial vertices to the scheduler 6) Run the engine on the graph [Blocking call]7) Final answer is stored in the graph
Algorithms Implemented PageRankLoopy Belief PropagationGibbs SamplingCoEMGraphical Model Parameter LearningProbabilistic Matrix/Tensor FactorizationAlternating Least SquaresLasso with Sparse FeaturesSupport Vector Machines with Sparse FeaturesLabel-Propagation…
Shared MemoryExperiments
Shared Memory Setting16 Core Workstation
68
Loopy Belief Propagation
69
3D retinal image denoising
Data GraphUpdate Function:
Loopy BP Update EquationScheduler:
SplashBPConsistency Model:
Edge Consistency
Vertices: 1 MillionEdges: 3 Million
Loopy Belief Propagation
70
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Optimal
Bette
r
SplashBP
15.5x speedup
Gibbs SamplingProtein-protein interaction networks [Elidan et al. 2006]
Provably correct ParallelizationEdge Consistency
71
Discrete MRF14K Vertices100K Edges
Backbone
Protein
Interactions
Side-Chain
Gibbs Sampling
72
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
OptimalBe
tter
Chromatic Gibbs Sampler
Carnegie Mellon
An asynchronous Gibbs Sampler that adaptively addresses strong dependencies.
Splash Gibbs Sampler
73
74
Splash Gibbs Sampler
Step 1: Grow multiple Splashes in parallel:
ConditionallyIndependent
75
Splash Gibbs Sampler
Step 1: Grow multiple Splashes in parallel:
ConditionallyIndependent
Tree-width = 1
76
Splash Gibbs Sampler
Step 1: Grow multiple Splashes in parallel:
ConditionallyIndependent
Tree-width = 2
77
Splash Gibbs Sampler
Step 2: Calibrate the trees in parallel
78
Splash Gibbs Sampler
Step 3: Sample trees in parallel
79
Experimental Results
The Splash sampler outperforms the Chromatic sampler on models with strong dependencies
Likelihood Final Sample
Bette
r
Splash
Chromatic
“Mixing”
BetterSplash
Chromatic
Speedup in Sample Generation
Bette
r
Splash
Chromatic
• Markov logic network with strong dependencies10K Variables 28K Factors
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Dog” an animal?Is “Catalina” a place?
Vertices: 2 MillionEdges: 200 Million
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bette
r
Optimal
GraphLab CoEM
CoEM (Rosie Jones, 2005)
81
GraphLab 16 Cores 30 min
15x Faster!6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
Lasso: Regularized Linear Model
82
Data matrix,n x d
weights d x 1 Observationsn x 1
5 Features
4 Examples
Shooting Algorithm [Coordinate Descent]• Updates on weight vertices modify
losses on observation vertices.
Requires theFull Consistency ModelFinancial prediction dataset
from Kogan et al [2009].
Regularization
Full Consistency
83
Optimal
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bette
r
Dense
Sparse
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Relaxing Consistency
84
Why does this work? (See Shotgut ICML Paper)
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bette
r Optimal
Dense
Sparse
ExperimentsAmazon EC2
High-Performance Nodes
85
Video Cosegmentation
Segments mean the same
Model: 10.5 million nodes, 31 million edges
Gaussian EM clustering + BP on 3D grid
Video Coseg. Speedups
Matrix FactorizationNetflix Collaborative Filtering
Alternating Least Squares Matrix FactorizationModel: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
d
NetflixSpeedup Increasing size of the matrix factorization
The Cost of Hadoop
SummaryAn abstraction tailored to Machine Learning
Targets Graph-Parallel Algorithms
Naturally expressesData/computational dependenciesDynamic iterative computation
Simplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems
92
Carnegie Mellon
Checkout GraphLab
http://graphlab.org
93
Documentation… Code… Tutorials…
Questions & Feedback
Current/Future Work Out-of-core StorageHadoop/HDFS Integration
Graph ConstructionGraph StorageLaunching GraphLab from HadoopFault Tolerance through HDFS Checkpoints
Sub-scope parallelismAddress the challenge of very high degree nodes
Improved graph partitioningSupport for dynamic graph structure