Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch...
If you can't read please download the document
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny
Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David
OHallaron A New Parallel Framework for Machine Learning Alex
Smola
Slide 2
2 The foundation of computation is changing
Slide 3
Parallelism: Hope for the Future Wide array of different
parallel architectures: New Challenges for Designing Machine
Learning Algorithms: Race conditions and deadlocks Managing
distributed model state New Challenges for Implementing Machine
Learning Algorithms: Parallel debugging and profiling Hardware
specific APIs 3 GPUsMulticoreClustersMini CloudsClouds
Slide 4
The foundation of computation is changing 4 the scale of
machine learning is exploding
Slide 5
27 Hours a Minute YouTube 13 Million Wikipedia Pages 750
Million Facebook Users 3.6 Billion Flickr Photos 5
Slide 6
Massive data provides opportunities for rich probabilistic
structure 6
Slide 7
Shopper 1 Shopper 2 Cameras Cooking 7 Social Network
Slide 8
Massive Structured Problems Advances Parallel Hardware ?
Thesis: Parallel Learning and Inference in Probabilistic Graphical
Models 8
Slide 9
Massive Structured Problems Advances Parallel Hardware Parallel
Learning and Inference in Probabilistic Graphical Models GraphLab
Parallel Algorithms for Probabilistic Learning and Inference
Parallel Algorithms for Probabilistic Learning and Inference
Probabilistic Graphical Models 9
Slide 10
GraphLab Massive Structured Problems Advances Parallel Hardware
10 Parallel Algorithms for Probabilistic Learning and Inference
Parallel Algorithms for Probabilistic Learning and Inference
Probabilistic Graphical Models
Slide 11
Massive Structured Problems Advances Parallel Hardware GraphLab
Parallel Algorithms for Probabilistic Learning and Inference
Parallel Algorithms for Probabilistic Learning and Inference
Probabilistic Graphical Models 11
Slide 12
Carnegie Mellon Question: How will we design and implement
parallel learning systems?
Slide 13
Carnegie Mellon Threads, Locks, & Messages Build each new
learning systems using low level parallel primitives We could use
.
Slide 14
Threads, Locks, and Messages ML experts repeatedly solve the
same parallel design challenges: Implement and debug complex
parallel system Tune for a specific parallel platform Two months
later the conference paper contains: We implemented ______ in
parallel. The resulting code: is difficult to maintain is difficult
to extend couples learning model to parallel implementation 14
Graduate students
Slide 15
Carnegie Mellon Map-Reduce / Hadoop Build learning algorithms
on-top of high-level parallel abstractions... a better answer:
Slide 16
CPU 1 CPU 2 CPU 3 CPU 4 MapReduce Map Phase 16 Embarrassingly
Parallel independent computation 12.912.9 42.342.3 21.321.3
25.825.8 No Communication needed
Slide 17
CPU 1 CPU 2 CPU 3 CPU 4 MapReduce Map Phase 17 Embarrassingly
Parallel independent computation 12.912.9 42.342.3 21.321.3
25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 No Communication
needed
Slide 18
CPU 1 CPU 2 CPU 3 CPU 4 MapReduce Map Phase 18 Embarrassingly
Parallel independent computation 12.912.9 42.342.3 21.321.3
25.825.8 17.517.5 67.567.5 14.914.9 34.334.3 24.124.1 84.384.3
18.418.4 84.484.4 No Communication needed
Belief Propagation SVM Kernel Methods Deep Belief Networks
Neural Networks Tensor Factorization Sampling Lasso Map-Reduce for
Data-Parallel ML Excellent for large data-parallel tasks! 20
Data-Parallel Graph-Parallel Is there more to Machine Learning ?
Cross Validation Feature Extraction Map Reduce Computing Sufficient
Statistics
Slide 21
Carnegie Mellon Concrete Example Label Propagation
Slide 22
Profile Label Propagation Algorithm Social Arithmetic:
Recurrence Algorithm: iterate until convergence Parallelism:
Compute all Likes[i] in parallel Sue Ann Carlos Me 50% What I list
on my profile 40% Sue Ann Likes 30% Carlos Like 40% 10% 50% 80%
Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I
Like: + 60% Cameras, 40% Biking
Slide 23
Properties of Graph Parallel Algorithms Dependency Graph
Iterative Computation What I Like What My Friends Like Factored
Computation
Slide 24
? Belief Propagation SVM Kernel Methods Deep Belief Networks
Neural Networks Tensor Factorization Sampling Lasso Map-Reduce for
Data-Parallel ML Excellent for large data-parallel tasks! 24
Data-Parallel Graph-Parallel Cross Validation Feature Extraction
Map Reduce Computing Sufficient Statistics Map Reduce?
Slide 25
Carnegie Mellon Why not use Map-Reduce for Graph Parallel
Algorithms?
Slide 26
Data Dependencies Map-Reduce does not efficiently express
dependent data User must code substantial data transformations
Costly data replication Independent Data Rows
Slide 27
Slow Processor Iterative Algorithms Map-Reduce not efficiently
express iterative algorithms: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU
2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier
Slide 28
MapAbuse: Iterative MapReduce Only a subset of data needs
computation: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU
1 CPU 2 CPU 3 Iterations Barrier
Slide 29
MapAbuse: Iterative MapReduce System is not optimized for
iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1
CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty
Slide 30
Synchronous vs. Asynchronous Example Algorithm: If Red neighbor
then turn Red Synchronous Computation (Map-Reduce) : Evaluate
condition on all vertices for every phase 4 Phases each with 9
computations 36 Computations Asynchronous Computation (Wave-front)
: Evaluate condition only when neighbor changes 4 Phases each with
2 computations 8 Computations Time 0 Time 1 Time 2Time 3Time 4
Slide 31
Data-Parallel Algorithms can be Inefficient The limitations of
the Map-Reduce abstraction can lead to inefficient parallel
algorithms. Optimized in Memory MapReduceBP Asynchronous Splash
BP
Slide 32
? Belief Propagation SVM Kernel Methods Deep Belief Networks
Neural Networks Tensor Factorization Sampling Lasso The Need for a
New Abstraction Map-Reduce is not well suited for Graph-Parallelism
32 Data-Parallel Graph-Parallel Cross Validation Feature Extraction
Map Reduce Computing Sufficient Statistics
Slide 33
Carnegie Mellon What is GraphLab?
Slide 34
The GraphLab Framework SchedulerConsistency Model Graph Based
Data Representation Update Functions User Computation 34
Aggregates
Slide 35
Data Graph 35 A graph with arbitrary data (C++ Objects)
associated with each vertex and edge. Vertex Data: User profile
text Current interests estimates Edge Data: Similarity weights
Graph: Social Network
Slide 36
The GraphLab Framework SchedulerConsistency Model Graph Based
Data Representation Update Functions User Computation 36
Aggregates
Slide 37
label_prop(i, scope){ // Get Neighborhood data (Likes[i], W ij,
Likes[j]) scope; // Update the vertex data // Reschedule Neighbors
if needed if Likes[i] changes then reschedule_neighbors_of(i); }
Update Functions 37 An update function is a user defined program
which when applied to a vertex transforms the data in the scope of
the vertex
Slide 38
The GraphLab Framework SchedulerConsistency Model Graph Based
Data Representation Update Functions User Computation 38
Aggregates
Slide 39
The Scheduler 39 CPU 1 CPU 2 The scheduler determines the order
that vertices are updated. e e f f g g k k j j i i h h d d c c b b
a a b b i i h h a a i i b b e e f f j j c c Scheduler The process
repeats until the scheduler is empty.
Slide 40
Choosing a Schedule GraphLab provides several different
schedulers Round Robin: vertices are updated in a fixed order FIFO:
Vertices are updated in the order they are added Priority: Vertices
are updated in priority order 40 The choice of schedule affects the
correctness and parallel performance of the algorithm Obtain
different algorithms by simply changing a flag!
--scheduler=roundrobin --scheduler=fifo --scheduler=priority
The GraphLab Framework SchedulerConsistency Model Graph Based
Data Representation Update Functions User Computation 42
Aggregates
Slide 43
GraphLab Ensures Sequential Consistency 43 For each parallel
execution, there exists a sequential execution of update functions
which produces the same result. CPU 1 CPU 2 Single CPU Single CPU
Parallel Sequential time
Slide 44
Ensuring Race-Free Code How much can computation overlap?
Slide 45
CPU 1 CPU 2 Common Problem: Write-Write Race 45 Processors
running adjacent update functions simultaneously modify shared
data: CPU1 writes:CPU2 writes: Final Value
Slide 46
Nuances of Sequential Consistency Data consistency depends on
the update function: Some algorithms are robust to data-races
GraphLab Solution The user can choose from three consistency models
Full, Edge, Vertex GraphLab automatically enforces the users choice
46 CPU 1 CPU 2 Unsafe CPU 1 CPU 2 Safe Read
Slide 47
Consistency Rules 47 Guaranteed sequential consistency for all
update functions Data
Slide 48
Full Consistency 48 Only allow update functions two vertices
apart to be run in parallel Reduced opportunities for
parallelism
Slide 49
Obtaining More Parallelism 49 Not all update functions will
modify the entire scope! Edge consistency is sufficient for a large
number of algorithms including: Label Propagation
Slide 50
Edge Consistency 50 CPU 1 CPU 2 Safe Read
Slide 51
Obtaining More Parallelism 51 Map operations. Feature
extraction on vertex data
Slide 52
Vertex Consistency 52
Slide 53
The GraphLab Framework SchedulerConsistency Model Graph Based
Data Representation Update Functions User Computation 53
Aggregates
Slide 54
Global State and Computation Not everything fits the Graph
metaphor Shared Weight: Global Computation:
Slide 55
Aggregates Compute asynchronously at fixed time intervals: User
defined Map operation (extracted Likes[i]) User defined Reduce
(sum) operation User defined normalization Update functions have
Read-Only access to a copy
Slide 56
Anatomy of a GraphLab Program: 1) Define C++ Update Function 2)
Build data graph using the C++ graph object 3) Set engine
parameters: 1) Scheduler type 2) Consistency model 4) Add initial
vertices to the scheduler 5) Register global aggregates and set
refresh intervals 6) Run the engine on the graph [Blocking C++
call] 7) Final answer is stored in the graph
Slide 57
Algorithms Implemented PageRank Loopy Belief Propagation Gibbs
Sampling CoEM Graphical Model Parameter Learning Probabilistic
Matrix/Tensor Factorization Alternating Least Squares Lasso with
Sparse Features Support Vector Machines with Sparse Features
Label-Propagation
Slide 58
Carnegie Mellon Implementing the GraphLab API Multi-core &
Cloud Settings
Slide 59
Multi-core Implementation Implemented in C++ on top of:
Pthreads, GCC Atomics Consistency Models implemented using:
Read-Write Locks on each vertex Canonically ordered lock
acquisition (dining philosophers) Approximate schedulers:
Approximate FiFo/Priority ordering to reduced locking overhead
Experimental Matlab/Java/Python support Nearly Complete
Implementation Available under Apache 2.0 License at
graphlab.org
Slide 60
Distributed Cloud Implementation Implemented in C++ on top of:
Multi-core implementation for each multi-core node Custom RPC built
on-top of TCP/IP and MPI Graph is Partitioned over Cluster using
either: ParMETIS: High-performance partitioning heuristics Random
Cuts: Seems to work well on natural graphs Consistency models are
enforced using either Distributed RW-Locks with pipelined
acquisition Graph-coloring with phased execution No Fault
Tolerance: we are working on a solution Still Experimental
CoEM (Rosie Jones, 2005) Named Entity Recognition Task the dog
Australia Catalina Island ran quickly travelled to is pleasant
Hadoop95 Cores7.5 hrs Is Dog an animal? Is Catalina a place?
Vertices: 2 Million Edges: 200 Million
Lasso: Regularized Linear Model 68 Data matrix, n x d weights d
x 1 Observations n x 1 5 Features 4 Examples Shooting Algorithm
[Coordinate Descent] Updates on weight vertices modify losses on
observation vertices. Requires the Full Consistency Model Financial
prediction dataset from Kogan et al [2009]. Regularization
Slide 69
Full Consistency 69 Optimal Better Dense Sparse
Slide 70
Relaxing Consistency 70 Why does this work? (See Shotgut ICML
Paper) Better Optimal Dense Sparse
Matrix Factorization Netflix Collaborative Filtering
Alternating Least Squares Matrix Factorization Model: 0.5 million
nodes, 99 million edges Netflix Users Movies d
Slide 73
Netflix Speedup Increasing size of the matrix
factorization
Slide 74
Netflix
Slide 75
Netflix: Comparison to MPI Cluster Nodes
Slide 76
Summary An abstraction tailored to Machine Learning Targets
Graph-Parallel Algorithms Naturally expresses Data/computational
dependencies Dynamic iterative computation Simplifies parallel
algorithm design Automatically ensures data consistency Achieves
state-of-the-art parallel performance on a variety of problems
76
Slide 77
Current/Future Work Out-of-core Storage Hadoop/HDFS Integration
Graph Construction Graph Storage Launching GraphLab from Hadoop
Fault Tolerance through HDFS Checkpoints Sub-scope parallelism
Address the challenge of very high degree nodes Stochastic Scopes
Update Functions -> Update Functors Allows update functions to
send state when rescheduling.