A New Parallel Framework for Machine Learning

Carnegie Mellon

Joseph GonzalezJoint work with

YuchengLow

AapoKyrola

DannyBickson

CarlosGuestrin

GuyBlelloch

JoeHellerstein

DavidO’Hallaron

A New Parallel Framework for Machine Learning

AlexSmola

In ML we face BIG problems

48 Hours a MinuteYouTube

24 Million Wikipedia Pages

750 MillionFacebook Users

6 Billion Flickr Photos

Massive data provides opportunities for rich probabilistic structure …

Shopper 1 Shopper 2

Cameras Cooking

Social Network

What are the tools for massive data?

Parallelism: Hope & ChallengesWide array of different parallel architectures:

New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocksManaging distributed model state

New Challenges for Implementing Machine Learning Algorithms:Parallel debugging and profilingHardware specific APIs

GPUs Multicore Clusters Mini Clouds Clouds

Massive Structured Problems

Advances Parallel Hardware?Thesis:

“Parallel Learning and Inference in Probabilistic Graphical Models”

Advances Parallel Hardware

“Parallel Learning and Inference in Probabilistic Graphical Models”

GraphLab

Parallel Algorithms forProbabilistic Learning and Inference

Probabilistic Graphical Models

GraphLab

Advances Parallel Hardware

Advances Parallel HardwareGraphLab

How will wedesign and implement

parallel learning systems?

Threads, Locks, & Messages

“low level parallel primitives”

We could use ….

Threads, Locks, and MessagesML experts repeatedly solve the same parallel design challenges:

Implement and debug complex parallel systemTune for a specific parallel platformTwo months later the conference paper contains:

“We implemented ______ in parallel.”The resulting code:

is difficult to maintainis difficult to extendcouples learning model to parallel implementation

Graduate students

Map-Reduce / HadoopBuild learning algorithms on-top of

high-level parallel abstractions

... a better answer:

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

Embarrassingly Parallel independent computation

No Communication needed

Image Features

Embarrassingly Parallel independent computation

No Communication needed

CPU 1 CPU 2

MapReduce – Reduce Phase

Image Features

Attractive Face Statistics

Not Attractive FaceStatistics

N A A N N N A A N A N A

Attractive FacesNot AttractiveFaces

BeliefPropagation

Label Propagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

PageRank

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Is there more toMachine Learning

Concrete ExampleLabel Propagation

Profile

Label Propagation AlgorithmSocial Arithmetic:

Recurrence Algorithm:

iterate until convergence

Parallelism:Compute all Likes[i] in parallel

Sue Ann

Carlos

50% What I list on my profile40% Sue Ann Likes10% Carlos Like

80% Cameras20% Biking

I Like:

+60% Cameras, 40% Biking

Properties of Graph Parallel Algorithms

DependencyGraph

IterativeComputation

What I Like

What My Friends Like

Factored Computation

BeliefPropagation

Label Propagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

PageRank

CrossValidation

Feature Extraction

Map Reduce

Map Reduce?

Why not use Map-Reducefor

Graph Parallel Algorithms?

Data DependenciesMap-Reduce does not efficiently express dependent data

User must code substantial data transformations Costly data replication

rIterative Algorithms

Map-Reduce not efficiently express iterative algorithms:

Iterations

MapAbuse: Iterative MapReduceOnly a subset of data needs computation:

Iterations

MapAbuse: Iterative MapReduceSystem is not optimized for iteration:

Iterations

Disk Penalty

Startup Penalty

BeliefPropagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

PageRank

CrossValidation

Feature Extraction

Map Reduce

Map Reduce?Pregel (Giraph)?

BarrierPregel (Giraph)

Bulk Synchronous Parallel Model:

Compute Communicate

Bulk synchronous computation can be highly inefficient.

Example: Loopy Belief Propagation

Problem

Loopy Belief Propagation (Loopy BP)

• Iteratively estimate the “beliefs” about vertices– Read in messages– Updates marginal

estimate (belief)– Send updated

out messages• Repeat for all variables

until convergence

Bulk Synchronous Loopy BP

• Often considered embarrassingly parallel – Associate processor

with each vertex– Receive all messages– Update all beliefs– Send all messages

• Proposed by:– Brunton et al. CRV’06– Mendiburu et al. GECC’07– Kang,et al. LDMTA’10– …

Sequential Computational Structure

Hidden Sequential Structure

• Running Time:

EvidenceEvidence

Time for a singleparallel iteration Number of Iterations

Optimal Sequential Algorithm

Forward-Backward

Bulk Synchronous2n2/p

p ≤ 2n

RunningTime

p = 1Optimal Parallel

np = 2 38

The Splash Operation• Generalize the optimal chain algorithm:

to arbitrary cyclic graphs:

1) Grow a BFS Spanning tree with fixed size

2) Forward Pass computing all messages at each vertex

3) Backward Pass computing all messages at each vertex

Data-Parallel Algorithms can be Inefficient

The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.

1 2 3 4 5 6 7 80

100020003000400050006000700080009000

Number of CPUs

Optimized in Memory Bulk Synchronous

Asynchronous Splash BP

BeliefPropagationSVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

PageRank

The Need for a New AbstractionMap-Reduce is not well suited for Graph-Parallelism

CrossValidation

Feature Extraction

Map Reduce

Pregel (Giraph)

What is GraphLab?

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Data Graph

A graph with arbitrary data (C++ Objects) associated with each vertex and edge.

Vertex Data:• User profile text• Current interests estimates

Edge Data:• Similarity weights

Graph:• Social Network

Implementing the Data GraphMulticore Setting

In MemoryChallenge:

Fast lookup, low overhead

Solution:Dense data-structuresFixed Vdata & Edata typesImmutable graph structure

Cluster Setting

In MemoryPartition Graph:

ParMETIS or Random Cuts

Cached Ghosting

Node 1 Node 2

A BC D

label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }

Update Functions

An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

The Scheduler

The scheduler determines the order that vertices are updated.

dcba b

The process repeats until the scheduler is empty.

Choosing a Schedule

GraphLab provides several different schedulersRound Robin: vertices are updated in a fixed orderFIFO: Vertices are updated in the order they are addedPriority: Vertices are updated in priority order

The choice of schedule affects the correctness and parallel performance of the algorithm

Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority

Ensuring Race-Free CodeHow much can computation overlap?

CPU 1 CPU 2

Common Problem: Write-Write Race

Processors running adjacent update functions simultaneously modify shared data:

CPU1 writes: CPU2 writes:

Final Value

Importance of Consistency

Alternating Least Squares

Importance of Consistency

Tweak Model

GraphLab Ensures Sequential Consistency

For each parallel execution, there exists a sequential execution of update functions which produces the same result.

SingleCPU

Parallel

Sequential

Consistency Rules

Guaranteed sequential consistency for all update functions

Full Consistency

Obtaining More Parallelism

Edge Consistency

CPU 1 CPU 2

Consistency Through R/W LocksRead/Write locks:

Full Consistency

Edge Consistency

Write Write WriteCanonical Lock Ordering

Read Write ReadRead Write

Multicore Setting: Pthread R/W LocksDistributed Setting: Distributed Locking

Prefetch Locks and Data

Allow computation to proceed while locks/data are requested.

Node 2

Consistency Through R/W Locks

Node 1Data GraphPartition

Lock Pipeline

Anatomy of a GraphLab Program:

1) #include <graphlab.hpp>2) Define C++ Update Function3) Build data graph using the C++ graph object4) Set engine parameters:

1) Scheduler type 2) Consistency model

5) Add initial vertices to the scheduler 6) Run the engine on the graph [Blocking call]7) Final answer is stored in the graph

Algorithms Implemented PageRankLoopy Belief PropagationGibbs SamplingCoEMGraphical Model Parameter LearningProbabilistic Matrix/Tensor FactorizationAlternating Least SquaresLasso with Sparse FeaturesSupport Vector Machines with Sparse FeaturesLabel-Propagation…

Shared MemoryExperiments

Shared Memory Setting16 Core Workstation

Loopy Belief Propagation

3D retinal image denoising

Data GraphUpdate Function:

Loopy BP Update EquationScheduler:

SplashBPConsistency Model:

Edge Consistency

Vertices: 1 MillionEdges: 3 Million

Loopy Belief Propagation

0 2 4 6 8 10 12 14 160

Number of CPUs

Optimal

SplashBP

15.5x speedup

Gibbs SamplingProtein-protein interaction networks [Elidan et al. 2006]

Provably correct ParallelizationEdge Consistency

Discrete MRF14K Vertices100K Edges

Backbone

Protein

Interactions

Side-Chain

Gibbs Sampling

0 2 4 6 8 10 12 14 160

Number of CPUs

OptimalBe

Chromatic Gibbs Sampler

Carnegie Mellon

An asynchronous Gibbs Sampler that adaptively addresses strong dependencies.

Splash Gibbs Sampler

Step 1: Grow multiple Splashes in parallel:

ConditionallyIndependent

Tree-width = 1

Tree-width = 2

Step 2: Calibrate the trees in parallel

Step 3: Sample trees in parallel

Experimental Results

The Splash sampler outperforms the Chromatic sampler on models with strong dependencies

Likelihood Final Sample

Splash

Chromatic

“Mixing”

BetterSplash

Chromatic

Speedup in Sample Generation

Splash

Chromatic

• Markov logic network with strong dependencies10K Variables 28K Factors

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

the dog

Australia

Catalina Island

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Dog” an animal?Is “Catalina” a place?

Vertices: 2 MillionEdges: 200 Million

0 2 4 6 8 10 12 14 160

Number of CPUs

Optimal

GraphLab CoEM

CoEM (Rosie Jones, 2005)

GraphLab 16 Cores 30 min

15x Faster!6x fewer CPUs!

Hadoop 95 Cores 7.5 hrs

Lasso: Regularized Linear Model

Data matrix,n x d

weights d x 1 Observationsn x 1

5 Features

4 Examples

Shooting Algorithm [Coordinate Descent]• Updates on weight vertices modify

losses on observation vertices.

Requires theFull Consistency ModelFinancial prediction dataset

from Kogan et al [2009].

Regularization

Full Consistency

Optimal

0 2 4 6 8 10 12 14 160

Number of CPUs

Sparse

0 2 4 6 8 10 12 14 160

Number of CPUs

Relaxing Consistency

Why does this work? (See Shotgut ICML Paper)

0 2 4 6 8 10 12 14 160

Number of CPUs

r Optimal

Sparse

ExperimentsAmazon EC2

High-Performance Nodes

Video Cosegmentation

Segments mean the same

Model: 10.5 million nodes, 31 million edges

Gaussian EM clustering + BP on 3D grid

Video Coseg. Speedups

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix FactorizationModel: 0.5 million nodes, 99 million edges

Netflix

Movies

NetflixSpeedup Increasing size of the matrix factorization

The Cost of Hadoop

SummaryAn abstraction tailored to Machine Learning

Targets Graph-Parallel Algorithms

Naturally expressesData/computational dependenciesDynamic iterative computation

Simplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems

Carnegie Mellon

Checkout GraphLab

http://graphlab.org

Documentation… Code… Tutorials…

Questions & Feedback

jegonzal@cs.cmu.edu

Current/Future Work Out-of-core StorageHadoop/HDFS Integration

Graph ConstructionGraph StorageLaunching GraphLab from HadoopFault Tolerance through HDFS Checkpoints

Sub-scope parallelismAddress the challenge of very high degree nodes

Improved graph partitioningSupport for dynamic graph structure

A New Parallel Framework for Machine Learning

Documents

PRAM (Parallel Random Access Machine)

revised manuscript - a framework for parallel

Windows-NT based Distributed Virtual Parallel Machine

Parallel & Scalable Machine Learningmorrisriedel.de/wp-content/uploads/2018/03/Parallel-and-Scalable... · Parallel & Scalable Machine Learning LECTURE 7. Review of Lecture 6 Remote

GraphLab: A New Framework For Parallel Machine Learning · Graph-Parallel Processing I Restrictsthetypes of computation. I New techniques topartition and distribute graphs. I Exploit

Parallel and Scalable Machine Learning · Lecture 2 – Introduction to Machine Learning Fundamentals 3/ 50 1. Parallel & Scalable Machine Learning driven by HPC 2. Introduction to

Implementation of MapReduce parallel computing framework

Toward Statistical Machine Translation without Parallel ...klementiev.org/slides/eacl12mt_slides.pdf · Toward Statistical Machine Translation without Parallel Corpora! Alex Klementiev,

A New Parallel Framework for Machine Learning

A Grid Parallel Application Framework

PARALLEL KINEMATICS MACHINE TOOLS: OVERVIEW- FROM

GMAW Machine with 1. Rotating Arc Mechanism Parallel …gvpce.ac.in/RF.pdf · · 2016-03-14GMAW Machine with Rotating Arc Mechanism 2. Parallel Kinematic Machine (2 – DOF) 3

2. Parallel Programming in the .NET Framework 4. Task ...vb-net.com/VB2015/AsyncProgram/Parallel Programming.pdf · Parallel Programming in the .NET Framework 4. Task Parallel Library

An Efficient Framework for Extracting Parallel Sentences ...bao/papers/FIdraft.pdf · An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora Cuong Hoang,

GraphLab: A New Framework For Parallel Machine … A New Framework For Parallel Machine Learning Yucheng Low Carnegie Mellon University ylow@cs.cmu.edu Joseph Gonzalez Carnegie Mellon

Parallel Random Access Machine Algorithm

A New Parallel Double Exitation Synchronous Machine

Architecture Framework for Mapping Parallel Algorithms to Parallel

GraphLab: A New Framework For Parallel Machine Learningveronika/MAE/graphlab... · correct parallel machine learning (ML) algo-rithms is challenging. Existing high-level par-allel

1 Unrelated Parallel Machine Schedulingnas.iiis.tsinghua.edu.cn/~jianli/courses/ATCS2014spring/... · 2019. 3. 26. · 1 Unrelated Parallel Machine Scheduling 1.1 Mathematical foundation