SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Machine Learning at the Limit

John Canny*^

* Computer Science Division

University of California, Berkeley

^ Yahoo Research Labs

@Alpine Labs, March, 2015

Outline

Scaling Inward:

• What are the performance limits for machine learning on

single machines, and how close can we get to them?

(Single-node Roofline Design)

Scaling Out:

• What happens when we try to scale up from optimized

single nodes to clusters?

(Roofline Design for Clusters)

My Other Job(s)

Yahoo [Chen, Pavlov, Canny, KDD 2009]*

Ebay [Chen, Canny, SIGIR 2011]**

Quantcast 2011-2013

Microsoft 2014

Yahoo 2015

* Best application paper prize

** Best paper honorable mention

Data Scientist’s Workflow

Digging Around

in Data

Hypothesize

Model

Customize

Large Scale

Exploitation

Evaluate

Interpret

Sandbox

Production

Data Scientist’s Workflow

Digging Around

in Data

Hypothesize

Model

Customize

Large Scale

Exploitation

Evaluate

Interpret

Sandbox

Production

Why Build a New ML Toolkit?

• Performance: GPU performance pulling away from other

platforms for *sparse* and dense data.

Minibatch + SGD methods dominant on Big Data,…

• Customizability: Great value in customizing models (loss

functions, constraints,…)

• Explore/Deploy: Explore fast, run the same code in

prototype and production. Be able to run on clusters.

Desiderata

• Performance:

• Roofline Design (single machine and cluster)

• General Matrix Library with full CPU/GPU acceleration

• Customizability:

• Modular Learner Architecture (reusable components)

• Likelihood “Mixins”

• Explore/Deploy:

• Interactive, Scriptable, Graphical

• JVM based (Scala) w/ optimal cluster primitives

Outline

Scaling Inward:




Scaling Out:




Roofline Design (Williams, Waterman, Patterson, 2009)

• Roofline design establishes fundamental performance

limits for a computational kernel.

Operational Intensity (flops/byte)

Thro

ughput (g

flops)

1

10

100

1000

0.01 0.1 1 10 100

GPU ALU throughput

CPU ALU throughput

1000

A Tale of Two Architectures

Intel CPU NVIDIA GPU

Memory Controller

L3 Cache

Core

ALU

Core

ALU

Core

ALU

Core

ALU

L2 Cache

ALU ALU ALU ALU ALU ALU

CPU vs GPU Memory Hierarchy

Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU

8 MB L3 Cache 1.5 MB L2 Cache

4 MB register file (!) 4kB registers:

1 MB Shared Mem

2 MB L2 Cache

512K L1 Cache

1 MB Constant Mem

1 TB/s 1 TB/s

40 TB/s

13 TB/s

5 TB/s

10s GB Main Memory 4 GB Main Memory

20 GB/s

500 GB/s 500 GB/s

200 GB/s

Roofline Design – Matrix kernels

• Dense matrix multiply

• Sparse matrix multiply

Operational Intensity (flops/byte)

Thro

ughput (g

flops)

1

10

100

1000

0.01 0.1 1 10 100

GPU ALU throughput

CPU ALU throughput

1000

DataSource

(JBOD disks) Learner

Model

Optimizer

Mixins

Model

Optimizer

Mixins

GPU 1 thread 1

GPU 2 thread 2

:

:

CPU host code

data

blocks

DataSource

(Memory)

DataSource

HDFS over

network

Zhao+Canny

SIAM DM 13, KDD 13, BIGLearn 13

A Rooflined Machine Learning Toolkit

Compressed disk streaming at

~ 0.1-2 GB/s 100 HDFS nodes

30 Gflops to

2 Teraflops per GPU

Matrix + Machine Learning Layers

Written in the beautiful Scala language:

• Interpreter with JIT, scriptable.

• Open syntax +,-,*, ,, etc, math looks like math.

• Java VM + Java codebase – runs on Hadoop, Yarn, Spark.

• Hardware acceleration in C/C++ native code (CPU/GPU).

• Easy parallelism: Actors, parallel collections.

• Memory management (sort of ).

• Pre-built for multiple Platforms (Windows, MacOS, Linux).

Experience similar to Matlab, R, SciPy

Benchmarks

Recent benchmarks on some representative tasks:

• Text Classification on Reuters news data (0.5 GB)

• Click prediction on the Kaggle Criteo dataset (12 GB)

• Clustering of handwritten digit images (MNIST) (25 GB)

• Collaborative filtering on the Netflix prize dataset (4 GB)

• Topic modeling (LDA) on a NY times collection (0.5 GB)

• Random Forests on a UCI Year Prediction dataset (0.2 GB)

• Pagerank on two social network graphs at 12GB and 48GB

Benchmarks

Systems (single node)

• BIDMach

• VW (Vowpal Wabbit) from Yahoo/Microsoft

• Scikit-Learn

• LibLinear

Cluster Systems

• Spark v1.1 and v1.2

• Graphlab (academic version)

• Yahoo’s LDA cluster

Benchmarks: Single-Machine Systems

System Algorithm Dataset Dim Time

(s)

Cost

($)

Energy

(KJ)

BIDMach Logistic Reg. RCV1 103 14 0.002 3

Vowpal

Wabbit

Logistic Reg. RCV1 103 130 0.02 30

LibLinear Logistic Reg. RCV1 103 250 0.04 60

Scikit-Learn Logistic Reg. RCV1 103 576 0.08 120

RCV1: Text Classification, 103 topics (0.5GB).

Algorithms were tuned to achieve similar accuracy.

Benchmarks: Cluster Systems

System A/B Algorithm Dataset Dim Time

(s)

Cost

($)

Energy

(KJ)

Spark-72

BIDMach

Logistic Reg. RCV1 1

103

30

14

0.07

0.002

120

3

Spark-64

BIDMach

RandomForest YearPred 1 280

320

0.48

0.05

480

60

Spark-128

BIDMach

Logistic Reg. Criteo 1 400

81

1.40

0.01

2500

16

Spark-XX = System with XX cores

BIDMach ran on one node with GTX-680 GPU

Benchmarks: Cluster Systems

System A/B Algorithm Dataset Dim Time

(s)

Cost

($)

Energy

(KJ)

Spark-384

BIDMach

K-Means MNIST 4096 1100

735

9.00

0.12

22k

140

GraphLab-576

BIDMach

Matrix

Factorization

Netflix 100 376

90

16

0.015

10k

20

Yahoo-1000

BIDMach

LDA (Gibbs) NYtimes 1024 220k

300k

40k

60

4E10

6E7

Spark-XX or GraphLab-XX = System with XX cores

Yahoo-1000 had 1000 nodes

BIDMach at Scale

Latent Dirichlet Allocation

BIDMach outperforms cluster systems on this problem, and

has run up to 10 TB on one node.

Convergence on 1TB data

Benchmark Summary

• BIDMach is at least 10x faster than other single-machine

systems for comparable accuracy.

• For Random Forests or single-class regression, BIDMach on

a GPU node is comparable with 8-16 worker clusters.

• For multi-class regression, factor models, clustering etc.,

GPU-assisted BIDMach is comparable to 100-1000-worker

clusters. Larger problems correlate with larger values in this

range.

Cost and Energy Use

• BIDMach had a 10x-1000x cost advantage over the other

systems. The ratio was higher for larger-scale problems.

• Energy savings were similar to the cost savings, at 10x-

1000x.

In the Wild (Examples from Industry)

• Multilabel regression problem (summer intern project):

• Existing tool (single-machine) took ~ 1 week to build a model.

• BIDMach on a GPU node takes 1 hour (120x speedup)

• Iteration and feature engineering gave +15% accuracy.

• Auction simulation problem (cluster job):

• Existing tool simulates auction variations on log data.

• On NVIDIA 3.0 devices (64 registers/thread) we achieve a 70x

speedup over a reference implementation in Scala

• On NVIDIA 3.5 devices (256 registers/thread) we can move

auction state entirely into register storage and gain a 400x

speedup.

In the Wild (Examples from Industry)

• Classification (cluster job):

• Cluster job (logistic regression) took 8 hours.

• BIDMach version takes < 1 hour on a single node.

• SVMs for image classification (single machine)

• Large multi-label classification took 1 week with LibSVM.

• BIDMach version (SGD-based SVM) took 90 seconds.

Performance Revisited

• BIDMach had a 10x-1000x cost advantage over the other

systems. The ratio was higher for larger-scale problems.

• Energy savings were similar to the cost savings, at 10x-

1000x.

But why??

• We only expect about 10x

from GPU acceleration?

The Exponential Growth of O(1)

Once upon a Time…

a = b + c // single instruction

y = A[i] // single instruction

Now five orders

of magnitude

The Exponential Growth of O(1)

By slight abuse of notation, let O(1) denote the ratio of times

for a main memory access vs an ALU operation.

Then O(1) is not only not “small,” but its growing exponentially

with time.

A lot of non-rooflined code is too “memory-hungry” – you can

often do (much) better with custom, memory-aware kernels.

Log (Memaccess/ALUop)

Time (Years)

Performance Creep

• Code that isnt explicitly rooflined seems to be creeping

further and further from its roofline limit.

• With Intel’s profiling tools we can check the throughput of a variety of

running programs – often 10s–100s of Mflops – memory bound.

• Need well-designed kernels (probably hardware-aware

code) to get to the limits.

• Coding for performance:

Avoid:

Large hash tables

Large binary trees

Linked data structures

Use:

Sorting

Memory B-trees

Packed data structures

BIDMach ML Algorithms

1. Regression (logistic, linear)

2. Support Vector Machines

3. k-Means Clustering

4. Topic Modeling - Latent Dirichlet Allocation

5. Collaborative Filtering

6. NMF – Non-Negative Matrix Factorization

7. Factorization Machines

8. Random Forests

9. Multi-layer neural networks

10. IPTW (Causal Estimation)

11. ICA

= Likely the fastest implementation available

CPU vs GPU Memory Hierarchy

Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU

8 MB L3 Cache 1.5 MB L2 Cache

4 MB register file (!) 4kB registers:

1 MB Shared Mem

2 MB L2 Cache

512K L1 Cache

1 MB Constant Mem

1 TB/s 1 TB/s

40 TB/s

13 TB/s

5 TB/s

10s GB Main Memory 4 GB Main Memory

20 GB/s

500 GB/s 500 GB/s

200 GB/s

CoDesign: Gibbs Sampling

• Gibbs sampling is a very general method for inference in

machine learning, but typically slow.

• But it’s a particularly good match

for SIMT hardware (GPUs), not

difficult to get 10-100x speedups.

• But many popular models have slow convergence:

• Latent Dirichlet Allocation

• 10 passes of online Variational Bayes ≈

• 1000 passes for collapsed Gibbs sampling

Consequences

LDA graphical model:

• Topic samples Z are discrete and sparse: very few samples

for a specific topic in a particular document.

• This gives a high-variance random walk through parameter

space before converges.

Documents

Gibbs Parameter Estimation

Gibbs sampling in ML often applied to distributions of the

form

P(D,Z,)

Where D is the data, are the parameters of the model, and

X are other hidden states we would like to integrate over.

e.g for LDA, are the , parameters and Z are

assignments of words to topics.

• Gibbs sampling most often simply samples over both, which

is slow.

• But we really want to optimize over , and marginalize

(integrate) over X.

State-Augmented Marginal Estimation (SAME) – Doucet et al. 2002

The marginal parameter distribution is 𝑃 Θ = 𝑃 𝑋, Θ 𝑑𝑋

We can sharpen it to 𝑃𝑘 Θ as follows

𝑃𝑘 Θ = … 𝑃 𝑋, Θ𝑍𝑌𝑋

𝑃 𝑌, Θ …𝑃 𝑍, Θ 𝑑𝑋𝑑𝑌…𝑑𝑍

𝑃𝑘 Θ has the same peaks as 𝑃 Θ and corresponds to a

Gibbs distribution cooled to T = 1/k.

We modify a standard, blocked Gibbs sampler to take k

samples instead of 1 for X, and then recompute Θ from

these samples.

SAME Sampling

In the language of graphical models:

Run independent simulations with tied parameters Θ

Θ

SAME sampling as cooling

What cooling does:

Likelihood P() in model parameter space (peaks are good

models)


What cooling does:


models)


What cooling does:


models)

Making SAME sampling fast

The approach in general:

• Wherever you would take 1 sample before, take k

independent samples and average for parameter estimation.

Instead of taking distinct samples take counts of

anonymous samples – complexity independent of k

• Exact for LDA, HMM, Factorization Machines,…

• Corresponds to factored joint approximation for other

models.

Varying k allows for annealing optimization over .

Results on LDA

• We use k=100, make 10 passes over the dataset

(effectively taking 1000 total samples).

• Our SAME Gibbs sampler is within a factor of 3 of online

Variational Bayes. More than 100x faster than standard

Gibbs samplers.

• The model was more accurate than any other LDA method

we tested:

SAME sampling in general

• SAME sampling is an alternative to other message-passing

methods (VMP etc.)

• But rather than using bounds on posterior distributions it:

• Samples from the distribution of “uninteresting” variables

• Sharpens the posteriors on parameter variables.

Current Work

Inference on general graphical

models with Gibbs sampling.

Larger graphical models arise

in many applications but are

discarded for performance

reasons.

Existing tools (JAGS, Stan, Infer.net) are far from the roofline

for this problem.

Inference for Bayesian Networks

• (Main memory) Rooflined Approach: Sampling distribution

can be computed with an SpMM – 1-2GF on CPU and 5-

10GF on GPU.

• Experiment on MOOC learning concept graph with ~300

nodes and ~350 edges, 3000 samples

Getting to Warp Speed

The model for this problem

(CPTs for all nodes) fits

comfortably in GPU SHMEM.

The state of thousands of

particles fits comfortably in

GPU register memory.

It should be possible to gain another 10x to 100x in

performance with this design, and approach ALU limits.

4 MB registers

1 MB Shared Mem

Outline

Scaling Inward:




Scaling Out:




Challenges with Scaling MB algorithms

Can we combine node acceleration with clustering?

Its harder than it seems:

Dataset Minibatch

updates

Single-node Minibatch Learner

Dataset partitions across nodes

Cluster-based Learner

Reduce (Allreduce)

Allreduce

Every node j has data Uj (a local copy of a global model).

We want to compute an aggregate (e.g. sum) of all data

vectors and distribute it to all nodes, i.e. to synchronize the

model across nodes

Why Allreduce?

• Speed: Minibatch SGD and related algorithms are the fastest way to solve many of these problems (deep learning, CF, topic modeling,…). But dataset sizes (several TB) make single machine training slow (> 1 day).

• The more updates/sec the faster the method converges, but:

• In practice minibatch dof ~ model dof is near-optimal.

• That means the ideal B/W for model updates is at least the rate of data consumption (several Gb/s).

• In other words, we want an Allreduce that consumes data at roughly the full NIC bandwidth.

Why Allreduce?

• Scale: Some models wont fit on one machine (e.g.

Pagerank).

• Other models will fit but minibatch updates only touch a

small fraction of features (Click Prediction). Updating only

those is much faster than a full Allreduce.

• Sparse Allreduce: Each node j pushes updates to a

subset of features Uj and pulls a subset Vj from a

distributed model.

Why Allreduce?

• Worker driven (emergency room model) vs. aggregator-

driven (doctor’s surgery)?

• “Excuse me, worker task, the aggregator will consume

your data now”

Lower Bounds for total message B/W.

• Intuitively, every node needs to send D values somewhere, and receive D values for the final aggregate.

• The data from each node needs to be caught somewhere if allreduce happens in the same network, and the final aggregate message need to originate from somewhere.

• This suggests a naïve lower bound of 2D values in or out of each node, or 2DN for an N-node network.

• In fact more careful analysis shows the lower bound is 2D(N-1) for total bandwidth, and ~ 2D/B for latency:

Patarasuk et al. Bandwidth optimal all-reduce algorithms for clusters of workstations

http://dl.acm.org/citation.cfm?id=1482266





Why not Parameter servers?

• Potentially B/W optimal, but latency limited by net B/W of

servers:

Clients

servers

This network has at least 2D(N-1)/PB latency with P

servers, suboptimal by (N-1)/P.

A Roofline For networks

• Optimal operating point

Packet size (kBytes)

Thro

ughput (M

B/s

)

1

10

100

1000

1 10 100 1000 10000

Network capacity

100000

10000

Allreduce (example: Click Prediction)

• To run the most efficient solver (distributed SGD), we need to

synchronize models across the network.

• An Allreduce (reduce+broadcast) operation both

synchronizes and combines the updates from each node.

• For Logistic Regression, SVM and many other problems, we

need a Sparse Allreduce.

What about MapReduce?

The all-to-all Reduce implementations in Hadoop, Spark,

Powergraph etc. having scaling problems because of message

overhead. Smaller messages lower throughput.

During reduce, N nodes exchange

O(N2) messages of diminishing size.

Solutions (Dense data)

• For dense data, there are optimal (and fast in practice)

methods for Allreduce on clusters.

• They are implemented in MPI, but not always used by

default.

• In a recent Baidu paper (Wu et al.*), the authors achieved

near-linear speedup on an Infiniband cluster using MPI all-to-

all primitives.

* “Deep Image: Scaling up Image Recognition” Arxiv 1501.02876v2

Kylix: A Sparse AllReduce

for Commodity Clusters

Huasha Zhao

John Canny Department of Electrical Engineering and Computer Sciences

University of California, Berkeley

Power-Law Data

Term-Document Graph

Power-Law Data

𝑭 = 𝒓−𝜶

F: frequency

r: rank

α: power law exponent

Sparse AllReduce

• Allreduce is a general primitive for distributed graph

mining and machine learning. • Reduce, e.g. 𝑣 = 𝑣𝑖

𝑚𝑖=1 or other commutative and associative aggregate

operator such as mean, max etc.

• And redistributed back across nodes.

• Fast Allreduce makes the cluster behave like a single giant node.

• 𝑣𝑖 is sparse (power law) for big data problems, and each

cluster node may not require all of the sum v but only a

sparse subset.

• Kylix is an efficient Sparse Allreduce primitive for

computing on commodity clusters.

• Loss function

• The derivative of loss, which defines the SGD update, is a

scaled copy of Xi – same sparse non-zeros as Xi

Mini-batch Machine Learning

… Features

Examples

Xi Xi+1

Mini-batch Machine Learning

… Features

Examples

Xi Xi+1

Pull requests:

non-zeros of

next iteration

Push requests:

non-zeros of

current iteration

Kylix – Nested Heterogeneous Butterfly

• Data split to 6 machines


• Each machine is responsible for reducing 1/6 of the indices.


• Scatter-Reduce along dimension 1

• data split by 3


• Scatter-Reduce along dimension 2

• Reduce for shorter range, but denser data

grey scale indicates density


• Allgather in a reverse order

• Redistribute reduced values

grey scale indicates density


• Heterogeneous degree allows tailoring message size for each

layer in order to achieve optimal message size (larger than

the efficient message floor).


• Heterogeneous degree allows tailoring message size for each

layer – achieve optimal message size.

• Nesting allows indices to be collapsed down the network, so

that communication is greatly reduced in the lower layers –

works particularly well for power data.

Kylix - Summary

• Each network node specifies a set of input indices to pull

and a set output indices to push.

• Configuration step

• Build the routing table, once for static graphs, multiple times for

dynamic graphs

• Partitioning

• Union/Collapsing indices

• Mapping

• Reduction step

• ConfigReduce

• Can be performed in a single down-pass

Tuning Degrees of the Network

• To compute the optimal degree for that layer, we need to know the

amount of data per node. That will be the length of the partition

range at that node times the density.

• Given this data size, we find the

largest degree d such that P/d is

larger then the message floor.

• Degree / diameter trade-off

• degree^diameter = const.

Data Volumes at Each Layer

• Total communication across all layers a small constant larger than

the top layer, which is close to optimal.

• Communication volume across layers has a characteristic Kylix

shape.

Experiments (PageRank)

• Twitter Followers’ Graph

• 1.4 billion edges and 40 million vertices

• Yahoo Web Graph

• 6.7 billion edges and 1.4 billion vertices

• EC2 cluster compute node (cc2.8xlarge)

90-node Yahoo M45 64-node EC2 64-node EC2

Summary

• Roofline design exposes fundamental limits for ML tasks,

and provides guidance to reaching them.

• Roofline design gives BIDMach one to several orders of

magnitude advantage over other tools – most of which ware

well below roofline limits.

• Rooflining is a powerful tool for network primitives and

allows cluster computing to approach linear speedups

(strong scaling).

Software (version 1.0 just released)

Code: github.com/BIDData/BIDMach

Wiki: http://bid2.berkeley.edu/bid-data-project/overview/

BSD open source libs and dependencies.

In this release:

• Random Forests

• Double-precision GPU matrices

• Ipython/IScala Notebook

• Simple DNNs

Wrapper for Berkeley’s Caffe coming soon)

http://bid2.berkeley.edu/bid-data-project/overview/







Thanks

Sponsors:

Collaborators:

Data & Analytics

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny