76
Machine Learning at the Limit John Canny*^ * Computer Science Division University of California, Berkeley ^ Yahoo Research Labs @Alpine Labs, March, 2015

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Embed Size (px)

Citation preview

Page 1: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Machine Learning at the Limit

John Canny*^

* Computer Science Division

University of California, Berkeley

^ Yahoo Research Labs

@Alpine Labs, March, 2015

Page 2: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Outline

Scaling Inward:

• What are the performance limits for machine learning on

single machines, and how close can we get to them?

(Single-node Roofline Design)

Scaling Out:

• What happens when we try to scale up from optimized

single nodes to clusters?

(Roofline Design for Clusters)

Page 3: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

My Other Job(s)

Yahoo [Chen, Pavlov, Canny, KDD 2009]*

Ebay [Chen, Canny, SIGIR 2011]**

Quantcast 2011-2013

Microsoft 2014

Yahoo 2015

* Best application paper prize

** Best paper honorable mention

Page 4: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Data Scientist’s Workflow

Digging Around

in Data

Hypothesize

Model

Customize

Large Scale

Exploitation

Evaluate

Interpret

Sandbox

Production

Page 5: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Data Scientist’s Workflow

Digging Around

in Data

Hypothesize

Model

Customize

Large Scale

Exploitation

Evaluate

Interpret

Sandbox

Production

Page 6: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Why Build a New ML Toolkit?

• Performance: GPU performance pulling away from other

platforms for *sparse* and dense data.

Minibatch + SGD methods dominant on Big Data,…

• Customizability: Great value in customizing models (loss

functions, constraints,…)

• Explore/Deploy: Explore fast, run the same code in

prototype and production. Be able to run on clusters.

Page 7: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Desiderata

• Performance:

• Roofline Design (single machine and cluster)

• General Matrix Library with full CPU/GPU acceleration

• Customizability:

• Modular Learner Architecture (reusable components)

• Likelihood “Mixins”

• Explore/Deploy:

• Interactive, Scriptable, Graphical

• JVM based (Scala) w/ optimal cluster primitives

Page 8: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Outline

Scaling Inward:

• What are the performance limits for machine learning on

single machines, and how close can we get to them?

(Single-node Roofline Design)

Scaling Out:

• What happens when we try to scale up from optimized

single nodes to clusters?

(Roofline Design for Clusters)

Page 9: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Roofline Design (Williams, Waterman, Patterson, 2009)

• Roofline design establishes fundamental performance

limits for a computational kernel.

Operational Intensity (flops/byte)

Thro

ughput (g

flops)

1

10

100

1000

0.01 0.1 1 10 100

GPU ALU throughput

CPU ALU throughput

1000

Page 10: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

A Tale of Two Architectures

Intel CPU NVIDIA GPU

Memory Controller

L3 Cache

Core

ALU

Core

ALU

Core

ALU

Core

ALU

L2 Cache

ALU ALU ALU ALU ALU ALU

Page 11: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

CPU vs GPU Memory Hierarchy

Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU

8 MB L3 Cache 1.5 MB L2 Cache

4 MB register file (!) 4kB registers:

1 MB Shared Mem

2 MB L2 Cache

512K L1 Cache

1 MB Constant Mem

1 TB/s 1 TB/s

40 TB/s

13 TB/s

5 TB/s

10s GB Main Memory 4 GB Main Memory

20 GB/s

500 GB/s 500 GB/s

200 GB/s

Page 12: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Roofline Design – Matrix kernels

• Dense matrix multiply

• Sparse matrix multiply

Operational Intensity (flops/byte)

Thro

ughput (g

flops)

1

10

100

1000

0.01 0.1 1 10 100

GPU ALU throughput

CPU ALU throughput

1000

Page 13: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

DataSource

(JBOD disks) Learner

Model

Optimizer

Mixins

Model

Optimizer

Mixins

GPU 1 thread 1

GPU 2 thread 2

:

:

CPU host code

data

blocks

DataSource

(Memory)

DataSource

HDFS over

network

Zhao+Canny

SIAM DM 13, KDD 13, BIGLearn 13

A Rooflined Machine Learning Toolkit

Compressed disk streaming at

~ 0.1-2 GB/s 100 HDFS nodes

30 Gflops to

2 Teraflops per GPU

Page 14: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Matrix + Machine Learning Layers

Written in the beautiful Scala language:

• Interpreter with JIT, scriptable.

• Open syntax +,-,*, ,, etc, math looks like math.

• Java VM + Java codebase – runs on Hadoop, Yarn, Spark.

• Hardware acceleration in C/C++ native code (CPU/GPU).

• Easy parallelism: Actors, parallel collections.

• Memory management (sort of ).

• Pre-built for multiple Platforms (Windows, MacOS, Linux).

Experience similar to Matlab, R, SciPy

Page 15: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Benchmarks

Recent benchmarks on some representative tasks:

• Text Classification on Reuters news data (0.5 GB)

• Click prediction on the Kaggle Criteo dataset (12 GB)

• Clustering of handwritten digit images (MNIST) (25 GB)

• Collaborative filtering on the Netflix prize dataset (4 GB)

• Topic modeling (LDA) on a NY times collection (0.5 GB)

• Random Forests on a UCI Year Prediction dataset (0.2 GB)

• Pagerank on two social network graphs at 12GB and 48GB

Page 16: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Benchmarks

Systems (single node)

• BIDMach

• VW (Vowpal Wabbit) from Yahoo/Microsoft

• Scikit-Learn

• LibLinear

Cluster Systems

• Spark v1.1 and v1.2

• Graphlab (academic version)

• Yahoo’s LDA cluster

Page 17: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Benchmarks: Single-Machine Systems

System Algorithm Dataset Dim Time

(s)

Cost

($)

Energy

(KJ)

BIDMach Logistic Reg. RCV1 103 14 0.002 3

Vowpal

Wabbit

Logistic Reg. RCV1 103 130 0.02 30

LibLinear Logistic Reg. RCV1 103 250 0.04 60

Scikit-Learn Logistic Reg. RCV1 103 576 0.08 120

RCV1: Text Classification, 103 topics (0.5GB).

Algorithms were tuned to achieve similar accuracy.

Page 18: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Benchmarks: Cluster Systems

System A/B Algorithm Dataset Dim Time

(s)

Cost

($)

Energy

(KJ)

Spark-72

BIDMach

Logistic Reg. RCV1 1

103

30

14

0.07

0.002

120

3

Spark-64

BIDMach

RandomForest YearPred 1 280

320

0.48

0.05

480

60

Spark-128

BIDMach

Logistic Reg. Criteo 1 400

81

1.40

0.01

2500

16

Spark-XX = System with XX cores

BIDMach ran on one node with GTX-680 GPU

Page 19: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Benchmarks: Cluster Systems

System A/B Algorithm Dataset Dim Time

(s)

Cost

($)

Energy

(KJ)

Spark-384

BIDMach

K-Means MNIST 4096 1100

735

9.00

0.12

22k

140

GraphLab-576

BIDMach

Matrix

Factorization

Netflix 100 376

90

16

0.015

10k

20

Yahoo-1000

BIDMach

LDA (Gibbs) NYtimes 1024 220k

300k

40k

60

4E10

6E7

Spark-XX or GraphLab-XX = System with XX cores

Yahoo-1000 had 1000 nodes

Page 20: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

BIDMach at Scale

Latent Dirichlet Allocation

BIDMach outperforms cluster systems on this problem, and

has run up to 10 TB on one node.

Convergence on 1TB data

Page 21: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Benchmark Summary

• BIDMach is at least 10x faster than other single-machine

systems for comparable accuracy.

• For Random Forests or single-class regression, BIDMach on

a GPU node is comparable with 8-16 worker clusters.

• For multi-class regression, factor models, clustering etc.,

GPU-assisted BIDMach is comparable to 100-1000-worker

clusters. Larger problems correlate with larger values in this

range.

Page 22: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Cost and Energy Use

• BIDMach had a 10x-1000x cost advantage over the other

systems. The ratio was higher for larger-scale problems.

• Energy savings were similar to the cost savings, at 10x-

1000x.

Page 23: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

In the Wild (Examples from Industry)

• Multilabel regression problem (summer intern project):

• Existing tool (single-machine) took ~ 1 week to build a model.

• BIDMach on a GPU node takes 1 hour (120x speedup)

• Iteration and feature engineering gave +15% accuracy.

• Auction simulation problem (cluster job):

• Existing tool simulates auction variations on log data.

• On NVIDIA 3.0 devices (64 registers/thread) we achieve a 70x

speedup over a reference implementation in Scala

• On NVIDIA 3.5 devices (256 registers/thread) we can move

auction state entirely into register storage and gain a 400x

speedup.

Page 24: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

In the Wild (Examples from Industry)

• Classification (cluster job):

• Cluster job (logistic regression) took 8 hours.

• BIDMach version takes < 1 hour on a single node.

• SVMs for image classification (single machine)

• Large multi-label classification took 1 week with LibSVM.

• BIDMach version (SGD-based SVM) took 90 seconds.

Page 25: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Performance Revisited

• BIDMach had a 10x-1000x cost advantage over the other

systems. The ratio was higher for larger-scale problems.

• Energy savings were similar to the cost savings, at 10x-

1000x.

But why??

• We only expect about 10x

from GPU acceleration?

Page 26: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

The Exponential Growth of O(1)

Once upon a Time…

a = b + c // single instruction

y = A[i] // single instruction

Now five orders

of magnitude

Page 27: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

The Exponential Growth of O(1)

By slight abuse of notation, let O(1) denote the ratio of times

for a main memory access vs an ALU operation.

Then O(1) is not only not “small,” but its growing exponentially

with time.

A lot of non-rooflined code is too “memory-hungry” – you can

often do (much) better with custom, memory-aware kernels.

Log (Memaccess/ALUop)

Time (Years)

Page 28: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Performance Creep

• Code that isnt explicitly rooflined seems to be creeping

further and further from its roofline limit.

• With Intel’s profiling tools we can check the throughput of a variety of

running programs – often 10s–100s of Mflops – memory bound.

• Need well-designed kernels (probably hardware-aware

code) to get to the limits.

• Coding for performance:

Avoid:

Large hash tables

Large binary trees

Linked data structures

Use:

Sorting

Memory B-trees

Packed data structures

Page 29: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

BIDMach ML Algorithms

1. Regression (logistic, linear)

2. Support Vector Machines

3. k-Means Clustering

4. Topic Modeling - Latent Dirichlet Allocation

5. Collaborative Filtering

6. NMF – Non-Negative Matrix Factorization

7. Factorization Machines

8. Random Forests

9. Multi-layer neural networks

10. IPTW (Causal Estimation)

11. ICA

= Likely the fastest implementation available

Page 30: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

CPU vs GPU Memory Hierarchy

Intel 8 core Sandy Bridge CPU NVIDIA GK110 GPU

8 MB L3 Cache 1.5 MB L2 Cache

4 MB register file (!) 4kB registers:

1 MB Shared Mem

2 MB L2 Cache

512K L1 Cache

1 MB Constant Mem

1 TB/s 1 TB/s

40 TB/s

13 TB/s

5 TB/s

10s GB Main Memory 4 GB Main Memory

20 GB/s

500 GB/s 500 GB/s

200 GB/s

Page 31: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

CoDesign: Gibbs Sampling

• Gibbs sampling is a very general method for inference in

machine learning, but typically slow.

• But it’s a particularly good match

for SIMT hardware (GPUs), not

difficult to get 10-100x speedups.

• But many popular models have slow convergence:

• Latent Dirichlet Allocation

• 10 passes of online Variational Bayes ≈

• 1000 passes for collapsed Gibbs sampling

Page 32: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Consequences

LDA graphical model:

• Topic samples Z are discrete and sparse: very few samples

for a specific topic in a particular document.

• This gives a high-variance random walk through parameter

space before converges.

Documents

Page 33: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Gibbs Parameter Estimation

Gibbs sampling in ML often applied to distributions of the

form

P(D,Z,)

Where D is the data, are the parameters of the model, and

X are other hidden states we would like to integrate over.

e.g for LDA, are the , parameters and Z are

assignments of words to topics.

• Gibbs sampling most often simply samples over both, which

is slow.

• But we really want to optimize over , and marginalize

(integrate) over X.

Page 34: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

State-Augmented Marginal Estimation (SAME) – Doucet et al. 2002

The marginal parameter distribution is 𝑃 Θ = 𝑃 𝑋, Θ 𝑑𝑋

We can sharpen it to 𝑃𝑘 Θ as follows

𝑃𝑘 Θ = … 𝑃 𝑋, Θ𝑍𝑌𝑋

𝑃 𝑌, Θ …𝑃 𝑍, Θ 𝑑𝑋𝑑𝑌…𝑑𝑍

𝑃𝑘 Θ has the same peaks as 𝑃 Θ and corresponds to a

Gibbs distribution cooled to T = 1/k.

We modify a standard, blocked Gibbs sampler to take k

samples instead of 1 for X, and then recompute Θ from

these samples.

Page 35: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

SAME Sampling

In the language of graphical models:

Run independent simulations with tied parameters Θ

Θ

Page 36: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

SAME sampling as cooling

What cooling does:

Likelihood P() in model parameter space (peaks are good

models)

Page 37: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

SAME sampling as cooling

What cooling does:

Likelihood P() in model parameter space (peaks are good

models)

Page 38: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

SAME sampling as cooling

What cooling does:

Likelihood P() in model parameter space (peaks are good

models)

Page 39: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Making SAME sampling fast

The approach in general:

• Wherever you would take 1 sample before, take k

independent samples and average for parameter estimation.

Instead of taking distinct samples take counts of

anonymous samples – complexity independent of k

• Exact for LDA, HMM, Factorization Machines,…

• Corresponds to factored joint approximation for other

models.

Varying k allows for annealing optimization over .

Page 40: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Results on LDA

• We use k=100, make 10 passes over the dataset

(effectively taking 1000 total samples).

• Our SAME Gibbs sampler is within a factor of 3 of online

Variational Bayes. More than 100x faster than standard

Gibbs samplers.

• The model was more accurate than any other LDA method

we tested:

Page 41: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

SAME sampling in general

• SAME sampling is an alternative to other message-passing

methods (VMP etc.)

• But rather than using bounds on posterior distributions it:

• Samples from the distribution of “uninteresting” variables

• Sharpens the posteriors on parameter variables.

Page 42: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Current Work

Inference on general graphical

models with Gibbs sampling.

Larger graphical models arise

in many applications but are

discarded for performance

reasons.

Existing tools (JAGS, Stan, Infer.net) are far from the roofline

for this problem.

Page 43: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Inference for Bayesian Networks

• (Main memory) Rooflined Approach: Sampling distribution

can be computed with an SpMM – 1-2GF on CPU and 5-

10GF on GPU.

• Experiment on MOOC learning concept graph with ~300

nodes and ~350 edges, 3000 samples

Page 44: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Getting to Warp Speed

The model for this problem

(CPTs for all nodes) fits

comfortably in GPU SHMEM.

The state of thousands of

particles fits comfortably in

GPU register memory.

It should be possible to gain another 10x to 100x in

performance with this design, and approach ALU limits.

4 MB registers

1 MB Shared Mem

Page 45: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Outline

Scaling Inward:

• What are the performance limits for machine learning on

single machines, and how close can we get to them?

(Single-node Roofline Design)

Scaling Out:

• What happens when we try to scale up from optimized

single nodes to clusters?

(Roofline Design for Clusters)

Page 46: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Challenges with Scaling MB algorithms

Can we combine node acceleration with clustering?

Its harder than it seems:

Dataset Minibatch

updates

Single-node Minibatch Learner

Dataset partitions across nodes

Cluster-based Learner

Reduce (Allreduce)

Page 47: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Allreduce

Every node j has data Uj (a local copy of a global model).

We want to compute an aggregate (e.g. sum) of all data

vectors and distribute it to all nodes, i.e. to synchronize the

model across nodes

Page 48: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Why Allreduce?

• Speed: Minibatch SGD and related algorithms are the fastest way to solve many of these problems (deep learning, CF, topic modeling,…). But dataset sizes (several TB) make single machine training slow (> 1 day).

• The more updates/sec the faster the method converges, but:

• In practice minibatch dof ~ model dof is near-optimal.

• That means the ideal B/W for model updates is at least the rate of data consumption (several Gb/s).

• In other words, we want an Allreduce that consumes data at roughly the full NIC bandwidth.

Page 49: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Why Allreduce?

• Scale: Some models wont fit on one machine (e.g.

Pagerank).

• Other models will fit but minibatch updates only touch a

small fraction of features (Click Prediction). Updating only

those is much faster than a full Allreduce.

• Sparse Allreduce: Each node j pushes updates to a

subset of features Uj and pulls a subset Vj from a

distributed model.

Page 50: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Why Allreduce?

• Worker driven (emergency room model) vs. aggregator-

driven (doctor’s surgery)?

• “Excuse me, worker task, the aggregator will consume

your data now”

Page 51: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Lower Bounds for total message B/W.

• Intuitively, every node needs to send D values somewhere, and receive D values for the final aggregate.

• The data from each node needs to be caught somewhere if allreduce happens in the same network, and the final aggregate message need to originate from somewhere.

• This suggests a naïve lower bound of 2D values in or out of each node, or 2DN for an N-node network.

• In fact more careful analysis shows the lower bound is 2D(N-1) for total bandwidth, and ~ 2D/B for latency:

Patarasuk et al. Bandwidth optimal all-reduce algorithms for clusters of workstations

Page 52: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Why not Parameter servers?

• Potentially B/W optimal, but latency limited by net B/W of

servers:

Clients

servers

This network has at least 2D(N-1)/PB latency with P

servers, suboptimal by (N-1)/P.

Page 53: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

A Roofline For networks

• Optimal operating point

Packet size (kBytes)

Thro

ughput (M

B/s

)

1

10

100

1000

1 10 100 1000 10000

Network capacity

100000

10000

Page 54: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Allreduce (example: Click Prediction)

• To run the most efficient solver (distributed SGD), we need to

synchronize models across the network.

• An Allreduce (reduce+broadcast) operation both

synchronizes and combines the updates from each node.

• For Logistic Regression, SVM and many other problems, we

need a Sparse Allreduce.

Page 55: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

What about MapReduce?

The all-to-all Reduce implementations in Hadoop, Spark,

Powergraph etc. having scaling problems because of message

overhead. Smaller messages lower throughput.

During reduce, N nodes exchange

O(N2) messages of diminishing size.

Page 56: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Solutions (Dense data)

• For dense data, there are optimal (and fast in practice)

methods for Allreduce on clusters.

• They are implemented in MPI, but not always used by

default.

• In a recent Baidu paper (Wu et al.*), the authors achieved

near-linear speedup on an Infiniband cluster using MPI all-to-

all primitives.

* “Deep Image: Scaling up Image Recognition” Arxiv 1501.02876v2

Page 57: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix: A Sparse AllReduce

for Commodity Clusters

Huasha Zhao

John Canny Department of Electrical Engineering and Computer Sciences

University of California, Berkeley

Page 58: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Power-Law Data

Term-Document Graph

Page 59: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Power-Law Data

𝑭 = 𝒓−𝜶

F: frequency

r: rank

α: power law exponent

Page 60: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Sparse AllReduce

• Allreduce is a general primitive for distributed graph

mining and machine learning. • Reduce, e.g. 𝑣 = 𝑣𝑖

𝑚𝑖=1 or other commutative and associative aggregate

operator such as mean, max etc.

• And redistributed back across nodes.

• Fast Allreduce makes the cluster behave like a single giant node.

• 𝑣𝑖 is sparse (power law) for big data problems, and each

cluster node may not require all of the sum v but only a

sparse subset.

• Kylix is an efficient Sparse Allreduce primitive for

computing on commodity clusters.

Page 61: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

• Loss function

• The derivative of loss, which defines the SGD update, is a

scaled copy of Xi – same sparse non-zeros as Xi

Mini-batch Machine Learning

… Features

Examples

Xi Xi+1

Page 62: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Mini-batch Machine Learning

… Features

Examples

Xi Xi+1

Pull requests:

non-zeros of

next iteration

Push requests:

non-zeros of

current iteration

Page 63: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix – Nested Heterogeneous Butterfly

• Data split to 6 machines

Page 64: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix – Nested Heterogeneous Butterfly

• Each machine is responsible for reducing 1/6 of the indices.

Page 65: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix – Nested Heterogeneous Butterfly

• Scatter-Reduce along dimension 1

• data split by 3

Page 66: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix – Nested Heterogeneous Butterfly

• Scatter-Reduce along dimension 2

• Reduce for shorter range, but denser data

grey scale indicates density

Page 67: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix – Nested Heterogeneous Butterfly

• Allgather in a reverse order

• Redistribute reduced values

grey scale indicates density

Page 68: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix – Nested Heterogeneous Butterfly

• Heterogeneous degree allows tailoring message size for each

layer in order to achieve optimal message size (larger than

the efficient message floor).

Page 69: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix – Nested Heterogeneous Butterfly

• Heterogeneous degree allows tailoring message size for each

layer – achieve optimal message size.

• Nesting allows indices to be collapsed down the network, so

that communication is greatly reduced in the lower layers –

works particularly well for power data.

Page 70: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Kylix - Summary

• Each network node specifies a set of input indices to pull

and a set output indices to push.

• Configuration step

• Build the routing table, once for static graphs, multiple times for

dynamic graphs

• Partitioning

• Union/Collapsing indices

• Mapping

• Reduction step

• ConfigReduce

• Can be performed in a single down-pass

Page 71: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Tuning Degrees of the Network

• To compute the optimal degree for that layer, we need to know the

amount of data per node. That will be the length of the partition

range at that node times the density.

• Given this data size, we find the

largest degree d such that P/d is

larger then the message floor.

• Degree / diameter trade-off

• degree^diameter = const.

Page 72: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Data Volumes at Each Layer

• Total communication across all layers a small constant larger than

the top layer, which is close to optimal.

• Communication volume across layers has a characteristic Kylix

shape.

Page 73: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Experiments (PageRank)

• Twitter Followers’ Graph

• 1.4 billion edges and 40 million vertices

• Yahoo Web Graph

• 6.7 billion edges and 1.4 billion vertices

• EC2 cluster compute node (cc2.8xlarge)

90-node Yahoo M45 64-node EC2 64-node EC2

Page 74: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Summary

• Roofline design exposes fundamental limits for ML tasks,

and provides guidance to reaching them.

• Roofline design gives BIDMach one to several orders of

magnitude advantage over other tools – most of which ware

well below roofline limits.

• Rooflining is a powerful tool for network primitives and

allows cluster computing to approach linear speedups

(strong scaling).

Page 75: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Software (version 1.0 just released)

Code: github.com/BIDData/BIDMach

Wiki: http://bid2.berkeley.edu/bid-data-project/overview/

BSD open source libs and dependencies.

In this release:

• Random Forests

• Double-precision GPU matrices

• Ipython/IScala Notebook

• Simple DNNs

Wrapper for Berkeley’s Caffe coming soon)

Page 76: SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit By Prof. John Canny

Thanks

Sponsors:

Collaborators: