41
Learning at Scale: Deep, Distributed and Multi-dimensional Anima Anandkumar .. Amazon AI & Caltech

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

  • Upload
    mlconf

  • View
    644

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Learning at Scale:

Deep, Distributed and Multi-dimensional

Anima Anandkumar

..

Amazon AI & Caltech

Page 2: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Significantly improve many applications on multiple domains

“deep learning” trend in the past 10 years

image understanding speech recognition natural language processing

Deep Learning

autonomy

Page 3: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Image Classification

Layer 1 Layer 2 Output

multilevel feature extractions from raw pixels to semantic meanings

explore spatial information with convolution layers

Page 4: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Image Classification

§ Hard to define the network§ the definition of the inception network has >1k lines of codes in Caffe

§ A single image requires billions floating-point operations§ Intel i7 ~500 GFLOPS § Nvidia Titan X: ~5 TFLOPS

§ Memory consumption is linear with number of layers

State-of-the-art networks have tens to hundreds layers

Page 5: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Outline

1 Introduction

2 Distributed Deep Learning Using Mxnet

3 Learning in Multiple Dimensions

4 Conclusion

Page 6: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

3. MXNet

image credit - wikipedia

• Imperative and Declarative Programming• Language Support• Backend and Automatic Parallelization

Page 7: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Writing Parallel Programs is Painful

Each forward-backward-update involves O(num_layer), which is

often 100—1,000, tensor computations and communications

data = next_batch()data[gpu0].copyfrom(data[0:50])

_, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu0] ,

fc1_weight[gpu0])

fc1_ograd[gpu0], fc2_wgrad[gpu0] = FullcBackward(fc2_ograd[gpu0] ,

fc2_weight[gpu0])

fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50])

fc2[gpu0] = FullcForward(fc1[gpu0], fc2_weight[gpu0])

fc1[gpu0] = FullcForward(data[gpu0], fc1_weight[gpu0])

fc2_wgrad[cpu] = fc2_wgrad[gpu0] + fc2_wgrad[gpu1]

fc2_weight[cpu].copyto(fc2_weight[gpu0] ,

fc2_weight[gpu1])

fc2_weight[cpu] -= lr*fc12_wgrad[gpu0]

fc1_weight[cpu] -= lr * fc1_wgrad[gpu0]

fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1]

fc1_weight[cpu].copyto(fc1_weight[gpu0] ,

fc1_weight[gpu1])

data[gpu0].copyfrom(data[51:100])

_, fc1_wgrad[gpu1] = FullcBackward(fc1_ograd[gpu1] ,

fc1_weight[gpu1])

fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] ,

fc2_weight[gpu1])

fc2_ograd[gpu1] = LossGrad(fc2[gpu1], label[51:100])

fc2[gpu1] = FullcForward(fc1[gpu1], fc2_weight[gpu1])

fc1[gpu1] = FullcForward(data[gpu1], fc1_weight[gpu1])

Dependency graph for 2-layer neural networks with 2 GPUs

Page 8: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Auto Parallelization

18

Write serial programs Run in parallel

>>> import mxnet as mx >>> A = mx.nd.ones((2,2)) *2 >>> C = A + 2 >>> B = A + 1 >>> D = B * C>>> D.wait_to_read()

A = 2

C = A + 2 B = A + 1

D = B ⨉ C

Page 9: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Data Parallelism

19

key-value store

examples

1. Read a data partition 2. Pull the parameters 3. Compute the gradient 4. Push the gradient 5. Update the parameters

Page 10: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Scale to Multiple GPU Machines

21

PCIe Switch

GPU

GPU

GPU

GPU

CPU

Network Switch

63 GB/s 4 PCIe 3.0 16x

15.75 GB/s PCIe 3.0 16x

1.25 GB/s 10 Gbit Ethernet

Hierarchical parameter server

Level-1 Servers

Workers

Level-2 Servers

GPUs

CPUs

Page 11: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Experiment Setup

✧ ✓ 1.2 million images with 1000 classes

✧ Resnet 152-layer model ✧ EC2 P2.16xlarge

22

GPU 0-15

PCIe switchesCPU

✧ Minibatch SGD ✧ Synchronized Updating

Page 12: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Scalability over Multiple Machines

23

time

(sec

) / b

ath

0

0.25

0.5

0.75

1

# of GPUs

0 32 64 96 128

Comm Costbatch size/GPU=2batch size/GPU=4batch size/GPU=8batch size/GPU=16

115x

Page 13: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

8

2012before 2013 2014 2015 2016 2017

mxnetimperative

symbolicgluon

Page 14: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Back-end System

✧ Optimization ✓ Memory optimization ✓ Operator fusion

✧ Scheduling ✓ Auto-parallelization

11

a b

1

+

c

fullc

softmax

weight

bias

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mxnet = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128)net = mx.symbol.SoftmaxOutput(data=net)texec = mx.module.Module(net)texec.forward(data=c)texec.backward()

Front-end

Page 15: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

In summary✦ Symbolic ❖ efficient & portable ❖ but hard to use

10

✦ tesla

✦ Imperative ❖ flexible ❖ may be slow

✦ Gluon ❖ imperative for developing ❖ symbolic for deploying

Page 16: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Outline

1 Introduction

2 Distributed Deep Learning Using Mxnet

3 Learning in Multiple Dimensions

4 Conclusion

Page 17: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensors: Beyond 2D world

Modern data is inherently multi-dimensional

Page 18: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensors: Beyond 2D world

Modern data is inherently multi-dimensional

Input Hidden 1 Hidden 2 Output

Page 19: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensor Contraction

Extends the notion of matrix product

Matrix product

Mv =∑

j

vjMj

= +

Tensor ContractionT (u, v, ·) =

i,j

uivjTi,j,:

=

++

+

Page 20: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Employing Tensor Contractions in Alexnet

Replace fully connected layer with tensor contraction layer

Page 21: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Enabling Tensor Contraction Layer in Mxnet

Page 22: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

PerformanceoftheTCL

• Trainedend-to-end

• OnImageNetwithVGG:• 65.9%spacesavings• performancedropof0.6%only

• OnImageNetwithAlexNet:• 56.6%spacesavings• Performanceimprovementof0.5%

Page 23: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Low-ranktensorregression

TensorRegressionNetworks,J.Kossaifi,Z.C.Lipton,A.Khanna,T.Furlanello andA.Anandkumar,ArXiv pre-publication

Page 24: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Performanceandrank

Page 25: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Speeding up Tensor Contractions

1 Tensor contractions are a core primitive of multilinear algebra.

2 BLAS 3: Unbounded compute intensity (no. of ops per I/O)

Consider single-index contractions: CC = AABB

=

=

A(:,1,:) A(:,2,:)A422

B21

C421

e.g. Cmnp = Amnk Bkp

Page 26: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Speeding up Tensor Contraction

Explicit permutation dominates,especially for small tensors.

Consider Cmnp = AkmBpkn.

1 Akm → Amk

2 Bpkn → Bkpn

3 Cmnp → Cmpn

4 Cm(pn) = Amk Bk(pn)

5 Cmpn → Cmnp

100 200 300 400 5000

0.2

0.4

0.6

0.8

1

n

(Top) CPU. (Bottom) GPU. The fraction of timespent in copies/transpositions. Lines are shown with1, 2, 3, and 6 transpositions.

Page 27: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Existing Primitives

GEMM

Suboptimal for many small matrices.

Pointer-to-Pointer BatchedGEMM

Available in MKL 11.3β and cuBLAS 4.1

C[p] = α op(A[p]) op(B[p]) + β C[p]

cublas<T>gemmBatched(cublasHandle_t handle,

cublasOperation_t transA, cublasOperation_t transB,

int M, int N, int K,

const T* alpha,

const T** A, int ldA,

const T** B, int ldB,

const T* beta,

T** C, int ldC,

int batchCount)

Page 28: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensor Contraction with Extended BLAS Primitives

Cmn[p] = AmkBkn[p]

cublasDgemmStridedBatched(handle,

CUBLAS_OP_N, CUBLAS_OP_N,

M, N, K,

&alpha,

A, ldA1, 0,

B, ldB1, ldB2,

&beta,

C, ldC1, ldC2,

P)

Page 29: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensor Contraction with Extended BLAS Primitives

Cmnp = A∗∗ ×B∗∗∗Cmnp ≡ C[m+ n · ldC1 + p · ldC2]

Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2

1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] 4.1 AknBkmp Cmn[p] = B>km[p]Akn

1.2 AmkBkpn Cmn[p] = AmkBk[p]n Cm[n]p = AmkBkp[n] 4.2 AknBkpm Cmn[p] = B>k[p]mAkn

1.3 AmkBnkp Cmn[p] = AmkB>nk[p] 4.3 AknBmkp Cmn[p] = Bmk[p]Akn

1.4 AmkBpkn Cm[n]p = AmkB>pk[n] 4.4 AknBpkm

1.5 AmkBnpk Cm(np) = AmkB>(np)k Cmn[p] = AmkB

>n[p]k 4.5 AknBmpk Cmn[p] = Bm[p]kAkn

1.6 AmkBpnk Cm[n]p = AmkB>p[n]k 4.6 AknBpmk

2.1 AkmBknp Cm(np) = A>kmBk(np) Cmn[p] = A>kmBkn[p] 5.1 ApkBkmn C(mn)p = B>k(mn)A>pk Cm[n]p = B>km[n]A

>pk

2.2 AkmBkpn Cmn[p] = A>kmBk[p]n Cm[n]p = A>kmBkp[n] 5.2 ApkBknm Cm[n]p = B>k[n]mA>pk

2.3 AkmBnkp Cmn[p] = A>kmB>nk[p] 5.3 ApkBmkn Cm[n]p = Bmk[n]A

>pk

2.4 AkmBpkn Cm[n]p = A>kmB>pk[n] 5.4 ApkBnkm

2.5 AkmBnpk Cm(np) = A>kmB>(np)k Cmn[p] = A>kmB

>n[p]k 5.5 ApkBmnk C(mn)p = B(mn)kA

>pk Cm[n]p = Bm[n]kA

>pk

2.6 AkmBpnk Cm[n]p = A>kmB>p[n]k 5.6 ApkBnmk

3.1 AnkBkmp Cmn[p] = B>km[p]A>nk 6.1 AkpBkmn C(mn)p = B>k(mn)Akp Cm[n]p = B>km[n]Akp

3.2 AnkBkpm Cmn[p] = B>k[p]mA>nk 6.2 AkpBknm Cm[n]p = B>k[n]mAkp

3.3 AnkBmkp Cmn[p] = Bmk[p]A>nk 6.3 AkpBmkn Cm[n]p = Bmk[n]Akp

3.4 AnkBpkm 6.4 AkpBnkm

3.5 AnkBmpk Cmn[p] = Bm[p]kA>nk 6.5 AkpBmnk C(mn)p = B(mn)kAkp Cm[n]p = Bm[n]kAkp

3.6 AnkBpmk 6.6 AkpBnmk

Page 30: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

A new primitive: StridedBatchedGEMM

Performance on par with pure GEMM (P100 and beyond).

Page 31: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Applications: Tucker DecompositionTmnp = GijkAmiBnjCpk

mnp ijk

mi

njT GA

B

pkC Main steps in the algorithm

Ymjk = TmnpBtnjC

tpk

Yink = TmnpAt+1mi C

tpk

Yijp = TmnpBt+1nj A

t+1mi

Performance on Tucker decomposition:

20 40 60 80 100 12010−2

100

102

104

106

n

Tim

e(s

ec)

TensorToolboxBTAS

CyclopsCPU BatchedGPU Batched

Page 32: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensor Sketches

Randomized dimensionality reductionthrough sketching.

◮ Complexity independent of tensor order:exponential gain!

+1

+1

-1

Tensor T

Sketch s

Applications

Tensor Decomposition via Sketching

Visual Question and Answering

CNN

RNN What is the

mustach made of?

CW

H

MCT

L

Avgpoolin

g

FC

Relu

Batc

hN

orm

FC "Banana"

Softm

ax

Page 33: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

MCT in Visual Question & Answering

CNN

RNNW��t is the

mustac� ���� ���

C�

H

M

L

Av�

����

FC

Relu

B����� � m

FC "����na"

Softm

a

x

Page 34: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Multimodal Tensor Pooling

C W

H

L

Text feature

Image feature d1d2

d3Spatial sketch

Count sketch

3D FFT

1D FFT

3D IFFT

(optional)

d4

d1d2

d3

Page 35: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensor Decompositions

Page 36: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Extracting Topics from Documents

Topics

Topic Proportion

police

witness

campus

police

witness

campus

police

witness

campus

police

witness

crime

Sports

Educa�on

campus

A., D. P. Foster, D. Hsu, S.M. Kakade, Y.K. Liu.“Two SVDs Suffice: Spectral decompositions

for probabilistic topic modeling and latent Dirichlet allocation,” NIPS 2012.

Page 37: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensor Methods for Topic Modeling

campus

police

witness

Topic-word matrix P[word = i|topic = j]

Linearly independent columns

Moment Tensor: Co-occurrence of Word Triplets

= + +campus

police

witness

crime

Sports

Educa�on

campus

police

witness

campus

police

witness

Page 38: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensors vs. Variational InferenceCriterion: Perplexity = exp[−likelihood].

Learning Topics from PubMed on Spark, 8mil articles

0

2

4

6

8

10 ×104

RunningTim

e

103

104

105

Perplexity Tensor

Variational

Learning network communities from social network data

Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6.

102

103

104

105

106

RunningTim

e

FB YP DBLPsub DBLP 10-2

10-1

100

101

Error

FB YP DBLPsub DBLP

F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.

Page 39: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Tensors vs. Variational InferenceCriterion: Perplexity = exp[−likelihood].

Learning Topics from PubMed on Spark, 8mil articles

0

2

4

6

8

10 ×104

RunningTim

e

103

104

105

Perplexity Tensor

Variational

Learning network communities from social network data

Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6.

102

103

104

105

106

RunningTim

e

FB YP DBLPsub DBLP 10-2

10-1

100

101

Error

FB YP DBLPsub DBLPOrders

ofMag

nitude Fa

ster &

MoreAc

curat

e

F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.

Page 40: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

Outline

1 Introduction

2 Distributed Deep Learning Using Mxnet

3 Learning in Multiple Dimensions

4 Conclusion

Page 41: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017

ConclusionDistributed Deep Learning at Scale

Mxnet has many attractive features◮ Flexible programming◮ Portable◮ Highly efficient

Easy to deploy large-scale DL on AWS cloud◮ Deep Learning AMI◮ Cloud formation templates

Tensors are the future of ML

Tensor contractions: space savings in deep architectures.

New primitives speed up tensor contractions: extended BLAS

=

++

+

T

u

v

= + ....