75

Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Embed Size (px)

Citation preview

Page 1: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector
Page 2: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector
Page 3: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Outline

• Overview

• CNTK introduction• Symbolic loop

• Batch scheduling

• Data parallel training

• Educational resources

• Conclusions

Page 4: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Outline

• Overview

• CNTK introduction• Symbolic loop

• Batch scheduling

• Data parallel training

• Educational resources

• Conclusions

Page 5: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Deep Learning at Microsoft

• Microsoft Cognitive Services

• Skype Translator

• Cortana

• Bing

• HoloLens

• Microsoft Research

Page 6: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

24%

14%

Page 7: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

ImageNet: Microsoft 2015 ResNet

28.225.8

16.4

11.7

7.3 6.73.5

ILSVRC2010 NECAmerica

ILSVRC2011 Xerox

ILSVRC2012

AlexNet

ILSVRC2013 Clarifi

ILSVRC2014 VGG

ILSVRC2014

GoogleNet

ILSVRC2015 ResNet

ImageNet Classification top-5 error (%)

Microsoft had all 5 entries being the 1-st places in 2015: ImageNet classification,

ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation

Page 8: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Microsoft Cognitive Services

Page 9: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 10: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 11: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Image Similarity

Goal: given query image, find similar images.

• Customer: Anonymous ISV (Azure Partner)

• Task: given a retail image, find same product on competitor websites (to compare price).

• Existing solution: solely based on mining text information from the websites of Target, Macy, etc.

• Customer asked for individual similarity measure (e.g. texture, neck style, etc).

Page 12: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Bing / Bing Ads

Page 13: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Microsoft Translator http://translate.it

Power point-plug in for translating

speech to subtitles

Page 14: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Youtube Link

https://www.youtube.com/watch?v=eu9kMIeS0wQ;start=70

Page 15: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Microsoft Customer Support Agent

Page 16: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 17: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 18: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 19: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 20: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 21: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Microsoft Cognitive Toolkit (CNTK)

• Microsoft’s open-source deep-learning toolkit • https://github.com/Microsoft/CNTK

• Created by Microsoft Speech researchers (Dong Yu et al.) in 2012, “Computational Network Toolkit”

• On GitHub since Jan 2016 under MIT license

• Python support since Oct 2016 (beta), rebranded as “Cognitive Toolkit”

• External contributions e.g. from MIT, Stanford and Nvidia

• Over 80% Microsoft internal DL workload runs CNTK

• 1st-class on Linux and Windows, docker support

• Python, C++, C#, Java, Keras

• Internal == External

Page 22: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

CTNK Speed

Caffe CNTK MxNet TensorFlow Torch

FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms

AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms

ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms

LSTM (256)(v7 benchmark)

- 43.581ms(44.917ms)

288.142ms(284.898ms)

-(223.547ms)

1130.606ms(906.958ms)

http://dlbench.comp.hkbu.edu.hk/ Benchmarking by HKBU, Version 8Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1

Caffe: 1.0rc5(39f28e4)CNTK: 2.0 Beta10(1ae666d)MXNet: 0.93(32dc3a2)TensorFlow: 1.0(4ac9c09)Torch: 7(748f5e3)

Page 23: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.”

Theano only supports 1 GPU

Achieved with 1-bit gradient quantizationalgorithm

0

10000

20000

30000

40000

50000

60000

70000

80000

CNTK Theano TensorFlow Torch 7 Caffe

speed comparison (samples/second), higher = better

[note: December 2015]

1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)

Page 24: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

MICROSOFT COGNITIVE TOOLKITFirst Deep Learning Framework Fully Optimized for Pascal

78

2,400

3,500

7,600

13,000

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Dual Socket CPU Server 1x P100 2x P100 4x P100 DGX-1 (8x P100)

Toolkit Delivering Near-Linear Multi-GPU Scaling AlexNet Performance

imag

es /

sec

AlexNet training batch size 128, Grad Bit = 32, Dual socket E5-2699v4 CPUs (total 44 cores)CNTK 2.0b3 (to be released) includes cuDNN 5.1.8, NCCL 1.6.1, NVLink enabled

170x Fasterv. CPU Server

Page 25: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Superior performance

Page 26: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

CNTK Scalability

ResNet50 on DGX-1, tested by Nvidia, Dec. 2016

Page 27: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

CNTK Other Advantages

• Python and C++ API• Mostly implemented in C++

• Low level + high level Python API

• Extensibility • User functions and learners in pure Python

• Readers • Distributed, highly efficient built-in data readers

• Internal == External

Page 28: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

CNTK Other Advantages

• Keras interoperability• With CNTK backend

• Try models trained with TF/Theano with CNTK (vice versa)

• Demo RL with DQN

• Binary evaluation• 10x speed-up in model execution

Page 29: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Reinforcement Learning with Flappy Bird

Page 30: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Binary evaluation

Conv-Net Binary Conv-Net

Page 31: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Outline

• Overview

• CNTK introduction• Symbolic loop

• Batch scheduling

• Data parallel training

• Educational resources

• Conclusions

Page 32: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.

What is CNTK?

Page 33: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)

h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)

P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)

with input x RM and one-hot label L RM

and cross-entropy training criterion

ce = LT log P ce = cross_entropy (L, P)

Scorpusce = max

What is CNTK?

Page 34: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)

h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)

P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P ce = cross_entropy (L, P)

Scorpusce = max

What is CNTK?

Page 35: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)

h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)

P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P ce = cross_entropy (P, y)

Scorpusce = max

What is CNTK?

Page 36: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

h1 = sigmoid (x @ W1 + b1)

h2 = sigmoid (h1 @ W2 + b2)

P = softmax (h2 @ Wout + bout)

ce = cross_entropy (P, y)

What is CNTK?

Page 37: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

+

s

+

s

+

softmax

W1

b1

W2

b2

Wout

bout

cross_entropy

h1

h2

P

x y

h1 = sigmoid (x @ W1 + b1)

h2 = sigmoid (h1 @ W2 + b2)

P = softmax (h2 @ Wout + bout)

ce = cross_entropy (P, y)

ce

What is CNTK?

Page 38: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

+

s

+

s

+

softmax

W1

b1

W2

b2

Wout

bout

cross_entropy

h1

h2

P

x y

ce• Nodes: functions (primitives)

• Can be composed into reusable composites

• Edges: values• Incl. tensors, sparse

• Automatic differentiation• ∂F / ∂in = ∂F / ∂out ∙ ∂out / ∂in

• Deferred computation execution engine

• Editable, clonable

LEGO-like composability allows CNTK to supportwide range of networks & applications

What is CNTK?

Page 39: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Symbolic loops over sequences with dynamic scheduling

• Turn graph into parallel program through minibatching

• Unique parallel training algorithms (1-bit SGD, Block Momentum)

CNTK Unique Features

Page 40: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Symbolic Loops over Sequential Data

Extend our example to a recurrent network (RNN)

h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)

h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)

P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)

ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)

Scorpusce(t) = max

no explicit notion of time

Page 41: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Symbolic Loops over Sequential Data

Extend our example to a recurrent network (RNN)

h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1)

h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2)

P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)

ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)

Scorpusce(t) = max

Page 42: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Symbolic Loops over Sequential Data

Extend our example to a recurrent network (RNN)

h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1)

h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2)

P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)

ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)

Scorpusce(t) = max

compare to id (Irvine Dataflow):[Arvind et al., TR114a, Dept ISC, UC Irvine, Dec 1978; “Executing a Program on theMIT Tagged-Token Dataflow Architecture”, 1988]

@Functiondef ip(a: Sequence[tensor], b: Sequence[tensor]):

s0 = 0s_ = ForwardDeclaration()s = past_value(s_, initial_value=s0) + a * bs_.resolve_to(s)s = last(s)return s

Page 43: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

+

s

+

softmax

W1

b1

Wout

bout

cross_entropy

h1

P

x y

ce

h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1)

h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2)

P = softmax(h2 @ Wout + bout)

ce = cross_entropy(P, L)

• CNTK automatically unrolls cycles at execution time• cycles are detected with Tarjan’s algorithm

• only nodes in cycles

• efficient and composable• cf. TensorFlow: [https://www.tensorflow.org/versions/r1.0/tutorials/recurrent/index.html]

lstm = rnn_cell.BasicLSTMCell(lstm_size)state = tf.zeros([batch_size, lstm.state_size]probabilities = []loss = 0.0for current_batch_of_words in words_in_dataset:

output, state = lstm(current_batch_of_words, state)logits = tf.matmul(output, softmax_w) + softmax_bprobabilities.append(tf.nn.softmax(logits))loss += loss_function(probabilities, target_words)

+ •

R1

z-1

+

s

W2

b2

h2

+ •

R2

z-1

Symbolic Loops over Sequential Data

Page 44: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

Batch-Scheduling of Variable-Length Sequences

para

llel se

qu

en

ces

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 3

sequence 5 sequence 6

sequence 7

Page 45: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

Batch-Scheduling of Variable-Length Sequences

para

llel se

qu

en

ces

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7schedule into the same slot, it may come for free!

Page 46: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Minibatches containing sequences of different lengths are automatically packed and padded

• Fully transparent batching• Recurrent CNTK unrolls, handles sequence boundaries

• Non-recurrent operations parallel

• Sequence reductions mask

Batch-Scheduling of Variable-Length Sequences

para

llel se

qu

en

ces

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Page 47: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Degrees of parallelism:• Within-vector parallelization: “vectorized”

• Across independent samples: “batching”

• Across GPUs: async PCIe device-to-device transfers

• Across servers: MPI etc., NVidia NCCL

• Parallelization options:• Data-parallel

• Model-parallel

• Layer-parallel

Data-Parallel Training

Page 48: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients

all-reduce

Data-Parallel Training

node 1 node 2 node 3

S

Page 49: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients

Data-Parallel Training

node 1 node 2 node 3

Page 50: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients

Data-Parallel Training

node 1 node 2 node 3

ring algorithmO(2 (K-1)/K M)

O(1) w.r.t. K

Page 51: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients

• O(1) — enough?

• Example: DNN, MB size 1024, 160M model parameters

• compute per MB: 1/7 second

• communication per MB: 1/9 second (640M over 6 GB/s)

• can’t even parallelize to 2 GPUs: communication cost already dominates!

• How about doing it asynchronously?

• HogWild! [-], DistBelief ASGD [Dean et al., 2012]

• Helps with latency and jitter, could hide some communication cost with pipeline

• Does not change the problem fundamentally

Data-Parallel Training

Page 52: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

node 1 node 2 node 3

How to reduce communication cost:

communicate less each time

• 1-bit SGD:[F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]

• Quantize gradients to 1 bit per value

• Trick: carry over quantization error to next minibatch

1-bit quantized with residual

1-bit quantized with residual

Data-Parallel Training

Page 53: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

How to reduce communication cost:

communicate less each time

• 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]

• quantize gradients to 1 bit per value

• trick: carry over quantization error to next minibatch

communicate less often

• Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “ON Parallelizability of Stochastic Gradient Descent...”, ICASSP 2014]

• Block momentum [K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training…,” ICASSP 2016]

• Very recent, very effective parallelization method

• Combines model averaging with error-residual idea

Data-Parallel Training

Page 54: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Outline

• Overview

• CNTK introduction

• Symbolic loop

• Batch scheduling

• Data parallel training

• Educational resources

• Conclusions

Page 55: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Where to begin?Tutorials: https://www.cntk.ai/pythondocs/tutorials.html

Page 56: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 57: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Where to begin?Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials

Page 58: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 59: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 60: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 61: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Outline

• Overview

• CNTK introduction

• Symbolic loop

• Batch scheduling

• Data parallel training

• Educational resources

• Conclusions

Page 62: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Conclusions

• CNTK: fast, scalable, and extensible

• CNTK Unique features: • Symbolic loop

• Batch scheduling

• Data parallel training

• Educational resources• Azure Notebooks with CNTK (at no cost)

• EdX – Deep Learning Explained course

https://github.com/Microsoft/CNTK

Page 63: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Thanks!

https://github.com/Microsoft/CNTK

Page 64: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector
Page 65: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 66: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Power point-plug in for Translating

Page 67: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Data Partition

• Partition randomly training dataset 𝒟 into 𝑆 mini-batches

𝒟 = {ℬ𝑖|𝑖 = 1,2,… , 𝑆}

• Group every 𝜏 mini-batches to form a split

• Group every 𝑁 splits to form a data block

• Training dataset 𝒟 consists of 𝑀 data blocks

𝑆 = 𝑀 × 𝑁 × 𝜏

Training dataset is processed block-by-block

Incremental Block Training (IBT)

Page 68: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Intra-Block Parallel Optimization (IBPO)• Select randomly an unprocessed data block denoted as 𝒟𝑡

• Distribute 𝑁 splits of 𝒟𝑡 to 𝑁 parallel workers

• Starting from an initial model denoted as 𝑾𝑖𝑛𝑖𝑡(𝑡), each worker optimizes its local model independently by 1-sweep mini-batch SGD with momentum trick

• Average 𝑁 optimized local models to get 𝑾(𝑡)

Page 69: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Blockwise Model-Update Filtering (BMUF)

• Generate model-update resulting from data block 𝒟𝑡: 𝑮 𝑡 = 𝑾 𝑡 −𝑾𝑖𝑛𝑖𝑡 𝑡

• Calculate global model-update:𝜟 𝑡 = 𝜂𝑡 ∙ 𝜟 𝑡 − 1 + 𝜍𝑡 ∙ 𝑮 𝑡

• 𝜍𝑡: Block Learning Rate (BLR)

• 𝜂𝑡: Block Momentum (BM)

• When 𝜍𝑡 = 1 and 𝜂𝑡 = 0 MA

• Update global model𝑾 𝑡 = 𝑾 𝑡 − 1 + 𝜟 𝑡

• Generate initial model for next data block

• Classical Block Momentum (CBM)

𝑾𝑖𝑛𝑖𝑡 𝑡 + 1 = 𝑾 𝑡• Nesterov Block Momentum (NBM)

𝑾𝑖𝑛𝑖𝑡 𝑡 + 1 = 𝑾 𝑡 + 𝜂𝑡+1 ∙ 𝜟 𝑡

Page 70: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Iteration

• Repeat IBPO and BMUF until all data blocks are processed• So-called “one sweep”

• Re-partition training set for a new sweep, repeat the above step

• Repeat the above step until a stopping criterion is satisfied• Obtain the final global model 𝑾𝑓𝑖𝑛𝑎𝑙

Page 71: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Benchmark Result of Parallel Training on CNTK

13.1 13.1 13.1

13.3

13.1

13.0 13.0 13.0

13.2

13.3

12.9

13.0

13.1

13.2

13.3

13.4

1 4 8 16 32 64

WER

(%)

# of GPUs

WER of CE-trained DNN with different # of GPUs

1-bit

BMUF

11.1

10.8

10.6

11.0

11.1

10.8 10.8

10.9 10.9

11.1

10.5

10.6

10.7

10.8

10.9

11.0

11.1

11.2

1 4 8 16 32 64

WER

(%)

# of GPUs

WER of CE-trained LSTM with different # of GPUs

1-bit

BMUF

2.9 5.4

8.0 3.3

6.7 10.8

3.7 6.9

13.8

25.5

43.7

4.1 8.1

14.1

27.3

54.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs

1bit/BMUF Speedup Factors in LSTM Training

1bit-average

1bit-peak

BMUF-average

BMUF-peak

• Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana• About 16 and 20 days to train DNN and

LSTM on 1-GPU, respectively

Credit: Yongqiang Wang, Kai Chen, Qiang Huo

Page 72: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Results

• Achievement• Almost linear speedup without degradation of model quality

• Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks

• Released in CNTK as a critical differentiator

• Used for enterprise scale production data loads

• Production tools in other companies such as iFLYTEK and Alibaba

Page 73: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector

Microsoft

Cognitive

Toolkit

Page 74: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector
Page 75: Outline - · PDF fileAchieved with 1-bit gradient quantization algorithm 0 10000 20000 30000 40000 50000 60000 ... First Deep Learning Framework Fully Optimized for ... •Within-vector