Upload
hoanganh
View
227
Download
0
Embed Size (px)
Citation preview
Microsoft
Cognitive
Toolkit
Outline
• Overview
• CNTK introduction• Symbolic loop
• Batch scheduling
• Data parallel training
• Educational resources
• Conclusions
Microsoft
Cognitive
Toolkit
Outline
• Overview
• CNTK introduction• Symbolic loop
• Batch scheduling
• Data parallel training
• Educational resources
• Conclusions
Microsoft
Cognitive
Toolkit
Deep Learning at Microsoft
• Microsoft Cognitive Services
• Skype Translator
• Cortana
• Bing
• HoloLens
• Microsoft Research
Microsoft
Cognitive
Toolkit
24%
14%
Microsoft
Cognitive
Toolkit
ImageNet: Microsoft 2015 ResNet
28.225.8
16.4
11.7
7.3 6.73.5
ILSVRC2010 NECAmerica
ILSVRC2011 Xerox
ILSVRC2012
AlexNet
ILSVRC2013 Clarifi
ILSVRC2014 VGG
ILSVRC2014
GoogleNet
ILSVRC2015 ResNet
ImageNet Classification top-5 error (%)
Microsoft had all 5 entries being the 1-st places in 2015: ImageNet classification,
ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation
Microsoft
Cognitive
Toolkit
Microsoft Cognitive Services
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Image Similarity
Goal: given query image, find similar images.
• Customer: Anonymous ISV (Azure Partner)
• Task: given a retail image, find same product on competitor websites (to compare price).
• Existing solution: solely based on mining text information from the websites of Target, Macy, etc.
• Customer asked for individual similarity measure (e.g. texture, neck style, etc).
Microsoft
Cognitive
Toolkit
Bing / Bing Ads
Microsoft
Cognitive
Toolkit
Microsoft Translator http://translate.it
Power point-plug in for translating
speech to subtitles
Microsoft
Cognitive
Toolkit
Youtube Link
https://www.youtube.com/watch?v=eu9kMIeS0wQ;start=70
Microsoft
Cognitive
Toolkit
Microsoft Customer Support Agent
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft Cognitive Toolkit (CNTK)
• Microsoft’s open-source deep-learning toolkit • https://github.com/Microsoft/CNTK
• Created by Microsoft Speech researchers (Dong Yu et al.) in 2012, “Computational Network Toolkit”
• On GitHub since Jan 2016 under MIT license
• Python support since Oct 2016 (beta), rebranded as “Cognitive Toolkit”
• External contributions e.g. from MIT, Stanford and Nvidia
• Over 80% Microsoft internal DL workload runs CNTK
• 1st-class on Linux and Windows, docker support
• Python, C++, C#, Java, Keras
• Internal == External
Microsoft
Cognitive
Toolkit
CTNK Speed
Caffe CNTK MxNet TensorFlow Torch
FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms
AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms
ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms
LSTM (256)(v7 benchmark)
- 43.581ms(44.917ms)
288.142ms(284.898ms)
-(223.547ms)
1130.606ms(906.958ms)
http://dlbench.comp.hkbu.edu.hk/ Benchmarking by HKBU, Version 8Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1
Caffe: 1.0rc5(39f28e4)CNTK: 2.0 Beta10(1ae666d)MXNet: 0.93(32dc3a2)TensorFlow: 1.0(4ac9c09)Torch: 7(748f5e3)
Microsoft
Cognitive
Toolkit
“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.”
Theano only supports 1 GPU
Achieved with 1-bit gradient quantizationalgorithm
0
10000
20000
30000
40000
50000
60000
70000
80000
CNTK Theano TensorFlow Torch 7 Caffe
speed comparison (samples/second), higher = better
[note: December 2015]
1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
MICROSOFT COGNITIVE TOOLKITFirst Deep Learning Framework Fully Optimized for Pascal
78
2,400
3,500
7,600
13,000
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Dual Socket CPU Server 1x P100 2x P100 4x P100 DGX-1 (8x P100)
Toolkit Delivering Near-Linear Multi-GPU Scaling AlexNet Performance
imag
es /
sec
AlexNet training batch size 128, Grad Bit = 32, Dual socket E5-2699v4 CPUs (total 44 cores)CNTK 2.0b3 (to be released) includes cuDNN 5.1.8, NCCL 1.6.1, NVLink enabled
170x Fasterv. CPU Server
Superior performance
Microsoft
Cognitive
Toolkit
CNTK Scalability
ResNet50 on DGX-1, tested by Nvidia, Dec. 2016
Microsoft
Cognitive
Toolkit
CNTK Other Advantages
• Python and C++ API• Mostly implemented in C++
• Low level + high level Python API
• Extensibility • User functions and learners in pure Python
• Readers • Distributed, highly efficient built-in data readers
• Internal == External
Microsoft
Cognitive
Toolkit
CNTK Other Advantages
• Keras interoperability• With CNTK backend
• Try models trained with TF/Theano with CNTK (vice versa)
• Demo RL with DQN
• Binary evaluation• 10x speed-up in model execution
Reinforcement Learning with Flappy Bird
Microsoft
Cognitive
Toolkit
Binary evaluation
Conv-Net Binary Conv-Net
Microsoft
Cognitive
Toolkit
Outline
• Overview
• CNTK introduction• Symbolic loop
• Batch scheduling
• Data parallel training
• Educational resources
• Conclusions
Microsoft
Cognitive
Toolkit
• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.
What is CNTK?
Microsoft
Cognitive
Toolkit
Example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)
with input x RM and one-hot label L RM
and cross-entropy training criterion
ce = LT log P ce = cross_entropy (L, P)
Scorpusce = max
What is CNTK?
Microsoft
Cognitive
Toolkit
Example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)
with input x RM and one-hot label y RJ
and cross-entropy training criterion
ce = yT log P ce = cross_entropy (L, P)
Scorpusce = max
What is CNTK?
Microsoft
Cognitive
Toolkit
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)
with input x RM and one-hot label y RJ
and cross-entropy training criterion
ce = yT log P ce = cross_entropy (P, y)
Scorpusce = max
What is CNTK?
Microsoft
Cognitive
Toolkit
h1 = sigmoid (x @ W1 + b1)
h2 = sigmoid (h1 @ W2 + b2)
P = softmax (h2 @ Wout + bout)
ce = cross_entropy (P, y)
What is CNTK?
Microsoft
Cognitive
Toolkit
•
+
s
•
+
s
•
+
softmax
W1
b1
W2
b2
Wout
bout
cross_entropy
h1
h2
P
x y
h1 = sigmoid (x @ W1 + b1)
h2 = sigmoid (h1 @ W2 + b2)
P = softmax (h2 @ Wout + bout)
ce = cross_entropy (P, y)
ce
What is CNTK?
Microsoft
Cognitive
Toolkit
•
+
s
•
+
s
•
+
softmax
W1
b1
W2
b2
Wout
bout
cross_entropy
h1
h2
P
x y
ce• Nodes: functions (primitives)
• Can be composed into reusable composites
• Edges: values• Incl. tensors, sparse
• Automatic differentiation• ∂F / ∂in = ∂F / ∂out ∙ ∂out / ∂in
• Deferred computation execution engine
• Editable, clonable
LEGO-like composability allows CNTK to supportwide range of networks & applications
What is CNTK?
Microsoft
Cognitive
Toolkit
• Symbolic loops over sequences with dynamic scheduling
• Turn graph into parallel program through minibatching
• Unique parallel training algorithms (1-bit SGD, Block Momentum)
CNTK Unique Features
Microsoft
Cognitive
Toolkit
Symbolic Loops over Sequential Data
Extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)
ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
no explicit notion of time
Microsoft
Cognitive
Toolkit
Symbolic Loops over Sequential Data
Extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1)
h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)
ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
Microsoft
Cognitive
Toolkit
Symbolic Loops over Sequential Data
Extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1)
h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)
ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
compare to id (Irvine Dataflow):[Arvind et al., TR114a, Dept ISC, UC Irvine, Dec 1978; “Executing a Program on theMIT Tagged-Token Dataflow Architecture”, 1988]
@Functiondef ip(a: Sequence[tensor], b: Sequence[tensor]):
s0 = 0s_ = ForwardDeclaration()s = past_value(s_, initial_value=s0) + a * bs_.resolve_to(s)s = last(s)return s
Microsoft
Cognitive
Toolkit
•
+
s
•
+
softmax
W1
b1
Wout
bout
cross_entropy
h1
P
x y
ce
h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1)
h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2)
P = softmax(h2 @ Wout + bout)
ce = cross_entropy(P, L)
• CNTK automatically unrolls cycles at execution time• cycles are detected with Tarjan’s algorithm
• only nodes in cycles
• efficient and composable• cf. TensorFlow: [https://www.tensorflow.org/versions/r1.0/tutorials/recurrent/index.html]
lstm = rnn_cell.BasicLSTMCell(lstm_size)state = tf.zeros([batch_size, lstm.state_size]probabilities = []loss = 0.0for current_batch_of_words in words_in_dataset:
output, state = lstm(current_batch_of_words, state)logits = tf.matmul(output, softmax_w) + softmax_bprobabilities.append(tf.nn.softmax(logits))loss += loss_function(probabilities, target_words)
+ •
R1
z-1
•
+
s
W2
b2
h2
+ •
R2
z-1
Symbolic Loops over Sequential Data
Microsoft
Cognitive
Toolkit
• Minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
Batch-Scheduling of Variable-Length Sequences
para
llel se
qu
en
ces
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 3
sequence 5 sequence 6
sequence 7
Microsoft
Cognitive
Toolkit
• Minibatches containing sequences of different lengths are automatically packed and padded
• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries
• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)
• sequence reductions
Batch-Scheduling of Variable-Length Sequences
para
llel se
qu
en
ces
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7schedule into the same slot, it may come for free!
Microsoft
Cognitive
Toolkit
• Minibatches containing sequences of different lengths are automatically packed and padded
• Fully transparent batching• Recurrent CNTK unrolls, handles sequence boundaries
• Non-recurrent operations parallel
• Sequence reductions mask
Batch-Scheduling of Variable-Length Sequences
para
llel se
qu
en
ces
time steps computed in parallel
padding
sequence 1
sequence 2 sequence 3
sequence 4
sequence 5 sequence 6
sequence 7
Microsoft
Cognitive
Toolkit
• Degrees of parallelism:• Within-vector parallelization: “vectorized”
• Across independent samples: “batching”
• Across GPUs: async PCIe device-to-device transfers
• Across servers: MPI etc., NVidia NCCL
• Parallelization options:• Data-parallel
• Model-parallel
• Layer-parallel
Data-Parallel Training
Microsoft
Cognitive
Toolkit
• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients
all-reduce
Data-Parallel Training
node 1 node 2 node 3
S
Microsoft
Cognitive
Toolkit
• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients
Data-Parallel Training
node 1 node 2 node 3
Microsoft
Cognitive
Toolkit
• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients
Data-Parallel Training
node 1 node 2 node 3
ring algorithmO(2 (K-1)/K M)
O(1) w.r.t. K
Microsoft
Cognitive
Toolkit
• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients
• O(1) — enough?
• Example: DNN, MB size 1024, 160M model parameters
• compute per MB: 1/7 second
• communication per MB: 1/9 second (640M over 6 GB/s)
• can’t even parallelize to 2 GPUs: communication cost already dominates!
• How about doing it asynchronously?
• HogWild! [-], DistBelief ASGD [Dean et al., 2012]
• Helps with latency and jitter, could hide some communication cost with pipeline
• Does not change the problem fundamentally
Data-Parallel Training
Microsoft
Cognitive
Toolkit
node 1 node 2 node 3
How to reduce communication cost:
communicate less each time
• 1-bit SGD:[F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]
• Quantize gradients to 1 bit per value
• Trick: carry over quantization error to next minibatch
1-bit quantized with residual
1-bit quantized with residual
Data-Parallel Training
Microsoft
Cognitive
Toolkit
How to reduce communication cost:
communicate less each time
• 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]
• quantize gradients to 1 bit per value
• trick: carry over quantization error to next minibatch
communicate less often
• Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “ON Parallelizability of Stochastic Gradient Descent...”, ICASSP 2014]
• Block momentum [K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training…,” ICASSP 2016]
• Very recent, very effective parallelization method
• Combines model averaging with error-residual idea
Data-Parallel Training
Microsoft
Cognitive
Toolkit
Outline
• Overview
• CNTK introduction
• Symbolic loop
• Batch scheduling
• Data parallel training
• Educational resources
• Conclusions
Microsoft
Cognitive
Toolkit
Where to begin?Tutorials: https://www.cntk.ai/pythondocs/tutorials.html
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Where to begin?Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Outline
• Overview
• CNTK introduction
• Symbolic loop
• Batch scheduling
• Data parallel training
• Educational resources
• Conclusions
Microsoft
Cognitive
Toolkit
Conclusions
• CNTK: fast, scalable, and extensible
• CNTK Unique features: • Symbolic loop
• Batch scheduling
• Data parallel training
• Educational resources• Azure Notebooks with CNTK (at no cost)
• EdX – Deep Learning Explained course
https://github.com/Microsoft/CNTK
Microsoft
Cognitive
Toolkit
Thanks!
https://github.com/Microsoft/CNTK
Microsoft
Cognitive
Toolkit
Microsoft
Cognitive
Toolkit
Power point-plug in for Translating
Microsoft
Cognitive
Toolkit
Data Partition
• Partition randomly training dataset 𝒟 into 𝑆 mini-batches
𝒟 = {ℬ𝑖|𝑖 = 1,2,… , 𝑆}
• Group every 𝜏 mini-batches to form a split
• Group every 𝑁 splits to form a data block
• Training dataset 𝒟 consists of 𝑀 data blocks
𝑆 = 𝑀 × 𝑁 × 𝜏
Training dataset is processed block-by-block
Incremental Block Training (IBT)
Microsoft
Cognitive
Toolkit
Intra-Block Parallel Optimization (IBPO)• Select randomly an unprocessed data block denoted as 𝒟𝑡
• Distribute 𝑁 splits of 𝒟𝑡 to 𝑁 parallel workers
• Starting from an initial model denoted as 𝑾𝑖𝑛𝑖𝑡(𝑡), each worker optimizes its local model independently by 1-sweep mini-batch SGD with momentum trick
• Average 𝑁 optimized local models to get 𝑾(𝑡)
Microsoft
Cognitive
Toolkit
Blockwise Model-Update Filtering (BMUF)
• Generate model-update resulting from data block 𝒟𝑡: 𝑮 𝑡 = 𝑾 𝑡 −𝑾𝑖𝑛𝑖𝑡 𝑡
• Calculate global model-update:𝜟 𝑡 = 𝜂𝑡 ∙ 𝜟 𝑡 − 1 + 𝜍𝑡 ∙ 𝑮 𝑡
• 𝜍𝑡: Block Learning Rate (BLR)
• 𝜂𝑡: Block Momentum (BM)
• When 𝜍𝑡 = 1 and 𝜂𝑡 = 0 MA
• Update global model𝑾 𝑡 = 𝑾 𝑡 − 1 + 𝜟 𝑡
• Generate initial model for next data block
• Classical Block Momentum (CBM)
𝑾𝑖𝑛𝑖𝑡 𝑡 + 1 = 𝑾 𝑡• Nesterov Block Momentum (NBM)
𝑾𝑖𝑛𝑖𝑡 𝑡 + 1 = 𝑾 𝑡 + 𝜂𝑡+1 ∙ 𝜟 𝑡
Microsoft
Cognitive
Toolkit
Iteration
• Repeat IBPO and BMUF until all data blocks are processed• So-called “one sweep”
• Re-partition training set for a new sweep, repeat the above step
• Repeat the above step until a stopping criterion is satisfied• Obtain the final global model 𝑾𝑓𝑖𝑛𝑎𝑙
Microsoft
Cognitive
Toolkit
Benchmark Result of Parallel Training on CNTK
13.1 13.1 13.1
13.3
13.1
13.0 13.0 13.0
13.2
13.3
12.9
13.0
13.1
13.2
13.3
13.4
1 4 8 16 32 64
WER
(%)
# of GPUs
WER of CE-trained DNN with different # of GPUs
1-bit
BMUF
11.1
10.8
10.6
11.0
11.1
10.8 10.8
10.9 10.9
11.1
10.5
10.6
10.7
10.8
10.9
11.0
11.1
11.2
1 4 8 16 32 64
WER
(%)
# of GPUs
WER of CE-trained LSTM with different # of GPUs
1-bit
BMUF
2.9 5.4
8.0 3.3
6.7 10.8
3.7 6.9
13.8
25.5
43.7
4.1 8.1
14.1
27.3
54.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs
1bit/BMUF Speedup Factors in LSTM Training
1bit-average
1bit-peak
BMUF-average
BMUF-peak
• Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana• About 16 and 20 days to train DNN and
LSTM on 1-GPU, respectively
Credit: Yongqiang Wang, Kai Chen, Qiang Huo
Microsoft
Cognitive
Toolkit
Results
• Achievement• Almost linear speedup without degradation of model quality
• Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks
• Released in CNTK as a critical differentiator
• Used for enterprise scale production data loads
• Production tools in other companies such as iFLYTEK and Alibaba
Microsoft
Cognitive
Toolkit