Upload
kenta-oono
View
415
Download
0
Embed Size (px)
Citation preview
Tutorial:Deep Learning Implementations
and FrameworksSeiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+
*Preferred Networks, Inc. (PFN){tokui,oono}@preferred.jp
+National Institute of Advanced Industrial Science and Technology (AIST)[email protected], [email protected]
1
Overview of this tutorial
•1st session (KO, 8:30 ‒ 10:00)• Introduction•Basics of neural networks•Common design of neural network implementations
•2nd session (ST, 10:30 ‒ 12:30)•Differences of deep learning frameworks•Coding examples of frameworks•Conclusion
Common Design ofDeep Learning FrameworksKenta Oono <[email protected]>Preferred Networks Inc.
2016/4/19 3DLIF Tutorial @ PAKDD2016
Objective of this part
• How deep learning frameworks represent various neural networks.
• How deep learning frameworks realize the training procedure of neural networks.
• Technology stack that is common to most of deep learning frameworks.
2016/4/19 4DLIF Tutorial @ PAKDD2016
Steps for training neural networks
Prepare the training dataset
Repeat until meeting some criterionPrepare for the next (mini) batch
Compute the loss (forward prop)
Initialize the Neural Network (NN) parameters
Save the NN parameters
Define how to compute the loss of this batch
Compute the gradient (backprop)Update the NN parameters
2016/4/19 5DLIF Tutorial @ PAKDD2016
Technology stack of DL framework
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
2016/4/19 6DLIF Tutorial @ PAKDD2016
Technology stack of DL framework
2016/4/19 7DLIF Tutorial @ PAKDD2016
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
Neural Network as a Computational Graph
• In simplest form, NN is represented as a computational graph (CG) that is a stack of bipartite DAGs (Directed Acyclic Graph) consisting of data nodes and operator nodes.
y = x1 * x2z = y - x3
x1 mul suby
x3
z
x2
data node
operator node2016/4/19 8DLIF Tutorial @ PAKDD2016
Example: Multi-layer Perceptron (MLP)
x Affine
W1 b1
h1 ReLU a1
Affine
W2 b2
h2 ReLU a2
Softmax y Cross
EntropyLoss
t
It is choice of implementation if CG includes weights and biases.
2016/4/19 9DLIF Tutorial @ PAKDD2016
Example: Recurrent Neural Network (RNN)
x1
RNNUnit h1
RNNUnit
x2
h2RNNUnit
xT
h0 ・・・ hT
RNN unit can be :• Affine + activation function• LSTM (Long Short-Term
Memory)• GRU (Gated Recurrent Unit)
x h y
xt
ht-1
ht
W b
2016/4/19 10DLIF Tutorial @ PAKDD2016
Example: Stacked RNN
x1
RNNUnit h1
RNNUnit
x2
h2RNNUnit
xT
h0 ・・・ hT
RNNUnit z1
RNNUnit z2
RNNUnitz0 ・・・ zT
SoftmaxAffine y
2016/4/19 11DLIF Tutorial @ PAKDD2016
Example: RNN with control flow nodes
loopenter s
i
predicate
pred
s
h0
x
switch s
RNNUnit
s’update
loopend y
pred=True
pred=False
• TensorFlow has control flow nodes (e.g. cond, switch, while)
• As CG has a loop, some mechanism is necessary that resolves he dependency of nodes to schedule the order of calculation.
W
b
2016/4/19 12DLIF Tutorial @ PAKDD2016
Automatic Differentiation
• Computes gradient of some specified data nodes (e.g. loss) with respect to each data node.
• Each operator node must have backward operation to calculate gradients w.r.t. its inputs from gradients w.r.t. its outputs (realization of chain rule).
• e.g. Function class of Chainer has backwardmethod.• e.g. Each layer classes of Caffe has Backward_cpu and Backward_gpumethods
• e.g. Autograd has a thin wrapper that adds gradient methods as a closure to most of NumPy methods.
2016/4/19 13DLIF Tutorial @ PAKDD2016
Backprop through CG
∇y z∇x1 z ∇z z = 1
y = x1 * x2z = y - x3
x1 mul suby
x3
z
x2
2016/4/19 14DLIF Tutorial @ PAKDD2016
Backprop as extended graphs
x1 mul suby
x3
z
x2
dzid
neg
mul
mul
dy
dx3
dx1
dx2
forwardpropagation
backwardpropagation
y = x1 * x2z = y - x3
2016/4/19 15DLIF Tutorial @ PAKDD2016
Example: Theano
2016/4/19 16DLIF Tutorial @ PAKDD2016
Technology stack of DL framework
2016/4/19 17DLIF Tutorial @ PAKDD2016
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
Numerical optimizer
• Many gradient-based optimization algorithms are implemented.• Stochastic Gradient Descent (SGD) is implemented in most DL
frameworks.• It depends on concrete tasks which optimizer works best.
w: parameters of neural networkθ: states of optimizerL: loss functionΓ: optimizer-specific function
initialize w, θuntil meet the criteria:
get data (x, y)calculate ∇w L(x, y; w)w, θ← Γ(w, θ, ∇w L)
2016/4/19 18DLIF Tutorial @ PAKDD2016
Serialization
• Save/Load the snapshot of training process in specified format (e.g. hdf5, npz, protobuf)• Models to be trained (= architectures and parameters of NNs)• States of training procedure (e.g. epoch, learning rate, momentum)
• Serialization enhance the portability of models.• Publish pre-trained model (e.g. Model Zoo (Caffe), MXNet, TensorFlow)• Import pre-trained model of other DL frameworks
• e.g. Chainer supports BVLC-official reference models of Caffe.
2016/4/19 19DLIF Tutorial @ PAKDD2016
Computational optimizer
• Convert CGs to make them simplified and efficient.
e.g. Theanoy = x1 * x2z = y - x3
2016/4/19 20DLIF Tutorial @ PAKDD2016
Abstraction of ML workflow• Offers typical training/validation/evaluation procedures as APIs.• Users should call a single API and do not have to write the procedure
manually.• e.g. fit, evaluatemethods of Model class in Keras.
2016/4/19 21DLIF Tutorial @ PAKDD2016
Prepare the training dataset
Repeat until meeting some criterionPrepare for the next (mini) batch
Compute the loss (forward prop)
Initialize the Neural Network (NN) parameters
Save the NN parameters
Define how to compute the loss of this batch
Compute the gradient (backprop)Update the NN parameters
Graphical interface
• Computational graph management• Editor, Visualizer
• Visualization of training procedure• Visualization of feature maps, output of NNs etc.• Transition of error and accuracy
• Performance monitor• e.g. Throughput, latency, memory usage
2016/4/19 22DLIF Tutorial @ PAKDD2016
Technology stack of DL framework
2016/4/19 23DLIF Tutorial @ PAKDD2016
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflowmanagement
Dataset ManagementTraining Loop
Keras, LasagneBlocks, TF Learn
Computational graph management
Build computational graphForward prop/Backprop
Theano, TensorFlowTorch.nn
Multi-dimensionalarray library
Linear algebra NumPy, CuPyEigen, torch (core)
Numerical computationpackage
Matrix operationConvolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
GPU support
• CUDA: Computing platform for GPGPU on NVIDIA GPU • language extension, compiler, library etc.
• DL frameworks prepare wrappers for CUDA.• GPU-array library that utilizes cuBLAS, cuRAND etc.• Layer implementation with cuDNN (e.g. Convolution, sigmoid, LSTM)
• Designed to switch CPU and GPU easily.• e.g. Users can write CPU-GPU agnostic code.• e.g. Switch CPU/GPU with environment variables.
• Some framework supports Open CL as a GPU environment, but CUDA is more popular for now.
2016/4/19 24DLIF Tutorial @ PAKDD2016
Multi-dimensional array library (CPU / GPU)
• In charge of concrete calculation of data nodes.• Heavily depends on BLAS (CPU) or CUDA / CUDA Toolkits
(GPU)
• CPU• Third-party library: Eigen::Tensor, NumPy• Scratch: ND4J (DL4J), mshadow (MXNet)
• GPU• Third-party library: Eigen::Tensor, PyCUDA, gpuarray• Scratch: ND4J (DL4J), mshadow (MXNet), CuPy (Chainer)
2016/4/19 25DLIF Tutorial @ PAKDD2016
Which device to use?
• GPU is (by far) faster than CPU in most case. • Most of tensor calculation consists of element-wise calculation,
matrix multiplications and convolutions.
• Exceptional cases• Difficult to apply mini-batch technique.
• e.g. variable-length training dataset• e.g. The architecture of NN depends on the training data.
• GPU calculation cannot hide transfer of data to GPU.• e.g. Minibatch size is too small.
2016/4/19 26DLIF Tutorial @ PAKDD2016
Technology stack of Chainer
cuDNN
Chainer
NumPy CuPy
BLAS cuBLAS, cuRAND
CPU GPU
2016/4/19 27DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
Technology stack of TensorFlow
cuDNN
TensorFlow
Eigen::Tensor
BLAS cuBLAS, cuRAND
CPU GPU
2016/4/19 28DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
TensorBoard
TF Learn
Technology stack of Theano
CUDA, OpenCLCUDAToolkit
Theano
BLAS
CPU GPU
2016/4/19 29DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
libgpuarrayNumPy
Technology stack of Keras
2016/4/19 30DLIF Tutorial @ PAKDD2016
name
Graphical interface
Machine learning workflowmanagementComputational graph managementMulti-dimensionalarray libraryNumerical computationpackage
Hardware
Keras
TensorFlowTheano
TechnologyStack of Theano
Technology Stack of TF
Summary
• Most DL frameworks have many components in common and can be organized as a similar technology stack.
• At upper layer of the stack, frameworks are designed to support users to follow typical ML workflows.• At middle layer, manipulations on computational graphs are
automated.• At lower layer, optimized tensor calculations are
implemented.
• Realization of these components differ between frameworks, as we will see in the following part.
2016/4/19 31DLIF Tutorial @ PAKDD2016
memorandum
2016/4/19 32DLIF Tutorial @ PAKDD2016
Training of Neural Networks
• L is designed so that its value gets small as the prediction more “accurate”• In deep learning context
• L : represented by neural networks• w : parameters of neural networks
argminw∑(x, y) L(x, y; w)w: parametersx: feature vectory: training labelL: loss function
e.g.: Classification problem
332016/4/19 DLIF Tutorial @ PAKDD2016
Layer = function + data nodes
• Layers (e.g. Fully connected layer, convolutional layer) can be considered as a function with parameters to be optimized.• In most of modern frameworks, parameters of layers can be
considered as data nodes in a computational graph.• Framework need to be differentiate which data nodes are
parameters to be optimized or data point.
342016/4/19 DLIF Tutorial @ PAKDD2016
Execution Engine
• It calculates the dependency between data node and schedules the execution of parts of computational graph (especially in multi-node or multi-GPU setting)
352016/4/19 DLIF Tutorial @ PAKDD2016