Upload
lyhanh
View
222
Download
2
Embed Size (px)
Citation preview
GPUs for Deep Learning
Jerry Chen & Brent Oster
April 2015
PC DATA CENTER MOBILE
ENTERPRISE VIRTUALIZATION
AUTONOMOUS MACHINES
HPC & CLOUD SERVICE PROVIDERSGAMING DESIGN
The World Leader in Visual Computing
0
500
1000
1500
2000
2500
3000
3500
2008 2009 2010 2011 2012 2013 2014
Peak Double Precision FLOPS
NVIDIA GPU x86 CPU
M2090
M1060
K20
K80
WestmereSandy Bridge
Haswell
GFLOPS
0
100
200
300
400
500
600
2008 2009 2010 2011 2012 2013 2014
Peak Memory Bandwidth
NVIDIA GPU x86 CPU
GB/s
K20
K80
WestmereSandy Bridge
Haswell
Ivy Bridge
K40
Ivy Bridge
K40
M2090
M1060
Performance Continues to Accelerate
US to Build Two Flagship Supercomputers
150-300 PFLOPS Peak Performance
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017
SUMMIT SIERRA
A Brief History of CIFAR-10 (2010-2012)
10-class image classification problem
60,000 32x32 images
Slide courtesy of Adam Coates, Baidu Research
2010 2011
74.5%Improved LCC [Yu & Zhang, ‘10]
78.9%Conv. RBM [Krizhevsky, ‘10]
64.8%RBM [Krizhevsky, ‘09]
71.0%MC-RBM [Ranzato et al., ‘10]
65.3%3-way factored RBM [Ranzato et al., ‘10]
K-means [Coates et al., ‘11]81.5%
2012
88.8%
Multi-column DNN
[Ciresan et al., ‘12]
Natural image recognition
1.2M training images
1000 classes
ImageNet Large Scale Visual Recognition Challenge
http://www.image-net.org/challenges/LSVRC/
ImageNet Large Scale Visual Recognition Challenge
ImageNet Large-Scale Visual Recognition Challenge started in 2010.
Best known, annual benchmark for image classification and object detection.
A classifier supplies 5 predictions out of 1,000 categories. Classification is considered correct when one guess agrees with the ground truth.
28.20
25.80
16.40
11.70
6.705.33 4.94 4.82
0.00
5.00
10.00
15.00
20.00
25.00
30.00
ILSVRC 2010(NEC)
ILSVRC 2011(Xerox)
ILSVRC 2012(AlexNet)
ILSVRC 2013(Clarifai)
ILSVRC 2014(GoogLeNet)
Jan 2015(Baidu)
Feb 2015(Microsoft)
Feb 2015(Google)
ILSVRC Top-5 Classification Error [%]
Deep Learning & GPUs
Deep learning improves with scale
Data & Compute
PerformanceDeep Learning
Many previous methods
Past Present Future
Slide courtesy of Adam Coates, Baidu Research
3 Drivers for Deep Learning
More Data Better ModelsPowerful GPUAccelerators
[Lee, Ranganath & Ng, 2007]
Why are GPUs good for deep learning?
GPUs deliver --
same or better prediction accuracy
faster results
smaller footprint
lower power
Neural Networks GPUs
Inherently
Parallel
Matrix
Operations
FLOPS
Bandwidth
DEEP LEARNING VISUALIZED
Image Classification, Object Detection, Localization Face Recognition
Speech & Natural Language Processing
Medical Imaging & Interpretation
Seismic Imaging & Interpretation Recommendation
Example Use Cases
Deep learning revolutionizingmedical research
Detecting Mitosis in
Breast Cancer Cells— IDSIA
Predicting the Toxicity
of New Drugs— Johannes Kepler University
Understanding Gene Mutation
to Prevent Disease— University of Toronto
Neuronal Tissue Segmentation
Reinforcement Learning
Building High-Level Features Using Large Scale Unsupervised
Learning
Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Gorrado, J. Dean, A. Ng
ICML 2012
Deep learning with COTS HPC systems
A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro
ICML 2013
GOOGLE DATACENTER
1,000 CPU Servers 2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
STANFORD AI LAB
3 GPU-Accelerated Servers 12 GPUs • 18,432 cores
4 kWatts
$33,000
Now You Can Build Google’s
$1M Artificial Brain on the Cheap
“ “
Unsupervised Learning
GPU-Accelerated deep learning
START-UPS
DIGITSDEEP GPU TRAINING
SYSTEM FOR DATA
SCIENTISTS
Design DNNs
Visualize activations
Manage multiple trainingsGPUGPU HW CloudGPU
ClusterMulti-GPU
USER INTERFACE
Visualize Layers
Configure DNN
Process Data
MonitorProgress
TheanoTorch
CaffecuDNN, cuBLAS
CUDA
DIGITS
Test Image
Monitor ProgressConfigure DNNProcess Data Visualize Layers
DIGITS
Deep Learning GPU Training System
Who it is for
Deep learning researchers
Automotive
Medical Researchers
Defense
Intelligent Video Analytics
Web Companies
Startups
Digits Demo
Deep learning with cuDNNcuDNN is a library for deep learning primitives
GPUs
cuDNN
Frameworks
Applications
Tesla TX-1 Titan
cuDNN design goals
Basic Deep Learning Subroutines
Allow user to create DNN application without any custom CUDA code
Flexible Layout
Handle any data layout
Memory – Performance tradeoff
Good performance with minimal memory, great performance with more memory
cuDNN Version 2
Accelerates key routines to
improve performance of neural
net training
Up to 1.8x faster on AlexNet than
a baseline GPU implementation
New support for 3D convolutions
Integrated into all major Deep
Learning frameworks: Caffe,
Theano, Torch
1.0x 1.0x
1.6x
1.2x
Caffe (GoogLeNet) Torch (OverFeat)
Baseline (GPU)
With cuDNN
2.5M
18M
23M
43M
0
10
20
30
40
50
16 Core CPU GTX Titan Titan BlackcuDNN v1
Titan XcuDNN v2
Millions
of
Images
Images Trained Per Day (Caffe AlexNet)
E5-2698 v3 @ 2.3GHz / 3.6GHz Turbo
TITAN XTHE WORLD’S FASTEST GPU
8 Billion Transistors 3,072 CUDA Cores7 TFLOPS SP / 0.2 TFLOPS DP12GB Memory
0
1
2
3
4
5
6
7
Titan X for deep learning
Training AlexNet
Days
16-coreXeon CPU
TITAN TITAN BlackcuDNN
TITAN XcuDNN
~
43
…
GPUs for training
Workstation
• 2x NVIDIA Tesla K40 Accelerator• 2x CPU• 64 GB System Memory
Server
• 4x NVIDIA Tesla K40/K80 Accelerator• 2x CPU• 256 GB System Memory
Upgrade Options: • 8x GPUs, OR• 6x GPUs + 2x IB FDR Cards
GPUs for inference
• NVIDIA Tesla Accelerators
Online Classification“commodity servers”
• NVIDIA TK1 / TX1
Offline ClassificationEmbedded / Mobile
Pascal: Next Generation Tesla GPU
Peak Performance Stacked Memory
NVLink High-Speed Interconnect Unified Memory
>3 TeraFLOPS4x Higher Bandwidth (~1 TB/s)
Larger Capacity (16 GB)
80 GB/sec
POWER CPU & GPU-to-GPU Interconnect
Single Memory Space
Lower Developer Effort
Phenomenal Memory Bandwidth for Applications
NVLINKGPU high speed interconnect
3D Stacked Memory4x Higher Bandwidth (~1 TB/s)
3x Larger Capacity
4x More Energy Efficient per bit
Thank you!
Developer Zone: https://developer.nvidia.com/deeplearning
GPU Technology Conference: http://www.gputechconf.com/
cuDNN Download: https://developer.nvidia.com/cuDNN
DIGITS Download: https://developer.nvidia.com/digits
DIGITS Source: https://www.github.com/nvidia/digits