GPUs for Deep Learning - G-DEP｜GPUコンピュー … for Deep Learning Jerry Chen & Brent Oster April 2015 PC DATA CENTER MOBILE ENTERPRISE VIRTUALIZATION AUTONOMOUS MACHINES HPC

GPUs for Deep Learning

Jerry Chen & Brent Oster

April 2015

PC DATA CENTER MOBILE

ENTERPRISE VIRTUALIZATION

AUTONOMOUS MACHINES

HPC & CLOUD SERVICE PROVIDERSGAMING DESIGN

The World Leader in Visual Computing

0

500

1000

1500

2000

2500

3000

3500

2008 2009 2010 2011 2012 2013 2014

Peak Double Precision FLOPS

NVIDIA GPU x86 CPU

M2090

M1060

K20

K80

WestmereSandy Bridge

Haswell

GFLOPS

0

100

200

300

400

500

600

2008 2009 2010 2011 2012 2013 2014

Peak Memory Bandwidth

NVIDIA GPU x86 CPU

GB/s

K20

K80

WestmereSandy Bridge

Haswell

Ivy Bridge

K40

Ivy Bridge

K40

M2090

M1060

Performance Continues to Accelerate

US to Build Two Flagship Supercomputers

150-300 PFLOPS Peak Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

SUMMIT SIERRA

A Brief History of CIFAR-10 (2010-2012)

10-class image classification problem

60,000 32x32 images

Slide courtesy of Adam Coates, Baidu Research

2010 2011

74.5%Improved LCC [Yu & Zhang, ‘10]

78.9%Conv. RBM [Krizhevsky, ‘10]

64.8%RBM [Krizhevsky, ‘09]

71.0%MC-RBM [Ranzato et al., ‘10]

65.3%3-way factored RBM [Ranzato et al., ‘10]

K-means [Coates et al., ‘11]81.5%

2012

88.8%

Multi-column DNN

[Ciresan et al., ‘12]

Natural image recognition

1.2M training images

1000 classes

ImageNet Large Scale Visual Recognition Challenge

http://www.image-net.org/challenges/LSVRC/

http://www.image-net.org/challenges/LSVRC/

ImageNet Large Scale Visual Recognition Challenge

ImageNet Large-Scale Visual Recognition Challenge started in 2010.

Best known, annual benchmark for image classification and object detection.

A classifier supplies 5 predictions out of 1,000 categories. Classification is considered correct when one guess agrees with the ground truth.

28.20

25.80

16.40

11.70

6.705.33 4.94 4.82

0.00

5.00

10.00

15.00

20.00

25.00

30.00

ILSVRC 2010(NEC)

ILSVRC 2011(Xerox)

ILSVRC 2012(AlexNet)

ILSVRC 2013(Clarifai)

ILSVRC 2014(GoogLeNet)

Jan 2015(Baidu)

Feb 2015(Microsoft)

Feb 2015(Google)

ILSVRC Top-5 Classification Error [%]

Deep Learning & GPUs

Deep learning improves with scale

Data & Compute

PerformanceDeep Learning

Many previous methods

Past Present Future

Slide courtesy of Adam Coates, Baidu Research

3 Drivers for Deep Learning

More Data Better ModelsPowerful GPUAccelerators

[Lee, Ranganath & Ng, 2007]

Why are GPUs good for deep learning?

GPUs deliver --

same or better prediction accuracy

faster results

smaller footprint

lower power

Neural Networks GPUs

Inherently

Parallel

Matrix

Operations

FLOPS

Bandwidth

DEEP LEARNING VISUALIZED

Image Classification, Object Detection, Localization Face Recognition

Speech & Natural Language Processing

Medical Imaging & Interpretation

Seismic Imaging & Interpretation Recommendation

Example Use Cases

Deep learning revolutionizingmedical research

Detecting Mitosis in

Breast Cancer Cells— IDSIA

Predicting the Toxicity

of New Drugs— Johannes Kepler University

Understanding Gene Mutation

to Prevent Disease— University of Toronto

Neuronal Tissue Segmentation

Reinforcement Learning

Building High-Level Features Using Large Scale Unsupervised

Learning

Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Gorrado, J. Dean, A. Ng

ICML 2012

Deep learning with COTS HPC systems

A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro

ICML 2013

GOOGLE DATACENTER

1,000 CPU Servers 2,000 CPUs • 16,000 cores

600 kWatts

$5,000,000

STANFORD AI LAB

3 GPU-Accelerated Servers 12 GPUs • 18,432 cores

4 kWatts

$33,000

Now You Can Build Google’s

$1M Artificial Brain on the Cheap

“ “

Unsupervised Learning

GPU-Accelerated deep learning

START-UPS

DIGITSDEEP GPU TRAINING

SYSTEM FOR DATA

SCIENTISTS

Design DNNs

Visualize activations

Manage multiple trainingsGPUGPU HW CloudGPU

ClusterMulti-GPU

USER INTERFACE

Visualize Layers

Configure DNN

Process Data

MonitorProgress

TheanoTorch

CaffecuDNN, cuBLAS

CUDA

DIGITS

Test Image

Monitor ProgressConfigure DNNProcess Data Visualize Layers

DIGITS

Deep Learning GPU Training System

Who it is for

Deep learning researchers

Automotive

Medical Researchers

Defense

Intelligent Video Analytics

Web Companies

Startups

Digits Demo

Deep learning with cuDNNcuDNN is a library for deep learning primitives

GPUs

cuDNN

Frameworks

Applications

Tesla TX-1 Titan

cuDNN design goals

Basic Deep Learning Subroutines

Allow user to create DNN application without any custom CUDA code

Flexible Layout

Handle any data layout

Memory – Performance tradeoff

Good performance with minimal memory, great performance with more memory

cuDNN Version 2

Accelerates key routines to

improve performance of neural

net training

Up to 1.8x faster on AlexNet than

a baseline GPU implementation

New support for 3D convolutions

Integrated into all major Deep

Learning frameworks: Caffe,

Theano, Torch

1.0x 1.0x

1.6x

1.2x

Caffe (GoogLeNet) Torch (OverFeat)

Baseline (GPU)

With cuDNN

2.5M

18M

23M

43M

0

10

20

30

40

50

16 Core CPU GTX Titan Titan BlackcuDNN v1

Titan XcuDNN v2

Millions

of

Images

Images Trained Per Day (Caffe AlexNet)

E5-2698 v3 @ 2.3GHz / 3.6GHz Turbo

TITAN XTHE WORLD’S FASTEST GPU

8 Billion Transistors 3,072 CUDA Cores7 TFLOPS SP / 0.2 TFLOPS DP12GB Memory

0

1

2

3

4

5

6

7

Titan X for deep learning

Training AlexNet

Days

16-coreXeon CPU

TITAN TITAN BlackcuDNN

TITAN XcuDNN

~

43

…

GPUs for training

Workstation

• 2x NVIDIA Tesla K40 Accelerator• 2x CPU• 64 GB System Memory

Server

• 4x NVIDIA Tesla K40/K80 Accelerator• 2x CPU• 256 GB System Memory

Upgrade Options: • 8x GPUs, OR• 6x GPUs + 2x IB FDR Cards

GPUs for inference

• NVIDIA Tesla Accelerators

Online Classification“commodity servers”

• NVIDIA TK1 / TX1

Offline ClassificationEmbedded / Mobile

Pascal: Next Generation Tesla GPU

Peak Performance Stacked Memory

NVLink High-Speed Interconnect Unified Memory

>3 TeraFLOPS4x Higher Bandwidth (~1 TB/s)

Larger Capacity (16 GB)

80 GB/sec

POWER CPU & GPU-to-GPU Interconnect

Single Memory Space

Lower Developer Effort

Phenomenal Memory Bandwidth for Applications

NVLINKGPU high speed interconnect

3D Stacked Memory4x Higher Bandwidth (~1 TB/s)

3x Larger Capacity

4x More Energy Efficient per bit

Thank you!

Developer Zone: https://developer.nvidia.com/deeplearning

GPU Technology Conference: http://www.gputechconf.com/

cuDNN Download: https://developer.nvidia.com/cuDNN

DIGITS Download: https://developer.nvidia.com/digits

DIGITS Source: https://www.github.com/nvidia/digits

https://developer.nvidia.com/deeplearning

http://www.gputechconf.com/

https://developer.nvidia.com/cuDNN

https://developer.nvidia.com/digits

https://www.github.com/nvidia/digits

Documents

GPUs for Deep Learning - G-DEP｜GPUコンピュー … for Deep Learning Jerry Chen & Brent Oster April 2015 PC DATA CENTER MOBILE ENTERPRISE VIRTUALIZATION AUTONOMOUS MACHINES HPC