Upload
haque
View
239
Download
0
Embed Size (px)
Citation preview
DEEP LEARNING / ARTIFICIAL INTELLIGENCE FOR ACCELERATED DATA ANALYTICS
5th December, 2017
Dr. Ettikan Kandasamy Karuppiah,Director of Developers Ecosystem,Nvidia, South East Asia Region
2
“Find where I parked my car”
AI IS EVERYWHERE, TOUCHING OUR LIVES
“Find the bag I just saw in this magazine”
“What movie should I watch next?”
Bringing grandmother closer to family by bridging
language barrier
Predicting sick baby’s vitals like heart rate, blood pressure, survival rate
Enabling the blind to “see” their surrounding, read
emotions on faces
Increasing public safety with smart video surveillance at
airports & malls
Providing intelligent services in hotels, banks and stores
Separating weeds as it harvests, reduces chemical
usage by 90%
3
EVERY INDUSTRY HAS AWOKEN TO AI
2014 2016
1,549
19,439
Higher Ed
Telecom
Healthcare
Finance
Automotive
Others
Government
Developer Tools
Organizations engaged with NVIDIA on Deep Learning
4
“mobile computing, inexpensive sensors collecting terabytes of data, rise of
machine learning that can use that data will fundamentally change the way the
global economy is organized.”
— Fortune, CEO – The Revolution is Coming March 8, 2016
5
THE 4TH WAVEDigital Lifestyle Services (DLS) - Trillion Dollar Wave
3G
4G
2G
5G
Digital Lifestyle
Services• Video & Content• Cloud Services• Cyber Security• Financial Services• IoT – Smart Cities• IoT – Connected
Vehicles• AR / VR• Mobile Gaming• Many more…
Reference : http://www.chetansharma.com/
6
Internet Startups
7
Visualization of the activations of theconvolutions over a car damage image
Visualization of the layers of a neural network and the back propagation process
Neural Network Demo
https://www.galacticar.ai/ -> Galaxy.ai : Artificial Intelligence Driven Sedan Damage Estimator
8
Frontend
9
Backend
NVIDIA METROPOLIS —EDGE TO CLOUD AI CITY PLATFORM
Camera
CLOUDTraining and Inference
EDGE AND ON-PREMISESInference
DGX Appliance
Video recorder Server
JETSON TESLA/QUADRO
JETPACK, TENSOR RT, DEEPSTREAM
AI IS THE SOLUTION TO SELF-DRIVINGPerception Reasoning
HD Map Mapping
Driving
AI Computing
Unsupervised Learning on Unstructured Data
13
3000
AI MOMENTUM
TODAY BY 2020
AI startups
85% $ 47B 20%
of all customer service interactions will
be powered by AI bots
spend on AI technologies
of companies will dedicate workers to monitor and guide neural networks
14
POTENTIAL AREAS FOR DISRUPTIONTop 8 Technical Domains
• Many Network Element are 20 year old design. Some elements can be accelerated with GPU
• E.g. Deep Packet InspectionNetwork Function Acceleration
• Virtualization removes networking interfaces, x86 is sluggish vs ASICS / FPGA
• GPU could be very promising for certain . Session Border Controllers Virtual Network Function
Acceleration
• AR / VR media servers,
• UHD Transcoding, etcAcceleration of graphical /
Video intense workload
• Connected Cars, Industrial IoT, Smart Cities
• Video Analytics High Speed AI Inference
• Network analytics, Subscriber Analytics
• Realtime Campaign Management , Self-Optimizing Network (SON)AI based Automation
• Virtualization and microservices
• Mobile Edge Computing with 4G+ an 5GEdge Cloud Computing
• RF planning and Optimization
• Acceleration of Massive MIMO 5G
• AI based network securityCyber Security for Networks
15
Telco Usecase : Fraud Detection
16
Telco Usecase : Customer Churn Prediction
17
Telco Usecase : Malware Thread Detection
Deep Learning Neural Nets Are Effective Against AI MalwareFEBRUARY 5, 2016 BY JAMIE PHAN 0 COMMENTS
Baidu, Google, and Facebook are deeply invested in deep learning via
neural networks, or networks of hardware and software that
approximate the web of neurons in the human brain.
https://www.bizety.com/2016/02/05/deep-learning-neural-nets-are-effective-against-ai-malware/
…attacks (Advanced Persistent Threats). According to the
Verizon 2015 Data Breach report, a total of over 170
million malware events were reported, with 70-90% of the
malware samples given were unique to a threat
organization. Thus, finding the malware signatures alone
are becoming more ineffective…..
Deep Instinct, an Israeli startup, claims that they are the first to employ offer a deep learning- based AV
solution using ANNs. They use data fragmentation by breaking down objects into their smallest parts, and
applying deep learning to identify unknown suspicious behavior and detect, predict, and prevent known
and unknown threats.
18
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,
O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected
for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
19
Deep Learning-Inferencing:TESLA V100 DELIVERS NEEDED RESPONSIVENESS WITH UP TO 99X MORE THROUGHPUT
0
1,000
2,000
3,000
4,000
5,000
CPU Server Tesla V100
ResNet-50
4,647 i/s@ 7mslatency
47 i/s@ 21msLatency
0
600
1,200
1,800
CPU Server Tesla V100
VGG-16
1,658i/s@ 7ms
Latency
23 i/s@ 43msLatency
0
500
1,000
1,500
2,000
2,500
3,000
3,500
CPU Server Tesla V100
GoogleNet
3,270 i/s@ 7ms
Latency
136 i/s@ 7ms
LatencyThro
ughput
(Im
ages/
Sec)
Thro
ughput
(Im
ages/
Sec)
Thro
ughput
(Im
ages/
Sec)
CPU: Xeon E5-2690 V4
GPU Servers: Dual Xeon E5-2690 [email protected] with 16GB PCIe GPUs configs as shownUbuntu 14.04.5, CUDA 9.0.103, cuDNN 7.0.1.13; NCCL 2.0.4, TensorRT pre-release, data set: ImageNet,GPU Optimal batch size used to achieve 7ms latency; CPU batch size reduced to 1 if latency exceeds 7ms
20
REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance
Over 80x DL Training Performance in 3 Years
1x K80cuDNN2
4x M40cuDNN3
8x P100cuDNN6
8x V100cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Googlenet Training Performance(Speedup Vs K80)
Speedup v
s K80
85% Scale-Out EfficiencyScales to 64 GPUs with Microsoft
Cognitive Toolkit
0 5 10 15
64X V100
8X V100
8X P100
Multi-Node Training with NCCL2.0(ResNet-50)
ResNet50 Training for 90 Epochs with 1.28M images dataset | Using Caffe2 | V100 performance measured on pre-production hardware.
1 Hour
7.4 Hours
18 Hours
3X Reduction in Time to Train Over P100
0 10 20
1XV100
1XP100
2XCPU
LSTM Training(Neural Machine Translation)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance
measured on pre-production hardware.
15 Days
18 Hours
6 Hours
21
VOLTA DELIVERS 3X MORE INFERENCE THROUGHPUT
Low Latency performance with V100 and TensorRT
Fuse LayersCompact
Optimize Precision(FP32, FP16, INT8)
3x more throughput at 7ms latency with V100 (ResNet-50)
TensorRT CompiledReal-timeNetwork
Trained Neural
Network
0
1,000
2,000
3,000
4,000
5,000
CPU Tesla P100(TensorFlow)
Tesla P100(TensorRT)
Tesla V100(TensorRT)
Thro
ughput
@ 7
ms
(Im
ages/
Sec)
33ms
CPU Server: 2X Xeon E5-2660 V4; GPU: w/P100, w/V100 (@150W) | V100 performance measured on pre-production hardware.
3X
10ms
7ms
7ms
22
DATACENTER IN A RACK1 Rack of Tesla P100 Delivers Performance of 6,000 CPUs
QUANTUM PHYSICSMILC
WEATHERCOSMO
DEEP LEARNINGCAFFE/ALEXNET
12 NODES in RACK8 P100s per Node
MOLECULAR DYNAMICSAMBER
# of Racks1 108642 . . .40 84
638 CPUs / 186 kW
650 CPUs / 190 kW
6,000 CPUs / 1.8 MW
2,900 CPUs / 850 kW
96 P100s / 38 kW
36 NODES in RACKDUAL CPUs Per Node
. . .
23
DEEP LEARNING FRAMEWORKS
VISION SPEECH BEHAVIOR
Object Detection Voice Recognition Language TranslationRecommendation
EnginesSentiment Analysis
DEEP LEARNING
cuDNN
MATH LIBRARIES
cuBLAS cuSPARSE
MULTI-GPU
NCCL
cuFFT
Mocha.jl
Image Classification
NVIDIA DEEP LEARNING SDKHigh Performance GPU-Acceleration for Deep Learning
24
GPU-ACCELERATION HAS NO LIMITSMapD
BlazeGraph
Kinetica
Leading In-Memory DB> 50x Slower
NoSQL DB’s> 100x Slower
Aggregate of queries - Time (s)Less is better!
SQream
1403
1843
700
GPUs 700X-800X faster
than graphs in all cases
700M Edges Single Node
Xeon 2650 vs 2 K80
1.98B Edges 16 EC2
r3.xlarge vs 16 K40s
1.98B Edges 16 EC2
r3.4xlarge vs 16 K40s2
1.98B Edges Spark CPU
Baseline
1
Speed-up over baseline spark CPU configuration
Speed-u
p (
hig
her
is f
ast
er)
25
GPU DATABASES ARE EVEN FASTER1.1 Billion Taxi Ride Benchmarks
21 30
1560
80 99
1250
150269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
Tim
e in M
illise
conds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
26
TESLA PLATFORMLeading Data Center Platform for HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY TOOLS
APPLICATIONS & SERVICES
C/C++
ECOSYSTEM TOOLS & LIBRARIES
HPC
+400 More Applications
cuBLAS
cuDNN TensorRT
DeepStreamSDK
FRAMEWORKS MODELS
Cognitive Services
AI TRAINING & INFERENCE
Machine Learning Services
ResNetGoogleNetAlexNet
DeepSpeechInceptionBigLSTM
DEEP LEARNING SDK
NCCL
COMPUTEWORKS
NVLINK SYSTEM OEM CLOUDTESLA GPU
27
Instant productivity — plug-and-play, supports every AI framework
Performance optimized across the entire stack
Docker/Container Development and Deployment model
Always up-to-date via the cloud
Mixed framework environments — containerized
Direct access to NVIDIA experts
DGX STACKFully integrated Deep Learning platform
NVIDIA GPU CLOUDGPU-ACCELERATED CLOUD PLATFORMOPTIMIZED FOR DEEP LEARNING
Containerized in NVDocker
Optimization Across the Full Stack
Always Up-to-Date
Fully Tested and Maintained by NVIDIA
Now Available with Volta GPUs on AWS
Sign up: www.nvidia.com/gpu-cloud
29
Thank You