Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
François FayardBAYNCORE
November 29, 2017Paris
Intel Nervanasoftware stack
libraries Intel® Math Kernel Library (MKL, MKL-DNN)
platforms
Frameworks
Intel® Data Analytics Acceleration Library
(DAAL)
hardwareMemory & Storage NetworkingCompute
Intel® Python Distribution
Mllib BigDL
Intel® Nervana™ Graph*
experiences
Intel® Nervana™ Cloud & System
Intel® Nervana™ portfolio
Intel® Nervana™ Deep Learning StudioIntel® Computer
Vision SDKMovidius™
Technology
*FutureOther names and brands may be claimed as the property of others.
*
2
libraries Intel® Math Kernel Library (MKL, MKL-DNN)
Frameworks
Intel® Data Analytics Acceleration Library
(DAAL)Intel® Python Distribution
Mllib BigDL
Intel® Nervana™ Graph*
Intel® Parallel Studio
*Other names and brands may be claimed as the property of others.
Compiler & Libraries
Intel® Math Kernel Library (MKL)
Intel® Integrated Performance Primitives (IPP)
Fortran & C/C++ Compilers
Intel® Threading Building Blocks (TBB)
Intel® MPI
Profiling Tools
Intel® Advisor
Intel® Inspector Intel® VTuneAmplifier
Intel® Trace Analyzer and Collector
Commercial Products
3
Software and Hardware Innovation
4
Focus: Open Source
Long-term commitment to Open Source Software (OSS)• Linux OS (not just drivers): Intel contributes and ranks in top-10 since years
• Other contributions: compiler enabling (GCC), math library support (SVML), language standards (OpenMP: iOMP Clang), in general: 01.org
… but Intel software product teams are committed as well• Free-of-charge Intel MKL (supported by user forum), which is also for commercial use (commercial
Premier support)
• Open Source Software: Intel TBB, Intel MKL-DNN
5
Open Source Software at Intel
Stencils• Implied embedding variables as constants, etc.
• Example: Advantages for Stencil Computation
Stencil (access-)pattern “baked” into code
Grid bounds are known constants
Code unrolling without remainder
[…]
Code specialization is an effective optimization!• Small-size problems can be effectively hard-coded; no upfront code generation needed (may also
avoid control-flow to select cases)
• Perfect match for unpredictable use cases (scripting framework, etc.)
6
Open Source Software Innovation ExampleJIT Code Specialization
Direct convolutions are more efficient than a calculation in the frequency space(There are cross-over points: direct incl. Winograd, FFT)
• Exploiting the memory bandwidth, caches, and register file is key
• ML frameworks are usually script-enabled (runtime dynamic)
• Network topology (graph structure) determines schedule,code fusion, etc.
JIT code generation: ideal for small convolutions and graph-based structure• Can further exploit topology (code fusion, etc.)
Significant advantage with JIT code generation
7
Open Source Software Innovation ExampleConvolutional Neural Networks and JIT
Innovation happens to a large extent at the ISA level• Domain specific ISA extensions (supported only at intrinsic level)
Examples: CRC32 (SSE 4.2), AES-NI, SHA,F16 LD/ST (F16C), TSX
• General (compiler support i.e., regularly generated from high-level lang.)Examples: AVX, AVX2, AVX-512
The Machine Learning domain for Intel Architecture is no exception.• Lower precision computation (fixed, variable width, etc.)
8
What about Hardware Innovation? ExampleISA Specialization
• 2x the flops by using INT16 inputs
• Similar accuracy as SP by using INT32 accumulated output
9
Variable Precision Instructions (VNNI*)
a1 a0
b1 b0
16b 16b
c032b
c0=c0+a0*b0+a1*b1
32b
Int16 inputs
Int32 accumulators
• 2x VNNI operations per port
• 4x ML performance than regular AVX512-SP
VNNI QVNNI
INT16 to INT32
QVNNI
2x 64 = 128 VNNI iops 2x 64 = 128 VNNI iops
* See Intel Architecture Reference Manual (“Icelake New Instructions”)
libraries Intel® Math Kernel Library (MKL, MKL-DNN)
platforms
Frameworks
Intel® Data Analytics Acceleration Library
(DAAL)
hardwareMemory & Storage NetworkingCompute
Intel® Python Distribution
Mllib BigDL
Intel® Nervana™ Graph*
experiences
Intel® Nervana™ Cloud & System
Intel® Nervana™ portfolio
Intel® Nervana™ Deep Learning StudioIntel® Computer
Vision SDKMovidius™
Technology
*FutureOther names and brands may be claimed as the property of others.
*
10
• 2nd generation open source machine learning framework from Google*
• Widely used in Google’s applications: search, Gmail, photos, translate, etc.
• Core system provides set of key computational kernel, extendable user ops
• Core in C++, front end wrapper is in python specifies/drives computation
• Multi-node support based on GRPC protocol (MPI added more recently)
• Own threading runtime (not OpenMP, TBB, etc.)
11
TensorFlow: Quick Summary and History
© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.*Other names and brands may be claimed as the property of others.
Data layout matters: CNNs run over batches (N: batch size) of multi-channel (C) images e.g., RGBA images with certain convolution kernel sizes (WxH)
• Intel’s MKL imply NHCW format by default
• TF without MKL uses NHWC format
An “optimal format” (HW-specific) eventually leads to (frequent) conversions in TF’s graph due to data-reordering between “native” and “optimal” tensor/data format
• Low-level frameworks are caused to optimize end-to-end
12
Algorithmic and Optimization Challenges
Convolution ConvolutionMax PoolNative to MKL layout
MKL layout to Native
Native to MKL layout
MKL layout to Native
Hardware-Specific Transformers
HardwareAgnostic Intel® Nervana™ Graph
Intel® Nervana™ Graphenables optimizations that are applicable across multiple HW targets.
• Efficient buffer allocation• Training vs inference optimizations• Efficient scaling across multiple nodes• Efficient partitioning of subgraphs• Compounding of ops
CustomerSolutions Neon Solutions
Neon ModelsCustomer Models
Neon Deep Learning FunctionsCustomer Algorithms
The Intel® Nervana™ Graph will scale performance across hundreds of machine and deep learning frameworks
intel® Nervana™ graphHigh-Performance Execution Graph for Neural Networks
intel® Nervana™ graphHigh-Performance Execution Graph for Neural Networks
Models
Hardware
Frameworks TensorFlowCaffe{2} Torch
Lake Crest
neon
Intel® Xeon® & Xeon Phi™
Integrated Graphics
Nervana Graph
FPGA
…CNTK
Use Cases
Movidius
Libraries For AI
15
Focus: Low-Level Primitives
Intel® Math Kernel Library Compute Library for Deep Neural
Networks (clDNN)
Intel® Distribution
Intel® Data Analytics
Acceleration Library (DAAL)
Intel® Nervana Graph
Intel® MKL-DNN MKL
High Level
Overview
High-Performance execution graph for
deep learning frameworks
Open source DNN functions for high-velocity integration with deep learning
frameworks
High performance math primitives
granting low level of control
Open source DNN functions library for
accelerating inference on Intel GPUs
Most popular and fastest growing
language for machine learning
Broad data analytics acceleration object
oriented library supporting
distributed ML at the algorithm level
Primary Audience
Advanced Data Scientists or Framework Developers enhancing
frameworks
Developers of the next generation of
deep learning frameworks
Developers of higher level libraries
and Applications
Developers that tune deep learning
workloads to the kernel level
Application Developers and Data
Scientists
Wider Data Analytics and ML audience,
Algorithm level development for all
stages of data analytics
Example Usage
Customization of neural nets scaled
across various hardware
New framework with functions
developers call for max CPU
performance
Framework developers call
matrixmultiplication,
convolution functions
Custom tuning DL model for application
for convolution
Call scikit-learnk-means function for
credit card fraud detection
Call distributed alternating least
squares algorithm for a recommendation
system
Find out more at software.intel.com/ai
Used for Classical Machine Learning
Ai libraries & APIs
Intel® Math Kernel Library Compute Library for Deep Neural
Networks (clDNN)
Intel® Distribution
Intel® Data Analytics
Acceleration Library (DAAL)
Intel® Nervana Graph
Intel® MKL-DNN MKL
High Level
Overview
High-Performance execution graph for
deep learning frameworks
Open source DNN functions for high-velocity integration with deep learning
frameworks
High performance math primitives
granting low level of control
Open source DNN functions library for
accelerating inference on Intel GPUs
Most popular and fastest growing
language for machine learning
Broad data analytics acceleration object
oriented library supporting
distributed ML at the algorithm level
Primary Audience
Advanced Data Scientists or Framework Developers enhancing
frameworks
Developers of the next generation of
deep learning frameworks
Developers of higher level libraries
and Applications
Developers that tune deep learning
workloads to the kernel level
Application Developers and Data
Scientists
Wider Data Analytics and ML audience,
Algorithm level development for all
stages of data analytics
Example Usage
Customization of neural nets scaled
across various hardware
New framework with functions
developers call for max CPU
performance
Framework developers call
matrixmultiplication,
convolution functions
Custom tuning DL model for application
for convolution
Call scikit-learnk-means function for
credit card fraud detection
Call distributed alternating least
squares algorithm for a recommendation
system
Find out more at software.intel.com/ai
Used for Classical Machine Learning
Ai libraries & APIsBeta
Beta
Intel® Math Kernel Library Compute Library for Deep Neural
Networks (clDNN)
Intel® Distribution
Intel® Data Analytics
Acceleration Library (DAAL)
Intel® Nervana Graph
Intel® MKL-DNN MKL
High Level
Overview
High-Performance execution graph for
deep learning frameworks
Open source DNN functions for high-velocity integration with deep learning
frameworks
High performance math primitives
granting low level of control
Open source DNN functions library for
accelerating inference on Intel GPUs
Most popular and fastest growing
language for machine learning
Broad data analytics acceleration object
oriented library supporting
distributed ML at the algorithm level
Primary Audience
Advanced Data Scientists or Framework Developers enhancing
frameworks
Developers of the next generation of
deep learning frameworks
Developers of higher level libraries
and Applications
Developers that tune deep learning
workloads to the kernel level
Application Developers and Data
Scientists
Wider Data Analytics and ML audience,
Algorithm level development for all
stages of data analytics
Example Usage
Customization of neural nets scaled
across various hardware
New framework with functions
developers call for max CPU
performance
Framework developers call
matrixmultiplication,
convolution functions
Custom tuning DL model for application
for convolution
Call scikit-learnk-means function for
credit card fraud detection
Call distributed alternating least
squares algorithm for a recommendation
system
Find out more at software.intel.com/ai
Used for Classical Machine Learning
Ai libraries & APIs
Open Source
Open Source
Open Source
Open Source
Open Source
Fast, Scalable Code with Intel® Math Kernel Library (Intel® MKL)
19
Highly optimized, threaded, & vectorized math functions that maximize performance on each processor family
Utilizes industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required
Dispatches optimized code for each processor automatically without the need to branch code
What’s New in the 2018 edition Improved small matrix multiplication performance
in GEMM & LAPACK
Improved ScaLAPACK performance for distributed computation
24 new vector math functions (“VML domain”)
Simplified license for easier adoption & redistribution
Additional distributions via YUM, APT-GET, & CondaLearn More: software.intel.com/mkl
®
Fast, Scalable Code with Intel® Math Kernel Library (Intel® MKL)
20
Highly optimized, threaded, & vectorized math functions that maximize performance on each processor family
Utilizes industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required
Dispatches optimized code for each processor automatically without the need to branch code
What’s New in the 2018 edition Improved small matrix multiplication performance
in GEMM & LAPACK
Improved ScaLAPACK performance for distributed computation
24 new vector math functions (“VML domain”)
Simplified license for easier adoption & redistribution
Additional distributions via YUM, APT-GET, & CondaLearn More: software.intel.com/mkl
®
Distribution Details Open Source
Apache 2.0 License
Common DNN APIs across all Intel hardware.
Rapid release cycles, iterated with the DL community, to best support industry framework integration.
Highly vectorized & threaded for maximal performance, based on the popular Intel® MKL library.
For developers of deep learning frameworks featuring optimized performance on Intel hardware
github.com/01org/mkl-dnn
Direct 2D Convolution
Rectified linear unit neuron activation
(ReLU)Maximum
pooling Inner productLocal response normalization
(LRN)
Intel® MATH Kernel Library-dnnMath Kernel Library for Deep Neural Networks
Examples:
22
Set of functions (primitives) commonly used in DNN topologies (CNN)
Example topologies supported: AlexNet, VGG, GoogleNet, ResNet
Optimized for Intel® architectures (multi-/many-core & SIMD)
Primitives: Direct batched convolution
Inner product
Pooling: maximum, minimum and average
Normalization: local response (LRN) and batch normalization
Activation: rectified linear unit (ReLU), softmax
Data manipulation: multi-dim. transposition (conversion), split, concat, sum and scale
Intel® MATH Kernel Library-dnn
23
Two stages:1. Setup: Create operations (primitives)2. Execution: Call primitives with input, apply
conversion and forward output
Conversions only when needed (temporary arrays)
C/C++ interface
Can be used by TensorFlow and Caffe{2}
Intel® MATH Kernel Library-dnn
1.6
2.3
1.14
2.0
1.5
1.9 2.02.2
2.1
1.14
1.61.3
1.5
1.8 1.8 1.7
2.2
1.7
2.0
1.41.5
1.15
1.51.6
2.21.9
2.3
1.7
2.4
0
0.5
1
1.5
2
2.5
Caffe TensorFlow MXNet Neon
Infe
renc
e Th
roug
hput
per
form
ance
(Mea
sure
d in
imag
es/s
econ
d)re
pres
ente
d re
lativ
e to
a b
asel
ine
1.0
Hig
her i
s be
tter
24
Inference throughput of Intel® Xeon® Platinum 8180 Processorvs. Intel® Xeon® Processor E5-2699 v4
25
Training throughput of Intel® Xeon® Platinum 8180 Processorvs. Intel® Xeon® Processor E5-2699 v4
2.1
1.8 1.9
2.12.2
2.0 2.0 2.0 2.0
1.8
1.6 1.5 1.5
1.7
2.2
0.0
0.5
1.0
1.5
2.0
2.5
Caffe TensorFlow MXNet Neon
Trai
ning
Thr
ough
put p
erfo
rman
ce(M
easu
red
in im
ages
/sec
ond)
repr
esen
ted
rela
tive
to a
bas
elin
e 1.
0H
ighe
r is
bette
r
Frameworks
26
Focus: Data Scientist
software.intel.com/intel-distribution-for-python
Easy, Out-of-the-box Access to High Performance Python Prebuilt, optimized for numerical
computing, data analytics, HPC Drop in replacement for your
existing Python (no code changes required)
Drive Performance with Multiple Optimization
Techniques Accelerated NumPy/SciPy/Scikit-
Learn with Intel® MKL Data analytics with pyDAAL,
enhanced thread scheduling with TBB, Jupyter* Notebook interface, Numba, Cython
Scale easily with optimized MPI4Py and Jupyter notebooks
Faster Access to Latest Optimizations for Intel
Architecture Distribution and individual
optimized packages available through conda and Anaconda Cloud
Optimizations upstreamed back to main Python trunk
For developers using the most popular and fastest growing programming language for AI
Intel distribution for pythonAdvancing Python Performance Closer to Native Speeds
High Performance ML and Data Analytics libraryBuilding blocks for all data analytics stages, including data preparation, data mining &
machine learning
Open Source Apache 2.0 License
Common Python, Java and C++ APIs across all Intel hardware
Optimized for large data sets including streaming and distributed processing
Flexible interfaces to leading big data platforms including Spark and range of data formats (CSV, SQL, etc.)
Pre-processing Transformation Analysis Modeling Decision MakingValidation
Intel® Data Analytics Acceleration Library (Intel® DAAL)
28
29
Data Analysis: Correlation and Variance-Cov.
Matrices
Cosine/Correlation Distance Matrix
SVD/QR/Cholesky Decomposition
Principal Component Analysis (PCA)
K-Means, etc.
Machine Learning:
Neural Networks for Deep Learning
Linear/Ridge Regression
Naïve Bayes, Multiclass Classifier
SVM, etc.
Used with Hadoop*, Spark* (MLlib), SciPy, etc.
Intel® Data Analytics Acceleration Library (Intel® DAAL)
30
Configuration: 2x Intel® Xeon® E5-2660 CPU @ 2.60GHz, 128 GB, Intel® DAAL 2018; Alternating Least Squares – Users=1M Products=1M Ratings=10M Factors=100 Iterations=1 MLLib time=165.9 sec DAAL time=40.5 sec Gain=4.1x; Correlation – N=1M P=2000 size=37 GB Mllib time=169.2 sec DAAL=12.9 sec Gain=13.1x; PCA – n=10M p=1000 Partitions=360 Size=75 GB Mllib=246.6 sec DAAL (seq)=17.4 sec Gain=14.2x
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Coming Soon
31
Deep Learning Deployment
ToolkitIntel® Computer
Vision SDK Fathom
Overview
Comprehensive productivity suite that
streamlines thedevelopment of enterprise-grade
deep learning solutions
Tools for exploring deep learning and
visualizing the training process and
results.
Optimize trained deep learning networks for end-point devices and utilize a unified API to
integrate inference with application logic
Develop & deploy vision-oriented
solutions that harness the full performance
of Intel CPUs and SOC accelerators
Compile, profile, validate
and run neural networks on
Movidius Myriad product family
Primary Audience
Experiencedenterprise deep
learning datascientists and teams
Students andbeginning data
scientists
Embeddeddevelopers adding
deep learning to their applications
Developers who create vision-oriented
solutions
Algorithm engineers & Embedded Software
Developers
Use CaseBuild deep learning
solutions for production use
Explore deep learning
Optimize a trained Caffe model through model compression
and weight quantization for a
smart camera
Convert a deep learning trainedmodel into an OpenVX graph
Convert trained offline neural networks into
embedded neural networks running on the ultra-low power
Myriad
Ai toolsIntel® Nervana™ Deep Learning Studio
Enterprise Starter
Intel Computer Vision SDK
https://software.intel.com/en-us/computer-vision-sdk 33
A tool to help Data Scientists get started training deep learning modelsSetup, prepare and design deep learning models with a simple UX
Load training data sets. Design models with automatically optimized
hyper-parameters.
Visually monitor training status(performance & accuracy).
Training powered by Intel optimized frameworks
Intel® Deep learning training tool
34
Imports trained models from popular DL framework regardless of training HW
Enhances model for improved execution, storage & transmission
Optimizes Inference execution for target hardware (computational graph analysis, scheduling, model compression, quantization)
Enables seamless integration with application logic
Delivers embedded friendly Inference solution
Ease of use + Embedded friendly + Extra performance boost
1
2
Convert & Optimize
Run!
Trained Model
1
2
For developers looking to run deep learning models on the edge
Intel® deep learning deployment toolkit
35
MACHINE/DEEP LEARNINGREASONING SYSTEMS
TOOLS & STANDARDSCOMPUTER VISIONProgrammable solutions
Memory/storageNetworkingcommunications 5G
Things& devices
CloudDATA Center
Accelerant Technologies
…
End-to-end aiOnly Intel Has a Complete End-to-End Portfolio