Intel Nervana software stack - Microsigma · Intel® Nervana™ Deep Learning Studio. Intel® Computer Vision SDK. ... Convolutional Neural Networks and JIT. Innovation happens to

François FayardBAYNCORE

November 29, 2017Paris

Intel Nervanasoftware stack

libraries Intel® Math Kernel Library (MKL, MKL-DNN)

platforms

Frameworks

Intel® Data Analytics Acceleration Library

(DAAL)

hardwareMemory & Storage NetworkingCompute

Intel® Python Distribution

Mllib BigDL

Intel® Nervana™ Graph*

experiences

Intel® Nervana™ Cloud & System

Intel® Nervana™ portfolio

Intel® Nervana™ Deep Learning StudioIntel® Computer

Vision SDKMovidius™

Technology

*FutureOther names and brands may be claimed as the property of others.

*

2


Frameworks


(DAAL)Intel® Python Distribution

Mllib BigDL


Intel® Parallel Studio

*Other names and brands may be claimed as the property of others.

Compiler & Libraries

Intel® Math Kernel Library (MKL)

Intel® Integrated Performance Primitives (IPP)

Fortran & C/C++ Compilers

Intel® Threading Building Blocks (TBB)

Intel® MPI

Profiling Tools

Intel® Advisor

Intel® Inspector Intel® VTuneAmplifier

Intel® Trace Analyzer and Collector

Commercial Products

3

Software and Hardware Innovation

4

Focus: Open Source

Long-term commitment to Open Source Software (OSS)• Linux OS (not just drivers): Intel contributes and ranks in top-10 since years

• Other contributions: compiler enabling (GCC), math library support (SVML), language standards (OpenMP: iOMP Clang), in general: 01.org

… but Intel software product teams are committed as well• Free-of-charge Intel MKL (supported by user forum), which is also for commercial use (commercial

Premier support)

• Open Source Software: Intel TBB, Intel MKL-DNN

5

Open Source Software at Intel

Stencils• Implied embedding variables as constants, etc.

• Example: Advantages for Stencil Computation

Stencil (access-)pattern “baked” into code

Grid bounds are known constants

Code unrolling without remainder

[…]

Code specialization is an effective optimization!• Small-size problems can be effectively hard-coded; no upfront code generation needed (may also

avoid control-flow to select cases)

• Perfect match for unpredictable use cases (scripting framework, etc.)

6

Open Source Software Innovation ExampleJIT Code Specialization

Direct convolutions are more efficient than a calculation in the frequency space(There are cross-over points: direct incl. Winograd, FFT)

• Exploiting the memory bandwidth, caches, and register file is key

• ML frameworks are usually script-enabled (runtime dynamic)

• Network topology (graph structure) determines schedule,code fusion, etc.

JIT code generation: ideal for small convolutions and graph-based structure• Can further exploit topology (code fusion, etc.)

Significant advantage with JIT code generation

7

Open Source Software Innovation ExampleConvolutional Neural Networks and JIT

Innovation happens to a large extent at the ISA level• Domain specific ISA extensions (supported only at intrinsic level)

Examples: CRC32 (SSE 4.2), AES-NI, SHA,F16 LD/ST (F16C), TSX

• General (compiler support i.e., regularly generated from high-level lang.)Examples: AVX, AVX2, AVX-512

The Machine Learning domain for Intel Architecture is no exception.• Lower precision computation (fixed, variable width, etc.)

8

What about Hardware Innovation? ExampleISA Specialization

• 2x the flops by using INT16 inputs

• Similar accuracy as SP by using INT32 accumulated output

9

Variable Precision Instructions (VNNI*)

a1 a0

b1 b0

16b 16b

c032b

c0=c0+a0*b0+a1*b1

32b

Int16 inputs

Int32 accumulators

• 2x VNNI operations per port

• 4x ML performance than regular AVX512-SP

VNNI QVNNI

INT16 to INT32

QVNNI

2x 64 = 128 VNNI iops 2x 64 = 128 VNNI iops

* See Intel Architecture Reference Manual (“Icelake New Instructions”)


platforms

Frameworks


(DAAL)

hardwareMemory & Storage NetworkingCompute

Intel® Python Distribution

Mllib BigDL


experiences

Intel® Nervana™ Cloud & System

Intel® Nervana™ portfolio

Intel® Nervana™ Deep Learning StudioIntel® Computer

Vision SDKMovidius™

Technology

*FutureOther names and brands may be claimed as the property of others.

*

10

• 2nd generation open source machine learning framework from Google*

• Widely used in Google’s applications: search, Gmail, photos, translate, etc.

• Core system provides set of key computational kernel, extendable user ops

• Core in C++, front end wrapper is in python specifies/drives computation

• Multi-node support based on GRPC protocol (MPI added more recently)

• Own threading runtime (not OpenMP, TBB, etc.)

11

TensorFlow: Quick Summary and History

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.*Other names and brands may be claimed as the property of others.

Data layout matters: CNNs run over batches (N: batch size) of multi-channel (C) images e.g., RGBA images with certain convolution kernel sizes (WxH)

• Intel’s MKL imply NHCW format by default

• TF without MKL uses NHWC format

An “optimal format” (HW-specific) eventually leads to (frequent) conversions in TF’s graph due to data-reordering between “native” and “optimal” tensor/data format

• Low-level frameworks are caused to optimize end-to-end

12

Algorithmic and Optimization Challenges

Convolution ConvolutionMax PoolNative to MKL layout

MKL layout to Native

Native to MKL layout

MKL layout to Native

Hardware-Specific Transformers

HardwareAgnostic Intel® Nervana™ Graph

Intel® Nervana™ Graphenables optimizations that are applicable across multiple HW targets.

• Efficient buffer allocation• Training vs inference optimizations• Efficient scaling across multiple nodes• Efficient partitioning of subgraphs• Compounding of ops

CustomerSolutions Neon Solutions

Neon ModelsCustomer Models

Neon Deep Learning FunctionsCustomer Algorithms

The Intel® Nervana™ Graph will scale performance across hundreds of machine and deep learning frameworks

intel® Nervana™ graphHigh-Performance Execution Graph for Neural Networks

intel® Nervana™ graphHigh-Performance Execution Graph for Neural Networks

Models

Hardware

Frameworks TensorFlowCaffe{2} Torch

Lake Crest

neon

Intel® Xeon® & Xeon Phi™

Integrated Graphics

Nervana Graph

FPGA

…CNTK

Use Cases

Movidius

Libraries For AI

15

Focus: Low-Level Primitives

Intel® Math Kernel Library Compute Library for Deep Neural

Networks (clDNN)

Intel® Distribution

Intel® Data Analytics

Acceleration Library (DAAL)

Intel® Nervana Graph

Intel® MKL-DNN MKL

High Level

Overview

High-Performance execution graph for

deep learning frameworks

Open source DNN functions for high-velocity integration with deep learning

frameworks

High performance math primitives

granting low level of control

Open source DNN functions library for

accelerating inference on Intel GPUs

Most popular and fastest growing

language for machine learning

Broad data analytics acceleration object

oriented library supporting

distributed ML at the algorithm level

Primary Audience

Advanced Data Scientists or Framework Developers enhancing

frameworks

Developers of the next generation of


Developers of higher level libraries

and Applications

Developers that tune deep learning

workloads to the kernel level

Application Developers and Data

Scientists

Wider Data Analytics and ML audience,

Algorithm level development for all

stages of data analytics

Example Usage

Customization of neural nets scaled

across various hardware

New framework with functions

developers call for max CPU

performance

Framework developers call

matrixmultiplication,

convolution functions

Custom tuning DL model for application

for convolution

Call scikit-learnk-means function for

credit card fraud detection

Call distributed alternating least

squares algorithm for a recommendation

system

Find out more at software.intel.com/ai

Used for Classical Machine Learning

Ai libraries & APIs


Networks (clDNN)





Intel® MKL-DNN MKL

High Level

Overview




frameworks










Primary Audience


frameworks




and Applications




Scientists




Example Usage





performance





for convolution





system



Ai libraries & APIsBeta

Beta


Networks (clDNN)





Intel® MKL-DNN MKL

High Level

Overview




frameworks










Primary Audience


frameworks




and Applications




Scientists




Example Usage





performance





for convolution





system



Ai libraries & APIs

Open Source

Open Source

Open Source

Open Source

Open Source

Fast, Scalable Code with Intel® Math Kernel Library (Intel® MKL)

19

Highly optimized, threaded, & vectorized math functions that maximize performance on each processor family

Utilizes industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required

Dispatches optimized code for each processor automatically without the need to branch code

What’s New in the 2018 edition Improved small matrix multiplication performance

in GEMM & LAPACK

Improved ScaLAPACK performance for distributed computation

24 new vector math functions (“VML domain”)

Simplified license for easier adoption & redistribution

Additional distributions via YUM, APT-GET, & CondaLearn More: software.intel.com/mkl

®

Fast, Scalable Code with Intel® Math Kernel Library (Intel® MKL)

20

Highly optimized, threaded, & vectorized math functions that maximize performance on each processor family

Utilizes industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required

Dispatches optimized code for each processor automatically without the need to branch code

What’s New in the 2018 edition Improved small matrix multiplication performance

in GEMM & LAPACK

Improved ScaLAPACK performance for distributed computation

24 new vector math functions (“VML domain”)

Simplified license for easier adoption & redistribution

Additional distributions via YUM, APT-GET, & CondaLearn More: software.intel.com/mkl

®

Distribution Details Open Source

Apache 2.0 License

Common DNN APIs across all Intel hardware.

Rapid release cycles, iterated with the DL community, to best support industry framework integration.

Highly vectorized & threaded for maximal performance, based on the popular Intel® MKL library.

For developers of deep learning frameworks featuring optimized performance on Intel hardware

github.com/01org/mkl-dnn

Direct 2D Convolution

Rectified linear unit neuron activation

(ReLU)Maximum

pooling Inner productLocal response normalization

(LRN)

Intel® MATH Kernel Library-dnnMath Kernel Library for Deep Neural Networks

Examples:

https://www.github.com/01org/mkl-dnn

22

Set of functions (primitives) commonly used in DNN topologies (CNN)

Example topologies supported: AlexNet, VGG, GoogleNet, ResNet

Optimized for Intel® architectures (multi-/many-core & SIMD)

Primitives: Direct batched convolution

Inner product

Pooling: maximum, minimum and average

Normalization: local response (LRN) and batch normalization

Activation: rectified linear unit (ReLU), softmax

Data manipulation: multi-dim. transposition (conversion), split, concat, sum and scale

Intel® MATH Kernel Library-dnn

23

Two stages:1. Setup: Create operations (primitives)2. Execution: Call primitives with input, apply

conversion and forward output

Conversions only when needed (temporary arrays)

C/C++ interface

Can be used by TensorFlow and Caffe{2}

Intel® MATH Kernel Library-dnn

1.6

2.3

1.14

2.0

1.5

1.9 2.02.2

2.1

1.14

1.61.3

1.5

1.8 1.8 1.7

2.2

1.7

2.0

1.41.5

1.15

1.51.6

2.21.9

2.3

1.7

2.4

0

0.5

1

1.5

2

2.5

Caffe TensorFlow MXNet Neon

Infe

renc

e Th

roug

hput

per

form

ance

(Mea

sure

d in

imag

es/s

econ

d)re

pres

ente

d re

lativ

e to

a b

asel

ine

1.0

Hig

her i

s be

tter

24

Inference throughput of Intel® Xeon® Platinum 8180 Processorvs. Intel® Xeon® Processor E5-2699 v4

25

Training throughput of Intel® Xeon® Platinum 8180 Processorvs. Intel® Xeon® Processor E5-2699 v4

2.1

1.8 1.9

2.12.2

2.0 2.0 2.0 2.0

1.8

1.6 1.5 1.5

1.7

2.2

0.0

0.5

1.0

1.5

2.0

2.5

Caffe TensorFlow MXNet Neon

Trai

ning

Thr

ough

put p

erfo

rman

ce(M

easu

red

in im

ages

/sec

ond)

repr

esen

ted

rela

tive

to a

bas

elin

e 1.

0H

ighe

r is

bette

r

Frameworks

26

Focus: Data Scientist

software.intel.com/intel-distribution-for-python

Easy, Out-of-the-box Access to High Performance Python Prebuilt, optimized for numerical

computing, data analytics, HPC Drop in replacement for your

existing Python (no code changes required)

Drive Performance with Multiple Optimization

Techniques Accelerated NumPy/SciPy/Scikit-

Learn with Intel® MKL Data analytics with pyDAAL,

enhanced thread scheduling with TBB, Jupyter* Notebook interface, Numba, Cython

Scale easily with optimized MPI4Py and Jupyter notebooks

Faster Access to Latest Optimizations for Intel

Architecture Distribution and individual

optimized packages available through conda and Anaconda Cloud

Optimizations upstreamed back to main Python trunk

For developers using the most popular and fastest growing programming language for AI

Intel distribution for pythonAdvancing Python Performance Closer to Native Speeds

https://software.intel.com/intel-distribution-for-python

High Performance ML and Data Analytics libraryBuilding blocks for all data analytics stages, including data preparation, data mining &

machine learning

Open Source Apache 2.0 License

Common Python, Java and C++ APIs across all Intel hardware

Optimized for large data sets including streaming and distributed processing

Flexible interfaces to leading big data platforms including Spark and range of data formats (CSV, SQL, etc.)

Pre-processing Transformation Analysis Modeling Decision MakingValidation

Intel® Data Analytics Acceleration Library (Intel® DAAL)

28

29

Data Analysis: Correlation and Variance-Cov.

Matrices

Cosine/Correlation Distance Matrix

SVD/QR/Cholesky Decomposition

Principal Component Analysis (PCA)

K-Means, etc.

Machine Learning:

Neural Networks for Deep Learning

Linear/Ridge Regression

Naïve Bayes, Multiclass Classifier

SVM, etc.

Used with Hadoop*, Spark* (MLlib), SciPy, etc.


30

Configuration: 2x Intel® Xeon® E5-2660 CPU @ 2.60GHz, 128 GB, Intel® DAAL 2018; Alternating Least Squares – Users=1M Products=1M Ratings=10M Factors=100 Iterations=1 MLLib time=165.9 sec DAAL time=40.5 sec Gain=4.1x; Correlation – N=1M P=2000 size=37 GB Mllib time=169.2 sec DAAL=12.9 sec Gain=13.1x; PCA – n=10M p=1000 Partitions=360 Size=75 GB Mllib=246.6 sec DAAL (seq)=17.4 sec Gain=14.2x


Coming Soon

31

Deep Learning Deployment

ToolkitIntel® Computer

Vision SDK Fathom

Overview

Comprehensive productivity suite that

streamlines thedevelopment of enterprise-grade

deep learning solutions

Tools for exploring deep learning and

visualizing the training process and

results.

Optimize trained deep learning networks for end-point devices and utilize a unified API to

integrate inference with application logic

Develop & deploy vision-oriented

solutions that harness the full performance

of Intel CPUs and SOC accelerators

Compile, profile, validate

and run neural networks on

Movidius Myriad product family

Primary Audience

Experiencedenterprise deep

learning datascientists and teams

Students andbeginning data

scientists

Embeddeddevelopers adding

deep learning to their applications

Developers who create vision-oriented

solutions

Algorithm engineers & Embedded Software

Developers

Use CaseBuild deep learning

solutions for production use

Explore deep learning

Optimize a trained Caffe model through model compression

and weight quantization for a

smart camera

Convert a deep learning trainedmodel into an OpenVX graph

Convert trained offline neural networks into

embedded neural networks running on the ultra-low power

Myriad

Ai toolsIntel® Nervana™ Deep Learning Studio

Enterprise Starter

Intel Computer Vision SDK

https://software.intel.com/en-us/computer-vision-sdk 33

A tool to help Data Scientists get started training deep learning modelsSetup, prepare and design deep learning models with a simple UX

Load training data sets. Design models with automatically optimized

hyper-parameters.

Visually monitor training status(performance & accuracy).

Training powered by Intel optimized frameworks

Intel® Deep learning training tool

34

Imports trained models from popular DL framework regardless of training HW

Enhances model for improved execution, storage & transmission

Optimizes Inference execution for target hardware (computational graph analysis, scheduling, model compression, quantization)

Enables seamless integration with application logic

Delivers embedded friendly Inference solution

Ease of use + Embedded friendly + Extra performance boost

1

2

Convert & Optimize

Run!

Trained Model

1

2

For developers looking to run deep learning models on the edge

Intel® deep learning deployment toolkit

35

MACHINE/DEEP LEARNINGREASONING SYSTEMS

TOOLS & STANDARDSCOMPUTER VISIONProgrammable solutions

Memory/storageNetworkingcommunications 5G

Things& devices

CloudDATA Center

Accelerant Technologies

…

End-to-end aiOnly Intel Has a Complete End-to-End Portfolio

Documents

Intel Nervana software stack - Microsigma · Intel® Nervana™ Deep Learning Studio. Intel® Computer Vision SDK. ... Convolutional Neural Networks and JIT. Innovation happens to