Graham McKenzie – Acceleration Systems FAE (EMEA)
1
Programmable Solutions Group
Bigger Data Better Hardware Smarter Algorithms
2
Driving Trends for AI
Image: 1000 KB / picture
Audio: 5000 KB / song
Video: 5,000,000 KB / movie
Transistor density doubles every 18 months
Cost / GB in 1995: $1000.00
Cost / GB in 2015: $0.03
Advances in neural networks leading to better accuracy in training models
Programmable Solutions Group
Market Demands Scalability for Machine Learning
• <10 Classes
• Frame Rate: 15-30fps
• Power: 1W-5W
• Cost: Low
• Varying accuracy
• Custom form factor
Cloud Analytics Embedded AnalyticsCloud Analytics Embedded Analytics
• 1000s of Classes
• Large Workloads
• Highly Efficient (Performance / W)
• Varying accuracy
• Server Form Factor
3
Programmable Solutions Group
Rapidly Evolving CNN Topologies
4
2012 AlexNet
2014 GoogLeNet
2015 ResNet
2016 FractalNet
SqueezeNet
LeNet
HMAX, NeoCognitron
NVIDIA DriveNet
Rapidly Evolving, Computation Intensive
Programmable Solutions Group 5
Convolution
Input Feature Map(Set of 2D Images)
Filter(3D Space)
Output Feature Map
Programmable Solutions Group 6
Convolution
Input Feature Map(Set of 2D Images)
Filter(3D Space)
Repeat for Multiple Filters to Create Multiple “Layers” of Output Feature
Map
Programmable Solutions Group
Why an FPGA for Deep Learning?
7
1 TFLOP floating point performance in Arria 10
– 35W total device power
– Enable massive parallelism, compute spatially
8 TB/s memory bandwidth: keep state on chip!
– Exceeds available external bandwidth by factor of 50*
– Random access, low latency (2 clks)
Avoid costly data movement
– Place all data in on-chip memory, compute temporally
Flexibility
– Support rapidly evolving algorithm and future architecture
– Enable accelerator pipeline for the best system efficiency
Fine-grained & low latency between compute and memory
Kernel 2Kernel 1 Kernel3
FPGA
IO IO
Optional Memory
Optional Memory
* DDR4 @ 3.2GHz, 72Bits…etc.
Programmable Solutions Group
Deep Learning FPGA Accelerator IP
8
Turns FPGA into deep learning accelerator
Reconfigurable to different CNN topologies
– Use Intel Caffe to define topology
– No FPGA compile required
Optimized implementation of core primitives
– Common primitives for CNN topologies
– Can be used or bypassed to create custom graphs
ML Framework
(Torch, Theano, Caffe)
MKL-DNN /DLA SW API
ReLUConvolution /
Fully ConnectedNorm MaxPool
Stream Buffer
Programmable Solutions Group
Energy Efficient Inference with Infrastructure Flexibility
Excellent energy efficiency up to 25 images/sec/watt inference on Caffe/Alexnet
Reconfigurable accelerator can be used for variety of data center workloads
Integrated FPGA with Intel® Xeon® processor fits in standard server infrastructure -OR- Discrete FPGA fits in PCIe card and embedded applications*
Intel® Arria® 10 FPGASuperior Inference Capabilities
Offers high perf/watt for inference with low latency and flexible precision
*Xeon with Integrated FPGA refers to Broadwell Proof of Concept Configuration details on slide: 44Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Programmable Solutions Group 10
Scalable Architecture
Programmable Solutions Group 11
Intel® Deep Learning Inference Accelerator (DLIA)
Turnkey inference solution to accelerate convolutional neural networks (CNN)
Image processing applications
Available Q2’17
Turnkey Solution
• Hardware: PCIe* add-in card with Intel® Arria 10 FPGA
• Software: Optimized deep learning framework with MKL-DNN and Caffe* integration. Preloaded CNN image recognition algorithms support multiple topologies
• IP: DLA IP accelerates 6 primitives on FPGA
• Validation, warranty and support
Value Proposition
• Accelerate time to market by simplifying deployment with turnkey solution & software ecosystem.
• Reduce TCO by offloading/accelerating inference workloads to support datacenter scalability
• Unified APIs on Intel Architecture provide a consistent user experience across Intel product families
Programmable Solutions Group
Software Architecture
12
Deep Learning Accelerator (DLA) IPImplements 6 CNN primitives (conv, FC, relu, pooling, norm, concat)
DLIA SW API
Board Support Package
MKL-DNN
Caffe integrated w/ MKL-DNN
CNN application
DLA IP
OpenCL Runtime
Driver to the board
OpenCL API to access FPGA
SW API to expose the FPGA primitives
Unified Intel deep learning API integrated with DLA IP
DNN
Unified Intel deep learning API integrated with DLA IPUnsupported primitives in DLA IP are padded with CPU primitives in MKL-
DNN
Caffe* framework integrated with MKL-DNNUnsupported primitives in MKL-DNN are padded with CPU impl in Caffe
Application using AlexNet, GoogleNet or custom topology for recognition
RobustnessCompatibilityEase of use
Deep Learning SDK
Programmable Solutions Group
Intel® Nervana™ PortfolioCommon Architecture for AI Implementations
Targeted acceleration
Most widely deployed machine learning platform (>97%)
Intel® Xeon® ProcessorsHigher performance, general purpose
machine learning
Intel® Xeon Phi™ Processors
Higher perf/watt inference, programmable
Intel® Xeon® Processor +FPGA
Best in class neural network performance
Intel® Xeon® Processor + LakE CREST
LAKECREST
Programmable Solutions Group
Legal Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2016 Intel Corporation. Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
14