View
243
Download
1
Category
Preview:
Citation preview
Scaling Convolutional Neural Networks on Reconfigurable LogicMichaela Blott, Principal Engineer, Xilinx Research
© Copyright 2017 Xilinx.
Page 2
Mission
“Investigate & exploit novel trends in machine learning
that play to the strengths of FPGAs: Reduced Precision Neural Networks”
Nick Fraser
(Xilinx & USydney)Giulio Gambardella
(Xilinx)
Yaman Umuroglu
(Xilinx & NTNU)
© Copyright 2017 Xilinx.
Page 3
Executive Summary
FPGAs can do trillions of reduced precision synaptic operations per second &
neural nets can put this to good use.
Inference accelerators that classify 10Ks to Ms of images per second, at < 25 W,
on today’s hardware.
© Copyright 2017 Xilinx.
Page 4
Background
© Copyright 2017 Xilinx.
CNN computation is linear algebra on originally floating point data types
– Demands lots of computation and lots of parameters (memory)
• AlexNet: 244MB & 1.5GOPS, VGG16: 552MB & 30.8GOPS; GoogleNet: 41.9MB & 3.0GOPS for ImageNet
– Not suitable for energy-constrained computing environments
Page 5
Convolutional Neural Networks
«cat»Output(w,h,m) +=
input(w+x,h+y,d)*filter(m,
x,y,d);Challenge:
billions of floating point multiply-accumulate ops
&
tens of megabytes of parameter data
© Copyright 2017 Xilinx.
Floating point (FP) CNNs contain a lot of redundancy
– Reducing precision is shown to work to 2b without loss of accuracy
– B.Dally EMDNN, 2016 with ternary networks on par with FP for
AlexNet top-1 and top-5, ResNet20,32,44,56
Reducing to the Extreme: Binary and Almost Binary Neural
Networks (BNNs) works at a small loss of accuracy for large
networks
Increasingly Reduced Precision Networks
Page 6
Quantization MNIST SVHN CIFAR-10
Binary weights, Binary activations 0.96% 2.53% 10.15%
Binary weights, FP activations 1.29% 2.30% 9.90%
FP weights, FP activations 0.94% 1.69% 7.62%
% classification error (lower is better)
Source: [4] http://arxiv.org/abs/1603.05279;
[5] https://arxiv.org/abs/1602.02830
© Copyright 2017 Xilinx.
Accuracy of Binary Networks ImprovingPublished Results for FP CNNs, BNNs and Extreme Reduced Precision NNs
Page 7
• BNNs are new and accuracy results are improving rapidly through for example new training
techniques, topological changes and other methods
0
10
20
30
40
50
60
06/07/2009 18/11/2010 01/04/2012 14/08/2013 27/12/2014 10/05/2016 22/09/2017
Top-5 Error (ImageNet)
CNN Reduced Precision BNN
© Copyright 2017 Xilinx.
Cost per operation is greatly reduced
– For example, for BNN: multiply accumulate becomes XNOR with bit counts
Memory cost is greatly reduced
– Large networks can fit entirely into on-chip memory (OCM) (UltraRAM, BRAM)
• VU9P(16nm): 43MB
– More memory bandwidth, lower energy
Today’s FPGAs have a much higher peak performance for reduced precision operations
– FPGA performance is anti-proportional to the cost per operation when applications are sufficiently parallel
– Lower cost per op & massively parallel = more ops every cycle
Potential of Reduced Precision on FPGAs
Page 8
Precision Cost per Op
LUT
Cost per Op
DSP
MB needed
(AlexNet)
TOps/s
(KU115)*
TOps/s
(ZU19EG)*
1b 2.5 0 7.6 ~46 ~66
4b 16 0 30.5 ~11 ~16
8b 45 0 61 ~3 ~4
16b 15 0.5 122 ~1 ~1
32b 178 2 244 ~0.5 ~0.3
100x
*Assumptions: Application can fill device to 70%
(fully parallelizable) 250MHZ
… 100ks LUTs
Ks DSPs
LUTDSP
© Copyright 2017 Xilinx.
Page 9
Potential of BNNs on FPGAs (ZU19EG)
66 TOPS
1 TOPS
0.1 TOPS 40 TOPS
Fewer LUTs/op yields to higher
peak performance
Staying on-chip to achieve
more of the peak
Assumption: Operational Intensity for 8b and 1b AlexNet, assuming 1.45GOps/image & 61MB & 7.6MB
• Reduced Precision allows us to scale NN performance on FPGAs to
unprecedented levels
© Copyright 2017 Xilinx.
Page 10
Exploitation of Reduced Precision Neural Networks through
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
https://arxiv.org/abs/1612.07119http://arxiv.org/abs/1701.03400
© Copyright 2017 Xilinx.
Custom-tailored hardware for optimal performance and power efficiency
– Customized data types
– Customized dataflow architecture to match network topology
– Exploit compile time optimizations to simplify generated hardware
Keep all parameters on-chip
– Higher energy efficiency and performance
Provide flexibility in architecture to scale solutions
Support portability and rapid exploration through high level design entries
– C++ and most recently in OpenCL
Page 11
FINN Design Principles
© Copyright 2017 Xilinx.
Data flow architecture
– Not a systolic array with scheduling network on processing
engines and looping over PEs
Customized to match each layer’s compute
requirement
– Equivalent throughput through all layers
– To avoid one size fits all penalties
Each layer consumes and produces in same order
to minimize buffering and latency
– FIFOs, no ping-pong buffers
Layers are different instantiations of a C++
template classes (MVTU)
Page 12
Heterogeneous Dataflow Architecture
1MOPS
10MOPS
1PE10PE
Systolic Array
Chosen Dataflow Architecture
© Copyright 2017 Xilinx.
Fully connected layers & convolutional layers are mapped on matrix-vector multiply threshold
units (MVTUs)
MVTUs support OFM (neuron) and folding over weights (synaptic)
Weight and output stationary (weights and popcounts are retained locally)
Max pool units are optionally placed behind MVTUs
Page 13
Architecture of a Matrix-Vector Threshold Unit (MVTU)
Weight folding
OFM
folding
© Copyright 2017 Xilinx.
Page 14
Synthesizable C++ Network Description
void DoCompute(ap_uint<64> * in, ap_uint<64> * out) {
#pragma HLS DATAFLOW
stream<ap_uint<64> > memInStrm("memInStrm");
stream<ap_uint<64> > InStrm("InStrm");
.
.
.
stream<ap_uint<64> > memOutStrm("memOutStrm");
Mem2Stream<64, inBytesPadded>(in, memInStrm);
StreamingMatrixVector<L0_SIMD, L0_PE, 16, L0_MW, L0_MH, L0_WMEM, L0_TMEM>
(InStrm, inter0, weightMem0, thresMem0);
StreamingMatrixVector<L1_SIMD, L1_PE, 16, L1_MW, L1_MH, L1_WMEM, L1_TMEM>
(inter0, inter1, weightMem1, thresMem1);
StreamingMatrixVector<L2_SIMD, L2_PE, 16, L2_MW, L2_MH, L2_WMEM, L2_TMEM>
(inter1, inter2, weightMem2, thresMem2);
StreamingMatrixVector<L3_SIMD, L3_PE, 16, L3_MW, L3_MH, L3_WMEM, L3_TMEM>
(inter2, outstream, weightMem3, thresMem3);
StreamingCast<ap_uint<16>, ap_uint<64> >(outstream, memOutStrm);
Stream2Mem<64, outBytesPadded>(memOutStrm, out);
}
Stream definitions
Layer instantiation
connected by streams
Move image in from PS memory
Move results to PS memory
© Copyright 2017 Xilinx.
Page 15
Work Flow for Exploration of NNs of FPGAs
• All code in C/C++
• Can execute on CPU and FPGA
- No RTL needed
• First prototype integration with
tiny-dnn and Theano (Tensorflow
and Caffe in progress)
Fast workflow, integrated with standard framework, with flexibility
to support different topologies, sizes, rates, resources with different
devices (Z7045, KU115, Z7020)
© Copyright 2017 Xilinx.
Page 16
Experimental Results
– Embedded platforms (Zynq Z7045 & 7020): ZC706, PYNQ open source platform
– Server class accelerator: ADM_PCIE_8K5 in OpenPOWER (& x86 with SDAccel)
© Copyright 2017 Xilinx.
Page 17
Input Data
• MNIST
handwritten digits• Streetview house
numbers
• Cifar-10: cats,
dogs, etc
• Playing cards
• Imagenet in
progress now
Solitaire demo
(Xilinx demo center & EmbeddedWorld)
© Copyright 2017 Xilinx.
Page 18
Test Networks
MLP: Multilayer Perceptron
– Input images: 28x28 pixels,black-white
(handwritten digits)
– Number of layers: 3 FC layers, 1024
neurons each
– Compute: 0.7 - 5.8 MOPS/Frame
CNV: CNN (VGG-16 derivative)
– Input images: 32x32 pixels, RGB image
(SVHN, CIFAR-10, traffic signs)
– Number of layers: 2 (3x3) Conv + Max Pool
+ 2 (3x3) Conv + Max Pool + 2
Convolutional + 3 FC
– Compute: 0.1-1.2 GOPS/Frame
© Copyright 2017 Xilinx.
Page 19
Results - Performance, Latency, Power & Resources
Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W]
FC- MNIST – S 12.3M 8’200 130.5 86’110 (39%) 0.31 21.2
FC- MNIST – L 1.5M 9’085 398 79’097 (36%) 2.44 22.6
CNV- CIFAR10 - S 21.9K 2’465 192 54’538 (25%) 283 11.7
Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W]
FC- MNIST - S 12.2k 8.16 15.5 4’810 (2%) 240 8.1
FC- MNIST – L 12.2k 71 115 6’156 (3%) 282 7.9
CNV- CIFAR10 - S 11.6k 1’306 158.5 40’404 (18%) 550 10
KU115 FPS GOPS/s BRAM LUT Latency [us] Power [W]
CNV- CIFAR10 - L 12.0k 14’814 1814 392’947 (59%) 671 <41
Comparable to
AlexNet
Max Throughput
Unprecedented
classification
rates
Ultra-low latency
(P4 ~11ms)
For robotics, AR,
UAVs
Scalability to
extremely small
footprints
3x classification rate
over best measured
numbers on GPU today
12K FPS target
© Copyright 2017 Xilinx.
Initial proof of concepts & demos are operational and demonstrate the
potential
Open source release on PYNQ
– With Python API - https://github.com/Xilinx/BNN-PYNQ
We continue to progress technology investigation
– Large NNs
– Higher (but no more than 8b!) & mixed precision NNs
– Improving accuracy through novel techniques
– Design space trade-offs accuracy vs performance vs resources
Interested to understand system level integration better
– How does ML plug into data base systems?
– Heterogeneous at system, node, or device level?
Page 20
Status & Next Steps
© Copyright 2017 Xilinx.
Page 21
Thank You.
Recommended