Scaling Convolutional Neural Networks on Reconfigurable …materials.dagstuhl.de/files/17/17101/17101.MichaelaBlott.Slides.pdf · Scaling Convolutional Neural Networks on Reconfigurable

Scaling Convolutional Neural Networks on Reconfigurable LogicMichaela Blott, Principal Engineer, Xilinx Research

© Copyright 2017 Xilinx.

Page 2

Mission

“Investigate & exploit novel trends in machine learning

that play to the strengths of FPGAs: Reduced Precision Neural Networks”

Nick Fraser

(Xilinx & USydney)Giulio Gambardella

(Xilinx)

Yaman Umuroglu

(Xilinx & NTNU)


Page 3

Executive Summary

FPGAs can do trillions of reduced precision synaptic operations per second &

neural nets can put this to good use.

Inference accelerators that classify 10Ks to Ms of images per second, at < 25 W,

on today’s hardware.


Page 4

Background


CNN computation is linear algebra on originally floating point data types

– Demands lots of computation and lots of parameters (memory)

• AlexNet: 244MB & 1.5GOPS, VGG16: 552MB & 30.8GOPS; GoogleNet: 41.9MB & 3.0GOPS for ImageNet

– Not suitable for energy-constrained computing environments

Page 5

Convolutional Neural Networks

«cat»Output(w,h,m) +=

input(w+x,h+y,d)*filter(m,

x,y,d);Challenge:

billions of floating point multiply-accumulate ops

&

tens of megabytes of parameter data


Floating point (FP) CNNs contain a lot of redundancy

– Reducing precision is shown to work to 2b without loss of accuracy

– B.Dally EMDNN, 2016 with ternary networks on par with FP for

AlexNet top-1 and top-5, ResNet20,32,44,56

Reducing to the Extreme: Binary and Almost Binary Neural

Networks (BNNs) works at a small loss of accuracy for large

networks

Increasingly Reduced Precision Networks

Page 6

Quantization MNIST SVHN CIFAR-10

Binary weights, Binary activations 0.96% 2.53% 10.15%

Binary weights, FP activations 1.29% 2.30% 9.90%

FP weights, FP activations 0.94% 1.69% 7.62%

% classification error (lower is better)

Source: [4] http://arxiv.org/abs/1603.05279;

[5] https://arxiv.org/abs/1602.02830

http://arxiv.org/abs/1603.05279

https://arxiv.org/abs/1602.02830


Accuracy of Binary Networks ImprovingPublished Results for FP CNNs, BNNs and Extreme Reduced Precision NNs

Page 7

• BNNs are new and accuracy results are improving rapidly through for example new training

techniques, topological changes and other methods

0

10

20

30

40

50

60

06/07/2009 18/11/2010 01/04/2012 14/08/2013 27/12/2014 10/05/2016 22/09/2017

Top-5 Error (ImageNet)

CNN Reduced Precision BNN


Cost per operation is greatly reduced

– For example, for BNN: multiply accumulate becomes XNOR with bit counts

Memory cost is greatly reduced

– Large networks can fit entirely into on-chip memory (OCM) (UltraRAM, BRAM)

• VU9P(16nm): 43MB

– More memory bandwidth, lower energy

Today’s FPGAs have a much higher peak performance for reduced precision operations

– FPGA performance is anti-proportional to the cost per operation when applications are sufficiently parallel

– Lower cost per op & massively parallel = more ops every cycle

Potential of Reduced Precision on FPGAs

Page 8

Precision Cost per Op

LUT

Cost per Op

DSP

MB needed

(AlexNet)

TOps/s

(KU115)*

TOps/s

(ZU19EG)*

1b 2.5 0 7.6 ~46 ~66

4b 16 0 30.5 ~11 ~16

8b 45 0 61 ~3 ~4

16b 15 0.5 122 ~1 ~1

32b 178 2 244 ~0.5 ~0.3

100x

*Assumptions: Application can fill device to 70%

(fully parallelizable) 250MHZ

… 100ks LUTs

Ks DSPs

LUTDSP


Page 9

Potential of BNNs on FPGAs (ZU19EG)

66 TOPS

1 TOPS

0.1 TOPS 40 TOPS

Fewer LUTs/op yields to higher

peak performance

Staying on-chip to achieve

more of the peak

Assumption: Operational Intensity for 8b and 1b AlexNet, assuming 1.45GOps/image & 61MB & 7.6MB

• Reduced Precision allows us to scale NN performance on FPGAs to

unprecedented levels


Page 10

Exploitation of Reduced Precision Neural Networks through

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

https://arxiv.org/abs/1612.07119http://arxiv.org/abs/1701.03400

https://arxiv.org/abs/1612.07119

http://arxiv.org/abs/1701.03400


Custom-tailored hardware for optimal performance and power efficiency

– Customized data types

– Customized dataflow architecture to match network topology

– Exploit compile time optimizations to simplify generated hardware

Keep all parameters on-chip

– Higher energy efficiency and performance

Provide flexibility in architecture to scale solutions

Support portability and rapid exploration through high level design entries

– C++ and most recently in OpenCL

Page 11

FINN Design Principles


Data flow architecture

– Not a systolic array with scheduling network on processing

engines and looping over PEs

Customized to match each layer’s compute

requirement

– Equivalent throughput through all layers

– To avoid one size fits all penalties

Each layer consumes and produces in same order

to minimize buffering and latency

– FIFOs, no ping-pong buffers

Layers are different instantiations of a C++

template classes (MVTU)

Page 12

Heterogeneous Dataflow Architecture

1MOPS

10MOPS

1PE10PE

Systolic Array

Chosen Dataflow Architecture


Fully connected layers & convolutional layers are mapped on matrix-vector multiply threshold

units (MVTUs)

MVTUs support OFM (neuron) and folding over weights (synaptic)

Weight and output stationary (weights and popcounts are retained locally)

Max pool units are optionally placed behind MVTUs

Page 13

Architecture of a Matrix-Vector Threshold Unit (MVTU)

Weight folding

OFM

folding


Page 14

Synthesizable C++ Network Description

void DoCompute(ap_uint<64> * in, ap_uint<64> * out) {

#pragma HLS DATAFLOW

stream<ap_uint<64> > memInStrm("memInStrm");

stream<ap_uint<64> > InStrm("InStrm");

.

.

.

stream<ap_uint<64> > memOutStrm("memOutStrm");

Mem2Stream<64, inBytesPadded>(in, memInStrm);

StreamingMatrixVector<L0_SIMD, L0_PE, 16, L0_MW, L0_MH, L0_WMEM, L0_TMEM>

(InStrm, inter0, weightMem0, thresMem0);


(inter0, inter1, weightMem1, thresMem1);


(inter1, inter2, weightMem2, thresMem2);


(inter2, outstream, weightMem3, thresMem3);

StreamingCast<ap_uint<16>, ap_uint<64> >(outstream, memOutStrm);

Stream2Mem<64, outBytesPadded>(memOutStrm, out);

}

Stream definitions

Layer instantiation

connected by streams

Move image in from PS memory

Move results to PS memory


Page 15

Work Flow for Exploration of NNs of FPGAs

• All code in C/C++

• Can execute on CPU and FPGA

- No RTL needed

• First prototype integration with

tiny-dnn and Theano (Tensorflow

and Caffe in progress)

Fast workflow, integrated with standard framework, with flexibility

to support different topologies, sizes, rates, resources with different

devices (Z7045, KU115, Z7020)


Page 16

Experimental Results

– Embedded platforms (Zynq Z7045 & 7020): ZC706, PYNQ open source platform

– Server class accelerator: ADM_PCIE_8K5 in OpenPOWER (& x86 with SDAccel)


Page 17

Input Data

• MNIST

handwritten digits• Streetview house

numbers

• Cifar-10: cats,

dogs, etc

• Playing cards

• Imagenet in

progress now

Solitaire demo

(Xilinx demo center & EmbeddedWorld)


Page 18

Test Networks

MLP: Multilayer Perceptron

– Input images: 28x28 pixels,black-white

(handwritten digits)

– Number of layers: 3 FC layers, 1024

neurons each

– Compute: 0.7 - 5.8 MOPS/Frame

CNV: CNN (VGG-16 derivative)

– Input images: 32x32 pixels, RGB image

(SVHN, CIFAR-10, traffic signs)

– Number of layers: 2 (3x3) Conv + Max Pool

+ 2 (3x3) Conv + Max Pool + 2

Convolutional + 3 FC

– Compute: 0.1-1.2 GOPS/Frame


Page 19

Results - Performance, Latency, Power & Resources

Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W]

FC- MNIST – S 12.3M 8’200 130.5 86’110 (39%) 0.31 21.2

FC- MNIST – L 1.5M 9’085 398 79’097 (36%) 2.44 22.6

CNV- CIFAR10 - S 21.9K 2’465 192 54’538 (25%) 283 11.7

Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W]

FC- MNIST - S 12.2k 8.16 15.5 4’810 (2%) 240 8.1

FC- MNIST – L 12.2k 71 115 6’156 (3%) 282 7.9

CNV- CIFAR10 - S 11.6k 1’306 158.5 40’404 (18%) 550 10

KU115 FPS GOPS/s BRAM LUT Latency [us] Power [W]

CNV- CIFAR10 - L 12.0k 14’814 1814 392’947 (59%) 671 <41

Comparable to

AlexNet

Max Throughput

Unprecedented

classification

rates

Ultra-low latency

(P4 ~11ms)

For robotics, AR,

UAVs

Scalability to

extremely small

footprints

3x classification rate

over best measured

numbers on GPU today

12K FPS target


Initial proof of concepts & demos are operational and demonstrate the

potential

Open source release on PYNQ

– With Python API - https://github.com/Xilinx/BNN-PYNQ

We continue to progress technology investigation

– Large NNs

– Higher (but no more than 8b!) & mixed precision NNs

– Improving accuracy through novel techniques

– Design space trade-offs accuracy vs performance vs resources

Interested to understand system level integration better

– How does ML plug into data base systems?

– Heterogeneous at system, node, or device level?

Page 20

Status & Next Steps

https://github.com/Xilinx/BNN-PYNQ


Page 21

Thank You.

Documents

Scaling Convolutional Neural Networks on Reconfigurable …materials.dagstuhl.de/files/17/17101/17101.MichaelaBlott.Slides.pdf · Scaling Convolutional Neural Networks on Reconfigurable