Scaling Convolutional Neural Networks on Reconfigurable...

Scaling Convolutional Neural Networks on Reconfigurable LogicMichaela Blott, Principal Engineer, Xilinx Research

Mission

“Investigate & exploit novel trends in machine learning

that play to the strengths of FPGAs: Reduced Precision Neural Networks”

Nick Fraser

(Xilinx & USydney)Giulio Gambardella

(Xilinx)

Yaman Umuroglu

(Xilinx & NTNU)

Executive Summary

FPGAs can do trillions of reduced precision synaptic operations per second &

neural nets can put this to good use.

Inference accelerators that classify 10Ks to Ms of images per second, at < 25 W,

on today’s hardware.

Background

CNN computation is linear algebra on originally floating point data types

– Demands lots of computation and lots of parameters (memory)

• AlexNet: 244MB & 1.5GOPS, VGG16: 552MB & 30.8GOPS; GoogleNet: 41.9MB & 3.0GOPS for ImageNet

– Not suitable for energy-constrained computing environments

Convolutional Neural Networks

«cat»Output(w,h,m) +=

input(w+x,h+y,d)*filter(m,

x,y,d);Challenge:

billions of floating point multiply-accumulate ops

tens of megabytes of parameter data

Floating point (FP) CNNs contain a lot of redundancy

– Reducing precision is shown to work to 2b without loss of accuracy

– B.Dally EMDNN, 2016 with ternary networks on par with FP for

AlexNet top-1 and top-5, ResNet20,32,44,56

Reducing to the Extreme: Binary and Almost Binary Neural

Networks (BNNs) works at a small loss of accuracy for large

networks

Increasingly Reduced Precision Networks

Quantization MNIST SVHN CIFAR-10

Binary weights, Binary activations 0.96% 2.53% 10.15%

Binary weights, FP activations 1.29% 2.30% 9.90%

FP weights, FP activations 0.94% 1.69% 7.62%

% classification error (lower is better)

Source: [4] http://arxiv.org/abs/1603.05279;

[5] https://arxiv.org/abs/1602.02830

Accuracy of Binary Networks ImprovingPublished Results for FP CNNs, BNNs and Extreme Reduced Precision NNs

• BNNs are new and accuracy results are improving rapidly through for example new training

techniques, topological changes and other methods

06/07/2009 18/11/2010 01/04/2012 14/08/2013 27/12/2014 10/05/2016 22/09/2017

Top-5 Error (ImageNet)

CNN Reduced Precision BNN

Cost per operation is greatly reduced

– For example, for BNN: multiply accumulate becomes XNOR with bit counts

Memory cost is greatly reduced

– Large networks can fit entirely into on-chip memory (OCM) (UltraRAM, BRAM)

• VU9P(16nm): 43MB

– More memory bandwidth, lower energy

Today’s FPGAs have a much higher peak performance for reduced precision operations

– FPGA performance is anti-proportional to the cost per operation when applications are sufficiently parallel

– Lower cost per op & massively parallel = more ops every cycle

Potential of Reduced Precision on FPGAs

Precision Cost per Op

Cost per Op

MB needed

(AlexNet)

TOps/s

(KU115)*

TOps/s

(ZU19EG)*

1b 2.5 0 7.6 ~46 ~66

4b 16 0 30.5 ~11 ~16

8b 45 0 61 ~3 ~4

16b 15 0.5 122 ~1 ~1

32b 178 2 244 ~0.5 ~0.3

*Assumptions: Application can fill device to 70%

(fully parallelizable) 250MHZ

… 100ks LUTs

Ks DSPs

LUTDSP

Potential of BNNs on FPGAs (ZU19EG)

66 TOPS

1 TOPS

0.1 TOPS 40 TOPS

Fewer LUTs/op yields to higher

peak performance

Staying on-chip to achieve

more of the peak

Assumption: Operational Intensity for 8b and 1b AlexNet, assuming 1.45GOps/image & 61MB & 7.6MB

• Reduced Precision allows us to scale NN performance on FPGAs to

unprecedented levels

Exploitation of Reduced Precision Neural Networks through

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

https://arxiv.org/abs/1612.07119http://arxiv.org/abs/1701.03400

Custom-tailored hardware for optimal performance and power efficiency

– Customized data types

– Customized dataflow architecture to match network topology

– Exploit compile time optimizations to simplify generated hardware

Keep all parameters on-chip

– Higher energy efficiency and performance

Provide flexibility in architecture to scale solutions

Support portability and rapid exploration through high level design entries

– C++ and most recently in OpenCL

FINN Design Principles

Data flow architecture

– Not a systolic array with scheduling network on processing

engines and looping over PEs

Customized to match each layer’s compute

requirement

– Equivalent throughput through all layers

– To avoid one size fits all penalties

Each layer consumes and produces in same order

to minimize buffering and latency

– FIFOs, no ping-pong buffers

Layers are different instantiations of a C++

template classes (MVTU)

Heterogeneous Dataflow Architecture

10MOPS

1PE10PE

Systolic Array

Chosen Dataflow Architecture

Fully connected layers & convolutional layers are mapped on matrix-vector multiply threshold

units (MVTUs)

MVTUs support OFM (neuron) and folding over weights (synaptic)

Weight and output stationary (weights and popcounts are retained locally)

Max pool units are optionally placed behind MVTUs

Architecture of a Matrix-Vector Threshold Unit (MVTU)

Weight folding

folding

Synthesizable C++ Network Description

void DoCompute(ap_uint<64> * in, ap_uint<64> * out) {

#pragma HLS DATAFLOW

stream<ap_uint<64> > memInStrm("memInStrm");

stream<ap_uint<64> > InStrm("InStrm");

stream<ap_uint<64> > memOutStrm("memOutStrm");

Mem2Stream<64, inBytesPadded>(in, memInStrm);

StreamingMatrixVector<L0_SIMD, L0_PE, 16, L0_MW, L0_MH, L0_WMEM, L0_TMEM>

(InStrm, inter0, weightMem0, thresMem0);

(inter0, inter1, weightMem1, thresMem1);

(inter1, inter2, weightMem2, thresMem2);

(inter2, outstream, weightMem3, thresMem3);

StreamingCast<ap_uint<16>, ap_uint<64> >(outstream, memOutStrm);

Stream2Mem<64, outBytesPadded>(memOutStrm, out);

Stream definitions

Layer instantiation

connected by streams

Move image in from PS memory

Move results to PS memory

Work Flow for Exploration of NNs of FPGAs

• All code in C/C++

• Can execute on CPU and FPGA

- No RTL needed

• First prototype integration with

tiny-dnn and Theano (Tensorflow

and Caffe in progress)

Fast workflow, integrated with standard framework, with flexibility

to support different topologies, sizes, rates, resources with different

devices (Z7045, KU115, Z7020)

Experimental Results

– Embedded platforms (Zynq Z7045 & 7020): ZC706, PYNQ open source platform

– Server class accelerator: ADM_PCIE_8K5 in OpenPOWER (& x86 with SDAccel)

Input Data

• MNIST

handwritten digits• Streetview house

numbers

• Cifar-10: cats,

dogs, etc

• Playing cards

• Imagenet in

progress now

Solitaire demo

(Xilinx demo center & EmbeddedWorld)

Test Networks

MLP: Multilayer Perceptron

– Input images: 28x28 pixels,black-white

(handwritten digits)

– Number of layers: 3 FC layers, 1024

neurons each

– Compute: 0.7 - 5.8 MOPS/Frame

CNV: CNN (VGG-16 derivative)

– Input images: 32x32 pixels, RGB image

(SVHN, CIFAR-10, traffic signs)

– Number of layers: 2 (3x3) Conv + Max Pool

+ 2 (3x3) Conv + Max Pool + 2

Convolutional + 3 FC

– Compute: 0.1-1.2 GOPS/Frame

Results - Performance, Latency, Power & Resources

Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W]

FC- MNIST – S 12.3M 8’200 130.5 86’110 (39%) 0.31 21.2

FC- MNIST – L 1.5M 9’085 398 79’097 (36%) 2.44 22.6

CNV- CIFAR10 - S 21.9K 2’465 192 54’538 (25%) 283 11.7

Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W]

FC- MNIST - S 12.2k 8.16 15.5 4’810 (2%) 240 8.1

FC- MNIST – L 12.2k 71 115 6’156 (3%) 282 7.9

CNV- CIFAR10 - S 11.6k 1’306 158.5 40’404 (18%) 550 10

KU115 FPS GOPS/s BRAM LUT Latency [us] Power [W]

CNV- CIFAR10 - L 12.0k 14’814 1814 392’947 (59%) 671 <41

Comparable to

AlexNet

Max Throughput

Unprecedented

classification

Ultra-low latency

(P4 ~11ms)

For robotics, AR,

Scalability to

extremely small

footprints

3x classification rate

over best measured

numbers on GPU today

12K FPS target

Initial proof of concepts & demos are operational and demonstrate the

potential

Open source release on PYNQ

– With Python API - https://github.com/Xilinx/BNN-PYNQ

We continue to progress technology investigation

– Large NNs

– Higher (but no more than 8b!) & mixed precision NNs

– Improving accuracy through novel techniques

– Design space trade-offs accuracy vs performance vs resources

Interested to understand system level integration better

– How does ML plug into data base systems?

– Heterogeneous at system, node, or device level?

Status & Next Steps

Thank You.

Scaling Convolutional Neural Networks on Reconfigurable...

Documents

Reconfigurable optical directed-logic circuits using ... › ece › xugroup › Papers › Reconfigurable... · Reconfigurable optical directed-logic circuits using microresonator-based

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive

RECONFIGURABLE PROCESS PLANS FOR RECONFIGURABLE MANUFACTURINGltodi.est.ips.pt/det2006/papers/Keynote/DET2006_Hoda_ElMaraghy.pdf · Reconfigurable Process Plans for Reconfigurable

Reconfigurable Computing

Lecture 13: Reconfigurable Computing Applications October 10, 2013 ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications

Analisis Reconfigurable

RECONFIGURABLE MAGNETOHYDRODYANAMIC ANTENNAvisconedutech.com/wp-content/uploads/2018/05/Reconfigurable-Magnetohydrodyanamic...• Reconfigurable antenna are also designed using capacitor

Software Development Environment for Reconfigurable ... · Environment for Reconfigurable Communications Architecture Software Development Environment for Reconfigurable ... WCDMA

Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Reconfigurable 20x20

Reconfigurable Computing with the Partitioned Global ...willenbe/publications/Cascadia2012_talk... · High-Performance Reconfigurable Computing Group University of Toronto Reconfigurable

Richer Convolutional Features for Edge Detectionopenaccess.thecvf.com/...Richer_Convolutional_Features_CVPR_2017… · Richer Convolutional Features for Edge Detection ... convolutional

MIMO Communication Systems with Reconfigurable Antennas · MIMO COMMUNICATION SYSTEMS WITH RECONFIGURABLE ANTENNAS ... MIMO COMMUNICATION SYSTEMS WITH RECONFIGURABLE ANTENNAS

Social Security: A-14-07-17101

ENG6530 Reconfigurable Computing Systems Introduction to Reconfigurable Computing Introduction to Reconfigurable Computing

Introduction to Reconfigurable RF Systemael.snu.ac.kr/paper_file/Introduction to Reconfigurable... · 2010-12-08 · Reconfigurable RF System: Introduction to Reconfigurable RF System

ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors

ENG3050 Embedded Reconfigurable Computing Systems Introduction to Reconfigurable Computing Introduction to Reconfigurable Computing

Reconfigurable Antennas