LP-BNN: Ultra-low-Latency BNN Inference with Layer …LP-BNN: Ultra-low-Latency BNN Inference with...

Boston University Slideshow Title Goes Here

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Tong Geng1，2, Tianqi Wang1, Chunshu Wu1, Chen Yang1, Shuaiwen Leon Song2

Ang Li2, Martin Herbordt1

1Boston University2Pacific Northwest National Laboratory

CNN: most popular algorithm in ML

▪ Widely used in computer vision such as object classification, detection and recognition

Birth of BNN

▪ Limitations of CNN

▪ High computation + memory intensity

▪ Long Latency

▪ 32-bit floating-point -> 8-bit fixed-point -> binary

▪ Computation and Memory access are not intensive anymore

▪ Potentially, much Lower Latency

▪ Relatively lower accuracy than CNN, however, becoming better

DNN: BNN vs CNN✓ Easier Computation: floating-point-multiply-add (FMA) operations ➔ single-bit XNOR/POPCOUNT.

✓ Less Storage: floating-point parameters and activations ➔ single-bit

✓ Energy Efficient: ideal for edge device

“Accelerating Neural Networks with Binary Arithmetic,” https://ai.intel.com/accelerating-neural-networks-binary-arithmetic.

Inference of BNN on different platforms

• Utilization to accelerate the inference of AlexNet• batch size of 1:

× GPU: ~ 1%. CPU: ~ 6%.

• batch size of 10:

× GPU: ~ 7%. CPU: ~ 10%.

✓FPGA: >60%: Millions of one-bit ALUs on a single device.

Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating binarized neural networks: comparison of

FPGA, CPU, GPU, and ASIC. In Field-Programmable Technology (FPT), 2016 International Conference on. IEEE, 77–84.

Challenges

• Challenges to make truly low-latency inference usingFPGA

1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).

2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.

3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.

Challenges

Intra-layer Fusion

• Addressing the 1st challenge: Intra-layer fusion• 3-to-1 fusion: ACT, NL and BL are fused to a Comparison layer.

• Simplified Computation: 2 Comparisons from ACT and BL and 5 floating-point operations from NL become an integer- comparison.

• Less Storage: 4 floating-point variables in NL become 1 integer.

Intra-layer Fusion for Networks with Shortcuts

• ResNet

BIN +POP3

Threshold Lookup(TL)

Simplify

(A) Original ResNet

(B) Optimized ResNet

CONV2 CONV3CONV1

Challenges

Inter-layer Fusion• Addressing the 2nd challenge: inter-layer fusion

• Fine-grained pipelining to fuse CONV & 1st FC layers.

• An image is processed based on the data dependency.

• Layers are processed in parallel ➔ overlapped latencies.

Challenges

Workload Balancing• Each Layer: NIC*NOC*K*K

• NIC=Num of Input Channel

• NOC=Num of Output Channel

• K=Kernel size

• Terms: PIC, POC, SIC, SOC• PIC=Parallelism of Input Channel

• POC=Parallelism of Output Channel

• SIC=Sequential of Input Channel

• SOC=Sequential of Output Channel

• Match Data Production and Consumption Rates of adjacent layers by adjusting PIC, POC, SIC and SOC

Sequential

Parallel

Parameterized Architecture

• Addressing the 3rd challenge: decent architecture design

• The architecture is flexible enough to support load balancing.

• All layers are fully configured on FPGA using model parallelism.

PE: Processing ElementMAC: Multiply-Accumulate UnitTB: Threshold BufferSWM: Shared Weight Memory

VPBHPB

PE #POC

BANK Set #1

BANK Set #2

BANK Set #K

123...

BANK #1

BANK #2

BANK #3

BANK #PIC/32 123...

Control Processor

XNOR #1

XNOR #3

XNOR #2

XNOR #PIC

123...

32 Channels among #PIC

...#SIC

els am

BankSet#1

BankSet#2

BankSet#K2

BankSet

#K2-1... B

......

#PIC/32#POC

POOLING Engine #1

POOLING Engine #2

POOLING Engine #(POC/32)

Abbreviation:VPB: Vertical Pooling BufferHPB: Horizontal Pooling BufferSIDSR: Shared Input Data Shift Registers

Parameters:K: Filter Kernel SizeW: Width of the input feature mapP: Pooling Kernel Size

303132

Evaluation: Latency Reduction from Intra-layer Fusion

COMPARE

BN & SUM-FP

POPCOUNT

Evaluation: Single Layer Latency VS Whole Latency

CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 FC1 Whole VGG

Latency(μs)

Evaluation: Hardware Resource Saving from Workload Balancing

Evaluation: BNN Inference Latency using LP-BNN

LP-BNN: Ultra-low-Latency BNN Inference with Layer …LP-BNN: Ultra-low-Latency BNN Inference with...

Documents

Final Paper BNN

Binarized ImageNet Inference in 29 s...BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018) Neurocomputing 275 (2018) Acknowledgement: This research was funded by the Deep

GRNN: Low-Latency and Scalable RNN Inference on GPUsweb.eecs.umich.edu/~mosharaf/Readings/GRNN.pdf · 2019-12-09 · Keywords recurrent neural networks, GPUs, deep learning inference

CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425 · 2019-09-17 · (img/sec) Latency (ms) Throughput (img/sec) Latency (ms) Throughput (img/sec) Latency (ms) 1 315

Renstra BNN 2015-2019

Low-Latency Event-Based Visual Odometry · ‣DVS: Interesting new sensor for agile robots! ‣ Goal: low-latency control architectures! ‣ So far: worked on inference problems!-blinking

biroren.bnn.go.idbiroren.bnn.go.id/uploads/download/PK_PUSAT_2017.pdf · Nilai I-KIP Nilai Indeks Reformasi Birokrasi BNN Opini publik terhadap BNN . BNN moo* ... peraturan disiplin

Struktur bnn

The Path Inference Filter: Model-Based Low-Latency Map Matching of Probe Vehicle Data

BNN 2014.pdf

Z10035 BNN AMIGO O1 Lime Z10035v25prokcssmedia.blob.core.windows.net/sys-master... · BNN RECADO O2 BNN FILIPO O2 Z20162 BNN SALVADOR High Z90601 Z90701 Z20263 Low comfortable shoes

BNN Magazine 2

BNN Magazine - December 2008

PERANAN BADAN NARKOTIKA NASIONAL (BNN ...digilib.unila.ac.id/56018/3/SKRIPSI TANPA BAB PEMBAHASAN.pdfProvinsi, termasuk BNN Provinsi Lampung. BNN Provinsi Lampung mempunyai tugas,

Uncertainty in Bayesian Neural Nets · BNN •Variational Inference •Maximize lower bound on the marginal log-likelihood log’12 ≥!"$[log’12,,+log’,−log6,] Prior Posterior

komunitas BNN

RILaaS: Robot Inference and Learning as a Serviceajaytanwani.com/docs/Tanwani_RILaaS_RAL_CR_2020.pdf · 3)We optimize the round-trip latency times for scalable inference serving by

Ppp Proposal Bnn

BNN jaarverslag

Bnn aisyah