LP-BNN: Ultra-low-Latency BNN Inference with Layer …LP-BNN: Ultra-low-Latency BNN Inference with...

Preview:

Citation preview

Boston University Slideshow Title Goes Here

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Tong Geng1,2, Tianqi Wang1, Chunshu Wu1, Chen Yang1, Shuaiwen Leon Song2

Ang Li2, Martin Herbordt1

1Boston University2Pacific Northwest National Laboratory

Boston University Slideshow Title Goes Here

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

CNN: most popular algorithm in ML

▪ Widely used in computer vision such as object classification, detection and recognition

Boston University Slideshow Title Goes Here

Birth of BNN

▪ Limitations of CNN

▪ High computation + memory intensity

▪ Long Latency

▪ 32-bit floating-point -> 8-bit fixed-point -> binary

▪ Computation and Memory access are not intensive anymore

▪ Potentially, much Lower Latency

▪ Relatively lower accuracy than CNN, however, becoming better

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

DNN: BNN vs CNN✓ Easier Computation: floating-point-multiply-add (FMA) operations ➔ single-bit XNOR/POPCOUNT.

✓ Less Storage: floating-point parameters and activations ➔ single-bit

✓ Energy Efficient: ideal for edge device

“Accelerating Neural Networks with Binary Arithmetic,” https://ai.intel.com/accelerating-neural-networks-binary-arithmetic.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Inference of BNN on different platforms

• Utilization to accelerate the inference of AlexNet• batch size of 1:

× GPU: ~ 1%. CPU: ~ 6%.

• batch size of 10:

× GPU: ~ 7%. CPU: ~ 10%.

✓FPGA: >60%: Millions of one-bit ALUs on a single device.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating binarized neural networks: comparison of

FPGA, CPU, GPU, and ASIC. In Field-Programmable Technology (FPT), 2016 International Conference on. IEEE, 77–84.

Boston University Slideshow Title Goes Here

Challenges

• Challenges to make truly low-latency inference usingFPGA

1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).

2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.

3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Challenges

• Challenges to make truly low-latency inference usingFPGA

1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).

2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.

3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Intra-layer Fusion

• Addressing the 1st challenge: Intra-layer fusion• 3-to-1 fusion: ACT, NL and BL are fused to a Comparison layer.

• Simplified Computation: 2 Comparisons from ACT and BL and 5 floating-point operations from NL become an integer- comparison.

• Less Storage: 4 floating-point variables in NL become 1 integer.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Intra-layer Fusion for Networks with Shortcuts

• ResNet

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

POP1

POP1

NO

RM

BIN +POP3

NORM1

POP3

Threshold Lookup(TL)

> >>

Simplify

(A) Original ResNet

(B) Optimized ResNet

CONV2 CONV3CONV1

NORM3

TL

Boston University Slideshow Title Goes Here

Challenges

• Challenges to make truly low-latency inference usingFPGA

1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).

2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.

3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Inter-layer Fusion• Addressing the 2nd challenge: inter-layer fusion

• Fine-grained pipelining to fuse CONV & 1st FC layers.

• An image is processed based on the data dependency.

• Layers are processed in parallel ➔ overlapped latencies.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Challenges

• Challenges to make truly low-latency inference usingFPGA

1.The critical Normalization Layer (NL) uses full-precisionfloating point (i.e., 2 FP MUL/DIV + 3 FP ADD/SUB).

2.Existing works process layers sequentially. Hence, theirlatencies are accumulated with no overlapping.

3.Optimal designs for all layers need to be simultaneouslyconfigured on FPGA with no reconfiguration.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Workload Balancing• Each Layer: NIC*NOC*K*K

• NIC=Num of Input Channel

• NOC=Num of Output Channel

• K=Kernel size

• Terms: PIC, POC, SIC, SOC• PIC=Parallelism of Input Channel

• POC=Parallelism of Output Channel

• SIC=Sequential of Input Channel

• SOC=Sequential of Output Channel

• Match Data Production and Consumption Rates of adjacent layers by adjusting PIC, POC, SIC and SOC

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Sequential

Parallel

Boston University Slideshow Title Goes Here

Parameterized Architecture

• Addressing the 3rd challenge: decent architecture design

• The architecture is flexible enough to support load balancing.

• All layers are fully configured on FPGA using model parallelism.

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

PE: Processing ElementMAC: Multiply-Accumulate UnitTB: Threshold BufferSWM: Shared Weight Memory

VPBHPB

PE #1

PE #2

PE #3

PE #POC

SIDSR

BANK Set #1

BANK Set #2

BANK Set #K

#PIC

*K

123...

32

Ch

ann

els

amo

ng #P

IC

#SIC

...

#W

BANK #1

BANK #2

BANK #3

BANK #PIC/32 123...

32

Ch

ann

els

amo

ng #P

OC

#SOC

...

#W/P

Control Processor

(CP)

XNOR #1

XNOR #3

XNOR #2

PO

PC

OU

NT

XNOR #PIC

TB

>ACC

#SOC

123...

32 Channels among #PIC

#SOC

...#SIC

123

...

32

Ch

ann

els am

on

g #PO

C

#SOC

SWM

BankSet#1

BankSet#2

BankSet#K2

BankSet

#K2-1... B

an

k#

1

Ba

nk

#2

Ba

nk

#P

IC/3

2...

Ba

nk

#1

Ba

nk

#2

......

#PIC/32#POC

Ba

nk

#P

IC/3

2

...

POOLING Engine #1

POOLING Engine #2

POOLING Engine #(POC/32)

#PIC

Abbreviation:VPB: Vertical Pooling BufferHPB: Horizontal Pooling BufferSIDSR: Shared Input Data Shift Registers

Parameters:K: Filter Kernel SizeW: Width of the input feature mapP: Pooling Kernel Size

303132

303132

303132

30

31

32

Boston University Slideshow Title Goes Here

Evaluation: Latency Reduction from Intra-layer Fusion

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

0%

20%

40%

60%

80%

100%

OTHER

COMPARE

BN & SUM-FP

XNOR

POPCOUNT

Boston University Slideshow Title Goes Here

Evaluation: Single Layer Latency VS Whole Latency

0

100

200

300

400

CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 FC1 Whole VGG

Latency(μs)

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Evaluation: Hardware Resource Saving from Workload Balancing

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

Evaluation: BNN Inference Latency using LP-BNN

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Boston University Slideshow Title Goes Here

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism

Recommended