Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of...

Preview:

Citation preview

ImprovingthePerformanceofOpenCL-basedFPGAAcceleratorforConvolutionalNeuralNetwork

JialiangZhangandJingLi

Outline

• Background• Motivation• Balance analysis model• Proposed design• Performance• Conclusion

2

ConvolutionalNeuralNetwork

3

Marco Architecture of VGG

Imagefrom:https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/l

OpenCL Framework

Application Source CodeOpenCL Kernel

OpenCL FrameworkOpenCL CHost API

FPGA

InfrastructureKernel

FPGARuntimeFPGA Driver

Offline FPGA OpenCL Compiler

CLANG Frontend

IR Analysis &Optimization

VerilogBackend

BasicBlockLibrary

(i)

(iii)

(ii)

• OpenCLprovidesagoodFPGAabstration

• UnlikeOpenCLonGPUorCPU,OpenCLFPGAdescribesbothhardwareandsoftware

• Use #prgma to guide hardwaregeneration:– loopunrolling; SIMDfactor; compute

unitreplication

4

OpenCL FPGA Framework

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)

16x Onchip Memory Replication ( Compiler Inferred )

PCIe Controller

SOC BUS

Dispather

Infrastructure Region

HOST

DDR Controller

GlobalMemory

ComputationSubsystem

Local Memory

Subsystem

Control Flow Data Flow

DDR Controller

… GlobalMemory

KernelRegion

(a) Top Level (b) Compute Unit Level (c) Processing Element Level

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

5

Related works• GenericCNNaccelerators: [1]-[3]• Balancecomputation and external memoryaccess: [4] -[7]• Hardware Abstraction: [7][8]• Contribution:– IdentifytheperformancebottleneckinlargescaleFPGACNNacceleratorison-chipmemorybandwidth

– A CNN accelerator achieved optimized balance among computation,on-chipmemory and external memory access

6

[1]Farabet,etal.Hardwareacceleratedconvolutional neuralnetworksforsyntheticvision systems.ISCAS2010[2]M.Peemen,etal.Memory-centricacceleratordesignforconvolutional neuralnetworks.ICCD2013[3]V.Gokhale, etal.A240G-ops/smobilecoprocessor fordeepneuralnetworks.CVPRWorkshops, 2014.[4]C.Zhang,etal.OptimizingFPGA-basedacceleratordesign fordeepconvolutional neuralnetworks.ISFPGA2015[5]N.Suda, et al. Throughput-OptimizedOpenCL-based FPGAAcceleratorforLarge-ScaleConvolutional NeuralNetworks. ISFPGA 2016[6] J. Qiu, et al. GoingDeeperwithEmbedded FPGAPlatformforConvolutional NeuralNetwork. ISFPGA 2016[7]C. Zhang Caffeine:TowardsUniformedRepresentationandAccelerationforDeepConvolutional NeuralNetworks. ICCAD 2016[8] H. Sharma, et al. FromHigh-LevelDeepNeuralModels toFPGAs. MICRO2016

Motivation

• New FPGA devices havemoreand faster DSP resources

• We should useallDSPresourceseverycycletoobtaintheGFLOPSnumberindatasheet

• The existing design cannot utilizethe DSP resources well

7

BalanceAnalysis Model• Balanceistherelationshipbetweenstorage and computation resources• Assumption:• Onlyonebottleneckinthesystem• Bandwidthbounded• Computationandmemoryaccessoverlapperfectly.

8

MachineBalance

CodeBalance

On-chip 𝑩𝒎𝑶𝒏 𝑩𝒎𝑶𝒏

Off-chip 𝑩𝒎𝑶𝒇𝒇 𝑩𝑪

𝑶𝒇𝒇

Hardwaredetermined

Algorithmdetermined

Refertoon-chipMemBWRefertooff-chipMemBW

MachineBalance

9

PE PE PE

Buffer Buffer

Computational Resources

On-chip Memory

… …

NBRAM *WIDTHBRAM *fBRAM

PE PE PEComputational Resources

… …

NDDR*𝐵𝑊))*

DDR DDROff-chip Memory

NDSP*OPS/DSP*fDSP NDSP*OPS/DSP*fDSP

CodeBalance

10

• On-chipCodeBalanceisdeterminedbyinputandoutputdatasizeofinstructions

• Forexample,MAC operation hasa

1

• Off-chipCodeBalanceisdeterminedbytheinputandoutputdatasizeofthewholeprogram

• Assumeunlimited on-chipmemorysize

BreakthememoryWall

11

Toachieve𝑃,-. ,weneedtosatisfy𝐵,122 >𝐵3

122 and 𝐵,14 >𝐵3122

Off-chipbalance

Layer 𝑩𝑪𝑶𝒇𝒇

CONV1 0.0021CONV2 0.0010CONV3 0.00062CONV4 0.00087CONV5 0.00277FC 0.5

12

TechnologyNode

𝑩𝒎𝑶𝒇𝒇

28nm 0.16820nm 0.217

14/16nm 0.17714/16nmw/HBM

0.177

• 𝑩𝑪𝑶𝒇𝒇 <𝑩𝒎

𝑶𝒇𝒇 forconvolutionlayers

• 𝑩𝑪𝑶𝒇𝒇 >𝑩𝒎

𝑶𝒇𝒇 forFC layer,butonlycontributeasmallportionofcomputation

On-chipbalance

13

TechnologyNode

𝑩𝒎𝑶𝒏

28nm 0.0033

20nm 0.0026

14/16nm 0.0010

Layer 𝑩𝑪𝑶𝒏

CONV1 1CONV2 1CONV3 1CONV4 1CONV5 1FC 1

• 𝑩𝑪𝑶𝒏 >𝑩𝒎𝑶𝒏 foralllayer

• On-chipmemorybandwidthbecomesthebottleneck

• Weneedtoincrease𝑩𝒎𝑶𝒏

DataReuseRequirement

303384

72

1000

0

200

400

600

800

1000

1200

28nm 20nm 14nmw/HMC

14nmw/DDR

• Thebalanceanalysis assumesunlimitedon-chipmemorysize(loaddataonce)

• Underlimitedon-chipmemorycapacity,weneedtoloaddatamorethanonce.

• Tosatisfiedexternalmemorybandwidthrequirement,weneedtoreusedataatleast.

14

OptimizationDirection

• Increase to utilize all DSP resources

• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.

15

OptimizationDirection

• Increase to utilize all DSP resources

• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.

16

MatrixMultiplicationKernel

BRAM

PE

BRAM

PE

BRAM

BRAM

PE PE

17

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)

16x Onchip Memory Replication ( Compiler Inferred )

PCIe Controller

SOC BUS

Dispather

Infrastructure Region

HOST

DDR Controller

GlobalMemory

ComputationSubsystem

Local Memory

Subsystem

Control Flow Data Flow

DDR Controller

… GlobalMemory

KernelRegion

(a) Top Level (b) Compute Unit Level (c) Processing Element Level

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

EnableMulticasting

BRAMusagereduction

18

Bymulticasting,wecanincreaseon-chipmachinebalanceby 𝑁)67

ExperimentonArria10GX1150FPGA(w/ 1518 DSP and 2713 M20kMemory):

OptimizationDirection

• Increase to utilize all DSP resources

• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.

19

Kernel Design

2D DISPATHER

Compute Units

Compute Units

Compute Units

Compute Units

Buffer

PointerParameters

ShiftRegsiter

Crossbar

ExternalMemory

ExternalMemory

20

CU Design

21

BRAM32bit x512

MAC

BRAM32bit x512

MAC

BRAM32bit x512

MAC

BRAM32bit x512

BRAM32bit x512

MAC

MAC

MAC

BRAM32bit x512

MAC

MAC

MAC

INPUT A512x

32x32x

32x

32x

INPUT B512x

32x

32x

32x 32x 32x

32x32x32x

32x 32x 32x

FF

FF

FF

MUX

OutputAddress

Output512x

512x

512x

512x

8x Duplication8x

Dup

licat

ion

(a)

Work-item Scheduling

22

2D Scheduling

0,0 0,1 0,2 0,n…

1,0 1,1 1,2 1,n…

2,0 2,1 2,2 2,n…

m,0 m,1 m,2 m,n…

… … … …

0,0 0,1 0,2 0,n…

1,0 1,1 1,2 1,n…

2,0 2,1 2,2 2,n…

m,0 m,1 m,2 m,n…… … … …

m

n(a) One Dimensional Dispatching (b) Two Dimensional Dispatching

23

𝑥9

𝑥:

Minimizingexternalmemorybandwidth

• Objective:– Findtheoptimized schedulingpolicy

– To minimizeexternalmemorybandwidth

• Constraint:– On-chipmemorysize

• Basedon:– CNNlayerparameters

24

Optimizationresult

25

Theoptimizationcaneffectivelyreducetherequirementofoff-chipmemorybandwidth

Experiment Setup

• FPGA: Arria10GX1150• 1518 DSPs• 2713 M20k memory• 1GBDDR4 with17GB/sBW

• Kernel implemented as anOpenCLIPlibrarypackage usingVerilog

26

Arria 10 GX FPGA Development Kit

Implementation Result

27

Total Ours Percentage

DSP 1518 1320 86

BRAM 2713 1250 46

Logic 1506k 437k 43

Frequency 370 MHz

VGG Performance

Layers Number of Ops Duration (ms) PerformanceGOP/s

Conv 30.69G 11.9 2568

FC 0.073G 5.5 13

Total 30.76G 17.4 1790

28

Overall VGG Classification Throughput: 57 image/s

Performance ComparisonSuda et. al [1] Qiu et. al [2] Ours

Platform Stratix V GSD8 Zynq XC7Z045 Arria 10 GX1150Performance(GOP/s) 117.8 136.9 1790

PerformanceDensity(OPs/DSP/Cycle)

0.36 1.17 3.06

Power efficiency(GOP/J)

1.84 14.22 47.88

29

[1]N.Suda, et al. Throughput-OptimizedOpenCL-basedFPGAAcceleratorforLarge-ScaleConvolutionalNeuralNetworks. ISFPGA 2016[2] J. Qiu, et al. GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork. ISFPGA 2016

Summary

• Motivation• Balance analysis model• Our Work:– Multicasting among PEs– 2D Scheduling

• Performance comparison

30

Thanks!

Q&A

31

Recommended