Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of...

ImprovingthePerformanceofOpenCL-basedFPGAAcceleratorforConvolutionalNeuralNetwork

JialiangZhangandJingLi

Outline

• Background• Motivation• Balance analysis model• Proposed design• Performance• Conclusion

ConvolutionalNeuralNetwork

Marco Architecture of VGG

Imagefrom:https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/l

OpenCL Framework

Application Source CodeOpenCL Kernel

OpenCL FrameworkOpenCL CHost API

InfrastructureKernel

FPGARuntimeFPGA Driver

Offline FPGA OpenCL Compiler

CLANG Frontend

IR Analysis &Optimization

VerilogBackend

BasicBlockLibrary

• OpenCLprovidesagoodFPGAabstration

• UnlikeOpenCLonGPUorCPU,OpenCLFPGAdescribesbothhardwareandsoftware

• Use #prgma to guide hardwaregeneration:– loopunrolling; SIMDfactor; compute

unitreplication

OpenCL FPGA Framework

LocalMemory

2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)

16x Onchip Memory Replication ( Compiler Inferred )

PCIe Controller

SOC BUS

Dispather

Infrastructure Region

DDR Controller

GlobalMemory

ComputationSubsystem

Local Memory

Subsystem

Control Flow Data Flow

DDR Controller

… GlobalMemory

KernelRegion

(a) Top Level (b) Compute Unit Level (c) Processing Element Level

BRAMBRAM

Related works• GenericCNNaccelerators: [1]-[3]• Balancecomputation and external memoryaccess: [4] -[7]• Hardware Abstraction: [7][8]• Contribution:– IdentifytheperformancebottleneckinlargescaleFPGACNNacceleratorison-chipmemorybandwidth

– A CNN accelerator achieved optimized balance among computation,on-chipmemory and external memory access

[1]Farabet,etal.Hardwareacceleratedconvolutional neuralnetworksforsyntheticvision systems.ISCAS2010[2]M.Peemen,etal.Memory-centricacceleratordesignforconvolutional neuralnetworks.ICCD2013[3]V.Gokhale, etal.A240G-ops/smobilecoprocessor fordeepneuralnetworks.CVPRWorkshops, 2014.[4]C.Zhang,etal.OptimizingFPGA-basedacceleratordesign fordeepconvolutional neuralnetworks.ISFPGA2015[5]N.Suda, et al. Throughput-OptimizedOpenCL-based FPGAAcceleratorforLarge-ScaleConvolutional NeuralNetworks. ISFPGA 2016[6] J. Qiu, et al. GoingDeeperwithEmbedded FPGAPlatformforConvolutional NeuralNetwork. ISFPGA 2016[7]C. Zhang Caffeine:TowardsUniformedRepresentationandAccelerationforDeepConvolutional NeuralNetworks. ICCAD 2016[8] H. Sharma, et al. FromHigh-LevelDeepNeuralModels toFPGAs. MICRO2016

Motivation

• New FPGA devices havemoreand faster DSP resources

• We should useallDSPresourceseverycycletoobtaintheGFLOPSnumberindatasheet

• The existing design cannot utilizethe DSP resources well

BalanceAnalysis Model• Balanceistherelationshipbetweenstorage and computation resources• Assumption:• Onlyonebottleneckinthesystem• Bandwidthbounded• Computationandmemoryaccessoverlapperfectly.

MachineBalance

CodeBalance

On-chip 𝑩𝒎𝑶𝒏 𝑩𝒎𝑶𝒏

Off-chip 𝑩𝒎𝑶𝒇𝒇 𝑩𝑪

𝑶𝒇𝒇

Hardwaredetermined

Algorithmdetermined

Refertoon-chipMemBWRefertooff-chipMemBW

MachineBalance

PE PE PE

Buffer Buffer

Computational Resources

On-chip Memory

… …

NBRAM *WIDTHBRAM *fBRAM

PE PE PEComputational Resources

… …

NDDR*𝐵𝑊))*

DDR DDROff-chip Memory

NDSP*OPS/DSP*fDSP NDSP*OPS/DSP*fDSP

CodeBalance

• On-chipCodeBalanceisdeterminedbyinputandoutputdatasizeofinstructions

• Forexample,MAC operation hasa

• Off-chipCodeBalanceisdeterminedbytheinputandoutputdatasizeofthewholeprogram

• Assumeunlimited on-chipmemorysize

BreakthememoryWall

Toachieve𝑃,-. ,weneedtosatisfy𝐵,122 >𝐵3

122 and 𝐵,14 >𝐵3122

Off-chipbalance

Layer 𝑩𝑪𝑶𝒇𝒇

CONV1 0.0021CONV2 0.0010CONV3 0.00062CONV4 0.00087CONV5 0.00277FC 0.5

TechnologyNode

𝑩𝒎𝑶𝒇𝒇

28nm 0.16820nm 0.217

14/16nm 0.17714/16nmw/HBM

• 𝑩𝑪𝑶𝒇𝒇 <𝑩𝒎

𝑶𝒇𝒇 forconvolutionlayers

• 𝑩𝑪𝑶𝒇𝒇 >𝑩𝒎

𝑶𝒇𝒇 forFC layer,butonlycontributeasmallportionofcomputation

On-chipbalance

TechnologyNode

𝑩𝒎𝑶𝒏

28nm 0.0033

20nm 0.0026

14/16nm 0.0010

Layer 𝑩𝑪𝑶𝒏

CONV1 1CONV2 1CONV3 1CONV4 1CONV5 1FC 1

• 𝑩𝑪𝑶𝒏 >𝑩𝒎𝑶𝒏 foralllayer

• On-chipmemorybandwidthbecomesthebottleneck

• Weneedtoincrease𝑩𝒎𝑶𝒏

DataReuseRequirement

303384

28nm 20nm 14nmw/HMC

14nmw/DDR

• Thebalanceanalysis assumesunlimitedon-chipmemorysize(loaddataonce)

• Underlimitedon-chipmemorycapacity,weneedtoloaddatamorethanonce.

• Tosatisfiedexternalmemorybandwidthrequirement,weneedtoreusedataatleast.

OptimizationDirection

• Increase to utilize all DSP resources

• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.

MatrixMultiplicationKernel

LocalMemory

2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)

16x Onchip Memory Replication ( Compiler Inferred )

PCIe Controller

SOC BUS

Dispather

Infrastructure Region

DDR Controller

GlobalMemory

ComputationSubsystem

Local Memory

Subsystem

Control Flow Data Flow

DDR Controller

… GlobalMemory

KernelRegion

(a) Top Level (b) Compute Unit Level (c) Processing Element Level

BRAMBRAM

EnableMulticasting

BRAMusagereduction

Bymulticasting,wecanincreaseon-chipmachinebalanceby 𝑁)67

ExperimentonArria10GX1150FPGA(w/ 1518 DSP and 2713 M20kMemory):

Kernel Design

2D DISPATHER

Compute Units

Buffer

PointerParameters

ShiftRegsiter

Crossbar

ExternalMemory

CU Design

BRAM32bit x512

INPUT A512x

32x32x

INPUT B512x

32x 32x 32x

32x32x32x

32x 32x 32x

OutputAddress

Output512x

8x Duplication8x

Work-item Scheduling

2D Scheduling

0,0 0,1 0,2 0,n…

1,0 1,1 1,2 1,n…

2,0 2,1 2,2 2,n…

m,0 m,1 m,2 m,n…

… … … …

0,0 0,1 0,2 0,n…

1,0 1,1 1,2 1,n…

2,0 2,1 2,2 2,n…

m,0 m,1 m,2 m,n…… … … …

n(a) One Dimensional Dispatching (b) Two Dimensional Dispatching

Minimizingexternalmemorybandwidth

• Objective:– Findtheoptimized schedulingpolicy

– To minimizeexternalmemorybandwidth

• Constraint:– On-chipmemorysize

• Basedon:– CNNlayerparameters

Optimizationresult

Theoptimizationcaneffectivelyreducetherequirementofoff-chipmemorybandwidth

Experiment Setup

• FPGA: Arria10GX1150• 1518 DSPs• 2713 M20k memory• 1GBDDR4 with17GB/sBW

• Kernel implemented as anOpenCLIPlibrarypackage usingVerilog

Arria 10 GX FPGA Development Kit

Implementation Result

Total Ours Percentage

DSP 1518 1320 86

BRAM 2713 1250 46

Logic 1506k 437k 43

Frequency 370 MHz

VGG Performance

Layers Number of Ops Duration (ms) PerformanceGOP/s

Conv 30.69G 11.9 2568

FC 0.073G 5.5 13

Total 30.76G 17.4 1790

Overall VGG Classification Throughput: 57 image/s

Performance ComparisonSuda et. al [1] Qiu et. al [2] Ours

Platform Stratix V GSD8 Zynq XC7Z045 Arria 10 GX1150Performance(GOP/s) 117.8 136.9 1790

PerformanceDensity(OPs/DSP/Cycle)

0.36 1.17 3.06

Power efficiency(GOP/J)

1.84 14.22 47.88

[1]N.Suda, et al. Throughput-OptimizedOpenCL-basedFPGAAcceleratorforLarge-ScaleConvolutionalNeuralNetworks. ISFPGA 2016[2] J. Qiu, et al. GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork. ISFPGA 2016

Summary

• Motivation• Balance analysis model• Our Work:– Multicasting among PEs– 2D Scheduling

• Performance comparison

Thanks!

Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of...

Documents

Optimizing FPGA-based Accelerator Design for …cadlab.cs.ucla.edu/~cong/slides/fpga2015_chen.pdfOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Chen

Intel FPGA SDK for OpenCL...Intel® FPGA SDK for OpenCL ベスト・プラクティス・ガイドインテル ® Quartus Prime開発デザインスイートの更新情報: 17.1

Convolutional Neural Network FPGA-accelerator on Intel

Intel FPGA SDK for OpenCL Programming Guide - Altera · PDF file1 Intel® FPGA SDK for OpenCL™ Overview The Intel® FPGA SDK for OpenCL™ Programming Guide provides descriptions,

Intel FPGA SDK for OpenCL Pro Edition: Programming Guide · Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide ... Channel Data Behavior.....40 5.4.3. Multiple Work-Item

Intel FPGA SDK for OpenCL - Altera · 1 Intel® FPGA SDK for OpenCL™ Custom Platform Toolkit User Guide The Intel® FPGA SDK for OpenCL™ Custom Platform Toolkit User Guide outlines

Spector: An OpenCL FPGA Benchmark Suite - UC San …kastner.ucsd.edu/wp-content/uploads/2013/08/admin/fpt16-spector.pdf · Spector: An OpenCL FPGA Benchmark Suite Quentin Gautier,

Intel FPGA SDK for OpenCL€¦ · Intel® FPGA SDK for OpenCL ™ Intel® Arria® 10 GX FPGA Development Kit Reference Platform Porting Guide Updated for Intel ® Quartus Prime Design

Intel FPGA SDK for OpenCL Pro Edition · 1. Intel® FPGA SDK for OpenCL™ Pro Edition Custom Platform Toolkit User Guide The Intel® FPGA SDK for OpenCL ™ Pro Edition Custom Platform

Gzip Compression Using Altera OpenCL · Altera’s OpenCL Compiler 3 2 1 OpenCL Single-threaded Code . FPGAs can be VERY Custom Host CPU FPGA Accelerator PCIe Load x Load y Store

Intel FPGA SDK for OpenCL Standard Edition · Send Feedback Intel FPGA SDK for OpenCL Standard Edition: Best Practices Guide 3. 9. Strategies for Optimizing FPGA Area Usage

OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

FPGA Accelerator for Floating-Point Matrix Multiplicationtesla.rcub.bg.ac.rs/.../FPGA_matrix_multiplication.pdf · 2012-10-02 · FPGA Accelerator for Floating-Point Matrix Multiplication

Intel FPGA SDK for OpenCL Standard Edition...Intel FPGA SDK for OpenCL Standard Edition: Best Practices Guide Send Feedback 6 A processor implements an instruction set that limits

Reconfigurable Architectures FPGA as an accelerator

Throughput-Optimized OpenCL-based FPGA …Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks ... et al. Optimizing FPGA-based accelerator

Intel FPGA SDK for OpenCL Programming Guide · 1.7.2 Allocating OpenCL Buffer for Manual Partitioning of Global Memory ... The Intel FPGA SDK for OpenCL Programming Guide assumes

PipeCNN: An OpenCL-Based FPGA Accelerator for Convolution ... › portal › assets › pdf › PR022.pdf · PipeCNN is an OpenCL-based FPGA Accelerator for Large-Scale Convolutional

Intel FPGA SDK for OpenCL Standard Edition · 1.2.1. Directories and Files in an Intel FPGA SDK for OpenCL Standard Edition ... Stratix V Network Reference Platform Porting Guide

FPGA-BASED ACCELERATOR FOR THE GENERATION OF …eprints.usm.my/40702/1/Fpga-Based_Accelerator_for_The_Generation_of...FPGA-BASED ACCELERATOR FOR THE GENERATION OF PSEUDO-AMINO ACID