Architectures ARAPrototyper: Enabling Rapid Prototyping and...

ARAPrototyper: Enabling Rapid Prototyping and Center for Future

Architectures ARAPrototyper: Enabling Rapid Prototyping and Evaluation for the Accelerator-Rich Architecture

Architectures Research Evaluation for the Accelerator-Rich Architecture

Yu-Ting Chen, Jason Cong, Zhenman Fang, and Peipei Zhou

Motivation, Contributions, and Prototype Baseline Accelerator-Rich ARA Memory System Template MotivationMotivation

AcceleratorAccelerator--rich architectures (ARAs) can provide 10rich architectures (ARAs) can provide 10--100X energy 100X energy

Motivation, Contributions, and Prototype Baseline Accelerator-Rich Architecture (ARA) Prototype

ARA Memory System Template

AcceleratorAccelerator--rich architectures (ARAs) can provide 10rich architectures (ARAs) can provide 10--100X energy 100X energy

efficiency over current chip multiefficiency over current chip multi--processorsprocessors

Need methodology to efficiently and accurately evaluate an ARANeed methodology to efficiently and accurately evaluate an ARA

Limitations of fullLimitations of full--system simulation:system simulation: Limitations of fullLimitations of full--system simulation:system simulation:

•• Modeling: difficult to model the customized interconnects of an ARAModeling: difficult to model the customized interconnects of an ARA

•• Speed: 300KIPS ~ 3MIPS; > 1000x slower than native executionSpeed: 300KIPS ~ 3MIPS; > 1000x slower than native execution

ContributionsContributions

Show Show 4000 to 10,000 evaluation time saving 4000 to 10,000 evaluation time saving over fullover full--system system

simulation.simulation.simulation.simulation.

Provide a highlyProvide a highly--automated prototyping flowautomated prototyping flow

Provide an efficient evaluation framework and APIs for usersProvide an efficient evaluation framework and APIs for users Prototype platform: Prototype platform: Prototype platform: Prototype platform:

Xilinx Xilinx ZynqZynq ZC706 ZC706

•• SoCSoC

•• DualDual--core core CortexCortex--A9 A9 ARMARM Shared homogeneous buffersShared homogeneous buffers•• DualDual--core core CortexCortex--A9 A9 ARMARM

•• OnOn--board BRAMs board BRAMs

•• FPGA (accelerators andFPGA (accelerators and

interconnectsinterconnects Accelerator planeAccelerator plane

Shared homogeneous buffersShared homogeneous buffers

Partial crossbar: accelerators Partial crossbar: accelerators shared buffersshared buffers

Provide enough connectivity and guaranteed fixed latencyProvide enough connectivity and guaranteed fixed latencyinterconnectsinterconnects

1GB on1GB on--board DRAMboard DRAM

Linux Linux can be portedcan be ported

Heterogeneous accelerators + ARA memory systemHeterogeneous accelerators + ARA memory system

Processor planeProcessor plane

Cores + a shared lastCores + a shared last--level cachelevel cache

Interleaved network: shared buffers Interleaved network: shared buffers DMACsDMACs

Improve offImprove off--chip prefetching efficiencychip prefetching efficiency

IOMMU + IOTLB => improve page translation efficiencyIOMMU + IOTLB => improve page translation efficiency

Hardware Design Automation System Software Stack Program the Accelerators

Cores + a shared lastCores + a shared last--level cachelevel cache IOMMU + IOTLB => improve page translation efficiencyIOMMU + IOTLB => improve page translation efficiency

Hardware Design Automation System Software Stack Major ComponentsMajor Components

GAM GAM –– global accelerator managerglobal accelerator manager

Program the Accelerators API supports for utilizing accelerators in an ARAAPI supports for utilizing accelerators in an ARA

C/C++ based APIs for manipulating acceleratorsC/C++ based APIs for manipulating accelerators

DBA DBA –– dynamic buffer dynamic buffer allocatorallocator

TLB miss handlerTLB miss handler

Coherence Coherence ManagerManager

C/C++ based APIs for manipulating acceleratorsC/C++ based APIs for manipulating accelerators

•• reservereserve(), (), check_reservecheck_reserve() (make reservations)() (make reservations)

•• send_paramsend_param() (send parameters and start the Acc.)() (send parameters and start the Acc.)

•• ccheck_doneheck_done() (poll the done signal)() (poll the done signal)

•• Coherence Coherence ManagerManager

•• Accelerator can directly fetch data from offAccelerator can directly fetch data from off--chip DRAM with the help of chip DRAM with the help of Coherence ManagerCoherence Manager

•• ffreeree() (release the accelerator)() (release the accelerator)

Easy compilation Easy compilation –– only only gccgcc is requiredis required

Executable can be run on board with Linux directly!Executable can be run on board with Linux directly!

ARA specification file

Executable can be run on board with Linux directly!Executable can be run on board with Linux directly!

ARAPrototyper design automation flow

Accelerator kernels

ARA specification file

Accelerator kernels

Shared buffer banks # and sizesShared buffer banks # and sizes

Interconnects:(1) Partial crossbar configuration(2) Interleaved network

TLB(2) Interleaved network

Coherent at memory or L2 cache Frequency

Results & PrototypeCode Size, Evaluation Time (Our Flow

vs. Full-System Simulations)

Case Study: Medical Imaging & MachSuite Results & Prototype

Accelerator microarchitectureAccelerator microarchitecture Interconnect optimizationInterconnect optimizationvs. Full-System Simulations)

Show Show 2.9x to 42.6x 2.9x to 42.6x evaluation time saving over the fullevaluation time saving over the full--system system simulationssimulations

More than 97% of flow time: C/C++ More than 97% of flow time: C/C++ --> RTL > RTL --> FPGA > FPGA bitstreambitstream

MachSuiteMedical Imaging Applications

More than 97% of flow time: C/C++ More than 97% of flow time: C/C++ --> RTL > RTL --> FPGA > FPGA bitstreambitstream

Native execution on the prototype: Native execution on the prototype: 101033 ~ ~ 101044 faster than the fullfaster than the full--sys simulationssys simulations0

Grad Gauss Seg Rician Grad + Rician

Grad + Gauss

+ Rician

Coherence choicesCoherence choices

Inter-Acc Intra-Acc

Grad Gauss Seg Rician

Coherent at L2 cache in ARM Coherent at DRAM

ARA FPGA prototype (100MHz) vs. stateARA FPGA prototype (100MHz) vs. state--ofof--thethe--art processorsart processors Code size Code size reductionreduction

ARAPrototyperARAPrototyper and HLS toolsand HLS tools

reduce the design effortsreduce the design efforts

Energy-Efficiency Summary:Prototype: 7.44x over Intel Xeon

Acc0: Gradient

Acc : Gaussian

Acc2: Segmentation

Acc : RicianPrototype: 7.44x over Intel XeonASIC projection: 84x over Intel Xeon (Haswell @1.9GHz), according to Kuon, TCAD’07

Acc1: Gaussian Acc3: Rician

Heterogeneous AcceleratorsHeterogeneous Accelerators

Architectures ARAPrototyper: Enabling Rapid Prototyping and...

Documents

Going Deeper with Embedded FPGA Platform for Convolutional ...cadlab.cs.ucla.edu/~jaywang/papers/fpga16-cnn.pdf · In recent years, Convolutional Neural Network (CNN) based methods

Center for DomainCenter for Domain--Specific Computing ...cadlab.cs.ucla.edu/.../PROFIT_Agenda_Sep_2010/Jason_Cong_2010.… · cong@cs.ucla.edu Participating Universities: UCLA (lead),

Synthesis Challenges for Next- Generation High-Performance ...cadlab.cs.ucla.edu/~cong/slides/aspdac2k.pdf · Generation High-Performance and High-Density PLDs Synthesis Challenges

Caffeine: Towards Uniformed Representation and ...vast.cs.ucla.edu/~peipei/slides/ICCAD_HALO_Caffeine_poster_2.pdf · Caffeine: Towards Uniformed Representation and Acceleration for

Multilevel global placement with congestion control ...cadlab.cs.ucla.edu/~cong/papers/tcad03-mpg.pdfMultilevel Global Placement With Congestion Control Chin-Chih Chang, Jason Cong,

GLOBAL FABLESS/FOUNDRY COLLABORATION AND INNOVATION …cadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug_2007... · GLOBAL FABLESS/FOUNDRY COLLABORATION AND INNOVATION Dr

Meet the Coordination Centers · 2015-12-09 · Oxford University bioCADDIE CEDAR Peipei Ping UCLA Heart BD2K Eric Deutsch Institute for Systems Biology HeartBD2K ... • Introduction

Implementation & Evaluation of the Delta- Sigma Modulation ...cadlab.cs.ucla.edu/Website/protected-dir/june_pdf_presentations/... · Multi-tone test (MTT) ... Single-tone test (STT)

Nanotechnology Opportunities from a System Application …cadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_Agenda_Aug... · 2006-09-07 · Nanotechnology Opportunities from a System

Caffeine: Towards Uniformed Representation and Acceleration for …vast.cs.ucla.edu/~peipei/papers/iccad2016caffeine_camera... · 2016-07-25 · In this paper we design and implement

Computational Modeling and Analysis for Complex Systems ...cadlab.cs.ucla.edu/expeditions_pi_meeting/slides/Edmund_Clarke... · –First automated formal analysis Delta-Reachability:

Optimizing FPGA-based Accelerator Design for …cadlab.cs.ucla.edu/~cong/slides/fpga2015_chen.pdfOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Chen

Enabling Trade: Enabling Trade in the Paciﬁc Alliancereports.weforum.org/enabling-trade-from-valuation... · 12/3/2013 · Enabling Trade Enabling Trade in the Pacific Alliance

Energy-Efﬁcient Scheduling on Heterogeneous Multi …cadlab.cs.ucla.edu/~boyuan/papers/islped149-yuan.pdf · Energy-Efﬁcient Scheduling on Heterogeneous Multi-Core Architectures

Fine Grain 3D Integration for Microarchitecture Design ...cadlab.cs.ucla.edu/~cong/papers/2007_ICCD_Eren.pdfFine Grain 3D Integration for Microarchitecture Design Through Cube Packing

Latte: Locality Aware Transformation for High-Level Synthesisvast.cs.ucla.edu/~peipei/papers/fccm18-latte_authorCopy.pdf · Latte microarchitecture featuring customizable pipelined

A case study of Amazon’s fulfillment system - · PDF fileA case study of Amazon’s fulfillment system Claude Mortel (RA6967207) PeiPei----Ling WuLing Wu (R56964079) ChengCheng-

Heterogeneous Datacenters: Options and …cadlab.cs.ucla.edu/~cong/slides/dac16_hetero_dc.pdfHeterogeneous Datacenters: Options and Opportunities Jason Cong1, Muhuan Huang1,2, Di Wu1,2,

Enabling healthcare consumerism - McKinsey on Healthcarehealthcare.mckinsey.com/.../files/Enabling-Healthcare-Consumerism_R6B.pdf · Enabling healthcare consumerism Companies that

Automatic Memory Partitioning and Scheduling for ...cadlab.cs.ucla.edu/~cong/papers/memPar.pdfAutomatic Memory Partitioning and Scheduling for Throughput and Power Optimization 15:3