1
ARAPrototyper: Enabling Rapid Prototyping and Center for Future Architectures ARAPrototyper: Enabling Rapid Prototyping and Evaluation for the Accelerator-Rich Architecture Architectures Research Yu-Ting Chen, Jason Cong, Zhenman Fang, and Peipei Zhou Motivation, Contributions, and Prototype Baseline Accelerator-Rich ARA Memory System Template Motivation Motivation Accelerator Accelerator-rich architectures (ARAs) can provide 10 rich architectures (ARAs) can provide 10-100X energy 100X energy Motivation, Contributions, and Prototype Architecture (ARA) Prototype ARA Memory System Template Accelerator Accelerator-rich architectures (ARAs) can provide 10 rich architectures (ARAs) can provide 10-100X energy 100X energy efficiency over current chip multi efficiency over current chip multi-processors processors Need methodology to efficiently and accurately evaluate an ARA Need methodology to efficiently and accurately evaluate an ARA Limitations of full Limitations of full-system simulation: system simulation: Limitations of full Limitations of full-system simulation: system simulation: Modeling: difficult to model the customized interconnects of an ARA Modeling: difficult to model the customized interconnects of an ARA Speed: 300KIPS ~ 3MIPS; > 1000x slower than native execution Speed: 300KIPS ~ 3MIPS; > 1000x slower than native execution Contributions Contributions Show Show 4000 to 10,000 evaluation time saving 4000 to 10,000 evaluation time saving over full over full-system system simulation. simulation. Provide a highly Provide a highly-automated prototyping flow automated prototyping flow Provide an efficient evaluation framework and APIs for users Provide an efficient evaluation framework and APIs for users Prototype platform: Prototype platform: Prototype platform: Prototype platform: Xilinx Xilinx Zynq Zynq ZC706 ZC706 SoC SoC Dual Dual-core core Cortex Cortex-A9 A9 ARM ARM Shared homogeneous buffers Shared homogeneous buffers Dual Dual-core core Cortex Cortex-A9 A9 ARM ARM On On-board BRAMs board BRAMs FPGA (accelerators and FPGA (accelerators and interconnects interconnects Accelerator plane Accelerator plane Shared homogeneous buffers Shared homogeneous buffers Partial crossbar: accelerators Partial crossbar: accelerators shared buffers shared buffers Provide enough connectivity and guaranteed fixed latency Provide enough connectivity and guaranteed fixed latency interconnects interconnects 1GB on 1GB on-board DRAM board DRAM Linux Linux can be ported can be ported Heterogeneous accelerators + ARA memory system Heterogeneous accelerators + ARA memory system Processor plane Processor plane Cores + a shared last Cores + a shared last-level cache level cache Interleaved network: shared buffers Interleaved network: shared buffers DMACs DMACs Improve off Improve off-chip prefetching efficiency chip prefetching efficiency IOMMU + IOTLB => improve page translation efficiency IOMMU + IOTLB => improve page translation efficiency Hardware Design Automation System Software Stack Program the Accelerators Cores + a shared last Cores + a shared last-level cache level cache IOMMU + IOTLB => improve page translation efficiency IOMMU + IOTLB => improve page translation efficiency Hardware Design Automation System Software Stack Major Components Major Components GAM GAM – global accelerator manager global accelerator manager Program the Accelerators API supports for utilizing accelerators in an ARA API supports for utilizing accelerators in an ARA C/C++ based APIs for manipulating accelerators C/C++ based APIs for manipulating accelerators DBA DBA – dynamic buffer dynamic buffer allocator allocator TLB miss handler TLB miss handler Coherence Coherence Manager Manager reserve reserve(), (), check_reserve check_reserve() (make reservations) () (make reservations) send_param send_param() (send parameters and start the Acc.) () (send parameters and start the Acc.) check_done heck_done() (poll the done signal) () (poll the done signal) Coherence Coherence Manager Manager Accelerator can directly fetch data from off Accelerator can directly fetch data from off-chip DRAM with the help of chip DRAM with the help of Coherence Manager Coherence Manager free ree() (release the accelerator) () (release the accelerator) Easy compilation Easy compilation – only only gcc gcc is required is required Executable can be run on board with Linux directly! Executable can be run on board with Linux directly! ARA specification file ARAPrototyper design automation flow Accelerator kernels ARA specification file Accelerator kernels Shared buffer banks # and sizes Shared buffer banks # and sizes Interconnects: (1) Partial crossbar configuration (2) Interleaved network TLB (2) Interleaved network TLB Coherent at memory or L2 cache Frequency Results & Prototype Code Size, Evaluation Time (Our Flow vs. Full-System Simulations) Case Study: Medical Imaging & MachSuite Results & Prototype Accelerator microarchitecture Accelerator microarchitecture Interconnect optimization Interconnect optimization vs. Full-System Simulations) Show Show 2.9x to 42.6x 2.9x to 42.6x evaluation time saving over the full evaluation time saving over the full-system system simulations simulations More than 97% of flow time: C/C++ More than 97% of flow time: C/C++ -> RTL > RTL -> FPGA > FPGA bitstream bitstream MachSuite Medical Imaging Applications 1 1.5 2 ndwidth s) More than 97% of flow time: C/C++ More than 97% of flow time: C/C++ -> RTL > RTL -> FPGA > FPGA bitstream bitstream Native execution on the prototype: Native execution on the prototype: 10 10 3 ~ ~ 10 10 4 faster than the full faster than the full-sys simulations sys simulations 0 0.5 Grad Gauss Seg Rician Grad + Rician Grad + Gauss + Rician all Memory ban (GB/s Coherence choices Coherence choices 1 2 e gain Inter-Acc Intra-Acc 0 1 Grad Gauss Seg Rician Performance Coherent at L2 cache in ARM Coherent at DRAM ARA FPGA prototype (100MHz) vs. state ARA FPGA prototype (100MHz) vs. state-of of-the the-art processors art processors Code size Code size reduction reduction ARAPrototyper ARAPrototyper and HLS tools and HLS tools reduce the design efforts reduce the design efforts Energy-Efficiency Summary: Prototype: 7.44x over Intel Xeon Acc 0 : Gradient Acc : Gaussian Acc 2 : Segmentation Acc : Rician Prototype: 7.44x over Intel Xeon ASIC projection: 84x over Intel Xeon (Haswell @1.9GHz), according to Kuon, TCAD’07 Acc 1 : Gaussian Acc 3 : Rician Heterogeneous Accelerators

Architectures ARAPrototyper: Enabling Rapid Prototyping and …cadlab.cs.ucla.edu/~peipei/files/website/ARAPrototyper... · 2016. 2. 18. · Center for Future ARAPrototyper: Enabling

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architectures ARAPrototyper: Enabling Rapid Prototyping and …cadlab.cs.ucla.edu/~peipei/files/website/ARAPrototyper... · 2016. 2. 18. · Center for Future ARAPrototyper: Enabling

ARAPrototyper: Enabling Rapid Prototyping and Center for Future

Architectures ARAPrototyper: Enabling Rapid Prototyping and Evaluation for the Accelerator-Rich Architecture

Architectures Research Evaluation for the Accelerator-Rich Architecture

Yu-Ting Chen, Jason Cong, Zhenman Fang, and Peipei Zhou

Motivation, Contributions, and Prototype Baseline Accelerator-Rich ARA Memory System Template MotivationMotivation

AcceleratorAccelerator--rich architectures (ARAs) can provide 10rich architectures (ARAs) can provide 10--100X energy 100X energy

Motivation, Contributions, and Prototype Baseline Accelerator-Rich Architecture (ARA) Prototype

ARA Memory System Template

AcceleratorAccelerator--rich architectures (ARAs) can provide 10rich architectures (ARAs) can provide 10--100X energy 100X energy

efficiency over current chip multiefficiency over current chip multi--processorsprocessors

Need methodology to efficiently and accurately evaluate an ARANeed methodology to efficiently and accurately evaluate an ARA

Limitations of fullLimitations of full--system simulation:system simulation: Limitations of fullLimitations of full--system simulation:system simulation:

•• Modeling: difficult to model the customized interconnects of an ARAModeling: difficult to model the customized interconnects of an ARA

•• Speed: 300KIPS ~ 3MIPS; > 1000x slower than native executionSpeed: 300KIPS ~ 3MIPS; > 1000x slower than native execution

ContributionsContributions

Show Show 4000 to 10,000 evaluation time saving 4000 to 10,000 evaluation time saving over fullover full--system system

simulation.simulation.simulation.simulation.

Provide a highlyProvide a highly--automated prototyping flowautomated prototyping flow

Provide an efficient evaluation framework and APIs for usersProvide an efficient evaluation framework and APIs for users Prototype platform: Prototype platform: Prototype platform: Prototype platform:

Xilinx Xilinx ZynqZynq ZC706 ZC706

•• SoCSoC

•• DualDual--core core CortexCortex--A9 A9 ARMARM Shared homogeneous buffersShared homogeneous buffers•• DualDual--core core CortexCortex--A9 A9 ARMARM

•• OnOn--board BRAMs board BRAMs

•• FPGA (accelerators andFPGA (accelerators and

interconnectsinterconnects Accelerator planeAccelerator plane

Shared homogeneous buffersShared homogeneous buffers

Partial crossbar: accelerators Partial crossbar: accelerators shared buffersshared buffers

Provide enough connectivity and guaranteed fixed latencyProvide enough connectivity and guaranteed fixed latencyinterconnectsinterconnects

1GB on1GB on--board DRAMboard DRAM

Linux Linux can be portedcan be ported

Heterogeneous accelerators + ARA memory systemHeterogeneous accelerators + ARA memory system

Processor planeProcessor plane

Cores + a shared lastCores + a shared last--level cachelevel cache

Interleaved network: shared buffers Interleaved network: shared buffers DMACsDMACs

Improve offImprove off--chip prefetching efficiencychip prefetching efficiency

IOMMU + IOTLB => improve page translation efficiencyIOMMU + IOTLB => improve page translation efficiency

Hardware Design Automation System Software Stack Program the Accelerators

Cores + a shared lastCores + a shared last--level cachelevel cache IOMMU + IOTLB => improve page translation efficiencyIOMMU + IOTLB => improve page translation efficiency

Hardware Design Automation System Software Stack Major ComponentsMajor Components

GAM GAM –– global accelerator managerglobal accelerator manager

Program the Accelerators API supports for utilizing accelerators in an ARAAPI supports for utilizing accelerators in an ARA

C/C++ based APIs for manipulating acceleratorsC/C++ based APIs for manipulating accelerators

DBA DBA –– dynamic buffer dynamic buffer allocatorallocator

TLB miss handlerTLB miss handler

Coherence Coherence ManagerManager

C/C++ based APIs for manipulating acceleratorsC/C++ based APIs for manipulating accelerators

•• reservereserve(), (), check_reservecheck_reserve() (make reservations)() (make reservations)

•• send_paramsend_param() (send parameters and start the Acc.)() (send parameters and start the Acc.)

•• ccheck_doneheck_done() (poll the done signal)() (poll the done signal)

•• Coherence Coherence ManagerManager

•• Accelerator can directly fetch data from offAccelerator can directly fetch data from off--chip DRAM with the help of chip DRAM with the help of Coherence ManagerCoherence Manager

•• ffreeree() (release the accelerator)() (release the accelerator)

Easy compilation Easy compilation –– only only gccgcc is requiredis required

Executable can be run on board with Linux directly!Executable can be run on board with Linux directly!

ARA specification file

Executable can be run on board with Linux directly!Executable can be run on board with Linux directly!

ARAPrototyper design automation flow

Accelerator kernels

ARA specification file

Accelerator kernels

Shared buffer banks # and sizesShared buffer banks # and sizes

Interconnects:(1) Partial crossbar configuration(2) Interleaved network

TLB(2) Interleaved network

TLB

Coherent at memory or L2 cache Frequency

Results & PrototypeCode Size, Evaluation Time (Our Flow

vs. Full-System Simulations)

Case Study: Medical Imaging & MachSuite Results & Prototype

Accelerator microarchitectureAccelerator microarchitecture Interconnect optimizationInterconnect optimizationvs. Full-System Simulations)

Show Show 2.9x to 42.6x 2.9x to 42.6x evaluation time saving over the fullevaluation time saving over the full--system system simulationssimulations

More than 97% of flow time: C/C++ More than 97% of flow time: C/C++ --> RTL > RTL --> FPGA > FPGA bitstreambitstream

MachSuiteMedical Imaging Applications

0.51

1.52

Mem

ory

ban

dw

idth

(G

B/s

)

More than 97% of flow time: C/C++ More than 97% of flow time: C/C++ --> RTL > RTL --> FPGA > FPGA bitstreambitstream

Native execution on the prototype: Native execution on the prototype: 101033 ~ ~ 101044 faster than the fullfaster than the full--sys simulationssys simulations0

0.5

Grad Gauss Seg Rician Grad + Rician

Grad + Gauss

+ Rician

all

Mem

ory

ban

dw

idth

(G

B/s

)

Coherence choicesCoherence choices

1

2

Pe

rfo

rma

nce

ga

in

Inter-Acc Intra-Acc

0

1

Grad Gauss Seg Rician

Pe

rfo

rma

nce

ga

in

Coherent at L2 cache in ARM Coherent at DRAM

ARA FPGA prototype (100MHz) vs. stateARA FPGA prototype (100MHz) vs. state--ofof--thethe--art processorsart processors Code size Code size reductionreduction

ARAPrototyperARAPrototyper and HLS toolsand HLS tools

reduce the design effortsreduce the design efforts

Energy-Efficiency Summary:Prototype: 7.44x over Intel Xeon

Acc0: Gradient

Acc : Gaussian

Acc2: Segmentation

Acc : RicianPrototype: 7.44x over Intel XeonASIC projection: 84x over Intel Xeon (Haswell @1.9GHz), according to Kuon, TCAD’07

Acc1: Gaussian Acc3: Rician

Heterogeneous AcceleratorsHeterogeneous Accelerators