TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY

CACHE FOR HIGH-THROUGHPUT

FPGA APPLICATIONS

Aaron SeveranceUniversity of British Columbia

Advised by Guy Lemieux

Our Problem

We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors

Memory: Large register files/scratchpad in overlay

Low latency, local data Trivial (large DMA): burst to/from DDR Non-trivial?

Scatter/Gather

Data dependent store/load vscatter adr_ptr, idx_vect,

data_vect for i in 1..N

adr_ptr[idx_vect[i]] <= data_vect[i]

Random narrow (32-bit) accesses Waste bandwidth on DDR interfaces

If Data Fits on the FPGA… BRAMs with interconnect network

General network… Not customized per application Shared: all masters <-> all slaves

Memory mapped BRAM Double-pump (2x clk) if possible

Banking/LVT/etc. for further ports

Example BRAM system

But if data doesn’t fit… (oversimplified)

So Let’s Use a Cache

But a throughput focused cache Low latency data held in local

memories Amortize latency over multiple

accesses Focus on bandwidth

Replace on-chip memory or augment memory controller?

Data fits on-chip Want BRAM like speed, bandwidth Low overhead compared to shared

Data doesn’t fit on-chip Use ‘leftover’ BRAMs for performance

TputCache Design Goals

Fmax near BRAM Fmax Fully pipelined Support multiple outstanding

misses Write coalescing Associativity

TputCache Architecture Replay based architecture

Reinsert misses back into pipeline Separate line fill/evict logic in background Token FIFO for completing requests in order

No MSHRs for tracking misses Fewer muxes (only single replay request mux) 6 stage pipeline -> 6 outstanding misses

Good performance with high hit rate Common case fast

TputCache Architecture

Cache Hit

Cache Miss

Evict/Fill Logic

Area & Fmax Results

•Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV

•423MHz compared to 490MHz BRAM fmax on Stratix IV

•Minor degredation with increasing size, associativity

•13% to 35% extra BRAM usage for tags, queues

Benchmark Setup TputCache

128kB, 4-way, 32-byte lines MXP soft vector processor

16 lanes, 128kB scratchpad memory Scatter/Gather memory unit

Indexed loads/stores per lane Doublepumping port adapters

TputCache runs at 2x frequency of MXP

MXP Soft Vector Processor

Histogram•Instantiate a number of Virtual Processors (VPs) mapped across lanes•Each VP histograms part of the image•Final pass to sum VP partial histograms

Hough Transform•Convert an image to 2D Hough Space (angle, radius)•Each vector element calculates the radius for a given angle

•Adds pixel value to counter

Motion Compensation•Load block from reference image, interpolate

•Offset by small amount from location in current image

Future Work More ports needed for scalability

Share evict/fill BRAM port with 2nd request Banking (sharing same evict/fill logic) Multiported BRAM designs

Write cache Allocate on write currently Track dirty state of bytes in BRAMs 9th bit

Non-blocking behavior Multiple token FIFOs (one per requestor)?

Coherency Envisioned as only/LLC Future work

Replay loops/problems Random replacement + associativity Power expected to be not great…

Conclusions TputCache: alternative to shared

BRAM Low overhead (13%-35% extra BRAM) Nearly as high fmax (253MHz vs

270MHz)

More flexible than shared BRAM Performance degrades gradually Cache behavior instead of manual filling

Questions?

Thank you

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

Documents

High Throughput Computing

High-throughput screening of single-cell contractile force ... · • Expedite acquisition using high-throughput, whole well imaging High-throughput screening of single-cell contractile

Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE

High-Temperature, High-Throughput Nanoindentation

HIGH-THROUGHPUT SATELLITE ROUTER UHP-100 · HIGH-THROUGHPUT SATELLITE ROUTER UHP-100 ... Table 2 Troubleshooting guide ... UHP-100 HIGH-THROUGHPUT SATELLITE ROUTER

High Throughput Sequencing

High-Throughput Analysis of Foodborne Bacterial … · High-Throughput Analysis of Foodborne Bacterial ... high -throughput procedures from DNA extraction to ... to-use Genomic DNA

High Throughput Urgent Computing

High Speed Cache

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

High Throughput Optical Materials

Priority-Based Cache Allocation in Throughput ProcessorsPriority-Based Cache Allocation in Throughput Processors Dong Lixy, Minsoo Rhuxyz, Daniel R. Johnsonz, Mike O’Connoryz, Mattan

Targeted high-throughput sequencing for genetic diagnostics ......prospectively evaluated a targeted high-throughput sequencing approach for HLH diagnostics. Methods: A high-throughput

TECHNOLOGY OPPORTUNITY High-Throughput Flow Cytometry on › wp-content › uploads › 2015 › ... · 2018-03-12 · High-Throughput Flow Cytometry on Adherent Cells High-throughput

High-Throughput Sequencing Technologies

HIGH-THROUGHPUT DATA ANALYSIS BY SYSTEMS BIOLOGYanaxomics.com/pdf/Anaxomics-High-throughputDataAnalysis.pdf · 2017-01-31 · HIGH-THROUGHPUT DATA ANALYSIS BY SYSTEMS BIOLOGY “High-throughput

High-throughput sequence analysis with R and Bioconductor · 2012-05-10 · 1.3 High-throughput sequence analysis Recent technological developments introduce high-throughput sequencing

High-Throughput Carrier Screening Using TaqMan …doryeshorim.org/wp...High-Throughput-Carrier-Screening-Using-TaqMan... · High-Throughput Carrier Screening Using TaqMan ... B, et

Chugai’s Strategy for Drug Discovery Researchweb-cache-sc.stream.ne.jp/ · 2019. 12. 9. · cell followed by automated high-throughput screening system Rabbit B-cellcloning DNA

High Throughput Computing Technologies