24
1 TPUTCACHE: HIGH- FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

  • Upload
    may

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS. Aaron Severance University of British Columbia Advised by Guy Lemieux. Our Problem. We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors Memory: - PowerPoint PPT Presentation

Citation preview

Page 1: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

1

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY

CACHE FOR HIGH-THROUGHPUT

FPGA APPLICATIONS

Aaron SeveranceUniversity of British Columbia

Advised by Guy Lemieux

Page 2: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

2

Our Problem

We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors

Memory: Large register files/scratchpad in overlay

Low latency, local data Trivial (large DMA): burst to/from DDR Non-trivial?

Page 3: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

Scatter/Gather

Data dependent store/load vscatter adr_ptr, idx_vect,

data_vect for i in 1..N

adr_ptr[idx_vect[i]] <= data_vect[i]

Random narrow (32-bit) accesses Waste bandwidth on DDR interfaces

3

Page 4: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

4

If Data Fits on the FPGA… BRAMs with interconnect network

General network… Not customized per application Shared: all masters <-> all slaves

Memory mapped BRAM Double-pump (2x clk) if possible

Banking/LVT/etc. for further ports

Page 5: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

5

Example BRAM system

Page 6: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

6

But if data doesn’t fit… (oversimplified)

Page 7: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

7

So Let’s Use a Cache

But a throughput focused cache Low latency data held in local

memories Amortize latency over multiple

accesses Focus on bandwidth

Page 8: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

Replace on-chip memory or augment memory controller?

Data fits on-chip Want BRAM like speed, bandwidth Low overhead compared to shared

BRAM

Data doesn’t fit on-chip Use ‘leftover’ BRAMs for performance

8

Page 9: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

9

TputCache Design Goals

Fmax near BRAM Fmax Fully pipelined Support multiple outstanding

misses Write coalescing Associativity

Page 10: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

10

TputCache Architecture Replay based architecture

Reinsert misses back into pipeline Separate line fill/evict logic in background Token FIFO for completing requests in order

No MSHRs for tracking misses Fewer muxes (only single replay request mux) 6 stage pipeline -> 6 outstanding misses

Good performance with high hit rate Common case fast

Page 11: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

11

TputCache Architecture

Page 12: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

12

Cache Hit

Page 13: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

13

Cache Miss

Page 14: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

14

Evict/Fill Logic

Page 15: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

15

Area & Fmax Results

•Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV

•423MHz compared to 490MHz BRAM fmax on Stratix IV

•Minor degredation with increasing size, associativity

•13% to 35% extra BRAM usage for tags, queues

Page 16: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

16

Benchmark Setup TputCache

128kB, 4-way, 32-byte lines MXP soft vector processor

16 lanes, 128kB scratchpad memory Scatter/Gather memory unit

Indexed loads/stores per lane Doublepumping port adapters

TputCache runs at 2x frequency of MXP

Page 17: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

MXP Soft Vector Processor

17

Page 18: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

18

Histogram•Instantiate a number of Virtual Processors (VPs) mapped across lanes•Each VP histograms part of the image•Final pass to sum VP partial histograms

Page 19: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

19

Hough Transform•Convert an image to 2D Hough Space (angle, radius)•Each vector element calculates the radius for a given angle

•Adds pixel value to counter

Page 20: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

20

Motion Compensation•Load block from reference image, interpolate

•Offset by small amount from location in current image

Page 21: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

21

Future Work More ports needed for scalability

Share evict/fill BRAM port with 2nd request Banking (sharing same evict/fill logic) Multiported BRAM designs

Write cache Allocate on write currently Track dirty state of bytes in BRAMs 9th bit

Non-blocking behavior Multiple token FIFOs (one per requestor)?

Page 22: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

22

FAQ

Coherency Envisioned as only/LLC Future work

Replay loops/problems Random replacement + associativity Power expected to be not great…

Page 23: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

23

Conclusions TputCache: alternative to shared

BRAM Low overhead (13%-35% extra BRAM) Nearly as high fmax (253MHz vs

270MHz)

More flexible than shared BRAM Performance degrades gradually Cache behavior instead of manual filling

Page 24: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

24

Questions?

Thank you