Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Cornell University

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/

Execute Decoupling

Tao Chen and G. Edward SuhComputer Systems Laboratory

Cornell University

Tao ChenCornell University 2

Accelerator-Rich Computing Systems

• Computing systems are becoming accelerator-rich • General-purpose cores + a large number of accelerators

• Challenge: Design and verification complexity • Non-recurring engineering (NRE) cost per accelerator

• Manual efforts are a major source of cost • Create computation pipelines • Manage data supply from memory

High-Level Synthesis (HLS)

This work: An automated framework for generating accelerators with efficient data supply


Inefficiencies in Accelerator Data Supply

Scratchpad-based accelerators• On-chip scratchpad memory (SPM)• Manually designed logic to move

data between SPM and main memory

• Pros: Good performance• Cons: High design effort,

accelerator-specific, not reusable

Accelerator

ComputeLogic

Memory Bus

SPM PreloadLogic


Inefficiencies in Accelerator Data Supply

Scratchpad-based accelerators• On-chip scratchpad memory (SPM)• Manually designed logic to move

data between SPM and main memory

• Pros: Good performance• Cons: High design effort,

accelerator-specific, not reusable

Cache-based accelerators• Pros: Low design effort, cache can be reused• Cons: Uncertain memory latency impacts performance

Accelerator

ComputeLogic

Memory Bus

Cache


Optimize Data Supply for Cache-Based Accelerators

Approach: automated framework for generating accelerators with efficient data supply

Accelerator Source

Automated Framework

Accelerator w/ Efficient Data Supply




Accelerator Source

Automated Framework


Accelerator

Cache

Memory Bus

ComputeLogic

Techniques• Prefetching• Tagging memory accesses

HWPrefetcher




Accelerator Source

Automated Framework


Accelerator

Cache

Memory Bus

Techniques• Prefetching• Tagging memory accesses

• Access/Execute Decoupling• Program slicing + architecture

templateHW

Prefetcher

AccessLogic

ExecuteLogic


Impact of Uncertain Memory Latency

• Example: Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS)

// inner loop of sparse matrix// vector multiplication

for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }

LDLDLD

LDLD LD

MUL

ADDMUL

LD

ADDMUL

LDLD

ADD

LD

MUL

LDLD

ADD

Time

HLS


Impact of Uncertain Memory Latency

• Example: Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS)

// inner loop of sparse matrix// vector multiplication

for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }

LDLDLD

LDLD LD

MUL

ADDMUL

LD

ADDMUL

LDLD

ADD

LD

MUL

LDLD

ADD

miss

Time

A cache miss stalls the entire accelerator pipeline

Regular stride

Regular strideIrregular

• Reduce cache misses for regular accesses• Prefetch data into the cache

• Tolerate cache misses for irregular accesses• Access/Execute Decoupling

HLS


Hardware Prefetching

• Predict future memory accesses

• PC is often used as a hint• Stream localization• Spatial correlation predictionfor (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

GlobalAddr

Stream

●●●




• PC is often used as a hint• Stream localization• Spatial correlation prediction

• Problem: accelerators lack a PC

• Solution: generate PC-like tags for accelerator memory accesses

for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

GlobalAddr

Stream

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

Local Addr Streams

●●●

regular strides

irregular no pred

238023842388238C

PC x541C542054245428

PC y80108328845481B8

PC z




• PC is often used as a hint• Stream localization• Spatial correlation prediction

• Problem: accelerators lack a PC

• Solution: generate PC-like tags for accelerator memory accesses

for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

GlobalAddr

Stream

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

Local Addr Streams

●●●

regular strides

irregular no pred

238023842388238C

PC x541C542054245428

PC y80108328845481B8

PC z

BB1

BB3

LD

LD

LD

×

+

BB2

x

yz

CDFG


Decoupled Access/Execute (DAE)

• Limitations of Hardware Prefetching• Not accurate for complex patterns / Needs warm-up time• Fundamental reason: lack of semantic information

• Decoupled Access/Execute• Allow memory accesses to run ahead to preload data

Time

MemoryAccess

ValueComp

memory latency

Accelerator

Cache

Memory Bus

HWPrefetche

r

AccessLogic

ExecuteLogic

Cache w/ DAE

ComputeManual Preload

memory latency

SPM w/ Manual Preload


Traditional DAE is not Effective for Accelerators

• Traditional DAE: access part forwards data to execute part • Problem: access pipeline stalls on misses • Throughput is limited by access pipeline

• Goal: allow access pipeline to continue to flow under misses

MUL

ADDMUL

ADDMUL

ADDMUL

ADD

LDLDLD

LDLD LD

LD LDLD

LDLDLD

LDLDLD

LDLD LD

MUL

ADDMUL

LD

ADDMUL

LDLD

ADD

LD

MUL

LDLD

ADD

missOriginal

DecoupledAccess Execute


DAE Accelerator with Decoupled Loads

• Anatomy of a load

• Solution: Delegate request/response handling

LD AGenReq

Resp

LDAGenReq

Resp

AGen

ReqResp


Memory Unit

• Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding

Load Queueto ExeU

MemUnit

DepCheck

memreq memresp

Store Addr

Store Datafrom ExeU

Load Addr

StoreAddr

Queue

StoreData

Queue

FwdData

Queue

Load Data to AccU

FwdData

LD


Memory Unit

• Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding

Load Queueto ExeU

MemUnit

DepCheck

memreq memresp

Store Addr

Store Datafrom ExeU

Load Addr

StoreAddr

Queue

StoreData

Queue

FwdData

Queue

Load Data to AccU

FwdData

LD

ST


Automated DAE Accelerator Generation

• Program slicing for generating access/execute slices • Architectural template with configurable parameters

accel.c

Architectural Template

access.c execute.cslicing slicing

access.v execute.v

HLS HLS

Access/Execute Decoupled Accel

HW GenerationParameters

• Queue sizes• Port width• MemUnit config• etc

Written in PyMTL


Evaluation Methodology

• Vertically integrated modeling methodology • System components: cycle-level (gem5) • Accelerators: register-transfer-level (Vivado HLS, PyMTL) • Area, power and energy: gate-level (commercial ASIC flow)

• Benchmark accelerators from MachSuiteName Description

bbgemm Blocked matrix multiplicationbfsbulk Breadth-First Searchgemm Dense matrix multiplicationmdknn Molecular dynamics (K-Nearest Neighbor)

nw Needleman-Wunsch algorithmspmvcrs Sparse matrix vector multiplicationstencil2d 2D stencil computation

viterbi Viterbi algorithm


Performance Comparison

• 2.28x speedup on average• Prefetching and DAE work in synergy


Energy Comparison

• 15% energy reduction on average because of reduced stalls• MemUnits/queues only consume a small amount of energy


More Details in the Paper

• Deadlock Avoidance • Customization of Memory Units • Baseline Validation • Power and Area Comparison • Energy, Power and Area Breakdown • Sensitivity Study on Varying Queue Sizes • Design Space Exploration: Queue Size Customization


Summary

Cache-based accelerators• Avoid the high design cost of manual data movement logic• Problem: Inefficient in handling uncertain memory latency

Approach: Automated program analysis and architectural template to generate accelerators with efficient data supply • Tagging memory requests to enable prefetching • Decoupling to enable memory accesses to run ahead

Results: High-performance cache-based accelerators with minimal manual efforts

Documents

Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388