23
Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University

Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Cornell University

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/

Execute Decoupling

Tao Chen and G. Edward SuhComputer Systems Laboratory

Cornell University

Page 2: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 2

Accelerator-Rich Computing Systems

• Computing systems are becoming accelerator-rich • General-purpose cores + a large number of accelerators

• Challenge: Design and verification complexity • Non-recurring engineering (NRE) cost per accelerator

• Manual efforts are a major source of cost • Create computation pipelines • Manage data supply from memory

High-Level Synthesis (HLS)

This work: An automated framework for generating accelerators with efficient data supply

Page 3: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 3

Inefficiencies in Accelerator Data Supply

Scratchpad-based accelerators• On-chip scratchpad memory (SPM)• Manually designed logic to move

data between SPM and main memory

• Pros: Good performance• Cons: High design effort,

accelerator-specific, not reusable

Accelerator

ComputeLogic

Memory Bus

SPM PreloadLogic

Page 4: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 3

Inefficiencies in Accelerator Data Supply

Scratchpad-based accelerators• On-chip scratchpad memory (SPM)• Manually designed logic to move

data between SPM and main memory

• Pros: Good performance• Cons: High design effort,

accelerator-specific, not reusable

Cache-based accelerators• Pros: Low design effort, cache can be reused• Cons: Uncertain memory latency impacts performance

Accelerator

ComputeLogic

Memory Bus

Cache

Page 5: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 4

Optimize Data Supply for Cache-Based Accelerators

Approach: automated framework for generating accelerators with efficient data supply

Accelerator Source

Automated Framework

Accelerator w/ Efficient Data Supply

Page 6: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 4

Optimize Data Supply for Cache-Based Accelerators

Approach: automated framework for generating accelerators with efficient data supply

Accelerator Source

Automated Framework

Accelerator w/ Efficient Data Supply

Accelerator

Cache

Memory Bus

ComputeLogic

Techniques• Prefetching• Tagging memory accesses

HWPrefetcher

Page 7: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 4

Optimize Data Supply for Cache-Based Accelerators

Approach: automated framework for generating accelerators with efficient data supply

Accelerator Source

Automated Framework

Accelerator w/ Efficient Data Supply

Accelerator

Cache

Memory Bus

Techniques• Prefetching• Tagging memory accesses

• Access/Execute Decoupling• Program slicing + architecture

templateHW

Prefetcher

AccessLogic

ExecuteLogic

Page 8: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 5

Impact of Uncertain Memory Latency

• Example: Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS)

// inner loop of sparse matrix// vector multiplication

for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }

LDLDLD

LDLD LD

MUL

ADDMUL

LD

ADDMUL

LDLD

ADD

LD

MUL

LDLD

ADD

Time

HLS

Page 9: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 5

Impact of Uncertain Memory Latency

• Example: Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS)

// inner loop of sparse matrix// vector multiplication

for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }

LDLDLD

LDLD LD

MUL

ADDMUL

LD

ADDMUL

LDLD

ADD

LD

MUL

LDLD

ADD

miss

Time

A cache miss stalls the entire accelerator pipeline

Regular stride

Regular strideIrregular

• Reduce cache misses for regular accesses• Prefetch data into the cache

• Tolerate cache misses for irregular accesses• Access/Execute Decoupling

HLS

Page 10: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 6

Hardware Prefetching

• Predict future memory accesses

• PC is often used as a hint• Stream localization• Spatial correlation predictionfor (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

GlobalAddr

Stream

●●●

Page 11: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 6

Hardware Prefetching

• Predict future memory accesses

• PC is often used as a hint• Stream localization• Spatial correlation prediction

• Problem: accelerators lack a PC

• Solution: generate PC-like tags for accelerator memory accesses

for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

GlobalAddr

Stream

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

Local Addr Streams

●●●

regular strides

irregular no pred

238023842388238C

PC x541C542054245428

PC y80108328845481B8

PC z

Page 12: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 6

Hardware Prefetching

• Predict future memory accesses

• PC is often used as a hint• Stream localization• Spatial correlation prediction

• Problem: accelerators lack a PC

• Solution: generate PC-like tags for accelerator memory accesses

for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

GlobalAddr

Stream

2380

8010541C

2384

83285420

2388

84545424

238C

81B85428

Local Addr Streams

●●●

regular strides

irregular no pred

238023842388238C

PC x541C542054245428

PC y80108328845481B8

PC z

BB1

BB3

LD

LD

LD

×

+

BB2

x

yz

CDFG

Page 13: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 7

Decoupled Access/Execute (DAE)

• Limitations of Hardware Prefetching• Not accurate for complex patterns / Needs warm-up time• Fundamental reason: lack of semantic information

• Decoupled Access/Execute• Allow memory accesses to run ahead to preload data

Time

MemoryAccess

ValueComp

memory latency

Accelerator

Cache

Memory Bus

HWPrefetche

r

AccessLogic

ExecuteLogic

Cache w/ DAE

ComputeManual Preload

memory latency

SPM w/ Manual Preload

Page 14: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 8

Traditional DAE is not Effective for Accelerators

• Traditional DAE: access part forwards data to execute part • Problem: access pipeline stalls on misses • Throughput is limited by access pipeline

• Goal: allow access pipeline to continue to flow under misses

MUL

ADDMUL

ADDMUL

ADDMUL

ADD

LDLDLD

LDLD LD

LD LDLD

LDLDLD

LDLDLD

LDLD LD

MUL

ADDMUL

LD

ADDMUL

LDLD

ADD

LD

MUL

LDLD

ADD

missOriginal

DecoupledAccess Execute

Page 15: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 9

DAE Accelerator with Decoupled Loads

• Anatomy of a load

• Solution: Delegate request/response handling

LD AGenReq

Resp

LDAGenReq

Resp

AGen

ReqResp

Page 16: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 10

Memory Unit

• Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding

Load Queueto ExeU

MemUnit

DepCheck

memreq memresp

Store Addr

Store Datafrom ExeU

Load Addr

StoreAddr

Queue

StoreData

Queue

FwdData

Queue

Load Data to AccU

FwdData

LD

Page 17: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 10

Memory Unit

• Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding

Load Queueto ExeU

MemUnit

DepCheck

memreq memresp

Store Addr

Store Datafrom ExeU

Load Addr

StoreAddr

Queue

StoreData

Queue

FwdData

Queue

Load Data to AccU

FwdData

LD

ST

Page 18: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 11

Automated DAE Accelerator Generation

• Program slicing for generating access/execute slices • Architectural template with configurable parameters

accel.c

Architectural Template

access.c execute.cslicing slicing

access.v execute.v

HLS HLS

Access/Execute Decoupled Accel

HW GenerationParameters

• Queue sizes• Port width• MemUnit config• etc

Written in PyMTL

Page 19: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 12

Evaluation Methodology

• Vertically integrated modeling methodology • System components: cycle-level (gem5) • Accelerators: register-transfer-level (Vivado HLS, PyMTL) • Area, power and energy: gate-level (commercial ASIC flow)

• Benchmark accelerators from MachSuiteName Description

bbgemm Blocked matrix multiplicationbfsbulk Breadth-First Searchgemm Dense matrix multiplicationmdknn Molecular dynamics (K-Nearest Neighbor)

nw Needleman-Wunsch algorithmspmvcrs Sparse matrix vector multiplicationstencil2d 2D stencil computation

viterbi Viterbi algorithm

Page 20: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 13

Performance Comparison

• 2.28x speedup on average• Prefetching and DAE work in synergy

Page 21: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 14

Energy Comparison

• 15% energy reduction on average because of reduced stalls• MemUnits/queues only consume a small amount of energy

Page 22: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 15

More Details in the Paper

• Deadlock Avoidance • Customization of Memory Units • Baseline Validation • Power and Area Comparison • Energy, Power and Area Breakdown • Sensitivity Study on Varying Queue Sizes • Design Space Exploration: Queue Size Customization

Page 23: Efficient Data Supply for Hardware Accelerators with ...tchen/files/accelmem-micro16-slides.pdf · Si = val[j] * vec[cols[j]]; sum = sum + Si; } 2380 8010 541C 2384 8328 5420 2388

Tao ChenCornell University 16

Summary

Cache-based accelerators• Avoid the high design cost of manual data movement logic• Problem: Inefficient in handling uncertain memory latency

Approach: Automated program analysis and architectural template to generate accelerators with efficient data supply • Tagging memory requests to enable prefetching • Decoupling to enable memory accesses to run ahead

Results: High-performance cache-based accelerators with minimal manual efforts