Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Cornell University
Efficient Data Supply for Hardware Accelerators with Prefetching and Access/
Execute Decoupling
Tao Chen and G. Edward SuhComputer Systems Laboratory
Cornell University
Tao ChenCornell University 2
Accelerator-Rich Computing Systems
• Computing systems are becoming accelerator-rich • General-purpose cores + a large number of accelerators
• Challenge: Design and verification complexity • Non-recurring engineering (NRE) cost per accelerator
• Manual efforts are a major source of cost • Create computation pipelines • Manage data supply from memory
High-Level Synthesis (HLS)
This work: An automated framework for generating accelerators with efficient data supply
Tao ChenCornell University 3
Inefficiencies in Accelerator Data Supply
Scratchpad-based accelerators• On-chip scratchpad memory (SPM)• Manually designed logic to move
data between SPM and main memory
• Pros: Good performance• Cons: High design effort,
accelerator-specific, not reusable
Accelerator
ComputeLogic
Memory Bus
SPM PreloadLogic
Tao ChenCornell University 3
Inefficiencies in Accelerator Data Supply
Scratchpad-based accelerators• On-chip scratchpad memory (SPM)• Manually designed logic to move
data between SPM and main memory
• Pros: Good performance• Cons: High design effort,
accelerator-specific, not reusable
Cache-based accelerators• Pros: Low design effort, cache can be reused• Cons: Uncertain memory latency impacts performance
Accelerator
ComputeLogic
Memory Bus
Cache
Tao ChenCornell University 4
Optimize Data Supply for Cache-Based Accelerators
Approach: automated framework for generating accelerators with efficient data supply
Accelerator Source
Automated Framework
Accelerator w/ Efficient Data Supply
Tao ChenCornell University 4
Optimize Data Supply for Cache-Based Accelerators
Approach: automated framework for generating accelerators with efficient data supply
Accelerator Source
Automated Framework
Accelerator w/ Efficient Data Supply
Accelerator
Cache
Memory Bus
ComputeLogic
Techniques• Prefetching• Tagging memory accesses
HWPrefetcher
Tao ChenCornell University 4
Optimize Data Supply for Cache-Based Accelerators
Approach: automated framework for generating accelerators with efficient data supply
Accelerator Source
Automated Framework
Accelerator w/ Efficient Data Supply
Accelerator
Cache
Memory Bus
Techniques• Prefetching• Tagging memory accesses
• Access/Execute Decoupling• Program slicing + architecture
templateHW
Prefetcher
AccessLogic
ExecuteLogic
Tao ChenCornell University 5
Impact of Uncertain Memory Latency
• Example: Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS)
// inner loop of sparse matrix// vector multiplication
for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }
LDLDLD
LDLD LD
MUL
ADDMUL
LD
ADDMUL
LDLD
ADD
LD
MUL
LDLD
ADD
Time
HLS
Tao ChenCornell University 5
Impact of Uncertain Memory Latency
• Example: Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS)
// inner loop of sparse matrix// vector multiplication
for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }
LDLDLD
LDLD LD
MUL
ADDMUL
LD
ADDMUL
LDLD
ADD
LD
MUL
LDLD
ADD
miss
Time
A cache miss stalls the entire accelerator pipeline
Regular stride
Regular strideIrregular
• Reduce cache misses for regular accesses• Prefetch data into the cache
• Tolerate cache misses for irregular accesses• Access/Execute Decoupling
HLS
Tao ChenCornell University 6
Hardware Prefetching
• Predict future memory accesses
• PC is often used as a hint• Stream localization• Spatial correlation predictionfor (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }
2380
8010541C
2384
83285420
2388
84545424
238C
81B85428
GlobalAddr
Stream
●●●
Tao ChenCornell University 6
Hardware Prefetching
• Predict future memory accesses
• PC is often used as a hint• Stream localization• Spatial correlation prediction
• Problem: accelerators lack a PC
• Solution: generate PC-like tags for accelerator memory accesses
for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }
2380
8010541C
2384
83285420
2388
84545424
238C
81B85428
GlobalAddr
Stream
2380
8010541C
2384
83285420
2388
84545424
238C
81B85428
Local Addr Streams
●●●
regular strides
irregular no pred
238023842388238C
PC x541C542054245428
PC y80108328845481B8
PC z
Tao ChenCornell University 6
Hardware Prefetching
• Predict future memory accesses
• PC is often used as a hint• Stream localization• Spatial correlation prediction
• Problem: accelerators lack a PC
• Solution: generate PC-like tags for accelerator memory accesses
for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }
2380
8010541C
2384
83285420
2388
84545424
238C
81B85428
GlobalAddr
Stream
2380
8010541C
2384
83285420
2388
84545424
238C
81B85428
Local Addr Streams
●●●
regular strides
irregular no pred
238023842388238C
PC x541C542054245428
PC y80108328845481B8
PC z
BB1
BB3
LD
LD
LD
×
+
BB2
x
yz
CDFG
Tao ChenCornell University 7
Decoupled Access/Execute (DAE)
• Limitations of Hardware Prefetching• Not accurate for complex patterns / Needs warm-up time• Fundamental reason: lack of semantic information
• Decoupled Access/Execute• Allow memory accesses to run ahead to preload data
Time
MemoryAccess
ValueComp
memory latency
Accelerator
Cache
Memory Bus
HWPrefetche
r
AccessLogic
ExecuteLogic
Cache w/ DAE
ComputeManual Preload
memory latency
SPM w/ Manual Preload
Tao ChenCornell University 8
Traditional DAE is not Effective for Accelerators
• Traditional DAE: access part forwards data to execute part • Problem: access pipeline stalls on misses • Throughput is limited by access pipeline
• Goal: allow access pipeline to continue to flow under misses
MUL
ADDMUL
ADDMUL
ADDMUL
ADD
LDLDLD
LDLD LD
LD LDLD
LDLDLD
LDLDLD
LDLD LD
MUL
ADDMUL
LD
ADDMUL
LDLD
ADD
LD
MUL
LDLD
ADD
missOriginal
DecoupledAccess Execute
Tao ChenCornell University 9
DAE Accelerator with Decoupled Loads
• Anatomy of a load
• Solution: Delegate request/response handling
LD AGenReq
Resp
LDAGenReq
Resp
AGen
ReqResp
Tao ChenCornell University 10
Memory Unit
• Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding
Load Queueto ExeU
MemUnit
DepCheck
memreq memresp
Store Addr
Store Datafrom ExeU
Load Addr
StoreAddr
Queue
StoreData
Queue
FwdData
Queue
Load Data to AccU
FwdData
LD
Tao ChenCornell University 10
Memory Unit
• Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding
Load Queueto ExeU
MemUnit
DepCheck
memreq memresp
Store Addr
Store Datafrom ExeU
Load Addr
StoreAddr
Queue
StoreData
Queue
FwdData
Queue
Load Data to AccU
FwdData
LD
ST
Tao ChenCornell University 11
Automated DAE Accelerator Generation
• Program slicing for generating access/execute slices • Architectural template with configurable parameters
accel.c
Architectural Template
access.c execute.cslicing slicing
access.v execute.v
HLS HLS
Access/Execute Decoupled Accel
HW GenerationParameters
• Queue sizes• Port width• MemUnit config• etc
Written in PyMTL
Tao ChenCornell University 12
Evaluation Methodology
• Vertically integrated modeling methodology • System components: cycle-level (gem5) • Accelerators: register-transfer-level (Vivado HLS, PyMTL) • Area, power and energy: gate-level (commercial ASIC flow)
• Benchmark accelerators from MachSuiteName Description
bbgemm Blocked matrix multiplicationbfsbulk Breadth-First Searchgemm Dense matrix multiplicationmdknn Molecular dynamics (K-Nearest Neighbor)
nw Needleman-Wunsch algorithmspmvcrs Sparse matrix vector multiplicationstencil2d 2D stencil computation
viterbi Viterbi algorithm
Tao ChenCornell University 13
Performance Comparison
• 2.28x speedup on average• Prefetching and DAE work in synergy
Tao ChenCornell University 14
Energy Comparison
• 15% energy reduction on average because of reduced stalls• MemUnits/queues only consume a small amount of energy
Tao ChenCornell University 15
More Details in the Paper
• Deadlock Avoidance • Customization of Memory Units • Baseline Validation • Power and Area Comparison • Energy, Power and Area Breakdown • Sensitivity Study on Varying Queue Sizes • Design Space Exploration: Queue Size Customization
Tao ChenCornell University 16
Summary
Cache-based accelerators• Avoid the high design cost of manual data movement logic• Problem: Inefficient in handling uncertain memory latency
Approach: Automated program analysis and architectural template to generate accelerators with efficient data supply • Tagging memory requests to enable prefetching • Decoupling to enable memory accesses to run ahead
Results: High-performance cache-based accelerators with minimal manual efforts