View
216
Download
2
Category
Preview:
Citation preview
A Performance-Correctness Explicitly-Decoupled Architecture
Alok Garg and Michael Huang
Department of Electrical & Computer Engineering
University of Rochester
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
2
Motivation
Performance optimization in a monolithic micro-architecture is difficult
Conservativeness in design reduces the common case efficacy
Want to explicitly decouple correctness & performance
Optimization 1 (e.g. branch prediction)
Optimization 2 (e.g. out-of-order execution)
IF MEM
EX
ID WB
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
3
PerformanceDomain
Explicitly-Decoupled Architecture (EDA)
Design separated into performance and correctness domains Implementation decoupled as well
Optimistic design of entire system stack Economic correctness guarantee Custom software-hardware interface
Software layer
Optimisticcore
Architectural layer
Device layer
HintsCorrectness
Domain
Correctnesscore
Simplethroughput-oriented
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
4
Correctness domainPerformance domain
ILP lookahead using EDA
Autonomy
Managing deviance
Optimisticcore
Correctnesscore
Lookahead agent Throughput engine
Static binarytransformation
Program (semantic) binary
Program (semantic) binarySkeleton
Branch Outcome Queue
L0 L1
L2
Minimal mutual dependence
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
5
Outline
Architectural and software support needed Performance optimization opportunities Complexity reduction opportunities Evaluation Conclusion
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
6
Feed arbitrary value Exact value may not matter
Conventional mechanism Planning against contingency
Tagging entire dependency chain as invalid State check-point and recovery
Type of value substitution Value predictor Explicitly flush the dependence chain of load
Opportunity : simple “0” value substitution Only used when optimistic core is not too far ahead Zero most frequent occurring value
Avoiding L2-miss stalls in lookahead
compare (x>f0)
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
7
Purging stale data
Source of stale data Performance optimizations Binary optimizations
Potential Solutions Timer based eviction mechanism Selective L0 invalidations from skeleton
Choice : do nothing Simply rely on cache replacement
OC CC
L0 L1
L2
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
8
Complexity reduction
Optimistic core – tradeoff complexity to improve performance E.g., Load Store Queue
Correctness core – throughput oriented design Accurate branch prediction from OC
No check-pointing and selective pipeline flush required
Cache misses are significantly mitigated Latency of various operations is less critical
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
9
Load-Hit speculation
Processor Pipeline
ld
Issue Reg Reg Ex Ex
Load Miss
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
10
Outline
Architectural and software support needed Performance optimization Opportunities Complexity reduction opportunities Evaluation Conclusion
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
11
Evaluation Environment
Simulation strives to model EDA very faithfully Value driven execution for optimistic core Data values in the caches Faithful simulation of branches Scheduling replays Prefetch modeling fidelity Stream prefetcher
Power modeling – both switching and leakage
SPEC CPU2000 and SPLASH(2) benchmark suite
System Configuration – loosely based on Power4 ROB/Register (INT, FP) – 128/(32, 32) L0 cache – 16KB, 4-way, 2 cycle L1 cache – 32KB, 4-way, 2 cycle L2 cache – 1MB, 8-way, 400 cycle BOQ – 512 entry Register copy latency – 32 cycles
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
12
Performance gain of optimismsp
eed
up
spee
du
p
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
13
Effect on explicitly parallel programs
spee
du
p
Exploiting ILP is not guaranteed to be less effective than exploiting thread level parallelism
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
14
Energy Implications
Reasons
Skeleton not the entire program
Few wrong path instructions in CC
Smaller cache hierarchy in OC
Reduce energy waste due to idling
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
15
Performance impact with reduction in in-flight capacity
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
16
Impact of simplifications and conservativeness
Removing Load-hit speculation
Making out-of-orderINT issue queue
in-order
10% clock freq. reduction
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
17
Other details in the paper
Related work discussion Quantitative comparison with past works Details on skeleton construction Eliminating useless branches Delayed release of prefetches Understanding sensitivity to performance domain
errors System diagnosis
* More details left in the technical report version
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
18
Conclusion
Performance-correctness explicitly-decoupled arch. Independent focus on performance and correctness goals Each goal can be achieved more efficiently with less complexity
Demonstrated a concrete design with efficient lookahead Achieves good performance boosting Does not consume excessive energy Better tolerance to conservatism
Future work Optimization beyond ILP lookahead Custom design of optimistic and correctness core
A Performance-Correctness Explicitly-Decoupled Architecture
Alok Garg and Michael Huang
Department of Electrical & Computer Engineering
University of Rochester
Link to technical report: http://www.ece.rochester.edu/~garg/documents/micro08tr.pdf
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
20
Related Work
Dynamic verification using DIVA checker [austin99]
Lookahead techniques Two-pass execution [sundaramoorthy00], [purser00], [zhou05], [barnes03],
[mesa-martinez07], [greskamp07]
Helper-threading [dubois98], [annavaram01], [luk01], [zilles01], [chappell99], [collins01], [roth01], [moshovos01], [farcy98]
Enhancing processor’s capability to buffer more in-flight instructions [balasubramonian00], [lebeck02], [torres05], [gandhi05], [akkary03], [sethumadhavan03]
Runahead execution [mutlu03], [dundas97], [ceze04], [kirman05]
Parallelization oriented techniques [zilles02], [balakrishnan06]
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
21
Differences from DIVA
Traditional CoreDIVA Checker &
commitCommunication
decoded instructioninput and output Values
DIVA Decoupling
Explicit Decoupling (EDA)
low bandwidthhints
have to produce correctoutput
frequent repairment
infrequent repairmentfree to performrisky optimizations
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
22
Comparison with DCE
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
23
Sensitivity to performance domain circuit errors
11/11/2008
"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008
24
Load-Store queue simplification
1 st
2 …
3 st
4 …
5 st
6 …
7 ld
…
age oldestyoungest
Load Queue
Store Queue
ld7
… st1st3st5
dispatch
ld7
Store-load replay
st5
Load queue removed Store-load replay support not required Priority logic replaced with simpler forwarding logic
Recommended