Upload
nantai
View
113
Download
0
Embed Size (px)
DESCRIPTION
Idempotent Processor Architecture. Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison. MICRO 2011, Porto Alegre. Idempotent Processor Architecture. idempotent regions. “live state”. idempotence : re-execution has no side-effects (inputs are preserved). =. - PowerPoint PPT Presentation
Citation preview
Idempotent Processor ArchitectureMarc de Kruijf
Karthikeyan SankaralingamVertical Research Group
UW-Madison
MICRO 2011, Porto Alegre
2
Idempotent Processor Architecture
applications naturally decompose into idempotent regions
=
idempotence: re-execution has no side-effects (inputs are preserved)
region entry point:implicit checkpoint in the live state of the program
“live state”
idempotent regions
3
Idempotent Processor Architecture
CHECKPOINT
conventionalrecovery
idempotentrecovery
recovery without checkpoints – just re-execute
idempotent recovery
4
Issue
Idempotent Processor Architecture
idempotentprocessor
conventionalprocessor
ExecRF
IssueFetch Decode WB
simpler decode, issue, execute, and writeback
ExecFetch Decode WBIssue
ExecDecode WBIssue
Decode Exec WBRF
idempotent processors
5
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
Idempotent Recovery
6
1. R2 = add R0, R12. R3 = ld [R2 + 4]3. R2 = add R2, R44. beqz R2, NEXT
“Something went wrong executing instruction 2!”(page fault, corrupted write, etc.)
“No problem – just re-execute from instruction 1!”
idempoten
treg
ionexample
7
Idempotent Recovery
idempotent “regions”: freely re-executable program regions
normal compiler:
custom compiler:
idempotent regions
=
sizes inhibited by clobber antidependences (WAR after no RAW)
special marker instructionadds some runtime overhead (typically 2-10%)
Idempotent Recovery
8
average idempotent region sizes
SPEC
INT
SPEC
FP
PARSEC
Parboil
401.bzip2
456.hmmer
OVERALL
10
100
1000
custom compiler AR
M m
icro
-ops
select benchmarksbenchmark suites (geo-mean) (geo-mean)
43
frequent clobber antidependences – can be removed, but requires large
restructuring effortlimited aliasing
information
with func params markedusing C/C++ “restrict”
9
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
10
Idempotent Processorsexploring the opportunity
Vdd
OutIn
issue execute retire
branch
hardware faultsexceptions & out-of-order retirement
branch misprediction
in-order out-of-order multi-core
Idempotent Processors
11
steps one, two, and three
Step 1: construct a high-performance in-order processor
Step 2: prune out unnecessary parts
Step 3: optimize for energy efficiency
ARM Cortex-A8 (‘05) IBM Cell SPE (‘05) Intel Atom (‘08)
↓ power, area, & complexity
↑ performance at low cost
12
Step 1: Construction
Fetch Decode & Issue
Integer
Integer
Branch
Multiply
Load/Store
FP
RFAdd
LdException!
…
v1.0
13
Fetch
Integer
Integer
Multiply
Load/Store
RF
Bypass
Branch
FP…
Staged instruction completion
Decode, Rename,& Issue
Step 1: Constructionv1.0
Flush?
14
Decode, Rename,& Issue
Fetch
Integer
Integer
Multiply
Load/Store
RF
Branch
FP…
IEEE FP… ?
Step 1: Constructionv1.0 v1.1
Bypass
FP exceptions handled in hardware.Separate FP unit implements full IEEE 754.
Flush?
15
Decode, Rename,& Issue
Fetch
Integer
Integer
Multiply
Load/Store
RF
Branch
FP…
IEEE FP…
Step 1: Constructionv1.0 v1.1 v1.2
BypassLoad miss?
Have to flush!
Replay queue
Flush?Flush?Replay?
16
Decode, Rename,& Issue
Fetch
Integer
Integer
Multiply
Load/Store
RF
Branch
FP…
IEEE FP
Step 1: Constructionv1.0 v1.1 v1.2 v1.3
Bypass
Replay queue
Flush?Replay?
TOO COMPLICATED!
…
17
Step 2: Simplification
Fetch Decode & Issue
Integer
Integer
Branch
Multiply
Load/Store
FP
RF
…
idempotent edition (simple)
WHAT IS GONE?• staging register file (6-entries)• replay queue (8-entries)• entire rename pipeline stage• IEEE-compliant floating point unit• pipeline flush for exceptions and replays• all associated control logic
18
Step 3: Optimization
Fetch Decode & Issue
Integer
Integer
Branch
Multiply
Load/Store
FP
RF
…
idempotent edition (fast)
SDB*
* Slice Data Buffer (SDB) – Continual Flow Pipelines. ASPLOS ‘04
details in paper…
19
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
Evaluation
20
idempotent processor performance
SPEC
INT
SPEC
FP
PARSEC
Parboil
401.bzip2
456.hmmer
OVERALL
-25%
0%
25%
50%
Simple Idem
Fast Idem
OoO
spee
d-up
ove
r in-
orde
r
select benchmarksbenchmark suites (geo-mean) (geo-mean)
21
Evaluationsummary
Processor Type Performance (vs. In-Order)
Power/Complexity(vs. In-Order)
simple idempotentworse by ~5%(compilation &
serialization overheads)better
fast idempotentbetter by ~5%
(modest OoO execution)
same or better
out-of-order better by ~30% worse
22
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
23
Future Work
– quantify power/complexity benefits(build real hardware prototype)
– more general error conditions(hardware faults, branch prediction, etc.)
– impact on multithreading/multiprocessors(re-execution currently assumes no interference)
– region overlapping (“region pipelining”)(analagous to overlapping checkpoints)
24
Conclusions
recovery using idempotence – recovery without checkpoints
multiple uses and multiple designs – uses: exception, speculation, fault recovery, and more – designs: in-order, out-of-order, multi-core, GPU, and more
in this work: exception recovery + in-order design – simplified out-of-order execution – better performance at equal or lower power/complexity
25
Back-up Slides
26
Optimal Idempotent Region Size?
region size
over
head
serialization overhead compiler formation overhead (many factors)
re-execution overhead
(rough sketch – graph not to scale)
27
Optimal Processor Design?
Single-issue in-order
Dual-issue in-order
Dual-issue OoO
Quad-issue OoO
Compiler overheadsdominate?
Re-execution overheadsdominate?
Best potential?
~ 250mW ~ 2.5W
28
Out-of-Order Issue Processors? Some additional challenges….
Re-execution overhead high if mis-speculation frequent cannot restart from point of mis-speculation, and hence… re-execution overhead on average ≈ half the region
Example: branch misprediction With in-order issue, simple to flush/drain pipeline With out-of-order issue, we can use idempotence but…
5 branches/region @ 90% confidence ≈ 41% re-execution rate
1. Speculate only high-confidence branches2. Hybrid checkpointing/idempotence3. …?