28
Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre

Idempotent Processor Architecture

  • Upload
    nantai

  • View
    113

  • Download
    0

Embed Size (px)

DESCRIPTION

Idempotent Processor Architecture. Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison. MICRO 2011, Porto Alegre. Idempotent Processor Architecture. idempotent regions. “live state”. idempotence : re-execution has no side-effects (inputs are preserved). =. - PowerPoint PPT Presentation

Citation preview

Page 1: Idempotent Processor Architecture

Idempotent Processor ArchitectureMarc de Kruijf

Karthikeyan SankaralingamVertical Research Group

UW-Madison

MICRO 2011, Porto Alegre

Page 2: Idempotent Processor Architecture

2

Idempotent Processor Architecture

applications naturally decompose into idempotent regions

=

idempotence: re-execution has no side-effects (inputs are preserved)

region entry point:implicit checkpoint in the live state of the program

“live state”

idempotent regions

Page 3: Idempotent Processor Architecture

3

Idempotent Processor Architecture

CHECKPOINT

conventionalrecovery

idempotentrecovery

recovery without checkpoints – just re-execute

idempotent recovery

Page 4: Idempotent Processor Architecture

4

Issue

Idempotent Processor Architecture

idempotentprocessor

conventionalprocessor

ExecRF

IssueFetch Decode WB

simpler decode, issue, execute, and writeback

ExecFetch Decode WBIssue

ExecDecode WBIssue

Decode Exec WBRF

idempotent processors

Page 5: Idempotent Processor Architecture

5

Presentation Overview

❶ Idempotent Recovery

❷ Idempotent Processors

❸ Evaluation

Page 6: Idempotent Processor Architecture

Idempotent Recovery

6

1. R2 = add R0, R12. R3 = ld [R2 + 4]3. R2 = add R2, R44. beqz R2, NEXT

“Something went wrong executing instruction 2!”(page fault, corrupted write, etc.)

“No problem – just re-execute from instruction 1!”

idempoten

treg

ionexample

Page 7: Idempotent Processor Architecture

7

Idempotent Recovery

idempotent “regions”: freely re-executable program regions

normal compiler:

custom compiler:

idempotent regions

=

sizes inhibited by clobber antidependences (WAR after no RAW)

special marker instructionadds some runtime overhead (typically 2-10%)

Page 8: Idempotent Processor Architecture

Idempotent Recovery

8

average idempotent region sizes

SPEC

INT

SPEC

FP

PARSEC

Parboil

401.bzip2

456.hmmer

OVERALL

10

100

1000

custom compiler AR

M m

icro

-ops

select benchmarksbenchmark suites (geo-mean) (geo-mean)

43

frequent clobber antidependences – can be removed, but requires large

restructuring effortlimited aliasing

information

with func params markedusing C/C++ “restrict”

Page 9: Idempotent Processor Architecture

9

Presentation Overview

❶ Idempotent Recovery

❷ Idempotent Processors

❸ Evaluation

Page 10: Idempotent Processor Architecture

10

Idempotent Processorsexploring the opportunity

Vdd

OutIn

issue execute retire

branch

hardware faultsexceptions & out-of-order retirement

branch misprediction

in-order out-of-order multi-core

Page 11: Idempotent Processor Architecture

Idempotent Processors

11

steps one, two, and three

Step 1: construct a high-performance in-order processor

Step 2: prune out unnecessary parts

Step 3: optimize for energy efficiency

ARM Cortex-A8 (‘05) IBM Cell SPE (‘05) Intel Atom (‘08)

↓ power, area, & complexity

↑ performance at low cost

Page 12: Idempotent Processor Architecture

12

Step 1: Construction

Fetch Decode & Issue

Integer

Integer

Branch

Multiply

Load/Store

FP

RFAdd

LdException!

v1.0

Page 13: Idempotent Processor Architecture

13

Fetch

Integer

Integer

Multiply

Load/Store

RF

Bypass

Branch

FP…

Staged instruction completion

Decode, Rename,& Issue

Step 1: Constructionv1.0

Flush?

Page 14: Idempotent Processor Architecture

14

Decode, Rename,& Issue

Fetch

Integer

Integer

Multiply

Load/Store

RF

Branch

FP…

IEEE FP… ?

Step 1: Constructionv1.0 v1.1

Bypass

FP exceptions handled in hardware.Separate FP unit implements full IEEE 754.

Flush?

Page 15: Idempotent Processor Architecture

15

Decode, Rename,& Issue

Fetch

Integer

Integer

Multiply

Load/Store

RF

Branch

FP…

IEEE FP…

Step 1: Constructionv1.0 v1.1 v1.2

BypassLoad miss?

Have to flush!

Replay queue

Flush?Flush?Replay?

Page 16: Idempotent Processor Architecture

16

Decode, Rename,& Issue

Fetch

Integer

Integer

Multiply

Load/Store

RF

Branch

FP…

IEEE FP

Step 1: Constructionv1.0 v1.1 v1.2 v1.3

Bypass

Replay queue

Flush?Replay?

TOO COMPLICATED!

Page 17: Idempotent Processor Architecture

17

Step 2: Simplification

Fetch Decode & Issue

Integer

Integer

Branch

Multiply

Load/Store

FP

RF

idempotent edition (simple)

WHAT IS GONE?• staging register file (6-entries)• replay queue (8-entries)• entire rename pipeline stage• IEEE-compliant floating point unit• pipeline flush for exceptions and replays• all associated control logic

Page 18: Idempotent Processor Architecture

18

Step 3: Optimization

Fetch Decode & Issue

Integer

Integer

Branch

Multiply

Load/Store

FP

RF

idempotent edition (fast)

SDB*

* Slice Data Buffer (SDB) – Continual Flow Pipelines. ASPLOS ‘04

details in paper…

Page 19: Idempotent Processor Architecture

19

Presentation Overview

❶ Idempotent Recovery

❷ Idempotent Processors

❸ Evaluation

Page 20: Idempotent Processor Architecture

Evaluation

20

idempotent processor performance

SPEC

INT

SPEC

FP

PARSEC

Parboil

401.bzip2

456.hmmer

OVERALL

-25%

0%

25%

50%

Simple Idem

Fast Idem

OoO

spee

d-up

ove

r in-

orde

r

select benchmarksbenchmark suites (geo-mean) (geo-mean)

Page 21: Idempotent Processor Architecture

21

Evaluationsummary

Processor Type Performance (vs. In-Order)

Power/Complexity(vs. In-Order)

simple idempotentworse by ~5%(compilation &

serialization overheads)better

fast idempotentbetter by ~5%

(modest OoO execution)

same or better

out-of-order better by ~30% worse

Page 22: Idempotent Processor Architecture

22

Presentation Overview

❶ Idempotent Recovery

❷ Idempotent Processors

❸ Evaluation

Page 23: Idempotent Processor Architecture

23

Future Work

– quantify power/complexity benefits(build real hardware prototype)

– more general error conditions(hardware faults, branch prediction, etc.)

– impact on multithreading/multiprocessors(re-execution currently assumes no interference)

– region overlapping (“region pipelining”)(analagous to overlapping checkpoints)

Page 24: Idempotent Processor Architecture

24

Conclusions

recovery using idempotence – recovery without checkpoints

multiple uses and multiple designs – uses: exception, speculation, fault recovery, and more – designs: in-order, out-of-order, multi-core, GPU, and more

in this work: exception recovery + in-order design – simplified out-of-order execution – better performance at equal or lower power/complexity

Page 25: Idempotent Processor Architecture

25

Back-up Slides

Page 26: Idempotent Processor Architecture

26

Optimal Idempotent Region Size?

region size

over

head

serialization overhead compiler formation overhead (many factors)

re-execution overhead

(rough sketch – graph not to scale)

Page 27: Idempotent Processor Architecture

27

Optimal Processor Design?

Single-issue in-order

Dual-issue in-order

Dual-issue OoO

Quad-issue OoO

Compiler overheadsdominate?

Re-execution overheadsdominate?

Best potential?

~ 250mW ~ 2.5W

Page 28: Idempotent Processor Architecture

28

Out-of-Order Issue Processors? Some additional challenges….

Re-execution overhead high if mis-speculation frequent cannot restart from point of mis-speculation, and hence… re-execution overhead on average ≈ half the region

Example: branch misprediction With in-order issue, simple to flush/drain pipeline With out-of-order issue, we can use idempotence but…

5 branches/region @ 90% confidence ≈ 41% re-execution rate

1. Speculate only high-confidence branches2. Hybrid checkpointing/idempotence3. …?