EECS 470 ILP and Exceptions Lecture 7 Coverage: Chapter 3

EECS 470ILP and Exceptions

Lecture 7Coverage: Chapter 3

Optimizing CPU Performance

• Golden Rule: tCPU = Ninst*CPI*tCLK

• Given this, what are our options– Reduce the number of instructions executed– Reduce the cycles to execute an instruction– Reduce the clock period

• Our first focus: Reducing CPI– Approach: Instruction Level Parallelism (ILP)

Why ILP?

Vs.

• Requirements– Parallelism– Large window– Limited control deps– Eliminate “false” deps– Find run-time deps

How Much ILP is There?

How Large Must the “Window” Be?

ALU Operation GOOD, Branch BAD

Expected Number of BranchesBetween Mispredicts

E(X) ~ 1/(1-p)

E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts

How Accurate are Branch Predictors?

Impact of Physical Storage Limitations

• Each instruction “in flight” must have storage for its result– Really worse than this because of mispeculation…

Registers GOOD, Memory BAD

• Benefits of registers– Well described deps– Fast access– Finite resource

• Memory loses these benefits for flexibility

*p = …

*q = …

… = *p

?

“Bottom Line” for an Ambitious Design

First Optimization: Out-of-Order Writeback

Playing by the Rules: In-order Writeback

DIV.D

ADD

IF ID D1 D2 D3 D4 MEM WB

IF ID EX MEM WB

D5


DIV.D

ADD IF ID EX MEM WB

What’s wrong with this picture?

Divide by Zero!

IF ID D1 D2 D3 D4 MEM WBD5


DIV.D

ADD IF ID EX MEM WB

What’s wrong with this picture?

Divide by Zero!


DIV.D

ADD IF ID EX MEM WB


stall stall stall stall

Another Way to Get in the Same Mess

• Many systems use microcode– Simplifies mapping of complex

instructions to CPU resources

• iA32 add-with-carry– ADC (EAX),EBX

tmp = MEM[EAX]tmp = tmp + EBX+CF, update CFMEM[EAX] = tmp

Side Effect!

Potential Fault!

Exceptions and Interrupts

Exception Type

Sync/Async Maskable? Restartable?

I/O request Async Yes Yes

System call Sync No Yes

Breakpoint Sync Yes Yes

Overflow Sync Yes Yes

Page fault Sync No Yes

Misaligned access

Sync No Yes

Memory Protect Sync No Yes

Machine Check Async/Sync No No

Power failure Async No No

Solution: Precise Interrupts• Implementation

approaches– Don’t

• E.g., Cray-1– Force in-order WB

• E.g., ARM SA-1– Force in-order checks

• E.g., Alpha 21064– Buffer speculative

results• E.g., P4, Alpha 21264• History buffer• Future file/Reorder buffer

InstructionsCompletelyFinished

No InstructionHas ExecutedAt All

PC

Precise State

Speculative State

MEM

Precise Interrupts via the Reorder Buffer

• @ Alloc– Allocate result storage at Tail

• @ Sched– Get inputs (ROB T-to-H then

ARF)– Wait until all inputs ready

• @ WB– Write results/fault to ROB– Indicate result is ready

• @ CT– Wait until inst @ Head is done– If fault, initiate handler– Else, write results to ARF– Deallocate entry from ROB

IF ID Alloc Sched EX

ROB

CT

Head Tail

PCDst regIDDst valueExcept?

• Reorder Buffer (ROB)– Circular queue of spec state– May contain multiple definitions

of same register

In-order In-order

Any order

ARF

Reorder Buffer Example

Code Sequence

f1 = f2 / f3 r3 = r2 + r3 r4 = r3 – r2

Initial Conditions

- reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5

ROB

Tim

eH T

regID: f1result: ?Except: ?

H T


regID: r3result: ?Except: ?

H T


regID: r3result: 11Except: N

regID: r4result: ?Except: ?

r3

regID: r8result: 2Except: n




Code Sequence

f1 = f2 / f3 r3 = r2 + r3 r4 = r3 – r2

Initial Conditions


ROB

Tim

eH T




H T

regID: f1result: ?Except: y





H T

regID: f1result: ?Except: y




Code Sequence

f1 = f2 / f3 r3 = r2 + r3 r4 = r3 – r2

Initial Conditions


ROB

Tim

eH T

H T

first instof faulthandler

Documents

EECS 470 ILP and Exceptions Lecture 7 Coverage: Chapter 3