40
Advanced Microarchitecture Lecture 11: Memory Scheduling

Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

Embed Size (px)

Citation preview

Page 1: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

Advanced MicroarchitectureLecture 11: Memory Scheduling

Page 2: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

2

Executing Memory Instructions

Lecture 13: Memory Scheduling

• If R1 != R7, then Load R8 gets correct value from cache

• If R1 == R7, then Load R8 should have gotten value from the Store, but it didn’t!

Load R3 = 0[R6]

Add R7 = R3 + R9

Store R4 0[R7]

Sub R1 = R1 – R2

Load R8 = 0[R1]

Issue

Issue

Cache Miss!

Issue Cache Hit!

Miss serviced…Issue

Issue

But there was a later load…

Page 3: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

3

Memory Disambiguation Problem• Ordering problem is a data-dependence

violation

• Why can’t this happen with non-memory insts?– Operand specifiers in non-memory insts are

absolute• “R1” refers to one specific location

– Operand specifiers in memory insts are ambiguous• “R1” refers to a memory location specified by the

value of R1. As pointers change, so does this location.

• Determining whether it is safe to issue a load OOO requires disambiguating the operand specifiers

Lecture 13: Memory Scheduling

Page 4: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

4

Two Problems• Memory disambiguation

– Are there any earlier unexecuted stores to the same address as myself? (I’m a load)

– Binary question: answer is yes or no

• Store-to-load forwarding problem– Which earlier store do I get my value from? (I’m

a load)– Which later load(s) do I forward my value to?

(I’m a store)– Non-binary question: answer is one or more

instruction identifiers

Lecture 13: Memory Scheduling

Page 5: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

5

Load Store Queue (LSQ)

Lecture 13: Memory Scheduling

L 0xF048 41773 0x3290 42

L/S PC Seq Addr Value

S 0xF04C 41774 0x3410 25

S 0xF054 41775 0x3290 -17

L 0xF060 41776 0x3418 1234

L 0xF840 41777 0x3290 -17

L 0xF858 41778 0x3300 1

S 0xF85C 41779 0x3290 0

L 0xF870 41780 0x3410 25

L 0xF628 41781 0x3290 0

L 0xF63C 41782 0x3300 1

Oldest

Youngest

0x3290 42

0x3410 38

0x3418 1234

0x3300 1

Data Cache

25

-17

Page 6: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

6

Most Conservative Policy• No Memory Reordering• LSQ still needed for forwarded data (last

slide)• Easy to schedule

Lecture 13: Memory Scheduling

Ready!

bidgrant

bidgrant

Ready!1

… …

Least IPC, all memory executed sequentially

Page 7: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

7

Loads OOO Between Stores

• Let loads exec OOO w.r.t. each other, but no ordering past earlier unexecuted stores

Lecture 13: Memory Scheduling

Srd ex all earlier stores executed

L

L

S

L

S=0L=1

Page 8: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

8

Loads Wait for Only STA’s• Stores normally don’t “Execute” until both

inputs are ready: address and data

• Only address is needed to disambiguate

Lecture 13: Memory Scheduling

S

L

Address readyData ready

Page 9: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

9

Loads Execute When Ready• Most aggressive approach• Relies on fact that storeload forwarding is

not the common case

• Greatest potential IPC – loads never stall

• Potential for incorrect execution

Lecture 13: Memory Scheduling

Page 10: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

10

Detecting Ordering Violations• Case 1: Older store execs before younger

load– No problem; if same address stld forwarding

happens• Case 2: Older store execs after younger

load– Store scans all younger loads– Address match ordering violation

Lecture 13: Memory Scheduling

Page 11: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

11

Detecting Ordering Violations (2)

Lecture 13: Memory Scheduling

L 0xF048 41773 0x3290 42

S 0xF04C 41774 0x3410 25

S 0xF054 41775 0x3290 -17

L 0xF060 41776 0x3418 1234

L 0xF840 41777 0x3290 -17

L 0xF858 41778 0x3300 1

S 0xF85C 41779 0x3290 0

L 0xF870 41780 0x3410 25

L 0xF628 41781 0x3290 42

L 0xF63C 41782 0x3300 1

Store broadcasts value,address and sequence #

(-17,0x3290,41775)

Loads CAM-match onaddress, only care ifstore seq-# is lower thanown seq

(Load 41773 ignores because it has a lower seq #)

IF younger load hadn’t executed, andaddress matches, grab b’casted value

IF younger load has executed, andaddress matches, then ordering violation!

-17

Grab value, flush pipeline after load

(0,0x3290,41779)

An instruction may be involved inmore than one ordering violation

Page 12: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

12

Dealing with Misspeculations• Instructions using the load’s stale/wrong

value will propagate more wrong values• These must somehow be re-executed

Lecture 13: Memory Scheduling

• Easiest: flush all instructions after (and including?) the misspeculated load, and just refetch

• Load uses forwarded value• Correct value propagated

when instructions re-execute

Page 13: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

13

Recovery Complications• When flushing only part of the pipeline

(everything after the load), RAT must be repaired to the state just after the load was renamed

• Solutions?– Checkpoint at every load

• Not so good, between loads and branches, very large number of checkpoints needed

– Rollback to previous branch (which has its own checkpoint)• Make sure load doesn’t misspeculate on 2nd time

around• Have to redo the work between the branch and the

load which were all correct the first time around– Works with undo-list style of recovery

Lecture 13: Memory Scheduling

Page 14: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

14

Flushing is Expensive• Not all later instructions are dependent on

the bogus load value• Pipeline latency due to refetch is exposed• Hunting down RS entries to squash is tricky

Lecture 13: Memory Scheduling

Page 15: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

15

Selective Re-Execution• Ideal case w.r.t. maintaining high IPC• Very complicated

– need to hunt down only data-dependent insts– messier because some instructions may have

already executed (now in ROB) while others may not have executed yet (still in RS)• iteratively walk dependence graph?• use some sort of load/store coloring scheme?

• P4 uses replay for load-latency misspeculation– But replay wouldn’t work in this case (why?)

Lecture 13: Memory Scheduling

Page 16: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

16

Load/Store Execution• “SimpleScalar” style

Lecture 13: Memory Scheduling

Store

alloc

ea-co

mp

Add

ea-co

mp

st-data

ld-d

ata

LSQ

RS

scheduleea

D

IndependentlyExecute

Sto

re

Store “complete”Forward valueto later Loads

S ea

D

IndependentlySchedule

Crack atDispatch

time

Load is similar, but LD-data portion isdata-dependent on the LD ea-comp

Add

Load

Page 17: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

17

Complications• LSQ needs data-capture support

– Store Data needs to capture value– EA-comps can write to LSQ entries directly

using LSQ index (no associative search)

Lecture 13: Memory Scheduling

Ld-d

St-d

add

L-ea

xor

S-e

a

LSQ

RS

ADD T17 T12 T43

op dest srcL srcR

St-ea Lsq-5 T18 #0

Store normally doesn’t have adest; overload field for LSQ index

Load ea-comp done the same;Load’s LSQ entry handles “real”

destination tag broadcast

Page 18: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

18

Sele

ct

Sele

ct

Complications (2)• Load must bid/select twice

– once for ea-comp portion– once for cache access (includes LSQ check)

Lecture 13: Memory Scheduling

Ld-ea

Ld-data

Ea-compExec

DataCache

Data cache and LSQsearch in parallel

RS LSQ

Page 19: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

19

Load/Store Execution• “Pentium” Style

Lecture 13: Memory Scheduling

Store

dispatch/alloc

STA

STD

LD

“store

”“lo

ad

LSQ

RS

schedule

Add

Load

• STA and STD still execute independently

• LSQ does not need data-capture– uses RS’s data-

capture (for data-capture scheduler)

– or RSPRFLSQ• Potentially adds a

little delay from STD-ready to STLD forwarding

Page 20: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

20

Sele

ct

Load Execution• Only one select/bid

Lecture 13: Memory Scheduling

Load

Load

Ea-compExec

DataCache

LSQ search in parallel

RS LSQ

Load queue part doesn’t“execute”, but justholds address for

detecting orderingviolations

Page 21: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

21

Store Execution• STA and STD independently issue from RS

– STA does ea comp– STD just reads operand and moves it to the LSQ

• When both have executed and reached the LSQ, then perform LSQ search for younger loads that have already executed (i.e., ordering violations)

Lecture 13: Memory Scheduling

Page 22: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

22

LSQ Hardware in More Detail• CAM logic – harder than regular scheduler

because we need address + age information

• Age information not needed for physical registers since register renaming guarantees one writer per address– No easy way to prevent more than one store to

the same address

Lecture 13: Memory Scheduling

Page 23: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

23

Loads checking for earlier matching stores

Lecture 13: Memory Scheduling

ST 0x4000

ST 0x4000

ST 0x4120

LD 0x4000

=

Address Bank Data Bank

=

=

=

=

=

=

0

No earliermatches

Addr match

Valid store

Use thisstore

Need to adjust this so thatload need not be at bottom,

and that LSQ can wrap-around

If |LSQ| is large, logic can beadapted to have log delay

Page 24: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

24

Sim

ilar Lo

gic to

Pre

vio

us S

lide

Data Forwarding

Lecture 13: Memory Scheduling

ST 0x4000

ST 0x4120

ST 0x4000

LD 0x4000

Addr Match

Is LoadCaptureValue

Overwritten

Overwritten

Data Bank

This logic is ugly, complicated, slow and

power hungry!

Page 25: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

25

Alternative: Store Colors• Each store is assigned a unique, increasing

number (its color)• Loads inherit the color of the most recently

alloc’d st

Lecture 13: Memory Scheduling

St

St

St

St

Ld

LdLd

Ld

Color=1

Color=2

Color=3

Color=4

Ld

All three loads have same color:only care about ordering w.r.t.

stores, not other loads

St

LdLdLd

Ignore store broadcastsIf store’s color > your own

Special care is needed to deal with the eventual

overflow/wrap-around of the color/age counter

Page 26: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

26

Don’t Make Stores Forward• When load receives data, it still needs to

wakeup its dependents… value not needed until dependents make it to execute stage

• Alternative timing/implementation:– Broadcast address only– When load wakes up, search LSQ again (should

hit now)

Lecture 13: Memory Scheduling

Page 27: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

27

StoreLoadOp Timing

Lecture 13: Memory Scheduling

Ideal Case: std staLD

Load predicted dependenton store: waits for STA

LD add

Cycle

i

Cycle

i+1

Cycle

i+2

i+3

i+4

add add

S X E

i i+1

std staLD LDWith decoupledScheduling:

LDLDRe-search: std staLD LD: search LSQi i+

1

i+2

i+3

addadd add

S X E

i+4

i+2

add

Even if load value is ready,

dependent op hasn’t been scheduled

No performance benefit for direct

STLD forwarding at time of address

broadcast

Page 28: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

28

LSQ is Full Of Associative Searches• We should all know by now that associative

searches do not scale well

• So how do we manage this?

Lecture 13: Memory Scheduling

Page 29: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

29

Split Load Queue/Store Queue• Stores don’t need to b’cast address to

stores• Loads don’t need to check for collisions

against earlier loads

Lecture 13: Memory Scheduling

Store Queue (STQ) Load Queue (LDQ)

Associative search for

earlier stores only needs

to check entries that actually

contain stores

Associative searchfor later loads forSTLD forwarding

only needs to checkentries that actually

contain loads

Page 30: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

30

Load Execution• Load issue EA computation DL1 access

and LSQ search in parallel• Typical Latencies

– DL1: 3 cycles– LSQ search: 1 cycle (more?)

• Remember: instructions are speculatively scheduled!

Lecture 13: Memory Scheduling

Page 31: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

31

Load Execution (2)

Lecture 13: Memory Scheduling

S X X X E

S X X X E

Pipeline timing assuming LSQ hit

LOAD

ADD

Pipeline timing assuming DL1 hit

S X X X E

S X X X E

LOAD

ADD

E E

But at time of scheduling, how dowe know LSQ hit vs. DL1 hit?

Page 32: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

32

Load Execution (3)• Can predict latency

– similar to predicting L1 hit vs. L2 hit vs. going to DRAM

– If predict LSQ hit but wrong scheduling replay– If predict L1 hit but wrong waste a few cycles

• Normalize latencies– Make LSQ hit and L1 hit have same latency– Greatly simplifies scheduler– Loses some performance since in theory you

could do STLD forwarding in less time than the L1 latency• Loss is not too great since most loads do not hit in LSQ

Lecture 13: Memory Scheduling

Page 33: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

33

Reducing Ordering Violations• Dependence violations can be predicted

Lecture 13: Memory Scheduling

A

B

OrderingViolationDetected

00000000

Make Noteof it!

1

B

A

Next time arounddon’t let B issuebefore previousSTA’s known

X Z

All previous STA’sknown; now it’ssafe to issue

B

A

X Z

1

1

11

1

1

1

Table has finite number of entries;eventually all will be set to “do notspeculate” equivalent to machine

with no ordering speculation

Page 34: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

34

Dealing with “Full” Table• Do similar to branch predictors: use

counters– asymmetric costs

• mispredicting T-branch as NT, or NT-branch as T makes no difference; need to flush and re-fetch either way

• predicting a no-conflict load as conflict causes load to stall unnecessarily, but other insts may still execute

• predicting a conflict as no-conflict causes pipeline flush

– asymmetric frequencies• no conflict loads much more common than conflicting

loads

Lecture 13: Memory Scheduling

Page 35: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

35

Dealing with “Full” Tables (2)• Asymmetric updates

– when no ordering violation, decrement counter by 1

– on ordering violation, increment by X > 1• choose X based on frequency of misspeculations and

penalty/performance cost of misspeculation

• Periodic reset– Every K cycles, reset the entire table– Works reasonably well, lower hardware cost

than using saturating counters

Lecture 13: Memory Scheduling

Page 36: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

36

Store-Load Pair Prediction• Explicitly remember which load conflicted

with which store

Lecture 13: Memory Scheduling

A

B

OrderingViolationDetected

A B

Make Noteof It!

B

A

Next time arounddon’t let B issue

before A’sSTA is known (don’t have to

wait for X and Z)

X Z

A’s STA isknown, but X

and Z stillunknown; it’shopefully safe

to issue

B

A

X Z

Page 37: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

37

Store Sets Prediction• A load may have conflicts with more than

one previous store

Lecture 13: Memory Scheduling

basicblock #2

basicblock #1A

B

C

Store R1 0x4000

Store R4 0x4000

Load R2 0x4000

A B

basicblock #3

C

Page 38: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

38

Store Sets Prediction (2)

Lecture 13: Memory Scheduling

A

C

OrderingViolationDetected

A

Make Noteof It!

B

Another

B

C

A

Next time arounddon’t let C issue

before A&B’sSTA’s are known

(don’t have towait for Z)

B Z

A&B’s STA’s areknown, but Zstill unknown;it’s hopefullysafe to issue

C

A

B Z

Page 39: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

39

Store Sets Implementation

Lecture 13: Memory Scheduling

144

4

3

Store Sets IdentificationTable (SSIT)

A

B

C

D

E

PC hash into SSIT;entry indicates store set

A, B, C belongto same store set

Last Fetched StoreTable (LFST)

012345

A A Fetched

SSIT lookup SSID = 4

A:L12 Update LFSTw/ LSQ index

C C Fetched

SSIT lookup SSID = 4

LFST says load shouldwait on LSQ entry 12

before issuing

If B fetched before C, thenB waits on A, updates LFST,

then C will wait on B

Page 40: Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should

40

Note on Dependence Prediction• Few processors actually support this

– 21264 did; used the “load wait table”– Core 2 supports this now… so this is becoming

much more important

• Many machines only use wait-for-earlier-STAs approach– becomes bottleneck as instruction window size

increases

Lecture 13: Memory Scheduling