49
Mikko Lipasti Mikko Lipasti University of Wisconsin-Madison University of Wisconsin-Madison Value Prediction: Value Prediction: Are(n’t) We Done Yet? Are(n’t) We Done Yet?

Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Embed Size (px)

Citation preview

Page 1: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Mikko LipastiMikko Lipasti

University of Wisconsin-MadisonUniversity of Wisconsin-Madison

Value Prediction:Value Prediction:Are(n’t) We Done Yet?Are(n’t) We Done Yet?

Page 2: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 2 of 38

DefinitionDefinition

What is value prediction? Broadly, three What is value prediction? Broadly, three salient attributes:salient attributes:

1.1. Generate a speculative value (predict)Generate a speculative value (predict)

2.2. Consume speculative value (execute)Consume speculative value (execute)

3.3. Verify speculative value (compare/recover)Verify speculative value (compare/recover) This subsumes branch predictionThis subsumes branch prediction

Focus here on operand valuesFocus here on operand values

Page 3: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 3 of 38

Some HistorySome History

““Classical” value predictionClassical” value prediction Independently invented by 4 groups in 1995-1996Independently invented by 4 groups in 1995-1996

1.1. AMD (Nexgen): L. Widigen and E. Sowadsky, AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995patent filed March 1996, inv. March 1995

2.2. Technion: F. Gabbay and A. Mendelson, inv. Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997sometime 1995, TR 11/96, US patent Sep 1997

3.3. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March 19961995, ASPLOS paper submitted March 1996

4.4. Wisconsin: Y. Sazeides, J. Smith, Summer 1996Wisconsin: Y. Sazeides, J. Smith, Summer 1996

Page 4: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 4 of 38

Why?Why?

Possible explanations:Possible explanations:1.1. Natural evolution from branch predictionNatural evolution from branch prediction

2.2. Natural evolution from memoizationNatural evolution from memoization

3.3. Natural evolution from rampant speculationNatural evolution from rampant speculation Cache hit speculationCache hit speculation Memory independence speculationMemory independence speculation Speculative address generationSpeculative address generation

4.4. Improvements in tracing/simulation technologyImprovements in tracing/simulation technology ““There’s a lot of zeroes out there.” (C. Wilkerson)There’s a lot of zeroes out there.” (C. Wilkerson) Values, not just instructions & addressesValues, not just instructions & addresses

TRIP6000 [A. Martin-de-Nicolas, IBM]TRIP6000 [A. Martin-de-Nicolas, IBM]

Page 5: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 5 of 38

Publications by YearPublications by Year

0

10

20

30

40

50

60

70

1996 1998 2000 2002 2004

Cu

mu

lati

ve P

ub

lica

tio

ns

ISCA

MICRO

HPCA

Others

Total

Excludes journals, workshops, compiler conferencesExcludes journals, workshops, compiler conferences

Page 6: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 6 of 38

What Happened?What Happened?

Tremendous academic interestTremendous academic interest Dozens of research groups, papers, proposalsDozens of research groups, papers, proposals

No industry uptakeNo industry uptake No present or planned CPU with value predictionNo present or planned CPU with value prediction

Why?Why? Meager performance benefit (< 10%)Meager performance benefit (< 10%) Power consumptionPower consumption

Dynamic power for extra activityDynamic power for extra activity Static power (area) for prediction tablesStatic power (area) for prediction tables

Complexity and correctnessComplexity and correctness Subtle memory ordering issues [MICRO ’01]Subtle memory ordering issues [MICRO ’01] Misprediction recovery [HPCA ’04]Misprediction recovery [HPCA ’04]

Page 7: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 7 of 38

Performance?Performance?

Relationship between timely fetch and value Relationship between timely fetch and value prediction benefit [Gabbay, ISCA]prediction benefit [Gabbay, ISCA]

Value prediction doesn’t help when the result can be Value prediction doesn’t help when the result can be computed before the consumer instruction is fetchedcomputed before the consumer instruction is fetched

High-bandwidth fetch helpsHigh-bandwidth fetch helps Wide trace caches studied in late 1990sWide trace caches studied in late 1990s But, these have several negative attributesBut, these have several negative attributes

Recent designs focus on frequency, not ILPRecent designs focus on frequency, not ILP High-bandwidthHigh-bandwidth fetch is a red herring fetch is a red herring

More important to fetch the More important to fetch the right instructionsright instructions

Page 8: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 8 of 38

Future Adoption?Future Adoption?

Classical value prediction will only make it in the Classical value prediction will only make it in the context of a very different microarchitecturecontext of a very different microarchitecture One that explicitly and aggressively exposes ILPOne that explicitly and aggressively exposes ILP

Promising trendsPromising trends Deep pipelining craze appears to be overDeep pipelining craze appears to be over

Can’t manage the design complexityCan’t manage the design complexity

High frequency mania appears to be overHigh frequency mania appears to be over Can’t afford the powerCan’t afford the power

Architects are pursuing ILP once againArchitects are pursuing ILP once again Value prediction has another opportunityValue prediction has another opportunity

Page 9: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 9 of 38

What Value Prediction BegatWhat Value Prediction Begat

Value prediction catalyzed a new focus on Value prediction catalyzed a new focus on values in computationvalues in computation This had not been studied beforeThis had not been studied before

A whole new realm of research:A whole new realm of research:

Value-Aware MicroarchitectureValue-Aware Microarchitecture Spans numerous subdisciplinesSpans numerous subdisciplines Significant industrial impact alreadySignificant industrial impact already Also, developments in supporting technologiesAlso, developments in supporting technologies

Page 10: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 10 of 38

Value-Aware MicroarchitectureValue-Aware MicroarchitectureMemory Hierarchy•Register File Compression [several]•Cache Compression [Gupta, Alameldeen]•Memory Compression [e.g. IBM MXT]•Bandwidth compression

•Address and data bus encoding [Rudolph]•Initialization Traffic [Lewis]

Execution Core•Value Prediction•Operand Significance

•Low Power [Canal]•Execution bandwidth [Loh]•Bit-slicing [Pentium 4, Mestan]

•Instruction reuse [Sodani]•Carry prediction [Circuit-level Speculation]

Load/Store Processing•Load value prediction [numerous]•Fast address calculation [Austin]•Value-aware alias prediction [Onder]•Memory consistency [Cain]

Cache Coherence•Producer-side

•Silent stores, temporally silent stores [Lepak]•Speculative lock elision [Rajwar]

•Consumer side•Load value prediction using stale lines [Lepak]•“Coherence decoupling” [Burger, Sohi]

Value-AwareMicroarchitecture

Load/Store Processing•Load value prediction [numerous]•Fast address calculation [Austin]•Value-aware alias prediction [Onder]•Memory consistency [Cain]Execution Core•Value Prediction•Operand Significance

•Low Power [Canal]•Execution bandwidth [Loh]•Bit-slicing [Pentium 4, Mestan]

•Instruction reuse [Sodani]•Carry prediction [Circuit-level Speculation]

Cache Coherence•Producer-side

•Silent stores, temporally silent stores [Lepak]•Speculative lock elision [Wisc, UIUC]

•Consumer side•Load value prediction using stale lines [Lepak]•“Coherence decoupling” [ASPLOS 04]

Memory Hierarchy•Register File Compression [several]•Cache Compression [Gupta, Alameldeen]•Memory Compression [e.g. IBM MXT]•Bandwidth compression

•Address and data bus encoding [Rudolph]•Initialization Traffic [Lewis]

Page 11: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 11 of 38

Supporting TechnologiesSupporting Technologies Value prediction presented some unique challenges:Value prediction presented some unique challenges:

Relatively low correct prediction rate (initially 40-50%)Relatively low correct prediction rate (initially 40-50%) Nontrivial misprediction rate with avoidable misprediction costNontrivial misprediction rate with avoidable misprediction cost

These drove study of:These drove study of: Confidence prediction/estimationConfidence prediction/estimation

First microarchitectural application of confidence estimation, though not First microarchitectural application of confidence estimation, though not widely credited or cited as suchwidely credited or cited as such

Since studied for numerous applications, e.g. gating control speculationSince studied for numerous applications, e.g. gating control speculation Selective recovery [Sazeides Ph.D., Kim HPCA ‘04]Selective recovery [Sazeides Ph.D., Kim HPCA ‘04]

Numerous challenges in extending recovery to entire windowNumerous challenges in extending recovery to entire window Both have proved to be fruitful research areasBoth have proved to be fruitful research areas Also stimulated development of software technology:Also stimulated development of software technology:

Value profilingValue profiling Value-based compiler optimizationsValue-based compiler optimizations Run-time code specializationRun-time code specialization

Page 12: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 12 of 38

OutlineOutline

Some History Industry Trends Value-Aware Microarchitecture Case study: Memory Consistency [Trey Cain, Case study: Memory Consistency [Trey Cain,

ISCA 2004]ISCA 2004] Conventional load queue microarchitectureConventional load queue microarchitecture Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation

ConclusionsConclusions

Page 13: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 13 of 38

Value-based Memory ConsistencyValue-based Memory Consistency

High ILP => Large instruction windowsHigh ILP => Large instruction windows Larger physical register fileLarger physical register file Larger schedulerLarger scheduler Larger load/store queuesLarger load/store queues Result in increased access latencyResult in increased access latency

Value-based ReplayValue-based Replay If load queue scalability a problem…who needs one!If load queue scalability a problem…who needs one! Instead, re-execute load instructions a 2Instead, re-execute load instructions a 2ndnd time in time in

program orderprogram order Filter replays: heuristics reduce extra cache Filter replays: heuristics reduce extra cache

bandwidth to 3.5% on averagebandwidth to 3.5% on average

Page 14: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 14 of 38

Enforcing RAW dependencesEnforcing RAW dependences

1. (1) store A2. (3) store ?3. (2) load A

Program order (Exe order)

Load queue contains load addressesLoad queue contains load addresses Memory independence speculationMemory independence speculation

Hoist load above unknown store assuming it is to a different addressHoist load above unknown store assuming it is to a different address Check correctness at store retirementCheck correctness at store retirement

One search per store address calculationOne search per store address calculation If address matches, the load is squashed If address matches, the load is squashed

Page 15: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 15 of 38

Enforcing memory consistencyEnforcing memory consistency

Processor p2

1. (2) store A

Processor p1

1. (3) load A

2. (1) load A

raw

war

Two approachesTwo approaches Snooping: Search per incoming invalidateSnooping: Search per incoming invalidate Insulated: Search per load address calculationInsulated: Search per load address calculation

Page 16: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 16 of 38

Load queue implementationLoad queue implementation

addressCAM

loadmeta-data

RAM

external address

store address

load address

store age

load age

squash determination

queue management

external request

# of write ports = load address calc width# of write ports = load address calc width # of read ports = load+store address calc width ( + 1)# of read ports = load+store address calc width ( + 1) Current generation designs (32-48 entries, 2 write ports, Current generation designs (32-48 entries, 2 write ports,

2 (3) read ports)2 (3) read ports)

Page 17: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 17 of 38

Load queue scalingLoad queue scaling

Larger instruction window => larger load Larger instruction window => larger load queuequeue Increases access latencyIncreases access latency Increases energy consumptionIncreases energy consumption

Wider issue width => more read/write Wider issue width => more read/write portsports Also increases latency and energyAlso increases latency and energy

Page 18: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 18 of 38

Related work: MICRO 2003Related work: MICRO 2003

Park et al., PurduePark et al., Purdue Extra structure dedicated to enforcing memory Extra structure dedicated to enforcing memory

consistencyconsistency Increase capacity through segmentationIncrease capacity through segmentation

Sethumadhavan et al., UT-AustinSethumadhavan et al., UT-Austin Add set of filters summarizing contents of load Add set of filters summarizing contents of load

queuequeue

Page 19: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 19 of 38

Keep it simple…Keep it simple…

Throw more hardware at the problem?Throw more hardware at the problem? Need to design/implement/verifyNeed to design/implement/verify Execution core is already complicatedExecution core is already complicated

Load queue checks for rare errorsLoad queue checks for rare errors Why not move error checking away from exe?Why not move error checking away from exe?

Page 20: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 20 of 38

CMP

Value-based ConsistencyValue-based Consistency

ReplayReplay: access the cache a second time -cheaply!: access the cache a second time -cheaply! Almost always cache hitAlmost always cache hit Reuse address calculation and translationReuse address calculation and translation Share cache port used by stores in commit stageShare cache port used by stores in commit stage

CompareCompare: compares new value to original value: compares new value to original value Squash if the values differSquash if the values differ

This is value prediction!This is value prediction! Predict: access cache prematurelyPredict: access cache prematurely Execute: as usualExecute: as usual Verify: replay load, compare value, recover if necessaryVerify: replay load, compare value, recover if necessary

IF1 D R Q S EX CREPIF2 WB…

Page 21: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 21 of 38

Rules of replayRules of replay

1.1. All prior stores must have written data to All prior stores must have written data to the cachethe cache

No store-to-load forwardingNo store-to-load forwarding

2.2. Loads must replay in program orderLoads must replay in program order

3.3. If a load is squashed, it should not be If a load is squashed, it should not be replayed a second timereplayed a second time

Ensures forward progressEnsures forward progress

Page 22: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 22 of 38

Replay reductionReplay reduction

Replay costsReplay costs Consumes cache bandwidth (and power)Consumes cache bandwidth (and power) Increases reorder buffer occupancyIncreases reorder buffer occupancy

Can we avoid these penalties?Can we avoid these penalties? Infer correctness of certain operationsInfer correctness of certain operations

Four replay filtersFour replay filters These are used to avoid checking our value These are used to avoid checking our value

prediction when in fact no value prediction prediction when in fact no value prediction occurred (loaded value is known to be correct)occurred (loaded value is known to be correct) Similar to “constant prediction” in initial workSimilar to “constant prediction” in initial work

Page 23: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 23 of 38

No-Reorder filterNo-Reorder filter

Avoid replay if load isn’t reordered wrt Avoid replay if load isn’t reordered wrt other memory operationsother memory operations

Can we do better?Can we do better?

Page 24: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 24 of 38

Enforcing single-thread RAW Enforcing single-thread RAW dependenciesdependencies

No-Unresolved Store Address FilterNo-Unresolved Store Address Filter Load instruction Load instruction ii is replayed if there are prior is replayed if there are prior

stores with unresolved addresses when stores with unresolved addresses when ii issuesissues

Works for intra-processor RAW dependencesWorks for intra-processor RAW dependences Doesn’t enforce memory consistencyDoesn’t enforce memory consistency

Page 25: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 25 of 38

Enforcing MP consistencyEnforcing MP consistency

No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line

fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow

No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external

invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window

Page 26: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 26 of 38

Constraint graphConstraint graph

Defined for sequential consistency by Landin et Defined for sequential consistency by Landin et al., ISCA-18al., ISCA-18

Directed-graph represents a multithreaded Directed-graph represents a multithreaded executionexecution Nodes represent dynamic instruction instancesNodes represent dynamic instruction instances Edges represent their transitive orders (program Edges represent their transitive orders (program

order, RAW, WAW, WAR).order, RAW, WAW, WAR). If the constraint graph is acyclic, then the If the constraint graph is acyclic, then the

execution is correctexecution is correct

Page 27: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 27 of 38

Constraint graph example - SCConstraint graph example - SC

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Cycle indicates that execution is

incorrect

1.

2.

3.

4.

Page 28: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 28 of 38

Anatomy of a cycleAnatomy of a cycle

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Incoming invalidate

Cache miss

Page 29: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 29 of 38

Enforcing MP consistencyEnforcing MP consistency

No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line

fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow

No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external

invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window

Page 30: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 30 of 38

Filter SummaryFilter Summary

Replay all committed loads

No-Reorder Filter

No-Unresolved Store/No-Recent-Snoop Filter

No-Unresolved Store/No-Recent-Miss Filter

Conservative

Aggressive

Page 31: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 31 of 38

OutlineOutline

Some HistorySome History Industry TrendsIndustry Trends Value-Aware MicroarchitectureValue-Aware Microarchitecture Case study: Memory Consistency [Cain, ISCA]Case study: Memory Consistency [Cain, ISCA]

Conventional load queue microarchitectureConventional load queue microarchitecture Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation

ConclusionsConclusions

Page 32: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 32 of 38

Base machine modelBase machine modelPHARMsimPHARMsim PowerPC execute-at-execute simulator with OOO cores and aggressive PowerPC execute-at-execute simulator with OOO cores and aggressive

split-transaction snooping coherence protocolsplit-transaction snooping coherence protocol

Out-of-order Out-of-order execution execution corecore

5 GHZ, 5 GHZ, 15-stage, 8-wide pipeline15-stage, 8-wide pipeline

256 entry reorder buffer, 128 entry load/store queue256 entry reorder buffer, 128 entry load/store queue

32 entry issue queue32 entry issue queue

Functional Functional units units (latency)(latency)

8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),

4 L1 Dcache load ports in OoO window4 L1 Dcache load ports in OoO window

1 L1 Dcache load/store port at commit1 L1 Dcache load/store port at commit

Front-endFront-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB16k entry selection table, 64 entry RAS, 8k entry 4-way BTB

Memory Memory system system (latency)(latency)

32k DM L1 icache (1), 32k DM L1 dcache (1)32k DM L1 icache (1), 32k DM L1 dcache (1)

256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines

Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)

Stride-based prefetcher modeled after Power4`Stride-based prefetcher modeled after Power4`

Page 33: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 33 of 38

%L1 DCache bandwidth increase%L1 DCache bandwidth increase

(a) replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter

On average, 3.4% bandwidth overhead using no-recent-snoop filter

SPECint2000 SPECfp2000 commercial multiprocessor

Page 34: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 34 of 38

Value-based replay performance Value-based replay performance (relative to constrained load queue)(relative to constrained load queue)

Value-based replay 8% faster on avg than baseline using 16-entry ld queue

SPECint2000 SPECfp2000 commercial multiprocessor

Page 35: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 35 of 38

Does value locality help?Does value locality help?

Not much…Not much… Value locality does avoid memory ordering Value locality does avoid memory ordering

violationsviolations 59% single-thread violations avoided59% single-thread violations avoided 95% consistency violations avoided95% consistency violations avoided

But these violations rarely occurBut these violations rarely occur ~1 single-thread violation per 100 million instr~1 single-thread violation per 100 million instr 4 consistency violation per 10,000 instr 4 consistency violation per 10,000 instr

Page 36: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 36 of 38

What About What About PowerPower??

Simple power model:Simple power model:

Empirically: 0.02 replay loads per committed Empirically: 0.02 replay loads per committed instructioninstruction

If load queue CAM energy/insn > 0.02 If load queue CAM energy/insn > 0.02 × energy energy expenditure of a cache access and comparison: expenditure of a cache access and comparison: value-based implementation saves power!value-based implementation saves power!

Energy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search × # ldq searches )

Page 37: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 37 of 38

Value-based replay Pros/ConsValue-based replay Pros/Cons

+ Eliminates associative lookup hardwareEliminates associative lookup hardware Load queue becomes simple FIFOLoad queue becomes simple FIFO Negligible IPC or L1D bandwidth impactNegligible IPC or L1D bandwidth impact

+ Can be used to fix value predictionCan be used to fix value prediction Enforces dependence order consistency Enforces dependence order consistency

constraint [MICRO ‘01]constraint [MICRO ‘01]- Requires additional pipeline stagesRequires additional pipeline stages- Requires additional cache datapath for Requires additional cache datapath for

loadsloads

Page 38: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 38 of 38

ConclusionsConclusions

Value predictionValue prediction Continues to generate lots of academic interestContinues to generate lots of academic interest Little industry uptake so farLittle industry uptake so far

Historical trends (narrow deep pipelines) minimized benefitHistorical trends (narrow deep pipelines) minimized benefit Sea-change underway on this frontSea-change underway on this front

Value prediction will be revisited in quest for ILPValue prediction will be revisited in quest for ILP Power consumption is key!Power consumption is key!

Value-Aware MicroarchitectureValue-Aware Microarchitecture Multiple fertile areas of researchMultiple fertile areas of research Some has found its way into productsSome has found its way into products

Are we done yet? No!Are we done yet? No! Questions?Questions?

Page 39: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 39 of 38

BackupsBackups

Page 40: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 40 of 38

Caveat: Memory Dependence PredictionCaveat: Memory Dependence Prediction

Some predictors train using the conflicting storeSome predictors train using the conflicting store (e.g. store-set predictor)(e.g. store-set predictor)

Replay mechanism is unable to pinpoint Replay mechanism is unable to pinpoint conflicting storeconflicting store

Fair comparison:Fair comparison: Baseline machine: store-set predictor w/ 4k entry Baseline machine: store-set predictor w/ 4k entry

SSIT and 128 entry LFSTSSIT and 128 entry LFST Experimental machine: Simple 21264-style Experimental machine: Simple 21264-style

dependence predictor w/ 4k entry history tabledependence predictor w/ 4k entry history table

Page 41: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 41 of 38

Load queue search energyLoad queue search energy

0

0.5

1

1.5

2

2.5

3

3.5

16 32 64 128 256 512

number of entries

ac

ce

ss

en

erg

y (

nJ

)

rd6wr6

rd4wr4

rd2wr2

Based on 0.09 micron process technology using Cacti v. 3.2

Page 42: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 42 of 38

Load queue search latencyLoad queue search latency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

16 32 64 128 256 512

number of entries

ac

ce

ss

late

nc

y (

ns

)

rd6wr6

rd4wr4

rd2wr2

Based on 0.09 micron process technology using Cacti v. 3.2

Page 43: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 43 of 38

BenchmarksBenchmarks

MP (16-way)MP (16-way) Commercial workloads (SPECweb, TPC-H)Commercial workloads (SPECweb, TPC-H) SPLASH2 scientific application (ocean)SPLASH2 scientific application (ocean) Error bars signify 95% statistical confidenceError bars signify 95% statistical confidence

UPUP 3 from SPECfp20003 from SPECfp2000

Selected due to high reorder buffer utilizationSelected due to high reorder buffer utilization apsi, art, wupwiseapsi, art, wupwise

3 commercial3 commercial SPECjbb2000, TPC-B, TPC-HSPECjbb2000, TPC-B, TPC-H

A few from SPECint2000A few from SPECint2000

Page 44: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 44 of 38

LD ?ST ?ST ?LD ? LD ?ST ? LD ?

Life cycle of a loadLife cycle of a load

OoO Execution Window

LD ?ST ? ST ? ST ?

Load queue

LD ?LD A

LD A ST A

Blam!

Page 45: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 45 of 38

Performance relative to Performance relative to unconstrained load queueunconstrained load queue

Good news: Replay w/ no-recent-snoop filter only 1% slower on average

Page 46: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 46 of 38

Reorder-Buffer UtilizationReorder-Buffer Utilization

Page 47: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 47 of 38

Why focus on load queue?Why focus on load queue?

Load queue has different constraints that store Load queue has different constraints that store queuequeue More loads than stores (30% vs 14% dynamic More loads than stores (30% vs 14% dynamic

instructions)instructions) Load queue searched more frequently (consuming Load queue searched more frequently (consuming

more power)more power) Store-forwarding logic performance criticalStore-forwarding logic performance critical

Many non-scalable structures in OoO processorMany non-scalable structures in OoO processor SchedulerScheduler Physical register filePhysical register file Register mapRegister map

Page 48: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 48 of 38

Prior work: formal memory model Prior work: formal memory model representationsrepresentations

Local, WRT, global “performance” of memory Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)ops (Dubois et al., ISCA-13)

Acyclic graph representation (Landin et al., Acyclic graph representation (Landin et al., ISCA-18)ISCA-18)

Modeling memory operation as a series of sub-Modeling memory operation as a series of sub-operations (Collier, RAPA)operations (Collier, RAPA)

Acyclic graph + sub-operations (Adve, thesis)Acyclic graph + sub-operations (Adve, thesis) Initiation event, for modeling early store-to-load Initiation event, for modeling early store-to-load

forwarding (Gharachorloo, thesis)forwarding (Gharachorloo, thesis)

Page 49: Mikko Lipasti University of Wisconsin-Madison Value Prediction: Are(n’t) We Done Yet?

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 49 of 38

Some HistorySome History

““Classical” value predictionClassical” value prediction Independently invented by 4 groups in 1995-1996Independently invented by 4 groups in 1995-1996

1.1. AMD (Nexgen): L. Widigen and E. Sowadsky, AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995patent filed March 1996, inv. March 1995

2.2. Technion: F. Gabbay and A. Mendelson, inv. Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997sometime 1995, TR 11/96, US patent Sep 1997

3.3. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March, 19961995, ASPLOS paper submitted March, 1996

4.4. Wisconsin: Y. Sazeides, J. Smith, Summer 1996Wisconsin: Y. Sazeides, J. Smith, Summer 1996

From: [email protected] (Larry Widigen)Received: by charlie (4.1) id AA00850; Wed, 14 Aug 96 10:33:12 PDTDate: Wed, 14 Aug 96 10:33:12 PDTMessage-Id: <9608141733.AA00850@charlie>To: [email protected]: www location of paperStatus: ROX-Status:X-Keywords:X-UID: 1

I would like to review your forthcoming paper, "Value Locality and Load Value Prediction." Could you provide a www address where it resides? I am curious as to its contents since its title suggests that it may discuss an area where I have done some work.

Cordially,

Larry WidigenManager of Processor Development