Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Memory Dependence PredictionAndreas Moshovos
www.ece.nwu.edu/~moshovos
Electrical And Computer Engineering Department
Northwestern University
ECE Dept., Purdue University, Sept. 9, 1999
Memory Dependence Prediction 2A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Exploi t ing Program BehaviorBuilding faster/better machines:
1. Faster Circuits -> Faster Execution
2. Be “smarter” about program execution:
Exploit Idiosynchrasies in Program Behavior
Example? Caches
Idiosyncrasies in access pattern
Idea: Use small/fast storage for recently accessed data.
What to do next? One Possibility is:
Identify/Exploit Other Idiosyncrasies in Typical Program Behavior
This might be the right time! Technology and Existing Knowledge.
Memory Dependence Prediction 3A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Memory Dependences are Qui te Regular• Identified a new form of regularity:
Memory Dependence Locality
1. Load/Store has a Dependence?
2. Which Dependence a Load/Store has?
(loadPC, storePC)
Opportunity to Exploit this Regularity
How to exploit? Need Techniques
Tim
e
load
store
load
store
for i
a[i] = a[i - 1] + 1
store load
Memory Dependence Prediction 4A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Memory Dependence Predict ion
1. Load/Store has a Dependence?
2. Which Dependence a Load/Store has?
Past Behavior ->
Good Indicator of Future Behavior
Basis for Three Micro-Architectural Techniques:
1. Exploit Load/Store Parallelism
2. Reduce Memory Latency
3. Memory Bandwidth Amplification
Memory is a major and ever increasing bottleneck
Guess:
How?
load
store
load
store
1. Observe
2. Predict
GO
AL
Memory Dependence Prediction 5A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Roadmap
• Exploiting Load/Store Parallelism
- Two directions for improving memory performance
- Load/Store Parallelism
- Speculation/Synchronization
- Performance
• Speculative Memory Cloaking and Bypassing
• Transient Value Cache
• Other Applications
• Future Directions: Slice Prediction & Slice-Processors
Memory Dependence Prediction 6A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
In Search of Higher Performance #1
Timeline
Min
imiz
e
send load
load
Instruction
address
value
Sequence
Memory
1. Higher Performance: Memory Responds Faster
computationstarts
Memory Dependence Prediction 7A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
#1. Making Memory Respond Faster
OK, we did the best we could, but ...
...memory is still not that fast
...and it is getting slower
Is this the end?
L1
L2
Main Memory
Ideally: Memory is Large and Fast
Can have it! Technology - Cost trade-off
Solution: Memory Hierarchy
slowerlarger
cheaper
Memory Dependence Prediction 8A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
In Search of Higher Performance #2
2. Higher Performance:
Send Load Request As Far In Advance As Possible
But, will the program still run correctly?
A. Can we ever move loads up? B. How do we do it?
Timeline
Max
imiz
e
send load
startsload
old position
old position
load
OriginalSequence
ModifiedSequence
computation
Memory Dependence Prediction 9A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
A. Can We Ever Move Loads Up?
loadA
Instruction Sequence
loadA
load Aload
Aload
A
A and load can execute in any order
if load does not use a value produced by A
dependence
load ..., [r3 + 5]r1 = r2 + 10
load ..., [r1 + 5]r1 = r2 + 10
Valid Execution Order
parallelism
Memory Dependence Prediction 10A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
B. How To Move Loads Up?Instruction Level Parallel Processors:
load
grab a chunkof instructions
Step #1 Step #2
find thedependences
load
Step #3
load
execute
Instruction Window
memorylatency
depe
nden
ces
Use Parallelism to Tolerate Memory Latency
Inspect Names
Memory Dependence Prediction 11A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Moving Loads Past Stores
load
store
100
???
Use Of Addresses Hinders Parallelism
load
store
Instructions
depe
nden
ce?
load addr
store addr. calc.
load addr. calc.
➔ Limits Us in Our Effort to Tolerate Memory Latency
store addrload executes
Timeline
Memory Dependence Prediction 12A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Naive Memory Dependence Speculat ion• Don’t give up, be optimistic, guess no dependences exist
• State-of-the-art in modern processors
Instructions
store addr
Timeline
depe
nden
ce?
load re-executes if yes
Guess no: load executes
dependence?
penalty
Need to Balance: Gain vs. Penalty
load
store
load addr
Memory Dependence Prediction 13A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Dependence Speculat ion and Performance
CA
BC
load
Program
storeC
load
loadCB
D
AB
D
No Dependence DependenceOrder
D
store
ABC
load
D store
No Speculation Speculation
Speculation may affect performance either way
free load
AB
store D
Penalty:
free
Balance: Gain vs. Penalty
(b) opportunity cost
(a) work thrown away
Memory Dependence Prediction 14A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
How About Future Systems?
Future Systems: Wish Dependences Were Known
Soon • Memory will be slower
➔ Need to move loads further up
Guessing Naively:
Penalty becomes significant
Loads — Ideal Behavior:
No Dependence: execute at will
Dependence: Synchronize w/ store
Common Case:
Today
Memory Dependence Prediction 15A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Speculat ion/Synchronizat ionLearn from mistakes:
1. Start with Naive Speculation
2. Remember mispeculations and predict next time the same will happen
- Q1? Which loads should wait
- Q2? How long
Best performing scheme
- A1: Loads w/ dependences
- A2: Synchronize w/ appropriate store
Memory Dependence Prediction 16A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
How it works
LD
① Mispeculation
② Allocate entry
LDPC STPC PRED
① Execute?
② No! Wait
LD
① Synchronize?② Resume
ST
LDPC STPC PRED LDPC STPC PRED
STPC LDPC
LD
LDPCSTPCLDPC
LDST
Mem. Dependence Prediction TablePredict Loads w/ DependencesST
LD
LD
STLD
1
2
3
1 2 3
Mem. Dependence Synchronization Table
LDPC STPC 0 1
F/EVLDPC STPC 0 1
F/EVMDST
MDPT
Enforce Synchronization
➔
➔
➔
Memory Dependence Prediction 17A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Memory Dependence Speculat ion/Synchronizat ion
Timeline
1. Predict load
Allocate Sync. bit
2. Predict store
Wait on Sync. bit
3. Store Signals
Load executes
Correct Prediction: Loads wait only as long as it is necessaryIncorrect: Same as Naive or Delay
store addr
load addr
store
load
1
2
3
A. Predict Dependence
2. Load executes
3. Store verifies
B. Predict No Dependence
Memory Dependence Prediction 18A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Speedup over Naive Speculat ion
0%
20%
40%
60%
80%
ORACLE 2 STORES/LOAD CENTRALIZED SYNONYMS09
912
412
612
913
013
213
414
710
110
210
310
410
711
012
514
114
514
6
256-Entries4K MDPT64 MDST
better
Memory Dependence Prediction 19A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Evaluat ion - Superscalar1. Can loads inspect Store addresses before obtaining a value?
Requires an address-based load/store scheduler (ABS)
2. Is speculation used?
• Speculation a win (naive over no-speculation):
No ABS: ~29% int, ~113% fp
mispeculations are an issue
ABS: 4.6% int, 5.3% fp (0-cycle scheduling latency)
virtually no mispeculations
• Next:
1. Compare “Oracle + no ABS” with “naive + ABS”
2. Speculation/Synchronization on “no Abs” close to Oracle
Speculation/Synchronization as an Alternative to ABS
Memory Dependence Prediction 20A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
“Oracle + no ABS” vs. “Naive + ABS”
099
124
126
129
130
132
134
147
101
102
103
104
107
110
125
141
145
146-20%
-10%
0%
10%
20%
30%
40%
better
NAIVE 0-CYCLES
NAIVE 1-CYCLE NAIVE 2-CYCLES
ORACLE
Base case: “ABS + no-speculation”
“Oracle + no ABS” close to “ABS + naive”
Potential for Spec./Sync.
Memory Dependence Prediction 21A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Speculat ion/Synchronizat ion
0%
1%
2%
3%
09912
412
612
913
013
213
414
710
110
210
310
410
711
012
514
114
514
6
better
“Ora
cle
+ n
o A
BS
” ov
er S
pec/
Syn
c
On the average within 0.93% (int) and 1.001% (fp) of oracle
Memory Dependence Prediction 22A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Summary
• Improving Processor/Memory Performance:
move loads up in time
• Addresses Hinder Parallelism
• Memory Dependence Speculation/Synchronization
Predict dependences and move loads up in time
Classify loads in two categories:
1. No Dependence: Execute at will
2. Dependence: Synchronize with specific store
Memory Dependence Prediction 23A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Related Work
• In the context of Dynamically Scheduled Processors
1. J. H. Hesson, J. LeBlanc, S. J. Ciavaglia, IBM Patent,‘97
Store Barrier
2. G. Chrysos, J. Emer (DEC), Superscalar Env., ISCA ‘98
This work: UW Tech. Report, March ‘96, ISCA ‘97
A. Moshovos, S. Breach, T. N. Vijaykumar, G. Sohi
• In the context of Statically Scheduled Processors
1. Static Disambiguation
2. Speculation in Software
• Dynamic Compilation
3. Software/Hardware Hybrids
Memory Dependence Prediction 24A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Roadmap
• Exploiting Load/Store Parallelism
• Speculative Memory Cloaking and Bypassing
- Memory as a communication mechanism
- Speculative Memory Cloaking
- Other uses of Memory
- Evaluation
• Transient Value Cache
• Other Applications
• Future Directions: Operation Prediction & Slice-Processors
Memory Dependence Prediction 25A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Memory
Memory as a Communicat ion Mechanism
load
addrcalc
address
address
store
Memory can be an inter-operation communication mechanism
Communication Specification: implicit via Addresses
Is this the best way in terms of performance?
addrcalc
value
Memory Dependence Prediction 26A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Memory Communicat ion Speci f icat ion - L imitat ions
del
ay
store value
store addr
load
load addr
Timeline
Store - Load: Direct Link
store2 addr
Implicit
1. Calculate address2. Establish Dependence
?
addr
ess
valu
e
Memory
Pro
gram
Ord
er store
load
store 2
Instructions
Explicit
Delays: No Delays
Implicit
Explicit
address
Memory Dependence Prediction 27A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Predict ing Memory Dependences1. Build Dependence history: Dependence Detection Table
2. Use history to predict forthcoming dependences
assign synonyms to detected dependences
use synonym to locate value
Record: (store PC, address) Loads: (load PC, address)
➔ (store PC, load PC)
Memory Dependence Prediction 28A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Dynamically & Transparently convert implicit into explicit
• Dependence prediction ➔ direct store-load or load-load links
• Speculative and has to be eventually verified
Speculat ive Memory Cloaking
store PC synonym
Dependence Prediction
MemoryHierarchy
Traditional
value f/e
Synonym File
address
address
1
2
3
4speculative
verify
value
Timeline
store value
addr
addr
load
load PC synonym
Memory Dependence Prediction 29A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
synonym TAG1
• Extents over multiple store-load dependences
• DEF and USE must co-exist in the instruction window
Speculat ive Memory Bypassing
store R1
USE R2
load R2
1
2
R2 TAG1 TAG2
3
4
R1 TAG1
Observe: Store and Load are used to just pass values
USE R2
Takes load-store or loads off the communication path
load R2store R1
DEF R1DEF R1
Memory Dependence Prediction 30A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Speculat ive Memory Cloaking/BypassingAddress-based Memory as: Storage vs. Interface
1. What is memory used for?
2. How addresses impact the action?Ask:
LOAD RY
USE RY
CloakingLOAD RZ
USE RY
USE RZ
Cloaking
Bypas
sing
DEF RX
STORE RX
LOAD RY
Bypassing
Inter-operation Communication Data-Sharing
registeraddress
Memory
Dynamically Create Direct Links Between Producers/Consumers
Memory Dependence Prediction 31A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Cloaking Coverage
10110
210
310
410
711
012
514
114
514
6
DATA STACK HEAP
0%
20%
40%
60%
80%
100%
09912
412
612
913
013
213
414
7
Most Dependences Correctly Predicted
Memory Dependence Prediction 32A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Cloaking - Mispeculat ion Rates
• Relatively low mispeculation rates
• Causes:
Unstable Dependences - Recursive Functions
099
124
126
129
130
132
134
147 10
110
210
310
410
711
012
514
114
514
60%
1%
2%
3%
4%
5%
6%
0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
DATA STACK HEAP
Memory Dependence Prediction 33A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Performance - Select ive Inval idat ion
099
124
126
129
130
132
134
147
10110
210
310
410
711
012
514
114
514
6
Oracle Selective
0%
2%
4%
6%
8%
10%
12%
14%
Memory Dependence Prediction 34A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Summary
• Address-based memory ➔ unnecessary delays
• Used Memory Dependence Prediction to:
1. Improve Store-Load and Load-Load Communication
address-based ➔ dependence-based
2. Take Store-Load off the Communication path
Treat Program as a specification of what to do
Not as also of how to do it.
Memory Dependence Prediction 35A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Increasing Memory Bandwidth• Parallelism ➔ more bandwidth ➔ more L1 ports
L1
CPU
L1
CPU
• A very small cache is easier to multi-port
• But, it increases the latency for all accesses that miss in it
L1
CPU
L0
Memory Dependence Prediction 36A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Transient Value Cache
Observe:A. Often RAW or RAR exists with recent store or loadB. Many recent stores are killed
+ Small cache can service these
- Latency for other loads will increase
Yes - in series No - in parallel
CPU
TVC
L1
CPU
TVC
L1
Avoid L1 access Avoid latency
dependence?
➔
A & B / Dependence Status is predictableC.
L1 DCache Bandwidth/Port Requirements are Reduced
• Support for multiple/simultaneous Load/Store Requests
Few Loads Observe a Latency Increase
Memory Dependence Prediction 37A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
TVC vs. L0 - Hi t Rates
0%
20%
40%
60%
80%
100% TVC
L0
099
124
126
129
130
132
134
147
10110
210
310
410
711
012
514
114
514
6
Memory Dependence Prediction 38A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
TVC vs. L0 - “Miss” Rates
099
124
126
129
130
132
134
147
10110
210
310
410
711
012
514
114
514
6
0%10%20%30%40%50%60%70%
TVC
L0
0%1%2%3%4%5%6%7%8%
Memory Dependence Prediction 39A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Other Appl icat ions• Transient Value Cache:
Bandwidth amplification
• Prefetching of recursive data structures
Load-to-Load-EAC dependences
• Virtual Function Call Precomputation
Surgical extension to previous one
Memory Dependence Prediction 40A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Current Research• Cloaking for Multiprocessors
- Prediction in coherence protocol
• Slice Prediction & Slice Processors
- Use the Program to Predict its actions
Identify Performance critical computation slices
Run them speculatively in advance
Annotate program with predicability information
• Embedded Processor Optimizations
• Multi-media application evolution
• Self-micro-reconfigurable processors
- Run-time extraction of convertable slices
- Interfaces with OOO
Memory Dependence Prediction 41A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Future Direct ions• Prediction is becoming prevalent
- Branch Prediction
- Value Prediction
- Dependence Prediction
- Coherence Prediction
Thus Far:
Perform Computation as Specified but as fast as possible
Moving toward:
A predict-and-verify model of computation
Q1? How a processor should look like then?
Q2? How to predict accurately?
Memory Dependence Prediction 42A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Current Predictors• Treat Program as a black box
• Observe outcomes
• Guess: soon predictors of this kind will be very accurate
say 90% or more
• How about the rest 10%?
Amdahl’s law -> may dominate performance
• Slice prediction: execute SCOUT threads in parallel
1. Predict performance critical computation slice
2. Execute as far in advance as possible
Memory Dependence Prediction 43A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Current Predictors• Treat Program as a black box
• Build an approx. representation of program’s function
(PC1, taken)(PC2, not-taken) (PC1, taken)
“INPUT” “OUTPUT”
f()?
• Guess: soon predictors of this kind will be very accurate
say 90% or more
• How about the rest 10%?
Amdahl’s law -> may dominate performance
program
Memory Dependence Prediction 44A. Moshovos
Permission to use these slides is granted provided that a reference to their origin is included.
Sl ice Processors
while (list)
if (list->data == KEY)
foo0(list)
else...
foo1(list)
list = list->next
Time
slices
predict
• But program itself is a representation of f()!
• USE PROGRAM TO PREDICT IT’S ACTIONS
list = list->next
list->data == KEY
then branch else branch