Upload
james-bennett
View
30
Download
1
Embed Size (px)
DESCRIPTION
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models. Harold “Trey” Cain IBM T.J. Watson Research Center Prof. Mikko H. Lipasti University of Wisconsin. Gotta go back in time!. Part of Ph.D. Dissertation Never submitted for publication, until now. - PowerPoint PPT Presentation
Citation preview
IBM T.J. Watson Research Center
RACES’12 Oct 21, 2012 © 2012 IBM Corporation
Edge Chasing Delayed Consistency:Pushing the Limits of Weak Memory Models
Harold “Trey” CainIBM T.J. Watson Research Center
Prof. Mikko H. LipastiUniversity of Wisconsin
IBM Research
© 2012 IBM Corporation2 Cain and Lipasti RACES’12 Oct 21, 2012
Gotta go back in time!
Part of Ph.D. Dissertation
– Never submitted for publication, until now.
– Looked particularly relevant when I saw the RACES CFP.
Journey back in time to the year 2004, when…
– … Mark Zuckerberg launched Facebook
– … Janet Jackson suffered a “wardrobe malfunction” during the Superbowl halftime show
– … an incumbent president was being challenged by a Massachusetts politician
88mph here we come!
IBM Research
© 2012 IBM Corporation3 Cain and Lipasti RACES’12 Oct 21, 2012
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering
From the RACES website: – “an approach towards scalability that reduces synchronization requirements
drastically, possibly to the point of discarding them altogether.”
A hardware developer’s perspective:– Constraints of Legacy Code
• What if we want to apply this principle, but have no control over the applications that are running on a system?
– Can one build a coherence protocol that avoids synchronizing cores as much as possible?• For example by allowing each core to use stale versions of cache lines as long as
possible• While maintaining architectural correctness; i.e. we will not break existing code
• If we do that, what will happen?
IBM Research
© 2012 IBM Corporation4 Cain and Lipasti RACES’12 Oct 21, 2012
Cache-Coherent Shared-memory multiprocessors
Are ubiquitous
Coherence misses are a major source of performance loss for shared memory applications
10 years ago Today
IBM Research
© 2012 IBM Corporation5 Cain and Lipasti RACES’12 Oct 21, 2012
16MB L3 Cache Misses per 1000 inst
IBM Research
© 2012 IBM Corporation6 Cain and Lipasti RACES’12 Oct 21, 2012
Edge-Chasing Delayed Consistency (ECDC)
A new hardware implementation of POWER weak ordering
– Not a new consistency model
Allows a cache line to be non-speculatively read after being invalidated.
Based on necessary conditions
– Processor must fetch new data only if causally dependent on it.
IBM Research
© 2012 IBM Corporation7 Cain and Lipasti RACES’12 Oct 21, 2012
Constraint graph
Introduced for SC by Landin et al., ISCA-18
Directed-graph represents a multithreaded execution
– Nodes represent dynamic instances of instructions
– Edges represent their transitive orders (program order, RAW, WAW, WAR).
If the constraint graph is acyclic, then the execution is correct
IBM Research
© 2012 IBM Corporation8 Cain and Lipasti RACES’12 Oct 21, 2012
Constraint graph example - WO
Proc 1
Proc 2
LD AST B
LD BST->MBOrder
LD->MBOrder
Write-after-readdependence order
Read-after-writedependence order
ST A
MB MBMB->STOrder
MB->LDOrder
1.
2.
3.
5.
4.
Observation: An aggressive coherence protocol can ignore coherence messages
unless doing so will create a cycle in the constraint graph
IBM Research
© 2012 IBM Corporation9 Cain and Lipasti RACES’12 Oct 21, 2012
Edge-chasing delayed consistency
Based on edge-chasing algorithms used by distributed database systems for deadlock detection
P1 P2 P3 P4Wham-O!
Cycle in WFG detected when a locally created probe received
IBM Research
© 2012 IBM Corporation10 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC - Basic idea
Observation: Cycles in constraint graph can be detected using a similar mechanism
Protocol:
– Upon write miss, create a “probe”
– Upon receipt of invalidation, add probe to cache line• Continue to read stale block until the probe is re-observed on
another message
– Pass probe to other processors at communication
IBM Research
© 2012 IBM Corporation11 Cain and Lipasti RACES’12 Oct 21, 2012
Example – necessary miss (SC)
Proc 1
Proc 2
LD A
ST B
LD BRAW
ST A
LD A
WARLine A is in proc 1’scache, valid bit = 1
Line A is in proc 1’scache, valid bit = 0 Supplanter ProbeA = RAW
IBM Research
© 2012 IBM Corporation12 Cain and Lipasti RACES’12 Oct 21, 2012
Detecting critical writes
Some write values shouldn’t be delayed (e.g. lock releases, barriers, etc.)
Two heuristics
– Atomic primitives – any cache block that has been touched by a store-conditional should not be delayed
– Polling detection – If consecutive cache accesses have same PC and address, discard stale line
IBM Research
© 2012 IBM Corporation13 Cain and Lipasti RACES’12 Oct 21, 2012
Performance Evaluation
PHARMSim – Cycle-mode Full System Simulator– Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within
the SimOS-PPC full-system simulator– Out-of-order single-threaded core– 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte
cache lines– Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)– Stride-based prefetcher modeled after Power4
Lock-free list insertion microbenchmark
Full applications– SPLASH2: fft, fmm, ocean, radix, raytrace– Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99
IBM Research
© 2012 IBM Corporation14 Cain and Lipasti RACES’12 Oct 21, 2012
Why delayed consistency?
False sharing/Silent sharing
Convergant/Data-race tolerant algorithms
– Genetic algorithms
– Parallel equation solvers
– Sparse matrix factorization
Lock-free parallel linked data structures
IBM Research
© 2012 IBM Corporation15 Cain and Lipasti RACES’12 Oct 21, 2012
Lock-free Algorithms
For example list insertion:
– New node’s next pointer set to cur
– CAS operation atomically updates prev’s next pointer to new
Increasingly common
prev cur
new
IBM Research
© 2012 IBM Corporation16 Cain and Lipasti RACES’12 Oct 21, 2012
Prior work (Delayed consistency)
Invalidate-based receiver-delayed protocols, sender-delayed protocols (Dubois et al., SC ’91)
Lazy release consistency (Keleher et al., ISCA ’92)
Update-based receiver-delayed, sender-delayed protocols (Afek et al., TPLS, ’93)
Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95)
Write cache for reducing bandwidth in update coherence protocol (Dahlgren and Stenstrom, JPDC ’95)
IBM Research
© 2012 IBM Corporation17 Cain and Lipasti RACES’12 Oct 21, 2012
Lock-free list microbenchmark
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 20 40 60 80 100
% updates
cycle
s/s
earch
base-1000
ecdc-1000
base-100
ecdc-100
base-10
ecdc-10
Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02]
15 threads randomly updating or searching linked list, 1 thread performing searches
IBM Research
© 2012 IBM Corporation18 Cain and Lipasti RACES’12 Oct 21, 2012
Intolerable miss reduction
Left to right: a) baseline, b) ECDC base, c) ECDC merged read/write sets, d) ECDC scalar probe set
IBM Research
© 2012 IBM Corporation19 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC Performance (Infinite resources)
IBM Research
© 2012 IBM Corporation20 Cain and Lipasti RACES’12 Oct 21, 2012
Conclusions
Of nine applications studied, performance improvement for two
– Mostly due to reduction in false sharing misses Other applications:
– Not enough coherence misses, or– The avoidance of those misses does not improve performance
We believe these results generalize to lock-based programs
Other programming models may have potential– As shown, lock-free data structures
• Should also apply to transactional programming model– But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth– Best to identify apps with a communication bottleneck before attacking
IBM Research
© 2012 IBM Corporation21 Cain and Lipasti RACES’12 Oct 21, 2012
Questions?
IBM Research
© 2012 IBM Corporation22 Cain and Lipasti RACES’12 Oct 21, 2012
Backup slides
IBM Research
© 2012 IBM Corporation23 Cain and Lipasti RACES’12 Oct 21, 2012
Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],
within the SimOS-PPC full-system simulator
Out-of-order execution core
15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue
Functional units (latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Memory system (latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines
Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)
Stride-based prefetcher modeled after Power4
IBM Research
© 2012 IBM Corporation24 Cain and Lipasti RACES’12 Oct 21, 2012
Causality (Lamport)
An instruction i is causally dependent upon instruction j if there is a directed path from j to i
Two operations are concurrent if neither causally depends upon the other
Coherence misses are a significant source of performance degradation for many applications
If two operations are concurrent, why is their performance penalized?
Time
P3P2P1
st A
st C
ld Ast B
ld C
ld B
ld A
IBM Research
© 2012 IBM Corporation25 Cain and Lipasti RACES’12 Oct 21, 2012
Prior work: formal memory model representations
Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)
Acyclic graph representation (Landin et al., ISCA-18)
Modeling memory operation as a series of sub-operations (Collier, RAPA)
Acyclic graph + sub-operations (Adve, thesis)
Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis)
IBM Research
© 2012 IBM Corporation26 Cain and Lipasti RACES’12 Oct 21, 2012
Anatomy of a cycle
Proc 1
ST A
Proc 2
LD AST B
LD BProgramorder
Programorder
WAR
RAW
Incoming invalidate
Cache miss
IBM Research
© 2012 IBM Corporation27 Cain and Lipasti RACES’12 Oct 21, 2012
Other prior work
Speculative stale value usage
– LVP with Stale Values (Lepak, Ph.D. Thesis ‘03)
– Coherence Decoupling (Huh et al., ASPLOS ’04)
Delayed RFO response to improve synchronization throughput (Rajwar et al., HPCA ’00)
IBM Research
© 2012 IBM Corporation28 Cain and Lipasti RACES’12 Oct 21, 2012
Constraint graph extensions
Constraint graph definition differs for other consistency models
Processor consistency
– Remove program order edges from stores to subsequent loads
– Remaining single-thread orders: edges from
• Loads to subsequent loads• Stores to subsequent stores• Loads to subsequent stores
IBM Research
© 2012 IBM Corporation29 Cain and Lipasti RACES’12 Oct 21, 2012
Constraint graph extensions
Constraint graph definition differs for other consistency models
Weak ordering
– Remove program order edges
– Add single-thread ordering edges between
• memory barrier and preceding/following instructions• same address reads/writes• dependent instructions
IBM Research
© 2012 IBM Corporation30 Cain and Lipasti RACES’12 Oct 21, 2012
PC Example – Dekker’s Alg.
Proc 1
ST A
Proc 2
ST B
LD B LD A
Write-after-readdependence order
Programorder
Programorder
Lack of store-to-load order results in acyclic graph
1.
2.
3.
4.
IBM Research
© 2012 IBM Corporation31 Cain and Lipasti RACES’12 Oct 21, 2012
Constraint graph example - SC
Proc 1
ST A
Proc 2
LD AST B
LD BProgramorder
Programorder
Write-after-readdependence order
Read-after-writedependence order
Cycle indicates that execution is
incorrect
1.
2.
3.
4.
IBM Research
© 2012 IBM Corporation32 Cain and Lipasti RACES’12 Oct 21, 2012
Constraint graph example - PC
Proc 1
ST A
Proc 2
LD BST B
LD A
Programorder
ProgramOrder
Write-after-readdependence order
Read-after-writedependence order
1.
2.
3.
4.
IBM Research
© 2012 IBM Corporation33 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC Conceptual Description
Identify causal dependences (upstream probe sets)
– 1 upstream set per processor
– 2 upstream sets per cache block (read set, write set)
Communicating dependences
– Probe sets passed on response messages
– Probes attached to incoming invalidation messages
– Extra ProbePropagation messages sent at memory barriers
Identifying usable stale blocks
– Extra stable state in cache (ST)
– Supplanter probe
IBM Research
© 2012 IBM Corporation34 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC Operation
Initially
1. ld A2. st A3. ld B4. st B5. ld C
Фprocupstream
{ }
{ }{ , }{ , }{ , }
Ф(read|write)A
{ | , }
{ | , }{ | , }{ | , }{ | , }
{ | }
{ | }{ | }
{ , | , }{ , | , }
Ф(read|write)B
IBM Research
© 2012 IBM Corporation35 Cain and Lipasti RACES’12 Oct 21, 2012
Finite ECDC Performance
When restricting PPB/STAB resources (220 KB per processor)
– 16k probe lifetime counter
– 128 entry STAB per processor
– 32 Entry PPB per processor/directory controller (256 PPB virtual namespace)
TPC-H/SPECweb99 performance within margin of error to infinite resources
IBM Research
© 2012 IBM Corporation36 Cain and Lipasti RACES’12 Oct 21, 2012
Non-atomicity of writes
Absent from model
Effect on optimizations
– Forces unnecessary orders to exist
– Correct, but another example of over-conservatism
Hopefully, infrequent performance divot
Processor p1
st r1, [A]
Processor p2
ld r1, [A]st r2, [r1]
Processor p3
ld r1, [B]membarld r2, [A]
IBM Research
© 2012 IBM Corporation37 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],
within the SimOS-PPC full-system simulator
Out-of-order execution core
15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue
Functional units (latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Cache Hierarchy (latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 16 MB 8-way L3 (15), 128 byte cache lines
Stride-based prefetcher modeled after Power4
Memory system (latency)
2-D static DOR routed torus interconnect. 60 cycle per link+route (40 GB/S bandwidth per link, 5GHZ clock)
Memory (400 cycle best-case latency, 10 GB/S bandwidth)
IBM Research
© 2012 IBM Corporation38 Cain and Lipasti RACES’12 Oct 21, 2012
Mapping ECDC to HW
STAB – Maintains supplanting probe for each stale cache block
PPB – Maintains approximation of upstream sets
In caches – 2 extra bits for stale state and synch heuristic
DRAM
Dir
MemCtr
NIC
L2 $
D$I$
P
STAB
PPB
CastoutPPB
IBM Research
© 2012 IBM Corporation39 Cain and Lipasti RACES’12 Oct 21, 2012
Probe representation
Each probe represented by n-bit timer
Stale block may be used until supplanting probe timer expires
Probe set in p-processor system represented by p timers
IBM Research
© 2012 IBM Corporation40 Cain and Lipasti RACES’12 Oct 21, 2012
STAB Detail
125258123
timer
9980x112c
0x24e20xc123
address
925690xf2e5104250x8000 (998)
(13523)(21646)
Cache
Incoming Invalidatesp1 p2 p3
counters
IBM Research
© 2012 IBM Corporation41 Cain and Lipasti RACES’12 Oct 21, 2012
PPB Detail
address hash
0005
515
189327
000
27
27127282735
00
92180
280800855950
000
12
121212
724
Shift register/probe timers
…
Incoming upstream set
Expired upstream set
Timer index table
IBM Research
© 2012 IBM Corporation42 Cain and Lipasti RACES’12 Oct 21, 2012
Memory consistency review
Memory consistency model
– Specifies the programming interface to a shared memory
– i.e. the allowable interleaving of instructions
Models discussed here:
– Sequential Consistency
– Processor Consistency• No store-to-load program order
– Weak Ordering• Order wrt memory barriers• Same-address order• Dependence order
IBM Research
© 2012 IBM Corporation43 Cain and Lipasti RACES’12 Oct 21, 2012
Example – necessary miss (SC)
Proc 1
Proc 2
LD A
ST B
LD BRAW
ST A
LD A
WAR
PO PO
PO
Block A is in proc 1’scache, valid bit = 1
Block A is in proc 1’scache, valid bit = 0
IBM Research
© 2012 IBM Corporation44 Cain and Lipasti RACES’12 Oct 21, 2012
Example – avoidable miss (SC)Proc
1Proc 2
LD AST B
LD BRAW ST A
LD A
WAR
PO PO
PO
Block A is in proc 1’scache, valid bit = 1
Block A is in proc 1’scache, valid bit = 0
IBM Research
© 2012 IBM Corporation45 Cain and Lipasti RACES’12 Oct 21, 2012
Typical ReadX transaction
When sending invalidation, create probe, add to PPB
At receipt of invalidation (2b, 2c) add probe to STAB
When sending invalidate acknowledgment, add probe set to the response
When receiving invalidate acknowledgment, add incoming probe set to the PPB
3(a) Inval Ack
R
S1
H
1. ReadX
3(b) Inval Ack
S2
2(a) Sharers/Data
2(b) Inval
2(c) Inval
IBM Research
© 2012 IBM Corporation46 Cain and Lipasti RACES’12 Oct 21, 2012
Invalidation to read distance
0%
20%
40%
60%
80%
100%
1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09
cycles
% o
f loa
d co
h m
isses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation47 Cain and Lipasti RACES’12 Oct 21, 2012
Invalidation to read distance (synch)
0%10%20%30%40%50%60%70%80%90%
100%
1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09
cycles
% o
f loa
d co
h m
isse
s
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation48 Cain and Lipasti RACES’12 Oct 21, 2012
Invalidation to read distance (data)
0%10%20%30%40%50%60%70%80%90%
100%
1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09
cycles
% o
f loa
d co
h m
isses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation49 Cain and Lipasti RACES’12 Oct 21, 2012
STAB entry death cdf
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000 1000000
cycles
% S
TAB
ent
ries
deal
loca
ted fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation50 Cain and Lipasti RACES’12 Oct 21, 2012
STAB Entry Lifetime
IBM Research
© 2012 IBM Corporation51 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC performance (16k probe lifetime)
IBM Research
© 2012 IBM Corporation52 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC Perf (128 entry STAB, 32 entry PPB, 256 entry namespace)
IBM Research
© 2012 IBM Corporation53 Cain and Lipasti RACES’12 Oct 21, 2012
ProbePropagation messages
IBM Research
© 2012 IBM Corporation54 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC Storage Overhead
0
50
100
150
200
250
300
350
4p 8p 16p 32p 64p 128p 256p 512p 1024p
Processor count
Sto
rag
e (
KB
)
IBM Research
© 2012 IBM Corporation55 Cain and Lipasti RACES’12 Oct 21, 2012
What about limit study?
Indicated a larger number of avoidable coherence misses
Reasons:
– Did not account for non-speculative nature of protocol (oracle ECDC could be better)
– Inaccurate measurement of critical writes
• Many loads perform polling to lines that have never been touched by a load-linked or store-conditional
– Used isolated stale data detection mechanism
IBM Research
© 2012 IBM Corporation56 Cain and Lipasti RACES’12 Oct 21, 2012
What about speculative load squashes?
In a few applications, they occur frequently (SPECjbb2000, TPC-H)
Implemented/evaluated read-set-tracking w/ squash on miss
Could eliminate a large fraction of squashes
– Unfortunately, little performance improvement
– Presumably, many squashes caused by contended spin locks
IBM Research
© 2012 IBM Corporation57 Cain and Lipasti RACES’12 Oct 21, 2012
ECDC and other consistency models
Stricter model => more ProbePropagation messages
Potential for release consistency
In SC/PC/TSO, ECDC benefits will probably be dominated by extra ProbePropagation messages
IBM Research
© 2012 IBM Corporation58 Cain and Lipasti RACES’12 Oct 21, 2012
Cause of STAB entry deallocation
IBM Research
© 2012 IBM Corporation59 Cain and Lipasti RACES’12 Oct 21, 2012
Publications
[ISCA ’04] Memory ordering: A Value-based approach.– Selected for IEEE Micro Top Picks ‘04
[PACT ’03] Constraint Graph Analysis of Multithreaded Programs.– Selected for Best of PACT JILP Issue
[PACT ’03] Redeeming IPC as a Performance Metric for Multithreaded Programs.
[CAECW ’02] Precise and Accurate Processor Simulation
[SPAA Revue ’02] Verifying Sequential Consistency Using Vector Clocks.
[Micro ’01] Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing.
[WBT ’01] A Dynamic Binary Translation Approach to Architectural Simulation
[HPCA ’01] An Architectural Characterization of Java TPC-W.
[Euro-Par ’00] A Callgraph-Based Search Strategy for Automated Performance Diagnosis.– Selected as distinguished paper
[CAECW ’00] Characterizing a Java Implementation of TPC-W