Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

IBM T.J. Watson Research Center

RACES’12 Oct 21, 2012 © 2012 IBM Corporation

Edge Chasing Delayed Consistency:Pushing the Limits of Weak Memory Models

Harold “Trey” CainIBM T.J. Watson Research Center

Prof. Mikko H. LipastiUniversity of Wisconsin

IBM Research

© 2012 IBM Corporation2 Cain and Lipasti RACES’12 Oct 21, 2012

Gotta go back in time!

Part of Ph.D. Dissertation

– Never submitted for publication, until now.

– Looked particularly relevant when I saw the RACES CFP.

Journey back in time to the year 2004, when…

– … Mark Zuckerberg launched Facebook

– … Janet Jackson suffered a “wardrobe malfunction” during the Superbowl halftime show

– … an incumbent president was being challenged by a Massachusetts politician

88mph here we come!

IBM Research


Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering

From the RACES website: – “an approach towards scalability that reduces synchronization requirements

drastically, possibly to the point of discarding them altogether.”

A hardware developer’s perspective:– Constraints of Legacy Code

• What if we want to apply this principle, but have no control over the applications that are running on a system?

– Can one build a coherence protocol that avoids synchronizing cores as much as possible?• For example by allowing each core to use stale versions of cache lines as long as

possible• While maintaining architectural correctness; i.e. we will not break existing code

• If we do that, what will happen?

IBM Research


Cache-Coherent Shared-memory multiprocessors

Are ubiquitous

Coherence misses are a major source of performance loss for shared memory applications

10 years ago Today

IBM Research


16MB L3 Cache Misses per 1000 inst

IBM Research


Edge-Chasing Delayed Consistency (ECDC)

A new hardware implementation of POWER weak ordering

– Not a new consistency model

Allows a cache line to be non-speculatively read after being invalidated.

Based on necessary conditions

– Processor must fetch new data only if causally dependent on it.

IBM Research


Constraint graph

Introduced for SC by Landin et al., ISCA-18

Directed-graph represents a multithreaded execution

– Nodes represent dynamic instances of instructions

– Edges represent their transitive orders (program order, RAW, WAW, WAR).

If the constraint graph is acyclic, then the execution is correct

IBM Research


Constraint graph example - WO

Proc 1

Proc 2

LD AST B

LD BST->MBOrder

LD->MBOrder

Write-after-readdependence order

Read-after-writedependence order

ST A

MB MBMB->STOrder

MB->LDOrder

1.

2.

3.

5.

4.

Observation: An aggressive coherence protocol can ignore coherence messages

unless doing so will create a cycle in the constraint graph

IBM Research


Edge-chasing delayed consistency

Based on edge-chasing algorithms used by distributed database systems for deadlock detection

P1 P2 P3 P4Wham-O!

Cycle in WFG detected when a locally created probe received

IBM Research


ECDC - Basic idea

Observation: Cycles in constraint graph can be detected using a similar mechanism

Protocol:

– Upon write miss, create a “probe”

– Upon receipt of invalidation, add probe to cache line• Continue to read stale block until the probe is re-observed on

another message

– Pass probe to other processors at communication

IBM Research


Example – necessary miss (SC)

Proc 1

Proc 2

LD A

ST B

LD BRAW

ST A

LD A

WARLine A is in proc 1’scache, valid bit = 1

Line A is in proc 1’scache, valid bit = 0 Supplanter ProbeA = RAW

IBM Research


Detecting critical writes

Some write values shouldn’t be delayed (e.g. lock releases, barriers, etc.)

Two heuristics

– Atomic primitives – any cache block that has been touched by a store-conditional should not be delayed

– Polling detection – If consecutive cache accesses have same PC and address, discard stale line

IBM Research


Performance Evaluation

PHARMSim – Cycle-mode Full System Simulator– Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within

the SimOS-PPC full-system simulator– Out-of-order single-threaded core– 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte

cache lines– Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)– Stride-based prefetcher modeled after Power4

Lock-free list insertion microbenchmark

Full applications– SPLASH2: fft, fmm, ocean, radix, raytrace– Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99

IBM Research


Why delayed consistency?

False sharing/Silent sharing

Convergant/Data-race tolerant algorithms

– Genetic algorithms

– Parallel equation solvers

– Sparse matrix factorization

Lock-free parallel linked data structures

IBM Research


Lock-free Algorithms

For example list insertion:

– New node’s next pointer set to cur

– CAS operation atomically updates prev’s next pointer to new

Increasingly common

prev cur

new

IBM Research


Prior work (Delayed consistency)

Invalidate-based receiver-delayed protocols, sender-delayed protocols (Dubois et al., SC ’91)

Lazy release consistency (Keleher et al., ISCA ’92)

Update-based receiver-delayed, sender-delayed protocols (Afek et al., TPLS, ’93)

Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95)

Write cache for reducing bandwidth in update coherence protocol (Dahlgren and Stenstrom, JPDC ’95)

IBM Research


Lock-free list microbenchmark

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 20 40 60 80 100

% updates

cycle

s/s

earch

base-1000

ecdc-1000

base-100

ecdc-100

base-10

ecdc-10

Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02]

15 threads randomly updating or searching linked list, 1 thread performing searches

IBM Research


Intolerable miss reduction

Left to right: a) baseline, b) ECDC base, c) ECDC merged read/write sets, d) ECDC scalar probe set

IBM Research


ECDC Performance (Infinite resources)

IBM Research


Conclusions

Of nine applications studied, performance improvement for two

– Mostly due to reduction in false sharing misses Other applications:

– Not enough coherence misses, or– The avoidance of those misses does not improve performance

We believe these results generalize to lock-based programs

Other programming models may have potential– As shown, lock-free data structures

• Should also apply to transactional programming model– But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth– Best to identify apps with a communication bottleneck before attacking

IBM Research


Questions?

IBM Research


Backup slides

IBM Research


Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],

within the SimOS-PPC full-system simulator

Out-of-order execution core

15-stage, 8-wide pipeline

256 entry reorder buffer, 128 entry load/store queue

32 entry issue queue

Functional units (latency)

8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),

4 L1 Dcache load ports in OoO window

1 L1 Dcache load/store port at commit

Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB

Memory system (latency)

32k DM L1 icache (1), 32k DM L1 dcache (1)

256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines

Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)

Stride-based prefetcher modeled after Power4

IBM Research


Causality (Lamport)

An instruction i is causally dependent upon instruction j if there is a directed path from j to i

Two operations are concurrent if neither causally depends upon the other

Coherence misses are a significant source of performance degradation for many applications

If two operations are concurrent, why is their performance penalized?

Time

P3P2P1

st A

st C

ld Ast B

ld C

ld B

ld A

IBM Research


Prior work: formal memory model representations

Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)

Acyclic graph representation (Landin et al., ISCA-18)

Modeling memory operation as a series of sub-operations (Collier, RAPA)

Acyclic graph + sub-operations (Adve, thesis)

Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis)

IBM Research


Anatomy of a cycle

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Incoming invalidate

Cache miss

IBM Research


Other prior work

Speculative stale value usage

– LVP with Stale Values (Lepak, Ph.D. Thesis ‘03)

– Coherence Decoupling (Huh et al., ASPLOS ’04)

Delayed RFO response to improve synchronization throughput (Rajwar et al., HPCA ’00)

IBM Research


Constraint graph extensions

Constraint graph definition differs for other consistency models

Processor consistency

– Remove program order edges from stores to subsequent loads

– Remaining single-thread orders: edges from

• Loads to subsequent loads• Stores to subsequent stores• Loads to subsequent stores

IBM Research


Constraint graph extensions

Constraint graph definition differs for other consistency models

Weak ordering

– Remove program order edges

– Add single-thread ordering edges between

• memory barrier and preceding/following instructions• same address reads/writes• dependent instructions

IBM Research


PC Example – Dekker’s Alg.

Proc 1

ST A

Proc 2

ST B

LD B LD A


Programorder

Programorder

Lack of store-to-load order results in acyclic graph

1.

2.

3.

4.

IBM Research


Constraint graph example - SC

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder



Cycle indicates that execution is

incorrect

1.

2.

3.

4.

IBM Research


Constraint graph example - PC

Proc 1

ST A

Proc 2

LD BST B

LD A

Programorder

ProgramOrder



1.

2.

3.

4.

IBM Research


ECDC Conceptual Description

Identify causal dependences (upstream probe sets)

– 1 upstream set per processor

– 2 upstream sets per cache block (read set, write set)

Communicating dependences

– Probe sets passed on response messages

– Probes attached to incoming invalidation messages

– Extra ProbePropagation messages sent at memory barriers

Identifying usable stale blocks

– Extra stable state in cache (ST)

– Supplanter probe

IBM Research


ECDC Operation

Initially

1. ld A2. st A3. ld B4. st B5. ld C

Фprocupstream

{ }

{ }{ , }{ , }{ , }

Ф(read|write)A

{ | , }

{ | , }{ | , }{ | , }{ | , }

{ | }

{ | }{ | }

{ , | , }{ , | , }

Ф(read|write)B

IBM Research


Finite ECDC Performance

When restricting PPB/STAB resources (220 KB per processor)

– 16k probe lifetime counter

– 128 entry STAB per processor

– 32 Entry PPB per processor/directory controller (256 PPB virtual namespace)

TPC-H/SPECweb99 performance within margin of error to infinite resources

IBM Research


Non-atomicity of writes

Absent from model

Effect on optimizations

– Forces unnecessary orders to exist

– Correct, but another example of over-conservatism

Hopefully, infrequent performance divot

Processor p1

st r1, [A]

Processor p2

ld r1, [A]st r2, [r1]

Processor p3

ld r1, [B]membarld r2, [A]

IBM Research


ECDC Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],

within the SimOS-PPC full-system simulator

Out-of-order execution core

15-stage, 8-wide pipeline

256 entry reorder buffer, 128 entry load/store queue

32 entry issue queue

Functional units (latency)

8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),

4 L1 Dcache load ports in OoO window

1 L1 Dcache load/store port at commit

Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB

Cache Hierarchy (latency)

32k DM L1 icache (1), 32k DM L1 dcache (1)

256K 8-way L2 (7), 16 MB 8-way L3 (15), 128 byte cache lines

Stride-based prefetcher modeled after Power4

Memory system (latency)

2-D static DOR routed torus interconnect. 60 cycle per link+route (40 GB/S bandwidth per link, 5GHZ clock)

Memory (400 cycle best-case latency, 10 GB/S bandwidth)

IBM Research


Mapping ECDC to HW

STAB – Maintains supplanting probe for each stale cache block

PPB – Maintains approximation of upstream sets

In caches – 2 extra bits for stale state and synch heuristic

DRAM

Dir

MemCtr

NIC

L2 $

D$I$

P

STAB

PPB

CastoutPPB

IBM Research


Probe representation

Each probe represented by n-bit timer

Stale block may be used until supplanting probe timer expires

Probe set in p-processor system represented by p timers

IBM Research


STAB Detail

125258123

timer

9980x112c

0x24e20xc123

address

925690xf2e5104250x8000 (998)

(13523)(21646)

Cache

Incoming Invalidatesp1 p2 p3

counters

IBM Research


PPB Detail

address hash

0005

515

189327

000

27

27127282735

00

92180

280800855950

000

12

121212

724

Shift register/probe timers

…

Incoming upstream set

Expired upstream set

Timer index table

IBM Research


Memory consistency review

Memory consistency model

– Specifies the programming interface to a shared memory

– i.e. the allowable interleaving of instructions

Models discussed here:

– Sequential Consistency

– Processor Consistency• No store-to-load program order

– Weak Ordering• Order wrt memory barriers• Same-address order• Dependence order

IBM Research


Example – necessary miss (SC)

Proc 1

Proc 2

LD A

ST B

LD BRAW

ST A

LD A

WAR

PO PO

PO

Block A is in proc 1’scache, valid bit = 1


IBM Research


Example – avoidable miss (SC)Proc

1Proc 2

LD AST B

LD BRAW ST A

LD A

WAR

PO PO

PO



IBM Research


Typical ReadX transaction

When sending invalidation, create probe, add to PPB

At receipt of invalidation (2b, 2c) add probe to STAB

When sending invalidate acknowledgment, add probe set to the response

When receiving invalidate acknowledgment, add incoming probe set to the PPB

3(a) Inval Ack

R

S1

H

1. ReadX

3(b) Inval Ack

S2

2(a) Sharers/Data

2(b) Inval

2(c) Inval

IBM Research


Invalidation to read distance

0%

20%

40%

60%

80%

100%

1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09

cycles

% o

f loa

d co

h m

isses

fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

IBM Research


Invalidation to read distance (synch)

0%10%20%30%40%50%60%70%80%90%

100%

1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09

cycles

% o

f loa

d co

h m

isse

s

fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

IBM Research


Invalidation to read distance (data)

0%10%20%30%40%50%60%70%80%90%

100%

1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09

cycles

% o

f loa

d co

h m

isses

fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

IBM Research


STAB entry death cdf

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000 1000000

cycles

% S

TAB

ent

ries

deal

loca

ted fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

IBM Research


STAB Entry Lifetime

IBM Research


ECDC performance (16k probe lifetime)

IBM Research


ECDC Perf (128 entry STAB, 32 entry PPB, 256 entry namespace)

IBM Research


ProbePropagation messages

IBM Research


ECDC Storage Overhead

0

50

100

150

200

250

300

350

4p 8p 16p 32p 64p 128p 256p 512p 1024p

Processor count

Sto

rag

e (

KB

)

IBM Research


What about limit study?

Indicated a larger number of avoidable coherence misses

Reasons:

– Did not account for non-speculative nature of protocol (oracle ECDC could be better)

– Inaccurate measurement of critical writes

• Many loads perform polling to lines that have never been touched by a load-linked or store-conditional

– Used isolated stale data detection mechanism

IBM Research


What about speculative load squashes?

In a few applications, they occur frequently (SPECjbb2000, TPC-H)

Implemented/evaluated read-set-tracking w/ squash on miss

Could eliminate a large fraction of squashes

– Unfortunately, little performance improvement

– Presumably, many squashes caused by contended spin locks

IBM Research


ECDC and other consistency models

Stricter model => more ProbePropagation messages

Potential for release consistency

In SC/PC/TSO, ECDC benefits will probably be dominated by extra ProbePropagation messages

IBM Research


Cause of STAB entry deallocation

IBM Research


Publications

[ISCA ’04] Memory ordering: A Value-based approach.– Selected for IEEE Micro Top Picks ‘04

[PACT ’03] Constraint Graph Analysis of Multithreaded Programs.– Selected for Best of PACT JILP Issue

[PACT ’03] Redeeming IPC as a Performance Metric for Multithreaded Programs.

[CAECW ’02] Precise and Accurate Processor Simulation

[SPAA Revue ’02] Verifying Sequential Consistency Using Vector Clocks.

[Micro ’01] Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing.

[WBT ’01] A Dynamic Binary Translation Approach to Architectural Simulation

[HPCA ’01] An Architectural Characterization of Java TPC-W.

[Euro-Par ’00] A Callgraph-Based Search Strategy for Automated Performance Diagnosis.– Selected as distinguished paper

[CAECW ’00] Characterizing a Java Implementation of TPC-W

Documents

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models