Memory Models In Software and in Hardware Practical Considerations

Memory Models

In Software and in Hardware

Practical Considerations

Agenda

• Motivation

• Factors

• Levels of Memory Models– Models for software: Java, CLI

– Models for hardware: IA-32, IA-64

MM Motivation and Factors

http://citeseer.nj.nec.com/adve95shared.html

MM Motivation

• Multithreaded programming– Shared memory

• An example: producer/consumer queue

• Does it work correctly?– The program performs the operations in the correct order!

Task t = new Task();

queue.insert(t);

Task t = queue.get();

t.run();

Thread 1 Thread 2

Memory Model Levels

Programmer-LevelModels

Programmer-LevelModels

Implementor-LevelModels (Virtual Machine)

Implementor-LevelModels (Virtual Machine)

Implementor-LevelModels (Hardware)

Implementor-LevelModels (Hardware)

IA-32, IA-64, Alpha, PowerPC, TSO, PSO,

etc.

Java Memory Model (Implementor View),

Microsoft CLI

Java MM, CLI MM, SC, Coherence, Release

Consistency, etc.

Compiler

VM

Factors that Affect MM

• Compiler: performs optimizations

• [Virtual Machine]: yet more optimizations

• Processor: performs operations out of order

• Memory subsystem: delivers updates out of order

MM Factors: Compiler & VM

• Compilers– Store values in registers– Reorder operations

• Example

int x = 0, answer = 0;

void f() { while (!answer) { x = x+1; }}


void f() { while (!answer) { x = x+1; }}


void f() { int tmp1 = x; int tmp2 = answer; while (!tmp2) { tmp1 = tmp1+1; } x = tmp1;}


void f() { int tmp1 = x; int tmp2 = answer; while (!tmp2) { tmp1 = tmp1+1; } x = tmp1;}

No read from memory

No write to memory

Held in register all the time

MM Factors: Processor

• Includes a lot of features that help it tolerate memory latency– Most of them change the order of memory operations

• Examples– Out-of-order execution : The most important

performance-enabler of modern processors

– Write combining : Reads/writes to the same cache line

– Read/write buffers

– Many more

MM Factors: Memory Subsystem

• Hardware– Cache Coherence Protocols

• Software– DSM Coherence Protocols

The TradeoffThe more optimizations are there in the system, the less transparent it is to the programmer

Sequential Consistency Any Order

Transparency Perfo

rman

ce

Programmer View Models

Java – Original specification

Java – New specification

Microsoft’s CLI (.NET) specification

Java MM – Original Spec

• Java Language Specification, Chapter 17 http://java.sun.com/docs/books/jls/

• A. Gontmakher, A. Schuster, ACM TOCS, vol. 18, No. 4, pp. 333-386 http://www.cs.technion.ac.il/~assaf/publications/java.ps

• Defines an abstract virtual machine– Really hard to understand– Non-compliant implementation by SUN (!!!)– Many other problems

Java MM: Motivation

• Built-in synchronization– Modeled after monitors– Integrated with memory model

• Performance: Avoid synchronization– Immutable objects

Java MM: The Abstract ModelThread 1

Local memory

Executionengine

Executionengine

Thread 2

Local memory

Executionengine

Executionengine

Main memory

useuse assignassign

loadload storestore

readread writewrite

useuse assignassign

loadload storestore

readread writewrite

Java MM: The Constraints

read x,v load x,v use x,vassign x,v store x,v write x,v

read x,v load x,vwrite x,v store x,v

load x,v use x,v

store x,v assign x,v … and more

Thread 1

Local memory

Executionengine

Executionengine

Main memory

useuseassignassign

loadloadstorestore

readread writewrite

Not always(Prescient Stores)

Java MM: Applying The Modelx==1y==1y=1 x=1

read y,1 read x,1

load y,1 load x,1

use y,1 use x,1

assign x,1 assign y,1

store x,1 store y,1

write x,1 write y,1

Java MM: How To Deal With

• Determine the dependencies between use/assigns that follow from the constraints

• Then, ignore all the operations except for use/assigns

• Non-Operational Model!

Java MM - Views

use/assign

load/store

read/write

use/assign

load/store

read/write

Programmer View(non-operational)

Implementor View(non-operational)

Program

mer V

iew(operational)Implementor View

(operational)

Java MM: Characterizations

• Java is stronger than Coherence– Proof below

• Volatile variables: Sequential Consistency

• Locks: variant of Release Consistency– Semantics of locks not SC or PC (and not stated

explicitly at all).

Java MM – Characterizations 2• Full definition: regular variables

– Based on Legal Serialization. Constraints:

– Excludes Prescient Stores– Proof: 5+ pages

r x,vw y,w

r/w xr/w x

Legend:Sees a value written by another thread

Same Variable rule

Transistor rule

Java MM – Characterizations 3• Java: full definition (regular variables only)

– Constraints:

– Includes Prescient Stores– Proof: 20+ pages!– Coherence follows from the first Constraint

r x,vr y,1r y,2w y,w

r x,vw y,1r y,2w y,w

r x,v

w y,2wy,w

r/w xr/w x

Legend:Writes a value seen by another thread

Java MM – Coherence Proof 1:Java is not weaker than Coherence

• Take operations for variable X from all threads.

• Divide each thread into blocks:

load-block: load (use)*

store-block: assign (use|assign) store (use)*

• Each block: one load/store operation.

• Sort the blocks by their memory accesses.

• Result: legal serialization of use/assigns to X.

Java MM – Coherence Proof 2:Java is stronger than Coherence

• Coherence: easily shown

• Java (without Prescient Stores):– Transistor Rule: 1.1 1.2, 2.1 2.2– Legal Serialization: 2.2 1.1, 2.1 1.2– Cycle of dependencies!

Thread 1 Thread 2

1 use x,1 1 use y,12 assign y,1 2 assign x,1

Java MM – Coherence Proof 3Prescient Stores

• A store can move presciently up– Before its corresponding assign– But not before another load/store

• The previous execution now valid– But it can still be fixed…

Thread 1read x,1read y,0read y,2write y,1

Thread 2read y,1read x,0read x,2write x,1

Thread 3write x,2write y,2

Necessarily has a load

The store, even prescient, now

cannot move up

Java MM: Conclusions

• Programming with Locks: easy

• Programming with volatile variables: easy

• Programming with regular variables:– Using just Coherence – OK– Using full definition – hard– Really accounting for Prescient Stores -

nightmare

New Java MM

In process, by Bill Pugh et. al.

http://www.javasoft.com/aboutJava/communityprocess/jsr/jsr_133.html

http://www.cs.umd.edu/~pugh/java/memoryModel/semantics.pdf

New Java VM: Motivation

• Correctly synchronized programs must have SC semantics

• Incorrectly synchronized programs must have (safe) semantics– Safety: JVM must never fail– Security: Prevent attacks based on

unsynchronized code

New Java MM: Requirements

• Backward Compatibility– No new language constructs– No new VM instructions– No system-specific artifacts, e.g. garbage collection

• Clear Distinction between compiler and VM– No optimizations in the compiler– Thus, VM model is the same as the one visible to the

programmer

• Implementability– No unrealistic requirements on software or hardware

New Java VM: The Approach

• Exact semantics for all memory accesses– Not really relevant– Except that SC for Properly Labelled (no data

races) programs can be shown

• Semantics for support of established idioms– Final fields– Volatile variables– Locks

• Quite practical

New Semantics of FinalImmutable objects

• Many objects in Java are designed to be immutable– Rationale: avoiding synchronization– Best known example – java.lang.String

• The problem: String not really immutable– Can see writes to the buffer, but not to the

length and offset!

• Security hole

New Semantics of FinalFixing immutable objects

• Solution 1: Make ALL String methods synchronized– Serious hit at performance– Not needed on single-processor machines

• Solution 2: Extending semantics of final fields– Access that reads a final field, sees it initialized– An object must not escape the constructor

• Problem: String: array elements cannot be final– “weak acquire semantics”: reads dependent on the final

field are seen initialized too

New Semantics for Volatile

• Previously: Sequential Consistency– But: no relation with the regular operations– Not really useful for synchronization (recall the

producer/consumer example)

• Now: Acquire/Release Semantics– Read works as Acquire– Write works as Release

New Semantics of VolatileDouble-Checked Locking

• An object s must be created first time it is requestedsynchronized(s) { if (s==null) s = new S(); }– Slow! Locking on each access

• Double-Checking:if (s==null) { synchronized(this)

if (s==null) s = new S(); }

• The reader can reorder access to s and to its fields

• But, if s is volatile, it works!

New Semantics of VolatileAdvanced Double-Checking

static volatile boolean initialized = false;

if (!initialized) {synchronized(this) {

if (!initialized) {s1 = new S();s1.connect(…);initialized = true;

}}}

Final fields won’t help

New Semantics of Locks

• Only locks on the same variable have acquire/release semantics– Simplifies implementation– Different locks do not synchronize anyway, so no

need for acquire

• In original spec, each lock is a memory barrier– Even synchronized(new Object()) {}– Compiler cannot safely remove locks– In the new semantics, recursive locks are no-op

CLI Memory Model

The VM for Microsoft’s .NET

http://www.ecma.ch/ecma1/STAND/ecma-335.htm

Standard ECMA-335, Common Language Infrastructure

Chapter 11.6, Memory Model and Optimizations

CLI Memory Model

• So Short!!! Just 4 pages• The system

– Flat shared memory– Threads access the same memory

• Any reordering of operations is permitted– Except volatile reads/writes– Except synchronous exceptions

• Atomic access defined for some operations• Threading APIs define synchronization semantics

CLI: Volatile Consistency

• Volatile reads and writes– Accesses to volatile variables– Explicit methods: Thread.VolatileRead,

Thread.VolatileWrite– Thread.MemoryBarrier – same as both VolatileRead

and VolatileWrite

• Volatile read – acquire semantics, volatile write – release semantics

• Different threads can see different orders of volatile writes of different threads

CLI: Locks

• Usual locking semantics: obtaining and releasing locks– Synchronized methods– System.Threading.Monitor class – simulates

C.A.R. Hoare’s monitor (only tries to; simulation is no more complete than in Java)

• Acquiring lock has acquire semantics, releasing – release semantics

CLI: Atomic Memory Accesses

• Word-length accesses, aligned 4-byte accesses are atomic

• System.Threading.Interlocked: atomic read-modify-write operations– Increment, Decrement, Exchange,

CompareExchange

• One and Two-byte reads are atomic. Byte writes may write the whole word

Conclusions: Using CLI

• All concurrent accesses might be synchronized using synchronized methods or Monitor class

• Volatile variables: no common order. Probably usable in the simplest cases– Designed for accessing hardware registers. There it fits

• Atomic memory access: no memory barrier semantics– Probably just forgotten

– Useful in some simple cases

Conclusions: Implementing CLI

• Lots of disclaimers in the spec – no unimplementable requirements. Thus, implementation is straightforward– For instance, Alpha has no instruction to write a

byte – implementation of atomic write would be problematic. Java has this problem

• From the other hand, all low-level mechanisms are present (Interlocked)

Conclusions: JVM vs. CLI• Similar semantics for locks

– Except that in Java, nested locks are no-op, thus locks can be eliminated by the compiler

– In Java, acquire/release happens only if synchronizing on same lock object. In CLI – full acquire/release.

• Similar semantics for volatiles– Except that volatiles consistency is weaker. It is unclear if

the Double Checked Locking idiom should work

• Similarly unusable semantics for regular variables– Except for Java’s provisions for object construction

(semantics of volatile fields)

• Adds low-level interlocked accesses

Hardware Memory Models

IA-64 and IA-32

IA-32

• Memory reads: acquire semantics– Except that reads can see local writes early; see

below

• Memory writes: release semantics– Except that there is no global order of writes;

see below

• Interlocked memory accesses: using processor lock prefix

IA-64: Memory Accesses

• Regular memory accesses – unordered

• Attributes to memory accesses: release or acquire– Acquire: ld.acq instruction– Release: st.rel instruction

• Memory Fence (mf)– AKA Memory Barrier, is both acquire and

release.

IA-64: Atomic Accesses

• CMPXCHG (Compare and Exchange)– Compare memory with a given value. Exchange

if not equal– Can have either acquire (cmpxchg.acq) or

release (cmpxchg.rel) semantics

• FAA (fetch and add)– Also acquire or release semantics

• XCHG (Exchange)– Only acquire semantics

IA-64: Semantics of ld.acq, st.rel

• Constraints:– Acquire >> X Acquire X

– X >> Release X Release

– Fence >> X Fence X

– X >> Fence X Fence

• Global order of all the strong write operationsT1 T2 T3 T4

st.rel [x]=1 ld.acq r1=[x] st.rel [y]=1 ld.acq r3=[y]

ld r2=[y] ld r4=[x]

Program order

Forbidden: r1=1, r3=1, r2=0, r4=0

Execution order

IA-64 Semantics: Exceptions

• Load may see value from store buffer

• Inserting mf between st.rel and ld.acq solves the problem

• But: in Java semantics, this execution is OK!

T1 T2

st.rel [x]=1 st.rel [y]=1

ld.acq r1=[x] ld.acq r3=[y]

ld r2=[y] ld r4=[x]

Permitted: r1=1, r3=1, r2=0, r4=0

IA-64 Semantics: Conclusion

• Simple. Clean

• Very usable: direct mapping to both Java and CLI memory models– Especially fits the new Java Memory Model (or

more reasonably, the new Java Memory Model especially fits IA-64 ;)

• IA-32: Obviously developed before MP systems became common (for Intel processors)– Cannot change architecture now

Documents

Memory Models In Software and in Hardware Practical Considerations