Presented by Steve Coward PSU Fall 2011 CS-510 11/02/2011

CS-510

Transactional Memory: Architectural Support for Lock-Free Data Structures

By Maurice Herlihy and J. Eliot B. Moss1993

Presented by Steve CowardPSU Fall 2011 CS-510

11/02/2011

Slide content heavily borrowed from Ashish JhaPSU SP 2010 CS-510

05/04/2010

2CS-510

Agenda Lock-based synchronization Non-blocking synchronization TM (Transactional Memory) concept A HW-based TM implementation

– Core additions– ISA additions– Transactional cache– Cache coherence protocol changes

Test Methodology and results Blue Gene/Q Summary

3CS-510

Generally easy to use, except not composable Generally easy to reason about Does not scale well due to lock arbitration/communication Pessimistic synchronization approach Uses Mutual Exclusion

– Blocking, only ONE process/thread can execute at a time

Lock-based synchronization

Lo-priority

Holds Lock X

Hi-priorityPre-emption Can’t proceedPriority Inversion

Holds Lock XDe-scheduled Can’t proceed

Ex: Quantum expiration,Page Fault, other interrupts

Proc A Proc B

Convoying

Can’t proceedDeadlock

Can’t proceed

Get Lock X

Get Lock X

Holds Lock YGet Lock X

Holds Lock XGet Lock Y

High context switch overhead

Med-priority

Proc C

4CS-510

Lock-Free Synchronization

Non-Blocking - optimistic, does not use mutual exclusion

Uses RMW operations such as CAS, LL&SC– limited to operations on single-word or double-words

Avoids common problems seen with conventional techniques such as Priority inversion, Convoying and Deadlock

Difficult programming logic

In absence of above problems and as implemented in SW, lock-free doesn’t perform as well as a lock-based approach

5CS-510

Non-Blocking Wish List Simple programming usage model Avoids priority inversion, convoying and deadlock

Equivalent or better performance than lock-based approach– Less data copying

No restrictions on data set size or contiguity Composable Wait-free

Enter Transactional Memory (TM) …

6CS-510

What is a transaction (tx)? A tx is a finite sequence of revocable operations executed by a

process that satisfies two properties:– Serializable

– Steps of one tx are not seen to be interleaved with the steps of another tx

– Tx’s appear to all processes to execute in the same order– Atomic

– Each tx either aborts or commits– Abort causes all tentative changes of tx to be discarded– Commit causes all tentative changes of tx to be made

effectively instantaneously globally visible This paper assumes that a process executes at most one tx at a time

– Tx’s do not nest (but seems like a nice feature)– Hence, these Tx’s are not composable

– Tx’s do not overlap (seems less useful)

7CS-510

What is Transactional Memory? Transactional Memory (TM) is a lock-free, non-

blocking concurrency control mechanism based on tx’s that allows a programmer to define customized read-modify-write operations that apply to multiple, independently chosen memory locations

Non-blocking– Multiple tx’s optimistically execute CS in parallel

(on diff CPU’s)– If a conflict occurs only one can succeed,

others can retry

8CS-510

Basic Transaction Concept

beginTx:[A]=1;[B]=2;[C]=3;x=_VALIDATE();

Proc A

beginTx:z=[A];y=[B];[C]=y;x=_VALIDATE();

Proc B

AtomicityALL or

NOTHING

IF (x) _COMMIT();ELSE _ABORT(); GOTO beginTx;

// _COMMIT - instantaneously make all above changes visible to all Proc’s

IF (x) _COMMIT();ELSE _ABORT(); GOTO beginTx;

// _ABORT - discard all above changes, may try again

True concurrency

How is validity determined?

Optimistic execution

Serialization ensured if only one tx commits and others abort

Linearization ensured by (combined) atomicity of validate, commit, and abort operations

How to COMMIT?How to ABORT?

Changes must be revocable!

9CS-510

TM vs. SW Non-Blocking

LT or LTX// READ set of locations

VALIDATE// CHECK consistency of READ values

ST// MODIFY set of locations

COMMIT

Succeed

Fail

Fail

Critical Section// HW makes changes PERMANENT

Pass

// TMWhile (1) {

curCnt = LTX(&cnt);

if (Validate()) {

int c = curCnt+1;

ST(&cnt, c);

if (Commit()) return; }

}

// Non-BlockWhile (1) { curCnt = LL(&cnt);

// Copy object if (Validate(&cnt)) { int c = curCnt + 1;

if (SC(&cnt, c)) return; }

}

// Do work

10CS-510

HW or SW TM? TM may be implemented in HW or SW

– Many SW TM library packages exist– C++, C#, Java, Haskell, Python, etc.– 2-3 orders of magnitude slower than other synchronization

approaches This paper focuses on HW implementation

– HW offers significantly greater performance than SW for reasonable cost– Minor tweaks to CPU’s*

– Core– ISA– Caches– Bus and Cache coherency protocols

– Leverage cache-coherency protocol to maintain tx memory consistency

– But HW implementation w/o SW cache overflow handling is problematical

* See Blue Gene/Q

11CS-510

Core: TM Updates Each CPU maintains two additional Status Register bits– TACTIVE

– flag indicating whether a transaction is in progress on this cpu

– Implicitly set upon entering transaction– TSTATUS

– flag indicating whether the active transaction has conflicted with another transaction

TACTIVE TSTATUS MEANINGFalse DC No tx activeTrue False Orphan tx - executing, conflict detected, will abortTrue True Active tx - executing, no conflict yet detected

12CS-510

ISA: TM Memory Operations LT: load from shared memory to register ST: tentative write of register to shared memory which

becomes visible upon successful COMMIT LTX: LT + intent to write to same location later

– A performance optimization for early conflict detection

ISA: TM Verification Operations VALIDATE: validate consistency of read set

– Avoid misbehaved orphan ABORT: unconditionally discard all updates

– Used to handle extraordinary circumstances COMMIT: attempt to make tentative updates permanent

13CS-510

TM Conflict Detection

LOAD and STORE are supported but do not affect tx’s READ or WRITE set– Why would a STORE be performed within a tx?

Left to implementation– Interaction between tx and non-tx operations to same address

– Is generally a programming error– Consider LOAD/STORE as committed tx’s with conflict potential, otherwise non-

linearizable outcome

LT reg, [MEM]

LTX reg, [MEM]ST [MEM], reg

// Tx READ with intent to WRITE later// Tx WRITE to local cache, value globally// visible only after COMMIT

// pure Tx READ

Tx Dependencies Definition Tx Abort Condition Diagram*

READ-SET

WRITE-SETDATASET

DATA-SETUpdated?

WRITE-SETRead by any other Tx?

COMMIT //WRITE-SET visible

to other processes

ABORT//Discard changes

to WRITE-SET

N

N

Y

Y

LOAD reg, [MEM]STORE[MEM], reg

// non-Tx READ// non-Tx WRITE - careful!

*Subject to arbitration

(as in Reader/Writer paradigm)

14CS-510

Core

TM Cache Architecture

Two primary, mutually exclusive caches– In absence of TM, non-Tx ops uses same caches, control logic and coherency protocols as non-Tx architecture– To avoid impairing design/performance of regular cache– To prevent regular cache set size from limiting max tx size– Accessed sequentially based on instruction type!

Tx Cache– Fully-associative

– Otherwise how would tx address collisions be handled?– Single-cycle COMMIT and ABORT– Small - size is implementation dependent

– Intel Core i5 has a first level TLB with 32 entries!– Holds all tentative writes w/o propagating to other caches or memory

– May be extended to at act as Victim Cache– Upon ABORT

– Modified lines set to INVALID state– Upon COMMIT

– Lines can be snooped by other processors– Lines WB to memory upon replacement

1st level Cache

L2D L3D

2nd level … 3rd level Main Mem

1 Clk4 Clk

Core

L1D

Direct-mappedExclusive

2048 lines x 8B

Tx Cache

Fully-associativeExclusive

64 lines x 8B

L2D L3D

15CS-510

HW TM Leverages Cache Protocol M - cache line only in current cache and

is modified E - cache line only in current cache and

is not modified S - cache line in current cache and

possibly other caches and is not modified

I - invalid cache line

Tx commit logic will need to detect the following events (akin to R/W locking)

– Local read, remote write S -> I E -> I

– Local write, remote write M -> I

– Local write, remote read– M, snoop read, write back -> S

Works with bus-based (snoopy cache) or network-based (directory) architectures

http://thanglongedu.org/showthread.php?2226-Nehalem/page2

16CS-510

Protocol: Cache States

Every Tx Op allocates 2 CL entries (2 cycle allocation)– Single CL entry also possible

– Effectively makes tx cache twice as big– Two entry scheme has optimizations

– Allows roll back (and roll forward) w/o bus traffic– LT allocates one entry (if LTX used properly)– Second LTX cycle can be hidden on cache hit

– Authors decision appears to be somewhat arbitrary A dirty value “originally read” must either be

– WB to memory, or– Allocated to XCOMMIT entry as its “old” entry

– avoids WB’s to memory and improves performance

XCOMMITXABORT New Value

Old value

Tx Cache Line Entry, States and Replacement

Tx Op Allocation

2 lines

COMMIT ABORT

CL Entry & States

CL Replacement EMPTYsearch No

replaceYes

NORMALNo

replace

XCOMMIT

WB if DIRTY then replace

Cache Line StatesName Access Shared? Modified?INVALID none — —VALID R Yes NoDIRTY R, W No YesRESERVED R, W No No

Tx TagsName MeaningEMPTY contains no dataNORMAL contains committed dataXCOMMIT discard on commitXABORT discard on abort

EMPTYNORMAL New Value

Old valueNew ValueOld valueNORMAL

EMPTY

NoABORT Tx/Trap to SW

Yes Yes

17CS-510

ISA: TM Verification Operations

Orphan = TACTIVE==TRUE && TSTATUS==FALSE– Tx continues to execute, but will fail at commit

Commit does not force write back to memory– Memory written only when CL is evicted or invalidated

Conditions for calling ABORT– Interrupts– Tx cache overflow

VALIDATE [Mem]

Return TRUE TSTATUS

TSTATUS=TRUETACTIVE=FALSE

FALSE

ABORT


For ALL entries1. Drop XABORT2. Set XCOMMIT to NORMAL

For ALL entries1. Drop XCOMMIT2. Set XABORT to NORMAL

Return FALSE

COMMIT

TSTATUSABORTReturn FALSE


TRUE

Return TRUE

18CS-510

TM Memory Access

IsXABORT DATA?

LT reg, [Mem]//search Tx cache

IsNORMAL DATA?

XABORT NORMAL DATAXCOMMIT DATA

T_READ cycleXABORT DATAXCOMMIT DATA

//ABORT TxFor ALL entries 1. Drop XABORT 2. Set XCOMMIT to NORMALTSTATUS=FALSE

Return DATA

Return arbitrary DATA

Y

Y

OK

BUSY

CL State as Goodman’s protocol for LOAD (Valid) CL State as Goodman’s protocol for LOAD (Reserved)ST to Tx cache only!!!

IsXABORT DATA?

LTX reg, [Mem]//search Tx cache

IsNORMAL DATA?

XABORT NORMAL DATAXCOMMIT DATA

T_RFO cycleXABORT DATAXCOMMIT DATA


Return DATA


//Tx cache miss

Y

Y

OK

BUSY

//Tx cache miss

IsXABORT DATA?

ST [Mem], reg

IsNORMAL DATA?

XCOMMIT NORMAL DATAXABORT NEW DATA

T_RFO cycleXCOMMIT DATAXABORT NEW DATA



//Tx cache miss

Y

OK

BUSY

CL State as Goodman’s protocol for STORE (Dirty)

Bus CyclesName Kind Meaning New AccessREAD regular read value sharedRFO regular read value exclusiveWRITE both write back exclusiveT-READ Tx read value sharedT-RFO Tx read value exclusiveBUSY Tx refuse access unchanged

XCOMMITXABORT New Value

Old value

Tx Op AllocationFor reference

XCOMMIT DATAXABORT NEW DATA

Y

Tx requests REFUSED by BUSY response– Tx aborts and retries (after exponential

backoff?)– Prevents deadlock or continual mutual aborts

Exponential backoff not implemented in HW– Performance is parameter sensitive– Benchmarks appear not to be optimized

19CS-510

TM – Snoopy Cache Actions

Both Regular and Tx Cache snoop on the bus Main memory responds to all L1 read misses Main memory responds to cache line replacement WRITE ‘s If TSTATUS==FALSE, Tx cache acts as Regular cache (for NORMAL entries)

Line Replacement Action - Regular or Tx $Bus RequestWRITE write data on Bus, written to Memory

Action

Response to SNOOP Actions - Regular $Bus Request Current State Next State/ActionRegular RequestRead VALID/DIRTY/RESV. VALID/Return DataRFO VALID/DIRTY/RESV. INVALID/Return DataTx RequestT_Read VALID VALID/Return DataT_Read DIRTY/RESV. VALID/Return DataT_RFO VALID/DIRTY/RESV. INVALID/Return Data

Response to SNOOP Actions - Tx $Bus Request Tx Tag Current State Next State/ActionRegular RequestRead Normal VALID/DIRTY/RESV. VALID/Return DataRFO Normal VALID/DIRTY/RESV. INVALID/Return DataTx RequestT_Read Normal/Xcommit/Xabort VALID VALID/Return DataT_Read Normal/Xcommit/Xabort DIRTY/RESV. NA/BUSY SIGNALT_RFO Normal/Xcommit/Xabort DIRTY/RESV. NA/BUSY SIGNAL

20CS-510

Test Methodology TM implemented in Proetus - execution driven simulator from MIT

– Two versions of TM implementation– Goodman’s snoopy protocol for bus-based arch– Chaiken directory protocol for (simulated) Alewife machine

– 32 Processors– memory latency of 4 clock cycles– 1st level cache latency of 1 clock cycles

– 2048x8B Direct-mapped regular cache– 64x8B fully-associative Tx cache

– Strong Memory Consistency Model Compare TM to 4 different implementation Techniques

– SW– TTS (test-and-test-and-set) spinlock with exponential backoff [TTS Lock]– SW queuing [MCS Lock]

– Process unable to lock puts itself in the queue, eliminating poll time– HW

– LL/SC (LOAD_LINKED/STORE_COND) with exponential backoff [LL/SC Direct/Lock]– HW queuing [QOSB]

– Queue maintenance incorporated into cache-coherency protocol• Goodman’s QOSB protocol - head in memory elements in unused CL’s

Benchmarks– Counting

– LL/SC directly used on the single-word counter variable– Producer & Consumer– Doubly-Linked List– All benchmarks do fixed amount of work

21CS-510

Counting Benchmark

N processes increment shared counter 2^16/n times, n=1 to 32 Short CS with 2 shared-mem accesses, high contention In absence of contention, TTS makes 5 references to mem for each increment

– RD + test-and-set to acquire lock + RD and WR in CS + release lock TM requires only 3 mem accesses

– RD & WR to counter and then COMMIT (no bus traffic)

void process (int work){

int success = O, backof f = BACKOFF_MIN;unsigned wait;while (success < work) {

ST (&counter, LTX (&counter) + 1) ;if (COMMIT()) {

success++;backof f = BACKOFF_MIN;

} else {wait = randomo % (01 << backoff) ;while (wait-- ) ;if (backof f < BACKOFF_MAX )backof f ++;

}}

}

SOURCE: from paper

22CS-510

Counting Results

LL/SC outperforms TM– LL/SC applied directly to counter variable, no explicit commit required– For other benchmarks, adv lost as shared object spans multiple words – only way to use LL/SC is as a spin lock

TM has higher thruput than all other mechanisms at most levels of concurrency– TM uses no explicit locks and so fewer accesses to memory (LL/SC -2, TM-3,TTS-5)

Total cycles needed to complete the benchmark

Concurrent ProcessesBUS NW

SOURCE: Figure copied from paper

TTS Lock

MCS Lock - SWQ

QOSB - HWQ

TM

LL/SC Direct

TTS Lock

TMLL/SC Direct

MCS Lock - SWQ

QOSB - HWQ

23CS-510

Prod/Cons Benchmark

N processes share a bounded buffer, initially empty– Half produce items, half consume items

Benchmark finishes when 2^16 operations have completed

typedef struct { Word deqs; Word enqs; Word items [QUEUE_SI ZE ] ; } queue;unsigned queue_deq (queue *q) {

unsigned head, tail, result, wait;unsigned backoff = BACKOFF_MINwhile (1) {

result = QUEUE_EMPTY ;tail = LTX (&q-> enqs) ;head = LTX ( &q->deqs ) ;if (head ! = tail) { /* queue not empty? */

result = LT (&q->items [head % QUEUE_SIZE] ) ;ST (&q->deqs, head + 1) ; /* advance counter */

}if (COMMIT() ) break; .wait = randomo % (01 << backoff) ; /* abort => backof f */while (wait--) ;if (backoff < BACKOFF_MAX) backof f ++;

}return result;

}

SOURCE: from paper

24CS-510

Prod/Cons Results

In Bus arch, almost flat thruput for all– TM yields higher thruput but not as dramatic as counting benchmark

In NW arch, all thruputs suffer as contention increases– TM suffers the least and wins

Cycles needed to complete the Benchmark

Concurrent ProcessesBUS NW


N

QOSB - HWQ

TTS Lock

TM

LL/SC Direct

MCS Lock - SWQ

TM

QOSB - HWQ

LL/SC Direct

MCS Lock - SWQ

TTS Lock

25CS-510

Doubly-Linked List Benchmark

N processes share a DL list anchored by Head & Tail pointers– Process Dequeues an item by removing the item pointed by tail and then Enqueues it by threading it onto the list as head– Process that removes last item sets both Head & Tail to NULL– Process that inserts item into an empty list set’s both Head & Tail to point to the new item

Benchmark finishes when 2^16 operations have completed

SOURCE: from paper

typedef struct list_elem {struct list elem *next; /* next to dequeue */struct list elem *prev; /* previously enqueued */int value;} entry;shared entry *Head, *Tail;void list_enq(entry* new) {

entry *old tail;unsigned backoff = BACKOFF_MIN;unsigned wait;new->next = new->prev = NULL;while (TRUE) {

old_tail = (entry*) LTX(&Tail);if (VALIDATE()) {

ST(&new->prev, old tail);if (old_tail == NULL) ST(&Head, new);else ST(&old —tail->next, new);ST(&Tail, new);if (COMMIT()) return;

}wait = randomo % (01 << backoff);while (wait--);if (backoff < BACKOFF_MAX) backoff++;

}}

26CS-510

Doubly-Linked List Results

Concurrency difficult to exploit by conventional means– State dependent concurrency is not simple to recognize using locks

– Enquerers don’t know if it must lock tail-ptr until after it has locked head-ptr & vice-versa for dequeuers– Queue non-empty: each Tx modifies head or tail but not both, so enqueuers can (in principle) execute without

interference from dequeuers and vice-versa– Queue Empty: Tx must modify both pointers and enqueuers and dequeuers conflict

– Locking techniques uses only single lock– Lower thruput as single lock prohibits overlapping of enqueues and dequeues

TM naturally permits this kind of parallelism

Cycles needed to complete the Benchmark

Concurrent Processes

BUS NW


MCS Lock - SWQ

MCS Lock - SWQ

TM TM

QOSB - HWQ

QOSB - HWQ

TTS Lock

TTS Lock

LL/SC Direct

LL/SC Direct

27CS-510

Blue Gene/Q Processor First Tx Memory HW implementation

Used in Sequoia supercomputer built by IBM for Lawrence Livermore Labs

Due to be completed in 2012 Sequoia is 20 petaflop machine Blue Gene/Q will have 18 cores One dedicated to OS tasks One held in reserve (fault tolerance?) 4 way hyper-threaded, 64-bit PowerPC A2 based Sequoia may use up to 100k Blue Genes 1.6 GHz, 205 Gflops, 55W, 1.47 B transistors, 19 mm sq

28CS-510

TM on Blue Gene/Q Transactional Memory only works intra chip Inter chip conflicts not detected Uses a tag scheme on the L2 cache memory Tags detect load/store data conflicts in tx Cache data has ‘version’ tag Cache can store multiple versions of same data SW commences tx, does its work, then tells HW to attempt

commit If unsuccessful, SW must re-try Appears to be similar to approach on Sun’s Rock processor Ruud Haring on TM: ”a lot of neat trickery”, “sheer

genius”– Full implementation much more complex than paper suggests?

Blue Gene/Q is the exception and does not mark wide scale acceptance of HW TM

29CS-510

Pros & Cons Summary Pros

– TM matches or outperforms atomic update locking techniques for simple benchmarks– Uses no locks and thus has fewer memory accesses

– Avoids priority inversion, convoying and deadlock– Easy programming semantics

– Complex NB scenarios such as doubly-linked list more realizable through TM– Allow true concurrency and hence highly scalable (for smaller Tx sizes)

Cons– TM can not perform undoable operations including most I/O – Single cycle commit and abort restrict size of 1st level cache and hence Tx size

– Is it good for anything other than data containers?– Portability is restricted by transactional cache size– Still SW dependent

– Algorithm tuning benefits from SW based adaptive backoff– Tx cache overflow handling

– Longer Tx increases the likelihood of being aborted by an interrupt or scheduling conflict– Tx should be able to complete within one scheduling time slot

– Weaker consistency models require explicit barriers at start and end, impacting performance– Other complications make it more difficult to implement in HW

– Multi-level caches– Nested Transactions (required for composability)– Cache coherency complexity on many-core SMP and NUMA arch’s

– Theoretically subject to starvation– Adaptive backoff strategy suggested fix - authors used exponential backoff– Else queuing mechanism needed

– Poor debugger support

30CS-510

Summary TM is a novel multi-processor architecture which allows easy lock-free multi-word

synchronization in HW– Leveraging concept of Database Transactions– Overcoming single/double-word limitation– Exploiting cache-coherency mechanisms

Wish Granted? Comment

Simple programming model Yes Very elegant

Avoids priority inversion, convoying and deadlock

Yes Inherent in NB approach

Equivalent or better performance than lock-based approach

Yes For very small and short tx’s

No restrictions on data set size or contiguity

No Limited by practical considerations

Composable No Possible but would add significant HW complexity

Wait-free No Possible but would add significant HW complexity

31CS-510

References M.P. Herlihy and J.E.B. Moss. Transactional Memory: Architectural support for lock-free data structures. Technical Report 92/07, Digital Cambridge Research Lab, One Kendall Square, Cambridge MA 02139, December 1992.

32CS-510

AppendixLinearizability: Herlihy’s Correctness Condition

Invocation of an object/function followed by a response History - sequence of invocations and responses made by a set of threads Sequential history - a history where each invocation is followed immediately by its response Serializable - if history can be reordered to form sequential history which is consistent with sequential

definition of the object Linearizable - serializable history in which each response that preceded invocation in history must also

precede in sequential reordering Object is linearizable if all of its usage histories may be linearized.

A invokes lock | B invokes lock | A fails | B succeedsAn example history

Reordering 1 - A sequential history but not serializable reordering

Reordering 2 - A linearizable reordering

A invokes lock | A fails | B invokes lock | B succeeds

B invokes lock | B succeeds | A invokes lock | A fails

33CS-510

AppendixGoodman’s Write-Once Protocol

D - line present only in one cache and differs from main memory. Cache must snoop for read requests.

R - line present only in one cache and matches main memory.

V - line may be present in other caches and matches main memory.

I - Invalid.

Writes to V,I are write thru Writes to D,R are write back

34CS-510

AppendixTx Memory and dB Tx’s

Differences with dB Tx’s– Disk vs. memory - HW better for Tx’s than dB systems– Tx’s do not need to be durable (post termination)– dB is a closed system; Tx interacts with non-Tx operations– Tx must be backward compatible with existing programming environment– Success in database transaction field does not translate to transactional

memory

Documents

Presented by Steve Coward PSU Fall 2011 CS-510 11/02/2011