Spinlocks and all the rest. Synchronization Overview Cache coherency Single versus Multi-core Under versus Oversubscribed Atomic operations …

SYNCHRONIZATIONSpinlocks and all the rest

Synchronization Overview Cache coherency Single versus Multi-core Under versus Oversubscribed Atomic operations …

Synchronization Overview Spinlock

acquire_lock(lock){while (TAS(lock) == true);

}

TAS – test and setPuts true in address, returns old value

Synchronization

Mellor-Crummey, Scott 1991Analyzed spinlocks and barriers

○ Linear, Proportional, Exponential Backoff○ Ticket locks -> “now serving”

Proposed the “mcs” lock, a queue based lock

Overview

Synchronization Types to be Discussed Further Developments Implementation Details

Types to be Discussed

Mutual ExclusionSpinlockMutexReader Writer Lock

Execution PointBarrier

Queues, etc (time permitting)

Spinlocks

Spin until lock is acquired

Simple Implementation Contention on lock

Queued Spinlock

Create a local lockSpin on itOn release, signal next waiter

Additional operations Reduced contention

Mutex

Wait to acquire

May use thread scheduler to wait

Reader Writer Lock

Readers can operate simultaneously with other readers

Only writers cause problems

Often spinlock plus count of readers

Barrier

Keep a group of threads in “sync” Barrier has to recognize two events

Old barrier as some threads may not be active

New barrier as threads may have reached it

Further Developments

Scalable RW Lock

Modification to MCS lockCount of Readers + Writer Waiting FlagQueue of waiting threadsReaders unblock readers on acquireWriters unblock next thread on release

John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming (PPOPP '91). ACM, New York, NY, USA, 106-113.

Scalable RW Lock cont. Split up the reader access

Since readers can acquire the lock with readers, have multiple locks

Writers, however, need all of the reader locks

Wilson C. Hsieh and William E. Weihl. 1992. Scalable Reader-Writer Locks for Parallel Systems. In Proceedings of the 6th International Parallel Processing Symposium, Viktor K. Prasanna and Larry H. Canter (Eds.). IEEE Computer Society, Washington, DC, USA, 656-659.

Scalable RW Lock cont.

Or use a C-SNZIClosable scalable nonzero indicatorLike a semaphore, but can be “closed”

What about write upgrade?

Yossi Lev, Victor Luchangco, and Marek Olszewski. 2009. Scalable reader-writer locks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures (SPAA '09). ACM, New York, NY, USA, 101-110.

Biased Locks

First and second class “citizens”Like readers / writers, but all exclusive

Secondary locks request the lockPrimary holder grants them the lock

Nalini Vasudevan, Kedar S. Namjoshi, and Stephen A. Edwards. 2010. Simple and fast biased locks. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10). ACM, New York, NY, USA, 65-74.

MCS Extensions

Queue based locks What if threads are preempted?

Add a time component to the lockStale elements are skipped

Michael L. Scott and William N. Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming (PPoPP '01). ACM, New York, NY, USA, 44-52.

B. He, W. N. Scherer III, and M. L. Scott. “Preemption Adaptivity in Time-Published Queue-Based Spin Locks,” 11th Intl. Conf. on High Performance Computing, Goa, India, Dec. 2005.

Spinning vs Blocking Spinning = busy-waiting Blocking = thread scheduling

What is the trade-off between the two schemes?Tested Solaris pthread implementation that does

both

Ryan Johnson, Manos Athanassoulis, Radu Stoica, and Anastasia Ailamaki. 2009. A new look at the roles of spinning and blocking. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (DaMoN '09). ACM, New York, NY, USA, 21-26.

Trees, etc Barriers

Lots of threads all signaling a single countSounds bad

Signal and Wakeup trees, with different degrees

Hardware Supported Barriers

Introduce dedicated on-chip connectionsSingle Centralized ControllerTransmission lines

Jungju Oh, Milos Prvulovic, and Alenka Zajic. 2011. TLSync: support for multiple fast barriers using on-chip transmission lines. In Proceeding of the 38th annual international symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 105-116.

Implementation Details

Architectural Primitives

Compare and Swap(mem, old, new)If (*mem == old) *mem = newReturn what was in mem

LL/SCLL – load valueSC to same address succeeds only if data

unmodified

Test and Test-and-Set

Synchronization instructions are expensiveSo don’t do them until likely to succeed

Test the lock, then Test-and-set the lock Caveat emptor

Can lead to races if used incorrectlyCan save time like TryToAcquire rather than

release

Queued Spinlock Details

void acquire_queued_spinlock(void* lock, entry* me)

{ me->next = NULL; me->state = UNLOCKED; entry* prev = atomic_swap(lock, me); if (prev == NULL) return; me->state = LOCKED; prev->next = me; while (me->state == LOCKED);}

Queued Spinlock Details contvoid release_queued_spinlock(void* lock, entry* me)

{ while (me->next == NULL) { if (me == CAS(lock, me, NULL)) return; } me->next->state = UNLOCKED;}

Bibliography Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA locks. In

Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures (SPAA '11). ACM, New York, NY, USA, 65-74.

B. He, W. N. Scherer III, and M. L. Scott. “Preemption Adaptivity in Time-Published Queue-Based Spin Locks,” 11th Intl. Conf. on High Performance Computing, Goa, India, Dec. 2005.

Wilson C. Hsieh and William E. Weihl. 1992. Scalable Reader-Writer Locks for Parallel Systems. In Proceedings of the 6th International Parallel Processing Symposium, Viktor K. Prasanna and Larry H. Canter (Eds.). IEEE Computer Society, Washington, DC, USA, 656-659.

Ryan Johnson, Manos Athanassoulis, Radu Stoica, and Anastasia Ailamaki. 2009. A new look at the roles of spinning and blocking. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (DaMoN '09). ACM, New York, NY, USA, 21-26.

Yossi Lev, Victor Luchangco, and Marek Olszewski. 2009. Scalable reader-writer locks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures (SPAA '09). ACM, New York, NY, USA, 101-110.

Peter S. Magnusson, Anders Landin, and Erik Hagersten. 1994. Queue Locks on Cache Coherent Multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing, Howard Jay Siegel (Ed.). IEEE Computer Society, Washington, DC, USA, 165-171.

Bibliography cont John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable

synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9, 1 (February 1991), 21-65.

John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming (PPOPP '91). ACM, New York, NY, USA, 106-113.

Jungju Oh, Milos Prvulovic, and Alenka Zajic. 2011. TLSync: support for multiple fast barriers using on-chip transmission lines. In Proceeding of the 38th annual international symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 105-116.

Michael L. Scott and William N. Scherer. 2001. Scalable queue-based spin locks with timeout. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming (PPoPP '01). ACM, New York, NY, USA, 44-52.

Nalini Vasudevan, Kedar S. Namjoshi, and Stephen A. Edwards. 2010. Simple and fast biased locks. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10). ACM, New York, NY, USA, 65-74.

Lock free list

Store head pointerAtomic update head

void push(node head, node n){ now = old = *head do { old = now n->next = old } while ((now = CAS(head, old, n)) != old) }

“ABA” Problem

Push C // pendingPop APop BPush A

// Does Push C complete successfully now?

“ABA” Problem cont.

Pop A // pendingPop APop BPush A

Does Pop A succeed?

Documents

Spinlocks and all the rest. Synchronization Overview Cache coherency Single versus Multi-core Under versus Oversubscribed Atomic operations …