The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

CS-510

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors

By T. E. Anderson

Presented by Ashish JhaPSU SP 2010 CS-510

05/20/2010

2CS-510

Agenda

Preview of a SMP single Bus based system– $ protocol and the Bus

What is a Lock?– Usage and operations in a CS

What is Spin-Lock?– Usage and operations in a CS

Problems with Spin-Locks on SMP systems

Methods to improve Spin-Lock performance in both SW & HW

Summary

3CS-510

Shared Bus– Coherent, Consistent and Contended Memory

Snoopy Invalidation based Cache Coherence Protocol– Guarantees Atomicity of a memory operation

Sources of Contention– Bus– Memory Modules

Preview: SMP Arch

CPU 0CPU 0

L1DL1D

BSQBSQ

LN$LN$

T1: LD reg=[M1]T1: LD reg=[M1]

InvalidInvalidExclusiveExclusive

CPU 1CPU 1

L1DL1D

LN$LN$

CPU N-1CPU N-1

L1DL1D

LN$LN$

CPU NCPU N

L1DL1D

LN$LN$

T2: LD reg=[M1]T2: LD reg=[M1]

SharedShared SharedShared

T3: ST [M1]=regT3: ST [M1]=reg

InvalidInvalidInvalidInvalid ModifiedModified

4CS-510

Instruction defined and exposed by the ISA– To achieve “Exclusive” access to memory

Lock is an “Atomic” RMW operation– uArch guarantees Atomicity

– achieved via Cache Coherence Protocol

Used to implement a Critical Section– A block of code with “Exclusive” access

Examples– TSL – Test-Set-Lock– CAS – Compare-Swap

What is a Lock?

5CS-510

Lock Operation

Local $ MissLocal $ Miss

Reg = TSL [M1]Reg = TSL [M1]

Bus TxBus Tx

Remote $ MissRemote $ Miss

Invalidate Invalidate Memory $ LineMemory $ Line

M1 in Local $ M1 in Local $ Modified StateModified State

M1 CLEAN?M1 CLEAN?

Set M1=BUSYSet M1=BUSY

Reg=CLEANReg=CLEANGOT LOCK!GOT LOCK!

Reg=BUSYReg=BUSYNO LOCKNO LOCK

Set M1=BUSYSet M1=BUSY

YY

YY

NN

YY

NN$Line Exclusive/$Line Exclusive/Modified?Modified?

Invalidate Invalidate Other CPU $ LineOther CPU $ Line

NN

YY

NN

6CS-510

Critical Section using Lock


Reg = CLEAN?Reg = CLEAN?

Execute CSExecute CS

[M1] = CLEAN[M1] = CLEAN

//Got Lock//Got Lock//[M1]=BUSY //[M1]=BUSY

//Un-Lock//Un-Lock

Simple, Intuitive and Elegant

7CS-510

Critical Section using Spin-Lock


Reg = ?Reg = ?



//Got Lock//Got Lock

//Un-Lock//Un-Lock

Spin on Test-and-Set Yet again, Simple, Intuitive and Elegant

CLEANCLEAN

BUSYBUSY

Spin-LockSpin-Lock

//M[1]=BUSY//M[1]=BUSY

8CS-510

Problem with Spin-Lock?


Reg = ?Reg = ?




//Un-Lock//Un-Lock

A Lock is a RMW operation– A “simple?” Store op

Works well for UP to few Core environment…next slide…

CLEANCLEAN

BUSYBUSY

Spin-LockSpin-Lock

////M[1]=BUSYM[1]=BUSY

9CS-510

Severe Contention on the Bus, with Traffic from– Snoops– Invalidations– Regular Requests

Contended Memory module– Data requested by diff CPU’s residing in the same module

Spin-Lock in Many-Core Env.

CPU 0CPU 0

L1DL1D

BSQBSQ

LN$LN$

T1: TSL[M1] T1: TSL[M1] //Lock//Lock

ModifiedModified

CPU 1CPU 1

L1DL1D

LN$LN$

CPU N-1CPU N-1

L1DL1D

LN$LN$

CPU NCPU N

L1DL1D

LN$LN$

T2: TSL[M1] T2: TSL[M1] //Spin//Spin

ModifiedModified


ModifiedModified

Q0: N-1,N-2Q0: N-1,N-2

T3: T3: reg=[M2]reg=[M2]T3: [M1]=CLEANT3: [M1]=CLEAN

CPU N-2CPU N-2

L1DL1D

LN$LN$

T3: T3: TSL[M2]TSL[M2]T3: TSL[M1] T3: TSL[M1] //Spin//Spin

Q1: NQ1: NQ2: 1Q2: 1

Q3: 0Q3: 0

InvalidInvalid

Q0: N-2Q0: N-2

ExclusiveExclusiveInvalidInvalidModifiedModifiedInvalidInvalid

InvalidInvalidModifiedModifiedInvalidInvalidModifiedModified

10CS-510

An avalanche effect on Bus & Mem Module contention with– more # of CPU’s – impacts scalability

– More snoop and coherency traffic with Ping-pong effect on locks – unsuccessful test&set and invalidations– More starvation – lock has been released but delayed further with contention on bus

– Requests conflicting with same mem module– Top it off with SW bugs

– Locks and/or regular requests conflicting with same CL

Suppose lock latency was 20 Core Clks– Bus runs as much as 10x slower

– Now latency to acquire the lock could increase by 10x Core clks or more

Spin-Lock in Many-Core Env. Cont’dCPU 0CPU 0

L1DL1D

BSQBSQ

LN$LN$

T1: TSL[M1] T1: TSL[M1] //Lock//Lock CPU 1CPU 1

L1DL1D

LN$LN$

CPU N-1CPU N-1

L1DL1D

LN$LN$

CPU NCPU N

L1DL1D

LN$LN$


T3: TSL[M1] T3: TSL[M1] //Spin//SpinT3: T3: reg=[M2]reg=[M2]T3: [M1]=CLEANT3: [M1]=CLEAN

CPU N-2CPU N-2

L1DL1D

LN$LN$

T3: T3: TSL[M2]TSL[M2]T3: TSL[M1] T3: TSL[M1] //Spin//Spin

InvalidInvalidModifiedModified InvalidInvalidInvalidInvalidModifiedModified

11CS-510

A better Spin-Lock


Reg = ?Reg = ?




//Un-Lock//Un-Lock

Spin on Read (Test-and-Test-and-Set)– A bit better as long as Lock not modified while spinning on cached value

– Doesn’t hold long as # of CPU’s scaled– Same set of problems as before – lot of invalidations due to TSLSame set of problems as before – lot of invalidations due to TSL

CLEANCLEAN

BUSYBUSY

Spin on Spin on Lock RD Lock RD and TSLand TSL


[M1]=BUSY?[M1]=BUSY?

NN

Spin on Spin on Lock RDLock RD YY

12CS-510

Verify through Tests

Spin Lock latency and perf with small and large amounts of contention Result confirms

– Sharp degradation in perf for spin on test-set as #CPU’s sclaed– Spin on read slightly better

– Both methods degrades badly (scales poorly) as CPUs increased– Peak perf never reached – time to quiesce almost linear with CPU count, hurting communication BW

• 20 CPU Symmetric Model B SMP20 CPU Symmetric Model B SMP• WB-Invalidate $WB-Invalidate $• Shared Bus – one same bus for Lock and regular requestsShared Bus – one same bus for Lock and regular requests

• Lock acquire-release=5.6 usecLock acquire-release=5.6 usec

• elapsed time for CPU to exe CS 1M timeselapsed time for CPU to exe CS 1M times• Ea CPU loops: wait for lock, do CS, release and delay for a time Ea CPU loops: wait for lock, do CS, release and delay for a time randomly selectedrandomly selected

SOURCE: SOURCE: Figure’s copied Figure’s copied from paperfrom paper

Time to quiesce, spin on read (usec)Time to quiesce, spin on read (usec)

13CS-510

What can be done?

Can Spin-Lock performance be improved by– SW

– Any efficient algorithm for busy locks?

– HW– Any more complex HW needed?

14CS-510

SW Impr. #1a: Delay TSL

By delaying the TSL– Reduce # of Invalidations and Bus Contentions

Delay could be set – Statically – delay slots for each processor, could be prioritized– Dynamically – as in CSMA NW – exponential back-off

Performance good with– Short delay and few spinners– Long delay and many spinners

Spin on Test -and- Test-and-Set Spin-LockSpin on Test -and- Test-and-Set Spin-Lock

//Spin on Lock RD//Spin on Lock RD


Reg = ?Reg = ?



CLEANCLEAN

BUSYBUSY


NN

YY

Spin on Test -and- Delay Test-and-Set Spin-LockSpin on Test -and- Delay Test-and-Set Spin-Lock


Reg = ?Reg = ?



CLEANCLEAN

BUSYBUSY


NN

YY

DELAYDELAY


YY

NN


//Un-Lock//Un-Lock




//Un-Lock//Un-Lock


//DELAY before TSL//DELAY before TSL

//Lock RD//Lock RD

15CS-510

SW Impr. #1b: Delay after ea. Lock access

Delay after each Lock access– Check lock less frequently

– TSL – less misses due to invalidation, less bus contention– Lock RD – less misses due to invalidation, less bus contention

Good for architecture with no caches– Communication (Bus, NW) BW overflow

Spin on Test -and- Test-and-Set Spin-LockSpin on Test -and- Test-and-Set Spin-Lock



Reg = ?Reg = ?



CLEANCLEAN

BUSYBUSY


NN

YY

Delay on Test -and- Delay on Test-and-Set Spin-LockDelay on Test -and- Delay on Test-and-Set Spin-Lock


Reg = ?Reg = ?



CLEANCLEAN

BUSYBUSY


YY

NN

DELAYDELAY


//Un-Lock//Un-Lock


//Lock RD//Lock RD


//Un-Lock//Un-Lock


//1. DELAY after Lock RD, before TSL//1. DELAY after Lock RD, before TSL//2. DELAY after TSL//2. DELAY after TSL

16CS-510

SW Impr. #2: Queuing

To resolve contention– Delay uses time– Queue uses space

Queue Implementation– Basic

– Allocate slot for each waiting CPU in a queue– Requires insertion and deletion – atomic op’sRequires insertion and deletion – atomic op’s

• Not good for small CSNot good for small CS– Efficient

– Each CPU get unique seq# - atomic op– One completing the lock, the current CPU activates one with next seq# - no atomic op

Q Performance– Works well (offers low contention) for bus based arch and NW based arch with invalidation– Less valuable for Bus based arch with no caches as still contention on bus for polling by each CPU– Increased Lock latency under low contention due to overhead in attaining the lock– Preemption on CPU holding the lock could further starve the CPU waiting on the lock – pass token before switching out– Centralized Q become bottleneck as # of CPU’s increased – solutions like divide Q between nodes, etc

0 –to- (N-1) Q’s0 –to- (N-1) Q’s •//slot for each CPU’s//slot for each CPU’s•//in separate CL//in separate CL

CPU’s spin on its own slotCPU’s spin on its own slot •//continuous polling //continuous polling •//no coherence Traffic//no coherence Traffic

CPU (0) unlocking pass CPU (0) unlocking pass token to another CPU (5)token to another CPU (5)

•//requires an atomic TSL on that slot//requires an atomic TSL on that slot•//some criteria for “another” e.g. priority, FIFO//some criteria for “another” e.g. priority, FIFO

•//no atomic TSL – the Lock is your slot//no atomic TSL – the Lock is your slot•//slot for each Lock!//slot for each Lock!

CPU (0) get the lockCPU (0) get the lock

17CS-510

SW Impr.: Test Results

At low CPU count (low contention)– Queue has high latency due to lock overhead

At high CPU count– Queue performs best– back-off performs slightly worse than static delays

• 20 CPU Symmetric Model B20 CPU Symmetric Model B• Static & Dynamic Delay=0-15 usecStatic & Dynamic Delay=0-15 usec• TSL =1 usecTSL =1 usec• No atomic incr, Q uses explicit lock w/ No atomic incr, Q uses explicit lock w/ backoff to access seq #backoff to access seq #• Ea CPU loops 1M/#P times to acquire, Ea CPU loops 1M/#P times to acquire, do CS, release and computedo CS, release and compute

• Spin-waiting overhead (sec) in Spin-waiting overhead (sec) in executing the b’markexecuting the b’mark

SOURCE: Figure SOURCE: Figure copied from copied from paperpaper

18CS-510

HW Solutions

Separate Bus for Lock and Regular memory requests– As in Balance

– Regular req follows invalidation based $ coherence

– Lock req follows distributed-write based $ coherence

Expensive solution– Little benefit to Apps which don’t spend much

time spin-waiting

– How to manage if the two buses are slower

19CS-510

HW Sol. – Multistage interconnect NW CPU

NUMA type of arch– “SMP view” as a “combination of memory” across the “nodes”

Collapse all simultaneous req’s for a single lock from a node into one – Well value would be same for all requests– Saves contention BW

– But performance could be offset by increased latency of “combining switches”

– Could be defeated by normal NW with backoff or queuingCould be defeated by normal NW with backoff or queuing

HW queue– Such as maintained by the cache controller

– Uses same method as SW to pass token to next CPU– One proposal by Goodman et al. combines HW and SW to

maintain the queue– HW implementation though complex could be faster

20CS-510

HW Sol. – Single Bus CPU Single Bus had ping-pong problem with constant invalidations even if lock wasn’t

available– Much due to “atomic” nature of RMW Lock instructions

Minimize invalidations by restricting it to only when the value has really changed – makes sense and solves problem when spinning on read– However, there would still be invalidation when lock finally released

– Cache miss by each spinning CPU and further failed TSL consume BW– Time to quiesce reduced but not fully eliminatedTime to quiesce reduced but not fully eliminated

Special handling of Read requests by improving snooping and coherence protocol– Broadcast on a Read which could eliminate duplicate read misses

– First read after an invalidation (such as making lock available) will full-fill further read requests on the same lockFirst read after an invalidation (such as making lock available) will full-fill further read requests on the same lock• Requires implementing fully distributed write-coherenceRequires implementing fully distributed write-coherence

Special handling of test-and-set requests in cache and bus controllers– If it doesn’t increase bus or cache cycle time, then it should be better than SW queuing or backoff

None of the methods show achieving ideal perf as measured and tested on Symmetry– The difficulty is knowing the type of atomic instruction making a request

– The type is only known and computed in the Core– The Cache and the Bus sees everything as nothing other than a “request”The Cache and the Bus sees everything as nothing other than a “request”

– Ability to pass such control signals along with requests could help achieve the purpose

21CS-510

Summary Spin Locks is a common method to achieve mutually exclusive access to a

shared data-structure

Multi-Core CPU’s are more common– Spin Lock performance degrades as # of spinning CPU increases

Efficient methods in both SW & HW could be implemented to salvage performance degradation

– SW– SW queuing

– Performs best at high contentionPerforms best at high contention– Ethernet style backoff

– Performs best at low contentionPerforms best at low contention– HW

– For multi-stage NW CPU, HW queuing at a node to combine requests of one type could help save contention

– For SMP bus based CPU, intelligent snooping could be implemented to reduce bus traffic

Recommendations to Spin-Lock performance (as above) looks promising – AMB small benchmarks – Benefits to “real” workloads is an open question

Documents

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson