View
41
Download
0
Category
Preview:
DESCRIPTION
The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson. Presented by Ashish Jha PSU SP 2010 CS-510 05/20/2010. Agenda. Preview of a SMP single Bus based system $ protocol and the Bus What is a Lock? Usage and operations in a CS What is Spin-Lock? - PowerPoint PPT Presentation
Citation preview
CS-510
The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors
By T. E. Anderson
Presented by Ashish JhaPSU SP 2010 CS-510
05/20/2010
2CS-510
Agenda
Preview of a SMP single Bus based system– $ protocol and the Bus
What is a Lock?– Usage and operations in a CS
What is Spin-Lock?– Usage and operations in a CS
Problems with Spin-Locks on SMP systems
Methods to improve Spin-Lock performance in both SW & HW
Summary
3CS-510
Shared Bus– Coherent, Consistent and Contended Memory
Snoopy Invalidation based Cache Coherence Protocol– Guarantees Atomicity of a memory operation
Sources of Contention– Bus– Memory Modules
Preview: SMP Arch
CPU 0CPU 0
L1DL1D
BSQBSQ
LN$LN$
T1: LD reg=[M1]T1: LD reg=[M1]
InvalidInvalidExclusiveExclusive
CPU 1CPU 1
L1DL1D
LN$LN$
CPU N-1CPU N-1
L1DL1D
LN$LN$
CPU NCPU N
L1DL1D
LN$LN$
T2: LD reg=[M1]T2: LD reg=[M1]
SharedShared SharedShared
T3: ST [M1]=regT3: ST [M1]=reg
InvalidInvalidInvalidInvalid ModifiedModified
4CS-510
Instruction defined and exposed by the ISA– To achieve “Exclusive” access to memory
Lock is an “Atomic” RMW operation– uArch guarantees Atomicity
– achieved via Cache Coherence Protocol
Used to implement a Critical Section– A block of code with “Exclusive” access
Examples– TSL – Test-Set-Lock– CAS – Compare-Swap
What is a Lock?
5CS-510
Lock Operation
Local $ MissLocal $ Miss
Reg = TSL [M1]Reg = TSL [M1]
Bus TxBus Tx
Remote $ MissRemote $ Miss
Invalidate Invalidate Memory $ LineMemory $ Line
M1 in Local $ M1 in Local $ Modified StateModified State
M1 CLEAN?M1 CLEAN?
Set M1=BUSYSet M1=BUSY
Reg=CLEANReg=CLEANGOT LOCK!GOT LOCK!
Reg=BUSYReg=BUSYNO LOCKNO LOCK
Set M1=BUSYSet M1=BUSY
YY
YY
NN
YY
NN$Line Exclusive/$Line Exclusive/Modified?Modified?
Invalidate Invalidate Other CPU $ LineOther CPU $ Line
NN
YY
NN
6CS-510
Critical Section using Lock
Reg = TSL [M1]Reg = TSL [M1]
Reg = CLEAN?Reg = CLEAN?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
//Got Lock//Got Lock//[M1]=BUSY //[M1]=BUSY
//Un-Lock//Un-Lock
Simple, Intuitive and Elegant
7CS-510
Critical Section using Spin-Lock
Reg = TSL [M1]Reg = TSL [M1]
Reg = ?Reg = ?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
//Got Lock//Got Lock
//Un-Lock//Un-Lock
Spin on Test-and-Set Yet again, Simple, Intuitive and Elegant
CLEANCLEAN
BUSYBUSY
Spin-LockSpin-Lock
//M[1]=BUSY//M[1]=BUSY
8CS-510
Problem with Spin-Lock?
Reg = TSL [M1]Reg = TSL [M1]
Reg = ?Reg = ?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
//Got Lock//Got Lock
//Un-Lock//Un-Lock
A Lock is a RMW operation– A “simple?” Store op
Works well for UP to few Core environment…next slide…
CLEANCLEAN
BUSYBUSY
Spin-LockSpin-Lock
////M[1]=BUSYM[1]=BUSY
9CS-510
Severe Contention on the Bus, with Traffic from– Snoops– Invalidations– Regular Requests
Contended Memory module– Data requested by diff CPU’s residing in the same module
Spin-Lock in Many-Core Env.
CPU 0CPU 0
L1DL1D
BSQBSQ
LN$LN$
T1: TSL[M1] T1: TSL[M1] //Lock//Lock
ModifiedModified
CPU 1CPU 1
L1DL1D
LN$LN$
CPU N-1CPU N-1
L1DL1D
LN$LN$
CPU NCPU N
L1DL1D
LN$LN$
T2: TSL[M1] T2: TSL[M1] //Spin//Spin
ModifiedModified
T3: TSL[M1] T3: TSL[M1] //Spin//Spin
ModifiedModified
Q0: N-1,N-2Q0: N-1,N-2
T3: T3: reg=[M2]reg=[M2]T3: [M1]=CLEANT3: [M1]=CLEAN
CPU N-2CPU N-2
L1DL1D
LN$LN$
T3: T3: TSL[M2]TSL[M2]T3: TSL[M1] T3: TSL[M1] //Spin//Spin
Q1: NQ1: NQ2: 1Q2: 1
Q3: 0Q3: 0
InvalidInvalid
Q0: N-2Q0: N-2
ExclusiveExclusiveInvalidInvalidModifiedModifiedInvalidInvalid
InvalidInvalidModifiedModifiedInvalidInvalidModifiedModified
10CS-510
An avalanche effect on Bus & Mem Module contention with– more # of CPU’s – impacts scalability
– More snoop and coherency traffic with Ping-pong effect on locks – unsuccessful test&set and invalidations– More starvation – lock has been released but delayed further with contention on bus
– Requests conflicting with same mem module– Top it off with SW bugs
– Locks and/or regular requests conflicting with same CL
Suppose lock latency was 20 Core Clks– Bus runs as much as 10x slower
– Now latency to acquire the lock could increase by 10x Core clks or more
Spin-Lock in Many-Core Env. Cont’dCPU 0CPU 0
L1DL1D
BSQBSQ
LN$LN$
T1: TSL[M1] T1: TSL[M1] //Lock//Lock CPU 1CPU 1
L1DL1D
LN$LN$
CPU N-1CPU N-1
L1DL1D
LN$LN$
CPU NCPU N
L1DL1D
LN$LN$
T2: TSL[M1] T2: TSL[M1] //Spin//Spin
T3: TSL[M1] T3: TSL[M1] //Spin//SpinT3: T3: reg=[M2]reg=[M2]T3: [M1]=CLEANT3: [M1]=CLEAN
CPU N-2CPU N-2
L1DL1D
LN$LN$
T3: T3: TSL[M2]TSL[M2]T3: TSL[M1] T3: TSL[M1] //Spin//Spin
InvalidInvalidModifiedModified InvalidInvalidInvalidInvalidModifiedModified
11CS-510
A better Spin-Lock
Reg = TSL [M1]Reg = TSL [M1]
Reg = ?Reg = ?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
//Got Lock//Got Lock
//Un-Lock//Un-Lock
Spin on Read (Test-and-Test-and-Set)– A bit better as long as Lock not modified while spinning on cached value
– Doesn’t hold long as # of CPU’s scaled– Same set of problems as before – lot of invalidations due to TSLSame set of problems as before – lot of invalidations due to TSL
CLEANCLEAN
BUSYBUSY
Spin on Spin on Lock RD Lock RD and TSLand TSL
//M[1]=BUSY//M[1]=BUSY
[M1]=BUSY?[M1]=BUSY?
NN
Spin on Spin on Lock RDLock RD YY
12CS-510
Verify through Tests
Spin Lock latency and perf with small and large amounts of contention Result confirms
– Sharp degradation in perf for spin on test-set as #CPU’s sclaed– Spin on read slightly better
– Both methods degrades badly (scales poorly) as CPUs increased– Peak perf never reached – time to quiesce almost linear with CPU count, hurting communication BW
• 20 CPU Symmetric Model B SMP20 CPU Symmetric Model B SMP• WB-Invalidate $WB-Invalidate $• Shared Bus – one same bus for Lock and regular requestsShared Bus – one same bus for Lock and regular requests
• Lock acquire-release=5.6 usecLock acquire-release=5.6 usec
• elapsed time for CPU to exe CS 1M timeselapsed time for CPU to exe CS 1M times• Ea CPU loops: wait for lock, do CS, release and delay for a time Ea CPU loops: wait for lock, do CS, release and delay for a time randomly selectedrandomly selected
SOURCE: SOURCE: Figure’s copied Figure’s copied from paperfrom paper
Time to quiesce, spin on read (usec)Time to quiesce, spin on read (usec)
13CS-510
What can be done?
Can Spin-Lock performance be improved by– SW
– Any efficient algorithm for busy locks?
– HW– Any more complex HW needed?
14CS-510
SW Impr. #1a: Delay TSL
By delaying the TSL– Reduce # of Invalidations and Bus Contentions
Delay could be set – Statically – delay slots for each processor, could be prioritized– Dynamically – as in CSMA NW – exponential back-off
Performance good with– Short delay and few spinners– Long delay and many spinners
Spin on Test -and- Test-and-Set Spin-LockSpin on Test -and- Test-and-Set Spin-Lock
//Spin on Lock RD//Spin on Lock RD
Reg = TSL [M1]Reg = TSL [M1]
Reg = ?Reg = ?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
CLEANCLEAN
BUSYBUSY
[M1]=BUSY?[M1]=BUSY?
NN
YY
Spin on Test -and- Delay Test-and-Set Spin-LockSpin on Test -and- Delay Test-and-Set Spin-Lock
Reg = TSL [M1]Reg = TSL [M1]
Reg = ?Reg = ?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
CLEANCLEAN
BUSYBUSY
[M1]=BUSY?[M1]=BUSY?
NN
YY
DELAYDELAY
[M1]=BUSY?[M1]=BUSY?
YY
NN
//Got Lock//Got Lock
//Un-Lock//Un-Lock
//M[1]=BUSY//M[1]=BUSY
//Spin on Lock RD//Spin on Lock RD
//Got Lock//Got Lock
//Un-Lock//Un-Lock
//M[1]=BUSY//M[1]=BUSY
//DELAY before TSL//DELAY before TSL
//Lock RD//Lock RD
15CS-510
SW Impr. #1b: Delay after ea. Lock access
Delay after each Lock access– Check lock less frequently
– TSL – less misses due to invalidation, less bus contention– Lock RD – less misses due to invalidation, less bus contention
Good for architecture with no caches– Communication (Bus, NW) BW overflow
Spin on Test -and- Test-and-Set Spin-LockSpin on Test -and- Test-and-Set Spin-Lock
//Spin on Lock RD//Spin on Lock RD
Reg = TSL [M1]Reg = TSL [M1]
Reg = ?Reg = ?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
CLEANCLEAN
BUSYBUSY
[M1]=BUSY?[M1]=BUSY?
NN
YY
Delay on Test -and- Delay on Test-and-Set Spin-LockDelay on Test -and- Delay on Test-and-Set Spin-Lock
Reg = TSL [M1]Reg = TSL [M1]
Reg = ?Reg = ?
Execute CSExecute CS
[M1] = CLEAN[M1] = CLEAN
CLEANCLEAN
BUSYBUSY
[M1]=BUSY?[M1]=BUSY?
YY
NN
DELAYDELAY
//Got Lock//Got Lock
//Un-Lock//Un-Lock
//M[1]=BUSY//M[1]=BUSY
//Lock RD//Lock RD
//Got Lock//Got Lock
//Un-Lock//Un-Lock
//M[1]=BUSY//M[1]=BUSY
//1. DELAY after Lock RD, before TSL//1. DELAY after Lock RD, before TSL//2. DELAY after TSL//2. DELAY after TSL
16CS-510
SW Impr. #2: Queuing
To resolve contention– Delay uses time– Queue uses space
Queue Implementation– Basic
– Allocate slot for each waiting CPU in a queue– Requires insertion and deletion – atomic op’sRequires insertion and deletion – atomic op’s
• Not good for small CSNot good for small CS– Efficient
– Each CPU get unique seq# - atomic op– One completing the lock, the current CPU activates one with next seq# - no atomic op
Q Performance– Works well (offers low contention) for bus based arch and NW based arch with invalidation– Less valuable for Bus based arch with no caches as still contention on bus for polling by each CPU– Increased Lock latency under low contention due to overhead in attaining the lock– Preemption on CPU holding the lock could further starve the CPU waiting on the lock – pass token before switching out– Centralized Q become bottleneck as # of CPU’s increased – solutions like divide Q between nodes, etc
0 –to- (N-1) Q’s0 –to- (N-1) Q’s •//slot for each CPU’s//slot for each CPU’s•//in separate CL//in separate CL
CPU’s spin on its own slotCPU’s spin on its own slot •//continuous polling //continuous polling •//no coherence Traffic//no coherence Traffic
CPU (0) unlocking pass CPU (0) unlocking pass token to another CPU (5)token to another CPU (5)
•//requires an atomic TSL on that slot//requires an atomic TSL on that slot•//some criteria for “another” e.g. priority, FIFO//some criteria for “another” e.g. priority, FIFO
•//no atomic TSL – the Lock is your slot//no atomic TSL – the Lock is your slot•//slot for each Lock!//slot for each Lock!
CPU (0) get the lockCPU (0) get the lock
17CS-510
SW Impr.: Test Results
At low CPU count (low contention)– Queue has high latency due to lock overhead
At high CPU count– Queue performs best– back-off performs slightly worse than static delays
• 20 CPU Symmetric Model B20 CPU Symmetric Model B• Static & Dynamic Delay=0-15 usecStatic & Dynamic Delay=0-15 usec• TSL =1 usecTSL =1 usec• No atomic incr, Q uses explicit lock w/ No atomic incr, Q uses explicit lock w/ backoff to access seq #backoff to access seq #• Ea CPU loops 1M/#P times to acquire, Ea CPU loops 1M/#P times to acquire, do CS, release and computedo CS, release and compute
• Spin-waiting overhead (sec) in Spin-waiting overhead (sec) in executing the b’markexecuting the b’mark
SOURCE: Figure SOURCE: Figure copied from copied from paperpaper
18CS-510
HW Solutions
Separate Bus for Lock and Regular memory requests– As in Balance
– Regular req follows invalidation based $ coherence
– Lock req follows distributed-write based $ coherence
Expensive solution– Little benefit to Apps which don’t spend much
time spin-waiting
– How to manage if the two buses are slower
19CS-510
HW Sol. – Multistage interconnect NW CPU
NUMA type of arch– “SMP view” as a “combination of memory” across the “nodes”
Collapse all simultaneous req’s for a single lock from a node into one – Well value would be same for all requests– Saves contention BW
– But performance could be offset by increased latency of “combining switches”
– Could be defeated by normal NW with backoff or queuingCould be defeated by normal NW with backoff or queuing
HW queue– Such as maintained by the cache controller
– Uses same method as SW to pass token to next CPU– One proposal by Goodman et al. combines HW and SW to
maintain the queue– HW implementation though complex could be faster
20CS-510
HW Sol. – Single Bus CPU Single Bus had ping-pong problem with constant invalidations even if lock wasn’t
available– Much due to “atomic” nature of RMW Lock instructions
Minimize invalidations by restricting it to only when the value has really changed – makes sense and solves problem when spinning on read– However, there would still be invalidation when lock finally released
– Cache miss by each spinning CPU and further failed TSL consume BW– Time to quiesce reduced but not fully eliminatedTime to quiesce reduced but not fully eliminated
Special handling of Read requests by improving snooping and coherence protocol– Broadcast on a Read which could eliminate duplicate read misses
– First read after an invalidation (such as making lock available) will full-fill further read requests on the same lockFirst read after an invalidation (such as making lock available) will full-fill further read requests on the same lock• Requires implementing fully distributed write-coherenceRequires implementing fully distributed write-coherence
Special handling of test-and-set requests in cache and bus controllers– If it doesn’t increase bus or cache cycle time, then it should be better than SW queuing or backoff
None of the methods show achieving ideal perf as measured and tested on Symmetry– The difficulty is knowing the type of atomic instruction making a request
– The type is only known and computed in the Core– The Cache and the Bus sees everything as nothing other than a “request”The Cache and the Bus sees everything as nothing other than a “request”
– Ability to pass such control signals along with requests could help achieve the purpose
21CS-510
Summary Spin Locks is a common method to achieve mutually exclusive access to a
shared data-structure
Multi-Core CPU’s are more common– Spin Lock performance degrades as # of spinning CPU increases
Efficient methods in both SW & HW could be implemented to salvage performance degradation
– SW– SW queuing
– Performs best at high contentionPerforms best at high contention– Ethernet style backoff
– Performs best at low contentionPerforms best at low contention– HW
– For multi-stage NW CPU, HW queuing at a node to combine requests of one type could help save contention
– For SMP bus based CPU, intelligent snooping could be implemented to reduce bus traffic
Recommendations to Spin-Lock performance (as above) looks promising – AMB small benchmarks – Benefits to “real” workloads is an open question
Recommended