Computer Architecture II
1
Computer architecture II
Lecture 10
Computer Architecture II
2
Today
• Synchronization for SMM– Test and set, ll and sc, array– barrier
• Scalable Multiprocessors– What is a scalable machine?
Computer Architecture II
3
Synchronization
• Types of Synchronization– Mutual Exclusion– Event synchronization
• point-to-point• group• global (barriers)
• All solutions rely on hardware support for an atomic read-modify-write operation
• We look today at synchronization for cache-coherent, bus-based multiprocessors
Computer Architecture II
4
Components of a Synchronization Event
• Acquire method–Acquire right to the synch (e.g. enter critical
section)
• Waiting algorithm–Wait for synch to become available when it
isn’t–busy-waiting, blocking, or hybrid
• Release method–Enable other processors to acquire
Computer Architecture II
5
Performance Criteria for Synch. Ops
• Latency (time per op)–especially when light contention
• Bandwidth (ops per sec)–especially under high contention
• Traffic– load on critical resources–especially on failures under contention
• Storage• Fairness
Computer Architecture II
6
Strawman Lock
lock: ld register, location /* copy location to register */
cmp location, #0 /* compare with 0 */
bnz lock /* if not 0, try again */
st location, #1 /* store 1 to mark it locked */
ret /* return control to caller */
unlock: st location, #0 /* write 0 to location */
ret /* return control to caller */
Busy-Waiting
Location is initially 0
Why doesn’t the acquire method work?
Computer Architecture II
7
Atomic Instructions
• Specifies a location, register, & atomic operation
– Value in location read into a register
– Another value (function of value read or not) stored into location
• Many variants
– Varying degrees of flexibility in second part
• Simple example: test&set
– Value in location read into a specified register
– Constant 1 stored into location
– Successful if value loaded into register is 0
– Other constants could be used instead of 1 and 0
Computer Architecture II
8
Simple Test&Set Locklock: t&s register, location
bnz lock /* if not 0, try again */ret /* return control to caller */
unlock: st location, #0 /* write 0 to location */ret /* return control to caller */
The same code for lock in pseudocode:
while (not acquired) /* lock is aquired be another one*/ test&set(location); /* try to acquire the
lock*/
•Condition: architecture supports atomic test and set– Copy location to register and set location to 1
•Problem: – t&s modifies the variable location in its cache each time it tries to
acquire the lock=> cache block invalidations => bus traffic (especially for high contention)
Computer Architecture II
9
Number of processors
Tim
e (
s)
11 13 150
2
4
6
8
10
12
14
16
18
20 Test&set, c = 0
Test&set, exponential backoff c = 3.64
Test&set, exponential backoff c = 0
Ideal
9753
T&S Lock Microbenchmark: SGI Challenge
lock; delay(c); unlock;
• Why does performance degrade?– Bus Transactions on T&S
Computer Architecture II
10
Other read-modify-write primitives
•Fetch&op–Atomically read and modify (by using op
operation) and write a memory location
–E.g. fetch&add, fetch&incr
•Compare&swap–Three operands: location, register to
compare with, register to swap with
Computer Architecture II
11
Enhancements to Simple Lock
• Problem of t&s: lots of invalidations if the lock can not be taken• Reduce frequency of issuing test&sets while waiting
– Test&set lock with exponential backoff i=0;while (! acquired) { /* lock is acquired be
another one*/ test&set(location);if (!acquired) {/* test&set didn’t succeed*/ wait (ti); /* sleep some time
i++;}
}• Less invalidations• May wait more
Computer Architecture II
12
Number of processors
Tim
e (
s)
11 13 150
2
4
6
8
10
12
14
16
18
20 Test&set, c = 0
Test&set, exponential backoff c = 3.64
Test&set, exponential backoff c = 0
Ideal
9753
T&S Lock Microbenchmark: SGI Challenge
lock; delay(c); unlock;
• Why does performance degrade?– Bus Transactions on T&S
Computer Architecture II
13
Enhancements to Simple Lock• Reduce frequency of issuing test&sets while waiting
– Test-and-test&set lockwhile (! acquired) { /* lock is acquired be another one*/ if (location=1) /* test with ordinary load */
continue; else {
test&set(location);if (acquired) {/*succeeded*/
break}
}
• Keep testing with ordinary load– Just a hint: cached lock variable will be invalidated when release occurs– If location becomes 0, use t&s to modify the variable atomically– If failure start over
• Further reduces the bus transactions – load produces bus traffic only when the lock is released– t&s produces bus traffic each time is executed
Computer Architecture II
14
Lock performanceLatency Bus Traffic Scalability Storage Fairness
t&s Low contention: low latency
High contention: high latency
A lot poor Low (does not increase with processor number)
no
t&s with backoff
Low contention: low latency (as t&s for no contention)
High contention: high latency
Less than t&s Better than t&s Low (does not increase with processor number)
no
t&t&s Low contention: low latency, a little higher than t&s
High contention: high latency
Less than t&s and t&s with backoff
Better than t&s and t&s with backoff
Low (does not increase with processor number)
no
Computer Architecture II
15
Improved Hardware Primitives: LL-SC
• Goals: – Problem of test&set: generate lot of bus traffic– Failed read-modify-write attempts don’t generate invalidations– Nice if single primitive can implement range of r-m-w operations
• Load-Locked (or -linked), Store-Conditional– LL reads variable into register– Work on the value from the register– SC tries to store back to location – succeed if and only if no other write to the variable since this
processor’s LL• indicated by a condition flag
• If SC succeeds, all three steps happened atomically• If fails, doesn’t write or generate invalidations
– must retry acquire
Computer Architecture II
16
Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */
sc location, reg2 /* SC reg2 into location*/
beqz reg2, lock /* if failed, start again */
ret
unlock: st location, #0 /* write 0 to location */
ret
• Can simulate the atomic ops t&s, fetch&op, compare&swap by changing what’s between LL & SC (exercise)
– Only a couple of instructions so SC likely to succeed
– Don’t include instructions that would need to be undone (e.g. stores)
• SC can fail (without putting transaction on bus) if:
– Detects intervening write even before trying to get bus
– Tries to get bus but another processor’s SC gets bus first
• LL, SC are not lock, unlock respectively
– Only guarantee no conflicting write to lock variable between them
– But can use directly to implement simple operations on shared variables
Computer Architecture II
17
Advanced lock algorithms
• Problems with presented approaches– Unfair: the order of arrival does not count– All processors try to acquire the lock when
released– More processes may incur a read miss when
the lock released• Desirable: only one miss
Computer Architecture II
18
Ticket Lock• Draw a ticket with a number, wait until the number is shown • Two counters per lock (next_ticket, now_serving)
– Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket
• atomic op when arrive at lock, not when it’s free (so less contention)
– Release: increment now-serving• Performance
– low latency for low-contention – O(p) read misses at release, since all spin on same variable– FIFO order
• like simple LL-SC lock, but no invalidation when SC succeeds, and fair
Computer Architecture II
19
Array-based Queuing Locks• Waiting processes poll on different locations in an array of size p
– Acquire
• fetch&inc to obtain address on which to spin (next array element)
• ensure that these addresses are in different cache lines or memories
– Release
• set next location in array, thus waking up process spinning on it
– O(1) traffic per acquire with coherent caches
– FIFO ordering, as in ticket lock, but, O(p) space per lock
– Not so great for non-cache-coherent machines with distributed memory
• array location I spin on not necessarily in my local memory
Computer Architecture II
20
Lock performanceLatency Bus Traffic Scalability Storage Fairness
t&s Low contention: low latency
High contention: high latency
A lot poor O(1) no
t&s with backoff
Low contention: low latency (as t&s)
High contention: high latency
Less than t&s Better than t&s O(1) no
t&t&s Low contention: low latency, a little higher than t&s
High contention: high latency
Less: no traffic while waiting
Better than t&s with backoff O(1) no
ll/sc Low contention: low latency
High contention: better than t&t&s
Like t&t&s + no traffic on missed attempt
Better than t&t&s O(1) no
ticket Low contention: low latency
High contention: better than ll/sc
Little less than ll/sc
Like ll/sc O(1) Yes (FIFO)
array Low contention: low latency, like t&t&s
High contention: better than ticket
Less than ticket More scalable than ticket (one processor incurs the miss)
O(p) Yes (FIFO)
Transactional memory
Computer Architecture II
21
Transactional memory
Computer Architecture II
22
Transactional memory
Computer Architecture II
23
Transactional memory benefits
Computer Architecture II
24
Transactional memory drawbacks
Computer Architecture II
25
Transactional memory
Computer Architecture II
26
Transactional memory
Computer Architecture II
27
Computer Architecture II
28
Point to Point Event Synchronization• Software methods:
– Busy-waiting: use ordinary variables as flags
– Blocking: semaphores
– Interrupts• Full hardware support: full-empty bit with each word in memory
– Set when word is “full” with newly produced data (i.e. when written)
– Unset when word is “empty” due to being consumed (i.e. when read)
– Natural for word-level producer-consumer synchronization• producer: write if empty, set to full; • consumer: read if full; set to empty
– Hardware preserves read or write atomicity
– Problem: flexibility• multiple consumers• multiple update of a producer
Computer Architecture II
29
Barriers• Hardware barriers
– Wired-AND line separate from address/data bus• Set input 1 when arrive, wait for output to be 1 to
leave– Useful when barriers are global and very frequent– Difficult to support arbitrary subset of processors
• even harder with multiple processes per processor– Difficult to dynamically change number and identity of
participants• e.g. latter due to process migration
– Not common today on bus-based machines• Software algorithms implemented using locks, flags, counters
Computer Architecture II
30
struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)
bar_name.flag = 0; /* reset flag if first to reach*/
mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */
bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */
}else while (bar_name.flag == 0) {}; /* busy wait for release
*/}
A Simple Centralized Barrier• Shared counter maintains number of processes that have arrived
– increment when arrive (lock), check until reaches numprocs– Problem?
Computer Architecture II
31
A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work
– Must prevent process from entering until all have left previous instance
– Could use another counter, but increases latency and contention• Sense reversal: wait for flag to take different value consecutive times
– Toggle this value only when all processes reach
1. BARRIER (bar_name, p) {2. local_sense = !(local_sense); /* toggle private sense variable */3. LOCK(bar_name.lock);4. mycount = bar_name.counter++; /* mycount is private */5. if (bar_name.counter == p) 6. UNLOCK(bar_name.lock); 7. bar_name.counter = 0;8. bar_name.flag = local_sense; /* release waiters*/9. else {10. UNLOCK(bar_name.lock);11. while (bar_name.flag != local_sense) {}; }12. }
Computer Architecture II
32
Centralized Barrier Performance• Latency
– critical path length at least proportional to p (the accesses to the critical region are serialized by the lock)
• Traffic– p bus transaction to obtain the lock– p bus transactions to modify the counter – 2 bus transaction for the last processor to reset the counter and release
the waiting process – p-1 bus transactions for the first p-1 processors to read the flag
• Storage Cost– Very low: centralized counter and flag
• Fairness– Same processor should not always be last to exit barrier
• Key problems for centralized barrier are latency and traffic– Especially with distributed memory, traffic goes to same node
Computer Architecture II
33
Improved Barrier Algorithms for a Bus
– Separate arrival and exit trees, and use sense reversal
– Valuable in distributed network: communicate along different paths
– On bus, all traffic goes on same bus, and no less total traffic
– Higher latency (log p steps of work, and O(p) serialized bus transactions)
– Advantage on bus is use of ordinary reads/writes instead of locks
Software combining tree•Only k processors access the same location, where k is degree of tree (k=2 in the example below)
Flat Tree structured
Contention Little contention
Computer Architecture II
34
Scalable Multiprocessors
Computer Architecture II
35
Scalable Machines
• Scalability: capability of a system to increase by adding processors, memory, I/O devices
• 4 important aspects of scalability– bandwidth increases with number of processors– latency does not increase or increases slowly– Cost increases slowly with number of processors– Physical placement of resources
Computer Architecture II
36
Limited Scaling of a Bus
• Small configurations are cost-effective
Characteristic Bus
Physical Length ~ 1 ft
Number of Connections fixed
Maximum Bandwidth fixed
Interface to Comm. medium extended memory interface
Global Order arbitration
Protection virtual -> physical
Trust total
OS single
comm. abstraction HW
Computer Architecture II
37
Workstations in a LAN?
• No clear limit to physical scaling, little trust, no global order
• Independent failure and restart
Characteristic Bus LAN
Physical Length ~ 1 ft KM
Number of Connections fixed many
Maximum Bandwidth fixed ???
Interface to Comm. medium memory interface peripheral
Global Order arbitration ???
Protection Virtual -> physical OS
Trust total none
OS single independent
comm. abstraction HW SW
Computer Architecture II
38
Bandwidth Scalability
• Bandwidth limitation: single set of wires• Must have many independent wires
(remember bisection width?) => switches
P M M P M M P M M P M M
S S S S
Typical switches
Bus
Multiplexers
Crossbar
Computer Architecture II
39
Dancehall MP Organization
• Network bandwidth demand: scales linearly with number of processors
• Latency: Increases with number of stages of switches (remember butterfly?)– Adding local memory would offer fixed latency
Scalable network
P
$
Switch
M
P
$
P
$
P
$
M M
Switch Switch
Computer Architecture II
40
Generic Distributed Memory Multiprocessor
• Most common structure
Scalable network
CA
P
$
Switch
M
Switch Switch
Computer Architecture II
41
Bandwidth scaling requirements
• Large number of independent communication paths between nodes: large number of concurrent transactions using different wires
• Independent transactions• No global arbitration• Effect of a transaction only visible to the nodes
involved– Broadcast difficult (was easy for bus):
additional transactions needed
Computer Architecture II
42
Latency Scaling
T(n) = Overhead + Channel Time (Channel Occupancy) + Routing Delay + Contention Time
• Overhead: processing time in initiating and completing a transfer
• Channel Time(n) = n/B
• RoutingDelay (h,n)
Computer Architecture II
43
Cost Scaling
• Cost(p,m) = fixed cost + incremental cost (p,m)• Bus Based SMP
– Add more processors and memory
• Scalable machines – processors, memory, network
• Parallel efficiency(p) = Speedup(p) / p• Costup(p) = Cost(p) / Cost(1)• Cost-effective: Speedup(p) > Costup(p)
Computer Architecture II
44
Cost Effective?
•2048 processors: 475 fold speedup at 206x cost
0
500
1000
1500
2000
0 500 1000 1500 2000
Processors
Speedup = P/(1+ logP)
Costup = 1 + 0.1 P
Computer Architecture II
45
Physical Scaling
• Chip-level integration– Multicore– Cell
• Board-level– Several multicores on a board
• System level– Clusters, supercomputers
Computer Architecture II
46
Chip-level integration: nCUBE/2
• Network integrated onto the chip 14 bidirectional links => 8096 nodes
• Entire machine synchronous at 40 MHz
Single-chip node
Basic module
Hypercube networkconfiguration
DRAM interface
DM
Ach
anne
ls
Ro
ute
r
MMU
I-Fetch&
decode
64-bit integerIEEE floating point
Operand$
Execution unit
1024 Nodes
Computer Architecture II
47
Chip-level integration: Cell
• PPE• 3.2 GHz • Synergetic Processing Elements
Computer Architecture II
48
Board level integration: CM-5
• Use standard microprocessor components• Scalable network interconnect
Diagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI
Computer Architecture II
49
System Level Integration
• Loose packaging
• IBM SP2
• Cluster blades
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
Computer Architecture II
50
Roadrunner• next-generation supercomputer to be built at the Los Alamos
National Laboratory in New Mexico. • 1 petaflops US Department of Energy. • hybrid design
– more than 16,000 AMD Opteron cores (~2200 IBM x3755 4U servers, each holding four dual core Opterons, connected by Infiniband)
– a comparable number of Cell microprocessors– Red Hat Linux operating system – When completed (2008), it will be the world's most powerful computer,
and cover approximately 12,000 square feet (1,100 square meters). It is expected to be operational in 2008.
• simulating how nuclear materials age and whether the aging nuclear weapon arsenal of the United States is safe and reliable.