50
Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II 1 Computer architecture II Lecture 10

  • View
    252

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

1

Computer architecture II

Lecture 10

Page 2: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

2

Today

• Synchronization for SMM– Test and set, ll and sc, array– barrier

• Scalable Multiprocessors– What is a scalable machine?

Page 3: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

3

Synchronization

• Types of Synchronization– Mutual Exclusion– Event synchronization

• point-to-point• group• global (barriers)

• All solutions rely on hardware support for an atomic read-modify-write operation

• We look today at synchronization for cache-coherent, bus-based multiprocessors

Page 4: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

4

Components of a Synchronization Event

• Acquire method–Acquire right to the synch (e.g. enter critical

section)

• Waiting algorithm–Wait for synch to become available when it

isn’t–busy-waiting, blocking, or hybrid

• Release method–Enable other processors to acquire

Page 5: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

5

Performance Criteria for Synch. Ops

• Latency (time per op)–especially when light contention

• Bandwidth (ops per sec)–especially under high contention

• Traffic– load on critical resources–especially on failures under contention

• Storage• Fairness

Page 6: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

6

Strawman Lock

lock: ld register, location /* copy location to register */

cmp location, #0 /* compare with 0 */

bnz lock /* if not 0, try again */

st location, #1 /* store 1 to mark it locked */

ret /* return control to caller */

unlock: st location, #0 /* write 0 to location */

ret /* return control to caller */

Busy-Waiting

Location is initially 0

Why doesn’t the acquire method work?

Page 7: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

7

Atomic Instructions

• Specifies a location, register, & atomic operation

– Value in location read into a register

– Another value (function of value read or not) stored into location

• Many variants

– Varying degrees of flexibility in second part

• Simple example: test&set

– Value in location read into a specified register

– Constant 1 stored into location

– Successful if value loaded into register is 0

– Other constants could be used instead of 1 and 0

Page 8: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

8

Simple Test&Set Locklock: t&s register, location

bnz lock /* if not 0, try again */ret /* return control to caller */

unlock: st location, #0 /* write 0 to location */ret /* return control to caller */

The same code for lock in pseudocode:

while (not acquired) /* lock is aquired be another one*/ test&set(location); /* try to acquire the

lock*/

•Condition: architecture supports atomic test and set– Copy location to register and set location to 1

•Problem: – t&s modifies the variable location in its cache each time it tries to

acquire the lock=> cache block invalidations => bus traffic (especially for high contention)

Page 9: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

9

Number of processors

Tim

e (

s)

11 13 150

2

4

6

8

10

12

14

16

18

20 Test&set, c = 0

Test&set, exponential backoff c = 3.64

Test&set, exponential backoff c = 0

Ideal

9753

T&S Lock Microbenchmark: SGI Challenge

lock; delay(c); unlock;

• Why does performance degrade?– Bus Transactions on T&S

Page 10: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

10

Other read-modify-write primitives

•Fetch&op–Atomically read and modify (by using op

operation) and write a memory location

–E.g. fetch&add, fetch&incr

•Compare&swap–Three operands: location, register to

compare with, register to swap with

Page 11: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

11

Enhancements to Simple Lock

• Problem of t&s: lots of invalidations if the lock can not be taken• Reduce frequency of issuing test&sets while waiting

– Test&set lock with exponential backoff i=0;while (! acquired) { /* lock is acquired be

another one*/ test&set(location);if (!acquired) {/* test&set didn’t succeed*/ wait (ti); /* sleep some time

i++;}

}• Less invalidations• May wait more

Page 12: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

12

Number of processors

Tim

e (

s)

11 13 150

2

4

6

8

10

12

14

16

18

20 Test&set, c = 0

Test&set, exponential backoff c = 3.64

Test&set, exponential backoff c = 0

Ideal

9753

T&S Lock Microbenchmark: SGI Challenge

lock; delay(c); unlock;

• Why does performance degrade?– Bus Transactions on T&S

Page 13: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

13

Enhancements to Simple Lock• Reduce frequency of issuing test&sets while waiting

– Test-and-test&set lockwhile (! acquired) { /* lock is acquired be another one*/ if (location=1) /* test with ordinary load */

continue; else {

test&set(location);if (acquired) {/*succeeded*/

break}

}

• Keep testing with ordinary load– Just a hint: cached lock variable will be invalidated when release occurs– If location becomes 0, use t&s to modify the variable atomically– If failure start over

• Further reduces the bus transactions – load produces bus traffic only when the lock is released– t&s produces bus traffic each time is executed

Page 14: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

14

Lock performanceLatency Bus Traffic Scalability Storage Fairness

t&s Low contention: low latency

High contention: high latency

A lot poor Low (does not increase with processor number)

no

t&s with backoff

Low contention: low latency (as t&s for no contention)

High contention: high latency

Less than t&s Better than t&s Low (does not increase with processor number)

no

t&t&s Low contention: low latency, a little higher than t&s

High contention: high latency

Less than t&s and t&s with backoff

Better than t&s and t&s with backoff

Low (does not increase with processor number)

no

Page 15: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

15

Improved Hardware Primitives: LL-SC

• Goals: – Problem of test&set: generate lot of bus traffic– Failed read-modify-write attempts don’t generate invalidations– Nice if single primitive can implement range of r-m-w operations

• Load-Locked (or -linked), Store-Conditional– LL reads variable into register– Work on the value from the register– SC tries to store back to location – succeed if and only if no other write to the variable since this

processor’s LL• indicated by a condition flag

• If SC succeeds, all three steps happened atomically• If fails, doesn’t write or generate invalidations

– must retry acquire

Page 16: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

16

Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */

sc location, reg2 /* SC reg2 into location*/

beqz reg2, lock /* if failed, start again */

ret

unlock: st location, #0 /* write 0 to location */

ret

• Can simulate the atomic ops t&s, fetch&op, compare&swap by changing what’s between LL & SC (exercise)

– Only a couple of instructions so SC likely to succeed

– Don’t include instructions that would need to be undone (e.g. stores)

• SC can fail (without putting transaction on bus) if:

– Detects intervening write even before trying to get bus

– Tries to get bus but another processor’s SC gets bus first

• LL, SC are not lock, unlock respectively

– Only guarantee no conflicting write to lock variable between them

– But can use directly to implement simple operations on shared variables

Page 17: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

17

Advanced lock algorithms

• Problems with presented approaches– Unfair: the order of arrival does not count– All processors try to acquire the lock when

released– More processes may incur a read miss when

the lock released• Desirable: only one miss

Page 18: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

18

Ticket Lock• Draw a ticket with a number, wait until the number is shown • Two counters per lock (next_ticket, now_serving)

– Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket

• atomic op when arrive at lock, not when it’s free (so less contention)

– Release: increment now-serving• Performance

– low latency for low-contention – O(p) read misses at release, since all spin on same variable– FIFO order

• like simple LL-SC lock, but no invalidation when SC succeeds, and fair

Page 19: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

19

Array-based Queuing Locks• Waiting processes poll on different locations in an array of size p

– Acquire

• fetch&inc to obtain address on which to spin (next array element)

• ensure that these addresses are in different cache lines or memories

– Release

• set next location in array, thus waking up process spinning on it

– O(1) traffic per acquire with coherent caches

– FIFO ordering, as in ticket lock, but, O(p) space per lock

– Not so great for non-cache-coherent machines with distributed memory

• array location I spin on not necessarily in my local memory

Page 20: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

20

Lock performanceLatency Bus Traffic Scalability Storage Fairness

t&s Low contention: low latency

High contention: high latency

A lot poor O(1) no

t&s with backoff

Low contention: low latency (as t&s)

High contention: high latency

Less than t&s Better than t&s O(1) no

t&t&s Low contention: low latency, a little higher than t&s

High contention: high latency

Less: no traffic while waiting

Better than t&s with backoff O(1) no

ll/sc Low contention: low latency

High contention: better than t&t&s

Like t&t&s + no traffic on missed attempt

Better than t&t&s O(1) no

ticket Low contention: low latency

High contention: better than ll/sc

Little less than ll/sc

Like ll/sc O(1) Yes (FIFO)

array Low contention: low latency, like t&t&s

High contention: better than ticket

Less than ticket More scalable than ticket (one processor incurs the miss)

O(p) Yes (FIFO)

Page 21: Computer Architecture II 1 Computer architecture II Lecture 10

Transactional memory

Computer Architecture II

21

Page 22: Computer Architecture II 1 Computer architecture II Lecture 10

Transactional memory

Computer Architecture II

22

Page 23: Computer Architecture II 1 Computer architecture II Lecture 10

Transactional memory

Computer Architecture II

23

Page 24: Computer Architecture II 1 Computer architecture II Lecture 10

Transactional memory benefits

Computer Architecture II

24

Page 25: Computer Architecture II 1 Computer architecture II Lecture 10

Transactional memory drawbacks

Computer Architecture II

25

Page 26: Computer Architecture II 1 Computer architecture II Lecture 10

Transactional memory

Computer Architecture II

26

Page 27: Computer Architecture II 1 Computer architecture II Lecture 10

Transactional memory

Computer Architecture II

27

Page 28: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

28

Point to Point Event Synchronization• Software methods:

– Busy-waiting: use ordinary variables as flags

– Blocking: semaphores

– Interrupts• Full hardware support: full-empty bit with each word in memory

– Set when word is “full” with newly produced data (i.e. when written)

– Unset when word is “empty” due to being consumed (i.e. when read)

– Natural for word-level producer-consumer synchronization• producer: write if empty, set to full; • consumer: read if full; set to empty

– Hardware preserves read or write atomicity

– Problem: flexibility• multiple consumers• multiple update of a producer

Page 29: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

29

Barriers• Hardware barriers

– Wired-AND line separate from address/data bus• Set input 1 when arrive, wait for output to be 1 to

leave– Useful when barriers are global and very frequent– Difficult to support arbitrary subset of processors

• even harder with multiple processes per processor– Difficult to dynamically change number and identity of

participants• e.g. latter due to process migration

– Not common today on bus-based machines• Software algorithms implemented using locks, flags, counters

Page 30: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

30

struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;

BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)

bar_name.flag = 0; /* reset flag if first to reach*/

mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */

bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */

}else while (bar_name.flag == 0) {}; /* busy wait for release

*/}

A Simple Centralized Barrier• Shared counter maintains number of processes that have arrived

– increment when arrive (lock), check until reaches numprocs– Problem?

Page 31: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

31

A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work

– Must prevent process from entering until all have left previous instance

– Could use another counter, but increases latency and contention• Sense reversal: wait for flag to take different value consecutive times

– Toggle this value only when all processes reach

1. BARRIER (bar_name, p) {2. local_sense = !(local_sense); /* toggle private sense variable */3. LOCK(bar_name.lock);4. mycount = bar_name.counter++; /* mycount is private */5. if (bar_name.counter == p) 6. UNLOCK(bar_name.lock); 7. bar_name.counter = 0;8. bar_name.flag = local_sense; /* release waiters*/9. else {10. UNLOCK(bar_name.lock);11. while (bar_name.flag != local_sense) {}; }12. }

Page 32: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

32

Centralized Barrier Performance• Latency

– critical path length at least proportional to p (the accesses to the critical region are serialized by the lock)

• Traffic– p bus transaction to obtain the lock– p bus transactions to modify the counter – 2 bus transaction for the last processor to reset the counter and release

the waiting process – p-1 bus transactions for the first p-1 processors to read the flag

• Storage Cost– Very low: centralized counter and flag

• Fairness– Same processor should not always be last to exit barrier

• Key problems for centralized barrier are latency and traffic– Especially with distributed memory, traffic goes to same node

Page 33: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

33

Improved Barrier Algorithms for a Bus

– Separate arrival and exit trees, and use sense reversal

– Valuable in distributed network: communicate along different paths

– On bus, all traffic goes on same bus, and no less total traffic

– Higher latency (log p steps of work, and O(p) serialized bus transactions)

– Advantage on bus is use of ordinary reads/writes instead of locks

Software combining tree•Only k processors access the same location, where k is degree of tree (k=2 in the example below)

Flat Tree structured

Contention Little contention

Page 34: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

34

Scalable Multiprocessors

Page 35: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

35

Scalable Machines

• Scalability: capability of a system to increase by adding processors, memory, I/O devices

• 4 important aspects of scalability– bandwidth increases with number of processors– latency does not increase or increases slowly– Cost increases slowly with number of processors– Physical placement of resources

Page 36: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

36

Limited Scaling of a Bus

• Small configurations are cost-effective

Characteristic Bus

Physical Length ~ 1 ft

Number of Connections fixed

Maximum Bandwidth fixed

Interface to Comm. medium extended memory interface

Global Order arbitration

Protection virtual -> physical

Trust total

OS single

comm. abstraction HW

Page 37: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

37

Workstations in a LAN?

• No clear limit to physical scaling, little trust, no global order

• Independent failure and restart

Characteristic Bus LAN

Physical Length ~ 1 ft KM

Number of Connections fixed many

Maximum Bandwidth fixed ???

Interface to Comm. medium memory interface peripheral

Global Order arbitration ???

Protection Virtual -> physical OS

Trust total none

OS single independent

comm. abstraction HW SW

Page 38: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

38

Bandwidth Scalability

• Bandwidth limitation: single set of wires• Must have many independent wires

(remember bisection width?) => switches

P M M P M M P M M P M M

S S S S

Typical switches

Bus

Multiplexers

Crossbar

Page 39: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

39

Dancehall MP Organization

• Network bandwidth demand: scales linearly with number of processors

• Latency: Increases with number of stages of switches (remember butterfly?)– Adding local memory would offer fixed latency

Scalable network

P

$

Switch

M

P

$

P

$

P

$

M M

Switch Switch

Page 40: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

40

Generic Distributed Memory Multiprocessor

• Most common structure

Scalable network

CA

P

$

Switch

M

Switch Switch

Page 41: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

41

Bandwidth scaling requirements

• Large number of independent communication paths between nodes: large number of concurrent transactions using different wires

• Independent transactions• No global arbitration• Effect of a transaction only visible to the nodes

involved– Broadcast difficult (was easy for bus):

additional transactions needed

Page 42: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

42

Latency Scaling

T(n) = Overhead + Channel Time (Channel Occupancy) + Routing Delay + Contention Time

• Overhead: processing time in initiating and completing a transfer

• Channel Time(n) = n/B

• RoutingDelay (h,n)

Page 43: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

43

Cost Scaling

• Cost(p,m) = fixed cost + incremental cost (p,m)• Bus Based SMP

– Add more processors and memory

• Scalable machines – processors, memory, network

• Parallel efficiency(p) = Speedup(p) / p• Costup(p) = Cost(p) / Cost(1)• Cost-effective: Speedup(p) > Costup(p)

Page 44: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

44

Cost Effective?

•2048 processors: 475 fold speedup at 206x cost

0

500

1000

1500

2000

0 500 1000 1500 2000

Processors

Speedup = P/(1+ logP)

Costup = 1 + 0.1 P

Page 45: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

45

Physical Scaling

• Chip-level integration– Multicore– Cell

• Board-level– Several multicores on a board

• System level– Clusters, supercomputers

Page 46: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

46

Chip-level integration: nCUBE/2

• Network integrated onto the chip 14 bidirectional links => 8096 nodes

• Entire machine synchronous at 40 MHz

Single-chip node

Basic module

Hypercube networkconfiguration

DRAM interface

DM

Ach

anne

ls

Ro

ute

r

MMU

I-Fetch&

decode

64-bit integerIEEE floating point

Operand$

Execution unit

1024 Nodes

Page 47: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

47

Chip-level integration: Cell

• PPE• 3.2 GHz • Synergetic Processing Elements

Page 48: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

48

Board level integration: CM-5

• Use standard microprocessor components• Scalable network interconnect

Diagnostics network

Control network

Data network

Processingpartition

Processingpartition

Controlprocessors

I/O partition

PM PM

SPARC

MBUS

DRAMctrl

DRAM DRAM DRAM DRAM

DRAMctrl

Vectorunit DRAM

ctrlDRAM

ctrl

Vectorunit

FPU Datanetworks

Controlnetwork

$ctrl

$SRAM

NI

Page 49: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

49

System Level Integration

• Loose packaging

• IBM SP2

• Cluster blades

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Page 50: Computer Architecture II 1 Computer architecture II Lecture 10

Computer Architecture II

50

Roadrunner• next-generation supercomputer to be built at the Los Alamos

National Laboratory in New Mexico. • 1 petaflops US Department of Energy. • hybrid design

– more than 16,000 AMD Opteron cores (~2200 IBM x3755 4U servers, each holding four dual core Opterons, connected by Infiniband)

– a comparable number of Cell microprocessors– Red Hat Linux operating system – When completed (2008), it will be the world's most powerful computer,

and cover approximately 12,000 square feet (1,100 square meters). It is expected to be operational in 2008.

• simulating how nuclear materials age and whether the aging nuclear weapon arsenal of the United States is safe and reliable.