View
237
Download
5
Tags:
Embed Size (px)
Computer Architecture II*Computer architecture IILecture 10
Computer Architecture II
Computer Architecture II*TodaySynchronization for SMMTest and set, ll and sc, arraybarrierScalable MultiprocessorsWhat is a scalable machine?
Computer Architecture II
Computer Architecture II*SynchronizationTypes of SynchronizationMutual ExclusionEvent synchronizationpoint-to-pointgroupglobal (barriers)All solutions rely on hardware support for an atomic read-modify-write operation We look today at synchronization for cache-coherent, bus-based multiprocessors
Computer Architecture II
Computer Architecture II*Components of a Synchronization EventAcquire methodAcquire right to the synch (e.g. enter critical section)Waiting algorithmWait for synch to become available when it isntbusy-waiting, blocking, or hybridRelease methodEnable other processors to acquire
Computer Architecture II
Computer Architecture II*Performance Criteria for Synch. OpsLatency (time per op)especially when light contentionBandwidth (ops per sec)especially under high contentionTrafficload on critical resourcesespecially on failures under contentionStorageFairness
Computer Architecture II
Computer Architecture II*Strawman Locklock:ldregister, location /* copy location to register */cmplocation, #0 /* compare with 0 */bnzlock/* if not 0, try again */stlocation, #1/* store 1 to mark it locked */ret/* return control to caller */
unlock:st location, #0/* write 0 to location */ret/* return control to caller */Busy-WaitingLocation is initially 0Why doesnt the acquire method work?
Computer Architecture II
Computer Architecture II*Atomic InstructionsSpecifies a location, register, & atomic operationValue in location read into a registerAnother value (function of value read or not) stored into locationMany variantsVarying degrees of flexibility in second partSimple example: test&setValue in location read into a specified registerConstant 1 stored into locationSuccessful if value loaded into register is 0Other constants could be used instead of 1 and 0
Computer Architecture II
Computer Architecture II*Simple Test&Set Locklock:t&sregister, location bnzlock /* if not 0, try again */ret /* return control to caller */unlock:st location, #0 /* write 0 to location */ret /* return control to caller */
The same code for lock in pseudocode:
while (not acquired) /* lock is aquired be another one*/ test&set(location); /* try to acquire the lock*/
Condition: architecture supports atomic test and setCopy location to register and set location to 1Problem: t&s modifies the variable location in its cache each time it tries to acquire the lock=> cache block invalidations => bus traffic (especially for high contention)
Computer Architecture II
Computer Architecture II*ssssssssssssssssllllllllllllllllnnnnnnnnnnnnnnnnuuuuuuuuuuuuuuuuNumber of processorsTime (ms)11131502468101214161820sTest&set, c = 0lTest&set, exponential backoff c = 3.64nTest&set, exponential backoff c = 0uIdeal9753T&S Lock Microbenchmark: SGI Challengelock; delay(c); unlock;Why does performance degrade?Bus Transactions on T&S
Computer Architecture II
Computer Architecture II*Other read-modify-write primitives
Fetch&opAtomically read and modify (by using op operation) and write a memory locationE.g. fetch&add, fetch&incrCompare&swapThree operands: location, register to compare with, register to swap with
Computer Architecture II
Computer Architecture II*Enhancements to Simple LockProblem of t&s: lots of invalidations if the lock can not be takenReduce frequency of issuing test&sets while waitingTest&set lock with exponential backoff i=0;while (! acquired) { /* lock is acquired be another one*/ test&set(location);if (!acquired) {/* test&set didnt succeed*/ wait (ti); /* sleep some time i++;}}Less invalidationsMay wait more
Computer Architecture II
Computer Architecture II*ssssssssssssssssllllllllllllllllnnnnnnnnnnnnnnnnuuuuuuuuuuuuuuuuNumber of processorsTime (ms)11131502468101214161820sTest&set, c = 0lTest&set, exponential backoff c = 3.64nTest&set, exponential backoff c = 0uIdeal9753T&S Lock Microbenchmark: SGI Challengelock; delay(c); unlock;Why does performance degrade?Bus Transactions on T&S
Computer Architecture II
Computer Architecture II*Enhancements to Simple LockReduce frequency of issuing test&sets while waitingTest-and-test&set lockwhile (! acquired) { /* lock is acquired be another one*/ if (location=1) /* test with ordinary load */continue; else {test&set(location);if (acquired) {/*succeeded*/ break}}
Keep testing with ordinary loadJust a hint: cached lock variable will be invalidated when release occursIf location becomes 0, use t&s to modify the variable atomicallyIf failure start overFurther reduces the bus transactions load produces bus traffic only when the lock is releasedt&s produces bus traffic each time is executed
Computer Architecture II
Computer Architecture II*Lock performance
LatencyBus TrafficScalabilityStorageFairnesst&sLow contention: low latencyHigh contention: high latencyA lotpoorLow (does not increase with processor number)not&s with backoffLow contention: low latency (as t&s for no contention)High contention: high latencyLess than t&sBetter than t&sLow (does not increase with processor number)not&t&sLow contention: low latency, a little higher than t&sHigh contention: high latencyLess than t&s and t&s with backoffBetter than t&s and t&s with backoffLow (does not increase with processor number)no
Computer Architecture II
Computer Architecture II*Improved Hardware Primitives: LL-SCGoals: Problem of test&set: generate lot of bus trafficFailed read-modify-write attempts dont generate invalidationsNice if single primitive can implement range of r-m-w operationsLoad-Locked (or -linked), Store-ConditionalLL reads variable into registerWork on the value from the registerSC tries to store back to location succeed if and only if no other write to the variable since this processors LLindicated by a condition flagIf SC succeeds, all three steps happened atomicallyIf fails, doesnt write or generate invalidationsmust retry acquire
Computer Architecture II
Computer Architecture II*Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */ sc location, reg2 /* SC reg2 into location*/ beqz reg2, lock /* if failed, start again */ retunlock: st location, #0 /* write 0 to location */ retCan simulate the atomic ops t&s, fetch&op, compare&swap by changing whats between LL & SC (exercise)Only a couple of instructions so SC likely to succeedDont include instructions that would need to be undone (e.g. stores)SC can fail (without putting transaction on bus) if:Detects intervening write even before trying to get busTries to get bus but another processors SC gets bus firstLL, SC are not lock, unlock respectivelyOnly guarantee no conflicting write to lock variable between themBut can use directly to implement simple operations on shared variables
Computer Architecture II
Computer Architecture II*Advanced lock algorithmsProblems with presented approachesUnfair: the order of arrival does not countAll processors try to acquire the lock when releasedMore processes may incur a read miss when the lock releasedDesirable: only one miss
Computer Architecture II
Computer Architecture II*Ticket LockDraw a ticket with a number, wait until the number is shown Two counters per lock (next_ticket, now_serving)Acquire: fetch&inc next_ticket; wait for now_serving == next_ticketatomic op when arrive at lock, not when its free (so less contention)Release: increment now-servingPerformancelow latency for low-contention O(p) read misses at release, since all spin on same variableFIFO orderlike simple LL-SC lock, but no invalidation when SC succeeds, and fair
Computer Architecture II
Computer Architecture II*Array-based Queuing LocksWaiting processes poll on different locations in an array of size pAcquirefetch&inc to obtain address on which to spin (next array element)ensure that these addresses are in different cache lines or memoriesReleaseset next location in array, thus waking up process spinning on itO(1) traffic per acquire with coherent cachesFIFO ordering, as in ticket lock, but, O(p) space per lockNot so great for non-cache-coherent machines with distributed memoryarray location I spin on not necessarily in my local memory
Computer Architecture II
Computer Architecture II*Lock performance
LatencyBus TrafficScalabilityStorageFairnesst&sLow contention: low latencyHigh contention: high latencyA lotpoorO(1)not&s with backoffLow contention: low latency (as t&s)High contention: high latencyLess than t&sBetter than t&sO(1)not&t&sLow contention: low latency, a little higher than t&sHigh contention: high latencyLess: no traffic while waitingBetter than t&s with backoffO(1)noll/scLow contention: low latencyHigh contention: better than t&t&sLike t&t&s + no traffic on missed attemptBetter than t&t&sO(1)noticketLow contention: low latencyHigh contention: better than ll/scLittle less than ll/scLike ll/scO(1)Yes (FIFO)arrayLow contention: low latency, like t&t&sHigh contention: better than ticketLess than ticketMore scalable than ticket (one processor incurs the miss)O(p)Yes (FIFO)
Computer Architecture II
Transactional memoryComputer Architecture II*
Computer Architecture II
Transactional memoryComputer Architecture II*
Computer Architecture II
Transactional memoryComputer Architecture II*
Computer Architecture II
Transactional memory benefitsComputer Architecture II*
Computer Architecture II
Transactional memory drawbacksComputer Architecture II*
Computer Architecture II
Transactional memoryComputer Architecture II*
Computer Architecture II
Transactional memoryComputer Architecture II*
Computer Architecture II
Computer Architecture II*Point to Point Event SynchronizationSoftware methods:Busy-waiting: use ordinary variables as flags Blocking: semaphoresInterruptsFull hardware support: full-empty bit with each word in memorySet when word is full with newly produced data (i.e. when written)Unset when word is empty due to being consumed (i.e. when read)Natural for word-level producer-consumer synchronizationproducer: write if empty, set to full; consumer: read if full; set to emptyHardware preserves read or write atomicityProblem: flexibilitymultiple consumersmultiple update of a producer
Computer Architecture II
Computer Architecture II*BarriersHardware barriersWired-AND line separate from address/data busSet input 1 when arrive, wait for output to be 1 to leaveUseful when barriers are global and very frequentDifficult to support arbitrary subset of processorseven harder with multiple processes per processorDifficult to dynamically change number and identity of participantse.g. latter due to process migrationNot common today on bus-based machinesSoftware algorithms implemented using locks, flags, counters
Computer Architecture II
Computer Architecture II*struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */}else while (bar_name.flag == 0) {}; /* busy wait for release */}A Simple Centralized BarrierShared counter maintains number of processes that have arrivedincrement when arrive (lock), check until reaches numprocsProblem?
Computer Architecture II
Computer Architecture II*A Working Centralized BarrierConsecutively entering the same barrier doesnt workMust prevent process from entering until all have left previous instanceCould use another counter, but increases latency and contentionSense reversal: wait for flag to take different value consecutive timesToggle this value only when all processes reachBARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock);mycount = bar_name.counter++;/* mycount is private */if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.counter = 0; bar_name.flag = local_sense;/* release waiters*/else { UNLOCK(bar_name.lock);while (bar_name.flag != local_sense) {}; }}
Computer Architecture II
Computer Architecture II*Centralized Barrier PerformanceLatencycritical path length at least proportional to p (the accesses to the critical region are serialized by the lock)Trafficp bus transaction to obtain the lockp bus transactions to modify the counter 2 bus transaction for the last processor to reset the counter and release the waiting process p-1 bus transactions for the first p-1 processors to read the flagStorage CostVery low: centralized counter and flagFairnessSame processor should not always be last to exit barrierKey problems for centralized barrier are latency and trafficEspecially with distributed memory, traffic goes to same node
Computer Architecture II
Computer Architecture II*Improved Barrier Algorithms for a BusSeparate arrival and exit trees, and use sense reversalValuable in distributed network: communicate along different pathsOn bus, all traffic goes on same bus, and no less total trafficHigher latency (log p steps of work, and O(p) serialized bus transactions)Advantage on bus is use of ordinary reads/writes instead of locks
Software combining treeOnly k processors access the same location, where k is degree of tree (k=2 in the example below)
Computer Architecture II
Computer Architecture II*Scalable Multiprocessors
Computer Architecture II
Computer Architecture II*Scalable MachinesScalability: capability of a system to increase by adding processors, memory, I/O devices 4 important aspects of scalabilitybandwidth increases with number of processorslatency does not increase or increases slowlyCost increases slowly with number of processorsPhysical placement of resources
Computer Architecture II
Computer Architecture II*Limited Scaling of a BusSmall configurations are cost-effectiveCharacteristicBusPhysical Length~ 1 ftNumber of ConnectionsfixedMaximum BandwidthfixedInterface to Comm. mediumextended memory interfaceGlobal OrderarbitrationProtectionvirtual -> physicalTrusttotalOSsinglecomm. abstractionHW
Computer Architecture II
Computer Architecture II*Workstations in a LAN?No clear limit to physical scaling, little trust, no global orderIndependent failure and restartCharacteristicBusLANPhysical Length~ 1 ftKMNumber of ConnectionsfixedmanyMaximum Bandwidthfixed???Interface to Comm. mediummemory interfaceperipheralGlobal Orderarbitration???ProtectionVirtual -> physicalOSTrusttotalnoneOSsingleindependentcomm. abstractionHWSW
Computer Architecture II
Computer Architecture II*Bandwidth ScalabilityBandwidth limitation: single set of wiresMust have many independent wires (remember bisection width?) => switches
Computer Architecture II
Computer Architecture II*Dancehall MP OrganizationNetwork bandwidth demand: scales linearly with number of processorsLatency: Increases with number of stages of switches (remember butterfly?)Adding local memory would offer fixed latency
Computer Architecture II
Computer Architecture II*Generic Distributed Memory Multiprocessor Most common structure
Computer Architecture II
Computer Architecture II*Bandwidth scaling requirementsLarge number of independent communication paths between nodes: large number of concurrent transactions using different wiresIndependent transactionsNo global arbitrationEffect of a transaction only visible to the nodes involvedBroadcast difficult (was easy for bus): additional transactions needed
Computer Architecture II
Computer Architecture II*Latency ScalingT(n) = Overhead + Channel Time (Channel Occupancy) + Routing Delay + Contention TimeOverhead: processing time in initiating and completing a transferChannel Time(n) = n/BRoutingDelay (h,n)
Computer Architecture II
Computer Architecture II*Cost ScalingCost(p,m) = fixed cost + incremental cost (p,m)Bus Based SMPAdd more processors and memoryScalable machines processors, memory, network Parallel efficiency(p) = Speedup(p) / pCostup(p) = Cost(p) / Cost(1)Cost-effective: Speedup(p) > Costup(p)
Computer Architecture II
Computer Architecture II*Cost Effective?2048 processors: 475 fold speedup at 206x cost
Computer Architecture II
Chart2
11
1.53724357371.1
2.49678540231.3
4.20368981741.7
7.25913295372.5
12.7736863174.1
22.80680519197.3
41.19451252113.7
75.112082073326.5
138.032552331552.1
255.3424958411103.3
475.0274329863205.7
Speedup = P/(1+ logP)
Costup = 1 + 0.1 P
Processors
Sheet1
10.1
1+c P
PSpeedup = P/(1+ logP)Costup = 1 + 0.1 P
111
221
421
842
1673
32134
64237
1284114
2567527
51213852
1024255103
2048475206
Sheet1
Speedup = P/(1+ logP)
Costup = 1 + 0.1 P
Processors
Sheet2
Sheet3
Computer Architecture II*Physical ScalingChip-level integrationMulticoreCellBoard-levelSeveral multicores on a boardSystem levelClusters, supercomputers
Computer Architecture II
Computer Architecture II*Chip-level integration: nCUBE/2
Network integrated onto the chip 14 bidirectional links => 8096 nodesEntire machine synchronous at 40 MHz1024 Nodes
Computer Architecture II
Computer Architecture II*Chip-level integration: Cell
PPE3.2GHz Synergetic Processing Elements
Computer Architecture II
Computer Architecture II*Board level integration: CM-5Use standard microprocessor componentsScalable network interconnect
Computer Architecture II
Computer Architecture II*System Level IntegrationLoose packagingIBM SP2Cluster blades
Computer Architecture II
Computer Architecture II*Roadrunnernext-generation supercomputer to be built at the Los Alamos National Laboratory in New Mexico. 1 petaflops US Department of Energy. hybrid design more than 16,000 AMD Opteron cores (~2200 IBM x3755 4U servers, each holding four dual core Opterons, connected by Infiniband) a comparable number of Cell microprocessorsRed Hat Linux operating system When completed (2008), it will be the world's most powerful computer, and cover approximately 12,000 square feet (1,100 square meters). It is expected to be operational in 2008.simulating how nuclear materials age and whether the aging nuclear weapon arsenal of the United States is safe and reliable.
Computer Architecture II
*****************Can be difficult to find a good amount to delay on backoffexponential backoff not a good idea due to FIFO orderbackoff proportional to now-serving - next-ticket may work well
************************