Upload
bonnie-cole
View
212
Download
0
Embed Size (px)
Citation preview
Shared Address Space Computing:Hardware Issues
Alistair RendellSee Chapter 2 of Lin and Synder,
Chapter 2 of Grama, Gupta, Karypis and Kumar,and also “Computer Architecture: A Quantitative
Approach”, J.L. Hennessy and D.A. Patterson, Morgan Kaufmann
Inte
rcon
nect
ion
Net
wor
k
P
P
P
M
Inte
rcon
nect
ion
Net
wor
k
P
C
Inte
rcon
nect
ion
Net
wor
k
UMAUniform Memory AccessShared Address Space
UMAUniform Memory AccessShared Address Space
with cache
NUMANon-Uniform Memory Access
Shared Address Spacewith cache
Grama fig 2.5
M
M M
M
M
P
C
P
C
P
MC
P
MC
P
MC
Shared Address Space Systems
• Systems with caches but otherwise flat memory generally called UMA
• If access to local cheaper than remote (NUMA), this should be built into your algorithm– How to do this and O/S support is another matter– (man numa on linux will give details of numa support)
• Global address space easier to program– Read only interactions invisible to programmer and coded like
sequential program– Read/write are harder, require mutual exclusion for concurrent
accesses
• Programmed using threads and directives– We will consider pthreads and OpenMP
• Synchronization using locks and related mechanisms
Caches on Multiprocessors• Multiple copies of some data word being manipulated by
two or more processors at the same time• Two requirements
– Address translation mechanism that locates memory word in system
– Concurrent operations on multiple copies have well defined semantics
• Latter generally known as cache coherency protocol– (I/O using DMA on machines with caches also leads to
coherency issues)
• Some machines only provide shared address space mechanisms and leave coherence to programme– Cray T3D provided get and put and cache invalidate
Shared-Address-Space and Shared-Memory Computers
• Shared-memory historically used for architectures in which memory is physically shared among various processors, and all processors have equal access to any memory segment– This is identical to the UMA model
• Compared to distributed-memory computer where different memory segments are physically associated with different processing elements.
• Either of these physical models can present the logical view of a disjoint or shared-address-space platform– A distributed-memory shared-address-space
computer is a NUMA system
X = 1
Memory
X = 1
P1load x
X = 1X = 1
P0load x
X = 1
X = 1
Memory
X = 1
P1load x
X = 1X = 1
P0load x
X = 1
X = 1
Memory
X = 1
P1
X = 1X = 1
P0write #3, x
X = 3
invalidate
X = 3
Memory
X = 1
P1
X = 3X = 1
P0write #3, x
X = 3
update
Invalidate Protocol
Update ProtocolGrama fig 2.21
Cache Coherency Protocols
Update v Invalidate• Update Protocol
– When a data item is written, all of its copies in the system are updated
• Invalidate Protocol (most common)– Before a data item is written, all other copies are marked as invalid
• Comparison– Multiple writes to same word with no intervening reads require multiple
write broadcasts in an update protocol, but only one initial invalidation– Multiword cache blocks, each word written in a cache block must be
broadcast in an update protocol, but only one invalidate per line is required
– Delay between writing a word in one processor and reading the written data in another is usually less for update
• False sharing: two processors modify different parts of the same cache line– Invalidate protocol leads to ping-ponged cache lines– Update protocol performs reads locally but updates
Implementation Techniques• On small scale bus based machines
– A processor must obtain access to the bus to broadcast a write invalidation
– With two competing processors the first to gain access to the bus will invalidate the others’ data
• A cache miss needs to locate top copy of data– Easy for write through cache– For write back each processor snoops the bus and responses by
providing data if it has the top copy
• For writes we would like to know if any other copies of the block are cached– i.e. whether a write back cache needs to put details on the bus– Handled by having a tag to indicate shared status
• Minimizing processor stalls– Either by duplication of tags or having multiple inclusive caches
Invalid Modified
Shared
C_write
write
flush
read write
C_readC_write
read/write
read
Grama fig 2.22
3 State (MSI) Cache Coherency
Protocol
• read: local read• c_read: coherency
read, i.e. read on remote processor gives rise to shown transition in local cache
MSI Coherency ProtocolInst.@P0 Inst@P1 Var/S@P0 Var/S@P1 Var/S@Mem
x = 5, M
y = 12,M
read x
*
x = x + 1
*
read y
*
x = x + y
*
x = x + 1
*
*
read y
*
y = y + 1
read x
*
y = x + y
*
y = y + 1
x = 5, S
*
x = 6, M
*
y = 13, S
x = 6,S
y = 19, M
y = 13, I
x = 20, M
*
*
y = 12, S
*
y = 13, M
y = 13, S
x = 6, S
x = 6, I
y = 19, M
*
y = 20, M
x = 5, S
y = 12, S
x = 5, I
y = 12, I
y = 13, S
x = 6, S
x = 6, I
y = 13, I
x = 6, I
y = 13, I
TIM
E
Snoopy Cache Systems• All caches broadcast all transactions
– Suited to bus or ring interconnects
• All processors monitor the bus for transactions of interest
• Each processors cache has a set of tag bits that determine the state of the cache block– Tags updated according to state diagram for relevant
protocol
• Eg snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out
• What sort of data access characteristics are likely to perform well/badly on snoopy based systems?
Processor
Sno
op H
/W
Tag
s
Cache
Processor
Sno
op H
/W
Tag
s
Processor
Sno
op H
/W
Tag
s
Memory
Cache Cache
Address/Data
Dirty
Grama fig 2.24
Snoopy Cache Based System
Directory Cache Based Systems
• Need to broadcast clearly not scalable– Solution is to only send information to
processing elements specifically interested in that data
• This requires a directory to store information– Augment global memory with presence
bitmap to indicate which caches memory located in
Directory Based Cache Coherency
• Must handle a read miss and a write to a shared, clean cache block
• To implement the direct must track the state of each cache block
• A simple protocol might be:– Shared: one or more processors have the block
cached, and the value in memory is up to date– Uncached: no processor has a copy– Exclusive: only one processor (the owner) has a copy
and the value in memory is out of date
• We also need to track the processors that have copies of shared cache block, or ownership of
Inte
rcon
nect
ion
Net
wor
k
Processor
Cache
Processor
Cache
Processor
Cache
DataPresenceBits
Status
Directory
Inte
rcon
nect
ion
Net
wor
k
Processor
Cache
Memory
Pre
sen
ce
bits
/sta
te
Processor
Cache
Memory
Pre
sen
ce
bits
/sta
te
Grama fig 2.25
Directory Based Cache Coherency
Directory Based Systems• How much memory is required to store the
directory?
• What sort of data access characteristics are likely to perform well/badly on directory based systems?– How do distributed and centralized systems
compare?
Costs on SGI Origin 3000(processor clock cycles)
16 CPU > 16 CPU
Cache Hit 1 1
Cache miss to local memory 85 85
Cache miss to remote home directory
125 150
Cache miss to remotely cached data (3 hop)
140 170
Data from: “Computer Architecture: A Quantitative Approach”, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003
Real Cache Coherency Protocols• From wikipedia
– “Most modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an "Exclusive" state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an "Owned" state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the "Forward" state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data.”
MESI(on a bus)
https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm