Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also

Shared Address Space Computing:Hardware Issues

Alistair RendellSee Chapter 2 of Lin and Synder,

Chapter 2 of Grama, Gupta, Karypis and Kumar,and also “Computer Architecture: A Quantitative

Approach”, J.L. Hennessy and D.A. Patterson, Morgan Kaufmann

Inte

rcon

nect

ion

Net

wor

k

P

P

P

M

Inte

rcon

nect

ion

Net

wor

k

P

C

Inte

rcon

nect

ion

Net

wor

k

UMAUniform Memory AccessShared Address Space

UMAUniform Memory AccessShared Address Space

with cache

NUMANon-Uniform Memory Access

Shared Address Spacewith cache

Grama fig 2.5

M

M M

M

M

P

C

P

C

P

MC

P

MC

P

MC

Shared Address Space Systems

• Systems with caches but otherwise flat memory generally called UMA

• If access to local cheaper than remote (NUMA), this should be built into your algorithm– How to do this and O/S support is another matter– (man numa on linux will give details of numa support)

• Global address space easier to program– Read only interactions invisible to programmer and coded like

sequential program– Read/write are harder, require mutual exclusion for concurrent

accesses

• Programmed using threads and directives– We will consider pthreads and OpenMP

• Synchronization using locks and related mechanisms

Caches on Multiprocessors• Multiple copies of some data word being manipulated by

two or more processors at the same time• Two requirements

– Address translation mechanism that locates memory word in system

– Concurrent operations on multiple copies have well defined semantics

• Latter generally known as cache coherency protocol– (I/O using DMA on machines with caches also leads to

coherency issues)

• Some machines only provide shared address space mechanisms and leave coherence to programme– Cray T3D provided get and put and cache invalidate

Shared-Address-Space and Shared-Memory Computers

• Shared-memory historically used for architectures in which memory is physically shared among various processors, and all processors have equal access to any memory segment– This is identical to the UMA model

• Compared to distributed-memory computer where different memory segments are physically associated with different processing elements.

• Either of these physical models can present the logical view of a disjoint or shared-address-space platform– A distributed-memory shared-address-space

computer is a NUMA system

X = 1

Memory

X = 1

P1load x

X = 1X = 1

P0load x

X = 1

X = 1

Memory

X = 1

P1load x

X = 1X = 1

P0load x

X = 1

X = 1

Memory

X = 1

P1

X = 1X = 1

P0write #3, x

X = 3

invalidate

X = 3

Memory

X = 1

P1

X = 3X = 1

P0write #3, x

X = 3

update

Invalidate Protocol

Update ProtocolGrama fig 2.21

Cache Coherency Protocols

Update v Invalidate• Update Protocol

– When a data item is written, all of its copies in the system are updated

• Invalidate Protocol (most common)– Before a data item is written, all other copies are marked as invalid

• Comparison– Multiple writes to same word with no intervening reads require multiple

write broadcasts in an update protocol, but only one initial invalidation– Multiword cache blocks, each word written in a cache block must be

broadcast in an update protocol, but only one invalidate per line is required

– Delay between writing a word in one processor and reading the written data in another is usually less for update

• False sharing: two processors modify different parts of the same cache line– Invalidate protocol leads to ping-ponged cache lines– Update protocol performs reads locally but updates

Implementation Techniques• On small scale bus based machines

– A processor must obtain access to the bus to broadcast a write invalidation

– With two competing processors the first to gain access to the bus will invalidate the others’ data

• A cache miss needs to locate top copy of data– Easy for write through cache– For write back each processor snoops the bus and responses by

providing data if it has the top copy

• For writes we would like to know if any other copies of the block are cached– i.e. whether a write back cache needs to put details on the bus– Handled by having a tag to indicate shared status

• Minimizing processor stalls– Either by duplication of tags or having multiple inclusive caches

Invalid Modified

Shared

C_write

write

flush

read write

C_readC_write

read/write

read

Grama fig 2.22

3 State (MSI) Cache Coherency

Protocol

• read: local read• c_read: coherency

read, i.e. read on remote processor gives rise to shown transition in local cache

MSI Coherency ProtocolInst.@P0 Inst@P1 Var/S@P0 Var/S@P1 Var/S@Mem

x = 5, M

y = 12,M

read x

*

x = x + 1

*

read y

*

x = x + y

*

x = x + 1

*

*

read y

*

y = y + 1

read x

*

y = x + y

*

y = y + 1

x = 5, S

*

x = 6, M

*

y = 13, S

x = 6,S

y = 19, M

y = 13, I

x = 20, M

*

*

y = 12, S

*

y = 13, M

y = 13, S

x = 6, S

x = 6, I

y = 19, M

*

y = 20, M

x = 5, S

y = 12, S

x = 5, I

y = 12, I

y = 13, S

x = 6, S

x = 6, I

y = 13, I

x = 6, I

y = 13, I

TIM

E

Snoopy Cache Systems• All caches broadcast all transactions

– Suited to bus or ring interconnects

• All processors monitor the bus for transactions of interest

• Each processors cache has a set of tag bits that determine the state of the cache block– Tags updated according to state diagram for relevant

protocol

• Eg snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out

• What sort of data access characteristics are likely to perform well/badly on snoopy based systems?

Processor

Sno

op H

/W

Tag

s

Cache

Processor

Sno

op H

/W

Tag

s

Processor

Sno

op H

/W

Tag

s

Memory

Cache Cache

Address/Data

Dirty

Grama fig 2.24

Snoopy Cache Based System

Directory Cache Based Systems

• Need to broadcast clearly not scalable– Solution is to only send information to

processing elements specifically interested in that data

• This requires a directory to store information– Augment global memory with presence

bitmap to indicate which caches memory located in

Directory Based Cache Coherency

• Must handle a read miss and a write to a shared, clean cache block

• To implement the direct must track the state of each cache block

• A simple protocol might be:– Shared: one or more processors have the block

cached, and the value in memory is up to date– Uncached: no processor has a copy– Exclusive: only one processor (the owner) has a copy

and the value in memory is out of date

• We also need to track the processors that have copies of shared cache block, or ownership of

Inte

rcon

nect

ion

Net

wor

k

Processor

Cache

Processor

Cache

Processor

Cache

DataPresenceBits

Status

Directory

Inte

rcon

nect

ion

Net

wor

k

Processor

Cache

Memory

Pre

sen

ce

bits

/sta

te

Processor

Cache

Memory

Pre

sen

ce

bits

/sta

te

Grama fig 2.25

Directory Based Cache Coherency

Directory Based Systems• How much memory is required to store the

directory?

• What sort of data access characteristics are likely to perform well/badly on directory based systems?– How do distributed and centralized systems

compare?

Costs on SGI Origin 3000(processor clock cycles)

16 CPU > 16 CPU

Cache Hit 1 1

Cache miss to local memory 85 85

Cache miss to remote home directory

125 150

Cache miss to remotely cached data (3 hop)

140 170

Data from: “Computer Architecture: A Quantitative Approach”, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003

Real Cache Coherency Protocols• From wikipedia

– “Most modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an "Exclusive" state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an "Owned" state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the "Forward" state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data.”

MESI(on a bus)

https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm

Documents

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also