Cache Coherence Problem

Cache coherenceFrom Wikipedia, the free encyclopedia

Multiple Caches of Shared Resource

In computing, cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared

resource. Cache coherence is a special case of memory coherence.

When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is

particularly true of CPUs in amultiprocessing system. Referring to the "Multiple Caches of Shared Resource" figure, if the top client

has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left

with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and

maintain consistency between cache and memory.

Contents

[hide]

1 Definition

2 Cache coherence mechanisms

3 Coherency protocol

4 Further reading

5 See also

[edit]Definition

Coherence defines the behavior of reads and writes to the same memory location. The coherence of caches is obtained if the

following conditions are met:

1. A read made by a processor P to a location X that follows a write by the same processor P to X, with no writes of X by

another processor occurring between the write and the read instructions made by P, X must always return the value

written by P. This condition is related with the program order preservation, and this must be achieved even in

monoprocessed architectures.

http://en.wikipedia.org/wiki/File:Cache_Coherency_Generic.png

http://en.wikipedia.org/wiki/File:Cache_Coherency_Generic.png

2. A read made by a processor P1 to location X that follows a write by another processor P2 to X must return the written

value made by P2 if no other writes to X made by any processor occur between the two accesses. This condition defines

the concept of coherent view of memory. If processors can read the same old value after the write made by P2, we can

say that the memory is incoherent.

3. Writes to the same location must be sequenced. In other words, if location X received two different values A and B, in this

order, by any two processors, the processors can never read location X as B and then read it as A. The location X must

be seen with values A and B in that order.

These conditions are defined supposing that the read and write operations are made instantaneously. However, this doesn't happen

in computer hardware given memory latency and other aspects of the architecture. A write by processor P1 may not be seen by a

read from processor P2 if the read is made within a very small time after the write has been made. The memory consistency model

defines when a written value must be seen by a following read instruction made by the other processors.

[edit]Cache coherence mechanisms

Directory-based coherence: In a directory-based system, the data being shared is placed in a common directory that

maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to

load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the

other caches with that entry.

Snooping is the process where the individual caches monitor address lines for accesses to memory locations that they have

cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own

copy of the snooped memory location.

Snarfing is where a cache controller watches both address and data in an attempt to update its own copy of a memory

location when a second master modifies a location in main memory. When a write operation is observed to a location that a

cache has a copy of, the cache controller updates its own copy of the snarfed memory location with the new data.

Distributed shared memory systems mimic these mechanisms in an attempt to maintain consistency between blocks of memory in

loosely coupled systems.

The two most common types of coherence that are typically studied are Snooping and Directory-based, each having its own benefits

and drawbacks. Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response

seen by all processors. The drawback is that snooping isn't scalable. Every request must be broadcast to all nodes in a system,

meaning that as the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow.

Directories, on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond) but use much less bandwidth

since messages are point to point and not broadcast. For this reason, many of the larger systems (>64 processors) use this type of

cache coherence.

Cache Only Memory Architecture

The Cache Only Memory Architecture (COMA) protocol is fundamentally different

from the protocols discussed earlier. COMA treats main memory as a large tertiary cache,

called an attraction memory, and provides automatic migration and replication of main

memory at a cache line granularity. COMA can potentially reduce the cost of processor

cache misses by converting high-latency remote misses into low-latency local misses. The

notion that the hardware can automatically bring needed data closer to the processor without advanced programmer information is the allure of the COMA protocol.

The first COMA machine was organized as a hierarchical tree of nodes as shown in

Figure 2.19 [21]. If a cache line is not found in the local attraction memory the protocol

looks for it in a directory kept at the next level up in the hierarchy. Each directory holds

Figure 2.18. The SCI protocol forms a natural queue when accessing heavily contended cache lines.

R3 R2 R1

Head of

List

GET

NACK

GET

PUT

GET

NACK

Figure 2.19. The original COMA architecture (the DDM machine) was a hierarchical architecture.

P

AM

P P P P

AM

P P P

Dir

Dir

P

AM

P P P P

AM

P P P

DirChapter 2: Cache Coherence Protocols 33

information only for the processors beneath it in the hierarchy, and thus a single cache

miss may require several directory lookups at various points in the tree. Once a directory

locates the requested cache line, it forwards the request down the tree to the attraction

memory containing the data. The hierarchical directory structure of this COMA implementation causes the root of the directory tree to become a bottleneck, and the multiple

directory lookups yield poor performance as the machine size scales.

A second version of COMA, called flat COMA or COMA-F [53] assigns a static home

for the directory entries of each cache line just as in the previous protocols. If the cache

line is not in the local attraction memory, the statically assigned home is immediately consulted to find out where the data resides. COMA-F removes the disadvantages of the hierarchical directory structure, generally performs better [53], and makes it possible to

implement COMA on a traditional DSM architecture. It is the COMA-F protocol that is

presented in this section and the remainder of this text, and for brevity it is referred to simply as COMA.

Unlike the other protocols, COMA reserves extra memory on each node to act as an

additional memory cache for remote data. COMA needs extra memory to efficiently support cache line replication. Without extra memory, COMA could only migrate data, since

any new data placed in one attraction memory would displace the last remaining copy of

another cache line. The last remaining copy of a cache line is called the master copy. With

the addition of reserved memory, the attraction memory is just a large cache, and like the

processor’s cache it does not always have to send a message when one of its cache lines is

displaced. The attraction memory need only take action if it is replacing a master copy of

the cache line. All copies of the line that are not master copies can be displaced without

requiring further action by the protocol. To displace a master copy, a node first sends a

replacement message to the home node. It is the home node that has the responsibility to

find a new node for the master copy. The best place to try initially is the node which just

sent the data to the requesting node that caused this master copy to be displaced (see

Figure 2.20). That node likely has “room” for data at the proper location in its attraction

memory since it just surrendered a cache line that mapped there.Chapter 2: Cache Coherence Protocols 34

Extra reserved memory is crucial in keeping the number of attraction memory displacements to a minimum. [27] shows that for many applications half of the attraction memory

should be reserved memory. COMA can naturally scale up to large machine sizes by using

the directory organization of any of the previous protocols with the addition of only a few

bits of state. The memory overhead of the COMA protocol is thus the same as the memory

overhead of the protocol which it mimics in directory organization, plus the overhead of

the extra reserved memory. In practice, since the reserved memory occupies half of the

attraction memory it is the dominant component in COMA’s memory overhead.

Figure 2.21 shows a COMA protocol that uses dynamic pointer allocation as its underlying directory organization. The only difference in the data structures is that COMA

Figure 2.20. In COMA the home node, H, must find a new attraction memory (AM) for displaced master

copies.

D

H

RPLC

ACK

R

PUT

4

5

1

3

AM

Master

2

RPLC

Memory Lines Directory Headers Free List Pointer/Link Store

Reserved

Figure 2.21. Data structures for the COMA protocol. COMA can use the same directory organization as

other protocols (dynamic pointer allocation is pictured here), with the addition of a tag field in

the directory header and some reserved memory for additional caching of data.

Tag

Tag

Tag

Tag

Tag

Tag

Tag

Tag

TagChapter 2: Cache Coherence Protocols 35

keeps a tag field in the directory header to identify which global cache line is currently in

the memory cache, or attraction memory. Because COMA must perform a tag comparison

of the cache miss address with the address in the attraction memory, COMA can potentially have higher miss latencies than the previous protocols. If the line is in the local

attraction memory then ideally COMA will be a win since a potential slow remote miss

has been converted into a fast local miss. If however, the tag check fails and the line is not

present in the local attraction memory, COMA has to go out and fetch the line as normal,

but it has delayed the fetch of the remote line by the time it takes to perform the tag check.

Clearly, the initial read miss on any remote line will find that it is not in the local attraction

memory, and therefore COMA will pay an additional overhead on the first miss to any

remote cache line. If, however, the processor misses on that line again either due to cache

capacity or conflict problems, that line will most likely still be in the local attraction memory, and COMA will have successfully converted a remote miss into a local one.

Aside from the additional messages and actions necessary to handle attraction memory

displacement, COMA’s message types and actions are the same as the protocol from

which it borrows its directory organization. Depending on the implementation, COMA

may have some additional complexity. Since main memory is treated as a cache, what happens if the processor’s secondary cache is 2-way set-associative? Now the COMA protocol may need to perform two tag checks on a cache miss, potentially further increasing the

miss cost. In addition, the protocol would have to keep some kind of additional information, such as LRU bits, to know which main memory block to displace when necessary. To

get around this problem, some COMA implementations keep a direct-mapped attraction

memory even though the caches are set-associative [27]. This can work as long as modifi-

cations are made to the protocol to allow cache lines to be in the processor’s cache that

aren’t necessarily in the local attraction memory. The COMA protocol discussed in Chapter 4 uses this scheme.

Despite the complications of extra tag checks and master copy displacements, the hope

is that COMA’s ability to turn remote capacity or conflict misses into local misses will

outweigh any of these potential disadvantages. This trade-off lies at the core of the COMA

performance results presented in Chapter 6. Several machines implement variants of theChapter 2: Cache Coherence Protocols 36

COMA protocol including the Swedish Institute of Computer Science’s Data Diffusion

Machine [21], and the KSR1 [8] from Kendall Square Research.

Documents

Cache Coherence Problem