34
Cache coherence and consistency models in multiprocessor architecture Computer Architecture Authors: Piscione Pietro Villardita Alessio Degree: Computer Engineering A.Y. 2014/2015

Coherence and consistency models in multiprocessor architecture

Embed Size (px)

Citation preview

Cache coherence and consistency models in multiprocessor architecture

Computer Architecture Authors:Piscione Pietro

Villardita AlessioDegree: Computer EngineeringA.Y. 2014/2015

Introduction● Multiprocessor architecture

overview

● Coherence vs. Consistency○ Coherence protocols○ Snooping and Directory models

○ Consistency models○ Sequential Consistency

● More throughput● More efficiency

Why multiprocessor architecture? ● Clock frequency wall

● Shared memory

● Distributed memory

The bus is the bottleneck

More processors (16 Threads)

More cache memory(20 MB L3)

More complexity(1.86 billions transistors)

i7-990x (2011): 12 threads, 12 MB cache, 1.17 billions trans.

Processor-Memory Performance gap

1.25x1.52x

1.20x

Cache design factorsTraditionally, memory hierarchies designers focused on:● Optimizing average memory access time● Miss rate● Miss penalty

More recently:● power consumption has become a major

consideration

Miss rate vs cache size (SPEC CPU2000)

Conflict

Compulsory

Capacity

Cache (in)CoherenceIncoherence

occurs

PiPj

They are different

Consistency and coherence

● Cache coherence model specifies HOW memory accesses are coordinated among CPUs.

● Cache consistency model specifies WHEN a memory write shows up at another CPU.

“For any given memory location, at any given (logical) time, there is either a single core that may write it (and that may also read it) or some number of cores that may read it.”

Cache coherence: definition

Two fundamental invariants:● Single-Writer-Multiple-Reader (SWMR)● Data-Value

Cache coherence: epochs

● Dividing a given memory location’s lifetime into epochs

● SWMR only is not enough: need for the Data-Value invariant

Coherence Controller

● Accepts loads and stores from and returns load values to the core

● Initiate a coherence transaction when a cache miss occurs, by issuing a coherence request for the block requested by the core

● Receive coherence requests and coherence responses that must be processed

Coherence controller behavior

Coherence Protocols: basicsWhen a write occurs on a specific address, what’s next? Two alternatives:● Write invalidate (most common): invalidate all

other copies

● Write update (broadcast): update all the cached copies

Invalidate vs. Update protocolsInvalidate:● One message to

achieve coherence

● Significantly less bandwidth

● Easy to implement

Update:● Less read latency

● Larger messages

● More bandwidth

● More complex implementations

Coherence Protocols: basics● Directory based: physical memory blocks’

sharing status stored in one centralized location

● Snooping: every cache tracks the sharing status of the given block of physical memory

Snooping protocol: main features● Distributed architecture● Messages broadcasting● Not so scalable● Total order of coherence requests across all

blocks● Interconnection network must serialize these

requests into some total order

● Write to shared data:○ An invalidate is sent to all caches which snoop and

invalidate any copy

Snooping protocol: Write Invalidate

● Read Miss:○ Write-through: memory is always up-to-date○ Write-back: force other caches to update copy in main

memory, then snoop that value

Can use a separate invalidate bus for write traffic

● Write to shared data:○ Broadcast on bus, processors snoop, and update

copies

Snooping protocol: Write Update

● Read miss:○ memory is always up-to-date

● Higher bandwidth (transmit data + address), but lower latency for readers (looks like write-through cache)

Directory protocol: basic idea● Global view of cache states● Centralized in directory● Unicast message● More scalability

When a directory receives a message, what does it happen?

Reply or Forward

Possible cases:

Directory protocol: basic idea

● One request-reply

● One request -> K forwards -> K replies

● Point-to-point ordering

Directory protocol: example1. Requestor sends GetM to

Directory2. Directory sends Ack Count

to Requestor3. Directory sends K Invalidate

Message to sharers4. Sharers send an AckInv to

requestor5. The requestor modifies the

block

Directory state

● Coarse directories

● Limited pointer directory

Directory distributed

Snooping vs. Directory coherenceSnooping Solution (Snoopy Bus):● Send all requests for data to all processors (broadcast)● Scaling limited by cache miss & write traffic saturating the

bus

Directory-Based Schemes:● Send point-to-point requests to processors (unicast)● Keep track of what is being shared in a directory● Distributed memory => distributed directory (reducing

bottlenecks)

Hybrid Designs

There are protocols that combine aspects of:● Snooping and directory protocols● Invalidate and update protocols

Achieving advantages from both the solutions.

(aka memory consistency model, or, memory model)

● A specification of the allowed behavior of multithreaded programs executing with shared memory

● Multiple correct behaviors are usually allowed

One fundamental:● Out-of-Order execution

Consistency model: definition

Cache (in)consistency

Should r2 always be set to NEW?NO!

Core Might Reorder Memory AccessesSequential execution model (von Neumann):● Usually, operations to the same address execute in the

original program order.

Possible reorderings (to different addresses):● Store-Store: no FIFO write buffer● Load-Load● Load-Store and store-load: local bypass

Multiple executions allowed → Non-Determinism

S2 S7S1

write buffer

read R1

● “The result of an execution is the same as if the operations had been executed in the order specified by the program.” (Lamport, 1979)

● Memory order must respect program order

● Every load gets its value from the last store before it (in global memory order)

Sequential consistency: basic idea

Sequential consistency: Atomicity

● Need of instructions that atomically perform a “read–modify–write” (e.g. “test-and-set”)

● Simplistic approach: the core effectively locks the memory system → sacrifices performance

● Aggressive approach: only need for a “test-and-set” appearing in total order

Sequential consistency: simple implementation

Sequential (in)consistency: solved

Inconsistencycannot occur

anymore

ConclusionsWhich protocol is the best?

It depends from:

● Technology● Architecture● Purposes and applications