Multithreaded Architecture

Chip Multiprocessors (CMP)CMP is the mantra of today’s microprocessor industry-Intel’s dual-core Pentium 4: each core is still hyper threaded (just uses existing cores)

-Intel’s quad-core Whitefield is coming up in a year or so

-For the server market Intel has announced a dual-core Itanium 2 (code named Montecito); again each core is 2-way threaded

-AMD has released dual-core Opteron in 2005

-IBM released their first dual-core processor POWER4 circa 2001; next-generation POWER5 also uses two cores but each core is also 2-way threaded

-Sun’s UltraSPARC IV (released in early 2004) is a dual-core processor and integrates two UltraSPARC III cores.

Why CMP?

Today microprocessor designers can afford to have a lot of transistors on the die-Ever-shrinking feature size leads to dense packing.-What would you do with so many transistors?-Can invest some to cache, but beyond a certain point it doesn’t help.-Natural choice was to think about greater level of integration.-Few chip designers decided to bring the memory and coherence controllers along with the router on the die.-The next obvious choice was to replicate the entire core; it is fairly simple: just use the existing cores and connect them through a coherent interconnect.

Moore’s Law

Moore's Law describes a long-term trend in the history of computing hardware, in which the number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years.[1] Rather than being a naturally-occurring "law" that cannot be controlled, however, Moore's Law is effectively a business practice in which the advancement of transistor counts occurs at a fixed rate.[2] [see image on right]

The law is named for Intel co-founder Gordon E. Moore, who introduced the concept in a 1965 paper. It has since been used in the semiconductor industry to guide long term planning and to set targets for research and development.

http://en.wikipedia.org/wiki/Research_and_development

http://en.wikipedia.org/wiki/Semiconductor

http://en.wikipedia.org/wiki/Gordon_Moore

http://en.wikipedia.org/wiki/Intel

http://en.wikipedia.org/wiki/Integrated_circuit

http://en.wikipedia.org/wiki/Transistor

http://en.wikipedia.org/wiki/History_of_computing_hardware

Plot of CPU transistor counts against dates of introduction. The curve shows counts doubling every two years.

Consequences and Limitations

Transistor count versus computing performanceThe exponential processor transistor growth predicted by Moore does not always translate into exponentially greater practical computing performance. For example, the higher transistor density in multi-core CPUs doesn't greatly increase speed on many consumer applications that are not parallelize.Wire DelayWires don’t scale with transistor technology: wire delay becomes the bottleneck-Wire delay doesn’t decrease with transistor size -Short wires are good: dictates localized logic design.-But superscalar processors exercise a “centralized” control requiring long wires (or pipelined long wires).-However, to utilize the transistors well, we need to overcome the memory wall problem.-To hide memory latency we need to extract more independent instructions i.e. more ILPExtracting more ILP directly requires more available in-flight instructions- But for that we need bigger ROB which in turn requires a bigger register file-Also we need to have bigger issue queues to be able to find more parallelism-None of these structures scale well: main problem is wiring-So the best solution to utilize these transistors effectively with a low cost must not require long wires and must be able to leverage existing technology: CMP satisfies these goals exactly (use existing processors and invest transistors to have more of these on-chip instead of trying to scale the existing processor for more ILP).

http://en.wikipedia.org/wiki/Central_processing_unit

http://en.wikipedia.org/wiki/Multi-core_(computing)

http://en.wikipedia.org/wiki/Transistor_count



Shared L2 Vs Tiled CMP

A chip multiprocessor (CMP) system having several processor cores may utilize a tiled architecture, with each tile having a processor core, a private cache (L1), a second private or shared cache (L2), and a directory to track copies of cached private copies. Historically, these tiled architectures may have one of two styles of L2 organization (e.g. Intel Pentium D, Dual Core Opteron, Intel Montecito, Sun UltraSPARC IV, and IBM Cell).

Due to constructive data sharing between threads, CMP systems performing multi-threaded workloads may use a shared L2 cache approach. A shared L2 cache approach may maximize effective L2 cache capacity due to no data duplication, but also increases average hit latency, compared to a private L2 cache. These designs may treat the L2 cache and directory as one structure (e.g. Intel Woodcrest, Intel Conroe, Sun Niagara, IBM Power4, and IBM Power5).

CMP systems performing scalar and latency sensitive workloads may prefer a private L2 cache organization for latency optimization at the expense of potential reduction in effective cache capacity due to potential data replication. A private L2 cache may offer cache isolation, yet disallow cache borrowing. Cache intensive applications on some cores may not borrow cache from inactive cores or cores running small data footprint applications.

Some generic CMP systems may have 3-levels of caches. The L1 cache and L2 cache may form two private levels. A third L3 cache may be shared across all cores.

Differences (On Basis of Performance)

Shared caches are often very large in the CMPs. -They are banked to avoid worst-case wire delay- The banks are usually distributed across the floor of the chip on an interconnect-In shared caches, getting a block from a remote bank takes time proportional to the physical distance between the requester and the bank. # Non-uniform cache architecture (NUCA)-This is same for private caches, if the data resides in a remote cache-Shared cache may have higher average hit latency than the private cache # Hopefully most hits in the latter will be local.-Shared caches are most likely to have less misses than private caches # Latter wastes space due to replication.

Snoopy Coherence

Snoopy Protocols

Cache coherence protocols implemented in bus-based machines are called snoopy protocols.-The processors snoop or monitor the bus and take appropriate protocol actions based on snoop results.-Cache controller now receives requests both from processor and bus.-Since cache state is maintained on a per line basis that also dictates the coherence granularity.-Cannot normally take a coherence action on parts of a cache line.-The coherence protocol is implemented as a finite state machine on a per cache line basis.-The snoop logic in each processor grabs the address from the bus and decides if any action should be taken on the cache line containing that address (only if the line is in cache).

Write Through Caches

There are only two cache line states-Invalid (I): not in cache-Valid (V): present in cache, may be present in other caches also

Read access to a cache line in I state generates a BusRd request on the bus.-Memory controller responds to the request and after reading from memory launches the line on the bus-Requester matches the address and picks up the line from the bus and fills the cache in V state-A store to a line always generates a BusWr transaction on the bus (since write through); other sharers either invalidate the line in their caches or update the line with new value

State Transition

The finite state machine for each cache line:

On a write miss no line is allocated-The state remains at I: called write through write no-allocated-A/B means: A is generated by processor, B is the resulting bus transaction (if any)

-Changes for write through write allocate?

Ordering Memory Op

Assume that the bus is atomic-It takes up the next transaction only after finishing the previous one.

Read misses and writes appear on the bus and hence are visible to all processors

What about read hits?-They take place transparently in the cache.-But they are correct as long as they are correctly ordered with respect to writes.-And all writes appear on the bus and hence are visible immediately in the presence of an atomic bus.

In general, in between writes reads can happen in any order without violating coherence-Writes establish a partial order.

Back To Snoopy Protocols

No change to processor or cache-Just extend the cache controller with snoop logic and exploit the bus

We will focus on writeback caches only-Possible states of a cache line: Invalid (I), Shared (S), Modified or dirty (M), Clean exclusive (E), Owned (O); every processor does not support all five states.-E state is equivalent to M in the sense that the line has permission to write, but in E state the line is not yet modified and the copy in memory is the same as in cache; if someone else requests the line the memory will provide the line-O state is exactly same as E state but in this case memory is not responsible for servicing requests to the line; the owner must supply the line (just as in M state)-Stores really read the memory (as opposed to write).

StoresLook at stores a little more closely.-There are three situations at the time a store issues: the line is not in the cache, the line is in the cache in S state, the line is in the cache in one of M, E and O states-If the line is in I state, the store generates a read-exclusive request on the bus and gets the line in M state-If the line is in S or O state, that means the processor only has read permission for that line; the store generates an upgrade request on the bus and the upgrade acknowledgment gives it the write permission (this is a data-less transaction)

-If the line is in M or E state, no bus transaction is generated; the cache already has write permission for the line (this is the case of a write hit; previous two are write misses)

Invalidation vs. UpdateTwo main classes of protocols: -Invalidation-based and update-based-Dictates what action should be taken on a write.-Invalidation-based protocols invalidate sharers when a write miss (upgrade or readX) appears on the bus-Update-based protocols update the sharer caches with new value on a write: requires write transactions (carrying just the modified bytes) on the bus even on write hits (not very attractive with writeback caches)-Advantage of update-based protocols: sharers continue to hit in the cache while in invalidation-based protocols sharers will miss next time they try to access the line-Advantage of invalidation-based protocols: only write misses go on bus (suited for writeback caches) and subsequent stores to the same line are cache hits.

Which One Is Better?

Difficult to answerDepends on program behaviour and hardware cost.

When is update-based protocol good?-What sharing pattern? (Large-scale producer/consumer).Otherwise it would just waste bus bandwidth doing useless updates.

When is invalidation-protocol good?Sequence of multiple writes to a cache line.Saves intermediate write transactions.

Also think about the overhead of initiating small updates for every write in update protocols-Invalidation-based protocols are much more popular.-Some systems support both or maybe some hybrid based on dynamic sharing pattern of a cache line.

MSI PROTOCOLThe MSI protocol is a basic cache coherence protocol that is used in multiprocessor systems. As with other cache coherency protocols, the letters of the protocol name identify the possible states in which a cache line can be. So, for MSI, each block contained inside a cache can have one of three possible states:

Modified: The block has been modified in the cache. The data in the cache is then inconsistent with the backing store (e.g. memory). A cache with a block in the "M" state has the responsibility to write the block to the backing store when it is evicted.

Shared: This block is unmodified and exists in at least one cache. The cache can evict the data without writing it to the backing store.

Invalid: This block is invalid, and must be fetched from memory or another cache if the block is to be stored in this cache.

These coherency states are maintained through communication between the caches and the backing store. The caches have different responsibilities when blocks are read or written, or when they learn of other caches issuing reads or writes for a block.

When a read request arrives at a cache for a block in the "M" or "S" states, the cache supplies the data. If the block is not in the cache (in the "I" state), it must verify that the line is not in the "M" state in any other cache. Different caching architectures handle this differently. For example, bus architectures often perform snooping, where the read request is broadcast to all of the caches

If another cache has the block in the "M" state, it must write back the data to the backing store and go to the "S" or "I" states. Once any "M" line is written back, the cache obtains the block from either the backing store, or another cache with the data in the "S" state. The cache can then supply the data to the requestor. After supplying the data, the cache block is in the "S" state.

When a write request arrives at a cache for a block in the "M" state, the cache modifies the data locally. If the block is in the "S" state, the cache must notify any other caches that might contain the block in the "S" state that they must evict the block. This notification may be via bus snooping or a directory, as described above. Then the data may be locally modified. If the block is in the "I" state, the cache must notify any other caches that might contain the block in the "S" or "M" states that they must evict the block. If the block is in another cache in the "M" state, that cache must either write the data to the backing store or supply it to the requesting cache. If at this point the cache does not yet have the block locally, the block is read from the backing

http://en.wikipedia.org/wiki/Bus_snooping

store before being modified in the cache. After the data is modified, the cache block is in the "M" state.

For any given pair of caches, the permitted states of a given cache line are as follows:

M S I

M x x √

S x √ √

I √ √ √

Processor requests to cache: PrRd, PrWr•Bus transactions: BusRd, BusRdX, BusUpgr, BusWB.

MESI protocolThe MESI protocol (known also as Illinois protocol due to its development at the University of Illinois at Urbana-Champaign) is a widely used cache coherency and memory coherence protocol. It is the most common protocol which supports write-back cache. Its use in personal computers became widespread with the introduction of Intel's Pentium processor to "support the more efficient write-back cache in addition to the write-through cache previously used by the Intel 486 processor"[1].

States

Every cache line is marked with one of the four following states (coded in two additional bits):

Modified The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state.

Exclusive The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.

Shared Indicates that this cache line may be stored in other caches of the machine & is "clean" ; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time.

Invalid Indicates that this cache line is invalid.

For any given pair of caches, the permitted states of a given cache line is as follows:

http://en.wikipedia.org/wiki/Intel_80486

http://en.wikipedia.org/wiki/Write-through

http://en.wikipedia.org/wiki/Pentium

http://en.wikipedia.org/wiki/Intel

http://en.wikipedia.org/wiki/CPU_cache

http://en.wikipedia.org/wiki/Write-back

http://en.wikipedia.org/wiki/Memory_coherence

http://en.wikipedia.org/wiki/Cache_coherency

M E S I

M X X X √

E X X X √

S X X √ √

I √ √ √ √

Operation

In a typical system, several caches share a common bus to main memory. Each also has an attached CPU which issues read and write requests. The caches' collective goal is to minimize the use of the shared main memory.

A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.

A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation known as Read For Ownership (RFO).

A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first.

A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other caches in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. retry later), then writing the data to main memory and changing the cache line to the Shared state.

A cache that holds a line in the Shared state must listen for invalidate or read-for-ownership broadcasts from other caches, and discard the line (by moving it into Invalid state) on a match.

A cache that holds a line in the Exclusive state must also snoop all read transactions from all other caches, and move the line to Shared state on a match.

http://en.wikipedia.org/wiki/Bus_snooping

http://en.wikipedia.org/wiki/CPU

The Modified and Exclusive states are always precise: i.e. they match the true cache line ownership situation in the system. The Shared state may be imprecise: if another cache discards a Shared line, this cache may become the sole owner of that cache line, but it will not be promoted to Exclusive state. Other caches do not broadcast notices when they discard cache lines, and this cache could not use such notifications without maintaining a count of the number of shared copies.

In that sense the Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache line that is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enables modifying a cache line with no bus transaction

MOESI PROTOCOL

This is a full cache coherency protocol that encompasses all of the possible states commonly used in other protocols. In addition to the four common MESI protocol states, there is a fifth "Owned" state representing data that is both modified and shared. This avoids the need to write modified data back to main memory before sharing it. While the data must still be written back eventually, the write-back may be deferred.

Each cache line is in one of five states:

Modified A cache line in the modified state holds the most recent, correct copy of the data. The copy in main memory is stale (incorrect), and no other processor holds a copy. The cached data may be modified at will. The cache line may be changed to the Exclusive state by writing the modifications back to main memory.

Owned A cache line in the owned state holds the most recent, correct copy of the data. The owned state is similar to the shared state in that other processors can hold a copy of the most recent, correct data. Unlike the shared state, however, the copy in main memory can be stale (incorrect). Only one processor can hold the data in the owned state—all other processors must hold the data in the shared state. The cache line may be changed to the Modified state after invalidating all shared copies, or changed to the Shared state by writing the modifications back to main memory.

Exclusive A cache line in the exclusive state holds the most recent, correct copy of the data. The copy in main memory is also the most recent, correct copy of the data. No other processor holds a copy of the data. The cache line may be changed to the Modified state at any time in order to modify the data. It may also be discarded (changed to the Invalid state) at any time.

Shared A cache line in the shared state holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The copy in main memory is also the most recent, correct copy of the data, if no other processor holds it in owned state. The cache line may not be written, but may be changed to the Exclusive state after invalidating all shared copies. It may also be discarded (changed to the Invalid state) at any time.

Invalid A cache line in the invalid state does not hold a valid copy of the data. Valid copies of the data might be either in main memory or another processor cache.

http://en.wikipedia.org/wiki/MESI_protocol

http://en.wikipedia.org/wiki/Cache_coherency

For any given pair of caches, the permitted states of a given cache line are as follows:

M O E S I

M X X X X √

O X X X √ √

E X X X X √

S X √ X √ √

I √ √ √ √ √

This protocol, a more elaborate version of the simpler MESI protocol, avoids the need to write modifications back to main memory when another processor tries to read it. Instead, the Owned state allows a processor to supply the modified data directly to the other processor. This is beneficial when the communication latency and bandwidth between two CPUs is significantly better than to main memory. An example would be multi-core CPUs with per-core L2 caches.

If a processor wishes to write to an Owned cache line, it must notify the other processors that are sharing that cache line. Depending on the implementation it may simply tell them to invalidate their copies (moving its own copy to the Modified state), or it may tell them to update their copies with the new contents (leaving its own copy in the Owned state).

Usages

This protocol was used in the SGI 4D machine

The MESI protocol adds an "Exclusive" state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an "Owned" state to reduce the traffic caused by write-backs of blocks that are read by other caches. The MOESI protocol does both of these things.

http://en.wikipedia.org/wiki/MOESI_protocol

http://en.wikipedia.org/wiki/MOSI_protocol


http://en.wikipedia.org/wiki/Silicon_Graphics

http://en.wikipedia.org/wiki/Main_memory


MOSI protocolThe MOSI protocol is an extension of the basic MSI cache coherency protocol. It adds the Owned state, which indicates that the current processor owns this block, and will service requests from other processors for the block.

For any given pair of caches, the permitted states of a given cache line is as follows:

M O S I

M X X X √

O X X √ √

S X √ √ √

I √ √ √ √

Sequential and Weak Consistency Model

Atomicity and Event Ordering (Ref. Pg 248 Hwang)

The problem of memory inconsistency arises when the memory access order differs from the program execution order. E.g. A uniprocessor system maps an SISD sequence into similar execution sequence. Thus memory access (for instructions and data) is consistent with the program execution order. This property has been called as sequential consistency.

In shared-memory multiprocessors there are multiple instructions (i.e. MIMD instruction sequences).

Memory Consistency Issues

The behavior of a shared-memory system as observed by processor is called as memory model. Primitive memory operations for multiprocessor include load (read), store (write), and swap (atomic load and store).

Event orderings

The order in which shared memory operations are performed by one process may be used by other processes. Memory events correspond to shared memory access. Consistency models specify the order by which the events from one process should be ordered by other processes in the machine.

The event ordering can be used to declare whether a memory event is legal or illegal, when several process are accessing a common set of memory location. A program order is the order by which memory access occur for the execution of a single process.

DUBOIS et al.(1986) has defined three primitive memory operations for the purpose of specifying memory consistency models:

1. A load by processor Pi is considered performed with respect to processor Pk at a point of time when the issuing of a store to the same location by Pk cannot affect the value returned by the load.

2. A Store by Pi is considered performed with respect to Pk at one time when issued a load to the same address by Pk returns the value by this store.

3. A load is globally performed if it is performed with respect to all processors if the store that is the source of the returned value has been performed with respect to all processors.

Atomicity

There are 2 classes for Shared-Memory Multiprocessors:

1. Atomic Memory Access2. Non-Atomic Memory Access

A shared –memory access is atomic if the memory updates are known to all processors. Thus a store is atomic if the value stored becomes readable to all processors at the same time.

A system can be non–atomic if an invalidation signal does not reach all processors at the same time. With non-atomic memory multiprocessor cannot be strongly ordered.

Sequential Consistency Model

Processors

………… ………………

Shared

Memory

System

In this model, the loads, stores and swaps of all processors appear to execute serially in a single global memory order.

Lamport’s Definition

He defined sequential consistency as follows: A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operation for all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.

Switch

Single port Memory

P 1 P 2 P 3 P n

Dubois, Scheurich and Briggs (1986) have provided following two sufficient conditions

1. Before a load is allowed to perform with respect to any other processor, all previous load accesses must be globally performed and all previous store accesses must be performed with respect to all processors.

2. Before a store is allowed to perform with respect to any other processor, all previous load accesses must be globally performed and all previous store accesses must be performed with respect to all processor.

Lamport’s Definition sets the basic spirit of sequential consistency.

Implementation Considerations

Figure shows that the shared memory consist of a single port that is able to service exactly one operation at a time, and a switch that connects this memory to one of the processors for the duration of each memory operation. The order in which switch is thrown from one processor to another determines the global order of memory – access operations.

This model implies total ordering of stores/loads at the instruction level.

Strong ordering of all shared memory access in the sequential consistency model preserves the program order in all processors.

A processor cannot issue another access until the most recently shared writable memory access by a processor has been globally performed.

Drawbacks

When system becomes very large, this model reduces the scalability of a multiprocessor system (poor memory performance).

Weak Consistency Model

The DSB Model (by Dubois, Scheurich and Briggs 1986)

1. All previous synchronization accesses must be performed, before a load or a store access is allowed to perform with respect to any other processor.

2. All previous load and store accesses must be performed before a synchronization access is allowed to perform with respect to any other processor.

3. Synchronization accesses are sequentially consistent with respect to one another.

These conditions provide a weak ordering of memory-access event in multiprocessor. The dependence conditions on shared variables are weaker in such a system because they are only limited to hardware-recognized synchronizing variable.

TSO (Total Store Order) Weak Consistency Model

Processors

. . . . ………………………….

Shared Memory System

TSO Model was developed by Sun Microsystems’ SPARC architecture group. Sindhu et al. described that the stores and swap issued by the processor are placed in a dedicated store buffer for the processor, which is operated in first –in-first-out. Order is same as the processor issued them.

Description

A Load by a processor first checks its store buffer to see if it contains a store to the same location. If it does, then the load returns the value the most recent such store. Otherwise the load goes directly to memory. Since all loads go to memory immediately, loads in general don’t appear in memory order. A processor is logically blocked from issuing further operations until

Stores, Swaps Stores, Swaps Store, swaps

Switch

P 1 P 2 P n

Single – Port Memory

the load returns a value. A swap behaves like a load and a store. It is placed in the store buffer like a store, and it blocks the processor like a load. In other words, the swap block until the store buffer is empty and then proceeds to the memory.

PSO (Partial Store Order)

Relaxed Memory Consistency

It has been introduced for building scalable multiprocessors with distributed shared memory.

Processor Consistency (PC)

Goodman (1989) introduced the Processor Consistency(PC) Model in which writes issued by each individual processor are always in program order. However, the order of writes from two different processors can be out of program order. In other words consistency in writes is observed in each processor, but the order of reads from each processors is not restricted as long as they do not involve other processors.

PC model relaxes from the SC model.

Two Conditions related to other processors are required for ensuring processor consistency:

1. Before a read is allowed to perform with respect to any other processor, all previous read accesses must be performed.

2. Before a write is allowed to perform with respect to any other processor, all previous read or write access must be performed.

Release Consistency (RC)

One of the most relaxed model.

Introduced by Gharachorloo et al. (1990)

It requires that synchronization accesses in the program be identified and classified as either acquires (e.g. locks) or releases (e.g. unlocks).

Acquire is a read Operation

Release is a write Operation.

Advantage of relaxed model is potential for increased performance by hiding as much write latency as possible.

Disadvantage is increased hardware complexity and a more complex programming model.

Three conditions ensure Release consistency:

1. Before an ordinary read or write access is allowed to perform with respect to any other processor, all previous acquire accesses must be performed.

2. Before a release access is allowed to perform with respect to any other processor, all previous ordinary read and store accesses must be performed.

3. The ordering restrictions imposed by weak consistency are not present in release consistency. Instead, RC requires PC & not SC.

Strong Model

Relaxed Model

Sequential Consistency

The result of any execution appears as some interleaving of the operations of the individual processors when executed on a multithreaded sequential machine

Processor Consistency

Write issued by each individual processor are never seen out of order, but the order of writes from two different processors can be observed differently.

Weak Consistency

The programmer enforces consistency using synchronization operators guaranteed to be sequentially consistent.

Release Consistency

Synchronization Operator

1. Acquire2. Release

Each type of operator is guaranteed to be processor Consistent

Documents

Multithreaded Architecture