Introduction to HPC Architecture - Memory Lecture 6johnsson/COSC6365_2014/Lecture06_Memory_S.pdf · Introduction to HPC Architecture - Memory Lecture 6 ... Intel Sandy ... AMD Bulldozer

1

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Introduction to HPCArchitecture - Memory

Lecture 6

Lennart JohnssonDept of Computer Science

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

GProf• Short tutorial

http://web.eecs.umich.edu/~sugih/pointers/gprof_quick.html

• on-line manual http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html

2

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: Classical DRAM Organization

columnaddress

word (row) line

bit (data) lines

Each intersection represents a 1-T DRAM cell

The column address selects the requested bit from the row in each plane

http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view

data bitdata bit

Row

Decoder

rowaddress

Column Selector &I/O Circuits

data bit

RAM CellArray

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: SDRAM Access TimingRow access Column accesses Precharge

tRCD

tCAS (R)

tRP

tWR (Write)

tCAS (R)

tCAS (R)

tRAS

http://en.wikipedia.org/wiki/SDRAM_latency

Initially, the row address is sent to the DRAM. After tRCD, the row is open and may be accessed. For SDRAM multiple column access can be in progress at once. Each read takes time tCAS. When done accessing the column, a precharge returns the SDRAM to the starting state after time tRP.

Two other time limits that must also be maintained are tRAS, the time for the refresh of the row to complete before it may be closed again, and tWR, the time that must elapse after the last write before the row may be closed.

3

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Nu

Number of Ranks on a Channel limited due to signal integrity and driver power

Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: DDR3 population scenarios• Maximum capacity

– 800MHz across 3 channels (38GB/s)– Up to 3 DPC (18 DIMMs total)– Capacity: 144GB (w/ 8GB)– Virtualization environments

• Balanced performance– 1066MHz across 3 channels (51GB/s)– Up to 2 DPC (12 DIMMs)– Capacity: 96GB– General purpose enterprise workload

• Maximum bandwidth– 1333MHz across 3 channels (64GB/s)– 1 DPC (6 DIMMs)– Capacity: 48GB– HPC technical computing

CPU CPU

10.6 GB/s

10.6

10.6

CPU CPU8.5 GB/s

8.5

8.5

CPU CPU

6.4 GB/s

6.4

6.4

HP DL G6 Intel based server

http://ilinuxkernel.com/Backup/Data/DDR3.Config.Recommendations.-.August.2009.ppt

4

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: The “Memory Wall”• Logic vs DRAM speed gap continues to grow

0.01

0.1

1

10

100

1000

VAX/1980 PPro/1996 2010+

Core

Memory

Clo

cks

per

inst

ruct

ion

Clo

cks

per

DR

AM

acc

ess


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: Memory Hierarchy – Parallelism Cache’s

ProcessingLogic

Level 1Cache

Level 2Cache

ProcessingLogic

Level 1Cache

Level 2Cache

ProcessingLogic

Level 1Cache

Level 2Cache

Level 3 Cache

DRAM

“CPU” = Processing die CPU

20 – 100 GB/s

20 – 50 GB/s

1 – 5 cycles

8 – 15 cycles

25 - 50 cycles

core

Processing logic clock rates from ~100+ MHz to 3+ GHz

Execution widths from one operation to eight

5

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: Cache – Memory mapping• There are three methods in block placement:

– Direct mapped : if each block has only one place it can appear in the cache, the cache is said to be direct mapped. The mapping is usually (Block address) MOD (Number of blocks in cache)

– Fully Associative : if a block can be placed anywhere in the cache, the cache is said to be fully associative.

– Set associative : if a block can be placed in a restricted set of places in the cache, the cache is said to be set associative . A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually chosen by bit selection; that is, (Block address) MOD (Number of sets in cache)

Note: A direct mapped cache is simply one-way set associative and a fully associative cache with m blocks could be called m-way set associative.

The vast majority of processor caches today are direct mapped, 2-way, 4-way, 8-way or 16-way set associative.

Adopted from http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/ex18.html

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: Cache Misses• Compulsory: first-reference to a block a.k.a. cold start

misses– misses that would occur even with infinite cache

• Capacity: cache is too small to hold all data needed by the program– misses that would occur even under perfect replacement policy

• Conflict: misses that occur because of collisions due to block-placement strategy– misses that would not occur with full associativity

http://inst.eecs.berkeley.edu/~cs152/sp10/lectures/L07-MemoryII.pdf

6

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: Cache Parameters - Performance

• Larger cache size+ reduces capacity and conflict misses- hit time will increase

• Higher associativity+ reduces conflict misses- may increase hit time

• Larger block size+ reduces compulsory and capacity (reload) misses- increases conflict misses and miss penalty


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Multi-Level Caches• Increases in transistor densities have allowed for cache to be placed

inside processor chip

• The trade-off between speed and size has resulted in several levels of cache (inside the chip)

• Internal caches have very short wires (within the chip itself) and are therefore quite fast

• Thus, modern CPUs have a super fast (1 – 4 cycle latency) cache (Level 1) of modest size (16kB – 64kB)

• Today, chip transistor counts is sufficiently large that two additional levels of cache (Level 2 and Level 3) of increasing size are also accommodated on chip


7

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Writing to Cache

• Must not overwrite a cache block unless main memory is up to date

• Two main problems:– If cache is written to, main memory is invalid or if main

memory is written to, cache is invalid –• Can occur if I/O can address main memory directly

– Multiple CPUs may have individual caches; • once one cache is written to, all other caches are invalid (and so

is main memory)

http://csciwww.etsu.edu/tarnoff/ntes4717/week_03/cache.ppt

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: Write through• All writes go to main memory as well as cache

• Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date

• Lots of traffic

• Slows down writes


8

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall: Write back• Updates initially made in cache only

• Update bit for cache slot is set when update occurs

• If block is to be replaced, write to main memory only if update bit is set

• Other caches get out of sync

• I/O must access main memory through cache

• Research shows that 15% of memory references are writes


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

IBM Power7

http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars

IBM Power7, 20101.2 billion transistors567 mm2 in 45 nm4-way simultaneous multi-threading (SMT)3.0 – 4.25 GHz 24 – 34 GF/core4, 6 or 8 coresup to 4 chips/MCM12 execution units/coreL1 instr cache: 32kB/core (4-way, 128B cache line, 64 sets, 2-3 cycle latency)L1 data cache: 32kB/core(8-way, 128B cache line, 64 sets, 2-3 cycle latency, LRU, write-through)L2 cache: 256 kB/core(8-way, 128B cache line, 256 sets, 8 cycle latency, “balanced” LRU, write-back)L3 cache: 4 MB/core(8-way, 128 B cache line, 4,096 sets, 25 cycle latency, modified LRU, modified write-back)200W TDP < 1.36 GF/W (peak)http://www-03.ibm.com/systems/resources/pwrsysperf_OfGigaHertzandCPWs.pdf

http://www.ibm.com/developerworks/wikis/download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf

http://www.tandvsolns.co.uk/DVClub/1_Nov_2010/Ludden_Power7_Verification.pdf

http://www-05.ibm.com/fr/events/hpc_summit_2010/P7_HPC_Summit_060410.pdf

9

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Itanium Quad-coreItanium quad-core2.05 billion transistors, 65 nm1.33 – 1.73 GHzLevel 1 cache 16kB+16kB(4-way, cache line 64B, 64 sets, LRU, write-through, 1 cycle latency )L2 instr cache 512kB(8-way, cache line 128B, 512 sets,7 cycle latency) L2 data cache, 256kB(8-way, cache line 128B, 256 sets, LRU, write-back, 5, 7 or 9 cycle latency)L3 cache, 6 MB/core(12-way, cache line 128B, 6,144 sets, LRU,write-back, 15 – 16 cycle latency)

130 - 185Whttp://www.intel.com/Assets/PDF/manual/323602.pdfhttp://xtreview.com/addcomment-id-4631-view-Intel-tukwila-the-power-of-30-Mb-cache.html

L3 cacheL3 cache

L3 cacheL3 cache

corecore

corecore

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Intel Sandy Bridge CPU

http://hothardware.com/articleimages/Item1751/small_sbe-die.jpghttp://en.wikipedia.org/wiki/Sandy_Bridgehttp://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/4

Intel Sandy Bridge, 20122.27 billion transistors435 mm2 in 32 nmHyperthreading(2 threads/core)3.3 – 3.9 GHz 26.4 – 31.2 GF/core6 (8) coresL1 cache: 32kB+32kB/core(8-way, cache line 64B, 64 sets, 3 cycles, LRU, write-back)L2 cache: 256 kB/core(8-way, cache line 64B, 512 sets, 8 cycles, LRU, write-back)L3 cache: 15 (20) MB shared(16-way, cache line 64B, 15,360 (20,480) sets, 26 - 31 cycles latency, LRU, write-back)60 - 130 W TDP

10

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

AMD Bulldozer - InterlagosBulldozer (2 cores)

Orichi (8 cores)

The Bulldozer module is architected to be power-efficientMinimize silicon area by sharing functionality between two cores213 million transistors, 30.9 mm2 in 32 nm technology

All blocks and circuits have been designed to minimize power (not just in the Bulldozer core)Extensive flip-flop clock-gating throughout designCircuits power-gated dynamically

Numerous power-saving features under firmware/software controlCore C6 State (CC6)Core P-states/AMD Turbo COREApplication Power Management (APM)DRAM power managementMessage Triggered C1E

Two integer unitsFPU configurable as two 128-bit wide or one 256-bit wideCache L1 instruction cache: 64 KB/module, 64B cache line, 2-way associative L1 data cache: two 16 kB/module, 64B cache line, 4-way, LRU, write-through L2 cache: 2MB, 64B cache-line, 16-way associative, 2.2 GHz, LRU

THE DIE 315 mm2

Eight “Bulldozer” modules Integrated Northbridge which controls:8 MB of Level3 Cache, 64-byte cacheline, 16-way associative, MOESI, 2.2 GHz, LRUTwo 72-bit wide DDR3 memory channels up to DDR3-1866Four 16-bit receive/16-bit transmit HyperTransport™ links

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

IBM PowerPCPowerPC 450, 208M transistors, 90 nm,16W4 flops/cycle, 850 MHz

L1 cache: 32kB+32kB(64-way, cache line 32B, 16 sets,round-robin repl., write-through,4 cycle latency, 8B/cycle))

L2 cache: prefetch buffer 2kB(16 128B lines, fully associative, 12 cycle latency, 8B/cycle (avg. 4.6B/cycle (128B/(12+128/8))), LRU,write-back)

L3 cache: shared, 4 banks of 2 MB each (each bank has a L3 directory and a 15 entry 128B combining buffer, 35 cycle latency)

Memory: 2GB DDR2 @ 400 MHz, (4 banks of 512MB) 86 cycle latency

http://workshops.alcf.anl.gov/gs10/files/2010/01/Morozov-BlueGeneP-Architecture.pdf

https://computing.llnl.gov/tutorials/bgp/

11

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

ARM Cortex A15• ARM Cortex-A15 MPCore supports

– up to 4 cores– Supporting six independent power domains – Optional SIMD/NEON™ unit

• L1 Cache/core:– Instruction cache, 32kB, 2-way set associative, 64B cache line,

LRU replacement policy, parity for 16-bits– Data cache 32 kB, 2-way set associative, 64B cache line,

LRU replacement policy, 32-bit ECC– Write-Back and Write-Through 1 – 2 cycle latency

• L2 Cache - shared– 512 kB – 4 MB – 16-way associativity– Cache coherence– Random replacement policy– L2 inclusive of L1 caches– 3 – 8 cycle latency– Option to include parity or ECC

• Advanced bus interface supporting up to 32GB/s– Exposing Accelerator Coherence Port (ACP) for enhanced

peripheral and accelerator SoC integration

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438g/DDI0438G_cortex_a15_r3p2_trm.pdf

Power domains

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Sandy Bridge Cache Behavior

Sandy Bridge

Ivy Bridge

Varying the stride of the random cyclic permutation access pattern for Ivy Bridge and Sandy Bridge. Sharp corners appear when stride >= cache line size, except for Ivy Bridge L3.

http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/

12

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Sandy Bridge Cache Behavior

Sandy Bridge, larger strides

http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

STREAM TI 6678 Bandwidth test 8 coresGB/s

Data set size in Bytes

L1: 128 GB/s

125.8 GB/s

L2: 2*64 GB/s

48 GB/s

DDR 1333 MHz: 10.664 GB/s

8.9 GB/s

STREAM

L1: 98 % of peak

L2: 37 – 75%? of peak

DDR: 83 % of peak(Better than TI’s results!!Telecon comment)

Source: Gilbert Netzer, KTH

13

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Recall Row/Column major

http://webster.cs.ucr.edu/AoA/Windows/HTML/Arraysa2.html

Row major ordering: C/C++Column major ordering: Fortran

http://egret.psychol.cam.ac.uk/statistics/local_copies_of_sources_Cardinal_and_Aitken_ANOVA/Matrix_multiplication.htm

Matrix multiplication: data access along row/column

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Example: Matrix multiplication onUltraSPARC IIIi Complete

http://www.cs.utexas.edu/~pingali/CS378/2011sp/lectures/co.ppt

14

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Example: Matrix multiplication onIBM Power 5


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Example: Matrix multiplication onIntel Itanium 2


15

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Example: Matrix multiplication onIntel Xeon


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Intel Xeon Clovertown

Ayaz Ali Code generator

Registers 16 (128 bit)

I-Cache 32K

L1 Cache 32K/core, 64B, 8way WB

L2 Cache 8M/dual, 64B, 16way WB

Impact of memory stride on FFT performance

For radix-16 codelet the performance range is about a factor of 6

16

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Intel Itanium 2



I-Cache 16K

L2 Cache 256K, 128B, 8way WB

L3 Cache 6M, 128B, 12way WB



COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

AMD Opteron 285



I-Cache 64K

L1 Cache 64K/core, 64B, 2way WB

L2 Cache 1M/core, 64B, 16way WB



17

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

IBM Power5+



I-Cache 64K

L1 Cache 32K/core, 128B, 4way WT

L2 Cache 1.9M/dual, 128B, 10way WB



COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

UHFFT Performance ResultsItanium2 (Quad) Opteron (Dual)

Processor 1.5 GHz 2.0 GHz

Cache 16K/256K/6M 64K/1M

Compiler UHFFT2.0.1(icc –O3)

FFTW3.1.2 (gcc –O3)

UHFFT2.0.1(gcc –O3)

FFTW3.1.2 (gcc –O3)

Source: Ayaz Ali

18

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Cache Level Design Dependencies

• Use smaller L1 if there is also L2– Trade increased L1 miss rate for reduced L1 hit time and reduced

L1 miss penalty

– Reduces average access energy

• Use simpler write-through L1 with on-chip L2– Write-back L2 cache absorbs write traffic,

doesn’t go off-chip

– At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control

– Simplifies coherence issues

– Simplifies error recovery in L1 (can use just parity bits in L1 and reload from L2 when parity error detected on L1 read)


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Cache• At any time data is copied between only one pair

of adjacent cache levels

• Caches closer to the CPU made in faster and more expensive technology and are smaller

• Conversely, caches further away are bigger, less costly and larger

19

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Cache Summary• Server processors today typically have three levels of on-

chip cache• Typical cache characteristics

– Level 1: • separate data and instruction caches, • 16 – 64kB, 2-way – 8-way (PowerPC exception 64-way)• 1 – 4 cycle latency

– Level 2: • shared cache for data and instructions (Itanium exception), • 256kB – 512kB, 8-way – 16-way (PowerPC exception, 2kB, fully

associative)• -12 cycle latency

– Level 3: • shared for data and instructions, • 1 – 6 MB/core, typically shared among all cores, 8MB – 32 MB/chip,

8-way – 48-way• - 35 cycle latency

• Mobile CPUs typically do NOT have on-chip level 3 cache

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Memory System Summary• Its important to match the cache characteristics

– caches access one block at a time (usually more than one word)

• with the DRAM characteristics– use DRAMs that support fast multiple word accesses, preferably

ones that match the block size of the cache

• with the memory-bus characteristics– make sure the memory-bus can support the DRAM access rates

and patterns

– with the goal of increasing the Memory-Bus to Cache bandwidth


20

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Multiple Processors/Multiple Caches

• Even if a write through policy is used, other processors may have invalid data in their caches

• In other words, if a processor updates its cache and updates main memory, a second processor may have been using the same data in its own cache which is now invalid.


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Solutions to Prevent Problems with Multiprocessor/cache systems

• Bus watching with write through – each cache watches the bus to see if data they contain is being written to the main memory by another processor. All processors must be using the write through policy

• Hardware transparency – a "big brother" watches all caches, and upon seeing an update to any processor's cache, it updates main memory AND all of the caches

• Noncacheable memory – Any shared memory (identified with a chip select) may not be cached.


21

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

In practice, read and write operations are not instantaneous. The memory consistency model defines when a written value must be seen by a following read instruction made by the other processors.

Cache Coherence

• A read made by a processor P to a location X that follows a write by the same processor P to X, with no writes of X by another processor occurring between the write and the read instructions made by P, X must always return the value written by P.

• A read made by a processor P1 to location X that follows a write by another processor P2 to X must return the written value made by P2 if no other writes to X made by any processor occur between the two accesses. This condition defines the concept of coherent view of memory.

• Writes to the same location must be sequenced, i.e., if location X received two different values A and B, in this order, by any two processors, the processors can never read location X as B and then read it as A. The location X must be seen with values A and B in that order.

Coherence defines the behavior of reads and writes to the same memory location. The coherence of caches is obtained if:

http://en.wikipedia.org/wiki/Cache_coherence

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Cache Consistency• The memory consistency model provides a formal specification of how

the memory system will appear to the programmer, eliminating the gap between the behavior expected by the programmer and the actual behavior supported by a system. The consistency model places restrictions on the values that can be returned by a read in a shared-memory program execution. Intuitively, a read should return the value of the “last” write to the same memory location. In uniprocessors, “last” is precisely defined by program order, i.e., the order in which memory operations appear in the program. This is not the case in multiprocessors.

• There are many consistency models– Sequential consistency; An intuitive extension of the uniprocessor model to

the multiprocessor– Release consistency– Weak consistency– ………..

Reference: Shared Memory Consistency Models: A Tutorial, Sarita V. Adve, Kourosh Gharachorloo, Western Research Laboratory, Report 95/7 http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

22

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Shared Memory Processors

• The time to access main memory is the same for all processors

• In a typical SMP architecture, all memory accesses are posted to the same shared memory bus.

• Contention - as more CPUs are added, competition for access to the bus leads to a decline in performance.

• Thus, scalability is limited (to about 32 processors).

SMP (Symmetric Multi-Processors)/UMA (Uniform Memory Access)

http://n1.slideserve.com/PPTFiles/ccNUMA_53663_58924.ppt

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Shared Memory ProcessorsNUMA (Non-Uniform Memory Access)

• Unlike SMPs, all processors are not equally close to all memory locations.

• Memory is physically distributed– Local memory access is faster than

non-local memory access• A processor’s own internal computations can be

done in its local memory leading to reduced memory contention.

• Designed to surpass the scalability limits of SMPs.

http://n1.slideserve.com/PPTFiles/ccNUMA_53663_58924.ppt

23

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Bus-based Shared Memory Organization

Basic picture is simple :-

CPU

Cache

CPU

Cache

CPU

Cache

Shared Bus

SharedMemory

http://www.cs.utexas.edu/~pingali/CS378/2009sp/lectures/mesi.ppt

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Organization

• Bus is usually simple physical connection (wires)

• Bus bandwidth limits no. of CPUs

• Could be multiple memory elements

• For now, assume that each CPU has only a single level of cache


24

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Problem of Memory Coherence

• Assume just single level caches and main memory

• Processor writes to location in its cache

• Other caches may hold shared copies - these will be out of date

• Updating main memory alone is not enough


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Example

CPUCache

CPUCache

CPUCache

Shared Bus

SharedMemory

X: 24

Processor 1 reads X: obtains 24 from memory and caches itProcessor 2 reads X: obtains 24 from memory and caches itProcessor 1 writes 32 to X: its locally cached copy is updatedProcessor 3 reads X: what value should it get?

Memory and processor 2 think it is 24Processor 1 thinks it is 32

Notice that having write-through caches is not good enough

1 2 3


25

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Cache Coherence Protocols

• Snooping: Individual caches monitor address lines for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location

• Directory based coherence: The shared data is placed in a common directory that maintains the coherence between caches. Loads from memory is governed by the directory. When an entry is changed the directory either updates or invalidates the other caches with that entry

• Snarfing: Cache controllers watches both address and data to update its own copy of a memory location when another processor modifies a location in main memory.

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Bus Snooping• Scheme where every CPU knows who has a

copy of its cached data is far too complex.• So each CPU (cache system) ‘snoops’ (i.e.

watches continually) for write activity concerned with data addresses which it has cached.

• This assumes a bus structure which is ‘global’, i.e., all communication can be seen by all.

• More scalable solution: ‘directory based’ coherence schemes


26

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Snooping Protocols• Write Invalidate

– CPU wanting to write to an address, grabs a bus cycle and sends a ‘write invalidate’ message

– All snooping caches invalidate their copy of appropriate cache line

– CPU writes to its cached copy (assume for now that it also writes through to memory)

– Any shared read in other CPUs will now miss in cache and re-fetch new data.


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Snooping Protocols• Write Update

– CPU wanting to write grabs bus cycle and broadcasts new data as it updates its own copy

– All snooping caches update their copy

• Note that in both schemes, problem of simultaneous writes is taken care of by bus arbitration - only one CPU can use the bus at any one time.


27

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Update or Invalidate?• Update looks the simplest, most obvious and

fastest, but– Multiple writes to same word (no intervening read)

need only one invalidate message but would require an update for each

– Writes to same block in (usual) multi-word cache block require only one invalidate but would require multiple updates.


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Update or Invalidate?• Due to both spatial and temporal locality,

previous cases occur often.• Bus bandwidth is a precious commodity in

shared memory multi-processors• Experience has shown that invalidate protocols

use significantly less bandwidth.• Will consider implementation details only of

invalidate.


28

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Implementation Issues• In both schemes, knowing if a cached value is

not shared (copy in another cache) can avoid sending any messages.

• Invalidate description assumed that a cache value update was written through to memory. If we used a ‘copy back’ scheme other processors could re-fetch old value on a cache miss.

• We need a protocol to handle all this.


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Protocol (1)• A practical multiprocessor invalidate protocol

which attempts to minimize bus usage.

• Allows usage of a ‘write-back’ scheme - i.e., main memory not updated until ‘dirty’ cache line is displaced

• Extension of usual cache tags, i.e. invalid tag and ‘dirty’ tag in normal write back cache.


29

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Protocol (2)

Any cache line can be in one of 4 states (2 bits)• Modified - cache line has been modified, is

different from main memory - is the only cached copy. (multiprocessor ‘dirty’)

• Exclusive - cache line is the same as main memory and is the only cached copy

• Shared - Same as main memory but copies may exist in other caches.

• Invalid - Line data is not valid (as in simple cache)


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Protocol (3)

• Cache line changes state as a function of memory access events.

• Event may be either– Due to local processor activity (i.e. cache access)

– Due to bus activity - as a result of snooping

• Cache line has its own state affected only if address matches


30

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Protocol (4)• Operation can be described informally by

looking at action in local processor– Read Hit

– Read Miss

– Write Hit

– Write Miss

• More formally by state transition diagram


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Read Hit• Line must be in one of MES

• This must be correct local value (if M it must have been modified locally)

• Simply return value

• No state change


31

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Read Miss (1)• No other copy in caches

– Processor makes bus request to memory

– Value read to local cache, marked E

• One cache has E copy– Processor makes bus request to memory

– Snooping cache puts copy value on the bus

– Memory access is abandoned

– Local processor caches value

– Both lines set to S


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Read Miss (2)• Several caches have S copy

– Processor makes bus request to memory

– One cache puts copy value on the bus (arbitrated)

– Memory access is abandoned

– Local processor caches value

– Local copy set to S

– Other copies remain S


32

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Read Miss (3)• One cache has M copy

– Processor makes bus request to memory– Snooping cache puts copy value on the bus– Memory access is abandoned– Local processor caches value– Local copy tagged S– Source (M) value copied back to memory– Source value M -> S


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Write Hit (1)Line must be one of MES• M

– line is exclusive and already ‘dirty’– Update local cache value– no state change

• E– Update local cache value– State E -> M


33

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Write Hit (2)• S

– Processor broadcasts an invalidate on bus

– Snooping processors with S copy change S->I

– Local cache value is updated

– Local state change S->M


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Write Miss (1)Detailed action depends on copies in other

processors

• No other copies– Value read from memory to local cache

– Value updated

– Local copy state set to M


34

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Write Miss (2)• Other copies, either one in state E or more in

state S– Value read from memory to local cache - bus

transaction marked RWITM (read with intent to modify)

– Snooping processors see this and set their copy state to I

– Local copy updated & state set to M


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Write Miss (3)Another copy in state M• Processor issues bus transaction marked

RWITM• Snooping processor sees this

– Blocks RWITM request– Takes control of bus– Writes back its copy to memory– Sets its copy state to I


35

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI Local Write Miss (4)Another copy in state M (continued)

• Original local processor re-issues RWITM request

• Is now simple no-copy case– Value read from memory to local cache

– Local copy value updated

– Local copy state set to M


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Putting it all together• All of this information can be described

compactly using a state transition diagram• Diagram shows what happens to a cache line in

a processor as a result of– memory accesses made by that processor (read

hit/miss, write hit/miss)– memory accesses made by other processors that

result in bus transactions observed by this snoopy cache (Mem read, RWITM,Invalidate)


36

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI – locally initiated accesses

Invalid

Modified Exclusive

SharedReadHit

ReadHit

ReadHit

ReadMiss(sh)

ReadMiss(ex)

WriteHit

WriteHit

WriteHitWrite

Miss

RWITMInvalidate

Mem Read

Mem Read

= bus transaction


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI – remotely initiated accesses

Invalid

Modified Exclusive

Shared

Mem Read

Mem Read

Mem Read

Invalidate

RWITMRWITM

= copy back


37

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI notes• There are minor variations (particularly to do

with write miss)

• Normal ‘write back’ when cache line is evicted is done if line state is M

• Multi-level caches– If caches are inclusive, only the lowest level cache

needs to snoop on the bus


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Directory Schemes

• Snoopy schemes do not scale because they rely on broadcast

• Directory-based schemes allow scaling.– avoid broadcasts by keeping track of all PEs caching a

memory block, and then using point-to-point messages to maintain coherence

– they allow the flexibility to use any scalable point-to-point network


38

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

– Read from main memory by PE-i:• If dirty-bit is OFF then { read from main memory; turn p[i] ON; }• if dirty-bit is ON then { recall line from dirty PE (cache state to

shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to PE-i; }

– Write to main memory:• If dirty-bit OFF then { send invalidations to all PEs caching that

block; turn dirty-bit ON; turn P[i] ON; ... }• ...

Basic Scheme (Censier & Feautrier)

• Assume "k" processors.

• With each cache-block in memory: k presence-bits, and 1 dirty-bit

• With each cache-block in cache: 1valid bit, and 1 dirty (owner) bit• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Directory based coherence

http://www.clear.rice.edu/comp422/lecture-notes/comp422-2012-Lecture10-CoherenceSynchronization.pdf

39

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Key Issues• Scaling of memory and directory bandwidth

– Can not have main memory or directory memory centralized

– Need a distributed memory and directory structure

• Directory memory requirements do not scale well– Number of presence bits grows with number of PEs

– Many ways to get around this problem• limited pointer schemes of many flavors

• Industry standard– SCI: Scalable Coherent Interface


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Cache Coherence Protocols in Use

• AMD Istanbul, Magny-Cours, Interlagos, MOESI snoop protocol (MOESI = MESI + Owned)

• ARM Cortex-9, MESI snoop protocol • Intel Westmere, MESIF snoop protocol

(MESIF = MESI + Forwarding)• Intel Itanium 2, MESI snoop protocol• IBM Power7, snoop protocol with “Speculative

limited-scope coherence broadcast protocol” • IBM Blue Gene/P (PowerPC), snoop protocol

40

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MOESIMOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state “Owned”, it can provide other processors the modified data without or even before writing it to the main memory. This is called dirty sharing. The processor with the data in "Owned" stays responsible to update the main memory later when the cache line is evicted

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MOESI• Modified (M) : The most recent copy of the data is present in

the cache line. But it is not present in any other processor cache.

• Owned (O) : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.

• Exclusive (E) : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.

• Shared (S) : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors.

• Invalid (I) : A cache line does not hold a valid copy of the data.

A detailed explanation of this protocol implementation on AMD processor can be found in the manual “Architecture of the AMD 64-bit core”http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

41

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MOESI Protocol State Diagram

http://support.amd.com/us/Processor_TechDocs/24593_APM_v2.pdf

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESIF• MESIF: Modified, Exclusive, Shared, Invalid and Forward

• M,E,S, and I states the same as in the MESI protocol

• The Forward, F, state designates a state of a cache line from which copies can be made when there is more than one cached copy, i.e., one of the shared copies are in the F state with all others being in the S state.– The cache line in the F state will respond to a request for a copy

of the cache line

– When a copy is made the cache receiving the cache-line marks its copy as being in the F state and the cache issuing the copy marks its cache-line to be in the S state or I state depending on whether a read or write operation caused the request for the copy.

http://en.wikipedia.org/wiki/MESIF_protocol

42

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

http://www.hotchips.org/archives/hc21/2_mon/HC21.24.100.ServerSystemsI-Epub/HC21.24.110.Conway-AMD-Magny-Cours.pdf

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365


43

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365


COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

• This line of CPUs use home snooping in which – 1) the requesting processor sends a request to the home agent, – 2) the home agent will send a snoop broadcast to the caching agents in the system and possibly begin reading the

cache line from memory – 3) the home node and/or any caching agents will send data to the original requester.

• In home snooping the coherency management clearly resides with the home agent.• Source snooping is lower latency, especially when the requested cache line is held in remote

memory and a remote cache. This is most common for workloads that have no NUMA awareness. The benefits are greater if accessing data in a cache is substantially faster than memory.

• Home snooping is a more natural fit for inter-socket snoop filtering and directories. After receiving the request, the home agent will probe the directory (or snoop filter) and only send snoop requests to the caching agents that have a copy of the data. Home snoop protocols using directories tend to scale better, because snoops are only sent to caching agents that hold the requested data and thus consumes less bandwidth across QPI.

• Intel’s studies for the first generation of QPI showed that source snooping was generally faster for 1-2 socket systems, equally fast for 4-socket systems, while home snooping was better for anything larger.

• However, changes such as more cores, greater integration and better NUMA support have altered the playing field. With each additional core in a system, the amount of snoop traffic grows, since each core will have its own steady stream of cache misses. This places a greater burden on the QPI links and the last level caches, which act as snoop filters. Directories are also becoming more relevant, since they are used to avoid probing I/O devices, and future server products will have integrated I/O.

• Based on these changes, subsequent studies showed that there was no longer an advantage for source snooping in 2-socket systems. As a result, QPI 1.1 is solely a home snooped protocol and there is no longer any support for source snooping. An additional benefit is that Itanium will fully leverage the x86 ecosystem since the two product lines now share the same home-snooped coherency protocol.

Intel Sandy-Bridge

http://www.realworldtech.com/page.cfm?ArticleID=RWT072011132706&p=3

44

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Home Snoop Protocol

Introduction to Intel QuickPath Interconnect in Weaving High Performance Multiprocessor Fabric, Robert A. Maddox, Gurbir Singh, and Robert J. Safranek, Intel Press

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Source Snoop Protocol

Introduction to Intel QuickPath Interconnect in Weaving High Performance Multiprocessor Fabric, Robert A. Maddox, Gurbir Singh, and Robert J. Safranek, Intel Press

45

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Cache Coherency – Task Migration

In SMPs which process runs on which core is controlled by the Operating System. In reality, unless explicit system calls bound a task to a specific core (this is known as CPU affinity), the likelihood is that that task will at some point migrate to a different core, along with its data as it is used.In a literal implementation of the MESI cache coherence protocol, it is quite inefficient for a migrated task to access memory locations that are stored in the L1 (write-back) cache of another core. First the original core will need to invalidate and clean the relevant cache lines out to the next level of the memory architecture. Once the data is available at a shared level of the memory architecture (e.g. L2 or main memory), then it would be loaded into the new core.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

MESI on ARM Cortex A-9• Direct Data Intervention (DDI): The Snoop Control Unit (SCU)

keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy.

• Cache-to-cache Migration: If the SCU finds that the cache line requested by one CPU is present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory.

The optimizations enhances performance, reduce memory traffic in and out the L1 cache subsystem, in turn reducing the overall load on the interconnect, and reducing power consumption by eliminating interaction with the external memories. DDI and cache-to-cache transfers particularly benefit an SMP OS, where tasks and data can migrate between cores.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html

46

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

http://community.anitaborg.org/wiki/images/9/92/GHC07-BlueGene_salapura.pdf

Blue Gene

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365


Blue Gene

47

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365


Blue Gene

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365


Blue Gene

48

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365


Blue Gene

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Directory Based Shared Memory

• Coherence– directory-based coherence

• each 128B cache line has an entry in a directory

• directories distributed among the compute/memory blade nodes, like the data homes

• directory size = 1/16 main memory

• line states in a directory– unowned: when a line is not cached

– exclusive: when only one processor has a copy

– shared: when more than one processor has a copy

• bit vector indicates which caches may contain a copy

– invalidation-based protocol: write invalidates copies & acquires exclusive ownership

SGI UV: 32-2048 cores; cache coherent single system image

http://www.clear.rice.edu/comp422/lecture-notes/comp422-2012-Lecture10-CoherenceSynchronization.pdf

49

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

SGI UV node

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=hdwr&db=bks&fname=/SGI_EndUser/AltixUV1K_UG/ch03.html

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

NumaConnect

Shared Everything - One Single Operating System Image

Caches

CPUs

I/O

Memory

Caches Caches

CPUs

I/O

Memory

Caches Caches

CPUs

I/O

Memory

Caches Caches

CPUs

I/O

Memory

Caches

NumaConnect Fabric (no switch required)

NumaChip

NumaCache

NumaChip

NumaCache

NumaCache

NumaChip

NumaCache

NumaChip

50

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

NumaConnect

SDRAM Cache

SDRAM Tags

HyperTransport

SPI

XA XB YBYA ZA ZB

LC C

onfig

Dat

aM

icro

code

Crossbar switch, LCs, SERDES

SCC

H2S

ccHTCave

SM

SPI Init Module

CSR

Remote Cache and Local Memory Tag

DRAM

Remote Cache and Local Memory Tag

DRAM

Remote Cache DRAM

Remote Cache DRAM

NumaChipNumaChip

Voltage Regulators

Voltage Regulators

Switch Fabric Connectors

Switch Fabric Connectors

HyperTransport Connector

HyperTransport Connector

Fast Tag

SRAM

Fast Tag

SRAM

Up to 4096 nodes (12 address bits)Up to 256 TBytes of memory (48 bits)6.4 GB/s HT9.6 GB/s per link per direction, 6 links, switch

www.numascale.com

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Response to question on Prefetching

51

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Example: Intel Xeon Phi• Software prefetching involves static identification

of memory accesses in a program and insertion of prefetch instructions for these memory accesses such that the prefetched data will be readily available in the on-chip caches when the memory access is executed.

• The Intel R Xeon Phi coprocessor employs a 16-stream hardware prefetcher that is enabled by default when the system starts. It observes L2 cache misses and, upon detection of a cache miss pattern, it starts issuing prefetch requests to memory.

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

STREAM Benchmark

52

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Prefetching and Intel Xeon Phi

Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor, Krishnaiyer, R. Kultursay, E., Chawla, P., Preis, S. Zvezdin, A., Saito, H.http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6651054&tag=1

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

Intel Xeon Phi: Non-Temporal stores

Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor, Krishnaiyer, R. Kultursay, E., Chawla, P., Preis, S. Zvezdin, A., Saito, H.http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6651054&tag=1

NR = No-ReadNGO = No Global orderingCLE = Cache Line Eviction

53

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

References• CPS101, Computer Organization and Programming, Lecture 13: The Memory System, Robert Wagner,

http://www.cs.duke.edu/~raw/cps104/Lectures/L13Mem.pdf

• CSE 431 Computer Architecture, Fall 2005, Lecture 18: Memory Hierarchy Review, Mary Jane Irwin, http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view

• Computer Architecture Tutorial, Gurpur M. Prabhu, http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/ex18.html

• CSCI 4717/5717 Computer Architecture, Topic: Cache Memory, http://csciwww.etsu.edu/tarnoff/ntes4717/week_03/cache.ppt

• CS 152 Computer Architecture and Engineering, Lecture 7, Memory Hierarchy-II, Krste Asanovic, Electrical Engineering and Computer Sciences, University of California at Berkeley, Spring 2010, http://inst.eecs.berkeley.edu/~cs152/sp10/lectures/L07-MemoryII.pdf

• CS422 Parallel Computing Platforms, John Mellor-Crummey, Rice University, Lecture 10, February 2012, http://www.clear.rice.edu/comp422/lecture-notes/comp422-2012-Lecture10-CoherenceSynchronization.pdf

• MESIF prtocol, http://en.wikipedia.org/wiki/MESIF_protocol

• Of GigaHertz and CPW, Version 2, Mark Funk, Robert Gagliardi, Allan Johnson, Rick Peterson, January 30, 2010, http://www03.ibm.com/systems/resources/pwrsysperf_OfGigaHertzandCPWs.pdf

• Mainline Functional Verification of IBM’s POWER7 Processor Core

• John Ludden, IBM, http://www.tandvsolns.co.uk/DVClub/1_Nov_2010/Ludden_Power7_Verification.pdf

• Two billion-transistor beats: Power7 and Niagara 3, John Stokes, 2011, http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars

• High Performance Computing with POWER7, Luigi Brochard, IBM, http://www05.ibm.com/fr/events/hpc_summit_2010/P7_HPC_Summit_060410.pdf

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

References (cont’d)• Power7 Processors: The Beat Goes on, Joel M Tendler,

http://www.ibm.com/developerworks/wikis/download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf

• Intel Itanium Processor 9300, Series Reference Manual for Software Development and Optimization, Intel, March 2010, http://www.intel.com/Assets/PDF/manual/323602.pdf

• Intel Tukwila: The power of 20 MB Cache, http://xtreview.com/addcomment-id-4631-view-Intel-tukwila-the-power-of-30-Mb-cache.html

• Intel Gulftown die shot, specs revealed, Ryan Shrout, February 3, 2010, PC Perspective, http://www.pcper.com/comments.php?nid=8348

• Intel Xeon Processors 5600 Series, http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-5600-brief.pdf

• Introduction to Intel QuickPath Interconnect in Weaving High Performance Multiprocessor Fabric, Robert A. Maddox, Gurbir Singh, and Robert J. Safranek, Intel Press 2009

• AMD’s 12-core Magny-Cours Opteron 6174 vs Intel’s 6-core Xeon, Johan De Gelas, March 29, 2010, http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon

• Intel Nehalem Westmere, http://code.google.com/p/likwid-topology/wiki/Intel_Nehalem_Westmere

• Venom GPU System with AMD's Opteron™ (aka Istanbul) http://www.3dprofessor.org/Reviews%20Folder%20Pages/Istanbul/SMISP1.htm

• Using the Dawn BG system, Blaise Barny, https://computing.llnl.gov/tutorials/bgp

• Blue Gene/P Architecture, Vitali Morozov, http://workshops.alcf.anl.gov/gs10/files/2010/01/Morozov-BlueGeneP-Architecture.pdf

54

COSC 6365Lecture 6,2011-02-02

Lennart Johnsson

2014-02-04COSC6365

References (cont’d)

• Intel Atom D510 – Processor Information and Comparisons, http://www.diffen.com/difference/Special:Information/Intel_Atom_D510

• A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-κ Metal Gate CMOS, Gianfranco Gerosa, Steve Curtis, Mike D’Addeo, Bo Jiang, Belliappa Kuttanna, Feroze Merchant, Binta Patel, Mohammed Taufique, Haytham Samarchi, ISSCC 2008, Session 13.1, Mobile Processing, http://download.intel.com/pressroom/kits/isscc/ISSC_Intel_Paper_Silverthorne.pdf

• ARM Low Power Leadership, CMP Conference, Eric Lalardie, January 28th 2010, http://cmp.imag.fr/aboutus/slides/Slides2010/14_ARM_lalardie_2009.pdf

• Matrix Multiplication, http://egret.psychol.cam.ac.uk/statistics/local_copies_of_sources_Cardinal_and_Aitken_ANOVA/Matrix_multiplication.htm

• Cache-Oblivius Programing, Keshav Pingali, 2007, http://www.cs.utexas.edu/~pingali/CS378/2011sp/lectures/co.ppt

• Intel’s Sandy Bridge Microarchitecture, http://www.realworldtech.com/sandy-bridge

• SGI Altix UV 1000 System User's Guide, http://techpubs.sgi.com/library/tpl/cgi-bin/download.cgi?coll=hdwr&db=bks&docnumber=007-5663-003

Documents

Introduction to HPC Architecture - Memory Lecture 6johnsson/COSC6365_2014/Lecture06_Memory_S.pdf · Introduction to HPC Architecture - Memory Lecture 6 ... Intel Sandy ... AMD Bulldozer