07 memory hierarchy computer architecture

MEMORY HIERARCHYWAYS TO REDUCE MISSES

Goal: Illusion of large, fast, cheap memory

Fact: Large memories are slow, fast memories are small

How do we create a memory that is large, cheap and fast (most of the time)?

Hierarchy of Levels Uses smaller and faster memory technologies close to the

processor Fast access time in highest level of hierarchy Cheap, slow memory furthest from processor

The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level

Recap: Memory Hierarchy Pyramid

Processor (CPU)

Size of memory at each level

Level 1

Level 2

Level n

Increasing Distance from CPU,

Decreasing cost / MB

Level 3

. . .

transfer datapath: bus

Decreasing distance

from CPU, Decreasing

Access Time (Memory Latency)

Memory Hierarchy: Terminology

Hit: data appears in level X: Hit Rate: the fraction of memory accesses found in the upper level

Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate)

Hit Time: Time to access the upper level which consists of Time to determine hit/miss + memory access time

Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor

Note: Hit Time << Miss Penalty

Current Memory Hierarchy

Control

Data-path

Processor

reg

s

Secon-dary

Mem-ory

L2Cache

Speed(ns): 0.5ns 2ns 6ns 100ns 10,000,000ns Size (MB): 0.0005 0.05 1-4 100-1000 100,000Cost ($/MB): -- $100 $30 $1 $0.05 Technology: Regs SRAM SRAM DRAM Disk

L1

ca

che

MainMem-

ory

Memory Hierarchy: Why Does it Work? Locality!

Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the

processor Spatial Locality (Locality in Space):

=> Move blocks consists of contiguous words to the upper levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Address Space0 2^n - 1

Probabilityof reference

Memory Hierarchy Technology Random Access:

“Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory

High density, low power, cheap, slow Dynamic: need to be “refreshed” regularly

SRAM: Static Random Access Memory Low density, high power, expensive, fast Static: content will last “forever”(until lose power)

“Not-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM

Sequential Access Technology: access time linear in location (e.g.,Tape)

We will concentrate on random access technology The Main Memory: DRAMs + Caches: SRAMs

Introduction to Caches Cache

is a small very fast memory (SRAM, expensive) contains copies of the most recently accessed

memory locations (data and instructions): temporal locality

is fully managed by hardware (unlike virtual memory)

storage is organized in blo c ks of contiguous memory locations: spatial locality

unit of transfer to/from main memory (or L2) is the cache block

General structure n blo c ks per cache organized in s s e ts b by te s per block total cache size n*b by te s

Caches For each block:

an address ta g : unique identifier state bits:

(in)valid modified

the data: b bytes Basic cache operation

every memory access is first presented to the cache hit: the word being accessed is in the cache, it is returned to

the cpu miss: the word is not in the cache,

a whole block is fetched from memory (L2) an “old” block is evicted from the cache (kicked out), which

one? the new block is stored in the cache the requested word is sent to the cpu

Cache Organization

(1) How do you know if something is in the cache?

(2) If it is in the cache, how to find it?• Answer to (1) and (2) depends on type or

organization of the cache

• In a direct mapped cache, each memory address is associated with one possible block within the cache– Therefore, we only need to look in a single location in the

cache for the data if it exists in the cache

Simplest Cache: Direct Mapped

index determines block in cache index = (address) mod (# blocks) If number of cache blocks is power of 2,

then cache index is just the lower n bits of memory address [ n = log2(# blocks) ]

MainMemory

4-Block Direct Mapped CacheBlock

Address0123456789

101112131415

Cache Index

0123

0010

0110

1010

1110

tag index

Memory block address

Issues with Direct-Mapped

If block size > 1, rightmost bits of index are really the offset within the indexed blockttttttttttttttttt iiiiiiiiii oooo

tag index byteto check to offset

if have select withincorrect block block block

16 12 Byteoffset

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

Address (showing bit positions)31 . . . 16 15 . . 4 3 2 1 0

64KB Cache with 4-word (16-byte) blocks

Tag DataV

Direct-mapped Cache Contd.

The direct mapped cache is simple to design and its access time is fast (Why?)

Good for L1 (on-chip cache) Problem: Conflict Miss, so low hit ratioConflict Misses are misses caused by accessing

different memory locations that are mapped to the same cache index

In direct mapped cache, no flexibility in where memory block can be placed in cache, contributing to conflict misses

Another Extreme: Fully Associative

Fully Associative Cache (8 word block) Omit cache index; place item in any block! Compare all Cache Tags in parallel

• By definition: Conflict Misses = 0 for a fully associative cache

Byte Offset

:

Cache DataB 0

0431

:

Cache Tag (27 bits long)

Valid

:

B 1B 31 :

Cache Tag=

==

=

=:

Fully Associative Cache

Must search all tags in cache, as item can be in any cache block

Search for tag must be done by hardware in parallel (other searches too slow)

But, the necessary parallel comparator hardware is very expensive

Therefore, fully associative placement practical only for a very small cache

Compromise: N-way Set Associative Cache N-way set associative:

N cache blocks for each Cache Index Like having N direct mapped caches operating in

parallel Select the one that gets a hit

Example: 2-way set associative cache Cache Index selects a “set” of 2 blocks from the

cache The 2 tags in set are compared in parallel Data is selected based on the tag result (which

matched the address)

Example: 2-way Set Associative Cache

Cache DataBlock 0

Cache TagValid

:: :

Cache DataBlock 0

Cache Tag Valid

: ::

Cache BlockHit

mux

tag index offset address

= =

Set Associative Cache Contd.

Direct Mapped, Fully Associative can be seen as just variations of Set Associative block placement strategy

Direct Mapped =1-way Set Associative Cache

Fully Associative = n-way Set associativity for a cache

with exactly n blocks

Addressing the Cache

Dire c t m a p p e d c a che : one block per set.

Se t-a s s o c ia tiv e m a p p ing : n/s blo c ks p e r s e t.

Fully a s s o c ia tive m a p p ing : one set per cache (s = n).

tag offsetindexDirect mapping log n log b

tag offsetindexSet-associative mapping log s log b

tag offset

Fully associative mapping log b

Alpha 21264 Cache Organization

Block Replacement PolicyN-way Set Associative or Fully Associative have choice where to place a block, (which block to replace) Of course, if there is an invalid block, use it

Whenever get a cache hit, record the cache block that was touched

When need to evict a cache block, choose one which hasn't been touched recently: “Least Recently Used” (LRU) Past is prologue: history suggests it is least likely

of the choices to be used soon

Cache Write Strategy

There are two basic writing approaches: Write-through – write is done synchronously

both to the cache and to the backing store. Write-back (or write-behind) – initially, writing

is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.

Review: Four Questions for Memory Hierarchy Designers

Q1: Where can a block be placed in the upper level? (Blo c k p la c e m e nt) Fully Associative, Set Associative, Direct Mapped

Q2: How is a block found if it is in the upper level? (Blo c k id e ntific a tio n) Tag/Block

Q3: Which block should be replaced on a miss? (Blo c k re p la c e m e nt) Random, LRU

Q4: What happens on a write? (Write s tra te g y ) Write Back or Write Through (with Write Buffer)

Review: Cache Performance

CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time

Misses per instruction = Memory accesses per instruction x Miss rate

CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

To Improve Cache Performance:1. Reduce the miss rate 2. Reduce the miss penalty3. Reduce the time to hit in the cache.

Reducing Misses

Classifying Misses: 3 Cs Co m p uls o ry—The first access to a block is not in the cache, so

the block must be brought into the cache. Also called c o ld s ta rt m is s e s or firs t re fe re nc e m is s e s .(Mis s e s in e ve n a n Infinite Ca che )

Ca pa c ity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Mis s e s in Fully As s o c ia tiv e Siz e X Ca che )

Co nflic t—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called c o llis io n m is s e s or inte rfe re nc e m is s e s .(Mis s e s in N-wa y As s o c ia tiv e , Siz e X Ca che )

Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Note: Compulsory Miss small

Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

2:1 Cache Rule

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

How Can Reduce Misses?

3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size not changed: What happens if:1) Change Block Size:

Which of 3Cs is obviously affected?

2) Change Associativity: Which of 3Cs is obviously affected?

3) Change Compiler: Which of 3Cs is obviously affected?

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16 32

64

128

256

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block Size

2. Reduce Misses via Higher Associativity2:1 Cache Rule:

Miss Rate DM cache size N Miss Rate 2-way cache size N/2

Beware: Execution time is only final measure! Hill [1988] suggested hit time for 2-way vs. 1-way

external cache +10%

Example: Avg. Memory Access Time vs. Miss Rate

Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)

3. Reducing Misses via a“Victim Cache”How to combine fast hit time of direct mapped

yet still avoid conflict misses? Add buffer to place data discarded from cacheJouppi [1990]: 4-entry victim cache removed

20% to 95% of conflicts for a 4 KB direct mapped data cache

Used in Alpha, HP machines

4 & 5 Reducing Misses by Prefetching of Instructions & Data

Instruction prefetching – Sequentially prefetch instructions from IM to the instruction Queue (IQ) together with branch prediction – All computers employ this.

Data prefetching – Difficult to predict data that will be used in future. Following questions must be answered.

1. What to prefetch? – How to know which data will be used? Unnecessary prefetches will waste memory/bus bandwidth and will replace useful data in the cache (cache pollution problem) giving rise to negative impact on the execution time.

2. When to prefetch? – Must be early enough for the data to be useful, but too early will cause cache pollution problem.

6. SW Prefetching Software Prefetching – Explicit instructions to

prefetch data are inserted in the program. Difficult to decide where to put in the program. Needs good compiler analysis. Some computers already have prefetch intructions. Examples are:

-- Load data into register (HP PA-RISC loads)

Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)

Hardware Prefetching – Difficult to predict and design. Different results for different applications

6. Reducing Cache Pollution E.g., Instruction Prefetching

Alpha 21064 fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer

Prefetching relies on having extra memory bandwidth that can be used without penalty

Summary

3 Cs: Compulsory, Capacity, Conflict Misses

Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher Associativity3. Reducing Misses via Victim Cache4 & 5. Reducing Misses by HW Prefetching Instr,

Data6. Reducing Misses by SW controlled Prefetching7. Reducing Misses by Compiler Optimizations

Remember danger of concentrating on just one parameter when evaluating performance

CPUtime = IC × CPIExecution

+Memory accesses

Instruction× Miss rate × Miss penalty

× Clock cycle time

Review: Improving Cache Performance1. Reduce the miss rate, 2 . Re duc e the m is s p e na lty , or3. Reduce the time to hit in the cache.

1. Reducing Miss Penalty: Read Priority over Write on Miss Write through with write buffers offer RAW (Read After

Write) conflicts with main memory reads on cache misses

If simply wait for write buffer to empty, might increase read miss penalty

Check write buffer contents before read; if no conflicts, let the memory access continue

Write Back? Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read,

and then do the write

Reducing Miss Penalty Summary

Five techniques Read priority over write on miss Subblock placement ?? Early Restart and Critical Word First on miss ?? Non-blocking Caches (Hit under Miss, Miss under Miss) ?? Second Level Cache ??

Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in

between First attempts at L2 caches can make things worse, since

increased worst case is worse

CPUtime = IC × CPIExecution

+Memory accesses

Instruction× Miss rate × Miss penalty

× Clock cycle time

Assignment 3

Cache Optimization Summary

Te chniq ue MR MP HT Co m p le x ity

Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2SW/Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2

mis

s ra

tem

iss

pen

alty

Legends:MR Miss rateMP Miss penalty HT Hit time

Engineering

07 memory hierarchy computer architecture