41
MEMORY HIERARCHY WAYS TO REDUCE MISSES

07 memory hierarchy computer architecture

Embed Size (px)

Citation preview

Page 1: 07 memory hierarchy computer architecture

MEMORY HIERARCHYWAYS TO REDUCE MISSES

Page 2: 07 memory hierarchy computer architecture

Goal: Illusion of large, fast, cheap memory

Fact: Large memories are slow, fast memories are small

How do we create a memory that is large, cheap and fast (most of the time)?

Hierarchy of Levels Uses smaller and faster memory technologies close to the

processor Fast access time in highest level of hierarchy Cheap, slow memory furthest from processor

The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level

Page 3: 07 memory hierarchy computer architecture

Recap: Memory Hierarchy Pyramid

Processor (CPU)

Size of memory at each level

Level 1

Level 2

Level n

Increasing Distance from CPU,

Decreasing cost / MB

Level 3

. . .

transfer datapath: bus

Decreasing distance

from CPU, Decreasing

Access Time (Memory Latency)

Page 4: 07 memory hierarchy computer architecture

Memory Hierarchy: Terminology

Hit: data appears in level X: Hit Rate: the fraction of memory accesses found in the upper level

Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate)

Hit Time: Time to access the upper level which consists of Time to determine hit/miss + memory access time

Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor

Note: Hit Time << Miss Penalty

Page 5: 07 memory hierarchy computer architecture

Current Memory Hierarchy

Control

Data-path

Processor

reg

s

Secon-dary

Mem-ory

L2Cache

Speed(ns): 0.5ns 2ns 6ns 100ns 10,000,000ns Size (MB): 0.0005 0.05 1-4 100-1000 100,000Cost ($/MB): -- $100 $30 $1 $0.05 Technology: Regs SRAM SRAM DRAM Disk

L1

ca

che

MainMem-

ory

Page 6: 07 memory hierarchy computer architecture

Memory Hierarchy: Why Does it Work? Locality!

Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the

processor Spatial Locality (Locality in Space):

=> Move blocks consists of contiguous words to the upper levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Address Space0 2^n - 1

Probabilityof reference

Page 7: 07 memory hierarchy computer architecture

Memory Hierarchy Technology Random Access:

“Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory

High density, low power, cheap, slow Dynamic: need to be “refreshed” regularly

SRAM: Static Random Access Memory Low density, high power, expensive, fast Static: content will last “forever”(until lose power)

“Not-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM

Sequential Access Technology: access time linear in location (e.g.,Tape)

We will concentrate on random access technology The Main Memory: DRAMs + Caches: SRAMs

Page 8: 07 memory hierarchy computer architecture

Introduction to Caches Cache

is a small very fast memory (SRAM, expensive) contains copies of the most recently accessed

memory locations (data and instructions): temporal locality

is fully managed by hardware (unlike virtual memory)

storage is organized in blo c ks of contiguous memory locations: spatial locality

unit of transfer to/from main memory (or L2) is the cache block

General structure n blo c ks per cache organized in s s e ts b by te s per block total cache size n*b by te s

Page 9: 07 memory hierarchy computer architecture

Caches For each block:

an address ta g : unique identifier state bits:

(in)valid modified

the data: b bytes Basic cache operation

every memory access is first presented to the cache hit: the word being accessed is in the cache, it is returned to

the cpu miss: the word is not in the cache,

a whole block is fetched from memory (L2) an “old” block is evicted from the cache (kicked out), which

one? the new block is stored in the cache the requested word is sent to the cpu

Page 10: 07 memory hierarchy computer architecture

Cache Organization

(1) How do you know if something is in the cache?

(2) If it is in the cache, how to find it?• Answer to (1) and (2) depends on type or

organization of the cache

• In a direct mapped cache, each memory address is associated with one possible block within the cache– Therefore, we only need to look in a single location in the

cache for the data if it exists in the cache

Page 11: 07 memory hierarchy computer architecture

Simplest Cache: Direct Mapped

index determines block in cache index = (address) mod (# blocks) If number of cache blocks is power of 2,

then cache index is just the lower n bits of memory address [ n = log2(# blocks) ]

MainMemory

4-Block Direct Mapped CacheBlock

Address0123456789

101112131415

Cache Index

0123

0010

0110

1010

1110

tag index

Memory block address

Page 12: 07 memory hierarchy computer architecture

Issues with Direct-Mapped

If block size > 1, rightmost bits of index are really the offset within the indexed blockttttttttttttttttt iiiiiiiiii oooo

tag index byteto check to offset

if have select withincorrect block block block

Page 13: 07 memory hierarchy computer architecture

16 12 Byteoffset

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

Address (showing bit positions)31 . . . 16 15 . . 4 3 2 1 0

64KB Cache with 4-word (16-byte) blocks

Tag DataV

Page 14: 07 memory hierarchy computer architecture

Direct-mapped Cache Contd.

The direct mapped cache is simple to design and its access time is fast (Why?)

Good for L1 (on-chip cache) Problem: Conflict Miss, so low hit ratioConflict Misses are misses caused by accessing

different memory locations that are mapped to the same cache index

In direct mapped cache, no flexibility in where memory block can be placed in cache, contributing to conflict misses

Page 15: 07 memory hierarchy computer architecture

Another Extreme: Fully Associative

Fully Associative Cache (8 word block) Omit cache index; place item in any block! Compare all Cache Tags in parallel

• By definition: Conflict Misses = 0 for a fully associative cache

Byte Offset

:

Cache DataB 0

0431

:

Cache Tag (27 bits long)

Valid

:

B 1B 31 :

Cache Tag=

==

=

=:

Page 16: 07 memory hierarchy computer architecture

Fully Associative Cache

Must search all tags in cache, as item can be in any cache block

Search for tag must be done by hardware in parallel (other searches too slow)

But, the necessary parallel comparator hardware is very expensive

Therefore, fully associative placement practical only for a very small cache

Page 17: 07 memory hierarchy computer architecture

Compromise: N-way Set Associative Cache N-way set associative:

N cache blocks for each Cache Index Like having N direct mapped caches operating in

parallel Select the one that gets a hit

Example: 2-way set associative cache Cache Index selects a “set” of 2 blocks from the

cache The 2 tags in set are compared in parallel Data is selected based on the tag result (which

matched the address)

Page 18: 07 memory hierarchy computer architecture

Example: 2-way Set Associative Cache

Cache DataBlock 0

Cache TagValid

:: :

Cache DataBlock 0

Cache Tag Valid

: ::

Cache BlockHit

mux

tag index offset address

= =

Page 19: 07 memory hierarchy computer architecture

Set Associative Cache Contd.

Direct Mapped, Fully Associative can be seen as just variations of Set Associative block placement strategy

Direct Mapped =1-way Set Associative Cache

Fully Associative = n-way Set associativity for a cache

with exactly n blocks

Page 20: 07 memory hierarchy computer architecture

Addressing the Cache

Dire c t m a p p e d c a che : one block per set.

Se t-a s s o c ia tiv e m a p p ing : n/s blo c ks p e r s e t.

Fully a s s o c ia tive m a p p ing : one set per cache (s = n).

tag offsetindexDirect mapping log n log b

tag offsetindexSet-associative mapping log s log b

tag offset

Fully associative mapping log b

Page 21: 07 memory hierarchy computer architecture

Alpha 21264 Cache Organization

Page 22: 07 memory hierarchy computer architecture

Block Replacement PolicyN-way Set Associative or Fully Associative have choice where to place a block, (which block to replace) Of course, if there is an invalid block, use it

Whenever get a cache hit, record the cache block that was touched

When need to evict a cache block, choose one which hasn't been touched recently: “Least Recently Used” (LRU) Past is prologue: history suggests it is least likely

of the choices to be used soon

Page 23: 07 memory hierarchy computer architecture

Cache Write Strategy

There are two basic writing approaches: Write-through – write is done synchronously

both to the cache and to the backing store. Write-back (or write-behind) – initially, writing

is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.

Page 24: 07 memory hierarchy computer architecture

Review: Four Questions for Memory Hierarchy Designers

Q1: Where can a block be placed in the upper level? (Blo c k p la c e m e nt) Fully Associative, Set Associative, Direct Mapped

Q2: How is a block found if it is in the upper level? (Blo c k id e ntific a tio n) Tag/Block

Q3: Which block should be replaced on a miss? (Blo c k re p la c e m e nt) Random, LRU

Q4: What happens on a write? (Write s tra te g y ) Write Back or Write Through (with Write Buffer)

Page 25: 07 memory hierarchy computer architecture

Review: Cache Performance

CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time

Misses per instruction = Memory accesses per instruction x Miss rate

CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

To Improve Cache Performance:1. Reduce the miss rate 2. Reduce the miss penalty3. Reduce the time to hit in the cache.

Page 26: 07 memory hierarchy computer architecture

Reducing Misses

Classifying Misses: 3 Cs Co m p uls o ry—The first access to a block is not in the cache, so

the block must be brought into the cache. Also called c o ld s ta rt m is s e s or firs t re fe re nc e m is s e s .(Mis s e s in e ve n a n Infinite Ca che )

Ca pa c ity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Mis s e s in Fully As s o c ia tiv e Siz e X Ca che )

Co nflic t—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called c o llis io n m is s e s or inte rfe re nc e m is s e s .(Mis s e s in N-wa y As s o c ia tiv e , Siz e X Ca che )

Page 27: 07 memory hierarchy computer architecture

Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Note: Compulsory Miss small

Page 28: 07 memory hierarchy computer architecture

Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

2:1 Cache Rule

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

Page 29: 07 memory hierarchy computer architecture

How Can Reduce Misses?

3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size not changed: What happens if:1) Change Block Size:

Which of 3Cs is obviously affected?

2) Change Associativity: Which of 3Cs is obviously affected?

3) Change Compiler: Which of 3Cs is obviously affected?

Page 30: 07 memory hierarchy computer architecture

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16 32

64

128

256

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block Size

Page 31: 07 memory hierarchy computer architecture

2. Reduce Misses via Higher Associativity2:1 Cache Rule:

Miss Rate DM cache size N ­ Miss Rate 2-way cache size N/2

Beware: Execution time is only final measure! Hill [1988] suggested hit time for 2-way vs. 1-way

external cache +10%

Page 32: 07 memory hierarchy computer architecture

Example: Avg. Memory Access Time vs. Miss Rate

Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)

Page 33: 07 memory hierarchy computer architecture

3. Reducing Misses via a“Victim Cache”How to combine fast hit time of direct mapped

yet still avoid conflict misses? Add buffer to place data discarded from cacheJouppi [1990]: 4-entry victim cache removed

20% to 95% of conflicts for a 4 KB direct mapped data cache

Used in Alpha, HP machines

Page 34: 07 memory hierarchy computer architecture

4 & 5 Reducing Misses by Prefetching of Instructions & Data

Instruction prefetching – Sequentially prefetch instructions from IM to the instruction Queue (IQ) together with branch prediction – All computers employ this.

Data prefetching – Difficult to predict data that will be used in future. Following questions must be answered.

1. What to prefetch? – How to know which data will be used? Unnecessary prefetches will waste memory/bus bandwidth and will replace useful data in the cache (cache pollution problem) giving rise to negative impact on the execution time.

2. When to prefetch? – Must be early enough for the data to be useful, but too early will cause cache pollution problem.

Page 35: 07 memory hierarchy computer architecture

6. SW Prefetching Software Prefetching – Explicit instructions to

prefetch data are inserted in the program. Difficult to decide where to put in the program. Needs good compiler analysis. Some computers already have prefetch intructions. Examples are:

-- Load data into register (HP PA-RISC loads)

Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)

Hardware Prefetching – Difficult to predict and design. Different results for different applications

Page 36: 07 memory hierarchy computer architecture

6. Reducing Cache Pollution E.g., Instruction Prefetching

Alpha 21064 fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer

Prefetching relies on having extra memory bandwidth that can be used without penalty

Page 37: 07 memory hierarchy computer architecture

Summary

3 Cs: Compulsory, Capacity, Conflict Misses

Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher Associativity3. Reducing Misses via Victim Cache4 & 5. Reducing Misses by HW Prefetching Instr,

Data6. Reducing Misses by SW controlled Prefetching7. Reducing Misses by Compiler Optimizations

Remember danger of concentrating on just one parameter when evaluating performance

CPUtime = IC × CPIExecution

+Memory accesses

Instruction× Miss rate × Miss penalty

× Clock cycle time

Page 38: 07 memory hierarchy computer architecture

Review: Improving Cache Performance1. Reduce the miss rate, 2 . Re duc e the m is s p e na lty , or3. Reduce the time to hit in the cache.

Page 39: 07 memory hierarchy computer architecture

1. Reducing Miss Penalty: Read Priority over Write on Miss Write through with write buffers offer RAW (Read After

Write) conflicts with main memory reads on cache misses

If simply wait for write buffer to empty, might increase read miss penalty

Check write buffer contents before read; if no conflicts, let the memory access continue

Write Back? Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read,

and then do the write

Page 40: 07 memory hierarchy computer architecture

Reducing Miss Penalty Summary

Five techniques Read priority over write on miss Subblock placement ?? Early Restart and Critical Word First on miss ?? Non-blocking Caches (Hit under Miss, Miss under Miss) ?? Second Level Cache ??

Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in

between First attempts at L2 caches can make things worse, since

increased worst case is worse

CPUtime = IC × CPIExecution

+Memory accesses

Instruction× Miss rate × Miss penalty

× Clock cycle time

Assignment 3

Page 41: 07 memory hierarchy computer architecture

Cache Optimization Summary

Te chniq ue MR MP HT Co m p le x ity

Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2SW/Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2

mis

s ra

tem

iss

pen

alty

Legends:MR Miss rateMP Miss penalty HT Hit time