Upload
programming-passion
View
109
Download
2
Tags:
Embed Size (px)
Citation preview
MEMORY HIERARCHYWAYS TO REDUCE MISSES
Goal: Illusion of large, fast, cheap memory
Fact: Large memories are slow, fast memories are small
How do we create a memory that is large, cheap and fast (most of the time)?
Hierarchy of Levels Uses smaller and faster memory technologies close to the
processor Fast access time in highest level of hierarchy Cheap, slow memory furthest from processor
The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level
Recap: Memory Hierarchy Pyramid
Processor (CPU)
Size of memory at each level
Level 1
Level 2
Level n
Increasing Distance from CPU,
Decreasing cost / MB
Level 3
. . .
transfer datapath: bus
Decreasing distance
from CPU, Decreasing
Access Time (Memory Latency)
Memory Hierarchy: Terminology
Hit: data appears in level X: Hit Rate: the fraction of memory accesses found in the upper level
Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate)
Hit Time: Time to access the upper level which consists of Time to determine hit/miss + memory access time
Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor
Note: Hit Time << Miss Penalty
Current Memory Hierarchy
Control
Data-path
Processor
reg
s
Secon-dary
Mem-ory
L2Cache
Speed(ns): 0.5ns 2ns 6ns 100ns 10,000,000ns Size (MB): 0.0005 0.05 1-4 100-1000 100,000Cost ($/MB): -- $100 $30 $1 $0.05 Technology: Regs SRAM SRAM DRAM Disk
L1
ca
che
MainMem-
ory
Memory Hierarchy: Why Does it Work? Locality!
Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the
processor Spatial Locality (Locality in Space):
=> Move blocks consists of contiguous words to the upper levels
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Address Space0 2^n - 1
Probabilityof reference
Memory Hierarchy Technology Random Access:
“Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory
High density, low power, cheap, slow Dynamic: need to be “refreshed” regularly
SRAM: Static Random Access Memory Low density, high power, expensive, fast Static: content will last “forever”(until lose power)
“Not-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM
Sequential Access Technology: access time linear in location (e.g.,Tape)
We will concentrate on random access technology The Main Memory: DRAMs + Caches: SRAMs
Introduction to Caches Cache
is a small very fast memory (SRAM, expensive) contains copies of the most recently accessed
memory locations (data and instructions): temporal locality
is fully managed by hardware (unlike virtual memory)
storage is organized in blo c ks of contiguous memory locations: spatial locality
unit of transfer to/from main memory (or L2) is the cache block
General structure n blo c ks per cache organized in s s e ts b by te s per block total cache size n*b by te s
Caches For each block:
an address ta g : unique identifier state bits:
(in)valid modified
the data: b bytes Basic cache operation
every memory access is first presented to the cache hit: the word being accessed is in the cache, it is returned to
the cpu miss: the word is not in the cache,
a whole block is fetched from memory (L2) an “old” block is evicted from the cache (kicked out), which
one? the new block is stored in the cache the requested word is sent to the cpu
Cache Organization
(1) How do you know if something is in the cache?
(2) If it is in the cache, how to find it?• Answer to (1) and (2) depends on type or
organization of the cache
• In a direct mapped cache, each memory address is associated with one possible block within the cache– Therefore, we only need to look in a single location in the
cache for the data if it exists in the cache
Simplest Cache: Direct Mapped
index determines block in cache index = (address) mod (# blocks) If number of cache blocks is power of 2,
then cache index is just the lower n bits of memory address [ n = log2(# blocks) ]
MainMemory
4-Block Direct Mapped CacheBlock
Address0123456789
101112131415
Cache Index
0123
0010
0110
1010
1110
tag index
Memory block address
Issues with Direct-Mapped
If block size > 1, rightmost bits of index are really the offset within the indexed blockttttttttttttttttt iiiiiiiiii oooo
tag index byteto check to offset
if have select withincorrect block block block
16 12 Byteoffset
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
Address (showing bit positions)31 . . . 16 15 . . 4 3 2 1 0
64KB Cache with 4-word (16-byte) blocks
Tag DataV
Direct-mapped Cache Contd.
The direct mapped cache is simple to design and its access time is fast (Why?)
Good for L1 (on-chip cache) Problem: Conflict Miss, so low hit ratioConflict Misses are misses caused by accessing
different memory locations that are mapped to the same cache index
In direct mapped cache, no flexibility in where memory block can be placed in cache, contributing to conflict misses
Another Extreme: Fully Associative
Fully Associative Cache (8 word block) Omit cache index; place item in any block! Compare all Cache Tags in parallel
• By definition: Conflict Misses = 0 for a fully associative cache
Byte Offset
:
Cache DataB 0
0431
:
Cache Tag (27 bits long)
Valid
:
B 1B 31 :
Cache Tag=
==
=
=:
Fully Associative Cache
Must search all tags in cache, as item can be in any cache block
Search for tag must be done by hardware in parallel (other searches too slow)
But, the necessary parallel comparator hardware is very expensive
Therefore, fully associative placement practical only for a very small cache
Compromise: N-way Set Associative Cache N-way set associative:
N cache blocks for each Cache Index Like having N direct mapped caches operating in
parallel Select the one that gets a hit
Example: 2-way set associative cache Cache Index selects a “set” of 2 blocks from the
cache The 2 tags in set are compared in parallel Data is selected based on the tag result (which
matched the address)
Example: 2-way Set Associative Cache
Cache DataBlock 0
Cache TagValid
:: :
Cache DataBlock 0
Cache Tag Valid
: ::
Cache BlockHit
mux
tag index offset address
= =
Set Associative Cache Contd.
Direct Mapped, Fully Associative can be seen as just variations of Set Associative block placement strategy
Direct Mapped =1-way Set Associative Cache
Fully Associative = n-way Set associativity for a cache
with exactly n blocks
Addressing the Cache
Dire c t m a p p e d c a che : one block per set.
Se t-a s s o c ia tiv e m a p p ing : n/s blo c ks p e r s e t.
Fully a s s o c ia tive m a p p ing : one set per cache (s = n).
tag offsetindexDirect mapping log n log b
tag offsetindexSet-associative mapping log s log b
tag offset
Fully associative mapping log b
Alpha 21264 Cache Organization
Block Replacement PolicyN-way Set Associative or Fully Associative have choice where to place a block, (which block to replace) Of course, if there is an invalid block, use it
Whenever get a cache hit, record the cache block that was touched
When need to evict a cache block, choose one which hasn't been touched recently: “Least Recently Used” (LRU) Past is prologue: history suggests it is least likely
of the choices to be used soon
Cache Write Strategy
There are two basic writing approaches: Write-through – write is done synchronously
both to the cache and to the backing store. Write-back (or write-behind) – initially, writing
is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.
Review: Four Questions for Memory Hierarchy Designers
Q1: Where can a block be placed in the upper level? (Blo c k p la c e m e nt) Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the upper level? (Blo c k id e ntific a tio n) Tag/Block
Q3: Which block should be replaced on a miss? (Blo c k re p la c e m e nt) Random, LRU
Q4: What happens on a write? (Write s tra te g y ) Write Back or Write Through (with Write Buffer)
Review: Cache Performance
CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time
To Improve Cache Performance:1. Reduce the miss rate 2. Reduce the miss penalty3. Reduce the time to hit in the cache.
Reducing Misses
Classifying Misses: 3 Cs Co m p uls o ry—The first access to a block is not in the cache, so
the block must be brought into the cache. Also called c o ld s ta rt m is s e s or firs t re fe re nc e m is s e s .(Mis s e s in e ve n a n Infinite Ca che )
Ca pa c ity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Mis s e s in Fully As s o c ia tiv e Siz e X Ca che )
Co nflic t—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called c o llis io n m is s e s or inte rfe re nc e m is s e s .(Mis s e s in N-wa y As s o c ia tiv e , Siz e X Ca che )
Cache Size (KB)
Mis
s R
ate
pe
r T
yp
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.141 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
3Cs Absolute Miss Rate (SPEC92)
Conflict
Note: Compulsory Miss small
Cache Size (KB)
Mis
s R
ate
pe
r T
yp
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.141 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
2:1 Cache Rule
Conflict
miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
How Can Reduce Misses?
3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size not changed: What happens if:1) Change Block Size:
Which of 3Cs is obviously affected?
2) Change Associativity: Which of 3Cs is obviously affected?
3) Change Compiler: Which of 3Cs is obviously affected?
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16 32
64
128
256
1K
4K
16K
64K
256K
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity2:1 Cache Rule:
Miss Rate DM cache size N Miss Rate 2-way cache size N/2
Beware: Execution time is only final measure! Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%
Example: Avg. Memory Access Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(Red means A.M.A.T. not improved by more associativity)
3. Reducing Misses via a“Victim Cache”How to combine fast hit time of direct mapped
yet still avoid conflict misses? Add buffer to place data discarded from cacheJouppi [1990]: 4-entry victim cache removed
20% to 95% of conflicts for a 4 KB direct mapped data cache
Used in Alpha, HP machines
4 & 5 Reducing Misses by Prefetching of Instructions & Data
Instruction prefetching – Sequentially prefetch instructions from IM to the instruction Queue (IQ) together with branch prediction – All computers employ this.
Data prefetching – Difficult to predict data that will be used in future. Following questions must be answered.
1. What to prefetch? – How to know which data will be used? Unnecessary prefetches will waste memory/bus bandwidth and will replace useful data in the cache (cache pollution problem) giving rise to negative impact on the execution time.
2. When to prefetch? – Must be early enough for the data to be useful, but too early will cause cache pollution problem.
6. SW Prefetching Software Prefetching – Explicit instructions to
prefetch data are inserted in the program. Difficult to decide where to put in the program. Needs good compiler analysis. Some computers already have prefetch intructions. Examples are:
-- Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
Hardware Prefetching – Difficult to predict and design. Different results for different applications
6. Reducing Cache Pollution E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer
Prefetching relies on having extra memory bandwidth that can be used without penalty
Summary
3 Cs: Compulsory, Capacity, Conflict Misses
Reducing Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher Associativity3. Reducing Misses via Victim Cache4 & 5. Reducing Misses by HW Prefetching Instr,
Data6. Reducing Misses by SW controlled Prefetching7. Reducing Misses by Compiler Optimizations
Remember danger of concentrating on just one parameter when evaluating performance
CPUtime = IC × CPIExecution
+Memory accesses
Instruction× Miss rate × Miss penalty
× Clock cycle time
Review: Improving Cache Performance1. Reduce the miss rate, 2 . Re duc e the m is s p e na lty , or3. Reduce the time to hit in the cache.
1. Reducing Miss Penalty: Read Priority over Write on Miss Write through with write buffers offer RAW (Read After
Write) conflicts with main memory reads on cache misses
If simply wait for write buffer to empty, might increase read miss penalty
Check write buffer contents before read; if no conflicts, let the memory access continue
Write Back? Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read,
and then do the write
Reducing Miss Penalty Summary
Five techniques Read priority over write on miss Subblock placement ?? Early Restart and Critical Word First on miss ?? Non-blocking Caches (Hit under Miss, Miss under Miss) ?? Second Level Cache ??
Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in
between First attempts at L2 caches can make things worse, since
increased worst case is worse
CPUtime = IC × CPIExecution
+Memory accesses
Instruction× Miss rate × Miss penalty
× Clock cycle time
Assignment 3
Cache Optimization Summary
Te chniq ue MR MP HT Co m p le x ity
Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2SW/Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2
mis
s ra
tem
iss
pen
alty
Legends:MR Miss rateMP Miss penalty HT Hit time