Upload
sadie-stricklan
View
224
Download
0
Embed Size (px)
Citation preview
1
Lecture 13: Cache and Virtual Memroy Review
Cache optimization approaches, cache miss classification,
Adapted from UCB CS252 S01
2
A typical memory hierarchy today:
Here we focus on L1/L2/L3 caches and main memory
What Is Memory Hierarchy
Proc/RegsL1-
CacheL2-Cache
Memory
Disk, Tape, etc.
Bigger Faster
L3-Cache (optional)
3
1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)
Why Memory Hierarchy?
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
“Moore’s Law”
4
Generations of Microprocessors Time of a full cache miss in instructions executed:
1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136
2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320
3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 6481/2X latency x 3X clock rate x 3X Instr/clock 4.5X
5
Area Costs of Caches
Processor % Area %Transistors
(cost) (power)Intel 80386 0% 0%Alpha 21164 37% 77%StrongArm SA110 61% 94%Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$
Itanium 92%Caches store redundant dataonly to close performance gap
6
What Is Cache, Exactly?Small, fast storage used to improve average access time to slow memory; usually made by SRAMExploits locality: spatial and temporalIn computer architecture, almost everything is a cache!
Register file is the fastest place to cache variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch-prediction a cache on prediction information? Branch-target buffer can be implemented as cache
Beyond architecture: file cache, browser cache, proxy cacheHere we focus on L1 and L2 caches (L3 optional) as buffers to main memory
7
Example: 1 KB Direct Mapped CacheAssume a cache of 2N bytes, 2K blocks, block size of 2M
bytes; N = M+K (#block times block size) (32 - N)-bit cache tag, K-bit cache index, and M-bit cache
The cache stores tag, data, and valid bit for each block Cache index is used to select a block in SRAM (Recall
BHT, BTB) Block tag is compared with the input tag A word in the data block may be selected as the output
Index
0123
:
Cache DataByte 0
0431
:
Tag Example: 0x50Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
31
Byte 1Byte 31
:Byte 32Byte 33Byte 63
:
Byte 992Byte 1023
:
Cache Tag
Block offsetEx: 0x00
9Block address
8
For Questions About Cache Design
Block placement: Where can a block be placed?
Block identification: How to find a block in the cache?
Block replacement: If a new block is to be fetched, which of existing blocks to replace? (if there are multiple choices
Write policy: What happens on a write?
9
Where Can A Block Be PlacedWhat is a block: divide memory space into blocks as cache is divided A memory block is the basic unit to be cached
Direct mapped cache: there is only one place in the cache to buffer a given memory blockN-way set associative cache: N places for a given memory block Like N direct mapped caches operating in
parallel Reducing miss rates with increased complexity,
cache access time, and power consumption
Fully associative cache: a memory block can be put anywhere in the cache
10
Set Associative CacheExample: Two-way set associative cache
Cache index selects a set of two blocks The two tags in the set are compared to the input
in parallel Data is selected based on the tag comparison
Set associative or direct mapped? Discuss later Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
11
How to Find a Cached BlockDirect mapped cache: the stored tag for the cache block matches the input tag
Fully associative cache: any of the stored N tags matches the input tag
Set associative cache: any of the stored K tags for the cache set matches the input tag
Cache hit time is decided by both tag comparison and data access – Can be determined by Cacti Model
12
Which Block to Replace?
Direct mapped cache: Not an issueFor set associative or fully associative* cache: Random: Select candidate blocks randomly from
the cache set LRU (Least Recently Used): Replace the block
that has been unused for the longest time FIFO (First In, First Out): Replace the oldest block
Usually LRU performs the best, but hard (and expensive) to implement
*Think fully associative cache as a set associative one with a single set
13
What Happens on WritesWhere to write the data if the block is found in cache?
Write through: new data is written to both the cache block and the lower-level memory Help to maintain cache consistency
Write back: new data is written only to the cache block Lower-level memory is updated when the block is
replaced A dirty bit is used to indicate the necessity Help to reduce memory traffic
What happens if the block is not found in cache?Write allocate: Fetch the block into cache, then write the data (usually combined with write back)No-write allocate: Do not fetch the block into cache (usually combined with write through)
14
Real Example: Alpha 21264 Caches
I-cache D-cache
64KB 2-way associative instruction cache64KB 2-way associative data cache
15
Alpha 21264 Data CacheD-cache: 64K 2-way
associativeUse 48-bit virtual address to index cache, use tag from physical address48-bit Virtual=>44-bit address512 block (9-bit blk index)Cache block size 64 bytes (6-bit offset)tTag has 44-(9+6)=29 bitsWriteback and write allocated
(We will study virtual-physical address translation)
16
Calculate average memory access time (AMAT)
Example: hit time = 1 cycle, miss time = 100 cycle, miss rate = 4%, than AMAT = 1+100*4% = 5
Calculate cache impact on processor performance
Note cycles spent on cache hit is usually counted into execution cycles
Cache performance
penalty Missrate Miss hit time AMAT
CycleTimenInstructio
Cycles StallMemory CPIIC timeCPU
timeCyclecycles) stallMemory cyclesexecution (CPU timeCPU
execution
17
Disadvantage of Set Associative CacheCompare n-way set associative with direct mapped
cache: Has n comparators vs. 1 comparator Has Extra MUX delay for the data Data comes after hit/miss decision and set
selectionIn a direct mapped cache, cache block is available before hit/miss decision
Use the data assuming the access is a hit, recover if found otherwise
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Virtual MemoryVirtual memory (VM) allows programs to have the illusion of a very large memory that is not limited by physical memory size
Make main memory (DRAM) acts like a cache for secondary storage (magnetic disk)
Otherwise, application programmers have to move data in/out main memory
That’s how virtual memory was first proposed
Virtual memory also provides the following functions
Allowing multiple processes share the physical memory in multiprogramming environment
Providing protection for processes (compare Intel 8086: without VM applications can overwrite OS kernel)
Facilitating program relocation in physical memory space
19
VM Example
20
Virtual Memory and CacheVM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory and secondary storage.
Cache terms vs. VM terms Cache block => page Cache Miss => page fault
Tasks of hardware and OS TLB does fast address translations OS handles less frequently events:
page fault TLB miss (when software approach is used)
Virtual Memory and Cache
Parameter L1 Cache Main Memory
Block (page) size 16-128 bytes 4KB – 64KB
Hit time 1-3 cycles 50-150 cycles
Miss Penalty 8-300 cycles 1M to 10M cycles
Miss rate 0.1-10% 0.00001-0.001%
Address mapping 25-45 bits => 13-21 bits
32-64 bits => 25-45 bits
4 Qs for Virtual Memory
Q1: Where can a block be placed in the upper level?
Miss penalty for virtual memory is very high => Full associativity is desirable (so allow blocks to be placed anywhere in the memory)
Have software determine the location while accessing disk (10M cycles enough to do sophisticated replacement)
Q2: How is a block found if it is in the upper level?
Address divided into page number and page offset Page table and translation buffer used for address
translation Q: why fully associativity does not affect hit time?
23
4 Qs for Virtual MemoryQ3: Which block should be replaced on a miss? Want to reduce miss rate & can handle in
software Least Recently Used typically used A typical approximation of LRU
Hardware set reference bits OS record reference bits and clear them
periodically OS selects a page among least-recently referenced
for replacement
Q4: What happens on a write? Writing to disk is very expensive Use a write-back strategy
24
Virtual-Physical TranslationA virtual address consists of a virtual page number and a page offset. The virtual page number gets translated to a physical page number.The page offset is not changed
Virtual Page Number Page offset
Physical Page Number Page offset
Translation
Virtual Address
Physical Address
36 bits
33 bits
12 bits
12 bits
25
Address Translation Via Page Table
Assume the access hits in main memory
26
TLB: Improving Page Table AccessCannot afford accessing page table for every access include cache hits (then cache itself makes no sense)Again, use cache to speed up accesses to page table! (cache for cache?)TLB is translation lookaside buffer storing frequently accessed page table entryA TLB entry is like a cache entry Tag holds portions of virtual address Data portion holds physical page number,
protection field, valid bit, use bit, and dirty bit (like in page table entry)
Usually fully associative or highly set associative
Usually 64 or 128 entriesAccess page table only for TLB misses
27
TLB CharacteristicsThe following are characteristics of TLBs
TLB size : 32 to 4,096 entries Block size : 1 or 2 page table entries (4 or
8 bytes each) Hit time: 0.5 to 1 clock cycle Miss penalty: 10 to 30 clock cycles (go to
page table) Miss rate: 0.01% to 0.1% Associative : Fully associative or set
associative Write policy : Write back (replace
infrequently)