1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01

1

Lecture 13: Cache and Virtual Memroy Review

Cache optimization approaches, cache miss classification,

Adapted from UCB CS252 S01

2

A typical memory hierarchy today:

Here we focus on L1/L2/L3 caches and main memory

What Is Memory Hierarchy

Proc/RegsL1-

CacheL2-Cache

Memory

Disk, Tape, etc.

Bigger Faster

L3-Cache (optional)

3

1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)

Why Memory Hierarchy?

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

“Moore’s Law”

4

Generations of Microprocessors Time of a full cache miss in instructions executed:

1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136

2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320

3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 6481/2X latency x 3X clock rate x 3X Instr/clock 4.5X

5

Area Costs of Caches

Processor % Area %Transistors

(cost) (power)Intel 80386 0% 0%Alpha 21164 37% 77%StrongArm SA110 61% 94%Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$

Itanium 92%Caches store redundant dataonly to close performance gap

6

What Is Cache, Exactly?Small, fast storage used to improve average access time to slow memory; usually made by SRAMExploits locality: spatial and temporalIn computer architecture, almost everything is a cache!

Register file is the fastest place to cache variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch-prediction a cache on prediction information? Branch-target buffer can be implemented as cache

Beyond architecture: file cache, browser cache, proxy cacheHere we focus on L1 and L2 caches (L3 optional) as buffers to main memory

7

Example: 1 KB Direct Mapped CacheAssume a cache of 2N bytes, 2K blocks, block size of 2M

bytes; N = M+K (#block times block size) (32 - N)-bit cache tag, K-bit cache index, and M-bit cache

The cache stores tag, data, and valid bit for each block Cache index is used to select a block in SRAM (Recall

BHT, BTB) Block tag is compared with the input tag A word in the data block may be selected as the output

Index

0123

:

Cache DataByte 0

0431

:

Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

31

Byte 1Byte 31

:Byte 32Byte 33Byte 63

:

Byte 992Byte 1023

:

Cache Tag

Block offsetEx: 0x00

9Block address

8

For Questions About Cache Design

Block placement: Where can a block be placed?

Block identification: How to find a block in the cache?

Block replacement: If a new block is to be fetched, which of existing blocks to replace? (if there are multiple choices

Write policy: What happens on a write?

9

Where Can A Block Be PlacedWhat is a block: divide memory space into blocks as cache is divided A memory block is the basic unit to be cached

Direct mapped cache: there is only one place in the cache to buffer a given memory blockN-way set associative cache: N places for a given memory block Like N direct mapped caches operating in

parallel Reducing miss rates with increased complexity,

cache access time, and power consumption

Fully associative cache: a memory block can be put anywhere in the cache

10

Set Associative CacheExample: Two-way set associative cache

Cache index selects a set of two blocks The two tags in the set are compared to the input

in parallel Data is selected based on the tag comparison

Set associative or direct mapped? Discuss later Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

11

How to Find a Cached BlockDirect mapped cache: the stored tag for the cache block matches the input tag

Fully associative cache: any of the stored N tags matches the input tag

Set associative cache: any of the stored K tags for the cache set matches the input tag

Cache hit time is decided by both tag comparison and data access – Can be determined by Cacti Model

12

Which Block to Replace?

Direct mapped cache: Not an issueFor set associative or fully associative* cache: Random: Select candidate blocks randomly from

the cache set LRU (Least Recently Used): Replace the block

that has been unused for the longest time FIFO (First In, First Out): Replace the oldest block

Usually LRU performs the best, but hard (and expensive) to implement

*Think fully associative cache as a set associative one with a single set

13

What Happens on WritesWhere to write the data if the block is found in cache?

Write through: new data is written to both the cache block and the lower-level memory Help to maintain cache consistency

Write back: new data is written only to the cache block Lower-level memory is updated when the block is

replaced A dirty bit is used to indicate the necessity Help to reduce memory traffic

What happens if the block is not found in cache?Write allocate: Fetch the block into cache, then write the data (usually combined with write back)No-write allocate: Do not fetch the block into cache (usually combined with write through)

14

Real Example: Alpha 21264 Caches

I-cache D-cache

64KB 2-way associative instruction cache64KB 2-way associative data cache

15

Alpha 21264 Data CacheD-cache: 64K 2-way

associativeUse 48-bit virtual address to index cache, use tag from physical address48-bit Virtual=>44-bit address512 block (9-bit blk index)Cache block size 64 bytes (6-bit offset)tTag has 44-(9+6)=29 bitsWriteback and write allocated

(We will study virtual-physical address translation)

16

Calculate average memory access time (AMAT)

Example: hit time = 1 cycle, miss time = 100 cycle, miss rate = 4%, than AMAT = 1+100*4% = 5

Calculate cache impact on processor performance

Note cycles spent on cache hit is usually counted into execution cycles

Cache performance

penalty Missrate Miss hit time AMAT

CycleTimenInstructio

Cycles StallMemory CPIIC timeCPU

timeCyclecycles) stallMemory cyclesexecution (CPU timeCPU

execution

17

Disadvantage of Set Associative CacheCompare n-way set associative with direct mapped

cache: Has n comparators vs. 1 comparator Has Extra MUX delay for the data Data comes after hit/miss decision and set

selectionIn a direct mapped cache, cache block is available before hit/miss decision

Use the data assuming the access is a hit, recover if found otherwise

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Virtual MemoryVirtual memory (VM) allows programs to have the illusion of a very large memory that is not limited by physical memory size

Make main memory (DRAM) acts like a cache for secondary storage (magnetic disk)

Otherwise, application programmers have to move data in/out main memory

That’s how virtual memory was first proposed

Virtual memory also provides the following functions

Allowing multiple processes share the physical memory in multiprogramming environment

Providing protection for processes (compare Intel 8086: without VM applications can overwrite OS kernel)

Facilitating program relocation in physical memory space

19

VM Example

20

Virtual Memory and CacheVM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory and secondary storage.

Cache terms vs. VM terms Cache block => page Cache Miss => page fault

Tasks of hardware and OS TLB does fast address translations OS handles less frequently events:

page fault TLB miss (when software approach is used)

Virtual Memory and Cache

Parameter L1 Cache Main Memory

Block (page) size 16-128 bytes 4KB – 64KB

Hit time 1-3 cycles 50-150 cycles

Miss Penalty 8-300 cycles 1M to 10M cycles

Miss rate 0.1-10% 0.00001-0.001%

Address mapping 25-45 bits => 13-21 bits

32-64 bits => 25-45 bits

4 Qs for Virtual Memory

Q1: Where can a block be placed in the upper level?

Miss penalty for virtual memory is very high => Full associativity is desirable (so allow blocks to be placed anywhere in the memory)

Have software determine the location while accessing disk (10M cycles enough to do sophisticated replacement)

Q2: How is a block found if it is in the upper level?

Address divided into page number and page offset Page table and translation buffer used for address

translation Q: why fully associativity does not affect hit time?

23

4 Qs for Virtual MemoryQ3: Which block should be replaced on a miss? Want to reduce miss rate & can handle in

software Least Recently Used typically used A typical approximation of LRU

Hardware set reference bits OS record reference bits and clear them

periodically OS selects a page among least-recently referenced

for replacement

Q4: What happens on a write? Writing to disk is very expensive Use a write-back strategy

24

Virtual-Physical TranslationA virtual address consists of a virtual page number and a page offset. The virtual page number gets translated to a physical page number.The page offset is not changed

Virtual Page Number Page offset

Physical Page Number Page offset

Translation

Virtual Address

Physical Address

36 bits

33 bits

12 bits

12 bits

25

Address Translation Via Page Table

Assume the access hits in main memory

26

TLB: Improving Page Table AccessCannot afford accessing page table for every access include cache hits (then cache itself makes no sense)Again, use cache to speed up accesses to page table! (cache for cache?)TLB is translation lookaside buffer storing frequently accessed page table entryA TLB entry is like a cache entry Tag holds portions of virtual address Data portion holds physical page number,

protection field, valid bit, use bit, and dirty bit (like in page table entry)

Usually fully associative or highly set associative

Usually 64 or 128 entriesAccess page table only for TLB misses

27

TLB CharacteristicsThe following are characteristics of TLBs

TLB size : 32 to 4,096 entries Block size : 1 or 2 page table entries (4 or

8 bytes each) Hit time: 0.5 to 1 clock cycle Miss penalty: 10 to 30 clock cycles (go to

page table) Miss rate: 0.01% to 0.1% Associative : Fully associative or set

associative Write policy : Write back (replace

infrequently)

Documents

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01