The Memory HierarchyThe Memory Hierarchy Basic Problem: How to design a reasonable cost memory system that can deliver data at speeds close to the CPU’s consumption rate. Current

The Memory Hierarchy

1. Overview of the memory hierarchy

2. Basics of Cache Memories

3. Virtual Memory

4. Satisfying a memory request

5. Performance and some cache memory optimizations

1 / 30

The Memory Hierarchy

Basic Problem: How to design areasonable cost memory system that candeliver data at speeds close to the CPU’sconsumption rate.

Current Answer: Construct a memoryhierarchy with slow (inexpensive, largesize) components at the higher levelsand with fast (most expensive, smallest)components at the lowest level.

Migration: As information is referenced,migrate the data into and out-of thelowest level memories. I−Cache D−Cache

Cache (L2)

CPU/core

Cache (L3)

Disk

Main Memory

2 / 30

How does this help?

I Programs are well behaved and tend to follow theobserved “principles of locality” for programs.

Principle of temporal locality: states that a referenced dataobject will likely be referenced again in the near future.

Principle of spatial locality: states that if data at location xis referenced at time t then it is likely that a nearby datalocation x + ∆x will be referenced at a nearby time t + ∆t .

I Also consider the 90/10 rule: A program executes about90% of it’s instructions from about 10% of its code space.

3 / 30

Cache Memories

I Similar to paged virtual memory w/ fixed-sized blocksmapped to the cache (from the [physical or virtual] memoryspace)

I Due to speed considerations, all operations implementedin hardware.

I 4 types of mapping policies:

I fully associative

I direct mapped

I set associative

I sector mapped (not discussed further)

4 / 30

Fully Associative Cache

Any block in the (physical or virtual) memory can be mapped into anyblock in cache (just like paged virtual memory). Memory addressesare formed as: n d where n is the memory tag and d is theaddress in the block.

Block 0

Block N−1tag

tag

tag

Cache

Memory

n

Block N−1

Block M

Block 0

Block i

Lookup Algorithm

if ∃α : 0 ≤ α < N ∧ cache[α].tag = nthen return cache[α].word[d]else cache-miss

5 / 30

Direct Mapped Cache

If cache is partitioned into N fixed size blocks then each cache blockk will contain only (physical or virtual) memory blocks k + nN, wheren = 0,1, .... The memory addresses are formed as: n k dwhere n is the memory tag, k is the cache block index, and d is theaddress in the block.

tag

tag

tag

Cache

Block 0

Block 1

Memory

k

n

Block N−1

Block N

Block N+1

Block N−1

Block 1

Block 0

Lookup Algorithm

if cache[k].tag = nthen return cache[k].word[d]else cache-miss

6 / 30

Set Associative Cache

Midpoint between directed mapped and fully associative (at the extremes itdegenerates to each). Basic idea: divide cache into S sets with E = N/S blockframes per set (N total blocks in the cache). Memory address are formed as:

n k d where n is the memory tag, k is the cache set index, and d isthe address in the block.

tag Block 0

tag Block E−1Set 0

tag Block 0

tag Block E−1Set S−1

tag Block 0

tag Block E−1Set 1

k

Block 0

Block 1

Memory

Block S−1

Block S

Block S+1n

CacheLookup Algorithm

if ∃α : 0 ≤ α < E ∧ cache[k].block[α].tag = nthen return cache[k].block[α].word[d]else cache-miss

7 / 30

Cache Policies

I Read/Write Policies

I Read through

I Critical word first

I Write through

I Write back

I Write allocate/no-allocate

I Structure

I Unified or Split Caches

I Contents across Cache Levels

I Multilevel inclusion (Pentium 4)

I Multilevel exclusion (Opteron)

I−Cache D−Cache

Cache (L2)

CPU/core

Cache (L3)

Disk

Main Memory

8 / 30

Lecture break; continue next class

9 / 30

Virtual Memory

Addresses used by programs do not correspond to actualaddresses of the program/data locations in main memory.Instead, there exists a translation mechanism between the CPUand memory that actually associates CPU addresses (virtual)to their actual (physical) in memory

PhysicalAddress

AddressTranslation

VirtualAddress

MainMemory

CPU

The two most important types:I Paged virtual memoryI Segmented virtual memory

10 / 30

Demand Paged Virtual Memory

I Logically subdivide virtual and physical spaces into fixedsize units called pages.

I Keep virtual space on disk (swap for working/dirty pages).

I As referenced (on demand), bring pages into mainmemory (and into the memory hierarchy) and update thepage table.

I Virtual pages can be mapped into any available physicalpage.

I Requires some page replacement algorithm: random,FIFO, LRU, LFU.

11 / 30

Paged Virtual Memory: Structure

Physical Addresses

Main Memory

Virtual Addresses

Program Space

page 0

page 1

page N−1

page M−1

page 0

page 1

12 / 30

Paged Virtual Memory: Address Translation

Virtual Address

Physical Address

valid bit dirty bit/other info

page offsetpage number

page table

page offsetpage number

13 / 30

Segmented Virtual Memory

I Identify segments in virtual space by the logical structureof the program. Using logical structure, we can identifyaccess control to the segment.

I Dynamically map segments into physical space assegments are referenced (on demand). Update segmenttable accordingly

I Keep virtual space on disk (swap for working/dirty pages).

I Need segment placement algorithm: best fit, worst fit, firstfit.

I Needs some segment replacement algorithm.

14 / 30

Segmented Virtual Memory: Structure

segment 0

segment i

segment m

segment m

segment i

Physical Addresses

Main Memory

Virtual Addresses

Program Space

segment 0

15 / 30

Seg. Virt. Memory: Address Translation

Virtual Address

valid bit

Physical Address

segment number seg. offset

dirty bit/length/other info

segment table

+

16 / 30

Segmented-Paged Virtual Memory

I Page the segments and map pages into physical memory.Considerably improving and simplifying physical memoryuse/management.

I Adds an additional translation step.

I Drastically shrinks page table sizes.

I Retains protection capabilities of segments.

I Mechanism used in most contemporary systems (ok,virtualization adds yet another level of indirection....we willdiscuss this later).

17 / 30

Segmented-Paged: Address Translation

segmenttable

pagetable

Virtual Address

Physical Address

seg # page # pg. offset

+

18 / 30

Hardware Implications of Virtual Memory

I Virtual memory requires a careful coordination betweenthe hardware and O/S.

I Virtual memory translation will be performed (at leastmostly) in hardware.

I When hardware cannot translate a virtual address, it willtrap to the O/S.

I Hardware structures for translation will limit the type ofvirtual memory that an O/S can deploy.

19 / 30

Lecture break; continue next class

20 / 30

The Memory Hierarchy: Who does what?

I MMU: Memory Management Unit

I TB: (address) Translation Buffer

I H/W: search to Main Memory

I O/S: handles page faults (invoking DMAoperations)

I if page being replaced is dirty, copyout first (DMA)

I move new page to main memory(DMA)

I O/S: context swaps on page fault

I DMA: operates concurrently w/ othertasks

I−Cache D−Cache

Cache (L2)

Cache (L3)

Disk

Main Memory

TBMMU CPU

21 / 30

Notes

I Memory Management UnitI Manages memory subsystems to main memoryI Translates addresses, searches caches, migrates data

(to/from main memory out/in)

I Translation BufferI Small cache to assist virtual → physical address translation

processI generally small (e.g., 64 entries), size does not need to

correspond to cache sizes

I Simplifying AssumptionsI L1 & L2 use physical addressesI paged virtual memoryI ignoring details of TB misses

22 / 30

Satisfying A Memory Request

Satisfied in L1 cache:1. MMU: translate address2. MMU: search I or D cache as indicated by CPU, success

(sometimes simultaneously with translation)3. MMU: read/write information to/from CPU

23 / 30


Satisfied in L2 cache:1. MMU: translate address2. MMU: search I or D cache as indicated by CPU, failure3. MMU: search L2 cache, success4. MMU: move information between L1 & L2, critical word

first?5. MMU: read/write information to/from CPU

24 / 30


Satisfied in main memory:1. MMU: translate address2. MMU: search I or D cache as indicated by CPU, failure3. MMU: search L2 cache, failure4. MMU: move information between memory & L25. MMU: move information between L1 & L2, critical word

first?6. MMU: read/write information to/from CPU

25 / 30


Not in main memory:1. MMU: translate address, failure trap to O/S2. O/S: page fault, block task for page fault

I O/S: if page dirtyI O/S: initiate DMA transfer to copy page to swapI O/S: block task for DMA interruptI O/S: invoke task schedulerI O/S: on interrupt continue

I O/S: initiate DMA transfer to copy page to main memoryI O/S: block task for DMA interruptI O/S: invoke task schedulerI O/S: on interrupt:

I update page tableI return task to ready to run list

26 / 30

Memory Performance Impact: Naive

CPU exec time

= (CPU clock cycles + Memory stall cycles)× Clock cycle time

Memory stall cycles (doesn’t consider hit times 6= 1)

= Number of Misses ×Miss Penalty

= IC × MissesInstruction

×Miss penalty

= IC × Memory accessesInstruction

×Miss Rate ×Miss penalty

= IC × Memory readsInstruction

× Read miss Rate × Read miss penalty

+IC × Memory WritesInstruction

×Write miss Rate ×Write miss penalty

27 / 30

Average Memory Access Time

Average memory access time

= Hit time + Miss Rate ×Miss penalty= Hit timeL1 + Miss RateL1 × (Hit timeL2 + Miss RateL2 ×Miss penaltyL2)

Accounting for Out-of-order Execution: (discussion probably premature atthis time)

Memory stall CyclesInstruction =

MissesInstruction × (Total miss latency −Overlapped miss latency)

28 / 30

Cache Misses

I Miss Rate

Local miss rate =# misses in a cache

total # of memory accesses to this cache

Global miss rate =# misses in a cache

total # of memory accesses generated by CPU

I Categories of Cache Misses

I Compulsory: cold-start or first-reference misses.

I Capacity: not all blocks needed fit into cache.

I Conflict: some needed blocks map to same cache block.

29 / 30

Basic Cache Memory Optimizations

I Larger block sizes to reduce miss rate

I Larger caches to reduce miss rate

I Higher associativity to reduce miss rate

I Multi-level caches to reduce miss penalty

I Giving priority to read misses over writes to reduce misspenalty

I Avoid address translation to reduce hit time

30 / 30

Documents

The Memory HierarchyThe Memory Hierarchy Basic Problem: How to design a reasonable cost memory system that can deliver data at speeds close to the CPU’s consumption rate. Current