30
The Memory Hierarchy 1. Overview of the memory hierarchy 2. Basics of Cache Memories 3. Virtual Memory 4. Satisfying a memory request 5. Performance and some cache memory optimizations 1 / 30

The Memory HierarchyThe Memory Hierarchy Basic Problem: How to design a reasonable cost memory system that can deliver data at speeds close to the CPU’s consumption rate. Current

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • The Memory Hierarchy

    1. Overview of the memory hierarchy

    2. Basics of Cache Memories

    3. Virtual Memory

    4. Satisfying a memory request

    5. Performance and some cache memory optimizations

    1 / 30

  • The Memory Hierarchy

    Basic Problem: How to design areasonable cost memory system that candeliver data at speeds close to the CPU’sconsumption rate.

    Current Answer: Construct a memoryhierarchy with slow (inexpensive, largesize) components at the higher levelsand with fast (most expensive, smallest)components at the lowest level.

    Migration: As information is referenced,migrate the data into and out-of thelowest level memories. I−Cache D−Cache

    Cache (L2)

    CPU/core

    Cache (L3)

    Disk

    Main Memory

    2 / 30

  • How does this help?

    I Programs are well behaved and tend to follow theobserved “principles of locality” for programs.

    Principle of temporal locality: states that a referenced dataobject will likely be referenced again in the near future.

    Principle of spatial locality: states that if data at location xis referenced at time t then it is likely that a nearby datalocation x + ∆x will be referenced at a nearby time t + ∆t .

    I Also consider the 90/10 rule: A program executes about90% of it’s instructions from about 10% of its code space.

    3 / 30

  • Cache Memories

    I Similar to paged virtual memory w/ fixed-sized blocksmapped to the cache (from the [physical or virtual] memoryspace)

    I Due to speed considerations, all operations implementedin hardware.

    I 4 types of mapping policies:

    I fully associative

    I direct mapped

    I set associative

    I sector mapped (not discussed further)

    4 / 30

  • Fully Associative Cache

    Any block in the (physical or virtual) memory can be mapped into anyblock in cache (just like paged virtual memory). Memory addressesare formed as: n d where n is the memory tag and d is theaddress in the block.

    Block 0

    Block N−1tag

    tag

    tag

    Cache

    Memory

    n

    Block N−1

    Block M

    Block 0

    Block i

    Lookup Algorithm

    if ∃α : 0 ≤ α < N ∧ cache[α].tag = nthen return cache[α].word[d]else cache-miss

    5 / 30

  • Direct Mapped Cache

    If cache is partitioned into N fixed size blocks then each cache blockk will contain only (physical or virtual) memory blocks k + nN, wheren = 0,1, .... The memory addresses are formed as: n k dwhere n is the memory tag, k is the cache block index, and d is theaddress in the block.

    tag

    tag

    tag

    Cache

    Block 0

    Block 1

    Memory

    k

    n

    Block N−1

    Block N

    Block N+1

    Block N−1

    Block 1

    Block 0

    Lookup Algorithm

    if cache[k].tag = nthen return cache[k].word[d]else cache-miss

    6 / 30

  • Set Associative Cache

    Midpoint between directed mapped and fully associative (at the extremes itdegenerates to each). Basic idea: divide cache into S sets with E = N/S blockframes per set (N total blocks in the cache). Memory address are formed as:

    n k d where n is the memory tag, k is the cache set index, and d isthe address in the block.

    tag Block 0

    tag Block E−1Set 0

    tag Block 0

    tag Block E−1Set S−1

    tag Block 0

    tag Block E−1Set 1

    k

    Block 0

    Block 1

    Memory

    Block S−1

    Block S

    Block S+1n

    CacheLookup Algorithm

    if ∃α : 0 ≤ α < E ∧ cache[k].block[α].tag = nthen return cache[k].block[α].word[d]else cache-miss

    7 / 30

  • Cache Policies

    I Read/Write Policies

    I Read through

    I Critical word first

    I Write through

    I Write back

    I Write allocate/no-allocate

    I Structure

    I Unified or Split Caches

    I Contents across Cache Levels

    I Multilevel inclusion (Pentium 4)

    I Multilevel exclusion (Opteron)

    I−Cache D−Cache

    Cache (L2)

    CPU/core

    Cache (L3)

    Disk

    Main Memory

    8 / 30

  • Lecture break; continue next class

    9 / 30

  • Virtual Memory

    Addresses used by programs do not correspond to actualaddresses of the program/data locations in main memory.Instead, there exists a translation mechanism between the CPUand memory that actually associates CPU addresses (virtual)to their actual (physical) in memory

    PhysicalAddress

    AddressTranslation

    VirtualAddress

    MainMemory

    CPU

    The two most important types:I Paged virtual memoryI Segmented virtual memory

    10 / 30

  • Demand Paged Virtual Memory

    I Logically subdivide virtual and physical spaces into fixedsize units called pages.

    I Keep virtual space on disk (swap for working/dirty pages).

    I As referenced (on demand), bring pages into mainmemory (and into the memory hierarchy) and update thepage table.

    I Virtual pages can be mapped into any available physicalpage.

    I Requires some page replacement algorithm: random,FIFO, LRU, LFU.

    11 / 30

  • Paged Virtual Memory: Structure

    Physical Addresses

    Main Memory

    Virtual Addresses

    Program Space

    page 0

    page 1

    page N−1

    page M−1

    page 0

    page 1

    12 / 30

  • Paged Virtual Memory: Address Translation

    Virtual Address

    Physical Address

    valid bit dirty bit/other info

    page offsetpage number

    page table

    page offsetpage number

    13 / 30

  • Segmented Virtual Memory

    I Identify segments in virtual space by the logical structureof the program. Using logical structure, we can identifyaccess control to the segment.

    I Dynamically map segments into physical space assegments are referenced (on demand). Update segmenttable accordingly

    I Keep virtual space on disk (swap for working/dirty pages).

    I Need segment placement algorithm: best fit, worst fit, firstfit.

    I Needs some segment replacement algorithm.

    14 / 30

  • Segmented Virtual Memory: Structure

    segment 0

    segment i

    segment m

    segment m

    segment i

    Physical Addresses

    Main Memory

    Virtual Addresses

    Program Space

    segment 0

    15 / 30

  • Seg. Virt. Memory: Address Translation

    Virtual Address

    valid bit

    Physical Address

    segment number seg. offset

    dirty bit/length/other info

    segment table

    +

    16 / 30

  • Segmented-Paged Virtual Memory

    I Page the segments and map pages into physical memory.Considerably improving and simplifying physical memoryuse/management.

    I Adds an additional translation step.

    I Drastically shrinks page table sizes.

    I Retains protection capabilities of segments.

    I Mechanism used in most contemporary systems (ok,virtualization adds yet another level of indirection....we willdiscuss this later).

    17 / 30

  • Segmented-Paged: Address Translation

    segmenttable

    pagetable

    Virtual Address

    Physical Address

    seg # page # pg. offset

    +

    18 / 30

  • Hardware Implications of Virtual Memory

    I Virtual memory requires a careful coordination betweenthe hardware and O/S.

    I Virtual memory translation will be performed (at leastmostly) in hardware.

    I When hardware cannot translate a virtual address, it willtrap to the O/S.

    I Hardware structures for translation will limit the type ofvirtual memory that an O/S can deploy.

    19 / 30

  • Lecture break; continue next class

    20 / 30

  • The Memory Hierarchy: Who does what?

    I MMU: Memory Management Unit

    I TB: (address) Translation Buffer

    I H/W: search to Main Memory

    I O/S: handles page faults (invoking DMAoperations)

    I if page being replaced is dirty, copyout first (DMA)

    I move new page to main memory(DMA)

    I O/S: context swaps on page fault

    I DMA: operates concurrently w/ othertasks

    I−Cache D−Cache

    Cache (L2)

    Cache (L3)

    Disk

    Main Memory

    TBMMU CPU

    21 / 30

  • Notes

    I Memory Management UnitI Manages memory subsystems to main memoryI Translates addresses, searches caches, migrates data

    (to/from main memory out/in)

    I Translation BufferI Small cache to assist virtual → physical address translation

    processI generally small (e.g., 64 entries), size does not need to

    correspond to cache sizes

    I Simplifying AssumptionsI L1 & L2 use physical addressesI paged virtual memoryI ignoring details of TB misses

    22 / 30

  • Satisfying A Memory Request

    Satisfied in L1 cache:1. MMU: translate address2. MMU: search I or D cache as indicated by CPU, success

    (sometimes simultaneously with translation)3. MMU: read/write information to/from CPU

    23 / 30

  • Satisfying A Memory Request

    Satisfied in L2 cache:1. MMU: translate address2. MMU: search I or D cache as indicated by CPU, failure3. MMU: search L2 cache, success4. MMU: move information between L1 & L2, critical word

    first?5. MMU: read/write information to/from CPU

    24 / 30

  • Satisfying A Memory Request

    Satisfied in main memory:1. MMU: translate address2. MMU: search I or D cache as indicated by CPU, failure3. MMU: search L2 cache, failure4. MMU: move information between memory & L25. MMU: move information between L1 & L2, critical word

    first?6. MMU: read/write information to/from CPU

    25 / 30

  • Satisfying A Memory Request

    Not in main memory:1. MMU: translate address, failure trap to O/S2. O/S: page fault, block task for page fault

    I O/S: if page dirtyI O/S: initiate DMA transfer to copy page to swapI O/S: block task for DMA interruptI O/S: invoke task schedulerI O/S: on interrupt continue

    I O/S: initiate DMA transfer to copy page to main memoryI O/S: block task for DMA interruptI O/S: invoke task schedulerI O/S: on interrupt:

    I update page tableI return task to ready to run list

    26 / 30

  • Memory Performance Impact: Naive

    CPU exec time

    = (CPU clock cycles + Memory stall cycles)× Clock cycle time

    Memory stall cycles (doesn’t consider hit times 6= 1)

    = Number of Misses ×Miss Penalty

    = IC × MissesInstruction

    ×Miss penalty

    = IC × Memory accessesInstruction

    ×Miss Rate ×Miss penalty

    = IC × Memory readsInstruction

    × Read miss Rate × Read miss penalty

    +IC × Memory WritesInstruction

    ×Write miss Rate ×Write miss penalty

    27 / 30

  • Average Memory Access Time

    Average memory access time

    = Hit time + Miss Rate ×Miss penalty= Hit timeL1 + Miss RateL1 × (Hit timeL2 + Miss RateL2 ×Miss penaltyL2)

    Accounting for Out-of-order Execution: (discussion probably premature atthis time)

    Memory stall CyclesInstruction =

    MissesInstruction × (Total miss latency −Overlapped miss latency)

    28 / 30

  • Cache Misses

    I Miss Rate

    Local miss rate =# misses in a cache

    total # of memory accesses to this cache

    Global miss rate =# misses in a cache

    total # of memory accesses generated by CPU

    I Categories of Cache Misses

    I Compulsory: cold-start or first-reference misses.

    I Capacity: not all blocks needed fit into cache.

    I Conflict: some needed blocks map to same cache block.

    29 / 30

  • Basic Cache Memory Optimizations

    I Larger block sizes to reduce miss rate

    I Larger caches to reduce miss rate

    I Higher associativity to reduce miss rate

    I Multi-level caches to reduce miss penalty

    I Giving priority to read misses over writes to reduce misspenalty

    I Avoid address translation to reduce hit time

    30 / 30