Class10 Memory

Embed Size (px)

Citation preview

  • 8/19/2019 Class10 Memory

    1/39

    The Memory Hierarchy

    TopicsTopicsStorage technologies and trendsLocality of reference

    Caching in the memory hierarchy

    CS 105Tour of the Black Holes of Computing

  • 8/19/2019 Class10 Memory

    2/39

    – 2 – CS 105

    Random-Access Memory (RAM)Key featuresKey features

    RAM is p ackaged as a chipBasic s torage u nit is a cell (one bit per cell)Multiple RAM chips form a memory

    Static RAM (Static RAM ( SRAMSRAM ))Each cell stores bit with a six-transistor circuitRetains value indenitely, as long as it is kept poweredRelatively insensitive to disturbances such as electrical noiseFaster and more expensive than DRAM

    Dynamic RAM (Dynamic RAM ( DRAMDRAM ))Each cell stores bit with a capacitor and transistorValue must be refreshed every 10-100 msSensitive to disturbancesSlower and cheaper than SRAM

  • 8/19/2019 Class10 Memory

    3/39

    – 3 – CS 105

    Non-Volatile RAM (NVRAM)Key Feature: Keeps data when power lostKey Feature: Keeps data when power lost

    Several typesMost important is NAND ashOngoing R&D

    NAND ashNAND ashReading similar to DRAM (though somewhat slower)Writing packed with restrictions:

    Can’t change e xisting dataMust erase in large b locks (e.g., 64K)Block dies after about 100K erases

    Writing slower than reading (mostly due to erase c ost)Chips o ften packaged with Flash Translation Layer (FTL)

    Spreads o ut writes (“wear leveling”)Makes chip appear like disk drive

  • 8/19/2019 Class10 Memory

    4/39

    – 4 – CS 105

    Conventional DRAM Organization

    d x w DRAM:d x w DRAM: dw total bits organized as d supercells o f size w bits

    cols

    rows

    0 1 2 30

    1

    2

    3

    internal row buffer

    16 x 8 DRAM chip

    addr

    data

    supercell(2,1)

    2 bits /

    8 bits /

    memory

    controller(to CPU)

  • 8/19/2019 Class10 Memory

    5/39

    – 5 – CS 105

    Reading DRAM Supercell (2,1)Step 1(a): Row address s trobe (Step 1(a): Row address s trobe ( RASRAS ) selects row 2) selects row 2

    cols

    rows

    RAS = 2 0 1 2 30

    1

    2

    internal row buffer

    16 x 8 DRAM chip

    3

    addr

    data

    2 /

    8 /

    memorycontroller

    Step 1(b): Row 2 copied from DRAM array to row bufferStep 1(b): Row 2 copied from DRAM array to row buffer

  • 8/19/2019 Class10 Memory

    6/39

    – 6 – CS 105

    Reading DRAM Supercell (2,1)Step 2(a): Column access strobe (Step 2(a): Column access strobe ( CASCAS ) selects c olumn 1) selects c olumn 1

    cols

    rows

    0 1 2 30

    1

    2

    3

    internal row buffer

    16 x 8 DRAM chip

    CAS = 1

    addr

    data

    2 /

    8 /

    memorycontroller

    Step 2(b): Supercell (2,1) copied from buffer to data lines,Step 2(b): Supercell (2,1) copied from buffer to data lines,and eventually back to CPUand eventually back to CPU

    supercell (2,1)

    supercell(2,1)

    To CPU

  • 8/19/2019 Class10 Memory

    7/39

    – 7 – CS 105

    Memory Modules

    : supercell (i,j)

    64 MBmemory moduleconsisting ofeight 8Mx8 DRAMs

    addr (row = i, col = j)

    Memorycontroller

    DRAM 7

    DRAM 0

    031 78151623243263 394047485556

    64-bit doubleword at main memory address A

    bits0-7

    bits8-15

    bits16-23

    bits24-31

    bits32-39

    bits40-47

    bits48-55

    bits56-63

    64-bit doubleword

    031 78151623243263 394047485556

    64-bit doubleword at main memory address A

  • 8/19/2019 Class10 Memory

    8/39

    – 8 – CS 105

    Typical Bus StructureConnecting CPU and Memory

    AA busbus is a collection of parallel wires that carryis a collection of parallel wires that carryaddress, data, and control signalsaddress, data, and control signals

    Buses a re typically s hared by multiple d evicesBuses a re typically s hared by multiple d evices

    mainmemory

    I/Obridgebus interface

    ALU

    register le

    CPU chip

    system bus memory bus

  • 8/19/2019 Class10 Memory

    9/39

    – 9 – CS 105

    Memory Read Transaction (1)

    CPU places address A on memory busCPU places address A on memory bus

    ALU

    register le

    bus interface A 0

    Ax

    main memoryI/O bridge

    %eax

    Load operation: movl A, %eax

  • 8/19/2019 Class10 Memory

    10/39

    – 10 – CS 105

    Memory Read Transaction (2)

    Main memory reads A from memory bus, retrieves wordMain memory reads A from memory bus, retrieves wordx, and places it on busx, and places it on bus

    ALU

    register le

    bus interface

    x 0

    Ax

    main memory

    %eax

    I/O bridge

    Load operation: movl A, %eax

  • 8/19/2019 Class10 Memory

    11/39

    – 11 – CS 105

    Memory Read Transaction (3)

    CPU reads word x from bus a nd copies it into registerCPU reads word x from bus and copies it into register %eax%eax

    xALU

    register le

    bus interface x

    main memory0

    A

    %eax

    I/O bridge

    Load operation: movl A, %eax

  • 8/19/2019 Class10 Memory

    12/39

    – 12 – CS 105

    Memory Write Transaction (1)

    CPU places address A on bus; main memory reads itCPU places address A on bus; main memory reads itand waits for corresponding data word to arriveand waits for corresponding data word to arrive

    yALU

    register le

    bus interface A

    main memory0

    A

    %eax

    I/O bridge

    Store operation: movl %eax, A

  • 8/19/2019 Class10 Memory

    13/39

    – 13 – CS 105

    Memory Write Transaction (2)

    CPU places data word y o n busCPU places data word y o n bus

    yALU

    register le

    bus interface y

    main memory0

    A

    %eax

    I/O bridge

    Store operation: movl %eax, A

  • 8/19/2019 Class10 Memory

    14/39

    – 14 – CS 105

    Memory Write Transaction (3)

    Main memory reads data word y from bus an d stores itMain memory reads data word y from bus an d stores itat address Aat address A

    yALU

    register le

    bus interface y

    main memory0

    A

    %eax

    I/O bridge

    Store operation: movl %eax, A

  • 8/19/2019 Class10 Memory

    15/39

    – 15 – CS 105

    Disk Geometry

    Disks consist ofDisks consist of plattersplatters , each with two, each with two surfaces surfacesEach surface consists of concentric rings calledEach surface consists of concentric rings called tracks tracks

    Each track consists o fEach track consists o f sectors sectors se parated byseparated by gapsgaps

    spindle

    surfacetracks

    track k

    sectors

    gaps

  • 8/19/2019 Class10 Memory

    16/39

    – 16 – CS 105

    Disk Geometry(Muliple-Platter View)

    Aligned tracks form aAligned tracks form a cylindercylinder

    surface 0

    surface 1surface 2surface 3surface 4

    surface 5

    cylinder k

    spindle

    platter 0

    platter 1

    platter 2

  • 8/19/2019 Class10 Memory

    17/39

    – 17 – CS 105

    Disk Operation (Single-PlatterView)

    The d isksurfacespins a t a xedrotational rate

    spindle

    By moving radially, arm canposition read/write headover any track

    Read/write headis a ttached to endof the arm and ies ov erdisk surface o nthin cushion of air

    spindle

  • 8/19/2019 Class10 Memory

    18/39

    – 18 – CS 105

    Disk Operation (Multi-PlatterView)

    arm

    read/write headsmove in unison

    from cylinder t o cylinder

    spindle

  • 8/19/2019 Class10 Memory

    19/39

    – 19 – CS 105

    Disk Access TimeAverage time to access s ome target sector approximated by :Average time to access so me target sector approximated by :

    Taccess = Tavg seek + Tavg rotation + Tavg t ransfer

    Seek timeSeek time (Tavg seek)(Tavg seek)Time to position heads o ver cylinder containing target sectorTypical Tavg seek = 9 ms

    Rotational latencyRotational latency (Tavg rotation)(Tavg rotation)Time waiting for rst bit of target sector to pass under r/w headTavg rotation = 1/2 x 1 /RPMs x 6 0 sec/1 min

    Transfer timeTransfer time (Tavg transfer)(Tavg transfer)

    Time to read the bits in the target sector.Tavg transfer = 1/RPM x 1/(avg # s ectors/track) x 60 s ecs/1 m in

  • 8/19/2019 Class10 Memory

    20/39

    – 20 – CS 105

    Disk Access Time ExampleGiven:Given:

    Rotational rate = 7,200 RPMAverage seek t ime = 9 m sAvg # sectors/track = 400

    Derived:Derived:

    Tavg rotation = 1/2 x ( 60 secs/7200 RPM) x 1000 ms/sec = 4 m sTavg transfer = 60/7200 RPM x 1/400 s ecs/track x 1000 ms/sec = 0.02msTaccess = 9 ms + 4 ms + 0.02 ms

    Important points:Important points:Access t ime d ominated by s eek time a nd rotational latencyFirst bit in a sector is the most expensive, the rest are freeSRAM access time is about 4 ns/ doubleword, DRAM about 60 ns

    Disk is ab out 40,000 times sl ower than SRAM, and2,500 times sl ower then DRAM

  • 8/19/2019 Class10 Memory

    21/39

    – 21 – CS 105

    Logical Disk Blocks

    Modern disks p resent a s impler abstract view of theModern disks p resent a s impler abstract view of the

    complex sector geometry:complex sector geometry:The s et of available s ectors is m odeled as a s equence of b-sized logical blocks (0 , 1, 2, ...)

    Mapping between logical blocks and actual (physical)Mapping between logical blocks and actual (physical)sectors sectors

    Maintained by hardware/rmware device c alled diskcontrollerConverts requests for logical blocks into(surface,track,sector) t riples

    Allows controller to set aside spare cylinders for eachAllows controller to set aside spare cylinders for eachzonezone

    Accounts for the d ifference in “ formatted capacity ” and“maximum capacity ”

  • 8/19/2019 Class10 Memory

    22/39

    – 22 – CS 105

    I/O Bus

    mainmemory

    I/Obridgebus interface

    ALU

    register le

    CPU chip

    system bus memory bus

    diskcontroller

    graphicsadapter

    USBcontroller

    mousekeyboard monitor

    disk

    I/O busExpansion slots forother devices su chas n etwork adapters.

  • 8/19/2019 Class10 Memory

    23/39

    – 23 – CS 105

    Reading a Disk Sector (1)

    mainmemory

    ALU

    register le

    CPU chip

    diskcontroller

    graphicsadapter

    USBcontroller

    mousekeyboard monitor

    disk

    I/O bus

    bus interface

    CPU initiates d isk r ead by w riting

    command, logical block n umber, anddestination memory a ddress t o a port(address) associated with disk controller

  • 8/19/2019 Class10 Memory

    24/39

    – 24 – CS 105

    Reading a Disk Sector (2)

    mainmemory

    ALU

    register le

    CPU chip

    diskcontroller

    graphicsadapter

    USBcontroller

    mousekeyboard monitor

    disk

    I/O bus

    bus interface

    Disk controller reads sector and performs

    direct memory acce ss ( DMA ) transfer intomain memory

  • 8/19/2019 Class10 Memory

    25/39

    – 25 – CS 105

    Reading a Disk Sector (3)

    mainmemory

    ALU

    register le

    CPU chip

    diskcontroller

    graphicsadapter

    USBcontroller

    mousekeyboard monitor

    disk

    I/O bus

    bus interface

    When the DMA transfer completes, disk

    controller noties C PU with interrupt (i .e .,asserts special “interrupt” pin on CPU)

  • 8/19/2019 Class10 Memory

    26/39

    – 26 – CS 105

    Storage Trends

    (Culled from back issues o f Byte an d PC Magazine)

    metric 1980 1985 1990 1995 2000 2000:1980

    $/MB 8,000 880 100 30 1 8,000access (ns) 375 200 100 70 60 6typical size(MB) 0.064 0.256 4 16 64 1,000

    DRAM

    metric 1980 1985 1990 1995 2000 2000:1980

    $/MB 19,200 2,900 320 256 100 190access (ns) 300 150 35 15 2 150

    SRAM

    metric 1980 1985 1990 1995 2000 2000:1980

    $/MB 500 100 8 0.30 0.05 10,000access (ms) 87 75 28 10 8 11typical size(MB) 1 10 160 1,000 9,000 9,000

    Disk

  • 8/19/2019 Class10 Memory

    27/39

    – 27 – CS 105

    CPU Clock Rates

    1980 1985 1990 1995 2000 2000:1980processor 8080 286 386 Pent P-IIIclock rate(MHz) 1 6 20 150 750 750cycle time(ns) 1,000 166 50 6 1.6 750

  • 8/19/2019 Class10 Memory

    28/39

    – 28 – CS 105

    The CPU-Memory Gap The increasing gap between DRAM, disk, and CPUThe increasing gap between DRAM, disk, and CPU

    speeds.speeds.

  • 8/19/2019 Class10 Memory

    29/39

    – 29 – CS 105

    LocalityPrinciple of Locality:Principle of Locality:

    Programs t end to reuse data a nd instructions n ear thosethey have u sed recently, or that were r ecently referencedthemselvesSpatial locality: Items with nearby addresses tend to bereferenced close together in time

    Temporal locality: Recently referenced items a re likely to bereferenced in the near future

    Locality Example:• Data

    – Reference array elements in succession(stride-1 reference pattern):– Reference sum e ach iteration:

    • Instructions– Reference instructions in s equence:– Cycle through loop repeatedly:

    sum = 0;for (i = 0; i !; i"")

    sum "= a#i$;

    retur! sum;Spatial locality

    Spatial localityTemporal locality

    Temporal locality

  • 8/19/2019 Class10 Memory

    30/39

    – 30 – CS 105

    Locality ExampleClaim:Claim: Being able to look at code and get qualitativeBeing able to look at code and get qualitative

    sense of its locality is key skill for professionalsense of its locality is key skill for professionalprogrammerprogrammer

    Question:Question: Does this function have g ood locality?Does this function have good locality?

    i!t sumarra rows(i!t a#&$#'$)

    i!t i, j, sum = 0;

    for (i = 0; i &; i"") for (j = 0; j '; j"") sum "= a#i$#j$; retur! sum;

  • 8/19/2019 Class10 Memory

    31/39

    – 31 – CS 105

    Locality ExampleQuestion:Question: Does this function have g ood locality?Does this function have good locality?

    i!t sumarra cols(i!t a#&$#'$)

    i!t i, j, sum = 0;

    for (j = 0; j '; j"") for (i = 0; i &; i"") sum "= a#i$#j$; retur! sum;

  • 8/19/2019 Class10 Memory

    32/39

    – 32 – CS 105

    Locality ExampleQuestion:Question: Can you permute the loops so that theCan you permute the loops so that the

    function scans the 3 -d arrayfunction scans the 3 -d array a#$a#$ with a stride-1with a stride-1reference pattern (and thus has good spatialreference pattern (and thus has good spatiallocality)?locality)?

    i!t sumarra *d(i!t a#&$#'$#'$)

    i!t i, j, +, sum = 0;

    for (i = 0; i '; i"") for (j = 0; j '; j"")

    for (+ = 0; + &; +"") sum "= a#+$#i$#j$; retur! sum;

  • 8/19/2019 Class10 Memory

    33/39

    – 33 – CS 105

    Memory HierarchiesSome fundamental and enduring properties o fSome fundamental and enduring properties o f

    hardware an d software:hardware an d software:Fast storage technologies cost more p er byte a nd have lesscapacityGap between CPU and main memory speed is widening

    Well-written programs tend to exhibit good locality

    These fundamental properties co mplement each otherThese fundamental properties co mplement each otherbeautifullybeautifully

    They suggest an approach for organizing memory andThey suggest an approach for organizing memory andstorage systems known as astorage systems known as a memory hierarchymemory hierarchy

  • 8/19/2019 Class10 Memory

    34/39

    – 34 – CS 105

    An Example Memory Hierarchy

    registers

    on-chip L1cache (SRAM)

    main memory(DRAM)

    local secondary storage(local disks)

    Larger,slower,

    andcheaper(per byte)storagedevices

    remote seco ndary storage(distributed le s ystems, Web servers)

    Local disks hold lesretrieved from disks o nremote network ser vers

    Main memory holds diskblocks ret rieved from localdisks

    off-chip L2cache (SRAM)

    L1 c ache h olds c ache l ines retrievedfrom the L2 cache m emory

    CPU registers h old words r etrievedfrom L1 cache

    L2 cache holds cac he linesretrieved from main memory

    L0:

    L1:

    L2:

    L3:

    L4:

    L5:

    Smaller,faster,and

    costlier(per byte)storagedevices

  • 8/19/2019 Class10 Memory

    35/39

    – 35 – CS 105

    CachesCache:Cache: Smaller, faster storage device that acts asSmaller, faster storage device that acts as

    staging area for subset of data in a larger, slowerstaging area for subset of data in a larger, slowerdevicedevice

    Fundamental idea of a memory hierarchy:Fundamental idea of a memory hierarchy:For each k, the faster, smaller device at level k serves as

    cache for larger, slower device at level k+1Why do memory hierarchies work?Why do memory hierarchies work?

    Programs t end to access data a t level k more o ften than theyaccess data at level k+1

    Thus, storage at level k+1 can be slower, and thus larger andcheaper per bitNet effect: Large p ool of memory that costs a s l ittle a s t hecheap storage n ear the bottom, but that serves d ata toprograms at ≈ rate of the fast storage near the top.

  • 8/19/2019 Class10 Memory

    36/39

    – 36 – CS 105

    Caching in a Memory Hierarchy

    0 1 2 3

    4 5 6 7

    8 9 10 1112 13 14 15

    Larger, slower, cheaper storagedevice at level k+1 is partitioned

    into blocks.

    Data is co pied betweenlevels in block-sized transferunits

    8 9 14 3Smaller, faster, more expensivedevice at level k caches asubset of the b locks f rom level k+1

    Level k:

    Level k+1: 4

    4

    4 10

    10

    10

  • 8/19/2019 Class10 Memory

    37/39

    – 37 – CS 105

    Request 14Request 12

    General Caching ConceptsProgram needs o bject d, which is s toredProgram needs o bject d, which is s tored

    in some block bin some block bCache hitCache hit

    Program nds b in the cache at levelk. E.g., block 14

    Cache missCache missb is not at level k, so level k cachemust fetch it from level k+1.E.g., block 12If level k cache is full, then some

    current block must be replaced(evicted). W hich one is the “victim”?Placement policy: where can the newblock go? E.g., b mod 4Replacement policy: which blockshould be e victed? E.g., LRU

    9 3

    0 1 2 3

    4 5 6 7

    8 9 10 11

    12 13 14 15

    Level k:

    Levelk+1:

    1414

    12

    14

    4*

    4*12

    12

    0 1 2 3

    Request 12

    4*4*12

  • 8/19/2019 Class10 Memory

    38/39

    – 38 – CS 105

    General Caching ConceptsTypes o f cache misses:Types o f cache misses:

    Cold (compulsory) missCold misses o ccur because the cache is em pty

    Conict missMost caches limit blocks a t level k to a s mall subset (sometimes

    a singleton) of the block positions at level k+1E.g. block i at level k+1 must be p laced in block (i mod 4) atlevel kConict misses o ccur when the level k ca che is large e nough,but multiple d ata o bjects a ll map to the s ame level k b lock

    E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every timeCapacity miss

    Occurs when the set of active cache blocks (working set) islarger than the c ache

  • 8/19/2019 Class10 Memory

    39/39

    Examples o f Caching in theHierarchy

    Hardware0On-Chip TLBAddresstranslations

    TLB

    Webbrowser

    10,000,000Local diskWeb pagesBrowser cache

    Web cache

    Network buffer

    cache

    Buffer cache

    VirtualMemory

    L2 cache

    L1 cache

    Registers

    Cache Type

    Web pages

    Parts of les

    Parts of les

    4-KB page32-byte block

    32-byte block

    4-byte word

    What Cached

    Web proxyserver

    1,000,000,000Remote serverdisks

    OS100Main memory

    Hardware1On-Chip L1

    Hardware10Off-Chip L2

    AFS/NFS

    client

    10,000,000Local disk

    Hardware+OS

    100Main memory

    Compiler0CPU registers

    Managed

    By

    Latency

    (cycles)

    Where Cached