28
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 6 – Memory II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152

CS 152 Computer Architecture and Engineering CS252 ...– write-back: write cache only, memory is wrigen only when the entry is evicted • A dirty bit per line further reduces write-back

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

  • CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

    Lecture6–MemoryII

    KrsteAsanovicElectricalEngineeringandComputerSciences

    UniversityofCaliforniaatBerkeley

    http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

  • Last?meinLecture6

    § DynamicRAM(DRAM)ismainformofmainmemorystorageinusetoday–  Holdsvaluesonsmallcapacitors,needrefreshing(hencedynamic)–  SlowmulF-stepaccess:precharge,readrow,readcolumn

    §  StaFcRAM(SRAM)isfasterbutmoreexpensive–  Usedtobuildon-chipmemoryforcaches

    § Cacheholdssmallsetofvaluesinfastmemory(SRAM)closetoprocessor–  Needtodevelopsearchschemetofindvaluesincache,andreplacementpolicytomakespacefornewlyaccessedlocaFons

    § Cachesexploittwoformsofpredictabilityinmemoryreferencestreams–  Temporallocality,samelocaFonlikelytobeaccessedagainsoon–  SpaFallocality,neighboringlocaFonlikelytobeaccessedsoon

    2

  • Recap:ReplacementPolicy

    3

    InanassociaFvecache,whichlinefromasetshouldbeevictedwhenthesetbecomesfull?• Random• Least-RecentlyUsed(LRU)

    • LRUcachestatemustbeupdatedoneveryaccess• TrueimplementaFononlyfeasibleforsmallsets(2-way)• Pseudo-LRUbinarytreeoTenusedfor4-8way

    • First-In,First-Out(FIFO)a.k.a.Round-Robin• UsedinhighlyassociaFvecaches

    • Not-Most-RecentlyUsed(NMRU)• FIFOwithexcepFonformost-recentlyusedlineorlines

    Thisisasecond-ordereffect.Why?

    Replacementonlyhappensonmisses

  • Pseudo-LRUBinaryTree

    §  For2-waycache,onahit,singleLRUbitissettopointtootherway

    §  For4-waycache,need3bitsofstate.Oncachehit,onpathdowntree,setallbitstopointtootherhalf.Onmiss,bitssaywhichwaytoreplace

    4

    Way0Way1Way2Way3

    1 0

    1 1 0 0

  • CPU-CacheInterac?on(5-stagepipeline)

    5

    PC addr inst

    PrimaryInstrucFonCache

    0x4Add

    IR

    D

    bubble

    hit?PCen

    Decode,RegisterFetch

    wdataR

    addr

    wdata

    rdataPrimaryDataCache

    weA

    B

    YYALU

    MD1 MD2

    CacheRefillDatafromLowerLevelsofMemoryHierarchy

    hit?

    StallenFreCPUondatacachemiss

    ToMemoryControl

    ME

  • ImprovingCachePerformance

    6

    AveragememoryaccessFme(AMAT)= HitFme+MissratexMisspenalty

    Toimproveperformance:•  reducethehitFme•  reducethemissrate•  reducethemisspenalty

    Whatisbestcachedesignfor5-stagepipeline?

    Biggestcachethatdoesn’tincreasehitAmepast1cycle(approx8-32KBinmoderntechnology)

    [designissuesmorecomplexwithdeeperpipelinesand/orout-of-ordersuperscalarprocessors]

  • CausesofCacheMisses:The3C’s

    Compulsory:firstreferencetoaline(a.k.a.coldstartmisses)

    – missesthatwouldoccurevenwithinfinitecacheCapacity:cacheistoosmalltoholdalldataneededbytheprogram

    – missesthatwouldoccurevenunderperfectreplacementpolicy

    Conflict:missesthatoccurbecauseofcollisionsduetoline-placementstrategy

    – missesthatwouldnotoccurwithidealfullassociaAvity

    7

  • EffectofCacheParametersonPerformance

    § Largercachesize+ reducescapacityandconflictmisses-  hitFmewillincrease

    § HigherassociaFvity+ reducesconflictmisses- mayincreasehitFme

    § Largerlinesize+ reducescompulsoryandcapacity(reload)misses-  increasesconflictmissesandmisspenalty

    8

  • Figure B.9 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C's for the data in Figure B.8. The top diagram shows the actual data cache miss rates, while the bottom diagram shows the percentage in each category. (Space allows the graphs to show one extra cache size than can fit in Figure B.8.)

    © 2018 Elsevier Inc. All rights reserved.

  • Recap:LineSizeandSpa?alLocality

    10

    Word3Word0 Word1 Word2

    LargerlinesizehasdisFncthardwareadvantages• lesstagoverhead• exploitfastbursttransfersfromDRAM• exploitfastbursttransfersoverwidebusses

    Whatarethedisadvantagesofincreasinglinesize?

    LineAddress

    2b=linesizea.k.alinesize(inbytes)

    SplitCPUaddress

    bbits32-bbits

    Tag

    Alineisunitoftransferbetweenthecacheandmemory

    4wordline,b=2

    Fewerlines=>moreconflicts.Canwastebandwidth.

    Offset

  • © 2019 Elsevier Inc. All rights reserved. 11

    Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a DECstation 5000 (Gee et al. 1993).

  • WritePolicyChoices§ Cachehit:

    –  write-through:writebothcache&memory•  Generallyhighertrafficbutsimplerpipeline&cachedesign

    –  write-back:writecacheonly,memoryiswrigenonlywhentheentryisevicted•  Adirtybitperlinefurtherreduceswrite-backtraffic• Musthandle0,1,or2accessestomemoryforeachload/store

    § Cachemiss:–  no-write-allocate:onlywritetomainmemory–  write-allocate(akafetch-on-write):fetchintocache

    § CommoncombinaFons:–  write-throughandno-write-allocate–  write-backwithwrite-allocate

    12

  • WritePerformance

    13

    Tag DataV

    =

    OffsetTag Index

    t kb

    t

    HIT DataWordorByte

    2klines

    WE

  • ReducingWriteHitTime

    Problem:Writestaketwocyclesinmemorystage,onecyclefortagcheckplusonecyclefordatawriteifhit

    Solu?ons:§ DesigndataRAMthatcanperformreadandwriteinonecycle,restoreoldvalueaTertagmiss

    §  Pipelinedwrites:Holdwritedataforstoreinsinglebufferaheadofcache,writecachedataduringnextstore’stagcheck

    §  Fully-associaFve(CAMTag)caches:Wordlineonlyenabledifhit

    14

  • PipeliningCacheWrites

    15

    Tags Data

    Tag Index StoreData

    Address and Store Data From CPU

    DelayedWriteDataDelayedWriteAddr.

    =?

    =?

    LoadDatatoCPU

    Load/Store

    LS

    1 0

    Hit?

    Datafromastorehitiswri\enintodataporAonofcacheduringtagaccessofsubsequentstore

  • CS152Administrivia

    §  PS2outonWednesday,dueWednesdayFeb26§ MondayFeb17isPresident’sDayHoliday,noclass!§  Lab1dueatstartofclassonWednesdayFeb19§  Friday’ssecFonswillreviewPS1andsoluFons

    16

  • CS252

    CS252Administrivia

    §  Startthinkingofclassprojectsandformingteamsoftwo§  ProposaldueWednesdayFebruary26th

    17

  • WriteBuffertoReduceReadMissPenalty

    18

    Processorisnotstalledonwrites,andreadmissescangoaheadofwritetomainmemory

    Problem:WritebuffermayholdupdatedvalueoflocaFonneededbyareadmissSimplesolu?on:onareadmiss,waitforthewritebuffertogoemptyFastersolu?on:Checkwritebufferaddressesagainstreadmissaddresses,ifnomatch,allowreadmisstogoaheadofwrites,else,returnvalueinwritebuffer

    DataCacheUnifiedL2Cache

    RF

    CPU

    Writebuffer

    Evicteddirtylinesforwrite-backcacheOR

    Allwritesinwrite-throughcache

  • ReducingTagOverheadwithSub-Blocks

    § Problem:Tagsaretoolarge,i.e.,toomuchoverhead–  SimplesoluFon:Largerlines,butmisspenaltycouldbelarge.

    §  Solu?on:Sub-blockplacement(akasectorcache)–  Avalidbitaddedtounitssmallerthanfullline,calledsub-blocks–  Onlyreadasub-blockonamiss–  Ifatagmatches,isthewordinthecache?

    19

    100 300 204

    1 1 1 1 1   1 0 0 0 1 0 1

  • Mul?levelCaches

    Problem:AmemorycannotbelargeandfastSolu?on:Increasingsizesofcacheateachlevel

    20

    CPU L1$ L2$ DRAM

    Local miss rate = misses in cache / accesses to cache

    Global miss rate = misses in cache / CPU memory accesses

    Misses per instruction = misses in cache / number of instructions

  • © 2019 Elsevier Inc. All rights reserved. 21

    Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches smaller than the sum of the two 64 KiB first-level caches make little sense, as reflected in the high miss rates. After 256 KiB the single cache is within 10% of the global miss rates. The miss rate of a single-level cache versus size is plotted against the local miss rate and global miss rate of a second-level cache using a 32 KiB first-level cache. The L2 caches (unified) were two-way set associative with replacement. Each had split L1 instruction and data caches that were 64 KiB two-way set associative with LRU replacement. The block size for both L1 and L2 caches was 64 bytes. Data were collected as in Figure B.4.

  • PresenceofL2influencesL1design

    § UsesmallerL1ifthereisalsoL2–  TradeincreasedL1missrateforreducedL1hitFme–  BackupL2reducesL1misspenalty–  Reducesaverageaccessenergy

    § Usesimplerwrite-throughL1withon-chipL2– Write-backL2cacheabsorbswritetraffic,doesn’tgooff-chip–  AtmostoneL1missrequestperL1access(nodirtyvicFmwriteback)simplifiespipelinecontrol

    –  Simplifiescoherenceissues–  SimplifieserrorrecoveryinL1(canusejustparitybitsinL1andreloadfromL2whenparityerrordetectedonL1read)

    22

  • © 2019 Elsevier Inc. All rights reserved. 23

    Figure B.15 Relative execution time by second-level cache size. The two bars are for different clock cycles for an L2 cache hit. The reference execution time of 1.00 is for an 8192 KiB second-level cache with a 1-clock-cycle latency on a second-level hit. These data were collected the same way as in Figure B.14, using a simulator to imitate the Alpha 21264.

  • InclusionPolicy

    §  InclusivemulFlevelcache:–  Innercachecanonlyholdlinesalsopresentinoutercache

    –  Externalcoherencesnoopaccessneedonlycheckoutercache

    §  ExclusivemulFlevelcaches:–  Innercachemayholdlinesnotinoutercache–  Swaplinesbetweeninner/outercachesonmiss–  UsedinAMDAthlonwith64KBprimaryand256KBsecondarycache

    Whychooseonetypeortheother?

    24

  • Itanium-2On-ChipCaches(Intel/HP,2002)

    25

    Level1:16KB,4-ways.a.,64Bline,quad-port(2load+2store),singlecyclelatency

    Level2:256KB,4-ways.a,128B

    line,quad-port(4loador4store),fivecyclelatency

    Level3:3MB,12-ways.a.,128B

    line,single32Bport,twelvecyclelatency

  • Power7On-ChipCaches[IBM2009]

    26

    32KBL1I$/core

    32KBL1D$/core

    3-cyclelatency

    256KBUnifiedL2$/core

    8-cyclelatency

    32MBUnifiedSharedL3$

    EmbeddedDRAM(eDRAM)

    25-cyclelatencytolocalslice

  • IBMz196MainframeCaches2010

    27

    § 96cores(4cores/chip,24chips/system)– Out-of-order,[email protected]

    § L1:64KBI-$/core+128KBD-$/core§ L2:1.5MBprivate/core(144MBtotal)§ L3:24MBshared/chip(eDRAM)(576MBtotal)§ L4:768MBshared/system(eDRAM)

  • Acknowledgements

    §  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPagerson(UCB)

    28