CS 152 Computer Architecture and Engineering CS252 ...– write-back: write cache only, memory is wrigen only when the entry is evicted • A dirty bit per line further reduces write-back

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture6–MemoryII

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

Last?meinLecture6

§ DynamicRAM(DRAM)ismainformofmainmemorystorageinusetoday–  Holdsvaluesonsmallcapacitors,needrefreshing(hencedynamic)–  SlowmulF-stepaccess:precharge,readrow,readcolumn

§  StaFcRAM(SRAM)isfasterbutmoreexpensive–  Usedtobuildon-chipmemoryforcaches

§ Cacheholdssmallsetofvaluesinfastmemory(SRAM)closetoprocessor–  Needtodevelopsearchschemetofindvaluesincache,andreplacementpolicytomakespacefornewlyaccessedlocaFons

§ Cachesexploittwoformsofpredictabilityinmemoryreferencestreams–  Temporallocality,samelocaFonlikelytobeaccessedagainsoon–  SpaFallocality,neighboringlocaFonlikelytobeaccessedsoon

2

Recap:ReplacementPolicy

3

InanassociaFvecache,whichlinefromasetshouldbeevictedwhenthesetbecomesfull?• Random• Least-RecentlyUsed(LRU)

• LRUcachestatemustbeupdatedoneveryaccess• TrueimplementaFononlyfeasibleforsmallsets(2-way)• Pseudo-LRUbinarytreeoTenusedfor4-8way

• First-In,First-Out(FIFO)a.k.a.Round-Robin• UsedinhighlyassociaFvecaches

• Not-Most-RecentlyUsed(NMRU)• FIFOwithexcepFonformost-recentlyusedlineorlines

Thisisasecond-ordereffect.Why?

Replacementonlyhappensonmisses

Pseudo-LRUBinaryTree

§  For2-waycache,onahit,singleLRUbitissettopointtootherway

§  For4-waycache,need3bitsofstate.Oncachehit,onpathdowntree,setallbitstopointtootherhalf.Onmiss,bitssaywhichwaytoreplace

4

Way0Way1Way2Way3

1 0

1 1 0 0

CPU-CacheInterac?on(5-stagepipeline)

5

PC addr inst

PrimaryInstrucFonCache

0x4Add

IR

D

bubble

hit?PCen

Decode,RegisterFetch

wdataR

addr

wdata

rdataPrimaryDataCache

weA

B

YYALU

MD1 MD2

CacheRefillDatafromLowerLevelsofMemoryHierarchy

hit?

StallenFreCPUondatacachemiss

ToMemoryControl

ME

ImprovingCachePerformance

6

AveragememoryaccessFme(AMAT)= HitFme+MissratexMisspenalty

Toimproveperformance:•  reducethehitFme•  reducethemissrate•  reducethemisspenalty

Whatisbestcachedesignfor5-stagepipeline?

Biggestcachethatdoesn’tincreasehitAmepast1cycle(approx8-32KBinmoderntechnology)

[designissuesmorecomplexwithdeeperpipelinesand/orout-of-ordersuperscalarprocessors]

CausesofCacheMisses:The3C’s

Compulsory:firstreferencetoaline(a.k.a.coldstartmisses)

– missesthatwouldoccurevenwithinfinitecacheCapacity:cacheistoosmalltoholdalldataneededbytheprogram

– missesthatwouldoccurevenunderperfectreplacementpolicy

Conflict:missesthatoccurbecauseofcollisionsduetoline-placementstrategy

– missesthatwouldnotoccurwithidealfullassociaAvity

7

EffectofCacheParametersonPerformance

§ Largercachesize+ reducescapacityandconflictmisses-  hitFmewillincrease

§ HigherassociaFvity+ reducesconflictmisses- mayincreasehitFme

§ Largerlinesize+ reducescompulsoryandcapacity(reload)misses-  increasesconflictmissesandmisspenalty

8

Figure B.9 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C's for the data in Figure B.8. The top diagram shows the actual data cache miss rates, while the bottom diagram shows the percentage in each category. (Space allows the graphs to show one extra cache size than can fit in Figure B.8.)

© 2018 Elsevier Inc. All rights reserved.

Recap:LineSizeandSpa?alLocality

10

Word3Word0 Word1 Word2

LargerlinesizehasdisFncthardwareadvantages• lesstagoverhead• exploitfastbursttransfersfromDRAM• exploitfastbursttransfersoverwidebusses

Whatarethedisadvantagesofincreasinglinesize?

LineAddress

2b=linesizea.k.alinesize(inbytes)

SplitCPUaddress

bbits32-bbits

Tag

Alineisunitoftransferbetweenthecacheandmemory

4wordline,b=2

Fewerlines=>moreconflicts.Canwastebandwidth.

Offset

© 2019 Elsevier Inc. All rights reserved. 11

Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a DECstation 5000 (Gee et al. 1993).

WritePolicyChoices§ Cachehit:

–  write-through:writebothcache&memory•  Generallyhighertrafficbutsimplerpipeline&cachedesign

–  write-back:writecacheonly,memoryiswrigenonlywhentheentryisevicted•  Adirtybitperlinefurtherreduceswrite-backtraffic• Musthandle0,1,or2accessestomemoryforeachload/store

§ Cachemiss:–  no-write-allocate:onlywritetomainmemory–  write-allocate(akafetch-on-write):fetchintocache

§ CommoncombinaFons:–  write-throughandno-write-allocate–  write-backwithwrite-allocate

12

WritePerformance

13

Tag DataV

=

OffsetTag Index

t kb

t

HIT DataWordorByte

2klines

WE

ReducingWriteHitTime

Problem:Writestaketwocyclesinmemorystage,onecyclefortagcheckplusonecyclefordatawriteifhit

Solu?ons:§ DesigndataRAMthatcanperformreadandwriteinonecycle,restoreoldvalueaTertagmiss

§  Pipelinedwrites:Holdwritedataforstoreinsinglebufferaheadofcache,writecachedataduringnextstore’stagcheck

§  Fully-associaFve(CAMTag)caches:Wordlineonlyenabledifhit

14

PipeliningCacheWrites

15

Tags Data

Tag Index StoreData

Address and Store Data From CPU

DelayedWriteDataDelayedWriteAddr.

=?

=?

LoadDatatoCPU

Load/Store

LS

1 0

Hit?

Datafromastorehitiswri\enintodataporAonofcacheduringtagaccessofsubsequentstore

CS152Administrivia

§  PS2outonWednesday,dueWednesdayFeb26§ MondayFeb17isPresident’sDayHoliday,noclass!§  Lab1dueatstartofclassonWednesdayFeb19§  Friday’ssecFonswillreviewPS1andsoluFons

16

CS252

CS252Administrivia

§  Startthinkingofclassprojectsandformingteamsoftwo§  ProposaldueWednesdayFebruary26th

17

WriteBuffertoReduceReadMissPenalty

18

Processorisnotstalledonwrites,andreadmissescangoaheadofwritetomainmemory

Problem:WritebuffermayholdupdatedvalueoflocaFonneededbyareadmissSimplesolu?on:onareadmiss,waitforthewritebuffertogoemptyFastersolu?on:Checkwritebufferaddressesagainstreadmissaddresses,ifnomatch,allowreadmisstogoaheadofwrites,else,returnvalueinwritebuffer

DataCacheUnifiedL2Cache

RF

CPU

Writebuffer

Evicteddirtylinesforwrite-backcacheOR

Allwritesinwrite-throughcache

ReducingTagOverheadwithSub-Blocks

§ Problem:Tagsaretoolarge,i.e.,toomuchoverhead–  SimplesoluFon:Largerlines,butmisspenaltycouldbelarge.

§  Solu?on:Sub-blockplacement(akasectorcache)–  Avalidbitaddedtounitssmallerthanfullline,calledsub-blocks–  Onlyreadasub-blockonamiss–  Ifatagmatches,isthewordinthecache?

19

100 300 204

1 1 1 1 1   1 0 0 0 1 0 1

Mul?levelCaches

Problem:AmemorycannotbelargeandfastSolu?on:Increasingsizesofcacheateachlevel

20

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cache

Global miss rate = misses in cache / CPU memory accesses

Misses per instruction = misses in cache / number of instructions


Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches smaller than the sum of the two 64 KiB first-level caches make little sense, as reflected in the high miss rates. After 256 KiB the single cache is within 10% of the global miss rates. The miss rate of a single-level cache versus size is plotted against the local miss rate and global miss rate of a second-level cache using a 32 KiB first-level cache. The L2 caches (unified) were two-way set associative with replacement. Each had split L1 instruction and data caches that were 64 KiB two-way set associative with LRU replacement. The block size for both L1 and L2 caches was 64 bytes. Data were collected as in Figure B.4.

PresenceofL2influencesL1design

§ UsesmallerL1ifthereisalsoL2–  TradeincreasedL1missrateforreducedL1hitFme–  BackupL2reducesL1misspenalty–  Reducesaverageaccessenergy

§ Usesimplerwrite-throughL1withon-chipL2– Write-backL2cacheabsorbswritetraffic,doesn’tgooff-chip–  AtmostoneL1missrequestperL1access(nodirtyvicFmwriteback)simplifiespipelinecontrol

–  Simplifiescoherenceissues–  SimplifieserrorrecoveryinL1(canusejustparitybitsinL1andreloadfromL2whenparityerrordetectedonL1read)

22


Figure B.15 Relative execution time by second-level cache size. The two bars are for different clock cycles for an L2 cache hit. The reference execution time of 1.00 is for an 8192 KiB second-level cache with a 1-clock-cycle latency on a second-level hit. These data were collected the same way as in Figure B.14, using a simulator to imitate the Alpha 21264.

InclusionPolicy

§  InclusivemulFlevelcache:–  Innercachecanonlyholdlinesalsopresentinoutercache

–  Externalcoherencesnoopaccessneedonlycheckoutercache

§  ExclusivemulFlevelcaches:–  Innercachemayholdlinesnotinoutercache–  Swaplinesbetweeninner/outercachesonmiss–  UsedinAMDAthlonwith64KBprimaryand256KBsecondarycache

Whychooseonetypeortheother?

24

Itanium-2On-ChipCaches(Intel/HP,2002)

25

Level1:16KB,4-ways.a.,64Bline,quad-port(2load+2store),singlecyclelatency

Level2:256KB,4-ways.a,128B

line,quad-port(4loador4store),fivecyclelatency

Level3:3MB,12-ways.a.,128B

line,single32Bport,twelvecyclelatency

Power7On-ChipCaches[IBM2009]

26

32KBL1I$/core

32KBL1D$/core

3-cyclelatency

256KBUnifiedL2$/core

8-cyclelatency

32MBUnifiedSharedL3$

EmbeddedDRAM(eDRAM)

25-cyclelatencytolocalslice

IBMz196MainframeCaches2010

27

§ 96cores(4cores/chip,24chips/system)– Out-of-order,[email protected]

§ L1:64KBI-$/core+128KBD-$/core§ L2:1.5MBprivate/core(144MBtotal)§ L3:24MBshared/chip(eDRAM)(576MBtotal)§ L4:768MBshared/system(eDRAM)

Acknowledgements

§  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPagerson(UCB)

28

Documents

CS 152 Computer Architecture and Engineering CS252 ...– write-back: write cache only, memory is wrigen only when the entry is evicted • A dirty bit per line further reduces write-back