Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture
Lecture6–MemoryII
KrsteAsanovicElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152
Last?meinLecture6
§ DynamicRAM(DRAM)ismainformofmainmemorystorageinusetoday– Holdsvaluesonsmallcapacitors,needrefreshing(hencedynamic)– SlowmulF-stepaccess:precharge,readrow,readcolumn
§ StaFcRAM(SRAM)isfasterbutmoreexpensive– Usedtobuildon-chipmemoryforcaches
§ Cacheholdssmallsetofvaluesinfastmemory(SRAM)closetoprocessor– Needtodevelopsearchschemetofindvaluesincache,andreplacementpolicytomakespacefornewlyaccessedlocaFons
§ Cachesexploittwoformsofpredictabilityinmemoryreferencestreams– Temporallocality,samelocaFonlikelytobeaccessedagainsoon– SpaFallocality,neighboringlocaFonlikelytobeaccessedsoon
2
Recap:ReplacementPolicy
3
InanassociaFvecache,whichlinefromasetshouldbeevictedwhenthesetbecomesfull?• Random• Least-RecentlyUsed(LRU)
• LRUcachestatemustbeupdatedoneveryaccess• TrueimplementaFononlyfeasibleforsmallsets(2-way)• Pseudo-LRUbinarytreeoTenusedfor4-8way
• First-In,First-Out(FIFO)a.k.a.Round-Robin• UsedinhighlyassociaFvecaches
• Not-Most-RecentlyUsed(NMRU)• FIFOwithexcepFonformost-recentlyusedlineorlines
Thisisasecond-ordereffect.Why?
Replacementonlyhappensonmisses
Pseudo-LRUBinaryTree
§ For2-waycache,onahit,singleLRUbitissettopointtootherway
§ For4-waycache,need3bitsofstate.Oncachehit,onpathdowntree,setallbitstopointtootherhalf.Onmiss,bitssaywhichwaytoreplace
4
Way0Way1Way2Way3
1 0
1 1 0 0
CPU-CacheInterac?on(5-stagepipeline)
5
PC addr inst
PrimaryInstrucFonCache
0x4Add
IR
D
bubble
hit?PCen
Decode,RegisterFetch
wdataR
addr
wdata
rdataPrimaryDataCache
weA
B
YYALU
MD1 MD2
CacheRefillDatafromLowerLevelsofMemoryHierarchy
hit?
StallenFreCPUondatacachemiss
ToMemoryControl
ME
ImprovingCachePerformance
6
AveragememoryaccessFme(AMAT)= HitFme+MissratexMisspenalty
Toimproveperformance:• reducethehitFme• reducethemissrate• reducethemisspenalty
Whatisbestcachedesignfor5-stagepipeline?
Biggestcachethatdoesn’tincreasehitAmepast1cycle(approx8-32KBinmoderntechnology)
[designissuesmorecomplexwithdeeperpipelinesand/orout-of-ordersuperscalarprocessors]
CausesofCacheMisses:The3C’s
Compulsory:firstreferencetoaline(a.k.a.coldstartmisses)
– missesthatwouldoccurevenwithinfinitecacheCapacity:cacheistoosmalltoholdalldataneededbytheprogram
– missesthatwouldoccurevenunderperfectreplacementpolicy
Conflict:missesthatoccurbecauseofcollisionsduetoline-placementstrategy
– missesthatwouldnotoccurwithidealfullassociaAvity
7
EffectofCacheParametersonPerformance
§ Largercachesize+ reducescapacityandconflictmisses- hitFmewillincrease
§ HigherassociaFvity+ reducesconflictmisses- mayincreasehitFme
§ Largerlinesize+ reducescompulsoryandcapacity(reload)misses- increasesconflictmissesandmisspenalty
8
Figure B.9 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C's for the data in Figure B.8. The top diagram shows the actual data cache miss rates, while the bottom diagram shows the percentage in each category. (Space allows the graphs to show one extra cache size than can fit in Figure B.8.)
© 2018 Elsevier Inc. All rights reserved.
Recap:LineSizeandSpa?alLocality
10
Word3Word0 Word1 Word2
LargerlinesizehasdisFncthardwareadvantages• lesstagoverhead• exploitfastbursttransfersfromDRAM• exploitfastbursttransfersoverwidebusses
Whatarethedisadvantagesofincreasinglinesize?
LineAddress
2b=linesizea.k.alinesize(inbytes)
SplitCPUaddress
bbits32-bbits
Tag
Alineisunitoftransferbetweenthecacheandmemory
4wordline,b=2
Fewerlines=>moreconflicts.Canwastebandwidth.
Offset
© 2019 Elsevier Inc. All rights reserved. 11
Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a DECstation 5000 (Gee et al. 1993).
WritePolicyChoices§ Cachehit:
– write-through:writebothcache&memory• Generallyhighertrafficbutsimplerpipeline&cachedesign
– write-back:writecacheonly,memoryiswrigenonlywhentheentryisevicted• Adirtybitperlinefurtherreduceswrite-backtraffic• Musthandle0,1,or2accessestomemoryforeachload/store
§ Cachemiss:– no-write-allocate:onlywritetomainmemory– write-allocate(akafetch-on-write):fetchintocache
§ CommoncombinaFons:– write-throughandno-write-allocate– write-backwithwrite-allocate
12
WritePerformance
13
Tag DataV
=
OffsetTag Index
t kb
t
HIT DataWordorByte
2klines
WE
ReducingWriteHitTime
Problem:Writestaketwocyclesinmemorystage,onecyclefortagcheckplusonecyclefordatawriteifhit
Solu?ons:§ DesigndataRAMthatcanperformreadandwriteinonecycle,restoreoldvalueaTertagmiss
§ Pipelinedwrites:Holdwritedataforstoreinsinglebufferaheadofcache,writecachedataduringnextstore’stagcheck
§ Fully-associaFve(CAMTag)caches:Wordlineonlyenabledifhit
14
PipeliningCacheWrites
15
Tags Data
Tag Index StoreData
Address and Store Data From CPU
DelayedWriteDataDelayedWriteAddr.
=?
=?
LoadDatatoCPU
Load/Store
LS
1 0
Hit?
Datafromastorehitiswri\enintodataporAonofcacheduringtagaccessofsubsequentstore
CS152Administrivia
§ PS2outonWednesday,dueWednesdayFeb26§ MondayFeb17isPresident’sDayHoliday,noclass!§ Lab1dueatstartofclassonWednesdayFeb19§ Friday’ssecFonswillreviewPS1andsoluFons
16
CS252
CS252Administrivia
§ Startthinkingofclassprojectsandformingteamsoftwo§ ProposaldueWednesdayFebruary26th
17
WriteBuffertoReduceReadMissPenalty
18
Processorisnotstalledonwrites,andreadmissescangoaheadofwritetomainmemory
Problem:WritebuffermayholdupdatedvalueoflocaFonneededbyareadmissSimplesolu?on:onareadmiss,waitforthewritebuffertogoemptyFastersolu?on:Checkwritebufferaddressesagainstreadmissaddresses,ifnomatch,allowreadmisstogoaheadofwrites,else,returnvalueinwritebuffer
DataCacheUnifiedL2Cache
RF
CPU
Writebuffer
Evicteddirtylinesforwrite-backcacheOR
Allwritesinwrite-throughcache
ReducingTagOverheadwithSub-Blocks
§ Problem:Tagsaretoolarge,i.e.,toomuchoverhead– SimplesoluFon:Largerlines,butmisspenaltycouldbelarge.
§ Solu?on:Sub-blockplacement(akasectorcache)– Avalidbitaddedtounitssmallerthanfullline,calledsub-blocks– Onlyreadasub-blockonamiss– Ifatagmatches,isthewordinthecache?
19
100 300 204
1 1 1 1 1 1 0 0 0 1 0 1
Mul?levelCaches
Problem:AmemorycannotbelargeandfastSolu?on:Increasingsizesofcacheateachlevel
20
CPU L1$ L2$ DRAM
Local miss rate = misses in cache / accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions
© 2019 Elsevier Inc. All rights reserved. 21
Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches smaller than the sum of the two 64 KiB first-level caches make little sense, as reflected in the high miss rates. After 256 KiB the single cache is within 10% of the global miss rates. The miss rate of a single-level cache versus size is plotted against the local miss rate and global miss rate of a second-level cache using a 32 KiB first-level cache. The L2 caches (unified) were two-way set associative with replacement. Each had split L1 instruction and data caches that were 64 KiB two-way set associative with LRU replacement. The block size for both L1 and L2 caches was 64 bytes. Data were collected as in Figure B.4.
PresenceofL2influencesL1design
§ UsesmallerL1ifthereisalsoL2– TradeincreasedL1missrateforreducedL1hitFme– BackupL2reducesL1misspenalty– Reducesaverageaccessenergy
§ Usesimplerwrite-throughL1withon-chipL2– Write-backL2cacheabsorbswritetraffic,doesn’tgooff-chip– AtmostoneL1missrequestperL1access(nodirtyvicFmwriteback)simplifiespipelinecontrol
– Simplifiescoherenceissues– SimplifieserrorrecoveryinL1(canusejustparitybitsinL1andreloadfromL2whenparityerrordetectedonL1read)
22
© 2019 Elsevier Inc. All rights reserved. 23
Figure B.15 Relative execution time by second-level cache size. The two bars are for different clock cycles for an L2 cache hit. The reference execution time of 1.00 is for an 8192 KiB second-level cache with a 1-clock-cycle latency on a second-level hit. These data were collected the same way as in Figure B.14, using a simulator to imitate the Alpha 21264.
InclusionPolicy
§ InclusivemulFlevelcache:– Innercachecanonlyholdlinesalsopresentinoutercache
– Externalcoherencesnoopaccessneedonlycheckoutercache
§ ExclusivemulFlevelcaches:– Innercachemayholdlinesnotinoutercache– Swaplinesbetweeninner/outercachesonmiss– UsedinAMDAthlonwith64KBprimaryand256KBsecondarycache
Whychooseonetypeortheother?
24
Itanium-2On-ChipCaches(Intel/HP,2002)
25
Level1:16KB,4-ways.a.,64Bline,quad-port(2load+2store),singlecyclelatency
Level2:256KB,4-ways.a,128B
line,quad-port(4loador4store),fivecyclelatency
Level3:3MB,12-ways.a.,128B
line,single32Bport,twelvecyclelatency
Power7On-ChipCaches[IBM2009]
26
32KBL1I$/core
32KBL1D$/core
3-cyclelatency
256KBUnifiedL2$/core
8-cyclelatency
32MBUnifiedSharedL3$
EmbeddedDRAM(eDRAM)
25-cyclelatencytolocalslice
IBMz196MainframeCaches2010
27
§ 96cores(4cores/chip,24chips/system)– Out-of-order,[email protected]
§ L1:64KBI-$/core+128KBD-$/core§ L2:1.5MBprivate/core(144MBtotal)§ L3:24MBshared/chip(eDRAM)(576MBtotal)§ L4:768MBshared/system(eDRAM)
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:– Arvind(MIT)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPagerson(UCB)
28