Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture
Lecture5–Memory
KrsteAsanovicElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152
Last=meinLecture4
§ Handlingexcep>onsinpipelinedmachinesbypassingexcep>onsdownpipelineun>linstruc>onscrosscommitpointinorder
§ Canusevaluesbeforecommitthroughbypassnetwork§ PipelinehazardscanbeavoidedthroughsoDwaretechniques:scheduling,loopunrolling
§ Decoupledarchitecturesusequeuesbetween“access”and“execute”pipelinestotoleratelongmemorylatency
§ Regularizingallfunc>onalunitstohavesamelatencysimplifiesmorecomplexpipelinedesignbyavoidingstructuralhazards,canbeexpandedtoin-ordersuperscalardesigns
2
EarlyRead-OnlyMemoryTechnologies
3
Punchedcards,Fromearly1700sthroughJaquardLoom,Babbage,andthenIBM Punchedpapertape,
instruc>onstreaminHarvardMk1
IBMCardCapacitorROS
IBMBalancedCapacitorROS
DiodeMatrix,EDSAC-2µcodestore
EarlyRead/WriteMainMemoryTechnologies
4
WilliamsTube,ManchesterMark1,1947
Babbage,1800s:Digitsstoredonmechanicalwheels
MercuryDelayLine,Univac1,1951
Also,regenera>vecapacitormemoryonAtanasoff-Berrycomputer,androta>ngmagne>cdrummemoryonIBM650
CoreMemory
6
§ Corememorywasfirstlargescalereliablemainmemory– inventedbyForresterinlate40s/early50satMITforWhirlwindproject
§ Bitsstoredasmagne>za>onpolarityonsmallferritecoresthreadedontotwo-dimensionalgridofwires
§ CoincidentcurrentpulsesonXandYwireswouldwritecellandalsosenseoriginalstate(destruc>vereads)
DECPDP-8/EBoard,4Kwordsx12bits,(1968)
§ Robust,non-vola>lestorage§ Usedonspaceshuhle
computers§ Coresthreadedontowiresby
hand(25billionayearatpeakproduc>on)
§ Coreaccess>me~1µs
SemiconductorMemory
§ Semiconductormemorybegantobecompe>>veinearly1970s– Intelformedtoexploitmarketforsemiconductormemory– EarlysemiconductormemorywasSta>cRAM(SRAM).SRAMcellinternalssimilartoalatch(cross-coupledinverters).
§ FirstcommercialDynamicRAM(DRAM)wasIntel1103– 1Kbitofstorageonsinglechip– chargeonacapacitorusedtoholdvalue
Semiconductormemoryquicklyreplacedcorein‘70s
7
One-TransistorDynamicRAM[Dennard,IBM]
8
TiNtopelectrode(VREF)Ta2O5dielectric
WboWomelectrode
polywordline access
transistor
1-TDRAMCell
word
bit
accesstransistor
Storagecapacitor(FETgate,trench,stack)
VREF
DRAMArchitecture
10
RowAdd
ress
Decode
r
Col.1
Col.2M
Row1
Row2N
ColumnDecoder&SenseAmplifiers
M
N
N+M
bitlineswordlines
Memorycell(onebit)
DData
§ Bitsstoredin2-dimensionalarraysonchip§ Modernchipshavearound4-8logicalbanksoneachchip
§ eachlogicalbankphysicallyimplementedasmanysmallerarrays
DRAMPackaging(Laptops/Desktops/Servers)
11
§ DIMM(DualInlineMemoryModule)containsmul>plechipswithclock/control/addresssignalsconnectedinparallel(some>mesneedbufferstodrivesignalstoallchips)
§ Datapinsworktogethertoreturnwideword(e.g.,64-bitdatabususing16x4-bitparts)
Addresslinesmul>plexedrow/columnaddress
Clockandcontrolsignals
Databus(4b,8b,16b,32b)
DRAMchip
~12
~7
DRAMPackaging,MobileDevices
12[AppleA4packagecross-sec<on,iFixit2010]
TwostackedDRAMdieProcessorpluslogicdie
[AppleA4packageoncircuitboard]
DRAMOpera=on
14
§ Threestepsinread/writeaccesstoagivenbank§ Rowaccess(RAS)
– decoderowaddress,enableaddressedrow(oDenmul>pleKbinrow)– bitlinessharechargewithstoragecell– smallchangeinvoltagedetectedbysenseamplifierswhichlatchwholerowofbits– senseamplifiersdrivebitlinesfullrailtorechargestoragecells
§ Columnaccess(CAS)– decodecolumnaddresstoselectsmallnumberofsenseamplifierlatches(4,8,16,or32bitsdependingonDRAMpackage)
– onread,sendlatchedbitsouttochippins– onwrite,changesenseamplifierlatcheswhichthenchargestoragecellstorequiredvalue
– canperformmul>plecolumnaccessesonsamerowwithoutanotherrowaccess(burstmode)
§ Precharge– chargesbitlinestoknownvalue,requiredbeforenextrowaccess
§ Eachstephasalatencyofaround15-20nsinmodernDRAMs§ VariousDRAMstandards(DDR,RDRAM)havedifferentwaysofencodingthesignalsfortransmissiontotheDRAM,butallsharesamecorearchitecture
Double-DataRate(DDR2)DRAM
15
Row Column Precharge Row’
Data
200MHzClock
400Mb/sDataRate[Micron,256MbDDR2SDRAMdatasheet]
CPU-MemoryBoWleneck
Performanceofhigh-speedcomputersisusuallylimitedbymemorybandwidth&latency§ Latency(>meforasingleaccess)
– Memoryaccess>me>>Processorcycle>me
§ Bandwidth(numberofaccessesperunit>me)iffrac>onmofinstruc>onsaccessmemory⇒1+mmemoryreferences/instruc>on⇒CPI=1requires1+mmemoryrefs/cycle(assumingRISC-VISA)
16
MemoryCPU
Processor-DRAMGap(latency)
17
Time
µProc60%/year
DRAM7%/year
1
10
100
10001980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformanceGap:(growing50%/yr)
Performance
Four-issue3GHzsuperscalaraccessing100nsDRAMcouldexecute1,200instruc>onsduring>meforonememoryaccess!
PhysicalSizeAffectsLatency
18
SmallMemory
CPU
BigMemory
CPU
§ Signalshavefurthertotravel
§ Fanouttomoreloca>ons
Rela=veMemoryCellSizes
19
[ Foss, “Implementing Application-Specific
Memory”, ISSCC 1996 ]
DRAM on memory chip
On-Chip SRAM in logic chip
MemoryHierarchy
20
Small,FastMemory(RF,SRAM)
• capacity:Register<<SRAM<<DRAM• latency:Register<<SRAM<<DRAM• bandwidth:on-chip>>off-chip
Onadataaccess:ifdata∈ fastmemory⇒ lowlatencyaccess(SRAM)ifdata∉ fastmemory⇒ highlatencyaccess(DRAM)
CPU Big,SlowMemory(DRAM)
A B
holdsfrequentlyuseddata
ManagementofMemoryHierarchy
§ Small/faststorage,e.g.,registers– Addressusuallyspecifiedininstruc>on– Generallyimplementeddirectlyasaregisterfile
• buthardwaremightdothingsbehindsoMware’sback,e.g.,stackmanagement,registerrenaming
§ Larger/slowerstorage,e.g.,mainmemory– Addressusuallycomputedfromvaluesinregister– Generallyimplementedasahardware-managedcachehierarchy(hardwaredecideswhatiskeptinfastmemory)
• butsoMwaremayprovide“hints”,e.g.,don’tcacheorprefetch
23
RealMemoryReferencePaWerns
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
oryAd
dress(on
edo
tperaccess)
24
TypicalMemoryReferencePaWerns
Address
Time
Instruc=onfetches
Stackaccesses
Dataaccesses
nloopitera=ons
subrou=necall
subrou=nereturn
argumentaccess
scalaraccesses
25
Twopredictableproper=esofmemoryreferences:
§ TemporalLocality:Ifaloca>onisreferenceditislikelytobereferencedagaininthenearfuture.
§ Spa=alLocality:Ifaloca>onisreferenceditislikelythatloca>onsnearitwillbereferencedinthenearfuture.
26
MemoryReferencePaWerns
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
oryAd
dress(on
edo
tperaccess)
Spa=alLocality
TemporalLocality
27
Cachesexploitbothtypesofpredictability:
§ Exploittemporallocalitybyrememberingthecontentsofrecentlyaccessedloca>ons.
§ Exploitspa>allocalitybyfetchingblocksofdataaroundrecentlyaccessedloca>ons.
28
InsideaCache
CACHEProcessor MainMemory
Address Address
DataData
AddressTag
DataBlock
DataByte
DataByte
DataByte
Line100
304
6848
copyofmainmemoryloca>on100
copyofmainmemoryloca>on101
416
29
CacheAlgorithm(Read)
LookatProcessorAddress,searchcachetagstofindmatch.Theneither
Foundincachea.k.a.HIT
Returncopyofdatafromcache
Notincachea.k.a.MISS
ReadblockofdatafromMainMemoryWait…Returndatatoprocessorandupdatecache
Q:Whichlinedowereplace? 30
PlacementPolicy
31
0 1 2 3 4 5 6 7 0 1 2 3 Set Number
Cache
Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into
set 0 block 4 (12 mod 4) (12 mod 8)
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9
2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9
3 3 0 1
Memory
Block Number
block 12 can be placed
DirectMapAddressSelec=onhigher-ordervs.lower-orderaddressbits
Tag DataBlockV
=
BlockOffset
Index
tkb
t
HIT DataWordorByte
2klines
Tag
33
2-WaySet-Associa=veCache
Tag Data Block V
=
Block Offset
Tag Index
t k
b
HIT
Tag Data Block V
Data Word or Byte
=
t
34
ReplacementPolicy
36
Inanassocia>vecache,whichblockfromasetshouldbeevictedwhenthesetbecomesfull?• Random
• Least-RecentlyUsed(LRU)• LRUcachestatemustbeupdatedoneveryaccess• trueimplementa>ononlyfeasibleforsmallsets(2-way)• pseudo-LRUbinarytreeoDenusedfor4-8way
• First-In,First-Out(FIFO)a.k.a.Round-Robin• usedinhighlyassocia>vecaches
• Not-Most-RecentlyUsed(NMRU)• FIFOwithexcep>onformost-recentlyusedblockorblocks
Thisisasecond-ordereffect.Why?
Replacementonlyhappensonmisses
BlockSizeandSpa=alLocality
37
Word3Word0 Word1 Word2
Largerblocksizehasdis>ncthardwareadvantages• lesstagoverhead• exploitfastbursttransfersfromDRAM• exploitfastbursttransfersoverwidebusses
Whatarethedisadvantagesofincreasingblocksize?
blockaddress offsetb
2b=blocksizea.k.alinesize(inbytes)
SplitCPUaddress
bbits32-bbits
Tag
Blockisunitoftransferbetweenthecacheandmemory
4wordblock,b=2
Fewerblocks=>moreconflicts.Canwastebandwidth.