38
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 5 – Memory Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152

CS 152 Computer Architecture and Engineering CS252 ...inst.eecs.berkeley.edu/~cs152/sp18/lectures/L05-Memory.pdf · § Can use values before commit through bypass network § Pipeline

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture5–Memory

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

Last=meinLecture4

§ Handlingexcep>onsinpipelinedmachinesbypassingexcep>onsdownpipelineun>linstruc>onscrosscommitpointinorder

§ Canusevaluesbeforecommitthroughbypassnetwork§  PipelinehazardscanbeavoidedthroughsoDwaretechniques:scheduling,loopunrolling

§ Decoupledarchitecturesusequeuesbetween“access”and“execute”pipelinestotoleratelongmemorylatency

§ Regularizingallfunc>onalunitstohavesamelatencysimplifiesmorecomplexpipelinedesignbyavoidingstructuralhazards,canbeexpandedtoin-ordersuperscalardesigns

2

EarlyRead-OnlyMemoryTechnologies

3

Punchedcards,Fromearly1700sthroughJaquardLoom,Babbage,andthenIBM Punchedpapertape,

instruc>onstreaminHarvardMk1

IBMCardCapacitorROS

IBMBalancedCapacitorROS

DiodeMatrix,EDSAC-2µcodestore

EarlyRead/WriteMainMemoryTechnologies

4

WilliamsTube,ManchesterMark1,1947

Babbage,1800s:Digitsstoredonmechanicalwheels

MercuryDelayLine,Univac1,1951

Also,regenera>vecapacitormemoryonAtanasoff-Berrycomputer,androta>ngmagne>cdrummemoryonIBM650

MITWhirlwindCoreMemory

5

CoreMemory

6

§ Corememorywasfirstlargescalereliablemainmemory–  inventedbyForresterinlate40s/early50satMITforWhirlwindproject

§ Bitsstoredasmagne>za>onpolarityonsmallferritecoresthreadedontotwo-dimensionalgridofwires

§ CoincidentcurrentpulsesonXandYwireswouldwritecellandalsosenseoriginalstate(destruc>vereads)

DECPDP-8/EBoard,4Kwordsx12bits,(1968)

§  Robust,non-vola>lestorage§  Usedonspaceshuhle

computers§  Coresthreadedontowiresby

hand(25billionayearatpeakproduc>on)

§  Coreaccess>me~1µs

SemiconductorMemory

§ Semiconductormemorybegantobecompe>>veinearly1970s–  Intelformedtoexploitmarketforsemiconductormemory–  EarlysemiconductormemorywasSta>cRAM(SRAM).SRAMcellinternalssimilartoalatch(cross-coupledinverters).

§ FirstcommercialDynamicRAM(DRAM)wasIntel1103–  1Kbitofstorageonsinglechip–  chargeonacapacitorusedtoholdvalue

Semiconductormemoryquicklyreplacedcorein‘70s

7

One-TransistorDynamicRAM[Dennard,IBM]

8

TiNtopelectrode(VREF)Ta2O5dielectric

WboWomelectrode

polywordline access

transistor

1-TDRAMCell

word

bit

accesstransistor

Storagecapacitor(FETgate,trench,stack)

VREF

ModernDRAMStructure

9[Samsung,sub-70nmDRAM,2004]

DRAMArchitecture

10

RowAdd

ress

Decode

r

Col.1

Col.2M

Row1

Row2N

ColumnDecoder&SenseAmplifiers

M

N

N+M

bitlineswordlines

Memorycell(onebit)

DData

§  Bitsstoredin2-dimensionalarraysonchip§  Modernchipshavearound4-8logicalbanksoneachchip

§  eachlogicalbankphysicallyimplementedasmanysmallerarrays

DRAMPackaging(Laptops/Desktops/Servers)

11

§  DIMM(DualInlineMemoryModule)containsmul>plechipswithclock/control/addresssignalsconnectedinparallel(some>mesneedbufferstodrivesignalstoallchips)

§  Datapinsworktogethertoreturnwideword(e.g.,64-bitdatabususing16x4-bitparts)

Addresslinesmul>plexedrow/columnaddress

Clockandcontrolsignals

Databus(4b,8b,16b,32b)

DRAMchip

~12

~7

DRAMPackaging,MobileDevices

12[AppleA4packagecross-sec<on,iFixit2010]

TwostackedDRAMdieProcessorpluslogicdie

[AppleA4packageoncircuitboard]

DRAMPackagingAppleA10

13

DRAMOpera=on

14

§  Threestepsinread/writeaccesstoagivenbank§ Rowaccess(RAS)

–  decoderowaddress,enableaddressedrow(oDenmul>pleKbinrow)–  bitlinessharechargewithstoragecell–  smallchangeinvoltagedetectedbysenseamplifierswhichlatchwholerowofbits–  senseamplifiersdrivebitlinesfullrailtorechargestoragecells

§ Columnaccess(CAS)–  decodecolumnaddresstoselectsmallnumberofsenseamplifierlatches(4,8,16,or32bitsdependingonDRAMpackage)

–  onread,sendlatchedbitsouttochippins–  onwrite,changesenseamplifierlatcheswhichthenchargestoragecellstorequiredvalue

–  canperformmul>plecolumnaccessesonsamerowwithoutanotherrowaccess(burstmode)

§  Precharge–  chargesbitlinestoknownvalue,requiredbeforenextrowaccess

§  Eachstephasalatencyofaround15-20nsinmodernDRAMs§ VariousDRAMstandards(DDR,RDRAM)havedifferentwaysofencodingthesignalsfortransmissiontotheDRAM,butallsharesamecorearchitecture

Double-DataRate(DDR2)DRAM

15

Row Column Precharge Row’

Data

200MHzClock

400Mb/sDataRate[Micron,256MbDDR2SDRAMdatasheet]

CPU-MemoryBoWleneck

Performanceofhigh-speedcomputersisusuallylimitedbymemorybandwidth&latency§  Latency(>meforasingleaccess)

–  Memoryaccess>me>>Processorcycle>me

§  Bandwidth(numberofaccessesperunit>me)iffrac>onmofinstruc>onsaccessmemory⇒1+mmemoryreferences/instruc>on⇒CPI=1requires1+mmemoryrefs/cycle(assumingRISC-VISA)

16

MemoryCPU

Processor-DRAMGap(latency)

17

Time

µProc60%/year

DRAM7%/year

1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformanceGap:(growing50%/yr)

Performance

Four-issue3GHzsuperscalaraccessing100nsDRAMcouldexecute1,200instruc>onsduring>meforonememoryaccess!

PhysicalSizeAffectsLatency

18

SmallMemory

CPU

BigMemory

CPU

§  Signalshavefurthertotravel

§  Fanouttomoreloca>ons

Rela=veMemoryCellSizes

19

[ Foss, “Implementing Application-Specific

Memory”, ISSCC 1996 ]

DRAM on memory chip

On-Chip SRAM in logic chip

MemoryHierarchy

20

Small,FastMemory(RF,SRAM)

•  capacity:Register<<SRAM<<DRAM• latency:Register<<SRAM<<DRAM• bandwidth:on-chip>>off-chip

Onadataaccess:ifdata∈ fastmemory⇒ lowlatencyaccess(SRAM)ifdata∉ fastmemory⇒ highlatencyaccess(DRAM)

CPU Big,SlowMemory(DRAM)

A B

holdsfrequentlyuseddata

CS152Administrivia

21

CS252

CS252Administrivia

22

ManagementofMemoryHierarchy

§ Small/faststorage,e.g.,registers– Addressusuallyspecifiedininstruc>on– Generallyimplementeddirectlyasaregisterfile

•  buthardwaremightdothingsbehindsoMware’sback,e.g.,stackmanagement,registerrenaming

§ Larger/slowerstorage,e.g.,mainmemory– Addressusuallycomputedfromvaluesinregister– Generallyimplementedasahardware-managedcachehierarchy(hardwaredecideswhatiskeptinfastmemory)

•  butsoMwaremayprovide“hints”,e.g.,don’tcacheorprefetch

23

RealMemoryReferencePaWerns

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

oryAd

dress(on

edo

tperaccess)

24

TypicalMemoryReferencePaWerns

Address

Time

Instruc=onfetches

Stackaccesses

Dataaccesses

nloopitera=ons

subrou=necall

subrou=nereturn

argumentaccess

scalaraccesses

25

Twopredictableproper=esofmemoryreferences:

§ TemporalLocality:Ifaloca>onisreferenceditislikelytobereferencedagaininthenearfuture.

§ Spa=alLocality:Ifaloca>onisreferenceditislikelythatloca>onsnearitwillbereferencedinthenearfuture.

26

MemoryReferencePaWerns

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

oryAd

dress(on

edo

tperaccess)

Spa=alLocality

TemporalLocality

27

Cachesexploitbothtypesofpredictability:

§ Exploittemporallocalitybyrememberingthecontentsofrecentlyaccessedloca>ons.

§ Exploitspa>allocalitybyfetchingblocksofdataaroundrecentlyaccessedloca>ons.

28

InsideaCache

CACHEProcessor MainMemory

Address Address

DataData

AddressTag

DataBlock

DataByte

DataByte

DataByte

Line100

304

6848

copyofmainmemoryloca>on100

copyofmainmemoryloca>on101

416

29

CacheAlgorithm(Read)

LookatProcessorAddress,searchcachetagstofindmatch.Theneither

Foundincachea.k.a.HIT

Returncopyofdatafromcache

Notincachea.k.a.MISS

ReadblockofdatafromMainMemoryWait…Returndatatoprocessorandupdatecache

Q:Whichlinedowereplace? 30

PlacementPolicy

31

0 1 2 3 4 5 6 7 0 1 2 3 Set Number

Cache

Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into

set 0 block 4 (12 mod 4) (12 mod 8)

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9

2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9

3 3 0 1

Memory

Block Number

block 12 can be placed

Direct-MappedCache

Tag DataBlockV

=

BlockOffset

Tag Index

t k b

t

HIT DataWordorByte

2klines

32

DirectMapAddressSelec=onhigher-ordervs.lower-orderaddressbits

Tag DataBlockV

=

BlockOffset

Index

tkb

t

HIT DataWordorByte

2klines

Tag

33

2-WaySet-Associa=veCache

Tag Data Block V

=

Block Offset

Tag Index

t k

b

HIT

Tag Data Block V

Data Word or Byte

=

t

34

FullyAssocia=veCacheTag DataBlockV

=

Block

Offset

Tag

t

b

HIT

DataWordorByte

=

=

t

35

ReplacementPolicy

36

Inanassocia>vecache,whichblockfromasetshouldbeevictedwhenthesetbecomesfull?• Random

• Least-RecentlyUsed(LRU)• LRUcachestatemustbeupdatedoneveryaccess• trueimplementa>ononlyfeasibleforsmallsets(2-way)• pseudo-LRUbinarytreeoDenusedfor4-8way

• First-In,First-Out(FIFO)a.k.a.Round-Robin• usedinhighlyassocia>vecaches

• Not-Most-RecentlyUsed(NMRU)• FIFOwithexcep>onformost-recentlyusedblockorblocks

Thisisasecond-ordereffect.Why?

Replacementonlyhappensonmisses

BlockSizeandSpa=alLocality

37

Word3Word0 Word1 Word2

Largerblocksizehasdis>ncthardwareadvantages• lesstagoverhead• exploitfastbursttransfersfromDRAM• exploitfastbursttransfersoverwidebusses

Whatarethedisadvantagesofincreasingblocksize?

blockaddress offsetb

2b=blocksizea.k.alinesize(inbytes)

SplitCPUaddress

bbits32-bbits

Tag

Blockisunitoftransferbetweenthecacheandmemory

4wordblock,b=2

Fewerblocks=>moreconflicts.Canwastebandwidth.

Acknowledgements

§  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPaherson(UCB)

38