22
11/3/2016 CS152, Fall 2016 CS 152 Computer Architecture and Engineering Lecture 18: Multi-Processors - Snoopy Caches John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw http://inst.cs.berkeley.edu/~cs152

CS 152 Computer Architecture and Engineering Lecture …inst.eecs.berkeley.edu/~cs152/fa16/lectures/L17-Snoopy.pdf · 11/3/2016 CS152, Fall 2016 CS 152 Computer Architecture and Engineering

  • Upload
    hanhan

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

11/3/2016 CS152,Fall2016

CS152ComputerArchitectureandEngineering

Lecture 18:Multi-Processors- SnoopyCaches

JohnWawrzynekElectricalEngineeringandComputerSciences

UniversityofCalifornia,Berkeley

http://www.eecs.berkeley.edu/~johnwhttp://inst.cs.berkeley.edu/~cs152

11/3/2016 CS152,Fall2016

UniprocessorPerformance(SPECint)

2CS152-Spring’09

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Per

form

ance

(vs

. VA

X-1

1/78

0)

25%/year

52%/year

??%/year

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

3X

11/3/2016 CS152,Fall2016

ParallelProcessing:Déjàvualloveragain?

§ “…today’sprocessors…arenearinganimpasseastechnologiesapproachthespeedoflight..”– DavidMitchell,TheTransputer:TheTimeIsNow(1989)

§ Transputer hadbadtiming(Uniprocessorperformance↑)⇒ Procrastinationrewarded:2Xseq.perf./1.5years

§ “Wearededicatingallofourfutureproductdevelopmenttomulticoredesigns.…Thisisaseachangeincomputing”– PaulOtellini,President,Intel(2005)

§ AllmicroprocessorcompaniesswitchtoMP(2+CPUs/2yrs)⇒ Procrastinationpenalized:2Xsequentialperf./5yrs

§ Evenhandheldsystemsmovedtomulticore– Nintendo3DS,iPhone6hastwocoreseach(plusadditionalspecializedcores),AndroidQualcommSnapdragon805hasfourcores.PlaystationPortableVitahasfourcores.

3

11/3/2016 CS152,Fall2016

SymmetricMultiprocessors(SMPs)

4

symmetric• Allmemoryisequally farawayfromallprocessors• AnyprocessorcandoanyI/O(setupaDMAtransfer)

Memory

I/Ocontroller

Graphicsoutput

CPU-Memorybus

bridge

Processor

I/Ocontroller I/Ocontroller

I/Obus

Networks

Processor

Local caches at processors makes it practical!

11/3/2016 CS152,Fall2016

Synchronization

5

Theneedforsynchronizationariseswheneverthereareconcurrentprocessesinasystemcooperatingonsometask

(eveninauniprocessor system)

Twoclassesofsynchronization:

Producer-Consumer:Aconsumerprocessmustwaituntiltheproducerprocesshasproduceddata

MutualExclusion:Ensurethatonlyone processusesaresourceatagiventime

producer

consumer

SharedResource

P1 P2

11/3/2016 CS152,Fall2016

NeedforMutualExclusion

§ Example(wikipedia):sharedlinkedlistmanagement§ Twonodes, i andi +1,beingremovedsimultaneouslyresultsinnode i +1notbeingremoved.

6

11/3/2016 CS152,Fall2016

MemoryCoherenceinSMPs

7

Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale valueswrite-through: cache-2 has a stale value

Do these stale values matter?What is the view of shared memory for programming?

cache-1A 100

CPU-Memory bus

CPU-1 CPU-2

cache-2A 100

memoryA 100

11/3/2016 CS152,Fall2016

MaintainingCacheCoherence

§ Acachecoherenceprotocolensuresthatallwritesbyoneprocessorareeventuallyvisibletootherprocessors,foronememoryaddress– i.e.,updatesarenotlost

§Hardwaresupportisrequiredsuchthat– onlyoneprocessoratatimehaswritepermissionforalocation

– noprocessorcanloadastalecopyofthelocationafterawrite

⇒ cachecoherenceprotocols

10

11/3/2016 CS152,Fall2016

Warmup:ParallelI/O

13

(DMA stands for “Direct Memory Access”, means the I/O device can read/write memory autonomous from the CPU)

Either Cache or DMA canbe the Bus Master andeffect transfers

DISKDMA

PhysicalMemory

Proc.

R/W

Data (D) Cache

Address (A)

AD

R/W

Page transfersoccur while theProcessor is running

MemoryBus

11/3/2016 CS152,Fall2016

ProblemswithParallelI/O

14

Memory Disk: Physical memory may bestale if cache copy is dirty

Disk Memory: Cache may hold stale data and not see memory writes

DISK

DMA

PhysicalMemory

Proc.Cache

MemoryBus

Cached portionsof page

DMA transfers

11/3/2016 CS152,Fall2016

SnoopyCache, Goodman1983

§ Idea:Havecachewatch(orsnoopupon)DMAtransfers,andthen“dotherightthing”

§ Snoopycachetagsaredual-ported

15

Proc.

Cache

Snoopy read portattached to MemoryBus

Data(lines)

Tags andState

A

D

R/W

Used to drive Memory Buswhen Cache is Bus Master

A

R/W

11/3/2016 CS152,Fall2016

SnoopyCacheActionsforDMA

16

Observed Bus Cycle Cache State Cache Action

Address not cached

DMA Read Cached, unmodified

Memory Disk Cached, modified

Address not cached

DMA Write Cached, unmodified

Disk Memory Cached, modified

No action

No action

No action

Cache intervenes

Cache purges its copy

???

11/3/2016 CS152,Fall2016

SharedMemoryMultiprocessor

18

Use snoopy mechanism to keep all processors’ view of memory coherent

M1

M2

M3

SnoopyCache

DMA

PhysicalMemory

MemoryBus

SnoopyCache

SnoopyCache

DISKS

11/3/2016 CS152,Fall2016

SnoopyCacheCoherenceProtocols

19

write miss:the address is invalidated in all othercaches before the write is performed

read miss:if a dirty copy is found in some cache, a write-back is performed before the memory is read

11/3/2016 CS152,Fall2016

TheMSIprotocol

20

M: ModifiedS: SharedI: Invalid

Each cache line has state bits

Address tagstatebits

§ Modified:Theblockhasbeenmodifiedinthecache.Thedatainthecacheistheninconsistentwiththebackingstore(e.g.memory).Acachewithablockinthe"M"statehastheresponsibilitytowritetheblocktothebackingstorewhenitisevicted.AblockintheModifiedstateisexclusive(itcan’tbeinanyothercache).

§ Shared:Thisblockisunmodifiedandexistsinread-onlystateinatleastonecache.Thecachecanevictthedatawithoutwritingittothebackingstore.

§ Invalid:Thisblockiseithernotpresent inthecurrentcacheorhasbeeninvalidatedbyabusrequest,andmustbefetched frommemoryiftheblockistobestored inthiscache.

11/3/2016 CS152,Fall2016

TheMSIprotocol

21

M: ModifiedS: SharedI: Invalid

Each cache line has state bits

Address tagstatebits

§ Areadmisstoablockinacache,C1,generatesabustransaction– ifanothercache,C2,hastheblockinMstate(“exclusively”), ithastowritebacktheblockbeforememorysuppliesit.C1getsthedatafromthebusandtheblockbecomes“shared”inbothcaches.

§ AwritehittoasharedblockinC1forcesawriteback– allothercachesthathavetheblockshouldinvalidatethatblock– theblockbecomes“exclusive” inC1.

§ Awritehittoamodified(exclusive) blockdoesnotgenerateawritebackorchangeofstate.

§ Awritemiss(toaninvalidblock)inC1generatesabustransaction– Ifacache,C2,hastheblockas“shared”,C2invalidates it’scopy– Ifacache,C2,hastheblockin“modified(exclusive)”, itwritesbacktheblockandchangesitstateinC2to“invalid”.

– Ifnocachesuppliestheblock,thememorywillsupplyit.– WhenC1getstheblock,itsetsitsstateto”modified(exclusive)”

11/3/2016 CS152,Fall2016

CacheStateTransitionDiagramTheMSIprotocol

22

M

S I

M: ModifiedS: SharedI: Invalid

Each cache line has state bits

Address tagstatebits Write miss

(P1 gets line from memory)

Other processorintent to write (P1 writes back)

Read miss(P1 gets line from memory)

Other processorintent to write

Read by anyprocessor

P1 readsor writes

Cache state in processor P1

Other processor reads(P1 writes back)

11/3/2016 CS152,Fall2016

TwoProcessorExample(Readingandwritingthesamecacheline)

23

M

S I

Write miss

Readmiss

P2 intent to write

P2 reads,P1 writes back

P1 readsor writes

P2 intent to write

P1

M

S I

Write miss

Readmiss

P1 intent to write

P1 reads,P2 writes back

P2 readsor writes

P1 intent to write

P2

P1 readsP1 writesP2 readsP2 writes

P1 writesP2 writes

P1 reads

P1 writes

11/3/2016 CS152,Fall2016

Observations

§ IfalineisintheM statethennoothercachecanhaveacopyoftheline!

§ Memorystayscoherent,multipledifferingcopiescannotexist§ AwritetoalineintheSstatecausesawriteback (evenifnoothercachehasacopy!)

24

M

S I

Write miss

Other processorintent to write

Readmiss

Other processorintent to write

Read by anyprocessor

P1 readsor writesOther processor reads

P1 writes back

11/3/2016 CS152,Fall2016

MESI:AnEnhancedMSIprotocolincreasedperformanceforprivatedata

25

M E

S I

M: Modified ExclusiveE: Exclusive but unmodifiedS: SharedI: Invalid

Each cache line has a tag

Address tagstatebits

Write miss

Other processorintent to write

Read miss,shared

Other processorintent to write

P1 write

Read by anyprocessor

Other processor readsP1 writes back

P1 readP1 writeor read

Cache state in processor P1

P1 intent to write

Read miss, not sharedOther

processorreads

Other processor intent to write, P1 writes back

Write to a Exclusive line doesn’t cause a writeback.

11/3/2016 CS152,Fall2016

OptimizedSnoopwithLevel-2Caches

26

Snooper Snooper Snooper Snooper

• Processorsoftenhavetwo-levelcaches• smallL1,largeL2(usuallybothonchipnow)

• Inclusionproperty:entriesinL1mustbeinL2invalidationinL2⇒ invalidationinL1

• SnoopingonL2doesnotaffectCPU-L1bandwidth

Whatproblemcouldoccur?

CPU

L1$

L2$

CPU

L1$

L2$

CPU

L1$

L2$

CPU

L1$

L2$

11/3/2016 CS152,Fall2016

Acknowledgements

§ Theseslidescontainmaterialdeveloped andcopyrightby:– Arvind(MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPatterson(UCB)

§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252

32