Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Lecture13:

MemoryConsistency

ParallelComputerArchitectureandProgrammingCMU15-418/15-618,Fall2016

CMU15-418/618,Fall2017 1

Carnegie Mellon

WhatisCorrectBehaviorforaParallelMemoryHierarchy?

•  Note:side-effectsofwritesareonlyobservablewhenreadsoccur–  sowewillfocusonthevaluesreturnedbyreads

•  IntuiMveanswer:–  readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)

•  Hmm…whatdoes“latest”meanexactly?–  withinathread,itcanbedefinedbyprogramorder–  butwhataboutacrossthreads?

•  themostrecentwriteinphysicalMme?–  hopefullynot,becausethereisnowaythatthehardwarecanpullthatoff

»  e.g.,ifittakes>10cyclestocommunicatebetweenprocessors,thereisnowaythatprocessor0canknowwhatprocessor1did2clockMcksago

•  mostrecentbaseduponsomethingelse?–  Hmm…

CMU15-418/618,Fall2017 2

Carnegie Mellon

RefiningOurIntuiMon

•  WhatwouldbesomeclearlyillegalcombinaMonsof(A,B,C)?•  Howabout:

•  Whatcanwegeneralizefromthis?–  writesfromanyparMcularthreadmustbeconsistentwithprogramorder

•  inthisexample,observedevennumbersmustbeincreasing(diOoforodds)

–  acrossthreads:writesmustbeconsistentwithavalidinterleavingofthreads•  notphysicalMme!(programmercannotrelyuponthat)

CMU15-418/618,Fall2017 3

// write evens to X for (i=0; i<N; i+=2) { X = i; … }

Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }

Thread1… A = X; … B = X; … C = X; …

Thread2

(Assume:X=0iniMally,andthesearetheonlywritestoX.)

(4,8,1)? (9,12,3)? (7,19,31)?

Carnegie Mellon

VisualizingOurIntuiMon

•  Eachthreadproceedsinprogramorder•  Memoryaccessesinterleaved(oneataMme)toasingle-portedmemory

–  rateofprogressofeachthreadisunpredictable

CMU15-418/618,Fall2017 4



Thread1… A = X; … B = X; … C = X; …

Thread2

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Carnegie Mellon

CorrectnessRevisited

Recall:“readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)”à  “latest”meansconsistentwithsomeinterleavingthatmatchesthismodel–  thisisahypotheMcalinterleaving;themachinedidn’tnecessarydothis!

CMU15-418/618,Fall2017 5



Thread1… A = X; … B = X; … C = X; …

Thread2

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Carnegie Mellon

Part2ofMemoryCorrectness:MemoryConsistencyModel

1.  “CacheCoherence”–  doallloadsandstorestoagivencacheblockbehavecorrectly?

2.  “MemoryConsistencyModel”(someMmescalled“MemoryOrdering”)–  doallloadsandstores,eventoseparatecacheblocks,behavecorrectly?

Recall:ourintuiMon

CMU15-418/618,Fall2017 6

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Carnegie Mellon

Whyisthissocomplicated?

•  Fundamentalissue:–  loadsandstoresareveryexpensive,evenonauniprocessor

•  caneasilytake10’sto100’sofcycles

•  WhatprogrammersintuiMvelyexpect:–  processoratomicallyperformsoneinstrucMonataMme,inprogramorder

•  Inreality:–  iftheprocessoractuallyoperatedthisway,itwouldbepainfullyslow–  instead,theprocessoraggressivelyreordersinstruc6onstohidememorylatency

•  Upshot:–  withinagiventhread,theprocessorpreservestheprogramorderillusion–  butthisillusionhasnothingtodowithwhathappensinphysicalMme!–  fromtheperspecMveofotherthreads,allbetsareoff!

CMU15-418/618,Fall2017 7

Carnegie Mellon

HidingMemoryLatencyisImportantforPerformance

•  Idea:overlapmemoryaccesseswithotheraccessesandcomputaMon

•  Hidingwritelatencyissimpleinuniprocessors:

–  addawritebuffer

•  (ButthisaffectscorrectnessinmulMprocessors)

CMU15-418/618,Fall2017 8

write A

read B

write A read B

Processor

Cache

READS WRITES

writebuffer

Carnegie Mellon

HowCanWeHidetheLatencyofMemoryReads?

“Outoforder”pipelining:–  whenaninstrucMonisstuck,perhapstherearesubsequentinstrucMonsthat

canbeexecuted

•  ImplicaMon:memoryaccessesmaybeperformedout-of-order!!!

CMU15-418/618,Fall2017 9

stuckwaiMngontruedependencestuckwaiMngontruedependencesuffersexpensivecachemisssuffersexpensivecachemissx = *p;

y = x + 1; z = a + 2; b = c / 3; } thesedonotneedtowait

Carnegie Mellon

WhatAboutCondiMonalBranches?

•  DoweneedtowaitforacondiMonalbranchtoberesolvedbeforeproceeding?–  No!JustpredictthebranchoutcomeandconMnueexecuMngspeculaMvely.

•  ifpredicMoniswrong,squashanyside-effectsandrestartdowncorrectpath

CMU15-418/618,Fall2017 10

x = *p; y = x + 1; z = a + 2; b = c / 3; if (x != z) d = e – 7; else d = e + 5; …

ifhardwareguessesthatthisistruethenexecute“then”part(speculaMvely)(withoutwaiMngforxorz)

Carnegie Mellon

HowOut-of-OrderPipeliningWorksinModernProcessors

•  FetchandgraduateinstrucMonsin-order,butissueout-of-order

•  Intra-threaddependencesarepreserved,butmemoryaccessesgetreordered!

CMU15-418/618,Fall2017 11

issue(cachemiss)

0x1c: b = c / 3;

0x18: z = a + 2;

0x14: y = x + 1;

0x10: x = *p;

PC:0x10Inst.Cache

BranchPredictor

0x140x180x1c

0x1c: b = c / 3;

0x18: z = a + 2;

0x14: y = x + 1;

0x10: x = *p;

Reorde

rBuff

er

issue(cachemiss)

issue(out-of-order)issue(out-of-order)

can’tissuecan’tissueissue(out-of-order)issue(out-of-order)

Carnegie Mellon

Analogy:GasParMclesinBalloons

•  ImaginethateachinstrucMonwithinathreadisagasparMcleinsideatwistyballoon•  Theywerenumberedoriginally,butthentheystarttomoveandbouncearound•  Whenagiventhreadobservesmemoryaccessesfromadifferentthread:

–  thosememoryaccessescanbe(almost)arbitrarilyjumbledaround•  liketryingtolocatetheposiMonofaparMculargasparMcleinaballoon

•  Aswe’llseelater,theonlythingthatwecandoistoputtwistsintheballoon

CMU15-418/618,Fall2017 12

(wikiHow)

14

13

12

15

11

15

11

14

13

12

12

13

15

11

14

11

12

13

14

15

Thread0 Thread1 Thread2 Thread3

Time

Carnegie Mellon

UniprocessorMemoryModel

•  Memorymodelspecifiesorderingconstraintsamongaccesses

•  Uniprocessormodel:memoryaccessesatomicandinprogramorder

•  NotnecessarytomaintainsequenMalorderforcorrectness–  hardware:buffering,pipelining–  compiler:registerallocaMon,codemoMon

•  Simpleforprogrammers

•  Allowsforhighperformance

CMU15-418/618,Fall2017 13

write A write B read A read B

Processor

Cache

READS WRITES

writebuffer

Readscheckformatchingaddressesinwritebuffer

Carnegie Mellon

InParallelMachines(withaSharedAddressSpace)

•  OrderbetweenaccessestodifferentlocaMonsbecomesimportant

CMU15-418/618,Fall2017 14

A = 1;

Ready = 1; while (Ready != 1);

… = A;

P1 P2

(Ini6allyAandReady=0)

Carnegie Mellon

HowUnsafeReorderingCanHappen

•  DistribuMonofmemoryresources–  accessesissuedinordermaybeobservedoutoforder

CMU15-418/618,Fall2017 15

Processor

Memory

Processor

Memory

Processor

Memory

InterconnecMonNetwork

…A = 1; Ready = 1;

A: 0 Ready:0

wait(Ready==1);…=A;

A = 1;

Ready = 1;

à1

Carnegie Mellon

CachesComplicateThingsMore•  MulMplecopiesofthesamelocaMon

CMU15-418/618,Fall2017 16

InterconnecMonNetwork

A = 1; wait(A ==1);B = 1;

A = 1;

B = 1;

Processor

Memory

Cache A:0

Processor

Memory

Cache A:0 B:0

Processor

Memory

Cache A:0 B:0

wait(B ==1);… = A;

A = 1;

à1 à1 à1 à1

Oops!

Carnegie Mellon

OurIntuiMveModel:“SequenMalConsistency”(SC)

•  FormalizedbyLamport(1979)–  accessesofeachprocessorinprogramorder–  allaccessesappearinsequenMalorder

•  Anyorderimplicitlyassumedbyprogrammerismaintained

CMU15-418/618,Fall2017 17

Memory

P0 P1 Pn…

Carnegie Mellon

ExamplewithSequenMalConsistency

SimpleSynchronizaMon:

P0 P1 A = 1 (a) Ready = 1(b) x = Ready (c) y = A (d)

•  alllocaMonsareiniMalizedto0•  possibleoutcomesfor(x,y):

–  (0,0),(0,1),(1,1)•  (x,y)=(1,0)isnotapossibleoutcome(i.e.Ready=1,A=0):

–  weknowa->bandc->dbyprogramorder–  b->cimpliesthata->d–  y==0impliesd->awhichleadstoacontradicMon

–  butrealhardwarewilldothis!

CMU15-418/618,Fall2017 18

Carnegie Mellon

AnotherExamplewithSequenMalConsistency

Stripped-downversionofa2-processmutex(minustheturn-taking):

P0 P1 want[0] = 1(a) want[1] = 1(c) x = want[1] (b) y = want[0] (d)

•  alllocaMonsareiniMalizedto0•  possibleoutcomesfor(x,y):

–  (0,1),(1,0),(1,1)•  (x,y)=(0,0)isnotapossibleoutcome(i.e.want[0]=0,want[1]=0):

–  a->bandc->dimpliedbyprogramorder–  x=0impliesb->cwhichimpliesa->d–  a->dsaysy=1whichleadstoacontradicMon–  similarly,y=0impliesx=1whichisalsoacontradicMon–  butrealhardwarewilldothis!

CMU15-418/618,Fall2017 19

Carnegie Mellon

OneApproachtoImplemenMngSequenMalConsistency

1.  Implementcachecoherenceà writestothesamelocaMonareobservedinsameorderbyallprocessors

2.  Foreachprocessor,delaystartofmemoryaccessunMlpreviousonecompletesà eachprocessorhasonlyoneoutstandingmemoryaccessataMme

•  Whatdoesitmeanforamemoryaccesstocomplete?

CMU15-418/618,Fall2017 20

Carnegie Mellon

WhenDoMemoryAccessesComplete?

•  MemoryReads:–  areadcompleteswhenitsreturnvalueisbound

CMU15-418/618,Fall2017 21

load r1 ß X X=???

(FindXinmemorysystem)X=17

r1=17

Carnegie Mellon

WhenDoMemoryAccessesComplete?

•  MemoryReads:–  areadcompleteswhenitsreturnvalueisbound

•  MemoryWrites:–  awritecompleteswhenthenewvalueis“visible”tootherprocessors

•  Whatdoes“visible”mean?–  itdoesNOTmeanthatotherprocessorshavenecessarilyseenthevalueyet–  itmeansthenewvalueiscommiOedtothehypotheMcalserializableorder(HSO)

•  alaterreadofXintheHSOwillseeeitherthisvalueoralaterone–  (forsimplicity,assumethatwritesoccuratomically)

CMU15-418/618,Fall2017 22

store 23 à X X=23

(Committomemoryorder)(aka“serialize”)

Carnegie Mellon

SummaryforSequenMalConsistency

•  Maintainorderbetweensharedaccessesineachprocessor

•  Balloonanalogy:

–  likepuqngatwistbetweeneachindividual(ordered)gasparMcle

•  SeverelyrestrictscommonhardwareandcompileropMmizaMons

CMU15-418/618,Fall2017 23

READ READ WRITE WRITE

READ WRITE READ WRITE

Don’tstartunMlpreviousaccesscompletes

Carnegie Mellon

•  Processorissuesaccessesone-at-a-MmeandstallsforcompleMon

•  LowprocessoruMlizaMon(17%-42%)evenwithcaching

PerformanceofSequenMalConsistency

CMU15-418/618,Fall2017 24

FromGuptaetal,“Compara6veevalua6onoflatencyreducingandtolera6ngtechniques.”InProceedingsofthe18thannualInterna6onalSymposiumonComputerArchitecture(ISCA'91)

Carnegie Mellon

AlternaMvestoSequenMalConsistency

•  Relaxconstraintsonmemoryorder

CMU15-418/618,Fall2017 25



TotalStoreOrdering(TSO)(SimilartoIntel)



ParMalStoreOrdering(PSO)

SeeSecMon8.2of“Intel®64andIA-32ArchitecturesSotwareDeveloper’sManual,Volume3A:SystemProgrammingGuide,Part1”,hOp://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-sotware-developer-vol-3a-part-1-manual.pdf

Carnegie Mellon

PerformanceImpactofTSOvs.SC

•  Canuseawritebuffer•  WritelatencyiseffecMvelyhidden

CMU15-418/618,Fall2017 26

“Base”=SC“WR”=TSO

Processor

Cache

READS WRITES

writebuffer

Carnegie Mellon

ButCanProgramsLivewithWeakerMemoryOrders?

•  “Correctness”:sameresultsassequenMalconsistency•  Mostprogramsdon’trequirestrictordering(alloftheMme)for“correctness”

•  Buthowdoweknowwhenaprogramwillbehavecorrectly?

CMU15-418/618,Fall2017 27

ProgramOrder

A = 1;

B = 1;

unlock L; lock L;

… = A;

… = B;

SufficientOrder

A = 1;

B = 1;

unlock L; lock L;

… = A;

… = B;

Carnegie Mellon

IdenMfyingDataRacesandSynchronizaMon

•  Twoaccessesconflictif:–  (i)accesssamelocaMon,and(ii)atleastoneisawrite

•  Orderaccessesby:–  programorder(po)–  dependenceorder(do):op1-->op2ifop2readsop1

•  DataRace:

–  twoconflicMngaccessesondifferentprocessors–  notorderedbyinterveningaccesses

•  ProperlySynchronizedPrograms:–  allsynchronizaMonsareexplicitlyidenMfied–  alldataaccessesareorderedthroughsynchronizaMon

CMU15-418/618,Fall2017 28

P1 P2WriteAWriteFlag ReadFlag

ReadA

po

po

do

Carnegie Mellon

OpMmizaMonsforSynchronizedPrograms

•  IntuiMon:manyparallelprogramshavemixturesof“private”and“public”parts*

–  the“private”partsmustbeprotectedbysynchronizaMon(e.g.,locks)–  canwetakeadvantageofsynchronizaMontoimproveperformance?

CMU15-418/618,Fall2017 29

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

SYNCH

SYNCH

Example:

Grabalock

Releasethelock

Insertnodeintodatastructure•  EssenMallya“private”acMvity;reorderingisok

•  Nowwemakeit“public”totheothernodes

*Caveat:shareddataisinfactalwaysvisibletootherthreads.

Carnegie Mellon

OpMmizaMonsforSynchronizedPrograms

•  ExploitinformaMonaboutsynchronizaMon

•  properlysynchronizedprogramsshouldyieldthesameresultasonanSCmachine

CMU15-418/618,Fall2017 30

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

SYNCH

SYNCH

“WeakOrdering”(WO)

BetweensynchronizaMonoperaMons:•  wecanallowreorderingofmemoryoperaMons•  (aslongasintra-threaddependencesarepreserved)

JustbeforeandjustaVersynchronizaMonoperaMons:•  threadmustwaitforallprioroperaMonstocomplete

Carnegie Mellon

Intel’sMFENCE(MemoryFence)OperaMon

•  AnMFENCEoperaMonenforcestheorderingseenonthepreviousslide:–  doesnotbeginunMlallpriorreads&writesfromthatthreadhavecompleted–  nosubsequentreadorwritefromthatthreadcanstartunMlateritfinishes

CMU15-418/618,Fall2017 31

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

MFENCE

MFENCE

Balloonanalogy:itisatwistintheballoon•  nogasparMclescanpassthroughit

(wikiHow)

Goodnews:xchgdoesthisimplicitly!

Carnegie Mellon

ARMProcessors

•  ARMprocessorshaveaveryrelaxedconsistencymodel

•  ARMhassomegreatexamplesintheirprogrammer’sreference:–  http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/

Barrier_Litmus_Tests_and_Cookbook_A08.pdf

•  Agreatlistregardingrelaxedmemoryconsistencyingeneral:–  http://www.cl.cam.ac.uk/~pes20/weakmemory/

CMU15-418/618,Fall2017 32

Carnegie Mellon

CommonMisconcepMonaboutMFENCE

•  MFENCEoperaMonsdoNOTpushvaluesouttootherthreads–  itisnotamagic“makeeverythreadup-to-date”operaMon

•  Instead,theysimplystallthethreadthatperformstheMFENCE

CMU15-418/618,Fall2017 33

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

MFENCE

MFENCE 14

13

11

15

12

15

11

14

13

12

13

12

11

11

12

13

14

15

Thread0 Thread1 Thread2 Thread3

Time

14

15

MFENCEoperaMonscreatepar6alorderings•  thatareobservableacrossthreads

Carnegie Mellon

Earlier(Broken)ExampleRevisited

WhereexactlyshouldweinsertMFENCEoperaMonstofixthis?

P0 P1 [1:Here?] A = 1 [2:Here?] [4:Here?] Ready = 1 x = Ready [3:Here?] [5:Here?] y = A [6:Here?]

CMU15-418/618,Fall2017 34

Carnegie Mellon

OverlyConservaMve

ExploiMngAsymmetryinSynchronizaMon:“ReleaseConsistency”

•  LockoperaMon:onlygains(“acquires”)permissiontoaccessdata•  UnlockoperaMon:onlygivesaway(“releases”)permissiontoaccessdata

CMU15-418/618,Fall2017 35

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

LOCK

UNLOCK

WeakOrdering(WO)

1

2

3ReleaseConsistency(RC)

READ/WRITE…

READ/WRITE

ACQUIRE

RELEASE

READ/WRITE…

READ/WRITE 12

READ/WRITE…

READ/WRITE3

Carnegie Mellon

Intel’sFullSetofFenceOperaMons

•  InaddiMontoMFENCE,IntelalsosupportstwootherfenceoperaMons:–  LFENCE:serializesonlywithrespecttoloadoperaMons(notstores!)–  SFENCE:serializesonlywithrespecttostoreoperaMons(notloads!)

•  Note:Itdoesslightlymorethanthis;seethespecfordetails:–  Sec6on8.2.5of“Intel®64andIA-32ArchitecturesSo_wareDeveloper’s

Manual,Volume3A:SystemProgrammingGuide,Part1

•  InpracMce,youaremostlikelytouse:–  MFENCE–  xchg

CMU15-418/618,Fall2017 36

Carnegie Mellon

Take-AwayMessagesonMemoryConsistencyModels

•  DON’TuseonlynormalmemoryoperaMonsforsynchronizaMon–  e.g.,Peterson’ssoluMon(fromSynchronizaMon#1lecture)

•  DOuseeitherexplicitsynchronizaMonoperaMons(e.g.,xchg)orfences

CMU15-418/618,Fall2017 37

boolean want[2] = {false, false}; int turn = 0; want[i] = true; turn = j; while (want[j] && turn == j) continue; …cri6calsec6on…want[i] = false;

Exerciseforthereader:Whereshouldweaddfences(andwhichtype)tofixthis?

while (!xchg(&lock_available, 0) continue; …cri6calsec6on…xchg(&lock_available, 1);

Carnegie Mellon

Summary:RelaxedConsistency

•  MoMvaMon:–  obtainhigherperformancebyallowingreorderingofmemoryoperaMons

•  (reorderingisnotallowedbysequenMalconsistency)

•  Onecostissotwarecomplexity:–  theprogrammerorcompilermustinsertsynchronizaMon

•  toensurecertainspecificorderingswhenneeded

•  InpracMce:–  complexiMesotenencapsulatedinlibrariesthatprovideintuiMveprimiMves

•  e.g.,lock/unlock,barriers(orlower-levelprimiMveslikefence)

•  Relaxedmodelsdifferinwhichmemoryorderingconstraintstheyignore

CMU15-418/618,Fall2017 38

Documents

Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture