Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Carnegie Mellon
Lecture13:
MemoryConsistency
ParallelComputerArchitectureandProgrammingCMU15-418/15-618,Fall2016
CMU15-418/618,Fall2017 1
Carnegie Mellon
WhatisCorrectBehaviorforaParallelMemoryHierarchy?
• Note:side-effectsofwritesareonlyobservablewhenreadsoccur– sowewillfocusonthevaluesreturnedbyreads
• IntuiMveanswer:– readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)
• Hmm…whatdoes“latest”meanexactly?– withinathread,itcanbedefinedbyprogramorder– butwhataboutacrossthreads?
• themostrecentwriteinphysicalMme?– hopefullynot,becausethereisnowaythatthehardwarecanpullthatoff
» e.g.,ifittakes>10cyclestocommunicatebetweenprocessors,thereisnowaythatprocessor0canknowwhatprocessor1did2clockMcksago
• mostrecentbaseduponsomethingelse?– Hmm…
CMU15-418/618,Fall2017 2
Carnegie Mellon
RefiningOurIntuiMon
• WhatwouldbesomeclearlyillegalcombinaMonsof(A,B,C)?• Howabout:
• Whatcanwegeneralizefromthis?– writesfromanyparMcularthreadmustbeconsistentwithprogramorder
• inthisexample,observedevennumbersmustbeincreasing(diOoforodds)
– acrossthreads:writesmustbeconsistentwithavalidinterleavingofthreads• notphysicalMme!(programmercannotrelyuponthat)
CMU15-418/618,Fall2017 3
// write evens to X for (i=0; i<N; i+=2) { X = i; … }
Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }
Thread1… A = X; … B = X; … C = X; …
Thread2
(Assume:X=0iniMally,andthesearetheonlywritestoX.)
(4,8,1)? (9,12,3)? (7,19,31)?
Carnegie Mellon
VisualizingOurIntuiMon
• Eachthreadproceedsinprogramorder• Memoryaccessesinterleaved(oneataMme)toasingle-portedmemory
– rateofprogressofeachthreadisunpredictable
CMU15-418/618,Fall2017 4
// write evens to X for (i=0; i<N; i+=2) { X = i; … }
Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }
Thread1… A = X; … B = X; … C = X; …
Thread2
CPU0 CPU1 CPU2
Memory
Singleporttomemory
Carnegie Mellon
CorrectnessRevisited
Recall:“readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)”à “latest”meansconsistentwithsomeinterleavingthatmatchesthismodel– thisisahypotheMcalinterleaving;themachinedidn’tnecessarydothis!
CMU15-418/618,Fall2017 5
// write evens to X for (i=0; i<N; i+=2) { X = i; … }
Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }
Thread1… A = X; … B = X; … C = X; …
Thread2
CPU0 CPU1 CPU2
Memory
Singleporttomemory
Carnegie Mellon
Part2ofMemoryCorrectness:MemoryConsistencyModel
1. “CacheCoherence”– doallloadsandstorestoagivencacheblockbehavecorrectly?
2. “MemoryConsistencyModel”(someMmescalled“MemoryOrdering”)– doallloadsandstores,eventoseparatecacheblocks,behavecorrectly?
Recall:ourintuiMon
CMU15-418/618,Fall2017 6
CPU0 CPU1 CPU2
Memory
Singleporttomemory
Carnegie Mellon
Whyisthissocomplicated?
• Fundamentalissue:– loadsandstoresareveryexpensive,evenonauniprocessor
• caneasilytake10’sto100’sofcycles
• WhatprogrammersintuiMvelyexpect:– processoratomicallyperformsoneinstrucMonataMme,inprogramorder
• Inreality:– iftheprocessoractuallyoperatedthisway,itwouldbepainfullyslow– instead,theprocessoraggressivelyreordersinstruc6onstohidememorylatency
• Upshot:– withinagiventhread,theprocessorpreservestheprogramorderillusion– butthisillusionhasnothingtodowithwhathappensinphysicalMme!– fromtheperspecMveofotherthreads,allbetsareoff!
CMU15-418/618,Fall2017 7
Carnegie Mellon
HidingMemoryLatencyisImportantforPerformance
• Idea:overlapmemoryaccesseswithotheraccessesandcomputaMon
• Hidingwritelatencyissimpleinuniprocessors:
– addawritebuffer
• (ButthisaffectscorrectnessinmulMprocessors)
CMU15-418/618,Fall2017 8
write A
read B
write A read B
Processor
Cache
READS WRITES
writebuffer
Carnegie Mellon
HowCanWeHidetheLatencyofMemoryReads?
“Outoforder”pipelining:– whenaninstrucMonisstuck,perhapstherearesubsequentinstrucMonsthat
canbeexecuted
• ImplicaMon:memoryaccessesmaybeperformedout-of-order!!!
CMU15-418/618,Fall2017 9
stuckwaiMngontruedependencestuckwaiMngontruedependencesuffersexpensivecachemisssuffersexpensivecachemissx = *p;
y = x + 1; z = a + 2; b = c / 3; } thesedonotneedtowait
Carnegie Mellon
WhatAboutCondiMonalBranches?
• DoweneedtowaitforacondiMonalbranchtoberesolvedbeforeproceeding?– No!JustpredictthebranchoutcomeandconMnueexecuMngspeculaMvely.
• ifpredicMoniswrong,squashanyside-effectsandrestartdowncorrectpath
CMU15-418/618,Fall2017 10
x = *p; y = x + 1; z = a + 2; b = c / 3; if (x != z) d = e – 7; else d = e + 5; …
ifhardwareguessesthatthisistruethenexecute“then”part(speculaMvely)(withoutwaiMngforxorz)
Carnegie Mellon
HowOut-of-OrderPipeliningWorksinModernProcessors
• FetchandgraduateinstrucMonsin-order,butissueout-of-order
• Intra-threaddependencesarepreserved,butmemoryaccessesgetreordered!
CMU15-418/618,Fall2017 11
issue(cachemiss)
0x1c: b = c / 3;
0x18: z = a + 2;
0x14: y = x + 1;
0x10: x = *p;
PC:0x10Inst.Cache
BranchPredictor
0x140x180x1c
0x1c: b = c / 3;
0x18: z = a + 2;
0x14: y = x + 1;
0x10: x = *p;
Reorde
rBuff
er
issue(cachemiss)
issue(out-of-order)issue(out-of-order)
can’tissuecan’tissueissue(out-of-order)issue(out-of-order)
Carnegie Mellon
Analogy:GasParMclesinBalloons
• ImaginethateachinstrucMonwithinathreadisagasparMcleinsideatwistyballoon• Theywerenumberedoriginally,butthentheystarttomoveandbouncearound• Whenagiventhreadobservesmemoryaccessesfromadifferentthread:
– thosememoryaccessescanbe(almost)arbitrarilyjumbledaround• liketryingtolocatetheposiMonofaparMculargasparMcleinaballoon
• Aswe’llseelater,theonlythingthatwecandoistoputtwistsintheballoon
CMU15-418/618,Fall2017 12
(wikiHow)
14
13
12
15
11
15
11
14
13
12
12
13
15
11
14
11
12
13
14
15
Thread0 Thread1 Thread2 Thread3
Time
Carnegie Mellon
UniprocessorMemoryModel
• Memorymodelspecifiesorderingconstraintsamongaccesses
• Uniprocessormodel:memoryaccessesatomicandinprogramorder
• NotnecessarytomaintainsequenMalorderforcorrectness– hardware:buffering,pipelining– compiler:registerallocaMon,codemoMon
• Simpleforprogrammers
• Allowsforhighperformance
CMU15-418/618,Fall2017 13
write A write B read A read B
Processor
Cache
READS WRITES
writebuffer
Readscheckformatchingaddressesinwritebuffer
Carnegie Mellon
InParallelMachines(withaSharedAddressSpace)
• OrderbetweenaccessestodifferentlocaMonsbecomesimportant
CMU15-418/618,Fall2017 14
A = 1;
Ready = 1; while (Ready != 1);
… = A;
P1 P2
(Ini6allyAandReady=0)
Carnegie Mellon
HowUnsafeReorderingCanHappen
• DistribuMonofmemoryresources– accessesissuedinordermaybeobservedoutoforder
CMU15-418/618,Fall2017 15
Processor
Memory
Processor
Memory
Processor
Memory
InterconnecMonNetwork
…A = 1; Ready = 1;
A: 0 Ready:0
wait(Ready==1);…=A;
A = 1;
Ready = 1;
à1
Carnegie Mellon
CachesComplicateThingsMore• MulMplecopiesofthesamelocaMon
CMU15-418/618,Fall2017 16
InterconnecMonNetwork
A = 1; wait(A ==1);B = 1;
A = 1;
B = 1;
Processor
Memory
Cache A:0
Processor
Memory
Cache A:0 B:0
Processor
Memory
Cache A:0 B:0
wait(B ==1);… = A;
A = 1;
à1 à1 à1 à1
Oops!
Carnegie Mellon
OurIntuiMveModel:“SequenMalConsistency”(SC)
• FormalizedbyLamport(1979)– accessesofeachprocessorinprogramorder– allaccessesappearinsequenMalorder
• Anyorderimplicitlyassumedbyprogrammerismaintained
CMU15-418/618,Fall2017 17
Memory
P0 P1 Pn…
Carnegie Mellon
ExamplewithSequenMalConsistency
SimpleSynchronizaMon:
P0 P1 A = 1 (a) Ready = 1(b) x = Ready (c) y = A (d)
• alllocaMonsareiniMalizedto0• possibleoutcomesfor(x,y):
– (0,0),(0,1),(1,1)• (x,y)=(1,0)isnotapossibleoutcome(i.e.Ready=1,A=0):
– weknowa->bandc->dbyprogramorder– b->cimpliesthata->d– y==0impliesd->awhichleadstoacontradicMon
– butrealhardwarewilldothis!
CMU15-418/618,Fall2017 18
Carnegie Mellon
AnotherExamplewithSequenMalConsistency
Stripped-downversionofa2-processmutex(minustheturn-taking):
P0 P1 want[0] = 1(a) want[1] = 1(c) x = want[1] (b) y = want[0] (d)
• alllocaMonsareiniMalizedto0• possibleoutcomesfor(x,y):
– (0,1),(1,0),(1,1)• (x,y)=(0,0)isnotapossibleoutcome(i.e.want[0]=0,want[1]=0):
– a->bandc->dimpliedbyprogramorder– x=0impliesb->cwhichimpliesa->d– a->dsaysy=1whichleadstoacontradicMon– similarly,y=0impliesx=1whichisalsoacontradicMon– butrealhardwarewilldothis!
CMU15-418/618,Fall2017 19
Carnegie Mellon
OneApproachtoImplemenMngSequenMalConsistency
1. Implementcachecoherenceà writestothesamelocaMonareobservedinsameorderbyallprocessors
2. Foreachprocessor,delaystartofmemoryaccessunMlpreviousonecompletesà eachprocessorhasonlyoneoutstandingmemoryaccessataMme
• Whatdoesitmeanforamemoryaccesstocomplete?
CMU15-418/618,Fall2017 20
Carnegie Mellon
WhenDoMemoryAccessesComplete?
• MemoryReads:– areadcompleteswhenitsreturnvalueisbound
CMU15-418/618,Fall2017 21
load r1 ß X X=???
(FindXinmemorysystem)X=17
r1=17
Carnegie Mellon
WhenDoMemoryAccessesComplete?
• MemoryReads:– areadcompleteswhenitsreturnvalueisbound
• MemoryWrites:– awritecompleteswhenthenewvalueis“visible”tootherprocessors
• Whatdoes“visible”mean?– itdoesNOTmeanthatotherprocessorshavenecessarilyseenthevalueyet– itmeansthenewvalueiscommiOedtothehypotheMcalserializableorder(HSO)
• alaterreadofXintheHSOwillseeeitherthisvalueoralaterone– (forsimplicity,assumethatwritesoccuratomically)
CMU15-418/618,Fall2017 22
store 23 à X X=23
(Committomemoryorder)(aka“serialize”)
Carnegie Mellon
SummaryforSequenMalConsistency
• Maintainorderbetweensharedaccessesineachprocessor
• Balloonanalogy:
– likepuqngatwistbetweeneachindividual(ordered)gasparMcle
• SeverelyrestrictscommonhardwareandcompileropMmizaMons
CMU15-418/618,Fall2017 23
READ READ WRITE WRITE
READ WRITE READ WRITE
Don’tstartunMlpreviousaccesscompletes
Carnegie Mellon
• Processorissuesaccessesone-at-a-MmeandstallsforcompleMon
• LowprocessoruMlizaMon(17%-42%)evenwithcaching
PerformanceofSequenMalConsistency
CMU15-418/618,Fall2017 24
FromGuptaetal,“Compara6veevalua6onoflatencyreducingandtolera6ngtechniques.”InProceedingsofthe18thannualInterna6onalSymposiumonComputerArchitecture(ISCA'91)
Carnegie Mellon
AlternaMvestoSequenMalConsistency
• Relaxconstraintsonmemoryorder
CMU15-418/618,Fall2017 25
READ READ WRITE WRITE
READ WRITE READ WRITE
TotalStoreOrdering(TSO)(SimilartoIntel)
READ READ WRITE WRITE
READ WRITE READ WRITE
ParMalStoreOrdering(PSO)
SeeSecMon8.2of“Intel®64andIA-32ArchitecturesSotwareDeveloper’sManual,Volume3A:SystemProgrammingGuide,Part1”,hOp://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-sotware-developer-vol-3a-part-1-manual.pdf
Carnegie Mellon
PerformanceImpactofTSOvs.SC
• Canuseawritebuffer• WritelatencyiseffecMvelyhidden
CMU15-418/618,Fall2017 26
“Base”=SC“WR”=TSO
Processor
Cache
READS WRITES
writebuffer
Carnegie Mellon
ButCanProgramsLivewithWeakerMemoryOrders?
• “Correctness”:sameresultsassequenMalconsistency• Mostprogramsdon’trequirestrictordering(alloftheMme)for“correctness”
• Buthowdoweknowwhenaprogramwillbehavecorrectly?
CMU15-418/618,Fall2017 27
ProgramOrder
A = 1;
B = 1;
unlock L; lock L;
… = A;
… = B;
SufficientOrder
A = 1;
B = 1;
unlock L; lock L;
… = A;
… = B;
Carnegie Mellon
IdenMfyingDataRacesandSynchronizaMon
• Twoaccessesconflictif:– (i)accesssamelocaMon,and(ii)atleastoneisawrite
• Orderaccessesby:– programorder(po)– dependenceorder(do):op1-->op2ifop2readsop1
• DataRace:
– twoconflicMngaccessesondifferentprocessors– notorderedbyinterveningaccesses
• ProperlySynchronizedPrograms:– allsynchronizaMonsareexplicitlyidenMfied– alldataaccessesareorderedthroughsynchronizaMon
CMU15-418/618,Fall2017 28
P1 P2WriteAWriteFlag ReadFlag
ReadA
po
po
do
Carnegie Mellon
OpMmizaMonsforSynchronizedPrograms
• IntuiMon:manyparallelprogramshavemixturesof“private”and“public”parts*
– the“private”partsmustbeprotectedbysynchronizaMon(e.g.,locks)– canwetakeadvantageofsynchronizaMontoimproveperformance?
CMU15-418/618,Fall2017 29
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
SYNCH
SYNCH
Example:
Grabalock
Releasethelock
Insertnodeintodatastructure• EssenMallya“private”acMvity;reorderingisok
• Nowwemakeit“public”totheothernodes
*Caveat:shareddataisinfactalwaysvisibletootherthreads.
Carnegie Mellon
OpMmizaMonsforSynchronizedPrograms
• ExploitinformaMonaboutsynchronizaMon
• properlysynchronizedprogramsshouldyieldthesameresultasonanSCmachine
CMU15-418/618,Fall2017 30
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
SYNCH
SYNCH
“WeakOrdering”(WO)
BetweensynchronizaMonoperaMons:• wecanallowreorderingofmemoryoperaMons• (aslongasintra-threaddependencesarepreserved)
JustbeforeandjustaVersynchronizaMonoperaMons:• threadmustwaitforallprioroperaMonstocomplete
Carnegie Mellon
Intel’sMFENCE(MemoryFence)OperaMon
• AnMFENCEoperaMonenforcestheorderingseenonthepreviousslide:– doesnotbeginunMlallpriorreads&writesfromthatthreadhavecompleted– nosubsequentreadorwritefromthatthreadcanstartunMlateritfinishes
CMU15-418/618,Fall2017 31
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
MFENCE
MFENCE
Balloonanalogy:itisatwistintheballoon• nogasparMclescanpassthroughit
(wikiHow)
Goodnews:xchgdoesthisimplicitly!
Carnegie Mellon
ARMProcessors
• ARMprocessorshaveaveryrelaxedconsistencymodel
• ARMhassomegreatexamplesintheirprogrammer’sreference:– http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/
Barrier_Litmus_Tests_and_Cookbook_A08.pdf
• Agreatlistregardingrelaxedmemoryconsistencyingeneral:– http://www.cl.cam.ac.uk/~pes20/weakmemory/
CMU15-418/618,Fall2017 32
Carnegie Mellon
CommonMisconcepMonaboutMFENCE
• MFENCEoperaMonsdoNOTpushvaluesouttootherthreads– itisnotamagic“makeeverythreadup-to-date”operaMon
• Instead,theysimplystallthethreadthatperformstheMFENCE
CMU15-418/618,Fall2017 33
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
MFENCE
MFENCE 14
13
11
15
12
15
11
14
13
12
13
12
11
11
12
13
14
15
Thread0 Thread1 Thread2 Thread3
Time
14
15
MFENCEoperaMonscreatepar6alorderings• thatareobservableacrossthreads
Carnegie Mellon
Earlier(Broken)ExampleRevisited
WhereexactlyshouldweinsertMFENCEoperaMonstofixthis?
P0 P1 [1:Here?] A = 1 [2:Here?] [4:Here?] Ready = 1 x = Ready [3:Here?] [5:Here?] y = A [6:Here?]
CMU15-418/618,Fall2017 34
Carnegie Mellon
OverlyConservaMve
ExploiMngAsymmetryinSynchronizaMon:“ReleaseConsistency”
• LockoperaMon:onlygains(“acquires”)permissiontoaccessdata• UnlockoperaMon:onlygivesaway(“releases”)permissiontoaccessdata
CMU15-418/618,Fall2017 35
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
READ/WRITE…
READ/WRITE
LOCK
UNLOCK
WeakOrdering(WO)
1
2
3ReleaseConsistency(RC)
READ/WRITE…
READ/WRITE
ACQUIRE
RELEASE
READ/WRITE…
READ/WRITE 12
READ/WRITE…
READ/WRITE3
Carnegie Mellon
Intel’sFullSetofFenceOperaMons
• InaddiMontoMFENCE,IntelalsosupportstwootherfenceoperaMons:– LFENCE:serializesonlywithrespecttoloadoperaMons(notstores!)– SFENCE:serializesonlywithrespecttostoreoperaMons(notloads!)
• Note:Itdoesslightlymorethanthis;seethespecfordetails:– Sec6on8.2.5of“Intel®64andIA-32ArchitecturesSo_wareDeveloper’s
Manual,Volume3A:SystemProgrammingGuide,Part1
• InpracMce,youaremostlikelytouse:– MFENCE– xchg
CMU15-418/618,Fall2017 36
Carnegie Mellon
Take-AwayMessagesonMemoryConsistencyModels
• DON’TuseonlynormalmemoryoperaMonsforsynchronizaMon– e.g.,Peterson’ssoluMon(fromSynchronizaMon#1lecture)
• DOuseeitherexplicitsynchronizaMonoperaMons(e.g.,xchg)orfences
CMU15-418/618,Fall2017 37
boolean want[2] = {false, false}; int turn = 0; want[i] = true; turn = j; while (want[j] && turn == j) continue; …cri6calsec6on…want[i] = false;
Exerciseforthereader:Whereshouldweaddfences(andwhichtype)tofixthis?
while (!xchg(&lock_available, 0) continue; …cri6calsec6on…xchg(&lock_available, 1);
Carnegie Mellon
Summary:RelaxedConsistency
• MoMvaMon:– obtainhigherperformancebyallowingreorderingofmemoryoperaMons
• (reorderingisnotallowedbysequenMalconsistency)
• Onecostissotwarecomplexity:– theprogrammerorcompilermustinsertsynchronizaMon
• toensurecertainspecificorderingswhenneeded
• InpracMce:– complexiMesotenencapsulatedinlibrariesthatprovideintuiMveprimiMves
• e.g.,lock/unlock,barriers(orlower-levelprimiMveslikefence)
• Relaxedmodelsdifferinwhichmemoryorderingconstraintstheyignore
CMU15-418/618,Fall2017 38