Final+solution+F09 with correct

Name: ID:

Page1of14

McGillUniversity

ECSE425ComputerOrganizationandArchitectureFall2009

FINALEXAMINATIONSOLUTIONS

9:00am12:00pm,December11,2009

Duration:180minutes

Question1.ShortAnswers(35points)Thereare2partstothisquestion.

1) Part1:Thereare10subquestionsinthispart(3pointseach)

Foreachquestionbelow,provideashortanswerin12sentences.

a) Whatisthedifferencebetweenmultithreadingandsimultaneousmultithreading(SMT)?MultithreadingexploitsTLP,itcanberunonanymachine(singleormultipleissue,uniormultiprocessor).Onethreadateachclockcycle.SMTexploitsbothTLPandILP,itmustberunonamultipleissuemachinewithdynamicscheduling.MultiplethreadsateachCC.

b) Whatisthesharedmemorymultiprocessormodel?Canitbeappliedtodistributedmemorymultiprocessorsystems?Thesharedmemorymultiprocessormodelusesasharedaddressspaceamongallprocessors.Itcanbeappliedtophysicallydistributedmemorymultiprocessorssystems.

c) Nametwomajorchallengesinparallelprocessingusingmultiprocessors.Onemustparallelizeprograms;theserialportionoftheprogrambecomesthebottleneck.Thelatencytoremotememoryislonger.

d) Whyiscachecoherencynotanissueinuniprocessorsystembutisanissueinsharedmemorymultiprocessorsystems?Inauniprocessor,only1processorhasaccesstothedataandnootherprocessorscanmodifyit.Insharedmemorymultiprocessorsystems,otherprocessorscanaccessandmodifythedata.

e) Thebusbasedbroadcastsnoopingprotocolserializesallthecoherencetraffic.Nameanadvantageandadisadvantageofthisserialization.Serializingpreservesmemoryaccessorder,suchasRAWWAWWARamongallprocessors.Howeverthelatencycanbelonger,thebuscanbecomethebottleneckwithincreasingnumberofprocessors.

Name: ID:

Page2of14

f) Achallengeofmultiprocessorsystemsistobuildoperationsthatappearatomic.Giveanexampleofasequenceofinstructionsthatcanbeusedforthispurpose.try:llR1,0(R2)someoperationonR1scR1,0(R2)beqzR1,try

g) Whatisthedifferencebetweenwriteallocateandnowriteallocate?Inwriteallocate,theblockmustbetransferredtothecacheonawritemiss,followedbyawritehitaction(writebackorwritethrough).Innowriteallocate,thereisnoblocktransfertocacheonawritemissandthedataiswrittendirectlytomemory.

h) Whycacheswithvirtualindexphysicaltagcanhelpreducethehittimecomparedtophysicallyaddressedcaches?Withvirtualindexphysicaltag,cachereadusingthevirtualindexandaddresstranslationcanbedoneinparallel,versusanallphysicallyaddressedcache,thehardwaremusttranslatetheaddressfirst,thenreadthecacheusingthetranslated(physical)index.

i) Giveanexampleofatechniquetoreducecachemisspenalty.1.Usemultilevelcaches.2.Fetchthecriticalwordfirstthentherestoftheblock,orearlyrestart(fetchinorderbutcontinuetheCPUinstructionassoonastheneededwordarrives,whilefetchingtherestoftheblock).3.Prioritizereadmissesoverwritesbyusingawritebuffer.

j) WhatisthedifferencebetweenAMATandCPUtime?Whichoneisamoreaccurateperformancemeasureforacomputersystem?AMAT:averagetimeittakestoaccessmemoryCPUtime:averagetimeittakestorunasequenceofinstructions.TheCPUtimeincludestheAMATinitsformula.CPUtimeisamoreaccurateperformancemeasureasitgivesmorerealisticperformancesincenotallinstructionsaccessmemory.

Part2(5points)

ThecurrentfocusincomputerdesignhasshiftedfromgettingmoreinstructionsperclockcycleonasingleCPU(byhavingmultiplepipelineswithmultipleissues)tohavingmultipleCPUs(eachCPUiseithersingleissueormultipleissue).Doesitmeanthatexploitinginstructionlevelparallelismisnolongeruseful?WhatfactorsdoyouthinkwillinfluencethesuccessofmultipleCPUarchitecture?Provideyouranswerinashortparagraph(nomorethanhalfapage).Answer:

ExploitingILPisstilluseful.WhattheshiftmeansisthatcurrenttechniquesforexploitingILPonasingleprocessoraregoodenoughanditisnotworthwhiletoinventnewtechniquestoexploitILPbecauseofdiminishingreturn(atleastwiththecurrenttechnology).

Name: ID:

Page3of14

ThecurrentfocusisonnewtechniquestoexploitTLP,wheremultiprocessorsseemmostsuitable.(EachprocessorhowevercanimplementtechniquesforexploitingILPwithineachthread.)

Incurrentandfuturesystems,bothILPandTLParegoingtoexist.Thebalancehoweverisunclearanddependsonapplication.Applicationisamajorfactorthatinfluencesthesuccessofanarchitecture.

Otherimportantfactoristhesoftware.MultipleprocessorsprovidetheplatformforexploitingTLP,butthesoftwaremustalsobeparallelizedtotakeadvantageofthisplatformbeforeweseemajormultiprocessorsuccesses.

Technologythataffectsthespeedofinterprocessorcommunicationsisalsoafactor.

Powerconsumptionislikelytoincreaseformultipleprocessorssoefficientpowermanagementisanotherfactor.

Question2.MultiProcessors(30points)Thereare3partstothisquestion.

Parta)(10points)Considerthewritebackinvalidatesnoopingprotocolwith3states:Invalid,SharedandExclusive.Listallbusrequeststoasharedblockinacacheandshowthecorrespondingstatetransitionsinthefinitestatemachineforthisprotocol.

Answer

Sharedreadmiss>Shared

Sharedwritemiss>Invalid

Sharedinvalidate>Invalid

Name: ID:

Page4of14

Partb)(10points)Assumethatwordsx1andx2areinthesamecacheblock,whichisintheSharedstateinthecachesofbothprocessorsP1andP2.Assumingthefollowingsequenceofevents,identifyifeacheventisahitoramiss,andeachmissasatruesharingmissorafalsesharingmiss.Anymissthatwouldoccuriftheblocksizewereonewordisdesignatedatruesharingmiss.

Time P1 P21 Readx1 2 Writex23 Writex1 4 Readx25 Writex2

Answers:

Time P1 P2 Hit/Miss?Trueorfalsesharingmiss?Why?1 Readx1 Hitsincex1insharedstate

2 Writex2 TruesharingmisssinceP1alsohasx2insharedstate

3 Writex1 FalsesharingmisssinceP1alsohasx1insharedstate

4 Readx2 FalsesharingmisssinceP1hastheblockcontainingx2inexclusivestateeventhoughitdidnotmodifyx2

5 Writex2 TruesharingmisssinceP2alsohasx1insharedstate

Partc)(30points)

Considera4processordistributedsharedmemorysystem.Eachprocessorhasasingledirectmappedcachethatholdsfourblocks,eachcontainingtwowordswithaddressesseparatedby4.Tosimplifytheillustration,thecacheaddresstagcontainsthefulladdressandeachwordshowsonlytwohexadecimalcharacters,withtheleastsignificantwordontheright.ThecachestatesaredenotedM,S,andIforModified,Shared,andInvalid.ThedirectorystatesaredenotedDM,DS,andDIforDirectoryModified,DirectoryShared,andDirectoryInvalid.ThissimpledirectoryprotocolusesmessagesgiveninTable2conthenextpage.AssumethecachecontentsofthefourprocessorsandthecontentofthemainmemoryasshowninFigure2cbelow.ACPUoperationisoftheform

P#:[

Name: ID:

Page5of14

eachreadoperation?Foreachoperation,whatisthesequenceofmessagespassedonthebus?Youcanusethetableonthefollowingpagetohelpyouwiththebusmessages.

Note:Thetagsareinhexadecimal

P0 P1 P2 P3state tag data state tag data state tag data state tag dataI 100 26 10 I 100 26 10 S 120 02 20 S 120 02 20S 108 15 08 M 128 2D 68 S 108 15 08 I 128 43 30M 110 F7 30 I 110 6F 10 I 110 6F 10 M 130 64 00I 118 C2 10 S 118 3E 18 I 118 C2 10 I 118 40 28

Memoryaddress state Sharers Data100 DI 20 00108 DS P0,P2 15 08110 DM P0 6F 10118 DS P1 3E 18120 DS P2,P3 02 20128 DM P1 3D 28130 DM P3 01 30

P0:read130P3:write130

Name: ID:

Page6of14

P=requestingprocessornumber,A=requestedaddress,andD=datacontents

Table2c.Messagesforasimpledirectoryprotocol.

Toshowthebusmessages,usethefollowingformat:

Bus{messagetype,requestingprocessor,address,data}

Example:Bus{readmiss,P0,100,}

Toshowthecontentsinthecacheofaprocessor,usethefollowingformat:

P#{state,tag,data}

Example:P3{S,120,0220}

Toshowthecontentsinthememory,usethefollowingformat:

M{state,[sharers],data}

Example:M{DS,[P0,P3],0220}

Answers:

P0:read130

Bus{datawriteback,110,F730}sentbyP0todirectoryM.110{DI,,F730}Bus{readmiss,P0,130}sentbyP0todirectoryBus{fetch,130}sentbydirectorytoP3P3.B2{S,130,6400}Bus{datawriteback,130,6400}sentbyP3todirectoryBus{datavaluereply,6400}sentbydirectorytoP0P0.B2{S,130,6400};returns00M.130{DS,{P0,P3},6400}

P3:write130

Name: ID:

Page7of14

Question3.MemoryHierarchy(30points)Thereare3partstothisquestion.

Parta)(5points)Drawthefinitestatemachinefora2bitlocalpredictorusingthesaturatingcounterinthespacebelow.

T0=11takenT1=10takenN1=01nottakenN0=00nottaken

Partb)(15points)ConsideraVirtualMemory/Cachesystemwiththefollowingproperties:

Virtualaddresssize 64bits,byteaddressablePhysicaladdresssize 30bits,byteaddressableBlocksize 32bytesPagesize 64kbytesTotalcachedatasize 32kbytesCacheassociativity 4waysetassociativeTLBassociativity 1waysetassociateTLBsize 1024entriesintotal

Nametheaddressfieldsandcalculatethebitsizeofeachfieldinthefollowingfigure.

a b c d e f gName TLBtag TLBindex Page

offsetPhysicalpagetable

Cachetag Cacheindex

Blockoffset

Bitsize 38 10 16 14 17 8 5

NotTaken

Taken

TakenTakenTaken

NotTaken

NotTaken

NotTaken

Name: ID:

Page8of14

Partc)(10points)ConsideraMIPSmachinewithabyteaddressablemainmemoryandthefollowingspecifications:

Datacachesize 1kBBlocksize 64B

ThefollowingCprogramrepresentingadotproduct(withnooptimizations)isexecutedonthiscomputer.

int i; int a[256], b[256]; int c; for ( i = 0; i < 256; i++ ){ c = a[i] * b[i] + c; }

Assumethatthesizeofeacharrayelementisonewordofsize4bytesandtheelementsarestoredinconsecutivememorylocationsinarrayindexorder.Arrayastartsataddress0x0000,bat0x0400.Whatisthemissrategivena2waysetassociativecache?Showyourcalculations.

Answer:

Giventhata[i]andb[i]willneverreplaceeachother,andgiventhatwereadeveryelementsonceandinorder,thenumberofmisseswillcorrespondtothenumberofblocksrequiredtoholdarraysaandb.Arrayarequires16blockstoholdits256words,therefore16blockswillbetransferredfrommemorytothecachethroughouttheloop.Thesameappliestoarrayb.

Throughouttheloop,32blockswillbetransferredfrommemorytothecache,andtherewillbe512memoryaccesses.Thatmakesatotalof32missesoutof512memoryaccessesandgivesamissrateof6.25%.

Name: ID:

Page9of14

Question4.PipeliningandInstructionLevelParallelism(55points)Thereare4partstothisquestion.

Forall4parts,usethefollowingsnippetofcode

loop: L.D F0,0(R1) ADD.D F0,F0,F4 L.D F2,0(R2) MUL.D F2,F0,F2 S.D F2,0(R2) DADDUI R1,R1,#-8 DADDUI R2,R2,#-8 BNEZ R1,loop

Also,usethefollowingexecutiontimeforeachunit:

Functionalunit CyclestoexecuteFPadd 3FPmult 6Load/store 2IntALU 1

Parta)(10points)Identifyallhazardsinthesnippetofcode.

Potentialdatahazards:

RAW:L.DF0,0(R1)>ADD.DF0,F0,F4ADD.DF0,F0,F4>MUL.DF2,F0,F2L.DF2,0(R2)>MUL.DF2,F0,F2MUL.DF2,F0,F2>S.DF2,0(R2)DADDUIR1,R1,#8>BNEZR1,loop

WAW:L.DF2,0(R2)>MUL.DF2,F0,F2L.DF0,0(R1)>ADD.DF0,F0,F4

WAR:L.DF0,0(R1)>DADDUIR1,R1,#8L.DF2,0(R2)>DADDUIR2,R2,#8S.DF2,0(R2)>DADDUIR2,R2,#8

Name: ID:

Page10of14

Partb)(15points)

Partb.i)Unrollthelooptwice(2iterationspernewloop)andscheduleitona5issueVLIWmachineusingtheprovidedtable.

loop: L.D F0,0(R1) ADD.D F0,F0,F4 L.D F2,0(R2) MUL.D F2,F0,F2 S.D F2,0(R2) DADDUI R1,R1,#-8 DADDUI R2,R2,#-8 BNEZ R1,loop

Clockcycle

Memoryreference1

Memoryreference2

FPoperation1 FPoperation2 Integeroperation/branch

1 L.DF0,0(R1) L.DF6,8(R1)

2 L.DF2,0(R2) L.DF8,8(R2)

3

4 ADD.DF0,F0,F4 ADD.DF6,F6,F4

5

6

7 MUL.DF2,F0,F2 MUL.DF8,F6,F8

8

9 DADDUIR1,R1,#16

10 DADDUIR2,R2,#16

11 BNEZR1,loop(withdelayslot)

12 S.DF2,16(R2) S.DF8,8(R2) BNEZR1,loop(withoutdelayslot)

Partb.ii)InthisVLIWmachine,atleasthowmanytimesdoyouneedtounrollthelooptogetthemaximumefficiency?Answer:Withoutgettingintotoomanycomplexities,wecanunroll6timeseasilytoget6iterationsper14CC.Wecanalsounroll10timestoget10iterationsper19CCbyperforminganother4iterationswhileidling.Thetruemaximumefficiencyiswhenoneoftheunitisalwaysbusy.Givenenoughregisters,unrolling32timeswillgivethebestefficiencywhereeverymemoryreferenceslotwillalwaysbebusy.Assumeaninfiniteloop,whatistheaveragenumberofclockcyclesperiteration?Answer:6clockcyclesperiterationforpartb.1.5clockcyclesperiterationformaximumefficiency.

ADD.D F0,F0,F4 ADD.D F6,F6,F4

MUL.D F2,F0,F2 MUL.D F8,F6,F8

Name: ID:

Page11of14

Partc)(15points)

Partc.i)Considerasingleissuedynamicallyscheduledmachinewithouthardwarespeculation.Assumethatthefunctionalunitsarepipelinedandthatallmemoryaccesseshitthecache.Thereisamemoryunitwith5loadbuffersand5storebuffers.Eachloadorstoretakes2cyclestoexecute,1tocalculatetheaddress,and1toload/storethedata.Therearededicatedintegerfunctionalunitsforeffectiveaddresscalculationandbranchconditionevaluation.Theotherfunctionunitsaredescribedinthefollowingtable.

Func.unittype Numberoffunc.units NumberofreservationstationsIntegerALU 1 5FPadder 1 3FPmultiplier 1 2Load 1 5Store 1 5

Nowdynamicallyschedule2iterationsoftheoriginallooponthismachinewithoutspeculation.Showtheclockcyclenumberofeachstageofthedynamicallyscheduledcodeinthetablebelow.Assumeatleastonecycledelaybetweensuccessivestepsofeveryinstructionexecutionsequence(issue,executionstart,writeback).

Instruction Operands Issue ExecutionStart

WriteBack

L.D F0,0(R1) 1 2 4ADD.D F0,F0,F4 2 5 8L.D F2,0(R2) 3 4 6MUL.D F2,F0,F2 4 9 15S.D F2,0(R2) 5 16* 17DADDUI R1,R1,#-8 6 7 9DADDUI R2,R2,#-8 7 8 10BNEZ R1,loop 8 10 11L.D F0,0(R1) 12(1) 13 16ADD.D F0,F0,F4 13 17 20L.D F2,0(R2) 14 16 18MUL.D F2,F0,F2 15 21 27S.D F2,0(R2) 16 28 29DADDUI R1,R1,#-8 17 18 19DADDUI R2,R2,#-8 18 19 21BNEZ R1,loop 19 20 22

*S.DherecancalculateaddresswhilewaitingforF2.(1)L.Dmustwaitforbranchtoreturnbranchdecision

Name: ID:

Page12of14

Partc.ii)Inthisdynamicallyscheduledcode,assumeaninfiniteloop,whatistheaveragenumberofclockcyclesperolditeration?Answer:11CCperiterationgiventhattheseconditerationcanonlystartatcycle12.

Partd)(15points)Nowweusethedynamicschedulinghardwaretobuildaspeculativemachinethatcanissueandcommit2instructionspercycle.Againassumethatthefunctionalunitsarepipelinedandthatallmemoryaccesseshitthecache.Thereisamemoryunitwith8loadbuffers.Thereorderbufferhas50entries.Thereorderbuffercanfunctionasastorebuffer,sotherearenoseparatestorebuffers.Eachloadorstoretakes2cyclestoexecute,1tocalculatetheaddress,and1toload/storethedata.Assumeabranchpredictorwith0%mispredictionrate.Assumetherearededicatedintegerfunctionalunitsforeffectiveaddresscalculationandbranchconditionevaluation.Theotherfunctionunitsaredescribedinthefollowingtable.

Functionalunittype Numberoffunctionalunits

Numberofreservationstationsperfunctionalunit

IntegerALU 2 4FPadder 2 3FPmultiplier 2 2Load 2 4

Partd.i)Schedule2iterationsoftheoriginalcodeonthisspeculativemachineinthetablebelow.Assumeatleastonecycledelaybetweensuccessivestepsofeveryinstructionexecutionsequence(issue,executionstart,writeback,commit).Assumetwocommondatabuses.

Instruction Operands Issue ExecutionStart

WriteBack

Commit

L.D F0,0(R1) 1 2 4 5ADD.D F0,F0,F4 1 5 8 9L.D F2,0(R2) 2 3 5 9MUL.D F2,F0,F2 2 9 15 16S.D F2,0(R2) 3 4 5 16DADDUI R1,R1,#-8 3 4 6 17DADDUI R2,R2,#-8 4 5 6 17BNEZ R1,loop 4 7 8 18L.D F0,0(R1) 5 6 9 18ADD.D F0,F0,F4 5 10 12 19L.D F2,0(R2) 6 7 9 19MUL.D F2,F0,F2 6 13 19 20S.D F2,0(R2) 7 8 10 20DADDUI R1,R1,#-8 7 8 10 21DADDUI R2,R2,#-8 8 9 11 21BNEZ R1,loop 8 11 12 22

137

14 20 2121222223

Name: ID:

Page13of14

Partd.ii)AssumeaninfiniteloopandnoROBoverflow,whatistheaveragenumberofclockcyclesperiterationonthisspeculativemachine?Comparewithpartsbandc.Answer:Ittakes4CCperolditeration.TheVLIWperformsbest,howeveritrequiressoftwarescheduling.NotealsothattheVLIWis5issuewhereasthespeculativemachineisdoubleissue.Also,thisspeculativemachinewithdoubleissuesperformstwoloopsinlessthanhalfthecyclesrequiredbythenonspeculativemachineinpartc.

Question5.Performance(30points)Thereare3partstothisquestion.

Parta)ReliabilityandAmdahlslaw.(10points)ConsiderasysteminwhichthecomponentshavethefollowingMTTF(inhours):

CPU 1,000,000Harddisk 200,000Memory 500,000Powersupply 100,000

Parta.i)Assumethatifanycomponentfails,thenthesystemfails.WhatisthesystemMTTF?

Parta.ii)YoubuyanadditionalharddriveandbringthetotalharddiskMTTFto600000hours,whichprovides3timesimprovement.UsingAmdahlslaw,computetheimprovementinthewholesystemreliability?

Theharddiskcontributes

%ofthetotalMTTF.

Partb)Cacheperformance(10points)Consideramemorysystemwithlatencyof60clocks.Thetransferrateis4bytesperclockcycleandthat30%ofthetransfersaredirty.Thereare32bytesperblockand25%oftheinstructionsaredatatransferinstructions.Thereisnowritebuffer.Inaddition,theTLBtakes40clockcyclesonaTLBmiss.ATLBdoesnotslowdownacachehit.FortheTLB,makethesimplifyingassumptionthat0.5%ofallreferencesisnotfoundinTLB,eitherwhenaddressescomedirectlyfromtheCPUorwhenaddressescomefromcachemisses.

IfthebaseCPIwithaperfectmemorysystemis1.5,whatistheCPIfora16KBtwowaysetassociativeunifiedcacheusingwritebackwithcachemissrateof1.6%?

ComputetheeffectiveCPIforthiscachewiththerealTLB.

Name: ID:

Page14of14

Answers:

Sincethisisaunifiedcached,bothinstructionanddatasharethesamecacheandhavethesametransferrateonblockreplacements.

Virtuallyaddressedcache:AnaddressfromCPUwillgothroughthecachefirst,andonlyonacachemissitgoesthroughtheTLB.

Physicallyaddressedcache:AnaddressfromCPUwillgothroughtheTLBfirst,thenthroughthecache.

Partc)Branchpredictionperformance(10points)

Supposewehaveadeeplypipelinedprocessor,forwhichweimplementabranchtargetbufferfortheconditionalbranchesandbranchfoldingfortheunconditionalbranches.

Fortheconditionalbranches,assumethatthemispredictionpenaltyisalways4cyclesandthebuffermisspenaltyisalways3cycles.Assume90%branchtargetbufferhitrateand90%targetaddressaccuracy,and15%conditionalbranchfrequency.

Forbranchfoldingthatstoresthetargetinstructionsoftheunconditionalbranches,assumealsoa90%hitrateand5%unconditionalbranchfrequency.Assumealsothatthehittargetinstructioncanbypassthefetchstageandstartimmediatelyinthedecodestage.

Howmuchfasteristhisprocessorversusaprocessorthathasafixed2cyclebranchpenaltyforbothunconditionalandconditionalbranches?AssumeabaseCPIwithoutbranchstallsof1.

Answer:

CPIofdeeplypipelinedprocessorassumingthatthebypassingonlyhappensforunconditionalbranches

Thedeeplypipelinedprocessoris1.31timesfasterthanthefixed2cyclebranchprocessor.

Documents

Final+solution+F09 with correct