Final+solution+F09 with correct

Embed Size (px)

DESCRIPTION

stuff

Citation preview

  • Name: ID:

    Page1of14

    McGillUniversity

    ECSE425ComputerOrganizationandArchitectureFall2009

    FINALEXAMINATIONSOLUTIONS

    9:00am12:00pm,December11,2009

    Duration:180minutes

    Question1.ShortAnswers(35points)Thereare2partstothisquestion.

    1) Part1:Thereare10subquestionsinthispart(3pointseach)

    Foreachquestionbelow,provideashortanswerin12sentences.

    a) Whatisthedifferencebetweenmultithreadingandsimultaneousmultithreading(SMT)?MultithreadingexploitsTLP,itcanberunonanymachine(singleormultipleissue,uniormultiprocessor).Onethreadateachclockcycle.SMTexploitsbothTLPandILP,itmustberunonamultipleissuemachinewithdynamicscheduling.MultiplethreadsateachCC.

    b) Whatisthesharedmemorymultiprocessormodel?Canitbeappliedtodistributedmemorymultiprocessorsystems?Thesharedmemorymultiprocessormodelusesasharedaddressspaceamongallprocessors.Itcanbeappliedtophysicallydistributedmemorymultiprocessorssystems.

    c) Nametwomajorchallengesinparallelprocessingusingmultiprocessors.Onemustparallelizeprograms;theserialportionoftheprogrambecomesthebottleneck.Thelatencytoremotememoryislonger.

    d) Whyiscachecoherencynotanissueinuniprocessorsystembutisanissueinsharedmemorymultiprocessorsystems?Inauniprocessor,only1processorhasaccesstothedataandnootherprocessorscanmodifyit.Insharedmemorymultiprocessorsystems,otherprocessorscanaccessandmodifythedata.

    e) Thebusbasedbroadcastsnoopingprotocolserializesallthecoherencetraffic.Nameanadvantageandadisadvantageofthisserialization.Serializingpreservesmemoryaccessorder,suchasRAWWAWWARamongallprocessors.Howeverthelatencycanbelonger,thebuscanbecomethebottleneckwithincreasingnumberofprocessors.

  • Name: ID:

    Page2of14

    f) Achallengeofmultiprocessorsystemsistobuildoperationsthatappearatomic.Giveanexampleofasequenceofinstructionsthatcanbeusedforthispurpose.try:llR1,0(R2)someoperationonR1scR1,0(R2)beqzR1,try

    g) Whatisthedifferencebetweenwriteallocateandnowriteallocate?Inwriteallocate,theblockmustbetransferredtothecacheonawritemiss,followedbyawritehitaction(writebackorwritethrough).Innowriteallocate,thereisnoblocktransfertocacheonawritemissandthedataiswrittendirectlytomemory.

    h) Whycacheswithvirtualindexphysicaltagcanhelpreducethehittimecomparedtophysicallyaddressedcaches?Withvirtualindexphysicaltag,cachereadusingthevirtualindexandaddresstranslationcanbedoneinparallel,versusanallphysicallyaddressedcache,thehardwaremusttranslatetheaddressfirst,thenreadthecacheusingthetranslated(physical)index.

    i) Giveanexampleofatechniquetoreducecachemisspenalty.1.Usemultilevelcaches.2.Fetchthecriticalwordfirstthentherestoftheblock,orearlyrestart(fetchinorderbutcontinuetheCPUinstructionassoonastheneededwordarrives,whilefetchingtherestoftheblock).3.Prioritizereadmissesoverwritesbyusingawritebuffer.

    j) WhatisthedifferencebetweenAMATandCPUtime?Whichoneisamoreaccurateperformancemeasureforacomputersystem?AMAT:averagetimeittakestoaccessmemoryCPUtime:averagetimeittakestorunasequenceofinstructions.TheCPUtimeincludestheAMATinitsformula.CPUtimeisamoreaccurateperformancemeasureasitgivesmorerealisticperformancesincenotallinstructionsaccessmemory.

    Part2(5points)

    ThecurrentfocusincomputerdesignhasshiftedfromgettingmoreinstructionsperclockcycleonasingleCPU(byhavingmultiplepipelineswithmultipleissues)tohavingmultipleCPUs(eachCPUiseithersingleissueormultipleissue).Doesitmeanthatexploitinginstructionlevelparallelismisnolongeruseful?WhatfactorsdoyouthinkwillinfluencethesuccessofmultipleCPUarchitecture?Provideyouranswerinashortparagraph(nomorethanhalfapage).Answer:

    ExploitingILPisstilluseful.WhattheshiftmeansisthatcurrenttechniquesforexploitingILPonasingleprocessoraregoodenoughanditisnotworthwhiletoinventnewtechniquestoexploitILPbecauseofdiminishingreturn(atleastwiththecurrenttechnology).

  • Name: ID:

    Page3of14

    ThecurrentfocusisonnewtechniquestoexploitTLP,wheremultiprocessorsseemmostsuitable.(EachprocessorhowevercanimplementtechniquesforexploitingILPwithineachthread.)

    Incurrentandfuturesystems,bothILPandTLParegoingtoexist.Thebalancehoweverisunclearanddependsonapplication.Applicationisamajorfactorthatinfluencesthesuccessofanarchitecture.

    Otherimportantfactoristhesoftware.MultipleprocessorsprovidetheplatformforexploitingTLP,butthesoftwaremustalsobeparallelizedtotakeadvantageofthisplatformbeforeweseemajormultiprocessorsuccesses.

    Technologythataffectsthespeedofinterprocessorcommunicationsisalsoafactor.

    Powerconsumptionislikelytoincreaseformultipleprocessorssoefficientpowermanagementisanotherfactor.

    Question2.MultiProcessors(30points)Thereare3partstothisquestion.

    Parta)(10points)Considerthewritebackinvalidatesnoopingprotocolwith3states:Invalid,SharedandExclusive.Listallbusrequeststoasharedblockinacacheandshowthecorrespondingstatetransitionsinthefinitestatemachineforthisprotocol.

    Answer

    Sharedreadmiss>Shared

    Sharedwritemiss>Invalid

    Sharedinvalidate>Invalid

  • Name: ID:

    Page4of14

    Partb)(10points)Assumethatwordsx1andx2areinthesamecacheblock,whichisintheSharedstateinthecachesofbothprocessorsP1andP2.Assumingthefollowingsequenceofevents,identifyifeacheventisahitoramiss,andeachmissasatruesharingmissorafalsesharingmiss.Anymissthatwouldoccuriftheblocksizewereonewordisdesignatedatruesharingmiss.

    Time P1 P21 Readx1 2 Writex23 Writex1 4 Readx25 Writex2

    Answers:

    Time P1 P2 Hit/Miss?Trueorfalsesharingmiss?Why?1 Readx1 Hitsincex1insharedstate

    2 Writex2 TruesharingmisssinceP1alsohasx2insharedstate

    3 Writex1 FalsesharingmisssinceP1alsohasx1insharedstate

    4 Readx2 FalsesharingmisssinceP1hastheblockcontainingx2inexclusivestateeventhoughitdidnotmodifyx2

    5 Writex2 TruesharingmisssinceP2alsohasx1insharedstate

    Partc)(30points)

    Considera4processordistributedsharedmemorysystem.Eachprocessorhasasingledirectmappedcachethatholdsfourblocks,eachcontainingtwowordswithaddressesseparatedby4.Tosimplifytheillustration,thecacheaddresstagcontainsthefulladdressandeachwordshowsonlytwohexadecimalcharacters,withtheleastsignificantwordontheright.ThecachestatesaredenotedM,S,andIforModified,Shared,andInvalid.ThedirectorystatesaredenotedDM,DS,andDIforDirectoryModified,DirectoryShared,andDirectoryInvalid.ThissimpledirectoryprotocolusesmessagesgiveninTable2conthenextpage.AssumethecachecontentsofthefourprocessorsandthecontentofthemainmemoryasshowninFigure2cbelow.ACPUoperationisoftheform

    P#:[

  • Name: ID:

    Page5of14

    eachreadoperation?Foreachoperation,whatisthesequenceofmessagespassedonthebus?Youcanusethetableonthefollowingpagetohelpyouwiththebusmessages.

    Note:Thetagsareinhexadecimal

    P0 P1 P2 P3state tag data state tag data state tag data state tag dataI 100 26 10 I 100 26 10 S 120 02 20 S 120 02 20S 108 15 08 M 128 2D 68 S 108 15 08 I 128 43 30M 110 F7 30 I 110 6F 10 I 110 6F 10 M 130 64 00I 118 C2 10 S 118 3E 18 I 118 C2 10 I 118 40 28

    Memoryaddress state Sharers Data100 DI 20 00108 DS P0,P2 15 08110 DM P0 6F 10118 DS P1 3E 18120 DS P2,P3 02 20128 DM P1 3D 28130 DM P3 01 30

    P0:read130P3:write130

  • Name: ID:

    Page6of14

    P=requestingprocessornumber,A=requestedaddress,andD=datacontents

    Table2c.Messagesforasimpledirectoryprotocol.

    Toshowthebusmessages,usethefollowingformat:

    Bus{messagetype,requestingprocessor,address,data}

    Example:Bus{readmiss,P0,100,}

    Toshowthecontentsinthecacheofaprocessor,usethefollowingformat:

    P#{state,tag,data}

    Example:P3{S,120,0220}

    Toshowthecontentsinthememory,usethefollowingformat:

    M{state,[sharers],data}

    Example:M{DS,[P0,P3],0220}

    Answers:

    P0:read130

    Bus{datawriteback,110,F730}sentbyP0todirectoryM.110{DI,,F730}Bus{readmiss,P0,130}sentbyP0todirectoryBus{fetch,130}sentbydirectorytoP3P3.B2{S,130,6400}Bus{datawriteback,130,6400}sentbyP3todirectoryBus{datavaluereply,6400}sentbydirectorytoP0P0.B2{S,130,6400};returns00M.130{DS,{P0,P3},6400}

    P3:write130

  • Name: ID:

    Page7of14

    Question3.MemoryHierarchy(30points)Thereare3partstothisquestion.

    Parta)(5points)Drawthefinitestatemachinefora2bitlocalpredictorusingthesaturatingcounterinthespacebelow.

    T0=11takenT1=10takenN1=01nottakenN0=00nottaken

    Partb)(15points)ConsideraVirtualMemory/Cachesystemwiththefollowingproperties:

    Virtualaddresssize 64bits,byteaddressablePhysicaladdresssize 30bits,byteaddressableBlocksize 32bytesPagesize 64kbytesTotalcachedatasize 32kbytesCacheassociativity 4waysetassociativeTLBassociativity 1waysetassociateTLBsize 1024entriesintotal

    Nametheaddressfieldsandcalculatethebitsizeofeachfieldinthefollowingfigure.

    a b c d e f gName TLBtag TLBindex Page

    offsetPhysicalpagetable

    Cachetag Cacheindex

    Blockoffset

    Bitsize 38 10 16 14 17 8 5

    NotTaken

    Taken

    TakenTakenTaken

    NotTaken

    NotTaken

    NotTaken

  • Name: ID:

    Page8of14

    Partc)(10points)ConsideraMIPSmachinewithabyteaddressablemainmemoryandthefollowingspecifications:

    Datacachesize 1kBBlocksize 64B

    ThefollowingCprogramrepresentingadotproduct(withnooptimizations)isexecutedonthiscomputer.

    int i; int a[256], b[256]; int c; for ( i = 0; i < 256; i++ ){ c = a[i] * b[i] + c; }

    Assumethatthesizeofeacharrayelementisonewordofsize4bytesandtheelementsarestoredinconsecutivememorylocationsinarrayindexorder.Arrayastartsataddress0x0000,bat0x0400.Whatisthemissrategivena2waysetassociativecache?Showyourcalculations.

    Answer:

    Giventhata[i]andb[i]willneverreplaceeachother,andgiventhatwereadeveryelementsonceandinorder,thenumberofmisseswillcorrespondtothenumberofblocksrequiredtoholdarraysaandb.Arrayarequires16blockstoholdits256words,therefore16blockswillbetransferredfrommemorytothecachethroughouttheloop.Thesameappliestoarrayb.

    Throughouttheloop,32blockswillbetransferredfrommemorytothecache,andtherewillbe512memoryaccesses.Thatmakesatotalof32missesoutof512memoryaccessesandgivesamissrateof6.25%.

  • Name: ID:

    Page9of14

    Question4.PipeliningandInstructionLevelParallelism(55points)Thereare4partstothisquestion.

    Forall4parts,usethefollowingsnippetofcode

    loop: L.D F0,0(R1) ADD.D F0,F0,F4 L.D F2,0(R2) MUL.D F2,F0,F2 S.D F2,0(R2) DADDUI R1,R1,#-8 DADDUI R2,R2,#-8 BNEZ R1,loop

    Also,usethefollowingexecutiontimeforeachunit:

    Functionalunit CyclestoexecuteFPadd 3FPmult 6Load/store 2IntALU 1

    Parta)(10points)Identifyallhazardsinthesnippetofcode.

    Potentialdatahazards:

    RAW:L.DF0,0(R1)>ADD.DF0,F0,F4ADD.DF0,F0,F4>MUL.DF2,F0,F2L.DF2,0(R2)>MUL.DF2,F0,F2MUL.DF2,F0,F2>S.DF2,0(R2)DADDUIR1,R1,#8>BNEZR1,loop

    WAW:L.DF2,0(R2)>MUL.DF2,F0,F2L.DF0,0(R1)>ADD.DF0,F0,F4

    WAR:L.DF0,0(R1)>DADDUIR1,R1,#8L.DF2,0(R2)>DADDUIR2,R2,#8S.DF2,0(R2)>DADDUIR2,R2,#8

  • Name: ID:

    Page10of14

    Partb)(15points)

    Partb.i)Unrollthelooptwice(2iterationspernewloop)andscheduleitona5issueVLIWmachineusingtheprovidedtable.

    loop: L.D F0,0(R1) ADD.D F0,F0,F4 L.D F2,0(R2) MUL.D F2,F0,F2 S.D F2,0(R2) DADDUI R1,R1,#-8 DADDUI R2,R2,#-8 BNEZ R1,loop

    Clockcycle

    Memoryreference1

    Memoryreference2

    FPoperation1 FPoperation2 Integeroperation/branch

    1 L.DF0,0(R1) L.DF6,8(R1)

    2 L.DF2,0(R2) L.DF8,8(R2)

    3

    4 ADD.DF0,F0,F4 ADD.DF6,F6,F4

    5

    6

    7 MUL.DF2,F0,F2 MUL.DF8,F6,F8

    8

    9 DADDUIR1,R1,#16

    10 DADDUIR2,R2,#16

    11 BNEZR1,loop(withdelayslot)

    12 S.DF2,16(R2) S.DF8,8(R2) BNEZR1,loop(withoutdelayslot)

    Partb.ii)InthisVLIWmachine,atleasthowmanytimesdoyouneedtounrollthelooptogetthemaximumefficiency?Answer:Withoutgettingintotoomanycomplexities,wecanunroll6timeseasilytoget6iterationsper14CC.Wecanalsounroll10timestoget10iterationsper19CCbyperforminganother4iterationswhileidling.Thetruemaximumefficiencyiswhenoneoftheunitisalwaysbusy.Givenenoughregisters,unrolling32timeswillgivethebestefficiencywhereeverymemoryreferenceslotwillalwaysbebusy.Assumeaninfiniteloop,whatistheaveragenumberofclockcyclesperiteration?Answer:6clockcyclesperiterationforpartb.1.5clockcyclesperiterationformaximumefficiency.

    ADD.D F0,F0,F4 ADD.D F6,F6,F4

    MUL.D F2,F0,F2 MUL.D F8,F6,F8

  • Name: ID:

    Page11of14

    Partc)(15points)

    Partc.i)Considerasingleissuedynamicallyscheduledmachinewithouthardwarespeculation.Assumethatthefunctionalunitsarepipelinedandthatallmemoryaccesseshitthecache.Thereisamemoryunitwith5loadbuffersand5storebuffers.Eachloadorstoretakes2cyclestoexecute,1tocalculatetheaddress,and1toload/storethedata.Therearededicatedintegerfunctionalunitsforeffectiveaddresscalculationandbranchconditionevaluation.Theotherfunctionunitsaredescribedinthefollowingtable.

    Func.unittype Numberoffunc.units NumberofreservationstationsIntegerALU 1 5FPadder 1 3FPmultiplier 1 2Load 1 5Store 1 5

    Nowdynamicallyschedule2iterationsoftheoriginallooponthismachinewithoutspeculation.Showtheclockcyclenumberofeachstageofthedynamicallyscheduledcodeinthetablebelow.Assumeatleastonecycledelaybetweensuccessivestepsofeveryinstructionexecutionsequence(issue,executionstart,writeback).

    Instruction Operands Issue ExecutionStart

    WriteBack

    L.D F0,0(R1) 1 2 4ADD.D F0,F0,F4 2 5 8L.D F2,0(R2) 3 4 6MUL.D F2,F0,F2 4 9 15S.D F2,0(R2) 5 16* 17DADDUI R1,R1,#-8 6 7 9DADDUI R2,R2,#-8 7 8 10BNEZ R1,loop 8 10 11L.D F0,0(R1) 12(1) 13 16ADD.D F0,F0,F4 13 17 20L.D F2,0(R2) 14 16 18MUL.D F2,F0,F2 15 21 27S.D F2,0(R2) 16 28 29DADDUI R1,R1,#-8 17 18 19DADDUI R2,R2,#-8 18 19 21BNEZ R1,loop 19 20 22

    *S.DherecancalculateaddresswhilewaitingforF2.(1)L.Dmustwaitforbranchtoreturnbranchdecision

  • Name: ID:

    Page12of14

    Partc.ii)Inthisdynamicallyscheduledcode,assumeaninfiniteloop,whatistheaveragenumberofclockcyclesperolditeration?Answer:11CCperiterationgiventhattheseconditerationcanonlystartatcycle12.

    Partd)(15points)Nowweusethedynamicschedulinghardwaretobuildaspeculativemachinethatcanissueandcommit2instructionspercycle.Againassumethatthefunctionalunitsarepipelinedandthatallmemoryaccesseshitthecache.Thereisamemoryunitwith8loadbuffers.Thereorderbufferhas50entries.Thereorderbuffercanfunctionasastorebuffer,sotherearenoseparatestorebuffers.Eachloadorstoretakes2cyclestoexecute,1tocalculatetheaddress,and1toload/storethedata.Assumeabranchpredictorwith0%mispredictionrate.Assumetherearededicatedintegerfunctionalunitsforeffectiveaddresscalculationandbranchconditionevaluation.Theotherfunctionunitsaredescribedinthefollowingtable.

    Functionalunittype Numberoffunctionalunits

    Numberofreservationstationsperfunctionalunit

    IntegerALU 2 4FPadder 2 3FPmultiplier 2 2Load 2 4

    Partd.i)Schedule2iterationsoftheoriginalcodeonthisspeculativemachineinthetablebelow.Assumeatleastonecycledelaybetweensuccessivestepsofeveryinstructionexecutionsequence(issue,executionstart,writeback,commit).Assumetwocommondatabuses.

    Instruction Operands Issue ExecutionStart

    WriteBack

    Commit

    L.D F0,0(R1) 1 2 4 5ADD.D F0,F0,F4 1 5 8 9L.D F2,0(R2) 2 3 5 9MUL.D F2,F0,F2 2 9 15 16S.D F2,0(R2) 3 4 5 16DADDUI R1,R1,#-8 3 4 6 17DADDUI R2,R2,#-8 4 5 6 17BNEZ R1,loop 4 7 8 18L.D F0,0(R1) 5 6 9 18ADD.D F0,F0,F4 5 10 12 19L.D F2,0(R2) 6 7 9 19MUL.D F2,F0,F2 6 13 19 20S.D F2,0(R2) 7 8 10 20DADDUI R1,R1,#-8 7 8 10 21DADDUI R2,R2,#-8 8 9 11 21BNEZ R1,loop 8 11 12 22

    137

    14 20 2121222223

  • Name: ID:

    Page13of14

    Partd.ii)AssumeaninfiniteloopandnoROBoverflow,whatistheaveragenumberofclockcyclesperiterationonthisspeculativemachine?Comparewithpartsbandc.Answer:Ittakes4CCperolditeration.TheVLIWperformsbest,howeveritrequiressoftwarescheduling.NotealsothattheVLIWis5issuewhereasthespeculativemachineisdoubleissue.Also,thisspeculativemachinewithdoubleissuesperformstwoloopsinlessthanhalfthecyclesrequiredbythenonspeculativemachineinpartc.

    Question5.Performance(30points)Thereare3partstothisquestion.

    Parta)ReliabilityandAmdahlslaw.(10points)ConsiderasysteminwhichthecomponentshavethefollowingMTTF(inhours):

    CPU 1,000,000Harddisk 200,000Memory 500,000Powersupply 100,000

    Parta.i)Assumethatifanycomponentfails,thenthesystemfails.WhatisthesystemMTTF?

    Parta.ii)YoubuyanadditionalharddriveandbringthetotalharddiskMTTFto600000hours,whichprovides3timesimprovement.UsingAmdahlslaw,computetheimprovementinthewholesystemreliability?

    Theharddiskcontributes

    %ofthetotalMTTF.

    Partb)Cacheperformance(10points)Consideramemorysystemwithlatencyof60clocks.Thetransferrateis4bytesperclockcycleandthat30%ofthetransfersaredirty.Thereare32bytesperblockand25%oftheinstructionsaredatatransferinstructions.Thereisnowritebuffer.Inaddition,theTLBtakes40clockcyclesonaTLBmiss.ATLBdoesnotslowdownacachehit.FortheTLB,makethesimplifyingassumptionthat0.5%ofallreferencesisnotfoundinTLB,eitherwhenaddressescomedirectlyfromtheCPUorwhenaddressescomefromcachemisses.

    IfthebaseCPIwithaperfectmemorysystemis1.5,whatistheCPIfora16KBtwowaysetassociativeunifiedcacheusingwritebackwithcachemissrateof1.6%?

    ComputetheeffectiveCPIforthiscachewiththerealTLB.

  • Name: ID:

    Page14of14

    Answers:

    Sincethisisaunifiedcached,bothinstructionanddatasharethesamecacheandhavethesametransferrateonblockreplacements.

    Virtuallyaddressedcache:AnaddressfromCPUwillgothroughthecachefirst,andonlyonacachemissitgoesthroughtheTLB.

    Physicallyaddressedcache:AnaddressfromCPUwillgothroughtheTLBfirst,thenthroughthecache.

    Partc)Branchpredictionperformance(10points)

    Supposewehaveadeeplypipelinedprocessor,forwhichweimplementabranchtargetbufferfortheconditionalbranchesandbranchfoldingfortheunconditionalbranches.

    Fortheconditionalbranches,assumethatthemispredictionpenaltyisalways4cyclesandthebuffermisspenaltyisalways3cycles.Assume90%branchtargetbufferhitrateand90%targetaddressaccuracy,and15%conditionalbranchfrequency.

    Forbranchfoldingthatstoresthetargetinstructionsoftheunconditionalbranches,assumealsoa90%hitrateand5%unconditionalbranchfrequency.Assumealsothatthehittargetinstructioncanbypassthefetchstageandstartimmediatelyinthedecodestage.

    Howmuchfasteristhisprocessorversusaprocessorthathasafixed2cyclebranchpenaltyforbothunconditionalandconditionalbranches?AssumeabaseCPIwithoutbranchstallsof1.

    Answer:

    CPIofdeeplypipelinedprocessorassumingthatthebypassingonlyhappensforunconditionalbranches

    Thedeeplypipelinedprocessoris1.31timesfasterthanthefixed2cyclebranchprocessor.