26
Parallel Techniques in Modeling Particle Systems Using Vulkan* API Example of using GPU- and CPU-based solutions utilizing Vulkan graphics and compute API Tomasz Chadzynski Integrated Computer Solutions Inc.

Parallel Techniques in Modeling Particle Systems Using

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Techniques in Modeling Particle Systems Using

ParallelTechniquesinModelingParticleSystemsUsingVulkan*API

ExampleofusingGPU-andCPU-basedsolutionsutilizingVulkangraphicsandcomputeAPI

TomaszChadzynskiIntegratedComputerSolutionsInc.

Page 2: Parallel Techniques in Modeling Particle Systems Using

2

TableofContents1Introduction..............................................................................................................................................4

1.1Assumptionsandexpectations...........................................................................................................4

1.2Additionalresources...........................................................................................................................5

1.3Buildingexamples..............................................................................................................................5

2ConceptofaSimulatedParticleSystem....................................................................................................6

2.1Generators.........................................................................................................................................7

2.2Interactors..........................................................................................................................................8

2.3Eulermethodinnumericalsimulationofparticlemovement...........................................................9

2.4Moreoninteractors...........................................................................................................................9

2.4.1Constantforceinteractor..........................................................................................................10

2.4.2Gravitypointinteractor.............................................................................................................10

2.4.3Gravityplanarinteractor...........................................................................................................11

2.4.4Additionalthoughts...................................................................................................................11

3VulkanRendererforCPU-BasedSimulator.............................................................................................11

3.1Physicaldeviceselection..................................................................................................................12

3.1.1Validationlayers:AcrucialstepinVulkanapplicationdevelopment........................................12

3.2Vulkanvirtualdeviceandswapchain...............................................................................................13

3.4Sceneelement..................................................................................................................................14

3.5Graphicsmemoryallocationandperformance................................................................................14

4CPU-BasedParticleSimulator..................................................................................................................15

4.1Particlemodel..................................................................................................................................16

4.2Sceneinteractorsandgenerators....................................................................................................16

4.2.1Performancevs.maintainability................................................................................................17

4.3Lifetimeoftheparticleandtheimportanceofsorting....................................................................17

5MultithreadedApproachtoComputingParticleSystem........................................................................19

5.1Goingfurtherwithparallelization....................................................................................................20

6VulkanComputeinaModelingParticleSystem.....................................................................................21

6.1UsingGPUmemorybufferforcomputationandvisualization........................................................21

6.2MovingcomputationfromtheCPUtocomputeshaderontheGPU...............................................22

6.3Structureofthecomputeshader-orientedsimulation....................................................................23

7GoingFurther..........................................................................................................................................24

7.1EliminatetheCPUroleincomputations..........................................................................................24

Page 3: Parallel Techniques in Modeling Particle Systems Using

3

7.2Parallelizingtheperformancepathbydoublebuffering.................................................................25

8Conclusion...............................................................................................................................................25

Page 4: Parallel Techniques in Modeling Particle Systems Using

4

1Introduction

Parallelprocessinghasbecomeoneofthemostimportantaspectsoftoday'sprogrammingtechniques.TheshiftinparadigmforcedbytheCPUshittingthepowerwallenforcedprogrammingtechniquesthatemphasizedthespreadingofcomputationsovermultiplecores/processors.Asitturnsout,manyofthecomputationtasksthatsoftwaredevelopersfaceareparallelizable.Oneexampleisthecaseofmodelingparticlesystems.Suchsystemsarewidelyusedinmanyfieldsfromphysicalsimulationtocomputergames.

TheVulkan*APIisacollaborativeeffortbytheindustrytomeetcurrentdemandsofcomputergraphics.ItisanewapproachthatemphasizeshidingtheCPUbottleneckthroughparallelism,allowingmuchmoreflexibilityinapplicationstructure.Asidefromcomponentsrelatedonlytographics,theVulkanAPIalsodefinesthecomputepipelinefornumericalcomputation.

ThispaperdiscussesandcomparesaspectsoftheimplementationofaparticlesystemusingCPUsandGPUssupportedbyaVulkan-basedrendererexample.

1.1AssumptionsandexpectationsWithalltheperformancegainsthatVulkanhastooffer,thereisadownsideintheformofarathersteeplearningcurve.Drawingatriangle,thesimplestexampleavailableontheInternetcantakearoundsixhundredlinesofcode.ThisisbecauseVulkanAPIrequiresthedevelopertospecifynearlyallaspectsoftherenderingmechanism.Still,givensomeeffort,VulkanisaclearandstraightforwardexampleforusingAPI.

ThisdocumentdoesnotintendtoteachVulkanbasics.SomebasicknowledgeofVulkanisrequired,however.YoushouldalreadyknowhowtosetupsimpleVulkanapplications.IfyouarenewtoVulkan,thenagoodstartwouldbetoworkthroughtheLunarGtutorial.

Theexamplecodeaccompanyingthispaperisorganizedtoemphasizepresentingarchitecturalconceptswhichleadtotwoimportantassumptions:1)ThereturnstatusfromVulkanfunctioncallsisinmostcasesignoredtoavoidoverlyexpandingthecodebase.Inproduction,everyreturnstatusshouldbecheckedandreactedon;2)Conceptuallyclosetopicsarekeptwithinasinglecodeflow.Thisreducestheneedforjumpingthroughthesources,butsometimesleadstoratherlongfunctions.

Thispaperisrecommendedtobereadwhilelookingattheaccompanyingcode.Makesureyouhavetheexamplesdownloadedanduseyourfavoritecodebrowsingtool.Toavoidlongcodelistings,thispaperreferencesthecodethroughaseriesoftags.Forexample,GPU_TP.15referencespointfifteenintheVulkancomputeversionofthecodebase.Eachtagisuniqueandeasilysearchablewithinthecodebase.

Page 5: Parallel Techniques in Modeling Particle Systems Using

5

Lastly,oneofthemajordesigngoalsofVulkanistogiveasmuchfreedomtotheapplicationdesigneraspossible.Therefore,therearemanywaysofimplementingagiventaskdependingonthedesigngoals,targethardware,scalability,andsoon.Thispaper’sgoalistointroduceandsparksomecreativethinking,ratherthanprovideafinalapproach.

1.2AdditionalresourcesVulkanhasreceivedmuchattention,whichhasresultedinmanygoodinformationsources.Herearefewrecommendedreferencematerials:

1) Vulkanspecificationandquickreference:https://www.khronos.org/registry/vulkan/2) OpenGLandGLSLspecification.WearemostlyinterestedinGLSLasweuseitasa

shaderprogramminglanguage:https://www.khronos.org/registry/OpenGL/index_gl.php

3) VulkanSDKandanexcellenttutorialcoveringVulkanbasics:https://vulkan.lunarg.com/4) AnotherVulkantutorialgoesbeyondbasics,coveringmoreadvancedmemoryhandling

andtextures:https://vulkan-tutorial.com/5) DatabaseofhardwaresupportingVulkan.Agreatsourceforcapabilitiesthatareoffered

bythevarietyofhardware:http://vulkan.gpuinfo.org/6) AlargesetofexamplesinVulkanfrombasictriangletocomputeshader:

https://github.com/SaschaWillems/Vulkan

1.3BuildingexamplesTheprojectbuildisorganizedusingVisualStudio*2017.Overall,itusesthreedependencies:GLM,GLFW,andVulkanSDK.ThefirsttwoareinstalledautomaticallythroughNuGet.TheVulkanSDKshouldbedownloadedandinstalled.Afteropeningtheproject,selectthe"PropertyManager"tab.Fromthere,unfoldthe"Debug|x64"andopenVulkanDir.Thengoto"UserMacros"andsettheVK_DIRpropertytopointtothemainVulkanSDKdirectory.Forexample,"C:\VulkanSDK\1.0.51.0."TheDebugandReleasebothusethesameproperty,soonechangewillaffectboth.Fromtherejustclick“Build.”

Page 6: Parallel Techniques in Modeling Particle Systems Using

6

2ConceptofaSimulatedParticleSystem

Ourgoalistomodelthemovementofmultipleobjects(particles)throughspace.Tomakethingsmoreinteresting,thescenewillcontainentitiesthatwouldalterthepaththroughwhichtheparticleismoving.

Asingleparticleisdefinedasasetoftwovectors,positioninspace,velocity,andanadditionalscalarrepresentingmass.Thesesetofpropertiesdescribethestateofeachparticle:

P={𝒙,𝒗,m} (1)

Wecanlookatthewayaparticlemovesthroughspacewithdisplacementgivenbythedifferentialequation:

𝒅𝒙𝒅𝒕=𝒗 (2)

Fornow,wewilltreatthevelocityasaconstantandsolvefordisplacement:

𝒅𝒙= 𝒗𝐝𝐭

𝒙=𝒗t+C (3)

Wearriveatageneralformulathatmodelshowaparticleismovingthroughspace.Constantvelocityisassumed(butwillchangesoon).Also,wemusthaveconstantC,avaluewhichmustbeknown,inordertodescribeoursystemcompletely.TheconstantCisrelatedtotheinitialconditionsofthisdifferentialequation.Inotherwords,theparticleatthebeginningofitslifehastohaveapredefinedstate.Thisrequiresusinggenerators.

Page 7: Parallel Techniques in Modeling Particle Systems Using

7

2.1GeneratorsTheconceptofageneratoristobeafunctionwhichfulfillsthetaskof"placing"anewparticleintothesceneinsomepredefinedmanner.Thisdemonstrationdefinesasinglesprinklergeneratorwithfunctionalityconsistingofspawningparticlesatapredefinedpointinspacewithgivenvelocityandmass.

Thepositionofthepointoforiginisgivendirectlyasaparameter.ThedirectionisdefinedusingsphericalcoordinateswhicharetheninternallyconvertedintotheEuclideancoordinatesystem.Theanglesaredefinedasfollows:

1)𝛳(theta):Anglestartsatx-axisandrotatescounterclockwise.2)𝜙(phi):Anglestartsfromz-axistowardsxy-plane.3)Thevalueofspeedisgivenasaseparateargument.

ToconvertintoEuclideancoordinates,weusethefollowingsetofequations:

x=sin(𝜙)cos(𝜽),y=sin(𝜙)sin(𝜽),z=cos(𝜙) (4)

Goingbacktoourgeneralequationforparticlemotionandfactoringinthegenerator,wecandeterminethatattimet=0thepositionofaparticlebecomesx0:

𝒙0=𝒗0∗0+C→C=𝒙0

𝒙=𝒗t+𝒙0 (5)

Atthispoint,wehaveaformulathatwecanusetodescribethemotionofaparticle.However,ourscenewillnotlookthatinterestingasalloftheparticleswouldonlymoveinastraightlineandataconstantspeed.Togiveourscenemoredynamicbehavior,weintroducetheconceptofinteractors.

Page 8: Parallel Techniques in Modeling Particle Systems Using

8

2.2InteractorsTounderstandhowwecanconsistentlyinfluencethemotionofaparticle,westarttreatingvelocityasavariableinsteadofconstant.Thechangeofvelocityisgivenbythefollowingformula:

𝒅𝒗𝒅𝒕=𝒂 (6)

Thisequationlookssimilartothepreviousformulafordisplacement.Bysolvingthisdifferentialequation,wearriveattheformulaforvelocity:

𝒗=𝒗0+𝒂t (7)

Therefore,toconsistentlychangeparticlemotionweinfluencethevelocityindirectlythroughchangingacceleration.Wedefinetheinteractorasafunctionthat,forthegivenparticlestateandgiveninteractorproperties,returnstheaccelerationvector:

f(p,in)→𝒂 (8)

Unfortunately,definingtheinteractorasafunctionbecomesproblematic.Whenwesubstitutebackintothevelocityequation,weseethatthedifferentialequationhasapotentialtobecomeunsolvable:

𝒅𝒗𝒅𝒕=f(p,in)

(9)

Thisisnotwhatwewant.Ourgoalistohavetheabilitytointroducedifferenttypesofinteractorswithoutbeingworriedaboutbreakingourmodel.Todothat,weneedtoimplementtheEulermethodofsolvingdifferentialequationstoapproximatetheparticlemovementnumerically.

Page 9: Parallel Techniques in Modeling Particle Systems Using

9

2.3EulermethodinnumericalsimulationofparticlemovementTheEulermethodworkingprincipleistostepthroughthedifferentialequationinsteadofsolvingacrosstheentiredomain.Ifweassumeasufficientlysmalltimestep,wecanapproximatethevelocity(2)andacceleration(6)tobeconstantswithinthespanofasinglestep.Thisallowsustofallbacktotheequations(5)and(7),anddiscretizethemsotheycanbeprogrammed.Therefore,thefinalsetofequationstobeusedisasfollows:

𝒙[n]=𝒙[n-1]+𝒗[n]*∆t

𝒗[n]=𝒗[n-1]+𝒂[n]*∆t (10)

Ifthemathsofarinthischapterhasgivenyoutrouble,don'tgetdiscouraged.Itisenoughtounderstandthefinalequations(10)toworkwiththerestofmaterial.Fromnowonwewilltreattheaccelerationvectorasaconstantwithinthesingleiteration.Next,wedivedeeperintointeractorsthataffectthemotionofaparticle.

2.4MoreoninteractorsWeestablishedearlierthattheparticlemotionwithinthescenewouldbeinfluencedbychangingacceleration,usingNewton'ssecondlaw:

𝒂=𝑭𝒎

(11)

Weseethatthetaskofaninteractorshouldbetogeneratesomeforceandconvertitintoaccelerationusingthemassoftheparticle.Moreover,wecansumuptheforcescomingfrommultipleinteractorstogether:

𝑭𝑵𝑬𝑻 = 𝑭𝒊𝒊

(12)

Page 10: Parallel Techniques in Modeling Particle Systems Using

10

Therefore,ourfinalapproachisasfollows:

𝒂𝑵𝑬𝑻 = 𝒂𝒊𝒊

= 𝑭𝒊𝒎𝒊

(13)

Themainreasontonotmultiplyoutthemassisthatinteractorsthatsimulategravitydon'tuseit.Inotherwords,theaccelerationduetogravitydoesnotdependonthemassofanobjectonwhichgravitationalpullisexerted.Itisalsopossiblethatsomeinteractorswhichdon’tmodelphysicallyaccuratebehaviorcouldsimplyignoremassaswell.Thisisanoptimizationstepwhichcouldsaveasignificantnumberofcyclesspentonmultiplication.

2.4.1Constantforceinteractor

Thisisthesimplestformofaninteractor.Itsfunctionistoexerttheforceonaparticleregardlessofanyotherproperties.Itcouldbeusedtomakeaparticlegotowardssomepredefineddirection:

𝒂i=𝑭𝒊𝒎(14)

2.4.2Gravitypointinteractor

Thisisthesecondtypeofinteractor.Itrepresentsasinglepointthatexertstheforceofgravityonparticlesinthescene.Suchapointhasmassandposition,butinourexampleitdoesnothavevolume.Itusesthestandardgravityequation:

𝑭=𝑮𝑴𝒎𝒅𝟐

𝒓 (15)

Nowwecanseewhywechosenottoincludethemassofaparticlewithineachinteractor.Whenderivingtheformulaforacceleration,weendupwith:

𝒂=𝑮𝑴𝒅𝟐𝒓 (16)

Inthisequation,Gisthegravitationalconstant,Misthemassofthegravitypoint,drepresentsthedistance,andristheunitvectorpointingtowardthegravitypointinteractorfromthelocationoftheparticle.

Page 11: Parallel Techniques in Modeling Particle Systems Using

11

2.4.3Gravityplanarinteractor

Thisisthelasttypeofinteractorimplementedforthisexample.Itismodeledasaninfiniteplaneexertinggravitationalforceontheparticlesinthescene.Inthiscase,thervectorisalwaysperpendiculartotheplane.Therefore,foranygivenparticlelocation,thedistancehastobecalculated,assuch:LetP1={x1,y1,z1}representaparticleinspace.Theplaneisdescribedbyitsnormalvectorn={a,b,c},andapointonthatplane,P={x,y,z}.ThedistanceDisgivenby:

D=𝒂𝒃𝒔(𝒅𝒐𝒕 𝒏,𝑷𝟏 G𝒅𝒍𝒆𝒏𝒈𝒕𝒉(𝒏)

(17)

d=-dot(𝒏,𝑷) (18)

ThistypeofinteractorcanbeusedtomodellargeobjectscomparedtoscenesizesuchastheEarth.

2.4.4Additionalthoughts

Simulatingphysicallycorrectbehaviorisnottheonlywaytobuildinterestingparticlesystems.Infact,settingupaphysicallyaccuratesysteminequilibriumcouldbeveryhard.Asimilareffectcouldbeachievedbyusingnon-physicallyaccurateinteractors.Suchfunctionscanlookintotheentirestateofaparticleandgenerateaccelerationthatwillattempttoforcecertainbehavior.

3VulkanRendererforCPU-BasedSimulator

ThedesignoftherenderercomponentusedinvisualizingtheCPUversionoftheparticlesimulatorwillbedescribedfirst.IfyouarenotfamiliarwithVulkan,itisagoodideatopausehereandreviewoneofthetutorials.TheLunarGtutorialcoverssufficientmaterialtounderstandtherestofthispaper.Also,rememberthatVulkanleavestheapplicationdesigntotheprogrammerandallowsforavarietyofwaystoapproachatask.Takethisguideasoneoftheoptionsavailableratherthanthedefinitivesolution.

Atahighlevel,whenprogrammingusingVulkan,thegoalistoconstructavirtualdevicetowhichdrawingcommandswillbesubmitted.Thedrawcommandsaresubmittedtoconstructscalled“queues.”Thenumberofqueuesavailableandtheircapabilitiesdependsuponhowtheywereselectedduringconstructionofthevirtualdevice,andtheactualcapabilitiesofthehardware.ThepowerofVulkanliesinthefactthattheworkloadsubmittedtoqueuescouldbeassembledandsentinparalleltoalreadyexecutingtasks.Vulkanoffersfunctionalitytocoherentlymaintaintheresourcesandperformsynchronization.

Page 12: Parallel Techniques in Modeling Particle Systems Using

12

Inourapplication,wetooktheapproachofbreakinguptheVulkanrenderersetupintofivecomponents(seegraphicbelow).Thegoalistopresentaneasy-to-understandcodebasefortheVulkanapplication.

3.1Physicaldeviceselection

ThefirststepineveryVulkanapplicationistocreatetheInstanceObject.ThisfunctionalityisdeliveredthroughVkInstance.ItisrequiredthatanapplicationhasaVkInstancecreatedbeforestartinganythingelse.TheVkInstanceistheinterfaceusedtoinspectthehardwarecapabilitiesofacomputersystem.ThisfunctionalityisencapsulatedintheAppInstanceobject(CPU_TP1).TheAppInstanceobjectisessentiallyawrapperexposingtheinterfaceforconvenienthardwaredeviceenumeration.

3.1.1Validationlayers:AcrucialstepinVulkanapplicationdevelopment

OneofthemostcriticalstepswhendevelopingwithVulkanistoenablethevalidationlayers(CPU_TP2).Bydesign,forperformancereasons,Vulkandoesnotcheckthevalidityoftheapplicationcode.Whenusedincorrectly,Vulkancallswouldsimplyfailorcrashwithvirtuallynotracesthatwouldallowfordebugging.Withvalidationlayersenabled,theVulkanloaderinjectsalayerofcodeforverification.Theenablingofvalidationlayersisacriticalstepthateverydevelopershouldremember.However,thatlayeraffectsperformanceandshouldbedisabledinproduction.

Page 13: Parallel Techniques in Modeling Particle Systems Using

13

3.2VulkanvirtualdeviceandswapchainThenextstepistocreatetheactualvirtualdevicethatwillexposethecommandqueueusedtorenderthescene.First,asurfaceobjectmustbeinstantiated.(Thetopicofthesurfaceisoutofthescopeofthisarticle,butitiscoveredindetailintheLunarGtutorial.)However,inthiscase,thedemocodesupportsWindows*only,andthesurfaceiscreatedandexposedbytheSurfaceWindowsclass(CPU_TP3).

NextiscreatingtheDeviceobject(CPU_TP4).NotethattheDeviceclassencapsulatesmuchmorethanVkDevice.Startingwiththeconstructor,thedeviceclassexposesmultipleversions,whichallowsmoreflexibilityininitialization(CPU_TP5).Thephysicaldeviceusedtocreatethevirtualdevicecanbe"firstfit"oritcanbespecifiedusingaVendorIDnumber,ifthesoftwareistousespecifichardware.End-usersystemscanhavemultipledevicesthatsupportVulkan,andthefirstoneonthelistisnotalwaysthedevicedesired.

Thedefaultconstructor(CPU_TP6)obtainsthelistofavailablephysicaldevicesandthenrunsthecreateDevicemethod.Ifthecreationwassuccessful,thisphysicaldeviceisthenused.Ifthecreationwasnotsuccessful,theconstructortriesthenextphysicaldevice.TheactualdevicecreationisdoneinthecreateDevicemethod(CPU_TP7).Thefirststepistodetermineifthephysicaldevicecandeliverqueueswiththerequiredfunctionality(CPU_TP8).Devicescandelivermanytypesofqueues.Somequeuesmaysupportalloftherequiredoperations.Somequeuesmaybehighlyspecializedforthespecifictasks,suchascomputeortransfer.Inthiscase,theapplicationexpectsthequeuetoallowfortransferanddrawinggraphics.Theapplicationwilliterateoveravailablequeuesuntilanindexisfoundthatmatchestheexpectedcriteria.

Oncecompleted,thevirtualdeviceiscreated.BesidesVkDevice,theobjectsuppliesthequeuehandlersandswapchainthatcorrespondtothevirtualdevice.(Formoredetails,pleasechecktheVulkantutorialsandreference.)

3.3Scenesetup

Page 14: Parallel Techniques in Modeling Particle Systems Using

14

TheParticlesElement(CPU_TP9)taskistodrawtheparticlefountain.Forittobedrawn,ithastobelongtothescene.TheSceneobjectroleistoaggregateallthesceneelements,maintaincommandbuffer,andtherenderpass(CPU_TP10).Whenthesimulatorisreadytorender,itcallstherenderfunction(CPU_TP11).Thisfunctionisresponsiblefirstforinitiatingthecommandbufferforrecording,obtainingthecurrentrendertarget,andfinallystartingtherenderpass(CPU_TP12).Oncethatisdone,therendererbeginsiteratingovereveryelementinthescene(CPU_TP13),allowingittorecorddrawingcommandsintothecommandbuffer.Inthiscase,thereisasinglesceneelementthatdrawstheparticlefountain.Afterallsceneelementsfinishrecording,therenderfunctionsubmitsthecommandbufferintothequeueandpresentstheresultsonthescreen.

3.4SceneElementTheSceneElement(CPU_TP14)representsthesingleelementorentitywithinthescene.Theexpectedscopeofsceneelementobjectsistohandleeverythingrelatedtorenderingasingleentityandmaintainingrequiredresources,likethepipelineandbuffers.TheParticlesElementfocusesitsentireoperationaroundtwomethods.Firstistheconstructor(CPU_TP15)thatinitializestherenderingpipeline.SecondistherecordToCmdBuffer(CPU_TP16)thatisinvokedbytheSceneclasstorecordelement-specificdrawcallsintothecommandbuffer.

3.5GraphicsmemoryallocationandperformanceThereisoneimportantaspectthathastobeconsideredwhenthememoryforthebuffersisallocated.TheVulkandevicecanexposedifferentmemorytypesthatitcanuse.TheMemorytabinthehardwaredatabaseshowsthatdevicesexposemultipleheapsfromwhichmemorycouldbeallocated.Lookingatthelistofmemorytypes,eachtyperepresentstwovalues:thememorytypeflagsthatdescribepropertiesandtheheapthatisassociatedwiththem.(Seehardwaredatabasereferencedinsection1.2.)

Thetypeofheapselectedcouldsignificantlyaffectperformance.Forexample,somedevices,mainlydiscretegraphicscards,exposeoneheapwithonlyDEVICE_LOCAL_BITsupported,whichisthefastestmemoryavailableforthatdevice,butdoesnotallowdirectaccessfromtheCPU(moreonthatinamoment).AnotherheapwiththeHOST_VISIBLE_BITallowsdirectaccess,buthaspotentiallylongeraccesstime.

Asoftoday,mostofIntel'sintegratedGPUsexposeheapswithbothDEVICE_LOCAL_BITandHOST_VISIBLE_BIT.Thepresenceofthefirstoneensuresoptimalaccesstimefortheplatform.ThesecondallowsthememorymappingoftheallocatedbufferintotheCPUaddressspaceanddirectaccess(CPU_TP17,CPU_TP18).

Page 15: Parallel Techniques in Modeling Particle Systems Using

15

Whenselectingamemoryheapthereisathree-stepprocess.Thefirststepistospecifyandcreatethebuffer(CPU_TP19)(notethatvkCreateBufferdoesnotallocatememoryforthebufferyet!).ThesecondstepistousetheVulkanAPItoobtainthelistofmemorytypesthatcansupportthedefinedbuffer(CPU_TP20).Thethirdstepistousetheselectedmemoryindextoallocatethebuffermemory.OfparticularinterestisthememoryTypeBitsfieldintheVkMemoryRequirements.Thisfieldisabitstringthathasabitsetto1atthepositionofthebitthatcorrespondstothememorytypeindexofthememorytypethatcanbeusedtoallocatebuffermemory.Ifthebitcontainsavalueof0,thenamemoryheapindexcannotbeusedwiththeparticularobjectofinterest.

Thefirststepabovewilllikelyprovidemultipleoptions.Therefore,inthesecondstep,theexactmemorytypedesiredmustbechosen.Inthiscase,onIntel’sGPU,allofthememorytypeshaveDEVICE_LOCAL_BITset.HOST_VISIBLE_BITandHOST_COHERENT_BITalsomustbeset.TheHOST_COHERENT_BITguaranteesthatmemoryoperationisobservedwithouttheneedtomanuallyflushthebuffers.IfthememoryTypeBitissetandthememorytypeatthegivenindexsupportstherequestedflags,thenthespecificindexisselectedtobeused(CPU_TP21).Inthethirdstep,thatindexisusedwhenallocatingtheactualmemoryontheheap(CPU_TP22).

Thismemoryallocationtopicisquiteimportant,andyouareencouragedtolookfurtherintoit.TopicslikestagingbufferorVulkanbuffercopyoperationswouldbeagoodnextstep,especiallyfordeviceswithdedicatedmemory.

4CPU-BasedParticleSimulator

Section2describesthemodeloftheparticlesimulatorthatwasimplementedforthisproject.Theimplementationisbasedonareal-timeinfiniteloop.Thesimulationloopcontainsthreemainoperations:

1) Obtainthetimedifferencebetweenthecurrentandpreviousiteration(CPU_TP23).2) ComputethenextEuleriterationfortheparticlemodel(CPU_TP24).3) UploadandrenderanewframeusingtheVulkanrenderer(CPU_TP25).

Theentirephysicalsimulationhappenswithintheprogressmethod(CPU_TP26)withinthealgorithm,anditiscomposedofthreemajorsteps:

4) Foreveryparticle,thealgorithmlaunchestheinteractorsandaddsuptheaccelerationvectors(CPU_TP27).

5) Thealgorithmsortstheparticlelistinsuchawaythatalloftheactiveparticles(TTL>0)occupytheleftsideofthelist,andalloftheinactiveparticlesoccupytherightsideofthelist(CPU_TP28).

6) Launchesparticlegenerator(s)inordertoaddnewparticlestothelistifthegenerator’sconditionshavebeenmet,andthereisspaceleftinthebuffer(CPU_TP29).

Page 16: Parallel Techniques in Modeling Particle Systems Using

16

Withthissimplemodel,manyinterestingeffectsmaybegenerated.Althoughthisimplementationusesareal-timeapproach(i.e.,theactualhardwareclockisusedtoobtainthetimedifference),thecodecaneasilybemodifiedtouseastepwiseapproach.Animportantpointtorememberisthatthetimedifferencehastobesufficientlysmall,otherwiseitwillbreakthemodel.Knowingthat,itcanbedeterminedwhetherornotthisapproachisagoodchoiceforasystemthatiscapableofspinningthemainloopatasufficientspeed.Ifthatisnotthecase,considerswitchingfromareal-timeapproachtoanarbitrarilydeterminedtime-step,whichwillnotusethesystemclock.

4.1ParticlemodelTorepresentthesetofparticlesinascene,theBufferclass(CPU_TP30)isused.Itisarelativelysimpleclassthatencapsulatesamemoryarrayofparticlestructures(CPU_TP31).Theclassalsodeliversafewconveniencefunctionsandthearraysortingalgorithm.

Thepurposeofthesortingalgorithmistoplaceparticlesintotwogroupswithintheparticlearray.Theactiveparticlesareplacedontheleftsideandtheinactiveonrightside.Itisimplementedasfollows:

1) InitializeLindexatthefirstelementandPindexatthelastelement.2) ProgressLtowardtheendofthearrayuntilaparticlewithTTL<=0isencountered.3) ProgressPtowardbeginningthearrayuntilaparticlewithTTL>0isencountered.4) Swaptab[L]withtab[P].5) RepeatwithL+1.

IfatanytimeL==P,thefinishconditionisreached.

4.2Sceneinteractorsandgenerators

Bothinteractorsandparticlegeneratorsinfluencethesceneduringthesimulation.Withinthemodel,botharerepresentedbytheircorrespondingabstractinterfaces.TheinteractorsuseBaseInteractor(CPU_TP32)andgenerators,BaseGenerator(CPU_TP33).BothneedtobedefinedbeforetheModelobjectisinstantiated(CPU_TP34).

Inthecaseoftheinteractor,allthatthemodelrequiresistoobtaintheaccelerationvectorgiventheparticlestateandtimedifference.Internally,theinteractorcanperformarbitraryoperations,andinteractorscanbefundamentallydifferentfromeachother.Forexample,theConstForceInteractor(CPU_TP35)usesonlyforcemagnitude,directionvector,andparticledata.

Page 17: Parallel Techniques in Modeling Particle Systems Using

17

TheBaseGenerator,asidefromacommoninterface,deliversfunctionstogenerateparticlepropertieslikespeedwithrandomizeddeviations(CPU_TP36).However,itleavestheexactparticlelocation,therateofappearance,andotherdetailstothederivedclasses.Inthisexample,pointgenerator(CPU_TP37)isused.Thepointgeneratorissetinsuchawaythatitactsasparticlesprinkler(CPU_TP38).Itisgivendirectioninpolarcoordinatesspecifiedindegrees,andemitsparticlesaccordingtoadefinedrate.

4.2.1Performancevs.maintainability

Thereexistsaperformancebottleneckintheimplementation.Aspecificdesignchoicewasmadetoprioritizemaintainabilityoverperformance.TheproblemoriginatesfromthepolymorphiccallintothecomputeAccelerationmethod(CPU_TP39).

Polymorphiccallstendtobeslowerthanregularfunctioncalls,andeveryinteractorisexecutedforeveryparticle.Itisclearhowthiscouldbecomeasignificantfactorforalargenumberofparticles.However,theadvantageinthiscaseisthatthereisaneasywaytoimplementnewinteractorsintothemixwithnoneedmakechangestotheModel'scode.

Thealternativeherecouldbetoimplementsomemechanismthatwoulddetectthetypeofinteractorandexecuteitsproceduresusingaconditionalstatement.Thiswillimproveperformance,butalsoincreasethecomplexityofthecode.

Whichapproachtouseisacase-by-casedecisionthatdependsonprojectgoals.Whenasmallnumberofparticlesisexpectedwithalargevarietyofinteractorspresent,thenthevirtualmethodapproachisagoodchoice.However,ifthenumberofparticlesislarge,thenswitchingtoothersolutionscouldbeagoodwaytoimproveperformance.

4.3LifetimeoftheparticleandtheimportanceofsortingThereisonelasttopicrelatedtoparticlegenerationandparticlelifetime.Onceaparticleismovingthroughthescene,itcaneasilyflyoutsideofthevisiblearea.Sinceitisnotvisible,thereisvirtuallynopurposeinmodelingit.Sometimesthegoalistomodeleffectslikeaflamewithaspecificheight.Toimplementsuchcases,everyparticlemusthaveapropertythatdeterminesitslifespancalled“TimeToLive”(TTL).Ateachiteration,thetimedifferenceissubtracted,andifthevalueisequaltoorbelowzero,theparticleisconsideredinactive.Whenparticlesbecomeinactive,nooperationsareperformedonthemandtheyarenolongerdrawn.

Page 18: Parallel Techniques in Modeling Particle Systems Using

18

Therearetwoissuesrelatedtothecaseofinactiveparticles.First,isthatawayisneededtofreethespaceupfornewparticles.Second,isthatinactiveparticlesshouldnolongerbedrawnonthescreen.Thisprojectexampleusesasimplerenderingmechanism.However,inthecaseofamorecomplicatedmulti-passrendererwithmorecomputation-heavyshaders,itisagoodideatoavoidprocessingelementsthatwouldnotbevisibleonthescreen.Thisiswheresortingcomesin.Sortingtheparticlearrayallowsforsubmittingdrawcallsthatspecifyonlythenumberofparticlesthatshouldappearonthescreen.Italsosimplifiestheworkofgeneratorsastheydonothavetosearchthroughthebufferforfreeslotstofill.

Thereisapossiblealternativeapproach.Insteadofthecontiguousarray,alinkedlistcouldbeused.Thiswouldeliminatetheneedforsorting,but,ontheotherhand,itwouldslowdowntheuploadofdatatotheGPUbuffersbecauseoftheneedtoconvertthelinkedliststructureintoacontiguousarray.

Page 19: Parallel Techniques in Modeling Particle Systems Using

19

5MultithreadedApproachtoComputingParticleSystem

Thespeedofthesimulationcouldbeimprovedbyparallelizingtheworkload.Thereareafewcasestoconsider:

1) Parallelizetheworkloadacrossthearrayofparticles,calculatingtheinteractors’contributionandEulerstepasasingleworkunit.

2a) Parallelizetheworkloadacrossinteractors,calculatingtheeffectofeachinteractorateveryparticleasasingleworkunit.

2b) FurtherparallelizeOption1,above,bysplittinginteractorcontributionsintoseparateworkunits.

3) Parallelizethesortingalgorithm.4) Parallelizethegenerators’contribution.

Option1,above,istherecommendedapproach.Foreachparticle,thealgorithmcomputes,asasingletask,thecontributionfromeachinteractor,summingtheaccelerationvectorsandthenperformingtheEulerstep(CPU_TP40).Sinceeachparticleresidesinadifferentmemoryregion,andinteractorsusuallydonotmodifytheirinternaldatawhencomputingparticles,thismethodprovidesthemostoptimalsolution.TheOpenMP*frameworkwasusedtoprovidemultithreading.OpenMPisanAPIthatisbasedprimarilyoncompilerdirectivesandusedformultithreaded-orienteddevelopment.Ithasloweroverheadregardingrequiredcodetobewritten.Itcouldbeeasilyusedtoparallelizeloops,buthasmuchmoretooffer.However,sometimesitmightnotdeliverthelevelofdetailassomeotheravailablelibraries.

Option2a,isprobablynotrecommended.Somemightfindthatthisoptionisappealingfromtheperspectivethatthedatathatdefinesthepropertiesofaninteractorwouldhavetobeloadedonce,andthenusedextensivelywhilecomputingeachparticle.ThisissomethingthatwillbesupportedbytheCPUcache,significantlyspeedinguptheprocess.However,thereareissueswiththisapproach.Forexample,accelerationmustbesummedup,whichmeanstherewillbeaneedtosynchronizeaccessbetweenthreadsexecutingtheinteractorsandthenstoretheaccelerationvectorforeveryparticleinanadditionalmemoryregion.Moreover,theentirearraywillrequireanadditionalpassperformedtocalculatetheEulerstep.Thesetwodisadvantageswouldproducesignificantsynchronizationoverhead,likelyrenderingthisstrategyimpractical.

Option2bisanextensionofOption1.Hypothetically,iftherearecomputationallydemandinginteractors,itcouldbebeneficialtosplittheiroperations.However,thegoalistoachieveminimalcontextswitching.Ifeveryparticlecomputationandeveryinteractorcontributionisparallelized,itwillmostlikelyleadtoanexplosionofthreads.Switchingbetweenthosethreadswouldfurtherdecreaseperformance.

Page 20: Parallel Techniques in Modeling Particle Systems Using

20

Option3suggestsimprovingthesortingalgorithmthroughparallelization.Thisstrategymightimprovesortingspeed;however,thetypeofdatamustbeconsidered.Inthisparticularcase,aparticleisinanactiveorinactivestate.Theorderoftheparticlesthatareinthesamestateisnotimportant,aslongastheyareallplacedonthecorrectsideofthesortedarray.Thisallowsforthearraytobesortedinasinglepass.Ifaparallelsortingalgorithmisusedinstead,itispossiblethatperformancecouldbeimproved.However,suchimprovementisnotguaranteed.Possibleimprovementsmaybedeterminedthroughexperimentation.

InOption4,parallelizingthegenerationofparticleshasahighpotentialofimprovingperformance,butonlyundercertainconditions.Thefirstconsiderationistodeterminetheworkloadofallofthegenerators.Ifthememorybufferisalmostalwaysfull,therewon’tbeanyspaceforthegeneratorstowork.Therefore,spawningadditionalthreadstoexecutethegeneratorswillonlycreateunnecessaryoverhead.Everygeneratortakesasmuchspaceasneeded,andwhatisleftisassignedtothenextgeneratorinthelist.Thiscouldresultinsomethreadsexecutinggeneratorsjusttofindoutthatthereisnothingforthemtodo.

5.1GoingfurtherwithparallelizationTherearefewmorethingsworthmentioningthatcouldaffectperformance.Firstistheissueoffalsesharing.Asingleparticlestructuresizewas44bytes.Giventhatthecachelinesizewas64bytes,whatwasfetchedfromlowerlevelmemorywasanentireparticlestructure,orpartofaparticlestructure,thatthecurrentthreadwasworkingon,aswellaspartofaparticlestructurelocatedrightnexttoit.Iftwoparallelthreadseachworkonasingleparticlelocatednexttoeachother,themodificationoftheaccelerationvectorinsideonethreadwillinvalidatetheentirecacheline.Essentially,thiswillcausecachemissontheotherthread,eventhoughthethreadswereassignedtodifferentparticles.Thissituationcanbealleviatedbyaligningtheparticlestructurestothe64-bytesboundary.However,thiswillalsoincreasememorywaste.

Thesecondandpreferredsolutionistogroupmultipleparticlestructurestogetheranddirectthemtobeprocessedasasingleunit.ThisiswhatOpenMPdoesbydefaultinthe"parallelfor"directive.

OpenMPisaverygoodframework,butithasdisadvantages.OpenMPisbuiltintoanumberofcompilersthatsupportit.However,notallcompilerssupportOpenMP,orsupportanoutdatedversionofOpenMP.AgreatalternativetoOpenMPisthethreadsupportlibrarythatwasintroducedwithC++11andshouldbesupportedinalmostallC++compilerscurrentlyavailable.

Page 21: Parallel Techniques in Modeling Particle Systems Using

21

6VulkanComputeinaModelingParticleSystem

ThissectionaddressesthetopicofportingaCPU-basedparticlesimulatorontotheGPUusingVulkancompute.VulkancomputeisanintegralpartoftheVulkanAPI.AlthoughmostofthesimulationwillbemovedontotheGPU,thesortingandgenerationofparticleswillstillbekeptontheCPU.

Thegoalofthissectionistopresenthowtosetupthecomputepipeline,instantiatemoreadvancedshaderinputconstructs,andshowhowtosendfrequentlychanging,butsmall,portionsofdataontotheGPUwithoutusingmemorymapping.

ThisisagoodmomenttomentionatoolthatisextremelyvaluablewhendebuggingVulkanapplications.Thetool,RenderDoc,canbeusedtocapturethecurrentapplicationstateforinspection.Thetoolcanbeusedtoexaminebufferscontent,shaderinput,textures,andmuchmore.RenderDocisautomaticallyinstalledwithVulkanSDK.Althoughnotdiscussedinthisdocument,itishighlyrecommendedforreaderstobecomefamiliarwiththistool.

6.1UsingGPUmemorybufferforcomputationandvisualization

Thefirststepistorepurposethebufferstructurethatholdsparticledata.AmongthebuffertypesavailabletouseonaGPUisabuffertypecalledtheshaderstoragebufferobject(SSBO).Thisisatypeofbufferthatissimilartotheuniformbuffer.However,SSBOsaremuchlargerinsizethantheuniformbufferandarewritablefromwithintheshader.

Theinternalstorage,whichwasthestandardC-stylearraycontainedbytheBufferclass,isnowreplacedwithanSSBO.Moreover,thesamebuffermemoryregionwillbereusedasavertexbufferforrendering.ThecomputeshaderwillmodifytheSSBOcontent,andthentherendererwilluseitasavertexbufferobject(VBO)torendertheframe.

Upuntilnow,theBufferclasswasindependentofVulkan.Now,however,theentireapplicationisdependentofVulkan.ThismeansthatbeforetheModelclasscanbeinstantiated,theBufferclassmustbeconstructed.AndbeforetheBufferclassisconstructed,theVulkanvirtualdevicemustbecreated.

Tomakethingsappearinamoreorganizedorder,thebuffer,thedevice,theAppInstance,andtherestofsupportingcodewasmovedintoanewnamespacecalled“base.”Now,insteadofcreatingtheModelfirst,theapplicationstartsfromcreatingtheAppInstance,device,andbufferobjects(GPU_TP41).ThecodecreatingthevertexbufferhasbeenmovedoutofParticleElementintobuffer(GPU_TP42).Notethechangeintheusagefield:itnowhastwoflagsassigned,VERTEX_BUFFER_BITandSTORAGE_BUFFER_BIT.Therestoftheprocedureremainsthesame.TheBufferclassnowdeliverstheVkBufferhandleandinterfacetoaccessthebufferontheCPUside(GPU_TP43).TheconstructedbufferisthenusedininstantiationofboththeModel(GPU_TP44)andtheParticlesElement(GPU_TP45)objects.

Page 22: Parallel Techniques in Modeling Particle Systems Using

22

6.2MovingcomputationfromtheCPUtocomputeshaderontheGPUTheModelclassiswheremostofthechangeswilloccur.Inthisversion,theModelclasswillcontainanotherpipelinewhichwillexecutethecomputeshader.

First,oneofthebiggestchangesisinhowtheinteractorsaregenerallyimplemented.Althoughtheinteractorlogicwasmovedintothecomputeshader,thereisstillaneedforthemechanismtoallowfortheabilitytodynamicallyspecifythepropertiesofeachinteractorintheapplicationlogic.BecauseofthestructureofGLSLlanguage,itisnolongerpossibletoimplementinteractorsusinganabstractinterface.Thisrequirestheinteractorstobesplitbytheirtypeandprovidespecificdataforallofthem.Todothis,wedefineastructure(GPU_TP46)calledSetuptocontainarraysforeachtypeofinteractor.Eacharraydefinespropertiesforaninteractorandallowsanarbitrarynumberofinteractorsofagiventypetobepresentuptoaspecifiedupperboundary.Thecountvariablesspecifytheactualnumberofactiveinteractors.

Whensubmittingcomplexstructurestobufferobjects,itiscriticaltoadheretothelayoutrulesthatapplyforthegiventypeofbuffer(Notethealignmentstatementinthestructurefield’sdeclaration(GPU_TP47)).Foruniformorshaderstoragebufferobjects,theselayoutsarestd140andstd430.TheirfullspecificationcanbefoundintheVulkanspecificationandGLSLlanguagespecification.

AsintheCPUversion,interactorswereinitializedbeforeconstructingtheModelobject(GPU_TP48).However,thistimetheinitializationisbasedonpopulatingtheSetupstructure.Later,theentirestructurewillbeuploadedintotheGPUduringconstruction(GPU_TP49),makingitavailableforthecomputeshadertoaccess.

IntheVulkanAPIthecomputestageandgraphicsstagecannotbeboundintoasinglepipeline.Asaresult,twopipelineobjectsareavailable.Comparedtothegraphicspipeline,thecomputepipeline(GPU_TP50)islesscomplicated.Theprimaryconcernistodefinetheproperpipelinelayout.Beyondthat,theprocessofcreatingthecomputepipelineisnearlythesameascreatingthegraphicspipeline.

Oncethepipelineisestablishedandthenecessarydataisreadytoexecute,theprocedurestartsrecordingcommandsintothecommandbufferforthecomputepipeline(GPU_TP51).However,therearetwoconceptsthatareworthexamining.

Thefirstconceptis“pushconstants.”PushconstantsaresmallregionsofmemoryresidinginternallyintheGPUdesignatedforfastaccess.Theadvantageofusingpushconstantsisthatdatadeliverycanbescheduledrightwithinthecommandbuffer(GPU_TP52)withouttheneedformemorymapping.Thememorylayoutofpushconstantsneedstobedefinedduringpipelineconstruction,buttheprocessismuchsimplercomparedtothedescriptorset(GPU_TP53).

Page 23: Parallel Techniques in Modeling Particle Systems Using

23

Thesecondconceptisrelatedtothedivisionofworkload.Thenumberofelementsissplitintoworkgroupsthatarescheduledtoexecuteonthecomputedevice.Thesizeoftheworkgroupisspecifiedwithinthecomputeshader(GPU_TP54).ThentheAPIisgiventhenumberofworkgroupsscheduled(GPU_TP55)forexecutionofthecomputejob.Thesizesandnumberofworkgroupsarebothlimitedbythehardware.ThoselimitscanbecheckedbylookingatthemaxComputeWorkGroupCountandmaxComputeWorkGroupSizefields,whicharepartofVkPhysicalDeviceLimitsstructure(whichisapartoftheVkPhysicalDevicePropertiesstructurethatisobtainedthroughcallingvkGetPhysicalDeviceProperties).Itisimperativetoscheduleasufficientnumberworkgroupswithappropriatesizetocovertheentiresetofparticles.Theoptimalchoiceofsizesisdictatedbythespecifichardwarearchitectureandinfluencedbythecomputeshadercode(e.g.,theinternalmemoryrequirements).

6.3Structureofthecomputeshader-orientedsimulationWiththetransitiontothecomputeshader,theinteractorstaketheformoftwocomponents.Firstaretheinteractorparameterswhicharedeliveredthroughtheuniformbuffer(GPU_TP56).Thesecondisthefunctionsetwhereineachfunctioncorrespondstoaspecificinteractorandcalculatestheaccelerationvectorfortheprocessedparticle(GPU_TP57).

ThefunctionperiterationofthecomputeshaderistoobtainaparticlefromSSBOusingtheglobalidindex(GPU_TP58).Thecomputeshaderiteratesthrougheachinteractordescriptorandinvokesitsfunctionasmanytimesasthecountfieldindicates(GPU_TP59).Onceallinteractorshavebeenexecuted,theaccelerationvectorsaresummedandthecomputeshaderinvokestheEulerstepfunction(GPU_TP60).TheparticleisthensavedbacktoSSBO.

Afterthecomputingstepisdone,therendererisinvoked.Therendererusesthesamebuffer,butaccessesmemorythroughthevertexbufferinterface,insteadofSSBO,torenderthescene.

Page 24: Parallel Techniques in Modeling Particle Systems Using

24

7GoingFurther

Buildingonthiswork,therearefewoptionstoconsiderthatmayimproveperformance.

7.1EliminatetheCPUroleincomputations

Currently,theGPUcomputeshaderonlycalculatestheinteractor’sinfluenceontheparticle.AllsortingandgeneratingoftheparticleswerelefttotheCPU(seesection5).InthecaseofIntel'sGPU,thisapproachisnotpenalizedatallduetotheuniformaccesstothememory.However,forotherplatforms,especiallyadiscreteGPU,allocatingbuffersusingthememoryheapwithmostefficientaccesstimeisamust.ThiscouldmeanthataddressspaceinsuchaheapmightnotbeaccessiblefromtheCPUthroughmemorymapping.Inthatcase,theonlywaytoaccessthedatawouldbebycopyingbuffersinternallybetweenheapsbackandforth,oraccessingitthroughthecomputeshader.

Anapproachinwhichbuffersarecopiedbetweeneachotherwillalmostcertainlycreateaperformancebottleneck.Therefore,thecomputeshaderoptionseemsmoreappealing.Initially,thesortingwasleftinitssequentialformbecauseoftheuncleargainsofparallelalgorithmsforthisspecificscenario.However,aGPUoffersmanymorecomputingunitswhichcouldmakeparallelsortingalgorithms,likean“odd-evensort,”deliverbetterresults.

Thereisstilltheissueofsynchronization.Aninstanceofthecomputeshadercannotstartthenextsortingpassbeforethepreviousonehascompleted.However,ifonlyasinglelocalworkgroupisused,thentheGLSLlanguagedeliversasetofShaderInvocationControlFunctions(seeGLSLspecification)thatwilldeliverthecapabilityofsettinguppropermemorybarriers.

Similarly,inthecaseofthegenerators,thebiggestproblemwastherandomnumbergeneratornotbeingthread-safeanditsdependenceonthegenerator’sinternalstate.TherandomizefunctioncanbeportedintoGLSL,andthecoherencemaintainedusingtheGLSLatomicmemoryfunctions.

Page 25: Parallel Techniques in Modeling Particle Systems Using

25

7.2ParallelizingtheperformancepathbydoublebufferingForthisproject,theapproachwastofirstcomputethesceneandthenrenderitsequentially.Thecomplexityoftherenderinginthiscasewasfairlybasic,and,therefore,itsimpactonperformancewasnegligible.However,thiswillnotalwaysbethecase.Withmoreinvolvedrenderersitisnecessarytoattempttoperformcomputationsandrenderinginparallel.TheVulkanAPIwasdesignedwiththisgoalinmindandprovidesthemeanstosynchronizeprogramflow,butdoesnotenforceit.Forexample,lookingatwheretheprogramloopiswaitingfortherenderingfunctiontofinish(CPU_TP61).Thepartofthecoderesponsibleforpresentationobviouslyhastowait,butthemodeldoesnot.Infact,themodelcouldstartcomputingthenextstepassoonasthevertexdataisdeliveredandonlyneedtowaitbeforeenteringthesetVertexBufferDatafunction.Fromheretheapplicationcanbesplitintotwothreadsanduseaproducer-consumersetup.

IntheGPUversion,insteadofusingasingleSSBO,doublebufferingcanbeintroduced.Onebufferisusedbothasarender-inputandcompute-input.Atthesametime,thesecondbufferispopulatedbythecomputeshaderwiththeEulerfunctionoutput.

Itisimportanttorememberthatthereisanupperthresholdtowhatcanbeachieved,whichislimitedbythecapabilitiesofthehardware.ThecomputeandrendertasksexecutedonthesameGPUwillnotalwaysprovideperformancegainsbecauseofthefinitenumberofcomputeunitsavailable.

8Conclusion

VulkanofferstheabilitytobypasstheCPUbottleneckthatexistsinpreviousgenerationsofgraphicsAPIs.However,Vulkanhasasteeplearningcurveandismoredemandingwhenitcomestoapplicationdesign.Nonetheless,despitebeingveryexplicit,itisaveryconsistentandstraightforwardAPI.

Itisworthrememberingthesefewpointswhichwillhelpwithapplicationdevelopment,aswellasallowachievementofbetterapplicationperformance:

1) StartyourdevelopmentwiththevalidationlayersenabledandbecomefamiliarwithtoolslikeRenderDoc.

2) Carefullydesignyourapplicationasyouwillhavetomakemanydecisionsaboutvariousdetails.

3) Payattentiontowhereyourbuffermemoryisallocatedasitmighthavesignificantperformanceimplications.

4) Understandthehardwareyouareworkingwith.NotallthetechniqueswillworkequallywelloneverytypeofGPU.AlthoughallVulkandevicesexposethesameAPI,theirpropertiesandlimitationsmightbefundamentallydifferentfromeachother.

Page 26: Parallel Techniques in Modeling Particle Systems Using

26

NoticesNolicense(expressorimplied,byestoppelorotherwise)toanyintellectualpropertyrightsisgrantedbythisdocument.Inteldisclaimsallexpressandimpliedwarranties,includingwithoutlimitation,theimpliedwarrantiesofmerchantability,fitnessforaparticularpurpose,andnon-infringement,aswellasanywarrantyarisingfromcourseofperformance,courseofdealing,orusageintrade.Thisdocumentcontainsinformationonproducts,servicesand/orprocessesindevelopment.Allinformationprovidedhereissubjecttochangewithoutnotice.ContactyourIntelrepresentativetoobtainthelatestforecast,schedule,specificationsandroadmaps.Theproductsandservicesdescribedmaycontaindefectsorerrorsknownaserratawhichmaycausedeviationsfrompublishedspecifications.Currentcharacterizederrataareavailableonrequest.Copiesofdocumentswhichhaveanordernumberandarereferencedinthisdocumentmaybeobtainedbycalling1-800-548-4725orbyvisitingwww.intel.com/design/literature.htm.ThissamplesourcecodeisreleasedundertheIntelSampleSourceCodeLicenseAgreement.IntelandtheIntellogoaretrademarksofIntelCorporationintheU.S.and/orothercountries.*Othernamesandbrandsmaybeclaimedasthepropertyofothers.©2017IntelCorporation