Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
ParallelTechniquesinModelingParticleSystemsUsingVulkan*API
ExampleofusingGPU-andCPU-basedsolutionsutilizingVulkangraphicsandcomputeAPI
TomaszChadzynskiIntegratedComputerSolutionsInc.
2
TableofContents1Introduction..............................................................................................................................................4
1.1Assumptionsandexpectations...........................................................................................................4
1.2Additionalresources...........................................................................................................................5
1.3Buildingexamples..............................................................................................................................5
2ConceptofaSimulatedParticleSystem....................................................................................................6
2.1Generators.........................................................................................................................................7
2.2Interactors..........................................................................................................................................8
2.3Eulermethodinnumericalsimulationofparticlemovement...........................................................9
2.4Moreoninteractors...........................................................................................................................9
2.4.1Constantforceinteractor..........................................................................................................10
2.4.2Gravitypointinteractor.............................................................................................................10
2.4.3Gravityplanarinteractor...........................................................................................................11
2.4.4Additionalthoughts...................................................................................................................11
3VulkanRendererforCPU-BasedSimulator.............................................................................................11
3.1Physicaldeviceselection..................................................................................................................12
3.1.1Validationlayers:AcrucialstepinVulkanapplicationdevelopment........................................12
3.2Vulkanvirtualdeviceandswapchain...............................................................................................13
3.4Sceneelement..................................................................................................................................14
3.5Graphicsmemoryallocationandperformance................................................................................14
4CPU-BasedParticleSimulator..................................................................................................................15
4.1Particlemodel..................................................................................................................................16
4.2Sceneinteractorsandgenerators....................................................................................................16
4.2.1Performancevs.maintainability................................................................................................17
4.3Lifetimeoftheparticleandtheimportanceofsorting....................................................................17
5MultithreadedApproachtoComputingParticleSystem........................................................................19
5.1Goingfurtherwithparallelization....................................................................................................20
6VulkanComputeinaModelingParticleSystem.....................................................................................21
6.1UsingGPUmemorybufferforcomputationandvisualization........................................................21
6.2MovingcomputationfromtheCPUtocomputeshaderontheGPU...............................................22
6.3Structureofthecomputeshader-orientedsimulation....................................................................23
7GoingFurther..........................................................................................................................................24
7.1EliminatetheCPUroleincomputations..........................................................................................24
3
7.2Parallelizingtheperformancepathbydoublebuffering.................................................................25
8Conclusion...............................................................................................................................................25
4
1Introduction
Parallelprocessinghasbecomeoneofthemostimportantaspectsoftoday'sprogrammingtechniques.TheshiftinparadigmforcedbytheCPUshittingthepowerwallenforcedprogrammingtechniquesthatemphasizedthespreadingofcomputationsovermultiplecores/processors.Asitturnsout,manyofthecomputationtasksthatsoftwaredevelopersfaceareparallelizable.Oneexampleisthecaseofmodelingparticlesystems.Suchsystemsarewidelyusedinmanyfieldsfromphysicalsimulationtocomputergames.
TheVulkan*APIisacollaborativeeffortbytheindustrytomeetcurrentdemandsofcomputergraphics.ItisanewapproachthatemphasizeshidingtheCPUbottleneckthroughparallelism,allowingmuchmoreflexibilityinapplicationstructure.Asidefromcomponentsrelatedonlytographics,theVulkanAPIalsodefinesthecomputepipelinefornumericalcomputation.
ThispaperdiscussesandcomparesaspectsoftheimplementationofaparticlesystemusingCPUsandGPUssupportedbyaVulkan-basedrendererexample.
1.1AssumptionsandexpectationsWithalltheperformancegainsthatVulkanhastooffer,thereisadownsideintheformofarathersteeplearningcurve.Drawingatriangle,thesimplestexampleavailableontheInternetcantakearoundsixhundredlinesofcode.ThisisbecauseVulkanAPIrequiresthedevelopertospecifynearlyallaspectsoftherenderingmechanism.Still,givensomeeffort,VulkanisaclearandstraightforwardexampleforusingAPI.
ThisdocumentdoesnotintendtoteachVulkanbasics.SomebasicknowledgeofVulkanisrequired,however.YoushouldalreadyknowhowtosetupsimpleVulkanapplications.IfyouarenewtoVulkan,thenagoodstartwouldbetoworkthroughtheLunarGtutorial.
Theexamplecodeaccompanyingthispaperisorganizedtoemphasizepresentingarchitecturalconceptswhichleadtotwoimportantassumptions:1)ThereturnstatusfromVulkanfunctioncallsisinmostcasesignoredtoavoidoverlyexpandingthecodebase.Inproduction,everyreturnstatusshouldbecheckedandreactedon;2)Conceptuallyclosetopicsarekeptwithinasinglecodeflow.Thisreducestheneedforjumpingthroughthesources,butsometimesleadstoratherlongfunctions.
Thispaperisrecommendedtobereadwhilelookingattheaccompanyingcode.Makesureyouhavetheexamplesdownloadedanduseyourfavoritecodebrowsingtool.Toavoidlongcodelistings,thispaperreferencesthecodethroughaseriesoftags.Forexample,GPU_TP.15referencespointfifteenintheVulkancomputeversionofthecodebase.Eachtagisuniqueandeasilysearchablewithinthecodebase.
5
Lastly,oneofthemajordesigngoalsofVulkanistogiveasmuchfreedomtotheapplicationdesigneraspossible.Therefore,therearemanywaysofimplementingagiventaskdependingonthedesigngoals,targethardware,scalability,andsoon.Thispaper’sgoalistointroduceandsparksomecreativethinking,ratherthanprovideafinalapproach.
1.2AdditionalresourcesVulkanhasreceivedmuchattention,whichhasresultedinmanygoodinformationsources.Herearefewrecommendedreferencematerials:
1) Vulkanspecificationandquickreference:https://www.khronos.org/registry/vulkan/2) OpenGLandGLSLspecification.WearemostlyinterestedinGLSLasweuseitasa
shaderprogramminglanguage:https://www.khronos.org/registry/OpenGL/index_gl.php
3) VulkanSDKandanexcellenttutorialcoveringVulkanbasics:https://vulkan.lunarg.com/4) AnotherVulkantutorialgoesbeyondbasics,coveringmoreadvancedmemoryhandling
andtextures:https://vulkan-tutorial.com/5) DatabaseofhardwaresupportingVulkan.Agreatsourceforcapabilitiesthatareoffered
bythevarietyofhardware:http://vulkan.gpuinfo.org/6) AlargesetofexamplesinVulkanfrombasictriangletocomputeshader:
https://github.com/SaschaWillems/Vulkan
1.3BuildingexamplesTheprojectbuildisorganizedusingVisualStudio*2017.Overall,itusesthreedependencies:GLM,GLFW,andVulkanSDK.ThefirsttwoareinstalledautomaticallythroughNuGet.TheVulkanSDKshouldbedownloadedandinstalled.Afteropeningtheproject,selectthe"PropertyManager"tab.Fromthere,unfoldthe"Debug|x64"andopenVulkanDir.Thengoto"UserMacros"andsettheVK_DIRpropertytopointtothemainVulkanSDKdirectory.Forexample,"C:\VulkanSDK\1.0.51.0."TheDebugandReleasebothusethesameproperty,soonechangewillaffectboth.Fromtherejustclick“Build.”
6
2ConceptofaSimulatedParticleSystem
Ourgoalistomodelthemovementofmultipleobjects(particles)throughspace.Tomakethingsmoreinteresting,thescenewillcontainentitiesthatwouldalterthepaththroughwhichtheparticleismoving.
Asingleparticleisdefinedasasetoftwovectors,positioninspace,velocity,andanadditionalscalarrepresentingmass.Thesesetofpropertiesdescribethestateofeachparticle:
P={𝒙,𝒗,m} (1)
Wecanlookatthewayaparticlemovesthroughspacewithdisplacementgivenbythedifferentialequation:
𝒅𝒙𝒅𝒕=𝒗 (2)
Fornow,wewilltreatthevelocityasaconstantandsolvefordisplacement:
𝒅𝒙= 𝒗𝐝𝐭
𝒙=𝒗t+C (3)
Wearriveatageneralformulathatmodelshowaparticleismovingthroughspace.Constantvelocityisassumed(butwillchangesoon).Also,wemusthaveconstantC,avaluewhichmustbeknown,inordertodescribeoursystemcompletely.TheconstantCisrelatedtotheinitialconditionsofthisdifferentialequation.Inotherwords,theparticleatthebeginningofitslifehastohaveapredefinedstate.Thisrequiresusinggenerators.
7
2.1GeneratorsTheconceptofageneratoristobeafunctionwhichfulfillsthetaskof"placing"anewparticleintothesceneinsomepredefinedmanner.Thisdemonstrationdefinesasinglesprinklergeneratorwithfunctionalityconsistingofspawningparticlesatapredefinedpointinspacewithgivenvelocityandmass.
Thepositionofthepointoforiginisgivendirectlyasaparameter.ThedirectionisdefinedusingsphericalcoordinateswhicharetheninternallyconvertedintotheEuclideancoordinatesystem.Theanglesaredefinedasfollows:
1)𝛳(theta):Anglestartsatx-axisandrotatescounterclockwise.2)𝜙(phi):Anglestartsfromz-axistowardsxy-plane.3)Thevalueofspeedisgivenasaseparateargument.
ToconvertintoEuclideancoordinates,weusethefollowingsetofequations:
x=sin(𝜙)cos(𝜽),y=sin(𝜙)sin(𝜽),z=cos(𝜙) (4)
Goingbacktoourgeneralequationforparticlemotionandfactoringinthegenerator,wecandeterminethatattimet=0thepositionofaparticlebecomesx0:
𝒙0=𝒗0∗0+C→C=𝒙0
𝒙=𝒗t+𝒙0 (5)
Atthispoint,wehaveaformulathatwecanusetodescribethemotionofaparticle.However,ourscenewillnotlookthatinterestingasalloftheparticleswouldonlymoveinastraightlineandataconstantspeed.Togiveourscenemoredynamicbehavior,weintroducetheconceptofinteractors.
8
2.2InteractorsTounderstandhowwecanconsistentlyinfluencethemotionofaparticle,westarttreatingvelocityasavariableinsteadofconstant.Thechangeofvelocityisgivenbythefollowingformula:
𝒅𝒗𝒅𝒕=𝒂 (6)
Thisequationlookssimilartothepreviousformulafordisplacement.Bysolvingthisdifferentialequation,wearriveattheformulaforvelocity:
𝒗=𝒗0+𝒂t (7)
Therefore,toconsistentlychangeparticlemotionweinfluencethevelocityindirectlythroughchangingacceleration.Wedefinetheinteractorasafunctionthat,forthegivenparticlestateandgiveninteractorproperties,returnstheaccelerationvector:
f(p,in)→𝒂 (8)
Unfortunately,definingtheinteractorasafunctionbecomesproblematic.Whenwesubstitutebackintothevelocityequation,weseethatthedifferentialequationhasapotentialtobecomeunsolvable:
𝒅𝒗𝒅𝒕=f(p,in)
(9)
Thisisnotwhatwewant.Ourgoalistohavetheabilitytointroducedifferenttypesofinteractorswithoutbeingworriedaboutbreakingourmodel.Todothat,weneedtoimplementtheEulermethodofsolvingdifferentialequationstoapproximatetheparticlemovementnumerically.
9
2.3EulermethodinnumericalsimulationofparticlemovementTheEulermethodworkingprincipleistostepthroughthedifferentialequationinsteadofsolvingacrosstheentiredomain.Ifweassumeasufficientlysmalltimestep,wecanapproximatethevelocity(2)andacceleration(6)tobeconstantswithinthespanofasinglestep.Thisallowsustofallbacktotheequations(5)and(7),anddiscretizethemsotheycanbeprogrammed.Therefore,thefinalsetofequationstobeusedisasfollows:
𝒙[n]=𝒙[n-1]+𝒗[n]*∆t
𝒗[n]=𝒗[n-1]+𝒂[n]*∆t (10)
Ifthemathsofarinthischapterhasgivenyoutrouble,don'tgetdiscouraged.Itisenoughtounderstandthefinalequations(10)toworkwiththerestofmaterial.Fromnowonwewilltreattheaccelerationvectorasaconstantwithinthesingleiteration.Next,wedivedeeperintointeractorsthataffectthemotionofaparticle.
2.4MoreoninteractorsWeestablishedearlierthattheparticlemotionwithinthescenewouldbeinfluencedbychangingacceleration,usingNewton'ssecondlaw:
𝒂=𝑭𝒎
(11)
Weseethatthetaskofaninteractorshouldbetogeneratesomeforceandconvertitintoaccelerationusingthemassoftheparticle.Moreover,wecansumuptheforcescomingfrommultipleinteractorstogether:
𝑭𝑵𝑬𝑻 = 𝑭𝒊𝒊
(12)
10
Therefore,ourfinalapproachisasfollows:
𝒂𝑵𝑬𝑻 = 𝒂𝒊𝒊
= 𝑭𝒊𝒎𝒊
(13)
Themainreasontonotmultiplyoutthemassisthatinteractorsthatsimulategravitydon'tuseit.Inotherwords,theaccelerationduetogravitydoesnotdependonthemassofanobjectonwhichgravitationalpullisexerted.Itisalsopossiblethatsomeinteractorswhichdon’tmodelphysicallyaccuratebehaviorcouldsimplyignoremassaswell.Thisisanoptimizationstepwhichcouldsaveasignificantnumberofcyclesspentonmultiplication.
2.4.1Constantforceinteractor
Thisisthesimplestformofaninteractor.Itsfunctionistoexerttheforceonaparticleregardlessofanyotherproperties.Itcouldbeusedtomakeaparticlegotowardssomepredefineddirection:
𝒂i=𝑭𝒊𝒎(14)
2.4.2Gravitypointinteractor
Thisisthesecondtypeofinteractor.Itrepresentsasinglepointthatexertstheforceofgravityonparticlesinthescene.Suchapointhasmassandposition,butinourexampleitdoesnothavevolume.Itusesthestandardgravityequation:
𝑭=𝑮𝑴𝒎𝒅𝟐
𝒓 (15)
Nowwecanseewhywechosenottoincludethemassofaparticlewithineachinteractor.Whenderivingtheformulaforacceleration,weendupwith:
𝒂=𝑮𝑴𝒅𝟐𝒓 (16)
Inthisequation,Gisthegravitationalconstant,Misthemassofthegravitypoint,drepresentsthedistance,andristheunitvectorpointingtowardthegravitypointinteractorfromthelocationoftheparticle.
11
2.4.3Gravityplanarinteractor
Thisisthelasttypeofinteractorimplementedforthisexample.Itismodeledasaninfiniteplaneexertinggravitationalforceontheparticlesinthescene.Inthiscase,thervectorisalwaysperpendiculartotheplane.Therefore,foranygivenparticlelocation,thedistancehastobecalculated,assuch:LetP1={x1,y1,z1}representaparticleinspace.Theplaneisdescribedbyitsnormalvectorn={a,b,c},andapointonthatplane,P={x,y,z}.ThedistanceDisgivenby:
D=𝒂𝒃𝒔(𝒅𝒐𝒕 𝒏,𝑷𝟏 G𝒅𝒍𝒆𝒏𝒈𝒕𝒉(𝒏)
(17)
d=-dot(𝒏,𝑷) (18)
ThistypeofinteractorcanbeusedtomodellargeobjectscomparedtoscenesizesuchastheEarth.
2.4.4Additionalthoughts
Simulatingphysicallycorrectbehaviorisnottheonlywaytobuildinterestingparticlesystems.Infact,settingupaphysicallyaccuratesysteminequilibriumcouldbeveryhard.Asimilareffectcouldbeachievedbyusingnon-physicallyaccurateinteractors.Suchfunctionscanlookintotheentirestateofaparticleandgenerateaccelerationthatwillattempttoforcecertainbehavior.
3VulkanRendererforCPU-BasedSimulator
ThedesignoftherenderercomponentusedinvisualizingtheCPUversionoftheparticlesimulatorwillbedescribedfirst.IfyouarenotfamiliarwithVulkan,itisagoodideatopausehereandreviewoneofthetutorials.TheLunarGtutorialcoverssufficientmaterialtounderstandtherestofthispaper.Also,rememberthatVulkanleavestheapplicationdesigntotheprogrammerandallowsforavarietyofwaystoapproachatask.Takethisguideasoneoftheoptionsavailableratherthanthedefinitivesolution.
Atahighlevel,whenprogrammingusingVulkan,thegoalistoconstructavirtualdevicetowhichdrawingcommandswillbesubmitted.Thedrawcommandsaresubmittedtoconstructscalled“queues.”Thenumberofqueuesavailableandtheircapabilitiesdependsuponhowtheywereselectedduringconstructionofthevirtualdevice,andtheactualcapabilitiesofthehardware.ThepowerofVulkanliesinthefactthattheworkloadsubmittedtoqueuescouldbeassembledandsentinparalleltoalreadyexecutingtasks.Vulkanoffersfunctionalitytocoherentlymaintaintheresourcesandperformsynchronization.
12
Inourapplication,wetooktheapproachofbreakinguptheVulkanrenderersetupintofivecomponents(seegraphicbelow).Thegoalistopresentaneasy-to-understandcodebasefortheVulkanapplication.
3.1Physicaldeviceselection
ThefirststepineveryVulkanapplicationistocreatetheInstanceObject.ThisfunctionalityisdeliveredthroughVkInstance.ItisrequiredthatanapplicationhasaVkInstancecreatedbeforestartinganythingelse.TheVkInstanceistheinterfaceusedtoinspectthehardwarecapabilitiesofacomputersystem.ThisfunctionalityisencapsulatedintheAppInstanceobject(CPU_TP1).TheAppInstanceobjectisessentiallyawrapperexposingtheinterfaceforconvenienthardwaredeviceenumeration.
3.1.1Validationlayers:AcrucialstepinVulkanapplicationdevelopment
OneofthemostcriticalstepswhendevelopingwithVulkanistoenablethevalidationlayers(CPU_TP2).Bydesign,forperformancereasons,Vulkandoesnotcheckthevalidityoftheapplicationcode.Whenusedincorrectly,Vulkancallswouldsimplyfailorcrashwithvirtuallynotracesthatwouldallowfordebugging.Withvalidationlayersenabled,theVulkanloaderinjectsalayerofcodeforverification.Theenablingofvalidationlayersisacriticalstepthateverydevelopershouldremember.However,thatlayeraffectsperformanceandshouldbedisabledinproduction.
13
3.2VulkanvirtualdeviceandswapchainThenextstepistocreatetheactualvirtualdevicethatwillexposethecommandqueueusedtorenderthescene.First,asurfaceobjectmustbeinstantiated.(Thetopicofthesurfaceisoutofthescopeofthisarticle,butitiscoveredindetailintheLunarGtutorial.)However,inthiscase,thedemocodesupportsWindows*only,andthesurfaceiscreatedandexposedbytheSurfaceWindowsclass(CPU_TP3).
NextiscreatingtheDeviceobject(CPU_TP4).NotethattheDeviceclassencapsulatesmuchmorethanVkDevice.Startingwiththeconstructor,thedeviceclassexposesmultipleversions,whichallowsmoreflexibilityininitialization(CPU_TP5).Thephysicaldeviceusedtocreatethevirtualdevicecanbe"firstfit"oritcanbespecifiedusingaVendorIDnumber,ifthesoftwareistousespecifichardware.End-usersystemscanhavemultipledevicesthatsupportVulkan,andthefirstoneonthelistisnotalwaysthedevicedesired.
Thedefaultconstructor(CPU_TP6)obtainsthelistofavailablephysicaldevicesandthenrunsthecreateDevicemethod.Ifthecreationwassuccessful,thisphysicaldeviceisthenused.Ifthecreationwasnotsuccessful,theconstructortriesthenextphysicaldevice.TheactualdevicecreationisdoneinthecreateDevicemethod(CPU_TP7).Thefirststepistodetermineifthephysicaldevicecandeliverqueueswiththerequiredfunctionality(CPU_TP8).Devicescandelivermanytypesofqueues.Somequeuesmaysupportalloftherequiredoperations.Somequeuesmaybehighlyspecializedforthespecifictasks,suchascomputeortransfer.Inthiscase,theapplicationexpectsthequeuetoallowfortransferanddrawinggraphics.Theapplicationwilliterateoveravailablequeuesuntilanindexisfoundthatmatchestheexpectedcriteria.
Oncecompleted,thevirtualdeviceiscreated.BesidesVkDevice,theobjectsuppliesthequeuehandlersandswapchainthatcorrespondtothevirtualdevice.(Formoredetails,pleasechecktheVulkantutorialsandreference.)
3.3Scenesetup
14
TheParticlesElement(CPU_TP9)taskistodrawtheparticlefountain.Forittobedrawn,ithastobelongtothescene.TheSceneobjectroleistoaggregateallthesceneelements,maintaincommandbuffer,andtherenderpass(CPU_TP10).Whenthesimulatorisreadytorender,itcallstherenderfunction(CPU_TP11).Thisfunctionisresponsiblefirstforinitiatingthecommandbufferforrecording,obtainingthecurrentrendertarget,andfinallystartingtherenderpass(CPU_TP12).Oncethatisdone,therendererbeginsiteratingovereveryelementinthescene(CPU_TP13),allowingittorecorddrawingcommandsintothecommandbuffer.Inthiscase,thereisasinglesceneelementthatdrawstheparticlefountain.Afterallsceneelementsfinishrecording,therenderfunctionsubmitsthecommandbufferintothequeueandpresentstheresultsonthescreen.
3.4SceneElementTheSceneElement(CPU_TP14)representsthesingleelementorentitywithinthescene.Theexpectedscopeofsceneelementobjectsistohandleeverythingrelatedtorenderingasingleentityandmaintainingrequiredresources,likethepipelineandbuffers.TheParticlesElementfocusesitsentireoperationaroundtwomethods.Firstistheconstructor(CPU_TP15)thatinitializestherenderingpipeline.SecondistherecordToCmdBuffer(CPU_TP16)thatisinvokedbytheSceneclasstorecordelement-specificdrawcallsintothecommandbuffer.
3.5GraphicsmemoryallocationandperformanceThereisoneimportantaspectthathastobeconsideredwhenthememoryforthebuffersisallocated.TheVulkandevicecanexposedifferentmemorytypesthatitcanuse.TheMemorytabinthehardwaredatabaseshowsthatdevicesexposemultipleheapsfromwhichmemorycouldbeallocated.Lookingatthelistofmemorytypes,eachtyperepresentstwovalues:thememorytypeflagsthatdescribepropertiesandtheheapthatisassociatedwiththem.(Seehardwaredatabasereferencedinsection1.2.)
Thetypeofheapselectedcouldsignificantlyaffectperformance.Forexample,somedevices,mainlydiscretegraphicscards,exposeoneheapwithonlyDEVICE_LOCAL_BITsupported,whichisthefastestmemoryavailableforthatdevice,butdoesnotallowdirectaccessfromtheCPU(moreonthatinamoment).AnotherheapwiththeHOST_VISIBLE_BITallowsdirectaccess,buthaspotentiallylongeraccesstime.
Asoftoday,mostofIntel'sintegratedGPUsexposeheapswithbothDEVICE_LOCAL_BITandHOST_VISIBLE_BIT.Thepresenceofthefirstoneensuresoptimalaccesstimefortheplatform.ThesecondallowsthememorymappingoftheallocatedbufferintotheCPUaddressspaceanddirectaccess(CPU_TP17,CPU_TP18).
15
Whenselectingamemoryheapthereisathree-stepprocess.Thefirststepistospecifyandcreatethebuffer(CPU_TP19)(notethatvkCreateBufferdoesnotallocatememoryforthebufferyet!).ThesecondstepistousetheVulkanAPItoobtainthelistofmemorytypesthatcansupportthedefinedbuffer(CPU_TP20).Thethirdstepistousetheselectedmemoryindextoallocatethebuffermemory.OfparticularinterestisthememoryTypeBitsfieldintheVkMemoryRequirements.Thisfieldisabitstringthathasabitsetto1atthepositionofthebitthatcorrespondstothememorytypeindexofthememorytypethatcanbeusedtoallocatebuffermemory.Ifthebitcontainsavalueof0,thenamemoryheapindexcannotbeusedwiththeparticularobjectofinterest.
Thefirststepabovewilllikelyprovidemultipleoptions.Therefore,inthesecondstep,theexactmemorytypedesiredmustbechosen.Inthiscase,onIntel’sGPU,allofthememorytypeshaveDEVICE_LOCAL_BITset.HOST_VISIBLE_BITandHOST_COHERENT_BITalsomustbeset.TheHOST_COHERENT_BITguaranteesthatmemoryoperationisobservedwithouttheneedtomanuallyflushthebuffers.IfthememoryTypeBitissetandthememorytypeatthegivenindexsupportstherequestedflags,thenthespecificindexisselectedtobeused(CPU_TP21).Inthethirdstep,thatindexisusedwhenallocatingtheactualmemoryontheheap(CPU_TP22).
Thismemoryallocationtopicisquiteimportant,andyouareencouragedtolookfurtherintoit.TopicslikestagingbufferorVulkanbuffercopyoperationswouldbeagoodnextstep,especiallyfordeviceswithdedicatedmemory.
4CPU-BasedParticleSimulator
Section2describesthemodeloftheparticlesimulatorthatwasimplementedforthisproject.Theimplementationisbasedonareal-timeinfiniteloop.Thesimulationloopcontainsthreemainoperations:
1) Obtainthetimedifferencebetweenthecurrentandpreviousiteration(CPU_TP23).2) ComputethenextEuleriterationfortheparticlemodel(CPU_TP24).3) UploadandrenderanewframeusingtheVulkanrenderer(CPU_TP25).
Theentirephysicalsimulationhappenswithintheprogressmethod(CPU_TP26)withinthealgorithm,anditiscomposedofthreemajorsteps:
4) Foreveryparticle,thealgorithmlaunchestheinteractorsandaddsuptheaccelerationvectors(CPU_TP27).
5) Thealgorithmsortstheparticlelistinsuchawaythatalloftheactiveparticles(TTL>0)occupytheleftsideofthelist,andalloftheinactiveparticlesoccupytherightsideofthelist(CPU_TP28).
6) Launchesparticlegenerator(s)inordertoaddnewparticlestothelistifthegenerator’sconditionshavebeenmet,andthereisspaceleftinthebuffer(CPU_TP29).
16
Withthissimplemodel,manyinterestingeffectsmaybegenerated.Althoughthisimplementationusesareal-timeapproach(i.e.,theactualhardwareclockisusedtoobtainthetimedifference),thecodecaneasilybemodifiedtouseastepwiseapproach.Animportantpointtorememberisthatthetimedifferencehastobesufficientlysmall,otherwiseitwillbreakthemodel.Knowingthat,itcanbedeterminedwhetherornotthisapproachisagoodchoiceforasystemthatiscapableofspinningthemainloopatasufficientspeed.Ifthatisnotthecase,considerswitchingfromareal-timeapproachtoanarbitrarilydeterminedtime-step,whichwillnotusethesystemclock.
4.1ParticlemodelTorepresentthesetofparticlesinascene,theBufferclass(CPU_TP30)isused.Itisarelativelysimpleclassthatencapsulatesamemoryarrayofparticlestructures(CPU_TP31).Theclassalsodeliversafewconveniencefunctionsandthearraysortingalgorithm.
Thepurposeofthesortingalgorithmistoplaceparticlesintotwogroupswithintheparticlearray.Theactiveparticlesareplacedontheleftsideandtheinactiveonrightside.Itisimplementedasfollows:
1) InitializeLindexatthefirstelementandPindexatthelastelement.2) ProgressLtowardtheendofthearrayuntilaparticlewithTTL<=0isencountered.3) ProgressPtowardbeginningthearrayuntilaparticlewithTTL>0isencountered.4) Swaptab[L]withtab[P].5) RepeatwithL+1.
IfatanytimeL==P,thefinishconditionisreached.
4.2Sceneinteractorsandgenerators
Bothinteractorsandparticlegeneratorsinfluencethesceneduringthesimulation.Withinthemodel,botharerepresentedbytheircorrespondingabstractinterfaces.TheinteractorsuseBaseInteractor(CPU_TP32)andgenerators,BaseGenerator(CPU_TP33).BothneedtobedefinedbeforetheModelobjectisinstantiated(CPU_TP34).
Inthecaseoftheinteractor,allthatthemodelrequiresistoobtaintheaccelerationvectorgiventheparticlestateandtimedifference.Internally,theinteractorcanperformarbitraryoperations,andinteractorscanbefundamentallydifferentfromeachother.Forexample,theConstForceInteractor(CPU_TP35)usesonlyforcemagnitude,directionvector,andparticledata.
17
TheBaseGenerator,asidefromacommoninterface,deliversfunctionstogenerateparticlepropertieslikespeedwithrandomizeddeviations(CPU_TP36).However,itleavestheexactparticlelocation,therateofappearance,andotherdetailstothederivedclasses.Inthisexample,pointgenerator(CPU_TP37)isused.Thepointgeneratorissetinsuchawaythatitactsasparticlesprinkler(CPU_TP38).Itisgivendirectioninpolarcoordinatesspecifiedindegrees,andemitsparticlesaccordingtoadefinedrate.
4.2.1Performancevs.maintainability
Thereexistsaperformancebottleneckintheimplementation.Aspecificdesignchoicewasmadetoprioritizemaintainabilityoverperformance.TheproblemoriginatesfromthepolymorphiccallintothecomputeAccelerationmethod(CPU_TP39).
Polymorphiccallstendtobeslowerthanregularfunctioncalls,andeveryinteractorisexecutedforeveryparticle.Itisclearhowthiscouldbecomeasignificantfactorforalargenumberofparticles.However,theadvantageinthiscaseisthatthereisaneasywaytoimplementnewinteractorsintothemixwithnoneedmakechangestotheModel'scode.
Thealternativeherecouldbetoimplementsomemechanismthatwoulddetectthetypeofinteractorandexecuteitsproceduresusingaconditionalstatement.Thiswillimproveperformance,butalsoincreasethecomplexityofthecode.
Whichapproachtouseisacase-by-casedecisionthatdependsonprojectgoals.Whenasmallnumberofparticlesisexpectedwithalargevarietyofinteractorspresent,thenthevirtualmethodapproachisagoodchoice.However,ifthenumberofparticlesislarge,thenswitchingtoothersolutionscouldbeagoodwaytoimproveperformance.
4.3LifetimeoftheparticleandtheimportanceofsortingThereisonelasttopicrelatedtoparticlegenerationandparticlelifetime.Onceaparticleismovingthroughthescene,itcaneasilyflyoutsideofthevisiblearea.Sinceitisnotvisible,thereisvirtuallynopurposeinmodelingit.Sometimesthegoalistomodeleffectslikeaflamewithaspecificheight.Toimplementsuchcases,everyparticlemusthaveapropertythatdeterminesitslifespancalled“TimeToLive”(TTL).Ateachiteration,thetimedifferenceissubtracted,andifthevalueisequaltoorbelowzero,theparticleisconsideredinactive.Whenparticlesbecomeinactive,nooperationsareperformedonthemandtheyarenolongerdrawn.
18
Therearetwoissuesrelatedtothecaseofinactiveparticles.First,isthatawayisneededtofreethespaceupfornewparticles.Second,isthatinactiveparticlesshouldnolongerbedrawnonthescreen.Thisprojectexampleusesasimplerenderingmechanism.However,inthecaseofamorecomplicatedmulti-passrendererwithmorecomputation-heavyshaders,itisagoodideatoavoidprocessingelementsthatwouldnotbevisibleonthescreen.Thisiswheresortingcomesin.Sortingtheparticlearrayallowsforsubmittingdrawcallsthatspecifyonlythenumberofparticlesthatshouldappearonthescreen.Italsosimplifiestheworkofgeneratorsastheydonothavetosearchthroughthebufferforfreeslotstofill.
Thereisapossiblealternativeapproach.Insteadofthecontiguousarray,alinkedlistcouldbeused.Thiswouldeliminatetheneedforsorting,but,ontheotherhand,itwouldslowdowntheuploadofdatatotheGPUbuffersbecauseoftheneedtoconvertthelinkedliststructureintoacontiguousarray.
19
5MultithreadedApproachtoComputingParticleSystem
Thespeedofthesimulationcouldbeimprovedbyparallelizingtheworkload.Thereareafewcasestoconsider:
1) Parallelizetheworkloadacrossthearrayofparticles,calculatingtheinteractors’contributionandEulerstepasasingleworkunit.
2a) Parallelizetheworkloadacrossinteractors,calculatingtheeffectofeachinteractorateveryparticleasasingleworkunit.
2b) FurtherparallelizeOption1,above,bysplittinginteractorcontributionsintoseparateworkunits.
3) Parallelizethesortingalgorithm.4) Parallelizethegenerators’contribution.
Option1,above,istherecommendedapproach.Foreachparticle,thealgorithmcomputes,asasingletask,thecontributionfromeachinteractor,summingtheaccelerationvectorsandthenperformingtheEulerstep(CPU_TP40).Sinceeachparticleresidesinadifferentmemoryregion,andinteractorsusuallydonotmodifytheirinternaldatawhencomputingparticles,thismethodprovidesthemostoptimalsolution.TheOpenMP*frameworkwasusedtoprovidemultithreading.OpenMPisanAPIthatisbasedprimarilyoncompilerdirectivesandusedformultithreaded-orienteddevelopment.Ithasloweroverheadregardingrequiredcodetobewritten.Itcouldbeeasilyusedtoparallelizeloops,buthasmuchmoretooffer.However,sometimesitmightnotdeliverthelevelofdetailassomeotheravailablelibraries.
Option2a,isprobablynotrecommended.Somemightfindthatthisoptionisappealingfromtheperspectivethatthedatathatdefinesthepropertiesofaninteractorwouldhavetobeloadedonce,andthenusedextensivelywhilecomputingeachparticle.ThisissomethingthatwillbesupportedbytheCPUcache,significantlyspeedinguptheprocess.However,thereareissueswiththisapproach.Forexample,accelerationmustbesummedup,whichmeanstherewillbeaneedtosynchronizeaccessbetweenthreadsexecutingtheinteractorsandthenstoretheaccelerationvectorforeveryparticleinanadditionalmemoryregion.Moreover,theentirearraywillrequireanadditionalpassperformedtocalculatetheEulerstep.Thesetwodisadvantageswouldproducesignificantsynchronizationoverhead,likelyrenderingthisstrategyimpractical.
Option2bisanextensionofOption1.Hypothetically,iftherearecomputationallydemandinginteractors,itcouldbebeneficialtosplittheiroperations.However,thegoalistoachieveminimalcontextswitching.Ifeveryparticlecomputationandeveryinteractorcontributionisparallelized,itwillmostlikelyleadtoanexplosionofthreads.Switchingbetweenthosethreadswouldfurtherdecreaseperformance.
20
Option3suggestsimprovingthesortingalgorithmthroughparallelization.Thisstrategymightimprovesortingspeed;however,thetypeofdatamustbeconsidered.Inthisparticularcase,aparticleisinanactiveorinactivestate.Theorderoftheparticlesthatareinthesamestateisnotimportant,aslongastheyareallplacedonthecorrectsideofthesortedarray.Thisallowsforthearraytobesortedinasinglepass.Ifaparallelsortingalgorithmisusedinstead,itispossiblethatperformancecouldbeimproved.However,suchimprovementisnotguaranteed.Possibleimprovementsmaybedeterminedthroughexperimentation.
InOption4,parallelizingthegenerationofparticleshasahighpotentialofimprovingperformance,butonlyundercertainconditions.Thefirstconsiderationistodeterminetheworkloadofallofthegenerators.Ifthememorybufferisalmostalwaysfull,therewon’tbeanyspaceforthegeneratorstowork.Therefore,spawningadditionalthreadstoexecutethegeneratorswillonlycreateunnecessaryoverhead.Everygeneratortakesasmuchspaceasneeded,andwhatisleftisassignedtothenextgeneratorinthelist.Thiscouldresultinsomethreadsexecutinggeneratorsjusttofindoutthatthereisnothingforthemtodo.
5.1GoingfurtherwithparallelizationTherearefewmorethingsworthmentioningthatcouldaffectperformance.Firstistheissueoffalsesharing.Asingleparticlestructuresizewas44bytes.Giventhatthecachelinesizewas64bytes,whatwasfetchedfromlowerlevelmemorywasanentireparticlestructure,orpartofaparticlestructure,thatthecurrentthreadwasworkingon,aswellaspartofaparticlestructurelocatedrightnexttoit.Iftwoparallelthreadseachworkonasingleparticlelocatednexttoeachother,themodificationoftheaccelerationvectorinsideonethreadwillinvalidatetheentirecacheline.Essentially,thiswillcausecachemissontheotherthread,eventhoughthethreadswereassignedtodifferentparticles.Thissituationcanbealleviatedbyaligningtheparticlestructurestothe64-bytesboundary.However,thiswillalsoincreasememorywaste.
Thesecondandpreferredsolutionistogroupmultipleparticlestructurestogetheranddirectthemtobeprocessedasasingleunit.ThisiswhatOpenMPdoesbydefaultinthe"parallelfor"directive.
OpenMPisaverygoodframework,butithasdisadvantages.OpenMPisbuiltintoanumberofcompilersthatsupportit.However,notallcompilerssupportOpenMP,orsupportanoutdatedversionofOpenMP.AgreatalternativetoOpenMPisthethreadsupportlibrarythatwasintroducedwithC++11andshouldbesupportedinalmostallC++compilerscurrentlyavailable.
21
6VulkanComputeinaModelingParticleSystem
ThissectionaddressesthetopicofportingaCPU-basedparticlesimulatorontotheGPUusingVulkancompute.VulkancomputeisanintegralpartoftheVulkanAPI.AlthoughmostofthesimulationwillbemovedontotheGPU,thesortingandgenerationofparticleswillstillbekeptontheCPU.
Thegoalofthissectionistopresenthowtosetupthecomputepipeline,instantiatemoreadvancedshaderinputconstructs,andshowhowtosendfrequentlychanging,butsmall,portionsofdataontotheGPUwithoutusingmemorymapping.
ThisisagoodmomenttomentionatoolthatisextremelyvaluablewhendebuggingVulkanapplications.Thetool,RenderDoc,canbeusedtocapturethecurrentapplicationstateforinspection.Thetoolcanbeusedtoexaminebufferscontent,shaderinput,textures,andmuchmore.RenderDocisautomaticallyinstalledwithVulkanSDK.Althoughnotdiscussedinthisdocument,itishighlyrecommendedforreaderstobecomefamiliarwiththistool.
6.1UsingGPUmemorybufferforcomputationandvisualization
Thefirststepistorepurposethebufferstructurethatholdsparticledata.AmongthebuffertypesavailabletouseonaGPUisabuffertypecalledtheshaderstoragebufferobject(SSBO).Thisisatypeofbufferthatissimilartotheuniformbuffer.However,SSBOsaremuchlargerinsizethantheuniformbufferandarewritablefromwithintheshader.
Theinternalstorage,whichwasthestandardC-stylearraycontainedbytheBufferclass,isnowreplacedwithanSSBO.Moreover,thesamebuffermemoryregionwillbereusedasavertexbufferforrendering.ThecomputeshaderwillmodifytheSSBOcontent,andthentherendererwilluseitasavertexbufferobject(VBO)torendertheframe.
Upuntilnow,theBufferclasswasindependentofVulkan.Now,however,theentireapplicationisdependentofVulkan.ThismeansthatbeforetheModelclasscanbeinstantiated,theBufferclassmustbeconstructed.AndbeforetheBufferclassisconstructed,theVulkanvirtualdevicemustbecreated.
Tomakethingsappearinamoreorganizedorder,thebuffer,thedevice,theAppInstance,andtherestofsupportingcodewasmovedintoanewnamespacecalled“base.”Now,insteadofcreatingtheModelfirst,theapplicationstartsfromcreatingtheAppInstance,device,andbufferobjects(GPU_TP41).ThecodecreatingthevertexbufferhasbeenmovedoutofParticleElementintobuffer(GPU_TP42).Notethechangeintheusagefield:itnowhastwoflagsassigned,VERTEX_BUFFER_BITandSTORAGE_BUFFER_BIT.Therestoftheprocedureremainsthesame.TheBufferclassnowdeliverstheVkBufferhandleandinterfacetoaccessthebufferontheCPUside(GPU_TP43).TheconstructedbufferisthenusedininstantiationofboththeModel(GPU_TP44)andtheParticlesElement(GPU_TP45)objects.
22
6.2MovingcomputationfromtheCPUtocomputeshaderontheGPUTheModelclassiswheremostofthechangeswilloccur.Inthisversion,theModelclasswillcontainanotherpipelinewhichwillexecutethecomputeshader.
First,oneofthebiggestchangesisinhowtheinteractorsaregenerallyimplemented.Althoughtheinteractorlogicwasmovedintothecomputeshader,thereisstillaneedforthemechanismtoallowfortheabilitytodynamicallyspecifythepropertiesofeachinteractorintheapplicationlogic.BecauseofthestructureofGLSLlanguage,itisnolongerpossibletoimplementinteractorsusinganabstractinterface.Thisrequirestheinteractorstobesplitbytheirtypeandprovidespecificdataforallofthem.Todothis,wedefineastructure(GPU_TP46)calledSetuptocontainarraysforeachtypeofinteractor.Eacharraydefinespropertiesforaninteractorandallowsanarbitrarynumberofinteractorsofagiventypetobepresentuptoaspecifiedupperboundary.Thecountvariablesspecifytheactualnumberofactiveinteractors.
Whensubmittingcomplexstructurestobufferobjects,itiscriticaltoadheretothelayoutrulesthatapplyforthegiventypeofbuffer(Notethealignmentstatementinthestructurefield’sdeclaration(GPU_TP47)).Foruniformorshaderstoragebufferobjects,theselayoutsarestd140andstd430.TheirfullspecificationcanbefoundintheVulkanspecificationandGLSLlanguagespecification.
AsintheCPUversion,interactorswereinitializedbeforeconstructingtheModelobject(GPU_TP48).However,thistimetheinitializationisbasedonpopulatingtheSetupstructure.Later,theentirestructurewillbeuploadedintotheGPUduringconstruction(GPU_TP49),makingitavailableforthecomputeshadertoaccess.
IntheVulkanAPIthecomputestageandgraphicsstagecannotbeboundintoasinglepipeline.Asaresult,twopipelineobjectsareavailable.Comparedtothegraphicspipeline,thecomputepipeline(GPU_TP50)islesscomplicated.Theprimaryconcernistodefinetheproperpipelinelayout.Beyondthat,theprocessofcreatingthecomputepipelineisnearlythesameascreatingthegraphicspipeline.
Oncethepipelineisestablishedandthenecessarydataisreadytoexecute,theprocedurestartsrecordingcommandsintothecommandbufferforthecomputepipeline(GPU_TP51).However,therearetwoconceptsthatareworthexamining.
Thefirstconceptis“pushconstants.”PushconstantsaresmallregionsofmemoryresidinginternallyintheGPUdesignatedforfastaccess.Theadvantageofusingpushconstantsisthatdatadeliverycanbescheduledrightwithinthecommandbuffer(GPU_TP52)withouttheneedformemorymapping.Thememorylayoutofpushconstantsneedstobedefinedduringpipelineconstruction,buttheprocessismuchsimplercomparedtothedescriptorset(GPU_TP53).
23
Thesecondconceptisrelatedtothedivisionofworkload.Thenumberofelementsissplitintoworkgroupsthatarescheduledtoexecuteonthecomputedevice.Thesizeoftheworkgroupisspecifiedwithinthecomputeshader(GPU_TP54).ThentheAPIisgiventhenumberofworkgroupsscheduled(GPU_TP55)forexecutionofthecomputejob.Thesizesandnumberofworkgroupsarebothlimitedbythehardware.ThoselimitscanbecheckedbylookingatthemaxComputeWorkGroupCountandmaxComputeWorkGroupSizefields,whicharepartofVkPhysicalDeviceLimitsstructure(whichisapartoftheVkPhysicalDevicePropertiesstructurethatisobtainedthroughcallingvkGetPhysicalDeviceProperties).Itisimperativetoscheduleasufficientnumberworkgroupswithappropriatesizetocovertheentiresetofparticles.Theoptimalchoiceofsizesisdictatedbythespecifichardwarearchitectureandinfluencedbythecomputeshadercode(e.g.,theinternalmemoryrequirements).
6.3Structureofthecomputeshader-orientedsimulationWiththetransitiontothecomputeshader,theinteractorstaketheformoftwocomponents.Firstaretheinteractorparameterswhicharedeliveredthroughtheuniformbuffer(GPU_TP56).Thesecondisthefunctionsetwhereineachfunctioncorrespondstoaspecificinteractorandcalculatestheaccelerationvectorfortheprocessedparticle(GPU_TP57).
ThefunctionperiterationofthecomputeshaderistoobtainaparticlefromSSBOusingtheglobalidindex(GPU_TP58).Thecomputeshaderiteratesthrougheachinteractordescriptorandinvokesitsfunctionasmanytimesasthecountfieldindicates(GPU_TP59).Onceallinteractorshavebeenexecuted,theaccelerationvectorsaresummedandthecomputeshaderinvokestheEulerstepfunction(GPU_TP60).TheparticleisthensavedbacktoSSBO.
Afterthecomputingstepisdone,therendererisinvoked.Therendererusesthesamebuffer,butaccessesmemorythroughthevertexbufferinterface,insteadofSSBO,torenderthescene.
24
7GoingFurther
Buildingonthiswork,therearefewoptionstoconsiderthatmayimproveperformance.
7.1EliminatetheCPUroleincomputations
Currently,theGPUcomputeshaderonlycalculatestheinteractor’sinfluenceontheparticle.AllsortingandgeneratingoftheparticleswerelefttotheCPU(seesection5).InthecaseofIntel'sGPU,thisapproachisnotpenalizedatallduetotheuniformaccesstothememory.However,forotherplatforms,especiallyadiscreteGPU,allocatingbuffersusingthememoryheapwithmostefficientaccesstimeisamust.ThiscouldmeanthataddressspaceinsuchaheapmightnotbeaccessiblefromtheCPUthroughmemorymapping.Inthatcase,theonlywaytoaccessthedatawouldbebycopyingbuffersinternallybetweenheapsbackandforth,oraccessingitthroughthecomputeshader.
Anapproachinwhichbuffersarecopiedbetweeneachotherwillalmostcertainlycreateaperformancebottleneck.Therefore,thecomputeshaderoptionseemsmoreappealing.Initially,thesortingwasleftinitssequentialformbecauseoftheuncleargainsofparallelalgorithmsforthisspecificscenario.However,aGPUoffersmanymorecomputingunitswhichcouldmakeparallelsortingalgorithms,likean“odd-evensort,”deliverbetterresults.
Thereisstilltheissueofsynchronization.Aninstanceofthecomputeshadercannotstartthenextsortingpassbeforethepreviousonehascompleted.However,ifonlyasinglelocalworkgroupisused,thentheGLSLlanguagedeliversasetofShaderInvocationControlFunctions(seeGLSLspecification)thatwilldeliverthecapabilityofsettinguppropermemorybarriers.
Similarly,inthecaseofthegenerators,thebiggestproblemwastherandomnumbergeneratornotbeingthread-safeanditsdependenceonthegenerator’sinternalstate.TherandomizefunctioncanbeportedintoGLSL,andthecoherencemaintainedusingtheGLSLatomicmemoryfunctions.
25
7.2ParallelizingtheperformancepathbydoublebufferingForthisproject,theapproachwastofirstcomputethesceneandthenrenderitsequentially.Thecomplexityoftherenderinginthiscasewasfairlybasic,and,therefore,itsimpactonperformancewasnegligible.However,thiswillnotalwaysbethecase.Withmoreinvolvedrenderersitisnecessarytoattempttoperformcomputationsandrenderinginparallel.TheVulkanAPIwasdesignedwiththisgoalinmindandprovidesthemeanstosynchronizeprogramflow,butdoesnotenforceit.Forexample,lookingatwheretheprogramloopiswaitingfortherenderingfunctiontofinish(CPU_TP61).Thepartofthecoderesponsibleforpresentationobviouslyhastowait,butthemodeldoesnot.Infact,themodelcouldstartcomputingthenextstepassoonasthevertexdataisdeliveredandonlyneedtowaitbeforeenteringthesetVertexBufferDatafunction.Fromheretheapplicationcanbesplitintotwothreadsanduseaproducer-consumersetup.
IntheGPUversion,insteadofusingasingleSSBO,doublebufferingcanbeintroduced.Onebufferisusedbothasarender-inputandcompute-input.Atthesametime,thesecondbufferispopulatedbythecomputeshaderwiththeEulerfunctionoutput.
Itisimportanttorememberthatthereisanupperthresholdtowhatcanbeachieved,whichislimitedbythecapabilitiesofthehardware.ThecomputeandrendertasksexecutedonthesameGPUwillnotalwaysprovideperformancegainsbecauseofthefinitenumberofcomputeunitsavailable.
8Conclusion
VulkanofferstheabilitytobypasstheCPUbottleneckthatexistsinpreviousgenerationsofgraphicsAPIs.However,Vulkanhasasteeplearningcurveandismoredemandingwhenitcomestoapplicationdesign.Nonetheless,despitebeingveryexplicit,itisaveryconsistentandstraightforwardAPI.
Itisworthrememberingthesefewpointswhichwillhelpwithapplicationdevelopment,aswellasallowachievementofbetterapplicationperformance:
1) StartyourdevelopmentwiththevalidationlayersenabledandbecomefamiliarwithtoolslikeRenderDoc.
2) Carefullydesignyourapplicationasyouwillhavetomakemanydecisionsaboutvariousdetails.
3) Payattentiontowhereyourbuffermemoryisallocatedasitmighthavesignificantperformanceimplications.
4) Understandthehardwareyouareworkingwith.NotallthetechniqueswillworkequallywelloneverytypeofGPU.AlthoughallVulkandevicesexposethesameAPI,theirpropertiesandlimitationsmightbefundamentallydifferentfromeachother.
26
NoticesNolicense(expressorimplied,byestoppelorotherwise)toanyintellectualpropertyrightsisgrantedbythisdocument.Inteldisclaimsallexpressandimpliedwarranties,includingwithoutlimitation,theimpliedwarrantiesofmerchantability,fitnessforaparticularpurpose,andnon-infringement,aswellasanywarrantyarisingfromcourseofperformance,courseofdealing,orusageintrade.Thisdocumentcontainsinformationonproducts,servicesand/orprocessesindevelopment.Allinformationprovidedhereissubjecttochangewithoutnotice.ContactyourIntelrepresentativetoobtainthelatestforecast,schedule,specificationsandroadmaps.Theproductsandservicesdescribedmaycontaindefectsorerrorsknownaserratawhichmaycausedeviationsfrompublishedspecifications.Currentcharacterizederrataareavailableonrequest.Copiesofdocumentswhichhaveanordernumberandarereferencedinthisdocumentmaybeobtainedbycalling1-800-548-4725orbyvisitingwww.intel.com/design/literature.htm.ThissamplesourcecodeisreleasedundertheIntelSampleSourceCodeLicenseAgreement.IntelandtheIntellogoaretrademarksofIntelCorporationintheU.S.and/orothercountries.*Othernamesandbrandsmaybeclaimedasthepropertyofothers.©2017IntelCorporation