Upload
johan-andersson
View
30.902
Download
0
Tags:
Embed Size (px)
Citation preview
The Intersection of The Intersection of Game Engines & GPUs:Game Engines & GPUs:
Current & FutureCurrent & Future
Johan AnderssonJohan AnderssonRendering ArchitectRendering Architect
2.5
AgendaAgenda GoalGoal
Share and discuss current & future graphics use cases in Share and discuss current & future graphics use cases in our games and implications for graphics hardwareour games and implications for graphics hardware
AreasAreas Engine overviewEngine overview ShadersShaders ParallelizationParallelization TexturingTexturing RaytracingRaytracing GPU computeGPU compute
ConclusionsConclusions Q & AQ & A
FrostbiteFrostbite DICE proprietary engineDICE proprietary engine
Xbox 360Xbox 360 PS3PS3 Windows (Direct3D 10)Windows (Direct3D 10)
FocusFocus Large outdoor environmentsLarge outdoor environments Singleplayer & multiplayerSingleplayer & multiplayer Destruction!Destruction! New: Content workflowsNew: Content workflows
BFBC screenshotBFBC screenshot
BFBC screenshotBFBC screenshot
Graph-based surface Graph-based surface shadersshaders
Artist-friendlyArtist-friendly Easy to create, tweak & Easy to create, tweak &
managemanage FlexibleFlexible
Programmers & artists can Programmers & artists can extend & expose featuresextend & expose features
Data-centricData-centric Encapsulates resourcesEncapsulates resources TransformableTransformable
Rich high-level shading frameworkRich high-level shading framework Used by all content & systemsUsed by all content & systems
Shader permutationsShader permutations Generate shader permutationsGenerate shader permutations
For eachFor each used combination of features/data used combination of features/data HLSL vertex & pixel shadersHLSL vertex & pixel shaders
Many features = permutation explosionMany features = permutation explosion Shader graphs, lighting, geometryShader graphs, lighting, geometry
Balance perf. vs permutations vs featuresBalance perf. vs permutations vs features Dynamic branchingDynamic branching Live with many permutationsLive with many permutations
Shader subroutinesShader subroutines Next step: Static subroutine linkingNext step: Static subroutine linking
Inline in all subroutines at call siteInline in all subroutines at call site Similar to a switch statementSimilar to a switch statement
Reduces # permutations Reduces # permutations Implementation moved to driver or GPUImplementation moved to driver or GPU
Doesn’t work with instancingDoesn’t work with instancing
Future step: Dynamic subroutinesFuture step: Dynamic subroutines Control function pointers inside shaderControl function pointers inside shader Problem solved, but coherency importantProblem solved, but coherency important
Rendering & ParallelizationRendering & Parallelization
JobsJobs Must utilize multi-coreMust utilize multi-core
6 HW threads on Xbox 3606 HW threads on Xbox 360 6 SPUs on PS36 SPUs on PS3 2-8 cores on PC2-8 cores on PC
JobJob definition definition Fully independent stateless functionFully independent stateless function
PS3 SPU requirementPS3 SPU requirement
Graph dependenciesGraph dependencies Task-parallel and data-parallelTask-parallel and data-parallel
Rendering jobsRendering jobs Refactor rendering Refactor rendering
systems to jobssystems to jobs
Most will move to GPUMost will move to GPU EventuallyEventually One-way data flowOne-way data flow Compute shaders & Compute shaders &
stream outputstream output
JobsJobs Decal projectionDecal projection Particle simulationParticle simulation Terrain geometry Terrain geometry
processingprocessing Undergrowth Undergrowth
generation [2]generation [2] Frustum cullingFrustum culling Occlusion cullingOcclusion culling Command buffer Command buffer
generationgeneration PS3: Triangle cullingPS3: Triangle culling
Parallel command buffer Parallel command buffer recording recording
Dispatch draw calls and state to multiple Dispatch draw calls and state to multiple command buffers in parallelcommand buffers in parallel Scales linearly with # coresScales linearly with # cores 1500-4000 draw calls per frame1500-4000 draw calls per frame
Super-important for all platforms, used on:Super-important for all platforms, used on: Xbox 360Xbox 360 PS3 (SPU-based)PS3 (SPU-based)
No support in DX10!No support in DX10!
DX10 parallel command buffer DX10 parallel command buffer rec.rec.
Single most important DX10 issue Single most important DX10 issue For us and many others (in the future)For us and many others (in the future)
Until future API supportUntil future API support Reduce draw calls with instancingReduce draw calls with instancing
Trade GPU performance for CPU performanceTrade GPU performance for CPU performance Reduce state & constant updatesReduce state & constant updates
Slow dynamic constant path Slow dynamic constant path Manual software command buffers Manual software command buffers
Difficult to update dynamic resources efficiently in Difficult to update dynamic resources efficiently in parallel due to APIparallel due to API
PS3 geometry processing PS3 geometry processing (1/2)(1/2)
Slow GPU triangle & vertex setup Slow GPU triangle & vertex setup Unique situation with ”free” processorsUnique situation with ”free” processors
Not fully utilizedNot fully utilized
Solution: SPU triangle cullingSolution: SPU triangle culling Trade SPU time for GPU performanceTrade SPU time for GPU performance Cull back faces, micro-triangles, frustumCull back faces, micro-triangles, frustum
Sony PS3 EDGE librarySony PS3 EDGE library
5 jobs processes frame geometry in parallel5 jobs processes frame geometry in parallel Output is new index buffer for each draw callOutput is new index buffer for each draw call
PS3 geometry processing PS3 geometry processing (2/2)(2/2)
Great flexibility and programmability!Great flexibility and programmability! Custom processingCustom processing
Partition bounding box cullingPartition bounding box culling Triangle part cullingTriangle part culling Clip plane triangle trivial accept & rejectClip plane triangle trivial accept & reject Triangle cull volumes (inverse clip planes)Triangle cull volumes (inverse clip planes)
Future: No vertex & geometry shadersFuture: No vertex & geometry shaders DIY compute shaders with fixed-func tesselation DIY compute shaders with fixed-func tesselation
and triangle setup unitsand triangle setup units Output buffer streaming still importantOutput buffer streaming still important
Occlusion cullingOcclusion culling Buildings occlude objectsBuildings occlude objects
Tons of objectsTons of objects Difficult to implementDifficult to implement
Building destructionBuilding destruction Dynamic occludeesDynamic occludees Heavy GPU occlusion Heavy GPU occlusion
queriesqueries Invisible objects still have toInvisible objects still have to
Update logic & animationsUpdate logic & animations Generate command bufferGenerate command buffer Processed on CPU & GPUProcessed on CPU & GPU
Software occlusion cullingSoftware occlusion culling Solution: Rasterize course Solution: Rasterize course
zbuffer on SPU/CPUzbuffer on SPU/CPU Low-poly occluder meshesLow-poly occluder meshes
100m view distance100m view distance Max 10000 vertices/frameMax 10000 vertices/frame Manually conservativeManually conservative
256x114 float z-buffer256x114 float z-buffer Created for PS3, now on allCreated for PS3, now on all
Cull all objects against zbufferCull all objects against zbuffer Before passed to all other Before passed to all other
systems = big savingssystems = big savings Screen-space bbox testScreen-space bbox test
GPU occlusion cullingGPU occlusion culling Want GPU rasterization & testing, but:Want GPU rasterization & testing, but:
Occlusion queries introduces overhead & latencyOcclusion queries introduces overhead & latency Can be manageable, not idealCan be manageable, not ideal
Conditional rendering only helps GPUConditional rendering only helps GPU Not CPU, frame memory or draw callsNot CPU, frame memory or draw calls
Future1: Low-latency extra GPU exec contextFuture1: Low-latency extra GPU exec context Rasterization and testing done on GPURasterization and testing done on GPU Lockstep with CPULockstep with CPU
Future2: Move entire cull & rendering to GPUFuture2: Move entire cull & rendering to GPU Scene graph, cull, systems, dispatch. End goal.Scene graph, cull, systems, dispatch. End goal.
TexturingTexturing
Texture formatsTexture formats UsingUsing
DXT1/5 color maps, sRGBDXT1/5 color maps, sRGB BC5 (3Dc) normal mapsBC5 (3Dc) normal maps BC4 (DXT5A) for grayscale masksBC4 (DXT5A) for grayscale masks
sRGB support for BC4/5 would be nicesRGB support for BC4/5 would be nice
DXT1 replacement neededDXT1 replacement needed Low qualityLow quality 565 color bleeding565 color bleeding RG/RGB masks compresses badlyRG/RGB masks compresses badly HDR envmaps & lightmapsHDR envmaps & lightmaps
RGB DXT1 mask
DXT color bleed
Future texture samplingFuture texture sampling Texture sampling derivativesTexture sampling derivatives
1st order 1st order texeltexel derivatives derivatives 2nd order as well?2nd order as well?
Implement in sampler unitImplement in sampler unit Bad performance or quality with Bad performance or quality with
shader sampling shader sampling Artifacts with ddx/ddy techniqueArtifacts with ddx/ddy technique
Replace normalmaps with easily Replace normalmaps with easily compressed bumpmapscompressed bumpmaps
Bicubic upsamplingBicubic upsampling Terrain masksTerrain masks
Terrain heightmap
Derived normals [2]
Current sparse texturesCurrent sparse textures Save memory for terrainSave memory for terrain
Static quadtree mask textureStatic quadtree mask texture Dynamic sparse destruction Dynamic sparse destruction
maskmask
ImplementationImplementation Indirection texture lookup in atlasIndirection texture lookup in atlas
Arrays too small, want 8192 slicesArrays too small, want 8192 slices Correct bilinear filtering by bordersCorrect bilinear filtering by borders
Siggraph’07 course for details [2]Siggraph’07 course for details [2]
Source mask
Atlas texture
HW sparse texturesHW sparse textures Virtual textureVirtual texture
HW texture filtering & mipmappingHW texture filtering & mipmapping Fallback on non-resident tile access Fallback on non-resident tile access Lower mipmap, default value or shader boolLower mipmap, default value or shader bool
At least 32k x 32k, fp issues with larger?At least 32k x 32k, fp issues with larger? Application-controlled tile commit/freeApplication-controlled tile commit/free
~128 x 128 tiles~128 x 128 tiles Feedback mechanism for referenced tilesFeedback mechanism for referenced tiles
Easy view-dependent allocationEasy view-dependent allocation
Future: Latency-free allocation & generationFuture: Latency-free allocation & generation Alt1. CPU thread callback & blockAlt1. CPU thread callback & block Alt2. Keep everything on GPU. ”Command” shader?Alt2. Keep everything on GPU. ”Command” shader?
Cached Procedural Unique Cached Procedural Unique TexturingTexturing
Unique dynamic sparse texture on all objects Unique dynamic sparse texture on all objects Defined by texture shader graphDefined by texture shader graph
Combine procedurals, compositing, streaming and Combine procedurals, compositing, streaming and uv-space geometryuv-space geometry
Dynamically commit & render visible tilesDynamically commit & render visible tiles
Highly complex compositingHighly complex compositing Thanks to high frame-to-frame coherencyThanks to high frame-to-frame coherency Upsample and refineUpsample and refine
New dynamic effects made possibleNew dynamic effects made possible Affect every surfaceAffect every surface
RaytracingRaytracing
RaytracingRaytracing Much recent debate & interest in RTRTMuch recent debate & interest in RTRT What we are interested in:What we are interested in:
Performance!! Performance!! Rasterization for primary raysRasterization for primary rays DeterministicDeterministic
Easy integration into enginesEasy integration into engines Just another method for certain effects & objectsJust another method for certain effects & objects Not replace whole pipeline Not replace whole pipeline
Efficient dynamic geometryEfficient dynamic geometry Procedural & manual animation (foliage, characters)Procedural & manual animation (foliage, characters) Destruction (foliage, buildings, objects)Destruction (foliage, buildings, objects)
Mirror’s EdgeMirror’s Edge
Raytraced reflections Raytraced reflections wantedwanted Glass & metalGlass & metal
Mostly planar surfacesMostly planar surfaces Reflection localityReflection locality
Correct reflections for Correct reflections for important objectsimportant objects Main characterMain character
Simplified world geometry Simplified world geometry & shading for rest& shading for rest Common for gamesCommon for games Brickmaps? [3]Brickmaps? [3]
Soft reflectionsSoft reflectionsMirror’s EdgeMirror’s Edge
GPGPUGPGPU
GPGPU usesGPGPU uses Effect physicsEffect physics
Particle vs world soft collisionParticle vs world soft collision AI pathfindingAI pathfinding AI visibilityAI visibility
View rasterization. Obstruction from smoke & View rasterization. Obstruction from smoke & foliagefoliage
Procedural animationProcedural animation Trees, undergrowth, hairTrees, undergrowth, hair
Post-processingPost-processing
CUDA DOF post-process CUDA DOF post-process filterfilter
Circle of confusion map
Thesis work at DICE [4]Thesis work at DICE [4] Test CUDA and performanceTest CUDA and performance Poisson disc blurPoisson disc blur Multi-passed diffusionMulti-passed diffusion Seperable diffusionSeperable diffusion
Good:Good: Easy to learn (C)Easy to learn (C) Map complex algorithmsMap complex algorithms Thread & memory controlThread & memory control
Bad:Bad: Performance vs shadersPerformance vs shaders
Beta interopBeta interop Vendor-specificVendor-specific Output
GPU Compute programming GPU Compute programming modelmodel
Wanted:Wanted: Easy & efficient Direct3D 10 interopEasy & efficient Direct3D 10 interop
Low-latency Compute tasksLow-latency Compute tasks
Vendor-independent base interfaceVendor-independent base interface OpenCL?OpenCL?
Efficient CPU multi-core backendEfficient CPU multi-core backend Server, older GPUs, debuggingServer, older GPUs, debugging MCUDA [5]MCUDA [5]
Eventually platform-independentEventually platform-independent Future consolesFuture consoles
Conclusions Shader subroutines More software-controlled pipeline More texture sampler functionality Limited-case raytracing GPU compute for games
Questions?Questions?
Contact: [email protected]: [email protected]
ReferencesReferences [1] Tartarchuk, Natasha & Andersson, Johan. ”Rendering [1] Tartarchuk, Natasha & Andersson, Johan. ”Rendering
Architecture and Architecture and Real-time Procedural Shading & Texturing Real-time Procedural Shading & Texturing Techniques”. Techniques”. GDC 2007. LinkGDC 2007. Link
[2] Andersson, Johan. ”[2] Andersson, Johan. ”Terrain Rendering in Frostbite using Terrain Rendering in Frostbite using Procedural Shader Splatting”. Procedural Shader Splatting”. Siggraph 2007. Siggraph 2007. LinkLink
[3] [3] Christensen, Per H. & Batali, Dana. "An Irradiance Atlas for Christensen, Per H. & Batali, Dana. "An Irradiance Atlas for Global Illumination in Complex Production Scenes“. Eurographics Global Illumination in Complex Production Scenes“. Eurographics Symposium on Rendering 2004. Symposium on Rendering 2004. LinkLink
[4] Lonroth, Per & Unger, Mattias. ”Advanced Real-time Post-[4] Lonroth, Per & Unger, Mattias. ”Advanced Real-time Post-Processing using GPGPU techniques”. Master thesis, 2008.Processing using GPGPU techniques”. Master thesis, 2008.
[5] John Stratton, Sam Stone, Wen-mei Hwu. "MCUDA: An Efficient [5] John Stratton, Sam Stone, Wen-mei Hwu. "MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores". Technical report, Implementation of CUDA Kernels on Multi-cores". Technical report, University of Illinois at Urbana-Champaign, IMPACT-08-01, March, University of Illinois at Urbana-Champaign, IMPACT-08-01, March, 2008. 2008.
Bonus slidesBonus slides
Real-time REYESReal-time REYES Very interestingVery interesting
Displacement mapping & proceduralsDisplacement mapping & procedurals Stochastic samplingStochastic sampling Potentially more efficient & generalPotentially more efficient & general
Compared to maxed out rasterization & Compared to maxed out rasterization & tessellation on everything = pixel-sized trianglestessellation on everything = pixel-sized triangles
ButBut No experience No experience More research & experimentation neededMore research & experimentation needed
Terrain detailTerrain detail Deriving normal from heightfield good in distanceDeriving normal from heightfield good in distance Future: HW tessellation & procedural Future: HW tessellation & procedural
displacement shaders for up close ground detaildisplacement shaders for up close ground detail
Texture arraysTexture arrays Use cases:Use cases:
Everything!Everything! Rich parameterized shadersRich parameterized shaders
Vary slice index per instance, triangle or texel Vary slice index per instance, triangle or texel Instancing without comprimising on variation or perf.Instancing without comprimising on variation or perf.
Cascaded shadow mapsCascaded shadow maps HW PCF only in DX 10.1 HW PCF only in DX 10.1 Stable Cascaded Bounding Box Shadow MapsStable Cascaded Bounding Box Shadow Maps
Sparse texturesSparse textures More slices plzMore slices plz
For tile pools. 64x64x8192For tile pools. 64x64x8192
Other raytracing usesOther raytracing uses Global Illumination & Ambient OcclusionGlobal Illumination & Ambient Occlusion
Incremental Photon Mapping?Incremental Photon Mapping?
Async collision raycastsAsync collision raycasts AI pathfinding, gameplay, sound obstructionAI pathfinding, gameplay, sound obstruction Seperate collision world from visual worldSeperate collision world from visual world CPU job-based nowCPU job-based now