View
234
Download
1
Tags:
Embed Size (px)
Citation preview
CS248: Graphics Performance,Debugging and Optimisation
Dave Oldcorn
November 13th 2007
Graphics Performance and Optimisation2 November 13th 2007
Your Guest Instructor
• Back in the mists of time, I wrote games…
• The last ten years have all been about 3D hardware
• Since 2001 at ATI, joining forces with AMD last year
• Optimisation specialist: linking the software and the hardware Tweaking games Understanding the hardware Driver performance Shader code optimisation
– (I find assembly language fun)
Graphics Performance and Optimisation3 November 13th 2007
Overview
• Three basic sections GPU Architecture
Efficient OpenGL
Practical Optimisation And Debugging
• There’s a lot in here Broad overview of all issues
I’ve prioritised the biggest issues and the ones most likely to help with Project 3
More details with respect to GPU architecture included as appendix
Graphics Performance and Optimisation4 November 13th 2007
GPU Architecture
Graphics Performance and Optimisation5 November 13th 2007
Graphics hardware architecture
• Parallel computation
• All about pipelines
• The OpenGL vertex pipeline shown right will be familiar…
Graphics Performance and Optimisation6 November 13th 2007
Graphics hardware architecture
• Extend the top of the pipeline with some more implementation detail
• Ideally, every stage is working simultaneously
• Could also decompose to smaller blocks
• And eventually to individual hardware pipeline stages As shown last week, the hardware
implementation may be considerably more complex than a linear pipeline
Application
Video Drivers
Parser
API
Command Buffers
CPU
GPU
Vertex Assembly
Vertex Operations
Primitive Assembly
Graphics Performance and Optimisation7 November 13th 2007
Draw commands
• Data enters the GPU pipeline via command buffers containing state and draw commands
• The draw command is a packet of primitives
• Occurs in the context of the current state As set by glEnable, glBlendFunc, etc.
The full set of state is often referred to as a state vector
Driver translates API state into hardware state
State changes may be pipelined; different parts of the GPU pipeline may be operating with different state vectors (even to the level of per-vertex data such as glColor)
Graphics Performance and Optimisation8 November 13th 2007
Pipeline performance
• The performance of a pipelined system is measured by throughput and latency Can subdivide at any level from the full pipeline down to
individual stages
• Throughput: the rate at which items enter and exit
• Latency: the time taken from entrance to exit Latency is not typically a major issue for API users It is a huge issue for GPU designers Even GPU-local memory reads may be hundreds of cycles Substantial percentage of both design effort and silicon is
devoted to latency compensation The system will generally run at full throughput until the
latency compensation is exceeded
Graphics Performance and Optimisation9 November 13th 2007
Pipeline throughput
• Given a particular state vector, each part of the pipeline has its own throughput
• The throughput of a system can be no higher than the slowest part: this is a bottleneck More generally, if input is ready but output is not, it is a
bottleneck
Graphics Performance and Optimisation10 November 13th 2007
Pipeline bottlenecks
• Consider system shown right Stage 1 can run at 1 per clock and is
100% utilised
Stage 2 can only accept on each other clock; still 100% utilised
Stage 3 is therefore starved on half of the cycles it could be working; 50% utilised
Although stage 3 has the longest latency, it has no effect on the throughput of the system
Stage 1Throughput 1/clock Latency 5 cycles
Stage 2Throughput 1 per 2
clocks Latency 10 cycles
Stage 3Throughput 1/clock Latency 15 cycles
Items enter1 per clock
Items pass at1 per clock
Half throughput,result only everyalternate clock
Still only alternateclock results, despiteper-clock throughput
Graphics Performance and Optimisation11 November 13th 2007
Pipeline bottlenecks
• A key subtlety; for this to work as shown, there must be load balancing between stages 1 and 2 (probably a FIFO)
• Once the FIFO is full, the input buffer will exert backpressure on stage 1 Happens after equilibrium is reached
• This pipeline therefore runs at the speed of the slowest part as soon as the FIFO fills
Stage 1Throughput 1/clock Latency 5 cycles
Stage 2Throughput 1 per 2
clocks Latency 10 cycles
Stage 3Throughput 1/clock Latency 15 cycles
Items enter1 per clock
Items pass at1 per clock;
eventually queue
Half throughput,result only everyalternate clock
Still only alternateclock results, despiteper-clock throughput
Input Buffer
Graphics Performance and Optimisation12 November 13th 2007
Variable throughput
• In general, throughput is data dependent Example: clipping is a complex operation which often isn’t
required
Example: texture fetch depends on the filtering chosen, which is data dependent
• Some pipeline stages require different rates at the input and the output Example: back-face culling; primitive in, no primitive out
Example: rasterisation of primitives to fragments; few primitives in, many fragments out
• Buffering between stages takes up the slack
Graphics Performance and Optimisation13 November 13th 2007
Pipeline bottlenecks
• A particular state vector will tend to have a characteristic set of bottlenecks The input data does also have an effect
• Small changes to the state vector can make substantial changes to the bottleneck
• As a state change filters through the pipeline and for a short period afterwards, bottlenecks shift into the new equilibrium For usual loads, where the render time is much larger than the
pipeline depth, this time can be ignored
• Can be hard to determine bottlenecks if the states in the pipe are disparate Smearing effect
Graphics Performance and Optimisation14 November 13th 2007
Pipeline bottlenecks
• There may be multiple bottlenecks if the throughput is not constant at all parts of the pipeline In general it is not constant
• GPU buffering absorbs changes in load Measured in tens or hundreds of cycles at best Whole pipeline is thousands of cycles
• The bottleneck could be outside the GPU Application, driver, memory management…
• Bottleneck analysis is key to hardware performance Not easy: bottlenecks are always present separating expected and unexpected cases is the challenge
Graphics Performance and Optimisation15 November 13th 2007
Flushes and synchronisation
• Some state cannot be pipelined; a flush occurs Various localities of flush
For a whole-pipeline flush, the parser waits before allowing new data into the pipe
CPU can carry on building and queuing command buffers
Low cost ~ thousands of cycles (~5us?)
Some operations can require the CPU to wait for the GPU Example: CPU wants to read memory the GPU is writing
This is a serialising event
Very expensive: wait for pipeline completion, flush all caches, and the restart time taken to build the next command buffer
You can force this with glFinish: please don’t!
Graphics Performance and Optimisation16 November 13th 2007
Asynchronous system
The process of rendering a typical game image is massively asynchronous
The boxes left show possible asynchronous actors
The diagram below shows a possible timeline
The shaded areas are the same frame
Input / Physics threadRuns continuously using input and time
(fixed or delta) to update the game world – typically including the scene graph
Typical runtime 10-30ms
Render threadRuns continuously to convert the scene
graph to rendering commands
Generally cannot start until input/physics thread has processed whole frame
GPU RendererRuns on its command buffers
DACLoops over the display at 60-100Hz
A command buffer operation changes the display at end of render; picked up at
start of next frame (unless vsync is off)
Input
Render
GPU
DAC
Graphics Performance and Optimisation17 November 13th 2007
Synchronisation
• GPU’s aim to run just under two frames ahead Block at SwapBuffers if there is another SwapBuffers in
the pipe that is not yet reached
• Reading any GPU memory on the CPU causes a sync glReadPixels is one method, for example; avoid
• Writing to GPU memory generally does not The GPU, driver and memory manager work together to
do uploads without serialisation No need to be unusually scared of glTexImage
• If you have to lock GPU memory, look for discard or write-only flags that will allow asynchronous access
Graphics Performance and Optimisation18 November 13th 2007
Shaders
• Texture lookup operations are relatively expensive Competition on GPU or system bus, cost of filtering,
unpredictable
Some of this is only a latency issue – but latency is not important…
– … until the buffering is exceeded
– Latency more than doubles for dependent texture operations
Prefer ALU math to texture until the function is complex
Might replace very small textures with shader constants
• Shader – typically its texture operations – likely to be the limiting factor on performance
Graphics Performance and Optimisation19 November 13th 2007
Shaders
• Each shader is run at a particular frequency Per-vertex, per-fragment now; per-primitive also exists; per-
sample seems likely in the future
Can view constants calculated on the CPU as another frequency (per-draw packet)
Aim to do calculations at the lowest necessary frequency
• Issues to be aware of: Data passed from vertex to fragment shader is interpolated
linearly in the space of the primitive (i.e. with perspective correction) so can only use interpolators if this is appropriate (linear or nearly so); high tessellation can be a workaround
Excessive use of interpolators can itself be a bottleneck; up to two interpolators per texture fetch, as a ballpark figure
Graphics Performance and Optimisation20 November 13th 2007
Shader constants
• Shader constants are a large part of the state vector Updating hundreds on each draw call will not be free
• Prefer inline constants (known at compile time) to state vector constants Gives the compiler and constant manager more
information
• For the same reason, avoid parameterising for its own sake
• Don’t switch shader just to change a couple of constants
Graphics Performance and Optimisation21 November 13th 2007
Efficient OpenGL
Graphics Performance and Optimisation22 November 13th 2007
Efficient OpenGL
• This is a data processing issue
• What data does the GPU need to render a scene? State data, texture data, vertices / primitives
• CPU-side performance can easily be dominated by inefficient management of this data
• Of them all, vertex data is the most problematic
Type of data State Vertex Texture
Volume (per frame) Low (~kB) Med-high (~MB)
Very high (~GB)
Rate of change Very high Low-med Very low
Graphics Performance and Optimisation23 November 13th 2007
Efficient vertex data
• Application needs to feed mesh data in somehow
• GL provides two basic methods glBegin/glEnd (known as ‘immediate mode’) Vertex arrays
• Immediate mode is easy to use but has high overheads Many tiny, unaligned copies Non ‘v’ forms imply extra copies Command stream is unpredictable and irregular
glBegin(GL_TRIANGLE_FAN);glColor4f(1,1,1,1);glVertex3f(0,0,0); // position + colourglVertex3f(0,1,0); // position onlyglColor4f(1,0,0,1);glVertex3f(1,1,0); // position + colourglVertex3f(1,0,0); // position onlyglEnd();
Graphics Performance and Optimisation24 November 13th 2007
Vertex arrays
• Vertex arrays are an alternative The application probably has its data in arrays somewhere, so
let GL read them en masse glVertexPointer, glColorPointer, etc. specify the array glDrawElements to issue a draw command; takes index list Primitives are drawn using the indices into the arrays as set up
by the gl*Pointer commandsglVertexPointer(3, GL_FLOAT, 16, vertex_array);
glColorPointer(4, GL_UNSIGNED_BYTE, 0, color_array);
glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_COLOR_ARRAY);
glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices);
• Easier for the driver and GPU to handle State vector is instantiated at the glDrawElements command The GPU can process all the primitives in a single draw packet
Graphics Performance and Optimisation25 November 13th 2007
Vertex arrays
• Did you hear a but?
• The vertex data still belongs to the application Until the glDrawElements call is entered, the GPU knows
nothing of the data After the call completes the app can change the data Therefore, the driver must copy the data on every
glDrawElements call Even if the data never changes – the GL can’t know
• Wouldn’t it be great if we could avoid the copy? We don’t supply textures on every call, just upload them
to the GPU and let the driver manage them…
Graphics Performance and Optimisation26 November 13th 2007
Buffer Objects
• This facility is provided with the Vertex Buffer Objects (VBO) extension allows the creation of buffer objects in GPU memory with
access mediated by the driver Data can be uploaded at any time with glBufferData
– As with glTexImage, done through command buffer to avoid serialisation
BindBuffer <-> BindTexture, BufferData <-> TexImage// During program initialisation
glBindBuffer(GL_ARRAY_BUFFER, 0);
glBufferData(GL_ARRAY_BUFFER, 16*4*sizeof(GL_FLOAT), vertex_array, GL_STATIC_DRAW);
...
// In render loop
glBindBuffer(GL_ARRAY_BUFFER, 0);
glVertexPointer(3, GL_FLOAT, 16, 0);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices);
Graphics Performance and Optimisation27 November 13th 2007
Index data
• DrawElements only needs to send the indices
• Actually we can optimise that away too; Element Arrays allow buffer objects to contain index data Index data is far smaller in volume, and tends to come in
larger batches if state changes are minimised, so this can be overoptimisation
• Keep batches as large as possible Keep state changes to a minimum
Primarily use triangle lists
Don’t mess with locality of reference
Strips can be marginally more efficient
Graphics Performance and Optimisation28 November 13th 2007
Display Lists
• These offer the driver opportunity for unlimited optimisation
• It’s hard for the driver to do The list can contain literally any GL command
• Not recommended for games or other consumer apps
• Professional GL apps do make heavy use of display lists (and immediate mode) The effort required to efficiently optimise these is one
reason professional GL cards are more expensive
Graphics Performance and Optimisation29 November 13th 2007
Visibility optimisations
• It’s far more efficient not to render something at all
• Try to avoid sending primitives that can’t be seen Not in the view frustum
Obscured
• Send it, but have it rejected at some early point in the pipeline Cull primitives before rasterisation
Reject fragments before shading
Graphics Performance and Optimisation30 November 13th 2007
Bounds
• Bounding boxes or spheres to reject objects wholly outside the view frustum
• Optimal methods for using these were in lecture 11
Graphics Performance and Optimisation31 November 13th 2007
Occlusion culling
• PVS (Potentially Visible Set) culling For each location in the set of locations,
store which other locations might be visible Precalculate before render process starts
• If you are standing anywhere in A, you absolutely cannot see C and vice versa View frustum checks cannot solve this part
of the problem; consider the position of the observer shown
A frustum test is still useful; if the observer was standing in B looking the same way, bounds could cull C
• Very effective on room-based games; not so useful on outdoor games Fewer large-scale occluders
A
B
C
Graphics Performance and Optimisation32 November 13th 2007
Other visibility methods
• Portals – as discussed in lecture 11
• BSP – Binary Space Partition – trees Complex but efficient way to store large static worlds for
fast frustum visibility calculations
Combine with PVS and portals; all need precalculation phase
• Abrash - Graphics Programming Black Book ch. 59-64, 70 Detailed information on these and other research he and
John Carmack did on visibility while developing Quake
Still in use today on modern FPS games (with many enhancements!)
Graphics Performance and Optimisation33 November 13th 2007
Model LOD
• Need to render something, render less of it
• Demonstrated two weeks ago: A model close to the camera requires many triangles
Carry reduced detail models and select on each render
– Like mipmapping, memory cost not prohibitive.
– Target sizes near GPU’s high efficiency 100 pixel region
• Visualise with wireframe
Graphics Performance and Optimisation34 November 13th 2007
Model LOD
• Non trivial implementation Popping a well known issue; morph or blend common
solutions Must generate reduced detail models Can reuse vertex data, just change indices
• Terrain offers particular challenges LOD systems essential for really large worlds Terrain tiles must match between different LODs
• Can also solve sampling issues As with undersampling textures, semirandom triangles
can be picked; occur if triangles are smaller than 1 pixel
Graphics Performance and Optimisation35 November 13th 2007
GPU primitive culling
• Degenerate primitives (example: triangles with two indices the same) will be culled at index fetch
• A primitive with all vertices outside the same clip plane will be culled
• Back-face culling is a simple optimisation and should be used for all closed opaque models
• Zero area triangles will be culled before rasterisation This is rarely usefully exploitable
• Scissor rectangles cull large parts of primitives during rasterisation
Graphics Performance and Optimisation36 November 13th 2007
GPU Z rejection
• The Z test can occur before shading Reduces colour read/write load as well
• Some states inhibit early Z test Write Z in the shader, obviously Gate Z update in the shader (pixel kill / alpha test with Z write)
– Alpha test sounds like an optimisation, but it only saves colour read/write; use it for visual effect not performance
– Shader kill acts as a shader conditional
• Z unit can reject at hundreds of pixels per clock Accept rate is lower (at the very least Z has to be written) but as
fast or faster than any other post-rasteriser operation
• Stencil usually rejects at Z rates Having a stencil op that does something implies a stencil write
Graphics Performance and Optimisation37 November 13th 2007
Early Z rejection
• Draw opaque geometry in roughly front-to-back order Do not work too hard to make this perfect, that’s what the Z
buffer was created for in the first place
Do not draw the sky first. Please!
This assumes you’re bottlenecked in the shader
• Consider a Z pass If the fragment shaders are very expensive
If at any point rendering the colour buffer you need some algorithm that requires the Z buffer
Disable colour writes (glColorMask) or fill the colour buffer with something cheap but useful (example: ambient lighting)
Invariance issues should be rare nowadays (but be aware)
Graphics Performance and Optimisation38 November 13th 2007
Shader conditionals
• Can also reduce shader load
• Treat with care…
• Use mostly for high coherency data the conditional is unlikely to have per-pixel granularity An if-then-else clause can have to execute both branches
• For low coherency data, prefer conditional move type operations
• Typically the shader compiler and optimiser can’t know much about the likely coherency So it guesses
Graphics Performance and Optimisation39 November 13th 2007
Triangle sizes
• Larger triangles are more efficient than small ones
• Rules of thumb: Over 1000 pixels is large
100 pixel triangles are considered typical and the GPU should be into the ballpark of its peak performance
Under 25 pixel triangles are small
Tiny triangles likely to cause granularity losses in the GPU
• Often the type of object and size of triangle are related Example: world triangles tend to be larger than entities
Graphics Performance and Optimisation40 November 13th 2007
Bump mapping
• Can trade off geometric complexity for more expensive fragment shading Textures in general offer this capability
– light maps are an earlier example
• Having a normal map available in the fragment shader useful for other reasons too Per-pixel lighting is an obvious use
• Doom3 an early pioneer: polygon counts are low compared with other games of
the time
bump mapping makes it hard to see except on silhouette edges
Graphics Performance and Optimisation41 November 13th 2007
Practical Optimisation and Debugging
Graphics Performance and Optimisation42 November 13th 2007
Optimising Applications
• Always profile; never assume
• Target optimisations Better to get a small gain in something that takes half the
time than a big gain in something that takes a couple of percent
Better to do easy things than hard things
“Low-hanging fruit”
Graphics Performance and Optimisation43 November 13th 2007
Instrumentation for debugging
• Logging
• Visualisation: make particular rendering (more) visible
• Simple interfaces into the high-level parts of the program to make low-level testing easier ‘God mode’
Skip to level N or subpart of the level
– Saved games may seem to be an answer here, but minor changes during development usually break them
Metadata display
• Multiple monitors and remote debugging Key for fullscreen applications
Useful to have ‘stable’ dev machine and separate debug target
Graphics Performance and Optimisation44 November 13th 2007
Instrumentation for performance
• Feedback on what the performance actually is A simple onscreen frames per second (FPS) and/or time-
per-frame counter
Special benchmarking modes
• Modify the performance Skip particular rendering passes
Add known extra load
– Examples: new entities, particle system load, force postprocessing effects on
Graphics Performance and Optimisation45 November 13th 2007
Real-world example: Doom3 engine
Heavily instrumented with developer console accessed with ctrl-alt- Most commands prefixed according to their functional unit
– r_ commands are to the renderer, s_ the sound system, sv_ the server, g_ the client (game), etc.
Record demos; playback with playdemo or timedemo Capture individual frames with demoshot for debugging or
performance Can also send console commands from the command line –
essential for external tools Many debugging commands
– noclip to fly anywhere on a level
– r_showshadows 1 displays the shadow volumes
– g_showPVS 1 to show the PVS regions at work
Graphics Performance and Optimisation46 November 13th 2007
More Doom3 convenience features
PAK files are just ZIP files You can look at the ARB_fragment_program shaders
Doom3 uses (glprogs/ directory in the first pakfile).
You can also modify them: real files (e.g. under the base/glprogs directory) override the pakfiles
• Human-readable configuration files
• TAB completion on the console Long commands not a problem – plus you can find the
command you want!
• Key bindings
Graphics Performance and Optimisation47 November 13th 2007
Doom3 render: multipass process
1. Z pass: set the Z buffer for the frame
2. Lighting passes: for each light in the scene2A. Shadow pass: render shadow volumes into the stencil buffer
2B. Interaction pass: accumulate the contribution from this light to the framebuffer.
- Cheap Phong algorithm (per-pixel lighting with interpolated E; Prey calculates E on a per-pixel basis for better specular)
- Vertex/fragment shader pair
3. Effects rendering; mostly blended geometry for explosions, smoke, decals, etc.
4. One or more postprocessing phases for refraction and other screen-space effects
Graphics Performance and Optimisation48 November 13th 2007
Doom3 benchmarking tools
• Each render pass can be disabled from the console r_skipinteractions, r_shadows, r_skippostprocess
Benchmark each pass individually
Worth considering render time rather than just FPS; linear quantity
Rendered FPS Frame time (ms)
Isolated pass
Pass time (ms)
Pass load
Everything 55.8 17.9
- postproc 58.5 17.1 Postproc 0.8 4%
- interactions 104.7 9.6 Interaction 7.5 42%
- shadows 174.5 5.7 Shadows 3.9 22%
The rest 5.7 32%
Graphics Performance and Optimisation49 November 13th 2007
Case study: Doom3 interaction shader
• The shader has 7 texture lookups Texture limited on most GPUs One of them was a simple function texture
– Probably originally a point of customisation but unused We tested gain by eliminating the lookup
– replaced with a constant – note not 0 or 1, which might allow the optimiser to eliminate other code
– Provided the expected ~15% gain for the pass Replaced with a couple of scalar ALU instructions
– Gain was still the same, as the scalar ALU scheduled into gaps in the existing shader
• Quake4 and later games all picked up the change
Graphics Performance and Optimisation50 November 13th 2007
Instrumenting applications
• Be wary of profiling API calls Asynchronous system; SwapBuffers is probably the only
point of synchronisation
Can’t easily measure hardware performance at a finer granularity than a frame
Don’t try to profile the cost of rendering a mesh by timing DrawElements; only measures the time taken to validate state and fill the command buffer
Which isn’t to say that’s never useful information
Graphics Performance and Optimisation51 November 13th 2007
Instrumenting applications
• Don’t overprofile QueryPerformanceCounter has a cost Even RDTSC does
• Try to look at the high level and in broad terms first 30% physics, 20% walking scene graph, 30% in the driver, 20% waiting
for end of frame Rather than 15.26% inside DrawElements
• Aim to be GPU limited, then optimise GPU workload Don’t waste time optimising CPU code if it’s waiting for the GPU Iterate as the GPU workload becomes more optimal
• Try to avoid compromising readability for performance Rarely necessary Download the Quake 3 source to see how clear really fast code can be The games industry is really, incredibly, bad at this.
Graphics Performance and Optimisation52 November 13th 2007
Benchmark modes
• Timed runs on repeatable scenes Two options
– Fix the number and exact content of frames and time the run (could be one frame repeated N times)
– Fix the run time, render frames as fast as possible, count the frames
Former is more repeatable; often essential if tools require multiple runs to accumulate data
Latter more convenient for benchmarkers and more realistic to how games behave in the real world
Cynical reason for benchmarks: applications get more attention from press (and hence driver developers)
Graphics Performance and Optimisation53 November 13th 2007
CodeAnalyst
• CodeAnalyst is an AMD tool that allows non-intrusive profiling of the application’s CPU usage A profiling session spawns the application under test
– Make sure to avoid profiling startup and shutdown time
Can drill down to individual source lines in your code and show you the cost
Many examples on AMD’s web site using this
Useful for all CPU-limited applications
Graphics Performance and Optimisation54 November 13th 2007
CodeAnalyst hints
A spike inside driver components may not be driver overhead The driver is probably waiting on the GPU to meet the
SwapBuffers limit
If there's not a large spike in the driver, it's probably the application that's the limit.
This is complicated by the fact that we may choose to block if the GPU is not yet ready, so time may move from the driver to being reported as 'system idle process‘, PID 0, or similar.
Vary the resolution and check the how the traces change If the relative time in the driver or system idle doesn't change,
the application is not pixel limited.
Multicore systems make interpreting the results harder. You might be best off switching a core off if you can
Graphics Performance and Optimisation55 November 13th 2007
GPUPerfStudio
• Lets you look inside the GPU
• Hardware performance counters 3D busy is the most obvious and often important
Vertex / pixel load can also be seen
• The bad news: GL support is not in the currently downloadable version 1.1. Coming soon…
Graphics Performance and Optimisation56 November 13th 2007
Shader Development
• AMD GPUShaderAnalyzer Available to download from AMD web site
Will handle all GL shader types (GLSL, ARB_fp, ARB_vp)
Good development environment; no need to run your app to compile
Will show output code and statistics including estimated cycle counts for all AMD GPUs
Graphics Performance and Optimisation57 November 13th 2007
Scalability
• Look to create consistent performance Better to run at 30fps continuously than oscillate wildly
between 15fps and 100fps.
Target worst-case scenes
You will need headroom to guarantee 60fps
• Is a particular gain useful? A 4% speedup won’t help anyone play your game
Five 4% speedups would, though
Gains in a lesser component allow more use of that component
Graphics Performance and Optimisation58 November 13th 2007
Scalability
• PC environment is a huge scalability challenge Matrix of CPUs, GPUs and render resolutions is huge Performance is in tension with image quality Adjust quality to scale for GPU power and set higher loads
– when CPU limited, more pixels probably have no cost Adjust quality in profiling
– Resolution (or clock) scaling to test if CPU or GPU limited
• Consoles have it easier: more fixed in every way Still need headroom, just less of it Now have resolution scaling issues - five TV resolutions in
NTSC 480i, PAL 576i, 720p, 1080i/p 60Hz / 50Hz is a headache here
Graphics Performance and Optimisation59 November 13th 2007
Caveats on optimisation
• Windowed mode GPUs can behave differently in windowed mode to
fullscreen mode
Windowed should still be your primary development mode unless you have remote debugging
• Front Buffer rendering May be useful for debugging, but could have similar
performance implications
• Avoid misusing benchmarks Repeat runs – make sure everything’s ‘warm’.
Graphics Performance and Optimisation60 November 13th 2007
Guidelines for Project 3
• Concentrate on the scene graph first, GPU second, CPU cycle picking last
Look for algorithms that cull monsters, trees and rooms rather than triangles or pixels
• Work on model or texture data in the GPU, not CPU Primarily, use the shader to do the work
Anywhere index data, primitive count and connectivity don’t change is a candidate
If you have to generate a texture consider using the GPU
Graphics Performance and Optimisation61 November 13th 2007
Guidelines for Project 3
• Short of time to write shaders Write a few shaders that you use a lot
• Don’t try to do everything in this lecture Many techniques won’t apply to your specific case
Even those that do often won’t matter
Profile-Guided Optimisation!
Graphics Performance and Optimisation62 November 13th 2007
Headline performance items
Scene graph optimisations: visibility culling, model LOD
Don’t touch model data on the CPU unless the algorithm absolutely requires it
Use vertex arrays for complex mesh data (> 10 primitives); store static data in VBOs.
Use mipmaps for all static textures; avoid undersampling textures without mipmaps
Render roughly front to back; don’t kill yourself trying but give it a go for the largest geometry; draw the sky last!
Use compressed textures by default; only disable if artifacts appear
Disable unnecessary alpha testing; don’t do kills in shaders unless you have to
Move work from fragment to vertex shaders where possible
Prefer moderate math to texture lookups particularly if they increase the dependent fetch level
Graphics Performance and Optimisation63 November 13th 2007
Further reading
• Abrash, Mike: The Graphics Programming Black Book Even in 1997 the asm and register programming section
was dated
Much of the Quake documentation isn’t
– Clear explanation of BSP, PVS and some on portals.
Rest is still worth reading to show the mindset
– Skip asm-specific bits, concentrate on thought process
Chapter 1 and chapter 70 are required reading
Stencil shadows; the Wikipedia page has many links
Graphics Performance and Optimisation64 November 13th 2007
Samples and Tools
http://ati.amd.com/developer/ GPUPerfstudio, GPUShaderAnalyzer and the
Compressonator
Tootle is also interesting; optimise meshes both for vertex cache and ‘internal’ front-to-backness
Many other samples, documents and tools
http://www.amd.com/codeanalyst
Graphics Performance and Optimisation65 November 13th 2007
Questions
• If we have time…
Graphics Performance and Optimisation66 November 13th 2007
Appendix
Background information on more aspects of the GPU
a.k.a. “The slides I knew I didn’t have time to go through”
Graphics Performance and Optimisation67 November 13th 2007
Texture and rendertarget tiling
• Memory interface efficiency mostly determined by burst sizes The more useful memory fetched in one go the better
Avoid fetching anything that isn’t then used
– This is why mipmapping is so important: minifying a texture implies fetching memory that isn’t then used
• Rearranging memory into tiles increases locality of reference 64 bytes might contain 4x4 pixels instead of 16x1 pixel
Format is transparent to application
Graphics Performance and Optimisation68 November 13th 2007
Texture Compression
• GL_ARB_texture_compression
• The S3TC / DXTC / BC algorithm is a high quality method for typical image textures Designed such that the artifacts introduced in lossy compression
tend to be smoothed out by texture filtering
Function textures and unusual use textures may not meet acceptable quality
Rearranging components can help
Use high-quality compressors - The Compressonator
• Compression isn’t just about memory bandwidth Reduces effective latency (one fetch brings in more useful
texels)
Effectively increases texture cache size
Graphics Performance and Optimisation69 November 13th 2007
Texture Filtering
• A bilinear-filtered sample is the common basic unit of work for a texture unit Unlikely that point sampling is any faster than bilinear; can
make this work for you in image processing shaders (rather than point sampling and doing some constant weighted sum)
Each additional bilinear sample for trilinear or anisotropic filtering is probably consuming additional time
• Smart algorithms ensure that only needed samples are taken No need for trilinear if magnifying No need for anisotropy if square-on Example: walls tend to have less anisotropy than floors
• Gradient calculations may be dynamic Necessary to handle dependent texture reads Be wary with dependency; gradient can be unpredictable
Graphics Performance and Optimisation70 November 13th 2007
Render to texture
• Useful for generating extra views or postprocessing Example: mirror in driving game
Example: postprocessing for refraction
• glCopyTexImage copies the framebuffer to a texture CPU-GPU serialisation is not implied; this can probably be
queued into a command buffer
• Other methods exist such as pbuffers and framebuffer extensions Can be slightly more efficient
Can return to a rendertarget after rendering on another
More complex; don’t use without good reason
Graphics Performance and Optimisation71 November 13th 2007
Multisample antialiasing
• The key gain is to run the fragment shader at pixel frequency rather than sample frequency
• Also saves memory bandwidth; can compress Z and colour The buffer may need to be resolved to an uncompressed
buffer for display or if used as a texture
Triangle size may be worth extra consideration with MSAA; frame buffer and Z compression rate is likely to be in roughly inverse proportion to the number of visible edges in the scene
Graphics Performance and Optimisation72 November 13th 2007
Caching
• Many caches inside the GPU
• Different to what you might be familiar with in a CPU More about memory bursts and latency compensation
than reuse In general you do need to hit the memory
– Example: texture mapping the whole framebuffer at 1:1; every pixel and texel will be touched exactly once
Therefore, be pessimistic: assume this Choose to compensate memory latency with large buffers
– Rather than using the cache to dodge the accesses
• In a few places short-term ‘reuse’ is critical Bilinear filtering the most obvious case
Graphics Performance and Optimisation73 November 13th 2007
Caching
• Can still be advantages to avoiding cycling Used to be a big thing, particularly in the days of visible
caching (software controlled rather than auto)
Caused sorting policy of hard sort by material
Nowadays far less important, hence rough sort by depth
In some pathological circumstances sort by shader and depth (or a Z pass followed by sort by shader) might be more efficient
Graphics Performance and Optimisation74 November 13th 2007
Disclaimer & Attribution
• DISCLAIMER• The information presented in this document is for informational
purposes only and may contain technical inaccuracies, omissions and typographical errors.
• AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
• AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
• ATTRIBUTION• © 2007 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD
Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.