CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007

CS248: Graphics Performance,Debugging and Optimisation

Dave Oldcorn

November 13th 2007

Graphics Performance and Optimisation2 November 13th 2007

Your Guest Instructor

• Back in the mists of time, I wrote games…

• The last ten years have all been about 3D hardware

• Since 2001 at ATI, joining forces with AMD last year

• Optimisation specialist: linking the software and the hardware Tweaking games Understanding the hardware Driver performance Shader code optimisation

– (I find assembly language fun)


Overview

• Three basic sections GPU Architecture

Efficient OpenGL

Practical Optimisation And Debugging

• There’s a lot in here Broad overview of all issues

I’ve prioritised the biggest issues and the ones most likely to help with Project 3

More details with respect to GPU architecture included as appendix


GPU Architecture


Graphics hardware architecture

• Parallel computation

• All about pipelines

• The OpenGL vertex pipeline shown right will be familiar…


Graphics hardware architecture

• Extend the top of the pipeline with some more implementation detail

• Ideally, every stage is working simultaneously

• Could also decompose to smaller blocks

• And eventually to individual hardware pipeline stages As shown last week, the hardware

implementation may be considerably more complex than a linear pipeline

Application

Video Drivers

Parser

API

Command Buffers

CPU

GPU

Vertex Assembly

Vertex Operations

Primitive Assembly


Draw commands

• Data enters the GPU pipeline via command buffers containing state and draw commands

• The draw command is a packet of primitives

• Occurs in the context of the current state As set by glEnable, glBlendFunc, etc.

The full set of state is often referred to as a state vector

Driver translates API state into hardware state

State changes may be pipelined; different parts of the GPU pipeline may be operating with different state vectors (even to the level of per-vertex data such as glColor)


Pipeline performance

• The performance of a pipelined system is measured by throughput and latency Can subdivide at any level from the full pipeline down to

individual stages

• Throughput: the rate at which items enter and exit

• Latency: the time taken from entrance to exit Latency is not typically a major issue for API users It is a huge issue for GPU designers Even GPU-local memory reads may be hundreds of cycles Substantial percentage of both design effort and silicon is

devoted to latency compensation The system will generally run at full throughput until the

latency compensation is exceeded


Pipeline throughput

• Given a particular state vector, each part of the pipeline has its own throughput

• The throughput of a system can be no higher than the slowest part: this is a bottleneck More generally, if input is ready but output is not, it is a

bottleneck


Pipeline bottlenecks

• Consider system shown right Stage 1 can run at 1 per clock and is

100% utilised

Stage 2 can only accept on each other clock; still 100% utilised

Stage 3 is therefore starved on half of the cycles it could be working; 50% utilised

Although stage 3 has the longest latency, it has no effect on the throughput of the system

Stage 1Throughput 1/clock Latency 5 cycles

Stage 2Throughput 1 per 2

clocks Latency 10 cycles


Items enter1 per clock

Items pass at1 per clock

Half throughput,result only everyalternate clock

Still only alternateclock results, despiteper-clock throughput



• A key subtlety; for this to work as shown, there must be load balancing between stages 1 and 2 (probably a FIFO)

• Once the FIFO is full, the input buffer will exert backpressure on stage 1 Happens after equilibrium is reached

• This pipeline therefore runs at the speed of the slowest part as soon as the FIFO fills


Stage 2Throughput 1 per 2

clocks Latency 10 cycles


Items enter1 per clock

Items pass at1 per clock;

eventually queue

Half throughput,result only everyalternate clock

Still only alternateclock results, despiteper-clock throughput

Input Buffer


Variable throughput

• In general, throughput is data dependent Example: clipping is a complex operation which often isn’t

required

Example: texture fetch depends on the filtering chosen, which is data dependent

• Some pipeline stages require different rates at the input and the output Example: back-face culling; primitive in, no primitive out

Example: rasterisation of primitives to fragments; few primitives in, many fragments out

• Buffering between stages takes up the slack



• A particular state vector will tend to have a characteristic set of bottlenecks The input data does also have an effect

• Small changes to the state vector can make substantial changes to the bottleneck

• As a state change filters through the pipeline and for a short period afterwards, bottlenecks shift into the new equilibrium For usual loads, where the render time is much larger than the

pipeline depth, this time can be ignored

• Can be hard to determine bottlenecks if the states in the pipe are disparate Smearing effect



• There may be multiple bottlenecks if the throughput is not constant at all parts of the pipeline In general it is not constant

• GPU buffering absorbs changes in load Measured in tens or hundreds of cycles at best Whole pipeline is thousands of cycles

• The bottleneck could be outside the GPU Application, driver, memory management…

• Bottleneck analysis is key to hardware performance Not easy: bottlenecks are always present separating expected and unexpected cases is the challenge


Flushes and synchronisation

• Some state cannot be pipelined; a flush occurs Various localities of flush

For a whole-pipeline flush, the parser waits before allowing new data into the pipe

CPU can carry on building and queuing command buffers

Low cost ~ thousands of cycles (~5us?)

Some operations can require the CPU to wait for the GPU Example: CPU wants to read memory the GPU is writing

This is a serialising event

Very expensive: wait for pipeline completion, flush all caches, and the restart time taken to build the next command buffer

You can force this with glFinish: please don’t!


Asynchronous system

The process of rendering a typical game image is massively asynchronous

The boxes left show possible asynchronous actors

The diagram below shows a possible timeline

The shaded areas are the same frame

Input / Physics threadRuns continuously using input and time

(fixed or delta) to update the game world – typically including the scene graph

Typical runtime 10-30ms

Render threadRuns continuously to convert the scene

graph to rendering commands

Generally cannot start until input/physics thread has processed whole frame

GPU RendererRuns on its command buffers

DACLoops over the display at 60-100Hz

A command buffer operation changes the display at end of render; picked up at

start of next frame (unless vsync is off)

Input

Render

GPU

DAC


Synchronisation

• GPU’s aim to run just under two frames ahead Block at SwapBuffers if there is another SwapBuffers in

the pipe that is not yet reached

• Reading any GPU memory on the CPU causes a sync glReadPixels is one method, for example; avoid

• Writing to GPU memory generally does not The GPU, driver and memory manager work together to

do uploads without serialisation No need to be unusually scared of glTexImage

• If you have to lock GPU memory, look for discard or write-only flags that will allow asynchronous access


Shaders

• Texture lookup operations are relatively expensive Competition on GPU or system bus, cost of filtering,

unpredictable

Some of this is only a latency issue – but latency is not important…

– … until the buffering is exceeded

– Latency more than doubles for dependent texture operations

Prefer ALU math to texture until the function is complex

Might replace very small textures with shader constants

• Shader – typically its texture operations – likely to be the limiting factor on performance


Shaders

• Each shader is run at a particular frequency Per-vertex, per-fragment now; per-primitive also exists; per-

sample seems likely in the future

Can view constants calculated on the CPU as another frequency (per-draw packet)

Aim to do calculations at the lowest necessary frequency

• Issues to be aware of: Data passed from vertex to fragment shader is interpolated

linearly in the space of the primitive (i.e. with perspective correction) so can only use interpolators if this is appropriate (linear or nearly so); high tessellation can be a workaround

Excessive use of interpolators can itself be a bottleneck; up to two interpolators per texture fetch, as a ballpark figure


Shader constants

• Shader constants are a large part of the state vector Updating hundreds on each draw call will not be free

• Prefer inline constants (known at compile time) to state vector constants Gives the compiler and constant manager more

information

• For the same reason, avoid parameterising for its own sake

• Don’t switch shader just to change a couple of constants


Efficient OpenGL


Efficient OpenGL

• This is a data processing issue

• What data does the GPU need to render a scene? State data, texture data, vertices / primitives

• CPU-side performance can easily be dominated by inefficient management of this data

• Of them all, vertex data is the most problematic

Type of data State Vertex Texture

Volume (per frame) Low (~kB) Med-high (~MB)

Very high (~GB)

Rate of change Very high Low-med Very low


Efficient vertex data

• Application needs to feed mesh data in somehow

• GL provides two basic methods glBegin/glEnd (known as ‘immediate mode’) Vertex arrays

• Immediate mode is easy to use but has high overheads Many tiny, unaligned copies Non ‘v’ forms imply extra copies Command stream is unpredictable and irregular

glBegin(GL_TRIANGLE_FAN);glColor4f(1,1,1,1);glVertex3f(0,0,0); // position + colourglVertex3f(0,1,0); // position onlyglColor4f(1,0,0,1);glVertex3f(1,1,0); // position + colourglVertex3f(1,0,0); // position onlyglEnd();


Vertex arrays

• Vertex arrays are an alternative The application probably has its data in arrays somewhere, so

let GL read them en masse glVertexPointer, glColorPointer, etc. specify the array glDrawElements to issue a draw command; takes index list Primitives are drawn using the indices into the arrays as set up

by the gl*Pointer commandsglVertexPointer(3, GL_FLOAT, 16, vertex_array);

glColorPointer(4, GL_UNSIGNED_BYTE, 0, color_array);

glEnableClientState(GL_VERTEX_ARRAY);

glEnableClientState(GL_COLOR_ARRAY);

glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices);

• Easier for the driver and GPU to handle State vector is instantiated at the glDrawElements command The GPU can process all the primitives in a single draw packet


Vertex arrays

• Did you hear a but?

• The vertex data still belongs to the application Until the glDrawElements call is entered, the GPU knows

nothing of the data After the call completes the app can change the data Therefore, the driver must copy the data on every

glDrawElements call Even if the data never changes – the GL can’t know

• Wouldn’t it be great if we could avoid the copy? We don’t supply textures on every call, just upload them

to the GPU and let the driver manage them…


Buffer Objects

• This facility is provided with the Vertex Buffer Objects (VBO) extension allows the creation of buffer objects in GPU memory with

access mediated by the driver Data can be uploaded at any time with glBufferData

– As with glTexImage, done through command buffer to avoid serialisation

BindBuffer <-> BindTexture, BufferData <-> TexImage// During program initialisation

glBindBuffer(GL_ARRAY_BUFFER, 0);

glBufferData(GL_ARRAY_BUFFER, 16*4*sizeof(GL_FLOAT), vertex_array, GL_STATIC_DRAW);

...

// In render loop

glBindBuffer(GL_ARRAY_BUFFER, 0);

glVertexPointer(3, GL_FLOAT, 16, 0);

glEnableClientState(GL_VERTEX_ARRAY);

glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices);


Index data

• DrawElements only needs to send the indices

• Actually we can optimise that away too; Element Arrays allow buffer objects to contain index data Index data is far smaller in volume, and tends to come in

larger batches if state changes are minimised, so this can be overoptimisation

• Keep batches as large as possible Keep state changes to a minimum

Primarily use triangle lists

Don’t mess with locality of reference

Strips can be marginally more efficient


Display Lists

• These offer the driver opportunity for unlimited optimisation

• It’s hard for the driver to do The list can contain literally any GL command

• Not recommended for games or other consumer apps

• Professional GL apps do make heavy use of display lists (and immediate mode) The effort required to efficiently optimise these is one

reason professional GL cards are more expensive


Visibility optimisations

• It’s far more efficient not to render something at all

• Try to avoid sending primitives that can’t be seen Not in the view frustum

Obscured

• Send it, but have it rejected at some early point in the pipeline Cull primitives before rasterisation

Reject fragments before shading


Bounds

• Bounding boxes or spheres to reject objects wholly outside the view frustum

• Optimal methods for using these were in lecture 11


Occlusion culling

• PVS (Potentially Visible Set) culling For each location in the set of locations,

store which other locations might be visible Precalculate before render process starts

• If you are standing anywhere in A, you absolutely cannot see C and vice versa View frustum checks cannot solve this part

of the problem; consider the position of the observer shown

A frustum test is still useful; if the observer was standing in B looking the same way, bounds could cull C

• Very effective on room-based games; not so useful on outdoor games Fewer large-scale occluders

A

B

C


Other visibility methods

• Portals – as discussed in lecture 11

• BSP – Binary Space Partition – trees Complex but efficient way to store large static worlds for

fast frustum visibility calculations

Combine with PVS and portals; all need precalculation phase

• Abrash - Graphics Programming Black Book ch. 59-64, 70 Detailed information on these and other research he and

John Carmack did on visibility while developing Quake

Still in use today on modern FPS games (with many enhancements!)


Model LOD

• Need to render something, render less of it

• Demonstrated two weeks ago: A model close to the camera requires many triangles

Carry reduced detail models and select on each render

– Like mipmapping, memory cost not prohibitive.

– Target sizes near GPU’s high efficiency 100 pixel region

• Visualise with wireframe


Model LOD

• Non trivial implementation Popping a well known issue; morph or blend common

solutions Must generate reduced detail models Can reuse vertex data, just change indices

• Terrain offers particular challenges LOD systems essential for really large worlds Terrain tiles must match between different LODs

• Can also solve sampling issues As with undersampling textures, semirandom triangles

can be picked; occur if triangles are smaller than 1 pixel


GPU primitive culling

• Degenerate primitives (example: triangles with two indices the same) will be culled at index fetch

• A primitive with all vertices outside the same clip plane will be culled

• Back-face culling is a simple optimisation and should be used for all closed opaque models

• Zero area triangles will be culled before rasterisation This is rarely usefully exploitable

• Scissor rectangles cull large parts of primitives during rasterisation


GPU Z rejection

• The Z test can occur before shading Reduces colour read/write load as well

• Some states inhibit early Z test Write Z in the shader, obviously Gate Z update in the shader (pixel kill / alpha test with Z write)

– Alpha test sounds like an optimisation, but it only saves colour read/write; use it for visual effect not performance

– Shader kill acts as a shader conditional

• Z unit can reject at hundreds of pixels per clock Accept rate is lower (at the very least Z has to be written) but as

fast or faster than any other post-rasteriser operation

• Stencil usually rejects at Z rates Having a stencil op that does something implies a stencil write


Early Z rejection

• Draw opaque geometry in roughly front-to-back order Do not work too hard to make this perfect, that’s what the Z

buffer was created for in the first place

Do not draw the sky first. Please!

This assumes you’re bottlenecked in the shader

• Consider a Z pass If the fragment shaders are very expensive

If at any point rendering the colour buffer you need some algorithm that requires the Z buffer

Disable colour writes (glColorMask) or fill the colour buffer with something cheap but useful (example: ambient lighting)

Invariance issues should be rare nowadays (but be aware)


Shader conditionals

• Can also reduce shader load

• Treat with care…

• Use mostly for high coherency data the conditional is unlikely to have per-pixel granularity An if-then-else clause can have to execute both branches

• For low coherency data, prefer conditional move type operations

• Typically the shader compiler and optimiser can’t know much about the likely coherency So it guesses


Triangle sizes

• Larger triangles are more efficient than small ones

• Rules of thumb: Over 1000 pixels is large

100 pixel triangles are considered typical and the GPU should be into the ballpark of its peak performance

Under 25 pixel triangles are small

Tiny triangles likely to cause granularity losses in the GPU

• Often the type of object and size of triangle are related Example: world triangles tend to be larger than entities


Bump mapping

• Can trade off geometric complexity for more expensive fragment shading Textures in general offer this capability

– light maps are an earlier example

• Having a normal map available in the fragment shader useful for other reasons too Per-pixel lighting is an obvious use

• Doom3 an early pioneer: polygon counts are low compared with other games of

the time

bump mapping makes it hard to see except on silhouette edges


Practical Optimisation and Debugging


Optimising Applications

• Always profile; never assume

• Target optimisations Better to get a small gain in something that takes half the

time than a big gain in something that takes a couple of percent

Better to do easy things than hard things

“Low-hanging fruit”


Instrumentation for debugging

• Logging

• Visualisation: make particular rendering (more) visible

• Simple interfaces into the high-level parts of the program to make low-level testing easier ‘God mode’

Skip to level N or subpart of the level

– Saved games may seem to be an answer here, but minor changes during development usually break them

Metadata display

• Multiple monitors and remote debugging Key for fullscreen applications

Useful to have ‘stable’ dev machine and separate debug target


Instrumentation for performance

• Feedback on what the performance actually is A simple onscreen frames per second (FPS) and/or time-

per-frame counter

Special benchmarking modes

• Modify the performance Skip particular rendering passes

Add known extra load

– Examples: new entities, particle system load, force postprocessing effects on


Real-world example: Doom3 engine

Heavily instrumented with developer console accessed with ctrl-alt- Most commands prefixed according to their functional unit

– r_ commands are to the renderer, s_ the sound system, sv_ the server, g_ the client (game), etc.

Record demos; playback with playdemo or timedemo Capture individual frames with demoshot for debugging or

performance Can also send console commands from the command line –

essential for external tools Many debugging commands

– noclip to fly anywhere on a level

– r_showshadows 1 displays the shadow volumes

– g_showPVS 1 to show the PVS regions at work


More Doom3 convenience features

PAK files are just ZIP files You can look at the ARB_fragment_program shaders

Doom3 uses (glprogs/ directory in the first pakfile).

You can also modify them: real files (e.g. under the base/glprogs directory) override the pakfiles

• Human-readable configuration files

• TAB completion on the console Long commands not a problem – plus you can find the

command you want!

• Key bindings


Doom3 render: multipass process

1. Z pass: set the Z buffer for the frame

2. Lighting passes: for each light in the scene2A. Shadow pass: render shadow volumes into the stencil buffer

2B. Interaction pass: accumulate the contribution from this light to the framebuffer.

- Cheap Phong algorithm (per-pixel lighting with interpolated E; Prey calculates E on a per-pixel basis for better specular)

- Vertex/fragment shader pair

3. Effects rendering; mostly blended geometry for explosions, smoke, decals, etc.

4. One or more postprocessing phases for refraction and other screen-space effects


Doom3 benchmarking tools

• Each render pass can be disabled from the console r_skipinteractions, r_shadows, r_skippostprocess

Benchmark each pass individually

Worth considering render time rather than just FPS; linear quantity

Rendered FPS Frame time (ms)

Isolated pass

Pass time (ms)

Pass load

Everything 55.8 17.9

- postproc 58.5 17.1 Postproc 0.8 4%

- interactions 104.7 9.6 Interaction 7.5 42%

- shadows 174.5 5.7 Shadows 3.9 22%

The rest 5.7 32%


Case study: Doom3 interaction shader

• The shader has 7 texture lookups Texture limited on most GPUs One of them was a simple function texture

– Probably originally a point of customisation but unused We tested gain by eliminating the lookup

– replaced with a constant – note not 0 or 1, which might allow the optimiser to eliminate other code

– Provided the expected ~15% gain for the pass Replaced with a couple of scalar ALU instructions

– Gain was still the same, as the scalar ALU scheduled into gaps in the existing shader

• Quake4 and later games all picked up the change


Instrumenting applications

• Be wary of profiling API calls Asynchronous system; SwapBuffers is probably the only

point of synchronisation

Can’t easily measure hardware performance at a finer granularity than a frame

Don’t try to profile the cost of rendering a mesh by timing DrawElements; only measures the time taken to validate state and fill the command buffer

Which isn’t to say that’s never useful information


Instrumenting applications

• Don’t overprofile QueryPerformanceCounter has a cost Even RDTSC does

• Try to look at the high level and in broad terms first 30% physics, 20% walking scene graph, 30% in the driver, 20% waiting

for end of frame Rather than 15.26% inside DrawElements

• Aim to be GPU limited, then optimise GPU workload Don’t waste time optimising CPU code if it’s waiting for the GPU Iterate as the GPU workload becomes more optimal

• Try to avoid compromising readability for performance Rarely necessary Download the Quake 3 source to see how clear really fast code can be The games industry is really, incredibly, bad at this.


Benchmark modes

• Timed runs on repeatable scenes Two options

– Fix the number and exact content of frames and time the run (could be one frame repeated N times)

– Fix the run time, render frames as fast as possible, count the frames

Former is more repeatable; often essential if tools require multiple runs to accumulate data

Latter more convenient for benchmarkers and more realistic to how games behave in the real world

Cynical reason for benchmarks: applications get more attention from press (and hence driver developers)


CodeAnalyst

• CodeAnalyst is an AMD tool that allows non-intrusive profiling of the application’s CPU usage A profiling session spawns the application under test

– Make sure to avoid profiling startup and shutdown time

Can drill down to individual source lines in your code and show you the cost

Many examples on AMD’s web site using this

Useful for all CPU-limited applications


CodeAnalyst hints

A spike inside driver components may not be driver overhead The driver is probably waiting on the GPU to meet the

SwapBuffers limit

If there's not a large spike in the driver, it's probably the application that's the limit.

This is complicated by the fact that we may choose to block if the GPU is not yet ready, so time may move from the driver to being reported as 'system idle process‘, PID 0, or similar.

Vary the resolution and check the how the traces change If the relative time in the driver or system idle doesn't change,

the application is not pixel limited.

Multicore systems make interpreting the results harder. You might be best off switching a core off if you can


GPUPerfStudio

• Lets you look inside the GPU

• Hardware performance counters 3D busy is the most obvious and often important

Vertex / pixel load can also be seen

• The bad news: GL support is not in the currently downloadable version 1.1. Coming soon…


Shader Development

• AMD GPUShaderAnalyzer Available to download from AMD web site

Will handle all GL shader types (GLSL, ARB_fp, ARB_vp)

Good development environment; no need to run your app to compile

Will show output code and statistics including estimated cycle counts for all AMD GPUs


Scalability

• Look to create consistent performance Better to run at 30fps continuously than oscillate wildly

between 15fps and 100fps.

Target worst-case scenes

You will need headroom to guarantee 60fps

• Is a particular gain useful? A 4% speedup won’t help anyone play your game

Five 4% speedups would, though

Gains in a lesser component allow more use of that component


Scalability

• PC environment is a huge scalability challenge Matrix of CPUs, GPUs and render resolutions is huge Performance is in tension with image quality Adjust quality to scale for GPU power and set higher loads

– when CPU limited, more pixels probably have no cost Adjust quality in profiling

– Resolution (or clock) scaling to test if CPU or GPU limited

• Consoles have it easier: more fixed in every way Still need headroom, just less of it Now have resolution scaling issues - five TV resolutions in

NTSC 480i, PAL 576i, 720p, 1080i/p 60Hz / 50Hz is a headache here


Caveats on optimisation

• Windowed mode GPUs can behave differently in windowed mode to

fullscreen mode

Windowed should still be your primary development mode unless you have remote debugging

• Front Buffer rendering May be useful for debugging, but could have similar

performance implications

• Avoid misusing benchmarks Repeat runs – make sure everything’s ‘warm’.


Guidelines for Project 3

• Concentrate on the scene graph first, GPU second, CPU cycle picking last

Look for algorithms that cull monsters, trees and rooms rather than triangles or pixels

• Work on model or texture data in the GPU, not CPU Primarily, use the shader to do the work

Anywhere index data, primitive count and connectivity don’t change is a candidate

If you have to generate a texture consider using the GPU


Guidelines for Project 3

• Short of time to write shaders Write a few shaders that you use a lot

• Don’t try to do everything in this lecture Many techniques won’t apply to your specific case

Even those that do often won’t matter

Profile-Guided Optimisation!


Headline performance items

Scene graph optimisations: visibility culling, model LOD

Don’t touch model data on the CPU unless the algorithm absolutely requires it

Use vertex arrays for complex mesh data (> 10 primitives); store static data in VBOs.

Use mipmaps for all static textures; avoid undersampling textures without mipmaps

Render roughly front to back; don’t kill yourself trying but give it a go for the largest geometry; draw the sky last!

Use compressed textures by default; only disable if artifacts appear

Disable unnecessary alpha testing; don’t do kills in shaders unless you have to

Move work from fragment to vertex shaders where possible

Prefer moderate math to texture lookups particularly if they increase the dependent fetch level


Further reading

• Abrash, Mike: The Graphics Programming Black Book Even in 1997 the asm and register programming section

was dated

Much of the Quake documentation isn’t

– Clear explanation of BSP, PVS and some on portals.

Rest is still worth reading to show the mindset

– Skip asm-specific bits, concentrate on thought process

Chapter 1 and chapter 70 are required reading

Stencil shadows; the Wikipedia page has many links


Samples and Tools

http://ati.amd.com/developer/ GPUPerfstudio, GPUShaderAnalyzer and the

Compressonator

Tootle is also interesting; optimise meshes both for vertex cache and ‘internal’ front-to-backness

Many other samples, documents and tools

http://www.amd.com/codeanalyst


Questions

• If we have time…


Appendix

Background information on more aspects of the GPU

a.k.a. “The slides I knew I didn’t have time to go through”


Texture and rendertarget tiling

• Memory interface efficiency mostly determined by burst sizes The more useful memory fetched in one go the better

Avoid fetching anything that isn’t then used

– This is why mipmapping is so important: minifying a texture implies fetching memory that isn’t then used

• Rearranging memory into tiles increases locality of reference 64 bytes might contain 4x4 pixels instead of 16x1 pixel

Format is transparent to application


Texture Compression

• GL_ARB_texture_compression

• The S3TC / DXTC / BC algorithm is a high quality method for typical image textures Designed such that the artifacts introduced in lossy compression

tend to be smoothed out by texture filtering

Function textures and unusual use textures may not meet acceptable quality

Rearranging components can help

Use high-quality compressors - The Compressonator

• Compression isn’t just about memory bandwidth Reduces effective latency (one fetch brings in more useful

texels)

Effectively increases texture cache size


Texture Filtering

• A bilinear-filtered sample is the common basic unit of work for a texture unit Unlikely that point sampling is any faster than bilinear; can

make this work for you in image processing shaders (rather than point sampling and doing some constant weighted sum)

Each additional bilinear sample for trilinear or anisotropic filtering is probably consuming additional time

• Smart algorithms ensure that only needed samples are taken No need for trilinear if magnifying No need for anisotropy if square-on Example: walls tend to have less anisotropy than floors

• Gradient calculations may be dynamic Necessary to handle dependent texture reads Be wary with dependency; gradient can be unpredictable


Render to texture

• Useful for generating extra views or postprocessing Example: mirror in driving game

Example: postprocessing for refraction

• glCopyTexImage copies the framebuffer to a texture CPU-GPU serialisation is not implied; this can probably be

queued into a command buffer

• Other methods exist such as pbuffers and framebuffer extensions Can be slightly more efficient

Can return to a rendertarget after rendering on another

More complex; don’t use without good reason


Multisample antialiasing

• The key gain is to run the fragment shader at pixel frequency rather than sample frequency

• Also saves memory bandwidth; can compress Z and colour The buffer may need to be resolved to an uncompressed

buffer for display or if used as a texture

Triangle size may be worth extra consideration with MSAA; frame buffer and Z compression rate is likely to be in roughly inverse proportion to the number of visible edges in the scene


Caching

• Many caches inside the GPU

• Different to what you might be familiar with in a CPU More about memory bursts and latency compensation

than reuse In general you do need to hit the memory

– Example: texture mapping the whole framebuffer at 1:1; every pixel and texel will be touched exactly once

Therefore, be pessimistic: assume this Choose to compensate memory latency with large buffers

– Rather than using the cache to dodge the accesses

• In a few places short-term ‘reuse’ is critical Bilinear filtering the most obvious case


Caching

• Can still be advantages to avoiding cycling Used to be a big thing, particularly in the days of visible

caching (software controlled rather than auto)

Caused sorting policy of hard sort by material

Nowadays far less important, hence rough sort by depth

In some pathological circumstances sort by shader and depth (or a Z pass followed by sort by shader) might be more efficient


Disclaimer & Attribution

• DISCLAIMER• The information presented in this document is for informational

purposes only and may contain technical inaccuracies, omissions and typographical errors.

• AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

• AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

• ATTRIBUTION• © 2007 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD

Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.

Documents

CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007