Optimizing Direct X On Multi Core Architectures

1

Game Developers Conference 2008

Optimizing DirectX on Multi-core architectures

Leigh DaviesSenior Application Engineer, INTEL

February 2008

[email protected]

Contributions from;David Potages Grin*

Jeff Andrews Intel®®

Rita Turkowski Intel®®

Kev Gee Microsoft**Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others

3

Agenda

Graphics and the CPU

Profiling Graphics and Drivers

Threading the render thread

Case Study GRIN*

Summary

*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others

4

Graphics is CPU Intensive.World in Conflict*World in Conflict*

Bionic Commando*Bionic Commando*

D3D Runtime and Driver account for 25-40% of CPU cycles per frame

D3D Runtime and Driver account for 25-40% of CPU cycles per frame

Application

D3D Runtime

Driver

Other

Application

D3D Runtime

Driver

Other

LegendLegend

*Other names and brands may be claimed as the property of others**Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.

*Other names and brands may be claimed as the property of others**Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.

Crysis* CPU BenchmarkCrysis* CPU Benchmark

Crysis* GPU BenchmarkCrysis* GPU Benchmark

5

Designing the Rendering Pipeline.

•Analyze the whole programAnalyze the whole program– Your ApplicationYour Application– Direct API usage and Direct API usage and

overheadsoverheads– Video card driverVideo card driver

•Have Defined Performance GoalsHave Defined Performance Goals- Use key game play targeted Use key game play targeted

scenarios for perf analysisscenarios for perf analysis- Build benchmarks / test levelsBuild benchmarks / test levels

ApplicationApplicationDirect3D*

Runtime

Direct3D*

RuntimeCommand

Buffer

Command

Buffer

Software

Driver

Software

DriverVideo

Card

Video

Card

World in Conflict*World in Conflict*World in Conflict*World in Conflict*

**Timings taken from msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx**Timings taken from msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx

510-700ZFUNC

1050-1150DrawPrimative

2500-3100SetTexture

1500-9000SetPixelShaderConstant

3000-12100SetVertexShader

Cycles countDX9 API Call**

Render

Functions

Render

Functions


6

Balancing Future Workloads

Compaction/DerivativeIntel Core™ Duo · Pentium-D

Intel Core™ MicroarchitectureIntel Core™2 Duo,

DC Intel Xeon® 5100

65nm

2 Y

EA

RS

45nm

2

YEA

RS

Compaction/DerivativePENRYN

New MicroarchitectureNEHALEM

Tick

Tick

Tock

Tock

Scalable & Scalable &

Configurable Configurable

Cache, Cache,

Interconnects & Interconnects &

Memory Memory

ControllersControllers

Scalable & Scalable &

Configurable Configurable

Cache, Cache,

Interconnects & Interconnects &

Memory Memory

ControllersControllers

Scalable Scalable

Performance: Performance: 1 to 8 Threads 1 to 8 Threads

& &

1 to 4 Cores1 to 4 Cores

Scalable Scalable

Performance: Performance: 1 to 8 Threads 1 to 8 Threads

& &

1 to 4 Cores1 to 4 Cores

Intel®® Roadmap Graphics

7

Be realistic, Rendering Costs CPU Be realistic, Rendering Costs CPU TimeTime

Rendering thread potential bottleneck Rendering thread potential bottleneck for N-Core scalingfor N-Core scaling

Rendering costs likely to increase as Rendering costs likely to increase as you add more physics, effects or you add more physics, effects or even AI objectseven AI objects

Runtime and driver costs are Runtime and driver costs are significantly higher on the PC than significantly higher on the PC than the consolesthe consoles

Use Performance Analysis results to Use Performance Analysis results to focus development effortsfocus development efforts

Analyze regularly and catch Analyze regularly and catch regressions earlyregressions early

Time is Money

Optimise the graphics thread.Offload as much as possible.

Optimise the graphics thread.Offload as much as possible.

8

Agenda




Case Study GRIN

Summary

9

Overview of Graphics Driver Models

WindowsWindows** XP Display Model XP Display Model XPDM - DX* - DX9- The Kernel mode driver controls threading

Windows VistaWindows Vista** Display Driver Model Display Driver Model WDDM - DX9- The D3D9 runtime manages creation of threads

- One is created specifically for the User Mode Driver (UMD)

Windows Vista Display Driver ModelWindows Vista Display Driver Model WDDM - DX10- The Driver is responsible for creating threads

- Currently released drivers don’t thread

- Could change in the near future

Graphics driver can have a major impact on performance and multi-core scaling.

Graphics driver can have a major impact on performance and multi-core scaling.


10

Profiling Tools

Need to use a variety of tools;Need to use a variety of tools;- Use repeatable workloadUse repeatable workload

CPU Tools;CPU Tools;- VTuneVTune™ Performance Analyser. Performance Analyser.

- Intel®Intel® Thread ProfilerThread Profiler

- PIX for PIX for WindowsWindows**

- AMD Code AnalystAMD Code Analyst™

GPU Tools;GPU Tools;- PIX for PIX for WindowsWindows with vendor pluginswith vendor plugins

- NVIDIANVIDIA** Perfhud Perfhud

- ATIATI** PerfStudio PerfStudio


11

Profiling Graphics with VTune™ Analyzer

Select Counter Monitor for a quick overview;Select Counter Monitor for a quick overview; Not necessary to launch the appNot necessary to launch the app Disable display of counter data unless running windowedDisable display of counter data unless running windowed Profile across a selection of configurationsProfile across a selection of configurations- Identify different bottlenecks based on h/w limitationsIdentify different bottlenecks based on h/w limitations

- ““Works great on my machine” isn’t good enoughWorks great on my machine” isn’t good enough

12

VTune™ Performance Analyzer - Sampling

•Calibration isn’t needed for gamesCalibration isn’t needed for games•Delay sampling allows alt-tab or bypass loadingDelay sampling allows alt-tab or bypass loading•Tracking core usage needs to be addedTracking core usage needs to be added•Privileged time shows time inside KernelPrivileged time shows time inside Kernel

13

VTune™ Analyzer Views

•Processor Usage•Memory Usage•Context Switching•CPU Frequency

•Processor Usage•Memory Usage•Context Switching•CPU Frequency

VTune™™ Analyzer allows you to add your own counters.

VTune™™ Analyzer allows you to add your own counters.

14

Sampling - Display Model XPDM

Application D3D Runtime

Win32k & Dxg

Display DriverMiniport Driver

Videoport

Kernel Mode

User Mode

Session Space

15

Sampling - Display Model WDDM

ApplicationApplication D3D RuntimeD3D Runtime

Win32kWin32k

User Mode Driver

User Mode Driver

Kernel DriverKernel Driver

DxgkrnlDxgkrnlKernel Mode

User Mode

DWM Process

DWMDWM

Application Process

CDDCDDSession Space

16

Associating Symbols in VTune™ Analyzer

Configure->Options->Directories->Symbol RepositoryConfigure->Options->Directories->Symbol Repository View Symbol Repository->Delete unassociated modulesView Symbol Repository->Delete unassociated modules In Tuning Browser select "Results" -> "Module Associations..." In Tuning Browser select "Results" -> "Module Associations..."

Edit symbol associationsEdit symbol associations

17

Symbol Information for DX10Core.dll

Symbols Taken while profiling SoftParticle Sample on SDK

Symbols Taken while profiling SoftParticle Sample on SDK

18

PIX for Windows

CPU

GPU

Gathering GPU events requires Windows VistaCross over between PIX and VTune™ ™ CountersEasy to see CPU/GPU headroom

Gathering GPU events requires Windows VistaCross over between PIX and VTune™ ™ CountersEasy to see CPU/GPU headroom

19

Intel® PIX Plug-in: Beta Available Now

Provides access to Intel®® Counters in PIX Rollout now to support IIG Profiling

# Metric Name Description1 Frame Time Instantaneous frame time in milliseconds.

2 Frames per Second Instantaneous frame rate normalized to seconds. (inverted frame time).

3 Driver Time The amount of time spent in the display driver, normalized to milliseconds.

4 Driver Time Stalled The amount of time spent in the display driver either busy stalled or in a sleep state, normalized to milliseconds.

5 Graphics Memory Used – MB The amount of graphics memory currently utilized, normalized to MB.

6 Graphics Memory Used - bytes The amount of graphics memory currently utilized, normalized to bytes.

7 Texture Memory Used The amount of texture memory currently utilized, normalized to MB.

8 GPU Busy The percent utilization of the front end of the GPU. This metric shall describe the incoming command stream and does NOT describe the utilization of the array of execution units (cores).

9 Cores Busy The percentage of time that any core in the array is either actively executing instructions or stalled.

10 Cores Active The percentage of time that the core array is actively executing instructions.

11 Vertex Count The number of vertices that entered the pipeline.

12 Triangle Count The number of triangles that flowed through the pipeline prior to any clipping or culling.

13 Texel Count The number of texels that were fetched by the pipeline.

14 Pixels Drawn The number of pixels that were actually written to the render target.

15 Mathbox Utilization The aggregated percentage of time that the mathbox was actively executing instructions.

16 Texture Unit(s) Utilization The aggregated percentage of time that the texture units were actively processing texels.

20

Agenda




Case Study GRIN

Summary

21

Starting Points

Common Issues:Common Issues:- Naive Ports to WindowsNaive Ports to Windows from console modelsfrom console models- Excessive context switching/synchronization overheadExcessive context switching/synchronization overhead- Work starvation due to thread sync dependenciesWork starvation due to thread sync dependencies

General RulesGeneral Rules- Use only 1 heavy weight thread per Core on WindowsUse only 1 heavy weight thread per Core on Windows - Manage Job distributionManage Job distribution- The OS scheduler knows bestThe OS scheduler knows best- Consider memory bandwidth Consider memory bandwidth

Multi-core and D3D UsageMulti-core and D3D Usage- Avoid Use of the D3DCREATE_MULTITHREADED flagAvoid Use of the D3DCREATE_MULTITHREADED flag- You You CAN CAN manage synch costs bettermanage synch costs better- Design around a single threaded D3D Device Access modelDesign around a single threaded D3D Device Access model- Lock resources from main thread, manually protect accessLock resources from main thread, manually protect access


22

Making the Drivers Work for You!

Pack your DrawPrimitive2 calls togetherPack your DrawPrimitive2 calls together

Frequently creating & destroying shaders, VB, IB, and Frequently creating & destroying shaders, VB, IB, and surfaces will impact performancesurfaces will impact performance

Avoid allocating too many system memory resourcesAvoid allocating too many system memory resources

DrawPrimitiveUP or DrawIndexedPrimitiveUPDrawPrimitiveUP or DrawIndexedPrimitiveUP

App

App

D3D Runtime

D3D DriverD3D Driver

Potential 20%+ speed gain.Potential 20%+ speed gain.

Can be disabled by application Can be disabled by application behaviour.behaviour.

Producer & Consumer threads dispatch Producer & Consumer threads dispatch commands to GPUcommands to GPU

23

Avoid any calls that return GPU state information, requires Avoid any calls that return GPU state information, requires a CPU thread synchronizationa CPU thread synchronization

Driver Queries are OK (calls are asynchronous)Driver Queries are OK (calls are asynchronous)

Do not lock threads to a specific CPU!Do not lock threads to a specific CPU!

Group all resource updates (Texture and Vertex) together Group all resource updates (Texture and Vertex) together once per frame beginning or end is fine, just don’t scatter once per frame beginning or end is fine, just don’t scatter them among drawing callsthem among drawing calls

Minimize use of any locks/unlocksMinimize use of any locks/unlocks

System Memory Vertex BuffersSystem Memory Vertex Buffers- D3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLYD3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLY- Lock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITELock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE

Making the Drivers Work for You!

24

Threading Issues

Race Conditions between threads.Race Conditions between threads.

- Object UpdatesObject Updates

- Creation/deletion of objectsCreation/deletion of objects False sharingFalse sharing of data between threads. of data between threads. Accessing hardware resources.Accessing hardware resources.

Render Thread

Main Thread

Time

(Frame n)

(Frame n-1)

Move Object X

Render Object X

Delete Object Y

Render Object Y

25

Threading Options

Front-End

Logic

EOF

EOF

Front-end

LogicBack-end

Render

CmdQueue

Back-end

Render

• Avoiding the IssuesAvoiding the Issues• Use an update queue, lightweight (lock-free?)Use an update queue, lightweight (lock-free?)• Make duplicate objects/Make duplicate objects/double-buffereddouble-buffered• Reference count objectsReference count objects

PipelinePipeline Consumer threadConsumer thread

26

Buffering Dynamic Data

Partially buffered locks consume more video memory.Partially buffered locks consume more video memory. Fully Buffered consume more system memory and have an Fully Buffered consume more system memory and have an

associated CPU cost for memory copying.associated CPU cost for memory copying.

Render Thread

Main Thread (Frame n)

(Frame n-1)

Modify Vertex Buffer0

Render Object from Vertex Buffer1

Render Thread

Main Thread

Modify Vertex Buffer1

Render Object from Vertex Buffer0

(Frame n+1)

(Frame n)

Main Thread Render Thread

Lock Buffer

Modify Buffer

Local Buffer

Unlock Buffer

Data Queue0

Lock Buffer

Copy Data

Unlock Buffer

Data Queue1

Video Buffer

Fully buffered locks Fully buffered locks

Partially buffered locks Partially buffered locks

27

Sub Threading Options

Front-End

Logic

EOF

Back-end

Render

Job

Job

Job

Job QueueJob Queue• Job Queue offloadsJob Queue offloads

•Software Visibility CullingSoftware Visibility Culling•Particle generationParticle generation•Character SkinningCharacter Skinning•Procedural updatesProcedural updates

•Reduces path size through Reduces path size through both front and back endsboth front and back ends

Job

Job

Job

Job QueueJob Queue

28

Threading the DX API

D3D9WrapperD3DVertexBuffer9

Wrapper

D3DDevice9

Wrapper

DX9 Render System

D3D9 D3DDevice9 D3DVertexBuffer9

Graphics Driver

Graphics Device

Main Thread 46.46(15.82%) in DX9

NVIDIA driver 23.02

Physics 10.91

Other threads 19.35

Main Thread 63.84(28.39% in DX10+Driver)

Physics 13.95

Other threads 21.88

DX9DX9 DX10DX10

Main Thread 39.08

DX API Thread 7.38

NVIDIA driver 23.02

Physics 10.91

Other threads 19.35

Main Thread 45.72

DX API Thread 18.12

Physics 13.95

Other threads 21.88

16% increase*16% increase*39% increase*39% increase*

Similar to DX9 threading in Similar to DX9 threading in the runtimethe runtime- Potentially repeating the Potentially repeating the

same worksame work Potential to move simple Potential to move simple

API code out of main API code out of main thread, i.e. state thread, i.e. state managementmanagement

DX10 has lower runtime DX10 has lower runtime costscosts

* Theoretical increase based on amount of API work offloaded, does not include threading overhead****Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.

* Theoretical increase based on amount of API work offloaded, does not include threading overhead****Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.

29

Agenda




Case Study GRIN

Summary


30

Case study: Grin’s engine*


David PotagesSenior Engine Architect, GRIN

February [email protected]

*Performance figures discussed in this case study refer to a pre release version of the game.They are subject to change before release and are for illustration only.

*Performance figures discussed in this case study refer to a pre release version of the game.They are subject to change before release and are for illustration only.

31

Quick Engine Overview

33rdrd generation of threaded engine generation of threaded engine 22ndnd generation of threaded renderer generation of threaded renderer Used in several gamesUsed in several games

32

Quick Engine Overview

Not game specific: game code in Lua scriptsNot game specific: game code in Lua scriptsAllows hot-reload, no link time, custom debuggerBut single threaded, a lot of memory allocations

Deferred renderingDeferred renderingDX9 – DX10 being implemented

Libraries: Libraries:

- PhysXPhysX™

- OpenALOpenAL

- Bink*Bink*

All the technology choices have great impact on the possible parallelization!All the technology choices have great impact on the possible parallelization!


33

Why multi-threading?

Poor CPU usagePoor CPU usage- Can go down to 30%Can go down to 30%

A lot of time spent in A lot of time spent in D3D/driverD3D/driver- 35-45%*35-45%*

But a lot of the But a lot of the application time is application time is dedicated to renderingdedicated to rendering- Up to 37%*Up to 37%*

- Grand total of 53%* of Grand total of 53%* of frame with D3D/driverframe with D3D/driver

Application

D3D Runtime

Driver

Other

Application

D3D Runtime

Driver

Other

LegendLegend

46%

17%

29%

*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.


34

Why multi-threading the renderer?

Simplified pipeline (ST version)Simplified pipeline (ST version)

Rendering is an easy target for multithreading: low system dependencies, 53% of frame time

But easier said than done!

Rendering is an easy target for multithreading: low system dependencies, 53% of frame time

But easier said than done!

Culling

Particles batch optimizations

RenderingWorld

update

Script

updateSound Network

Lua* PhysX™ OpenAL*

Some systems or the drivers they use can take advantage of multi-coresRendering has low dependencies with other systems, but big data dependencies


35

Implementation Details

Main threadMain thread

Entity/World updates, Animations, Input, Network, Lua, Entity/World updates, Animations, Input, Network, Lua, SoundSystem, Physics (main)SoundSystem, Physics (main)

Renderer threadRenderer thread

Culling (including software occlusion queries)Culling (including software occlusion queries)

Particle effects batch optimizationsParticle effects batch optimizations

RenderDevice (D3D)RenderDevice (D3D)

Win32 messagingWin32 messaging OtherOther

File streamingFile streaming

PhysXPhysX™ threads threads

Driver threadsDriver threads

36

Implementation Details Messages sent to the renderer- Non blocking:

render_scene render_frame update_window Etc

- Blocking:

flush_pipe flush_pipe forces the renderer to

execute all the queued jobs => synchronization point- Used between frames on main thread

- Can be used to ensure that data (eg Textures) is ready

Front-end

Logic Back-end

Render

Flush

Back-end

Render

Idle

Front-end

Logic

Sync

Idle Flush

37

Implementation Details

States needs to be mirrored States needs to be mirrored States changes are queued, and updated in the States changes are queued, and updated in the

freezefreeze The proper state is returned depending on the The proper state is returned depending on the

calling threadcalling thread

This will avoid contention when data is accessed in the renderer, but mirror only what is requiredThis will avoid contention when data is accessed in the renderer, but mirror only what is required

38

Results

Better CPU usageBetter CPU usage40-60%*

Better threads Better threads workloadworkload

*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.

*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.

39

Results: Rendering Performance

Better FPSBetter FPS- 4C MT is 1.88x faster than 1C*

- 4C MT is 1.20x faster than 4C ST*

AnalysisAnalysis- Remember that the drivers are

partially threaded: we save up to 17% + %of D3D/driver time that is not threaded

- Close to 1.20xif D3D/driver were completely threaded, new frame time would be 1-0.17=83% less, and the scale-up :

fpsnew/fpsold=timeold/timenew

=timeold/(timeold*0.83)=1.20Maximum scale-up vs. 1C is 2.12x

- Context switches, cache misses and contention slow us down.

- Render-thread bound

0102030405060708090

100

CPU FPS

1C2C ST2C MT4C ST4C MT



• Effect on a low physics/gameplay workloadEffect on a low physics/gameplay workload• Effect on a low physics/gameplay workloadEffect on a low physics/gameplay workload

40

Improvements

Threading some parts of the render threadThreading some parts of the render threadE.g.: culling (~9-25%* of the render thread)

Reducing contentionsReducing contentionsMainly memory

Batch moreBatch moreE.g.: Effects

Triple buffering?Triple buffering?



41

Scalability

We can push for instance more physics/effects, while we are render-thread bound, or more AI

But hard to find the right balance between CPU and GPU workload!

Example: falling carsaka pushing more physics

42

Scalability

- ~256 cars falling and bouncing~256 cars falling and bouncing

- 4C MT is 1.42x* faster than 4C 4C MT is 1.42x* faster than 4C ST, and 3.23x* faster than 1CST, and 3.23x* faster than 1C

- PhysXPhysX™ helped us a lot to helped us a lot to propagate the workload, but propagate the workload, but occupies the other cores quite occupies the other cores quite heavily, thus preventing heavily, thus preventing D3D/drivers to take advantage D3D/drivers to take advantage of them.of them.

- Rendering overhead was not Rendering overhead was not that big with the additional that big with the additional units since they batch well.units since they batch well.

0

10

20

30

40

50

FPS

1C4C ST4C MT



43

Issues

A proper benchmark system is requiredA proper benchmark system is requiredA fly-through benchmark is not enough!The CPU & GPU workloads vary a lot on different maps

Easy to forget a data that needs to be mirroredEasy to forget a data that needs to be mirrored Lockfree algorithm are nice, but to be used with careLockfree algorithm are nice, but to be used with care Memory contention + cache misses + false sharingMemory contention + cache misses + false sharing Behaviour of drivers varies quite alot…Behaviour of drivers varies quite alot…

44

Agenda




Case Study GRIN

Summary


45

Summary/Conclusion

Graphic pipeline is still very CPU intensiveGraphic pipeline is still very CPU intensive Future CPUs will have increasing logical processorsFuture CPUs will have increasing logical processors It is worth threading your renderer as much as possible if It is worth threading your renderer as much as possible if

you want to be able to push more things in your gameyou want to be able to push more things in your game Hard to balance the workloads though, need to profile whole Hard to balance the workloads though, need to profile whole

systemsystem Making the most of the graphics driver essentialMaking the most of the graphics driver essential

46

References:

Accurately Profiling Direct3D API Calls.- msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx

Debugging Tools and Symbols: Getting Started- www.microsoft.com/whdc/devtools/debugging/debugstart.mspx

Threading the OGRE3D Render System- www.intel.com/cd/ids/developer/asmo-na/eng/dc/games/331359.htm

47

Technology

Optimizing Direct X On Multi Core Architectures