Optimizing OptiX Applications David McAllister and James...

Optimizing OptiX Applications David McAllister

and James Bigler

Ray Tracing Regimes

Computational Power

Interactive

Real-time

Real Time Path Tracing What would it take?

— 4 rays / sample

— 50 samples / pixel

— 2M pixels / frame

— 30 frames / second

— 12B rays / second

GeForce GTX 680: 350M rays / second

— Need 35X speedup

1 shading sample

1 AA sample

9 shading samples

1 AA sample

18 shading samples

2 AA samples

72 shading samples

8 AA samples

144 shading samples

16 AA samples

36 shading samples

4 AA samples

Good enough

for games

CUDA Interoperability

CPU Fallback

Callable Programs

BVH Refinement

Texture IDs

Chapter 9

Glass Sample

Whitted-style recursive

Reflection and refraction per hit

Beer’s Law attenuation

Depth cut-off

Importance cut-off

Optimizing OptiX Device Code

Maximize rays/second

— Avoid gratuitous divergence

— Avoid memory traffic

Minimize rays needed for given result

— Improved ray tracing algorithms

MIS, QMC, MCMC, BDPT, MLT

Callable

Program

OptiX Execution Model

rtContextLaunch Ray Generation

Program

Exception

Program

Selector Visit

Program

Program Node Graph

Traversal

Acceleration

Traversal

Launch

Traverse Shade

rtTrace

Closest Hit

Program

Any Hit

Program

Intersection

Program

Minimize Continuation State

OptiX rewrites rtTrace, rtReportIntersection, etc. as:

— Find all registers needed after rtFunction (continuation state)

— Save continuation state to local memory

— Execute function body

— Restore continuation state from local memory

Minimizing continuation state can have a large impact

float3 light_pos, light_color, light_scale;

sampleLight(light_pos, light_color, light_scale); // Fill in values

optix::Ray ray = ...; // create ray given light_pos

PerRayData shadow_prd;

rtTrace(top_object, ray, shadow_prd); // Trace shadow ray

return light_color*light_pos*shadow_prd.attenuation;

light_pos and light_color

saved to local stack

float3 light_pos, light_color, light_scale;

sampleLight(light_pos, light_color, light_scale); // Fill in values

float3 scaled_light_color = light_color*light_scale;

optix::Ray ray = ...; // create ray given light_pos

PerRayData shadow_prd;

rtTrace(top_object, ray, shadow_prd); // Trace shadow ray

return scaled_light_color*shadow_prd.attenuation;

Minimize Continuation State RT_PROGRAM void closestHit() {

float3 N = rtTransformNormal( normal );

float3 P = ray.origin + t_hit * ray.direction;

float3 wo = -ray.direction;

// Compute direct lighting

float3 on_light = lightSample();

float dist = length(on_light-P)

float3 wi = (on_light - P) / length;

float3 bsdf = bsdfVal(wi, N, wo, bsdf_params);

bool is_occluded = traceShadowRay(P, wi, dist);

if( !is_occluded ) prd.result = light_col * bsdf;

// Fill in values for next path trace iteration

bsdfSample( wo, N, bsdf_params,

prd.next_wi, prd.next_bsdf_weight );

RT_PROGRAM void closestHit() {

float3 N = rtTransformNormal( normal );

float3 P = ray.origin + t_hit * ray.direction;

float3 wo = -ray.direction;

// Fill in values for next path trace iteration

bsdfSample( wo, N, bsdf_params,

prd.next_wi, prd.next_bsdf_weight );

// Compute direct lighting

float3 on_light = lightSample();

float dist = length(on_light - P)

float3 wi = (on_light - P) / length;

float3 bsdf = bsdfVal(wi, N, wo, bsdf_params);

bool is_occluded = traceShadowRay(P, wi, dist);

if( !is_occluded ) prd.result = light_col * bsdf;

Pulled above trace to

reduce stack state

DesignGarage: iterative path tracer

Closest hit programs do:

— Direct lighting (next event estimation with shadow query ray)

— Compute next ray (sample BSDF for reflected/refracted ray info)

— Return direct light and next ray info to ray gen program

Ray gen program iterates

// Calculate BSDF sample for next path ray

float3 ray_direction, ray_weight;

sampleBSDF( wo, N, ray_direction, ray_weight );

// Recurse

float3 indirect_light = tracePathRay(P, ray_direction,

ray_weight);

// Perform direct lighting

prd.result = indirect_light + direct_light;

RT_PROGRAM void rayGeneration(){

float3 ray_dir = cameraGetRayDir();

float3 result = tracePathRay( camera.pos, ray_dir, 1 );

output_buffer[ launch_index ] = result;

// Calculate BSDF sample for next path ray

float3 ray_direction, ray_weight;

sampleBSDF( wo, N, ray_direction, ray_weight );

// Return sampled ray info and let ray_gen iterate

prd.ray_dir = ray_direction;

prd.ray_origin = P;

prd.ray_weight = ray_weight;

// Perform direct lighting

prd.direct = direct_light;

RT_PROGRAM void rayGeneration() {

PerRayData prd;

prd.ray_dir = cameraGetRayDir();

prd.ray_origin = camera.position;

float3 weight = make_float3( 1.0f );

float3 result = make_float3( 0.0f );

for( i = 0; i < MAX_DEPTH; ++i ) {

traceRay( prd.ray_origin, prd.ray_dir, prd );

result += prd.direct*weight;

weight *= prd.ray_weight;

output_buffer[ launch_index ] = result;

Registers

Accessing stack allocated arrays through pointers uses local

memory not registers

— Change float v[3] to float v0, v1, v2

— Avoid accessing variables via pointer

Register spilling

— When working set of registers is too large

— Registers are stored to local memory

— Keep working set small

— Collapse terms when possible

Fracture

GPU Rigid Bodies, Max ray depth = 12, ~350,000 triangles

Acceleration Builder Options

Sbvh has world class ray tracing performance

Lbvh is extremely fast and works on very large datasets

Slow Build

Fast Render Fast Build

Slow Render

Sbvh Bvh MedianBvh Lbvh

Avoiding Acceleration Builds

Cache acceleration structure alongside model

— rtAccelerationGetData / rtAccelerationSetData

Segregate dynamic and static geometry

— Use lightweight global Acceleration

— Only rebuild dynamic geometry and global Acceleration

Add padding to account for special primitive movement

— Trade off framerate vs. rebuild/refit lag

— Use sparingly

Acceleration Structure Refinement

BVH Refinement optimizes the quality of a BVH

— Allows use of fast builder (e.g., MedianBvh) which you can refine

between frames

— Smoother scene editing

— Smoother animation

rtAccelerationSetProperty(accel, “refine”, “8” )

— 0 → rebuild or refit; never refine

— 1+ → refit and refine N times per frame

Rebuild if prim count changes

Rebuild if marked dirty

Acceleration Refitting

Refit keeps existing BVH node structure

Adjusts bounding box extents to fit moved primitives

Very fast update

Slower rendering if primitives move too much

rtAccelerationSetProperty( accel, “refit”, “1” )

— 0 → rebuild whenever dirty

— 1 → refit every frame

— 2+ → refit every frame; refine every Nth frame

Refinement for Animation

1201 4 7

SAH Cost of Fracturing Columns

hlbvh onlyhlbvh+rotaterotate onlyrotate + hlbvh every 3rd frame

Lbvh + Refine

Refine only

Refine + Lbvh every 3rd frame

API Interoperability

Share data between OptiX and

— CUDA

— OpenGL

— Direct3D

Saves host-device copies

Sharing Pointers with CUDA

rtBufferSetDevicePointer() – CUDA owns the buffer

const float* d_out_probe_buf;

cudaSetDevice(0);

cudaMalloc(&d_output_probe_buffer, moving_obj_count * sizeof(float));

rtBufferSetDevicePointer(buf, optixDevice0, d_out_probe_buf);

rtContextLaunch1D(..., moving_obj_count);

LOS_reduction<<<moving_obj_count, 1>>> (d_moving_objs, d_out_probe_buf);

Sharing Pointers with CUDA

rtBufferGetDevicePointer() – OptiX owns the buffer

rtContextLaunch1D(..., moving_obj_count);

const float* d_out_probe_buf;

rtBufferGetDevicePointer(buf, optixDevice0, &d_out_probe_buf);

cudaSetDevice(0);

LOS_reduction<<<moving_obj_count, 1>>> (d_moving_objs, d_out_probe_buf);

128x64x64 water volume

No isosurfaces

Volume ray casting

Stored in 3D float texture

Reuse RTprograms

Use variables or control flow to reuse programs

— singleSidedDiffuse closest hit and doubleSidedDiffuse closest hit

— diffuse closest hit and RTvariable do_double_sided

Use in moderation: uber shaders cause longer compilation

Careful arithmetic

nvcc --use-fast-math

Do not unintentionally use double precision math

— 1.0 != 1.0f

— cos() != cosf()

Search for “.f64” in your PTX files

Shallow Node Hierarchies

Flatten node hierarchy

— Collapse nested RTtransforms

— Pre-transform vertices

— Use RTselectors judiciously

Combine multiple meshes into single mesh

A single BVH over all geometry

Per-mesh BVHes

How OptiX Links Your Code

Ray Generation

Material Shading Material

Material Shading

Material Shading Material

Object

Intersection

Acceleration

Structures

Compiler

OptiX API CUDA C shaders from

user programs

GPU Execution via

Scheduling

Minimize graph changes between launches

Some changes cause kernel recompile

— Adding an RTprogram for PTX function not used before

— Adding a non-bindless texture to the scene

OptiX at least traverses graph to check validity

Safe changes:

— RTbuffer contents

— Rtvariable values

Bindless textures avoid recompiles

Share Graph Nodes

RTmaterial

RTprogram closestHitLambertian

RTgeometryinstance

RTvariable Kd: float3( 1, 0, 0 )

RTgeometryinstance

RTmaterial

Share Graph Nodes

RTmaterial

RTgeometryinstance

Use different RTvariables,

bound at the

RTgeometryinstance level

Leverage CUDA compute cache

On disk cache of PTX codes compiled to machine code

Force compile when convenient

— rtContextLaunch2D( entry_point, 0 )

Callable Programs Speed Up Compilation

OptiX inlines all CUDA functions

Fast execution

Large kernel to compile

Use callable programs

RT_CALLABLE_PROGRAM float3 checker_color(float3 input_color, float scale) { uint2 tile_size = make_uint2(launch_dim.x / N, launch_dim.y / N); if (launch_index.x/tile_size.x ^ launch_index.y/tile_size.y) return input_color * scale; else return input_color; }

rtCallableProgram(float3, get_color, (float3, float));

RT_PROGRAM camera() { float3 initial_color; // … trace a ray, get the initial color … float3 final_color = get_color( initial_color, 0.5f ); // … write new final color to output buffer … }

Callable programs reduce OCG compile times

Bigger wins in future with separate compilation and linking

Small rendering performance overhead

Enables shade trees and plugin rendering architectures

Visualize application using nSight

activity over time

CPU & GPU interaction

nvToolsExt

#include “nvToolsExt.h” // Found inside the Nsight installation

#pragma comment( lib, “nvToolsExt64_1.lib”)

nvtxMarkEx(“Something interesting happening”);

nvtxRangePushEx(“Doing process A”);

nvtxRangePop();

Download OptiX

— Available free: Windows, Linux, Mac

— http://developer.nvidia.com

OptiX-Help@nvidia.com

Ray tracing talks:

— http://www.gputechconf.com

— OptiX

— Iray

— Material Description Language

Optimizing OptiX Applications David McAllister and James...

Documents

System V200R012C00 Product Overview - karma-group.ru OSN 7500 II Product... · 7500, OptiX OSN 3500, OptiX OSN 1500, OptiX OSN 500, OptiX OSN 550, and OptiX RTN 900) to meet the requirements

OptiX PTN3900

SS107-GPU-Ray-Tracing-OptiX - Nvidiadeveloper.download.nvidia.com/.../SS107-GPU-Ray-Tracing-OptiX.pdf · 24 The OptiX engine works purely in compute. OptiX is not a renderer, or even

· OPTIX PTN 910 OPTIX PTN 950 OPTIX PTN 3900 OPTIX OSN OPTIX OSN 6800 ... Multiservice Optical Platform 6100 CN4200 DELIVERY & …

OptiX 8800

Optix Series...Optix Series LCD Monitor Optix G27C4 (3CA9) Optix G27CQ4 (3CB0) Optix G32CQ4 (3DB5) Optix G24C4 (3BA0) Optix G27C5 (3CA9) Optix G32C4 (3DA6) Preface iv v Safety Instructions

FoCos optix 12_reva

MSI Optix MAG273R Datasheet - mysoft.hu€¦ · MSI Optix MAG273R Datasheet Author: Micro-Star INT'L CO., LTD. Subject: MSI Optix MAG273R Monitor Keywords: MSI,Optix MAG273R,Monitor,Datasheet,--digital_sig_here--

Accelerating Ray Tracing Using OptiX - GTC On-Demand ...on-demand.gputechconf.com/...accelerating-ray-tracing-using-optix.pdf · OPTIX RAY TRACING TALKS AT GTC Nvidia Keynote –Tue

Optix - pdmsview.shawinc.com

INTRODUCTION TO OPTIX - NVIDIA · 2017. 5. 8. · 5 RELEASE TIMELINE Jan 2016 Summer 2016 TODAY! OptiX 3.9 Pascal Support OptiX 4.0 LLVM Pipeline NVLINK Scaling OptiX 4.1 Performance

Optix MAG275R

NVIDIA OptiX Ray Tracing Enginedeveloper.download.nvidia.com/assets/tools/files/optix/3.0.0/...OptiX Programming Guide Version 3.0 1 Chapter 1. Introduction 1.1. OptiX Overview GPUs

Optix Series...Optix Series LCD Monitor Optix MAG271C (3FA61) Optix MAG271CP (3FA64) Optix MAG271CR (3FA65) Optix MAG271CV (3FA61) Optix MAG241C (3EA21) Optix MAG241CP (3EA24) Preface

MSI Optix MAG274R Datasheet - compy.sk · MSI Optix MAG274R Datasheet Author: Micro-Star INT'L CO., LTD. Subject: MSI Optix MAG274R Monitor Keywords: MSI,Optix MAG274R,Monitor,Datasheet,--digital_sig_here--

Optix Series

OPTIX 500 Huawei

OPTIX OSN 7500 OPTIX OSN 1500 OPTIX OSN 1800 II OPTIX OSN 3500 OPTIX METRO OPTIX METRO 5000 (10G) OPTIX METRO 6040 OPTIX METRO 1050 OPTIX METRO 3100 OPTIX SBS 2500+ (METRO 3000)

Huawei OptiX OSN 8800 and Boards Datasheet · OptiX OSN 8800 UPS Chassis Structure Subracks are the basic working units of the OptiX OSN 8800 universal platform subrack. The OptiX

Advanced OptiX Programming - NVIDIA · OPTIX 3.5 REGISTERED DEVELOPER PROGRAM Improvements to OptiX development as it enters its 4th year: OptiX Registered Developer Program (RDP)