Cg and Hardware Accelerated Shading

Cg and Hardware Accelerated Shading

Cg and Hardware Accelerated ShadingCem Cebenoyan

NVIDIA CONFIDENTIAL

Overview

Cg Overview

Where we are in hardware today

Physical Simulation on GPU

GeforceFX / Cg Demos

Advanced hair and skin rendering in “Dawn”

Adaptive subdivision surfaces and ambient occlusion shading in “Ogre”

Procedural shading in “Time Machine”

Depth of field and post-processing effects in “Toys”

OIT

NVIDIA CONFIDENTIAL

What is Cg?

A high level language for controlling parts of the graphics pipeline of modern GPUs

Today, this includes the vertex transformation and fragment processing units of the pipeline

Very C-like

Only simpler

Native support for vectors, matrices, dot-products, reflection vectors, etc.

Similar in scope to Renderman

But notably different to handle the way hardware accelerators work

NVIDIA CONFIDENTIAL

Cg Pipeline Overview

Graphics Program Written in Cg

“C” for Graphics

Compiled & Optimized

Low Level, Graphics“Assembly Code”

NVIDIA CONFIDENTIAL

Graphics Data Flow

ApplicationVertex

ProgramFragmentProgram Framebuffer

Cg Program Cg Program

// // Diffuse lighting // float d = dot (normalize(frag.N), normalize(frag.L)); if (d < 0) d = 0; c = d * f4tex2D( t, frag.uv ) * diffuse; …

NVIDIA CONFIDENTIAL

Graphics Hardware Today

Fully programmable vertex processing

Full IEEE 32-bit floating point processing

Native support for mul, dp3, dp4, rsq, pow, sin, cos...

Full support for branching, looping, subroutines

Fully programmable pixel processing

IEEE 32-bit, 16-bit (s10e5) math supported

Same native math ops as vertex, plus texture fetch, and derivative instructions

No branching, but >1000 instruction limit

Floating point textures / frame buffers

No blending / filtering yet

~500mhz core clock

NVIDIA CONFIDENTIAL

Physical Simulation

Simple cellular automata-like simulations are possible on NV20 class hardware (e.g. Game of Life, Greg James’ water simulation, Mark Harris’ CML work)

Use textures to represent physical quantities (e.g. displacement, velocity, force) on a regular grid

Multiple texture lookups allow access to neighbouring values

Pixel shader calculates new values, renders results back to texture

Each rendering pass draws a single quad, calculating next time step in simulation

NVIDIA CONFIDENTIAL

Physical Simulation

Problem: 8 bit precision on NV20 is not enough, causes drifting, stability problems

Float precision on NV30 allows GPU physics to match CPU accuracy

New fragment programming model (longer programs, flexible dependent texture reads) allows much more interesting simulations

NVIDIA CONFIDENTIAL

Example: Cloth Simulation Shader

Uses Verlet integration (see: Jakobsen, GDC 2001)

Avoids storing explicit velocity

newx = x + (x – oldx)*damping + a*dt*dt

Not always accurate, but stable!

Store current and previous position of each particle in 2 RGB float textures

Fragment program calculates new position, writes result to float buffer

Copy float buffer back to texture for next iteration (could use render-to-texture instead)

Swap current and previous textures

NVIDIA CONFIDENTIAL

Cloth Shader Demo

NVIDIA CONFIDENTIAL

Cloth Simulation Shader

2 passes:

1. Perform integration

2. Apply constraints:

Floor constraint

Sphere constraint

Distance constraints between particles

Read back float frame buffer using glReadPixels

Draw particles and constraints

NVIDIA CONFIDENTIAL

Cloth Simulation Cg Code (1st pass)

void Integrate(inout float3 x, float3 oldx, float3 a, float timestep2, float damping){ x = x + damping*(x - oldx) + a*timestep2;}

myFragout main(v2fconnector In, uniform texobjRECT x_tex, uniform texobjRECT ox_tex, uniform float timestep, uniform float damping, uniform float3 gravity){ myFragout Out; float2 s = In.TEX0.xy;

// get current and previous position float3 x = f3texRECT(x_tex, s); float3 oldx = f3texRECT(ox_tex, s);

// move the particle Integrate(x, oldx, gravity, timestep*timestep, damping); Out.COL.xyz = x; return Out;}

NVIDIA CONFIDENTIAL

Cloth Simulation Cg Code (2nd pass)

// constrain particle to be fixed distance from another particlevoid DistanceConstraint(float3 x, inout float3 newx, float3 x2, float restlength, float stiffness){ float3 delta = x2 - x; float deltalength = length(delta); float diff = (deltalength - restlength) / deltalength; newx = newx + delta*stiffness*diff;}

// constraint particle to be outside spherevoid SphereConstraint(inout float3 x, float3 center, float r){ float3 delta = x - center; float dist = length(delta); if (dist < r) { x = center + delta*(r / dist); }}

// constrain particle to be above floorvoid FloorConstraint(inout float3 x, float level){ if (x.y < level) { x.y = level; }}

NVIDIA CONFIDENTIAL

Cloth Simulation Cg Code (cont.)

myFragout main(v2fconnector In, uniform texobjRECT x_tex, uniform texobjRECT ox_tex,

uniform float dist, uniform float stiffness){ myFragout Out; float2 s = In.TEX0.xy; // get current position float3 x = f3texRECT(x_tex, s); // satisfy constraints FloorConstraint(x, 0.0f); SphereConstraint(x, float3(0.0, 2.0, 0.0), 1.0f); // get positions of neighbouring particles float3 x1 = f3texRECT(x_tex, s + float2(1.0, 0.0) ); float3 x2 = f3texRECT(x_tex, s + float2(-1.0, 0.0) ); float3 x3 = f3texRECT(x_tex, s + float2(0.0, 1.0) ); float3 x4 = f3texRECT(x_tex, s + float2(0.0, -1.0) ); // apply distance constraints float3 newx = x; if (s.x < 31) DistanceConstraint(x, newx, x1, dist, stiffness); if (s.x > 0) DistanceConstraint(x, newx, x2, dist, stiffness); if (s.y < 31) DistanceConstraint(x, newx, x3, dist, stiffness); if (s.y > 0) DistanceConstraint(x, newx, x4, dist, stiffness); Out.COL.xyz = newx; return Out;}

NVIDIA CONFIDENTIAL

Physical Simulation – Future Work

Limitation - only one destination buffer, can only modify position of one particle at a time

Could use pack instructions to store 2 vec4h (8 half floats) in 128 bit float buffer

Could also use additional textures to encode particle masses, stiffness, constraints between arbitrary particles (rigid bodies)

“float buffer to vertex array” extension offers possibility of directly interpreting results as geometry without any CPU intervention!

Collision detection with meshes is hard

NVIDIA CONFIDENTIAL

Developed 4 demos for the launch of GeForce FX

“Dawn”

“Toys”

“Time Machine”

“Ogre”(Spellcraft Studio)

Demos Introduction

NVIDIA CONFIDENTIAL

Characters Look Better With Hair

NVIDIA CONFIDENTIAL

Rendering Hair

Two options:

1) Volumetric (texture)

2) Geometric (lines)

We have used volumetric approximations (shells and fins) in the past (e.g. Wolfman demo)

Doesn’t work well for long hair

We considered using textured ribbons (popular in Japanese video games). Alpha sorting is a pain.

Performance of GeForce FX finally lets us render hair as geometry

NVIDIA CONFIDENTIAL

Rendering Hair as Lines

Each hair strand is rendered as a line strip (2-20 vertices, depending on curvature)

Problem: lines are a minimum of 1 pixel thick, regardless of distance from camera

Not possible to change line width per vertex

Can use camera-facing triangle strips, but these require twice the number of vertices, and have aliasing problems

NVIDIA CONFIDENTIAL

Anti-Aliasing

Two methods of anti-aliasing lines in OpenGL

GL_LINE_SMOOTH

High quality, but requires blending, sorting geometry

GL_MULTISAMPLE

Usually lower quality, but order independent

We used multisample anti-aliasing with “alpha to coverage” mode

By fading alpha to zero at the ends of hairs, coverage and apparent thickness decreases

“SAMPLE_ALPHA_TO_COVERAGE_ARB” is part of the ARB_multisample extension

NVIDIA CONFIDENTIAL

Hair Without Antialiasing

NVIDIA CONFIDENTIAL

Hair With Multisample Antialiasing

NVIDIA CONFIDENTIAL

Hair Shading

Hair is lit with simple anisotropic shader (Heidrich and Seidel model)

Low specular exponent, dim highlight looks best

Black hair = no shadows!

Self-shadowing hair is hard

Deep shadow maps

Opacity shadow maps

Top of head is painted black to avoid skin showing through

We also had a very short hair style, which helps

NVIDIA CONFIDENTIAL

Hair Styling is Important

NVIDIA CONFIDENTIAL

Hair Styling

Difficult to position 50,000 individual curves by hand

Typical solution is to define a small number of control hairs, which are then interpolated across the surface to produce render hairs

We developed a custom tool for hair styling

Commercial hair applications have poor styling tools and are not designed for real time output

NVIDIA CONFIDENTIAL

Hair Styling

Scalp is defined as a polygon mesh

Hairs are represented as cubic Bezier curves

Controls hairs are defined for each vertex

Render hairs are interpolated across triangles using barycentric coordinates

Number of generated hairs is based on triangle area to maintain constant density

Can add noise to interpolated hairs to add variation

NVIDIA CONFIDENTIAL

Hair Styling Tool

Provides a simple UI for styling hair

Combing tools

Lengthen / shorten

Straighten / mess up

Uses a simple physics simulation based on Verlet integration (Jakobson, GDC 2001)

Physics is run on control hairs only

Collision detection done with ellipsoids

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

Dawn Demo

Show demo

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

The Ogre Demo

A real-time preview of Spellcraft Studio’s in-production short movie “Yeah!”

Created in 3DStudio MAX

Used Character Studio for animation, plus Stitch plug-in for cloth simulation

Original movie was rendered in Brazil with global illumination

Available at: www.yeahthemovie.de

Our aim was to recreate the original as closely as possible, in real-time

NVIDIA CONFIDENTIAL

What are Subdivision Surfaces?

A curved surface defined as the limit of repeated subdivision steps on a polygonal model

Subdivision rules create new vertices, edges, faces based on neighboring features

We used the Catmull-Clark subdivision scheme (as used by Pixar)

MAX, Maya, Softimage, Lightwave all support forms of subdivision surfaces

NVIDIA CONFIDENTIAL

Realtime Adaptive Tessellation

Brute force subdivision is expensive

Generates lots of polygons where they aren’t needed

Number of polygons increases exponentially with each subdivision

Adaptive tessellation subdivides patches based on screen-space patch size test

Guaranteed crack-free

Generates normals and tangents on the fly

Culls off-screen and back-facing patches

CPU-based (uses SSE were possible)

NVIDIA CONFIDENTIAL

Control Mesh vs. Subdivided Mesh

4000 faces 17,000 triangles

NVIDIA CONFIDENTIAL

Control Mesh Detail

NVIDIA CONFIDENTIAL

Subdivided Mesh Detail

NVIDIA CONFIDENTIAL

Why Use Subdivision Surfaces?

Content

Characters were modeled with subdivision in mind (using 3DSMax “MeshSmooth/NURMS” modifier)

Scalability

wanted demo to be scalable to lower-end hardware

“Infinite” detail

Can zoom in forever without seeing hard edges

Animation compression

Just store low-res control mesh for each frame

May be accelerated on future GPUs

NVIDIA CONFIDENTIAL

Disadvantages of Realtime Subdivision

CPU intensive

But we might as well use the CPU for something!

View dependent

Requires re-tessellation for shadow map passes

Mesh topology changes from frame to frame

Makes motion blur difficult

NVIDIA CONFIDENTIAL

Ambient Occlusion Shading

Helps simulate the global illumination “look” of the original movie

Self occlusion is the degree to which an object shadows itself

“How much of the sky can I see from this point?”

Simulates a large spherical light surrounding the scene

Popular in production rendering – Pearl Harbor (ILM), Stuart Little 2 (Sony)

NVIDIA CONFIDENTIAL

Occlusion

N

NVIDIA CONFIDENTIAL

How To Calculate Occlusion

Shoot rays from surface in random directions over the hemisphere (centered around the normal)

The percentage of rays that hit something is the occlusion amount

Can also keep track of average of un-occluded directions – “bent normal”

Some Renderman compliant renders (e.g. Entropy) have a built-in occlusion() function that will do this

We can’t trace rays using graphics hardware (yet)

So we pre-calculate it!

NVIDIA CONFIDENTIAL

Occlusion Baking Tool

Uses ray-tracing engine to calculate occlusion values for each vertex in control mesh

We used 128 rays / vertex

Stored as floating point scalar for each vertex and each frame of the animation

Calculation took around 5 hours for 1000 frames

Subdivision code interpolates occlusion values using cubic interpolation

Used as ambient term in shader

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

Ogre Demo

Show demo

NVIDIA CONFIDENTIAL

Procedural Shading in Time Machine

Goals for the Time Machine demo

Overview of effects

Metallic Paint

Wood

Chrome

Techniques used

Faux-BRDF reflection

Reveal and dXdT maps

Normal and DuDv scaling

Dynamic Bump mapping

Performance Issues

Summary

NVIDIA CONFIDENTIAL

Why do Time Machine?

GPUs are much more programmable

Thanks to generalized dependent texturing, more active textures (16 on GeForce FX) and (for our purposes) unlimited blend operations, high-quality animation is possible per-pixel

GeForce FX has >2x performance of GeForce 4Ti

Executing lots of per-pixel operations isn’t just possible; it can be done in real time.

Previous per-pixel animation was limited

Animated textures

PDE / CA effects (see Mark Harris’ talk at GDC)

Goal : Full-scene per-pixel animation

NVIDIA CONFIDENTIAL

Why do Time Machine? (continued)

Neglected pick-up trucks demonstrate a wide variety of surface effects, with intricate transitions and boundaries

Paint oxidizing, bleaching and rusting

Vinyl cracking

Wood splintering and fading

And more…

Not possible with just per-vertex animation!

NVIDIA CONFIDENTIAL

Time Machine Effects : Paint

Specular color shift Oxidation

Bubbling Rusting

60 Pixel Shader instructions, 11 textures

Paint textures:•Paint Color•Rust LUT•Shadow map•Spotlight mask•Light Rust Color*•Deep Rust Color*•Ambient Light*•Bubble Height*•Reveal Time*•New Environment*•Old Environment*(* = artist created)

NVIDIA CONFIDENTIAL

Effects (cont’d) : Wood, Chrome, Glass

Wood fades and cracks Chrome welts and corrodes

Headlights fog

23 instructions, 8 textures31 instructions, 6 textures

24 instructions, 4 textures

NVIDIA CONFIDENTIAL

Procedural or Not?

Procedural shading normally replaces textures with functions of several variables.

Time Machine uses textures liberally.

The only parameter to our shaders is time.

However, turning everything into math is expensive

Time Machine’s solution

Give artist direct control (textures) over final image, use functions to control transitions

NVIDIA CONFIDENTIAL

Techniques : Faux-BRDF ReflectionMany automotive paints exhibit a color-shift as a function of the light and viewer directions.

This effect has been approximated with analytic BRDFs (Lafortune’s cosine lobes)

And measured by Cornell University’s graphics lab

BRDF factorization [McCool, Rusinkiewicz] is one method to use this data on graphics hardware

Efficient representation with multiple 2D textures

Closely approximates the original BRDFs

But not necessarily the most efficient method for automotive paint, and not artist-controllable.

Reflection intensity is uninteresting (largely Blinn)

Rotated/projected axes hard to visualize

NVIDIA CONFIDENTIAL

Techniques : Faux-BRDF Reflection 2Our solution: project BRDF values onto a single 2D texture, and factor out the intensity

Compute intensity in real-time, using (N.H)^s

Texture varies slowly, so it can be low-res (64x64).

Anti-aliasing texture fixes laser noise at grazing angles

For automotive paints, N.L and N.H work well for axes.

Not physically accurate, but fast and high-quality.

Easy for artists to tweak.

Dupont Cayman lacquer Mystique lacquer

NVIDIA CONFIDENTIAL

Techniques : Reveal and dXdT maps

Artists do not want to paint hundreds of frames of animation for a surface transition (e.g., paint->rust)

Ultimately, effect is just a conditional:

if (time > n) color = rust; else color = paint;

Or an interpolation between a start and end point

paint = interpolate(paint, bleach, s*(time-n));

So all intermediate values can be generated.

For continuous effects, use dXdT (velocity) maps

Can be stored in alpha in a DXT5 texture.

NVIDIA CONFIDENTIAL

Performance Concerns

Executing large shaders is expensive.

First rule of optimization: Keep inner loops tight

Shaders are the inner loop, run >1M times per frame.

But graphics cards have many parallel units

Vertex, fragment, and texture units

Modern GPUs do a great job of hiding texture latency

Bandwidth is unimportant in long shaders

Time Machine runs at virtually the same framerate on a 500/500 GeForceFX as it does on a 500/400 or 500/550

So not using textures is wasting performance!

NVIDIA CONFIDENTIAL

Performance Concerns…

What makes a good texture?

Saves math operations

8 (RGBA) or 16 (HILO) bit precision sufficient

Depends on a limited number of variables

Textures we used

Interpolating between light and dark rust layers

Required computing the difference between light and dark layers’ reveal maps, and expanding to [0..1].

Function was dependent on current and reveal time.

Used to blend two texture maps

NVIDIA CONFIDENTIAL

Performance Concerns…

Textures Used, continued…

Surround Maps

Recomputing the normal requires knowing the heights of 4 texels (s-1,t), (s+1,t), (s,t+1) and (s,t-1)

Each height is only 1 8-bit component

Instead of 4 dependent fetches, we can pack all into 1S(s,t) = [ H(s-1, t), H(s+1, t), H(s,t-1), H(s,t+1) ]

Saved 4 math ops and 3 texture fetches + shuffle logic

NVIDIA CONFIDENTIAL

Time Machine demo

Show demo

NVIDIA CONFIDENTIAL

Toys Demo - Simple Depth of Field

Render scene to color and depth textures

Generate mipmaps for color texture

Render full screen quad with “simpledof” shader:

Depth = tex(depthtex, texcoord)

Coc (circle of confusion) = abs(depth*scale + bias)

Color = txd(colortex, texcoord, (coc,0), (0,coc))

Scale and bias are derived from the camera:Scale = (aperture * focaldistance * planeinfocus * (zfar – znear)) /

((planeinfocus – focaldistance) * znear * zfar)

Bias = (aperture * focaldistance * (znear – planeinfocus)) / ((planeinfocus * focaldistance) * znear)

NVIDIA CONFIDENTIAL

Artifacts: Bilinear Interpolation/Magnification

Bilinear artifacts in extreme back- and near-ground

Solution: multiple jittered samples

Even without jittering, a 4 or 5 sample rotated grid pattern brings smaller artifacts under control

Larger artifacts need jittered samples, and more of them

Then it’s just a tradeoff between noise from the jittering and bilinear interpolation artifacts

(and of course the quality/performance tradeoff with number of samples)

NVIDIA CONFIDENTIAL

Noise vs. Interpolation Artifacts

With Noise Without Noise

NVIDIA CONFIDENTIAL

Artifacts: Depth Discontinuities

Near-ground (blurry) pixels don’t properly blend out over top of mid-ground (sharp) pixels

Easy solution: Cheat!

Either don’t let objects get too far in front of the plane in focus, or blur everything a little more when they do – soft edges help hide this fairly well.

NVIDIA CONFIDENTIAL

Depth Discontinuities

NVIDIA CONFIDENTIAL

Fun With Color Matrices

Since we’re already rendering to a full-screen texture, it’s easy to muck with the final image.

Operations are just rotations / scales in RGB space

Color (hue) shift

Saturation

Brightness

Contrast

These are all matrices, so compose them together, and apply them as 3 dot products in the shader

NVIDIA CONFIDENTIAL

Original Image

NVIDIA CONFIDENTIAL

Colorshifted Image

NVIDIA CONFIDENTIAL

Black and White Image

NVIDIA CONFIDENTIAL

Toys Demo

Show demo

NVIDIA CONFIDENTIAL

Order Independent Transparency

Why is correct transparency hard?

Depth peeling

Two depth buffers

Enter the shadow map

Precision/invariance issues

Depth replace texture shader

Blending the layers

Other applications

NVIDIA CONFIDENTIAL

Good Transparency Bad Transparency

Can’t just glEnable(GL_BLEND)…

without OITwith OIT

NVIDIA CONFIDENTIAL

Why is correct transparency hard?

Most hardware does object-order rendering

Correct transparency requires sorted traversal

Have to render polygons in sorted order

Not very convenient

Polygons can’t intersect

Lot of extra application work

Especially difficult for dynamic scene databases

NVIDIA CONFIDENTIAL

Depth Peeling

The algorithm uses an “implicit sort” to extract multiple depth layers

First pass render finds front-most fragment color/depth

Each successive pass render finds (extracts) the fragment color/depth for the next-nearest fragment on a per pixel basis

Use dual depth buffers to compare previous nearest fragment with current

Second “depth buffer” used for comparison (read only) from texture [more on this later]

NVIDIA CONFIDENTIAL

Layer 0 Layer 1

Layer 2 Layer 3

NVIDIA CONFIDENTIAL

Cross-section view of depth peeling

0 depth 1

Layer 0 Layer 1 Layer 2

Depth peeling strips away depth layers with each successive pass. The frames above show the frontmost (leftmost) surfaces as bold black lines, hidden surfaces as thin black lines, and “peeled away” surfaces as light grey lines.

0 depth 1 0 depth 1

NVIDIA CONFIDENTIAL

Dual Depth Buffer Pseudo-code

for ( i = 0; i < num_passes; i++ ){

clear color bufferdepth unit 0:

if(i == 0) { disable depth test }else { enable depth test }bind depth buffer (i % 2)disable depth writes /* read-only depth test

*/ set depth func to GREATER

depth unit 1:bind depth buffer ((i+1) % 2)clear depth bufferenable depth writes;enable depth test;set depth func to LESS

render scenesave color buffer RGBA as layer i

}

NVIDIA CONFIDENTIAL

Implementation

There is no “dual depth buffer” extension to OpenGL, so what can we do?

Just need one depth test with writeable depth buffer – the other can be read-only

Shadow mapping is a read-only depth test!

Depth test can have an arbitrary camera location

Other interesting uses for clip volumes

Fast copies make this proposition reasonable

Copies will be unnecessary in the future…

NVIDIA CONFIDENTIAL

Precision / Invariance issues

Using shadow mapping hardware introduces precision and invariance issues

depth rasterization usually just needs to match output depth buffer precision, and requires no perspective correction

Texture hardware requires perspective correction and projection at high precision

Making things match would be difficult without the DEPTH_REPLACE texture shader

Computes with texture hardware at texture precision

Solves invariance problems at some extra expense

Will be cheaper in the future…

NVIDIA CONFIDENTIAL

1 layer 2 layers

3 layers 4 layers

NVIDIA CONFIDENTIAL

Compositing

Each time we peel, we capture the RGBA, then as a final step, we blend all the layers together from back to front

Opaque fragments completely overwrite previous transparent ones

NVIDIA CONFIDENTIAL

Conclusions

Results are nice!

Get correct transparency without invasive changes to internal data structures

Can be “bolted on” to existing CAD/CAM apps

Requires n scene traversals for n correctly sorted depths

n = 4 is often quite satisfactory (see previous slide)

Shadow maps are for more than shadows!

NVIDIA CONFIDENTIAL

Questions?

[email protected]

http://developer.nvidia.com

http://developer.nvidia.com/cg/

http://www.cgshaders.org/

Documents

Cg and Hardware Accelerated Shading