08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Software Rasterization on GPUs

Samuli Laine Jacopo Pantaleoni

NVIDIA Research

Rasterization Laine, Karras: High-Performance Software Rasterization on GPUs.

Proceedings of High-Performance Graphics 2011.

Voxelization

Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization.

Proceedings of High-Performance Graphics 2011.

Outline

Build a research platform

Elbow space for game developers

Enable new algorithms

Provoke hardware architects

Flexibility of software, performance of fixed-function hardware

Rationale

Programmable ROP

Stochastic rasterization

Non-linear rasterization

Non-quad derivatives

Quad merging

Decoupled sampling

Compact after discard

etc.

We implemented a full pixel pipeline using CUDA

From triangle setup to ROP

Obey fundamental requirements of gfx pipe

Maintain input order

Hole-free rasterizer with correct rasterization rules

Make it as fast as possible!

Building a Pipeline

Run everything in parallel

We need a lot of threads to fill the machine

Minimize amount of synchronization

Avoid excessive use of atomics

Focus on load balancing

Graphics workloads are wild

Programmable Shading

Design Considerations

Chunker-style pipeline with four stages

Triangle setup Bin raster Coarse raster Fine raster

Run data in large batches

Separate kernel launch for each stage

Keep data in input order all the time

No need to sort

Pipeline Structure

Chunking to Bins and Tiles

Frame buffer

Bin

16x16 tiles

128x128 px Tile

8x8 px

Pixel

Triangle Setup

positions, attributes

Vertex buffer

Index buffer

. . .

Triangle Setup

Triangle data buffer

. . .

edge eqs.

u/v pleqs

zmin

etc.

Bin Raster

Bin Raster

SM 0

Triangle data buffer. . .

Bin Raster

SM 1

Bin Raster

SM 14. . .

IDs of triangles that overlap bin

Coarse Raster

. . .

IDs of triangles that overlap tile

One coarse raster SM has

exclusive access to the bin

it’s processing

Coarse Raster

SM n

Fine Raster

IDs of triangles that overlap tile

Pixel data in FB

One fine raster warp has

exclusive access to the tile

it’s processing

Write tile once

to DRAM

Read tile once from

DRAM to shared

Fine Raster

warp n

Tidbit 1: Coverage Calculation

Step along edge (Bresenham-like)

Use look-up tables to generate coverage masks

~50 instructions for 8x8 stamp, one edge

Tidbit 2: Fragment Distribution

In input phase, calculate coverage and store in list

In shading phase, detect triangle changes and calculate

triangle index and fragment in triangle

Input Phase Shading Phase

Test Scenes

Call of Juarez scene courtesy of Techland

S.T.A.L.K.E.R.: Call of Pripyat scene courtesy of GSC Game World

Performance Results

Frame rendering time in ms (depth test + color, no MSAA, no blending)

Comparison to Hardware (1/3)

– Resolution

Cannot match hardware in raster, z kill + compact

Currently support max 2K x 2K frame buffer, 4 subpixel bits

– Attributes

Fetched when used bad latency hiding

Expensive interpolation

– Antialiasing

Hardware nearly oblivious to MSAA, we much less so


– Memory usage, buffering through DRAM

Performance implications of reduced buffering unknown

Streaming through on-chip memory would be much better

+ Shader complexity

Shader performance theoretically the same as in graphics pipe

+ Frame buffer bandwidth

Each pixel touched only once in DRAM


+ Extensibility

Need one stage to do something extra?

Need a new stage altogether?

You can actually implement it

+ Specialization to individual applications

Rip out what you don’t need, hard-code what you can

Shader performance boosters

Compact after discard, quad merging, decoupled sampling, …

Things to do with programmable ROP

A-buffering, order-independent transparency, …

Stochastic rasterization

Non-linear rasterization

(Your idea here)

Exploration Potential

The Code is Out There

http://code.google.com/p/cudaraster/

The entire codebase is open-sourced and released

VoxelPipe:A Programmable Pipeline for 3D Voxelization

What is Voxelization?

Voxelization =

Finding all voxels overlappedby each triangle in a mesh

Why shall we care?

Why is it useful?

• Shape Matching• Collision Detection• Fluid / Soft-body Sim• Stress Analysis• Level of Detail• Ray Tracing

Why shall we care?

InteractiveIndirect Illumination andAmbient Occlusion usingVoxel Cone Tracing

Cyril Crassin

(I3D 2011)

Rationale

building a full-featured pipeline for voxelization, analogous to OpenGL for 2d rasterization

• fully conservative and thin* rasterization• arbitrary frame-buffer types• many blending modes (additive,max,min,and,or...)• multiple render targets• vertex shaders• fragment shaders

Rationale

Extended support for rendering modes:

• conventional blending-based rasterization

• A-buffer / bucketing

Challenges

• Previous research mostly concerned withbinary output

• State-of-the-Art had poor load balancing

=> no Shading, no ROP

=> Huge performance hit for mixed triangle sizes

What is Rasterization?

What is Rasterization?

Fragment Shading

Vertex Shading

Highly VariableExpansion Rate

source of most load balancing problems

Observations (1)

1 2 3 4 5 6 ntriangles:

fragments: f1,1

f1,2

f1,3

f2,1 f3,1

f3,2

f3,3

f3,1000000

f4,1

F4,2

fn,1

fn,2

highly variable decompression rate

Rasterization =

Sorting of Compressed Batches of Elements

Observations (2)

Decompression and Sorting can be done

Hierarchically

emit per-tile fragments

sort by tile

emit per-voxel fragments

sort by voxel

blend

Observations (2)

1 2 3 4 5 6 ntriangles:

fragments: f1,1

f1,2

f1,3

f2,1 f3,1

f3,2

f3,3

f3,1000000

f4,1

f4,2

fn,1

fn,2

decompression rate is more regular

Decompression and Sorting can be done

Hierarchically

Pipeline Overview

coarserasterizer

one tri per

thread

persist.threads

tile sorting

radix sort

finerasterizer

one tri per

thread

persist.threads

(tile,tri)fragments

by tri id

tilequeues

tri 1

tri 2

1

1

1

2

2

tri 2

tri 3

tri 1

1 CTA per tile

FBtiles

3

tri 1

Programmable Shading

simple C++ classes:

struct MyShader{

T eval(const Fragment frag) const;

private:...

};

can be any of the supported types!

Performance Results

Performance Results

150 – 300 M tris/s

Example Application: Real-Time GI

Future Work

• Sparse Octrees

• Tessellation / Geometry Shaders

• Programmable ROP

Future Work

http://code.google.com/p/voxelpipe/

The entire codebase will be open-sourced

Thank You

Questions

Further information:

Laine, Karras: High-Performance Software Rasterization on GPUs. Proceedings of

High-Performance Graphics 2011.

Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization. Proceedings of

High-Performance Graphics 2011.

Documents

08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011