40
Software Rasterization on GPUs Samuli Laine Jacopo Pantaleoni NVIDIA Research

08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

  • Upload
    yurymik

  • View
    17

  • Download
    3

Embed Size (px)

DESCRIPTION

08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011.pdf

Citation preview

Page 1: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Software Rasterization on GPUs

Samuli Laine Jacopo Pantaleoni

NVIDIA Research

Page 2: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Rasterization Laine, Karras: High-Performance Software Rasterization on GPUs.

Proceedings of High-Performance Graphics 2011.

Voxelization

Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization.

Proceedings of High-Performance Graphics 2011.

Outline

Page 3: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Build a research platform

Elbow space for game developers

Enable new algorithms

Provoke hardware architects

Flexibility of software, performance of fixed-function hardware

Rationale

Programmable ROP

Stochastic rasterization

Non-linear rasterization

Non-quad derivatives

Quad merging

Decoupled sampling

Compact after discard

etc.

Page 4: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

We implemented a full pixel pipeline using CUDA

From triangle setup to ROP

Obey fundamental requirements of gfx pipe

Maintain input order

Hole-free rasterizer with correct rasterization rules

Make it as fast as possible!

Building a Pipeline

Page 5: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Run everything in parallel

We need a lot of threads to fill the machine

Minimize amount of synchronization

Avoid excessive use of atomics

Focus on load balancing

Graphics workloads are wild

Programmable Shading

Design Considerations

Page 6: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Chunker-style pipeline with four stages

Triangle setup Bin raster Coarse raster Fine raster

Run data in large batches

Separate kernel launch for each stage

Keep data in input order all the time

No need to sort

Pipeline Structure

Page 7: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Chunking to Bins and Tiles

Frame buffer

Bin

16x16 tiles

128x128 px Tile

8x8 px

Pixel

Page 8: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Triangle Setup

positions, attributes

Vertex buffer

Index buffer

. . .

Triangle Setup

Triangle data buffer

. . .

edge eqs.

u/v pleqs

zmin

etc.

Page 9: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Bin Raster

Bin Raster

SM 0

Triangle data buffer. . .

Bin Raster

SM 1

Bin Raster

SM 14. . .

IDs of triangles that overlap bin

Page 10: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Coarse Raster

. . .

IDs of triangles that overlap tile

One coarse raster SM has

exclusive access to the bin

it’s processing

Coarse Raster

SM n

Page 11: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Fine Raster

IDs of triangles that overlap tile

Pixel data in FB

One fine raster warp has

exclusive access to the tile

it’s processing

Write tile once

to DRAM

Read tile once from

DRAM to shared

Fine Raster

warp n

Page 12: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Tidbit 1: Coverage Calculation

Step along edge (Bresenham-like)

Use look-up tables to generate coverage masks

~50 instructions for 8x8 stamp, one edge

Page 13: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Tidbit 2: Fragment Distribution

In input phase, calculate coverage and store in list

In shading phase, detect triangle changes and calculate

triangle index and fragment in triangle

Input Phase Shading Phase

Page 14: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Test Scenes

Call of Juarez scene courtesy of Techland

S.T.A.L.K.E.R.: Call of Pripyat scene courtesy of GSC Game World

Page 15: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Performance Results

Frame rendering time in ms (depth test + color, no MSAA, no blending)

Page 16: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Comparison to Hardware (1/3)

– Resolution

Cannot match hardware in raster, z kill + compact

Currently support max 2K x 2K frame buffer, 4 subpixel bits

– Attributes

Fetched when used bad latency hiding

Expensive interpolation

– Antialiasing

Hardware nearly oblivious to MSAA, we much less so

Page 17: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Comparison to Hardware (2/3)

– Memory usage, buffering through DRAM

Performance implications of reduced buffering unknown

Streaming through on-chip memory would be much better

+ Shader complexity

Shader performance theoretically the same as in graphics pipe

+ Frame buffer bandwidth

Each pixel touched only once in DRAM

Page 18: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Comparison to Hardware (3/3)

+ Extensibility

Need one stage to do something extra?

Need a new stage altogether?

You can actually implement it

+ Specialization to individual applications

Rip out what you don’t need, hard-code what you can

Page 19: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Shader performance boosters

Compact after discard, quad merging, decoupled sampling, …

Things to do with programmable ROP

A-buffering, order-independent transparency, …

Stochastic rasterization

Non-linear rasterization

(Your idea here)

Exploration Potential

Page 20: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

The Code is Out There

http://code.google.com/p/cudaraster/

The entire codebase is open-sourced and released

Page 21: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

VoxelPipe:A Programmable Pipeline for 3D Voxelization

Page 22: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

What is Voxelization?

Voxelization =

Finding all voxels overlappedby each triangle in a mesh

Page 23: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Why shall we care?

Why is it useful?

• Shape Matching• Collision Detection• Fluid / Soft-body Sim• Stress Analysis• Level of Detail• Ray Tracing

Page 24: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Why shall we care?

InteractiveIndirect Illumination andAmbient Occlusion usingVoxel Cone Tracing

Cyril Crassin

(I3D 2011)

Page 25: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Rationale

building a full-featured pipeline for voxelization, analogous to OpenGL for 2d rasterization

• fully conservative and thin* rasterization• arbitrary frame-buffer types• many blending modes (additive,max,min,and,or...)• multiple render targets• vertex shaders• fragment shaders

Page 26: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Rationale

Extended support for rendering modes:

• conventional blending-based rasterization

• A-buffer / bucketing

Page 27: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Challenges

• Previous research mostly concerned withbinary output

• State-of-the-Art had poor load balancing

=> no Shading, no ROP

=> Huge performance hit for mixed triangle sizes

Page 28: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

What is Rasterization?

Page 29: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

What is Rasterization?

Fragment Shading

Vertex Shading

Highly VariableExpansion Rate

source of most load balancing problems

Page 30: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Observations (1)

1 2 3 4 5 6 ntriangles:

fragments: f1,1

f1,2

f1,3

f2,1 f3,1

f3,2

f3,3

f3,1000000

f4,1

F4,2

fn,1

fn,2

highly variable decompression rate

Rasterization =

Sorting of Compressed Batches of Elements

Page 31: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Observations (2)

Decompression and Sorting can be done

Hierarchically

emit per-tile fragments

sort by tile

emit per-voxel fragments

sort by voxel

blend

Page 32: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Observations (2)

1 2 3 4 5 6 ntriangles:

fragments: f1,1

f1,2

f1,3

f2,1 f3,1

f3,2

f3,3

f3,1000000

f4,1

f4,2

fn,1

fn,2

decompression rate is more regular

Decompression and Sorting can be done

Hierarchically

Page 33: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Pipeline Overview

coarserasterizer

one tri per

thread

persist.threads

tile sorting

radix sort

finerasterizer

one tri per

thread

persist.threads

(tile,tri)fragments

by tri id

tilequeues

tri 1

tri 2

1

1

1

2

2

tri 2

tri 3

tri 1

1 CTA per tile

FBtiles

3

tri 1

Page 34: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Programmable Shading

simple C++ classes:

struct MyShader{

T eval(const Fragment frag) const;

private:...

};

can be any of the supported types!

Page 35: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Performance Results

Page 36: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Performance Results

150 – 300 M tris/s

Page 37: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Example Application: Real-Time GI

Page 38: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Future Work

• Sparse Octrees

• Tessellation / Geometry Shaders

• Programmable ROP

Page 39: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Future Work

http://code.google.com/p/voxelpipe/

The entire codebase will be open-sourced

Page 40: 08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011

Thank You

Questions

Further information:

Laine, Karras: High-Performance Software Rasterization on GPUs. Proceedings of

High-Performance Graphics 2011.

Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization. Proceedings of

High-Performance Graphics 2011.