Upload
yurymik
View
17
Download
3
Embed Size (px)
DESCRIPTION
08-gpuSoftwareRasterLaineAndPantaleoni-BPS2011.pdf
Citation preview
Software Rasterization on GPUs
Samuli Laine Jacopo Pantaleoni
NVIDIA Research
Rasterization Laine, Karras: High-Performance Software Rasterization on GPUs.
Proceedings of High-Performance Graphics 2011.
Voxelization
Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization.
Proceedings of High-Performance Graphics 2011.
Outline
Build a research platform
Elbow space for game developers
Enable new algorithms
Provoke hardware architects
Flexibility of software, performance of fixed-function hardware
Rationale
Programmable ROP
Stochastic rasterization
Non-linear rasterization
Non-quad derivatives
Quad merging
Decoupled sampling
Compact after discard
etc.
We implemented a full pixel pipeline using CUDA
From triangle setup to ROP
Obey fundamental requirements of gfx pipe
Maintain input order
Hole-free rasterizer with correct rasterization rules
Make it as fast as possible!
Building a Pipeline
Run everything in parallel
We need a lot of threads to fill the machine
Minimize amount of synchronization
Avoid excessive use of atomics
Focus on load balancing
Graphics workloads are wild
Programmable Shading
Design Considerations
Chunker-style pipeline with four stages
Triangle setup Bin raster Coarse raster Fine raster
Run data in large batches
Separate kernel launch for each stage
Keep data in input order all the time
No need to sort
Pipeline Structure
Chunking to Bins and Tiles
Frame buffer
Bin
16x16 tiles
128x128 px Tile
8x8 px
Pixel
Triangle Setup
positions, attributes
Vertex buffer
Index buffer
. . .
Triangle Setup
Triangle data buffer
. . .
edge eqs.
u/v pleqs
zmin
etc.
Bin Raster
Bin Raster
SM 0
Triangle data buffer. . .
Bin Raster
SM 1
Bin Raster
SM 14. . .
IDs of triangles that overlap bin
Coarse Raster
. . .
IDs of triangles that overlap tile
One coarse raster SM has
exclusive access to the bin
it’s processing
Coarse Raster
SM n
Fine Raster
IDs of triangles that overlap tile
Pixel data in FB
One fine raster warp has
exclusive access to the tile
it’s processing
Write tile once
to DRAM
Read tile once from
DRAM to shared
Fine Raster
warp n
Tidbit 1: Coverage Calculation
Step along edge (Bresenham-like)
Use look-up tables to generate coverage masks
~50 instructions for 8x8 stamp, one edge
Tidbit 2: Fragment Distribution
In input phase, calculate coverage and store in list
In shading phase, detect triangle changes and calculate
triangle index and fragment in triangle
Input Phase Shading Phase
Test Scenes
Call of Juarez scene courtesy of Techland
S.T.A.L.K.E.R.: Call of Pripyat scene courtesy of GSC Game World
Performance Results
Frame rendering time in ms (depth test + color, no MSAA, no blending)
Comparison to Hardware (1/3)
– Resolution
Cannot match hardware in raster, z kill + compact
Currently support max 2K x 2K frame buffer, 4 subpixel bits
– Attributes
Fetched when used bad latency hiding
Expensive interpolation
– Antialiasing
Hardware nearly oblivious to MSAA, we much less so
Comparison to Hardware (2/3)
– Memory usage, buffering through DRAM
Performance implications of reduced buffering unknown
Streaming through on-chip memory would be much better
+ Shader complexity
Shader performance theoretically the same as in graphics pipe
+ Frame buffer bandwidth
Each pixel touched only once in DRAM
Comparison to Hardware (3/3)
+ Extensibility
Need one stage to do something extra?
Need a new stage altogether?
You can actually implement it
+ Specialization to individual applications
Rip out what you don’t need, hard-code what you can
Shader performance boosters
Compact after discard, quad merging, decoupled sampling, …
Things to do with programmable ROP
A-buffering, order-independent transparency, …
Stochastic rasterization
Non-linear rasterization
(Your idea here)
Exploration Potential
The Code is Out There
http://code.google.com/p/cudaraster/
The entire codebase is open-sourced and released
VoxelPipe:A Programmable Pipeline for 3D Voxelization
What is Voxelization?
Voxelization =
Finding all voxels overlappedby each triangle in a mesh
Why shall we care?
Why is it useful?
• Shape Matching• Collision Detection• Fluid / Soft-body Sim• Stress Analysis• Level of Detail• Ray Tracing
Why shall we care?
InteractiveIndirect Illumination andAmbient Occlusion usingVoxel Cone Tracing
Cyril Crassin
(I3D 2011)
Rationale
building a full-featured pipeline for voxelization, analogous to OpenGL for 2d rasterization
• fully conservative and thin* rasterization• arbitrary frame-buffer types• many blending modes (additive,max,min,and,or...)• multiple render targets• vertex shaders• fragment shaders
Rationale
Extended support for rendering modes:
• conventional blending-based rasterization
• A-buffer / bucketing
Challenges
• Previous research mostly concerned withbinary output
• State-of-the-Art had poor load balancing
=> no Shading, no ROP
=> Huge performance hit for mixed triangle sizes
What is Rasterization?
What is Rasterization?
Fragment Shading
Vertex Shading
Highly VariableExpansion Rate
source of most load balancing problems
Observations (1)
1 2 3 4 5 6 ntriangles:
fragments: f1,1
f1,2
f1,3
f2,1 f3,1
f3,2
f3,3
f3,1000000
f4,1
F4,2
fn,1
fn,2
highly variable decompression rate
Rasterization =
Sorting of Compressed Batches of Elements
Observations (2)
Decompression and Sorting can be done
Hierarchically
emit per-tile fragments
sort by tile
emit per-voxel fragments
sort by voxel
blend
Observations (2)
1 2 3 4 5 6 ntriangles:
fragments: f1,1
f1,2
f1,3
f2,1 f3,1
f3,2
f3,3
f3,1000000
f4,1
f4,2
fn,1
fn,2
decompression rate is more regular
Decompression and Sorting can be done
Hierarchically
Pipeline Overview
coarserasterizer
one tri per
thread
persist.threads
tile sorting
radix sort
finerasterizer
one tri per
thread
persist.threads
(tile,tri)fragments
by tri id
tilequeues
tri 1
tri 2
1
1
1
2
2
tri 2
tri 3
tri 1
1 CTA per tile
FBtiles
3
tri 1
Programmable Shading
simple C++ classes:
struct MyShader{
T eval(const Fragment frag) const;
private:...
};
can be any of the supported types!
Performance Results
Performance Results
150 – 300 M tris/s
Example Application: Real-Time GI
Future Work
• Sparse Octrees
• Tessellation / Geometry Shaders
• Programmable ROP
Future Work
http://code.google.com/p/voxelpipe/
The entire codebase will be open-sourced
Thank You
Questions
Further information:
Laine, Karras: High-Performance Software Rasterization on GPUs. Proceedings of
High-Performance Graphics 2011.
Pantaleoni: VoxelPipe: A Programmable Pipeline for 3D Voxelization. Proceedings of
High-Performance Graphics 2011.