Upload
sheila-atherley
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
A Hardware Processing Unit For Point Sets
S. Heinzle, G. Guennebaud,M. Botsch, M. Gross
Graphics Hardware 2008
Motivation
• Point-based graphics established• Powerful algorithms
– Representation– Processing– Manipulation– Rendering
• Decomposition– Get neighborhood– Operate on neighbors
Graphics Hardware 2008 2
Motivation
• GPUs not suited for getting neighborhood– SIMD – Incoherent branching– Dynamic data structures
slow– Recursive calls not
supported
• CPUs– Small number of FPUs– Inflexible memory caches
Graphics Hardware 2008 3
Courtesy of NVIDIA
Courtesy of Intel
Contributions
• Hardware architecture for point sets– Neighbor search module– Novel advanced caching mechanism– Reconfigurable processing module– Programmability using FPGA compiler
• FPGA prototype and measurements• Small & Lean
Integration into multi-core CPU/GPU possible
Graphics Hardware 2008 4
Outline
• Related Work• Spatial Searching and Caching• Architecture and Prototype• Results• Conclusion
Graphics Hardware 2008 5
Related Work
Kd-Tree[Bentley 75]
Graphics Hardware 2008 6
kNN on GPUs[Ma and McCool 02]
Kd-Tree Hardware[Woop et al. 05][Woop et al. 06]
Kd-Tree on GPUs[Popov et al. 07]
Related Work
Adaptive SPH Fluid Simulation[Adams et al. ‘07]
Graphics Hardware 2008 7
Linear Moving Least Squares,[Adamson and Alexa ’04]
Algebraic Moving Least Squares, [Guennebaud and Gross ‘07]
Linear Moving Least Squares
Graphics Hardware 2008 8
• Implicit surface definition defined by set of points
Linear Moving Least Squares
Graphics Hardware 2008 9
x
• Implicit surface definition defined by set of points
Linear Moving Least Squares
Graphics Hardware 2008 10
10
x
pi
ni
Linear Moving Least Squares
Graphics Hardware 2008 11
x
• Iterative projections onto plane
Linear Moving Least Squares
Graphics Hardware 2008 12
x
• Iterative projections onto plane
x’
’
Linear Moving Least Squares
Graphics Hardware 2008 13
x
• Iterative projections onto plane
x’’
’ ’
Linear Moving Least Squares
Graphics Hardware 2008 14
x
• Iterative projections onto plane
x’’’
’ ’ ’
Linear Moving Least Squares
Graphics Hardware 2008 15
x
• Surface defined by points projecting onto themselves
Outline
• Related Work• Spatial Searching and Caching• Architecture & Prototype• Results• Conclusion
Graphics Hardware 2008 16
Spatial Search
• Spatial search: kNN and NN– Common in most point operations– Based on kd-tree
• Example NN:
Graphics Hardware 2008 17
Spatial Search
• kNN search similar to NN search:– Start with infinite radius– Sort leaf points into priority queue– Shrink radius with every point sorted
Graphics Hardware 2008 18
Coherent Neighbor Cache(NN)
• Find neighbors in slightly bigger radius• Re-use result for spatially close query
Graphics Hardware 2008 19
Re-use if
Coherent Neighbor Cache
(kNN, exact)• Find (k+1) neighbors• Re-use result for spatially close query
Graphics Hardware 2008 20
Re-use if
Coherent Neighbor Cache
(kNN, approximation)• Approximation error
– Enlarge radius
Graphics Hardware 2008 21
Re-use if
Outline
• Related Work• Spatial Searching and Caching• Architecture & Prototype• Results• Conclusion
Graphics Hardware 2008 22
The Architecture
Graphics Hardware 2008 23
Host
• Eight cached neighborhoods• Problem: parallel queries in kd-tree
module Interleave spatially similar queries
Coherent Neighbor Cache
Graphics Hardware 2008 24
1 1 1
0 0 0
n n n
Kd-Tree Traversal
Graphics Hardware 2008 25
Graphics Hardware 2008 26
• Kd-tree structure on chip• 16 threads• Pipelining and multi-threading
NodeRecurs
e
Stacks
• 16 stacks• Parallel read/write• Bounded in depth
• 6 bytes per thread per recursion
Graphics Hardware 2008 27
Leaf
• 16 parallel priority queues (1-cycle ops)• Queues store pointers and distances• Bandwidth bottleneck
Graphics Hardware 2008 28
• Multithreaded quad-port bank of 16 registers
• 128 threads• Programmability using FPGA-technology
Processing Module
Graphics Hardware 2008 29
Further Data
• Implemented on two FPGAs– 64 bit DDR DRAM– Interconnection: no overhead
• Resource usage regs and LUTs– Virtex 2 Pro 100 (kNN):
26% registers, 38% LUTs– Virtex 2 Pro 70 (MLS):
47% registers, 52% LUTs
• Clock frequency: 75 MHz
Graphics Hardware 2008 30
Outline
• Related Work• Spatial Searching and Caching• Architecture & Prototype• Results• Conclusion
Graphics Hardware 2008 31
Applications
• Tested on various applications
• PCI interface of prototype slow
Graphics Hardware 2008 32
[Weyrich et al. 04]
[Adams et al. 07]
Results kNN
Graphics Hardware 2008 33
CUDA: x4
CPU: x1.5
FPGA: x1
CUDA: x2.4
CPU: x1.4
FPGA: x1
CUDA w/o sort: x4.0
CUDA: x1.6CPU: x1.1
FPGA: x1
CUDA w/o sort: x3.1
75 MHz
1200 MHz2200 MHz
Number of Neighbors
Nu
mb
er
of
qu
eri
es
ASIC estimate, 500 MHzx6.6
Results kNN
Graphics Hardware 2008 34
CUDA: x4
CPU: x1.5
FPGA: x1
CUDA: x2.4
CPU: x1.4
FPGA: x1
CUDA w/o sort: x4.0
CUDA: x1.6CPU: x1.1
FPGA: x1
CUDA w/o sort: x3.1
75 MHz
1200 MHz2200 MHz
Number of Neighbors
Nu
mb
er
of
qu
eri
es
ASIC estimate, 500 MHzx6.6
• Small hardware footprint • FPGA slightly slower• Realistic clock frequency
Prototype faster than CPU/GPU
Results MLS
Graphics Hardware 2008 35
FPGA: x1
MLS CPU: x0.4
MLS CUDA x3.8
75 MHz
1200 MHz2200 MHz
Number of Neighbors
Nu
mb
er
of
qu
eri
es
FPGA faster than CPU
kNN bottleneck – FPGA– GPU
Coherent Neighbor Cache
Graphics Hardware 2008 36
CPU,=0.1
FPGA, exact
FPGA,=0.1
Level of coherence
Nu
mb
er
of
qu
eri
es
Results Approximation Error (MLS projection)
Graphics Hardware 2008 37
approximation
MLS
Err
or
no approx.
Results Approximation Error (MLS projection)
Graphics Hardware 2008 38
Cache hits
Cach
e H
its
approximation
Approximation Error (visual)
Graphics Hardware 2008 39
Approximation Error (visual)
Graphics Hardware 2008 40
Coherent Neighbor Cache:
• Not optimal for exact queries
• Approximate queries – Can be tolerated in most
cases– Greatly increases
performance– Even for small
approximations
Outline
• Related Work• Spatial Searching and Caching• Architecture & Prototype• Results• Conclusion
Graphics Hardware 2008 41
Conclusion
• Novel hardware architecture for – Nearest-neighbor searches– Generic meshless processing operators
• Cache exploiting spatial coherence• Good performance considering resources• Possible GPU integration
Graphics Hardware 2008 42
Future Work
• Programmable data structure– Support different data structures– Programmability in data structure– Construction on-chip
• ‘Real’ programmability in point processing module
Graphics Hardware 2008 43
A Hardware Processing Unit For Point Sets
S. Heinzle, G. Guennebaud,M. Botsch, M. Gross
Graphics Hardware 2008