A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008

A Hardware Processing Unit For Point Sets

S. Heinzle, G. Guennebaud,M. Botsch, M. Gross

Graphics Hardware 2008

Motivation

• Point-based graphics established• Powerful algorithms

– Representation– Processing– Manipulation– Rendering

• Decomposition– Get neighborhood– Operate on neighbors

Graphics Hardware 2008 2

Motivation

• GPUs not suited for getting neighborhood– SIMD – Incoherent branching– Dynamic data structures

slow– Recursive calls not

supported

• CPUs– Small number of FPUs– Inflexible memory caches


Courtesy of NVIDIA

Courtesy of Intel

Contributions

• Hardware architecture for point sets– Neighbor search module– Novel advanced caching mechanism– Reconfigurable processing module– Programmability using FPGA compiler

• FPGA prototype and measurements• Small & Lean

Integration into multi-core CPU/GPU possible


Outline

• Related Work• Spatial Searching and Caching• Architecture and Prototype• Results• Conclusion


Related Work

Kd-Tree[Bentley 75]


kNN on GPUs[Ma and McCool 02]

Kd-Tree Hardware[Woop et al. 05][Woop et al. 06]

Kd-Tree on GPUs[Popov et al. 07]

Related Work

Adaptive SPH Fluid Simulation[Adams et al. ‘07]


Linear Moving Least Squares,[Adamson and Alexa ’04]

Algebraic Moving Least Squares, [Guennebaud and Gross ‘07]

Linear Moving Least Squares


• Implicit surface definition defined by set of points



x

• Implicit surface definition defined by set of points



10

x

pi

ni



x

• Iterative projections onto plane



x


x’

’



x


x’’

’ ’



x


x’’’

’ ’ ’



x

• Surface defined by points projecting onto themselves

Outline

• Related Work• Spatial Searching and Caching• Architecture & Prototype• Results• Conclusion


Spatial Search

• Spatial search: kNN and NN– Common in most point operations– Based on kd-tree

• Example NN:


Spatial Search

• kNN search similar to NN search:– Start with infinite radius– Sort leaf points into priority queue– Shrink radius with every point sorted


Coherent Neighbor Cache(NN)

• Find neighbors in slightly bigger radius• Re-use result for spatially close query


Re-use if

Coherent Neighbor Cache

(kNN, exact)• Find (k+1) neighbors• Re-use result for spatially close query


Re-use if


(kNN, approximation)• Approximation error

– Enlarge radius


Re-use if

Outline



The Architecture


Host

• Eight cached neighborhoods• Problem: parallel queries in kd-tree

module Interleave spatially similar queries



1 1 1

0 0 0

n n n

Kd-Tree Traversal



• Kd-tree structure on chip• 16 threads• Pipelining and multi-threading

NodeRecurs

e

Stacks

• 16 stacks• Parallel read/write• Bounded in depth

• 6 bytes per thread per recursion


Leaf

• 16 parallel priority queues (1-cycle ops)• Queues store pointers and distances• Bandwidth bottleneck


• Multithreaded quad-port bank of 16 registers

• 128 threads• Programmability using FPGA-technology

Processing Module


Further Data

• Implemented on two FPGAs– 64 bit DDR DRAM– Interconnection: no overhead

• Resource usage regs and LUTs– Virtex 2 Pro 100 (kNN):

26% registers, 38% LUTs– Virtex 2 Pro 70 (MLS):

47% registers, 52% LUTs

• Clock frequency: 75 MHz


Outline



Applications

• Tested on various applications

• PCI interface of prototype slow


[Weyrich et al. 04]

[Adams et al. 07]

Results kNN


CUDA: x4

CPU: x1.5

FPGA: x1

CUDA: x2.4

CPU: x1.4

FPGA: x1

CUDA w/o sort: x4.0

CUDA: x1.6CPU: x1.1

FPGA: x1

CUDA w/o sort: x3.1

75 MHz

1200 MHz2200 MHz

Number of Neighbors

Nu

mb

er

of

qu

eri

es

ASIC estimate, 500 MHzx6.6

Results kNN


CUDA: x4

CPU: x1.5

FPGA: x1

CUDA: x2.4

CPU: x1.4

FPGA: x1

CUDA w/o sort: x4.0

CUDA: x1.6CPU: x1.1

FPGA: x1

CUDA w/o sort: x3.1

75 MHz

1200 MHz2200 MHz

Number of Neighbors

Nu

mb

er

of

qu

eri

es

ASIC estimate, 500 MHzx6.6

• Small hardware footprint • FPGA slightly slower• Realistic clock frequency

Prototype faster than CPU/GPU

Results MLS


FPGA: x1

MLS CPU: x0.4

MLS CUDA x3.8

75 MHz

1200 MHz2200 MHz

Number of Neighbors

Nu

mb

er

of

qu

eri

es

FPGA faster than CPU

kNN bottleneck – FPGA– GPU



CPU,=0.1

FPGA, exact

FPGA,=0.1

Level of coherence

Nu

mb

er

of

qu

eri

es

Results Approximation Error (MLS projection)


approximation

MLS

Err

or

no approx.

Results Approximation Error (MLS projection)


Cache hits

Cach

e H

its

approximation

Approximation Error (visual)


Approximation Error (visual)


Coherent Neighbor Cache:

• Not optimal for exact queries

• Approximate queries – Can be tolerated in most

cases– Greatly increases

performance– Even for small

approximations

Outline



Conclusion

• Novel hardware architecture for – Nearest-neighbor searches– Generic meshless processing operators

• Cache exploiting spatial coherence• Good performance considering resources• Possible GPU integration


Future Work

• Programmable data structure– Support different data structures– Programmability in data structure– Construction on-chip

• ‘Real’ programmability in point processing module


A Hardware Processing Unit For Point Sets

S. Heinzle, G. Guennebaud,M. Botsch, M. Gross

Graphics Hardware 2008

Documents

A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008