Upload
lulu
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing. Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany. - PowerPoint PPT Presentation
Citation preview
RPU: A Programmable Ray Processing Unit for Realtime Ray TracingSven Woop Jrg Schmittler Philipp Slusallek
Computer Graphics LabSaarland University, Germany
Some argue that in the very long term, rendering may best be solved by some variant of ray tracing, in which huge numbers of rays sample the environment for the eyes view of each frame.And there will also be colonies on Mars, underwater cities, and personal jet packs.Real-Time Rendering, 1999, 1st edition, page 391Motivation
RasterizationFundamental Operation: Project isolated triangles No global access to the sceneAll interesting visual effects need 2+ trianglesShadows, reflection, global illumination, Requires multiple passes & approximations, has many issues
Ray TracingFundamental Operation: Trace a RayGlobal scene accessIndividual rays in O(log N)Flexibility in space and time Automatic combination of visual effectsDemand drivenPhysical light simulation Embarrassingly parallel
OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs) Realtime indirect lighting & caustic computationLarge model visualization
OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization
OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization
OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization
OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization Compact Hardware?
Previous WorkCPUs [Parker99, Wald01]Flexible, but needs massive number of CPUsGPUs [Purcell02, Carr02]Stream programming model is too limitedCustom HW [Green91, Hall01, Kobayashi02, Schmittler04]Mostly focused on parts of the RT pipeline
RPU ApproachShading processorDesign similar to fragment processors on GPUsHighly parallel, highly efficientImproved programming modelAdd highly efficient recursion, conditional branchingAdd flexible memory access (beyond textures)Custom traversal hardwareHigh-performance kd-tree traversal
RPU DesignShader Processing Units (SPU)Intersection, Shading, Lighting, Significant vector processingHigh instruction level parallelismSPU approach: instruction set similar to GPUs4-vector SIMDDual issue & pairingArithmetic splitting (3+1 and 2+2 vector)Arithmetic + loadArithmetic + conditional jump, call, returnHigh code densityInstruction Level Parallelism+
Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat
Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return
Efficient recursionHW-managed register stack Single cycle function call
Ray traversal function calltrace()Instruction Level Parallelism+
Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat
Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return
Efficient recursionHW-managed register stack Single cycle function call
Ray traversal function calltrace()Instruction Level Parallelism+
Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat
Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return
Efficient recursionHW-managed register stack Single cycle function call
Ray traversal function calltrace()Instruction Level Parallelism+
Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat
Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return
Efficient recursionHW-managed register stack Single cycle function call
Ray traversal function calltrace()Instruction Level Parallelism+
Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat
Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return
Efficient recursionHW-managed register stack Single cycle function call
Ray traversal function calltrace()Instruction Level Parallelism+
Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat
Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return
Efficient recursionHW-managed register stack Single cycle function call
Ray traversal function calltrace()Instruction Level Parallelism+
RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Uses kd-trees: Flexible and adaptiveHandles general scenes well [Havran2000] Simple traversal algorithmBut many instruction dependencies in inner loopAbout 10 instructionsCPU:>100 cycles (un-optimized) ~15 cycles (optimized OpenRT)TPU approach: Optimized pipeline1 cycle throughput+Optimized PipeliningInstruction Level Parallelism+
RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingHigh computational & memory latencyTake advantage of thread level parallelismHigh utilization of hardware unitsNo overhead for switching threads in HW
++Thread Level ParallelismOptimized PipeliningInstruction Level Parallelism+
RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingTakes advantage of ray coherenceSIMD execution of threadsReduces hardware complexityShared processor infrastructureReduces external bandwidthCombining memory requestsAutomatic handling of incoherenceSplitting and masked execution++Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow Coherence++
RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Objects often span many kd-tree cellsRedundant intersectionsMailboxing with a small per-chunk cache (e.g. 4 entries)Up to 10x performance for some scenes
++Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++
RPU Design++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++
RPU Design++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++
RPU Design++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++Optimized Pipelining
RPU Design+Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy+++
RPU Design++++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy
RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU
RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU
RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU
RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU
RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU
RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUinternalexternalSPUSPUSPUSPUTPUTPUTPUThread Generator
RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUinternalexternalSPUSPUSPUSPUTPUTPUTPUThread Generator
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinates
Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generator
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()
Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generator
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU
Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generator
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPUData dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )
Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generatorshader calls
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread GeneratorGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPUtop-level traversal
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread GeneratorGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPUtop-level traversal
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread GeneratorGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPUtop-level traversal
Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread Generatortop-level traversalbottom-level traversalGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPU
FPGA prototypeXilinx Virtex II 6000Usage: 99% logic, 70% on-chip memory128 MB DDR-RAM with 350 MB/s24 bit floating pointConfiguration of prototype RPU32 threads per SPU60% usageChunk size of 495% efficiency12 kB caches in total90% hit rate
PerformanceSingle FPGA at only 66 MHz4 million rays/s20 fps @ 512x384Same performance as CPU40x clock rate (2.66 GHz)Using our highly optimized software (OpenRT with SSE)Linear scalability with HW resourcesTested: 4x FPGA 4x performanceIndependent of scenes
Prototype PerformanceTechnology: FPGA versus ASIC (GPU)
Large headroom for scaling performance
RPU prototype (FPGA)GPU (ASIC)4 Gflops200 Gflops50x0.3 GB/s30 GB/s100x
Demo
ConclusionsProgrammable Architecture for Ray TracingGPU-like shading processorMore flexible programming modelFlexible control flow and memory accessEfficient recursion with HW-managed stackEfficient traversal through custom hardwareRealtime performance on FPGA prototype
OutlookASIC designEvaluate scalability with technologyRPU as a general purpose processorExplore non-rendering use of RPU
New fundamental operation: Tracing a rayBasis for next generation interactive 3D graphics
Siggraph 2005: More Realtime Ray TracingIntroduction to Realtime Ray TracingFull day course: Wednesday, Petree Hall DBooth 1155: Mercury Computer SystemsRealtime ray tracing product on PC clustersRealtime ray tracing on the Cell ProcessorRealtime previewing in Cinema-4DBooth 1511: SGIRay tracing massive model : Boeing 777
Ray Triangle Intersection; load triangle transformation load4x A.y,0
; prepare intersection comp. dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2
; compute hit distance mul R8.z,-R6.z,S.z + if z =1 return
; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return
; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + returnInputArithmetic (dot products)Multi-issue (arith. & cond.)
TPU (Traversal Processing Unit)Read NodeRead RayProject Ray(org_k split)/dir_kd < t_mind < t_minnear < hitDecisionUpdate: Node Addr, Near, Far, Stackto MPUNode Cache
Shader Processing UnitPipeliningRead InstructionRead 3 Source RegistersSwizzeling mov R0,R1* mov R2,R3 * mov R0,R2 MaskingWriteback
*
Memory AccessWriteback I0 I3***++++ClampThread ControlBranchingStack Control RCP,RSQWritebackMasking