53
Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing

RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing

  • Upload
    lulu

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing. Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany. - PowerPoint PPT Presentation

Citation preview

  • RPU: A Programmable Ray Processing Unit for Realtime Ray TracingSven Woop Jrg Schmittler Philipp Slusallek

    Computer Graphics LabSaarland University, Germany

  • Some argue that in the very long term, rendering may best be solved by some variant of ray tracing, in which huge numbers of rays sample the environment for the eyes view of each frame.And there will also be colonies on Mars, underwater cities, and personal jet packs.Real-Time Rendering, 1999, 1st edition, page 391Motivation

  • RasterizationFundamental Operation: Project isolated triangles No global access to the sceneAll interesting visual effects need 2+ trianglesShadows, reflection, global illumination, Requires multiple passes & approximations, has many issues

  • Ray TracingFundamental Operation: Trace a RayGlobal scene accessIndividual rays in O(log N)Flexibility in space and time Automatic combination of visual effectsDemand drivenPhysical light simulation Embarrassingly parallel

  • OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs) Realtime indirect lighting & caustic computationLarge model visualization

  • OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization

  • OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization

  • OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization

  • OpenRT Project:Realtime Ray Tracing in SWResults:Exploit inherent coherenceRealtime performance (> 30x)Scalability (> 80 CPUs)Realtime indirect lighting & caustic computationLarge model visualization Compact Hardware?

  • Previous WorkCPUs [Parker99, Wald01]Flexible, but needs massive number of CPUsGPUs [Purcell02, Carr02]Stream programming model is too limitedCustom HW [Green91, Hall01, Kobayashi02, Schmittler04]Mostly focused on parts of the RT pipeline

  • RPU ApproachShading processorDesign similar to fragment processors on GPUsHighly parallel, highly efficientImproved programming modelAdd highly efficient recursion, conditional branchingAdd flexible memory access (beyond textures)Custom traversal hardwareHigh-performance kd-tree traversal

  • RPU DesignShader Processing Units (SPU)Intersection, Shading, Lighting, Significant vector processingHigh instruction level parallelismSPU approach: instruction set similar to GPUs4-vector SIMDDual issue & pairingArithmetic splitting (3+1 and 2+2 vector)Arithmetic + loadArithmetic + conditional jump, call, returnHigh code densityInstruction Level Parallelism+

  • Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat

    Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return

    Efficient recursionHW-managed register stack Single cycle function call

    Ray traversal function calltrace()Instruction Level Parallelism+

  • Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat

    Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return

    Efficient recursionHW-managed register stack Single cycle function call

    Ray traversal function calltrace()Instruction Level Parallelism+

  • Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat

    Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return

    Efficient recursionHW-managed register stack Single cycle function call

    Ray traversal function calltrace()Instruction Level Parallelism+

  • Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat

    Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return

    Efficient recursionHW-managed register stack Single cycle function call

    Ray traversal function calltrace()Instruction Level Parallelism+

  • Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat

    Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return

    Efficient recursionHW-managed register stack Single cycle function call

    Ray traversal function calltrace()Instruction Level Parallelism+

  • Instruction Set of SPUShort vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4Input modifiersSwizzeling, negation, masking Multiply with power of 2 Special operations (modifiers)rcp, rsq, sat

    Fast 2D texture addressingtexload, texload4xRandom bulk memory accessload, load4x, storeConditional instructions (paired)if jmp label if call If return

    Efficient recursionHW-managed register stack Single cycle function call

    Ray traversal function calltrace()Instruction Level Parallelism+

  • RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Uses kd-trees: Flexible and adaptiveHandles general scenes well [Havran2000] Simple traversal algorithmBut many instruction dependencies in inner loopAbout 10 instructionsCPU:>100 cycles (un-optimized) ~15 cycles (optimized OpenRT)TPU approach: Optimized pipeline1 cycle throughput+Optimized PipeliningInstruction Level Parallelism+

  • RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingHigh computational & memory latencyTake advantage of thread level parallelismHigh utilization of hardware unitsNo overhead for switching threads in HW

    ++Thread Level ParallelismOptimized PipeliningInstruction Level Parallelism+

  • RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingTakes advantage of ray coherenceSIMD execution of threadsReduces hardware complexityShared processor infrastructureReduces external bandwidthCombining memory requestsAutomatic handling of incoherenceSplitting and masked execution++Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow Coherence++

  • RPU DesignShader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Objects often span many kd-tree cellsRedundant intersectionsMailboxing with a small per-chunk cache (e.g. 4 entries)Up to 10x performance for some scenes

    ++Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++

  • RPU Design++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++

  • RPU Design++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++

  • RPU Design++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy++Optimized Pipelining

  • RPU Design+Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy+++

  • RPU Design++++Shader Processing Units (SPU)Custom Ray Traversal Unit (TPU)Multi-ThreadingChunkingMailbox Processing Unit (MPU)Thread Level ParallelismOptimized PipeliningInstruction Level ParallelismControl Flow CoherenceAvoiding Redundancy

  • RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU

  • RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU

  • RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU

  • RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU

  • RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUThread GeneratorinternalexternalSPUSPUSPUSPUTPUTPUTPU

  • RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUinternalexternalSPUSPUSPUSPUTPUTPUTPUThread Generator

  • RPU ArchitectureTPUMPUVector- CacheNode- CacheList- CacheExternal DDR MemoryRPUinternalexternalSPUSPUSPUSPUTPUTPUTPUThread Generator

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinates

    Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generator

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()

    Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generator

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU

    Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generator

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPUData dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )

    Top-LevelObjectIntersectorTop-LevelObjectIntersectorPrimaryShaderTPU/MPUIntersectorThread Generatorshader calls

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread GeneratorGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPUtop-level traversal

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread GeneratorGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPUtop-level traversal

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread GeneratorGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPUtop-level traversal

  • Ray Tracing on an RPUThread generation: initialize SPU registers with pixel coordinatesPrimary shader generates camera ray and calls trace()Ray traversal performed on TPU with mailboxing on MPU Data dependent call to object/intersection shader on SPUProgrammable geometry (triangles, spheres, quadrics, bicubic splines, )Nested ray traversal for object-based dynamic scenesTop-LevelObjectIntersectorTop-LevelObjectIntersectorShaderTPU/MPUTop-LevelObjectIntersectorThread Generatortop-level traversalbottom-level traversalGeometryIntersectorGeometryIntersectorGeometryIntersectorTPU/MPU

  • FPGA prototypeXilinx Virtex II 6000Usage: 99% logic, 70% on-chip memory128 MB DDR-RAM with 350 MB/s24 bit floating pointConfiguration of prototype RPU32 threads per SPU60% usageChunk size of 495% efficiency12 kB caches in total90% hit rate

  • PerformanceSingle FPGA at only 66 MHz4 million rays/s20 fps @ 512x384Same performance as CPU40x clock rate (2.66 GHz)Using our highly optimized software (OpenRT with SSE)Linear scalability with HW resourcesTested: 4x FPGA 4x performanceIndependent of scenes

  • Prototype PerformanceTechnology: FPGA versus ASIC (GPU)

    Large headroom for scaling performance

    RPU prototype (FPGA)GPU (ASIC)4 Gflops200 Gflops50x0.3 GB/s30 GB/s100x

  • Demo

  • ConclusionsProgrammable Architecture for Ray TracingGPU-like shading processorMore flexible programming modelFlexible control flow and memory accessEfficient recursion with HW-managed stackEfficient traversal through custom hardwareRealtime performance on FPGA prototype

  • OutlookASIC designEvaluate scalability with technologyRPU as a general purpose processorExplore non-rendering use of RPU

    New fundamental operation: Tracing a rayBasis for next generation interactive 3D graphics

  • Siggraph 2005: More Realtime Ray TracingIntroduction to Realtime Ray TracingFull day course: Wednesday, Petree Hall DBooth 1155: Mercury Computer SystemsRealtime ray tracing product on PC clustersRealtime ray tracing on the Cell ProcessorRealtime previewing in Cinema-4DBooth 1511: SGIRay tracing massive model : Boeing 777

  • Ray Triangle Intersection; load triangle transformation load4x A.y,0

    ; prepare intersection comp. dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2

    ; compute hit distance mul R8.z,-R6.z,S.z + if z =1 return

    ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return

    ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + returnInputArithmetic (dot products)Multi-issue (arith. & cond.)

  • TPU (Traversal Processing Unit)Read NodeRead RayProject Ray(org_k split)/dir_kd < t_mind < t_minnear < hitDecisionUpdate: Node Addr, Near, Far, Stackto MPUNode Cache

  • Shader Processing UnitPipeliningRead InstructionRead 3 Source RegistersSwizzeling mov R0,R1* mov R2,R3 * mov R0,R2 MaskingWriteback

    *

    Memory AccessWriteback I0 I3***++++ClampThread ControlBranchingStack Control RCP,RSQWritebackMasking