View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Many-Core Programming with GRAMPS
Jeremy SugermanStanford PPL RetreatNovember 21, 2008
22
Introduction Collaborators: Kayvon Fatahalian, Solomon Boulos,
Kurt Akeley, Pat Hanrahan Initial work appearing in ACM TOG in January, 2009
Our starting point: CPU, GPU trends… and collision? Two research areas:
– HW/SW Interface, Programming Model– Future Graphics API
33
BackgroundProblem Statement / Requirements: Build a programming model / primitives / building
blocks to drive efficient development for and usage of future many-core machines.
Handle homogeneous, heterogeneous, programmable cores, and fixed-function units.
Status Quo: GPU Pipeline (Good for GL, otherwise hard) CPU / C run-time (No guidance, fast is hard)
44
Apps: Graphs of stages and queues Producer-consumer, task, data-parallelism Initial focus on real-time rendering
GRAMPSInput
FragmentQueue
OutputFragment
Queue
= Thread Stage= Shader Stage= Fixed-func Stage
= Queue= Stage Output
Ray Tracer
RayQueue
Ray HitQueue Fragment
Queue
Camera Intersect
Shade FB Blend
Raster Graphics
Shade FB BlendRasterize
5
Design Goals Large Application Scope– preferable to roll-your-own High Performance– Competitive with roll-your-own Optimized Implementations– Informs HW design Multi-Platform– Suits a variety of many-core systems
Also: Tunable– Expert users can optimize their apps
66
As a Graphics Evolution Not (unthinkably) radical for ‘graphics’ Like fixed → programmable shading
– Pipeline undergoing massive shake up– Diversity of new parameters and use cases
Bigger picture than ‘graphics’– Rendering is more than GL/D3D– Compute is more than rendering– Some ‘GPUs’ are losing their innate pipeline
77
As a Compute Evolution (1) Sounds like streaming:
Execution graphs, kernels, data-parallelism
Streaming: “squeeze out every FLOP”– Goals: bulk transfer, arithmetic intensity– Intensive static analysis, custom chips (mostly)– Bounded space, data access, execution time
88
As a Compute Evolution (2) GRAMPS: “interesting apps are irregular”
– Goals: Dynamic, data-dependent code– Aggregate work at run-time– Heterogeneous commodity platforms
Streaming techniques fit naturally when applicable
99
GRAMPS’ Role A ‘graphics pipeline’ is now an app! Target users: engine/pipeline/run-time authors,
savvy hardware-aware systems developers.
Compared to status quo:– More flexible, lower level than a GPU pipeline– More guidance than bare metal– Portability in between– Not domain specific
10
GRAMPS Entities (1) Data access via windows into queues/memory
Queues: Dynamically allocated / managed– Ordered or unordered– Specified max capacity (could also spill)– Two types: Opaque and Collection
Buffers: Random access, Pre-allocated– RO, RW Private, RW Shared (Not Supported)
11
GRAMPS Entities (2) Queue Sets: Independent sub-queues
– Instanced parallelism plus mutual exclusion– Hard to fake with just multiple queues
12
Design Goals (Reminder) Large Application Scope High Performance Optimized Implementations Multi-Platform (Tunable)
1313
What We’ve Built (System)
14
GRAMPS Scheduling Static Inputs:
– Application graph topology– Per-queue packet (‘chunk’) size– Per-queue maximum depth / high-watermark
Dynamic Inputs (currently ignored):– Current per-queue depths– Average execution time per input packet
Simple Policy: Run consumers, pre-empt producers
1515
GRAMPS Scheduler Organization Tiered Scheduler: Tier-N, Tier-1, Tier-0 Tier-N only wakes idle units, no rebalancing All Tier-1s compete for all queued work.
‘Fat’ cores: software tier-1 per core, tier-0 per thread ‘Micro’ cores: single shared hardware tier-1+0
16
What We’ve Built (Apps)Direct3D Pipeline (with Ray-tracing Extension)
Ray-tracing Graph
IA 1 VS 1 RO Rast
Trace
IA N VS N
PS
SampleQueue Set
RayQueue
PrimitiveQueue
Input VertexQueue 1
PrimitiveQueue 1
Input VertexQueue N
OM
PS2
FragmentQueue
Ray HitQueue
Ray-tracing Extension
PrimitiveQueue N
Tiler
Shade FB Blend
SampleQueue
TileQueue
RayQueue
Ray HitQueue
FragmentQueue
CameraSampler Intersect
= Thread Stage= Shader Stage= Fixed-func
= Queue= Stage Output= Push Output
1717
Initial Renderer Results Queues are small (< 600 KB CPU, < 1.5 MB GPU) Parallelism is good (at least 80%, all but one 95+%)
1818
Scheduling Can Clearly Improve
1919
Taking Stock: High-level Questions Is GRAMPS a suitable GPU evolution?
– Enable pipeline competitive with bare metal?– Enable innovation: advanced / alternative
methods?
Is GRAMPS a good parallel compute model?– Does it fulfill our design goals?
20
Possible Next Steps Simulation / Hardware fidelity improvements
– Memory model, locality GRAMPS Run-Time improvements
– Scheduling, run-time overheads GRAMPS API extensions
– On-the-fly graph modification, data sharing More applications / workloads
– REYES, physics, finance, AI, … – Lazy/adaptive/procedural data generation
21
Design Goals (Revisited) Application Scope: okay– only (multiple) renderers High Performance: so-so– limited simulation detail Optimized Implementations: good Multi-Platform: good (Tunable: good, but that’s a separate talk)
Strategy: Broaden available apps and use them to drive performance and simulation work for now.
22
Digression: Some Kinds of Parallelism
Task (Divide) and Data (Conquer) Subdivide algorithm into a DAG (or graph) of kernels. Data is long lived, manipulated in-place. Kernels are ephemeral and stateless. Kernels only get input at entry/creation.
Producer-Consumer (Pipeline) Parallelism Data is ephemeral: processed as it is generated. Bandwidth or storage costs prohibit accumulation.
23
Three New Graphs “App” 1: MapReduce Run-time
– Popular parallelism-rich idiom– Enables a variety of useful apps
App 2: Cloth Simulation (Rendering Physics)– Inspired by the PhysBAM cloth simulation– Demonstrates basic mechanics, collision detection– Graph is still very much a work in progress…
App 3: Real-time REYES-like Renderer (Kayvon)
24
MapReduce: Specific Flavour
“ProduceReduce”: Minimal simplifications / constraints
Produce/Split (1:N) Map (1:N) (Optional) Combine (N:1) Reduce (N:M, where M << N or M=1 often) Sort (N:N conceptually, implementations vary)
(Aside: REYES is MapReduce, OpenGL is MapCombine)
25
MapReduce Graph
Map output is a dynamically instanced queue set. Combine might motivate a formal reduction shader. Reduce is an (automatically) instanced thread stage. Sort may actually be parallelized.
= Thread Stage= Shader Stage
= Queue= Stage Output= Push Output
IntermediateTuples
Map
Output
Produce Combine
(Optional) Reduce Sort
InitialTuples
IntermediateTuples
FinalTuples
26
Update is not producer-consumer! Broad Phase will actually be either a (weird) shader
or multiple thread instances. Fast Recollide details are also TBD.
Cloth Simulation Graph
= Thread Stage= Shader Stage
= Queue= Stage Output= Push Output
ResolutionProposed Update
UpdateMesh
FastRecollide
Resolve
Narrow Collide
Broad Collide
Collision Detection
BVHNodes
MovedNodes
Collisions
CandidatePairs
27
That’s All Folks Thank you for listening. Any questions? Actively interested in new collaborators
– Owners or experts in some application domain (or engine / run-time system / middleware).
– Anyone interested in scheduling or details of possible hardware / core configurations.
TOG Paper:http://graphics.stanford.edu/papers/gramps-tog/
Backup Slides / More Details
29
Designing A Good Graph Efficiency requires “large chunks of coherent work”
Stages separate coherency boundaries– Frequency of computation (fan-out / fan-in)– Memory access coherency– Execution coherency
Queues allow repacking, re-sorting of work from one coherency regime to another.
3030
GRAMPS Interfaces Host/Setup: Create execution graph
Thread: Stateful, singleton
Shader: Data-parallel, auto-instanced
3131
GRAMPS Graph Portability Portability really means performance.
Less portable than GL/D3D– GRAMPS graph is (more) hardware sensitive
More portable than bare metal– Enforces modularity– Best case, just works – Worst case, saves boiler plate
3232
Possible Next Steps: Implementation Better scheduling
– Less bursty, better slot filling– Dynamic priorities– Handle graphs with loops better
More detailed costs– Bill for scheduling decisions– Bill for (internal) synchronization
More statistics
3333
Possible Next Steps: API Important: Graph modification (state change)
Probably: Data sharing / ref-counting
Maybe: Blocking inter-stage calls (join) Maybe: Intra/inter-stage synchronization primitives
3434
Possible Next Steps: New Workloads REYES, hybrid graphics pipelines Image / video processing Game Physics
– Collision detection or particles Physics and scientific simulation AI, finance, sort, search or database query, …
Heavy dynamic data manipulation- k-D tree / octree / BVH build- lazy/adaptive/procedural tree or geometry