34
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

Many-Core Programming with GRAMPS

Jeremy SugermanStanford PPL RetreatNovember 21, 2008

Page 2: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

22

Introduction Collaborators: Kayvon Fatahalian, Solomon Boulos,

Kurt Akeley, Pat Hanrahan Initial work appearing in ACM TOG in January, 2009

Our starting point: CPU, GPU trends… and collision? Two research areas:

– HW/SW Interface, Programming Model– Future Graphics API

Page 3: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

33

BackgroundProblem Statement / Requirements: Build a programming model / primitives / building

blocks to drive efficient development for and usage of future many-core machines.

Handle homogeneous, heterogeneous, programmable cores, and fixed-function units.

Status Quo: GPU Pipeline (Good for GL, otherwise hard) CPU / C run-time (No guidance, fast is hard)

Page 4: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

44

Apps: Graphs of stages and queues Producer-consumer, task, data-parallelism Initial focus on real-time rendering

GRAMPSInput

FragmentQueue

OutputFragment

Queue

= Thread Stage= Shader Stage= Fixed-func Stage

= Queue= Stage Output

Ray Tracer

RayQueue

Ray HitQueue Fragment

Queue

Camera Intersect

Shade FB Blend

Raster Graphics

Shade FB BlendRasterize

Page 5: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

5

Design Goals Large Application Scope– preferable to roll-your-own High Performance– Competitive with roll-your-own Optimized Implementations– Informs HW design Multi-Platform– Suits a variety of many-core systems

Also: Tunable– Expert users can optimize their apps

Page 6: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

66

As a Graphics Evolution Not (unthinkably) radical for ‘graphics’ Like fixed → programmable shading

– Pipeline undergoing massive shake up– Diversity of new parameters and use cases

Bigger picture than ‘graphics’– Rendering is more than GL/D3D– Compute is more than rendering– Some ‘GPUs’ are losing their innate pipeline

Page 7: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

77

As a Compute Evolution (1) Sounds like streaming:

Execution graphs, kernels, data-parallelism

Streaming: “squeeze out every FLOP”– Goals: bulk transfer, arithmetic intensity– Intensive static analysis, custom chips (mostly)– Bounded space, data access, execution time

Page 8: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

88

As a Compute Evolution (2) GRAMPS: “interesting apps are irregular”

– Goals: Dynamic, data-dependent code– Aggregate work at run-time– Heterogeneous commodity platforms

Streaming techniques fit naturally when applicable

Page 9: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

99

GRAMPS’ Role A ‘graphics pipeline’ is now an app! Target users: engine/pipeline/run-time authors,

savvy hardware-aware systems developers.

Compared to status quo:– More flexible, lower level than a GPU pipeline– More guidance than bare metal– Portability in between– Not domain specific

Page 10: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

10

GRAMPS Entities (1) Data access via windows into queues/memory

Queues: Dynamically allocated / managed– Ordered or unordered– Specified max capacity (could also spill)– Two types: Opaque and Collection

Buffers: Random access, Pre-allocated– RO, RW Private, RW Shared (Not Supported)

Page 11: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

11

GRAMPS Entities (2) Queue Sets: Independent sub-queues

– Instanced parallelism plus mutual exclusion– Hard to fake with just multiple queues

Page 12: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

12

Design Goals (Reminder) Large Application Scope High Performance Optimized Implementations Multi-Platform (Tunable)

Page 13: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

1313

What We’ve Built (System)

Page 14: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

14

GRAMPS Scheduling Static Inputs:

– Application graph topology– Per-queue packet (‘chunk’) size– Per-queue maximum depth / high-watermark

Dynamic Inputs (currently ignored):– Current per-queue depths– Average execution time per input packet

Simple Policy: Run consumers, pre-empt producers

Page 15: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

1515

GRAMPS Scheduler Organization Tiered Scheduler: Tier-N, Tier-1, Tier-0 Tier-N only wakes idle units, no rebalancing All Tier-1s compete for all queued work.

‘Fat’ cores: software tier-1 per core, tier-0 per thread ‘Micro’ cores: single shared hardware tier-1+0

Page 16: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

16

What We’ve Built (Apps)Direct3D Pipeline (with Ray-tracing Extension)

Ray-tracing Graph

IA 1 VS 1 RO Rast

Trace

IA N VS N

PS

SampleQueue Set

RayQueue

PrimitiveQueue

Input VertexQueue 1

PrimitiveQueue 1

Input VertexQueue N

OM

PS2

FragmentQueue

Ray HitQueue

Ray-tracing Extension

PrimitiveQueue N

Tiler

Shade FB Blend

SampleQueue

TileQueue

RayQueue

Ray HitQueue

FragmentQueue

CameraSampler Intersect

= Thread Stage= Shader Stage= Fixed-func

= Queue= Stage Output= Push Output

Page 17: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

1717

Initial Renderer Results Queues are small (< 600 KB CPU, < 1.5 MB GPU) Parallelism is good (at least 80%, all but one 95+%)

Page 18: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

1818

Scheduling Can Clearly Improve

Page 19: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

1919

Taking Stock: High-level Questions Is GRAMPS a suitable GPU evolution?

– Enable pipeline competitive with bare metal?– Enable innovation: advanced / alternative

methods?

Is GRAMPS a good parallel compute model?– Does it fulfill our design goals?

Page 20: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

20

Possible Next Steps Simulation / Hardware fidelity improvements

– Memory model, locality GRAMPS Run-Time improvements

– Scheduling, run-time overheads GRAMPS API extensions

– On-the-fly graph modification, data sharing More applications / workloads

– REYES, physics, finance, AI, … – Lazy/adaptive/procedural data generation

Page 21: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

21

Design Goals (Revisited) Application Scope: okay– only (multiple) renderers High Performance: so-so– limited simulation detail Optimized Implementations: good Multi-Platform: good (Tunable: good, but that’s a separate talk)

Strategy: Broaden available apps and use them to drive performance and simulation work for now.

Page 22: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

22

Digression: Some Kinds of Parallelism

Task (Divide) and Data (Conquer) Subdivide algorithm into a DAG (or graph) of kernels. Data is long lived, manipulated in-place. Kernels are ephemeral and stateless. Kernels only get input at entry/creation.

Producer-Consumer (Pipeline) Parallelism Data is ephemeral: processed as it is generated. Bandwidth or storage costs prohibit accumulation.

Page 23: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

23

Three New Graphs “App” 1: MapReduce Run-time

– Popular parallelism-rich idiom– Enables a variety of useful apps

App 2: Cloth Simulation (Rendering Physics)– Inspired by the PhysBAM cloth simulation– Demonstrates basic mechanics, collision detection– Graph is still very much a work in progress…

App 3: Real-time REYES-like Renderer (Kayvon)

Page 24: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

24

MapReduce: Specific Flavour

“ProduceReduce”: Minimal simplifications / constraints

Produce/Split (1:N) Map (1:N) (Optional) Combine (N:1) Reduce (N:M, where M << N or M=1 often) Sort (N:N conceptually, implementations vary)

(Aside: REYES is MapReduce, OpenGL is MapCombine)

Page 25: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

25

MapReduce Graph

Map output is a dynamically instanced queue set. Combine might motivate a formal reduction shader. Reduce is an (automatically) instanced thread stage. Sort may actually be parallelized.

= Thread Stage= Shader Stage

= Queue= Stage Output= Push Output

IntermediateTuples

Map

Output

Produce Combine

(Optional) Reduce Sort

InitialTuples

IntermediateTuples

FinalTuples

Page 26: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

26

Update is not producer-consumer! Broad Phase will actually be either a (weird) shader

or multiple thread instances. Fast Recollide details are also TBD.

Cloth Simulation Graph

= Thread Stage= Shader Stage

= Queue= Stage Output= Push Output

ResolutionProposed Update

UpdateMesh

FastRecollide

Resolve

Narrow Collide

Broad Collide

Collision Detection

BVHNodes

MovedNodes

Collisions

CandidatePairs

Page 27: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

27

That’s All Folks Thank you for listening. Any questions? Actively interested in new collaborators

– Owners or experts in some application domain (or engine / run-time system / middleware).

– Anyone interested in scheduling or details of possible hardware / core configurations.

TOG Paper:http://graphics.stanford.edu/papers/gramps-tog/

Page 28: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

Backup Slides / More Details

Page 29: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

29

Designing A Good Graph Efficiency requires “large chunks of coherent work”

Stages separate coherency boundaries– Frequency of computation (fan-out / fan-in)– Memory access coherency– Execution coherency

Queues allow repacking, re-sorting of work from one coherency regime to another.

Page 30: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

3030

GRAMPS Interfaces Host/Setup: Create execution graph

Thread: Stateful, singleton

Shader: Data-parallel, auto-instanced

Page 31: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

3131

GRAMPS Graph Portability Portability really means performance.

Less portable than GL/D3D– GRAMPS graph is (more) hardware sensitive

More portable than bare metal– Enforces modularity– Best case, just works – Worst case, saves boiler plate

Page 32: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

3232

Possible Next Steps: Implementation Better scheduling

– Less bursty, better slot filling– Dynamic priorities– Handle graphs with loops better

More detailed costs– Bill for scheduling decisions– Bill for (internal) synchronization

More statistics

Page 33: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

3333

Possible Next Steps: API Important: Graph modification (state change)

Probably: Data sharing / ref-counting

Maybe: Blocking inter-stage calls (join) Maybe: Intra/inter-stage synchronization primitives

Page 34: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

3434

Possible Next Steps: New Workloads REYES, hybrid graphics pipelines Image / video processing Game Physics

– Collision detection or particles Physics and scientific simulation AI, finance, sort, search or database query, …

Heavy dynamic data manipulation- k-D tree / octree / BVH build- lazy/adaptive/procedural tree or geometry