48
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

Martin Burtscher

Department of Computer Science

Texas State University-San Marcos

Page 2: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

LLNL

Mapping Regular Code to GPUs Regular codes

Operate on array- and matrix-based data structures

Exhibit mostly strided memory access patterns

Have relatively predictable control flow (control flow behavior is largely determined by input size)

Many regular codes are easy to port to GPUs E.g., matrix codes executing many ops/word

Dense matrix operations (level 2 and 3 BLAS)

Stencil codes (PDE solvers)

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 2

Page 3: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

FSU

Mapping Irregular Code to GPUs Irregular codes

Build, traverse, and update dynamic data structures (e.g., trees, graphs, linked lists, and priority queues)

Exhibit data dependent memory access patterns

Have complex control flow (control flow behavior depends on input values and changes dynamically)

Many important scientific programs are irregular

E.g., data clustering, SAT solving, social networks, meshing, …

Need case studies to learn how to best map irregular codes

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 3

Page 4: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Example: N-Body Simulation Irregular Barnes Hut algorithm

Repeatedly builds unbalanced tree & performs complex traversals on it

Our implementation Designed for GPUs (not just port of CPU code)

First GPU implementation of entire BH algorithm

Results GPU is 21 times faster than CPU (6 cores) on this code

GPU has better architecture for this irregular algorithm

NVIDIA GPU Gems book chapter (2011)

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 4

Page 5: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Outline Introduction

Barnes Hut algorithm

CUDA implementation

Experimental results

Conclusions

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 5

NASA/JPL-Caltech/SSC

Page 6: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

RUG

Cornell

N-Body Simulation Time evolution of physical system

System consists of bodies

“n” is the number of bodies

Bodies interact via pair-wise forces

Many systems can be modeled in this way Star/galaxy clusters (gravitational force)

Particles (electric force, magnetic force)

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 6

Page 7: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Simple O(n2) n-Body Algorithm Algorithm

Initialize body masses, positions, and velocities

Iterate over time steps { Accumulate forces acting on each body

Update body positions and velocities based on force

}

Output result

More sophisticated n-body algorithms exist Barnes Hut algorithm

Fast Multipole Method (FMM)

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 7

Page 8: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Barnes Hut Idea Precise force calculation

Requires O(n2) operations (O(n2) body pairs)

Computationally intractable for large n

Barnes and Hut (1986) Algorithm to approximately compute forces

Bodies’ initial position and velocity are also approximate

Requires only O(n log n) operations

Idea is to “combine” far away bodies

Error should be small because force is proportional to 1/distance2

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 8

Page 9: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Barnes Hut Algorithm Set bodies’ initial position and velocity

Iterate over time steps 1. Compute bounding box around bodies

2. Subdivide space until at most one body per cell Record this spatial hierarchy in an octree

3. Compute mass and center of mass of each cell

4. Compute force on bodies by traversing octree Stop traversal path when encountering a leaf (body) or an internal node (cell) that

is far enough away

5. Update each body’s position and velocity

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 9

Page 10: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Build Tree (Level 1)

10

Compute bounding box around all bodies → tree root

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

* *

*

* *

*

* *

*

o

Page 11: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Build Tree (Level 2)

11

Subdivide space until at most one body per cell

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

* *

*

* *

*

* *

*

o o o o

o

Page 12: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Build Tree (Level 3)

12

Subdivide space until at most one body per cell

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

* *

*

* *

*

* *

*

o o o o

o

o o o o o o o o o o o o

Page 13: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Build Tree (Level 4)

13

Subdivide space until at most one body per cell

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

* *

*

* *

*

* *

*

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 14: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Build Tree (Level 5)

14

Subdivide space until at most one body per cell

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

* *

*

* *

*

* *

*

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 15: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

Compute Cells’ Center of Mass

15

For each internal cell, compute sum of mass and weighted average of position of all bodies in subtree; example shows two cells only

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

Page 16: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Compute Forces

16

Compute force, for example, acting upon green body

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

Page 17: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Compute Force (short distance)

17

Scan tree depth first from left to right; green portion already completed

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

Page 18: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Compute Force (down one level)

18

Red center of mass is too close, need to go down one level

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

Page 19: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Compute Force (long distance)

19

Blue center of mass is far enough away

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

Page 20: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Compute Force (skip subtree)

20

Therefore, entire subtree rooted in the blue cell can be skipped

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

Page 21: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

*

* *

*

* *

* *

* * *

* * *

* *

*

* *

*

* *

*

*

* *

*

* *

* *

* * *

* * *

* *

*

* *

*

* *

*

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

*

* *

*

* *

* *

* * *

* * *

o

* *

*

* *

o *

* *

*

Pseudocode bodySet = ...

foreach timestep do {

bounding_box = new Bounding_Box();

foreach Body b in bodySet {

bounding_box.include(b);

}

octree = new Octree(bounding_box);

foreach Body b in bodySet {

octree.Insert(b);

}

cellList = octree.CellsByLevel();

foreach Cell c in cellList {

c.Summarize();

}

foreach Body b in bodySet {

b.ComputeForce(octree);

}

foreach Body b in bodySet {

b.Advance();

}

}

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 21

Page 22: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Complexity and Parallelism bodySet = ...

foreach timestep do { // O(n log n) + fully ordered sequential

bounding_box = new Bounding_Box();

foreach Body b in bodySet { // O(n) parallel reduction

bounding_box.include(b);

}

octree = new Octree(bounding_box);

foreach Body b in bodySet { // O(n log n) top-down tree building

octree.Insert(b);

}

cellList = octree.CellsByLevel();

foreach Cell c in cellList { // O(n) + partially ordered bottom-up traversal

c.Summarize();

}

foreach Body b in bodySet { // O(n log n) fully parallel

b.ComputeForce(octree);

}

foreach Body b in bodySet { // O(n) fully parallel

b.Advance();

}

}

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 22

Page 23: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Outline Introduction

Barnes Hut algorithm

CUDA implementation

Experimental results

Conclusions

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 23

Page 24: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Efficient GPU Code Coalesced main memory accesses

Little thread divergence

Enough threads per block Not too many registers per thread

Not too much shared memory usage

Enough (independent) blocks Little synchronization between blocks

Little CPU/GPU data transfer

Efficient use of shared memory

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 24

Thepcreport.net

Page 25: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Main BH Implementation Challenges Uses irregular tree-based data structure

Load imbalance

Little coalescing

Complex recursive traversals Recursion not allowed*

Lots of thread divergence

Memory-bound pointer-chasing operations Not enough computation to hide latency

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 25

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 26: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Six GPU Kernels Read initial data and transfer to GPU

for each timestep do { 1. Compute bounding box around bodies

2. Build hierarchical decomposition, i.e., octree

3. Summarize body information in internal octree nodes

4. Approximately sort bodies by spatial location (optional)

5. Compute forces acting on each body with help of octree

6. Update body positions and velocities

}

Transfer result from GPU and output

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 26

Page 27: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

objects on heap

objects in array

fields in arrays

Global Optimizations Make code iterative (recursion not supported*)

Keep data on GPU between kernel calls

Use array elements instead of heap nodes One aligned array per field for coalesced accesses

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 27

Page 28: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

c0

c2 c4 c1

b5 ba c3 b6 b2 b7 b0 c5

b3 b1 b8 b4 b9

bodies (fixed) cell allocation direction

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba c5 c4 c3 c2 c1 c0. . .

Global Optimizations (cont.) Maximize thread count (round down to warp size)

Maximize resident block count (all SMs filled)

Pass kernel parameters through constant memory

Use special allocation order

Alias arrays (56 B/node)

Use index arithmetic

Persistent blocks & threads

Unroll loops over children

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 28

Page 29: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

main memory

threads

shared memory

threads

shared memory

threads

shared memory

threads t1 t2

shared memory

threads t1

shared memory

barrier

barrier

. . .

warp 1

barrier

barrier

warp 1 warp 2 warp 3 warp 4

warp 1 warp 2

Kernel 1: Bounding Box (Regular)

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 29

Optimizations Fully coalesced

Fully cached

No bank conflicts

Minimal divergence

Built-in min and max

2 red/mem, 6 red/bar

Bodies load balanced

512*3 threads per SM

Reduction operation

Page 30: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

*

Kernel 2: Build Octree (Irregular) Optimizations

Only lock leaf “pointers”

Lock-free fast path

Light-weight lock release on slow path

No re-traverse after lock acquire failure

Combined memory fence per block

Re-compute position during traversal

Separate init kernels for coalescing

512*3 threads per SM

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 30

Top-down tree building

Page 31: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Kernel 2: Build Octree (cont.) // initialize

cell = find_insertion_point(body); // no locks, cache cell

child = get_insertion_index(cell, body);

if (child != locked) { // skip atomic if already locked

if (child == null) { // fast path (frequent)

if (null == atomicCAS(&cell[child], null, body)) { // lock-free insertion

// move on to next body

}

} else { // slow path (first part)

if (child == atomicCAS(&cell[child], child, lock)) { // acquire lock

// build subtree with new and existing body

flag = true;

}

}

}

__syncthreads(); // barrier

if (threadIdx == 0) __threadfence(); // push data out if L1 cache

__syncthreads(); // barrier

if (flag) { // slow path (second part)

cell[child] = new_subtree; // insert subtree and release lock

// move on to next body

}

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 31

Page 32: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

4

3 1 2

3 4

allocation direction

3 4 1 2 3 4

scan direction

. . .

Kernel 3: Summarize Subtrees (Irregular)

Bottom-up tree traversal

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 32

Optimizations Scan avoids deadlock

Use mass as flag + fence No locks, no atomics

Use wait-free first pass

Cache the ready information

Piggyback on traversal Count bodies in subtrees

No parent “pointers”

128*6 threads per SM

Page 33: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

4

3 1 2

3 4

allocation direction

3 4 1 2 3 4

scan direction

. . .

Kernel 4: Sort Bodies (Irregular)

Top-down tree traversal

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 33

Optimizations (Similar to Kernel 3) Scan avoids deadlock Use data field as flag

No locks, no atomics

Use counts from Kernel 3 Piggyback on traversal

Move nulls to back

Throttle flag polling requests with optional barrier

64*6 threads per SM

Page 34: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Kernel 5: Force Calculation (Irregular)

Multiple prefix traversals

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 34

Optimizations Group similar work together

Uses sorting to minimize size of prefix union in each warp

Early out (nulls in back)

Traverse whole union to avoid divergence (warp voting)

Lane 0 controls iteration stack for entire warp (fits in shared mem)

Avoid unneeded volatile accesses Cache tree-level-based data 256*5 threads per SM

Page 35: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Architectural Support Coalesced memory accesses & lockstep execution

All threads in warp read same tree node at same time

Only one mem access per warp instead of 32 accesses

Warp-based execution Enables data sharing in warps without synchronization

RSQRTF instruction Quickly computes good approximation of 1/sqrtf(x)

Warp voting instructions Quickly perform reduction operations within a warp

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 35

Page 36: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

main memory

threads

main memory

warp 2

. . .

. . .

. . .

warp 1 warp 2 warp 3 warp 4 warp 1 warp 2

Kernel 6: Advance Bodies (Regular)

Straightforward streaming

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 36

Optimizations Fully coalesced, no divergence

Load balanced, 1024*1 threads per SM

Page 37: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Outline Introduction

Barnes Hut algorithm

CUDA implementation

Experimental results

Conclusions

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 37

1

10

100

1000

10000

100000

1000000

10000000

Page 38: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Evaluation Methodology Implementations

CUDA: irregular Barnes Hut & regular O(n2) algorithm OpenMP: Barnes Hut algorithm (derived from CUDA code) Pthreads: Barnes Hut algorithm (from SPLASH-2 suite)

Systems and compilers nvcc 4.0 (-O3 -arch=sm_20 -ftz=true*)

GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM gcc 4.1.2 (-O3 -fopenmp* -ffast-math*)

Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core

Inputs and metric 5k, 50k, 500k, and 5M star clusters (Plummer model) Best runtime of three experiments, excluding I/O

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 38

Page 39: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Nodes Touched per Activity (5M Input) Kernel “activities”

K1: pair reduction

K2: tree insertion

K3: bottom-up step

K4: top-down step

K5: prefix traversal

K6: integration step

Max tree depth ≤ 22

Cells have 3.1 children

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 39

Prefix ≤ 6,315 nodes (≤ 0.1% of 7.4M)

BH algorithm & sorting to minimize size of prefix union work well

min avg max

kernel 1 1 2.0 2

kernel 2 2 13.2 22

kernel 3 2 4.1 9

kernel 4 2 4.1 9

kernel 5 818 4,117.0 6,315

kernel 6 1 1.0 1

neighborhood size

Page 40: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000k1

.1

k1.2

k1.3

k1.4

k1.5

k1.6

k1.7

k1.8

k1.9

k1

.10

k1

.11

k1

.12

k1

.13

k1

.14

k1

.15

k1

.16

k1

.17

k1

.18

k1

.19

k1

.20

k1

.21

k1

.22

k1

.23

k2.1

k2.2

k2.3

k2.4

k2.5

k2.6

k2.7

k2.8

k2.9

k2

.10

k2

.11

k2

.12

k2

.13

k2

.14

k2

.15

k2

.16

k2

.17

k2

.18

k2

.19

k3.1

k3.2

k3.3

k3.4

k3.5

k3.6

k3.7

k3.8

k3.9

k3

.10

k3

.11

k3

.12

k3

.13

k3

.14

k3

.15

k3

.16

k3

.17

k3

.18

k3

.19

k3

.20

k4.1

k4.2

k4.3

k4.4

k4.5

k4.6

k4.7

k4.8

k4.9

k4

.10

k4

.11

k4

.12

k4

.13

k4

.14

k4

.15

k4

.16

k4

.17

k4

.18

k4

.19

k4

.20

k5.1

k6.1

avai

lab

le p

aral

lelis

m [

ind

ep

en

de

nt

acti

viti

es]

kernel number and iteration

5,000 bodies

50,000 bodies

500,000 bodies

5,000,000 bodies

Available Amorphous Data Parallelism

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 40

bounding box tree building summarization sorting

forc

e c

alc

.

inte

gra

tio

n

Page 41: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Runtime Comparison GPU BH inefficiency

5k input too small for 5,760 to 23,040 threads

BH vs. O(n2) algorithm Regular O(n2) code is faster with

fewer than about 15,000 bodies

GPU vs. CPU (5M input) 21.1x faster than OpenMP

23.2x faster than Pthreads

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 41

0

1

10

100

1,000

10,000

100,000

5k 50k 500k 5M

run

tim

e p

er

tim

est

ep

[m

s]

number of bodies

GPU CUDA

GPU O(n^2)

CPU OpenMP

CPU Pthreads

Page 42: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n2)

Gflops/s 71.6 5.8 2.5 n/a 240.6 33.5 228.4 897.0

GB/s 142.9 26.8 10.6 12.8 8.0 133.9 8.8 2.8

runtime [ms] 0.4 44.6 28.0 14.2 1641.2 2.2 1730.6 557421.5

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6

X5690 CPU 5.5 185.7 75.8 52.1 38,540.3 16.4 10.3 193.1 101.0 51.6 47,706.4 33.1

GTX 480 GPU 0.4 44.6 28.0 14.2 1,641.2 2.2 0.8 46.7 31.0 14.2 7,714.6 4.2

CPU/GPU 13.1 4.2 2.7 3.7 23.5 7.3 12.7 4.1 3.3 3.6 6.2 7.9

non-compliant fast single-precision version IEEE 754-compliant double-precision version

Kernel Performance for 5M Input $400 GPU delivers 228 GFlops/s on irregular code

GPU chip is 2.7 to 23.5 times faster than CPU chip

GPU hardware is better suited for running BH than CPU hw is But more difficult and time consuming to program

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 42

Page 43: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

avoid rsqrtf recalc. thread full multi-

volatile instr. data voting threading

50,000 1.14x 1.43x 0.99x 2.04x 20.80x

500,000 1.19x 1.47x 1.32x 2.49x 27.99x

5,000,000 1.18x 1.46x 1.69x 2.47x 28.85x

throttling waitfree combined sorting of sync'ed

barrier pre-pass mem fence bodies execution

50,000 0.97x 1.02x 1.54x 3.60x 6.23x

500,000 1.03x 1.21x 1.57x 6.28x 8.04x

5,000,000 1.04x 1.31x 1.50x 8.21x 8.60x

Kernel Speedups Optimizations that are generally applicable

Optimizations for irregular kernels

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 43

Page 44: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Outline Introduction

Barnes Hut algorithm

CUDA implementation

Experimental results

Conclusions

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 44

Page 45: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Optimization Summary Minimize thread divergence

Group similar work together, force synchronicity

Reduce main memory accesses Share data within warp, combine memory fences & traversals,

re-compute data, avoid volatile accesses

Implement entire algorithm on and for GPU Avoid data transfers & data structure inefficiencies, wait-free pre-pass,

scan entire prefix union

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 45

Page 46: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Optimization Summary (cont.) Exploit hardware features

Fast synchronization and thread startup, special instructions, coalesced memory accesses, even lockstep execution

Use light-weight locking and synchronization Minimize locks, reuse fields, use fence + store ops

Maximize parallelism Parallelize every step within and across SMs

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 46

Page 47: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Conclusions Irregularity does not necessarily prevent high-performance

on GPUs Entire Barnes Hut algorithm implemented on GPU

Builds and traverses unbalanced octree

GPU is 22.5 times (float) and 6.2 times (double) faster than high-end hyperthreaded 6-core Xeon

Code directly for GPU, do not merely adjust CPU code Requires different data and code structures

Benefits from different algorithmic modifications

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 47

Page 48: An Efficient CUDA Implementation of a Tree-Based N-Body … · 2013. 8. 23. · More sophisticated n-body algorithms exist ... Compute forces acting on each body with help of octree

Acknowledgments

Collaborators Ricardo Alves (Universidade do Minho, Portugal)

Molly O’Neil (Texas State University-San Marcos)

Keshav Pingali (University of Texas at Austin)

Hardware and funding NVIDIA Corporation

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 48