View
16
Download
0
Category
Preview:
Citation preview
Parallel Graph Algorithms
Rupesh Nasre.IIT Madras
NITKFebruary 2016
2
Graphs
● Where do we encounter graphs?● Social networks, road connections, molecular
interactions, planetary forces, …● snap, florida, dimacs, konect, ...
● Why treat them separately?● They provide structural information.● They can be processed more efficiently.
● What challenges do they pose?● Load imbalance, poor locality, …● Irregularity
3
What is IrReguLariTy?
● Data-access or control patterns are unpredictable at compile time.
Irregular data-access Irregular control-flow
int a[N], b[N], c[N];readinput(a);
c[5] = b[a[4]];
int a[N], b[N], c[N];readinput(a);
c[5] = b[a[4]];
Pointer-based data structures often contribute to irregularity.
int a[N];readinput(a);
if (a[4] > 30) {...
}
int a[N];readinput(a);
if (a[4] > 30) {...
}
Needs dynamic techniques
4
Examples of Irregular Algorithms
C1C1 C2C2 C3C3 C4C4 C5C5
X1X1 X2X2 X5X5X4X4X3X3
a = &xb = &yp = &a*p = bc = a
aa
cc
pp bb
x
ya
5
3
8
5
6 6
4
7
Delaunay Mesh Refinement Points-to Analysis
Minimum Spanning Tree Computation
Survey Propagation
5
Regular vs. Irregular Algorithms
X =
aa
cc
bb dd
73
4
gg
ff
ee
Matrix Multiplication Shortest Paths Computation
By knowing matrix size and its starting address, and without knowing
its values we can determine the dynamic behavior.
Dynamic behavior is dependent on the input graph.
This results inpessimistic synchronization.
6
Irregularity is a Matter of Degree
X =
aa
cc
bb dd
73
4
gg
ff
ee
Co
ntr
ol-
flo
w ir
reg
ula
rity
0.5%
1.0%
1.5%
0%
2.0%
2.5%
3.0%
3.5%
4.0%
memory-access irregularity
0.0%
10% 20% 30% 40% 50% 60% 70% 80%
7
Irregularity is a Matter of Degree
X =
aa
cc
bb dd
73
4
gg
ff
ee
8
ee
Dynamic Techniques
aa
cc
bb dd
gg
ff hh ii
jj Active node
Neighborhood
Non-overlapping neighborhoods can be processed in parallel.
Overlapping neighborhoods require synchronization.
Leads to optimistic and cautious parallelizations.
Question: Given an algorithm, can you schedule its parallel processing such that neighborhoods never overlap?
9
● Manual● Libraries
● Galois, Ligra, LonestarGPU, Totem, ...
● Domain-Specific Languages● Green-Marl, Elixir, BlackBuck, ...
Parallelism Approaches
Pro
du
ctiv
ity P
erform
ance
10
Specifying Parallelism
● Do not specify.● Sequential input, completely automated, currently
very challenging in general
● Implicit parallelism● aggregates, aggregate functions, value-based
processing, ...
● Explicit parallelism● pthreads, MPI, ...
11
Identifying Dependence
for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}
for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}
Is there a flow dependencebetween different iterations?
Dependence equations0 <= ii
w < ii
r < 10
2 * iiw = 2 * ii
r + 1
which can be written as
0 <= iiw
iiw <= ii
r - 1
iir <= 9
2 * iiw <= 2 * ii
r + 1
2 * iir + 1 <= 2 * ii
w
-1 0
1 -1
0 1
2 -2
-2 2
iiw
iir
-1 0
1 -1
0 1
2 -2
-2 2
0
-1
9
1
-1
<=
Dependence exists if the system has a solution.Dependence exists if the system has a solution.
12
Outline
● Introduction➔ Graph algorithms on GPUs● Morph algorithms on GPUs● Synthesizing graph algorithms for GPUs
13
What is a GPU?
● Graphics Processing Unit● Separate piece of hardware
connected using a bus● Separate address space
than that of the CPU● Massive multithreading● Warp-based execution
14
What is a Warp?
Source: Wikipedia
15
GPU Computation Hierarchy
...
... ... ......
... ... ......
... ... ......
Thread
Warp
Block
Multi-processor
GPU
1
32
1024
Tens ofthousands
Hundreds ofthousands
16
Challenges with GPUs
● Warp-based execution● Locking is expensive● Dynamic memory allocation is costly● Limited data-cache● Programmability issues
● separate address space● no recursion support● complex computation hierarchy● exposed memory hierarchy● ...
17
Challenges in Graph Algorithms
● Synchronization● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive
● Memory latency● locality is difficult to exploit● low caching support
● Thread-divergence● work done per node varies with graph structure
● Uncoalesced memory accesses● warp-threads access arbitrary graph elements
18
Irregular Algorithms on GPUs
aa
cc
bb dd1
34
ggff
ee5 2 7
7sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) relaxEdge(src, dst, wt)
sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) relaxEdge(src, dst, wt)
Single-source shortest paths
19
Irregular Algorithms on GPUs
aa
cc
bb dd1
34
ggff
ee
Single-source shortest paths
5 2 7
7sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt
if altdist < ddst distance[dst] = altdist
sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt
if altdist < ddst distance[dst] = altdist
Mem
ory-bound kernel
Kernel unrolling
Instead of operating on a node, a thread operates on a subgraph.
20
Improving Synchronization
aa
push-based
aa
pull-based
1010
33 44
2 3
tfive tseven
77
33 44
2 3
tfive tseven
55
33 44
2 3
tfive tseven
Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity
21
Irregular Algorithms on GPUs
aa
cc
bb dd1
34
ggff
ee
Barnes-Hut n-body simulationBreadth-first search Single-source shortest paths
5 2 7
7
● Better memory layout
● Kernel unrolling
● Local worklists
● Improved synchronization
Application Speedup
BFS 48
BH 90
SSSP 45
22
Outline
● introduction● Graph algorithms on GPUs
● e.g., breadth-first search, shortest paths
➔ Morph algorithms on GPUs➔ e.g., mesh refinement, pointer analysis
● Synthesizing irregular algorithms for GPUs
23
Identify the Celebrity
Source: wikipedia
24
What is a morph?
Source: wikipedia
25
Examples of Morph Algorithms
C1C1 C2C2 C3C3 C4C4 C5C5
X1X1 X2X2 X5X5X4X4X3X3
a = &xb = &yp = &a*p = bc = a
aa
cc
pp bb
x
ya
5
3
8
5
6 6
4
7
Delaunay Mesh Refinement Points-to Analysis
Minimum Spanning Tree Computation
Survey Propagation
26
TAO Classification
Multi-core GPU
Regular (matrix multiplication, stencils)
Irregular (static-graph) (breadth-first search, shortest paths)
Morph (mesh refinement, survey propagation)
The TAO of Parallelism in Algorithms, Pingali et al. [PLDI 2011]
27
First Attempt
● Ported multi-core Galois benchmarks to CUDA
● GPU version's performance was comparable to that of the single-threaded version
48322420161284210
10
20
30
40
50
60
70
80
90
Serial (Shewchuck, Berkeley)
GPU Ported
Multicore Galois
Exe
cutio
n tim
e (s
econ
ds)
Number of CPU threads
DMR on a mesh with 10 million triangles with around 50% bad triangles
28
Challenges in Morph Algorithms● Synchronization
● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive
● Memory allocation● changing graph structure requires new strategies● memory requirement cannot be predicted
● Load imbalance● different modifications to different parts of the graph● work done per node changes dynamically● leads to thread-divergence and uncoalesced
memory accesses
29
Performance Considerations
Graph representation
Subgraph addition
Subgraph deletion
Conflict resolution
Load balancing
44
33
22
11
55
30
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points-to Analysis Minimum Spanning Tree Computation
Points Triangles
31
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles Neighbors
Points-to Analysis Minimum Spanning Tree Computation
32
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles Neighbors
.
.
.
Clauses to literals
Literals to clauses
Points-to Analysis Minimum Spanning Tree Computation
33
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles NeighborsClauses to
literalsLiterals to clauses
Pointers
.
.
.
Edges
Points-to Analysis Minimum Spanning Tree Computation
.
.
.
34
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles NeighborsClauses to
literalsLiterals to clauses
Pointers
.
.
.
Points-to sets
.
.
.
Graph
.
.
.
Components
Points-to Analysis Minimum Spanning Tree Computation
.
.
.
.
.
.
35
Subgraph Addition 22
● Pre-allocation● When final graph has bounded size● Overallocation
● Host-only● When next size predictable on host, global
allocation● May be expensive, but can be optimized
● Kernel-host● When next size predictable, global allocation● Size computation can piggyback
● Kernel-only● General purpose, local allocation● Expensive, but can be optimized
e.g. MST, DMR
e.g. PTA
e.g. DMR
e.g. PTA
36
PerformanceE
xecu
tion
time
(sec
onds
)
Number of CPU threads
Serial (Shewchuck, Berkeley)
Multi-core Galois
GPU Optimized48322420161284210
10
20
30
40
50
60
70
80
90
GPU Ported
Application Speedup
DMR 80
MST 20
PTA 23
SP 25
44
33
22
11
55
Graph representation
Subgraph addition
Subgraph deletion
Conflict resolution
Load balancing
37
GPU Optimization Principles
GPUPrinciples
Synchronization
Com
puta
tion Mem
ory
Kernel transform
ations
Data grouping
Exploiting m
emory hierarchy
Algorithm selection
Work sorting
Work chunking
Communication onto computation
Following parallelism profile
Pipelined computation
Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization
38
GPU Optimization Principles
GPUPrinciples
Synchronization
Com
puta
tion Mem
ory
Kernel transformationsData groupingExploiting memory hierarchy
Algorithm selectionWork sorting
Work chunkingCommunication onto computation
Following parallelism profilePipelined computation
Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization
These optimization principlesare critical for high-performingirregular GPU computations.
39
Barrier-based Synchronization
➔ Disjoint accesses
● Overlapping accesses
● Benign overlaps
O(e) atomics
O(t) atomics
O(log t) barriers
atomic per element
atomic per thread
prefix-sum
...
Consider threads pushing elements into a worklist
40
Barrier-based Synchronization
● Disjoint accesses
➔ Overlapping accesses
● Benign overlaps
...
atomic per element
non-atomic mark
prioritized mark
check
Race and resolve
Race and resolve
AND
ORnon-atomic mark
check
e.g., for owning cavities in Delaunay mesh refinement
e.g., for inserting unique elements into a worklist
Consider threads trying to own a set of elements
41
Barrier-based Synchronization
● Disjoint accesses
● Overlapping accesses
➔ Benign overlaps
...
with atomics
without atomics
e.g., level-by-level breadth-first search
Consider threads updating shared variables to the same value
a
b dc
e gf
42
Exploiting Algebraic Properties
➔ Monotonicity
● Idempotency
● Associativity
1010
33 44
2 3
tfive tseven
77
33 44
2 3
tfive tseven
55
33 44
2 3
tfive tseven
Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity
Consider threads updating distances in shortest paths computation
43
Exploiting Algebraic Properties
● Monotonicity
➔ Idempotency
● Associativity
zz
bb cc
t2 t3
aa dd
t1 t4 zz zz zz zz
worklist zz
pp
qqrr
t5, t6, t7,t8
Consider threads updating distances in shortest paths computation
Update by multiple threads Multiple instances of a node in the worklist
Same node processed by multiple threads
44
Exploiting Algebraic Properties
● Monotonicity
● Idempotency
➔ Associativity
zz
bb cc
t2 t3
aa dd
t1 t4x
y z,v
m,n
x,y,z,v,m,n
Consider threads pushing points-to information to a pointer node
Associativity helps push information using prefix-sum
45
Other Synchronization Methods
➔ Partitioning● Scatter-gather
Delaunay Mesh Refinement
● Layout-based graph partitioning is expensive
● Partitioning time may be comparable to overall running time
aa
push-based
aa
pull-based
Partitioning enables pull-based approach
46
Other Synchronization Methods
● Partitioning➔ Scatter-gather
O(e) atomics
O(t) atomics
O(log t) barriers
atomic per element
atomic per thread
prefix-sum
...
scatter
gather
Consider threads pushing elements into a worklist
47
Outline
● Introduction● Graph algorithms on GPUs● Morph algorithms on GPUs➔ Synthesizing irregular algorithms for GPUs
48
Hand-Written BFS
... hedgessrcdst = (unsigned int *)malloc(nedges * sizeof(unsigned int)); hpsrc = (unsigned int *)calloc(nnodes, sizeof(unsigned int)); hdist = (float *)malloc(nnodes * sizeof(float));
... // Read graph. …
Memoryallocationon host
Memoryallocationon host
49
Hand-Written BFS ... cudaMalloc((void **)&edgessrcdst, nedges * sizeof(unsigned int)); cudaMalloc((void **)&psrc, nnodes * sizeof(unsigned int));
cudaMemcpy(edgessrcdst, hedgessrcdst, nedges * sizeof(unsigned int), cudaMemcpyHostToDevice); cudaMemcpy(edgessrcwt, hedgessrcwt, nedges * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(psrc, hpsrc, nnodes * sizeof(unsigned int), cudaMemcpyHostToDevice);
...
Memoryallocationon device
Memoryallocationon device
Explicitmemorytransfer
Explicitmemorytransfer
50
Hand-Written BFS
… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged); ...
Transferfrom GPU
Transferfrom GPU
Transfer to GPU
Transfer to GPU
Explicitbarrier
Explicitbarrier
51
Hand-Written BFS
… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged);
cudaMemcpy(hdist, dist, nnodes * sizeof(float), cudaMemcpyDeviceToHost); ...
Transferfrom GPU
Transferfrom GPU
52
SSSP Specification in Elixir
Graph [nodes(node: Node, dist: int),
edges(src: Node, dst: Node, wt: int)]
source: Node
initDist = [nodes(node n, dist d)] →
[d = if (n == source) 0 else ∞]
relaxEdge = [nodes(node p, dist pd),
nodes(node q, dist qd),
edges(src p, dst q, wt w),
pd + wt < qd] →
[qd = pd + wt]
init = foreach initDist
bfs = iterate relaxEdge
main = init; bfs
Graph [nodes(node: Node, dist: int),
edges(src: Node, dst: Node, wt: int)]
source: Node
initDist = [nodes(node n, dist d)] →
[d = if (n == source) 0 else ∞]
relaxEdge = [nodes(node p, dist pd),
nodes(node q, dist qd),
edges(src p, dst q, wt w),
pd + wt < qd] →
[qd = pd + wt]
init = foreach initDist
bfs = iterate relaxEdge
main = init; bfs
Generate kernels
Graph rewrite rules
Graph layout unspecified
Processing order unspecified
53
SSSP Specification in Green-Marl
... G.dist = (G == root) ? 0 : +INF; G.updated = (G == root) ? True: False; G.dist_nxt = G.dist; G.updated_nxt = G.updated;
While (!fin) { fin = True;
Foreach(n: G.Nodes)(n.updated) { Foreach(s: n.Nbrs) { Edge e = s.ToEdge(); // the edge to s // updated_nxt becomes true only if dist_nxt is actually updated <s.dist_nxt; s.updated_nxt> min= <n.dist + e.len; True>; } }
G.dist = G.dist_nxt; G.updated = G.updated_nxt; G.updated_nxt = False; fin = ! Exist(n: G.Nodes){n.updated}; } ...
54
Using Operators
e
a
b
cd
A = b @ w1B = a @ w2C = a @ w3D = c @ w4E = (d @ w5) # (b @ w6)
Assigning # to a thread is vertex-based.Assigning @ to a thread is edge-based.
@ = +# = minwi = 1
@ = +# = minwi = wt
BFS SSSP
@ = /# = +wi = outdegree
PR
55
Language Features
Automatic memory management Memory allocation on host and device Memory transfer between devices
Automatic kernel configuration Easy overriding
Intelligent barrier insertion Automatic synchronization
56
Performance of Synthesized BFS
Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs
Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs
Generate kernels
Graph rewrite rules
Graph layout unspecified
Hand-written multicore
Hand-written GPU
Synthesized GPU
100% 29% 41%
Processing order unspecified
57
The Road Ahead
● Static graph algorithms● Dynamic graph algorithms● Language extensions for morph algorithms● Optimizations for special classes of graphs● Applications
58
Parallel Graph Algorithms
Rupesh Nasre.rupesh@cse.iitm.ac.in
NITKFebruary 2016
59
60
Todo
● Literature survey● higher level picture● Irregularity definition, earlier work by yelick, runger.● Multi-core status, tao.
● Morph algorithms● Benchmark suite: show properties of each b/m● optimizations● Synthesis● Current work: ir for multicore and gpu● Addons: data-topology, workload● cite papers.● Avoid sentences at the bottom of the slides.
61
Subgraph Deletion 33
● Marking● When deletion is infrequent● Deleted memory unavailable
● Explicit deletion● When deletion is frequent, local deletions● Global deletion may require expensive compaction
● Recycle● When new subgraph fits in the memory
of the old subgraph● Moderately expensive
e.g. SP
e.g. DMR
62
Conflict Resolution 44
1. mark triangles
–barrier–
2. priority-check markings
–barrier–
3. check markings
0
01
1
1
1
63
Effect of Optimizations
Base with partitioning
Race and resolveWith atomic-free global barrier
Optimized memory layout
Adaptive configuration
Load balancing
Single precision arithmeticOn-demand memory reallocation
0 10 20 30 40 50 60 70 80
Speedup
Op
tim
izat
ion
s
Input: mesh with 10 million triangles with around 50% bad triangles
Delaunay Mesh Refinement
Recommended