Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its...

Parallel Graph Algorithms

Rupesh Nasre.IIT Madras

NITKFebruary 2016

Graphs

● Where do we encounter graphs?● Social networks, road connections, molecular

interactions, planetary forces, …● snap, florida, dimacs, konect, ...

● Why treat them separately?● They provide structural information.● They can be processed more efficiently.

● What challenges do they pose?● Load imbalance, poor locality, …● Irregularity

What is IrReguLariTy?

● Data-access or control patterns are unpredictable at compile time.

Irregular data-access Irregular control-flow

int a[N], b[N], c[N];readinput(a);

c[5] = b[a[4]];

int a[N], b[N], c[N];readinput(a);

c[5] = b[a[4]];

Pointer-based data structures often contribute to irregularity.

int a[N];readinput(a);

if (a[4] > 30) {...

int a[N];readinput(a);

if (a[4] > 30) {...

Needs dynamic techniques

Examples of Irregular Algorithms

C1C1 C2C2 C3C3 C4C4 C5C5

X1X1 X2X2 X5X5X4X4X3X3

a = &xb = &yp = &a*p = bc = a

Delaunay Mesh Refinement Points-to Analysis

Minimum Spanning Tree Computation

Survey Propagation

Regular vs. Irregular Algorithms

Matrix Multiplication Shortest Paths Computation

By knowing matrix size and its starting address, and without knowing

its values we can determine the dynamic behavior.

Dynamic behavior is dependent on the input graph.

This results inpessimistic synchronization.

Irregularity is a Matter of Degree

memory-access irregularity

10% 20% 30% 40% 50% 60% 70% 80%

Irregularity is a Matter of Degree

Dynamic Techniques

ff hh ii

jj Active node

Neighborhood

Non-overlapping neighborhoods can be processed in parallel.

Overlapping neighborhoods require synchronization.

Leads to optimistic and cautious parallelizations.

Question: Given an algorithm, can you schedule its parallel processing such that neighborhoods never overlap?

● Manual● Libraries

● Galois, Ligra, LonestarGPU, Totem, ...

● Domain-Specific Languages● Green-Marl, Elixir, BlackBuck, ...

Parallelism Approaches

erform

Specifying Parallelism

● Do not specify.● Sequential input, completely automated, currently

very challenging in general

● Implicit parallelism● aggregates, aggregate functions, value-based

processing, ...

● Explicit parallelism● pthreads, MPI, ...

Identifying Dependence

for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}

Is there a flow dependencebetween different iterations?

Dependence equations0 <= ii

w < ii

r < 10

2 * iiw = 2 * ii

which can be written as

0 <= iiw

iiw <= ii

iir <= 9

2 * iiw <= 2 * ii

2 * iir + 1 <= 2 * ii

Dependence exists if the system has a solution.Dependence exists if the system has a solution.

Outline

● Introduction➔ Graph algorithms on GPUs● Morph algorithms on GPUs● Synthesizing graph algorithms for GPUs

What is a GPU?

● Graphics Processing Unit● Separate piece of hardware

connected using a bus● Separate address space

than that of the CPU● Massive multithreading● Warp-based execution

What is a Warp?

Source: Wikipedia

GPU Computation Hierarchy

... ... ......

Thread

Multi-processor

Tens ofthousands

Hundreds ofthousands

Challenges with GPUs

● Warp-based execution● Locking is expensive● Dynamic memory allocation is costly● Limited data-cache● Programmability issues

● separate address space● no recursion support● complex computation hierarchy● exposed memory hierarchy● ...

Challenges in Graph Algorithms

● Synchronization● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive

● Memory latency● locality is difficult to exploit● low caching support

● Thread-divergence● work done per node varies with graph structure

● Uncoalesced memory accesses● warp-threads access arbitrary graph elements

Irregular Algorithms on GPUs

bb dd1

ee5 2 7

7sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) relaxEdge(src, dst, wt)

sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) relaxEdge(src, dst, wt)

Single-source shortest paths

bb dd1

Single-source shortest paths

7sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt

if altdist < ddst distance[dst] = altdist

sssp(src): dsrc = distance[src]

forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt

if altdist < ddst distance[dst] = altdist

ory-bound kernel

Kernel unrolling

Instead of operating on a node, a thread operates on a subgraph.

Improving Synchronization

push-based

pull-based

tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity

bb dd1

Barnes-Hut n-body simulationBreadth-first search Single-source shortest paths

● Better memory layout

● Kernel unrolling

● Local worklists

● Improved synchronization

Application Speedup

BFS 48

SSSP 45

Outline

● introduction● Graph algorithms on GPUs

● e.g., breadth-first search, shortest paths

➔ Morph algorithms on GPUs➔ e.g., mesh refinement, pointer analysis

● Synthesizing irregular algorithms for GPUs

Identify the Celebrity

Source: wikipedia

What is a morph?

Source: wikipedia

Examples of Morph Algorithms

C1C1 C2C2 C3C3 C4C4 C5C5

X1X1 X2X2 X5X5X4X4X3X3

a = &xb = &yp = &a*p = bc = a

Delaunay Mesh Refinement Points-to Analysis

Minimum Spanning Tree Computation

Survey Propagation

TAO Classification

Multi-core GPU

Regular (matrix multiplication, stencils)

Irregular (static-graph) (breadth-first search, shortest paths)

Morph (mesh refinement, survey propagation)

The TAO of Parallelism in Algorithms, Pingali et al. [PLDI 2011]

First Attempt

● Ported multi-core Galois benchmarks to CUDA

● GPU version's performance was comparable to that of the single-threaded version

48322420161284210

Serial (Shewchuck, Berkeley)

GPU Ported

Multicore Galois

Number of CPU threads

DMR on a mesh with 10 million triangles with around 50% bad triangles

Challenges in Morph Algorithms● Synchronization

● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive

● Memory allocation● changing graph structure requires new strategies● memory requirement cannot be predicted

● Load imbalance● different modifications to different parts of the graph● work done per node changes dynamically● leads to thread-divergence and uncoalesced

memory accesses

Performance Considerations

Graph representation

Subgraph addition

Subgraph deletion

Conflict resolution

Load balancing

Graph Representation 11

Delaunay Mesh Refinement Survey Propagation

Points-to Analysis Minimum Spanning Tree Computation

Points Triangles

Points Triangles Neighbors

Clauses to literals

Literals to clauses

Points Triangles NeighborsClauses to

literalsLiterals to clauses

Pointers

Points Triangles NeighborsClauses to

literalsLiterals to clauses

Pointers

Points-to sets

Components

Subgraph Addition 22

● Pre-allocation● When final graph has bounded size● Overallocation

● Host-only● When next size predictable on host, global

allocation● May be expensive, but can be optimized

● Kernel-host● When next size predictable, global allocation● Size computation can piggyback

● Kernel-only● General purpose, local allocation● Expensive, but can be optimized

e.g. MST, DMR

e.g. PTA

e.g. DMR

e.g. PTA

PerformanceE

Number of CPU threads

Serial (Shewchuck, Berkeley)

Multi-core Galois

GPU Optimized48322420161284210

GPU Ported

Application Speedup

DMR 80

MST 20

PTA 23

Graph representation

Subgraph addition

Subgraph deletion

Conflict resolution

Load balancing

GPU Optimization Principles

GPUPrinciples

Synchronization

tion Mem

Kernel transform

ations

Data grouping

Exploiting m

emory hierarchy

Algorithm selection

Work sorting

Work chunking

Communication onto computation

Following parallelism profile

Pipelined computation

Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization

GPU Optimization Principles

GPUPrinciples

Synchronization

tion Mem

Kernel transformationsData groupingExploiting memory hierarchy

Algorithm selectionWork sorting

Work chunkingCommunication onto computation

Following parallelism profilePipelined computation

Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization

These optimization principlesare critical for high-performingirregular GPU computations.

Barrier-based Synchronization

➔ Disjoint accesses

● Overlapping accesses

● Benign overlaps

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

Consider threads pushing elements into a worklist

● Disjoint accesses

➔ Overlapping accesses

● Benign overlaps

atomic per element

non-atomic mark

prioritized mark

Race and resolve

ORnon-atomic mark

e.g., for owning cavities in Delaunay mesh refinement

e.g., for inserting unique elements into a worklist

Consider threads trying to own a set of elements

● Disjoint accesses

● Overlapping accesses

➔ Benign overlaps

with atomics

without atomics

e.g., level-by-level breadth-first search

Consider threads updating shared variables to the same value

Exploiting Algebraic Properties

➔ Monotonicity

● Idempotency

● Associativity

tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity

Consider threads updating distances in shortest paths computation

● Monotonicity

➔ Idempotency

● Associativity

t1 t4 zz zz zz zz

worklist zz

t5, t6, t7,t8

Consider threads updating distances in shortest paths computation

Update by multiple threads Multiple instances of a node in the worklist

Same node processed by multiple threads

● Monotonicity

● Idempotency

➔ Associativity

t1 t4x

x,y,z,v,m,n

Consider threads pushing points-to information to a pointer node

Associativity helps push information using prefix-sum

Other Synchronization Methods

➔ Partitioning● Scatter-gather

Delaunay Mesh Refinement

● Layout-based graph partitioning is expensive

● Partitioning time may be comparable to overall running time

push-based

pull-based

Partitioning enables pull-based approach

Other Synchronization Methods

● Partitioning➔ Scatter-gather

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

scatter

gather

Consider threads pushing elements into a worklist

Outline

● Introduction● Graph algorithms on GPUs● Morph algorithms on GPUs➔ Synthesizing irregular algorithms for GPUs

Hand-Written BFS

... hedgessrcdst = (unsigned int *)malloc(nedges * sizeof(unsigned int)); hpsrc = (unsigned int *)calloc(nnodes, sizeof(unsigned int)); hdist = (float *)malloc(nnodes * sizeof(float));

... // Read graph. …

Memoryallocationon host

Hand-Written BFS ... cudaMalloc((void **)&edgessrcdst, nedges * sizeof(unsigned int)); cudaMalloc((void **)&psrc, nnodes * sizeof(unsigned int));

cudaMemcpy(edgessrcdst, hedgessrcdst, nedges * sizeof(unsigned int), cudaMemcpyHostToDevice); cudaMemcpy(edgessrcwt, hedgessrcwt, nedges * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(psrc, hpsrc, nnodes * sizeof(unsigned int), cudaMemcpyHostToDevice);

Memoryallocationon device

Explicitmemorytransfer

Hand-Written BFS

… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged); ...

Transferfrom GPU

Transfer to GPU

Explicitbarrier

Hand-Written BFS

… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged);

cudaMemcpy(hdist, dist, nnodes * sizeof(float), cudaMemcpyDeviceToHost); ...

Transferfrom GPU

SSSP Specification in Elixir

Graph [nodes(node: Node, dist: int),

edges(src: Node, dst: Node, wt: int)]

source: Node

initDist = [nodes(node n, dist d)] →

[d = if (n == source) 0 else ∞]

relaxEdge = [nodes(node p, dist pd),

nodes(node q, dist qd),

edges(src p, dst q, wt w),

pd + wt < qd] →

[qd = pd + wt]

init = foreach initDist

bfs = iterate relaxEdge

main = init; bfs

Graph [nodes(node: Node, dist: int),

edges(src: Node, dst: Node, wt: int)]

source: Node

initDist = [nodes(node n, dist d)] →

[d = if (n == source) 0 else ∞]

relaxEdge = [nodes(node p, dist pd),

nodes(node q, dist qd),

edges(src p, dst q, wt w),

pd + wt < qd] →

[qd = pd + wt]

init = foreach initDist

bfs = iterate relaxEdge

main = init; bfs

Generate kernels

Graph rewrite rules

Graph layout unspecified

Processing order unspecified

SSSP Specification in Green-Marl

... G.dist = (G == root) ? 0 : +INF; G.updated = (G == root) ? True: False; G.dist_nxt = G.dist; G.updated_nxt = G.updated;

While (!fin) { fin = True;

Foreach(n: G.Nodes)(n.updated) { Foreach(s: n.Nbrs) { Edge e = s.ToEdge(); // the edge to s // updated_nxt becomes true only if dist_nxt is actually updated <s.dist_nxt; s.updated_nxt> min= <n.dist + e.len; True>; } }

G.dist = G.dist_nxt; G.updated = G.updated_nxt; G.updated_nxt = False; fin = ! Exist(n: G.Nodes){n.updated}; } ...

Using Operators

A = b @ w1B = a @ w2C = a @ w3D = c @ w4E = (d @ w5) # (b @ w6)

Assigning # to a thread is vertex-based.Assigning @ to a thread is edge-based.

@ = +# = minwi = 1

@ = +# = minwi = wt

BFS SSSP

@ = /# = +wi = outdegree

Language Features

Automatic memory management Memory allocation on host and device Memory transfer between devices

Automatic kernel configuration Easy overriding

Intelligent barrier insertion Automatic synchronization

Performance of Synthesized BFS

Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs

Generate kernels

Graph rewrite rules

Graph layout unspecified

Hand-written multicore

Hand-written GPU

Synthesized GPU

100% 29% 41%

Processing order unspecified

The Road Ahead

● Static graph algorithms● Dynamic graph algorithms● Language extensions for morph algorithms● Optimizations for special classes of graphs● Applications

Parallel Graph Algorithms

Rupesh Nasre.rupesh@cse.iitm.ac.in

NITKFebruary 2016

● Literature survey● higher level picture● Irregularity definition, earlier work by yelick, runger.● Multi-core status, tao.

● Morph algorithms● Benchmark suite: show properties of each b/m● optimizations● Synthesis● Current work: ir for multicore and gpu● Addons: data-topology, workload● cite papers.● Avoid sentences at the bottom of the slides.

Subgraph Deletion 33

● Marking● When deletion is infrequent● Deleted memory unavailable

● Explicit deletion● When deletion is frequent, local deletions● Global deletion may require expensive compaction

● Recycle● When new subgraph fits in the memory

of the old subgraph● Moderately expensive

e.g. SP

e.g. DMR

Conflict Resolution 44

1. mark triangles

–barrier–

2. priority-check markings

–barrier–

3. check markings

Effect of Optimizations

Base with partitioning

Race and resolveWith atomic-free global barrier

Optimized memory layout

Adaptive configuration

Load balancing

Single precision arithmeticOn-demand memory reallocation

0 10 20 30 40 50 60 70 80

Speedup

Input: mesh with 10 million triangles with around 50% bad triangles

Delaunay Mesh Refinement

Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its...

Documents

Popular Matchings: Structure and Cheating Strategies€¦ · Popular Matchings: Structure and Cheating Strategies Meghana Nasre April 19, 2013

26910585 Ryle Knowing How and Knowing That

Knowing You Knowing Me

Knowing Oneself - Knowing God

From Knowing to Not Knowing

Loreto Teachers’ Workshop Annette Honan. Teaching requires 3 kinds of knowing Knowing what to do – knowledge Knowing how to do it – skills Knowing why

Knowing Me, Knowing You

Knowing The Matrix

Ryle 'Knowing How and Knowing That

Early Stage 1 - rous.pbworks.comrous.pbworks.com/f/3_hsie_comgames[1].docx · Web viewMultiple Intelligences. Blooms Matrix – Stage 3 Commonwealth Games. Knowing. Understanding

Knowing you, Knowing me Mentalization Abilities of Children who

Knowing Minds Knowing Bodies

University of Groningen Knowing Me, Knowing You Prey, Robert€¦ · Knowing Me, Knowing You: Datafication on Music Streaming Platforms Robert Prey, University of Groningen Abstract

DISCERNMENT AS “NOT KNOWING” AND “KNOWING”: A …

2013 WIT Forum - Knowing Yourself & Knowing Others

Knowing Me Knowing YouTube

Knowing me, knowing you

Knowing Me, Knowing Them, Knowing How: promoting inclusions in Teacher Education

Geography of the soul knowing God, knowing ourselves

Knowing me knowing you for slideshare