CS267, Spring 2011 April 7, 2011 Parallel Graph Algorithms
Kamesh Madduri [email protected] Lawrence Berkeley National
Laboratory
Slide 2
Applications Designing parallel graph algorithms, performance
on current systems Case studies: Graph traversal-based problems,
parallel algorithms Breadth-First Search Single-source Shortest
paths Betweenness Centrality Lecture Outline
Slide 3
Road networks, Point-to-point shortest paths: 15 seconds (nave)
10 microseconds Routing in transportation networks H. Bast et al.,
Fast Routing in Road Networks with Transit Nodes, Science 27,
2007.
Slide 4
The world-wide web can be represented as a directed graph Web
search and crawl: traversal Link analysis, ranking: Page rank and
HITS Document classification and clustering Internet topologies
(router networks) are naturally modeled as graphs Internet and the
WWW
Slide 5
Reorderings for sparse solvers Fill reducing orderings
Partitioning, eigenvectors Heavy diagonal to reduce pivoting
(matching) Data structures for efficient exploitation of sparsity
Derivative computations for optimization graph colorings, spanning
trees Preconditioning Incomplete Factorizations Partitioning for
domain decomposition Graph techniques in algebraic multigrid
Independent sets, matchings, etc. Support Theory Spanning trees
& graph embedding techniques Scientific Computing B.
Hendrickson, Graphs and HPC: Lessons for Future Architectures,
http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
Image source: Yifan Hu, A gallery of large graphs Image source: Tim
Davis, UF Sparse Matrix Collection.
Slide 6
Graph abstractions are very useful to analyze complex data
sets. Sources of data: petascale simulations, experimental devices,
the Internet, sensor networks Challenges: data size, heterogeneity,
uncertainty, data quality Large-scale data analysis Astrophysics:
massive datasets, temporal variations Bioinformatics: data quality,
heterogeneity Social Informatics: new analytics challenges, data
uncertainty Image sources: (1)
http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3)
www.visualComplexity.comhttp://physics.nmt.edu/images/astro/hst_starfield.jpg
Slide 7
Study of the interactions between various components in a
biological system Graph-theoretic formulations are pervasive:
Predicting new interactions: modeling Functional annotation of
novel proteins: matching, clustering Identifying metabolic
pathways: paths, clustering Identifying new protein complexes:
clustering, centrality Data Analysis and Graph Algorithms in
Systems Biology Image Source: Giot et al., A Protein Interaction
Map of Drosophila melanogaster, Science 302, 1722-1736, 2003.
Slide 8
Image Source: Nexus (Facebook application) Graph theoretic
problems in social networks Targeted advertising: clustering and
centrality Studying the spread of information
Slide 9
[Krebs 04] Post 9/11 Terrorist Network Analysis from public
domain information Plot masterminds correctly identified from
interaction patterns: centrality A global view of entities is often
more insightful Detect anomalous activities by exact/approximate
subgraph isomorphism. Image Source:
http://www.orgnet.com/hijackers.html Network Analysis for
Intelligence and Survelliance Image Source: T. Coffman, S.
Greenblatt, S. Marcus, Graph-based technologies for intelligence
analysis, CACM, 47 (3, March 2004): pp 45-47
Slide 10
Research in Parallel Graph Algorithms Application Areas
Methods/ Problems Architectures Graph Algorithms Traversal Shortest
Paths Connectivity Max Flow GPUs FPGAs x86 multicore servers
Massively multithreaded architectures Multicore Clusters Clouds
Social Network Analysis WWW Computational Biology Scientific
Computing Engineering Find central entities Community detection
Network dynamics Data size Problem Complexity Graph partitioning
Matching Coloring Gene regulation Metabolic pathways Genomics
Marketing Social Search VLSI CAD Route planning
Slide 11
Characterizing Graph-theoretic computations graph sparsity (m/n
ratio) static/dynamic nature weighted/unweighted, weight
distribution vertex degree distribution directed/undirected
simple/multi/hyper graph problem size granularity of computation at
nodes/edges domain-specific characteristics paths clusters
partitions matchings patterns orderings Input data Problem: Find
*** Factors that influence choice of algorithm Graph kernel
traversal shortest path algorithms flow algorithms spanning tree
algorithms topological sort .. Graph problems are often recast as
sparse linear algebra (e.g., partitioning) or linear programming
(e.g., matching) computations
Slide 12
Applications Designing parallel graph algorithms, performance
on current systems Case studies: Graph traversal-based problems,
parallel algorithms Breadth-First Search Single-source Shortest
paths Betweenness Centrality Lecture Outline
Slide 13
1735: Seven Bridges of Knigsberg problem, resolved by Euler,
one of the first graph theory results. 1966: Flynns Taxonomy. 1968:
Batchers sorting networks 1969: Hararys Graph Theory 1972: Tarjans
Depth-first search and linear graph algorithms 1975: Reghbati and
Corneil, Parallel Connected Components 1982: Misra and Chandy,
distributed graph algorithms. 1984: Quinn and Deos survey paper on
parallel graph algorithms History
Slide 14
Idealized parallel shared memory system model Unbounded number
of synchronous processors; no synchronization, communication cost;
no parallel overhead EREW (Exclusive Read Exclusive Write), CREW
(Concurrent Read Exclusive Write) Measuring performance: space and
time complexity; total number of operations (work) The PRAM
model
Slide 15
Pros Simple and clean semantics. The majority of theoretical
parallel algorithms are designed using the PRAM model. Independent
of the communication network topology. Cons Not realistic, too
powerful communication model. Algorithm designer is misled to use
IPC without hesitation. Synchronized processors. No local memory.
Big-O notation is often misleading. PRAM Pros and Cons
Slide 16
Reghbati and Corneil [1975], O(log 2 n) time and O(n 3 )
processors Wyllie [1979], O(log 2 n) time and O(m) processors
Shiloach and Vishkin [1982], O(log n) time and O(m) processors Reif
and Spirakis [1982], O(log log n) time and O(n) processors
(expected) PRAM Algorithms for Connected Components
Slide 17
Prefix sums Symmetry breaking Pointer jumping List ranking
Euler tours Vertex collapse Tree contraction Building blocks of
classical PRAM graph algorithms
Slide 18
Static case Dense graphs (m = O(n 2 )): adjacency matrix
commonly used. Sparse graphs: adjacency lists Dynamic
representation depends on common-case query Edge insertions or
deletions? Vertex insertions or deletions? Edge weight updates?
Graph update rate Queries: connectivity, paths, flow, etc.
Optimizing for locality a key design consideration. Data
structures: graph representation
Each processor stores the entire graph (full replication) Each
processor stores n/p vertices and all adjacencies out of these
vertices (1D partitioning) How to create these p vertex partitions?
Graph partitioning algorithms: recursively optimize for conductance
(edge cut/size of smaller partition) Randomly shuffling the vertex
identifiers ensures that edge count/processor are roughly the same
Distributed Graph representation
Slide 21
Consider a logical 2D processor grid (p r * p c = p) and the
matrix representation of the graph Assign each processor a
sub-matrix (i.e, the edges within the sub-matrix) 2D graph
partitioning 07 5 3 8 2 46 1 xxx x xx xxx xxx xx xxx xxx xxxx 9
vertices, 9 processors, 3x3 processor grid Flatten Sparse matrices
Per-processor local graph representation
Slide 22
A wide range seen in graph algorithms: array, list, queue,
stack, set, multiset, tree Implementations are typically
array-based for performance considerations. Key data structure
considerations in parallel graph algorithm design Practical
parallel priority queues Space-efficiency Parallel set/multiset
operations, e.g., union, intersection, etc. Data structures in
(parallel) graph algorithms
Slide 23
Concurrency Simulating PRAM algorithms: hardware limits of
memory bandwidth, number of outstanding memory references;
synchronization Locality Try to improve cache locality, but avoid
too much superfluous computation Work-efficiency Is (Parallel time)
* (# of processors) = (Serial work)? Graph Algorithms on current
systems
Slide 24
Serial Performance of approximate betweenness centrality on a
2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache) Input: Synthetic
R-MAT graphs (# of edges m = 8n) The locality challenge Large
memory footprint, low spatial and temporal locality impede
performance ~ 5X drop in performance No Last-level Cache (LLC)
misses O(m) LLC misses
Slide 25
Graph topology assumptions in classical algorithms do not match
real-world datasets Parallelization strategies at loggerheads with
techniques for enhancing memory locality Classical work-efficient
graph algorithms may not fully exploit new architectural features
Increasing complexity of memory hierarchy, processor heterogeneity,
wide SIMD. Tuning implementation to minimize parallel overhead is
non- trivial Shared memory: minimizing overhead of locks, barriers.
Distributed memory: bounding message buffer sizes, bundling
messages, overlapping communication w/ computation. The parallel
scaling challenge Classical parallel graph algorithms perform
poorly on current parallel systems
Slide 26
Applications Designing parallel graph algorithms, performance
on current systems Case studies: Graph traversal-based problems,
parallel algorithms Breadth-First Search Single-source Shortest
paths Betweenness Centrality Lecture Outline
Slide 27
Graph traversal (BFS) problem definition 07 5 3 8 2 46 1 9
source vertex Input: Output: 1 1 1 2 2 3 3 4 4 distance from source
vertex Memory requirements (# of machine words): Sparse graph
representation: m+n Stack of visited vertices: n Distance array:
n
Slide 28
Optimizing BFS on cache-based multicore platforms, for networks
with power-law degree distributions Problem Spec.Assumptions No. of
vertices/edges10 6 ~ 10 9 Edge/vertex ratio1 ~ 100
Static/dynamic?Static DiameterO(1) ~ O(log n)
Weighted/UnweightedUnweighted Vertex degree distributionUnbalanced
(power law) Directed/undirected?Both
Simple/multi/hypergraph?Multigraph Granularity of computation at
vertices/edges? Minimal Exploiting domain-specific characteristics?
Partially Test data Synthetic R-MAT networks (Data: Mislove et al.,
IMC 2007.)
Slide 29
1. Expand current frontier (level-synchronous approach, suited
for low diameter graphs) Parallel BFS Strategies 07 5 3 8 2 46 1 9
source vertex 2. Stitch multiple concurrent traversals
(Ullman-Yannakakis approach, suited for high-diameter graphs) O(D)
parallel steps Adjacencies of all vertices in current frontier are
visited in parallel 07 5 3 8 2 46 1 9 source vertex path-limited
searches from super vertices APSP between super vertices
Slide 30
Locality (where are the random accesses originating from?) A
deeper dive into the level synchronous strategy 0 31 53 84 74 11 93
1. Ordering of vertices in the current frontier array, i.e.,
accesses to adjacency indexing array, cumulative accesses O(n). 2.
Ordering of adjacency list of each vertex, cumulative O(m). 3.
Sifting through adjacencies to check whether visited or not,
cumulative accesses O(m). 26 44 63 1. Access Pattern: idx array --
53, 31, 74, 26 2,3. Access Pattern: d array -- 0, 84, 0, 84, 93,
44, 63, 0, 0, 11
Slide 31
Performance Observations Youtube social network Graph expansion
Edge filtering Flickr social network
Slide 32
Well-studied problem, slight differences in problem
formulations Linear algebra: sparse matrix column reordering to
reduce bandwidth, reveal dense blocks. Databases/data mining:
reordering bitmap indices for better compression; permuting
vertices of WWW snapshots, online social networks for compression
NP-hard problem, several known heuristics We require fast,
linear-work approaches Existing ones: BFS or DFS-based,
Cuthill-McKee, Reverse Cuthill-McKee, exploit overlap in adjacency
lists, dimensionality reduction Improving locality: Vertex
relabeling x xxxx xxx x x x x x x xxx xxx x x x x x xxxx xx x x
xxxx x x x xx xx xx x x
Instead of processing adjacencies of each vertex serially,
exploit sorted adjacency list structure w/ blocked accesses
Requires multiple passes through the frontier array, tuning for
optimal block size. Note: frontier array size may be O(n) Improving
locality: Cache blocking xxxx xx x x x x x xx xx xx x x frontier
Adjacencies (d) linear processing New: cache-blocked approach xxxx
xx x x x x x xx xx xx x x frontier Adjacencies (d) Metadata
denoting blocking pattern 1 2 3 Tune to L2 cache size Process
high-degree vertices separately
Slide 35
S imilar to older heuristics, but tuned for small-world
networks: 1.High percentage of vertices with (out) degrees 0, 1,
and 2 in social and information networks => store adjacencies
explicitly (in indexing data structure). Augment the adjacency
indexing data structure (with two additional words) and frontier
array (with one bit) 2.Process high-degree vertices adjacencies in
linear order, but other vertices with d-array cache blocking.
3.Form dense blocks around high-degree vertices Reverse
Cuthill-McKee, removing degree 1 and degree 2 vertices Vertex
relabeling heuristic
Slide 36
1. Software prefetching on the Intel Core i7 (supports 32 loads
and 20 stores in flight) Speculative loads of index array and
adjacencies of frontier vertices will reduce compulsory cache
misses. 2. Aligning adjacency lists to optimize memory accesses
16-byte aligned loads and stores are faster. Alignment helps reduce
cache misses due to fragmentation 16-byte aligned non-temporal
stores (during creation of new frontier) are fast. 3. SIMD SSE
integer intrinsics to process high-degree vertex adjacencies. 4.
Fast atomics (BFS is lock-free w/ low contention, and CAS-based
intrinsics have very low overhead) 5. Hugepage support (significant
TLB miss reduction) 6. NUMA-aware memory allocation exploiting
first-touch policy Architecture-specific Optimizations
Slide 37
Experimental Setup NetworknmMax. out- degree % of vertices w/
out- degree 0,1,2 Orkut3.07M223M32K5 LiveJournal5.28M77.4M9K40
Flickr1.86M22.6M26K73 Youtube1.15M4.94M28K76 R-MAT8M-64M8nn 0.6
Intel Xeon 5560 (Core i7, Nehalem) 2 sockets x 4 cores x 2-way SMT
12 GB DRAM, 8 MB shared L3 51.2 GBytes/sec peak bandwidth 2.66 GHz
proc. Intel Xeon 5560 (Core i7, Nehalem) 2 sockets x 4 cores x
2-way SMT 12 GB DRAM, 8 MB shared L3 51.2 GBytes/sec peak bandwidth
2.66 GHz proc. Performance averaged over 10 different source
vertices, 3 runs each.
Cache locality improvement Theoretical count of the number of
non- contiguous memory accesses: m+3n Performance count: # of
non-contiguous memory accesses (assuming cache line size of 16
words)
Slide 40
Parallel performance (Orkut graph) Parallel speedup: 4.9
Speedup over baseline: 2.9 Execution time: 0.28 seconds (8 threads)
Graph: 3.07 million vertices, 220 million edges Single socket of
Intel Xeon 5560 (Core i7) Graph: 3.07 million vertices, 220 million
edges Single socket of Intel Xeon 5560 (Core i7)
Slide 41
How well does my implementation match theoretical bounds? When
can I stop optimizing my code? Begin with asymptotic analysis
Express work performed in terms of machine-independent performance
counts Add input graph characteristics in analysis Use simple
kernels or micro-benchmarks to provide an estimate of achievable
peak performance Relate observed performance to machine
characteristics Performance Analysis
Slide 42
BFS (from a single vertex) on a static, undirected R-MAT
network with average vertex degree 16. Evaluation criteria: largest
problem size that can be solved on a system, minimum execution
time. Reference MPI, shared memory implementations provided. NERSC
Franklin system is ranked #2 on current list (Nov 2010). BFS using
500 nodes of Franklin Graph 500 Search Benchmark
(graph500.org)
Slide 43
Applications Designing parallel graph algorithms, performance
on current systems Case studies: Graph traversal-based problems,
parallel algorithms Breadth-First Search Single-source Shortest
paths Betweenness Centrality Lecture Outline
Slide 44
Parallel Single-source Shortest Paths (SSSP) algorithms Edge
weights: concurrency primary challenge! No known PRAM algorithm
that runs in sub-linear time and O(m+nlog n) work Parallel priority
queues: relaxed heaps [DGST88], [BTZ98] Ullman-Yannakakis
randomized approach [UY90] Meyer and Sanders, - stepping algorithm
[MS03] Distributed memory implementations based on graph
partitioning Heuristics for load balancing and termination
detection K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, An
Experimental Study of A Parallel Shortest Path Algorithm for
Solving Large-Scale Graph Instances, Workshop on Algorithm
Engineering and Experiments (ALENEX), New Orleans, LA, January 6,
2007.
Slide 45
- stepping algorithm [MS03] Label-correcting algorithm: Can
relax edges from unsettled vertices also - stepping: approximate
bucket implementation of Dijkstras algorithm : bucket width
Vertices are ordered using buckets representing priority range of
size Each bucket may be processed in parallel
Slide 46
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket = 0.1
(say)
Slide 47
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket 0
Initialization: Insert s into bucket, d(s) = 0 0 0
Slide 48
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket 0 0 0 2 R
S.01
Slide 49
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket 0 2 R 0 S.01
0
Slide 50
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket 0 .01 2 R 0 S
0
Slide 51
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket 0 .01 2 R 0 S
0 13.03.06
Slide 52
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket 0 .01 R 0 S 0
13.03.06 2
Slide 53
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket 0.03.01.06 R
0 S 0 2 1 3
Slide 54
0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18
0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One
parallel phase while (bucket is non-empty) i)Inspect light edges
ii)Construct a set of requests (R) iii)Clear the current bucket
iv)Remember deleted vertices (S) v)Relax request pairs in R Relax
heavy request pairs (from S) Go on to the next bucket
0.03.01.06.16.29.62 R 0 S 1 2 13 2 6 4 5 6
Slide 55
Slide 56
Classify edges as heavy and light
Slide 57
Relax light edges (phase) Repeat until B[i] Is empty
Slide 58
Relax heavy edges. No reinsertions in this step.
Slide 59
No. of phases (machine-independent performance count) low
diameter high diameter
Slide 60
Average shortest path weight for various graph families ~ 2 20
vertices, 2 22 edges, directed graph, edge weights normalized to
[0,1]
Slide 61
Last non-empty bucket (machine-independent performance count)
Fewer buckets, more parallelism
Slide 62
Number of bucket insertions (machine-independent performance
count)
Slide 63
Applications Designing parallel graph algorithms, performance
on current systems Case studies: Graph traversal-based problems,
parallel algorithms Breadth-First Search Single-source Shortest
paths Betweenness Centrality Lecture Outline
Slide 64
Betweenness Centrality Centrality: Quantitative measure to
capture the importance of a vertex/edge in a graph degree,
closeness, eigenvalue, betweenness Betweenness Centrality ( : No.
of shortest paths between s and t) Applied to several real-world
networks Social interactions WWW Epidemiology Systems biology
Slide 65
Algorithms for Computing Betweenness All-pairs shortest path
approach: compute the length and number of shortest paths between
all s-t pairs (O(n 3 ) time), sum up the fractional dependency
values (O(n 2 ) space). Brandes algorithm (2003): Augment a
single-source shortest path computation to count paths; uses the
Bellman criterion; O(mn) work and O(m+n) space.
Slide 66
Madduri, Bader (2006): parallel algorithms for computing exact
and approximate betweenness centrality low-diameter sparse graphs
(diameter D = O(log n), m = O(nlog n)) Exact algorithm: O(mn) work,
O(m+n) space, O(nD+nm/p) time. Madduri et al. (2009): New parallel
algorithm with lower synchronization overhead and fewer
non-contiguous memory references In practice, 2-3X faster than
previous algorithm Lock-free => better scalability on large
parallel systems Our New Parallel Algorithms
Slide 67
Parallel BC Algorithm Consider an undirected, unweighted graph
High-level idea: Level-synchronous parallel Breadth- First Search
augmented to compute centrality scores Two steps traversal and path
counting dependency accumulation
Parallel BC Algorithm Illustration 1. Traversal step: visit
adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1
9 source vertex 2 7 5 0 0 0 0 1 1 1 1 1 1 D 0 00 0 S P
Slide 71
Parallel BC Algorithm Illustration 1. Traversal step: visit
adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1
9 source vertex 8 8 2 7 5 0 0 0 0 1 1 2 2 1 1 1 1 2 2 D 0 00 0 S P
3 3 27 57 Level-synchronous approach: The adjacencies of all
vertices in the current frontier can be visited in parallel
Slide 72
Parallel BC Algorithm Illustration 1.Traversal step: at the
end, we have all reachable vertices, their corresponding
predecessor multisets, and D values. 07 5 3 8 2 46 1 9 source
vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 0 0 1 1 2 2 1 1 1 1 2 2 D 0 00
0 S P 3 3 27 57 Level-synchronous approach: The adjacencies of all
vertices in the current frontier can be visited in parallel 38 8 6
6
Slide 73
Exploit concurrency in visiting adjacencies, as we assume that
the graph diameter is small: O(log n) Upper bound on size of each
predecessor multiset: In-degree Potential performance bottlenecks:
atomic updates to predecessor multisets, atomic increments of path
counts New algorithm: Based on observation that we dont need to
store predecessor vertices. Instead, we store successor edges along
shortest paths. simplifies the accumulation step reduces an atomic
operation in traversal step cache-friendly! Graph traversal step
analysis
Slide 74
Graph Traversal Step locality analysis for all vertices u at
level d in parallel do for all adjacencies v of u in parallel do dv
= D[v]; if (dv < 0) vis = fetch_and_add(&Visited[v], 1); if
(vis == 0) D[v] = d+1; pS[count++] = v;
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Scount[u], 1); if (dv == d + 1)
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Scount[u], 1); All the vertices are in a
contiguous block (stack) All the adjacencies of a vertex are stored
compactly (graph rep.) Store to S[u] Non-contiguous memory access
Non-contiguous memory access Non-contiguous memory access Better
cache utilization likely if D[v], Visited[v], sigma[v] are stored
contiguously
Parallel BC Algorithm Illustration 2. Accumulation step: Can
also be done in a level-synchronous manner. 07 5 3 8 2 46 1 9
source vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 Delta 0 00 0 S P 3 3 27
57 38 8 6 6
Slide 77
Accumulation step locality analysis for level d =
GraphDiameter-2 to 1 do for all vertices v at level d in parallel
do for all w in S[v] in parallel do reduction(delta) delta_sum_v =
delta[v] + (1 + delta[w]) * sigma[v]/sigma[w]; BC[v] = delta[v] =
delta_sum_v; All the vertices are in a contiguous block (stack)
Each S[v] is a contiguous block Only floating point operation in
code
Slide 78
Centrality Analysis applied to Protein Interaction Networks 43
interactions Protein Ensembl ID ENSG00000145332.2 Kelch-like
protein 8
Slide 79
Designing parallel algorithms for large sparse graph analysis
Problem size (n: # of vertices/edges) 10 4 10 6 10 8 10 12 Work
performed O(n) O(n 2 ) O(nlog n) RandomAccess-like Stream-like
Improve locality Data reduction/ Compression Faster methods Peta+
System requirements: High (on-chip memory, DRAM, network, IO)
bandwidth. Solution: Efficiently utilize available memory
bandwidth. Algorithmic innovation to avoid corner cases. Problem
Complexity Locality
Slide 80
Applications: Internet and WWW, Scientific computing, Data
analysis, Surveillance Overview of parallel graph algorithms PRAM
algorithms graph representation Parallel algorithm case studies
BFS: locality, level-synchronous approach, multicore tuning
Shortest paths: exploiting concurrency, parallel priority queue
Betweenness centrality: importance of locality, atomics Review of
lecture