CS267, Spring 2011 April 7, 2011 Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory

Applications Designing parallel graph algorithms, performance on current systems Case studies: Graph traversal-based problems, parallel algorithms Breadth-First Search Single-source Shortest paths Betweenness Centrality Lecture Outline

Road networks, Point-to-point shortest paths: 15 seconds (nave) 10 microseconds Routing in transportation networks H. Bast et al., Fast Routing in Road Networks with Transit Nodes, Science 27, 2007.

The world-wide web can be represented as a directed graph Web search and crawl: traversal Link analysis, ranking: Page rank and HITS Document classification and clustering Internet topologies (router networks) are naturally modeled as graphs Internet and the WWW

Reorderings for sparse solvers Fill reducing orderings Partitioning, eigenvectors Heavy diagonal to reduce pivoting (matching) Data structures for efficient exploitation of sparsity Derivative computations for optimization graph colorings, spanning trees Preconditioning Incomplete Factorizations Partitioning for domain decomposition Graph techniques in algebraic multigrid Independent sets, matchings, etc. Support Theory Spanning trees & graph embedding techniques Scientific Computing B. Hendrickson, Graphs and HPC: Lessons for Future Architectures, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf Image source: Yifan Hu, A gallery of large graphs Image source: Tim Davis, UF Sparse Matrix Collection.

Graph abstractions are very useful to analyze complex data sets. Sources of data: petascale simulations, experimental devices, the Internet, sensor networks Challenges: data size, heterogeneity, uncertainty, data quality Large-scale data analysis Astrophysics: massive datasets, temporal variations Bioinformatics: data quality, heterogeneity Social Informatics: new analytics challenges, data uncertainty Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.comhttp://physics.nmt.edu/images/astro/hst_starfield.jpg

Study of the interactions between various components in a biological system Graph-theoretic formulations are pervasive: Predicting new interactions: modeling Functional annotation of novel proteins: matching, clustering Identifying metabolic pathways: paths, clustering Identifying new protein complexes: clustering, centrality Data Analysis and Graph Algorithms in Systems Biology Image Source: Giot et al., A Protein Interaction Map of Drosophila melanogaster, Science 302, 1722-1736, 2003.

Image Source: Nexus (Facebook application) Graph theoretic problems in social networks Targeted advertising: clustering and centrality Studying the spread of information

[Krebs 04] Post 9/11 Terrorist Network Analysis from public domain information Plot masterminds correctly identified from interaction patterns: centrality A global view of entities is often more insightful Detect anomalous activities by exact/approximate subgraph isomorphism. Image Source: http://www.orgnet.com/hijackers.html Network Analysis for Intelligence and Survelliance Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Research in Parallel Graph Algorithms Application Areas Methods/ Problems Architectures Graph Algorithms Traversal Shortest Paths Connectivity Max Flow GPUs FPGAs x86 multicore servers Massively multithreaded architectures Multicore Clusters Clouds Social Network Analysis WWW Computational Biology Scientific Computing Engineering Find central entities Community detection Network dynamics Data size Problem Complexity Graph partitioning Matching Coloring Gene regulation Metabolic pathways Genomics Marketing Social Search VLSI CAD Route planning

Characterizing Graph-theoretic computations graph sparsity (m/n ratio) static/dynamic nature weighted/unweighted, weight distribution vertex degree distribution directed/undirected simple/multi/hyper graph problem size granularity of computation at nodes/edges domain-specific characteristics paths clusters partitions matchings patterns orderings Input data Problem: Find *** Factors that influence choice of algorithm Graph kernel traversal shortest path algorithms flow algorithms spanning tree algorithms topological sort .. Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

1735: Seven Bridges of Knigsberg problem, resolved by Euler, one of the first graph theory results. 1966: Flynns Taxonomy. 1968: Batchers sorting networks 1969: Hararys Graph Theory 1972: Tarjans Depth-first search and linear graph algorithms 1975: Reghbati and Corneil, Parallel Connected Components 1982: Misra and Chandy, distributed graph algorithms. 1984: Quinn and Deos survey paper on parallel graph algorithms History

Idealized parallel shared memory system model Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overhead EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write) Measuring performance: space and time complexity; total number of operations (work) The PRAM model

Pros Simple and clean semantics. The majority of theoretical parallel algorithms are designed using the PRAM model. Independent of the communication network topology. Cons Not realistic, too powerful communication model. Algorithm designer is misled to use IPC without hesitation. Synchronized processors. No local memory. Big-O notation is often misleading. PRAM Pros and Cons

Reghbati and Corneil [1975], O(log 2 n) time and O(n 3 ) processors Wyllie [1979], O(log 2 n) time and O(m) processors Shiloach and Vishkin [1982], O(log n) time and O(m) processors Reif and Spirakis [1982], O(log log n) time and O(n) processors (expected) PRAM Algorithms for Connected Components

Prefix sums Symmetry breaking Pointer jumping List ranking Euler tours Vertex collapse Tree contraction Building blocks of classical PRAM graph algorithms

Static case Dense graphs (m = O(n 2 )): adjacency matrix commonly used. Sparse graphs: adjacency lists Dynamic representation depends on common-case query Edge insertions or deletions? Vertex insertions or deletions? Edge weight updates? Graph update rate Queries: connectivity, paths, flow, etc. Optimizing for locality a key design consideration. Data structures: graph representation

Compressed Sparse Row-like Graph representation 07 5 3 8 2 46 1 9 2577 6 03 247 368 08 1489 0038 4567 6 0 1 2 3 4 5 6 7 8 9 4 1 2 3 3 2 4 4 4 1 Vertex DegreeAdjacencies Flatten adjacency arrays 257760324.676Adjacencies Size: 2*m 045728 Size: n+1 Index into adjacency array

Each processor stores the entire graph (full replication) Each processor stores n/p vertices and all adjacencies out of these vertices (1D partitioning) How to create these p vertex partitions? Graph partitioning algorithms: recursively optimize for conductance (edge cut/size of smaller partition) Randomly shuffling the vertex identifiers ensures that edge count/processor are roughly the same Distributed Graph representation

Consider a logical 2D processor grid (p r * p c = p) and the matrix representation of the graph Assign each processor a sub-matrix (i.e, the edges within the sub-matrix) 2D graph partitioning 07 5 3 8 2 46 1 xxx x xx xxx xxx xx xxx xxx xxxx 9 vertices, 9 processors, 3x3 processor grid Flatten Sparse matrices Per-processor local graph representation

A wide range seen in graph algorithms: array, list, queue, stack, set, multiset, tree Implementations are typically array-based for performance considerations. Key data structure considerations in parallel graph algorithm design Practical parallel priority queues Space-efficiency Parallel set/multiset operations, e.g., union, intersection, etc. Data structures in (parallel) graph algorithms

Concurrency Simulating PRAM algorithms: hardware limits of memory bandwidth, number of outstanding memory references; synchronization Locality Try to improve cache locality, but avoid too much superfluous computation Work-efficiency Is (Parallel time) * (# of processors) = (Serial work)? Graph Algorithms on current systems

Serial Performance of approximate betweenness centrality on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache) Input: Synthetic R-MAT graphs (# of edges m = 8n) The locality challenge Large memory footprint, low spatial and temporal locality impede performance ~ 5X drop in performance No Last-level Cache (LLC) misses O(m) LLC misses

Graph topology assumptions in classical algorithms do not match real-world datasets Parallelization strategies at loggerheads with techniques for enhancing memory locality Classical work-efficient graph algorithms may not fully exploit new architectural features Increasing complexity of memory hierarchy, processor heterogeneity, wide SIMD. Tuning implementation to minimize parallel overhead is non- trivial Shared memory: minimizing overhead of locks, barriers. Distributed memory: bounding message buffer sizes, bundling messages, overlapping communication w/ computation. The parallel scaling challenge Classical parallel graph algorithms perform poorly on current parallel systems

Graph traversal (BFS) problem definition 07 5 3 8 2 46 1 9 source vertex Input: Output: 1 1 1 2 2 3 3 4 4 distance from source vertex Memory requirements (# of machine words): Sparse graph representation: m+n Stack of visited vertices: n Distance array: n

Optimizing BFS on cache-based multicore platforms, for networks with power-law degree distributions Problem Spec.Assumptions No. of vertices/edges10 6 ~ 10 9 Edge/vertex ratio1 ~ 100 Static/dynamic?Static DiameterO(1) ~ O(log n) Weighted/UnweightedUnweighted Vertex degree distributionUnbalanced (power law) Directed/undirected?Both Simple/multi/hypergraph?Multigraph Granularity of computation at vertices/edges? Minimal Exploiting domain-specific characteristics? Partially Test data Synthetic R-MAT networks (Data: Mislove et al., IMC 2007.)

1. Expand current frontier (level-synchronous approach, suited for low diameter graphs) Parallel BFS Strategies 07 5 3 8 2 46 1 9 source vertex 2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs) O(D) parallel steps Adjacencies of all vertices in current frontier are visited in parallel 07 5 3 8 2 46 1 9 source vertex path-limited searches from super vertices APSP between super vertices

Locality (where are the random accesses originating from?) A deeper dive into the level synchronous strategy 0 31 53 84 74 11 93 1. Ordering of vertices in the current frontier array, i.e., accesses to adjacency indexing array, cumulative accesses O(n). 2. Ordering of adjacency list of each vertex, cumulative O(m). 3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m). 26 44 63 1. Access Pattern: idx array -- 53, 31, 74, 26 2,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11

Performance Observations Youtube social network Graph expansion Edge filtering Flickr social network

Well-studied problem, slight differences in problem formulations Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal dense blocks. Databases/data mining: reordering bitmap indices for better compression; permuting vertices of WWW snapshots, online social networks for compression NP-hard problem, several known heuristics We require fast, linear-work approaches Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit overlap in adjacency lists, dimensionality reduction Improving locality: Vertex relabeling x xxxx xxx x x x x x x xxx xxx x x x x x xxxx xx x x xxxx x x x xx xx xx x x

Recall: Potential O(m) non-contiguous memory references in edge traversal (to check if vertex is visited). e.g., access order: 53, 31, 31, 26, 74, 84, 0, Objective: Reduce TLB misses, private cache misses, exploit shared cache. Optimizations: 1.Sort the adjacency lists of each vertex helps order memory accesses, reduce TLB misses. 1.Permute vertex labels enhance spatial locality. 2.Cache-blocked edge visits exploit temporal locality. Improving locality: Optimizations 0 31 53 84 74 11 93 26 44 63

Instead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accesses Requires multiple passes through the frontier array, tuning for optimal block size. Note: frontier array size may be O(n) Improving locality: Cache blocking xxxx xx x x x x x xx xx xx x x frontier Adjacencies (d) linear processing New: cache-blocked approach xxxx xx x x x x x xx xx xx x x frontier Adjacencies (d) Metadata denoting blocking pattern 1 2 3 Tune to L2 cache size Process high-degree vertices separately

S imilar to older heuristics, but tuned for small-world networks: 1.High percentage of vertices with (out) degrees 0, 1, and 2 in social and information networks => store adjacencies explicitly (in indexing data structure). Augment the adjacency indexing data structure (with two additional words) and frontier array (with one bit) 2.Process high-degree vertices adjacencies in linear order, but other vertices with d-array cache blocking. 3.Form dense blocks around high-degree vertices Reverse Cuthill-McKee, removing degree 1 and degree 2 vertices Vertex relabeling heuristic

1. Software prefetching on the Intel Core i7 (supports 32 loads and 20 stores in flight) Speculative loads of index array and adjacencies of frontier vertices will reduce compulsory cache misses. 2. Aligning adjacency lists to optimize memory accesses 16-byte aligned loads and stores are faster. Alignment helps reduce cache misses due to fragmentation 16-byte aligned non-temporal stores (during creation of new frontier) are fast. 3. SIMD SSE integer intrinsics to process high-degree vertex adjacencies. 4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsics have very low overhead) 5. Hugepage support (significant TLB miss reduction) 6. NUMA-aware memory allocation exploiting first-touch policy Architecture-specific Optimizations

Experimental Setup NetworknmMax. out- degree % of vertices w/ out- degree 0,1,2 Orkut3.07M223M32K5 LiveJournal5.28M77.4M9K40 Flickr1.86M22.6M26K73 Youtube1.15M4.94M28K76 R-MAT8M-64M8nn 0.6 Intel Xeon 5560 (Core i7, Nehalem) 2 sockets x 4 cores x 2-way SMT 12 GB DRAM, 8 MB shared L3 51.2 GBytes/sec peak bandwidth 2.66 GHz proc. Intel Xeon 5560 (Core i7, Nehalem) 2 sockets x 4 cores x 2-way SMT 12 GB DRAM, 8 MB shared L3 51.2 GBytes/sec peak bandwidth 2.66 GHz proc. Performance averaged over 10 different source vertices, 3 runs each.

OptimizationGeneralityImpact*Tuning required? (Preproc.) Sort adjacency listsHigh--No (Preproc.) Permute vertex labelsMedium--Yes Preproc. + binning frontier vertices + cache blocking M2.5xYes Lock-free parallelizationM2.0xNo Low-degree vertex filteringLow1.3xNo Software PrefetchingM1.10xYes Aligning adjacencies, streaming storesM1.15xNo Fast atomic intrinsicsH2.2xNo Impact of optimization strategies * Optimization speedup (performance on 4 cores) w.r.t baseline parallel approach, on a synthetic R-MAT graph (n=2 23,m=2 26 )

Cache locality improvement Theoretical count of the number of non- contiguous memory accesses: m+3n Performance count: # of non-contiguous memory accesses (assuming cache line size of 16 words)

Parallel performance (Orkut graph) Parallel speedup: 4.9 Speedup over baseline: 2.9 Execution time: 0.28 seconds (8 threads) Graph: 3.07 million vertices, 220 million edges Single socket of Intel Xeon 5560 (Core i7) Graph: 3.07 million vertices, 220 million edges Single socket of Intel Xeon 5560 (Core i7)

How well does my implementation match theoretical bounds? When can I stop optimizing my code? Begin with asymptotic analysis Express work performed in terms of machine-independent performance counts Add input graph characteristics in analysis Use simple kernels or micro-benchmarks to provide an estimate of achievable peak performance Relate observed performance to machine characteristics Performance Analysis

BFS (from a single vertex) on a static, undirected R-MAT network with average vertex degree 16. Evaluation criteria: largest problem size that can be solved on a system, minimum execution time. Reference MPI, shared memory implementations provided. NERSC Franklin system is ranked #2 on current list (Nov 2010). BFS using 500 nodes of Franklin Graph 500 Search Benchmark (graph500.org)

Parallel Single-source Shortest Paths (SSSP) algorithms Edge weights: concurrency primary challenge! No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) work Parallel priority queues: relaxed heaps [DGST88], [BTZ98] Ullman-Yannakakis randomized approach [UY90] Meyer and Sanders, - stepping algorithm [MS03] Distributed memory implementations based on graph partitioning Heuristics for load balancing and termination detection K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances, Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

- stepping algorithm [MS03] Label-correcting algorithm: Can relax edges from unsettled vertices also - stepping: approximate bucket implementation of Dijkstras algorithm : bucket width Vertices are ordered using buckets representing priority range of size Each bucket may be processed in parallel

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket = 0.1 (say)

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 Initialization: Insert s into bucket, d(s) = 0 0 0

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 0 0 2 R S.01

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 2 R 0 S.01 0

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 .01 2 R 0 S 0

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 .01 2 R 0 S 0 13.03.06

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 .01 R 0 S 0 13.03.06 2

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0.03.01.06 R 0 S 0 2 1 3

0.01 - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of requests (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0.03.01.06.16.29.62 R 0 S 1 2 13 2 6 4 5 6

Classify edges as heavy and light

Relax light edges (phase) Repeat until B[i] Is empty

Relax heavy edges. No reinsertions in this step.

No. of phases (machine-independent performance count) low diameter high diameter

Average shortest path weight for various graph families ~ 2 20 vertices, 2 22 edges, directed graph, edge weights normalized to [0,1]

Last non-empty bucket (machine-independent performance count) Fewer buckets, more parallelism

Number of bucket insertions (machine-independent performance count)

Betweenness Centrality Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph degree, closeness, eigenvalue, betweenness Betweenness Centrality ( : No. of shortest paths between s and t) Applied to several real-world networks Social interactions WWW Epidemiology Systems biology

Algorithms for Computing Betweenness All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n 3 ) time), sum up the fractional dependency values (O(n 2 ) space). Brandes algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space.

Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centrality low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n)) Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time. Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory references In practice, 2-3X faster than previous algorithm Lock-free => better scalability on large parallel systems Our New Parallel Algorithms

Parallel BC Algorithm Consider an undirected, unweighted graph High-level idea: Level-synchronous parallel Breadth- First Search augmented to compute centrality scores Two steps traversal and path counting dependency accumulation

Parallel BC Algorithm Illustration 07 5 3 8 2 46 1 9

1. Traversal step: visit adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1 9 source vertex

Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1 9 source vertex 2 7 5 0 0 0 0 1 1 1 1 1 1 D 0 00 0 S P

Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1 9 source vertex 8 8 2 7 5 0 0 0 0 1 1 2 2 1 1 1 1 2 2 D 0 00 0 S P 3 3 27 57 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel

Parallel BC Algorithm Illustration 1.Traversal step: at the end, we have all reachable vertices, their corresponding predecessor multisets, and D values. 07 5 3 8 2 46 1 9 source vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 0 0 1 1 2 2 1 1 1 1 2 2 D 0 00 0 S P 3 3 27 57 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 38 8 6 6

Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n) Upper bound on size of each predecessor multiset: In-degree Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts New algorithm: Based on observation that we dont need to store predecessor vertices. Instead, we store successor edges along shortest paths. simplifies the accumulation step reduces an atomic operation in traversal step cache-friendly! Graph traversal step analysis

Graph Traversal Step locality analysis for all vertices u at level d in parallel do for all adjacencies v of u in parallel do dv = D[v]; if (dv < 0) vis = fetch_and_add(&Visited[v], 1); if (vis == 0) D[v] = d+1; pS[count++] = v; fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); All the vertices are in a contiguous block (stack) All the adjacencies of a vertex are stored compactly (graph rep.) Store to S[u] Non-contiguous memory access Non-contiguous memory access Non-contiguous memory access Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously

Parallel BC Algorithm Illustration 2. Accumulation step: Pop vertices from stack, update dependence scores. 07 5 3 8 2 46 1 9 source vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 Delta 0 00 0 S P 3 3 27 57 38 8 6 6

Parallel BC Algorithm Illustration 2. Accumulation step: Can also be done in a level-synchronous manner. 07 5 3 8 2 46 1 9 source vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 Delta 0 00 0 S P 3 3 27 57 38 8 6 6

Accumulation step locality analysis for level d = GraphDiameter-2 to 1 do for all vertices v at level d in parallel do for all w in S[v] in parallel do reduction(delta) delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w]; BC[v] = delta[v] = delta_sum_v; All the vertices are in a contiguous block (stack) Each S[v] is a contiguous block Only floating point operation in code

Centrality Analysis applied to Protein Interaction Networks 43 interactions Protein Ensembl ID ENSG00000145332.2 Kelch-like protein 8

Designing parallel algorithms for large sparse graph analysis Problem size (n: # of vertices/edges) 10 4 10 6 10 8 10 12 Work performed O(n) O(n 2 ) O(nlog n) RandomAccess-like Stream-like Improve locality Data reduction/ Compression Faster methods Peta+ System requirements: High (on-chip memory, DRAM, network, IO) bandwidth. Solution: Efficiently utilize available memory bandwidth. Algorithmic innovation to avoid corner cases. Problem Complexity Locality

Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance Overview of parallel graph algorithms PRAM algorithms graph representation Parallel algorithm case studies BFS: locality, level-synchronous approach, multicore tuning Shortest paths: exploiting concurrency, parallel priority queue Betweenness centrality: importance of locality, atomics Review of lecture

Questions? Thank you!

Documents

CS267, Spring 2011 April 7, 2011 Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory