Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Tutorial: Large-Scale Graph Processing in Shared Memory
Julian Shun (University of California, Berkeley) Laxman Dhulipala (Carnegie Mellon University) Guy Blelloch (Carnegie Mellon University)
Tutorial Agenda
• 2:00-3:30pm • Overview of graph processing and Ligra • Walk through installation • Do examples in Ligra
• 3:30-4:00pm • Break
• 4:00-5:30pm • Implementation details of Ligra • Overview of other graph processing systems • Exercises
PPoPP 2016 2
Tutorial website: http://jshun.github.io/ligra/ Slides available at https://github.com/jshun/ligra/tree/master/tutorial/tutorial.pdf
Graphs are everywhere!
• Can contain billions of vertices and edges!
6.6 billion edges
128 billion edges
~1 trillion edges [VLDB 2015]
PPoPP 2016 3
Graph processing challenges
• Need to efficiently analyze large graphs • Running time efficiency • Space efficiency • Programming efficiency
• Many random memory accesses, poor locality • Relatively high communication-to-computation ratio • Varying parallelism throughout execution • Race conditions, load balancing
PPoPP 2016 4
Ligra Graph Processing Framework
EdgeMap VertexMap
Breadth-first search Betweenness centrality Connected components K-core decomposition Belief propagation Maximal independent set …
Single-source shortest paths Eccentricity estimation (Personalized) PageRank Local graph clustering Biconnected components Collaborative filtering …
Simplicity, Performance, Scalability
PPoPP 2016 5
Graph Processing Systems
• Existing (at the time Ligra was developed): Pregel/Giraph/GPS, GraphLab, Pegasus, Knowledge Discovery Toolbox, GraphChi, and many others…
• Our system: Ligra - Lightweight graph processing system for shared memory
Takes advantage of “frontier-based” nature of many algorithms (active set is dynamic and often small)
PPoPP 2016 6
Breadth-first Search (BFS) • Compute a BFS tree rooted at source r containing all vertices reachable from r
• Can process each level in parallel • Race conditions, load balancing
Applications Betweenness centrality Eccentricity estimation Maximum flow Web crawlers Network broadcasting Cycle detection …
… …
…
…
.
PPoPP 2016 7
Steps for Graph Traversal • Operate on a subset of vertices • Map computation over subset of edges in parallel • Return new subset of vertices • Map computation over subset of vertices in parallel
}
Graph
VertexSubset
EdgeMap
VertexMap
Think with flat data-parallel operators
We built the Ligra abstraction for these kinds of computations
Optimizations: • hybrid traversal • graph compression
Many graph traversal algorithms do this!
Shared memory parallelism
PPoPP 2016 8
Ligra Framework
0 4 68VertexSubset
4
7
52
1
0
6
8
3
68VertexSubset
bool f(v){ data[v] = data[v] + 1; return (data[v] == 1); }
4
0
6
8
VertexMap
T F F T
VertexFilter
PPoPP 2016 9
Ligra Framework
4
7
52
1
0
6
8
3
0 4 68VertexSubset
VertexSubset
bool update(u,v){…}
4
0
6
8
5 24 71
bool cond(v){…}
F
7
52
1
4
T
T
T
T
T EdgeMap
PPoPP 2016 10
Breadth-first Search in Ligra parents = {-1, …, -1}; //-1 indicates “unexplored” procedure UPDATE(s, d):
return compare_and_swap(parents[d], -1, s); procedure COND(v):
return parents[v] == -1; //checks if “unexplored” procedure BFS(G, r):
parents[r] = r; frontier = {r}; //VertexSubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);
frontier
T T T T F
PPoPP 2016 11
Install and code examples in Ligra
PPoPP 2016 12
Breadth-first Search (BFS) • Compute a BFS tree rooted at source r containing all vertices reachable from r
r r
Frontier
PPoPP 2016 13
Connected Components
• Takes an unweighted, undirected graph G = (V, E) • Returns a label array L such that L[v] = L[w] if and only if
v is connected to w
Image Source: docs.roguewave.com
PPoPP 2016 14
Parallel Label Propagation Algorithm
• Processing all vertices in each iteration seems wasteful • Optimization: only place vertices who changed on frontier
• Warning: this algorithm is only good for low-diameter graphs
5 1
3
2
4
0
3
2
1
0
4
5 4
0
0
1 0
PPoPP 2016 15
Single-Source Shortest Paths
• Takes a weighted graph G = (V, E, w(E)) and starting vertex r ∈ V
• Returns a shortest paths array SP where SP[v] stores the shortest path distance from r to v (∞ if v unreachable from r)
PPoPP 2016 16
Parallel Bellman-Ford Shortest Paths
A
B
C
D
F
E
A
1
-2
3
4
5
2
1
-3
4
∞
0
∞ ∞
∞
∞
1
-2
6
2
4
3
PPoPP 2016 17
Parallel Bellman-Ford Performance
0.1
1
10
100
1 2 4 8 16 32 40 80
Run
ning
tim
e (s
econ
ds)
Number of threads
Times for Bellman-Ford on rMat24
Our Bellman-FordNaive Bellman-Ford
PPoPP 2016 18
K-core decomposition
• A k-core of a graph is a maximal connected subgraph in which all vertices have degree at least k
• A vertex has core number k if it belongs to a k-core but not a (k+1)-core
• Algorithm: Takes an unweighted, undirected graph G and returns the core number of each vertex in G
k = 1 while(G is not empty) {
while(there exists vertices with degree < k in G) { assign a core number of k-1 to all vertices with degree < k; remove all vertices with degree < k from G; } k = k+1;
}
PPoPP 2016 19
PageRank PPoPP 2016 20
PageRank in Ligra p_curr = {1/|V|, …, 1/|V|}; p_next = {0, …, 0}; diff = {}; procedure UPDATE(s, d):
return atomic_increment(p_next[d], p_curr[s] / degree(s)); procedure COMPUTE(i):
p_next[i] = α · p_next[i] + (1- α) · (1/|V|); diff[i] = abs(p_next[i] – p_curr[i]); p_curr[i] = 0; return 1;
procedure PageRank(G, α, ε):
frontier = {0, …, |V|-1}; error = ∞ while (error > ε): frontier = EDGEMAP(G, frontier, UPDATE, CONDtrue); VERTEXMAP(frontier, COMPUTE); error = sum of diff entries; swap(p_curr, p_next) return p_curr;
PPoPP 2016 21
PageRank • Sparse version?
• PageRank-Delta: Only update vertices whose PageRank value has changed by more than some Δ-fraction (discussed in GraphLab and McSherry WWW ‘05)
PPoPP 2016 22
PageRank-Delta in Ligra PR[i] = {1/|V|, …, 1/|V|}; nghSum = {0, …, 0}; Change = {}; //store changes in PageRank values procedure UPDATE(s, d): //passed to EdgeMap
return atomic_increment(nghSum[d], Change[s] / degree(s)); procedure COMPUTE(i): //now passed to VertexFilter
Change[i] = α · nghSum[i]; PR[i] = PR[i] + Change[i]; return (abs(Change[i]) > Δ); //check if absolute value of change is big enough
procedure PageRank(G, α, ε):
… frontier = VERTEXFILTER(frontier, COMPUTE); …
PPoPP 2016 23
Eccentricity estimation
• Takes an unweighted, undirected graph G = (V, E) • Returns an estimate of the eccentricity of each vertex
where • The eccentricity of a vertex v is the distance to furthest
reachable vertex from v
a b c d
e f g
ecc(a) = 4 ecc(b) = 3 ecc(c) = 2 ecc(d) = 3
ecc(e) = 4 ecc(f) = 3 ecc(g) = 4
PPoPP 2016 24
Multiple BFS’s • Run multiple BFS’s from a sample of random vertices and use distance from furthest sample as eccentricity estimate
v
s4
s2
s3
s1
eĉc(v) = max(d(v, s1), d(v, s2), d(v, s3), d(v, s4)) for all v
• In practice, need to run two sweeps to get good accuracy [KDD 2015]
PPoPP 2016 25
Eccentricity estimation implementation • Run all k BFS’s simultaneously • Take advantage of bit-level parallelism to store “visited”
information
v
000000000…000000000
000000000…000000000
000000000…000000000
… k/64 words
s1
100000000…000000000
000000000…000000000
000000000…000000000
…
s2
010000000…000000000
000000000…000000000
000000000…000000000
…
Unique bit set for each source
Bitwise-OR of bit-vectors when visiting neighbor 1e+07
1e+08
1e+09
1e+10
1e+11
1e+12
26 27 28 29 210 211 212 213 214 215
Num
. edg
es tr
aver
sed
k
k-BFS Naive BFScom-Youtube
PPoPP 2016 26
Eccentricity estimation implementation
• Initial frontier = {s1, s2, …, sk} • d = 0 • While frontier not empty: //Advance all BFS’s by 1 level
• nextFrontier = {} • d = d+1 • For each vertex v in frontier:
• For each neighbor ngh: //pass “visited” information • Do bitwise-OR of v’s words with ngh’s words and store in ngh • If any of ngh’s words changed:
• eĉc(ngh) = max(eĉc(ngh), d) and place ngh on nextFrontier if not there • frontier = nextFrontier
EdgeMap
atomic bitwise-OR using compare-
and-swap
PPoPP 2016 27
• We will implement this example in Ligra for k=64
Ligra Implementation Details
PPoPP 2016 28
VertexSubset, VertexMap, and VertexFilter
procedure VERTEXMAP(VertexSubset U, func F): parallel foreach v in U: F(v); //side-effects application data
procedure VERTEXFILTER(VertexSubset U, bool func F): result = {} parallel foreach v in U: if(F(v) == 1) then: add v to result; return result;
PPoPP 2016 29
VertexSubset has one of two representations: • Sparse integer array, e.g. {1, 4, 7} • Dense boolean array, e.g. {0,1,0,0,1,0,0,1}
Sparse or Dense EdgeMap?
11
10
9
12
13
15
14
1
4
3
2
5
8
7
6
Frontier • Dense method better when
frontier is large and many vertices have been visited
• Sparse (traditional) method better for small frontiers
• Switch between the two methods based on frontier size [Beamer et al. SC ’12]
11
10
9
12
Limited to BFS?
PPoPP 2016 30
EdgeMap
Loop through outgoing edges of frontier vertices in parallel
procedure EDGEMAP(G, frontier, Update, Cond): if (above threshold) then: return EDGEMAP_DENSE(G, frontier, Update, Cond); else: return EDGEMAP_SPARSE(G, frontier, Update, Cond);
Loop through incoming edges of “unexplored” vertices (in parallel), breaking early if possible
• More general than just BFS! • Generalized to many other problems • Users need not worry about this
PPoPP 2016 31
EdgeMap (sparse version) • How to represent VertexSubset?
• Array of integers, e.g. U = {0, 5, 7}
parents = {-1, …, -1}; //-1 indicates “unvisited” procedure UPDATE(s, d):
return compare_and_swap(parents[d], -1, s); procedure COND(i):
return parents[i] == -1; //checks if “unvisited” procedure BFS(G, r):
parents[r] = r; frontier = {r}; //vertexSubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);
procedure EDGEMAP_SPARSE(G, frontier, UPDATE, COND): nextFrontier = {}; parallel foreach v in frontier: parallel foreach w in out_neighbors(v): if(COND(w) == 1 and UPDATE(v, w) == 1) then: add w to nextFrontier; remove duplicates from nextFrontier; return nextFrontier;
PPoPP 2016 32
EdgeMap (dense version) • How to represent dense VertexSubset?
• Byte array, e.g. U = {1, 0, 0, 0, 0, 1, 0, 1} procedure EDGEMAP_DENSE(G, frontier, UPDATE, COND):
nextFrontier = {0, …, 0}; parallel foreach v in G: if (COND(v) == 1) then: foreach ngh in in_neighbors(v): if ngh in frontier and UPDATE(ngh, v) == 1 then: nextFrontier[v] = 1; if (COND(v) == 0) then: break; return nextFrontier;
parents = {-1, …, -1}; //-1 indicates “unvisited” procedure UPDATE(s, d):
return compare_and_swap(parents[d], -1, s); procedure COND(i):
return parents[i] == -1; //checks if “unvisited” procedure BFS(G, r):
parents[r] = r; frontier = {r}; //vertexSubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);
PPoPP 2016 33
Frontier-based approach enables sparse/dense traversal
0
2
4
6
8
10
12
14
BFS Betweenness Centrality
Connected Components
Eccentricity Estimation
40-c
ore
runn
ing
time
(sec
onds
)
Twitter graph (41M vertices, 1.5B edges)
Dense
Sparse
Sparse/Dense
30.7 20.7
(using default threshold of |E|/20)
PPoPP 2016 34
Ligra BFS Performance
0
0.05
0.1
0.15
0.2
0.25
0.3
BFS
Run
ning
tim
e (s
econ
ds)
Twitter graph (41M vertices, 1.5B edges)
Ligra (40-core machine)
Hand-written Cilk/OpenMP (40-core machine)
0 20 40 60 80
100 120 140 160 180
BFS
Line
s of
cod
e • Comparing against direction-optimizing code by Beamer et al.
PPoPP 2016 35
Ligra PageRank Performance
0
2
4
6
8
10
12
14
Page Rank (1 iteration)
Run
ning
tim
e (s
econ
ds)
Twitter graph (41M vertices, 1.5B edges)
GraphLab (64 x 8-cores)
GraphLab (40-core machine)
Ligra (40-core machine)
Hand-written Cilk/OpenMP (40-core machine) 0
10 20 30 40 50 60 70 80
Page Rank
Line
s of
cod
e
• Easy to implement “sparse” version of PageRank in Ligra
PPoPP 2016 36
Ligra Connected Components Performance
0
2
4
6
8
10
12
14
Connected Components
Run
ning
tim
e (s
econ
ds)
Twitter graph (41M vertices, 1.5B edges)
GraphLab (16 x 8-cores)
GraphLab (40-core machine)
Ligra (40-core machine)
Hand-written Cilk/OpenMP (40-core machine) 0
50 100 150 200 250 300 350 400
Connected Components
Line
s of
cod
e
120 244
• Performance close to hand-written code • Faster than existing high-level frameworks at the time • Shared-memory graph processing is very efficient
• Several shared-memory graph processing systems subsequently developed: Galois [SOSP ‘13], X-stream [SOSP ‘13], PRISM [SPAA ‘14], Polymer [PPoPP ’15], Ringo [SIGMOD ‘15], GraphMat [VLDB ‘15]
PPoPP 2016 37
Large Graphs
6.6 billion edges
128 billion edges
~1 trillion edges [VLDB 2015]
• Most can fit on commodity shared memory machine • What if you don’t have that much memory?
Run
ning
Tim
e
Memory Required
Available RAM
PPoPP 2016 38
Ligra+: Adding Graph Compression to Ligra
PPoPP 2016 39
• Same interface as Ligra • All changes hidden from the user!
Ligra+: Adding Graph Compression to Ligra
Graph
VertexSubset
EdgeMap
VertexMap
Interface
Use compressed representation
Decode edges on-the-fly
Same as before
Same as before
PPoPP 2016 40
Graph representation
0 4 5 11
2 7 9 16 0 1 6 9 12
...
...
Offsets
Edges
2 5 2 7 -1 -1 5 3 3 ... Compressed
Edges
• Graph reordering to improve locality • Goal: give neighbors IDs close to vertex ID • BFS, DFS, METIS, our own separator-based
algorithm
Vertex IDs 0 1 2 3
Sort edges and encode differences
2 - 0 = 2 7 - 2 = 5 1 - 2 = -1
PPoPP 2016 41
• k-bit codes • Encode value in chunks of k bits • Use k-1 bits for data, and 1 bit as the “continue” bit
• Example: encode “401” using 8-bit (byte) code • In binary: 1 1 0 0 1 0 0 0 1
1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1
“continue” bit
7 bits for data
Variable-length codes PPoPP 2016 42
• Another idea: get rid of “continue” bits
0 1 0 1 1 0 0 1
Number of bytes per integer
Size of group (max 64)
Header
…… Integers in group
encoded in byte chunks
• Increases space, but makes decoding cheaper (no branch misprediction from checking “continue” bit)
x1 x2 x3 x4 x5 x6 x7 x8 ……
Number of bytes required to encode each integer
1 2 2 2 2 2 2 2
Use run-length encoding
……
Encoding optimization PPoPP 2016 43
• Same interface as Ligra • All changes hidden from the user!
Ligra+: Adding Graph Compression to Ligra
Graph
VertexSubset
EdgeMap
VertexMap
Interface
Use compressed representation
Decode edges on-the-fly
Same as before
Same as before
PPoPP 2016 44
• Processes outgoing edges of a subset of vertices
Modifying EdgeMap
2 5 2 7 9 2 1 3 3 VertexSubset
-4 6 3 1 3 5 6 2
5 10 2
30 5
-16 2 19 1 4 2 5 3
All vertices processed in parallel
16
25
44
0
7
What about high-degree vertices?
PPoPP 2016 45
Handling high-degree vertices
-1 2 4 3 16 2 1 5 8 19 4 1 23 14 12 1 9 10 3 5
High-degree vertex
…
Chunks of size T
-1 2 4 3 16 2 27 5 8 19 4 1 87 14 12 1 9 10
…
…
Encode first entry relative to source vertex
All chunks can be decoded in parallel!
• We chose T=1000 • Similar performance
and space usage for a wide range of T
PPoPP 2016 46
Ligra+: Adding Graph Compression to Ligra
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 Space relative to Ligra
Ligra
Ligra+
0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4
40-core time relative to Ligra
• Using 8-bit codes with run-length encoding • Cost of decoding on-the-fly? • Memory bottleneck a bigger issue as graph
algorithms are memory-bound
PPoPP 2016 47
Demo on compressed graphs in Ligra
PPoPP 2016 48
Other Graph Processing Systems
PPoPP 2016 49
• Pregel/Giraph/GPS, GraphLab/PowerGraph, PRISM, Pegasus, Knowledge Discovery Toolbox/CombBLAS, GraphChi, GraphX, Galois, X-Stream, Gunrock, GraphMat, Ringo, TurboGraph, FlashGraph, Grace, PathGraph, Polymer, GoFFish, Blogel, LightGraph, MapGraph, PowerLyra, PowerSwitch, XDGP, Signal/Collect, PrefEdge, Parallel BGL, KLA, Grappa, Chronos, Green-Marl, GraphHP, P++, LLAMA, Venus, Cyclops, Medusa, NScale, Neo4J, Trinity, GBase, HyperGraphDB, Horton, GSPARQL, Titan, and many others…
• Cannot list everything here. For more information, see: • “Systems for Big Graphs”, Khan and Elnikety VLDB 2014 Tutorial • “Trade-offs in Large Graph Processing: Representations, Storage,
Systems and Algorithms”, Ajwani et al. WWW 2015 Tutorial • “A Survey of Parallel Graph Processing Frameworks”, Doekemeijer
and Varbanescu 2014 • “Thinking like a Vertex: A Survey of Vertex-Centric Frameworks for
Large-Scale Distributed Graph Processing”, McCune et al. 2015
Many existing frameworks PPoPP 2016 50
Pregel
• “Think like a vertex” • Distributed-memory,
uses message passing • Bulk synchronous model
• Vertices can be either “active” or “inactive”
Pregel: A System for Large-Scale Graph Processing, Malewicz et al. SIGMOD 2010
PPoPP 2016 51
GraphLab/PowerGraph
• “Think like a vertex” • Vertices define Gather, Apply, and
Scatter functions • Data on edges as well as vertices • Scheduler processes “active” vertices
• Supports asynchronous execution • Useful for some machine learning
applications • Different levels of consistency • Different scheduling orders
• Current version for distributed memory (original version was for shared memory)
GraphLab: A New Framework for Parallel Machine Learning, Low et al. UAI 2010 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al. OSDI 2012
Vertex Data:
Edge Data:
PPoPP 2016 52
GraphChi
GraphChi: Large-Scale Graph Computation on Just a PC, Kyrola et al. OSDI 2012
• “Think like a vertex” • Define Update function for vertices
• Optimized for disk-based execution • Divides edges into “shards”
• Each shard has a range of target IDs • Each shard sorted by source ID
• Parallel sliding windows method for efficient execution • One shard in memory, other shards
read in streaming fashion • About O((V+E)/B) I/O’s per iteration
PPoPP 2016 53
Linear algebra abstraction
• PEGASUS: Framework using matrix-vector abstraction in MapReduce [Kang et al. ICDM 2009]
• Knowledge Discovery Toolbox/CombBLAS: Uses matrix-vector and matrix-matrix routines for implementing graph algorithms; uses Python frontend, C++/MPI backend [Lugowski et al. SDM 2012]
PPoPP 2016 54
Some other shared memory systems • X-Stream [SOSP 2013]
• Edge-centric abstraction • Supports disk-based execution
• Galois [SOSP 2013] • Parallel programming framework based on dynamic sets with
various schedulers • Implements graph abstractions of Ligra, X-Stream, and GraphLab
• PRISM [SPAA 2014] • Deterministic version of GraphLab using graph coloring for
scheduling; implemented in Cilk • Polymer [PPoPP 2015]
• NUMA-aware implementation of Ligra’s abstraction • GraphMat [VLDB 2015]
• Uses optimized matrix-vector routines to implement graph algorithms
• Gunrock [PPoPP 2016] • Abstraction for frontier-based computations on GPUs
PPoPP 2016 55
Exercises
PPoPP 2016 56
• Implement a version of BFS that gives a deterministic BFS tree (i.e., the Parents array is deterministic)
• Implement a version of BFS that uses a bit-vector to check the “visited” status before updating the Parents array
Exercises PPoPP 2016 57
Thank you!
Code: https://github.com/jshun/ligra/
References Ligra: A Lightweight Graph Processing Framework for Shared Memory, PPoPP 2013. Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+, DCC 2015. An Evaluation of Parallel Eccentricity Estimation Algorithms on Undirected Real-World Graphs, KDD 2015.
PPoPP 2016 58