Tutorial: Large-Scale Graph Processing in Shared Memory · 2017-06-20 · Tutorial: Large-Scale Graph Processing in Shared Memory Julian Shun (University of California, Berkeley)

Tutorial: Large-Scale Graph Processing in Shared Memory

Julian Shun (University of California, Berkeley) Laxman Dhulipala (Carnegie Mellon University) Guy Blelloch (Carnegie Mellon University)

Tutorial Agenda

• 2:00-3:30pm • Overview of graph processing and Ligra • Walk through installation • Do examples in Ligra

• 3:30-4:00pm • Break

• 4:00-5:30pm •  Implementation details of Ligra • Overview of other graph processing systems • Exercises

PPoPP 2016 2

Tutorial website: http://jshun.github.io/ligra/ Slides available at https://github.com/jshun/ligra/tree/master/tutorial/tutorial.pdf

Graphs are everywhere!

• Can contain billions of vertices and edges!

6.6 billion edges

128 billion edges

~1 trillion edges [VLDB 2015]

PPoPP 2016 3

Graph processing challenges

• Need to efficiently analyze large graphs • Running time efficiency • Space efficiency • Programming efficiency

• Many random memory accesses, poor locality • Relatively high communication-to-computation ratio • Varying parallelism throughout execution • Race conditions, load balancing

PPoPP 2016 4

Ligra Graph Processing Framework

EdgeMap VertexMap

Breadth-first search Betweenness centrality Connected components K-core decomposition Belief propagation Maximal independent set …

Single-source shortest paths Eccentricity estimation (Personalized) PageRank Local graph clustering Biconnected components Collaborative filtering …

Simplicity, Performance, Scalability

PPoPP 2016 5

Graph Processing Systems

• Existing (at the time Ligra was developed): Pregel/Giraph/GPS, GraphLab, Pegasus, Knowledge Discovery Toolbox, GraphChi, and many others…

• Our system: Ligra - Lightweight graph processing system for shared memory

Takes advantage of “frontier-based” nature of many algorithms (active set is dynamic and often small)

PPoPP 2016 6

Breadth-first Search (BFS) • Compute a BFS tree rooted at source r containing all vertices reachable from r

• Can process each level in parallel • Race conditions, load balancing

Applications Betweenness centrality Eccentricity estimation Maximum flow Web crawlers Network broadcasting Cycle detection …

… …

…

…

.

PPoPP 2016 7

Steps for Graph Traversal • Operate on a subset of vertices • Map computation over subset of edges in parallel • Return new subset of vertices • Map computation over subset of vertices in parallel

}

Graph

VertexSubset

EdgeMap

VertexMap

Think with flat data-parallel operators

We built the Ligra abstraction for these kinds of computations

Optimizations: •  hybrid traversal •  graph compression

Many graph traversal algorithms do this!

Shared memory parallelism

PPoPP 2016 8

Ligra Framework

0 4 68VertexSubset

4

7

52

1

0

6

8

3

68VertexSubset

bool f(v){ data[v] = data[v] + 1; return (data[v] == 1); }

4

0

6

8

VertexMap

T F F T

VertexFilter

PPoPP 2016 9

Ligra Framework

4

7

52

1

0

6

8

3

0 4 68VertexSubset

VertexSubset

bool update(u,v){…}

4

0

6

8

5 24 71

bool cond(v){…}

F

7

52

1

4

T

T

T

T

T EdgeMap

PPoPP 2016 10

Breadth-first Search in Ligra parents = {-1, …, -1}; //-1 indicates “unexplored” procedure UPDATE(s, d):

return compare_and_swap(parents[d], -1, s); procedure COND(v):

return parents[v] == -1; //checks if “unexplored” procedure BFS(G, r):

parents[r] = r; frontier = {r}; //VertexSubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);

frontier

T T T T F

PPoPP 2016 11

Install and code examples in Ligra

PPoPP 2016 12

Breadth-first Search (BFS) • Compute a BFS tree rooted at source r containing all vertices reachable from r

r r

Frontier

PPoPP 2016 13

Connected Components

•  Takes an unweighted, undirected graph G = (V, E) • Returns a label array L such that L[v] = L[w] if and only if

v is connected to w

Image Source: docs.roguewave.com

PPoPP 2016 14

Parallel Label Propagation Algorithm

• Processing all vertices in each iteration seems wasteful •  Optimization: only place vertices who changed on frontier

• Warning: this algorithm is only good for low-diameter graphs

5 1

3

2

4

0

3

2

1

0

4

5 4

0

0

1 0

PPoPP 2016 15

Single-Source Shortest Paths

•  Takes a weighted graph G = (V, E, w(E)) and starting vertex r ∈ V

• Returns a shortest paths array SP where SP[v] stores the shortest path distance from r to v (∞ if v unreachable from r)

PPoPP 2016 16

Parallel Bellman-Ford Shortest Paths

A

B

C

D

F

E

A

1

-2

3

4

5

2

1

-3

4

∞

0

∞ ∞

∞

∞

1

-2

6

2

4

3

PPoPP 2016 17

Parallel Bellman-Ford Performance

0.1

1

10

100

1 2 4 8 16 32 40 80

Run

ning

tim

e (s

econ

ds)

Number of threads

Times for Bellman-Ford on rMat24

Our Bellman-FordNaive Bellman-Ford

PPoPP 2016 18

K-core decomposition

• A k-core of a graph is a maximal connected subgraph in which all vertices have degree at least k

• A vertex has core number k if it belongs to a k-core but not a (k+1)-core

• Algorithm: Takes an unweighted, undirected graph G and returns the core number of each vertex in G

k = 1 while(G is not empty) {

while(there exists vertices with degree < k in G) { assign a core number of k-1 to all vertices with degree < k; remove all vertices with degree < k from G; } k = k+1;

}

PPoPP 2016 19

PageRank PPoPP 2016 20

PageRank in Ligra p_curr = {1/|V|, …, 1/|V|}; p_next = {0, …, 0}; diff = {}; procedure UPDATE(s, d):

return atomic_increment(p_next[d], p_curr[s] / degree(s)); procedure COMPUTE(i):

p_next[i] = α · p_next[i] + (1- α) · (1/|V|); diff[i] = abs(p_next[i] – p_curr[i]); p_curr[i] = 0; return 1;

procedure PageRank(G, α, ε):

frontier = {0, …, |V|-1}; error = ∞ while (error > ε): frontier = EDGEMAP(G, frontier, UPDATE, CONDtrue); VERTEXMAP(frontier, COMPUTE); error = sum of diff entries; swap(p_curr, p_next) return p_curr;

PPoPP 2016 21

PageRank • Sparse version?

• PageRank-Delta: Only update vertices whose PageRank value has changed by more than some Δ-fraction (discussed in GraphLab and McSherry WWW ‘05)

PPoPP 2016 22

PageRank-Delta in Ligra PR[i] = {1/|V|, …, 1/|V|}; nghSum = {0, …, 0}; Change = {}; //store changes in PageRank values procedure UPDATE(s, d): //passed to EdgeMap

return atomic_increment(nghSum[d], Change[s] / degree(s)); procedure COMPUTE(i): //now passed to VertexFilter

Change[i] = α · nghSum[i]; PR[i] = PR[i] + Change[i]; return (abs(Change[i]) > Δ); //check if absolute value of change is big enough

procedure PageRank(G, α, ε):

… frontier = VERTEXFILTER(frontier, COMPUTE); …

PPoPP 2016 23

Eccentricity estimation

•  Takes an unweighted, undirected graph G = (V, E) • Returns an estimate of the eccentricity of each vertex

where •  The eccentricity of a vertex v is the distance to furthest

reachable vertex from v

a b c d

e f g

ecc(a) = 4 ecc(b) = 3 ecc(c) = 2 ecc(d) = 3

ecc(e) = 4 ecc(f) = 3 ecc(g) = 4

PPoPP 2016 24

Multiple BFS’s • Run multiple BFS’s from a sample of random vertices and use distance from furthest sample as eccentricity estimate

v

s4

s2

s3

s1

eĉc(v) = max(d(v, s1), d(v, s2), d(v, s3), d(v, s4)) for all v

•  In practice, need to run two sweeps to get good accuracy [KDD 2015]

PPoPP 2016 25

Eccentricity estimation implementation • Run all k BFS’s simultaneously •  Take advantage of bit-level parallelism to store “visited”

information

v

000000000…000000000

000000000…000000000

000000000…000000000

… k/64 words

s1

100000000…000000000

000000000…000000000

000000000…000000000

…

s2

010000000…000000000

000000000…000000000

000000000…000000000

…

Unique bit set for each source

Bitwise-OR of bit-vectors when visiting neighbor 1e+07

1e+08

1e+09

1e+10

1e+11

1e+12

26 27 28 29 210 211 212 213 214 215

Num

. edg

es tr

aver

sed

k

k-BFS Naive BFScom-Youtube

PPoPP 2016 26

Eccentricity estimation implementation

•  Initial frontier = {s1, s2, …, sk} •  d = 0 • While frontier not empty: //Advance all BFS’s by 1 level

•  nextFrontier = {} •  d = d+1 •  For each vertex v in frontier:

•  For each neighbor ngh: //pass “visited” information •  Do bitwise-OR of v’s words with ngh’s words and store in ngh •  If any of ngh’s words changed:

• eĉc(ngh) = max(eĉc(ngh), d) and place ngh on nextFrontier if not there •  frontier = nextFrontier

EdgeMap

atomic bitwise-OR using compare-

and-swap

PPoPP 2016 27

• We will implement this example in Ligra for k=64

Ligra Implementation Details

PPoPP 2016 28

VertexSubset, VertexMap, and VertexFilter

procedure VERTEXMAP(VertexSubset U, func F): parallel foreach v in U: F(v); //side-effects application data

procedure VERTEXFILTER(VertexSubset U, bool func F): result = {} parallel foreach v in U: if(F(v) == 1) then: add v to result; return result;

PPoPP 2016 29

VertexSubset has one of two representations: •  Sparse integer array, e.g. {1, 4, 7} •  Dense boolean array, e.g. {0,1,0,0,1,0,0,1}

Sparse or Dense EdgeMap?

11

10

9

12

13

15

14

1

4

3

2

5

8

7

6

Frontier •  Dense method better when

frontier is large and many vertices have been visited

•  Sparse (traditional) method better for small frontiers

•  Switch between the two methods based on frontier size [Beamer et al. SC ’12]

11

10

9

12

Limited to BFS?

PPoPP 2016 30

EdgeMap

Loop through outgoing edges of frontier vertices in parallel

procedure EDGEMAP(G, frontier, Update, Cond): if (above threshold) then: return EDGEMAP_DENSE(G, frontier, Update, Cond); else: return EDGEMAP_SPARSE(G, frontier, Update, Cond);

Loop through incoming edges of “unexplored” vertices (in parallel), breaking early if possible

•  More general than just BFS! •  Generalized to many other problems •  Users need not worry about this

PPoPP 2016 31

EdgeMap (sparse version) • How to represent VertexSubset?

• Array of integers, e.g. U = {0, 5, 7}

parents = {-1, …, -1}; //-1 indicates “unvisited” procedure UPDATE(s, d):

return compare_and_swap(parents[d], -1, s); procedure COND(i):

return parents[i] == -1; //checks if “unvisited” procedure BFS(G, r):

parents[r] = r; frontier = {r}; //vertexSubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);

procedure EDGEMAP_SPARSE(G, frontier, UPDATE, COND): nextFrontier = {}; parallel foreach v in frontier: parallel foreach w in out_neighbors(v): if(COND(w) == 1 and UPDATE(v, w) == 1) then: add w to nextFrontier; remove duplicates from nextFrontier; return nextFrontier;

PPoPP 2016 32

EdgeMap (dense version) • How to represent dense VertexSubset?

• Byte array, e.g. U = {1, 0, 0, 0, 0, 1, 0, 1} procedure EDGEMAP_DENSE(G, frontier, UPDATE, COND):

nextFrontier = {0, …, 0}; parallel foreach v in G: if (COND(v) == 1) then: foreach ngh in in_neighbors(v): if ngh in frontier and UPDATE(ngh, v) == 1 then: nextFrontier[v] = 1; if (COND(v) == 0) then: break; return nextFrontier;

parents = {-1, …, -1}; //-1 indicates “unvisited” procedure UPDATE(s, d):

return compare_and_swap(parents[d], -1, s); procedure COND(i):

return parents[i] == -1; //checks if “unvisited” procedure BFS(G, r):

parents[r] = r; frontier = {r}; //vertexSubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);

PPoPP 2016 33

Frontier-based approach enables sparse/dense traversal

0

2

4

6

8

10

12

14

BFS Betweenness Centrality


Eccentricity Estimation

40-c

ore

runn

ing

time

(sec

onds

)

Twitter graph (41M vertices, 1.5B edges)

Dense

Sparse

Sparse/Dense

30.7 20.7

(using default threshold of |E|/20)

PPoPP 2016 34

Ligra BFS Performance

0

0.05

0.1

0.15

0.2

0.25

0.3

BFS

Run

ning

tim

e (s

econ

ds)


Ligra (40-core machine)

Hand-written Cilk/OpenMP (40-core machine)

0 20 40 60 80

100 120 140 160 180

BFS

Line

s of

cod

e • Comparing against direction-optimizing code by Beamer et al.

PPoPP 2016 35

Ligra PageRank Performance

0

2

4

6

8

10

12

14

Page Rank (1 iteration)

Run

ning

tim

e (s

econ

ds)


GraphLab (64 x 8-cores)

GraphLab (40-core machine)


Hand-written Cilk/OpenMP (40-core machine) 0

10 20 30 40 50 60 70 80

Page Rank

Line

s of

cod

e

• Easy to implement “sparse” version of PageRank in Ligra

PPoPP 2016 36

Ligra Connected Components Performance

0

2

4

6

8

10

12

14


Run

ning

tim

e (s

econ

ds)


GraphLab (16 x 8-cores)

GraphLab (40-core machine)


Hand-written Cilk/OpenMP (40-core machine) 0

50 100 150 200 250 300 350 400


Line

s of

cod

e

120 244

•  Performance close to hand-written code •  Faster than existing high-level frameworks at the time •  Shared-memory graph processing is very efficient

•  Several shared-memory graph processing systems subsequently developed: Galois [SOSP ‘13], X-stream [SOSP ‘13], PRISM [SPAA ‘14], Polymer [PPoPP ’15], Ringo [SIGMOD ‘15], GraphMat [VLDB ‘15]

PPoPP 2016 37

Large Graphs

6.6 billion edges

128 billion edges

~1 trillion edges [VLDB 2015]

•  Most can fit on commodity shared memory machine •  What if you don’t have that much memory?

Run

ning

Tim

e

Memory Required

Available RAM

PPoPP 2016 38

Ligra+: Adding Graph Compression to Ligra

PPoPP 2016 39

• Same interface as Ligra • All changes hidden from the user!


Graph

VertexSubset

EdgeMap

VertexMap

Interface

Use compressed representation

Decode edges on-the-fly

Same as before

Same as before

PPoPP 2016 40

Graph representation

0 4 5 11

2 7 9 16 0 1 6 9 12

...

...

Offsets

Edges

2 5 2 7 -1 -1 5 3 3 ... Compressed

Edges

•  Graph reordering to improve locality •  Goal: give neighbors IDs close to vertex ID •  BFS, DFS, METIS, our own separator-based

algorithm

Vertex IDs 0 1 2 3

Sort edges and encode differences

2 - 0 = 2 7 - 2 = 5 1 - 2 = -1

PPoPP 2016 41

• k-bit codes • Encode value in chunks of k bits • Use k-1 bits for data, and 1 bit as the “continue” bit

• Example: encode “401” using 8-bit (byte) code •  In binary: 1 1 0 0 1 0 0 0 1

1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1

“continue” bit

7 bits for data

Variable-length codes PPoPP 2016 42

• Another idea: get rid of “continue” bits

0 1 0 1 1 0 0 1

Number of bytes per integer

Size of group (max 64)

Header

…… Integers in group

encoded in byte chunks

•  Increases space, but makes decoding cheaper (no branch misprediction from checking “continue” bit)

x1 x2 x3 x4 x5 x6 x7 x8 ……

Number of bytes required to encode each integer

1 2 2 2 2 2 2 2

Use run-length encoding

……

Encoding optimization PPoPP 2016 43

• Same interface as Ligra • All changes hidden from the user!


Graph

VertexSubset

EdgeMap

VertexMap

Interface

Use compressed representation

Decode edges on-the-fly

Same as before

Same as before

PPoPP 2016 44

• Processes outgoing edges of a subset of vertices

Modifying EdgeMap

2 5 2 7 9 2 1 3 3 VertexSubset

-4 6 3 1 3 5 6 2

5 10 2

30 5

-16 2 19 1 4 2 5 3

All vertices processed in parallel

16

25

44

0

7

What about high-degree vertices?

PPoPP 2016 45

Handling high-degree vertices

-1 2 4 3 16 2 1 5 8 19 4 1 23 14 12 1 9 10 3 5

High-degree vertex

…

Chunks of size T

-1 2 4 3 16 2 27 5 8 19 4 1 87 14 12 1 9 10

…

…

Encode first entry relative to source vertex

All chunks can be decoded in parallel!

• We chose T=1000 • Similar performance

and space usage for a wide range of T

PPoPP 2016 46


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 Space relative to Ligra

Ligra

Ligra+

0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4

40-core time relative to Ligra

•  Using 8-bit codes with run-length encoding •  Cost of decoding on-the-fly? •  Memory bottleneck a bigger issue as graph

algorithms are memory-bound

PPoPP 2016 47

Demo on compressed graphs in Ligra

PPoPP 2016 48

Other Graph Processing Systems

PPoPP 2016 49

• Pregel/Giraph/GPS, GraphLab/PowerGraph, PRISM, Pegasus, Knowledge Discovery Toolbox/CombBLAS, GraphChi, GraphX, Galois, X-Stream, Gunrock, GraphMat, Ringo, TurboGraph, FlashGraph, Grace, PathGraph, Polymer, GoFFish, Blogel, LightGraph, MapGraph, PowerLyra, PowerSwitch, XDGP, Signal/Collect, PrefEdge, Parallel BGL, KLA, Grappa, Chronos, Green-Marl, GraphHP, P++, LLAMA, Venus, Cyclops, Medusa, NScale, Neo4J, Trinity, GBase, HyperGraphDB, Horton, GSPARQL, Titan, and many others…

• Cannot list everything here. For more information, see: •  “Systems for Big Graphs”, Khan and Elnikety VLDB 2014 Tutorial •  “Trade-offs in Large Graph Processing: Representations, Storage,

Systems and Algorithms”, Ajwani et al. WWW 2015 Tutorial •  “A Survey of Parallel Graph Processing Frameworks”, Doekemeijer

and Varbanescu 2014 •  “Thinking like a Vertex: A Survey of Vertex-Centric Frameworks for

Large-Scale Distributed Graph Processing”, McCune et al. 2015

Many existing frameworks PPoPP 2016 50

Pregel

•  “Think like a vertex” • Distributed-memory,

uses message passing • Bulk synchronous model

• Vertices can be either “active” or “inactive”

Pregel: A System for Large-Scale Graph Processing, Malewicz et al. SIGMOD 2010

PPoPP 2016 51

GraphLab/PowerGraph

•  “Think like a vertex” •  Vertices define Gather, Apply, and

Scatter functions •  Data on edges as well as vertices •  Scheduler processes “active” vertices

•  Supports asynchronous execution •  Useful for some machine learning

applications •  Different levels of consistency •  Different scheduling orders

•  Current version for distributed memory (original version was for shared memory)

GraphLab: A New Framework for Parallel Machine Learning, Low et al. UAI 2010 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al. OSDI 2012

Vertex Data:

Edge Data:

PPoPP 2016 52

GraphChi

GraphChi: Large-Scale Graph Computation on Just a PC, Kyrola et al. OSDI 2012

•  “Think like a vertex” •  Define Update function for vertices

• Optimized for disk-based execution • Divides edges into “shards”

•  Each shard has a range of target IDs •  Each shard sorted by source ID

• Parallel sliding windows method for efficient execution •  One shard in memory, other shards

read in streaming fashion •  About O((V+E)/B) I/O’s per iteration

PPoPP 2016 53

Linear algebra abstraction

• PEGASUS: Framework using matrix-vector abstraction in MapReduce [Kang et al. ICDM 2009]

• Knowledge Discovery Toolbox/CombBLAS: Uses matrix-vector and matrix-matrix routines for implementing graph algorithms; uses Python frontend, C++/MPI backend [Lugowski et al. SDM 2012]

PPoPP 2016 54

Some other shared memory systems • X-Stream [SOSP 2013]

•  Edge-centric abstraction •  Supports disk-based execution

• Galois [SOSP 2013] •  Parallel programming framework based on dynamic sets with

various schedulers •  Implements graph abstractions of Ligra, X-Stream, and GraphLab

• PRISM [SPAA 2014] •  Deterministic version of GraphLab using graph coloring for

scheduling; implemented in Cilk • Polymer [PPoPP 2015]

•  NUMA-aware implementation of Ligra’s abstraction • GraphMat [VLDB 2015]

•  Uses optimized matrix-vector routines to implement graph algorithms

• Gunrock [PPoPP 2016] •  Abstraction for frontier-based computations on GPUs

PPoPP 2016 55

Exercises

PPoPP 2016 56

•  Implement a version of BFS that gives a deterministic BFS tree (i.e., the Parents array is deterministic)

•  Implement a version of BFS that uses a bit-vector to check the “visited” status before updating the Parents array

Exercises PPoPP 2016 57

Thank you!

Code: https://github.com/jshun/ligra/

References Ligra: A Lightweight Graph Processing Framework for Shared Memory, PPoPP 2013. Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+, DCC 2015. An Evaluation of Parallel Eccentricity Estimation Algorithms on Undirected Real-World Graphs, KDD 2015.

PPoPP 2016 58

Documents

Tutorial: Large-Scale Graph Processing in Shared Memory · 2017-06-20 · Tutorial: Large-Scale Graph Processing in Shared Memory Julian Shun (University of California, Berkeley)