Upload
junya-arai
View
327
Download
5
Embed Size (px)
Citation preview
Copyright© 2016 NTT Corp. All Rights Reserved.
Rabbit Order:Just-in-time Parallel Reorderingfor Fast Graph Analysis
Junya Arai Nippon Telegraph and Telephone Corp. (NTT)
Hiroaki Shiokawa Univ. of Tsukuba
Takeshi Yamamuro NTT
Makoto Onizuka Osaka Univ.
Sotetsu Iwamura NTT
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 2
Summary
• Vertex reordering has been used to improve locality of graph processing
• However, overheads of reordering tend to increase end-to-end runtime (= reordering + analysis)
• Thus, we propose a fast reordering algorithm, Rabbit Order
• Exploit community structures in real-world graphs
• Up to 3.5x speedup for PageRank
• Including reordering overheads!
• Also effective for various graph analysis
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 3
Graph analysis
To deal with large-scale graphs, performance of various analysis algorithms need to be improved
Real-world graphs
• Web graphs
Over 50B pages*1
• Social graphs
1B users200B friendships*1
Analysis algorithms
• Community detection• Ranking (e.g., PageRank)• Shortest Path• Diameter• Connected components• k-core decomposition• ......
Large-scale Various
*1: Andrew+, “Parallel Graph Analytics,” CACM, 59(5), ‘16
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 4
Poor locality
• Poor locality in memory accesses is a problem common to various analysis algorithms
Poor locality causes ...
• Frequent cache misses
• Frequent inter-core communications
• Memory bandwidth saturation
• Simultaneous memory access from cores
Poor scalability
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 5
Memory access example
• PageRank
until convergence do
for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
Access PageRank score 𝒔of each neighbor
𝒔 0 , 𝒔 2 , 𝒔 4 and 𝒔[7] are accessed
when 𝑣 = 0
5
2
07
4
13
6
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 6
Memory access example
• PageRank
until convergence do
for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)
Access PageRank score 𝒔of each neighbor
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
𝑣 = 1
5
2
07
4
13
6
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 7
Memory access example
• PageRank
until convergence do
for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)
Access PageRank score 𝒔of each neighbor
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
𝑣 = 1
𝑣 = 2
5
2
07
4
13
6
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 8
Memory access example
• PageRank
until convergence do
for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)
Access PageRank score 𝒔of each neighbor
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
𝑣 = 1
𝑣 = 2
𝑣 = 3
𝑣 = 4
𝑣 = 5
𝑣 = 6
𝑣 = 7
5
2
07
4
13
6
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 9
Memory access example
• PageRank
until convergence do
for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)
Access PageRank score 𝒔of each neighbor
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
𝑣 = 1
𝑣 = 2
𝑣 = 3
𝑣 = 4
𝑣 = 5
𝑣 = 6
𝑣 = 7
Poor spatial locality
Poor temporal locality5
2
07
4
13
6
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 10
Reordering
• Preprocess for optimizing vertex ordering (ID numbering)
• No change of analysis algorithms and implementations is required
• Improve locality by co-locating neighboring vertices in memory
• Existing algorithms: RCM, LLP, Nested Dissection, ...
Random ordering High-locality ordering
5
2
07
4
13
6
0
2
31
4
76
5
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 11
On a reordered graph
• PageRank
until convergence do
for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)
Access PageRank score 𝒔of each neighbor
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
𝑣 = 1
𝑣 = 2
𝑣 = 3
𝑣 = 4
𝑣 = 5
𝑣 = 6
𝑣 = 7
0
2
31
4
76
5
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 12
On a reordered graph
• PageRank
until convergence do
for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)
Access PageRank score 𝒔of each neighbor
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
𝑣 = 1
𝑣 = 2
𝑣 = 3
𝑣 = 4
𝑣 = 5
𝑣 = 6
𝑣 = 7
0
2
31
4
76
5
High spatial locality
High temporal locality
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 13
Problem in reordering
• Reordering tends to increase end-to-end runtime
• end-to-end = reordering + analysis (e.g., PageRank)
• ‘Speedup’ by ahead-of-time reordering
Reordering:
SlowAnalysis:
Fast
Reorder again when the graph is modified
Result
0
2
31
4
76
5
5
2
07
4
13
6
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 14
Problem in reordering
• Reordering tends to increase end-to-end runtime
• end-to-end = reordering + analysis (e.g., PageRank)
• ‘Speedup’ by ahead-of-time reordering
w/o reordering Analysis
w/ reordering Reordering Analysis
Time
Slowdown!!!
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 15
Our contribution: Rabbit Order
• Reordering algorithm to reduce end-to-end runtime
• Speedup by just-in-time reordering
w/o reordering Analysis
w/ reordering Reordering Analysis
Time
ReorderingAnalysis
Fast reordering High locality&
Rabbit Order
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 16
Rabbit Order
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 17
Two main techniques
1. Hierarchical community-based ordering
• For high locality
2. Parallel incremental aggregation
• For fast reordering
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 18
Two main techniques
1. Hierarchical community-based ordering
• For high locality
2. Parallel incremental aggregation
• For fast reordering
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 19
Community-based ordering
• Community: a group of densely connected vertices
• Common in real-world graphs (e.g., web, social, ...)
• Co-locate vertices within each community in memory(cf. [Prat-Perez ‘11][Boldi+ ‘11])
Accessed elements in array 𝒔
0 1 2 3 4 5 6 7
𝑣 = 0
𝑣 = 1
𝑣 = 2
𝑣 = 3
𝑣 = 4
𝑣 = 5
𝑣 = 6
𝑣 = 7
0
2
31
4
76
5
Community 1Vertex 0~4
Community 2Vertex 5~7
Community1
Community2
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 20
Community-based ordering
• Community: a group of densely connected vertices
• Common in real-world graphs (e.g., web, social, ...)
• Co-locate vertices within each community in memory(cf. [Prat-Perez ‘11][Boldi+ ‘11])
Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html
Accessed elements in array 𝒔0 1 2 3 ……
𝑣 = 0
𝑣 = 1
𝑣 = 2
……
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 21
Hierarchy of communities
• A community contains inner nested communities
• e.g., social network of students
• Hierarchy of schools, grades, and classes
Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 22
Hierarchical community-based ordering
• Hierarchical community co-location for further locality
• Recursively co-locate vertices within each inner-community
• Inner communities produce denser blocks, higher locality
Accessed elements in array 𝒔0 1 2 3 ……
𝑣 = 0
𝑣 = 1
𝑣 = 2…
…Denser block
Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 23
Hierarchical community-based ordering
• Hierarchical community co-location for further locality
• Recursively co-locate vertices within each inner-community
• Inner communities produce denser blocks, higher locality
Accessed elements in array 𝒔0 1 2 3 ……
𝑣 = 0
𝑣 = 1
𝑣 = 2…
…Denser block
Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html
How can we obtainhierarchical communities?
Reordering time must be short!
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 24
Two main techniques
1. Hierarchical community-based ordering
• For high locality
2. Parallel incremental aggregation
• For fast reordering
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 25
Incremental aggregation [Shiokawa+ '13]
• Extract hierarchical communities by merging vertex pairs
• Fast since it rapidly coarsens the graph, but sequential
Merged to a neighbor that most improves modularity 𝑸
2
07
4
𝜟𝑸(𝒗𝟎, 𝒗𝟐) = 𝟎. 𝟎𝟓𝟐
𝛥𝑄(𝑣0, 𝑣4) = 0.031
𝛥𝑄(𝑣0, 𝑣7) = 0.042
Gain of modularity for merging vertex 𝒖 and 𝒗:
𝛥𝑄 𝑢, 𝑣 = 2𝑤𝑢𝑣2𝑚
−𝑑𝑒𝑔 𝑢 𝑑𝑒𝑔(𝑣)
2𝑚 2
𝒘𝒖𝒗 Edge weight between vertex 𝑢 and 𝑣
𝒎 Total number of edges in the graph
Community
20 75 4
Community
13 6
5
2
07
4
63
1 2
7
43
18
6
2
43
[Newman+ ‘04]
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 26
Parallelization issues
• Naive per-vertex parallelization causes conflicts
• Mutex: large overheads
• Fine-grained locking (per vertex) is required
• Atomic operation: too small operands (16 bytes on x86-64)
• Cannot atomically merge vertices
• by reattaching edges and removing a one of the vertices
1
2
4
5
6
3
0
Thread 1
Thread 2
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 27
Solution: lazy aggregation (1/2)
• Lightweight concurrency control by atomic operations
• Delay merges until the merged vertex is required• to reduce data size to be atomically modified
1
2
4
5
6
3
01
2
4
5
6
3
0
Just register vertices as a community member
• This can be performed using compare-and-swapby storing the members in a singly-linked list
• All the members are virtually treated as vertex 1
CommunityThread 1
Thread 2
1
1
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 28
Solution: lazy aggregation (2/2)
• Lightweight concurrency control by atomic operations
• Delay merges until the merged vertex is required• to reduce data size to be atomically modified
1
2
4
5
6
3
0
Which vertexshould vertex 1be merged to?
1
5
6
301
1
Actually merge the members
• Only one thread is assigned to eachvertex, and so it can merge themembers without conflicts
Compute
𝜟𝐐
6
2
Thread Thread
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 29
Communities to orderingin a hierarchical community-based manner
• Construct a dendrogram while extracting communities
1 3 65 7 0 2 4
Community 2
Community 1
Innercommunity
5
2
07
4
13
6
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 30
Communities to orderingin a hierarchical community-based manner
• Construct a dendrogram while extracting communities
• Reorder vertices to DFS visit order on it
• Vertices in each inner-community are recursively co-located
1 3 65 7 0 2 4
Community 2
Community 1
DFSDFS
New ordering
5 6 70 1 2 3 4
Innercommunity
5
2
07
4
13
6
0
3
21
4
56
7
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 31
Evaluation
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 32
Setup
• Xeon E5-2697v2 12 cores x 2 socket / RAM 256GB
• Reordering methods for comparison
• Graphs
Slash SlashBurn [Lim+ TKDE’14] Sequential
BFS Unordered parallel BFS [Karantasis+ SC’14] Parallel
RCM Unordered parallel RCM [Karantasis+ SC’14] Parallel
ND Multithreaded Nested Dissection [LaSalle+ IPDPS’13] Parallel
LLP Layered Label Propagation [Boldi+ WWW’11] Parallel
Shingle The shingle ordering [Chierichetti+ KDD’09] Parallel
Degree Ascending order of degree Parallel
Random Random ordering (baseline) -
berkstan enwiki ljournal uk-2002 road-usa uk-2005 it-2004 twitter sk-2005 webbase
V 0.7M 4.2M 4.8M 18.5M 23.9M 39.5M 41.3M 41.7M 50.6M 118.1M
E 7.6M 101.4M 69.0M 298.1M 57.7M 936.4M 1.2B 1.5B 1.9B 1.0B
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 33
End-to-end PageRank speedup
Rabbit Order yields up to 3.5x (avg. 2.2x) speedup
The other methods degrade performance in most cases
• End−to−end speedup = PageRank runtime with random orderingReordering runtime + PageRank runtime
• Reordering methods and PageRank are run with 48 threads using HyperThreading
0
0.5
1
1.5
2
2.5
3
3.5
berkstan enwiki ljournal uk-2002 road-usa uk-2005 it-2004 twitter sk-2005 webbase
Speedup
Rabbit Slash
BFS RCM
ND LLP
Shingle Degree
Sp
eed
up
Slo
wd
ow
n
Best speedup
3.5x
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 34
Breakdown of PageRank runtime
• Rabbit Order achieves fast reordering and high locality at the same time• Reorder a 1.2B-edge graph in about 12 sec.
0 500 1000 1500 2000 2500
Random
Degree
Shingle
LLP
ND
RCM
BFS
Slash
Rabbit
Runtime [sec]
Reordering
PageRank
Fast
Graph: it-2004
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 35
Cache misses during PageRank
• Competitive with the best state-of-the-art algorithms
0
2E+10
4E+10
6E+10
8E+10
1E+11
1.2E+11
1.4E+11
1.6E+11
1.8E+11
Rabbit Slash BFS RCM ND LLP Shingle Degree Rand
# o
f cache m
isse
s
Graph: it-2004L1 L2 L3
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 36
Effectiveness for other analyses
• Rabbit Order is effective for various analysis algorithms
• Efficiency is affected by computational cost of analyses
• It is difficult to amortize the reordering time by short analysis time (e.g., that of DFS and BFS)
0
0.5
1
1.5
2
2.5
3
3.5
DFS BFS Connected components
Graph diameter k-core decomposition
Speedup
Average end-to-end speedup for the 10 graphs
Rabbit Slash
BFS RCM
ND LLP
Shingle Degree
Slo
wd
ow
nS
peed
upAnalysis
1-10 sec
Analysis10-100 sec
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 37
Scalability of reordering
• Highest scalability against the number of threads
• Plenty of parallelism in incremental aggregation
• Lightweight concurrency control (lazy aggregation)
0
2
4
6
8
10
12
14
16
18
20
Rabbit BFS RCM ND LLP Shingle Degree
Avg.
speedup v
s. 1
thre
ad
Reordering time
12 threads 24 threads 48 threads (HT)
Scala
ble
J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 38
Conclusion
• Reordering improves locality of graph analysis
• But existing algorithms tend to increase end-to-end runtime
• Rabbit Order reduces the end-to-end runtimeby two main techniques:
1. Hierarchical community-based ordering for high locality
2. Parallel incremental aggregation for fast reordering
• Up to 3.5x speedup for PageRank
• Also effective for various analysis algorithms
Implementation available https://git.io/rabbit
(for evaluation purposes only)