Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm
Angen Zheng, Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi
University of Pittsburgh
1
Importance of Graph Partitioning
Applications of Graph Partitioning
o Scientific Simulations
o Distributed Graph Computation
o Pregel, Hama, Giraph
o VLSI Design
o Task Scheduling
o Linear Programming
2
Target Workloads
3
★ Vertex○ a unique identifier
○ a modifiable, user-defined value
★ Edge○ a modifiable, user-defined value
○ a target vertex identifier
UD
F
UD
F
UD
F
UD
F
Balanced load distribution!!!
Minimizing the comm cost!!!
★ Vertex-Centric UDF
○ Change vertex/edge state
○ Send msg to neighbours
○ Receive msg from neighbors
○ Mutate the graph topology
○ Deactivate at end of the superstep
○ Reactivate by external msgs
Roadmap
6
# of slides
Introduction ✔
Heterogeneity
State of the Art
PARAGON
Contention
Experiments
% of
audience
asleep
Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost
N1 N2 N3
N1 1 6
N2 1 1
N3 6 1
N3N1
N2
• 3 edge-cut• 3 unit data comm• 8 unit comm cost (8 = 1 * 6 + 2 * 1)
10
Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost
N1 N2 N3
N1 1 6
N2 1 1
N3 6 1
11
• 4 edge-cut• 4 unit data comm• 4 unit comm cost (4 = 1 * 1+ 3 * 1)
N3N1
N2
Group neighbouring vertices as close as possible!
Roadmap
12
# of slides
Introduction ✔
Heterogeneity✔
State of the Art
PARAGON
Contention
Experiments
% of
audience
asleep
Overview of the State-of-the-Art
13
Balanced Graph (Re)Partitioning
Partitioners
(static graphs)Repartitioners
(dynamic graphs)
Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG
Fennel
Offline Methods
(High Quality)
(Poor Scalability)
Online Methods(Moderate Quality)
(High Scalability)
Parmetis Aragon
Offline Methods(High Quality)
(Poor Scalability)
Online Methods(Moderate~High Quality)
(High Scalability)
Heterogeneity-Aware
CatchWParagon xdgp Mizan
Heterogeneity-Aware
LogGPHermes
Our Prior Work: Aragon
A sequential architecture-aware graph partition refinement algorithm.o Input:
• A partitioned graph• The relative network comm cost matrix
o Output:• A partitioning with improved mapping of the comm
pattern to the underlying hardware topology.
14
[1]. Angen Zheng, Alexandors Labrinidis, and Panos K. Chrysanthis. Archiitecture-Aware Graph Repartitioning for Data-Intensive Scientific
Computing. BigGraphs, 2014
Our Prior Work: AragonN1
N2
N3
N4
N5
N6
N7
N8
N9
G
P1
P2
P3
P4
P5
P6
P7
P8
P9
P1
P2
P3
P4
P9
P8
P7
P6
Heterogeneity-Aware Refinement(More details in the paper)
Aragon N5 can hold entire graph
in memory
Prefer to work in offline
mode
15
Roadmap
16
# of slides
Introduction ✔
Heterogeneity✔
State of the Art ✔
PARAGON
Contention
Experiments
% of
audience
asleep
Overview:
o Parallel Architecture-Aware Graph Partition Refinement Algorithm
Goal:
o Group neighbouring vertices as close as possible
Paragon
17
Paragon vs Aragon○ lower overhead
○ scale to much larger graphs
Paragon: Partition Grouping
P1P2P3
P4P6P9
P5P7P8
18
P1
P2
P3
P4
P5
P6
P7
P8
P9
N1
N2
N3
N4
N5
N6
N7
N8
N9
Paragon: Group Server Selection
19
P1
P2
P3
P4
P5
P6
P7
P8
P9
N1
N2
N3
N4
N5
N6
N7
N8
N9
N9
N8
N2P1P2P3
P4P6P9
P5P7P8
Paragon: Sending “Partition” to Group Servers
N1
N2
N3
N4
N5
N6
N7
N8
N9
P1
P2
P3
P4
P5
P6
P7
P8
P9
P1
P3
P4
P6
P5
P7
20
Only send boundary vertices
N9
N8
N2P1P2P3
P4P6P9
P5P7P8
Paragon: Parallel Refinement
Aragon
P1
P2
P3
P4
P5
P6
P7
P8
P9
P1
P3
P4
P6
P5
P7
Aragon
Aragon
N2
N3
N4
N5
N6
N7
N8
N9
N1
21
N9
N8
N2P1P2P3
P4P6P9
P5P7P8
# of Groups○ Degree of Parallelism
Paragon: Parallel Refinement
Aragon
P1
P2
P3
P4
P5
P6
P7
P8
P9
P1
P3
P4
P6
P5
P7
Aragon
Aragon
N2
N3
N4
N5
N6
N7
N8
N9
N1
22
N9
N8
N2P1P2P3
P4P6P9
P5P7P8
# of Groups○ Degree of Parallelism○ Parallelism vs Quality
36
16
96
Paragon: Shuffle Refinement
N2:
N9:
N8:
P1P2P4
P5P7P9
P3P6P8
Swap
Aragon
Aragon
Aragon
Parallel
P1P2P3
P4P6P9
P5P7P8
23
Repeat k times
To increase the # of partition pairs being refined!
Roadmap
24
# of slides
Introduction ✔
Heterogeneity✔
State of the Art ✔
PARAGON ✔
Contention
Experiments
% of
audience
asleep
Inter-Node Comm Cost≅ Intra-Node Comm Cost
[2]. C. Binnig, U. Çetintemel, A. Crotty, A. Galakatos, T. Kraska, E. Zamanian, and S. B. Zdonik. The End of Slow Networks: It’sTime for a
Redesign. CoRR, 2015
★ Dual-socket Xeon E5v2 server with
○ DDR3-1600
○ 2 FDR 4x NICs per socket
Revisit the Impact of Memory Subsystem Carefully!
★ Infiniband: 1.7GB/s~37.5GB/s
★ DDR3:
6.25GB/s~16.6GB/s
26
Intra-Node Shared Resource Contention
Send Buffer
Sending Core Receiving Core
Receive BufferShared Buffer
1. Load3. Load2b. Write
2a. Load 4a. Load
4b. Write
27
Cached Send/Shared/Receive Buffer
Intra-Node Shared Resource Contention
Multiple copies of the same data in LLC,
contending for LLC and MC
28
Intra-Node Shared Resource Contention
Cached Send/Shared Buffer Cached Receive/Shared Buffer
29
Multiple copies of the same data in LLC,
contending for LLC, MC, and QPI.
Paragon: Avoiding Contention
Intra-Node Network Comm Cost
Maximal Inter-Node Network Comm Cost
Degree of Contention
30
(Small HPC Clusters)
(Cloud/Large Clusters)
Paragon: Avoiding Contention
Send
Buffer
Sending Core
Node#1
IB
HCA
Receive
Buffer
Receiving Core
Node#2
IB
HCA
31
Roadmap
32
# of slides
Introduction ✔
Heterogeneity✔
State of the Art ✔
PARAGON ✔
Contention ✔
Experiments
% of
audience
asleep
Evaluation
MicroBenchmarks
o Degree of Refinement Parallelism
o Varying Shuffle Refinement Times
o Varying Initial Partitioners
Real-World Workloads
o Breadth First Search (BFS)
o Single Source Shortest Path (SSSP)
Billion-Edge Graph Scaling
33
Evaluation
MicroBenchmarks
o Degree of Refinement Parallelism
o Varying Shuffle Refinement Times
o Varying Initial Partitioners
Real-World Workloads
o Breadth First Search (BFS)
o Single Source Shortest Path (SSSP)
Billion-Edge Graph Scaling
34
Degree of Refinement Parallelism: Refinement Time
35
Aragon
★ com-lj: |V|=4M, |E| = 69M
★ 40 partitions: two 20-core machines
★ Initial Partitioner: DG (deterministic greedy)
★ # of Shuffle Times: 0
Degree of Refinement Parallelism: Partitioning Quality
36
★ com-lj: |V|=4M, |E| = 69M
★ 40 partitions: two 20-core machines
★ Initial Partitioner: DG (deterministic greedy)
★ # of Shuffle Times: 0
Evaluation
MicroBenchmarks
o Degree of Refinement Parallelism
o Varying Shuffle Refinement Times
o Varying Initial Partitioners
Real-World Workloads
o Breadth First Search (BFS)
o Single Source Shortest Path (SSSP)
Billion-Edge Graph Scaling
37
Varying Shuffle Refinement Times
38
★ com-lj: |V|=4M, |E| = 69M
★ 40 partitions: two 20-core machines
★ Initial Partitioner: DG (deterministic greedy)
★ Deg. of Parallelism: 8
# of shuffle refinement times > 10○ Paragon had lower refinement overhead
■ 8~10s vs 33s (Paragon vs Aragon)
○ Paragon produce better decompositions
■ 0~2.6% (Paragon vs Aragon)
Evaluation
MicroBenchmarks
o Degree of Refinement Parallelism
o Varying Shuffle Refinement Times
o Varying Initial Partitioners
Real-World Workloads
o Breadth First Search (BFS)
o Single Source Shortest Path (SSSP)
Billion-Edge Graph Scaling
39
Varying Initial Partitioners
40
Dataset 12 datasets from various areas
# of Parts 40 (two 20-core machines)
Initial Partitioner HP/DG/LDG
Deg. of Parallelism 8
# of Refinement Times 8
HP: Hashing Partitioning
DG: Deterministic Greedy Partitioning
LDG: Linear Deterministic Greedy Partitioning
Impact of Varying Initial Partitioners: Partitioning Quality
Improv. Max Avg.
HP 58% 43%
DG 29% 17%
LDG 53% 36%
41
Evaluation
MicroBenchmarks
o Degree of Refinement Parallelism
o Varying Shuffle Refinement Times
o Varying Initial Partitioners
Real-World Workloads
o Breadth First Search (BFS)
o Single Source Shortest Path (SSSP)
Billion-Edge Graph Scaling
42
BFS: Platform
• PittMPICluster: 32 nodes connected via a single switch with 56Gbps FDR Infiniband
• Gordon Supercomputer: 4x4x4 3D torus of switches connected via QDR Infiniband with 16 compute nodes attached to each switch (with 8Gbps link bandwidth)
43Bottleneck Memory (𝛌=1) Network (𝛌=0)
Paragon xdgp Mizan
BFS: Competitors
44
uniParagon
Initial Partitioner: DG
Balanced Graph (Re)Partitioning
Partitioners
(static graphs)Repartitioners
(dynamic graphs)
Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG
Fennel
Offline Methods
(High Quality)
(Poor Scalability)
Online Methods(Moderate Quality)
(High Scalability)
Parmetis Aragon
Offline Methods(High Quality)
(Poor Scalability)
Online Methods(Moderate~High Quality)
(High Scalability)
CatchW LogGPHermes
BFS on PittMPICluser (𝛌=1): Exec. Time
★ as-skitter: |V|=1.6M, |E| = 22M
★ 60 partitions: three 20-core machines
★ deg. of parallelism: 8
★ # of shuffle refinement times: 8
5.9x
6.7x
5.9x
2.7x
45
BFS on PittMPICluser (𝛌=1): Comm Vol
46
★ as-skitter: |V|=1.6M, |E| = 22M
★ 60 partitions: three 20-core machines
★ deg. of parallelism: 8
★ # of shuffle refinement times: 8
Reduction Intra-
Socket
Inter-
Socket
DG 62% 55%
METIS 53% 55%
PARMETIS 15% 17%
uniPARAGON 62% 39%
2.5x
1.5x50%
38%
BFS on Gordon (𝛌=0)
47
★ as-skitter: |V|=1.6M, |E| = 22M
★ 48 partitions: three 16-core machines
★ deg. of parallelism: 8
★ # of shuffle refinement times: 8
Evaluation
MicroBenchmarks
o Degree of Refinement Parallelism
o Varying Shuffle Refinement Times
o Varying Initial Partitioners
Real-World Workloads
o Breadth First Search (BFS)
o Single Source Shortest Path (BFS)
Billion-Edge Graph Scaling
48
Billion-Edge Graph Scaling: BFS Exec. Time
49
★ friendster: |V|=124M, |E| = 3.6B
★ 60 partitions: three 20-core machines
★ deg. of parallelism: 10
★ # of shuffle refinement times: 10
★ 1.65x
★ 60 cores
Billion-Edge Graph Scaling: Refine Time
★ friendster: |V|=124M, |E| = 3.6B
★ 60 partitions: three 20-core machines
★ deg. of parallelism: 10
★ # of shuffle refinement times: 10
50
1.36x
Conclusions
Introduced PARAGONo Parallel Architecture-Aware
Graph Partition Refinement Algorithm
PARAGON addresses:
o Scalability
o Communication Heterogeneity
o Shared Resource Contention
Extensive experimental study:
o Achieved up to 6.7x speedups on real-world workloads
o Scaled up to 3.6 billion edges
51
Acknowledgments:
• Jack Lange
• Albert DeFusco
• Kim Wong
• Mark Silvis
Funding:
• NSF OIA-1028162
• NSF CBET-1250171
Thank You!
Email: [email protected] or [email protected]: http://people.cs.pitt.edu/~anz28/ADMT: http://db.cs.pitt.edu/group/
52