Upload
yuichiro-yasui
View
234
Download
0
Tags:
Embed Size (px)
Citation preview
Fast & Energy-Efficient Breadth-First Search
on a Single NUMA System
Yuichiro Yasui & Katsuki Fujisawa Kyushu University & JST CREST
Yukinori Sato JAIST & JST CREST
ISC14 (International supercomputing conference 2014) Research Papers 08 ‒ Energy Efficiency, June 26, 2014
Outline 1. Background
2. Fast computation of graph processing – Related work and our previous contributions
3. Bottlenecks analysis for our previous NUMA-optimized BFS
4. Our proposal : Degree-aware BFS
5. Performance evaluation of proposal BFS – Fast for Graph500 benchmark – Energy-efficient for Green graph500 benchmark
Background • Large scale graphs in various fields
– US Road network : 58 million edges – Twitter follow-ship : 1.47 billion edges – Neuronal network : 100 trillion edges
89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project
Cyber-security Twitter
US road network 24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing by using HPC
large
61.6 million vertices & 1.47 billion edges
• Transportation • Social network • Cyber-security • Bioinformatics
Graph analysis and important kernel BFS • The cycle of graph analysis for understanding real-networks
• concurrent search (breadth-first search) • optimization (single source shortest path) • edge-oriented (maximal independent set)
graph processing
Understanding
Application field
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Relationships - SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
graph
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
results
Step1
Step2
Step3
• One of most important and fundamental processing • Many algorithms and applications based on exists (Max.-flow and centrality) • low arithmetic intensity & irregular memory accesses.
Breadth-first search (BFS)
Source
BFS Lv. 3
source Lv. 2 Lv. 1
Outputs:Distance (Lv.) and Predecessor for each vertex from source Inputs:Graph,
and source vertex
Target: NUMA arch. system
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
CPU socket(16 logical cores) + Local RAM
Memory access for Local RAM(Fast)
Memory access for Remote RAM(Slow)
NUMA node
• Reduces and avoids memory accesses for Remote RAM
• 4-way Intel Xeon E5-4640 (Sandybridge-EP) – 4 (# of CPU sockets) – 8 (# of physical cores per socket) – 2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
NUMA-aware computation
Graph500 Benchmark • Fast computation of graph processing is significant topic in HPC • Graph500 benchmark measures computer performance using TEPS ratio (# of Traversed edges per second) in graph processing such as BFS (Breath-first search)
SCALE&&&edgefactor&(=16)
Median TEPS
1. Generation
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS x 64 2. Construction
x 64
TEPS ratio
• Kronecker graph – synthetic scale-free network which was generated by using Recursive Kronecker product
– 2SCALE vertices and 2SCALE×edgefactor edges – e.g.) SCALE 30 and edgefactor 16 ⇒ 1 billion vertices and 17.2 billion edges
www.graph500.org
Level-synchronized parallel BFS (Top-down) • Started from source vertex and executes following two phases for each level
3) BFS iterations (timed).: This step iterates the timedBFS-phase and the untimed verify-phase 64 times. The BFS-phase executes the BFS for each source, and the verify-phaseconfirms the output of the BFS.
This benchmark is based on the TEPS ratio, which iscomputed for a given graph and the BFS output. Submissionagainst this benchmark must report five TEPS ratios: theminimum, first quartile, median, third quartile, and maxi-mum.
III. PARALLEL BFS ALGORITHM
A. Level-synchronized Parallel BFSWe assume that the input of a BFS is a graph G = (V, E)
consisting of a set of vertices V and a set of edges E.The connections of G are contained as pairs (v, w), wherev, w ∈ V . The set of edges E corresponds to a set ofadjacency lists A, where an adjacency list A(v) containsthe outgoing edges (v, w) ∈ E for each vertex v ∈ V . ABFS explores the various edges spanning all other verticesv ∈ V \{s} from the source vertex s ∈ V in a given graphG and outputs the predecessor map π, which is a map fromeach vertex v to its parent. When the predecessor map π(v)points to only one parent for each vertex v, it represents atree with the root vertex s. However, some applications, suchas the betweenness centrality [1], require all of the parentsfor each vertex, which is equivalent to the number of hopsfrom the source. Therefore, the output predecessor map isrepresented as a directed adjacency graph (DAG). In thispaper, we focus on the Graph500 benchmark, and assumethat the BFS output is a predecessor map that is representedby a tree.
Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [6]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v ∈ QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:
QN ←!w ∈ A(v) | w ̸∈ visited, v ∈ QF
". (1)
B. Hybrid BFS Algorithm of Beamer et al.The main runtime bottleneck of the level-synchronized
parallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [8],[9] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithmcombines two different traversal kernels: top-down and
Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : unweighted directed graph.
s : source vertex.Variables: QF : frontier queue.
QN : neighbor queue.visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.1 π(v)← −1, ∀v ∈ V2 π(s)← s3 visited← {s}4 QF ← {s}5 QN ← ∅6 while QF ̸= ∅ do7 for v ∈ QF in parallel do8 for w ∈ A(v) do9 if w ̸∈ visited atomic then
10 π(w)← v11 visited← visited ∪ {w}12 QN ← QN ∪ {w}
13 QF ← QN
14 QN ← ∅
bottom-up. Like the level-synchronized parallel BFS, top-down kernels traverse neighbors of the frontier. Conversely,bottom-up kernels find the frontier from vertices in candidateneighbors. In other words, a top-down method finds thechildren from the parent, whereas a bottom-up method findsthe parent from the children. For a large frontier, bottom-upapproaches reduce the number of edges explored, becausethis traversal kernel terminates once a single parent is found(Algorithm 2, lines 16–21).
Table III lists the number of edges explored at each levelusing a top-down, bottom-up, and combined hybrid (oracle)approach. For the top-down kernel, the frontier size mF atlow and high levels is much less than that at mid-levels.In the case of the bottom-up method, the frontier size mBis equal to the number of edges m = |E| in a given graphG = (V, E), and decreases as the level increases. Bottom-upkernels estimate all unvisited vertices as candidate neighborsQN , because it is difficult to determine their exact numberprior to traversal, as shown in line 15 of Algorithm 2. Thislazy estimation of candidate neighbors increases the numberof edges traversed for a small frontier. Hence, consideringthe size of the frontier and the number of neighbors, wecombine top-down and bottom-up approaches in a hybrid(oracle). This loop generally only executes once, so thealgorithm is suitable for a small-world graph that has a largefrontier, such as a Kronecker graph or an R-MAT graph.Table III shows that the total number of edges traversed bythe hybrid (oracle) is only 3% of that in the case of thetop-down kernel.
We now explain how to determine a traversal policy(Table II(a)) for the top-down and bottom-up kernels. The
395
Traversal
Swap
Frontier
Neighbor
Level k Level k+1 QF
QN
Swap … swaps the frontier QF and the neighbor QN for next level
Traversal … finds unvisited adjacency vertices from current frontier QF and append to neighbor QN!
Candidates of neighbors
前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み
v→ w
v
w
Input : Directed graph G = (V, AF ), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for v ∈ QF in parallel do
for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}
QF ← QN
• 後方探索でのデータの書込み
w→ v
v w
Input : Directed graph G = (V, AB), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for w ∈ V \ visited in parallel do
for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break
QF ← QN
• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)
6 / 12
Hybrid-BFS (Direction-optimizing BFS) Chooses one from Top-down or Bottom-up for frontier size at each level
Frontier
Neighbors
Level0k
Level0k+1
Frontier Level0k
Level0k+1 neighbors
Top-down algorithm • Efficient for small-frontier • Uses out-going edges
Bottom-up algorithm • Efficient for large-frontier • Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み
v→ w
v
w
Input : Directed graph G = (V, AF ), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for v ∈ QF in parallel do
for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}
QF ← QN
• 後方探索でのデータの書込み
w→ v
v w
Input : Directed graph G = (V, AB), Queue QF
Data : Queue QN , visited, Tree π(v)
QN ← ∅for w ∈ V \ visited in parallel do
for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break
QF ← QN
• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)
6 / 12
Current frontier
Unvisited neighbors
Current frontier
Beamer2012
Candidates of neighbors
Skips unnecessary edge traversal
Chooses one from Top-down or Bottom-up for a number of traversed edges at each level
Number of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces unnecessary edge traversals
Beamer2012 Hybrid-BFS (Direction-optimizing BFS)
Top=down 幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%
6 / 14
Bottom=up&
Top=down
Distance from source |V| = 226, |E| = 230
= |E|
NUMA-optimized BFS • Clearly separated to accessing for local and remote memory
– Edge traversal on Local RAM – All-gather of local queues and bitmaps for Remote RAM
NUMA=optimized&Top=down
NUMA=optimized&Bottom=up
Large&frontier? Aggregates&local&frontier&queues
Yes
No At each level,!
Traversal on local RAM Swap on Remote RAM
QN1
QN0
QN3 QF
QN2
QF QN2
• Searches&local&neighbors&from&local&copied&frontier&
• Out-going edges for Top-down • In-coming edges for Bottom-up
NUMA-opt. requires two CSR graphs
※&Not&same&for&undirected&graph
Local Edge Traversal V0
QF visited0 QN0
A0 V
V1
QF visited1 QN1
V A1
V3
QF visited3 QN3
A3 V
V2
QF visited2 QN2
A2 V
• partial&edges&(vi),)vj)),&(vi)∈)V)and)vj)∈)Vk)
NUMAを考慮したグラフ領域の分割
• ℓ-個の CPUソケット (ℓ-個の RAM)を持つ計算機を想定• RAMk 上に,部分点集合 Vk と隣接リストの部分集合 Ak に配置
V =!V0 | V1 | · · · | Vℓ−1
", A =
!A0 | A1 | · · · | Aℓ−1
",
• 部分点集合 Vk は単純な 1次元分割 (n: 点数, ℓ: CPUソケット数)
Vk =
#v j ∈ V | j ∈
$knℓ,
(k + 1)nℓ
%&,
• ℓ-個の CPUソケットを持つ計算機上に 2種類の隣接リストを定義AFk (v) = {w ∈ Vk ∩ A(v)} , v ∈ V
A A A AFF
0 1
F2
F3
visited1TREE1
visited0TREE0
visited2TREE2
visited3TREE3
frontier
v
w neighbors
ABk (w) = {v ∈ A(w)} , w ∈ Vk.
A A A ABB
0 1
B2
B3
visited1TREE1
visited0TREE0
visited2TREE2
visited3TREE3
frontier
neighbors
v
w
8 / 13
• partial&vertices&Vk
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
0th NUMA node 3th NUMA node
2nd NUMA node 1st NUMA node
k=th&NUMA&node&holds& • Local&copied&frontier&QF
CPU Affinity and local memory binding • ULIBC: Ubiquity Library for Intelligently Binding Cores
– provides some routines for CPU affinity + Local memory binding – manages each processor core (processor ID) by topology information as a tuple of (SMT ID, core ID, package ID).
All processors Online processors (allocated&to¤t&process)
CPU Affinity
1.&Detects&online&processors&&&&&&&using&sched_getaffinity&system&call
NUMA node 0
NUMA node 1
core 0 core 1 core 2
core 3
RAM
RAM
Local RAM
Use&Other&processes
Package0ID0:0index&of&CPU&socket&Core0ID0:0index&of&physical&core&in&each&CPU&socket&SMT0ID0:0index&of&thread&in&each&physical&core&
Processor0ID&index&of&logical&processor&core&
2.&Binds&each&thread&to&logical&core&&&&&&&&&using&sched_setaffinity&system&call&or&&&&&&&&&&&&&&&&&Intel&compiler&Thread&Affinity&Interface
0
5
10
15
20
25
30
35
20 21 22 23 24 25 26 27 28 29
GTE
PS
SCALE
reference codeAgawal2010Beamer2012
Yasui2013Yasui2014
Related work: TEPS ratios on a single node • Our BFS achieves 31.7 GTEPS for Kronecker graph (SCALE27)
Yasui2013
Yasui2014
x 2.2
x 2.6
x 5.9 Agarwal2010
faster
Agarwal2010 NUMA-aware Top-down BFS 4-way Intel Xeon 7560
Beamer2012 Hybrid-BFS 4-way Intel Xeon E5-8870
Yasui2013 NUMA-opt. Hybrid-BFS 4-way Intel Xeon E5-4640
0.8 GTEPS (m/n=64, 1.1GTEPS)
5.1 GTEPS
11.1 GTEPS
Yasui2014 Degree-aware NUMA-opt. BFS 4-way Intel Xeon E5-4650
31.7 GTEPS
This paper
This paper
Reference code 0.1 GTEPS
Visited vertices!Zero-degree!71,140,085!
53.0%
Top-down!283!
0.0%!
Bottom-up 63,035,833!
47.0%
Level Step Hybrid-BFS0 Top-down 22
1 Top-down 239,930
2 Bottom-up 150,006,673
3 Bottom-up 19,742,764
4 Bottom-up 139,817
5 Bottom-up 41,846
6 Top-down 260
Total – 170,171,312
% 4.0 %
Breakdown of hybrid-BFS
• Most of CPU time taken to Bottom-up step in Hybrid BFS.
• In particular, Bottom-up step in Level-2 has almost edge traversals. 99.9 %
231 = 2,147,483,648 (100 %)
for Kronecker graph with SCALE27
#Traversed edges
+ + Total vertices!134,217,728!
100.0%!=
• Most of vertex traversal taken to Bottom-up step in Hybrid BFS. • A half of number of vertices is unvisited.
Breakdown of vertex traversal
Traversed edges
88.1 %
Unvisited vertices!Isolated!41,527!
0.0% +
=227
( 8 %)
227 vertices and 231 edges
Influence of ordering for adjacency vertices • Computation complexity of Bottom-up step depends on the ordering of adjacency vertices for each vertex
Number of traversal edges for each ordering
# of traversed edges is strongly affected by each ordering in Lv. 2.
Descending&order
High-degree Low-degree A(v)
Sorted adjacency list A(v)! using out-degree of w!
w
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithmLevel Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Num
ber
offixed
vertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Num
ber
offixed
vertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Num
ber
offixed
vertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFSfor Kronecker graph with SCALE27 and edgefactor 16.
Table 4. Breakdown of bottleneck in a BFS of a Kronecker graph with scale 27.
(a) Ascending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 25,489,401 37,331,644 215,070
Ratio 53.00 % 18.99 % 27.81 % 0.16 %
(b) Randomized order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 37,663,087 25,157,958 215,070
Ratio 53.00 % 28.06 % 18.74 % 0.16 %
(c) Descending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 60,462,127 2,358,918 215,070
Ratio 53.00 % 45.05 % 1.76 % 0.16 %
3.2 Degree-aware BFS on Degree-aware graph representation
The reference to the adjacent vertices require many in-directing accesses via anindex array and an adjacency edge array of the compressed sparse row (CSR)format graph. However, we clarified that the major bottleneck of the hybrid BFSalgorithm is the edge traversal of the first adjacent vertex in the bottom-up stepin the previous subsection. Then, we separated the standard CSR graph into
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithmLevel Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Num
ber
offixed
vertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Num
ber
offixed
vertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Num
ber
offixed
vertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFSfor Kronecker graph with SCALE27 and edgefactor 16.
Table 4. Breakdown of bottleneck in a BFS of a Kronecker graph with scale 27.
(a) Ascending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 25,489,401 37,331,644 215,070
Ratio 53.00 % 18.99 % 27.81 % 0.16 %
(b) Randomized order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 37,663,087 25,157,958 215,070
Ratio 53.00 % 28.06 % 18.74 % 0.16 %
(c) Descending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 60,462,127 2,358,918 215,070
Ratio 53.00 % 45.05 % 1.76 % 0.16 %
3.2 Degree-aware BFS on Degree-aware graph representation
The reference to the adjacent vertices require many in-directing accesses via anindex array and an adjacency edge array of the compressed sparse row (CSR)format graph. However, we clarified that the major bottleneck of the hybrid BFSalgorithm is the edge traversal of the first adjacent vertex in the bottom-up stepin the previous subsection. Then, we separated the standard CSR graph into
Better
Loop0count0τ!
A(va) A(vb)
finds frontier vertex and breaks this loop … …
Bottom=up&
Skipped&adjacency&vertices Traversed&adjacency&vertices
τ=1
Analysis of loop count for each vertices
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithmLevel Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Num
ber
offixed
vertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Num
ber
offixed
vertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Num
ber
offixed
vertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFSfor Kronecker graph with SCALE27 and edgefactor 16.
Table 4. Breakdown of bottleneck in a BFS of a Kronecker graph with scale 27.
(a) Ascending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 25,489,401 37,331,644 215,070
Ratio 53.00 % 18.99 % 27.81 % 0.16 %
(b) Randomized order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 37,663,087 25,157,958 215,070
Ratio 53.00 % 28.06 % 18.74 % 0.16 %
(c) Descending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 60,462,127 2,358,918 215,070
Ratio 53.00 % 45.05 % 1.76 % 0.16 %
3.2 Degree-aware BFS on Degree-aware graph representation
The reference to the adjacent vertices require many in-directing accesses via anindex array and an adjacency edge array of the compressed sparse row (CSR)format graph. However, we clarified that the major bottleneck of the hybrid BFSalgorithm is the edge traversal of the first adjacent vertex in the bottom-up stepin the previous subsection. Then, we separated the standard CSR graph into
Max: 5,873 Max: 58 Max: 28
19.0% + 27.8%
• Bottom-up found 46.8 % vertices • Descending finds most vertices at first loop . Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithmLevel Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Num
ber
offixed
vertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Num
ber
offixed
vertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Num
ber
offixed
vertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFSfor Kronecker graph with SCALE27 and edgefactor 16.
Table 4. Breakdown of bottleneck in a BFS of a Kronecker graph with scale 27.
(a) Ascending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 25,489,401 37,331,644 215,070
Ratio 53.00 % 18.99 % 27.81 % 0.16 %
(b) Randomized order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 37,663,087 25,157,958 215,070
Ratio 53.00 % 28.06 % 18.74 % 0.16 %
(c) Descending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 60,462,127 2,358,918 215,070
Ratio 53.00 % 45.05 % 1.76 % 0.16 %
3.2 Degree-aware BFS on Degree-aware graph representation
The reference to the adjacent vertices require many in-directing accesses via anindex array and an adjacency edge array of the compressed sparse row (CSR)format graph. However, we clarified that the major bottleneck of the hybrid BFSalgorithm is the edge traversal of the first adjacent vertex in the bottom-up stepin the previous subsection. Then, we separated the standard CSR graph into
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithmLevel Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Num
ber
offixed
vertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Num
ber
offixed
vertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Num
ber
offixed
vertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFSfor Kronecker graph with SCALE27 and edgefactor 16.
Table 4. Breakdown of bottleneck in a BFS of a Kronecker graph with scale 27.
(a) Ascending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 25,489,401 37,331,644 215,070
Ratio 53.00 % 18.99 % 27.81 % 0.16 %
(b) Randomized order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 37,663,087 25,157,958 215,070
Ratio 53.00 % 28.06 % 18.74 % 0.16 %
(c) Descending order
Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 60,462,127 2,358,918 215,070
Ratio 53.00 % 45.05 % 1.76 % 0.16 %
3.2 Degree-aware BFS on Degree-aware graph representation
The reference to the adjacent vertices require many in-directing accesses via anindex array and an adjacency edge array of the compressed sparse row (CSR)format graph. However, we clarified that the major bottleneck of the hybrid BFSalgorithm is the edge traversal of the first adjacent vertex in the bottom-up stepin the previous subsection. Then, we separated the standard CSR graph into
28.1% + 18.7%
45.0%
τ = 1
Better
better
τ ≧ 2
First vertex of adjacency list
τ = 1 τ ≧ 2
τ = 1 τ ≧ 2 Descending order
Ascending Randomized 1.8%
3 features and Degree-aware BFS 1. A half vertices has no adjacency vertices ⇒ Suppression of zero degree vertices using renumbering technique for non-zero degree vertices
2. Computation complexity of Bottom-up depends on the ordering of adjacency vertices for each vertex ⇒ Sorted adjacency list by out-degree in descending
3. Most vertices was found at first loop of Bottom-up ⇒ Separated graph representation; highest-degree adjacency vertex list A+ and remaining CSR graph A-
High%degree Low%degree�
�������i� �������i+1�
�������i�
n�
m-n�n�
Highest%degree
��A-
High%degree Low%degree�
�������i� �������i+1�
�������i�
n�
m-n�n�
Highest%degree
��A+ = standard
CSR graph +
Zero-degree opt.
High-degree opt.
Performance improvements Degree-aware BFS is 2.68 faster than NUMA-opt.
0
5
10
15
20
25
30
NUMA-opt. + High-deg + zero-deg Degree-aware
GT
EPS
⇥1.00
⇥1.34
⇥2.03
⇥2.681.34 x 2.03 ≒
This&paper Previous&paper
Intel Xeon (64) v.s. SGI Altix UV1000 (512) for Kronecker graph with SCALE29
– 536.9 million vertices, 8.59 billion edges
Intel Xeon E5-4650 @ 2.70GHz 64-threads = 4-NUMA x 8-cores x 2-threads
0
50
100
150
200
250
300
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CP
U ti
me
(ms)
Level
Traversal (local)Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUp 0
50
100
150
200
250
300
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CP
U ti
me
(ms)
Level
Traversal (local)Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUp
64 threads 400 GB
Intel Xeon E7-8837 @ 2.67GHz 512-threads = 64-NUMA x 8-cores
SGI Altix UV1000 (Westmere-EX arch.) Intel Xeon (Sandybridge-EP arch.)
for&Local&RAM
for&Remote&RAM for&Remote&RAM
for&Local&RAM
31.81 GE/s 21.81 GE/s
512 GB RAM 4.0 TB RAM
512 threads 1.0 TBytes
Intel Xeon (64) v.s. SGI Altix UV1000 (512) for Kronecker graph with SCALE30
– 1.07 billion vertices, 17.18 billion edges
Intel Xeon E5-4650 @ 2.70GHz 64-threads = 4-NUMA x 8-cores x 2-threads
Intel Xeon E7-8837 @ 2.67GHz 512-threads = 64-NUMA x 8-cores
SGI Altix UV1000 (Westmere-EX arch.) Intel Xeon (Sandybridge-EP arch.)
512 GB RAM 4.0 TB RAM
0
50
100
150
200
250
300
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CP
U ti
me
(ms)
Level
Traversal (local)Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUp
512 threads 2.0 TBytes
for&Remote&RAM
for&Local&RAM
37.70 GE/s
Out of memory
Rank.50 Fastest of single-node on nov.2013 list
Strong scaling on SGI Altix UV1000
0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CP
U ti
me
(ms)
Level
Traversal (local)Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CP
U ti
me
(ms)
Level
Traversal (local)Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CP
U ti
me
(ms)
Level
Traversal (local)Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUp
512 threads (one-rack)
37.70 GE/s 26.17 GE/s
256 threads 128 threads
18.76 GE/s
Local Local Local Remote Remote Remote
L : R = 67% : 33% L : R = 80% : 20% L : R = 57% : 43%
197 ms 218 ms 182 ms
258 ms 438 ms 733 ms
• As the number of threads increases, – Improves the CPU time for Local memory access – Keeps the CPU time for Remote memory access
for Kronecker graph with SCALE30 – 1.07 billion vertices, 17.18 billion edges
Rank.50 Fastest of single-node on nov.2013 list
BFS Performances for Real networks • Suitable for small-world networks
– efficient for a low-diameter and a large-edgefactor
Twitter follow-ship network in 2009 61.6 million vertices & 1.47 billion edges
10.90 GTEPS (max. 24.09 GTEPS)
US road network 24 million vertices & 58 million edges 0.09 GTEPS (max. 0.11 GTEPS)
Small=world
Non&small=world
4.4 BFS Performance on Real-world Networks
We verify our BFS performance by using real-world networks on a Sandybridge-EP system. Table 9 shows the graph sizes, graph properties, and GTEPS usingour BFS for each network instance. Here, diam′
G indicates the approximation ofthe diameter on each network using the maximum hops for each vertex of 64 BFSiterations. USA-road-d and wiki-Talk have approximately the same edgefactor,but the latter shows 4–11 times faster than the former owing to its diameterbeing a thousand times smaller relatively. On the other hand, LiveJournal andtwitter have approximately the same diameter, but the latter shows 3–5 timesfaster than the former owing to its edgefactor being 1.68 times larger relatively.In addition, twitter and friendster show similar BFS performances of approxi-mately 10 GTEPS because they have similar edgefactor and similar diameters.Therefore, we verify whether our BFS is affected by using both the edgefactorand diameter of the network. From these numerical results, we could achievehigh performance for large-scale small-world networks with a large edgefactor.
Table 9. BFS performance of real-world network on Sandybridge-EP system.
Graph size edgefactor Diameter GTEPS
Instance n m m/n diam′G min 1/4 median 3/4 max
wiki-Talk [23, 24] 2.39 M 5.02 M 2.1 8 0.29 0.61 0.75 0.87 1.26USA-road-d [25] 23.95 M 58.33 M 2.4 8,098 0.07 0.08 0.09 0.09 0.11LiveJournal [26, 27] 4.85 M 68.99 M 14.2 16 2.76 3.76 4.07 4.32 4.94twitter [28] 61.58 M 1,468.37 M 23.8 16 7.58 10.02 10.90 12.68 24.09friendster [29] 65.61 M 1,806.07 M 27.5 25 4.89 9.61 10.74 11.29 11.81
5 Energy Efficiency of Our BFS
Thus far, we have discussed the BFS performance in terms of only the speed.This section discusses the BFS performance of our implementation in terms ofenergy efficiency. Our speedup techiques are aimed at not only fast computationbut also energy efficiency, as described in [4].
Fig. 6 shows the performance (GTEPS) and energy-efficiency (MTEPS/W) ofour Degree-aware BFS, our NUMA-optimized BFS, and the Graph500 referencecode for each CPU affinity on the SandyBridge-EP. If the number of threads isthe same, our Degree-aware BFS achieves a high TEPS and a high TEPS/Wwith a larger number of sockets. On the other hand, if the number of socketsis the same, our BFS achieves a high TEPS and a high TEPS/W with a largernumber of threads.
As an energy-efficient machine environment, we use Android-based devices,which have highly energy-efficient ARM or Snapdragon processors. We developa native binary code based on the C/C++ programming language by usingthe Android native development kit (NDK). The Android NDK supports mul-tithreaded computation using the OpenMP library and the atomic functions of
The Green Graph500 list in Nov. 2013
• Measures power-efficient using TEPS/W ratio • Our results on various systems such as Xeon servers and Android devices
http://green.graph500.org
Median TEPS
1. Generation - SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS phase
2. Construction x 64
TEPS ratio
Watt TEPS/W
Power measurement Green Graph500
Graph500
Measuring power consumption during the BFS phase
TEPS and TEPS/W on 4-way Xeon
16 8 8 4 4
4 4
7.92 GTEPS 364.9 W
21.71 MTEPS/W
11.83 GTEPS 452.6 W
26.13 MTEPS/W
13.96 GTEPS 517.8 W
26.96 MTEPS/W
Fast
16 16
16 16
29.03 GTEPS 639.1 W
45.43 MTEPS/W
8 8
8 8
22.03 GTEPS 586.7 W
37.55 MTEPS/W
Energy efficient
#NUMA nodes = 4 #threads = 16
0
5
10
15
20
25
30
1⇥1
(1)
4⇥1
(4)
4⇥2
(8)
1⇥16
(16)
2⇥8
(16)
4⇥4
(16)
2⇥16
(32)
4⇥8
(32)
4⇥16
(64)
w/o
(64)
4⇥16
(64)
GT
EPS
`⇥ t CPU Affinity (Number of threads)
Degree-aware (GTEPS)
Reference (GTEPS)
NUMA-opt. (GTEPS)
NUMA-opt.Ref.Degree-aware
0
100
200
300
400
500
10 11 12 13 14 15 16 17 18 19 20
GT
EPS
SCALE
Reference (p = 4)Degree-aware BFS (p = 4)
Fig. 7. MTEPS of reference BFS and Degree-aware BFS on XperiaA SO-04E.
Table 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/WReference (p = 1) 20 3.25 3.15 1.03Reference (p = 4) 20 4.58 3.22 1.42
Degree-aware (p = 1) 20 136.29 3.23 42.25Degree-aware (p = 2) 20 248.08 2.99 82.92Degree-aware (p = 4) 20 477.63 3.12 153.17
new implementation achieved a BFS search performance of 31.65 GTEPS on37.66 GTEPS on an SGI Altix UV1000 and a 4-way Intel Xeon E5-4650 system,which were ranked as the 50th (fastest on single node) and 51st (fastest on singleserver) best performances on the Graph500 list in November 2013, respectively.In addition, our BFS performance on a Sony Xperia-A-SO-04E and ASUS Nexus7 (2013 version) was ranked as the first and second best performances in termsof energy efficiency on the Green Graph500 list in November 2013, respectively.Finally, further studies will be required to clarify the relationship between thetheoretical complexity of our BFS and the small-world property.
Acknowledgments. We appreciate the valuable comments and suggestions fromthe anonymous reviewers. This research was supported by the Core Research for Evo-lutional Science and Technology (CREST) program of the Japan Science and Tech-nology Agency (JST) and by the JAIST Research Center for Advanced ComputingInfrastructure.
References
1. Edmonds, J., Karp, R. M.: Theoretical improvements in algorithmic efficiency fornetwork flow problems, Journal of the ACM 19 (2), 248–64, (1972).
2. Cormen, T., Leiserson, C., and Rivest, R.: Introduction to Algorithms. MIT Press,Cambridge MA, 1990.
3. Brandes, U.: A Faster Algorithm for Betweenness Centrality. J. Math. Sociol., 25(2),163–177 (2001).
Green Graph500 on Xperia-A-SO-04E Manage both fast and energy-efficient
cution, suggesting that the effective power is not strongly affected by the numberof threads and the algorithm used. With regard to energy-efficient computation,our BFS is around 100 times faster than the reference code for roughly thesame effective power of 3.0 W; specifically, our BFS shows an energy-efficientperformance of 153.17 MTESP/W.
Table 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/WReference (p = 1) 20 3.25 3.15 1.03Reference (p = 4) 20 4.58 3.22 1.42
This study (p = 1) 20 136.29 3.23 42.25This study (p = 2) 20 248.08 2.99 82.92This study (p = 4) 20 477.63 3.12 153.17
Table 11 summarizes the energy efficiency of our BFS on various devices. TheAndroid-based Sony Xperia-A-SO-04E and ASUS Nexus 7 (2013 version) showshigh MTEPS/W values. In comparison, the Sandybridge-EP system shows highTEPS but only moderate energy efficiency. Finally, the Altix UV1000 system isfaster than the Sandybridge-EP system but requires more effective power.
Table 11. Energy efficiency of BFS for a Kronecker graph.
Machine spec. SCALE GTEPS MTEPS/WSony Xperia-A-SO-04E 20 0.478 153.17ASUS Nexus 7 (2013 version) 20 0.534 129.63Sandybridge-EP 27 29.034 45.43Sandybridge-EP (2.7GHz) 27 31.648 41.01Altix UV1000 (one-rack) 30 37.659 1.89
6 Conclusion
In this study, we investigate the major bottleneck in our previous NUMA-optimized BFS algorithm and apply degree-aware speedup techniques to it. Ournew implementation achieved a BFS search performance of 31.65 GTEPS on37.66 GTEPS on an SGI Altix UV1000 and a 4-way Intel Xeon E5-4650 system,which were ranked as the 50th (fastest on single node) and 51st (fastest on singleserver) best performances on the Graph500 list in November 2013, respectively.In addition, our BFS performance on a Sony Xperia-A-SO-04E and ASUS Nexus7 (2013 version) was ranked as the first and second best performances in termsof energy efficiency on the Green Graph500 list in November 2013, respectively.
Acknowledgments. This research was supported by the Core Research forEvolutional Science and Technology (CREST) program of the Japan Scienceand Technology Agency (JST) and by the JAIST Research Center for AdvancedComputing Infrastructure.
153.17 MTEPS/W (477.64 MTEPS)
Roughly same power-consumption
Smartphone SONY Xperia-A-SO-04E CPU : 4-core Snapdragon
RAM : 2 GB
#1 in Nov. 2013 list
# threads
1 MTEPS/W
Energy=efficient
x150 Faster and
energy efficient
Conclusion • Degree-aware BFS
– Speedup techniques considering the vertex degree – 1) Zero-degree vertex suppression – 2) Separated graph representation – 2.68 times faster than our previous algorithm
• Our BFS achieves fastest of single-node – 37.7 GTEPS for SCALE30 on SGI Altix UV1000 (one rack)
• Investigates affinity and power consumption – “4 sockets x 16 threads”-affinity is the highest MTEPS and MTEPS/W on 4-way Intel Xeon server.
– First position of small data category of 2nd Green graph500 on Android device