Fast & Energy-Efficient Breadth-First Search on a Single NUMA System

Fast & Energy-Efficient Breadth-First Search

on a Single NUMA System

Yuichiro Yasui & Katsuki Fujisawa Kyushu University & JST CREST

Yukinori Sato JAIST & JST CREST

ISC14 (International supercomputing conference 2014) Research Papers 08 ‒ Energy Efficiency, June 26, 2014

Outline 1.  Background

2.  Fast computation of graph processing –  Related work and our previous contributions

3.  Bottlenecks analysis for our previous NUMA-optimized BFS

4.  Our proposal : Degree-aware BFS

5.  Performance evaluation of proposal BFS –  Fast for Graph500 benchmark –  Energy-efficient for Green graph500 benchmark

Background •  Large scale graphs in various fields

–  US Road network : 58 million edges –  Twitter follow-ship : 1.47 billion edges –  Neuronal network : 100 trillion edges

89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project

Cyber-security Twitter

US road network 24 million vertices & 58 million edges 15 billion log entries / day

Social network

•  Fast and scalable graph processing by using HPC

large

61.6 million vertices & 1.47 billion edges

•  Transportation •  Social network •  Cyber-security •  Bioinformatics

Graph analysis and important kernel BFS •  The cycle of graph analysis for understanding real-networks

•  concurrent search (breadth-first search) •  optimization (single source shortest path) •  edge-oriented (maximal independent set)

graph processing

Understanding

Application field

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Relationships - SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

graph

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

results

Step1

Step2

Step3

•  One of most important and fundamental processing •  Many algorithms and applications based on exists (Max.-flow and centrality) •  low arithmetic intensity & irregular memory accesses.

Breadth-first search (BFS)

Source

BFS Lv. 3

source Lv. 2 Lv. 1

Outputs：Distance (Lv.) and Predecessor for each vertex from source Inputs：Graph,

and source vertex

Target: NUMA arch. system

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

CPU socket（16 logical cores） + Local RAM

Memory access for Local RAM（Fast）

Memory access for Remote RAM（Slow）

NUMA node

•  Reduces and avoids memory accesses for Remote RAM

•  4-way Intel Xeon E5-4640 (Sandybridge-EP) –  4 (# of CPU sockets) –  8 (# of physical cores per socket) –  2 (# of threads per core)

4 x 8 x 2 = 64 threads

NUMA node

Max.

NUMA-aware computation

Graph500 Benchmark •  Fast computation of graph processing is significant topic in HPC •  Graph500 benchmark measures computer performance using TEPS ratio (# of Traversed edges per second) in graph processing such as BFS (Breath-first search)

SCALE&&&edgefactor&(=16)

Median TEPS

1.  Generation

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3.  BFS x 64 2.  Construction

x 64

TEPS ratio

•  Kronecker graph –  synthetic scale-free network which was generated by using Recursive Kronecker product

–  2SCALE vertices and 2SCALE×edgefactor edges –  e.g.) SCALE 30 and edgefactor 16 ⇒ 1 billion vertices and 17.2 billion edges

www.graph500.org

Level-synchronized parallel BFS (Top-down) •  Started from source vertex and executes following two phases for each level

3) BFS iterations (timed).: This step iterates the timedBFS-phase and the untimed verify-phase 64 times. The BFS-phase executes the BFS for each source, and the verify-phaseconfirms the output of the BFS.

This benchmark is based on the TEPS ratio, which iscomputed for a given graph and the BFS output. Submissionagainst this benchmark must report five TEPS ratios: theminimum, first quartile, median, third quartile, and maxi-mum.

III. PARALLEL BFS ALGORITHM

A. Level-synchronized Parallel BFSWe assume that the input of a BFS is a graph G = (V, E)

consisting of a set of vertices V and a set of edges E.The connections of G are contained as pairs (v, w), wherev, w ∈ V . The set of edges E corresponds to a set ofadjacency lists A, where an adjacency list A(v) containsthe outgoing edges (v, w) ∈ E for each vertex v ∈ V . ABFS explores the various edges spanning all other verticesv ∈ V \{s} from the source vertex s ∈ V in a given graphG and outputs the predecessor map π, which is a map fromeach vertex v to its parent. When the predecessor map π(v)points to only one parent for each vertex v, it represents atree with the root vertex s. However, some applications, suchas the betweenness centrality [1], require all of the parentsfor each vertex, which is equivalent to the number of hopsfrom the source. Therefore, the output predecessor map isrepresented as a directed adjacency graph (DAG). In thispaper, we focus on the Graph500 benchmark, and assumethat the BFS output is a predecessor map that is representedby a tree.

Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [6]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v ∈ QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:

QN ←!w ∈ A(v) | w ̸∈ visited, v ∈ QF

". (1)

B. Hybrid BFS Algorithm of Beamer et al.The main runtime bottleneck of the level-synchronized

parallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [8],[9] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithmcombines two different traversal kernels: top-down and

Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : unweighted directed graph.

s : source vertex.Variables: QF : frontier queue.

QN : neighbor queue.visited : vertices already visited.

Output : π(v) : predecessor map of BFS tree.1 π(v)← −1, ∀v ∈ V2 π(s)← s3 visited← {s}4 QF ← {s}5 QN ← ∅6 while QF ̸= ∅ do7 for v ∈ QF in parallel do8 for w ∈ A(v) do9 if w ̸∈ visited atomic then

10 π(w)← v11 visited← visited ∪ {w}12 QN ← QN ∪ {w}

13 QF ← QN

14 QN ← ∅

bottom-up. Like the level-synchronized parallel BFS, top-down kernels traverse neighbors of the frontier. Conversely,bottom-up kernels find the frontier from vertices in candidateneighbors. In other words, a top-down method finds thechildren from the parent, whereas a bottom-up method findsthe parent from the children. For a large frontier, bottom-upapproaches reduce the number of edges explored, becausethis traversal kernel terminates once a single parent is found(Algorithm 2, lines 16–21).

Table III lists the number of edges explored at each levelusing a top-down, bottom-up, and combined hybrid (oracle)approach. For the top-down kernel, the frontier size mF atlow and high levels is much less than that at mid-levels.In the case of the bottom-up method, the frontier size mBis equal to the number of edges m = |E| in a given graphG = (V, E), and decreases as the level increases. Bottom-upkernels estimate all unvisited vertices as candidate neighborsQN , because it is difficult to determine their exact numberprior to traversal, as shown in line 15 of Algorithm 2. Thislazy estimation of candidate neighbors increases the numberof edges traversed for a small frontier. Hence, consideringthe size of the frontier and the number of neighbors, wecombine top-down and bottom-up approaches in a hybrid(oracle). This loop generally only executes once, so thealgorithm is suitable for a small-world graph that has a largefrontier, such as a Kronecker graph or an R-MAT graph.Table III shows that the total number of edges traversed bythe hybrid (oracle) is only 3% of that in the case of thetop-down kernel.

We now explain how to determine a traversal policy(Table II(a)) for the top-down and bottom-up kernels. The

395

Traversal

Swap

Frontier

Neighbor

Level k Level k+1 QF

QN

Swap … swaps the frontier QF and the neighbor QN for next level

Traversal … finds unvisited adjacency vertices from current frontier QF and append to neighbor QN!

Candidates of neighbors

前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み

v→ w

v

w

Input : Directed graph G = (V, AF ), Queue QF

Data : Queue QN , visited, Tree π(v)

QN ← ∅for v ∈ QF in parallel do

for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}

QF ← QN

• 後方探索でのデータの書込み

w→ v

v w

Input : Directed graph G = (V, AB), Queue QF


QN ← ∅for w ∈ V \ visited in parallel do

for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break

QF ← QN

• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)

6 / 12

Hybrid-BFS (Direction-optimizing BFS) Chooses one from Top-down or Bottom-up for frontier size at each level

Frontier

Neighbors

Level0k

Level0k+1

Frontier Level0k

Level0k+1 neighbors

Top-down algorithm •  Efficient for small-frontier •  Uses out-going edges

Bottom-up algorithm •  Efficient for large-frontier •  Uses in-coming edges

前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み

v→ w

v

w

Input : Directed graph G = (V, AF ), Queue QF


QN ← ∅for v ∈ QF in parallel do

for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}

QF ← QN

• 後方探索でのデータの書込み

w→ v

v w

Input : Directed graph G = (V, AB), Queue QF


QN ← ∅for w ∈ V \ visited in parallel do

for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break

QF ← QN

• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)

6 / 12

Current frontier

Unvisited neighbors

Current frontier

Beamer2012

Candidates of neighbors

Skips unnecessary edge traversal

Chooses one from Top-down or Bottom-up for a number of traversed edges at each level

Number of traversal edges of Kronecker graph with SCALE 26

Hybrid-BFS reduces unnecessary edge traversals

Beamer2012 Hybrid-BFS (Direction-optimizing BFS)

Top=down 幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)

Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

6 / 14

Bottom=up&

Top=down

Distance from source |V| = 226, |E| = 230

= |E|

NUMA-optimized BFS •  Clearly separated to accessing for local and remote memory

–  Edge traversal on Local RAM –  All-gather of local queues and bitmaps for Remote RAM

NUMA=optimized&Top=down

NUMA=optimized&Bottom=up

Large&frontier? Aggregates&local&frontier&queues

Yes

No At each level,!

Traversal on local RAM Swap on Remote RAM

QN1

QN0

QN3 QF

QN2

QF QN2

•  Searches&local&neighbors&from&local&copied&frontier&

•  Out-going edges for Top-down •  In-coming edges for Bottom-up

NUMA-opt. requires two CSR graphs

※&Not&same&for&undirected&graph

Local Edge Traversal V0

QF visited0 QN0

A0 V

V1

QF visited1 QN1

V A1

V3

QF visited3 QN3

A3 V

V2

QF visited2 QN2

A2 V

•  partial&edges&(vi),)vj)),&(vi)∈)V)and)vj)∈)Vk)

NUMAを考慮したグラフ領域の分割

• ℓ-個の CPUソケット (ℓ-個の RAM)を持つ計算機を想定• RAMk 上に,部分点集合 Vk と隣接リストの部分集合 Ak に配置

V =!V0 | V1 | · · · | Vℓ−1

", A =

!A0 | A1 | · · · | Aℓ−1

",

• 部分点集合 Vk は単純な 1次元分割 (n: 点数, ℓ: CPUソケット数)

Vk =

#v j ∈ V | j ∈

$knℓ,

(k + 1)nℓ

%&,

• ℓ-個の CPUソケットを持つ計算機上に 2種類の隣接リストを定義AFk (v) = {w ∈ Vk ∩ A(v)} , v ∈ V

A A A AFF

0 1

F2

F3

visited1TREE1

visited0TREE0

visited2TREE2

visited3TREE3

frontier

v

w neighbors

ABk (w) = {v ∈ A(w)} , w ∈ Vk.

A A A ABB

0 1

B2

B3

visited1TREE1

visited0TREE0

visited2TREE2

visited3TREE3

frontier

neighbors

v

w

8 / 13

•  partial&vertices&Vk

RAM RAM



RAM RAM

RAM RAM



RAM RAM

RAM RAM



RAM RAM

0th NUMA node 3th NUMA node

2nd NUMA node 1st NUMA node

k=th&NUMA&node&holds& •  Local&copied&frontier&QF

CPU Affinity and local memory binding •  ULIBC: Ubiquity Library for Intelligently Binding Cores

–  provides some routines for CPU affinity + Local memory binding –  manages each processor core (processor ID) by topology information as a tuple of (SMT ID, core ID, package ID).

All processors Online processors (allocated&to&current&process)

CPU Affinity

1.&Detects&online&processors&&&&&&&using&sched_getaffinity&system&call

NUMA node 0

NUMA node 1

core 0 core 1 core 2

core 3

RAM

RAM

Local RAM

Use&Other&processes

Package0ID0:0index&of&CPU&socket&Core0ID0:0index&of&physical&core&in&each&CPU&socket&SMT0ID0:0index&of&thread&in&each&physical&core&

Processor0ID&index&of&logical&processor&core&

2.&Binds&each&thread&to&logical&core&&&&&&&&&using&sched_setaffinity&system&call&or&&&&&&&&&&&&&&&&&Intel&compiler&Thread&Affinity&Interface

0

5

10

15

20

25

30

35

20 21 22 23 24 25 26 27 28 29

GTE

PS

SCALE

reference codeAgawal2010Beamer2012

Yasui2013Yasui2014

Related work: TEPS ratios on a single node • Our BFS achieves 31.7 GTEPS for Kronecker graph (SCALE27)

Yasui2013

Yasui2014

x 2.2

x 2.6

x 5.9 Agarwal2010

faster

Agarwal2010 NUMA-aware Top-down BFS 4-way Intel Xeon 7560

Beamer2012 Hybrid-BFS 4-way Intel Xeon E5-8870

Yasui2013 NUMA-opt. Hybrid-BFS 4-way Intel Xeon E5-4640

0.8 GTEPS (m/n=64, 1.1GTEPS)

5.1 GTEPS

11.1 GTEPS

Yasui2014 Degree-aware NUMA-opt. BFS 4-way Intel Xeon E5-4650

31.7 GTEPS

This paper

This paper

Reference code 0.1 GTEPS

Visited vertices!Zero-degree!71,140,085!

53.0%

Top-down!283!

0.0%!

Bottom-up 63,035,833!

47.0%

Level Step Hybrid-BFS0 Top-down 22

1 Top-down 239,930

2 Bottom-up 150,006,673

3 Bottom-up 19,742,764

4 Bottom-up 139,817

5 Bottom-up 41,846

6 Top-down 260

Total – 170,171,312

% 4.0 %

Breakdown of hybrid-BFS

•  Most of CPU time taken to Bottom-up step in Hybrid BFS.

•  In particular, Bottom-up step in Level-2 has almost edge traversals. 99.9 %

231 = 2,147,483,648 (100 %)

for Kronecker graph with SCALE27

#Traversed edges

+ + Total vertices!134,217,728!

100.0%!=

•  Most of vertex traversal taken to Bottom-up step in Hybrid BFS. •  A half of number of vertices is unvisited.

Breakdown of vertex traversal

Traversed edges

88.1 %

Unvisited vertices!Isolated!41,527!

0.0% +

=227

( 8 %)

227 vertices and 231 edges

Influence of ordering for adjacency vertices •  Computation complexity of Bottom-up step depends on the ordering of adjacency vertices for each vertex

Number of traversal edges for each ordering

# of traversed edges is strongly affected by each ordering in Lv. 2.

Descending&order

High-degree Low-degree A(v)

Sorted adjacency list A(v)! using out-degree of w!

w

Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.

Hybrid algorithmLevel Top-down Bottom-up Step Ascending Randomized Descending

0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260

Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %

100

101

102

103

104

105

106

107

108

60001 10 100 1000

Num

ber

offixed

vertices

Loop count ⌧

Ascending

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30 60

Num

ber

offixed

vertices

Loop count ⌧

Randomized

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30

Num

ber

offixed

vertices

Loop count ⌧

Descending

Lv.2

Lv.3

Lv.4

Lv.5

Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFSfor Kronecker graph with SCALE27 and edgefactor 16.

Table 4. Breakdown of bottleneck in a BFS of a Kronecker graph with scale 27.

(a) Ascending order

Component 0-degree Bottom-up (τ = 1) Bottom-up (τ ≥ 2) Top-down#vertices 71,140,085 25,489,401 37,331,644 215,070

Ratio 53.00 % 18.99 % 27.81 % 0.16 %

(b) Randomized order


Ratio 53.00 % 28.06 % 18.74 % 0.16 %

(c) Descending order


Ratio 53.00 % 45.05 % 1.76 % 0.16 %

3.2 Degree-aware BFS on Degree-aware graph representation

The reference to the adjacent vertices require many in-directing accesses via anindex array and an adjacency edge array of the compressed sparse row (CSR)format graph. However, we clarified that the major bottleneck of the hybrid BFSalgorithm is the edge traversal of the first adjacent vertex in the bottom-up stepin the previous subsection. Then, we separated the standard CSR graph into



0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260

Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %

100

101

102

103

104

105

106

107

108

60001 10 100 1000

Num

ber

offixed

vertices

Loop count ⌧

Ascending

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30 60

Num

ber

offixed

vertices

Loop count ⌧

Randomized

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30

Num

ber

offixed

vertices

Loop count ⌧

Descending

Lv.2

Lv.3

Lv.4

Lv.5



(a) Ascending order


Ratio 53.00 % 18.99 % 27.81 % 0.16 %



Ratio 53.00 % 28.06 % 18.74 % 0.16 %



Ratio 53.00 % 45.05 % 1.76 % 0.16 %



Better

Loop0count0τ!

A(va) A(vb)

finds frontier vertex and breaks this loop … …

Bottom=up&

Skipped&adjacency&vertices Traversed&adjacency&vertices

τ=1

Analysis of loop count for each vertices



0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260

Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %

100

101

102

103

104

105

106

107

108

60001 10 100 1000

Num

ber

offixed

vertices

Loop count ⌧

Ascending

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30 60

Num

ber

offixed

vertices

Loop count ⌧

Randomized

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30

Num

ber

offixed

vertices

Loop count ⌧

Descending

Lv.2

Lv.3

Lv.4

Lv.5



(a) Ascending order


Ratio 53.00 % 18.99 % 27.81 % 0.16 %



Ratio 53.00 % 28.06 % 18.74 % 0.16 %



Ratio 53.00 % 45.05 % 1.76 % 0.16 %



Max: 5,873 Max: 58 Max: 28

19.0% + 27.8%

•  Bottom-up found 46.8 % vertices •  Descending finds most vertices at first loop . Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.


0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260

Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %

100

101

102

103

104

105

106

107

108

60001 10 100 1000

Num

ber

offixed

vertices

Loop count ⌧

Ascending

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30 60

Num

ber

offixed

vertices

Loop count ⌧

Randomized

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30

Num

ber

offixed

vertices

Loop count ⌧

Descending

Lv.2

Lv.3

Lv.4

Lv.5



(a) Ascending order


Ratio 53.00 % 18.99 % 27.81 % 0.16 %



Ratio 53.00 % 28.06 % 18.74 % 0.16 %



Ratio 53.00 % 45.05 % 1.76 % 0.16 %





0 22 4,223,250,243 T 22 22 221 239,930 3,258,645,723 T 239,930 239,930 239,9302 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,8993 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,1304 37,007,608 139,606 B 139,868 139,817 139,6065 98,339 41,846 B 41,846 41,846 41,8466 260 41,586 T 260 260 260

Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693% 100 % 179.6 % 20.6 % 4.0 % 2.5 %

100

101

102

103

104

105

106

107

108

60001 10 100 1000

Num

ber

offixed

vertices

Loop count ⌧

Ascending

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30 60

Num

ber

offixed

vertices

Loop count ⌧

Randomized

Lv.2

Lv.3

Lv.4

Lv.5

100

101

102

103

104

105

106

107

108

1 2 3 4 5 10 20 30

Num

ber

offixed

vertices

Loop count ⌧

Descending

Lv.2

Lv.3

Lv.4

Lv.5



(a) Ascending order


Ratio 53.00 % 18.99 % 27.81 % 0.16 %



Ratio 53.00 % 28.06 % 18.74 % 0.16 %



Ratio 53.00 % 45.05 % 1.76 % 0.16 %



28.1% + 18.7%

45.0%

τ = 1

Better

better

τ ≧ 2

First vertex of adjacency list

τ = 1 τ ≧ 2

τ = 1 τ ≧ 2 Descending order

Ascending Randomized 1.8%

3 features and Degree-aware BFS 1. A half vertices has no adjacency vertices ⇒  Suppression of zero degree vertices using renumbering technique for non-zero degree vertices

2. Computation complexity of Bottom-up depends on the ordering of adjacency vertices for each vertex ⇒  Sorted adjacency list by out-degree in descending

3. Most vertices was found at first loop of Bottom-up ⇒  Separated graph representation; highest-degree adjacency vertex list A+ and remaining CSR graph A-

High%degree Low%degree�

��i� ��i+1�

��i�

n�

m-n�n�

Highest%degree

��A-

High%degree Low%degree�

��i� ��i+1�

��i�

n�

m-n�n�

Highest%degree

��A+ = standard

CSR graph +

Zero-degree opt.

High-degree opt.

Performance improvements Degree-aware BFS is 2.68 faster than NUMA-opt.

0

5

10

15

20

25

30

NUMA-opt. + High-deg + zero-deg Degree-aware

GT

EPS

⇥1.00

⇥1.34

⇥2.03

⇥2.681.34 x 2.03 ≒

This&paper Previous&paper

Intel Xeon (64) v.s. SGI Altix UV1000 (512) for Kronecker graph with SCALE29

–  536.9 million vertices, 8.59 billion edges

Intel Xeon E5-4650 @ 2.70GHz 64-threads = 4-NUMA x 8-cores x 2-threads

0

50

100

150

200

250

300

Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7

CP

U ti

me

(ms)

Level

Traversal (local)Swap (all-gather)

BottomUp

BottomUp

BottomUpBottomUp 0

50

100

150

200

250

300


CP

U ti

me

(ms)

Level


BottomUp

BottomUp

BottomUpBottomUp

64 threads 400 GB

Intel Xeon E7-8837 @ 2.67GHz 512-threads = 64-NUMA x 8-cores

SGI Altix UV1000 (Westmere-EX arch.) Intel Xeon (Sandybridge-EP arch.)

for&Local&RAM

for&Remote&RAM for&Remote&RAM

for&Local&RAM

31.81 GE/s 21.81 GE/s

512 GB RAM 4.0 TB RAM

512 threads 1.0 TBytes

Intel Xeon (64) v.s. SGI Altix UV1000 (512) for Kronecker graph with SCALE30

–  1.07 billion vertices, 17.18 billion edges

Intel Xeon E5-4650 @ 2.70GHz 64-threads = 4-NUMA x 8-cores x 2-threads

Intel Xeon E7-8837 @ 2.67GHz 512-threads = 64-NUMA x 8-cores

SGI Altix UV1000 (Westmere-EX arch.) Intel Xeon (Sandybridge-EP arch.)

512 GB RAM 4.0 TB RAM

0

50

100

150

200

250

300


CP

U ti

me

(ms)

Level


BottomUpBottomUp

BottomUpBottomUp

512 threads 2.0 TBytes

for&Remote&RAM

for&Local&RAM

37.70 GE/s

Out of memory

Rank.50 Fastest of single-node on nov.2013 list

Strong scaling on SGI Altix UV1000

0

100

200

300

400

500

600


CP

U ti

me

(ms)

Level


BottomUp

BottomUp

BottomUpBottomUpBottomUp 0

100

200

300

400

500

600


CP

U ti

me

(ms)

Level


BottomUpBottomUp

BottomUpBottomUpBottomUp 0

100

200

300

400

500

600


CP

U ti

me

(ms)

Level


BottomUpBottomUp

BottomUpBottomUp

512 threads (one-rack)

37.70 GE/s 26.17 GE/s

256 threads 128 threads

18.76 GE/s

Local Local Local Remote Remote Remote

L : R = 67% : 33% L : R = 80% : 20% L : R = 57% : 43%

197 ms 218 ms 182 ms

258 ms 438 ms 733 ms

•  As the number of threads increases, –  Improves the CPU time for Local memory access –  Keeps the CPU time for Remote memory access

for Kronecker graph with SCALE30 –  1.07 billion vertices, 17.18 billion edges

Rank.50 Fastest of single-node on nov.2013 list

BFS Performances for Real networks •  Suitable for small-world networks

–  efficient for a low-diameter and a large-edgefactor

Twitter follow-ship network in 2009 61.6 million vertices & 1.47 billion edges

10.90 GTEPS (max. 24.09 GTEPS)

US road network 24 million vertices & 58 million edges 0.09 GTEPS (max. 0.11 GTEPS)

Small=world

Non&small=world

4.4 BFS Performance on Real-world Networks

We verify our BFS performance by using real-world networks on a Sandybridge-EP system. Table 9 shows the graph sizes, graph properties, and GTEPS usingour BFS for each network instance. Here, diam′

G indicates the approximation ofthe diameter on each network using the maximum hops for each vertex of 64 BFSiterations. USA-road-d and wiki-Talk have approximately the same edgefactor,but the latter shows 4–11 times faster than the former owing to its diameterbeing a thousand times smaller relatively. On the other hand, LiveJournal andtwitter have approximately the same diameter, but the latter shows 3–5 timesfaster than the former owing to its edgefactor being 1.68 times larger relatively.In addition, twitter and friendster show similar BFS performances of approxi-mately 10 GTEPS because they have similar edgefactor and similar diameters.Therefore, we verify whether our BFS is affected by using both the edgefactorand diameter of the network. From these numerical results, we could achievehigh performance for large-scale small-world networks with a large edgefactor.

Table 9. BFS performance of real-world network on Sandybridge-EP system.

Graph size edgefactor Diameter GTEPS

Instance n m m/n diam′G min 1/4 median 3/4 max

wiki-Talk [23, 24] 2.39 M 5.02 M 2.1 8 0.29 0.61 0.75 0.87 1.26USA-road-d [25] 23.95 M 58.33 M 2.4 8,098 0.07 0.08 0.09 0.09 0.11LiveJournal [26, 27] 4.85 M 68.99 M 14.2 16 2.76 3.76 4.07 4.32 4.94twitter [28] 61.58 M 1,468.37 M 23.8 16 7.58 10.02 10.90 12.68 24.09friendster [29] 65.61 M 1,806.07 M 27.5 25 4.89 9.61 10.74 11.29 11.81

5 Energy Efficiency of Our BFS

Thus far, we have discussed the BFS performance in terms of only the speed.This section discusses the BFS performance of our implementation in terms ofenergy efficiency. Our speedup techiques are aimed at not only fast computationbut also energy efficiency, as described in [4].

Fig. 6 shows the performance (GTEPS) and energy-efficiency (MTEPS/W) ofour Degree-aware BFS, our NUMA-optimized BFS, and the Graph500 referencecode for each CPU affinity on the SandyBridge-EP. If the number of threads isthe same, our Degree-aware BFS achieves a high TEPS and a high TEPS/Wwith a larger number of sockets. On the other hand, if the number of socketsis the same, our BFS achieves a high TEPS and a high TEPS/W with a largernumber of threads.

As an energy-efficient machine environment, we use Android-based devices,which have highly energy-efficient ARM or Snapdragon processors. We developa native binary code based on the C/C++ programming language by usingthe Android native development kit (NDK). The Android NDK supports mul-tithreaded computation using the OpenMP library and the atomic functions of

The Green Graph500 list in Nov. 2013

•  Measures power-efficient using TEPS/W ratio •  Our results on various systems such as Xeon servers and Android devices

http://green.graph500.org

Median TEPS

1. Generation - SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3.  BFS phase

2. Construction x 64

TEPS ratio

Watt TEPS/W

Power measurement Green Graph500

Graph500

Measuring power consumption during the BFS phase

TEPS and TEPS/W on 4-way Xeon

16 8 8 4 4

4 4

7.92 GTEPS 364.9 W

21.71 MTEPS/W

11.83 GTEPS 452.6 W

26.13 MTEPS/W

13.96 GTEPS 517.8 W

26.96 MTEPS/W

Fast

16 16

16 16

29.03 GTEPS 639.1 W

45.43 MTEPS/W

8 8

8 8

22.03 GTEPS 586.7 W

37.55 MTEPS/W

Energy efficient

#NUMA nodes = 4 #threads = 16

0

5

10

15

20

25

30

1⇥1

(1)

4⇥1

(4)

4⇥2

(8)

1⇥16

(16)

2⇥8

(16)

4⇥4

(16)

2⇥16

(32)

4⇥8

(32)

4⇥16

(64)

w/o

(64)

4⇥16

(64)

GT

EPS

`⇥ t CPU Affinity (Number of threads)

Degree-aware (GTEPS)

Reference (GTEPS)

NUMA-opt. (GTEPS)

NUMA-opt.Ref.Degree-aware

0

100

200

300

400

500

10 11 12 13 14 15 16 17 18 19 20

GT

EPS

SCALE

Reference (p = 4)Degree-aware BFS (p = 4)

Fig. 7. MTEPS of reference BFS and Degree-aware BFS on XperiaA SO-04E.

Table 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.

Implementation SCALE MTEPS watt MTEPS/WReference (p = 1) 20 3.25 3.15 1.03Reference (p = 4) 20 4.58 3.22 1.42

Degree-aware (p = 1) 20 136.29 3.23 42.25Degree-aware (p = 2) 20 248.08 2.99 82.92Degree-aware (p = 4) 20 477.63 3.12 153.17

new implementation achieved a BFS search performance of 31.65 GTEPS on37.66 GTEPS on an SGI Altix UV1000 and a 4-way Intel Xeon E5-4650 system,which were ranked as the 50th (fastest on single node) and 51st (fastest on singleserver) best performances on the Graph500 list in November 2013, respectively.In addition, our BFS performance on a Sony Xperia-A-SO-04E and ASUS Nexus7 (2013 version) was ranked as the first and second best performances in termsof energy efficiency on the Green Graph500 list in November 2013, respectively.Finally, further studies will be required to clarify the relationship between thetheoretical complexity of our BFS and the small-world property.

Acknowledgments. We appreciate the valuable comments and suggestions fromthe anonymous reviewers. This research was supported by the Core Research for Evo-lutional Science and Technology (CREST) program of the Japan Science and Tech-nology Agency (JST) and by the JAIST Research Center for Advanced ComputingInfrastructure.

References

1. Edmonds, J., Karp, R. M.: Theoretical improvements in algorithmic efficiency fornetwork flow problems, Journal of the ACM 19 (2), 248–64, (1972).

2. Cormen, T., Leiserson, C., and Rivest, R.: Introduction to Algorithms. MIT Press,Cambridge MA, 1990.

3. Brandes, U.: A Faster Algorithm for Betweenness Centrality. J. Math. Sociol., 25(2),163–177 (2001).

Green Graph500 on Xperia-A-SO-04E Manage both fast and energy-efficient

cution, suggesting that the effective power is not strongly affected by the numberof threads and the algorithm used. With regard to energy-efficient computation,our BFS is around 100 times faster than the reference code for roughly thesame effective power of 3.0 W; specifically, our BFS shows an energy-efficientperformance of 153.17 MTESP/W.

Table 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.

Implementation SCALE MTEPS watt MTEPS/WReference (p = 1) 20 3.25 3.15 1.03Reference (p = 4) 20 4.58 3.22 1.42

This study (p = 1) 20 136.29 3.23 42.25This study (p = 2) 20 248.08 2.99 82.92This study (p = 4) 20 477.63 3.12 153.17

Table 11 summarizes the energy efficiency of our BFS on various devices. TheAndroid-based Sony Xperia-A-SO-04E and ASUS Nexus 7 (2013 version) showshigh MTEPS/W values. In comparison, the Sandybridge-EP system shows highTEPS but only moderate energy efficiency. Finally, the Altix UV1000 system isfaster than the Sandybridge-EP system but requires more effective power.

Table 11. Energy efficiency of BFS for a Kronecker graph.

Machine spec. SCALE GTEPS MTEPS/WSony Xperia-A-SO-04E 20 0.478 153.17ASUS Nexus 7 (2013 version) 20 0.534 129.63Sandybridge-EP 27 29.034 45.43Sandybridge-EP (2.7GHz) 27 31.648 41.01Altix UV1000 (one-rack) 30 37.659 1.89

6 Conclusion

In this study, we investigate the major bottleneck in our previous NUMA-optimized BFS algorithm and apply degree-aware speedup techniques to it. Ournew implementation achieved a BFS search performance of 31.65 GTEPS on37.66 GTEPS on an SGI Altix UV1000 and a 4-way Intel Xeon E5-4650 system,which were ranked as the 50th (fastest on single node) and 51st (fastest on singleserver) best performances on the Graph500 list in November 2013, respectively.In addition, our BFS performance on a Sony Xperia-A-SO-04E and ASUS Nexus7 (2013 version) was ranked as the first and second best performances in termsof energy efficiency on the Green Graph500 list in November 2013, respectively.

Acknowledgments. This research was supported by the Core Research forEvolutional Science and Technology (CREST) program of the Japan Scienceand Technology Agency (JST) and by the JAIST Research Center for AdvancedComputing Infrastructure.

153.17 MTEPS/W (477.64 MTEPS)

Roughly same power-consumption

Smartphone SONY Xperia-A-SO-04E CPU : 4-core Snapdragon

RAM : 2 GB

#1 in Nov. 2013 list

# threads

1 MTEPS/W

Energy=efficient

x150 Faster and

energy efficient

Conclusion •  Degree-aware BFS

–  Speedup techniques considering the vertex degree –  1) Zero-degree vertex suppression –  2) Separated graph representation –  2.68 times faster than our previous algorithm

•  Our BFS achieves fastest of single-node –  37.7 GTEPS for SCALE30 on SGI Altix UV1000 (one rack)

•  Investigates affinity and power consumption –  “4 sockets x 16 threads”-affinity is the highest MTEPS and MTEPS/W on 4-way Intel Xeon server.

–  First position of small data category of 2nd Green graph500 on Android device