NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

NUMA%op(mized.Parallel.Breadth%first.Search.on.Mul(core.Single%node.System..

Yuichiro.Yasui*1,$Katsuki$Fujisawa*1$Kazushige$Goto*2$

$

*1$Chuo$university$&$JST$CREST$*2$Intel$CorporaDon�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:$NUMAIopDmized$parallel$BFS$4.  Numerical$Results$5.  Conclusion$

Background�•  Large.scale.graph.in.various.fields.–  US$Road$network$$$$:$$$$58$million$edges$–  TwiVer$followIship$:$1.47$billion$$edges$–  Neuronal$network$:$$$100$trillion$$edges$

89.billion.ver(ces.&.100.trillion.edges�[email protected]�

Cyber%security�

TwiQer�

US.road.network�24.million.ver(ces.&.58.million.edges� 15.billion.log.entries./.day.

Social.network�

•  Fast.and.scalable.graph.processing$by$using.HPC$large�

61.6.million.ver(ces..&..1.47.billion.edges.

•  TransportaDon$•  Social$network$•  CyberIsecurity$•  BioinformaDcs�

Importance.of.graph.processing�

•  BFS$is$important$and$fundamental$graph$processing$–  Obtains$relaDonship$of$distance$(hops)$as$standIalone$– Many$algorithm$(BC,$$Max.$flow,$$Max.$independent$set)$

•  concurrent.search.(breadth%first.search).•  opDmizaDon$(single$source$shortest$path)$•  edgeIoriented$(maximal$independent$set)$

graph.processing�

Understanding�

Applica(on.field�

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Rela(on.ships�- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

graph�

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

results�

low.arithme(c.intensity$&$irregular.memory.accesses.Problems.of.Fast.&.scalable.computa(on.BFS�

Step1�

Step2�

Step3�

Breadth%first.search�

Graph500.Benchmark�•  Measures$computer$performance$using$TEPS$raDo$in$graph$processing$such$as$BFS$(BreathIfirst$search)$

•  TEPS.raDo$=$#$of$Traversed$edges$per$second$

SCALE$and$edgefactor.(=16)�

Median.TEPS�

1.   Genera(on.

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3.   BFS.2.   Construc(on.

.x.64�

TEPS$raDo�.x.64�•  Kronecker$graph$– 2SCALE$verDces$and$2SCALE+4$edges$–  syntheDc$scaleIfree$network$

hVp:www.graph500.org�

•  NUMA%op(mized$hybrid$algorithm$•  Improves$locality$of$memory$access$– Library$for$considering$NUMA$carefully$– ColumnIwise$graph$parDDoning$

Contribu(on�•  Efficient$hybrid$algorithm$of$BFS.[Beamer2011,2012]$

–  reduces$unnecessary$edge$traversal$ 5.1.GTEPS�

Hybrid$BFS�

NUMA�

4%way.Intel.Xeon.E5.(64.CPU.cores)�

•  Scalable:.Scale.well.up.to.64.threads.•  Fast:.11.15.GTEPS.and.2.2x.speedup.compared$with$original$Hybrid$algorithm$

Our.proposal�

Outline�

1.  Background$2.   .Breadth%first.Search.(BFS).3.  NUMA$architecture$4.  Proposal$:$NUMAIopDmized$parallel$BFS$5.  Numerical$Results$

Breadth%first.Search.(BFS)�•  Obtains$level$of$each$verDces$from$source$vertex$•  Level$=$certain$#$of$hops$away$from$the$source�

Input:$Graph$G.and$source�

Output:$Tree$with$root$as$source�

BFS�

Source�

Level.3�

source� Level.2�Level.1�

Hybrid.BFS.for.low%diameter.graph�•  Efficient.for.Low%diameter.graph$

–  scale%free$and/or$small%world$property$such$as$social$network.$

•  At$higher$ranks$in$Graph500$benchmark$•  Hybrid$algorithm$

–  combines$topIdown$algorithm$and$boVomIup$algorithm$–  reduces$unnecessary$edge$traversal$

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down.algorithm� BoQom%up.algorithm�

switch�

Efficient$for$a$smallIfronDer� Efficient$for$a$largeIfronDer�

[Beamer2011,.2012]�

Fron(er$<$neighbor� Fron(er$>$neighbor�

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

Level.1�Source�

Level.0� QN�QF�


QN�Level.1�

Source�Level.0�QF�

Level.2�Level.1�

QN�QF�

Unnecessary.edge.traversal�


•  Efficient.for.a.small.fron(er.•  Has$an$unnecessary$edge$traversal$for$a$large$fronDer$

QN�Level.1�

Source�Level.0�QF�


QN�QF�

Unnecessary.edge.traversal�Level.3�

Level.2�

QN�

QF�

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

source� QN�

QF�

Unvisited.ver(ces�

Level.1�



source� QN�

QF�


Level.1�


Level.2�QF�Level.1�

QN�



•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� QN�

QF�


Level.1�


Level.2�QF�Level.1�

QN�


Level.3�

Level.2�

QN�

QF�

Hybrid.BFS.combines.Top%down.and.BoQom%up�

Fron(er�

Neighbors�

Level.k�

Level.k+1�



Top%down.algorithm� BoQom%up.algorithm�

switch�

Hybrid algorithm of Beamer et al 1

• Two different traversal kernels: top-down and bottom-up.• Top-down

• traverse neighbors of the frontier.• performance depends on frontier size.

• Bottom-up• finds the frontier from vertices in candidate

neighbors (all unvisited vertices).• performance depends on number of unvisited

vertices.• This lazy estimation of candidate neighbors

increases the number of edges traverse.Level Top-down Bottom-up Hybrid

mF mB min(mF ,mB)0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

!�

��

��

Fig: Top-down for small frontier

��

��

Fig: Bottom-up for large frontier

1S. Beamer et al.: Direction-optimizing breadth-first search, SC’12, 2012.5 / 35

Traversal.edges$of$Kronecker$graph$

(SCALE$26)�

only�

switch�

switch�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:.NUMA%op(mized.parallel.BFS.4.  Numerical$Results$5.  Conclusion$

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

•  Frequent$non%local.memory$accesses$on$NUMA.architecture.

G BFS

Source�

Fron(er�

Neighbors�

Level.k�

Level.k+1�



Top%down�

BoQom%up�

Working.data.(QF,.QN,.visited%flag)�

Graph.G�

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache


RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

Across.the.local.memories�

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

G$=$G0�G1�G2�G3�

?�

G�

G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

.

.2.   How.does.bind.parDal$graph$and$data$to$each.NUMA.unit?.

G0 B0 B1 B2 B3G1 G2 G3

?�

G0� G1� G2�G3�

G$=$G0�G1�G2�G3�G�

CPU0� CPU1� CPU2� CPU3�

RAM0� RAM1�

NUMA$unit3�

RAM2� RAM3�

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

/sys/devices/system/*

Linux.device.files�







Thread$ID�



At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�







Thread$ID�

Supports$scaQer.and$compact.policy�

ULIBC.is.possible.to.manage.NUMA.carefully..



round%robin.on.CPU.sockets.

At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�

NUMA%opt..Column%wise.Graph.Par((oning�

A0

A1

A2

A3

Row%wise.graph.par((oning�

Vk�

Column%wise.graph.par((oning�

A0

A1

A2

A3

Adjacency.matrix�

Vk�

Adjacency.matrix�j�i�

O(m).mostly.non%local.memory.accesses�O(m).Local.memory.accesses.only�

i� j�

Fron(er.

Neighbors�

Level.k�Level.k+1�

Fron(er�

Neighbors.

Level.k�Level.k+1�

•  divides$G=(V,A)$into$parDal$Gk=(Vk,.Ak)$and$binds.local.RAM.k.$–  Ak$is$a$set$of$adjacency$list$that$holds$incoming.edges$to$Vk.�

i�

j�

i�

j�

NUMA%op(mized.Top%down�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

•  Efficient.for.a.small.fron(er.•  Has$unnecessary$edge$traversal$for$a$large$fronDer$

Neighbors.QN�Level.1�

Source�Level.0�

Fron(er.QF�


Neighbors.QN�Fron(er.QF�


Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

NUMA.unit.3�

Details.of.NUMA%op(mized.Top%down�•  Explores$outgoing$edges$Ak$of$fron(er.queue.QF

k.

•  Appends$unvisited$verDces$into$neighbor.queue.QNk.


QN2�

QF2�


QN1�

QF1�


QN0�

QF0�


Neighbors.QN�Fron(er.QF�


Level.2�Fron(er.QF�All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� Neighbors.QN�

Fron(er.QF�


Level.1�


Level.2�Fron(er.QF�Level.1�

Neighbors.QN�


Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

NUMA.unit.3�

Details.of.NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF

k$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN

k.

Level.2�Fron(er.QF�

All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

Level.2�Fron(er.QF�Level.1�

Neighbors.QN�


Level.2�QF

0�

Level.1�

QN0�

Level.2�QF

1�

Level.1�

QN1�

Level.2�QF

2�

Level.1�

QN2�

Level.2�QF

3�

Level.1�

QN3�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.  NUMAIopDmized$parallel$BFS$4.   Numerical.Results.5.  Conclusion.

Machine.specifica(on�•  4%way.Intel.Xeon.E5.– CentOS$6.4$(Kernel$2.6.32)$– GCC$4.4.7$– 64$logical$CPU$cores$– 4.NUMA.units.x.16.logical%cores.

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache


RAM RAM RAM

•  4%way.AMD.Opteron.6174.– Fedora$19$(Kernel$3.11.2)$– GCC$4.8.1$– 48$CPU$cores$– 8.NUMA.units.x.6%core.


RAMRAMRAMRAM

12-coresOpteron 6174

interconnect

0

2

4

6

8

10

12

14

20 21 22 23 24 25 26 27 28 29

GTE

PS

Scale

HybridHybrid + NUMA

TEPS.ra(o$varied.with$problem.size�•  Ours.achieves.11.15.GTEPS$for$Kronecker$graph$(SCALE26).•  Ours.2.2x.speedups$compared$with$original.hybrid.algorithm.

Beamer2011,$2012�

Peak.performance�

Hybrid�

Hybrid� NUMA�

This.paper�BeVer�

x2.2�

11.15.GTEPS�

5.1.GTEPS�4Iway$Intel$Xeon$E5I8870$

WestmereIEX$arch.�

4Iway$Intel$Xeon$E5I4640$$SandyBridgeIEP$arch.�

67$million$verDces$and$1$billion$edges�

12

24

32

48

1

2

4

8

16

64

12 24 32 48 1 2 4 8 16 64

Spe

edup

Number of threads

ideal4-way SandyBridge-EP

4-way MagnyCours

Strong.scaling.on.Intel/AMD.System�Scale.well.up.to.#.of.threads.as.#.of.cores�

4%way.Intel.Xeon.11.15.GTEPS�

4%way.AMD.Opteron.6.17.GTEPS�

40.threads.:.x40�

64.threads.:.x28�

Lv� FronDer$size� Freq.$(%)$ Cum.$Freq.$(%)$0$ 1$$ 0.00$$ 0.00$$1$ 7$$ 0.00$$ 0.00$$2$ 6,188$$ 0.01$$ 0.01$$3$ 510,515$$ 1.23$$ 1.24$$4$ 29,526,508$$ 70.89$$ 72.13$$5$ 11,314,238$$ 27.16$$ 99.29$$6$ 282,456$$ 0.68$$ 99.97$$7$ 11536$$ 0.03$$ 100.00$$8$ 673$$ 0.00$$ 100.00$$9$ 68$$ 0.00$$ 100.00$$

10$ 19$$ 0.00$$ 100.00$$11$ 10$$ 0.00$$ 100.00$$12$ 5$$ 0.00$$ 100.00$$13$ 2$$ 0.00$$ 100.00$$14$ 2$$ 0.00$$ 100.00$$15$ 2$$ 0.00$$ 100.00$$

Total� 41,652,230$$ 100.00$$ I$

TwiQer.network�

41$million$verDces$and$1.47$billion$edges$

Fron(er.size.in.BFS.$$$$$$$$$$$$$with$source$as$User$21,804,357�

Follow%ship.network.2009�

User$i�

User$j�

(i,$j)Iedge�

Our.NUMA%op(mized.BFS.on.4%way.Xeon.system�

180.ms$/$BFS$$$$$$$$$$$$$$$$$$$$$$$�$8.1$GTEPS�

Six%degrees.of.separa(on�

Graph500$benchmark�•  Fastest$of$singleInode$on$4th.list$(June$2012)$

•  Fastest$of$CPUIbased$singleInode$on$6th.list$(June$2013)$

ours�

ours�

4%way.Intel.Xeon.Westmere%EX�

4%way.Intel.Xeon.SandyBridge%EP�

8.2.GTEPS�

Rank26�

Rank57� 11.1.GTEPS�

Convey.4.FPGA.+.2.CPU�

hVp:www.graph500.org�

1st.Green.Graph500.list$on$June$2013�•  Measures$powerIefficient$using$TEPS/W$raDo$•  Results$on$various$system$such$as$Android,.Linux,.and.Mac.$

Small.Data$category�

ours�

Rank.1.ASUS.tablet.TF700T� Rank.2.Intel.NUC.(Linux)�Rank.3.Mac.mini�

Android$NDK�53.5$MTEPS/w$$(1.9$GTEPS)�53.8$MTEPS/w$

$(1.1$GTEPS)�

64.1$MTEPS/w$$(150$MTEPS)�

NVIDIA.Tegra3.(4%core)�

NVIDIA.Tegra3� Intel/AMD.arch.�with$same$source$code�

hQp://green.graph500.org�

Conclusion�•  NUMA%op(mized.Hybrid.BFS.algorithm.– Reduces.unnecessary.edge.traversals$and$remote.RAM.access.carefully$considering$NUMA.

•  Numerical.results.on.4%way.Intel.Xeon�–  scales.well.up.to.64.threads.(scalable)$–  achieves.11.15.GTEPS.(fast).–  2.2x.speedup.compared.original.Hybrid.

•  Graph500.&.Green.Graph500.– Fastest.single%node$in$June$2012$– Most.power%efficient$in$June$2013$

Hybrid� NUMA�

Future.work�•  Further.op(mizing$NUMAIopDmized$BFS$

0

5

10

15

20

25

30

20 21 22 23 24 25 26 27 28 29

GTE

PS

SCALE

Latest versionBigdata2013

BigData2013.version:.11.GTEPS�

Latest.version:.26.GTEPS.

•  distributed%memory.parallel.computa(on$

2.4x...faster�

Technology

NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System