38
NUMA%op(mized Parallel Breadth%first Search on Mul(core Single%node System Yuichiro Yasui * 1 , Katsuki Fujisawa* 1 Kazushige Goto* 2 * 1 Chuo university & JST CREST * 2 Intel CorporaDon

NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Embed Size (px)

Citation preview

Page 1: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

NUMA%op(mized.Parallel.Breadth%first.Search.on.Mul(core.Single%node.System..

Yuichiro.Yasui*1,$Katsuki$Fujisawa*1$Kazushige$Goto*2$

$

*1$Chuo$university$&$JST$CREST$*2$Intel$CorporaDon�

Page 2: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:$NUMAIopDmized$parallel$BFS$4.  Numerical$Results$5.  Conclusion$

Page 3: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Background�•  Large.scale.graph.in.various.fields.–  US$Road$network$$$$:$$$$58$million$edges$–  TwiVer$followIship$:$1.47$billion$$edges$–  Neuronal$network$:$$$100$trillion$$edges$

89.billion.ver(ces.&.100.trillion.edges�[email protected]

Cyber%security�

TwiQer�

US.road.network�24.million.ver(ces.&.58.million.edges� 15.billion.log.entries./.day.

Social.network�

•  Fast.and.scalable.graph.processing$by$using.HPC$large�

61.6.million.ver(ces..&..1.47.billion.edges.

Page 4: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

•  TransportaDon$•  Social$network$•  CyberIsecurity$•  BioinformaDcs�

Importance.of.graph.processing�

•  BFS$is$important$and$fundamental$graph$processing$–  Obtains$relaDonship$of$distance$(hops)$as$standIalone$– Many$algorithm$(BC,$$Max.$flow,$$Max.$independent$set)$

•  concurrent.search.(breadth%first.search).•  opDmizaDon$(single$source$shortest$path)$•  edgeIoriented$(maximal$independent$set)$

graph.processing�

Understanding�

Applica(on.field�

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Rela(on.ships�- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

graph�

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

results�

low.arithme(c.intensity$&$irregular.memory.accesses.Problems.of.Fast.&.scalable.computa(on.BFS�

Step1�

Step2�

Step3�

Breadth%first.search�

Page 5: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Graph500.Benchmark�•  Measures$computer$performance$using$TEPS$raDo$in$graph$processing$such$as$BFS$(BreathIfirst$search)$

•  TEPS.raDo$=$#$of$Traversed$edges$per$second$

SCALE$and$edgefactor.(=16)�

Median.TEPS�

1.   Genera(on.

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

3.   BFS.2.   Construc(on.

.x.64�

TEPS$raDo�.x.64�•  Kronecker$graph$– 2SCALE$verDces$and$2SCALE+4$edges$–  syntheDc$scaleIfree$network$

hVp:www.graph500.org�

Page 6: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

•  NUMA%op(mized$hybrid$algorithm$•  Improves$locality$of$memory$access$– Library$for$considering$NUMA$carefully$– ColumnIwise$graph$parDDoning$

Contribu(on�•  Efficient$hybrid$algorithm$of$BFS.[Beamer2011,2012]$

–  reduces$unnecessary$edge$traversal$ 5.1.GTEPS�

Hybrid$BFS�

NUMA�

4%way.Intel.Xeon.E5.(64.CPU.cores)�

•  Scalable:.Scale.well.up.to.64.threads.•  Fast:.11.15.GTEPS.and.2.2x.speedup.compared$with$original$Hybrid$algorithm$

Our.proposal�

Page 7: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Outline�

1.  Background$2.   .Breadth%first.Search.(BFS).3.  NUMA$architecture$4.  Proposal$:$NUMAIopDmized$parallel$BFS$5.  Numerical$Results$

Page 8: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Breadth%first.Search.(BFS)�•  Obtains$level$of$each$verDces$from$source$vertex$•  Level$=$certain$#$of$hops$away$from$the$source�

Input:$Graph$G.and$source�

Output:$Tree$with$root$as$source�

BFS�

Source�

Level.3�

source� Level.2�Level.1�

Page 9: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Hybrid.BFS.for.low%diameter.graph�•  Efficient.for.Low%diameter.graph$

–  scale%free$and/or$small%world$property$such$as$social$network.$

•  At$higher$ranks$in$Graph500$benchmark$•  Hybrid$algorithm$

–  combines$topIdown$algorithm$and$boVomIup$algorithm$–  reduces$unnecessary$edge$traversal$

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down.algorithm� BoQom%up.algorithm�

switch�

Efficient$for$a$smallIfronDer� Efficient$for$a$largeIfronDer�

[Beamer2011,.2012]�

Fron(er$<$neighbor� Fron(er$>$neighbor�

Page 10: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

Level.1�Source�

Level.0� QN�QF�

Page 11: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

QN�Level.1�

Source�Level.0�QF�

Level.2�Level.1�

QN�QF�

Unnecessary.edge.traversal�

Page 12: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

•  Efficient.for.a.small.fron(er.•  Has$an$unnecessary$edge$traversal$for$a$large$fronDer$

QN�Level.1�

Source�Level.0�QF�

Level.2�Level.1�

QN�QF�

Unnecessary.edge.traversal�Level.3�

Level.2�

QN�

QF�

Page 13: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

source� QN�

QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

Page 14: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

source� QN�

QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

Level.2�QF�Level.1�

QN�

Unvisited.ver(ces�

Page 15: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� QN�

QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

Level.2�QF�Level.1�

QN�

Unvisited.ver(ces�

Level.3�

Level.2�

QN�

QF�

Page 16: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Hybrid.BFS.combines.Top%down.and.BoQom%up�

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down.algorithm� BoQom%up.algorithm�

switch�

Hybrid algorithm of Beamer et al 1

• Two different traversal kernels: top-down and bottom-up.• Top-down

• traverse neighbors of the frontier.• performance depends on frontier size.

• Bottom-up• finds the frontier from vertices in candidate

neighbors (all unvisited vertices).• performance depends on number of unvisited

vertices.• This lazy estimation of candidate neighbors

increases the number of edges traverse.Level Top-down Bottom-up Hybrid

mF mB min(mF ,mB)0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

!�

����������������

�� ������

Fig: Top-down for small frontier

�����������������

������� �

Fig: Bottom-up for large frontier

1S. Beamer et al.: Direction-optimizing breadth-first search, SC’12, 2012.5 / 35

Traversal.edges$of$Kronecker$graph$

(SCALE$26)�

only�

switch�

switch�

Page 17: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:.NUMA%op(mized.parallel.BFS.4.  Numerical$Results$5.  Conclusion$

Page 18: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

Page 19: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

•  Frequent$non%local.memory$accesses$on$NUMA.architecture.

G BFS

Source�

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down�

BoQom%up�

Working.data.(QF,.QN,.visited%flag)�

Graph.G�

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

Across.the.local.memories�

Page 20: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

G$=$G0�G1�G2�G3�

?�

G�

Page 21: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

.

.2.   How.does.bind.parDal$graph$and$data$to$each.NUMA.unit?.

G0 B0 B1 B2 B3G1 G2 G3

?�

G0� G1� G2�G3�

G$=$G0�G1�G2�G3�G�

CPU0� CPU1� CPU2� CPU3�

RAM0� RAM1�

NUMA$unit3�

RAM2� RAM3�

Page 22: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

/sys/devices/system/*

Linux.device.files�

Page 23: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

Thread$ID�

/sys/devices/system/*

Linux.device.files�

At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�

Page 24: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

Thread$ID�

Supports$scaQer.and$compact.policy�

ULIBC.is.possible.to.manage.NUMA.carefully..

/sys/devices/system/*

Linux.device.files�

round%robin.on.CPU.sockets.

At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�

Page 25: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

NUMA%opt..Column%wise.Graph.Par((oning�

A0

A1

A2

A3

Row%wise.graph.par((oning�

Vk�

Column%wise.graph.par((oning�

A0

A1

A2

A3

Adjacency.matrix�

Vk�

Adjacency.matrix�j�i�

O(m).mostly.non%local.memory.accesses�O(m).Local.memory.accesses.only�

i� j�

Fron(er.

Neighbors�

Level.k�Level.k+1�

Fron(er�

Neighbors.

Level.k�Level.k+1�

•  divides$G=(V,A)$into$parDal$Gk=(Vk,.Ak)$and$binds.local.RAM.k.$–  Ak$is$a$set$of$adjacency$list$that$holds$incoming.edges$to$Vk.�

i�

j�

i�

j�

Page 26: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

NUMA%op(mized.Top%down�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

•  Efficient.for.a.small.fron(er.•  Has$unnecessary$edge$traversal$for$a$large$fronDer$

Neighbors.QN�Level.1�

Source�Level.0�

Fron(er.QF�

Level.2�Level.1�

Neighbors.QN�Fron(er.QF�

Unnecessary.edge.traversal�

Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

Page 27: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

NUMA.unit.3�

Details.of.NUMA%op(mized.Top%down�•  Explores$outgoing$edges$Ak$of$fron(er.queue.QF

k.

•  Appends$unvisited$verDces$into$neighbor.queue.QNk.

Level.2�Level.1�

QN2�

QF2�

Level.2�Level.1�

QN1�

QF1�

Level.2�Level.1�

QN0�

QF0�

Level.2�Level.1�

Neighbors.QN�Fron(er.QF�

Unnecessary.edge.traversal�

Level.2�Fron(er.QF�All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

Page 28: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� Neighbors.QN�

Fron(er.QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

Level.2�Fron(er.QF�Level.1�

Neighbors.QN�

Unvisited.ver(ces�

Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

Page 29: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

NUMA.unit.3�

Details.of.NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF

k$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN

k.

Level.2�Fron(er.QF�

All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

Level.2�Fron(er.QF�Level.1�

Neighbors.QN�

Unvisited.ver(ces�

Level.2�QF

0�

Level.1�

QN0�

Level.2�QF

1�

Level.1�

QN1�

Level.2�QF

2�

Level.1�

QN2�

Level.2�QF

3�

Level.1�

QN3�

Page 30: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.  NUMAIopDmized$parallel$BFS$4.   Numerical.Results.5.  Conclusion.

Page 31: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Machine.specifica(on�•  4%way.Intel.Xeon.E5.– CentOS$6.4$(Kernel$2.6.32)$– GCC$4.4.7$– 64$logical$CPU$cores$– 4.NUMA.units.x.16.logical%cores.

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

•  4%way.AMD.Opteron.6174.– Fedora$19$(Kernel$3.11.2)$– GCC$4.8.1$– 48$CPU$cores$– 8.NUMA.units.x.6%core.

processor core & L1/L2 cache

RAMRAMRAMRAM

12-coresOpteron 6174

interconnect

Page 32: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

0

2

4

6

8

10

12

14

20 21 22 23 24 25 26 27 28 29

GTE

PS

Scale

HybridHybrid + NUMA

TEPS.ra(o$varied.with$problem.size�•  Ours.achieves.11.15.GTEPS$for$Kronecker$graph$(SCALE26).•  Ours.2.2x.speedups$compared$with$original.hybrid.algorithm.

Beamer2011,$2012�

Peak.performance�

Hybrid�

Hybrid� NUMA�

This.paper�BeVer�

x2.2�

11.15.GTEPS�

5.1.GTEPS�4Iway$Intel$Xeon$E5I8870$

WestmereIEX$arch.�

4Iway$Intel$Xeon$E5I4640$$SandyBridgeIEP$arch.�

67$million$verDces$and$1$billion$edges�

Page 33: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

12

24

32

48

1

2

4

8

16

64

12 24 32 48 1 2 4 8 16 64

Spe

edup

Number of threads

ideal4-way SandyBridge-EP

4-way MagnyCours

Strong.scaling.on.Intel/AMD.System�Scale.well.up.to.#.of.threads.as.#.of.cores�

4%way.Intel.Xeon.11.15.GTEPS�

4%way.AMD.Opteron.6.17.GTEPS�

40.threads.:.x40�

64.threads.:.x28�

Page 34: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Lv� FronDer$size� Freq.$(%)$ Cum.$Freq.$(%)$0$ 1$$ 0.00$$ 0.00$$1$ 7$$ 0.00$$ 0.00$$2$ 6,188$$ 0.01$$ 0.01$$3$ 510,515$$ 1.23$$ 1.24$$4$ 29,526,508$$ 70.89$$ 72.13$$5$ 11,314,238$$ 27.16$$ 99.29$$6$ 282,456$$ 0.68$$ 99.97$$7$ 11536$$ 0.03$$ 100.00$$8$ 673$$ 0.00$$ 100.00$$9$ 68$$ 0.00$$ 100.00$$

10$ 19$$ 0.00$$ 100.00$$11$ 10$$ 0.00$$ 100.00$$12$ 5$$ 0.00$$ 100.00$$13$ 2$$ 0.00$$ 100.00$$14$ 2$$ 0.00$$ 100.00$$15$ 2$$ 0.00$$ 100.00$$

Total� 41,652,230$$ 100.00$$ I$

TwiQer.network�

41$million$verDces$and$1.47$billion$edges$

Fron(er.size.in.BFS.$$$$$$$$$$$$$with$source$as$User$21,804,357�

Follow%ship.network.2009�

User$i�

User$j�

(i,$j)Iedge�

Our.NUMA%op(mized.BFS.on.4%way.Xeon.system�

180.ms$/$BFS$$$$$$$$$$$$$$$$$$$$$$$�$8.1$GTEPS�

Six%degrees.of.separa(on�

Page 35: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Graph500$benchmark�•  Fastest$of$singleInode$on$4th.list$(June$2012)$

•  Fastest$of$CPUIbased$singleInode$on$6th.list$(June$2013)$

ours�

ours�

4%way.Intel.Xeon.Westmere%EX�

4%way.Intel.Xeon.SandyBridge%EP�

8.2.GTEPS�

Rank26�

Rank57� 11.1.GTEPS�

Convey.4.FPGA.+.2.CPU�

hVp:www.graph500.org�

Page 36: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

1st.Green.Graph500.list$on$June$2013�•  Measures$powerIefficient$using$TEPS/W$raDo$•  Results$on$various$system$such$as$Android,.Linux,.and.Mac.$

Small.Data$category�

ours�

Rank.1.ASUS.tablet.TF700T� Rank.2.Intel.NUC.(Linux)�Rank.3.Mac.mini�

Android$NDK�53.5$MTEPS/w$$(1.9$GTEPS)�53.8$MTEPS/w$

$(1.1$GTEPS)�

64.1$MTEPS/w$$(150$MTEPS)�

NVIDIA.Tegra3.(4%core)�

NVIDIA.Tegra3� Intel/AMD.arch.�with$same$source$code�

hQp://green.graph500.org�

Page 37: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Conclusion�•  NUMA%op(mized.Hybrid.BFS.algorithm.– Reduces.unnecessary.edge.traversals$and$remote.RAM.access.carefully$considering$NUMA.

•  Numerical.results.on.4%way.Intel.Xeon�–  scales.well.up.to.64.threads.(scalable)$–  achieves.11.15.GTEPS.(fast).–  2.2x.speedup.compared.original.Hybrid.

•  Graph500.&.Green.Graph500.– Fastest.single%node$in$June$2012$– Most.power%efficient$in$June$2013$

Hybrid� NUMA�

Page 38: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Future.work�•  Further.op(mizing$NUMAIopDmized$BFS$

0

5

10

15

20

25

30

20 21 22 23 24 25 26 27 28 29

GTE

PS

SCALE

Latest versionBigdata2013

BigData2013.version:.11.GTEPS�

Latest.version:.26.GTEPS.

•  distributed%memory.parallel.computa(on$

2.4x...faster�