Upload
yuichiro-yasui
View
376
Download
1
Tags:
Embed Size (px)
Citation preview
NUMA%op(mized.Parallel.Breadth%first.Search.on.Mul(core.Single%node.System..
Yuichiro.Yasui*1,$Katsuki$Fujisawa*1$Kazushige$Goto*2$
$
*1$Chuo$university$&$JST$CREST$*2$Intel$CorporaDon�
Outline�
1. Background$2. BreadthIfirst$Search$(BFS)$3. Proposal.:$NUMAIopDmized$parallel$BFS$4. Numerical$Results$5. Conclusion$
Background�• Large.scale.graph.in.various.fields.– US$Road$network$$$$:$$$$58$million$edges$– TwiVer$followIship$:$1.47$billion$$edges$– Neuronal$network$:$$$100$trillion$$edges$
89.billion.ver(ces.&.100.trillion.edges�[email protected]�
Cyber%security�
TwiQer�
US.road.network�24.million.ver(ces.&.58.million.edges� 15.billion.log.entries./.day.
Social.network�
• Fast.and.scalable.graph.processing$by$using.HPC$large�
61.6.million.ver(ces..&..1.47.billion.edges.
• TransportaDon$• Social$network$• CyberIsecurity$• BioinformaDcs�
Importance.of.graph.processing�
• BFS$is$important$and$fundamental$graph$processing$– Obtains$relaDonship$of$distance$(hops)$as$standIalone$– Many$algorithm$(BC,$$Max.$flow,$$Max.$independent$set)$
• concurrent.search.(breadth%first.search).• opDmizaDon$(single$source$shortest$path)$• edgeIoriented$(maximal$independent$set)$
graph.processing�
Understanding�
Applica(on.field�
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Rela(on.ships�- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
graph�
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
results�
low.arithme(c.intensity$&$irregular.memory.accesses.Problems.of.Fast.&.scalable.computa(on.BFS�
Step1�
Step2�
Step3�
Breadth%first.search�
Graph500.Benchmark�• Measures$computer$performance$using$TEPS$raDo$in$graph$processing$such$as$BFS$(BreathIfirst$search)$
• TEPS.raDo$=$#$of$Traversed$edges$per$second$
SCALE$and$edgefactor.(=16)�
Median.TEPS�
1. Genera(on.
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS.2. Construc(on.
.x.64�
TEPS$raDo�.x.64�• Kronecker$graph$– 2SCALE$verDces$and$2SCALE+4$edges$– syntheDc$scaleIfree$network$
hVp:www.graph500.org�
• NUMA%op(mized$hybrid$algorithm$• Improves$locality$of$memory$access$– Library$for$considering$NUMA$carefully$– ColumnIwise$graph$parDDoning$
Contribu(on�• Efficient$hybrid$algorithm$of$BFS.[Beamer2011,2012]$
– reduces$unnecessary$edge$traversal$ 5.1.GTEPS�
Hybrid$BFS�
NUMA�
4%way.Intel.Xeon.E5.(64.CPU.cores)�
• Scalable:.Scale.well.up.to.64.threads.• Fast:.11.15.GTEPS.and.2.2x.speedup.compared$with$original$Hybrid$algorithm$
Our.proposal�
Outline�
1. Background$2. .Breadth%first.Search.(BFS).3. NUMA$architecture$4. Proposal$:$NUMAIopDmized$parallel$BFS$5. Numerical$Results$
Breadth%first.Search.(BFS)�• Obtains$level$of$each$verDces$from$source$vertex$• Level$=$certain$#$of$hops$away$from$the$source�
Input:$Graph$G.and$source�
Output:$Tree$with$root$as$source�
BFS�
Source�
Level.3�
source� Level.2�Level.1�
Hybrid.BFS.for.low%diameter.graph�• Efficient.for.Low%diameter.graph$
– scale%free$and/or$small%world$property$such$as$social$network.$
• At$higher$ranks$in$Graph500$benchmark$• Hybrid$algorithm$
– combines$topIdown$algorithm$and$boVomIup$algorithm$– reduces$unnecessary$edge$traversal$
Fron(er�
Neighbors�
Level.k�
Level.k+1�
Fron(er�Level.k�
Level.k+1�neighbors�
Top%down.algorithm� BoQom%up.algorithm�
switch�
Efficient$for$a$smallIfronDer� Efficient$for$a$largeIfronDer�
[Beamer2011,.2012]�
Fron(er$<$neighbor� Fron(er$>$neighbor�
Top%down.algorithm�• Explores$outgoing$edges$of$fron(er.queue.QF.• Appends$unvisited$verDces$into$neighbor.queue.QN.
Level.1�Source�
Level.0� QN�QF�
Top%down.algorithm�• Explores$outgoing$edges$of$fron(er.queue.QF.• Appends$unvisited$verDces$into$neighbor.queue.QN.
QN�Level.1�
Source�Level.0�QF�
Level.2�Level.1�
QN�QF�
Unnecessary.edge.traversal�
Top%down.algorithm�• Explores$outgoing$edges$of$fron(er.queue.QF.• Appends$unvisited$verDces$into$neighbor.queue.QN.
• Efficient.for.a.small.fron(er.• Has$an$unnecessary$edge$traversal$for$a$large$fronDer$
QN�Level.1�
Source�Level.0�QF�
Level.2�Level.1�
QN�QF�
Unnecessary.edge.traversal�Level.3�
Level.2�
QN�
QF�
BoQom%up.algorithm�• Explores$fron(er.queue.QF$from$unvisited.ver(ces.• Appends$adjacent$verDces$into$neighbors.QN.
source� QN�
QF�
Unvisited.ver(ces�
Level.1�
Unnecessary.edge.traversal�
BoQom%up.algorithm�• Explores$fron(er.queue.QF$from$unvisited.ver(ces.• Appends$adjacent$verDces$into$neighbors.QN.
source� QN�
QF�
Unvisited.ver(ces�
Level.1�
Unnecessary.edge.traversal�
Level.2�QF�Level.1�
QN�
Unvisited.ver(ces�
BoQom%up.algorithm�• Explores$fron(er.queue.QF$from$unvisited.ver(ces.• Appends$adjacent$verDces$into$neighbors.QN.
• Efficient.for.a.large.fron(er.• Has$unnecessary$edge$traversal$for$a$small$fronDer$
source� QN�
QF�
Unvisited.ver(ces�
Level.1�
Unnecessary.edge.traversal�
Level.2�QF�Level.1�
QN�
Unvisited.ver(ces�
Level.3�
Level.2�
QN�
QF�
Hybrid.BFS.combines.Top%down.and.BoQom%up�
Fron(er�
Neighbors�
Level.k�
Level.k+1�
Fron(er�Level.k�
Level.k+1�neighbors�
Top%down.algorithm� BoQom%up.algorithm�
switch�
Hybrid algorithm of Beamer et al 1
• Two different traversal kernels: top-down and bottom-up.• Top-down
• traverse neighbors of the frontier.• performance depends on frontier size.
• Bottom-up• finds the frontier from vertices in candidate
neighbors (all unvisited vertices).• performance depends on number of unvisited
vertices.• This lazy estimation of candidate neighbors
increases the number of edges traverse.Level Top-down Bottom-up Hybrid
mF mB min(mF ,mB)0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%
!�
����������������
�� ������
Fig: Top-down for small frontier
�����������������
������� �
Fig: Bottom-up for large frontier
1S. Beamer et al.: Direction-optimizing breadth-first search, SC’12, 2012.5 / 35
Traversal.edges$of$Kronecker$graph$
(SCALE$26)�
only�
switch�
switch�
Outline�
1. Background$2. BreadthIfirst$Search$(BFS)$3. Proposal.:.NUMA%op(mized.parallel.BFS.4. Numerical$Results$5. Conclusion$
How.to.speedup.the.hybrid.algorithm?�• NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.
RAM
8-coreXeon E5 4640
interconnect
shared L3 cache
processor core & L1/L2 cache
RAM RAM RAM
Non%local�local�
4%socket$Intel$Xeon$E5$system�
How.to.speedup.the.hybrid.algorithm?�• NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.
• Frequent$non%local.memory$accesses$on$NUMA.architecture.
G BFS
Source�
Fron(er�
Neighbors�
Level.k�
Level.k+1�
Fron(er�Level.k�
Level.k+1�neighbors�
Top%down�
BoQom%up�
Working.data.(QF,.QN,.visited%flag)�
Graph.G�
RAM
8-coreXeon E5 4640
interconnect
shared L3 cache
processor core & L1/L2 cache
RAM RAM RAM
Non%local�local�
4%socket$Intel$Xeon$E5$system�
Across.the.local.memories�
Difficulty.of.considering.NUMA.architecture�1. How.does.distribute.graph$and$data$to$each.local.RAM?.
G$=$G0�G1�G2�G3�
?�
G�
G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3
Difficulty.of.considering.NUMA.architecture�1. How.does.distribute.graph$and$data$to$each.local.RAM?.
.
.2. How.does.bind.parDal$graph$and$data$to$each.NUMA.unit?.
G0 B0 B1 B2 B3G1 G2 G3
?�
G0� G1� G2�G3�
G$=$G0�G1�G2�G3�G�
CPU0� CPU1� CPU2� CPU3�
RAM0� RAM1�
NUMA$unit3�
RAM2� RAM3�
ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�
1. NUMACTL$(command$line$tool,$library$for$C/C++)$2. Intel.compiler$Thread$Affinity$Interface$(API)$
3. ULIBC$(Our$library,$library$for$C/C++)$– Processor.ID$:$index$of$logical$processor$core$– Package.ID$:$index$of$CPU$socket$– Core.ID$:$index$of$physical$core$in$each$CPU$socket$
�$CPU.affinity.+.Local.memory.binding�
�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�
Processor.topology.for.each.CPU.core�
/sys/devices/system/*
Linux.device.files�
ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�
1. NUMACTL$(command$line$tool,$library$for$C/C++)$2. Intel.compiler$Thread$Affinity$Interface$(API)$
3. ULIBC$(Our$library,$library$for$C/C++)$– Processor.ID$:$index$of$logical$processor$core$– Package.ID$:$index$of$CPU$socket$– Core.ID$:$index$of$physical$core$in$each$CPU$socket$
�$CPU.affinity.+.Local.memory.binding�
�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�
Processor.topology.for.each.CPU.core�
Thread$ID�
/sys/devices/system/*
Linux.device.files�
At.a.parallel.region� sched_setaffinity.system$call�
mbind$system$call�Processor$ID� Package$ID�
Core$ID�
ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�
1. NUMACTL$(command$line$tool,$library$for$C/C++)$2. Intel.compiler$Thread$Affinity$Interface$(API)$
3. ULIBC$(Our$library,$library$for$C/C++)$– Processor.ID$:$index$of$logical$processor$core$– Package.ID$:$index$of$CPU$socket$– Core.ID$:$index$of$physical$core$in$each$CPU$socket$
�$CPU.affinity.+.Local.memory.binding�
�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�
Processor.topology.for.each.CPU.core�
Thread$ID�
Supports$scaQer.and$compact.policy�
ULIBC.is.possible.to.manage.NUMA.carefully..
/sys/devices/system/*
Linux.device.files�
round%robin.on.CPU.sockets.
At.a.parallel.region� sched_setaffinity.system$call�
mbind$system$call�Processor$ID� Package$ID�
Core$ID�
NUMA%opt..Column%wise.Graph.Par((oning�
A0
A1
A2
A3
Row%wise.graph.par((oning�
Vk�
Column%wise.graph.par((oning�
A0
A1
A2
A3
Adjacency.matrix�
Vk�
Adjacency.matrix�j�i�
O(m).mostly.non%local.memory.accesses�O(m).Local.memory.accesses.only�
i� j�
Fron(er.
Neighbors�
Level.k�Level.k+1�
Fron(er�
Neighbors.
Level.k�Level.k+1�
• divides$G=(V,A)$into$parDal$Gk=(Vk,.Ak)$and$binds.local.RAM.k.$– Ak$is$a$set$of$adjacency$list$that$holds$incoming.edges$to$Vk.�
i�
j�
i�
j�
NUMA%op(mized.Top%down�• Explores$outgoing$edges$of$fron(er.queue.QF.• Appends$unvisited$verDces$into$neighbor.queue.QN.
• Efficient.for.a.small.fron(er.• Has$unnecessary$edge$traversal$for$a$large$fronDer$
Neighbors.QN�Level.1�
Source�Level.0�
Fron(er.QF�
Level.2�Level.1�
Neighbors.QN�Fron(er.QF�
Unnecessary.edge.traversal�
Level.3�
Level.2�
Neighbors.QN�
Fron(er.QF�
NUMA.unit.3�
Details.of.NUMA%op(mized.Top%down�• Explores$outgoing$edges$Ak$of$fron(er.queue.QF
k.
• Appends$unvisited$verDces$into$neighbor.queue.QNk.
Level.2�Level.1�
QN2�
QF2�
Level.2�Level.1�
QN1�
QF1�
Level.2�Level.1�
QN0�
QF0�
Level.2�Level.1�
Neighbors.QN�Fron(er.QF�
Unnecessary.edge.traversal�
Level.2�Fron(er.QF�All%gather�
NUMA.unit.0�
NUMA.unit.1�
NUMA.unit.2�
NUMA%op(mized.BoQom%up�• Explores$fron(er.queue.QF$from$unvisited.ver(ces.• Appends$adjacent$verDces$into$neighbors.QN.
• Efficient.for.a.large.fron(er.• Has$unnecessary$edge$traversal$for$a$small$fronDer$
source� Neighbors.QN�
Fron(er.QF�
Unvisited.ver(ces�
Level.1�
Unnecessary.edge.traversal�
Level.2�Fron(er.QF�Level.1�
Neighbors.QN�
Unvisited.ver(ces�
Level.3�
Level.2�
Neighbors.QN�
Fron(er.QF�
NUMA.unit.3�
Details.of.NUMA%op(mized.BoQom%up�• Explores$fron(er.queue.QF
k$from$unvisited.ver(ces.• Appends$adjacent$verDces$into$neighbors.QN
k.
Level.2�Fron(er.QF�
All%gather�
NUMA.unit.0�
NUMA.unit.1�
NUMA.unit.2�
Level.2�Fron(er.QF�Level.1�
Neighbors.QN�
Unvisited.ver(ces�
Level.2�QF
0�
Level.1�
QN0�
Level.2�QF
1�
Level.1�
QN1�
Level.2�QF
2�
Level.1�
QN2�
Level.2�QF
3�
Level.1�
QN3�
Outline�
1. Background$2. BreadthIfirst$Search$(BFS)$3. NUMAIopDmized$parallel$BFS$4. Numerical.Results.5. Conclusion.
Machine.specifica(on�• 4%way.Intel.Xeon.E5.– CentOS$6.4$(Kernel$2.6.32)$– GCC$4.4.7$– 64$logical$CPU$cores$– 4.NUMA.units.x.16.logical%cores.
RAM
8-coreXeon E5 4640
interconnect
shared L3 cache
processor core & L1/L2 cache
RAM RAM RAM
• 4%way.AMD.Opteron.6174.– Fedora$19$(Kernel$3.11.2)$– GCC$4.8.1$– 48$CPU$cores$– 8.NUMA.units.x.6%core.
processor core & L1/L2 cache
RAMRAMRAMRAM
12-coresOpteron 6174
interconnect
0
2
4
6
8
10
12
14
20 21 22 23 24 25 26 27 28 29
GTE
PS
Scale
HybridHybrid + NUMA
TEPS.ra(o$varied.with$problem.size�• Ours.achieves.11.15.GTEPS$for$Kronecker$graph$(SCALE26).• Ours.2.2x.speedups$compared$with$original.hybrid.algorithm.
Beamer2011,$2012�
Peak.performance�
Hybrid�
Hybrid� NUMA�
This.paper�BeVer�
x2.2�
11.15.GTEPS�
5.1.GTEPS�4Iway$Intel$Xeon$E5I8870$
WestmereIEX$arch.�
4Iway$Intel$Xeon$E5I4640$$SandyBridgeIEP$arch.�
67$million$verDces$and$1$billion$edges�
12
24
32
48
1
2
4
8
16
64
12 24 32 48 1 2 4 8 16 64
Spe
edup
Number of threads
ideal4-way SandyBridge-EP
4-way MagnyCours
Strong.scaling.on.Intel/AMD.System�Scale.well.up.to.#.of.threads.as.#.of.cores�
4%way.Intel.Xeon.11.15.GTEPS�
4%way.AMD.Opteron.6.17.GTEPS�
40.threads.:.x40�
64.threads.:.x28�
Lv� FronDer$size� Freq.$(%)$ Cum.$Freq.$(%)$0$ 1$$ 0.00$$ 0.00$$1$ 7$$ 0.00$$ 0.00$$2$ 6,188$$ 0.01$$ 0.01$$3$ 510,515$$ 1.23$$ 1.24$$4$ 29,526,508$$ 70.89$$ 72.13$$5$ 11,314,238$$ 27.16$$ 99.29$$6$ 282,456$$ 0.68$$ 99.97$$7$ 11536$$ 0.03$$ 100.00$$8$ 673$$ 0.00$$ 100.00$$9$ 68$$ 0.00$$ 100.00$$
10$ 19$$ 0.00$$ 100.00$$11$ 10$$ 0.00$$ 100.00$$12$ 5$$ 0.00$$ 100.00$$13$ 2$$ 0.00$$ 100.00$$14$ 2$$ 0.00$$ 100.00$$15$ 2$$ 0.00$$ 100.00$$
Total� 41,652,230$$ 100.00$$ I$
TwiQer.network�
41$million$verDces$and$1.47$billion$edges$
Fron(er.size.in.BFS.$$$$$$$$$$$$$with$source$as$User$21,804,357�
Follow%ship.network.2009�
User$i�
User$j�
(i,$j)Iedge�
Our.NUMA%op(mized.BFS.on.4%way.Xeon.system�
180.ms$/$BFS$$$$$$$$$$$$$$$$$$$$$$$�$8.1$GTEPS�
Six%degrees.of.separa(on�
Graph500$benchmark�• Fastest$of$singleInode$on$4th.list$(June$2012)$
• Fastest$of$CPUIbased$singleInode$on$6th.list$(June$2013)$
ours�
ours�
4%way.Intel.Xeon.Westmere%EX�
4%way.Intel.Xeon.SandyBridge%EP�
8.2.GTEPS�
Rank26�
Rank57� 11.1.GTEPS�
Convey.4.FPGA.+.2.CPU�
hVp:www.graph500.org�
1st.Green.Graph500.list$on$June$2013�• Measures$powerIefficient$using$TEPS/W$raDo$• Results$on$various$system$such$as$Android,.Linux,.and.Mac.$
Small.Data$category�
ours�
Rank.1.ASUS.tablet.TF700T� Rank.2.Intel.NUC.(Linux)�Rank.3.Mac.mini�
Android$NDK�53.5$MTEPS/w$$(1.9$GTEPS)�53.8$MTEPS/w$
$(1.1$GTEPS)�
64.1$MTEPS/w$$(150$MTEPS)�
NVIDIA.Tegra3.(4%core)�
NVIDIA.Tegra3� Intel/AMD.arch.�with$same$source$code�
hQp://green.graph500.org�
Conclusion�• NUMA%op(mized.Hybrid.BFS.algorithm.– Reduces.unnecessary.edge.traversals$and$remote.RAM.access.carefully$considering$NUMA.
• Numerical.results.on.4%way.Intel.Xeon�– scales.well.up.to.64.threads.(scalable)$– achieves.11.15.GTEPS.(fast).– 2.2x.speedup.compared.original.Hybrid.
• Graph500.&.Green.Graph500.– Fastest.single%node$in$June$2012$– Most.power%efficient$in$June$2013$
Hybrid� NUMA�
Future.work�• Further.op(mizing$NUMAIopDmized$BFS$
0
5
10
15
20
25
30
20 21 22 23 24 25 26 27 28 29
GTE
PS
SCALE
Latest versionBigdata2013
BigData2013.version:.11.GTEPS�
Latest.version:.26.GTEPS.
• distributed%memory.parallel.computa(on$
2.4x...faster�