View
5
Download
0
Category
Preview:
Citation preview
Advanced Course in High Performance
Computing
(Department of Computer Science)
also as code-shared class of
High Performance Computing Technology
(Human Biology Program)
“Case study on supercomputers”
Lecture Note #9, 2015/12/09
(lectured by Taisuke Boku)taisuke@cs.tsukuba.ac.jp
2014/12/24 ACHPC 1
TOP500 List
Sum of #18#500
#1 Computer
#500 Computer
http://www.top500.org
2014/12/24 ACHPC 2
1 million times faster in 20 years
Top1 drops from top500 in 10 years
Top1 reaches the sum in 5 years
1 EXAFLOPS in 2020??
Advance of supercomputer
n Improvement of Silicon Process
(moor’s law) x2 in 18 month
Speed of Tr does not scale now
n Innovations on architecture
n 1976 Vector processor architecture (Cray-1)
n 1980 Rise of Vector computer
n 1990 Microprocessor and parallel architecture
n 19908 Rise of Parallel computer
Vector Parallel, Massively Parallel, SMP
2014/12/24 ACHPC 3
1970 1980 1990 2000 2010 2020
perf
orm
ance
(FLO
PS)
year
T2K-Tsukuba
HA-PACS
PACS-CS
CP-PACS
Titan
Sequoia
K computer
Earth Simulator
Numerical Wind Tunnel
75APU
S810/20
SX-2
XMPCRAY-1
ILLIAC IV
BlueGene/LRoadrunner
ASCI WhiteASCI Red
SX-4
SR8000
SX-8
YMP C90
CM-5
Tianhe-1ATSUBAME2.0
1G
1T
1P
1E
30 years
10 millions times faster
Vector MPP Cluster Accelerator
Japanese
Supercomputer
Univ. Tsukuba
Overview of HPC System (1)
• 100MFLOPS81GFLOPS (’70/e8’80/b)
– Vector Computer like Cray-1
– High performance with Vector register and high bandwidth
memory
– World First GFLOPS machine: NEC SX-2
• 10GFLOPS8100GFLOPS(’80/e8’90/b)
– Parallel vector pipeline
– Vector computer with shared memory (several to several tens of
node)
– Beginning of MPP
– 1996: #1-#3 in Top500 is Japanese machine
• 1TFLOPS (’90/e)
– Massively parallel by scalar micro processor : ASCI machines
– World First TFLOPS machine: SNL ASCI Red
2014/12/24 ACHPC 4
2014/12/24 ACHPC 5
NWT (Numerical Wind Tunnel) 1993
• National Aerospace Laboratory
of Japan (Current JAXA)
developed in partnership with
Fujitsu.
• Vector Parallel Architecture
with distributed memory
connected by cross-bar
network (166 ndoes)
• #1 in Top500 from 1993/11 to
1995/11 (280GFLOPS)
• VPP500 was developed based
on the NWT, and continue
vector parallel architecture.
2014/12/24 ACHPC 6
CP-PACS
• CCS in U-Tsukuba
by U-Tsukuba /
HITACH in 1996
• Becomes world
fastest computer
developed by
university
(1996/11)
• Computer for
computational
physics
• Improve processor
with pseudo vector
• Base for SR-2201
• 2048 CPU
614GFLOPS
2014/12/24 ACHPC 7 2014/12/24 ACHPC 8
Overview of HPC System (2)
• 10TFLOPS(’00/b)
– ASCI machine
– Earth simulator (40TFLOPS)
• 100TFLOPS(’00/m)
– Massively parallel/energy save: IBM BlueGene/L
• 1PFLOPS(‘08)
– World First PFLOPS machine: LANL RoadRunner
– ORNL Jaguar
2014/12/24 ACHPC 9
Earth simulator
• JAMSTEC by NEC
in 2002
• Vector parallel
computer
• TOP500#1 2002/6-
2004/6
• large scale weather
simulation, etc.
• 8 vector processor
shares memory,
and each node
connected by single
cross-bar network
• Base for SX-6
• 5120 CPU
(640node)
40 TFLOPS
2014/12/24 ACHPC 10
Blue Gene/L
• Lawrence
Livermore National
Lab. by IBM in 2005
• TOP500#1 2004/11-
2007/11
• embedded low
power processor,
very large number
of processors
• Particle simulation
fluid dynamic
simulation, etc
• 65536 CPU
360 TFLOPS
2014/12/24 ACHPC 11
Roadrunner
• At Los Alamos National Lab.
by IBM in 2008
• World first PFLOPS
computer
TOP500#1 2008/6-2009/6
• Hybrid cluster, each node
with Opteron processor and
IBM Cell Broadband Engine
• 129600 cores 1.46 PFLOPS
(Linpack: 1.11 PFLOPS)
2014/12/24 ACHPC 12
TOP4 in Nov. 2014
Rank
Name Country
Rmax(PFLOPS)
Node Architecture # of nodes Power(MW)
1 Tianhe-2 China 33.9 2 Xeon(12C) + 3
Xeon Phi(57C)
16000 17.8
2 Titan U.S.A 17.6 1 Opteron(16C)+ 1
K20X(14C)
18688 8.2
3 Sequoia U.S.A 17.2 1 PowerBQC(16C) 98304 7.9
4 K
computer
Japan 10.5 1 SPARC64(8C) 82944 12.7
2014/12/24 ACHPC 13
Top 4 is kept in two yearsTianhe-2(��-2)
• National University of Defense
Technology, China
• Top500 2013/6 #1,
33.8PFLOPS (efficiency 62%),
16000node(2 Xeon(12core) +
3 Phi(57core), 17.8MW,
1.9GFLOPS/W
• TH Express-2 interconnection (same
perf. of IB QDR (40Gbps) x2)
• CPU Intel IvyBridge 12core/chip,
2.2GHz
• ACC Intel Xeon Phi 57core/chip,
1.1GHz
• 162 racks (125 rack for comp.
128node/rack)2014/12/24 ACHPC 14
Compute Node of Tianhe-2
CPU x 2: 2.2GHz x 8flop/cycle x 12core x 2
= 422.4GFLOPS
MIC x 3: 1.1GHz x 16flop/cycle x 57core x 3
= 3.01TFLOPS
CPU: 422.4 x 16000 = 6.75PFLOPS
64GB x 16000 = 1.0 PB
MIC: 3010 x 16000 = 48.16PFLOPS
8GB x 3 x 16000 = 0.384 PB
2014/12/24 ACHPC 15
ORNL Titan
• Oak Ridge National Laboratory
• Cray XK7
• Top500 2012/11 #1,
17.6PFLOPS (efficiency 65%),
18688 CPU + 18688 GPU,
8.2MW, 2.14GFLOPS/W
• Gemini Interconnect (3Dtorus)
• CPU AMD Opteron 6274
16core/chip, 2.2GHz
• GPU nVIDIA K20X, 2688CUDA
core (896 DP unit)
• 200 racks
2014/12/24 ACHPC 16
Node of XK7
Node
CPU: 2.2GHz x 4flop/cycle x 16core
= 140.8GFLOPS
GPU: 1.46GHz x 64flop/cycle x 14SMX
= 1.31TFLOPS
CPU: 140.8 * 18688 = 2.63 PFLOPS
32GB x 18688 = 0.598 PB
GPU: 1310 * 18688 = 24.48 PFLOPS
6GB x 18688 = 0.112 PB2014/12/24 ACHPC 17
LLNL Sequoia
• Lawrence Livermore National
Laboratory (LLNL)
• IBM BlueGene/Q
• Top500 2012/6 #1,
16.3PFLOPS (efficiency 81%),
1.57Mcore, 7.89MW,
2.07GFLOPS/W
• 4 BlueGene/Q were listed in
top10
• 18core/chip, 1.6GHz, 4way
SMT, 204.8GFLOPS/55W,
L2:32MB eDRAM, mem:
16GB, 42.5GB/s
• 32chip/node, 32node/rack,
96rack
2014/12/24 ACHPC 18
Sequoia Organization
1.6GHz x 8flop/cycle
= 12.8GFLOPS
12.8GFLOPS x 16core
= 204.8GFLOPS 204.8GFLOPS x 32
= 6.5TFLOPS
16GB x 32
= 512GB
6.5TFLOPS x 32
= 210TFLOPS
512GB x 32
= 16TB
210TFLOPS x 96
= 20.1PFLOPS
16TB x 96
= 1.6PB
2014/12/24 ACHPC 19
K computer
• RIKEN by Fujitsu in 2012
• Each nodes has 4 SPARC64
VIIIfx (8core) and network chip
(Tofu Interconnect)
• TOP500#1 2011/6-11
• 705k core, 10.5 PFLOPS
(efficiency 93%)
• LINPACK power consumption
12.7MW
• Green500#6 2011/6
(824MFLOPS/W)
• Green500#1 is BlueGene/Q
Proto2 2GFLOPS/W (40kW)
• 864 racks
2014/12/24 ACHPC 20
Compute Nodes and network
Compute nodes (CPUs): 88,128
SPARC64 VIIIfx
Number of cores: 705,024 (8core/CPU)
Peak performance: 11.2PFLOPS
2GHz x 8flop/cycle x 8core x 88128
Memory: 1.3PB (16GB/node)
Logical 3-dimensional torus network
Peak bandwidth: 5GB/s x 2 for each
direction of logical 3-dimensional
torus network
bi-section bandwidth: > 30TB/s
Courtesy of FUJITSU Ltd.
Compute node
Logical 3-dimensional torus network
for programming
2013/2/13 ACHPC 21
CPU
ICC
Cluster computer
• Cluster : HPC
– Old cluster: poor men’s supercomputer
• From small to large scale
– cost effective (peak performance / cost)
– Commodity of processor and network
• Platform with general purpose and application specific
– 64bit IA-32(x86) with Linux
– Accelerator via I/O
• Massively parallel
– Scalar processor becomes fast, but the requirement is more
– Large scale network with commodity network
– Easily adding Accelerators
2014/12/24 ACHPC 22
TOP500 List (2014/11)
Architecture CountSystem Share
(%)
Rmax
(PFlops)
Rpeak
(PFlops)Cores(Million)
Cluster 429 85.8 207.0 320.8 15.5
MPP 71 14.2 101.9 132.7 7.6
total 500 100.0 308.9 453.5 23.1
2014/12/24 ACHPC 23
Core per socket
Count System Share (%)
Rmax(PFlops)
Core (Million)
1-4 17 3.4 4.0 0.42
6 56 11.2 20.0 1.77
8 232 46.4 96.3 7.01
10 87 17.4 34.4 2.10
12 57 11.4 71.8 5.63
14 7 1.4 4.80 0.15
16 44 8.8 77.5 6.07
Commodity CPU
• COTS(Commodity Of The Shelf) high cost
performance �changes supercomputer from
special purpose to general purpose
– Commodity : development cost paid by general
consumer in the world
– Vector computer/ MPP: user pay the development
cost
• The performance of CPU
– Clock frequency
– Multicore
– SIMD
2014/12/24 ACHPC 24
Commodity network
• Classical commodity network
– Ethernet: 10Base, 100Base, 1000Base, 10GBase
– High cost performance in bandwidth, but latency is
not enough
– Basically tree topology, low scalability
• SAN (System Area Network)
– Myrinet, Infiniband, Quadrix, …
– Both high bandwidth and low latency , but expensive
– Clos-network, Fat-Tree network
– Cost greatly decreased, SAN becomes commodity
– On-board Infiniband NIC instead of on-board Ethernet
NIC
2014/12/24 ACHPC 25
Scalability in cluster
• Advance in general I/O bus
– PCI �PCI-X� PCI-Express�Gen2� Gen3
– Parallel link �multi high speed serial link
– Direct link from CPU: Hyper Transport, Quick Path
• Hardware accelerator
– Clear Speed: TITECH TSUBAME1.2
– Cell Broadband Engine: LANL Roadrunner
– GRAPE: U-Tsukuba FIRST Cluster
– GPGPU: TSUBAME2.0, HA-PACS, …
2014/12/24 ACHPC 26
Supercomputer in Japan (2014/11)
Rank Name Site Vendor Rmax Rpeak
4K computer
RIKEN Advanced Institute for
Computational Science (AICS) Fujitsu 10510000 11280384
15TSUBAME2.5
GSIC Center, Tokyo Institute of
Technology NEC/HP 2785000 5735685
38Helios
International Fusion Energy
Research Centre (IFERC6 Bull SA 1237000 1524096
48Oakleaf-FX IThe University of Tokyo Fujitsu 1043000 1135411
49QUARTETTO Kyushu University Hitachi/Fujitsu 1018000 1502236
63Aterui
National Astronomical
Observatory of Japan Cray Inc. 801400 1058304
70COMA CCS, University of Tsukuba Cray Inc. 745997 998502
86
Central Research Institute of
Electric Power Industry/CRIEPI SGI 582100 670925
91SAKURA KEK IBM 536663 629146
92HIMAWARI KEK IBM 536663 629146
117HA-PACS CCS, University of Tsukuba Cray Inc. 421600 778128
2014/12/24 ACHPC 27
Trend of large scale HPC cluster
• CPU performance
– Multi-core (8-16core/CPU)
– SIMD instruction (AVX: 8 flops/clock)
• Interconnect
– Infiniband (or other SAN) becomes popular in HPC
and decrease the cost
– Large switch (648port/rack)
– Optical cable
– Infiniband SDR � DDR � QDR(40Gbps) �FDR(56Gbps) � EDR(100Gbps)
2014/12/24 ACHPC 28
Heterogeneous Computing Platform
• Parallel system those node has CPU and Accelerator
– Basic style: Nodes in cluster with accelerator
• ClearSpeed
• GRAPE
• Cell Broadband Engine
• GPGPU (General Purpose – Graphic Processing Unit)
• Xeon Phi (many core processor)
– Accelerator was used in special purpose application
– 2008/06 Roadrunner@ LANL achieved the performance over
1PFLOPS in LINPACK by Opteron + Cell BE hetero config
(Peak: 1.3PFLOPS6
– 2Hybrid Computing” is used in combination with shared-memory
and distributed-memory, Heterogeneous Computing is used for
combination with CPU and accelerator here
2014/12/24 ACHPC 29
TSUBAME2.0
• TITECH by NEC/HP in 2010
• Each node has Xeon X5670
(6core)x2 and GPU(Nvidia
M2050)x3, and connected
Infiniband QDR
• TOP500#4 2010/11
• 1400nodes, 1.2 PFLOPS
• Linpack power consumption
1.4MW
• Green500#2 2010/11
(958MFOPS/W)
2014/12/24 ACHPC 30
TSUBAME2.5 upgrade GPU to K20X, TOP500#11 2013/11
TSUBAME-KFC 4.5GFLOPS/W Green500#1 2013/11
Problems of commodity CPU
• Computation performance
– CPU frequency is limited, sequential performance will not be
improved.
– Process technology is still advanced, and the capacity of gate
will be increased. (until ??)
– Peak performance is increased by increase of number of cores.
• Memory bandwidth (rich vector vs. poor scalar)
– The gap between CPU FLOPS and memory bandwidth is wider
– The gap between Peak performance (LINPACK etc.) and
effective performance (non-cache-aware program)
2014/12/24 ACHPC 31
Balance on CPU : Memory : Network
2014/12/24 ACHPC 32
Systems C : M : N = GFLOPS : GB/s : GB/s C : M : N (M = 1.0)
CP-PACS 0.3 1.2 0.3 0.25 1 0.25
Earth Simulator 64 256 12.5 0.25 1 0.05
PACS-CS 5.6 6.4 0.75 0.90 1 0.1
T2K-Tsukuba 147.2 42.7 8 3.50 1 0.2
Small C and large N is good for bandwidth
Previous vector computer = 34Byte/FLOP”
(in the table C:M = 0.25:1)
Bottleneck of memory bandwidth
• HPC application requires memory bandwidth
– Fluid dynamics, weather forecast
– QCD (Quantum Chromo Dynamics)
– FFT
– Particle simulation with long distance interaction
– …
• These application does not fit the current multi-core
architecture
– Required Byte/FLOP is not provided
– On-chip storage such as cache or register is not enough for the
working set of the applications
• Key is data localization�it is not always applicable
– Cache tuning (blocking, cache-awareness)
2014/12/24 ACHPC 33
Future of system architecture
• Cluster system
– Multi-core/multi-socket/multi-NIC
• Programming and performance tuning becomes more complicated
because of memory hierarchy and multiple network I/F
• Programming on the hybrid architecture with shared memory(multi-
core + multi-socket) and distributed memory(interconnect)
�currently user should write program explicitly with shared memory
(ex. OpenMP) and distributed memory (ex. MPI)
�new programming paradigm, compiler technology are required
– Larger scale cluster is available, but there are several limitations
• Space: especially in Japan
• Power: power per core is decreased, but power per node is almost
fixed
• Cooling: cheap cluster becomes a bottleneck, …
2014/12/24 ACHPC 34
�Is there limit of improvement performance by large scale cluster ?
Expect for Accelerator
• From special purpose to (semi-)general purpose
– While GRAPE has special arithmetic unit, SIMD instruction is
general arithmetic pipeline
– In GPGPU, standard programming tool is prepared. (nVIDIA
CUDA, PGI compiler for GPGPU)
– Effective performance per watt than CPU
• Intel Xeon E5645: 57.6GFLOPS / 80 W = 0.72GFLOPS/W
• nVIDIA M2090: 665GFLOPS / 225W = 3.0GFLOPS/W
• Intel SandyBridge: 160GFLOPS / 115W = 1.4GFLOPS/W
• Memory access bandwidth
– GPGPU has high memory bandwidth
• nVIDIA M2090: 177GB/s (ECC off)
• Intel Xeon E5656: 32GB/s, Intel SandyBridge: 51.2GB/s
2014/12/24 ACHPC 35
X 4
Problem of current accelerator
• Bandwidth between CPU and device
– Via External bus, usually PCI-Express
� PCI-E Gen3 x 16 : 8GB/s less than 1 port of DDR3
– Ex. A GPGPU runs over 70GFLOPS in Himeno benchmark
� only on a GPGPU
� the performance was greatly degraded if requiring memory
transfer between CPU and GPU, or between GPU and GPU
using multiple GPU on a node
• Bandwidth between GPUs
– Data transfer between GPUs on each node require three hops of
memory copy
• Registers, on-chip memory, SIMD instructions are
somewhat limitation for applications.
2014/12/24 ACHPC 36
Future of parallel computer system
• Largest problem: power consumption
– To realize Exa-FLOPS, power consumption per core has to be greatly reduced
• The limit of semiconductor process
• Leakage current will be relatively larger, and DVFS (Dynamic Voltage Frequency Scaling) is limited
• Breakthrough for semiconductor process and circuit level
• Large scale interconnect network
– Fat-tree in current cluster will be limit
• Direct network: merit of neighbor communication will be low power
• Application improvement in algorithm to prevent long distant communication
• Reliability
– Dependability: MTBF with millions of processor will be worse. �dynamic fault tolerance is required
– Failure recovery technology for massively parallel beyond a classical check-pointing / restart method
2014/12/24 ACHPC 37
Green500
2014/12/24 ACHPC 38
in Nov. 2014
Supercomputers in CCS
• PAX series
– CP-PACS
– PACS-CS
• FIRST
• T2K Tsukuba
• HA-PACS
– base cluster
– TCA
2014/12/24 ACHPC 39
History of parallel computer PAX(PACS) in U-Tsukuba
��� ���������
�� ����������
�� ���������
��������������
�()��$'��()����$+,
�,�*, ���1��*(!���(+#$'(��'����/�$
Year Name Performance
1978 PACS-9 7KFLOPS
1980 PAXS-32 500KFLOPS
1983 PAX-128 4MFLOPS
1989 QCDPAX 14GFLOPS
1996 CP-PACS 614GFLOPS
2006 PACS-CS 14.3TFLOPS
2012 HA-PACS 800TFLOPS
2014 COMA 1.001PFLOPS
��������������
n �(() *�,$('�/$,#��(&)-,�,$('�%���$ ',$+,+��'��
�(&)-, *��'"$' *+
n ��*" ,�) *!(*&�'� ��*$. '��1��))%$��,$('
n �(',$'-(-+�� . %()& ',�/$,#� 0) *$ '� �
���-&-%�,$('
� *.$� �(-,�$'�%�+,�� )�
2014/12/24 ACHPC 40
CP-PACS (1996 Univ. Tsukuba)
• First large scale massively parallel
supercomputer developed in Japan
ü Scalar processor with pseudo vector
ü Flexible and high performance network
• Collaboration with physics and computer
science.
• Collaboration with university and vender
(Hitachi), Hitachi developed SR-2201 based
on CP-PACS
• Scientific breakthrough in particle physics and
astrophysics
ü First principle calculation for QCD
ü General simulation model for field (fluid,
electromagnetic field, wave function, etc)
2014/12/24 ACHPC 41
Architecture of CP-PACS
2014/12/24 ACHPC 42
Pseudo-vector 3D hyper crosbar
PACS-CS• CCS in U-Tsukuba
by U-Tsukuba
/Hitachi/Fujitsu in
2006
• PC cluster + multi-
link Ethernet (3D
hyper-crossbar)
• For computational
sciences
• 2560 CPU
14.4 TFLOPS
• Service was ended
in 2011 September.
2014/12/24 ACHPC 43
Unit chassis (19inch x 1U)
2014/12/24 ACHPC 44
3D-HXB (16x16x10=2560 node)
2014/12/24 ACHPC 45
FIRST
• CCS in U-Tsukuba by U-
Tsukuba/HP/Hamamatsu
Metrics in 2005
• Hybrid cluster, each node
has dual socket Xeon and
GRAPE-6 board (Blade-
GRAPE)
• For computational
astrophysics
• 256 nodes/512 cores
+ 1024 GRAPE-6 chip
• Host: 3.1 TFLOPS
Blade-GRAPE: 33TFLOPS
2014/12/24 ACHPC 46
Blade GRAPE
• Full size PCI card requiring
2-PCI slots
• 10 layers in a board
• 4 GRAPE6 chips =
136.8GFLOPS
• Electric power of 54W
• Memory of 16 MB (260K
particles)
• Implemented by
Hamamatsu Metrics Co.
Accelerator calculating gravity
for FIRST (embedded in a node)
2014/12/24 ACHPC 47
Interconnection Network
2014/12/24 ACHPC 48
T2K Tsukuba• CCS in U-Tsukuba by
Appro International + Cray
Japan in 2008
• Commodity based PC
cluster with high node
performance and network
performance
• For computational
sciences
• 648 node = 10368 CPU
core
95 TFLOPS
• Service will end in 2014
February.
T2K Open Supercomputer Alliance:
University of Tsukuba, University of Tokyo, Kyoto University
2014/12/24 ACHPC 49
Computation node and file serverComputation node (70 racks)
648 node (quad-core x 4 socket / node)
Opteron Barcelona B8000 CPU
2.3GHz x 4FLOP/c x 4core x 4socket
= 147.2 GFLOPS / node
= 95.3 TFLOPS / system
20.8TB memory / system
800TB (physical 1PB) RAID-6
Luster cluster file system
Infiniband x 2
Dual MDS and OSS config.
� high reliability
File server (disk array only)
2014/12/24 ACHPC 50
Block diagram of a node
2014/12/24 ACHPC 51
Infiniband 4xDDR x 4-rail Fat-Tree
2014/12/24 ACHPC 52
HA-PACS• CCS in U-Tsukuba by
Appro in 2012
• Commodity based PC
cluster with multiple GPU
accelerators
• For computational
sciences
• 268 node = 4288 CPU
core and + 1072 GPU
802 (= 89 + 713) TFLOPS
• 40 TByte memory
2014/12/24 ACHPC 53
Rack configuration
…
Computation
nodes7Appro
Green Blade 8204
(8U enc. 4 node)
268 node (67
enc./23 rack),
800TFLOPS
Interconnection7Mellanox IS5300 (QDR
IB 288 port) x 2
Login/Management
nodes7Appro Green
Blade 8203 x 8, 10GbE
I/F
Storage7DDN
SFA10000,
connecting QDR IB,
Luster File system,
User area: 504TB
Total 26 racks2014/12/24 ACHPC 54
Computation node
665GFLOPSx4
=2660GFLOPS
20.8GFLOPSx16
=332.8GFLOPS
(2.6GHz x 8flop/clock)
Total: 3TFLOPS
(16GB, 12.8GB/s)x8
=128GB, 102.4GB/s
8GB/s
AVX
(6GB, 177GB/s)x4
=24GB, 708GB/s
2014/12/24 ACHPC 55
HA-PACS Project• HA-PACS (Highly Accelerated Parallel Advanced system for Computational
Sciences)
– 8th generation of PAX/PACS series supercomputer
• Promotion of computational science applications in key areas in our Center
– Target field: QCD, astrophysics, QM/MM (quantum mechanics / molecular
mechanics, bioscience)
HA-PACS is not only a “commodity GPU-accelerated PC cluster” but also experiment platform for direct communication among accelerators.
• Two parts
• HA-PACS base cluster– for development of GPU-accelerated code for target fields, and performing
product-run of them
– Now in operation since Feb. 2012
• HA-PACS/TCA (TCA = Tightly Coupled Accelerators)
– for elementary research on new technology for accelerated computing
– Our original communication chip named “PEACH2”
– 64 nodes will be installed in Oct. 2013
2014/12/24 ACHPC 56
HA-PACS/TCA (Tightly Coupled Accelerator)
• PEACH2– 4 ports of PCI Express Gen2 x8
lanes
– Direct connection between
accelerators (GPUs) over the
nodes
– hardwired on main data path and
PCIe interface fabric
PEACH2CPU
PCIe
CPUPCIe
Node
PEACH2
PCIe
PCIe
PCIe
GPU
GPU
PCIe
PCIe
Node
PCIe
PCIe
PCIeGPU
GPUCPU
CPU
IB HCA
IB HCA
IBSwitch
n True GPU-directn current GPU clusters require 3-
hop communication (3-5 times
memory copy)
n For strong scaling, inter-GPU
direct communication protocol
is needed for lower latency and
higher throughput
MEMMEM
MEM MEM
2014/12/24 ACHPC 57
PEACH2 board (Production version for
HA-PACS/TCA)Main board+ sub board Most part operates at 250 MHz
(PCIe Gen2 logic runs at 250MHz)
PCI Express x8 card edge
Power supplyfor various voltage
DDR3-
SDRAM
FPGA
(Altera Stratix IV
530GX)
PCIe x16 cable connecter
PCIe x8 cable connecter
2014/12/24 ACHPC 58
TCA node structure
• CPU can uniformly
access to GPUs.
• PEACH2 can access
every GPUs
– Kepler architecture +
CUDA 5.0 “GPUDirect
Support for RDMA”
– Performance over QPI
is quite bad.
=> support only for
GPU0, GPU1
• Connect among 3
nodes
• This configuration is similar to HA-PACS base cluster except PEACH2.
– All the PCIe lanes (80 lanes) embedded in CPUs are used.
CPU
(Xeon
E5)
CPU
(Xeon
E5)QPI
PCIe
GPU
0
GPU
2
GPU
3
IB
HCA
PEA
CH2
GPU
1
G2
x8 G2
x16
G2
x16
G3
x8G2
x16
G2
x16
G2
x8
G2
x8
G2
x8
Single PCIe address
GPU: NVIDIA K20X(Kepler architecture6
592014/12/24 ACHPC
HA-PACS System
Base Cluster
TCATCA: 5Rack x 2Line
2014/12/24 ACHPC 60
(2.8 GHz x 8 flop/clock)
Total: 5.688 TFLOPS
8 GB/s
AVX
1.31 TFLOPSx4
=5.24 TFLOPS
22.4 GFLOPS x20
=448.0 GFLOPS
(16 GB, 14.9 GB/s)x8
=128 GB, 119.4 GB/s
(6 GB, 250 GB/s)x4
=24 GB, 1 TB/s
4 Channels
1,866 MHz
59.7 GB/sec
4 Channels
1,866 MHz
59.7 GB/sec
Ivy Bridge Ivy Bridge
4 x NVIDIA K20X
HA-PACS/TCA (Computation node)
Gen 2 x 16
Gen 2 x 16
Gen 2 x 16
Gen 2 x 16
PEACH2 board
(TCA interconnect)
Gen 2 x 8
Gen 2 x 8
Gen 2 x 8
Legacy Devices
2014/12/24 ACHPC 61
HA-PACS Total System
• InfiniBand QDR 40port x 2ch between base cluster and
TCA
HA-PACSBase Cluster268 nodes
HA-PACS / TCA
64 nodes
InfiniBand QDR324port sw
InfiniBand QDR324port sw
InfiniBand QDR108 port sw
InfiniBand QDR108 port sw
40
40
LustreFilesystem
421 TFLOPS, Efficiency 54%,
41st 2012.6, 73rd 2013.11
1.15 GFLOPS/W
277 TFLOPS, Efficiency 76%,
134th 2013.11 Top500
3.52 GFLOPS/W 3rd 2013.11 Green500
ACHPC 622014/12/24 ACHPC
COMA (PACS-IX)n 2014/4
n Cray CS300
n Intel Xeon Phi (KNC:
Knights Corner)
n 393 node (2 Xeon E5-
2670v2 +
2 Xeon Phi 7110P)
n Mellanox IniniBand FDR,
Fat Tree
n File Server: DDN
1.5PB (RAID6+Lustre)
n 1.001 PFLOPS
(HPL: 746 TFLOPS)
June ’14 TOP500 #51
n HPL efficiency 74.7%
2014/12/24 ACHPC 63
What is COMA ?
• Cluster of Many-core Architecture processor
• COMA = Coma Cluster
– a large cluster of galaxies that contains over 1,000 identified
galaxies
– galaxies (cluster of stars) (=Many Core)
– cluster of galaxies = (Cluster)
• COMA is also 9th machine of PACS series, so it also
called ”PACS-IX”
2014/12/24 ACHPC 64
Intel Xeon Phi 7110P
Power supply Inel Xeon E5-2670v2 (IvyBridge core)
IB FDR
Mellanox
Connect-X3
COMA (PACS-IX) comp. node (Cray 1U 1027GR) SATA HDD
(3.5inch 1TB x2)
2014/12/24 ACHPC 65
COMA node configration
CPU
(Xeon
E5)
CPU
(Xeon
E5)QPI
PCIe
MIC0 MIC1IB
HCA
G2
x16
G3
x8G2
x16
1054 1054
2014/12/24 ACHPC 66
Power consumption
• In LINPACK (HPL) power efficiency, COMA(2014) is 20 times
efficient than T2K-Tsukuba(2008) in 6 years
• What is the efficiency for real application ?
System LINPACK (TF) LINPACK (kW) Ave. Power(kW)T2K-Tsukuba 76.5 671.8 420
HA-PACS (GPU) 421 407.3 250
HA-PACS/TCA
(GPU)
277 93.0 34
COMA (MIC) 746 264.8 215
(ave. usage 60%6
2014/12/24 ACHPC 67
Recommended