Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Copyright 2015 FUJITSU LIMITED
FUJITSU Supercomputer PRIMEHPC FX100
0
Copyright 2015 FUJITSU LIMITED
The K computer and the evolution of PRIMEHPC
Fujitsu has been developing supercomputers over 30 years, and will continue its development to deliver the best application performance.
SPARC64 VIIIfx: 8 cores / 128 GF11 PF, 2010~
K computer PRIMEHPC FX10SPARC64 IXfx: 16 cores / 236.5 GF23 PF, 2012~
PRIMEHPC FX100SPARC64 XIfx: 32 cores / over 1 TFOver 100 PF, 2015~
C RIKEN
Post-K computer
1
PRIMEHPC FX100 design concept
Copyright 2015 FUJITSU LIMITED
Designed for massively parallel supercomputer system
Enhance and inherit K computer features・Many-core CPU-based architecture for application productivity
Introduce new technologies to Exascale computing
・High performance for a wide range of real applications
・HPC-ACE2 : Wide SIMD enhancements・Assistant cores : Dedicated cores for non-calculation operation
・Enhanced VISIMPACT (hardware barrier synchronization, sector cache, etc.)
・HMC : Leading-edge memory technology
2
SPARC64TM XIfx
Copyright 2015 FUJITSU LIMITED
Over 1 TF high performance processor・32 compute cores
HPC-ACE2: ISA enhancements・Two 256-bit wide SIMD units per core
Tofu2 interface
Tofu2 controller
L2 cache
L2 cache
HM
C interfa
ce
HM
C in
terf
ace
PCI interface
core core
core core
core core
core core
core core
core core
core core
core core
core
core
core
core
core
core
core
core
core
core
core
core
core
core
core
core
・2 assistant cores:
Assistant core
Assistant core
Daemons, IOs, non-blocking MPI functions, etc.
・Addressing mode (stride load/store, indirect load/store)・Cross lane operation (compress, permutation)
Offloading non-calculation operations
HMC support・480GB/s/node of theoretical memory throughput
3
Tofu interconnect 2
Copyright 2015 FUJITSU LIMITED
Enhanced Tofu interconnect
CPU-integrated interconnect controller・Reduced communication latency
Optical cable connection between chassis
・Improved packaging density and energy efficiency
・Highly scalable, 6-dimensional mesh/torus topology・Increased link bandwidth by 2.5 times to 12.5GB/s
・Enable flexible installation
4
Copyright 2015 FUJITSU LIMITED
Technical Computing Suite
Linux-based OS enhanced for FX100
Technical Computing Suite
Applications
PRIMEHPC FX100
Management software Programming Environment
High Performance File SystemFEFS
System management
Job management
Lustre-based distributed file system(enhanced for FX100)
MPI, OpenMP, XPFortran
Compilers(C,C++,Fortan)Mathematical libraries
Debugging and tuning tools
Enhanced software stack developed by FUJITSU
5
Performance highlights
Copyright 2015 FUJITSU LIMITED
Node performance is 2.9 to 4.7 times higher than previous generation.
Node performance normalized by FX10
0.01.02.03.04.05.0
FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100
HPL HPCG NPB FT NTChem Mini NICAM-DCMini
CCS QCD Mini UPACS-Lite
Benchmark programs Applications
Ratio of peak performance
3.1x2.9x3.3x4.0x
3.0x4.2x 4.7x
6
Results of major benchmarks - LINPACK
The first four FX100 systems have been delivered to customers, with >90% efficiency.
Copyright 2015 FUJITSU LIMITED
LINPACK performance and efficiency
Perf
orm
ance
3Pflops
2 Pflops
1 Pflops
0 Pflops
Effi
cien
cy
100 %
90%
80 %
RIKEN1080 nodes
MeteorologicalResearch Institute
1080 nodes
Japan AerospaceExploration Agency
1296 nodes
National Institutefor Fusion Science
2592 nodes
Performance (Pflops) Efficiency (%)
7
Results of major benchmarks - HPCG
Copyright 2015 FUJITSU LIMITED
Solves large sparse linear systems, using Conjugate Gradient method.
Enhanced memory bandwidth enables higher performance, keeping ease-of-use for programming.
Node Performance (Gflops/node)
0102030405060
Mira K computer FX10 FX100 Tianhe-2(MIC x3/node)
Titan(GPU x1/node)
TSUBAME 2.5(GPU x3/node)
3.0x
8
Results of major benchmarks – NPB FT
Copyright 2015 FUJITSU LIMITED
Node performance (Gflops/node) Breakdown of execution time
0.0
0.2
0.4
0.6
0.8
1.0
FX10 (16threads)
FX100 (32threads)
2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)0
10
20
30
40
FX10 (16threads)
FX100 (32threads)
3.3x
Solves a 3D partial differential equation using FFT.
Improved by higher cache/memory throughput, as well as increased CPU cores and SIMD width.
9
Application performance - NTChem-MINI†
Copyright 2015 FUJITSU LIMITED
† https://github.com/fiber-miniapp/ntchem-mini
0
100
200
300
400
500
FX10(12 nodes)
FX100(6 nodes)
0.0
0.2
0.4
0.6
0.8
1.0
FX10(12 nodes)
FX100(6 nodes)
2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)
4.0x
A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation
4.0x faster on FX100, close to the ratio of peak performance (4.4x)
Node performance (Gflops/node) Breakdown of execution time
10
Application performance - NICAM-DC-MINI†
Copyright 2015 FUJITSU LIMITED
† https://github.com/fiber-miniapp/nicam-dc-mini
0
10
20
30
40
FX10(8 nodes) FX100(4 nodes)0.0
0.2
0.4
0.6
0.8
1.0
FX10 (8 nodes) FX100(4 nodes)
2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)
232.1 GB/s/node
74.2GB/s/node
2.9x
Node performance (Gflops/node) Breakdown of execution time ("vi_path2" routine)
A mini-app. derived from NICAM-DC, dynamic phase of NICAM (NonhydrostaticICosahedral Atmospheric Model)
3.1x higher memory throughput in “Vertical Implicit” calculation speeds up 1.5x with half the number of FX10 nodes.
1.5x
11
Application performance – CCS QCD-MINI†
Copyright 2015 FUJITSU LIMITED
† https://github.com/fiber-miniapp/ccs-qcd
0
30
60
90
120
Sector cachedisabled
Sector cacheenabled
Sector cachedisabled
Sector cacheenabled
Sector cachedisabled
Sector cacheenabled
FX10 (1 proc. 16 threads) FX100 (1 proc. X 16 threads) FX100 (1proc. X 32 threads)
3.1x
A mini-app. code for a lattice QCD problem (324 size)
Enhanced memory bandwidth boosts the performance and Sector Cache mechanism promotes data reuse on L2$.
Node performance (Gflops/node)
12
Application performance – UPACS-Lite
Copyright 2015 FUJITSU LIMITED
Calc. time per time step (sec)
Calculation of compressible fluid dynamics, 1603 grid points per node
Enhanced memory bandwidth and other CPU improvementsboost the performance
13
0.0
0.5
1.0
1.5
2.0
2.5
FX10 FX100
4.6x
32.2 GB/s/node
150.9 GB/s/node
(Courtesy of Japan Aerospace Exploration Agency)
Example: simulation for acoustic design of a launch pad
14