FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Copyright 2015 FUJITSU LIMITED

FUJITSU Supercomputer PRIMEHPC FX100

0


The K computer and the evolution of PRIMEHPC

Fujitsu has been developing supercomputers over 30 years, and will continue its development to deliver the best application performance.

SPARC64 VIIIfx: 8 cores / 128 GF11 PF, 2010~

K computer PRIMEHPC FX10SPARC64 IXfx: 16 cores / 236.5 GF23 PF, 2012~

PRIMEHPC FX100SPARC64 XIfx: 32 cores / over 1 TFOver 100 PF, 2015~

C RIKEN

Post-K computer

1

PRIMEHPC FX100 design concept


Designed for massively parallel supercomputer system

Enhance and inherit K computer features・Many-core CPU-based architecture for application productivity

Introduce new technologies to Exascale computing

・High performance for a wide range of real applications

・HPC-ACE2 : Wide SIMD enhancements・Assistant cores : Dedicated cores for non-calculation operation

・Enhanced VISIMPACT (hardware barrier synchronization, sector cache, etc.)

・HMC : Leading-edge memory technology

2

SPARC64TM XIfx


Over 1 TF high performance processor・32 compute cores

HPC-ACE2: ISA enhancements・Two 256-bit wide SIMD units per core

Tofu2 interface

Tofu2 controller

L2 cache

L2 cache

HM

C interfa

ce

HM

C in

terf

ace

PCI interface

core core

core core

core core

core core

core core

core core

core core

core core

core

core

core

core

core

core

core

core

core

core

core

core

core

core

core

core

・2 assistant cores:

Assistant core

Assistant core

Daemons, IOs, non-blocking MPI functions, etc.

・Addressing mode (stride load/store, indirect load/store)・Cross lane operation (compress, permutation)

Offloading non-calculation operations

HMC support・480GB/s/node of theoretical memory throughput

3

Tofu interconnect 2


Enhanced Tofu interconnect

CPU-integrated interconnect controller・Reduced communication latency

Optical cable connection between chassis

・Improved packaging density and energy efficiency

・Highly scalable, 6-dimensional mesh/torus topology・Increased link bandwidth by 2.5 times to 12.5GB/s

・Enable flexible installation

4


Technical Computing Suite

Linux-based OS enhanced for FX100

Technical Computing Suite

Applications

PRIMEHPC FX100

Management software Programming Environment

High Performance File SystemFEFS

System management

Job management

Lustre-based distributed file system(enhanced for FX100)

MPI, OpenMP, XPFortran

Compilers(C,C++,Fortan)Mathematical libraries

Debugging and tuning tools

Enhanced software stack developed by FUJITSU

5

Performance highlights


Node performance is 2.9 to 4.7 times higher than previous generation.

Node performance normalized by FX10

0.01.02.03.04.05.0

FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100

HPL HPCG NPB FT NTChem Mini NICAM-DCMini

CCS QCD Mini UPACS-Lite

Benchmark programs Applications

Ratio of peak performance

3.1x2.9x3.3x4.0x

3.0x4.2x 4.7x

6

Results of major benchmarks - LINPACK

The first four FX100 systems have been delivered to customers, with >90% efficiency.


LINPACK performance and efficiency

Perf

orm

ance

3Pflops

2 Pflops

1 Pflops

0 Pflops

Effi

cien

cy

100 %

90%

80 %

RIKEN1080 nodes

MeteorologicalResearch Institute

1080 nodes

Japan AerospaceExploration Agency

1296 nodes

National Institutefor Fusion Science

2592 nodes

Performance (Pflops) Efficiency (%)

7

Results of major benchmarks - HPCG


Solves large sparse linear systems, using Conjugate Gradient method.

Enhanced memory bandwidth enables higher performance, keeping ease-of-use for programming.

Node Performance (Gflops/node)

0102030405060

Mira K computer FX10 FX100 Tianhe-2(MIC x3/node)

Titan(GPU x1/node)

TSUBAME 2.5(GPU x3/node)

3.0x

8

Results of major benchmarks – NPB FT


Node performance (Gflops/node) Breakdown of execution time

0.0

0.2

0.4

0.6

0.8

1.0

FX10 (16threads)

FX100 (32threads)

2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)0

10

20

30

40

FX10 (16threads)

FX100 (32threads)

3.3x

Solves a 3D partial differential equation using FFT.

Improved by higher cache/memory throughput, as well as increased CPU cores and SIMD width.

9

Application performance - NTChem-MINI†


† https://github.com/fiber-miniapp/ntchem-mini

0

100

200

300

400

500

FX10(12 nodes)

FX100(6 nodes)

0.0

0.2

0.4

0.6

0.8

1.0

FX10(12 nodes)

FX100(6 nodes)

2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)

4.0x

A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation

4.0x faster on FX100, close to the ratio of peak performance (4.4x)

Node performance (Gflops/node) Breakdown of execution time

10

Application performance - NICAM-DC-MINI†


† https://github.com/fiber-miniapp/nicam-dc-mini

0

10

20

30

40

FX10(8 nodes) FX100(4 nodes)0.0

0.2

0.4

0.6

0.8

1.0

FX10 (8 nodes) FX100(4 nodes)

2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)

232.1 GB/s/node

74.2GB/s/node

2.9x

Node performance (Gflops/node) Breakdown of execution time ("vi_path2" routine)

A mini-app. derived from NICAM-DC, dynamic phase of NICAM (NonhydrostaticICosahedral Atmospheric Model)

3.1x higher memory throughput in “Vertical Implicit” calculation speeds up 1.5x with half the number of FX10 nodes.

1.5x

11

Application performance – CCS QCD-MINI†


† https://github.com/fiber-miniapp/ccs-qcd

0

30

60

90

120

Sector cachedisabled

Sector cacheenabled


Sector cacheenabled


Sector cacheenabled

FX10 (1 proc. 16 threads) FX100 (1 proc. X 16 threads) FX100 (1proc. X 32 threads)

3.1x

A mini-app. code for a lattice QCD problem (324 size)

Enhanced memory bandwidth boosts the performance and Sector Cache mechanism promotes data reuse on L2$.

Node performance (Gflops/node)

12

Application performance – UPACS-Lite


Calc. time per time step (sec)

Calculation of compressible fluid dynamics, 1603 grid points per node

Enhanced memory bandwidth and other CPU improvementsboost the performance

13

0.0

0.5

1.0

1.5

2.0

2.5

FX10 FX100

4.6x

32.2 GB/s/node

150.9 GB/s/node

(Courtesy of Japan Aerospace Exploration Agency)

Example: simulation for acoustic design of a launch pad

14

Documents

FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of