31
IDRIS Site Update Pascal Voury, User Support Team Institut du développement et des ressources en informatique scientifique SPXXL / ScicomP – Lugano, May 2013

Institut du développement et des ressources ...spscicomp.org/wordpress/wp-content/uploads/2013/05/voury-IDRIS... · 1000000 10000000 100000000 Performance (Gflops) Evolution des

Embed Size (px)

Citation preview

www.idris.fr Institut du développement et des ressources en informatique scientifique

IDRIS Site Update Pascal Voury, User Support Team

Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

2 Institut du développement et des ressources en informatique scientifique

IDRIS Location

SPXXL / ScicomP – Lugano, May 2013

3 Institut du développement et des ressources en informatique scientifique

History of Supercomputing at IDRIS

SPXXL / ScicomP – Lugano, May 2013

Vorführender
Präsentationsnotizen
Workhorse ? Betes de somme.

4 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

History of Supercomputing at IDRIS

• Parallel scalar systems

• Cray T3D (1995) • Cray T3E (1996) • IBM SP3 (2001) • IBM P4 (2002)

• IBM P4+ (2003) • IBM BG/P (2008)

• IBM P6 (2009)

IBM BG/Q IBM x3750

Vector systems

Cray C98 (1993) Cray C94 (1994)

Fujitsu VPP300 (1997) NEC SX-5 (2000) NEC SX-8 (2006)

RIP Feb 2012

5 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

6 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

7 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

8 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

9 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

10 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

0,1

1

10

100

1000

10000

100000

1000000

10000000

100000000 Pe

rfor

man

ce (G

flops

)

Evolution des performances

Numéro 1

Numéro 500

IDRIS - somme

Earth Simulator

Slide provided by M.A. Foujols, IPSL, CNRS

11 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

12 Institut du développement et des ressources en informatique scientifique

Turing, IBM BG/Q

SPXXL / ScicomP – Lugano, May 2013

• 4 racks • 64 nodes per I/O node, • Everything is then proportional : 65TB memory, 65536 cores, 836 Tflops

etc.

• 2.2 TB disc shared with the x3750 (BW theoretically overloaded by 25 I/O nodes : 50GB/s) in 5 DDN SFA10K cabinets.

13 Institut du développement et des ressources en informatique scientifique

Ada, IBM x3750-M4

SPXXL / ScicomP – Lugano, May 2013

• 332 nodes with 4 Sandy Bridge E5-4650 8 cores @ 2.7 GHz • 28 nodes have 256GB memory, all the others have 128GB • Roughly 49 TB memory and 233Tflops • 2 nodes for interactive login, with discs • Plus 4 nodes x3870-M5 Westmere @ 2.67 GHz with 1 TB memory each

(and discs) for pre- and post-processing • GPFS 3.5, LoadLeveler 5.1, poe 1.2.12 • Mellanox InfiniBand FDR10 with a 648 ports switch and a second level of

36 ports switches (each node has 2 links to 2 switches)

• New for us : diskless nodes, optimization of the memory requirement for the OS image. Same HW, different SW stack. And another different set for post-processing.

14 Institut du développement et des ressources en informatique scientifique

Infrastructure

SPXXL / ScicomP – Lugano, May 2013

Vorführender
Präsentationsnotizen
4 racks DDN

15 Institut du développement et des ressources en informatique scientifique

The good, the bad, the ugly

SPXXL / ScicomP – Lugano, May 2013

• From a User Support point of view : • The good : BG/Q pretty stable, as BG/P was. Surprised to learn IBM would

not correct the bugs in the software stack provided. • The « not so good » for BG/Q: hardware problems not detected by IBM

(NaNs, QCD code); change of I/O performance strategy vs. BG/P. • The bad for x3750 :

− RDMA engine halting the whole configuration : seems solved by Mellanox expertise for 2 weeks.

− Latency of support for Intel problems. Lack of experience on our side (Power for 12 years!). For example : how to limit the RSS memory taken by an OpenMP job ?

• The ugly for x3750: poe environment mandatory for performance, on our Intel platform. Could be OK, but we still have bugs with poe that we don’t have using Intel MPI (Buring Issue).

Vorführender
Präsentationsnotizen
switch to DATAGRAM (UD) mode instead of CONNECTED (CM) mode. OFED 1.5.3, CM better with scalability issues (may to many). OFED 2.0, UD x2

16 Institut du développement et des ressources en informatique scientifique

E5 4650 internal architecture :

SPXXL / ScicomP – Lugano, May 2013

17 Institut du développement et des ressources en informatique scientifique

Political Developments

SPXXL / ScicomP – Lugano, May 2013

• IDRIS is not buying its own computers for the CNRS any more : GENCI does.

GENCI means Grand Equipement National pour le Calcul Intensif. Owned for 49 % by the French State represented by the Ministry for Higher Education and Research, for 20 % by CEA, 20 % by CNRS, 10 % by the Universities and 1% by INRIA. Created in 2007, GENCI provides funding and assumes ownership. Also promotes the organization of an European HPC area and participates to its achievements; GENCI is the french representative in PRACE.

• IBM did not promote its Power architecture, • No clear visibility yet on BG’s future, • Why would GENCI still buy an IBM for an Intel based computer ?

Vorführender
Präsentationsnotizen
Catherine Riviere was appointed last year Chair of PRACE’s Council (for the next 2 years). She has succeeded Prof. Dr. Achim Bachem, Chairman of the Board of Directors of Germany’s Jülich Research Centre

18 Institut du développement et des ressources en informatique scientifique

Future Technical Developments

SPXXL / ScicomP – Lugano, May 2013

• Archive : robot is fine. Do we need a new design for our system? − HSM on one of the supercomputers − « Classical » design, because of limited financial enveloppe, with a disc

cache as big as possible; SSD discarded because of the price • Currently used as backup for result files in a batch; should evolve to a pure

archive system ? − 2.2 Po disc WORKDIR on the computers − Increase capacity, even at the cost of increased latency.

19 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Ada

SPXXL / ScicomP – Lugano, May 2013

• DRAKKAR, Climatology : NEMO ocean model, 5 Mh (4000 cores)

ORCA 12 domain with it 3600 subdomains

20 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Ada

SPXXL / ScicomP – Lugano, May 2013

• DEUS-PUR, Astrophysics : 4 Mh (64 to 9000 cores)

21 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Ada

SPXXL / ScicomP – Lugano, May 2013

• SELTRAN, Molecular Dynamics : custom GROMACS, 1.5 Mh (64 cores) • LIQSIM, ab-initio Molecular Dynamics : CP2K, 0.3 Mh (512 cores)

Non-aqueous ionic solution Gromacs patched with PLUMED plug-in performance

22 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Turing

SPXXL / ScicomP – Lugano, May 2013

• PrecLQCD, lattice QCD : 31.5 Mh (4 racks) • StabMat, QCD & QED, proton weight : 30 Mh (2 racks to 4 racks, Juelich) • BigDFT, ab-initio ion Li batteries: 16.6 Mh

23 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Turing

SPXXL / ScicomP – Lugano, May 2013

• MesoNH, Tiwi Island « Hector » storm : 13 Mh (1 rack, Global Array I/Os)

Vorführender
Präsentationsnotizen
Comparison with airborne LIDAR. Mesh from 800 to 100m, all realistic. Impact at a larger scale of a realistic simulation.

24 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Turing

SPXXL / ScicomP – Lugano, May 2013

• ZoomBHA, astrophysics (Black Hole Accretion): 12 Mh

Vorführender
Präsentationsnotizen
Zooming in from galaxy to BH for short periods of time (1M years). More turbulence and chaos than expected, general models have to change.

25 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Turing

SPXXL / ScicomP – Lugano, May 2013

• GYSELA, Tokamak plasmas : 11 Mh (100 000 threads, 4 racks) Fluctuations of electrostatic potential when turbulence starts in plasma

Vorführender
Präsentationsnotizen
ITER. 95% efficiency weak scaling btw 4 rack and half a one.

26 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Turing

SPXXL / ScicomP – Lugano, May 2013

• MHDTURB, RAMSES magnetohydrodynamics code : 11Mh (2 racks)

Vorführender
Präsentationsnotizen
Again astrophysics on accretion discs.

27 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Turing

SPXXL / ScicomP – Lugano, May 2013

• APAFA, AVBP LES combustion : 10 Mh (2 racks, I/O perf. problems)

28 Institut du développement et des ressources en informatique scientifique

Grand Challenges on Turing

SPXXL / ScicomP – Lugano, May 2013

• ECOPREMS, LES combustion with stratification : 9.3 Mh

400 million tetrahedras

50 million tetrahedras

29 Institut du développement et des ressources en informatique scientifique

Our typical Workload

SPXXL / ScicomP – Lugano, May 2013

2012 figures for the decomissioned computers, very few changes (except political choices) • X3750 : 246 projects, 1100 individual users, 60 Mh allocated for 2013.

26 Million hours on Power6 in 2012

Vorführender
Präsentationsnotizen
99% availability.

30 Institut du développement et des ressources en informatique scientifique

Our typical Workload

SPXXL / ScicomP – Lugano, May 2013

• BG/Q : 90 projects, 400 users, 297 Mh allocated for 2013.

245 Million hours on BG/P in 2012

Vorführender
Präsentationsnotizen
290 Mh allocated, 99% availability. Fluid mechanics from 70 to 120Mh in a year (less QCD).

31 Institut du développement et des ressources en informatique scientifique

SPXXL / ScicomP – Lugano, May 2013

• Thank you for your attention.

• Questions ?