Transcript
Page 1: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 1CAS2K11

Benchmarking Best Practices in the

RAPS Consortium

George Mozdzynski, ECMWF

RAPS Chairman

Page 2: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 2

Outline

Benchmark evolution

What is RAPS?

Benchmarking ‘best’ practices

RAPS futures

RAPS forum

CAS2K11

Page 3: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 3

Benchmark Evolution

CAS2K11

1970‟s 1980‟s 1990‟s 2000+

Whetstone/

Dhrystone

Full Application

benchmarks with

MPI + OpenMP

Linpack100 NAS Parallel

Benchmarks

SpecFP

Livermore Loops Ping-Pong/

Ping-Ping

IOzone

(top500)

HPL

homo benchmarkus

Page 4: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 4CAS2K11

What is RAPS?

Real Applications on Parallel Systems

European Software Initiative

RAPS Consortium (founded early 90’s)

Working group of hardware vendors

Programming model (MPI + F90/F95 + OpenMP)

“The partners of the RAPS Consortium develop portable

parallel versions of their production codes which are made

available to a Working Group of Hardware Vendors for

benchmarking and testing.”

Page 5: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 5CAS2K11

RAPS Consortium

CCLRC Fraunhofer SCAI/ITWM

CERFACS Met Office UK

CPTEC MPI-M

CSCS METEO-FRANCE

DWD NERC

DKRZ ONERA

ECMWF

Page 6: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 6CAS2K11

Working Group of

Hardware Vendors

Bull INTEL

Cray NEC

Fujitsu Oracle

HP SGI

IBM

Page 7: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 7CAS2K11

Why RAPS

Portability of codes (F90/F95/F2003, C/C++, MPI, OpenMP)

Availability of benchmark codes ahead of a formal procurement

RAPS community

- Some influence on vendors adopting standards

Regular meetings in past years (24 in total)

Page 8: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 8CAS2K11

RAPS process

RAPS benchmarks distributed by individual organizations

No official membership required for vendors

Vendors approach individual orgs for benchmarks

- Confidentiality / Licence agreement

Meetings

- Every 2 years as part of ECMWF ‘Use of HPC in Meteorology

workshop’

- Chairmanship with host organization (this will probably change)

Page 9: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 9CAS2K11

RAPS benchmarks

Today

- DWD

- ECMWF

- Met Office

- MPI-M

- Meteo-France

Commitment to produce up to date benchmarks reflecting key operational applications of consortium members

Page 10: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 10CAS2K11

Benchmark Best Practices

Using ECMWF’s IFS as a case study

Benchmark preparation

What are correct results?

Testing

- With different compilers

- On different systems

Page 11: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 11

Benchmark preparation

Benchmarks should be bug free

No formal way of proving a program is bug free

But we can do a lot to find bugs

- Using compiler options

- Using memory debugging tools

- Run combinations of number of MPI tasks and OpenMP threads

- Enforcing bit reproducible results in our applications

These approaches should be part of normal testing of each new cycle/release of the application and not just for benchmarking

- Doing this work post release is unproductive, back stitches …

CAS2K11

Page 12: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 12

Using compilers to find bugs /1

Compile with no optimization

- On IBM, –O0 or -qnooptimize

- Will pick up potential OpenMP bugs where some variables should

be PRIVATE but are by default SHARED

- We get away with it with higher compiler optimization as these

variables are more likely to be held in registers

- Remember a vendor benchmarker will probably revert to this if

they have problems with some ‘safe’ compiler options

- Object libraries built with no optimization are needed for use with

debuggers

- On IBM, for routines with OpenMP also need, –qsmp=omp:noopt

CAS2K11

Page 13: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 13

Using compilers to find bugs /2

Compile with subscript checking

- on IBM –qcheck or –C

- Can be done with normal (production) compiler optimization

- A few months of effort was initially required to get IFS to run with

compiler subscript checking (circa 2002)

- Many rewards of bugs detected since then

- Problem areas requiring resolution

Unused field index in array actual argument on subroutine call

Incorrect subroutine dummy argument size

- Resolution straightforward, combination of

Unused field indices set to 1 (IFS namelist variable NUNDEFLD=1)

Use of interface blocks, assumed shape arrays, modules

CAS2K11

Page 14: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 14

Using compilers to find bugs /3

Initialize arrays to NANS

- On IBM, -qinitauto=7FF7FFFF for automatic variables

- Plus some local code to intercept malloc’s and initialize to NANS

For IFS activated via environment variable EC_FILL_NANS=1

to catch all memory allocations (heap/stack)

- Finds the occasional bug where scalars or arrays are not initialized

correctly (or at all)

CAS2K11

Page 15: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 15

Using memory debugging tools to find bugs

CAS2K11

Page 16: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 16

Using memory debugging tools to

find bugs and leaks

Using RogueWave’s MemoryScape

Started from a Totalview debugging run (at ECMWF)

- Guard blocks

When memory is malloc’d and free’d

- Red Zones

Align each malloc on page boundary (last word)

Next page is RO, so SEGV on write attempt

Supported on Linux

Not supported on AIX

- Check for memory leaks

CAS2K11

Page 17: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 17

IFS RAPS11 model-only benchmark

Model resolutions T159, T399, T799, T1279 and T2049

Full outputs from IBM Power6 (all resolutions)

No output of model fields (i.e. not an I/O benchmark)

Reference job scripts

- 24 time steps

- test of correctness

Long run job scripts, for performance runs

- use same executable as for reference runs

Results ([SP,GP] norms) should be bit-reproducible when changing

- number of MPI tasks

- number of OpenMP threads

CAS2K11

Results of ERROR calculation (T159/Intel/pgi-10.6)

The error calculated from the results shows

that the calculations are correct

The maximum error is = 0.16518 %

Page 18: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 18

Maximum Error Correctness Check

(optimized libs v non-optimized libs on P6)

CAS2K11

0

1

2

3

4

5

6

7

1 7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

103

109

115

121

127

133

139

145

151

157

163

169

175

181

187

193

199

205

211

217

223

229

235

Maxim

um

Err

or

%

Model Hours

T159 (125 km)

T1279 (16 km)

Maximum Error (all time steps)

computed from saved reference

norms and norms from running

model (at each atmospheric level)

using following spectral variables

-Vorticity

-Divergence

-Temperature

-Humidity

-Kinetic Energy

IFS CY37R3

(RAPS12)

Page 19: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 19

Maximum Error Correctness Check

(optimized libs v non-optimized libs on P6)

CAS2K11

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47

Maxim

um

Err

or

%

Model Hours

T159 (125 km)

T1279 (16 km)

T1279 TSTEP=600s

24 steps = 4 model hours

T159 TSTEP=3600s

24 steps = 24 hours

Acceptable max error is below 1 %

-at 24 model time-steps

-to keep run time short

Page 20: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 20

IFS RAPS11 benchmark DR_HOOK_OPT=prof output

provided for all model cases (T2047 extract)

CAS2K11

Name of the executable : /fws2/lb/work/rd/mpm/RAPS11/T2047/../bin/ifsMASTER

Number of MPI-tasks : 512

Number of OpenMP-threads : 8

Wall-times over all MPI-tasks (secs) : Min=1020.070, Max=1025.730, Avg=1022.972, StDev=1.261

Routines whose total time (i.e. sum) > 1.000 secs will be included in the listing

Avg-% Avg.time Min.time Max.time St.dev Imbal-% # of calls : Name of the routine

7.91% 80.951 71.799 86.287 1.554 16.79% 6545408 : MXMAOP

4.15% 42.488 39.259 47.132 1.530 16.70% 87748608 : CLOUDSC

3.65% 37.345 31.398 43.424 1.999 27.69% 44742137979 : CUADJTQ

3.62% 37.059 36.473 38.821 0.240 6.05% 350994432 : LAITRI

3.59% 36.672 35.747 38.017 0.342 5.97% 482818560 : VERINT

3.12% 31.937 27.864 36.514 1.297 23.69% 100352 : >MPL-TRMTOL_COMMS(807)

3.03% 30.991 15.474 70.559 10.875 78.07% 98816 : >MPL-TRLTOM_COMMS(806)

2.93% 29.964 3.977 71.212 12.580 94.42% 98816 : >MPL-TRGTOL_COMMS(803)

2.45% 25.066 12.663 37.822 5.504 66.52% 1611 : >MPL-IOSTREAMREAD_RECORD(650)

2.43% 24.853 8.287 33.075 3.370 74.94% 100352 : >MPL-TRLTOG_COMMS(805)

2.21% 22.597 21.537 23.729 0.336 9.24% 175497216 : LASCAW

2.03% 20.776 20.174 21.547 0.242 6.37% 87748608 : VDFMAIN

1.85% 18.952 18.393 20.062 0.245 8.32% 789737472 : LAITLI

1.70% 17.341 16.440 18.936 0.364 13.18% 87748608 : VDFEXCU

1.55% 15.812 13.110 22.851 1.057 42.63% 98816 : >BAR-BARRIERINSIGCHECK(718)

1.49% 15.241 6.446 25.483 4.342 74.70% 802816 : >OMP-FTINV_CTL(1639)

1.40% 14.324 9.447 23.023 1.886 58.97% 99840 : >MPL-SLCOMM1_COMMS(509)

1.39% 14.183 13.788 14.387 0.076 4.16% 175497216 : LARCHE

1.37% 14.056 8.137 25.275 2.968 67.81% 43874304 : CUBASEN

1.18% 12.059 1.922 25.705 5.039 92.52% 43874304 : CLOUDVAR

1.04% 10.594 6.716 16.324 1.017 58.86% 796672 : >OMP-LTINV_CTL-INVERSE(1647)

1.03% 10.583 10.194 11.688 0.258 12.78% 786432 : >OMP-WAMODEL2(1431)

0.97% 9.930 0.000 18.834 4.970 100.00% 1024 : >MPL-BROADCASTIOSTREAMGR(632)

0.94% 9.637 9.043 10.247 0.191 11.75% 790528 : >OMP-CPG1(1025)

0.93% 9.484 9.059 9.978 0.152 9.21% 1702400 : SRTM_SPCVRT_MCICA

0.90% 9.210 7.612 11.005 0.545 30.83% 790528 : >OMP-LTDIR_CTL-DIRECT(1645)

0.90% 9.193 8.827 9.538 0.145 7.45% 43874304 : CPDYDDH

Page 21: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 21

RAPS[10,12] Test of Adjoint

Correctness check before running 4D-Var

Test of the minimization (TL,AD) method

No observations required

Runs in seconds

CAS2K11

TEST OF THE ADJOINT

12345678901234567890

< F(X) , Y > = 0.29794005935791911810E+00

< X , F*(Y) > = 0.29794005935792916562E+00

THE DIFFERENCE IS 151.876 TIMES THE ZERO OF THE MACHINE

A difference value of

- a few hundred is perfectly acceptable

- a few thousand suggests a compiler „issue‟

(e.g. optimizations that replace divides by multiplies of the reciprocal)

When the difference displays a row of asterisks the test has definitely failed.

Page 22: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 22

RAPS10 4D-Var correctness check

CAS2K11

Page 23: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 23

0

1024

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

13312

Sp

eed

up

User Threads

ideal

T2047L149

T1279L91

T799L91

T1279 Operations

IFS model speedup on Power6 (RAPS11)

CAS2K11

Page 24: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 24

RAPS futures

Improving scalability, maintainability and portability of our production applications

F90 / F95 / F2003 / C / C++ / MPI / OpenMP

- RAPS programming model today

F2008 (coarrays)

- Many thanks to Bob Numrich and John Reid

- To be adopted in the future

- Requires vendors to support

- Interested to hear F2008 plans from vendors

Use of standards are paramount

Thread checking software?

CAS2K11

Page 25: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 25

RAPS forum

http://raps.enes.org/

Click on register (in the top right hand corner), where you will be asked to provide some basic information (id/pw/email address) and submit your registration.

Registration could take up to a day to process (my experience was just 1 hour), after which time you will receive an email confirming that your account has been activated.

Contact: [email protected]

CAS2K11

Page 26: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 26

RAPS forums

Parallel I/O initiative

This forum should be used for developing a project for a portable parallel I/O library for NWP and climate modelling

Measurement tools initiative

Discuss the design and implementation of a common lightweight measurement tools library

Fortran 2003

Debate about Fortran 2003 usage

Fortran 2008

Debate about Fortran 2008 usage

CAS2K11

Page 27: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 27

Benchmark Evolution

CAS2K11

1970‟s 1980‟s 1990‟s 2000+

Whetstone/

Dhrystone

Full Application

benchmarks with

MPI + OpenMP

Linpack100 NAS Parallel

Benchmarks

SpecFP

Livermore Loops Ping-Pong/

Ping-Ping

IOzone

(top500)

HPL

homo benchmarkus

Page 28: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 28CAS2K11

Able to multi-task Limited multi-tasking skills

Page 29: Benchmarking Best Practices in the RAPS …...Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS forums Parallel I/O initiative This forum should be used for developing a

ECMWFSlide 29CAS2K11

Thank you for your

attention

QUESTIONS?