ECMWFSlide 1CAS2K11
Benchmarking Best Practices in the
RAPS Consortium
George Mozdzynski, ECMWF
RAPS Chairman
ECMWFSlide 2
Outline
Benchmark evolution
What is RAPS?
Benchmarking ‘best’ practices
RAPS futures
RAPS forum
CAS2K11
ECMWFSlide 3
Benchmark Evolution
CAS2K11
1970‟s 1980‟s 1990‟s 2000+
Whetstone/
Dhrystone
Full Application
benchmarks with
MPI + OpenMP
Linpack100 NAS Parallel
Benchmarks
SpecFP
Livermore Loops Ping-Pong/
Ping-Ping
IOzone
(top500)
HPL
homo benchmarkus
ECMWFSlide 4CAS2K11
What is RAPS?
Real Applications on Parallel Systems
European Software Initiative
RAPS Consortium (founded early 90’s)
Working group of hardware vendors
Programming model (MPI + F90/F95 + OpenMP)
“The partners of the RAPS Consortium develop portable
parallel versions of their production codes which are made
available to a Working Group of Hardware Vendors for
benchmarking and testing.”
ECMWFSlide 5CAS2K11
RAPS Consortium
CCLRC Fraunhofer SCAI/ITWM
CERFACS Met Office UK
CPTEC MPI-M
CSCS METEO-FRANCE
DWD NERC
DKRZ ONERA
ECMWF
ECMWFSlide 6CAS2K11
Working Group of
Hardware Vendors
Bull INTEL
Cray NEC
Fujitsu Oracle
HP SGI
IBM
ECMWFSlide 7CAS2K11
Why RAPS
Portability of codes (F90/F95/F2003, C/C++, MPI, OpenMP)
Availability of benchmark codes ahead of a formal procurement
RAPS community
- Some influence on vendors adopting standards
Regular meetings in past years (24 in total)
ECMWFSlide 8CAS2K11
RAPS process
RAPS benchmarks distributed by individual organizations
No official membership required for vendors
Vendors approach individual orgs for benchmarks
- Confidentiality / Licence agreement
Meetings
- Every 2 years as part of ECMWF ‘Use of HPC in Meteorology
workshop’
- Chairmanship with host organization (this will probably change)
ECMWFSlide 9CAS2K11
RAPS benchmarks
Today
- DWD
- ECMWF
- Met Office
- MPI-M
- Meteo-France
Commitment to produce up to date benchmarks reflecting key operational applications of consortium members
ECMWFSlide 10CAS2K11
Benchmark Best Practices
Using ECMWF’s IFS as a case study
Benchmark preparation
What are correct results?
Testing
- With different compilers
- On different systems
ECMWFSlide 11
Benchmark preparation
Benchmarks should be bug free
No formal way of proving a program is bug free
But we can do a lot to find bugs
- Using compiler options
- Using memory debugging tools
- Run combinations of number of MPI tasks and OpenMP threads
- Enforcing bit reproducible results in our applications
These approaches should be part of normal testing of each new cycle/release of the application and not just for benchmarking
- Doing this work post release is unproductive, back stitches …
CAS2K11
ECMWFSlide 12
Using compilers to find bugs /1
Compile with no optimization
- On IBM, –O0 or -qnooptimize
- Will pick up potential OpenMP bugs where some variables should
be PRIVATE but are by default SHARED
- We get away with it with higher compiler optimization as these
variables are more likely to be held in registers
- Remember a vendor benchmarker will probably revert to this if
they have problems with some ‘safe’ compiler options
- Object libraries built with no optimization are needed for use with
debuggers
- On IBM, for routines with OpenMP also need, –qsmp=omp:noopt
CAS2K11
ECMWFSlide 13
Using compilers to find bugs /2
Compile with subscript checking
- on IBM –qcheck or –C
- Can be done with normal (production) compiler optimization
- A few months of effort was initially required to get IFS to run with
compiler subscript checking (circa 2002)
- Many rewards of bugs detected since then
- Problem areas requiring resolution
Unused field index in array actual argument on subroutine call
Incorrect subroutine dummy argument size
- Resolution straightforward, combination of
Unused field indices set to 1 (IFS namelist variable NUNDEFLD=1)
Use of interface blocks, assumed shape arrays, modules
CAS2K11
ECMWFSlide 14
Using compilers to find bugs /3
Initialize arrays to NANS
- On IBM, -qinitauto=7FF7FFFF for automatic variables
- Plus some local code to intercept malloc’s and initialize to NANS
For IFS activated via environment variable EC_FILL_NANS=1
to catch all memory allocations (heap/stack)
- Finds the occasional bug where scalars or arrays are not initialized
correctly (or at all)
CAS2K11
ECMWFSlide 15
Using memory debugging tools to find bugs
CAS2K11
ECMWFSlide 16
Using memory debugging tools to
find bugs and leaks
Using RogueWave’s MemoryScape
Started from a Totalview debugging run (at ECMWF)
- Guard blocks
When memory is malloc’d and free’d
- Red Zones
Align each malloc on page boundary (last word)
Next page is RO, so SEGV on write attempt
Supported on Linux
Not supported on AIX
- Check for memory leaks
CAS2K11
ECMWFSlide 17
IFS RAPS11 model-only benchmark
Model resolutions T159, T399, T799, T1279 and T2049
Full outputs from IBM Power6 (all resolutions)
No output of model fields (i.e. not an I/O benchmark)
Reference job scripts
- 24 time steps
- test of correctness
Long run job scripts, for performance runs
- use same executable as for reference runs
Results ([SP,GP] norms) should be bit-reproducible when changing
- number of MPI tasks
- number of OpenMP threads
CAS2K11
Results of ERROR calculation (T159/Intel/pgi-10.6)
The error calculated from the results shows
that the calculations are correct
The maximum error is = 0.16518 %
ECMWFSlide 18
Maximum Error Correctness Check
(optimized libs v non-optimized libs on P6)
CAS2K11
0
1
2
3
4
5
6
7
1 7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
205
211
217
223
229
235
Maxim
um
Err
or
%
Model Hours
T159 (125 km)
T1279 (16 km)
Maximum Error (all time steps)
computed from saved reference
norms and norms from running
model (at each atmospheric level)
using following spectral variables
-Vorticity
-Divergence
-Temperature
-Humidity
-Kinetic Energy
IFS CY37R3
(RAPS12)
ECMWFSlide 19
Maximum Error Correctness Check
(optimized libs v non-optimized libs on P6)
CAS2K11
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
Maxim
um
Err
or
%
Model Hours
T159 (125 km)
T1279 (16 km)
T1279 TSTEP=600s
24 steps = 4 model hours
T159 TSTEP=3600s
24 steps = 24 hours
Acceptable max error is below 1 %
-at 24 model time-steps
-to keep run time short
ECMWFSlide 20
IFS RAPS11 benchmark DR_HOOK_OPT=prof output
provided for all model cases (T2047 extract)
CAS2K11
Name of the executable : /fws2/lb/work/rd/mpm/RAPS11/T2047/../bin/ifsMASTER
Number of MPI-tasks : 512
Number of OpenMP-threads : 8
Wall-times over all MPI-tasks (secs) : Min=1020.070, Max=1025.730, Avg=1022.972, StDev=1.261
Routines whose total time (i.e. sum) > 1.000 secs will be included in the listing
Avg-% Avg.time Min.time Max.time St.dev Imbal-% # of calls : Name of the routine
7.91% 80.951 71.799 86.287 1.554 16.79% 6545408 : MXMAOP
4.15% 42.488 39.259 47.132 1.530 16.70% 87748608 : CLOUDSC
3.65% 37.345 31.398 43.424 1.999 27.69% 44742137979 : CUADJTQ
3.62% 37.059 36.473 38.821 0.240 6.05% 350994432 : LAITRI
3.59% 36.672 35.747 38.017 0.342 5.97% 482818560 : VERINT
3.12% 31.937 27.864 36.514 1.297 23.69% 100352 : >MPL-TRMTOL_COMMS(807)
3.03% 30.991 15.474 70.559 10.875 78.07% 98816 : >MPL-TRLTOM_COMMS(806)
2.93% 29.964 3.977 71.212 12.580 94.42% 98816 : >MPL-TRGTOL_COMMS(803)
2.45% 25.066 12.663 37.822 5.504 66.52% 1611 : >MPL-IOSTREAMREAD_RECORD(650)
2.43% 24.853 8.287 33.075 3.370 74.94% 100352 : >MPL-TRLTOG_COMMS(805)
2.21% 22.597 21.537 23.729 0.336 9.24% 175497216 : LASCAW
2.03% 20.776 20.174 21.547 0.242 6.37% 87748608 : VDFMAIN
1.85% 18.952 18.393 20.062 0.245 8.32% 789737472 : LAITLI
1.70% 17.341 16.440 18.936 0.364 13.18% 87748608 : VDFEXCU
1.55% 15.812 13.110 22.851 1.057 42.63% 98816 : >BAR-BARRIERINSIGCHECK(718)
1.49% 15.241 6.446 25.483 4.342 74.70% 802816 : >OMP-FTINV_CTL(1639)
1.40% 14.324 9.447 23.023 1.886 58.97% 99840 : >MPL-SLCOMM1_COMMS(509)
1.39% 14.183 13.788 14.387 0.076 4.16% 175497216 : LARCHE
1.37% 14.056 8.137 25.275 2.968 67.81% 43874304 : CUBASEN
1.18% 12.059 1.922 25.705 5.039 92.52% 43874304 : CLOUDVAR
1.04% 10.594 6.716 16.324 1.017 58.86% 796672 : >OMP-LTINV_CTL-INVERSE(1647)
1.03% 10.583 10.194 11.688 0.258 12.78% 786432 : >OMP-WAMODEL2(1431)
0.97% 9.930 0.000 18.834 4.970 100.00% 1024 : >MPL-BROADCASTIOSTREAMGR(632)
0.94% 9.637 9.043 10.247 0.191 11.75% 790528 : >OMP-CPG1(1025)
0.93% 9.484 9.059 9.978 0.152 9.21% 1702400 : SRTM_SPCVRT_MCICA
0.90% 9.210 7.612 11.005 0.545 30.83% 790528 : >OMP-LTDIR_CTL-DIRECT(1645)
0.90% 9.193 8.827 9.538 0.145 7.45% 43874304 : CPDYDDH
ECMWFSlide 21
RAPS[10,12] Test of Adjoint
Correctness check before running 4D-Var
Test of the minimization (TL,AD) method
No observations required
Runs in seconds
CAS2K11
TEST OF THE ADJOINT
12345678901234567890
< F(X) , Y > = 0.29794005935791911810E+00
< X , F*(Y) > = 0.29794005935792916562E+00
THE DIFFERENCE IS 151.876 TIMES THE ZERO OF THE MACHINE
A difference value of
- a few hundred is perfectly acceptable
- a few thousand suggests a compiler „issue‟
(e.g. optimizations that replace divides by multiplies of the reciprocal)
When the difference displays a row of asterisks the test has definitely failed.
ECMWFSlide 22
RAPS10 4D-Var correctness check
CAS2K11
ECMWFSlide 23
0
1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
Sp
eed
up
User Threads
ideal
T2047L149
T1279L91
T799L91
T1279 Operations
IFS model speedup on Power6 (RAPS11)
CAS2K11
ECMWFSlide 24
RAPS futures
Improving scalability, maintainability and portability of our production applications
F90 / F95 / F2003 / C / C++ / MPI / OpenMP
- RAPS programming model today
F2008 (coarrays)
- Many thanks to Bob Numrich and John Reid
- To be adopted in the future
- Requires vendors to support
- Interested to hear F2008 plans from vendors
Use of standards are paramount
Thread checking software?
CAS2K11
ECMWFSlide 25
RAPS forum
http://raps.enes.org/
Click on register (in the top right hand corner), where you will be asked to provide some basic information (id/pw/email address) and submit your registration.
Registration could take up to a day to process (my experience was just 1 hour), after which time you will receive an email confirming that your account has been activated.
Contact: [email protected]
CAS2K11
ECMWFSlide 26
RAPS forums
Parallel I/O initiative
This forum should be used for developing a project for a portable parallel I/O library for NWP and climate modelling
Measurement tools initiative
Discuss the design and implementation of a common lightweight measurement tools library
Fortran 2003
Debate about Fortran 2003 usage
Fortran 2008
Debate about Fortran 2008 usage
CAS2K11
ECMWFSlide 27
Benchmark Evolution
CAS2K11
1970‟s 1980‟s 1990‟s 2000+
Whetstone/
Dhrystone
Full Application
benchmarks with
MPI + OpenMP
Linpack100 NAS Parallel
Benchmarks
SpecFP
Livermore Loops Ping-Pong/
Ping-Ping
IOzone
(top500)
HPL
homo benchmarkus
ECMWFSlide 28CAS2K11
Able to multi-task Limited multi-tasking skills
ECMWFSlide 29CAS2K11
Thank you for your
attention
QUESTIONS?