NEMO BENCHMARKING: V.3.3 VS V.3.3.1 COMPARISON...Test ID Arch NEMO ver. SMT Exclusive Mode Dom. Decomp. T09 IBM v3.3 ON ON Manual T10 IBM v3.3.1 ON ON Manual T11 NEC v3.3 - ON Manual

Research PapersIssue RP0109July 2011

Scientific Computing andOperation (SCO)

This work has beencarried out with the

support of CMCC-ANSDivision, namely

Marcello Vichi.The research leading to

these results hasreceived funding from the

Italian Ministry ofEducation, University andResearch and the ItalianMinistry of Environment,Land and Sea under the

GEMINA project.

NEMO BENCHMARKING: V.3.3 VSV.3.3.1 COMPARISON

By Italo EpicocoUniversity of Salento, Italy

[email protected]

Silvia MocaveroCMCC

[email protected]

Alessandro D’AncaCMCC

[email protected]

and Giovanni AloisioCMCC

University of Salento, [email protected]

SUMMARY The report describes the activities carried out within the NEMOConsortium commitment. They refer to the performance evaluation and thecomparison of the NEMO version 3.3 and the NEMO version 3.3.1. Themain differences between two versions are related to the memorymanagement and the allocation of the data structures. The NEMO ver.3.3.1 replaces the static memory array allocation with a dynamic one. Thisapproach brings some relevant benefits, such as the run time evaluation ofthe best domain decomposition. On the other hand, the dynamic arrayallocation introduces some lose of computational performance. The aim ofthis work is to evaluate the difference between the two versions from acomputational point of view.

Keywords: NEMO, Performance evaluation, Simultaneous Multi Threading,Domain Decomposition

JEL: C61

02

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici

NEMO BENCHMARKING: V.3.3 VS V.3.3.1 COMPARISON

INTRODUCTION

Since the beginning, the NEMO model was de-signed taking into account the target machineand the available system software. The staticallocation of the data structures employed in theNEMO model up to the version 3.3 is justifiedby several reasons:

1. The optimizations, made by the compiler,can be more incisive when the dimensionof the array are known at compile time,like for example the optimization regard-ing the loops.

2. The allocation of the memory is per-formed once during the initialization of thedata structures and it remains unchangedduring the execution of the model.

In a dynamic computing environment wherethe available resources can change rapidly, thestatic memory management could be a limit-ing factor for an efficient use of the resources.Moreover the static approach introduces otherrelated limitations:

1. The domain decomposition must be cho-sen statically before compiling the code.

2. If the number of available computing re-sources changes during a long exper-iment, a re-compilation of the code isneeded.

3. The domain decomposition must bemaintained static for the whole durationof the run.

The version 3.3.1 of NEMO implements the dy-namic allocation of the data structures. Thiswork aims at evaluating the differences be-tween both versions from a computational pointof view.

DESCRIPTION OF THE TEST CASES

The comparison of both versions has been per-formed taking into consideration the followingconfigurations:

1. ORCA2LIM PISCES

2. POMME

3. MFS16

The first two configurations belong to the NEMOdistribution package. They do not representmeaningful configurations from a scientific pointof view but they are used as test cases for allof the modification and updating to the model.The MFS16 configuration [2] instead is a pro-duction configuration designed at INGV (Italy)with the aim to study the Mediterranean Sea.The comparison aims mainly at:

1. Evaluating the performance of the dy-namical memory version of NEMOagainst the previous version comparingthe parallel execution time, the parallelscalability and the memory size used.

2. Evaluating of the performance both onIBM PWR6 and on NEC SX9.

3. Estimating the efficiency of the domaindecomposition algorithm.

4. Evaluating the impact of the SMT (only forIBM).

5. Evaluating the exclusive allocation ofcomputational nodes against a sharednodes allocation.

Accordingly with the previous goals, the testplan described in Tab.1, Tab.2, Tab.3 has beendesigned:


03

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici

Table 1: Test plan with MFS16 configuration

Test ID Arch NEMOver.

SMT ExclusiveMode

Dom.Decomp.

T01 IBM v3.3 ON ON AutoT02 IBM v3.3.1 ON ON AutoT03 IBM v3.3.1 OFF ON AutoT04 IBM v3.3.1 ON OFF AutoT05 IBM v3.3.1 ON ON Manual

T06 NEC v3.3 - ON AutoT07 NEC v3.3.1 - ON AutoT08 NEC v3.3.1 - OFF Auto

Table 2: Test plan with ORCA2LIM PISCESconfiguration


SMT ExclusiveMode

Dom.Decomp.

T09 IBM v3.3 ON ON ManualT10 IBM v3.3.1 ON ON Manual

T11 NEC v3.3 - ON ManualT12 NEC v3.3.1 - ON Manual

Table 3: Test plan with POMME configuration


SMT ExclusiveMode

Dom.Decomp.

T13 IBM v3.3 ON ON ManualT14 IBM v3.3.1 ON ON Manual

T15 NEC v3.3 - ON ManualT16 NEC v3.3.1 - ON Manual

RESULTS AND COMMENTS

For each of the test described before we madea series of runs, considering the elapsed timeand the memory allocation with a growing num-ber of processes. In this section we analyzeand describe the results.

EVALUATING THE PERFORMANCE OFTHE NEMO V3.3 AGAINST NEMO V3.3.1

The comparison of both versions has beenscheduled to be made on a IBM parallel ma-chine based on Power6 processors and aNEC vector machine based on SX9 computingnodes. All the runs of the NEMO version 3.3.1performed on the NEC architecture failed with afloating-point zero divide error inthe ldftra module. This error does not allowus to make any kind of consideration about thecomputational performance comparison on theNEC architecture. For comparing the compu-tational performances of the both versions ofNEMO we took into consideration tests T01,T02, T09, T10, T13, T14. The Fig.1 andFig.2 report the execution time and the mea-surement of the displacement in time betweenthe NEMO version 3.3 and 3.3.1. In all of theanalyzed configurations, the version 3.3.1 intro-duces a lose of performance. However, whilefor the POMME configuration the difference intime is meanly about 27%, it decreases to 15%for the ORCA2LIM PISCES and up to 3.5% forMFS16. In general the more the configurationis computationally intensive, the more the over-head due to the dynamic memory managementcan be negligible. The Fig.3 and Fig.?? re-port the evaluation of the maximum memoryused for all of the considered configurations.For POMME and ORCA2LIM PISCES configu-rations the total allocated memory of the NEMOversion 3.3.1 is the same of the version 3.3(the mean difference is less then 2%), while forMFS16 configuration we notice a more evidentdifference (about 5%) in terms of MBs of allo-cated memory by NEMO ver. 3.3.1. A deeperinvestigation is needed in order to justify thisobservations.Taking into consideration only the MFS16 con-figuration we can evaluate the parallel speed-upand the efficiency of the parallel algorithm.

04

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici


(a) POMME configuration (b) ORCA2LIM PISCES configuration

Figure 1: Performance comparison between NEMO ver. 3.3 and ver. 3.3.1

(a) MFS16 configuration (b) Time displacement in MFS16 configuration

Figure 2: Performance comparison between NEMO ver. 3.3 and ver. 3.3.1

(a) POMME configuration (b) ORCA2LIM PISCES configuration (c) MFS16 configuration

Figure 3: Comparison of the global memory used in NEMO ver. 3.3 and ver. 3.3.1


05

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici

(a) Parallel efficiency (b) Parallel speed-up

Figure 4: Comparison of the NEMO scalability between version 3.3 and 3.3.1 using MFS16 config-uration

This configuration theoretically give us the pos-sibility to scale the run with a higher numberof processes rather than the other configura-tions. Fig.4 reports the results of this analy-sis. The model with this configuration scaleswell up to 32 processes, beyond this limit theperformances start to decrease. It reaches anacceptable limit on 64 processes with an effi-ciency of 30%. Comparing the speed-up andefficiency from both versions, apparently thever. 3.3.1 seams to perform better than ver.3.3, but actually this is not true. The speed-up has been measured taking as reference thetime of the parallel model launched on 1 pro-cess instead of the time of the best sequentialalgorithm. In this way, thus, we do not havethe same sequential time as reference for bothversions, and in particular the sequential timeof the NEMO ver. 3.3.1 is greater then the se-quential time of the NEMO ver. 3.3. We remindthat the sole metric to be taken into account forcomparing two or more models is the execu-tion time. Fig.2a shows that the NEMO ver. 3.3performs better then the NEMO ver. 3.3.1.

ESTIMATING THE EFFICIENCY OF THEDOMAIN DECOMPOSITIONALGORITHM

The dynamic memory allocation implementedin NEMO ver. 3.3.1 includes also the implemen-tation of the algorithm for choosing the best do-main decomposition with a given number of pro-cesses. The algorithm takes as input parame-ters the total number of processes, the dimen-sion of the global domain and it returns the num-ber of processes along latitude (jpni) and lon-gitude (jpnj). Moreover, giving the bathimetryof the global domain, the algorithm can alsoestablish the land sub-domains and eventuallyexclude them from the calculus. The domaindecomposition plays a crucial role for the com-putational performance of the model; a goodchoice for domain decomposition will producehigher performance of the model.Starting from the benchmark results describedin [1] we can assert that the best domain de-composition generates the sub-domains with ashape as much squared as possible. This makeminimum the overlap region among neighborprocesses.In order to evaluate the impact of the automatic

06

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici


(a) Execution time with different domain decomposition strate-gies

(b) Shape factors for different domain decomposition strate-gies

Figure 5: Evaluation of the impacts of different domain decomposition strategies on the computa-tional performances for NEMO ver. 3.3.1 on IBM cluster

domain decomposition we compared the re-sults from test T01 and T05. The Fig.5 reportsthe execution time of the NEMO ver. 3.3.1 whenthe domain decomposition is chosen manuallyin order to be the optimal one, and when thedomain decomposition is chosen automaticallyfrom the algorithm. The results show that theautomatic domain decomposition is not the bestfor all the runs. In particular using a manual de-composition following the rule of the shape fac-tor as near as possible to 1, we can get a perfor-mance improvement on an average of 5%. Theautomatic domain decomposition should be re-vised considering that there are some caseswhere it introduces up to 20% of lost of perfor-mance. The MFS16 configuration has a globaldomain made of 871x253 grid points; using 156processes the automatic domain decomposi-tion gives jpni=39 and jpnj=4 that impliessub-domains with a shape factor of 2.8; with 176processes it gives even worst with jpni=11and jpnj=16 that implies a shape factor of 5. Amodified version of the nemogcm.F90 has beenreleased. The file includes the algorithm for es-tablishing the best domain decomposition. The

evaluation of this update will be performed inthe near future.

EVALUATING THE IMPACT OF THESMT ON IBM POWER6 CLUSTER

Simultaneous multithreading is the ability of asingle physical processor to simultaneously dis-patch instructions from more than one hard-ware thread context. Because in the Power6processors there are two hardware threads per

Figure 6: Evaluation of the benefit given bySMT for MFS16 configuration for NEMO ver.3.3.1 on IBM cluster


07

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici

(a) Parallel speed-up (b) Execution time

Figure 7: Analysis of scalability for NEMO ver. 3.3.1 on IBM cluster, with and without SMT

physical processor, additional instructions canrun at the same time. Simultaneous multi-threading allows the user to take advantageof the superscalar nature of the processor byscheduling two threads at the same time on thesame processor. Of course the advantage ofthe SMT is based on the idea that each singlethread executed simultaneously does not fullysaturate the resources of the physical proces-sor. Applications that do not benefit much fromsimultaneous multithreading are those in whichthe majority of individual software threads usea large amount of any resource in the proces-sor or memory. For example, applications thatare floating-point intensive are likely to gain lit-tle from SMT and are the ones most likely tolose performance. These applications heavilyuse either the floating-point units or the memorybandwidth. Applications with low CPI (clock Cy-cleS per Instruction) and low cache miss ratesmight see a some small benefit.For these reasons an evaluation on the benefitof the SMT for the NEMO model is needed. Thetests used to evaluate the SMT benefit are T01and T03. The Fig.6 reports the comparison ofthe execution time for the MFS16 configurationwith and without SMT. In order to evaluate the

SMT benefit we compared the execution timewhen the same computational resources areused and thus we took as reference the numberof allocated nodes. When SMT is activated weused 64 processes to fully exploit a node; dis-abling the SMT a maximum of 32 processes pernode can be allocated. The results show thatthe SMT does not provide any kind of benefitwhen the number of allocated nodes is greaterthan 1.An analysis of scalability without SMT is re-ported in Fig.7. Although the number of pro-cesses is reported along the x-axis, differentpolicies for process mapping are applied forboth cases; when SMT is active, the localscheduler maps 2 processes for each physi-cal core, while when the SMT is disabled only1 process is mapped for each physical core.Having this in mind we can not use the chartsin Fig.7 to compare or to evaluate the benefitof the SMT. Moreover when SMT is active, thescheduler distributes the MPI processes amongthe virtual cores using a stride=2. This meanthat if the number of processes is less then 32each process is mapped to a physical coresalso in the case of SMT active.

08

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici


(a) Execution time (b) Average pending time

Figure 8: Comparison of two cases: exclusive nodes allocation and shared nodes allocation usingNEMO ver. 3.3.1 on IBM cluster

EVALUATING THE EXCLUSIVEALLOCATION OF COMPUTATIONALNODES AGAINST A SHARED NODESALLOCATION

All of the previous tests have been performedusing an exclusive node allocation, then evenfor a run with few processes a whole com-puting node has been allocated. This config-uration can guarantee the best performancesince the resources local to the node (mem-ory, bus and internal switches) are not sharedwith other jobs. On the other hand this gen-erally implies an inefficient use of the comput-ing nodes for those cases when the numberof processes/threads are less than the num-ber of cores. The test presented in this sec-tion aimed at evaluating how much is the lostof performance if the computing nodes are notexclusively allocated for the NEMO simulations.It is worth noting here that the measures aredeeply conditioned from the actual workloadof the whole cluster. An exhaustive analysiswould require the evaluation of the executiontime for a long time period with different work-load. Several runs have been executed duringa week; in fig. 8a the execution time is reported.

As expected the wall clock time considerably in-creases with high values of processes, this isdue to the spreading of the processes amongseveral computing nodes. However it is inter-esting to evaluate also the mean pending timefor each submission. In fig. 8b the pendingtime for jobs requiring exclusive node alloca-tion is compared with ones that do not.


09

Cen

tro

Eur

o-M

edite

rran

eope

riC

ambi

amen

tiC

limat

ici

Bibliography

[1] I. Epicoco, S. Mocavero, and G. Aloisio. Thenemo oceanic model: Computational per-formance analysis and optimization. to bepublished in the HPCC2011 proceedings, 2-4

Sep 2011, Banff Canada.

[2] P. Oddo, M. Adani, N. Pinardi, C. Fra-tianni, M. Tonani, and D. Pettenuzzo. Anested atlantic-mediterranean sea generalcirculation model for operational forecast-ing. Ocean Sci., 5(4):461–473, 2009.

c© Centro Euro-Mediterraneo per i Cambiamenti Climatici 2011Visit www.cmcc.it for information on our activities and publications.

The Euro-Mediteranean Centre for Climate Change is a Ltd Company with its registered office andadministration in Lecce and local units in Bologna, Venice, Capua, Sassari and Milan. The societydoesn’t pursue profitable ends and aims to realize and manage the Centre, its promotion, and researchcoordination and different scientific and applied activities in the field of climate change study.

Documents

NEMO BENCHMARKING: V.3.3 VS V.3.3.1 COMPARISON...Test ID Arch NEMO ver. SMT Exclusive Mode Dom. Decomp. T09 IBM v3.3 ON ON Manual T10 IBM v3.3.1 ON ON Manual T11 NEC v3.3 - ON Manual