18
Int J Parallel Prog (2007) 35:441–458 DOI 10.1007/s10766-007-0057-y High-Scalability Parallelization of a Molecular Modeling Application: Performance and Productivity Comparison Between OpenMP and MPI Implementations Russell Brown · Ilya Sharapov Received: 6 November 2006 / Accepted: 26 January 2007 / Published online: 20 July 2007 © Springer Science+Business Media, LLC 2007 Abstract Important components of molecular modeling applications are estimation and minimization of the internal energy of a molecule. For macromolecules such as proteins and amino acids, energy estimation is performed using empirical equations known as force fields. Over the past several decades, much effort has been directed towards improving the accuracy of these equations, and the resulting increased accu- racy has come at the expense of greater computational complexity. For example, the interactions between a protein and surrounding water molecules have been modeled with improved accuracy using the generalized Born solvation model, which increases the computational complexity to O ( n 3 ) . Fortunately, many force-field calculations are amenable to parallel execution. This paper describes the steps that were requi- red to transform the Born calculation from a serial program into a parallel program suitable for parallel execution in both the OpenMP and MPI environments. Measu- rements of the parallel performance on a symmetric multiprocessor reveal that the Born calculation scales well for up to 144 processors. In some cases the OpenMP implementation scales better than the MPI implementation, but in other cases the MPI implementation scales better than the OpenMP implementation. However, in all cases the OpenMP implementation performs better than the MPI implementation, and requires less programming effort as well. Trademark Legend Sun, Sun Microsystems, SPARC, UltraSPARC, Sun Fire, Sun Performance Library and Sun HPC Cluster Tools are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. R. Brown (B ) Sun Microsystems, Inc., 14 Network Circle, Menlo Park, CA 94025, USA e-mail: [email protected] I. Sharapov Apple, Inc., 1 Infinite Loop, Cupertino, CA 95014, USA e-mail: [email protected] 123

High-Scalability Parallelization of a Molecular Modeling Application: Performance and Productivity Comparison Between OpenMP and MPI Implementations

Embed Size (px)

Citation preview

Int J Parallel Prog (2007) 35:441–458DOI 10.1007/s10766-007-0057-y

High-Scalability Parallelization of a MolecularModeling Application: Performance and ProductivityComparison Between OpenMP and MPIImplementations

Russell Brown · Ilya Sharapov

Received: 6 November 2006 / Accepted: 26 January 2007 / Published online: 20 July 2007© Springer Science+Business Media, LLC 2007

Abstract Important components of molecular modeling applications are estimationand minimization of the internal energy of a molecule. For macromolecules such asproteins and amino acids, energy estimation is performed using empirical equationsknown as force fields. Over the past several decades, much effort has been directedtowards improving the accuracy of these equations, and the resulting increased accu-racy has come at the expense of greater computational complexity. For example, theinteractions between a protein and surrounding water molecules have been modeledwith improved accuracy using the generalized Born solvation model, which increasesthe computational complexity to O

(n3

). Fortunately, many force-field calculations

are amenable to parallel execution. This paper describes the steps that were requi-red to transform the Born calculation from a serial program into a parallel programsuitable for parallel execution in both the OpenMP and MPI environments. Measu-rements of the parallel performance on a symmetric multiprocessor reveal that theBorn calculation scales well for up to 144 processors. In some cases the OpenMPimplementation scales better than the MPI implementation, but in other cases theMPI implementation scales better than the OpenMP implementation. However, in allcases the OpenMP implementation performs better than the MPI implementation, andrequires less programming effort as well.

Trademark Legend Sun, Sun Microsystems, SPARC, UltraSPARC, Sun Fire, Sun Performance Libraryand Sun HPC Cluster Tools are trademarks or registered trademarks of Sun Microsystems, Inc. in theUnited States and other countries.

R. Brown (B)Sun Microsystems, Inc., 14 Network Circle, Menlo Park, CA 94025, USAe-mail: [email protected]

I. SharapovApple, Inc., 1 Infinite Loop, Cupertino, CA 95014, USAe-mail: [email protected]

123

442 Int J Parallel Prog (2007) 35:441–458

Keywords Parallel programming · OpenMP · MPI · Molecular modeling

1 Introduction

Molecular modeling is one of the most demanding areas of scientific computing today.Although the high computational requirements of molecular simulation can producelong computation times, parallel execution may used to increase the size of mole-cules that can be analyzed in manageable time. Several software applications exist formolecular modeling and estimation of the internal energy of molecules using classicalequations known as force fields [5]. An open-source application related to the well-known AMBER [21] software is Nucleic Acid Builder or NAB [16], which we use asthe basis for our analysis in this work.

A typical force field includes several energy terms, one of which models the inter-actions between a biomolecule and the solvent, or surrounding water molecules. Thisenergy is known as the Born free energy of solvation. Using a method known as thegeneralized Born [3,9,19] approximation, the electrostatic contribution to this energyis computed as the sum of pairwise interactions:

EBorn = −1

2

i

j>i

qi q j

⎢⎢⎢⎢⎢⎣

1 − e−κ

√√√√

d2i j +Ri R j e

−d2i j

4Ri R j

εw

⎥⎥⎥⎥⎥⎦

(1)

In this equation, di j represents the distance between atoms i and j , εw represents thedielectric constant of water, and κ represents a Debye-Huckel screening constant [18].Ri represents the effective Born radius that is a measure of the amount by which theatom i is screened from the solvent by all of the surrounding atoms k. The effectiveBorn radius is calculated as:

R−1i = 1

ρi+

k �=i

f (dik, ρi , ρk) (2)

In this equation, ρi and ρk represent the intrinsic radii of atoms i and k, and f () is asmooth function of the interatomic distance and the intrinsic radii [12].

Because of the presence of Ri and R j in Eq. 1, computation of the Born free energyand its first and second derivatives can involve considerable complexity. However, itis possible to reduce this computational complexity via precomputation, the detailsof which are beyond the scope of this paper and are reported elsewhere [7]. Thisprecomputation produces two vectors, R and A, each of length n (where n is thenumber of atoms). These vectors are used to produce four matrices: N (of size n byn) and F, G and D (each of size 3n by n). Products of these matrices are used to formthe Hessian matrix H (of size 3n by 3n) that contains the second derivatives. Thisprecomputation reduces the computational complexity of the Born free energy and its

123

Int J Parallel Prog (2007) 35:441–458 443

first derivatives to O(n2

), as well as reducing the complexity of the second derivatives

to O(n3

).

For molecular modeling applications, it is desirable to estimate the internal energyof a molecule and to minimize that energy. Minimization may be performed using theiterative Newton-Raphson method [5] that is calculated as:

x1 = x0 − H−1 (x0) ∇E (x0) (3)

In the above equation, x0 represents the initial Cartesian coordinates of the atomicnuclei prior to a step of Newton-Raphson iteration, x1 represents the Cartesian coor-dinates after the step of Newton-Raphson iteration, ∇E (x0) represents the gradientvector of first derivatives that are calculated from the initial Cartesian coordinates,and H−1 (x0) represents the inverse of the Hessian matrix of second derivatives thatare calculated from the initial Cartesian coordinates. In practice, inversion of the Hes-sian matrix is avoided by solving the linear system via techniques such as Choleskyfactorization.

2 Implementation

As shown in Fig. 1, each iteration of Newton-Raphson minimization is subdividedinto four phases of computation: (1) calculation of the non-Born energy terms andtheir derivatives; (2) calculation of the Born free energy, its first derivatives, and thematrices N, F, G and D; (3) matrix multiplication to produce the Hessian matrix H ofsecond derivatives; and (4) solution of the linear system via Cholesky factorization.

We can visualize the phases of an iteration by monitoring the changes in low-levelactivity in the system. Figure 2 gives an example of the low-level activity that illustratesthe change in the cycle per instruction (CPI) measurements for different phases of theMPI version of NAB. The CPI measures the efficiency of the CPU; low CPI valuesindicate good performance. Superscalar processors are capable of executing multipleinstructions in one cycle and therefore CPI values can be less than one for carefullytuned sections of code. In our experiments we measured the CPI and other low-levelstatistics, such as cache miss rates, using UltraSPARC® processor on-chip hardwarecounters [10].

In Phase 1 we observe fairly poor processor efficiency, which can be explainedby non-contiguous memory references that occur in the computation of the non-Bornenergy terms. Relatively little computation is performed during this phase, which doesnot significantly impact the overall performance. In Phase 2 we can see five distinctregions corresponding to the five Born energy computations outlined in Fig. 1. Phase 3updates the Hessian matrix via three matrix-matrix multiplications. Phase 4 performsCholesky factorization.

The computation in Phase 1 has O (n) complexity and contributes only minimallyto the total computation. The computational in Phases 3 and 4 has O

(n3

)complexity

and is performed by parallelized subroutines from a scientific library. The computationin Phase 2 exhibits O

(n2

)complexity and must be parallelized in order that the

total computation achieve reasonable scalability [1]. We have parallelized the Phase

123

444 Int J Parallel Prog (2007) 35:441–458

Computation of thenon-Born energy terms

Precompute effective radii R

Generate A (ij part)Generate A (ji part)

Generate N, F, G and D (ij part)

Generate N, F, G and D (ji part)

Bor

nen

ergy

com

puta

tion

Hessian update

Cholesky factorizationof the Hessian

Fig. 1 Phases 1–4 of Newton-Raphson. The i j and j i parts are computed separately

0

1

2

3

4

5

6

0 200 400 600 800 1000 1200 1400 1600 1800

CP

I

Run time (sec.)

First 3 terms of energy computation (1)

Born energy computation (2)

Hessian update (3)

Cholesky factorization (4)

Fig. 2 CPI for Phases 1–4 of Newton-Raphson minimization, using 5-way parallelism for Phase 2 and25-way parallelism for Phases 3 and 4

2 computation via an OpenMP1 [8] implementation, as well as via an MPI2 [11]implementation. Prior studies have reported comparisons between OpenMP and MPIversions of applications software [2,13–15].

1 http://www.openmp.org2 http://www.mpi-forum.org

123

Int J Parallel Prog (2007) 35:441–458 445

2.1 OpenMP

The computation of Phase 2 of Newton-Raphson minimization involves several sum-mations such as

i

j>ior

i

j �=i, each of which implies a loop nest with i and j as

the outer and inner loop indices, respectively. Each loop nest updates a vector and amatrix, as shown for the shared vector A and the shared matrix N in the fragment ofC code (4) that is (incorrectly) parallelized via OpenMP by adding a #pragma ompparallel for directive to the serial code:

#pragma omp parallel for private(j)for (i = 0; i < n; i++) {for (j = i+1; j < n; j++) {A[i] += f1(i, j);A[j] += f2(i, j); // Incorrect!N[i][i] += f3(i, j);N[i][j] += f4(i, j);N[j][j] += f5(i, j); // Incorrect!N[j][i] += f6(i, j);

}}

(4)

Code fragment (4) exhibits a race condition for the update of A[ j] and N[ j][ j].Because the j loop index is not partitioned amongst the OpenMP threads, all of thethreads can potentially update A[ j] and N[ j][ j]. There is no guarantee that theseupdates will occur atomically, and therefore the threads may overwrite one another’supdates. This race condition is removed by splitting the loop nest to create two loopnests. The first loop nest (5) preserves i as the outer index and j as the inner index. Itupdates A[i], N[i][i] and N[i][ j]:

#pragma omp parallel for private(j)for (i = 0; i < n; i++) {for (j = i+1; j < n; j++) {A[i] += f1(i, j);N[i][i] += f3(i, j);N[i][j] += f4(i, j);

}}

(5)

123

446 Int J Parallel Prog (2007) 35:441–458

The second loop nest (6) uses j as the outer index and i as the inner index. It updatesA[ j], N[ j][ j] and N[ j][i]:

#pragma omp parallel for private(i)for (j = 0; j < n; j++) {for (i = 0; i < j; i++) {A[j] += f2(i, j);N[j][j] += f5(i, j);N[j][i] += f6(i, j);

}}

(6)

The combination of i and j loop indices is identical for both loop nests, i.e., foreither loop nest a given value of i is combined with the same values of j . It is essentialthat A[i] and N[i][i] be updated in the first loop nest, and that A[ j] and N[ j][ j] beupdated in the second loop nest; otherwise, a race condition will result. However,N[i][ j] and N[ j][i] may be updated in either loop nest. No race condition can existfor these updates because each update involves both i and j matrix addresses.

However, by updating N[i][i] and N[i][ j] in the first loop nest, and by updatingN[ j][ j] and N[ j][i] in the second loop nest, the matrix elements are partitionedamongst the OpenMP threads by matrix row. Groups of r contiguous rows may bepartitioned amongst the threads by adding a schedule(static, r) clause tothe #pragma omp parallel for directive. This partitioning is known as rowcyclic partitioning, and promotes locality of memory access by each OpenMP threadfor a matrix that is allocated in row-major order.

Instead of using the schedule(static, r) clause, we map the OpenMPthreads to matrix rows explicitly, as will be discussed below for MPI in code fragment(7), in order to maintain compatibility between the OpenMP and MPI implementa-tions of NAB. This approach is an example of the SPMD OpenMP programming stylethat can improve the performance of OpenMP applications when used in conjunc-tion with code tuning techniques [14]. A performance analysis tool such as the Sun™

Performance Analyzer3 is essential to the code tuning process.The above discussion describes the parallelization that is accomplished via OpenMP

for Phase 2 of the Newton-Raphson minimization. Phases 3 and 4 are executed bythe dgemm and dposv subroutines, respectively, of the Sun Performance Library™

multi-threaded scientific library.4

2.2 MPI

The parallelization of Phases 2–4 of Newton-Raphson minimization is more complexfor MPI than for OpenMP. The increased complexity is due principally to the Scalable

3 http://developers.sun.com/sunstudio4 http://developers.sun.com/sunstudio

123

Int J Parallel Prog (2007) 35:441–458 447

Fig. 3 ScaLAPACK matrix distribution. A 13 by 8 matrix is subdivided into blocks of 2 by 3 matrixelements, and distributed onto a 3 by 2 process grid in a block cyclic manner. The numbers along the topand left edges of the process grid indicate the row and column coordinates, respectively, of each process onthe grid. The blocks are separated by thin spacing, and the grid processes are separated by thick spacing

LAPACK (or ScaLAPACK5 [6]) scientific library that is used with the MPI implemen-tation of NAB. The ScaLAPACK library does not support global, shared vectors andmatrices; instead, it supports vectors and matrices that are distributed across all of theMPI processes. Under this distributed paradigm, each process has exclusive access toa unique subset of the global vector or matrix, which we will call the sub-vector orsub-matrix. Each process initializes its sub-vector or sub-matrix, and then the ScaLA-PACK subroutines distribute computation such as matrix multiplication and Choleskyfactorization across all of the processes.

Before a vector or matrix can be processed by a ScaLAPACK subroutine, it must bedistributed onto a process grid. The process grid is a group of MPI processes that areplaced on a rectangular grid of nprow rows by npcol columns. Each process hasunique row and column coordinates myrow and mycol that indicate the location ofthe process on the grid. The matrix elements are not mapped onto the process grid in acontiguous manner but rather in a block cyclic manner [6] wherein a matrix of m rowsby n columns is subdivided into blocks of mb rows by nb columns (see details below).Block cyclic mapping is used by ScaLAPACK to achieve reasonable load balancingacross the MPI processes. For example, the solution of a system of linear equations isaccomplished by sweeping through a matrix from upper left to lower right in such away that an anti-diagonal processing wavefront advances along the principal diagonalof the matrix. Block cyclic mapping ensures that the wavefront encounters data fromall processes at each stage of its advance.

5 http://www.netlib.org/scalapack

123

448 Int J Parallel Prog (2007) 35:441–458

Figure 3 illustrates the block cyclic distribution of a 13 by 8 matrix onto a 3 by2 process grid, using 2 by 3 blocks. Scanning from top to bottom along the leftmostcolumn of the matrix, we see that the upper left block of the matrix (which beginswith element a1,1) maps to the upper left block of process (0,0). The block that be-gins with element a3,1 maps to the upper left block of process (1,0). The block thatbegins with element a5,1 maps to the upper left block of process (2,0). The block thatbegins with element a7,1 begins a new cycle and maps to the block beneath the upperleft block of process (0,0). The block that begins with element a9,1 maps to the blockbeneath the upper left block of process (1,0). The block that begins with element a11,1maps to the block beneath the upper left block of process (2,0). The (partial) blockthat begins with element a13,1 begins a new cycle and maps to the lower left blockof process (0,0). A similar cyclic mapping in the horizontal direction fills the processgrid and produces the required block cyclic distribution of the matrix onto the processgrid.

As mentioned above, each MPI process has exclusive access to a unique sub-vector or sub-matrix. Thus, the i and j loop indices of a loop nest must be restrictedto only those values that are required to update the accessible sub-vector or sub-matrix elements. In effect, restriction of the loop indices accomplishes the necessarypartitioning of the matrix elements amongst the MPI processes. However, partitio-ning for MPI is not as simple as for OpenMP where a #pragma omp parallelfor directive suffices. For MPI, access to the elements of the distributed vector Aand the distributed matrix N may be restricted to a unique subset of the i and jaddresses via the following (partially correct) code fragment that tests the loopindices using divide and modulus instructions, and skips loop iterations via acontinue statement:

for (i = 0; i < n; i++) {if ( (i/mb)%nprow != myrow ) continue;for (j = i+1; j < n; j++) {if ( (j/nb)%npcol != mycol ) continue;A[i] += f1(i, j); // Incorrect!N[i][i] += f3(i, j); // Incorrect!N[i][j] += f4(i, j);

}}

for (j = 0; j < n; j++) {if ( (j/mb)%nprow != myrow ) continue;for (i = 0; i < j; i++) {if ( (i/nb)%npcol != mycol ) continue;A[j] += f2(i, j); // Incorrect!N[j][j] += f5(i, j); // Incorrect!N[j][i] += f6(i, j);

}}

(7)

123

Int J Parallel Prog (2007) 35:441–458 449

Restriction of the i and j loop indices as depicted in code fragment 7 parallelizesthe computation because each process handles a unique subset of i and j that is se-lected by the myrow and mycol coordinates of that process. However, this codefragment is misleading for several reasons. First, although the first loop nest canaccess N[i][ j], it cannot necessarily access N[i][i]. Similarly, although the secondloop nest can access N[ j][i], it cannot necessarily access N[ j][ j]. The diagonalelements of the matrix N are not guaranteed to belong to the process that calcu-lates updates to those elements because each process owns only a sub-matrix of thedistributed matrix N, which we will call sub_N. A solution to this problem is foreach process to maintain a private copy of the diagonal elements of the matrix N.Because N is a square matrix of dimensions n by n, the private copy of the dia-gonal elements can be stored in a vector of length n, which we will designate bythe vector diag_N . When all of the processes have finished updating their privatecopies diag_N , all of the copies of diag_N are combined to produce the vector Qthat is then rebroadcast to each process. The MPI_Allreduce function is per-fectly suited to combining (reducing) diag_N and rebroadcasting the resulting vec-tor Q in this manner. Upon receipt of the vector Q, each process copies from Qinto the diagonal elements of its sub-matrix sub_N only those elements that exist insub_N.

The second problem with code fragment 7 arises due to a ScaLAPACK conven-tion that requires that a distributed vector such as the vector A exist only in co-lumn zero of the process grid. Hence, only a process that exists in column zero ofthe grid possesses a sub-vector of the vector A, which we will call sub_A. A par-ticular process that calculates updates to the vector A may not lie in column zeroof the grid and therefore may not be able to access the elements that it needs toupdate. This problem is very similar to the first problem discussed above, and ithas a similar solution. Each process must maintain a private copy of the distributedvector A, which we will call priv_A. When all of the processes have finished upda-ting their private copies priv_A, all of the copies of priv_A are combined, and theresulting vector S is rebroadcast to each process via the MPI_Allreduce function.Upon receipt of the vector S, each process in column zero of the process grid copiesfrom S into the elements of its sub-vector sub_A only those elements that exist insub_A.

The third problem with code fragment 7 is that the matrix elements N[i][ j] andN[ j][i], which are accessible by a given process, are not accessed as elements ofthe global matrix N, but rather as elements of the sub-matrix sub_N that is ownedby that process. Moreover, the sub-matrix sub_N is not addressed using the global[i][ j] or [ j][i] address directly. Instead, the global address is converted to an offsetinto the sub-matrix sub_N. The address conversion is accomplished via the followingadrmap function that requires, in addition to i and j , the number of rows or localleading dimension lld of the sub-matrix so that adrmap can perform column-majoraddressing as required by ScaLAPACK:

123

450 Int J Parallel Prog (2007) 35:441–458

size_t adrmap( int i, int j) {size_t lbi, lbj;lbi = i/mb/nprow;lbj = j/nb/npcol;return ( lld*(nb*lbj + j%nb)

+ mb*lbi + i%mb );}

(8)

Code fragment 7 may be rewritten to call the adrmap function as follows:

for (i = 0; i < n; i++) {if ( (i/mb)%nprow != myrow ) continue;for (j = i+1; j < n; j++) {if ( (j/nb)%npcol != mycol ) continue;priv_A[i] += f1(i, j);diag_N[i] += f3(i, j);sub_N[adrmap(i, j)] += f4(i, j);

}}

for (j = 0; j < n; j++) {if ( (j/mb)%nprow != myrow ) continue;for (i = 0; i < j; i++) {if ( (i/nb)%npcol != mycol ) continue;priv_A[j] += f2(i, j);diag_N[j] += f5(i, j);sub_N[adrmap(j, i)] += f6(i, j);

}}

(9)

After the vectors priv_A and diag_N have been updated by each process as indica-ted in code fragment 9, they are combined and rebroadcast to all processes by theMPI_Allreduce function, then copied by each process into the sub-vector sub_Aand into the sub-matrix sub_N, respectively.

Code fragments 8 and 9 demonstrate that mapping from a global matrix addressto a local sub-matrix address involves several integer divisions and several modulusoperations for access to each element of the matrix. However, because the divisorsare mb, nb, nprow and npcol (which are constants for a given matrix), and furtherbecause the dividends are i and j whose values lie in the range 0 ≤ i , j < n, all ofthe divisions and modulus operations may be precomputed and stored in eight lookuptables, each of length n. Because a matrix requires O

(n2

)memory, these tables that

require O (n) memory represent only a small fraction of the size of a typical matrix,so they offer the possibility of accelerated computation at the expense of minimaladditional storage.

The above discussion describes the parallelization that is accomplished via MPIfor Phase 2 of the Newton-Raphson minimization. Phases 3 and 4 are executed by

123

Int J Parallel Prog (2007) 35:441–458 451

the pdgemm and pdposv subroutines, respectively, of the ScaLAPACK scientificlibrary.

3 Programmability

Creating two versions of the same application using different parallelizationapproaches allowed us to compare the effort required to implement both versions.Parallelizing Phase 2 of the computation required the splitting of nested loops in bothcases. Once this splitting was completed, the OpenMP implementation was straight-forward. However, creating the MPI version required substantial additional effort (re-quired by ScaLAPACK) to map global matrices onto a two-dimensional process grid,and to modify one-dimensional row cyclic parallelization to obtain two-dimensionalblock cyclic parallelization.

Whereas the OpenMP version accesses global, shared matrices and vectors, the MPIversion accesses distributed matrices and vectors. Each MPI process must maintain notonly a sub-matrix for each distributed matrix, but also a private copy of the diagonalelements of that matrix, as well as division and modulus lookup tables for the matrixin order to facilitate address mapping. Furthermore, each MPI process must maintaina private, complete copy of each distributed vector, as well as a sub-vector for thatvector.

Moreover, the ScaLAPACK version requires somewhat greater storage than theOpenMP version. This requirement arises because each process must maintain itsown copies of the gradient vector and of the diagonal elements of various matrices.Also, division and modulus lookup tables must be created for each matrix in order tofacilitate address mapping. However, this additional storage is of O (n) and thereforeis modest compared to the O

(n2

)storage required for the matrices themselves.

We have estimated the relative complexity of the two versions of NAB by countingthe non-comment source code lines that are related to the Newton-Raphson minimi-zation and to the calculation of the Born energy and its derivatives. Three categoriesof source code lines were counted: (1) source code lines that are required for serialexecution, (2) source code lines that are required to modify the serial code for parallelexecution by OpenMP, and (3) source code lines that are required to modify the serialcode for parallel execution by MPI. The serial line count is 1,643. The OpenMP linecount is 180. The MPI line count is 962. These line counts reveal that adaptation ofthe serial code for MPI and ScaLAPACK produced significantly (i.e., a factor of five)more source code than adaptation of the serial code for OpenMP. This finding suggeststhat the programmer’s productivity may be higher when an application that relies onlinear algebra is parallelized using OpenMP instead of MPI.

4 Performance and Scalability

In this section we make scalability and performance comparisons between the OpenMPand MPI implementations of the Newton-Raphson minimization. In order to obtainan accurate comparison between the two implementations of the code, we performedall of the measurements using the same server: a Sun Fire™ E25K server with 72

123

452 Int J Parallel Prog (2007) 35:441–458

Fig. 4 6,370-atom 1AKD molecule

dual-core UltraSPARC IV processors. We have compared the performance of theOpenMP and MPI implementations for up to 144 OpenMP threads and MPI processesusing the 1AKD [17], 1AFS [4] and 1AMO [20] molecules from the RCSB ProteinData Bank6. For our experiments, we modified these X-ray crystal structures by re-moving all of the water oxygens and other non-amino-acid atoms and then by addinghydrogen atoms to obtain molecular models that comprise 6,370, 10,350 and 19,030atoms. The smallest of these models (1AKD) is shown in Fig. 4.

We start by comparing different versions of the MPI implementation of NAB.Figure 2 (see the Implementation section, above) shows the profile of our baseline MPIimplementation which does not use the table lookup optimization discussed above.The division and modulus operations that are used in address mapping between thefull matrix and submatices are relatively inefficient and inhibit pipelined execution;hence, Phase 2 is prolonged. Moreover, for this version the parallelization of Phase2 is implemented in a row cyclic manner similar to the approach that was describedabove for OpenMP. For a given row of the process grid, all of the columns performredundant computation. Therefore, when this baseline implementation executes using25 processes on a 5 by 5 process grid, the degree of parallelization is limited to 5. ThePhase 2 execution time is 269 s.

Figure 5 shows the execution profile for the improved version of the code in whichthe address mapping is implemented with look-up tables. The tables are small enoughthat they remain in processor caches during the course of the computation and therefore

6 http://www.rcsb.org/pdb/

123

Int J Parallel Prog (2007) 35:441–458 453

0

1

2

3

4

5

6

0 200 400 600 800 1000 1200 1400 1600

CP

I

Run time (sec.)

Non-Born terms of energy computation (1)

Born energy computation (2)

Hessian update (3)

Cholesky factorization (4)

Fig. 5 CPI for MPI implementation with table lookup optimization, using 5-way parallelism for Phase 2and 25-way parallelization for Phases 3–4

0

1

2

3

4

5

6

0 200 400 600 800 1000 1200 1400

CP

I

Run time (sec.)

Non-Born terms of energy computation (1)

Born energy computation (2)

Hessian update (3)

Cholesky factorization (4)

Fig. 6 CPI for the MPI implementation using 25-way parallelism for Phases 2, 3 and 4

the look-up is very quick. The resulting reduction in the CPI (cycles per instruction)is significant. The Phase 2 execution time is 105 s.

Figure 6 shows the execution profile for full parallelization of Phase 2. Full paralle-lization is achieved by having all process columns of a particular process row performunique (instead of redundant) computation, thus increasing the degree of paralleliza-tion from 5 to 25 for a 5 by 5 process grid. The execution time for Phase 2 is 20 s,and thus this phase has been eliminated as a barrier to the scalabiliity of the overallcomputation, as required by Amdahl’s law [1].

Figure 7 shows the execution profile for the OpenMP implementation of NAB thatfully parallelizes Phase 2 of the Newton-Raphson minimization. The execution time forPhase 2 is 10 s, compared to 20 s for the fastest MPI implementation. This improvementcan be attributed to the direct access to shared matrix elements by OpenMP, whichavoids the address mapping required for distributed submatrix access in the MPIimplementation.

123

454 Int J Parallel Prog (2007) 35:441–458

0

1

2

3

4

5

6

0 100 200 300 400 500 600 700

CP

I

Run time (sec.)

Non-Born terms of energy computation (1)

Born energy computation (2)

Hessian update (3)

Cholesky factorization (4)

Fig. 7 CPI for the OpenMP implementation using 25-way parallelism for Phases 2, 3 and 4

0

20

40

60

80

100

120

140a b

0 20 40 60 80 100 120 140

Spe

edup

Number of Threads

Scalability of Phase 2 of Newton-Raphson (OpenMP)

1AMO1AFS1AKDLinear

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Spe

edup

Number of Processes

Scalability of Phase 2 of Newton-Raphson (MPI)

1AMO1AFS1AKDLinear

Fig. 8 Scalability of Phase 2 of Newton-Raphson for the 1AKD, 1AFS and 1AMO models, using a 144-core Sun Fire E25K server. For both plots, the scalability at 144 OpenMP threads or MPI processes increasesin the order 1AKD, 1AFS and 1AMO. The “Linear” plot represents perfect scalability

Moreover, all phases of Newton-Raphson minimization execute more rapidly forthe OpenMP implementation than for the fastest MPI implementation, as can be appre-ciated by comparing Figs. 7–6. The OpenMP implementation performs one iterationof Newton-Raphson in approximately 400 s, compared to approximately 700 s forthe fastest MPI implementation. The 300 s increase in execution time for the MPIimplementation is due to the communication overhead required to transmit distributedsubmatrices between processors. Even when the communication occurs within theshared memory of a symmetric multiprocessor, this overhead can be significant.

In order to understand the performance properties of both the MPI and OpenMPimplementations of NAB we examine the scalability of different phases of the com-putation. We start with Phase 2 because it was the focus of our attention in previoussections. Figure 8 shows that Phase 2 scales better for MPI than for OpenMP. Similarobservations have been reported for other parallel applications [13].

In an attempt to understand the disparity between the scalabilities of the OpenMPand MPI implementations of NAB, we used the Sun Performance Analyzer to measurethe level 2 cache miss rates during Phase 2 of the computation. We obtained thesemeasurements for the 1AMO model and 4, 9, 16, 25 and 36 OpenMP threads or MPIprocesses executed on a Sun Fire E6900 server with 24 dual-core UltraSPARC IV

123

Int J Parallel Prog (2007) 35:441–458 455

0

20

40

60

80

100

120

140a b

0 20 40 60 80 100 120 140

Spe

edup

Number of Threads

Scalability of Phase 3 of Newton-Raphson (OpenMP)

1AMO1AFS1AKDLinear

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Spe

edup

Number of Processes

Scalability of Phase 3 of Newton-Raphson (MPI)

1AMO1AFS1AKDLinear

Fig. 9 Scalability of Phase 3 of Newton-Raphson minimization for the 1AKD, 1AFS and 1AMO models.For the OpenMP plots, the scalability at 144 threads increases in the order 1AKD, 1AFS and 1AMO. Forthe MPI plots, the scalability at 144 processes increases in the order 1AFS, 1AKD and 1AMO

0

20

40

60

80

100

120

140a b

0 20 40 60 80 100 120 140

Spe

edup

Number of Threads

Scalability of Phase 4 of Newton-Raphson (OpenMP)

1AMO1AFS1AKDLinear

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Spe

edup

Number of Processes

Scalability of Phase 4 of Newton-Raphson (MPI)

1AMO1AFS1AKDLinear

Fig. 10 Scalability of Phase 4 of Newton-Raphson for the 1AKD, 1AFS, and 1AMO models

processors. The cache miss rates per instruction were 0.0034 for OpenMP and 0.0012for MPI, measured for between 4 and 36 OpenMP threads or MPI processes. Hence,relative to MPI we observed a nearly three-fold increase in the cache miss rate forOpenMP.

Figure 9 shows that Phase 3 scales better for OpenMP than for MPI due to thesuperior scalability of the Sun Performance Library dgemm subroutine relative tothe ScaLAPACK pdgemm subroutine. Figure 10 shows that Phase 4 scales better forMPI than for OpenMP due to the superior scalability of the ScaLAPACK pdposvsubroutine relative to the Sun Performance Library dposv subroutine. The scalabilityof OpenMP peaks near 100 threads for the 1AKD model and near 121 threads for the1AFS model, but no peak is evident for the 1AMO model. Thus the peak in thescalability of the Sun Performance Library dposv subroutine appears to move upand to the right for increasing model size. No peak is evident in the MPI plots.

Figure 11 shows that the combined phases of Newton-Raphson minimization scalebetter for OpenMP than for MPI, but that the OpenMP scalability begins to fall offnear 100 OpenMP threads and that this effect is less pronounced for the larger models.These observations can be explained by the relative scalabilities of Phases 3 and 4 ofOpenMP and MPI, as seen in Figs. 9 and 10.

Figure 12 shows that the OpenMP version executes Newton-Raphson minimizationsignificantly more rapidly than the MPI version. In all of our experiments the OpenMPversion ran a factor of 1.3 to 2.0 faster than the MPI version.

123

456 Int J Parallel Prog (2007) 35:441–458

0

20

40

60

80

100

120

140a b

0 20 40 60 80 100 120 140

Spe

edup

Number of Threads

Scalability of Phases 1-4 of Newton-Raphson (OpenMP)

1AMO1AFS1AKDLinear

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Spe

edup

Number of Processes

Scalability of Phases 1-4 of Newton-Raphson (MPI)

1AMO1AFS1AKDLinear

Fig. 11 Scalability of Phases 1–4 of Newton-Raphson for the 1AKD, 1AFS and 1AMO models. For bothplots, the scalability at 144 threads or processes increases in the order 1AKD, 1AFS and 1AMO

0

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0.0007

0 20 40 60 80 100 120 140

Spe

ed (

1/se

c)

Number of Processes

Speed of Phases 1-4 of Newton-Raphson

OpenMPMPI

Fig. 12 Speed of Phases 1–4 of Newton-Raphson for the 1AMO model

5 Discussion

This article is a case study of parallelizing a molecular modeling application. Wehave demonstrated that energy minimization computations can be implemented in ahighly-scalable way and can utilize over one hundred processors efficiently. In ourexperiments the performance and scalability of the OpenMP version were superiorto those of the MPI version. Two other studies [2,15] have found the performance ofOpenMP to be inferior to, or at best equal to that of MPI. One other study has found thatthe SPMD OpenMP programming style combined with careful code tuning can resultin better performance for OpenMP [14]. In our experiments, it is likely that OpenMPperforms better than MPI not only due to our use of the SPMD OpenMP programmingstyle, but also due to the array indexing overhead imposed by ScaLAPACK in the MPIversion.

We observe that the MPI version can execute on clusters of processors, in additionto symmetric multiprocessors. Clusters may have better nominal price/performancecharacteristics than large symmetric multiprocessors, although typically clusters haveslower interconnects than symmetric multiprocessors. In order to understand the inter-

123

Int J Parallel Prog (2007) 35:441–458 457

Table 1 Crossover bandwidth for Phase 2 and combined Phases 3 and 4 of Newton-Raphson minimization,using 144 MPI processes and the 1AKD, 1AFS and 1AMO models

Molecular Number Phase 2 Phases 3–4Model of Atoms

Bytes Time (s) COBW Bytes Time (s) COBW

1AKD 6,370 309,539,520 6.09 0.51 91,157,705,712 117.21 7.781AFS 10,350 502,808,256 19.90 0.25 240,701,741,352 459.33 5.241AMO 19,030 924,314,688 46.52 0.20 813,832,000,560 2,761.20 2.95

The crossover bandwidth (COBW) in gigabits per second is calculated by dividing the bytes transmitted(Bytes) by the execution time in seconds (Time), and assuming that 10 bits are transmitted per byte

connect bandwidth requirement for clusters, we have used the MPProf7 profiling toolsfrom the Sun HPC ClusterTools™ to measure the total bytes transmitted during Phase2 and the combined Phases 3 and 4 of Newton-Raphson minimization. In addition,the execution time was measured using Sun’s gethrtime high-resolution time function.These measurements were obtained for the 1AKD, 1AFS and 1AMO models using144 MPI processes.

From these measurements we can calculate the crossover bandwidth, which wedefine to be the bandwidth at which the communication time equals the executiontime, and is calculated by dividing the total bytes transmitted by the execution time.Under the assumption of blocking communication, the crossover bandwidth is thesystem bandwidth at which the processors spend equal amounts of time performingcomputation and waiting for communication. We took our measurements using 144processes in order to obtain the largest possible values for the total bytes transmittedand hence produce the largest values for the crossover bandwidth. Table 1 summarizesour results. For Phase 2 the largest crossover bandwidth is 0.51 gigabits per second.For the combined Phases 3 and 4 the largest crossover bandwidth is 7.78 gigabitsper second. The largest values for the crossover bandwidth are obtained using thesmallest model (1AKD) that comprises 6,370 atoms. This phenomenon occurs becausecomputational complexity grows more rapidly with increasing model size than doescommunication. Specifically, the computational complexity of Phase 2 is O

(n2

)while

the communication grows as O (n). The computational complexity of the combinedPhases 3 and 4 is O

(n3

)while the communication grows as O

(n2

).

Acknowledgements We thank David Case, Guy Delamarter, Gabriele Jost, Daryl Madura, Eugene Lohand Ruud van der Pas for helpful comments. The NAB software is distributed under the terms of the GNUGeneral Public License (GPL), and can be obtained at http://www.scripps.edu/case/. This material is basedupon work supported by DARPA under Contract No. NBCH3039002.

References

1. Amdahl, G.M.: Validity of the single-processor approach to achieving large scale computing capabi-lities. In: AFIPS Conference Proceedings, AFIPS Press, pp. 483–485. Reston, VA (1967)

7 http://developers.sun.com/sunstudio

123

458 Int J Parallel Prog (2007) 35:441–458

2. Armstrong, B., Kim, S.W., Eigenmann, R.: Quantifying differences between OpenMP and MPI using alarge-scale application suite. In: LNCS 1940: ISHPC International Workshop on OpenMP: Experiencesand Implementations (WOMPEI 2000). Springer-Verlag, Berlin Heidelberg (2000)

3. Bashford, D., Case, D.: Generalized born models of macromolecular solvation effects. Ann. Rev. Phys.Chem 51, 129 (2000)

4. Bennett, M., Albert, R., Jez, J., Ma, H., Penning, T., Lewis, M.: Steroid recognition and regulation ofhormone action: structure of testosterone and nadp+ bound to 3a-hydroxysteroid/dihydrodiol dehy-drogenase. Structure 5, 799–812 (1997)

5. Berkert,U., Allinger, N.: Molecular mechanics. ACS Monograph 177, American Chemical Society,(1982)

6. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongara, J., Hammarling,S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: Scalapack Users’ Guide. Society forIndustrial and Applied Math (1977)

7. Brown, R., Case, D.: Second derivatives in generalized born theory. J. Comput. Chem 27, 1662–1675 (2006)

8. Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J.: Parallel Programming inOpenMP. Morgan Kaufmann (2000)

9. Feig, M., Im, W., Brooks, C.: Implicit solvation based on generalized born theory in different dielectricenvironments. J. Chem. Phys. 120, 903–911 (2004)

10. Garg, R.P., Sharapov, I.: Techniques for optimizing applications: high performance computing. PrenticeHall, PTR (2001)

11. Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B. Saphir, W., Snir, M.: MPI: thecomplete reference, vol. 2, The MPI Extensions. MIT Press, Cambridge, MA (1998)

12. Hawkins, G., Cramer, C., Truhlar, D.: Parametrized models of aqueous free energies of solvation basedon pairwise descreening of solute atomic charges from a dielectric medium. J. Phys. Chem 100, 19824–19839 (1996)

13. Jin, H., Frumkin, M., Yan, J.: The OpenMP implementation of the NAS parallel benchmarks and itsperformance. NASA Ames Research Center, editor, Technical Report NAS-99-01 (1999)

14. Krawezik, G., Cappello, F.: Performance comparison of MPI and three OpenMP programming styleson shared memory multiprocessors. In: SPAA 2003: Proceedings of the Fifteenth Annual ACM Sym-posium on Parallel Algorithms, pp. 118–127. ACM, New York, NY (2003)

15. Luecke, G.R., Lin, W.H.: Scalability and performance of OpenMP and MPI on a 128-processor SGIOrigin 2000. Concurr. Comp. Pract. Ex. 13(10), 905–928 (2001)

16. Macke, T.: NAB, a language for molecular manipulation. PhD Thesis, the Scripps Research Institute(1996)

17. Schlichting, I., Jung, C., Schulze, H.: Crystal strucure of cytochrome p-450cam complexed with the(1s)-camphor enantiomer. FEBS Lett. 415, 253–257 (1997)

18. Srinivasan, J., Trevathan, M., Beroza, P., Case, D.: Application of a pairwise generalized born modelto proteins and nucleic acids: inclusion of salt effects. Theor. Chem. Acc. 101, 426–434 (1999)

19. Still, W., Tempczyk, A., Hawley, R., Hendrickson, T.: Semianalytical treatment of solvation for mo-lecular mechanics and dynamics. J. Am. Chem. Soc 112, 6127–6129 (1990)

20. Wang, M., Roberts, D., Paschke, R., Shea, T., Masters, B., Kim, J.: Three-dimensional structure ofnadph-cytochrome p450 reductase: prototype for fmn- and fad-containing enzymes. Proc. Natl. Acad.Sci. USA 94, 8411–8416 (1997)

21. Weiner, P., Kollman, P.: AMBER: Assisted model building with energy refinement. a general programfor modeling molecules and their interactions. J. Comp. Chem 2, 287–303 (1981)

123