8
Parallel Boundary Elements Using Lapack and ScaLapack Manoel T. F. Cunha, José C. F. Telles and Alvaro L. G. A. Coutinho COPPE - Universidade Federal do Rio de Janeiro - Brazil [email protected], [email protected], [email protected] Abstract The present work introduces the main steps towards the parallelization of existing Boundary Element Method (BEM) codes using available standard and portable libraries for writing parallel programs, such as LAPACK and ScaLAPACK. Here, a well-known BEM Fortran implementation is reviewed and rewritten to run on shared and distributed memory systems. This effort is the initial step to develop a new generation of parallel BEM codes to be used in many important engineering problems. Numerical experiments on a SGI Origin 2000 show the effectiveness of the proposed approach. 1. Introduction Since there is always a limit to the performance of a single CPU, computer programs have evolved to increase their performance by using multiple CPUs. The last decade has seen a tremendous increase in the availability and affordability of parallel systems. Over the years, there has been a surge in both quantity and scalability of multiprocessor platforms. Parallel computing techniques can have an enormous impact on an application performance and many parallel libraries facilitate access to this improvement. Since such procedures are not widespread among BEM practitioners, the present authors decided to introduce these techniques into a well- documented teaching package BEM application program, described in detail by Brebbia and Dominguez [5]. The code is herein reviewed and rewritten to run on shared and distributed memory systems using LAPACK [1], ScaLAPACK [4] and OpenMP [7]. The implementation process provides guidelines to develop parallel BEM codes, applicable to many engineering problems. The text is organized as follows : next section presents an outline of the BEM theory, the following section presents parallel memory architectures and parallel libraries. Section 4 describes the selected application and the parallel code while in Section 5 a performance analysis is presented. The paper ends with a summary of the main conclusions. 2. Outline of BEM theory The Boundary Element Method (BEM) [6] [14] is a technique for the numerical solution of partial differential equations with initial boundary conditions. By using a weighted residual formulation, Green's third identity, Betti's reciprocal theorem or some other procedure, an equivalent integral equation can be obtained and then converted to a form that involves only surface integrals, i.e., over the boundary. The boundary is then divided into elements and the integrals over the boundary are simply the sum of the integration over each element, resulting in a reduced dense and non-symmetric system of linear equations. The discretization process also involves selecting nodes along the boundary where the unknown values are considered. An interpolation function relates one or more nodes on the element to the potential and fluxes anywhere on the element. The simplest case places a node in the center of each element and defines an interpolation function that is constant across the entire element. Once the boundary values have been obtained, the interior points can be computed directly from the basic integral equation in a post-processing routine. 2.1. Differential Equation Potential problems are governed by Laplace's equation 0 2 2 2 2 1 2 2 = + = x u x u u in (1) and boundary conditions : u = u on Γ 1 and q = q = u / n on Γ 2 (2) where u is the potential function, u and q are prescribed values, Γ = Γ 1 + Γ 2 is the boundary of domain and n is the unit outward normal to surface Γ . 2.2. Integral Equation The integral equation for potential problems reads : u(ξ )+Γ u(x)q*(ξ ,x)dΓ =Γ q(x)u*(ξ ,x)dΓ (3) Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 ' 2002 IEEE

[IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

  • Upload
    alga

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

Parallel Boundary Elements Using Lapack and ScaLapack

Manoel T. F. Cunha, José C. F. Telles and Alvaro L. G. A. CoutinhoCOPPE - Universidade Federal do Rio de Janeiro - Brazil

[email protected], [email protected], [email protected]

Abstract

The present work introduces the main steps towardsthe parallelization of existing Boundary Element Method(BEM) codes using available standard and portablelibraries for writing parallel programs, such as LAPACKand ScaLAPACK. Here, a well-known BEM Fortranimplementation is reviewed and rewritten to run onshared and distributed memory systems. This effort is theinitial step to develop a new generation of parallel BEMcodes to be used in many important engineeringproblems. Numerical experiments on a SGI Origin 2000show the effectiveness of the proposed approach.

1. Introduction

Since there is always a limit to the performance of asingle CPU, computer programs have evolved to increasetheir performance by using multiple CPUs. The lastdecade has seen a tremendous increase in the availabilityand affordability of parallel systems. Over the years, therehas been a surge in both quantity and scalability ofmultiprocessor platforms. Parallel computing techniquescan have an enormous impact on an applicationperformance and many parallel libraries facilitate accessto this improvement. Since such procedures are notwidespread among BEM practitioners, the present authorsdecided to introduce these techniques into a well-documented teaching package BEM application program,described in detail by Brebbia and Dominguez [5]. Thecode is herein reviewed and rewritten to run on sharedand distributed memory systems using LAPACK [1],ScaLAPACK [4] and OpenMP [7]. The implementationprocess provides guidelines to develop parallel BEMcodes, applicable to many engineering problems.

The text is organized as follows : next section presentsan outline of the BEM theory, the following sectionpresents parallel memory architectures and parallellibraries. Section 4 describes the selected application andthe parallel code while in Section 5 a performanceanalysis is presented. The paper ends with a summary ofthe main conclusions.

2. Outline of BEM theory

The Boundary Element Method (BEM) [6] [14] is atechnique for the numerical solution of partial differentialequations with initial boundary conditions.

By using a weighted residual formulation, Green'sthird identity, Betti's reciprocal theorem or some otherprocedure, an equivalent integral equation can be obtainedand then converted to a form that involves only surfaceintegrals, i.e., over the boundary.

The boundary is then divided into elements and theintegrals over the boundary are simply the sum of theintegration over each element, resulting in a reduceddense and non-symmetric system of linear equations.

The discretization process also involves selectingnodes along the boundary where the unknown values areconsidered. An interpolation function relates one or morenodes on the element to the potential and fluxes anywhereon the element. The simplest case places a node in thecenter of each element and defines an interpolationfunction that is constant across the entire element.

Once the boundary values have been obtained, theinterior points can be computed directly from the basicintegral equation in a post-processing routine.

2.1. Differential Equation

Potential problems are governed by Laplace's equation

022

221

22 =∂∂+

∂∂=∇

xu

xuu in Ω (1)

and boundary conditions :u = u on Γ 1 andq = q = ∂ u / ∂ n on Γ 2 (2)where u is the potential function, u and q are prescribedvalues, Γ = Γ 1 + Γ 2 is the boundary of domain Ω and n isthe unit outward normal to surface Γ .

2.2. Integral Equation

The integral equation for potential problems reads :u(ξ )+∫Γ u(x)q*(ξ ,x)dΓ =∫Γ q(x)u*(ξ ,x)dΓ (3)

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Page 2: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

The weighting function u* satisfies Laplace'sequation and represents the field generated by aconcentrated unit charge acting at the source point ξ .

The function q* is the outward normal derivative ofu* along the boundary Γ with respect to the field point x,i.e., q*(ξ ,x)=∂ u*(ξ ,x)/∂ x(x).

In order to obtain an integral equation involving onlythe variables on the boundary, one can take the limit ofthe integral equation as the point ξ tends to the boundaryΓ . This limit must take into account the discontinuity ofthe second integral.

The resulting equation for potential problems is :c(ξ )u(ξ )+∫Γ u(x)q*(ξ ,x)dΓ =∫Γ q(x)u*(ξ ,x)dΓ (4)in which the second integral is computed in the Cauchyprincipal value sense.

The coefficient c is a function of the internal angle ofthe boundary Γ at point ξ .

2.3. Discretization

The integral statement for potential problems can bewritten as follows :ci ui + ∫Γ u q* dΓ = ∫Γ q u* dΓ (5)

Notice that the source point ξ was taken as theboundary node i on which the unit charge is applied, i.e.,u(ξ )=ui.

After the discretization of the boundary into Nelements, the integral equation can be written :

∑ ∫∑ ∫=

Γ=

ΓΓ=Γ+

N

1j

*N

1j

*ii

jj

duqdquuc (6)

This equation represents, in discrete form, therelationship between the node i at which the unit charge isapplied and all the j elements on the boundary, includingi=j.

For the constant element case the boundary is alwayssmooth as the node is at the element midpoint, hence thecoefficient ci is equal to ½ and the values of u and q aretaken out of the integrals :

∑ ∫∑ ∫=

Γ=

ΓΓ=Γ+

N

1jj

*N

1jj

*i qduudqu

21

jj

(7)

The equation can thus be written as follows :

∑∑==

=+N

1jjij

N

1jjiji qGuHu

21 (8)

and can be further rewritten as :

∑∑==

=N

1jjij

N

1jjij qGuH (9)

where :

=→+≠→

=jiH

jiHH

21

ij

ijij (10)

Equation (9) can be written in matrix form as :H u = G q (11)

Applying the prescribed conditions, the equation canbe reordered with all the unknowns on the left-hand sideand a vector of known values on the right-hand side.

This gives :A y = f (12)where y is the vector of unknowns u's and q's.

2.4. Internal Points

Once the system is solved, all the values on theboundary are known and the values of u and q at anyinterior point can be calculated using :ui = ∫Γ q u* dΓ - ∫Γ u q* dΓ (13)

Internal fluxes can be derived from the potentialequation, as follows :

∫∫ ΓΓΓ

∂∂−Γ

∂∂=

∂∂= d

xqud

xuq

xuq

**

i (14)

The same discretization process can be applied for theintegrals above.

2.5. Fundamental Solution

For a two dimensional isotropic domain, thefundamental solution is :

r1ln

21)x,(u*

π=ξ ηπ

=ξ rr2

1)x,(q* (15)

where r is the distance from the source point ξ to the fieldpoint x and rη = ∂ r(ξ ,x)/∂ n(x).

3. Parallel programming libraries

Parallel systems are usually classified as shared ordistributed memory architectures. In the shared memoryarchitecture all the processors are able to directly accessall of the memory in the machine while in distributedmemory each processor in the system is only capable ofaddressing memory directly associated with it [7].

3.1. LAPACK

LAPACK [1], or Linear Algebra Package, is a libraryof subroutines for solving linear equations systems, leastsquares, eigenvalue and singular value problems. Denseand band matrices are supported as well as real andcomplex data. LAPACK can also handle many associatedcomputations such as matrix factorizations or estimatingcondition numbers.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Page 3: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

It is designed to be efficient on a wide range ofmodern high-performance computers, from PCs andworkstations to shared memory multiprocessors andvector processors. LAPACK routines are written so thatas much as possible of the computations is performed bycalls to the BLAS routines.

BLAS. The BLAS, or Basic Linear AlgebraSubprograms, include subroutines for common linearalgebra computations such as vector, matrix-vector andmatrix-matrix operations.

Highly efficient machine-specific implementations ofthe BLAS are available for many modern high-performance computers. The BLAS form a low-levelinterface between LAPACK software and differentmachine architectures. Thus, BLAS enable LAPACKroutines to achieve high-performance with portable codes.

3.2. ScaLAPACK

ScaLAPACK [4] is the distributed-memory version ofLAPACK. The goals of both libraries are efficiency,scalability, reliability, portability, flexibility and easy ofuse.

ScaLAPACK can solve systems of linear equations,linear least squares, eigenvalue and singular valuedecomposition problems. ScaLAPACK can also handlemany associated computations such as matrixfactorizations or estimating condition numbers.

The building blocks of the ScaLAPACK library aredistributed-memory versions of the BLAS, called theparallel BLAS or PBLAS and a set of Basic LinearAlgebra Communications Subprograms, BLACS.

PBLAS. The ScaLAPACK strategy for combiningefficiency with portability is to construct the software sothat as much as possible of the computation is performedby calls to PBLAS. The PBLAS routines perform globalcomputation by relying on BLAS for local computationand the BLACS for communication.

To simplify the design of ScaLAPACK and becauseBLAS has proven to be a useful tool outside LAPACK,the ScaLAPACK project chose to build a parallel set ofBLAS called PBLAS, which performs message-passingand whose interface is as similar to the BLAS as possible.This decision has permitted the ScaLAPACK code to bequite similar and sometimes nearly identical to theanalogous LAPACK code.

The ScaLAPACK project hopes that PBLAS willprovide a distributed memory standard, just as BLAShave provided a shared memory standard. This wouldsimplify and encourage the development of highperformance and portable parallel numerical software aswell as providing manufactures with a small set ofroutines to be optimized.

BLACS. The BLACS, Basic Linear AlgebraCommunication Subprograms, is a message-passinglibrary designed for linear algebra.

The computation model consists of one or two-dimensional process grid where each process stores piecesof the matrices and vectors. The BLACS includesynchronous send/receive routines to communicate amatrix or submatrix from one process to another, tobroadcast submatrices to many processes or to computeglobal reductions : sums, maxima and minima.

There are also routines to construct change or querythe process grid. Since several ScaLAPACK algorithmsrequire broadcasts or reductions among different subsetsof grids, BLACS permits a process to be a member ofseveral overlapping or disjoint process grids, each onelabeled by a context.

Like LAPACK, the ScaLAPACK routines are basedon block-partitioned algorithms in order to minimize thefrequency of data movement between different levels ofmemory hierarchy. The efficiency of the ScaLAPACKsoftware depends on the use of block-partitionedalgorithms and on efficient implementations of BLAS andBLACS provided by computer vendors for theirmachines. Thus, the BLAS and BLACS form a low-levelinterface between ScaLAPACK software and differentmachine architectures. Above this level, all of theScaLAPACK is portable.

Message Passing Libraries. In distributed memorysystems, to access information in memory connected toother processors, the user must pass messages throughsome network connecting the processors. Such systemsare usually programmed with message passing librariessuch as PVM, Parallel Virtual Machine, and MPI,Message Passing Interface [13].

ScaLAPACK software is designed so that it can beused with clusters of workstations through a networkedenvironment and with a heterogeneous computingenvironment via MPI or PVM. Indeed, ScaLAPACK canrun on any machine that supports either PVM or MPI.

3.3. OpenMP

OpenMP is a standard and portable ApplicationProgramming Interface (API) for writing shared memoryand distributed shared memory programs. [7]

One advantage of this parallelization model is that itcan be done incrementally. That is, the user can identifythe most time-consuming part of the code, usually loops,and parallelize just that. Another advantage of thisparallelization style is that loops can be parallelizedwithout being concerned about how the work isdistributed across multiple processors and how dataelements are accessed by each CPU.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Page 4: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

4. A case study

Here, an existing Fortran computer program presentedby Brebbia and Dominguez [5] is reviewed and rewritten.The program is a simple code for the solution of potentialproblems using constant elements.

The main program defines some general variables andarrays, integer and real, as follows :

program POCONBE

integer :: N,L,infointeger,parameter :: NX=8000integer,dimension(NX) :: KODE,ipivreal,dimension(NX+1) :: X,Yreal,dimension(NX) :: XM,YM,FI,DFIreal,dimension(NX,NX) :: G,Hreal,dimension(20) :: CX,CY,POT,FLUX1,FLUX2

call INPUTcall GHMATcall SGESV (N,1,G,NX,ipiv,DFI,NX,info)if (info /= 0) stopcall INTERcall OUTPUT

end program POCONBE

In order to compute the of G and H matrices ofinfluence coefficients, the routine GHMAT calls twoadditional subroutines, EXTIN and LOCIN. WhileEXTIN computes the off-diagonal coefficients of H andG, LOCIN calculates only the diagonal elements of Gmatrix. EXTIN subroutine is also called by subroutineINTER to compute potential and fluxes at internal points.

SGESV is a LAPACK routine which solves the linearsystem of equations Ax = b by factorizing A andoverwriting b with the solution x.

In an optimized version of the program, EXTIN andLOCIN are inlined into GHMAT and loop optimizationsare applied [9] :

subroutine GHMAT

integer :: i,j,k,kkreal :: ax,bx,ay,by,sl,eta1,eta2,ra,rd1,rd2,chreal :: xco1,yco1,xco2,yco2,xco3,yco3,xco4,yco4real :: tmp1,tmp2,tmp3,tmp4,rdn1,rdn2,rdn3,rdn4real,parameter,dimension(4) :: & GI = (/0.86113631,-0.86113631, & 0.33998104,-0.33998104/), & OME = (/0.34785485,0.34785485, & 0.65214515,0.65214515/)

X(N+1) = X(1)Y(N+1) = Y(1)do i=1,N XM(i) = (X(i) + X(i+1)) * 0.5 YM(i) = (Y(i) + Y(i+1)) * 0.5enddo

G=0.;H=0.do j=1,N

kk = j + 1 ax = (X(kk) - X(j)) * 0.5 bx = (X(kk) + X(j)) * 0.5 ay = (Y(kk) - Y(j)) * 0.5 by = (Y(kk) + Y(j)) * 0.5 sl = sqrt(ax * ax + ay * ay) eta1 = ay / sl eta2 = -ax / sl xco1 = ax * GI(1) + bx yco1 = ay * GI(1) + by ... xco4 = ax * GI(4) + bx yco4 = ay * GI(4) + by do i=1,N if (i /= j) then ra = sqrt((XM(i)-xco1)*(XM(i)-xco1) + & (YM(i)-yco1)*(YM(i)-yco1)) tmp1 = 1. / ra rd1 = (xco1 - XM(i)) * tmp1 rd2 = (yco1 - YM(i)) * tmp1 rdn1 = rd1 * eta1 + rd2 * eta2 ... ra = sqrt((XM(i)-xco4)*(XM(i)-xco4) + & (YM(i)-yco4)*(YM(i)-yco4)) tmp4 = 1. / ra rd1 = (xco4 - XM(i)) * tmp4 rd2 = (yco4 - YM(i)) * tmp4 rdn4 = rd1 * eta1 + rd2 * eta2 G(i,j) = (log(tmp1) * OME(1) + & ... log(tmp4) * OME(4)) * sl H(i,j) = - (rdn1 * OME(1) * tmp1 + & ... rdn4 * OME(4) * tmp4) * sl else G(j,j) = 2. * sl * (1. - log(sl)) H(j,j) = 3.1415926 endif enddoenddo

do j=1,N if (KODE(j) > 0) then do i=1,N ch = G(i,j) G(i,j) = -H(i,j) H(i,j) = -ch enddo endifenddo

call SGEMV & ('No transpose',N,N,1.,H,NX,FI,1,0.,DFI,1)

end subroutine GHMAT

The following profiler output shows the initialexecution times of the whole program and of eachfunction separately :

------------------------------------------------[3] 100.0% 269.547 0.0% 0.000 POCONBE [3] 70.0% 188.791 SGESV [4] 29.9% 80.715 GHMAT [10] 0.0% 0.041 INTER [31] 0.0% 0.000 INPUT [87] 0.0% 0.000 OUTPUT [527]------------------------------------------------

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Page 5: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

The results produced by the program profiling showthat SGESV routine takes the greatest portion ofexecution time and should be the first part of the code toget programmer’s attention. In larger systems, the solverhas a major importance and changes in this part of thecode bring the best results in the overall performance.

4.1. A shared-memory implementation

Once the solver takes the most part of the executiontime, a very simple and effective approach to implement aparallel version of the program is to use a shared-memoryparallel version of the solver, done with a simple compileroption.

The LAPACK routines are written as a single threadof execution. However, LAPACK can accommodateshared-memory machines, provided parallel BLAS areavailable.

4.2. A distributed-memory implementation

ScaLAPACK provides parallel implementations forthe serial LAPACK routines. Hence, the SGESV routinecan be replaced by its equivalent parallel routinePSGESV. In both cases the letter S in the routines namesrefers to single precision real data type. The libraries alsoprovide routines for double precision, complex anddouble complex cases. The next two letters, GE, indicatea general, i.e., non-symmetric matrix. ScaLAPACK canalso handle banded, tridiagonal and other matrix types.Simple driver routines have names ending by SV.

Four basic steps are required to call a ScaLAPACKroutine :• Initialize the process grid.• Distribute the matrix on the process grid.• Call ScaLAPACK routine.• Release the process grid.

The application here in study can be rewritten usingthese steps, each one detailed below.

Process Grid. The P processes of an abstract parallelcomputer are often represented as a one-dimensionallinear array of processes labeled 0, 1, ..., P-1. It is oftenmore convenient to map this one-dimensional array ofprocesses into a two-dimensional rectangular grid. Thisgrid will have Pr rows and Pc columns, where Pr * Pc = P.A process can now be referenced by its row and columncoordinates within the grid. The processes can be mappedto the process grid by using row-major order where thenumbering of the processes increases sequentially acrosseach row. Similarly, the processes can be mapped in acolumn-major order whereby the numbering of theprocesses proceeds down each column of the process grid.

Contexts. In ScaLAPACK, and thus BLACS, eachprocess grid is enclosed in a context. A context partitionsthe communication space. A process grid can safelycommunicate even if other, possibly overlapping processgrid is also communicating. But a message sent from onecontext cannot be received in another context.

In most aspects, the terms process grid and contextcan be used interchangeably. The slightly difference hereis that the user can define two identical process grids, buteach will be enclosed in its own context, so that they aredistinct in operation, even though they areindistinguishable from a process grid standpoint.

A call to BLACS routine BLACS_PINFO determinesthe number of process and the current process number.The BLACS_GET and BLACS_GRIDINIT routines get adefault system context and initialize the process gridusing a row-major ordering. The BLACKS_GRIDINFOroutine allows the user to identify the process coordinates,row and column. Two BLACS routines,BLACS_GRIDEXIT and call BLACS_EXIT, release theprocess grid and free the BLACS context. The processgrid can alternatively be initialized with routine SL_INIT.

Data distribution. ScaLAPACK requires that all globaldata, vectors and matrices, be distributed across theprocesses prior to invoking the routines. It is the user’sresponsibility to perform this data distribution.

The storage schemes of global data structures inScaLAPACK are conceptually the same as LAPACK.Global data is mapped to the local memories of processesassuming specific data distributions. The local data oneach process is referred to as the local array.

The ScaLAPACK software library provides routinesthat operate on three types of matrices : in-core densematrices, in-core narrow band matrices and out-of-coredense matrices. On entry, these routines assume that thedata has been distributed on the processors according to aspecific data decomposition scheme. Conventional arraysare used to store locally the data when it resides in theprocessor’s memory.

The data layout information as well as the localstorage scheme for these different matrix operands isconveyed to the routines via a simple array of integerscalled an array descriptor.

Array Descriptors. Each global array to be distributedacross the process grid must be assigned to an arraydescriptor. This array stores the information required toestablish the mapping between each global array entryand its corresponding process and memory location.

This array descriptor is most easily initialized with acall to a ScaLAPACK TOOLS routine called DESCINITand must be set prior to the invocation of a ScaLAPACKroutine.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Page 6: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

In-core Dense Matrices. The choice of an appropriatedata distribution heavily depends on the characteristic orflow of the computation in the algorithm.

The two main issues in choosing a data layout fordense matrix computations are :• Load balance, or splitting the work evenly among the

processors.• Use of Level 3 BLAS during computations on a

single processor, to account for the memoryhierarchy on each processor.

Dense matrix computations feature a large amount ofparallelism, so that a wide variety of distribution schemeshave the potential for achieving high performance. Theblock-cyclic data layout selected for the dense algorithmsimplemented in ScaLAPACK is principally due to thescalability, load balance and communication properties.The block-partitioned computation proceeds inconsecutive order just like a conventional serialalgorithm. This essential property of the block cyclic datalayout explains why the ScaLAPACK design has beenable to reuse the numerical and software expertise of thesequential LAPACK library.

The one-dimensional block cyclic column distributionscheme divides the columns into groups of block size NBand distributes these groups in a cyclic manner. Thislayout includes two special cases, the one-dimensionalblock column distribution and the one-dimensional cycliccolumn distribution, where NB = N / P and NB = 1,respectively.

The two-dimensional block cyclic distribution schemeused in the ScaLAPACK library for dense matrixcomputations is a mapping of a set of blocks onto theprocesses.

The application can then be written as shown :

program POCONBE

integer :: N,Linteger,parameter :: NX=8000integer,dimension(NX) :: KODEreal,dimension(NX+1) :: X,Yreal,dimension(NX) :: FI,XM,YMreal,dimension(20) :: CX,CY

! BLACS variablesinteger :: IAM,NPROCS,ICTXTinteger :: NPROW,NPCOL,MYROW,MYCOL

! array descriptorsinteger,parameter :: NB=64,MXLLD=NX,MXLOCC=4064integer,dimension(9) :: DESCg,DESCh,DESCx

! local arraysinteger :: NP,NQ,i,j,k,l,infointeger,dimension(NX) :: ipivreal,dimension(MXLLD,MXLOCC) :: H,Greal,dimension(MXLLD) :: LFI,DFI

! external functioninteger,external :: NUMROC

! process number and number of processescall BLACS_PINFO(IAM,NPROCS)

NPROW = INT(SQRT(REAL(NPROCS)))NPCOL = NPROCS / NPROWif (MOD(NPROCS,NPROW*NPCOL) > 0) then NPROW = 1 NPCOL = NPROCSendif

! get default system context and! initialize the process gridcall BLACS_GET(0,0,ICTXT)call BLACS_GRIDINIT(ICTXT,'Row',NPROW,NPCOL)call BLACS_GRIDINFO & (ICTXT,NPROW,NPCOL,MYROW,MYCOL)

call INPUT

NP = NUMROC(N,NB,MYROW,0,NPROW)NQ = NUMROC(N,NB,MYCOL,0,NPCOL)

! init array descriptorscall DESCINIT & (DESCg,N,N,NB,NB,0,0,ICTXT,MXLLD,info)if (info/=0) write (*,*) 'DESCINIT (g) : ',infocall DESCINIT & (DESCh,N,N,NB,NB,0,0,ICTXT,MXLLD,info)if (info/=0) write (*,*) 'DESCINIT (h) : ',infocall DESCINIT & (DESCx,N,1,NB,1,0,0,ICTXT,MXLLD,info)if (info/=0) write (*,*) 'DESCINIT (x) : ',infoif (info/=0) stop

call GHMAT

! solve the linear system A X = Bcall PSGESV & (N,1,G,1,1,DESCg,ipiv,DFI,1,1,DESCx,info)if (info/=0) then write (*,*) 'PSGESV : ',info stopendif

call INTERcall OUTPUT

call BLACS_GRIDEXIT(ICTXT)call BLACS_EXIT(0)

end program POCONBE

The generation of the system of equations in adistributed-memory version must also be rewritten inorder to accommodate the two-dimensional block cyclicdistribution :

subroutine GHMAT

integer :: i,j,k,l,kkreal :: ax,bx,ay,by,sl,eta1,eta2,ra,rd1,rd2,chreal :: xco1,yco1,xco2,yco2,xco3,yco3,xco4,yco4real :: tmp1,tmp2,tmp3,tmp4,rdn1,rdn2,rdn3,rdn4

do j=1,MXLOCC do i=1,MXLLD G(i,j) = 0.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Page 7: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

H(i,j) = 0. enddoenddodo i=1,MXLLD DFI(i) = 0.enddo

X(N+1) = X(1)Y(N+1) = Y(1)do i=1,N XM(i) = (X(i) + X(i+1)) * 0.5 YM(i) = (Y(i) + Y(i+1)) * 0.5enddo

do j=1,NQ l = (INT((j-1)/NB)*NPCOL+MYCOL)*NB + & (j-INT((j-1)/NB)*NB) kk = l + 1 ax = (X(kk) - X(l)) * 0.5 bx = (X(kk) + X(l)) * 0.5 ay = (Y(kk) - Y(l)) * 0.5 by = (Y(kk) + Y(l)) * 0.5 sl = sqrt(ax * ax + ay * ay) eta1 = ay / sl eta2 = -ax / sl xco1 = ax * GI(1) + bx yco1 = ay * GI(1) + by ... xco4 = ax * GI(4) + bx yco4 = ay * GI(4) + by do i=1,NP k = (INT((i-1)/NB)*NPROW+MYROW)*NB + & (i-INT((i-1)/NB)*NB) if (k /= l) then ra = sqrt((XM(k)-xco1)*(XM(k)-xco1)+ & (YM(k)-yco1)*(YM(k)-yco1)) tmp1 = 1. / ra rd1 = (xco1 - XM(k)) * tmp1 rd2 = (yco1 - YM(k)) * tmp1 rdn1 = rd1 * eta1 + rd2 * eta2 ... ra = sqrt((XM(k)-xco4)*(XM(k)-xco4) + & (YM(k)-yco4)*(YM(k)-yco4)) tmp4 = 1. / ra rd1 = (xco4 - XM(k)) * tmp4 rd2 = (yco4 - YM(k)) * tmp4 rdn4 = rd1 * eta1 + rd2 * eta2 G(i,j) = (log(tmp1) * OME(1) + & ... log(tmp4) * OME(4)) * sl H(i,j) = - (rdn1 * OME(1) * tmp1 + & ... rdn4 * OME(4) * tmp4) * sl else G(i,j) = 2. * sl * (1. - log(sl)) H(i,j) = 3.1415926 endif enddoenddo

do j=1,NQ l = (INT((j-1)/NB)*NPCOL+MYCOL)*NB + & (j-INT((j-1)/NB)*NB) if (KODE(l) > 0) then do i=1,NP ch = G(i,j) G(i,j) = -H(i,j) H(i,j) = -ch enddo endif

enddo

do i=1,NP k = (INT((i-1)/NB)*NPROW+MYROW)*NB + & (i-INT((i-1)/NB)*NB) LFI(i) = FI(k)enddo

call PSGEMV('No transpose',N,N,1.,H,1,1,DESCh, & LFI,1,1,DESCx,1,0.,DFI,1,1,DESCx,1)

end subroutine GHMAT

4.3. The shared-model with OpenMP

OpenMP is a set of compiler directives that may beembedded within a program written with a standardprogramming language such as Fortran or C/C++.

In Fortran these directives take form of source codecomments identified by the $OMP prefix and are simplyignored by a non-OpenMP compiler. Thus, the samesource code can be used to compile a serial or parallelversion of the application.

The performance of the shared memory version can beenhanced with the parallelization of the generation of thesystem of equations using OpenMP directives. Thisimplementation is already presented in a previous work[8] and its performance will be compared ahead.

5. Experimental results

The applications run on a SGI's Origin 2000 with 16R10000 250 MHz processors and 8 Gb memory. Theoperational system is IRIX and the compiler is MIPSpro.Profiler is SpeedShop.

A ScaLAPACK library tuned for the system wasdownloaded from www.netlib.org and installed inthe home user directory.

For a system of equations of 8000x8000 elements andusing a shared-memory parallel version of the solver, theprogram presented the following processing times :

Procs. User System Wall Speedup1 862.341 3.587 14:30.98 -2 878.876 3.040 7:58.58 1.824 913.357 3.011 4:44.13 3.078 948.206 3.824 3:02.79 4.7616 1090.721 3.607 2:16.41 6.39

The distributed memory implementation presentedhere shown the following results :

Procs. User System Wall Speedup1 862.341 3.587 14:30.98 -2 944.393 5.262 7:58.12 1.934 998.318 10.460 4:15.14 3.418 1052.448 23.128 2:18.44 6.2916 1199.217 95.945 1:27.77 9.92

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Page 8: [IEEE Comput. Soc 14th Symposium on Computer Architecture and High Performance Computing - Vitoria, Brazil (28-30 Oct. 2002)] 14th Symposium on Computer Architecture and High Performance

It must be noticed that in the shared memory versiononly the solver is parallelized while in the distributedmemory version, the generation of the system ofequations is also parallelized.

A second shared-memory implementation, includingthe parallelization of the generation of the system ofequations using OpenMP directives, shown the resultsbelow:

Procs. User System Wall Speedup1 862.341 3.587 14:30.98 -2 865.046 4.295 7:20.85 1.984 898.354 4.023 3:50.15 3.788 946.537 4.150 2:02.65 7.1016 1151.94 4.279 1:16.35 11.40

6. Conclusions

The development of parallel applications for sharedand distributed memory systems can be achieved in asimple and effortless way by using tuned parallellibraries.

The solution of linear equation systems are usually themost time consuming routine in BEM codes and this workshows a practical approach in the parallelization ofexisting or new applications using a tuned parallel solver.

The implementation of the shared memory version ofthe program does not demand any change in the code andbrings speedups as good as the distributed version. Thelatter, however, needs considerable code rewriting todistribute data among processes.

Once the solver is an already tuned routine, analgorithm change can be applied to increase performanceof the solution of the system of equations. Generally,iterative solvers have more modest storage requirementsthan direct methods and may also be faster, depending onthe iterative method and the problem. The GeneralizedMinimal Residual, GMRES, algorithm is an efficientmethod to solve dense non-symmetric systems ofequations produced by BEM codes [2][3][10][11][12].

Improvement of program performance is the mainreason to the parallelization of a computer program.However, the concepts presented here also address theproblem of running applications in memory limitedsystems. In this case, the size of the arrays in the programis a major concern and memory restraints may beprevalent over processing time.

LAPACK and ScaLAPACK are standard and portablelibraries freely available for a variety of architectures.Thus, parallel boundary elements can be developed anddistributed without other considerations than theknowledge of data distribution schemes.

Acknowledgements

The authors are indebted to SIMEPAR - Sistema deMeteorologia do Paraná by the use of its supercomputingfacilities and support during the course of this work.M.T.F. Cunha is supported by a CAPES grant from theMinistry of Education, Brazil.

References :

1. Anderson E et al. LAPACK User’s Guide, 3rd ed.SIAM, 1999.

2. Barra LPS, Coutinho ALGA, Telles JCF et al.Iterative Solution of BEM Equations by GMRESAlgorithm. Computer and Structures.1992;44(6):1249-1253.

3. Barra LPS, Coutinho ALGA, Telles JCF. Multi-levelHierarchical Preconditioners for Boundary ElementSystems. Engineering Analysis with BoundaryElements. 1993;12:103-109.

4. Blackford LS et al; ScaLAPACK User’s Guide.SIAM, 1987.

5. Brebbia CA, Dominguez J. Boundary Elements : AnIntroductory Course, 2nd ed. CMP - McGraw Hill,1992.

6. Brebbia CA, Telles JCF, Wrobel LC. BoundaryElements Techniques : Theory and Applications inEngineering. Springer-Verlag, 1984.

7. Chandra R et al. Parallel Programming in OpenMP.Academic Press, 2001.

8. Cunha, MTF; Telles, JCF; Coutinho, ALGA. ParallelBoundary Elements Using OpenMP, 13 SBAC-PAD,Brasília, Brazil, 2001.

9. Cunha, MTF; Telles, JCF; Coutinho, ALGA. HighPerfomance Techniques Applied to BoundaryElements : Potential Problems, 22 CILAMCE,Campinas, Brazil, 2001.

10. Dongarra JJ, Duff IS, Sorensen DC et al. NumericalLinear Algebra for High Performance Computers.SIAM, 1998.

11. Golub G, Ortega JM. Scientific Computing – AnIntroduction with Parallel Computing. AcademicPress, 1993.

12. Kreienmeyer M, Stein E. Efficient Parallel Solvers forBoundary Element Equations Using DataDecomposition. Engineering Analysis with BoundaryElements. 1997;19:33-39.

13. Pacheco PS. Parallel Programming with MPI. MorganKaufmann, 1997.

14. Telles JCF. The Boundary Element Method Appliedto Inelastic Problems. Springer-Verlag, 1984.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD02) 0-7695-1772-2/02 $17.00 © 2002 IEEE