10.1.1.116.9344

7/29/2019 10.1.1.116.9344

1/28

INTERNAT IONAL O U R N AL OR NUMERICAL METHODS I N ENGINEER ING,VOL . 9, 1313- 1340 (1996)

HIGH-PERFORMANCE PCG SOLVERS FOR FEMSTRUCTURAL ANAL Y SISP. SAINT-GEORGES' AN D G. WARZEE

Service des Mil ieux Continus, Universith Libre de Bruxelles, 50 av. F.-D. Roosevelt, I 050 Bruxelles, BelgiumR. BEAUWENS AND Y . NOTAY'

Service de Metrologie Nucliaire, Universitk L ibre de Bruxelles, 50av. F.-D. Roosevelt, 1050 Bruxelles, Belgium

SUMMARYThe preconditioned conjugate gradient algorithm is awell-known and powerful method used to solve largesparsesymmetric positive definite linear systems.Suchsystems are generated by the finite element discretiz-ationin structural analysisbut usersof finite elements n this context generally still rely ondirectmethods. Itis our purpose in the present work to highlight the improvementbrought forward by some new precon-ditioning techniques and show that the preconditioned conjugate gradient method performs better thanefficient direct methods.KEY WORDS: iterative methods for linear systems; preconditioning

1. INTRODUCTIONIt seems generally accepted that Preconditioned Conjugate Gradient (PCG) solvers cannotcompete with direct methods for solving realistic industrial problems arising from finite element(FE)discretizations in structural analysis. A recent comparison by Poole et a/.' confirmed thisview, leaving their low memory requirementsasthe only advantage that still should be granted toiterative solvers. As a matter of fact, most commercially available industrial codes use Iron'sfrontal method or a skylineLU factorization as solver. Our purpose here is to demonstrate thattaking care of recent advances in the design of PCG algorithms reverses this conclusion at leastfor large problems.Highly effective preconditioners have recently been obtained bydynamically controlled approx-imate factorizations of Stieltjes matrices, as described in Section 2. The field of application of thesefactorizations has further been extended to more general symmetric positive definite matrices bymeans of spectral equivalence techniques. The application of this extension to FE structuralanalysis is discussed in Section 3. Section 4 presents performance comparisons between our PCGalgorithm and anefficient direct method, showing amongst others that the robustness of PCG doesnot deteriorate for problems offering discontinuitiesor high values of the Poisson ratio.All the results in this paper are restricted to elastostatics leading after FE discretization toa system of linear equations of the form

K q =f (1)*Supportedby the IRSIA (Institut pour I 'Encouragement de la Recherche Scientifique dans I'Industrie et I 'Agriculture)Supported by the FN RS (Fonds National de la Recherche Scientifique)CCC 0029-598 1/96/08 1313-280 1996by J ohn Wiley & Sons, L td. Received 3 August 1993Revised 26 J anuary 1995

7/29/2019 10.1.1.116.9344

2/28

1314 P. SAINT-GEORGES ET AL.whereK is the structural stiffness matrix,f is the nodal load vector andq is the nodal displacementvector.

2. PGG SOLVERS FOR STIELTJES MATRICESPCG methods for solving (1) have now reached an elaborated stage of development and canalmostbeused in a black-box fashion when K is a Stieltjes WhenK is not a Stieltjesmatrix, an additional so-calledreduction step has to be introduced, by which K is first precon-ditionedbyan approximate Stieltjes matrixS.The techniques for doing this have also seen recentdevelopments, to bedescribed in the next section.Definition 1. M is an M-matrix if M is non-singular with non-positive off-diagonal entries andhasanon-negative inverse.Dejnition 2. S is a Stieltjes matrix if S is a symmetric M-matrix or, alternatively, positivedefinite with non-positive off-diagonal entries.

The PCG method for solving (1) is shown in Table I, whereB denotes the preconditioningmatrix. At each iteration step,a linear system

(2)hk+1- k +l- gThe number of iterations i, required to reduce the K-norm of the initial error by a factor E ishas to be solved. Therefore, solving(2) must be cheap.

bounded' by(3)i, < , / m ) ~ n - +1&

whereic(B-'K) denotes the ratio of the extreme eigenvalues of B-'KIN(B- 'K)

K(B-'K) = (B- K)Table I . The PCG scheme

INITIALIZATIONq0 =qinitgo=K qo-do=- oho=go

ITERATION9" hk? --- dkTK#

(4)

7/29/2019 10.1.1.116.9344

3/28

HIGH-PERFORM ANCE PCG SOLVERS 1315Table11. TheIC scheme

INITIALIZATION U =offdiag(K)P=diag(K)

ITERATIONFor r =1,.. .,N -

For i =r +1,. . .,N such as (r ,~ ) E N Z Ptemp=uri/p,pi =pi - empu,iForj = +1,. . . ,N such as ( r , j ) E N Z P

Zf ( i , ) E F P then uij=uij - empurj

called the spectral condition number of B-'K (a result sometimes called Meinardus bound as itrefers to Reference5). Therefore, B should be a good spectral approximation of K, meaninghereby that rc(B-'K) should beassmall aspossible.Both conditions put on B (unexpensive resolution of equation (2) and small ic(B-'K )) can bemet by using forB an approximate factorization of K but, as will beseen below, considerable carehas to be taken in the choice of this approximate factorization.To begin with, one may first try to use incomplete factorizations of K, often called (somewhatabusively) incomplete Cholesky (IC) factorizations and derived from the exact factorization byignoring updates ofallentries (of the approximate factors) that do not belong to some givenfill-inpattern.This scheme is described in Table11, whereB =UTP- ' U ( 5 )

U being upper triangular andPdiagonal withP=diag(U). In this table, the differences betweenIC and the exact factorization are put in italic mode; the fill-in pattern is writtenFP and NZPdenotes the nonzero pattern ofU+U that is the union of FP and the nonzero pattern of K. AnICmethod is of low order whenFP is small;amore precise definitionof order will begiven below.IC methods were introduced hoping that the quality of the preconditioning would rapidlyincrease with FP. It was later proved6 that it never decreases whenFP increases in the case ofM-matrices (then in particular for Stieltjes matrices) on the basis of Woznicki's comparisontheorem.' But it was very early observed by Price and Varga* that, on the contrary, a quite

considerable ncrease of the fill-in leads to onlyamoderate improvement of the spectral conditionnumber.A first issue of the Price-Varga analysis is therefore that only low-order methods are ofpractical interest. In the present work, we shall only consider two fill-in levels:1. order 0=FP(0) = { ( i ,j )with i = j } ;2. order 1=FP(1)=E(K),

whereE(K)denotes the edge set of the graph of the matrixK, .e. the set of couples i J ) such thatkij #0.

7/29/2019 10.1.1.116.9344

4/28

1316 P. SAINT-GEORGESET AL .Higher orders of fill-in may be defined recursively by

FP(p+1) =FP(p)uE(B(p)- K) (6)whereB(p)denotes the IC factorization determined by FP(p).Fractional orders may be used forintermediate levels. This definition differs only slightly from the Price-V arga convention6* and itis closer (although not identical) to the more usual ones.I t has to be mentioned that another issueof the Price-V arga analysis is that the IC factorizationleads asymptotically to iteration numbers similar to those of the J acobi preconditioning, witha better leading constant, but insufficient to represent a decisive improvement for large matrices.Such an improvement came essentially from the works of Axelsson9 and Gustafsson'O whosucceeded in analyzing a modification of the IC method, originally introduced by Buleev" andby which the diagonal entries of the triangular factors U are modified such as to satisfy a row-sumcriterion which may be written

BX=KX +ADxwhere x is a positive vector such that Kx 2 0, D=diag(K) and A =( Ibi)s a non-negativediagonal matrix, called perturbation matrix. When K is diagonally dominant, one may choosex =1 (the vector where all the components are unity) and this will beourchoice in the followingwithout further comment.A heuristic way to ustify this modification consists in regarding it as attempting to take careofthe fill-in neglected in the IC scheme by moving it to the diagonal (sothat the row-sum on eachline has the same value as it would have for the exact factorization).This heuristic explanation neglects the perturbation matrix A which plays an essential role inthe Axelsson-Gustafsson analysis but it turned out to be widely accepted and the correspondingmethod is described in Table I 11 under the name of M IC method according to this widespreadusage. We must insist, however, that i t is in complete disagreement with the Axelsson-Gustafssonanalysis (whose bound on x(B-'K ) becomes infinite when A =0) and even that it neglects themain progressof their contributions, which was precisely to show the extreme sensitivity of (theirbound on) the spectral condition number of the M IC preconditioning to the size and location ofthe perturbations ,Ii.

(7)

Table111 The M I C schemeINITIALIZATION U =offdiag(K )

P =diag(K )ITERATIONFor r = 1 , . . . N-

For i =r +1, . . . ,N such as (r, i)ENZPtemp =u,i/Prpi =pi- emp u,~For j = +1,. . . ,N such as (r,j)ENZP

I f (i, ) FP then uii =uij- empu , ~else pi=pi- empu,~

pi =pj - empurj

7/29/2019 10.1.1.116.9344

5/28

HIGH-PERFORMANCE PCG SOLVERS 1317Table IV . TheDMIC scheme

~INITIALIZATION U=offdiag(K)P =diag(K)

ITERATIONFor r =1,. , , ,N - 1

1Pri>rTo =-- U,

1F i >r

I f to > then p, =- 1uriFor i =r +1,. . . ,N such as (r, i)eNZP

temp =U r i hpi =pi- emp u,~For j = +1,. . . ,N such as (r,j)ENZPIf ( i , ) E FP then uij =uij - empu ,~

else pi =pi- empu , ~pi =pi - empu , ~

TableV. The RIC schemeINITIALIZATION U=offdiag(K)

P =diag(K)ITERATIONFor r =1 , . . . ,N - 1

For i =r +1, . . .,N such as (r, i )eNZ Ptemp =U r i hpi =pi- empu,~For = +1 , . . . ,N such as (r,j)ENZP

I f ( i , j ) E F P then uij =uij - empurjelse pi =pi-w temp urj

pj =pi -w tempurj

The best way to introduce these modulated perturbations became an active research area,leading to dynamic factorizations whereacheck ismade at each stage of the factorizationprocess,the issue of which determines the sizeof the corresponding perturbation.2 4We shall consider here two dynamic factorization schemes, the DMIC scheme described inTable IV, which is a dynamic version of Gustafssons statically perturbed MIC(which must be distinguished from the unperturbed MIC scheme of Table 111)and the DRICmethod described in Table VI, which is adynamic method derived from mixing the RIC andDMIC methods. The RIC method, due to Axelsson and L indskogZ6 s shown in Table V.

7/29/2019 10.1.1.116.9344

6/28

1318 P. SAINT-GEORGESET ALTableVI. TheDRIC scheme

~INITIALIZATION U =offdiag(K)P =diag(l0

ITERATIONFor r=1, . . . N -

I f 70 >7 then w =2r/ro- elsew =1For i =r +1, . . . ,N such as (r,i)ENZP

temp =uri/P,pi =pi - emp u ,~For j =i +1,.. ,N such as (r, ) E NZP

If (i ,j)EFP then uij =uij - empu , ~else pi =pi -wtempu 1

pj =pj - w tempurj

These dynamic methods use an a pri ori given parameter z and an upper spectral bound ofB-lK to enforce the condition &(B-'K )

7/29/2019 10.1.1.116.9344

7/28

HIGH-PERFORMANCE PCG SOLVERS 13193. REDUCTION TECHNIOUES

3. . A two-step preconditioning approachStiffness matrices generated byFEM structural analyses are usually not Stieltjes matrices and

non-positive pivots may then appear during the computation of the approximateXIC preconditioners cannot anymore be considered as reliable. A simple remedy consists inadding appropriate positive pivots to the pivots encountered during the approximate factoriz-ation16 but there is no evidence that the resulting PCG method will still be efficient.A morereliable procedure is based on a two-step preconditioning approach:

14* '

1. Reduction step: Determine a Stieltjes matrix S from the stiffness matrix K;2. Approximatefactorization: Determine an approximate factorizationB of S by applying oneof theXIC schemes described in the previous section to S.Since

B-'K =B-'SS-'K (9)

rc(R-'K )

7/29/2019 10.1.1.116.9344

8/28

1320 P. SAINT-GEORGESET AL .where 1 ={1 1 ... 1I T is chosen here for convenience. The reduced matrix isobtained by taking:A =K and S =A.Axelsson and Gustafsson" developed also another reduction technique for 2-D membraneanalyses, later extended to the study of 3-D solid structures by Shlafman and Efrat." First, thestiffness matrix is partitioned by grouping all thedegrees offreedom(d.0.f.s)of the same type. Thematrix below illustrates the case of a 3-D solid structure where each node has three associatedd.0.f.s in the x, y and z directions, respectively;

Kxx Kx, K x z

Kz, Kzy KzzK =[ Kyx KY, K y*] (16)

A reduced matrix K D is then obtained by decoupling the do.f.s, that is, K D ignores anyconnection between d.0.f.s of different types. That decoupling will be called D-reduction.

Axelsson and Gustafsson'* and Shalfman and Efrat" did not consider in their theoreticalstudies the case of matrices K D to which an approximate factorization cannot still apply.However, in most of the FEM structural analyses,K D s not a Stieltjes matrix. For this reason, theD-reduction is incomplete and must be followed by another reduction process: in our experi-ments, we used the C-reduction defined by equations(12)-( 15) with A =K D,A =EDand A =KD.We will refer in the following to that combination of reduction schemes as the DC-reduction.The reduced matrix is obtained by taking S =KD.3.3. Spectral equivalence of K and S for the DC-reduction

Our purpose now is to prove the spectral equivalence of K and S=K D with respect to thenumber NE L and the sizeh of the finite elements. The whole set of matrices used in that proof arepresented with their links in Figure 1. Matrix K is formed by assembling the elementary stiffnessmatrices K".The decoupling applies to these matrices as well as to the global matrix, generatingelementary K eDmatrices. The assembly of K eDrestores the global K D matrix. The C-reductionmay be applied to K Dto form K D but also to each K eD,yielding elementary reduced matricesrD.D-reduction C-reduction

Figure 1. The reduction process and related matrices

7/29/2019 10.1.1.116.9344

9/28

HIGH-PERFORMANCE PCG SOLVERS 1321The assembly of ED rings forward a global matrix K*; t can easily be seen that in general

The property of spectral equivalence with the DC-reduction isderived from the five theoremsbelow.Theorem I . When C-reduction is applied to any symmetric positice definite matrix A (resp.non-negative definiteA'), the matricesA and A (resp.A and lie) re respectively, symmetric positivedefinite and non-negative definite (resp. both symmetric non-negative definite).Proof. Symmetry is obvious. A (resp.Ae) snon-negative definite from the Gershgorin theorembecause A1=0 (resp. A el =0) and all off-diagonal entries of A (resp.Ae) re non-positive. A is

0

K * #K D .

positive (resp.A' is non-negative) definite because A =A +A (resp. A' =A' +Ae).Theorem2. For a matrix A =1 Ae such that Qe

ker(A') =ker(g),the matricesA and A are spectrally equivalent with respect to the number offinite elementNEL, i.e.there exist aA, A >0, ndependent of NEL such that for all x #0,aAxTAx

7/29/2019 10.1.1.116.9344

10/28

1322 P. SAI NT- GEORGESET AL.depend on the size of the finite elements h. Note that the assumptions of Theorem 5 are lessrestrictive than those of Theorem 4.

Theorem5. For 2 - 0 membranes, 3- 0 solids, beams and C l continuity plates, the spectral boundsaAand BA are independent of the size h of the elements when C-reduction is applied to A =KD.Prook See Appendix IV. 0

4. NUM ERICAL RESULTSAll the techniques presented in Sections 2 and 3 have been implemented in our PCG solver, whichallows the user to choose the preconditioner (IC, M IC, DM IC, RIC or DRIC), its order (0 or 1)and the reduction (C or DC).4.1. Some important remarks about the numerical results

Remark 1. We compare our PCG solver to the implementation of I rons rontal solver FRONTof the industrial software SAMCEF(V 4), distributed by SA M TECH (Likge, Belgium), generallyaccepted to be an efficient direct solver and widely used for aeronautic and civil engineeringapplications. Some numerical tests aiming to compare SA M CEF and the well-known MA28direct solver package of the Harwell Subroutine L ibrary are presented in A ppendix V.Remark 2. I t is well-known20 hat the numbering of the unknowns has a significant effect onPCG performances. We do not address this problem here, relying instead on a recent study byNotay who recommends the use of level orderings often called Reverse Cuthill-McKee (withrespect to Reference 27 where one such ordering was introduced). As these orderings may bedefined on arbitrary grids, we used one of them, described in:

(a) A level structure is built with one of the most connected nodes of the graph as starting node;(b) Within each level, nodes with the smallest value of the ratio (# unnumbered neigh-(c) T he obtained ordering is reversed.On the other hand, the FRONT solver needs a frontwidth minimization, for which SloansRemark 3. One generally uses2 a stopping criterion ensuring that the initial error (the norm ofor lo-) but this may be

bows)/( # neighbours) are numbered first;

frontwidth reducer2 has been used.the initial gradient go)has been reduced by a given factor E (=unsatisfactory if the initial error is too large (oreven too small). We use here the criterion

3, (B- K )qkTf2l + EgkThk

7/29/2019 10.1.1.116.9344

11/28

HIGH-PERFORMANCE PCG SOLVERS 1323Remark 4. The last two points relate to some details of implementation allowing significant(i) The preconditioner and the system matrix are diagonally scaled by P to shortcut alloperations involving P in the PCG iterations and to save the storage of a real vector.

Because the diagonal scaling is performed only once, the benefit brought by that scalingincreases if more than one right-hand side is considered;(ii) For order0preconditioners, Eisenstats algorithmz4 appliesPas preconditioner for solving

reductions of the computations times:

( U- TKU- l ) (UQ)( VTf )instead of applying B as preconditioner for solving system (1). The so-modified PCGscheme produces at each iteration the same approximated solution as the original algo-rithm but allows significant CPU time savings.

4.2. The range of examplesThe presented numerical experiments are restricted here to 2-D membrane and 3-D solidproblems but the same kind of results have been obtained for plates and shells evenif the theoryhas not yet been fully extended to handle these problems. The finite elements used in the following are:REM4, REM8 (REctangular Membrane 4- or 8-node elements);H8, H20 (Hexaedral solid8- or 20-node elements).

4.3. E ficiencyof the diflerent preconditioning techniques on regular gridsFor regular grids using one single type of FE, the number of PCG iterationsn can be plotted asa function of the number of d.0.f.s N of the grid. The plotted curves tend to be straight lines

(for N 2 500 roughjy) in bilogarithmic axes, which corresponds to the exponential law ofequation (24),In that law, the value of e depends on the preconditioner and the reduction technique that arechosen but also on the type of FE used for the discretization.Table VII(a) gives the number of iterationsn needed by all the IC-like order 0 preconditionersfor increasing number of d.0.f.s N and for different types of FE over regular square and cubicgrids. For each preconditioner, DC- and C-reductions were performed separately, giving rise totwo values for n. Table VII(b) contains the same informations for order 1preconditioners.Comparing the number of iterations for the DC- and the C-reduction, it can be seen that theDC-reduction ismuch more efficient than the C-reduction. This conclusion is especially true forthe MIC preconditioner, for which the number of iterations is increased by a factor6to7 in someof our numerical tests when only the C-reduction was used (see e.g. REM8 for N =38,720 inTable VII(a)). Let us remark that the choice of the reduction is more critical for the MICpreconditioner because of its lack of stability, as mentioned in Section2.In practice, increasing the order from0to 1 leads to a significant improvement n the aggregatefor the MIC, DMIC and RIC factorizations but roughly no change is gathered in for IC andDRIC preconditioners. However, even for IC and DRIC, the numbers of iterations are sometimesdifferent. These small differences betray that there is no mathematical equivalence between order0and 1 C and DRIC preconditioners. It can onlybeguaranteed that the numbersof iterationsare quite similar in the range of the considered examples.

n x N e (24)

7/29/2019 10.1.1.116.9344

12/28

1324 P. SAINT-GEORGES ET AL .Since the improvement due to an increase from order 0 to 1 is often small, especially forDC-reduced preconditioners, order 0 is to be recommended since it requires less storage andmore efficient implementations of PCG (like EisenstatsZ4) an then be used.Table VII(c) shows the values of the exponent eof equation (24)for all the preconditioners built

with a DC-reduction. DM IC, RIC and DRIC factorizations exhibit the best asymptoticalbehaviour, having the smallest values for e.These preconditioners must thus be preferred to anyother X IC factorization when very large systems are considered. The values of e seem to be tooclose to each other to conclude which is the best preconditioner amongst DM IC, RIC and DRIC.However, it can be noticed that(i) e is always smaller for DRIC than for RIC;(ii) e is sometimes smaller for DM IC than for DRIC;(ii i) However, when DMIC(0) performs better than DRIC(O), DM IC(1) may perform worsethan DRIC(0) and DRIC(1) (for H8 examples).Table VII(a). Number of iterations obtained by performing order 0 preconditionings with DC- orC-reduction on regular grids

IC(0) MIC(0) DMIC(0) RIC(0) DRIC(0)Element N C DC C DC C DC C DC C DCREM4 22084018603280510073209940

1296016380REM8 6402480552097601520021840296803872048960H8 54013443630608494501387219494H 20 5041980327650407344

35679813016119522525729496178264346439523610688436088104120137153

123162185218251

36 3861 8087 144113 222139 320165 419192 564217 690244 82697 71170 137246 226317 350396 487473 660549 871626 103670343 4257 6974 11786 15498 203110 252123 316

110 102133 143145 171175 212201 2.54

33496174859810611612854698710311512914215316641496473829110294111116124149

34649814318423327933739368110157215274337403463385583103124146168

101128140160188

3249607282919910511352627484951041141211324146535762656995102101105117

335884113144167211243280731131582022462873333733549728498114130

102126139156178

3245556675839210110959707788971061151231253643505761687297104107115127

34608712215319023427523 17011115519724930035841 137507590125122140

108127140158184

32455765748288941015459687785951021081153742505457616499100105109124

7/29/2019 10.1.1.116.9344

13/28

HIGH-PERFORMANCE PCG SOLVERS 1325Table VII(b). Number of iterations obtained by performing order 1preconditionings with DC- orC-reduction on regular grids

W1) MIC(1) DMIC( 1) RIC(1) DRIC(1)_ _ _ ~Element N C DC C DC C DC C DC C DCREM4 220840186032805100732099401296016380REM8 64024805520

97601520021840296803872054013443630608494501387219494

H20 5041980327650407344

H8

35679813016119522525729496178264

346439523610436088104120137

12216218521825 1

36618711313916519221724497170246

3173964735496264357748698110123

11013314517520 1

409016726339853269 1878109471145242

359516710898396411115019123999139159189228

33496274859510811612850678398108124137143384659667684938697102113126

366911215920426632539245667111159

218278342411375682102125147

101127137156180

32496072829299105113495972808910010911337425054586265949794104104

33639412816519624528232969109153

20525029434535507690107122

102124138155183

3245556675839210110954627178899910811336425055606569969799109119

34608712215319023427532270111155

19724930035837507590106123

108127140158184

324557657482889510154596877879510210837425054576164

10199106109124

Table VII(c). Valueof the exponent e of the DC-reduced preconditioners applied toregular grids for different types of FEOrder 0preconditioners Order 1preconditioners~ _REM4 REM8 H8 H20 REM4 REM8 H8 H20

~ ~ ~ ~ ~IC 0,4482 0.4571 0.2887 0.2168 04482 0-4571 0.2887 0.2168MIC 0.3137 0.2606 0.2549 0.1524 0.3129 0.2633 0.2499 0.1339DMIC 0.2913 0-2130 0-1458 0-0636 0-2918 0-2126 0.1597 0-0375RIC 0.2853 0.1811 0.1928 0.0909 0.2853 0.1879 0.1823 0.0715DRIC 0-2671 0-1799 0.1544 0.0708 0.2680 0.1799 0.1544 0.0653

In addition, in presence of anisotropies, DRIC is also robust while DMIC presents someweakness, as shown by N ~t a y . ~ince these tendencies will be confirmed for irregular grids, weconsider that DRIC is the best possible choice for a black box solver and will be chosen forcomparison purpose with the reference direct solver.

7/29/2019 10.1.1.116.9344

14/28

1326 P. SAI NT- GEORGESET AL .

0 l oo00 2 m 3 m 40000 50000Numberof dofs

Figure2. Number of iterationsfor different C- andDC-reduced order 0preconditioners onREM8 regular grids

a0cd.-Y._:40a

zP5

1401201008060

402000 uxxx) 3 m

Number of dofsFigure3. Number of iterationsfor theDRIC(0)-DCpreconditioner on regular grids usingvari ous element types

Equation (24) is illustrated in Figure 2 where the number of iterations is plotted for REM8grids for different preconditioners and in Figure 3 where the number of iterations is plotted for theDRIC(0)-DCpreconditioner for different types of elements.

4.4, Comparison of iterative and direct solvers eflciencies on regular gridsafunction of the number of d.0.f.s N, following equation (25)For regular grids using the same type of finite element, the computational time t is only

7/29/2019 10.1.1.116.9344

15/28

HIGH-PERFORMANCE PC G SOLVERS 1327Table VIII(a). CPU timesfor the PCG-DRIC(0)-DC and FRONT solversfor regulargrids (obtained on IBM 4381)

~ ~ ~PCG CPU times FRONTExample N (1) Iteration' (2) Solvingb (3) Total' SolvingREM4 220840186032805100732099401296016380REM8 64024805520

9760152002184029680387204896054013443630608494501387219494

H20 5041980327650407344

H8

0.4052.3916.87913.95325.32740.19759,45083.551112.1693.13615.55937.625

79.316130.478218535307.051420.393582.3883-09910.86136.96063561107.673173.647256.410

13.14263.954117.534194.796326.620

0.5752.9398.07416.01328.55644.78365.85091.691122.4843.76118.12643.233

89.278145.766241.138336.728459.031632.2924-06013.43944.57576-409128-957204-476300-513

14.76471.805131.4 15216.495358.411

08383.8 159.95519-39633.69752-08375.836104.778138.9854.30920.20148.060

97.562158.703259.642363.226498.985689,0035.42717.37756.45097.266162.874258.470400.626

16.86483.854152.714251.359422.500

1.1089-03136656109.113227.573456.3518 10.0241497.42723 15.3577.47180.588351.526

1138.35826908635250.61410059.1461754238129455.51727.1 75242.3 122051.0256859.36219229.52045299.1 159340033634.9 11765.15324041916497.12415434569

'Column (1)=CPU for PCG iterationsbColumn(2) =Column (1) +CPU for reduction +approx. factorization +scaling'Column (3) =Column (2)+CPU for the assemblyTable VIII(b). Valueof the exponentsfor the FRONT andPCG-DRIC(0)-DC solvers

Typeof elementSolver REM4 REM8 H8 H20FRONT 1-7830 1-9147 2.2674 2.2727PCG-DRIC(0)-DC 1.1919 1-1675 1-1838 1.1926

In that law, the exponentsdepends on the solver (FRONT or PCG). Table VI I I (a) shows theCPU times needed by FRONT and PCG for which only the best order 0preconditioner, that isDRIC(0) with the DC-reduction has been taken into account. These times were obtained on anI BM 4381 but our experience confirms that they are similar to those that would be obtained on

7/29/2019 10.1.1.116.9344

16/28

1328 P. SAINT-GEORGESET AL.other workstations (I BM RS/6000, SUN SPA RC, etc.).No tests were performed on supercom-puters like Cray, that are used by a much fewer amount of industrial FEM users.Three columns concern the PCG solver in Table VIII(a): the first one gives the CPU timesassociated to the iterative process, the second one adds to these main values the times needed tocompute the reduction, the preconditioner and the scaling, and the third column shows the globaltimes including the assembly of the system. This last time must be compared to the frontal solvertime.Table VII I(a) shows that even for small systems, the iterative scheme performs faster. Theexponent s,given in Table VIII (b) for different types of FE,allows to extrapolate that conclusionto very large systems:

(i) The valueof s lies around 1.9 and 2.3 for the frontal method, respectively for 2-D and 3-D(ii) For PCG-DR IC(0)-DC, the value of s remains almost unchanged in the 2-D and 3-D3-D structures are consequently very expensive in terms of (storage and) CPU times if FRONTis chosen as solver but this is not the case when PCG is chosen.Let us recall here that the best result we could hope is to have s =1because the calculationtime cannot increase less than the number of unknowns and note that DRIC(0)-DC yields valuesclose to 1.Other iterative solvers like algebraic multilevel methods, combined with DC-reductionyield theoretically to s =1as asymptotical value. In any case, it is only an asymptotical behaviourand we do not have enough experience in dealing with multilevel methods to verify numericallytheir efficiency (i.e. to verify if they reach rapidly their asymptotical behaviour). Therefore, thevalues of s provided by DRIC(0)-DC seem satisfactory to us.

grids;applications and lies around 1-18.

4.5. Performancesof the diflerent preconditioning techniques on non-regular gridsExcluding the influence of the frontwidth, the calculation times could still be extrapolatedfollowing equation (25) for the frontal method with the same exponent s but not for the PCGmethods which are sensitive to mesh-specific parameters like element distortions, materialdiscontinuities,. . . .For this reason, each numerical test must be considered separately.The first non-regular grid tested is presented in Figure 4; it has been generated to study themembrane stress distribution in a parking floor. In that study, the plane stress hypothesis hasbeen assumed and REM 8 elements were used. In the following, we will refer to that grid asPA RK . The second grid (Figure 5 ) has been built under the same assumptions; it is a REM8discretization of a portion of a wall in a mansion (called MANS). 3-D grids were also tested, likea component of oven or an I-beam, using H8 elements. Those grids, named OV EN and BEAM,are represented in Figures 6and 7.Table IX (a) shows the number of iterations obtained by order 0 preconditioners, presentedunder the same format as in T able VII(a). Informations related to order 1preconditioners aregiven in Table IX(b). It can be seen that DM IC, RIC and DRIC remain the most efficient schemesand that M IC exhibits again its lack of stabil ity and poor convergence properties in the FEcontext. The use of DM IC or R IC order 1preconditioners leads to small reductions of thenumber of iterations (about 10per cent reduction for MANS, 8 per cent for PARK , 5per cent forOV EN and no reduction for BEAM) compared to order0results, and the rate of convergence issimply not improved by increasing the order for the DRIC preconditioner.Once more, DC-reduction is highly superior to the C-reduction. This allows to considerDRIC(0)-DC as the best preconditioning technique amongst those presented in this paper.

7/29/2019 10.1.1.116.9344

17/28

HIGH-PERFORMANCE PCG SOLVE RS 1329

Figure 4. PARK: freemesh of a parking fl oor

4.6. Comparison of iterative and direct solvers eficiencies on non-regular gridsThe same examples as in Section 4.5 are considered. Table X shows the total CPU timesobtained for those examples by the frontal solver and the DRIC(0)-DC solver. The iterative solverremains quicker than the frontal one even for high gradients of elements sizes (M ANS) or largeelement distorsions (BEAM). Let us remark that the BEAM and OV EN grids have smallfrontwidths, which favour the frontal solver.PCG runs only two times faster than FRONT for theBEAM example.

4.7. The efect of material discontinuitiesMaterial discontinuities have been generated as follows: regular square (R EM 4 and REM8)and cubic (H 8 and H20) meshes are cut into two pieces of the same shape, containing the samenumber of elements, one having a Y oungs modulus 10times greater than the other one. Thenumber of iterations of our iterative solver on these non-uniform structures is compared inTableXI to that of the uniform structures of Table VII(a) for the DRIC(0)-DCpreconditioner.The discontinuities have apparently no significant influence.

7/29/2019 10.1.1.116.9344

18/28

1330 P. SAINTGEORGES ET AL.

Figure5. M A N S free mesh of a mansionwall

Table IX(a). Number of iterations obtained by performing order 0 preconditionings with DC- orC-reduction on non-regular gridsIC(0) M IC(0) DM IC(0) RI C(0) DRIC(0)

Example N C DC C DC C DC C DC C DC9067 453 449 a a 503 191 422 196 423 184ARKMAN S 24368 aOVEN 23687 370 370 a a 410 221 325 241 323 196BEAM 19362 1169 1020 a a 1199 677 1035 728 1073 649

a =morethan 1501 iterations

a a 1042 1492 642 1279 680 1288 623

7/29/2019 10.1.1.116.9344

19/28

HIGH-PERFORM ANCE PCG SOLVERS 1331Table IX(b). Number of iterations obtained by performing order 1 preconditionings with DC- orC-reduction on non-regular grids

MIC(1) DMIC(1) RIC(1) DRIC(1)1)Example N C DC C DC C DC C DC C DCPA RK 9067 453 449 a a 495 174 401 180 423 185MANS 24368 a a 947 594 608 62 1

a 399 210 347 236 325 195679 725 647OVEN 23687 370 370 aBEAM 19362 1169 1020 aa =more than 1501 iterations

a

a

Figure6. OVEN: free mesh of an oven componentTable X. CPU times for the PCG and F RONT solver for non-regular grids

PCG CPU times FR ON TExample N (1) Iterationa (2) Solvingb (3) Total' SolvingPA RK 9067 164.060 173.339 180.378 5223.835MANS 24368 1515.441 1539.534 1558451 5292.520OV EN 23687 578.947 6 10.050 673.929 9649.185BEAM 19362 2277.499 2315.715 2384.1 3 1 4774.791a Column (1) =CPU for PCG iterationsbColumn(2)=Column (1) +CPU for reduction +approx. factorization +scaling'Column (3)=Column (2)+CPU for the assembly

7/29/2019 10.1.1.116.9344

20/28

1332 P. SAINT-GEORGES ET AL .

Figure7. B E A M freemesh of an I-beam

Table XI . Number of iterations of the DC-re-duced order 0 IC-like preconditionersfor uni-formmeshes and discontinuous meshesElement N Mesh 1 Mesh 2REM 4 16380 101 103REM8 38720 115 112H8 19494 64 64H20 7344 124 121M esh 1 =Uniform REM S or HS mesh with Young smodul us E lM esh2 =Discontinuous mesh Youngs modulus Elon the half mesh, 10E l on the other half

7/29/2019 10.1.1.116.9344

21/28

HIGH-PERFORMANCE PCG SOLVERS 1333Table XI I . Number of iterations for differentvaluesof thePoissonratio v with theDRIC(0)-DC preconditioner

PARK REM4V (N =388)0.4oooO 184 450.49000 196 460.49900 197 460.49990 197 460.49999 198 46

REM4( N =796)

5557575157

4.8. The efect o the material properties and anisotropyThe quality of XIC preconditioners is admitted to be influenced by the material properties.

Axelsson and Gustafsson'8 have more particularly enhanced the effect of the Poisson ratio v: intheir experiments, values of v near 0.5 deteriorate the spectral equivalence bound. This may bedue to the presence of (1 - v) in the Hooke matrix.TableXI1compares the effect of v on the number of iterations for three of our grids, includinga non-regular grid, with a DRIC(0)-DC preconditioner. It seems that thanks to the DC-reduction, the effect of v on the number of iterations is fairly slight and can be neglected, which isan important improvement from the robustness point of view.Following N ~t a y , ~he robustness of the DRIC preconditioner extends to anisotropicproblems, contrary to DMIC. Note that the Poisson ratio v, when non-zero, introducesanisotropy in the discretized equations. However, additional experiments not presented herefor brevity have shown that even when other XIC factorizations than DRIC are used, the effectof v can be neglected; this highlights the fact that the insensitivity of the number of iterationswith respect to v is due to our DC reduction and not to the choice of DRIC amongst the otherXIC schemes.

5. CONCLUSIONS5.1. On the choice of an iterative schemePCG preconditioning:Three aspects must be taken into account when choosing an approximate factorization for

(i) Thereduction: the need for an efficient reduction technique in theFEM structural analysiscontext has been highlighted and reduction techniques yielding Stieltjes factorizablematrices have been presented and justified. The coupling of two reduction schemes,decoupling and diagonal compensation, leads to highly efficient preconditioners;(ii) The approximate factorization: the preconditioner DRIC recently developed by Notay4has been presented and significant numerical tests show that also for structural FEMproblems, it is oneof the most powerful of the IC family of preconditioners: IC has weakconvergence properties, M IC is not robust enough, RI C and DMIC sometimes performbetter at order 1but are more sensitive to grid non-uniformity, material discontinuitiesoranisotropy;

7/29/2019 10.1.1.116.9344

22/28

1334 P. SAINT-GEORGESET AL .(iii) The order of the factorization: the profit carried by medium-order preconditioners is oftentoo low to make them seem attractive for problems arising from the FEM discretization instructural analysis, at least for DRIC-DC preconditioning. This allows us to use order0schemes that have small memory requirements.The combination of DRIC(0) and the hybrid DC reduction leads to the lowest CPU timesamongst the different popular preconditioning techniques referred here.

5.2. PCG versusfrontal methodI t is now accepted that the memory requirements of a PCG solver are much lower than those ofa direct solver, but the comparison of C PU times feeds a great polemic for a lot of years. This isdue to parameters that influence the convergence rate of the PCG solver (for a given problem),which are as follows:(i) The type offinite elements: This paper shows that PCG methods lead probably to one of

the most efficient solvers for some types of finite element. This has been theoretically andnumerically demonstrated for membrane and solid elements in this work but numericalresults have been obtained separately for plate and shell problems;(ii) The preconditioner: The improvement brought forward by recently developed approxim-ate factorization algorithms reduces dramatically the number of iterations;(iii) The reduction is generally ignored by several authors but this paper shows how to combinedifferent reduction techniques to carry out efficient preconditioners in the structuralanalysis context;(iv) The computer on which the software codes are tested: the sparse data storage schemes andthe ordering of the unknowns must be completely re-thought to run PCG on vector orparallel computers and get the same conclusions as here on a scalar computer whencomparing PC G to F RONT. However, the most used computers in industry are scalarand mono-processor machines, on whichour contribution allows to increase the size of thesolved problems, now restricted by the use of direct methods.DRIC(0)-DC computational times presented here are far smaller than those obtained by thefrontal method, even if the grids are non-regular, with high element size gradients or shapedistorsions. This conclusion remains valid for2-D structures and is reinforced for 3-Dstructures,and it seems that the PCG method must always be preferred to a direct method.

ACKNOWLEDGEMENTSWe are grateful to SA M TECH for having provided their software code SA M CEF and to thereferees for their helpful comments that have widely contributed to the completion of this paper.

APPENDIX I: PROOF OF THEOREM 2C-reduction applied to A and Ae requires equations (26)-(33) to be satisfied:

A = A - Aoffdiag(A)=min(offdiag(A),0)offdiag(A)=-max(offdiag(A ),0)

7/29/2019 10.1.1.116.9344

23/28

HIGH-PERFORMANCE PC G SOLVERS 1335A1=A 1A" =A' - A'

offdiag(A")=min(offdiag(A'), 0)offdiag(A') =-max(offdiag(A'), 0)

A" 1=A'1and the elementary matrices are tied to the global ones by equations (34) and (35)

A =C A "e

A* =CA'e

Upper spectral equivalence bound PA.Equation (26)yields

(34)(35)

X'AX =xT(A- A) x

7/29/2019 10.1.1.116.9344

24/28

1336 P. SAINT-GEORGES ET AL .APPENDIX 11: PROOF OF ASSUMPTION (18): ker(Ae)=ker(Ae)

The proof is made for A' =KeD.Without any loss of generality, let us consider a problem withtwo translation d.0.f.s x, y and one rotation d.0.f. 8,0KeD=E[K:F] =[ f Kf d,] (43)

where [K;] and [K have the same entries as K; and K; padded with zeroes to have thesame size as KeDand KeD.Let us take a look at the kernel of each block:Translation d.0.f.s: ker(K:;) c span(1). These kernels would be reduced to (0) if boundaryconditions were applied to the corresponding d.0.f.s;Rotation d.0.f.s: ker(K,":) =(0).Moreover, equation (33) implies

K:Fl =K;Fl (45)and therefore

ker(KeD) ker ( rD ) (46)But since xTKeDx=xT(KD D ) x

2 X T I p Xfrom Theorem 1, if K eDx=0, then F Dx=0 and

ker(KeD)c ker(KeD)Equations (46) and (47) provide

ker(KeD)=ker(K")A PPENDI X 111: (A* - ) IS NONNEGATIVE DEFINI TE

From equations (35), (33), (34) and (29) successively, one hasA *l =14'1=ZA"1 =A 1 =A1

e ewhich leads to - 1 A* - A), =(A* - )i ,

i d

(47)

0

7/29/2019 10.1.1.116.9344

25/28

HIGH-PERFORM ANCE PCG SOLVERS 1337On the other hand, it can beseen from equations (26)-(35) that

(A*)ii =CCmax((A')ij,0)j e

and therefore(A*)ij2 (A)ij for i = j(A* )ij

7/29/2019 10.1.1.116.9344

26/28

1338 P. SAINT-GEORGES ET AL .whereH is the Hooke matrix;N is the matrix of the shape functions;J is the Jacobian matrix ofthe isoparametric transformation that can be written,N i being the shape functions at node i ,

in which all entries are proportional to h thanks to equation (48). The value of ldetJ I is thenproportional to hdimwith dim=1,2 or 3 for 1-D,2-D or 3-D finite elements respectively;D isa matrix of differential operators usingx- , y- and z-derivatives.As, for instance,a x =( a x 0a, + a x 4a, +(am,

in which all terms between brackets are entriesof J - ' , and thus proportional to h-', all firstderivatives are proportional to h-'. Moreover, order d derivatives are proportional toIf D ishomogeneous (all the derivatives being of the same order) of order d, it follows fromequation (52) that

and obviously, for matricesA', AP computed fromK', K P by a D-reduction (for instance),(Ke)h,h=hdim-2d(Ke)h==hdimTZdK p

(Ae)h,h=hdim-"AP (53)(g)h=p m - 2 d & J (54)

Therefore, as it can always be written that for some positive a',ciexTAPx

7/29/2019 10.1.1.116.9344

27/28

HIGH-PERFORMANCE PCG SOLVERS 1339Table XIII. Comparison of FRONT andMA28 CPU times for some gridsExample FRONT MA28PARK 24.21 79-90M ANS 116.32 528.70BEAM 109.81 189580

with A =EQh-, B =12EZh-3, C =6EZh-2 and D =2EIh-, E being the Youngs modulus,I the inertia of the beam andR its section.The correspondingKeD andrDre partitioned into three (2,2) diagonal blocks [ and[ K LD ] corresponding to x, y and 8d.o.f.s, as proposed in equations(43) and (44).Each of theseblocks contains only entries that are proportional to h- (x and 8blocks) orK 3 y block). Inequation (57)the size parameter h can always be considered as a leading constant withagiven exponent. Thevalue of a, in equation(57)is then independent ofh. Equation (37)is finally obtained by summing

0

a e ~ T [ K : D ] ~xT[K zD]x V X E R ~ , i (57)equation (57) for all i, with respect to equations(43) and (44)and by takingA =Po.

APPENDIX VThe direct solver used here is the frontal solverof the industrial software codeSAMCEF. SomeC PU times given in TableXI11 allow a comparison of FRONT and the frontal solver of theMA28 package of the Harwell Subroutine Library, which has a larger popularity in the researcharea. For technical reasons, the experiments could not have been made on an IBM 4381 withSAMCEF(V4); they were performed on a SUN SPARC20/514 with SAMCEF(V5) but it isexpected to find the same qualitative conclusions. Partial pivoting was used for theMA28 solver.It has been found that FRONT is always more efficient than MA28, which seems not to beoptimized for industrial use.

REFERENCES1. E. L. Poole, N. F.K night and D. D.Davis Jr, High performance equation solvers and their impacton finite element2. 0.Axelsson and V. A. Barker, F inite Element Solution of Boundary Value Problems-Theory and Computation.3. R. Beauwens, Modified incomplete factorization strategies, n 0.Axelsson and L. K otolina (eds.),PCG Methods,4. Y .Notay, A dynamic version of the RIC method, Numer. Linear Algebra Appl. 1(4), (1994).5 . G.Meinardus, Ueber eine Verallgemeinerung einer Ungleichung von L. V. K antorowitsch, Numer. M ath. 5, 14-236. R. Beauwens, Factorization iterative methods, M-operators and H-operators, Numer. M ath. 31, 335-357 (1979).7. S . Woznicki, Two-sweep iteration methods for solving large linear systems and their application to the numericalsolution of multi-group multi-dimensional neutron diffusion equations, Ph.D. Dissertation, Report f447/CYF -RONET/PM/A, Institute of Nuclear Research, Swicek, Poland, 1973.8. H. S. Priceand R. S. Varga, Incomplete primitive factorizations, unpublished manuscript, 1964.9. 0.Axelsson, A generalized SSOR method, BI T 13, 443-467 (1972).

analysis problems, Int. j. numer. methods eng.,33, 858-868 (1992).Academic Press, New Y ork, 1984.Lecture Notes in Mathematics, 1457, Springer, Berlin, 1990, pp. 1-16.(1 963).

10. 1. Gustafsson, A class of first order factorization methods, BI T 18, 142-156 (1978).11. N. I. Buleev, A numerical method forthe solution of two-dimensional and three-dimensional equations of diffusion,Math.Sb. 51,227-238 (1960); English translation in Report BN L -T R-551,Bookhaven, National L aboratory, U pton,NY , 1973.

7/29/2019 10.1.1.116.9344

28/28

1340 P. SAINT-GEORGESET AL12. 1. Gustafsson, M odified incomplete Cholesky (M IC) methods, in D. Evans (ed.),Preconditioning Methods, T heory13. M. M. Magolu, Taking advantage of the potentiali ties of dynamically modified block incomplete factorizations,14. 0.Axelsson and L. Kolotil ina, Diagonally compensated reduction and related preconditioning methods, Numer.15. R. Beauwens and R. Wilmet, Conditioning analysis of positive definite matrices by approximate factorizations,16. J . K . Dickinson and P. A. Forsyth, Preconditioned conjugate gradient methods for three-dimensional linear17. 0.Axelsson, I terarive solution methods, Cambridge University Press, Cambridge, 1994.18. 0.Axelsson and I. Gustafsson, I terative methods for the solution of the Navier equations of elasticity, Comput.Methods Appl. Mech. Eng., 15, 241-258 (1978).19. S. Shlafman and I. Efrat, Using K orns inequality for an efficient terative solution of structural analysis problems, nR. Beauwens and P. de Groen (eds.), I terative Methods in Linear Algebra, North-Holland, Amsterdam, 1992,20. I.S. Duff and G. A. Meurant, The effect of ordering on preconditioned conjugate gradients,BIT Comput. Sci. Numer.21. Y. Notay, Ordering methods for approximate factorization preconditioning, Technical Report I T/ I F/ 14- 11,ervice22. S. W. Sloan, An algorithm for profile and wavefront reduction of sparse matrices, Int. j . numer. methods eng., 23,23. Y. Notay, Resolution itirative de systkmes liniaires par factorisations approchies, Ph.D. Thesis, Service de24. S.Eisenstat, Efficient mplementation of aclassof preconditioned conjugate gradient methods,SIAM J .Sci. Statist.25. 0. . Zienkiewicz and R. L . Taylor, TheJ initeelement method,4th edn, vol. 1 and 2, M cGraw Hil l, New Y ork, 1989.26. 0. xelsson and G. L indskog, On the rateof convergenceof the preconditioned conjugate gradient method, Numer.27. E. Cuthill and J . McKee, Reducing the bandwidth of sparse symmetric matrices,Proc. 24th Nat. Conf ofthe Assoc.28. M. R. Hestenes and E. Stiefel, M ethods of conjugate gradient for solving linear systems, .I.es. Nat. Bureau

and Applications, Gordon and Breach, NY , 1983, pp. 265-293.Report I T/ I F/ 14- 13,ervice de Metrologie Nucltaire, Universitt L ibre de Bruxelles.Linear Algebra Appl., 1(2), 155-177 (1994).J . Comput. Appl. M ath.,26, 257-269 (1989).elasticity. Int.j. numer. methods eng., 37, 2211-2234 (1994).

pp. 575-581.Marh., 29, 635-657 (1989).de Metrologie Nuclkaire, Universitk Libre de Bruxelles, 1993.239-251 (1986).Metrologie Nucleaire, Universite L ibre de Bruxelles, Belgium, 1991.Compur.,2, 1-4 (1981).Math., 48, 499- 523 (1986).for Computing Machinery, Brandon Press, N.J ., 1969, pp. 157-172.Standards Sect. B49, 409-436 (1952).

Documents

10.1.1.116.9344