Subdomain Generation for Parallel Finite Element Analysisusers.monash.edu/~asadk/JournalPapers/x-domain3.pdf · Parallel Computational Methods for Large Scale Structural Analysis

Paper submitted for the

Special issue of

Computing Systems in Engineering for

Second Symposium on

Parallel Computational Methods

for Large Scale Structural Analysis and Design

NASA, Langley, USA Subdomain Generation for

Parallel Finite Element Analysis

A. I. Khan† and B.H.V. Topping‡

†Lecturer‡Professor of Structural EngineeringHeriot-Watt University, Riccarton,Edinburgh, EH14 4AS, United Kingdom

Abstract This paper describes an optimization and artificial intelligence based approach for

solving the mesh partitioning problem in parallel finite element analysis. The problem of do-

main decomposition with reference to the mesh partitioning approach is described. Some current

mesh partitioning approaches are discussed with respect to their limitations. The formulation

of the optimization problem is presented. The theory for the mesh partitioning approach using

an optimization and a predictive module is also described. It is shown that the genetic algo-

rithm linked to a neural network [16] predictive module may be used successfully, to limit the

computational load and the number of design variables for the decomposition problem. This

approach does not suffer from the limitations of some current domain decompostion aproaches

where an overall mesh is first generated and then partitioned. It is shown that by partitioning

the coarse initial background mesh, near optimal partitions for finer graded (adaptive) meshes

may be obtained economically. The use of the genetic algorithm for the optimization module

and neural networks as the predictive module is described. Finally a comparison between some

current mesh partitioning algorithms and the proposed method is made with the aid of three

examples; thus illustrating the feasibility of the method.

1 Introduction

Parallel finite element analysis is based on the concept of dividing a large and computation-ally time consuming finite element problem into smaller more manageable sub-problems

1

which may be solved efficiently. This process of breaking up of a problem into smallersub-problems is called domain decomposition. In the finite element problem, domaindecomposition may be undertaken using one of the following approaches:

• The Explicit Approach: where a finite element mesh is physically partitioned to formsubdomains for parallel processing.

• The Implicit Approach: where the system of assembled equations is partitioned forparallel analysis.

In this paper the explicit approach to the domain decomposition problem will be con-sidered. With this approach a target finite element mesh (a discretized domain) is dividedinto a finite number of subdomains such that the computational load (or effort) per sub-domain is approximately the same, these subdomains are then solved concurrently overdifferent processors. Hence large scale analysis problems may be solved using this ap-proach at much greater speeds by networking multiple processors.

Unstructured, adaptive or uniform, finite element meshes are generally comprised ofa single element type. The computational load balancing using the un-structured meshesfor explicit time-stepping finite element analysis may be ensured by fulfilling the followingrequirements:

• An equal number of elements should be assigned to each subdomain.

• The number of boundary interface nodes of each subdomain should be minimized.

In the case of an implicit scheme requiring the solution of a system of equations thecomputational load balancing is governed by the bandwith of each of the sub domains andthe number of boundary interface nodes of each subdomain. In this paper the problem ofun-structured mesh partitioning has been dealt with respect to the explicit time-steppingfinite element analysis.

With the advent of parallel finite element analysis a variety of mesh partitioningalgorithms have been proposed, some of these are listed below:

• Recursive bi-section algorithms: Simon [1] has presented the spectral graph bisection(RSB) algorithm and compared its effectiveness with recursive bisection algorithmssuch as:

◦ the recursive coordinate bisection (RCB); and the

◦ recursive graph bisection (RGB).

• Design optimization based algorithms: Using simulated annealing as the optimiza-tion method Flower et al [2] and Nour-Omid et al [3] have attempted to the solvemesh partitioning problem.

• Greedy algorithms: Farhat [4] has proposed a mesh partitioning method using agreedy algorithm approach.

2

• Heuristic approaches: Kernighan and Lin [5] have presented a heuristic approachfor solving the combinatorial problem of partitioning a graph into subsets. Theobjective of the algorithm is to minimize the total cost of the cut edges subject toan upper limit on the members (i.e. in this case elements) of the subsets.

A common feature of the above mesh-partitioning algorithms [1, 3, 4] is that all com-mence with the supposition that a large finite element mesh is to be mapped on to thenetwork of parallel processors. In MIMD (multiple instruction multiple data) distributedmemory architectures the available memory in the network increases with the increase inthe number of networked processors hence theoretically large scale analysis may be per-formed without experiencing memory constraints in these systems. However a bottle neckis created at the central processor (i.e. the ROOT processor in the case of transputer-based systems) because storing a complete mesh for partitioning purposes induces highmemory requirements in that processor. It is not always possible to increase the sizeof ROOT processor’s memory hence mass storage devices (e.g. hard-disks) are attachedwhich in turn reduce computational efficiency on account to their slow data transfer rates.

Mesh generation in general and adaptive re-meshing [9, 10] in particular becomes com-putationally expensive as the size of the domain is increased with respect to element sizes.From a user’s point of view the finite element analysis commences at the time when thedata is supplied to the computer that specifies the geometry of the domain etc. and fin-ishes when the required output such as stresses and displacements have been calculatedwithin the discretized domain. Hence the time spent in the mesh generation, mesh par-titioning and solution of the mesh must all be considered when reviewing computationalefficiency. In the case of parallel processing machines the efficiency of the algorithm isseriously affected by the proportion of the sequential code present in the overall parallelcode.

A Subdomain Data Generation Method (SGM) is presented in this paper which usesan optimization and an artificial intelligence based approach to alleviate the large mem-ory requirement for the ROOT processor. The partitions obtained using the SGM arecompared with those obtained by using both Farhat’s method [4] and Simon’s method(RSB) [1] for the three example meshes. Comparisions have been made on the basis ofthe sequential computation time for each partitioning method because the two compar-ative methods require the assembly of the subdomains from the parallel mesh generatorbefore meshes may be partitioned. With the SGM the generation of an overall mesh forthe whole of the domain is avoided and the finite element discretization exists only in itsdistributed form. In practice the parallel mesh generator [9, 10] may be used to generatemeshes for each of the partitioning methods, however, the SGM has the advantage thatthe subdomains of the meshes are partitioned before mesh generation and therefore thecomplete mesh is not required to be assembled and processed on the ROOT processor.In general, the parallel mesh generator [9, 10] sends elements of the coarse mesh to theprocessors in the arbitrary order that they are listed in the data structure. In the case ofthe SGM the elements must be sent to the processor on which the sub domain is to beremeshed.

3

2 Development of the Subdomain Generation

Method

The Subdomain Generation Method proposed in this paper, was implemented for planarconvex finite element subdomains using adaptive triangular unstructured meshes. Themethod comprises the following main components.

• a Genetic Algorithm-based optimization module; and

• a Neural Network-based predictive module.

The SGM has been developed for transputer-based parallel systems which deliver verycompetitive computational power and are well within the reach of even modest engineeringorganisations. It has been shown [6] that finite element solvers implemented on a PC-based transputer system achieved computational speed-ups close to that of a Cray XMP.

2.1 Memory Considerations

The MIMD architectures such as transputer-based systems usually have a central or aROOT processor within the processor network. The ROOT processor is generally linkedto a host micro or a mini-computer for I/O support. The host is responsible for loadingthe executable binary code over the transputer network through the ROOT processorand provides the I/O support to the transputer network through the ROOT processor.The ROOT processor differs from the remaining transputers in the network because itis usually equipped with a larger memory module (RAM). The transputer is capable oflinearly addressing 4 Giga Bytes [11] of memory addresses. However because of practicalconsiderations regarding the increase in size of the transputer module (TRAM) and theassociated cost of the increase in the size of the memory module, the present day trans-puters are usually provided with a maximum memory of 16 MBytes RAM. The usual sizeof memory module for the other transputers (Workers) in the network is 1 or 2 MBytes.Hence it is apparent that when the number of transputers in the network is increasedbeyond a certain number then the ROOT processor would require out-of-core storageto handle the data structures to be distributed among the Worker processors [6]. Thusin order to avoid computational overheads due to the slow transfer rates of out-of-corestorage devices it necessary that the large memory requirements at the ROOT processorlevel are alleviated.

2.2 Performance Considerations

The parallel systems are affected by the proportion of the sequential processing that maybe involved within the parallel code. Hence if N instructions are to be executed on aprocessor which takes time t to execute a single instruction then the total time taken forexecuting N instructions sequentially would be:

Tseq = Nt (1)

4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

Sp

ee

d-u

pS

p

S=1.882

(proportion of the sequential code)

Figure 1. Graph of the “speed-up” versus the proportion of the sequentialcode for 16 processors

If N parallel processors were available then the time to execute N instructions inparallel would be t units. If only a proportion p out of N instructions can be executed inparallel then the parallel execution time may be represented by the following expression:

Tparr = (1 − p)Nt + pt (2)

The speed-up S may be calculated as:

S =Tseq

Tparr

=N

(1 − p)N + p(3)

It may be seen from Figure 1 that when the proportion of the parallel and sequentialcodes is equal i.e. p = 50% then theoretically the speed-up cannot be more than 1.882 for16 processors. Hence in order to fully exploit the parallel architecture it is imperative toreduce the proportion of the sequential instructions in the overall parallel code.

The pre-processing that precedes parallel finite element analysis will comprise thefollowing:

• Discretization of the finite element domain by mesh generation; and

• Partitioning of the generated mesh into subdomains for mapping onto parallel pro-cessors.

5

The above pre-processing procedures are generally carried out sequentially in a parallelarchitecture and from equation 3 it may be seen that this would induce gross inefficienciesin the overall parallel finite element codes comprising pre-processing and finite elementanalysis. Hence the proportion of the sequential code due to the pre-processing needs tobe curtailed.

3 The Subdomain Generation Method

As already mentioned, the load balancing problem, for finite element discretizations in-volving a single type of element, may be addressed by ensuring that the elements in thedomain are equally shared among the processors and that the boundary interface nodesper subdomain are minimized.

Optimization algorithms such as simulated annealing have been used to determineoptimum finite element mesh partitions [2, 3]. After working with the genetic algorithms

(an optimization procedure similar to simulated annealing based upon random directedsearches) it was demonstrated that stopping the algorithm after few iterations would notguarantee good partitions [2, 3]. This is because with such methods good solutions cannotalways be obtained after the first few iterations since the starting point for these methodsis determined at random. Thus when the number of the elements in the mesh are increasedthe design space increases and the probability of finding a minima in the first few iterationdecreases. Full scale optimization, that is performing an adequate number of iterations, isnot always feasible. It may take more time to optimally partition the mesh than it wouldto solve the finite element problem sequentially. However if the number of elements in thefinite element mesh is restricted to between 50 and 100 elements, reasonable partitionsmay be obtained economically by using a convergence criterion, which is discussed in thispaper.

A h-type adaptive refinement procedure such as outlined in references [7, 8, 9, 10]comprises the following steps:

• Performing finite element analysis on an initial coarse background mesh (usuallycomprising 50 to 100 elements) and calculating the re-meshing parameters.

• Generating the final mesh and solving it to obtain a relatively accurate solution.

Hence if there is advanced knowledge concerning the number of the elements that wouldbe generated in the final mesh, then it is possible to apply recursive bisections on the ini-

tial mesh Mi = (Ei, Ci), consisting of Ei elements and Ci element edges, and partition thefinal (predicted) mesh Mf = (Ef , Cf), with Ef elements and Cf element edges, as follows:

Divide Mi into M1i and M2i such that,

Ef = E1f ∪ E2f (4)

E1f ∩ E2f = ∅ (5)

and the interfacing edges Ccf

| Ccf |=| C1f ∩ C2f | (6)

6

are minimized.The advance knowledge regarding the number of elements in the final mesh Mf is

obtained by training a neural network to predict the number of elements that may begenerated per element from the initial mesh Mi.

The method of recursive bisections is applied to the initial mesh Mi which is dividedinto two subdomains by using a variation of the genetic algorithm regulated by an objectivefunction which obtains a maximum value when the number of generated elements areequal in both the subdomains and the number of interfacing edges is at its minimum.This procedure is then applied recursively to each subdomain.

The coarse initial mesh thus partitioned into the desired number of subdomains may bemapped on to parallel processors and adaptive mesh generation may be done concurrentlyproducing the final distributed mesh for use in a parallel finite element analysis.

4 Implementation of the Subdomain Generation

Method

The memory and the performance requirements for the finite element analysis in MIMDsystems are fulfilled under the SGM as follows:

• The memory requirements for large finite element discretizations at the ROOTprocessor are curtailed by generating subdomain meshes independently within theparallel processors.

• The above process automatically addresses the problem of sequential mesh genera-tion by providing an opportunity for remeshing the subdomains concurrently withinthe parallel processors.

• By using optimization techniques such as genetic algorithms, which lend themselveseasily to parallelisation, for mesh partitioning the proportion of the sequential codemay be further reduced.

For generating subdomain meshes in isolation to one another it is important thatcompatibility at the interfacing boundary nodes is maintained. This problem is addressedon the basis of the technique used in reference [9] where the nodal mesh parameters δn

were used instead of the element mesh parameters δe for maintaining compatibility at thesubdomain mesh interfaces.

The conventional mesh-partitioning methods use the overall mesh, which is usuallyavailable in the ROOT processor, to obtain optimal partitions of the same. With theSGM the overall mesh is not formed and hence a different strategy must be adopted. Thestrategy used for the SGM relates to the adaptive mesh generation method described inreference [9] where the adaptive solution or in other words the final meshes are generatedusing the following information:

• A coarse initial (background) mesh.

• The element mesh parameters δe for controlling the local mesh density in the finiteelement domain.

7

The nodal mesh parameters δn may be readily calculated by nodal averaging. Thus theinitial mesh may be divided into suitable subdomains and each subdomain with its corre-sponding δn values may be used to generate a part of the resultant adaptive finite elementmesh in isolation. The use of the nodal mesh parameters δn ensures that the boundarynode compatibility is always maintained among the generated subdomain meshes. Thususing the available coarse initial mesh and the mesh parameters finer adaptive meshesmay be generated in parallel.

Thus in order to practically implement the SGM the initial mesh must be partionedinto a suitable number of parts or subdomains such that each subdomain will generateapproximately equal numbers of elements and that the number of the interfacing boundaryedges will be minimized.

The genetic algorithm with a trained neural network module were used to determineoptimal partitions for the initial or the background mesh.

5 Genetic Algorithm for mesh partitioning

The genetic algorithm through the principle of the survival-of-the-fittest selects the direc-tion for the optimal solution through directed random sampling. The central theme ofresearch on the genetic algorithm has been described by Goldberg [12] as robustness. Thegenetic algorithm also forms an attractive candidate for parallel processing [13, 14].

Robustness of the algorithm is an essential requirement for parallel processing sinceit is sometimes difficult to implement and unsafe to execute distributed algorithms whichare not robust. Errors caused by the failure of a part of the algorithm on a processor maysometimes be difficult to trace or control hence validation of the results becomes difficultunder these circumstances. As a consequence of the robustness of these algorithms andtheir potential for parallel processing they were selected for the optimization module inthe mesh partitioning method.

The genetic algorithms work with a complete set of population domain, (i = 1, . . . , size of the population).The size of the population determines the size of the sampling space with larger popula-tion sizes affording a better possibility of finding the optimal solution but the procedurebecomes computationally expensive.

The domain of the population will comprise a number of individuals. These individualsare essentially concatenated strings or arrays representing a set of design variables. Eachdesign variable representation forming part of an individual string or array is known as achromosome.

5.1 Crossover

For the variation of the genetic algorithm used in this study an individual is definedas an array of unsigned 16 bit integers with the size of the array being equal to thenumber of the design variables. Hence each chromosome representing a design variablehas the ability to store numerical values in the range of 0, . . . , 65535. The numerical valuesstored in the chromosome do not represent the actual design variable values but requirea translating function to map the chromosome values over to the specified ranges for thedesign variables.

8

From the domain of the population, pairs of individual designs are selected and crossedover using a single point crossover. As an example of a crossover we take a pair of indi-viduals A and B each comprising three chromosome variables. The chromosome variablesrepresent the three design variables, say x and y coordinates and the angle of inclina-tion of a vector in two-dimensional space. Initially random values between the range of0, . . . , 65535 are assigned for each chromosome variable this is shown in the set of equationsbelow pertaining to the individuals, A and B:

chromosome[1]A = 1024

chromosome[2]A = 10

chromosome[3]A = 230

chromosome[1]B = 15

chromosome[2]B = 2000

chromosome[3]B = 70

The binary representation for the above individuals are shown in Figure 2 (a). TheFigure 2 (b) shows the results from the a single point crossover performed betweenthe individuals. Individuals A and B undergo crossover at the seventh bit position ofthe chromosome[2]A and chromosome[2]B. It may be seen from the Figure 2 (b) bit-wise manipulation is required within the chromosome[2]A and chromosome[2]B where aschromosome[3]A and chromosome[3]B merely require swapping of their values. The C

programming language provides bitwise operations facilities which enable the crossoversto be efficiently performed. This is accomplished in two steps. The chromosome pairwhich falls within the crossover location under goes bitwise crossover. The chromosomesafter the bitwise crossed chromosome under go swapping of their respective integer values.The C code for performing the crossover is given in Appendix A.

5.2 Objective Function:

The objective function is based upon the equations 4, 5 and 6 of section 3. The minimumand the maximum corner coordinates of the domain are determined and the design vari-ables define a vector D = {x, y, θxy}

T within the confines of the domain as shown in theFigure 3.

The elements of the domain are divided into two sets by virtue of their centriodallocations i.e. either to the left or to the right side of the generated vector. Using atrained neural network module (which shall be described in the following section) thenumber of elements are estimated which would result from each element of the mesh. Thetotal number of predicted elements in the divided mesh i.e. E1i and E2i are calculated.The cumulative square root value, C ′

cf of interfacing edges | Ccf | in the final mesh isdetermined as follows:

C ′cf =

|Cci|∑

k=1

√

√

√

√

(LCci)k

(δn1+δn2)2

(7)

9

0123456789101112131415

0123456789101112131415

0123456789101112131415

0001 000000000000

0011000000000000

0110011100000000

0123456789101112131415

0123456789101112131415

0123456789101112131415

1110 100000000000

0000101111100000

0110001000000000

chromosome[1]_B

chromosome[2]_B

chromosome[1]_A

chromosome[2]_A

chromosome[3]_A

crossover point crossover point

0123456789101112131415

0123456789101112131415

0123456789101112131415

0110011100000000

0001 000000000000

0011000111100000

0123456789101112131415

0123456789101112131415

0123456789101112131415

0110001000000000

1110 100000000000

0000101000000000

chromosome[1]_B

chromosome[2]_B

chromosome[1]_A

chromosome[2]_A

chromosome[3]_A

chromosome[3]_B

chromosome[3]_B

Bit Swap

Variable Swap

(a)

(b)

bit

po

siti

on

s

Figure 2. (a) Array representation of three 16 bit chromosomes for two in-dividuals A & B (b) Resulting chromosomes for new individualsA & B after a single point crossover

10

X

Y

xy

(xmin,ymin)

(xmax,ymax)

FE Domain

Di v

idin

gve

cto

r

(x,y)

D

Figure 3. Dividing vector D generated within the domain

where

Cci are the interfacing edges in the initial mesh

δn1 and δn2 are the nodal mesh parameters at the two end of (Cci)k.

(LCci)k is the length of (Cci)k.

The objective function is defined as:

z =| Ef | − | (| E1f | − | E2f |) | −C ′cf (8)

where | Ef |=| E1f | + | E2f |and subject to: if z ≤ 0 then z = tolerance

A small value of the order of 10−10 is specified as the tolerance value in order to preventthe occurance of zero or negative values for the objective function.

5.3 Selection Scheme

The selection scheme was changed from the usual roulette wheel selection to stochasticremainder selection without replacement owing to the inferiority of roulette wheel selec-tion as described in [12] and from the experience of testing an arbitrary objective functionwhere it was noted that stochastic remainder selection scheme without replacement gen-erally converged to better results than the roulette wheel selection scheme.

11

5.4 Scaling

Back scaling of the objective function evaluations was undertaken in order to preventthe dominance of a few superindividuals in the initial stages of the algorithm. Provisionwas made for scaling the objective function evaluations to produce more contrast in theobjective function evaluations in the final (converged) stage of the algorithm. This wasdone using linear scaling technique described in the reference [12].

5.5 Convergence

The number of iterations performed using the genetic algorithm is the number of genera-tions which a given domain of population undergoes. With the genetic algorithm a givenpopulation of possible designs is processed through sufficient number of generations to de-termine the optimum design. For problems of design optimization the shape of the designspace is usually not known in advance and the genetic algorithm performs directed randomsearches over the design variable space to find better solutions. Hence it is not possibleto state at any one stage that the solution obtained by the genetic algorithm is the bestpossible solution. Given the genetic algorithm a very large sample space (population) andallowing it to process a very large number of generations then one may say with a certaindegree of confidence that the solution reached by the genetic algorithm is either globallyoptimum or quite close to it. However for problems such as mesh-partitioning for parallelfinite element analysis it is not possible to incur large computational effort since it woulddefeat the whole purpose of performing parallel finite element analysis. Keeping in mindthis aspect that the genetic algorithm may not be performed over a very large numberof generations the following convergence criterion was employed. For the population sizeof 50, chromosome length of 10, crossover probability of 0.6 and mutation probabilityof 0.0333 the genetic algorithm was stopped if no improvement over the best design wasnoticed within consecutive 5 generations. In addition to this criterion a maximum limit of200 on the number of generations was set but this criterion was never invoked during thetest runs of the SGM. The genetic algorithm generally provided good partitions withinthe first 50 generations based upon the above stopping criterion.

6 Neural Networks

Neural networks are based upon the concept of emulating human learning abilities oncomputers. The research carried out in the field of neural computing has resulted infundamental computer-based learning algorithms which attempt to model the powerfulhuman traits of remembering and problem-solving. The computer modelling of humanbrain activities such as remembering and learning are done by simulating the function ofneurons.

6.1 Training Strategy

The information on the number of elements that may be generated per element of theinitial background mesh forms the pivitol factor in determining the effectiveness of theproposed method. In adaptive unstructured mesh generation the mesh generator using the

12

1 2

3

21

3

L1

L2L3

1

1

1

L3

1

1

L2

1

3

2

21

3

L1= 1

Figure 4. Unscaled and scaled representation of the input data

background mesh and the mesh parameter generates a finer graded mesh which providesa better discretization of the finite element domain. The unstructured mesh generationmethod as originally proposed in reference [8] was re-formulated for parallel unstructuredmesh generation in reference [9]. The modified method re-meshed the background meshusing subdomains where each element of the background mesh was treated as a subdo-main. It was shown in reference [9] that in order to re-mesh each subdomain individuallymaintaining boundary node compatibility at the adjoining subdomain it was necessaryto use nodal mesh parameters. Thus for carrying out unstructured mesh generation perelement of the background mesh, the input data will comprise the nodal coordinates andthe nodal mesh parameter values for the each triangular element of the background mesh.Hence the data input requirements for generating compatible meshes within the elementsof the background mesh determine the input stimuli for the neural networks.

Thus for training purposes neural networks with 9 inputs (6 input values for the two-dimensional nodal coordinates of the triangular element plus three values of the nodalmesh parameters) and one output (the number of elements generated) were forseen. How-ever for the actual training of the neural network it was noted that each triangular elementcould be represented by the length of its three sides and three internal angles. By usingthese parameters to represent the geometry of the triangular element to the neural net-works one of the input parameter could be de-activated by making it a constant. Thiswas accomplished knowing the fact that the nodal mesh parameters actually representthe size of the triangle to be generated, hence scaling the three sides and the three nodalmesh parameters with the one of the nodal mesh parameters would render that meshparameter constant in the data set of the input stimuli. If the three sides of a triangularelements were L1, L2 & L3 and the three nodal mesh parameters were δ1, δ2 & δ3 thenscaling by δ1 the above input values would render δ1 constant (equal to unity) as shownin Figure 4.

A single hidden layer was used and the the number of processing elements in the

13

hidden layer were determined using the following equation as a guide line.

h =cases

10(m + n)(9)

Where:

cases is the number of training input sets provided.

m is the number of the processing elements in the output layer.

n is the number of the processing elements in the input layer.

h is the maximum number of processing elements recommended for use in the hiddenlayer.

Supervised training is required in predictive neural networks. The primary network forprediction is called a feedforward network with non-linear elements. There are a numberof ways to train this network. Some of these training methods are as follows:

Back propagation: The back-propagation training is based upon propogating the er-rors calculated at the output layer backward through the connections to the pre-vious layer. This process is repeated until the input layer is reached. The back-propagation is based upon a steepest decent approach to minimization of the pre-diction error with respect to the connection weights in the network.

DBD method: The delta-bar-delta (DBD) attempts to increase the speed of conver-gence by applying heuristics based upon the previous values of the gradients forinferring the curvature of the local error surface.

EDBD method: The extended-delta-bar-delta (EDBD) technique applies heuristic ad-justment of the momentum term in the DBD-based networks. The momentum is aterm which is added to the weight change and is proportional to the previous weightchange. This heuristic is designed to reinforce positive learning trends and dampenthe oscillations.

DRS method: This method differs from the back-propagation in the respect that itdoes not utilizes the back tracking features to make weight adjustments but incor-porates an element of uncertainty (randomness) while keeping track of previouslysuccessful directions. This method has been found useful for small but complicatednetworks [16].

The learning method based upon EDBD (extended-delta-bar-delta) method was se-lected for the training of the predictive neural network owing to its heuristically con-trolled learning and momentum coefficient selections, which reduced the training timeconsiderably.

The training set for the neural network was based upon the combination of only twobackground meshes shown in Figures 5 and 6.

The background mesh shown in Figure 5 was analysed using different load cases asshown in Table 1 and node1 and node2 were restrained in x and y directions. From

14

1 2

3 4

5

6 7

8

9 10 11

12

13

14

15 16 17

18

19

20

21

22

23 24

25

26

27

Figure 5. Background mesh comprising 34 elements

1 2

3

4 5

6

7 8

9

10 11

12

13 14

15

16 17

18

19 20

21

22 23

24

25

26

27 28 29

30 31

32

33

34

35

36 37

38

39

40 41

42

43

44

45

Figure 6. Background mesh comprising 56 elements

15

load case magnitude nodes loaded with P (one at a time)Px 100 4 5 6 7 12 13 14 18 19 20Py 100 3 15 16 17

Table 1. The load cases for the mesh shown in Figure 5

the adaptivity module the mesh parameters per element of the background mesh werecalculated, these mesh parameters were then fed into the parallel adaptive mesh generatorwhich produced the re-meshings for each of the load cases specified in the Table 1. Themesh generator also produced statistics on the number of elements generated within eachelement of the background mesh. Thus a training file comprising the following informationper element of the background mesh was generated.

• The side lengths of the element

• The internal angles of the element

• The nodal mesh parameters of the element

• The number of elements in the refined mesh generated from the element in thebackground mesh

The side lengths and the nodal mesh parameters were normalized as discussed above.It was noted that the number of generated elements equal to unity occurred quite fre-quently in the input data. From experience it was noted that greater than usual numberof these unit values for the number of generated elements reduced the accuracy of thenetwork. Therefore many of these unit predicted values were removed from the trainingdata set leaving a few of such values (between 2-3) per load case.

The mesh shown in Figure 6 was analysed under a single vertical load applied at node16. The nodes 7 & 25 were restrained in x & y directions. The input data file wasgenerated on the same basis as in the case of the mesh shown in Figure 5.

The data input files from these two background meshes were combined to form ageneral data file. The general data file thus formed consisted of 407 training data sets(each pertaining to an element of one of the two background meshes under consideration).

The neural network comprised 16 processing elements having 9 processing elementsand a bias element for the input layer, 5 elements for the hidden layer and one processingelement for the output layer. This network was created and trained using NeuralWorksProfessional II/Plus1.

The neural network was trained with respect to limiting R.M.S. error [16] values.Convergence was assumed at the following percentage R.M.S. error values i.e. 1.6%, 1.5%and 1.2%. The neural networks resulting from the above convergence R.M.S. error valueswere tested to predict the resultant number of elements for the the initial meshes) shownin Figure 7 (left), 8 (left), 9 (left), 10 (left), 11 (left) and 12 (left).

1NeuralWare, Inc. Building IV, Suite 227, Penn Center West, Pittsburgh, PA 15276.

16

Figure 7. The initial and the final meshes comprising 46 and 412 elementsrespectively


Percentage Convergence R.M.S.Errors For The Neural Network

Test Mesh 1.6% 1.5% 1.2%errorRMS for Figure 7 2.7024 2.276 2.329errorRMS for Figure 8 2.347 1.394 2.803errorRMS for Figure 9 3.676 3.0314 3.44errorRMS for Figure 10 3.234 4.238 3.784errorRMS for Figure 11 2.896 2.716 2.604

errorRMS for the Figure 12 2.747 2.796 1.883Total (Averaged) errorRMS 2.937 2.742 2.687

Table 2. Performance (errorRMS) of the Neural Network with differentRMS values

17



18

Figure 11. The initial and the final meshes comprising 153 and 1172 ele-ments respectively


19

Figure 13. A square shaped domain with in-plane load

An overall R.M.S. error based upon the difference between the predicted and the actualnumber of elements was calculated for each percentage convergence R.M.S. error valuefor the network using the following expression.

errorRMS =

√

∑nei=1(pi − ai)2

ne(10)

Where:

ne: is the number of elements tested for element generation.

pi: is the number of elements predicted, per element, by the neural network.

ai: is the number of actual elements generated per element.

From the total averaged values of the errorRMS calculated for various meshes as shownin Table 2, it may be seen that the neural network exhibits consistent improvement inpredicting the number of elements with the decrease in the acceptable convergence R.M.S.error values.

7 Example 1

A square shaped domain, shown in Figure 13 with an in-plane horizontal concentratedload at the top left corner node and the bottom corner nodes restrained, was uniformlymeshed and the initial mesh comprised 46 elements as shown in Figure 7 (left).

Adaptive finite element analysis on this mesh were carried out which only required fewseconds on a single 20 MHz T800 transputer. The results from the adaptive analysis wereused to generate the final mesh comprising 412 elements, as shown in Figure 7 (right).

20

Figure 14. Example 1: The initial mesh (46 elements) of Figure 7 (left)divided into 4 subdomains using the SGM

Subdomain Generated GeneratedNo. Elements (Actual) Elements (Required) diff %age diff1 99 103 -4 -3.882 108 103 5 4.853 97 103 -6 -5.8254 108 103 5 4.85

Table 3. Example 1: Comparison of the actual number of generated ele-ments per subdomain versus the ideal number of elements thatshould have been generated in the example shown in Figure 7right

It should be pointed out that final mesh referred to in this example was produced forcomparison with the mesh partitioning methods other than the SGM.

The proposed method was applied to the initial mesh shown in the Figure 7 (left) andthe mesh was divided into 4 subdomains as shown in Figure 14 by performing two recursivebisections. The subdomains obtained were independently re-meshed (as they would havebeen processed in a multiple processors environment) and the remeshed subdomains areshown in Figure 15.

21

Figure 15. Example 1: The remeshed subdomains for the mesh (412 ele-ments) shown in the Figure 7 (left) partitioned using the SGM

22

Figure 16. Example 1: The mesh partitions delivered by Farhat’s methodfor the mesh (412 elements) shown in Figure 7 (right)

Method Interfaces Ccf time min.SGM 58 1.067

Farhat’s Method 82 .067Simon’s Method 50 3.8

Table 4. Example 1: Comparison of interfaces between the subdomainsand the run times on a single T800 transputer for the partition-ing of the meshes shown in Figure 7

The number of element generated within each subdomain versus the ideal number ofelements which should have been generated per subdomain are shown in Table 3.

The mesh partitions delivered by Farhat’s and Simon’s methods for the mesh shownin Figure 7 (right) are shown in Figures 16 and 17 respectively. The number of interfacesCcf for the above partitioning of the mesh obtained using SGM, Farhat’s method [4],Simon’s method (RSB) [1] and the corresponding times taken for mesh partitioning areshown in Table 4.

It may be seen from Table 3 that the maximum positive unbalanced load occurs in thesubdomains number 2 and 4 and is equal to 4.85%. Both Simon’s and Farhat’s methodsprovide the exact number of elements per subdomain to ensure equal element distribution.

23

Figure 17. Example 1: The mesh partitions delivered by Simon’s methodfor the mesh (412 elements) shown in Figure 7 (right)

24

Figure 18. Example 2: A L-shaped domain with in-plane load

Subdomain No. Gen. Elements (Actual) Gen. Elements (Required) diff %age diff1 159 168.75 -9.75 -1.442 167 168.75 -1.75 -0.263 165 168.75 -3.75 -0.564 184 168.75 15.25 2.26

Table 5. Example 2: Comparison of the actual number of generated ele-ments per subdomain versus the ideal number of elements thatshould have been generated using the SGM

In Table 4 the proposed method provides better results than Farhat’s method. Simon’smethod provides the best results but as it may be seen from Table 4 the computationalcost is relatively higher.

8 Example 2

The adaptive finite element analysis were performed for the L-shaped domain shown inFigure 18. Initial and the final meshes comprising 126 and 666 elements were generated.

The SGM was applied to the initial mesh and the mesh was divided into 4 subdomainsas shown in Figure 19. The subdomains obtained were independently re-meshed and areshown in Figure 20.

The number of element generated within each subdomain versus the ideal number ofelements which should have been generated per subdomain are shown in Table 5.

The mesh partitions delivered by Farhat’s and Simon’s methods for the mesh shown

25

Figure 19. Example 2: The initial mesh (126 elements) divided into 4 sub-domains using the SGM

Method Interfaces Ccf time min.Proposed Method 72 4.267Farhat’s Method 148 0.16Simon’s Method 78 7.8

Table 6. Example 2: Comparison of interfaces between the subdomainsand the run times on a single T800 transputer for the partition-ing of the mesh shown in Figure 20

26

Figure 20. Example 2: The remeshed subdomains for the mesh (666 ele-ments) shown in the Figure 19 comprising partitioned using theSGM

27

Figure 21. Example 2: The mesh partitions delivered by Farhat’s methodfor the final mesh (666 elements)

28

Figure 22. Example 2: The mesh partitions delivered by Simon’s methodfor the the final mesh (666 elements)

29

Figure 23. Example 3: Domain with cut-out and chamfer

in Figure 9 (right) are shown in Figures 21 and 22 respectively. The number of interfacesCcf for the above partitioning of the mesh obtained using the SGM, Farhat’s method [4],Simon’s method (RSB) [1] and the corresponding times taken for mesh partitioning areshown in Table 6.

It may be seen from Table 5 that the maximum positive unbalanced load occurs in thesubdomain number 4 and is equal to 2.26%. Both Simon’s and Farhat’s methods providethe exact number of elements per subdomain to ensure equal element distribution. Table 6shows that the SGM provides the best results.

9 Example 3

The adaptive finite element analyses were performed for the domain with cut-out andchamfer shown in Figure 23. Adaptive analysis was performed on the initial mesh of153 elements shown in Figure 11 (left) and this mesh was divided into 8 subdomains asshown in Figure 24 using the SGM. The subdomains were remeshed and the resultantsubdomains are shown in Figure 25. The final adaptive mesh comprising 1172 elements, asshown in Figure 11 (right) was also generated for comparison with other mesh partitioningmethods.

The maximum positive inbalance of 3.75% occurred in subdomain 5 as shown in Ta-ble 8. The mesh partitions del ivered by Farhat’s method for the mesh shown in Fig-

30

Figure 24. Example 3: The initial mesh (153 elements) of Figure 11 (left)divided into 8 subdomains using the SGM

Figure 25. Example 3: The remeshed domain (1172 elements) shown in theFigure 11 (left) showing the SGM generated sub domains

31

Subdomain No. Gen. Elements (Actual) Gen. Elements (Required) diff %age diff1 128 146.5 -18.5 -12.622 150 146.5 3.5 2.393 150 146.5 3.5 2.394 144 146.5 -2.5 -1.7075 152 146.5 5.5 3.756 149 146.5 2.5 1.7077 150 146.5 3.5 2.398 147 146.5 0.5 0.341

Table 7. Example 3: Comparison of the actual number of generated ele-ments per subdomain versus the ideal number of elements thatshould have been generated

Figure 26. Example 3: The mesh partitions delivered by Farhat’s methodfor the mesh (1172 elements) shown in Figure 11 (right)

Method Interfaces Ccf time min.Proposed Method 168 4.267Farhat’s Method 242 0.3833Simon’s Method 164 16.833

Table 8. Example 3: Comparison of interfaces between the subdomainsand the run times on a single T800 transputer for the partition-ing of the mesh shown in Figure 25

32

Figure 27. Example 3: The mesh partitions delivered by Simon’s methodfor the mesh (1172 elements) shown in Figure 11 (right)

ure 7 (right) is shown in Figures 26.The Simon’s method provides the best results with 164 interfaces. The SGM follows

closely with 168 interfaces. The time taken by the Simon’s method is however approxi-mately four times higher than the SGM in this case.

10 Conclusions and Remarks

From the theory and the examples presented in the previous sections it is apparent that theproposed subdomain generation method’s effectiveness is controlled by the optimizationand the predictive modules. The genetic algorithm was chosen as the optimization moduleon account of its robustness and algorithmic structure which lends itself perfectly toparallel processing. A neural network was chosen as the predictive module on account ofthe ability of the neural networks to afford solutions to complex predictive problems veryeconomically. However the effectiveness of the neural networks depends on the extent ofthe training undertaken. The training of the neural network, in this research, was basedon a set comprising only two background meshes. It was shown that even with this limitedtraining set the neural network performance was generally found to be satisfactory for avariety of meshes.

The Subdomain Generation Method has been designed for coarse grained parallel finiteelement processing systems, where the initial background mesh comprising approximately150 element may deliver feasible partitions for up to 8 subdomains or processors. Thenumber of elements in the initial mesh is important on two accounts. In the first instanceif the the number of elements is too low as compared to the number of subdomains tobe formed then it may not be possible for the optimization module to deliver optimumpartitions since the number of generated elements per subdomain are allocated per elementof the initial mesh to the subdomains. On the other hand if the number of elements in the

33

initial mesh are increased then the computational cost for the formation of the subdomainsincreases. Thus SGM in its present sequential form is suitable for coarse grained finiteelement processing where the initial meshes may have up to 200 elements for generating upto 16 subdomains. However with the use of parallel genetic algorithm it may be possibleto partition higher density initial meshes allowing the generation of subdomain data forgreater number of parallel processors.

The characteristics of the subdomain generation method, proposed in this paper, maybe summarised as follows:

1. It is possible, using this method, to generate a completely distributed finite elementdiscretization which eliminates the bottle-neck caused by the high storage require-ments at the ROOT or the central processor of the processor network.

2. With this method the finite element discretization only exists in its distributedform thus the requirement of generating an overall mesh and then partitioning itfor parallel processing is eliminated.

3. Parallel adaptive/uniform mesh generation constitutes an important feature of themethod. It has been shown in reference [14] that genetic algorithms may be effi-ciently parallelised. Hence the whole procedure for the subdomain generation maybe effectively parallelised starting from the partitioning of the initial mesh by genetic

algorithm to the re-meshing of the subdomains on the parallel processors.

4. The computational time for determining subdomains is independent of the numberof the elements that may be generated in the final mesh and only depends uponthe number of elements in the coarse initial mesh. Hence large scale discretizationscomprising thousands of elements may be partitioned by partitioning their coarseinitial background meshes (usually comprising a few hundred elements), this is notpossible under the conventional mesh partitioning methods (such as in references [1,4]) where the mesh partitioning time is directly dependent on the number of theelements in the mesh to be analysed.

5. The method does not determine exact mesh partitions with respect to the numberof elements per subdomain but the in-balance may be reduced by improving theperformance of the neural network.

Further research may be undertaken in determining objective functions which maycater for non-convex 3-D domains. Various possibilities in the type, topology and trainingof the neural networks need to be explored for increasing their accuracy. A parallelimplementation of SGM may be incorporated, using parallel genetic algorithm optimizerand parallel mesh generation, into a parallel finite element solver.

11 C Functions

The function for the bitwise crossover, in C, is given as follows:

void bit_cross(unsigned parent1,unsigned parent2,

34

int left_offset, unsigned *child1,unsigned *child2)

{

unsigned a,b, siz;

siz = sizeof(unsigned)*8;

a = parent1; b = parent2;

%a <<= left_offset; a >>= left_offset;

%b >>= siz-left_offset; b <<= siz-left_offset;

*child1 = a | b;

a = parent1; b = parent2;

%a >>= siz-left_offset; a <<= siz-left_offset;

%b <<= left_offset; b >>= left_offset;

*child2 = a | b;

}

Where parent1 and parent2 represent the two chromosome variables before thecrossover and child1 and child2 represent the resultant chromosome variables. Theleft offset equals to the difference between the length of the chromosome variable inbits less the bit location of the crossover point.

To enable the chromosome variable values to translated into the design variable valuesa range for the design variable values must be specified.

For a specified set of design variable ranges the design variable values may be deter-mined, using the following C function:

double chrom_map_x(unsigned chrom,double min,double max,int lchrom)

{

double range,add_percent,result;

/*

chrom = integer value of the chromosome variable

lchrom = length of the chromosome in bits

min = starting value of the design variable

max = highest permissible value for the design variable

result = value of the design variable

*/

range = max - min;

add_percent = ((double) chrom / (pow(2.,lchrom)-1))*range;

result = min + add_percent;

return(result);

}

35

References and Bibliography

[1] H.D.Simon, “Partitioning of un-structured problems for parallel processing, Comput-ing Systems in Engineering, vol. 2, no. 2/3, 1991.

[2] J. Flower, S. Otto, M. Salma, “Optimal mapping of irregular finite element domains

to parallel processors, Parallel computations and their impact on mechanics, AMD-VOL 86, 239-250, AMSE, New York, 1986.

[3] B. Nour-Omid, A. Raefsky, G. Lyzenga, “Solving finite element equations on concur-

rent computers, Parallel computations and their impact on mechanics, AMD-VOL86, 209-227, AMSE, New York, 1986.

[4] C. Farhat, “A simple and efficient automatic FEM domain decomposer, Computer& Structures, vol. 28, no. 5, 579-602, 1988.

[5] B. W. Kernighan, S. Lin, “An efficient heuristic procedure for partitioning graphs,The Bell System Technical Journal, 49, 291-307, 1970.

[6] J. Favenesi, A. Daniel, J. Tombello, J. Watson, “Distributed Finite Element using aTransputer Network”, Computing Systems in Engineering, vol. 1, Nos 2-4, 171-182,1990

[7] O. C. Zienkiewicz, J. Z. Zhu, “A simple error estimator and adaptive procedure for

practical engineering analysis”, Int J for Numerical Methods in Engineering, vol. 24,337-357, 1987.

[8] J. Peraire, M. Vahdati, K. Morgan, O. C. Zeinkiewicz, “Adaptive remeshing for

compressible flow computations”, Journal of Computational Physics, vol. 72, 449-466, 1987.

[9] A.I. Khan, B.H.V. Topping, “Parallel adaptive mesh generation”, Computing Sys-tems in Engineering, vol. 2, no. 1, 75-102, 1991.

[10] B.H.V. Topping, A.I. Khan, “Parallel Finite Element Computations”, Saxe-CoburgPublications, Edinburgh, to be published 1993.

[11] INMOS Limited, “Transputer Reference Manual”, Prentice-Hall, Hertfordshire, U.K.,1988.

[12] D.E. Goldberg, Genetic Algorithms in search, optimization, and machine learning,Addison-Wesley Publishing Company, Inc., 1989.

[13] E. J. Anderson, M.C. Ferris, “Parallel Genetic Algorithms in Optimization”, Proceed-ings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing,Society of Industrial and Applied Mathematics, 1990.

[14] T. Fogarty, R. Huang, “Implementing the genetic algorithm on transputer-based par-

allel processing systems”, Lecture Notes in Computer Science, vol. 496, 145-149,1991.

36

[15] R. Beale, T. Jackson, Neural Computing: an introduction, Adam Hilger, IOP Pub-lishing Ltd, Bristol, 1990.

[16] Neural Computing, NeuralWare Inc., Technical Publishing Group, Building IV, Suite227, Penn Center West Pittsburgh, 1991.

37

Documents

Subdomain Generation for Parallel Finite Element Analysisusers.monash.edu/~asadk/JournalPapers/x-domain3.pdf · Parallel Computational Methods for Large Scale Structural Analysis