18
GPU-acceleration of stiffness matrix calculation and efficient initialization of EFG meshless methods A. Karatarakis , P. Metsis, M. Papadrakakis Institute of Structural Analysis and Antiseismic Research, National Technical University of Athens, Zografou Campus, Athens 15780, Greece article info Article history: Received 5 October 2012 Received in revised form 20 January 2013 Accepted 12 February 2013 Available online 4 March 2013 Keywords: Meshless methods Element free Galerkin (EFG) Preprocessing Stiffness matrix assembly Parallel computing GPU acceleration abstract Meshless methods have a number of virtues in problems concerning crack growth and propagation, large displacements, strain localization and complex geometries, among other. Despite the fact that they do not rely on a mesh, meshless methods require a preliminary step for the identification of the correlation between nodes and Gauss points before building the stiffness matrix. This is implicitly performed with the mesh generation in FEM but must be explicitly done in EFG methods and can be time-consuming. Fur- thermore, the resulting matrices are more densely populated and the computational cost for the formu- lation and solution of the problem is much higher than the conventional FEM. This is mainly attributed to the vast increase in interactions between nodes and integration points due to their extended domains of influence. For these reasons, computing the stiffness matrix in EFG meshless methods is a very compu- tationally demanding task which needs special attention in order to be affordable in real-world applica- tions. In this paper, we address the pre-processing phase, dealing with the problem of defining the necessary correlations between nodes and Gauss points and between interacting nodes, as well as the computation of the stiffness matrix. A novel approach is proposed for the formulation of the stiffness matrix which exhibits several computational merits, one of which is its amenability to parallelization which allows the utilization of graphics processing units (GPUs) to accelerate computations. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction In meshless methods (MMs) there is no need to construct a mesh, as in finite element method (FEM), which is often in conflict with the real physical compatibility condition that a continuum possesses [1]. Moreover, stresses obtained using FEM are discon- tinuous and less accurate while a considerable loss of accuracy is observed when dealing with large deformation problems because of element distortion. Furthermore, due to the underlying structure of the classical mesh-based methods, they are not well suited for treating problems with discontinuities that do not align with ele- ment edges. MMs were developed with the objective of eliminating part of the above mentioned difficulties [2]. With MMs, manpower time is limited to a minimum due to the absence of mesh and mesh related phenomena. Complex geometries are handled easily with the use of scattered nodes. One of the first and most prominent meshless methods is the element free Galerkin (EFG) method introduced by Belytschko et al. [3]. EFG requires only nodal data, no element connectivity is needed to construct the shape functions. However a global back- ground cell structure is necessary for the numerical integration. Moreover, since the number of interactions between nodes and/ or integration points is heavily increased, due to large domains of influence, the resulting matrices are more densely populated and the computational cost for the formulation and solution of the problem is much higher than in the conventional FEM [3]. To improve the computational efficiency of MMs, parallel implementations like the MPI parallel paradigm has been used in large scale applications[4,5] and several alternative methodologies have been proposed concerning the formulation of the problem. The smoothed FEM (SFEM) [6] couples FEM with meshless meth- ods by incorporating a strain smoothing operation used in the mesh-free nodal integration method. The linear point interpolation method (PIM) [7] obtains the partial derivatives of shape functions effortlessly due to the local character of the radial basis functions. A coupled EFG/boundary element scheme [8], taking advantage of both the EFG and the boundary element method. Furthermore, solvers which perform an improved factorization of the stiffness matrix and use special algorithms for realizing the matrix–vector multiplication are proposed in [9,10]. Divo and Kassab [11] presented a domain decomposition scheme on a meshless collocation method, where collocation expressions are used at each subdomain with artificial created interfaces. Wang et al. [7] presented a parallel reproducing kernel particle method (RKPM), using a particle overlapping scheme which significantly increases the number of shared particles and the time for communicating 0045-7825/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cma.2013.02.011 Corresponding author. E-mail addresses: [email protected] (A. Karatarakis), [email protected] (P. Metsis), [email protected] (M. Papadrakakis). Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 Contents lists available at SciVerse ScienceDirect Comput. Methods Appl. Mech. Engrg. journal homepage: www.elsevier.com/locate/cma

1-s2.0-S0045782513000431-main

Embed Size (px)

Citation preview

Page 1: 1-s2.0-S0045782513000431-main

Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

Contents lists available at SciVerse ScienceDi rect

Com put. Methods Appl. Mech. Engrg.

journal homepage: www.elsevier .com/locate /cma

GPU-acceleration of stiffness matrix calculation and efficientinitialization of EFG meshless methods

0045-7825/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.cma.2013.02.011

⇑ Corresponding author.E-mail addresses: [email protected] (A. Karatarakis), [email protected]

(P. Metsis), [email protected] (M. Papadrakakis).

A. Karatarakis ⇑, P. Metsis, M. Papadrakakis Institute of Structural Analysis and Antiseismic Research, National Technical University of Athens, Zografou Campus, Athens 15780, Greece

a r t i c l e i n f o

Article history:Received 5 October 2012 Received in revised form 20 January 2013 Accepted 12 February 2013 Available online 4 March 2013

Keywords:Meshless methods Element free Galerkin (EFG)PreprocessingStiffness matrix assembly Parallel computing GPU acceleration

a b s t r a c t

Meshless methods have a number of virtues in problems concerning crack growth and propaga tion, large displacements, strain localization and complex geometries, among other. Despite the fact that they do not rely on a mesh, meshless methods require a preliminary step for the identification of the correlation between nodes and Gauss points before building the stiffness matrix. This is implicitly performed with the mesh generation in FEM but must be explicitly done in EFG methods and can be time-cons uming. Fur- thermore, the resulting matrices are more densely populated and the computational cost for the formu- lation and solution of the problem is much higher than the conventional FEM. This is mainly attributed tothe vast increase in interactions between nodes and integrati on points due to their extended domains ofinfluence. For these reasons, computin g the stiffne ss matrix in EFG meshless methods is a very compu- tationally demanding task which needs special attention in order to be affordable in real-world applica- tions. In this paper, we address the pre-processing phase, dealing with the problem of defining the necessary correlations between nodes and Gauss points and between interacting nodes, as well as the computation of the stiffness matrix. A novel approach is proposed for the formulation of the stiffne ssmatrix which exhibits several computational merits, one of which is its amenability to parallelization which allows the utilization of graphics processing units (GPUs) to accelerate computations.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

In meshless methods (MMs) there is no need to construct amesh, as in finite element method (FEM), which is often in conflictwith the real physical compatibilit y condition that a continuum possesses [1]. Moreover, stresses obtained using FEM are discon- tinuous and less accurate while a considerable loss of accuracy isobserved when dealing with large deformation problems because of element distortion. Furthermore, due to the underlying structure of the classical mesh-based methods, they are not well suited for treating problems with discontinui ties that do not align with ele- ment edges. MMs were developed with the objective of eliminati ngpart of the above mentioned difficulties [2]. With MMs, manpower time is limited to a minimum due to the absence of mesh and mesh related phenomena . Complex geometries are handled easily with the use of scattered nodes.

One of the first and most prominent meshless methods is the element free Galerkin (EFG) method introduced by Belytschko et al. [3]. EFG requires only nodal data, no element connectivity is needed to construct the shape functions. However a global back- ground cell structure is necessary for the numerical integration.

Moreove r, since the number of interactions between nodes and/ or integrati on points is heavily increased, due to large domains of influence, the resulting matrices are more densely populate dand the computational cost for the formulation and solution ofthe problem is much higher than in the conventi onal FEM [3].

To improve the computational efficiency of MMs, parallel impleme ntations like the MPI parallel paradigm has been used inlarge scale applicati ons [4,5] and several alternativ e methodologies have been proposed concerning the formulation of the problem.The smoothed FEM (SFEM) [6] couples FEM with meshless meth- ods by incorporating a strain smoothing operation used in the mesh-free nodal integration method. The linear point interpolation method (PIM) [7] obtains the partial derivatives of shape functions effortlessl y due to the local character of the radial basis functions.A coupled EFG/bounda ry element scheme [8], taking advantage ofboth the EFG and the boundary element method. Furthermore,solvers which perform an improved factorization of the stiffness matrix and use special algorithms for realizing the matrix–vectormultiplication are proposed in [9,10]. Divo and Kassab [11]presente d a domain decompositi on scheme on a meshless collocatio n method, where collocation expressions are used at each subdoma in with artificial created interfaces . Wang et al. [7]presente d a parallel reproduci ng kernel particle method (RKPM),using a particle overlapping scheme which significantly increases the number of shared particles and the time for communi cating

Page 2: 1-s2.0-S0045782513000431-main

64 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

information between them. Recently, a novel approach for reducing the computational cost of EFG methods is proposed byemploying domain decomposition techniqu es on the physical aswell as on the algebraic domains [12]. In that work the solution of the resulting algebraic problems is performed with the dual do- main decompositi on FETI method with and without overlapping between the subdoma ins. The non-overlapping scheme has led toa significant decrease of the overall computational cost.

Application s of graphics processing units (GPUs) to scientificcomputations are attracting a lot of attention due to their low cost in conjunct ion with their inherentl y remarkable performanc efeatures. Parametric tests on 2D and 3D elasticity problems revealed the potential of the proposed approach as a result of the exploitation of multi-core CPU hardware resources and the intrin- sic software and hardware features of the GPUs.

Driven by the demands of the gaming industry, graphics hard- ware has substantially evolved over the years with remarkable floating point arithmetic performanc e. In the early years, these operations had to be programmed indirectly, by mapping them to graphic manipulations and using graphic libraries such asopenGL and DirectX. This approach of solving general purpose problems is known as general purpose computing on GPUs (GPGPU). GPU programming was greatly facilitated with the initial release of the CUDA-SDK [13–15], which resulted in a rapid devel- opment of GPU computin g and the appearan ce of GPU-powered clusters on the Top500 supercompute rs [16]. Unlike CPUs, GPUs have an inherent parallel throughput architecture that focuses onexecuting many concurrent threads slowly, rather than executing a single thread very fast.

Work pertaining to GPUs has extended to a large spectrum ofapplications even before CUDA made their use easier. A number of studies in engineering applicati ons have been recently reported on a variety of GPU platforms using implicit computational algo- rithms: in fluid mechanics [17–21], molecular dynamics [22,23],topology optimization [24], wave propagation [25], Helmholtz problems [26], neurosurgic al simulations [27]. Linear algebra applications have also been a topic of scientific interest for GPU implementati ons. Dense linear algebra algorithms are reported in[28], while a thorough analysis of algorithmic performance of basic linear algebra operations can be found in [29]. The performance ofiterative solvers is analyzed in [30], and a parametric study of the PCG solver is performed on multi-GPU CUDA clusters in [31,32]. Ahybrid CPU–GPU implementation of domain decomposition meth- ods is presented in [33] where speedups of the order of 40x have been achieved. It should be noted that all implementations prior to CUDA 1.3 are performed in single-precisio n, since support for double-preci sion floating point operation is added on CUDA 1.3.This has caused some misinterpr etations in a number of published comparisons between the GPU and the CPU, usually in favor of the GPU.

The present work aims at a drastic reduction of the computational effort required for the initializatio n phase and for assembling the stiffness matrix by implementi ng a novel node pair-wise procedure. It is believed that with the proposed compu- tational handling of the pre-proc essing phase and the accelerated formulation of the stiffness matrix, together with recent improve- ments on the solution of the resulting algebraic equations [12],MMs are becoming computationall y competitive and are expected to demonstrat e their inherent advantages in solving real, large- scale engineering problems .

2. Basic ingredients of the meshless EFG method

The approximation of a scalar function u in terms of Lagrangian coordinates in the meshless EFG method can be written as

uðx; tÞ ¼Xi2S

UiðxÞuiðtÞ ð1Þ

where Ui are the shape functions, ui are the nodal values at particle ilocated at position xi, and S is the set of nodes i for which Ui(x) – 0.The shape functions in Eq. (1) are only approximant s and not inter- polants, since ui – u(xi).

The shape functions Ui are obtained from the weight coefficients wi, which are functions of a distant parameter r = kxi � xk/di where di defines the domain of influence (doi) ofnode i. The domain of influence is crucial to solution accuracy, sta- bility and computational cost, as it defines the degree of continuity between the nodes and the bandwidth of the system matrices.

The approximat ion uh is expresse d as a polynomial of length mwith non-constant coefficients. The local approximat ion around apoint �x, evaluated at a point x is given by

uhLðx; �xÞ ¼ pTðxÞað�xÞ ð2Þ

where p(x) is a complet e polynomial of length m and að�xÞ containsnon-const ant coefficients that depend on x

að�xÞ ¼ a0ðxÞ a1ðxÞ a2ðxÞ � � � amðxÞ½ �T ð3Þ

In two dimensional problems, the linear basis p(x) is given by

pTðxÞ ¼ 1 x y½ �; m ¼ 3 ð4Þ

and the quadratic basis by

pTðxÞ ¼ 1 x y x2 y2 xy� �

; m ¼ 6 ð5Þ

The unknow n paramete rs aj(x) are determine d at any point x, byminimizin g a functiona l J(x) defined by a weighte d average over all nodes i 2 1, . . . ,n:

JðxÞ ¼Xn

i¼1

wðx� xiÞ uhL ðxi;xÞ � ui

� �2

¼Xn

i¼1

wðx� xiÞ½pTðxiÞaðxÞ � ui�2 ð6Þ

where the paramete rs ui are specified by the difference between the local approx imation uh

Lðx; �xÞ and the value ui while the weight function satisfies the conditio n w(x � xi) – 0. An extremum of J(x)with respect to the coefficients aj(x) can be obtain ed by setting the derivativ e of J with respect to a(x) equal to zero. This conditio ngives the following relation

AðxÞaðxÞ ¼WðxÞu ð7Þ

where

AðxÞ ¼Xn

i¼1

wðx� xiÞpðxiÞpTðxiÞ ð8Þ

WðxÞ ¼ wðx� x1Þpðx1Þ wðx� x2Þpðx2Þ � � � wðx� xnÞpðxnÞ½ �ð9Þ

Solving for a(x) in Eq. (7) and substituting into Eq. (2) the approxi- mants uh can be defined as follows:

uhðxÞ ¼ pTðxÞ½½AðxÞ��1WðxÞu� ð10Þ

which together with Eq. (1) leads to the derivation of the shape function Ui associated with node i at point x:

UiðxÞ ¼ pTðxÞ½AðxÞ��1WðxiÞ ð11Þ

A solution of a local problem A(x)z = p(x) of size m �m is performed whenever the shape functions are to be evaluated . This constitut es adrawbac k of moving least squares-based (MLS-based) MMs since the computa tional cost can be substanti al and it is possible for the moment matrix A(x) to be ill conditio ned [2].

Page 3: 1-s2.0-S0045782513000431-main

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 65

The Galerkin weak form of the above formulation gives the dis- crete algebraic equation

Ku ¼ f ð12Þ

with

Kij ¼Z

XBT

i EBjdX ð13Þ

f i ¼Z

Ct

Ui�tdCþ

ZX

UibdX ð14Þ

In 2D problems matrix B is given by

Bi ¼Ui;x 00 Ui;y

Ui;y Ui;x

264

375 ð15Þ

and in 3D problem s by

Bi ¼

Uj;x 0 00 Uj;y 00 0 Uj;z

Uj;y Uj;x 00 Uj;z Uj;y

Uj;z 0 Uj;x

2666666664

3777777775

ð16Þ

Due to the lack of the Kronecker delta property of shape functions,the essentia l boundar y conditions cannot be imposed the same way as in FEM. Several techniques are available such as Lagran ge multi- pliers, penalty and EFG and FEM coupling.

For the integrati on of Eq. (13), virtual background cells are con- sidered by dividing the problem domain into integrati on cells over which a Gaussian quadrature is performed :

ZX

fðxÞdX ¼X

J

fðnJÞxN det JnðnÞ ð17Þ

where n are the local coordinat es and det Jn(n) is the determina nt ofthe Jacobian.

(a)

Fig. 1. Domain of influence of Gauss point in (a) EFG; (b)

3. Gauss point-wise formulation of the stiffness matrix

The stiffness matrix of Eq. (13) is usually formed by adding the contributi ons of the products BT

GEBG of all Gauss points G to the stiffness matrix according to the formula:

K ¼X

G

BTGEBG ¼

XG

Q G ð18Þ

where the deformation matrix BG is compute d at the corresp onding Gauss point. The summati on is performed for each Gauss point and affects all nodes within its domain of influence. Compar ed to FEM,the amount of calcul ations for performing this task is significantly higher since the domains of influence of Gauss points are much lar- ger than the correspondi ng domain s in FEM as is schemati cally shown in Fig. 1 for a domain discretize d with EFG and FEM having equal number of nodes and Gauss points. Throughout this paper wedo not address the issue of accuracy obtained by the two methods with the same number of nodes and Gauss points.

In FEM, each Gauss point is typically involved in element-lev elcomputati ons for the formation of the element stiffness matrix which is then added to the appropriate positions of the global stiff- ness matrix. Moreover, the shape functions and their derivatives are predefined for each element type and need to be evaluated on all combinati ons of nodes and Gauss points within each element. InEFG methods, however, the contribution of each Gauss point is di- rectly added to the global stiffness matrix while the shape functions are not predefined and span across larger domains with a signifi-cantly higher amount of Gauss point-node interactions.

Although, in EFG methods there is no need to construct a mesh,the correlation between nodes and Gauss points needs to be de- fined. This preliminar y step before building the stiffness matrix isimplicitly performed with the mesh creation in FEM but must beexplicitly done in EFG methods and can be time-consuming ifnot appropriate ly handled. For the aforementione d reasons, com- puting the stiffness matrix in EFG meshless methods is a very com- putational ly demanding task which needs special attention inorder to be affordable in real-world applications.

3.1. Node-Gau ss point correlation

In the initializa tion step, the basic entities are created, namely the nodes and the Gauss points together with their domains of

(b)

FEM, for the same number of nodes and Gauss points.

Page 4: 1-s2.0-S0045782513000431-main

Table 1Computing time required for all node-Gauss point correlations.

Example Nodes Gauss points Search time (s)

Global serial Regioned serial Regioned parallel

2D-1 25,921 102,400 23 1.3 0.5 2D-2 75,625 300,304 300 3.4 1.0 2D-3 126,025 501,264 836 5.4 1.4

3D-1 9,221 64,000 7 3.7 0.9 3D-2 19,683 140,608 45 7.8 1.7 3D-3 35,937 262,144 157 15.7 3.3

66 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

influence. The domains of influence define the correlation between nodes and Gauss points. With the absence of an element mesh, the correlation of Gauss points and nodes must be established explic- itly at the initialization phase.

A first approach is to search on the global physical domain for the Gauss points belonging to the domain of influence of each node. This approach performs a large amount of unnecessary cal- culations since the domains of influence are localized areas. In or- der to reduce the time spent for identifying the interaction between Gauss points and nodes, the search can be performed on Gauss regions.

A rectangular grid is created and we refer to each of the regions defined as a Gauss region. Each Gauss region contains a group ofGauss points (Fig. 3). Given the coordinates of a particular node,it is immediatel y known in which region it is located. The search per node is conducte d over the neighbori ng Gauss regions only in- stead of the global domain. Thus, regardless of the size of the prob- lem, the search per node is restricted on a small number of Gauss regions.

In order to quickly decide whether a neighbori ng Gauss region will be searched or not, the centroid of each Gauss region is used as a representat ive point for the whole region. If the centroid of aGauss region lies inside the domain of influence of a node, then all Gauss points of that region will be processed for possible inter- action with the node, otherwise they will be ignored. However,there may be cases of Gauss points which are inside the domain of influence of a node but are ignored because the centroid of their Gauss region lies outside the domain of influence, as can be seen inFig. 3. In order to account for such cases, the centroids are tested with regard to an extended domain of influence. The extended do- main of influence is only used for the centroids so the contribution of Gauss points is evaluated based on the actual domain of influ-ence of the node.

The extended domain of influence should be large enough to in- clude the centroids of regions that would be outside the actual do- main of influence and small enough to avoid false positives , i.e.regions that test true but contain no influencing Gauss points. Inorder to accomplish this, the maximum distance between the cen- troid and a point on the border of the respective Gauss region iscomputed. The extended domain of influence is then defined byadding this distance to the initial domain of influence. Gauss re- gions can be formed from a cluster of Gauss cells or it can be totally unrelated to Gauss cells.

The time required to define correlations in three 2D and three 3D elasticity problems with varying number of degrees of freedom

Table 2Influences per node and Gauss poin t for EFG and FEM.

2D

EFG (doi = 2,5)

Gauss points influencing a node 100 Nodes influenced by a Gauss point 25

(dof) are shown in Table 1. The 2D problems correspond to square domains and the 3D to cubic domains, with rectangular domains ofinfluence (doi) with dimensio nless parameter 2.5. These domains maximiz e the number of correlations and consequentl y the com- putational cost for the given number of nodes. In these examples,each Gauss region is equivalent to a single Gauss cell. Thus, inthe 2D examples each Gauss cell contains 16 Gauss points (4 � 4rule) and in the 3D examples 64 Gauss points (4 � 4 � 4 rule).The examples are run on a Core i7-980X which has six physical cores (12 logical cores) at 3.33 GHz and 12 MB cache. Each node can define its correlation independen tly of other nodes, which isamenable to parallel computations.

When each node checks all Gauss points of the domain the time complexi ty is O(nnG), where n is the number of nodes and nG is the number of Gauss points. As a result, the time needed to define the correlations when globally searching quickly becomes prohibiti ve.In the case of Gauss regions, each node needs to check a constant number of Gauss points regardles s of the size of the problem, sothe time complexi ty is O(n).

With the implementati on of Gauss regions, the initializa tion phase of EFG methods in complex domains takes less time than FEM, since the generation of a finite element mesh can sometimes be laborious and time consuming [34]. At the end of the initializa- tion step each node has a list of influencing Gauss points and each Gauss point has a list of influenced nodes.

3.2. Comparis on to FEM for equal number of nodes and Gauss points

Table 2 shows the number of Gauss points influencing a single node and the number of nodes influenced by a single Gauss point in typical 2D and 3D problems. The numbers displayed for EFG cor- respond to the majority of nodes (Side/Corner nodes or Gauss points have a lower number of influences).

Table 3 shows the total number of correlations for the six exam- ples considered. The significantly higher number in EFG methods isa direct conseque nce of the larger domain of influence, as shown inFigs. 1 and 2.

3.3. Computati on of stiffness contribut ion for each Gauss point

3.3.1. Shape function derivative calculation The shape functions in EFG formulation span across larger do-

mains of influence than in FEM and their evaluation is performed over a large number of correlate d Gauss points-nodes. For the evaluation of the deformat ion matrix B the shape functions and

3D

FEM (QUAD4) EFG (doi = 2,5) FEM (HEXA8)

16 1000 644 125 8

Page 5: 1-s2.0-S0045782513000431-main

Table 3Total number of node-Gauss point correlations in EFG and FEM.

Example Nodes Gauss points Total correlations Ratio

EFG FEM

2D-1 25,921 102,400 2,534,464 409,600 6.2 2D-2 75,625 300,304 7,463,824 1,201,216 6.2 2D-3 126,025 501,264 12,475,024 2,005,056 6.2

3D-1 9,221 64,000 7,077,888 512,000 13.8 3D-2 19,683 140,608 16,003,008 1,124,864 14.2 3D-3 35,937 262,144 30,371,328 2,097,152 14.5

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 67

their derivatives are calculated with the following procedure for each Gauss point: (i) Calculate the weight function coefficients w,w,x, w,y, w,z for each node in the domain of influence of the Gauss point. (ii) Calculate the moment matrix A of Eq. (8) and its deriva- tives, Ax, Ay, Az of the Gauss point with contributions from all influ-enced nodes. (iii) Use the moment matrix and its derivatives along with the weight coefficients to calculate the shape function and derivative values for all influenced nodes of the Gauss point.

The moment matrix and its derivatives are functions of the polynomial p, which is a complete polynomi al of order q for any material point of the domain. In the case of a linear basis (Eq.(4)) the moment matrix A and its derivatives are 3 � 3 or 4 � 4matrices for 2D and 3D elasticity problems, respectively . The con- tribution of each node to the moment matrix and its derivatives isrelated to the product ppT. According to Eq. (8), the moment matrix and its derivatives are given by

A ¼X

i

wiðppTÞi;8i 2 Infl:Nodes

Ax ¼X

i

ðwxÞiðppTÞi;Ay ¼X

i

ðwyÞiðppTÞi;

Az ¼X

i

ðwzÞiðppTÞi; 8i 2 Infl:Nodes ð19Þ

Thus, the moment matrix consists of the following terms

A ¼

Xi

wi

Xi

wixi

Xi

wiyi

Xi

wix2i

Xi

wixiyi

Xi

wiy2i

26666664

37777775

ð20Þ

(a)Fig. 2. Domain of influence of node (a) EFG; (b) FEM

while similar express ions define its derivatives Ax, Ay, Az.The shape function value Ui(x) associate d with node i at point x

is expressed according to Eq. (11), while the derivatives Ui,x, Ui,y,Ui,z are given by

Ui;x¼wi;xpTGðA

�1piÞþwi 0 1 0 0f gðA�1piÞþð�wiÞpTGA�1AxðA�1piÞ

Ui;y¼wi;ypTGðA

�1piÞþwi 0 0 1 0f gðA�1piÞþð�wiÞpTGA�1AyðA�1piÞ

Ui;z¼wi;zpTGðA

�1piÞþwi 0 0 0 1f gðA�1piÞþð�wiÞpTGA�1AzðA�1piÞ

ð21Þ

where the polynomial s pG and pi are evaluated at the Gauss point Gand the influenced node i, respective ly.

In Eqs. (11) and (21) the following operation s are repeated for all influenced nodes of a Gauss point:

pTA ¼ pT

GA�1; pAx ¼ pT

AAxA�1; pAy ¼ pT

AAyA�1; pAz

¼ pTAAzA

�1 ð22Þ

These matrix–vector multiplica tions can be reused in several calcu- lations for every influenced node of a particula r Gauss point. For large size of the moment matrix A, the direct computation of its in- verse is burdensom e, so an LU factorization is typically performed [2]. In this implement ation, an explicit algorithm is used for the inversion of the moment matrix in order to minimize the calculatio ns.

For each influenced node i, the following three groups of calcu- lations are then performed :

Ui ¼ wipTApi Uð2Þi;x ¼ wif0 1 0 0 gðA�1piÞ Uð3Þi;x ¼ �wipT

Axpi

Uð1Þi;x ¼ wi;xpTApi Uð2Þi;y ¼ wi 0 0 1 0f gðA�1piÞ Uð3Þi;y ¼ �wipT

Aypi

Uð1Þi;y ¼ wi;ypTApi Uð2Þi;z ¼ wi 0 0 0 1f gðA�1piÞ Uð3Þi;z ¼ �wipT

Azpi

Uð1Þi;z ¼ wi;zpTApi

ð23Þ

3.3.2. BTEB calculation A fast computati on of the matrix product

Q G ¼ BTGEBG ð24Þ

of Eq. (18) is important because it is repeated at each integration point. This may not be so critical in FEM compared to the total

(b), for the same number of nodes and Gauss points.

Page 6: 1-s2.0-S0045782513000431-main

Fig. 3. Identifying the influencing Gauss points of node .

68 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

simulati on time, but it is very important in EFG meshless methods where the number of Gauss points and the number of influencednodes per Gauss point are both significantly greater.

Q ijð3�3Þ

¼ BTi

ð3�6ÞEð6�6Þ

Bjð6�3Þ

¼Ui;x 0 0 Ui;y 0 Ui;z

0 Ui;y 0 Ui;x Ui;z 00 0 Ui;z 0 Ui;y Ui;x

264

375

M k k

k M k

k k M

ll

l

2666666664

3777777775

Uj;x 0 00 Uj;y 00 0 Uj;z

Uj;y Uj;x 00 Uj;z Uj;y

Uj;z 0 Uj;x

2666666664

3777777775

Q ijð3�3Þ

¼Ui;xUj;xM þUi;yUj;ylþUi;zUj;zl Ui;xUj;ykþUi;yUj;xl Ui;xUj;zkþUi;zUj;xl

Ui;yUj;xkþUi;xUj;yl Ui;yUj;yMþUi;xUj;xlþUi;zUj;zl Ui;yUj;zkþUi;zUj;ylUi;zUj;xkþUi;xUj;zl Ui;zUj;ykþUi;yUj;zl Ui;zUj;zM þUi;yUj;ylþUi;xUj;xl

264

375

ð26Þ

The computations of Eq. (24) can be broken into smaller opera- tions for each combination of influenced nodes i, j belonging to the domain of influence of the Gauss point:

Q ij ¼ BTi EBj ¼ Q T

ji ð25Þ

Once a submatrix Qij is calculated, it is added to the corresp onding positions of K (Eq. (18)). The computa tion of Qij together with the

associat ed indexing to access the entries of K domin ate the total ef- fort for the formula tion of the global stiffness matrix [35].

The Qij for an isotropic material in 3D elasticity takes the form:

E and Bi/Bj are never formed. Instead three values for E, the two Lamé parameters k, l and the P-Wave modulus M = 2l + k andthree values for Bi, specifically Ni,x, Ni,y, Ni,z, are stored. Since some of the multiplications are repeated, the calculations in Eq. (26) canbe efficiently performed with 30 multiplications and 12 additions.

3.3.3. Summation of Gauss point contribut ions Contrary to FEM, where the stiffness matrices are built on the

element level by integrati ng over the element Gauss points before

Page 7: 1-s2.0-S0045782513000431-main

Table 4Computing time for the formulation of the stiffness matrix in the CPU implemen- tations of the Gauss-point wise approach.

Example dof Gauss points Time (s) Ratio

Conventional GP

Proposed GP

2D-1 51,842 102,400 107 12 92D-2 152,250 300,304 313 34 92D-3 252,050 501,264 502 53 9

3D-1 27,783 64,000 2,374 241 103D-2 59,049 140,608 6,328 616 103D-3 107,811 262,144 13,302 1165 11

Table 5Comparison of the proposed Gauss point-wise method for the formulation of the stiffness matrix when using sparse and skyline format.

Example dof Gauss points Time (s) Ratio

Sparse Skyline

2D-1 51,842 102,400 12 7 1.6 2D-2 152,250 300,304 34 20 1.7 2D-3 252,050 501,264 53 31 1.7

3D-1 27,783 64,000 241 68 3.5 3D-2 59,049 140,608 616 174 3.5 3D-3 107,811 262,144 1,165 329 3.5

Table 6Number of stored stiffness elements when using skyline and sparse format.

Example dof Gauss points Number of stored elements Ratio

Skyline Sparse

2D-1 51,842 102,400 66,221,715 4,110,003 162D-2 152,250 300,304 331,150,875 12,129,675 272D-3 252,050 501,264 713,161,275 20,287,275 35

3D-1 27,783 64,000 136,041,444 21,734,532 63D-2 59,049 140,608 486,852,444 49,932,576 103D-3 107,811 262,144 1,343,011,428 95,696,604 14

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 69

assembling the global stiffness matrix, the absence of elements inEFG meshless methods necessitates each Gauss point to directly append its contribution to the global stiffness matrix. Since there are considerably more Gauss points in EFG and each Gauss point influences much more nodes, indexing time during the creation of the stiffness matrix is an important factor in EFG simulations.Thus, an efficient implementation for building the stiffness matrix in sparse format is needed in the Gauss point-wise approach. The procedure requires updating previous values of the matrix, thus asparse matrix type that allows lookups is needed. Updates happen a large number of times for every non-zero element of the matrix,so they consume a large portion of the total effort. A sparse matrix format suitable for this method is the dictionary of keys (DOK) [36]and our implementati on is based on hash-tables [37].

3.4. Performance of the Gauss point-wise approach

The performanc e of the Gauss point-wise approach in the CPU isshown in Table 4. The proposed Gauss point-wise (GP) approach iscompared with the ‘‘conventional ’’ one without the improvements described in this Section.

The Gauss point-wise approach is heavily influenced by index- ing time, especially in the 3D examples. A matrix format with bet- ter indexing properties would benefit the Gauss point-wis eapproach. The quick identification of interacting node pairs as de- scribed in Section 12 allows the fast predictio n of the non-zero coefficients of the stiffness matrix, as demonstrated in Table 9. This leads to the calculation of the indexes of the skyline format of the matrix. The skyline format exhibits faster indexing time but in- creased memory requirements compared to a sparse matrix format which contains only non-zero elements .

Table 5 compare s the proposed Gauss point-wise approach for building the stiffness matrix when using sparse and skyline format.The difference highlights the importance of indexing time in EFG methods where access to the stiffness matrix is performed a large number of times during its formulation .

However, the skyline format stores higher number of elements ,as shown in Table 6, and thus requires a larger amount of memory.

Note that the skyline format is dependent on the numbering ofnodes in the domain and an ideal numberi ng is used in the pre- sented examples . This dependency may lead to more excessive amounts of zeros stored and further exacerbate the required mem- ory of the skyline format, whereas the sparse formats always have the same amount of elements regardles s of numbering.

4. Node pair-wis e formulation of the stiffness matrix

An alternativ e way to perform the computati on of the global stiffness matrix is the proposed node pair-wise approach . The com- putation of the global stiffness coefficient Kij is performed for all interactin g i � j nodes and is formed from contributi on by the shared Gauss points of their domains of influence. Fig. 4 depictstwo interacting nodes as a result of having common Gauss points in the intersect ion of their domains of influence and one node that is not interactin g with the other two.

4.1. Interacting node pairs and their shared Gauss points

The interacting node pairs approach requires an extra initializa- tion step for identifying the interactin g node pairs and their shared Gauss points. The identification of the Gauss points associated with each interacting node pair is beneficial since it accelerates the computati on of the stiffness matrix and all node pair related calcu- lations. More importantly , it also enables an efficient parallel impleme ntation and particularly utilization of massively parallel processin g, including GPUs.

In FEM the nodes interact through neighbori ng elements only and thus the interacting node pairs can be easily defined from the element-nod e connectivity (Fig. 5b). In EFG meshless methods however , a node pair contributes non-zero entries in the stiffness matrix, and therefore is active, if there is at least one Gauss point whose domain of influence includes both nodes (Fig. 5a). The naive approach is to look for all possible combinations of node pairs, findtheir shared Gauss points and keep those node pairs that are inter- acting together with the correspondi ng shared Gauss points. The shared Gauss points are located in the intersection of the domains of influence of two interacting nodes (Fig. 4). This approach , how- ever, takes a prohibitive amount of time because it needs to calcu- late the shared Gauss points for all possible n(n + 1)/2combinati ons of node pairs, where n is the number of nodes.Table 7 shows all possible combinations of node pairs and those that are interacting as well as the associated computin g time for a naive identification.

Thus, identification of the shared Gauss points is expensive, un- less the unnecessary searches for Gauss points of non-inter acting nodes are avoided. This is accomplished by first identifying the interactin g nodes. As in the naive approach, all n(n + 1)/2 combina- tions can be checked and if there is at least one Gauss point in com- mon the node pair is marked as interacting. However, this is still anO(n2) process so it does not scale well and quickly grows into anunaccept able amount of time.

Page 8: 1-s2.0-S0045782513000431-main

Fig. 4. Intersection of domains of influence.

(a) (b)Fig. 5. Interacting nodes: (a) EFG; (b) FEM.

70 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

By taking advantag e of the previous initializatio n step described in Section 7, the identification of interactin g node pairs can beaccelerated considerabl y. Each node has a list of influencing Gauss

points and each Gauss point has a list of influenced nodes. There- fore, each node looks for interactin g nodes in the lists of influencednodes of its Gauss points. Fig. 6 shows node A which is influenced

Page 9: 1-s2.0-S0045782513000431-main

Table 7Computing time required for a naive identification of interacting nodes and their shared Gauss points.

Naive

Example Nodes All combinations Interacting Time (s)

2D-1 25,921 335,962,081 1,033,981 771 2D-2 75,625 2,859,608,125 3,051,325 6908 2D-3 126,025 7,941,213,325 5,103,325 23,380

3D-1 9221 42,518,031 2,418,035 608 3D-2 19,683 193,720,086 5,554,625 3021 3D-3 35,937 645,751,953 10,644,935 16,290

Fig. 6. Identifying interacting node pairs for node A. A, B, C, D, E represent nodes whereas i, j, k represent Gauss points.

Table 8Computing time for the identification of interacting nodes.

Example Time (s)

Serial Parallel

2D-1 1.5 0.2 2D-2 4.5 0.7 2D-3 9.8 1.6

3D-1 20.1 2.8 3D-2 42.6 5.6 3D-3 85.6 11.2

Fig. 7. Identifying Interacting node pairs by considering Gauss points near the border of the domain of influence.

Table 9Computing time for the identification of interacting nodes by only inspecting Gauss points near the border.

Example Time (s)

Serial Parallel

2D-1 0.2 <0.1 2D-2 0.5 <0.1 2D-3 0.8 <0.1

3D-1 0.5 <0.1 3D-2 0.9 0.2 3D-3 1.6 0.3

Table 10Computing time to identify the shared Gauss points of an interacting node pair.

Example Time (s)

Serial Parallel

2D-1 2.1 0.4 2D-2 6.1 1.2 2D-3 8.8 1.5

3D-1 46.6 7.4 3D-2 135.6 18.8 3D-3 315.7 45.8

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 71

by Gauss points i,j,k. Each of these Gauss points influences various nodes, including node A. Those nodes are guaranteed to interact with A since there is at least one Gauss point in common between them. In Fig. 6 the interactions are: AA, AB, AC, AD, AE.

The correspondi ng computing times for this process are shown in Table 8. As previously, the examples are run on a Core i7-980X which has six physical cores (12 logical cores) at 3.33 GHz. Each node can search for interacting nodes independen tly of other nodes, so parallelism offers very good acceleration.

With this approach the identification of interactin g nodes is im- proved, but it can be further accelerated by noting that an interact- ing node may be in the lists of several Gauss points of A, as is node B in Fig. 6. Since the number of influencing Gauss points of a node is large (1000 for the majority of nodes in our 3D examples), there will be a large amount of duplicates in the process, which are dis- carded. To reduce the number of duplicates, we only inspect those Gauss points that are near the border of the domain of influence ofthe node (Fig. 7). These Gauss points define the interactions with further away nodes while including all closer nodes. This consider- ably reduces the time as can be seen in Table 9.

Following the identification of the interacting node pairs, the determination of shared Gauss points is performed the least possi- ble number of times, i.e., only once for every interacting node pair,in contrast to the n(n + 1)/2 times of the naive approach. This leads

to a vast reduction of the required amount of computing time com- pared to the naive approach (Table 7) as can be seen in Table 10.

For further improvement, regioning (Fig. 8) can be utilized and the results are shown in Table 11. The Gauss regions may be the same as those in the initialization phase (Section 7) or can be dif- ferent. Shared Gauss points are only searched within regions shared by both node pairs. In both intersection identifications, with and without regions, each node pair can identify its shared Gauss points independently of other node pairs, so parallelism offers very good accelerations, as shown in Tables 10 and 11.

In the 2D examples considered, each region has 16 Gauss points and the results are slightly worse with regioning because skipping 16 Gauss points per skipped region was not enough to compensate for the added overhead. Higher number of Gauss points per region eventual ly makes regioning worthwh ile in the 2D examples . In the

Page 10: 1-s2.0-S0045782513000431-main

Fig. 8. Region-wise search for interacting nodes. Only the shaded regions are inspected for shared Gauss points..

Table 11Computing time to identify the shared Gauss points of an interacting node pair with regioning.

Example Time (s)

Serial Parallel

2D-1 2.4 0.6 2D-2 6.8 1.6 2D-3 11.0 2.8

3D-1 24.9 4.8 3D-2 57.9 10.7 3D-3 118.0 22.4

Table 12Number of interacting node pairs in EFG and FEM.

Example Nodes Interacting node pairs Ratio

EFG FEM

2D-1 25,921 1,033,981 128,641 8.0 2D-2 75,625 3,051,325 376,477 8.1 2D-3 126,025 5,103,325 627,997 8.1

3D-1 9,221 2,418,035 118,121 20.5 3D-2 19,683 5,554,625 256,361 21.7 3D-3 35,937 10,644,935 474,305 22.4

72 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

3D examples, the extra dimension and the fact that each region has 64 Gauss points makes regioning more important. Regionin g ben- efits become greater as the number of Gauss points per region increases.

4.2. Comparison to FEM for equal number of nodes and Gauss points

Table 12 shows the number of interacting node pairs in EFG and FEM for equal number of nodes and Gauss points. Interactions in

EFG extend in much larger regions than in FEM, as is shown inFig. 5. Furthermore, the numbers are indicative of the total non- zeros of the correspondi ng stiffness matrices. The total non-zeros can be calculated by

NZ ¼ 4 � NP � nð2DÞ; NZ ¼ 9 � NP � 3 � nð3DÞ; ð27Þ

where NP is the number of interacting node pairs and n is the num- ber of nodes.

Page 11: 1-s2.0-S0045782513000431-main

Table 13Total Gauss poin t contributions for EFG and FEM.

Example Gauss points Total GP contributions Ratio

EFG FEM

2D-1 102,400 32,725,544 1,024,000 32.0 2D-2 300,304 96,647,624 3,003,040 32.2 2D-3 501,264 161,681,224 5,012,640 32.3

3D-1 64,000 408,317,728 2,304,000 177.2 3D-2 140,608 942,981,088 5,061,888 186.3 3D-3 262,144 1,813,006,048 9,437,184 192.1

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 73

Each interacting node pair correspond s to a non-zero submatrix of the stiffness matrix, whose size is equal to the number of dof ofeach node. To calculate the correspondi ng coefficients, contribu- tions from several Gauss points are summed to form the finalsubmatrix. The total number of Gauss point contributions for the whole problem is shown in Table 13.

From the above tables it is clear that the computati onal effort required for EFG methods is much higher than in FEM.

4.3. Computation of global stiffness coefficients for each interacting node pair

The computation of the stiffness elements for each interacting node pair is split in two phases. In the first phase, the shape func- tion derivatives for each influenced node of every Gauss point are calculated as described in Section 9 for the Gauss point-wise meth- od. Then, instead of continuing with the calculation of the stiffness matrix coefficients correspond ing to a particular Gauss point, the shape function derivatives are stored for the calculation of Qij

matrices in the next phase. The required storage of all shape func- tion derivatives is small so storing them temporarily is not anissue.

In the second phase, the stiffness matrix coefficients of each interacting node pair is computed. For each interacting node pair ij, the matrix Qij of Eq. (25) is calculated over all shared Gauss

Table 14Computing time for the formulation of the stiffness matrix in the serial CPU implementat

Example dof Gauss points

2D-1 51,842 102,400 2D-2 152,250 300,304 2D-3 252,050 501,264

3D-1 27,783 64,000 3D-2 59,049 140,608 3D-3 107,811 262,144

Fig. 9. Schematic representation of the contribution

points and summed to form the final values of the corresponding coefficients of the global matrix:

Kij ¼X

G

Q ij ¼X

G

BTi EBj: ð28Þ

The calculatio n of Qij matrices is performed as descri bed in Sec- tion 10. The matrices Bi, Bj contain the shape function derivativ evalues calculate d in the first phase and each pre-cal culated shape function derivativ e is used a large number of times.

Both phases are amenable to parallelization , the first with re- spect to Gauss points and the second with respect to interacting node pairs, and involve no race conditions or the need for synchro- nization, which makes the interacting node pairs approach an ideal method for massively parallel systems.

4.4. Sparse matrix format for the interacting node pairs approach

The final values of each Kij submatrix are calculated and written once in the correspondi ng positions of the global stiffness matrix instead of being gradually updated as in the Gauss point-wise ap- proach. Apart from the reduced number of accesses to the matrix,this method does not require lookups, which allows the use of asimpler and more efficient sparse matrix format, like the coordi- nate list (COO) format [38]. A simple implementation with three arrays, one for row indexes, one for column indexes and one for the value of each non-zero matrix coefficient is sufficient and iseasily applied both in the CPU and the GPU, while also requiring less memory than a format that allows lookups. Note that the node pair-wise method has no indexing time due to its nature, in con- trast to the Gauss point-wise approach as described in Section 11.This is why the computing time shown for the interacting node pairs approach in Table 14 are lower in the CPU implementati ons presente d in Section 18.

4.5. Paralleliza tion features of the interacting node pairs approach

The interacting node pairs approach has certain advantages compare d to the Gauss point-wise approach. The most important

ions of the Gauss point-wise (GP) and node pair-wise (NP) approaches.

CPU time (s)

Conventional GP Proposed GP Proposed NP

107 12 11313 34 28502 53 47

2374 241 134 6328 616 328

13,302 1165 645

of 3 Gauss points to the global stiffness matrix.

Page 12: 1-s2.0-S0045782513000431-main

Fig. 10. Scatter parallelism required for the Gauss point-wise approach.

Fig. 11. Gather parallelism implemented in the interacting node pairs approach.

74 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

one is related to its amenability to parallelism, in contrast to the Gauss point-wise approach . The Gauss point-wise approach can be visualized in Fig. 9, where the contributions of three Gauss points to the stiffness matrix are schematical ly depicted.

Since in EFG methods each Gauss point affects a large number ofnodes, each Kij submatrix is formed by a large number of stiffness contributions . Parallelizing the Gauss point-wis e approach in- volves scatter parallelism, which is schemati cally shown inFig. 10 for two Gauss points C and D. Each part of the sum can becalculated in parallel but there are conflicting updates to the same element of the stiffness matrix. These race condition s can beavoided with proper synchroniza tion but in massively parallel sys- tems like the GPU where thousands of threads may be working concurrently it is very detrimental to performanc e because all up- dates are serialized with atomic operations [39].

In the interacting node pairs approach, instead of constantly updating the matrix, the final values for the submatrix of each interacting node pair are calculated and then appende d to the ma- trix. For the calculation of a submatrix, all contributions of the Gauss points belonging to the intersection of the domains of influ-ence of two interacting nodes should be summed together. Thus,the interacting node pairs approach utilizes gather parallelism asshown schematically in Fig. 11.

In a parallel implementation, each working unit, or thread, pre- pares a submatrix Kij related to a specific interacting node pair ij. It

gathers all contributions from the Gauss points and writes to a spe- cific memory location accessed by no other thread. Thus, this method requires no synchroniza tion or atomic operations. Animportant benefit of this approach is the indexing cost of the stiff- ness matrix elements. In the Gauss point-wise method each stiff- ness matrix element is updated a large number of times while inthe proposed interacting node pair approach the final value is cal- culated and written only once.

5. GPU programmin g

Graphics processing units (GPUs) are parallel devices of the SIMD (single instruction, multiple data) classification, which de- scribes devices with multiple processing elements that perform the same operation on multiple data simultaneously and exploit data level parallelism. Programming in openCL or CUDA is easier than legacy general purpose computing on GPUs (GPGPU), since it only involves learning a few extensions to C and thus requiring no graphic-spe cific knowledge. In openCL/CUDA context, the CPU is also referred to as a host and the GPU is also referred to as a de- vice. The general processing flow of GPU programmin g is depicted in Fig. 12. GPUs have a large number of streaming processors (SPs),which can collective ly offer significantly more gigaflops than cur- rent high-end CPUs.

5.1. GPU threads

The GPU applies the same functions on a large number of data.These data-paralle l functions are called kernels. Kernels generate alarge number of threads in order to exploit data parallelism, hence the single instruction multiple thread (SIMT) paradigm. A thread isthe smallest unit of processing that can be scheduled by an operat- ing system. Threads in GPUs take very few clock cycles to generate and schedule due to the GPU’s underlying hardware support, un- like CPUs where thousands of clock cycles are required. All threads generate d by a kernel define a grid and are organized in groups which are commonly referenced as thread blocks [in CUDA] orthread groups [in openCL]. A grid consists of a number of blocks (all equal in size), and each block consists of a number of threads (Fig. 13).

There is another type of thread grouping called warps which are the units of thread scheduling in the GPU. The number of threads ina warp is specific to the particular hardware implementati on–it de- pends on how many threads the available hardware can process atthe same time. The purpose of warps is to ensure high hardware utilization. For example, if a warp initiates a long-late ncy operation and is waiting for results in order to continue , it is put on hold and another warp is selected for execution in order to avoid having idle processor s while waiting for the operation to complete. When the long latency operation completes, the original warp will eventually resume execution. With a sufficient number of warps, the proces- sors are likely to have a continuous workload in spite of the long- latency operations. It is recommend ed that the number of threads per block should be chosen as a multiple of the warp size [15].

The number of threads in each block is subject to refinement. Itshould be a power of 2 and, in contempor ary hardware, less than 1024. The warp size of the cards used in the present study is 32,hence, the number should ideally be 32 or higher.

5.2. GPU memory

GPGPU devices have a variety of different memories that can beutilized by programmers in order to achieve high performanc e.Fig. 14 shows a simplified representation of the different memo- ries. The global memory is the memory responsible for interaction with the host/CPU. The data to be processed by the device/GPU is

Page 13: 1-s2.0-S0045782513000431-main

Fig. 12. GPU processing flow paradigm: (1) data transfer to GPU memory, (2) CPU instructions to GPU, (3) GPU parallel processing, (4) result transfer to main memory.

Fig. 13. Thread organization.

Fig. 14. Visual representation of GPU memory model and scope.

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 75

first transferred from the host memory to the device global mem- ory. Also, output data from the device needs to be placed here be- fore being passed over to the host. Global memory is large in size and off-chip. Constant memory also provides interaction with the host, but the device is only allowed to read from it and not write to it. It is small, but provides fast access for data needed by all threads.

There are also other types of memories which cannot be ac- cessed by the host. Data in these memories can be accessed in ahighly efficient manner. The memories differ depending on which threads have access to them. Registers [CUDA] or private memories [openCL] are thread-bound meaning that each thread can only access its own registers. Registers are typically used for holding variables that need to be accessed frequently but that do not need to be shared with other threads. Shared memories [CUDA] or local

memories [openCL] are allocated to thread blocks/grou ps instead of single threads, which allows all threads in a block to access vari- ables in the shared memory locations allocated specifically for that block. Shared memories are almost as fast as registers while also allowing cooperation between threads of the same block.

5.3. Reductions in the GPU

In several parts of the GPU implementation, reduction s need tobe performed in order to calculate a sum. On a sequential proces- sor, the summation operation would be implemented by writing asimple loop with a single accumulato r variable to construct the sum of all elements in sequence . On a parallel machine, using asingle accumulato r variable would create a global serialization point and lead to very poor performance. In order to overcome this problem, a parallel reduction strategy is implemented where each parallel thread sums a fixed-length sub-sequence of the input.Then, these partial sums are gathered by summing pairs of partial sums in parallel. Each step of this pair-wise summation divides the number of partial sums by half and ultimately produces the finalsum after log 2N steps as shown in Fig. 15.

In order to calculate the sum of several vectors into a single vector, a similar process is performed but each thread sums two vectors instead of two values in every step.

6. GPU implement ation of the node pair-wise approach

Contrary to the Gauss point-wise approach, the interacting node pair approach for the formation of the stiffness matrix in EFG simulatio ns is well suited for the GPU. Each one of the two phases described Section 13 is calculated with its own kernel and exhibits

Page 14: 1-s2.0-S0045782513000431-main

Fig. 15. Parallel summation using a tree-like structure.

Fig. 17. Phase 1 – concurrency level for the calculation of shape function values inthe GPU.

76 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

different levels of parallelism. The implementati ons in this work are written in openCL for greater portability.

6.1. Phase 1 – calculation of shape function and derivative values

In the first phase the shape function and its derivatives are cal- culated for all influenced nodes of every Gauss point. The calcula- tions in this phase are described in detail in Section 9. There are two levels of parallelis m: the major over the Gauss points and the minor over the influenced nodes. A thread block/gro up is as- signed to each Gauss point and each thread handles one influencednode at a time. This is schematically shown in Fig. 16, where it isassumed that each thread handles a single influencing node (thisis for demonst ration purposes only and not mandatory). Since the number of threads should be a power of 2 and the number ofinfluenced nodes can be anything, some threads will not produce useful results.

For the most part of this phase all threads of a block are busy.The exceptions are the inversion of the moment matrix A andthe reductions which are used to sum the contributions of all influ-enced nodes in the moment matrix A and the vectors pA, pAx, pAy,pAz. The process is shown schemati cally in Fig. 17.

Since each Gauss point has its own thread block, all values re- lated to a particular Gauss point are stored in the shared/local memory. This includes the moment matrix and all vectors (pA, pA

x, pA y, pA z). The interaction with the global memory is performed only at the beginning of the process, where each thread reads the coordinates of the corresponding Gauss point and influenced node and stores them in registers, and at the end of the process where the resulting shape function values are written to the global mem- ory. Constant memory is used for storing the ranges of the influ-ence domains. As a result, all calculations are performed with data found in fast memories which is very beneficial from a perfor- mance point of view.

Fig. 16. Thread organization in phase 1.

6.2. Phase 2 – calculation of the global stiffness coefficients

In the second phase, there are also two levels of parallelism, the major one being on the level of interactin g node pairs and the min- or one on the Gauss points. A thread block/gro up is assigned toeach node pair and each thread of the block handles one Gauss point at a time. This is schematically shown in Fig. 18, where itis assumed that each thread handles 3 shared Gauss points (thisis for demonstration purposes only and not mandatory). Since the number of threads should be a power of 2 and the number ofshared Gauss points can be anything, some threads will process only 2 shared Gauss points.

In this phase, all threads of a block go through all available shared Gauss points of the node pair and calculate the Qij subma-trices (Eq. (25)) as described in Section 10. Each thread t of the block sums contributi ons from different shared Gauss points and updates its own partial Kt

ij so there is no need for atomic opera- tions. After all shared Gauss points have been processed, the partial Kt

ij matrices of each thread of the block are summed with a reduc- tion into the final values of the stiffness coefficients Kij. The process is shown in Fig. 19.

7. Numerical results in 2D and 3D elasticity problems

The two procedures elaborated in this work for the computation of the stiffness matrix in large-scale EFG meshless simulations are

Page 15: 1-s2.0-S0045782513000431-main

Fig. 18. Thread organization in phase 2.

Table 15Computing time for the formulation of the stiffness matrix in the GPU implemen- tation of the interacting node-pair approach.

Example dof Gauss points NP GPU time (s)

Kernel 1 Kernel 2 Total

2D-1 51,842 102,400 0.05 0.19 0.2 2D-2 152,250 300,304 0.13 0.56 0.7 2D-3 252,050 501,264 0.21 0.89 1.1

3D-1 27,783 64,000 0.17 2.41 2.6 3D-2 59,049 140,608 0.32 6.17 6.5 3D-3 107,811 262,144 0.62 12.31 12.9

Table 16Relative speedup ratios of GPU implementation compared to the CPU implementations.

Example Speedup ratios of GPU implementation

Conventional GP Proposed GP Proposed NP

2D-1 450 50 462D-2 457 50 412D-3 456 48 43

3D-1 921 93 523D-2 975 95 503D-3 1,028 90 50

Table 17Total serial CPU computing time for the conventional initialization phase and formulation of the stiffness matrix.

Example Conventional time (s)

Initialization Formulation Total

2D-1 23 107 130 2D-2 300 313 613 2D-3 836 502 1338

3D-1 7 2374 2381 3D-2 45 6328 6373 3D-3 157 13,302 13,459

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 77

tested for the same 2D and 3D elasticity problems already used for testing througho ut this paper. The geometric domains of these problems maximize the number of correlations and consequently the computational cost for the given number of nodes. The exam- ples are run on the following hardware. CPU: Core i7-980X which has six physical cores (12 logical cores) at 3.33 GHz and 12 MBcache. GPU: is a GeForce GTX680 with 1536 CUDA cores and 2 GB GDDR5 memory.

The performanc e of the Gauss point-wise (GP) and node pair- wise (NP) approaches in the CPU are given in Table 14. The pro- posed Gauss point-wis e approach is compared with the ‘‘conven- tional’’ one without the previousl y described improvements. The performanc e of the GPU implementati on of the node pair-wise method is shown in Table 15. Speedup ratios of the GPU imple- mentation compared to the CPU implementati ons is given in Ta-ble 16. The total elapsed time for the initializatio n phase and formulation of the stiffness matrix with the conventional way isshown in Table 17. By applying all techniqu es proposed in this pa- per and utilizing one GPU, we can achieve the results of Table 18,which also shows the speedup compared to the conventional implementati on.

The identification of node pairs is performed in the CPU and the formulation of the stiffness matrix in the GPU. Therefore, it is pos- sible to have the CPU producing tasks (node pairs) and the GPU processing them concurrently . This producer–consumer model can be expanded to utilize all available hardware and is shown

Fig. 19. Phase 2 – concurrency level for the calculation of stiffness coefficients inthe GPU.

schemati cally in Fig. 20. The production can be done on domain or subdomain level so it is performed by the hardware assigned to them. Processing can be done by any available CPUs, GPUs orother processin g units thanks to the huge amount of interacting node pairs and the fact that each node pair is completely independen t of other node pairs. Example results are shown inTable 19, where the speedup ratios refer to the conventi onal impleme ntations.

7.1. Turbine blade example

A real-worl d example of a turbine blade is tested. The geometry of the example was taken from the training examples of FEMAP.The EFG model has 31,512 degrees of freedom and 29,135 Gauss points. The geometry of the turbine blade is shown in Fig. 21 andnode placement is shown in Fig. 22. The same hardware as in the previous examples is used here, namely a Core i7-980X CPU and a GeForce GTX680 GPU. The challenges associated with node and Gauss point generation and selection of an appropriate domain ofinfluence will be investigated in a future work.

The Gauss points used for integration are depicted in Fig. 23.The density of the Gauss points clearly demonstrat es the large amount of Gauss points required in EFG methods .

The total elapsed time for the initializatio n phase and formula- tion of the stiffness matrix with the conventional way and with the improved initializatio n and the proposed Gauss point-wise method is shown in Table 20.

Page 16: 1-s2.0-S0045782513000431-main

Table 18Best achieved total time for the initialization phase and formulation of the stiffness matrix.

Example Best achieved time (s) Speedup

Initialization Node pairs Formulation Total CPU parallel CPU parallel GPU

2D-1 0.5 0.6 0.2 1.4 932D-2 1.0 1.6 0.7 3.2 191 2D-3 1.4 2.8 1.1 5.3 252

3D-1 0.9 4.8 2.6 8.2 289 3D-2 1.7 10.9 6.5 19.1 334 3D-3 3.3 22.7 12.9 38.9 346

Fig. 20. Schematic representation of the processing of node pairs utilizing all available hardware.

Table 19Best achieved total time for the initialization phase and formulation of the stiffness matrix when using CPU and GPU concurrentl y.

Example Best achieved time (s) Speedup

Initialization NP + formulation Total CPU parallel Hybrid CPU/GPU

2D-1 0.5 0.6 1.2 110 2D-2 1.0 1.6 2.6 236 2D-3 1.4 2.9 4.3 310

3D-1 0.9 5.0 5.9 403 3D-2 1.7 11.5 13.2 481 3D-3 3.3 24.0 27.3 494

Fig. 21. Geometry of the turbine blade.

Fig. 22. Turbine blade example: Position of the 10,504 nodes.

Fig. 23. Turbine blade example: Position of the 29,135 Gauss points.

Table 20Turbine blade example: Total serial CPU computing time for the initialization phase and formulation of the stiffness matrix with the Gauss point-wise method.

CPU time (s)

Initialization Formulation Total

Conventional 3 341 344 Proposed 1.4 36.9 38.3

78 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

Page 17: 1-s2.0-S0045782513000431-main

Table 21Turbine blade example: Best achie ved total time for the initialization phase and formulation of the stiffness matrix.

Best achieved time (s) Speedup

Initialization Node pairs Formulation Total CPU parallel CPU parallel GPU

0.5 0.7 0.5 1.7 202

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80 79

By applying all techniqu es proposed in this paper and utilizing one GPU, we can achieve a speedup of more than two orders ofmagnitude compared to the conventi onal impleme ntation as dem- onstrated in Table 21.

8. Concluding remarks

The proposed improvements on the initializa tion phase through the utilization of Gauss regions significantly reduces the time re- quired to create the necessary correlations between the entities of the meshless methods. With Gauss regions, the process scales very well, in contrast to globally searching, and the initializatio ntakes only a small percentage of the problem formulation time.

The improvements in the Gauss point-wise approach for assem- bling the stiffness matrix offer an order of magnitud e speedup compared to the conventional approach. This is attributed to the reduced number of calculations in all parts of the process and the usage of an efficient sparse matrix format and an implementa- tion specifically tailored for the formulation phase of the stiffness matrix. Indexing is a major factor affecting the computational cost.Therefore, the skyline format is faster due to its lower indexing cost, however the significantly higher memory requiremen t makes it problematic for larger problems where a sparse format is prefer- able or mandatory.

The proposed node pair-wise approach has several benefits over the Gauss point-wise approach . The most important being its ame- nability to parallelism especiall y in massively parallel systems like the GPUs. Each node pair can be processed separately by any avail- able processor in order to compute the correspondi ng stiffness submatrix. The node pair approach is characteri zed as ‘‘embarrass- ingly parallel’’ since it requires no synchronizatio n whatsoever be- tween node pairs.

A GPU implementati on is applied to the node pair-wise ap- proach offering great speedups compared to CPU implementati ons.The node pairs keep the GPU constantly busy with calculations resulting in high hardware utilization which is evidenced by the high speedup ratios of approximat ely two orders of magnitude inthe test examples presente d. The node pair-wise approach can beapplied as is to any available hardware achieving even lower com- puting times. This includes using many GPUs, hybrid CPU(s)/GPU(s) implementations and generally any available processing unit. The importance of the latter becomes apparent when consid- ering contempor ary and future developments like heterogeneous systems architectur e (HSA).

In conclusion, the parametric tests performed in the framework of this study showed that with the proposed implementation along with the exploitation of currently available low cost hardware, the expensive formulation of the stiffness matrix in meshless EFG methods can be reduced by orders of magnitud e. The presented node pair-appr oach enables the efficient utilization of any avail- able hardware and in conjunct ion with fast initializatio n and its inherently parallelization features can accomplish high speedup ratios, which convincingl y addresse s the main shortcomin g ofmeshless methods making them computational ly competitive insolving large-scal e engineering problems.

Acknowled gments

This work has been supported by the European Research Coun- cil Advanced Grant ‘‘MASTER–Mastering the computational chal- lenges in numerical modeling and optimum design of CNT reinforce d composites’’ (ERC-2011-ADG_201102 09).

Appendi x A. Supplementar y data

Supplement ary data associate d with this article can be found, inthe online version, at http://dx.doi.o rg/10.1016/j.cm a.2013.02.011 .

References

[1] S. Li, W.K. Liu, Meshfree and particle methods and their applications, Appl.Mech. Rev. 55 (2002) 1–34.

[2] V.P. Nguyen, T. Rabczuk, S. Bordas, M. Duflot, Meshless methods: A review and computer implementation aspects, Math. Comput. Simul. 79 (2008) 763–813.

[3] T. Belytschko, Y. Krongauz, D. Organ, M. Fleming, P. Krysl, Meshless methods:An overview and recent developments, Comput. Methods Appl. Mech. Engrg.139 (1996) 3–47.

[4] K.T. Danielson, S. Hao, W.K. Liu, R.A. Uras, S. Li, Parallel computation ofmeshless methods for explicit dynamic analysis, Int. J. Numer. Methods Engrg.47 (2000) 1323–1341.

[5] K.T. Danielson, R.A. Uras, M.D. Adley, S. Li, Large-scale application of some modern CSM methodologies by parallel computation, Adv. Engrg. Software 31(2000) 501–509.

[6] G.R. Liu, K.Y. Dai, T.T. Nguyen, A smoothed finite element method for mechanics problems, Comput. Mech. 39 (2007) 859–877.

[7] J.G. Wang, G.R. Liu, A point interpolation meshless method based on radial basis functions, Int. J. Numer. Methods Engrg. 54 (2002) 1623–1648.

[8] Y.T. Gu, G.R. Liu, A coupled element free Galerkin/boundary element method for stress analysis of tow-dimensional solids, Comput. Methods Appl. Mech.Engrg. 190 (2001) 4405–4419.

[9] W.-R. Yuan, P. Chen, K.-X. Liu, High performance sparse solver for unsymmetrical linear equations with out-of-core strategies and its application on meshless methods, Appl. Math. Mech. (Engl. Ed.) 27 (2006)1339–1348.

[10] S.C. Wu, H.O. Zhang, C. Zheng, J.H. Zhang, A high performance large sparse symmetric solver for the meshfree Galerkin method, Int. J. Comput. Methods 5(2008) 533–550.

[11] E. Divo, A. Kassab, Iterative domain decomposition meshless method modeling of incompressible viscous flows and conjugate heat transfer, Engrg. Anal.Bound. Elem. 30 (2006) 465–478.

[12] P. Metsis, M. Papadrakakis, Overlapping and non-overlapping domain decomposition methods for large-scale meshless EFG simulations, Comput.Methods Appl. Mech. Engrg. 229–232 (2012) 128–141.

[13] J. Sanders, E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, 2010.

[14] D.B. Kirk, W.W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufman, 2010.

[15] NVIDIA Corporation, CUDA C Best Practices Guide, NVIDIA GPU Computing Documentation j NVIDIA Developer Zone, NVIDIA, 2012.

[16] TOP500 Supercomputing Sites. Available: <http://www.top500.org/>.[17] I.C. Kampolis, X.S. Trompoukis, V.G. Asouti, K.C. Giannakoglou, CFD-based

analysis and two-level aerodynamic optimization on graphics processing units, Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722.

[18] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic vehicle using a GPU, J. Comput. Phys. 227 (2008) 10148–10161.

[19] J.C. Thibault, I. Senocak, Accelerating incompressible flow computations with aPthreads-CUDA implementation on small-footprint multi-GPU platforms, J.Supercomput. 59 (2012) 693–719.

[20] M. De La Asunción, J.M. Mantas, M.J. Castro, Simulation of one-layer shallow water systems on multicore and CUDA architectures, J. Supercomput. 58(2011) 206–214.

[21] H. Zhou, G. Mo, F. Wu, J. Zhao, M. Rui, K. Cen, GPU implementation of lattice Boltzmann method for flows with curved boundaries, Comput. Methods Appl.Mech. Engrg. 225–228 (2012) 65–73.

[22] A. Sunarso, T. Tsuji, S. Chono, GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows, J. Comput. Phys. 229 (2010) 5486–5497.

[23] J.A. Anderson, C.D. Lorenz, A. Travesset, General purpose molecular dynamics simulations fully implemented on graphics processing units, J. Comput. Phys.227 (2008) 5342–5359.

[24] E. Wadbro, M. Berggren, Megapixel topology optimization on a graphics processing unit, SIAM Rev. 51 (2009) 707–721.

[25] D. Komatitsch, G. Erlebacher, D. Göddeke, D. Michéa, High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster,J. Comput. Phys. 229 (2010) 7692–7714.

[26] T. Takahashi, T. Hamada, GPU-accelerated boundary element method for Helmholtz’ equation in three dimensions, Int. J. Numer. Methods Engrg. 80(2009) 1295–1321.

Page 18: 1-s2.0-S0045782513000431-main

80 A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80

[27] G.R. Joldes, A. Wittek, K. Miller, Real-time nonlinear finite element computations on GPU-Application to neurosurgical simulation, Comput.Methods Appl. Mech. Engrg. 199 (2010) 3305–3314.

[28] S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Comput. 36 (2010) 232–240.

[29] O. Schenk, M. Christen, H. Burkhart, Algorithmic performance studies ongraphics processing units, J. Parallel Distrib. Comput. 68 (2008) 1360–1369.

[30] J.M. Elble, N.V. Sahinidis, P. Vouzis, GPU computing with Kaczmarz’s and other iterative algorithms for linear systems, Parallel Comput. 36 (2010) 215–231.

[31] A. Cevahir, A. Nukada, S. Matsuoka, Fast conjugate gradients with multiple GPUs, in: 9th International Conference on Computational Science, ICCS 2009,Baton Rouge, LA, 2009, pp. 893–903.

[32] A. Cevahir, A. Nukada, S. Matsuoka, High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning, Comput. Sci.-Res.Develop. 25 (2010) 83–91.

[33] M. Papadrakakis, G. Stavroulakis, A. Karatarakis, A new era in scientificcomputing: Domain decomposition methods in hybrid CPU–GPUarchitectures, Comput. Methods Appl. Mech. Engrg. 200 (2011) 1490–1508.

[34] R. Trobec, M. Šterk, B. Robic ˇ, Computational complexity and parallelization ofthe meshless local Petrov–Galerkin method, Comput. Struct. 87 (2009) 81–90.

[35] C. Felippa, Chapter 15-Solid Elements: Overview, Advanced Finite Element Methods (ASEN 6367) Course Material, University of Colorado, 2011.

[36] Sparse matrix: Dictionary of keys (DOK), Wikipedia, the free encyclopedia, Sep 2012.

[37] Hash table, Wikipedia, the free encyclopedia, Sep 2012.[38] Sparse matrix: Coordinate list (COO), Wikipedia, the free encyclopedia, Sep

2012.[39] W.W. Hwu, D.B. Kirk, Parallelism Scalability, Programming and tUning

Massively Parallel Systems (PUMPS), Barcelona, 2011.