Solution of dynamic solid deformation using hybrid ...miguelvargas/Papers/Slides.Substructuration.pdf · Finite element domain (left), domain discretization (center), partitioning

Solution of dynamic solid deformation using hybrid parallelization with MPI and OpenMP

MSc. Miguel Vargas-FélixISUM 2012

1/24

Problem description

Problem description

We want to solve large scale dynamic problems with linear deformation modeled with the finite element method.

ε=(∂∂ x

0 0

0 ∂∂ y

0

0 0 ∂∂ z

∂∂ y

∂∂ x

0

0 ∂∂ z

∂∂ y

∂∂ z

0 ∂∂ x

)(u1

u 2

u3)

σ=D (ε−ε 0 )+σ 0

Where u is the displacement vector, ε the stain, σ the stress. D is called the constitutive matrix.The solution is found using the finite element method with the Galerkin weighted residuals.

2/24

Schur substructuring method


This is a domain decomposition method without overlapping [Krui04].

Γ f

Γd

Ω

i j

Finite element domain (left), domain discretization (center), partitioning (right).

We start with a system of equations resulting from a finite element problemKd=f , (1)

where K is a symmetric positive definite matrix of size n×n.

If we divide the geometry into p partitions, the idea is to split the workload to let each partition to be handled by a computer in the cluster.

3/24


K 1II

K 1IB

K 2II

K 2IB

K 3IIK 3

IB

KBB

K 2IB

(K1II 0 0 K1

IB

0 K2II 0 K2

IB

0 0 K3II K3

IB

K1BI K 2

BI K3BI KBB)

Figure 1. Partitioning example.

We can arrange (reorder variables) of the system of equations to have the following form

(K1

II 0 K1IB

K2II K2

IB

0 K3II K3

IB

⋮ ⋱ ⋮K p

II K pIB

K1BI K2

BI K3BI ⋯ K p

BI K BB)(d1

I

d2I

d3I

⋮d p

I

dB)=(

f 1I

f 2I

f 3I

⋮f p

I

f B). (2)

The superscript II denotes entries that capture the relationship between nodes inside a partition. BB is used to indicate entries in the matrix that relate nodes on the boundary. Finally IB and BI are used for entries with values dependent of nodes in the boundary and nodes inside the partition.

4/24


Thus, the system can be separated in p different systems,

(KiII K i

IB

K iBI KBB)(di

I

dB)=( f iI

f B), i=1… p.

For each partition i the vector of unknowns diI as

diI=(Ki

II)−1( f iI−Ki

IBdB). (3)

After applying Gaussian elimination by blocks on (2), the reduced system of equations becomes

(KBB−∑i=1

p

K iBI (Ki

II)−1K i

IB)dB=fB−∑i=1

p

KiBI(Ki

II)−1f i

I. (4)

Once the vector dB is computed using (4), we can calculate the internal unknowns diI with (3).

It is not necessary to calculate the inverse in (4).

Let’s define K iBB=Ki

BI(K iII)−1K i

IB, to calculate it [Sori00], we proceed column by column using an extra vector t, and solving for c=1…n

K iII t=[Ki

IB]c, (5)note that many [K i

IB]c are null. Next we can complete K iBB with,

[K iBB]c=K i

BI t.

5/24


Now lets define f iB=K i

BI(K iII)−1 f i

I, in this case only one system has to be solvedK i

II t=f iI, (6)

and thenf i

B=K iBI t.

Each K iBB and f i

B holds the contribution of each partition to (4), this can be written as

(KBB−∑i=1

p

K iBB)dB=fB−∑

i=1

p

f iB, (7)

once (7) is solved, we can calculate the inner results of each partition using (3).

Since KiII is sparse and has to be solved many times in (5), a efficient way to proceed is to use a

Cholesky factorization of K iII. To reduce memory usage and increase speed a sparse Cholesky

factorization has to be implemented, this method is explained below.

In case of (7), KBB is sparse, but K iBB are not. To solve this system of equations an sparse version

of conjugate gradient was implemented, the matrix (KBB−∑i=1p K i

BB) is not assembled, but maintained distributed.

6/24

Matrix storage

Matrix storage

An efficient method to store and operate matrices of this kind of problems is the Compressed Row Storage (CRS) [Saad03 p362]. This method is suitable when we want to access entries of each row of a matrix A sequentially.

For each row i of A we will have two vectors, a vector viA that will contain the non-zero values of

the row, and a vector j iA with their respective column indexes. For example a matrix A and its

CRS representation

A=(8 4 0 0 0 00 0 1 3 0 02 0 1 0 7 00 9 3 0 0 10 0 0 0 0 5

), 8 41 21 33 42 11 3

75

9 32 3

16

56

v4A=(9,3, 1)

j 4A=(2,3, 6)

The size of the row will be denoted by ∣viA∣ or by ∣ j iA∣. Therefore the q th non zero value of the

row i of A will be denoted by (viA)q and the index of this value as ( j iA)q, with q=1,… ,∣vi

A∣.

7/24

Cholesky factorization for sparse matrices


For full matrices the computational complexity of Cholesky factorization A=LLT is O (n3).

To calculate entries of L

Li j=1

L j j (Ai j−∑k=1

j−1

Li k L j k), for i> j

L j j=√A j j−∑k=1

j−1

L j k2 .

We use four strategies to reduce time and memory usage when performing this factorization on sparse matrices:

1. Reordering of rows and columns of the matrix to reduce fill-in in L. This is equivalent to use a permutation matrix to reorder the system (P A PT)(P x )=(P b ).

2. Use symbolic Cholesky factorization to obtain an exact L factor (non zero entries in L).3. Organize operations to improve cache usage.4. Parallelize the factorization.

8/24


Matrix reorderingWe want to reorder rows and columns of A, in a way that the number of non-zero entries of L are reduced. η(L ) indicates the number of non-zero entries of L.

A= L=

The stiffness matrix to the left A∈ℝ556×556, with η( A)=1810. To the right the lower triangular matrix L, with η(L )=8729.There are several heuristics like the minimum degree algorithm [Geor81] or a nested dissection method [Kary99].

9/24


By reordering we have a matrix A ' with η(A ' )=1810 and its factorization L ' with η(L ' )=3215. Both factorizations solve the same system of equations.

A '= L '=

We reduce the factorization fill-in byη(L ' )=3215η(L )=8729

=0.368.

To determine a “good” reordering for a matrix A that minimize the fill-in of L is an NP complete problem [Yann81].

10/24


Symbolic Cholesky factorizationThe algorithm to determine the Li j entries that area non-zero is called symbolic Cholesky factorization [Gall90].Let be, for all columns j=1…n,

a j ≝ {k> j ∣ Ak j≠0 },l j ≝ {k> j ∣ Lk j≠0 }.

The sets r j will register the columns of L which structure will affect the column j of L.

r j ← ∅, j ← 1…nfor j ← 1…nl j ← a j

for i∈r j

l j ← l j∪l i∖ { j }end_for

p ← {min {i∈l j } if l j≠∅j other

r p ← r p∪{ j }end_for

A=(

a1 1 a1 2 a1 6

a21 a2 2 a2 3 a2 4

a3 2 a33 a35

a4 2 a4 4

a53 a55 a5 6

a61 a65 a6 6

)a2= {3,4 }

L=(

l 1 1

l 2 1 l 2 2

l 3 2 l 3 3

l 4 2 l 4 3 l 4 4

l 5 3 l 5 4 l55

l 6 1 l 6 2 l 6 3 l 6 4 l65 l 66

)l 2={3,4,6}

This algorithm is very efficient, its complexity in time and space has an order of O (η(L)).11/24


Parallelization of the factorizationThe calculation of the non-zero Li j entries can be done in parallel if we fill L column by column [Heat91].Let J (i ) be the indexes of the non-zero values of the row i of L. Formulae to calculate Li j are:

Li j=1

L j j (Ai j− ∑k∈(J (i)∩J ( j ))

k < j

Li k L j k), para i> j

L j j=√A j j− ∑k∈J ( j )

k < j

L j k2

.

Core 1

Core 2

Core N

The paralellization was made using the OpenMP schema.12/24


How efficient is it?The next table shows results solving a 2D Poisson equation problem, comparing Cholesky and conjugate gradient with Jacobi preconditioning. Several discretizations where used.

Equations nnz(A) nnz(L) Cholesky [s] CGJ [s]1,006 6,140 14,722 0.086 0.0813,110 20,112 62,363 0.137 0.103

10,014 67,052 265,566 0.309 0.18431,615 215,807 1’059,714 1.008 0.454

102,233 705,689 4’162,084 3.810 2.891312,248 2’168,286 14’697,188 15.819 19.165909,540 6’336,942 48’748,327 69.353 89.660

3’105,275 21’681,667 188’982,798 409.365 543.11010’757,887 75’202,303 743’643,820 2,780.734 3,386.609

1,000 10,000 100,000 1,000,000 10,000,0000

0

1

10

100

1,000

10,000

0.1 0.10.2

0.5

2.9

19.2

89.7

543.1

3,386.6

0.10.1

0.3

1.0

3.8

15.8

69.4

409.4

2,780.7

CholeskyCGJ

Equations

Tim

e [s

]

1,000 10,000 100,000 1,000,000 10,000,0001E+5

1E+6

1E+7

1E+8

1E+9

1E+10

417,838

1,314,142

4,276,214

13,581,143

44,071,337

134,859,928

393,243,516

1,343,496,475

4,656,139,711

707,464

2,632,403

10,168,743

38,186,672

145,330,127

512,535,099

1,747,287,767

7,134,437,212

32,703,892,477

CholeskyCGJ

Equations

Mem

ory

[byt

es]

13/24

Numerical experiment, building deformation


We useda cluster with 15 nodes, each one with two dual core Intel Xeon E5502 (1.87GHz) processors, a total of 60 cores.

The problem tested is a 3D solid model of a building that is deformed due to self weight. The geometry is divided in 1’336,832 elements, with 1’708,273 nodes, with three degrees of freedom per node the resulting system of equations has 5’124,819 unknowns. Tolerance used is 1x10-10.

Substructuration of the domain (left) resulting deformation (right)

14/24


Number of processes

Partitioningtime [s]

Inversion time (Cholesky) [s]

Schur complement time (CG) [s]

CG steps Total time [s]

14 47.6 18520.8 4444.5 6927 23025.028 45.7 6269.5 2444.5 8119 8771.656 44.1 2257.1 2296.3 9627 4608.9

14 28 560

5000

10000

15000

20000Schur complement time (CG) [s]Inversion time (Cholesky) [s]Partitioning time [s]

Number of processes

Tim

e [s

]

14 28 560

10

20

30

40

50

60

70

80 Slave processes [GB]

Number of processesM

emor

y [G

iga

byte

s]

Number of processes

Master process

[GB]

Slave processes

[GB]

Total memory

[GB]14 1.89 73.00 74.8928 1.43 67.88 69.3256 1.43 62.97 64.41

15/24

Larger systems of equations


To test solution times in larger systems of equations we set a simple geometry. We calculated the temperature distribution of a metalic square with Dirichlet conditions on all boundaries.

1°C2°C3°C4°C

The domain was discretized using quadrilaterals with nine nodes, the discretization made was from 25 million nodes up to 150 million nodes.In all cases we divided the domain into 116 partitions, each partition is solved in one core.

16/24


25,010,001 50,027,329 75,012,921 100,020,001 125,014,761 150,038,0010

50

100

150

200

250

300Schur complement time (CG) [min]Inversion time (Cholesky) [min]Partitioning time [min]Total time [min]

Number of equations

Tim

e [m

in]

Equations Partitioningtime [min]

Inversion time

(Cholesky) [min]

Schur complement

time (CG) [min]

CG steps Total time [min]

25,010,001 6.2 17.3 4.7 872.0 29.450,027,329 13.3 43.7 6.3 1012.0 65.475,012,921 20.6 80.2 4.3 1136.0 108.3

100,020,001 28.5 115.1 5.4 1225.0 152.9125,014,761 38.3 173.5 7.5 1329.0 224.2150,038,001 49.3 224.1 8.9 1362.0 288.5

17/24


25,010,001 50,027,329 75,012,921 100,020,001 125,014,761 150,038,0010

50

100

150

200

250

300

350 Slave processes [GB]Master process [GB]

Number of equations

Mem

ory [

Gig

a by

tes]

Equations Master process [GB]

Average slave processes [GB]

Slave processes [GB]

Total memory [GB]

25,010,001 4.05 0.41 47.74 51.7950,027,329 8.10 0.87 101.21 109.3175,012,921 12.15 1.37 158.54 170.68

100,020,001 16.20 1.88 217.51 233.71125,014,761 20.25 2.38 276.04 296.29150,038,001 24.30 2.92 338.29 362.60

18/24

Dynamic problem formulation

Dynamic problem formulation

The Hilber-Hughes-Taylor (HHT) method [Hilb77] is used to solve a dynamic problem, equations of motion have the form

Mu+Cu+Ku=F ( t ),where M, C and K are the mass, damping and stiffness matrices of size n×n, these are constant, F is the time depending force.The integration formulas depend on two parameters β and γ,

ui+1=ui+h ui+h2

2 [(1−2β) ui+2β ui+1],

ui+1=ui+h [(1−γ ) ui+γ ui+1 ],they are used to discretize the equations of motion at time t i+1, h is the step size,

Mui+1+(1+α )Cui+1−αCui+(1+α )Kui+1−αKui=F ( t i+1),where t i+1=t n+(1+α )h.

The HTT method is second order accurate and unconditionally stable with−1 /3≤α≤0, β=(1−α )2/4 and γ=(1−2α ) /2.

19/24

Infante Henrique bridge over the Douro river, Portugal


20/24


SimulationSimulation of a 18 wheels 36 metric tons truckcrossing the Infante D. Henrique Bridge.Pre and post-process using GiD (http://gid.cimne.upc.es).

Nodes 332,462Elements 1’381,944Element type TetrahedronTime steps 372HHT alpha factor 0Rayleigh damping a 0.5Rayleigh damping b 0.5Degrees of freedom 997,386nnz(K) 38’302,119Partitioning time 32.9 sFactorization time 87.3 sTime per step (CGJ) 132.9 s

Total time 13.7 h

21/24

http://en.structurae.de/structures/data/index.cfm?id=s0004697

http://gid.cimne.upc.es/

Thank you!Questions?

Thank you!Questions?

[email protected]://www.cimat.mx/~miguelvargas

22/24

http://www.cimat.mx/~miguelvargas

mailto:[email protected]

References

References

[Gall90] K. A. Gallivan, M. T. Heath, E. Ng, J. M. Ortega, B. W. Peyton, R. J. Plemmons, C. H. Romine, A. H. Sameh, R. G. Voigt, Parallel Algorithms for Matrix Computations, SIAM, 1990.[Geor81] A. George, J. W. H. Liu. Computer solution of large sparse positive definite systems. Prentice-Hall, 1981[Heat91] M T. Heath, E. Ng, B. W. Peyton. Parallel Algorithms for Sparse Linear Systems. SIAM Review, Vol. 33, No. 3, pp. 420-460, 1991.[Hilb77] H. M. Hilber, T. J. R. Hughes, and R. L. Taylor. Improved numerical dissipation for time integration algorithms in structural dynamics. Earthquake Eng. and Struct. Dynamics, 5:283–292, 1977.[Kary99] G. Karypis, V. Kumar. A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, Vol. 20-1, pp. 359-392, 1999.[Krui04] J. Kruis. “Domain Decomposition Methods on Parallel Computers”. Progress in Engineering Computational Technology, pp 299-322. Saxe-Coburg Publications. Stirling, Scotland, UK. 2004.[MPIF08]Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 2.1. University of Tennessee, 2008.[Saad03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2003.

23/24

References

[Sori00] M. Soria-Guerrero. Parallel multigrid algorithms for computational fluid dynamics and heat transfer. Universitat Politècnica de Catalunya. Departament de Màquines i Motors Tèrmics. 2000. http://www.tesisenred.net/handle/10803/6678[Yann81] M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAM Journal on Algebraic Discrete Methods, Volume 2, Issue 1, pp 77-79, March, 1981.

24/24

http://www.tesisenred.net/handle/10803/6678

Documents

Solution of dynamic solid deformation using hybrid ...miguelvargas/Papers/Slides.Substructuration.pdf · Finite element domain (left), domain discretization (center), partitioning