Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Solution of dynamic solid deformation using hybrid parallelization with MPI and OpenMP
MSc. Miguel Vargas-FélixISUM 2012
1/24
Problem description
Problem description
We want to solve large scale dynamic problems with linear deformation modeled with the finite element method.
ε=(∂∂ x
0 0
0 ∂∂ y
0
0 0 ∂∂ z
∂∂ y
∂∂ x
0
0 ∂∂ z
∂∂ y
∂∂ z
0 ∂∂ x
)(u1
u 2
u3)
σ=D (ε−ε 0 )+σ 0
Where u is the displacement vector, ε the stain, σ the stress. D is called the constitutive matrix.The solution is found using the finite element method with the Galerkin weighted residuals.
2/24
Schur substructuring method
Schur substructuring method
This is a domain decomposition method without overlapping [Krui04].
Γ f
Γd
Ω
i j
Finite element domain (left), domain discretization (center), partitioning (right).
We start with a system of equations resulting from a finite element problemKd=f , (1)
where K is a symmetric positive definite matrix of size n×n.
If we divide the geometry into p partitions, the idea is to split the workload to let each partition to be handled by a computer in the cluster.
3/24
Schur substructuring method
K 1II
K 1IB
K 2II
K 2IB
K 3IIK 3
IB
KBB
K 2IB
(K1II 0 0 K1
IB
0 K2II 0 K2
IB
0 0 K3II K3
IB
K1BI K 2
BI K3BI KBB)
Figure 1. Partitioning example.
We can arrange (reorder variables) of the system of equations to have the following form
(K1
II 0 K1IB
K2II K2
IB
0 K3II K3
IB
⋮ ⋱ ⋮K p
II K pIB
K1BI K2
BI K3BI ⋯ K p
BI K BB)(d1
I
d2I
d3I
⋮d p
I
dB)=(
f 1I
f 2I
f 3I
⋮f p
I
f B). (2)
The superscript II denotes entries that capture the relationship between nodes inside a partition. BB is used to indicate entries in the matrix that relate nodes on the boundary. Finally IB and BI are used for entries with values dependent of nodes in the boundary and nodes inside the partition.
4/24
Schur substructuring method
Thus, the system can be separated in p different systems,
(KiII K i
IB
K iBI KBB)(di
I
dB)=( f iI
f B), i=1… p.
For each partition i the vector of unknowns diI as
diI=(Ki
II)−1( f iI−Ki
IBdB). (3)
After applying Gaussian elimination by blocks on (2), the reduced system of equations becomes
(KBB−∑i=1
p
K iBI (Ki
II)−1K i
IB)dB=fB−∑i=1
p
KiBI(Ki
II)−1f i
I. (4)
Once the vector dB is computed using (4), we can calculate the internal unknowns diI with (3).
It is not necessary to calculate the inverse in (4).
Let’s define K iBB=Ki
BI(K iII)−1K i
IB, to calculate it [Sori00], we proceed column by column using an extra vector t, and solving for c=1…n
K iII t=[Ki
IB]c, (5)note that many [K i
IB]c are null. Next we can complete K iBB with,
[K iBB]c=K i
BI t.
5/24
Schur substructuring method
Now lets define f iB=K i
BI(K iII)−1 f i
I, in this case only one system has to be solvedK i
II t=f iI, (6)
and thenf i
B=K iBI t.
Each K iBB and f i
B holds the contribution of each partition to (4), this can be written as
(KBB−∑i=1
p
K iBB)dB=fB−∑
i=1
p
f iB, (7)
once (7) is solved, we can calculate the inner results of each partition using (3).
Since KiII is sparse and has to be solved many times in (5), a efficient way to proceed is to use a
Cholesky factorization of K iII. To reduce memory usage and increase speed a sparse Cholesky
factorization has to be implemented, this method is explained below.
In case of (7), KBB is sparse, but K iBB are not. To solve this system of equations an sparse version
of conjugate gradient was implemented, the matrix (KBB−∑i=1p K i
BB) is not assembled, but maintained distributed.
6/24
Matrix storage
Matrix storage
An efficient method to store and operate matrices of this kind of problems is the Compressed Row Storage (CRS) [Saad03 p362]. This method is suitable when we want to access entries of each row of a matrix A sequentially.
For each row i of A we will have two vectors, a vector viA that will contain the non-zero values of
the row, and a vector j iA with their respective column indexes. For example a matrix A and its
CRS representation
A=(8 4 0 0 0 00 0 1 3 0 02 0 1 0 7 00 9 3 0 0 10 0 0 0 0 5
), 8 41 21 33 42 11 3
75
9 32 3
16
56
v4A=(9,3, 1)
j 4A=(2,3, 6)
The size of the row will be denoted by ∣viA∣ or by ∣ j iA∣. Therefore the q th non zero value of the
row i of A will be denoted by (viA)q and the index of this value as ( j iA)q, with q=1,… ,∣vi
A∣.
7/24
Cholesky factorization for sparse matrices
Cholesky factorization for sparse matrices
For full matrices the computational complexity of Cholesky factorization A=LLT is O (n3).
To calculate entries of L
Li j=1
L j j (Ai j−∑k=1
j−1
Li k L j k), for i> j
L j j=√A j j−∑k=1
j−1
L j k2 .
We use four strategies to reduce time and memory usage when performing this factorization on sparse matrices:
1. Reordering of rows and columns of the matrix to reduce fill-in in L. This is equivalent to use a permutation matrix to reorder the system (P A PT)(P x )=(P b ).
2. Use symbolic Cholesky factorization to obtain an exact L factor (non zero entries in L).3. Organize operations to improve cache usage.4. Parallelize the factorization.
8/24
Cholesky factorization for sparse matrices
Matrix reorderingWe want to reorder rows and columns of A, in a way that the number of non-zero entries of L are reduced. η(L ) indicates the number of non-zero entries of L.
A= L=
The stiffness matrix to the left A∈ℝ556×556, with η( A)=1810. To the right the lower triangular matrix L, with η(L )=8729.There are several heuristics like the minimum degree algorithm [Geor81] or a nested dissection method [Kary99].
9/24
Cholesky factorization for sparse matrices
By reordering we have a matrix A ' with η(A ' )=1810 and its factorization L ' with η(L ' )=3215. Both factorizations solve the same system of equations.
A '= L '=
We reduce the factorization fill-in byη(L ' )=3215η(L )=8729
=0.368.
To determine a “good” reordering for a matrix A that minimize the fill-in of L is an NP complete problem [Yann81].
10/24
Cholesky factorization for sparse matrices
Symbolic Cholesky factorizationThe algorithm to determine the Li j entries that area non-zero is called symbolic Cholesky factorization [Gall90].Let be, for all columns j=1…n,
a j ≝ {k> j ∣ Ak j≠0 },l j ≝ {k> j ∣ Lk j≠0 }.
The sets r j will register the columns of L which structure will affect the column j of L.
r j ← ∅, j ← 1…nfor j ← 1…nl j ← a j
for i∈r j
l j ← l j∪l i∖ { j }end_for
p ← {min {i∈l j } if l j≠∅j other
r p ← r p∪{ j }end_for
A=(
a1 1 a1 2 a1 6
a21 a2 2 a2 3 a2 4
a3 2 a33 a35
a4 2 a4 4
a53 a55 a5 6
a61 a65 a6 6
)a2= {3,4 }
L=(
l 1 1
l 2 1 l 2 2
l 3 2 l 3 3
l 4 2 l 4 3 l 4 4
l 5 3 l 5 4 l55
l 6 1 l 6 2 l 6 3 l 6 4 l65 l 66
)l 2={3,4,6}
This algorithm is very efficient, its complexity in time and space has an order of O (η(L)).11/24
Cholesky factorization for sparse matrices
Parallelization of the factorizationThe calculation of the non-zero Li j entries can be done in parallel if we fill L column by column [Heat91].Let J (i ) be the indexes of the non-zero values of the row i of L. Formulae to calculate Li j are:
Li j=1
L j j (Ai j− ∑k∈(J (i)∩J ( j ))
k < j
Li k L j k), para i> j
L j j=√A j j− ∑k∈J ( j )
k < j
L j k2
.
Core 1
Core 2
Core N
The paralellization was made using the OpenMP schema.12/24
Cholesky factorization for sparse matrices
How efficient is it?The next table shows results solving a 2D Poisson equation problem, comparing Cholesky and conjugate gradient with Jacobi preconditioning. Several discretizations where used.
Equations nnz(A) nnz(L) Cholesky [s] CGJ [s]1,006 6,140 14,722 0.086 0.0813,110 20,112 62,363 0.137 0.103
10,014 67,052 265,566 0.309 0.18431,615 215,807 1’059,714 1.008 0.454
102,233 705,689 4’162,084 3.810 2.891312,248 2’168,286 14’697,188 15.819 19.165909,540 6’336,942 48’748,327 69.353 89.660
3’105,275 21’681,667 188’982,798 409.365 543.11010’757,887 75’202,303 743’643,820 2,780.734 3,386.609
1,000 10,000 100,000 1,000,000 10,000,0000
0
1
10
100
1,000
10,000
0.1 0.10.2
0.5
2.9
19.2
89.7
543.1
3,386.6
0.10.1
0.3
1.0
3.8
15.8
69.4
409.4
2,780.7
CholeskyCGJ
Equations
Tim
e [s
]
1,000 10,000 100,000 1,000,000 10,000,0001E+5
1E+6
1E+7
1E+8
1E+9
1E+10
417,838
1,314,142
4,276,214
13,581,143
44,071,337
134,859,928
393,243,516
1,343,496,475
4,656,139,711
707,464
2,632,403
10,168,743
38,186,672
145,330,127
512,535,099
1,747,287,767
7,134,437,212
32,703,892,477
CholeskyCGJ
Equations
Mem
ory
[byt
es]
13/24
Numerical experiment, building deformation
Numerical experiment, building deformation
We useda cluster with 15 nodes, each one with two dual core Intel Xeon E5502 (1.87GHz) processors, a total of 60 cores.
The problem tested is a 3D solid model of a building that is deformed due to self weight. The geometry is divided in 1’336,832 elements, with 1’708,273 nodes, with three degrees of freedom per node the resulting system of equations has 5’124,819 unknowns. Tolerance used is 1x10-10.
Substructuration of the domain (left) resulting deformation (right)
14/24
Numerical experiment, building deformation
Number of processes
Partitioningtime [s]
Inversion time (Cholesky) [s]
Schur complement time (CG) [s]
CG steps Total time [s]
14 47.6 18520.8 4444.5 6927 23025.028 45.7 6269.5 2444.5 8119 8771.656 44.1 2257.1 2296.3 9627 4608.9
14 28 560
5000
10000
15000
20000Schur complement time (CG) [s]Inversion time (Cholesky) [s]Partitioning time [s]
Number of processes
Tim
e [s
]
14 28 560
10
20
30
40
50
60
70
80 Slave processes [GB]
Number of processesM
emor
y [G
iga
byte
s]
Number of processes
Master process
[GB]
Slave processes
[GB]
Total memory
[GB]14 1.89 73.00 74.8928 1.43 67.88 69.3256 1.43 62.97 64.41
15/24
Larger systems of equations
Larger systems of equations
To test solution times in larger systems of equations we set a simple geometry. We calculated the temperature distribution of a metalic square with Dirichlet conditions on all boundaries.
1°C2°C3°C4°C
The domain was discretized using quadrilaterals with nine nodes, the discretization made was from 25 million nodes up to 150 million nodes.In all cases we divided the domain into 116 partitions, each partition is solved in one core.
16/24
Larger systems of equations
25,010,001 50,027,329 75,012,921 100,020,001 125,014,761 150,038,0010
50
100
150
200
250
300Schur complement time (CG) [min]Inversion time (Cholesky) [min]Partitioning time [min]Total time [min]
Number of equations
Tim
e [m
in]
Equations Partitioningtime [min]
Inversion time
(Cholesky) [min]
Schur complement
time (CG) [min]
CG steps Total time [min]
25,010,001 6.2 17.3 4.7 872.0 29.450,027,329 13.3 43.7 6.3 1012.0 65.475,012,921 20.6 80.2 4.3 1136.0 108.3
100,020,001 28.5 115.1 5.4 1225.0 152.9125,014,761 38.3 173.5 7.5 1329.0 224.2150,038,001 49.3 224.1 8.9 1362.0 288.5
17/24
Larger systems of equations
25,010,001 50,027,329 75,012,921 100,020,001 125,014,761 150,038,0010
50
100
150
200
250
300
350 Slave processes [GB]Master process [GB]
Number of equations
Mem
ory [
Gig
a by
tes]
Equations Master process [GB]
Average slave processes [GB]
Slave processes [GB]
Total memory [GB]
25,010,001 4.05 0.41 47.74 51.7950,027,329 8.10 0.87 101.21 109.3175,012,921 12.15 1.37 158.54 170.68
100,020,001 16.20 1.88 217.51 233.71125,014,761 20.25 2.38 276.04 296.29150,038,001 24.30 2.92 338.29 362.60
18/24
Dynamic problem formulation
Dynamic problem formulation
The Hilber-Hughes-Taylor (HHT) method [Hilb77] is used to solve a dynamic problem, equations of motion have the form
Mu+Cu+Ku=F ( t ),where M, C and K are the mass, damping and stiffness matrices of size n×n, these are constant, F is the time depending force.The integration formulas depend on two parameters β and γ,
ui+1=ui+h ui+h2
2 [(1−2β) ui+2β ui+1],
ui+1=ui+h [(1−γ ) ui+γ ui+1 ],they are used to discretize the equations of motion at time t i+1, h is the step size,
Mui+1+(1+α )Cui+1−αCui+(1+α )Kui+1−αKui=F ( t i+1),where t i+1=t n+(1+α )h.
The HTT method is second order accurate and unconditionally stable with−1 /3≤α≤0, β=(1−α )2/4 and γ=(1−2α ) /2.
19/24
Infante Henrique bridge over the Douro river, Portugal
Infante Henrique bridge over the Douro river, Portugal
20/24
Infante Henrique bridge over the Douro river, Portugal
SimulationSimulation of a 18 wheels 36 metric tons truckcrossing the Infante D. Henrique Bridge.Pre and post-process using GiD (http://gid.cimne.upc.es).
Nodes 332,462Elements 1’381,944Element type TetrahedronTime steps 372HHT alpha factor 0Rayleigh damping a 0.5Rayleigh damping b 0.5Degrees of freedom 997,386nnz(K) 38’302,119Partitioning time 32.9 sFactorization time 87.3 sTime per step (CGJ) 132.9 s
Total time 13.7 h
21/24
References
References
[Gall90] K. A. Gallivan, M. T. Heath, E. Ng, J. M. Ortega, B. W. Peyton, R. J. Plemmons, C. H. Romine, A. H. Sameh, R. G. Voigt, Parallel Algorithms for Matrix Computations, SIAM, 1990.[Geor81] A. George, J. W. H. Liu. Computer solution of large sparse positive definite systems. Prentice-Hall, 1981[Heat91] M T. Heath, E. Ng, B. W. Peyton. Parallel Algorithms for Sparse Linear Systems. SIAM Review, Vol. 33, No. 3, pp. 420-460, 1991.[Hilb77] H. M. Hilber, T. J. R. Hughes, and R. L. Taylor. Improved numerical dissipation for time integration algorithms in structural dynamics. Earthquake Eng. and Struct. Dynamics, 5:283–292, 1977.[Kary99] G. Karypis, V. Kumar. A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, Vol. 20-1, pp. 359-392, 1999.[Krui04] J. Kruis. “Domain Decomposition Methods on Parallel Computers”. Progress in Engineering Computational Technology, pp 299-322. Saxe-Coburg Publications. Stirling, Scotland, UK. 2004.[MPIF08]Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 2.1. University of Tennessee, 2008.[Saad03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2003.
23/24
References
[Sori00] M. Soria-Guerrero. Parallel multigrid algorithms for computational fluid dynamics and heat transfer. Universitat Politècnica de Catalunya. Departament de Màquines i Motors Tèrmics. 2000. http://www.tesisenred.net/handle/10803/6678[Yann81] M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAM Journal on Algebraic Discrete Methods, Volume 2, Issue 1, pp 77-79, March, 1981.
24/24