Parco 2005 1
Auto-optimization of Auto-optimization of linear algebra parallel linear algebra parallel routines: the Cholesky routines: the Cholesky
factorizationfactorizationLuis-Pedro GarcíaLuis-Pedro García
Servicio de Apoyo a la Investigación Servicio de Apoyo a la Investigación TecnológicaTecnológica
Universidad Politécnica de Cartagena, Spain Universidad Politécnica de Cartagena, Spain [email protected]@sait.upct.es
Javier CuencaJavier CuencaDepartamento de IngenieríaDepartamento de Ingeniería
y Tecnología de Computadoresy Tecnología de ComputadoresUniversidad de Murcia, SpainUniversidad de Murcia, Spain
[email protected]@ditec.um.es
Domingo GiménezDomingo GiménezDepartamento de Informática y SistemasDepartamento de Informática y Sistemas
Universidad de Murcia, SpainUniversidad de Murcia, [email protected]@dif.um.es
2Parco 2005
OutlineOutline
IntroductionIntroduction Parallel routine for the Cholesky Parallel routine for the Cholesky
factorizationfactorization Experimental ResultsExperimental Results ConclusionsConclusions
3Parco 2005
IntroductionIntroduction Our Goal: to obtain linear algebra parallel Our Goal: to obtain linear algebra parallel
routines with auto-optimization capacityroutines with auto-optimization capacity The approach: model the behavior of the The approach: model the behavior of the
algorithmalgorithm This This workwork: improve the model for the : improve the model for the
communication costcommunication costss whenwhen:: The routine uses different typeThe routine uses different typess of MPI of MPI
communication mechanismscommunication mechanisms The system has more than The system has more than oneone interconnection interconnection
networknetwork The communication parameters vary with the The communication parameters vary with the
volume of the communicationvolume of the communication
4Parco 2005
IntroductionIntroduction
TTheoretical and experimental heoretical and experimental study study of of the algorithm. APthe algorithm. AP selection selection..
In linear algebra parallel routinesIn linear algebra parallel routines,, typical AP and SPtypical AP and SP are are: : b, p = r b, p = r xx c c and the basic library and the basic library kk11,, kk22, , kk33, , ttss and and ttww
An analytical model of the execution An analytical model of the execution timetime T(n) = f(n,AP,SP)T(n) = f(n,AP,SP)
5Parco 2005
Parallel Cholesky Parallel Cholesky factorizationfactorization
The The n n xx n n matrix is mapped matrix is mapped through through a block cyclic 2-D distribution onto a a block cyclic 2-D distribution onto a two-dimensional mesh of two-dimensional mesh of p = r p = r xx c c processes (in ScaLAPACK style)processes (in ScaLAPACK style)
(a) First step (b) Second step (c) Third step
Figure 1. Work distribution in the first three steps, with n/b = 6 and p = 2 x 3
6Parco 2005
Parallel Cholesky Parallel Cholesky factorizationfactorization
The general model: The general model: t(n) = f(n,AP,SP)t(n) = f(n,AP,SP) Problem size:Problem size:
n n matrix size matrix size
Algorithmic parameters (AP):Algorithmic parameters (AP): bb block size block size p = r p = r x x c c processes processes
System parameters (SP) System parameters (SP) SP = g(n,AP)SP = g(n,AP):: k(n,b,p): k2potf2, k3trsm, k3gemm k(n,b,p): k2potf2, k3trsm, k3gemm andand k3syrk k3syrk cost of cost of
basic arithmetic operationsbasic arithmetic operations ttss(p)(p) start-up timestart-up time
ttwsws(n,p)(n,p), , ttwdwd(n,p) (n,p) word-sending time for different types of word-sending time for different types of communicationscommunications tcom(n,p) = ts(p)
+ntw(n,p)
7Parco 2005
Parallel Cholesky Parallel Cholesky factorizationfactorization
Theoretical model:Theoretical model: Arithmetic cost:Arithmetic cost:
Communication cost:Communication cost:
T = tarit + tcom
8Parco 2005
Experimental ResultsExperimental Results
Systems:Systems: A network of four nodes Intel Pentium 4 (P4net) A network of four nodes Intel Pentium 4 (P4net)
with a with a FFastEthernetastEthernet switch switch, enabling parallel , enabling parallel communications between them. The MPI library communications between them. The MPI library used used is MPICHis MPICH
A network of four nodes HP AlphaServer quad A network of four nodes HP AlphaServer quad processors (HPC160) using Shared Memory processors (HPC160) using Shared Memory (HPC160smp), MemoryChannel (HPC160mc) (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) for the and both (HPC160smp-mc) for the communications between processes. A MPI communications between processes. A MPI library optimized for Shared Memory and for library optimized for Shared Memory and for MemoryChannel has been used.MemoryChannel has been used.
9Parco 2005
Experimental ResultsExperimental Results How to estimate the arithmetic SPsHow to estimate the arithmetic SPs
With routines performing some basic operation (dgemm, With routines performing some basic operation (dgemm, dsyrk, dtrsm) with the same data access scheme used in the dsyrk, dtrsm) with the same data access scheme used in the algorithmalgorithm
How to estimate the communication SPsHow to estimate the communication SPs With routines that communicate rows or columns in the With routines that communicate rows or columns in the
logical mesh of processes:logical mesh of processes: With a broadcast for MPI derived data type between processes With a broadcast for MPI derived data type between processes inin
the same columnthe same column With a broadcast for MPI predefined data type between With a broadcast for MPI predefined data type between
processes processes inin the same row the same row
In both cases the experiments are In both cases the experiments are repeated several times, repeated several times, to to obtain an obtain an average valueaverage value
10Parco 2005
Experimental ResultsExperimental Results Lowest execution time with the optimized version Lowest execution time with the optimized version
of BLAS and LAPACK for Pentium 4 and for Alphaof BLAS and LAPACK for Pentium 4 and for AlphaBlock Block sizesize
3232 6464 128128 256256
k3,dgemk3,dgemmm
0,001860,0018622
0,000930,0009377
0,000570,0005722
0,000460,0004677
k3,dsyrkk3,dsyrk 0,003490,0034922
0,001480,0014844
0,001220,0012288
0,000760,0007622
k3,dtrsmk3,dtrsm 0,011710,0117199
0,006520,0065277
0,003780,0037855
0,002320,0023255
Block Block sizesize
3232 6464 128128 256256
k3,dgemk3,dgemmm
0,000820,0008244
0,000650,0006588
0,000610,0006100
0,000580,0005800
k3,dsyrkk3,dsyrk 0,001620,0016288
0,001160,0011644
0,000800,0008077
0,000680,0006888
k3,dtrsmk3,dtrsm 0,001610,0016177
0,001110,0011100
0,000840,0008411
0,000700,0007066
Table 1. Values of arithmetic system parameters (in µsec) in Pentium 4 with BLASopt
Table 2. Values of arithmetic system parameters (in µsec) in Alpha with CXML
11Parco 2005
Experimental ResultsExperimental Results But other SPBut other SPss can depend on can depend on nn and and bb, for , for
example: example: k2,potf2k2,potf2
bb//nn 512512 10241024 20482048
3232 0,00450,0045 0,00540,0054 0,00670,0067
6464 0,00340,0034 0,0460,046 0,00490,0049
128128 0,00630,0063 0,00770,0077 0,00760,0076
256256 0,00860,0086 0,01030,0103 0,01000,0100
bb//nn 10241024 20482048 40964096
3232 0,00280,0028 0,01470,0147 0,01010,0101
6464 0,00240,0024 0,00820,0082 0,00340,0034
128128 0,00330,0033 0,00520,0052 0,00250,0025
256256 0,00270,0027 0,00400,0040 0,00230,0023
Table 3. Values of k2,potf2 (in µsec) in Pentium 4 with BLASopt
Table 4. Values of k2,potf2 (in µsec) in Alpha with CXML
12Parco 2005
Experimental ResultsExperimental Results Communication system parametersCommunication system parameters
Broadcast cost for MPI predefined data typeBroadcast cost for MPI predefined data type,, ttwsws
Message SizeMessage Size
pp 15001500 20482048 > 4000> 4000
22 0,610,61 0,770,77 0,840,84
44 1,221,22 1,451,45 1,681,68
pp Shared Shared MemorMemor
yy
MemorMemoryy
ChanneChannell
22 0,0110,011 0,0720,072
44 0,0250,025 0,140,14Table 5. Values of tws (in µsecs) in P4net
Table 6. Values of tws (in µsecs) in HPC160
13Parco 2005
Experimental ResultsExperimental Results Communication system parametersCommunication system parameters
Word sending time of a broadcast for MPI Word sending time of a broadcast for MPI derived data type derived data type ttwdwd
Block sizeBlock size
P4netP4net HPC160smpHPC160smp HPC160mcHPC160mc
pp 3232 6464 128128 256256 3232 6464 128128 256256 3232 6464 128128 256256
22 0,970,97 0,80,844
1,001,00 1,101,10 0,00,01919
0,00,02424
0,00,02020
0,00,01919
0,00,09595
0,00,09191
0,00,08989
0,00,09090
44 1,601,60 1,91,900
1,601,60 1,641,64 0,00,04747
0,00,04848
0,00,04545
0,00,04141
0,10,19090
0,10,17676
0,10,17979
0,10,18383
Table 7. Values of twd twd (in µsecs) obtained experimentally for different b and p
14Parco 2005
Experimental ResultsExperimental Results Communication system parametersCommunication system parameters
Startup time of MPI broadcast Startup time of MPI broadcast ttss
Can be considered Can be considered ttss(n,p) ≈ t(n,p) ≈ tss(p)(p)
pp P4netP4net HPC160smpHPC160smp HPC160mcHPC160mc
22 5555 4,884,88 4,884,88
44 121121 9,779,77 9,779,77
Table 8. Values of ttss (in µsecs) obtained experimentally for different number of processes
15Parco 2005
Experimental ResultsExperimental Results
Theoretical n = 4096
0,0010,0020,0030,0040,0050,00
32 64 128 256
block size (b)
tim
e (s
eco
nd
s) 1x1
1x2
2x1
2x2
1x4
4x1
Experimental n = 4096
0,00
10,00
20,00
30,00
40,00
32 64 128 256
block size (b)
tim
e (s
eco
nd
s) 1x1
1x2
2x1
2x2
1x4
4x1
Experimental n = 5120
0,00
20,00
40,00
60,00
80,00
32 64 128 256
block size (b)
tim
e (s
eco
nd
s) 1x1
1x2
2x1
2x2
1x4
4x1
Theoretical n = 5120
0,0020,0040,0060,0080,00
32 64 128 256
block size (b)
time
(sec
onds
) 1x1
1x2
2x1
2x2
1x4
4x1
P4netP4net
16Parco 2005
Experimental ResultsExperimental Results HPC160smpHPC160smp
Experimental n = 5120
0,0010,00
20,0030,00
40,00
32 64 128 256
block size (b)
tim
e (s
eco
nd
s) 1x1
1x2
2x1
2x2
1x4
4x1
Theoretical n = 5120
0,0010,00
20,0030,00
40,00
32 64 128 256
block size (b)
tim
e (s
eco
nd
s) 1x1
1x2
2x1
2x2
1x4
4x1
Experimental n = 7168
0,000
50,000
100,000
150,000
32 64 128 256
block size (b)
time
(sec
onds
) 1x1
1x2
2x1
2x2
1x4
4x1
Theoretical n = 7168
0,000
50,000
100,000
150,000
32 64 128 256
block size (b)
tim
e (s
eco
nd
s) 1x1
1x2
2x1
2x2
1x4
4x1
17Parco 2005
Experimental ResultsExperimental ResultsParameters selection in P4netParameters selection in P4net
Table 9. Parameters selection for the Cholesky factorization in P4net
18Parco 2005
Experimental ResultsExperimental ResultsParameters Selection in HPC160Parameters Selection in HPC160
Table 10. Parameters selection for the Cholesky factorization in HPC160 with shared memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc)
19Parco 2005
ConclusionsConclusions The method has been applied successfully The method has been applied successfully to the Cholesky factorization and can be to the Cholesky factorization and can be applied to other linear algebra routinesapplied to other linear algebra routines
It is neccesary to use different costs for It is neccesary to use different costs for different types of MPI communication different types of MPI communication mechanisms.mechanisms.
and to use different cost for the and to use different cost for the communication parameters in systems with communication parameters in systems with more than one interconnection network.more than one interconnection network.
It is necessary to decide the optimal It is necessary to decide the optimal allocation of processes by node, according to allocation of processes by node, according to the speed of the interconnection networks. the speed of the interconnection networks. Hybrid systems.Hybrid systems.