1
Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix
Makoto Yamashita @ Tokyo-TechKatsuki Fujisawa @ Chuo UniversityMituhiro Fukuda @ Tokyo-TechYoshiaki Futakata @ University of VirginiaKazuhiro Kobayashi @ National Maritime Research InstituteMasakazu Kojima @ Tokyo-TechKazuhide Nakata @ Tokyo-TechMaho Nakata @ RIKEN
ISMP 2009 @ Chicago [2009/08/26]
2
Extremely Large SDPs Arising from various fields
Quantum Chemistry Sensor Network Problems Polynomial Optimization Problems
Most computation time is related to Schur complement matrix (SCM)
[SDPARA]Parallel computation for SCM In particular, sparse SCM
3
Outline
1. SemiDefinite Programming and Schur complement matrix
2. Parallel Implementation3. Parallel for Sparse Schur complement4. Numerical Results5. Future works
4
Standard form of SDP
5
Primal-Dual Interior-Point Methods
6
Computation for Search Direction
Schur complement matrix ⇒ Cholesky Factorizaiton
Exploitation of Sparsity in 1.ELEMENTS
2.CHOLESKY
7
Bottlenecks on Single Processor
Apply Parallel Computation to the Bottlenecks
in secondOpteron 246 (2.0GHz)
LiOH HF
m 10592 15018
ELEMENTS 6150( 43%) 16719( 35%)
CHOLESKY 7744( 54%) 20995( 44%)
TOTAL 14250(100%) 47483(100%)
8
SDPARA SDPA parallel version
(generic SDP solver) MPI & ScaLAPACK
Row-wise distribution for ELEMENTS parallel Cholesky factorization for CHOLESKY
http://sdpa.indsys.chuo-u.ac.jp/sdpa/
9
Row-wise distribution for evaluation of the Schur complement matrix
4 CPU is availableEach CPU computes only their assigned rows
. No communication between CPUsEfficient memory management
10
Parallel Cholesky factorization We adopt Scalapack for the Cholesky factorization of t
he Schur complement matrix We redistribute the matrix from row-wise to two-dimen
sional block-cyclic distribtuion
Redistribution
11
Computation time on SDP from Quantum Chemistry [LiOH]
14250
3514969
414
61501654
30884
7744
1186357
141
1
10
100
1000
10000
100000
1 4 16 64#processors
second TOTAL
ELEMENTSCHOLESKY
AIST super clusterOpteron 246 (2.0GHz)
6GB memory/node
12
Sclability on SDP from Quantum Chemistry [NF]
1
10
100
1 2 4 8 16 32 64#processors
scalability TOTAL
ELEMENTSCHOLESKY
Total 29 times
ELEMENTS 63 times
CHOLESKY 39 times
ELEMENTS is very effective
13
Sparse Schur complement matrix
Schur complement matrix becomes very sparse for some applications.
⇒Simple Row-wise loses its efficiencyfrom Control Theory(100%) from Sensor Network(2.12%)
14
Sparseness ofSchur complement matrix
Many applications havediagonal block structure
15
Exploitation of Sparsityin SDPA
We change the formula by row-wise
F1
F2
F3
16
ELEMENTS forSparse Schur complement
150 40 30 20
135 20
70 10
50 5
30
3
Load on each CPU
CPU1:190
CPU2:185
CPU3:188
17
CHOLESKY forSparse Schur complement Parallel Sparse Cholesky factorization implemente
d in MUMPS MUMPS adopts Multiple Frontal method
150 40 30 20
135 20
70 10
50 5
30
3
Memory storage on each processor should
be consecutive.
The distribution for ELEMENTS matches
this method.
18
Computation time for SDPs from Polynomial Optimization Problem
1126645 486 479
270 251
411207
10555
2916
664391
243 336179 188
1
10
100
1000
10000
1 2 4 8 16 32#processors
second TOTAL
ELEMENTSCHOLESKY
tsubasaXeon E5440 (2.83GHz)
8GB memory/node
Parallel Sparse Cholesky achieves mild scalability.ELEMENTS attains 24x speed-up on 32 CPUs.
19
ELEMENTS Load-balance on 32 CPUs
Only first processor has a little heavier computation.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Processor Number
Tim
e(se
cond
)
0
200000
400000
600000
800000
1000000
1200000
1400000
#dis
trib
uted
ele
men
ts
Time(second) #distributed elements
20
Automatic selection ofsparse / dense SCM Dense Parallel Cholesk
y achieves higher scalability than Sparse Parallel Cholesky
Dense becomes better for many processors.
We estimate both computation time using computation cost and scalability. 1
10
1 2 4 8 16 32#processors
second auto
densesparse
21
Sparse/Dense CHOLESKY for a small SDP from POP
70 52 4424
14 14
13663
3523
13 13
71 52 44 36 30 30
1
10
100
1000
1 2 4 8 16 32#processors
second auto
densesparse
tsubasaXeon E5440 (2.83GHz)
8GB memory/node
Only on 4 CPUs, the auto selection failed.(since scalability on sparse cholesky
is unstable on 4 CPUs.)
22
Numerical Results
Comparison with PCSDP Sensor Network Problem
generated by SFSDP Multi Threading
Quantum Chemistry
23
SDPs from Sensor Network#sensors 1,000 (m=16,450: density 1.23%)
#CPU 1 2 4 8 16
SDPARA 28.2 22.1 16.7 13.8 27.3
PCSDP M.O. 1527 887 591 368
#sensors 35,000 (m=527,096: density )
#CPU 1 2 4 8 16
SDPARA 1080 845 614 540 506
PCSDP Memory Over. if #sensors >= 4,000
(time unit : second)
24
MPI + Multi Threading for Quantum Chemistry
N.4P.DZ.pqgt11t2p(m=7230)
5376336206
5803
2785418134
2992
142739190
1630
78954729
931
46502479
565
100
1000
10000
100000
1 2 4 8 16
#nodes
PCSDPSDPARA(1)SDPARA(2)SDPARA(4)SDPARA(8)
seco
nd
64x speed-up on [16nodesx8threads]
25
Concluding Remarks & Future works
1. New parallel schemes for sparse Schur complement matrix
2. Reasonable Scalability3. Extremely large-scale SDPs with sparse Sc
hur complement matrix
Improvement on Multi-Threading for sparse Schur complement matrix