Download ppt - Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix

1

Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix

Makoto Yamashita @ Tokyo-TechKatsuki Fujisawa @ Chuo UniversityMituhiro Fukuda @ Tokyo-TechYoshiaki Futakata @ University of VirginiaKazuhiro Kobayashi @ National Maritime Research InstituteMasakazu Kojima @ Tokyo-TechKazuhide Nakata @ Tokyo-TechMaho Nakata @ RIKEN

ISMP 2009 @ Chicago [2009/08/26]

2

Extremely Large SDPs Arising from various fields

Quantum Chemistry Sensor Network Problems Polynomial Optimization Problems

Most computation time is related to Schur complement matrix (SCM)

[SDPARA]Parallel computation for SCM In particular, sparse SCM

3

Outline

1. SemiDefinite Programming and Schur complement matrix

2. Parallel Implementation3. Parallel for Sparse Schur complement4. Numerical Results5. Future works

4

Standard form of SDP

5

Primal-Dual Interior-Point Methods

6

Computation for Search Direction

Schur complement matrix ⇒ Cholesky Factorizaiton

Exploitation of Sparsity in 1.ELEMENTS

2.CHOLESKY

7

Bottlenecks on Single Processor

Apply Parallel Computation to the Bottlenecks

in secondOpteron 246 (2.0GHz)

LiOH HF

m 10592 15018

ELEMENTS 6150( 43%) 16719( 35%)

CHOLESKY 7744( 54%) 20995( 44%)

TOTAL 14250(100%) 47483(100%)

8

SDPARA SDPA parallel version

(generic SDP solver) MPI & ScaLAPACK

Row-wise distribution for ELEMENTS parallel Cholesky factorization for CHOLESKY

http://sdpa.indsys.chuo-u.ac.jp/sdpa/

9

Row-wise distribution for evaluation of the Schur complement matrix

4 CPU is availableEach CPU computes only their assigned rows

. No communication between CPUsEfficient memory management

10

Parallel Cholesky factorization We adopt Scalapack for the Cholesky factorization of t

he Schur complement matrix We redistribute the matrix from row-wise to two-dimen

sional block-cyclic distribtuion

Redistribution

11

Computation time on SDP from Quantum Chemistry [LiOH]

14250

3514969

414

61501654

30884

7744

1186357

141

1

10

100

1000

10000

100000

1 4 16 64#processors

second TOTAL

ELEMENTSCHOLESKY

AIST super clusterOpteron 246 (2.0GHz)

6GB memory/node

12

Sclability on SDP from Quantum Chemistry [NF]

1

10

100

1 2 4 8 16 32 64#processors

scalability TOTAL

ELEMENTSCHOLESKY

Total 29 times

ELEMENTS 63 times

CHOLESKY 39 times

ELEMENTS is very effective

13

Sparse Schur complement matrix

Schur complement matrix becomes very sparse for some applications.

⇒Simple Row-wise loses its efficiencyfrom Control Theory（１００％） from Sensor Network(2.12%)

14

Sparseness ofSchur complement matrix

Many applications havediagonal block structure

15

Exploitation of Sparsityin SDPA

We change the formula by row-wise

F1

F2

F3

16

ELEMENTS forSparse Schur complement

150 40 30 20

135 20

70 10

50 5

30

3

Load on each CPU

CPU1:190

CPU2:185

CPU3:188

17

CHOLESKY forSparse Schur complement Parallel Sparse Cholesky factorization implemente

d in MUMPS MUMPS adopts Multiple Frontal method

150 40 30 20

135 20

70 10

50 5

30

3

Memory storage on each processor should

be consecutive.

The distribution for ELEMENTS matches

this method.

18

Computation time for SDPs from Polynomial Optimization Problem

1126645 486 479

270 251

411207

10555

2916

664391

243 336179 188

1

10

100

1000

10000

1 2 4 8 16 32#processors

second TOTAL

ELEMENTSCHOLESKY

tsubasaXeon E5440 (2.83GHz)

8GB memory/node

Parallel Sparse Cholesky achieves mild scalability.ELEMENTS attains 24x speed-up on 32 CPUs.

19

ELEMENTS Load-balance on 32 CPUs

Only first processor has a little heavier computation.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Processor Number

Tim

e(se

cond

)

0

200000

400000

600000

800000

1000000

1200000

1400000

#dis

trib

uted

ele

men

ts

Time(second) #distributed elements

20

Automatic selection ofsparse / dense SCM Dense Parallel Cholesk

y achieves higher scalability than Sparse Parallel Cholesky

Dense becomes better for many processors.

We estimate both computation time using computation cost and scalability. 1

10


second auto

densesparse

21

Sparse/Dense CHOLESKY for a small SDP from POP

70 52 4424

14 14

13663

3523

13 13

71 52 44 36 30 30

1

10

100

1000


second auto

densesparse

tsubasaXeon E5440 (2.83GHz)

8GB memory/node

Only on 4 CPUs, the auto selection failed.(since scalability on sparse cholesky

is unstable on 4 CPUs.)

22

Numerical Results

Comparison with PCSDP Sensor Network Problem

generated by SFSDP Multi Threading

Quantum Chemistry

23

SDPs from Sensor Network#sensors 1,000 (m=16,450: density 1.23%)

#CPU 1 2 4 8 16

SDPARA 28.2 22.1 16.7 13.8 27.3

PCSDP M.O. 1527 887 591 368

#sensors 35,000 (m=527,096: density )

#CPU 1 2 4 8 16

SDPARA 1080 845 614 540 506

PCSDP Memory Over. if #sensors >= 4,000

(time unit : second)

24

MPI + Multi Threading for Quantum Chemistry

N.4P.DZ.pqgt11t2p(m=7230)

5376336206

5803

2785418134

2992

142739190

1630

78954729

931

46502479

565

100

1000

10000

100000

1 2 4 8 16

#nodes

PCSDPSDPARA(1)SDPARA(2)SDPARA(4)SDPARA(8)

seco

nd

64x speed-up on [16nodesx8threads]

25

Concluding Remarks & Future works

1. New parallel schemes for sparse Schur complement matrix

2. Reasonable Scalability3. Extremely large-scale SDPs with sparse Sc

hur complement matrix

Improvement on Multi-Threading for sparse Schur complement matrix