The MPACK : Multiple precision version of BLAS and LAPACK

.

......

The MPACK : Multiple precision version of BLAS andLAPACK

NAKATA, Maho

RIKEN, Advanced Center for Computer and Communication

SIAM Conference Applied Linear Algebra, at Valencia, Spain,2012/6/18-22 MS51 11:25-11:50 June/21th Room:2.0

NAKATA, Maho The MPACK : Multiple precision version of BLAS and LAPACK

The MPACK : Multiple precision version of BLAS and LAPACK�� http://mplapack.sourceforge.net/NAKATA, Maho @ RIKEN

MPACK: multiple precision version of BLAS and LAPACK.

Providing Building block, reference implementation, and ApplicationProgram Interface (API)

Version 0.7.0 (2012/6/16); Status: MBLAS completed, and 100MLAPACK routines.

Extensive testing: preparing test cases for all calculations.

Multi-platform:Linux/BSD/Mac/Win

Five supported multiple precision types: GMP, MPFR, quadrupleprecision (binary128), DD, QD, and double

Written in C++: easier programming, faster programming.

Distributed under: 2-clause BSD license, redistribution, modificationare permitted.


Overview

Introduction: Why do we need more accuracy?

Floating point numbers and multiple precision libraries.

Introduction of BLAS, LAPACK, and MPACK.

Summary.


Introduction: Why do we need more accuracy?


More accuracy is needed towards peta and exa scale computing

EXA scale computing : 1023 FLOP!!! for just one weekcalculation.

Scientific computing may suffer from the accuracy.










More accuracy is needed towards Peta and Exa scale computing

Iterative methods in double precision calculation sometimesdo not even converge [Hasegawa 2007].


More accuracy is needed towards Peta and Exa scale computing

Iterative methods in double precision calculation sometimesdo not even converge [Hasegawa 2007].



Semidefinite programming (SDP): condition number divergesat the optimum.Therefore, one may be very hard to obtain an accuratesolution[Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]

1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.

The 1-norm and the estimated 1-norm condition number of shur complement matrix

1-cond1-norm




1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.


1-cond1-norm




1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.


1-cond1-norm




1e-10

1e-05

1

100000

1e+10

1e+15

1e+20

0 10 20 30 40 50 60 70 80 90

# of iter.


1-cond1-norm


Floating point numbers and multiple precision libraries.


The double precision: most widely used floating point numberformat

“754-2008 IEEE Standard for Floating-Point Arithmetic”

The binary64 (aka double precision) format has 16 decimalsignificant digits

Widely used and very fast. Core i7 920: ∼40GFLOPS;RADEON HD7970 ∼1000GFLOPS, K computer: ∼ over10PFLOPS)�� Rounding error may occur for every arithmetic operation.


Dealing with round-off error by multiple precision calculation

�� Multiple precision: A brute force method against round-off error

Floating point numbers: approximation of the real numbers oncomputer.

a + (b + c) , (a + b) + c

Round-off error can occur in each arithmetic operation.

The double precision has only 16 decimal significant digits

1 + 0.0000000000000001 = 1

one solution: higher/multiple precision calculation.





a + (b + c) , (a + b) + c



1 + 0.0000000000000001 = 1






a + (b + c) , (a + b) + c



1 + 0.0000000000000001 = 1






a + (b + c) , (a + b) + c



1 + 0.0000000000000001 = 1






a + (b + c) , (a + b) + c



1 + 0.0000000000000001 = 1



What is a multiple precision arithmetic?

�� There are some ways to treat multipe precision on computers

GMP is a free library for arbitrary precision arithmetic,operating on signed integers, rational numbers, and floatingpoint numbers : http://gmplib.org/

Significant digits can be arbitrary large:

One of the fastest library but still arithmetic operations arevery slow.


What is a multiple precision arithmetic?

�� There are some ways to treat multipe precision on computers

GMP is a free library for arbitrary precision arithmetic,operating on signed integers, rational numbers, and floatingpoint numbers : http://gmplib.org/

Significant digits can be arbitrary large:

One of the fastest library but still arithmetic operations arevery slow.


Other multiple/arbitrary precision arithmetic libraries

Other multiple/arbitrary precision arithmetic libraries:

The QD library: double-double (quad-double) precision : 32(64) significant decimal digits and FAST

binary128, quadruple precision defined in IEEE 754 2008.

IEEE754 style multiple precision libraries: MPFR (real) andMPC (complex).


Other multiple/arbitrary precision arithmetic libraries

Other multiple/arbitrary precision arithmetic libraries:

The QD library: double-double (quad-double) precision : 32(64) significant decimal digits and FAST

binary128, quadruple precision defined in IEEE 754 2008.

IEEE754 style multiple precision libraries: MPFR (real) andMPC (complex).


Introduction of BLAS, LAPACK, and MPACK.


What is BLAS and LAPACK?

BLAS: reference implementation of various types ofvector-vector, matrix-vector, matrix-matrix operations. Fasterimplementations are available: OpenBLAS(GotoBLAS2), IntelMKL, ATLAS etc.

LAPACK: solve linear equation, eigenvalue problem, leastsquare fitting, singular value decomposition.

De facto standard library; even using without noticing.

LAPACK web hits: 110,343,542 (Mon Dec 10 16:20:25 EST2012)�� BLAS and LAPACK are very very important library


What is BLAS and LAPACK?

BLAS: reference implementation of various types ofvector-vector, matrix-vector, matrix-matrix operations. Fasterimplementations are available: OpenBLAS(GotoBLAS2), IntelMKL, ATLAS etc.

LAPACK: solve linear equation, eigenvalue problem, leastsquare fitting, singular value decomposition.

De facto standard library; even using without noticing.

LAPACK web hits: 110,343,542 (Mon Dec 10 16:20:25 EST2012)�� BLAS and LAPACK are very very important library


MPACK 0.7.0: Multiple precision version of BLAS and LAPACK�� http://mplapack.sourceforge.net/NAKATA, Maho @ RIKEN




Extensive testing: preparing test cases for all calculations.

Multi-platform:Linux/BSD/Mac/Win

Five supported multiple precision types: GMP, MPFR, quadrupleprecision (binary128), DD, QD, and double

Written in C++: easier programming, faster programming.

Distributed under: 2-clause BSD license, redistribution, modificationare permitted.


MPACK 0.7.0: capability and non-capability


Rgemm (matrix-matrix multiplication) : OpenMP acceleration.

Rgemm by GPU acceleration (upcoming 0.8.0)

MLAPACK what can do: diagonalization of symmetric (Hermitian)matrix, LU decomposition, Cholesky decomposition, estimation ofcondition number, matrix inversion.

MLAPACK: not yet done: diagonalization of non-symmetric matrix,singular value decomposition, least square fitting, QR factorizationetc...


Providing Application Program Interface: naming rule

Change in Prefixfloat, double→ “R”eal,complex, double complex→ “C”omplex.

daxpy, zaxpy→ Raxpy, Caxpy

dgemm, zgemm→ Rgemm, Cgemm

dsterf, dsyev→ Rsterf, Rsyev

dzabs1, dzasum→ RCabs1, RCasum


Supported MBLAS 0.7.0 routines (completed)

LEVEL1 MBLASCrotg Cscal Rrotg Rrot Rrotm CRrot Cswap

Rswap CRscal Rscal Ccopy Rcopy Caxpy RaxpyRdot Cdotc Cdotu RCnrm2 Rnrm2 Rasum iCasum

iRamax RCabs1 Mlsame Mxerbla

LEVEL2 MBLASCgemv Rgemv Cgbmv Rgbmv Chemv Chbmv Chpmv RsymvRsbmv Ctrmv Cgemv Rgemv Cgbmv Rgemv Chemv ChbmvChpmv Rsymv Rsbmv Rspmv Ctrmv Rtrmv Ctbmv CtpmvRtpmv Ctrsv Rtrsv Ctbsv Rtbsv Ctpsv Rger CgeruCgerc Cher Chpr Cher2 Chpr2 Rsyr Rspr Rsyr2Rspr2

LEVEL3 MBLASCgemm Rgemm Csymm Rsymm Chemm Csyrk Rsyrk CherkCsyr2k Rsyr2k Cher2k Ctrmm Rtrmm Ctrsm Rtrsm


Supported MLAPACK 0.7.0 routines: 100 routines

Mutils Rlamch Rlae2 Rlaev2 Claev2 Rlassq ClassqRlanst Clanht Rlansy Clansy Clanhe Rlapy2 RlarfgRlapy3 Rladiv Cladiv Clarfg Rlartg Clartg RlasetClaset Rlasr Clasr Rpotf2 Clacgv Cpotf2 RlasclClascl Rlasrt Rsytd2 Chetd2 Rsteqr Csteqr RsterfRlarf Clarf Rorg2l Cung2l Rorg2r Cung2r RlarftClarft Rlarfb Clarfb Rorgqr Cungqr Rorgql CungqlRlatrd Clatrd Rsytrd Chetrd Rorgtr Cungtr RsyevCheev Rpotrf Cpotrf Clacrm Rtrti2 Ctrti2 RtrtriCtrtri Rgetf2 Cgetf2 Rlaswp Claswp Rgetrf CgetrfRgetri Cgetri Rgetrs Cgetrs Rgesv Cgesv RtrtrsCtrtrs Rlasyf Clasyf Clahef Clacrt Claesy Crot

Cspmv Cspr Csymv Csyr iCmax1 RCsum1 RpotrsRposv Rgeequ Rlatrs Rlange Rgecon Rlauu2 RlauumRpotri Rpocon


Providing APIs: difference in calling

The difference is: call by value or call by referenceMBLAS/MLAPACK:

Rgemm("n", "n", n, n, n, alpha, A, n, B, n, beta, C, n);

Rgetrf(n, n, A, n, ipiv, &info);

Rgetri(n, A, n, ipiv, work, lwork, &info);

Rsyev("V", "U", n, A, n, w, work, &lwork, &info);

BLAS/LAPACK:

dgemm_f77("N", "N", &n, &n, &n, &One, A, &n, A, &n, &Zero, C, &n);

dgetri_f77(&n, A, &n, ipiv, work, &lwork, &info);


Programming model

Required types: INTEGER, REAL, COMPLEX, LOGICAL.

Switching MP libs by “typedef” REAL→ mpf class, qd real,dd real etc.

Requiring elementary functions (log, sin etc); usually accuracyis enough in double.

Currently supported MP libs: GMP, MPFR, QD, DD, binary128and double

Intermediate functions which absorbs the difference betweenMP libs.

You can program using MP types almost same as “double” inC++ (cf. SDPA-DD, and -GMP)


Extraction from MBLAS codes

Caxpy: Complex version of axpy

void Caxpy(INTEGER n, COMPLEX ca, COMPLEX * cx, INTEGER incx, COMPLEX * cy, INTEGER incy)

{

REAL Zero = 0.0;

if (n <= 0)

return;

if (RCabs1(ca) == Zero)

return;

INTEGER ix = 0;

INTEGER iy = 0;

if (incx < 0)

ix = (-n + 1) * incx;

if (incy < 0)

iy = (-n + 1) * incy;

for (INTEGER i = 0; i < n; i++) {

cy[iy] = cy[iy] + ca * cx[ix];

ix = ix + incx;


Extraction from MLAPACK source code

Rsyev; diagonalization of real symmetric matrices

Rlascl(uplo, 0, 0, One, sigma, n, n, A, lda, info);

}

//Call DSYTRD to reduce symmetric matrix to tridiagonal form.

inde = 1;

indtau = inde + n;

indwrk = indtau + n;

llwork = *lwork - indwrk + 1;

Rsytrd(uplo, n, &A[0], lda, &w[0], &work[inde - 1], &work[indtau - 1],

&work[indwrk - 1], llwork, &iinfo);

//For eigenvalues only, call DSTERF. For eigenvectors, first call

//DORGTR to generate the orthogonal matrix, then call DSTEQR.

if (!wantz) {

Rsterf(n, &w[0], &work[inde - 1], info);

} else {

Rorgtr(uplo, n, A, lda, &work[indtau - 1], &work[indwrk - 1], llwork,

&iinfo);

Rsteqr(jobz, n, w, &work[inde - 1], A, lda, &work[indtau - 1], info);

}

//If matrix was scaled, then rescale eigenvalues appropriately.

if (iscale == 1) {

if (*info == 0) {


Facts of MPACK (MBLAS/MLAPACK)

Google searches only my pages or related pages with”Multiple precision BLAS”

Download count : 2520 (2012/6/21)


Quality assurance of MBLAS

�� BLAS uses only algebraic manipulations

Input possible values and checks with BLAS.Can detect algorithmic bugs. That’s almost ok.for (int k = MIN_K; k < MAX_K; k++) {

for (int n = MIN_N; n < MAX_N; n++) {

for (int m = MIN_M; m < MAX_M; m++) {

...

for (int lda = minlda; lda < MAX_LDA; lda++) {

for (int ldb = minldb; ldb < MAX_LDB; ldb++) {

for (int ldc = max(1, m); ldc < MAX_LDC; ldc++) {

Rgemm(transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

dgemm_f77(transa, transb, &m, &n, &k, &alphad, Ad, &lda,

Bd, &ldb, &betad, Cd, &ldc);

...

diff = vec_diff(C, Cd, MAT_A(ldc, n), 1);

if (fabs(diff) > EPSILON) {

printf(‘‘#error %lf!!\n’’, diff);

errorflag = TRUE;

}


Quality assurance of MLAPACK

�� Very difficult: introduction of “convergence”

Input possible values and compare the results by MLAPACKand by LAPACK.LAPACK introduces “convergence”.

Essentially different but still many routines uses only algebraicones.

Can detect bugs when used in some researches or studies(Waki et al.)


Performance of Raxpy

on Intel Core i7 920 (2.6GHz) / Ubuntu 10.04 / gcc 4.4.3

y ← αx + y

Raxpy performance in Flops. multithread by OpenMP inparenthesis

MP Library(sign. digs.) Flops (OpenMP)DD(32) 130(570)MQD(64) 13.7(67)M

GMP(77) 11.3(45)MGMP(154) 7.6(32)MMPFR(154) 3.7(17)M

GotoBLAS(16) 1.5G


Performance of Rgemv


y ← αAx + βy

Rgemv performance in Flops.MP Library(sign. digs.) Flops (OpenMP)

DD(32) 140MQD(64) 13M

GMP(77) 11.1MMPFR(77) 4.7MGMP(154) 7.1MMPFR(154) 3.7M

GotoBLAS(16) 3.8G


Performance of Rgemm


Rgemm performance in Flops.

C ← αAB + βC

MP Library(sign. digs.) Flops (OpenMP)DD(32) 136 (605)MQD(64) 13.9 (63)M

GMP(77) 11.5 (44)MMPFR(77) 4.6 (20)MGMP(154) 7.2 (28) MMPFR(154) 3.7 (16) M

GotoBLAS(16) 42.5G


Performance of Rgemm: double-double precision on WestmereEP

Intel Composer, Intel WestmereEP, 40 cores, 2.4GHz: apporx5GFlops


Performance of Rgemm: GMP (154 decimal digits) on WestmereEP

Intel Composer, Intel WestmereEP, 40 cores, 2.4GHz, approx.0.2GFlops


Performance of Rgemm: double-double (quasi quad precision)on Magnycours 48cores

GCC 4.6, Magny cours 2.4GHz, 48 cores : approx 3GFlops


Performance of Rgemm: binary128 (true quad precision) onMagnycours 48cores

GCC 4.6, Magny cours 2.4GHz, 48 cores : approx 0.3GFlops


Performance of Rgemm: GMP (154 decimal digits) onMagnycours 48cores

GCC 4.6, Magny cours 2.4GHz, 48 cores: approx 0.15GFlops


Performance of Rgemm: double-double precision on NVIDIAC2050

CUDA 3.2, NVIDIA C2050, 16GFlops! fast and stable!

0

2

4

6

8

10

12

14

16

0 1000 2000 3000 4000 5000 6000

GFL

OPS

Dimension

NN−KernelNN−Total

NT−KernelNT−Total

TN−KernelTN−Total

TT−KernelTT−Total


Performance of Rsyev


Rsyev performance (symmetric 300x300, obtain eigenvalue,vectors) in second

AX = diag[λ1, λ2, · · · λN]X

MP Library(sign. digs.) secondsDD(32) 2.4QD(64) 25.6

GMP(77) 36.9MPFR(77) 78.9GMP(154) 64.0MPFR(154) 111

GotoBLAS(16) 0.1


MPACK 0.7.0: Multiple precision version of BLAS and LAPACK

�� http://mplapack.sourceforge.net/NAKATA, Maho @ RIKEN




2500+ downloads until now.

Faster implementation for double-double precision for Rgemm onGPU is available.


Technology

The MPACK : Multiple precision version of BLAS and LAPACK