44
HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016

HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

HiCMA:Hierarchical Computations on Manycore

Architectures

Hatem LtaiefExtreme Computing Research Center

King Abdullah University of Science and TechnologyThuwal, Saudi Arabia

NVIDIA GTC - San JoseApril 5th, 2016

Page 2: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Outline

Motivations

QR-based Dynamically Weighted Halley for SVD

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Page 3: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Students/Collaborators/Support

Academic:I Extreme Computing Research Center @ KAUST

W. Boukaram, A. Charara, G. Chavez, D. Keyes, D. Sukkariand G. Turkiyyah

I Tokyo Institute of TechnologyR. Yokota

I Institut Polytechnique de Bordeaux - INRIA BordeauxM. Faverge

I Innovative Computing Laboratory - UTK

Industry:

Page 4: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Outline

Motivations

QR-based Dynamically Weighted Halley for SVD

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Page 5: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Hardware/Software Trends

I Flops are free

I On/Off-chip network bandwidth limited

I Increasing gap between flops and bandwidth

I Data movement are the most energy-consuming operations

I Synchronization-reducing

I Communication-reducing

Page 6: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Going hierarchical all the way down the software stack

I Recursive formulation (increase data locality)

I Old concept!

I Tree structure (depth-first Vs breadth-first tree traversal)

I Reduce vertical/horizontal data motion

I Without compromising concurrency

I Trade-off between data reuse and parallelism

Page 7: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Outline

Motivations

QR-based Dynamically Weighted Halley for SVD

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Page 8: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Standard SVD solver

One stage reduction:

Figure: Computational stages of the standard SVD algorithm: (a)bidiagonal reduction, (b) bidiagonal SVD solver and (c) backtransformation.

Page 9: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Two-stage SVD solver

Two-stage reduction:

Figure: Reduction of a general dense matrix to bidiagonal form using atwo-stage approach.

Page 10: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

QDWH-SVD[1,2] solver

Three computational stages:

I Polar decomposition A = UpH: iterative procedure using thematrix inversion free formulation based on QR/Choleskyfactorization

I Symmetric eigensolver H = VΣV> to calculate the singularvalues and the right singular vectors

I Matrix-matrix multiplication U = UpV to calculate the leftsingular vectors

[1] Y. Nakatsukasa and N. J.Higham, Stable and Efficient SpectralDivide and Conquer Algorithms for the Symmetric EigenvalueDecomposition and the SVD, SISC, 35 (2013), pp. A1325-A1349.[2] D. Sukkari, H. Ltaief and D. Keyes, A High PerformanceQDWH-SVD Solver Using Hardware Accelerators, submitted toTrans. on Math. Soft., 2015.

Page 11: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Divide-and-Conquer

Figure: The recursive QDWH-SVD algorithm. The matrix Ai,j

corresponds to the submatrix indexed j at the ith level of recursion.

Page 12: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance results

0.1

1

10

100

1000

1024

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

13312

14336

15360

Log (

tim

e (

s))

Matrix size

2.3x

MKL-DGESVDMKL-DGESDD

MAGMA-QDWH-SVD

(a) Ill-conditioned matrix.

0.1

1

10

100

1000

1024

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

13312

14336

15360

Log (

tim

e (

s))

Matrix size

3.5x

MKL-DGESVDMAGMA-QDWH-SVD

(b) Well-conditioned matrix.

Figure: Performance comparisons of MAGMA-QDWH-SVD (GPU)against Intel MKL (CPU).

Page 13: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance results

0.1

1

10

100

1000

1024

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

13312

14336

15360

Log (

tim

e (

s))

Matrix size

18%

MAGMA-DGESVDMAGMA-DGESDD

MAGMA-QDWH-SVD

(a) Ill-conditioned matrix.

0.1

1

10

100

1000

1024

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

13312

14336

15360

Log (

tim

e (

s))

Matrix size

2.1x

MAGMA-DGESVDMAGMA-QDWH-SVD

(b) Well-conditioned matrix.

Figure: Performance comparisons against existing MAGMA SVD solvers(GPU).

Page 14: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Outline

Motivations

QR-based Dynamically Weighted Halley for SVD

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Page 15: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Recursive formulation

I Usually used for Level 2 BLAS algorithms (e.g., panelfactorization)

I Increase data locality

I Run at the cache level speed

I Again, not new and literature is quite rich

I And it does pay off for Level 3 BLAS too!

Page 16: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Triangular matrix-matrix multiplication (TRMM)

         Bl                                              Br  

1-RecTRMM

3-RecTRMM

2-GEMM

Au All Ar

M

N1 N2

N1 N2

N2

N1

Figure: Illustrating a Right-Lower-NonTranspose-NonUnit recursiveTRMM, and splitting along the vertical direction. Operations areperformed according to their numbering.

Page 17: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Triangular matrix-matrix multiplication

GEMM  

GEMM  

GEMM   GEMM  

GEMM  

GEMM   GEMM  

TRMM   TRMM   TRMM   TRMM   TRMM   TRMM   TRMM   TRMM  

Figure: A hypothetical tree representing the operations executed by therecursive algorithm. Operations are to be executed by traversing the treein depth-first order.

Page 18: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance results on NVIDIA GPUs

0!100!200!300!400!500!600!700!800!900!

1000!1100!1200!

1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14!

Perfo

rman

ce (G

Flop

/ s)!

Matrix Dimension (x1024)!

Theo-Peak! DGEMM! cuBLAS_DTRMM (OOP)! KBLAS_DTRMM (IP)! cuBLAS_DTRMM (IP)!

Figure: Performance comparisons of KBLAS DTRMM against that of IPand OOP cuBLAS DTRMM (Integration to CUDA 8.0).

Page 19: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance results on higher DLA computations using GPUs

0!

200!

400!

600!

800!

1000!

1200!

1400!

1600!

1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14!

Perfo

rman

ce (G

Flop

/ s)!

Matrix Dimension (x1024)!

Theo-Peak! DGEMM! DPOTRI + KBLAS_TRMM! DPOTRI + cuBLAS_TRMM!

Figure: Performance speedup of matrix inversion in MAGMA library(DPOTRI) using KBLAS DTRMM vs using cuBLAS DTRMM.

Page 20: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Outline

Motivations

QR-based Dynamically Weighted Halley for SVD

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Page 21: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Low Rank Approximation using H-Matrices

I Introduced by E. Tyrtyshnikov and revisited later by W.Hackbush[1,2]

I R = U X V T

[1] W. Hackbush, A Sparse Matrix Arithmetic based on H-Matrices(Part I), Computing, 62(2), pp 89-108, 1999.[2] W. Hackbusch and B. Khoromskij, A Sparse H-MatrixArithmetic (Part II): Application to multi-dimensional problems,Computing, 64(1), pp. 21-47, 2000.

Page 22: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Nice Properties of H-Matrices

I Memory footprint savingfrom O(n2) to O(k n log(n))

I Linear arithmetic complexityMVM: from O(n2) to O(k n log(n))MMM and A−1: from O(n3) to O(k2 n log2(n))

Page 23: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Examples of H-Matrices

4 3

3 3 5

5 5

9 9

5 9

9 5

9 9

5 4

4 4

6 9

9 9

6 9

9 9

6 5

5 5

9 9

9 9

4 9 9 9

9 9

9

9 4

9 9

9 9 9

9 9

4 4

4 5

9 9

5 9

9 5

9 9

5 5

5

4 3

3 3

9 9

6 9

9 6

9 9

6 5

5 5

9 99 9

5 9 9

9 9

9 9

9 9

9 9

9 99 9

5 9 9

9

9 5

9 9

9 9

9 9

9 9

9 9

9 9

9

9 5

9 9

9 9

5 5

5 6

9 9

6 9

9 6

9 9

3 3

3 4 5

5 5

9 9

5 9

9 5

9 9

5 4

4 4

9 9

9

9 9

9 9

4 9 9

9 9

9

9

9 4

9 9

9 9

5 5

5 6

9 9

9 6

9 9

9 6

4 4

4 5

9 9

5 9

9 5

9 9

5 5

5

3 3

3 4

Figure: Example of H-matrix approximation for BEM.

Page 24: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Examples of H-Matrices

25 20

20 20 30

3020 16

16 1630

30

20 20

20 20 30

30 3227

27

20 20

20 20 30

30 32 28

2832 28

28 32

18

18

20 20

20 20 30

30 32 29

29

20 20

20 20 29

29 3219

19

32 29

29 32 19

1932 19

19 32

9

9

20 20

20 20 30

30 32 30

3032 30

30 3220

20

20 20

20 20 30

30 32 20

2032 20

20 32

10

10

32 30

30 32 20

2032 20

20 3210

10

32 20

20 32 10

1032 10

10 32

Figure: Examples of H-matrix approximation for covariance matrix.

Page 25: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Tree Structure

Page 26: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

H-MVM

Page 27: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Implementation Details

Dense MVM:

I Calculate the products V T x for the leaves in the treeBatch MVM operation

Upsweep:

I Sweep up the column basis tree tree calculating the productsof the inner nodes from the products of their childrenblock SpMV (BSR)

Mult:

I It is also block SpMV (BSR) per level of the tree

Downsweep:

I Transpose operation of the upsweep phaseblock SpMV (BSR)

Pipelining:

I Overlapping computations possible within Dense MVM /Upsweep / Mult phases. Downsweep, however, requires a syncpoint!

Page 28: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results

Figure: Performance (GB/s) of H-MVM using k = 8 and n min = 32.

Page 29: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Advanced Hierarchical Matrix Operations

Context:

I Very small sizes!

I Batch operation executions at each level of the tree

I (usually) Fixed sizes

I Recursive formulation, stressing register usage

I State-of-the-art implementations not well optimized for thisscope or not supported

I NVIDIA K40 GPU (single GPU)

Page 30: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Advanced Hierarchical Matrix Operations

H-Matrix compression:

I Batch QR factorizations (square and tall-and-skinny)

I Batch SVD

H-Matrix computations:

I Level 3 BLAS: SYRK, TRSM

I Factorizations: POTRF

I Solves: POTRS, POSV

Page 31: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

0.125

0.25

0.5

1

2

4

8

16

32

64

128

8 16 32 64 128 256

Pe

rfo

rman

ce (

GFL

OP

/s L

og2

)

Matrix Size

Batch QR (Square matrix)

KBLAS_10000

KBLAS_1000

CUBLAS_10000

CUBLAS_1000

MAGMA_10000

MAGMA_1000

Page 32: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

Page 33: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

Page 34: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

1  

2  

4  

8  

16  

32  

64  

128  

256  

512  

8   16   32   64   128   256   512  

Performan

ce  (G

Flop

/s  Log

2)  

Matrix  Size  

DSYRK_Batch  

 KBLAS_10240  

 KBLAS_1024  

 MAGMA_10240  

 MAGMA_1024  

Page 35: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

0.125  0.25  0.5  1  2  4  8  16  32  64  

128  256  512  

8   16   32   64   128   256   512  

Performan

ce  (G

Flop

/s  Log

2)  

Matrix  Size  

DTRSM_Batch  

 KBLAS_10240    KBLAS_1024    MAGMA_IP_10240    MAGMA_IP_1024    CUBLAS_10240    CUBLAS_1024  

Page 36: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

1  

2  

4  

8  

16  

32  

64  

128  

256  

8   16   32   64   128   256  

Performan

ce  (G

Flop

/s  Log

2)  

Matrix  Size  

DPOTRF_Batch  

 KBLAS_1024  

 KBLAS_10240  

 MAGMA_1024  

 MAGMA_10240  

Page 37: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

0.25  

0.5  

1  

2  

4  

8  

16  

32  

64  

128  

256  

512  

8   16   32   64   128   256   512  

Performan

ce  (G

Flop

/s  Log

2)  

Matrix  Size  

DPOTRS_Batch  

 KBLAS_10240  

 KBLAS_1024  

 MAGMA_10240  

 MAGMA_1024  

Page 38: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Performance Results (preliminary)

0.25  

0.5  

1  

2  

4  

8  

16  

32  

64  

128  

256  

512  

8   16   32   64   128   256   512  

Performan

ce  (G

Flop

/s  Log

2)  

Matrix  Size  

DPOSV_Batch  

 KBLAS_10240  

 KBLAS_1024  

 MAGMA_10240  

 MAGMA_1024  

Page 39: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

Outline

Motivations

QR-based Dynamically Weighted Halley for SVD

Level 3 BLAS

H-Matrices

HiCMA in a nutshell

Page 40: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

HiCMA’s Scope

The hierarchical computations for manycore architectures libraryaims to:

I Develop high performance numerical solvers:Dense/ Data-Sparse (H)

I Increase data reuse thanks to a recursive/hierarchicalformulation

I Exploit high level of concurrency

I Perform asynchronous execution

I Target various architectures:Shared/Distributed-memoryAccelerators/Co-processorsARM

Page 41: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

HiCMA Software Stack

Page 42: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

HiCMA’s Backbone

Page 43: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

HiCMA’s Horsepower

Page 44: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia

4*

HiCMA’s MoC