41
Toward Fast Eigensolvers for Electronic Structure Calculations using Low-rank Approximations K. Akbudak, A. Charara, D. Keyes, H. Ltaief , A. Mikhalev, and D. Sukkari Extreme Computing Research Center King Abdullah University of Science and Technology SIAM Conference on Computational Science and Engineering Spokane, WA USA Feb 25 - Mar 1, 2019 H. Ltaief 1 / 41

Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Toward Fast Eigensolvers for Electronic StructureCalculations using Low-rank Approximations

K. Akbudak, A. Charara, D. Keyes,H. Ltaief, A. Mikhalev, and D. Sukkari

Extreme Computing Research CenterKing Abdullah University of Science and Technology

SIAM Conference on Computational Science and Engineering

Spokane, WA USAFeb 25 - Mar 1, 2019

H. Ltaief 1 / 41

Page 2: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Acknowledgments

Students/Collaborators

Extreme Computing Research Center @ KAUST

KAUST Spintronics Theory Group: S. Laref and A. Manchon

KAUST Computational Physics & Materials Science : U.Schwingenschlgl and N. Singh

KAUST Supercomputing Lab: S. Feki, B. Hadri, and Z. Zhu

INRIA/INP/LaBRI Bordeaux, France: Runtime/HiePACS Teams

Innovative Computing Laboratory @ UTK:PLASMA/MAGMA/SLATE/PaRSEC Teams

American University of Beirut, Lebanon: G. Turkiyyah

Max-Planck Institute@Leipzig, Germany: R. Kriemann

University of Oxford: Y. Nakatsukasa

University of Manchester: N. Higham

EPFL, Switzerland: D. Kressner

H. Ltaief 2 / 41

Page 3: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Acknowledgments

Vendors

NVIDIA GPU Research Center

Intel Parallel Computing Center

Cray Center of Excellence

H. Ltaief 3 / 41

Page 4: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Acknowledgments

The Hourglass Revisited

@KAUST_ECRC

https://www.facebook.com/ecrckaust

H. Ltaief 4 / 41

Page 5: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Acknowledgments

Having fun at SIAMCSE19!

H. Ltaief 5 / 41

Page 6: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

High Performance Computing: The Top500 List

H. Ltaief 6 / 41

Page 7: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

It is getting Moore and Moore hot here!

H. Ltaief 7 / 41

Page 8: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

Hardware Trends: Energy / Data Movement Matters!

2011 2018

DP FLOP 100 pJ 10 pJ

DP DRAM Read 4800 pJ 1920 pJ

Local interconnect 7500 pJ 2500 pJ

Cross system 9000 pJ 3500 pJ

John Shalf, LBNLH. Ltaief 8 / 41

Page 9: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

2012 2014 20182016

2012

2014

2018

H. Ltaief 9 / 41

Page 10: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

Tile algorithms

Block algorithms

Dynamic runtime systems

Data motionreduction

Synchronizationreduction

Fine-grained parallelism

SNB AVX KNL AVX512HSW AVX 2 SKL AVX512

K20/K40/K80 P100 V100

2012 2014 20182016

2012

2014

2018

H. Ltaief 10 / 41

Page 11: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

Tile algorithms

Block algorithms

Dynamic runtime systems

Data motionreduction

Synchronizationreduction

Fine-grained parallelism

SNB AVX KNL AVX512HSW AVX 2 SKL AVX512

K20/K40/K80 P100 V100

100x Faster

2012 2014 20182016

2012

2014

2018

H. Ltaief 11 / 41

Page 12: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

Tile algorithms

Block algorithms

Dynamic runtime systems

Data motionreduction

Synchronizationreduction

Fine-grained parallelism

SNB AVX KNL AVX512HSW AVX 2 SKL AVX512

K20/K40/K80 P100 V100

100x Faster

2012 2014 20182016

2012

2014

2018

Approximations andMixed Precisions

Batch BLAS

H. Ltaief 12 / 41

Page 13: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

The HiCMA Library

Available at http://github.com/ecrc/hicma

H. Ltaief 13 / 41

Page 14: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

A Hostile Hardware Landscape

Algorithmic Recipes For Exascale Computing

Fine-grain

parallelism Data

Motio

n

Reduci

ng

Synch

roniza

tion

Reduci

ng

Approximations andMixed Precisions

Power

efficiency

Dynamic runtimesystems

1018

H. Ltaief 14 / 41

Page 15: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Solving Challenging Scientific Problems

The KAUST Supercomputer: Shaheen-2 #20

H. Ltaief 15 / 41

Page 16: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Solving Challenging Scientific Problems

Materials Science

Structural and vibrational analysis to problems in computationalphysics and chemistry like electronic and band structure calculations

(a) Problem Definition. (b) Electronic structure.

Figure: Design of new materials.

w/ S. Laref, A. Manchon, N. Singh, U. Schwingenschlogl, and Z. Zhu

H. Ltaief 16 / 41

Page 17: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Solving Challenging Scientific Problems

The Self-Consistency Cycle: VASP, Fleur, and Wien2K

Generate A operator

Ax = λBx

Generate B operator

20 iterations at least

H. Ltaief 17 / 41

Page 18: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Solving Challenging Scientific Problems

The Big Picture

H. Ltaief 18 / 41

Page 19: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Common Denominator

This Highly Ranked Guy!

Figure: captionH. Ltaief 19 / 41

Page 20: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Common Denominator

The Cholesky Factorization

The Cholesky factorization of an N × N real symmetric, positive-definitematrix A has the form

A = LLT ,

where L is an N × N real lower triangular matrix with positive diagonalelements.

H. Ltaief 20 / 41

Page 21: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Common Denominator

PLASMA/MAGMA/CHAMELEON DPOTRF from thiscentury

Figure: Tile Algorithms.H. Ltaief 21 / 41

Page 22: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Common Denominator

Exploiting the hierarchical low-rankness of these matrices!

Ubiquitous in computational science and engineering

Symmetric, positive-definite matrix structure

(Apparently) Dense matrices

Often data-sparse

Decay of parameter correlations with distance

Hierarchically of low rank

H. Ltaief 22 / 41

Page 23: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

The HiCMA Library

Available at http://github.com/ecrc/hicma

H. Ltaief 23 / 41

Page 24: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Matrix Rank X-ray: Rank Distribution

Hamiltonian matrix w/ nb=16 and 1e − 8 accuracy threshold

2 4 6 8 10 12 14 16

0

0.5

1

1.5

2

2.5

3

3.5

410

4

H. Ltaief 24 / 41

Page 25: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Matrix Rank X-ray: Rank Distribution

Hamiltonian matrix w/ nb=16 and 1e − 6 accuracy threshold

2 4 6 8 10 12 14 16

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

210

4

H. Ltaief 25 / 41

Page 26: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Matrix Rank X-ray: Rank Distribution

Hamiltonian matrix w/ nb=16 and 1e − 4 accuracy threshold

2 4 6 8 10 12

0

2000

4000

6000

8000

10000

12000

14000

16000

H. Ltaief 26 / 41

Page 27: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Dense Linear Algebra Renaissance: Tile Low-Rank as aPragmatic Approach

T. Mary, PhD Dissertation, Block Low-Rank multifrontal solvers: complexity, performance, andscalability, 2017.C. Weisbecker, PhD Dissertation, Improving multifrontal solvers by means of algebraic BlockLow-Rank representations, 2013.

H. Ltaief 27 / 41

Page 28: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

HiCMA Vs Intel MKL on Shared-Memory Systems

Geospatial statistic w/ square exp. kernel and acc=1e-8

27K 40K 54K 68K81K 108K135K 176K 230K 297KMatrix size

100

101

102

103

Tim

e(s)

MKL-SNBMKL-HSWMKL-SKLHiCMA-SNBHiCMA-HSWHiCMA-SKL

H. Ltaief 28 / 41

Page 29: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

HiCMA Vs ScaLAPACK on Distributed-Memory Systems

54K 81K 108K135K 189K 270K 351K 459K 594KMatrix size

100

101

102

103

104

Tim

e(s)

ScaLAPACK 16 nodesScaLAPACK 32 nodesScaLAPACK 64 nodesScaLAPACK 128 nodesScaLAPACK 256 nodesHiCMA-TLR Cholesky-16

K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, and D. E. Keyes, Exploiting Data Sparsity forLarge-Scale Matrix Computations, EuroPar, 2018.

H. Ltaief 29 / 41

Page 30: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Strong Scalability on Shaheen-2 Cray Haswell

1M 2M 4M 5M 6M 8M 11MMatrix size

3

678

131720

304250

133

Tim

e(m

inut

es)

HiCMA-16HiCMA-32HiCMA-64HiCMA-128HiCMA-256HiCMA-512

K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, and D. E. Keyes, Exploiting Data Sparsity forLarge-Scale Matrix Computations, EuroPar, 2018.

H. Ltaief 30 / 41

Page 31: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Strong Scalability on Cray Skylake: Turbo ON

1M 2M 4M 5M 6M 8M 11MMatrix size

3

678

131720

304250

133

Tim

e(m

inut

es)

HiCMA-16HiCMA-32HiCMA-64HiCMA-128HiCMA-256

K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, and D. E. Keyes, Exploiting Data Sparsity forLarge-Scale Matrix Computations, EuroPar, 2018.

H. Ltaief 31 / 41

Page 32: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Traces Chameleon: Dense dpotrf time=18.1s on 4 nodesof Shaheen-2 with a matrix size of 54K

H. Ltaief 32 / 41

Page 33: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Traces HiCMA: Data-sparse dpotrf time=1.8s on 4 nodesof Shaheen-2 with a matrix size of 54K

H. Ltaief 33 / 41

Page 34: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

AL4SAN: Abstraction Layer For Standardizing APIs ofTask-Based Engines – https://github.com/ecrc/al4san

Abstraction Layer For Standardizing APIs of Task-Based Engines

The abstraction layer for standardizing APIs of task-based engines (AL4SAN) is designed as a lightweight software library, which provides a collection of APIs to unify the expression of tasks and their data dependencies from existing dynamic engines. AL4SAN supports various dynamic runtime systems relying on compiler infrastructure technology or on library-defined APIs. It features an abstraction of task-based engines and, therefore, enables a single-code application to assess various runtimes and their respective scheduling components. The goal of AL4SAN is not to create yet another runtime system, but to further leverage the user-obliviousness of the underlying complex hardware architectures at the dawn of the Exascale age.

81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumu

lative

Tim

e [s]

Runtime ManagementTask InsertTask Unpacking

81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumm

ulativ

e Tim

e [s]

Runtime ManagementTask InsertTask Unpacking

81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumm

ulativ

e Tim

e [s]

Runtime ManagementTask InsertTask Unpacking

0 2 4 6 8 10Time [s]

0

2000

4000

6000

8000

10000

Task

Coun

t

PaRSECAL4SAN-PaRSEC

0 2 4 6 8 10Time [s]

0

500

1000

1500

2000

2500

Task

Cou

nt

QUARKAL4SAN-QUARK

0 10 20 30Time [s]

0

2000

4000

6000

8000

10000

12000

Task

Cou

nt

StarPUAL4SAN-StarPU

0 5 10 15Time [s]

0

200

400

600

800

1000

Task

Coun

t

OpenMPAL4SAN-OpenMP

AL4SAN Frontend

OS/Hardware

Runtimes

Applications

StarPU,PaRSEC,QUARK,OpenMP

AL4SAN Backends

Runtime Support Ø  OpenMP-LLVM Ø  PaRSEC Ø  StarPU Ø  QUARK

AL4SAN v1.0 Features Ø  Standardizing task-based

runtime systems Ø  Using a lightweight

abstraction layer Ø  Improving

user productivity Ø  Supporting different

hardware architectures Ø  Performing with a

relatively limited overhead (up to 10%) AL4SAN Roadmap

Ø  Extending to more engines Ø  Leveraging data abstraction Ø  Integrating C++ constructs Ø  Composing across dynamic

runtime systems Ø  Adding support to more

algorithms and applications

 Software Infrastructure

Dense Cholesky on Skylake Low Rank Cholesky on Skylake Low Rank Cholesky on Shaheen-2 Dense Cholesky on 8x Nvidia K80 GPUs

Left: PaRSEC, right: AL4SAN-PaRSEC Left: QUARK, right: AL4SAN-QUARK

Left: StarPU, right: AL4SAN-StarPU Left: OpenMP, right: AL4SAN-OpenMP

A collaboration of Sponsored by With support from

Cholesky Pseudo-Code Task Interface

Performance Assessment

Main Reference AL4SAN: Abstraction Layer For Standardizing APIs of Task-Based Engines, R. Alomairy, H. Ltaief, M. Abduljabbar, and D. Keyes, Submitted to IPDPS’19, Available at http://hdl.handle.net/10754/629718

Overhead Breakdown Across Runtimes on Skylake Task Scheduling Distribution on Skylake

81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumm

ulativ

e Tim

e [s]

Runtime ManagementTask InsertTask Unpacking

H. Ltaief 34 / 41

Page 35: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Dense Linear Algebra Renaissance

H. Ltaief 35 / 41

Page 36: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Introducing BLAS for batched TLR LA on GPUs: KBLAS

Context:

Very small sizes=⇒ Arithmetic intensity is low.

Humongous number of independent operations=⇒ Kernel launch overhead is high.

Limited GPU occupancy=⇒ GPU CUDA cores are idle.

Solution: batched executions for TLR LA operations

H. Ltaief 36 / 41

Page 37: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

TLR-GEMM Variants

TLR-GEMM-LLD: updatedense C .

M

N

K

ntth outer product

A

B

C

TLR-GEMM-LLL: updateTLR C .

M

N

K

ntth outer product

A

B

C

A. Charara, D. E. Keyes, and H. Ltaief, Batched Tile Low-Rank GEMM on GPUs, EuroPar,2018.

H. Ltaief 37 / 41

Page 38: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

Batched TLR GEMM: uniform ranks

Update dense tile.

Higher memory footprint.

Update Low-rank tile:

Requires re-compression.QR + SVD + GEMM.

AAAA

BBBB

CCCC

AAAA

BBBB

CCCC

W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. E. Keyes, Batched QR and SVD Algorithmson GPUs with Applications in Hierarchical Matrix Compression, Parallel Computing, vol. 74, p.19-33, 2018.

H. Ltaief 38 / 41

Page 39: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

The Tile Low-Rank Cholesky Matrix Factorization

TLR POTRF: Uniform Ranks

Tile low-rank Cholesky factoriza-tion:

Uniform ranks.

Generate, compress andfactorize on-the-fly.

Single Pascal GPU P100.

7X

0.0625

0.125

0.25

0.5

1

2

4

8

16

32

64

Time(s)

MatrixSize

HiCMA-TLR-36cores-CPU

MAGMA-Dense-1-GPU

HiCMA-TLR-1-GPU

A. Charara, H. Ltaief, K. Akbudak, A. Mikhalev, and D. E. Keyes, Accelerating Tile Low-RankCholesky Factorization on GPUs, To be submitted at IEEE TPDS, 2019.

H. Ltaief 39 / 41

Page 40: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

What’s Next?

Future works

KBLAS support for batched TLR LA operations on GPUs usingnon-uniform ranks

KBLAS support for batched TLR LA operations on x86 using IntelMKL / libxsmm

HiCMA support for HODLR/H (non-nested bases) data compressionformat

HiCMA support for Stochastic Gradient Descent using approximationof matrix inversion w/ P. Richtarik (KAUST)

Runtime support for rank growth

Runtime support for batched kernel executions

H. Ltaief 40 / 41

Page 41: Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

What’s Next?

Thank You!

Questions?

H. Ltaief 41 / 41