Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic

Toward Fast Eigensolvers for Electronic StructureCalculations using Low-rank Approximations

K. Akbudak, A. Charara, D. Keyes,H. Ltaief, A. Mikhalev, and D. Sukkari

Extreme Computing Research CenterKing Abdullah University of Science and Technology

SIAM Conference on Computational Science and Engineering

Spokane, WA USAFeb 25 - Mar 1, 2019

H. Ltaief 1 / 41

Acknowledgments

Students/Collaborators

Extreme Computing Research Center @ KAUST

KAUST Spintronics Theory Group: S. Laref and A. Manchon

KAUST Computational Physics & Materials Science : U.Schwingenschlgl and N. Singh

KAUST Supercomputing Lab: S. Feki, B. Hadri, and Z. Zhu

INRIA/INP/LaBRI Bordeaux, France: Runtime/HiePACS Teams

Innovative Computing Laboratory @ UTK:PLASMA/MAGMA/SLATE/PaRSEC Teams

American University of Beirut, Lebanon: G. Turkiyyah

Max-Planck Institute@Leipzig, Germany: R. Kriemann

University of Oxford: Y. Nakatsukasa

University of Manchester: N. Higham

EPFL, Switzerland: D. Kressner

H. Ltaief 2 / 41

Acknowledgments

Vendors

NVIDIA GPU Research Center

Intel Parallel Computing Center

Cray Center of Excellence

H. Ltaief 3 / 41

Acknowledgments

The Hourglass Revisited

@KAUST_ECRC

https://www.facebook.com/ecrckaust

H. Ltaief 4 / 41

Acknowledgments

Having fun at SIAMCSE19!

H. Ltaief 5 / 41

A Hostile Hardware Landscape

High Performance Computing: The Top500 List

H. Ltaief 6 / 41


It is getting Moore and Moore hot here!

H. Ltaief 7 / 41


Hardware Trends: Energy / Data Movement Matters!

2011 2018

DP FLOP 100 pJ 10 pJ

DP DRAM Read 4800 pJ 1920 pJ

Local interconnect 7500 pJ 2500 pJ

Cross system 9000 pJ 3500 pJ

John Shalf, LBNLH. Ltaief 8 / 41


Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

2012 2014 20182016

2012

2014

2018

H. Ltaief 9 / 41


Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

Tile algorithms

Block algorithms

Dynamic runtime systems

Data motionreduction

Synchronizationreduction

Fine-grained parallelism

SNB AVX KNL AVX512HSW AVX 2 SKL AVX512

K20/K40/K80 P100 V100

2012 2014 20182016

2012

2014

2018

H. Ltaief 10 / 41


Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

Tile algorithms

Block algorithms






K20/K40/K80 P100 V100

100x Faster

2012 2014 20182016

2012

2014

2018

H. Ltaief 11 / 41


Algorithmic Trends

Hardware trends

Alg

ori

thm

ic t

ren

ds

Tile algorithms

Block algorithms






K20/K40/K80 P100 V100

100x Faster

2012 2014 20182016

2012

2014

2018

Approximations andMixed Precisions

Batch BLAS

H. Ltaief 12 / 41


The HiCMA Library

Available at http://github.com/ecrc/hicma

H. Ltaief 13 / 41


Algorithmic Recipes For Exascale Computing

Fine-grain

parallelism Data

Motio

n

Reduci

ng

Synch

roniza

tion

Reduci

ng

Approximations andMixed Precisions

Power

efficiency

Dynamic runtimesystems

1018

H. Ltaief 14 / 41

Solving Challenging Scientific Problems

The KAUST Supercomputer: Shaheen-2 #20

H. Ltaief 15 / 41


Materials Science

Structural and vibrational analysis to problems in computationalphysics and chemistry like electronic and band structure calculations

(a) Problem Definition. (b) Electronic structure.

Figure: Design of new materials.

w/ S. Laref, A. Manchon, N. Singh, U. Schwingenschlogl, and Z. Zhu

H. Ltaief 16 / 41


The Self-Consistency Cycle: VASP, Fleur, and Wien2K

Generate A operator

Ax = λBx

Generate B operator

20 iterations at least

H. Ltaief 17 / 41


The Big Picture

H. Ltaief 18 / 41

The Common Denominator

This Highly Ranked Guy!

Figure: captionH. Ltaief 19 / 41


The Cholesky Factorization

The Cholesky factorization of an N × N real symmetric, positive-definitematrix A has the form

A = LLT ,

where L is an N × N real lower triangular matrix with positive diagonalelements.

H. Ltaief 20 / 41


PLASMA/MAGMA/CHAMELEON DPOTRF from thiscentury

Figure: Tile Algorithms.H. Ltaief 21 / 41


Exploiting the hierarchical low-rankness of these matrices!

Ubiquitous in computational science and engineering

Symmetric, positive-definite matrix structure

(Apparently) Dense matrices

Often data-sparse

Decay of parameter correlations with distance

Hierarchically of low rank

H. Ltaief 22 / 41

The Tile Low-Rank Cholesky Matrix Factorization

The HiCMA Library

Available at http://github.com/ecrc/hicma

H. Ltaief 23 / 41


Matrix Rank X-ray: Rank Distribution

Hamiltonian matrix w/ nb=16 and 1e − 8 accuracy threshold

2 4 6 8 10 12 14 16

0

0.5

1

1.5

2

2.5

3

3.5

410

4

H. Ltaief 24 / 41




2 4 6 8 10 12 14 16

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

210

4

H. Ltaief 25 / 41




2 4 6 8 10 12

0

2000

4000

6000

8000

10000

12000

14000

16000

H. Ltaief 26 / 41


Dense Linear Algebra Renaissance: Tile Low-Rank as aPragmatic Approach

T. Mary, PhD Dissertation, Block Low-Rank multifrontal solvers: complexity, performance, andscalability, 2017.C. Weisbecker, PhD Dissertation, Improving multifrontal solvers by means of algebraic BlockLow-Rank representations, 2013.

H. Ltaief 27 / 41


HiCMA Vs Intel MKL on Shared-Memory Systems

Geospatial statistic w/ square exp. kernel and acc=1e-8

27K 40K 54K 68K81K 108K135K 176K 230K 297KMatrix size

100

101

102

103

Tim

e(s)

MKL-SNBMKL-HSWMKL-SKLHiCMA-SNBHiCMA-HSWHiCMA-SKL

H. Ltaief 28 / 41


HiCMA Vs ScaLAPACK on Distributed-Memory Systems

54K 81K 108K135K 189K 270K 351K 459K 594KMatrix size

100

101

102

103

104

Tim

e(s)

ScaLAPACK 16 nodesScaLAPACK 32 nodesScaLAPACK 64 nodesScaLAPACK 128 nodesScaLAPACK 256 nodesHiCMA-TLR Cholesky-16

K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, and D. E. Keyes, Exploiting Data Sparsity forLarge-Scale Matrix Computations, EuroPar, 2018.

H. Ltaief 29 / 41


Strong Scalability on Shaheen-2 Cray Haswell

1M 2M 4M 5M 6M 8M 11MMatrix size

3

678

131720

304250

133

Tim

e(m

inut

es)

HiCMA-16HiCMA-32HiCMA-64HiCMA-128HiCMA-256HiCMA-512


H. Ltaief 30 / 41


Strong Scalability on Cray Skylake: Turbo ON

1M 2M 4M 5M 6M 8M 11MMatrix size

3

678

131720

304250

133

Tim

e(m

inut

es)

HiCMA-16HiCMA-32HiCMA-64HiCMA-128HiCMA-256


H. Ltaief 31 / 41


Traces Chameleon: Dense dpotrf time=18.1s on 4 nodesof Shaheen-2 with a matrix size of 54K

H. Ltaief 32 / 41


Traces HiCMA: Data-sparse dpotrf time=1.8s on 4 nodesof Shaheen-2 with a matrix size of 54K

H. Ltaief 33 / 41


AL4SAN: Abstraction Layer For Standardizing APIs ofTask-Based Engines – https://github.com/ecrc/al4san

Abstraction Layer For Standardizing APIs of Task-Based Engines

The abstraction layer for standardizing APIs of task-based engines (AL4SAN) is designed as a lightweight software library, which provides a collection of APIs to unify the expression of tasks and their data dependencies from existing dynamic engines. AL4SAN supports various dynamic runtime systems relying on compiler infrastructure technology or on library-defined APIs. It features an abstraction of task-based engines and, therefore, enables a single-code application to assess various runtimes and their respective scheduling components. The goal of AL4SAN is not to create yet another runtime system, but to further leverage the user-obliviousness of the underlying complex hardware architectures at the dawn of the Exascale age.

81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumu

lative

Tim

e [s]

Runtime ManagementTask InsertTask Unpacking

81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumm

ulativ

e Tim

e [s]


81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumm

ulativ

e Tim

e [s]


0 2 4 6 8 10Time [s]

0

2000

4000

6000

8000

10000

Task

Coun

t

PaRSECAL4SAN-PaRSEC

0 2 4 6 8 10Time [s]

0

500

1000

1500

2000

2500

Task

Cou

nt

QUARKAL4SAN-QUARK

0 10 20 30Time [s]

0

2000

4000

6000

8000

10000

12000

Task

Cou

nt

StarPUAL4SAN-StarPU

0 5 10 15Time [s]

0

200

400

600

800

1000

Task

Coun

t

OpenMPAL4SAN-OpenMP

AL4SAN Frontend

OS/Hardware

Runtimes

Applications

StarPU,PaRSEC,QUARK,OpenMP

AL4SAN Backends

Runtime Support Ø  OpenMP-LLVM Ø  PaRSEC Ø  StarPU Ø  QUARK

AL4SAN v1.0 Features Ø  Standardizing task-based

runtime systems Ø  Using a lightweight

abstraction layer Ø  Improving

user productivity Ø  Supporting different

hardware architectures Ø  Performing with a

relatively limited overhead (up to 10%) AL4SAN Roadmap

Ø  Extending to more engines Ø  Leveraging data abstraction Ø  Integrating C++ constructs Ø  Composing across dynamic

runtime systems Ø  Adding support to more

algorithms and applications

Software Infrastructure

Dense Cholesky on Skylake Low Rank Cholesky on Skylake Low Rank Cholesky on Shaheen-2 Dense Cholesky on 8x Nvidia K80 GPUs

Left: PaRSEC, right: AL4SAN-PaRSEC Left: QUARK, right: AL4SAN-QUARK

Left: StarPU, right: AL4SAN-StarPU Left: OpenMP, right: AL4SAN-OpenMP

A collaboration of Sponsored by With support from

Cholesky Pseudo-Code Task Interface

Performance Assessment

Main Reference AL4SAN: Abstraction Layer For Standardizing APIs of Task-Based Engines, R. Alomairy, H. Ltaief, M. Abduljabbar, and D. Keyes, Submitted to IPDPS’19, Available at http://hdl.handle.net/10754/629718

Overhead Breakdown Across Runtimes on Skylake Task Scheduling Distribution on Skylake

81000

108000148500

162000202500

Matrix Size

0

5

10

15

20

Cumm

ulativ

e Tim

e [s]


H. Ltaief 34 / 41


Dense Linear Algebra Renaissance

H. Ltaief 35 / 41


Introducing BLAS for batched TLR LA on GPUs: KBLAS

Context:

Very small sizes=⇒ Arithmetic intensity is low.

Humongous number of independent operations=⇒ Kernel launch overhead is high.

Limited GPU occupancy=⇒ GPU CUDA cores are idle.

Solution: batched executions for TLR LA operations

H. Ltaief 36 / 41


TLR-GEMM Variants

TLR-GEMM-LLD: updatedense C .

M

N

K

ntth outer product

A

B

C

TLR-GEMM-LLL: updateTLR C .

M

N

K

ntth outer product

A

B

C

A. Charara, D. E. Keyes, and H. Ltaief, Batched Tile Low-Rank GEMM on GPUs, EuroPar,2018.

H. Ltaief 37 / 41


Batched TLR GEMM: uniform ranks

Update dense tile.

Higher memory footprint.

Update Low-rank tile:

Requires re-compression.QR + SVD + GEMM.

AAAA

BBBB

CCCC

AAAA

BBBB

CCCC

W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. E. Keyes, Batched QR and SVD Algorithmson GPUs with Applications in Hierarchical Matrix Compression, Parallel Computing, vol. 74, p.19-33, 2018.

H. Ltaief 38 / 41


TLR POTRF: Uniform Ranks

Tile low-rank Cholesky factoriza-tion:

Uniform ranks.

Generate, compress andfactorize on-the-fly.

Single Pascal GPU P100.

7X

0.0625

0.125

0.25

0.5

1

2

4

8

16

32

64

Time(s)

MatrixSize

HiCMA-TLR-36cores-CPU

MAGMA-Dense-1-GPU

HiCMA-TLR-1-GPU

A. Charara, H. Ltaief, K. Akbudak, A. Mikhalev, and D. E. Keyes, Accelerating Tile Low-RankCholesky Factorization on GPUs, To be submitted at IEEE TPDS, 2019.

H. Ltaief 39 / 41

What’s Next?

Future works

KBLAS support for batched TLR LA operations on GPUs usingnon-uniform ranks

KBLAS support for batched TLR LA operations on x86 using IntelMKL / libxsmm

HiCMA support for HODLR/H (non-nested bases) data compressionformat

HiCMA support for Stochastic Gradient Descent using approximationof matrix inversion w/ P. Richtarik (KAUST)

Runtime support for rank growth

Runtime support for batched kernel executions

H. Ltaief 40 / 41

What’s Next?

Thank You!

Questions?

H. Ltaief 41 / 41

Documents

Toward Fast Eigensolvers for Electronic Structure Calculations using …icl.utk.edu/bblas/siam-cse19/files/04-ltaief-siamcse.pdf · 2019-03-05 · Toward Fast Eigensolvers for Electronic