Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Toward Fast Eigensolvers for Electronic StructureCalculations using Low-rank Approximations
K. Akbudak, A. Charara, D. Keyes,H. Ltaief, A. Mikhalev, and D. Sukkari
Extreme Computing Research CenterKing Abdullah University of Science and Technology
SIAM Conference on Computational Science and Engineering
Spokane, WA USAFeb 25 - Mar 1, 2019
H. Ltaief 1 / 41
Acknowledgments
Students/Collaborators
Extreme Computing Research Center @ KAUST
KAUST Spintronics Theory Group: S. Laref and A. Manchon
KAUST Computational Physics & Materials Science : U.Schwingenschlgl and N. Singh
KAUST Supercomputing Lab: S. Feki, B. Hadri, and Z. Zhu
INRIA/INP/LaBRI Bordeaux, France: Runtime/HiePACS Teams
Innovative Computing Laboratory @ UTK:PLASMA/MAGMA/SLATE/PaRSEC Teams
American University of Beirut, Lebanon: G. Turkiyyah
Max-Planck Institute@Leipzig, Germany: R. Kriemann
University of Oxford: Y. Nakatsukasa
University of Manchester: N. Higham
EPFL, Switzerland: D. Kressner
H. Ltaief 2 / 41
Acknowledgments
Vendors
NVIDIA GPU Research Center
Intel Parallel Computing Center
Cray Center of Excellence
H. Ltaief 3 / 41
Acknowledgments
The Hourglass Revisited
@KAUST_ECRC
https://www.facebook.com/ecrckaust
H. Ltaief 4 / 41
Acknowledgments
Having fun at SIAMCSE19!
H. Ltaief 5 / 41
A Hostile Hardware Landscape
High Performance Computing: The Top500 List
H. Ltaief 6 / 41
A Hostile Hardware Landscape
It is getting Moore and Moore hot here!
H. Ltaief 7 / 41
A Hostile Hardware Landscape
Hardware Trends: Energy / Data Movement Matters!
2011 2018
DP FLOP 100 pJ 10 pJ
DP DRAM Read 4800 pJ 1920 pJ
Local interconnect 7500 pJ 2500 pJ
Cross system 9000 pJ 3500 pJ
John Shalf, LBNLH. Ltaief 8 / 41
A Hostile Hardware Landscape
Algorithmic Trends
Hardware trends
Alg
ori
thm
ic t
ren
ds
2012 2014 20182016
2012
2014
2018
H. Ltaief 9 / 41
A Hostile Hardware Landscape
Algorithmic Trends
Hardware trends
Alg
ori
thm
ic t
ren
ds
Tile algorithms
Block algorithms
Dynamic runtime systems
Data motionreduction
Synchronizationreduction
Fine-grained parallelism
SNB AVX KNL AVX512HSW AVX 2 SKL AVX512
K20/K40/K80 P100 V100
2012 2014 20182016
2012
2014
2018
H. Ltaief 10 / 41
A Hostile Hardware Landscape
Algorithmic Trends
Hardware trends
Alg
ori
thm
ic t
ren
ds
Tile algorithms
Block algorithms
Dynamic runtime systems
Data motionreduction
Synchronizationreduction
Fine-grained parallelism
SNB AVX KNL AVX512HSW AVX 2 SKL AVX512
K20/K40/K80 P100 V100
100x Faster
2012 2014 20182016
2012
2014
2018
H. Ltaief 11 / 41
A Hostile Hardware Landscape
Algorithmic Trends
Hardware trends
Alg
ori
thm
ic t
ren
ds
Tile algorithms
Block algorithms
Dynamic runtime systems
Data motionreduction
Synchronizationreduction
Fine-grained parallelism
SNB AVX KNL AVX512HSW AVX 2 SKL AVX512
K20/K40/K80 P100 V100
100x Faster
2012 2014 20182016
2012
2014
2018
Approximations andMixed Precisions
Batch BLAS
H. Ltaief 12 / 41
A Hostile Hardware Landscape
The HiCMA Library
Available at http://github.com/ecrc/hicma
H. Ltaief 13 / 41
A Hostile Hardware Landscape
Algorithmic Recipes For Exascale Computing
Fine-grain
parallelism Data
Motio
n
Reduci
ng
Synch
roniza
tion
Reduci
ng
Approximations andMixed Precisions
Power
efficiency
Dynamic runtimesystems
1018
H. Ltaief 14 / 41
Solving Challenging Scientific Problems
The KAUST Supercomputer: Shaheen-2 #20
H. Ltaief 15 / 41
Solving Challenging Scientific Problems
Materials Science
Structural and vibrational analysis to problems in computationalphysics and chemistry like electronic and band structure calculations
(a) Problem Definition. (b) Electronic structure.
Figure: Design of new materials.
w/ S. Laref, A. Manchon, N. Singh, U. Schwingenschlogl, and Z. Zhu
H. Ltaief 16 / 41
Solving Challenging Scientific Problems
The Self-Consistency Cycle: VASP, Fleur, and Wien2K
Generate A operator
Ax = λBx
Generate B operator
20 iterations at least
H. Ltaief 17 / 41
Solving Challenging Scientific Problems
The Big Picture
H. Ltaief 18 / 41
The Common Denominator
This Highly Ranked Guy!
Figure: captionH. Ltaief 19 / 41
The Common Denominator
The Cholesky Factorization
The Cholesky factorization of an N × N real symmetric, positive-definitematrix A has the form
A = LLT ,
where L is an N × N real lower triangular matrix with positive diagonalelements.
H. Ltaief 20 / 41
The Common Denominator
PLASMA/MAGMA/CHAMELEON DPOTRF from thiscentury
Figure: Tile Algorithms.H. Ltaief 21 / 41
The Common Denominator
Exploiting the hierarchical low-rankness of these matrices!
Ubiquitous in computational science and engineering
Symmetric, positive-definite matrix structure
(Apparently) Dense matrices
Often data-sparse
Decay of parameter correlations with distance
Hierarchically of low rank
H. Ltaief 22 / 41
The Tile Low-Rank Cholesky Matrix Factorization
The HiCMA Library
Available at http://github.com/ecrc/hicma
H. Ltaief 23 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Matrix Rank X-ray: Rank Distribution
Hamiltonian matrix w/ nb=16 and 1e − 8 accuracy threshold
2 4 6 8 10 12 14 16
0
0.5
1
1.5
2
2.5
3
3.5
410
4
H. Ltaief 24 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Matrix Rank X-ray: Rank Distribution
Hamiltonian matrix w/ nb=16 and 1e − 6 accuracy threshold
2 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
210
4
H. Ltaief 25 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Matrix Rank X-ray: Rank Distribution
Hamiltonian matrix w/ nb=16 and 1e − 4 accuracy threshold
2 4 6 8 10 12
0
2000
4000
6000
8000
10000
12000
14000
16000
H. Ltaief 26 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Dense Linear Algebra Renaissance: Tile Low-Rank as aPragmatic Approach
T. Mary, PhD Dissertation, Block Low-Rank multifrontal solvers: complexity, performance, andscalability, 2017.C. Weisbecker, PhD Dissertation, Improving multifrontal solvers by means of algebraic BlockLow-Rank representations, 2013.
H. Ltaief 27 / 41
The Tile Low-Rank Cholesky Matrix Factorization
HiCMA Vs Intel MKL on Shared-Memory Systems
Geospatial statistic w/ square exp. kernel and acc=1e-8
27K 40K 54K 68K81K 108K135K 176K 230K 297KMatrix size
100
101
102
103
Tim
e(s)
MKL-SNBMKL-HSWMKL-SKLHiCMA-SNBHiCMA-HSWHiCMA-SKL
H. Ltaief 28 / 41
The Tile Low-Rank Cholesky Matrix Factorization
HiCMA Vs ScaLAPACK on Distributed-Memory Systems
54K 81K 108K135K 189K 270K 351K 459K 594KMatrix size
100
101
102
103
104
Tim
e(s)
ScaLAPACK 16 nodesScaLAPACK 32 nodesScaLAPACK 64 nodesScaLAPACK 128 nodesScaLAPACK 256 nodesHiCMA-TLR Cholesky-16
K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, and D. E. Keyes, Exploiting Data Sparsity forLarge-Scale Matrix Computations, EuroPar, 2018.
H. Ltaief 29 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Strong Scalability on Shaheen-2 Cray Haswell
1M 2M 4M 5M 6M 8M 11MMatrix size
3
678
131720
304250
133
Tim
e(m
inut
es)
HiCMA-16HiCMA-32HiCMA-64HiCMA-128HiCMA-256HiCMA-512
K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, and D. E. Keyes, Exploiting Data Sparsity forLarge-Scale Matrix Computations, EuroPar, 2018.
H. Ltaief 30 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Strong Scalability on Cray Skylake: Turbo ON
1M 2M 4M 5M 6M 8M 11MMatrix size
3
678
131720
304250
133
Tim
e(m
inut
es)
HiCMA-16HiCMA-32HiCMA-64HiCMA-128HiCMA-256
K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, and D. E. Keyes, Exploiting Data Sparsity forLarge-Scale Matrix Computations, EuroPar, 2018.
H. Ltaief 31 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Traces Chameleon: Dense dpotrf time=18.1s on 4 nodesof Shaheen-2 with a matrix size of 54K
H. Ltaief 32 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Traces HiCMA: Data-sparse dpotrf time=1.8s on 4 nodesof Shaheen-2 with a matrix size of 54K
H. Ltaief 33 / 41
The Tile Low-Rank Cholesky Matrix Factorization
AL4SAN: Abstraction Layer For Standardizing APIs ofTask-Based Engines – https://github.com/ecrc/al4san
Abstraction Layer For Standardizing APIs of Task-Based Engines
The abstraction layer for standardizing APIs of task-based engines (AL4SAN) is designed as a lightweight software library, which provides a collection of APIs to unify the expression of tasks and their data dependencies from existing dynamic engines. AL4SAN supports various dynamic runtime systems relying on compiler infrastructure technology or on library-defined APIs. It features an abstraction of task-based engines and, therefore, enables a single-code application to assess various runtimes and their respective scheduling components. The goal of AL4SAN is not to create yet another runtime system, but to further leverage the user-obliviousness of the underlying complex hardware architectures at the dawn of the Exascale age.
81000
108000148500
162000202500
Matrix Size
0
5
10
15
20
Cumu
lative
Tim
e [s]
Runtime ManagementTask InsertTask Unpacking
81000
108000148500
162000202500
Matrix Size
0
5
10
15
20
Cumm
ulativ
e Tim
e [s]
Runtime ManagementTask InsertTask Unpacking
81000
108000148500
162000202500
Matrix Size
0
5
10
15
20
Cumm
ulativ
e Tim
e [s]
Runtime ManagementTask InsertTask Unpacking
0 2 4 6 8 10Time [s]
0
2000
4000
6000
8000
10000
Task
Coun
t
PaRSECAL4SAN-PaRSEC
0 2 4 6 8 10Time [s]
0
500
1000
1500
2000
2500
Task
Cou
nt
QUARKAL4SAN-QUARK
0 10 20 30Time [s]
0
2000
4000
6000
8000
10000
12000
Task
Cou
nt
StarPUAL4SAN-StarPU
0 5 10 15Time [s]
0
200
400
600
800
1000
Task
Coun
t
OpenMPAL4SAN-OpenMP
AL4SAN Frontend
OS/Hardware
Runtimes
Applications
StarPU,PaRSEC,QUARK,OpenMP
AL4SAN Backends
Runtime Support Ø OpenMP-LLVM Ø PaRSEC Ø StarPU Ø QUARK
AL4SAN v1.0 Features Ø Standardizing task-based
runtime systems Ø Using a lightweight
abstraction layer Ø Improving
user productivity Ø Supporting different
hardware architectures Ø Performing with a
relatively limited overhead (up to 10%) AL4SAN Roadmap
Ø Extending to more engines Ø Leveraging data abstraction Ø Integrating C++ constructs Ø Composing across dynamic
runtime systems Ø Adding support to more
algorithms and applications
Software Infrastructure
Dense Cholesky on Skylake Low Rank Cholesky on Skylake Low Rank Cholesky on Shaheen-2 Dense Cholesky on 8x Nvidia K80 GPUs
Left: PaRSEC, right: AL4SAN-PaRSEC Left: QUARK, right: AL4SAN-QUARK
Left: StarPU, right: AL4SAN-StarPU Left: OpenMP, right: AL4SAN-OpenMP
A collaboration of Sponsored by With support from
Cholesky Pseudo-Code Task Interface
Performance Assessment
Main Reference AL4SAN: Abstraction Layer For Standardizing APIs of Task-Based Engines, R. Alomairy, H. Ltaief, M. Abduljabbar, and D. Keyes, Submitted to IPDPS’19, Available at http://hdl.handle.net/10754/629718
Overhead Breakdown Across Runtimes on Skylake Task Scheduling Distribution on Skylake
81000
108000148500
162000202500
Matrix Size
0
5
10
15
20
Cumm
ulativ
e Tim
e [s]
Runtime ManagementTask InsertTask Unpacking
H. Ltaief 34 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Dense Linear Algebra Renaissance
H. Ltaief 35 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Introducing BLAS for batched TLR LA on GPUs: KBLAS
Context:
Very small sizes=⇒ Arithmetic intensity is low.
Humongous number of independent operations=⇒ Kernel launch overhead is high.
Limited GPU occupancy=⇒ GPU CUDA cores are idle.
Solution: batched executions for TLR LA operations
H. Ltaief 36 / 41
The Tile Low-Rank Cholesky Matrix Factorization
TLR-GEMM Variants
TLR-GEMM-LLD: updatedense C .
M
N
K
ntth outer product
A
B
C
TLR-GEMM-LLL: updateTLR C .
M
N
K
ntth outer product
A
B
C
A. Charara, D. E. Keyes, and H. Ltaief, Batched Tile Low-Rank GEMM on GPUs, EuroPar,2018.
H. Ltaief 37 / 41
The Tile Low-Rank Cholesky Matrix Factorization
Batched TLR GEMM: uniform ranks
Update dense tile.
Higher memory footprint.
Update Low-rank tile:
Requires re-compression.QR + SVD + GEMM.
AAAA
BBBB
CCCC
AAAA
BBBB
CCCC
W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. E. Keyes, Batched QR and SVD Algorithmson GPUs with Applications in Hierarchical Matrix Compression, Parallel Computing, vol. 74, p.19-33, 2018.
H. Ltaief 38 / 41
The Tile Low-Rank Cholesky Matrix Factorization
TLR POTRF: Uniform Ranks
Tile low-rank Cholesky factoriza-tion:
Uniform ranks.
Generate, compress andfactorize on-the-fly.
Single Pascal GPU P100.
7X
0.0625
0.125
0.25
0.5
1
2
4
8
16
32
64
Time(s)
MatrixSize
HiCMA-TLR-36cores-CPU
MAGMA-Dense-1-GPU
HiCMA-TLR-1-GPU
A. Charara, H. Ltaief, K. Akbudak, A. Mikhalev, and D. E. Keyes, Accelerating Tile Low-RankCholesky Factorization on GPUs, To be submitted at IEEE TPDS, 2019.
H. Ltaief 39 / 41
What’s Next?
Future works
KBLAS support for batched TLR LA operations on GPUs usingnon-uniform ranks
KBLAS support for batched TLR LA operations on x86 using IntelMKL / libxsmm
HiCMA support for HODLR/H (non-nested bases) data compressionformat
HiCMA support for Stochastic Gradient Descent using approximationof matrix inversion w/ P. Richtarik (KAUST)
Runtime support for rank growth
Runtime support for batched kernel executions
H. Ltaief 40 / 41
What’s Next?
Thank You!
Questions?
H. Ltaief 41 / 41