Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Scalable Algorithms for Tensor Computations
Edgar Solomonik
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
SIAM PP, Seattle WA
LPNA Algorithms for Tensor Computations February 14th, 2020 1 / 20
Laboratory for Parallel Numerical Algorithms
Recent/ongoing research topics
parallel matrix computations
QR factorizationtriangular solveeigenvalue problems
tensor computations
tensor decompositionsparse tensor kernelstensor completion
simulation of quantum systems
tensor networksquantum chemistryquantum circuits
fast bilinear algorithms
convolution algorithmstensor symmetryfast matrix multiplication http://lpna.cs.illinois.edu
LPNA Algorithms for Tensor Computations February 14th, 2020 2 / 20
Outline
1 Introduction
2 Matrix Factorizations
3 Tensor Computations
4 Software Libraries
5 Conclusion
LPNA Algorithms for Tensor Computations February 14th, 2020 3 / 20
3D algorithms for matrix computations
For Cholesky factorization with p processors, BSP (critical path) costs are
F = Θ(n3/p), W = Θ(n2/√cp), S = Θ(
√cp)
using c matrix copies (processor grid is 2D for c = 1, 3D for c = p1/3).
Achieving similar costs for LU, QR, and the symmetric eigenvalue problemrequires algorithmic changes.
triangular solve square TRSM X1 rectangular TRSM X2
LU with pivoting pairwise pivoting X3 tournament pivoting X4
QR factorization Givens on square X3 Householder on rect. X5
sym. eig. eigenvalues only X5 eigenvectors XXmeans costs attained (synchronization within polylog factors).
1B. Lipshitz, MS thesis 2013
2T. Wicky, E.S., T. Hoefler, IPDPS 2017
3A. Tiskin, FGCS 2007
4E.S., J. Demmel, EuroPar 2011
5E.S., G. Ballard, T. Hoefler, J. Demmel, SPAA 2017
LPNA Algorithms for Tensor Computations February 14th, 2020 4 / 20
New algorithms can circumvent lower bounds
For TRSM, we can achieve a lower synchronization/communication costby performing triangular inversion on diagonal blocks
Tobias Wicky
decreases synchronization cost by O(p2/3) on p processors withrespect to known algorithms
optimal communication for any number of right-hand sides
MS thesis work by Tobias Wicky1
1T. Wicky, E.S., T. Hoefler, IPDPS 2017
LPNA Algorithms for Tensor Computations February 14th, 2020 5 / 20
Cholesky-QR2 for rectangular matrices
Cholesky-QR21 with 3D Cholesky gives a practical 3D QR algorithm2
Compute A = QR using Cholesky ATA = RTR
Correct computed factorization by Cholesky-QR of Q
Attains full accuracy so long as cond(A) < 1/√εmach
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
64 128 256 512 1024 2048 4096 8192 16384 32768
runtim
e (
s)
KNL cores
QR: Weak Scaling (16384x2048 at 64 cores)
scalapack
CQR2
Edward Hutter
1T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, Y. Yamamoto, 2014
2E. Hutter, E.S., IPDPS 2019
LPNA Algorithms for Tensor Computations February 14th, 2020 6 / 20
Tridiagonalization
Reducing the symmetric matrix A ∈ Rn×n to a tridiagonal matrix
T = QTAQ
by an orthogonal similarity transformation is most costly in eigenvaluecomputation (SVD is similar).
can be done by successive column QR factorizations
T = QTn · · ·QT
1︸ ︷︷ ︸QT
AQ1 · · ·Qn︸ ︷︷ ︸Q
two-sided updates harder to manage than one-sided
can use n/b QRs on panels of b columns to go to band-width b+ 1
b = 1 gives direct tridiagonalization
LPNA Algorithms for Tensor Computations February 14th, 2020 7 / 20
Multi-stage tridiagonalization
Writing the orthogonal transformation in Householder form, we get
(I −UTUT )T︸ ︷︷ ︸QT
A (I −UTUT )︸ ︷︷ ︸Q
= A−UV T − V UT
where columns of U contains Householder vectors and V is
V T = TUT +1
2T TUT AU︸︷︷︸
bottleneck
TUT
if b = 1, U is a column-vector, and AU is dominated by verticalcommunication cost (moving A between memory and cache)
idea: reduce to banded matrix (b� 1) first1
1T. Auckenthaler, H. Bungartz, T. Huckle, L. Kramer, B. Lang, P. Willems 2011
LPNA Algorithms for Tensor Computations February 14th, 2020 8 / 20
Successive band reduction (SBR)
After reducing to a banded matrix, we need to transform the bandedmatrix to a tridiagonal one
fewer nonzeros lead to lower computational cost, F = O(n2b/p)
however, transformations introduce fill/bulges
bulges must be chased down the band1
communication- and synchronization-efficient 1D SBR algorithmknown for small band-width2
1B. Lang 1993; C.H. Bischof, B. Lang, X. Sun 2000
2G. Ballard, J. Demmel, N. Knight 2012
LPNA Algorithms for Tensor Computations February 14th, 2020 9 / 20
Communication-efficient eigenvalue computation
Previous work (start-of-the-art): two-stage tridiagonalization
implemented in ELPA, can outperform ScaLAPACK1
with n = n/√p, 1D SBR gives2 W = O(n2/
√p), S = O(
√p log2(p))
New results3: many-stage tridiagonalization
Θ(log(p)) intermediate band-widths to achieve W = O(n2/p2/3)
communication-efficient rectangular QR with processor groups
3D SBR (each QR and matrix multiplication update parallelized)1
T. Auckenthaler, H. Bungartz, T. Huckle, L. Kramer, B. Lang, P. Willems 20112
G. Ballard, J. Demmel, N. Knight 20123
E.S., G. Ballard, J. Demmel, T. Hoefler 2017
LPNA Algorithms for Tensor Computations February 14th, 2020 10 / 20
Symmetric eigensolver results summary
Algorithm W Q S
ScaLAPACK n2/√p n3/p n log(p)
ELPA n2/√p - n log(p)
two-stage + 1D-SBR n2/√p n2 log(n)/
√p√p(log2(p) + log(n))
many-stage n2/p2/3 n2 log(p)/p2/3 p2/3 log2 p
costs are asymptotic (same computational cost F for eigenvalues)
W – horizontal (interprocessor) communication
Q – vertical (memory–cache) communication excluding W + F/√H
where H is cache size
S – synchronization cost (number of supersteps)
LPNA Algorithms for Tensor Computations February 14th, 2020 11 / 20
Tensor Decompositions
Tensor of order N has N modes and dimensions s× · · · × s
Canonical polyadic (CP) tensor decomposition1
Alternating least squares (ALS) is most widely used method
Monotonic linear convergence
Gauss-Newton method is an emerging alternative
Non-monotonic, but can achieve superlinear convergence rate
1T.G. Kolda and B.W. Bader, SIAM Review 2009
LPNA Algorithms for Tensor Computations February 14th, 2020 12 / 20
Accelerating Alternating Least Squares
50 100 200 400 800Dimension size (s)
102
101
100
101
102
Sw
eep
time
(sec
onds
)
PP-initializationALSPP-approximate
0 5000 10000 15000Time (seconds)
0.93
0.94
0.95
0.96
0.97
0.98
Fitn
ess
ALSPP
New algorithm: pairwise perturbation (PP)1 approximates ALS
accurate when ALS sweep over factor matrices stagnates
rank R < sN−1 CP decomposition:
ALS sweep cost O(sNR)⇒ O(s2R), up to 600x speed-up
rank R < s Tucker decomposition:
ALS sweep cost O(sNR)⇒ O(s2RN−1) Linjian Ma1
L. Ma, E.S. arXiv:1811.10573
LPNA Algorithms for Tensor Computations February 14th, 2020 13 / 20
Regularization and Parallelism for Gauss-Newton
3 4 5 6 7 8 9Rank
0.0
0.2
0.4
0.6
0.8
1.0Co
nver
genc
e pr
obab
ility
Random low rank tensor with s=4
GN w/ var regGN w/ = 0.001 GN w/ diag regALS
0 500 1000 1500 2000 2500 3000 3500 4000Time in seconds
0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
Fitn
ess
3 H2O system with R=200
GN with varying GN with = 10 3
ALS
New regularization scheme1 for Gauss-Newton CP with implicit CG2
Oscillates regularization parameter geometricallybetween lower and upper thresholds
Achieves higher convergence likelihood
More accurate than ALS in applications
Faster than ALS sequentially and in parallel Navjot Singh1
Navjot Singh, Linjian Ma, Hongru Yang, and E.S. arXiv:1910.123312
P. Tichavsky, A. H. Phan, and A. Cichocki., 2013
LPNA Algorithms for Tensor Computations February 14th, 2020 14 / 20
Symmetry in Tensor Contractions
4
16
64
256
1024
4096
1024 2048 4096 8192 16384 32768
Effe
ctiv
e G
igaf
lops
/sec
(2n3 b3 /ti
me)
Effective matrix dimension (n*b)
Performance of nested AB+BA on full KNL node
symmetry preserving algorithm, n=16symmetry preserving algorithm, n=64
direct evaluation algorithm, n=16direct evaluation algorithm, n=64
1
2
4
8
16
32
64
1 2 4 8 16 32 64
Tim
e (s
econ
ds)
Cores of KNL (threads or MPI processes)
Contraction with 9 k-points and 18 Molecular Orbitals
Manual Block-by-Block Contraction with NumPyIrrep-Aligned with Cyclops (Threaded/OpenMP)
Irrep-Aligned with Cyclops (Distributed/MPI)
New algebraic transformations for contractions to exploit permutationalsymmetry1 and group symmetry2, prevalent in quantum chemistry/physics
C = AB + BA⇒
cij =∑k
[(aij + aik + ajk)·
(bij + bik + bjk)]− . . .
1E.S, J. Demmel, CMAM 2020
2collaboration with Yang Gao, Phillip Helms, and Garnet Chan at Caltech
LPNA Algorithms for Tensor Computations February 14th, 2020 15 / 20
Library for Massively-Parallel Tensor Computations
Cyclops Tensor Framework1 sparse/dense generalized tensor algebra
Cyclops is a C++ library that distributes each tensor over MPI
Used in chemistry (PySCF, QChem)2, quantum circuit simulation(IBM/LLNL)3, and graph analysis (betweenness centrality)4
Summations and contractions specified via Einstein notation
E["aixbjy"] += X["aixbjy"] - U["abu"]*V["iju"]*W["xyu"]
Best distributed contraction algorithm selected at runtime via models
Support for Python (numpy.ndarray backend), OpenMP, and GPU
Simple interface to core ScaLAPACK matrix factorization routines
1https://github.com/cyclops-community/ctf2
E.S., D. Matthews, J. Hammond, J.F. Stanton, J. Demmel, JPDC 20143
E. Pednault, J.A. Gunnels, G. Nannicini, L. Horesh, T. Magerlein, E. S., E. Draeger, E. Holland, and R. Wisnieff, 20174
E.S., M. Besta, F. Vella, T. Hoefler, SC 2017
LPNA Algorithms for Tensor Computations February 14th, 2020 16 / 20
Sparsity in Tensor Contractions
1
2
4
8
16
32
64
128
256
512
1024
2048
24 48 96 192 384 768 1536 3072 6144
seco
nds/
itera
tion
#cores
Weak scaling of MP3 with no=40, nv=160
dense10% sparse*dense10% sparse*sparse
1% sparse*dense1% sparse*sparse.1% sparse*dense.1% sparse*sparse
0 50
100 150 200 250 300 350 400 450 500
1 2 4 8 16 32 64 128
50 59 70 84 100 118 141 168
seco
nds
/ CC
SD
iter
atio
n
# Knights Landing (KNL) nodes
Weak scaling of PySCF CCSD using Cyclops
# occupied orbitals (o)
densesparse integrals
sparse amplitudes
Cyclops supports sparse representation of tensors1
Choice of representation specified in tensor constructor
CSR or DCSR2 (2-index CSF3) representation used locally forcontractions
1E.S., T. Hoefler 2015
2A. Buluc, J.R. Gilbert, 2008
3S. Smith, G. Karypis 2015
LPNA Algorithms for Tensor Computations February 14th, 2020 17 / 20
All-at-Once Multi-Tensor Contraction
0.015625
0.0625
0.25
1
4
16
64
256
1000 4000 16000 64000 256000
3 9 15 21 27
Tim
e (s
econ
ds)
Tensor Dimension (I=J=K)
Cyclops MTTKRP with m=1B, R=50 on 4096 Cores of Stampede2
Negative Log Density -log2(m/(I*J*K))
hypersparsesparsedense
all-at-onceSPLATT
0.0625
0.125
0.25
0.5
1
2
4
1000 2000 4000 8000 16000 32000 64000 128000
0 3 6 9 12 15 18 21
Tim
e (s
econ
ds)
Tensor Dimension (I=J=K)
Cyclops TTTP with m=1B, R=60 on 4096 Cores of Stampede2
Negative Log Density -log2(m/(I*J*K))
TTTP by Dense ContractionAll-at-Once TTTP
With sparsity, all-at-once contraction1 of multiple tensors can be faster2.
Sparse tensor CP decomposition isdominated by MTTKRP
uir =∑j,k
tijkvjrwkr
All-at-once sparse MTTKRP needsless communication than pairwise
Tensor times tensor product(TTTP) enables sparse residualand CP tensor completion
rijk =∑r
tijkuirvjrwkr
Cost and memory footprint reducedasymptotically
1S. Smith, J. Park, G. Karypis, 2018
2Zecheng Zhang, Xiaoxiao Wu, Naijing Zhang, Siyuan Zhang, and E.S. arXiv:1910.02371
LPNA Algorithms for Tensor Computations February 14th, 2020 18 / 20
Conclusion
Communication avoiding matrix factorizations
3D algorithms move O(p1/6) less data on p processors
Faster algorithms for TRSM, Cholesky, LU, QR, and symmetriceigensolve
Optimization methods for tensor decomposition
Pairwise perturbation approximates ALS sweeps with asymptoticallyless cost
Gauss-Newton methods with new regularization scheme more effectivethan ALS
Algorithms and software for sparse/symmetric tensors
Algebraic reorganizations to exploit permutational and group symmetry
Cyclops library for dense/sparse generalized distributed tensor algebrain C++/Python
LPNA Algorithms for Tensor Computations February 14th, 2020 19 / 20
Acknowledgements
Laboratory for ParallelNumerical Algorithms (LPNA)at University of Illinois
Key external collaborators:James Demmel, TorstenHoefler, Garnet Chan
NSF OAC CSSI funding (award#1931258)
DOE Computational ScienceGraduate Fellowship
Stampede2 resources at TACCvia XSEDE
http://lpna.cs.illinois.edu
LPNA Algorithms for Tensor Computations February 14th, 2020 20 / 20