Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
New Directions in Numerical Linear Algebra and High Performance Computing:Celebrating the 70th Birthday of Jack Dongarra , July 7-8, 2021
Numerical methods and benchmarking across scales, precisions, and hardware platforms
Piotr Luszczek
July 7, 2021
University of Tennessee
3 / 23
If at first you don’t succeed...
[…] the problem with simulations is that they are doomed to succeed […]
Rodney Brooks
4 / 23
Whimsical Non-Sequitur
[…] the problem with simulations is that they are doomed to succeed […]
Anonymous
benchmarks
5 / 23
Measurement as a Tool for Science
I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be.
Lord Kelvin, PLA Vol.1 Electrical Units of Measure
If you can't measure it, you can't improve it.
Peter Drucker, Management
6 / 23
Measurement as a Tool for Science
I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be.
Lord Kelvin, PLA Vol.1 Electrical Units of Measure
If you can't measure it, you can't improve it.
Peter Drucker, Management
7 / 23
Scientific Code before the Disco Era subroutine dgefa(a,lda,n,ipvt,info) integer lda,n,ipvt(1),info double precision a(lda,1) double precision t integer idamax,j,k,kp1,l,nm1 nfo = 0 nm1 = n - 1 if (nm1 .lt. 1) go to 70 do 60 k = 1, nm1 kp1 = k + 1 l = idamax(n-k+1,a(k,k),1) + k - 1 ipvt(k) = l if (a(l,k) .eq. 0.0d0) go to 40 if (l .eq. k) go to 10 t = a(l,k) a(l,k) = a(k,k) a(k,k) = t 10 continue t = -1.0d0/a(k,k) call dscal(n-k,t,a(k+1,k),1) do 30 j = kp1, n t = a(l,j) if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue call daxpy(n-k,t,a(k+1,k),1,a(k+1,j),1) 30 continue go to 50 40 continue info = k 50 continue 60 continue 70 continue ipvt(n) = n if (a(n,n) .eq. 0.0d0) info = n return end
c********************************************c*** KERNEL 5 TRI-DIAGONAL ELIMINATION, BELOW DIAGONAL (NO VECTORS)c********************************************cdir$ novector 1005 DO 5 i = 2,n 5 X(i)= Z(i) * (Y(i) – X(i-1))cdir$ vectorc********************************************c*** KERNEL 7 EQUATION OF STATE FRAGMENTc********************************************cdir$ ivdep 1007 DO 7 k= 1,n X(k)= U(k ) + R*( Z(k ) + R*Y(k )) + 1 T*( U(k+3) +R*(U(k+2)+ R*U(k+1)) + 2 T*( U(k+6) + Q*( U(k+5) + Q*U(k+4)))) 7 CONTINUEc********************************************c*** KERNEL 21 MATRIX*MATRIX PRODUCTc******************************************** 1021 DO 21 k= 1,25 DO 21 i= 1,25 DO 21 j= 1,n PX(i,j)= PX(i,j) +VY(i,k) * CX(k,j) 21 CONTINUEc********************************************c*** KERNEL 23 2-D IMPLICIT HYDRODYNAMICSc******************************************** fw= 0.17500d0 1023 DO 23 j= 2,6 DO 23 k= 2,n QA= ZA(k,j+1)*ZR(k,j)+ZA(k,j-1)*ZB(k,j)+ ZA(k+1,j)*ZU(k,j)+ZA(k-1,j)*ZV(k,j) +ZZ(k,j) 23 ZA(k,j)= ZA(k,j) +fw*(QA -ZA(k,j))
9 / 23
Conditioning of Random Matrices after WWII● With probability ≈ 1, κ < 10N
– [von Neumann and Goldstine 1947]● For a “random matrix” of order N the expectation [of κ] value
has been shown to be about N.– [von Neumann 1963, p. 14]
● […] we choose two different values of κ, namely N and N√10– [von Neumann 1963, p. 477]
● Von Neumann’s goal was to pick test matrices with “rules of thumb”
● Modern random matrix theory– [Edelman and Sutton 2004] [Azaïs and Wschebor 2004]
[Viswanath and Trefethen 1998] [Yeung and Chan 1997] [Trefethen and Schreiber 1990]
11 / 23
Dangers of Single Value of Merit
Any observed statistical regularity will tend to collapse oncepressure is placed upon it for control purposes.
Charles Goodhart
When a measure becomes a target, it ceases to be a good measure.
Marilyn Strathern
14 / 23
High Performance Conjugate Gradients
Multigrid
L [u ]≡∇2 u= f
Sparse matrix based on 27-point stencil
A u= f
Multigrid and Gauss-Seidel
19 / 23
NVIDIA Ampere A100 Highlights
INT32 FP32 FP64INT32 FP32
FP32 FP64FP32
FP32 FP64FP32
FP32 FP64FP32
FP32 FP64FP32
FP32 FP64FP32
FP32 FP64FP32
FP32 FP64FP32
Tensor Core
LD/STSFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Register File (16,384x32 bits)
Dispatch Unit (32 threads/clock)Warp Scheduler (32 threads/clock)
L0 Instruction Cache
238
108
105
78
5211FP64
FP32
TF32
FP16
BF16
Precision Width Exponent bits
Mantissa bits Epsilon Max
Quadruple 128 15 112 O(10-34) 1.2x104932
Extended 80 15 64 O(10-19)
Double 64 11 52 O(10-16) 1.8x10308
Single 32 8 23 O(10-7) 3.4x1038
Half* 16 5 10 O(10-3) 65504
*Only storage format is specified
BFloat 16 8 7 O(10-2) 3.4x1038
IEEE 754 2018 standard update includes 16-bit for computing
Graphics, gaming
Science ML, AI
INT32 INT32
INT32 INT32
INT32 INT32
INT32 INT32
INT32 INT32
INT32 INT32
INT32 INT32
20 / 23
The Landscape of Mixed-Precision Hardware● Mixed-Precision Startup
Hardware– GraphCore
● Colossus– Habana
● Labs Gaudi– Cerebras
● Wafer Scale Chip– Blaize
● Graph Streaming Processor– Groq
● Tensor Streaming Processor – SambaNova
● Cardinal– Tenstorrent
● Grayskull
● NVIDIA Mixed-Precision Hardware– Pascal
● FP16 units only– Volta
● Tensor Cores and FP16– Turing
● Tensor Cores and FP16– Ampere
● Tensor Cores for FP16 and FP64
21 / 23
HPL-AI Benchmark OverviewGMRES𝑟𝑒𝑠( , 𝑨 𝒙𝟎, ,𝒃 𝑴−1)
for =0, 1, 2,…𝑘𝒓𝒌 ← −𝒃 𝑨𝒙𝒌𝒛 𝒌← 𝑴−1 𝒓𝒌 ← ‖𝛽 𝒛𝒌 ‖2
𝑽:,0 ← 𝒛𝒌 ∕ 𝛽 ← 𝐬 [ , 0, 0, …, 0]𝛽 𝑇
for j=0, 1, 2, …←𝒘 𝑴−1 𝑨𝑽:,𝑗
𝒘,𝑯:,𝑗 ← ( , 𝑜𝑟𝑡ℎ𝑜𝑔𝑜𝑛𝑎𝑙𝑖𝑧𝑒 𝒘 𝑽:,𝑗 )𝑯𝑗+1,𝑗 ← ‖ ‖𝒘 2
𝑽:, +1𝑗 ← ∕ ‖ ‖𝒘 𝒘 2 𝑯:, 𝑗 ← 𝑮𝟎 𝑮𝟏…𝑮 −𝒋 𝟏 𝑯:,𝑗𝑮𝒋 ← 𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛_ (𝑚𝑎𝑡𝑟𝑖𝑥 𝑯:,𝑗)𝑯:, 𝑗 ← 𝑮𝒋 𝑯:,𝑗 ← 𝒔 𝑮𝒋 𝒔
𝒖𝒌 ← 𝑽𝑯−1 𝒔𝒙𝒌+𝟏 ← 𝒙 𝒌+ 𝒖𝒌
16-bit preconditioning with dense LU factorization
● Half-Precision LINPACK for Accelerator Introspection
● Exploits low-precision hardware to take advantage of performance– FP16, BF16, TF32
● Uses Carson-Higham 3-precision iterative refinement– Dense LU in lowest
precision: A16 = L16 U16 – GMRES preconditioned
with L16 and U16 factors in 64-bit precision