19
COMPUTING THE MATRIX EXPONENTIAL WITH AN OPTIMIZED TAYLOR POLYNOMIAL APPROXIMATION PHILIPP BADER * , SERGIO BLANES , FERNANDO CASAS Abstract. We present a new way to compute the Taylor polynomial of the matrix exponential which reduces the number of matrix multiplications in comparison with the de-facto standard Paterson–Stockmeyer method for polynomial evaluation. Combined with the scaling and squaring procedure, this reduction is sufficient to make the Taylor method superior in performance to Pad´ e approximants over a range of values of the matrix norms and thus we propose its replacement in standard software kits. An efficient adjustment to make the method robust against overscaling is also proposed. The new methods reduce the number of matrix products by up to 40% for matrices of small norm. For large norms, when s scaling and squaring is used, one product can be saved for 2/3 of possible matrix norms. However, the computational times are reduced to one third compared with the standard Matlab implementation of expm due to a more cost effective use of norm estimators. Numerical experiments show the performance and accuracy of the method in comparison with state-of-the-art implementations. Keywords: exponential of a matrix, scaling, squaring, matrix polynomials. 1. Introduction. The search for efficient algorithms to compute the expo- nential of an square matrix has a tremendous interest, given the wide range of its applications in many branches of science. Its importance is clearly showcased by the great impact achieved by various reference reviews devoted to the subject, e.g. [10, 19, 21, 28]. The problem has been tackled of course by many authors and a wide variety of techniques have been proposed. Among them, the scaling and squaring procedure is perhaps the most popular when the dimension of the corresponding matrix runs well into the hundreds. As a matter of fact, this technique is incorpo- rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp) in combination with Pad´ e approximants [2, 10, 12]. Specifically, for a given matrix A C N×N , the scaling and squaring technique is based on the key property (1.1) e A = e A/2 s 2 s , s N. The exponential e A/2 s is then replaced by a rational approximation r m (A/2 s ), namely the [m/m] diagonal Pad´ e approximant to e x . The optimal choice of both parameters, s and m, is determined in such a way that full machine accuracy is achieved with the minimal computational cost [2]. Diagonal Pad´ e approximants have the form (1.2) r m (A)= p m (A) p m (-A) , where the polynomial p m (x) is given by p m (x)= m X j=0 (2m - j )!m! (2m)!(m - j )! x j j ! , * DEPARTAMENT DE MATEM ` ATIQUES, UNIVERSITAT JAUME I, 12071 CASTELL ´ ON, SPAIN. [email protected] INSTITUTO DE MATEM ´ ATICA MULTIDISCIPLINAR, UNIVERSITAT POLIT ` ECNICA DE VAL ` ENCIA, E-46022 VALENCIA, SPAIN. [email protected] IMAC AND DEPARTAMENT DE MATEM ` ATIQUES, UNIVERSITAT JAUME I, 12071 CASTELL ´ ON, SPAIN. [email protected] 1

Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

COMPUTING THE MATRIX EXPONENTIAL WITH ANOPTIMIZED TAYLOR POLYNOMIAL APPROXIMATION

PHILIPP BADER∗, SERGIO BLANES†, FERNANDO CASAS‡

Abstract. We present a new way to compute the Taylor polynomial of the matrix exponentialwhich reduces the number of matrix multiplications in comparison with the de-facto standardPaterson–Stockmeyer method for polynomial evaluation. Combined with the scaling and squaringprocedure, this reduction is sufficient to make the Taylor method superior in performance to Padeapproximants over a range of values of the matrix norms and thus we propose its replacement instandard software kits. An efficient adjustment to make the method robust against overscaling isalso proposed. The new methods reduce the number of matrix products by up to 40% for matricesof small norm. For large norms, when s scaling and squaring is used, one product can be savedfor 2/3 of possible matrix norms. However, the computational times are reduced to one thirdcompared with the standard Matlab implementation of expm due to a more cost effective use ofnorm estimators. Numerical experiments show the performance and accuracy of the method incomparison with state-of-the-art implementations.

Keywords: exponential of a matrix, scaling, squaring, matrix polynomials.

1. Introduction. The search for efficient algorithms to compute the expo-nential of an square matrix has a tremendous interest, given the wide range of itsapplications in many branches of science. Its importance is clearly showcased bythe great impact achieved by various reference reviews devoted to the subject, e.g.[10, 19, 21, 28]. The problem has been tackled of course by many authors and a widevariety of techniques have been proposed. Among them, the scaling and squaringprocedure is perhaps the most popular when the dimension of the correspondingmatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica(MatrixExp) in combination with Pade approximants [2, 10, 12].

Specifically, for a given matrix A ∈ CN×N , the scaling and squaring techniqueis based on the key property

(1.1) eA =(eA/2

s)2s

, s ∈ N.

The exponential eA/2s

is then replaced by a rational approximation rm(A/2s),namely the [m/m] diagonal Pade approximant to ex. The optimal choice of bothparameters, s and m, is determined in such a way that full machine accuracy isachieved with the minimal computational cost [2].

Diagonal Pade approximants have the form

(1.2) rm(A) =pm(A)

pm(−A),

where the polynomial pm(x) is given by

pm(x) =

m∑j=0

(2m− j)!m!

(2m)!(m− j)!xj

j!,

∗DEPARTAMENT DE MATEMATIQUES, UNIVERSITAT JAUME I, 12071CASTELLON, SPAIN. [email protected]†INSTITUTO DE MATEMATICA MULTIDISCIPLINAR, UNIVERSITAT

POLITECNICA DE VALENCIA, E-46022 VALENCIA, SPAIN. [email protected]‡IMAC AND DEPARTAMENT DE MATEMATIQUES, UNIVERSITAT JAUME I,

12071 CASTELLON, SPAIN. [email protected]

1

Page 2: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

so that rm(A) = eA +O(A2m+1). In practice, the evaluation of pm(A), pm(−A) iscarried out with a reduced number of matrix products. For estimating the compu-tational effort required, the cost of the inverse is taken as 4/3 the cost of one matrixproduct: it requires one LU factorization at the cost of 1/3 products plus N solu-tions of upper and lower triangular systems by forward and backward substitutionat the cost of one product. For illustration, the computation procedure for r5(A) isgiven by [10]

(1.3)

u5 = A(b5A4 + b3A2 + b1I

),

v5 = b4A4 + b2A2 + b0I,

(−u5 + v5) r5(A) = u5 + v5

with appropriate coefficients bj , whereas for r13(A) one has

(1.4)

u13 = A(A6

(b13A6 + b11A4 + b9A2

)+ b7A6 + b5A4 + b3A2 + b1I

),

v13 = A6

(b12A6 + b10A4 + b8A2

)+ b6A6 + b4A4 + b2A2 + b0I,

(−u13 + v13)r13(A) = u13 + v13.

Here A2 = A2, A4 = A22 and A6 = A2A4. Written in this form, it is clear that

only three and six matrix multiplications and one inversion are required to obtainapproximations of order 10 and 26 to the exponential, respectively. Diagonal Padeapproximants rm(A) with m = 3, 5, 7, 9 and 13 are used in fact by the function expm

in Matlab.In a straightforward implementation of the scaling and squaring algorithm,

the choice of the optimal order of the approximation and the scaling parameterfor a given matrix A are based on the control of the backward error [10]. Morespecifically, given an approximation tn(A) to the exponential of order n, i.e., tn(A) =eA+O(An+1), if one defines the function hn+1(x) = log(e−xtn(x)), then tn(2−sA) =

e2−sA+hn+1(2−sA) and(tn(2−sA)

)2s

= eA+2shn+1(2−sA) ≡ eA+∆A,

where ∆A = 2shn+1(2−sA) is the backward error originating in the approximationof eA. If in addition hn+1 has a power series expansion

hn+1(x) =

∞∑k=n+1

ckxk

with radius of convergence ω, then it is clear that ‖hn+1(A)‖ ≤ hn+1(‖A‖), where

hn+1(x) =

∞∑k=n+1

|ck|xk,

and thus

(1.5)‖∆A‖‖A‖ =

‖hn+1(2−sA)‖‖2−sA‖ ≤ hn+1(‖2−sA‖)

‖2−sA‖ .

Given a prescribed accuracy u (for instance, u = 2−53 ' 1.1·10−16, the unit roundoffin double precision), one computes

(1.6) θn = max{θ : hn+1(θ)/θ ≤ u}.

Then ‖∆A‖ ≤ u‖A‖ if s is chosen so that ‖2−sA‖ ≤ θn and (tn(2−sA))2s

is used toapproximate eA.

2

Page 3: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Table 1Values of θ2m for the diagonal Pade approximant rm of order 2m with the minimum number

of products π for single and double precision. In bold, we have highlighted the asymptoticallyoptimal order at which it is advantageous to apply scaling and squaring since the increase of θper extra product is smaller than the factor 2 from squaring. The values for double precision aretaken from [12, Table A.1].

π : 0 1 2 3 4 5 62m : 2 4 6 10 14 18 26

u ≤ 2−24 8.46e-4 8.09e-2 4.26e-1 1.88 3.93 6.25 11.2u ≤ 2−53 3.65e-8 5.32e-4 1.50e-2 2.54e-1 9.50e-1 2.10 5.37

In the particular case that tn is the [m/m] diagonal Pade approximant rm, thenn = 2m, and the values of θ2m are collected in Table 1 when u is the unit roundoffin single and double precision for m = 1, 2, 3, 5, 7, 9, 13. According to Higham [10],m = 13 and therefore r13 is the optimal choice in double precision when scaling isrequired. When ‖A‖ ≤ θ26, the algorithm in [10] takes the first m ∈ {3, 5, 7, 9, 13}such that ‖A‖ ≤ θ2m. This algorithm is later referred to as expm2005.

Pade approximants are not, of course, the only option one has in this setting.Another approach to the problem consists in using the Taylor polynomial of degreem as the underlying approximation tn to the exponential, i.e., n = m, and Tm(A) =∑mk=0A

k/k!, computed in an efficient way. Early attempts in this direction traceback to the work by Knuth [16] and especially by Paterson & Stockmeyer [22].When Tm(A) is evaluated according to the Paterson–Stockmeyer (P-S) procedure,the number of matrix products is reduced and the overall performance is improvedfor matrices of small norm, although it is less efficient for matrices with large norm[11, 12, 23, 24, 25, 26].

In more detail, if the P-S technique is carried out in a Horner-like fashion, themaximal attainable degree is m = (k + 1)2 by using 2k matrix products. Theoptimal choices for most cases then correspond to k = 2 (four products) and k = 3(six products), i.e, to degree m = 9 and m = 16, respectively. The correspondingpolynomials are then computed as

(1.7)

T9(A) =

9∑i=0

ciAi = f0 +

(f1 + (f2 + c9A3)A3

)A3,

T16(A) =

16∑i=0

ciAi = g0 +

(g1 + (g2 + (g3 + c16A4)A4)A4

)A4,

where ci = 1/i! and

fi =

2∑k=0

c3i+kAk, i = 0, 1, 2,

gi =

3∑k=0

c4i+kAk, i = 0, 1, 2, 3,

respectively. Here, A0 = I, A1 = A, and, as before, A2 = A2, A3 = A2A, A4 =A2A2. In Table 2, we collect the values for the corresponding thresholds θm in(1.6) needed to select the best scheme for a given accuracy. They are computed by

truncating the series of the corresponding functions hm+1(θ) after 150 terms.It is the main goal of this work to show that it is indeed possible to organize

the computation of the Taylor polynomial of the matrix exponential function ina more efficient way than the Paterson–Stockmeyer technique, so that with the

3

Page 4: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Table 2Values of θm in (1.6) for the Taylor polynomial Tm of degree m with the minimum number

of products π for single and double precision. In bold, we have highlighted the asymptoticallyoptimal order at which it is advantageous to apply scaling and squaring since the increase of θ perextra product is smaller than the factor 2 from squaring. We have included degree 24 to illustratethat the gain is marginal over scaling and squaring for double precision (θ24 − 2θ18 = 0.04) andnegative for single precision.

π 0 1 2 3 4 5 6m 1 2 4 8 12 18 24

u ≤ 2−24 1.19e-7 5.98e-4 5.12e-2 5.80e-1 1.46 3.01 4.65u ≤ 2−53 2.22e-16 2.58e-8 3.40e-4 4.99e-2 2.99e-1 1.09 2.22

same number of matrix products one can construct a polynomial of higher degree.With the consequent cost reduction, a method for computing eA based on scalingand squaring together with the new implementation of the Taylor polynomial isproposed and shown to be more efficient than Pade approximants for a wide rangeof matrices A. In Section 2, we outline the new procedure to compute matrixpolynomials reducing the number of matrix products. In Section 3, we apply it tothe Taylor polynomial of the exponential and discuss maximal attainable degrees ata given number of matrix multiplications. Numerical experiments in Section 4 showthe superior performance of the new scheme when compared to a standard Padeimplementation using scaling and squaring (cf. Matlab Release R2013a). Section5 is dedicated to error estimates and adaptations to reduce overscaling along thelines of the more recent algorithms [2] (cf. Matlab Release R2016a), whereasexecution times of several methods are included in Section 6. Finally, Section 7contains some concluding remarks.

2. A generalized recursive algorithm. Clearly, the most economic way toconstruct polynomials of degree 2k is by applying the following sequence, whichinvolves only k products:

A1 = A

A2 = (δ1I + x1A1)(δ2I + x2A1)

A4 = (δ3I + x3A1 + x4A2)(δ4I + x5A1 + x6A2)(2.1)

A8 = (δ5I + x7A1 + x8A2 + x9A4)(δ6I + x10A1 + x11A2 + x12A4),

...

with δi = 0, 1. These polynomials are then linearly combined to form

T2k = y0I + y1A1 + y2A2 + y3A4 + y4A8 + · · ·+ yk+1A2k .

Notice that the indices in A, A2k , are chosen to indicate the highest attainable

power, i.e., A2k = O(A2k

). A simple counting tells us that with k products one has(k + 1)2 + 1 parameters to construct a polynomial of degree 2k containing 2k + 1coefficients. It is then clear that the number of coefficients grows faster than thenumber of parameters, so that this procedure cannot be used to obtain high degreepolynomials, as already noticed in [22]. Even worse, in general, not all parametersare independent and this simple estimate does not suffice to guarantee the existenceof solutions with real coefficients.

Nevertheless, this procedure can be modified in such a way that additionalparameters are introduced, at the price, of course, of including some extra products.In particular, we could include new terms of the form

(γ1I + z1A1)(γ2I + z2A1 + z3A2),

4

Page 5: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

not only in the previous Ak, k > 2, but also in T2k , which would allow us tointroduce a cubic term and an additional parameter.

Although the Paterson–Stockmeyer technique is arguably the most efficientprocedure to evaluate a general polynomial, there are relevant classes of polynomialsfor which the P-S rule involves more products than strictly necessary. To illustratethis feature, let us consider the evaluation of

(2.2) Ψ(k,A) = I +A+ · · ·+Ak−1,

a problem addressed in [30]. Polynomial (2.2) appears in connection with the in-tegral of the state transition matrix and the analysis of multirate sampled datasystems. In [30] it is shown that with three matrix products one can evaluateΨ(7, A) (as with the P-S rule), whereas with four products it is possible to com-pute Ψ(11, A) (one degree higher than using the P-S rule). In general, the savingswith respect to the P-S technique grow with the degree k. The procedure wasfurther improved and analysed in [17], where the following conjecture was posed:the minimum number of products to evaluate Ψ(k,A) is 2 blog2 kc− 2 + ij−1 whereN = (ij , ij−1, . . . , i1, i0)2 (written in binary), i.e., ij−1 is the second most significantbit.

This conjecture is not true in general, as is illustrated by the following algo-rithm of type (2.1), that allows one to compute Ψ(9, A) by using only three matrixproducts:

A2 = A2, B = x1I + x2A+ x3A2

A4 = x4I + x5A+B2(2.3)

A8 = x6A2 +A24

Ψ(9, A) = x7I + x8A+ x9A2 +A8,

with

x1 =−5 + 6

√7

32, x2 = −1

4, x3 = −1, x4 =

3(169 + 20√

7)

1024,

x5 =3(5 + 2

√7)

64, x6 =

1695

4096, x7 =

267

512, x8 =

21

64, x9 =

3√

7

4.

Notice that, since the coefficients of the algorithm must satisfy a system of nonlinearequations, they are irrational numbers.

Although by following this approach it is not possible to achieve degree 16with four products, there are other polynomials of degree 16 that can indeed becomputed with only four products. This is the case, in particular, of the truncatedTaylor expansion of the function cos(A):

T16 =

8∑i=0

(−1)iA2i

(2i)!= cos(A) +O(A17).

Taking B = A2 we obtain a polynomial of degree 8 in B that can be evaluated withthree additional products in a similar way as in the computation of Ψ(9, A), butwith different coefficients.

3. A new procedure to evaluate the Taylor polynomial approxima-tion Tn(A). Algorithm (2.1) can be conveniently modified along the lines exposedin the previous section to compute the truncated Taylor expansion of the matrixexponential function

(3.1) Tn(A) =

n∑i=0

Ai

i!= eA +O(An+1),

5

Page 6: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

for different values of n using the minimum number of products. In practice, weproceed in the reverse order: given a number k, we find a convenient (modified)sequence of type (2.1) that allows one to construct the highest degree polynomialTn(A) using only k matrix products. The coefficients in the sequence satisfy arelatively large system of algebraic nonlinear equations. Here several possibilitiesmay occur: (i) the system has no solution; (ii) there is a finite number of real and/orcomplex solutions, or (iii) there are families of solutions depending on parameters.In addition, if there are several solutions we take a solution with small coefficientsto avoid large round off errors due to products of large and small numbers.

Remark 1. Notice that in general, there is a multitude of ways to decomposea given polynomial but for our purposes, we only need one solution with real coef-ficients. Furthermore, the procedure described in (2.1) is modified below using anadditional product A3 = A2A to reach degrees higher than 8.

With k = 0, 1, 2 products we can evaluate Tn for n = 1, 2, 4, in a similar way asthe P-S rule, whereas for k = 3, 4, 5 and 6 products the situation is detailed next.

k = 3 products. In this case, only T6 can be determined with the P-S rule,whereas the following algorithm allows one to evaluate T8:

(3.2)

A2 = A2,

A4 = A2(x1A+ x2A2),

A8 = (x3A2 +A4)(x4I + x5A+ x6A2 + x7A4),

T8(A) = y0I + y1A+ y2A2 +A8.

Algorithm (3.2) is a particular example of the sequence (2.1) with some of thecoefficients fixed to zero to avoid unnecessary redundancies. The parameters xi, yiare then determined of course by requiring that T8(A) =

∑8i=0A

i/i!.

With this sequence we get two families of solutions depending on a free param-eter, x3, which is chosen to (approximately) minimize the 1-norm of the vector ofparameters. The reasoning behind this approach is to avoid multiplications of highpowers of A by large coefficients, in a similar vein as in the Horner procedure. Thecoefficients in (3.2) are given by

x1 = x31 +√

177

88, x2 =

1 +√

177

352x3, x4 =

−271 + 29√

177

315x3,

x5 =11(−1 +

√177)

1260x3, x6 =

11(−9 +√

177)

5040x3, x7 =

89−√

177

5040x23

,

y0 = 1, y1 = 1, y2 =857− 58

√177

630,

x3 = 2/3.

Perhaps surprisingly, T7(A) requires at least 4 products, so T8 may be considered asingular polynomial.

k = 4 products. Although polynomials up to degree 16 can in principle be con-structed by applying the sequence (2.1), our analysis suggests that the Taylor poly-nomial (3.1) corresponding to eA does not belong to that family. In consequence, aspointed out previously, some variations in our strategy have to be introduced. Morespecifically, we take Tn(A) for a given value of n and decompose it as a product oftwo polynomials of lower degree plus a lower degree polynomial (that will be usedto evaluate the higher degree polynomials). The highest value we have managed toreach is n = 12. It is important to note that this ansatz gives many different waysto write the sought polynomial. The following sequence, in particular, has been

6

Page 7: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

empirically shown to be comparable to Pade and Horner methods with respect torelative errors.

(3.3)

A2 = A2,

A3 = A2A,

B1 = a0,1I + a1,1A+ a2,1A2 + a3,1A3,

B2 = a0,2I + a1,2A+ a2,2A2 + a3,2A3,

B3 = a0,3I + a1,3A+ a2,3A2 + a3,3A3,

B4 = a0,4I + a1,4A+ a2,4A2 + a3,4A3,

A6 = B3 +B24

T12(A) = B1 + (B2 +A6)A6.

This ansatz has four families of solutions with three free parameters which can beobtained in closed form. Using the free parameters, we have minimized the 1-normof the coefficients

∑i,j |ai,j | and obtained

a0,1 = −0.01860232051462055322, a0,2 = 4.60000000000000000000,a0,3 = 0.21169311829980944294, a0,4 = 0,a1,1 = −0.00500702322573317730, a1,2 = 0.99287510353848683614,a1,3 = 0.15822438471572672537, a1,4 = −0.13181061013830184015,a2,1 = −0.57342012296052226390, a2,2 = −0.13244556105279963884,a2,3 = 0.16563516943672741501, a2,4 = −0.02027855540589259079,a3,1 = −0.13339969394389205970, a3,2 = 0.00172990000000000000,a3,3 = 0.01078627793157924250, a3,4 = −0.00675951846863086359.

Although we report here 20 digits for the coefficients, they can be in fact determinedwith arbitrary accuracy.

k = 5 products. With 5 products, n = 18 is the highest value we have been ableto achieve. We write T18 as the product of two polynomials of degree 9, that arefurther decomposed into polynomials of lower degree. The polynomial is evaluatedthrough the following sequence:

(3.4)

A2 = A2, A3 = A2A, A6 = A23,

B1 = a0,1I + a1,1A+ a2,1A2 + a3,1A3,

B2 = b0,1I + b1,1A+ b2,1A2 + b3,1A3 + b6,1A6,

B3 = b0,2I + b1,2A+ b2,2A2 + b3,2A3 + b6,2A6,

B4 = b0,3I + b1,3A+ b2,3A2 + b3,3A3 + b6,3A6,

B5 = b0,4I + b1,4A+ b2,4A2 + b3,4A3 + b6,4A6,

A9 = B1B5 +B4,

T18(A) = B2 + (B3 +A9)A9,

7

Page 8: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

with coefficients

a0,1 = 0, a1,1 = −0.10036558103014462001,a2,1 = −0.00802924648241156960, a3,1 = −0.00089213849804572995,b0,1 = 0, b1,1 = 0.39784974949964507614,b2,1 = 1.36783778460411719922, b3,1 = 0.49828962252538267755,b6,1 = −0.00063789819459472330, b0,2 = −10.9676396052962062593,b1,2 = 1.68015813878906197182, b2,2 = 0.05717798464788655127,b3,2 = −0.00698210122488052084, b6,2 = 0.00003349750170860705,b0,3 = −0.09043168323908105619, b1,3 = −0.06764045190713819075,b2,3 = 0.06759613017704596460, b3,3 = 0.02955525704293155274,b6,3 = −0.00001391802575160607, b0,4 = 0,b1,4 = 0, b2,4 = −0.09233646193671185927,b3,4 = −0.01693649390020817171, b6,4 = −0.00001400867981820361.

k = 6 products. With six products we can reconstruct the Taylor polynomialup to degree n = 22 by applying the same strategy. We have also explored differentalternatives, considering decompositions based on the previous computation of lowpowers of the matrix, such as A2, A3, A4, A8, etc. to achieve degree n = 24, but allour attempts have been in vain. Nevertheless, we should remark that even if onecould construct T24(A) with only six products, this would not lead to a significantadvantage with respect to considering one scaling and squaring (s = 1 in eq. (1.1))applied to the previous decomposition for T18.

In Table 3 we show the number of products required to evaluate Tn by applyingthe P-S rule and the new decomposition strategy. The improvement for k ≥ 3products is apparent.

At this point, some remarks are in order:Remark 2. As it should be clear from Tables 1 and 2, T18 (for the Taylor

method) and r13 (for the Pade scheme) are the default choices when scaling andsquaring is needed.

Remark 3. A seemingly obvious observation would be that the same optimiza-tion technique we have presented here for polynomial evaluation could also be appliedto the numerator and denominator of Pade approximants. That this is not the casefor the exponential can be grasped by noticing that the Pade scheme

rn(A) =pn(A)

pn(−A)

requires the simultaneous evaluation of two polynomials for which better optimiza-tions exist, cf. (1.3) [3 products], (1.4) [6 products]. Notice that our improvementsfor polynomial evaluation start from degree 8 onwards, then, we could compute

p17(A)

p17(−A)=u8(B) +Av8(B)

u8(B)−Av8(B), B = A2,

for some polynomials of degree eight, u8, v8, which requires 7 products (A2, Av8, 3products for u8(B) and only 2 for v8 since B2 is reused). At one extra product,we thus increase the threshold to θ17 = 9.44 which is less than 2θ13 = 10.74 at thesame cost. The addition of this method to a scheme could therefore only improvethe stability because it would avoid scaling for θ13 < ‖A‖ ≤ θ17 but does not havean impact on the computational cost and is therefore not examined further in thiswork. For completeness, using T12, r25 can be computed using 8 products, but again,θ25 = 18.7 < 2θ17 = 18.9 < 4θ13 = 21.5. With T18, we can compute r37 but thenθ37 = 33.7 < 22θ17 = 37.8 < 23θ13 = 41.3.

Remark 4. Concerning the effects of rounding errors on the evaluation ofthe Taylor polynomials for the exponential with the new decomposition, it is not

8

Page 9: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Table 3Minimal number of products to evaluate the Taylor polynomial approximation to eA of a

given degree by applying the P-S technique and the new decomposition strategy.

Paterson–StockmeyerProducts 1 2 3 4 5 6Degree 2 4 6 9 12 16

New decompositionProducts 1 2 3 4 5 6Degree 2 4 8 12 18 22

difficult to obtain error bounds similar to those existing when applying the Hornerand Paterson–Stockmeyer techniques [9], [11, Theorem 4.5]. More specifically, ifTm (m = 8, 12, 18) denotes the computed polynomial obtained by applying algorithm(3.2), (3.3) and (3.4), respectively, and γn ≡ cnu/(1− cnu), with c a small integerconstant (whose concrete value is not relevant), then

‖Tm(A)− Tm(A)‖1 ≤ γmNTm(‖A‖1).

Thus, if ‖A‖1 ≤ θm, then, proceeding as in [11], one gets

‖Tm(A)− Tm(A)‖1 ≈ γmN e‖A‖1 ≤ γmN‖eA‖1 e2‖A‖1

≈ γmN‖Tm(A)‖1 e2‖A‖1 ≤ γmN‖Tm(A)‖1 e2θm ,

so that

(3.5)‖Tm(A)− Tm(A)‖1

‖Tm(A)‖1≤ γmN e2θm

and the relative error is bounded approximately by γmNe2θm . Notice that the bound

(3.5) holds irrespective of the sign of the coefficients in (3.2)-(3.4).If one considers the important case of an essentially non-negative matrix A

(i.e., A = (aij) is such that aij ≥ 0 for all i 6= j), since Tm(2−sA) approximates

e2−sA with a truncation error less than the machine precision, we can suppose thatTm(2−sA) is a non-negative matrix, and thus will remain non-negative after apply-ing successive powers 2s. In this way, the presence of some negative coefficients in(3.3) and (3.4) should not alter the stability of the method [3].

4. Numerical performance. We assess the performance of three approxima-tions based on scaling and squaring for the matrix exponential. Here, all methodsfirst estimate the matrix norm ‖A‖. More precisely, the matrix 1-norm has beenused in all algorithms. This value is then used to choose the optimal order n ofeach approximant using Tables 1,2 and if necessary, the scaling parameter s byapplying the error bound following (1.6). Specifically, s = dlog2(‖A‖1/θn)e, whered·e denotes the rounding-up operation.

In Fig. 1, we compare methods using the Taylor polynomial based on thePaterson–Stockmeyer (P-S) rule (with orders 2, 4, 6, 9, 16) as well as on the newdecompositions from Sect. 3, which we call expm2 (with orders 1, 2, 4, 8, 12, 18).

Algorithm 1 (expm2).Input: square matrix A.orders := {1,2,4,8,12}.for k in orders:

if θk < ‖A‖1: return Tk(A).Apply scaling: s = d(log2(‖A‖1/θ18)e, As = 2−sACompute exponential: E = T18(As)Apply s squarings to E and return result.

9

Page 10: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

For reference, we also include the Pade approximants from [10] (with orders 6, 10,14, 18, 26). The procedure, which we call expm2005, is implemented as expm inMatlab Release R2013a. We plot ‖A‖ versus the cost measured as the numberof matrix products necessary to approximate eA, both in double (top) and singleprecision (bottom). For small values of ‖A‖, the new approach to compute theTaylor approximant shows the best performance, whereas for higher values it hassimilar performance as the Pade method.

The right panels shows the scaling regime. We conclude that expm2 is superior(by 1/3 matrix products) for 2/3 of matrix norms and inferior (by 1/3) for theremaining third. The percentage of the norm range where expm2 is cheaper can becomputed by determining the points where both methods transition into the scalingregime, for expm2 at order 18 from log10(‖A‖) > 0.037 and for expm2005 at order26 for log10(‖A‖) > 0.732. The difference is approximately 1/3 mod log10(2).For single precision, approximately the same conclusion holds. The correspondingpoints are log10(‖A‖) > 0.594 for expm2005 (order 14) and log10(‖A‖) > 0.479for expm2 (order 18).

In Fig. 1, we have limited ourselves to values of log10 ‖A‖ of the order of 1,since for larger matrix norms all the procedures require scaling and squaring.

In Fig. 2, we compare expm2005 with our method expm2 for a range oftest matrices. The color coding indicates the set to which the test matrix belongs:we have used 9 matrices of type (5.1) where b = 1, 10, . . . , 108 (green), 46 differentspecial matrices from the Matlab matrix gallery, as has been done in [10], sampledat two different norms (blue) as well as 130 randomly generated matrices (red) allof dimension 16 × 16. Notice that the same experiment repeated with matrices oflarger dimension (up to 60 × 60) or from a larger matrix test set shows virtuallyidentical results.

For each matrix, we computed the exponential condition number [2]

κexp(A) := limε→0

sup‖E‖≤ε‖A‖

‖eA+E − eA‖ε‖eA‖ ,

and the reference solution eA was obtained using Mathematica with 100 digits ofprecision. From the top left panel, we see that both methods have relative errors‖tn − eA‖/‖eA‖ close to machine accuracy for a wide range of matrices and thebottom right panel illustrates that the relative errors of the two methods lie withina range of two digits of each other. The top right panel confirms that due to smallerθ values of the Taylor polynomials, more scalings are needed for expm2. From thebottom left panel, it follows that the new method expm2 is expected to be fasterfor most matrices within the test set. In Fig. 3 (top row), we provide performanceprofiles for a larger test set comparing the methods expm2 and expm2005. Thebottom row anticipates the results from the methods expm3 and expm2009[2]described below which incorporate norm estimators to avoid overscaling. From theseplots, we can observe a clear improvement over the reference methods in terms ofcomputational cost with only a minor impact on accuracy.

5. Refinement of error bounds to avoid overscaling. It has been noticed,in particular by [2], that the scaling and squaring technique based on the Padeapproximant suffers from a phenomenon called overscaling : in some cases a valueof s much larger than necessary is chosen, with the result of a loss of accuracy infloating point arithmetic. This feature is well illustrated by the following simplematrix proposed in [2]:

(5.1) A =

(1 b0 −1

), so that eA =

(e b

2 (e− e−1)0 e−1

).

10

Page 11: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Fig. 1. Optimal orders and their respective cost vs. the norm of the matrix exponent. Thenumbers indicate the order of the approximants. Thick lines (right panels) show the cost whenscaling and squaring is used and the corresponding orders are highlighted in bold-face.

−4 −3 −2 −11

2

3

4

5

6

4

6 8

129

18

6

10

18

26

log10(‖A‖)

Computation

alcost

Double precision

expm2005P-Sexpm2

−0.2 0 0.2 0.4 0.6 0.8 1

5

6

7

8

9

4

6 8

129

16

18

4

6

10

14

18

26

log10(‖A‖)

Scaling regime

−4 −3 −2 −1 0

1

2

3

4

6

10

14

18

26

2

4

6 8

129

18

log10(‖A‖)

Com

putation

alcost

Single precision

expm2005P-Sexpm2

0.2 0.4 0.6 0.8 1 1.2

4

5

6

7

2

4

6

10

14

8

129

18

log10(‖A‖)

Scaling regime

For |b| � 1, we have ‖A‖ ≈ ‖eA‖ � e‖A‖ and the error bound previously consideredin Sect. 1, namely

‖h`(A)‖ ≤ h`(‖A‖),

leads to an unnecessarily large value of the scaling parameter s: the algorithmchooses a large value of s in order to verify the condition ‖2−sA‖ < θ`.

Needless to say, exactly the same considerations apply if we use instead a Taylorpolynomial as the underlying approximation in the scaling and squaring algorithm.

This phenomenon can be alleviated by applying the strategy proposed in [2].The idea consists in using the backward error analysis that underlies the algorithmin terms of the sequence {‖Ak‖1/k} instead of ‖A‖, the reasoning being that forcertain classes of matrices (in particular, for matrices of type (5.1)) ‖Ak‖1/k, k > 1,is in fact much smaller than ‖A‖.

If ρ(A) denotes the spectral radius of A, one has in general

ρ(A) ≤ ‖Ak‖1/k ≤ ‖A‖, k = 1, 2, 3, . . . ,

11

Page 12: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Fig. 2. Comparison of expm2 and expm2005 for matrices of dimension ≤ 16 × 16. Thecolor coding corresponds to different test matrices: special matrices (5.1) (green), Matlab matrixgallery (blue), random matrices (red). Top left panel: The logarithmic relative errors of the twomethods show similar values around machine accuracy (solid horizontal line). We appreciate thatfor some special matrices, both methods have the potential to reach errors below machine accuracy.The same holds for slightly larger errors: for some matrices, especially of large condition, bothmethods produce relative errors above machine accuracy. Top right panel: We observe that, asexpected, the Pade based expm2005 uses less scalings due to the larger θ values from Tab.1compared with Tab.2 - the crosses are generally below the squares. Bottom left: The relative costof the method, counting matrix multiplications and inversions only, indicates that the new methodis cheaper for a wide range of matrices as expected from Fig. 1. Bottom right: The difference ofaccurate digits is shown. The two methods lie within two digits of each other. For the relativeerrors, we have taken the maximum between the obtained value and the machine accuracy in orderto avoid misleading large differences at insignificant digits that can be appreciated in the top leftpanel.

−4 −2 0 2 4−20

−18

−16

−14

−12

log10(exp. condition number)

log10‖relativeerror‖

2

expm2005expm2

0 2 40

2

4

6

8

10

12

log10(exp. condition number)

Scalings

−4 −2 0 2 4−2

−1

0

1

2

log10(exp. condition number)

log10rel.error:

expm2/expm2005

−4 −2 0 2 4

0.6

0.8

1

1.2

1.4

log10(exp. condition number)

rel.cost:expm2/expm2005

and moreover [2]

‖h`(A)‖ ≤ h`(‖At‖1/t),

where ‖At‖1/t = max{‖Ak‖1/k : k ≥ ` and c` 6= 0}. Since the value of t is not easyto determine, the following results are more useful in practice.

Lemma 1 ([2]). For any k ≥ 1 such that k = pm1 + qm2 with p, q ∈ N andm1,m2 ∈ N ∪ {0},

‖Ak‖1/k ≤ max(‖Ap‖1/p, ‖Aq‖1/q

).

12

Page 13: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Fig. 3. Performance profiles to compare expm2 vs. expm2005 and expm3 vs.expm2009[2] for matrices of dimension ≤ 64 × 64. We have used 3000 matrices sampled asbefore from the Matlab matrix gallery, special matrices that are prone to overscaling, nilpotentmatrices and random matrices. They have been scaled to cover a wide range of condition numbers.The left panels show the percentage of matrices that have a relative logarithmic error smaller thanthe abscissa. In the center panel, we see the percentage of matrices that can be computed usingfactor-times the number of products of the cheapest method for each problem. The right panel isa repetition of this experiment but shows the computation times (averaged over 10 runs) insteadof the number of products.

1 1.2 1.4 1.6 1.80

0.2

0.4

0.6

0.8

1

factor

# products

−15 −14 −13

0.6

0.8

1

digits

probab

ility

relative error

expm2005expm2

1 1.5 2 2.50

0.2

0.4

0.6

0.8

1

factor

comp. times

1 1.2 1.4 1.6 1.80

0.2

0.4

0.6

0.8

1

factor

# products

expm2009[2]expm3

−15 −14 −13

0.6

0.8

1

digits

probability

relative error

2 4 6 8 100

0.2

0.4

0.6

0.8

1

factor

comp. times

Theorem 2 ([2]). Assuming that ρ(A) < ω and p ∈ N, then

(a) ‖h`(A)‖ ≤ h`(δp,p+1) if ` ≥ p(p− 1), with

δp,p+1 ≡ max(‖Ap‖1/p, ‖Ap+1‖1/(p+1)

);

(b) ‖h`(A)‖ ≤ h`(δ2p,2p+2) if ` ≥ 2p(p− 1) and h` is even, with

δ2p,2p+2 ≡ max(‖A2p‖1/(2p), ‖A2p+2‖1/(2p+2)

).

Denoting dk = ‖Ak‖1/k, δi,j = max(di, dj), and applying Theorem 2 to theanalysis of Sect. 1 allows one to replace the previous condition ‖2−sA‖ < θn by

2−s min(δp,p+1 : p(p− 1) ≤ `) < θ`,(a)

or, if h` is an even function, as is the case with Pade approximants,

2−s min(δ2p,2p+2 : 2p(p− 1) ≤ `) < θ`.(b)

13

Page 14: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

The algorithm proposed in [2], which is implemented from Matlab R2016aonwards, incorporates the estimates from Theorem 2 (b) into the expm2005 (Pade)method, specifically, it computes d4,6, d6,8 and d8,10.

This process requires computing ‖Ar‖ for some values of r for which Ar havenot been previously evaluated in the algorithm. In these cases, one may use thefast block 1-norm estimation algorithm of Higham and Tisseur [13] to get normestimates correct within a factor 3 for N ×N matrices. In this section, we considerthe cost of these estimations to be scaling with O(N2) and therefore neglect theircontribution to the computational effort.

In order to reduce overscaling for the proposed algorithm expm2, we can onlyapply part (a) of Theorem 2 since the error expansion h` for the Taylor method isnot an even function. It is not difficult to see, however, that the choices (a) and (b)in Theorem 2 are not the only options, and that other alternatives might provideeven sharper bounds. We have to find pairs (pi, qi) such that any k ≥ m can bewritten as k = m1,ipi+m2,iqi for m1,i,m2,i ∈ N∪{0}. Then, for Taylor polynomialsof degree n we have the following possible choices, where obviously every pair alsoserves to decompose higher degrees:n = 2: we have the pair (2, 3).n = 4: we have additionally the pair (2, 5).n = 8: the following pairs are also valid: (2, 7), (2, 9), (3, 4), (3, 5).n = 12: additionally, we obtain (2, 11), (2, 13), (3, 7), (4, 5).n = 18: additionally, (2, 15), (2, 17), (2, 19), (3, 8), (3, 10), (4, 7).This idea is further developed in [20]. We stress that for generic matrices for which{‖Ak‖1/k}, k > 1, is not so different from ‖A‖, the use of such estimates based on{dk} to avoid overscaling leads to an increased computational cost at small to nobenefit. In consequence, we propose the following adjustments to our algorithm tocompromise between extra computational cost and reduction of overscaling.

Algorithm 2 (expm3). Same as Algorithm 1 (expm2) for ‖A‖ ≤ θ18. Forlarger norms, degree 18 is selected, which requires the computation of these powersA2, A3, A6. Now, calculate the roots of the norms of these products, d2, d3, d6. Setη = max(d2, d3). Heuristically, we say that there is a large decay in the norms ofpowers when

(5.2) min

(d2

d1,d3

d1,d6

d1

)≤ 1

24.

This means that we can save at least 4 squarings using the right estimate.1 If thisis the case, then, we also estimate d9 and update η = min(η,max(d2, d9)).

Finally, the number of scalings is given by s = dlog2(η/θ18)e and the order 18decomposition is applied.

Remark 5. For small matrix dimensions, it can be more efficient to explicitlycompute matrix powers instead of using a sophisticated norm estimation algorithm.This is implemented in Matlab for dimensions smaller than 150× 150. We pointout that the additional products that are computed in this case can be used to increasethe order of the Taylor methods following the procedures described in this work. Thenew problem formulation is then: given a set of matrix powers, which is the largestdegree (Taylor) polynomial which can be computed at a given number of extra matrixproducts?

Algorithm expm3 is compared next with the refined method expm2009[2],which is based on Theorem 2 (b) for the Pade approximants and additionally im-plements further estimates to refine the scaling parameter. The experimental setup

1To be precise, we need min(d3/d2, d9/d2) < 1/2k to reduce the required squarings by k. Inpractice, however, the available fractions in (5.2) are sufficient to estimate the behavior at highpowers.

14

Page 15: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Fig. 4. Comparison of algorithms expm3 and expm2009[2] for matrices of dimension≤ 16 × 16. Cf. Fig.2.

−4 −2 0 2 4−20

−18

−16

−14

−12

log10(exp. condition number)

log10‖relative

error‖

2

expm2009[2]expm3

0 2 40

2

4

6

8

10

log10(exp. condition number)

Scalings

−4 −2 0 2 4−2

−1

0

1

2

log10(exp. condition number)

log10(rel.

error:

expm3/expm2009)

−4 −2 0 2 4

0.6

0.8

1

1.2

1.4

log10(exp. condition number)

rel.cost:expm3/expm2009[2]

is identical to the one of Fig. 2. Recall that the computational cost of matrix normestimates is not taken into account for either algorithm. The results are shown inFig. 4. The relative errors of the two methods are very similar. From the lowerleft panel, a clear cost advantage over the algorithm based on Pade approximantsis apparent.

6. Execution times. For modern architectures, the computational cost is notas straightforward to assess since, very often, computation time, memory access andother considerations can become important. In order to provide a more detailedassessment on the performance of the algorithm proposed here, we next compare thedifferent procedures and some reference implementations for the same test matrixsets as before, but now with a much larger number of matrices, cf. Tab. 5.

The compared methods are tabulated in Tab. 4. Essentially, we have orga-nized the algorithms in three groups: simple algorithms without norm estimatesof type dk incorporated (expm2005, expm2); sophisticated algorithms to avoidoverscaling (expm2009[2], expm3); and an optimized official implementation fromMatlab (expm2016), which contains several additional tweaks and modificationsfor expm2009[2]. The computation times reported in Tab. 5 lead to several con-clusions: Firstly, the implementation expm2005 has become slower by roughly afactor three through the incorporation of the norm estimators dk(expm2009[2])

15

Page 16: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Table 4Algorithms for timing.

expm2005 Pade alg. [10] as expm in Matlab Release R2013a.

expm2 Our Alg. 1 using the decomposition of this work.

expm3 Our Alg. 2 modified to avoid overscaling.

expm2009 Alg. 3.2 from [2], based on Pade approximants and normestimations through Th. 2.

expm2016 Pade alg. from [2] as expm in Matlab Release R2016a.This implementation contains special tweaks for smallmatrices, symmetric matrices Schur-decompositions andblock structures.

with a precision gain for matrices that tend to be overscaled. Second, this loss inspeed has been mitigated to a factor of two by additional tweaks and support forspecial cases of matrices in the reference implementation expm2016. Third, thealgorithms proposed here are overall the fastest by roughly a 10% without adjust-ments for overscaling (expm2), and by nearly 50% when estimates to deal withoverscaling are incorporated (expm3).

Notice that if only matrices that tend to be overscaled (i.e., dk � d1) were tobe considered, the picture changes: expm3 is then approximately 10% slower thanexpm2009[2] algorithm, and even 35% slower than the reference implementationexpm2016. Arguably, such a scenario is very constructed. If we only omit the ran-dom matrices in the tests, the results are very close to the original experiment, withslightly amplified gains for the new methods. In Fig. 5, we show the performanceprofiles of the stated methods on a larger test set of matrices predominantly of di-mension n× n ≤ 1024× 1024. The gain in computational speed is evident but lesspronounced for some matrices since the overhead from the norm estimations whichscale with O(n2). For example, we can deduce that for 50% of the test matrices,the Pade methods require at least 50% more computational time.

A highly relevant application of the matrix exponential is the (numerical) solu-tion of differential equations. Numerical methods such as exponential integrators,Magnus expansions or splitting methods typically force small matrix norms due tothe small time-step inherent to the methods. In this case, the proposed algorithmexpm2 should be highly advantageous.

Table 5Computation times for 5 iterations over 2000 random matrices, 700 matrices multiplied by

random scalars from the Matlab matrix gallery and matrices of type (5.1). The two right columnsare extreme cases, where the matrix gallery and/or the random matrices have been omitted. Thenumber of iterations over the smaller test sets has been to increased to 1000 for (5.1) and to100 without random matrices. The experiments have been conducted several times under identicalconditions on a Intel i3-4130 2 core processor with 4GB of RAM and Matlab Release R2016ayielding virtually identical results (within 100ms).

16× 16 64× 64 128× 128 only (5.1) w/o rand.expm2005 1.41s 4.9s 7.42s 0.90s 9.97sexpm2 1.25s 4.4s 6.92s 0.78s 7.21s

expm2009[2] 8.57s 15.7s 16.8s 1.77s 66.5sexpm3 1.75s 4.7s 7.50s 1.98s 10.7s

expm2016 2.87s 8.6s 12.1s 1.20s 14.5s

16

Page 17: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

Fig. 5. Performance profiles for matrices of dimension ≤ 1024 × 1024. We have used 3000matrices from the Matlab matrix gallery, special matrices that are prone to overscaling, nilpotentmatrices and random matrices. Since the computation of the exact solution for this large test setwas too costly, we only illustrate the performance plots for the number of required matrix productsand the corresponding run times.

1 1.2 1.4 1.6 1.8 20

0.2

0.4

0.6

0.8

1

factor

probab

ility

# products

expm2016a

expm2009[2]expm2005expm2expm3

1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

factor

comput. time

Remark 6. After the submission of this work, we were made aware of relatedresults on the decomposition of the polynomial [27]. As it is clear form the generalform (2.1), there are possibly many ways to write a given polynomial and somealternatives to our decompositions can be found therein [27].

7. Concluding remarks. We have presented a new way to construct theTaylor polynomial approximating the matrix exponential function up to degree 18requiring less products than the Paterson–Stockmeyer technique. This reductionof matrix products is due to a particular factorization of the polynomial in termsof polynomials of lower degree where the coefficients of the factorization satisfy asystem of algebraic equations.

More specifically, the proposed algorithm uses up to three times less matrixmultiplications compared with the reference Pade implementations and on average,the computational times are reduced by the same factor.

In combination with scaling and squaring, this yields a procedure to computethe matrix exponential up to the desired accuracy at lower computational cost thanthe standard Pade method for a wide range of matrices. Based on estimates of theform ‖Ak‖1/k and the use of scaling and squaring, we have presented a modificationof our algorithm to reduce overscaling that is still more efficient than state-of-the-art implementations with a slightly lower accuracy for some matrices but higheraccuracy for others. The loss in accuracy can be attributed to possible overscalingsdue to a reduced number of norm estimations compared with Pade methods.

In practice, the maximal degree considered for the Taylor polynomial is 18and the reduction in the number of matrix products required is due to a particu-lar factorization of the polynomial in terms of polynomials of lower degree whosecoefficients satisfy an algebraic system of nonlinear equations.

Although this technique, combined with scaling and squaring, can be applied inprinciple to any matrix, it is clear that, in some cases, one can take advantage of thevery particular structure of the matrix A and then design especially tailored (andvery often more efficient) methods for such problems [7, 8, 15]. For example, if onecan find an additive decomposition A = B+C such that ‖C‖ is small and B is easyto exponentiate, i.e., eB is sparse and exactly solvable (or can be accurately and

17

Page 18: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

cheaply approximated numerically), and C is a dense matrix, then more efficientmethods can be found in [4, 21]. Exponentials of upper or lower triangular matricesA have been treated in [2], where it is shown that it is advantageous to exploit thefact that the diagonal elements of the exponential are exactly known. It is thenmore efficient to replace the diagonal elements obtained numerically by the exactsolution before squaring the matrix. This technique can also be extended to thefirst super (or sub-)diagonal elements.

For the convenience of the reader, we provide Matlab implementations of theproposed schemes on our website [1].

Acknowledgments. The authors would like to thank the anonymous refereesfor many insightful remarks that have contributed to improve the paper. This workhas been funded by Ministerio de Economıa, Industria y Competitividad (Spain)through project MTM2016-77660-P (AEI/FEDER, UE). P.B. has been additionallysupported by a contract within the Program Juan de la Cierva Formacion (Spain).

REFERENCES

[1] http://www.gicas.uji.es/Research/MatrixExp.html

[2] A.H. Al-Mohy and N.J. Higham, A new scaling and squaring algorithm for the matrix expo-nential, SIAM J. Matrix Anal. Appl., 31, (2009), pp. 970–989.

[3] M. Arioli, B. Codenotti, and C. Fassino, The Pade method for computing the matrix expo-nential, Linear Algebra Applic., 240 (1996), pp. 111–130.

[4] P. Bader, S. Blanes, and M. Seydaoglu, The scaling, splitting and squaring method for theexponential of perturbed matrices, SIAM J. Matrix Anal. Appl., 36, (2015), pp. 594–614.

[5] S. Blanes, F. Casas, J.A. Oteo, and J. Ros, The Magnus expansion and some of its applica-tions. Phys. Rep., 470 (2009), pp. 151–238.

[6] S. Blanes, F. Casas, and A. Murua, An efficient algorithm based on splitting for the timeintegration of the Schrodinger equation, J. Comput. Phys., 303 (2015), pp. 396–412.

[7] E. Celledoni and A. Iserles, Approximating the exponential from a Lie algebra to a Lie group,Math. Comput., 69 (2000), pp. 1457–1480.

[8] E. Celledoni and A. Iserles, Methods for the approximation of the matrix exponential in aLie-algebraic setting, IMA J. Numer. Anal., 21 (2001), pp. 463–488.

[9] N.J. Higham, Accuracy and Stability of Numerical Algorithms, Society for Industrial andApplied Mathematics, Philadelphia, PA, USA (1996).

[10] N.J. Higham, The scaling and squaring method for the matrix exponential revisited, SIAMJ. Matrix Anal. Appl., 26, (2005), pp. 1179–1193.

[11] N.J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial andApplied Mathematics, Philadelphia, PA, USA (2008).

[12] N.J. Higham and A.H. Al-Mohy, Computing matrix functions, Acta Numerica, 51, (2010),pp. 159–208.

[13] N.J. Higham and F. Tisseur, A block algorithm for matrix 1-norm estimation, with anapplication to 1-norm pseudospectra SIAM J. Matrix Anal. Appl., 21, (2000), pp. 1185-1201.

[14] M. Hockbruck and A. Ostermann, Exponential integrators, Acta Numerica, 19, (2010), pp.209–286.

[15] A. Iserles, H. Z. Munthe-Kaas, S.P. Nørsett and A. Zanna, Lie group methods, Acta Numer-ica, 9, (2000), pp. 215–365.

[16] D. Knuth, Evaluation of polynomials by computer, Comm. ACM 5 (1962), 595-599.[17] Li Lei and T. Nakamura, A fast algorithm for evaluating the matrix polynomial I+A+ · · ·+

AN−1, IEEE Trans. Circuits Sys.-I: Fund. Theory Appl., 39, (1992), pp. 299–300.[18] C. Lubich, From Quantum to Classical Molecular Dynamics: Reduced Models and Numerical

Analysis, European Mathematical Society, (2008).[19] C.B. Moler and C.F. Van Loan, Nineteen dubious ways to compute the exponential of a

matrix, twenty-five years later, SIAM Review 45 (2003), pp. 3–49.[20] P. Nadukandi and N.J. Higham, Computing the wave-kernel matrix functions, MIMS EPrint

2018.4, Manchester Institute for Mathematical Sciences, The University of Manchester,UK, Feb. 2018.

[21] I. Najfeld and T.F. Havel, Derivatives of the matrix exponential and their computation Adv.Appl. Math., 16 (1995), pp. 321–375.

[22] M.S. Paterson and L.J. Stockmeyer , On the number of nonscalar multiplications necessaryto evaluate polynomials, SIAM J. Comput. 2 (1973), pp. 60–66.

18

Page 19: Matlab MatrixExp - UJImatrix runs well into the hundreds. As a matter of fact, this technique is incorpo-rated in popular computing packages such as Matlab (expm) and Mathematica (MatrixExp)

[23] P. Ruiz, J. Sastre, J. Ibanez, and E. Defez, High perfomance computing of the matrix expo-nential, J. Comput. Appl. Math., 291 (2016), pp. 370–379.

[24] J. Sastre, J. Ibanez, E. Defez, and P. Ruiz, Accurate matrix exponential computation tosolve coupled differential models in Engineering, Math. Compu. Modelling, 54 (2011),pp. 1835–1840.

[25] J. Sastre, J. Ibanez, P. Ruiz, and E. Defez, Accurate and efficient matrix exponential com-putation, Int. J. Comput. Math., 91 (2014), pp. 97–112.

[26] J. Sastre, J. Ibanez, P. Ruiz, and E. Defez, New scaling-squaring Taylor algorithms forcomputing the matrix exponential, SIAM J. Sci. Comp., 37 (2015), pp. A439–A455.

[27] J. Sastre, Efficient evaluation of matrix polynomials, Linear Algebra Appl., 539 (2018), pp.229–250.

[28] R.B. Sidje, Expokit: a software package for computing matrix exponentials, ACM Trans.Math. Software 24 (1998), pp. 130–156.

[29] C. Van Loan, A note on the evaluation of matrix polynomials, IEEE Transactions on Auto-matic Control, 24 (1979), pp. 320–321

[30] D. Westreich, Evaluating the matrix polynomial I +A+ · · · +AN−1, IEEE Trans. CircuitsSys.,36 (1989), pp. 162–164.

19