of 36 /36
Iterative methods with special structures David F. Gleich Purdue University David Gleich · Purdue 1

Iterative methods with special structures

Embed Size (px)

DESCRIPTION

In a talk at the Institute for Physics and Computational Mathematics in Beijing, I discuss a few different types of structure in iterative methods.

Text of Iterative methods with special structures

  • 1. Iterative methods with special structuresDavid F. Gleich! Purdue University! David Gleich Purdue1

2. Two projects.1. Circulant structure in tensors and a linear system solver.2. Localized structure in the matrix exponential and relaxation methods.David Gleich Purdue2 3. Circulant algebra40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 2 / 29 Introduction Kilmer, Martin, and Perrone (2008) presented a circulant algebra: a set of operations that generalize matrix algebra to three-way data and provided an SVD. The essence of this approach amounts to viewing three-dimensional objects as two-dimensional arrays (i.e., matrices) of one-dimensional arrays (i.e., vectors). Braman (2010) developed spectral and other decompositions. We have extended this algebra with the ingredients required for iterative methods such as the power method and Arnoldi method, and have char- acterized the behavior of these algo- rithms. With Chen Greif and James Varah at UBC.David Gleich Purdue3 4. 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 4 / 29 Three-way arrays Given an m n k table of data, we view this data as an m n ma- trix where each scalar is a vector of length k. A 2 Kmn k We denote the space of length-k scalars as Kk. These scalars interact like circulant matrices. David Gleich Purdue4 5. 40 60 80 Circulant matrices are a commutative, closed class under the standard matrix operations. 2 6 6 6 6 6 4 1 k . . . 2 2 1 ... ... ... ... ... k k . . . 2 1 3 7 7 7 7 7 5 Well see more of their properties shortly! 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 6 / 29 The circ operation We denote the space of length-k scalars as Kk. These scalars interact like circulant matrices. = { 1 ... k } 2 Kk. $ circ( ) 2 6 6 6 6 6 4 1 k . . . 2 2 1 ... ... ... ... ... k k . . . 2 1 3 7 7 7 7 7 5 . + $ circ( )+circ( ) and $ circ( )circ( ); 0 = {0 0 ... 0} 1 = {1 0 ... 0} David Gleich Purdue5 6. Scalars to matrix-vector products40 60 80 100 120 40 60 mm David F. Gleich (Purdue) LA/Opt Seminar 13 Operations (cont.) More operations are simplied in the Fourier space too. Let cft( ) = di g [1, ..., k]. Because the j values are the eigenvalues of circ( ), we have: abs( ) = icft(di g [| 1|, ..., | k|]), = icft(di g [1, ..., k]) = icft(cft( ) ), and angle( ) = icft(di g [1/| 1|, ..., k/| k|]). 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 10 / 29 cft and icft We dene the Circulant Fourier Transform or cft cft : 2 Kk 7! Ckk and its inverse icft : Ckk 7! Kk as follows: cft( ) 1 ... k = F circ( )F, icft 1 ... k $ F cft( )F , where j are the eigenvalues of circ( ) as produced in the Fourier transform. These transformations satisfy icft(cft( )) = and provide a convenient way of moving between operations in Kk to the more familiar environment of diagonal matrices in Ckk. 40 60 40 60 mm The circ operation on ma A x = 2 6 4 Pn j=1 A1,j j ...Pn j=1 Am,j j 3 7 5 $ 2 4 circ(A1,1) .. ... .. circ(Am,1) .. Dene circ(A) 2 4 circ(A1,1) ... circ(A1,n) ... ... ... circ(Am,1) ... circ(Am,n) 3 5 c David Gleich Purdue6 7. The special structureThis circulant structure is our special structure for this rst problem. We look at two types of iterative methods: 1. the power method and 2. the Arnoldi method.David Gleich Purdue7 8. A perplexing result! 40 60 80 100 120 40 60 80 mm Example Run the power method on {2 3 1} {0 0 0} {0 0 0} {3 1 1} Result = (1/3) {10 4 4} David Gleich Purdue8 9. Some understanding through decoupling40 60 80 100 120 40 60 80 mm Back to gure David Gleich Purdue9 10. Some understanding through decoupling40 60 80 100 40 60 80 mm David F. Gleich (Purdue) LA/Opt Sem Example Let A = {2 3 1} {8 2 0} { 2 0 2} {3 1 1} . The result of the circ and cft operations are: circ(A) = 2 6 6 6 6 6 6 4 2 1 3 8 0 2 3 2 1 2 8 0 1 3 2 0 2 8 2 2 0 3 1 1 0 2 2 1 3 1 2 0 2 1 1 3 3 7 7 7 7 7 7 5 , ( F )circ(A)( F) = 2 6 6 6 6 6 6 6 4 6 6 p 3 9 + p 3 p 3 9 p 3 0 5 3 + p 3 2 3 p 3 2 3 7 7 7 7 7 7 7 5 , cft(A) = 2 6 6 6 6 6 6 6 4 6 6 0 5 p 3 9 + p 3 3 + p 3 2 p 3 9 p 3 3 p 3 2 3 7 7 7 7 7 7 7 5 . David Gleich Purdue10 11. 40 60 80 100 120 40 60 80 mm David F. Gleich (Purdue) LA/Opt Seminar 20 / Example A = {2 3 1} {0 0 0} {0 0 0} {3 1 1} A1 = 6 0 0 5 , A2 = - p 3 0 0 2 , A3 = p 3 0 0 2 . 1 = icft(di g [6 2 2]) = (1/3) {10 4 4} 2 = icft(di g [5 - p 3 p 3]) = (1/3) {5 2 2} 3 = icft(di g [6 - p 3 p 3]) = {2 3 1} 4 = icft(di g [5 2 2]) = (1/3) {3 1 1} . The corresponding eigenvectors are x1 = {1/3 1/3 1/3} {2/3 -1/3 -1/3} ; x2 = {2/3 -1/3 -1/3} {1/3 1/3 1/3} ; x3 = {1 0 0} {0 0 0} ; x4 = {0 0 0} {1 0 0} . Some understanding through decouplingDavid Gleich Purdue11 12. Convergence of the power method is in terms of the individual blocks40 60 80 100 120 40 60 mm David F. Gleich (Purdue) LA/Opt Seminar 23 / 29 The power method converges Let A 2 Knn k have a canonical set of eigenvalues 1, . . . , n where | 1| > | 2|, then the power method in the circulant algebra convergences to an eigenvector x1 with eigenvalue 1. Where we use the ordering ... < $ cft( ) < cft( ) elementwise 40 60 80 5 = icft(di g [6 - p 3 2]) 6 = icft(di g [6 2 p 3]) 7 = icft(di g [5 - p 3 2]) 8 = icft(di g [5 2 p 3]), altogether polynomial number, exceeds dimension of matrix. Denition. A canonical set of eigenvalues and eigenvectors is a set of minimum size, ordered such that abs( 1) abs( 2) . . . abs( k), which contains the information to reproduce any eigenvalue or eigenvector of A In this case, the only canonical set is {( 1, x1), ( 2, x2)}. (Need two, and have abs( 1) abs( 2).) David Gleich Purdue12 13. 40 60 80 100 120 40 60 mm David F. Gleich (Purdue) LA/Opt Seminar 28 / 29 2000 4000 6000 8000 0 15 0 10 0 5 0 0 2 +2 cos( 2/n) 2 +2 cos( /n) 2 i 6 +2 cos( 2/n) 6 +2 cos( /n) 2 i 6 +2 cos( 2/n) 6 +2 cos( /n) i iteration Eigenvalue Error Eigenvector Change ure: The convergence behavior of the power od in the circulant algebra. The gray lines show rror in the each eigenvalue in Fourier space. e curves track the predictions made based on igenvalues as discussed in the text. The red 0 10 20 30 40 50 10 15 10 10 10 5 10 0 Arnoldi iteration Magnitude Absolute error Residual magnitude Figure: The convergence behavior of a GMRES procedure using the circulant Arnoldi process. The gray lines show the error in each Fourier component and the red line shows the magnitude of the residual. We observe poor convergence in one The Arnoldi MethodUsing our repertoire of operations, the Arnoldi method in the circulant algebra is equivalent to individual Arnoldi processes on each matrix.Equivalent to a block Arnoldi process.Using the cft and icft operations, we produce an Arnoldi factorization:40 60 80 100 120 40 60 80 mm avid F. Gleich (Purdue) LA/Opt Seminar 24 / 29 e Arnoldi process Let A be an n n matrix with real valued entries. Then the Arnoldi method is a technique to build an orthogonal basis or the Krylov subspace Kt(A, v) = span{v, Av, . . . , At 1 v}, where v is an initial vector. We have the decomposition AQt = Qt+1Ht+1,t where Qt is an n t matrix, and Ht+1,t is a (t + 1) t upper Hessenberg matrix. Using our repertoire of operations, the Arnoldi method in he circulant algebra is equivalent to individual Arnoldi processes on each matrix Aj. Equivalent to a block Arnoldi process. Using the cft and icft operations, we produce an Arnoldi actorization: A Qt = Qt+1 Ht+1,t. David Gleich Purdue13 14. A number of interesting mathematical results from this algebra1. A case study of how decoupled block iterations arise and are meaningful for an application. 2. Its a beautiful algebra. E.g. 40 60 80 100 40 60 80 mm More operations are simplied in the Fourier space too. Le ft( ) = di g [1, ..., k]. Because the j values are the igenvalues of circ( ), we have: abs( ) = icft(di g [| 1|, ..., | k|]), = icft(di g [1, ..., k]) = icft(cft( ) ), and angle( ) = icft(di g [1/| 1|, ..., k/| k|]). Proofs are simple, e.g. angle( ) angle( ) = 1 Live Matlab demo David Gleich Purdue14 15. Conclusion to the circulant algebraPaper available from " https://www.cs.purdue.edu/homes/dgleich/The power and Arnoldi method in an algebra of circulants, NLA 2013, Gleich, Greif, Varah. Code available from" https://www.cs.purdue.edu/homes/dgleich/ codes/camatDavid Gleich Purdue15 16. Project 2Fast relaxation methods to estimate a column of the martrix exponential.With Kyle KlosterDavid Gleich Purdue16 17. Matrix exponentialsexp(A) is dened as 1X k=0 1 k! Ak Always convergesdx dt = Ax(t) , x(t) = exp(tA)x(0) Evolution operator " for an ODEA is n n, real David Gleich Purdue17special case of a function of a matrix f(A) others are f(x) = 1/x; f(x) = sinh(x)... 18. Matrix exponentials on large networksexp(A) = 1X k=0 1 k! Ak If A is the adjacency matrix, then Ak counts the number of length k paths between node pairs.[Estrada 2000, Farahat et al. 2002, 2006] Large entries denote important nodes or edges.Used for link prediction and centralityIf P is a transition matrix (column stochastic), then Pk is the probability of a length k walk between node pairs.[Kondor & Lafferty 2002, Kunegis & Lommatzsch 2009, Chung 2007]Used for link prediction, kernels, and clustering or community detectionexp(P) = 1X k=0 1 k! Pk David Gleich Purdue18 19. This talk: a column of the matrix exponentialx = exp(P)ec x the solution P the matrix ec the column David Gleich Purdue19 20. This talk: a column of the matrix exponentialx = exp(P)ec x the solution P the matrix ec the column localizedlarge, sparse, stochasticDavid Gleich Purdue20 21. Uniformly localized " solutions in livejournal1 2 3 4 5 x 10 6 0 0.5 1 1.5 nnz = 4815948 magnitude plot(x) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 14 10 12 10 10 10 8 10 6 10 4 10 2 10 0 1normerror largest nonzeros retained 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 14 10 12 10 10 10 8 10 6 10 4 10 2 10 0 1normerror largest nonzeros retained x = exp(P)ec David Gleich Purdue21nnz(x) = 4, 815, 948 Gleich & Kloster,arXiv:1310.3423 22. Our mission! Exploit the structure Find the solution with work " roughly proportional to the " localization, not the matrix.David Gleich Purdue22 23. Our algorithms for uniform localization" www.cs.purdue.edu/homes/dgleich/codes/nexpokit10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 8 10 6 10 4 10 2 10 0 nonzeros 1normerror gexpm gexpmq expmimv 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 8 10 6 10 4 10 2 10 0 nonzeros 1normerror David Gleich Purdue23work = O log(1 " )(1 " )3/2 d2 (log d)2 nnz = O log(1 " )(1 " )3/2 d(log d) 24. Matrix exponentials on large networksIs a single column interesting? Yes!exp(P)ec = 1X k=0 1 k! Pk ec Link prediction scores for node cA community relative to node cBut modern networks are " large ~ O(109) nodes,sparse ~ O(1011) edges,constantly changing and so wed like " speed over accuracyDavid Gleich Purdue24 25. SIAM REVIEW c 2003 Society for Industrial and Applied Mathematics Vol. 45, No. 1, pp. 349 Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later Cleve Moler Charles Van Loan David Gleich Purdue25 26. Our underlying methodDirect expansion! A few matvecs, quick loss of sparsity due to ll-inThis method is stable for stochastic P! " no cancellation, unbounded norm, etc.! ! David Gleich Purdue26x = exp(P)ec PN k=0 1 k! Pk ec = xN The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the le again. If the red x still appears, you may have to delete the image and then insert it again. 27. Our underlying method " as a linear systemDirect expansion! " ! ! ! David Gleich Purdue27x = exp(P)ec PN k=0 1 k! Pk ec = xN 2 6 6 6 6 6 6 4 III P/1 III P/2 ... ... III P/N III 3 7 7 7 7 7 7 5 2 6 6 6 6 6 6 4 v0 v1 ... ... vN 3 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 4 ec 0 ... ... 0 3 7 7 7 7 7 7 5 xN = NX i=0 vi (III IIIN SN P)v = e1 ec Lemma we approximate xN well if we approximate v well 28. Our mission (2)! Approximately solve " when A, b are sparse," x is localized.David Gleich Purdue28Ax = b 29. Coordinate descent, Gauss-Southwell, Gauss-Seidel, relaxation & push methodsDavid Gleich Purdue29Algebraically! Procedurally! Solve(A,b) x = sparse(size(A,1),1) r = b While (1) Pick j where r(j) != 0 z = r(j) x(j) = x(j) + r(j) For i where A(i,j) != 0 r(i) = r(i) z*A(i,j) Ax = b r(k) = b Ax(k) x(k+1) = x(k) + ej eT j r(k) r(k+1) = r(k) r(k) j Aej 30. Back to the exponentialDavid Gleich Purdue302 6 6 6 6 6 6 4 III P/1 III P/2 ... ... III P/N III 3 7 7 7 7 7 7 5 2 6 6 6 6 6 6 4 v0 v1 ... ... vN 3 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 4 ec 0 ... ... 0 3 7 7 7 7 7 7 5 xN = NX i=0 vi (III IIIN SN P)v = e1 ec Solve this system via the same method. Optimization 1 build system implicitly Optimization 2 dont store vi, just store sum xN 31. Error analysis for Gauss-SouthwellDavid Gleich Purdue31Theorem Assume P is column-stochastic, v(0) = 0. (Nonnegativity) iterates and residuals are nonnegative v(l) 0 and r(l) 0 (Convergence) residual goes to 0: kr(l) k1 Ql k=1 1 1 2dk l( 1 2d ) (III IIIN SN P)v = e1 ec easyannoyingd is the largest degree 32. Proof sketchGauss-Southwell picks largest residual Bound the update by avg. nonzeros in residual (sloppy) Algebraic convergence with slow rate, but each update is REALLY fast O(d max log n).If d is log log n, then our method runs in sub-linear time " (but so does just about anything)David Gleich Purdue32 33. Overall error analysisDavid Gleich Purdue33Components! Truncation to N termsResidual to error Approximate solveTheorem kxN (`) xk1 1 N!N + 1 e ` 1 2d After steps of Gauss-Southwell 34. More recent error analysisDavid Gleich Purdue34Theorem (Gleich and Kloster, 2013 arXiv:1310.3423)" Consider computing the matrix exponential using the Gauss-Southwell relaxation method in a graph with a Zipf-law in the degrees with exponent p=1 and max- degree d, then the work involved in getting a solution with 1-norm error iswork = O log(1 " )(1 " )3/2 d2 (log d)2 35. Problem size &" Runtimes10 6 10 7 10 8 10 9 10 10 10 4 10 2 10 0 10 2 10 4 |V|+ nnz(P) runtime(s) expmv half gexpmq gexpm expmimv The median runtime of our methods for the seven graphs over 100 trials (only Table 3 The real-world datasets we use in our experiments span three orders of magnitude in size. Graph |V | nnz(P ) nnz(P )/|V | itdk0304 190,914 1,215,220 6.37 dblp-2010 226,413 1,432,920 6.33 flickr-scc 527,476 9,357,071 17.74 ljournal-2008 5,363,260 77,991,514 14.54 webbase-2001 118,142,155 1,019,903,190 8.63 twitter-2010 33,479,734 1,394,440,635 41.65 friendster 65,608,366 3,612,134,270 55.06 Real-world networks The datasets used are summarized in Table 3. They include a version of the ickr graph from [Bonchi et al., 2012] containing just the largest strongly-connected component of the original graph; dblp-2010 from [Boldi et al., 2011], itdk0304 in [(The Cooperative Association for Internet Data Analyais), 2005], ljournal- 2008 from [Boldi et al., 2011, Chierichetti et al., 2009], twitter-2010 [Kwak et al., 2010] webbase-2001 from [Hirai et al., 2000, Boldi and Vigna, 2005], and the friendster graph in [Yang and Leskovec, 2012]. Implementation details All experiments were performed on either a dual processor Xeon e5-2670 system with 16 cores (total) and 256GB of RAM or a single processor Intel i7-990X, 3.47 GHz CPU and 24 GB of RAM. Our algorithms were implemented in C++ using the Matlab MEX interface. All data structures used are memory-e cient: the solution and residual are stored as hash tables using Googles sparsehash pack- age. The precise code for the algorithms and the experiments below are available via https://www.cs.purdue.edu/homes/dgleich/codes/nexpokit/. Comparison We compare our implementation with a state-of-the-art Matlab function for computing the exponential of a matrix times a vector, expmv [Al-Mohy and Higham, 2011]. We customized this method with knowledge that kP k1 = 1. This single change results in a great improvement to the runtime of their code. In each experiment, we use as the true solution the result of a call to expmv using the single option, which guarantees a 1-norm error bounded by 2 24 , or, for smaller problems, we use a Taylor approximation with the number of terms predicted by Lemma 12.David Gleich Purdue35 36. References and ongoing workKloster and Gleich, Workshop on Algorithms for the Web-graph, 2013. Also see the journal version on arXiv.www.cs.purdue.edu/homes/dgleich/codes/nexpokit Error analysis using the queue (almost done ) Better linear systems for faster convergence Asynchronous coordinate descent methods Scaling up to billion node graphs (done )David Gleich Purdue36Supported by NSF CAREER 1149756-CCF www.cs.purdue.edu/homes/dgleich