8
Fast approximation of matrix coherence and statistical leverage Petros Drineas [email protected] Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180 USA Malik Magdon-Ismail [email protected] Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180 USA Michael W. Mahoney [email protected] Dept. of Mathematics, Stanford University, Stanford, CA 94305 USA David P. Woodruff [email protected] IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120 USA Abstract The statistical leverage scores of a data ma- trix are the squared row-norms of any matrix whose columns are obtained by orthogonal- izing the columns of the data matrix; and, the coherence is the largest leverage score. These quantities play an important role in several machine learning algorithms because they capture the key structural nonunifor- mity of the data matrix that must be dealt with in developing efficient randomized algo- rithms. Our main result is a randomized al- gorithm that takes as input an arbitrary n×d matrix A, with n d, and returns, as out- put, relative-error approximations to all n of the statistical leverage scores. The pro- posed algorithm runs in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ ıve algorithm that involves computing an orthogonal basis for the range of A. This re- solves an open question from (Drineas et al., 2006) and (Mohri & Talwalkar, 2011); and our result leads to immediate improvements in coreset-based 2 -regression, the estimation of the coherence of a matrix, and several re- lated low-rank matrix problems. Interest- ingly, to achieve our result we judiciously ap- ply random projections on both sides of A. Appearing in Proceedings of the 29 th International Confer- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). 1. Introduction The concept of statistical leverage measures the ex- tent to which the singular vectors of a matrix are correlated with the standard basis and as such it has found usefulness recently in large-scale data anal- ysis and in the analysis of randomized matrix al- gorithms (Mahoney & Drineas, 2009; Drineas et al., 2008). A related notion is that of matrix coherence, which has been of interest in recently popular problems such as matrix completion and Nystr¨om-based low- rank matrix approximation (Candes & Recht, 2009; Talwalkar & Rostamizadeh, 2010). Statistical lever- age scores have a long history in statistical data anal- ysis, where they have been used for outlier detection in regression diagnostics (Hoaglin & Welsch, 1978; Chatterjee & Hadi, 1986). Statistical leverage scores have also proved crucial recently in the development of improved worst-case randomized matrix algorithms that are also amenable to high-quality numerical im- plementation and that are useful to domain scien- tists (Drineas et al., 2008; Mahoney & Drineas, 2009; Sarl´os, 2006; Drineas et al., 2010); see (Mahoney, 2011) for a detailed discussion. The na¨ ıve and best previously existing algorithm to compute these scores would compute an orthogonal basis for the dominant part of the spectrum of A, e.g., the basis provided by the Singular Value Decomposition (SVD) or a basis provided by a QR decomposition, and then use that basis to compute the leverage scores, by taking the norms of the rows. We present a randomized algorithm to compute relative-error approximations to every statistical lever- age score in time qualitatively faster than the time re- quired to compute an orthogonal basis. For the case

dmmw12

  • Upload
    nz0ptk

  • View
    214

  • Download
    0

Embed Size (px)

DESCRIPTION

dmmw12

Citation preview

  • Fast approximation of matrix coherence and statistical leverage

    Petros Drineas [email protected]

    Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180 USA

    Malik Magdon-Ismail [email protected]

    Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180 USA

    Michael W. Mahoney [email protected]

    Dept. of Mathematics, Stanford University, Stanford, CA 94305 USA

    David P. Woodru [email protected]

    IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120 USA

    Abstract

    The statistical leverage scores of a data ma-trix are the squared row-norms of any matrixwhose columns are obtained by orthogonal-izing the columns of the data matrix; and,the coherence is the largest leverage score.These quantities play an important role inseveral machine learning algorithms becausethey capture the key structural nonunifor-mity of the data matrix that must be dealtwith in developing ecient randomized algo-rithms. Our main result is a randomized al-gorithm that takes as input an arbitrary ndmatrix A, with n d, and returns, as out-put, relative-error approximations to all nof the statistical leverage scores. The pro-posed algorithm runs in O(nd logn) time, asopposed to the O(nd2) time required by thenave algorithm that involves computing anorthogonal basis for the range of A. This re-solves an open question from (Drineas et al.,2006) and (Mohri & Talwalkar, 2011); andour result leads to immediate improvementsin coreset-based `2-regression, the estimationof the coherence of a matrix, and several re-lated low-rank matrix problems. Interest-ingly, to achieve our result we judiciously ap-ply random projections on both sides of A.

    Appearing in Proceedings of the 29 th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).

    1. Introduction

    The concept of statistical leverage measures the ex-tent to which the singular vectors of a matrix arecorrelated with the standard basis and as such ithas found usefulness recently in large-scale data anal-ysis and in the analysis of randomized matrix al-gorithms (Mahoney & Drineas, 2009; Drineas et al.,2008). A related notion is that of matrix coherence,which has been of interest in recently popular problemssuch as matrix completion and Nystrom-based low-rank matrix approximation (Candes & Recht, 2009;Talwalkar & Rostamizadeh, 2010). Statistical lever-age scores have a long history in statistical data anal-ysis, where they have been used for outlier detectionin regression diagnostics (Hoaglin & Welsch, 1978;Chatterjee & Hadi, 1986). Statistical leverage scoreshave also proved crucial recently in the developmentof improved worst-case randomized matrix algorithmsthat are also amenable to high-quality numerical im-plementation and that are useful to domain scien-tists (Drineas et al., 2008; Mahoney & Drineas, 2009;Sarlos, 2006; Drineas et al., 2010); see (Mahoney,2011) for a detailed discussion. The nave and bestpreviously existing algorithm to compute these scoreswould compute an orthogonal basis for the dominantpart of the spectrum of A, e.g., the basis provided bythe Singular Value Decomposition (SVD) or a basisprovided by a QR decomposition, and then use thatbasis to compute the leverage scores, by taking thenorms of the rows.

    We present a randomized algorithm to computerelative-error approximations to every statistical lever-age score in time qualitatively faster than the time re-quired to compute an orthogonal basis. For the case

  • Fast approximation of matrix coherence and statistical leverage

    of an arbitrary n d matrix A, with n d, our mainalgorithm runs in O(nd log n=2) time (under assump-tions on the precise values of n and d, see Theorem 1for an exact statement). This is the rst algorithm tobreak the O(nd2) barrier required by the nave algo-rithm, and provide accuracy to within relative error.As a corollary, our algorithm provides a relative-errorapproximation to the coherence of an arbitrary ma-trix in the same time. In addition, we discuss sev-eral practically-important extensions of the basic ideaunderlying our main algorithm: computing so-calledcross leverage scores; computing leverage scores for\fat" matrices with n d with respect to a low-rankparameter k; and, computing leverage scores in datastreaming environments.

    1.1. Overview and denitions

    We start with the following denition.

    Denition 1. Given an arbitrary n d matrix A,with n > d, let U denote the n d matrix consist-ing of the d left singular vectors of A, and let U(i)denote the i-th row of the matrix U as a row vector.Then, the statistical leverage scores of the rows of A

    are given by `i =

    U(i)

    22 ; for i 2 f1; : : : ; ng; the co-

    herence of the rows of A is = maxi2f1;:::;ng `i; i.e.,it is the largest statistical leverage score of A; and the(i; j)-cross-leverage scores cij are cij =

    U(i); U(j)

    ;

    i.e., they are the dot products between the ith row andthe jth row of U .

    Although we have dened these quantities in termsof a particular basis, they clearly do not depend onthat particular basis, but only on the space spannedby that basis. To see this, let PA denote the projec-tion matrix onto the span of the columns of A. Then,

    `i =

    U(i)

    22 = UUT ii = (PA)ii : That is, the sta-

    tistical leverage scores of a matrix A are equal to thediagonal elements of the projection matrix onto thespan of its columns. Similarly, the (i; j)-cross-leveragescores are equal to the o-diagonal elements of thisprojection matrix, i.e., cij = (PA)ij =

    U(i); U(j)

    :

    Clearly, O(nd2) time suces to compute all the statis-tical leverage scores exactly: simply perform the SVDor compute a QR decomposition of A in order to ob-tain any orthogonal basis for the range of A and thencompute the Euclidean norm of the rows of the result-ing matrix. Thus, in this paper, we are interested inalgorithms that run in o(nd2) time.

    1.2. Our main result

    Our main result is a randomized algorithm for com-puting relative-error approximations to every statisti-cal leverage score, as well as an additive-error approx-

    imation to all of the large cross-leverage scores, of anarbitrary n d matrix, with n d, in time qualita-tively faster than the time required to compute an or-thogonal basis for the range of that matrix. Our mainalgorithm for computing approximations to the statis-tical leverage scores (see Algorithm 1 in Section 3) willamount to constructing a \randomized sketch" of theinput matrix and then computing the Euclidean normsof the rows of that sketch. This sketch can also be usedto compute approximations to the large cross-leveragescores (see Algorithm 2 of Section 3).

    The following is our main result for Algorithm 1.

    Theorem 1. Let A be a full-rank n d ma-trix, with n d; let 2 (0; 1=2] be an errorparameter; and recall the denition of the statis-tical leverage scores `i from Denition 1. Then,there exists a randomized algorithm (Algorithm 1of Section 3 below) that returns values ~`i, for alli 2 f1; : : : ; ng, such that with probability at least 0:8,`i ~`i `i holds for all i 2 f1; : : : ; ng. Assumingd n ed, the running time of the algorithm isOnd ln

    d1

    + nd2 lnn+ d32 (lnn)

    lnd1

    :

    Algorithm 1 provides a relative-error approximationto all of the statistical leverage scores `i of A and,assuming d ln d = o

    nlnn

    , lnn = o (d), and treating

    as a constant, its running time is o(nd2), as desired.As a corollary, the largest leverage score (and thusthe coherence) is approximated to relative-error in theo(nd2) time.

    The following is our main result for Algorithm 2.

    Theorem 2. Let A be a full-rank n d matrix, withn d; let 2 (0; 1=2] be an error parameter; let be a parameter; and recall the denition of the cross-leverage scores cij from Denition 1. Then, there ex-ists a randomized algorithm (Algorithm 2 of Section 3below) that returns the pairs f(i; j)g together with esti-mates f~cijg such that, with probability at least 0:8, (1)If c2ij

    d

    + 12`i`j, then (i; j) is returned; if (i; j) is

    returned, then c2ij d

    30`i`j. (2) For all pairs (i; j)

    that are returned, ~c2ij 30`i`j c2ij ~c2ij + 12`i`j :This algorithm runs in O(2n lnn+3d ln2 n) time.

    Note that by setting = n lnn, we can compute allthe \large" cross-leverage scores, i.e., those satisfyingc2ij dn lnn , to within additive-error in O

    nd ln3 n

    time (treating as a constant). If ln3 n = o (d) theoverall running time is o(nd2), as desired.

    Due to space limitations, numerous details of our anal-ysis are omitted; a full presentation, including full de-

  • Fast approximation of matrix coherence and statistical leverage

    tails of the proofs, can be found in the technical reportversion of this paper (Drineas et al., 2011).

    1.3. Signicance and related work

    Signicance in theoretical computer science.The statistical leverage scores dene the key struc-tural nonuniformity that must be dealt with (i.e.,either rapidly approximated or rapidly uniformizedat the preprocessing step) in developing fast ran-domized algorithms for matrix problems such asleast-squares regression (Sarlos, 2006; Drineas et al.,2010) and low-rank matrix approximation (Sarlos,2006; Drineas et al., 2008; Mahoney & Drineas, 2009;Boutsidis et al., 2009). Roughly, the best randomsampling algorithms use these scores (or the general-ized leverage scores relative to the best rank-k approx-imation to A) as an importance sampling distributionto sample with respect to. On the other hand, the bestrandom projection algorithms rotate to a basis wherethese scores are approximately uniform and thus inwhich uniform sampling is appropriate. See (Mahoney,2011) for a detailed discussion.

    Applications in statistics. The statistical leveragescores equal the diagonal elements of the so-called \hatmatrix" (Hoaglin & Welsch, 1978; Chatterjee & Hadi,1986). As such, they have a natural statistical inter-pretation in terms of the \leverage" or \inuence" as-sociated with each of the data points. Historically,these quantities have been widely-used for outlier iden-tication in diagnostic regression analysis.

    Applications in machine learning. Matrix coher-ence arises in many other applications. For example,(Candes & Recht, 2009) is interested in the problemof matrix completion; (Talwalkar & Rostamizadeh,2010) is interested in Nystrom-based low-rank matrixapproximation; (Mohri & Talwalkar, 2011) explicitlyask the question of whether matrix coherence be e-ciently and accurately estimated|and thus our mainresult provides a positive answer to their question.(In other applications, the largest cross-leverage scoreis called the coherence (Tropp, 2004; Mougeot et al.);thus our results also provide a bound for this quantity.)

    2. Linear algebra and fast projections

    2.1. Basic linear algebra and notation

    Let [n] denote the set of integers f1; 2; : : : ; ng. For anymatrix A 2 Rnd, let A(i), i 2 [n], denote the i-th rowof A as a row vector, and let A(j), j 2 [d] denote thej-th column of A as a column vector. Let kAk2F =Pn

    i=1

    Pdj=1A

    2ij denote the square of the Frobenius

    norm of A, and let kAk2 = sup kxk2=1 kAxk2 de-

    note the spectral norm of A. Relatedly, for any vectorx 2 Rn, its Euclidean norm (or `2-norm) is the squareroot of the sum of the squares of its elements. The dotproduct between two vectors x; y 2 Rn will be denotedhx; yi, or alternatively as xT y. Finally, let ei 2 Rn, forall i 2 [n], denote the standard basis vectors for Rnand let In denote the n n identity matrix.Let the rank of A be minfn; dg, in which case theSVD of A is denoted by A = UV T , where U 2 Rn, 2 R, and V 2 Rd. (For a general matrixX, we will write X = UXXV

    TX .) Let i(A); i 2 []

    denote the i-th singular value of A, and let max(A)and min(A) denote the maximum and minimum sin-gular values of A, respectively. The Moore-Penrosepseudoinverse of A is the d n matrix dened byAy = V 1UT .

    2.2. The Fast JL Transform (FJLT)

    Given > 0 and a set of points x1; : : : ; xn withxi 2 Rd, a -Johnson-Lindenstrauss Transform (-JLT), denoted 2 Rrd, is a projection of the pointsinto Rr such that

    (1 )kxik22 kxik22 (1 + )kxik22: (1)

    To construct an -JLT with high probability, sim-ply choose every entry of independently, equal top3=r with probability 1=6 each and zero otherwise(with probability 2=3) (Achlioptas, 2003). Let JLTbe a matrix drawn from such a distribution over r dmatrices. Then, the following lemma holds.

    Lemma 1 (Theorem 1.1 of (Achlioptas, 2003)). Letx1; : : : ; xn be an arbitrary (but xed) set of points,where xi 2 Rd and let 0 < 1=2 be an accuracyparameter. If r 12

    12 lnn+ 6 ln 1

    then, with prob-

    ability at least 1 , JLT 2 Rrd is an -JLT .

    For our main results, we will also need a strongerrequirement than the simple -JLT and so we willuse a version of the Fast Johnson-LindenstraussTransform (FJLT), which was originally introducedin (Ailon & Chazelle, 2009). Consider an orthogonalmatrix U 2 Rnd, viewed as d vectors in Rn. A FJLTprojects the vectors from Rn to Rr, while preserv-ing the orthogonality of U ; moreover, it does so veryquickly. Specically, given > 0, 2 Rrn is an -FJLT for U if:

    Id UTTU

    2 ; and given anyX 2 Rnd, the matrix product X can be computedin O(nd ln r) time. The next lemma follows from thedenition of an -FJLT and its proof can be foundin (Drineas et al., 2006; 2010).

    Lemma 2. Let A by any matrix in Rnd with n dand rank(A) = d. Let the SVD of A be A = UV T ,

  • Fast approximation of matrix coherence and statistical leverage

    let be an -FJLT for U (with 0 < 1=2) and let = U = UV

    T . Then, all the following hold:

    rank(A) = rank(U) = rank(U) = rank(A);

    (2)

    I 2

    2 =(1 ); and (3)(A)y = V 1(U)y: (4)

    2.3. The Subsampled Randomized HadamardTransform (SRHT)

    One can use a Randomized Hadamard Transform(RHT) to construct, with high probability, an -FJLT.Our main algorithm will use this ecient constructionin a crucial way. Recall that the (unnormalized) nnmatrix of the Hadamard transform H^n is dened re-

    cursively by H^2n =

    H^n H^nH^n H^n

    ; with H^1 = 1. The

    n n normalized matrix of the Hadamard transformis equal to Hn = H^n=

    pn: From now on, for simplic-

    ity and without loss of generality, we assume that nis a power of 2 and we will suppress n and just writeH. Let D 2 Rnn be a random diagonal matrix withindependent diagonal entries Dii = +1 with proba-bility 1=2 and Dii = 1 with probability 1=2. Theproduct HD is a RHT and it has three useful proper-ties. First, when applied to a vector, it \spreads out"its energy. Second, computing the product HDx forany vector x 2 Rn takes O(n log2 n) time. Third, ifwe only need to access r elements in the transformedvector, then those r elements can be computed inO(n log2 r) time (Ailon & Liberty, 2008). The Sub-sampled Randomized Hadamard Transform (SRHT)randomly samples (according to the uniform distribu-tion) a set of r rows of a RHT.

    Using the sampling matrix formalism described previ-ously (Drineas et al., 2006; 2008; 2010), we will rep-resent the operation of randomly sampling r rows ofan n d matrix A using an r n linear samplingoperator ST . Let the matrix FJLT = S

    THD begenerated using the SRHT. The most important prop-erty about the distribution FJLT is that if r is largeenough, then, with high probability, FJLT generatesan -FJLT. We summarize this discussion in the fol-lowing lemma (which is essentially a combination ofLemmas 3 and 4 from (Drineas et al., 2010), restatedto t our notation).

    Lemma 3. Let FJLT 2 Rrn be generated using theSRHT as described above and let U 2 Rnd (n d)be an (arbitrary but xed) orthogonal matrix. If r 142d ln(40nd)

    2 ln302d ln(40nd)

    2

    ; then, with probability at

    least 0:9, FJLT is an -FJLT for U .

    3. Our main algorithmic results

    3.1. Outline of our basic approach

    Recall that our rst goal is to approximate, for alli 2 [n], the quantities

    `i =

    U(i)

    22 =

    eTi U

    22 ; (5)

    where ei is a standard basis vector. The hard partof computing the scores `i according to Eqn. (5) iscomputing an orthogonal matrix U spanning the rangeof A, which takes O(nd2) time. Since UUT = AAy, itfollows that

    `i =

    eTi UUT

    22 =

    eTi AAy

    22 =

    (AAy)(i)

    22 ; (6)

    where the rst equality follows from the orthogonalityof (the columns of) U . The hard part of computingthe scores `i according to Eqn. (6) is two-fold: rst,computing the pseudoinverse; and second, performingthe matrix-matrix multiplication of A and Ay. Bothof these procedures take O(nd2) time. As we will see,we can get around both of these bottlenecks by the ju-dicious application of random projections to Eqn. (6).

    To get around the bottleneck of O(nd2) time due tocomputing Ay in Eqn. (6), we will compute the pseu-doinverse of a \smaller" matrix that approximates A.A necessary condition for such a smaller matrix is thatit preserves rank. So, nave ideas such as uniformlysampling r1 n rows from A and computing the pseu-doinverse of this sampled matrix will not work well foran arbitrary A. For example, this idea will fail (withhigh probability) to return a meaningful approxima-tion for matrices consisting of n1 identical rows anda single row with a nonzero component in the directionperpendicular to that the identical rows; nding that\outlying" row is crucial to obtaining a relative-errorapproximation. This is where the SRHT enters, sinceit preserves important structures of A, in particularits rank, by rst rotating A to a random basis andthen uniformly sampling rows from the rotated ma-trix (see (Drineas et al., 2010) for more details). Moreformally, recall that the SVD of A is UV T and let1 2 Rr1n be an -FJLT for U (using, for example,the SRHT of Lemma 3 with the appropriate choice forr1). One could approximate the `i's of Eqn. (6) by

    ^`i =

    eTi A (1A)y

    22; (7)

    where we approximated the ndmatrix A by the r1dmatrix 1A. Computing A (1A)

    yin this way takes

    O (ndr1) time, which is not ecient because r1 > d(from Lemma 3).

    To get around this bottleneck, recall that we onlyneed the Euclidean norms of the rows of the matrix

  • Fast approximation of matrix coherence and statistical leverage

    Algorithm 1 Approximating the (diagonal) statisti-cal leverage scores `i.

    Input: A 2 Rnd (with SVD A = UV T ), errorparameter 2 (0; 1=2].Output: ~`i; i 2 [n].1. Construct 1 2 Rr1n to be an -FJLT for U ,

    using Lemma 3 with r1 =

    d lnn2 ln

    d lnn2

    :

    2. Compute 1A 2 Rr1d and its SVD, 1A =U1A1AV

    T1A

    . Let R1 = V1A11A

    2 Rdd.(Alternatively, R could be computed by a QR fac-torization of 1A.)

    3. View the normalized rows of AR1 2 Rnd asn vectors in Rd, and construct 2 2 Rdr2 tobe an -JLT for n2 vectors (the aforementionedn vectors and their n2 n pairwise sums), usingLemma 1 with r2 = O

    2 lnn

    :

    4. Construct the matrix product = AR12.

    5. 8i 2 [n] compute and return ~`i =

    (i)

    22.

    A (1A)y 2 Rnr1 . Thus, we can further reduce the

    dimensionality of this matrix by using an -JLT to re-duce the dimension r1 = (d) to r2 = O(lnn). Specif-ically, let T2 2 Rr2r1 be an -JLT for the rows ofA (1A)

    y(viewed as n vectors in Rr1) and consider

    the matrix = A (1A)y2. This n r2 matrix

    may be viewed as our \randomized sketch" of the rowsof AAy. Then, we can compute and return

    ~`i =

    eTi A (1A)y2

    22; (8)

    for each i 2 [n], which is essentially what Algorithm 1does. Not surprisingly, the sketch A (1A)

    y2 can be

    used in other ways: for example, by considering the dotproduct between two dierent rows of this randomizedsketching matrix Algorithm 2 approximates the largecross-leverage scores of A.

    3.2. Approximating all the leverage scores

    Our rst main result is Algorithm 1, which takes asinput an n d matrix A and an error parameter 2 (0; 1=2], and returns as output numbers ~`i, i 2 [n].Although the basic idea to approximate

    (AAy)(i)

    2was described in the previous section, we can im-prove the eciency of our approach by avoiding thefull sketch of the pseudoinverse. In particular, letA^ = 1A and let its SVD be A^ = UA^A^V

    TA^. Let

    R1 = VA^1A^

    and note that R1 2 Rdd is an or-

    thogonalizer for A^ since UA^ = A^R1 is an orthogonal

    matrix. In addition, note that AR1 is approximatelyorthogonal. Thus, we can compute AR1 and use itas an approximate orthogonal basis for A and thencompute ^`i as the squared row-norms of AR

    1. Thenext lemma states that this is exactly what our mainalgorithm does.

    Lemma 4. Let R1 be such that Q = 1AR1 is anorthogonal matrix with rank(Q) = rank(1A). Then,

    (AR1)(i)

    22 = ^`i.3.3. Approximating large cross-leverage scores

    By combining Lemmas 6 and 7 (in Section 4.1 below)with the triangle inequality, one immediately obtainsthe following lemma.

    Lemma 5. Let be either the sketching matrix con-structed by Algorithm 1, i.e., = AR12, or =A (1A)

    y2 as described in Section 3.1. Then, the

    pairwise dot-products of the rows of are additive-error approximations to the leverage scores andcross-leverage scores:

    hU(i); U(j)i h(i);(j)i 31

    U(i)

    2

    U(j)

    2:That is, if one were interested in obtaining an approx-imation to all the cross-leverage scores to within ad-ditive error (and thus the diagonal statistical lever-age scores to relative-error), then the algorithm whichrst computes followed by all the pairwise innerproducts achieves this in time T ()+O

    n2r2

    , where

    T () is the time to compute from Section 3.2 andr2 = O(

    2 lnn). The challenge is to avoid the n2

    computational complexity and this can be done if oneis interested only in the large cross-leverage scores.

    Our second main result is provided by Algorithms 2and 3. Algorithm 2 takes as input an n d matrixA, a parameter > 1, and an error parameter 2(0; 1=2], and returns as output a subset of [n] [n] andestimates ~cij satisfying Theorem 2. The rst step ofthe algorithm is to compute the matrix = AR12constructed by Algorithm 1. Then, the algorithm callsAlgorithm 3 as a subroutine to compute \heavy hitter"pairs of rows from a matrix.

    4. Proofs of our main theorems

    4.1. Proof of Theorem 1

    We condition all our analysis on the events that 1 2Rr1n is an -FJLT for U and 2 2 Rr1r2 is an -JLT for n2 points in Rr1 . Dene u^i = eTi A(1A)y and~ui = e

    Ti A(1A)

    y2: Then, ^`i = ku^ik22 and ~`i = k~uik22.The proof will follow from the following two lemmas.

  • Fast approximation of matrix coherence and statistical leverage

    Algorithm 2 Approximating the large (o-diagonal)cross-leverage scores cij .

    Input: A 2 Rnd and parameters > 1, 2 (0; 1=2].Output: The set H consisting of pairs (i; j) togetherwith estimates ~cij satisfying Theorem 2.

    1. Compute the n r2 matrix = AR12 fromAlgorithm 1.

    2. Use Algorithm 3 with inputs and 0 = (1 +30d) to obtain the set H containing all the 0-heavy pairs of .

    3. Return the pairs in H as the -heavy pairs of A.

    Lemma 6. For i; j 2 [n],hU(i); U(j)i hu^i; u^ji 1

    U(i)

    2

    U(j)

    2: (9)Lemma 7. For i; j 2 [n],

    jhu^i; u^ji h~ui; ~ujij 2ku^ik2ku^jk2: (10)

    Lemma 6 states that hu^i; u^ji is an additive error ap-proximation to all the cross-leverage scores (i 6= j)and a relative error approximation for the diagonals(i = j). Similarly, Lemma 7 shows that these cross-leverage scores are preserved by 2. Indeed, withi = j, from Lemma 6 we have j^`i `ij 1`i, andfrom Lemma 7 we have j^`i ~`ij 2^`i. Using thetriangle inequality and 1=2:`i ~`i = `i ^`i + ^`i ~`i `i ^`i+ ^`i ~`i

    1 + 2`i 4`i:

    The theorem follows after rescaling .

    Running Times. By Lemma 4, we can useV1A

    11

    instead of (1A)y and obtain the same

    estimates. Since 1 is an -FJLT, the product 1Acan be computed in O(nd ln r1) while its SVD takes anadditional O(r1d

    2) time to return V1A11

    2 Rdd.Since 2 2 Rdr2 , we obtain V1A112 2 Rdr2 inan additional O(r2d

    2) time. Finally, premultiplying byA takes O(ndr2) time, and computing and returningthe squared row-norms of = AV1A

    112 2 Rnr2

    takes O (nr2) time. So, the total running timeis the sum of all these operations, which isO(nd ln r1 + ndr2 + r1d

    2 + r2d2): For our imple-

    mentations of the -JLTs and -FJLTs ( = 0:1), r1 =O2d (lnn)

    ln2d lnn

    and r2 = O(

    2 lnn).

    Algorithm 3 Computing heavy pairs of a matrix.

    Input: X 2 Rnr with rows x1; : : : ; xn and a param-eter > 1.

    Output: H = f(i; j); ~cijg containing all heavy (un-ordered) pairs. The pair (i; j); ~cij 2 H if and only if~c2ij = hxi; xji2

    XTX

    2F=.

    1: Compute the norms kxik2 and sort the rows ac-cording to norm, so that kx1k2 kxnk2.

    2: H fg; z1 n; z2 1.3: while z2 z1 do4: while kxz1k22kxz2k22 z1 then7: return H.8: end if9: end while10: for each pair (i; j) where i = z1 and j 2

    fz2; z2 + 1; : : : ; z1g do11: ~c2ij = hxi; xji2.12: if ~c2ij

    XTX

    2Fthen

    13: add (i; j) and ~cij to H.14: end if15: z1 z1 1.16: end for17: end while18: return H.

    It follows that the asymptotic running time isOnd ln

    d1

    + nd2 lnn+ d32 (lnn)

    lnd1

    :

    To simplify, suppose that d n ed and treat asa constant. Then, the asymptotic running time isOnd lnn+ d3 (lnn) (ln d)

    :

    4.2. Proof of Theorem 2

    We rst construct an algorithm to estimate the largeinner products among the rows of an arbitrary matrixX 2 Rnr with n > r. This general algorithm willbe applied to the matrix = AV1A

    11A

    2. Letx1; : : : ; xn denote the rows of X; for a given > 1, the

    pair (i; j) is heavy if hxi; xji2 1

    XTX

    2

    F: By the

    Cauchy-Schwarz inequality, this implies that

    kxik22kxjk22 1

    XTX

    2F; (11)

    so it suces to nd all the pairs (i; j) for whichEqn. (11) holds. We will call such pairs norm-heavy.Let s be the number of norm-heavy pairs satisfyingEqn. (11). We rst bound the number of such pairs.

    Lemma 8. Using the above notation, s r.Algorithm 3 starts by computing the norms kxik22 for

  • Fast approximation of matrix coherence and statistical leverage

    all i 2 [n] and sorting them (in O (nr + n lnn) time) sothat we can assume that kx1k2 kxnk2. Then,we initialize the set of norm-heavy pairs to H = fgand we also initialize two pointers z1 = n and z2 = 1.The basic loop in the algorithm checks if z2 > z1 andstops if that is the case. Otherwise, we increment z2to the rst pair (z1; z2) that is norm-heavy. If none ofpairs are norm heavy (i.e., z2 > z1 occurs), then westop and output H; otherwise, we add (z1; z2); (z1; z2+1); : : : ; (z1; z1) toH. This basic loop computes all pairs(z1; i) with i z1 that are norm-heavy. Next, wedecrease z1 by one and if z1 < z2 we stop and outputH; otherwise, we repeat the basic loop. Note that inthe basic loop z2 is always incremented. This occurswhenever the pair (z1; z2) is not norm-heavy. Sincez2 can be incremented at most n times, the numberof times we check whether a pair is norm-heavy andfail is at most n. Every successful check results in theaddition of at least one norm-heavy pair into H andthus the number of times we check if a pair is normheavy (a constant-time operation) is at most n+s. Thenumber of pair additions into H is exactly s and thusthe total running time is O(nr+n lnn+s). Finally, wemust check each norm-heavy pair to verify whether ornot it is actually heavy by computing s inner productsvectors in Rr; this can be done in O(sr) time. Usings r we get the following lemma.Lemma 9. Algorithm 3 returns H including all theheavy pairs of X in O(nr + r2 + n lnn) time.

    To complete the proof, we apply Algorithm 3 with

    = AV1A

    11A

    2 2 Rnr2 , where r2 = O(2 lnn).Let ~u1; : : : ; ~un denote the rows of and recall thatA = UV T . Let u1; : : : ; un denote the rows of U ;then, from Lemma 5,

    hui; uji h~ui; ~uji hui; uji+; (12)where = 31kuikkujk. Given ; , assume that forthe pair of vectors ui and uj

    hui; uji2 1

    UTU

    2F+ 12 =

    d

    + 12;

    where = kuik2kujk2; and where the last equalityfollows from

    UTU

    2F= kIdk2F = d. By Eqn. (12),

    after squaring and using < 0:5,

    hui; uji2 12 h~ui; ~uji2 hui; uji2 + 30; (13)where = kuik2kujk2. Thus, h~ui; ~uji2 d= andsumming Eqn. (13) over all i; j we get

    T

    2F

    d+ 30d2, or, equivalently, d k

    Tk2

    F

    1+30d : We conclude

    that hui; uji2 d + 12kuik2kujk2 implies

    h~ui; ~uji2 d

    T

    2F

    (1 + 30d): (14)

    By construction, Algorithm 3 is invoked with 0 =

    T

    2

    F=d and thus it nds all pairs with h~ui; ~uji2

    T

    2

    F=0 = d=. This set contains all pairs

    for which hui; uji2 d + 12kuik2kujk2: Further,since every pair returned satises h~ui; ~uji2 d=, byEqn. (13), cij d= 30`i`j . This proves the rstclaim of the Theorem; the second claim follows analo-gously from Eqn. (13).

    Using Lemma 9, the running time of our approach isOnr2 +

    0r22 + n lnn. Since r2 = O

    2 lnn

    , and,

    by Eqn. (14), 0 =

    T

    2

    F=d (1 + 30d), the

    overall running time is O2n lnn+ 3d ln2 n

    .

    5. Extension to general matrices andstreaming environments

    Our main result can be extended to the computationof the statistical leverage scores for general \fat" ma-trices, i.e., matrices A 2 Rnd, where both n and d arelarge, e.g., d = n or d = (n), when a rank parameterk minfn; dg is specied. In this case, we wish to ob-tain the statistical leverage scores `i =

    (Uk)(i)

    22 forAk = UkkV

    Tk , the best rank-k approximation to A.

    As stated, this is an ill-posed problem; and thus themain technical challenge is to deal with this posednessissue. To do so, we note that the leverage scores areused to approximate the matrix in some way; and thuswe only care that the estimated leverage scores area good approximation to the leverage scores of somegood low-rank approximation to A. Thus, we can de-ne the set of matrices that are good approximations,e.g., with respect to the spectral norm or Frobeniusnorm, to the best rank k approximation to A, andwe can prove that we can eciently approximate theleverage scores for some matrix in this set.

    Our main result can also be extended to estimate theleverage scores and related statistics in data stream en-vironments. In this context, one is interested in com-puting statistics of the data stream while making onepass over the data from external storage and using onlya small amount of additional space. For an ndmatrixA, with n d, small additional space means that thespace complexity only depends logarithmically on thehigh dimension n and polynomially on the low dimen-sion d. The general strategy behind our algorithms isas follows. First, as the data streams by, compute TA,for an appropriate problem-dependent linear sketchingmatrix T , and also compute A, for a random projec-tion matrix . Second, after the rst pass over thedata, compute the matrix R1, as described in Algo-rithm 1, corresponding to A (or compute the pseu-

  • Fast approximation of matrix coherence and statistical leverage

    doinverse of A or the R matrix from any other QRdecomposition of A). Third, compute TAR12, fora random projection matrix 2, such as the one usedby Algorithm 1.

    With the procedure outlined above, the matrix T iseectively applied to the rows of AR12, i.e., to thesketch of A that has rows with Euclidean norms ap-proximately equal to the row norms of U , and pair-wise inner products approximately equal to those inU . Thus statistics related to U can be extracted. Forexample, in one pass we can: nd the indices of allrows of A for with \large" leverage scores and computea (1+ )-approximation to the leverage scores of theselarge rows; approximately compute statistics such asthe entropy of the distribution of leverage scores ofA; and obtain samples of rows of A with proabilityproportional to their leverage score distribution.

    More details can be found in the technical report ver-sion of this paper (Drineas et al., 2011).

    References

    Achlioptas, D. Database-friendly random projections:Johnson-Lindenstrauss with binary coins. Journalof Computer and System Sciences, 66(4):671{687,2003.

    Ailon, N. and Chazelle, B. The fast Johnson-Lindenstrauss transform and approximate nearestneighbors. SIAM Journal on Computing, 39(1):302{322, 2009.

    Ailon, N. and Liberty, E. Fast dimension reductionusing Rademacher series on dual BCH codes. InProceedings of the 19th Annual ACM-SIAM Sympo-sium on Discrete Algorithms, pp. 1{9, 2008.

    Boutsidis, C., Mahoney, M.W., and Drineas, P. Animproved approximation algorithm for the columnsubset selection problem. In Proceedings of the 20thAnnual ACM-SIAM Symposium on Discrete Algo-rithms, pp. 968{977, 2009.

    Candes, E.J. and Recht, B. Exact matrix completionvia convex optimization. Foundations of Computa-tional Mathematics, 9(6):717{772, 2009.

    Chatterjee, S. and Hadi, A.S. Inuential observations,high leverage points, and outliers in linear regres-sion. Statistical Science, 1(3):379{393, 1986.

    Drineas, P., Mahoney, M.W., and Muthukrishnan, S.Sampling algorithms for `2 regression and applica-tions. In Proceedings of the 17th Annual ACM-SIAMSymposium on Discrete Algorithms, pp. 1127{1136,2006.

    Drineas, P., Mahoney, M.W., and Muthukrishnan, S.Relative-error CUR matrix decompositions. SIAMJournal on Matrix Analysis and Applications, 30:844{881, 2008.

    Drineas, P., Mahoney, M.W., Muthukrishnan, S., andSarlos, T. Faster least squares approximation. Nu-merische Mathematik, 117(2):219{249, 2010.

    Drineas, P., Magdon-Ismail, M., Mahoney, M. W., andWoodru, D. P. Fast approximation of matrix co-herence and statistical leverage. Technical report,2011. Preprint: arXiv:1109.3843 (2011).

    Hoaglin, D.C. and Welsch, R.E. The hat matrix inregression and ANOVA. The American Statistician,32(1):17{22, 1978.

    Mahoney, M. W. Randomized algorithms for matri-ces and data. Foundations and Trends in MachineLearning. NOW Publishers, Boston, 2011. Alsoavailable at: arXiv:1104.5557.

    Mahoney, M.W. and Drineas, P. CUR matrix decom-positions for improved data analysis. Proc. Natl.Acad. Sci. USA, 106:697{702, 2009.

    Mohri, M. and Talwalkar, A. Can matrix coherence beeciently and accurately estimated? In Proceedingsof the 14th International Workshop on Articial In-telligence and Statistics, 2011.

    Mougeot, M., Picard, D., and Tribouley, K. Learn-ing out of leaders. Technical report. Preprint:arXiv:arXiv:1001.1919 (2010).

    Sarlos, T. Improved approximation algorithms forlarge matrices via random projections. In Proceed-ings of the 47th Annual IEEE Symposium on Foun-dations of Computer Science, pp. 143{152, 2006.

    Talwalkar, A. and Rostamizadeh, A. Matrix coherenceand the Nystrom method. In Proceedings of the 26thConference in Uncertainty in Articial Intelligence,2010.

    Tropp, J.A. Greed is good: Algorithmic results forsparse approximation. IEEE Transactions on Infor-mation Theory, 50(10):2231{2242, 2004.