Resistant lower rank approximation of matrices by iterative majorization

Computational Statistics & Data Analysis 18 (1994) 457-467

North-Holland

457

Resistant lower rank approximation of matrices by iterative majorization ’

Peter Verboon and Willem J. Heiser

University of Leiden, Leiden, The Netherlands

Received September 1992 Revised May 1993

Abstract: It is commonly known that many techniques for data analysis based on the least squares criterion are very sensitive to outliers in the data. Gabriel and Odoroff (1984) suggested a resistant approach for lower rank approximation of matrices. In this approach, weights are used to diminish the influence of outliers on the low-dimensional representation. The present paper uses iterative majorization to provide for a general algorithm for such resistant lower rank approximations which guarantees convergence. It is shown that the weights can be chosen in different ways corresponding with different objective functions. Some possible extensions of the algorithm are discussed.

Keywords: Lower rank approximation; Resistance; Robustness; Huber function; Biweight function; Iteratively reweighted least squares; Majorization.

1. Introduction

Finding a low-dimensional representation of a high-dimensional data matrix is a well-known method in data analysis. For instance, the biplot (Gabriel, 1971) and principal component analysis (PCA) are among the most important multivariate techniques in data exploration. The criterion used to examine how well the data are represented, is usually defined in terms of the squared residuals. However, when the least squares criterion is used and there are outliers in the data, the low-dimensional representation may not be the most interesting one, and will tend to be unstable. This problem has been investigated by Hawkins and Fatti (1984) for outliers that deflate the correlations and dominate the last few principal components. A general discussion of the influence of outliers in PCA is given in Joliffe (1986, ch. 9).

Correspondence to: Peter Verboon, Department of Data Theory, Faculty of Social Sciences, University of Leiden, P.O. Box 9555, 2300 RB Leiden, The Netherlands. ’ This research was supported by a PSYCHON grant (560-267-029) of the Netherlands organiza-

tion for scientific research (NWO) for the first author.

0167-9473/94/$07.00 0 1994 - Elsevier Science B.V. All rights reserved SSDI 0167-9473(93)E0027-2

458 P. Verboon, W.J. Heiser / Iterative majorization

Gabriel and Zamir (1979) show how a weighted least squares algorithm can be used to find a low-dimensional representation of both column and row points of a data matrix. In this approach each entry in the data matrix is separately weighted with some prechosen nonnegative quantity. An extension of this basic idea is given by Gabriel and Odoroff (1984) who suggest to use these weights to decrease the influence of outliers on the representation. In this extension, the weights are related to the residuals from the low-dimensional representation, which yields an iterative scheme, known as iteratively reweighed least squares (IRLS), in which the representation and the weights are alternatingly updated.

In the present paper we will use a majorization argument to prove convergence of IRLS algorithms that obtain a low-dimensional representation which is not influenced by outliers. It will be shown that iterative majorization can be used for a variety of resistant loss functions, by merely choosing the weights differently. This makes the iteratively reweighted Gabriel-Zamir algorithm widely applicable.

2. The method of successive dyadic fitting

Let Z = {zij} be the observed data matrix of order IZ x m. A p-dimensional (p I m) representation of Z is given by

ZEXA’, (I)

where = represents the least squares approximation. The row markers or component (object) scores are in the matrix X of order IZ xp (p I m), and the matrix A (m xp) contains the column markers or component loadings.

Finding X and A in (1) implies that we must minimize the following loss function:

a(X, A) = tr(Z - XA’)‘(Z - XA’), (2)

with the normalization constraint X’X = n1,. The normalization constraint is necessary for identification, since the product XA’ is unique up to linear transformations of X and A. First consider a p = 1 approximation of Z. In this case X reduces to the column vector x1 and A to the column vector a,. The loss function for the first principal component then becomes

0(x1, a,) = tr(Z - x,a;)‘(Z - xla\). (3)

Both unknown vectors x1 and a, are easily computed via simple regression equations (Good, 19691, which is called dyadic fitting (Gabriel & Zamir, 1979). For the first component we may iterate between x1 and a, until the solution stabilizes. After convergence the product xla; is the best least squares rank-one approximation of Z.

Subsequent dimensions can be found as follows. The data are replaced by the

P. Verboon, W.J. Heiser / Iterative majorization 4.59

residuals from the rank-one approximation by subtracting the approximation from the original data; thus

z’- 1) = Z - Z’ 7

where Z’ represents the rank-one approximation. With this Z(-l), (3) can be solved again for the second dimension, which yields new x2, a2, and ZCp2). In this way all p dimensions (principal components) can be computed. This stepwise fitting is possible because successive columns of X, and also successive columns of A, are orthogonal (or, in the case of multiple singular values of Z, they can be chosen to be orthogonal).

3. Weighted cyclic dyadic fitting

Let W = {wij} be a matrix with weights of order n x m, so that each wij corresponds with an observation zij in the data. Furthermore, let Vj (j = 1 the

. . , m) be a diagonal matrix with the elements wij (i = 1,. . . , n) for some j on diagonal. The weighted least squares loss function is now written as

a(X, A) = E (zj - Xaj)‘Vj(zj - Xaj), (4) j=l

where aj and zj are the jth column of A’ and Z, respectively. It is not hard to see that minimizing (4) over X and A can be done by alternatingly solving a weighted least squares problem. For given Z and X it follows directly from (4) that each aj is found independently by projecting zj on the space spanned by the columns of X in the metric Vj, giving regression weights aj. Rewriting (4) as a summation over rows shows that for given Z and A the rows of X are also regression weights that are found by projecting zi on the space spanned by the columns of A in the metric Vi, since

a(X, A) = t (zi - AxJ’y(z, - AxJ, (5) i=l

where V,(m X m) is diagonal with the elements of the ith row of W on the diagonal; xi and zi are the ith row of X and Z, respectively. It is important to note that this weighted alternating least squares procedure can no longer be applied successively, but must be carried out cyclically (Gabriel & Zamir, 1979). Thus, we may start as in the unweighted case by first computing the solution for the first dimension, subtracting this solution from the data and continue with these residualized values to compute the next dimension. But we have to cycle through this process again. In general, the elements ajk are updated by

ajk = (x;y.xk) -1

x;Vjzj,

and the elements xik by

xik = (aiVia,)-‘aiViZ,

(6)

(7)


where xk and ak are the k th column of X and A, respectively, and ij and Zi are the jth column and the ith row of the residualized matrix

it = Z - C xla;, lzk

in which the contribution of the dyadic fits of the other dimensions have been substracted from the data matrix. Cycling is necessary because the succesive ak’s (as well as the x,‘s> are generally not orthogonal except when the weights are equal.

4. Resistant loss functions

The vulnerability of the least squares criterion in the presence of outliers is well-known. In the context of estimating a location parameter and in regression analysis alternative loss functions have been introduced, such as Huber’s function (Huber, 1964; 1981) and Tukey’s biweight function (Beaton & Tukey, 1974). These functions have proved to be a good alternative for least squares when there are outliers in the data. In the present paper we will show that these functions can also be applied in the situation where a low-dimensional approximation of a data matrix is required. To formulate the problem in a general way, we rewrite it as a summation over residual elements

where the residuals are defined by rij = zij - C[=l~ikajka The ordinary least squares function is of course given by f(r,,) = r$ The Huber function, for each separate residual element, is defined as

fH('ij) = i

+ri; if IrijI <c,

c I rij I - ic* if IrijI 2C, (9)

where the constant c is called the tuning constant. For small residuals the ordinary least squares function is used, while for relatively large residuals the least absolute residuals criterion is inserted. This differential treatment of residuals implies that with the Huber function the influence of large deviations is reduced compared to least squares. If c is made very small, so that all residuals are larger than than c, minimizing the Huber function amounts to minimizing the L, norm. The biweight function is even more radical in down weighting the large residuals:

(10)

Tukey’s biweight is called a hard redescending function, because its first

P. Verboon, W.J. Heiser / Iterative majorization 461

derivative first ascends and then descends, while it becomes exactly zero for large residuals, which implies that it has a relatively high tolerance towards large deviations and that it is indifferent beyond c. It follows that outliers in the data may become associated with large residuals. They can be arbitrarily far away from the model, since their contribution to the loss is constant. Consequently they have no further influence upon the solution, which is presumably entirely based on “good” points only.

5. Minimization by iterative majorization

To minimize the Huber and biweight functions, the iterative majorization method will be used. This method was first explicitly used in this context by Heiser (1987) and recently in a slightly different context by Verboon and Heiser (1992). In the present paper it will be shown that majorization leads to Gabriel and Odoroffs (1984) iteratively reweighted least squares (IRLS) algorithm, with two main steps. In one step, a weighted least squares problem is solved for a fixed set of weights, and in the other, the weights are chosen as a monotonically decreasing function of the residuals from the previous step.

The principle of iterative majorization relies on a family of functions ,u( - ) for which the following inequality holds

(11) The notation p(rij; wij> says that p(rij; wij) is a function of the residuals rij for some set of fixed weights wij based on residuals (r$) that have been derived in the previous step of the algorithm. The majorizing function p(rij; Wij) should be chosen in such a way that this function is much easier to minimize than the loss function itself. At each step in the algorithm p(rij; Wij> is adapted, using the residual of the previous step as a so-called supporting point. To identify p(rij; wij) the following equality should also hold

p(r$; Wij) =f(YiT). (12) Together with (ll), equality (12) implies that at riT both functions have the same first derivative (if it exists). If this derivative is not zero (in which case the minimum would be attained), then we can always find new residuals (rL;), which minimize p(r,,; wij), such that

p( rf ; wij) < p( riT ; wij). (13) The updated residual will be used as the new supporting point, except when

,u( r$ ; wij) = p( riT; wij), in which case the algorithm stops. Combining (111, (121, and (13) yields f(rG) 5 f(riT > with equality only at the minimum, which implies that each step decreases the value of the objective function.

For this argument to apply in the case of resistant loss functions, it must be verified that there exist majorizing functions for fH(rij) and fB(rij) defined in

462 P. Verboon, WJ. Heiser / Iterative majorization

(91 and (10). F or instance, a majorizing function pn( * ) for minimizing the loss components in the biweight, can be defined as

rug( rij; Wij) = c2/6( 1 - 3W,j( 1 - (~ij/~)“) + 2~,:‘~). (14)

From (14) it is clear that this majorizing function is a weighted quadratic function of the residuals. After dropping all irrelevant constants, this yields the same minimization problem as the one given in (4) and (5). It follows that in the IRLS algorithm we can minimize (4) and (5) for some fixed weights by applying alternating least squares, and in the other step we update the weights.

Different choices for the weights function correspond to different resistant functions. For the Huber function the weights have to be computed as

(1 if Ir$I <c,

wij =

i

c

I riT I if lriTl kc,

The weights for the biweight function are found by

w., = (1 - (rir/c)2)2 if I riT I SC, lJ 1 0 if IriT I >c.

(15)

In both cases, we obtain a set of weights (0 I wij I 1) that is monotonically decreasing with respect to the absolute values of the previous residuals, as can easily be verified from (15) and (16), and which can be used as diagnostics. Weights close to 1 are assigned to data that fit the model well, while badly fitting points (outliers) will have small weights. A short and informal overview of the algorithm is presented in Figure 1.

It will now be shown that pB(rij; wij) is indeed a majorizing function for the biweight function given in (10). The two conditions (11) and (12) must hold for a proper majorizing function.

Lemma 1 For the previously found parameter matrices, yielding residuals Y$, the value of the biweight function is equal to that of function (141, i.e. fn<riT1 = El.ncriT; wij)*

Proof For notational convenience, we will set c = 1. Considering one loss component, substitution of (16) in (14) yields for the first part of the function:

I-LB(riT i wij) = l/6( 1 - 3(1 - ~~7’)’ + 2(1 - riy2)3) = l/6( 1 - (1 - ~7’)~)

The second part of both functions is equal to l/6 because wij = 0. Since the equality can be proved for any component, it consequently has been proved for the summation over the components too. :.

P. Verboon, W.J. Heiser / Iterative majorization 463

START

initialization

while still improvements of overall loss found &

compute new weights WY using (15) or (16)

while still improvements of loss found for fixed weights we &I

for j=ltom&

iJ measurement level of jth variable is ordinal or nominal h

compute unrestricted update qj+

constrain CJj+ t0 qj & rj

set qj equal to standardized qj

&$

endfor

while still improvements for X found &

for k= 1 top&

update xk according to (7)

endfor

end while

forj=ltom&

update aj according to (6)

end-for

end while

end while

STOP

Fig. 1. Schematic overview of iterative majorization algorithm.

Lemma 2 The value of function (14) is never smaller than the value of the biweight function, i.e. f~(rij) I ~B(Tij; Wij).

Proof There are two situations: (i) I rij 1 > 1 and (ii) 1 rij I I 1. In situation (i) we have

+ I ;(1 - 3Wij(l -r;) + 2wy). (17)

This inequality is true since wij (1 - r$> 5 0 and ~$1~ 2 0; thus, the term


In situation (ii) we have

$1 - (1 -r;,‘) 5 +(1 - 3Wij(l -I;) + 2wy). (18)

We start from the general inequality (a - b>2 2 0, which gives a2 2 2ab - b2.

Using this inequality we may also write:

(1 -r.?)‘> 2(1 -r$)(l -riT’) -“ii’ (19) Next both sides are multiplied by the non-negative quantity (1 - r$>, yielding:

(1 -ri)3> 2(1 -ri)2(1 -riy”) -IVij(l -I;:.)* (20) Substituting the second part of (19) for the term (1 - Y:>~ does not change the inequality:

(1 -~~)3 2 2[2(1-~1:)(1 -YiT2) -Wij](l -rS2) -wij(l -~~).).

Working out this expression yields

(1 - rl:)3 2 3( 1 - ~~)Wij - 2w~‘2’

Substracting both terms from one and multiplying by l/6 gives

(21)

(22)

(23)

which proves the inequality. Again, if this inequality is true for any element, it is also true for the summation. :.

From these two lemma’s it follows that pB(yij; wij> can be used as a majorizing function for the biweight function. Since (10) is bounded from below, majorization theory guarantees that at least a local minimum is attained by alternating repeatedly between minimizing (14) and updating the weights through (16), in case of the biweight.

For the Huber function, the simplest majorizing function is

PH(yij; wij) =

i

$vijri; if riT < c

$vijri~ + criT - c2 if YiT r c. (24)

which is also quadratic in the residuals. This function can be used to majorize Huber’s function, because of the following lemma’s:

Lemma 3 fnCriT) = /-Ln(riT; wij)

Lemma 4 fH(Tij) I pHCrij; Wij)

The proofs of lemma 3 and 4 can be found in Heiser (1987).

P. Verboon, WJ. Heiser / Iterative majorization

6. Aggregating the residuals

In the previous section, different weight matrices Vj were

465

considered, one for each variable. An interesting special case occurs when we assume that all Vj are equal to each other; thus, Vj = V for j = 1,. . . , m. This implies we have II weights, one for each object, instead of IZ x m weights, one for each cell in the data matrix. Thus, instead of applying the loss function to each residual element, rij, and summing to obtain the overall loss, we aggregate residuals over rows and apply the loss function to these II aggregated values, denoted as di (i = 1,. . .) n). Handling the residuals rowwise, we should first compute the residuals per row, di. The value di is computed as the Euclidean distance between an object in the m-dimensional space and its model values that satisfy the rank-p restrictions; thus,

Now, the Huber or biweight function can be applied to these values to obtain a loss per row; a summation of these row losses yields the total loss.

For least squares both ways of handling the residuals are equivalent, but for the Huber or biweight function these two approaches lead to two different situations. Elementwise weighting is more flexible, since small weights could be assigned to separate scores of an object, leaving its other scores unaffected. On the other hand rowwise weighting might conceptually (and computationally) be more attractive, since it considers whole objects as possible outliers. The latter is suitable in the context of PCA, which usually considers a data matrix of objects and variables. The elementwise approach is conceptually more suitable in the bilinear analysis of tables, in which rows and columns play a more symmetric role.

7. Extensions

In this section some extensions are discussed of the general minimization problem formulated in (4). We have seen that (4) can be solved by weighted cyclic dyadic fitting procedures, when there are no restrictions. Now suppose Z is a matrix where the columns represent categorical variables for which we may assume that they are measured on an ordinal or nominal level. The additional objective is to find optimal transformations of the variables, where optimal is defined in terms of the loss function (Young, 1981). The optimally transformed variables are denoted as qj( j = 1,. . . , m). This yields, as the nonlinear variant of (4), the following function:

l(Q, X3 A) = jgl (Sj - XaJ)‘y(qj - Xaj), (26)


which has to be minimized over X, A, and ql,. . . , q,, satisfying qJqj = n and qj E q.., where 4 indicates the set of admissible transformations of the given variable zj. The class of transformations may be defined differently for each variable, and includes nominal, monotonic, and linear transformations. In fact, (26) is a generalization of the PRINCIPALS program (Young et al., 19781, in which Vj = I for all j = 1, . . . , m.

Up to some irrelevant constants, the function in (26) can still be seen as a majorizing function for one of the resistant functions. Minimizing (26) is of course more complex than minimizing (41, but they both lead to a minimum of an objective function (see Verboon et al., 1991). It follows that from a technical point of view optimal scaling can easily be added to the problem of finding a low-dimensional representation of a data matrix.

Another extension is the possibility to deal with missing values in the data. The missing values are not part of the minimization problem by weighting them with zero. For this we need a set of binary diagonal matrices Mj, indicating missing values by 0 and those observed by 1. Each regression problem is now defined in the metric Mj. It follows that the quadratic part of the majorizing function to be minimized becomes:

5 (zj - Xa;)‘Mjy(zj - XaJ). j=l

(27)

Obviously, the matrices Mj are fixed throughout the algorithm. When there are no missing values, Mj = I for all variables, and (27) equals (4).

8. Discussion

In the present paper a very general approach based on majorization has been discussed for fitting lower rank approximation of matrices. Many different weight functions can be applied instead of (15) or (161, provided that they yield a proper majorizing function, which guarantees monotonic convergence. Examples are Hampel’s three-part hard redescender (Hampel, 19681, Eilers’ soft redescender (Eilers, 19871, or simply a trimming function, which assigns 0 to residuals larger than a particular value and 1 otherwise. In all cases, the whole problem is repeatedly applying the algorithm proposed by Gabriel and Zamir (19791, since the choice of the weights yields no problems.

References

Beaton A.E. & Turkey, J.W. (19741, The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics, 16, 147-185

Eilers, P.H.C. (1987), Adaptieve gewichten, een exploratieve techniek voor uitbijters en mengsels van regressie modellen. Kwantitatieue Methoden, 23, 63-83.

P. Verboon, WJ. Heiser / Iterative majorization 467

Gabriel, K.R. (19711, The biplot-graphic display of matrices with application to principal components analysis. Biometrika, 58, 453-467.

Gabriel, K.R. & Odoroff, L. (19841, Resistant lower rank approximation of matrices. In: E. Diday et al. (Ed.), Data Analysis and Statistics ZZZ (pp. 23-30). Amsterdam: North-Holland.

Gabriel, K.R. & Zamir, S. (19791, Lower rank approximation of matrices by least squares with any choice of weights. Technometrics, 21, 4, 489-498.

Good, I.J. (1969), Some applications of the singular value decomposition of a matrix. Technomet- rics, 11, 823-831.

Hampel, F.R. (19681, Contributions to the theory of robust estimation. Ph. D. thesis, University of California, Berkeley.

Hawkins, D.M. & Fatti, L.P. (19841, Exploring multivariate data using the minor principal components. The Statistician, 33, 325-338

Heiser, W.J. (19871, Correspondence analysis with least absolute residuals. Computational Statis- tics and Data Analysis, 5, 337-356.

Huber, P.J. (19641, Robust estimation of a location parameter. Ann. Math. Statist., 35, 73-101. Huber, P.J. (19811, Robust Statistics. New York: Wiley. Jolliffe, I.T. (19861, Principals component analysis. New York: Springer-Verlag. Verboon, P. & Heiser, W.J. (19921, Resistant orthogonal Procrustes analysis. Journal of Classifica-

tion, 9, 237-256. Verboon, P., Van der Lans I. & Heiser, W.J. (19911, The multipals algorithm. Research Report

91-05. Leiden: Department of Data Theory. Young, F.W. (19811, Quantitative analysis of qualitative data, Psychometrika, 46, 347-388. Young, F.W., Takane, Y. & De Leeuw, J. (1978), The principal components of mixed measure-

ment level multivariate data: an alternating least squares method with optimal scaling features. Psychometrika, 43, 279-281.

Documents

Resistant lower rank approximation of matrices by iterative majorization