19
Stochastic algorithms for solving structured low-rank matrix approximation problems J.W. Gillard , A.A. Zhigljavsky Cardiff School of Mathematics, Cardiff University, United Kingdom article info Article history: Available online 2 October 2014 Keywords: Structured low rank approximation Hankel matrix Global optimization abstract In this paper, we investigate the complexity of the numerical construction of the Hankel structured low-rank approximation (HSLRA) problem, and develop a family of algorithms to solve this problem. Briefly, HSLRA is the problem of finding the closest (in some pre- defined norm) rank r approximation of a given Hankel matrix, which is also of Hankel structure. We demonstrate that finding optimal solutions of this problem is very hard. For example, we argue that if HSLRA is considered as a problem of estimating parameters of damped sinusoids, then the associated optimization problem is basically unsolvable. We discuss what is known as the orthogonality condition, which solutions to the HSLRA prob- lem should satisfy, and describe how any approximation may be corrected to achieve this orthogonality. Unlike many other methods described in the literature the family of algo- rithms we propose has the property of guaranteed convergence. Ó 2014 Elsevier B.V. All rights reserved. 1. Introduction 1.1. Statement of the problem Let L; K and r be given positive integers such that 1 6 r < L 6 K. Denote the set of all real-valued L K matrices by R LK . Let M r ¼M LK r R LK be the subset of R LK containing all matrices with rank 6 r, and H¼H LK R LK be the subset of R LK containing matrices of some known structure. The set of structured L K matrices of rank 6 r is A ¼M r \H. Assume we are given a matrix X 2H. The problem of structured low rank approximation (SLRA) is: f ðXÞ! min X2A ð1Þ where f ðXÞ¼ q 2 ðX; X Þ is a squared distance on R LK R LK . In this paper we only consider the case where H is the set of Hankel matrices and thus refer to (1) as HSLRA. Recall that a matrix X ¼ðx lk Þ2 R LK is called Hankel if x lk ¼ const for all pairs ðl; kÞ such that l þ k ¼ const; that is, all elements on the anti- diagonals of X are equal. There is a one-to-one correspondence between L K Hankel matrices and vectors of size N ¼ L þ K 1. For a vector Y ¼ðy 1 ; ... ; y N Þ T , the matrix X ¼ HðY Þ¼ðx lk Þ2 R LK with elements x lk ¼ y lþk1 is Hankel and vise-versa: for any matrix X 2H, we may define Y ¼ H 1 ðXÞ so that X ¼ HðY Þ. We consider the distances q defined by the semi-norms jjAjj 2 W ¼ tr AWA T ðso that f ðXÞ¼ trðX X ÞWðX X Þ T Þ; ð2Þ http://dx.doi.org/10.1016/j.cnsns.2014.08.023 1007-5704/Ó 2014 Elsevier B.V. All rights reserved. Corresponding author. E-mail addresses: [email protected] (J.W. Gillard), [email protected] (A.A. Zhigljavsky). Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 Contents lists available at ScienceDirect Commun Nonlinear Sci Numer Simulat journal homepage: www.elsevier.com/locate/cnsns

Stochastic algorithms for solving structured low-rank matrix approximation problems

  • Upload
    aa

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stochastic algorithms for solving structured low-rank matrix approximation problems

Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

Contents lists available at ScienceDirect

Commun Nonlinear Sci Numer Simulat

journal homepage: www.elsevier .com/locate /cnsns

Stochastic algorithms for solving structured low-rank matrixapproximation problems

http://dx.doi.org/10.1016/j.cnsns.2014.08.0231007-5704/� 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author.E-mail addresses: [email protected] (J.W. Gillard), [email protected] (A.A. Zhigljavsky).

J.W. Gillard ⇑, A.A. ZhigljavskyCardiff School of Mathematics, Cardiff University, United Kingdom

a r t i c l e i n f o a b s t r a c t

Article history:Available online 2 October 2014

Keywords:Structured low rank approximationHankel matrixGlobal optimization

In this paper, we investigate the complexity of the numerical construction of the Hankelstructured low-rank approximation (HSLRA) problem, and develop a family of algorithmsto solve this problem. Briefly, HSLRA is the problem of finding the closest (in some pre-defined norm) rank r approximation of a given Hankel matrix, which is also of Hankelstructure. We demonstrate that finding optimal solutions of this problem is very hard.For example, we argue that if HSLRA is considered as a problem of estimating parametersof damped sinusoids, then the associated optimization problem is basically unsolvable. Wediscuss what is known as the orthogonality condition, which solutions to the HSLRA prob-lem should satisfy, and describe how any approximation may be corrected to achieve thisorthogonality. Unlike many other methods described in the literature the family of algo-rithms we propose has the property of guaranteed convergence.

� 2014 Elsevier B.V. All rights reserved.

1. Introduction

1.1. Statement of the problem

Let L; K and r be given positive integers such that 1 6 r < L 6 K. Denote the set of all real-valued L� K matrices by RL�K .LetMr ¼ML�K

r � RL�K be the subset of RL�K containing all matrices with rank 6 r, andH ¼ HL�K � RL�K be the subset of RL�K

containing matrices of some known structure. The set of structured L� K matrices of rank 6 r is A ¼Mr \H.Assume we are given a matrix X� 2 H. The problem of structured low rank approximation (SLRA) is:

f ðXÞ !minX2A

ð1Þ

where f ðXÞ ¼ q2ðX;X�Þ is a squared distance on RL�K � RL�K .In this paper we only consider the case where H is the set of Hankel matrices and thus refer to (1) as HSLRA. Recall that a

matrix X ¼ ðxlkÞ 2 RL�K is called Hankel if xlk ¼ const for all pairs ðl; kÞ such that lþ k ¼ const; that is, all elements on the anti-diagonals of X are equal. There is a one-to-one correspondence between L� K Hankel matrices and vectors of sizeN ¼ Lþ K � 1. For a vector Y ¼ ðy1; . . . ; yNÞ

T , the matrix X ¼ HðYÞ ¼ ðxlkÞ 2 RL�K with elements xlk ¼ ylþk�1 is Hankel andvise-versa: for any matrix X 2 H, we may define Y ¼ H�1ðXÞ so that X ¼ HðYÞ.

We consider the distances q defined by the semi-norms

jjAjj2W ¼ trAWAT ðso that f ðXÞ ¼ trðX� X�ÞWðX� X�ÞTÞ; ð2Þ

Page 2: Stochastic algorithms for solving structured low-rank matrix approximation problems

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 71

where A 2 RL�K and W is a symmetric non-negative definite negative matrix of size K � K. Moreover, in our main applicationthe weight matrix W is diagonal:

W ¼ diagðw1; . . . ;wKÞ; ð3Þ

where w1; . . . ;wK are some positive numbers.

1.2. Background

The aim of low-rank approximation methods is to approximate a matrix containing observed data, by a matrix of pre-specified lower rank r. The rank of the matrix containing the original data can be viewed as the order of complexity requiredto fit to the data exactly, and a matrix of lower complexity (lower rank) ‘close’ to the original matrix is often required. A fur-ther requirement is that if the original matrix of the observed data is of a particular structure, then the approximation shouldalso have this structure. An example is the HSLRA problem as defined in the previous section.

HSLRA is a very important problem with applications in a number of different areas. In addition to the clear connectionwith time series analysis and signal processing, HSLRA has been extensively used in system identification (modeling dynam-ical systems) [1], in speech and audio processing [2], in modal and spectral analysis [3] and image processing [4]. Some dis-cussion on the relationship of HSLRA with some well known subspace-based methods of time series analysis and signalprocessing is given in [5]. Similar structures used in (1) include Toeplitz, block Hankel and block Toeplitz structures. In imageprocessing, there is much use of Hankel-block Hankel structures. Further details, references and specific applications of SLRAare provided in [6–8].

1.3. Notation

The following list contains the main notation used in this paper.

N; L; K; r Positive integers with 1 6 r < L 6 K < N; N ¼ Lþ K � 1RL�K Set of L� K matricesRN Set of vectors of length NHL�K or H Set of L� K Hankel matricesMr Set of L� K matrices of rank rA ¼Mr \H Set of L� K Hankel matrices of rank rY ¼ ðy1; . . . ; yNÞ

T Vector in RN

HðYÞ Hankel matrix in HL�K associated with vector Y 2 RN

X� 2 HL�K Given matrixY� ¼ ðy1�; . . . ; yN�Þ

T Vector in RN such that HðY�Þ ¼ X� (vector of observed values)pHðXÞ Projection of the matrix X 2 RL�K onto the set HpðrÞðXÞ Projection of a matrix X 2 RL�K onto the set Mr

Ip Identity matrix of size p� p.

1.4. Structure of the paper and the main results

The structure of the paper is as follows. In Section 2 we formally define the HSLRA problem (1) as an optimization prob-lem in the space of matrices and introduce a generic norm defining the objective function f ð�Þ. In Section 2 we also describeprojection operators to H and especially Mr that are used throughout the majority of the algorithms introduced in thispaper, in the process of solving the HSLRA problem. In Section 3 we study the relations between different norms whichdefine the objective function in two different setups. In Section 3, we also discuss some computational aspects for dealingwith infinite and infinitesimal numbers. In Section 4 we study the so-called orthogonality condition which optimal solutionsof (1) should satisfy, and describe how an approximation may be corrected to achieve this orthogonality. Section 5 considerssome algorithms for solving the HSLRA problem represented as optimization problems in the set of Hankel matrices H. Westart with formulating a well-known algorithm based on alternating projections to the spaces Mr and H, and call this AP.This is followed by an introduction of an improved version of this algorithm which we call ‘Orthogonalized Alternating Pro-jections’ (OAP). In Section 5.2 we introduce a family of algorithms which incorporate randomization, backtracking, evolutionand selection. The algorithms described in this section have guaranteed convergence to the optimum, unlike all other meth-ods described in the literature. The main algorithm introduced and studied in the paper is called APBR (which in an abbre-viation for ‘Alternating Projections with Backtracking and Randomization’). Examples provided in Section 6 show that APBRsignificantly outperforms AP, as well as some other methods. In (A), we consider the HSLRA problem (1) by associating matri-ces X 2 A with vectors whose elements can be represented as sums of damped sinusoids; this approach is popular in thesignal processing literature. We demonstrate that the resulting objective function can be very complex which means thatthe associated optimization problems are basically unsolvable. Section 7 concludes the paper.

Page 3: Stochastic algorithms for solving structured low-rank matrix approximation problems

72 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

2. HSLRA as an optimization problem

2.1. HSLRA as a problem of parameter estimation in non-linear regression

In the signal processing literature, it is customary to represent the HSLRA as a problem of estimating the parameters of anonlinear regression model with terms given by damped sinusoids, see for example [9,10]. This can be formulated as follows.

Consider a general non-linear regression model where it is assumed that each element of the observed vector Y� may bewritten as

yj� ¼ ynðhÞ þ ej ðj ¼ 1; . . . ;NÞ; ð4Þ

where h is a vector of unknown parameters, yjðhÞ is a function nonlinear in h and e1; . . . ; eN is a series of noise terms so thatEej ¼ 0 and Eeiej ¼ 0 for i – j.

Parameter estimation in a general (weighted) non-linear regression model is usually reduced to solving the minimizationproblem

FðhÞ ¼XN

j¼1

sj yj� � yjðhÞ� �2 !min

h: ð5Þ

Assuming the variances r2j ¼ Ee2

j of ej’s are known, the weights sj can naturally be chosen as sj ¼ 1=r2j .

The estimator h is defined as h ¼ arg minhFðhÞ. In the case of damped sinusoids, the function ynðhÞ has the form

ynðhÞ ¼Xq

i¼1

ai expðdinÞ sinð2pxinþ /iÞ; n ¼ 1; . . . ;N; ð6Þ

where h ¼ ða; d;x;/Þ with a ¼ ða1; . . . ; aqÞ; d ¼ ðd1; . . . ; dqÞ; x ¼ ðx1; . . . ;xqÞ, and / ¼ ð/1; . . . ;/qÞ.The correspondence between q in (6) and r in (1) is as follows: if xi – 0 then the term ai expðdinÞ sinð2pxinþ /iÞ in (6)

adds 2 to the rank of the corresponding matrix X 2 A while if xi ¼ 0 (so that the term is simply ai expðdinÞ) then this termonly adds 1 to the rank of this X 2 A. Each vector Y of the form (6) generates a low-rank Hankel matrix X ¼ HðYÞ. Note, how-ever, that the set of vectors fY ¼ H�1ðXÞ;X 2 Ag is slightly richer than the set of vectors of the form (6) with appropriate val-ues of q and the forms of the terms in this formula; see [8] or [11].

Let Y� ¼ ðy1�; . . . ; yN�ÞT; YðhÞ ¼ ðy1ðhÞ; . . . ; yNðhÞÞ

T and let S be the diagonal matrix S ¼ diagðs1; . . . ; sNÞ 2 RN�N . Then theobjective function in (5) can be written as

FðhÞ ¼ ðY� � YðhÞÞT SðY� � YðhÞÞ: ð7Þ

The papers [10,12] contain discussions about the behavior of the objective function (5), with yjðhÞ defined through (6). In[12] the fact that the objective function F is multiextremal has been observed; the function F was decomposed into threedifferent components and it was numerically demonstrated that the part of the objective function with the observation noiseremoved dominates the shape of the objective function. An extension of this analysis is given in Appendix A.1. Note that upuntil now, only the weights sj ¼ 1 have been considered in the literature devoted to the optimization problem defined by (5)and (6).

This optimization problem is very difficult with the objective function possessing many local minima. The objective func-tion has very large Lipschitz constants which increase with N, the number of observations. Additionally, the number of localminima in the neighborhood of the global minimum increases linearly in N. Adding noise to the observed data increases thecomplexity of the objective function and moves the global minimizer away from the true value; for more details see SectionsA.1 and A.2 in Appendix A. The nature of the objective function (5) suggests much potential for the interval arithmetic basedoptimization for non-linear regression as described in [13], but this is a topic reserved for future work.

2.2. Matrix optimal solution and its approximation

Consider the HSLRA problem (1). Since 0 6 f ðXÞ <1 for any X 2 A; f ðXÞ is a continuous function on A and f ðXÞ ! 1 asjjXjj ! 1, a solution to (1) always exists. However, the solution is not necessarily unique. Set

X� ¼ fX� ¼ arg minX2A

f ðXÞg and f � ¼ f ðX�Þ ¼minX2A

f ðXÞ:

Any algorithm designed to solve the optimization problem (1) should return a matrix Xappr which can be considered as anapproximation to one of the solutions X� 2 X�; approximations to the value f � alone (without approximations to X�) are notsufficient. Ideally, the matrix Xappr should belong to the set of matrices A.

A typical optimization algorithm designed for solving the problem (1) could be represented as a procedure which gener-ates a sequence of matrices X0;X1; . . . such that some of the matrices Xn for large n can be considered as approximations toX�, a solution of (1). This matrix sequence must have at least one limiting point (matrix) in the set A. Denote by X1 the lim-iting point of the algorithm which has the smallest value of f among all its limiting points belonging to A. In general,f ðX1ÞP f �. The algorithm (theoretically) converges to the optimal solution if f ðX1Þ ¼ f �; that is, if X1 2 X�.

Page 4: Stochastic algorithms for solving structured low-rank matrix approximation problems

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 73

In many optimization algorithms attempting to solve the SLRA problem (1) and operating in matrix spaces, the projectionto the spaces H and Mr is of prime importance. Let us consider these two projections.

2.3. Projection to H

The spaceH ¼ HL�K of L� K Hankel matrices is a linear subspace of RL�K . The closest Hankel matrix (for a variety of normsand semi-norms including (2)) to any given matrix is obtained by using the diagonal averaging procedure.

Recall that every L� K Hankel matrix X 2 H is in a one-to-one correspondence with some vector Y ¼ ðy1; . . . ; yNÞT , with

N ¼ Lþ K � 1. This correspondence is described by the function H : RN !HL�K which is defined by HðYÞ ¼ jjylþk�1jjL;Kl;k¼1 for

Y ¼ ðy1; . . . ; yNÞT . Each element of the vector Y is repeated in X ¼ HðYÞ several times. Let E ¼ ðelkÞ 2 RL�K be the matrix con-

sisting entirely of ones. We can compute the sum of each anti-diagonal of E, denoted tn, as

tn ¼X

lþk¼nþ1

elk ¼n for n ¼ 1; . . . ; L� 1;L for n ¼ L; . . . ;K � 1;N � nþ 1 for n ¼ K; . . . ;N:

8><>: ð8Þ

The value tn is the number of times the element yn of the vector Y is repeated in the Hankel matrix HðYÞ.Let pHðXÞ denote the projection of X 2 RL�K onto the space H. Then the element ~xij of pHðXÞ is given by

~xij ¼ t�1iþj�1

Xlþk¼iþj

xlk:

The squared distance between matrix X and the space H is

q2ðX;HÞ ¼minX02H

q2ðX;X0Þ ¼ q2ðX;pHðXÞÞ ¼ jjX� pHðXÞjj2:

Since projecting to H is very easy, using this subspace as the feasible domain for the HSLRA problem (1) is more naturalthan using the original space RL�K . Based on results of Appendix A we also argue that this leads to more tractable optimiza-tion problems than the approach based on the use of the damped sinusoids model.

2.4. Projection to Mr

UnlikeH, the setMr is clearly not convex. However, the projections from RL�K toMr for the semi-norms (2) can be easilycomputed using the singular value decomposition (SVD) of an appropriate matrix.

Let A be some matrix in RL�K and suppose that we need to compute the projection of this matrix onto Mr .

2.4.1. Frobenius normAssume first that W is the K � K identity matrix so that the semi-norm in (2) is the usual Frobenius norm

jjAjj2F ¼XL

l¼1

XK

k¼1

a2lk for A ¼ ðalkÞL;Kl;k¼1 2 RL�K : ð9Þ

Then for any r, a projection toMr can be computed with the help of the SVD of A as follows. Let ri ¼ riðAÞ, the singularvalues of A, be ordered so that r1 P r2 P . . . P rL. Denote R0 ¼ diagðr1;r2; . . . ;rLÞ and R ¼ diagðr1;r2; . . . ;rr; 0; . . . ;0Þ.Then the SVD of A can be written as A ¼ UR0VT , where columns Ul of the matrix U 2 RL�L are the left singular vectors ofA and columns Vl of the matrix V 2 RK�L are the right singular vectors Vl ¼ AT Ul=rl (if for some l the singular valuerl ¼ 0 then Vl 2 RK can be chosen arbitrarily). The matrix

pðrÞðAÞ ¼ URVT ¼Xr

i¼1

UiZTi with ZT

i ¼ UTi A

belongs toMr and minimizes the squared distance jjA� ~Ajj2F ¼ trðA� ~AÞðA� ~AÞT

over ~A 2Mr , see [14] or [11]. The projec-tion pðrÞðAÞ of A onto Mr is uniquely defined if and only if rr > rrþ1. The squared distance between matrix A and Mr is

q2ðA;MrÞ ¼ min~A2Mr

q2ðA; ~AÞ ¼ q2ðA;pðrÞðAÞÞ ¼ jjA� pðrÞðAÞjj2F ¼XL

i¼rþ1

r2i ðAÞ:

2.4.2. The weighted semi-norm (2)In the more general case of the semi-norm (2), one needs to compute SVD of the matrix B ¼ AW1=2 rather than of

A. Note that for non-singular weight matrix W1=2 the considered problem is a special case of Theorem 3.18 in [16].

Page 5: Stochastic algorithms for solving structured low-rank matrix approximation problems

74 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

Let r0i ¼ riðBÞ, be the ordered singular values of B and let R00 ¼ diagðr01;r02; . . . ;r0LÞ and R0 ¼ diagðr01;r02; . . . ;r0r; 0; . . . ;0Þ.Then the SVD of B is B ¼ U0R00ðV

0ÞT and the matrix

pðrÞðBÞ ¼ U0R0ðV0ÞT ¼Xr

i¼1

U0iðZ0iÞ

T with ðZ0iÞT ¼ ðU0iÞ

T B

belongs to Mr and minimizes the squared distance jjB� ~Bjj2F ¼ trðB� ~BÞðB� ~BÞT over ~B 2Mr .Define

pðrÞW ðAÞ ¼ pðrÞðBÞW�1=2 ¼Xr

i¼1

U0iðT0iÞ

T 2Mr with ðT 0iÞT ¼ ðU0iÞ

T BW�1=2: ð10Þ

Then

jjA� pðrÞW ðAÞjj2W ¼ trðA� pðrÞW ðAÞÞWðA� pðrÞW ðAÞÞ

T¼ trðAW1=2 � pðrÞW ðAÞW

1=2ÞðAW1=2 � pðrÞW ðAÞW1=2Þ

T

¼ trðB� pðrÞðBÞÞðB� pðrÞðBÞÞT ¼ jjB� pðrÞðBÞjj2F :

Note that this equality holds even for singular W. Note also that one can show ðA� pðrÞW ðAÞÞ ¼ ðB� pðrÞðBÞÞðW1=2Þy whereð�Þy denotes the pseudoinverse. This identity may help circumnavigate some of the problems to be described at the end ofSection 3.2.

The squared weighted distance between matrix A and Mr is

q2ðA;MrÞ ¼ min~A2Mr

jjA� ~Ajj2W ¼ jjA� pðrÞðAÞjj2W ¼ jjB� pðrÞðBÞjj2F ¼XL

i¼rþ1

r2i ðBÞ:

3. Distances defining the objective function f in (1)

In accordance with (2) the objective function f in (1) is

f ðXÞ ¼ trðX� X�ÞWðX� X�ÞT ; ð11Þ

where W is a diagonal matrix of weights, W ¼ diagðw1; . . . ;wKÞ. In this section, we discuss the choice of the matrix W in (11)by making associations between the squared distances (7) and (11).

Note that Y� in (7) is in the one-to-one correspondence with X� ¼ HðY�Þ in (11) and YðhÞ in (7) defines a matrixX ¼ HðYðhÞÞ 2 A conditionally the values of q in (7) and r in (1) are in agreement, as discussed in Section 2.1. Given this,the main difference between the optimization problem (1) with the objective function (11) and problem (5) is that in (5)all the constraints (Hankel structure and rank of the matrices) are incorporated into the objective function.

3.1. Frobenius norm

The Frobenius norm (9) is the most commonly used norm for defining the distance in (1); see for example [11,15,16]. Thisfrequent occurrence of the Frobenius norm can be explained by the following two reasons:

(i) The Frobenius norm is very natural in the non-structured low-rank approximation (when A ¼Mr) and the structuredlow-rank approximation problems are often considered simply as extensions of the non-structured approximationproblems;

(ii) The very important operation of projecting a matrix on the set of matrices of a given rank (the set Mr) is simplestwhen the chosen norm is Frobenius, see Section 2.4.

Consideration of the Frobenius norm in (1) corresponds to defining the objective function (7) with S ¼ U, whereU ¼ diagðu1; . . . ;uNÞ is the diagonal matrix with elements u1; . . . ;uN defined in (8). This does not look natural if we look atthe HSLRA problem as a problem of time series analysis or signal processing.

3.2. Uniform weights for each observation

If the uncertainty for all observations yj� is approximately the same, then from the view-point of a time series analyst themost natural definition of the HSLRA objective function would be (5) with sj ¼ 1 for all j; that is, (7) with S ¼ IN , the N � Nidentity matrix. An important question thus arises: ‘can we find a matrix W such that the semi-norm in (11) coincides withthe norm in (7) with S ¼ IN (conditionally r in (1) and q in (5) match)?’ The answer is given in the following lemma.

Lemma 1. Let Y 2 RN and X ¼ HðYÞ 2 RL�K . If h ¼ N=L is integer, then tr XWð0ÞXT ¼ YT Y where Wð0Þ is a diagonal matrix withdiagonal elements

Page 6: Stochastic algorithms for solving structured low-rank matrix approximation problems

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 75

wð0Þkk ¼1 if k ¼ jLþ 1 for some j ¼ 0; . . . ;h� 1;0 otherwise:

Proof. Assume that h ¼ N=L is integer so that N ¼ hL and K ¼ N � Lþ 1 ¼ ðh� 1ÞLþ 1.

By the definition, the elements of the L� K matrix X are xlk ¼ ylþk�1. This gives

trXWð0ÞXT ¼XL

l¼1

XK

k¼1

XK

k0¼1

xlkwð0Þkk0

xlk0 ¼XL

l¼1

XK

k¼1

wð0Þkk x2lk ¼

XL

l¼1

Xh�1

j¼0

x2l;jLþ1 ¼

XL

l¼1

Xh�1

j¼0

y2lþjL ¼

XN

n¼1

y2n: �

Thus, the answer to the question raised above is positive conditionally N is divisible by L. Very often this is not a seriousrestriction as typically there is some freedom in the choice of L and, if needed, the first few values of Y can be ignored (to alterthe value of N).

Note that the matrix Wð0Þ has zeros at the diagonal and therefore (10) cannot be used as such. To be able to use the tech-nique of Section 2.4.2 we either need to replace 0 with a small number or use the methodology outlined in Section 3.5.

3.3. Arbitrary observation weights

Consider now the general case of the HSLRA objective function (5) with arbitrary sj > 0. Now, the answer to the question:‘can we find a matrix W such that the semi-norm in (11) coincides with the norm in (7) with arbitrary diagonal matrix S?’ isgenerally negative. The reason is as follows. First, it is easy to see that allowing the matrix W to be non-diagonal will notincrease our freedom in improving the quality of approximation of the sum YT SY by trXWXT , where Y 2 RN is arbitraryand X ¼ HðYÞ. And second, for a diagonal W, trXWXT is a weighted sum of squares of yj’s but the number of degrees of free-dom (that is, diagonal elements in W) is K which is less than the number of diagonal elements in S.

The next question is: ‘how can we approximate the sum of squares YT SY by trXWXT for arbitrary Y in the best way?’ Toanswer this question, we shall use Least Squares (LS) approximation.

Let S 2 RN�N and W 2 RK�K be diagonal matrices with diagonals determined by the vectors S ¼ ðs1; . . . ; sNÞT andW ¼ ðw1; . . . ;wKÞT , respectively. The vector S is a given and chosen, for example, as discussed in Section 2.1. The vector Wis unknown and has to be determined by trying to match the sum of squares SS1ðYÞ ¼ trXWXT to SS2ðYÞ ¼ YT SY for allY 2 RN .

Lemma 2. Let Y 2 RN ; X ¼ HðYÞ and W be a diagonal matrix with diagonal W ¼ ðw1; . . . ;wKÞT . Then we haveSS1ðYÞ ¼

PNn¼1~sny2

n where ~sn ¼PK

k¼1cnkwk and the coefficients cnk are given by

cnk ¼1 if 1 6 n 6 K and maxf1;n� Lþ 1g 6 k 6 n;

1 if K þ 1 6 n 6 N and n� Lþ 1 6 k 6 K;

0 otherwise:

8><>:

Proof. Using the substitution of indices l! n ¼ lþ k� 1 (so that 1 6 n 6 N) we obtain

SS1ðYÞ ¼ trX WXT ¼XL

l¼1

XK

k¼1

wkx2lk ¼

XL

l¼1

XK

k¼1

wky2lþk�1 ¼

XN

n¼1

y2n

X16k6K

16n�kþ16L

wk ¼XN

n¼1

y2n

XK

k¼1

cnkwk ¼XN

n¼1

~sny2n

where the coefficients cnk are as above. h

Set ~S ¼ ð~s1; . . . ;~sNÞT 2 RN and C ¼ ðcnkÞ 2 RN�K , where cnk are the coefficients defined above. Then ~S ¼ CW and thus wehave the approximation problem S ’ CW . We write this problem in the form of a linear regression S ¼ CW þ e. Note thatC has the following structure:

C ¼

1 0 . . . 0... . .

. . .. ..

.

1 . ..

0

0 . ..

1... . .

. . .. ..

.

0 . . . 0 1

0BBBBBBBBBBB@

1CCCCCCCCCCCA:

Page 7: Stochastic algorithms for solving structured low-rank matrix approximation problems

76 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

Here S plays the role of the vector of observations and vector W is the vector of unknown parameters. We use a generalleast squares estimator (LSE) for W:

W ¼ CTR�1C� ��1

CTR�1S; ð12Þ

where R is an arbitrary positive definite N � N matrix. The LSE of the vector S is S ¼ CW . The matrix R in (12) plays the role ofa covariance matrix determining the precision of estimates of elements of the vector S. In accordance with results of Sec-tion 3.2, if R ¼ IN , the vector C is the vector of 1’s and N is divisible by L, then we achieve the equality S ¼ S.

The vectors W ¼ ðw1; . . . ; wKÞT and S ¼ ðs1; . . . ; sNÞT are the vectors which define the norms (or semi-norms) we useinstead of the norm we would have liked to use: for any Y 2 RN ,

YT SY ’ YT SY ¼ trXWXT

where X ¼ HðYÞ; S ¼ diagðs1; . . . ; sNÞ and W ¼ diagðw1; . . . ; wKÞ.

3.4. Dealing with missing values

We now explain how one can formulate the HSLRA problem when there are missing observations among yj� . Denote byJ � f1;2 . . . ;Ng be the set of indices j such that the observations yj� are missing.

Let us return to the discussion in Section 2.1 just after formula (5). In that discussion, we mentioned that a natural choiceof sj is sj ¼ 1=r2

j , where r2j is the variance of the observation error of yj. If j 2 J then we can use any number as yj� but assume

r2j ¼ 1 for this j. This would yield sj ¼ 0 for all j 2 J. This is, however, not sufficient as we also want to achieve sj ¼ 0 for all

j 2 J. Note that in formula (12), we may need to multiply large numbers by sj and do not want the results to vanish. We there-fore assume that sj are very small numbers (infinitesimal) for all j 2 J rather than zeros.

Assume R is diagonal: R ¼ diagðR11; . . . ;RNNÞ. As we want to achieve sj ¼ 0 for all j 2 J, we need to choose Rjj ¼ 0 for allj 2 J implying R�1

jj ¼ 1 for j 2 J. This would guarantee sj ¼ 0 for all j 2 J. The main problem here is the organization of cal-culations as in the process of computing the estimator (12) we many times need to multiply and divide by infinity. This prob-lem is discussed in the next section (Section 3.5). Note that it is also possible to achieve sj ¼ 0 for all j 2 J by using the methodknown as constrained least squares [38].

Note that in a special case when missing observations are at the end of the series, the problem of estimation of missingobservations is known as the problem of forecasting. To forecast using low-rank approximations we can proceed in twoways: (i) solve the HSLRA problem and consequently build the model (6) using the available observations and use the model(6) for forecasting, and (ii) solve the HSLRA problem with a larger value of N treating the observations at the end of the seriesas missing. An important question arises of whether the two HSRA models and associated models (6) are the same. Theanswer to this question is positive conditionally that we deal with infinity as explained in Section 3.5 (that is, avoidingany approximate computations), see Example 10 for the illustration of this phenomena.

3.5. Computations involving infinite and infinitesimal numbers

As explained in Section 3.4, if there are missing values in the series Y�, then to compute the weighting matrix W definingthe objective function in the HSLRA problem (1) we need to perform many operations with infinity (including division by 0).The usual way would be to replace infinity with a very large number and 0 with a very small number, see an example below.This creates difficulties as we do not know in advance how large or small should be the numbers that replace infinity andzero and we thus need to try several numbers to be sure that we have reached an acceptable accuracy. Also, in this way wewould never be able to get an exact answer as the one derived in Section 3.2.

There is, however, a novel methodology for dealing with numerical infinity; that is, with infinite and infinitesimal quan-tities. This methodology has been developed by Sergeyev and published in a small book [17] and a series of papers; see forexample, [18–20]. The place of Sergeyev’s methodology for dealing with infinitesimals and infinities among other mathemat-ical approaches is discussed in a historical survey [21].

Following Sergeyev, we denote a numerical infinity by r and a numerical infinitesimal quantity by r�1. r satisfies someaxioms; see the above cited papers of Sergeyev and [22] for a discussion of Sergeyev’s axioms and their modifications.

If the ‘Infinity computer’ of Sergeyev existed then we would have been able to perform all operations with infinite andinfinitesimal numbers numerically (rather than symbolically, which is computationally very demanding).

Let us consider a simple example of constructing the norm (2) (to be used for defining the objective function (11)) in thecase of missing data.

Example 1. Assume N ¼ 10 and L ¼ 3 so that K ¼ 8. Assume we have 2 missing observations which are y3� and y5� (theresults are very similar with respect to the location of missing values and even with respect to the number of missing values).The vector S, defining the ‘ideal’ norm (7), that we have to approximate is Sideal ¼ ð1;1;0;1;0;1;1;1;1;1ÞT . In view of thediscussion in Section 3.4, we set S ¼ ð1;1;a;1;a;1;1;1;1;1ÞT , where a > 0 is a very small (infinitesimal) number.

Page 8: Stochastic algorithms for solving structured low-rank matrix approximation problems

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 77

According to the recommendation of Section 3.4 we may choose the matrix R ¼ diagðR11; . . . ;RNNÞ so that R33 ¼ R55 ¼ 0and all other diagonal elements Rjj ¼ 1 for j – 3;5. Since we need to invert R, we cannot set R33 ¼ R55 ¼ 0 and therefore weset R33 ¼ R55 ¼ b, where b > 0 is a very small positive number (it may and generally should differ from a). Straightforwardcalculations using (12) give

W ¼ 1bþ 10

c1;6� 2a; bþ 11a� 12;12� 6a;bþ 5a;0;6� 2a; c1½ �T ;

S ¼ CW ¼ 1bþ 10

c1; c2;2bþ 10a; c1;2bþ 10a; c2; c1; c2; c2; c1½ �T ;

where c1 ¼ 6þ 2aþ b; c2 ¼ 12� aþ b. An important observation here is that S½3� and S½5� tend to 0 as a! 0 and b! 0. Inthis particular case, we may set a ¼ b ¼r�1. Then we have

S ¼ 35

1;2;0;1;0;2;1;2;2;1½ �T þ 325r

2;�1;10;2;10;�1;2;�1;�1;2½ �T þ O1

r2

� �; ð13Þ

where O r�2� �indicates the terms or order r�2 or less. In addition to the limiting vector of weights, the expansion (13) gives

the exact rate of convergence to this limiting vector (as a and b tend to 0 with the same speed).

Example 10. As in Example 1, assume N ¼ 10; L ¼ 3 but assume that we place two missing observations at the end of the

series; that is, we assume that y9� and y10� are missing. Similar to (13) we obtain S ¼ Sas þ O r�1� �

, where

Sas ¼ 37 2;2;3;2;2;3;2;2;0;0½ �T . If we set N ¼ 8; L ¼ 3;R ¼ I8 and no observations are missed, then we obtain

S ¼ 37 2;2;3;2;2;3;2;2½ �T , with no infinitesimals involved. This vector gives the first eight components of Sas 2 R10 above.

This illustrates the statement we have made at the end of Section 3.4.

4. Orthogonality condition and associated algorithms

4.1. Orthogonality condition

In this section, we consider the so-called orthogonality condition which any locally optimal solution to (1) should satisfy.

Theorem 1. Let Y ¼ H�1ðXÞ and Y� ¼ H�1ðX�Þ be non-zero vectors in RN, W 2 RK�K be a diagonal weight matrix (with diagonalgiven by a vector W) defining the norm (2), S ¼ CW 2 RN be the associated vector (computed as ~S in Lemma 2) defining thesquared norm YT SY ¼

PNn¼1sny2

n ¼ tr XWXT ,where S 2 RN�N is a diagonal matrix with vector S on the diagonal. Set

b� ¼ YT SY�=YT SY ¼ trXWXT�=trXWXT : ð14Þ

Then we have

ðb�Y � Y�ÞT SY ¼ trðb�X� X�ÞWXT ¼ 0: ð15Þ

Proof. Following from the statement of the Theorem it is straightforward to show that trðb�X� X�ÞWXT ¼ ðb�Y � Y�ÞT SY .

We have

ðb�Y�Y�ÞT SY ¼ YT SY�YT SY

Y�Y�

!T

SY ¼ 1YT SY

ðYT SY�ÞY�ðYT SYÞY�h iT

SY ¼ 1YT SY

ðYT SY�ÞYT SY�ðYT SYÞYT�SY

h i¼0: �

In a particular case W ¼ IK we have S ¼ T, where T ¼ ðt1; . . . ; tNÞT is the vector with elements given by (8). In this partic-ular case, the equality

ð~Y � Y�ÞTS~Y ¼ 0 ð16Þ

is called orthogonality condition for ~Y , see [15]. It was proved in [15] that if ~Y is a local minima of SSðYÞ ¼ ðY � Y�ÞT SðY � Y�Þ,then ~Y must satisfy the orthogonality condition. By analogy, we shall call (16) the orthogonality condition even if S – T. InTheorem 1, we have shown that we can achieve orthogonality simply by multiplying all elements of either the Hankelmatrix, or its associated vector, by a suitable constant. Note that the space A is closed with respect to multiplication by anon-zero constant.

Page 9: Stochastic algorithms for solving structured low-rank matrix approximation problems

78 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

4.2. Associated algorithms with W ¼ IK

In this section we consider two existing algorithms described in the literature with W ¼ IK . Algorithms for general W fol-low similarly.

4.2.1. De Moor’s methodFollowing De Moor [15], introduce the Lagrangian function as follows. Let c1 2 RL�r; c2 2 R, and R 2 RK�ðK�rÞ. The Lagrang-

ian function corresponding to the HSLRA with the unweighted distance (9) is given by

F1ðX;R; c1; c2Þ ¼ jjX� X�jj2F þ cT1X�R þ c2ðRT R � IK�rÞ: ð17Þ

The constraint rankðX�Þ ¼ r is defined through its kernel representation X�R ¼ 0, where R ¼ ðR0;�IK�rÞT for some matrix

R0 2 Rr�ðK�rÞ. The constraint RT R ¼ IðK�rÞ is introduced to ensure the identifiability of R. By setting all derivatives of theLagrangian to zero, De Moor derived the condition (14) for the solution to the HSLRA problem. Moreover, by manipulatingwith the Lagrangian, De Moor has represented the solution of (17) as a non-linear generalized SVD. Subsequently, De Moorhas developed an algorithm for numerical approximation of the non-linear generalized SVD problem which approximatesthe solution to the HSLRA problem; the convergence to the optimal solution is however not guaranteed. For full detailswe refer to De Moor [15]. For our comparative examples discussed in Section 6 we shall refer to this method as the ‘DM’method.

4.2.2. I. Markovsky’s methodThe body of work by Markovsky and some of his co-authors (see [7,8,16]) has concentrated on developing computation-

ally efficient methods of approximating the solution of (1), where the structure is not necessarily Hankel. In the case ofHSLRA, Markovsky defines the problem (1) through the minimization of the objective function

F2ðX;RÞ ¼ jjX� X�jj2F such that X�R ¼ 0; RT R ¼ IðK�rÞ; ð18Þ

using again the unweighted distance (9). R is as in (17). It can be seen this objective function has a direct analogy with (17).The objective function (18) is non-convex, and in [8] the function (18) is locally optimized from an initial starting matrix

(or vector). The latest software implementation to optimize (18) is given and described in [23]. In this software, by default,the initial starting point is the unstructured rank r approximation of X�, which is computed by pðrÞðX�Þ. Consequently there isno guarantee that a global minimum of (18) is found. We shall use this software for our comparative examples discussed inSection 6; note that this software can only be used when r ¼ L� 1. We will refer to the minimization of (18), via the softwaredescribed in [23], as the ‘IM method’.

Some other methods are known; see [10,24,25]. As far as the authors are aware, none of the methods known in the lit-erature on HSLRA have the theoretical property of convergence and hence the construction of reliable methods for solvingthe HSLRA problem remains an open problem. Moreover, most of the known algorithms require the condition r ¼ L� 1.Except for the IM method, we failed to find reliable software implementing these other methods.

5. Algorithms based on the use of alternating projections

In this section we consider algorithms for solving the HSLRA problem represented as optimization problems using alter-nating projections between the spaces H and Mr . We restrict our attention to the distance function associated with thematrix Frobenius norm (9), that is, we take W ¼ I in (11).

5.1. Classical algorithms and their modifications

5.1.1. Alternating projections (AP)The algorithm (19) below is the direct implementation of the alternating projections. For brevity we will refer to this algo-

rithm as AP.

X0 ¼ X�; Xnþ1 ¼ pH pðrÞðXnÞ

for n ¼ 0;1; . . . ð19Þ

These projections have also been studied in [26] and are sometimes known as Cadzow iterations [27]. Here we simplyalternate projections to the space Mr with projections to the space H. In this form of alternating projections, we haveXn 2 H for all n ¼ 0;1; . . ..

General information about algorithms that use alternating projections (where the sets are not necessarily convex) is pro-vided in Andersson and Carlsson [28] and Andersson et al. [29]. AP can be considered as a particular instance of the Alter-nating Least Squares method used in signal processing, see Section 3.3 in [30]. Note also that one iteration of AP for HSLRAcorresponds to the basic version of the technique of time series analysis known as singular spectrum analysis (SSA), see [31];for further details regarding the link between AP and SSA; see, for example, Gillard [27].

Page 10: Stochastic algorithms for solving structured low-rank matrix approximation problems

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 79

5.1.2. Orthogonalized Alternating Projections (OAP)The following algorithm is a slight improvement over AP (19):

X0 ¼ X�; Xnþ1 ¼trXnXT

trXnXTn

pH pðrÞðXnÞ

for n ¼ 0;1; . . . ð20Þ

The algorithm (20) uses the coefficients (14) to improve at each iteration. We shall refer to the algorithm (20) as ‘Orthog-onalized Alternating Projections’ (abbreviated OAP in the discussion and examples below).

5.1.3. Discussion on the convergence of AP and OAPDespite AP often appearing to be myopic and too greedy by only aiming at minimizing the distance q2ðX;MrÞ, it is very

popular in practice. The popularity of AP is explained by the simplicity of the algorithm and by the fact that convergence tothe space A is guaranteed, see [32]. However, as seen in examples provided in Section 6, AP often converges to a matrix whichis far away from the set of optimal solutions X�.

As shown in [28, Th. 6.1], AP converges linearly; that is, there exist constants c < 1 and A > 0 such thatq2ðX1;XnÞ < Acn

; 8n, where X1 is some matrix in A. Moreover, it is easy to prove monotonicity of AP iterations. As derivedby Chu et al. [32], we have

jjXnþ1 � pðrÞðXnþ1Þjj2F 6 jjXnþ1 � pðrÞðXnÞjj2F 6 jjXn � pðrÞðXnÞjj2F :

Similar to AP (19), the algorithm OAP (20) converges to some matrix in A. Numerical results show that the resultingapproximation is never worse than the approximation obtained by (19) (usually it is slightly better than the approximationobtained by AP); it also converges faster to the set A. Examples of Section 6 show that typically both AP and OAP do not con-verge to the set of optimal solutions X�.

5.2. Alternating Projections with Backtracking and Randomization

In this section, we describe a family of algorithms which can be run as a random multistart-type algorithm, as a multi-stage algorithm and also as an evolutionary method. The main steps of this algorithm are summarized by its title ‘AlternatingProjections with Backtracking and Randomization’ and we abbreviate this algorithm APBR. Here we describe two versions ofthis algorithm, Multistart APBR and APBR with selection. APBR with selection significantly reduces the number of computa-tions by terminating non-prospective trajectories at early stages.

The underpinning idea for the family of the APBR algorithms has been suggested by the authors in [12] where the poten-tial of the Multistart APBR has been demonstrated on a number of examples.

5.2.1. Multistart APBRThe multistart version of APBR is described as follows. Let U denote a realization of a random number with uniform dis-

tribution in [0,1] and let ~X denote a random Hankel matrix which corresponds to a realization of a white noise Gaussian pro-cess ~Y ¼ ðn1; . . . ; nNÞ with ni; i ¼ 1; . . . N, independent Gaussian random variables with mean 0 and variance s2 P 0.

In Multistart APBR, we run M independent trajectories in the space H starting at random Hankel matrices

X0;j ¼ ð1� s0ÞX� þ s0~X; ð21Þ

with some s0 (0 6 s0 6 1), and use the updating formula

Xnþ1;j ¼ trZn;jXT�=trZn;jZ

Tn;j

� �Zn;j ð22Þ

where j ¼ 1; . . . ;M,

Zn;j ¼ 1� dnð ÞpH pðrÞðXn;jÞ

þ dnX� þ rn~X ð23Þ

and

dn ¼ U=ðnþ 1Þp; rn ¼ c=ðnþ 1Þq; if q2ðXn;j;MrÞP e;dn ¼ 0; rn ¼ 0; otherwise:

(ð24Þ

Each trajectory is either run until convergence or for a pre-specified number of iterations. U could be either random orsimply set to 1, c 2 f0;1g and positive numbers p; q and e can be chosen arbitrarily, see Section 6 concerning the choiceof U; c; p and q. A MATLAB implementation of this version of APBR, developed by the authors, is available at [33].

If s0 ¼ dn ¼ rn ¼ 0 then the iterations in (22) coincide with iterations of OAP (20). If s0 > 0 then the jth trajectory of thealgorithm starts at a random matrix in the neighborhood of X� (the width of this neighborhood is controlled by the param-eter s0). If rn > 0 then there is a ‘random mutation’ at the nth iteration (22). When dn > 0, the current approximation ‘back-tracks’ towards X� conditionally that the backtracking does not worsen the distance q2ðXn;j;X�Þ. If q2ðXn;j;MrÞ < e, we set

Page 11: Stochastic algorithms for solving structured low-rank matrix approximation problems

80 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

dn ¼ 0 and rn ¼ 0. That is, in the final stage for any trajectory of the APBR we perform OAP iterations (20) to achieve fasterconvergence to A.

Note that in APBR the initial value of q2ðX0;j;X�Þ could be large but the resulting jth trajectory may be very good, seeFig. 3(b).

5.2.2. APBR with selection

Step I (initialization) Run OAP until convergence and record the distance F� ¼ q2ðX�;XOAPÞ, where XOAP is the approxi-mation of X� obtained by OAP (20).

Step II (main iterations) For some pre-specified numbers M and Q, we compute the first Q terms of M independently runtrajectories in the space H starting at random Hankel matrices X0;j ¼ ~X, so that we use (21) withs0 ¼ 1, and apply the updating formula (22).Having reached the matrix XQ ;j of the jth trajectory (j ¼ 1; . . . ;M), we test the prospectiveness ofthis jth trajectory. Any trajectory with q2ðX�;XQ ;jÞP F� is considered non-prospective and termi-nated. We then choose k trajectories corresponding to the k smallest distancesq2ðX�;XQ ;jÞ; j ¼ 1; . . . ;M, and perform OAP iterations until convergence (here k is a predefinedsmall number). Let X1;i; i ¼ 1; . . . ; k, be the k HSLRA approximations obtained. Update the recordvalue F� mini¼1;...;kfF�; q2ðX�;X1;iÞg.The trajectories which are not run until convergence and not terminated are halted and kept in abuffer. Step II is repeated with the updated record value F�.

Step III (buffer check) Consider all trajectories halted at Step II and held in the buffer. First, remove all matrices XQ ;j withq2ðX�;XQ ;jÞP F� for the updated value of F�. We then compare all the halted approximations XQ ;j

in terms of their distances to X� and to the space Mr . If for some i – j we haveq2ðXQ ;i;X�Þ < q2ðXQ ;j;X�Þ and q2ðXQ ;i;MrÞ < q2ðXQ ;j;MrÞ then the matrix XQ ;i dominates thematrix XQ ;j and the jth trajectory could be discontinued. All trajectories left after this sievingare run to convergence using OAP.

Remark 1. In Step II, prior to running AP for the k most prospective trajectories, we may ‘split’ them into further trajectoriesby adding small random Hankel matrices to the matrices XQ ;j from these prospective trajectories.

Remark 2. Using several matrices (with n P Q) from any trajectory fXn;jgn created by APBR with selection, we can esti-mate the exact value of the limit F1;j ¼ limn!1q2ðXn;j;X�Þ in the following way. Consider a Hankel matrix Xn;j withsome j and n > Q . Let X1;j be the (unknown) limiting matrix for the trajectory fXi;jgi. Assuming that Xn;j is close enoughto the set A, we have the geometrically fast convergence of Xi;j to X1;j for i ¼ n;nþ 1; . . .. Indeed, at all iterations withn > Q the APBR with selection only makes alternative projections (20) to the spaces H and Mr . Therefore, as followsfrom results stated in Section 5.1.3, we have the geometrically fast convergence of pðrÞðXi;jÞ to X1;j with the sequencefq2ðXi;j;X�Þgi increasing and the sequence fq2ðpðrÞðXi;jÞ;X�Þgi decreasing. (It is straightforward to check thatq2ðXi;j;X�Þ 6 q2ðpðrÞðXi;jÞ;X�Þ for any i and j). Then F1;j is the uniquely defined constant a such that the sequence ofratios ½q2ðXi;j;X�Þ � a�=½q2ðXiþ1;j;X�Þ � a� tends to a constant c < 1 as i!1. Similarly, F1;j is the uniquely definedconstant b such that the sequence of ratios ½q2ðpðrÞðXi;jÞ;X�Þ � b�=½q2ðpðrÞðXiþ1;jÞ;X�Þ � b� tends to a constant c0 < 1 asi!1.

Using the estimators Fj of F1;j we can terminate non-prospective trajectories as soon as Fj P F�; this may happen earlierthan the event q2ðX�;XQ ;jÞP F� which has been used in Steps II and III for termination of non-prospective trajectories.

5.3. Convergence of algorithms

Consider the APBR defined by the formulas (21)–(24); these formulas underpin both versions of APBR defined above.These algorithms are typical stochastic global optimization algorithms and hence their theoretical properties of convergencecan be studied using the common tools readily available in the literature on global random search; see, for example, [34,Section 2.1.3].

The APBR algorithms, considered as optimization algorithms for the objective function (1), benefit from both globality andlocality of search. Local convergence to the set A is a consequence of the fact that OAP iterations always locally converge(linearly) to A, see [28,32]. Global convergence is guaranteed if we assume that s0 ¼ 1 and M !1. In this case, the set ofstarting matrices for APBR becomes everywhere dense. This is a necessary condition of convergence for any global optimi-zation algorithm employing local descent iterations which does not use any properties of the objective function other thanits continuity. It also becomes a sufficient condition if the iterations are monotonic so that the value of the objective functionis non-increasing. For the algorithms considered, this is a consequence of the condition used in (22).

Page 12: Stochastic algorithms for solving structured low-rank matrix approximation problems

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 81

6. Examples

In examples below we shall use the following additional notation:

TaFro

XAP ¼ HðYAPÞ

ble 1benius distances of the approximations to X� using Multistage

AP OAP

L ¼ 4r ¼ 1 110.3142 110.3141r ¼ 2 73.6980 73.6955r ¼ 3 14.8251 14.8218

L ¼ 5r ¼ 1 111.8552 111.8552r ¼ 2 73.3795 73.3786r ¼ 3 15.6168 15.6160r ¼ 4 3.4535 3.4535

Approximation obtained by AP (19)

XOAP ¼ HðYOAPÞ Approximation obtained by OAP (20) XIM ¼ HðYIMÞ Approximation obtained by software of Markovsky and Usevich [23] XDM ¼ HðYDMÞ Approximation obtained by De Moor in [15] XAPBR ¼ HðYAPBRÞ Approximation obtained by Multistage APBR Min(APBR) Minimal distance to X� obtained by Multistage APBR Med(APBR) Median distance to X� obtained by Multistage APBR

In all examples we have used Multistage APBR, with q2 defined by the Frobenius norm (9) and the number of trajectorieswas chosen as M ¼ 1000. Unless otherwise stated, U is a realization of a random number with uniform distribution in ½0;1�.For ease of exposition, (24) is defined as

dn ¼ U=ðnþ 1Þp; rn ¼ c=ðnþ 1Þq; for n ¼ 0;1; . . . ; P;

dn ¼ 0; rn ¼ 0; for n > P:

(

6.1. Numerical results for the example of De Moor

In this section we consider the data given by De Moor [15] to demonstrate the sub-optimality of AP (19). De Moor’s data

and parameter settings are as follows: Y� ¼ ð3;4;2;1;5;6;7;1;2ÞT , N ¼ 9; L ¼ 4, and r ¼ 3. Let X� ¼ HðY�Þ. Table 1 containsthe Frobenius distances to X� obtained using AP (19), OAP (20), the IM method, for different values of the parameters L and r.Table 1 also contains the result obtained by De Moor. The approximation achieved by AP and by De Moor, YAP and YDM areprovided in [15] (four decimal places only). One of the best Multistage APBR approximations for the De Moor’s data isYAPBR ¼ ð3:451346;3:533941;2:002535;1:487395;4:039565;7:078974;5:995627;1:720123;1:613392ÞT . This MultistageAPBR solution is slightly different from YDM while the values of the Frobenius distances to X� coincide (with the precisionprovided by De Moor). We have no access to the software realizing the method of De Moor and therefore we cannot performa comparative study involving this method. Markovsky’s latest software implementation [23] can only be applied for thecase r ¼ L� 1 unless one reshapes the Hankel matrix. Note that if a L� ðN � Lþ 1Þ Hankel matrix is of rank r then thereshaped ðr þ 1Þ � ðN � rÞ Hankel matrix is also of rank r, see for example [39]. Also note that this reshaping approachcan only be applied to Hankel matrices.

In this example, the parameters of Multistage APBR are selected to be M ¼ 1000; P ¼ 500; c ¼ 1; s0 ¼0:25; s ¼ 1; p ¼ 0:5 and q ¼ 1:5. The total number of iterations was set at 600. OAP (20) was marginally better than AP(19). Multistage APBR yields solutions similar to that obtained by the method of De Moor. For all values of L and r MultistageAPBR provides better (or, in one case, similar) solutions than the other methods considered.

Table 2 contains the minimum Frobenius distance to X� using Multistage APBR, varying the parameters p and q in (22).With s0 ¼ 0:25 and s ¼ 1 it can be seen that there are many ðp; qÞ parameter pairs which yield similar solutions, with Frobe-nius distances to X� comparable to that achieved by De Moor’s method. Summarizing the numerical results obtained in thisexample (and similar ones), we can give the following recommendations concerning the choice of parameters p and q of theMultistage APBR algorithm (22).

Backtracking (regulated by the parameter p), is extremely important. It is worth noting that numerical results show thatin many examples the use of the random variable U in the formula for dn in (22) often works marginally better than a con-stant. For the data considered in this example, Multistage APBR was best performing with values of p in between 0.25 and 1,implying that the rate at which backtracking decreases should be slow.

APBR, AP (19), OAP (20), IM method (18) and De Moor (DM) results.

IM Med (APBR) Min (APBR) DM

110.0095 110.0101 110.0095 –72.8526 72.8550 72.8530 –14.1482 14.1481 14.1478 14.1478

111.5625 111.5690 111.562573.1739 73.1790 73.174014.9518 14.9597 14.9519

3.4535 3.4535 3.4509

Page 13: Stochastic algorithms for solving structured low-rank matrix approximation problems

Table 2Minimum Frobenius distances to X� using Multistage APBR, varying the parameters p and q; L ¼ 4; r ¼ 3.

p n q 0.25 0.5 0.75 1 1.5 2

0.25 14.2678 14.1528 14.1483 14.1478 14.1478 14.14780.5 14.3230 14.1540 14.1486 14.1478 14.1478 14.14780.75 16.6083 14.2294 14.1535 14.1479 14.1478 14.14781 17.1857 14.4427 14.1721 14.1498 14.1478 14.14781.5 36.1126 16.3141 14.3893 14.4292 14.1976 14.22492 38.4927 18.0746 14.9541 14.9674 14.4582 14.7119

82 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

Randomization (regulated by the parameter q) could be useful too. Note also that the mechanism of this stochastic muta-tion in Multistage APBR resembles the mechanism of regularization of the Alternating Least Squares methods used in signalprocessing and in particular tensor decompositions, see [30]. Randomization appears to be beneficial both at the start of theiterations and throughout the running of the algorithm, but as illustrated in Table 2, the rate at which randomizationdecreases should be slightly faster than for backtracking (that is, we recommend choosing p < q). For the data consideredin this example, Multistage APBR was best performing with values of q in between 1 and 2.

6.2. Numerical results for other examples

In this example we introduce a parametric family based on the data originally studied in [35]. Let N ¼ 11 and

Y ðmÞ� ¼ ð0;3� 2m; 0;�1;0;m; 0;�1;0;3� 2m;0ÞT , where m ¼ �1; 0; 1; 2; 3. We fix L ¼ 3 and r ¼ 2. Set XðmÞ� ¼ H Y ðmÞ�� �

.

We have rank Xð1Þ�� �

¼ 2 while the rank of other matrices XðmÞ� (for m ¼ �1;0;2;3) is equal to 3.

We compare the results obtained from performing AP (19), OAP (20), Multistage APBR (22) and the IM method from [23].The parameters of (22) are M ¼ 1000; c ¼ 1; s0 ¼ 0:25; s ¼ 1; p ¼ 0:5 and q ¼ 1:5. The total number of iterations was fixedat 250 with P ¼ 200. Table 3 contains the Frobenius distances to XðmÞ� using AP (19), OAP, (20), Multistage APBR (22) and theIM method. The results show that Multistage APBR gives better results than the other methods.

Fig. 1 contains plots of the original data Y ðmÞ� , for m ¼ �1; 2; 3, and approximations obtained from performing Cadzowiterations (19), Multistart APBR (22) and the IM method. The difference between the results obtained using the IM methodand AP is not seen for m ¼ �1 and m ¼ 2. Results obtained by different methods for m ¼ 0 and m ¼ 1 are indistinguishable inthe figures and so are not included.

Fig. 2 contains the Frobenius distances from X� for AP and Multistart APBR as a function of m where �2 6 m 6 4, for L ¼ 3and L ¼ 4. Note a slight improvement in the overall trend for AP at m ¼ 3. Fig. 2 shows, in particular, that AP provides con-

Table 3Frobenius distances to XðmÞ� using AP (19), OAP (20) and IM method. The minimal and median Frobenius distances to XðmÞ� using Multistage APBR (22) withM ¼ 1000 are also provided.

m AP OAP IM Med (APBR) Min (APBR)

�1 68.3077 68.1548 68.3077 56.8699 56.74870 17.0769 17.0769 17.0769 17.0769 17.07691 0.0000 0.0000 0.0000 0.0000 0.00002 17.0769 17.0769 17.0769 12.9900 12.87913 50.1888 50.1873 49.9663 36.2506 36.2357

1 2 3 4 5 6 7 8 9 10 11−2

−1

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11−1

−0.5

0

0.5

1

1.5

2

1 2 3 4 5 6 7 8 9 10 11−4

−3

−2

−1

0

1

2

3

Fig. 1. Plots of Y ðmÞ� for m ¼ �1; 2; 3 (points) with approximations using AP (dashed line), IM method (dotted line) and the best approximation achievedusing Multistart APBR (solid line).

Page 14: Stochastic algorithms for solving structured low-rank matrix approximation problems

−2 −1 0 1 2 3 40

20

40

60

80

100

120

140

160

−2 −1 0 1 2 3 40

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

Fig. 2. Frobenius distances from X� for AP (solid line) and the best approximation achieved using Multistart APBR (dashed line) as a function of m evaluatedat increments of 0.01, with solutions for the case L ¼ 3 in the region of 2:8 6 m 6 3:2. The solution corresponding to m ¼ 3 is in the dashed line.

0 50 100 150 200 2500

10

20

30

40

50

60

70

0 50 100 150 200 2500

10

20

30

40

50

60

70

Fig. 3. Plots of the Frobenius distances jjXn � X�jj2F as functions of n for AP (bold line) and three randomly selected Multistart APBR iterations (dot-dash line),for different s0.

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 83

sistently poor results in our example. Fig. 2(c) contains a plot of the solutions obtained by AP in the region 2:8 6 m 6 3:2. Thesolution at m ¼ 3 is different from the other solutions obtained in the regions 2:8 6 m < 3 and 3 < m 6 3:2. Solutions in theregions 2:8 6 m < 3 and 3 < m 6 3:2 are very similar.

In this example we can see that AP (19) is the poorest. OAP (20) and the IM method give marginal improvement on AP. Forthe cases m ¼ 0 and m ¼ 1, all algorithms give identical solutions. In these cases, the SLRA problem is quite simple. For exam-ple, for m ¼ 1 the first AP iteration yields the optimal SLRA approximation.

Consider some results for the case m ¼ 3. Fig. 3 contains plots of the Frobenius distances jjXn � X�jj2F as functions of n (forAP and three randomly selected Multistart APBR iterations), for different s0. For small s0 the distances are initially small, butthen grow as the algorithm iterates towards a solution. For large s0 it is likely that the distances are large initially, but theinclusion of backtracking in the Multistart APBR algorithm eventually makes these distances smaller. In Multistart APBR,

Table 4Minimum and median Frobenius distances to Xð3Þ� (in Frobenius norm) using Multistart APBR respectively, varying the parameters p and q.

p n q 0.25 0.5 0.75 1 1.5 2

Min (APBR)0.25 36.8456 37.1360 37.1899 36.2055 37.2088 37.20880.5 36.1380 37.1232 36.2359 36.2359 36.2357 36.23580.75 36.1970 37.1932 36.2620 36.2661 36.2744 36.27491 36.4415 36.4320 36.2829 36.2912 36.2998 36.29992 36.4580 36.4960 36.3614 36.3178 36.3672 36.3253

Median (APBR)0.25 37.5396 37.2330 37.2111 36.2178 37.2129 37.21320.5 37.7568 37.3393 36.2489 36.2448 36.2408 36.24100.75 40.1898 37.6974 36.3167 36.2876 36.2791 36.27941 44.0458 37.8351 36.5178 36.3345 36.3069 36.30552 47.2857 42.0455 39.1298 38.1299 37.7234 37.5386

Page 15: Stochastic algorithms for solving structured low-rank matrix approximation problems

Table 5Frobenius distances to X� obtained using AP (19), OAP (20), and Multistart APBR (22).

AP OAP Med (APBR) Min (APBR)

9.9652 9.9652 9.8606 9.8578

0 50 100 1502

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

Fig. 4. Plot of the log-transformed Air Passenger data (gray) and the rank 2 AP (dashed line) and Multistart APBR approximations (solid line) respectively.

84 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

after performing P ¼ 200 iterations with randomization and backtracking, we perform 50 iterations of the OAP algorithm.The effect of using OAP in the Multistart APBR is very clear: the distances from X� increase while the algorithm convergesto some matrix in A.

Table 4 contains the minimum and median Frobenius distances to Xð3Þ� (m ¼ 3) using Multistart APBR, for varying param-eters p and q in (22). With s0 ¼ 0:25 and s ¼ 1 it can be seen that there are a number of ðp; qÞ parameter pairs which yieldsimilar solutions. As is consistent with the previous example we advise to choose 0 < p < q. We can also see that the resultsare very stable with respect to the values of p and q.

6.3. Application to ‘Air Passenger’ data

In this example we consider the celebrated ‘Air Passenger’ data (available from [36]) which consists of monthly counts ofairline passengers, measured in thousands, for the period January 1949 through December 1960. We denote this data by Y�,and let X� ¼ HðY�Þ. We include this example to demonstrate that Multistart APBR may be used for the time series analysis ofreal data. Note that for general guidance concerning choice of the parameters L and r, see for example [5]. We wish to find anr ¼ 2 approximation to the log-transformed data using AP (19), OAP (20) and Multistart APBR (22). Table 5 contains theFrobenius distances to X� obtained using these methods with L ¼ 24. The parameters of Multistart APBR (22) are selectedto be M ¼ 1000; c ¼ 1; s0 ¼ 0:25; s ¼ 1; p ¼ 0:5; q ¼ 1:5, and P ¼ 600. The total number iterations was 800. Fig. 4 containsa plot of the log-transformed data and the rank 2 AP and Multistart APBR approximations respectively.

7. Conclusion

This paper is devoted to the construction of numerical methods for solving the HSLRA problem. Finding optimal solutionsto the HSLRA problem is very difficult. If HSLRA is considered as a problem of estimating parameters of damped sinusoids,then the associated objective function becomes extremely complex so that the associated optimization problem is basicallyunsolvable. This leads us to a conclusion that constructing algorithms for solving the HSLRA problem with trajectories in thespace H of Hankel matrices lead to more tractable algorithms with higher chances of success. This is the approach which weundertook in the main body of the paper.

In Section 4 we discussed the so-called orthogonality condition which optimal solutions should satisfy, and describedhow any approximation may be corrected to achieve this orthogonality. In Section 5 we introduced our family of algorithmscalled APBR, which can be viewed as a global random search extension of AP. Examples provided in Section 6 show that Mul-tistart APBR significantly outperforms AP and some other methods. APBR with selection is much more efficient than Multi-start APBR. Many other, possibly more efficient, techniques could be adapted for solving the HSLRA problem but this is atheme for future research.

Page 16: Stochastic algorithms for solving structured low-rank matrix approximation problems

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 85

Appendix A. Complexity of the HSLRA optimization problem when using parametric forms of the solution

A.1. Objective function and its split into components

In this Appendix, we consider the HSLRA problem (1) as an optimization problem after a parametric representation of thesolution has been set. The most popular parameterization of the set A ¼Mr \H, often found in the signal processing liter-ature (see for example [25]), is obtained by associating matrices X 2 A with vectors YðhÞ ¼ ðy1ðhÞ; . . . ; yNðhÞÞ

T whose elementscan be represented as sums of damped sinusoids (other parametric forms are known but they lead to even more difficultoptimization problems).

The definition of the objective function was given in (5), with elements of the vector YðhÞ ¼ ðy1ðhÞ; . . . ; yNðhÞÞT given by (6).

Let H be a parameter space so that h 2 H. The ranges for parameters ai and di is ð�1;1Þ, whilst the ranges for xi and /i

are [0,1) and ½0;p=2Þ respectively. Denote the true value of parameters by hð0Þ ¼ ðað0Þ; dð0Þ;xð0Þ;/ð0ÞÞ.If we assume that there is a true signal represented in the form (6), such as in the standard ‘signal plus noise’ model (4),

then we denote the true values of parameters by hð0Þ ¼ ðað0Þ; dð0Þ;xð0Þ;/ð0ÞÞ where að0Þ ¼ ðað0Þ1 ; . . . ; að0Þq Þ; dð0Þ ¼ðdð0Þ1 ; . . . ; dð0Þq Þ; xð0Þ ¼ ðxð0Þ1 ; . . . ;xð0Þq Þ and /ð0Þ ¼ ð/ð0Þ1 ; . . . ;/ð0Þq Þ. The associated true signal values will be yð0Þn ;n ¼ 1; . . . ;N. If

the observations are noise-free, then the vector of observations Y ¼ ðy1; . . . ; yNÞT coincides with the signal vector

Y ð0Þ ¼ ðyð0Þ1 ; . . . ; yð0ÞN ÞT. Otherwise Y is different from Y ð0Þ.

Given an observed vector Y ¼ ðy1; . . . ; yNÞT we define the objective function f ðhÞ as follows:

f ðhÞ ¼XN

n¼1

sne2nðhÞ ¼

XN

n¼1

snðyn � ynðhÞÞ2; ðA:1Þ

where 0 6 sn 61; n ¼ 1; . . . ;N are a series of weights and enðhÞ ¼ yn � ynðhÞ.A comprehensive study of the objective function (A.1) with the parameterization (6) has been undertaken in [35]. Some

analysis of the objective function (A.1) has also been reported in [10], but we show below that it is possible to extend theanalysis of [10].

Let h ¼ ða; d; x; /Þ denote either an estimated value of hð0Þ or simply any value in H. Then we may write:

enðhÞ ¼ yn � ynðhÞ ¼ yn � ynðhÞ þ ynðhÞ � ynðhÞ ¼ enðhÞ þ ðynðhÞ � ynðhÞÞ: ðA:2Þ

Hence we may write (A.1) as

f ðhÞ ¼ f ðhÞ þ f 1ðh; hÞ þ f 2ðh; hÞ; ðA:3Þ

where

f 1ðh; hÞ ¼XN

n¼1

sn ynðhÞ � ynðhÞ� �2

and f 2ðh; hÞ ¼ 2XN

n¼1

snenðhÞ ynðhÞ � ynðhÞ� �

:

Here we consider each component of (A.3) in turn. The component f ðhÞ is a constant representing the sum of squares

between the observed vector Y ¼ ðy1; . . . ; yNÞT and the vector YðhÞ ¼ ðy1ðhÞ; . . . ; yNðhÞÞ

T. The function f 1ðh; hÞ is a sum of

squares but f 2ðh; hÞ is a sum of terms which may have alternating signs. This observation has led the authors of [10] tothe suggestion that the following could be a common phenomena: if h is not a good approximation to the true parameterhð0Þ then f 1ðh; hÞ dominates the shape of the objective function f ðhÞ, and the contribution of f 2ðh; hÞ is almost negligible. How-ever, this seems to be true only if the vector Y is observed either without noise or when the noise is very small; that is, whenY ’ Y ð0Þ. In the next paragraph we will argue and then demonstrate in the examples that follow that f 2ðh; hÞ often has a con-siderable and important contribution to the objective function f ðhÞ for all h 2 H, especially if h is not a good approximation ofhð0Þ.

Consider the function f 2ðh; hÞ. Represent it as f 2ðh; hÞ ¼ CðhÞ � f 3ðh; hÞ, where

CðhÞ ¼ 2XN

n¼1

sn yn � ynðhÞ� �

ynðhÞ and f 3ðh; hÞ ¼ 2XN

n¼1

sn yn � ynðhÞ� �

ynðhÞ:

Appealing to the standard results concerning the orthogonality of residuals [37], if h has been obtained as a least squaresestimator of hð0Þ then f 3ðhð0Þ; hÞ � 0 and f 3ðhð0Þ; hÞ does not make a large contribution to the shape of the objective functionf ðhÞ, at least in the region where h ’ h. However, if h is not a good approximation to hð0Þ then f 3ðhð0Þ; hÞ may be significantlydifferent from 0, and we can expect f 3ðhð0Þ; hÞ to be significantly contributing to the shape of f ðhÞ. Moreover, due to the auto-correlation inherent in Y, there are likely to be some oscillatory or seasonal patterns to be observed in f 2ðh; hÞ. This phenom-enon is confirmed by the results in the example below.

Page 17: Stochastic algorithms for solving structured low-rank matrix approximation problems

86 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

A.2. Example

Similar to [10] let us consider the parametrization (6) with q ¼ 1 and the objective function defined by (A.1). Assume we

generate a vector of N ¼ 10 observations Y ð0Þ ¼ ðyð0Þ1 ; . . . ; yð0ÞN ÞT

from (6) with xð0Þ ¼ 0:4; /ð0Þ ¼ p2 ; dð0Þ ¼ 0; að0Þ ¼ 2 and create

a vector of observations Y ¼ ðy1; . . . ; yNÞT such that yn ¼ yð0Þn þ �n. The noise terms �n are assumed to be normally distributed

with mean 0 and variance r2. We take w1 ¼ . . . ¼ wN ¼ 1. Similar phenomenon as demonstrated in this example can beobserved for different weights.

In this example we assume that dð0Þ; /ð0Þ and að0Þ are known but xð0Þ is not (a similar case was considered in [10]). Thisyields h ¼ x and the objective function becomes

0

10

20

30

40

50

60

Fig. A.5global m

f ðxÞ ¼XN

n¼1

yn � að0Þ expðdð0ÞnÞ sinð2pxnþ /ð0ÞÞ� �2

:

The feasible domain for xcan be chosen as H ¼ ½0;1Þ; in this interval, the function f many local minimizers (in addition tothe global minimizer). As mentioned in [35] the number of local minima of f ðxÞ is linear in N and therefore the complexity ofthe objective function f ð�Þ increases with N.

Fig. A.5 contains plots of f ðxÞ; f ðxð0ÞÞ; f 1ðx;xð0ÞÞ and f 2ðx;xð0ÞÞ for particular realizations of noise for varying values ofr2. Adding noise to the observed data increases the complexity of the function f ðxÞ and moves the global minimizer of f ðxÞaway from the true value xð0Þ ¼ 0:4. As x we use the global minimizer of f ðxÞ.

Fig. A.6 contains plots of Y and the estimated reconstructed signal. The effect of adding noise is to increase the complexityof the objective function f, and as such it possible to obtain estimates of xð0Þ that are far away from the true value. The con-sequences of measurement error are clearly seen in this figure.

Now set r2 ¼ 2:5. Fig. A.7 contains plots of f ðxÞ; f ðxÞ; f 1ðx; xÞ and f 2ðx; xÞ with h ¼ x perturbed away from the truevalue xð0Þ ¼ 0:4. Theoretical work by Lemmerling and Van Huffel [10] suggests that it is sufficient to study f 1ðx; xÞ, as thiscomponent dominates the shape of the objective function f ðxÞ, and the contribution of f 2ðx; xÞ is negligible particularlywhen x is not close to the true value of h. However, it can be seen in Fig. A.7 that f 2ðx; xÞ is a very significant componentof the objective function f ð�Þ, whatever the choice of x. We can also see that the inclusion of noise in the vector Y alwaysmoves the global minimizer of f ð�Þ away from the true value xð0Þ (which is the minimizer of f ð�Þ in the noise-free situation).

A.3. Derivatives and simplification of the objective function (5), parameterized by (6) with q ¼ 1

Let q ¼ 1 and consider the optimization problem (5) defined by the model (6). For brevity let xn ¼ expðdnÞ sinð2pxnþ /Þand w1 ¼ . . . ¼ wN ¼ 1. Eq. (5) may be written

f ða;d;x;/Þ ¼XN

n¼1

ðyn � axnÞ2: ðA:4Þ

0.2 0.4 0.6 0.8 1

–50

0

50

100

150

0.2 0.4 0.6 0.8 1

–100

0

100

200

300

0.2 0.4 0.6 0.8 1

. Function f ðxÞ in solid line, f ðxð0ÞÞ in solid line (horizontal), f 1ðx;xð0ÞÞ in dashed line and f 2ðx;xð0ÞÞ in gray; for different values of r. Estimatedinimum x and corresponding value of the objective function f ðxÞ also given.

Page 18: Stochastic algorithms for solving structured low-rank matrix approximation problems

–3

–2

–1

0

1

2

3

2 4 6 8 10

–3

–2

–1

0

1

2

3

2 4 6 8 10

Fig. A.6. Plots of Y and the estimated reconstructed signal að0Þ expðdð0ÞnÞ sinð2pxð0Þnþ /ð0ÞÞ in black, with true signal að0Þ expðdð0ÞnÞ sinð2pxÞnþ /ð0Þ in gray,corresponding to the estimated global minimizer shown in Fig. A.5.

–100

–50

0

50

100

0.2 0.4 0.6 0.8 1

–100

–50

0

50

100

0.2 0.4 0.6 0.8 1

–100

–50

0

50

100

0.2 0.4 0.6 0.8 1

–100

–50

0

50

100

0.2 0.4 0.6 0.8 1

–100

–50

0

50

100

0.2 0.4 0.6 0.8 1

–100

–50

0

50

100

0.2 0.4 0.6 0.8 1

Fig. A.7. xð0Þ ¼ 0:4, function f ðxÞ in solid line, f ðxð0ÞÞ in solid line (horizontal), f 1ðx;xð0ÞÞ in dashed line and f 2ðx;xð0ÞÞ in gray; with r2 ¼ 2:5.

J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88 87

Since

@f ða; d;x;/Þ@a

¼ �2XN

n¼1

ðyn � axnÞxn;

then we may obtain an explicit estimator for a, which we denote a. This estimator is a function of the remaining parametersd; x and /;

a ¼PN

n¼1ynxnPNn¼1x2

n

:

Substituting a into (A.4) gives a new objective function, which we denote gðd;x;/Þ:

gðd;x;/Þ ¼XN

n¼1

yn � xn

PNk¼1ykxkPN

k¼1x2k

!2

¼XN

n¼1

�2n;k:

where �n;k ¼ yn �PN

k¼1ykxkPN

k¼1x2

k

. It is possible to compute the derivative of the objective function with respect to each of the

unknown parameters. However, even for simple cases the derivatives cannot be written in a neat form. Here we state thefirst derivatives of gðd;x;/Þ with respect to each of the unknown parameters:

Page 19: Stochastic algorithms for solving structured low-rank matrix approximation problems

88 J.W. Gillard, A.A. Zhigljavsky / Commun Nonlinear Sci Numer Simulat 21 (2015) 70–88

@g@d¼ �2

XN

n¼1

�n;kxnPN

k¼1kykxkPNk¼1x2

k

þ xnPN

k¼1ykxkPNk¼1x2

k

� �2 nXN

k¼1

x2k � 2

XN

k¼1

kx2k

!264

375

8><>:

9>=>;:

Let cðnÞ1 ¼ nedn cosð2pxnþ /ÞPN

k¼1x2k � xn

PNk¼1kxkedk cosð2pxkþ /Þ, then

@g@x¼ �4p

XN

n¼1

�n;kxnPN

k¼1kedk cosð2pxkþ /ÞPNk¼1x2

k

þPN

k¼1ykxkPNk¼1x2

k

� �2 cðnÞ1

264

375

8><>:

9>=>;:

Let cðnÞ2 ¼ edn cosð2pxnþ /ÞPN

k¼1x2k � 2xn

PNk¼1xkedk cosð2pxnþ /Þ, then

@g@/¼ �2

XN

n¼1

�n;kxnPN

k¼1ykedk cosð2pxkþ /ÞPNk¼1x2

k

þPN

k¼1ykxkPNk¼1x2

k

� �2 cðnÞ2

264

375

8><>:

9>=>;:

References

[1] Markovsky I, Willems JC, VanHuffel S, DeMoor B, Pintelon R. Application of structured total least squares for system identification and model reduction.IEEE Trans Automat Control 2005;50(10):1490–500.

[2] Lemmerling P, Mastronardi N, VanHuffel S. Efficient implementation of a structured total least squares based speech compression method. LinearAlgebra Appl 2003;366:295–315.

[3] Yeredor A. Multiple delays estimation for chirp signals using structured total least squares. Linear Algebra Appl 2004;391:261–86.[4] Pruessner A, O’Leary DP. Blind deconvolution using a regularized structured total least norm algorithm. SIAM J Matrix Anal Appl 2003;24(4):1018–37.[5] Golyandina N. On the choice of parameters in singular spectrum analysis and related subspace-based methods. Stat Interface 2010;3:259–79.[6] Markovsky I. Structured low-rank approximation and its applications. Automatica 2008;44(4):891–909.[7] Markovsky I. Bibliography on total least squares and related methods. Stat Interface 2010;3(3):329–34.[8] Markovsky I. Low rank approximation: algorithms, implementation, applications. Springer; 2012.[9] Bishop WB, Djuric PM. Model order selection of damped sinusoids in noise by predictive densities. IEEE Trans Signal Process 1996;44(3):611–9.

[10] Lemmerling P, Van Huffel S. Analysis of the structured total least squares problem for Hankel/Toeplitz matrices. Numer Algorithms2001;27(1):89–114.

[11] Golyandina N, Nekrutkin V, Zhigljavsky A. Analysis of time series structure: SSA and related techniques. New York – London: Chapman & Hall/CRC;2001.

[12] Gillard J, Zhigljavsky A. Optimization challenges in the structured low rank approximation problem. J Global Optim 2012:1–19.[13] Zilinskas A, Zilinskas J. Interval arithmetic based optimization in nonlinear regression. Informatica 2010;21(1):149–58.[14] Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometrika 1936;1(3):211–8.[15] De Moor B. Structured total least squares and L2 approximation problems. Linear Algebra Appl 1993;188–189(1036):163–205.[16] Markovsky I, Willems JC, VanHuffel S, DeMoor B. Exact and approximate modeling of linear systems. Philadelphia: SIAM; 2006.[17] Sergeyev YD. Arithmetic of infinity. CS: Edizioni Orizzonti Meridionali; 2003.[18] Sergeyev YD. A new applied approach for executing computations with infinite and infinitesimal quantities. Informatica 2008;19(4):567–96.[19] Sergeyev YD. Numerical point of view on calculus for functions assuming finite, infinite, and infinitesimal values over finite, infinite, and infinitesimal

domains. Nonlinear Anal Theory Methods Appl 2009;71(12):1688–707.[20] Sergeyev YD. Numerical computations and mathematical modelling with infinite and infinitesimal numbers. J Appl Math Comput 2009;29(1–

2):177–95.[21] Lolli G. Infinitesimals and infinites in the history of mathematics: a brief survey. Appl Math Comput 2012;218(16):7979–88.[22] Zhigljavsky A. Computing sums of conditionally convergent and divergent series using the concept of grossone. Appl Math Comput

2012;218(16):8064–76.[23] Markovsky I, Usevich K. Software for weighted structured low-rank approximation, Tech. Rep. 339974. ECS, Univ. of Southampton; 2012. <http://

eprints.soton.ac.uk/339974/>.[24] Park H, Zhang L, Rosen JB. Low rank approximation of a Hankel matrix by structured total least norm. BIT Numer Math 1999;39(4):757–79.[25] Van Huffel S. Enhanced resolution based on minimum variance estimation and exponential data modeling. Signal Process 1993;33(3):333–55.[26] Cadzow JA. Signal enhancement: a composite property mapping algorithm. IEEE Trans Acoust Speech Signal Process 1988;36:1070–87.[27] Gillard J. Cadzow’s basic algorithm, alternating projections and singular spectrum analysis. Stat Interface 2010;3(3):335–43.[28] Andersson, F, Carlsson, M. Alternating projections on non-tangential manifolds, arXiv:1107.4055.[29] Andersson, F, Carlsson, M, Ivert, P.-A. A fast alternating projection method for complex frequency estimation, arXiv:1107.2028.[30] Comon P, Luciani X, deAlmeida ALF. Tensor decompositions, alternating least squares and other tales. J Chemometrics 2009;23:393–405.[31] Golyandina N, Zhigljavsky AA. Singular spectrum analysis for time series. Springer briefs in statistics. Springer; 2013.[32] Chu MT, Funderlic RE, Plemmons RJ. Structured low rank approximation. Linear Algebra Appl 2003;366:157–72.[33] Gillard JW, Zhigljavsky AA. Software for alternating projections with backtracking and randomization. <http://www.jonathangillard.co.uk>.[34] Zhigljavsky A, Zilinskas A. Stochastic global optimization. New York: Springer; 2008.[35] Gillard J, Zhigljavsky AA. Analysis of structured low rank approximation as an optimization problem. Informatica 2011;22(4):489–505.[36] Hyndman RJ. Time series data library, <http://data.is/TSDLdemo>.[37] Fuller WA. Introduction to statistical time series. 2nd ed. N.Y.: Wiley & Sons; 1996.[38] Lawson Charles L, Hanson Richard J. Solving least squares problems, vol. 161. SIAM; 1974.[39] Heinig Georg, Rost Karla. Algebraic methods for Toeplitz-like matrices and operators. Springer; 1984.