14
Review Recent progress on variable projection methods for structured low-rank approximation Ivan Markovsky n Department ELEC, Vrije Universiteit Brussel, Pleinlaan 2, Building K, B-1050 Brussels, Belgium article info Article history: Received 19 July 2013 Received in revised form 21 September 2013 Accepted 26 September 2013 Available online 7 October 2013 Keywords: Low-rank approximation Total least squares Structured matrix Variable projection Missing data System identification Reproducible research Literate programming noweb abstract Rank deficiency of a data matrix is equivalent to the existence of an exact linear model for the data. For the purpose of linear static modeling, the matrix is unstructured and the corresponding modeling problem is an approximation of the matrix by another matrix of a lower rank. In the context of linear time-invariant dynamic models, the appropriate data matrix is Hankel and the corresponding modeling problems becomes structured low-rank approximation. Low-rank approximation has applications in: system identification; signal processing, machine learning, and computer algebra, where different types of structure and constraints occur. This paper gives an overview of recent progress in efficient local optimization algorithms for solving weighted mosaic-Hankel structured low-rank approximation problems. In addition, the data matrix may have missing elements and elements may be specified as exact. The described algorithms are implemented in a publicly available software package. Their application to system identification, approximate common divisor, and data-driven simulation problems is described in this paper and is illustrated by reproducible simulation examples. As a data modeling paradigm the low-rank approximation setting is closely related to the behavioral approach in systems and control, total least squares, errors-in-variables modeling, principal component analysis, and rank minimization. & 2013 Elsevier B.V. All rights reserved. Contents 1. Introduction ........................................................................................... 407 2. Examples of properties equivalent to rank deficiency of a matrix ................................................ 407 3. Structured low-rank approximation ........................................................................ 409 4. Related paradigms, problems, and algorithms ................................................................ 410 4.1. Structured total least squares ....................................................................... 410 4.2. Behavioral paradigm .............................................................................. 410 4.3. Errors-in-variables model ........................................................................... 411 4.4. Principal component analysis ........................................................................ 411 4.5. Rank minimization ................................................................................ 411 4.6. Other related problems ............................................................................. 411 5. Applications ........................................................................................... 411 6. Variable projection-type algorithms for structured low-rank approximation ....................................... 412 Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/sigpro Signal Processing 0165-1684/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sigpro.2013.09.021 n Tel.: þ32 2 6292947. E-mail address: [email protected] Signal Processing 96 (2014) 406419

Recent progress on variable projection methods for structured low-rank approximation

  • Upload
    ivan

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Recent progress on variable projection methods for structured low-rank approximation

Contents lists available at ScienceDirect

Signal Processing

Signal Processing 96 (2014) 406–419

0165-16http://d

n Tel.:E-m

journal homepage: www.elsevier.com/locate/sigpro

Review

Recent progress on variable projection methods for structuredlow-rank approximation

Ivan Markovsky n

Department ELEC, Vrije Universiteit Brussel, Pleinlaan 2, Building K, B-1050 Brussels, Belgium

a r t i c l e i n f o

Article history:Received 19 July 2013Received in revised form21 September 2013Accepted 26 September 2013Available online 7 October 2013

Keywords:Low-rank approximationTotal least squaresStructured matrixVariable projectionMissing dataSystem identificationReproducible researchLiterate programmingnoweb

84/$ - see front matter & 2013 Elsevier B.V.x.doi.org/10.1016/j.sigpro.2013.09.021

þ32 2 6292947.ail address: [email protected]

a b s t r a c t

Rank deficiency of a data matrix is equivalent to the existence of an exact linear model forthe data. For the purpose of linear static modeling, the matrix is unstructured and thecorresponding modeling problem is an approximation of the matrix by another matrix ofa lower rank. In the context of linear time-invariant dynamic models, the appropriate datamatrix is Hankel and the corresponding modeling problems becomes structured low-rankapproximation. Low-rank approximation has applications in: system identification; signalprocessing, machine learning, and computer algebra, where different types of structureand constraints occur.

This paper gives an overview of recent progress in efficient local optimization algorithms forsolving weighted mosaic-Hankel structured low-rank approximation problems. In addition, thedata matrix may have missing elements and elements may be specified as exact. The describedalgorithms are implemented in a publicly available software package. Their application tosystem identification, approximate common divisor, and data-driven simulation problems isdescribed in this paper and is illustrated by reproducible simulation examples. As a datamodeling paradigm the low-rank approximation setting is closely related to the behavioralapproach in systems and control, total least squares, errors-in-variables modeling, principalcomponent analysis, and rank minimization.

& 2013 Elsevier B.V. All rights reserved.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4072. Examples of properties equivalent to rank deficiency of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4073. Structured low-rank approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4094. Related paradigms, problems, and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

4.1. Structured total least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4104.2. Behavioral paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4104.3. Errors-in-variables model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4114.4. Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4114.5. Rank minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4114.6. Other related problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411

5. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4116. Variable projection-type algorithms for structured low-rank approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

All rights reserved.

Page 2: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419 407

7. Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4147.1. Autonomous system identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

7.2. Autonomous system identification with fixed poles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

7.3. Autonomous system identification with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

7.4. Identification of an input/output system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

7.5. Data-driven simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

7.6. Approximate greatest common divisor of N polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417Appendix A Algorithmic details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417A.1. Analytical solution of the inner minimization problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417A.2. Fast cost function evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

1 In data modeling, a pattern of the data is called a model and acollection of related patterns (e.g., linear, conic section, etc.) is called amodel class.

1. Introduction

Science is aiming at discovering patterns in empiricalobservations. The simplest and most basic form of a patternis the linear relation and a convenient data structure fororganizing the observations is the two dimensional array ofnumbers—the matrix. The problem of finding linear rela-tions among the columns or rows of a matrix is closelyrelated to the numerical linear algebra problem of comput-ing the rank of a matrix. Linear relations exist if and only ifthe matrix is “low-rank”, i.e., its rank is less than the smallerof its dimensions.

The problem of discovering more complicated datapatterns, such as nonlinear and dynamical relations, canalso be posed and solved as a rank computation problem fora matrix obtained from the data via a problem dependenttransformation. We call such matrices structured. There is adirect correspondence between types of patterns consid-ered in the applications and types of structured matricesused in the mathematical problem formulation described inthis paper. Specific examples illustrating this statement aregiven in Section 2.

Noise in the data, however, can disguise the exactproperties present (or hypothesized) in the noise-freedata. This makes the estimation and approximation issuescritical in real-life data modeling applications. In Section 2,we show that the problem of discovering exact patterns inthe data is equivalent to computing the rank of a struc-tured matrix, constructed from the data. In Section 3, wemake a transition to the more realistic problem of struc-tured low-rank approximation, where the “best” approx-imation of the given data by data that is consistent withthe desired exact property is aimed at.

The emphasis in this paper is on applications in systems,control, and signal processing. General introduction to appli-cation of the structured low-rank approximation methodsdeveloped is given in Section 5. Details about applications insignal processing and numerical examples presented aregiven in Section 7. The examples include sum-of-dampedexponentials modeling, using prior knowledge about pole

location, data modeling with missing data, and approximatecommon divisor computation. An interesting application ofmissing data estimation, shown in Section 7.5, is data-drivensimulation and control [38,33], i.e., direct computation ofa signal of interest (simulated response or feed-forwardcontrol) from data without explicit model identification.

As explained in Section 4, the structured low-rank approx-imation setting is closely related to other data modelingapproaches—structured total least-squares [15,42,39], beha-vioral paradigm [48], errors-in-variables modeling [54], prin-cipal component analysis [17,20,18], and rank minimization[30,10]. We refer the reader to the overview papers [52,42,32]and the book [44] for extensive literature survey about theseconnections and bibliography. This paper is an update of[32,42] with new applications, methods for missing dataestimation, fast approximation of mosaic-Hankel-likematrices, and availability of robust and efficient methodsimplementing the theory.

2. Examples of properties equivalent to rank deficiencyof a matrix

The general concept outlined in the introduction isillustrated and motivated in this section on five specificexamples of data patterns1:

E1.

N, q-dimensional vectors d1;…; dN belong to a sub-space of dimension moq,

E2.

N, 2-dimensional vectors d1;…; dN belong to a conicsection,

E3.

the scalar sequence ðyð1Þ;…; yðTÞÞ is a sum of noT=2damped exponentials,

E4.

the q-dimensional vector sequence ðwð1Þ;…;wðTÞÞ is atrajectory of a bounded complexity linear time-invariantsystem, and
Page 3: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419408

E5.

two polynomials a and b have a common divisor ofdegree at least ℓ.

2 As explained in Section 3, the affine structure is still too general fordevelopment of practically efficient algorithms. The best compromisebetween generality and efficiency is the mosaic-Hankel-like structure.

Each example is expressed as a rank constraint

rankðDÞrr; ð1Þ

for a suitably chosen matrix D and a natural number r.The matrix D is structured, i.e., it is a function of the data.Its size and structure depend on the example and arespecified later on. The observation that as diverse exam-ples as E1–E5 lead to (1) motivates the statement that thelow-rank property of a structured matrix constructed fromthe data is a general property in data modeling. Eachexample is next briefly discussed.

Example E1. In the simplest example E1, D is the matrix ofthe (horizontally) stacked vectors

½d1 ⋯ dN �

and the upper bound r on the rank of D is equal to thesubspace dimension m. Indeed, the equivalence of state-ment E1 and the rank condition (1) is a basic linear algebrafact. In the case of subspace fitting, the data matrix D isunstructured, i.e., the observed variables dij enter directlyas elements of the data matrix D. As shown next, this is notthe case in the more complicated examples.

Example E2. The notion of rank plays also a central role inthe problem of detecting whether the points fd1;…; dNg �R2

belong to a conic section B�R2. Although a conic section B isa nonlinear curve, it is linearly parameterized

B¼ d¼ a

b

� �j½a2 ab a b2 b 1�θ¼ 0

� �; for some θa0;

by monomials a2, ab, a, b2, b, and 1 of degree up to 2.We have

fd1;…; dNg � B ⇔ θ>D¼ 0;

where D is the 6� N matrix

a21 a22 ⋯ a2Na1b1 a2b2 ⋯ aNbNa1 a2 ⋯ aNb21 b22 ⋯ b2Nb1 b2 ⋯ bN1 1 ⋯ 1

26666666664

37777777775; ð2Þ

whose columns are nonlinear transformations of the corre-

sponding data vectors di ¼ aibi

h i. Thus, statement E2 is equiva-

lent to the rank condition (1) with D given in (2) and r¼5.The matrix (2) is an example of a nonlinear structure.

In the rest of the paper, we consider only affinely struc-tured matrices, which allow effective solution methods(see Section 6). As illustrated by examples E3–E5,the affine structure is general enough to cover important

classes of applications in systems, control, and signalprocessing.2

Example E3. A sum-of-damped-exponentials sequenceðyð1Þ;…; yðTÞÞ, i.e., a sequence of the form

yðtÞ ¼ c1zt1þ⋯þcnztn; for t ¼ 1;…; T ; ð3Þ

where c1;…; cn and z1;…; zn are given complex numbers,satisfies a difference equation

R0yðtÞþR1yðtþ1Þþ⋯þRnyðtþnÞ ¼ 0; for t ¼ 1;…; T�n;

ð4Þand parameter vector

R≔½R0 R1 ⋯ Rn�a0

The relation (4) is again linear in the parameters, so thatthe condition for existence of solution is a rank constraintof the type (1). The difference equation (4), however,results in a matrix equation RD¼ 0, with a ðnþ1Þ �ðT�nÞ Hankel structured matrix D

Hnþ1;T�nðyÞ≔

yð1Þ yð2Þ yð3Þ ⋯ yðT�nÞyð2Þ yð3Þ ⋰ yðT�nþ1Þyð3Þ ⋰ ⋮⋮

yðnþ1Þ yðnþ2Þ ⋯ yðTÞ

26666664

37777775;ð5Þ

i.e., the elements on the antidiagonals of the matrix areequal to each other. The upper bound r on the rank of D isequal to the number n of damped exponentials.

Example E4. The sum-of-damped-exponentials model (3)is a special case of a linear time-invariant dynamicalsystem. In the behavioral approach to system theory[62], a dynamical system is defined as a set of trajectories.A system can be specified by different representations. Forexample, a linear time-invariant system can always berepresented by a vector constant coefficients differenceequation

R0wðtÞþR1wðtþ1Þþ⋯þRℓwðtþℓÞ ¼ 0; for t ¼ 1;…; T�ℓ;

ð6Þwhere the coefficients RiARp�q are such that the matrixpolynomial RðzÞ≔∑ℓ

i ¼ 0Rizi is full row rank. In this case, (6)is a system of p linearly independent equations.Using the notation H, defined in (5) for scalar sequences,

the difference equation (6) can be written as a matrixequation

½R0 R1 ⋯ Rn� D¼ 0; where D¼Hℓþ1;T�ℓðwÞ:

Note that, in this case, the data matrix D is Hankel withq� 1 block elements (a block Hankel matrix). This showsthat statement E4 is also a special case of (1) for D blockHankel and r¼ ðℓþ1Þq�p. The bound r on the rank of Dbounds the complexity of the model.

Page 4: Recent progress on variable projection methods for structured low-rank approximation

Table 1Summary of data matrices D and rank bounds r in the examples.

Example Data matrix D Structure Rank bound r

E1 subspace ½d1 ⋯ dN � Unstructured Subspace dimension m

E2 conic section (2) Polynomial 5E3 sum of damped exponentials Hℓþ1;T�ℓðyÞ Hankel Number of exponents n

E4 linear time-invariant system Hℓþ1;T�ℓðwÞ Block Hankel ℓqþm

E5 greatest common divisor Snb �ℓþ1ðaÞSna �ℓþ1ðbÞ

" #Sylvester naþnb�2ℓþ1

ℓ¼GCD degree

3 This representation of a Hankel matrix is memory and computationinefficient and is not used in practical implementations.

I. Markovsky / Signal Processing 96 (2014) 406–419 409

Example E5. Finally, in the polynomial common divisorexample E5, there is a relation

rankðDÞrr≔naþnb�2ℓþ1

between the degree ℓ of a common divisor of the poly-nomials

aðzÞ ¼ a0þa1zþ⋯þanazna and bðzÞ ¼ b0þb1zþ⋯þbnbz

nb

and the rank of the Sylvester sub-resultant matrix [5] of aand b:

Snb �ℓþ1ðaÞSna �ℓþ1ðbÞ

" #; where

SkðaÞ≔a0 a1 ⋯ ana

⋱ ⋱ ⋱a0 a1 ⋯ ana

264375ARk�ðna þkÞ: ð7Þ

(By default, all blank entries in a matrix are zeros.)A summary of the matrices D and their maximum ranksr, in the subspace, conic section, sum-of-damped-expo-nentials, linear time-invariant dynamical system, andcommon divisor examples is given in Table 1.

Summary of the examples. The examples have thefollowing common feature:

a property of the data is equivalent to the low-rankproperty of a matrix constructed from the data.

The problem of checking the presence or absence of aproperty of the data is then reduced to the standardnumerical linear algebra problem of computing the rankof a matrix. Existing algorithms and software are, there-fore, readily available for solving the original problem.Note that a range of applications are equivalent to a singleabstract problem, for which robust and efficient methodsand software are readily available. The key step in applyingthe approach sketched above and further developed in thispaper is to find the relevant matrix D and rank constraint rin (1) for the data property of interest.

3. Structured low-rank approximation

Examples E1–E5 are qualitative statements about prop-erties of the data. In practice the data is often noisecorrupted and the properties are generically not satisfied.Even for exact data, in finite precision arithmetic, theproblem of computing the rank of a matrix is notoriouslydifficult and should be avoided.

A more general and more useful problem than the oneestablishing the presence or absence of an exact propertyis the one finding a quantitative measure of how far thedata is from satisfying the property. One way to formalizethis problem is to define a distance measure from the datato the subset of the data space where the property ofinterest is satisfied. As shown above, the subset of the dataspace for which the property is satisfied is a manifold oflow-rank matrices, so that the distance computationproblem is a low-rank approximation problem.

For the examples considered, the low-rank approxima-tion problem leads to classical problems: principal com-ponent analysis [17] and total least squares [15,42], inthe case of E1; orthogonal regression, also called geometricfitting [12,45], in the case of E2; output error systemidentification [55,27], in the case of E3; errors-in-variables system identification [54], in the case of E4;and approximate common divisor [22], in the case of E5.Similarities and differences of the low-rank approximationsetting to other modeling paradigms are explained inSection 4 and an overview of applications is given inSection 5. Next, we define formally the low-rank approx-imation problem.

The general problem considered in this paper is tofind the nearest (in some specified sense) matrix bD to agiven matrix D, where bD has rank less than or equal toa specified number r and the same structure as the one ofD. A matrix structure is a class of matrices, which canbe defined as the image fSðpÞjpARnp g of a functionS : Rnp-Rm�n, e.g., the affine structures

SðpÞ ¼ ∑np

i ¼ 1Sipi; where SiARm�n: ð8Þ

An example of affine matrix structure S is the Hankelstructure (5).3 For example, H2 pð Þ ¼ p1

p2p2p3

h iis represented

by (8) with

S0 ¼ 0; S1 ¼1 00 0

� �; S2 ¼

0 11 0

� �; and S3 ¼

0 00 1

� �:

For a given structure S, a vector pARnp , referred to as thestructure parameter vector, specifies a structured matrixSðpÞ. The distance between matrices with the same struc-ture is expressed equivalently as the distance betweentheir structure parameter vectors. In what follows, we use

Page 5: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419410

the weighted (semi-)norm

‖p‖2w≔ ∑np

i ¼ 1wip2i ;

defined by the nonnegative vector w¼ ½w1 ⋯ wnp �> .The structured low-rank approximation problem is defined

as follows.

Problem 1 (Structured low-rank approximation problem).Given matrix structure specification S, vector of structureparameters p, nonnegative vector w, and desired rank r,find a structure parameter vector bp, such that the corre-sponding matrix SðbpÞ has rank at most r, and is as close aspossible to p in the sense of the weighted semi-norm‖ � ‖w, i.e.,

minimize over bpARnp ‖p�bp‖2wsubject to rankðSðbpÞÞrr: ðSLRAÞ

Without loss of generality, it is assumed that mrn.A solution method for general affine structure (8) is

presented in Sections 6. As illustrated by examples E3–E5,applications in signal processing, system identification,and computer algebra lead to special linearly structuredlow-rank approximation problems. The block-Hankel andSylvester structures, appearing in system identificationand computer algebra, are instances of a mosaic-Hankel-like structure [16]

SðpÞ≔ΦHm;nðpÞ; ð9Þ

where Hm;n is the block matrix with Hankel blocks(see (5))

Hm;n pð Þ ¼Hm1 ;n1 ðpð11ÞÞ ⋯ Hm1 ;nN ðpð1NÞÞ

⋮ ⋮Hmq ;n1 ðpðq1ÞÞ ⋯ Hmq ;nN ðpðqNÞÞ

26643775; m≔½m1 ⋯ mq�

n≔½n1 ⋯ nN �

ð10Þ

Φ is a full row rank matrix, and the pðijÞ’s are subvectorsof p. A fast algorithm for mosaic-Hankel low-rank approx-imation is outlined in Section A.2.

In regular problems, the weight vector w is positive,however, it is useful to consider also the extreme cases ofinfinite and zero weights in the approximation criterion,so that, in general, wA ½0; þ1Þ [ fþ1g. An infinite weightwi ¼ þ1 imposes the equality constraint bpi ¼ pi onthe optimization (SLRA) and a zero weight wi¼0 makesthe parameter pi irrelevant. Indeed, in case of an infiniteweight wi ¼ þ1, the cost function ‖p�bp‖w is infiniteunless bpi ¼ pi, so that the infinite weight specifies impli-citly a hard constraint of a fixed element in the approx-imation vector bp.

In the case of zero weight wi¼0, the error term pi�bpi isexcluded from the cost function, so that pi does notinfluence the solution. Parameter values correspondingto zero weights are marked with the symbol NaN (not anumber) and are considered as missing values.

4. Related paradigms, problems, and algorithms

4.1. Structured total least squares

The total least squares method is introduced by Goluband Van Loan [13,15] as a solution technique for anoverdetermined system of linear equations AX � B, whereAARm�n and BARm�ðm� rÞ are the given data andXARn�ðm� rÞ is unknown. The total least squares approx-imate solution bX of the system AX � B is a minimizer of thefollowing optimization problem:

minimize over X; bA; and bB ‖½A B��½bA bB�‖Fsubject to bAX ¼ bB; ð11Þ

where ‖ � ‖F denotes the Frobenius norm.With XARr��, Problem (11) is generically equivalent to

approximation of the matrix D≔½A B� by a rank r matrixbD≔½bA bB� in the Frobenius norm

minimize over bDARm�n ‖D� bD‖2Fsubject to rankðbDÞrr: ð12Þ

While a solution to the low-rank approximation (12)always exists, the total least squares (11) may fail to havea solution. The nongeneric case of lack of total leastsquares solution occurs when the optimal approximatingmatrix bD of (12) cannot be represented in the form bAX ¼ bB.

The generalization of the total least squares problem tosystems of linear equations with structured matrices A andB is called structured total least squares [9]. The structuredtotal least squares problem is generically equivalent to thestructured low-rank approximation problem. Many algo-rithms for solving structured total least squares approx-imation are proposed in the literature [2,51,28,36,43] .The variable projection-like method presented in Section 6is related to the structured total least squares algorithmof [43].

4.2. Behavioral paradigm

The system of equations bAX � bB implies a functionalrelation among the variables bA to bB (bB is a function of bA).In system theory and signal processing, such a relationdefines what is called an input–output representation of amodel. The rank constraint in (12) does not impose an apriori fixed functional relation or input–output partitionon the variables of the matrix bD, however, the rankdeficiency of bD implies that such relations exist. They canbe inferred from bD after the approximation problem issolved. The representation-free approach to data modelingis known in the systems and control literature as thebehavioral paradigm [62,63].

(SLRA) is a computational method for data modeling inthe behavioral setting.

Related methods for system identification in the beha-vioral setting are developed by Roorda and Heij [49,50].From a system theoretic point of view, the inner mini-mization (16) of Hankel structured low-rank approxima-tion problem is a Kalman smoothing problem [35]. This

Page 6: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419 411

link allows alternative methods for fast cost function andderivatives evaluation.

4.3. Errors-in-variables model

In the classical regression model, an input–outputpartitioning of the variables is imposed and the inputvariables (regressors) are assumed to be noise free. Thissetup is inadequate for applications where all variables areobtained through measurements and should be treated onan equal footing. The relevant statistical setup in this caseis the errors-in-variables model [54]

di ¼ diþ ~di; for i¼ 1;…;N;

where di is the vector of measured variables, di is its truevalue, and ~di is the measurement noise. Low-rank approx-imation problem is a maximum likelihood estimator in theerrors-in-variables setup, under the assumption that themeasurement noise is zero mean, normally distributed,with covariance matrix known up to a scaling factor [32].Therefore, low-rank approximation yields a statisticallyoptimal estimator in the errors-in-variables setting.Conversely, the errors-in-variables setting provides therelevant testbed for the methods developed in this paper.

4.4. Principal component analysis

The principal component analysis method [47,20,18] isdefined as follows: given a set of vectors

D¼ fd1;…; dNg �Rq;

drawn from a zero mean distribution, find a subspaceB¼ kerðRÞ of dimension r, which maximizes the empiricalvariance of the projected data on the subspace. Theproblem is equivalent to approximation of the data matrix

D¼ ½d1 ⋯ dN�:by a rank r matrix.

Principal component analysis gives a statistical interpre-tation of the deterministic low-rank approximationproblem.

4.5. Rank minimization

In the low-rank approximation problem, the fittingerror ‖p�bp‖w is minimized subject to a hard constrainton the rank of the approximating matrix SðbpÞ. In therelated structured rank minimization problem [30,10]

minimize over bp rankðSðbpÞÞ subject to ‖p�bp‖wre;

the rank of SðbpÞ is minimized subject to a hard constrainton the fitting error. Both problems are NP-hard and areequivalent in the sense that a method for solving one ofthem can be used for solving the other.

More generally, one can consider the bi-objectiveproblem

minimize over bp ‖p�bp‖wrankðSðbpÞÞ" #

:

Then, the rank minimization and low-rank approximationproblems can be viewed as ways of scalarizing the bi-objective problem. Another scalarization of the bi-objective problem is the regularized problem

minimize over bp ‖p�bp‖wþγ rankðSðbpÞÞ:The solutions of all three problems above trace the samePareto optimal curve when the corresponding hyperparameters—r, in the low-rank approximation problem,e in the rank minimization problem, and γ in the regular-ized problem—are varied.

Suboptimal solutions to these NP-hard problems canbe computed by relaxing them to convex optimizationproblems. A popular convex relaxation of the rank con-straint is the nuclear norm, defined as the sum of thesingular values. An upper bound on the nuclear norm canbe expressed as a linear matrix inequality, so that therelaxed problems are semi-definite programming pro-blems. Consequently, they can be solved by readily avail-able algorithms and software.

4.6. Other related problems

A subarea of low-rank approximation that is not cov-ered in this overview is tensors low-rank approximation[64,29]. Tensor methods are used in higher order statisticalsignal processing problems, such as independent compo-nent analysis, and multidimensional signal processing,such as spatio-temporal modeling and video processing,to name a few. Other areas of research on low-rankapproximation that we do not cover are nonnegative low-rank approximation and matrix completion [34]. Nonne-gative constraints appear, for example, in chemometrics,image, and text processing [7]. These constraints areimposed in solution methods by a rank revealing factoriza-tionwith nonnegative factors. In addition, upper bounds onthe elements are imposed in the method of [21].

5. Applications

Many data-driven modeling problems in systems andcontrol, signal processing, and machine learning as wellas problems in computer algebra (see Fig. 1) can be posedas a structured low-rank approximation (SLRA) for suitablydefined structure S, weight vector w, and rank constraint r.As illustrated by examples E1–E5, there is a correspon-dence between the matrix structure S of the low-rankapproximation (SLRA) and the model class in the datamodeling problems (see Table 2).

Advantages of reformulating problems as the structuredlow-rank approximation (SLRA) are: (1) links among see-mingly unrelated problems are established, (2) genericmethods, algorithms, and software can be reused in differ-ent application domains. In order to justify this claim, weapply the generic software for solving (SLRA), described inSection 6, to three vastly different applications: systemidentification, data-driven simulation, and computation ofan approximate common divisor of three polynomials.

In system theory, it is well known that there is aconnection between model reduction and system identifi-cation. At the heart of both problems is the fact (illustrated

Page 7: Recent progress on variable projection methods for structured low-rank approximation

Fig. 1. Structured low-rank approximation as a generic problem for data modeling.

Table 2Correspondence between the matrix structure S of the low-rank approx-imation (SLRA) and the model class in the data modeling problem.

Structure S 2 Model class

Unstructured 2 Linear staticHankel 2 Scalar linear time-invariantq� 1 Hankel 2 q-variate linear time-invariantq�N Hankel 2 N equal length trajectoriesMosaic-Hankel 2 N general trajectories[Hankel unstructured] 2 Finite impulse responseBlock-Hankel Hankel-block 2 2-Dimensional linear shift-

invariant

I. Markovsky / Signal Processing 96 (2014) 406–419412

by example E3) that a time series is a trajectory of a lineartime-invariant system of bounded complexity if and only ifa Hankel matrix constructed from the time-series is rankdeficient. Also, there are known connections among signalprocessing methods, such as forward-backward linear pre-diction [24,57,1] and harmonic retrieval [53], on one hand,and system identification methods, such as the predictionerror methods [27], on the other hand.

Cross-disciplinary links among computer algebra,machine learning and system theory however are lessexplored. The low-rank approximation setting offers a plat-form for establishing such links on a global scale. Indeed,formulating a new problem as a low-rank approximationproblem with a specific structure relates this problem to allproblems that are known to be equivalent to low-rankapproximation with the same type of structured matrix.

Another advantage of formulating a problem as thestructured low-rank approximation (SLRA) is that a rangeof different solution methods become readily available forthat problem. The main types of methods, with a fewrepresentatives for each type, are:

global solution methods (see [58, Section 3]),○ semi-definite programming relaxations of rational

function minimization [19],

○ methods based on solving systems of polynomialequations [56];

local optimization methods,○ variable projection [43],○ alternating projections [61]; and

heuristics based on convex relaxations,○ multistage methods [60],○ nuclear norm heuristic [10].

The methods have different properties, which may beadvantages or disadvantages in different applications.Multistage methods are, in general, faster but less accuratethan the optimization based methods. Local optimizationmethods require initial approximations. The solutionobtained and the amount of computations may be affectedstrongly by the bad initialization. Global optimizationmethods are currently impractical for realistic signalprocessing applications (e.g., large data sets or real-timeapplications), where the required computational effort isan important consideration.

In our experience the most promising approach forbatch model identification is local optimization using anapproach similar to the variable projection principle [14].This approach is very effective in data modeling problemswhere the model complexity (number of inputs and lag ororder of the system) is much smaller than the amount ofdata. This is a typical situation in control and signalprocessing. The method described in the following sec-tion is of the variable projection family and is thereforeintended for data approximation by low-complexitymodels. In Section 7, we show numerical examples witha publicly available software implementation of themethod.

6. Variable projection-type algorithms for structuredlow-rank approximation

The rank constraint in (SLRA) can be expressed as acondition that the dimension of the left kernel of SðbpÞ is at

Page 8: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419 413

least m�r

rankðSðbpÞÞrr ⇔ there is a full row rank

matrix RARðm� rÞ�m;

such that RSðbpÞ ¼ 0: ðrankRÞUsing (rankR), (SLRA) is rewritten in the following equiva-lent form:

minimize over bpARnp and RARðm� rÞ�m ‖p�bp‖2wsubject to RSðbpÞ ¼ 0 and R is full row rank: ð13Þ

Problem (13) is a nonlinear least squares problem, which,in general, admits no analytic solution. However, thevariable bp can be eliminated by representing (13) as adouble minimization problem

minimize over full row rank RARðm� rÞ�m MðRÞ; ð14Þwhere

MðRÞ≔minbp ‖p�bp‖2w subject to RSðbpÞ ¼ 0: ð15Þ

The computation of M(R) for given R is referred to as the/inner minimization/ problem and the minimization (14)of the function M over R is referred to as /outer minimiza-tion/ problem.

As shown in Section A.1, the inner minimization (15)is a linear least squares problem

MðRÞ ¼minbp ‖diagð ffiffiffiffiw

p Þðp�bpÞ‖22subject to GðRÞðbp�pÞ ¼ vecðRSðpÞÞ; ð16Þ

where

GðRÞ≔½vecðRS1Þ ⋯ vecðRSnp Þ�ARðm� rÞn�np :

Here vec is the column-wise vectorization operator: foran m�n matrix A¼ ½a1 ⋯ an�, vecðAÞ is the mn vector½a>

1 ⋯ a>n �> . Problem (16) admits an analytic solution,

though special care is needed for fixed values (wi ¼ þ1)and missing values (wi¼0), see Section A.1. Furthermore,in Section A.2, it is shown that fast evaluation of M can beperformed for mosaic-Hankel-like structured matrices (9).

The main advantage in the reformulation of (13) as (14)is the elimination of the optimization variable bp. Indeed,(14) has less optimization variables than (13) — mðm�rÞversus npþmðm�rÞ). In applications of modeling data bya low-complexity model, bp is high dimensional and R issmall dimensional. Therefore, the elimination of the bpleads to a big reduction in the number of the optimizationvariables.

The approach, described above, for solving (13) issimilar to the variable projection method [14] for solutionof separable unconstrained non-linear least squares pro-blems

minimize over x and θ ‖AðθÞx�b‖22; ð17Þwhere matrix valued function A and the vector y are thegiven data and the vector x is the to be found parameter.The low-rank approximation (13), however, is differentfrom the nonlinear least squares (17). Instead of theexplicit function bb ¼ AðθÞx, where x is unconstrained, animplicit relation RSðbpÞ ¼ 0 is considered, where the vari-able R is constrained to have full row rank. This fact

requires new type of algorithms where the nonlinear leastsquares problem is an optimization problem on a Grass-mann manifold, see [3,4,37,6].

In (14), the cost function M is minimized over the set offull row rank matrices RARðm� rÞ�m. It can be seen from thedefinition (15) of M(R) that the cost function depends onlyon the space spanned by the rows of R:

rows of R′ and R″ span the same subspace⟹MðR′Þ

¼MðR″Þ:Therefore, (14) is a minimization problem on theGrassmann manifold Grðm�r;mÞ of all (m–r)-dimensionalsubspaces of Rm. In order to find a minimum of M,the search space in (14) can be replaced by a set ofmatrices RARðm� rÞ�m that represent all subspaces fromGr ðm�r;mÞ, e.g., matrices satisfying the constraint

RR> ¼ Im� r : ð18Þ

In the software package of [40], (SLRA) is solvednumerically by a function with interface

[ph, info]¼slra(p, s, r, opt)

The input arguments p, s, and r correspond to the vectorof the structure parameters p, the structure specification S,and the bound r on the rank of the approximating matrix,respectively. The output argument ph is a locally optimalsolution bp of (SLRA). The argument s is a structure withfields m, n, phi, and w. The fields m and n are the vectors mand n that specify the mosaic-Hankel matrix (10), and phi

is the Φ matrix (default value I). The vector w is the weightvector wARnp (default value is vector of all ones).

The arguments opt and info contain input optionsand output information for the optimization solver,respectively. For example, the optional field opt.Rini isused as an initial approximation for the optimizationmethod. The minimizer bR of (16) is provided in the fieldinfo.Rh.

In the software package, problem (14) is solved by localminimization over parameters θARnθ , such that R¼RðθÞ

minimize over θARnθ MðRðθÞÞ

subject to RðθÞR> ðθÞ ¼ Im� r : ðSLRAθÞ

(SLRAθ) is a standard nonlinear least squares problem, forwhich existing methods, e.g., the Levenberg–Marquardtalgorithm [31] implemented the GNU Scientific Library[11], are used.

The function R allows specification of linear structureof the matrix R

R¼RðθÞ≔vec�1m� rðθΨ Þ; ð19Þ

where vec�1ð�Þ is the inverse of the column-wise vector-ization operator and ΨARnθ�ðm� rÞm is a full row-rankmatrix that is passed by the optional parameter opt.

Psi (default value I). Applications leading to structuredlow-rank approximation with linearly structured kernelare identification with fixed poles (see Section 7.2) andapproximate common divisor computation for three ormore polynomials (see Section 7.6).

Page 9: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419414

7. Numerical examples

In this section, we describe the application of (SLRA)to system identification (Sections 7.1–7.4), data-drivensimulation (Section 7.5), and approximate common divisorcomputation (Section 7.6). Numerical examples with thevariable projection method, described in this paper andimplemented in [40] are presented. The reported numer-ical results are reproducible in the sense of [8] by down-loading the slra package [40] from

https://github.com/slra/slra

and following the instructions in/doc/overview-readme.txt.

7.1. Autonomous system identification

In system theoretic and signal processing terminology,the difference equation (4) defines an autonomous lineartime-invariant dynamical model. The problem of findingthe model from observed trajectory of the system is calleda system identification problem [27,55]. The particularidentification problem considered in this section is definedas follows. Given a time series

y¼ ðyð1Þ;…; yðTÞÞART

and a natural number ℓ,

minimize over byART and ½R0 R1 ⋯ Rℓ�a0 ‖y�by‖22subject to R0byðtÞþR1byðtþ1Þþ⋯þRℓbyðtþℓÞ ¼ 0;for t ¼ 1;…; T�ℓ: ð20Þ

The constraints of (20) are equivalent to

rankðHℓþ1;T�ℓðbyÞÞrℓ;

so that, (20) is equivalent to the Hankel low-rank approx-imation problem

minimize over byART ‖y�by‖22subject to rankðHℓþ1;T�ℓðbyÞÞrℓ; ð21Þ

which is a special case of (SLRA) with

m¼ ℓþ1 and n¼ T�ℓ:

Numerical exampleThe sequence y, shown in Fig. 2, left, (solid line) is

an exact trajectory of an autonomous linear time-invariantsystem with lag ℓ¼ 6. Indeed, the left kernel of the Hankelmatrix Hℓþ1;T�ℓðyÞ has dimension equal to one and a basisvector R for the left kernel of Hℓþ1;T�ℓðyÞ gives a differ-ence equation representation (4) with lag ℓ of the exactmodel for y.

In the simulation example, however, the identificationdata y¼ yþ ~y is corrupted by zero mean white Gaussiannoise ~y, so that an exact model does not exist for the data.A heuristic method for approximate autonomous systemidentification is Kung's method [25]. The model obtainedby Kung's method is suboptimal in the output errorcriterion Jy�by J and is used as an initial approximationfor the structured low-rank optimization. Fig. 2, left, showsthe optimal fit (dashed line) obtained by the slra

function, the noisy data (dotted line), and the true data(solid line).

In order to validate the predictive ability of the identi-fied model, we forecast the next Tval ¼ 100 samples

yv≔ðyvðTþ1Þ;…; yvðTþTvÞÞof the time series y, shown in Fig. 2, right (solid line). Theprediction byv of the model is obtained by iterating therecursion (4) with initial conditions byðT�ℓþ1Þ;…; byðTÞ.Fig. 2, right, shows the fit of the validation data (dashedline) and the prediction (dashed-dotted line) by a modelobtained in the next section.

7.2. Autonomous system identification with fixed poles

The complex numbers z1;…; zn in the sum-of-damped-exponentials model (3) describe the behavior of the modeland are called poles. The poles of an autonomous lineartime-invariant model, defined by a difference equation (4),are equal to the roots of the polynomial

RðzÞ≔R0þR0zþ⋯þRℓzℓ:

An ℓþ1 dimensional vector R¼ ½R0 R1 ⋯ Rℓ� correspondsto a degree ℓ polynomial R(z) and, vice verse, a polynomialof degree ℓ corresponds to the ℓþ1 dimensional vectorformed of its coefficients in, say, increasing order of thedegree.

Consider the Hankel low-rank approximation problem(21) and let R be a basis for the left kernel of theapproximating matrix Hℓþ1;T�ℓðbyÞ. Denote by z1;…; zℓthe zeros of R(z). Since R is real, the zeros appear incomplex conjugate pairs. Otherwise, the poles z1;…; zℓare unconstrained by the approximation (21). In thissection, we impose a constraint that some zeros of R(z)are fixed at predefined locations zf ;1;…; zf ;ℓf in the complexplane (observing the complex conjugate symmetry) andthe other zeros are unconstrained. The fixed zeros define areal polynomial

Rf ðzÞ≔ ∏ℓf

i ¼ 1ðz�zf ;iÞ

and the free zeros define a polynomial θðzÞ of degree ℓ�ℓf .With this notation, the fixed zeros constraint is

RðzÞ ¼ θðzÞRf ðzÞ;or in matrix notation

R¼ θSnθ ðRf Þ; where nθ ¼ ℓ�ℓf þ1:

The resulting Hankel low-rank approximation problemwith fixed poles is of the form (19), with the Ψ matrixbeing the polynomial multiplication matrix Snθ ðRf Þ, see (7).

Numerical exampleAs a numerical example of identification with fixed

poles, consider the problem of trend estimation. The data yis modeled as the sum of a ramp function and a trajectoryof an autonomous linear time-invariant system. The fullmodel is an autonomous linear time-invariant systemwitha double pole at one. Therefore, it has ℓ�2 free poles andtwo poles fixed at one.

Page 10: Recent progress on variable projection methods for structured low-rank approximation

Fig. 3. System identification with periodically missing data (crosses onthe x-axis). Noisy trajectory (dotted black), optimal approximation(dashed blue), and true trajectory (solid red). (For interpretation of thereferences to color in this figure caption, the reader is referred to the webversion of this paper.)

Fig. 2. Autonomous system identification example. Noisy trajectory (dotted black), optimal approximation (dashed blue), and true trajectory (solid red),and prediction with fixed poles by ′

v (black dashed-dotted). (For interpretation of the references to color in this figure caption, the reader is referred to theweb version of this paper.)

I. Markovsky / Signal Processing 96 (2014) 406–419 415

The true time series y defined in the previous section isgenerated as a sum of a ramp function and a trajectory ofa 4th order autonomous linear time-invariant system.Using this prior knowledge, we obtain a better approx-imation of the true data generating system. The identifiedmodel is validated by predicting the validation data yv.Fig. 2, right, shows the result obtained (dashed-dottedline). Although the fixed double pole at one leads to worsefit of the identification data, it improves the fit of thevalidation data.

7.3. Autonomous system identification with missing data

A block of ℓ or more missing values splits the data intotwo independent datasets. With N�1 such blocks ofmissing data the problem is equivalent to identificationfrom N datasets. Note, however, that in general thedatasets have different number of samples T1;…; TN .System identification frommultiple datasets with differentlengths is a mosaic-Hankel low-rank approximation pro-blem with

m¼ ℓ and n¼ ½T1�ℓ ⋯ TN�ℓ�:In the more general case of arbitrary distributed miss-

ing values, the identification problem is posed as aweighted Hankel low-rank approximation problem. LetIm be the indices of the missing samples and Im be theindices of the given samples. The appropriate weightvector is

wART ; with wðImÞ ¼ 0 and wðImÞ ¼ 1:

Numerical exampleStandard identification methods require at least ℓþ1

sequential data point of a trajectory of the system. In theexample, shown in Fig. 3, data points are periodicallymissing with a period ℓþ1. Although the classical meth-ods are not applicable in this case, methods for solving(SLRA) can be used by assigning zero weights in the costfunction to the missing data elements. The noisy trajectorywith missing data point (dotted black), the optimal

approximation, obtained by slra (dashed blue), and thetrue trajectory (solid red) are plotted in Fig. 3.

7.4. Identification of an input/output system

The examples presented so far illustrate different aspectsof the scalar autonomous linear time-invariant system iden-tification problem (example E3 from the introduction). Next,we consider the general case of a dynamical system withinputs (example E4). In (6), m≔q�p variables (elements ofw) are free, i.e., unrestricted by the model. In system theory,the free variables are called inputs and the remainingvariables outputs of the system. Although the separation ofthe variables into inputs and outputs is, in general, notunique, the number of inputs is invariant of the input/outputpartition. The pair of integers ðℓ;mÞ—lag ℓ of Eq. (6) andnumber of inputs m—describes the complexity of the system.

Given a time series

w¼ ðwð1Þ;…;wðTÞÞA ðRqÞT

Page 11: Recent progress on variable projection methods for structured low-rank approximation

Fig. 4. Identification and data-drive simulation examples. True sðtÞ (redsolid) and identified bsðtÞ (blue dashed) model step responses. (Forinterpretation of the references to color in this figure caption, the readeris referred to the web version of this paper.)

I. Markovsky / Signal Processing 96 (2014) 406–419416

and a complexity, specified by a pair of natural numbersðm;ℓÞminimize over bwA ðRqÞT and ½R0 R1 ⋯ Rℓ�a0 ‖w� bw‖22

subject to R0 bwðtÞþR1 bwðtþ1Þþ⋯þRℓ bwðtþℓÞ ¼ 0;for t ¼ 1;…; T�ℓ: ð22Þ

In the special case of m¼ 0, (22) reduces to the output-onlyidentification (20) and in the special case of ℓ¼ 0, (22)reduces to static modeling (example E1). This seamlesstransition from one problem to another is an importantadvantage of the adopted framework.

Although, from the application point of view, the input–output identification problem is a significant generalizationof the output-only identification problem, from the numer-ical linear algebra point of view, it is a trivial one. Thegeneral identification problem (22) is also a Hankel low-rank approximation problem

minimize over bwAðRqÞT ‖w� bw‖22subject to rankðHℓþ1;T�ℓðbwÞÞrℓ: ð23Þ

More specifically, problem (23) is a special case of (SLRA)with

m¼ ½ℓþ1 ⋯ ℓþ1�|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}q

and n¼ T�ℓ:

Numerical exampleA linear time-invariant system, with an input/output

partition of the variables w¼ ðu; yÞ, can be represented bythe difference equation

P0yðtÞþP1yðtþ1Þþ⋯þPℓyðtþℓÞ ¼Q0uðtÞþQ1uðtþ1Þþ⋯þQℓuðtþℓÞ: ð24Þ

We consider a simulation example with the second ordersingle-input single-output system with parameters

Q ¼ ½1 �1 1� and P ¼ ½0:81 �1:456 1�:The identification data w¼ ðu; yÞ is a random trajectory

of the system, perturbed by additive noise (errors-in-vari-ables setup) with noise standard deviation 0.05.

The true data generating system and the identifiedsystem by the slra method are compared by plottingtheir step responses. Fig. 4 shows the true system's stepresponse sðtÞ (solid line) and the identified system's stepresponse bsðtÞ (dashed line).

7.5. Data-driven simulation

The numerical example in the previous section showsmodel based simulation: first, a model is identified fromdata and, second, the model's step response is simulated.In this section, we show how an arbitrary response of anunknown system can be obtained directly from (noisy)data of the system without identifying the model in anintermediate step.

Consider two input/output trajectories

w′¼u′y′

" #A ðRqÞT ′ and w″¼

u″y″

" #AðRqÞT″

of an unknown linear time-invariant system B with lag ℓ.The trajectory w′ specifies implicitly the system. Thetrajectory w″ is specified by its first ℓ samples (initialconditions)

w″p ¼ ðw″ð1Þ;…;w″ðℓÞÞ

and its input component

u″f ¼ ðu″ðℓþ1Þ;…;u″ðT″ÞÞ:

The problem is to find the output of B

y″f ¼ ðy″ðℓþ1Þ;…; y″ðT″ÞÞ;

which corresponds to the given initial conditions and input,directly from the data without identifying the system B.

This problem is a mosaic-Hankel structured low-rankapproximation problem

minimize over bw′ and y″f ‖w′� bw′‖22

subject to rankðHℓþ1ðbw′Þ Hℓþ1ðw″ÞÞr2ℓþ1; ð25Þ

with exact data, the initial conditions w″p and the input u″

f ,and with missing data, the to-be-simulated response y″f .

Numerical examplesWe use the same simulation setup as in Section 7.4. The

to-be-simulated trajectory w″ is the step response s of thesystem, i.e., the response under zero initial conditions andstep input

u″ ¼ 0;…;0|fflfflffl{zfflfflffl}

;1;1;…;1|fflfflfflfflfflffl{zfflfflfflfflfflffl}step input

!and

y″¼ 0;…;0|fflfflffl{zfflfflffl}

; sð1Þ; sð2Þ;…; sðT″�ℓÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}step response

!:

The step response bs, estimated by solving (25) is equal(up to numerical errors) to the one obtained by the model-based simulation procedure in Section 7.4.

Page 12: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419 417

7.6. Approximate greatest common divisor of N polynomials

Our last example is about polynomials computationswith inexact coefficients. Consider N polynomials p1;…;pN

of degrees n1;…;nN , respectively. Similar to the twopolynomials case in example E5, the degree of the greatestcommon divisor of p1;…; pN can be expressed as

degreeðgcdðp1;…;pNÞÞ ¼ n1þ⋯þnN�rankðS0ðp1;…; pNÞÞwhere

Sℓðp1;…; pNÞ≔

Sn1 �ℓðp2Þ ⋯ Sn1 �ℓðpNÞSn2 �ℓðp1Þ

⋱SnN �ℓðp1Þ

266664377775: ð26Þ

is the generalized (multipolynomial) Sylvester sub-resultant matrix [5,26] for p1;…; pN with parameter ℓ.The structured matrix Sℓðp1;…; pNÞ is a mosaic-Hankel-likematrix ΦHm;n with specification of exact (zero) elements.For example, with N¼3 and n1 ¼ n2 ¼ n3 ¼ 2

Φ¼I4

02 I2

" #and H½2 6�;6 ¼

½S1ðp2Þ S1ðp3Þ�S6ðp1Þ

" #:

The approximate common divisor problem for N poly-nomials is defined as follows:

minimize over bp1ARn1 þ1;…; bpNARnN þ1

p1

⋮pN

264375� bp1

⋮bpN

26643775

����������������2

2

subject to degreeðgcdðbp1;…; bpNÞÞZℓ

and is equivalent to the low-rank approximation problemwith rank constraint

rankðSℓðp1;…;pNÞÞrn1þ⋯þnN�Nℓ�1:

The motivation for this problem formulation is to modifythe given polynomial coefficients as little as possible inorder to ensure that the modified polynomials have acommon factor of a desired order. Then, the exact commonfactor of the modified polynomials is by definition theapproximate common factor for the original polynomials.

7.6.1. Numerical examplesConsider the polynomials

p1ðzÞ ¼ 5�6zþz2 ¼ ð1�zÞð5�zÞ;p2ðzÞ ¼ 5:72�6:3zþz2 ¼ ð1:1�zÞð5:2�zÞ;p3ðzÞ ¼ 6:48�6:6zþz2 ¼ ð1:2�zÞð5:4�zÞ:In the example, we are looking for an approximatecommon divisor of degree 1.

The computed approximation polynomials

bp1ðzÞ ¼ 4:9989�6:0057zþ0:9705z2;bp2ðzÞ ¼ 5:7200�6:2999zþ1:0004z2;bp3ðzÞ ¼ 6:4811�6:5944zþ1:0289z2

by the slra method have a common root 0.1924.

8. Conclusions

As diverse applications as linear time-invariant systemidentification, algebraic curve fitting, approximate commondivisor computation, and recommender systems can beposed and solved as a single core computational problem:low-rank matrix approximation. This fact allows reuse ofideas, algorithms, and software from one application domaininto another. In general, the low-rank approximation pro-blem is nonconvex. We presented a method based on thevariable projection principle, which is applicable to generallinearly structured matrices and weighted 2-norm approx-imation criteria. In addition, the method can take intoaccount fixed and missing data values. In the case ofmosaic-Hankel-like matrices, the developed algorithm forlow-rank approximation has linear computational complex-ity in the number of data points. This makes it efficient fordata modeling by low-complexity linear time-invariantdynamical models.

Acknowledgments

Sections 6 and 7.6 and Appendix A are based on jointwork with K. Usevich. The research leading to these resultshas received funding from the European Research Councilunder the European Union's Seventh Framework Programme(FP7/2007-2013)/ERC Grant agreement number 258581“Structured low-rank approximation: Theory, algorithms,and applications”.

Appendix A. Algorithmic details

A.1. Analytical solution of the inner minimization problem

In this section, we use the following additional notation:

AI ;J is the submatrix of A with rows in I and columnsin J (the row/column index can be replaced by thesymbol “:”, in which case all rows/columns areselected).

nm is the number and Im≔fijwi ¼ 0g is the set ofmissing values.

nf is the number and I f≔fijwi ¼ þ1g is the set of fixedvalues.

Imf≔Im [ I f is the set of missing and fixed values, andI≔f1;…;npg\I . � Wr≔diagðwImf

Þ is the part of the weight matrix relatedto given non-fixed data.

Aþ is the pseudoinverse of A and A? is a matrix inwhich rows form a basis for the left null space of A.

Consider the inner minimization problem (15). Usingthe change of variables

Δpr≔pImf�bpImf

and (16), we have

RSðbpÞ ¼ 0 ⇔ ½G:;ImfG:;I f G:;Im �

pImf�ΔprpI fbpm

264375¼ 0:

Page 13: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419418

Therefore, (15) is equivalent to the generalized least normproblem [46]

MðRÞ≔minΔpr ;bpm

‖Δpr‖2Wr

subject to ½G:;ImfG:;Im �

Δpr�bpm

" #¼ G:;Im

pIm: ð27Þ

The following theorem is adapted from [41] and givesthe solution to (27).

Theorem 2. Under the following assumptions:

1.

G:;Im is full column rank, 2. 1r ðm�rÞn�nmrnp�nf �nm, and 3. Gr≔G?

:;ImG:;Imf

is full row rank,

problem (27) has a unique minimum,

MðRÞ ¼Δp>r WrΔpr ¼ h> ðGrW

�1r G>

r Þ�1h;

where h≔G?:;Im

G:;ImpIm

: ð28Þ

The minimum is attained bybpm ¼ �Gþ:;Im

ðG:;ImpIm

�G:;ImfΔprÞ and

Δpr ¼W �1r G>

r ðGrW�1r G>

r Þ�1h:

A.2. Fast cost function evaluation

“…in good design, there must be an underlying corre-spondence between the structure of the problem andthe structure of the solution.” R. Gabriel

structure of the (SLRA) problem

2 structure of the cost function S and w Γ (see (29)) ↕ ↕ (in the mosaic-Hankel case) 2 (in the mosaic-Hankel case) represented by m;n, w represented by Vk (see (30))

In [59] it is shown that for mosaic-Hankel structure (10)and positive weights (no missing values), the cost functionM (and its derivatives) can be evaluated efficiently.The algorithm is implemented in a Cþþ software package[40].

Let γi≔w�1i , for i¼ 1;…;np, where ðþ1Þ�1≔0. By (28)

the cost function is equal to

MðRÞ ¼ vec> ðRSðpÞÞΓ�1ðRÞvecðRSðpÞÞ;where

ΓðRÞ≔GðRÞdiagðγÞG> ðRÞARðm� rÞn�ðm� rÞn: ð29ÞThe evaluation of M(R), for a given R, requires solution ofthe system of linear equations

ΓðRÞu¼ vecðRSðpÞÞ:For structure of the form ΦHm;n, the matrix ΓðRÞ is block-diagonal

ΓðRÞ ¼ diagðΓð1ÞðRΦÞ;…;ΓðNÞðRΦÞÞ;with ΓðkÞðRÞ being the matrix Γ for the structure Hm;nk andthe corresponding subvector of w.

The efficiency of the algorithm is based on the fact thatΓ is block banded with bandwidth s¼maxðm1;…;mNÞ.In addition, for uniform weights, Γ is block-diagonal withblock-Toeplitz elements

ΓðRÞ≔

Γ1;1 ⋯ Γ1;s 0⋮ ⋱ ⋱ ⋱

Γs;1 ⋱ ⋱ ⋱ Γn� sþ1;n

⋱ ⋱ ⋱ ⋮0 Γn;n� sþ1 ⋯ Γn;n

26666664

37777775;

where Γi;j ¼ RVi;jR> ARðm� rÞ�ðm� rÞ:

The matrices Vi;jARm�m depend only on the structurespecification and the weights, so that they can be precom-puted outside the optimization loop. For mosaic-Hankelstructured problems Hm;n, where m¼ ½m1 … mq�, thematrices Vi;j, iZ j, are

Vi;j ¼ðUmÞj� i; if w¼ 1 ðuniform weightsÞ anddiagðγi;jÞðUmÞj� i; otherwise;

(ð30Þ

where γi;jARm is a subvector of γ, and

Um≔diagðUm1 ;…;Umq Þ; with Um≔

0 1 0 0⋮ ⋱ ⋱ 0⋮ ⋱ 10 … … 0

2666437775:

For general weights, the cost function evaluation hascomplexity O(smn) and for uniform weights the complexityis O(mn), assuming that an efficient method (e.g., the Schur-type algorithms [23]) is used for the Cholesky factorization ofthe block-Toeplitz matrices ΓðiÞðRΦÞ.

References

[1] J. Antoni, L. Garibaldi, A reduced-rank spectral approach to multi-sensor enhancement of large numbers of sinusoids in unknownnoise fields, Signal Processing 87 (March (3)) (2007) 453–478.

[2] T. Abatzoglou, J. Mendel, G. Harada, The constrained total leastsquares technique and its application to harmonic superresolution,IEEE Transactions on Signal Processing 39 (1991) 1070–1087.

[3] P.-A. Absil, R. Mahony, R. Sepulchre, Optimization Algorithms onMatrix Manifolds, Princeton University Press, Princeton, NJ, 2008.

[4] P.-A. Absil, R. Mahony, R. Sepulchre, P. Van Dooren, A Grassmann–Rayleigh quotient iteration for computing invariant subspaces, SIAMReview 44 (1) (2002) 57–73.

[5] M. Agarwal, P. Stoica, P. Åhgren, Common factor estimation and twoapplications in signal processing, Signal Processing 84 (2004)421–429.

[6] N. Boumal, P.-A. Absil, RTRMC: a Riemannian trust-region methodfor low-rank matrix completion, in: J. Shawe-Taylor, R.S. Zemel, P.Bartlett, F.C.N. Pereira, K.Q. Weinberger (Eds.), Advances in NeuralInformation Processing Systems, vol. 24 (NIPS), 2011, pp. 406–414.

[7] M. Berry, M. Browne, A. Langville, P. Pauca, R. Plemmons, Algorithmsand applications for approximate nonnegative matrix factorization,Computational Statistics & Data Analysis 52 (September (1)) (2007)155–173.

[8] J. Buckheit, D. Donoho, Wavelab and reproducible research, in:Wavelets and Statistics, 1995, Springer-Verlag, Berlin, New York.

[9] B. De Moor, Structured total least squares and L2 approximationproblems, Linear Algebra and its Applications 188–189 (1993)163–207.

[10] M. Fazel, Matrix rank minimization with applications (Ph.D. thesis),Electrical Engineering Department, Stanford University, 2002.

[11] M. Galassi, et al., GNU Scientific Library Reference Manual ⟨http://www.gnu.org/software/gsl/⟩.

Page 14: Recent progress on variable projection methods for structured low-rank approximation

I. Markovsky / Signal Processing 96 (2014) 406–419 419

[12] W. Gander, G. Golub, R. Strebel, Fitting of circles and ellipses: leastsquares solution, BIT 34 (1994) 558–578.

[13] G. Golub, Some modified matrix eigenvalue problems, SIAM Review15 (1973) 318–344.

[14] G. Golub, V. Pereyra, Separable nonlinear least squares: the variableprojection method and its applications, Institute of Physics, InverseProblems 19 (2003) 1–26.

[15] G. Golub, C. Van Loan, An analysis of the total least squares problem,SIAM Journal on Numerical Analysis 17 (1980) 883–893.

[16] G. Heinig, Generalized inverses of Hankel and Toeplitz mosaicmatrices, Linear Algebra and its Applications 216 (February) (1995)43–59.

[17] H. Hotelling, Analysis of a complex of statistical variables intoprincipal components, Journal of Educational Psychology 24 (7)(1933) 498–520.

[18] J. Jackson, A User's Guide to Principal Components, Wiley, 2003.[19] D. Jibetean, E. de Klerk, Global optimization of rational functions: a

semidefinite programming approach, Mathematical Programming106 (2006) 93–109.

[20] I. Jolliffe, Principal Component Analysis, Springer-Verlag, 2002.[21] R. Kannan, M. Ishteva, H. Park, Bounded matrix low rank approx-

imation, in: 12th International Conference on Data Mining, 2012,pp. 319–328.

[22] N. Karmarkar, Y. Lakshman, On approximate GCDs of univariatepolynomials, in: S. Watt, H. Stetter (Eds.), Journal of SymbolicComputation 26 (1998) 653–666 (special issue on SymbolicNumeric Algebra for Polynomials).

[23] T. Kailath, A. Sayed, Displacement structure: theory and applica-tions, SIAM Review 37 (3) (1995) 297–386.

[24] R. Kumaresan, D. Tufts, Estimating the parameters of exponentiallydamped sinusoids and pole-zero modeling in noise, IEEE Transac-tions on Acoustics, Speech, and Signal Processing 30 (6) (1982)833–840.

[25] S. Kung, A new identification method and model reduction algo-rithm via singular value decomposition, in: Proceedings of 12thAsilomar Conference on Circuits, Systems, Computers, Pacific Grove,1978, pp. 705–714.

[26] E. Kaltofen, Z. Yang, L. Zhi, Approximate greatest common divisors ofseveral polynomials with linearly constrained coefficients andsingular polynomials, in: Proceedings of International Symposiumon Symbolic Algebraic Computation, 2006, pp. 169–176.

[27] L. Ljung, System Identification: Theory for the User, Prentice-Hall,Upper Saddle River, NJ, 1999.

[28] P. Lemmerling, N. Mastronardi, S. Van Huffel, Fast algorithm forsolving the Hankel/Toeplitz structured total least squares problem,Numerical Algorithms 23 (2000) 371–392.

[29] B. Lathauwer, L. De Moor, J. Vandewalle, On the best rank-1 andrank-(R1;R2 ;…;RN) approximation of higher-order tensors, SIAMJournal on Matrix Analysis and Applications 21 (4) (2000)1324–1342.

[30] Z. Liu, L. Vandenberghe, Interior-point method for nuclear normapproximation with application to system identification, SIAMJournal on Matrix Analysis and Applications 31 (3) (2009)1235–1256.

[31] D. Marquardt, An algorithm for least-squares estimation of non-linear parameters, SIAM Journal on Applied Mathematics 11 (1963)431–441.

[32] I. Markovsky, Structured low-rank approximation and its applica-tions, Automatica 44 (4) (2008) 891–909.

[33] I. Markovsky, Closed-loop data-driven simulation, InternationalJournal of Control 83 (10) (2010) 2134–2139.

[34] I. Markovsky, Algorithms and literate programs for weighted low-rank approximation with missing data, in: Springer Proceedings inMathematics, vol. 3, Springer, 2011, pp. 255–273.

[35] I. Markovsky, B. De Moor, Linear dynamic filtering with noisy inputand output, Automatica 41 (1) (2005) 167–171.

[36] N. Mastronardi, P. Lemmerling, S. Van Huffel, Fast structured totalleast squares algorithm for solving the basic deconvolution problem,SIAM Journal on Matrix Analysis and Applications 22 (2000)533–553.

[37] J. Manton, R. Mahony, Y. Hua, The geometry of weighted low-rankapproximations, IEEE Transactions on Signal Processing 51 (2)(2003) 500–514.

[38] I. Markovsky, P. Rapisarda, Data-driven simulation and control,International Journal of Control 81 (12) (2008) 1946–1959.

[39] I. Markovsky, D. Sima, S. Van Huffel, Total least squares methods,Wiley Interdisciplinary Reviews: Computational Statistics 2 (2)(2010) 212–217.

[40] I. Markovsky, K. Usevich, Software for weighted structured low-rankapproximation, Journal of Computational and Applied Mathematics256 (2013) 278–292.

[41] I. Markovsky, K. Usevich, Structured low-rank approximation withmissing data, SIAM Journal on Matrix Analysis and Applications 34(2) (2013) 814–830.

[42] I. Markovsky, S. Van Huffel, Overview of total least squares methods,Signal Processing 87 (2007) 2283–2302.

[43] I. Markovsky, S. Van Huffel, R. Pintelon, Block-Toeplitz/Hankelstructured total least squares, SIAM Journal on Matrix Analysis andApplications 26 (4) (2005) 1083–1099.

[44] I. Markovsky, J.C. Willems, S. Van Huffel, B. De Moor, Exact andapproximate modeling of linear systems: a behavioral approach, in:Monographs on Mathematical Modeling and Computation, Number11, SIAM, March 2006.

[45] Y. Nievergelt, Hyperspheres and hyperplanes fitting seamlessly byalgebraic constrained total least-squares, Linear Algebra and itsApplications 331 (2001) 43–59.

[46] C.C. Paige, Computer solution and perturbation analysis of general-ized linear least squares problems, Mathematics of Computation33 (145) (1979) 171–183.

[47] K. Pearson, On lines and planes of closest fit to points in space,Philosophical Magazine 2 (1901) 559–572.

[48] J. Polderman, J.C. Willems, Introduction to Mathematical SystemsTheory, Springer-Verlag, New York, 1998.

[49] B. Roorda, C. Heij, Global total least squares modeling of multivariatetime series, IEEE Transactions on Automatic Control 40 (1) (1995)50–63.

[50] B. Roorda, Algorithms for global total least squares modelling offinite multivariable time series, Automatica 31 (3) (1995) 391–404.

[51] J. Rosen, H. Park, J. Glick, Total least norm formulation and solutionof structured problems, SIAM Journal on Matrix Analysis andApplications 17 (1996) 110–126.

[52] L. Scharf, The SVD and reduced rank signal processing, SignalProcessing 25 (2) (1991) 113–133.

[53] P. Stoica, R. Moses, Spectral Analysis of Signals, Prentice Hall, 2005.[54] T. Söderström, Errors-in-variables methods in system identification,

Automatica 43 (2007) 939–958.[55] T. Söderström, P. Stoica, System Identification, Prentice Hall, 1989.[56] H. Stetter, Numerical Polynomial Algebra, SIAM, 2004.[57] D. Tufts, A. Shah, Estimation of a signal waveform from noisy data

using low-rank approximation to a data matrix, IEEE Transactionson Signal Processing 41 (4) (1993) 1716–1721.

[58] K. Usevich, I. Markovsky, Structured low-rank approximation as arational function minimization, in: Proceedings of the 16th IFACSymposium on System Identification, Brussels, 2012, pp. 722–727.

[59] K. Usevich, I. Markovsky, Variable projection for affinely structuredlow-rank approximation in weighted 2-norms, Journal of Computa-tional and Applied Mathematics (2013), http://dxdoi.org/10.1016/j.cam.2013.04.034.

[60] P. Van Overschee, B. De Moor, Subspace Identification for LinearSystems: Theory, Implementation, Applications, Kluwer, Boston,1996.

[61] P. Wentzell, D. Andrews, D. Hamilton, K. Faber, B. Kowalski, Max-imum likelihood principal component analysis, Journal of Chemo-metrics 11 (1997) 339–366.

[62] J.C. Willems, From time series to linear system—Part I. finitedimensional linear time invariant systems, Automatica 22 (1986)561–580; J.C. Willems, From time series to linear system—Part II.exact modelling, Automatica 22 (1986) 675-694; J.C. Willems, Fromtime series to linear system—Part III. approximate modelling, Auto-matica 23 (1987) 87–115.

[63] J.C. Willems, The behavioral approach to open and interconnectedsystems: modeling by tearing, zooming, and linking, Control Sys-tems Magazine 27 (2007) 46–99. Available online ⟨http://homes.esat.kuleuven.be/ jwillems/Articles/JournalArticles/2007.1.pdf⟩.

[64] S. Weiland, F. Van Belzen, Singular value decompositions and lowrank approximations of tensors, IEEE Transactions on Signal Proces-sing 58 (3) (2010) 1171–1182.