184
Foundations and Trends R in Machine Learning Vol. 9, No. 4-5 (2016) 249–429 c 2017 A. Cichocki et al. DOI: 10.1561/2200000059 Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Part 1 Low-Rank Tensor Decompositions Andrzej Cichocki RIKEN Brain Science Institute (BSI), Japan and Skolkovo Institute of Science and Technology (SKOLTECH) [email protected] Namgil Lee RIKEN BSI, [email protected] Ivan Oseledets Skolkovo Institute of Science and Technology (SKOLTECH) and Institute of Numerical Mathematics of Russian Academy of Sciences [email protected] Anh-Huy Phan RIKEN BSI, [email protected] Qibin Zhao RIKEN BSI, [email protected] Danilo P. Mandic Department of Electrical and Electronic Engineering Imperial College London [email protected]

TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

Foundations and TrendsR© in Machine LearningVol. 9, No. 4-5 (2016) 249–429c© 2017 A. Cichocki et al.DOI: 10.1561/2200000059

Tensor Networks for Dimensionality Reductionand Large-Scale Optimization

Part 1 Low-Rank Tensor Decompositions

Andrzej CichockiRIKEN Brain Science Institute (BSI), Japan and

Skolkovo Institute of Science and Technology (SKOLTECH)[email protected]

Namgil LeeRIKEN BSI, [email protected]

Ivan OseledetsSkolkovo Institute of Science and Technology (SKOLTECH) and

Institute of Numerical Mathematics of Russian Academy of [email protected]

Anh-Huy PhanRIKEN BSI, [email protected]

Qibin ZhaoRIKEN BSI, [email protected]

Danilo P. MandicDepartment of Electrical and Electronic Engineering

Imperial College [email protected]

Page 2: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

Contents

1 Introduction and Motivation 2501.1 Challenges in Big Data Processing . . . . . . . . . . . . . 2511.2 Tensor Notations and Graphical Representations . . . . . . 2521.3 Curse of Dimensionality and Generalized Separation of Vari-

ables for Multivariate Functions . . . . . . . . . . . . . . . 2601.4 Advantages of Multiway Analysis via Tensor Networks . . . 2681.5 Scope and Objectives . . . . . . . . . . . . . . . . . . . . 269

2 Tensor Operations and Tensor Network Diagrams 2722.1 Basic Multilinear Operations . . . . . . . . . . . . . . . . 2722.2 Graphical Representation of Fundamental Tensor Networks 2922.3 Generalized Tensor Network Formats . . . . . . . . . . . . 310

3 Constrained Tensor Decompositions: From Two-way to Mul-tiway Component Analysis 3143.1 Constrained Low-Rank Matrix Factorizations . . . . . . . . 3143.2 The CP Format . . . . . . . . . . . . . . . . . . . . . . . 3173.3 The Tucker Tensor Format . . . . . . . . . . . . . . . . . 3233.4 Higher Order SVD (HOSVD) for Large-Scale Problems . . 3323.5 Tensor Sketching Using Tucker Model . . . . . . . . . . . 3423.6 Multiway Component Analysis (MWCA) . . . . . . . . . . 351

ii

Page 3: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

iii

3.7 Nonlinear Tensor Decompositions – Infinite Tucker . . . . 360

4 Tensor Train Decompositions: Graphical Interpretations andAlgorithms 3634.1 Tensor Train Decomposition – Matrix Product State . . . 3634.2 Matrix TT Decomposition – Matrix Product Operator . . . 3694.3 Links Between CP, BTD Formats and TT/TC Formats . . 3744.4 Quantized Tensor Train (QTT) – Blessing of Dimensionality 3774.5 Basic Operations in TT Formats . . . . . . . . . . . . . . 3834.6 Algorithms for TT Decompositions . . . . . . . . . . . . . 393

5 Discussion and Conclusions 407

Acknowledgements 409

References 410

Page 4: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

Abstract

Modern applications in engineering and data science are increasinglybased on multidimensional data of exceedingly high volume, variety,and structural richness. However, standard machine learning algo-rithms typically scale exponentially with data volume and complex-ity of cross-modal couplings - the so called curse of dimensionality -which is prohibitive to the analysis of large-scale, multi-modal andmulti-relational datasets. Given that such data are often efficientlyrepresented as multiway arrays or tensors, it is therefore timely andvaluable for the multidisciplinary machine learning and data analyticcommunities to review low-rank tensor decompositions and tensor net-works as emerging tools for dimensionality reduction and large scaleoptimization problems. Our particular emphasis is on elucidating that,by virtue of the underlying low-rank approximations, tensor networkshave the ability to alleviate the curse of dimensionality in a numberof applied areas. In Part 1 of this monograph we provide innovativesolutions to low-rank tensor network decompositions and easy to in-terpret graphical representations of the mathematical operations ontensor networks. Such a conceptual insight allows for seamless migra-tion of ideas from the flat-view matrices to tensor network operationsand vice versa, and provides a platform for further developments, prac-tical applications, and non-Euclidean extensions. It also permits theintroduction of various tensor network operations without an explicitnotion of mathematical expressions, which may be beneficial for manyresearch communities that do not directly rely on multilinear algebra.Our focus is on the Tucker and tensor train (TT) decompositions andtheir extensions, and on demonstrating the ability of tensor networksto provide linearly or even super-linearly (e.g., logarithmically) scalablesolutions, as illustrated in detail in Part 2 of this monograph.

A. Cichocki et al. Tensor Networks for Dimensionality Reduction and Large-ScaleOptimization Part 1 Low-Rank Tensor Decompositions. Foundations and TrendsR©

in Machine Learning, vol. 9, no. 4-5, pp. 249–429, 2016.DOI: 10.1561/2200000059.

Page 5: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1Introduction and Motivation

This monograph aims to present a coherent account of ideas andmethodologies related to tensor decompositions (TDs) and tensor net-works models (TNs). Tensor decompositions (TDs) decompose complexdata tensors of exceedingly high dimensionality into their factor (com-ponent) tensors and matrices, while tensor networks (TNs) decomposehigher-order tensors into sparsely interconnected small-scale factor ma-trices and/or low-order core tensors. These low-order core tensors arecalled “components”, “blocks”, “factors” or simply “cores”. In this way,large-scale data can be approximately represented in highly compressedand distributed formats.

In this monograph, the TDs and TNs are treated in a unified way,by considering TDs as simple tensor networks or sub-networks; theterms “tensor decompositions” and “tensor networks” will therefore beused interchangeably. Tensor networks can be thought of as specialgraph structures which break down high-order tensors into a set ofsparsely interconnected low-order core tensors, thus allowing for bothenhanced interpretation and computational advantages. Such an ap-proach is valuable in many application contexts which require the com-putation of eigenvalues and the corresponding eigenvectors of extremelyhigh-dimensional linear or nonlinear operators. These operators typi-cally describe the coupling between many degrees of freedom withinreal-world physical systems; such degrees of freedom are often onlyweakly coupled. Indeed, quantum physics provides evidence that cou-plings between multiple data channels usually do not exist among all

250

Page 6: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.1. Challenges in Big Data Processing 251

the degrees of freedom but mostly locally, whereby “relevant” infor-mation, of relatively low-dimensionality, is embedded into very large-dimensional measurements (Verstraete et al., 2008; Schollwöck, 2013;Orús, 2014; Murg et al., 2015).

Tensor networks offer a theoretical and computational frameworkfor the analysis of computationally prohibitive large volumes of data, by“dissecting” such data into the “relevant” and “irrelevant” information,both of lower dimensionality. In this way, tensor network representa-tions often allow for super-compression of datasets as large as 1050

entries, down to the affordable levels of 107 or even less entries (Os-eledets and Tyrtyshnikov, 2009; Dolgov and Khoromskij, 2013; Kazeevet al., 2013a, 2014; Kressner et al., 2014a; Vervliet et al., 2014; Dolgovand Khoromskij, 2015; Liao et al., 2015; Bolten et al., 2016).

With the emergence of the big data paradigm, it is therefore bothtimely and important to provide the multidisciplinary machine learningand data analytic communities with a comprehensive overview of tensornetworks, together with an example-rich guidance on their applicationin several generic optimization problems for huge-scale structured data.Our aim is also to unify the terminology, notation, and algorithms fortensor decompositions and tensor networks which are being developednot only in machine learning, signal processing, numerical analysis andscientific computing, but also in quantum physics/chemistry for therepresentation of, e.g., quantum many-body systems.

1.1 Challenges in Big Data Processing

The volume and structural complexity of modern datasets are becom-ing exceedingly high, to the extent which renders standard analysismethods and algorithms inadequate. Apart from the huge Volume, theother features which characterize big data include Veracity, Varietyand Velocity (see Figures 1.1(a) and (b)). Each of the “V features”represents a research challenge in its own right. For example, high vol-ume implies the need for algorithms that are scalable; high Velocityrequires the processing of big data streams in near real-time; high Ve-racity calls for robust and predictive algorithms for noisy, incomplete

Page 7: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

252 Introduction and Motivation

and/or inconsistent data; high Variety demands the fusion of differentdata types, e.g., continuous, discrete, binary, time series, images, video,text, probabilistic or multi-view. Some applications give rise to addi-tional “V challenges”, such as Visualization, Variability and Value. TheValue feature is particularly interesting and refers to the extraction ofhigh quality and consistent information, from which meaningful andinterpretable results can be obtained.

Owing to the increasingly affordable recording devices, extreme-scale volumes and variety of data are becoming ubiquitous across thescience and engineering disciplines. In the case of multimedia (speech,video), remote sensing and medical/biological data, the analysis alsorequires a paradigm shift in order to efficiently process massive datasetswithin tolerable time (velocity). Such massive datasets may have bil-lions of entries and are typically represented in the form of huge blockmatrices and/or tensors. This has spurred a renewed interest in thedevelopment of matrix/tensor algorithms that are suitable for verylarge-scale datasets. We show that tensor networks provide a naturalsparse and distributed representation for big data, and address both es-tablished and emerging methodologies for tensor-based representationsand optimization. Our particular focus is on low-rank tensor networkrepresentations, which allow for huge data tensors to be approximated(compressed) by interconnected low-order core tensors.

1.2 Tensor Notations and Graphical Representations

Tensors are multi-dimensional generalizations of matrices. A matrix(2nd-order tensor) has two modes, rows and columns, while an Nth-order tensor has N modes (see Figures 1.2–1.7); for example, a 3rd-order tensor (with three-modes) looks like a cube (see Figure 1.2).Subtensors are formed when a subset of tensor indices is fixed. Of par-ticular interest are fibers which are vectors obtained by fixing everytensor index but one, and matrix slices which are two-dimensional sec-tions (matrices) of a tensor, obtained by fixing all the tensor indicesbut two. It should be noted that block matrices can also be representedby tensors, as illustrated in Figure 1.3 for 4th-order tensors.

Page 8: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.2. Tensor Notations and Graphical Representations 253

(a)

Petabytes

Terabytes

GB

MB

Batch

Micro-batch

Near real-time

Streams

VELOCITY

VAR

IET

Y

VOLUME

Mis

sing

dat

aA

nom

aly

Out

liers

Noi

seInco

nsis

tenc

y

Tim

e se

ries

Imag

esBi

nary

dat

a3D

imag

esM

ultiv

iew

dat

aPr

obab

ilisti

c

VE

RA

CIT

Y

(b)

StorageManagement,

Scale

Integrationof Variety of

Data

High SpeedDistributed,

ParallelComputing

Robustness toNoise, Outliers,Missing Values

VOLUME

VERACITY

VELOCITY

VARIETY

Applications,Tasks

Matrix/TensorCompletion,Inpainting,Imputation

AnomalyDetection

FeatureExtraction,

Classification,Clustering

Correlation,Regression,Prediction,Forecasting

PARAFACCPD,NTF

Tucker,NTDHierarchical

TuckerTensor Train,

MPS/MPO

PEPS,MERA

TensorModels

SparsenessOptimization

Criteria,Constraints

SmoothnessNon-negativity

StatisticalIndependence,

Correlation

SignalProcessing

and MachineLearning for

Big Data

Challenges

Figure 1.1: A framework for extremely large-scale data analysis. (a) The 4Vchallenges for big data. (b) A unified framework for the 4V challenges and thepotential applications based on tensor decomposition approaches.

Page 9: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

254 Introduction and Motivation

X

=1,2,...,j JMode-2

Mod

e-1 ,...,2,1=i

IM

ode-

3 ,...,2,1=k

K

6,5,1x

Horizontal Slices Lateral Slices Frontal Slices

X(i,:,:) X(:, j,:)

X(:,:,k)

Column (Mode-1)Fibers

Row (Mode-2)Fibers

Tube (Mode-3)Fibers

(:,3,1)

(1,:,3)

(1,3,:)

X

X

X

Figure 1.2: A 3rd-order tensor X P RIJK , with entries xi,j,k Xpi, j, kq, andits subtensors: slices (middle) and fibers (bottom). All fibers are treated as columnvectors.

We adopt the notation whereby tensors (for N ¥ 3) are denoted bybold underlined capital letters, e.g., X P RI1I2IN . For simplicity,we assume that all tensors are real-valued, but it is, of course, possibleto define tensors as complex-valued or over arbitrary fields. Matricesare denoted by boldface capital letters, e.g., X P RIJ , and vectors(1st-order tensors) by boldface lower case letters, e.g., x P RJ . Forexample, the columns of the matrix A ra1,a2, . . . ,aRs P RIR are

Page 10: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.2. Tensor Notations and Graphical Representations 255

. . .G11 G12 G1K

. . .G21 G22 G2K

. . .GM1 GM2 GMK

. . .

. . . . . .

.... . .

. . .. . .

Figure 1.3: A block matrix and its representation as a 4th-order tensor, createdby reshaping (or a projection) of blocks in the rows into lateral slices of 3rd-ordertensors.

... ...

...

...

...

...

...

...

... ... ......

Scalar Vector Matrix 3rd-order Tensor 4th-order Tensor

One-way 4-way 5-way

Univariate MultivariateMultiway Analysis (High-order tensors)

One

sam

ple

A s

ampl

e se

t

2-way 3-way

Figure 1.4: Graphical representation of multiway array (tensor) data of increasingstructural complexity and “Volume” (see (Olivieri, 2008) for more detail).

the vectors denoted by ar P RI , while the elements of a matrix (scalars)are denoted by lowercase letters, e.g., air Api, rq (see Table 1.1).

A specific entry of an Nth-order tensor X P RI1I2IN is denotedby xi1,i2,...,iN Xpi1, i2, . . . , iN q P R. The order of a tensor is thenumber of its “modes”, “ways” or “dimensions”, which can includespace, time, frequency, trials, classes, and dictionaries. The term ‘‘size”stands for the number of values that an index can take in a particular

Page 11: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

256 Introduction and Motivation

(a)

a

Scalar

a

Vector

IIA

Matrix

AI J

I

J

3rd-order tensor

AI1

I2

I3I1

I2

I3 ΛI

I

I I I

I

3rd-order diagonal tensor

(b)x

I JA

= Ib=Ax

BI J

A= I

C =ABK K

B

JK

AM

I P

L

=C

ΣK

k=1ai,j,k bk,l,m,p = ci,j,l,m,p

J

IM

P

L

Figure 1.5: Graphical representation of tensor manipulations. (a) Basic buildingblocks for tensor network diagrams. (b) Tensor network diagrams for matrix-vectormultiplication (top), matrix by matrix multiplication (middle) and contraction oftwo tensors (bottom). The order of reading of indices is anti-clockwise, from the leftposition.

mode. For example, the tensor X P RI1I2IN is of order N and sizeIn in all modes-n pn 1, 2, . . . , Nq. Lower-case letters e.g, i, j are usedfor the subscripts in running indices and capital letters I, J denote theupper bound of an index, i.e., i 1, 2, . . . , I and j 1, 2, . . . , J . Fora positive integer n, the shorthand notation n ¡ denotes the set ofindices t1, 2, . . . , nu.

Page 12: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.2. Tensor Notations and Graphical Representations 257

Table 1.1: Basic matrix/tensor notation and symbols.

X P RI1I2IN Nth-order tensor of size I1 I2 IN

xi1,i2,...,iN Xpi1, i2, . . . , iN q pi1, i2, . . . , iN qth entry of X

x, x, X scalar, vector and matrix

G, S, Gpnq, Xpnq core tensors

Λ P RRRR Nth-order diagonal core tensor with nonzeroentries λr on the main diagonal

AT, A1, A: transpose, inverse and Moore–Penrosepseudo-inverse of a matrix A

A ra1,a2, . . . ,aRs P RIR matrix with R column vectors ar P RI , withentries air

A, B, C, Apnq,Bpnq, Upnq component (factor) matrices

Xpnq P RInI1In1In1IN

mode-n matricization of X P RI1IN

X n¡ P RI1I2InIn1IN

mode-(1, . . . , n) matricization of X P RI1IN

Xp:, i2, i3, . . . , iN q P RI1 mode-1 fiber of a tensor X obtained by fixing allindices but one (a vector)

Xp:, :, i3, . . . , iN q P RI1I2 slice (matrix) of a tensor X obtained by fixingall indices but two

Xp:, :, :, i4, . . . , iN q subtensor of X, obtained by fixing several in-dices

R, pR1, . . . , RN q tensor rank R and multilinear rank

, d , b

bL , |b|

outer, Khatri–Rao, Kronecker products

Left Kronecker, strong Kronecker products

x vecpXq vectorization of X

trp q trace of a square matrix

diagp q diagonal matrix

Page 13: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

258 Introduction and Motivation

Table 1.2: Terminology used for tensor networks across the machine learn-ing/scientific computing and quantum physics/chemistry communities.

Machine Learning Quantum Physics

Nth-order tensor rank-N tensor

high/low-order tensor tensor of high/low dimension

ranks of TNs bond dimensions of TNs

unfolding, matricization grouping of indices

tensorization splitting of indices

core site

variables open (physical) indices

ALS Algorithm one-site DMRG or DMRG1

MALS Algorithm two-site DMRG or DMRG2

column vector x P RI1 ket |Ψyrow vector xT P R1I bra xΨ|inner product xx,xy xTx

xΨ|Ψy

Tensor Train (TT) Matrix Product State (MPS) (with OpenBoundary Conditions (OBC))

Tensor Chain (TC) MPS with Periodic Boundary Conditions(PBC)

Matrix TT Matrix Product Operators (with OBC)

Hierarchical Tucker (HT) Tree Tensor Network State (TTNS) withrank-3 tensors

Page 14: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.2. Tensor Notations and Graphical Representations 259

Notations and terminology used for tensors and tensor networksdiffer across the scientific communities (see Table 1.2); to this end weemploy a unifying notation particularly suitable for machine learningand signal processing research, which is summarized in Table 1.1.

Even with the above notation conventions, a precise description oftensors and tensor operations is often tedious and cumbersome, giventhe multitude of indices involved. To this end, in this monograph, wegrossly simplify the description of tensors and their mathematical op-erations through diagrammatic representations borrowed from physicsand quantum chemistry (see (Orús, 2014) and references therein). Inthis way, tensors are represented graphically by nodes of any geometri-cal shapes (e.g., circles, squares, dots), while each outgoing line (“edge”,“leg”,“arm”) from a node represents the indices of a specific mode (seeFigure 1.5(a)). In our adopted notation, each scalar (zero-order ten-sor), vector (first-order tensor), matrix (2nd-order tensor), 3rd-ordertensor or higher-order tensor is represented by a circle (or rectangu-lar), while the order of a tensor is determined by the number of lines(edges) connected to it. According to this notation, an Nth-order ten-sor X P RI1IN is represented by a circle (or any shape) with N

branches each of size In, n 1, 2, . . . , N (see Section 2). An intercon-nection between two circles designates a contraction of tensors, whichis a summation of products over a common index (see Figure 1.5(b)and Section 2).

Block tensors, where each entry (e.g., of a matrix or a vector) is anindividual subtensor, can be represented in a similar graphical form,as illustrated in Figure 1.6. Hierarchical (multilevel block) matrices arealso naturally represented by tensors and vice versa, as illustrated inFigure 1.7 for 4th-, 5th- and 6th-order tensors. All mathematical oper-ations on tensors can be therefore equally performed on block matrices.

In this monograph, we make extensive use of tensor network di-agrams as an intuitive and visual way to efficiently represent tensordecompositions. Such graphical notations are of great help in studyingand implementing sophisticated tensor operations. We highlight thesignificant advantages of such diagrammatic notations in the descrip-tion of tensor manipulations, and show that most tensor operations can

Page 15: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

260 Introduction and Motivation

4th-order tensor. . . =

5th-order tensors...

...

... ...... = =

6th-order tensor

=

Figure 1.6: Graphical representations and symbols for higher-order block tensors.Each block represents either a 3rd-order tensor or a 2nd-order tensor. The outercircle indicates a global structure of the block tensor (e.g. a vector, a matrix, a3rd-order block tensor), while the inner circle reflects the structure of each elementwithin the block tensor. For example, in the top diagram a vector of 3rd ordertensors is represented by an outer circle with one edge (a vector) which surroundsan inner circle with three edges (a 3rd order tensor), so that the whole structuredesignates a 4th-order tensor.

be visualized through changes in the architecture of a tensor networkdiagram.

1.3 Curse of Dimensionality and Generalized Separation ofVariables for Multivariate Functions

1.3.1 Curse of Dimensionality

The term curse of dimensionality was coined by Bellman (1961) toindicate that the number of samples needed to estimate an arbitraryfunction with a given level of accuracy grows exponentially with the

Page 16: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.3. Curse of Dimensionality 261

(a)X X

R1 1I

R2 2I

I1 I2

R1

R2

1 2( )I I

I1 I2

R1

R2

=

(b)...

Vector (each entry is a block matrix)

Block matrix

Matrix

=

(c)Matrix

=

Figure 1.7: Hierarchical matrix structures and their symbolic representation astensors. (a) A 4th-order tensor representation for a block matrix X P RR1I1R2I2

(a matrix of matrices), which comprises block matrices Xr1,r2 P RI1I2 . (b) A 5th-order tensor. (c) A 6th-order tensor.

number of variables, that is, with the dimensionality of the function.In a general context of machine learning and the underlying optimiza-tion problems, the “curse of dimensionality” may also refer to an ex-ponentially increasing number of parameters required to describe thedata/system or an extremely large number of degrees of freedom. Theterm “curse of dimensionality”, in the context of tensors, refers to the

Page 17: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

262 Introduction and Motivation

phenomenon whereby the number of elements, IN , of an Nth-order ten-sor of size pII Iq grows exponentially with the tensor order, N .Tensor volume can therefore easily become prohibitively big for multi-way arrays for which the number of dimensions (“ways” or “modes”)is very high, thus requiring enormous computational and memory re-sources to process such data. The understanding and handling of theinherent dependencies among the excessive degrees of freedom createboth difficult to solve problems and fascinating new opportunities, butcomes at a price of reduced accuracy, owing to the necessity to involvevarious approximations.

We show that the curse of dimensionality can be alleviated or evenfully dealt with through tensor network representations; these natu-rally cater for the excessive volume, veracity and variety of data (seeFigure 1.1) and are supported by efficient tensor decomposition algo-rithms which involve relatively simple mathematical operations. An-other desirable aspect of tensor networks is their relatively small-scaleand low-order core tensors, which act as “building blocks” of tensornetworks. These core tensors are relatively easy to handle and visual-ize, and enable super-compression of the raw, incomplete, and noisyhuge-scale datasets. This also suggests a solution to a more generalquest for new technologies for processing of exceedingly large datasetswithin affordable computation times.

To address the curse of dimensionality, this work mostly focuses onapproximative low-rank representations of tensors, the so-called low-rank tensor approximations (LRTA) or low-rank tensor network de-compositions.

1.3.2 Separation of Variables and Tensor Formats

A tensor is said to be in a full format when it is represented as an orig-inal (raw) multidimensional array (Klus and Schütte, 2015), however,distributed storage and processing of high-order tensors in their fullformat is infeasible due to the curse of dimensionality. The sparse for-mat is a variant of the full tensor format which stores only the nonzeroentries of a tensor, and is used extensively in software tools such as the

Page 18: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.3. Curse of Dimensionality 263

Tensor Toolbox (Bader and Kolda, 2015) and in the sparse grid ap-proach (Garcke et al., 2001; Bungartz and Griebel, 2004; Hackbusch,2012).

As already mentioned, the problem of huge dimensionality can bealleviated through various distributed and compressed tensor networkformats, achieved by low-rank tensor network approximations. The un-derpinning idea is that by employing tensor networks formats, bothcomputational costs and storage requirements may be dramatically re-duced through distributed storage and computing resources. It is im-portant to note that, except for very special data structures, a tensorcannot be compressed without incurring some compression error, sincea low-rank tensor representation is only an approximation of the orig-inal tensor.

The concept of compression of multidimensional large-scale databy tensor network decompositions can be intuitively explained as fol-lows. Consider the approximation of an N -variate function fpxq fpx1, x2, . . . , xN q by a finite sum of products of individual functions,each depending on only one or a very few variables (Bebendorf, 2011;Dolgov, 2014; Cho et al., 2016; Trefethen, 2017). In the simplest sce-nario, the function fpxq can be (approximately) represented in thefollowing separable form

fpx1, x2, . . . , xN q f p1qpx1qf p2qpx2q f pNqpxN q. (1.1)

In practice, when an N -variate function fpxq is discretized into an Nth-order array, or a tensor, the approximation in (1.1) then corresponds tothe representation by rank-1 tensors, also called elementary tensors (seeSection 2). Observe that with In, n 1, 2, . . . , N denoting the size ofeach mode and I maxntInu, the memory requirement to store such afull tensor is

±Nn1 In ¤ IN , which grows exponentially with N . On the

other hand, the separable representation in (1.1) is completely definedby its factors, f pnqpxnq, pn 1, 2, . . . , N), and requires only

°Nn1 In !

IN storage units. If x1, x2, . . . , xN are statistically independent randomvariables, their joint probability density function is equal to the productof marginal probabilities, fpxq f p1qpx1qf p2qpx2q f pNqpxN q, in anexact analogy to outer products of elementary tensors. Unfortunately,the form of separability in (1.1) is rather rare in practice.

Page 19: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

264 Introduction and Motivation

The concept of tensor networks rests upon generalized (full or par-tial) separability of the variables of a high dimensional function. Thiscan be achieved in different tensor formats, including:

• The Canonical Polyadic (CP) format (see Section 3.2), where

fpx1, x2, . . . , xN q R

r1f p1qr px1q f p2qr px2q f pNq

r pxN q, (1.2)

in an exact analogy to (1.1). In a discretized form, the above CPformat can be written as an Nth-order tensor

F R

r1f p1qr f p2qr f pNq

r P RI1I2IN , (1.3)

where f pnqr P RIn denotes a discretized version of the univariatefunction f pnqr pxnq, symbol denotes the outer product, and R isthe tensor rank.

• The Tucker format, given by

fpx1, . . . , xN q R1

r11

RN

rN1gr1,...,rN f p1qr1 px1q f pNq

rNpxN q,

(1.4)and its distributed tensor network variants (see Section 3.3),

• The Tensor Train (TT) format (see Section 4.1), in the form

fpx1, x2, . . . , xN q R1

r11

R2

r21

RN1¸rN11

f p1qr1 px1q f p2qr1 r2px2q

f pN2qrN2 rN1pxN1q f pNq

rN1pxN q, (1.5)

with the equivalent compact matrix representation

fpx1, x2, . . . , xN q Fp1qpx1qFp2qpx2q FpNqpxN q, (1.6)

where Fpnqpxnq P RRn1Rn , with R0 RN 1.

• The Hierarchical Tucker (HT) format (also known as the Hierar-chical Tensor format) can be expressed via a hierarchy of nested

Page 20: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.3. Curse of Dimensionality 265

separations in the following way. Consider nested nonempty dis-joint subsets u, v, and t u Y v t1, 2, . . . , Nu, then for some1 ¤ N0 N , with u0 t1, . . . , N0u and v0 tN0 1, . . . , Nu,the HT format can be expressed as

fpx1, . . . , xN q Ru0

ru01

Rv0

rv01gp12Nqru0 ,rv0

f pu0qru0

pxu0q f pv0qrv0

pxv0q,

f ptqrtpxtq

Ru

ru1

Rv

rv1gptqru,rv ,rt

f puqrupxuq f pvqrv

pxvq,

where xt txi : i P tu. See Section 2.2.1 for more detail.Example. In a particular case for N=4, the HT format can beexpressed by

fpx1, x2, x3, x4q R12

r121

R34

r341gp1234qr12,r34 f

p12qr12 px1, x2q f p34q

r34 px3, x4q,

f p12qr12 px1, x2q

R1

r11

R2

r21gp12qr1,r2,r12 f

p1qr1 px1q f p2qr2 px2q,

f p34qr34 px3, x4q

R3

r31

R4

r41gp34qr3,r4,r34 f

p3qr3 px3q f p4qr4 px4q.

The Tree Tensor Network States (TTNS) format, which is an ex-tension of the HT format, can be obtained by generalizing the twosubsets, u, v, into a larger number of disjoint subsets u1, . . . , um,m ¥ 2. In other words, the TTNS can be obtained by more flexi-ble separations of variables through products of larger numbers offunctions at each hierarchical level (see Section 2.2.1 for graphicalillustrations and more detail).

All the above approximations adopt the form of “sum-of-products” ofsingle-dimensional functions, a procedure which plays a key role in alltensor factorizations and decompositions.

Indeed, in many applications based on multivariate functions, verygood approximations are obtained with a surprisingly small numberof factors; this number corresponds to the tensor rank, R, or tensor

Page 21: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

266 Introduction and Motivation

network ranks, tR1, R2, . . . , RNu (if the representations are exact andminimal). However, for some specific cases this approach may fail toobtain sufficiently good low-rank TN approximations. The concept ofgeneralized separability has already been explored in numerical meth-ods for high-dimensional density function equations (Liao et al., 2015;Trefethen, 2017; Cho et al., 2016) and within a variety of huge-scaleoptimization problems (see Part 2 of this monograph).

To illustrate how tensor decompositions address excessive volumesof data, if all computations are performed on a CP tensor format in(1.3) and not on the raw Nth-order data tensor itself, then instead ofthe original, exponentially growing, data dimensionality of IN , the num-ber of parameters in a CP representation reduces to NIR, which scaleslinearly in the tensor order N and size I (see Table 4.4). For exam-ple, the discretization of a 5-variate function over 100 sample points oneach axis would yield the difficulty to manage 1005 10, 000, 000, 000sample points, while a rank-2 CP representation would require only5 2 100 1000 sample points.

Although the CP format in (1.2) effectively bypasses the curse ofdimensionality, the CP approximation may involve numerical problemsfor very high-order tensors, which in addition to the intrinsic unclose-ness of the CP format (i.e., difficulty to arrive at a canonical format),the corresponding algorithms for CP decompositions are often ill-posed(de Silva and Lim, 2008). As a remedy, greedy approaches may beconsidered which, for enhanced stability, perform consecutive rank-1corrections (Lim and Comon, 2010). On the other hand, many efficientand stable algorithms exist for the more flexible Tucker format in (1.4),however, this format is not practical for tensor orders N ¡ 5 becausethe number of entries of both the original data tensor and the coretensor (expressed in (1.4) by elements gr1,r2,...,rN ) scales exponentiallyin the tensor order N (curse of dimensionality).

In contrast to CP decomposition algorithms, TT tensor network for-mats in (1.5) exhibit both very good numerical properties and the abil-ity to control the error of approximation, so that a desired accuracy ofapproximation is obtained relatively easily. The main advantage of theTT format over the CP decomposition is the ability to provide stable

Page 22: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.3. Curse of Dimensionality 267

quasi-optimal rank reduction, achieved through, for example, truncatedsingular value decompositions (tSVD) or adaptive cross-approximation(Oseledets and Tyrtyshnikov, 2010; Bebendorf, 2011; Khoromskij andVeit, 2016). This makes the TT format one of the most stable andsimple approaches to separate latent variables in a sophisticated way,while the associated TT decomposition algorithms provide full controlover low-rank TN approximations1. In this monograph, we thereforemake extensive use of the TT format for low-rank TN approximationsand employ the TT toolbox software for efficient implementations (Os-eledets et al., 2012). The TT format will also serve as a basic prototypefor high-order tensor representations, while we also consider the Hier-archical Tucker (HT) and the Tree Tensor Network States (TTNS) for-mats (having more general tree-like structures) whenever advantageousin applications.

Furthermore, we address in depth the concept of tensorization ofstructured vectors and matrices to convert a wide class of huge-scale op-timization problems into much smaller-scale interconnected optimiza-tion sub-problems which can be solved by existing optimization meth-ods (see Part 2 of this monograph).

The tensor network optimization framework is therefore performedthrough the two main steps:

• Tensorization of data vectors and matrices into a high-order ten-sor, followed by a distributed approximate representation of acost function in a specific low-rank tensor network format.

• Execution of all computations and analysis in tensor network for-mats (i.e., using only core tensors) that scale linearly, or evensub-linearly (quantized tensor networks), in the tensor order N .This yields both the reduced computational complexity and dis-tributed memory requirements.

1Although similar approaches have been known in quantum physics for a longtime, their rigorous mathematical analysis is still a work in progress (see (Oseledets,2011; Orús, 2014) and references therein).

Page 23: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

268 Introduction and Motivation

1.4 Advantages of Multiway Analysis via Tensor Networks

In this monograph, we focus on two main challenges in huge-scale dataanalysis which are addressed by tensor networks: (i) an approximaterepresentation of a specific cost (objective) function by a tensor net-work while maintaining the desired accuracy of approximation, and (ii)the extraction of physically meaningful latent variables from data in asufficiently accurate and computationally affordable way. The benefitsof multiway (tensor) analysis methods for large-scale datasets then in-clude:

• Ability to perform all mathematical operations in tractable tensornetwork formats;

• Simultaneous and flexible distributed representations of both thestructurally rich data and complex optimization tasks;

• Efficient compressed formats of large multidimensional dataachieved via tensorization and low-rank tensor decompositionsinto low-order factor matrices and/or core tensors;

• Ability to operate with noisy and missing data by virtue of numer-ical stability and robustness to noise of low-rank tensor/matrixapproximation algorithms;

• A flexible framework which naturally incorporates various diver-sities and constraints, thus seamlessly extending the standard,flat view, Component Analysis (2-way CA) methods to multiwaycomponent analysis;

• Possibility to analyze linked (coupled) blocks of large-scale ma-trices and tensors in order to separate common/correlated fromindependent/uncorrelated components in the observed raw data;

• Graphical representations of tensor networks allow us to expressmathematical operations on tensors (e.g., tensor contractions andreshaping) in a simple and intuitive way, and without the explicituse of complex mathematical expressions.

Page 24: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.5. Scope and Objectives 269

In that sense, this monograph both reviews current research in thisarea and complements optimisation methods, such as the AlternatingDirection Method of Multipliers (ADMM) (Boyd et al., 2011).

Tensor decompositions (TDs) have been already adopted in widelydiverse disciplines, including psychometrics, chemometrics, biometric,quantum physics/information, quantum chemistry, signal and imageprocessing, machine learning, and brain science (Smilde et al., 2004;Tao et al., 2007; Kroonenberg, 2008; Kolda and Bader, 2009; Hack-busch, 2012; Favier and de Almeida, 2014; Cichocki et al., 2009, 2015b).This is largely due to their advantages in the analysis of data that ex-hibit not only large volume but also very high variety (see Figure 1.1),as in the case in bio- and neuroinformatics and in computational neu-roscience, where various forms of data collection include sparse tabularstructures and graphs or hyper-graphs.

Moreover, tensor networks have the ability to efficiently parame-terize, through structured compact representations, very general high-dimensional spaces which arise in modern applications (Kressner et al.,2014b; Cichocki, 2014; Zhang et al., 2015; Corona et al., 2015; Litsarevand Oseledets, 2016; Khoromskij and Veit, 2016; Benner et al., 2016).Tensor networks also naturally account for intrinsic multidimensionaland distributed patterns present in data, and thus provide the oppor-tunity to develop very sophisticated models for capturing multiple in-teractions and couplings in data – these are more physically insightfuland interpretable than standard pair-wise interactions.

1.5 Scope and Objectives

Review and tutorial papers (Kolda and Bader, 2009; Lu et al., 2011;Grasedyck et al., 2013; Cichocki et al., 2015b; de Almeida et al., 2015;Sidiropoulos et al., 2016; Papalexakis et al., 2016; Bachmayr et al.,2016) and books (Smilde et al., 2004; Kroonenberg, 2008; Cichockiet al., 2009; Hackbusch, 2012) dealing with TDs and TNs already exist,however, they typically focus on standard models, with no explicit linksto very large-scale data processing topics or connections to a wide classof optimization problems. The aim of this monograph is therefore to

Page 25: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

270 Introduction and Motivation

extend beyond the standard Tucker and CP tensor decompositions,and to demonstrate the perspective of TNs in extremely large-scaledata analytics, together with their role as a mathematical backbonein the discovery of hidden structures in prohibitively large-scale data.Indeed, we show that TN models provide a framework for the analysisof linked (coupled) blocks of tensors with millions and even billions ofnon-zero entries.

We also demonstrate that TNs provide natural extensions of 2-way (matrix) Component Analysis (2-way CA) methods to multi-waycomponent analysis (MWCA), which deals with the extraction of de-sired components from multidimensional and multimodal data. Thisparadigm shift requires new models and associated algorithms capableof identifying core relations among the different tensor modes, whileguaranteeing linear/sub-linear scaling with the size of datasets2.

Furthermore, we review tensor decompositions and the associatedalgorithms for very large-scale linear/multilinear dimensionality reduc-tion problems. The related optimization problems often involve struc-tured matrices and vectors with over a billion entries (see (Grasedycket al., 2013; Dolgov, 2014; Garreis and Ulbrich, 2016) and referencestherein). In particular, we focus on Symmetric Eigenvalue Decomposi-tion (EVD/PCA) and Generalized Eigenvalue Decomposition (GEVD)(Dolgov et al., 2014; Kressner et al., 2014a; Kressner and Uschmajew,2016), SVD (Lee and Cichocki, 2015), solutions of overdetermined andundetermined systems of linear algebraic equations (Oseledets and Dol-gov, 2012; Dolgov and Savostyanov, 2014), the Moore–Penrose pseudo-inverse of structured matrices (Lee and Cichocki, 2016b), and Lassoproblems (Lee and Cichocki, 2016a). Tensor networks for extremelylarge-scale multi-block (multi-view) data are also discussed, especiallyTN models for orthogonal Canonical Correlation Analysis (CCA) andrelated Partial Least Squares (PLS) problems. For convenience, allthese problems are reformulated as constrained optimization problems

2Usually, we assume that huge-scale problems operate on at least 107 parameters.

Page 26: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

1.5. Scope and Objectives 271

which are then, by virtue of low-rank tensor networks reduced to man-ageable lower-scale optimization sub-problems. The enhanced tractabil-ity and scalability is achieved through tensor network contractions andother tensor network transformations.

The methods and approaches discussed in this work can be con-sidered a both an alternative and complementary to other emergingmethods for huge-scale optimization problems like random coordinatedescent (RCD) scheme (Nesterov, 2012; Richtárik and Takáč, 2016),sub-gradient methods (Nesterov, 2014), alternating direction methodof multipliers (ADMM) (Boyd et al., 2011), and proximal gradient de-scent methods (Parikh and Boyd, 2014) (see also (Cevher et al., 2014;Hong et al., 2016) and references therein).

This monograph systematically introduces TN models and the as-sociated algorithms for TNs/TDs and illustrates many potential appli-cations of TDs/TNS. The dimensionality reduction and optimizationframeworks (see Part 2 of this monograph) are considered in detail,and we also illustrate the use of TNs in other challenging problemsfor huge-scale datasets which can be solved using the tensor networkapproach, including anomaly detection, tensor completion, compressedsensing, clustering, and classification.

Page 27: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2Tensor Operations and Tensor Network

Diagrams

Tensor operations benefit from the power of multilinear algebra whichis structurally much richer than linear algebra, and even some basicproperties, such as the rank, have a more complex meaning. We nextintroduce the background on fundamental mathematical operations inmultilinear algebra, a prerequisite for the understanding of higher-ordertensor decompositions. A unified account of both the definitions andproperties of tensor network operations is provided, including the outer,multi-linear, Kronecker, and Khatri–Rao products. For clarity, graphi-cal illustrations are provided, together with an example rich guidancefor tensor network operations and their properties. To avoid any con-fusion that may arise given the numerous options on tensor reshap-ing, both mathematical operations and their properties are expresseddirectly in their native multilinear contexts, supported by graphicalvisualizations.

2.1 Basic Multilinear Operations

The following symbols are used for most common tensor multiplica-tions: b for the Kronecker product, d for the Khatri–Rao product,f for the Hadamard (componentwise) product, for the outerproduct and n for the mode-n product. Basic tensor operationsare summarized in Table 2.1, and illustrated in Figures 2.1–2.13.

272

Page 28: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 273

Table 2.1: Basic tensor/matrix operations.

C An BMode-n product of a tensor A P RI1I2IN

and a matrix B P RJIn yields a ten-sor C P RI1In1JIn1IN ,with entries c i1,...,in1, j, in1,...,iN

°In

in1 ai1,...,in,...,iNbj, in

C JG; Bp1q, . . . ,BpNqKMultilinear (Tucker) product of a core tensor,G, and factor matrices Bpnq, which gives

C G1 Bp1q 2 Bp2q N BpNq

C A n b

Mode-n product of a tensor A P RI1IN

and vector b P RIn yields a ten-sor C P RI1In1In1IN ,with entries c i1,...,in1,in1,...,iN

°In

in1 ai1,...,in1,in,in1,...,iNbin

C A1N B A1 B

Mode-pN, 1q contracted product of tensorsA P RI1I2IN and B P RJ1J2JM ,with IN J1, yields a tensorC P RI1IN1J2JM with entriesci1,...,iN1,j2,...,jM

°IN

iN1 ai1,...,iNbiN ,j2,...,jM

C A BOuter product of tensors A P RI1I2IN

and B P RJ1J2JM yields an pN Mqth-order tensor C, with entries c i1,...,iN , j1,...,jM

ai1,...,iN

bj1,...,jM

X a b c P RIJK Outer product of vectors a,b and c forms arank-1 tensor, X, with entries xijk ai bj ck

C AbL B(Left) Kronecker product of tensors A PRI1I2IN and B P RJ1J2JN yieldsa tensor C P RI1J1IN JN , with entriesc i1j1,...,iN jN

ai1,...,iNbj1,...,jN

C AdL B(Left) Khatri–Rao product of matrices A ra1, . . . ,aJ s P RIJ and B rb1, . . . ,bJ s PRKJ yields a matrix C P RIKJ , withcolumns cj aj bL bj P RIK

Page 29: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

274 Tensor Operations and Tensor Network Architectures

Matricization

Vectorization

Tensorization

TensorData

Tensorization

Vectorization

...

...=

=

=

=

Figure 2.1: Tensor reshaping operations: Matricization, vectorization and ten-sorization. Matricization refers to converting a tensor into a matrix, vectorization toconverting a tensor or a matrix into a vector, while tensorization refers to convertinga vector, a matrix or a low-order tensor into a higher-order tensor.

We refer to (Kolda and Bader, 2009; Cichocki et al., 2009; Lee andCichocki, 2016c) for more detail regarding the basic notations andtensor operations. For convenience, general operations, such as vecpqor diagpq, are defined similarly to the MATLAB syntax.

Multi–indices: By a multi-index i i1i2 iN we refer to an indexwhich takes all possible combinations of values of indices, i1, i2, . . . , iN ,for in 1, 2, . . . , In, n 1, 2, . . . , N and in a specific order. Multi–indices can be defined using two different conventions (Dolgov andSavostyanov, 2014):

1. Little-endian convention (reverse lexicographic ordering)

i1i2 iN i1pi21qI1pi31qI1I2 piN1qI1 IN1.

2. Big-endian (colexicographic ordering)

i1i2 iN iN piN1 1qIN piN2 1qININ1 pi1 1qI2 IN .

Page 30: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 275

The little-endian convention is used, for example, in Fortran andMATLAB, while the big-endian convention is used in C language.Given the complex and non-commutative nature of tensors, the basicdefinitions, such as the matricization, vectorization and the Kroneckerproduct, should be consistent with the chosen convention1. In thismonograph, unless otherwise stated, we will use little-endian notation.

Matricization. The matricization operator, also known as theunfolding or flattening, reorders the elements of a tensor into a matrix(see Figure 2.2). Such a matrix is re-indexed according to the choiceof multi-index described above, and the following two fundamentalmatricizations are used extensively.

The mode-n matricization. For a fixed index n P t1, 2, . . . , Nu, themode-nmatricization of an Nth-order tensor, X P RI1IN , is definedas the (“short” and “wide”) matrix

Xpnq P RInI1I2In1In1IN , (2.1)

with In rows and I1I2 In1In1 IN columns, the entries of whichare

pXpnqqin,i1in1in1iN xi1,i2,...,iN .

Note that the columns of a mode-n matricization, Xpnq, of a tensor Xare the mode-n fibers of X.

The mode-tnu canonical matricization. For a fixed index n Pt1, 2, . . . , Nu, the mode-p1, 2, . . . , nq matricization, or simply mode-ncanonical matricization, of a tensor X P RI1IN is defined as thematrix

X n¡ P RI1I2InIn1IN , (2.2)with I1I2 In rows and In1 IN columns, and the entries

pX n¡qi1i2in, in1iN xi1,i2,...,iN .

1 Note that using the colexicographic ordering, the vectorization of an outerproduct of two vectors, a and b, yields their Kronecker product, that is, vecpabq abb, while using the reverse lexicographic ordering, for the same operation, we needto use the Left Kronecker product, vecpa bq b b a a bL b.

Page 31: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

276 Tensor Operations and Tensor Network Architectures

(a)I2

I1

I3

I1

I2

I3

I1

I3

I2

I1

I2

I3

I3

I1

I2

I1

I2

I3

A A(1)

A

A

I1 × I2 I3

I I1 I32

I I1 I23

×

×

R

A(2) ∈ R

A(2) ∈ R

(b)A

I1

I2

In

IN

...

... In

A( )n

IN

I1

1 1 1n n NI I I I

...

(c)I1

I2

In

IIn+1

IN J

... ...

A<n>

I1IN.... ..

A

I2In

In+1

Figure 2.2: Matricization (flattening, unfolding) used in tensor reshaping. (a)Mode-1, mode-2, and mode-3 matricizations of a 3rd-order tensor, from the topto the bottom panel. (b) Tensor network diagram for the mode-n matricizationof an Nth-order tensor, A P RI1I2IN , into a short and wide matrix, Apnq PRIn I1In1In1IN . (c) Mode-t1, 2, . . . , nuth (canonical) matricization of an Nth-order tensor, A, into a matrix A n¡ Api1in ; in1iN q P RI1I2In In1IN .

Page 32: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 277

Vector

x∊ 8K

Matrix

X∊4K×2

3rd-order tensor

X ∊2K×2×2

3

4th-order tensor

X ∊RK×2×2×24 IRIRIRI

Figure 2.3: Tensorization of a vector into a matrix, 3rd-order tensor and 4th-ordertensor.

The matricization operator in the MATLAB notation (reverse lexico-graphic) is given by

X n¡ reshape pX, I1I2 In, In1 IN q . (2.3)

As special cases we immediately have (see Figure 2.2)

X 1¡ Xp1q, X N1¡ XTpNq, X N¡ vecpXq. (2.4)

The tensorization of a vector or a matrix can be considered as areverse process to the vectorization or matricization (see Figures 2.1and 2.3).

Kronecker, strong Kronecker, and Khatri–Rao products ofmatrices and tensors. For an IJ matrix A and a KL matrix B,the standard (Right) Kronecker product, AbB, and the Left Kroneckerproduct, AbL B, are the following IK JL matrices

AbB

a1,1B a1,JB... . . . ...

aI,1B aI,JB

, AbL B

Ab1,1 Ab1,L... . . . ...

AbK,1 AbK,L

.Observe that A bL B B b A, so that the Left Kronecker productwill be used in most cases in this monograph as it is consistent withthe little-endian notation.

Page 33: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

278 Tensor Operations and Tensor Network Architectures

AB

=

C = A B

A11 A12 A13

A21 A22 A23

B11 B12

B21 B22

B31 B32

A11 B+A12+A13

L 11BL 21BL 31

A21 B+A22+A23

L 11BL 21BL 31

A11 B+A12+A13

L 12BL 22BL 32

A21 B+A22+A23

L 12BL 22BL 32

Figure 2.4: Illustration of the strong Kronecker product of two block matrices, A rAr1,r2 s P RR1I1R2J1 and B rBr2,r3 s P RR2I2R3J2 , which is defined as a blockmatrix C A |b|B P RR1I1I2R3J1J2 , with the blocks Cr1,r3 °R2

r21 Ar1,r2 bL

Br2,r3 P RI1I2J1J2 , for r1 1, . . . , R1, r2 1, . . . , R2 and r3 1, . . . , R3.

Using Left Kronecker product, the strong Kronecker product of twoblock matrices, A P RR1IR2J and B P RR2KR3L, given by

A

A1,1 A1,R2... . . . ...

AR1,1 AR1,R2

, B

B1,1 B1,R3... . . . ...

BR2,1 BR2,R3

,can be defined as a block matrix (see Figure 2.4 for a graphical illus-tration)

C A |b| B P RR1IKR3JL, (2.5)

with blocks Cr1,r3 °R2r21 Ar1,r2 bL Br2,r3 P RIKJL, where

Ar1,r2 P RIJ and Br2,r3 P RKL are the blocks of matrices withinA and B, respectively (de Launey and Seberry, 1994; Kazeev et al.,2013a,b). Note that the strong Kronecker product is similar to thestandard block matrix multiplication, but performed using Kroneckerproducts of the blocks instead of the standard matrix-matrix prod-ucts. The above definitions of Kronecker products can be naturallyextended to tensors (Phan et al., 2012) (see Table 2.1), as shown below.

The Kronecker product of tensors. The (Left) Kronecker productof two Nth-order tensors, A P RI1I2IN and B P RJ1J2JN ,yields a tensor C A bL B P RI1J1INJN of the same orderbut enlarged in size, with entries ci1j1,...,iN jN ai1,...,iN bj1,...,jN as

Page 34: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 279

I1

I2

I3

I4 J1

J3

J4

J2

A B

K I J1 1 1 K I J4 4 4

K I J2 2 2 K I J3 3 3

Figure 2.5: The left Kronecker product of two 4th-order tensors, A and B,yields a 4th-order tensor, C A bL B P RI1J1I4J4 , with entries ck1,k2,k3,k4 ai1,...,i4 bj1,...,j4 , where kn injn (n 1, 2, 3, 4). Note that the order of tensor C isthe same as the order of A and B, but the size in every mode within C is a productof the respective sizes of A and B.

illustrated in Figure 2.5.

The mode-n Khatri–Rao product of tensors. The Mode-nKhatri–Rao product of two Nth-order tensors, A P RI1I2InIN

and B P RJ1J2JnJN , for which In Jn, yields a tensorC A d n B P RI1J1In1Jn1InIn1Jn1INJN , with subten-sors Cp:, . . . :, in, :, . . . , :q Ap:, . . . :, in, :, . . . , :q bBp:, . . . :, in, :, . . . , :q.

The mode-2 and mode-1 Khatri–Rao product of matrices.The above definition simplifies to the standard Khatri–Rao (mode-2) product of two matrices, A ra1,a2, . . . ,aRs P RIR and B rb1,b2, . . . ,bRs P RJR, or in other words a “column-wise Kroneckerproduct”. Therefore, the standard Right and Left Khatri–Rao productsfor matrices are respectively given by2

AdB ra1 b b1,a2 b b2, . . . ,aR b bRs P RIJR, (2.6)AdL B ra1 bL b1,a2 bL b2, . . . ,aR bL bRs P RIJR. (2.7)

2For simplicity, the mode 2 subindex is usually neglected, i.e., Ad2 B AdB.

Page 35: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

280 Tensor Operations and Tensor Network Architectures

Analogously, the mode-1 Khatri–Rao product of two matrices A PRIR and B P RIQ, is defined as

Ad1 B

Ap1, :q bBp1, :q...

ApI, :q bBpI, :q

P RIRQ. (2.8)

Direct sum of tensors. A direct sum of Nth-order tensorsA P RI1IN and B P RJ1JN yields a tensor C A ` B PRpI1J1qpINJN q, with entries Cpk1, . . . , kN q Apk1, . . . , kN qif 1 ¤ kn ¤ In, @n, Cpk1, . . . , kN q Bpk1 I1, . . . , kN IN q ifIn kn ¤ In Jn, @n, and Cpk1, . . . , kN q 0, otherwise (seeFigure 2.6(a)).

Partial (mode-n) direct sum of tensors. A partial direct sum oftensors A P RI1IN and B P RJ1JN , with In Jn, yields a ten-sor C A ` n B P RpI1J1qpIn1Jn1qInpIn1Jn1qpINJN q,where Cp:, . . . , :, in, :, . . . , :q Ap:, . . . , :, in, :, . . . , :q ` Bp:, . . . , :, in, :, . . . , :q, as illustrated in Figure 2.6(b).

Concatenation of Nth-order tensors. A concatenation alongmode-n of tensors A P RI1IN and B P RJ1JN , forwhich Im Jm, @m n yields a tensor C A ` n B PRI1In1pInJnqIn1pIN q, with subtensors Cpi1, . . . , in1, :, in1, . . . , iN q Api1, . . . , in1, :, in1, . . . , iN q ` Bpi1, . . . , in1, :, in1, . . . , iN q, as illustrated in Figure 2.6(c). For a concatenation oftwo tensors of suitable dimensions along mode-n, we will use equivalentnotations C A` n B A" n B.

3D Convolution. For simplicity, consider two 3rd-order tensorsA P RI1I2I3 and B P RJ1J2J3 . Their 3D Convolution yieldsa tensor C A B P RpI1J11qpI2J21qpI3J31q, with entriesCpk1, k2, k3q

°j1

°j2

°j3

Bpj1, j2, j3q Apk1 j1, k2 j2, k3 j3q asillustrated in Figure 2.7 and Figure 2.8.

Partial (mode-n) Convolution. For simplicity, consider two 3rd-order tensors A P RI1I2I3 and B P RJ1J2J3 . Their mode-2 (partial)

Page 36: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 281

(b)

(a) I3

I1

I2J1

J

J

3

2

AB

A B ∈ R( + ) I1 J1 ×( + ) ×( + ) I3 J3I2 J2

(c)

A B1 A B2A B

3

A B1 A B2 A B3

I2 = J2

I3 = J3

I3 = J3

I1 = J1

I2 = J2

I3

I1

I2

J

J

3

2

A

A

B B

BJ3I3

I1 = J1

B

A A

B BI2 J2 I2 = J2 I2 J2

I3

I1J1

I1

J1

I3 = J3

I1=J1

I1

J1 A A

I

Figure 2.6: Illustration of the direct sum, partial direct sum and concatenationoperators of two 3rd-order tensors. (a) Direct sum. (b) Partial (mode-1, mode-2,and mode-3) direct sum. (c) Concatenations along mode-1,2,3.

Page 37: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

282 Tensor Operations and Tensor Network Architectures

* =

A B

C

1 2 3 4

0 3 2 1

5 0 1 4

3 1 0 2

0 -1 0

-1 4 -1

0 -1 0

3 4

2 1

5 0 1 4

3 1 0 2

3 4

1

4

3 1 0 2

0 -1 0

-1

0

1 2

0

0

0

0 0

4 -1

-1 00 3

0 -1 01 2 3

2

5 0 1-1

0

04 -1

-1 0

3

3

3 1 0 2

0 -1 01 2 3

2

5 0 1-1

0

04 -1

-1 0

3

4

1

4

1・4+2・(-1)=2

2・(-1)+3・4+2・(-1)=8

3・(-1)+3・(-1)+2・4+1・(-1)+1・(-1)=0

0 -1 -2 -3

-1 2 1 4

0 -9 8 0

-5 17 -10 -2

-4 0

12 -4

-6 -1

12 -4

-3 6 1 -4

0 -3 -1 0

4 -2

-2 0

2

8

0

Figure 2.7: Illustration of the 2D convolution operator, performed through asliding window operation along both the horizontal and vertical index.

Page 38: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 283

A B CI3

I1

I2

* =

J3I3+ -1

J1I1+ -1

* =

Reduction(summation)

2 3 36 2 44 2 5

4 0 32 3 52 1 2

0 3 22 3 11 0 5

0 -1 0-1 5 -10 -1 0

-2 -1 0-1 1 10 1 2

0 -1 0-1 4 -10 -1 0

0 -3 0-610 -40 -2 0

-8 0 0-2 3 50 1 4

0 -3 0-2 12 -10 -1 0

J3

J1J2

( )

( )3

J2I2+ -1( )

Hadamard product

Σ

Figure 2.8: Illustration of the 3D convolution operator, performed through asliding window operation along all three indices.

convolution yields a tensor C A d2 B P RI1J1pI2J21qI3J3 , thesubtensors (vectors) of which are Cpk1, :, k3q Api1, :, i3q Bpj1, :, j3q P RI2J21, where k1 i1j1, and k3 i3j3.

Outer product. The central operator in tensor analysis is the outeror tensor product, which for the tensors A P RI1IN and B PRJ1JM gives the tensor C A B P RI1INJ1JM withentries ci1,...,iN ,j1,...,jM ai1,...,iN bj1,...,jM .

Note that for 1st-order tensors (vectors), the tensor product reducesto the standard outer product of two nonzero vectors, a P RI andb P RJ , which yields a rank-1 matrix, X a b abT P RIJ .The outer product of three nonzero vectors, a P RI , b P RJ andc P RK , gives a 3rd-order rank-1 tensor (called pure or elementarytensor), X a b c P RIJK , with entries xijk ai bj ck.

Page 39: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

284 Tensor Operations and Tensor Network Architectures

Rank-1 tensor. A tensor, X P RI1I2IN , is said to be of rank-1 if itcan be expressed exactly as the outer product, X bp1qbp2q bpNq

of nonzero vectors, bpnq P RIn , with the tensor entries given byxi1,i2,...,iN b

p1qi1bp2qi2 bpNq

iN.

Kruskal tensor, CP decomposition. For further discussion, it isimportant to highlight that any tensor can be expressed as a finite sumof rank-1 tensors, in the form

X R

r1bp1qr bp2q

r bpNqr

R

r1

Nn1

bpnqr

, bpnq

r P RIn , (2.9)

which is exactly the form of the Kruskal tensor, illustrated in Figure2.9, also known under the names of CANDECOMP/PARAFAC,Canonical Polyadic Decomposition (CPD), or simply the CP decom-position in (1.2). We will use the acronyms CP and CPD.

Tensor rank. The tensor rank, also called the CP rank, is a naturalextension of the matrix rank and is defined as a minimum number, R,of rank-1 terms in an exact CP decomposition of the form in (2.9).

Although the CP decomposition has already found many practicalapplications, its limiting theoretical property is that the best rank-Rapproximation of a given data tensor may not exist (see de Silvaand Lim (2008) for more detail). However, a rank-R tensor can beapproximated arbitrarily well by a sequence of tensors for which theCP ranks are strictly less than R. For these reasons, the conceptof border rank was proposed (Bini, 1985), which is defined as theminimum number of rank-1 tensors which provides the approximationof a given tensor with an arbitrary accuracy.

Symmetric tensor decomposition. A symmetric tensor (sometimescalled a super-symmetric tensor) is invariant to the permutations of itsindices. A symmetric tensor of Nth-order has equal sizes, In I, @n, inall its modes, and the same value of entries for every permutation of itsindices. For example, for vectors bpnq b P RI , @n, the rank-1 tensor,constructed by N outer products, Nn1bpnq bb b P RIII ,is symmetric. Moreover, every symmetric tensor can be expressed as a

Page 40: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 285

I1

I4

I3I2

X=

I1

I2

I3

I4

(1)rb

(2)rb

(3)rb

(4)rb

=1

R

r

Figure 2.9: The CP decomposition for a 4th-order tensor X of rank R. Observethat the rank-1 subtensors are formed through the outer products of the vectorsbp1q

r , . . . ,bp4qr , r 1, . . . , R.

linear combination of such symmetric rank-1 tensors through the so-called symmetric CP decomposition, given by

X R

r1λrbr br br, br P RI , (2.10)

where λr P R are the scaling parameters for the unit length vectors br,while the symmetric tensor rank is the minimal number R of rank-1tensors that is necessary for its exact representation.

Multilinear products. The mode-n (multilinear) product, also calledthe tensor-times-matrix product (TTM), of a tensor, A P RI1IN ,and a matrix, B P RJIn , gives the tensor

C An B P RI1In1JIn1IN , (2.11)

with entries

ci1,i2,...,in1,j,in1,...,iN In

in1ai1,i2,...,iN bj,in . (2.12)

From (2.12) and Figure 2.10, the equivalent matrix form isCpnq BApnq, which allows us to employ established fast matrix-by-vector and matrix-by-matrix multiplications when dealing with verylarge-scale tensors. Efficient and optimized algorithms for TTM are,however, still emerging (Li et al., 2015; Ballard et al., 2015a,b).

Full multilinear (Tucker) product. A full multilinear product, alsocalled the Tucker product, of an Nth-order tensor, G P RR1R2RN ,

Page 41: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

286 Tensor Operations and Tensor Network Architectures

(a)...

...B

C(1)

I2

I1

I3I2I1

JJ

J

I1

2 3I II3

A

C B

I1

J

...

...

A1 A2 AI3

BA1 BA2 BAI3

C=A× B1 C =B A(1) (1)

A(1)

(b)

INA B

InI1I2

...

... J J

BIn

A ( )n

1 1 1n- n+ NI I I I

nC A B( ) ( )n nC B A

Figure 2.10: Illustration of the multilinear mode-n product, also known as theTTM (Tensor-Times-Matrix) product, performed in the tensor format (left) andthe matrix format (right). (a) Mode-1 product of a 3rd-order tensor, A P RI1I2I3 ,and a factor (component) matrix, B P RJI1 , yields a tensor C A 1 B PRJI2I3 . This is equivalent to a simple matrix multiplication formula, Cp1q BAp1q. (b) Graphical representation of a mode-n product of an Nth-order tensor,A P RI1I2IN , and a factor matrix, B P RJIn .

and a set of N factor matrices, Bpnq P RInRn for n 1, 2, . . . , N ,performs the multiplications in all the modes and can be compactlywritten as (see Figure 2.11(b))

C G1 Bp1q 2 Bp2q N BpNq (2.13) JG; Bp1q,Bp2q, . . . ,BpNqK P RI1I2IN .

Observe that this format corresponds to the Tucker decomposition(Tucker, 1964, 1966; Kolda and Bader, 2009) (see Section 3.3).

Multilinear product of a tensor and a vector (TTV). In a sim-ilar way, the mode-n multiplication of a tensor, A P RI1IN , and a

Page 42: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 287

vector, b P RIn (tensor-times-vector, TTV) yields a tensor

C Anb P RI1In1In1IN , (2.14)

with entries

ci1,...,in1,in1,...,iN In

in1ai1,...,in1,in,in1,...,iN bin . (2.15)

Note that the mode-n multiplication of a tensor by a matrix does notchange the tensor order, while the multiplication of a tensor by vectorsreduces its order, with the mode n removed (see Figure 2.11).

Multilinear products of tensors by matrices or vectors play akey role in deterministic methods for the reshaping of tensors anddimensionality reduction, as well as in probabilistic methods forrandomization/sketching procedures and in random projections oftensors into matrices or vectors. In other words, we can also performreshaping of a tensor through random projections that change itsentries, dimensionality or size of modes, and/or the tensor order. Thisis achieved by multiplying a tensor by random matrices or vectors,transformations which preserve its basic properties. (Sun et al., 2006;Drineas and Mahoney, 2007; Lu et al., 2011; Li and Monga, 2012;Pham and Pagh, 2013; Wang et al., 2015; Kuleshov et al., 2015; Sorberet al., 2016) (see Section 3.5 for more detail).

Tensor contractions. Tensor contraction is a fundamental and themost important operation in tensor networks, and can be considereda higher-dimensional analogue of matrix multiplication, inner product,and outer product.

In a way similar to the mode-n multilinear product3, the mode-pmn qproduct (tensor contraction) of two tensors, A P RI1I2IN and B PRJ1J2JM , with common modes, In Jm, yields an pN M 2q-order tensor, C P RI1In1In1INJ1Jm1Jm1JM , inthe form (see Figure 2.12(a))

C A mn B, (2.16)3In the literature, sometimes the symbol n is replaced by n.

Page 43: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

288 Tensor Operations and Tensor Network Architectures

(a)

Scalar

Vector

Matrix

Lower-orderTensor

=

=

=

=

R1

R1 R3 R1

R3

R1

R2R3

R4

R1

R2 R3

R4

R1

R2 R3

R4

R1

R2 R3

R4

R1

R2

R4

R4

R2

G

G

G

G

G

(b) (c)

R1R2

R4B

(1) R5B

(5)

B(4)

B(3)B

(2)

I1

I2I3

I4

I5

R3

G

R4

R1

R2

R3

b3

b2

b1 G

Figure 2.11: Multilinear tensor products in a compact tensor network notation.(a) Transforming and/or compressing a 4th-order tensor, G P RR1R2R3R4 , into ascalar, vector, matrix and 3rd-order tensor, by multilinear products of the tensor andvectors. Note that a mode-n multiplication of a tensor by a matrix does not changethe order of a tensor, while a multiplication of a tensor by a vector reduces its orderby one. For example, a multilinear product of a 4th-order tensor and four vectors (topdiagram) yields a scalar. (b) Multilinear product of a tensor, G P RR1R2R5 , andfive factor (component) matrices, Bpnq P RInRn (n 1, 2, . . . , 5), yields the tensorC G1Bp1q2Bp2q3Bp3q4Bp4q5Bp5q P RI1I2I5 . This corresponds to theTucker format. (c) Multilinear product of a 4th-order tensor, G P RR1R2R3R4 ,and three vectors, bn P RRn pn 1, 2, 3q, yields the vector c G1b12b23b3 PRR4 .

Page 44: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 289

I1

I2 J3

J4

A B

I4

I J3 2=

A B

I1

I2

J5

J4

I5 1=J

I4 2=J

I3 3=J

IN J1

Jm+1

JM

A BI Jn m=I1

I2

)b()a(

c( ) d( )

...

... ... A BI2

I3

I1

J1...

Figure 2.12: Examples of contractions of two tensors. (a) Multilinear product oftwo tensors is denoted by A m

n B. (b) Inner product of two 3rd-order tensorsyields a scalar c xA,By A 1,2,3

1,2,3 B A B °i1,i2,i3

ai1,i2,i3 bi1,i2,i3 .(c) Tensor contraction of two 4th-order tensors, along mode-3 in A and mode-2 in B, yields a 6th-order tensor, C A 2

3 B P RI1I2I4J1J3J4 , withentries ci1,i2,i4,j1,j3,j4 °

i3ai1,i2,i3,i4 bj1,i3,j3,j4 . (d) Tensor contraction of two

5th-order tensors along the modes 3, 4, 5 in A and 1, 2, 3 in B yields a 4th-ordertensor, C A 1,2,3

5,4,3 B P RI1I2J4J5 .

for which the entries are computed as

ci1, ..., in1, in1, ...,iN , j1, ..., jm1, jm1, ..., jM

In

in1ai1,...,in1, in, in1, ..., iN bj1, ..., jm1, in, jm1, ..., jM . (2.17)

This operation is referred to as a contraction of two tensors in singlecommon mode.

Tensors can be contracted in several modes or even in all modes, asillustrated in Figure 2.12. For convenience of presentation, the super-or sub-index, e.g., m,n, will be omitted in a few special cases. Forexample, the multilinear product of the tensors, A P RI1I2IN andB P RJ1J2JM , with common modes, IN J1, can be written as

C A 1N B A1 B A B P RI1I2IN1J2JM , (2.18)

Page 45: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

290 Tensor Operations and Tensor Network Architectures

for which the entries

ci1,i2,...,iN1,j2,j3,...,jM IN

iN1ai1,i2,...,iN biN ,j2,...,jM .

In this notation, the multiplications of matrices and vectors can bewritten as, A 1

2 B A 1 B AB, A 22 B ABT, A 1,2

1,2 B AB xA,By, and A1

2 x A1 x Ax.Note that tensor contractions are, in general not associative or com-

mutative, since when contracting more than two tensors, the order hasto be precisely specified (defined), for example, Aba pBdcCq for b c.

It is also important to note that a matrix-by-vector product,y Ax P RI1IN , with A P RI1INJ1JN and x P RJ1JN , can be ex-pressed in a tensorized form via the contraction operator as Y AX,where the symbol denotes the contraction of all modes of the tensorX (see Section 4.5).

Unlike the matrix-by-matrix multiplications for which severalefficient parallel schemes have been developed, the number of efficientalgorithms for tensor contractions is rather limited. In practice, due tothe high computational complexity of tensor contractions, especiallyfor tensor networks with loops, this operation is often performedapproximately (Lubasch et al., 2014; Di Napoli et al., 2014; Pfeiferet al., 2014; Kao et al., 2015).

Tensor trace. Consider a tensor with partial self-contraction modes,where the outer (or open) indices represent physical modes of the ten-sor, while the inner indices indicate its contraction modes. The TensorTrace operator performs the summation of all inner indices of the ten-sor (Gu et al., 2009). For example, a tensor A of size R I R hastwo inner indices, modes 1 and 3 of size R, and one open mode of sizeI. Its tensor trace yields a vector of length I, given by

a TrpAq ¸r

Apr, :, rq ,

the elements of which are the traces of its lateral slices Ai P RRR

pi 1, 2, . . . , Iq, that is, (see bottom of Figure 2.13)

a rtrpA1q, . . . , trpAiq, . . . , trpAIqsT. (2.19)

Page 46: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.1. Basic Multilinear Operations 291

I A

A1

A2

A3

A4

IR =

tr( )Ac

A1A2A3A4)

A

I

tr(c

aiii=

aR

I J

I I

K

Ax y

AX X

tr(Ay x ) x AyT T

tr(X AX)T

[a1 a2, ,..., aI]T

ai rA(r,i,r)

1

Figure 2.13: Tensor network notation for the traces of matrices (panels 1-4 from the top), and a (partial) tensor trace (tensor self-contraction) of a 3rd-order tensor (bottom panel). Note that graphical representations of the traceof matrices intuitively explain the permutation property of trace operator, e.g.,trpA1A2A3A4q trpA3A4A1A2q.

A tensor can have more than one pair of inner indices, e.g., the tensorA of size R I SS I R has two pairs of inner indices, modes 1and 6, modes 3 and 4, and two open modes (2 and 5). The tensor traceof A therefore returns a matrix of size I I defined as

TrpAq ¸r

¸s

Apr, :, s, s, :, rq .

A variant of Tensor Trace (Lee and Cichocki, 2016c) for thecase of the partial tensor self-contraction considers a tensor A PRRI1I2INR and yields a reduced-order tensor rA TrpAq P

Page 47: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

292 Tensor Operations and Tensor Network Architectures

I1 I2 I3 I4 I5 I6

X

I1 I2 I3

I4

I5 I6

I7 I8I9

I7 I8

I9

MPS

PEPS

TTNS

I1 I2 I3

I4

I5

I6

I7 I8

I9

I1 I2 I3 I4

I5

I6 I7

I8

I9

Figure 2.14: Illustration of the decomposition of a 9th-order tensor, X PRI1I2I9 , into different forms of tensor networks (TNs). In general, the ob-jective is to decompose a very high-order tensor into sparsely (weakly) connectedlow-order and small size tensors, typically 3rd-order and 4th-order tensors calledcores. Top: The Tensor Chain (TC) model, which is equivalent to the Matrix Prod-uct State (MPS) with periodic boundary conditions (PBC). Middle: The ProjectedEntangled-Pair States (PEPS), also with PBC. Bottom: The Tree Tensor NetworkState (TTNS).

RI1I2IN , with entries

rApi1, i2, . . . , iN q R

r1Apr, i1, i2, . . . , iN , rq, (2.20)

Conversions of tensors to scalars, vectors, matrices or tensors with re-shaped modes and/or reduced orders are illustrated in Figures 2.11–2.13.

2.2 Graphical Representation of Fundamental Tensor Net-works

Tensor networks (TNs) represent a higher-order tensor as a set ofsparsely interconnected lower-order tensors (see Figure 2.14), and

Page 48: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 293

in this way provide computational and storage benefits. The lines(branches, edges) connecting core tensors correspond to the contractedmodes while their weights (or numbers of branches) represent the rankof a tensor network4, whereas the lines which do not connect core ten-sors correspond to the “external” physical variables (modes, indices)within the data tensor. In other words, the number of free (dangling)edges (with weights larger than one) determines the order of a datatensor under consideration, while set of weights of internal branchesrepresents the TN rank.

2.2.1 Hierarchical Tucker (HT) and Tree Tensor Network State(TTNS) Models

Hierarchical Tucker (HT) decompositions (also called hierarchical ten-sor representation) have been introduced in (Hackbusch and Kühn,2009) and also independently in (Grasedyck, 2010), see also (Hack-busch, 2012; Lubich et al., 2013; Uschmajew and Vandereycken, 2013;Kressner and Tobler, 2014; Bachmayr et al., 2016) and referencestherein5. Generally, the HT decomposition requires splitting the setof modes of a tensor in a hierarchical way, which results in a binarytree containing a subset of modes at each branch (called a dimensiontree); examples of binary trees are given in Figures 2.15, 2.16 and 2.17.In tensor networks based on binary trees, all the cores are of order ofthree or less. Observe that the HT model does not contain any cycles(loops), i.e., no edges connecting a node with itself. The splitting op-eration of the set of modes of the original data tensor by binary treeedges is performed through a suitable matricization.

Choice of dimension tree. The dimension tree within the HTformat is chosen a priori and defines the topology of the HT decompo-sition. Intuitively, the dimension tree specifies which groups of modes

4Strictly speaking, the minimum set of internal indices tR1, R2, R3, . . .u is calledthe rank (bond dimensions) of a specific tensor network.

5The HT model was developed independently, from a different perspective, inthe chemistry community under the name MultiLayer Multi-Configurational Time-Dependent Hartree method (ML-MCTDH) (Wang and Thoss, 2003). Furthermore,the PARATREE model, developed independently for signal processing applications(Salmi et al., 2009), is quite similar to the HT model (Grasedyck, 2010).

Page 49: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

294 Tensor Operations and Tensor Network Architectures

I8

I1

I2

I3

I4

I5

I6

I7

I8

I7

I5

I6

I4

I3I2

I1

Figure 2.15: The standard Tucker decomposition of an 8th-order tensor into a coretensor (red circle) and eight factor matrices (green circles), and its transformationinto an equivalent Hierarchical Tucker (HT) model using interconnected smaller size3rd-order core tensors and the same factor matrices.

are “separated” from other groups of modes, so that a sequential HTdecomposition can be performed via a (truncated) SVD applied to asuitably matricized tensor. One of the simplest and most straightfor-ward choices of a dimension tree is the linear and unbalanced tree,which gives rise to the tensor-train (TT) decomposition, discussed indetail in Section 2.2.2 and Section 4 (Oseledets, 2011; Oseledets andTyrtyshnikov, 2009).

Using mathematical formalism, a dimension tree is a binary treeTN , N ¡ 1, which satisfies that

(i) all nodes t P TN are non-empty subsets of 1, 2,. . . , N,

(ii) the set troot t1, 2, . . . , Nu is the root node of TN , and

(iii) each non-leaf node has two children u, v P TN such that t is adisjoint union t uY v.

The HT model is illustrated through the following Example.

Example. Suppose that the dimension tree T7 is given, which gives

Page 50: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 295

Order 3: Order 4:

Order 5:

Order 6:

Order 7:

Order 8:

Figure 2.16: Examples of HT/TT models (formats) for distributed Tucker de-compositions with 3rd-order cores, for different orders of data tensors. Green circlesdenote factor matrices (which can be absorbed by core tensors), while red circlesindicate cores. Observe that the representations are not unique.

the HT decomposition illustrated in Figure 2.17. The HT decomposi-tion of a tensor X P RI1I7 with given set of integers tRtutPT7 can

Page 51: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

296 Tensor Operations and Tensor Network Architectures

G

G(123)

G(4567)

G(67)

G(45)

G (23)B

(1)

B(2) B(3) B(4)B(5)

B(6)

B(7)

R123

R 1

1

R 23

R 3R 2

I2 I3 I4 I5 I6 I7

R4567

R67

R7

R45

R4 R5R6

・・・(12 7)

I

Figure 2.17: Example illustrating the HT decomposition for a 7th-order datatensor.

be expressed in the tensor and vector/matrix forms as follows. Let in-termediate tensors Xptq with t tn1, . . . , nku t1, . . . , 7u have thesize In1 In2 Ink

Rt. Let Xptqrt Xptqp:, . . . , :, rtq denote the

subtensor of Xptq and Xptq Xptq k¡ P RIn1In2 Ink

Rt denote the corre-sponding unfolded matrix. Let Gptq P RRuRvRt be core tensors whereu and v denote respectively the left and right children of t.

The HT model shown in Figure 2.17 can be then described mathe-matically in the vector form as

vecpXq pXp123q bL Xp4567qq vecpGp127qq,

Xp123q pBp1q bL Xp23qq Gp123q 2¡ , Xp4567q pXp45q bL Xp67qq Gp4567q

2¡ ,

Xp23q pBp2q bL Bp3qq Gp23q 2¡, Xp45q pBp4q bL Bp5qq Gp45q

2¡,

Xp67q pBp6q bL Bp7qq Gp67q 2¡.

Page 52: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 297

An equivalent, more explicit form, using tensor notations becomes

X R123

r1231

R4567¸r45671

gp127qr123,r4567 Xp123q

r123 Xp4567qr4567 ,

Xp123qr123

R1

r11

R23

r231gp123qr1,r23,r123b

p1qr1 Xp23q

r23 ,

Xp4567qr4567

R45

r451

R67

r671gp4567qr45,r67,r4567X

p45qr45 Xp67q

r67 ,

Xp23qr23

R2

r21

R3

r31gp23qr2,r3,r23 bp2q

r2 bp3qr3 ,

Xp45qr45

R4

r41

R5

r51gp45qr4,r5,r45 bp4q

r4 bp5qr5 ,

Xp67qr67

R6

r61

R7

r71gp67qr6,r7,r67 bp6q

r6 bp7qr7 .

The TT/HT decompositions lead naturally to a distributed Tuckerdecomposition, where a single core tensor is replaced by interconnectedcores of lower-order, resulting in a distributed network in which onlysome cores are connected directly with factor matrices, as illustratedin Figure 2.15. Figure 2.16 illustrates exemplary HT/TT structuresfor data tensors of various orders (Tobler, 2012; Kressner and Tobler,2014). Note that for a 3rd-order tensor, there is only one HT tensornetwork representation, while for a 5th-order we have 5, and for a 10th-order tensor there are 11 possible HT architectures.

A simple approach to reduce the size of a large-scale core tensorin the standard Tucker decomposition (typically, for N ¡ 5) wouldbe to apply the concept of distributed tensor networks (DTNs). The

Page 53: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

298 Tensor Operations and Tensor Network Architectures

Figure 2.18: The Tree Tensor Network State (TTNS) with 3rd-order and 4th-order cores for the representation of 24th-order data tensors. The TTNS can beconsidered both as a generalization of HT/TT format and as a distributed modelfor the Tucker-N decomposition (see Section 3.3).

DTNs assume two kinds of cores (blocks): (i) the internal cores (nodes)which are connected only to other cores and have no free edges and(ii) external cores which do have free edges representing physical modes(indices) of a given data tensor (see also Section 2.2.4). Such distributedrepresentations of tensors are not unique.

The tree tensor network state (TTNS) model, whereby all nodesare of 3rd-order or higher, can be considered as a generalization of theTT/HT decompositions, as illustrated by two examples in Figure 2.18(Nakatani and Chan, 2013). A more detailed mathematical descriptionof the TTNS is given in Section 3.3.

2.2.2 Tensor Train (TT) Network

The Tensor Train (TT) format can be interpreted as a special case ofthe HT format, where all nodes (TT-cores) of the underlying tensornetwork are connected in cascade (or train), i.e., they are aligned whilefactor matrices corresponding to the leaf modes are assumed to be iden-tities and thus need not be stored. The TT format was first proposedin numerical analysis and scientific computing in (Oseledets, 2011; Os-eledets and Tyrtyshnikov, 2009). Figure 2.19 presents the concept of

Page 54: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 299

(a)

I1

...

R1R2

R1

I2

(2)G

(1)G i1

( )nG

( )NG

R1 R2

I2 INI1 In

G(2) G( )nG(1) G( )NRn-1 Rn RN-1

... ...Rn-1

In

Rn

RN-1

IN

...

...

...

...

...

...

...

i2 in Ni

(b)...

R2

R1

I2

(2)G

(1)G i1

( )nG

R1 R2

I2 INI1 In

G(2) G( )nG(1) G( )NRn-1 Rn RN-1

Rn-1

In

Rn

...

...

...

i2 in

...

...IN

RN

RN-1

...

R1

RN

I1

...

RN

G i( )N

N

Figure 2.19: Concepts of the tensor train (TT) and tensor chain (TC) de-compositions (MPS with OBC and PBC, respectively) for an Nth-order datatensor, X P RI1I2IN . (a) Tensor Train (TT) can be mathematically de-scribed as xi1,i2,...,iN Gp1q

i1Gp2q

i2 GpNq

iN, where (bottom panel) the slice

matrices of TT-cores Gpnq P RRn1InRn are defined as Gpnqin

Gpnqp:, in, :qP RRn1Rn with R0 RN 1. (b) For the Tensor Chain (TC), the en-tries of a tensor are expressed as xi1,i2,...,iN tr pGp1q

i1Gp2q

i2 GpNq

iNq

R1

r11

R2

r21

RN

rN1

gp1qrN , i1, r1

gp2qr1, i2, r2

gpNqrN1, iN , rN

, where (bottom panel) the lat-

eral slices of the TC-cores are defined as Gpnqin

Gpnqp:, in, :q P RRn1Rn andgpnqrn1, in, rn

Gpnqprn1, in, rnq for n 1, 2, . . . , N , with R0 RN ¡ 1. Noticethat TC/MPS is effectively a TT with a single loop connecting the first and the lastcore, so that all TC-cores are of 3rd-order.

Page 55: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

300 Tensor Operations and Tensor Network Architectures

TT decomposition for an Nth-order tensor, the entries of which canbe computed as a cascaded (multilayer) multiplication of appropriatematrices (slices of TT-cores). The weights of internal edges (denoted bytR1, R2, . . . , RN1u) represent the TT-rank. In this way, the so alignedsequence of core tensors represents a “tensor train” where the role of“buffers” is played by TT-core connections. It is important to highlightthat TT networks can be applied not only for the approximation oftensorized vectors but also for scalar multivariate functions, matrices,and even large-scale low-order tensors, as illustrated in Figure 2.20 (formore detail see Section 4).

In the quantum physics community, the TT format is knownas the Matrix Product State (MPS) representation with the OpenBoundary Conditions (OBC) and was introduced in 1987 as theground state of the 1D AKLT model (Affleck et al., 1987). It wassubsequently extended by many researchers6 (see (White, 1993; Vidal,2003; Perez-Garcia et al., 2007; Verstraete et al., 2008; Schollwöck,2013; Huckle et al., 2013; Orús, 2014) and references therein).

Advantages of TT formats. An important advantage of theTT/MPS format over the HT format is its simpler practical imple-mentation, as no binary tree needs to be determined (see Section 4).Another attractive property of the TT-decomposition is its simplicitywhen performing basic mathematical operations on tensors directly inthe TT-format (that is, employing only core tensors). These includematrix-by-matrix and matrix-by-vector multiplications, tensor addi-tion, and the entry-wise (Hadamard) product of tensors. These op-erations produce tensors, also in the TT-format, which generally ex-hibit increased TT-ranks. A detailed description of basic operationssupported by the TT format is given in Section 4.5. Moreover, only

6In fact, the TT was rediscovered several times under different names: MPS,valence bond states, and density matrix renormalization group (DMRG) (White,1993). The DMRG usually refers not only to a tensor network format but also theefficient computational algorithms (see also (Schollwöck, 2011; Hubig et al., 2015)and references therein). Also, in quantum physics the ALS algorithm is called theone-site DMRG, while the Modified ALS (MALS) is known as the two-site DMRG(for more detail, see Part 2).

Page 56: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 301

a

A

A

I I1= I2…IN

I I1= I2…IN

J J1= J2…JN

K K1= K2…KN

I J

I1 I2 I3 IN

I1 I2 I3 IN

J1 J2 J3 JN

K1 K2 3K

I1J1 I2

J2 I3J3

KN

INJN

Figure 2.20: Forms of tensor train decompositions for a vector, a P RI , matrix,A P RIJ , and 3rd-order tensor, A P RIJK (by applying a suitable tensorization).

TT-cores need to be stored and processed, which makes the number ofparameters to scale linearly in the tensor order, N , of a data tensor andall mathematical operations are then performed only on the low-orderand relatively small size core tensors.

The TT rank is defined as an pN 1q-tuple of the form

rankTTpXq rTT tR1, . . . , RN1u, Rn rankpX n¡q, (2.21)

where X n¡ P RI1InIn1IN is an nth canonical matricization ofthe tensor X. Since the TT rank determines memory requirementsof a tensor train, it has a strong impact on the complexity, i.e., thesuitability of tensor train representation for a given raw data tensor.

The number of data samples to be stored scales linearly in the tensororder, N , and the size, I, and quadratically in the maximum TT rankbound, R, that is

N

n1Rn1RnIn OpNR2Iq, R : max

ntRnu, I : max

ntInu. (2.22)

This is why it is crucially important to have low-rank TT approxima-tions7. A drawback of the TT format is that the ranks of a tensor train

7In the worst case scenario the TT ranks can grow up to IpN2q for an Nth-ordertensor.

Page 57: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

302 Tensor Operations and Tensor Network Architectures

PEPS PEPO

MPS MPO

Figure 2.21: Class of 1D and 2D tensor train networks with open boundaryconditions (OBC): the Matrix Product State (MPS) or (vector) Tensor Train (TT),the Matrix Product Operator (MPO) or Matrix TT, the Projected Entangled-PairStates (PEPS) or Tensor Product State (TPS), and the Projected Entangled-PairOperators (PEPO).

decomposition depend on the ordering (permutation) of the modes,which gives different size of cores for different ordering. To solve thischallenging permutation problem, we can estimate mutual informationbetween individual TT cores pairwise (see (Barcza et al., 2011; Ehlerset al., 2015)). The procedure can be arranged in the following threesteps: (i) Perform a rough (approximate) TT decomposition with rel-ative low TT-rank and calculate mutual information between all pairsof cores, (ii) order TT cores in such way that the mutual informationmatrix is close to a diagonal matrix, and finally, (iii) perform TT de-composition again using the so optimised order of TT cores (see alsoPart 2).

2.2.3 Tensor Networks with Cycles: PEPS, MERA and Honey-Comb Lattice (HCL)

An important issue in tensor networks is the rank-complexity trade-offin the design. Namely, the main idea behind TNs is to dramatically

Page 58: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 303

reduce computational cost and provide distributed storage and com-putation through low-rank TN approximation. However, the TT/HTranks, Rn, of 3rd-order core tensors sometimes increase rapidly withthe order of a data tensor and/or increase of a desired approximationaccuracy, for any choice of a tree of tensor network. The ranks canbe often kept under control through hierarchical two-dimensional TTmodels called the PEPS (Projected Entangled Pair States8) and PEPO(Projected Entangled Pair Operators) tensor networks, which containcycles, as shown in Figure 2.21. In the PEPS and PEPO, the ranks arekept considerably smaller at a cost of employing 5th- or even 6th-ordercore tensors and the associated higher computational complexity withrespect to the order (Verstraete et al., 2008; Evenbly and Vidal, 2009;Schuch et al., 2010).

Even with the PEPS/PEPO architectures, for very high-order ten-sors, the ranks (internal size of cores) may increase rapidly with anincrease in the desired accuracy of approximation. For further controlof the ranks, alternative tensor networks can be employed, such as: (1)the Honey-Comb Lattice (HCL) which uses 3rd-order cores, and (2)the Multi-scale Entanglement Renormalization Ansatz (MERA) whichconsist of both 3rd- and 4th-order core tensors (see Figure 2.22) (Gio-vannetti et al., 2008; Orús, 2014; Matsueda, 2016). The ranks are oftenkept considerably small through special architectures of such TNs, atthe expense of higher computational complexity with respect to tensorcontractions due to many cycles.

Compared with the PEPS and PEPO formats, the main advantageof the MERA formats is that the order and size of each core tensorin the internal tensor network structure is often much smaller, whichdramatically reduces the number of free parameters and provides moreefficient distributed storage of huge-scale data tensors. Moreover, TNswith cycles, especially the MERA tensor network allow us to modelmore complex functions and interactions between variables.

8An “entangled pair state” is a tensor that cannot be represented as an elemen-tary rank-1 tensor. The state is called “projected” because it is not a real physicalstate but a projection onto some subspace. The term “pair” refers to the entan-glement being considered only for maximally entangled state pairs (Orús, 2014;Handschuh, 2015).

Page 59: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

304 Tensor Operations and Tensor Network Architectures

(a) (b)

Figure 2.22: Examples of TN architectures with loops. (a) Honey-Comb Lattice(HCL) for a 16th-order tensor. (b) MERA for a 32th-order tensor.

2.2.4 Concatenated (Distributed) Representation of TT Networks

Complexity of algorithms for computation (contraction) on tensor net-works typically scales polynomially with the rank, Rn, or size, In, ofthe core tensors, so that the computations quickly become intractablewith the increase in Rn. A step towards reducing storage and compu-tational requirements would be therefore to reduce the size (volume) ofcore tensors by increasing their number through distributed tensor net-works (DTNs), as illustrated in Figure 2.22. The underpinning idea isthat each core tensor in an original TN is replaced by another TN (seeFigure 2.23 for TT networks), resulting in a distributed TN in whichonly some core tensors are associated with physical (natural) modes ofthe original data tensor (Hübener et al., 2010). A DTN consists of twokinds of relatively small-size cores (nodes), internal nodes which haveno free edges and external nodes which have free edges representingnatural (physical) indices of a data tensor.

The obvious advantage of DTNs is that the size of each core tensorin the internal tensor network structure is usually much smaller thanthe size of the initial core tensor; this allows for a better managementof distributed storage, and often in the reduction of the total num-ber of network parameters through distributed computing. However,

Page 60: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 305

R1 R2

K1

I1

G(2)

G( -1)N

G(1)

G( )N

J1

I2 IN-1K2J2

INJN KN

RN-2 RN-1

JN-1 KN-1

I1

J1

K1

I2

J2

K2

IN-1

JN-1

KN-1

IN

JN

KN

I1

J1

K1

I2

J2

K2

IN-1

JN-1

KN-1

IN

JN

KN

Figure 2.23: Graphical representation of a large-scale data tensor via its TTmodel (top panel), the PEPS model of the TT (third panel), and its transformationto a distributed 2D (second from bottom panel) and 3D (bottom panel) tensor trainnetworks.

Page 61: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

306 Tensor Operations and Tensor Network Architectures

Table 2.2: Links between tensor networks (TNs) and graphical models used inMachine Learning (ML) and Statistics. The corresponding categories are not exactlythe same, but have general analogies.

Tensor Networks Neural Networks and Graphical Models inML/Statistics

TT/MPS Hidden Markov Models (HMM)

HT/TTNS Deep Learning Neural Networks, GaussianMixture Model (GMM)

PEPS Markov Random Field (MRF), ConditionalRandom Field (CRF)

MERA Wavelets, Deep Belief Networks (DBN)

ALS, DMRG/MALSAlgorithms

Forward-Backward Algorithms, Block Non-linear Gauss-Seidel Methods

compared to initial tree structures, the contraction of the resulting dis-tributed tensor network becomes much more difficult because of theloops in the architecture.

2.2.5 Links between TNs and Machine Learning Models

Table 2.2 summarizes the conceptual connections of tensor net-works with graphical and neural network models in machine learn-ing and statistics (Morton, 2012; Critch, 2013; Critch and Morton,2014; Kazeev et al., 2014; Novikov and Rodomanov, 2014; Yang andHospedales, 2016; Cohen et al., 2016; Cohen and Shashua, 2016; Even-bly and White, 2016). More research is needed to establish deeper andmore precise relationships.

2.2.6 Changing the Structure of Tensor Networks

An advantage of the graphical (graph) representation of tensor net-works is that the graphs allow us to perform complex mathematicaloperations on core tensors in an intuitive and easy to understand way,

Page 62: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 307

without the need to resort to complicated mathematical expressions.Another important advantage is the ability to modify (optimize) thetopology of a TN, while keeping the original physical modes intact. Theso optimized topologies yield simplified or more convenient graphicalrepresentations of a higher-order data tensor and facilitate practical ap-plications (Zhao et al., 2010; Hübener et al., 2010; Handschuh, 2015).In particular:

• A change in topology to a HT/TT tree structure provides re-duced computational complexity, through sequential contractionsof core tensors and enhanced stability of the corresponding algo-rithms;

• Topology of TNs with cycles can be modified so as to completelyeliminate the cycles or to reduce their number;

• Even for vastly diverse original data tensors, topology modifica-tions may produce identical or similar TN structures which makeit easier to compare and jointly analyze block of interconnecteddata tensors. This provides opportunity to perform joint group(linked) analysis of tensors by decomposing them to TNs.

It is important to note that, due to the iterative way in whichtensor contractions are performed, the computational requirementsassociated with tensor contractions are usually much smaller fortree-structured networks than for tensor networks containing manycycles. Therefore, for stable computations, it is advantageous totransform a tensor network with cycles into a tree structure.

Tensor Network transformations. In order to modify tensor net-work structures, we may perform sequential core contractions, followedby the unfolding of these contracted tensors into matrices, matrix fac-torizations (typically truncated SVD) and finally reshaping of such ma-trices back into new core tensors, as illustrated in Figures 2.24.

The example in Figure 2.24(a) shows that, in the first step a con-traction of two core tensors, Gp1q P RI1I2R and Gp2q P RRI3I4 , is

Page 63: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

308 Tensor Operations and Tensor Network Architectures

(a)

RI4

I3 I2

I1G(1) G(2)

I1

I2 I3

I4

G(1,2)I1I4

I2I3

U

V

G(1,2)

I4

I3

I1

I2

(1)G

(2)G

I1

I4

I2 I3

I1I4

I2I3

R

R

R

Contraction Matricization SVD Reshaping

(b)

noitcartnoCDVS

Figure 2.24: Illustration of basic transformations on a tensor network. (a) Con-traction, matricization, matrix factorization (SVD) and reshaping of matrices backinto tensors. (b) Transformation of a Honey-Comb lattice into a Tensor Chain (TC)via tensor contractions and the SVD.

performed to give the tensor

Gp1,2q Gp1q 1 Gp2q P RI1I2I3I4 , (2.23)

with entries gp1,2qi1,i2,i3,i4 °R

r1 gp1qi1,i2,r

gp2qr,i3,i4

. In the next step, the tensorGp1,2q is transformed into a matrix via matricization, followed by a low-rank matrix factorization using the SVD, to give

Gp1,2qi1i4, i2i3

USVT P RI1I4I2I3 . (2.24)

In the final step, the factor matrices, US12 P RI1I4R1 and VS12 PRR1I2I3 , are reshaped into new core tensors, G1p1q P RI1R1I4 andG1p2q P RR1I2I3 .

Page 64: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.2. Graphical Representation of Fundamental Tensor Networks 309

I1 I2G(1) G(2)

G(4) G(3)

I4 I3

R1

R3

R4

(4)G

(1)G(2)G

(3 ,4)G

(3)G

(1,2)G

R4 R2

R3I4 I3

I1 I2

I1 I2

I4 I3

1RR R2 4(4)G (3)G

R3I4 I3

(1)G (2)G

R2

I1 I21R

R4

(1)G(2)G

(3)G

1R

R R2 4

3R(4)G

(4)G (3)G

I4 I3

(1)G (2)GI1 I21R

2R

3R

I1 I2

I4 I3

R2

Figure 2.25: Transformation of the closed-loop Tensor Chain (TC) into the open-loop Tensor Train (TT). This is achieved by suitable contractions, reshaping anddecompositions of core tensors.

The above tensor transformation procedure is quite general, andis applied in Figure 2.24(b) to transform a Honey-Comb lattice intoa tensor chain (TC), while Figure 2.25 illustrates the conversion of atensor chain (TC) into TT/MPS with OBC.

To convert a TC into TT/MPS, in the first step, we perform acontraction of two tensors, Gp1q P RI1R4R1 and Gp2q P RR1R2I2 ,as

Gp1,2q Gp1q 1 Gp2q P RI1R4R2I2 ,

for which the entries gp1,2qi1,r4,r2,i2 °R1

r11 gp1qi1,r4,r1

gp2qr1,r2,i2

. In the nextstep, the tensor Gp1,2q is transformed into a matrix, followed by a trun-cated SVD

Gp1,2qp1q USVT P RI1R4R2I2 .

Finally, the matrices, U P RI1R11 and VS P RR1

1R4R2I2 , are re-shaped back into the core tensors, G1p1q U P R1I1R1

1 and G1p2q P

Page 65: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

310 Tensor Operations and Tensor Network Architectures

RR11R4R2I2 . The procedure is repeated all over again for different

pairs of cores, as illustrated in Figure 2.25.

2.3 Generalized Tensor Network Formats

The fundamental TNs considered so far assume that the links betweenthe cores are expressed by tensor contractions. In general, links betweenthe core tensors (or tensor sub-networks) can also be expressed viaother mathematical linear/multilinear or nonlinear operators, such asthe outer (tensor) product, Kronecker product, Hadamard product andconvolution operator. For example, the use of the outer product leadsto Block Term Decomposition (BTD) (De Lathauwer, 2008; De Lath-auwer and Nion, 2008; De Lathauwer, 2011; Sorber et al., 2013) anduse the Kronecker products yields to the Kronecker Tensor Decom-position (KTD) (Ragnarsson, 2012; Phan et al., 2012, 2013b). Blockterm decompositions (BTD) are closely related to constrained Tuckerformats (with a sparse block Tucker core) and the Hierarchical OuterProduct Tensor Approximation (HOPTA), which be employed for veryhigh-order data tensors (Cichocki, 2014).

Figure 2.26 illustrates such a BTD model for a 6th-order tensor,where the links between the components are expressed via outer prod-ucts, while Figure 2.27 shows a more flexible Hierarchical Outer Prod-uct Tensor Approximation (HOPTA) model suitable for very high-ordertensors.

Observe that the fundamental operator in the HOPTA generalizedtensor networks is outer (tensor) product, which for two tensors A PRI1IN and B P RJ1JM , of arbitrary orders N and M , is definedas an pN Mqth-order tensor C A B P RI1INJ1JM , withentries c i1,...,iN , j1,...,jM ai1,...,iN bj1,...,jM . This standard outer productof two tensors can be generalized to a nonlinear outer product as follows

pA f Bqi1,...,iN ,j1,...,JM f pai1,...,iN , bj1,...,jM q , (2.25)

where fp, q is a suitably designed nonlinear function with associa-tive and commutative properties. In a similar way, we can defineother nonlinear tensor products, for example, Hadamard, Kronecker orKhatri–Rao products and employ them in generalized nonlinear tensor

Page 66: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.3. Generalized Tensor Network Formats 311

+ . . . ++A1 A2 ARb1(3)

b1(2)

b1(1) b2

(1) b2(2)

b2(3)

bR(1) bR

(2)

bR(3)

+ ++ . . .

X A1 A2 AR

( ) ( ) ( )b1(1) b1

(2) b1(3) b2

(1) b2(2) b2

(3) bR(1) bR

(2) bR(3)

X

Figure 2.26: Block term decomposition (BTD) of a 6th-order block tensor, to yieldX °R

r1 Ar bp1q

r bp2qr bp3q

r

(top panel), for more detail see (De Lathauwer,

2008; Sorber et al., 2013). BTD in the tensor network notation (bottom panel).Therefore, the 6th-order tensor X is approximately represented as a sum of R terms,each of which is an outer product of a 3rd-order tensor, Ar, and another a 3rd-order,rank-1 tensor, bp1q

r bp2qr bp3q

r (in dashed circle), which itself is an outer product ofthree vectors.

networks. The advantage of the HOPTA model over other TN modelsis its flexibility and the ability to model more complex data structuresby approximating very high-order tensors through a relatively smallnumber of low-order cores.

The BTD, and KTD models can be expressed mathematically, forexample, in simple nested (hierarchical) forms, given by

BTD : X R

r1pAr Brq, (2.26)

KTD : X R

r1pAr bBrq, (2.27)

where, e.g., for BTD, each factor tensor can be represented recursivelyas Ar

°R1r11pAp1q

r1 Bp1qr1 q or Br

°R2r21 Ap2q

r2 Bp2qr2 .

Note that the 2Nth-order subtensors, Ar Br and Ar bBr, havethe same elements, just arranged differently. For example, if X AB

Page 67: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

312 Tensor Operations and Tensor Network Architectures

+X

+

Xβ ( )A Bp p p

+

α ( )A br r r

α ( )A brr r

β ( )cpA BpX

α ( )A B Cr rr r

α ( )A Brr r

X

α ( )A Br rr

(1) (2) (3)( )q q qb b bλq

p p

Figure 2.27: Conceptual model of the HOPTA generalized tensor network, illus-trated for data tensors of different orders. For simplicity, we use the standard outer(tensor) products, but conceptually nonlinear outer products (see Eq. (2.25) andother tensor product operators (Kronecker, Hadamard) can also be employed. Eachcomponent (core tensor), Ar, Br and/or Cr, can be further hierarchically decom-posed using suitable outer products, so that the HOPTA models can be applied tovery high-order tensors.

Page 68: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

2.3. Generalized Tensor Network Formats 313

and X1 A b B, where A P RJ1J2JN and B P RK1K2KN ,then xj1,j2,...,jN ,k1,k2,...,kN

x1k1K1pj11q,...,kNKN pjN1q.The definition of the tensor Kronecker product in the KTD model

assumes that both core tensors, Ar and Br, have the same order. Thisis not a limitation, given that vectors and matrices can also be treatedas tensors, e.g, a matrix of dimension IJ as is also a 3rd-order tensorof dimension I J 1. In fact, from the BTD/KTD models, manyexisting and new TDs/TNs can be derived by changing the structureand orders of factor tensors, Ar and Br. For example:

• If Ar are rank-1 tensors of size I1 I2 IN , and Br arescalars, @r, then (2.27) represents the rank-R CP decomposition;

• If Ar are rank-Lr tensors of size I1 I2 IR 1 1,in the Kruskal (CP) format, and Br are rank-1 tensors of size1 1 IR1 IN , @r, then (2.27) expresses therank-(Lr 1) BTD;

• If Ar and Br are expressed by KTDs, we arrive at the Nested Kro-necker Tensor Decomposition (NKTD), a special case of which isthe Tensor Train (TT) decomposition. Therefore, the BTD modelin (2.27) can also be used for recursive TT-decompositions.

The generalized tensor network approach caters for a large variety oftensor decomposition models, which may find applications in scientificcomputing, signal processing or deep learning (see, eg., (De Lathauwer,2011; Cichocki, 2013b, 2014; Phan et al., 2015b; Cohen and Shashua,2016)).

In this monograph, we will mostly focus on the more establishedTucker and TT decompositions (and some of their extensions), due totheir conceptual simplicity, availability of stable and efficient algorithmsfor their computation and the possibility to naturally extend thesemodels to more complex tensor networks. In other words, the Tuckerand TT models are considered here as simplest prototypes, which canthen serve as building blocks for more sophisticated tensor networks.

Page 69: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3Constrained Tensor Decompositions: FromTwo-way to Multiway Component Analysis

The component analysis (CA) framework usually refers to the applica-tion of constrained matrix factorization techniques to observed mixedsignals in order to extract components with specific properties and/orestimate the mixing matrix (Cichocki and Amari, 2003; Cichocki et al.,2009; Comon and Jutten, 2010; De la Torre, 2012; Hyvärinen, 2013). Inthe machine learning practice, to aid the well-posedness and uniquenessof the problem, component analysis methods exploit prior knowledgeabout the statistics and diversities of latent variables (hidden sources)within the data. Here, by the diversities, we refer to different char-acteristics, features or morphology of latent variables which allow usto extract the desired components or features, for example, sparse orstatistically independent components.

3.1 Constrained Low-Rank Matrix Factorizations

Two-way Component Analysis (2-way CA), in its simplest form, can beformulated as a constrained matrix factorization of typically low-rank,in the form

X AΛBT E R

r1λr ar br E

R

r1λr ar bT

r E, (3.1)

where Λ diagpλ1, . . . , λRq is an optional diagonal scaling matrix.The potential constraints imposed on the factor matrices, A and/or B,

314

Page 70: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.1. Constrained Low-Rank Matrix Factorizations 315

include orthogonality, sparsity, statistical independence, nonnegativityor smoothness. In the bilinear 2-way CA in (3.1), X P RIJ is a knownmatrix of observed data, E P RIJ represents residuals or noise, A ra1,a2, . . . ,aRs P RIR is the unknown mixing matrix with R basisvectors ar P RI , and depending on application, B rb1,b2, . . . ,bRsP RJR, is the matrix of unknown components, factors, latent variables,or hidden sources, represented by vectors br P RJ (see Figure 3.2).

It should be noted that 2-way CA has an inherent symmetry. In-deed, Eq. (3.1) could also be written as XT BAT, thus interchangingthe roles of sources and mixing process.

Algorithmic approaches to 2-way (matrix) component analysis arewell established, and include Principal Component Analysis (PCA),Robust PCA (RPCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization (NMF), Sparse Component Analysis(SCA) and Smooth Component Analysis (SmCA) (Bach and Jordan,2003; Cichocki et al., 2009; Bruckstein et al., 2009; Comon and Jutten,2010; Kauppi et al., 2015; Yokota et al., 2016). These techniques havebecome standard tools in blind source separation (BSS), feature extrac-tion, and classification paradigms. The columns of the matrix B, whichrepresent different latent components, are then determined by specificchosen constraints and should be, for example, (i) as statistically mu-tually independent as possible for ICA; (ii) as sparse as possible forSCA; (iii) as smooth as possible for SmCA; (iv) take only nonnegativevalues for NMF.

Singular value decomposition (SVD) of the data matrix X P RIJ

is a special, very important, case of the factorization in Eq. (3.1), andis given by

X USVT R

r1σr ur vr

R

r1σr urvT

r , (3.2)

where U P RIR and V P RJR are column-wise orthogonal matri-ces and S P RRR is a diagonal matrix containing only nonnegativesingular values σr in a monotonically non-increasing order.

According to the well known Eckart–Young theorem, the truncatedSVD provides the optimal, in the least-squares (LS) sense, low-rank

Page 71: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

316 Multiway Component Analysis and Tensor Decompositions

matrix approximation1. The SVD, therefore, forms the backbone oflow-rank matrix approximations (and consequently low-rank tensor ap-proximations).

Another virtue of component analysis comes from the ability toperform simultaneous matrix factorizations

Xk AkBTk , pk 1, 2, . . . ,Kq, (3.3)

on several data matrices, Xk, which represent linked datasets, subjectto various constraints imposed on linked (interrelated) component (fac-tor) matrices. In the case of orthogonality or statistical independenceconstraints, the problem in (3.3) can be related to models of groupPCA/ICA through suitable pre-processing, dimensionality reductionand post-processing procedures (Esposito et al., 2005; Groves et al.,2011; Cichocki, 2013a; Smith et al., 2014; Zhou et al., 2016b). Theterms “group component analysis” and “joint multi-block data analy-sis” are used interchangeably to refer to methods which aim to identifylinks (correlations, similarities) between hidden components in data.In other words, the objective of group component analysis is to analyzethe correlation, variability, and consistency of the latent componentsacross multi-block datasets. The field of 2-way CA is maturing and hasgenerated efficient algorithms for 2-way component analysis, especiallyfor sparse/functional PCA/SVD, ICA, NMF and SCA (Bach and Jor-dan, 2003; Cichocki and Amari, 2003; Comon and Jutten, 2010; Zhouet al., 2012; Hyvärinen, 2013).

The rapidly emerging field of tensor decompositions is the next im-portant step which naturally generalizes 2-way CA/BSS models andalgorithms. Tensors, by virtue of multilinear algebra, offer enhancedflexibility in CA, in the sense that not all components need to bestatistically independent, and can be instead smooth, sparse, and/ornon-negative (e.g., spectral components). Furthermore, additional con-straints can be used to reflect physical properties and/or diversities ofspatial distributions, spectral and temporal patterns. We proceed toshow how constrained matrix factorizations or 2-way CA models can

1(Mirsky, 1960) has generalized this optimality to arbitrary unitarily invariantnorms.

Page 72: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.2. The CP Format 317

be extended to multilinear models using tensor decompositions, suchas the Canonical Polyadic (CP) and the Tucker decompositions, as il-lustrated in Figures 3.1, 3.2 and 3.3.

3.2 The CP Format

The CP decomposition (also called the CANDECOMP, PARAFAC, orCanonical Polyadic decomposition) decomposes an Nth-order tensor,X P RI1I2IN , into a linear combination of terms, bp1q

r bp2qr

bpNqr , which are rank-1 tensors, and is given by (Hitchcock, 1927;

Harshman, 1970; Carroll and Chang, 1970)

X R

r1λr bp1q

r bp2qr bpNq

r

Λ1 Bp1q 2 Bp2q N BpNq

JΛ; Bp1q,Bp2q, . . . ,BpNqK,

(3.4)

where λr are non-zero entries of the diagonal core tensor Λ PRRRR and Bpnq rbpnq

1 ,bpnq2 , . . . ,bpnq

R s P RInR are factor ma-trices (see Figure 3.1 and Figure 3.2).

Via the Khatri–Rao products (see Table 2.1), the CP decompositioncan be equivalently expressed in a matrix/vector form as

Xpnq BpnqΛpBpNq d dBpn1q dBpn1q d dBp1qqT (3.5) BpnqΛpBp1q dL dL Bpn1q dL Bpn1q dL dL BpNqqT

and

vecpXq rBpNq dBpN1q d dBp1qs λ (3.6) rBp1q dL Bp2q dL dL BpNqs λ,

where λ rλ1, λ2, . . . , λRsT and Λ diagpλ1, . . . , λRq is a diagonalmatrix.

The rank of a tensor X is defined as the smallest R for which theCP decomposition in (3.4) holds exactly.

Algorithms to compute CP decomposition. In real world appli-cations, the signals of interest are corrupted by noise, so that the CP

Page 73: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

318 Multiway Component Analysis and Tensor Decompositions

(a) Standard block diagram for CP decomposition of a 3rd-order tensor

X A

J

I

K

( )I R ( )R R R ( )R J

BT

C

Λ

+ +. . .c1

b1

a1

cR

bR

aR

1λ λR

K

G C

ABT

( )R R K( )I R ( )R J

Cdiag( ) λ

(b) CP decomposition for a 4th-order tensor in the tensor network notation

I1

I4

I3I2

X

=I1

I2

I3

I4

RR

R

R

B(1)

B(2)

B(3)

B(4)

Λ~

Figure 3.1: Representations of the CP decomposition. The objective of the CPdecomposition is to estimate the factor matrices Bpnq P RInR and scaling coef-ficients tλ1, λ1, . . . , λRu. (a) The CP decomposition of a 3rd-order tensor in theform, X Λ 1 A 2 B 3 C °R

r1 λr ar br cr Gc 1 A 2 B, withGc Λ 3 C. (b) The CP decomposition for a 4th-order tensor in the formX Λ 1 Bp1q 2 Bp2q 3 Bp3q 4 Bp4q °R

r1 λr bp1qr bp2q

r bp3qr bp4q

r .

Page 74: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.2. The CP Format 319

ΛXbr

TB

+ +

c1

a1

b1 b

cR

R

( )R R ( )R Ja1

b1

aR

bR++ ...

( )I J

aR

R

R

X = Aar

( )I R

Aa br

r

( )I R ( )R R R ( )R J

( )K R

= TB

( )I J K

X Λ

C cr

...

...

...

Figure 3.2: Analogy between a low-rank matrix factorization, X AΛBT °Rr1 λr ar br (top), and a simple low-rank tensor factorization (CP decomposi-

tion), X Λ 1 A 2 B 3 C °Rr1 λr ar br cr (bottom).

decomposition is rarely exact and has to be estimated by minimizinga suitable cost function. Such cost functions are typically of the Least-Squares (LS) type, in the form of the Frobenius norm

J2pBp1q,Bp2q, . . . ,BpNqq X JΛ; Bp1q,Bp2q, . . . ,BpNqK2F , (3.7)

or Least Absolute Error (LAE) criteria (Vorobyov et al., 2005)

J1pBp1q,Bp2q, . . . ,BpNqq X JΛ; Bp1q,Bp2q, . . . ,BpNqK1. (3.8)

The Alternating Least Squares (ALS) based algorithms minimizethe cost function iteratively by individually optimizing each component(factor matrix, Bpnq)), while keeping the other component matricesfixed (Harshman, 1970; Kolda and Bader, 2009).

To illustrate the ALS principle, assume that the diagonal matrix Λhas been absorbed into one of the component matrices; then, by takingadvantage of the Khatri–Rao structure in Eq. (3.5), the componentmatrices, Bpnq, can be updated sequentially as

Bpnq Ð Xpnq

äkn

Bpkq

ækn

pBpkq TBpkqq:

. (3.9)

The main challenge (or bottleneck) in implementing ALS and Gra-dient Decent (GD) techniques for CP decomposition lies therefore in

Page 75: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

320 Multiway Component Analysis and Tensor Decompositions

Algorithm 1: Basic ALS for the CP decomposition of a3rd-order tensorInput: Data tensor X P RIJK and rank ROutput: Factor matrices A P RIR, B P RJR, C P RKR, and scaling

vector λ P RR

1: Initialize A,B,C2: while not converged or iteration limit is not reached do3: A Ð Xp1qpCdBqpCTCfBTBq:4: Normalize column vectors of A to unit length (by computing the

norm of each column vector and dividing each element of avector by its norm)

5: B Ð Xp2qpCdAqpCTCfATAq:6: Normalize column vectors of B to unit length7: C Ð Xp3qpBdAqpBTBfCTCq:8: Normalize column vectors of C to unit length,

store the norms in vector λ9: end while

10: return A,B,C and λ.

multiplying a matricized tensor and Khatri–Rao product (of factormatrices) (Phan et al., 2013a; Choi and Vishwanathan, 2014) and inthe computation of the pseudo-inverse of pR Rq matrices (for thebasic ALS see Algorithm 1).

The ALS approach is attractive for its simplicity, and often providessatisfactory performance for well defined problems with high SNRsand well separated and non-collinear components. For ill-conditionedproblems, advanced algorithms are required which typically exploitthe rank-1 structure of the terms within CP decomposition to performefficient computation and storage of the Jacobian and Hessian of thecost function (Phan et al., 2013c; Sorber et al., 2013; Phan et al.,2015a). Implementation of parallel ALS algorithm over distributedmemory for very large-scale tensors was proposed in (Choi andVishwanathan, 2014; Karlsson et al., 2016).

Multiple random projections, tensor sketching and Giga-Tensor. Most of the existing algorithms for the computation of CP

Page 76: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.2. The CP Format 321

decomposition are based on the ALS or GD approaches, however,these can be too computationally expensive for huge tensors. Indeed,algorithms for tensor decompositions have generally not yet reachedthe level of maturity and efficiency of low-rank matrix factorization(LRMF) methods. In order to employ efficient LRMF algorithms totensors, we need to either: (i) reshape the tensor at hand into a setof matrices using traditional matricizations, (ii) employ reduced ran-domized unfolding matrices, or (iii) perform suitable random multi-ple projections of a data tensor onto lower-dimensional subspaces. Theprinciples of the approaches (i) and (ii) are self-evident, while the ap-proach (iii) employs a multilinear product of an Nth-order tensor andpN 2q random vectors, which are either chosen uniformly from a unitsphere or assumed to be i.i.d. Gaussian vectors (Kuleshov et al., 2015).

For example, for a 3rd-order tensor, X P RI1I2I3 , we can use theset of random projections, X3 X 3 ω3 P RI1I2 , X2 X 2 ω2 PRI1I3 and X1 X 1 ω1 P RI2I3 , where the vectors ωn P RIn ,n 1, 2, 3, are suitably chosen random vectors. Note that random pro-jections in such a case are non-typical – instead of using projections fordimensionality reduction, they are used to reduce a tensor of any orderto matrices and consequently transform the CP decomposition prob-lem to constrained matrix factorizations problem, which can be solvedvia simultaneous (joint) matrix diagonalization (De Lathauwer, 2006;Chabriel et al., 2014). It was shown that even a small number of ran-dom projections, such as OplogRq is sufficient to preserve the spectralinformation in a tensor. This mitigates the problem of the dependenceon the eigen-gap2 that plagued earlier tensor-to-matrix reductions. Al-though a uniform random sampling may experience problems for ten-sors with spiky elements, it often outperforms the standard CP-ALSdecomposition algorithms.

Alternative algorithms for the CP decomposition of huge-scaletensors include tensor sketching – a random mapping technique, whichexploits kernel methods and regression (Pham and Pagh, 2013; Wanget al., 2015), and the class of distributed algorithms such as the

2In linear algebra, the eigen-gap of a linear operator is the difference betweentwo successive eigenvalues, where the eigenvalues are sorted in an ascending order.

Page 77: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

322 Multiway Component Analysis and Tensor Decompositions

DFacTo (Choi and Vishwanathan, 2014) and the GigaTensor which isbased on Hadoop / MapReduce paradigm (Kang et al., 2012).

Constraints. Under rather mild conditions, the CP decompositionis generally unique by itself (Kruskal, 1977; Sidiropoulos and Bro,2000). It does not require additional constraints on the factor matricesto achieve uniqueness, which makes it a powerful and useful tool fortensor factroization. Of course, if the components in one or moremodes are known to possess some properties, e.g., they are knownto be nonnegative, orthogonal, statistically independent or sparse,such prior knowledge may be incorporated into the algorithms tocompute CPD and at the same time relax uniqueness conditions.More importantly, such constraints may enhance the accuracy andstability of CP decomposition algorithms and also facilitate betterphysical interpretability of the extracted components (Sidiropoulos,2004; Dhillon, 2009; Sørensen et al., 2012; Kim et al., 2014; Zhou andCichocki, 2012b; Liavas and Sidiropoulos, 2015).

Applications. The CP decomposition has already been established asan advanced tool for blind signal separation in vastly diverse branchesof signal processing and machine learning (Acar and Yener, 2009; Koldaand Bader, 2009; Mørup, 2011; Anandkumar et al., 2014; Wang et al.,2015; Tresp et al., 2015; Sidiropoulos et al., 2016). It is also routinelyused in exploratory data analysis, where the rank-1 terms capture es-sential properties of dynamically complex datasets, while in wirelesscommunication systems, signals transmitted by different users corre-spond to rank-1 terms in the case of line-of-sight propagation and there-fore admit analysis in the CP format. Another potential application isin harmonic retrieval and direction of arrival problems, where real orcomplex exponentials have rank-1 structures, for which the use of CPdecomposition is quite natural (Sidiropoulos et al., 2000; Sidiropoulos,2001; Sørensen and De Lathauwer, 2013).

Page 78: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.3. The Tucker Tensor Format 323

3.3 The Tucker Tensor Format

Compared to the CP decomposition, the Tucker decomposition pro-vides a more general factorization of an Nth-order tensor into a rela-tively small size core tensor and factor matrices, and can be expressedas follows:

X R1

r11

RN

rN1gr1r2rN

bp1qr1 bp2q

r2 bpNqrN

G1 Bp1q 2 Bp2q N BpNq

JG; Bp1q,Bp2q, . . . ,BpNqK, (3.10)

where X P RI1I2IN is the given data tensor, G P RR1R2RN

is the core tensor, and Bpnq rbpnq1 ,bpnq

2 , . . . ,bpnqRns P RInRn are the

mode-n factor (component) matrices, n 1, 2, . . . , N (see Figure 3.3).The core tensor (typically, Rn In) models a potentially complexpattern of mutual interaction between the vectors in different modes.The model in (3.10) is often referred to as the Tucker-N model.

The CP and Tucker decompositions have long history. For recentsurveys and more detailed information we refer to (Kolda and Bader,2009; Grasedyck et al., 2013; Comon, 2014; Cichocki et al., 2015b;Sidiropoulos et al., 2016).

Using the properties of the Kronecker tensor product, the Tucker-Ndecomposition in (3.10) can be expressed in an equivalent matrix andvector form as

Xpnq BpnqGpnqpBp1q bL bL Bpn1q bL Bpn1q bL bL BpNqqT BpnqGpnqpBpNq b bBpn1q bBpn1q b bBp1qqT,

(3.11)X n¡ pBp1q bL bL Bpnqq G n¡pBpn1q bL bL BpNqqT

pBpnq b bBp1qq G n¡pBpNq bBpN1q b bBpn1qqT,(3.12)

vecpXq rBp1q bL Bp2q bL bL BpNqs vecpGq rBpNq bBpN1q b bBp1qs vecpGq, (3.13)

Page 79: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

324 Multiway Component Analysis and Tensor Decompositions

where the multi-indices are ordered in a reverse lexicographic order(little-endian).

Table 3.1 and Table 3.2 summarize fundamental mathematical rep-resentations of CP and Tucker decompositions for 3rd-order and Nth-order tensors.

The Tucker decomposition is said to be in an independent Tuckerformat if all the factor matrices, Bpnq, are full column rank, while aTucker format is termed an orthonormal format, if in addition, all thefactor matrices, Bpnq Upnq, are orthogonal. The standard Tuckermodel often has orthogonal factor matrices.

Multilinear rank. The multilinear rank of an Nth-order tensor X PRI1I2IN corresponds to the N -tuple pR1, R2, . . . , RN q consisting ofthe dimensions of the different subspaces. If the Tucker decomposition(3.10) holds exactly it is mathematically defined as

rankMLpXq trankpXp1qq, rankpXp2qq, . . . , rankpXpNqqu, (3.14)

with Xpnq P RInI1In1In1IN for n 1, 2, . . . , N . Rank of the Tuckerdecompositions can be determined using information criteria (Yokotaet al., 2017), or through the number of dominant eigenvalues when anapproximation accuracy of the decomposition or a noise level is given(see Algorithm 8).

The independent Tucker format has the following important prop-erties if the equality in (3.10) holds exactly (see, e.g., (Jiang et al.,2016) and references therein):

1. The tensor (CP) rank of any tensor, X JG; Bp1q,Bp2q, . . . ,BpNqK P RI1I2IN , and the rank ofits core tensor, G P RR1R2RN , are exactly the same, i.e.,

rankCP pXq rankCP pGq. (3.15)

2. If a tensor, X P RI1I2IN JG; Bp1q,Bp2q, . . . ,BpNqK,admits an independent Tucker format with multilinear ranktR1, R2, . . . , RNu, then

Rn ¤N¹pn

Rp @n. (3.16)

Page 80: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.3. The Tucker Tensor Format 325

(a) Standard block diagrams of Tucker (top) and Tucker-CP (bottom) de-compositions for a 3rd-order tensor

B (1)X

J

I

K

+ . . .( (c1

b1a1

cR

bRaR

1( )I R 1 2 3(R R R

3( )K R

2( )R J

3( )K R

G

R3R1

R2

B (2) T

B (3)

B (1)

B(3)

B (2) T

)

=r1 2 3r r

r1b

r3b

r2b

g, ,

r1 2 3r r , ,

(3)

(2)(1)

=

= +

(b) The TN diagram for the Tucker and Tucker/CP decompositions of a 4th-order tensor

R1

R2

R3

R4I1

I2

I3

I4

B (1)

RR

R

RI1

I2

I3

I4

ΛG

R1

R4

R3

R2B (2)

B (3)

B (4)

B (2)

B (4)

B (3)B (1) A (1)

A (2)

A (3)A (4)

Figure 3.3: Illustration of the Tucker and Tucker-CP decompositions, where theobjective is to compute the factor matrices, Bpnq, and the core tensor, G. (a)Tucker decomposition of a 3rd-order tensor, X G 1 Bp1q 2 Bp2q 3 Bp3q. Insome applications, the core tensor can be further approximately factorized usingthe CP decomposition as G °R

r1 ar br cr (bottom diagram), or alterna-tively using TT/HT decompositions. (b) Graphical representation of the Tucker-CPdecomposition for a 4th-order tensor, X G 1 Bp1q 2 Bp2q 3 Bp3q 4 Bp4q JG; Bp1q,Bp2q,Bp3q,Bp4qK pΛ1 Ap1q2 Ap2q3 Ap3q4 Ap4qq1 Bp1q2 Bp2q3Bp3q 4 Bp4q JΛ; Bp1qAp1q, Bp2qAp2q, Bp3qAp3q, Bp4qAp4qK.

Page 81: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

326 Multiway Component Analysis and Tensor Decompositions

Moreover, without loss of generality, under the assumption R1 ¤R2 ¤ ¤ RN , we have

R1 ¤ rankCP pXq ¤ R2R3 RN . (3.17)

3. If a data tensor is symmetric and admits an independent Tuckerformat, X JG; B,B, . . . ,BK P RIII , then its core ten-sor, G P RRRR, is also symmetric, with rankCP pXq rankCP pG).

4. For the orthonormal Tucker format, that is, X JG; Up1q,Up2q, . . . ,UpNqK P RI1I2IN , with UpnqT Upnq I, @n, the Frobenius norms and the Schatten p-norms3 of thedata tensor, X , and its core tensor, G, are equal, i.e.,

XF GF ,XSp GSp , 1 ¤ p 8.

Thus, the computation of the Frobenius norms can be performedwith an OpRN q complexity pR maxtR1, . . . , RNuq, instead ofthe usual order OpIN q complexity (typically R ! I).

Note that the CP decomposition can be considered as a special caseof the Tucker decomposition, whereby the cube core tensor has nonzeroelements only on the main diagonal (see Figure 3.1). In contrast tothe CP decomposition, the unconstrained Tucker decomposition is notunique. However, constraints imposed on all factor matrices and/or coretensor can reduce the indeterminacies inherent in CA to only column-wise permutation and scaling, thus yielding a unique core tensor andfactor matrices (Zhou and Cichocki, 2012a).

The Tucker-N model, in which pNKq factor matrices are identitymatrices is called the Tucker-pK,Nq model. In the simplest scenario,

3The Schatten p-norm of an Nth-order tensor X is defined as the average ofthe Schatten norms of mode-n unfoldings, i.e., XSp p1Nq°N

n1 XpnqSp andXSp p°r σ

pr q1p, where σr is the rth singular value of the matrix X. For p 1,

the Schatten norm of a matrix X is called the nuclear norm or the trace norm,while for p 0 the Schatten norm is the rank of X, which can be replaced by thesurrogate function log detpXXT εIq, ε ¡ 0.

Page 82: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.3. The Tucker Tensor Format 327

Table 3.1: Different forms of CP and Tucker representations of a 3rd-order tensorX P RIJK , where λ rλ1, λ2, . . . , λRsT, and Λ diagtλ1, λ2, . . . , λRu.

CP Decomposition Tucker Decomposition

Scalar representation

xijk R°

r1λr ai r bj r ck r xijk

R1°r11

R2°r21

R3°r31

gr1 r2 r3 ai r1 bj r2 ck r3

Tensor representation, outer products

X R°

r1λr ar br cr X

R1°r11

R2°r21

R3°r31

gr1 r2 r3 ar1 br2 cr3

Tensor representation, multilinear products

X Λ1 A2 B3 C X G1 A2 B3 C JΛ; A, B, CK JG; A, B, CK

Matrix representations

Xp1q A Λ pBdL CqT Xp1q A Gp1q pBbL CqTXp2q B Λ pAdL CqT Xp2q B Gp2q pAbL CqTXp3q C Λ pAdL BqT Xp3q C Gp3q pAbL BqT

Vector representation

vecpXq pAdL BdL Cqλ vecpXq pAbL BbL Cq vecpGq

Matrix slices Xk Xp:, :, kq

Xk A diagpλ1 ck,1, . . . , λR ck,RqBT Xk A

R3°r31

ckr3Gp:, :, r3q

BT

Page 83: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

328 Multiway Component Analysis and Tensor Decompositions

Table 3.2: Different forms of CP and Tucker representations of an Nth-order tensorX P RI1I2IN .

CP TuckerScalar product

xi1,...,iN R

r1

λr bp1qi1,r bpNq

iN ,r xi1,...,iN R1

r11

RN

rN1

gr1,...,rN bp1qi1,r1

bpNqiN ,rN

Outer product

X R

r1

λr bp1qr bpNq

r X R1

r11

RN

rN1

gr1,...,rN bp1qr1 bpNq

rN

Multilinear product

X Λ 1 Bp1q 2 Bp2q N BpNq X G 1 Bp1q 2 Bp2q N BpNq

X rΛ; Bp1q,Bp2q, . . . ,BpNq

zX

rG; Bp1q,Bp2q, . . . ,BpNq

z

Vectorization

vecpXq

1änN

Bpnq

λ vecpXq

nN

Bpnq

vecpGq

Matricization

Xpnq BpnqΛ

mN, mn

Bpmq

T

Xpnq BpnqGpnq

mN, mn

Bpmq

T

X n¡ p1ä

mn

BpmqqΛpn1ä

mN

BpmqqT, X n¡ p1â

mn

BpmqqG n¡pn1â

mN

BpmqqT

Slice representation

Xp:, :, k3q Bp1q rDk3Bp2q T Xp:, :, k3q Bp1q rGk3Bp2q T, k3 i3i4 iN

rDk3 diagpd11, . . . , dRRq P RRR with entries drr λrbp3qi3,r bpNq

iN ,r

rGk3 ¸r3

¸rN

bp3qi3,r3

bpNqiN ,rN

G:,:,r3,...,rN is the sum of frontal slices.

Page 84: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.3. The Tucker Tensor Format 329

for a 3rd-order tensor X P RIJK , the Tucker-(2,3) or simply Tucker-2model, can be described as4

X G1 A2 B3 I G1 A2 B, (3.18)

or in an equivalent matrix form

Xk AGkBT, pk 1, 2, . . . ,Kq, (3.19)

where Xk Xp:, :, kq P RIJ and Gk Gp:, :, kq P RR1R2 are re-spectively the frontal slices of the data tensor X and the core tensorG P RR1R2R3 , and A P RIR1 , B P RJR2 .Generalized Tucker format and its links to TTNS model. Forhigh-order tensors, X P RI1,1I1,K1I2,1IN,KN , the Tucker-N for-mat can be naturally generalized by replacing the factor matrices,Bpnq P RInRn , by higher-order tensors Bpnq P RIn,1In,2In,KnRn ,to give

X JG; Bp1q,Bp2q, . . . ,BpNqK, (3.20)

where the entries of the data tensor are computed as

Xpi1, . . . , iN q R1

r11

RN

rN1Gpr1, . . . , rN qBp1qpi1, r1q BpNqpiN , rN q,

and in pin,1in,2 . . . in,Knq (Lee and Cichocki, 2016c).Furthermore, the nested (hierarchical) form of such a generalized

Tucker decomposition leads to the Tree Tensor Networks State (TTNS)model (Nakatani and Chan, 2013) (see Figure 2.15 and Figure 2.18),with possibly a varying order of cores, which can be formulated as

X JG1; Bp1q,Bp2q, . . . ,BpN1qK

G1 JG2; Ap1,2q,Ap2,2q, . . . ,ApN2,2qK.

GP JGP1; Ap1,P1q,Ap2,P1q, . . . ,ApNP1,P1qK, (3.21)

4For a 3rd-order tensor, the Tucker-2 model is equivalent to the TT model. Thecase where the factor matrices and the core tensor are non-negative is referred to asthe NTD-2 (Nonnegative Tucker-2 decomposition).

Page 85: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

330 Multiway Component Analysis and Tensor Decompositions

where Gp P RRppq1 R

ppq2 R

ppqNp and Apnp,pq P R

Rpp1qlnp

Rpp1qmnp

Rppqnp ,

with p 2, . . . , P 1.Note that some factor tensors, Apn,1q and/or Apnp,pq, can be identity

tensors which yield an irregular structure, possibly with a varying orderof tensors. This follows from the simple observation that a mode-nproduct may have, e.g., the following form

Xn Bpnq JX; I1, . . . , IIn1 ,Bpnq, IIn1 , . . . , IINK.

The efficiency of this representation strongly relies on an appropriatechoice of the tree structure. It is usually assumed that the treestructure of TTNS is given or assumed a priori, and recent efforts aimto find an optimal tree structure from a subset of tensor entries andwithout any a priori knowledge of the tree structure. This is achievedusing so-called rank-adaptive cross-approximation techniques whichapproximate a tensor by hierarchical tensor formats (Ballani andGrasedyck, 2014; Ballani et al., 2014).

Operations in the Tucker format. If large-scale data tensors admitan exact or approximate representation in their Tucker formats, thenmost mathematical operations can be performed more efficiently usingthe so obtained much smaller core tensors and factor matrices. Considerthe Nth-order tensors X and Y in the Tucker format, given by

X JGX ; Xp1q, . . . ,XpNqK and Y JGY ; Yp1q, . . . ,YpNqK, (3.22)

for which the respective multilinear ranks are tR1, R2, . . . , RNu andtQ1, Q2, . . . , QNu, then the following mathematical operations can beperformed directly in the Tucker format5, which admits a significantreduction in computational costs (Phan et al., 2013b, 2015b; Lee andCichocki, 2016c):

• The addition of two Tucker tensors of the same order and sizes

XY JGX `GY ; rXp1q,Yp1qs, . . . , rXpNq,YpNqsK, (3.23)

where ` denotes a direct sum of two tensors, and rXpnq,Ypnqs PRInpRnQnq, Xpnq P RInRn and Ypnq P RInQn , @n.

5Similar operations can be performed in the CP format, assuming that the coretensors are diagonal.

Page 86: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.3. The Tucker Tensor Format 331

• The Kronecker product of two Tucker tensors of arbitraryorders and sizes

XbY JGX bGY ; Xp1q bYp1q, . . . ,XpNq bYpNqK. (3.24)

• The Hadamard or element-wise product of two Tucker tensorsof the same order and the same sizes

XfY JGX bGY ; Xp1q d1 Yp1q, . . . ,XpNq d1 YpNqK, (3.25)

where d1 denotes the mode-1 Khatri–Rao product, also called thetransposed Khatri–Rao product or row-wise Kronecker product.

• The inner product of two Tucker tensors of the same order andsizes can be reduced to the inner product of two smaller tensorsby exploiting the Kronecker product structure in the vectorizedform, as follows

xX,Yy vecpXqT vecpYq (3.26)

vecpGXqT

Nân1

Xpnq T

Nân1

Ypnq

vecpGY q

vecpGXqT

Nân1

XpnqT Ypnq

vecpGY q

xJGX ; pXp1qT Yp1qq, . . . , pXpNqT YpNqqK,GY y.

• The Frobenius norm can be computed in a particularly simpleway if the factor matrices are orthogonal, since then all productsXpnqT Xpnq, @n, become the identity matrices, so that

XF xX,Xy vec

JGX ; pXp1qT Xp1qq, . . . , pXpNqT XpNqqK

TvecpGXq

vecpGXqT vecpGXq GXF . (3.27)

• The N-D discrete convolution of tensors X P RI1IN andY P RJ1JN in their Tucker formats can be expressed as

Z X Y JGZ ; Zp1q, . . . ,ZpNqK (3.28)P RpI1J11qpINJN1q.

Page 87: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

332 Multiway Component Analysis and Tensor Decompositions

If tR1, R2, . . . , RNu is the multilinear rank of X andtQ1, Q2, . . . , QNu the multilinear rank Y, then the core tensorGZ GX bGY P RR1Q1RNQN and the factor matrices

Zpnq Xpnqd1 Ypnq P RpInJn1qRnQn , (3.29)

where Zpnqp:, snq Xpnqp:, rnq Ypnqp:, qnq P RpInJn1q forsn rnqn 1, 2, . . . , RnQn.

• Super Fast discrete Fourier transform (MATLAB functionsfftnpXq and fftpXpnq, rs, 1q) of a tensor in the Tucker format

FpXq JGX ;FpXp1qq, . . . ,FpXpNqqK. (3.30)

Note that if the data tensor admits low multilinear rank approxi-mation, then performing the FFT on factor matrices of relativelysmall size Xpnq P RInRn , instead of a large-scale data tensor, de-creases considerably computational complexity. This approach isreferred to as the super fast Fourier transform in Tucker format.

3.4 Higher Order SVD (HOSVD) for Large-Scale Problems

The MultiLinear Singular Value Decomposition (MLSVD), also calledthe higher-order SVD (HOSVD), can be considered as a special form ofthe constrained Tucker decomposition (De Lathauwer et al., 2000a,b),in which all factor matrices, Bpnq Upnq P RInIn , are orthogonal andthe core tensor, G S P RI1I2IN , is all-orthogonal (see Figure3.4).

The orthogonality properties of the core tensor are defined throughthe following conditions:

1. All orthogonality. The slices in each mode are mutually orthogo-nal, e.g., for a 3rd-order tensor and its lateral slices

xS:,k,:S:,l,:y 0, for k l, (3.31)

2. Pseudo-diagonality. The Frobenius norms of slices in each modeare decreasing with the increase in the running index, e.g., for a

Page 88: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.4. Higher Order SVD (HOSVD) for Large-Scale Problems 333

3rd-order tensor and its lateral slices

S:,k,:F ¥ S:,l,:F , k ¥ l. (3.32)

These norms play a role similar to singular values in standardmatrix SVD.

In practice, the orthogonal matrices Upnq P RInRn , with Rn ¤ In,can be computed by applying both the randomized and standard trun-cated SVD to the unfolded mode-n matrices, Xpnq UpnqSnVpnqT PRInI1In1In1IN . After obtaining the orthogonal matrices Upnq ofleft singular vectors of Xpnq, for each n, the core tensor G S can becomputed as

S X1 Up1q T 2 Up2q T N UpNq T, (3.33)

so that

X S1 Up1q 2 Up2q N UpNq. (3.34)

Due to the orthogonality of the core tensor S, its slices are also mutuallyorthogonal.

Analogous to the standard truncated SVD, a large-scale data ten-sor, X, can be approximated by discarding the multilinear singularvectors and slices of the core tensor corresponding to small multilin-ear singular values. Figure 3.4 and Algorithm 2 outline the truncatedHOSVD, for which any optimized matrix SVD procedure can be ap-plied.

For large-scale tensors, the unfolding matrices, Xpnq P RInIn (In I1 InIn1 IN ) may become prohibitively large (with In " In), eas-ily exceeding the memory of standard computers. Using a direct andsimple divide-and-conquer approach, the truncated SVD of an unfold-ing matrix, Xpnq UpnqSnVpnqT, can be partitioned into Q slices, asXpnq rX1,n,X2,n, . . . ,XQ,ns UpnqSnrVT

1,n,VT2,n, . . . ,VT

Q,ns. Next,the orthogonal matrices Upnq and the diagonal matrices Sn can be ob-tained from the eigenvalue decompositions XpnqXT

pnq UpnqS2nUpnqT °

q Xq,nXTq,n P RInIn , allowing for the terms Vq,n XT

q,nUpnqS1n to

be computed separately. This enables us to optimize the size of the qth

Page 89: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

334 Multiway Component Analysis and Tensor Decompositions

(a)

XI

J

=

R

U

ur

Eigenvector of XXT

R

vr

Eigenvector of X XT

Rank of XXT

VT

R

( )I J ( )I I ( )I J ( )J J

R

0

~

......

Singularvalue

Ssr

0

0

(b)

U(1)

1 1( )I R 2 2( )I R

3 3( )I R

X U(2)I1

I2

I3

I1

R1

R2

R3

I3

I2

I2I1

1 2 3( )I I I

S

R3

St

U(3)

R2R1

1 2 3( )I I I

(c)

I1

I4I3

I2

X

R1R2

R3

R4

I1

I2

I3

I4

StU(1)

U(2)

U(3)

U(4)

Figure 3.4: Graphical illustration of the truncated SVD and HOSVD. (a) The ex-act and truncated standard matrix SVD, X USVT. (b) The truncated (approxi-mative) HOSVD for a 3rd-order tensor calculated as X St1Up1q2Up2q3Up3q.(c) Tensor network notation for the HOSVD of a 4th-order tensor X St1 Up1q2Up2q 3 Up3q 4 Up4q. All the factor matrices, Upnq P RInRn , and the core tensor,St G P RR1RN , are orthogonal.

Page 90: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.4. Higher Order SVD (HOSVD) for Large-Scale Problems 335

slice Xq,n P RInpInQq so as to match the available computer memory.Such a simple approach to compute matrices Upnq and/or Vpnq doesnot require loading the entire unfolding matrices at once into computermemory; instead the access to the datasets is sequential. For currentstandard sizes of computer memory, the dimension In is typically lessthan 10,000, while there is no limit on the dimension In

±kn Ik.

For very large-scale and low-rank matrices, instead of the standardtruncated SVD approach, we can alternatively apply the randomizedSVD algorithm, which reduces the original data matrix X to a relativelysmall matrix by random sketching, i.e. through multiplication with arandom sampling matrix Ω (see Algorithm 3). Note that we explicitlyallow the rank of the data matrix X to be overestimated (that is, R RP , where R is a true but unknown rank and P is the over-samplingparameter) because it is easier to obtain more accurate approximationof this form. Performance of randomized SVD can be further improvedby integrating multiple random sketches, that is, by multiplying a datamatrix X by a set of random matrices Ωp for p 1, 2, . . . , P andintegrating leading low-dimensional subspaces by applying a MonteCarlo integration method (Chen et al., 2016).

Using special random sampling matrices, for instance, a sub-sampled random Fourier transform, substantial gain in the executiontime can be achieved, together with the asymptotic complexity ofOpIJ logpRqq. Unfortunately, this approach is not accurate enough formatrices for which the singular values decay slowly (Halko et al., 2011).

The truncated HOSVD can be optimized and implemented in sev-eral alternative ways. For example, if Rn ! In, the truncated tensorZ Ð X1 Up1qT yields a smaller unfolding matrix Zp2q P RI2R1I3IN ,so that the multiplication Zp2qZT

p2q can be faster in the next iterations(Vannieuwenhoven et al., 2012; Austin et al., 2015).

Furthermore, since the unfolding matrices YTpnq are typically very

“tall and skinny”, a huge-scale truncated SVD and other constrainedlow-rank matrix factorizations can be computed efficiently based on theHadoop / MapReduce paradigm (Constantine et al., 2014; Constantineand Gleich, 2011; Benson et al., 2014).

Page 91: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

336 Multiway Component Analysis and Tensor Decompositions

Algorithm 2: Sequentially Truncated HOSVD (Van-nieuwenhoven et al., 2012)Input: Nth-order tensor X P RI1I2IN and approximation

accuracy εOutput: HOSVD in the Tucker format X JS; Up1q, . . . ,UpNqK,

such that X XF ¤ ε1: S Ð X2: for n 1 to N do3: rUpnq,S,Vs truncated_svdpSpnq, ε?

Nq

4: S Ð VS5: end for6: S Ð reshapepS, rR1, . . . , RN sq7: return Core tensor S and orthogonal factor matrices

Upnq P RInRn .

Algorithm 3: Randomized SVD (rSVD) for large-scaleand low-rank matrices with single sketch (Halko et al.,2011)Input: A matrix X P RIJ , desired or estimated rank R, and

oversampling parameter P or overestimated rank rR R P ,exponent of the power method q (q 0 or q 1)

Output: An approximate rank- rR SVD, X USVT, i.e., orthogonalmatrices U P RI rR, V P RJ rR and diagonal matrix S P R rR rR withsingular values

1: Draw a random Gaussian matrix Ω P RJ rR,2: Form the sample matrix Y pXXTqq XΩ P RI rR

3: Compute a QR decomposition Y QR4: Form the matrix A QTX P R rRJ

5: Compute the SVD of the small matrix A as A pUSVT

6: Form the matrix U QpU.

Low multilinear rank approximation is always well-posed, however,in contrast to the standard truncated SVD for matrices, the truncatedHOSVD does not yield the best multilinear rank approximation, butsatisfies the quasi-best approximation property (De Lathauwer et al.,2000a)

X JS; Up1q, . . . ,UpNqK ¤?NXXBest, (3.35)

Page 92: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.4. Higher Order SVD (HOSVD) for Large-Scale Problems 337

Algorithm 4: Higher Order Orthogonal Iteration (HOOI)(De Lathauwer et al., 2000b; Austin et al., 2015)Input: Nth-order tensor X P RI1I2IN (usually in Tucker/HOSVD

format)Output: Improved Tucker approximation using ALS approach, with

orthogonal factor matrices Upnq

1: Initialization via the standard HOSVD (see Algorithm 2)2: repeat3: for n 1 to N do4: Z Ð Xpn tUppqTu5: C Ð ZpnqZT

pnq P RRR

6: Upnq Ð leading Rn eigenvectors of C7: end for8: G Ð ZN UpNqT

9: until the cost function pX2F G2

F q ceases to decrease10: return JG; Up1q,Up2q, . . . ,UpNqK

where XBest is the best multilinear rank approximation of X, for aspecific tensor norm .

When it comes to the problem of finding the best approximation,the ALS type algorithm called the Higher Order Orthogonal Iteration(HOOI) exhibits both the advantages and drawbacks of ALS algorithmsfor CP decomposition. For the HOOI algorithms, see Algorithm 4 andAlgorithm 5. For more sophisticated algorithms for Tucker decompo-sitions with orthogonality and nonnegativity constraints, suitable forlarge-scale data tensors, see (Phan and Cichocki, 2011; Zhou et al.,2012; Constantine et al., 2014; Jeon et al., 2016).

When a data tensor X is very large and cannot be stored in com-puter memory, another challenge is to compute a core tensor G Sdirectly, using the formula (3.33). Such computation is performed se-quentially by fast matrix-by-matrix multiplications6, as illustrated inFigure 3.5(a) and (b).

We have shown that for very large-scale problems, it is useful todivide a data tensor X into small blocks Xrk1,k2,...,kN s. In a similar way,

6Efficient and parallel (state of the art) algorithms for multiplications of suchvery large-scale matrices are proposed in (Li et al., 2015; Ballard et al., 2015a).

Page 93: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

338 Multiway Component Analysis and Tensor Decompositions

Table 3.3: Basic multiway component analysis (MWCA)/Low-Rank Tensor Ap-proximations (LRTA) and related multiway dimensionality reduction models. Thesymbol X P RI1I2IN denotes a noisy data tensor, while Y G 1 Bp1q 2Bp2q N BpNq is the general constrained Tucker model with the latent factormatrices Bpnq P RInRn and the core tensor G P RR1R2RN . In the specialcase of a CP decomposition, the core tensor is diagonal, G Λ P RRR, so thatY °R

r1 λrpbp1qr bp2q

r bpNqr q.

Cost Function Constraints

Multilinear (sparse) PCA (MPCA)

maxupnqr

X1up1qr 2up2qr N upNqr γ

°Nn1 u

pnqr 1

upnqTr upnqr 1, @pn, rq

upnqTr upnqq 0 for r q

HOSVD/HOOI

minUpnq X G 1 Up1q 2 Up2q N UpNq2F

UpnqT Upnq IRn , @n

Multilinear ICA

minBpnq X G 1 Bp1q 2 Bp2q N BpNq2F

Vectors of Bpnq statistically

as independent as possible

Nonnegative CP/Tucker decomposition

(NTF/NTD) (Cichocki et al., 2009)

minBpnq X G 1 Bp1q N BpNq2F

γ°N

n1°Rn

rn1 bpnqrn 1

Entries of G and Bpnq, @n

are nonnegative

Sparse CP/Tucker decomposition

minBpnq X G 1 Bp1q N BpNq2F

γ°N

n1°Rn

rn1 bpnqrn 1

Sparsity constraints

imposed on Bpnq

Smooth CP/Tucker decomposition

(SmCP/SmTD) (Yokota et al., 2016)

minBpnq X Λ 1 Bp1q N BpNq2F

γ°N

n1°R

r1 Lbpnqr 2

Smoothness imposed

on vectors bpnqr

of Bpnq P RInR, @n

via a difference operator L

we can partition the orthogonal factor matrices UpnqT into the cor-responding blocks of matrices UpnqT

rkn,pns, as illustrated in Figure 3.5(c)

for 3rd-order tensors (Wang et al., 2005; Suter et al., 2013). For ex-ample, the blocks within the resulting tensor Gpnq can be computed

Page 94: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.4. Higher Order SVD (HOSVD) for Large-Scale Problems 339

Algorithm 5: HOOI using randomization for large-scaledata (Zhou et al., 2015)Input: Nth-order tensor X P RI1I2IN and multilinear rank

tR1, R2, . . . , RNuOutput: Approximative representation of a tensor in Tucker format,

with orthogonal factor matrices Upnq P RInRn

1: Initialize factor matrices Upnq as random Gaussian matricesRepeat steps (2)-(6) only two times:

2: for n 1 to N do3: Z Xpn tUppqTu4: Compute Zpnq ZpnqΩpnq P RInRn , where Ωpnq P R

±pn RpRn

is a random matrix drawn from Gaussian distribution5: Compute Upnq as an orthonormal basis of Zpnq, e.g., by using QR

decomposition6: end for7: Construct the core tensor as

G X1 Up1q T 2 Up2q T N UpNq T

8: return X JG; Up1q,Up2q, . . . ,UpNqK

Algorithm 6: Tucker decomposition with constrained fac-tor matrices via 2-way CA /LRMFInput: Nth-order tensor X P RI1I2IN , multilinear rank

tR1, . . . , RNu and desired constraints imposed on factor matricesBpnq P RInRn

Output: Tucker decomposition with constrained factor matrices Bpnq

using LRMF and a simple unfolding approach1: Initialize randomly or via standard HOSVD (see Algorithm 2)2: for n 1 to N do3: Compute specific LRMF or 2-way CA (e.g., RPCA, ICA, NMF) of

unfolding XTpnq ApnqBpnq T or Xpnq BpnqApnq T

4: end for5: Compute core tensor G X1 rBp1qs: 2 rBp2qs: N rBpNqs:

6: return Constrained Tucker decomposition X JG,Bp1q, . . . ,BpNqK

sequentially or in parallel, as follows:

Gpnqrk1,k2,...,qn,...,kN s

Kn

kn1Xrk1,k2,...,kn,...,kN s n Upnq T

rkn,qns. (3.36)

Page 95: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

340 Multiway Component Analysis and Tensor Decompositions

(a) Sequential computation

I1

I2

I3I1 I2

R1

R1R2

I2

I3

I2 I3R1

R2

I3

R1

R3

I3 R1R2

R3

X... ... ...

G(1)

G(2)

U(1)T G

(1)U

(2)T G(2) GU

(3)T

(1)T (2)T (3)T1 2 3(( ) )G X U U U

I3

... ... ...

R1 R2

(b) Fast matrix-by-matrix approach

R1

I1

I1

R1

2 3I I

X(1)I1

I2

I3I1 I2

R1

R1

X

...

U(1)T G(1)

I3

... U(1)T G

(1)(1)

(c) Divide-and-conquer approach

X

U[1,1] U[2,1] U[3,1]

U[1,2] U[2,2] U[3,2]1

=

U(1)T Z G=

(1)

1 2 3( )I I I 1 1( )R I 1 2 3( )R I I

X[1,1,1] X[1,2,1]

X[2,1,1] X[2,2,1]

X[3,2,1]X[3,1,1]

X[1,1,2] X[1,2,2]

Z[1,1,1] Z[1,2,1]

Z[2,1,1] Z[2,2,1]

Z[1,1,2] Z[1,2,2]

z[]

2,2,2

Figure 3.5: Computation of a multilinear (Tucker) product for large-scale HOSVD.(a) Standard sequential computing of multilinear products (TTM) G S pppX1Up1qTq 2 Up2qTq 3 Up3qTq. (b) Distributed implementation through fast matrix-by-matrix multiplications. (c) An alternative method for large-scale problems usingthe “divide and conquer” approach, whereby a data tensor, X, and factor matrices,UpnqT, are partitioned into suitable small blocks: Subtensors Xrk1,k2,k3s

and blockmatrices Up1qT

rk1,p1s. The blocks of a tensor, Z Gp1q X1 Up1qT, are computed as

Zrq1,k2,k3s °K1

k11 Xrk1,k2,k3s 1 Up1qTrk1,q1s

(see Eq. (3.36) for a general case).

Page 96: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.4. Higher Order SVD (HOSVD) for Large-Scale Problems 341

Applications. We have shown that the Tucker/HOSVD decomposi-tion may be considered as a multilinear extension of PCA (Kroonen-berg, 2008); it therefore generalizes signal subspace techniques and findsapplication in areas including multilinear blind source separation, clas-sification, feature extraction, and subspace-based harmonic retrieval(Vasilescu and Terzopoulos, 2002; Haardt et al., 2008; Phan and Ci-chocki, 2010; Lu et al., 2011). In this way, a low multilinear rank ap-proximation achieved through Tucker decomposition may yield higherSignal-to-Noise Ratio (SNR) than the SNR for the original raw datatensor, which also makes Tucker decomposition a natural tool for signalcompression and enhancement.

It was recently shown that HOSVD can also perform simultane-ous subspace selection (data compression) and K-means clustering,both unsupervised learning tasks (Huang et al., 2008; Papalexakiset al., 2013). This is important, as a combination of these meth-ods can both identify and classify “relevant” data, and in this waynot only reveal desired information but also simplify feature extraction.

Anomaly detection using HOSVD. Anomaly detection refers tothe discrimination of some specific patterns, signals, outliers or featuresthat do not conform to certain expected behaviors, trends or properties(Chandola et al., 2009; Fanaee-T and Gama, 2016). While such analy-sis can be performed in different domains, it is most frequently basedon spectral methods such as PCA, whereby high dimensional data areprojected onto a lower-dimensional subspace in which the anomaliesmay be identified more easier. The main assumption within such ap-proaches is that the normal and abnormal patterns, which may be dif-ficult to distinguish in the original space, appear significantly differentin the projected subspace. When considering very large datasets, sincethe basic Tucker decomposition model generalizes PCA and SVD, itoffers a natural framework for anomaly detection via HOSVD, as illus-trated in Figure 3.6. To handle the exceedingly large dimensionality,we may first compute tensor decompositions for sampled (pre-selected)small blocks of the original large-scale 3rd-order tensor, followed bythe analysis of changes in specific factor matrices Upnq. A simpler form

Page 97: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

342 Multiway Component Analysis and Tensor Decompositions

I3

U(1)

1 1( )I R 2 2( )R I

3 3( )I R

Xk

X

G

1 2 3( )R R R

I1

I2

(2)TkU

U(2)

U(2)TR3

R1

R2

Figure 3.6: Conceptual model for performing the HOSVD for a very large-scale3rd-order data tensor. This is achieved by dividing the tensor into blocks Xk G1

Up1q2Up2qk 3Up3q, pk 1, 2 . . . ,Kq. It assumed that the data tensor X P RI1I2I3

is sampled by sliding the block Xk from left to right (with an overlapping slidingwindow). The model can be used for anomaly detection by fixing the core tensorand some factor matrices while monitoring the changes along one or more specificmodes (in our case mode two). Tensor decomposition is then first performed for asampled (pre-selected) small block, followed by the analysis of changes in specificsmaller–dimensional factor matrices Upnq.

is straightforwardly obtained by fixing the core tensor and some fac-tor matrices while monitoring the changes along one or more specificmodes, as the block tensor moves from left to right as shown in Figure3.6.

3.5 Tensor Sketching Using Tucker Model

The notion of sketches refers to replacing the original huge matrixor tensor by a new matrix or tensor of a significantly smaller size orcompactness, but which approximates well the original matrix/tensor.Finding such sketches in an efficient way is important for the analysisof big data, as a computer processor (and memory) is often incapableof handling the whole data-set in a feasible amount of time. For thesereasons, the computation is often spread among a set of processorswhich for standard “all-in-one” SVD algorithms, are unfeasible.

Given a very large-scale tensor X , a useful approach is to computea sketch tensor Z or set of sketch tensors Zn that are of significantlysmaller sizes than the original one.

There exist several matrix and tensor sketching approaches: sparsi-fication, random projections, fiber subset selections, iterative sketching

Page 98: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.5. Tensor Sketching Using Tucker Model 343

techniques and distributed sketching. We review the main sketchingapproaches which are promising for tensors.1. Sparsification generates a sparser version of the tensor which, ingeneral, can be stored more efficiently and admit faster multiplicationsby factor matrices. This is achieved by decreasing the number on non-zero entries and quantizing or rounding up entries. A simple techniqueis element-wise sparsification which zeroes out all sufficiently small el-ements (below some threshold) of a data tensor, keeps all sufficientlylarge elements, and randomly samples the remaining elements of thetensor with sample probabilities proportional to the square of theirmagnitudes (Nguyen et al., 2015).2. Random Projection based sketching randomly combines fibers ofa data tensor in all or selected modes, and is related to the conceptof a randomized subspace embedding, which is used to solve a vari-ety of numerical linear algebra problems (see (Tropp et al., 2016) andreferences therein).3. Fiber subset selection, also called tensor cross approximation(TCA), finds a small subset of fibers which approximates the entire datatensor. For the matrix case, this problem is known as the Column/RowSubset Selection or CUR Problem which has been thoroughly investi-gated and for which there exist several algorithms with almost match-ing lower bounds (Mahoney, 2011; Desai et al., 2016; Ghashami et al.,2016).

3.5.1 Tensor Sketching via Multiple Random Projections

The random projection framework has been developed for computingstructured low-rank approximations of a data tensor from (random)linear projections of much lower dimensions than the data tensor itself(Caiafa and Cichocki, 2015; Tropp et al., 2016). Such techniques havemany potential applications in large-scale numerical multilinear algebraand optimization problems.

Notice that for an Nth-order tensor X P RI1I2IN , we can com-pute the following sketches

Z X1 Ω1 2 Ω2 N ΩN (3.37)

Page 99: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

344 Multiway Component Analysis and Tensor Decompositions

(a)ZR3

R2

I1I1

I2

I3

I3R3

I2

R2

ZI1

R1

Z

R1

I2 R3

R1

R2

I3Z

R1

R2R3

Ω2

Ω3Ω1

X

X X

= =I3R3

Ω3

=I1

R1Ω1

I2

R2

Ω2

I3R3

Ω3I1

R1Ω1

I2

R2

Ω2

=

X

X X

X

1

3

2

(b)

.

XR1

R2

Rn-1

Rn+1

RN

In

=1

2-1n

N

+1nI1

I2 In-1

In+1

IN

Zn

R1

R2

Rn-1

Rn+1

RN

In

Ω

Ω

Ω

Ω

Ω

Ω ΩXZn n-1 n+1Ω ΩN N1 1 n-1 n+1

. ..

.. . ..

. ..

=

XR1

R2Rn

RN

.

I1

I2In

IN1

N

2

Ω

Ω

Ω

ZR1

R2 Rn

RN

Z X

.

Ωn

Ω1 ΩN N1 Ω2 2...

. .

..

. . .

. ..

Figure 3.7: Illustration of tensor sketching using random projections of a datatensor. (a) Sketches of a 3rd-order tensor X P RI1I2I3 given by Z1 X2 Ω2 3Ω3 P RI1R2R3 , Z2 X 1 Ω1 3 Ω3 P RR1I2R3 , Z3 X 1 Ω1 2 Ω2 PRR1R2I3 , and Z X 1 Ω1 2 Ω2 3 Ω3 P RR1R2R3 . (b) Sketches for anNth-order tensor X P RI1IN .

Page 100: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.5. Tensor Sketching Using Tucker Model 345

and

Zn X1 Ω1 n1 Ωn1 n1 Ωn1 N ΩN , (3.38)

for n , 1, 2, . . . , N , where Ωn P RRnIn are statistically independentrandom matrices with Rn ! In, usually called test (or sensing) matri-ces.

A sketch can be implemented using test matrices drawn from var-ious distributions. The choice of a distribution leads to some tradeoffs(Tropp et al., 2016), especially regarding (i) the costs of randomization,computation, and communication to generate the test matrices; (ii) thestorage costs for the test matrices and the sketch; (iii) the arithmeticcosts for sketching and updates; (iv) the numerical stability of recon-struction algorithms; and (v) the quality of a priori error bounds. Themost important distributions of random test matrices include:

• Gaussian random projections which generate random ma-trices with standard normal distribution. Such matrices usuallyprovide excellent performance in practical scenarios and accuratea priori error bounds.

• Random matrices with orthonormal columns that spanuniformly distributed random subspaces of dimensions Rn. Suchmatrices behave similar to Gaussian case, but usually exhibit evenbetter numerical stability, especially when Rn are large.

• Rademacher and super-sparse Rademacher random pro-jections that have independent Rademacher entries which takethe values 1 with equal probability. Their properties are similarto standard normal test matrices, but exhibit some improvementsin the cost of storage and computational complexity. In a specialcase, we may use ultra sparse Rademacher test matrices, wherebyin each column of a test matrix independent Rademacher randomvariables are placed only in very few uniformly random locationsdetermined by a sampling parameter s; the remaining entries areset to zero. In an extreme case of maximum sparsity, s 1, andeach column of a test matrix has exactly only one nonzero entry.

Page 101: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

346 Multiway Component Analysis and Tensor Decompositions

• Subsampled randomized Fourier transforms based on testmatrices take the following form

Ωn PnFnDn, (3.39)

where Dn are diagonal square matrices with independentRademacher entries, Fn are discrete cosine transform (DCT) ordiscrete Fourier transform (DFT) matrices, and entries of thematrix Pn are drawn at random from a uniform distribution.

Example. The concept of tensor sketching via random projections isillustrated in Figure 3.7 for a 3rd-order tensor and for a general caseof Nth-order tensors. For a 3rd-order tensor with volume (number ofentries) I1I2I3 we have four possible sketches which are subtensors ofmuch smaller sizes, e.g., I1R2R3, with Rn ! In, if the sketching isperformed along mode-2 and mode-3, or R1R2R3, if the sketching isperformed along all three modes (Figure 3.7(a) bottom right). Fromthese subtensors we can reconstruct any huge tensor if it has low amultilinear rank (lower than tR1, R2, . . . , Rnu).

In more general scenario, it can be shown (Caiafa and Cichocki,2015) that the Nth order tensor data tensor X with sufficiently low-multilinear rank can be reconstructed perfectly from the sketch tensorsZn, for n 1, 2, . . . , N , as follows

X Z1 Bp1q 2 Bp2q N BpNq, (3.40)

where Bpnq rZnspnqZ:pnq for n 1, 2, . . . , N (for more detail see the

next section).

3.5.2 Matrix/Tensor Cross-Approximation (MCA/TCA)

Huge-scale matrices can be factorized using the Matrix Cross-Approximation (MCA) method, which is also known under the namesof Pseudo-Skeleton or CUR matrix decompositions (Goreinov et al.,1997b,a; Mahoney et al., 2008; Mahoney and Drineas, 2009; Oseledetsand Tyrtyshnikov, 2010; Bebendorf, 2011; Bebendorf et al., 2015;Khoromskij and Veit, 2016). The main idea behind the MCA is toprovide reduced dimensionality of data through a linear combination

Page 102: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.5. Tensor Sketching Using Tucker Model 347

X C U R

J

I

( )I C ( )R J( )C R

Figure 3.8: Principle of the matrix cross-approximation which decomposes a hugematrix X into a product of three matrices, whereby only a small-size core matrix Uneeds to be computed.

of only a few “meaningful” components, which are exact replicas ofcolumns and rows of the original data matrix. Such an approach isbased on the fundamental assumption that large datasets are highlyredundant and can therefore be approximated by low-rank matrices,which significantly reduces computational complexity at the cost of amarginal loss of information.

The MCA method factorizes a data matrix X P RIJ as (Goreinovet al., 1997a,b) (see Figure 3.8)

X CUR E, (3.41)

where C P RIC is a matrix constructed from C suitably selectedcolumns of the data matrix X, matrix R P RRJ consists of R appro-priately selected rows of X, and matrix U P RCR is calculated so asto minimize the norm of the error E P RIJ .

A simple modification of this formula, whereby the matrix U is ab-sorbed into either C or R, yields the so-called CR matrix factorizationor Column/Row Subset selection:

X CR CR (3.42)

for which the bases can be either the columns, C, or rows, R, whileR UR and C CU.

For dimensionality reduction, C ! J and R ! I, and the columnsand rows of X should be chosen optimally, in the sense of providing a

Page 103: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

348 Multiway Component Analysis and Tensor Decompositions

C

RU

1 2 3( )I P P 1 3 2( )P P I2 3 1 3 1 2( )P P P P P P

...

...3 1 2( )I P P...T

X

T

P2P3

P1

P3P1P2

R

I2

I1

I3

W

C

Figure 3.9: The principle of the tensor cross-approximation (TCA) algo-rithm, illustrated for a large-scale 3rd-order tensor X U 1 C 2 R 3 T JU; C,R,TK, where U W1 W:

p1q 2 W:p2q 3 W:

p3q JW; W:p1q,W

:p2q,W

:p3qK P

RP2P3P1P3P1P2 and W P RP1P2P3 . For simplicity of illustration, we assumethat the selected fibers are permuted, so as to become clustered as subtensors,C P RI1P2P3 , R P RP1I2P3 and T P RP1P2I3 .

high “statistical leverage” and the best low-rank fit to the data matrix,while at the same time minimizing the cost function E2F . For a givenset of columns, C, and rows, R, the optimal choice for the core matrixis U C:XpR:qT. This requires access to all the entries of X andis not practical or feasible for large-scale data. In such cases, a prag-matic choice for the core matrix would be U W:, where the matrixW P RRC is composed from the intersections of the selected rowsand columns. It should be noted that for rankpXq ¤ mintC,Ru thecross-approximation is exact. For the general case, it has been proventhat when the intersection submatrix W is of maximum volume7, thematrix cross-approximation is close to the optimal SVD solution. Theproblem of finding a submatrix with maximum volume has exponen-tial complexity, however, suboptimal matrices can be found using fastgreedy algorithms (Wang and Zhang, 2013; Mikhalev and Oseledets,2015; Rakhuba and Oseledets, 2015; Anderson et al., 2015).

The concept of MCA can be generalized to tensor cross-approximation (TCA) (see Figure 3.9) through several approaches, in-cluding:

7The volume of a square submatrix W is defined as | detpWq|.

Page 104: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.5. Tensor Sketching Using Tucker Model 349

(a)

1 1( )I P 2 2( )P I1 2 3( )P P P

3 3( )I P

B(1)

B(2)T

B(3)

=

1 2 3( )I I I

XI3

I1

I2

T

W R

C

R

W

P3

P1

P2

(b)

XI1

I2

I3UP P2 3

I2

P P1 3

R

C T I3I1

WP P2 3

P2

C TI1

P3

P P1 2

B(3)

I3

W(1)+ W(3)

+

P P1 3

R

W(2)+

P1

B(2)

B(1)

W

I2

I3I1B

(1) B(3)

B(2)

P P1 2

P1

P2

P3

I2

Figure 3.10: The Tucker decomposition of a low multilinear rank 3rd-ordertensor using the cross-approximation approach. (a) Standard block diagram. (b)Transformation from the TCA in the Tucker format, X U 1 C 2 R 3 T,into a standard Tucker representation, X W 1 Bp1q 2 Bp2q 3 Bp3q JW; CW:

p1q,RW:p2q,TW:

p3qK, with a prescribed core tensor W.

Page 105: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

350 Multiway Component Analysis and Tensor Decompositions

• Applying the MCA decomposition to a matricized version of thetensor data (Mahoney et al., 2008);

• Operating directly on fibers of a data tensor which admits a low-rank Tucker approximation, an approach termed the Fiber Sam-pling Tucker Decomposition (FSTD) (Caiafa and Cichocki, 2010,2013, 2015).

Real-life structured data often admit good low-multilinear rank ap-proximations, and the FSTD provides such a low-rank Tucker decom-position which is practical as it is directly expressed in terms of arelatively small number of fibers of the data tensor.

For example, for a 3rd-order tensor, X P RI1I2I3 , for which anexact rank-pR1, R2, R3q Tucker representation exists, the FSTD selectsPn ¥ Rn, n 1, 2, 3, indices in each mode; this determines an inter-section subtensor, W P RP1P2P3 , so that the following exact Tuckerrepresentation can be obtained (see Figure 3.10)

X JU; C,R,TK, (3.43)

where the core tensor is computed as U G JW; W:p1q,W

:p2q,W

:p3qK,

while the factor matrices, C P RI1P2P3 ,R P RI2P1P3 ,T P RI3P1P2 ,contain the fibers which are the respective subsets of the columns C,rows R and tubes T. An equivalent Tucker representation is then givenby

X JW; CW:p1q,RW:

p2q,TW:p3qK. (3.44)

Observe that for N 2, the TCA model simplifies into the MCAfor a matrix case, X CUR, for which the core matrix is U JW; W:

p1q,W:p2qK W:WW: W:.

For a general case of an Nth-order tensor, we can show (Caiafa andCichocki, 2010) that a tensor, X P RI1I2IN , with a low multilinearrank tR1, R2, . . . , RNu, where Rn ¤ In, @n, can be fully reconstructedvia the TCA FSTD, X JU; Cp1q,Cp2q, . . . ,CpNqK, using only N factormatrices Cpnq P RInPn pn 1, 2, . . . , Nq, built up from the fibers of thedata and core tensors, U G JW; W:

p1q,W:p2q, . . . ,W

:pNqK, under

the condition that the subtensor W P RP1P2PN with Pn ¥ Rn, @n,has the multilinear rank tR1, R2, . . . , RNu.

Page 106: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.6. Multiway Component Analysis (MWCA) 351

The selection of a minimum number of suitable fibers depends upona chosen optimization criterion. A strategy which requires access toonly a small subset of entries of a data tensor, achieved by selectingthe entries with maximum modulus within each single fiber, is givenin (Caiafa and Cichocki, 2010). These entries are selected sequentiallyusing a deflation approach, thus making the tensor cross-approximationFSTD algorithm suitable for the approximation of very large-scale butrelatively low-order tensors (including tensors with missing fibers orentries).

It should be noted that an alternative efficient way to estimatesubtensors W,C,R and T is to apply random projections as follows

W Z X1 Ω1 2 Ω2 3 Ω3 P RP1P2P3 ,

C Z1 X2 Ω2 3 Ω3 P RI1P2P3 ,

R Z2 X1 Ω1 3 Ω3 P RP1I2P3 ,

T Z3 X1 Ω1 2 Ω2 P RP1P2I3 , (3.45)

where Ωn P RPnIn with Pn ¥ Rn for n 1, 2, 3 are indepen-dent random matrices. We explicitly assume that the multilinear ranktP1, P2, . . . , PNu of approximated tensor to be somewhat larger than atrue multilinear rank tR1, R2, . . . , RNu of target tensor, because it iseasier to obtain an accurate approximation in this form.

3.6 Multiway Component Analysis (MWCA)

3.6.1 Multilinear Component Analysis Using Constrained TuckerDecomposition

The great success of 2-way component analyses (PCA, ICA, NMF,SCA) is largely due to the existence of very efficient algorithms fortheir computation and the possibility to extract components with adesired physical meaning, provided by the various flexible constraintsexploited in these methods. Without these constraints, matrix factor-izations would be less useful in practice, as the components would haveonly mathematical but not physical meaning.

Similarly, to exploit the full potential of tensor factoriza-tion/decompositions, it is a prerequisite to impose suitable constraints

Page 107: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

352 Multiway Component Analysis and Tensor Decompositions

on the desired components. In fact, there is much more flexibility fortensors, since different constraints can be imposed on the matrix fac-torizations in every mode n a matricized tensor Xpnq (see Algorithm 6and Figure 3.11).

Such physically meaningful representation through flexible mode-wise constraints underpins the concept of multiway component analysis(MWCA). The Tucker representation of MWCA naturally accommo-dates such diversities in different modes. Besides the orthogonality,alternative constraints in the Tucker format include statistical inde-pendence, sparsity, smoothness and nonnegativity (Vasilescu and Ter-zopoulos, 2002; Cichocki et al., 2009; Zhou and Cichocki, 2012a; Ci-chocki et al., 2015b) (see Table 3.3).

The multiway component analysis (MWCA) based on the Tucker-Nmodel can be computed directly in two or three steps:

1. For each mode n pn 1, 2, . . . , Nq perform model reduction andmatricization of data tensors sequentially, then apply a suitableset of 2-way CA/BSS algorithms to the so reduced unfolding ma-trices, Xpnq. In each mode, we can apply different constraints anda different 2-way CA algorithms.

2. Compute the core tensor using, e.g., the inversion formula, G X 1 Bp1q: 2 Bp2q: N BpNq:. This step is quite importantbecause core tensors often model the complex links among themultiple components in different modes.

3. Optionally, perform fine tuning of factor matrices and the coretensor by the ALS minimization of a suitable cost function,e.g., X JG; Bp1q, . . . ,BpNqK2F , subject to specific imposed con-straints.

3.6.2 Analysis of Coupled Multi-block Matrix/Tensors – LinkedMultiway Component Analysis (LMWCA)

We have shown that TDs provide natural extensions of blind sourceseparation (BSS) and 2-way (matrix) Component Analysis to multi-way component analysis (MWCA) methods.

Page 108: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.6. Multiway Component Analysis (MWCA) 353

I

J

K

I ...J JX(1)

X(2)

U1

(SVD)

1V

X(:,:, )k

...T

K ...I IX(:, :)j, X(3)

A3

(SCA)

3T

...J

K K

A2

(ICA)

1T

( ,:,:)X i

1

X

S

=

2T

1R

2R

3R

B

B

B...

...

Figure 3.11: Multiway Component Analysis (MWCA) for a third-order tensor viaconstrained matrix factorizations, assuming that the components are: orthogonal inthe first mode, statistically independent in the second mode and sparse in the thirdmode.

In addition, TDs are suitable for the coupled multiway analysis ofmulti-block datasets, possibly with missing values and corrupted bynoise. To illustrate the simplest scenario for multi-block analysis, con-sider the block matrices, Xpkq P RIJ , which need to be approximatelyjointly factorized as

Xpkq AGpkqBT, pk 1, 2, . . . ,Kq, (3.46)

where A P RIR1 and B P RJR2 are common factor matrices andGpkq P RR1R2 are reduced-size matrices, while the number of datamatrices K can be huge (hundreds of millions or more matrices). Sucha simple model is referred to as the Population Value Decomposition(PVD) (Crainiceanu et al., 2011). Note that the PVD is equivalentto the unconstrained or constrained Tucker-2 model, as illustrated inFigure 3.12. In a special case with square diagonal matrices, Gpkq, themodel is equivalent to the CP decomposition and is related to jointmatrix diagonalization (De Lathauwer, 2006; Tichavský and Yeredor,2009; Chabriel et al., 2014). Furthermore, if A B then the PVDmodel is equivalent to the RESCAL model (Nickel et al., 2016).

Observe that the PVD/Tucker-2 model is quite general and flexible,since any high-order tensor, X P RI1I2IN (with N ¡ 3), can be

Page 109: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

354 Multiway Component Analysis and Tensor Decompositions

Algorithm 7: Population Value Decomposition (PVD)with orthogonality constraintsInput: A set of matrices Xk P RIJ , for k 1, . . . ,K (typically,

K " maxtI, Ju)Output: Factor matrices A P RIR1 , B P RJR2 and Gk P RR1R2 ,

with orthogonality constraints ATA IR1 and BTB IR2

1: for k 1 to K do2: Perform truncated SVD, Xk UkSkVT

k , using R largest singularvalues

3: end for4: Construct short and wide matrices:

U rU1S1, . . . ,UKSKs P RIKR andV rV1S1, . . . ,VKSKs P RJKR

5: Perform SVD (or QR) for the matrices U and VObtain common orthogonal matrices A and B as left-singularmatrices of U and V, respectively

6: for k 1 to K do7: Compute Gk ATXkB8: end for

reshaped and optionally permuted into a “skinny and tall” 3rd-ordertensor, rX P RJ J K , with e.g., I I1, J I2 and K I3I4 IN ,for which PVD/Tucker-2 Algorithm 8 can be applied.

As previously mentioned, various constraints, including sparsity,nonnegativity or smoothness can be imposed on the factor matrices, Aand B, to obtain physically meaningful and unique components.

A simple SVD/QR based algorithm for the PVD with orthogonalityconstraints is presented in Algorithm 7 (Crainiceanu et al., 2011; Con-stantine et al., 2014; Wang et al., 2016). However, it should be notedthat this algorithm does not provide an optimal solution in the senseof the absolute minimum of the cost function,

°Kk1 Xk AGkBT2F ,

and for data corrupted by Gaussian noise, better performance can beachieved using the HOOI-2 given in Algorithm 4, for N 3. An im-proved PVD algorithm referred to as Tucker-2 algorithm is given inAlgorithm 8 (Phan et al., 2016).Linked MWCA. Consider the analysis of multi-modal high-dimensional data collected under the same or very similar conditions,

Page 110: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.6. Multiway Component Analysis (MWCA) 355

(a)

…( )I J× ( )I R1× ( )R R× ( )R J×

G

G(2) (2)

G(K) (K)

1 2 2

A

A

A

TB

TB

TB… … …

(b)

X

( )I J× ×K

AR2

R1 G

K

TB

( )R J2×( )R R1 2× ×K( )I R1×

Figure 3.12: Concept of the Population Value Decomposition (PVD). (a) Principleof simultaneous multi-block matrix factorizations. (b) Equivalent representation ofthe PVD as the constrained or unconstrained Tucker-2 decomposition, X G 1A 2 B. The objective is to find the common factor matrices, A, B and the coretensor, G P RR1R2K .

for example, a set of EEG and MEG or EEG and fMRI signals recordedfor different subjects over many trials and under the same experimentalconfigurations and mental tasks. Such data share some common latent(hidden) components but can also have their own independent fea-tures. As a result, it is advantageous and natural to analyze such data

Page 111: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

356 Multiway Component Analysis and Tensor Decompositions

Algorithm 8: Orthogonal Tucker-2 decomposition with aprescribed approximation accuracy (Phan et al., 2016)Input: A 3rd-order tensor X P RIJK (typically, K " maxtI, Ju)

and estimation accuracy εOutput: A set of orthogonal matrices A P RIR1 , B P RJR2 and core

tensor G P RR1R2K , which satisfies the constraintXG1 AB2

F ¤ ε2 , s.t, ATA IR1 and BTB IR2 .1: Initialize A II P RII , R1 I2: while not converged or iteration limit is not reached do3: Compute the tensor Zp1q X1 AT P RR1JK

4: Compute EVD of a small matrix Q1 Zp1qp2qZ

p1q Tp2q P RJJ as

Q1 B diag pλ1, , λR2q BT, such that°R2r21 λr2 ¥ X2

F ε2 ¥ °R21r21 λr2

5: Compute tensor Zp2q X2 BT P RIR2K

6: Compute EVD of a small matrix Q2 Zp2qp1qZ

p2q Tp1q P RII as

Q2 A diag pλ1, . . . , λR1q AT, such that°R1r11 λr1 ¥ X2

F ε2 ¥ °R11r11 λr1

7: end while8: Compute the core tensor G X1 AT 2 BT

9: return A,B and G.

in a linked way instead of treating them independently. In such a sce-nario, the PVD model can be generalized to multi-block matrix/tensordatasets (Cichocki, 2013a; Zhou et al., 2016a,b).

The linked multiway component analysis (LMWCA) for multi-blocktensor data can therefore be formulated as a set of approximate simul-taneous (joint) Tucker-p1, Nq decompositions of a set of data tensors,Xpkq P RI

pkq1 I

pkq2 I

pkqN , with Ipkq1 I1 for k 1, 2, . . . ,K, in the form

(see Figure 3.13)

Xpkq Gpkq 1 Bp1,kq, pk 1, 2, . . .Kq (3.47)

where each factor (component) matrix, Bp1,kq rBp1qC , Bp1,kq

I s PRI1Rk , comprises two sets of components: (1) Components Bp1q

C PRI1C (with 0 ¤ C ¤ Rk), @k, which are common for all the avail-able blocks and correspond to identical or maximally correlated com-ponents, and (2) components Bp1,kq

I P RI1pRkCq, which are different

Page 112: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.6. Multiway Component Analysis (MWCA) 357

X(1)

X( )K B

(1, )K

. . .

. . .I1

B(1,2)

X(2)I1

I1

I3(1)

I3(2)

I3( )K

I2(1)

I2( )K

I2(2)

BC(1)

BI

(1,1)

BC(1)

BI

(1,2)

BI

(1, )KBC

(1)

G(1)

G(K)

B(1,1)

G(2)

Figure 3.13: Linked Multiway Component Analysis (LMWCA) for coupled 3rd-order data tensors Xp1q, . . . ,XpKq; these can have different dimensions in everymode, except for the mode-1 for which the size is I1 for all Xpkq. Linked Tucker-1 de-compositions are then performed in the form Xpkq Gpkq1 Bp1,kq, where partiallycorrelated factor matrices are Bp1,kq rBp1q

C ,Bp1,kqI s P RI1Rk , pk 1, 2, . . . ,Kq.

The objective is to find the common components, Bp1qC P RI1C , and individual

components, Bp1,kqI P RI1pRkCq, where C ¤ mintR1, . . . , RKu is the number of

common components in mode-1.

independent processes for each block, k, these can be, for example,latent variables independent of excitations or stimuli/tasks. The ob-jective is therefore to estimate the common (strongly correlated) com-ponents, Bp1q

C , and statistically independent (individual) components,Bp1,kqI (Cichocki, 2013a).

Page 113: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

358 Multiway Component Analysis and Tensor Decompositions

(a)

A G=(1) G(2) G(3) G( -1)N

G(2) G(3)

B G=( )N

G( -1)NB G=

( )NA G=(1)

=I1 I2 I3 IN-1 IN

J1 J2 J3 JN-1JN

...I1

I2 In

IN

=IJ2 Jn

JN

X...

I3

......

...

J3

YJ1

=I1 2

(b)

...I1

I2 In

IN

J =I1 1

J2 Jn

JN

X...

I3

......

...

J3

Y

Figure 3.14: Conceptual models of generalized Linked Multiway Component Anal-ysis (LMWCA) applied to the cores of high-order TNs. The objective is to find asuitable tensor decomposition which yields the maximum number of cores that areas much correlated as possible. (a) Linked Tensor Train (TT) networks. (b) LinkedHierarchical Tucker (HT) networks with the correlated cores indicated by ellipses inbroken lines.

Page 114: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.6. Multiway Component Analysis (MWCA) 359

If Bpn,kq BpnqC P RInRn for a specific mode n (in our case n 1),

and under the additional assumption that the block tensors are of thesame order and size, the problem simplifies into generalized CommonComponent Analysis or tensor Population Value Decomposition (PVD)and can be solved by concatenating all data tensors along one mode,followed by constrained Tucker or CP decompositions (Phan and Ci-chocki, 2010).

In a more general scenario, when Cn Rn, we can unfold each datatensor Xpkq in the common mode, and perform a set of simultaneousmatrix factorizations, e.g., Xpkq

p1q Bp1qC Ap1,kq

C Bp1,kqI Ap1,kq

I , throughsolving the constrained optimization problems

minK

k1Xpkq

p1q Bp1qC Ap1,kq

C Bp1,kqI Ap1,kq

I F

P pBp1qC q, s.t. Bp1q T

C Bp1,kqI 0 @k,

(3.48)

where the symbol P denotes the penalty terms which impose addi-tional constraints on the common components, Bp1q

C , in order to ex-tract as many common components as possible. In the special case oforthogonality constraints, the problem can be transformed into a gen-eralized eigenvalue problem. The key point is to assume that commonfactor submatrices, Bp1q

C , are present in all data blocks and hence re-flect structurally complex latent (hidden) and intrinsic links betweenthe data blocks. In practice, the number of common components, C, isunknown and should be estimated (Zhou et al., 2016a).

The linked multiway component analysis (LMWCA) model com-plements currently available techniques for group component analysisand feature extraction from multi-block datasets, and is a natural ex-tension of group ICA, PVD, and CCA/PLS methods (see (Cichocki,2013a; Zhao et al., 2013a; Zhou et al., 2016a,b) and references therein).Moreover, the concept of LMWCA can be generalized to tensor net-works, as illustrated in Figure 3.14.

Page 115: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

360 Multiway Component Analysis and Tensor Decompositions

3.7 Nonlinear Tensor Decompositions – Infinite Tucker

The Infinite Tucker model and its modification, the Distributed Infi-nite Tucker (DinTucker), generalize the standard Tucker decompositionto infinitely dimensional feature spaces using kernel and Bayesian ap-proaches (Tang et al., 2013; Xu et al., 2012; Zhe et al., 2016).

Consider the classic Tucker-N model of an Nth-order tensor X PRI1IN , given by

X G1 Bp1q 2 Bp2q N BpNq

JG; Bp1q,Bp2q, . . . ,BpNqK (3.49)

in its vectorized version

vecpXq pBp1q bL bL BpNqq vecpGq.Furthermore, assume that the noisy data tensor is modeled as

Y XE, (3.50)

where E represents the tensor of additive Gaussian noise. Using theBayesian framework and tensor-variate Gaussian processes (TGP) forTucker decomposition, a standard normal prior can be assigned overeach entry, gr1,r2,...,rN , of an Nth-order core tensor, G P RR1RN , inorder to marginalize out G and express the probability density functionof tensor X (Chu and Ghahramani, 2009; Xu et al., 2012; Zhe et al.,2016) in the form

pX |Bp1q, . . . ,BpNq

N

vecpXq; 0,Cp1q bL bL CpNq

exp

12JX; pCp1qq12, . . . , pCpNqq12K2F

p2πqI2±N

n1 |Cpnq|Ip2Inq(3.51)

where I ±n In and Cpnq Bpnq Bpnq T P RInIn for n 1, 2, . . . , N .In order to model unknown, complex, and potentially nonlinear in-

teractions between the latent factors, each row, bpnqin

P R1Rn , withinBpnq, is replaced by a nonlinear feature transformation Φpbpnq

inq using

the kernel trick (Zhao et al., 2013b), whereby the nonlinear covari-ance matrix Cpnq kpBpnq, Bpnqq replaces the standard covariance

Page 116: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

3.7. Nonlinear Tensor Decompositions – Infinite Tucker 361

matrix, BpnqBpnq T. Using such a nonlinear feature mapping, the origi-nal Tucker factorization is performed in an infinite feature space, whileEq. (3.51) defines a Gaussian process (GP) on a tensor, called theTensor-variate GP (TGP), where the inputs come from a set of factormatrices tBp1q, . . . ,BpNqu tBpnqu.

For a noisy data tensor Y, the joint probability density function isgiven by

ppY,X, tBpnquq pptBpnquq ppX | tBpnquq ppY|Xq. (3.52)

To improve scalability, the observed noisy tensor Y can be splitinto K subtensors tY1, . . . ,YKu, whereby each subtensor Yk is sam-pled from its own GP based model with factor matrices, tBpnq

k u tBp1q

k , . . . , BpNqk u. The factor matrices can then be merged via a prior

distribution

pptBpnqk u|tBpnquq

N¹n1

ppBpnqk |Bpnqq

N¹n1N pvecpBpnq

k q|vecpBpnqqq, λIq, (3.53)

where λ ¡ 0 is a variance parameter which controls the similaritybetween the corresponding factor matrices. The above model is referredto as DinTucker (Zhe et al., 2016).

The full covariance matrix, Cp1q b bCpNq P R±

n In±

n In , mayhave a prohibitively large size and can be extremely sparse. For suchcases, an alternative nonlinear tensor decomposition model has beenrecently developed, which does not, either explicitly or implicitly, ex-ploit the Kronecker structure of covariance matrices (Cichocki et al.,2015a). Within this model, for each tensor entry, xi1,...,iN xi, withi pi1, i2, . . . , iN q, an input vector bi is constructed by concatenatingthe corresponding row vectors of factor (latent) matrices, Bpnq, for allN modes, as

bi rbp1qi1, . . . , bpNq

iNs P R1

°Nn1 Rn . (3.54)

We can formalize an (unknown) nonlinear transformation as

xi fpbiq fprbp1qi1, . . . , bpNq

iNsq (3.55)

Page 117: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

362 Multiway Component Analysis and Tensor Decompositions

for which a zero-mean multivariate Gaussian distribution is determinedby BS tbi1 , . . . ,biM u and fS tfpbi1q, . . . , fpbiM qu. This allows usto construct the following probability function

pfS |tBpnqu

N pfS |0, kpBS , BSqq , (3.56)

where kp, q is a nonlinear covariance function which can be ex-pressed as kpbi,bjq kpprbp1q

i1, . . . , bpNq

iNsq, prbp1q

j1, . . . , bpNq

jNsqq and S

ri1, . . . , iM s.In order to assign a standard normal prior over the factor matrices,

tBpnqu, we assume that for selected entries, x rxi1 , . . . , xiM s, of atensor X, the noisy entries, y ryi1 , . . . , yiM s, of the observed tensorY, are sampled from the following joint probability model

ppy,x, tBpnquq (3.57)

N¹n1N pvecpBpnqq|0, Iq N px|0, kpBS , BSqq N py|x, β1Iq,

where β represents noise variance.These nonlinear and probabilistic models can be potentially applied

for data tensors or function-related tensors comprising large number ofentries, typically with millions of non-zero entries and billions of zeroentries. Even if only nonzero entries are used, exact inference of theabove nonlinear tensor decomposition models may still be intractable.To alleviate this problem, a distributed variational inference algorithmhas been developed, which is based on sparse GP, together with an effi-cient MapReduce framework which uses a small set of inducing pointsto break up the dependencies between random function values (Zheet al., 2016; Titsias, 2009).

Page 118: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4Tensor Train Decompositions: Graphical

Interpretations and Algorithms

Efficient implementation of the various operations in tensor train (TT)formats requires compact and easy-to-understand mathematical andgraphical representations (Cichocki, 2013b, 2014). To this end, wenext present mathematical formulations of the TT decompositions anddemonstrate their advantages in both theoretical and practical scenar-ios.

4.1 Tensor Train Decomposition – Matrix Product State

The tensor train (TT/MPS) representation of an Nth-order data ten-sor, X P RI1I2IN , can be described in several equivalent forms (seeFigures 4.1, 4.2 and Table 4.1) listed below:

1. The entry-wise scalar form, given by

xi1,i2,...,iN R1,R2,...,RN1¸r1, r2,...,rN11

gp1q1, i1, r1 g

p2qr1, i2, r2

gpNqrN1, iN ,1.

(4.1)

2. The slice representation (see Figure 2.19) in the form

xi1,i2,...,iN Gp1qi1

Gp2qi2 GpNq

iN, (4.2)

where the slice matrices are defined asGpnqin

Gpnqp:, in, :q P RRn1Rn , in 1, 2, . . . , In

363

Page 119: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

364 Tensor Train Decompositions

(a)

R1 R2

I2 I4

G(2)

G(3)

G(1)

G(4)

R3

I1 I3

I1

R1

I2

R2R3

I4

R3R2R1

1

(1)rg 1 2

(2),r rg 2 3

(3),r rg

3

(4)rg

1 2 2( )R I R 2 3 3( )R I R3

1 1(1 )I R (R )I 13 4

I

(b)

1 1( )I R 1 2 2( )R I R 2 3 3( )R I R 3 4( 1)R I

I1

R1

R2

I 2

R3

I 3

I 4

(1)G

(2)G

(3)G

(4)G

R1 R2

Figure 4.1: TT decomposition of a 4th-order tensor, X, for which the TTrank is R1 3, R2 4, R3 5. (a) (Upper panel) Representation of theTT via a multilinear product of the cores, X Gp1q 1 Gp2q 1 Gp3q 1

Gp4q xxGp1q,Gp2q,Gp3q,Gp4qyy, and (lower panel) an equivalent representationvia the outer product of mode-2 fibers (sum of rank-1 tensors) in the form,X °R1

r11°R2

r21°R3

r31°R4

r41pgp1qr1 gp2qr1, r2 gp3qr2, r3 gp4qr3 q. (b) TT de-composition in a vectorized form represented via strong Kronecker products of blockmatrices, x rGp1q |b| rGp2q |b| rGp3q |b| rGp4q P RI1I2I3I4 , where the block matrices aredefined as rGpnq P RRn1InRn , with block vectors gpnqrn1, rn P RIn1, n 1, . . . , 4and R0 R4 1.

Page 120: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.1. Tensor Train Decomposition – Matrix Product State 365

(a)

I1

I2

IN...

1 2 NI = I I II3

X

~~x

(b)

...

...

1 1(1 )I R 1 2 2( )R I R

R1 R2 RN-1

I2 In INI1

G(2)

G( )n

G(1)

G( )N

R2

R1

I2

Rn-1

INRN-1

1( )n nnR I R 1( 1)NNR I

R1

I1

... ...... ...

Rn

1 1 1 1 1Rn-1

Rn

In

...

......

...

(c)

(1)G

1 1( )I R 1 2 2( )R I R

2( 1)I(2)

G( )n

G( )N

G

I1

R1

( 1)nI( 1)NI

R2

...

1( 1)I

...

...

...

... ... ......

...

...

...

...... ... ... ...

1( )n nnR I R 1( 1)NNR I

...

Rn

...R1

Figure 4.2: TT/MPS decomposition of an Nth-order data tensor, X, for whichthe TT rank is tR1, R2, . . . , RN1u. (a) Tensorization of a huge-scale vector, x P RI ,into an Nth-order tensor, X P RI1I2IN . (b) The data tensor can be representedexactly or approximately via a tensor train (TT/MPS), consisting of 3rd-order coresin the form X Gp1q 1 Gp2q 1 1 GpNq xxGp1q,Gp2q, . . . ,GpNqyy, whereGpnq P RRn1InRn for n 1, 2, . . . , N with R0 RN 1. (c) Equivalently, usingthe strong Kronecker products, the TT tensor can be expressed in a vectorized form,x rGp1q |b| rGp2q |b| |b| rGpNq P RI1I2IN , where the block matrices are definedas rGpnq P RRn1InRn , with blocks gpnqrn1, rn P RIn1.

Page 121: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

366 Tensor Train Decompositions

Table 4.1: Equivalent representations of the Tensor Train decomposition (MPSwith open boundary conditions) approximating an Nth-order tensor X PRI1I2IN . It is assumed that the TT rank is rT T tR1, R2, . . . , RN1u, withR0 RN 1.

Tensor representation: Multilinear products of TT-cores

X Gp1q 1 Gp2q 1 1 GpNq P RI1I2IN

with the 3rd-order cores Gpnq P RRn1InRn , pn 1, 2, . . . , Nq

Tensor representation: Outer products

X R1,R2,...,RN1¸

r1, r2,...,rN11gp1q1,r1

gp2qr1, r2 gpN1q

rN2, rN1 gpNq

rN1, 1

where gpnqrn1, rn Gpnqprn1, :, rnq P RIn are fiber vectors.

Vector representation: Strong Kronecker products

x rGp1q |b| rGp2q |b| |b| rGpNq P RI1I2IN , where

rGpnq P RRn1InRn are block matrices with blocks gpnqrn1,rn P RIn

Scalar representation

x i1,i2,...,iN

R1,R2,...,RN1¸r1,r2,...,rN11

gp1q1, i1, r1

gp2qr1, i2, r2

gpN1qrN2, iN1, rN1

gpNqrN1, iN ,1

where gpnqrn1, in, rnare entries of a 3rd-order core Gpnq P RRn1InRn

Slice (MPS) representation

x i1, i2,...,iN Gp1q

i1Gp2q

i2 GpNq

iN, where

Gpnqin

Gpnqp:, in, :q P RRn1Rn are lateral slices of Gpnq P RRn1InRn

Page 122: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.1. Tensor Train Decomposition – Matrix Product State 367

Table 4.2: Equivalent representations of the Tensor Chain (TC) decomposition(MPS with periodic boundary conditions) approximating an Nth-order tensor X PRI1I2IN . It is assumed that the TC rank is rT C tR1, R2, . . . , RN1, RNu.

Tensor representation: Trace of multilinear products of cores

X Tr pGp1q 1 Gp2q 1 1 GpNqq P RI1I2IN

with the 3rd-order cores Gpnq P RRn1InRn , R0 RN , n 1, 2, . . . , N

Tensor/Vector representation: Outer/Kronecker products

X R1,R2,...,RN¸

r1, r2,...,rN1gp1qrN , r1

gp2qr1, r2 gpNq

rN1, rNP RI1I2IN

x R1,R2,...,RN¸

r1, r2,...,rN1gp1qrN , r1

bL gp2qr1, r2bL bL gpNq

rN1, rNP RI1I2IN

where gpnqrn1, rn P RIn are fiber vectors within Gpnqprn1, :, rnq P RIn

Vector representation: Strong Kronecker products

x RN

rN1p rGp1q

rN|b| rGp2q |b| |b| rGpN1q |b| rGpNq

rNq P RI1 I2IN where

rGpnq P RRn1 InRn are block matrices with blocks gpnqrn1, rn P RIn ,

rGp1qrN P RI1R1 is a matrix with blocks (columns) gp1qrN , r1 P RI1 ,

rGpNqrN P RRN1 IN1 is a block vector with blocks gpNq

rN1, rN P RIN

Scalar representations

x i1, i2,...,iN trpGp1q

i1Gp2q

i2 GpNq

iNq

RN

rN1pgp1qT

rN , i1, : Gp2qi2

GpN1qiN1

gpNq:, iN , rN

q

where gp1qrN , i1, : Gp1qprN , i1, :q P RR1 , gpNq:, iN , rN

GpNqp:, iN , rN q P RRN1

Page 123: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

368 Tensor Train Decompositions

with Gpnqin

being the inth lateral slice of the core Gpnq PRRn1InRn , n 1, 2, . . . , N and R0 RN 1.

3. The (global) tensor form, based on multilinear products (contrac-tion) of cores (see Figure 4.1(a)) given by

X Gp1q 1 Gp2q 1 1 GpN1q 1 GpNq

xxGp1q,Gp2q, . . . ,GpN1q,GpNqyy, (4.3)

where the 3rd-order cores1 Gpnq P RRn1InRn , n 1, 2, . . . , Nand R0 RN 1 (see also Figure 4.2(b)).

4. The tensor form, expressed as a sum of rank-1 tensors (see Figure4.1(a))

X R1,R2,...,RN1¸r1, r2,...,rN11

gp1q1, r1 gp2qr1, r2 gpN1qrN2, rN1 gpNq

rN1, 1,

(4.4)where gpnqrn1,rn Gpnqprn1, :, rnq P RIn are mode-2 fibers, n 1, 2, . . . , N and R0 RN 1.

5. A vector form, expressed by Kronecker products of the fibers

x R1,R2,...,RN1¸r1,r2,...,rN11

gp1q1, r1 bL gp2qr1, r2 bL

bL gpN1qrN2, rN1 bL gpNq

rN1, 1, (4.5)

where x vecpXq P RI1I2IN .

6. An alternative vector form, produced by strong Kronecker prod-ucts of block matrices (see Figure 4.1(b)) and Figure 4.2(c)),given by

x rGp1q |b| rGp2q |b| |b| rGpNq, (4.6)where the block matrices rGpnq P RRn1InRn , for n 1, 2, . . . , N ,consist of blocks gpnqrn1,rn P RIn1, n 1, 2, . . . , N , with R0

1Note that the cores Gp1q and GpNq are now two-dimensional arrays (matrices),but for a uniform representation, we assume that these matrices are treated as 3rd-order cores of sizes 1 I1 R1 and RN1 IN 1, respectively.

Page 124: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.2. Matrix TT Decomposition – Matrix Product Operator 369

RN 1, and the symbol |b| denotes the strong Kronecker prod-uct.

Analogous relationships can be established for Tensor Chain (i.e.,MPS with PBC (see Figure 2.19(b)) and summarized in Table 4.2.

4.2 Matrix TT Decomposition – Matrix Product Operator

The matrix tensor train, also called the Matrix Product Operator(MPO) with open boundary conditions (TT/MPO), is an impor-tant TN model which first represents huge-scale structured matrices,X P RIJ , as 2Nth-order tensors, X P RI1J1I2J2INJN , whereI I1I2 IN and J J1J2 JN (see Figures 4.3, 4.4 and Table4.3). Then, the matrix TT/MPO converts such a 2Nth-order tensorinto a chain (train) of 4th-order cores2. It should be noted that thematrix TT decomposition is equivalent to the vector TT, created bymerging all index pairs pin, jnq into a single index ranging from 1 toInJn, in a reverse lexicographic order.

Similarly to the vector TT decomposition, a large scale 2Nth-ordertensor, X P RI1J1I2J2INJN , can be represented in a TT/MPOformat via the following mathematical representations:

1. The scalar (entry-wise) form

xi1,j1,...,iN ,jN R1

r11

R2

r21

RN1¸rN11

gp1q1, i1,j1,r1 g

p2qr1, i2, j2, r2

gpN1qrN2, iN1, jN1, rN1

gpNqrN1, iN , jN , 1. (4.7)

2. The slice representation

xi1,j1,...,iN ,jN Gp1qi1,j1

Gp2qi2,j2

GpNqiN ,jN

, (4.8)

where Gpnqin,jn

Gpnqp:, in, jn, :q P RRn1Rn are slices of thecores Gpnq P RRn1InJnRn , n 1, 2, . . . , N and R0 RN 1.

2The cores Gp1q and GpNq are in fact three-dimensional arrays, however for uni-form representation, we treat them as 4th-order cores of sizes 1 I1 J1 R1 andRN1 IN JN 1.

Page 125: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

370 Tensor Train Decompositions

3. The compact tensor form based on multilinear products (Figure4.4(b))

X Gp1q 1 Gp2q 1 1 GpNq

xxGp1q,Gp2q, . . . ,GpNqyy, (4.9)

where the TT-cores are defined as Gpnq P RRn1InJnRn , n 1, 2, . . . , N and R0 RN 1.

4. A matrix form, based on strong Kronecker products of block ma-trices (Figures 4.3(b) and 4.4(c))

X rGp1q |b| rGp2q |b| |b| rGpNq P RI1IN J1JN , (4.10)

where rGpnq P RRn1InRnJn are block matrices with blocksGpnqrn1,rn P RInJn and the number of blocks is Rn1 Rn. In

a special case, when the TT ranks Rn 1, @n, the strong Kro-necker products simplify into standard (left) Kronecker products.

The strong Kronecker product representation of a TT is probablythe most comprehensive and useful form for displaying tensor trains intheir vector/matrix form, since it allows us to perform many operationsusing relatively small block matrices.Example. For two matrices (in the TT format) expressed via thestrong Kronecker products, A Ap1q |b| Ap2q |b| |b| ApNq and B Bp1q |b| Bp2q |b| |b| BpNq, their Kronecker product can be efficientlycomputed as AbL B Ap1q |b| |b| ApNq |b| Bp1q |b| |b| BpNq. Fur-thermore, if the matrices A and B have the same mode sizes3, thentheir linear combination, C αA βB can be compactly expressedas (Oseledets, 2011; Kazeev et al., 2013a,b)

C rAp1q Bp1qs |b|Ap2q 0

0 Bp2q

|b| |b|

ApN1q 0

0 BpN1q

|b|αApNq

βBpNq

.

Consider its reshaped tensor C xxCp1q,Cp2q, . . . ,CpNqyy in the TTformat; then its cores Cpnq P RRn1InJnRn , n 1, 2, . . . , N can be

3Note that, wile original matrices A P RI1INJ1JN and B P RI1INJ1JN

must have the same mode sizes, the corresponding core tenors, Apnq PRRA

n1InJnRAn and Bpnq P RRB

n1InJnRBn , may have arbitrary mode sizes.

Page 126: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.2. Matrix TT Decomposition – Matrix Product Operator 371

(a)

R1 R2 R3

I2 I3 I4I1

G(2) G(3)

I2

R2

R3

R3

G(1) G(4)J1

I1

R1

I3

11RI2

R2

R3

I3

J2 J3 J4

J1

J2

J2

J3

J3J4

I4

(1×I × J ×R )111 (R ×I × J ×R ) 1 2 2 2 (R ×I × J ×R ) 2 3 3 3 (R ×I × J ×1) 3 4 4

1 1

I2

R2J2

R3

I3

J3

R3

I3

J3

R2

(b)

(1)G

(2)G

(3)G

(4)G

(I × J )11(I × J )22 (I × J )33

(I × J )44

(I × R J )111 (R I × R J )1 2 22 (R I × R J )2 3 33 (R I × J )3 4 4

Figure 4.3: TT/MPO decomposition of a matrix, X P RIJ , reshaped as an8th-order tensor, X P RI1J1I4J4 , where I I1I2I3I4 and J J1J2J3J4. (a)Basic TT representation via multilinear products (tensor contractions) of cores X Gp1q 1 Gp2q 1 Gp3q 1 Gp4q, with Gpnq P RRn1InRn for R1 3, R2 4, R3 5, R0 R4 1. (b) Representation of a matrix or a matricized tensor via strong Kro-necker products of block matrices, in the form X rGp1q |b| rGp2q |b| rGp3q |b| rGp4q PRI1I2I3I4 J1J2J3J4 .

Page 127: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

372 Tensor Train Decompositions

(a)

I1 I2

IN

J1

J2JN

...

X1 2 NI = I I I

1 2 NJ = J J J

(b)

I2

R2J2...I1

R1J1

1R

R1 R2J1I1

G(2)

G( )n

G(1)

G( )N

I2 InJ2IN JN

Rn-1 RnJn

INJN

... ...

I2

J2...

...

...

...

...

RN-1

JnIn

JnIn

Rn

Rn-1 RN-1

1 1 1(1 )I J R 1 2 2 2( )R I J R 1( )n n nnR I J R 1( 1)N NNR I J

1 1111

(c)

1 1 1( )I R J 1 2 2 2( )R I R J

(1)G

(2)G

( )nG

( )NG

...

...

...

... ... ......

...

...

...

... ... .........

( )N NI J

1( )n n nnR I R J 1( )N NNR I J

... ... RN-1

R1

2 2( )I J ( )n nI J1 1( )I J

...

Figure 4.4: Representations of huge matrices by “linked” block matrices. (a)Tensorization of a huge-scale matrix, X P RIJ , into a 2Nth-order tensor X PRI1J2INJN . (b) The TT/MPO decomposition of a huge matrix, X, expressedby 4th-order cores, Gpnq P RRn1InJnRn . (c) Alternative graphical representa-tion of a matrix, X P RI1I2IN J1J2JN , via strong Kronecker products of blockmatrices rGpnq P RRn1 In Rn Jn for n 1, 2, . . . , N with R0 RN 1.

Page 128: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.2. Matrix TT Decomposition – Matrix Product Operator 373

Table 4.3: Equivalent forms of the matrix Tensor Train decomposition (MPO withopen boundary conditions) for a 2Nth-order tensor X P RI1J1I2J2INJN . Itis assumed that the TT rank is tR1, R2, . . . , RN1u, with R0 RN 1.

Tensor representation: Multilinear products (tensor contractions)

X Gp1q 1 Gp2q 1 1 GpN1q 1 GpNq

with 4th-order cores Gpnq P RRn1InJnRn , pn 1, 2, . . . , Nq

Tensor representation: Outer products

X R1,R2,...,RN1¸r1,r2,...,rN11

Gp1q1, r1

Gp2qr1, r2

GpN1qrN2, rN1

GpNqrN1, 1

where Gpnqrn1, rn P RInJn are blocks of rGpnq P RRn1InRnJn

Matrix representation: Strong Kronecker products

X rGp1q |b| rGp2q |b| |b| rGpNq P RI1IN J1JN

where rGpnq P RRn1InRnJn are block matrices with blocks

Gpnqprn1, :, :, rnq

Scalar representation

xi1,j1,i2,j2,...,iN ,jN

R1,R2,...,RN1¸r1,r2,...,rN11

gp1q1,i1, j1, r1

gp2qr1, i2, j2, r2

gpNqrN1, iN , jN ,1

where gpnqrn1, in, jn, rnare entries of a 4th-order core

Gpnq P RRn1InJnRn

Slice (MPS) representation

xi1,j1,i2,j2,...,iN ,jN Gp1q

i1,j1Gp2q

i2,j2 GpNq

iN ,jNwhere

Gpnqin, jn

Gpnqp:, in, jn, :q P RRn1Rn are slices of Gpnq P RRn1InJnRn

Page 129: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

374 Tensor Train Decompositions

expressed through their unfolding matrices, Cpnq n¡ P RRn1InRnJn , or

equivalently by the lateral slices, Cpnqin,jn

P RRn1Rn , as follows

Cpnqin,jn

Apnqin,jn

00 Bpnq

in,jn

, n 2, 3, . . . , N 1, (4.11)

while for the border cores

Cp1qi1,j1

Ap1qi1,j1

Bp1qi1,j1

, CpNq

iN ,jNα ApNq

iN ,jN

β BpNqiN ,jN

(4.12)

for in 1, 2, . . . , In, jn 1, 2, . . . , JN , n 1, 2, . . . , N .Note that the various mathematical and graphical representations

of TT/MPS and TT/MPO can be used interchangeably for differentpurposes or applications. With these representations, all basic mathe-matical operations in TT format can be performed on the constitutiveblock matrices, even without the need to explicitly construct coretensors (Oseledets, 2011; Dolgov, 2014).

Remark. In the TT/MPO paradigm, compression of large matrices isnot performed by global (standard) low-rank matrix approximations,but by low-rank approximations of block-matrices (submatrices) ar-ranged in a hierarchical (linked) fashion. However, to achieve a low-rankTT and consequently a good compression ratio, ranks of all the corre-sponding unfolding matrices of a specific structured data tensor mustbe low, i.e., their singular values must rapidly decrease to zero. Whilethis is true for many structured matrices, unfortunately in general, thisassumption does not hold.

4.3 Links Between CP, BTD Formats and TT/TC Formats

It is important to note that any specific TN format can be convertedinto the TT format. This very useful property is next illustrated fortwo simple but important cases which establish links between the CPand TT and the BTD and TT formats.

Page 130: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.3. Links Between CP, BTD Formats and TT/TC Formats 375

(a)

R RI2 INI1

G(2) GG(1) G( )NR R

IN-1

R

1

R

R

I2 IN-1 INI1(1×I ×R)1 (R×I ×R)2 (R×I ×R)N-1 (R×I ×1)N

A

A A(N-1)(2)

(1) A(N) R

(N-1)

R

R

1

1 1 1 1

(b)

I2

RJ2

I1

RJ1

1 1 1

R R

J1I1

G(2)

G

G(1)

G( )N

I2 IJ2IN JN

R

J

INJN

...

...

...

... ... ......

...

...

...

... ... .........

... ...

I2

J2

...

...

...

...

... ...

R

J

I

J

I

R

RR

R

R

(N-1)

N-1 N-1

R

N-1

N-1

N-1

N-1

1

(R×I ×J ×R)N-1 N-1 (R×I ×J ×1)N N(R×I ×J ×R)2 2(1×I ×J ×R)1 1

(I ×RJ )1 1 (RI ×RJ )22 (RI ×RJ )N-1 N-1 (RI ×J )N N

(I ×J )N N

(I ×J )N-1 N-1(I ×J )22

(I ×J )1 1

Figure 4.5: Links between the TT format and other tensor network formats. (a)Representation of the CP decomposition for an Nth-order tensor, X I1 Ap1q2Ap2q N ApNq, in the TT format. (b) Representation of the BTD model given byEqs. (4.15) and (4.16) in the TT/MPO format. Observe that the TT-cores are verysparse and the TT ranks are tR,R, . . . , Ru. Similar relationships can be establishedstraightforwardly for the TC format.

Page 131: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

376 Tensor Train Decompositions

1. A tensor in the CP format, given by

X R

r1ap1qr ap2qr apNq

r , (4.13)

can be straightforwardly converted into the TT/MPS format asfollows. Since each of the R rank-1 tensors can be represented inthe TT format of TT rank p1, 1, . . . , 1q, using formulas (4.11) and(4.12), we have

X R

r1xxap1qTr ,ap2qTr , . . . ,apNqT

r yy (4.14)

xxGp1q,Gp2q, . . . ,GpN1q,GpNqyy,where the TT-cores Gpnq P RRInR have diagonal lateral slicesGpnqp:, in, :q Gpnq

in diagpain,1, ain,2, . . . , ain,Rq P RRR for n

2, 3, . . . , N 1 and Gp1q Ap1q P RI1R and GpNq ApNq T PRRIN (see Figure 4.5(a)).

2. A more general Block Term Decomposition (BTD) for a 2Nth-order data tensor

X R

r1pAp1q

r Ap2qr ApNq

r q P RI1J1INJN (4.15)

with full rank matrices, Apnqr P RInJn , @r, can be converted into

a matrix TT/MPO format, as illustrated in Figure 4.5(b).Note that (4.15) can be expressed in a matricized (unfolding)form via strong Kronecker products of block diagonal matrices(see formulas (4.11)), given by

X R

r1pAp1q

r bL Ap2qr bL bL ApNq

r q (4.16)

rGp1q |b| rGp2q |b| |b| rGpNq P RI1IN J1JN ,

with the TT rank, Rn R for n 1, 2, . . . N1, and the block di-agonal matrices, rGpnq diagpApnq

1 ,Apnq2 , . . . ,Apnq

R q P RRInRJn ,for n 2, 3, . . . , N 1, while rGp1q rAp1q

1 ,Ap1q2 , . . . ,Ap1q

R s P

Page 132: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.4. Quantized Tensor Train (QTT) – Blessing of Dimensionality 377

RI1RJ1 is a row block matrix, and rGpNq

ApNq

1...

ApNqR

P RRINJN

a column block matrix (see Figure 4.5(b)).

Several algorithms exist for decompositions in the form (4.15) and(4.16) (Salmi et al., 2009; Batselier et al., 2015; Batselier and Wong,2015). In this way, TT/MPO decompositions for huge-scale structuredmatrices can be constructed indirectly.

4.4 Quantized Tensor Train (QTT) – Blessing of Dimension-ality

The procedure of creating a higher-order tensor from lower-order orig-inal data is referred to as tensorization, while in a special case whereeach mode has a very small size 2, 3 or 4, it is referred to as quan-tization. In addition to vectors and matrices, lower-order tensors canalso be reshaped into higher-order tensors. By virtue of quantization,low-rank TN approximations with high compression ratios can be ob-tained, which is not possible to achieve with original raw data formats.(Oseledets, 2010; Khoromskij, 2011a).

Therefore, the quantization can be considered as a special form oftensorization where size of each mode is very small, typically 2 or 3.The concept of quantized tensor networks (QTN) was first proposedin (Oseledets, 2010) and (Khoromskij, 2011a), whereby low-size3rd-order cores are sparsely interconnected via tensor contractions.The so obtained model often provides an efficient, highly compressed,and low-rank representation of a data tensor and helps to mitigate thecurse of dimensionality, as illustrated below.

Example. The quantization of a huge vector, x P RI , I 2K , canbe achieved through reshaping to give a p2 2 2q tensor Xof order K, as illustrated in Figure 4.6. For structured data such aquantized tensor, X, often admits low-rank TN approximation, so thata good compression of a huge vector x can be achieved by enforcingthe maximum possible low-rank structure on the tensor network. Even

Page 133: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

378 Tensor Train Decompositions

I=26

(64 1)

(2 2 2 2 2 2)

. . .

2 2 2 2 2 2

2 2

2

2

22=

G(1)

G(2)

G(3)

G(4)

G(5)

G(6)

TT

Figure 4.6: Concept of tensorization/quantization of a large-scale vector into ahigher-order quantized tensor. In order to achieve a good compression ratio, weneed to apply a suitable tensor decomposition such as the quantized TT (QTT)using 3rd-order cores, X Gp1q 1 Gp2q 1 1 Gp6q.

more generally, an Nth-order tensor, X P RI1IN , with In qKn ,can be quantized in all modes simultaneously to yield a pqq qqquantized tensor of higher-order and with small value of q.

Example. Since large-scale tensors (even of low-order) cannot beloaded directly into the computer memory, our approach to this prob-lem is to represent the huge-scale data by tensor networks in a dis-tributed and compressed TT format, so as to avoid the explicit re-quirement for unfeasible large computer memory.

In the example shown in Figure 4.7, the tensor train of a huge3rd-order tensor is expressed by the strong Kronecker products ofblock tensors with relatively small 3rd-order tensor blocks. TheQTT is mathematically represented in a distributed form via strongKronecker products of block 5th-order tensors. Recall that the strongKronecker product of two block core tensors, rGpnq P RRn1InRnJnKn

and rGpn1q P RRnIn1Rn1Jn1Kn1 , is defined as the block tensor,C rGpnq |b| rGpn1q P RRn1InIn1Rn1JnJn1KnKn1 , with 3rd-order tensor blocks, Crn1,rn1 °Rn

rn1 Gpnqrn1,rn bL Gpn1q

rn,rn1 PRInIn1JnJn1KnKn1 , where Gpnq

rn1,rn P RInJnKn and

Page 134: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.4. Quantized Tensor Train (QTT) – Blessing of Dimensionality 379

Gpn1qrn,rn1 P RIn1Jn1Kn1 are the block tensors of rGpnq andrGpn1q, respectively.

In practice, a fine (q 2, 3, 4 ) quantization is desirable to cre-ate as many virtual (additional) modes as possible, thus allowing forthe implementation of efficient low-rank tensor approximations. Forexample, the binary encoding (q 2) reshapes an Nth-order ten-sor with p2K1 2K2 2KN q elements into a tensor of orderpK1 K2 KN q, with the same number of elements. In otherwords, the idea is to quantize each of the n “physical” modes (dimen-sions) by replacing them with Kn “virtual” modes, provided that thecorresponding mode sizes, In, are factorized as In In,1In,2 In,Kn .This, in turn, corresponds to reshaping the nth mode of size In intoKn modes of sizes In,1, In,2, . . . , In,Kn .

The TT decomposition applied to quantized tensors is referred toas the QTT, Quantics-TT or Quantized-TT, and was first introducedas a compression scheme for large-scale matrices (Oseledets, 2010), andalso independently for more general settings.

The attractive properties of QTT are:

1. Not only QTT ranks are typically small (usually, below 20) butthey are also almost independent4 of the data size (even forI 250), thus providing a logarithmic (sub-linear) reduction ofstorage requirements from OpIN q to OpNR2 logqpIqq which is re-ferred to as super-compression (Khoromskij, 2011a; Kazeev andKhoromskij, 2012; Kazeev et al., 2013a; Dolgov and Khoromskij,2013; Dolgov et al., 2014). Comparisons of the storage complexityof various tensor formats are given in Table 4.4.

2. Compared to the TT decomposition (without quantization), theQTT format often represents deep structures in the data by in-troducing “virtual” dimensions or modes. For data which exhibithigh degrees of structure, the high compressibility of the QTT ap-proximation is a consequence of the better separability propertiesof the quantized tensor.

4At least uniformly bounded.

Page 135: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

380 Tensor Train Decompositions

(a)

I1

I2

IN

J1

J2

JN

K1

K2

KN

...

X

1 2 NK = K K K

1 2 NI = I I I

1 2 NJ = J J J

(b)

R1 R2

K1I1

1 1 1 1( )I J K R 1 2 2 2 2( )R I J K R 2 1 1 1 1)( N N N N NR I J K R1( )N N NNR I J K

G(2)

G( -1)N

G(1)

G( )N

J1

I2 IN-1K2J2

R1

R2 RN-1

RN-2R1

G(2)

G( -1)N

G(1)

G( )N

INJN

KN

RN-2 RN-1

...

......

...

...

... ...

... ...

...

... ...

1 1 1 1( )I R J K 1 2 2 2 2( )R I R J K

~~~~

...

JN-1

KN-1

RN-1

2 1 1 1 1)( N N N N NR I R J K 1( )N N NNR I J K

1 1 1( )I J K 2 2 2( )I J K 1 1 1( )N N NI J K ( )N N NI J K

...

...

...

...

Figure 4.7: Tensorization/quantization of a huge-scale 3rd-order tensor intoa higher order tensor and its TT representation. (a) Example of tensoriza-tion/quantization of a 3rd-order tensor, X P RIJK , into a 3Nth-order tensor,assuming that the mode sizes can be factorized as, I I1I2 IN , J J1J2 JN

and K K1K2 KN . (b) Decomposition of the high-order tensor via a general-ized Tensor Train and its representation by the strong Kronecker product of blocktensors as X rGp1q |b| rGp2q |b| |b| rGpNq P RI1INJ1JNK1KN , where eachblock rGpnq P RRn1InRnJnKn is also a 3rd-order tensor of size pIn Jn Knq,for n 1, 2, . . . , N with R0 RN 1. In the special case when J K 1, themodel simplifies into the standard TT/MPS model.

Page 136: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.4. Quantized Tensor Train (QTT) – Blessing of Dimensionality 381

Table 4.4: Storage complexities of tensor decomposition models for an Nth-ordertensor, X P RI1I2IN , for which the original storage complexity is OpIN q,where I maxtI1, I2, . . . , INu, while R is the upper bound on the ranks oftensor decompositions considered, that is, R maxtR1, R2, . . . , RN1u or R maxtR1, R2, . . . , RNu.

1. Full (raw) tensor format OpIN q2. CP OpNIRq3. Tucker OpNIRRN q4. TT/MPS OpNIR2q5. TT/MPO OpNI2R2q6. Quantized TT/MPS (QTT) OpNR2 logqpIqq7. QTT+Tucker OpNR2 logqpIq NR3q8. Hierarchical Tucker (HT) OpNIRNR3q

3. The fact that the QTT ranks are often moderate or even low5

offers unique advantages in the context of big data analyt-ics (see (Khoromskij, 2011a,b; Kazeev et al., 2013a) and refer-ences therein), together with high efficiency of multilinear alge-bra within the TT/QTT algorithms which rests upon the well-posedness of the low-rank TT approximations.

The ranks of the QTT format often grow dramatically with data size,but with a linear increase in the approximation accuracy. To overcomethis problem, Dolgov and Khoromskij proposed the QTT-Tucker for-mat (Dolgov and Khoromskij, 2013) (see Figure 4.8), which exploitsthe TT approximation not only for the Tucker core tensor, but alsofor the factor matrices. This model naturally admits distributed com-putation, and often yields bounded ranks, thus avoiding the curse ofdimensionality.

5The TT/QTT ranks are constant or growing linearly with respect to the tensororder N and are constant or growing logarithmically with respect to the dimensionof tensor modes I.

Page 137: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

382 Tensor Train Decompositions

(a) In

InAn An,1 An,2 An K,Rn,1

In,1 In,2 In K,

~ ~R Rnn

(b)

- - -

- - -

G(1)

G(N-1)

RN -1 G( )N

R1 R2 RN-1 RN

A1 A2 AN -1 AN

I1 I2 IN-1 IN

R1R2

- - - -- -- -

RN -1

I1,1A1,1 A2,1 AN -1,1 AN ,1

I2,1 IN -1,1 IN ,1

R1R2

I1,K I2,K IN K-1, I

A1,K A2,KA A

R1 R2 RN- 1 RN

G(2)

G(1)

G(2)

G G( )N

I1 I2 IN...

...

(N-2)

N K,

N K-1, N K,

RN

RN

Figure 4.8: The QTT-Tucker or alternatively QTC-Tucker (Quantized Tensor-Chain-Tucker) format. (a) Distributed representation of a matrix An P RInRn

with a very large value of In via QTT, by tensorization to a high-order quantizedtensor, followed by QTT decomposition. (b) Distributed representation of a large-scale Tucker-N model, X G 1 A1 A2 N AN , via a quantized TC modelin which the core tensor G P RR1R2RN and optionally all large-scale factormatrices An (n 1, 2, . . . , N) are represented by MPS models (for more detail see(Dolgov and Khoromskij, 2013)).

Page 138: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.5. Basic Operations in TT Formats 383

The TT/QTT tensor networks have already found application invery large-scale problems in scientific computing, such as in eigenanal-ysis, super-fast Fourier transforms, and in solving huge systems of largelinear equations (see (Dolgov and Khoromskij, 2013; Huckle et al., 2013;Dolgov et al., 2014; Wahls et al., 2014; Kressner et al., 2014a; Kressnerand Uschmajew, 2016) and references therein).

4.5 Basic Operations in TT Formats

For big tensors in their TT formats, basic mathematical operations,such as the addition, inner product, computation of tensor norms,Hadamard and Kronecker product, and matrix-by-vector and matrix-by-matrix multiplications can be very efficiently performed using block(slice) matrices of individual (relatively small size) core tensors.

Consider two Nth-order tensors in the TT format

X xxXp1q,Xp2q, . . . ,XpNqyy P RI1I2IN

Y xxYp1q,Yp2q, . . . ,YpNqyy P RI1I2IN ,

for which the TT ranks are rX tR1, R2, . . . , RN1u andrY tR1, R2, . . . , RN1u. The following operations can then beperformed directly in the TT formats.

Tensor addition. The sum of two tensors

Z XY xxZp1q,Zp2q, . . . ,ZpNqyy P RI1I2IN (4.17)

has the TT rank rZ rX rY and can be expressed via lateral slicesof the cores Z P RRn1InRn as

Zpnqin

Xpnqin

00 Ypnq

in

, n 2, 3, . . . , N 1. (4.18)

For the border cores, we have

Zp1qi1

Xp1qi1

Yp1qi1

, ZpNq

iN

XpNqiN

YpNqiN

(4.19)

Page 139: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

384 Tensor Train Decompositions

for in 1, 2, . . . , In, n 1, 2, . . . , N .

Hadamard product. The computation of the Hadamard (element-wise) product, Z XfY, of two tensors, X and Y, of the same orderand the same size can be performed very efficiently in the TT formatby expressing the slices of the cores, Z P RRn1InRn , as

Zpnqin

Xpnqin

bYpnqin, n 1, . . . , N, in 1, . . . , In. (4.20)

This increases the TT ranks for the tensor Z to at most RnRn,n 1, 2, . . . , N , but the associated computational complexity can bereduced from being exponential in N , OpIN q, to being linear in bothI and N , OpINpRRq2qq.

Super fast Fourier transform of a tensor in the TT format (MAT-LAB functions: fftnpXq and fftpXpnq, rs, 2q) can be computed as

FpXq xxFpXp1qq,FpXp2qq, . . . ,FpXpNqqyy FpXp1qq 1 FpXp2qq 1 1 FpXpNqq. (4.21)

It should be emphasized that performing computation of the FFT onrelatively small core tensors Xpnq P RRn1InRn reduces dramaticallycomputational complexity under condition that a data tensor admitslow-rank TT approximation. This approach is referred to as the superfast Fourier transform (SFFT) in TT format. Wavelets, DCT, and otherlinear integral transformations admit a similar form to the SFFT in(4.21), for example, for the wavelet transform in the TT format, wehave

WpXq xxWpXp1qq,WpXp2qq, . . . ,WpXpNqqyyWpXp1qq 1WpXp2qq 1 1WpXpNqq. (4.22)

The N-D discrete convolution in a TT format of tensors X PRI1IN with TT rank tR1, R2, . . . , RN1u and Y P RJ1JN withTT rank tQ1, Q2, . . . , QN1u can be computed as

Z X Y (4.23) xxZp1q,Zp2q, . . . ,ZpNqyy P RpI1J11qpI2J21qpINJN1q,

Page 140: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.5. Basic Operations in TT Formats 385

with the TT-cores given by

Zpnq Xpnqd2 Ypnq P RpRn1Qn1qpInJn1qpRnQnq, (4.24)

or, equivalently, using the standard convolution Zpnqpsn1, :, snq Xpnqprn1, :, rnq Ypnqpqn1, :, qnq P RpInJn1q for sn 1, 2, . . . , RnQnand n 1, 2, . . . , N , R0 RN 1.

Inner product. The computation of the inner (scalar, dot) productof two Nth-order tensors, X xxXp1q,Xp2q, . . . ,XpNqyy P RI1I2IN

and Y xxYp1q,Yp2q, . . . ,YpNqyy P RI1I2IN , is given by

xX,Yy xvecpXq, vecpYqy (4.25)

I1

i11

IN

iN1xi1...in yi1iN

and has the complexity of OpIN q in the raw tensor format. In TT for-mats, the inner product can be computed with the reduced complexityof only OpNIpR2R RR2qq when the inner product is calculated bymoving TT-cores from left to right and performing calculations onrelatively small matrices, Sn Xpnq 1,2

1,2 pYpnq 1 Sn1q P RRn rRn forn 1, 2, . . . , N . The results are then sequentially multiplied by thenext core Ypn1q (see Algorithm 9).

Computation of the Frobenius norm. In a similar way, we canefficiently compute the Frobenius norm of a tensor, XF

axX,Xy,in the TT format. For the so-called n-orthogonal6 TT format, it is easyto show that

XF XpnqF . (4.26)

Matrix-by-vector multiplication. Consider a huge-scale matrixequation (see Figure 4.9 and Figure 4.10)

Ax y, (4.27)6An Nth-order tensor X xxXp1q,Xp2q . . . ,XpNqyy in the TT format is called

n-orthogonal if all the cores to the left of the core Xpnq are left-orthogonalized andall the cores to the right of the core Xpnq are right-orthogonalized (see Part 2 formore detail).

Page 141: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

386 Tensor Train Decompositions

Algorithm 9: Inner product of two large-scale tensors inthe TT Format (Oseledets, 2011; Dolgov, 2014)Input: Nth-order tensors, X xxXp1q,Xp2q, . . . ,XpNqyy P RI1I2IN

and Y xxYp1q,Yp2q, . . . ,YpNqyy P RI1I2IN in TT formats, withTT-cores X P RRn1InRn and Y P R rRn1In rRn andR0 rR0 RN rRN 1

Output: Inner product xX,Yy vecpXqTvecpYq1: Initialization S0 12: for n 1 to N do3: Zpnq

p1q Sn1Ypnqp1q P RRn1In

rRn

4: Sn Xpnq T 2¡ Zpnq

2¡ P RRn rRn

5: end for6: return Scalar xX,Yy SN P RRN rRN R, with RN rRN 1

where A P RIJ , x P RJ and y P RI are represented approximately inthe TT format, with I I1I2 IN and J J1J2 JN . As shown inFigure 4.9(a), the cores are defined as Apnq P RPn1InJnPn , Xpnq PRRn1JnRn and Ypnq P RQn1InQn .

Upon representing the entries of the matrix A and vectors x and yin their tensorized forms, given by

A P1,P2,...,PN1¸p1,p2,...,pN11

Ap1q1,p1 Ap2q

p1,p2 ApNqpN1,1

X R1,R2,...,RN1¸r1,r2,...,rN11

xp1qr1 xp2qr1,r2 xpNqrN1 (4.28)

Y Q1,Q2,...,QN1¸q1,q2,...,qN11

yp1qq1 yp2qq1,q2 ypNqqN1 ,

we arrive at a simple formula for the tubes of the tensor Y, in the form

ypnqqn1,qn ypnqrn1 pn1, rn pn

Apnqpn1, pn

xpnqrn1, rnP RIn ,

with Qn PnRn for n 1, 2, . . . , N .

Page 142: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.5. Basic Operations in TT Formats 387

(a)

J1

I1

J2

I2

J

I

JN

IN

I1 I2 I IN

・・・ ・・・

・・・ ・・・

・・・ ・・・

X

A

Y

n

n

n

~= ~=

R2R1

J1 J2

Rn

Jn JN

Rn-1x

P2P1 PnPn-1A A( )NA( )nA(2)A(1)

X(1) X(2) X( )n X( )N

I1 I2 In IN

Q2Q1

I1 I2

Qn

In IN

Qn-1y Y(1) Y(2) Y( )n Y( )N

(b)

J1

I1

J2

I2

J

I

JN

IN

I1 I2 I IN

・・・ ・・・

・・・ ・・・

・・・ ・・・

X

A

Y

n

n

n

~= ~=

R2R1

J1 J2

Rn

Jn JN

Rn-1x

P2P1 PnPn-1A A( )NA( )nA(2)A(1)

X(1) X(2) X( )n X( )N

I1 I2 In IN

Q2Q1

I1 I2

Qn

In IN

Qn-1y Y(1) Y(2) Y( )n Y( )N

Figure 4.9: Linear systems represented by arbitrary tensor networks (left) andTT networks (right) for (a) Ax y and (b) AX Y.

Page 143: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

388 Tensor Train Decompositions

(a)

I 1

J1

I 2

J2

I n

Jn

IN

JN

Y

A

X

・・・ ・・・

・・・ ・・・

Q2Q1

I1 I2

Qn

In IN

Qn-1y

P2P1 PnPn-1

J1 J2 Jn JN

A

R2R1 RnRn-1x

Y(1)

X(1) X(2) X( )n X( )N

Y(2) Y( )n Y( )N

A( )NA( )nA(2)A(1)

T

(b)

J1

J1

I1

J2

J2

I2

Jn

Jn

In

JN

IN

X

A

A

X

・・・ ・・・ JN

・・・ ・・・

・・・ ・・・

R2R1

J1 J2

Rn

Jn JN

Rn-1x

P2P1 PnPn-1

J1 J2 Jn JN

A

R2R1 RnRn-1x

T

X(1) X(2) X( )n X( )N

A( )NA( )nA(2)A(1)

X(1) X(2) X( )n X( )N

P2P1 PnPn-1A

T

A( )NA( )nA(2)A(1)

I1 I2 In IN

Figure 4.10: Representation of typical cost functions by arbitrary TNs and byTT networks: (a) J1pxq yTAx and (b) J2pxq xTATAx. Note that tensors A,X and Y can be, in general, approximated by any TNs that provide good low-rankrepresentations.

Furthermore, by representing the matrix A and vectors x, y via thestrong Kronecker products

A Ap1q |b| Ap2q |b| |b| ApNq

x Xp1q |b| Xp2q |b| |b| XpNq (4.29)y Yp1q |b| Yp2q |b| |b| YpNq,

with Apnq P RPn1InJnPn , Xpnq P RRn1JnRn and Ypnq PRQn1InQn , we can establish a simple relationship

Ypnq Apnq | | Xpnq P RRn1 Pn1 InRn Pn , n 1, . . . , N, (4.30)

Page 144: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.5. Basic Operations in TT Formats 389

AB

=

C=A B

A11 A12

A21 A22

B11

B21

B31

A11 11BA11 21BA11 31B

A12 11BA12 21BA12 31B

A21 11BA21 21BA21 31B

A22 11BA22 21BA22 31B

( )I J ( )J K ( )I K1 2( )P I P J 1 2( )R J R K 1 1 2 2( )P R I P R K

B12

B22

B32

A11 12BA11 22BA11 32B

A12 12BA12 22BA12 32B

A21 12BA21 22BA21 32B

A22 12BA22 22BA22 32B

Figure 4.11: Graphical illustration of the C product of two block matrices.

where the operator | | represents the C (Core) product of two blockmatrices.

The C product of a block matrix Apnq P RPn1InPnJn with blocksApnqpn1,pn P RInJn , and a block matrix Bpnq P RRn1JnRnKn , with

blocks Bpnqrn1,rn P RJnKn , is defined as Cpnq Apnq | | Bpnq P

RQn1InQnKn , the blocks of which are given by Cpnqqn1,qn

Apnqpn1,pnBpnq

rn1,rn P RInKn , where qn pnrn, as illustrated in Fig-ure 4.11.

Note that, equivalently to Eq. (4.30), for Ax y, we can use aslice representation, given by

Ypnqin

Jn

jn1pApnq

in,jnbL Xpnq

jnq, (4.31)

which can be implemented by fast matrix-by matrix multiplication al-gorithms (see Algorithm 10). In practice, for very large scale data, weusually perform TT core contractions (MPO-MPS product) approxi-mately, with reduced TT ranks, e.g., via the “zip-up” method proposedby (Stoudenmire and White, 2010).

In a similar way, the matrix equation

Y AX, (4.32)

where A P RIJ , X P RJK , Y P RIK , with I I1I2 IN ,J J1J2 JN and K K1K2 KN , can be represented in TT

Page 145: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

390 Tensor Train Decompositions

Table 4.5: Basic operations on tensors in TT formats, where X Xp1q 1 Xp2q 1

1 XpNq P RI1I2IN , Y Yp1q 1 Yp2q 1 1 YpNq P RJ1J2JN , andZ Zp1q 1 Zp2q 1 1 ZpNq P RK1K2KN .

Operation TT-cores

Z X Y Xp1q `2 Yp1q

1

Xp2q `2 Yp2q

1 1

XpNq `2 YpNq

Zpnq Xpnq `2 Ypnq, with TT core slices Zpnqin

Xpnqin

` Ypnqin, pIn Jn Kn, @nq

Z X ` Y Xp1q ` Yp1q

1

Xp2q ` Yp2q

1 1

XpNq ` YpNq

Z X f Y Xp1q d2 Yp1q

1

Xp2q d2 Yp2q

1 1

XpNq d2 YpNq

Zpnq Xpnq d2 Ypnq, with TT core slices Zpnqin

Xpnqin

b Ypnqin, pIn Jn Kn, @nq

Z X b Y Xp1q b Yp1q

1

Xp2q b Yp2q

1 1

XpNq b YpNq

Zpnq Xpnq b Ypnq, with TT core slices Zpnqkn

Xpnqin

b Ypnqjn

(kn injn)

Z X Y pXp1q d2 Yp1qq 1 1 pXpNq d2 YpNqq

Zpnq Xpnq d2 Ypnq P RpRn1Qn1qpInJn1qpRnQnq, with vectors

Zpnqpsn1, :, snq Xpnqprn1, :, rnq Ypnqpqn1, :, qnq P RpInJn1q

for sn 1, 2, . . . , RnQn and n 1, 2, . . . , N , R0 RN 1.

Z X n A Xp1q 1 1 Xpn1q 1Xpnq 2 A

1 Xpn1q 1 1 XpNq

z xX,Yy Zp1q 1 Zp2q 1 1 ZpNq Zp1qZp2q ZpNq

Zpnq Xpnq d2 Ypnq

21In

°in

Xpnqin

b Ypnqin

pIn Jn, @nq

Page 146: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.5. Basic Operations in TT Formats 391

Table 4.6: Basic operations in the TT format expressed via the strong Kro-necker and C products of block matrices, where A rAp1q |b| rAp2q |b| |b| rApNq,B rBp1q |b| rBp2q |b| |b| rBpNq, x rXp1q |b| rXp2q |b| |b| rXpNq, y rYp1q |b| rYp2q |b| |b| rYpNq and the block matrices rApnq P RRA

n1InJnRAn , rBpnq P

RRBn1JnKnRB

n , rXpnq P RRxn1InRx

n , rYpnq P RRyn1InR

yn .

Operation Block matrices of TT-coresZ A B

rAp1q rBp1q

|b|rAp2q 0

0 rBp2q

|b| |b|

rApN1q 00 rBpN1q

|b|rApNqrBpNq

Z A b B rAp1q |b| |b| rApNq |b| rBp1q |b| |b| rBpNq

z xTy xx,yy rXp1q | | rYp1q

|b| |b|

rXpNq | | rYpNq

rZpnq rXpnq | | rYpnq P RRxn1R

yn1Rx

nRyn , with core slices Zpnq °in

Xpnqin

b Ypnqin

z Ax rAp1q | | rXp1q

|b| |b|

rApNq | | rXpNq

rZpnq rApnq 1 rXpnq, with blocks (vectors)

zpnqsn1,sn Apnq

rAn1,rA

nxpnqrx

n1,rxn

(sn rAn rx

nq

Z AB rAp1q | | rBp1q

|b| |b|

rApNq | | rBpNq

rZpnq rApnq | | rBpnq, with blocks

Zpnqsn1, sn Apnq

rAn1,rA

nBpnq

rBn1,rB

n(sn rA

n rBn )

z xTAx xx,AxyrXp1q | | rAp1q | | rXp1q

|b| |b|

rXpNq | | rApNq | | rXpNq

rZpnq rXpnq | | rApnq | | rXpnq P RRxn1RA

n1Rxn1Rx

nRAn Rx

n , with blocks (entries)

zpnqsn1,sn

Bxpnqrx

n1,rxn,Apnq

rAn1,rA

nxpnq

ryn1,r

yn

F(sn rx

n rAn r

ynq

Page 147: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

392 Tensor Train Decompositions

Algorithm 10: Computation of a Matrix-by-Vector Prod-uct in the TT FormatInput: Matrix A P RIJ and vector x P RJ in their respective

TT formatA xxAp1q,Ap2q, . . . ,ApNqyy P RI1J1I2J2INJN ,and X xxXp1q,Xp2q, . . . ,XpNqyy P RJ1J2JN ,with TT-coresXpnq P RRn1JnRn and Apnq P RR

An1InInRA

n

Output: Matrix by vector product y Ax in the TT formatY xxYp1q,Yp2q, . . . ,YpNqyy P RI1I2IN , with coresYpnq P RR

Yn1JnRY

n

1: for n 1 to N do2: for in 1 to In do3: Ypnq

in °Jn

jn1

Apnqin,jn

bL Xpnqjn

4: end for5: end for6: return Vector y P RI1I2IN in the TT format

Y xxYp1q,Yp2q, . . . ,YpNqyy

formats. This is illustrated in Figure 4.9(b) for the corresponding TT-cores defined as

Apnq P RPn1InJnPn

Xpnq P RRn1JnKnRn

Ypnq P RQn1InKnQn .

It is straightforward to show that when the matrices, A P RIJ

and X P RJK , are represented in their TT formats, they canbe expressed via a strong Kronecker product of block matrices asA Ap1q |b| Ap2q |b| |b| ApNq and X Xp1q |b| Xp2q |b| |b| XpNq,where the factor matrices are Apnq P RPn1 InJn Pn and Xpnq PRRn1 JnKn Rn . Then, the matrix Y AX can also be expressedvia the strong Kronecker products, Y Yp1q |b| |b| YpNq, whereYpnq Apnq | | Xpnq P RQn1 InKn Qn , pn 1, 2, . . . , Nq, with blocksYpnqqn1, qn Apnq

pn1, pnXpnqrn1, rn , where Qn Rn Pn, qn pnrn, @n.

Page 148: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.6. Algorithms for TT Decompositions 393

Similarly, a quadratic form, z xTAx, for a huge symmetric matrixA, can be computed by first computing (in TT formats), a vectory Ax, followed by the inner product xTy.

Basic operations in the TT format are summarized in Table 4.5,while Table 4.6 presents these operations expressed via strong Kro-necker and C products of block matrices of TT-cores. For more ad-vanced and sophisticated operations in TT/QTT formats, see (Kazeevet al., 2013a,b; Lee and Cichocki, 2016c).

4.6 Algorithms for TT Decompositions

We have shown that a major advantage of the TT decompositionis the existence of efficient algorithms for an exact representationof higher-order tensors and/or their low-rank approximate represen-tations with a prescribed accuracy. Similarly to the quasi-best ap-proximation property of the HOSVD, the TT approximation pX xxpXp1q

, pXp2q, . . . , pXpNqyy P RI1I2IN (with core tensors denoted by

Xpnq Gpnq), obtained by the TT-SVD algorithm, satisfies the follow-ing inequality

X pX22 ¤ N1

n1

In

jRn1

σ2j pX n¡q, (4.33)

where the `2-norm of a tensor is defined via its vectorization andσjpX n¡q denotes the jth largest singular value of the unfolding matrixX n¡ (Oseledets, 2011).

The two basic approaches to perform efficiently TT decompositionsare based on: (1) low-rank matrix factorizations (LRMF), and (2) con-strained Tucker-2 decompositions.

4.6.1 Sequential SVD/LRMF Algorithms

The most important algorithm for the TT decomposition is the TT-SVD algorithm (see Algorithm 11) (Vidal, 2003; Oseledets and Tyr-tyshnikov, 2009), which applies the truncated SVD sequentially to theunfolding matrices, as illustrated in Figure 4.12. Instead of SVD, alter-native and efficient LRMF algorithms can be used (Corona et al., 2015),

Page 149: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

394 Tensor Train Decompositions

I1

I2 I3

I4

I5Reshape I1 I3 I4 I5I2

X

tSVD I1 R1U1 S1R1 I3 I4 I5I2

V1T

Reshape MR1 I3 I4 I5I22

I RU2

22R1

SR I3 I4 I5VT

22 2

...

I RU4

4R R3 4S VT

4 4 I5

I1

R1X

(1)R2

I2

X(2)

R3

I3

X(3)

R4

I4

X(4)

I5

X(5)

M =X (1)1

4

tSVD

tSVD

Reshape

=

=

I1

R1X

(1)

I1

R1X

(1)

I1

R1X

(1)R2R1

I2

X(2)

R3R2

I3

X(3)

Figure 4.12: The TT-SVD algorithm for a 5th-order data tensor using truncatedSVD. Instead of the SVD, any alternative LRMF algorithm can be employed, suchas randomized SVD, RPCA, CUR/CA, NMF, SCA, ICA. Top panel: A 6th-ordertensor X of size I1 I2 I5 is first reshaped into a long matrix M1 of sizeI1 I2 I5. Second panel: The tSVD is performed to produce low-rank matrixfactorization, with I1 R1 factor matrix U1 and the R1 I2 I5 matrix S1VT

1 ,so that M1 U1S1VT

1 . Third panel: the matrix U1 becomes the first core coreXp1q P R1I1R1 , while the matrix S1VT

1 is reshaped into the R1I2 I3I4I5 matrixM2. Remaining panels: Perform tSVD to yield M2 U2S2VT

2 , reshape U2 into anR1I2R2 core Xp2q and repeat the procedure until all the five cores are extracted(bottom panel). The same procedure applies to higher order tensors of any order.

see also Algorithm 12). For example, in (Oseledets and Tyrtyshnikov,2010) a new approximate formula for TT decomposition is proposed,where an Nth-order data tensor X is interpolated using a special formof cross-approximation. In fact, the TT-Cross-Approximation is analo-gous to the TT-SVD algorithm, but uses adaptive cross-approximationinstead of the computationally more expensive SVD. The complexity

Page 150: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.6. Algorithms for TT Decompositions 395

Algorithm 11: TT-SVD Decomposition using truncatedSVD (tSVD) or randomized SVD (rSVD) (Vidal, 2003;Oseledets, 2011)

Input: Nth-order tensor X P RI1I2IN and approximation accuracy εOutput: Approximative representation of a tensor in the TT formatpX xx pXp1q

, pXp2q, . . . , pXpNqyy, such that X pXF ¤ ε

1: Unfolding of tensor X in mode-1 M1 Xp1q2: Initialization R0 13: for n 1 to N 1 do4: Perform tSVD rUn,Sn,Vns tSVDpMn, ε

?N 1q

5: Estimate nth TT rank Rn sizepUn, 2q6: Reshape orthogonal matrix Un into a 3rd-order corepXpnq reshapepUn, rRn1, In, Rnsq7: Reshape the matrix Vn into a matrix

Mn1 reshapeSnVT

n , rRnIn1,±N

pn2 Ips

8: end for9: Construct the last core as pXpNq reshapepMN , rRN1, IN , 1sq10: return xx pXp1q

, pXp2q, . . . , pXpNqyy.

of the cross-approximation algorithms scales linearly with the order Nof a data tensor.

4.6.2 Tucker-2/PVD Algorithms for Large-scale TT Decomposi-tions

The key idea in this approach is to reshape any Nth-order data tensor,X P RI1I2IN with N ¡ 3, into a suitable 3rd-order tensor, e.g.,rX P RI1 IN I2IN1 , in order to apply the Tucker-2 decompositionas follows (see Algorithm 8 and Figure 4.13(a))

rX Gp2,N1q 1 Xp1q 2 XpNq Xp1q 1 Gp2,N1q 1 XpNq, (4.34)

which, by using frontal slices of the involved tensors, can also be ex-pressed in the matrix form

Xk1 Xp1qGk1XpNq, k1 1, 2, . . . , I2 IN1. (4.35)

Such representations allow us to compute the tensor, Gp2,N1q, thefirst TT-core, Xp1q, and the last TT-core, XpNq. The procedure can

Page 151: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

396 Tensor Train Decompositions

Algorithm 12: TT Decomposition using any efficientLRMF

Input: Tensor X P RI1I2IN and the approximation accuracy εOutput: Approximate tensor representation in the TT formatpX xx pXp1q

, pXp2q, . . . , pXpNqyy

1: Initialization R0 12: Unfolding of tensor X in mode-1 as M1 Xp1q3: for n 1 to N 1 do4: Perform LRMF, e.g., CUR, RPCA, ...

rAn,Bns LRMFpMn, εq, i.e., Mn AnBTn

5: Estimate nth TT rank, Rn sizepAn, 2q6: Reshape matrix An into a 3rd-order core, aspXpnq reshape pAn, rRn1, In, Rnsq7: Reshape the matrix Bn into the pn 1qth unfolding matrix

Mn1 reshapeBT

n , rRnIn1,±N

pn2 Ips

8: end for9: Construct the last core as pXpNq reshapepMN , rRN1, IN , 1sq10: return TT-cores: xx pXp1q

, pXp2q, . . . , pXpNqyy.

be repeated sequentially for reshaped tensors rGn Gpn1,Nnq forn 1, 2, . . ., in order to extract subsequent TT-cores in their matricizedforms, as illustrated in Figure 4.13(b). See also the detailed step-by-stepprocedure shown in Figure 4.13(c).

Such a simple recursive procedure for TT decomposition can beused in conjunction with any efficient algorithm for Tucker-2/PVD de-compositions or the nonnegative Tucker-2 decomposition (NTD-2) (seealso Section 3).

4.6.3 Tensor Train Rounding – TT Recompression

Mathematical operations in TT format produce core tensors with rankswhich are not guaranteed to be optimal with respect to the desiredapproximation accuracy. For example, matrix-by-vector or matrix-by-matrix products considerably increase the TT ranks, which quicklybecome computationally prohibitive, so that a truncation or low-rankTT approximations are necessary for mathematical tractability. To thisend, the TT–rounding (also called truncation or recompression) may beused as a post-processing procedure to reduce the TT ranks. The TT

Page 152: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.6. Algorithms for TT Decompositions 397

(a)

=

X (1) X (N)RN-1

...

II

K =1

2I N-1

K =1

INGXk1 kR1

I1

...

...

I1

IN

2N-1 ...

...

1

I

...

R1

RN-1

X-~ G-

(2,N-1)= G-2

(b)

=

X(n)

...

I

I

K =n

GkR

...

...I

n+1

N-n

...

...

Rpp

IRn-1 n IRn-1 n

n

<2> Rn

...

I

In+

1Rp-1 X (p)

<1>

Rp-1n

G G- -~

n n+1

K =n

N-nI Rpp

(c)I1

I2 I3

I4

I5

X

I1MReshape 1 I

PVD or Tucker2 I1 R1A1 R4R1

G 2 I

Reshape I1 R1A1 R1

I3

G 2I2 I

PVD or Tucker2 I 1 RA1 I RA21

2R1 RR2G 3 R RI IR

Reshape

I1

R1X

(1)R

I2

X(2)

2 R

I

X(3)

I4

X(4)

I

X(5)

R

I2 I4

I2 I4

5R4

4R4 R4 I5

2

I3

3 3 4 4 4 5

3

3 5

4

5

I3

I3

B5T

B5T

B4T B5

T

=

~

Figure 4.13: TT decomposition based on the Tucker-2/PVD model. (a) Extractionof the first and the last core. (b) The procedure can be repeated sequentially forreshaped 3rd-order tensors Gn (for n 2, 3, . . . and p N 1, N 2, . . .). (c)Illustration of a TT decomposition for a 5th-order data tensor, using an algorithmbased on sequential Tucker-2/PVD decompositions.

Page 153: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

398 Tensor Train Decompositions

Algorithm 13:TT Rounding (Recompression) (Oseledets,2011)Input: Nth-order tensor X xxXp1q,Xp2q, . . . ,XpNqyy P RI1I2IN ,

in a TT format with an overestimated TT rank,rT T tR1, R2, . . . , RN1u, and TT-cores X P RRn1InRn ,absolute tolerance ε, and maximum rank Rmax

Output: Nth-order tensor pX with a reduced TT rank; the cores arerounded (reduced) according to the input tolerance ε and/or ranksbounded by Rmax, such that X pXF ¤ ε XF

1: Initialization pX X and δ ε?N 12: for n 1 to N 1 do3: QR decomposition Xpnq

2¡ QnR, with Xpnq 2¡ P RRn1InRn

4: Replace cores Xpnq 2¡ Qn and Xpn1q

1¡ Ð RXpn1q 1¡ , with

Xpn1q 1¡ P RRnIn1Rn1

5: end for6: for n N to 2 do7: Perform δ-truncated SVD Xpnq

1¡ U diagtσuVT

8: Determine minimum rank pRn1 such that°

r¡Rn1σ2

r ¤ δ2σ2

9: Replace cores pXpn1q 2¡ Ð pXpn1q

2¡ pU diagtpσu and pXpnq 1¡ pVT

10: end for11: return Nth-order tensorpX xxpXp1q

, pXp2q, . . . , pXpNqyy P RI1I2IN ,

with reduced cores pXpnq P R pRn1In pRn

rounding algorithms are typically implemented via QR/SVD with theaim to approximate, with a desired prescribed accuracy, the TT coretensors, Gpnq Xpnq, by other core tensors with minimum possibleTT-ranks (see Algorithm 13). Note that TT rounding is mathemati-cally the same as the TT-SVD, but is more efficient owing to the touse of TT format.

The complexity of TT-rounding procedures is only OpNIR3q, sinceall operations are performed in TT format which requires the SVD tobe computed only for a relatively small matricized core tensor at eachiteration. A similar approach has been developed for the HT format(Grasedyck, 2010; Grasedyck et al., 2013; Kressner and Tobler, 2014;Espig et al., 2012).

Page 154: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.6. Algorithms for TT Decompositions 399

4.6.4 Orthogonalization of Tensor Train Network

The orthogonalization of core tensors is an essential procedure in manyalgorithms for the TT formats (Oseledets, 2011; Holtz et al., 2012;Dolgov et al., 2014; Dolgov, 2014; Kressner et al., 2014a; Steinlechner,2015, 2016).

For convenience, we divide a TT network, which represents a ten-sor pX xxpXp1q

, pXp2q, . . . , pXpNqyy P RI1I2IN , into sub-trains. In

this way, a large-scale task is replaced by easier-to-handle sub-tasks,whereby the aim is to extract a specific TT core or its slices from thewhole TT network. For this purpose, the TT sub-trains can be definedas follows

pX n xxpXp1q, pXp2q

, . . . , pXpn1qyy P RI1I2In1Rn1 (4.36)pX¡n xxpXpn1q, pXpn2q

, . . . , pXpNqyy P RRnIn1IN (4.37)

while the corresponding unfolding matrices, also called interface matri-ces, are defined by

pX¤n P RI1I2InRn , pX¡n P RRnIn1IN . (4.38)

The left and right unfolding of the cores are defined as

pXpnqL pXpnq

2¡ P RRn1InRn and pXpnqR Xpnq

1¡ P RRn1InRn .

The n-orthogonality of tensors. An Nth-order tensor in a TTformat pX xxpXp1q

, . . . , pXpNqyy, is called n-orthogonal with 1 ¤ n ¤ N ,if

ppXpmqL qT pXpmq

L IRm , m 1, . . . , n 1 (4.39)pXpmqR ppXpmq

R qT IRm1 , m n 1, . . . , N. (4.40)

The tensor is called left-orthogonal if n N and right-orthogonal ifn 1.

When considering the nth TT core, it is usually assumed thatall cores to the left are left-orthogonalized, and all cores to the

Page 155: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

400 Tensor Train Decompositions

Algorithm 14: Left-orthogonalization, right-orthog-onalization and n-orthogonalization of a tensor in theTT formatInput: Nth-order tensor pX xxpXp1q

, pXp2q, . . . , pXpNqyy P RI1I2IN ,

with TT cores pXpnq P RRn1InRn and R0 RN 1Output: Cores pXp1q

, . . . , pXpn1qbecome left-orthogonal, while the

remaining cores are right-orthogonal, except for the core pXpnq

1: for m 1 to n 1 do2: Perform the QR decomposition rQ,Rs Ð qrppXpmq

L q for theunfolding cores pXpmq

L P RRm1ImRm

3: Replace the cores pXpmqL Ð Q and pXpm1q Ð pXpm1q 1 R

4: end for5: for m N to n 1 do6: Perform QR decomposition rQ,Rs Ð qrpppXpmq

R qTq for theunfolding cores ppXpmq

R q P RRm1ImRm ,7: Replace the cores: Gpmq

R Ð QT and pXpm1q Ð pXpm1q 3 RT

8: end for9: return Left-orthogonal TT cores with ppXpmq

L qT pXpmqL IRm

form 1, 2, . . . , n 1 and right-orthogonal cores pXpmq

R ppXpmqR qT IRm1

for m N,N 1, . . . , n 1.

right are right-orthogonalized. Notice that if a TT tensor7, pX, is n-orthogonal then the “left” and “right” interface matrices have orthonor-mal columns and rows, that is

ppX nqT pX n IRn1 ,pX¡n ppX¡nqT IRn . (4.41)

A tensor in a TT format can be orthogonalized efficiently using recur-sive QR and LQ decompositions (see Algorithm 14). From the abovedefinition, for n N the algorithms perform left-orthogonalization andfor n 1 right-orthogonalization of the whole TT network.

7By a TT-tensor we refer to as a tensor represented in the TT format.

Page 156: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.6. Algorithms for TT Decompositions 401

4.6.5 Improved TT Decomposition Algorithm – Alternating SingleCore Update (ASCU)

Finally, we next present an efficient algorithm for TT decomposition,referred to as the Alternating Single Core Update (ASCU), which se-quentially optimizes a single TT-core tensor while keeping the otherTT-cores fixed in a manner similar to the modified ALS (Phan et al.,2016).

Assume that the TT-tensor pX xxpXp1q, pXp2q

, . . . , pXpNqyy is left-and right-orthogonalized up to pXpnq, i.e., the unfolding matrices pXpkq

for k 1, . . . , n 1 have orthonormal columns, and pXpmq

p1q for m n 1, . . . , N have orthonormal rows. Then, the Frobenius norm ofthe TT-tensor pX is equivalent to the Frobenius norm of pXpnq, thatis, pX2F pXpnq2F , so that the Frobenius norm of the approximationerror between a data tensor X and its approximate representation inthe TT format pX can be written as

JpXpnqq X pX2F (4.42) X2F pX2F 2xX, pXy X2F pXpnq2F 2xCpnq, pXpnqy X2F Cpnq2F Cpnq pXpnq2F , n 1, . . . , N,

where Cpnq P RRn1InRnrepresents a tensor contraction of X and pXalong all modes but the mode-n, as illustrated in Figure 4.14. The Cpnq

can be efficiently computed through left contractions along the firstpn 1q-modes and right contractions along the last pN mq-modes,expressed as

L n pX nn1 X, Cpnq L n Nn

pX¡n. (4.43)

The symbols n and m stand for the tensor contractions between twoNth-order tensors along their first n modes and last m Nn modes,respectively.

The optimization problem in (4.42) is usually performed subject tothe following constraint

X pX2F ¤ ε2 (4.44)

Page 157: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

402 Tensor Train Decompositions

such that the TT-rank of pX is minimum.Observe that the constraint in (4.44) for left- and right-

orthogonalized TT-cores is equivalent to the set of sub-constraints

Cpnq pXpnq2F ¤ ε2n n 1, . . . , N, (4.45)

whereby the nth core Xpnq P RRn1InRn should have minimum ranksRn1 and Rn. Furthermore, ε2

n ε2 X2F Cpnq2F is assumedto be non-negative. Finally, we can formulate the following sequentialoptimization problem

min pRn1 Rnq ,s.t. Cpnq pXpnq2F ¤ ε2

n, n 1, 2, . . . , N. (4.46)

By expressing the TT-core tensor pXpnq as a TT-tensor of threefactors, i.e., in a Tucker-2 format given by

pXpnq An 1 Xpnq 1 Bn ,

the above optimization problem with the constraint (4.45) reduces toperforming a Tucker-2 decomposition (see Algorithm 8). The aim is tocompute An, Bn (orthogonal factor matrices) and a core tensor Xpnq

which approximates tensor Cpnq with a minimum TT-rank-pRn1, Rnq,such that

Cpnq An 1 Xpnq 1 Bn2F ¤ ε2n ,

where An P RRn1Rn1 and Bn P RRnRn , with Rn1 Ð Rn1 andRn Ð Rn.

Note that the new estimate of X is still of Nth-order because thefactor matrices An and Bn can be embedded into pXpn1q and pXpn1q

as followspX pXp1q 1 1 ppXpn1q 1 Anq 1 Xpnq 1 pBn 1 pXpn1qq1 1 pXpNq

.

In this way, the three TT-cores pXpn1q, pXpnq and pXpn1q are updated.Since An and BT

n have respectively orthonormal columns and rows,

Page 158: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.6. Algorithms for TT Decompositions 403

In

X(1) ( -1)n ( +1)nX( )n

In+1 INI1 In-1

〉 X X X(N)

X

・・・ ・・・

R1 Rn-1 Rn Rn+1

C ( )n

L<nX <n X >n

RN-1

〉 〉 〉 〉 〉

Figure 4.14: Illustration of the contraction of tensors in the Alternating SingleCore Update (ASCU) algorithm (see Algorithm 15). All the cores to the left of Xpnq

are left-orthogonal and all cores to its right are right-orthogonal.

the newly adjusted cores ppXpn1q1 Anq and pBn1 pXpn1qq obey theleft- and right-orthogonality conditions. Algorithm 15 outlines such asingle-core update algorithm based on the Tucker-2 decomposition. Inthe pseudo-code, the left contracted tensor L n is computed efficientlythrough a progressive contraction in the form (Schollwöck, 2011; Hubiget al., 2015)

L n pXpn1q 2 L pn1q, (4.47)

where L 1 X.Alternatively, instead of adjusting the two TT ranks, Rn1 and Rn,

of pXpnq, we can update only one rank, either Rn1 or Rn, correspondingto the right-to-left or left-to-right update order procedure. Assumingthat the core tensors are updated in the left-to-right order, we need tofind pXpnq which has a minimum rank-Rn and satisfies the constraints

Cpnq pXpnq 1 Bn2F ¤ ε2n, n 1, . . . , N.

This problem reduces to the truncated SVD of the mode-t1, 2u matri-cization of Cpnq with an accuracy ε2

n, that is

rCpnqs 2¡ Un Σ VTn ,

Page 159: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

404 Tensor Train Decompositions

Algorithm 15: The Alternating Single-Core UpdateAlgorithm (two-sides rank adjustment) (Phan et al.,2016)Input: Data tensor X P RI1I2IN and approximation accuracy εOutput: TT-tensor pX pXp1q 1 pXp2q 1 1 pXpNq

of minimumTT-rank such that X pX2

F ¤ ε2

1: Initialize pX xxpXp1q, pXp2q

, . . . , pXpNqyy2: repeat3: for n 1, 2, . . . , N 1 do4: Compute contracted tensor Cpnq L n Nn

pX¡n

5: Solve a Tucker-2 decompositionCpnq An 1 pXpnq 1 Bn2

F ¤ ε2 X2F Cpnq2

F

6: Adjust adjacent corespXpn1q Ð pXpn1q 1 An, pXpn1q Ð Bn 1 pXpn1q

7: Perform left-orthogonalization of pXpnq

8: Update left-side contracted tensorsL n Ð AT

n 1 L n, L pn1q Ð pXpnq 2 L n

9: end for10: for n N,N 1, . . . , 2 do11: Compute contracted tensor Cpnq L n Nn

pX¡n

12: Solve a constrained Tucker-2 decompositionCpnq An 1 pXpnq 1 Bn2

F ¤ ε2 X2F Cpnq2

F

13: pXpn1q Ð pXpn1q 1 An, pXpn1q Ð Bn 1 pXpn1q

14: Perform right-orthogonalization of pXpnq

15: end for16: until a stopping criterion is met17: return xxpXp1q

, pXp2q, . . . , pXpNqyy.

where Σ diagpσn,1, . . . , σn,Rnq. Here, for the new optimized rank R

n,the following holds

Rn

r1σ2n,r ¥ X2F ε2 ¡

Rn1¸r1

σ2n,r . (4.48)

The core tensor pXpnq is then updated by reshaping Un to an order-3tensor of size Rn1InR

n, while the core pXpn1q needs to be adjusted

Page 160: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

4.6. Algorithms for TT Decompositions 405

Algorithm 16: The Alternating Single-Core UpdateAlgorithm (one-side rank adjustment) (Phan et al., 2016)

Input: Data tensor X P RI1I2IN and approximation accuracy εOutput: TT-tensor pX pXp1q 1 pXp2q 1 1 pXpNq

of minimumTT-rank such that X pX2

F ¤ ε2

1: Initialize TT-cores pXpnq, @n

2: repeat3: for n 1, 2, . . . , N 1 do4: Compute the contracted tensor Cpnq L n Nn

pX¡n

5: Truncated SVD:rCpnqs 2¡ U Σ VT2

F ¤ ε2 X2F Cpnq2

F

6: Update pXpnq reshapepU, Rn1 In Rnq7: Adjust adjacent core pXpn1q Ð pΣ VTq 1 pXpn1q

8: Update left-side contracted tensorsL pn1q Ð pXpnq 2 L n

9: end for10: for n N,N 1, . . . , 2 do11: Compute contracted tensor Cpnq L n Nn

pX¡n

12: Truncated SVD:rCpnqsp1q U Σ VT2

F ¤ ε2 X2F Cpnq2

F ;

13: pXpnq reshapepVT, Rn1 In Rnq14: pXpn1q Ð pXpn1q 1 pU Σq15: end for16: until a stopping criterion is met17: return xxpXp1q

, pXp2q, . . . , pXpNqyy.

accordingly as

pXpn1q Σ VTn 1 pXpn1q

. (4.49)

When the algorithm updates the core tensors in the right-to-left order,we update pXpnq by using the R

n1 leading right singular vectors of themode-1 matricization of Cpnq, and adjust the core pXpn1q accordingly,

Page 161: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

406 Tensor Train Decompositions

that is,

rCpnqsp1q Un Σ VTnpXpnq reshapepVT

n , rRn1, In, RnsqpXpn1q pXpn1q 1 pUn Σq . (4.50)

To summarise, the ASCU method performs a sequential update of onecore and adjusts (or rotates) another core. Hence, it updates two coresat a time (for detail see Algorithm 16).

The ASCU algorithm can be implemented in an even more efficientway, if the data tensor X is already given in a TT format (with anon-optimal TT ranks for the prescribed accuracy). Detailed MATLABimplementations and other variants of the TT decomposition algorithmare provided in (Phan et al., 2016).

Page 162: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

5Discussion and Conclusions

In Part 1 of this monograph, we have provided a systematic andexample-rich guide to the basic properties and applications of ten-sor network methodologies, and have demonstrated their promise asa tool for the analysis of extreme-scale multidimensional data. Ourmain aim has been to illustrate that, owing to the intrinsic compres-sion ability that stems from the distributed way in which they rep-resent data and process information, TNs can be naturally employedfor linear/multilinear dimensionality reduction. Indeed, current appli-cations of TNs include generalized multivariate regression, compressedsensing, multi-way blind source separation, sparse representation andcoding, feature extraction, classification, clustering and data fusion.

With multilinear algebra as their mathematical backbone, TNs havebeen shown to have intrinsic advantages over the flat two-dimensionalview provided by matrices, including the ability to model both strongand weak couplings among multiple variables, and to cater for multi-modal, incomplete and noisy data.

In Part 2 of this monograph we introduce a scalable frameworkfor distributed implementation of optimization algorithms, in order totransform huge-scale optimization problems into linked small-scale op-timization sub-problems of the same type. In that sense, TNs can beseen as a natural bridge between small-scale and very large-scale opti-mization paradigms, which allows for any efficient standard numericalalgorithm to be applied to such local optimization sub-problems.

407

Page 163: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

408 Discussion and Conclusions

Although research on tensor networks for dimensionality reductionand optimization problems is only emerging, given that in many mod-ern applications, multiway arrays (tensors) arise, either explicitly orindirectly, through the tensorization of vectors and matrices, we fore-see this material serving as a useful foundation for further studies on avariety of machine learning problems for data of otherwise prohibitivelylarge volume, variety, or veracity. We also hope that the readers willfind the approaches presented in this monograph helpful in advanc-ing seamlessly from numerical linear algebra to numerical multilinearalgebra.

Page 164: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

Acknowledgements

The authors wish to express their sincere gratitude to the anony-mous reviewers for their very constructive comments. We also appreci-ate the deep insight of Prof Anthony Constantinides (Imperial Col-lege London), Dr Tatsuya Yokota (Nagoya Institute of Technology,Japan), and Dr Vladimir Lucic (Barclays and Imperial College Lon-don), and useful suggestions of Dr Kim Batselier (University of HongKong) and Claudius Hubig (Ludwig-Maximilians-University, Munich),and help with the artwork and numerous mathematical details of ZheSun (RIKEN BSI), and Alexander Stott, Thiannithi Thanthawaritthi-sai and Sithan Kanna (Imperial College London). The work by I. Os-eledets was supported by the Russian Science Foundation grant 14-11-00659, and work by D. Mandic was partially supported by the EPSRC,grant EP/K025643/1.

409

Page 165: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References

E. Acar and B. Yener. Unsupervised multiway data analysis: A literaturesurvey. IEEE Transactions on Knowledge and Data Engineering, 21:6–20,2009.

I. Affleck, T. Kennedy, E. H. Lieb, and H. Tasaki. Rigorous results on valence-bond ground states in antiferromagnets. Physical Review Letters, 59(7):799,1987.

A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensordecompositions for learning latent variable models. Journal of MachineLearning Research, 15:2773–2832, 2014.

D. Anderson, S. Du, M. Mahoney, C. Melgaard, K. Wu, and M. Gu. Spec-tral gap error bounds for improving CUR matrix decomposition and theNyström method. In Proceedings of the 18th International Conference onArtificial Intelligence and Statistics, pages 19–27, 2015.

W. Austin, G. Ballard, and T. G. Kolda. Parallel tensor compression forlarge-scale scientific data. arXiv preprint arXiv:1510.06689, 2015.

F. R. Bach and M. I. Jordan. Kernel independent component analysis. TheJournal of Machine Learning Research, 3:1–48, 2003.

M. Bachmayr, R. Schneider, and A. Uschmajew. Tensor networks and hi-erarchical tensors for the solution of high-dimensional partial differentialequations. Foundations of Computational Mathematics, pages 1–50, 2016.

B. W. Bader and T. G. Kolda. MATLAB tensor toolbox version 2.6, February2015. URL http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/.

410

Page 166: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 411

J. Ballani and L. Grasedyck. Tree adaptive approximation in the hierarchicaltensor format. SIAM Journal on Scientific Computing, 36(4):A1415–A1431,2014.

J. Ballani, L. Grasedyck, and M. Kluge. A review on adaptive low-rankapproximation techniques in the hierarchical tensor format. In Extraction ofQuantifiable Information from Complex Systems, pages 195–210. Springer,2014.

G. Ballard, A. R. Benson, A. Druinsky, B. Lipshitz, and O. Schwartz. Improv-ing the numerical stability of fast matrix multiplication algorithms. arXivpreprint arXiv:1507.00687, 2015a.

G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Brief announcement:Hypergraph partitioning for parallel sparse matrix-matrix multiplication.In Proceedings of the 27th ACM on Symposium on Parallelism in Algo-rithms and Architectures, pages 86–88. ACM, 2015b.

G. Barcza, Ö. Legeza, K. H. Marti, and M. Reiher Quantum informationanalysis of electronic states at different molecular structures. In PhysicalReview A, 83(1):012508, 2011.

K. Batselier and N. Wong. A constructive arbitrary-degree Kronecker productdecomposition of tensors. arXiv preprint arXiv:1507.08805, 2015.

K. Batselier, H. Liu, and N. Wong. A constructive algorithm for decomposinga tensor into a finite sum of orthonormal rank-1 terms. SIAM Journal onMatrix Analysis and Applications, 36(3):1315–1337, 2015.

M. Bebendorf. Adaptive cross-approximation of multivariate functions. Con-structive Approximation, 34(2):149–179, 2011.

M. Bebendorf, C. Kuske, and R. Venn. Wideband nested cross approximationfor Helmholtz problems. Numerische Mathematik, 130(1):1–34, 2015.

R. E. Bellman. Adaptive Control Processes. Princeton University Press,Princeton, NJ, 1961.

P. Benner, V. Khoromskaia, and B. N. Khoromskij. A reduced basis approachfor calculation of the Bethe–Salpeter excitation energies by using low-ranktensor factorisations. Molecular Physics, 114(7-8):1148–1161, 2016.

A. R. Benson, J. D. Lee, B. Rajwa, and D. F. Gleich. Scalable methodsfor nonnegative matrix factorizations of near-separable tall-and-skinny ma-trices. In Proceedings of Neural Information Processing Systems (NIPS),pages 945–953, 2014. URL http://arxiv.org/abs/1402.6964.

D. Bini. Tensor and border rank of certain classes of matrices and the fastevaluation of determinant inverse matrix and eigenvalues. Calcolo, 22(1):209–228, 1985.

Page 167: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

412 References

M. Bolten, K. Kahl, and S. Sokolović. Multigrid methods for tensor structuredMarkov chains with low rank approximation. SIAM Journal on ScientificComputing, 38(2):A649–A667, 2016. URL http://adsabs.harvard.edu/abs/2016arXiv160506246B.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza-tion and statistical learning via the alternating direction method of multi-pliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.

A. Bruckstein, D. Donoho, and M. Elad. From sparse solutions of systems ofequations to sparse modeling of signals and images. SIAM Review, 51(1):34–81, 2009.

H.-J. Bungartz and M. Griebel. Sparse grids. Acta Numerica, 13:147–269,2004.

C. Caiafa and A. Cichocki. Generalizing the column-row matrix decompo-sition to multi-way arrays. Linear Algebra and its Applications, 433(3):557–573, 2010.

C. Caiafa and A. Cichocki. Computing sparse representations of multidimen-sional signals using Kronecker bases. Neural Computaion, 25(1):186–220,2013.

C. Caiafa and A. Cichocki. Stable, robust, and super–fast reconstruction oftensors using multi-way projections. IEEE Transactions on Signal Process-ing, 63(3):780–793, 2015.

J. D. Carroll and J.-J. Chang. Analysis of individual differences in multidi-mensional scaling via an N-way generalization of “Eckart-Young” decom-position. Psychometrika, 35(3):283–319, 1970.

V. Cevher, S. Becker, and M. Schmidt. Convex optimization for big data:Scalable, randomized, and parallel algorithms for big data analytics. IEEESignal Processing Magazine, 31(5):32–43, 2014.

G. Chabriel, M. Kleinsteuber, E. Moreau, H. Shen, P. Tichavský, and A. Yere-dor. Joint matrix decompositions and blind source separation: A survey ofmethods, identification, and applications. IEEE Signal Processing Maga-zine, 31(3):34–43, 2014.

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey.ACM Computing Surveys (CSUR), 41(3):15, 2009.

T.-L. Chen, D. D. Chang, S.-Y. Huang, H. Chen, C. Lin, and W. Wang. Inte-grating multiple random sketches for singular value decomposition. arXive-prints, 2016.

Page 168: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 413

H. Cho, D. Venturi, and G. E. Karniadakis. Numerical methods for high-dimensional probability density function equations. Journal of Computa-tional Physics, 305:817–837, 2016.

J. H. Choi and S. Vishwanathan. DFacTo: Distributed factorization of tensors.In Advances in Neural Information Processing Systems, pages 1296–1304,2014.

W. Chu and Z. Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In JMLR Workshop and Conference Proceedings Vol-ume 5: AISTATS 2009, volume 5, pages 89–96. Microtome Publishing (pa-per) Journal of Machine Learning Research, 2009.

A. Cichocki. Tensor decompositions: A new concept in brain data analysis?arXiv preprint arXiv:1305.0395, 2013a.

A. Cichocki. Era of big data processing: A new approach via tensor networksand tensor decompositions (invited). In Proceedings of the InternationalWorkshop on Smart Info-Media Systems in Asia (SISA2013), September2013b. URL http://arxiv.org/abs/1403.2048.

A. Cichocki. Tensor networks for big data analytics and large-scale optimiza-tion problems. arXiv preprint arXiv:1407.3124, 2014.

A. Cichocki and S. Amari. Adaptive Blind Signal and Image Processing:Learning Algorithms and Applications. John Wiley & Sons, Ltd, 2003.

A. Cichocki, R. Zdunek, A.-H. Phan, and S. Amari. Nonnegative Matrix andTensor Factorizations: Applications to Exploratory Multi-way Data Analy-sis and Blind Source Separation. Wiley, Chichester, 2009.

A. Cichocki, S. Cruces, and S. Amari. Log-determinant divergences revis-ited: Alpha-beta and gamma log-det divergences. Entropy, 17(5):2988–3034, 2015a.

A. Cichocki, D. Mandic, A.-H. Phan, C. Caiafa, G. Zhou, Q. Zhao, and L. DeLathauwer. Tensor decompositions for signal processing applications: Fromtwo-way to multiway component analysis. IEEE Signal Processing Maga-zine, 32(2):145–163, 2015b.

N. Cohen and A. Shashua. Convolutional rectifier networks as generalizedtensor decompositions. In Proceedings of The 33rd International Conferenceon Machine Learning, pages 955–963, 2016.

N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learn-ing: A tensor analysis. In 29th Annual Conference on Learning Theory,pages 698–728, 2016.

P. Comon. Tensors: a brief introduction. IEEE Signal Processing Magazine,31(3):44–53, 2014.

Page 169: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

414 References

P. Comon and C. Jutten. Handbook of Blind Source Separation: IndependentComponent Analysis and Applications. Academic Press, 2010.

P. G. Constantine and D. F. Gleich. Tall and skinny QR factorizations inMapReduce architectures. In Proceedings of the Second International Work-shop on MapReduce and its Applications, pages 43–50. ACM, 2011.

P. G. Constantine, D. F. Gleich, Y. Hou, and J. Templeton. Model reduc-tion with MapReduce-enabled tall and skinny singular value decomposition.SIAM Journal on Scientific Computing, 36(5):S166–S191, 2014.

E. Corona, A. Rahimian, and D. Zorin. A Tensor-Train accelerated solver forintegral equations in complex geometries. arXiv preprint arXiv:1511.06029,November 2015.

C. Crainiceanu, B. Caffo, S. Luo, V. Zipunnikov, and N. Punjabi. Populationvalue decomposition, a framework for the analysis of image populations.Journal of the American Statistical Association, 106(495):775–790, 2011.URL http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.ap10089.

A. Critch and J. Morton. Algebraic geometry of matrix product states. Sym-metry, Integrability and Geometry: Methods and Applications (SIGMA), 10:095, 2014. .

A. J. Critch. Algebraic Geometry of Hidden Markov and Related Models. PhDthesis, University of California, Berkeley, 2013.

A. L. F. de Almeida, G. Favier, J. C. M. Mota, and J. P. C. L. da Costa.Overview of tensor decompositions with applications to communications.In R. F. Coelho, V. H. Nascimento, R. L. de Queiroz, J. M. T. Romano,and C. C. Cavalcante, editors, Signals and Images: Advances and Results inSpeech, Estimation, Compression, Recognition, Filtering, and Processing,chapter 12, pages 325–355. CRC Press, 2015.

F. De la Torre. A least-squares framework for component analysis. IEEETransactions on Pattern Analysis and Machine Intelligence, 34(6):1041–1055, 2012.

L. De Lathauwer. A link between the canonical decomposition in multilinearalgebra and simultaneous matrix diagonalization. SIAM Journal on MatrixAnalysis and Applications, 28:642–666, 2006.

L. De Lathauwer. Decompositions of a higher-order tensor in block terms— Part I and II. SIAM Journal on Matrix Analysis and Applications,30(3):1022–1066, 2008. URL http://publi-etis.ensea.fr/2008/De08e.Special Issue on Tensor Decompositions and Applications.

Page 170: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 415

L. De Lathauwer. Blind separation of exponential polynomials and the de-composition of a tensor in rank-pLr, Lr, 1q terms. SIAM Journal on MatrixAnalysis and Applications, 32(4):1451–1474, 2011.

L. De Lathauwer and D. Nion. Decompositions of a higher-order tensor inblock terms – Part III: Alternating least squares algorithms. SIAM Journalon Matrix Analysis and Applications, 30(3):1067–1083, 2008.

L. De Lathauwer, B. De Moor, and J. Vandewalle. A Multilinear singularvalue decomposition. SIAM Journal on Matrix Analysis Applications, 21:1253–1278, 2000a.

L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 andrank-pR1, R2, . . . , RN q approximation of higher-order tensors. SIAM Jour-nal of Matrix Analysis and Applications, 21(4):1324–1342, 2000b.

W. de Launey and J. Seberry. The strong Kronecker product. Journal ofCombinatorial Theory, Series A, 66(2):192–213, 1994. URL http://dblp.uni-trier.de/db/journals/jct/jcta66.html#LauneyS94.

V. de Silva and L.-H. Lim. Tensor rank and the ill-posedness of the bestlow-rank approximation problem. SIAM Journal on Matrix Analysis andApplications, 30:1084–1127, 2008.

A. Desai, M. Ghashami, and J. M. Phillips. Improved practical matrix sketch-ing with guarantees. IEEE Transactions on Knowledge and Data Engineer-ing, 28(7):1678–1690, 2016.

I. S. Dhillon. Fast Newton-type methods for nonnegative matrix and tensorapproximation. The NSF Workshop, Future Directions in Tensor-BasedComputation and Modeling, 2009.

E. Di Napoli, D. Fabregat-Traver, G. Quintana-Ortí, and P. Bientinesi. To-wards an efficient use of the BLAS library for multilinear tensor contrac-tions. Applied Mathematics and Computation, 235:454–468, 2014.

S. V. Dolgov. Tensor Product Methods in Numerical Simulation of High-dimensional Dynamical Problems. PhD thesis, Faculty of Mathematics andInformatics, University Leipzig, Germany, 2014.

S. V. Dolgov and B. N. Khoromskij. Two-level QTT-Tucker format for opti-mized tensor calculus. SIAM Journal on Matrix Analysis and Applications,34(2):593–623, 2013.

S. V. Dolgov and B. N. Khoromskij. Simultaneous state-time approximationof the chemical master equation using tensor product formats. NumericalLinear Algebra with Applications, 22(2):197–219, 2015.

Page 171: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

416 References

S. V. Dolgov and D. V. Savostyanov. Alternating minimal energy methodsfor linear systems in higher dimensions. SIAM Journal on Scientific Com-puting, 36(5):A2248–A2271, 2014.

S. V. Dolgov, B. N. Khoromskij, I. V. Oseledets, and D. V. Savostyanov.Computation of extreme eigenvalues in higher dimensions using block ten-sor train format. Computer Physics Communications, 185(4):1207–1216,2014.

P. Drineas and M. W. Mahoney. A randomized algorithm for a tensor-basedgeneralization of the singular value decomposition. Linear Algebra and itsApplications, 420(2):553–571, 2007.

G. Ehlers, J. Sólyom, Ö. Legeza, and R.M. Noack. Entanglement structureof the Hubbard model in momentum space. Physical Review B, 92(23):235116, 2015.

M. Espig, M. Schuster, A. Killaitis, N. Waldren, P. Wähnert, S. Handschuh,and H. Auer. TensorCalculus library, 2012. URL http://gitorious.org/tensorcalculus.

F. Esposito, T. Scarabino, A. Hyvärinen, J. Himberg, E. Formisano, S. Co-mani, G. Tedeschi, R. Goebel, E. Seifritz, and F. Di Salle. Independentcomponent analysis of fMRI group studies by self-organizing clustering.NeuroImage, 25(1):193–205, 2005.

G. Evenbly and G. Vidal. Algorithms for entanglement renormalization. Phys-ical Review B, 79(14):144108, 2009.

G. Evenbly and S. R. White. Entanglement renormalization and wavelets.Physical Review Letters, 116(14):140403, 2016.

H. Fanaee-T and J. Gama. Tensor-based anomaly detection: An interdisci-plinary survey. Knowledge-Based Systems, 2016.

G. Favier and A. de Almeida. Overview of constrained PARAFAC models.EURASIP Journal on Advances in Signal Processing, 2014(1):1–25, 2014.

J. Garcke, M. Griebel, and M. Thess. Data mining with sparse grids. Com-puting, 67(3):225–253, 2001.

S. Garreis and M. Ulbrich. Constrained optimization with low-rank tensorsand applications to parametric problems with PDEs. SIAM Journal onScientific Computing (accepted), 2016.

M. Ghashami, E. Liberty, and J. M. Phillips. Efficient frequent directionsalgorithm for sparse matrices. arXiv preprint arXiv:1602.00412, 2016.

Page 172: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 417

V. Giovannetti, S. Montangero, and R. Fazio. Quantum multiscale entangle-ment renormalization ansatz channels. Physical Review Letters, 101(18):180503, 2008.

S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin. A theory ofpseudo-skeleton approximations. Linear Algebra and its Applications, 261:1–21, 1997a.

S. A. Goreinov, N. L. Zamarashkin, and E. E. Tyrtyshnikov. Pseudo-skeletonapproximations by matrices of maximum volume. Mathematical Notes, 62(4):515–519, 1997b.

L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAMJournal on Matrix Analysis and Applications, 31(4):2029–2054, 2010.

L. Grasedyck, D. Kessner, and C. Tobler. A literature survey of low-ranktensor approximation techniques. GAMM-Mitteilungen, 36:53–78, 2013.

A. R. Groves, C. F. Beckmann, S. M. Smith, and M. W. Woolrich. Linkedindependent component analysis for multimodal data fusion. NeuroImage,54(1):2198–21217, 2011.

Z.-C. Gu, M. Levin, B. Swingle, and X.-G. Wen. Tensor-product represen-tations for string-net condensed states. Physical Review B, 79(8):085118,2009.

M. Haardt, F. Roemer, and G. Del Galdo. Higher-order SVD based sub-space estimation to improve the parameter estimation accuracy in multi-dimensional harmonic retrieval problems. IEEE Transactions on SignalProcessing, 56:3198–3213, July 2008.

W. Hackbusch. Tensor Spaces and Numerical Tensor Calculus, volume 42of Springer Series in Computational Mathematics. Springer, Heidelberg,2012.

W. Hackbusch and S. Kühn. A new scheme for the tensor representation.Journal of Fourier Analysis and Applications, 15(5):706–722, 2009.

N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decomposi-tions. SIAM Review, 53(2):217–288, 2011.

S. Handschuh. Numerical Methods in Tensor Networks. PhD thesis, Facualtyof Mathematics and Informatics, University Leipzig, Germany, Leipzig,Germany, 2015.

R. A. Harshman. Foundations of the PARAFAC procedure: Models and condi-tions for an explanatory multimodal factor analysis. UCLA Working Papersin Phonetics, 16:1–84, 1970.

Page 173: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

418 References

F. L. Hitchcock. Multiple invariants and generalized rank of a p-way matrixor tensor. Journal of Mathematics and Physics, 7:39–79, 1927.

S. Holtz, T. Rohwedder, and R. Schneider. The alternating linear scheme fortensor optimization in the tensor train format. SIAM Journal on ScientificComputing, 34(2), 2012. URL http://dx.doi.org/10.1137/100818893.

M. Hong, M. Razaviyayn, Z. Q. Luo, and J. S. Pang. A unified algorithmicframework for block-structured optimization involving big data with appli-cations in machine learning and signal processing. IEEE Signal ProcessingMagazine, 33(1):57–77, 2016.

H. Huang, C. Ding, D. Luo, and T. Li. Simultaneous tensor subspace selectionand clustering: The equivalence of high order SVD and K-means cluster-ing. In Proceedings of the 14th ACM SIGKDD International Conferenceon Knowledge Discovery and Data mining, pages 327–335. ACM, 2008.

R. Hübener, V. Nebendahl, and W. Dür. Concatenated tensor network states.New Journal of Physics, 12(2):025004, 2010.

C. Hubig, I. P. McCulloch, U. Schollwöck, and F. A. Wolf. Strictly single-siteDMRG algorithm with subspace expansion. Physical Review B, 91(15):155115, 2015.

T. Huckle, K. Waldherr, and T. Schulte-Herbriggen. Computations in quan-tum tensor networks. Linear Algebra and its Applications, 438(2):750–781,2013.

A. Hyvärinen. Independent component analysis: Recent advances. Philosoph-ical Transactions of the Royal Society A, 371(1984):20110534, 2013.

I. Jeon, E. E. Papalexakis, C. Faloutsos, L. Sael, and U. Kang. Mining billion-scale tensors: Algorithms and discoveries. The VLDB Journal, pages 1–26,2016.

B. Jiang, F. Yang, and S. Zhang. Tensor and its Tucker core: The invariancerelationships. arXiv e-prints arXiv:1601.01469, January 2016.

U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos. GigaTensor: Scalingtensor analysis up by 100 times – algorithms and discoveries. In Proceed-ings of the 18th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD ’12), pages 316–324, August 2012.

Y.-J. Kao, Y.-D. Hsieh, and P. Chen. Uni10: An open-source library for tensornetwork algorithms. In Journal of Physics: Conference Series, volume 640,page 012040. IOP Publishing, 2015.

L. Karlsson, D. Kressner, and A. Uschmajew. Parallel algorithms for tensorcompletion in the CP format. Parallel Computing, 57:222–234, 2016.

Page 174: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 419

J.-P. Kauppi, J. Hahne, K. R. Müller, and A. Hyvärinen. Three-way analysisof spectrospatial electromyography data: Classification and interpretation.PloS One, 10(6):e0127231, 2015.

V. A. Kazeev and B. N. Khoromskij. Low-rank explicit QTT representationof the Laplace operator and its inverse. SIAM Journal on Matrix Analysisand Applications, 33(3):742–758, 2012.

V. A. Kazeev, B. N. Khoromskij, and E. E. Tyrtyshnikov. Multilevel Toeplitzmatrices generated by tensor-structured vectors and convolution with loga-rithmic complexity. SIAM Journal on Scientific Computing, 35(3):A1511–A1536, 2013a.

V. A. Kazeev, O. Reichmann, and C. Schwab. Low-rank tensor structure oflinear diffusion operators in the TT and QTT formats. Linear Algebra andits Applications, 438(11):4204–4221, 2013b.

V. A. Kazeev, M. Khammash, M. Nip, and C. Schwab. Direct solution of thechemical master equation using quantized tensor trains. PLoS Computa-tional Biology, 10(3):e1003359, 2014.

B. N. Khoromskij. Opd logNq-quantics approximation of N -d tensors in high-dimensional numerical modeling. Constructive Approximation, 34(2):257–280, 2011a.

B. N. Khoromskij. Tensors-structured numerical methods in scientific com-puting: Survey on recent advances. Chemometrics and Intelligent Labo-ratory Systems, 110(1):1–19, 2011b. URL http://www.mis.mpg.de/de/publications/preprints/2010/prepr2010-21.html.

B. N. Khoromskij and A. Veit. Efficient computation of highly oscillatoryintegrals by using QTT tensor approximation. Computational Methods inApplied Mathematics, 16(1):145–159, 2016.

H.-J. Kim, E. Ollila, V. Koivunen, and H. V. Poor. Robust iterativelyreweighted Lasso for sparse tensor factorizations. In IEEE Workshop onStatistical Signal Processing (SSP), pages 420–423, 2014.

S. Klus and C. Schütte. Towards tensor-based methods for the numericalapproximation of the Perron-Frobenius and Koopman operator. arXiv e-prints arXiv:1512.06527, December 2015.

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAMReview, 51(3):455–500, 2009.

D. Kressner and C. Tobler. Algorithm 941: HTucker–A MATLAB toolbox fortensors in hierarchical Tucker format. ACM Transactions on MathematicalSoftware, 40(3):22, 2014.

Page 175: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

420 References

D. Kressner and A. Uschmajew. On low-rank approximability of solutionsto high-dimensional operator equations and eigenvalue problems. LinearAlgebra and its Applications, 493:556–572, 2016.

D. Kressner, M. Steinlechner, and A. Uschmajew. Low-rank tensor methodswith subspace correction for symmetric eigenvalue problems. SIAM Journalon Scientific Computing, 36(5):A2346–A2368, 2014a.

D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor com-pletion by Riemannian optimization. BIT Numerical Mathematics, 54(2):447–468, 2014b.

P. M. Kroonenberg. Applied Multiway Data Analysis. John Wiley & SonsLtd, New York, 2008.

J. B. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decom-positions, with application to arithmetic complexity and statistics. LinearAlgebra and its Applications, 18(2):95–138, 1977.

V. Kuleshov, A. T. Chaganty, and P. Liang. Tensor factorization via matrixfactorization. In Proceedings of the Eighteenth International Conference onArtificial Intelligence and Statistics, pages 507–516, 2015.

N. Lee and A. Cichocki. Estimating a few extreme singular values and vectorsfor large-scale matrices in Tensor Train format. SIAM Journal on MatrixAnalysis and Applications, 36(3):994–1014, 2015.

N. Lee and A. Cichocki. Tensor train decompositions for higher order regres-sion with LASSO penalties. In Workshop on Tensor Decompositions andApplications (TDA2016), 2016a.

N. Lee and A. Cichocki. Regularized computation of approximate pseudoin-verse of large matrices using low-rank tensor train decompositions. SIAMJournal on Matrix Analysis and Applications, 37(2):598–623, 2016b. URLhttp://adsabs.harvard.edu/abs/2015arXiv150601959L.

N. Lee and A. Cichocki. Fundamental tensor operations for large-scale dataanalysis using tensor network formats. Multidimensional Systems and Sig-nal Processing (accepted), 2016c.

J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. An input-adaptive andin-place approach to dense tensor-times-matrix multiply. In Proceedings ofthe International Conference for High Performance Computing, Network-ing, Storage and Analysis, page 76. ACM, 2015.

M. Li and V. Monga. Robust video hashing via multilinear subspace projec-tions. IEEE Transactions on Image Processing, 21(10):4397–4409, 2012.

Page 176: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 421

S. Liao, T. Vejchodský, and R. Erban. Tensor methods for parameter esti-mation and bifurcation analysis of stochastic reaction networks. Journal ofthe Royal Society Interface, 12(108):20150233, 2015.

A. P. Liavas and N. D. Sidiropoulos. Parallel algorithms for constrained tensorfactorization via alternating direction method of multipliers. IEEE Trans-actions on Signal Processing, 63(20):5450–5463, 2015.

L. H. Lim and P. Comon. Multiarray signal processing: Tensor decompositionmeets compressed sensing. Comptes Rendus Mecanique, 338(6):311–320,2010.

M. S. Litsarev and I. V. Oseledets. A low-rank approach to the computationof path integrals. Journal of Computational Physics, 305:557–574, 2016.

H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. A survey of multilinearsubspace learning for tensor data. Pattern Recognition, 44(7):1540–1551,2011.

M. Lubasch, J. I. Cirac, and M.-C. Banuls. Unifying projected entangledpair state contractions. New Journal of Physics, 16(3):033014, 2014. URLhttp://stacks.iop.org/1367-2630/16/i=3/a=033014.

C. Lubich, T. Rohwedder, R. Schneider, and B. Vandereycken. Dynamical ap-proximation of hierarchical Tucker and tensor-train tensors. SIAM Journalon Matrix Analysis and Applications, 34(2):470–494, 2013.

M. W. Mahoney. Randomized algorithms for matrices and data. Foundationsand Trends in Machine Learning, 3(2):123–224, 2011.

M. W. Mahoney and P. Drineas. CUR matrix decompositions for improveddata analysis. Proceedings of the National Academy of Sciences, 106:697–702, 2009.

M. W. Mahoney, M. Maggioni, and P. Drineas. Tensor-CUR decompositionsfor tensor-based data. SIAM Journal on Matrix Analysis and Applications,30(3):957–987, 2008.

H. Matsueda. Analytic optimization of a MERA network and its relevance toquantum integrability and wavelet. arXiv preprint arXiv:1608.02205, 2016.

A. Y. Mikhalev and I. V. Oseledets. Iterative representing set selection fornested cross–approximation. Numerical Linear Algebra with Applications,2015.

L. Mirsky. Symmetric gauge functions and unitarily invariant norms. TheQuarterly Journal of Mathematics, 11:50–59, 1960.

Page 177: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

422 References

J. Morton. Tensor networks in algebraic geometry and statistics. Lectureat Networking Tensor Networks, Centro de Ciencias de Benasque PedroPascual, Benasque, Spain, 2012.

M. Mørup. Applications of tensor (multiway array) factorizations and decom-positions in data mining. Wiley Interdisciplinary Review: Data Mining andKnowledge Discovery, 1(1):24–40, 2011.

V. Murg, F. Verstraete, R. Schneider, P. R. Nagy, and O. Legeza. Tree ten-sor network state with variable tensor order: An efficient multireferencemethod for strongly correlated systems. Journal of Chemical Theory andComputation, 11(3):1027–1036, 2015.

N. Nakatani and G. K. L. Chan. Efficient tree tensor network states (TTNS)for quantum chemistry: Generalizations of the density matrix renormal-ization group algorithm. The Journal of Chemical Physics, 2013. URLhttp://arxiv.org/pdf/1302.2298.pdf.

Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimiza-tion problems. SIAM Journal on Optimization, 22(2):341–362, 2012.

Y. Nesterov. Subgradient methods for huge-scale optimization problems.Mathematical Programming, 146(1-2):275–297, 2014.

N. H. Nguyen, P. Drineas, and T. D. Tran. Tensor sparsification via a boundon the spectral norm of random tensors. Information and Inference, pageiav004, 2015.

M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relationalmachine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016.

A. Novikov and R. A. Rodomanov. Putting MRFs on a tensor train. InProceedings of the International Conference on Machine Learning (ICML’14), 2014.

A. C. Olivieri. Analytical advantages of multivariate data processing. One,two, three, infinity? Analytical Chemistry, 80(15):5713–5720, 2008.

R. Orús. A practical introduction to tensor networks: Matrix product statesand projected entangled pair states. Annals of Physics, 349:117–158, 2014.

I. V. Oseledets. Approximation of 2d2d matrices using tensor decomposition.SIAM Journal on Matrix Analysis and Applications, 31(4):2130–2145, 2010.

I. V. Oseledets. Tensor-train decomposition. SIAM Journal on ScientificComputing, 33(5):2295–2317, 2011.

Page 178: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 423

I. V. Oseledets and S. V. Dolgov. Solution of linear systems and matrixinversion in the TT-format. SIAM Journal on Scientific Computing, 34(5):A2718–A2739, 2012.

I. V. Oseledets and E. E. Tyrtyshnikov. Breaking the curse of dimensional-ity, or how to use SVD in many dimensions. SIAM Journal on ScientificComputing, 31(5):3744–3759, 2009.

I. V. Oseledets and E. E. Tyrtyshnikov. TT cross–approximation for mul-tidimensional arrays. Linear Algebra and its Applications, 432(1):70–88,2010.

I. V. Oseledets, S. V. Dolgov, V. A. Kazeev, D. Savostyanov, O. Lebedeva,P. Zhlobich, T. Mach, and L. Song. TT-Toolbox, 2012. URL https://github.com/oseledets/TT-Toolbox.

E. E. Papalexakis, N. Sidiropoulos, and R. Bro. From K-means to higher-wayco-clustering: Multilinear decomposition with sparse latent factors. IEEETransactions on Signal Processing, 61(2):493–506, 2013.

E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos. Tensors for data min-ing and data fusion: Models, applications, and scalable algorithms. ACMTransactions on Intelligent Systems and Technology (TIST), 8(2):16, 2016.

N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends inOptimization, 1(3):127–239, 2014.

D. Perez-Garcia, F. Verstraete, M. M. Wolf, and J. I. Cirac. Matrix productstate representations. Quantum Information & Computation, 7(5):401–430, July 2007. URL http://dl.acm.org/citation.cfm?id=2011832.2011833.

R. Pfeifer, G. Evenbly, S. Singh, and G. Vidal. NCON: A tensor networkcontractor for MATLAB. arXiv preprint arXiv:1402.0939, 2014.

N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicitfeature maps. In Proceedings of the 19th ACM SIGKDD international con-ference on Knowledge discovery and data mining, pages 239–247. ACM,2013.

A.-H. Phan and A. Cichocki. Extended HALS algorithm for nonnegativeTucker decomposition and its applications for multiway analysis and clas-sification. Neurocomputing, 74(11):1956–1969, 2011.

A.-H. Phan, P. Tichavský, and A. Cichocki. Fast alternating ls algorithms forhigh order candecomp/parafac tensor factorizations. IEEE Transactionson Signal Processing, 61(19):4834–4846, 2013a.

Page 179: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

424 References

A.-H. Phan, P. Tichavský, and A. Cichocki. Tensor deflation for CANDE-COMP/PARAFAC – Part I: Alternating subspace update algorithm. IEEETransactions on Signal Processing, 63(22):5924–5938, 2015a.

A.-H. Phan, A. Cichocki, A. Uschmajew, P. Tichavský, G. Luta, andD. Mandic. Tensor networks for latent variable analysis. Part I: Algorithmsfor tensor train decomposition. ArXiv e-prints, 2016.

A.-H. Phan and A. Cichocki. Tensor decompositions for feature extractionand classification of high dimensional datasets. Nonlinear Theory and itsApplications, IEICE, 1(1):37–68, 2010.

A.-H. Phan, A. Cichocki, P. Tichavský, D. Mandic, and K. Matsuoka. Onrevealing replicating structures in multiway data: A novel tensor decom-position approach. In Proceedings of the 10th International ConferenceLVA/ICA, Tel Aviv, March 12–15, pages 297–305. Springer, 2012.

A.-H. Phan, A. Cichocki, P. Tichavský, R. Zdunek, and S. R. Lehky. Frombasis components to complex structural patterns. In Proceedings of theIEEE International Conference on Acoustics, Speech and Signal Processing,ICASSP 2013, Vancouver, BC, Canada, May 26–31, 2013, pages 3228–3232, 2013b. URL http://dx.doi.org/10.1109/ICASSP.2013.6638254.

A.-H. Phan, P. Tichavský, and A. Cichocki. Low complexity damped Gauss-Newton algorithms for CANDECOMP/PARAFAC. SIAM Journal onMatrix Analysis and Applications (SIMAX), 34(1):126–147, 2013c. URLhttp://arxiv.org/pdf/1205.2584.pdf.

A.-H. Phan, P. Tichavský, and A. Cichocki. Low rank tensor deconvolution.In Proceedings of the IEEE International Conference on Acoustics Speechand Signal Processing, ICASSP, pages 2169–2173, April 2015b. URL http://dx.doi.org/10.1109/ICASSP.2015.7178355.

S. Ragnarsson. Structured Tensor Computations: Blocking Symmetries andKronecker Factorization. PhD dissertation, Cornell University, Departmentof Applied Mathematics, 2012.

M. V. Rakhuba and I. V. Oseledets. Fast multidimensional convolution in low-rank tensor formats via cross–approximation. SIAM Journal on ScientificComputing, 37(2):A565–A582, 2015.

P. Richtárik and M. Takáč. Parallel coordinate descent methods for big dataoptimization. Mathematical Programming, 156:433–484, 2016.

J. Salmi, A. Richter, and V. Koivunen. Sequential unfolding SVD for tensorswith applications in array signal processing. IEEE Transactions on SignalProcessing, 57:4719–4733, 2009.

Page 180: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 425

U. Schollwöck. The density-matrix renormalization group in the age of matrixproduct states. Annals of Physics, 326(1):96–192, 2011.

U. Schollwöck. Matrix product state algorithms: DMRG, TEBD and relatives.In Strongly Correlated Systems, pages 67–98. Springer, 2013.

N. Schuch, I. Cirac, and D. Pérez-García. PEPS as ground states: Degeneracyand topology. Annals of Physics, 325(10):2153–2192, 2010.

N. Sidiropoulos, R. Bro, and G. Giannakis. Parallel factor analysis in sensorarray processing. IEEE Transactions on Signal Processing, 48(8):2377–2388, 2000.

N. D. Sidiropoulos. Generalizing Caratheodory’s uniqueness of harmonic pa-rameterization to N dimensions. IEEE Transactions on Information The-ory, 47(4):1687–1690, 2001.

N. D. Sidiropoulos. Low-rank decomposition of multi-way arrays: A signalprocessing perspective. In Proceedings of the IEEE Sensor Array andMultichannel Signal Processing Workshop (SAM 2004), July 2004. URLhttp://www.sandia.gov/~tgkolda/tdw2004/Nikos04.pdf.

N. D. Sidiropoulos and R. Bro. On the uniqueness of multilinear decomposi-tion of N-way arrays. Journal of Chemometrics, 14(3):229–239, 2000.

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis,and C. Faloutsos. Tensor decomposition for signal processing and machinelearning. arXiv e-prints arXiv:1607.01668, 2016.

A. Smilde, R. Bro, and P. Geladi. Multi-way Analysis: Applications in theChemical Sciences. John Wiley & Sons Ltd, New York, 2004.

S. M. Smith, A. Hyvärinen, G. Varoquaux, K. L. Miller, and C. F. Beckmann.Group-PCA for very large fMRI datasets. NeuroImage, 101:738–749, 2014.

L. Sorber, M. Van Barel, and L. De Lathauwer. Optimization-based algo-rithms for tensor decompositions: Canonical Polyadic Decomposition, de-composition in rank-pLr, Lr, 1q terms and a new generalization. SIAM Jour-nal on Optimization, 23(2), 2013.

L. Sorber, I. Domanov, M. Van Barel, and L. De Lathauwer. Exact lineand plane search for tensor optimization. Computational Optimization andApplications, 63(1):121–142, 2016.

M. Sørensen and L. De Lathauwer. Blind signal separation via tensor decom-position with Vandermonde factor. Part I: Canonical polyadic decompo-sition. IEEE Transactions on Signal Processing, 61(22):5507–5519, 2013.URL http://dx.doi.org/10.1109/TSP.2013.2276416.

Page 181: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

426 References

M. Sørensen, L. De Lathauwer, P. Comon, S. Icart, and L. Deneire. CanonicalPolyadic Decomposition with orthogonality constraints. SIAM Journal onMatrix Analysis and Applications, 33(4):1190–1213, 2012.

M. Steinlechner. Riemannian optimization for high-dimensional tensor com-pletion. Technical report, Technical report MATHICSE 5.2015, EPF Lau-sanne, Switzerland, 2015.

M. M. Steinlechner. Riemannian Optimization for Solving High-DimensionalProblems with Low-Rank Tensor Structure. PhD thesis, École Polytechn-nque Fédérale de Lausanne, 2016.

E. M. Stoudenmire and S. R. White. Minimally entangled typical thermalstate algorithms. New Journal of Physics, 12(5):055026, 2010.

J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: Dynamic tensoranalysis. In Proceedings of the 12th ACM SIGKDD international conferenceon Knowledge Discovery and Data Mining, pages 374–383. ACM, 2006.

S. K. Suter, M. Makhynia, and R. Pajarola. TAMRESH – tensor approxima-tion multiresolution hierarchy for interactive volume visualization. Com-puter Graphics Forum, 32(3):151–160, 2013.

Y. Tang, R. Salakhutdinov, and G. Hinton. Tensor analyzers. In Proceedingsof the 30th International Conference on Machine Learning (ICML 2013),Atlanta, USA, 2013.

D. Tao, X. Li, X. Wu, and S. Maybank. General tensor discriminant analysisand Gabor features for gait recognition. IEEE Transactions on PatternAnalysis and Machine Intelligence, 29(10):1700–1715, 2007.

P. Tichavský and A. Yeredor. Fast approximate joint diagonalization incor-porating weight matrices. IEEE Transactions on Signal Processing, 47(3):878–891, 2009.

M. K. Titsias. Variational learning of inducing variables in sparse Gaussianprocesses. In Proceedings of the 12th International Conference on ArtificialIntelligence and Statistics, pages 567–574, 2009.

C. Tobler. Low-rank tensor methods for linear systems and eigenvalue prob-lems. PhD thesis, ETH Zürich, 2012.

L. N. Trefethen. Cubature, approximation, and isotropy in the hypercube.SIAM Review, Forthcoming, 2017. URL https://people.maths.ox.ac.uk/trefethen/hypercube.pdf.

V. Tresp, C. Esteban, Y. Yang, S. Baier, and D. Krompaß. Learning withmemory embeddings. arXiv preprint arXiv:1511.07972, 2015.

Page 182: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 427

J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher. Randomized single-viewalgorithms for low-rank matrix approximation. arXiv e-prints, 2016.

L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psy-chometrika, 31(3):279–311, 1966.

L. R. Tucker. The extension of factor analysis to three-dimensional matrices.In H. Gulliksen and N. Frederiksen, editors, Contributions to MathematicalPsychology, pages 110–127. Holt, Rinehart and Winston, New York, 1964.

A. Uschmajew and B. Vandereycken. The geometry of algorithms using hier-archical tensors. Linear Algebra and its Applications, 439:133–166, 2013.

N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen. A new truncationstrategy for the higher-order singular value decomposition. SIAM Journalon Scientific Computing, 34(2):A1027–A1052, 2012.

M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensem-bles: Tensorfaces. In Proceedings of the European Conference on ComputerVision (ECCV), volume 2350, pages 447–460, Copenhagen, Denmark, May2002.

F. Verstraete, V. Murg, and I. Cirac. Matrix product states, projected entan-gled pair states, and variational renormalization group methods for quan-tum spin systems. Advances in Physics, 57(2):143–224, 2008.

N. Vervliet, O. Debals, L. Sorber, and L. De Lathauwer. Breaking the curseof dimensionality using decompositions of incomplete tensors: Tensor-basedscientific computing in big data analysis. IEEE Signal Processing Magazine,31(5):71–79, 2014.

G. Vidal. Efficient classical simulation of slightly entangled quantum compu-tations. Physical Review Letters, 91(14):147902, 2003.

S. A. Vorobyov, Y. Rong, N. D. Sidiropoulos, and A. B. Gershman. Robustiterative fitting of multilinear models. IEEE Transactions on Signal Pro-cessing, 53(8):2678–2689, 2005.

S. Wahls, V. Koivunen, H. V. Poor, and M. Verhaegen. Learning multidi-mensional Fourier series with tensor trains. In IEEE Global Conferenceon Signal and Information Processing (GlobalSIP), pages 394–398. IEEE,2014.

D. Wang, H. Shen, and Y. Truong. Efficient dimension reduction for high-dimensional matrix-valued data. Neurocomputing, 190:25–34, 2016.

H. Wang and M. Thoss. Multilayer formulation of the multiconfiguration time-dependent Hartree theory. Journal of Chemical Physics, 119(3):1289–1299,2003.

Page 183: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

428 References

H. Wang, Q. Wu, L. Shi, Y. Yu, and N. Ahuja. Out-of-core tensor approx-imation of multi-dimensional matrices of visual data. ACM Transactionson Graphics, 24(3):527–535, 2005.

S. Wang and Z. Zhang. Improving CUR matrix decomposition and the Nys-tröm approximation via adaptive sampling. The Journal of Machine Learn-ing Research, 14(1):2729–2769, 2013.

Y. Wang, H.-Y. Tung, A. Smola, and A. Anandkumar. Fast and guaranteedtensor decomposition via sketching. In Advances in Neural InformationProcessing Systems, pages 991–999, 2015.

S. R. White. Density-matrix algorithms for quantum renormalization groups.Physical Review B, 48(14):10345, 1993.

Z. Xu, F. Yan, and Y. Qi. Infinite Tucker decomposition: NonparametricBayesian models for multiway data analysis. In Proceedings of the 29thInternational Conference on Machine Learning (ICML), ICML ’12, pages1023–1030. Omnipress, July 2012.

Y. Yang and T. Hospedales. Deep multi-task representation learning: A tensorfactorisation approach. arXiv preprint arXiv:1605.06391, 2016. URL http://adsabs.harvard.edu/abs/2016arXiv160506391Y.

T. Yokota, Q. Zhao, and A. Cichocki. Smooth PARAFAC decompositionfor tensor completion. IEEE Transactions on Signal Processing, 64(20):5423–5436, 2016.

T. Yokota, N. Lee, and A. Cichocki. Robust multilinear tensor rank estimationusing higher order singular value decomposition and information criteria.IEEE Transactions on Signal Processing, accepted, 2017.

Z. Zhang, X. Yang, I. V. Oseledets, G. E. Karniadakis, and L. Daniel. Enablinghigh-dimensional hierarchical uncertainty quantification by ANOVA andtensor-train decomposition. IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems, 34(1):63–76, 2015.

H. H. Zhao, Z. Y. Xie, Q. N. Chen, Z. C. Wei, J. W. Cai, and T. Xiang. Renor-malization of tensor-network states. Physical Review B, 81(17):174411,2010.

Q. Zhao, C. Caiafa, D. P. Mandic, Z. C. Chao, Y. Nagasaka, N. Fujii, L. Zhang,and A. Cichocki. Higher order partial least squares (HOPLS): A generalizedmultilinear regression method. IEEE Transactions on Pattern Analysis andMachine Intelligence, 35(7):1660–1673, 2013a.

Page 184: TensorNetworksforDimensionalityReduction andLarge …mandic/Tensor_Networks_Cichocki... · 2017. 1. 2. · A.Cichockiet al. Tensor Networks for Dimensionality Reduction and Large-Scale

References 429

Q. Zhao, G. Zhou, T. Adali, L. Zhang, and A. Cichocki. Kernelization oftensor-based models for multiway data analysis: Processing of multidimen-sional structured data. IEEE Signal Processing Magazine, 30(4):137–148,2013b.

Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki. Tensor ring decompo-sition. CoRR, abs/1606.05535, 2016. URL http://arxiv.org/abs/1606.05535.

S. Zhe, Y. Qi, Y. Park, Z. Xu, I. Molloy, and S. Chari. DinTucker: Scalingup Gaussian process models on large multidimensional arrays. In Pro-ceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.URL http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11959/11888.

G. Zhou and A. Cichocki. Fast and unique Tucker decompositions via multi-way blind source separation. Bulletin of Polish Academy of Science, 60(3):389–407, 2012a.

G. Zhou and A. Cichocki. Canonical Polyadic Decomposition based on asingle mode blind source separation. IEEE Signal Processing Letters, 19(8):523–526, 2012b.

G. Zhou, A. Cichocki, and S. Xie. Fast nonnegative matrix/tensor factor-ization based on low-rank approximation. IEEE Transactions on SignalProcessing, 60(6):2928–2940, June 2012.

G. Zhou, A. Cichocki, Q. Zhao, and S. Xie. Efficient nonnegative Tuckerdecompositions: Algorithms and uniqueness. IEEE Transactions on ImageProcessing, 24(12):4990–5003, 2015.

G. Zhou, A. Cichocki, Y. Zhang, and D. P. Mandic. Group component analysisfor multiblock data: Common and individual feature extraction. IEEETransactions on Neural Networks and Learning Systems, (in print), 2016a.

G. Zhou, Q. Zhao, Y. Zhang, T. Adali, S. Xie, and A. Cichocki. Linkedcomponent analysis from matrices to high-order tensors: Applications tobiomedical data. Proceedings of the IEEE, 104(2):310–331, 2016b.