Centrale Nantes Laboratoire de Math ematiques Jean Leray · Une journ ee autour de Ron DeVore Sep. 13, 2019 Approximation with tree tensor networks Anthony Nouy Centrale Nantes Laboratoire

Une journee autour de Ron DeVore Sep. 13, 2019

Approximation with tree tensor networks

Anthony Nouy

Centrale NantesLaboratoire de Mathematiques Jean Leray

Anthony Nouy Centrale Nantes 1 / 42

Statistical learning

Two typical tasks of statistical learning are to

approximate a random variable Y by a function of a set of variablesX = (X1, . . . ,Xd), from samples of the pair Z = (X ,Y ) (supervised learning)

approximate the probability distribution of a random vector Z = (Z1, . . . ,Zd) fromsamples of the distribution (unsupervised learning)


Risk

A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that

R(v)−R(u)

measures some distance between the target u and the function v .

The risk is defined as an expectation

R(v) = E(γ(v ,Z))

where γ is called a contrast (or loss) function.

For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and

R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.

For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and

R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density

u of Z with respect to a reference measure µ.


Risk


R(v)−R(u)



R(v) = E(γ(v ,Z))








Risk


R(v)−R(u)



R(v) = E(γ(v ,Z))








Risk


R(v)−R(u)



R(v) = E(γ(v ,Z))








Risk

Variational methods for PDEs [Eigel et al 2018]: with Z uniformly distributed onD = (0, 1)d and a risk

R(v) = E(|∇v(Z)|2)− 2v(Z)f (Z))

the target function u in H10 is such that

−∆u = f on D,

andR(v)−R(u) = ‖v − u‖2H1

0


Empirical risk minimization

Given i.i.d. samples {zi}ni=1 of Z , an approximation unF of u is obtained by minimization

of the empirical risk

Rn(v) =1

n

n∑i=1

γ(v , zi )

over a certain model class F .

Denoting by uF the minimizer of the risk over F , the error

R(unF )−R(u) = R(un

F )−R(uF )︸︷︷︸estimation error

+ R(uF )−R(u)︸︷︷︸approximation error

For a given sample, when taking larger and larger model classes, approximation error↘ while estimation error ↗.

Methods should be proposed for the selection of a model class taking the best fromthe available information.





Rn(v) =1

n

n∑i=1

γ(v , zi )












Rn(v) =1

n

n∑i=1

γ(v , zi )









Outline

1 Tree tensor networks

2 Estimation error

3 Approximation error

4 Learning algorithms


Outline


2 Estimation error




Tensor ranks

We consider a space H of functions defined on the set X = X1 × . . .×Xd equipped witha product measure µ = µ1 ⊗ . . .⊗ µd .

For a subset of variablesα ⊂ {1, . . . , d} := D,

a function v ∈ H can be identified with a bivariate function

v(xα, xαc )

where xα and xαc are complementary groups of variables.

The canonical rank of this bivariate function is called the α-rank of v , denoted rankα(v).It is the minimal integer rα such that

v(x) =

rα∑k=1

vαk (xα)wαc

k (xαc )


Tensor formats

For T ⊂ 2D a collection of subsets of D and a given tuple r = (rα)α∈T , a tensorformat is defined by

T Tr (H) = {v ∈ H : rankα(v) ≤ rα, α ∈ T} .

In the particular case where T is a dimension partition tree, T Tr is a tree-based

tensor format.

{1, 2, 3, 4, 5}

{1} {2} {3} {4} {5}

Tucker

{1, 2, 3, 4, 5}

{1, 2, 3, 4}

{1, 2, 3}

{1, 2}

{1} {2}

{3}

{4}

{5}

Tensor Train

{1, 2, 3, 4, 5}

{1, 2, 3}

{1}

{2, 3}

{2} {3}

{4, 5}

{4} {5}

Hierarchical Tucker


Tree-based formats as tensor networks

Consider a tensor space H = H1 ⊗ . . .⊗Hd of functions in L2µ(X ), and let

{φνiν : iν ∈ I ν} be a basis of Hν ⊂ L2µν (Xν), typically polynomials, wavelets...

A function v in T Tr (H) = {v ∈ H : rankT (v) ≤ r} admits an explicit representation

v(x) =∑iα∈Iαα∈L(T )

∑1≤kβ≤rββ∈T

∏α∈T\L(T )

aα(kβ )β∈S(α),kα

∏α∈L(T )

aαiα,kαφαiα(xα)

where each parameter aα is in a tensor space RKα .

a1,2,3,4,5

a1,2,3

a1 a2,3

a2 a3

a4,5

a4 a5

Representation complexity

C(T , r ,H) =∑

α∈T\L(T )

rα∏

β∈S(α)

rβ +∑

α∈L(T )

rα dim(Hα).


Tree-based tensor format as a deep neural network

By identifying a tensor a(α) ∈ Rn1×...×ns×rα with a Rrα -valued multilinear function

f (α) : Rn1 × . . .× Rns → Rrα ,

a function v in T Tr admits a representation as a tree-structured composition of

multilinear functions {f (α)}α∈T .

f 1,2,3,4,5

f 1,2,3

f 1 f 2,3

f 2 f 3

f 4,5

f 4 f 5

v(x) = f D(f 1,2,3(f 1(Φ1(x1)), f 2,3(f 2(Φ2(x2)), f 3(Φ3(x3))), f 4,5(f 4(Φ4(x4)), f 5(Φ5(x5))))

where Φν(xν) = (φνiν (xν))iν∈Iν ∈ R#Iν .

It corresponds to a deep network with a sparse architecture (given by T ), a depthbounded by d − 1, and width at level ` related to the α-ranks of the nodes α of level `.


Tree-based tensor format as a deep neural network

By identifying a tensor a(α) ∈ Rn1×...×ns×rα with a Rrα -valued multilinear function

f (α) : Rn1 × . . .× Rns → Rrα ,

a function v in T Tr admits a representation as a tree-structured composition of

multilinear functions {f (α)}α∈T .

f 1,2,3,4,5

f 1,2,3

f 1 f 2,3

f 2 f 3

f 4,5

f 4 f 5

v(x) = f D(f 1,2,3(f 1(Φ1(x1)), f 2,3(f 2(Φ2(x2)), f 3(Φ3(x3))), f 4,5(f 4(Φ4(x4)), f 5(Φ5(x5))))

where Φν(xν) = (φνiν (xν))iν∈Iν ∈ R#Iν .

It corresponds to a deep network with a sparse architecture (given by T ), a depthbounded by d − 1, and width at level ` related to the α-ranks of the nodes α of level `.


Outline


2 Estimation error




Estimation error

Given a model class F , a minimizer uF of the risk R over F and a minimizer unF of the

empirical risk Rn over F , the estimation error

R(unF )−R(uF ) ≤ R(un

F )− Rn(unF ) + Rn(uF )−R(uF )

so that

E(R(unF )−R(uF )) ≤ E(sup

v∈F|Rn(v)−R(v)|) = E(sup

v∈F|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))|)


Estimation error

Assume that F is compact in L∞, and that for all v ,w ∈ F , the contrast is uniformlybounded and Lipschitz,

|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞

Then using a standard concentration inequality (here Hoeffding), we obtain

P(|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2e−nε2

2

where N εM2L

= N ( εM2L,F , ‖ · ‖L∞) is the covering number of F .

We deduce that for any ε,

E(R(unF )−R(uF )) ≤ εM + 2Mη(n, ε,F )

and a bound can be obtained by taking the infimum over ε.


Estimation error


|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞


P( supv∈F|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2N εM2Le−

nε2

2 := η(n, ε,F )

where N εM2L






Estimation error


|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞


P( supv∈F|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2N εM2Le−

nε2

2 := η(n, ε,F )

where N εM2L






Metric entropy of tree-based tensor formats[with Bertrand Michel]

Assume H ⊂ L∞(X ) with basis functions {φi}i∈I normalized in L∞(X ).

For any representation of v ∈ T Tr (H) with parameters {f α : α ∈ T}, the multilinearity of

the parametrization implies

‖v‖L∞ ≤∏α

‖f α‖α,∞,

with a suitable choice of norms

‖f α‖α,∞ = sup‖zβ‖∞≤1

‖f α((zβ)β∈S(α))‖∞

Considering the model class

F = RF1, F1 = {v ∈ T Tr (H) : max

α∈T‖f α‖α,∞ ≤ 1},

we obtain an upper bound of the metric entropy

logN (ε,F , ‖ · ‖L∞) ≤ C(T , r ,H) log(3ε−1R#T ),

where C(T , r ,H) is the representation complexity of elements of F .


Coming back to estimation error

We conclude thatE(R(un

F )−R(uF )) ≤ εM + 2Mη

for

n ≥ 2ε−2

(log(η−1/2) + C(T , r ,H) log(6ε−1RL

M#T )

)and

E(R(unF )−R(uF )) . M

√C(T , r ,H)

n

up to log factors.


Improving estimation error

Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes

supv∈F|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))|

A better performance can be obtained by using as empirical risk

Rn(v) =1

n

n∑i=1

wiγ(v , zi )

where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )

−1.

The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.

For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].




supv∈F|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))|


Rn(v) =1

n

n∑i=1

wiγ(v , zi )


−1.






supv∈F|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))|


Rn(v) =1

n

n∑i=1

wiγ(v , zi )


−1.






supv∈F|1n

n∑i=1

γ(v , zi )− E(γ(v ,Z))|


Rn(v) =1

n

n∑i=1

wiγ(v , zi )


−1.




Outline


2 Estimation error




Approximation properties of tree tensor networks

We want to quantify the approximation error

minv∈T T

r

R(v)−R(u)

for a target function u in a given function class, i.e. study the expressive power of treetensor networks, and compare it with other approximation tools.



We consider least-squares regression and L2 density estimation, where

minv∈T T

r

R(v)−R(u) = minv∈T T

r

‖u − v‖2L2µ := eT ,r (u)2.

SinceT Tr =

⋂α∈T

{v : rankα(v) ≤ rα},

a lower bound is given byeT ,r (u) ≥ max

α∈Teα,rα(u)

whereeα,rα(u) = min

rankα(v)≤rα‖u − v‖L2µ


Singular value decomposition

u(x1, . . . , xd) can be identified with a bivariate function u(xα, xαc ) inL2µα⊗µαc (Xα ×Xαc ) which admits a singular value decomposition

u(xα, xαc ) =

rankα(u)∑k=1

σαk vαk (xα)vα

c

k (xαc ),

The problem of best approximation of u by a function with α-rank rα admits as asolution the truncated singular value decomposition urα of u

urα(xα, xαc ) =

rα∑k=1

σαk vαk (xα)vα

c

k (xαc )

where {vα1 , . . . , vαrα} are the rα α-principal components of u, and

eα,rα(u) =

√∑k>rα

(σαk )2.


Linear widths of multivariate functions

The subspace of principal components Uα = span{vα1 , . . . , vαrα} is solution of

mindim(Uα)=rα

∫‖u(·, xαc )− PUαu(·, xαc )‖2L2µα dµαc (xαc ) = eα,rα(u)2

where PUα is the orthogonal projection onto Uα.

Considering the set of partial evaluations of u

Kα(u) = {u(·, xαc ) : xαc ∈ Xαc } ⊂ L2µα(Xα),

andeα,rα(u) ≤ min

dim(Uα)=rαsup

v∈Kα(u)

‖v − PUαv‖L2µα = drα(Kα(u))L2µα,

the upper bound being the Kolmogorov rα-width of Kα(u)

Furthermore, since eα,rα(u) = eαc ,rα(u), we have

eα,rα(u) ≤ min{drα(Kα(u))L2µα

, drα(Kαc (u))L2µαc

}




mindim(Uα)=rα






dim(Uα)=rαsup

v∈Kα(u)






}




mindim(Uα)=rα






dim(Uα)=rαsup

v∈Kα(u)






}


Upper bound through higher order singular value decomposition

Given the spaces of principal components Uα, α ∈ T , we can define an approximation

ur =∏α∈T

PUαu ∈ TTr

obtained by successive orthogonal projections (suitably ordered) such that [Grasedyck2010]

eT ,r (u)2 ≤ ‖u − ur‖2 ≤∑α∈T

eα,rα(u)2

Then error bounds can be obtained with information on the α-singular values of u, or onKolmogorov witdhs of the sets Kα(u) and Kαc (u) of partial evaluations of u.


Expressive power of tree tensor networks

For standard regularity classes, they perform almost as well as standardapproximation tools.

For example, for u ∈ Hkmix((0, 1)d), Kα(u) ⊂ Hk

mix((0, 1)#α) for any α. From boundsof Kolmogorov widths of Sobolev balls

eα,rα(u) ≤ drα(Kα(u)) . r−kα log(rα)k(#α−1)

we obtain that the complexity to achieve a precision ε (with binary trees) is

C(ε) . ε−3/k log(ε−1)dd1+3/(2k) up to powers of log(ε−1)

Performs almost as well as hyperbolic cross approximation (sparse tensors).

Similar results in [Schneider and Uschmajew 2014] using results on bilinearapproximation [Temlyakov 1989]. See also [Griebel and Harbrecht 2019]



For standard regularity classes, they perform almost as well as standardapproximation tools.

For example, for u ∈ Hkmix((0, 1)d), Kα(u) ⊂ Hk

mix((0, 1)#α) for any α. From boundsof Kolmogorov widths of Sobolev balls

eα,rα(u) ≤ drα(Kα(u)) . r−kα log(rα)k(#α−1)

we obtain that the complexity to achieve a precision ε (with binary trees) is

C(ε) . ε−3/k log(ε−1)dd1+3/(2k) up to powers of log(ε−1)

Performs almost as well as hyperbolic cross approximation (sparse tensors).

Similar results in [Schneider and Uschmajew 2014] using results on bilinearapproximation [Temlyakov 1989]. See also [Griebel and Harbrecht 2019]


Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]

But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.

f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))

{1, 2, 3, 4}

{1, 2}

{1} {2}

{3, 4}

{3} {4}

Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε

C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k

with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.

• Bad influence of the depth through the norm B of functions fα (roughness).• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scales

polynomially in d : no curse of dimensionality !




f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))

{1, 2, 3, 4}

{1, 2}

{1} {2}

{3, 4}

{3} {4}


C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k







f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))

{1, 2, 3, 4}

{1, 2}

{1} {2}

{3, 4}

{3} {4}


C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k


• Bad influence of the depth through the norm B of functions fα (roughness).

• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scalespolynomially in d : no curse of dimensionality !




f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))

{1, 2, 3, 4}

{1, 2}

{1} {2}

{3, 4}

{3} {4}


C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k






A function in canonical format (shallow network)

u(x) =r∑

k=1

u1k(x1) . . . ud

k (xd)

can be represented in tree-based format with a similar complexity.

Conversely, a typical function in tree-based format T Tr has a canonical rank

depending exponentially in d .

Deep is better !

For a balanced or linear binary tree T , the subset of tensors v in T Tr (Rn×...×n) with

canonical rank less than min{n, r}d/2 is of Lebesgue measure 0 [Cohen et al. 2016,Khrulkov et al 2018]

But a typical function in T Tr may admit a representation complexity exponential in d

when using another tree.




u(x) =r∑

k=1

u1k(x1) . . . ud

k (xd)




Deep is better !








u(x) =r∑

k=1

u1k(x1) . . . ud

k (xd)




Deep is better !






Influence of the tree

As an example, consider the probability distribution f (x) = P(X = x) of a Markov chainX = (X1, . . . ,Xd) given by

f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)

where bivariate functions fi|i−1 have a rank bounded by r .

With the linear tree T containing interior nodes {1, 2}, {1, 2, 3}, . . . , {1, . . . , d − 1},f admits a representation in tree-based format with storage complexity in r 4.

The canonical rank of f is exponential in d .

But when considering the linear tree Tσ = {σ(α) : α ∈ T} obtained by applyingpermutation σ = (1, 3, . . . , d − 1, 2, 4, . . . , d) to the tree T , the storage complexityin tree-based format is also exponential in d .




f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)








f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)







Choosing a good tree (architecture of network) is a crucial but combinatorial problem...

{2} {3}

{7}

{5} {4}

{8}

{1} {6} {2} {7}

{4}

{8} {1} {5} {3}

{6}

{1} {4}{2} {8}

{3} {7} {6}

{5}

{3} {2}{4}

{7}{5}

{8}{6} {1}


Outline


2 Estimation error




Learning algorithm for tree tensor networks

A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )

where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.

The empirical risk minimization problem over the nonlinear model class T Tr

min(aα)α∈T

1

n

n∑i=1

γ(Ψ(·)((aα)α∈T ), zi )

can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model

Ψ(x)((aα)α∈T ) =∑k∈Kα

Ψαk (x)aαk

with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.

In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).

Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).

For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .






min(aα)α∈T

1

n

n∑i=1

γ(Ψ(·)((aα)α∈T ), zi )



Ψαk (x)aαk










min(aα)α∈T

1

n

n∑i=1

γ(Ψ(·)((aα)α∈T ), zi )



Ψαk (x)aαk










min(aα)α∈T

1

n

n∑i=1

γ(Ψ(·)((aα)α∈T ), zi )



Ψαk (x)aαk










min(aα)α∈T

1

n

n∑i=1

γ(Ψ(·)((aα)α∈T ), zi )



Ψαk (x)aαk







Selection an optimal model class T Tr (H) is a combinatorial problem.

An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.

Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank

r = (1, ..., 1). Then repeat

Increase some ranks rα based on estimates of truncation errors

minrankα(v)≤rα

R(v)−R(u)

Learn an approximation v in T Tr (H), with adaptive selection of H

Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)

minT

C(T , rankT (v),H)






r = (1, ..., 1). Then repeat


minrankα(v)≤rα

R(v)−R(u)



minT

C(T , rankT (v),H)






r = (1, ..., 1). Then repeat


minrankα(v)≤rα

R(v)−R(u)



minT

C(T , rankT (v),H)






r = (1, ..., 1). Then repeat


minrankα(v)≤rα

R(v)−R(u)



minT

C(T , rankT (v),H)






r = (1, ..., 1). Then repeat


minrankα(v)≤rα

R(v)−R(u)



minT

C(T , rankT (v),H)


Example in supervised learning: composition of functions

Consider a tree-structured composition of functions

u(X ) = h(h(h(X1,X2), h(X3,X4)), h(h(X5,X6), h(X7,X8))),

where h(t, s) = 9−1(2 + ts)2 is a bivariate function and where the d = 8 randomvariables X1, . . . ,X8 are independent and uniform on [−1, 1].

h

h

h

X1 X2

h

X3 X4

h

h

X5 X6

h

X7 X8

We use polynomial approximation spaces H (with adaptive selection of the degree), sothat function u could (in principle) be recovered exactly for any choice of tree with asufficiently high rank.



We consider the tree T 1 coinciding with the structure of u, for which

C(T 1, rankT 1(u),H) = 2427

{1} {2} {3} {4} {5} {6} {7} {8}

(a) Tree T 1

{8} {1} {6} {4} {7} {2} {3} {5}

(b) Tree T 1σ

By considering a permutation T 1σ = {σ(α) : α ∈ T 1} of T 1, with

σ = (8, 1, 6, 4, 7, 2, 3, 5), we have a complexity

C(T 1σ, rankT1

σ(u),H) ≥ 9.106



We consider the tree T 1 coinciding with the structure of u, for which

C(T 1, rankT 1(u),H) = 2427

{1} {2} {3} {4} {5} {6} {7} {8}

(c) Tree T 1

{8} {1} {6} {4} {7} {2} {3} {5}

(d) Tree T 1σ

By considering a permutation T 1σ = {σ(α) : α ∈ T 1} of T 1, with

σ = (8, 1, 6, 4, 7, 2, 3, 5), we have a complexity

C(T 1σ, rankT 1

σ(u),H) ≥ 9.106



We consider a linear tree T 2 and start the algorithm from a tree T 2σ = {σ(α) : α ∈ T 2}

obtained by applying a random permutation σ to T 2.

{1} {2}{3}{4}{5}{6}{7}{8}

(e) Tree T 2

{6} {8}{4}{5}{1}{7}{2}{3}

(f) Tree T 2σ for σ = (6, 8, 4, 5, 1, 7, 2, 3)



Behavior of the algorithm with a sample size n = 105

Iteration rank r εtest(v) C(T , r ,H)

1 (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 3.38 10−2 79

6 84

51

72

3





2 (1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1) 2.95 10−2 100

3 4 2 8

6

7 1 5





3 (1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1) 2.45 10−2 121

7 5

3 4 6 8

2 1





4 (1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1) 1.85 10−2 1425 (1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2) 8.97 10−3 163

5 6

3 4 7 8

2 1





6 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 8.89 10−3 188

5 6 3 4

7 8

2 1





7 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 8.87 10−3 188

5 6

3 4 7 8

2 1





8 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 3.97 10−3 1889 (1, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 3) 1.55 10−4 308

3 4

5 6 7 8

2 1





10 (1, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3) 1.18 10−4 36411 (1, 3, 4, 3, 4, 2, 4, 3, 4, 2, 4, 3, 4, 4, 4) 6.65 10−6 52012 (1, 3, 5, 3, 5, 3, 5, 3, 5, 3, 5, 3, 5, 5, 5) 1.19 10−6 72313 (1, 4, 5, 4, 5, 3, 5, 4, 5, 3, 5, 4, 5, 5, 5) 1.72 10−7 86514 (1, 4, 6, 4, 6, 3, 6, 4, 6, 3, 6, 4, 6, 6, 6) 1.47 10−8 111315 (1, 5, 6, 5, 6, 3, 6, 5, 6, 3, 6, 5, 6, 6, 6) 7.02 10−9 131116 (1, 5, 7, 5, 7, 3, 7, 5, 7, 3, 7, 5, 7, 7, 7) 1.27 10−10 164317 (1, 5, 8, 5, 8, 3, 8, 5, 8, 3, 8, 5, 8, 8, 8) 3.87 10−12 201518 (1, 5, 9, 5, 9, 3, 9, 5, 9, 3, 9, 5, 9, 9, 9) 2.95 10−14 2427

3 4 2 1 5 6 7 8



Behavior of the algorithm for different sample sizes n.

n P(T = T 1) εtest(v) C(T , r ,H)

103 90% [1.75 10−5, 1.75 10−4] [360, 1062]104 90% [2.15 10−8, 4.10 10−3] [185, 2741]105 100% [4.67 10−15, 8.92 10−3] [163, 2594]

Table: training sample size n, estimation of the probability of obtaining the ideal tree T 1 andranges (over the 10 trials) for the test error, and the storage complexity.


Example in unsupervised learning

We consider truncated normal distribution with zero mean and covariance matrix Σ. Itssupport is X = ×6

ν=1[−5σν , 5σν ], with σ2ν = Σνν , and its density (with respect to

Lebesgue measure)

f (x) ∼ exp

(−1

2xTΣ−1x

)1x∈X ,

We consider polynomial approximation spaces Hν (with adaptive selection of the degree).


Example in unsupervised learning

Consider

Σ =

2 0 0.5 1 0 0.50 1 0 0 0.5 0

0.5 0 2 0 0 11 0 0 3 0 00 0.5 0 0 1 0

0.5 0 1 0 0 2

.

After a permutation (3, 6, 1, 4, 2, 5) of its rows and columns, it comes the matrix2 1 0.5 0 0 01 2 0.5 0 0 0

0.5 0.5 2 1 0 00 0 1 3 0 00 0 0 0 1 0.50 0 0 0 0.5 1

(X1,X3,X4,X6) and (X2,X5) are independent, as well as X4 and (X3,X6), so that

f (x) = f1,3,4,6(x1, x3, x4, x6)f2,5(x2, x5) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)

with rank{2,5}(f ) = rank{1,3,4,6}(f ) = 1,rank{1,4}(f ) = rank{1}(f{1,3,6}) = rank{3,6}(f1,3,6) = rank{3,6}(f ).


Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)

n Risk × 10−2 L2-error T C(T , r ,H)

102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]

Table: Ranges over 10 trials

{1, 2, 3, 4, 5, 6}{2, 3, 4, 5, 6}

{2, 4, 5, 6}{2, 4, 5}

{4, 5}

{4} {5}{2}

{6}{3}

{1}

Figure: (a) Best tree over 10 trials for n = 102






{1, 2, 3, 4, 5, 6}{1, 3, 4, 6}{3, 4, 6}{3, 6}

{6} {3}{4}{1}

{2, 5}

{2} {5}

Figure: (b) Best tree over 10 trials for n = 103






{1, 2, 3, 4, 5, 6}{1, 3, 4, 6}{1, 4}

{1} {4}

{3, 6}

{3} {6}

{2, 5}

{2} {5}

Figure: (c) Best tree over 10 trials for n = 104, 105, 106






Convergence rate close to O(n−1/2),the minimax rate for the estimation of analytic densities.


Concluding remarks

Open questions for a complete learning theory...

The obtained theoretical results are related to the minimizer unF of the empirical risk

over the model class, but available algorithms do not guarantee to find a solution of

minv∈T T

r

Rn(v)

Algorithms generate a sequence of estimations in different model classes. Modelselection strategies should be proposed (for selecting tree, ranks and backgroundapproximation space) that guarantee oracle inequalities.

Convexification of tree tensor networks ?


Concluding remarks




minv∈T T

r

Rn(v)




Concluding remarks




minv∈T T

r

Rn(v)




Concluding remarks

A fundamental problem would be to characterize approximation classes Aγ of treetensor networks, those functions for which tree tensor networks give a certainperformance

infv∈T T

r (H)R(v)−R(u) . γ(C)−1

for some growth function γ and C a measure of complexity of T Tr (H).

These are not standard regularity classes (see [Dahmen et al 2016, Bachmayr andDahmen 2015]).

For an approximation class Aγ , we would like to devise (black box) algorithms thatselect a model class T T

r (H) and provide an estimation u ∈ T Tr (H) such that

E(R(u)−R(u)) . γ(C)−1,

or γ replaced by another growth function.


Concluding remarks

A fundamental problem would be to characterize approximation classes Aγ of treetensor networks, those functions for which tree tensor networks give a certainperformance

infv∈T T

r (H)R(v)−R(u) . γ(C)−1

for some growth function γ and C a measure of complexity of T Tr (H).

These are not standard regularity classes (see [Dahmen et al 2016, Bachmayr andDahmen 2015]).

For an approximation class Aγ , we would like to devise (black box) algorithms thatselect a model class T T

r (H) and provide an estimation u ∈ T Tr (H) such that

E(R(u)−R(u)) . γ(C)−1,

or γ replaced by another growth function.


Thank you for your attention

References

A. Falco, W. Hackbusch, and A. Nouy.Tree-based tensor formats.SeMA Journal, Oct 2018.

E. Grelier, A. Nouy, M. Chevreuil.Learning with tree-based tensor formats.Arxiv eprints, Nov. 2018.

A. Nouy.Higher-order principal component analysis for the approximation of tensors intree-based low-rank formats.Numerische Mathematik, 141(3):743–789, Mar 2019.

E. Grelier, R. Lebrun, A. Nouy.Learning high-dimensional probability distributions using tree tensor networksComing soon...


Documents

Centrale Nantes Laboratoire de Math ematiques Jean Leray · Une journ ee autour de Ron DeVore Sep. 13, 2019 Approximation with tree tensor networks Anthony Nouy Centrale Nantes Laboratoire