Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Une journee autour de Ron DeVore Sep. 13, 2019
Approximation with tree tensor networks
Anthony Nouy
Centrale NantesLaboratoire de Mathematiques Jean Leray
Anthony Nouy Centrale Nantes 1 / 42
Statistical learning
Two typical tasks of statistical learning are to
approximate a random variable Y by a function of a set of variablesX = (X1, . . . ,Xd), from samples of the pair Z = (X ,Y ) (supervised learning)
approximate the probability distribution of a random vector Z = (Z1, . . . ,Zd) fromsamples of the distribution (unsupervised learning)
Anthony Nouy Centrale Nantes 2 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
Variational methods for PDEs [Eigel et al 2018]: with Z uniformly distributed onD = (0, 1)d and a risk
R(v) = E(|∇v(Z)|2)− 2v(Z)f (Z))
the target function u in H10 is such that
−∆u = f on D,
andR(v)−R(u) = ‖v − u‖2H1
0
Anthony Nouy Centrale Nantes 4 / 42
Empirical risk minimization
Given i.i.d. samples {zi}ni=1 of Z , an approximation unF of u is obtained by minimization
of the empirical risk
Rn(v) =1
n
n∑i=1
γ(v , zi )
over a certain model class F .
Denoting by uF the minimizer of the risk over F , the error
R(unF )−R(u) = R(un
F )−R(uF )︸ ︷︷ ︸estimation error
+ R(uF )−R(u)︸ ︷︷ ︸approximation error
For a given sample, when taking larger and larger model classes, approximation error↘ while estimation error ↗.
Methods should be proposed for the selection of a model class taking the best fromthe available information.
Anthony Nouy Centrale Nantes 5 / 42
Empirical risk minimization
Given i.i.d. samples {zi}ni=1 of Z , an approximation unF of u is obtained by minimization
of the empirical risk
Rn(v) =1
n
n∑i=1
γ(v , zi )
over a certain model class F .
Denoting by uF the minimizer of the risk over F , the error
R(unF )−R(u) = R(un
F )−R(uF )︸ ︷︷ ︸estimation error
+ R(uF )−R(u)︸ ︷︷ ︸approximation error
For a given sample, when taking larger and larger model classes, approximation error↘ while estimation error ↗.
Methods should be proposed for the selection of a model class taking the best fromthe available information.
Anthony Nouy Centrale Nantes 5 / 42
Empirical risk minimization
Given i.i.d. samples {zi}ni=1 of Z , an approximation unF of u is obtained by minimization
of the empirical risk
Rn(v) =1
n
n∑i=1
γ(v , zi )
over a certain model class F .
Denoting by uF the minimizer of the risk over F , the error
R(unF )−R(u) = R(un
F )−R(uF )︸ ︷︷ ︸estimation error
+ R(uF )−R(u)︸ ︷︷ ︸approximation error
For a given sample, when taking larger and larger model classes, approximation error↘ while estimation error ↗.
Methods should be proposed for the selection of a model class taking the best fromthe available information.
Anthony Nouy Centrale Nantes 5 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 6 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 7 / 42
Tensor ranks
We consider a space H of functions defined on the set X = X1 × . . .×Xd equipped witha product measure µ = µ1 ⊗ . . .⊗ µd .
For a subset of variablesα ⊂ {1, . . . , d} := D,
a function v ∈ H can be identified with a bivariate function
v(xα, xαc )
where xα and xαc are complementary groups of variables.
The canonical rank of this bivariate function is called the α-rank of v , denoted rankα(v).It is the minimal integer rα such that
v(x) =
rα∑k=1
vαk (xα)wαc
k (xαc )
Anthony Nouy Centrale Nantes 8 / 42
Tensor formats
For T ⊂ 2D a collection of subsets of D and a given tuple r = (rα)α∈T , a tensorformat is defined by
T Tr (H) = {v ∈ H : rankα(v) ≤ rα, α ∈ T} .
In the particular case where T is a dimension partition tree, T Tr is a tree-based
tensor format.
{1, 2, 3, 4, 5}
{1} {2} {3} {4} {5}
Tucker
{1, 2, 3, 4, 5}
{1, 2, 3, 4}
{1, 2, 3}
{1, 2}
{1} {2}
{3}
{4}
{5}
Tensor Train
{1, 2, 3, 4, 5}
{1, 2, 3}
{1}
{2, 3}
{2} {3}
{4, 5}
{4} {5}
Hierarchical Tucker
Anthony Nouy Centrale Nantes 9 / 42
Tree-based formats as tensor networks
Consider a tensor space H = H1 ⊗ . . .⊗Hd of functions in L2µ(X ), and let
{φνiν : iν ∈ I ν} be a basis of Hν ⊂ L2µν (Xν), typically polynomials, wavelets...
A function v in T Tr (H) = {v ∈ H : rankT (v) ≤ r} admits an explicit representation
v(x) =∑iα∈Iαα∈L(T )
∑1≤kβ≤rββ∈T
∏α∈T\L(T )
aα(kβ )β∈S(α),kα
∏α∈L(T )
aαiα,kαφαiα(xα)
where each parameter aα is in a tensor space RKα .
a1,2,3,4,5
a1,2,3
a1 a2,3
a2 a3
a4,5
a4 a5
Representation complexity
C(T , r ,H) =∑
α∈T\L(T )
rα∏
β∈S(α)
rβ +∑
α∈L(T )
rα dim(Hα).
Anthony Nouy Centrale Nantes 10 / 42
Tree-based tensor format as a deep neural network
By identifying a tensor a(α) ∈ Rn1×...×ns×rα with a Rrα -valued multilinear function
f (α) : Rn1 × . . .× Rns → Rrα ,
a function v in T Tr admits a representation as a tree-structured composition of
multilinear functions {f (α)}α∈T .
f 1,2,3,4,5
f 1,2,3
f 1 f 2,3
f 2 f 3
f 4,5
f 4 f 5
v(x) = f D(f 1,2,3(f 1(Φ1(x1)), f 2,3(f 2(Φ2(x2)), f 3(Φ3(x3))), f 4,5(f 4(Φ4(x4)), f 5(Φ5(x5))))
where Φν(xν) = (φνiν (xν))iν∈Iν ∈ R#Iν .
It corresponds to a deep network with a sparse architecture (given by T ), a depthbounded by d − 1, and width at level ` related to the α-ranks of the nodes α of level `.
Anthony Nouy Centrale Nantes 11 / 42
Tree-based tensor format as a deep neural network
By identifying a tensor a(α) ∈ Rn1×...×ns×rα with a Rrα -valued multilinear function
f (α) : Rn1 × . . .× Rns → Rrα ,
a function v in T Tr admits a representation as a tree-structured composition of
multilinear functions {f (α)}α∈T .
f 1,2,3,4,5
f 1,2,3
f 1 f 2,3
f 2 f 3
f 4,5
f 4 f 5
v(x) = f D(f 1,2,3(f 1(Φ1(x1)), f 2,3(f 2(Φ2(x2)), f 3(Φ3(x3))), f 4,5(f 4(Φ4(x4)), f 5(Φ5(x5))))
where Φν(xν) = (φνiν (xν))iν∈Iν ∈ R#Iν .
It corresponds to a deep network with a sparse architecture (given by T ), a depthbounded by d − 1, and width at level ` related to the α-ranks of the nodes α of level `.
Anthony Nouy Centrale Nantes 11 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 12 / 42
Estimation error
Given a model class F , a minimizer uF of the risk R over F and a minimizer unF of the
empirical risk Rn over F , the estimation error
R(unF )−R(uF ) ≤ R(un
F )− Rn(unF ) + Rn(uF )−R(uF )
so that
E(R(unF )−R(uF )) ≤ E(sup
v∈F|Rn(v)−R(v)|) = E(sup
v∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|)
Anthony Nouy Centrale Nantes 13 / 42
Estimation error
Assume that F is compact in L∞, and that for all v ,w ∈ F , the contrast is uniformlybounded and Lipschitz,
|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞
Then using a standard concentration inequality (here Hoeffding), we obtain
P(|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2e−nε2
2
where N εM2L
= N ( εM2L,F , ‖ · ‖L∞) is the covering number of F .
We deduce that for any ε,
E(R(unF )−R(uF )) ≤ εM + 2Mη(n, ε,F )
and a bound can be obtained by taking the infimum over ε.
Anthony Nouy Centrale Nantes 14 / 42
Estimation error
Assume that F is compact in L∞, and that for all v ,w ∈ F , the contrast is uniformlybounded and Lipschitz,
|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞
Then using a standard concentration inequality (here Hoeffding), we obtain
P( supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2N εM2Le−
nε2
2 := η(n, ε,F )
where N εM2L
= N ( εM2L,F , ‖ · ‖L∞) is the covering number of F .
We deduce that for any ε,
E(R(unF )−R(uF )) ≤ εM + 2Mη(n, ε,F )
and a bound can be obtained by taking the infimum over ε.
Anthony Nouy Centrale Nantes 14 / 42
Estimation error
Assume that F is compact in L∞, and that for all v ,w ∈ F , the contrast is uniformlybounded and Lipschitz,
|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞
Then using a standard concentration inequality (here Hoeffding), we obtain
P( supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2N εM2Le−
nε2
2 := η(n, ε,F )
where N εM2L
= N ( εM2L,F , ‖ · ‖L∞) is the covering number of F .
We deduce that for any ε,
E(R(unF )−R(uF )) ≤ εM + 2Mη(n, ε,F )
and a bound can be obtained by taking the infimum over ε.
Anthony Nouy Centrale Nantes 14 / 42
Metric entropy of tree-based tensor formats[with Bertrand Michel]
Assume H ⊂ L∞(X ) with basis functions {φi}i∈I normalized in L∞(X ).
For any representation of v ∈ T Tr (H) with parameters {f α : α ∈ T}, the multilinearity of
the parametrization implies
‖v‖L∞ ≤∏α
‖f α‖α,∞,
with a suitable choice of norms
‖f α‖α,∞ = sup‖zβ‖∞≤1
‖f α((zβ)β∈S(α))‖∞
Considering the model class
F = RF1, F1 = {v ∈ T Tr (H) : max
α∈T‖f α‖α,∞ ≤ 1},
we obtain an upper bound of the metric entropy
logN (ε,F , ‖ · ‖L∞) ≤ C(T , r ,H) log(3ε−1R#T ),
where C(T , r ,H) is the representation complexity of elements of F .
Anthony Nouy Centrale Nantes 15 / 42
Coming back to estimation error
We conclude thatE(R(un
F )−R(uF )) ≤ εM + 2Mη
for
n ≥ 2ε−2
(log(η−1/2) + C(T , r ,H) log(6ε−1RL
M#T )
)and
E(R(unF )−R(uF )) . M
√C(T , r ,H)
n
up to log factors.
Anthony Nouy Centrale Nantes 16 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 18 / 42
Approximation properties of tree tensor networks
We want to quantify the approximation error
minv∈T T
r
R(v)−R(u)
for a target function u in a given function class, i.e. study the expressive power of treetensor networks, and compare it with other approximation tools.
Anthony Nouy Centrale Nantes 19 / 42
Approximation properties of tree tensor networks
We consider least-squares regression and L2 density estimation, where
minv∈T T
r
R(v)−R(u) = minv∈T T
r
‖u − v‖2L2µ := eT ,r (u)2.
SinceT Tr =
⋂α∈T
{v : rankα(v) ≤ rα},
a lower bound is given byeT ,r (u) ≥ max
α∈Teα,rα(u)
whereeα,rα(u) = min
rankα(v)≤rα‖u − v‖L2µ
Anthony Nouy Centrale Nantes 20 / 42
Singular value decomposition
u(x1, . . . , xd) can be identified with a bivariate function u(xα, xαc ) inL2µα⊗µαc (Xα ×Xαc ) which admits a singular value decomposition
u(xα, xαc ) =
rankα(u)∑k=1
σαk vαk (xα)vα
c
k (xαc ),
The problem of best approximation of u by a function with α-rank rα admits as asolution the truncated singular value decomposition urα of u
urα(xα, xαc ) =
rα∑k=1
σαk vαk (xα)vα
c
k (xαc )
where {vα1 , . . . , vαrα} are the rα α-principal components of u, and
eα,rα(u) =
√∑k>rα
(σαk )2.
Anthony Nouy Centrale Nantes 21 / 42
Linear widths of multivariate functions
The subspace of principal components Uα = span{vα1 , . . . , vαrα} is solution of
mindim(Uα)=rα
∫‖u(·, xαc )− PUαu(·, xαc )‖2L2µα dµαc (xαc ) = eα,rα(u)2
where PUα is the orthogonal projection onto Uα.
Considering the set of partial evaluations of u
Kα(u) = {u(·, xαc ) : xαc ∈ Xαc } ⊂ L2µα(Xα),
andeα,rα(u) ≤ min
dim(Uα)=rαsup
v∈Kα(u)
‖v − PUαv‖L2µα = drα(Kα(u))L2µα,
the upper bound being the Kolmogorov rα-width of Kα(u)
Furthermore, since eα,rα(u) = eαc ,rα(u), we have
eα,rα(u) ≤ min{drα(Kα(u))L2µα
, drα(Kαc (u))L2µαc
}
Anthony Nouy Centrale Nantes 22 / 42
Linear widths of multivariate functions
The subspace of principal components Uα = span{vα1 , . . . , vαrα} is solution of
mindim(Uα)=rα
∫‖u(·, xαc )− PUαu(·, xαc )‖2L2µα dµαc (xαc ) = eα,rα(u)2
where PUα is the orthogonal projection onto Uα.
Considering the set of partial evaluations of u
Kα(u) = {u(·, xαc ) : xαc ∈ Xαc } ⊂ L2µα(Xα),
andeα,rα(u) ≤ min
dim(Uα)=rαsup
v∈Kα(u)
‖v − PUαv‖L2µα = drα(Kα(u))L2µα,
the upper bound being the Kolmogorov rα-width of Kα(u)
Furthermore, since eα,rα(u) = eαc ,rα(u), we have
eα,rα(u) ≤ min{drα(Kα(u))L2µα
, drα(Kαc (u))L2µαc
}
Anthony Nouy Centrale Nantes 22 / 42
Linear widths of multivariate functions
The subspace of principal components Uα = span{vα1 , . . . , vαrα} is solution of
mindim(Uα)=rα
∫‖u(·, xαc )− PUαu(·, xαc )‖2L2µα dµαc (xαc ) = eα,rα(u)2
where PUα is the orthogonal projection onto Uα.
Considering the set of partial evaluations of u
Kα(u) = {u(·, xαc ) : xαc ∈ Xαc } ⊂ L2µα(Xα),
andeα,rα(u) ≤ min
dim(Uα)=rαsup
v∈Kα(u)
‖v − PUαv‖L2µα = drα(Kα(u))L2µα,
the upper bound being the Kolmogorov rα-width of Kα(u)
Furthermore, since eα,rα(u) = eαc ,rα(u), we have
eα,rα(u) ≤ min{drα(Kα(u))L2µα
, drα(Kαc (u))L2µαc
}
Anthony Nouy Centrale Nantes 22 / 42
Upper bound through higher order singular value decomposition
Given the spaces of principal components Uα, α ∈ T , we can define an approximation
ur =∏α∈T
PUαu ∈ TTr
obtained by successive orthogonal projections (suitably ordered) such that [Grasedyck2010]
eT ,r (u)2 ≤ ‖u − ur‖2 ≤∑α∈T
eα,rα(u)2
Then error bounds can be obtained with information on the α-singular values of u, or onKolmogorov witdhs of the sets Kα(u) and Kαc (u) of partial evaluations of u.
Anthony Nouy Centrale Nantes 23 / 42
Expressive power of tree tensor networks
For standard regularity classes, they perform almost as well as standardapproximation tools.
For example, for u ∈ Hkmix((0, 1)d), Kα(u) ⊂ Hk
mix((0, 1)#α) for any α. From boundsof Kolmogorov widths of Sobolev balls
eα,rα(u) ≤ drα(Kα(u)) . r−kα log(rα)k(#α−1)
we obtain that the complexity to achieve a precision ε (with binary trees) is
C(ε) . ε−3/k log(ε−1)dd1+3/(2k) up to powers of log(ε−1)
Performs almost as well as hyperbolic cross approximation (sparse tensors).
Similar results in [Schneider and Uschmajew 2014] using results on bilinearapproximation [Temlyakov 1989]. See also [Griebel and Harbrecht 2019]
Anthony Nouy Centrale Nantes 24 / 42
Expressive power of tree tensor networks
For standard regularity classes, they perform almost as well as standardapproximation tools.
For example, for u ∈ Hkmix((0, 1)d), Kα(u) ⊂ Hk
mix((0, 1)#α) for any α. From boundsof Kolmogorov widths of Sobolev balls
eα,rα(u) ≤ drα(Kα(u)) . r−kα log(rα)k(#α−1)
we obtain that the complexity to achieve a precision ε (with binary trees) is
C(ε) . ε−3/k log(ε−1)dd1+3/(2k) up to powers of log(ε−1)
Performs almost as well as hyperbolic cross approximation (sparse tensors).
Similar results in [Schneider and Uschmajew 2014] using results on bilinearapproximation [Temlyakov 1989]. See also [Griebel and Harbrecht 2019]
Anthony Nouy Centrale Nantes 24 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scales
polynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scales
polynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).
• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scalespolynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scales
polynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks
A function in canonical format (shallow network)
u(x) =r∑
k=1
u1k(x1) . . . ud
k (xd)
can be represented in tree-based format with a similar complexity.
Conversely, a typical function in tree-based format T Tr has a canonical rank
depending exponentially in d .
Deep is better !
For a balanced or linear binary tree T , the subset of tensors v in T Tr (Rn×...×n) with
canonical rank less than min{n, r}d/2 is of Lebesgue measure 0 [Cohen et al. 2016,Khrulkov et al 2018]
But a typical function in T Tr may admit a representation complexity exponential in d
when using another tree.
Anthony Nouy Centrale Nantes 26 / 42
Expressive power of tree tensor networks
A function in canonical format (shallow network)
u(x) =r∑
k=1
u1k(x1) . . . ud
k (xd)
can be represented in tree-based format with a similar complexity.
Conversely, a typical function in tree-based format T Tr has a canonical rank
depending exponentially in d .
Deep is better !
For a balanced or linear binary tree T , the subset of tensors v in T Tr (Rn×...×n) with
canonical rank less than min{n, r}d/2 is of Lebesgue measure 0 [Cohen et al. 2016,Khrulkov et al 2018]
But a typical function in T Tr may admit a representation complexity exponential in d
when using another tree.
Anthony Nouy Centrale Nantes 26 / 42
Expressive power of tree tensor networks
A function in canonical format (shallow network)
u(x) =r∑
k=1
u1k(x1) . . . ud
k (xd)
can be represented in tree-based format with a similar complexity.
Conversely, a typical function in tree-based format T Tr has a canonical rank
depending exponentially in d .
Deep is better !
For a balanced or linear binary tree T , the subset of tensors v in T Tr (Rn×...×n) with
canonical rank less than min{n, r}d/2 is of Lebesgue measure 0 [Cohen et al. 2016,Khrulkov et al 2018]
But a typical function in T Tr may admit a representation complexity exponential in d
when using another tree.
Anthony Nouy Centrale Nantes 26 / 42
Influence of the tree
As an example, consider the probability distribution f (x) = P(X = x) of a Markov chainX = (X1, . . . ,Xd) given by
f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)
where bivariate functions fi|i−1 have a rank bounded by r .
With the linear tree T containing interior nodes {1, 2}, {1, 2, 3}, . . . , {1, . . . , d − 1},f admits a representation in tree-based format with storage complexity in r 4.
The canonical rank of f is exponential in d .
But when considering the linear tree Tσ = {σ(α) : α ∈ T} obtained by applyingpermutation σ = (1, 3, . . . , d − 1, 2, 4, . . . , d) to the tree T , the storage complexityin tree-based format is also exponential in d .
Anthony Nouy Centrale Nantes 27 / 42
Influence of the tree
As an example, consider the probability distribution f (x) = P(X = x) of a Markov chainX = (X1, . . . ,Xd) given by
f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)
where bivariate functions fi|i−1 have a rank bounded by r .
With the linear tree T containing interior nodes {1, 2}, {1, 2, 3}, . . . , {1, . . . , d − 1},f admits a representation in tree-based format with storage complexity in r 4.
The canonical rank of f is exponential in d .
But when considering the linear tree Tσ = {σ(α) : α ∈ T} obtained by applyingpermutation σ = (1, 3, . . . , d − 1, 2, 4, . . . , d) to the tree T , the storage complexityin tree-based format is also exponential in d .
Anthony Nouy Centrale Nantes 27 / 42
Influence of the tree
As an example, consider the probability distribution f (x) = P(X = x) of a Markov chainX = (X1, . . . ,Xd) given by
f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)
where bivariate functions fi|i−1 have a rank bounded by r .
With the linear tree T containing interior nodes {1, 2}, {1, 2, 3}, . . . , {1, . . . , d − 1},f admits a representation in tree-based format with storage complexity in r 4.
The canonical rank of f is exponential in d .
But when considering the linear tree Tσ = {σ(α) : α ∈ T} obtained by applyingpermutation σ = (1, 3, . . . , d − 1, 2, 4, . . . , d) to the tree T , the storage complexityin tree-based format is also exponential in d .
Anthony Nouy Centrale Nantes 27 / 42
Approximation properties of tree tensor networks
Choosing a good tree (architecture of network) is a crucial but combinatorial problem...
{2} {3}
{7}
{5} {4}
{8}
{1} {6} {2} {7}
{4}
{8} {1} {5} {3}
{6}
{1} {4}{2} {8}
{3} {7} {6}
{5}
{3} {2}{4}
{7}{5}
{8}{6} {1}
Anthony Nouy Centrale Nantes 28 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 29 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Example in supervised learning: composition of functions
Consider a tree-structured composition of functions
u(X ) = h(h(h(X1,X2), h(X3,X4)), h(h(X5,X6), h(X7,X8))),
where h(t, s) = 9−1(2 + ts)2 is a bivariate function and where the d = 8 randomvariables X1, . . . ,X8 are independent and uniform on [−1, 1].
h
h
h
X1 X2
h
X3 X4
h
h
X5 X6
h
X7 X8
We use polynomial approximation spaces H (with adaptive selection of the degree), sothat function u could (in principle) be recovered exactly for any choice of tree with asufficiently high rank.
Anthony Nouy Centrale Nantes 32 / 42
Example in supervised learning: composition of functions
We consider the tree T 1 coinciding with the structure of u, for which
C(T 1, rankT 1(u),H) = 2427
{1} {2} {3} {4} {5} {6} {7} {8}
(a) Tree T 1
{8} {1} {6} {4} {7} {2} {3} {5}
(b) Tree T 1σ
By considering a permutation T 1σ = {σ(α) : α ∈ T 1} of T 1, with
σ = (8, 1, 6, 4, 7, 2, 3, 5), we have a complexity
C(T 1σ, rankT1
σ(u),H) ≥ 9.106
Anthony Nouy Centrale Nantes 33 / 42
Example in supervised learning: composition of functions
We consider the tree T 1 coinciding with the structure of u, for which
C(T 1, rankT 1(u),H) = 2427
{1} {2} {3} {4} {5} {6} {7} {8}
(c) Tree T 1
{8} {1} {6} {4} {7} {2} {3} {5}
(d) Tree T 1σ
By considering a permutation T 1σ = {σ(α) : α ∈ T 1} of T 1, with
σ = (8, 1, 6, 4, 7, 2, 3, 5), we have a complexity
C(T 1σ, rankT 1
σ(u),H) ≥ 9.106
Anthony Nouy Centrale Nantes 33 / 42
Example in supervised learning: composition of functions
We consider a linear tree T 2 and start the algorithm from a tree T 2σ = {σ(α) : α ∈ T 2}
obtained by applying a random permutation σ to T 2.
{1} {2}{3}{4}{5}{6}{7}{8}
(e) Tree T 2
{6} {8}{4}{5}{1}{7}{2}{3}
(f) Tree T 2σ for σ = (6, 8, 4, 5, 1, 7, 2, 3)
Anthony Nouy Centrale Nantes 34 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
1 (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 3.38 10−2 79
6 84
51
72
3
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
2 (1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1) 2.95 10−2 100
3 4 2 8
6
7 1 5
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
3 (1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1) 2.45 10−2 121
7 5
3 4 6 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
4 (1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1) 1.85 10−2 1425 (1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2) 8.97 10−3 163
5 6
3 4 7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
6 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 8.89 10−3 188
5 6 3 4
7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
7 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 8.87 10−3 188
5 6
3 4 7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
8 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 3.97 10−3 1889 (1, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 3) 1.55 10−4 308
3 4
5 6 7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
10 (1, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3) 1.18 10−4 36411 (1, 3, 4, 3, 4, 2, 4, 3, 4, 2, 4, 3, 4, 4, 4) 6.65 10−6 52012 (1, 3, 5, 3, 5, 3, 5, 3, 5, 3, 5, 3, 5, 5, 5) 1.19 10−6 72313 (1, 4, 5, 4, 5, 3, 5, 4, 5, 3, 5, 4, 5, 5, 5) 1.72 10−7 86514 (1, 4, 6, 4, 6, 3, 6, 4, 6, 3, 6, 4, 6, 6, 6) 1.47 10−8 111315 (1, 5, 6, 5, 6, 3, 6, 5, 6, 3, 6, 5, 6, 6, 6) 7.02 10−9 131116 (1, 5, 7, 5, 7, 3, 7, 5, 7, 3, 7, 5, 7, 7, 7) 1.27 10−10 164317 (1, 5, 8, 5, 8, 3, 8, 5, 8, 3, 8, 5, 8, 8, 8) 3.87 10−12 201518 (1, 5, 9, 5, 9, 3, 9, 5, 9, 3, 9, 5, 9, 9, 9) 2.95 10−14 2427
3 4 2 1 5 6 7 8
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm for different sample sizes n.
n P(T = T 1) εtest(v) C(T , r ,H)
103 90% [1.75 10−5, 1.75 10−4] [360, 1062]104 90% [2.15 10−8, 4.10 10−3] [185, 2741]105 100% [4.67 10−15, 8.92 10−3] [163, 2594]
Table: training sample size n, estimation of the probability of obtaining the ideal tree T 1 andranges (over the 10 trials) for the test error, and the storage complexity.
Anthony Nouy Centrale Nantes 36 / 42
Example in unsupervised learning
We consider truncated normal distribution with zero mean and covariance matrix Σ. Itssupport is X = ×6
ν=1[−5σν , 5σν ], with σ2ν = Σνν , and its density (with respect to
Lebesgue measure)
f (x) ∼ exp
(−1
2xTΣ−1x
)1x∈X ,
We consider polynomial approximation spaces Hν (with adaptive selection of the degree).
Anthony Nouy Centrale Nantes 37 / 42
Example in unsupervised learning
Consider
Σ =
2 0 0.5 1 0 0.50 1 0 0 0.5 0
0.5 0 2 0 0 11 0 0 3 0 00 0.5 0 0 1 0
0.5 0 1 0 0 2
.
After a permutation (3, 6, 1, 4, 2, 5) of its rows and columns, it comes the matrix2 1 0.5 0 0 01 2 0.5 0 0 0
0.5 0.5 2 1 0 00 0 1 3 0 00 0 0 0 1 0.50 0 0 0 0.5 1
(X1,X3,X4,X6) and (X2,X5) are independent, as well as X4 and (X3,X6), so that
f (x) = f1,3,4,6(x1, x3, x4, x6)f2,5(x2, x5) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
with rank{2,5}(f ) = rank{1,3,4,6}(f ) = 1,rank{1,4}(f ) = rank{1}(f{1,3,6}) = rank{3,6}(f1,3,6) = rank{3,6}(f ).
Anthony Nouy Centrale Nantes 38 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
{1, 2, 3, 4, 5, 6}{2, 3, 4, 5, 6}
{2, 4, 5, 6}{2, 4, 5}
{4, 5}
{4} {5}{2}
{6}{3}
{1}
Figure: (a) Best tree over 10 trials for n = 102
Anthony Nouy Centrale Nantes 39 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
{1, 2, 3, 4, 5, 6}{1, 3, 4, 6}{3, 4, 6}{3, 6}
{6} {3}{4}{1}
{2, 5}
{2} {5}
Figure: (b) Best tree over 10 trials for n = 103
Anthony Nouy Centrale Nantes 39 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
{1, 2, 3, 4, 5, 6}{1, 3, 4, 6}{1, 4}
{1} {4}
{3, 6}
{3} {6}
{2, 5}
{2} {5}
Figure: (c) Best tree over 10 trials for n = 104, 105, 106
Anthony Nouy Centrale Nantes 39 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
Convergence rate close to O(n−1/2),the minimax rate for the estimation of analytic densities.
Anthony Nouy Centrale Nantes 39 / 42
Concluding remarks
Open questions for a complete learning theory...
The obtained theoretical results are related to the minimizer unF of the empirical risk
over the model class, but available algorithms do not guarantee to find a solution of
minv∈T T
r
Rn(v)
Algorithms generate a sequence of estimations in different model classes. Modelselection strategies should be proposed (for selecting tree, ranks and backgroundapproximation space) that guarantee oracle inequalities.
Convexification of tree tensor networks ?
Anthony Nouy Centrale Nantes 40 / 42
Concluding remarks
Open questions for a complete learning theory...
The obtained theoretical results are related to the minimizer unF of the empirical risk
over the model class, but available algorithms do not guarantee to find a solution of
minv∈T T
r
Rn(v)
Algorithms generate a sequence of estimations in different model classes. Modelselection strategies should be proposed (for selecting tree, ranks and backgroundapproximation space) that guarantee oracle inequalities.
Convexification of tree tensor networks ?
Anthony Nouy Centrale Nantes 40 / 42
Concluding remarks
Open questions for a complete learning theory...
The obtained theoretical results are related to the minimizer unF of the empirical risk
over the model class, but available algorithms do not guarantee to find a solution of
minv∈T T
r
Rn(v)
Algorithms generate a sequence of estimations in different model classes. Modelselection strategies should be proposed (for selecting tree, ranks and backgroundapproximation space) that guarantee oracle inequalities.
Convexification of tree tensor networks ?
Anthony Nouy Centrale Nantes 40 / 42
Concluding remarks
A fundamental problem would be to characterize approximation classes Aγ of treetensor networks, those functions for which tree tensor networks give a certainperformance
infv∈T T
r (H)R(v)−R(u) . γ(C)−1
for some growth function γ and C a measure of complexity of T Tr (H).
These are not standard regularity classes (see [Dahmen et al 2016, Bachmayr andDahmen 2015]).
For an approximation class Aγ , we would like to devise (black box) algorithms thatselect a model class T T
r (H) and provide an estimation u ∈ T Tr (H) such that
E(R(u)−R(u)) . γ(C)−1,
or γ replaced by another growth function.
Anthony Nouy Centrale Nantes 41 / 42
Concluding remarks
A fundamental problem would be to characterize approximation classes Aγ of treetensor networks, those functions for which tree tensor networks give a certainperformance
infv∈T T
r (H)R(v)−R(u) . γ(C)−1
for some growth function γ and C a measure of complexity of T Tr (H).
These are not standard regularity classes (see [Dahmen et al 2016, Bachmayr andDahmen 2015]).
For an approximation class Aγ , we would like to devise (black box) algorithms thatselect a model class T T
r (H) and provide an estimation u ∈ T Tr (H) such that
E(R(u)−R(u)) . γ(C)−1,
or γ replaced by another growth function.
Anthony Nouy Centrale Nantes 41 / 42
Thank you for your attention
References
A. Falco, W. Hackbusch, and A. Nouy.Tree-based tensor formats.SeMA Journal, Oct 2018.
E. Grelier, A. Nouy, M. Chevreuil.Learning with tree-based tensor formats.Arxiv eprints, Nov. 2018.
A. Nouy.Higher-order principal component analysis for the approximation of tensors intree-based low-rank formats.Numerische Mathematik, 141(3):743–789, Mar 2019.
E. Grelier, R. Lebrun, A. Nouy.Learning high-dimensional probability distributions using tree tensor networksComing soon...
Anthony Nouy Centrale Nantes 42 / 42