Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Nonsmooth, nonconvex optimization, implications fordeep-learning
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3
1 Institut de Mathematiques de Toulouse, Universite Toulouse 3 Paul Sabatier.2 Institut de recherche en informatique de Toulouse, Universite Toulouse 3 Paul Sabatier.
3 Centre d’etude et de recherche en informatique et communication, Conservatoire national des arts et metiers.
28 Jan. 2019
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 1 / 58
Acknowledgements
Jerome Bolte (Toulouse School of Economics):important contributor, helped for the preparation of this course.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 2 / 58
Plan
1 Introduction
2 Convergence to local minima for Morse-Functions
3 On the structure of deep learning training loss
4 Convergence to critical points for tame functions
5 Approaching critical point with noise
6 Extensions to nonsmooth settings
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 3 / 58
Training a deep network
Finite dimensional optimization problem
minw,b
1n
n
∑i=1
L(fw,b(xi ),yi )
((xi ,yi ))ni=1: training set in X ×Y .
L loss.
(w,b) network parameters (linear maps and offset).
fw,b : X 7→ Y neural network.
Notations:
F : Rp 7→ R
θ 7→ 1n
n
∑i=1
li (θ) (P)
θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 4 / 58
Main question
minθ∈Rp
F(θ) =1n
n
∑i=1
li (θ) (1)
Compositional structure of deep network: Computing a (stochastic)(sub)-gradientof F has a cost comparable to evaluating F .
Deep nets are trained with variants of gradient descent.
θk+1 = θk −αk ∇F(θk ) (GD)
αk > 0
Long term behaviour for this recursion?
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 5 / 58
Non convexity, non smoothness
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 6 / 58
Roadmap
Main difficulty: The objective term is not convex and may be not smooth.
Such questions have been of interest to the math community for a long time.Foundations from two mathematical fields:
Smooth dynamical systems Poincare, Hadamard, Lyapunov, Hirsch, Smale,Shub, Hartman, Grobman, Thom . . .
Favorable geometric structure of F (semi-algebraic/tame geometry). Łojasiewicz,Hironaka, Grothendiek, van den Dries, Shiota. . .
Main results:
Convergence to second order critical point for Morse functions (60’s).
Favorable structure of deep learning landscapes (60’s).
Convergence to critical points under Łojasiewicz assumption (60’s).
Approaching critical point with stochastic subgradient (ODE method, 70’s).
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 7 / 58
Plan
1 Introduction
2 Convergence to local minima for Morse-Functions
3 On the structure of deep learning training loss
4 Convergence to critical points for tame functions
5 Approaching critical point with noise
6 Extensions to nonsmooth settings
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 8 / 58
Main idea
Smooth dynamical systems
x = S(x) (flow)
xk+1 = T (xk ) (discrete)
S,T : Rp 7→ Rp, local diffeomorphisms (differentiable with differentiable inverse).
Long term behaviour: convergence, bifurcation, chaos . . .
Generic results: Nonlinear dynamics behave qualitatively in the same way as theirlinear approximations.
Lemma: Let F be C 2, if ∇F is L-Lipschitz, then the gradient mappingT : x → x−α∇F(x) is a diffeomorphism 0 < α < 1/L.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 9 / 58
The gradient mapping is a diffeormorphism
Constructive proof:
For any x ∈ Rp, the Jacobian ∇T = I−α∇2F(x) is positive definite (exercise).We have a local diffeomorphism as a consequence of implicit function theorem.
Explicitely, for any x ,y ∈ Rp such that T (x) = T (y),
‖x− y‖= α‖∇F(x)−∇F(y)‖< Lα‖x− y‖, hence x = y .
Explicit inverse: solution to the strictly convex problem,
prox−αF : z 7→ arg miny∈Rp−αF(y) +
12‖y− z‖2
2
x = prox−αF (z)⇔ z = x−α∇F(x).
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 10 / 58
Linear isomorphisms
Convergence to 0?
x0 ∈ R, a ∈ C, a 6= 0, xk+1 = axk .
x0 ∈ Rp, D ∈ Rp×p, diagonal, no zero entry, xk+1 = Dxk .
x0 ∈ Rp, M ∈ Rp×p diagonalisable over C, xk+1 = Mxk .
Symmetric real matrix: If M has no eigenvalue λ such that |λ|= 1, then one can set
Rp = Es⊕Eu
where Es is the stable space of M: all x such that Mk x →k→∞
0. If dim(Eu) > 0, then the
divergence behaviour is generic (for almost every x).
Extension to any square matrix using Jordan normal form.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 11 / 58
Stable manifold theorem
Idea dates back to Hadamard, Lyapunov and Perron. This is a difficult result.
Theorem (e.g. Schub’s book [S1987]): Let T : Rp→ Rp be a local diffeomorphism xa fixed point of T such that ∇T (x) does not have any eigenvalue on the unit circle andat least one eigenvalue of modulus > 1.
Then there exists a neighborhood U of x such that
W s(T , x) = x0 ∈ U,T n(x0)→ x ,n→ ∞,W u(T , x) = x0 ∈ U,T n(x0)→ x ,n→−∞,
are differentiable manifolds tangent to the stable and unstable spaces of ∇T (x). Inparticular, W s(T , x) has dimension < n.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 12 / 58
With a picture
Obayashi et al. (2016). Formation mechanism of a basin of attraction for passivedynamic walking induced by intrinsic hyperbolicity. Proceedings of the Royal Society A.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 13 / 58
Convergence to local minima on Morse functionsAssume that F : Rp 7→ R is C 2, with L-lipschitz gradient. Assume that x ∈ Rp satisfies.
∇F(x) = 0
∇2F(x) has no null eigenvalue
∇2F(x) has at least one strictly negative eigenvalue.
Assume that if x0 is taken randomly ( Lebesgue, e.g. Gaussian) and (xk )k∈N isgiven by gradient descent starting at x0 with α < 1/L. Then with respect to the randomchoice of the initialization.
P [xk → x] = 0
Proof: The gradient mapping T : x 7→ x−α∇F(x) satisies hypotheses of the stablemanifold theorem. If xk → x , this means that after a finite number of steps K , xk ∈ Ufor all k ≥ K which implies that xk ∈W s(T , x) for all k ≥ K . Hence
x0 ∈ Rp, T k (x0) →k→∞
x
= ∪K∈NT−K (W s(T , x))
W s(T , x) has Lebesgue measure 0, images of zero measure sets by diffeomorphismhave measure 0 and countable union of measure 0 set is of measure 0.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 14 / 58
Extension: Gradient Descent Only Converges to Minimizers
Lee, Simchowitz, Jordan, Recht [LSJR2016]: drop the full rank assumption on theHessian.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 15 / 58
Plan
1 Introduction
2 Convergence to local minima for Morse-Functions
3 On the structure of deep learning training loss
4 Convergence to critical points for tame functions
5 Approaching critical point with noise
6 Extensions to nonsmooth settings
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 16 / 58
Deep learning training loss
θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.
F : Rp 7→ R
θ 7→ 1n
n
∑i=1
li (θ) (P)
Consider L : (y ,y) = (y− y)2 or L : (y ,y) = |y− y | and a Relu network: activationfunction is the positive part max(0, ·).
Then F has a highly favorable structure: it is “piecewise” polynomial.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 17 / 58
Semi-algebraic sets and functions
Semi-algebraic set in Rp: Union of finitely many solution sets of systems of the form.
x ∈ Rp, P(x) = 0,Q1(x) > 0, . . . ,Ql (x) > 0
for some polynomials functions P,Q1, . . .Ql over Rp.
Semi-algebraic map Rp→ Rp′ : A map F : Rp 7→ Rp′ whose graph
graphf =
(x ,z) ∈ Rp+p′ , z = F(x)
is semi-algebraic.
Semi-algebraic set in R: Union of finitely many intervals.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 18 / 58
Semi-algebraic functions: examples
Polynomials: P(x)
“Piecewise polynomials”: P(x) if x > 0 , Q(x) otherwise
Rational functions: 1/P(x)
Rational powers: P(x)q , q ∈Q.
Absolute value: ‖ · ‖1.
‖ · ‖0 pseudo-norm.
Rank of matrices
. . .
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 19 / 58
Tarski-Seidenberg TheoremTheorem: Let A⊂ Rp+1 be a semi-algebraic and π denote the projection on the first pcoordinates, then π(A) is a semialgebraic set. In other words
x ∈ Rp, ∃y ∈ R,(x ,y) ∈ A
is semi-algebraic and can be described by finitely many polynomial inequalities in xonly.
Consequences: Any set or function which can be described using a first orderformula, with real variables, semi-algebraic objects, addition, multiplication, equalityand inequality signs, is semi-algebraic. For example
The image and pre-image of a semi-algebraic map.
The interior, closure and boundary of a semi-algebraic set.
The derivatives of a differentiable semi-algebraic function.
The set of non continuity, non differentiability points of a semi-algebraic function
For more: Michel Coste’s Introduction to semi-algebraic geometry [C2002].
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 20 / 58
Example: Morse-Sard theorem
Theorem: Let f : R 7→ R be a semialgebraic differentiable function, then the set ofcritical values of f is finite:
critf = f(
x ∈ R, f ′(x) = 0)
Proof: Setting C = x ∈ R, f ′(x) = 0, f ′ is semialgebraic, C is semialgebraic andthere is m ∈ N and intervals J1, . . . ,Jm such that C = ∪m
i=1Ji .
For each i = 1, . . .m, since Ji is an interval, for any a,b ∈ Ji , we have
f (b)− f (a) =∫ b
af ′(t)dt = 0.
Hence f is constant on Ji for all i = 1 . . .m and f (C) has at most m values.
Feature of this theory: Some results have simple short proof but rely on a deeptechnical construction.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 21 / 58
Extension to o-minimal structure (van den Dries, Shiota)o-minimal structure, axiomatic definition: M = ∪p∈NMp, where each Mp is afamily of subsets of Rp such that
if A,B ∈Mp then so does A∪B, A∩B and Rp \A.
if A ∈Mp and B ∈M ′p , then A×B ∈Mp+p′
each Mp contains the semi-algebraic sets in Rp.
if A ∈Mp+1, denoting π the projection on the first p coordinates, π(A) ∈Mp.
M1 consists of all finite unions intervals.
Tame function: A function which graph is an element of an o-minimal structure.
Example: Semialgebraics sets (Tarski-Seidenberg), exp-definable sets (Wilkie),restriction of analytic functions to bounded sets (Gabrielov).
Consequences: Many results which hold for semi-algebraic sets actually hold fortame functions.
For more: van den Dries and Miller [VdD1998, VdDM1996], Shiota [S1995], Coste’sintroduction to o-minimal geometry [C2000].
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 22 / 58
Deep learning training loss
θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.
F : Rp 7→ R
θ 7→ 1n
n
∑i=1
li (θ) (P)
L(·) = (·)2 or L(·) = | · | and a Relu network: F is semi-algebraic. More generally forany semi-algebraic L and activation functions.
For most choices of L and activation functions, F is tame.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 23 / 58
Plan
1 Introduction
2 Convergence to local minima for Morse-Functions
3 On the structure of deep learning training loss
4 Convergence to critical points for tame functions
5 Approaching critical point with noise
6 Extensions to nonsmooth settings
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 24 / 58
Introduction
F : Rp 7→ R is C 1 with L-Lipschitz gradient
α ∈ (0,1/L].xk+1 = xk −α∇F(xk )
Convergence of the iterates?
Non convergence Jittering Convergence
20
40
60
80k
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 25 / 58
KL property (Łojasiewicz 63, Kurdyka 98)
Desingularizing functions on (0, r0)
ϕ ∈ C ([0, r0),R+),
ϕ ∈ C 1(0, r0), ϕ′ > 0,
ϕ concave and ϕ(0) = 0.x
phi
Definition
Let F : Rp 7→ R be C 1. F has the KL property at x (F(x) = 0) if there exists ε > 0 anda desingularizing function ϕ such that,
‖∇(ϕF)(x)‖2 ≥ 1, ∀x , ||x− x ||< ε, F(x) < F(x) < F(x) + ε.
Theorem
KL inequality holds for
differentiable semi-algebraic functions (Łojasiewicz 1963 [Ł1963]).
differentiable tame functions (Kurdyka 1998, [K1998]).
nonsmooth tame functions (Bolte-Daniilidis-Lewis-Shiota 2007 [BDLS2007]).
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 26 / 58
Illustration F and ϕF
F and ϕF
−50
5−5
0
50
100
200Parameterize with ϕ
sharpens the function−→
−50
5−5
0
50
20
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 27 / 58
KL inequality examples
Trivial outside critical points: If ∇F(x) 6= 0 then one can take ϕ as multiplication bya small positive constant.
Univariate analytic functions: F : x 7→ ∑+∞
i=I aix i with I ≥ 1. F is differentiable,around 0
|f ′| ≥ c|f |θ, c > 0, θ = 1− 1I.
Original form of Łojasiewicz’s gradient inequality, corresponds to ϕ : t 7→ (1−θ)c t1−θ.
µ-strongly convex functions: x∗ realises the minimum of F .
F(x∗)≥ F(x) + 〈∇F(x),x∗− x〉+ µ2‖x− x∗‖2
2 ∀x ∈ Rp
2µ(F(x)−F(x∗))≤ ||∇F(x)||2 x = x∗− 1µ
∇F(x)
θ = 1/2, ϕ(·) =√·
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 28 / 58
Non convex examples: θ = 1/2, ϕ(·) =√·
Quadratics: F : x → 12 (x−b)T A(x−b), A symetric, b ∈ Rp:
µ smallest non zero positive eigenvalue of A and −A.
µA A2
−µA A2
2µ(F(x)−F ∗)≤ ||∇F(x)||2
Morse functions: F C 2 such that ∇F(x) = 0 implies ∇2F(x) is non singular.
Non convex example:
F : x →(x1
2− x2
2
)2
∇F(x) =
( ( x12 − x2
2
)−4x2
( x12 − x2
2
) )(F(x)−F ∗)≤ ||∇F(x)||2.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 29 / 58
Convergence under KL assumption
Theorem (Absil, Mahony, Andrews 2005 [AMA2005])
Let F : Rp 7→ R be bounded bellow, C 1 with L-Lipschitz gradient and satisfy KLproperty. Assume that x0 ∈ Rp is such that x ∈ Rp, F(x)≤ F(x0) is compact andfor all k ∈ N
xk+1 = xk −α∇F(xk ),
with α = 1/L. Then xk →k→∞
x where ∇F(x) = 0 and ∑k∈N ‖xk+1− xk‖ is finite.
Note that KL assumption holds automatically if F is tame which is the case for deepnetwork training losses.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 30 / 58
Proof sketch of convergence under KL assumptionFrom the descent Lemma, we have
k
∑i=0‖∇F(xi )‖2
2 ≤ 2L(F(x0)−F(xk+1))≤ 2L(F(x0)−F∗).
The sum converges and ∇F(xk ) →k→∞
0, all accumulation points are critical points.
Ω: set of accumulation points of (xk )k∈N (non empty). Ω compact, F constant on Ω (=0),dist(xk ,Ω) →
k→∞0. F satisfy KL property at each x ∈Ω.
Construct a global desingularizing function φ such Ω:For each x ∈Ω, consider εx > 0 and function ϕx , given by KL property. These balls cover Ωwhich is compact, extract a finite set of such balls B(xi ,εi )m
i=1 which cover Ω.
We have
0 < ε0 := infy∈Rp
dist(y ,Ω), such that‖y− xi‖ ≥ εi , i = 1 . . .m
Setting ε = mini=0,...,m εi , ϕ = ∑mi=1 ϕi , ϕ is a global desingularizing function on
U = y ∈ Rp, dist(y ,Ω) < ε, 0 < F(y) < ε
and xk ∈ U for all k ≥ K for some K ∈ N. Discard the first terms so that K = 0.S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 31 / 58
Proof sketch of convergence under KL assumption
For all k ∈ N,
F(xk+1))≤ F(xk )− 12‖xk+1− xk‖2‖∇F(xk )‖
ϕ(F(xk+1)))≤ ϕ
(F(xk )− 1
2‖xk+1− xk‖2‖∇F(xk )‖
)≤ ϕ(F(xk )))−φ
′(F(xk ))12‖xk+1− xk‖2‖∇F(xk )‖
= ϕ(F(xk )))− 12‖xk+1− xk‖2‖∇φF(xk )‖ ≤ ϕ(F(xk )))− 1
2‖xk+1− xk‖2
Finally, ∑k∈N ‖xk+1− xk‖2 ≤ 2φ(F(x0)).
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 32 / 58
Generalizations
The idea dates back to Łojasiewicz in the 60’s [Ł1963]. Thanks to the nonsmooth KLinequality [BDLS2007], the result and proof technique extends to many algorithms ofgradient type including
Forward-backward or proximal gradient for smooth + non smooth.
Projected gradient for smooth under constraint.
Proximal point algorithm.
Block alternating variants
Inertial variants
Sequential convex programs for composite objectives
. . .
A more carefull analysis also provides convergence rates. See for example[BA2009, ABS2013, BST2014].
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 33 / 58
Plan
1 Introduction
2 Convergence to local minima for Morse-Functions
3 On the structure of deep learning training loss
4 Convergence to critical points for tame functions
5 Approaching critical point with noise
6 Extensions to nonsmooth settings
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 34 / 58
Stochastic gradient
θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.
F : Rp 7→ R
θ 7→ 1n
n
∑i=1
li (θ) (P)
Stochastic gradient. (ik )k∈N iid RVs uniform on 1, . . . ,n.
θk+1|θk = θk −αk ∇lik (θk ) (SG)
αk > 0
Stochastic gradient. Let (Mk )k∈N be a martingale difference sequence.
θk+1|past = θk −αk (∇F(θk ) + Mk+1) (SG)
E [Mk+1|past] = 0
αk > 0
Stochastic approximation: Robbins and Monro 1951 [RM1951].
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 35 / 58
The ODE method
Averaging out noise: vanishing step size, ∑k∈N αk = +∞, ∑k∈N α2k < +∞.
Differentiable F (Ljung 1977 [L1977]): The sequence (θk )k∈N behaves in the limit assolutions to the differential equation
θ =−∇F(θ)
Developments: Benaım [B1996], Kushner and Yin [KY1997] . . . .
Non-smooth case: Benaım, Hofbauer, Sorin (2005) [BHS2005]. Differential inclusion
θ ∈ −∂F(θ)
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 36 / 58
Invariant sets
Definition (Invariant sets)
A set A is invariant if for any θ0 ∈ A, there is a solution to θ =−∇F(θ), withθ(0) = θ0 and θ(R) ∈ A.
A set A is chain transitive if for any T > 0 and ε > 0, for any θ,µ ∈ A, there existsN ∈ N and θi , i = 1, . . . ,N solutions of θ =−∇F(θ) and ti ≥ T such that
I θi (t) ∈ A for all 0≤ t ≤ ti , i = 1 . . .N.I For all i = 1, . . . ,N−1, ‖θi (ti )−θi+1(0)‖ ≤ ε
I ‖θ1(0)−θ‖ ≤ ε and ‖θN(tN)−µ‖ ≤ ε (draw a picture).
From [B1999], a chain transitive set is invariant.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 37 / 58
A result of Benaim
θk+1|past = θk −αk (∇F(θk ) + Mk+1) (SG)
E [Mk+1|past] = 0
αk > 0
Square summable increments: There exists M ≥ 0 such that
supk∈N
E[M2
k |past]≤M.
Vanishing step size: ∑k∈N αk = +∞, ∑k∈N α2k < +∞. This implies that ∑
ki=0 αk Mk+1
converges almost surely.
Theorem (Benaim 1996 [B1996])
Conditionally on (θk )k∈N being bounded, the set of accumulation points is compactconnected and chain transitive invariant, almost surely.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 38 / 58
Corollary for deep learning
Stochastic gradient. (ik )k∈N iid RVs uniform on 1, . . . ,n.
θk+1|θk = θk −αk ∇lik (θk ) (SG)
αk > 0
Conditioning on (θk )k∈N being bounded, almost surely, F(θk ) converges and anyaccumulation point θ satisfies ∇F(θ) = 0.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 39 / 58
Proof sketch
Lemma
If F is tame, any compact chain transitive set A satisfies
F is constant on A
∇F(θ) = 0 for all θ ∈ A.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 40 / 58
Proof sketchFor any solution to θ =−∇F(θ), and any t ∈ R we have
ddt
F(θ(t)) = ∇F(θ(t))T ˙θ(t) =−‖∇F(θ(t))‖22.
Set L− = minθ∈A F(θ) and L+ = maxθ∈A F(θ). Both are critical values since A isinvariant. Let us prove that L− = L+. Set L− = 0. F is γ-Lipschitz on A.
Morse-Sard theorem: δ > 0 such that (0,2δ] contains no critical value of F . Chooseε = δ/γ: x ,y ∈ A, ‖x− y‖ ≤ ε implies |f (x)− f (y)| ≤ δ. Set
∆ = minθ∈A,δ≤F(θ)≤2δ
‖∇F(θ)‖2 > 0,
and put T = δ/∆2.
For any θ0 ∈ A such that F(θ0)≤ 2δ, any solution to θ =−∇F(θ) starting at θ0
satisfies F(θ(t))≤ δ for any t ≥ T .
For any θ ∈ A, with F(θ) = 0 and µ ∈ A, by chain trainsitivity, F(µ)≤ 2δ.
Take δ arbitrarily small, we have F(µ) = 0 for all µ ∈ A and L+ = 0.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 41 / 58
Plan
1 Introduction
2 Convergence to local minima for Morse-Functions
3 On the structure of deep learning training loss
4 Convergence to critical points for tame functions
5 Approaching critical point with noise
6 Extensions to nonsmooth settings
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 42 / 58
Stochastic subgradient
θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.
F : Rp 7→ R
θ 7→ 1n
n
∑i=1
li (θ) (P)
Stochastic subgradient descent. Let (Mk )k∈N be a martingale difference sequence.
θk+1|past = θk −αk (v + Mk+1) (SG)
E [Mk+1|past] = 0
v ∈ ∂F(θk )
αk > 0
∂ is a suitable generalization of the gradient.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 43 / 58
Subgradients: F : Rp 7→ R Lipschitz continuous
Convex: only for F convex, global lower tangent.
∂convF(x) =
v ∈ Rp, F(y)≥ F(x) + vT (y− x), ∀y ∈ Rp .Frechet: local lower tangent.
∂FrechetF(x) =
v ∈ Rp, lim
y→x ,infy 6=x
F(y)−F(x)− vT (y− x)
‖y− x‖≥ 0
.
Limiting: sequential closure.
∂limF(x) =
v ∈ Rp, ∃(yk ,vk )k∈N , yk →k→∞
x , vk →k→∞
v , vk ∈ ∂FrechetF(yk ),k ∈ N.
Clarke: convex closure.
∂ClarkeF(x) = conv(∂limF(x)).
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 44 / 58
Subgradients: F : Rp 7→ R Lipschitz continuous
Example: F : x 7→ −|x |.
∂convF(0) = /0
∂FrechetF(0) = /0
∂limF(0) = −1,1∂ClarkeF(0) = [−1,1].
0 is a local maximum, it is critical only for for the most general notion of subgradientwhich we have seen . . .
∂FrechetF(x)⊂ ∂limF(x)⊂ ∂ClarkeF(x) for all x .
Fermat rule: x is a local minimum of F if and only if 0 ∈ ∂FrechetF(x).
We will work with Clarke subgradients.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 45 / 58
Absolute continuity
Definition (Absolutely continuous map)
Let g : I 7→ Rp be AC on an interval I. This means that for any ε > 0, there exists δ > 0such that for any collection of pairwise disjoint sub intervals of I, [uk ,vk ]k∈N, we have
∑k∈N|uk − vk | ≤ δ⇒ ∑
k∈N‖g(uk )−g (vk )‖ ≤ ε.
Lipschitz functions are AC and composition of AC functions are AC.
Equivalently there exists a Lebesgue integrable function y : I→ Rp and a ∈ I such thatfor all t ∈ I,
f (t) = f (a) +∫ t
ay(s)ds.
Most importantly, AC functions are differentiable almost everywhere.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 46 / 58
Differential inclusion solutions
Definition
Given θ0 ∈ Rp, a solution of the problem
θ ∈ −∂F(θ),θ(0) = θ0,
is any Absolutely Continuous map θ : R 7→ Rp, such that ddt θ(t) ∈ −∂F(θ(t)) for
almost all t and θ(0) = θ0.
Example: absolute value.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 47 / 58
Clarke subdifferential for locally Lipschitz functions
If F is locally Lipschitz, then
The graph of ∂F is closed:
graph(∂F) = x ,y ∈ Rp,y ∈ ∂F(x) .
Equivalently xk →k→∞
x and yk →k→∞
y with yk ∈ ∂F(xk ) for all k ∈ N implies
y ∈ F(x). In particular ∂F(x) is closed values.
∂F(x) is nonempty, convex and compact.
Slightly stronger assumption: ∂F(x) is compact: C > 0 such that for all x ∈ Rp,supz∈∂F(x) ‖z‖ ≤ C(1 +‖x‖).
Existence of solutions: These hypotheses suffice to ensure that the differentialinclusion problem has (typically non unique) solutions (Aubin and Celina [AC1984],Clarke et al [CLSW1998]).
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 48 / 58
Invariant sets
Definition (Invariant sets)
A set A is invariant if for any θ0 ∈ A, there is a solution to θ ∈ −∂F(θ), withθ(0) = θ0 and θ(R) ∈ A.
A set A is chain transitive if for any T > 0 and ε > 0, for any θ,µ ∈ A, there existsN ∈ N and θi , i = 1, . . . ,N solutions of θ ∈ −∂F(θ) and ti ≥ T such that
I θi (t) ∈ A for all 0≤ t ≤ ti , i = 1 . . .N.I For all i = 1, . . . ,N−1, ‖θi (ti )−θi+1(0)‖ ≤ ε
I ‖θ1(0)−θ‖ ≤ ε and ‖θN(tN)−µ‖ ≤ ε (draw a picture).
From [BHS2005], a bounded chain transitive set is invariant.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 49 / 58
Stochastic approximation and differential inclusions
θk+1|past = θk −αk (v + Mk+1) (SG)
E [Mk+1|past] = 0
v ∈ ∂F(θk )
αk > 0
Square summable increments: There exists M ≥ 0 such that
supk∈N
E[M2
k |past]≤M.
Vanishing step size: ∑k∈N αk = +∞, ∑k∈N α2k < +∞. This implies that ∑
ki=0 αk Mk+1
converges almost surely.
Theorem (Benaim Hofbauer Sorin [BHS2005])
Conditionally on (θk )k∈N being bounded, the set of accumulation points is compactconnected and chain transitive invariant, almost surely.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 50 / 58
Chain rule
Lemma (Convex chain rule)
If F is convex and θ : R→ Rp solution to θ ∈ −∂F(θ), then for almost all t ∈ R+,
ddt
F θ =⟨
∂F(θ(t)), θ⟩
=−dist(0,∂F(θ(t)))2.
Example: `1 norm. See Brezis 1973 [B1973].
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 51 / 58
Chain ruleF is locally Lipschitz, and x is AC so that the composition is also AC and we maychoose t0 such that both θ and F θ are differentiable. Let θ0 = θ(t0) and θ0 = θ(t0),we have
θ(t0 + h) = θ0 + hθ0 + o(h)
θ(t0−h) = θ0−hθ0 + o(h)
For any v ∈ ∂F(θ0), it holds that 〈v ,y−θ0〉 ≤ F(y)−F(θ0) for all y ∈ Rp. Nowimposing h > 0, we have
〈v ,θ(t0 + h)−θ0〉h
=⟨
v , θ0
⟩+ o(1)≤ F(θ(t0 + h))− f (θ0)
h→
h→0
ddt
(F θ)(t0).
On the other hand, still considering h positive
〈v ,θ(t0−h)−θ0〉−h
=⟨
v , θ0
⟩+ o(1)≥ F(θ(t0−h))−F(θ0)
−h→
h→0
ddt
(F θ)(t0).
This proves the first identity. θ0 ∈ ∂F(θ0) and it is “orthogonal” to ∂F(θ0) so that it isthe minimum norm element.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 52 / 58
Corollary for deep learning
Stochastic subgradient. (ik )k∈N iid RVs uniform on 1, . . . ,n.
θk+1|past = θk −αk (v + Mk+1) (SG)
E [Mk+1|past] = 0
v ∈ ∂F(θk )
αk > 0
Conditioning on (θk )k∈N being bounded, almost surely, F(θk ) converges and anyaccumulation point θ satisfies 0 ∈ ∂F(θ).
Proof arguments: Same idea as in the smooth case
Accumulation points form chain transitive invariant sets [BHS2005].
If F is tame, we have nonsmooth Morse Sard [BDLS2007].
If F is tame, F satisfies a chain rule [DDKL2018], using projection formula of[BDLS2007] for stratisfiable functions.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 53 / 58
P.A. Absil, R. Mahony, B. Andrews, B. (2005).Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2), 531–547.
H. Attouch, J. Bolte and B.F. Svaiter (2013).Convergence of descent methods for semi-algebraic and tame problems: proximalalgorithms, forward-backward splitting, and regularized Gauss-Seidel methods.Mathematical Programming, 137(1-2), 91–129.
J.P. Aubin and A. Cellina (1984).Differential inclusions: set-valued maps and viability theory. Springer Science &Business Media.
M. Benaim (1996).A dynamical system approach to stochastic approximations.SIAM Journal on Control and Optimization, 34(2), 437–472.
M. Benaım (1999).Dynamics of stochastic approximation algorithms.Seminaire de probabilites XXXIII (pp. 1-68). Springer, Berlin, Heidelberg.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 54 / 58
M. Benaım, J. Hofbauer and S. Sorin (2005).Stochastic approximations and differential inclusions.SIAM Journal on Control and Optimization, 44(1), 328-348.
J. Bolte, A. Daniilidis, A. Lewis and M. Shiota (2007).Clarke subgradients of stratifiable functions.SIAM Journal on Optimization, 18(2), 556-572.
H. Attouch and J. Bolte (2009).On the convergence of the proximal algorithm for nonsmooth functions involvinganalytic features.Mathematical Programming, 116(1-2), 5-16.
J. Bolte, S. Sabach and M. Teboulle (2014).Proximal alternating linearized minimization or nonconvex and nonsmoothproblems.Mathematical Programming, 146(1-2), 459–494.
H. Brezis (1973).Operateurs maximaux monotones et semi-groupes de contractions dans lesespaces de Hilbert (Vol. 5). Elsevier.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 55 / 58
F.H. Clarke, Y.S. Ledyaev, R.J. Stern and P.R. Wolenski (1998).Nonsmooth analysis and control theory.Springer Science & Business Media.
M. Coste (2000).An introduction to o-minimal geometry.Pisa: Istituti editoriali e poligrafici internazionali.
M. Coste (2002).An introduction to semialgebraic geometry.Pisa: Istituti editoriali e poligrafici internazionali.
D. Davis, D. Drusvyatskiy, S. Kakade and J. D. Lee (2018).Stochastic subgradient method converges on tame functions.arXiv preprint arXiv:1804.07795.
L. Van den Dries and C.Miller (1996).Geometric categories and o-minimal structures.Duke Math. J, 84(2), 497-540.
L. Van den Dries, (1998).Tame topology and o-minimal structures (Vol. 248). Cambridge university press.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 56 / 58
K. Kurdyka (1998).On gradients of functions definable in o-minimal structures.Annales de l’institut Fourier 48(3)769–784.
H. Kushner and G.G. Yin (1997). Stochastic approximation and recursivealgorithms and applications. Springer Science & Business Media.
J.D. Lee and M. Simchowitz and M.I. Jordan and B. Recht (2016).Gradient descent only converges to minimizers.Conference on Learning Theory (pp. 1246–1257).
L. Ljung (1977).Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4), 551–575.
S. Łojasiewicz (1963).Une propriete topologique des sous-ensembles analytiques reels.Les equations aux derivees partielles, 117, 87–89.
H. Robbins and S. Monro (1951).A Stochastic Approximation Method.The Annals of Mathematical Statistics, 22(3), 400–407.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 57 / 58
M. Shiota (1995).Geometry of subanalytic and semialgebraic sets. Springer Science & BusinessMedia.
M. Shub (1987).Global stability of dynamical systems. Springer Science & Business Media.
S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 58 / 58