58
Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz 1 , F. Malgouyres 1 , E. Pauwels 2 & N. Thome 3 1 Institut de Math ´ ematiques de Toulouse, Universit´ e Toulouse 3 Paul Sabatier. 2 Institut de recherche en informatique de Toulouse, Universit´ e Toulouse 3 Paul Sabatier. 3 Centre d’ ´ etude et de recherche en informatique et communication, Conservatoire national des arts et m ´ etiers. 28 Jan. 2019 S. Gerchinovitz 1 , F. Malgouyres 1 , E. Pauwels 2 & N. Thome Nonsmooth, nonconvex optimization 28 Jan. 2019 1 / 58

Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Nonsmooth, nonconvex optimization, implications fordeep-learning

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3

1 Institut de Mathematiques de Toulouse, Universite Toulouse 3 Paul Sabatier.2 Institut de recherche en informatique de Toulouse, Universite Toulouse 3 Paul Sabatier.

3 Centre d’etude et de recherche en informatique et communication, Conservatoire national des arts et metiers.

28 Jan. 2019

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 1 / 58

Page 2: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Acknowledgements

Jerome Bolte (Toulouse School of Economics):important contributor, helped for the preparation of this course.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 2 / 58

Page 3: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Plan

1 Introduction

2 Convergence to local minima for Morse-Functions

3 On the structure of deep learning training loss

4 Convergence to critical points for tame functions

5 Approaching critical point with noise

6 Extensions to nonsmooth settings

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 3 / 58

Page 4: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Training a deep network

Finite dimensional optimization problem

minw,b

1n

n

∑i=1

L(fw,b(xi ),yi )

((xi ,yi ))ni=1: training set in X ×Y .

L loss.

(w,b) network parameters (linear maps and offset).

fw,b : X 7→ Y neural network.

Notations:

F : Rp 7→ R

θ 7→ 1n

n

∑i=1

li (θ) (P)

θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 4 / 58

Page 5: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Main question

minθ∈Rp

F(θ) =1n

n

∑i=1

li (θ) (1)

Compositional structure of deep network: Computing a (stochastic)(sub)-gradientof F has a cost comparable to evaluating F .

Deep nets are trained with variants of gradient descent.

θk+1 = θk −αk ∇F(θk ) (GD)

αk > 0

Long term behaviour for this recursion?

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 5 / 58

Page 6: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Non convexity, non smoothness

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 6 / 58

Page 7: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Roadmap

Main difficulty: The objective term is not convex and may be not smooth.

Such questions have been of interest to the math community for a long time.Foundations from two mathematical fields:

Smooth dynamical systems Poincare, Hadamard, Lyapunov, Hirsch, Smale,Shub, Hartman, Grobman, Thom . . .

Favorable geometric structure of F (semi-algebraic/tame geometry). Łojasiewicz,Hironaka, Grothendiek, van den Dries, Shiota. . .

Main results:

Convergence to second order critical point for Morse functions (60’s).

Favorable structure of deep learning landscapes (60’s).

Convergence to critical points under Łojasiewicz assumption (60’s).

Approaching critical point with stochastic subgradient (ODE method, 70’s).

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 7 / 58

Page 8: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Plan

1 Introduction

2 Convergence to local minima for Morse-Functions

3 On the structure of deep learning training loss

4 Convergence to critical points for tame functions

5 Approaching critical point with noise

6 Extensions to nonsmooth settings

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 8 / 58

Page 9: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Main idea

Smooth dynamical systems

x = S(x) (flow)

xk+1 = T (xk ) (discrete)

S,T : Rp 7→ Rp, local diffeomorphisms (differentiable with differentiable inverse).

Long term behaviour: convergence, bifurcation, chaos . . .

Generic results: Nonlinear dynamics behave qualitatively in the same way as theirlinear approximations.

Lemma: Let F be C 2, if ∇F is L-Lipschitz, then the gradient mappingT : x → x−α∇F(x) is a diffeomorphism 0 < α < 1/L.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 9 / 58

Page 10: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

The gradient mapping is a diffeormorphism

Constructive proof:

For any x ∈ Rp, the Jacobian ∇T = I−α∇2F(x) is positive definite (exercise).We have a local diffeomorphism as a consequence of implicit function theorem.

Explicitely, for any x ,y ∈ Rp such that T (x) = T (y),

‖x− y‖= α‖∇F(x)−∇F(y)‖< Lα‖x− y‖, hence x = y .

Explicit inverse: solution to the strictly convex problem,

prox−αF : z 7→ arg miny∈Rp−αF(y) +

12‖y− z‖2

2

x = prox−αF (z)⇔ z = x−α∇F(x).

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 10 / 58

Page 11: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Linear isomorphisms

Convergence to 0?

x0 ∈ R, a ∈ C, a 6= 0, xk+1 = axk .

x0 ∈ Rp, D ∈ Rp×p, diagonal, no zero entry, xk+1 = Dxk .

x0 ∈ Rp, M ∈ Rp×p diagonalisable over C, xk+1 = Mxk .

Symmetric real matrix: If M has no eigenvalue λ such that |λ|= 1, then one can set

Rp = Es⊕Eu

where Es is the stable space of M: all x such that Mk x →k→∞

0. If dim(Eu) > 0, then the

divergence behaviour is generic (for almost every x).

Extension to any square matrix using Jordan normal form.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 11 / 58

Page 12: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Stable manifold theorem

Idea dates back to Hadamard, Lyapunov and Perron. This is a difficult result.

Theorem (e.g. Schub’s book [S1987]): Let T : Rp→ Rp be a local diffeomorphism xa fixed point of T such that ∇T (x) does not have any eigenvalue on the unit circle andat least one eigenvalue of modulus > 1.

Then there exists a neighborhood U of x such that

W s(T , x) = x0 ∈ U,T n(x0)→ x ,n→ ∞,W u(T , x) = x0 ∈ U,T n(x0)→ x ,n→−∞,

are differentiable manifolds tangent to the stable and unstable spaces of ∇T (x). Inparticular, W s(T , x) has dimension < n.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 12 / 58

Page 13: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

With a picture

Obayashi et al. (2016). Formation mechanism of a basin of attraction for passivedynamic walking induced by intrinsic hyperbolicity. Proceedings of the Royal Society A.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 13 / 58

Page 14: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Convergence to local minima on Morse functionsAssume that F : Rp 7→ R is C 2, with L-lipschitz gradient. Assume that x ∈ Rp satisfies.

∇F(x) = 0

∇2F(x) has no null eigenvalue

∇2F(x) has at least one strictly negative eigenvalue.

Assume that if x0 is taken randomly ( Lebesgue, e.g. Gaussian) and (xk )k∈N isgiven by gradient descent starting at x0 with α < 1/L. Then with respect to the randomchoice of the initialization.

P [xk → x] = 0

Proof: The gradient mapping T : x 7→ x−α∇F(x) satisies hypotheses of the stablemanifold theorem. If xk → x , this means that after a finite number of steps K , xk ∈ Ufor all k ≥ K which implies that xk ∈W s(T , x) for all k ≥ K . Hence

x0 ∈ Rp, T k (x0) →k→∞

x

= ∪K∈NT−K (W s(T , x))

W s(T , x) has Lebesgue measure 0, images of zero measure sets by diffeomorphismhave measure 0 and countable union of measure 0 set is of measure 0.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 14 / 58

Page 15: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Extension: Gradient Descent Only Converges to Minimizers

Lee, Simchowitz, Jordan, Recht [LSJR2016]: drop the full rank assumption on theHessian.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 15 / 58

Page 16: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Plan

1 Introduction

2 Convergence to local minima for Morse-Functions

3 On the structure of deep learning training loss

4 Convergence to critical points for tame functions

5 Approaching critical point with noise

6 Extensions to nonsmooth settings

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 16 / 58

Page 17: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Deep learning training loss

θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.

F : Rp 7→ R

θ 7→ 1n

n

∑i=1

li (θ) (P)

Consider L : (y ,y) = (y− y)2 or L : (y ,y) = |y− y | and a Relu network: activationfunction is the positive part max(0, ·).

Then F has a highly favorable structure: it is “piecewise” polynomial.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 17 / 58

Page 18: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Semi-algebraic sets and functions

Semi-algebraic set in Rp: Union of finitely many solution sets of systems of the form.

x ∈ Rp, P(x) = 0,Q1(x) > 0, . . . ,Ql (x) > 0

for some polynomials functions P,Q1, . . .Ql over Rp.

Semi-algebraic map Rp→ Rp′ : A map F : Rp 7→ Rp′ whose graph

graphf =

(x ,z) ∈ Rp+p′ , z = F(x)

is semi-algebraic.

Semi-algebraic set in R: Union of finitely many intervals.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 18 / 58

Page 19: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Semi-algebraic functions: examples

Polynomials: P(x)

“Piecewise polynomials”: P(x) if x > 0 , Q(x) otherwise

Rational functions: 1/P(x)

Rational powers: P(x)q , q ∈Q.

Absolute value: ‖ · ‖1.

‖ · ‖0 pseudo-norm.

Rank of matrices

. . .

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 19 / 58

Page 20: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Tarski-Seidenberg TheoremTheorem: Let A⊂ Rp+1 be a semi-algebraic and π denote the projection on the first pcoordinates, then π(A) is a semialgebraic set. In other words

x ∈ Rp, ∃y ∈ R,(x ,y) ∈ A

is semi-algebraic and can be described by finitely many polynomial inequalities in xonly.

Consequences: Any set or function which can be described using a first orderformula, with real variables, semi-algebraic objects, addition, multiplication, equalityand inequality signs, is semi-algebraic. For example

The image and pre-image of a semi-algebraic map.

The interior, closure and boundary of a semi-algebraic set.

The derivatives of a differentiable semi-algebraic function.

The set of non continuity, non differentiability points of a semi-algebraic function

For more: Michel Coste’s Introduction to semi-algebraic geometry [C2002].

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 20 / 58

Page 21: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Example: Morse-Sard theorem

Theorem: Let f : R 7→ R be a semialgebraic differentiable function, then the set ofcritical values of f is finite:

critf = f(

x ∈ R, f ′(x) = 0)

Proof: Setting C = x ∈ R, f ′(x) = 0, f ′ is semialgebraic, C is semialgebraic andthere is m ∈ N and intervals J1, . . . ,Jm such that C = ∪m

i=1Ji .

For each i = 1, . . .m, since Ji is an interval, for any a,b ∈ Ji , we have

f (b)− f (a) =∫ b

af ′(t)dt = 0.

Hence f is constant on Ji for all i = 1 . . .m and f (C) has at most m values.

Feature of this theory: Some results have simple short proof but rely on a deeptechnical construction.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 21 / 58

Page 22: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Extension to o-minimal structure (van den Dries, Shiota)o-minimal structure, axiomatic definition: M = ∪p∈NMp, where each Mp is afamily of subsets of Rp such that

if A,B ∈Mp then so does A∪B, A∩B and Rp \A.

if A ∈Mp and B ∈M ′p , then A×B ∈Mp+p′

each Mp contains the semi-algebraic sets in Rp.

if A ∈Mp+1, denoting π the projection on the first p coordinates, π(A) ∈Mp.

M1 consists of all finite unions intervals.

Tame function: A function which graph is an element of an o-minimal structure.

Example: Semialgebraics sets (Tarski-Seidenberg), exp-definable sets (Wilkie),restriction of analytic functions to bounded sets (Gabrielov).

Consequences: Many results which hold for semi-algebraic sets actually hold fortame functions.

For more: van den Dries and Miller [VdD1998, VdDM1996], Shiota [S1995], Coste’sintroduction to o-minimal geometry [C2000].

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 22 / 58

Page 23: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Deep learning training loss

θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.

F : Rp 7→ R

θ 7→ 1n

n

∑i=1

li (θ) (P)

L(·) = (·)2 or L(·) = | · | and a Relu network: F is semi-algebraic. More generally forany semi-algebraic L and activation functions.

For most choices of L and activation functions, F is tame.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 23 / 58

Page 24: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Plan

1 Introduction

2 Convergence to local minima for Morse-Functions

3 On the structure of deep learning training loss

4 Convergence to critical points for tame functions

5 Approaching critical point with noise

6 Extensions to nonsmooth settings

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 24 / 58

Page 25: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Introduction

F : Rp 7→ R is C 1 with L-Lipschitz gradient

α ∈ (0,1/L].xk+1 = xk −α∇F(xk )

Convergence of the iterates?

Non convergence Jittering Convergence

20

40

60

80k

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 25 / 58

Page 26: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

KL property (Łojasiewicz 63, Kurdyka 98)

Desingularizing functions on (0, r0)

ϕ ∈ C ([0, r0),R+),

ϕ ∈ C 1(0, r0), ϕ′ > 0,

ϕ concave and ϕ(0) = 0.x

phi

Definition

Let F : Rp 7→ R be C 1. F has the KL property at x (F(x) = 0) if there exists ε > 0 anda desingularizing function ϕ such that,

‖∇(ϕF)(x)‖2 ≥ 1, ∀x , ||x− x ||< ε, F(x) < F(x) < F(x) + ε.

Theorem

KL inequality holds for

differentiable semi-algebraic functions (Łojasiewicz 1963 [Ł1963]).

differentiable tame functions (Kurdyka 1998, [K1998]).

nonsmooth tame functions (Bolte-Daniilidis-Lewis-Shiota 2007 [BDLS2007]).

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 26 / 58

Page 27: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Illustration F and ϕF

F and ϕF

−50

5−5

0

50

100

200Parameterize with ϕ

sharpens the function−→

−50

5−5

0

50

20

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 27 / 58

Page 28: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

KL inequality examples

Trivial outside critical points: If ∇F(x) 6= 0 then one can take ϕ as multiplication bya small positive constant.

Univariate analytic functions: F : x 7→ ∑+∞

i=I aix i with I ≥ 1. F is differentiable,around 0

|f ′| ≥ c|f |θ, c > 0, θ = 1− 1I.

Original form of Łojasiewicz’s gradient inequality, corresponds to ϕ : t 7→ (1−θ)c t1−θ.

µ-strongly convex functions: x∗ realises the minimum of F .

F(x∗)≥ F(x) + 〈∇F(x),x∗− x〉+ µ2‖x− x∗‖2

2 ∀x ∈ Rp

2µ(F(x)−F(x∗))≤ ||∇F(x)||2 x = x∗− 1µ

∇F(x)

θ = 1/2, ϕ(·) =√·

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 28 / 58

Page 29: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Non convex examples: θ = 1/2, ϕ(·) =√·

Quadratics: F : x → 12 (x−b)T A(x−b), A symetric, b ∈ Rp:

µ smallest non zero positive eigenvalue of A and −A.

µA A2

−µA A2

2µ(F(x)−F ∗)≤ ||∇F(x)||2

Morse functions: F C 2 such that ∇F(x) = 0 implies ∇2F(x) is non singular.

Non convex example:

F : x →(x1

2− x2

2

)2

∇F(x) =

( ( x12 − x2

2

)−4x2

( x12 − x2

2

) )(F(x)−F ∗)≤ ||∇F(x)||2.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 29 / 58

Page 30: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Convergence under KL assumption

Theorem (Absil, Mahony, Andrews 2005 [AMA2005])

Let F : Rp 7→ R be bounded bellow, C 1 with L-Lipschitz gradient and satisfy KLproperty. Assume that x0 ∈ Rp is such that x ∈ Rp, F(x)≤ F(x0) is compact andfor all k ∈ N

xk+1 = xk −α∇F(xk ),

with α = 1/L. Then xk →k→∞

x where ∇F(x) = 0 and ∑k∈N ‖xk+1− xk‖ is finite.

Note that KL assumption holds automatically if F is tame which is the case for deepnetwork training losses.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 30 / 58

Page 31: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Proof sketch of convergence under KL assumptionFrom the descent Lemma, we have

k

∑i=0‖∇F(xi )‖2

2 ≤ 2L(F(x0)−F(xk+1))≤ 2L(F(x0)−F∗).

The sum converges and ∇F(xk ) →k→∞

0, all accumulation points are critical points.

Ω: set of accumulation points of (xk )k∈N (non empty). Ω compact, F constant on Ω (=0),dist(xk ,Ω) →

k→∞0. F satisfy KL property at each x ∈Ω.

Construct a global desingularizing function φ such Ω:For each x ∈Ω, consider εx > 0 and function ϕx , given by KL property. These balls cover Ωwhich is compact, extract a finite set of such balls B(xi ,εi )m

i=1 which cover Ω.

We have

0 < ε0 := infy∈Rp

dist(y ,Ω), such that‖y− xi‖ ≥ εi , i = 1 . . .m

Setting ε = mini=0,...,m εi , ϕ = ∑mi=1 ϕi , ϕ is a global desingularizing function on

U = y ∈ Rp, dist(y ,Ω) < ε, 0 < F(y) < ε

and xk ∈ U for all k ≥ K for some K ∈ N. Discard the first terms so that K = 0.S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 31 / 58

Page 32: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Proof sketch of convergence under KL assumption

For all k ∈ N,

F(xk+1))≤ F(xk )− 12‖xk+1− xk‖2‖∇F(xk )‖

ϕ(F(xk+1)))≤ ϕ

(F(xk )− 1

2‖xk+1− xk‖2‖∇F(xk )‖

)≤ ϕ(F(xk )))−φ

′(F(xk ))12‖xk+1− xk‖2‖∇F(xk )‖

= ϕ(F(xk )))− 12‖xk+1− xk‖2‖∇φF(xk )‖ ≤ ϕ(F(xk )))− 1

2‖xk+1− xk‖2

Finally, ∑k∈N ‖xk+1− xk‖2 ≤ 2φ(F(x0)).

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 32 / 58

Page 33: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Generalizations

The idea dates back to Łojasiewicz in the 60’s [Ł1963]. Thanks to the nonsmooth KLinequality [BDLS2007], the result and proof technique extends to many algorithms ofgradient type including

Forward-backward or proximal gradient for smooth + non smooth.

Projected gradient for smooth under constraint.

Proximal point algorithm.

Block alternating variants

Inertial variants

Sequential convex programs for composite objectives

. . .

A more carefull analysis also provides convergence rates. See for example[BA2009, ABS2013, BST2014].

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 33 / 58

Page 34: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Plan

1 Introduction

2 Convergence to local minima for Morse-Functions

3 On the structure of deep learning training loss

4 Convergence to critical points for tame functions

5 Approaching critical point with noise

6 Extensions to nonsmooth settings

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 34 / 58

Page 35: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Stochastic gradient

θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.

F : Rp 7→ R

θ 7→ 1n

n

∑i=1

li (θ) (P)

Stochastic gradient. (ik )k∈N iid RVs uniform on 1, . . . ,n.

θk+1|θk = θk −αk ∇lik (θk ) (SG)

αk > 0

Stochastic gradient. Let (Mk )k∈N be a martingale difference sequence.

θk+1|past = θk −αk (∇F(θk ) + Mk+1) (SG)

E [Mk+1|past] = 0

αk > 0

Stochastic approximation: Robbins and Monro 1951 [RM1951].

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 35 / 58

Page 36: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

The ODE method

Averaging out noise: vanishing step size, ∑k∈N αk = +∞, ∑k∈N α2k < +∞.

Differentiable F (Ljung 1977 [L1977]): The sequence (θk )k∈N behaves in the limit assolutions to the differential equation

θ =−∇F(θ)

Developments: Benaım [B1996], Kushner and Yin [KY1997] . . . .

Non-smooth case: Benaım, Hofbauer, Sorin (2005) [BHS2005]. Differential inclusion

θ ∈ −∂F(θ)

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 36 / 58

Page 37: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Invariant sets

Definition (Invariant sets)

A set A is invariant if for any θ0 ∈ A, there is a solution to θ =−∇F(θ), withθ(0) = θ0 and θ(R) ∈ A.

A set A is chain transitive if for any T > 0 and ε > 0, for any θ,µ ∈ A, there existsN ∈ N and θi , i = 1, . . . ,N solutions of θ =−∇F(θ) and ti ≥ T such that

I θi (t) ∈ A for all 0≤ t ≤ ti , i = 1 . . .N.I For all i = 1, . . . ,N−1, ‖θi (ti )−θi+1(0)‖ ≤ ε

I ‖θ1(0)−θ‖ ≤ ε and ‖θN(tN)−µ‖ ≤ ε (draw a picture).

From [B1999], a chain transitive set is invariant.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 37 / 58

Page 38: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

A result of Benaim

θk+1|past = θk −αk (∇F(θk ) + Mk+1) (SG)

E [Mk+1|past] = 0

αk > 0

Square summable increments: There exists M ≥ 0 such that

supk∈N

E[M2

k |past]≤M.

Vanishing step size: ∑k∈N αk = +∞, ∑k∈N α2k < +∞. This implies that ∑

ki=0 αk Mk+1

converges almost surely.

Theorem (Benaim 1996 [B1996])

Conditionally on (θk )k∈N being bounded, the set of accumulation points is compactconnected and chain transitive invariant, almost surely.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 38 / 58

Page 39: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Corollary for deep learning

Stochastic gradient. (ik )k∈N iid RVs uniform on 1, . . . ,n.

θk+1|θk = θk −αk ∇lik (θk ) (SG)

αk > 0

Conditioning on (θk )k∈N being bounded, almost surely, F(θk ) converges and anyaccumulation point θ satisfies ∇F(θ) = 0.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 39 / 58

Page 40: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Proof sketch

Lemma

If F is tame, any compact chain transitive set A satisfies

F is constant on A

∇F(θ) = 0 for all θ ∈ A.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 40 / 58

Page 41: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Proof sketchFor any solution to θ =−∇F(θ), and any t ∈ R we have

ddt

F(θ(t)) = ∇F(θ(t))T ˙θ(t) =−‖∇F(θ(t))‖22.

Set L− = minθ∈A F(θ) and L+ = maxθ∈A F(θ). Both are critical values since A isinvariant. Let us prove that L− = L+. Set L− = 0. F is γ-Lipschitz on A.

Morse-Sard theorem: δ > 0 such that (0,2δ] contains no critical value of F . Chooseε = δ/γ: x ,y ∈ A, ‖x− y‖ ≤ ε implies |f (x)− f (y)| ≤ δ. Set

∆ = minθ∈A,δ≤F(θ)≤2δ

‖∇F(θ)‖2 > 0,

and put T = δ/∆2.

For any θ0 ∈ A such that F(θ0)≤ 2δ, any solution to θ =−∇F(θ) starting at θ0

satisfies F(θ(t))≤ δ for any t ≥ T .

For any θ ∈ A, with F(θ) = 0 and µ ∈ A, by chain trainsitivity, F(µ)≤ 2δ.

Take δ arbitrarily small, we have F(µ) = 0 for all µ ∈ A and L+ = 0.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 41 / 58

Page 42: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Plan

1 Introduction

2 Convergence to local minima for Morse-Functions

3 On the structure of deep learning training loss

4 Convergence to critical points for tame functions

5 Approaching critical point with noise

6 Extensions to nonsmooth settings

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 42 / 58

Page 43: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Stochastic subgradient

θ = (w,b), li (θ) = L(fw,b(xi ),yi ), i = 1 . . .n.

F : Rp 7→ R

θ 7→ 1n

n

∑i=1

li (θ) (P)

Stochastic subgradient descent. Let (Mk )k∈N be a martingale difference sequence.

θk+1|past = θk −αk (v + Mk+1) (SG)

E [Mk+1|past] = 0

v ∈ ∂F(θk )

αk > 0

∂ is a suitable generalization of the gradient.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 43 / 58

Page 44: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Subgradients: F : Rp 7→ R Lipschitz continuous

Convex: only for F convex, global lower tangent.

∂convF(x) =

v ∈ Rp, F(y)≥ F(x) + vT (y− x), ∀y ∈ Rp .Frechet: local lower tangent.

∂FrechetF(x) =

v ∈ Rp, lim

y→x ,infy 6=x

F(y)−F(x)− vT (y− x)

‖y− x‖≥ 0

.

Limiting: sequential closure.

∂limF(x) =

v ∈ Rp, ∃(yk ,vk )k∈N , yk →k→∞

x , vk →k→∞

v , vk ∈ ∂FrechetF(yk ),k ∈ N.

Clarke: convex closure.

∂ClarkeF(x) = conv(∂limF(x)).

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 44 / 58

Page 45: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Subgradients: F : Rp 7→ R Lipschitz continuous

Example: F : x 7→ −|x |.

∂convF(0) = /0

∂FrechetF(0) = /0

∂limF(0) = −1,1∂ClarkeF(0) = [−1,1].

0 is a local maximum, it is critical only for for the most general notion of subgradientwhich we have seen . . .

∂FrechetF(x)⊂ ∂limF(x)⊂ ∂ClarkeF(x) for all x .

Fermat rule: x is a local minimum of F if and only if 0 ∈ ∂FrechetF(x).

We will work with Clarke subgradients.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 45 / 58

Page 46: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Absolute continuity

Definition (Absolutely continuous map)

Let g : I 7→ Rp be AC on an interval I. This means that for any ε > 0, there exists δ > 0such that for any collection of pairwise disjoint sub intervals of I, [uk ,vk ]k∈N, we have

∑k∈N|uk − vk | ≤ δ⇒ ∑

k∈N‖g(uk )−g (vk )‖ ≤ ε.

Lipschitz functions are AC and composition of AC functions are AC.

Equivalently there exists a Lebesgue integrable function y : I→ Rp and a ∈ I such thatfor all t ∈ I,

f (t) = f (a) +∫ t

ay(s)ds.

Most importantly, AC functions are differentiable almost everywhere.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 46 / 58

Page 47: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Differential inclusion solutions

Definition

Given θ0 ∈ Rp, a solution of the problem

θ ∈ −∂F(θ),θ(0) = θ0,

is any Absolutely Continuous map θ : R 7→ Rp, such that ddt θ(t) ∈ −∂F(θ(t)) for

almost all t and θ(0) = θ0.

Example: absolute value.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 47 / 58

Page 48: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Clarke subdifferential for locally Lipschitz functions

If F is locally Lipschitz, then

The graph of ∂F is closed:

graph(∂F) = x ,y ∈ Rp,y ∈ ∂F(x) .

Equivalently xk →k→∞

x and yk →k→∞

y with yk ∈ ∂F(xk ) for all k ∈ N implies

y ∈ F(x). In particular ∂F(x) is closed values.

∂F(x) is nonempty, convex and compact.

Slightly stronger assumption: ∂F(x) is compact: C > 0 such that for all x ∈ Rp,supz∈∂F(x) ‖z‖ ≤ C(1 +‖x‖).

Existence of solutions: These hypotheses suffice to ensure that the differentialinclusion problem has (typically non unique) solutions (Aubin and Celina [AC1984],Clarke et al [CLSW1998]).

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 48 / 58

Page 49: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Invariant sets

Definition (Invariant sets)

A set A is invariant if for any θ0 ∈ A, there is a solution to θ ∈ −∂F(θ), withθ(0) = θ0 and θ(R) ∈ A.

A set A is chain transitive if for any T > 0 and ε > 0, for any θ,µ ∈ A, there existsN ∈ N and θi , i = 1, . . . ,N solutions of θ ∈ −∂F(θ) and ti ≥ T such that

I θi (t) ∈ A for all 0≤ t ≤ ti , i = 1 . . .N.I For all i = 1, . . . ,N−1, ‖θi (ti )−θi+1(0)‖ ≤ ε

I ‖θ1(0)−θ‖ ≤ ε and ‖θN(tN)−µ‖ ≤ ε (draw a picture).

From [BHS2005], a bounded chain transitive set is invariant.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 49 / 58

Page 50: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Stochastic approximation and differential inclusions

θk+1|past = θk −αk (v + Mk+1) (SG)

E [Mk+1|past] = 0

v ∈ ∂F(θk )

αk > 0

Square summable increments: There exists M ≥ 0 such that

supk∈N

E[M2

k |past]≤M.

Vanishing step size: ∑k∈N αk = +∞, ∑k∈N α2k < +∞. This implies that ∑

ki=0 αk Mk+1

converges almost surely.

Theorem (Benaim Hofbauer Sorin [BHS2005])

Conditionally on (θk )k∈N being bounded, the set of accumulation points is compactconnected and chain transitive invariant, almost surely.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 50 / 58

Page 51: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Chain rule

Lemma (Convex chain rule)

If F is convex and θ : R→ Rp solution to θ ∈ −∂F(θ), then for almost all t ∈ R+,

ddt

F θ =⟨

∂F(θ(t)), θ⟩

=−dist(0,∂F(θ(t)))2.

Example: `1 norm. See Brezis 1973 [B1973].

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 51 / 58

Page 52: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Chain ruleF is locally Lipschitz, and x is AC so that the composition is also AC and we maychoose t0 such that both θ and F θ are differentiable. Let θ0 = θ(t0) and θ0 = θ(t0),we have

θ(t0 + h) = θ0 + hθ0 + o(h)

θ(t0−h) = θ0−hθ0 + o(h)

For any v ∈ ∂F(θ0), it holds that 〈v ,y−θ0〉 ≤ F(y)−F(θ0) for all y ∈ Rp. Nowimposing h > 0, we have

〈v ,θ(t0 + h)−θ0〉h

=⟨

v , θ0

⟩+ o(1)≤ F(θ(t0 + h))− f (θ0)

h→

h→0

ddt

(F θ)(t0).

On the other hand, still considering h positive

〈v ,θ(t0−h)−θ0〉−h

=⟨

v , θ0

⟩+ o(1)≥ F(θ(t0−h))−F(θ0)

−h→

h→0

ddt

(F θ)(t0).

This proves the first identity. θ0 ∈ ∂F(θ0) and it is “orthogonal” to ∂F(θ0) so that it isthe minimum norm element.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 52 / 58

Page 53: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

Corollary for deep learning

Stochastic subgradient. (ik )k∈N iid RVs uniform on 1, . . . ,n.

θk+1|past = θk −αk (v + Mk+1) (SG)

E [Mk+1|past] = 0

v ∈ ∂F(θk )

αk > 0

Conditioning on (θk )k∈N being bounded, almost surely, F(θk ) converges and anyaccumulation point θ satisfies 0 ∈ ∂F(θ).

Proof arguments: Same idea as in the smooth case

Accumulation points form chain transitive invariant sets [BHS2005].

If F is tame, we have nonsmooth Morse Sard [BDLS2007].

If F is tame, F satisfies a chain rule [DDKL2018], using projection formula of[BDLS2007] for stratisfiable functions.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 53 / 58

Page 54: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

P.A. Absil, R. Mahony, B. Andrews, B. (2005).Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2), 531–547.

H. Attouch, J. Bolte and B.F. Svaiter (2013).Convergence of descent methods for semi-algebraic and tame problems: proximalalgorithms, forward-backward splitting, and regularized Gauss-Seidel methods.Mathematical Programming, 137(1-2), 91–129.

J.P. Aubin and A. Cellina (1984).Differential inclusions: set-valued maps and viability theory. Springer Science &Business Media.

M. Benaim (1996).A dynamical system approach to stochastic approximations.SIAM Journal on Control and Optimization, 34(2), 437–472.

M. Benaım (1999).Dynamics of stochastic approximation algorithms.Seminaire de probabilites XXXIII (pp. 1-68). Springer, Berlin, Heidelberg.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 54 / 58

Page 55: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

M. Benaım, J. Hofbauer and S. Sorin (2005).Stochastic approximations and differential inclusions.SIAM Journal on Control and Optimization, 44(1), 328-348.

J. Bolte, A. Daniilidis, A. Lewis and M. Shiota (2007).Clarke subgradients of stratifiable functions.SIAM Journal on Optimization, 18(2), 556-572.

H. Attouch and J. Bolte (2009).On the convergence of the proximal algorithm for nonsmooth functions involvinganalytic features.Mathematical Programming, 116(1-2), 5-16.

J. Bolte, S. Sabach and M. Teboulle (2014).Proximal alternating linearized minimization or nonconvex and nonsmoothproblems.Mathematical Programming, 146(1-2), 459–494.

H. Brezis (1973).Operateurs maximaux monotones et semi-groupes de contractions dans lesespaces de Hilbert (Vol. 5). Elsevier.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 55 / 58

Page 56: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

F.H. Clarke, Y.S. Ledyaev, R.J. Stern and P.R. Wolenski (1998).Nonsmooth analysis and control theory.Springer Science & Business Media.

M. Coste (2000).An introduction to o-minimal geometry.Pisa: Istituti editoriali e poligrafici internazionali.

M. Coste (2002).An introduction to semialgebraic geometry.Pisa: Istituti editoriali e poligrafici internazionali.

D. Davis, D. Drusvyatskiy, S. Kakade and J. D. Lee (2018).Stochastic subgradient method converges on tame functions.arXiv preprint arXiv:1804.07795.

L. Van den Dries and C.Miller (1996).Geometric categories and o-minimal structures.Duke Math. J, 84(2), 497-540.

L. Van den Dries, (1998).Tame topology and o-minimal structures (Vol. 248). Cambridge university press.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 56 / 58

Page 57: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

K. Kurdyka (1998).On gradients of functions definable in o-minimal structures.Annales de l’institut Fourier 48(3)769–784.

H. Kushner and G.G. Yin (1997). Stochastic approximation and recursivealgorithms and applications. Springer Science & Business Media.

J.D. Lee and M. Simchowitz and M.I. Jordan and B. Recht (2016).Gradient descent only converges to minimizers.Conference on Learning Theory (pp. 1246–1257).

L. Ljung (1977).Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4), 551–575.

S. Łojasiewicz (1963).Une propriete topologique des sous-ensembles analytiques reels.Les equations aux derivees partielles, 117, 87–89.

H. Robbins and S. Monro (1951).A Stochastic Approximation Method.The Annals of Mathematical Statistics, 22(3), 400–407.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 57 / 58

Page 58: Nonsmooth, nonconvex optimization, implications for deep ...fmalgouy/enseignement/downlo… · Nonsmooth, nonconvex optimization, implications for deep-learning S. Gerchinovitz1,

M. Shiota (1995).Geometry of subanalytic and semialgebraic sets. Springer Science & BusinessMedia.

M. Shub (1987).Global stability of dynamical systems. Springer Science & Business Media.

S. Gerchinovitz1, F. Malgouyres1, E. Pauwels2 & N. Thome3 (IMT-IRIT)Nonsmooth, nonconvex optimization 28 Jan. 2019 58 / 58