An introduction to optimization on manifolds

Preview:

Citation preview

An introduction to optimization on manifoldsAug 30, 2021

Geometric methods in optimization and sampling,Boot camp at Simons Institute

Nicolas Boumal โ€“ OPTIMInstitute of Mathematics, EPFL

1

nicolasboumal.net/book

Step 0 in optimization

It all starts with a set ๐‘†๐‘† and a function ๐‘“๐‘“: ๐‘†๐‘† โ†’ ๐‘๐‘:

min๐‘ฅ๐‘ฅโˆˆ๐‘†๐‘†

๐‘“๐‘“ ๐‘ฅ๐‘ฅ

These bare objects fully specify the problem.

Any additional structure on ๐‘†๐‘† and ๐‘“๐‘“ may (and should) be exploited for algorithmic purposes but is not part of the problem.

2

Classical unconstrained optimization

The search space is a linear space, e.g., ๐‘†๐‘† = ๐‘๐‘๐‘›๐‘›:

min๐‘ฅ๐‘ฅโˆˆ๐‘๐‘๐‘›๐‘›

๐‘“๐‘“ ๐‘ฅ๐‘ฅ

We can choose to turn ๐‘๐‘๐‘›๐‘› into a Euclidean space: ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ = ๐‘ข๐‘ขโŠค๐‘ฃ๐‘ฃ.

If ๐‘“๐‘“ is differentiable, this provides gradients โˆ‡๐‘“๐‘“ and Hessians โˆ‡2๐‘“๐‘“.These objects underpin algorithms: gradient descent, Newtonโ€™s method...

3

โˆ‡๐‘“๐‘“ ๐‘ฅ๐‘ฅ , ๐‘ฃ๐‘ฃ = D๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = lim๐‘ก๐‘กโ†’0

๐‘“๐‘“ ๐‘ฅ๐‘ฅ + ๐‘ก๐‘ก๐‘ฃ๐‘ฃ โˆ’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘ก๐‘ก

โˆ‡2๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = D โˆ‡๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = lim๐‘ก๐‘กโ†’0

โˆ‡๐‘“๐‘“ ๐‘ฅ๐‘ฅ + ๐‘ก๐‘ก๐‘ฃ๐‘ฃ โˆ’ โˆ‡๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘ก๐‘ก

Extend to optimization on manifolds

The search space is a smooth manifold, ๐‘†๐‘† = โ„ณ:

min๐‘ฅ๐‘ฅโˆˆโ„ณ

๐‘“๐‘“ ๐‘ฅ๐‘ฅ

We can choose to turn โ„ณ into a Riemannian manifold.

If ๐‘“๐‘“ is differentiable, this provides Riemannian gradients and Hessians.These objects underpin algorithms: gradient descent, Newtonโ€™s method...

Around since the 70s; practical since the 90s.4

What is a manifold? Take one:

5

Yes

Yes

No

What is a manifold? Take two:

A few manifolds that come up in the wild

6

7https://www.popsci.com/new-roomba-knows-locationhttp://emanual.robotis.com/docs/en/platform/turtlebot3/slam

Orthonormal frames and rotations

Stiefel manifold: โ„ณ = ๐‘‹๐‘‹ โˆˆ ๐‘๐‘๐‘›๐‘›ร—๐‘๐‘:๐‘‹๐‘‹โŠค๐‘‹๐‘‹ = ๐ผ๐ผ๐‘๐‘

Rotation group: โ„ณ = ๐‘‹๐‘‹ โˆˆ ๐‘๐‘3ร—3:๐‘‹๐‘‹โŠค๐‘‹๐‘‹ = ๐ผ๐ผ3 and det ๐‘‹๐‘‹ = +1

Applications in sparse PCA, Structure-from-Motion, SLAM (robotics)...

The singularities of Euler angles (gimbal lock) are artificial: the rotation group is smooth.

Subspaces and fixed-rank matrices

Grassman manifold: โ„ณ = subspaces of dimension ๐‘‘๐‘‘ in ๐‘๐‘๐‘›๐‘›

Fixed-rank matrices:โ„ณ = ๐‘‹๐‘‹ โˆˆ ๐‘๐‘๐‘š๐‘šร—๐‘›๐‘›: rank ๐‘‹๐‘‹ = ๐‘Ÿ๐‘Ÿ

Applications to linear dimensionality reduction, data completion and denoising, large-scale matrix equations, ...

Optimization allows us to go beyond PCA (least-squares loss โ‰ก truncated SVD):can handle outlier-robust loss functions and missing data.

8Picture: https://365datascience.com/tutorials/python-tutorials/principal-components-analysis/

Positive matrices and hyperbolic space

Positive definite matrices: โ„ณ = ๐‘‹๐‘‹ โˆˆ ๐‘๐‘๐‘›๐‘›ร—๐‘›๐‘›:๐‘‹๐‘‹ = ๐‘‹๐‘‹โŠค and ๐‘‹๐‘‹ โ‰ป 0

Hyperbolic space: โ„ณ = ๐‘ฅ๐‘ฅ โˆˆ ๐‘๐‘๐‘›๐‘›+1: ๐‘ฅ๐‘ฅ02 = 1 + ๐‘ฅ๐‘ฅ12 + โ‹ฏ+ ๐‘ฅ๐‘ฅ๐‘›๐‘›2

Used in metric learning, Gaussian mixture models, tree-like embeddings...

With appropriate metrics, these are Cartan-Hadamard manifolds:Complete, simply connected, with non-positive (intrinsic) curvature.Great playground for geodesic convexity.

9Picture: https://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings

A tour of technical toolsRestricted to embedded submanifolds

What is a manifold?Tangent spacesSmooth mapsDifferentialsRetractionsRiemannian manifoldsGradientsHessians

17

A subset โ„ณ of a linear space โ„ฐ of dimension ๐‘‘๐‘‘ isa smooth embedded submanifold of dimension ๐‘›๐‘› if:

For all ๐‘ฅ๐‘ฅ โˆˆ โ„ณ, there exists a neighborhood ๐‘ˆ๐‘ˆ of ๐‘ฅ๐‘ฅ in โ„ฐ, an open set ๐‘‰๐‘‰ โŠ† ๐‘๐‘๐‘‘๐‘‘ and a diffeomorphism ๐œ“๐œ“:๐‘ˆ๐‘ˆ โ†’ ๐‘‰๐‘‰ such that ๐œ“๐œ“ ๐‘ˆ๐‘ˆ โˆฉโ„ณ = ๐‘‰๐‘‰ โˆฉ ๐ธ๐ธ where ๐ธ๐ธ is a linear subspace of dimension ๐‘›๐‘›.

We call โ„ฐ the embedding space.

What is a manifold? Take three:

18

? ?

๐œ“๐œ“

โ„ณ

๐‘ˆ๐‘ˆ ๐œ“๐œ“ ๐‘ˆ๐‘ˆ = ๐‘‰๐‘‰

๐‘๐‘๐‘‘๐‘‘

โ„ฐ โ‰ก ๐‘๐‘๐‘‘๐‘‘๐‘ฅ๐‘ฅ

๐œ“๐œ“ ๐‘ฅ๐‘ฅ

Matrix sets in our list are manifolds: orthonormal, fixed-rank, positive definite...

Linear subspaces are manifolds.

Open subsets of manifolds are manifolds.

Products of manifolds are manifolds.

What is a manifold? Quick facts:

19

๐œ“๐œ“

โ„ณ

๐‘ˆ๐‘ˆ ๐œ“๐œ“ ๐‘ˆ๐‘ˆ = ๐‘‰๐‘‰

๐‘๐‘๐‘‘๐‘‘

๐‘ฅ๐‘ฅ๐œ“๐œ“ ๐‘ฅ๐‘ฅโ„ฐ โ‰ก ๐‘๐‘๐‘‘๐‘‘

โ„ณT๐‘ฅ๐‘ฅโ„ณ

๐‘ฅ๐‘ฅ

Tangent vectors of โ„ณ embedded in โ„ฐA tangent vector at ๐‘ฅ๐‘ฅ is the velocity ๐‘๐‘โ€ฒ 0 = lim

๐‘ก๐‘กโ†’0๐‘๐‘ ๐‘ก๐‘ก โˆ’๐‘๐‘ 0

๐‘ก๐‘กof a curve ๐‘๐‘:๐‘๐‘ โ†’โ„ณ with ๐‘๐‘ 0 = ๐‘ฅ๐‘ฅ.

The tangent space T๐‘ฅ๐‘ฅโ„ณ is the set of all tangent vectors of โ„ณ at ๐‘ฅ๐‘ฅ.It is a linear subspace of โ„ฐ of the same dimension as โ„ณ.

If โ„ณ = ๐‘ฅ๐‘ฅ:โ„Ž ๐‘ฅ๐‘ฅ = 0 with โ„Ž:โ„ฐ โ†’ ๐‘๐‘๐‘˜๐‘˜ smooth and rank Dโ„Ž ๐‘ฅ๐‘ฅ = ๐‘˜๐‘˜, then T๐‘ฅ๐‘ฅโ„ณ = ker Dโ„Ž ๐‘ฅ๐‘ฅ .

โ„Ž ๐‘ฅ๐‘ฅ = ๐‘ฅ๐‘ฅโŠค๐‘ฅ๐‘ฅ โˆ’ 1 = 0ker Dโ„Ž ๐‘ฅ๐‘ฅ = ๐‘ฃ๐‘ฃ: ๐‘ฅ๐‘ฅโŠค๐‘ฃ๐‘ฃ = 0

20

Smooth maps on/to manifoldsLet โ„ณ,โ„ณโ€ฒ be (smooth, embedded) submanifolds of linear spaces โ„ฐ,โ„ฐโ€ฒ.

A map ๐น๐น:โ„ณ โ†’โ„ณโ€ฒ is smooth if it has a smooth extension, i.e., if there exists a neighborhood ๐‘ˆ๐‘ˆ of โ„ณ in โ„ฐ and a smooth map ๏ฟฝ๐น๐น:๐‘ˆ๐‘ˆ โ†’ โ„ฐโ€ฒ such that ๐น๐น = ๏ฟฝ๐น๐น|โ„ณ .

Example: a cost function ๐‘“๐‘“:โ„ณ โ†’ ๐‘๐‘ is smooth if it is the restriction of a smooth ๐‘“๐‘“:๐‘ˆ๐‘ˆ โ†’ ๐‘๐‘.

Composition preserves smoothness.

21

๐‘“๐‘“โ„ณ

โ„ฐ

๐‘ˆ๐‘ˆ

๐‘๐‘

๐‘“๐‘“ = ๐‘“๐‘“|โ„ณ

Differential of a smooth map ๐น๐น:โ„ณ โ†’โ„ณโ€ฒ

The differential of ๐น๐น at ๐‘ฅ๐‘ฅ is the map D๐น๐น ๐‘ฅ๐‘ฅ : T๐‘ฅ๐‘ฅโ„ณ โ†’ T๐น๐น ๐‘ฅ๐‘ฅ โ„ณโ€ฒ defined by:

D๐น๐น ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = ๐น๐น โˆ˜ ๐‘๐‘ โ€ฒ 0 = lim๐‘ก๐‘กโ†’0

๐น๐น ๐‘๐‘ ๐‘ก๐‘ก โˆ’ ๐น๐น ๐‘ฅ๐‘ฅ๐‘ก๐‘ก

where ๐‘๐‘:๐‘๐‘ โ†’โ„ณ satisfies ๐‘๐‘ 0 = ๐‘ฅ๐‘ฅ and ๐‘๐‘โ€ฒ 0 = ๐‘ฃ๐‘ฃ.

Claim: D๐น๐น ๐‘ฅ๐‘ฅ is well defined and linear, and we have a chain rule.If ๏ฟฝ๐น๐น is a smooth extension of ๐น๐น, then D๐น๐น ๐‘ฅ๐‘ฅ = D ๏ฟฝ๐น๐น ๐‘ฅ๐‘ฅ |T๐‘ฅ๐‘ฅโ„ณ .

22

Retractions: moving around on โ„ณThe tangent bundle is the set

Tโ„ณ = ๐‘ฅ๐‘ฅ, ๐‘ฃ๐‘ฃ : ๐‘ฅ๐‘ฅ โˆˆ โ„ณ and ๐‘ฃ๐‘ฃ โˆˆ T๐‘ฅ๐‘ฅโ„ณ .Claim: Tโ„ณ is a smooth manifold embedded in โ„ฐ ร— โ„ฐ.

A retraction is a smooth map ๐‘…๐‘…: Tโ„ณ โ†’โ„ณ: ๐‘ฅ๐‘ฅ, ๐‘ฃ๐‘ฃ โ†ฆ ๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃsuch that each curve

๐‘๐‘ ๐‘ก๐‘ก = ๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ก๐‘ก๐‘ฃ๐‘ฃsatisfies ๐‘๐‘ 0 = ๐‘ฅ๐‘ฅ and ๐‘๐‘โ€ฒ 0 = ๐‘ฃ๐‘ฃ.

E.g., metric projection: ๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ is the projection of ๐‘ฅ๐‘ฅ + ๐‘ฃ๐‘ฃ to โ„ณ.โ„ณ = ๐‘๐‘๐‘›๐‘›: ๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = ๐‘ฅ๐‘ฅ + ๐‘ฃ๐‘ฃ; โ„ณ = ๐‘ฅ๐‘ฅ: ๐‘ฅ๐‘ฅ = 1 : ๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = ๐‘ฅ๐‘ฅ+๐‘ฃ๐‘ฃ

๐‘ฅ๐‘ฅ+๐‘ฃ๐‘ฃ;

โ„ณ = ๐‘‹๐‘‹: rank ๐‘‹๐‘‹ = ๐‘Ÿ๐‘Ÿ : ๐‘…๐‘…๐‘‹๐‘‹ ๐‘‰๐‘‰ = SVD๐‘Ÿ๐‘Ÿ ๐‘‹๐‘‹ + ๐‘‰๐‘‰ .23

๐‘ฅ๐‘ฅ

๐‘ฃ๐‘ฃ

๐‘ฅ๐‘ฅ + ๐‘ฃ๐‘ฃ๐‘ฅ๐‘ฅ + ๐‘ฃ๐‘ฃ

Riemannian manifolds

Each tangent space T๐‘ฅ๐‘ฅโ„ณ is a linear space.Endow each one with an inner product: ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ ๐‘ฅ๐‘ฅ for ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ โˆˆ T๐‘ฅ๐‘ฅโ„ณ.

A vector field is a map ๐‘‰๐‘‰:โ„ณ โ†’ Tโ„ณ such that ๐‘‰๐‘‰ ๐‘ฅ๐‘ฅ is tangent at ๐‘ฅ๐‘ฅ for all ๐‘ฅ๐‘ฅ.We say the inner products โ‹…,โ‹… ๐‘ฅ๐‘ฅ vary smoothly with ๐‘ฅ๐‘ฅ if ๐‘ฅ๐‘ฅ โ†ฆ ๐‘ˆ๐‘ˆ ๐‘ฅ๐‘ฅ ,๐‘‰๐‘‰ ๐‘ฅ๐‘ฅ ๐‘ฅ๐‘ฅ is smooth for all smooth vector fields ๐‘ˆ๐‘ˆ,๐‘‰๐‘‰.

If the inner products vary smoothly with ๐‘ฅ๐‘ฅ, they form a Riemannian metric.

A Riemannian manifold is a smooth manifold with a Riemannian metric.

24

Riemannian structure and optimization

A Riemannian manifold is a smooth manifold with a smoothly varying choice of inner product on each tangent space.

A manifold can be endowed with many different Riemannian structures.

A problem min๐‘ฅ๐‘ฅโˆˆโ„ณ

๐‘“๐‘“ ๐‘ฅ๐‘ฅ is defined independently of any Riemannian structure.

We choose a metric for algorithmic purposes. Akin to preconditioning.

25

โ„ณT๐‘ฅ๐‘ฅโ„ณ

๐‘ฅ๐‘ฅ

Riemannian submanifoldsLet the embedding space of โ„ณ be a Euclidean space โ„ฐ with metric โ‹…,โ‹… .For example: โ„ฐ = ๐‘๐‘๐‘›๐‘› and ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ = ๐‘ข๐‘ขโŠค๐‘ฃ๐‘ฃ for all ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ โˆˆ ๐‘๐‘๐‘›๐‘›.

A convenient choice of Riemannian structure for โ„ณ is to let:

๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ ๐‘ฅ๐‘ฅ = ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ .

This is well defined because ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ โˆˆ T๐‘ฅ๐‘ฅโ„ณ are, in particular, elements of โ„ฐ.

This is a Riemannian metric. With it, โ„ณ is a Riemannian submanifold of โ„ฐ.

26!! A Riemannian submanifold is not just a submanifold that is Riemannian !!

Riemannian gradientsThe Riemannian gradient of a smooth ๐‘“๐‘“:โ„ณ โ†’ ๐‘๐‘ is the vector field grad๐‘“๐‘“ defined by:

โˆ€ ๐‘ฅ๐‘ฅ, ๐‘ฃ๐‘ฃ โˆˆ Tโ„ณ, grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ , ๐‘ฃ๐‘ฃ ๐‘ฅ๐‘ฅ = D๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ .

Claim: grad๐‘“๐‘“ is a well-defined smooth vector field.

If โ„ณ is a Riemannian submanifold of a Euclidean space โ„ฐ, then

grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ = Proj๐‘ฅ๐‘ฅ โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ,

where Proj๐‘ฅ๐‘ฅ is the orthogonal projector from โ„ฐ to T๐‘ฅ๐‘ฅโ„ณ and ๐‘“๐‘“ is a smooth extension of ๐‘“๐‘“.

27

โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ , ๐‘ฃ๐‘ฃ = D ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = lim๐‘ก๐‘กโ†’0

๐‘“๐‘“ ๐‘ฅ๐‘ฅ + ๐‘ก๐‘ก๐‘ฃ๐‘ฃ โˆ’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘ก๐‘ก

โˆ‡2 ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = D โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = lim๐‘ก๐‘กโ†’0

โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ + ๐‘ก๐‘ก๐‘ฃ๐‘ฃ โˆ’ โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘ก๐‘ก

Riemannian HessiansThe Riemannian Hessian of ๐‘“๐‘“ at ๐‘ฅ๐‘ฅ should be a symmetric linear map Hess๐‘“๐‘“ ๐‘ฅ๐‘ฅ : T๐‘ฅ๐‘ฅโ„ณ โ†’ T๐‘ฅ๐‘ฅโ„ณdescribing gradient change.

Since grad๐‘“๐‘“:โ„ณ โ†’ Tโ„ณ is a smooth map from one manifold to another, a natural first attempt is:

Hess๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ =? Dgrad๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ .

However, this does not produce tangent vectors in general.To overcome this issue, we need a new derivative for vector fields: a Riemannian connection.

If โ„ณ is a Riemannian submanifold of Euclidean space, then:

where ๐‘Š๐‘Š is the Weingarten map of โ„ณ. 28

โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ , ๐‘ฃ๐‘ฃ = D ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = lim๐‘ก๐‘กโ†’0

๐‘“๐‘“ ๐‘ฅ๐‘ฅ + ๐‘ก๐‘ก๐‘ฃ๐‘ฃ โˆ’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘ก๐‘ก

โˆ‡2 ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = D โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = lim๐‘ก๐‘กโ†’0

โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ + ๐‘ก๐‘ก๐‘ฃ๐‘ฃ โˆ’ โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘ก๐‘ก

Hess๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = Proj๐‘ฅ๐‘ฅ Dgrad๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ= Proj๐‘ฅ๐‘ฅ โˆ‡2 ๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ + ๐‘Š๐‘Š ๐‘ฃ๐‘ฃ, Proj๐‘ฅ๐‘ฅโŠฅ โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ

Compute the smallest eigenvalue of a symmetric matrix ๐ด๐ด โˆˆ ๐‘๐‘๐‘›๐‘›ร—๐‘›๐‘› :

min๐‘ฅ๐‘ฅโˆˆโ„ณ

12๐‘ฅ๐‘ฅ

โŠค๐ด๐ด๐‘ฅ๐‘ฅ with โ„ณ = ๐‘ฅ๐‘ฅ โˆˆ ๐‘๐‘๐‘›๐‘›: ๐‘ฅ๐‘ฅโŠค๐‘ฅ๐‘ฅ = 1

The cost function ๐‘“๐‘“:โ„ณ โ†’ ๐‘๐‘ is the restriction of the smooth function ๐‘“๐‘“ ๐‘ฅ๐‘ฅ = 12๐‘ฅ๐‘ฅ

โŠค๐ด๐ด๐‘ฅ๐‘ฅ from ๐‘๐‘๐‘›๐‘› to โ„ณ.Tangent spaces T๐‘ฅ๐‘ฅโ„ณ = ๐‘ฃ๐‘ฃ โˆˆ ๐‘๐‘๐‘›๐‘›: ๐‘ฅ๐‘ฅโŠค๐‘ฃ๐‘ฃ = 0 .Make โ„ณ into a Riemannian submanifold of ๐‘๐‘๐‘›๐‘› with ๐‘ข๐‘ข, ๐‘ฃ๐‘ฃ = ๐‘ข๐‘ขโŠค๐‘ฃ๐‘ฃ.Projection to T๐‘ฅ๐‘ฅโ„ณ: Proj๐‘ฅ๐‘ฅ ๐‘ง๐‘ง = ๐‘ง๐‘ง โˆ’ ๐‘ฅ๐‘ฅโŠค๐‘ง๐‘ง ๐‘ฅ๐‘ฅ.Gradient of ๐‘“๐‘“: โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ = ๐ด๐ด๐‘ฅ๐‘ฅ.

Gradient of ๐‘“๐‘“: grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ = Proj๐‘ฅ๐‘ฅ โˆ‡ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ = ๐ด๐ด๐‘ฅ๐‘ฅ โˆ’ ๐‘ฅ๐‘ฅโŠค๐ด๐ด๐‘ฅ๐‘ฅ ๐‘ฅ๐‘ฅ.

Differential of grad๐‘“๐‘“: Dgrad๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = ๐ด๐ด๐‘ฃ๐‘ฃ โˆ’ ๐‘ฃ๐‘ฃโŠค๐ด๐ด๐‘ฅ๐‘ฅ + ๐‘ฅ๐‘ฅโŠค๐ด๐ด๐‘ฃ๐‘ฃ ๐‘ฅ๐‘ฅ โˆ’ ๐‘ฅ๐‘ฅโŠค๐ด๐ด๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ.Hessian of ๐‘“๐‘“: Hess๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = Proj๐‘ฅ๐‘ฅ Dgrad๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ = Proj๐‘ฅ๐‘ฅ ๐ด๐ด๐‘ฃ๐‘ฃ โˆ’ ๐‘ฅ๐‘ฅโŠค๐ด๐ด๐‘ฅ๐‘ฅ ๐‘ฃ๐‘ฃ.

Example: Rayleigh quotient optimization

29

The following are equivalent for ๐‘ฅ๐‘ฅ โˆˆ โ„ณ: ๐‘ฅ๐‘ฅ is a global minimizer; ๐‘ฅ๐‘ฅ is a unit-norm eigenvector of ๐ด๐ด for the least eigenvalue; grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ = 0 and Hess๐‘“๐‘“ ๐‘ฅ๐‘ฅ โ‰ฝ 0.

Basic optimization algorithmsAlgorithms hop around the manifold using a retraction:

๐‘ฅ๐‘ฅ๐‘˜๐‘˜+1 = ๐‘…๐‘…๐‘ฅ๐‘ฅ๐‘˜๐‘˜ ๐‘ ๐‘ ๐‘˜๐‘˜with some algorithm-specific tangent vector ๐‘ ๐‘ ๐‘˜๐‘˜ โˆˆ T๐‘ฅ๐‘ฅ๐‘˜๐‘˜โ„ณ.E.g., gradient descent: ๐‘ ๐‘ ๐‘˜๐‘˜ = โˆ’๐‘ก๐‘ก๐‘˜๐‘˜grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜

Newtonโ€™s method: Hess๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ ๐‘ ๐‘ ๐‘˜๐‘˜ = โˆ’grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜

Convergence analyses rely on Taylor expansions of ๐‘“๐‘“ along retractions.For second-order retractions (e.g., metric projection on Riemannian submanifold):

๐‘“๐‘“ ๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ ๐‘  = ๐‘“๐‘“ ๐‘ฅ๐‘ฅ + grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ , ๐‘ ๐‘  ๐‘ฅ๐‘ฅ +12

Hess๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ ๐‘  , ๐‘ ๐‘  ๐‘ฅ๐‘ฅ + ๐‘‚๐‘‚ ๐‘ ๐‘  ๐‘ฅ๐‘ฅ3

30

A1 ๐‘“๐‘“ ๐‘ฅ๐‘ฅ โ‰ฅ ๐‘“๐‘“low for all ๐‘ฅ๐‘ฅ โˆˆ โ„ณ

A2 ๐‘“๐‘“ ๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ ๐‘  โ‰ค ๐‘“๐‘“ ๐‘ฅ๐‘ฅ + ๐‘ ๐‘ , grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ ๐‘ฅ๐‘ฅ + ๐ฟ๐ฟ2๐‘ ๐‘  ๐‘ฅ๐‘ฅ

2

Algorithm: ๐‘ฅ๐‘ฅ๐‘˜๐‘˜+1 = ๐‘…๐‘…๐‘ฅ๐‘ฅ๐‘˜๐‘˜ โˆ’ 1๐ฟ๐ฟ

grad๐‘“๐‘“(๐‘ฅ๐‘ฅ๐‘˜๐‘˜)

Complexity: min๐‘˜๐‘˜<๐พ๐พ

grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ โ‰ค 2๐ฟ๐ฟ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ0 โˆ’๐‘“๐‘“low๐พ๐พ

(same as Euclidean case)

A2 โ‡’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜+1 โ‰ค ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ โˆ’1๐ฟ๐ฟ

grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜2 +

12๐ฟ๐ฟ

grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜2

โ‡’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ โˆ’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜+1 โ‰ฅ12๐ฟ๐ฟ

grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜2

๐€๐€๐€๐€ โ‡’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ0 โˆ’ ๐‘“๐‘“low โ‰ฅ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ0 โˆ’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐พ๐พ = ๏ฟฝ๐‘˜๐‘˜=0

๐พ๐พโˆ’1

๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ โˆ’ ๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜+1 โ‰ฅ๐พ๐พ2๐ฟ๐ฟ

min๐‘˜๐‘˜<๐พ๐พ

grad๐‘“๐‘“ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ๐‘˜๐‘˜2

Gradient descent on โ„ณ

31

๐‘ฅ๐‘ฅ

๐‘…๐‘…๐‘ฅ๐‘ฅ ๐‘ ๐‘ 

๐‘ ๐‘ 

Manopt: user-friendly softwareManopt is a family of toolboxes for Riemannian optimization.Go to www.manopt.org for code, a tutorial, a forum, and a list of other software.

Matlab example for min๐‘ฅ๐‘ฅ =1

๐‘ฅ๐‘ฅโŠค๐ด๐ด๐‘ฅ๐‘ฅ:

problem.M = spherefactory(n);

problem.cost = @(x) x'*A*x;

problem.egrad = @(x) 2*A*x;

x = trustregions(problem);

32

With Bamdev Mishra,P.-A. Absil & R. Sepulchre

Lead by J. Townsend,N. Koep & S. Weichwald

Lead by Ronny Bergmann

Active research directions

โ€ข More algorithms: nonsmooth, stochastic, parallel, quasi-Newton, ...โ€ข Constrained optimization on manifoldsโ€ข Applications, old and new (electronic structure, deep learning)โ€ข Complexity (upper and lower bounds)โ€ข Role of curvatureโ€ข Geodesic convexityโ€ข Randomized algorithmsโ€ข Broader generalizations: manifolds with a boundary, algebraic varietiesโ€ข Benign non-convexity

33

34

โ€œโ€ฆ in fact, the great watershed in optimization isn't between linearity

and nonlinearity, but convexity and non-convexity.โ€

R. T. Rockafellar, in SIAM Review, 1993

๐‘ฅ๐‘ฅ

Non-convex just means not convex.

35

36

?

โ€œโ€ฆ in fact, the great watershed in optimization isn't between linearity

and nonlinearity, but convexity and non-convexity.โ€

R. T. Rockafellar, in SIAM Review, 1993

Non-convexity can be benign

This can mean various things. Theorem templates are on a spectrum:

โ€œIf {conditions}, necessary optimality conditions are sufficient.โ€โ‹ฎ

โ€œIf {conditions}, we can initialize a specific algorithm well.โ€

The conditions (often on data) may be generous (e.g., genericity) or less so (e.g., high-probability event for non-adversarial distribution.)

Geometry and symmetry seem to play an outsized role.See for example Zhang, Qu & Wright, arxiv:2007.06753, for a review.

37

nicolasboumal.net/bookpress.princeton.edu/absil

manopt.org

Recommended