78
Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering Princeton Workshop, May 2018

0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Geometry and Regularizationin Nonconvex Quadratic Inverse Problems

Yuejie Chi

Department of Electrical and Computer Engineering

Princeton Workshop, May 2018

Page 2: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Acknowledgements

Thanks to my collaborators:

Yuxin Chen Cong Ma Kaizheng Wang Yuanxin LiPrinceton Princeton Princeton CMU/OSU

This research is supported by NSF, ONR and AFOSR.

1

Page 3: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

2

Page 4: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Nonconvex optimization in theory

There may be exponentially many local optima

e.g. a single neuron model (Auer, Herbster, Warmuth ’96)

2

Page 5: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Nonconvex optimization in practice

Using simple algorithms such as gradient descent, e.g., “backpropagation” for training deep neural networks...

3

Page 6: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

m

m∑

i=1

`(yi;x)

m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0

Figure credit: Mei, Bai and Montanari

4

Page 7: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

m

m∑

i=1

`(yi;x)m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0

Figure credit: Mei, Bai and Montanari

4

Page 8: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Statistical context is important

Data/measurements follow certain statistical models and henceare not worst-case instances.

minimizex f(x) =1

m

m∑

i=1

`(yi;x)m→∞=⇒ E[`(y;x)]

empirical risk ≈ population risk (often nice!)

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3θ0 = [1, 0]

θn = [0.816,−0.268]

θ1

-3 -2 -1 0 1 2 3

θ2

-3

-2

-1

0

1

2

3

θ0

Figure credit: Mei, Bai and Montanari

4

Page 9: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Putting together...

global convergence

statistical models

benign landscape

5

Page 10: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Computational efficiency?

global convergence

statistical models

benign landscape

But how fast?

6

Page 11: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Overview and question we aim to address

samplecomplexity

critical points

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Page 12: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Page 13: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

inefficient

efficient

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Page 14: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

inefficient

efficient

efficient

regularized unregularized

?

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Page 15: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Overview and question we aim to address

samplecomplexity

critical points smoothness

efficient

inefficient(saddle point, nonsmooth)

Statistical:

Computational:

inefficient

efficient

efficient

regularized unregularized

?

Can we simultaneously achieve statistical and computationalefficiency using unregularized methods?

7

Page 16: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Quadratic inverse problem

A

x

Ax

1

1

-3

2

-1

4

2-2

-1

3

-1

3

10

8

3

18

125

2

27

18

Y likelihood: fY |X(y | x unknown signal: X observation: Y

X y n p

1

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical loss minimization

minimizex f(x) = 1m

mÿ

k=1

Ë!a€k x

"2 ≠ ykÈ2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt ≠ ÷t Òf(xt)

10/ 29

X

1

0

2

-1

1

20

0

3

4

1

1

0

-1

-1

21

-1

3

1

r

AX yi = ka>i Xk2

2

Recover X\ ∈ Rn×r from m “random” quadratic measurements

yi =∥∥∥a>i X\

∥∥∥2

2, i = 1, . . . ,m

The rank-1 case is the famous phase retrieval problem.

8

Page 17: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Geometric interpretation

If X\ is orthonormal...

aiaj

X\

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

2

9

Page 18: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Geometric interpretation

If X\ is orthonormal...

aiaj

X\

ka>i X\k2

ka>j X\k2

We lose a generalized notion of “phase” in Rr:

a>i X\

∥∥a>i X\∥∥

2

9

Page 19: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Shallow neural network

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y +

1

a

X\

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

i=1

σ(a>xi).

10

Page 20: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Shallow neural network with quadratic activation

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y

1

hidden layer input layer output layer

x y W y +

1

a

X\

Set X\ = [x1,x2, . . . ,xr], then

y =r∑

i=1

σ(a>xi)σ(z)=z2

:=r∑

i=1

(a>xi)2 =

∥∥∥a>X\∥∥∥

2

2.

11

Page 21: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

minimizeX∈Rn×r f(X) =1

4m

m∑

k=1

(∥∥∥a>kX∥∥∥

2

2− yk

)2

• Use r = 1 as a running example.

12

Page 22: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A natural least squares formulation

given: yk =∥∥∥a>kX\

∥∥∥2

2, 1 ≤ k ≤ m

minimizeX∈Rn×r f(X) =1

4m

m∑

k=1

(∥∥∥a>kX∥∥∥

2

2− yk

)2

• Use r = 1 as a running example.

12

Page 23: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical risk minimization

minimizex f(x) =1

4m

m∑

k=1

[(a>k x

)2 − yk]2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt − η∇f(xt)

13

Page 24: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Wirtinger flow (Candes, Li, Soltanolkotabi ’14)

Empirical risk minimization

minimizex f(x) =1

4m

m∑

k=1

[(a>k x

)2 − yk]2

• Initialization by spectral method

• Gradient iterations: for t = 0, 1, . . .

xt+1 = xt − η∇f(xt)

13

Page 25: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Geometry of loss surface

-1 -0.5 0 0.5 1-1.5

-1

-0.5

0

0.5

1

1.5

1

2

3

4

5

6

7

8

P(0)PX|(x | 0)

H0

><H1

P(1)PX|(x | 1)

L(x) =PX|(x | 1)

PX|(x | 0)

H1

><H0

H0 ! N (0, 1)

H1 ! N (1, 1)

L(x) =fX(x | H1)

fX(x | H0)=

1p2

exp (x1)2

2

1p2

expx2

2

= exp

x 1

2

L(x)

H1

><H0

() x

H1

><H0

1

2+ log

Pe,MAP = P(0)↵ + P(1)

S1,k = (Wk + c1Fk)ei1,k

S2,k = (Wk + c2Fk)ei2,k

S1,k = (Wk + c1Fk)ei1,k

S2,k = (Wk + c2Fk)ei2,k, 8pixel k

Wk = f1(S1,k, S2,k)

Fk = f2(S1,k, S2,k)

x\

1

saddle point spectral init

1

saddle point spectral initialization

1

14

Page 26: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

α

α

β

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

ε

)iterations

15

Page 27: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Gradient descent theory

f is said to be α-strongly convex and β-smooth if

0 αI ∇2f(x) βI, ∀x

`2 error contraction: GD with η = 1/β obeys

‖xt+1 − x\‖2 ≤(

1− β

α

)‖xt − x\‖2

• Condition number βα determines rate of convergence

• Attains ε-accuracy within O(βα log 1

ε

)iterations

15

Page 28: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

Page 29: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

Page 30: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

Page 31: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

Page 32: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

Page 33: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

What does this optimization theory say about WF?

Gaussian designs: aki.i.d.∼ N (0, In), 1 ≤ k ≤ m

Population level (infinite samples)

E[∇2f(x)

]= 3

(‖x‖22 I + 2xx>

)−(∥∥x\

∥∥2

2I + 2x\x\>

)

︸ ︷︷ ︸locally positive definite and well-conditioned

In E[∇2f(x)

] 10In

Consequence: WF converges within O(

log 1ε

)iterations if

m→∞

Finite-sample level (m n log n)

∇2f(x) but ill-conditioned︸ ︷︷ ︸condition number n

(even locally)

1

2In ∇2f(x) O(n)In

Consequence (Candes et al ’14): WF attains ε-accuracy withinO(n log 1

ε

)iterations with η 1

n if m n log n

Regularization helps, e.g. TWF (Chen and Candes ’15), but is WFreally that bad?

16

Page 34: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

10-10

10-5

100

Vanilla GD (WF) can proceed much more aggressively!

17

Page 35: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Numerical efficiency with ηt = 0.1

0 100 200 300 400 50010-15

10-10

10-5

100

Generic optimization theory is too pessimistic!

18

Page 36: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

Page 37: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

Page 38: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

Page 39: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

Page 40: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theoryWhich region enjoys both strong convexity and smoothness?

∇2f(x) =1

m

m∑

k=1

[3(a>k x

)2 −(a>k x

\)2]

aka>k

• Not smooth if x and ak are too close (coherent)

·∙

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

. log n

a1 a2 x\

a>2 (x x\)

kx x\k2

. log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1

a>1 (x x\)

.p

log n

a>2 (x x\)

.p

log n

1

• x is not far away from x\

• x is incoherent w.r.t. sampling vectors (incoherence region)

(1/2) · In ∇2f(x) O(log n) · In

19

Page 41: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 42: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 43: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 44: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 45: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 46: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 47: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 48: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 49: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

A second look at gradient descent theory

region of local strong convexity + smoothness

··

• Generic optimization theory only ensures that iterates remainin `2 ball but not incoherence region

• Existing algorithms enforce regularization, or apply samplesplitting to promote incoherence

20

Page 50: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Page 51: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Page 52: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Page 53: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Page 54: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Our findings: GD is implicitly regularized

region of local strong convexity + smoothness

··

GD implicitly forces iterates to remain incoherenteven without regularization

21

Page 55: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

2

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

22

Page 56: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

2

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

22

Page 57: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees

Theorem (Phase retrieval)

Under i.i.d. Gaussian design, WF achieves

• maxk∣∣a>k (xt − x\)

∣∣ . √log n ‖x\‖2 (incoherence)

• ‖xt − x\‖2 .(1− η

2

)t ‖x\‖2 (near-linear convergence)

provided that step size η 1logn and sample size m & n log n.

Big computational saving: WF attains ε-accuracy withinO(

log n log 1ε

)iterations with η 1/ log n if m n log n

22

Page 58: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Page 59: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Page 60: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Page 61: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Page 62: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Theoretical guarantees for the low-rank case

Theorem (Quadratic sampling)

Under i.i.d. Gaussian design, GD achieves

• maxl∥∥a>l (XtQt −X\)

∥∥ .√

log n σ2r(X\)‖X\‖F

(incoherence)

• ‖XtQt−X\‖F .(

1− σ2r(X\)η

2

)t‖X\‖F (linear convergence)

provided that η 1(logn∨r)2σ2

r(X\)and m & nr4 log n.

Big computational saving: GD attains ε-accuracy withinO((log n ∨ r)2 log 1

ε

)iterations if m nr4 log n

Prior result (Sanghavi, Ward, White ’15): O(n4r2 log4 n log 1

ε

)

iterations if m nr6 log2 n.

23

Page 63: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24

Page 64: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;

Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24

Page 65: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Key ingredient: leave-one-out analysis

How to establish∣∣a>l (xt − x\)

∣∣ . √log n ‖x\‖2?

Technical difficulty: xt is statistically dependent with al;Leave-one-out trick: For each 1 ≤ l ≤ m, introduce leave-one-outiterates xt,(l) by dropping lth sampleA

x

Ax

1

1

-3

2

-1

4

-2

-1

3

4

1

9

4

1

16

4

1

9

16

24

Page 66: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Key ingredient: leave-one-out analysis

·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25

Page 67: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Key ingredient: leave-one-out analysis

·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

xt,(1) xt

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25

Page 68: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Key ingredient: leave-one-out analysis

·

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

1

minimizex f(x) =1

m

mX

i=1

(a>

i x)2 yi

2

s

x 2 Rns.t.a>

i x2

=a>

i x\2

, 1 i m

minimizeX f(X) =X

(i,j)2

e>

i XX>ej e>i X\X\>ej

2

minimizeh,x f(h, x) =mX

i=1

e>

i XX>ej e>i X\X\>ej

2

find X 2 Rnr s.t. e>j XX>ei = e>

j X\X\>ei, (i, j) 2

find h, x 2 Cn s.t. bi hix

i ai = b

i h\ix

\i ai, 1 i m

max(r2f(x))

max(r2f(x))

a1 a2 x\

a>1 (x x\)

kx x\k2

.p

log n

a1 a2 x\

a>2 (x x\)

kx x\k2

.p

log n

incoherence region w.r.t. a1

xt,(1) xt

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

xt,(l) al w.r.t. al

1

• Leave-one-out iterates xt,(l) are independent of al, and arehence incoherent w.r.t. al with high prob.

• Leave-one-out iterates xt,(l) ≈ true iterates xt

• Finish by triangle inequality∣∣a>l (xt − x\)

∣∣ ≤∣∣a>l (xt,(l) − x\)

∣∣+∣∣a>l (xt − xt,(l))

∣∣.

25

Page 69: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Incoherence region in high dimensions

·· · ··

2-dimensional high-dimensional︸ ︷︷ ︸incoherence region is vanishingly small

26

Page 70: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

This recipe is quite general

Page 71: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Low-rank matrix completion

X ? ? ? X ?? ? X X ? ?X ? ? X ? ?? ? X ? ? XX ? ? ? ? ?? X ? ? X ?? ? X X ? ?

? ? ? ?

?

?

??

??

???

?

?

Fig. credit: Candes

Given partial samples of a low-rank matrix M in an index set Ω,fill in missing entries.

Applications: recommendation systems, ...

28

Page 72: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Incoherence

1 0 0 · · · 00 0 0 · · · 0...

......

0 0 0 · · · 0

︸ ︷︷ ︸hard µ=n

vs.

1 1 1 · · · 11 1 1 · · · 1...

......

1 1 1 · · · 1

︸ ︷︷ ︸easy µ=1

Definition (Incoherence for matrix completion)

A rank-r matrix M \ with eigendecomposition M \ = U \Σ\U \> issaid to be µ-incoherent if

∥∥∥U \∥∥∥

2,∞≤√µ

n

∥∥∥U \∥∥∥

F=

õr

n.

29

Page 73: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Page 74: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Page 75: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Page 76: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Prior theory

minimizeX∈Rn×r f(X) =∑

(j,k)∈Ω

(e>j XX>ek −Mj,k

)2

Existing theory promotes incoherence explicitly:

• regularized loss (solve minX f(X) +R(X) instead)• e.g. Keshavan, Montanari, Oh ’10, Sun, Luo ’14, Ge, Lee, Ma

’16

• projection onto set of incoherent matrices• e.g. Chen, Wainwright ’15, Zheng, Lafferty ’16

Our theory provides guarantees on vanilla / unregularized gradientdescent.

30

Page 77: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

Conclusions

optimization theory + statistical model: vanilla gradientdescent is “implicitly regularized” and runs fast!

Computational:near dimension-freeiteration complexity

Statistical:near-optimal

sample complexity

• works for a few other problems such as low-rank matrixcompletion and blind deconvolution;

• “leave-one-out” arguments are useful for decoupling weakdependency to allow finer characterization of GD trajectories.

31

Page 78: 0.15in Geometry and Regularization in Nonconvex …Geometry and Regularization in Nonconvex Quadratic Inverse Problems Yuejie Chi Department of Electrical and Computer Engineering

References

1. Implicit Regularization for Nonconvex Statistical Estimation, C. Ma,K. Wang, Y. Chi and Y. Chen, arXiv:1711.10467.

2. Nonconvex Matrix Factorization from Rank-One Measurements, Y.Li, C. Ma, Y. Chen, and Y. Chi, arXiv:1802.06286.

3. Gradient Descent with Random Initialization: Fast GlobalConvergence for Nonconvex Phase Retrieval, Y. Chen, Y. Chi, J.Fan and C. Ma, arXiv:1803.07726.

Thank you!

32