Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large

Solving composite optimization problems,with applications to phase retrieval

John Duchi (based on joint work with Feng Ruan)

Outline

Composite optimization problems

Methods for composite optimization

Application: robust phase retrieval

Experimental evaluation

Large scale composite optimization?

What I hope to accomplish today

I Investigate problem structures that are not quite convex but stillamenable to elegant solution approaches

I Show how we can leverage stochastic structure to turn hardnon-convex problems into “easy” ones[Keshavan, Montanari, Oh 10; Loh & Wainwright 12]

I Consider large scale versions of these problems

Composite optimization problems

The problem:minimize

xf(x) := h(c(x))

whereh : Rm → R is convex and c : Rn → Rm is smooth

Motivation: the exact penalty

minimizex

f(x) subject to x ∈ X

equivalent (for all large enough λ) to

minimizex

f(x) + λ dist(x,X)

dist(x,X)


minimizex



minimizex

f(x) + λ dist(x,X)

dist(x,X)


minimizex



minimizex

f(x) + λ dist(x,X)

dist(x,X)


minimizex

f(x) subject to c(x) = 0

equivalent to (for all large enough λ)

minimizex

f(x) + λ ‖c(x)‖

[Fletcher & Watson 80, 82; Burke 85]


minimizex

f(x) subject to c(x) = 0

equivalent to (for all large enough λ)

minimizex

f(x) + λ ‖c(x)‖︸︷︷︸=h(c(x))

whereh(z) = λ ‖z‖

[Fletcher & Watson 80, 82; Burke 85]

Motivation: nonlinear measurements and modeling

I Have true signal x? ∈ Rn and measurement vectors ai ∈ Rn

I Observe nonlinear measurements

bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m

for φ(·) a nonlinear function but smooth function

An objective:

f(x) =1

m

m∑i=1

(φ(〈ai, x〉)− bi

)2Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]




bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m


An objective:

f(x) =1

m

m∑i=1


)2

Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]




bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m


An objective:

f(x) =1

m

m∑i=1


)2Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]

(Robust) Phase retrieval

[Candes, Li, Soltanolkotabi 15]

Observations (usually)bi = 〈ai, x?〉2

yield objective

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

(Robust) Phase retrieval

[Candes, Li, Soltanolkotabi 15]

Observations (usually)bi = 〈ai, x?〉2

yield objective

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

Optimization methods

How do we solve optimization problems?

1. Build a “good” but simple local model of f

2. Minimize the model (perhaps regularizing)





Gradient descent: Taylor (first-order) model

f(y) ≈ fx(y) := f(x) +∇f(x)T (y − x)





Newton’s method: Taylor (second-order) model

f(y) ≈ fx(y) := f(x) +∇f(x)T (y − x) + (1/2)(y − x)T∇2f(x)(y − x)

Modeling composite problems

Now we make a convex model

f(x) = h(c(x))



f(x) = h( c(x)︸︷︷︸linearize

)



f(y) ≈ h(c(x) +∇c(x)T (y − x))



f(y) ≈ h(c(x) +∇c(x)T (y − x)︸︷︷︸=c(y)+O(‖x−y‖2)

)



fx(y) := h(c(x) +∇c(x)T (y − x)

)



fx(y) := h(c(x) +∇c(x)T (y − x)

)[Burke 85; Drusvyatskiy, Ioffe, Lewis 16]



fx(y) := h(c(x) +∇c(x)T (y − x)

)Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1



fx(y) := h(c(x) +∇c(x)T (y − x)




fx(y) := h(c(x) +∇c(x)T (y − x)


The prox-linear method [Burke, Drusvyatskiy et al.]

Iteratively (1) form regularized convex model and (2) minimize it

xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X

{h(c(xk) +∇c(xk)T (x− xk)

)+

1

2α‖x− xk‖22

}



xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X


)+

1

2α‖x− xk‖22

}



xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X


)+

1

2α‖x− xk‖22

}

|xk − x?| = .3



xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X


)+

1

2α‖x− xk‖22

}

|xk − x?| = .024



xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X


)+

1

2α‖x− xk‖22

}

|xk − x?| = 3 · 10−4



xk+1 = argminx∈X

{fxk

(x) +1

2α‖x− xk‖22

}= argmin

x∈X


)+

1

2α‖x− xk‖22

}

|xk − x?| = 4 · 10−8

Robust phase retrieval problems

A nice application for these composite methods


Data model: true signal x? ∈ Rn, for pfail <12 observe

bi = 〈ai, x?〉2 + ξi where ξi =

{0 w.p. ≥ 1− pfailarbitrary otherwise

Goal: solve

minimizex

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|

Composite problem: f(x) = 1m ‖φ(Ax)− b‖1 = h(c(x)) where φ(·) is

elementwise square,

h(z) =1

m‖z‖1 , c(x) = φ(Ax)− b





Goal: solve

minimizex

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|


elementwise square,

h(z) =1

m‖z‖1 , c(x) = φ(Ax)− b





Goal: solve

minimizex

f(x) =1

m

m∑i=1

| 〈ai, x〉2 − bi|


elementwise square,

h(z) =1

m‖z‖1 , c(x) = φ(Ax)− b

A convergence theorem

Three key ingredients.

(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22

(3) A good initialization




m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22


I Measurement matrix A = [a1 · · · am]T ∈ Rm×n and

1

mATA =

1

m

m∑i=1

aiaTi

I Convex model fx of f at x defined by

fx(y) = h(c(x) +∇c(x)T (y − x))




m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22


I Measurement matrix A = [a1 · · · am]T ∈ Rm×n and

1

mATA =

1

m

m∑i=1

aiaTi

I Convex model fx of f at x defined by

fx(y) =1

m

m∑i=1

∣∣∣〈ai, x〉2 + 2 〈ai, x〉〈ai, y − x〉∣∣∣




m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22


Theorem (D. & Ruan 17)

Define dist(x, x?) = min{‖x− x?‖2 , ‖x+ x?‖2}. Let xk be generated bythe prox-linear method and L = 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

. Then

dist(xk, x?) ≤

(2L

λdist(x0, x

?)

)2k

.

Unpacking the convergence theorem


Define dist(x, x?) = min{‖x− x?‖2 , ‖x+ x?‖2}. Let xk be generated bythe prox-linear method and L = 1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

. Then

dist(xk, x?) ≤

(2L

λdist(x0, x

?)

)2k

.

I Quadratic convergence: for all intents and purposes, 6 iterations

I Requires solving explicit convex optimization problems (quadraticprograms) with no tuning parameters

Ingredients in convergence: stability

1. Stability: (cf. Eldar and Mendelson 14)

f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2



f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2

What is necessary?

Proposition (D. & Ruan 17)

Assume uniformity condition: for all u, v ∈ Rn and a ∼ P

P (|uTaaT v| ≥ ε0 ‖u‖2 ‖v‖2) ≥ c > 0.

Then f is 12ε0-stable with probability at least 1− e−cm.

(Gaussians satisfy this)



f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2

What is necessary?


Assume uniformity condition: for all u, v ∈ Rn and a ∼ P

P (|uTaaT v| ≥ ε0 ‖u‖2 ‖v‖2) ≥ c > 0.

Then f is 12ε0-stable with probability at least 1− e−cm.

(Gaussians satisfy this)


Growth condition (stability):

〈ai, x〉2 − 〈ai, x?〉2 = 〈ai, x− x?〉〈ai, x+ x?〉

and under random ai with uniform enough support,

f(x) =1

m

m∑i=1

∣∣(x− x?)TaiaTi (x+ x?)∣∣ & ‖x− x?‖2 ‖x+ x?‖2

Ingredients in convergence

2. Approximation: need 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1)

What is necessary?

Proposition (Vershynin 11)

If the measurement vectors ai are sub-Gaussian, then

1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op≤ O(1) ·

√n

m+ t w.p. ≥ 1− e−mt2 .

Heavy-tailed data gets 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1) with reasonable probabilityfor m a bit larger

Ingredients in convergence

2. Approximation: need 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1)

What is necessary?

Proposition (Vershynin 11)

If the measurement vectors ai are sub-Gaussian, then

1

m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op≤ O(1) ·

√n

m+ t w.p. ≥ 1− e−mt2 .

Heavy-tailed data gets 1m

∣∣∣∣∣∣ATA∣∣∣∣∣∣op

= O(1) with reasonable probabilityfor m a bit larger

Ingredients in convergence: spectral initialization

Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x?

X init :=∑

i:bi≤median(b)

aiaTi

satisfiesX init ≈ E[aiaTi ]− cd?d?

T where d? = x?/ ‖x?‖2

d?



X init :=∑

i:bi≤median(b)

aiaTi



d?



X init :=∑

i:bi≤median(b)

aiaTi



d?


3. Initialization: We need dist(x0, x?) . 1

2 ‖x?‖2

Estimate direction d ≈ x?/ ‖x?‖2 and radius r by

X init :=∑

i:bi≤median(b)

aiaTi and d = argmin

d∈Sn−1

{dTX initd

}r :=

(1

m

m∑i=1

b2i

) 12

≈ ‖x?‖2


Under appropriate orthogonality conditions, x0 = rd satisfies

dist(x0, x?) .

√n

m+ t

with probability at least 1− e−mt2



2 ‖x?‖2


X init :=∑

i:bi≤median(b)


d∈Sn−1

{dTX initd

}r :=

(1

m

m∑i=1

b2i

) 12

≈ ‖x?‖2



dist(x0, x?) .

√n

m+ t




2 ‖x?‖2


X init :=∑

i:bi≤median(b)


d∈Sn−1

{dTX initd

}r :=

(1

m

m∑i=1

b2i

) 12

≈ ‖x?‖2



dist(x0, x?) .

√n

m+ t


Take-home result

I Stability: measurements ai are uniform enough in direction

I Closeness: ai are sub-Gaussian or normalized

I Sufficient conditions for initialization: for v ∈ Sn,

E[aiaTi | 〈ai, v〉2 ≤ ‖v‖22] = In − cvvT + E

where c > 0 and E is a small error

I Measurement failure probability pfail ≤ 14


If these conditions hold and m/n & 1, then the spectral initializationsucceeds and iterates xk of prox-linear algorithm satisfy

dist(xk, x0) ≤ (O(1) · dist(x0, x?))2k

Experiments

1. Random (Gaussian) measurements

2. Adversarially chosen outliers

3. Real images

Experiment 1: random Gaussian measurements

I Data generation: dimension n = 3000,

aiiid∼ N(0, In) and bi = 〈ai, x?〉2

I Compare to Wang, Giannakis, Eldar’s Truncated Amplitude Flow(best performing non-convex approach)

I Look at success probability against m/n (note that m ≥ 2n− 1 isnecessary for injectivity)


1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20

m/n

0.0

0.2

0.4

0.6

0.8

1.0

P(s

ucc

ess

)

Prox

TAF


1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20

m/n

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fract

ion o

f one-s

ided s

ucc

ess

Prox

TAF

Experiment 2: corrupted measurements

I Data generation: dimension n = 200,

aiiid∼ N(0, In) and bi =

{0 w.p. pfail

〈ai, x?〉2 otherwise

(most confuses our initialization method)

I Compare to Zhang, Chi, Liang’s Median-Truncated Wirtinger Flow(designed specially for standard Gaussian measurements)

I Look at success probability against m/n (note that m ≥ 2n− 1 isnecessary for injectivity)

Experiment 2: corrupted measurements

0.0 0.25 0.5 0.75 1.0

0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3

1.82.02.53.04.06.08.0

0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3

1.82.02.53.04.06.08.0

pfail

Experiment 3: digit recovery

I Data generation: handwritten 16×16 grayscale digits, sensing matrix

A =

HnS1HnS2HnS3

∈ R3n×n

where n = 256, Sl are diagonal random sign matrices, Hn isHadamard transform matrix

I Observe

b = (Ax?)2 + ξ where ξi =

{0 w.p. 1− pfailCauchy otherwise

I Other non-convex approaches designed for Gaussian data; unclearhow to parameterize them


Left: true image. Middle: spectral initialization. Right: solution.


0.00 0.05 0.10 0.15 0.200.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Succ

ess

pro

babili

ty

2000

4000

6000

8000

10000

12000

14000

16000

18000

Matr

ix m

ult

iply

count

pfail

Performance of composite optimization scheme versus failure probability

Experiment 4: real images

Signal size n = 222, measurements m = 3 · 224

Experiment 4: real images

Signal size n = 222, measurements m = 3 · 224

Composite optimization at scale

Question: What if we have composite problems with a really big sample?

I Typical stochastic optimization setup,

f(x) = E[F (x;S)] where F (x;S) = h(c(x;S);S)

I Example: large scale (robust) nonlinear regression

f(x) =1

m

m∑i=1

|φ(〈ai, x〉)− bi|






f(x) =1

m

m∑i=1

|φ(〈ai, x〉)− bi|






f(x) =1

m

m∑i=1

|φ(〈ai, x〉)− bi|

A stochastic composite method

I Define (random) convex approximation

Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x); s)

I Then iterate for k = 1, 2, . . .

Skiid∼ P

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}



Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)︸︷︷︸≈c(y;s)

; s)


Skiid∼ P

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}



Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)︸︷︷︸≈c(y;s)

; s)


Skiid∼ P

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}

Understanding convergence behavior

Ordinary differential equations (gradient flow):

x = −∇f(x) i.e.d

dtx(t) = −∇f(x(t))

Understanding convergence behavior

Ordinary differential inclusions (subgradient flow):

x ∈ −∂f(x) i.e.d

dtx(t) ∈ −∂f(x(t))

The differential inclusion

For stochastic function

f(x) := E[F (x;S)] = E[h(c(x;S);S)] =∫h(c(x; s); s)dP (s)

the generalized subgradient (for non-convex, non-smooth) is [D. & Ruan 17]

∂f(x) =

∫∇c(x; s)∂h(c(x; s); s)dP (s)


For stochastic composite problem, the subdifferential inclusion x ∈ −∂f(x) has aunique trajectory for all time and

f(x(t))− f(x(0)) ≤ −∫ t

0

‖∂f(x(τ))‖2 dτ.

It also has limit points and they are stationary.

The limiting differential inclusion

Recall our iteration

xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.

Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),



xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.


0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)) +1

αk[xk+1 − xk]



xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.


0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)︸︷︷︸=c(xk;s)±O(‖xk−xk+1‖2)

) +1

αk[xk+1 − xk]



xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.


0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)︸︷︷︸=c(xk;s)±O(‖xk−xk+1‖2)

) +1

αk[xk+1 − xk]

i.e.

1

αk[xk+1 − xk] ∈ −∇c(xk; s)∂h(c(xk; s); s) + subgradient mess + Noise

= −∂f(xk) + subgradient mess + Noise

Graphical example

Iterate xk+1 = argminx

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}

A convergence guarantee

Consider the stochatsic composite optimization problem

minimizex∈X

f(x) := E[F (x;S)] where F (x; s) = h(c(x; s); s).

Use the iteration

xk+1 = argminx∈X

{Fxk

(x;Sk) +1

2αk‖x− xk‖22

}.


Assume X is compact and∑∞

k=1 αk =∞,∑∞

k=1 α2k <∞. Then the

sequence {xk} satisfies

(1) f(xk) converges

(2) All cluster points of xk are stationary

Experiment: noiseless phase retrieval

0 50 100 150 20010-8

10-7

10-6

10-5

10-4

10-3

10-2

proxsproxsgd

Conclusions

1. Broadly interesting structures for non-convex problems that are stillapproximable

2. Statistical modeling allows solution of non-trivial, non-smooth,non-convex problems

3. Large scale efficient methods still important

References

I Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval arXiv:1705.02356

I Stochastic Methods for Composite Optimization ProblemsarXiv:1703.08570

Conclusions

1. Broadly interesting structures for non-convex problems that are stillapproximable

2. Statistical modeling allows solution of non-trivial, non-smooth,non-convex problems

3. Large scale efficient methods still important

References

I Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval arXiv:1705.02356

I Stochastic Methods for Composite Optimization ProblemsarXiv:1703.08570

Documents

Solving composite optimization problems, with applications ...jduchi/projects/slides/DuchiRu17-pr.pdf · Composite optimization problems Methods for composite optimization ... Large