Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Solving composite optimization problems,with applications to phase retrieval
John Duchi (based on joint work with Feng Ruan)
Outline
Composite optimization problems
Methods for composite optimization
Application: robust phase retrieval
Experimental evaluation
Large scale composite optimization?
What I hope to accomplish today
I Investigate problem structures that are not quite convex but stillamenable to elegant solution approaches
I Show how we can leverage stochastic structure to turn hardnon-convex problems into “easy” ones[Keshavan, Montanari, Oh 10; Loh & Wainwright 12]
I Consider large scale versions of these problems
Composite optimization problems
The problem:minimize
xf(x) := h(c(x))
whereh : Rm → R is convex and c : Rn → Rm is smooth
Motivation: the exact penalty
minimizex
f(x) subject to x ∈ X
equivalent (for all large enough λ) to
minimizex
f(x) + λ dist(x,X)
dist(x,X)
Motivation: the exact penalty
minimizex
f(x) subject to x ∈ X
equivalent (for all large enough λ) to
minimizex
f(x) + λ dist(x,X)
dist(x,X)
Motivation: the exact penalty
minimizex
f(x) subject to x ∈ X
equivalent (for all large enough λ) to
minimizex
f(x) + λ dist(x,X)
dist(x,X)
Motivation: the exact penalty
minimizex
f(x) subject to c(x) = 0
equivalent to (for all large enough λ)
minimizex
f(x) + λ ‖c(x)‖
[Fletcher & Watson 80, 82; Burke 85]
Motivation: the exact penalty
minimizex
f(x) subject to c(x) = 0
equivalent to (for all large enough λ)
minimizex
f(x) + λ ‖c(x)‖︸ ︷︷ ︸=h(c(x))
whereh(z) = λ ‖z‖
[Fletcher & Watson 80, 82; Burke 85]
Motivation: nonlinear measurements and modeling
I Have true signal x? ∈ Rn and measurement vectors ai ∈ Rn
I Observe nonlinear measurements
bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m
for φ(·) a nonlinear function but smooth function
An objective:
f(x) =1
m
m∑i=1
(φ(〈ai, x〉)− bi
)2Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]
Motivation: nonlinear measurements and modeling
I Have true signal x? ∈ Rn and measurement vectors ai ∈ Rn
I Observe nonlinear measurements
bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m
for φ(·) a nonlinear function but smooth function
An objective:
f(x) =1
m
m∑i=1
(φ(〈ai, x〉)− bi
)2
Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]
Motivation: nonlinear measurements and modeling
I Have true signal x? ∈ Rn and measurement vectors ai ∈ Rn
I Observe nonlinear measurements
bi = φ(〈ai, x?〉) + ξi, i = 1, . . . ,m
for φ(·) a nonlinear function but smooth function
An objective:
f(x) =1
m
m∑i=1
(φ(〈ai, x〉)− bi
)2Nonlinear least squares [Nocedal & Wright 06; Plan & Vershynin 15; Oymak &Soltanolkotabi 16]
(Robust) Phase retrieval
[Candes, Li, Soltanolkotabi 15]
Observations (usually)bi = 〈ai, x?〉2
yield objective
f(x) =1
m
m∑i=1
| 〈ai, x〉2 − bi|
(Robust) Phase retrieval
[Candes, Li, Soltanolkotabi 15]
Observations (usually)bi = 〈ai, x?〉2
yield objective
f(x) =1
m
m∑i=1
| 〈ai, x〉2 − bi|
Optimization methods
How do we solve optimization problems?
1. Build a “good” but simple local model of f
2. Minimize the model (perhaps regularizing)
Optimization methods
How do we solve optimization problems?
1. Build a “good” but simple local model of f
2. Minimize the model (perhaps regularizing)
Gradient descent: Taylor (first-order) model
f(y) ≈ fx(y) := f(x) +∇f(x)T (y − x)
Optimization methods
How do we solve optimization problems?
1. Build a “good” but simple local model of f
2. Minimize the model (perhaps regularizing)
Newton’s method: Taylor (second-order) model
f(y) ≈ fx(y) := f(x) +∇f(x)T (y − x) + (1/2)(y − x)T∇2f(x)(y − x)
Modeling composite problems
Now we make a convex model
f(x) = h(c(x))
Modeling composite problems
Now we make a convex model
f(x) = h( c(x)︸︷︷︸linearize
)
Modeling composite problems
Now we make a convex model
f(y) ≈ h(c(x) +∇c(x)T (y − x))
Modeling composite problems
Now we make a convex model
f(y) ≈ h(c(x) +∇c(x)T (y − x)︸ ︷︷ ︸=c(y)+O(‖x−y‖2)
)
Modeling composite problems
Now we make a convex model
fx(y) := h(c(x) +∇c(x)T (y − x)
)
Modeling composite problems
Now we make a convex model
fx(y) := h(c(x) +∇c(x)T (y − x)
)[Burke 85; Drusvyatskiy, Ioffe, Lewis 16]
Modeling composite problems
Now we make a convex model
fx(y) := h(c(x) +∇c(x)T (y − x)
)Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
Modeling composite problems
Now we make a convex model
fx(y) := h(c(x) +∇c(x)T (y − x)
)Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
Modeling composite problems
Now we make a convex model
fx(y) := h(c(x) +∇c(x)T (y − x)
)Example: f(x) = |x2 − 1|, h(z) = |z| and c(x) = x2 − 1
The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it
xk+1 = argminx∈X
{fxk
(x) +1
2α‖x− xk‖22
}= argmin
x∈X
{h(c(xk) +∇c(xk)T (x− xk)
)+
1
2α‖x− xk‖22
}
The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it
xk+1 = argminx∈X
{fxk
(x) +1
2α‖x− xk‖22
}= argmin
x∈X
{h(c(xk) +∇c(xk)T (x− xk)
)+
1
2α‖x− xk‖22
}
The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it
xk+1 = argminx∈X
{fxk
(x) +1
2α‖x− xk‖22
}= argmin
x∈X
{h(c(xk) +∇c(xk)T (x− xk)
)+
1
2α‖x− xk‖22
}
|xk − x?| = .3
The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it
xk+1 = argminx∈X
{fxk
(x) +1
2α‖x− xk‖22
}= argmin
x∈X
{h(c(xk) +∇c(xk)T (x− xk)
)+
1
2α‖x− xk‖22
}
|xk − x?| = .024
The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it
xk+1 = argminx∈X
{fxk
(x) +1
2α‖x− xk‖22
}= argmin
x∈X
{h(c(xk) +∇c(xk)T (x− xk)
)+
1
2α‖x− xk‖22
}
|xk − x?| = 3 · 10−4
The prox-linear method [Burke, Drusvyatskiy et al.]
Iteratively (1) form regularized convex model and (2) minimize it
xk+1 = argminx∈X
{fxk
(x) +1
2α‖x− xk‖22
}= argmin
x∈X
{h(c(xk) +∇c(xk)T (x− xk)
)+
1
2α‖x− xk‖22
}
|xk − x?| = 4 · 10−8
Robust phase retrieval problems
A nice application for these composite methods
Robust phase retrieval problems
Data model: true signal x? ∈ Rn, for pfail <12 observe
bi = 〈ai, x?〉2 + ξi where ξi =
{0 w.p. ≥ 1− pfailarbitrary otherwise
Goal: solve
minimizex
f(x) =1
m
m∑i=1
| 〈ai, x〉2 − bi|
Composite problem: f(x) = 1m ‖φ(Ax)− b‖1 = h(c(x)) where φ(·) is
elementwise square,
h(z) =1
m‖z‖1 , c(x) = φ(Ax)− b
Robust phase retrieval problems
Data model: true signal x? ∈ Rn, for pfail <12 observe
bi = 〈ai, x?〉2 + ξi where ξi =
{0 w.p. ≥ 1− pfailarbitrary otherwise
Goal: solve
minimizex
f(x) =1
m
m∑i=1
| 〈ai, x〉2 − bi|
Composite problem: f(x) = 1m ‖φ(Ax)− b‖1 = h(c(x)) where φ(·) is
elementwise square,
h(z) =1
m‖z‖1 , c(x) = φ(Ax)− b
Robust phase retrieval problems
Data model: true signal x? ∈ Rn, for pfail <12 observe
bi = 〈ai, x?〉2 + ξi where ξi =
{0 w.p. ≥ 1− pfailarbitrary otherwise
Goal: solve
minimizex
f(x) =1
m
m∑i=1
| 〈ai, x〉2 − bi|
Composite problem: f(x) = 1m ‖φ(Ax)− b‖1 = h(c(x)) where φ(·) is
elementwise square,
h(z) =1
m‖z‖1 , c(x) = φ(Ax)− b
A convergence theorem
Three key ingredients.
(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22
(3) A good initialization
A convergence theorem
Three key ingredients.
(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22
(3) A good initialization
I Measurement matrix A = [a1 · · · am]T ∈ Rm×n and
1
mATA =
1
m
m∑i=1
aiaTi
I Convex model fx of f at x defined by
fx(y) = h(c(x) +∇c(x)T (y − x))
A convergence theorem
Three key ingredients.
(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22
(3) A good initialization
I Measurement matrix A = [a1 · · · am]T ∈ Rm×n and
1
mATA =
1
m
m∑i=1
aiaTi
I Convex model fx of f at x defined by
fx(y) =1
m
m∑i=1
∣∣∣〈ai, x〉2 + 2 〈ai, x〉 〈ai, y − x〉∣∣∣
A convergence theorem
Three key ingredients.
(1) Stability: f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2(2) Close models: |fx(y)− f(y)| ≤ 1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op‖x− y‖22
(3) A good initialization
Theorem (D. & Ruan 17)
Define dist(x, x?) = min{‖x− x?‖2 , ‖x+ x?‖2}. Let xk be generated bythe prox-linear method and L = 1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op
. Then
dist(xk, x?) ≤
(2L
λdist(x0, x
?)
)2k
.
Unpacking the convergence theorem
Theorem (D. & Ruan 17)
Define dist(x, x?) = min{‖x− x?‖2 , ‖x+ x?‖2}. Let xk be generated bythe prox-linear method and L = 1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op
. Then
dist(xk, x?) ≤
(2L
λdist(x0, x
?)
)2k
.
I Quadratic convergence: for all intents and purposes, 6 iterations
I Requires solving explicit convex optimization problems (quadraticprograms) with no tuning parameters
Ingredients in convergence: stability
1. Stability: (cf. Eldar and Mendelson 14)
f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2
Ingredients in convergence: stability
1. Stability: (cf. Eldar and Mendelson 14)
f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2
What is necessary?
Proposition (D. & Ruan 17)
Assume uniformity condition: for all u, v ∈ Rn and a ∼ P
P (|uTaaT v| ≥ ε0 ‖u‖2 ‖v‖2) ≥ c > 0.
Then f is 12ε0-stable with probability at least 1− e−cm.
(Gaussians satisfy this)
Ingredients in convergence: stability
1. Stability: (cf. Eldar and Mendelson 14)
f(x)− f(x?) ≥ λ ‖x− x?‖2 ‖x+ x?‖2
What is necessary?
Proposition (D. & Ruan 17)
Assume uniformity condition: for all u, v ∈ Rn and a ∼ P
P (|uTaaT v| ≥ ε0 ‖u‖2 ‖v‖2) ≥ c > 0.
Then f is 12ε0-stable with probability at least 1− e−cm.
(Gaussians satisfy this)
Ingredients in convergence: stability
Growth condition (stability):
〈ai, x〉2 − 〈ai, x?〉2 = 〈ai, x− x?〉 〈ai, x+ x?〉
and under random ai with uniform enough support,
f(x) =1
m
m∑i=1
∣∣(x− x?)TaiaTi (x+ x?)∣∣ & ‖x− x?‖2 ‖x+ x?‖2
Ingredients in convergence
2. Approximation: need 1m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op
= O(1)
What is necessary?
Proposition (Vershynin 11)
If the measurement vectors ai are sub-Gaussian, then
1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op≤ O(1) ·
√n
m+ t w.p. ≥ 1− e−mt2 .
Heavy-tailed data gets 1m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op
= O(1) with reasonable probabilityfor m a bit larger
Ingredients in convergence
2. Approximation: need 1m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op
= O(1)
What is necessary?
Proposition (Vershynin 11)
If the measurement vectors ai are sub-Gaussian, then
1
m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op≤ O(1) ·
√n
m+ t w.p. ≥ 1− e−mt2 .
Heavy-tailed data gets 1m
∣∣∣∣∣∣ATA∣∣∣∣∣∣op
= O(1) with reasonable probabilityfor m a bit larger
Ingredients in convergence: spectral initialization
Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x?
X init :=∑
i:bi≤median(b)
aiaTi
satisfiesX init ≈ E[aiaTi ]− cd?d?
T where d? = x?/ ‖x?‖2
d?
Ingredients in convergence: spectral initialization
Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x?
X init :=∑
i:bi≤median(b)
aiaTi
satisfiesX init ≈ E[aiaTi ]− cd?d?
T where d? = x?/ ‖x?‖2
d?
Ingredients in convergence: spectral initialization
Insight: [Wang, Giannakis, Eldar 16] Most vectors ai ∈ Rn are orthogonal to x?
X init :=∑
i:bi≤median(b)
aiaTi
satisfiesX init ≈ E[aiaTi ]− cd?d?
T where d? = x?/ ‖x?‖2
d?
Ingredients in convergence: spectral initialization
3. Initialization: We need dist(x0, x?) . 1
2 ‖x?‖2
Estimate direction d ≈ x?/ ‖x?‖2 and radius r by
X init :=∑
i:bi≤median(b)
aiaTi and d = argmin
d∈Sn−1
{dTX initd
}r :=
(1
m
m∑i=1
b2i
) 12
≈ ‖x?‖2
Proposition (D. & Ruan 17)
Under appropriate orthogonality conditions, x0 = rd satisfies
dist(x0, x?) .
√n
m+ t
with probability at least 1− e−mt2
Ingredients in convergence: spectral initialization
3. Initialization: We need dist(x0, x?) . 1
2 ‖x?‖2
Estimate direction d ≈ x?/ ‖x?‖2 and radius r by
X init :=∑
i:bi≤median(b)
aiaTi and d = argmin
d∈Sn−1
{dTX initd
}r :=
(1
m
m∑i=1
b2i
) 12
≈ ‖x?‖2
Proposition (D. & Ruan 17)
Under appropriate orthogonality conditions, x0 = rd satisfies
dist(x0, x?) .
√n
m+ t
with probability at least 1− e−mt2
Ingredients in convergence: spectral initialization
3. Initialization: We need dist(x0, x?) . 1
2 ‖x?‖2
Estimate direction d ≈ x?/ ‖x?‖2 and radius r by
X init :=∑
i:bi≤median(b)
aiaTi and d = argmin
d∈Sn−1
{dTX initd
}r :=
(1
m
m∑i=1
b2i
) 12
≈ ‖x?‖2
Proposition (D. & Ruan 17)
Under appropriate orthogonality conditions, x0 = rd satisfies
dist(x0, x?) .
√n
m+ t
with probability at least 1− e−mt2
Take-home result
I Stability: measurements ai are uniform enough in direction
I Closeness: ai are sub-Gaussian or normalized
I Sufficient conditions for initialization: for v ∈ Sn,
E[aiaTi | 〈ai, v〉2 ≤ ‖v‖22] = In − cvvT + E
where c > 0 and E is a small error
I Measurement failure probability pfail ≤ 14
Theorem (D. & Ruan 17)
If these conditions hold and m/n & 1, then the spectral initializationsucceeds and iterates xk of prox-linear algorithm satisfy
dist(xk, x0) ≤ (O(1) · dist(x0, x?))2k
Experiments
1. Random (Gaussian) measurements
2. Adversarially chosen outliers
3. Real images
Experiment 1: random Gaussian measurements
I Data generation: dimension n = 3000,
aiiid∼ N(0, In) and bi = 〈ai, x?〉2
I Compare to Wang, Giannakis, Eldar’s Truncated Amplitude Flow(best performing non-convex approach)
I Look at success probability against m/n (note that m ≥ 2n− 1 isnecessary for injectivity)
Experiment 1: random Gaussian measurements
1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20
m/n
0.0
0.2
0.4
0.6
0.8
1.0
P(s
ucc
ess
)
Prox
TAF
Experiment 1: random Gaussian measurements
1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20
m/n
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Fract
ion o
f one-s
ided s
ucc
ess
Prox
TAF
Experiment 2: corrupted measurements
I Data generation: dimension n = 200,
aiiid∼ N(0, In) and bi =
{0 w.p. pfail
〈ai, x?〉2 otherwise
(most confuses our initialization method)
I Compare to Zhang, Chi, Liang’s Median-Truncated Wirtinger Flow(designed specially for standard Gaussian measurements)
I Look at success probability against m/n (note that m ≥ 2n− 1 isnecessary for injectivity)
Experiment 2: corrupted measurements
0.0 0.25 0.5 0.75 1.0
0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3
1.82.02.53.04.06.08.0
0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3
1.82.02.53.04.06.08.0
pfail
Experiment 3: digit recovery
I Data generation: handwritten 16×16 grayscale digits, sensing matrix
A =
HnS1HnS2HnS3
∈ R3n×n
where n = 256, Sl are diagonal random sign matrices, Hn isHadamard transform matrix
I Observe
b = (Ax?)2 + ξ where ξi =
{0 w.p. 1− pfailCauchy otherwise
I Other non-convex approaches designed for Gaussian data; unclearhow to parameterize them
Experiment 3: digit recovery
Left: true image. Middle: spectral initialization. Right: solution.
Experiment 3: digit recovery
0.00 0.05 0.10 0.15 0.200.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Succ
ess
pro
babili
ty
2000
4000
6000
8000
10000
12000
14000
16000
18000
Matr
ix m
ult
iply
count
pfail
Performance of composite optimization scheme versus failure probability
Experiment 4: real images
Signal size n = 222, measurements m = 3 · 224
Experiment 4: real images
Signal size n = 222, measurements m = 3 · 224
Composite optimization at scale
Question: What if we have composite problems with a really big sample?
I Typical stochastic optimization setup,
f(x) = E[F (x;S)] where F (x;S) = h(c(x;S);S)
I Example: large scale (robust) nonlinear regression
f(x) =1
m
m∑i=1
|φ(〈ai, x〉)− bi|
Composite optimization at scale
Question: What if we have composite problems with a really big sample?
I Typical stochastic optimization setup,
f(x) = E[F (x;S)] where F (x;S) = h(c(x;S);S)
I Example: large scale (robust) nonlinear regression
f(x) =1
m
m∑i=1
|φ(〈ai, x〉)− bi|
Composite optimization at scale
Question: What if we have composite problems with a really big sample?
I Typical stochastic optimization setup,
f(x) = E[F (x;S)] where F (x;S) = h(c(x;S);S)
I Example: large scale (robust) nonlinear regression
f(x) =1
m
m∑i=1
|φ(〈ai, x〉)− bi|
A stochastic composite method
I Define (random) convex approximation
Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x); s)
I Then iterate for k = 1, 2, . . .
Skiid∼ P
xk+1 = argminx∈X
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}
A stochastic composite method
I Define (random) convex approximation
Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)︸ ︷︷ ︸≈c(y;s)
; s)
I Then iterate for k = 1, 2, . . .
Skiid∼ P
xk+1 = argminx∈X
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}
A stochastic composite method
I Define (random) convex approximation
Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)︸ ︷︷ ︸≈c(y;s)
; s)
I Then iterate for k = 1, 2, . . .
Skiid∼ P
xk+1 = argminx∈X
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}
Understanding convergence behavior
Ordinary differential equations (gradient flow):
x = −∇f(x) i.e.d
dtx(t) = −∇f(x(t))
Understanding convergence behavior
Ordinary differential inclusions (subgradient flow):
x ∈ −∂f(x) i.e.d
dtx(t) ∈ −∂f(x(t))
The differential inclusion
For stochastic function
f(x) := E[F (x;S)] = E[h(c(x;S);S)] =∫h(c(x; s); s)dP (s)
the generalized subgradient (for non-convex, non-smooth) is [D. & Ruan 17]
∂f(x) =
∫∇c(x; s)∂h(c(x; s); s)dP (s)
Theorem (D. & Ruan 17)
For stochastic composite problem, the subdifferential inclusion x ∈ −∂f(x) has aunique trajectory for all time and
f(x(t))− f(x(0)) ≤ −∫ t
0
‖∂f(x(τ))‖2 dτ.
It also has limit points and they are stationary.
The limiting differential inclusion
Recall our iteration
xk+1 = argminx
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}.
Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),
The limiting differential inclusion
Recall our iteration
xk+1 = argminx
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}.
Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),
0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)) +1
αk[xk+1 − xk]
The limiting differential inclusion
Recall our iteration
xk+1 = argminx
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}.
Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),
0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)︸ ︷︷ ︸=c(xk;s)±O(‖xk−xk+1‖2)
) +1
αk[xk+1 − xk]
The limiting differential inclusion
Recall our iteration
xk+1 = argminx
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}.
Optimality conditions: using Fx(y; s) = h(c(x; s) +∇c(x; s)T (y − x)),
0 ∈ ∇c(xk; s)∂h(c(xk; s) +∇c(xk; s)T (xk+1 − xk)︸ ︷︷ ︸=c(xk;s)±O(‖xk−xk+1‖2)
) +1
αk[xk+1 − xk]
i.e.
1
αk[xk+1 − xk] ∈ −∇c(xk; s)∂h(c(xk; s); s) + subgradient mess + Noise
= −∂f(xk) + subgradient mess + Noise
Graphical example
Iterate xk+1 = argminx
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}
A convergence guarantee
Consider the stochatsic composite optimization problem
minimizex∈X
f(x) := E[F (x;S)] where F (x; s) = h(c(x; s); s).
Use the iteration
xk+1 = argminx∈X
{Fxk
(x;Sk) +1
2αk‖x− xk‖22
}.
Theorem (D. & Ruan 17)
Assume X is compact and∑∞
k=1 αk =∞,∑∞
k=1 α2k <∞. Then the
sequence {xk} satisfies
(1) f(xk) converges
(2) All cluster points of xk are stationary
Experiment: noiseless phase retrieval
0 50 100 150 20010-8
10-7
10-6
10-5
10-4
10-3
10-2
proxsproxsgd
Conclusions
1. Broadly interesting structures for non-convex problems that are stillapproximable
2. Statistical modeling allows solution of non-trivial, non-smooth,non-convex problems
3. Large scale efficient methods still important
References
I Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval arXiv:1705.02356
I Stochastic Methods for Composite Optimization ProblemsarXiv:1703.08570
Conclusions
1. Broadly interesting structures for non-convex problems that are stillapproximable
2. Statistical modeling allows solution of non-trivial, non-smooth,non-convex problems
3. Large scale efficient methods still important
References
I Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval arXiv:1705.02356
I Stochastic Methods for Composite Optimization ProblemsarXiv:1703.08570