Sparse Optimization - Lecture 1: Review of Convex …wotaoyin/summer2013/slides/Lec01_Convex... · Lecture 1: Review of Convex Optimization ... A large number of online lecture slides,

Sparse Optimization

Lecture 1: Review of Convex Optimization

Instructor: Wotao Yin

July 2013

online discussions on piazza.com

Those who complete this lecture will know

• convex optimization background

• various standard concepts and terminology

• reformulating `1 optimization and its optimality conditions

1 / 30

Resources for convex optimization

• Book: Convex Analysis by T. Rockafellar

• Book: Convex Optimization by S. Boyd and L. Vandenberge, along with

online videos and slides

• Book: Introductory Lectures on Convex Optimization: A Basic Course by

Y. Nesterov

• A large number of online lecture slides, notes, and videos online

2 / 30

Review: mathematical optimization

Formulation

minimizex

f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m,

hj(x) = 0, j = 1, . . . , p.

• decision variables: x = (x1, . . . , xn)

• objective function: f0 : Rn → R

• functions defining inequality constraints: fi : Rn → R, i = 1, . . . ,m

• functions defining equality constraints: hj : Rn → R, j = 1, . . . , p

3 / 30

Terminology

• feasible solutions: all points x satisfying the constraints

fi(x) ≤ 0 (i = 1, . . . ,m) and hj(x) = 0 (j = 1, . . . , p).

• feasible set: the set of all feasible solutions, often denoted by X .

• (global) (optimal) solution: feasible solution x∗ that achieves the

minimum objective value among all feasible solutions.

• local (optimal) solution: feasible solution x∗ that achieves the minimal

objective value among a neighborhood around x∗, say, the set

{x : ‖x− x∗‖ ≤ δ} ∩ X for some δ > 0

4 / 30

Some examples

• Find two nonnegative numbers whose product up to 9 and so that the sum

of the two numbers is a maximum.

• Find the largest area a rectangular region provided that its perimeter is not

great than 100.

• Given a sequences of nonnegative numbers, find a start point and an end

point so that the partial sum of the sequence between the two points is a

maximum.

5 / 30

Solving optimization problems

In general, everything is optimization, but optimization problems are generally

not solvable, even by the most powerful computers.

Some classes of problems can be solved efficiently and reliably, for example:

• least-squares problems

• linear programming problems

• quadratic programming problems

• convex optimization problems

• a subclass of network-flow problems

• submodular function minimization

(.... more, but not much more...)

• some sparse optimization problems

6 / 30

Least squares

minimizex

‖Ax− b‖22

• analytic solution x∗ = (ATA)−1ATb if A has independent columns

• reliable and efficient algorithms and software packages

• computation time proportional to n2k (A ∈ Rk×n), less if structured

• a mature technology (unless A is huge and/or distributed)

7 / 30

Linear programming (LP)

minimizex

cTx

subject to aTi x ≤ bi, i = 1, . . . ,m

• no analytic formula for solutions

• reliable and efficient algorithms and software packages

• computation time proportional to n2m if m ≥ n, less with structured data

• a mature technology

• a few standard tricks used to convert problems (with `1 or `∞, piecewise

linear functions) into linear programs

8 / 30

Convex optimization

minimizex

f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m,

Ax = b.

where objective and constraint functions are convex, i.e.,

fi(θx1 + (1− θ)x2) ≤ θfi(x1) + (1− θ)fi(x2)

for all i = 0, 1, . . . ,m, θ ∈ (0, 1) and x1,x2 ∈ domfi.

• no analytic solution

• relatively reliable and efficient algorithms and software packages

• computation time (roughly) proportional to max{n3, n2m,F}, where F is

cost of evaluating fi’s and their first and second derivatives.

• almost a technology

Least-squares and linear programs are special convex programs.

9 / 30

Non-convex optimization problems

General optimization problems are non-convex

minimizex

f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m

Local optimization methods

• find a solution which minimizes f0 among feasible solutions near it

• fast and handle large problems

• require initial guess

• provide no information about the distance to global optima

Global optimization methods

• find the global solution

• worst-case complexity grows exponentially with problem size.

These methods are often based on solving convex subproblems.

10 / 30

Brief history of convex optimization

theory (convex analysis): 1900–1970s

algorithms

• 1947: simplex algorithm for linear programming (Dantzig)

• 1960s: early interior-point methods (Fiacco & McCormick, Dikin, . . . )

• 1970s: ellipsoid method and other subgradient methods

• 1980s: polynomial-time interior-point methods for linear programming

(Karmarkar 1984)

• late 1980s-2000s: polynomial-time interior-point methods for nonlinear

convex optimization (Nesterov & Nemirovski 1994)

• recently: revived interests in first-order (gradient-based) algorithms,

solving big-data problems

applications

• before 1990: mostly in operations research; few in engineering

• since 1990: many new applications in engineering (control, signal

processing, communications, circuit design, . . . ); new problem classes

(semidefinite and second-order cone programming, robust optimization,

sparse optimization)11 / 30

Convex set

A set C is called convex if the segment between any two points in C lies

entirely in C.

Formally, C is convex if for any x1,x2 ∈ C and θ ∈ (0, 1), we have

θx1 + (1− θ)x2 ∈ C.

Examples:

• Euclidean balls: B(xc, r) = {x : ‖x− xc‖2 ≤ r}

• ellipsoid: {x : (x− xc)TP−1(x− xc) ≤ 1} with P being symmetric

positive definite

• polyhedra: {x : Ax ≤ b, Cx = d} with A ∈ Rm×n, C ∈ Rp×n

• several operations preserving convexity: intersection; affine function;

perspective function; linear-fractional functions.

In most time, recognizing a convex set is not a problem.

12 / 30

Convex functions

A function f : Rn → R is convex if domf is convex and for any

x1,x2 ∈ domf and θ ∈ (0, 1), we have

f(θx1 + (1− θ)x2) ≤ θf(x1) + (1− θ)f(x2).

f is concave if (−f) is convex.

f is strictly convex if domf is convex and

f(θx1 + (1− θ)x2) < θf(x1) + (1− θ)f(x2).

13 / 30

Examples of convex functions

Examples in Rn

• affine function f(x) = aTx+ b

• norms: ‖x‖p = (∑ni=1 |xi|

p)1/p for p ≥ 1; ‖x‖∞ = maxi |xi|.

Examples in Rm×n

• affine function

f(X) = tr(ATX) + b =m∑i=1

n∑j=1

AijXij + b

• spectral norm (maximum singular value)

f(X) = ‖X‖2 = σmax(X) = (λmax(XTX))1/2

• nuclear norm

f(X) = ‖X‖∗ =min{m,n}∑

i=1

σi

14 / 30

Terminology

• extended value: f may take on value +∞, reduce the need of domf

• proper: exists x so that f(x) is finite

• lower semi-continuous (LSC): lim infx→x0 f(x) ≥ f(x0)

• closed: f has a closed epigraph

epif = {(x, µ) : µ ∈ R, µ ≥ f(x)}

• Lemma: a proper convex function is closed if and only if its is LSC

• subdifferential

∂f(x) = {p : f(y) ≥ f(x) + 〈p,y − x〉 ∀y}

- each p ∈ ∂f(x) is called a subgradient

- if f ∈ C1 near x, then ∂f(x) = {∇f(x)}

15 / 30

First-order condition

f is differentiable if the derivative

∇f(x) =[∂f(x)

∂x1,∂f(x)

∂x2, . . . ,

∂f(x)

∂xn

]Texists at every x ∈ domf .

first-order condition: differentiable f with convex domain is convex iff

f(y) ≥ f(x) +∇f(x)T (y − x) for all x,y ∈ domf

first-order condition: subdifferentiable f with convex domain is convex iff

f(y) ≥ f(x) + pT (y − x) for all x,y ∈ domf, p ∈ ∂f(x)

first-order optimality condition: x∗ ∈ argmin f(x)⇐⇒ 0 ∈ ∂f(x∗)

16 / 30

Second-order condition

f is twice differentiable if Hessian ∇2f(x) ∈ Sn defined by

∇2f(x)ij =∂2f(x)

∂xi∂xj, i, j = 1, . . . , n,

exists at every x ∈ domf .

second-order condition: twice differentiable f with convex domain is convex iff

∇2f(x) � 0, for all x ∈ domf.

Furthermore, if ∇2f(x) � 0 for all x ∈ domf , then f is strictly convex.

Very useful in general convex optimization but not so in sparse optimization

17 / 30

Convex optimization formulation

Standard-form convex optimization problem

minimizex

f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m,

Ax = b.

- the feasible set of a convex optimization problem is convex.

- f0, f1, . . . , fm are convex; equality constraints are affine.

18 / 30

Local and global solutions

Theorem

Any local solution of a convex problem is a global solution.

Proof.

Suppose that x is a local solution and y is a global solution and that

f0(y) < f0(x).

Consider z = θy + (1− θ)x. Since

f0(z) ≤ θf0(x) + (1− θ)f0(y) < f0(x)

for any θ ∈ (0, 1) and ‖x− z‖ can be arbitrary small, x cannot be a local

solution.

19 / 30

Optimality criterion for differentiable f0

Since the feasible set is convex and

f0(y) ≥ f0(x) +∇f0(x)T (y − x),

x is optimal iff it is feasible and

∇f0(x)T (y − x) ≥ 0 for all feasible y.

20 / 30

• unconstrained problem: x is optimal if and only if

x ∈ domf0, ∇f0(x) = 0

• equality constrained problem:

minimizex

f0(x) subject to Ax = b

x is optimal if and only if their exist a vector ν such that

x ∈ domf0, Ax = b, ∇f0(x) +AT ν = 0

• minimization over nonnegative orthant

minimizex

f0(x) subject to x ≥ 0

x is optimal if and only if

x ∈ domf0, x ≥ 0,

{∇f0(x)i ≥ 0 xi = 0

∇f0(x)i = 0 xi > 0

21 / 30

Unconstrained problem with nondifferentiable f0

g is a subgradient of a convex function f at x ∈ domf if

f(y) ≥ f(x) + gT (y − x), ∀y ∈ domf.

the subdifferential ∂f(x) of f at x is the set of all subgradients:

∂f(x) = {g : gT (y − x) ≤ f(y)− f(x) ∀y ∈ domf}

x∗ minimizes f0(x) if and only if

0 ∈ ∂f0(x∗)

22 / 30

Optimality criteria in the general case

Standard form problem (not necessarily convex)

minimizex

f0(x)

s.t. fi(x) ≤ 0, i = 1, . . . ,m

hj(x) = 0, j = 1, . . . , p

domain D, optimal value p∗

Lagrangian: L : Rn × Rm × Rp → R with domL = D × Rm × Rp,

L(x, λ, ν) = f0(x) +

m∑i=1

λifi(x) +

p∑j=1

νjhj(x)

• λi is Lagrange multiplier associated with fi(x) ≤ 0

• νj is Lagrange multiplier associated with hj(x) = 0

23 / 30

Lagrange dual function

Lagrange dual function: g : Rm × Rp → R,

g(λ, ν) = infx∈D

L(x, λ, ν)

= infx∈D

(f0(x) +

m∑i=1

λifi(x) +

p∑j=1

νjhj(x)

)

g is concave, can be −∞ for some λ, ν.

Lower bound property: if λ � 0, then g(λ, ν) ≤ p∗.

24 / 30

Dual problem

Lagrange dual problem

maximizeλ,ν

g(λ, ν)

subject to λ � 0

• finds the best lower bound n p∗

• a convex optimization problem; optimal value denoted d∗

• λ, ν are dual feasible if λ � 0, (λ, ν) ∈ domg

Strong duality: d∗ = p∗

• does not hold in general

• (usually) holds for convex problems

• conditions that guarantee strong duality in convex problems are called

constraint qualifications

25 / 30

Slater’s constraint qualification

Strong duality holds for a convex problem

minimizex

f0(x)

subject to fi(x) ≤ 0, i = 1, . . . ,m

Ax = b

if it is strictly feasible, i.e.,

∃x ∈ intD : fi(x) < 0, i = 1, . . . ,m, Ax = b

• also guarantees that the dual optimum is attained (if p∗ > −∞)

• linear inequalities do not need to hold with strict inequality

• there are many other types of constraint qualifications

• some non-convex optimization problems may have strong duality

26 / 30

Complementary slackness

Assume strong duality holds, x∗ is primal optimal, and (λ∗, ν∗) is dual optimal

f0(x∗) = g(λ∗, ν∗) = inf

x

(f0(x) +

m∑i=1

λ∗i fi(x) +

p∑j=1

ν∗j hj(x)

)

≤ f0(x∗) +m∑i=1

λ∗i fi(x∗) +

p∑j=1

ν∗j hj(x∗)

≤ f0(x∗)

• x∗ minimizes L(x, λ∗, ν∗)

• λ∗i fi(x∗) = 0 for i = 1, . . . ,m (complementary slackness)

27 / 30

Karush-Kuhn-Tucker (KKT) conditions

KKT conditions for a problem with differentiable fi, hj :

• primal constraints: fi(x) ≤ 0, i = 1, . . . ,m, hj(x) = 0, j = 1, . . . , p

• dual constraints: λ � 0

• complementary slackness: λifi(x) = 0, i = 1, . . . ,m

• gradient of Lagrangian with respect to x vanishes:

∇f0(x) +m∑i=1

λi∇fi(x) +p∑i=1

νj∇hj(x) = 0

If x̃, λ̃, ν̃ satisfy KKT for a convex problem, then they are optimal.

28 / 30

Exercise: constrained/unconstrained `1 problem

Consider two `1 problems

minimizex

‖x‖1

subject to Ax = b

l ≤ x ≤ u

and

minimizex

‖x‖1 +λ

2‖Ax− b‖22

subject to l ≤ x ≤ u

Exercises: derive their

• LP or QP formulations

• Lagrange dual problems

• KKT conditions

29 / 30

Exercise: total variation problem∗

The discrete total variation of a vector x ∈ Rn is

TV(x) =

n−1∑i=1

|xi+1 − xi|.

Consider problem

minimizex

TV(x) +λ

2‖Ax− b‖22

subject to l ≤ x ≤ u

Exercises: derive its

• SOCP formulation (refer to Sec.4.2.2 of Boyd&Vandenberghe)

• Lagrange dual problem

• KKT conditions

30 / 30

Documents

Sparse Optimization - Lecture 1: Review of Convex …wotaoyin/summer2013/slides/Lec01_Convex... · Lecture 1: Review of Convex Optimization ... A large number of online lecture slides,