Multiobjective Optimization

Preview:

Citation preview

Multiobjective Optimization

Frederico Gadelha Guimarãesfredericoguimaraes@ufmg.br

Department of Electrical EngineeringUniversidade Federal de Minas Gerais

Belo Horizonte, Brazil

Introduction

General formulation of optimization problems

minx

f ( x)∈ℝ , x∈Φ

Φ={g i(x )≤0, i=1,… , ph j( x)=0, j=1,… , qx∈X

Introduction

General formulation of optimization problems

minx

f ( x)∈ℝ , x∈Φ

Φ={g (x)≤0,h(x )=0,x∈X

Definitions

Objective function

● The objective function (cost function or optimization criterion) is a function to be optimized (minimized) by the optimization algorithm.

Global optimum (global minimum)

● A point is global optimum of if

f : X⊂ℝn→Y⊂ℝ

x ∗ ∈X f : X⊂ℝn→Y⊂ℝ

f (x ∗ )≤ f (x ) , ∀ x≠x ∗

Definitions

Strong local minimum (strict local minimum)

● A strict local minimum is defined in terms of its vicinity:

Weak local minimum (non strict local minimum)

f : X⊂ℝn→Y⊂ℝf ( x ∗ )< f ( x ) , ∀ x≠x ∗ ∧ x∈V ϵ(x∗ )

f : X⊂ℝn→Y⊂ℝf (x ∗ )≤ f (x ) , ∀ x≠x ∗ ∧ x∈V ϵ(x∗ )

Definitions

Convex sets

● A convex set is defined by

● Examples

Q⊂ℝn

z=λ x1+(1−λ) x2, z∈Q ,∀ x1, x2∈Q ∧ 0≤λ≤1

A={(x1, x2): x12+x2

2≤4 }⊂ℝ2

A={x : Ax≤b }

Definitions

Convex functions

● A function is convex if

● A function is strictly convex if the above inequality is strict.

f : X ∈ℝn→Y ∈ℝ

f [λ x1+(1−λ) x 2 ]≤λ f (x1)+(1−λ) f (x2) , ∀ x1, x 2∈X ∧ 0≤λ≤1

Definitions

Convex functions

● Sub-level region

● The sub-level region of a convex function is a convex set.

● Level surface (or level curve)

f : X ∈ℝn→Y ∈ℝ

R ( f ,α)= {x∈X : f ( x)≤α }

S ( f ,α)={x∈X : f (x)=α }

Definitions

Unimodal function

● A function is unimodal if its sub-level region is a connected set for all values of alpha.

Multimodal function

● A function is multimodal if the associated sub-level region is disconnected for some value of alpha.

Definitions

Attraction basin

● Around local minima, there are regions in which the function behaves as if it were unimodal. Such regions are named attraction basins.

● The attraction basin of a local minimum is defined by the greatest connected sub-level region that contains

● A local search method converges to the local minimum of an attraction basin if x0∈B(x ∗ )

x ∗R ( f ,α)

Definitions

Differentiable functions

● A function is differentiable in the domain if there is a Gradient vector defined by

● A function is differentiable to the second order in the domain if there is a Hessian matrix given by

∇ f ( x)=( ∂ f∂ x1

∂ f∂ x1

⋯ ∂ f∂ xn

) , x∈X

H (x)=(∂2 f

∂ x12

⋯ ∂2 f∂ x1∂ xn

⋮ ⋱ ⋮∂2 f∂ xn∂ x1

⋯ ∂2 f

∂ xn2) , x∈X

Definitions

Differentiable functions

● Calculate the Gradient and the Hessian of the following function

f ( x)=10(x2−x12)2+(1−x1)

2

Optimality conditions

Taylor series

● A continuously differentiable function can be approximated locally by its expansion in Taylor series:

f ( x)= f (x0)+∇ f (x0)' (x−x0)+12(x−x0)' H ( x0)(x−x0)+O (∥x−x0∥3)

Optimality conditions

Theorem (Necessary condition of 1st order)

● The Gradient at the local minimum is null.

Proof

f (x ∗ )≤ f ( x ) ,∀ x∈V ϵ( x∗ )

f (x)≥ f (x ∗ )

f ( x ∗ )+∇ f ∗ ' (x−x ∗ )+O (∥x−x ∗∥2 )≥ f ( x ∗ )

∇ f ∗ ' (−α∇ f ∗ )+O (∥x−x ∗∥2)≥0

−α∥∇ f ∗∥2+O (∥x−x ∗∥2)≥0

limx→ x ∗

O (∥x−x ∗∥2)α =0 → ∇ f ∗=0

Optimality conditions

Theorem (Necessary condition of 2nd order)

● The Hessian matrix at the local minimum is definite positive.

Proof

f (x ∗ )≤ f ( x ) ,∀ x∈V ϵ( x∗ )

f (x)≥ f (x ∗ )

f ( x ∗ )+12(x−x ∗ )' H ∗ (x−x ∗ )+O (∥x−x ∗∥3)≥ f (x ∗ )

u ' H ∗ u≥0

12(x−x ∗ )' H ∗ (x−x ∗ )+O (∥x−x ∗∥3)≥0

12( x−x ∗ ) '∥x−x ∗∥

H ∗ (x−x∗ )

∥x−x ∗∥+

O (∥x−x ∗∥3 )∥x−x ∗∥2 ≥0

Optimality conditions

● What about constrained problems?

● How can we determine the optimality conditions?

Optimality conditions

Consider the following constrained problem:

At the minimum, the Gradients of the objective function and the constraint are parallel (why?):

min f ( x) , with h(x )=0

∇ f ( x ∗ )=−λ ∗ ∇ h(x ∗ )

∇ f ( x ∗ )+λ ∗ ∇ h (x ∗ )=0

∇ [ f (x ∗ )+λ ∗ h( x ∗ )]=0

∇ x L ( x∗ ,λ ∗ )=0

Optimality conditions

The solution of the constrained problem:

is a critical point of the Lagrangean function:

min f ( x) , with h(x )=0

L ( x ,λ)= f (x )+λ h( x )

∇ x L ( x∗ ,λ ∗ )=0

∇ λ L(x∗ ,λ ∗ )=0

Optimality conditions

Now consider the constrained problem (inequality constraint):

The Lagrangean function is given by:

min f ( x) , with g (x )≤0

L ( x , z ,μ)= f ( x)+μ [g (x)+z2 ]

∇ x L ( x∗ , z ∗ ,μ ∗ )=0 ⇒ ∇ f ( x ∗ )+μ ∗ ∇ g (x ∗ )=0

∇ z L( x∗ , z ∗ ,μ ∗ )=0 ⇒ 2μ ∗ z ∗=0

g ( x )+ z2=0

∇μ L(x∗ , z ∗ ,μ ∗ )=0 ⇒ g (x ∗ )+z ∗2=0

Optimality conditions

The second condition implies either z=0 or μ=0.

● If z=0, the solution lies at the border of the feasible region and we say that g is an active constraint.

● If μ=0, then z can be different from 0. The constraint is satisfied at the solution and we say that g is inactive.

∇ x L ( x∗ , z ∗ ,μ ∗ )=0 ⇒ ∇ f ( x ∗ )+μ ∗ ∇ g (x ∗ )=0

∇ z L( x∗ , z ∗ ,μ ∗ )=0 ⇒ 2μ ∗ z ∗=0

∇μ L(x∗ , z ∗ ,μ ∗ )=0 ⇒ g (x ∗ )+z ∗2=0

Optimality conditions

We can replace this condition by an equivalent one, thus eliminating the need of the slack variable z.

2μ ∗ z ∗=0 ⇒ μ ∗ g (x ∗ )=0

∇ x L ( x∗ , z ∗ ,μ ∗ )=0 ⇒ ∇ f ( x ∗ )+μ ∗ ∇ g (x ∗ )=0

∇ z L( x∗ , z ∗ ,μ ∗ )=0 ⇒ 2μ ∗ z ∗=0 ⇒ μ ∗ g (x ∗ )=0

∇μ L(x∗ , z ∗ ,μ ∗ )=0 ⇒ g (x ∗ )+z ∗2=0

Optimality conditions

Karush-Kuhn-Tucker optimality conditions

At the solution, the following equations are valid:

minx

f ( x)∈ℝ , x∈Φ

Φ={g i(x )≤0, i=1,… , ph j( x)=0, j=1,… , qx∈X

∇ f (x ∗ )+∑i=1

p

μi∗ ∇ g i(x

∗ )+∑j=1

q

λ j∗ ∇ h j(x

∗ )=0

μ i∗ g i( x

∗ )=0, μi≥0

g i(x∗ )≤0 h j( x

∗ )=0

Deterministic optimization methods

● Derivative methods

– Gradient method (steepest descent method);

– Newton method;

– Marquardt method;

– Quasi-Newton methods;

– Conjugate Gradients methods;

● Non-derivative methods

– Nelder-Mead Simplex;

– Hooke-Jeeves method (pattern search);

Deterministic optimization methods

General structure of derivative methods (search direction based):

● Methods vary in the way the step size and the search direction are calculated.

xk+1← xk+αk d k

Gradient method

● The Gradient method (or Cauchy method or Steepest Descent Method) is the simplest one among the derivative methods.

● It was developed by Cauchy in 1847.

A. L. Cauchy, Méthode générale pour la résolution des systèmes d’équations simultanées, Comptes Rendus de l’Academie des Sciences, Paris, Vol. 25, pp. 536–538, 1847.

d k=−∇ f ( xk)

Gradient method

Algorithm: Gradient (Cauchy) method

Input:

1.

2. while stop criterion is not met do

3. Calculate

4.

5.

6.

7.

8. end

d k←−∇ f ( xk)

∇ f ( xk)

αk←arg minα

f (x k+αd k)

xk+1← x k+αk d k

k← k+1

k←0

x0∈X , f : X →Y

Gradient method

● The algorithm generates a monotonic sequence

● The step size is a non-negative scalar that minimizes the function in the search direction from the current solution, i.e., represents a step towards the minimizing direction.

● In practice, the step size should be calculated with a one-dimensional minimization method (line search).

{xk , f (x k)} such that∇ f ( xk)→0 when k →∞

Gradient method

Numerical evaluation of the Gradient

● Finite difference approximation:

● Central finite difference approximation:

∂ f∂ xi∣x

≈f ( x+δi e i)− f ( x)

δi, i=1,… , n

∂ f∂ xi∣x

≈f ( x+δi e i)− f ( x−δi e i)

2δi, i=1,… , n

Gradient method

Possible stop criteria

● Gradient close to zero

● Stabilization of the variables

● Stabilization of function values

∥∇ f (xk)∥≤ϵ

∥xk−xk−1∥≤ϵ

∣ f ( xk)− f (xk−1)∣≤ϵ

Gradient method

Difficulties

● Slow convergence;

● Zig zag effect;

● Trapped by non-differentiable regions;

One-dimensional minimization

Newton method

Seja

● From its Taylor series expansion:

f : X →Y , f ∈C 2

f ( x)= f (x k)+∇ f (x k)(x−xk)+12(x−xk) ' H (xk)(x−xk)+O (∥x−xk∥3)

f ( x)≈ f (x k)+∇ f (x k)(x−xk)+12(x−xk) ' H (xk)(x−xk)

∇ f (xk)+H (xk)(xk+1−xk)=0

xk+1=x k−H−1( xk)∇ f (xk)

Newton method

● If the objective function is quadratic, the Newton method gives the solution in one step;

● The inverse of the Hessian can be interpreted as a “correction” applied to the Gradient direction, considering the curvature of the function;

● In the general case, for non quadratic functions, the step size should be computed:

xk+1=x k+αk d k , d k=−H−1(xk)∇ f ( xk)

Newton method

Algorithm: Newton method

Input:

1.

2. while stop criterion is not met do

3. Calculate

4. Calculate

5.

6.

7.

8.

9. end

d k←−H−1(x k)∇ f (x k)

∇ f ( xk)

αk←arg minα

f (x k+αd k)

xk+1← x k+αk d k

k← k+1

k←0

x0∈X , f : X →Y

H ( xk)

Newton method

● The method has quadratic convergence;

● Convergence is guaranteed under two assumptions:

– The Hessian is non singular – there is an inverse;

– The Hessian is definite positive, which guarantees a minimizing direction;

Newton method

Numerical evaluation of the Hessian

● Finite difference approximation:

∂2 f∂ xi∂ x j

∣x

≈f (x+δi ei+δ j e j)− f (x+δi ei)− f (x+δ j e j)+ f (x )

δi δ j

Newton method

Difficulties

● Requires the computation of the inverse of the Hessian matrix;

● Numerical ill-conditioning of the Hessian matrix makes the computation of the inverse difficult in practice;

● Numerical derivatives: greater numerical errors and many function evaluations required by finite difference approximation;

● Trapped by non-differentiable regions;

Marquardt method

Motivation

● The Cauchy method reduces the function value faster when the design vector is away from the optimum;

● The Newton method, on the other hand, converges fast when close to the optimum point;

● The Marquardt method attempts to take advantage of both;

D. Marquardt, An algorithm for least squares estimation of nonlinear parameters, SIAM Journal of Applied Mathematics, Vol. 11, No. 2, pp. 431–441, 1963.

Marquardt method

● The Marquardt method changes the diagonal of the Hessian:

● When gamma is large (~104), the diagonal dominates and the inverse is given by

H̃ (xk)=H ( xk)+γk I , γk>0

H̃−1(x k)=[H ( xk)+γk I ]−1≈[γk I ]

−1= 1γkI

Marquardt method

● The search direction of the Marquardt method is:

d k=−[ H̃ (xk)]−1∇ f (xk)

d k=−[H (xk)+γk I ]−1∇ f (x k)

Marquardt method

Algorithm: Marquardt method

Input:

1.

2. while stop criterion is not met do

3. Calculate

4. Calculate

5.

6.

7.

8. If the function decreased, decrease else increase

9.

10. end

d k←−[H ( xk)+γk I ]−1∇ f ( xk)

∇ f ( xk)

αk←arg minα

f (x k+αd k)

xk+1← x k+αk d k

k← k+1

k←0

x0∈X , f : X →Y , γ0

H ( xk)

γk γk

Marquardt method

Difficulties

● Numerical derivatives: greater numerical errors and many function evaluations required by finite difference approximation;

● Trapped by non-differentiable regions;

quasi-Newton methods

Motivation

● These methods approximate the inverse of the Hessian matrix, avoiding the computation of the inverse and the computation of the Hessian itself;

● Avoid numerical calculation of second derivatives;

● Maintain quadratic convergence of the Newton method;

C. G. Broyden, Quasi-Newton methods and their application to function minimization, Mathematics of Computation, Vol. 21, p. 368, 1967.

quasi-Newton methods

Approximating the inverse of the Hessian

● It is possible to approximate the inverse of the Hessian iteratively by using the recursive formula:

● The update of this estimative is built in terms of the Gradient vectors and the points in previous iterations.

Bk+1=Bk+c z z ' , Bk→H−1(xk )

quasi-Newton methods

An approximate inverse of the Hessian is to be computed. Using the Taylor series expansion to expand the gradient:

∇ f ( x)≈∇ f ( x0)+H (x0)( x−x0)

∇ f (xk+1)=∇ f (x0)+Ak(xk+1−x0)∇ f (x k)=∇ f (x0)+Ak(xk−x0)

Ak (x k+1−xk)=∇ f (x k+1)−∇ f ( xk)

Ak d k=gk

d k=[Ak ]−1gk=Bk gk , Bk=[Ak ]

−1≈[H (x0)]−1

quasi-Newton methods

Rank 1 update

Bk+1=Bk+c z z ' , Bk→H−1(xk )

d k=Bk+1 g k

d k=(Bk+c z z ' ) gk=Bk gk+c z ( z ' g k )

c z=d k−Bk gkz ' gk

c= 1z ' gk

, z=d k−Bk g k

quasi-Newton methods

Rank 1 update

● This leads to the update formula:

● This formula is attributed to Broyden.

Bk+1=Bk+c z z ' , Bk→H−1(xk )

Bk+1=Bk+(d k−Bk g k ) (d k−Bk gk ) '

(d k−Bk gk ) ' gk

quasi-Newton methods

Rank 2 update

● Rank 1 update formula guarantees symmetry but not positive definiteness;

● Rank 2 update formulas were developed to guarantee both symmetry and positive definiteness and are more robust in minimizing general nonlinear functions;

● A rank 2 update can be obtained as:

● Following a similar procedure, rank 2 update formulas can be derived.

Bk+1=Bk+c1 z1 z1 '+c2 z 2 z2 ' , Bk→H−1(x k)

quasi-Newton methods

Davidon-Fletcher-Powell (DFP) formula

Bk+1=Bk+d k d k

T

d kT g k

−(Bk gk ) (Bk gk )

T

(Bk g k )T gk

d k=xk+1−xk

gk=∇ f (xk+1)−∇ f (xk)

quasi-Newton methods

Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula

Bk+1=Bk+d k d k

T

d kT g k (1+

g kT Bk g kd kT g k )−

Bk g k d kT

d kT gk

−d k gk

T Bkd kT gk

d k=xk+1−xk

gk=∇ f (xk+1)−∇ f (xk)

quasi-Newton methods

● Numerical experience indicates that the BFGS method is the best unconstrained method and is less influenced by errors in finding the optimal step size compared to the DFP method;

Conjugate gradient methods

● Presented first in 1908 by Schmidt, reinvented independently in 1948 and improved in the 1950s;

● Initially it was developed for solving linear systems of equations, still used for sparse matrices;

● In 1964, Fletcher and Reeves generalized the method to solve unconstrained nonlinear optimization problems.

R. Fletcher and C. M. Reeves, Function minimization by conjugate gradients, Computer Journal , Vol. 7, No. 2, pp. 149–154, 1964.

Conjugate gradient methods

Conjugate directions

● Let A be a symmetric matrix. A set of n vectors (or directions) is said to be conjugate (A-conjugate) if

● Orthogonal directions are a special case of conjugate directions.

Quadratically Convergent Method.

● If a minimization method, using exact arithmetic, can find the minimum point in n steps while minimizing a quadratic function in n variables, the method is called a quadratically convergent method.

d kT Ad k=0, ∀ i≠ j , i=1,… , n , j=1,… , n

Conjugate gradient methods

Conjugate gradient methods

Theorem

● If a quadratic function

is minimized sequentially, once along each direction of a set of n mutually conjugate directions, the minimum of the function will be found at or before the nth step irrespective of the starting point.

q (x)=12xT A x+BT x+C

Conjugate gradient methods

Algorithm: Conjugate gradient method

Input:

1.

2. while stop criterion is not met do

3.

4.

5.

6. Calculate

7. Calculate new conjugate direction

8.

9. end

rk+1←−∇ f ( xk+1)

αk←arg minα

f (x k+αd k)

xk+1← x k+αk d k

k← k+1

k←0 ; r0←−∇ f ( x0) ; d 0← r0

x0∈X , f : X →Y

d k+1← rk+1+βk d k

βk

Conjugate gradient methods

● Two well known formulas are:

– Fletcher-Reeves:

– Polak-Ribière:

βkFR=

r k+1T r k+1r kT rk

βkFR=

r k+1T (r k+1−rk)

r kT r k

, r k=−∇ f ( xk)

Conjugate gradient methods

● For quadratic functions, the method converges in n iterations. In non-quadratic functions, directions are no longer conjugate.

● Since the method is based on the generation of n conjugate directions in an n-dimensional space, it should be restarted at every n iterations in non-quadratic problems;

● In general, quasi-Newton methods converge in less iterations, however require more computation and more memory per iteration. Therefore, conjugate gradient is recommended for large scale problems.

Nelder-Mead Simplex

● Derivative methods converge faster, but can only be used for problems characterized by differentiable functions;

● Nonetheless, in problems with many variables, the numerical errors introduced by numerical derivatives can become significant;

Nelder-Mead Simplex

● Nelder-Mead Simplex was developed in 1965 for nonlinear optimization;

● The method works with n+1 points at every iteration, eliminating the “worst” point;

● The n+1 points produce a simplex which “moves” in the search space;

J. A. Nelder and R. Mead, A simplex method for function minimization, Computer Journal, Vol. 7, p. 308, 1965.

Nelder-Mead Simplex

Convex hull

● The convex hull of a set A is given by the intersection of all convex sets that contain A.

Polytope

● The convex hull of a finite set of points is called polytope.

Simplex

● If n+1 n-dimensional points form n linearly independent vectors, then the convex hull of this set of points is a simplex.

Nelder-Mead Simplex

Notation

● Index of the vertex with the best objective function value:

● Index of the vertex with the worst objective function value:

● Index of the vertex with the second worst objective function value:

● Centroid of the opposite face to the worst vertex:

b∈{1,… , n+1} , x b

w∈{1,… , n+1} , xw

s∈{1,… , n+1} , x s

x̂= 1n ∑i=1, i≠w

n+1

x i

Nelder-Mead Simplex

● Reflection: reflects the worst solution and moves the simplex towards direction of improvement:

● Expansion: expands the simples towards direction of improvement:

● Outer contraction:

● Inner contraction:

x r= x̂+α ( x̂−xw ) , α=1

xe= x̂+γ ( x̂−xw ) , γ=2

xoc= x̂+β ( x̂−xw ) , β=0.5

x ic= x̂−β ( x̂−xw ) , β=0.5

Nelder-Mead Simplex

Algorithm: Nelder-Mead Simplex method

Input:

1.

2. while stop criterion is not met do

3. Perform reflection

4. if then

5. calculate and evaluate

6. if then accept else accept

7. else if then accept

8. else if then

9. calculate and evaluate

10. if then accept

xr= x̂+α( x̂−xw)

k←0

x0∈X , f : X →Y

f (x r)< f ( xb)

xe= x̂+γ( x̂−xw)

f (x e)< f (x r) xe xr

f (x r)< f ( x s) xr

f (x r)< f ( xw)

xoc= x̂+β( x̂−xw)

f (x oc)≤ f (xw) xoc

Nelder-Mead Simplex

Algorithm: Nelder-Mead Simplex method

Input:

11. else if then

12. calculate and evaluate

13. if then accept

14. else shrink simplex

15. end

16. end

x0∈X , f : X →Y

f (x r)≥ f ( xw)

x ic= x̂−β( x̂−xw)

f (x ic)≤ f (xw) x ic

Nelder-Mead Simplex

Nelder-Mead Simplex

● Stop criterion is based on the volume of the simplex;

● Convergence to convex function proven only recently;

● Initialization: orthogonal perturbation to the initial solution;

● Method actually used in the function fminsearch in Matlab;

● Can be coupled with one-dimensional search methods.

Pattern search method

● Pattern search method tests pattern points from the current solution;

● It alternates search directions parallel to the coordinate axis and search directions of the kind

R. Hooke and T. A. Jeeves, Direct search solution of numerical and statistical problems, Journal of the ACM, Vol. 8, No. 2, pp. 212–229, 1961.

M.J.D. Powell, An efficient method for finding the minimum of a function of several variables without calculating derivatives, Computer Journal , Vol. 7, No. 4, pp. 303–307, 1964.

xk+1−x k

Pattern search method

● Given the initial point:

● The algorithm tests coordinate directions making a move when the function improves or staying at the current point:

● After searching all coordinates, we end up at

● Perform a search in the direction

● Restart the search from this point. If the function does not decrease, reduce the step size.

x0, y0=x0, k=0

yi±λ e i+1

xk+1

y0=xk+1+α( xk+1−x k)

Pattern search method

Algorithm: Hooke-Jeeves method

Input:

1.

2. while stop criterion is not met do

3. foreach do

4. if then else

5. end

6. if then

7. else

8. end

9.

10. end

i=0,… , n−1

k←0, y0← x0

x0∈X , f : X →Y ,λ ,α

f ( y i±λ e i+1)< f ( y i ) y i+1← y i±λ e i+1 y i+1← y i

f ( yn)< f ( xk) xk+1← yn ; y0← xk+1+α( xk+1−x k)

λ←λ /2 ; xk+1← x k ; y0← xk

k← k+1

Pattern search method

Pattern search method

● Pattern search method is very easy to code and computationally competitive with other methods.

● Possible modifications include: different step sizes for each variable, or coupling with one-dimensional search methods.

Recommended