Numerical Optimization with Applications - BGUnoia192/wiki.files/NOA.pdf · Numerical Optimization with Applications Eran Treister Computer Science Department, Ben-Gurion University

Lecture notes

Numerical Optimization with Applications

Eran TreisterComputer Science Department,

Ben-Gurion University of the Negev.

Contents

1 Background – unconstrained optimization 1

1.1 Optimality conditions for unconstrained optimization . . . . . . . . . . . . . 4

1.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Iterative methods for unconstrained optimization 12

2.1 Steepest descent (SD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 A descent direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Newton, quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Line-search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Coordinate descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Non-linear Conjugate Gradient methods . . . . . . . . . . . . . . . . . . . . 23

2.7 Inexact Newton Methods - Newton-PCG . . . . . . . . . . . . . . . . . . . . 24

3 Constrained Optimization 25

3.1 Equality-constrained optimization – Lagrange multipliers . . . . . . . . . . . 26

3.2 The KKT optimality conditions for general constrained optimization . . . . . 31

3.3 Penalty and Barrier methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 The projected Steepest Descent method . . . . . . . . . . . . . . . . . . . . 41

4 Robust statistics in least squares problems 45

5 Stochastic optimization 48

5.1 Stochastic Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Example: SGD for a minimizing a convex function . . . . . . . . . . . . . . . 52

5.3 Upgrades to SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Minimization of Neural Networks for Classification 55

6.1 Linear classifiers: Logistic regression . . . . . . . . . . . . . . . . . . . . . . 55

6.1.1 Computing the gradient of logistic regression . . . . . . . . . . . . . . 57

6.1.2 Adding the bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2 CONTENTS

6.2 Linear classifiers: Multinomial logistic (softmax) regression . . . . . . . . . . 59

6.2.1 Computing the gradient of softmax regression . . . . . . . . . . . . . 60

6.3 The general structure of a neural network . . . . . . . . . . . . . . . . . . . 61

6.4 Computing the gradient of NN’s: back-propagation . . . . . . . . . . . . . . 63

6.5 Computing derivatives of Neural Networks . . . . . . . . . . . . . . . . . . . 65

6.5.1 The derivatives of matrix-vector and matrix-matrix multiplications . 65

6.5.2 The derivatives of standard neural networks . . . . . . . . . . . . . . 67

6.5.3 The derivatives of residual networks . . . . . . . . . . . . . . . . . . . 68

6.6 Gradient and Jacobian verification . . . . . . . . . . . . . . . . . . . . . . . 69

1 1. Background – unconstrained optimization

1 Background – unconstrained optimization

An optimization problem is a problem where a cost function f(x) : Rn → R needs to be

minimized or maximized. We will mostly refer to minimization, as maximization can always

be achieved by minimizing −f . The function f() is called the “objective”. In most cases,

we wish to find a point x∗ that minimizes f . We will denote this by

x∗ = arg minx∈Rn

f(x). (1)

This optimization problem is not the most general problem we will deal with. This problem

is called an unconstrained optimization problem, and we will deal with this problem

first. We will deal with constrained optimization in the next section.

Optimization problems arise in huge variety of applications in fields like natural sciences,

engineering, economics, statistics and more. Optimization arises from our nature - we usually

wish to make things better in some sense. In recent years, large amounts of data have been

collected, and consequently optimization methods arise from the need to characterize this

data by computational models. In this course we will deal with continuous optimization

problems, as opposed to discrete optimization which is a completely different field with

different challenges.

Until now we met some optimization problems, such as least squares and A−norm min-

imization problems. Those problems had quadratic objectives, and setting their gradient to

zero resulted in a linear system. Understanding quadratic optimization problems are the key

for understanding general optimization problems, and often, the iterative methods used for

general optimization problems are just a generalization of methods for solving linear systems.

We’ve seen a bit of this concept in Steepest Descent and Gauss-Seidel. Another strong con-

nection between quadratic and general optimization problems is that every objective function

f is locally quadratic. We will now see that via the Taylor expansion.

Multivariate Taylor expansion Assume a continuous one-variable function f(x). The

one-dimensional Taylor expansion is given by

f(x+ ε) = f(x) + f ′(x)ε+1

2f ′′(x)ε2 +

1

3!f ′′′(c)ε3; c ∈ [x, x+ ε].


Assuming that f is continuous, we will usually refer to the last term just as O(ε3), and all of

our derivations will focus on the first two or three terms. If f has two variables, f = f(x1, x2),

then the two-dimensional Taylor expansion is given by

f(x1 + ε1, x2 + ε2) = f(x1, x2) +∂f

∂x1

ε1 +∂f

∂x2

ε2 +1

2

∂2f

∂x21

ε21 +

∂2f

∂x1x2

ε1ε2 +1

2

∂2f

∂x22

ε22.

Let us recall the definition of the gradient, which in two-variables is given by

∇f =

[∂f∂x1∂f∂x2

].

This is a 2× 1 vector, and if we denote by ε = [ε1, ε2]>, then

〈∇f, ε〉 =∂f

∂x1

ε1 +∂f

∂x2

ε2.

Now we will define the two-dimensional Hessian matrix – a two-dimensional second deriva-

tive:

∇2f = H =

[∂2f∂x21

∂2f∂x1∂x2

∂2f∂x1∂x2

∂2f∂x22

].

It can be seen that 12〈ε, Hε〉 = 1

2∂2f∂x21ε2

1 + ∂2f∂x1∂x2

ε1ε2 + 12∂2f∂x22ε2

2, and therefore the Taylor

expansion can be written as

f(x + ε) = f(x) + 〈∇f, ε〉+1

2〈ε, Hε〉+O(‖ε‖3),

This expansion is also suitable for n−dimensional functions, where the Hessian matrix H ∈Rn×n is defined by

Hij =∂2f

∂xi∂xj.

Note that if we assume that f is smooth and twice differentiable, then H is a symmetric

matrix (but not necessarily positive definite).


A Taylor expansion of a function vector In some cases, derivatives may be complicated

and we need to know to calculate a derivative of a vector of functions, as opposed to scalar

functions that we dealt with so far. Consider f(x) : Rn → Rm. That is x ∈ Rn, and

f(x) ∈ Rm. The first order approximation of each of the functions fi(x) is given by

fi(x + ε) ≈ fi(x) + 〈∇fi(x), ε〉.

For the whole function vector f(x), we can write

δf = f(x + ε)− f(x) ≈ Jε,


where J ∈ Rm×n is the Jacobian matrix, comprised of the gradients ∇fi(x) as rows:

J =

− ∇f1(x) −− ∇f2(x) −

...

− ∇fm(x) −

Ji,j =∂fi∂xj

.

For the sake of gradient calculation we will focus on the case where ε → 0 and assume

equality in the definition of δf .

Example 1. (The Jacobian of Ax) Suppose that f = Ax, where A ∈ Rm×n. It is clear that

δf = f(x + ε)− f(x) = A(x + ε)− Ax = Aε.

It is clear that J = A.

Example 2. (The Jacobian of φ(x), where φ is a scalar function) Suppose that f = φ(x),

where φ : R → R is a scalar function, i.e., fi(x) = φ(xi). In this case the vector Taylor

expansion is just a vector of one dimensional expansions.

δf = φ(x + ε)− φi(x) ≈ diag(φ′(x))ε = diag(φ′(x))δx.

This means that J = diag(φ′(x)), which is a diagonal matrix such that Jii = φ′(xi).

1.1 Optimality conditions for unconstrained optimization

We we ask ourselves: what is a minimum point of f? We will consider two kinds of minimum

points.

Definition 1 (Local minimum). A point x∗ will be called a local minimum of a function f()

if there exists r > 0 s.t

f(x) ≥ f(x∗) for all x s.t ‖x− x∗‖ < r.

If we use a strict inequality, then we have a strict local minimum.


Figure 1: Global and local minimum

Definition 2 (Global minimum). A point x∗ will be called a global minimum of a function

f() if

f(x) ≥ f(x∗) for all x ∈ Rn.

Again, if we have a strict inequality, we will also have a strict global minimum.

We will ask ourselves now, how can we identify or compute a local and global minima of

a function? It turns out that a local minima is possible to define, and a global one is harder.

In this course will focus on iterative methods to find a local minimum, hoping that we are

not so far from the global minimum. There are many ways to handle issue of global vs. local

minimum in the case of general functions—e.g., adding regularization, and investigating a

good starting point—but we will not deal with that in this course.


It turns out that there is a quite large family of functions, where any local minimum is

also a global minimum. These functions are called convex functions.

1.2 Convexity

The convex set definition is illustrated in Fig. in The convex function definition can be

illustrated in 1D–see Figure 3. A 2D example of convex and non convex function is given in

Fig. 4.


Figure 2: Convexity of sets

Figure 3: Convexity: the image uses t instead of α


Figure 4: Convexity in 2D: the left image shows a non-convex function, while the right oneshows a convex (and quadratic) function.

Alternative definitions for convexity

1. Assume that f(x) is differentiable. f(x) is a convex function over a convex region Ω if

and only if for all x1,x2 ∈ Ω

f(x1) ≥ f(x2) + 〈∇f(x2),x1 − x2〉.

This definition means that any tangent plane of the function has to be under the

function. This definition is equivalent to the previous traditional definition.

2. Assume that f(x) is twice differentiable. f(x) is a convex function over a convex region

Ω if and only if the Hessian ∇2f(x) 0 is positive semi-definite for all x. It is easy to

show that the two definitions are equivalent through a Taylor series:

f(x1) = f(x2) + 〈∇f(x2),x1 − x2〉+1

2〈x1 − x2,∇2f(c)(x1 − x2)〉, c ∈ [x1,x2].


Example 3 (The definitions of convexity for a quadratic function). A



When the objective function is convex, local and global minimizers are simple to charac-

terize.

Theorem 1. When f is convex, any local minimizer x∗ is a global minimizer of f . If in

addition f is differentiable, then any stationary point x∗ such that ∇f(x∗) = 0 is a global

minimizer of f .

We will only consider the proof of the first part of the Theorem here:

The proof of the second part of the Theorem is obtained by contradiction.

These results provide the foundation of unconstrained optimization algorithms. If we

have a convex function, then every local minimizer is also a global minimizer, and hence if

we have methods that reach local minimizers, we can reach the global minimizer. This is

why convex problems are very popular in science, although in some cases there is no choice

but to solve non-convex problems. In all algorithms we seek a point x∗ where ∇f(x∗) = 0.

12 2. Iterative methods for unconstrained optimization

2 Iterative methods for unconstrained optimization

Just like linear systems, optimization can be carried out by iterative methods. Actually,

in optimization, it is usually not possible to directly find a minimum of a function, and so

iterative methods are much more essential in optimization than in numerical linear algebra.

On the other hand, we have learned about minimizing quadratic functions, which is equiv-

alent to solving linear systems. Since every function is locally approximated by a quadratic

function (and specifically near its minimum) everything we learned about iterative methods

for linear systems can be useful in the context of optimization.

2.1 Steepest descent (SD)

Earlier, we met the steepest descent method

x(k+1) = x(k) − α(k)∇f(x(k)),

which we analysed for positive definite linear systems Ax = b. In fact, we saw that if we

look at the equivalent quadratic minimization, then the matrix A has to be positive definite,

which means that the quadratic function f has to be convex (this does not mean that a

general f has to be convex for Steepest Descent to work). Just like quadratic (algebraic)

case, α(k) can be chosen sufficiently small, whether a constant α(k) = α or not, so that the

method converges to a local minimum. However, a more common way is to look at α(k) as

a step size, and choose it in some wise way.

2.2 A descent direction

In the method of SD, we chose the direction

dSD = −∇f(x(k)), (2)

as the most obvious choice for a descent direction (which requires a suitable step-size).

Among all the directions that we could choose from x(k), this is the one in which f decreases

most rapidly. To verify this claim, let us first define a descent direction: a direction in which

the value of f decreases. To define such a direction we use Taylor’s theorem. Let d be a


direction of norm 1 (‖d‖ = 1), and let α > 0 be a positive step size. We have:

f(x + αd) = f(x(k)) + α〈∇f,d〉+1

2α2〈d,∇2f(xk + td)d〉, t ∈ (0, α).

Now, for α > 0 sufficiently small, we will have that

f(x(k) + αd) < f(x(k)) ⇔ 〈∇f(x(k)),d〉 < 0, (3)

which is the condition for d to qualify as a descent direction.

Corollary 1. Any direction d = −M∇f such that M 0 is a descent direction.

It is easy to see that since we chose ‖d‖ = 1 we have

〈∇f,d〉 = ‖∇f‖‖d‖ cos(θ) = ‖∇f‖ cos(θ),

where θ is the angle between the vectors ∇f and d. to maximize the decrease in f we should

choose d such that cos(θ) = −1, which will be satisfied in θ = 180. That is the opposite

direction from the gradient, or, −∇f . This is the reason why (2) qualifies as the “steepest

descent” direction.

Interpretation of SD as quadratic minimization We will look at more general meth-

ods next, but first observe that steepest descent method (with a pre-defined step size α) can

be viewed as a step that minimizes the quadratic function that approximates f(x) around

x(k):

d(k)SD = arg min

d

f(x) + 〈∇f(x(k)),d〉+

1

2α〈d,d〉

= −α∇f(x(k)).

Then:

xk+1 = x(k) + d(k)SD = x(k) − α∇f(x(k)).

This derivation assumes that α is either constant or known a-priori. We will now see some

other methods that follow the same pattern, and later see a method for determining the step

length α through linesearch.


2.3 Newton, quasi-Newton

One of the most important methods outside of steepest descent is Newton’s method, which

is much more powerful than SD, but also may be more expensive. Recall again the Taylor

expansion

f(x(k) + ε) = f(x(k)) + 〈∇f(x(k)), ε〉+1

2〈ε,∇2f(x(k))ε〉+O(‖ε‖3). (4)

Now we choose the next step x(k+1) = x(k) + ε by a minimizing Eq. (4) without the O(‖ε‖3)

term with respect to ε. That is, the Newton direction dN is chosen by the following quadratic

minimization:

d(k)N = arg min

d

f(x) + 〈∇f(x(k)),d〉+

1

2〈d,∇2f(x(k))d〉

= −(∇f (x(k))−1)∇f(x(k)).

This minimization is similar to the one mentioned in SD, but now it has a ∇2f(x(k)) instead

of a matrix 1αI. This shows us that in SD we essentially approximate the real Hessian with

a scaled identity matrix.

The Newton procedure leads to a search direction

dN = −(∇2f(x(k)))−1∇f(x(k)),

which is the Newton’s direction. This is the best search direction that is practically used, and

it is relatively hard to obtain: first, the function f should be twice differentiable. Second,

one has to compute the Hessian matrix or at least know how to solve a non-trivial linear

system involving the Hessian matrix. This has to be computed at each iteration, which may

be costly. Solving the linear system can be done directly through an LU decomposition,

or iteratively using an appropriate method. If we choose a constant steplength, then one

can show that the best step-length for Newton’s method is asymptotically 1.0. However,

throughout the iterations it is better to perform a lineaserach over the direction dN to

guarantee the decreasing values of f(x(k)) and convergence.

It is important to note that the Newton’s method should be used with care. There is no

guarantee that d is a descent direction, or that the matrix (∇2f(x(k))) is invertible. If the

problem is convex and the Hessian is positive definite - we have nothing to worry about.


Quasi Newton methods Applying the Newton’s method is expensive. In Quasi-Newton

we approximate the Hessian by some matrix M(x(k)), and at each step minimize

d(k)QN = arg min

d

f(x) + 〈∇f(x(k)),d〉+

1

2〈d,M(x(k))d〉

= −(M(x(k))−1)∇f(x(k)). (5)

M can be chosen as diagonal, M = cI or M = diag(∇2f), which is easily invertible, or by

other ways. We will not go into this further in this course, but the limited memory Broyden-

Fletcher-Goldfarb-Shanno (LBFGS) method is one of the most popular quasi-Newton meth-

ods in the literature. It iteratively builds a low rank approximation of the Hessian from

previous search directions, which is somewhat similar to some Krylov methods that we’ve

seen earlier in this course (e.g. GMRES).

All the methods above are one-point iterations, and are summarized in Algorithm 1.

Algorithm: General one-point iterative method# Input: Objective: f(x) to minimize.for k = 1, ...,maxIter do

Compute the gradient: ∇f(xk).

Define the search direction d(k), by essentially solving a quadratic minimization.

Choose a step-length α(k) (possibly by linesearch)Apply a step:

x(k+1) = x(k+1) + α(k)d(k).

if ‖∇f(x(k+1))‖‖∇f(x(1))‖ < ε or alternatively ‖x(k+1)−x(k)‖

‖x(k)‖ < ε. thenConvergence is reached, stop the iterations.

end

end

Return x(k+1) as the solution.Algorithm 1: General one-point iterative method for unconstrained optimization

2.4 Line-search methods

Line-search methods are among the most common methods in optimization. Assume we are

at iteration k, at point x(k). Assume that we have a descent direction d(k) in which the


function decreases. Line-search methods performs

x(k+1) = x(k) + α(k)d(k) (6)

and choose α(k) such that f(x(k) + αd(k)) is exactly or approximately minimized for α. In

the quadratic SD case, we had a closed term that minimizes this linesearch.

When computing the step length α(k), we are essentially minimizing a one-dimensional

function

φ(α) = f(x(k) + αd(k)), α(k) = arg minα

φ(α)

with respect to α. All 1-D minimization methods can be used here, but we face a tradeoff.

We would like to choose α(k) to substantially reduce f , but at the same time we do not

want to spend too much time making the choice. In general, it is too expensive to identify

a minimizer, as it may require too many evaluations of the objective f . We assume here

that the computation of f(x(k) + αd(k)) is just as expensive as computing f(w) for some

arbitrary vector w. If this is not the case, we may be able to allow ourselves to invest more

iterations in minimizing φ(α). However, note that minimizing φ is only locally optimal (a

greedy choice), and there is no guarantee that an exact linesearch minimization is indeed

the right choice for the fastest convergence, not to mention the cost of doing that.

More practical strategies perform an inexact line search to identify a step length that

achieves adequate reductions in f at minimal cost. Typical line search algorithms try out

a sequence of candidate values for α, stopping to accept one of these values when certain

conditions are satisfied. Sophisticated line search algorithms can be quite complicated. We

now show a practical termination condition for the line search algorithm. We will later show,

using a simple quadratic example, that effective step lengths need not lie near the minimizers

of φ(α).

Backtracking line search using the Armijo condition We will now show one of the

most common line search algorithms, called “backtracking” line search, also known as the

Armijo rule. It involves starting with a relatively large estimate of the step size α, and

iteratively shrinking the step size (i.e., “backtracking”) until a sufficient decrease of the

objective function is observed. The motivation is to choose α(k) to be as large as possible

while having a sufficient decrease of the objective at the same time. We are not minimizing


φ. This is done by choosing α0 rather large, and choosing a decrease factor 0 < β < 1. We

will examine the values of φ(αj) for decreasing values αj = βjα0 (using j = 0, 1, 2, ...) until

the following condition is satisfied:

f(x(k) + αjd(k)) ≤ f(x(k)) + cαj〈∇f,d(k)〉. (7)

0 < c < 1 is usually chosen small (about 10−4), and β is usually chosen to be about 0.5. We

are guaranteed that such a point exists, because we assume that d(k) is a descent direction.

That is, we know that 〈∇f,d(k)〉 < 0, and using the Taylor theorem we have

f(x(k) + αd(k)) = f(x(k)) + α〈∇f,d(k)〉+O(α2‖d(k)‖2).

It is clear that for some α small enough we will have

f(x(k))− f(x(k) + αd(k)) = −α〈∇f,d(k)〉+O(α2‖d(k)‖2) > 0.

In our stopping rule we usually set c to be small, so we choose α to be relatively large.

Algorithm 2 summarizes the backtracking linesearch procedure. Common “upgrade” for this

procedure are

• Choose α0 for iteration k as 1βα(k−1) or 1

β2α(k−1).

• Instead of only reducing the αj’s, also try to enlarge them by αj+1 = 1βαj as long as

the Armijo condition is satisfied.

Example 4 (Newton vs. Steepest descent). We will now consider the minimization of the

Rosenbrock “banana” function f(x) = (a−x1)2 + b(x2−x21)2, where a, b are parameters that

determine the “difficulty” of the problem. Here we choose b = 5, and a = 1. The minimum

of this function is at [a, a2], where f = 0 (in our case it’s the point [1,1]).

The gradient and Hessian of this function are

∇f(x) =

[−4bx1(x2 − x2

1)− 2(a− x1)

2b(x2 − x21)

]∇2f(x) =

[−4b(x2 − x2

1) + 8bx21 + 2 −4bx1

−4bx2 2b

]

In the code below we apply the SD and Newton’s algorithms starting from the point [−1.4, 2].


Figure 5: Armijo rule


Algorithm: Armijo Linesearch# Input: Iterate x, objective f(x), gradient ∇f(x), descent direction d.# Constants: α0 > 0, 0 < β < 1, 0 < c < 1.for j = 0, ...,maxIter do

Compute the objective φ(αj) = f(x + αjd).

if f(x + αjd) ≤ f(x(k)) + cαj〈∇f,d(k)〉. thenReturn α = αj as the chosen step-length.

elseαj+1 = βαj

end

end# If maxIter iterations were reached, α is too small and we are probably stuck. # Inthis case terminate the minimization.

Algorithm 2: Armijo backtracking linesearch procedure.

After 100 iterations, SD reached [0.957531, 0.915136] on his way to [1, 1]. The Newton method

solved the problem in only 9 iterations.

Figure 6: The convergence path of steepest descent and Newton’s method for minimizingthe “banana” function.


using PyPlot;close("all");

xx = -1.5:0.01:1.6

yy= -0.5:0.01:2.5;

X = repmat(xx ,1,length(yy))’;

Y = repmat(yy ,1,length(xx));

a = 1; b = 5;

F = (a-X).^2 + b*(Y-X.^2).^2

figure(); contour(X,Y,F,500); #hold on; axis image;

xlabel("x"); ylabel("y"); title("SD vs Newton for minimizing the Banana function.")

f = (x)->((a-x[1]).^2 + b*(x[2]-x[1].^2).^2)

g = (x)->[-4*b*x[1]*(x[2]-x[1]^2)-2*(a-x[1]); 2*b*(x[2]-x[1]^2)];

H = (x)->[-4b*(x[2]-x[1]^2)+8*b*x[1]^2+2 -4*b*x[1] ; -4*b*x[1] 2*b];

## Armijo parameters:

alpha0 = 1.0; beta = 0.5; c = 1e-4;

function linesearch(f::Function,x,d,gk,alpha0,beta,c)

alphaj = alpha0;

for jj = 1:10

x_temp = x + alphaj*d;

if f(x_temp) <= f(x) + alphaj*c*dot(d,gk)

break;

else

alphaj = alphaj*beta;

end

end

return alphaj;

end

println("*********** SD iterations *****************")

x_SD = [-1.4;2.0]; # initial guess

## SD Iterations

for k=1:100

gk = g(x_SD);

d_SD = -gk;

x_prev = copy(x_SD);

alpha_SD = linesearch(f,x_SD,d_SD,gk,0.25,beta,c);

x_SD=x_SD+alpha_SD*d_SD;

plot([x_SD[1];x_prev[1]],[x_SD[2];x_prev[2]],"g",linewidth=2.0); println(x_SD);

end;

println("*********** Newton iterations *****************")

x_N = [-1.4;2.0]; # initial guess

for k=1:10

gk = g(x_N);

Hk = H(x_N);

d_N = -(Hk)\gk;

x_prev = copy(x_N);

alpha_N = linesearch(f,x_N,d_N,gk,alpha0,beta,c);

x_N=x_N+alpha_N*d_N;

plot([x_N[1];x_prev[1]],[x_N[2];x_prev[2]],"k",linewidth=2.0); println(x_N);

end;

legend(("SD","Newton"))


2.5 Coordinate descent methods

Coordinate descent methods are similar in principle to the Gauss-Seidel method for linear

systems. When we minimize f(x), we iterate over all entries i of x, and change each scalar

entry xi so that the i-th f(x) is minimized with respect to xi given that the rest of the

variables are fixed.

Algorithm: Coordinate Descent# Input: Objective: f(x) to minimize.for k = 1, ...,maxIter do

#sweep through all scalar variables xi and (approximately) minimize f(x),# according xi in turn, assuming the rest of the variables are fixed.for i = 1, ..., n do

x(k+1)i ← arg minxi f(x

(k+1)1 , x

(k+1)2 , ..., x

(k+1)i−1 , xi, x

(k)i+1, ..., x

(k)n )

end# Check if convergence is reached.

end

Return x(k+1) as the solution.Algorithm 3: Coordinate descent

By doing this, we essentially zero the i-th entry of the gradient, i.e., update xi such that

∂f

∂xi= 0.

Similarly to the variational property of Gauss-Seidel, the value of f is monotonically non-

increasing with each update. If f is bounded from below, the series f(x(k)) converges.

This does not guarantee convergence to a minimizer, but it is a good property to have.

There are many rules in which order to choose the variables xi. For example, lexicographic

or random orders are common. One of the requirements to guarantee convergence is that all

the variables will be visited periodically. That is, we cannot guarantee convergence if a few

of the variables are not visited.

Coordinate descent methods are most common in non-smooth optimization, and espe-

cially effective when one-dimensional minimization is easy to apply.

Example 5 (Coordinate descent for the “banana” function). Assume again that we are

minimizing f(x) = (a − x1)2 + b(x2 − x21)2, where a, b are parameters b = 10, and a = 1.


We’ve seen that

∇f(x) =

[−4bx1(x2 − x2

1)− 2(a− x1)

2b(x2 − x21)

],

and in coordinate descent we’re solving each of these equations in turn. We start with the

second variable, because the updates regarding the second variable are easy to define (assuming

b 6= 0)∂f

∂x2

= 2b(x2 − x21) = 0⇒ x2 = x2

1.

For the first variable, we have

∂f

∂x1

= −4bx1(x2 − x21)− 2(a− x1) = 0⇒ x1 =?[No closed form solution]

This time there is no closed form solution to the problem, and in fact there may be more

than one minimum point here. A common option in such cases is to apply a one-dimensional

Newton step for x1:

xnew1 = x1 − αf ′x1f ′′x1

= x1 − α−4bx1(x2 − x2

1)− 2(a− x1)

−4b(x2 − x21) + 8bx2

1 + 2

where α is obtained by linesearch with respect to minimizing f(x1 + αd1, x2) and d1 is the

Newton search direction for x1.

The code for the CD updates appears next. It works in addition to the code in the previous

examples. After 100 iterations, CD iterates reached [0.997156, 0.99432]> on their way to

[1,1], which is significantly better than SD but worse than Newton’s method.


Figure 7: The convergence path of the coordinate descent method for minimizing the “ba-nana” function.

## CD Iterations (this code comes in addition to the first section of code in the previous example)

x_CD = [-1.4;2.0];

println("*********** Newton *****************")

for k=1:100

# first variable Newton update:

gk = -4*b*x_CD[1]*(x_CD[2]-x_CD[1]^2)-2*(a-x_CD[1]); # This is g[1]

Hk = -4*b*(x_CD[2]-x_CD[1]^2)+8*b*x_CD[1]^2+2; # This is H[1,1]

d_1 = -gk/Hk; ## Hk is a scalar

f1 = (x)->((a-x).^2 + b*(x_CD[2]-x.^2).^2) # f1 accepts a scalar.

x_prev = copy(x_CD);

alpha_1 = linesearch(f1,x_CD[1],d_1,gk,alpha0,beta,c);

x_CD[1] = x_CD[1]+alpha_1*d_1;

# second variable update

x_CD[2] = x_CD[1]^2;

plot([x_CD[1];x_prev[1]],[x_CD[2];x_prev[2]],"k",linewidth=2.0); println(x_CD);

end;

2.6 Non-linear Conjugate Gradient methods

Earlier we’ve seen the Conjugate Gradient methods for linear systems. This method has

a lot of nice properties and advantages. It turns out that the linear CG method can be

extended to non-linear unconstrained optimization. This extension has many variants, and

all of them reduce to the linear CG method when the function is quadratic. We will not

study these methods in this course, but they are quite efficient and worthy of considera-


tion. See https://en.wikipedia.org/wiki/Nonlinear_conjugate_gradient_method for

an intuitive explanation.

2.7 Inexact Newton Methods - Newton-PCG

In this approach we can utilize the power of the Newton method, if the dimension of the

problem is too large for an exact inversion of the Hessian matrix. There are other reasons

why it may not be possible or efficient, but they are usually application specific. In this

method, instead of solving the system

∇2f(x(k))d = −∇f(x(k))

directly, we apply an iterative method to solve it approximately. The common choice is Con-

jugate Gradients, either with or without a preconditioner M . This way, the nice properties

of the linear CG method are in place and the solution is rather fast. This method is suitable

for ill-conditioned problems, where a Hessian-vector product (that is necessary in CG) is

easy to apply. This method is usually better than Steepest Descent (if it is applicable),

becasue (1) the gradient has to be computed only a few times, and (2) the good properties

of linear CG. The method is often comparable in performance to non-linear CG, but is more

stable and understandable.

25 3. Constrained Optimization

3 Constrained Optimization

In this section we will see the constrained optimization theory. The theory is a bit deep and

we will see only a few of the main results in this course. A general formulation of constrained

optimization problems has the form

minx∈Rn

f(x) subject to

ceqj (x) = 0 j = 1, ...,meq

cieql (x) ≤ 0 l = 1, ...,mieq, (8)

where f , ceqj , and cieqj are all smooth functions. f(x), as before, is the objective that we wish

to minimize. ceqj (x) are equality constraints, and cieqj (x) are inequality constraints. We define

a feasible set to be the set of all points that satisfy the constraints:

Ω =x|ceqj (x) = 0, j = 1, ...,meq ; cieql (x) ≤ 0, l = 1, ...,mieq

.

We can write (8) compactly as minx∈Ω f(x). The focus in this section is to characterize the

solutions of (8). Recall that for unconstrained minimization we had the following sufficient

optimality condition: Any point x∗ at which ∇f(x∗) = 0 and ∇2f(x∗) is positive definite

is a strong local minimizer of f . Our goal now is to define similar optimality conditions to

constrained optimization problems.

We have seen already that global solutions are difficult to find even when there are no

constraints. The situation may be improved when we add constraints, since the feasible set

might exclude many of the local minima, and it may be easy to pick the global minimum

from those that remain. However, constraints can also make things much more difficult. As

an example, consider the problem

minx∈Rn‖x‖2 s.t. ‖x‖2

2 ≥ 1.

Without the constraint, this is a convex quadratic problem with unique minimizer x = 0.

When the constraint is added, any vector x with ‖x‖2 = 1 solves the problem. There are

infinitely many such vectors (hence, infinitely many local minima) whenever n ≥ 2.

Definitions of the different types of local solutions are simple extensions of the corre-

sponding definitions for the unconstrained case, except that now we restrict consideration

to the feasible points in the neighborhood of x∗. We have the following definition.


Figure 8: The single equality constraint example.

Definition 3 (Local solution of the constrained problem). A vector x∗ is a local solution of

the problem (8) if x∗ ∈ Ω (x∗ satisfies the constraints) and there is a neighborhood N of x∗

such that f(x) ≥ f(x∗) for x ∈ N ∩ Ω.

3.1 Equality-constrained optimization – Lagrange multipliers

To introduce the basic principles behind the characterization of solutions of general con-

strained optimization problems, we will first consider only equality constraints. We start

with a relatively simple case of constrained optimization with a single equality constraint.

Our first example is a two-variable problem with a single equality constraint.

Example 6 (A single equality constraint). Consider the following problem:

minx∈Rn

x1 + x2 subject to x21 + x2

2 − 2 = 0, (9)

In the language of Equation (8) we have meq = 1 and mieq = 0, with ceq1 (x) = x21 + x2

2 − 2.

We can see by inspecting Fig 8 that the feasible set for this problem is the circle of radius√2 centered at the origin–just the boundary of this circle, not its interior. The solution x∗

is obviously [−1,−1]> . From any other point on the circle, it is easy to find a way to move

that stays feasible (that is, remains on the circle) while decreasing f .


We also see that at the optimal point x∗, the objective gradient ∇f is parallel to the

constraint gradient ∇ceq1 (x∗). In other words, there exists a scalar λ1 such that

∇f + λ1∇ceq1 (x∗) = 0, (10)

and in this particular case where ∇f = [1, 1]>, and ∇ceq1 = [2x1, 2x2]>, we have λ1 = 12, for

x = [−1,−1]>.

We can reach the same conclusion using the Taylor expansion. Assume a feasible point

x. To retain feasibility, any small movement in the direction d from x has to satisfy the

constraint

0 = ceq1 (x + d) ≈ ceq1 (x) + 〈∇ceq1 (x),d〉 = 〈∇ceq1 (x),d〉.

So any movement at the direction d from the point x has to satisfy

〈∇ceq1 (x),d〉 = 0. (11)

On the other hand, we have seen earlier than any descent direction has to satisfy

〈∇f(x),d〉 < 0 (12)

because f(x + d)− f(x) ≈ 〈∇f(x),d〉.If there exists a direction d, satisfying (11)-(12), then we can say that improvement in f

is possible under the constraints. It follows that a necessary condition for optimality for our

problem is that there is not a direction d satisfying both (11)-(12). The only way that such

a direction cannot exist is if ∇f(x) and ∇ceq1 (x) are parallel, otherwise

d =

(I − 1

∇ceq1 (x)>∇ceq1 (x)∇ceq1 (x)∇ceq1 (x)>

)∇f(x),

is a descent direction, which satisfies the constraints at least at first order. To eliminate this

option we will demand that there is no such direction, which also means that

1

∇ceq1 (x)>∇ceq1 (x)∇ceq1 (x)∇ceq1 (x)>∇f(x) = ∇f(x).


Looking at the right hand side of the equation above as a least squares solution of

λ1 = arg minλ‖∇c1λ−∇f‖2

2,

that is because: λ1 = 1∇ceq1 (x)>∇ceq1 (x)

∇ceq1 (x)>∇f(x) we conclude that the least squares mini-

mization above must be satisfied with a value of 0. It means that there exists a λ1 satisfying

(10).

Because of the reasoning above, we can “pack” the objective and its equality constraints

into one function called the Lagrangian function

L(x, λ1) = f(x) + λ1ceq

1 (x),

for which ∇xL = ∇xf(x) + λ1∇xceq

1 (x). Hence, at the solution x∗ there exists a scalar λ1

such that

∇xL(x∗, λ∗1) = 0.

Note that the condition ∇λ1L = 0 provides the constraint ceq1 (x) = 0. This observation

suggests that we can search for solutions of the equality-constrained problem by searching for

stationary points of the Lagrangian function. The scalar quantity λ1 is called a Lagrange

multiplier for the constraint ceq1 (x) = 0. The condition above appears to be necessary for an

optimal solution of the problem, but it is clearly not sufficient. For example, the point [1, 1]>

also satisfies it in our example, but it is the maximum of the function under the constraint.

Back to our general equality constrained problem, which we write as

minx∈Rn

f(x) subject to ceq(x) = 0, (13)

where ceq(x) : Rn → Rmeqis the constraints vector function, that is

ceq(x) =

ceq1 (x)

ceq2 (x)...

ceqmeq (x)

= 0.

To obtain the optimality conditions we will first make an assumption on a point x∗ regarding


the constraints.

Definition 4 (LICQ for equality constrained optimization). Given the point x∗, we say that

the linear independence constraint qualification (LICQ) holds if the set of constraint gradients

∇ceqi is linearly independent, or equivalently, if the Jacobian of the constraints vector Jeq(x∗)

is a full-rank matrix.

Note that if this condition holds, none of the constraint gradients can be zero. This

assumption comes to simplify our derivations and is not really necessary from a practical

point of view. For example, it breaks if we write one of the constraints twice. This clearly

should have any practical influence on our ability to solve problems.

A generalization of our conclusion from the previous example is as follows. In order for

x∗ to be a stationary point, there shouldn’t be a direction d such that 〈∇f(x),d〉 < 0 and

〈∇ceqi (x),d〉 = 0 for i = 1, ...,meq . In other words, the gradient cannot have a component

that is orthogonal to all the constraints’ gradients. More explicitly let

Jeq =

− ∇ceq1 (x) −− ∇ceq2 (x) −

...

− ∇ceqmeq (x) −

.

be the Jacobian of the constraints, the demand above will happen only if (similarly to the

1D case) the orthogonal projection of the gradient on the span of its rows is zero (dropping

the eq notation):

(I − JT (JJT )−1J

)∇f = 0⇒ min

λ‖JTλ−∇f‖2

2 = 0.

This also means that ∇f(x) is a linear combination of Jeq ’s rows, and we get the condition:

∇f(x) +meq∑i=1

λi∇ceqi (x) = ∇f(x) + (Jeq)>λ = 0.

Consequently, given an equality constrained problem (13), we write its Lagrangian function

as

L(x,λ) = f(x) + λ>ceq(x),


where the vector λ ∈ Rmeqis the Lagrange multipliers vector.

Theorem 2 (Lagrange multipliers for equality constrained minimization). Let x∗ be a local

solution of (13), in particular satisfying the equality constraints ceq(x∗) = 0, and the LICQ

condition. Then there exists a unique vector λ∗ ∈ Rmeqcalled the Lagrange multipliers vector

which satisfies

∇xL(x∗,λ∗) = ∇f(x∗) + (Jeq)>λ∗ = 0.

We will not prove this theorem, but note that ∇λL(x∗,λ∗) = 0 ⇒ ceq(x∗) = 0, which

allows us to formulate a method for solving equality constrained optimization.

Back to our example of a single equality constraint

minx∈Rn


2 − 2 = 0. (14)

The Lagrangian of this method is L(x, λ1) = x1 + x2 + λ1(x21 + x2

2 − 2) and the solution of

this problem is given by

∇L = 0⇒

1 + 2x1λ1 = 0

1 + 2x2λ1 = 0

x21 + x2

2 − 2 = 0

⇒

x1 = − 1

2λ1

x2 = − 12λ1

14λ21

+ 14λ21

= 2

⇒

λ1 = ±1

2

x1 = ∓1

x2 = ∓1

,

As discussed before, we got two stationary points for the Lagrangian. How do we decide

which one of them is a local minimum? The following theorem says that the Hessian of Lwith respect to x has to be positive definite with respect to the directions that satisfy the

constraints (or, the directions which are orthogonal to the constraints gradients.).

Theorem 3 (2nd order necessary conditions for equality constrained minimization). Let x∗

be a minimum solution and let λ∗ be a corresponding Lagrange multiplier vector satisfying

∇L(x∗,λ∗) = 0. Also assume that the LICQ condition is satisfied. Then the following must

hold:

y>∇2xL(x∗,λ∗)y ≥ 0 ∀ y ∈ Rn s.t. Jeqy = 0.

In our example, L(x, λ1) = x1 + x2 + λ1(x21 + x2

2 − 2), and

∇2xL(x,λ) =

[2λ1 0

0 2λ1

]∇2

xL(x∗,λ∗ = ±1/2) =

[±1 0

0 ±1

]


It follows that the minimum is obtained at λ∗1 = 1/2 where ∇2xL(x∗,λ∗) = I 0. This

matrix is positive definite with respect to any vector, and in particular to those vectors that

are orthogonal to the constraint gradient. In the other point (λ1 = −1/2), we encounter the

case of the negative definite Hessian, and hence we have a local maximum.

3.2 The KKT optimality conditions for general constrained opti-

mization

Suppose now that we are considering the general case of constrained optimization, allowing

inequality constraints as well. Recall the following problem definition:

minx∈Rn

f(x) subject to

ceqj (x) = 0 j = 1, ...,meq

cieql (x) ≤ 0 l = 1, ...,mieq. (15)

It turns out that for this problem we should make a distinction between active inequality

constraints and inactive inequality constraints. Assume a local solution x∗. The active

constraints are those constraints for which cieql (x∗) = 0 and the inactive constraints are those

constraints for which cieql (x∗) < 0.

Example 7. Consider the following problem which is similar to the previous one, only now

we have an inequality constraint:

minx∈Rn


2 − 2 ≤ 0. (16)

Similarly to the previous problem, the solution of this problem lies at the point [−1,−1]>,

where the constraint is active. The Lagrangian is again L(x, λ1) = x1 +x2 +λ1(x21 +x2

2− 2),

and the minimum is obtained at the point satisfying ∇xL(x∗, λ∗1) = 0. But now, it turns out

that the sign of the Lagrange multiplier λ∗1 = 12

has a significant role.

Similarly to before, we need that any search direction d will satisfy

cieq1 (x + d) ≈ cieq1 (x) + 〈∇cieq1 (x),d〉 ≤ 0.

If the constraint cieq1 is inactive at the point x we can move at any direction, and in particular

we can move at the descent direction ∇f(x) and decrease the objective. However, if the


constraint is active, then cieq1 (x) = 0, and to satisfy the constraint we have the condition

〈∇cieq1 (x),d〉 ≤ 0. (17)

In order for d to be a legal descent direction it needs to satisfy both 〈∇f(x),d〉 < 0 and

(17). If we wish that d is not a descent direction, we need the inner products not to be both

negative. For this we must require that −∇f and ∇c1 will be at the same direction. That is:

∇f + λ1∇cieq1 (x∗) = 0,

with λ1 ≥ 0.

From the example above we learn that active inequality constraints act as equality

constraints but require the lagrange multiplier to be positive at stationary points. If we

have multiple inequality constraints, cieql (x) ≤ 0meq

l=1 , then we require that for a descent

direction d, each of them which is active will satisfy 〈∇cieql (x),d〉 ≤ 0, but if

∇f +∑

l active

λl∇cieql (x∗) = 0,

with λl ≥ 0, then

〈∇f(x),d〉+∑

l active

λl〈∇cieql (x∗),d〉 = 0,

and 〈∇f(x),d〉 ≥ 0 must hold. This means that d cannot be a descent direction under these

conditions.

We will now state the necessary conditions for a solution of a general constrained mini-

mization problem.

Definition 5 (Active set). The active set A(x) at any feasible x is the union of inequality

constrains which are satisfied with exact equality l : cieql (x) = 0.

We define the Lagrangian of the problem (15) by

L(x,λeq ,λieq) = f(x) + (λeq)>ceq(x) + (λieq)>cieq(x),

where λeq ∈ Rmeqand λieq ∈ Rmieq

are the Lagrange multiplier vectors for the equality


and inequality constraints respectively. The next theorem states the first order necessary

conditions, known as the Karush-Kuhn-Tucker (KKT) conditions.

Theorem 4 (First-Order Necessary Conditions (KKT)). Suppose that x∗ is a local solution

of (15), and that the LICQ holds at x∗. Then there are Lagrange multiplier vectors λeq ∈ Rmeq

and λieq ∈ Rmieq, such that the following hold:

∇xL(x∗,λeq∗,λieq∗) = ∇f(x∗) + (Jeq)>λeq∗ + (Jieq)>λieq∗ = 0 (18)

ceq(x∗) = 0 (19)

cieq(x∗) ≤ 0 (20)

λieq∗ ≥ 0 (21)

(Complementary slackness) for l = 1, ...,mieq λieq∗l cieql (x) = 0 (22)

The conditions (20)-(22) may be replaced with λieq∗l > 0 for the active constraints l ∈ A(x∗)

and λieq∗l = 0 for the inactive ones.

To know whether a stationary point is a minimum or a maximum we have the following

theorem:

Theorem 5 (2nd order necessary conditions for general constrained minimization). Let x∗

be a minimum solution satisfying the KKT conditions and let λeq∗, λieq∗ be the corresponding

Lagrange multiplier vectors. Also assume that the LICQ condition is satisfied. Then the

following must hold:

y>∇2xL(x∗,λeq∗,λieq∗)y ≥ 0 ∀ y ∈ Rn s.t.

Jeqy = 0

∇cieql (x∗))>y = 0 l ∈ A(x)

Example 8. Consider the following problem

minx∈Rn

(x1 −

3

2

)2

+

(x2 −

1

8

)4

s.t.

x1 + x2 − 1 ≤ 0

x1 − x2 − 1 ≤ 0

−x1 + x2 − 1 ≤ 0

−x1 − x2 − 1 ≤ 0

. (23)

This time, we have four inequality constraints. In principle, we should try every combi-

nation of active and inactive constraints and see if the resulting x∗ and λ∗ that are achieved


by ∇xL = 0 satisfy the KKT conditions. It turns out that the solution here is x∗ = [1, 0]>,

and that the first and second constraints are active at this point. Denoting them by c1 and

c2 we have

∇f(x∗) =

[−1

− 1128

]∇c1(x∗) =

[1

1

]∇c2(x∗) =

[1

−1

],

and the vector λ = [129256, 127

256, 0, 0]>.

3.3 Penalty and Barrier methods

In the previous section we saw that in order to solve a constrained optimization using the

KKT conditions, we may need to solve several large linear systems, each time checking a

different combination of active and inactive inequality constraints. This is a option for solving

the problem but it does not align with the optimization methods we’ve seen so far. In this

section, we will see two closely related approaches for solving constrained optimization—the

penalty and barrier approaches. Both of these approach convert the constrained problem

into a somewhat equivalent unconstrained problem which can be solved by SD, Newton, or

any other method for unconstrained optimization.

Penalty methods. We again assume that we have the general constrained optimization

problem

minx∈Rn

f(x) subject to

ceqj (x) = 0 j = 1, ...,meq

cieql (x) ≤ 0 l = 1, ...,mieq, (24)

and now we wish to transform it to an equivalent unconstrained problem. In the penalty

approach, we rewrite the problem as

minx∈Rn

f(x) + µ

(meq∑j=1

ρj(ceq

j (x)) +mieq∑l=1

ρl(max0, cieql (x))

)(25)

where µ > 0 is a balancing penalty parameter, and ρj(x) and ρl(x) are scalar penalty

functions which are bounded from below and get their minimum at 0 (it is preferable that

these are non-negative but this is not mandatory). In addition, these functions should

monotonically increase as we move away from zero. The common choice for a penalty is the


quadratic function ρ(x) = x2, which has a minimum at 0. Using this choice, if we set µ→∞then the solution of (25) will be equal to that of (24). We will focus on this choice in this

course.

The problem (25) is an unconstrained optimization problem which can be solved using the

methods we’ve learned to far. For ρ(x) = x2, however, it turns out that the problem becomes

more and more ill-conditioned as µ→∞, which makes standard first-order approaches like

SD and CD slow. Therefore, in practice we will have a continuation strategy where we solve

the problem for iteratively increasing values of the penalty µ. That is we will define

µ0 < µ1 < ... <∞

and iteratively solve the problem for each of those µ′s, each time using the previous solution

as an initial guess for the next problem. We will stop when µ is large enough such that the

constraints are reasonably fulfilled. That is, for our largest value of µ, it may be that for

example one of the equality constraints satisfies c(x∗) = ε. Whether ε is small enough or

not will depend on the application that yielded the optimization problem.

Remark 1. Another popular choice for a penalty is the exact penalty function ρ(x) = |x|(absolute value). It is popular because unlike the quadratic function, the constraints will be

exactly fulfilled for some moderate µ and not only for µ → ∞. The downside here is that

this function is non-smooth and significantly complicates the optimization process. We will

not consider this choice further in this chapter.

Example 9. Consider again the following problem

minx∈Rn

(x1 −

3

2

)2

+

(x2 −

1

8

)4

s.t.

x1 + x2 − 1 ≤ 0

x1 − x2 − 1 ≤ 0

−x1 + x2 − 1 ≤ 0

−x1 − x2 − 1 ≤ 0

. (26)

Previously, we had four inequality constraints, and only the first two where active at

the solution. Using the Lagrange multipliers approach we had to check several combinations

of active/inactive constraints. We will now use the penalty version of this approach using


ρ = x2, which is vector form becomes

minx∈Rn

fµ(x) =

(x1 −

3

2

)2

+

(x2 −

1

8

)4

+ µ ‖maxAx− 1,0‖22 , (27)

where the matrix A is

A =

1 1

1 −1

−1 1

−1 −1

.We will solve this problem using Steepest Descent (SD), and the gradient for SD is given by

∇fµ =

[2(x1 − 3

2)

4(x2 − 18)3

]+ µA>(IAx−1>0 (Ax− 1)),

where is the Hadamard point-wise vector product (same as the operator .* is Julia/Matlab),

and IAx−1>0 is an indicator vector

(IAx−1>0)i =

1 if (Ax)i − 1 > 0

0 otherwise.


using PyPlot;

close("all");

xx = -0.5:0.01:1.5

yy= -1.0:0.01:1.0;

m = length(xx);

X = repmat(xx ,1,length(yy))’;

Y = repmat(yy ,1,length(xx));

oVec = ones(4);

A = [1.0 1 ; 1 -1 ; -1 1 ; -1 -1];

f = (x,mu)->(x[1]-3/2)^2 + (x[2]-1/8)^4 + mu*norm(max(A*x - oVec,0.0)).^2;

g = (x,mu)->[2*(x[1]-3/2) ; 4*(x[2]-1/8).^3] + mu*2*A’*(((A*x- oVec).>0.0).*(A*x - oVec));

F = (X-3/2).^2 + (Y-1/8).^4

C = (max(X + Y - 1,0).^2 + max(-X + Y - 1,0).^2 + max(-X - Y - 1,0).^2 + max(X - Y - 1,0).^2);

figure();

mu = 0.0;

Fc = F + mu*C;

subplot(3,2,1); contour(X,Y,Fc,200); #hold on; axis image;

xlabel("x"); ylabel("y"); title(string("Penalty for mu = ",mu))

mu = 0.01;

Fc = F + mu*C;



mu = 0.1;

Fc = F + mu*C;



mu = 1;

Fc = F + mu*C;



mu = 10;

Fc = F + mu*C;



mu = 100;

Fc = F + mu*C;




Figure 9: The influence of the penalty parameter on the objective. As µ grows, the feasibleset is more and more obvious in the function values.


## Armijo parameters:

alpha0 = 1.0; beta = 0.5; c = 1e-1;

## See the Armijo linesearch function in previous examples.

## Starting point

x_SD = [-0.5;0.8];

muVec = [0.1; 1.0 ; 10.0 ; 100.0]

fvals = []; cvals = [];figure();

## SD Iterations

for jmu = 1:length(muVec)

mu = muVec[jmu]

fmu = (x)->f(x,mu);

println("mu = ",mu)

Fc = F + mu*C;

subplot(2,2,jmu); contour(X,Y,Fc,1000); #hold on; axis image;

xlabel("x"); ylabel("y"); title(string("SD for mu = ",mu))

for k=1:50

gk = g(x_SD,mu);

if norm(gk) < 1e-3

break;

end

d_SD = -gk;


alpha_SD = linesearch(fmu,x_SD,d_SD,gk,0.5,beta,c);


plot([x_SD[1];x_prev[1]],[x_SD[2];x_prev[2]],"g",linewidth=2.0);

println(x_SD,",",norm(gk),",",alpha_SD,",",f(x_SD,mu));

fvals = [fvals;f(x_SD,mu)];

cvals = [cvals;norm(max(A*x_SD - oVec,0.0))]

end;

end

figure();subplot(1,2,1);

plot(fvals);title("function values")

subplot(1,2,2);

semilogy(cvals,"*");title("Constraints violation values ||max(Ax-1,0)||")

Barrier methods We will now see the “sister” of the penalty method which offers a

different penalty approach. This approach is relevant only for inequality constraints, and is

used when we wish to require that the iterates will absolutely be inside the feasible domain.

The idea is to choose a “barrier” function that goes to∞ as we move towards the boundaries

of the feasible domain. Unlike the penalty approach we do not even define the objective

outside the feasible domain, and do not allow our iterates to go there.

There are two main “barrier functions”: the log-barrier function and the inverse-barrier


Figure 10: The convergence paths of the SD method for minimizing the objective for variouspenalty parameters. Each solution is initialized with the solution for the previous largerpenalty parameter µ.


Figure 11: The function value and constraints violation for the SD iterations. At µ = 100we get about three digits of accuracy for the constraints.

function. Suppose we wish to have a constraint φ(x) ≤ 0, then the two following functions

can be used:

blog(x) = − log(φ(x)), binv(x) = − 1

φ(x)

Figure 12 shows an example of the two penalty functions. Using these functions, the min-

imizer will only be inside the domain and as we apply the iterates, we have to make sure

that our next step is always inside the domain. If not, we reduce the steplength parameter

until the next iterate is inside the domain (say, in an iterative matter similar to the Armijo

process). Since these functions go to∞, the barrier method is not too sensitive to the choice

of penalty parameter, but note that 1) these barrier functions are not 0 inside the domain,

and hence influence the solution even if the constraint is not active. 2) These are highly

non-quadratic functions with quite large high order derivatives (third derivative for exam-

ple). The optimization process which works best on quadratic penalties usually struggles

with these functions.

3.4 The projected Steepest Descent method

The projected Steepest Descent method is an attempt to avoid all the difficulties involved

with the penalty and barrier methods. In this method we assume that the iterates x(k) are

inside or on the boundaries of the feasible domain. In this method we find a Steepest Descent


Figure 12: Two barrier functions for the constraint x − 2 ≤ 0. Both of the functions go to∞ as x → 0. The log function is bounded from below only in a closed region, so it is notsuitable for cases where only one sided barrier is used.

step y, followed by a projection of y to be inside the feasible domain. That is, we replace

the y with the closest point that fulfils the constraints. This is not always an easy task,

therefore this method is most useful when the constraints are simple (linear, for example).

More explicitly, assume the problem

minx∈Ω

f(x)

where Ω is a sub-set of Rn. Given a point y, we will define the projection operator ΠΩ to be

the solution of the following problem:

ΠΩ(y) = arg minx∈Ω

‖x− y‖,

for some vector norm (in most cases the squared `2 norm is chosen).

The projected SD method is defined by

x(k+1) = ΠΩ

(x(k) − α∇f(x(k))

),

where α is obtained by linesearch.


Example 10. Define the projected steepest descent method for the linearly constrained min-

imization

minx∈Rn

f(x), subject to Ax = b,

assuming A ∈ Rm×n with m < n is full rank).

All we need to define is the projection operator with respect to some norm, and then set

x(k+1) = ΠΩ

(x(k) − α∇f(x(k))

). We will choose the squared `2 norm.

ΠΩ(y) = arg minx∈Rn

1

2‖x− y‖2

2, subject to Ax = b

The Lagrangian of the system is given by

L(x,λ) =1

2‖x− y‖2

2 + λ>(Ax− b).

To solve the problem we will solve the system

∇xL = x− y + A>λ = 0⇒ x∗ = y − A>λ.

The lagrange multiplier λ∗ will be defined by the constraint:

Ax∗ = b⇒ A(y − A>λ) = b⇒ λ∗ = (AA>)−1(Ay − b).

Here we see that we should invert AA> ∈ Rm×m, which is invertible because we assumed that

A is full rank. To get the final solution for the projection we set

x∗ = y − A>λ∗ = y − A>(AA>)−1(Ay − b).

Now we set y = x(k)−α∇f(x(k)), and assume that x(k) satisfies the constraint (Ax(k) = b):

x(k+1) = x(k) − α∇f(x(k))− A>(AA>)−1(A(x(k) − α∇f(x(k)))− b)

= x(k) − α(I − A>(AA>)−1A)∇f(x(k)).

This way we get the projected SD method. α is chosen by linesearch. The operator (I −A>(AA>)−1A) is called an orthogonal projection operator. Using this method, every step


x(k+1) − x(k) will be in the null space of A, that is A(x(k+1) − x(k)) = 0.

Example 11. The box-constrained minimization

minx∈Rn

f(x), subject to a ≤ x ≤ b,

where the bound vectors satisfy: a < b.

This time, the constraints impose a very simple solution to the problem. The lagrangian

is given by

L(x,λ) =1

2‖x− y‖2

2 + λ>1 (x− b) + λ>2 (−x + a).

and its gradient is given by

∇xL(x,λ) = x− y + λ1 − λ2.

But here, there are active and inactive constraints that we need to incorporate. If no con-

straint is active we will get x∗ = y. The problem is separable, so if ai ≤ yi ≤ bi then we

can set x∗i = yi without breaking the constraint, and hence (λ∗1)i = (λ∗2)i = 0, because the

constraints are inactive. If yi < ai, then the lower bound constraint is active and the upper

bound is not. We set x∗i = ai and

x∗i − yi − (λ2)i = 0⇒ (λ∗2)i = ai − yi > 0.

We get a positive Lagrange multiplier, which is what needs to be. If yi > bi, then the upper

bound constraint is active and the lower bound is not. We set x∗i = bi and

x∗i − yi + (λ1)i = 0⇒ (λ∗1)i = yi − bi > 0.

Overall, the projected steepest descent step is given by:

z = x(k) − α∇f(x(k)), x(k+1)i =

ai zi < ai

bi zi > bi

zi otherwise

.

45 4. Robust statistics in least squares problems

4 Robust statistics in least squares problems

In many applications we are needed to recover some property out of given some data mea-

surements and (maybe) some additional information or prior knowledge. In most cases we

have noise or error in the measurements, which we assume to belong to some distribution.

The most common statistical tool that we have is Gaussian Distribution, which leads to least-

squares type problems. However, what if some of the measurements are really corrupted?

(these are called “outliers”).

Example 12 (Linear regression with outliers.). Suppose that you are given with m measure-

ments (xi, yi) and wish to find the line that defines yi given xi. That is, find scalars a, b

so that axi + b ≈ yi for all i in an optimal way. Previously, we used the `2 norm and got the

LS problem

mina,b

∑(axi + b− yi)2,

or

mina,b

∥∥∥∥∥∥∥∥∥∥

x1 1

x2 1...

...

xn 1

[a

b

]−

y1

y2

...

yn

∥∥∥∥∥∥∥∥∥∥

2

.

It turns out that since we are squaring the error, then the approximation is very sensitive

to points with large errors. Such points appear in many cases of “real data” and they severely

influence the approximated properties that we wish to learn on the system (a and b in our

case).

As an alternative, we may use a different distance measure, which is quadratic around 0

(to fit the Gaussian assumption) but it is linear as we move away from 0. The new function

is called “Huber” loss function and is defined by

hδ(x) =

12x2 |x| ≤ δ

δ(|x| − 12δ) otherwise

.

This function is non smooth and is often replaced by the following “pseudo Huber” smooth

function

hδ(x) = δ√x2 + δ2 − δ2.


Figure 13: A comparison between Quadratic and Huber loss functions.

Figure 13 shows these two functions compared to the standard loss 12x2. All functions behave

the same around zero, but the Huber functions become δ|x| as x grows, in order not to penalize

outliers.

The following code shows an example of linear regression, where the noise standard devi-

ation is set to 0.2. To solve this problem we use LS and get a reasonable answer. However,

now we set the noise level in 2 random entries of the data to have much larger error of stan-

dard deviation 20 (still Gaussian). Even though the corrupted data is just 1% of the data,

the LS estimation is completely wrong and is heavily biased because of the corrupted data.

Approximating the variables a, b using (pseudo) Huber loss generates a much more accurate

result, regardless of the outliers. The estimation is demonstrated in Fig. 14. The downside is

that the approximation cannot be computed in closed form and requires an iterative process.

In this code example we use Steepest Descent but any other general purpose method can be

used here as the function is smooth and convex.


x = linspace(0.01,1.0,200);a = 0.8;b = 0.4;sigma_noise = 0.2;

epsilon = sigma_noise*randn(length(x));

B = a.*x + b + epsilon;

A = [x ones(length(x))];

opt_ab = (A’*A) \ A’*B; # can be obtained using opt_ab = A\B;

figure();

subplot(1,2,1)

plot(x,B,".b");

plot(x,a*x + b,"-r");

plot(x,opt_ab[1]*x + opt_ab[2],"-g");

title("Least Squares: linear regression example");

legend(("Measurments","True line","Estimated line"));

corrupt_ind = randperm(200)[1:2];

epsilon[corrupt_ind] = 20.0*randn(length(corrupt_ind));

B = a.*x + b + epsilon;

opt_ab = (A’*A) \ A’*B; # can be obtained using opt_ab = A\B;

delta = sigma_noise;

hub = (x)->(r = A*x-B;f = sum(delta*(sqrt(x.^2 + delta^2)) - delta^2) ;return f;);

g_hub = (x)->(r = A*x-B; return A’*(delta*(r./sqrt(r.^2 + delta^2))); );

x_SD = [0.4;0.2];

println("*********** SD *****************")

## SD Iterations

for k=1:1000

gk = g_hub(x_SD);

d_SD = -gk;


alpha_SD = linesearch(hub,x_SD,d_SD,gk,1.0,beta,c); ## see linesearch() in previous examples.


if norm(gk) < 1e-2

break;

end

end;

subplot(1,2,2)

plot(x,B,".b");

plot(x,a*x + b,"-r");

plot(x,opt_ab[1]*x + opt_ab[2],"-g");

plot(x,x_SD[1]*x + x_SD[2],"-k");

title("Least Squares: linear regression example with 1% outliers");

legend(("Measurments","True line","LS estimate","Huber estimate"));

48 5. Stochastic optimization

Figure 14: A comparison between LS and Huber estimations with 1% corrupted data.

5 Stochastic optimization

The term “stochastic optimization” refers to a family of optimization methods that apply

some randomness on their way for obtaining the optimal solution. We will focus on stochastic

gradient descent methods, which have become very popular recently as the work-horse for

training deep neural networks.

Notation used in the literature in this context Two definitions that appear in the

literature in this context: f is ρ-Lipschitz continuous if

|f(x1)− f(x0)| ≤ ρ‖x1 − x0‖,

which is a more general definition than continuously differentiable (for example, the absolute

value function is Lipschitz continuous). Also, a function will be called ρ-smooth if its gradient

is Lipschitz continuous

∀x0,x1 : ‖∇f(x1)−∇f(x0)‖ ≤ ρ‖x1 − x0‖,

which also means that for twice differentiable function, and if one uses the `2 norm above

then the Hessian is bounded: ∇2f ρI.


A differentiable function f is µ-strongly convex if for every x,y:

f(y) ≥ f(x) +∇f(x)>(y − x) +µ

2‖y − x‖2 (28)

for some µ > 0.

A more general claim than strong convexity is the Polyak-Lojasiewicz (PL) inequality

which is quite useful for establishing linear convergence for a lot of algorithms:

∀x :1

2‖∇f(x)‖2 ≥ µ(f(x)− f(x∗)). (29)

(28)⇒(29) is obtained by minimizing both sides of (28) with respect to y.

Another more general claim that is achieved by (28) is

‖∇f(x)−∇f(y)‖ ≥ µ‖x− y‖.

This also means that if f is twice differentiable, then the Hessian’s eigenvalues are bounded

from below: ∇2f µI.

5.1 Stochastic Gradient descent

SGD was originally designed to save computations, but was actually saving computations in

very unique and limited cases. Later, it was shown that the power of SGD actually lies in

handling non-convex problems. Hence, we will distinguish between the performance of SGD

in solving convex problems versus solving (highly) non-convex problems.

To present SGD, we will revert to a notation that is more common in machine learning

applications: xi will be data samples (columns/rows of a matrix), and the vector w will be

a vector of unknown weights that we will optimize for (up until now the vector of unknowns

was denoted by x and the matrix rows/columns by ai).

Assume that you have some data ximi=1 to characterize. We can assume that ximi=1 is

drawn from some multivariate probabilistic distribution. Also assume that you can somehow

characterize this data by finding some vector of weights w ∈ Rn, which is found by minimizing

an objective according to some suitable model. Such optimization problem typically has the


form

arg minw∈Rn

F (w) =1

m

m∑i=1

f(w,xi) =1

m

m∑i=1

fi(w) (30)

where w is the unknown vector of parameters, and f() is a function that derived from the

assumed model. Since ximi=1 is drawn from a multivariate probabilistic distribution, then

for m→∞F (w) = Ex(f(w,x)), (31)

where Ex is the true expected value of xi’s distribution. Now suppose that we wish to use

gradient descent for which we need to compute the gradient (with respect to unknown w,

and not the data x):

∇wF =1

m

m∑i=1

∇wfi(w) = Ex(∇wf(w,x)) = ∇wEx(f(w,x)). (32)

All these equalities stems from the linearity of the expectation and gradient operations. We

essentially get that the gradient of (30) and (31) are equal as m→∞.

With the above derivations in mind, one way to interpret SGD would be: we wish to

minimize (30), but let’s assume that we really want to minimize (30), or that they are similar

enough. For that we need to estimate the gradient of the expectation, which can be estimated

using less data samples.

So, we can pick a (small) random subset of xi at each step, and estimate ∇Exf(w,x)

according to it (even one sample is sufficient for this). Then, we are essentially minimizing

F (w) = Ex(f(w,x)). The resulting SGD algorithm can be summarized as:

Pick S ⊂ 1, ...,m : w(k+1) = w(k) − α(k) 1

|S|∑i∈S

∇fi(w(k)), (33)

where S is the so-called “mini-batch”. In pure SGD, we pick S to be a single data sample.

Since we only estimate the gradient of (31), and since every estimation has an error, we

essentially get that1

|S|∑i∈S

∇fi(w(k)) = ∇wF (w(k)) + η,

where η is noise, with a mean of 0 (unbiased estimator). So, SGD is simply a noisy version


of standard gradient descent if one look at minimizing (30). If we’re looking at minimizing

(31), then the noise in the gradient is inevitable anyway.

As in the previous cases, a key issue with SGD in (33) is the choice of the step-size

αk, which is also called the “learning rate”. Theory says that for convergence in convex

problems, we need to choose α(k) small and decaying (α(k) → 0), but still satisfy

limk→∞

k∑j=1

α(j) =∞.

Possible choices may be α(k) = 1k. In practice, α is often just chosen to be a sufficiently

small constant, or a set of constants for bunches of iterations, which is called “learning rate

scheduling”. For example, up to k = 100, α = 0.01, and from k = 100 on then α = 0.001.

In practice, we wish to divide the iterations into random sweeps over the whole data in

mini-batches. Each such sweep is called an “epoch”. Algorithm 4 summarizes the method.

Algorithm: Stochastic Gradient Descent# Input: Objective: F (w) = 1

m

∑mi=1 fi(w) to minimize.

for k = 1, ...,maxIter do#Loop over epochs

Divide the data indices 1, ...,m into random mini-batches (mb) Sj#mbj=1 .

Denote w(k,1) = w(k).for j = 1, ...,#mb do

Compute the mini-batch gradient: g(k,j) = 1|Sj |∑

i∈Sj ∇fi(w(k,j)).

Choose a step-length α(k,j) (a “learning-rate”)Apply a step:

w(k,j+1) ← w(k,j) + α(k,j)g(k,j).

end

Denote w(k+1) = w(k,#mb+1) # may be replaced by averaging over j# Check convergence by some criterion.

end

Return w(k+1) as the solution. # may be replaced by some averaging

Algorithm 4: Stochastic Gradient Descent


5.2 Example: SGD for a minimizing a convex function

Assume that we have a least squares problem, which we will write a bit differently than

before

arg minw

F (w) =1

2m

m∑i=1

(xTi w − yi)2 =1

2m‖Xw − y‖2

2,

where X is the data matrix whose rows are xi, and yi are given scalars.

Since fi(w) = 12(xTi w− yi)2, then ∇wfi = (xTi w− yi)xi. SGD with a random minibatch

of size 1 will yield

w(k+1) = w(k) − α(k)(xTi w − yi)xi.

Interestingly enough, if we choose α(k) = 1xTi xi

, then we get a particular method for solving

linear systems called the “randomized Kaczmartz method”. Essentially, the iteration above

is often used to solve linear systems of the form

Xw = y,

by reformulating the problem as

XXTu = y, w = XTu,

and applying Gauss-Seidel updates with a random selection of variables. Since XXT is a

symmetric semi definite matrix, the Gauss-Seidel iterations converge (assuming that there is

a solution), and so does SGD as long as α(k) is sufficiently small (note that it does not need

to decay to zero at k →∞).


function compareSGDandCG(m,n)

X = randn(m,n);

(U,S,V) = svd(X);

S = exp.(1.5*randn(min(m,n)));

X = U*Matrix(Diagonal(S))*V’;

println("Cond: ",maximum(S)/minimum(S))

sol = randn(n);

y = X*sol + 0.05*randn(m);

lambda = 0.01;

I_n = Matrix(1.0I, n, n)

sol = ((1.0/m)*(X’*X) + lambda*I_n)\((1.0/m)*X’*y);

w = zeros(n);

batch = 30;

norms = norm(y);

#alpha = 0.1;

alpha = 1.0./maximum(eig((1.0/m)*(X’*X) + lambda*I_n)[1])

println("alpha: ",alpha)

max_epochs = 1000;

lr = alpha;

for epoch = 1:max_epochs

idxs = randperm(m);

#lr = alpha

#lr = alpha./sqrt(epoch)

if mod(epoch,100)==0

lr = lr*0.1;

println("lr: ",lr)

end

for k=0:div(m,batch)-1

Ib = idxs[(k*batch+1):((k+1)*batch)];

Xb = X[Ib,:];

grad = (1.0/batch)*Xb’*(Xb*w - y[Ib]) + lambda*w;

w = w - lr.*grad;

end

nn = norm((1/m)*X’*(X*w - y) + lambda*w);

norms = [norms;nn];

end

wcg = zeros(n);

x,normsCG = CG(X’*X + lambda*I_n, X’*y,wcg,max_epochs,sol);

return norms,normsCG

end

using Random; using PyPlot; close("all");

include("cg.jl");

n = 500;

for m = [300;500;1000]

figure();

norms,normsCG = compareSGDandCG(m,n);

semilogy(norms)

semilogy(normsCG)

legend(("SGD","CG"));

title(string("m = ",m,", n = ",n));

end

println("Done.");


5.3 Upgrades to SGD

SGD is the workhorse for many data science applications when the data dimension is much

higher than number of parameters, and when the problems are well conditioned. One way to

reduce the noise in SGD gradients is by averaging iterates/gradients or by re-scaling them.

Momentum One common way is by using the momentum strategy:

g(k) =1

|S|∑i∈S

∇fi(w(k))

m(k) = γm(k−1) + α(k)g(k)

w(k+1) = w(k) −m(k).

where m(0) is set to zero. The parameter γ is usually chosen to be between 0.5 and 0.9, and

generally, the learning rate α needs to be chosen smaller than in standard SGD. The momen-

tum drives the iterations in the direction of averaged gradients. One common pitfall of the

method is that the algorithm often fails to stop upon convergence (because the momentum

is not decaying to zero.)

AdaGrad Another approach is to rescale the gradients:

g(k) =1

|S|∑i∈S

∇fi(w(k))

v(k) = γv(k−1) + (1− γ)(g(k) g(k))

w(k+1) = w(k) − α(k) 1√v(k) v(k).

The symbol denotes the Hadamard (element-wise) product.

55 6. Minimization of Neural Networks for Classification

Figure 15: MNIST data set: Given hand written images and their labels (a “training set” of60,000 labeled images), learn how to classify letters.

6 Minimization of Neural Networks for Classification

In this section we will discuss how to learn the parameters of a simple neural network. As

example, we will take a case of classification. In classification we wish to predict a discrete

value (sat, “0” or “1”) for a given data, say an image. A famous data set is the MNIST

data set illustrated in Fig. 15 where we need to learn a function (or a computer program)

who can say which number appears on a given image. We “teach” the program to predict

the numbers for the images using examples of labeled images for which we train parameters.

This is done by defining an objective function whose unknowns are the parameters that we

wish to “learn” and minimizing this objective function using an iterative method.

6.1 Linear classifiers: Logistic regression

Logistic regression is a “linear classifier”, which means that it predicts discrete values based

on an inner product xTw (a linear function), more or less like least squares problems which

are also called “linear regression”. The main difference is that based on this inner product

(a continuous scalar), logistic regression predict a discrete value, 0 or 1. It does so by a


function called “logistic” or “sigmoid” function

σ(x>w) =1

1 + exp(−x>w)∈ [0, 1] (34)

The function σ(x>w) may be interpreted as the probability of x to be 0 or 1, given the

distribution parameter w. If we decide that

P (y = 1|x) = σ(x>w), P (y = 0|x) = 1− σ(x>w) (35)

then if σ(x>w) is close to 1, we will say that with high probability y = 1, and if not, then

y = 0. The labels can be chosen the other way around, and then w that we will learn will

just switch signs (−w will show probability 0 for label y = 1). We often refer to (34) as the

“(probabilistic) model” in which we classify the images.

In “supervised learning” we are given with labeled data (xi, yi)mi=1, such that xi ∈ Rn

and yi are discrete categorial labels (may be even “cat” and “dog”). In logistic regression,

we assume that we have only two labels (say, “cat” and “dog”) and we wish to train w based

on

F (w) = − 1

m

m∑i=1

(Iyi=dog log(σ(x>i w)) + Iyi=cat log(1− σ(x>i w))

)(36)

which is also the maximum likelyhood estimator of w that corresponds to the above prob-

ability. Iy=dog is an indicator function, that equals 1 if yi = dog and 0 otherwise. Once we

minimize this “loss” function and get the estimated parameters w∗, we can predict!. Given

a new image x, we compute σ(x>w). If the value is larger than 0.5, then we’ll predict that

x should be labeled as “dog”, and otherwise, as a “cat”.

We can write the objective in matrix-vector form. This way, it will be more efficient

to compute the objective in languages that are vectorized like Matlab, Python or R. Let

X ∈ Rn×m be the data matrix

X = [x1|x2|x3|...|xm],

and c1, c2 ∈ 0, 1m be the classes vectors as

c1 = [Iy1=dog, Iy2=dog, Iy3=dog, ..., Iym=dog]>, (37)

c2 = [Iy1=cat, Iy2=cat, Iy3=cat, ..., Iym=cat]>. (38)


The objective in (36) can be written in matrix form as

F (w) = − 1

m

(c>1 log

(σ(X>w)

)+ c>2 log

(1− σ(X>w)

)), (39)

where the scalar function σ operates on each vector component separately. We may also

write c2 = 1− c1, but keep it this way for the derivation of the softmax function.

Recall the following example:

Example 13. (The Jacobian of φ(x), where φ is a scalar function) Suppose that f = φ(x),

where φ : R → R is a scalar function, i.e., fi(x) = φ(xi). In this case the vector Taylor

expansion is just a vector of one dimensional expansions.

δf = φ(x + ε)− φi(x) ≈ diag(φ′(x))ε = diag(φ′(x))δx.

This means that J = diag(φ′(x)), which is a diagonal matrix such that Jii = φ′(xi).

6.1.1 Computing the gradient of logistic regression

First,

c>φ(X>(w + δw)) ≈ c>φ(X>w) + c>φ′(X>w)X>δw

and hence:

∇(c>φ(X>w)) = Xφ′(X>w)c (40)

Next, it is not hard to show that for (34) we have

σ′(t) = σ(t)(1− σ(t)), (log(σ(t)))′ = 1− σ(t), (log(1− σ(t)))′ = −σ(t)

In the first term of (39) we have φ1(X>w) = log(σ(X>w)), and for the second one we have

φ2(X>w) = log(1− σ(X>w)). Hence we get


∇wF = − 1

mX[φ′1(X>w)c1 + φ′2(X>w)c2

](41)

= − 1

mX[(1− σ(X>w))c1 − σ(X>w)c2

](42)

= − 1

mX[(1− σ(X>w))c1 − σ(X>w)(1− c1)

](43)

=1

mX[σ(X>w)− c1

](44)

The Hessian The Hessian of (39), which is also the Jacobian of (41), can be computed

using similar computations like (40). We get:

∇2F =1

mXDX>,

where D = σ′(X>w) = σ(X>w)(1 − σ(X>w)) is a diagonal matrix with only positive

numbers on the diagonal. Overall, the Hessian must be a semi definite matrix and hence,

the logistic regression objective is convex, and the strict convexity depends on whether the

matrix X is full rank (rank at least n) or not. It can be shown using Taylor series that

F (w + δw) =1

m‖X>δw − g‖2

D +O(‖δw‖3),

for some vector g, and hence, SGD will asymptotically behave as it behaves for least squares

problems, which we saw earlier.

6.1.2 Adding the bias

In some sense, the logistic regression model (34) separates the domain into two sides by a

linear function. If x>w > 0 the probability will come out larger than 0.5 and we will classify

x as “1”, while if x>w < 0, we will classify x as “0”. This means that we are separating the

entire Rn domain into two half-spaces by a linear function - all points x that lies above that

plain will be classified together. As we know, a plain (linear function) is usually defined by

x>w + b with a free parameter b ∈ R. Our case is no different, and in practice we need to

add b to the parameters of the logistic regression classifier. This term is called “bias”. To

keep all the derivations above, one can add the bias artificially by adding a pixel of 1 to any


image x, and to add one parameter to w.

6.2 Linear classifiers: Multinomial logistic (softmax) regression

The logistic regression objective was able to classify data images to two classes. What

happens if we have more than two?? Say, “dog”, “cat” and “horse” (in Fig 15 we have 10

labels)? The answer is the multinomial logistic regression which is also dubbed the “softmax”

function, which is a simple generalization of the logistic regression model to l labels.

Assume again that we have the labeled data (xi, yi)mi=1, where this time yi ∈ 1, ..., l.We will again use the (34) model, but now we will have l weight vectors wili=1. For each

image, we will have a probability of that image to be any of the labels (l probabilities). To

have them all sum to 1, we will define, somewhat similarly to (35)

P (y = j|x) =exp(x>wj)∑li=1 exp(x>wi)

∈ [0, 1]. (45)

The objective function can be written similarly to (39):

F (wklk=1) = − 1m

∑mi=1

(∑lk=1(ck)i log

(∑lj=1 exp(x>i wj)

−1

exp(x>i wk)

))(46)

= − 1m

∑lk=1 c>k log

(diag

∑lj=1 exp(X>wj)

−1

exp(X>wk)

). (47)

In softmax, one also need to add the bias terms, just like in logistic regression. We will have

one bias term for each label.

Remark 2. The diag...−1 term above is written this way so that the expression fulfils the

rule of linear algebra. Computing diag(v)−1u “as is” in the code for two large vectors v,u

is not efficient. To do so we may perform an element-wise division by. In code it is often

looks like u./v in most scientific languages (Matlab, Julia, etc.).

Computing extreme exponentials “safely” The terms above should sum to 1. How-

ever, as w>x is not bounded, the exponent may be computed by the computer as Inf (numer-

ical overflow), leaving the whole computation undetermined. As it happens, we can instead


computeexp(x>wj)∑li=1 exp(x>wi)

=exp(x>wj − η)∑li=1 exp(x>wi − η)

,

for some η, and in particular we can choose η = maxjx>wj, guaranteeing that the com-

putation will be performed without overflow.

6.2.1 Computing the gradient of softmax regression

The computation of the gradient is quite similar to the computation of the logistic regression.

We will use the following equality:

∂

∂tp

l∑k=1

ak log

(exp(tk)∑j exp(tj)

)= ap −

l∑k=1

akexp(tp)∑j exp(tj)

Using this equality and the fact that∑

k ck = 1, we get

∇wpF = − 1

mX

cp −l∑

k=1

diag

∑j

exp(X>wj)

−1

exp(XTwp)ck

= − 1

mX

cp − diag

∑j

exp(X>wj)

−1

exp(XTwp)l∑

k=1

ck

=

1

mX

diag

∑j

exp(X>wj)

−1

exp(X>wp)− cp

.Remark 3. The softmax objective (46) is redundant, as one computes a weight vector wk

for each label. Since we make sure all probabilities sum to 1, one can compute probabilities

for all but 1 label, and that label completes the sum of probabilities to 1. That is the case

in (36) for two labels, where we learn only one weight vector w. Note that comparing the

probabilities (34) and (45), once can see that (34) can be achieved using the formula in (45)

with 2 vectors wj where the second one is strictly zero. Hence, one can remove one weight

vector from (46), and set it to zero without loosing anything. Otherwise, we just have a

redundant function with many equivalent local minima. That does not necessarily need to

bother us, and in practice many people are using the redundant version of soft-max.


Remark 4. The derivative of softmax with respect to the data. This derivative is

necessary for combining the softmax loss function at the end of a forward pass of a neural

network, described in the next section. Using similar computations as above we have in

matrix form (this has to be verified):

∇XF =1

mW

[exp(W>X)

(∑j

exp(w>j X)

)−C

]∈ Rn×m

where is element-wise division (that also requires a broadcast or a repmat), W ∈ Rn×l is

the matrix of all weights (arranged by columns), and C ∈ Rl×m is the matrix of all labels

(arranged by rows):

C =

− c1 −− c2 −

...

− cl −

.

6.3 The general structure of a neural network

A neural network is typically defined out of several building blocks that are chained one after

the other, starting from a data measurement and ending with a loss function. A network

typically have the following architecture:

F (θlLl=1) =1

m

m∑i=1

`(θ(L),x(L)i , yi) (48)

s.t x(l+1)i = fl(θ

(l),x(l)i ) l = 1, ..., L− 1, i = 1, ...,m (49)

where `() is the softmax objective function defined in the previous section, and x(1)i denotes

the given data sample i. We will denote by θ(l) the set of weights of layer l that we need to

learn. For example, for the last layer L we have the softmax weights:

θ(L) =w(L)

j nlabelsj=1 ∈ Rn, and the bias vector b(L) ∈ Rnlabels .

For each data sample xi, the predicted label is the label with highest probability in the

softmax function.


Figure 16: Commonly used activation functions σ(x).

The step x(l+1)i = fl(θ

(l),x(l)i ) may be defined in various ways. The classical neural

network is defined as

f(θ,x) = f(W,b,x) = σ(Wx + b),

where σ is some non-linear “activation function”. Theoretically, any non-linear function can

be used as activation. Typical activation functions may be the hyperbolic tangent tanh, the

sigmoid that we’ve seen earlier, and the ReLU(x) = max(x, 0) function illustrated in Fig 16.

A more sophisticated step is the residual neural network (ResNets) defined as:

f(W1,W2,b,x) = x + W2σ(W1x + b); (50)

for which the dimension of x is the same as the dimension of f or

f(W1,W2,b,x) = W2x + σ(W1x + b); (51)

for which the dimensions are different. There are several other type of steps in common NN’s

but we will stop here.

We will focus on the residual network. In the language of Eq. (48), the unknowns that

we have for the residual layers (or block of layers) l = L− 1, ..., 1 are

θ(l) = W(l)1 ∈ Rn×n,W

(l)2 ∈ Rn×n,b(l) ∈ Rn,

which needs to be learned as part of the training process (minimization of (48) for the


training set).

Remark: the definition of layer Some people define each operation in the function

above as “layer” (for example, the addition of bias, the multiplication of W, the non-linear

function σ etc, each one is a different “layer”). The whole residual network step is then

defined as a block of layers.

6.4 Computing the gradient of NN’s: back-propagation

The constrained objective (48) can be minimized in various ways. The most common way to

solve it is by SGD. To apply SGD, we need to be able to compute the gradient, and somehow

handle the constraint. From practical reasons, the constraint is handled by elimination,

removing x(l) as unknowns. We get a nested chain of functions of the type

F (θ(i)3i=1) = f3(x(3), θ(3)) = f3(f2(x(2), θ(2)), θ(3)) = f3(f2(f1(x(1), θ(1)), θ(2)), θ(3)), (52)

where in this example, L = 3, and x(1) is the given data.

Let’s assume that all the quantities above are scalars, i.e., fi(x, θ) : (R,R) → R so each

derivative will also be a scalar. We wish to get the derivative of f3 in (52) with respect to

θ(1), θ(2), θ(3). We get a special structure using inner derivatives:

∂F

∂θ(3)=

∂f3

∂θ(53)

∂F

∂θ(2)=

∂f3

∂x· ∂f2

∂θ(54)

∂F

∂θ(1)=

∂f3

∂x· ∂f2

∂x· ∂f1

∂θ. (55)

The order of multiplications in (53) could have been different as these are all scalars, but it

turns out (as we will see later) that the right way to put it is

∂F

∂θ(3)=

∂f3

∂θ(56)

∂F

∂θ(2)=

∂f2

∂θ· ∂f3

∂x(57)

∂F

∂θ(1)=

∂f1

∂θ· ∂f2

∂x· ∂f3

∂x. (58)


This is the right order if all functions where vector functions of vector unknowns. To compute

(52) we start from f1, and end at f3. That is a forward pass. To compute the gradient

of F , we start from f3 and propagate backwards all the way to f1. That is called the

backpropagation pass.

To illustrate the right order we will rewrite (52) in vector form for L = 2, but also include

the scalar loss function at the end. We denote: ` = `(x,θ), and fi = fi(x,θ). That is, each

function fi is defined for two unknowns and is not “aware” of its location in the network.

F (θ(i)3i=1) = `(f2(x(2),θ(2)),θ(3)) = `(f2(f1(x(1),θ(1)),θ(2)),θ(3)). (59)

We wish to compute the gradient of F with respect to all the parameters (in order to use an

optimization algorithm):

∇F =

∇θ(3)F

∇θ(2)F

∇θ(1)F

Since θ(3) only involves with `, then the gradient with respect to θ(3) is simply:

∇θ(3)F = ∇θ`.

The rest of the gradient requires the chain rule, because it involves with fi inside `. We know

that simple Taylor series is given by

`(x + δx,θ) ≈ `(x,θ) + δxT∇x`.

Also, for every x and θ

f2(x + δx,θ) ≈ f2(x,θ) +∂f2

∂xδx.

Now, assume that instead of x + δx, we put f1(x(1),θ(1) + δθ(1)) = f1(x(1),θ(1)) + δf1. We get

f2(f1(x(1),θ(1) + δθ(1)),θ(2)) ≈ f2(f1(x(1),θ(1)),θ(2)) +∂f2

∂xδf1

≈ f2(f1(x(1),θ(1)),θ(2)) +∂f2

∂x

∂f1

∂θδθ(1).


So, overall δf2 = ∂f2∂x

∂f1∂θδθ(1). Now we plug:

`(f2 + δf2,θ(3)) ≈ `(f2,θ

(3)) + δfT2 ∇x`

= `(f2,θ(3)) +

(∂f2

∂x

∂f1

∂θδθ(1)

)T∇x`

= `(f2,θ(3)) +

(δθ(1)

)> ∂f1

∂θ

T ∂f2

∂x

T

∇x`

From here, the gradient of (59) is

∇θ(2)F =∂f2

∂θ

T

∇x`

∇θ(1)F =∂f1

∂θ

T ∂f2

∂x

T

∇x`

To sum up, for each layer f(x,θ), we need to compute the Jacobian (with respect to each

of the parameters) transpose multiplied with some vector. We will define this operation for

our residual network.

6.5 Computing derivatives of Neural Networks

6.5.1 The derivatives of matrix-vector and matrix-matrix multiplications

We will start with a rather simple auxiliary computation that will help us later. Assume

y = Wx; W ∈ Rt×n; y ∈ Rt; x ∈ Rn

We’ve already seen earlier in the course that ∂y∂x

= W, and hence simply ∂y∂x

T= WT .

Now we’ll examine the case of the derivative with respect to W:

y + δy = (W + δW)x⇒ δy = δWx.

Using different writing, we wish to have

δWx ≡ ∂y

∂Wvec(δW),


where ∂y∂W∈ Rt×tn is a Jacobian matrix and the operation vec(W) is the concatanation of the

columns of W one after the other into a long vector of size tn (also known as column-stack

representation, achieved by the operator [:] in Julia). We need to know how to define ∂y∂W

and to compute its transpose times a vector.

To obtain this, we will start with a few definitions to let us work with matrices and their

derivatives. To deal with matrices as unknowns, we will transfer them into vectors first,

compute the derivative and transfer them back.

In particular we will use the following identities:

(A⊗B)T = AT ⊗BT (60)

AXB = C ⇒ (BT ⊗ A)vec(X) = vec(C) (61)

Back to our auxiliary computation:

∂y

∂Wvec(δW) = δWx,

then by (61) we have

IδWx = (xT ⊗ I)vec(δW)⇒ ∂y

∂W= xT ⊗ I. (62)

By the identity in (60) we have that the Jacobian trasposed times some vector is given by:

(xT ⊗ I)Tv = (x⊗ I)v =after de-column-stacking vxT ∈ Rt×n (63)

where v is some vector of the same size of y. vxT is the same size of W without column

stack. Like in the case of Logistic Regression, if we have many data samples xi stacked in a

matrix column-wise:

X = [x1|x2|x3|...|xm],


and we wish to take the derivative of Y = WX with respect to W. Similarly to before, the

Jacobian transposed times a matrix V of the same size as Y is given by:

(X⊗ I)V =after de-column-stacking

∑i

vix>i = VX>.

this has to be verified.

6.5.2 The derivatives of standard neural networks

Let’s look at our network step:

y = f(θ,x) = σ(Wx + b).

In our backpropagation, we should multiply different vectors v of size similar to y with the

transpose of the Jacobian with respect to each of the weights and unknowns x ∈ Rn,W ∈Rk×n,b ∈ Rk.

The derivative with respect to b ∈ Rk:

y + δy = σ(Wx + b + δb)

≈ y + diag(σ′(Wx + b))δb

⇒ ∂y

∂b= diag(σ′(Wx + b)) ∈ Rk×k.

The derivative with respect to W ∈ Rk×n:

y + δy = σ((W + δW)x + b)

≈ diag(σ′(Wx + b))δWx

⇒ ∂y

∂W= diag(σ′(Wx + b))(xT ⊗ I) ∈ Rk×kn.

In the last equation we used (62), and to multiply the transpose of this matrix with a vector

v, we need to follow (63).


The derivative with respect to x:

y + δy = σ(W(x + δx) + b)

≈ y + diag(σ′(Wx + b))Wδx

⇒ ∂y

∂x= diag(σ′(Wx + b))W ∈ Rk×n.

6.5.3 The derivatives of residual networks

Let’s look again at our residual network step:

y = f(θ,x) = x + W2σ(W1x + b).

In our backpropagation, we should multiply different vectors v of size similar to f(θ,x) with

the transpose of the Jacobian with respect to each of the weights and unknowns x,W2,W1,b.

The derivative with respect to b:

y + δy = x + W2σ(W1x + b + δb)

≈ y + W2diag(σ′(W1x + b))δb

⇒ ∂y

∂b= W2diag(σ′(W1x + b)) ∈ Rn×n.

The derivative with respect to W1:

y + δy = x + W2σ((W1 + δW1)x + b)

≈ y + W2diag(σ′(W1x + b))δW1x

⇒ ∂y

∂W1

= W2diag(σ′(W1x + b))(xT ⊗ I) ∈ Rn×n2

.

In the last equation we used (62), and to multiply the transpose of this matrix with a vector

v, we need to follow (63).


The derivative with respect to W2:

y + δy = x + (W2 + δW2)σ(W1x + b)

= y + δW2σ(W1x + b)

⇒ ∂y

∂W2

= (σ(W1x + b)T ⊗ I) ∈ Rn×n2

.

Here, again, we used (62), only with respect to σ(W1x + b) instead of x. To multiply the

transpose of this matrix with a vector v, we need to follow (63).

The derivative with respect to x:

y + δy = x + δx + W2σ(W1(x + δx) + b)

≈ y + δx + W2diag(σ′(W1x + b))W1δx

⇒ ∂y

∂x= I + W2diag(σ′(W1x + b))W1 ∈ Rn×n.

6.6 Gradient and Jacobian verification

Each time we compute a non-linear function and its derivative, we need to make sure that

our derivations are correct. It is better not to test this though the whole optimization process

of our system, and do it part by part. This way, (1) we will be able to identify the errors in

the right places, and (2) we won’t be confused in the case where our optimization algorithm

really does not supposed to work (the lack of minimization is a “feature instead of a bug”).

The gradient test: The simplest case is achieved for the gradient of a scalar function.

Let d be a random vector, such that ‖d‖ = O(1). Then, we know that

f(x + εd) = f(x) + εdT∇f +O(ε2)

Suppose that we have a code grad(x) for computing ∇f(x). Then, we can compare

|f(x + εd)− f(x)| = O(ε) (64)

versus

|f(x + εd)− f(x)− εdTgrad(x)| = O(ε2). (65)

We will test these two values for decreasing values of ε. For example εi = (0.5)iε0 for some

ε0. The values of (64) should decrease linearly (be smaller by a factor of two for each iterate

of i), and the values of (65) should (if the code is right) decrease quadratically (be smaller

by a factor of four for each iterate of i).

The Jacobian test: The same technique can be used to test the computation of the

Jacobian. Here, we do not necessarily compute the matrix explicitly, but compute its mul-

tiplication with a vector. Assume that you have the code JacMV(x,v), that computes the

multiplication of the Jacobian with a vector v, i.e., ∂f∂x

v. The derivative is computed at the

point x. Similarly to before, we will use a vector v = εd

‖f(x + εd)− f(x)‖ = O(ε)

versus

‖f(x + εd)− f(x)− JacMV(x, εd)‖ = O(ε2).

The test is exactly the same procedure as the gradient test.

The transpose test: In this test we assume that we have a routine JacTMV(x,v) that

computes the multiplication(∂f∂x

)Tv. We will assume that our JacMV function is correct,

and passed the Jacobian test. Then we will use the equality

uTJv = vTJTu,

which holds for any two random vectors u,v. To pass the test, we will verify that

|uTJacMV(x,v)− vTJacTMV(x,u)| < ε,

where ε is the machine precision.

Documents

Numerical Optimization with Applications - BGUnoia192/wiki.files/NOA.pdf · Numerical Optimization with Applications Eran Treister Computer Science Department, Ben-Gurion University