B Yakir-Nonlinear Optimization

Nonlinear Optimization Benny Yakir
These notes are based on help files of MATLAB’s

optimization toolbox and on the book
Linear and Nonlinear Programing by D.G. Luenberger.
No originality is claimed.
**************************************************** **********************************
Contents
1 The General Optimization Problem 3
2 Basic properties of solutions and algorithms 5

2.1 Necessary conditions for a local optimum . . . . . . . . . . . 5
2.2 Global convergence of decent algorithms . . . . . . . . . . . 7
2.3 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Basic MATLAB 10
3.1 Files and Directories in UNIX . . . . . . . . . . . . . . . . . . 10
3.2 Other UNIX Commands . . . . . . . . . . . . . . . . . . . . . 10
3.3 Starting and quitting MATLAB . . . . . . . . . . . . . . . . . 10
3.4 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Scripts and functions . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Basic descent methods 16

4.1 Fibonacci and Golden Section Search . . . . . . . . . . . . . 16
4.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Applying line-search methods . . . . . . . . . . . . . . . . . . 17
4.4 Quadratic interpolation . . . . . . . . . . . . . . . . . . . . . 18
4.5 Cubic fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1
5 The method of steepest decent 21
5.1 The quadratic case . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Applying the method in Matlab . . . . . . . . . . . . . . . . . 22
5.3 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Newton and quasi-Newton methods 26

6.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 The Davidon-Fletcher-Powell (DFP) method . . . . . . . . . 27
6.4 The Broyden-Flecher-Goldfarb-Shanno (BFGS) method . . . 28
6.5 The function fminunc . . . . . . . . . . . . . . . . . . . . . . 29
6.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.7 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.8 Project 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7 Constrained Minimization Conditions 33

7.1 Necessary conditions (equality constraints) . . . . . . . . . . 33
7.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 Necessary conditions (inequality constraints) . . . . . . . . . 36
7.4 Sufficient conditions . . . . . . . . . . . . . . . . . . . . . . . 37
7.5 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.6 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 Lagrange methods 40
8.1 Quadratic programming . . . . . . . . . . . . . . . . . . . . . 40
8.1.1 Equality constraints . . . . . . . . . . . .
. . . . 40. . .
8.1.2 Inequality constraints . . . . . . . . . . .
. . . . 41. . .
8.2 Sequential Quadratic Programming . . . . . . .
. . . . 42. . .
8.3 Newton’s Method . . . . . . . . . . . . . . . . .
. . . . 42. . .
8.4 Structured Methods . . . . . . . . . . . . . . . .
. . . . 42. . .
8.5 Merit function . . . . . . . . . . . . . . . . . . .
. . . . 43. . .
8.6 Enlargement of the feasible region . . . . . . . .
. . . . 44. . .
8.7 The Han–Powell method . . . . . . . . . . . . .
. . . . 45. . .
8.8 Constrained minimization in Matlab . . . . . . .
. . . . 45. . .
8.9 Constrained minimization in Matlab (using the function fmincon 46
8.10 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9 Large scale problems 51

9.1 Basic issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Minimization with no constraints. Hassien provided . . . . . 52
2
9.3 Minimization with no constraints. Hassien not provided . . . 54
9.4 Minimization with constraints. . . . . . . . . . . . . . . . . . 55
9.5 Project 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
10 Penalty and Barrier Methods 57

10.1 Penalty method . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.2 Barrier method . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3
1 The General Optimization Problem
The general optimization problem has the form:
min f (x)
d
x∈R
subject to:
gi (x) = 0 i = 1, . . . , me
gi (x) ≤ 0 i = me + 1, . . . , m
xl ≤ x ≤ xu
In particular, if m = 0, the problem is called an unconstrained optimization

problem. In this course we intend to introduce and investigate algorithms for
solving this problem. We will concentrate, in general, in algorithms which
are used by the Optimization toolbox of MATLAB.
We intend to cover the following sections:
Basic properties of solutions and algorithms: In this section we con-

sider general conditions for the existence of a solution. We also define
the terms algorithm, iterative algorithm and descent algorithm.
Basic MATLAB: Here we introduce the basic features and structure of

the MATLAB system.
Line descent methods: Here we deal with algorithms for finding the min-
imum in the case where d = 1. These algorithms are the basic building
blocks when solving more complex optimization problems.
The method of steepest descent: In each iteration, a line search is per-

formed in the direction of the steepest descent.
Newton and Quasi-Newton methods: In the Newton method the func-

tion is approximated (locally) by a quadratic form. The direction of
the search is chosen based on this form. Quasi-Newton methods re-
place the Hassian with terms which are easier to evaluate.
Conditions in constraint minimization: The conditions that were con-

sidered in section 1 for unconstrained problems are modified in order to
deal with constraint. The Lagrange multipliers and the Kuhn-Tucker
conditions are described.
4
Lagrange methods: These methods are based on the Lagrange first-order
conditions of a solution. The method is applied for quadratic pro-
graming.
Sequential Quadratic Programming: At each iteration the function and

Lagrange multipliers are approximated by a quadratic programing
problem.
Penalty and barrier methods: A sequence of unconstrained minimiza-

tion problem is solved. The solution to these problems converge to
the solution of the original problem.
Time permitting, we would also like to deal with other optimization
problems. Examples include: the EM algorithm, discrete optimization using
dynamic programing, stochastic approximation.
5
2 Basic properties of solutions and algorithms
2.1 Necessary conditions for a local optimum
Assume that the function f is defined over Ω ⊂ Rd .

Definition: A point x∗ ∈ Ω is said to be a relative minimum point or a
local minimum point of f if there is an > 0 such that f (x∗ ) ≤ f (x)
for all x such that kx − x∗ k < . If the inequality is strict for all x 6= x∗
then x∗ is said to be a strict relative minimum point.
Definition: A point x∗ ∈ Ω is said to be a global minimum point of f if
f (x∗ ) ≤ f (x) for all x ∈ Ω. If the inequality is strict for all x 6= x∗
then x∗ is said to be a strict global minimum point.
In practice, the algorithms we will consider in the better part of this
course converge to a local minimum. We may indicate in some cases how
the global minimum can be attained.
Definition: Given x ∈ Ω, a vector d is a feasible direction at x if there
exists an ᾱ > 0 such that x + αd ∈ Ω for all 0 ≤ α ≤ ᾱ.
Theorem 2.1 (First-order necessary conditions.) Let f ∈ C 1 . If x∗

is a relative minimum, then for any vector d which is feasible at x∗ , we
have f˙(x∗ )0 d ≥ 0. (f˙(x∗ ) is the vector of partial derivatives of f at x∗ .)
Corollary 2.1 If x∗ is a relative minimum and if x∗ ∈ Ω0 then f˙(x∗ ) = 0.
Example 2.1 Consider the function f (x, y) = x2 − xy + y 2 − 3y, with

Ω = R2 . From the first order conditions we get that x∗ = 1 and y ∗ = 2.
This is a global minimum.
Example 2.2 Consider the function f (x, y) = x2 − x + y + xy, with Ω =

(R+ )2 . The global minimum is at x∗ = 0.5 and y ∗ = 0. At this point,
f˙(0.5, 0) = (0, 3/2)0 .
6
Example 2.3 Let the production function be f (x1 , . . . , xd ), where xi are the
inputs. The unit price of the produced commodity is q and the unit price of
the ith input is pi . The producer wants to maximize
qf (x1 , . . . , xd ) − p1 x1 − · · · − pd xd .
The first order conditions can be interpreted as stating that the marginal
value increase must be equal to pi .
Example 2.4 We observe g(x) at the points x1 , . . . , xm . We want to ap-

P
proximate the function with a polynomial of the form h(x) = d−1 j
j=0 aj x , for
d < m. Consider the minimization problem
m
X m
X Pd−1
min [g(xk ) − h(xk )]2 = min [g(xk ) − j=0 aj xjk ]2 = mina∈Rd f (a).
d d
a∈R k=1 a∈R k=1
Pm i+j , Pm j Pm 2
Let qij = k=1 (xk ) bj = k=1 g(xk )(xk ) and c = k=1 g(xk ) . Then
0 0
f (a) = a Qa − 2b a + c,
and the first order conditions are
Qa = b.
Second order conditions deal with functions with continuous second par-
tial derivatives and uses the Hessian matrix f¨(x∗ ) of (mixed) partial deriva-
tives.
Theorem 2.2 (Second-order necessary conditions.) Let f ∈ C 2 . Let
x∗ be a relative minimum. For any vector d which is feasible at x∗ , if
f˙(x∗ )0 d = 0 then d0 f¨(x∗ )d ≥ 0.
Corollary 2.2 If x∗ is a relative minimum and if x∗ ∈ Ω0 then f˙(x∗ )0 d =

0 and d0 f¨(x∗ )d ≥ 0 for all d.
Example 2.5 Consider the function f (x, y) = x2 − x2 y + 2y 2 , with Ω =

(R+ )2 . The first order conditions are
3x2 − 2xy = 0, −x2 + 4y = 0.
There are two solutions: (0, 0) and (6, 9). However, the second is not a
relative minimum since the Hessian matrix

18 −12
f¨(6, 9) =
−12 4
is not positive semi-definite.
7
Theorem 2.3 (Second-order sufficient conditions.) Let f ∈ C 2 . As-
sume that x∗ ∈ Ω0 . If f˙(x∗ ) = 0 and f¨(x∗ ) is positive definite then x∗ is a
strict relative minimum.
2.2 Global convergence of decent algorithms
The algorithms we consider are iterative descent algorithms. By iterative we

mean, roughly, that the algorithm generates a series of points, each point
being calculated on the basis of the points preceding it. By descent we
mean that the sequence of values of some function, calculated at the points
generated by the algorithm, is a monotone decreasing sequence.
Definition: An algorithm A is a mapping that assigns, to each point, a

subset of the space.
Iterative algorithm: The specific sequence is constructed by choosing a

point in the subset and iterating the process. Thus algorithm generates
a series of points, and each point is calculated on the basis of the points
preceding it.
Descent algorithm: As each new point is generated, the corresponding

value of some function decreases in value. Specifically, there exists a
continuous function Z such that if A is the algorithm and Γ is the
solution set then
1. If x 6∈ Γ and y ∈ A(x), then Z(y) < Z(x).
2. If x ∈ Γ and y ∈ A(x), then Z(y) ≤ Z(x).
Definition: An algorithm is said to be globally convergent if, for any start-

ing point, it generates a sequence that converges to a solution.
Definition: A point-to-set map A is said to be closed at x if
1. xk → x and
2. yk → y, yk ∈ A(xk ), imply
3. y ∈ A(x).
The map A is closed if it is closed at each point of the space.
8
Example 2.6 Suppose for x ∈ R we define A(x) = [−x/2, x/2]. Starting
at x0 = 100, each of the sequences
100, 50, 25, 12, −6, −2, 1, 1/2, . . .
100, −40, 20, −5, −2, 1, 1/4, 1/8, . . .
100, 10, 1/16, 1/100, −1/1000, 1/10000, . . .
might be generated from iterative application of the algorithm. The given
algorithm is closed.
Example 2.7 If A is point-to-point and continuous them A is closed.
Theorem 2.4 If A is a decent iterative algorithm which is closed outside

of the solution set Γ and if the sequence of points is contained in a compact
set then any converging subsequence converges to a solution.
2.3 Homework
1. To approximate the function g over the interval [0, 1] by a polynomial h

of degree n (or less), we use the criterion
Z 1
f (a) = [g(x) − h(x)]2 dx,
0
n+1
where a ∈ R are the coefficients of h. Find the equations satisfied by
the optimal solution.
2.(a) Using first-order necessary conditions, find the minimum of the function
f (x, y, z) = 2x2 + xy + y 2 + yz + z 2 − 6x − 7y − 8z + 9.
(b) Verify the point is a relative minimum by checking the second-order

conditions.
3. In control problem one is interested in finding numbers u0 , . . . , un that
minimize the objective function
n
X
J= {(x0 + u0 + · · · + uk−1 )2 + u2k },
k=0
for a given x0 . Find the equations that determine the first order condi-
tions.
9
4. Define the point-to-set mapping on Rn by
A(x) = {y : y0 x ≤ b},
where b is a fixed constant. Is A closed?
10
3 Basic MATLAB
The name MATLAB stands for matrix laboratory. It is an interactive system

for technical computing whose basic data element is an array that does not
require dimensioning.
3.1 Files and Directories in UNIX
The home dir is ‘‘~’’, the current dir is ‘‘.’’ and one dir up is
‘‘..’’.
mkdir dirname creates a directory dirname.

cd path/dirname moves the working directory to dirname.
ls lists the content of the working directory.
less lists the content of the file filename.
cp file1 file2 copies file1 into file2.
cp file dir copies file into dir.
rm file deletes file.
3.2 Other UNIX Commands
man command help on command.

arrows move through commands typed in the past.
tab tries to completes commands.
Control-c kills a running job.
emacs file & editing file with emacs.
pico file editing file with pico.
3.3 Starting and quitting MATLAB
Starting on pluto: /applic/matlab.5.3/bin/matlab.

Starting on shum: matlab.
Quitting: Write quit.
Set the display: setenv DISPLAY xterm:0
Help: help function name,
helpwin,
lookfor topic.
11
3.4 Matrices
MATLAB is case sensitive. Memory is allocated automatically.

>> A = [16 3 2 13; 5 10 11 8; 9 6 7 12; 4 15 14 1]
A =
16 3 2 13
5 10 11 8
9 6 7 12
4 15 14 1
>> sum(A)
ans =
34 34 34 34
>> sum(A’)
ans =
34 34 34 34
>> sum(diag(A))
ans =
34
>> A(1,4) + A(2,3) + A(3,2) + A(4,1)
ans =
34
>> sum(A(1:4,4));
>> sum(A(:,end));
>> A(~isprime(A)) = 0
A =
0 3 2 13
5 0 11 0
0 0 7 0
0 0 0 0
>> sum(1:16)/4;
>> pi:-pi/4:0
ans =
3.1416 2.3562 1.5708 0.7854 0
>> B = [fix(10*rand(1,5));rand(1,5)]
B =
4.0000 9.0000 9.0000 4.0000 8.0000
0.1139 1.0668 0.0593 -0.0956 -0.8323
>> B(2:2:10)=[]
B =
12
4 9 9 4 8
>> s = 1 -1/2 + 1/3 - 1/4 + 1/5 - 1/6 + 1/7 ...
-1/8 + 1/9 -1/10
s =
0.6456
>> A’*A
ans =
378 212 206 360
212 370 368 206
206 368 370 212
360 206 212 378
>> det(A)
ans =
0
>> eig(A)
ans =
34.0000
8.0000
0.0000
-8.0000
>> (A/34)^5
ans =
0.2507 0.2495 0.2494 0.2504
0.2497 0.2501 0.2502 0.2500
0.2500 0.2498 0.2499 0.2503
0.2496 0.2506 0.2505 0.2493
>> A’.*A
ans =
256 15 18 52
15 100 66 120
18 66 49 168
52 120 168 1
>> n= (0:3)’;
>> pows = [n n.^2 2.^n]
pows =
0 0 1
1 1 2
2 4 4
3 9 8
13
3.5 Graphics
>> t = 0:pi/100:2*pi;
>> y = sin(t);
>> plot(t,y)
>> y2 = sin(t-0.25);
>> y3 = sin(t-0.5);
>> plot(t,y,t,y2,t,y3)
>> [x,y,z]=peaks;
>> contour(x,y,z,20,’k’)
>> hold on
>> pcolor(x,y,z)
>> hold off
>> [x,y] = meshgrid(-8:.5:8);

>> R = sqrt(x.^2 + y.^2) + eps;
>> Z = sin(R)./R;
>> mesh(x,y,Z)
3.6 Scripts and functions
M-files are text files containing MATLAB code. M-files end with .m prefix.
Functions are M-files that can accept input argument and return output
arguments. Variables, in general, are local. MATLAB provides many func-
tions. You can also write your own function in an M-file:
function h = falling(t)
global GRAVITY
h = 1/2*GRAVITY*t.^2;
save it and run it from MATLAB:
>> global GRAVITY

>> GRAVITY = 32;
>> y = falling((0:.1:5)’);
>> falling(0:5)
ans =
0 16 64 144 256 400
14
3.7 Files
The MATLAB environment includes a set of variables built up during the

session — the Workplace — and disk files containing programs and data
that persist between sessions. Variables can be saved in MAT-files.
>> save B A
>> A = 0
A =
0
>> load B
>> A
A =
16 3 2 13
5 10 11 8
9 6 7 12
4 15 14 1
To obtain efficiency it is important to vectorize the computations. For ex-

ample, write the M-file logtab1.m:
function t = logtab1(n)
x=0.01;
for k=1:n
y(k) = log10(x);
x = x+0.01;
end t=y;
and the M-file logtab2.m:
function t = logtab2(n)
x = 0.01:0.01:(n*0.01); t = log10(x);
Then run in Matlab:
>> tic; logtab1(1000); toc

elapsed_time =
0.6476
>> tic; logtab2(1000); toc
elapsed_time =
0.0343
15
3.8 Homework
1. Let f (x) = ax2 − 2bx + c. Under which conditions does f has a minimum?
What is the minimizing x?
2. Let f (x) = x0 Ax − 2b0 x + c, with A an n × n matrix, b and c n-vectors.

Under which conditions does f has a minimum? a unique minimum?
What is the minimizing x?
3. Write a MATLAB function that finds the location and value of the mini-
mum of a quadratic function.
4. Plot, using MATLAB, a contour plot of the function f with A = [1 3; −1 2],

b = [5 2]0 and c = [1 3]0 . Mark, on the plot, the location of the minimum.
16
4 Basic descent methods
We consider now algorithms for locating a local minimum in the optimiza-

tion problem with no constrains. All methods have in common the basic
structure: in each iteration a direction dn is chosen from the current lo-
cation xn . The next location, xn+1 , is the minimum of the function along
the line that passes through xn in the direction dn . Before discussing the
different approaches for choosing directions, we will deal with the problem
of finding the minimum of a function of one variable — “line search”.
4.1 Fibonacci and Golden Section Search
These approaches assume only that the function is unimodal. Hence, if

the interval is divided by the points x0 < x1 < · · · < xN < xN +1 and we
find that, among these points, xk minimizes the function then the over-all
minimum is in the interval (xk−1 , xk+1 ).
The Fibonacci sequence (Fn = Fn−1 + Fn−2 , F0 = F1 = 1) is the basis
for choosing altogether N points sequentially such that the xk+1 − xk−1 is
minimized. The length of the final interval is (xN +1 − x0 )/FN .
The solution of the Fibonacci equation is FN = Aτ1N + Bτ2N , where
√ √
1+ 5 1− 5
τ1 = = 1/0.618, τ2 = .
2 2
It follows that FN −1 /FN ∼ 0.618 and the rate of convergence of this line
search approach is linear.
4.2 Newton’s method
The best known method of line search is Newton’s method. Assume not
only that the function is continuous but also that it is smooth. Given the
first and second derivatives of the function at xn , one can write the Taylor
expansion:
f (x) ≈ q(x) = f (xn ) + f 0 (xn )(x − xn ) + f 00 (xn )(x − xn )2 /2.
The minimum of q(x) is attained at

f 0 (xn )
xn+1 = xn − .
f 00 (xn )
17
(Note that this approach can be generalized to the problem of finding the
zeros of the function g(x) = q 0 (x).)
We can expect that the solution of an iterative procedure of this type
will satisfy
f 0 (x∗ )
x∗ = x∗ − 00 ∗ ⇒ f 0 (x∗ ) = 0.
f (x )
We say that an algorithm converges at rat p at least to a solution x∗ if
kxn+1 − x∗ k
limn < ∞,
kxn − x∗ kp
where k · k is an appropriate norm. Note that the rate of convergence when

p = 1 is actually exponential.
Theorem 4.1 Let the function g have a continuous second derivative and
let x∗ be such that g(x∗ ) = 0 and g 0 (x∗ ) 6= 0. Then the Newton method
converges with an order of convergence of at least two, provided that x0 is
sufficiently close to x∗ .
Proof: Denote G(x) = x − f 0 (x)/f 00 (x) and Let x∗ be a solution of G(x) =

x.
f 0 (xn ) − f 0 (x∗ ) f (3) (x∗ )

xn+1 − x∗ = xn − x∗ − 00
≈ − 00 ∗ (xn − x∗ )2 .
f (xn ) f (x )
4.3 Applying line-search methods
In order for Matlab to be able to read/write files in disk D you should use
the command
>> cd d:
Now you can write the function:
function y = humps(x) y = 1./((x-0.3).^2 + 0.01)+ 1./((x - 0.9).^2

+ 0.04) -6;
in the M-file humps.m in directory D.
>> fplot(’humps’, [-5 5])

>> grid on
>> fplot(’humps’, [-5 5 -10 25])
18
>> grid on
>> fplot(’[2*sin(x+3), humps(x)]’, [-5 5])
>> fmin(’humps’,0.3,1)
ans =
0.6370
>> fmin(’humps’,0.3,1,1)
Func evals x f(x) Procedure
1 0.567376 12.9098 initial
2 0.732624 13.7746 golden
3 0.465248 25.1714 golden
4 0.644416 11.2693 parabolic
5 0.6413 11.2583 parabolic
6 0.637618 11.2529 parabolic
7 0.636985 11.2528 parabolic
8 0.637019 11.2528 parabolic
9 0.637052 11.2528 parabolic
ans =
0.6370
4.4 Quadratic interpolation
Assume we are given x1 < x2 < x3 and the values of f (xi ), i = 1, 2, 3, which
satisfy
f (x2 ) < f (x1 ) and f (x2 ) < f (x3 ).
The quadratic passing through these points is given by
3 Q
X j6=i (x − xj )
q(x) = f (xi ) Q .
i=1 j6=i (xi − xj )
The minimum of this function is attained at the point

1 β23 f (x1 ) + β31 f (x2 ) + β12 f (x3 )
x4 = ,
2 γ23 f (x1 ) + γ31 f (x2 ) + γ12 f (x3 )
with βij = x2i − x2j and γij = xi − xj . An algorithm A : R3 → R3 can

be defined by such a pattern. If we start from an initial 3-points pattern
x = (x1 , x2 , x3 ) the algorithm A can be constructed in such a way that
A(x) has the same pattern. The algorithm is continuous, hence closed. It
is descends with respect to the function Z(x) = f (x1 ) + f (x2 ) + f (x3 ). If
follows that the algorithm converges to the solution set Γ = {x∗ : f 0 (x∗i ) =
19
0, i = 1, 2, 3.}. It can be shown that the order of convergence to the solution
is (approximately) 1.3.
4.5 Cubic fit
Given x1 and x2 , together with f (x1 ), f 0 (x1 ), f (x2 ) and f 0 (x2 ), one can
consider a cubic polynom of the form
q(x) = a0 + a1 x + a2 x2 + a3 x3 .
The local minimum is determined by the solution of the equation
q 0 (x) = a1 + 2a2 x + 3a3 x2 = 0,
which satisfies
q 00 (x) = 2a2 x + 6a3 x > 0.
It follows that the appropriate interpolation is given by
f 0 (x2 ) + β2 − β1
x3 = x2 − (x2 − x1 ) ,
f 0 (x2 ) − f 0 (x1 ) + 2β2
where
f (x1 ) − f (x2 )
β1 = f 0 (x1 ) + f 0 (x2 ) − 3
x1 − x2
2 0 0 1/2
β2 = (β1 − f (x1 )f (x2 )) .
The order of convergence of this algorithm is 2.
4.6 Homework
1. Consider the iterative process

1 a
xn+1 = xn + ,
2 xn
where a > 0. Assuming the process converges, to what does it converge?
What is the order of convergence.
2. Find the minimum of the function -humps. Use different ranges.
20
3.(a) Given f (xn ), f 0 (xn ) and f 0 (xn−1 ), show that
f 0 (xn−1 ) − f 0 (xn ) x − xn )2
q(x) = f (x) + f 0 (xn )(x − xn ) + · ,
xn−1 − xn 2
has the same derivatives as f at xn and xn−1 and is equal to f at xn .
(b) Construct a line search algorithm based on this quadratic fit.
4. What conditions on the values and derivatives at two points guarantee

that a cubic fit will have a minimum between the two points? Use the
answer to develop a search scheme that is globally convergent for unimodal
functions.
5. Suppose the continuous real-valued function f satisfies
min f (x) < f (0).

0≤x
Starting at any x > 0 show that, through a series of halving and doubling
of x and evaluation of the corresponding f (x)’s, a three-point pattern can
be determined.
6. Consider the function
f (x, y) = ex (4x2 + 2y 2 + 4xy + 2y + 1).
Use the function fmin to plot the function
g(y) = min f (x, y).

x
21
5 The method of steepest decent
The method of steepest decent is a method of searching for the minimum of

a function of many variables f . In in each iteration of this algorithm a line
search is performed in the direction of the steepest decent of the function at
the current location. In other words,
xn+1 = xn − αn f˙(xn ),
where αn is the nonnegative scalar that minimizes f (xn − αf˙(xn )). It can
be shown that relative to the solution set {x∗ : f˙(x∗ ) = 0}, the algorithm is
descending and closed, thus converging.
5.1 The quadratic case
Assume
1 1 1
f (x) = x0 Qx − x0 b = (x − x∗ )0 Q(x − x∗ ) − x∗ 0 Qx∗ ,
2 2 2
were Q a positive definite and symmetric matrix and x∗ = Q−1 b is the
minimizer of f . Note that in this case f˙(x) = Qx − b. and
1
f (xn − αf˙(xn )) = (xn − αf˙(xn ))0 Q(xn − αf˙(xn )) − (xn − αf˙(xn ))0 b,
2
which is minimized at
f˙(xn )0 f˙(xn )
αn = .
f˙(xn )0 Qf˙(xn )
It follows that
1
(xn+1 − x∗ )0 Q(xn+1 − x∗ ) =
2

(f˙(xn )0 f˙(xn ))2 1
1− × (xn − x∗ )0 Q(xn − x∗ ).
˙ 0 ˙ ˙ 0 −1 ˙
f (xn ) Qf (xn )f (xn ) Q f (xn ) 2
Theorem 5.1 (Kantorovich inequality) Let Q be a positive definite and

symmetric matrix and let 0 < a = λ1 ≤ λ2 ≤ · · · ≤ λn = A be the eigenval-
ues. Then
(x0 x)2 4aA
0 0 −1
≥ .
(x Qx)(x Q x) (a + A)2
22
Proof: By a change of variables Q is diagonal. We assume it is. In which
case P
(x0 x)2 ( di=1 x2i )2
= Pd P .
(x0 Qx)(x0 Q−1 x) ( i=1 λi x2i )( di=1 x2i /λi )
Pd
Denoting ξi = x2i / 2
j=1 xj xj the above becomes
Pd
1/ i=1 ξi λi φ(ξ1 , . . . , ξd )
= Pd = .
i=1 (ξi /λi )
ψ(ξ1 , . . . , ξd )
The numerator is a point on the curve 1/x. The denominator is a convex

combination of points from that curve. The minimal ratio is achieved for
some λ = ξ1 λ1 + ξd λd , where ξ1 + ξd = 1. Hence,
φ(ξ1 , . . . , ξd ) (1/λ)
≥ min .
ψ(ξ1 , . . . , ξd ) λ1 ≤λ≤λd (λ1 + λd − λ)/(λ1 λd )
The minimum is achieved when ξ1 = ξd = 1/2, and the proof follows.
Theorem 5.2 For the quadratic case

2
1 A−a 1
(xn+1 − x∗ )0 Q(xn+1 − x∗ ) ≤ (xn − x∗ )0 Q(xn − x∗ ).
2 A+a 2
Proof: 2
4aA A−a
1− = .
(a + A)2 A+a
5.2 Applying the method in Matlab
>> Q = [0.78 -0.02 -0.12 -0.14; -0.02 0.86 -0.04 0.06; ...
-0.12 -0.04 0.72 -0.08; -0.14 0.06 -0.08 0.74]
Q =
0.7800 -0.0200 -0.1200 -0.1400
-0.0200 0.8600 -0.0400 0.0600
-0.1200 -0.0400 0.7200 -0.0800
-0.1400 0.0600 -0.0800 0.7400
>> b = [0,76 0.08 1.12 0.68];
>> eig(Q)
ans =
0.8800
0.9400
23
0.7600
0.5200
>> ((0.88 - 0.52)/(0.88 + 0.52))^2
ans =
0.0661
Write the M-file quad1.m:

function f=quad1(x)
global Q
global b
f = x*Q*x’ - 2*x*b’
In the Matlab session:
>> global Q
Warning: The value of local variables may have been changed to match the
globals. Future versions of MATLAB will require that you declare
a variable to be global before you use that variable.
>> global b
Warning: The value of local variables may have been changed to match the
globals. Future versions of MATLAB will require that you declare
a variable to be global before you use that variable.
>> fminu(’quad1’,x)
...
ans =
1.5350 0.1220 1.9752 1.4130
>> options(6) = 2;
>> fminu(’quad1’,x,options)
...
ans =
1.5350 0.1220 1.9752 1.4130

Write the M-file fun.m:
function f=fun(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
In the Matlab session:
>> x=[-1,1];
>> x=fminu(’fun’,x)
24
x =
0.5000 -1.0000
>> fun(x)
ans =
1.3029e-010
>> x=[-1,1];
>> options(6)=2;
>> x = fminu(’fun’,x,options)
x =
0.5000 -1.0000
Write the M-file fun1.m:
function f = fun1(x)
f = 100*(x(2)-x(1)^2)^2 + (1 - x(1))^2;
and in the Matlab session:
>> x=[-1,1];
>> options(1)=1;
>> options(6)=0;
>> x = fminu(’fun1’,x,options)
f-COUNT FUNCTION STEP-SIZE GRAD/SD
4 4 0.500001 -16
9 3.56611e-009 0.500001 0.0208
14 7.36496e-013 0.000915682 -3.1e-006
21 1.93583e-013 9.12584e-005 -1.13e-006
24 1.55454e-013 4.56292e-005 -7.16e-007
Optimization Terminated Successfully
Search direction less than 2*options(2)
Gradient in the search direction less than 2*options(3)
NUMBER OF FUNCTION EVALUATIONS=24
x =
1.0000 1.0000
>> x=[-1,1];
>> options(6)=2;
>> x = fminu(’fun1’,x,options)
f-COUNT FUNCTION STEP-SIZE GRAD/SD
4 4 0.500001 -16
9 3.56611e-009 0.500001 0.0208
15 1.11008e-012 0.000519178 -4.82e-006
25
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 1.931503e-017.
> In c:\matlab\toolbox\optim\cubici2.m at line 10
In c:\matlab\toolbox\optim\searchq.m at line 54
In c:\matlab\toolbox\optim\fminu.m at line 257
....
192 4.56701e-013 -4.52912e-006 -1.02e-006

195 4.5539e-013 2.26456e-006 -4.03e-007
198 4.55537e-013 -1.13228e-006 -1.02e-006
201 4.55336e-013 5.66141e-007 -4.03e-007
Maximum number of function evaluations exceeded;
increase options(14).
x =
1.0000 1.0000
5.3 Homework
1. Investigate the function
f (x, y) = 100(y − x2 )2 + (1 − x)2 .
Why doesn’t the steepest decent algorithm converge?
26
6 Newton and quasi-Newton methods
6.1 Newton’s method
Based on the Taylor expansion

1
f (xn ) ≈ f (xn ) + f˙(xn )0 (x − xn ) + (x − xn )0 f¨(xn )(x − xn )
2
one can derive, just as for the line-search problem, Newton’s method:
xn+1 = xn − (f¨(xn ))−1 f˙(xn ).
Theorem 6.1 (Newton’s method) Assume the target function is in C 3 ,

x∗ is a local minimum, and the Hessian f¨(x∗ ) is positive definite. The if x0
is close enough to x∗ then the order of convergence is at least 2.
Proof:
kxn+1 − x∗ k = kxn − x∗ − f¨(xn )−1 f˙(xn )k

= kf¨(xn )−1 [f˙(x∗ ) − f˙(xn ) − f¨(xn )(x∗ − xn )]k
≤ Ckxn − x∗ k2 ,
for some constant C.

A modification of this approach is to set
xn+1 = xn − αn (f¨(xn ))−1 f˙(xn ),
where αn minimizes the function f (xn − α(f¨(xn ))−1 f˙(xn )).
6.2 Extensions
Consider the approach of choosing
xn+1 = xn − αn Sn f˙(xn ),
where Sn is some symmetric and positive-definite matrix and αn is the non-

negative scalar that minimizes f (xn − αSn f˙(xn )). It can be shown that
since Sn is positive-definite the algorithm is descending.
27
Assume
1 1 1
f (x) = x0 Qx − x0 b = (x − x∗ )0 Q(x − x∗ ) − x∗ 0 Qx∗ ,
2 2 2
were Q a positive definite and symmetric matrix and x∗ = Q−1 b is the
minimizer of f . Note that in this case f˙(x) = Qx − b. and
1
f (xn −αSn f˙(xn )) = (xn −αSn f˙(xn ))0 Q(xn −αSn f˙(xn ))−(xn −αSn f˙(xn ))0 b,
2
which is minimized at
f˙(xn )0 Sn f˙(xn )
αn = .
f˙(xn )0 Sn QSn f˙(xn )
It follows that
1
(xn+1 − x∗ )0 Q(xn+1 − x∗ ) =
2

(f˙(xn )0 Sn f˙(xn ))2 1
1− × (xn − x∗ )0 Q(xn − x∗ ).
˙ 0 ˙ ˙ 0 −1 ˙
f (xn ) Sn QSn f (xn )f (xn ) Q f (xn ) 2
Theorem 6.2 For the quadratic case

2
1 Bn − bn 1
(xn+1 − x∗ )0 Q(xn+1 − x∗ ) ≤ (xn − x∗ )0 Q(xn − x∗ ),
2 Bn + bn 2
where Bn and bn are the largest and smallest eigenvalues of SQ.
6.3 The Davidon-Fletcher-Powell (DFP) method
This is a rank-two correction procedure. The algorithm starts with some

positive-definite algorithm S0 , initial point x0 :
1. Minimizes f (xn )−αSn f˙(xn ) to obtain xn+1 , ∆n x = xn+1 −xn = −αn Sn f˙(xn ),
f˙(xn+1 ) and ∆n f˙ = f˙(xn+1 ) − f˙(xn ).
2. Set
∆n x0 ∆n x Sn ∆n f˙∆n f˙0 Sn
Sn+1 = Sn + − .
∆n x0 ∆n f˙ ∆n f˙0 Sn ∆n f˙
3. Go to 1.
28
It follows, since ∆n x0 f˙(xn+1 ) = 0, that if Sn is positive definite then so
is Sn+1 .
Proof: Define ∆x = xn+1 − xn , ∆f˙ = f˙(xn+1 ) − f˙(xn ). Since
(∆x)0 (∆x) Sn (∆f˙)(∆f˙)0 Sn

Sn+1 = Sn + − ,
(∆x)0 (∆f˙) (∆f˙)0 Sn (∆f˙)
it follows that
(y0 Sn (∆f˙))2 (y0 (∆x))2

y0 Sn+1 y = y0 Sn y − +
(∆f˙)0 Sn (∆f˙) (∆x)0 (∆f˙)
(a b)2
0 (y0 (∆x))2
= a0 a − + , (1)
b0 b (∆x)0 (∆f˙)
1/2 1/2
where a = Sn y and b = Sn (∆f˙).
Next, since xn+1 is computed by minimizing the function f in the given
direction, it follows that (f˙(xn ))0 Sn (f˙(xn+1 )). However, ∆x = −αn Sn f˙(xn ).
It can be concluded that (∆x)0 (∆f˙) = αn (f˙(x − n))0 Sn (f˙(xn )) > 0.
The first difference in (1) is strictly positive, unless y ∝ (∆f˙). assume
it is. In which case the second term is proportional to (f˙(x − n))0 Sn (f˙(xn )),
thus positive.
6.4 The Broyden-Flecher-Goldfarb-Shanno (BFGS) method
In this method the Hessian is approximated. This is a rank-two correction

procedure as well. The algorithm starts with some positive-definite algo-
rithm H0 , initial point x0 :
1. Minimizes f (xn ) − αHn−1 f˙(xn ) to obtain xn+1 , ∆k x = xn+1 − xn =

−αn Hn−1 f˙(xn ), f˙(xn+1 ) and ∆n f˙ = f˙(xn+1 ) − f˙(xn ).
2. Set
∆n f˙0 ∆n f˙ Hn ∆n x∆n x0 Hn
Hn+1 = Hn + − .
∆n f˙0 ∆n x ∆n x0 Hn ∆n x
3. Go to 1.
29
6.5 The function fminunc
FMINUNC Finds the minimum of a function of several variables.

X=FMINUNC(FUN,X0) starts at the point X0 and finds a minimum
X of the function described in FUN. X0 can be a scalar, vector or matrix.
The function FUN (usually an M-file or inline object) should return a scalar
function value F evaluated at X when called with feval: F=feval(FUN,X).
See the examples below for more about FUN.
X=FMINUNC(FUN,X0,OPTIONS) minimizes with the default optimiza-
tion parameters replaced by values in the structure OPTIONS, an argument
created with the OPTIMSET function. See OPTIMSET for details. Used
options are Display, TolX, TolFun, DerivativeCheck, Diagnostics, GradObj,HessPattern,
LineSearchType, Hessian, HessUpdate, MaxFunEvals, MaxIter, DiffMin-
Change and DiffMaxChange, LargeScale, MaxPCGIter, PrecondBandWidth,
TolPCG, TypicalX. Use the GradObj option to specify that FUN can be
called with two output arguments where the second, G, is the partial deriva-
tives of the function df/dX, at the point X: [F,G] = feval(FUN,X). Use Hes-
sian to specify that FUN can be called with three output arguments where
the second, G, is the partial derivatives of the function df/dX, and the third
H is the 2nd partial derivatives of the function (the Hessian) at the point
X: [F,G,H] = feval(FUN,X). The Hessian is only used by the large-scale
method, not the line-search method.
X=FMINUNC(FUN,X0,OPTIONS,P1,P2,...) passes the problem-dependent
parameters P1,P2,... directly to the function FUN, e.g. FUN would be
called using feval as in: feval(FUN,X,P1,P2,...). Pass an empty matrix for
OPTIONS to use the default values.
[X,FVAL]=FMINUNC(FUN,X0,...) returns the value of the objective
function FUN at the solution X.
[X,FVAL,EXITFLAG]=FMINUNC(FUN,X0,...) returns a string EXIT-
FLAG that describes the exit condition of FMINUNC. If EXITFLAG is: ¿
0 then FMINUNC converged to a solution X. 0 then the maximum number
of function evaluations was reached. ¡ 0 then FMINUNC did not converge
to a solution.
[X,FVAL,EXITFLAG,OUTPUT]=FMINUNC(FUN,X0,...) returns a struc-
ture OUTPUT with the number of iterations taken in OUTPUT.iterations,
the number of function evaluations in OUTPUT.funcCount, the algorithm
used in OUTPUT.algorithm, the number of CG iterations (if used) in OUT-
PUT.cgiterations, and the first-order optimality (if used) in OUTPUT.firstorderopt.
[X,FVAL,EXITFLAG,OUTPUT,GRAD]=FMINUNC(FUN,X0,...) returns
30
the value of the gradient of FUN at the solution X.
[X,FVAL,EXITFLAG,OUTPUT,GRAD,HESSIAN]=FMINUNC(FUN,X0,...)
returns the value of the Hessian of the objective function FUN at the solution
X.
6.6 Examples
Create a file myfun.m:
function f = myfun(x)
f = 3*x(1)^2 + 2*x(1)*x(2) + x(2)^2; % cost function
Then call fminunc to find a minimum of ’myfun’ near [1,1]:
>> x0 = [1,1];
>> [x,fval] = fminunc(’myfun’,x0)
After a couple of iterations, the solution, x, and the value of the function at
x, fval, are returned:
x =
1.0e-008 *
-0.7914 0.2260
fval =
1.5722e-016
To minimize this function with the gradient provided, modify the M-file
myfun.m so the gradient is the second output argument
function [f,g] = myfun(x)

f = 3*x(1)^2 + 2*x(1)*x(2) + x(2)^2; % cost function
if nargout > 1
g(1) = 6*x(1)+2*x(2);
g(2) = 2*x(1)+2*x(2);
end
and indicate the gradient value is available by creating an optimization op-

tions structure with options.GradObj set to ’on’ using optimset:
>> options = optimset(’GradObj’,’on’);

>> x0 = [1,1];
>> [x,fval] = fminunc(’myfun’,x0,options)
31
After several iterations the solution x and fval, the value of the function at
x, are returned:
x =
1.0e-015 *
-0.6661 0
fval2 =
1.3312e-030
6.7 Homework
1. Use the function fminu with options(6)=0 (BFGS), options(6)=1 (DFP)

and options(6)=2 (steepest descent) to compare the performance of the
algorithms. Apply the function to minimize f (x) = x0 Qx, here Q is a
diagonal matrix. Use different ratios between the smallest and the largest
eigenvalues different dimensions.
2. Investigate the rate of convergence of the algorithm
xn+1 = xn − [δI + (f¨(xn ))−1 ]f˙(xn ).
What is the rate if δ is larger than the smallest eigenvalue of (f¨(x∗ ))−1 ?
3. Use the formula

A−1 ab0 A−1
[A + ba0 ]−1 = A−1 − ,
1 + b0 A−1 a
in order to get a direct updating formula for the inverse of Hn in the
BFGS method.
4. Read the help file on the function fminu. Investigate the effect of sup-
plying the gradients with the parameter grad on the performance of the
procedure. Compare, in particular the functions bilinear and fun1.
6.8 Project 1
Consider the function

X
n−1
2 2

f (x) = (x2i )(xi+1 +1) + (x2i+1 )(xi +1) .
i=1
32
Use the function fminunc to identify local minima of the function. Try to
the procedure with and without providing the gradient. Try it for different
n’s. Which is the largest n for which the convergence was successful?
33
7 Constrained Minimization Conditions
The general (constrained) optimization problem has the form:
min f (x)
d
x∈R
subject to:
gi (x) = 0 i = 1, . . . , me
gi (x) ≤ 0 i = me + 1, . . . , m
xl ≤ x ≤ xu
The first me constraints are called equality constraints and the last m − me
constraints are the inequality constraints.
7.1 Necessary conditions (equality constraints)
We assume first that me = m — all constraints are equality constraints.

Let x∗ be a solution of the optimization problem. Let g = (g1 , . . . , gm ).
Note that g is a (non-linear) transformation from Rd into Rm . The set
{x ∈ Rn : g(x) = 0} is a surface in Rn . This surface is approximated near
x∗ by x∗ + M , where
M = {y : ġ(x∗ )0 y = 0}.
In order for this approximation to hold, x∗ should be a regular point of the

constraint, i.e. (ġ1 (x∗ ), . . . , ġm (x∗ )) should be linearly independent.
Theorem 7.1 (Lagrange multipliers) Let x∗ be a local extremum point

of f subject to the constraint g = 0. Assume that x∗ is a regular point of
these constraints. Then there is a λ ∈ Rm such that
m
X
f˙(x∗ ) + ġ(x∗ )λ = f˙(x∗ ) + λj ġj (x∗ ) = 0.
j=1
Given λ, one can consider the Lagrangian:
l(x, λ) = f (x) + g(x)λ.
34
The necessary conditions can be formulated as l˙ = 0. The matrix of partial
second derivatives of l (with respect to x) at x∗ is
m
X
¨lx (x∗ ) = f¨(x∗ ) + g̈(x∗ )λ = f¨(x∗ ) + g̈j (x∗ )λj
j=1
We say that this matrix is positive semidefinite over M if x0 ¨lx (x∗ )x ≥ 0, for
all x ∈ M .
Theorem 7.2 (Second-order condition) Let x∗ be a local extremum point

of f subject to the constraint g = 0. Assume that x∗ is a regular point of
these constraints, and let λ ∈ Rm be such that
m
X
f˙(x∗ ) + ġ(x∗ )λ = f˙(x∗ ) + λj ġj (x∗ ) = 0.
j=1
Then the matrix ¨lx (x∗ ) is positive semidefinite over M .
7.2 Examples
We now give some applications of the above theory.
Example 7.1 Minimize f (x, y, z) = xy + yz + xz, subject to x + y + z = 3.
The necessary conditions become:
y+z+λ = 0
x+z+λ = 0
x+y+λ = 0
x + y + z = 3.
Solving this system gives x = y = z = 1, λ = −2.
Example 7.2 A discrete random variable takes the values x1 , . . . , xd , with

probabilities p1 , . . . , pd . For a given mean value m, find the distribution
which minimizes the entropy
d
X
f (p1 , . . . , pd ) = − pi log(pi ).
i=1
35
The problem can be formulated as
min f (p1 , . . . , pd )
subject to:
d
X d
X
pi = 1, xi pi = m
i=1 i=1
0 ≤ pi , i = 1, . . . , d.
Ignoring the set-constraints, the Lagrangian becomes

d
X
l(p1 , . . . , pd , λ1 , λ2 ) = {−pi log(pi ) + λ1 pi + λ2 xi pi } − λ1 − mλ2 .
i=1
The necessary conditions are
− log(pi ) − 1 + λ1 + λ2 xi = 0, i = 1, . . . , d,
which leads to
d
X d
X
pi = exp{(λ1 − 1) + λ2 xi }, pi = 1, xi Pi = m.
i=1 i=1
Note that the solution satisfies the set-constraints.
Example 7.3 A chain is suspended from two hooks that are t meters apart
on a horizontal line. The chain consists of d links. Each link is 1 meter
long (measured from the inside). What is the shape of the chain?
We intend to minimize the potential energy of the chain. We let link i a

xi distance horizontally and yi distance vertically. The potential energy of
a link is its weight times its vertical distance (from some reference point).
The potential energy of the chain is the sum of the potential energies of the
links. Take the top as the reference and assume that the mass of each link is
concentrated at the center of the link. The potential energy is proportional
to
f (y1 , . . . , yd ) = 0.5y1 + (y1 + 0.5y2 ) + · · · + (y1 + · · · + yd−1 + 0.5yd )

d
X
= (d − i + 0.5)yi .
i=1
36
The constraints are:
d
X d
X d q
X
yi = 0, xi = 1 − yi2 = t.
i=1 i=1 i=1
The first order necessary conditions are

λ 2 yi
(d − i + 0.5) + λ1 − = 0, for i = 1, . . . , d,
(1 − yi2 )1/2
which leads to
d − i + 0.5 + λ1
yi = − .
[λ22 + (d − i + 0.5 + λ1 )2 ]1/2
7.3 Necessary conditions (inequality constraints)
We now consider the case where me < m. Let x∗ be a solution of the

constrained optimization problem. A constraint gj is active at x∗ if gj (x∗ ) =
0 and it is inactive if gj (x∗ ) < 0. Note that all equality constraints are active.
Denote by J the set of all active constraints.
For the consideration of necessary conditions when inequality constraints
are present the definition of a regular point should be extended. We say now
that x∗ is regular if {ġj (x∗ ) : j ∈ J} are linearly independent.
Theorem 7.3 (Kuhn-Tucker Conditions) Let x∗ be a local extremum

point of f subject to the constraint gj (x) = 0, 1 ≤ j ≤ me and gj (x) ≤ 0,
me + 1 ≤ j ≤ m. Assume that x∗ is a regular point of these constraints.
Then there is a λ ∈ Rm such that λj ≥ 0, for all j > me , and
m
X
f˙(x∗ ) + λj ġj (x∗ ) = 0
j=1
m
X
λj gj (x∗ ) = 0.
j=me +1
Let m
X
¨lx (x∗ ) = f¨(x∗ ) + λj g̈j (x∗ ).
j=1
37
Theorem 7.4 (Second-order condition) Let x∗ be a local extremum point
of f subject to the constraint gj (x) = 0, 1 ≤ j ≤ me and gj (x) ≤ 0,
me + 1 ≤ j ≤ m. Assume that x∗ is a regular point of these constraints, and
let λ ∈ Rm be such that λj ≥ 0, for all j > me , and
m
X
f˙(x∗ ) + λj ġj (x∗ ) = 0.
j=1
Then the matrix ¨lx (x∗ ) is positive semidefinite on the tangent subspace of
the active constraints.
7.4 Sufficient conditions
Sufficient conditions are based on second-order conditions:
Theorem 7.5 (Equality constraints) Suppose there is a point x∗ satis-

fying g(x∗ ) = 0, and a λ ∈ Rm such that
m
X
f˙(x∗ ) + λj ġj (x∗ ) = 0.
j=1
Suppose also that the matrix ¨lx (x∗ ) is positive definite on M . Then x∗ is a
strict local minimum for the constrained optimization problem.
Theorem 7.6 (Inequality constraints) Suppose there is a point x∗ that

satisfies the constraints. A sufficient condition for x∗ to be a strict local
minimum for the constrained optimization problem is the existence of a λ ∈
Rm such that λj ≥ 0, for me < j ≤ m, and
m
X
f˙(x∗ ) + λj ġj (x∗ ) = 0 (2)
j=1
m
X
λj gj (x∗ ) = 0, (3)
j=me +1
and the Hessian matrix ¨lx (x∗ ) is positive on the subspace
M 0 = {y : ġj (x∗ )0 y = 0, j ∈ J}
where J = {j : gj (x∗ ) = 0, λj > 0}
38
Example 7.4 Consider the problem:
minimize 2x2 + 2xy + y 2 − 10x − 10y

subject to x2 + y 2 ≤ 5
3x + y ≤ 6.
The first order necessary conditions are
4x + 2y − 10 + 2λ1 3 + 3λ2 = 0
2x + 2y − 10 + 2λ1 y + λ2 = 0
λ1 ≥ 0, λ2 ≥ 0
2 2
λ1 (x + y − 5) = 0
λ2 (3x + y − 6) = 0.
One should check different subsets of active and inactive constraints. For
example, if we set J = {1} then
4x + 2y − 10 + 2λ1 3 + 3λ2 = 0
2x + 2y − 10 + 2λ1 y + λ2 = 0
x2 + y 2 = 5,
which has the solution x = 1, y = 2, λ1 = 1. This yields 3x + y = 5, and

hence the second constraint is satisfied. Thus, since λ1 > 0, we conclude
that this solution satisfies the first order necessary conditions.
7.5 Sensitivity
The Lagrange multipliers can be interpreted as the price of incremental

change in the constraints. Consider the class of problems:
minimize f (x)
subject to g(x) = c.
For each c, assume the existence of a solution point x∗ (c). Under appropriate
regularity conditions the function x∗ (c) is well behaved with x∗ (0) = x∗ .
Theorem 7.7 (Sensitivity Theorem) Let f, g ∈ C 2 and consider the
family of problem defined above. Suppose that for c = 0 there is a local
solution x∗ that is a regular point and that, together with its associated La-
grange multiplier vector λ, satisfies the second-order sufficient conditions for
39
a strict local minimum. Then for every c in a neighborhood of 0 there is
x∗ (c), continuous in c, such that x∗ (0) = x∗ , x∗ (c) is a local minimum of
the constrained problem indexed by c, and
f˙(x∗ (c)) | c=0 = −λ.
7.6 Homework
1. Consider the constraints x1 ≥ 0, x2 ≥ 0 and x2 − x1 − 1)2 ≤ 0. Show that

(1, 0) is feasible but not regular.
2. Find the rectangle of given perimeter that has greatest area by solving
the first-order necessary conditions. Verify that the second-order sufficient
conditions are satisfied.
3. Three types of items are to be stored. Item A costs one dollar, item B costs
two dollars and item C costs 4 dollars. The demand for the three items
are independent and uniformly distributed in the range [0, 3000]. How
many of each type should be stored if the total budget is 4,000 dollars?
4. Let A be an n × m matrix of rank m and let L be an n × n matrix that

is symmetric and positive-definite on the subspace M = {y : Ay = 0}.
Show that the (n + m) × (n + m) matrix

L A0
A 0
is non-singular.
5. Consider the quadratic program
minimize x0 Qx − 2b0 x
subject to Ax = c.
Prove that x∗ is a local minimum point if and only if it is a global minimum

point.
6. Maximize 14x − x2 + 6y − y 2 + 7 subject to x + y ≤ 2, x + 2y ≤ 3.
40
8 Lagrange methods
The Lagrange methods for dealing with constrained optimization problem

are based on solving the Lagrange first-order necessary conditions. In par-
ticular, for solving the problem with equality constraints only:
minimize f (x)
subject to g(x) = 0,
the algorithms look for solutions of the problem:
m
X
f˙(x) + λj ġj (x) = 0
j=1
g(x) = 0,
8.1 Quadratic programming
An important special case is when the target function f is quadratic and

the constraints are linear:
minimize (1/2)x0 Qx + x0 c
subject to a0i x = bi , 1 ≤ i ≤ me
a0i x ≤ bi , me + 1 ≤ i ≤ m
with Q a symmetric matrix.
8.1.1 Equality constraints

In the particular case where me = m the above becomes
minimize (1/2)x0 Qx + x0 c
subject to Ax = b.
and the Lagrange necessary conditions become
Qx + A0 λ + c = 0
Ax − b = 0.
This system is nonsingular if Q is positive definite on the subspace M =
{x : Ax = 0}. If Q is nonsingular then the solution becomes:
x = Q−1 A0 (AQ−1 A0 )−1 [AQ−1 c + b] − Q−1 c
λ = −(AQ−1 A0 )−1 [AQ−1 c + b].
41
8.1.2 Inequality constraints
For the general quadratic programming problem, the method of active set is
used. A working set of constraints Wn is updated in each iteration. The set
Wn contains all constraints that are suspected to satisfy an equality relation
at the solution point. In particular, it contains the equality constraints. An
algorithm for solving the general quadratic problem is:
1. Start with a feasible point x0 and a working set W0 . Set n = 0
2. Solve the quadratic problem
minimize (1/2)d0 Qd + (c + Qxn )0 d

subject to a0i d = 0, i ∈ Wn .
If d∗n = 0 go to 4.
3. Set xn+1 = αn d∗n , where

bi − a0i xn
αn = 0min 1, .
ai d∗n >0 a0i d∗n
If αn < 1, adjoin the minimizing index above to Wn to form Wn+1 . Set
n = n + 1 and return to step 2.
4. Compute the Lagrange multiplier in step 3 and let λn = min{λi : i ∈

Wn , i > me }. If λn ≥ 0, stop; xn is a solution. Otherwise, drop λn from
Wn to form Wn+1 and return to step 2.
Example 8.1 Consider the problem
minimize 2x2 + xy + y 2 − 12x − 10y

subject to (1) x + y ≤ 0,
(2) − x ≤ 0,
(3) − y ≤ 0.
Take x0 = (0, 0)0 , and W0 = {2, 3}. Then d∗0 = (0, 0)0 . Both Lagrange
multipliers are negative, but the one corresponding to (2) is more negative.
Drop that constraint, and put W1 = {3}. Minimizing along the line y = 0
leads to x1 = (3, 0)0 . The Lagrange multiplier of the active constraint is
negative, thus W2 = ∅. Also, d∗1 = (−1, 4), the direction to the overall
optimum at (2, 4)0 . We move to the constraint (1), and write W3 = {(1)}.
Finally, we move along this constraint to the solution.
42
8.2 Sequential Quadratic Programming
Let us go back to the general problem
minimize f (x)
subject to gi (x) = 0, i = 1, . . . , me
gi (x) ≤ 0, i = me + 1, . . . , m.
The SQP method solves this problem by solving a sequence of QP problems

where the Lagrangian function l is approximated by a quadratic function
and the constraints are approximated by a linear hyper-space.
8.3 Newton’s Method
Consider the case of equality constraints only. At each iteration the problem
minimize (1/2)d0 ¨lx (xn , λn )d + l(x

˙ n , λn ) 0 d
subject to ġi (xn )0 d + gi (xn ) = 0, i = 1, . . . , m
is solved. It can be shown that the rate of convergence of this algorithm

is 2 (at least) if the starting point (x0 , λ0 ) is close enough to the solution
(x∗ , λ∗ ). A disadvantage of this approach is the need to compute Hessian.
8.4 Structured Methods
These methods are modifications of the basic Newton method, with approx-
imations replacing Hessian. One can rewrite the solution to the Newton step
in the form ¨
xn+1 xn ln ġn0 −1 l˙n
= − .
λn+1 λn ġn 0 gn
Instead, one can use the formula
−1 ˙
xn+1 xn Hn ġn0 ln
= − αn ,
λn+1 λn ġn 0 gn
with αn and Hn appropriately chosen.
43
8.5 Merit function
In order to choose the αn and to assure that the algorithm will converge
a merit function is associated with the problem such that a solution of
the constrained problem is a (local) minimum of the merit function. The
algorithm should be descending with respect to the merit function.
Consider, for example, the problem with inequality constraints only:
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m.
The absolute-value merit function is given by

m
X
Z(x) = f (x) + c gi (x)+ .
i=1
The parameter α is chosen by minimizing the merit function in the direction

chosen by the algorithm.
Theorem 8.1 If H is positive-definite and if c > max1≤i≤m λi then the

algorithm is descending with respect to the absolute-value merit function.
Proof: The optimization problem is solved by solving sequentially problems

of the form:
minimize (1/2)d0 Hd + f˙(x)0 d

subject to ġj (x)0 d + gj (x) ≤ 0, i = 1, . . . , m,
The necessary conditions here are

m
X
Hd + f˙(x) + λj ġj (x) = 0 (4)
j=1
0
ġj (x) d + gj (x) ≤ 0 (5)
λj [ġj (x) + gj (x)] = 0 (6)
λj ≥ 0. (7)
Let J(x) = {gi (x) > 0}. Now, for α > 0,

m
X
Z(x + αd) = f (x + αd) + c gi (x + αd)+
i=1
44
m
X
= f (x) + αf˙(x)0 d + c [gi (x)+ + αcġj (x)0 d + o (α)]+
i=1
m
X X
= f (x) + αf˙(x)0 d + c gi (x)+ + αc ġj (x)0 d + o (α)
i=1 j∈J(x)
X
= Z(x) + αf˙(x) d + αc 0 0
ġj (x) d + o (α) .
j∈J(x)
Here we applied condition (5) in order to infer that ġj (x)0 d ≤ 0 if gj (x) = 0.
Using this condition again we get
X X m
X
0
c ġj (x) d ≤ c −gj (x) = −c gj (x)+ . (8)
j∈J(x) j∈J(x) j=1
Using (4) we can infer that

m
X
f˙(x)0 d = −d0 Hd − λj ġj (x)d,
j=1
which by using condition (6) leads to

m
X m
X
f˙(x)0 d ≤ −d0 Hd + λj gj (x)+ ≤ −d0 Hd + max λj gj (x)+ . (9)
j
j=1 j=1
Summarizing what we got thus far leads to

m
X
0
Z(x + αd) ≤ Z(x) + α{−d Hd − [c − max λj ] gj (x)+ } + o (α) .
j
j=1
The conclusion follows from the assumption that H is positive definite and
the assumption c ≥ maxj λj .
8.6 Enlargement of the feasible region
Consider, again, the problem with inequality constraints only:
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m.
and its solution with a structural SQP algorithm.
45
Assume that at the current iteration xn = x and Hn = H. Then one
wants to consider the QP problem:
minimize (1/2)d0 Hd + f˙(x)

subject to ġi (x)0 d + g(x) ≤ 0, i = 1, . . . , m.
However, it is possible that this problem is infeasible at the point x. Hence,

the original method breaks down. However, one can consider instead the
problem
m
X
minimize (1/2)d0 Hd + f˙(x) + c ξi
i=1
subject to ġi (x)0 d + g(x) ≤ ξi , i = 1, . . . , m.
−ξi ≤ 0, i = 1, . . . , m,
which is always feasible.

Theorem 8.2 If H is positive-definite and if c > max1≤i≤m λi then the
algorithm is descending with respect to the absolute-value merit function.
8.7 The Han–Powell method
The Han–Powell method is what is used by Matlab for SQP. It is a Quasi-

Newton method, where Hn is updated using the BFGS approach:
(∆l)(∆l)0 Hn (∆x)(∆x)0 Hn
Hn+1 = Hn + − ,
(∆x)0 (∆l) (∆x)0 Hn (∆x)
where
∆x = xn+1 − xn , ∆l = l(xn+1 , λn+1 ) − l(xn , λn ).
It can be shown that Hn+1 is positive-definite if Hn is and if (∆x)0 (∆l) > 0.
8.8 Constrained minimization in Matlab
Write the M-file fun2.m:

function [f,g]=fun2(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
g(1,1) = 1.5 + x(1)*x(2) - x(1) - x(2);
g(2,1) = -x(1)*x(2) - 10;
46
and run the Matlab session:
>> x0 = [-1,1];
>> x = constr(’fun4’, x0)
x =
-9.5474 1.0474
>> [f,g] = fun4(x)
f =
0.0236
g =
1.0e-014 *
0.1110
-0.1776
>> options = [];
>> vlb = [0,0];
>> vlu = [];
>> x = constr(’fun4’, x0, options, vlb, vlu)
x =
0 1.5000
>> [f,g] = fun4(x)
f =
8.5000
g =
0
-10
Write the M-file grodf4.m:
function [df,dg] = grudf4(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
df = [f + exp(x(1))*(8*x(1) + 4*x(2)), exp(x(1))*(4*x(1) + 4*x(2) + 2)];
dg = [x(2) - 1, -x(2); x(1) - 1, -x(1)];
and run the session:
>> vlb = [];
>> x = constr(’fun4’, x0, options, vlb, vlu, ’grudf4’)
x =
-9.5474 1.0474
8.9 Constrained minimization in Matlab (using the function fmincon
47
Let us consider the problem:
minimize f (x) = ex1 (4x21 + 2x22 + 4x1 x2 + 2x2 + 1)

subject to 1.5 + x1 x2 − x1 − x2 ≤ 0
−x1 x2 ≤ 10
First create the M-files obj5.m
function f = obj5(x)
f=exp(x(1)) * (4*x(1)^2 + 2*x(2)^2 + 4*x(1)*x(2) + 2*x(2) + 1);
and con5.m:
function [c, ceq] = con5(x)

c = [1.5 + x(1)*x(2) - x(1) - x(2);
-x(1)*x(2) - 10];
ceq = [];
In the command window do:
>> x0 = [-1 1];

>> options = optimset(’LargeScale’,’off’,’Display’,’iter’);
>> [x,fval,exitflag,output] = fmincon(’obj5’,x0,[],[],[],[],[],[],...
’con5’,options);
max Directional
Iter F-count f(x) constraint Step-size derivative Procedure
1 3 1.8394 0.5 1 0.0486
2 7 1.85127 -0.09197 1 -0.556 Hessian modi
3 11 0.300167 9.33 1 0.17
4 15 0.529834 0.9209 1 -0.965
5 20 0.186965 -1.517 0.5 -0.168
6 24 0.0729085 0.3313 1 -0.0518
7 28 0.0353322 -0.03303 1 -0.0142
8 32 0.0235566 0.003184 1 -6.22e-006
9 36 0.0235504 9.032e-008 1 1.76e-010 Hessian modi
Optimization terminated successfully:
Search direction less than 2*options.TolX and
maximum constraint violation is less than options.TolCon
Active Constraints:
1
2
>> x
48
x =
-9.5474 1.0474
>> val
fval =
0.0236
>> [c, ceq] = con5(x)
c =
1.0e-014 *
0.1110
-0.1776
ceq =
[]
>> output.funcCount
ans =
38
The above problem can be solved more efficiently and accurately if gra-
dients are supplied by the user. Create the M-files:
function [f, G] = obj5grad(x)

f =exp(x(1)) * (4*x(1)^2 + 2*x(2)^2 + 4*x(1)*x(2) + 2*x(2) + 1);
t = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
G = [ t + exp(x(1)) * (8*x(1) + 4*x(2)),
exp(x(1))*(4*x(1)+4*x(2)+2)];
and
function [c, ceq, dc, dceq] = con5grad(x)

c = [1.5 + x(1)*x(2) - x(1) - x(2);
-x(1)*x(2) - 10];
dc = [x(2)-1, -x(2);
x(1)-1, -x(1)];
ceq = [];
dceq = [];
In the command window:
>> x0
x0 =
-1 1
>> options = optimset(’LargeScale’,’off’);
>> options = optimset(options,’GradObj’,’on’,’GradConstr’,’on’);
49
>> [x,fval,exitflag,output] = fmincon(’obj5grad’,x0,[],[],[],[],[],[],...
>> ’con5grad’,options);
Search direction less than 2*options.TolX and
maximum constraint violation is less than options.TolCon
Active Constraints:
1
2
>> x
x =
-9.5474 1.0474
>> fval
fval =
0.0236
>> [c, ceq] = con5grad(x)
c =
1.0e-014 *
0.1110
-0.1776
ceq =
[]
>> output.funcCount
ans =
20
8.10 Homework
1. Read the help files of the functions constr and fmincon. Redo the ex-
ample given in class with the function fmincon. Compare the properties
of the two functions in QP problems of different magnitude.
2. Let H be a positive-definite matrix and assume that, throughout some

compact set, the quadratic programing has a unique solution, such that
the Lagrange multipliers are not larger than c. Let {xn : n ≥ 0} be
a sequence generated by the recursion xn+1 = xn + αn dn , where d is
the direction found by solving the QP centered at xn and with H fixed
and αn is determined by minimization of the function Z. Show that any
limit point of {xn } satisfies the first order necessary conditions for the
constrained minimization problem.
50
3. Extend the result in 2 for the case where H = Hn changes but yet kxk2 ≤
x0 Hn x ≤ ckxk2 for some 0 < < c < ∞ and for all x and n.
51
9 Large scale problems
9.1 Basic issues
Large scale problems requires special techniques to deal with memory prob-
lems and numeric complications.
Consider the issue in the context of unconstrained minimization of a
functionf (x). The basic approaches to minimization of a function can by the
general heuristic: Define a neighborhood N of the current x. Approximate
the function f by a function q over N . The solution to the minimization
problem in q provides a candidate for a new x, x + d. This candidate
is adopted if f (x + d) < f (x). (For example, q(d) = f (x) + f˙(x)0 d +
(1/2)d0 f¨(x)d and N = {kDdk ≤ }, for D a diagonal scaling matrix.)
However, when the dimension of the problem is large this approach is
not feasible. The alternative which is used in MATLAB is to choose a two
dimensional subspace S and to constraint the analysis to that subspace.
The subspace is formed by taking the direction of the gradient and either
the Newton direction, i.e. the solution to
f¨(x)d2 = −f˙(x),
or a direction of negative curvature
d02 f¨(x)d2 < 0
(in order to force convergence).

Large scale algorithms will work better if they are provided with the
Hessian. The main issue in the limitations on the memory involves storage
of that matrix. In many applications, however, it follows that many of the
entries in the matrix are actually zeros. Room can be saved by storing the
matrix in the sparse matrix form. Hence,
>> sparse(1:5,1:5,1)
ans =
(1,1) 1
(2,2) 1
(3,3) 1
(4,4) 1
(5,5) 1
is a compact way to store the 5 × 5 identity matrix.
52
9.2 Minimization with no constraints. Hassien provided
Let consider the function which was analyzed as part of the first project:
X
n−1
2 2

f (x) = (x2i )(xi+1 +1) + (x2i+1 )(xi +1) .
i=1
Let us first minimize this function with the sparse tridiagonal Hessian
matrix. Start with the M-file brownfgh.m:
function [f,g,H] = brownfgh(x)
% Evaluate the function.

n=length(x); y=zeros(n,1);
i=1:(n-1);
y(i)=(x(i).^2).^(x(i+1).^2+1)+(x(i+1).^2).^(x(i).^2+1);
f=sum(y);
%
% Evaluate the gradient.
if nargout > 1
i=1:(n-1); g = zeros(n,1);
g(i)= 2*(x(i+1).^2+1).*x(i).*((x(i).^2).^(x(i+1).^2))+...
2*x(i).*((x(i+1).^2).^(x(i).^2+1)).*log(x(i+1).^2);
g(i+1)=g(i+1)+...
2*x(i+1).*((x(i).^2).^(x(i+1).^2+1)).*log(x(i).^2)+...
2*(x(i).^2+1).*x(i+1).*((x(i+1).^2).^(x(i).^2));
end
%
% Evaluate the (sparse, symmetric) Hessian matrix
if nargout > 2
v=zeros(n,1);
i=1:(n-1);
v(i)=2*(x(i+1).^2+1).*((x(i).^2).^(x(i+1).^2))+...
4*(x(i+1).^2+1).*(x(i+1).^2).*(x(i).^2).*((x(i).^2).^((x(i+1).^2)-1))+...
2*((x(i+1).^2).^(x(i).^2+1)).*(log(x(i+1).^2));
v(i)=v(i)+4*(x(i).^2).*((x(i+1).^2).^(x(i).^2+1)).*((log(x(i+1).^2)).^2);
v(i+1)=v(i+1)+...
2*(x(i).^2).^(x(i+1).^2+1).*(log(x(i).^2))+...
4*(x(i+1).^2).*((x(i).^2).^(x(i+1).^2+1)).*((log(x(i).^2)).^2)+...
2*(x(i).^2+1).*((x(i+1).^2).^(x(i).^2));
53
v(i+1)=v(i+1)+4*(x(i).^2+1).*(x(i+1).^2).*(x(i).^2).*((x(i+1).^2).^(x(i).^2-1));
v0=v;
v=zeros(n-1,1);
v(i)=4*x(i+1).*x(i).*((x(i).^2).^(x(i+1).^2))+...
4*x(i+1).*(x(i+1).^2+1).*x(i).*((x(i).^2).^(x(i+1).^2)).*log(x(i).^2);
v(i)=v(i)+ 4*x(i+1).*x(i).*((x(i+1).^2).^(x(i).^2)).*log(x(i+1).^2);
v(i)=v(i)+4*x(i).*((x(i+1).^2).^(x(i).^2)).*x(i+1);
v1=v;
i=[(1:n)’;(1:(n-1))’];
j=[(1:n)’;(2:n)’];
s=[v0;2*v1];
H=sparse(i,j,s,n,n);
H=(H+H’)/2;
end
To better understand the structure of H, note that
>> n = 5;
>> v0=ones(n,1);
>> v1=ones(n-1,1);
>> s=[v0;2*v1];
>> i=[(1:n)’;(1:(n-1))’];
>> j=[(1:n)’;(2:n)’];
>> H=sparse(i,j,s,n,n);
>> full(H)
ans =
1 2 0 0 0
0 1 2 0 0
0 0 1 2 0
0 0 0 1 2
0 0 0 0 1
Continue with the real thing:
>> n = 1000;
>> xstart = -ones(n,1);
>> xstart(2:2:n,1) = 1;
>> options = optimset(’GradObj’, ’on’, ’Hessian’, ’on’);
> [x, fval, exitflag, output] = fminunc(’brownfgh’,xstart,options);
First-order optimality less than OPTIONS.TolFun, and no negative/zero
54
curvature detected
>> exitflag
exitflag =
1
>> fval
fval =
2.8709e-017
>> output.iterations
ans =
8
9.3 Minimization with no constraints. Hassien not provided
Now, lets redo the problem, but without the Hessian. The algorithm will
approximate it, using the sparse finite-differences. Note that the gradient
must be provided in large-scale problems. Start with the M-file brownfg.m:
function [f,g] = brownfg(x)

% Evaluate the function.
n=length(x); y=zeros(n,1);
i=1:(n-1);
y(i)=(x(i).^2).^(x(i+1,1).^2+1)+(x(i+1).^2).^(x(i).^2+1);
f=sum(y);
%
% Evaluate the gradient if nargout > 1
if nargout > 1
i=1:(n-1); g = zeros(n,1);
g(i)= 2*(x(i+1).^2+1).*x(i).*((x(i).^2).^(x(i+1).^2))+...
2*x(i).*((x(i+1).^2).^(x(i).^2+1)).*log(x(i+1).^2);
g(i+1)=g(i+1)+...
2*x(i+1).*((x(i).^2).^(x(i+1).^2+1)).*log(x(i).^2)+...
2*(x(i).^2+1).*x(i+1).*((x(i+1).^2).^(x(i).^2));
end
The sparsity structure of H must be predetermined and provided.
>> i=[(1:n)’;(1:(n-1))’];
>> j=[(1:n)’;(2:n)’];
>> v0=ones(n,1);
>> v1=ones(n-1,1);
55
>> s=[v0;v1];
>> H=sparse(i,j,s,n,n);
>> Hstr = (H + H’)/2;
>> spy(Hstr);
Back to the optimization problem:
>> options = optimset(’GradObj’, ’on’, ’HessPattern’, Hstr);

>> [x, fval, exitflag, output] = fminunc(’brownfg’,xstart,options);
First-order optimality less than OPTIONS.TolFun, and no negative/zero
curvature detected
>> exitflag
exitflag =
1
>> fval
fval =
7.4739e-017
ans =
8
9.4 Minimization with constraints.
The large-scale method for fmincon can handle equality constraints if no

other constraints exist. Lets add to the problem 100 linear equality con-
straints of the form Ax = b, where A is a 100 × 1000 matrix and b is a
100-vector. They are given in the MAT-file browneq.mat.
>> load browneq

>> spy(Aeq)
>> condest(Aeq*Aeq’)
ans =
2.9310e+006
The function condest compute a 1-norm condition number estimate. If the

number is large, it indicates that the matrix is close to being null.
>> options = optimset(’GradObj’, ’on’, ’Hessian’,’on’);

>> [x, fval, exitflag, output] = fmincon(’brownfgh’,xstart,...
[], [], Aeq, beq, [], [], [], options);
56
Relative function value changing by less than OPTIONS.TolFun
>> fval
fval =
205.9313
ans =
22
9.5 Project 2
The goal is to minimize the function
n n/2
X X
4
f (x) = 1 + [(3 − 2xi )xi − xi−1 − xi+1 + 1] + [xi + xi+n/2 ]4 ,
i=1 i=1
for an n which is a multiple of 4, and x0 = xn+1 = 0. Solve this both as a

medium-scale problem (say, n = 8 or n = 16) and as a large-scale problem
(say, n = 800). Try to find, on your machine, when the problem cannot be
solved as a medium-scale and large-scale algorithms must be applied.
57
10 Penalty and Barrier Methods
The basic approach in these methods is to solve a sequence of unconstrained

problems. The solutions of these problems converge to the solution of the
original problem.
10.1 Penalty method
Consider the problem
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m.
Choose a continuous penalty function which is zero inside the feasible set
and positive outside of it. For example,
m
X
P (x) = (1/2) max{0, gi (x)}2 .
i=1
Minimize, for each c, the problem
q(c, x) = f (x) + cP (x).
When c is increased it expected that the solution x∗ (c) converges to x∗ .
Lemma 10.1 Let cn+1 > cn , then
q(cn , x∗n ) ≤ q(cn+1 , x∗n+1 ) (10)

P (x∗n ) ≥ P (x∗n+1 ) (11)
f (x∗n ) ≤ f (x∗n+1 ). (12)
Proof:
q(cn+1 , x∗n=1 ) = f (x∗n+1 ) + cn+1 P (x∗n+1 ) ≥ f (x∗n+1 ) + cn P (x∗n+1 )

≥ f (x∗n ) + cn P (x∗n ) = q(cn , x∗n ),
which proves (10). Also,
f (x∗n ) + cn P (x∗n ) ≤ f (x∗n+1 ) + cn P (x∗n+1 )

f (x∗n+1 ) + cn+1 P (x∗n+1 ) ≤ f (x∗n ) + cn+1 P (x∗n ).
58
Adding (13) and (13) yields
(cn+1 − cn )P (x∗n+1 ) ≤ (cn+1 − cn )P (x∗n ),
which proves (11). Finally,
f (x∗n+1 ) + cn P (x∗n+1 ) ≥ f (x∗n ) + cn P (x∗n ),
which proves (12).
Lemma 10.2 Let x∗ be the solution of the original problem. Then for each
n
f (x∗ ) ≥ q(cn , x∗n ) ≥ f (x∗n ).
Proof:
f (x∗ ) = f (x∗ ) + cn P (x∗ ) ≥ f (x∗n ) + cn P (x∗n ) ≥ f (x∗n ).
Theorem 10.1 Let {xn } be a sequence generated by the penalty method.

Then, any limit point of the sequence is a solution to the original problem.
Proof: Suppose the subsequence {x∗n : n ∈ N } is a convergent subsequence

with limit x̄. The by continuity of f ,
lim f (x∗n ) = f (x̄). (13)

n∈N
Let f ∗ be the optimal value associated with the problem. then according
to Lemmas 10.1 and 10.2, the sequence of values q(cn , x∗n ) is nondecreasing
and bounded by f ∗ . Thus
lim q(cn , x∗n ) = q ∗ ≤ f ∗ . (14)

n∈N
Subtracting (13) from (14) yields
lim cn P (x∗n ) = q ∗ − f (x̄). (15)

n∈N
Since P (x∗n ) ≥ 0 and cn → ∞, this implies limn∈N P (x∗n ) = 0. By the

continuity of P , P (x̄) = 0, thus x̄ is feasible for the problem.
To show that x̄ is optimal note that form Lemma 10.2, f (x∗n ) ≤ f ∗ and
f (x̄) = limn∈N f (x∗n ) ≤ f ∗ .
59
10.2 Barrier method
Consider again the problem
minimize f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m,
and assume that the feasible set is the closure of its interior. A barrier
function is a continuous and positive function over the feasible set which
goes to ∞ as x approaches the boundary. For example,
m
X 1
B(x) = .
i=1
gi (x)
Minimize, for each c, the problem

1
minimize r(c, x) = f (x) + B(x)
c
subject to gi (x) ≤ 0, i = 1, . . . , m.
Note that a method of unconstrained minimization can be used because

the solution is in the interior of the feasible set. When c is increased it is
expected that the solution x∗ (c) converges to x∗ .
Theorem 10.2 Let {xn } be a sequence generated by the barrier method.

Then, any limit point of the sequence is a solution to the original problem.
60

B Yakir-Nonlinear Optimization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

B Yakir-Nonlinear Optimization

Uploaded by

Copyright:

Available Formats

Nonlinear Optimization Benny Yakir

These notes are based on help files of MATLAB’s

2 Basic properties of solutions and algorithms 5

4 Basic descent methods 16

6 Newton and quasi-Newton methods 26

7 Constrained Minimization Conditions 33

9 Large scale problems 51

10 Penalty and Barrier Methods 57

The general optimization problem has the form:

In particular, if m = 0, the problem is called an unconstrained optimization

Basic properties of solutions and algorithms: In this section we con-

Basic MATLAB: Here we introduce the basic features and structure of

The method of steepest descent: In each iteration, a line search is per-

Newton and Quasi-Newton methods: In the Newton method the func-

Conditions in constraint minimization: The conditions that were con-

Sequential Quadratic Programming: At each iteration the function and

Penalty and barrier methods: A sequence of unconstrained minimiza-

2.1 Necessary conditions for a local optimum

Assume that the function f is defined over Ω ⊂ Rd .

Theorem 2.1 (First-order necessary conditions.) Let f ∈ C 1 . If x∗

Corollary 2.1 If x∗ is a relative minimum and if x∗ ∈ Ω0 then f˙(x∗ ) = 0.

Example 2.1 Consider the function f (x, y) = x2 − xy + y 2 − 3y, with

Example 2.2 Consider the function f (x, y) = x2 − x + y + xy, with Ω =

Example 2.4 We observe g(x) at the points x1 , . . . , xm . We want to ap-

Corollary 2.2 If x∗ is a relative minimum and if x∗ ∈ Ω0 then f˙(x∗ )0 d =

Example 2.5 Consider the function f (x, y) = x2 − x2 y + 2y 2 , with Ω =

2.2 Global convergence of decent algorithms

The algorithms we consider are iterative descent algorithms. By iterative we

Definition: An algorithm A is a mapping that assigns, to each point, a

Iterative algorithm: The specific sequence is constructed by choosing a

Descent algorithm: As each new point is generated, the corresponding

1. If x 6∈ Γ and y ∈ A(x), then Z(y) < Z(x).

2. If x ∈ Γ and y ∈ A(x), then Z(y) ≤ Z(x).

Definition: An algorithm is said to be globally convergent if, for any start-

Definition: A point-to-set map A is said to be closed at x if

The map A is closed if it is closed at each point of the space.

Example 2.7 If A is point-to-point and continuous them A is closed.

Theorem 2.4 If A is a decent iterative algorithm which is closed outside

1. To approximate the function g over the interval [0, 1] by a polynomial h

(b) Verify the point is a relative minimum by checking the second-order

where b is a fixed constant. Is A closed?

The name MATLAB stands for matrix laboratory. It is an interactive system

3.1 Files and Directories in UNIX

mkdir dirname creates a directory dirname.

3.2 Other UNIX Commands

man command help on command.

3.3 Starting and quitting MATLAB

Starting on pluto: /applic/matlab.5.3/bin/matlab.

MATLAB is case sensitive. Memory is allocated automatically.

>> [x,y] = meshgrid(-8:.5:8);

3.6 Scripts and functions

save it and run it from MATLAB:

>> global GRAVITY

The MATLAB environment includes a set of variables built up during the

To obtain efficiency it is important to vectorize the computations. For ex-

and the M-file logtab2.m:

Then run in Matlab:

>> tic; logtab1(1000); toc

2. Let f (x) = x0 Ax − 2b0 x + c, with A an n × n matrix, b and c n-vectors.

4. Plot, using MATLAB, a contour plot of the function f with A = [1 3; −1 2],

We consider now algorithms for locating a local minimum in the optimiza-

4.1 Fibonacci and Golden Section Search

These approaches assume only that the function is unimodal. Hence, if

4.2 Newton’s method