$35
1 Introduction
In many situations, it is necessary to optimize a given function, i.e., to minimize or maximize it. Most machine learning methods are based on optimizing a function that measures the performance of the system that we want to train.
This function is generically called objective function, because it indirectly defines the objective of the training. Frequently, this function measures how costly are the errors made by the system. In that case, the function is called cost function, and the purpose of training is to minimize it. Since this is the most common case, in this assignment we’ll study function minimization. However, all the conclusions can be applied, with the appropriate changes, to the case of function maximization.
In most cases of practical interest, the function that we want to optimize is very complex. Therefore, solving the system of equations that is obtained by setting to zero the partial derivatives of the function with respect to all variables, is not practicable. In fact, these equations are usually highly nonlinear, and the number of variables is often very large, on the order of hundreds, thousands, or even more. In those cases, iterative optimization methods have to be used.
2 The gradient descent method
One of the simplest and most frequently used optimization methods is the method of gradient descent. Consider a function f (x) that we want to minimize, where the vector x (x1,x2,,xN ) is the set of arguments. The gradient of f, denoted by f , points in the direction that makes f increase fastest. Therefore, it makes sense that, in order to minimize the function, we take steps in the direction of the negative gradient, f , which is the direction that makes f decrease fastest. The gradient method consists of the following steps:
Choose an initial vector x(0) .
Update x iteratively, according to the equation:
𝒙( ) = 𝒙( ) − 𝜂 ∇𝑓[𝒙( )].
The parameter is chosen by the user, and must be positive. It is clear, from the previous equation, that the method consists of a succession of steps, each taken in the direction that f has at the current location. The iterations stop when a given stopping criterion, chosen by the user, is met.
In this lab we’ll study the gradient method in order to gain experience about the way it works. We’ll also study some modifications to this method, which are intended to increase its efficiency.
2.1 Minimization of functions of one variable
We’ll start by studying the gradient method in the simplest situation, which corresponds to minimizing functions of only one variable. Namely, we’ll minimize a function of the form 𝑓(𝑥) = 𝑎 𝑥 /2. The parameter a controls the functions’ curvature.
Use function quad1, which performs the optimization and graphically shows its evolution. The stopping criterion consists in finding a value of f under 0.01 with a maximum of 1000 iterations (these values can be controlled by the input parameters threshold and maxiter).
By default, function quad1, initializes the parameters to the following values:
𝑥( ) = −9, 𝑎 = 1, 𝜂 = 0.1 .
These values can be changed by varying parameters x_o, a and eta.
Parameter anim controls the graphic animation. Setting anim=1 makes the evolution visible as it progresses. This allows us to get a better idea of the evolution, but also makes it take longer. Setting anim=0 shows the plot only at the end of the evolution, which makes it go considerably faster.
Fill the following table with the numbers of iterations needed to optimize the function, for different values of a and (a and eta in quad1 function). If more than 1000 iterations were needed, write “>1000”. If the optimization method diverged, write “div”. In the last two lines, instead of the number of iterations, write the approximate values of that correspond to the fastest optimization and to the threshold of divergence.
a 0.5
a1
a 2
a 5
.001
.01
.03
.1
.3
1
3
Fastest
Divergence threshold
Table 1
What is the relationship between a (the function’s second derivative) and the value of that corresponds to the fastest optimization? Prove that relationship for this class of functions.
What is the relationship between the function’s second derivative and the value of that corresponds to the divergence threshold? Prove that relationship for this class of functions.
Comment on the results from the table.
How many steps correspond to the fastest optimization, for each value of a? Does there exist, for every differentiable function of one variable (even if the function is not quadratic), and for each given starting point 𝐱( ), a value of that optimizes the function in that number of steps? Assume that the function has a global minimum.
2.2. Minimization of functions of more than one variable
When we deal with functions of more than one variable, new phenomena occur in the gradient method. We’ll study them by minimizing functions of two variables.
We’ll start by studying a simple class of functions: quadratic functions of the form 𝑓(𝑥 , 𝑥 ) = (𝑎𝑥 + 𝑥 )⁄2. For these functions, the second derivative with respect to x1 is a, and the second derivative with respect to x2 is 1.
Use function quad2, which performs the optimization and shows the results. The stopping criterion corresponds to finding a value of f smaller than 0.01, with a maximum of 1000 iterations (these values can be controlled by parameters threshold and maxiter).
By default, function quad2 initializes the parameters to the following values:
𝐱( ) = (−9,9), 𝑎 = 2, 𝜂 = 0.1.
These values can be changed through parameters x_o, a and eta.
Observe that, along the trajectory, the steps are always taken in the direction orthogonal to the level curves of f. In fact, the gradient is always orthogonal to these lines.
Fill the first column of the following table for the various values of . Then set a 20(which creates a relatively narrow valley) and fill the second column. Use the same rules for filling the table as in the preceding case. You may find the values for 𝜂that correspond to the fastest optimization and to the threshold of divergence by trial and error.
a 2
a 20
.01
.03
.1
.3
1
3
Fastest
Divergence threshold
Table 2
Comment on the results from the table. What is the qualitative relationship between the valley’s width and the minimum number of iterations that can be achieved? Why?
Is it always possible, for differentiable functions of more than one variable, to achieve, for any given 𝐱( ), the same minimum number of iterations that was reached for functions of one variable?
3. Momentum term
In order to accelerate the optimization in situations in which the function has narrow valleys (situations which are very common in machine learning problems), one of the simplest solutions is to use the so called momentum term. The previous examples showed how the divergence in the gradient descent method is normally oscillatory. The aim of the momentum term is to attenuate the oscillations by using, at each step, a fraction of the previous one. The iterations are described by:
x(n1) x(n) f [x(n)]
x(n1) x(n) x(n1)
or, alternatively, by
x(n1) x(n) (1)f [x(n)]
.
x(n1) x(n) x(n1)
We’ll use this second version.
The parameter should satisfy 0 1. Using 0 corresponds to optimizing without the momentum term. The term 𝛼Δ𝒙( ), in the update equation for Δ𝒙(𝒏 𝟏), attenuates the oscillations and adds a kind of inertia to the process, which explains the name momentum term, given to this term.
The students that are knowledgeable on digital filters, can readily verify that the equation that computes x(n1) corresponds to filtering the gradient with a first order low-pass filter, with a pole at z . This low-pass filtering attenuates rapid oscillations.
Still using the function 𝑓(𝑥 , 𝑥 ) = (𝑎𝑥 + 𝑥 )⁄2, fill the following table, using a 20, and varying the momentum parameter (parameter corresponds to variable alpha). Use the same rules for filling the table as in the preceding cases.
0
.5
.7
.9
.95
.003
.01
.03
.1
.3
1
3
10
Divergence threshold
Table 3 2. Comment on the results from the table.
4. Adaptive step sizes
The previous examples showed how narrow valleys create difficulties in the gradient descent method, and how the momentum term alleviates these problems. However, in complex problems, the optimization can take a long time even when the momentum term is used. Another acceleration technique that is quite effective relies on the use of adaptive step sizes. This technique will not be explained here, given its complexity. Nevertheless, we’ll test its efficiency.
As an example of a function which is difficult to optimize, we’ll use the Rosenbrock function, which is one of the common benchmarks used for testing optimization techniques. This function is given by:
f (x1,x2) (x1 1)2 a(x2 x12)2.
This function has a valley along the parabola x2 x12 , with a minimum at (1,1). The parameter a controls the width of the valley. The original Rosenbrock function uses the value a 100 , which creates a very narrow valley. Initially, we’ll use a 20, which creates a wider valley, so that we can run our tests faster.
Use function rosen, which performs the optimization. The stopping criterion corresponds to having two consecutive iterations with f smaller than 0.001, with a maximum of 1000 iterations.
By default, these function uses the following parameters:
𝑎 = 20, 𝜂 = 0.001, 𝛼 = 0, 𝒙( ) = (−1.5, 1),
Try to find a pair of values for and that leads to a number of iterations below 200 (do it manually, do not use grid search). If, the number of tests is becoming too large, stop and use the best result obtained so far. Write down how many tests you performed in order to find that pair of values, and fill the following table, using the best value that you found for , and also values 10% and 20% higher and lower than the best.
N. of tests
-20%
-10%
Best
+10%
+20%
N. of iterations
Table 4
Basing yourself on the results that you obtained in the table above, give a qualitative explanation of why it is hard to find values of the parameters that yield a relatively small number of iterations.
Note that the total time that it takes to optimize a function corresponds to both the time it takes to perform the tests that needed to find a fast enough optimization, plus the time it takes for that optimization to run.
Next, we’ll test the optimization using adaptive step sizes. Set the following input parameters for function rosen, and fill the table with the numbers of iterations necessary for the optimization under different situations. up = 1.1, down = 0.9, reduce = 0.5
0
.5
.7
.9
.95
.99
.001
.01
.1
1
10
Table 5
Observe how the number of iterations depends little on the value of , contrary to what happened when fixed step sizes were used (note that, in the table above, varies by four orders of magnitude). The relative insensitivity to the value of is due to the fact that the step sizes vary automatically during the minimization (the value given to represents only the initial value of the step size parameter). The little dependency on the initial value of makes it easier to choose values that result in efficient optimization.
Finally, we’ll test the optimization of the Rosenbrock function with the original value of a. Set a=100. Try to find values of and such that the convergence is reached in less than 500 steps, first without adaptive step sizes (up = 1, down = 1, reduce = 1), and then with adaptive step sizes (up = 1.1, down = 0.9, reduce = 0.5). Write down, for each case, the number of tests required to find the final values of and . If, in any case, the number of tests is becoming too large, stop and use the best result obtained so far.
For both cases change the best value of eta by about 10% up and down, without changing
, and write down the corresponding numbers of iterations. Fill table 6 with the values that you obtained.
N. of tests
N. of iterations
Without adaptive step sizes
-10%
final =
+10%
With adaptive step sizes
-10%
final =
+10%
Table 6
Comment briefly on the efficiency of the acceleration methods that you have tested in this assignment.