$29.99
Please typeset or write your solutions neatly! If we cannot read it, we cannot grade it.
Note: You can use the results we have proved in class – no need to prove them again.
Q 1. Recall the Gauss-Southwell rule for basic descent methods that we saw in class: dk = −∇ikf(xk)eik, where ik = argmax1≤i≤n |∇if(xk)| and eik is the vector that has 0 in all coordinates except for ik, where it equals 1 (it is the ikth standard basis vector). Same as in the class, we assume that f is L-smooth. Prove that there exists α > 0 such that the Gauss-Southwell rule applied for an appropriate step size αk satisfies:
.
How would you choose αk? What can you say about the convergence of this method (discuss all three cases we have covered in class: nonconvex and bounded below, convex, strongly convex)? [10pts]
Q 2. Exercise 8 from Chapter 3 in Recht-Wright. Notation from the exercise: x? = argminx∈Rn f(x). You don’t need to worry about getting the same constants as stated there, being off by constant factors (up to 4) is fine. [15pts]
Q 3 (Bregman Divergence). Bregman divergence of a continuously differentiable function ψ : Rn → R is a function of two points defined by
Dψ(x,y) = ψ(x) − ψ(y) − h∇ψ(y),x − yi.
Equivalently, you can view Bregman divergence as the error in the first-order approximation of a function:
ψ(x) = ψ(y) + h∇ψ(y),x − yi + Dψ(x,y).
(i) What is the Bregman divergence of a simple quadratic function , where x0 ∈ Rn is a given point? [5pts]
(ii) Given x0 ∈ Rn and some continuously differentiable ψ : Rn − R, what is the Bregman divergence of function φ(x) = ψ(x) + hx0,xi? [5pts]
(iii) Use Part (ii) and the definition of Bregamn divergence to prove the following 3-point identity:
(∀x,y,z ∈ Rn) : Dψ(x,y) = Dψ(z,y) + h∇ψ(z) − ∇ψ(y),x − zi + Dψ(x,z). [5pts]
(iv) Suppose I give you the following function: , where , where {ai}i≥0 is a sequence of positive reals and are fixed vectors from Rn. Define vk = argminx∈Rn mk(x) and . Using what you have proved so far, prove the following inequality:
. [5pts]
Q 4. In class, we have analyzed the following variant of Nesterov’s method for L-smooth convex optimization:
x vk = vk−1 − ak∇f(xk)/L
y ,
1
where L is the smoothness constant of . We take x0 ∈ Rn to be an arbitrary initial point and y0 = v0 = x0 − ∇f(x0)/L.
Prove that we can state Nesterov’s method in the following equivalent form:
x ,
(1)
y .
Hint: It is helpful to first prove that y . [10pts]
Q 5 (Coding Assignment). In the coding assignment, we will compare different optimization methods discussed in class on the following problem instance: minx∈Rn f(x), where , b is a vector whose first coordinate is while the remaining coordinates are , and M is the same matrix we saw in Q 8 of Homework #1. We will take the dimension to be n = 200. Matrix M and vector b can be generated using the following Matlab code: k = n;
M = diag(2*[ones(k, 1); zeros(n-k, 1)], 0)...
+ diag([-ones(k-1, 1); zeros(n-k, 1)], -1)...
+ diag([-ones(k-1, 1); zeros(n-k, 1)], 1);
M(n,1) = - 1; M(1,n) = -1; b = -1/n * ones(n, 1); b(1) = b(1) + 1;
Observe that you can compute the minimizer x∗ of f given M and b, and thus you can also compute f(x∗). It is possible to show that the top eigenvalue of M is L = 4. Implement the following algorithms:
1. Steepest descent with the constant step size αk = 1/L.
2. Steepest descent with the exact line search.
3. Lagged steepest descent, defined as follows: Let αk be the exact line search steepest descent step size corresponding to the point xk. Lagged steepest descent updates the iterates as: xk+1 = xk − αk−1∇f(xk) (i.e., the step size “lags” by one iteration).
4. Nesterov’s method for smooth convex minimization.
Initialize all algorithms at x0 = 0. All your plots should be showing the optimality gap f(x) − f(x∗) (with x = yk for Nesterov and x = xk for all other methods) on the y-axis and the iteration count on the x-axis. The y-axis should be shown on a logarithmic scale (use set(gca, ’YScale’, ’log’) after the figure command in Matlab).
(i) Use a single plot to compare steepest descent with constant step size, steepest descent with the exact line search, and Nesterov’s algorithm. Use different colors for different algorithms and show a legend with descriptive labels (e.g., SD:constant, SD:exact, and Nesterov). Discuss the results. Do you see what you expect from the analysis we saw in class?
(ii) Use a single plot to compare Nesterov’s algorithm to lagged steepest descent. You should, again, use different colors and a legend. What can you say about lagged steepest descent? How does it compare to Nesterov’s algorithm?
(iii) Modify the output of Nesterov’s algorithm and lagged steepest descent: you should still run the same algorithms, but now your plot at each iteration k should show the lowest function value up to iteration k for each of the two algorithms. Discuss the results.
You should turn in both the code (as a text file) and a pdf with the figures produced by your code together with the appropriate answers to the above questions. [45pts]
Note: Your code needs to compile without any errors to receive any points for the coding assignment.
2