$30
Optimization for Machine Learning (Homework #3)
Theoretical Problems
1. Consider x ∈ R+. Consider
h(x) = −lnx
• Find the Bregman divergence Dh(x,y).
• Consider function f(x) = 0.5x − ln(x + 1).
What’s the smoothness and strong convexity parameters of f with respect to h?
• Consider h-mirror descent method for solving f(x). What’s the formula to obtain xt from xt−1 with a learning rate ηt.
2. Consider a composite optimization problem
f(x) + g(x),
with non-convex regularizer g(x) given by
,
for some λ > 0. Since logarithm grows slowly when |xj| increases, this regularizer leads to less bias than L1 regularization. Assume we want to apply proximal gradient method
proxηg [x − η∇f(x)]
to solve this problem, where the proximal optimization problem becomes
proxηg(z) = argmin(1) x
• Show that when η > 0 is sufficiently small, then the proximal optimization problem (1) is strongly convex for all z. What is the range for such η?
• Find the closed form solution for proxηg(z) when η is sufficiently small so that (1) is convex.
• If f(x) is L smooth and 2λ strongly convex. Does the proximal gradient algorithm converge? If so, how many iterations are needed to obtain an primal suboptimal solution?
1
3. Consider the optimization problem
,
where Ai are matrices. We rewrite it as
subject to A1x − z1 = 0,...,Anx − zn = 0, A0x − z0 = 0.
• Write down the Lagrangian function L(x,z,α) with multipliers α1,...,αn and α0 corresponding to the n + 1 linear constraints.
• Find the dual formulation φD(α) = minx,z L(x,z,α) in terms of fi∗.
• If n = 1 and A1 = I, write down the dual objective function in α0 by eliminating α1. Assume ) is smooth, how to get primal the primal variable x from dual α0? 4. (4 points) Consider a symmetric positive definite matrix A, and let
.
• Find the Fenchel’s dual of f(x) + g(x).
• Find ∇f∗(α) and ∇g∗(α)
• Write down a closed form formula for the dual ascent method.
• Find the smoothness of f∗ and g∗ and explain how to set learning rate for dual ascent.
Programming Problem
• Download data in the mnist sub-directory (which contains class 1 (positive) versus 7 (negative)
• Use the python template “prog-template.py”, and implement functions marked with ’# implement’.
• Complete codes and get the plots
• Discuss the behaviors of proximal and dual averaging algorithms in the experiments
• Discuss the impacts of different µ in the experiments
• Moreover, can you reduce the degree of oscillations for GD by selecting other learning rates when the objective function is highly non-smooth?
2