$25
Advanced Topics in Machine Learning
Instructions There are 4 sections. Two assess the knowledge of the theory and two require implementing algorithms. Provide sufficient explanations for all solutions. For the coding parts the following libraries are required: numpy, sklearn, matplotlib.
1 Questions with multiple answers [20%]
Question 1 [4%] Which one of these is a convex function?
a) max{ax + b,x4 − 5,ex2}
b) min{ax + b,x4 − 5,ex2}
c) ax + b + x4 − 5 − ex2.
Question 2 [4%] Consider the function
Indicate which one of the two graphs below represents the subdifferential of f.
Ad 2
94A i:1
2 - -- a r bJ
Figure 1: Question 2
Question 3 [4%] The gradient of f(x) = ⟨Ax,x⟩ + ⟨x,b⟩ + c, where A is a square matrix, not necessarily symmetric is
a) A∗x + Ax + b
b) 2Ax + b
c) A∗ + b, where A∗ denotes the transpose of A.
Question 4 [4%] Let g ∈ Γ0(R). The Fenchel conjugate of f(x) = g(2x) is
a) f∗(u) = g∗(u/2)
b) f∗(u) = g∗(2u)
c) f∗(u) = g∗(u).
Question 5 [4%] Referring to the ridge regression problem
,
and denoting by K the Gram matrix of the data, indicate the solution of the dual problem
a) u¯ = (K + λnId)y
b) u¯ = (K2 + λnId)−1y
c) u¯ = (K + λnId)−1y.
2 Theory on convex analysis and optimization [30%]
Problem 1. [4%] Compute (showing a complete derivation) the Fenchel conjugates of the following functions.
1. f : R → ]−∞,+∞], with .
2. f(x) = x2.
3. ι[0,1] (the indicator function of the interval [0,1]).
Problem 2. [7%] Let f : X → ]−∞,+∞] be a proper convex function.
1. Prove by induction the Jensen’s inequality, that is
,
for all x1,...,xn ∈ X and for all λ1,...,λn ∈ R+ with
2. Using the characterization for differentiable functions prove that the function −log is convex 3. Applying Jensen’s inequality to −log, prove the following arithmetic-geometric inequality
,
for all x1,...,xn ∈ R+.
Problem 3. [4%] Given a polytope C = co(a1,...,am) in X. Prove that the maximum of a convex function f on C is attained at one of the vertices a1,...,am.
Problem 4. [2%] Prove that the function f(x,y) = ∥x − 2y∥2 is (jointly) convex
Problem 5. [4%] Provide minimal sufficient conditions for the existence and uniqueness of minimizers for a convex function f : X → ]−∞,+∞].
Problem 6. [9%] Consider the optimization problem.
,
where ε > 0, A ∈ Rn×d and b ∈ Rn. Solve the following points.
1. Compute the dual problem (using the Fenchel-Rockafellar duality theory). [Hint: put the problem in the form f(x) + g(Ax)]
2. Does strong duality hold? Justify the answer. [Hint: use the qualification condition]
3. Write the KKT conditions.
4. Derive a rate of convergence on the primal iterates from the application of FISTA on the dual problem. [Hint: recall the it is possible to bound the square of the distance to the primal solution by the dual objective values]
3 Solving the Lasso problem [25%]
Consider the following problem
, (1)
where A ∈ Rn×d. Let us denote by ai ∈ Rd and aj ∈ Rn the i-th row and j-column of A respectively. Then the objective function can be written in the form
, (2)
Then the proximal stochastic gradient algorithm is
xk+1 = prox , (PSGA)
2 √
where γk = n/(∥A∥ k + 1) and (ik)k∈N is a sequence of i.i.d. random variables uniformly distributed on {1,...,n}.
Alternatively, problem (1) can be approached via a randomized coordinate proximal gradient algorithm. Indeed, recalling that ∇(1/2)∥Ax − y∥2 = A∗(Ax − y) = (⟨aj,Ax − y⟩)1≤j≤d one can consider the following algorithm
(RCPGA)
,
where (jk)k∈N is a sequence of i.i.d. random variables uniformly distributed on {1,...,d} and γj = n/∥aj∥2.
Generate the data according to the following python code with n = 1000, d = 500, s = 50, σ = 0.06.
def generate_problem(n, d, s, std=0.06):
# Generate xs
# vectors with entries in [0.5, 1] and [-1, -0.5]
# repectively assert s % 2 == 0, "s needs to be divisible by 2" xsp = 0.5 * (np.random.rand(s // 2) + 1) xsn = - 0.5 * (np.random.rand(s // 2) + 1) xsparse = np.hstack([xsp, xsn, np.zeros(d - s)]) random.shuffle(xsparse)
# Generate A
A = np.random.randn(n, d)
# Generate eps y = A @ xsparse + std * np.random.randn(n) return xsparse, A, y
1. Implement algorithm (PSGA), recalling that proxγkλ∥·∥1 acts component-wise as a softthresholding operator with threshold γkλ.
2. Implement algorithm (RCPGA).
3. Choose an appropriate regularization parameter λ and plot the objective function values vs the number of iterations for both the algorithms described above. For algorithm (PSGA) consider also the behavior of the objective values over the sequence of ergodic means, that is,
.
4 Support Vector Machines [25%]
The problem of SVM’s for classification is
, (3)
where (xi,yi)1≤i≤n is the training set, xi ∈ Rd and yi ∈ {−1,1} and Λ: Rd → H is the feature map corresponding to the Gaussian kernel, that is,
. (4)
The dual problem is
, (5)
where Ki,j = K(xi,xj) and ι[0, 1 ] is the indicator function of the interval [0, ]. The problem can
λn be equivalently restated as
, (6)
where (Ky)i,j = yiKi,jyj. Note that the primal solution can be recovered via ) and hence the decision function is
. (7)
Figure 2: A realization of the two-moons dataset.
Generate the data (2D) according to the following python code:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
y = 2*y - 1
Choose appropriate values of λ and σ and address the following points.
1. Implement FISTA for solving the dual problem and plot the dual objective function.
2. Implement the randomized coordinate projected gradient algorithm on the dual problem (6) and plot the dual objective function.
3. For each algorithm above, using a contour plot command, plot the decision boundary as well as the two classes.
4. compare the performance of the two approaches in terms of speed and accuracy.