$25
1 Support Vector Machines using SGD
Until now we have implemented linear and logistic regression to do classification. In this exercise we will use the Support Vector Machine (SVM) for classification. As we have seen in the lecture notes, the original optimization problem for the Support Vector Machine (SVM) is given by
N
λ 2
min w (1)
w
where ` : R → R, `(z) := max{0,1 − z} is the hinge loss function. Here for any n, 1 ≤ n ≤ N, the vector xn ∈ RD is the nth data example, and yn ∈ {±1} is the corresponding label.
Problem 1 (SGD for SVM):
Implement stochastic gradient descent (SGD) for the original SVM formulation (1). That is in every iteration, pick one data example n ∈ [N] uniformly at random, and perform an update on w based on the (sub-)gradient of the nth summand of the objective (1). Then iterate by picking the next n.
1. Fill in the notebook functions calculate accuracy(y, X, w) which computes the accuracy on the training/test dataset for any w and calculate primal objective(y, X, w, lambda ) which computes the total primal objective (1).
2. Derive the SGD updates for the original SVM formulation and fill in the notebook function calculate stochastic gradient() which should return the stochastic gradient of the total cost function (loss plus regularizer) with respect to w. Finally, use sgd for svm demo() provided in the template for training.
2 Support Vector Machines using Coordinate Descent
As seen in class, another approach to train SVMs is by considering the dual optimization problem given by
YXX>Yα such that 0 ≤ αn ≤ 1 ∀n (2)
where Y := diag(y), and X ∈ RN×D again collects all N data examples as its rows, as usual. In this approach we optimize over the dual variables α and map the solutions back to the primal vector w.
Problem 2 (Coordinate Descent for SVM):
Derive the coordinate descent algorithm updates for the dual (2) of the SVM formulation. That is, in every iteration, pick a coordinate n ∈ [N] uniformly at random, and fully optimize the objective (2) with respect to that coordinate alone.
After updating that coordinate αn, update the corresponding primal vector w such that the first-order correspondence is maintained, that is that always w = w(α) := λ1X>Yα. Then iterate by picking the next coordinate n.
1. Mathematically derive the coordinate update for one coordinate n (finding the closed-form solution to maximization over just that coordinate), when given α and corresponding w.
2. Fill in the notebook functions calculate coordinate update() which should compute the coordinate update for a single desired coordinate and calculate dual objective() which should return the objective (loss) for the dual problem (2) .
3. Finally train your model using coordinate descent (here ascent) using the given function sgd for svm demo() in the template. Compare to your SGD implementation. Which one is faster? (Compare the training objective values (1) for the w iterates you obtain from each method).
Theory Excercises
Problem 3 (Kernels):
In class we have seen that many kernel functions k(x,x0) can be written as inner products φ(x)>φ(x0) for a suitably chosen feature map φ(·). Let us say that such a kernel function is valid. We further discussed many operations on valid kernel functions that result again in valid kernel functions. Here are two more.
1. Let k1(x,x0) be a valid kernel function. Let f be a polynomial with positive coefficients. Show that is a valid kernel.
2. Show that k(x,x0) = exp(k1(x,x0) is a valid kernel assuming that k1(x,x0) is a valid kernel. HINT: You are allowed to take limits.
2