Starting from:

$30

Machine-Learning Homework 4 Solved

1.     Max of Convex Functions. Consider m convex functions f1(x),...,fm(x), where fi : Rd → R. Now define a new function g(x) = maxi fi(x).

(a)     Prove that g(x) is a convex function.

(b)    Show that a sub-gradient of g at point x is the gradient of a function fi (assume fi is differentiable) for which fi(x) = max{f1(x),...,fm(x)}.

2.    ) `2 penalty. Consider the following problem:

 

                                                                                    s.t. yi(wT xi + b) ≥ 1 − ξi                   ∀i = 1,...,m

(a)     Show that a constraint of the form ξi ≥ 0 will not change the problem. meaning, Show that these non-negativity constraints can be removed. That is, show that the optimal value of the objective will be the same whether or not these constraints are present.

(b)    What is the Lagrangian of this problem?

(c)     Minimize the Lagrangian with respect to w,b,ξ by setting the derivative with respect to these variables to 0.

(d)    What is the dual problem?

3.   The Multi-class Hinge-loss. Consider the problem of multi-class prediction where the label Y has L values (e.g., L = 10 for MNIST). Denote [L] = 1,2,...,L. Assume the inputs are x ∈ Rd. We will consider classifiers of the form f(x;w1,...,wL) = argmaxy∈[L] wy · x defined by L vectors w1,...,wL. Given an input x and its correct label y the error of the classifier is ∆zo(f(x;w1,...,wL),y) where

 

Since this loss is hard to optimize, we consider another loss, called the multi-class hinge loss, and is defined as follows:

 .

Given a labeled training set x1,...,xn ∈ Rd and y1,...,yn ∈ [L] the hinge-loss optimization problem would be:

 .

Denote a minimizer of this problem by  (note there may be multiple such minimizers).

1

2                                                                                                                                 Handout Homework 4: April 19, 2020

(a)     Show that ` is a convex function of w1,...,wL.

(b)    Show that `(w1,...,wL,x,y) ≥ ∆zo(f(x;w1,...,wL),y) for all values of w,x,y.

(c)     Assume that for your training set there exists that achieve zero training error

(namely ∆  ) = 0 for all i). Prove that wopt would also have zero training error. Namely that ∆  ) = 0 for all i.

4.     Growth Function of Composition. Let  and   be two function families. Define F = F2 ◦F1 to be the set of functions which are a composition of a function from F1 and from F2. That is,

F = F2 ◦ F1 = {f2 ◦ f1|f1 ∈ F1,f2 ∈ F2}

Prove that

ΠF(m) ≤ ΠF1(m) · ΠF2(m).

5.     Gradient Descent on Smooth Function. We say that a continuously differentiable function f : Rn → R is β-smooth if for all x,y ∈ Rn

 

In words, β-smoothness of a function f means that at every point x, f is upper bounded by a qaudratic function which coincides with f at x.

Let ` : Rn → R be a β-smooth and non-negative function (i.e., `(x) ≥ 0 for all x ∈ Rn). Consider the (non-stochastic) gradient descent algorithm applied on ` with constant step size η 0: xt+1 = xt − η∇`(xt)

Assume that gradient descent is initialized at some point x0. Show that if η < β2 then

lim k∇`(xt)k = 0 t→∞

(Hint: Use the smoothness definition with points xt+1 and xt to show that 

∞ and recall that for a sequence  implies limn→∞ an = 0. Note that f is not assumed to be convex!)

                Handout Homework 4: April 19, 2020                                                                                                                                3

Programming Assignment
Submission guidelines:

 

•    Download the file skeleton sgd.py from Moodle. In each of the following questions you should only implement the algorithm at each of the skeleton files. Plots, tables and any other artifact should be submitted with the theoretical section.

•    In the file skeleton sgd.py there is an helper function. The function reads the examples labelled 0, 8 and returns them with 0-1 labels. Case you are unable to read the MNIST data with the provided script, you can download the file from here:

https://github.com/amplab/datasciencesp14/blob/master/lab7/mldata/mnist-original.mat.

•    Your code should be written in Python 3.

•    Make sure to comment out or remove any code which halts code execution, such as matplotlib popup windows.

•    Your code submission should include one file: sgd.py.

1.    (25 points) SGD for Hinge loss. We will continue working with the MNIST data set. The file template (skeleton sgd.py), contains the code to load the training, validation and test sets for the digits 0 and 8 from the MNIST data. In this exercise we will optimize the Hinge loss (as you seen in the lecture) using the stochastic gradient descent implementation discussed in class. Namely, at each iteration t = 1,... we sample i uniformly; and if yiwt · xi < 1, we update:

wt+1 = (1 − ηt)wt + ηtCyixi

and wt+1 = (1 − ηt)wt otherwise, where ηt = η0/t, and η0 is a constant. Implement an SGD function that accepts the samples and their labels, C, η0 and T, and runs T gradient updates as specified above. In the questions that follow, make sure your graphs are meaningful.

Consider using set xlim or set ylim to concentrate only on a relevant range of values.

(a)     (10 points) Train the classifier on the training set. Use cross-validation on the validation set to find the best η0, assuming T = 1000 and C = 1. For each possible η0 (for example, you can search on the log scale η0 = 10−5,10−4,...,104,105 and increase resolution if needed), assess the performance of η0 by averaging the accuracy on the validation set across 10 runs. Plot the average accuracy on the validation set, as a function of η0.

(b)    (5 points) Now, cross-validate on the validation set to find the best C given the best η0 you found above. For each possible C (again, you can search on the log scale as in section (a)), average the accuracy on the validation set across 10 runs. Plot the average accuracy on the validation set, as a function of C.

(c)     (5 points) Using the best C, η0 you found, train the classifier, but for T = 20000. Show the resulting w as an image.

(d)    (5 points) What is the accuracy of the best classifier on the test set?

2.    (15 points) SGD for multi-class cross-entropy. The skeleton file contains a second helper function to load the training, validation and test sets for all the digits. In this exercise

4                                                                                                                                 Handout Homework 4: April 19, 2020

we will optimize the multi-class cross entropy loss using SGD. Recall the multi-class crossentropy loss discussed in the recitation (our classes are 0,1,...,9):

 

Derive the gradient update for this case, and implement the appropriate SGD function.

(a)    Train the classifier on the training set. Use cross-validation on the validation set to find the best η0, assuming T = 1000. For each possible η0 (for example, you can search on the log scale η0 = 10−5,10−4,...,104,105 and increase resolution if needed), assess the performance of η0 by averaging the accuracy on the validation set across 10 runs. Plot the average accuracy on the validation set, as a function of η0.

(b)   ) Using the best η0 you found, train the classifier, but for T = 20000. Show the resulting w0,...,w9 as images.

(c)     What is the accuracy of the best classifier on the test set?

More products