$25
1. SGD for Hinge loss. We will continue working with the MNIST data set. The file template (skeleton sgd.py), contains the code to load the training, validation and test sets for the digits 0 and 8 from the MNIST data. In this exercise we will optimize the Hinge loss (as you seen in the lecture) using the stochastic gradient descent implementation discussed in class. Namely, at each iteration t = 1,... we sample i uniformly; and if yiwt · xi < 1, we update:
wt+1 = (1 − ηt)wt + ηtCyixi
and wt+1 = (1 − ηt)wt otherwise, where ηt = η0/t, and η0 is a constant. Implement an SGD function that accepts the samples and their labels, C, η0 and T, and runs T gradient updates as specified above. In the questions that follow, make sure your graphs are meaningful.
Consider using set xlim or set ylim to concentrate only on a relevant range of values.
(a) Train the classifier on the training set. Use cross-validation on the validation set to find the best η0, assuming T = 1000 and C = 1. For each possible η0 (for example, you can search on the log scale η0 = 10−5,10−4,...,104,105 and increase resolution if needed), assess the performance of η0 by averaging the accuracy on the validation set across 10 runs. Plot the average accuracy on the validation set, as a function of η0.
(b) Now, cross-validate on the validation set to find the best C given the best η0 you found above. For each possible C (again, you can search on the log scale as in section (a)), average the accuracy on the validation set across 10 runs. Plot the average accuracy on the validation set, as a function of C.
(c) Using the best C, η0 you found, train the classifier, but for T = 20000. Show the resulting w as an image.
(d) What is the accuracy of the best classifier on the test set?
2. SGD for multi-class cross-entropy. The skeleton file contains a second helper function to load the training, validation and test sets for all the digits. In this exercise
4
we will optimize the multi-class cross entropy loss using SGD. Recall the multi-class crossentropy loss discussed in the recitation (our classes are 0,1,...,9):
Derive the gradient update for this case, and implement the appropriate SGD function.
(a) Train the classifier on the training set. Use cross-validation on the validation set to find the best η0, assuming T = 1000. For each possible η0 (for example, you can search on the log scale η0 = 10−5,10−4,...,104,105 and increase resolution if needed), assess the performance of η0 by averaging the accuracy on the validation set across 10 runs. Plot the average accuracy on the validation set, as a function of η0.
(b) Using the best η0 you found, train the classifier, but for T = 20000. Show the resulting w0,...,w9 as images.
(c) What is the accuracy of the best classifier on the test set?