$25
1. SVM. In this exercise, we will explore different kernels for SVM and the relation between the parameters of the optimization problem. We will use an existing implementation of SVM: the SVC class from sklearn.svm. This class solves the soft-margin SVM problem. You can assume that the data we will use is separable by a linear separator (i.e. that we could formalize this problem as a hard-margin SVM). In the file skeleton svm.py you will find two implemented methods:
• get points - returns training and validation sets of points in 2D, and their labels.
• create plot - receives a set of points, their labels and a trained SVM model, and creates a plot of the points and the separating line of the model. The plot is created in the background using matplotlib. To show it simply run plt.show() after calling this method.
In the following questions, you will be asked to implement the other methods in this file, while using get points and create plot that were implemented for you.
(a) Implement the method train three kernels that uses the training data to train 3 kernel SVM models - linear, quadratic and RBF. For all models, set the penalty constant C to 1000.
• How are the 3 separating lines different from each other? You may support your answer with plots of the decision boundary.
• How many support vectors are there in each case?
(b) Implement the method linear accuracy per C that trains a linear SVM model on the training set, while using cross-validation on the validation set to select the best penalty constant from C = 10−5,10−4,...,105. Plot the accuracy of the resulting model on the training set and on the validation set, as a function of C.
• What is the best C you found?
• Explain the behavior of error as a function of C. Support your answer with plots of the decision boundary.
(c) Implement the method rbf accuracy per gamma that trains an RBF SVM model on the training set, while using cross-validation on the validation set to select the best coefficient γ = 1/σ2. Start your search on the log scale, e.g., perform a grid search γ = 10−5,10−4,...,105, and increase the resolution until you are satisfied. Use C = 10 as the penalty constant for this section. Plot the accuracy of the resulting seperating line on the training set and on the validation set, as a function of γ.
• What is the best γ you found?
• How does γ affect the decision rule? Support your answer with plots of the decision boundary.
2. Back-Propagation. In this exercise we will implement the back-propagation algorithm for training a neural network. We will work with the MNIST data set that consists of 60000 28x28 gray scale images with values of 0 to 1 in each pixel (0 - white, 1 - black). The optimization problem we consider is of a neural network with sigmoid activations and the cross entropy loss. Namely, let x ∈ Rd be the input to the network (in our case d = 784) and denote a0 = x and n0 = 784. Then for 0 ≤ t ≤ L − 2, define
zt+1 = Wt+1at + bt+1 at+1 = h(zt+1) ∈ Rnt+1
and
zL = WLaL−1 + bL
ezL
aL = L
Pi ezi
where h is the sigmoid function applied element-wise on a vector (recall the sigmoid function ) and Wt+1 ∈ Rkt+1×kt, bt+1 ∈ Rkt+1 (kt is the number of neurons in layer t). Denote by W the set of all parameters of the network. Then the output of the network (after the softmax) on an input x is given by aL(x;W) ∈ R10 (nL = 10).
Assume we have an MNIST training data set where xi ∈ R784 is the 28x28 image given in vectorized form and yi ∈ R10 is a one-hot label, e.g., (0,0,1,0,0,0,0,0,0,0) is the label for an image containing the digit 2. Define the log-loss on a single example (x,y), `(x,y)(W) = −y · logaL(x;W) where the logarithm is applied element-wise on the vector aL(x;W). The loss we want to minimize is then
The code for this exercise is given in the backprop.zip file in moodle. The code consists of the following:
(a) data.py: Loads the MNIST data.
(b) network.py: Code for creating and training a neural network.
(c) main.py: Example of loading data, training a neural network and evaluating on the testset.
(d) mnist.pkl.gz: MNIST data set.
The code in network.py contains the functionality of the training procedure except the code for back-propagation which is missing.
Here is an example of training a one-hidden layer neural network with 40 hidden neurons on a randomly chosen training set of size 10000. The evaluation is performed on a randomly chosen test set of size 5000. It trains for 30 epochs with mini-batch size 10 and constant learning rate 0.1.
training_data, test_data = data.load(train_size=10000,test_size=5000)
net = network.Network([784, 40, 10])
net.SGD(training_data, epochs=30, mini_batch_size=10, learning_rate=0.1, test_data=test_data)
(a) Implement the back-propagation algorithm in the backprop function in the Network class. The function receives as input a 784 dimensional vector x and a one-hot vector y. The function should return a tuple (db,dw) such that db contains a list of derivatives of `(x,y) with respect to the biases and dw contains a list of derivatives with respect to the weights. The element dw[i] (starting from 0) should contain the matrix and db[i] should contain the vector .
In order to check your code you can use the following approximation of a one-dimensional derivative for small . You can use this to check that your partial derivative calculations are correct. You can use the loss function in the Network class to calculate `(x,y)(W).
(b) Train a one-hidden layer neural network as in the example given above (e.g., training set of size 10000, one hidden layer of size 40). Plot the training accuracy, training loss (`(W)) and test accuracy across epochs (3 different plots). For the test accuracy you can use the one label accuracy function, for the training accuracy use the one hot accuracy function and for the training loss you can use the loss function. All functions are in the Network class. The test accuracy in the final epoch should be above 80%.
(c) Now train the network on the whole training set and test on the whole test set:
training_data, test_data = data.load(train_size=50000,test_size=10000)
net = network.Network([784, 40, 10])
net.SGD(training_data, epochs=30, mini_batch_size=10, learning_rate=0.1, test_data=test_data)
Do not calculate the training accuracy and training loss as in the previous section (this is time consuming). What is the test accuracy in the final epoch (should be above 90%)?
(d) In this section we will train a deep network with 4 hidden layers using gradient descent (i.e., mini-batch size=training set size) and observe the vanishing gradient phenomenon. This occurs when the gradients of the layers closer to the input have lower norms than gradients of the hidden layers that are closer to the output. In other words, the early hidden layers are learned more slowly than later layers. In each epoch calculate the gradient euclidean norms for 0 ≤ i ≤ 4. These are the gradients with respect to the whole training set. Do not forget to divide by the size of the training set as in `(W). Train a 4 hidden layer neural network with 30 hidden neurons in each layer as follows:
training_data, test_data = data.load(train_size=10000,test_size=5000)
net = network.Network([784, 30, 30, 30, 30, 10])
net.SGD(training_data, epochs=30, mini_batch_size=10000, learning_rate=0.1, test_data=test_data)
Plot the values for 0 ≤ i ≤ 4 across epochs (one plot). Using the expression of the gradients in the backpropagation algorithm and the derivative of the sigmoid, give a possible explanation for this phenomenon.
3. Multilayer Perceptron. In this exercise you will implement a multilayer perceptron (MLP) network, and explore how training is affected by the network architecture. We will work with Keras, a popular high-level deep learning framework.
To start, download the files skeleton mlp mnist.py and mlp main.py from Moodle. The file skeleton mlp mnist.py is a non-complete implementation of the MLP network and its training. The file mlp main.py is the main program, that builds the network defined in the other file and train and evaluate it on the MNIST dataset. It also stores useful graphs, with model accuracy and loss value as the y-axis and the number of epochs[1] or model complexity as the x-axis. See the help menu or the code for more details.
You should add code only at the designated places, as specified in the instructions. In your submission, make sure to include your code, written solution and output graphs.
(a) MLP is a basic neural network architecture that consists of at least three layers of nodes. Except for the input layer, on each layer a nonlinear activation function is applied. Implement the class method build model no skip in skeleton mlp mnist.py. This method should build a network according to a given set of parameters. Specifically, it should build the following layers:
• k fully-connected layers of dimensions {d1,...,dk}. For these layers, use the ReLU[2]activation function.
• An output layer of size that equals to the number of classes. For this layer, we would like to use softmax[3] as an activation layer, in order to get a probability distribution over the classes.
In your implementation, create a Sequential model and use the Dense API[4]. Further details are provided in the code.
Notice how we compile the model with SGD as the optimizer. This is a great benefit of deep learning frameworks, that ship with built-in automatic differentiation and various optimization methods.
(b) Perform a series of 5 experiments of increasing network complexity. Run the program in series mode, and specify an increasing number of hidden layers and hidden units, such that in each experiment the network has more parameters than the one before. For now, we will use relatively shallow networks, with no more than 6 layers and no more than 512 hidden units. Please keep all other parameters to default (batch size, number of epochs, etc.).
Examining the output graph, how does adding more parameters affect the training?
(c) Now we will go deeper. Run the program in single mode, with 80 epochs and 60 hidden layers of 16 units each. Examining the two output graphs, how does the training process looks like? Could you give a possible explanation?
(d) A naive implementation of deeper networks does not always work. A common practice for these cases is skip connections, which are connections that skip not one but multiple layers[5]. Concretely, if x is the value of a hidden layer before applying non-linearity σ, and z = F(x) is the value of another layer down the network, then a
skip connection between the two layers is created by replacing σ(z) with σ(x + z) = σ(x + F(x)) (see Figure 1 for an illustration).
Figure 1: Skip connection between two hidden layers, x and z are the layer values before applying a non-linearity function σ.
Implement the class method build model skip in skeleton mlp mnist.py, that creates skip connections between every n layers. The parameter n is determined by the class parameter skips, which is set according to the program argument.
In this question, use the Model functional API[6] and not the Sequential method.
(e) Run the same experiment from the previous question (c), but now with skip connections between every 5 layers (use the skips parameter). Compare the output graphs to the results in the previous question.