1 Backpropagation with logistic loss [40 points]
Let fw : RD → R be a single-hidden-layer, fully connected, neural network with no biases and parameters w = {W(1),w(2)}, such that its forward pass computes
x (1)
. (2)
Here, σ(·) denotes a sigmoid activation. In what follows, we will assume D = 5 and K = 6.
1. [5 points] Implement a function predict(X,W) that given a batch of B samples (an array X of size B×D) and a dictionary of weights (W={w 1: D x K, w 2: K x 1}) computes the set of outputs . Do not use for loops.
2. [5 points] Implement a function logistic loss(y, y hat) that given a vector of labels y ∈ {0,1}B and a vector of predictions yˆ ∈ RB computes the average logistic loss of a batch. Compute the average loss for the case yˆ = 0 and y = 0. Explain your result. Reminder : Given a true label y ∈ {0,1} and a soft prediction ˆy ∈ [0,1], the logistic loss is defined as
L(y,yˆ) = −y log(ˆy) − (1 − y)log(1 − yˆ). (3)
3. [6 points] Implement another function stable logistic loss(y, z 2) that given a vector of labels y ∈ {0,1}B and a vector of logits z(2) ∈ RB computes the average logistic loss of a batch. Make sure that your function is stable. Compute the average loss for the case when z2 = −1010 · 1 and y = 0. Explain your result. Hint : You can use np.logaddexp.
4. [12 points] Derive analytically, using the backpropagation algorithm, the expressions for the partial derivatives of the loss with respect to the weights, i.e.,
and , (4)
when the loss is computed using a single pair (x,y). Recall that we are using the logistic loss, and that the final implementation should be stable.
5. [12 points] Implement a function gradient(X,y,W) which computes the gradient (i.e., vector of partial derivatives) of the average loss with respect to all the weights for a batch . Do not use for loops.
2 Classifying FashionMNIST using neural networks [30 points]
In this exercise you will play with the dataset called FashionMNIST. Some examples are shown in figure 1a. You can use code snippets from the labs. Note that the quantitative results are not as important as the qualitative ones.
1. [4 points] Load the dataset and construct the dataloaders for train, validation and test, and visualize the data. Use an 50000 and 10000 images for train and validation, respectively. Which transform is necessary for the samples to be compatible with the models we will create?
2. [6 points] MultiLayer Perceptron (MLP) : Construct a two-hidden-layer MLP with 100 neurons (per layer), ReLU activation functions and a linear output layer to classify FashionMNIST. Train this model for 20 epochs using the cross-entropy loss and the following optimizers : SGD (lr=0.01), SGD with momentum (lr=0.01, momentum=0.9, nesterov=True), Adam (lr=0.01) and Adam (lr=1). Plot the training and validation learning curves (loss against steps and epochs) on a single plot. Comment on your results.
3. [6 points] Convolutional Neural Network (CNN) : Construct a CNN with three convolutional layers (kernel size=3) and 16, 32, and 64 channels, respectively; a non-linearity and one max-pooling layer (kernel size=2) after every convolution; and a final fully connected layer. Train this model with the same four configurations as before, and plot the training and validation learning curves on a single plot. Comment on your results.
4. [6 points] Create a function that computes the number of parameters of a given model. Show the number of parameters for the two models you have used. Does more parameters translate to better performance? Explain.
5. [8 points] PermutedFashionMNIST1 : In this version of the dataset, the pixels are randomly permuted. Visualize the new dataset. Train one MLP and one CNN using
SGD with momentum lr=0.01, momentum=0.9, nesterov=True). What do you observe? Is one model more affected than the other? Explain.
6. (Bonus [5 points]) Using any of the tips and tricks seen in class or during the exercise sessions, optimize the validation performance on the original FashionMNIST and report both validation and test performance. Also feel free to explore things beyond the ones seen in class. In any case, explain your decisions. Did you manage to improve the performance? Why?
1. Add the tranformation RandomPermutation to the list of transforms. This transform should be the last one.
(a) FashionMNIST samples (b) MultiMNIST example (c) Multi-Task model
Figure 1 – Samples and models seen in Exercises.
3 Multi-Task Learning with MultiMNIST [30 points]
In this exercise you will create a dataset with two tasks from scratch and train a multitask model. Some
1. [7 points] Create a new sample. Implement the function make new sample(x1: torch.Tensor, x2: torch.Tensor) -> torch.Tensor that takes two MNIST samples and creates a new MultiMNIST sample. An example is shown in figure 1b. You can use the following procedure :
— Create an empty tensor of size 36 × 36
— Place one image on the top-left corner. Reminder : A MultiMNIST image has dimensions 28 × 28.
— Place the other image on the bottom-right corner. For the overlapping pixels take the max value.
Visualize five samples with corresponding labels.
2. [5 points] Create the new dataset. Repeat the above procedure 60000 and 10000 times to create the full train and test datasets, respectively. The MultiMNIST train images must be created by samples of the MNIST train dataset. Similarly, for the test dataset. Hint : Use TensorDataset to store the images and their labels.
3. [9 points] Create the Multi-Task model. Adapt the Convolutional Neural Network seen in class to have two outputs by implementing the model in figure 1c. The encoder has two convolutional layers of 10 and 20 channels, respectively, and a Linear layer with 50 output features. Each convolutional layer has kernel size of 5 and is followed by ReLU and Maxpool. Each decoder is a linear layer.
4. [9 points] Train the Multi-Task model. Train the multi-task model and report the accuracy and loss per task. Use the average of task losses as your objective. Hint : Adapt the training and predict functions seen in class to handle two predictions, two losses etc.