Starting from:

$30

FMAN45 Assignment 3 Solved

1           Common Layers and Backpropagation
In this section you will implement forward and backward steps for a few common neural network layers.

Recall that the loss L : X × Y × W → R is a function of the network input x, the true value y and the network parameters w. When you train a neural net you have to evaluate the gradient  with respect to all the parameters in the network. The algorithm to do this is called backpropagation.

1.1         Dense Layer
Introduction. Let a fixed, given layer in the network f = σ ◦` : Rm → Rn consist of a linear part ` : Rm → Rn and an activation function σ : Rn → Rn.

We represent the linear part ` by a weight matrix W and a bias vector b. Let x ∈ Rm be a data point that enters the given layer and let z = `(x) ∈ Rn contain the values of the data point after having passed through the linear part of the layer. Then for all i,

                                                            .                                                     (1)

Using matrix notation we have z = Wx+b. Note that W and b are parameters that are trainable. Several such mappings, or layers, are concatenated to form a full network.

Computing the gradients. To derive the equations for backpropagation we want to express  , as an expression containing the gradient ∂L∂z with respect to the pre-activation value z. Using the chain rule we get

                                                              .                                                      (2)

We also need to compute the gradient with respect to the parameters W and b. To do this, again use the chain rule,

                                                                                                              (3)

                                                                                                                     (4)

Exercise 1 (10 points): Derive expressions for ∂L∂x, ∂∂LW and ∂L∂b in terms of ∂L∂z,W and x. Include a full derivation of your results. The answers should all be given as matrix expressions without any explicit sums.
Checking the gradients. When you have code that uses gradients, it is very important to check that the gradients are correct. A bad way to conclude that the gradient is correct is to manually look at the code and convince yourself that it is correct or just to see if the function seems to decrease when you run optimization; it might still be a descent direction even if it is not the gradient. A much better way is to check finite differences using the formula

                                                ,                                        (5)

where ei is a vector that is all zero except for position i where it is 1, and  is a small number. In the code there is a file tests/test_fully_connected.m where you can test your implementation.

Computing the gradients of batches. When you are training a neural network it is common to evaluate the network and compute gradients not with respect to just one element but N elements in a batch. We use superscripts to denote the elements in the batch. For the dense layer above, we have multiple inputs x(1), x(2), ..., x(N) and we wish to compute z(1) = Wx(1) + b, z(2) = Wx(2) + b, ···, z(N) = Wx(N) + b. In the code for the forward pass you can see that the input array first is reshaped to a matrix X where each column contains all values for a single batch, that is,

                                                    X = x(1)             x .                                           (6)

For instance, if x(1) an image of size 5×5×3 it is reshaped to a long vector with length 75. We wish to compute

Z = z(1)                                                    z  Wx(1) + b Wx(2) + b ... Wx 

When we are backpropagating to X we have to compute

                                                                                            (8)

using

                                                  .                                          (9)

Use your expression obtained in Exercise 1 and simplify to matrix operations. For the parameters, we have that both W and b influence all elements z(i), so for the parameters we compute   using the chain rule with respect to each element in each z(i) and just sum them up. More formally,

(10)

                                                                                                            .                                          (11)

Note that the inner sum is the same as you have computed in the previous exercise, so you just sum the expression you obtained in the previous exercise over all elements in the batch. You can implement the forward and backward passes with for-loops, but it is also possible to use matrix operations to vectorize the code and for full credit you should vectorize it. Potentially useful functions in Matlab include bsxfun, sub2ind, ind2sub, reshape and repmat.

Using for-loops in programming can slow down your programm significantly. This is because in a for-loop, the data is being processed sequentially and not in parallel. Our aim is to parallelise the tasks in backpropagation by vectorising our Matlab commands and to avoid a for-loop running over all the data points.

Exercise 2 (10 points): Derive expressions for Z, ∂∂LX, ∂∂LW and

∂L∂b                                in terms of         ∂∂LZ,W,X and b. Include a full derivation of your results. Also add Matlab code that implements these vectorised expressions that you have just derived. In the code there are two files layers/fully_connected_forward.m and layers/fully_ connected_backward.m. Implement these functions and check that your implementation is correct by running tests/test_fully_ connected.m. For full credit your code should be vectorized over the batch. Include the relevant code in the report.
1.2         ReLU
The most commonly used function as a nonlinearity is the rectified linear unit (ReLU). It is defined by

                                        ReLU : R → R;        xi →7    zi := max(xi,0).                               (12)

  Exercise 3 (10 points): Derive the backpropagation expression forin terms of . Give a full solution. Implement the layer in layers/relu_forward.m and layers/relu_backward.m. Test it with tests/test_relu.m. For full credit you must use built in indexing and not use any for-loops. Include the relevant code in the report.
1.3         Softmax Loss
Suppose we have a vector x = [x1,...,xm]> ∈ Rm. The goal is to classify the input as one out of m classes and xi is a score for class i and a larger xi is a higher score. By using the softmax function we can interpret the scores xi as probabilities. Let

                                                                                                                         (13)

be the probability for class i. Now suppose that the ground truth class is c. Since now zi are probabilities we can define the loss L to be minimized as the negative log likelihood,

 

Exercise 4 (10 points): Compute the expression for   in terms of zi. Write a full solution. Note that this is the final layer, so there are no gradients to backpropagate. Implement the layer in layers/softmaxloss_forward.m and layers/ softmaxloss_backward.m. Test it with tests/test_softmaxloss.

m. Make sure that tests/test_gradient_whole_net.m runs correctly when you have implemented all layers. For full credit you should use built in functions for summation and indexing and not use any for-loops. Include the relevant code in the report.
2           Training a Neural Network
The function we are trying to minimize when we are training a neural net is

                                                  )                                        (15)

where w are all the parameters of the network, x(i) is the input and y(i) the corresponding ground truth and L(x(i),y(i);w) is the loss for a single example, given for instance by a neural net with a softmax loss in the last layer.In practice, N is very large and we never evaluate the gradient over the entire sum. Instead we evaluate it over a batch with  elements because the network can be trained much faster if you use batches and update the parameters in every step.

If you are using gradient descent to train the network, the way the parameters are updated is

                                                         w                                               (16)

where α is a hyperparameter called the learning rate. Since we only evaluate the gradient over a few examples, the estimated gradient in (16) might be very noisy. The idea behind gradient descent with momentum is to average the gradient estimations over time and use the smoothed gradient to update the parameters. The update in (16) is modified as

∂L

                                                         mn = µmn−1 + (1 − µ)                                             (17)

∂w

                                                     wn+1 = wn − αmn                                                                                                 (18)

where mn is a moving average of the gradient estimations and µ is a hyperparameter in the range 0 < µ < 1 controlling the smoothness.

Exercise 5 (10 points): Implement gradient descent with momentum in training.m. Remember to include weight decay. Include the relevant code in the report.
3           Classifying Handwritten Digits
You must have solved all previous problems before you can start working on this and the next problem.

In mnist_starter.m there is a simple baseline for the MNIST dataset. Read the comments in the code carefully. It reaches about 98 % accuracy on the test set (this of course varies a bit from time to time). Validate this to make sure that the code you have written so far is correct.

Exercise 6 (25 points): Plot the filters the first convolutional layer learns. Plot a few images that are misclassified. Plot the confusion matrix for the predictions on the test set and compute the precision and the recall for all digits. Write down the number of parameters for all layers in the network. Write comments about all plots and figures.
4           Classifying Tiny Images
In cifar10_starter.m there is a simple network that can be trained on the CIFAR10 dataset. This baseline give an accuracy of about 48 % after training for 5000 iterations. Note that it is much harder to get good classification results on this dataset than MNIST, mainly due to significant intraclass variation.

Exercise 7 (25 points): Do whatever you want to improve the accuracy of the baseline network. Write what you have tried in the report, even experiments that did not work out. For your final model, compute and report all plots and numbers that were required in the previous exercise and comment on it.

More products