$25
Neural Network Layer Implementation
In this problem, you will get the opportunity to implement various neural network layers. X,Y will represent the input and the output of the layers, respectively. For this problem, we use the row-major notation here (i.e. each row of X and Y corresponds to one data sample) to be compatible with the multi-dimensional array structure of Numpy. Furthermore, L is the scalar valued loss function of Y .
For each layer, you should include your derivation of the gradient in your written solution. In addition to the derivations, implement the corresponding forward and backward passes in layers.py. More importantly, please ensure that you use vectorization to make your layer implementations as efficient as possible. You should use autograder.py to help you in checking the correctness of the your implementation. Note, however, that for each layer, the autograder just checks the result for one specific example. Passing these tests does not necessarily mean that your implementation is 100% correct.
Important Note: In this question, any indices involved (i,j, etc.) in the mathematical notation start from 1, and not 0. In your code, however, you should use 0-based indexing (standard Numpy notation).
(a) [15 points] Fully-Connected Layer
In this question, you will be deriving the gradient of a fully-connected layer. For this problem, consider X ∈RN×Din, where N is the number of samples in a batch. Additionally, consider a dense layer with weight parameters of the form of W ∈RDin×Dout and bias parameters of the form of b ∈R1×Dout.
As we saw in the lecture, we can compute the output of the layer through a simple matrix multiply operation. Concretely, this can be seen as Y = XW +B. In this equation, we denote the output matrix as Y ∈RN×Dout and the bias matrix as B ∈RN×Dout where each row of B is the vector b. (Please note that we are using row-vector notation here to make it compatible with Numpy/PyTorch, but the lecture slides used column-vector notation, which is standard notation for most ML methods. We hope that the distinction is clear from the context.)
Please note, that the matrix multiply operation stated above is generally useful to compute batch outputs (i.e. simultaneously computing the output of N samples in the batch). However, for the purposes of computing the gradient, you might first want to consider the single-sample output operation of this fully-connected layer, which can be stated in row-major notation as y(n) = x(n)W + b where y(n) ∈R1×Dout and x(n) ∈R1×Din for 1 ≤ n ≤ N. Here, the index “(n)” in the superscript denotes n-th example, not the layer index, and denotes i-th dimension of n-th example x(n).
Note: it might be fruitful for you to consider this expression as a summation and then calculate the gradient for individual parameters for a single-example case. After this, you can try to extend it to a vector/matrix format for the involved parameters.
Now, compute the partial derivatives (in matrix form) Din×Dout, ×Dout,
RN×Din in terms of N×Dout. For technical correctness, you might want to start with writing the gradient with non-vectorized form involving , where is the gradient with respect to
j
j-th element of n-th sample in Y with 1 ≤ n ≤ N and 1. In addition, you might find the
Kronecker Delta useful in this derivation.
Hint for
Please note that we use a “matrix/vector” notation to denote a Jacobian matrix in RDin×Dout where the element in i-th row and j-th column is and Wi,j is the i-th row, j-th column element of W with 1 ≤ i ≤ Din and 1 ≤ j ≤ Dout. Here, you would probably want to calculate the gradient
using the formula , and then vectorize to get .
Hint for
Please note that is a vector in R1×Dout. Ideally, you might want to start by using the formula
and then move on to derive .
Hint for
Please note that is a matrix in RN×Din. In order to derive this gradient, you can start by using the
formula Following this, you can say that
because the n-th sample in X is only related with the n-th sample in Y and =0 for all n′ ̸= n.
(b) [5 points] ReLU
Let X ∈RN×D be a 2-D array and Y = ReLU(X). In this case, ReLU is applied to X in an element-wise fashion. For an element x ∈ X, the corresponding output is y = ReLU(x) = max(0,x). You can observe that Y will have the same shape as X. Express in terms of , where has the same shape as X. Now, implement the relu_forward and relu_backward methods in layers.py.
Hint: you may need to use the element-wise product to express
(c) [0 points] Convolution
This section contains some useful information that you will use in Q3 to implement the conv_forward and conv_backward methods. Note that while this particular sub-part is worth 0 points, you will be getting points for implementation of these methods in Q3. Additionally, for better visualization of the convolution operations, you can use the following link: https://bit.ly/3H0Rjhn.
Consider the following 2D arrays/tensors: a ∈RHa×Wa and b ∈RHb×Wb. Note, a is the image and b is the filter with Ha > Hb and Wa > Wb. Now, consider the following definitions.
Valid convolution: In this problem, we can define the valid convolution as follows:
i+Hb−1 j+Wb−1
(a∗valid b)i,j = X X am,nbi−m+Hb,j−n+Wb
m=i n=j
Here, 1 ≤ i ≤ Ha − Hb + 1 and 1 ≤ j ≤ Wa − Wb + 1. Please note that the convolution operation we discussed in class is valid convolution, and does not involve any zero padding. This operation produces an output of size (Ha − Hb + 1) × (Wa − Wb + 1).
Filtering: Moreover, it might also be useful to consider the filtering operation ∗filt, defined by:
i+Hb−1 j+Wb−1 Hb Wb
(a∗filt b)i,j = X X am,nbm−i+1,n−j+1 = XXai+p−1,j+q−1bp,q
m=i n=j p=1 q=1
Here, 1 ≤ i ≤ Ha − Hb + 1 and 1 ≤ j ≤ Wa − Wb + 1. Please note that the filtering operation generates an output of size (Ha − Hb + 1) × (Wa − Wb + 1). In summary, the filtering operation is similar to the valid convolution, except that the filter is not flipped when computing the weighted sum. You can think of filtering as an inner product between the Hb ×Wb sized kernel and image patch of the same size inside the input image where we don’t flip the kernel.
Full convolution: Finally, for deriving the gradient through convolution layer, the full convolution operation may be useful. This operation is defined as follows:
i j
(a∗full b)i,j = X X am,nbi−m+1,j−n+1
m=i−Hb+1 n=j−Wb+1
Here, 1 ≤ i ≤ Ha+Hb−1 and 1 ≤ j ≤ Wa+Wb−1. The full convolution can be thought of as zero padding a on all sides with one less than the size of the kernel, and then performing valid convolution using the modified input tensor a. Concretely, this means that we will pad a by Hb−1 rows on the top and bottom, followed by Wb −1 columns on the left and right. In the definition of full convolution, am,n = 0 if m < 1 or n < 1 or m > Ha or n > Wa. This operation produces an output of size (Ha+Hb−1)×(Wa+Wb−1).
Here, assume the input to the layer to be X ∈ RN×C×H×W, where N is the number of images in the batch, C is the number of channels, H is the height of the image, W is the width of the image. Furthermore, consider a convolutional kernel K ∈ RF×C×H′×W′, where F represents the number of′′ ′′ filters present in this layer. The output of this layer is given by Y ∈ RN×F×H ×W where we have H′′ = H − H′ + 1 and W′′ = W − W′ + 1.
Backpropagation through the convolution layer: We now consider the output of the layer Yn,f
which can be defined as valid Kf,c. In this case, Kf,c represents the flipped filter,
which can be more concretely defined for each element as Kf,c,i,j = Kf,c,H′+1−i,W′+1−j.
This means that the output expression valid Kf,c can also be represented by the expression filt Kf,c. Therefore, you can implement in the conv_forward function in the layers.py script.
Finally, considering the upstream gradient to be denoted by , use the following formulae to implement the backward pass in layers.py for Q3. Note: since this question is worth 0 points, you do not need to prove these results. You can directly use these results in your implementation:
and
Note, however, you can attempt Q6 to gain extra credits where you will prove the results presented above. Please go to the end of the document to learn more about this.
Implementation note for questions 2-4
To solve question 2-4, please read through solver.py and familiarize yourself with the API. To build the models, please use the intermediate layers you implemented in layers.py. After doing so, use a Solver instance to train the models. Additionally, please vectorize your code to ensure that your networks can train in a reasonable amount of time.
2 [20 points] Multi-class classification with Softmax
(a) Implement softmax loss layer softmax_loss in layers.py. This function outputs the softmax loss averaged over all input examples.
(b) Implement softmax multi-class classification using the starter code provided in softmax.py.
(c) Train a two layer neural network with a softmax loss function on the MNIST dataset using the script digit_classification.py. Identify and report the best setting for the number of hidden units based on the performance on validation set. Please ensure that you try atleast 6 different settings for the number of hidden units. Using the optimal setting from the previous step, train your network again, and report the accuracy obtained on the test set.
3 [20 points] Convolutional Neural Network for multi-class classification
(a) Implement the forward and backward passes of the max-pooling and conv layers in layers.py
(b) Implement CNN for softmax multi-class classification using the starter code provided in cnn.py.
(c) Train the CNN multi-class classifier on MNIST dataset with digit_classification.py. Using the hyperparameters specified in the solver instance, train the CNN on the MNIST dataset and report the accuracy obtained on the test set. [Tip: to make the computation more efficient, please consider vectorizing the implementation of convolution and max-pool layer.]
4 [20 points] Application to Image Captioning
In this problem, you will apply the RNN module that has been partially implemented to build an image captioning model. The link to download data is provided in the comment in image_captioning.py. Please unzip coco_captioning.zip to get the data.
(a) At every timestep we use a fully-connected layer to transform the RNN hidden vector at that timestep into scores for each word in the vocabulary. This is very similar to the fully-connected layer that you implemented in Q1. Implement the forward pass in temporal_fc_forward function and the backward pass in temporal_fc_backward function in rnn_layers.py. [Tip: autograder.py could also help debugging.]
(b) Now that you have the necessary layers in rnn_layers.py, you can combine them to build an image captioning model. Implement the forward and backward pass of the model in the loss function for the RNN model in rnn.py.
(c) With the RNN code now ready, run the script image_captioning.py to get learning curves of training loss and the caption samples. Report the learning curves and the caption samples based on your welltrained network.
5 [20 points] Transfer Learning
By working on this problem, you will be more familiar with PyTorch and can run experiments for two major transfer learning scenarios. The link to download data is provided in the comment in transfer_learning.py. Please unzip hymenoptera_data.zip to get the data.
(a) Fill in the code in the train_model function which is a general function for model training.
(b) Fill in the code in the visualize_model function to briefly visualize how the trained model performs on validation images.
(c) Fill in the code in the finetune function. Instead of random initialization, we initialize the network with a pre-trained network. Rest of the training looks as usual.
(d) Fill in the code in the freeze function. We will freeze the weights for all of the network except that of the final fully connected layer. This last fully connected layer is replaced with a new one with random weights and only this layer is trained.
(e) Run the script and report the accuracy on validation dataset for these two scenarios.
6 [6 points (extra credit)] Convolutional Backward Pass
Consider again, the following 2D arrays/tensors: a ∈RHa×Wa and b ∈RHb×Wb. Note, a is the image and b is the filter with Ha > Hb and Wa > Wb. Now, consider again the following definitions.
Valid convolution: In this problem, we can define the valid convolution as follows:
i+Hb−1 j+Wb−1
(a∗valid b)i,j = X X am,nbi−m+Hb,j−n+Wb
m=i n=j
Here, 1 ≤ i ≤ Ha−Hb+1 and 1 ≤ j ≤ Wa−Wb+1. Please note that the convolution operation we discussed in class is valid convolution, and does not involve any zero padding. This operation produces an output of size (Ha − Hb + 1) × (Wa − Wb + 1).
Filtering: Moreover, it might also be useful to consider the filtering operation ∗filt, defined by:
i+Hb−1 j+Wb−1 Hb Wb
(a∗filt b)i,j = X X am,nbm−i+1,n−j+1 = XXai+p−1,j+q−1bp,q
m=i n=j p=1 q=1
Here, 1 ≤ i ≤ Ha − Hb + 1 and 1 ≤ j ≤ Wa − Wb + 1. Please note that the filtering operation generates an output of size (Ha − Hb + 1) × (Wa − Wb + 1). In summary, the filtering operation is similar to the valid convolution, except that the filter is not flipped when computing the weighted sum. You can think of filtering as an inner product between the Hb × Wb sized kernel and image patch of the same size inside the input image where we don’t flip the kernel.
Full convolution: Finally, for deriving the gradient through convolution layer, the full convolution operation may be useful. This operation is defined as follows:
i j
(a∗full b)i,j = X X am,nbi−m+1,j−n+1
m=i−Hb+1 n=j−Wb+1
Here, 1 ≤ i ≤ Ha +Hb −1 and 1 ≤ j ≤ Wa +Wb −1. The full convolution can be thought of as zero padding a on all sides with one less than the size of the kernel, and then performing valid convolution using the modified input tensor a. Concretely, this means that we will pad a by Hb − 1 rows on the top and bottom, followed by Wb − 1 columns on the left and right. In the definition of full convolution, am,n = 0 if m < 1 or n < 1 or m > Ha or n > Wa. This operation produces an output of size (Ha + Hb − 1) × (Wa + Wb − 1).
Using this information, show that:
and
Hint for ∂L
∂Xn,c
Please note that the gradient , where the (i,j)th element of is represented as .
Theoretically, this expression measures the change induced in loss L when you perturb the input element Xn,c,i,j which is the pixel value in the i-th row, j-th column, and c-th channel of the n-th sample image in the batch.
In order to derive this gradient expression, start with the following expression:
C
Yn,f = XXn,c ∗filt Kf,c
c=1
Now, use the definition of the ∗filt operation given in the problem statement (the double summation expression) to represent Yn,f as a triple summation (in this hint, we assume the inner summation variables to be p and q). Using this triple summation expression, derive the following partial derivative:
This should yield a relatively simple expression. Following this, you should use the following chain rule expression:
Observe that you already have the value of , so this can be substituted into the expression for .
Finally, try to use a change of variables in the inner two summations so that the inner two summations can be used to represent the full convolution operation. This should yield the required expression.
Hint for
Use the same steps as in the previous hint. Please note that you might not need to do a change of summation variables here, and that you should finally try to represent the inner two summations as the filtering operation. Additionally, for the chain rule step, please use the following expression:
This should help you arrive at the final expression.
Credits
Some questions adopted/adapted from http://cs231n.stanford.edu/.