$30
The goal of the first sections in this assignment is to refresh your memory of index notation as it is used in linear algebra. In the final section, vector calculus will be applied to an MLP in order to derive the equations of backpropagation for the basic modules in a vanilla neural network. We will need a good understanding of index manipulation in order to handle calculus with objects of arbitrary rank. The rank of an array refers to the dimensionality of its inherent structure: a scalar s has rank 0, a vector v has rank 1 (vi), a matrix M has a rank of 2 (Mij). Note the number of independent indices. An array of higher rank is often referred to as a tensor . As such, the object T with elements Tijk could be referred to as a 3-rank tensor. As will become clear early on, the most important takeaway of working with tensors is to keep good algebraic hygiene throughout your calculations.
1 Index Gymnastics: Notation
The key to performing calculus with objects from linear algebra is to remember that the algebra in index representation is always the same, no matter how you define the shapes of the gradients. As conventions can change from textbook to textbook and paper to paper, it is a good skill to be able to understand how these equations look at the element-level. We will stick to performing calculations with indices from the start, and resort to the luxury of aesthetics only in the end. Remember that these results need to be coded up, so our priority should go to ease of implementation.
Let us begin with some basic notation. One of the most important objects in our arsenal is the Kronecker delta , which has the power to encode if-statements into our mathematical equations:
.
When used in a sum, this object has the useful property of selecting or sifting the terms that satisfy the equality of its indices. For example, let a ∈ Rn be an arbitrary vector, then
.
Note that i is a dummy index: It can be renamed without any consequences to the truthfulness of the equation. The other index k, is a free index and needs to be present on both sides of the equation. It cannot simply disappear! As calculations become more involved, one needs to carefully keep track of which indices are free and which are summed over. Another crucial observation is the following:
.
This introduces calculus into our set of operations. Note that even though x3 and x7 are both elements of a vector called x, the derivative of one with respect to the other is still 0. They are independent variables which happen to have been collected into the same array.
Another subtle trick is quite an obvious one: indexing. Given a complicated looking object, analyzing it element by element will prove to be very efficient. For example, given matrix M its elements can be accessed by indexing using square brackets as follows: [M]ij = Mij. Observe the relationship between the identity matrix I and the Kronecker delta: [I]ij = δij.
Yet another useful piece of notation is that for the trace of a square matrix S ∈ Rm×m, i.e. tr(S) := Pi Sii. Sometimes it will be useful to introduce the ones-vector 1, which simply has all components equal to unity [1]i = 1. The Hadamard product or element-wise product between two matrices of identical size is given by A ◦ B. The elements of the result are [A ◦ B]ij = AijBij.
2 Index Gymnastics: Examples
Consider the following matrix equation: A = BC. Given the standard definition of matrix multiplication, we can index the whole equation as follows:
[A]ij = [BC]ij
Aij = XBipCpj.
p
Note the introduction of the dummy index p. Also, since the elements are simply numbers, they commute. Now on to some examples involving calculus.
Example 1
Question: Let r = x · a ∈ R for vectors x,a ∈ Rn. What is ?
Solution: We start off by indexing the object under investigation with i and expanding.
.
After writing out the dot product explicitly, we leverage the linearity of the differential operator. Informally put, we can swap the order of the differential operator and the summation symbol. With the Kronecker delta, we note that the only non-zero term in the sum is the one in which k equals i. Without having to predetermine whether gradients should be represented by column or row vectors, we have the unambiguous result: .
Let us pick a shape for our gradient. If we decide to let gradients be column vectors, the result has the pretty form .
In the example above, we could have picked the gradient to be a row vector, in that case: . The only difference between the column and row vector gradients is a transpose operation. This seems quite harmless, but don’t be fooled. These choices become increasingly more important as the objects increase in rank. The main takeaway is to pick a reasonable convention and stick with it. Here, the default will be the column vector representation, unless stated otherwise. (In an assignment or exam you are usually told which one to use. Always read the instructions carefully!)
Example 2
Question: Consider the scalar s = b⊤Xc, where b ∈ Rm, c ∈ Rn, and X ∈ Rm×n. Find .
Solution: Again, choosing a way to index the object and expanding gives the following.
Note that this object requires two indices in order to define a single element. We also required two Kronecker deltas in order to codify the condition for the derivative to equal unity. (The derivative of X31 with respect to X31 is 1, so the indices need to match in top-bottom fashion.) First we sum over q and we are left with the terms in which q equal j. Then we sum over p and obtain the final result,
∂s ∂b⊤Xc ⊤
= = bc , ∂X ∂X
which can be rewritten in terms of an outer product of the two constant vectors. (Write the
final step out in terms of the elements of a matrix in order to convince yourself that this is so.)
Depending on how you choose to approach a problem, you might have to make a choice when casting the elements back into matrix notation. Use the dimensions of the matrices to guide you along the way. In fact, you should always keep a mental note of what type of object you are manipulating throughout the steps of an equation. It helps to write out the dimensions of the different tensors to keep track of what the sizes are of the various terms. This will prove to be helpful when coding everything up. Remember: You can always print the result of numpy.shape or torch.size in order to check that the dimensions of your arrays are what you expect. With this in mind, try the following exercise.
That’s it! Comparing the left hand side with every step in the calculation, you will observe
that there is a conservation of free indices. In other words, if someone asks you for entry , it can be readily evaluated using the result: .
You might have noticed that the previous result was not written in closed-form, but was left in index notation. Closed-form is useful in deep learning because it allows us to "vectorize" our algorithms and use our GPUs to their full potential. So when it is sensible, you should opt for a vectorized expression in your algorithm in order to allow for large batch sizes. In other words: less loops, more speed!
3 Additional Tools
Note that performing the chain rule over a matrix requires to sum over all its elements. Let there be a matrix M with some dependence on a scalar variable t. Then, for some well-defined and continuous function g : Rm×n → R, we have:
.
In your further reading, you might encounter the notion of the Einstein summation convention . Simply put, this alleviates the need to write the summation sign at the front of an expression. The key to working with this convention is to look for repeated indices, which indicates that the index is a dummy index (and it is therefore being summed over). We will not use it in this course as it will not provide additional clarity in solving the problems. You are expected to keep summation signs in all expressions in your work. In NumPy, however, there is a handy implementation of einsum which
could be useful for removing loops from your calculations.
4 MLP Backpropagation
We will look at backpropagation from a modular perspective. In other words, it will be easier to think of a neural network as a series of functions (with or without adjustable parameters θ) rather than as a network with neurons as nodes. In a traditional sketch of a neural network, it is not as easy to see that
within each node, an activation function is being applied to the result of the linear transformation. By making each of these operations a separate module, it will become clear how backpropagation works in the general setting. A simple example of such a modular representation is shown in Figure 1. Note that in the forward pass, certain modules require not only features from the previous layer, but also a set of parameters that are constantly being updated during training. Backpropagation is an algorithm that allows us to update these parameters using gradient descent in order to decrease the loss.
4.1 Evaluating the Gradients
In the forward pass, some input data is injected into a neural network. The features flow through the neural network blissfully, changing dimensionality along the way. The number of features in a hidden layer corresponds to the number of neurons in the corresponding layer. In the final layer, some sort
Figure 1. Example of an MLP represented using modules
of output is generated. In the example of a classification problem, one might consider using softmax in the output layer (as in Figure 1). For training, we will require a loss function L, a measure for how poorly the neural network has performed. The lower the loss, the better the performance of our model on that data. Note that our model has parameters θ. In a traditional linear layer, the parameters are the weights and biases of a linear transformation. Also note that a conventional activation function (e.g. ReLU) has no parameters that need to be optimized in this fashion.
In general, we do not want to send in one data point at a time, but rather multiple in a batch. Let the number of samples in a batch be represented by S and the number of features (or dimensions) in each sample by M. Concatenating all the samples in a single batch as row-vectors, we obtain the feature matrix X ∈ RS×M.
In a simple linear module, the number of features per data point will usually vary. For example, in a linear transformation from a layer with M neurons to the next layer with N neurons, the number of features goes from M to N. In other words, an input to this linear transformation has M elements, and the output has N, which is just like an ordinary matrix multiplication! For one data point z ∈ RM being transformed into v ∈ RN (i.e. batch size of 1) the linear transformation looks like v = Wz + p, where W ∈ RN×M and p ∈ RN. If we transpose this whole equation we get: v⊤ = z⊤W⊤ +p⊤. Note that in programming, the most fundamental array is a list, which is best represented by a row-vector.
Instead we rewrite the equation with row-vectors y and we obtain the much nicer looking: y = xW⊤ + b. Now we can handle multiple data points at once with input feature matrix X ∈ RS×M, the output features are then given by Y = XW⊤ + B ∈ RS×N. The weight matrix is W and the bias row-vector b ∈ R1×N is tiled S times into B ∈ RS×N. (Note that Bij = bj.)
Figure 2. The programming convention used in DL assumes the batch dimension comes first.
For a linear module that receives input features X and has weight and biases given by W and b the forward pass is given by Y = XW⊤ + B. In the backward pass (backpropagation), the gradient of the loss with respect to the output Y will be supplied to this module by the subsequent module.
(a) A linear module (b) An activation function module
Figure 3. Forward and backward passes in the basic modules.
The final module before the loss evaluation is responsible for turning the jumbled-up data into predictions for C categories. Softmax takes an ordered set of numbers (e.g list or vector) as an input, and returns the same sized set with a corresponding "probability" for each element. Therefore, one must have alread ensured that this module receives data with a number of features equal to the number of categories C. We would like to generalize this to a batch of many such ordered lists (row vectors). The softmax module is defined for feature matrices X,Y ∈ RS×C as follows:
Yij = [softmax( .
Finally, we must specify a loss function for training in order to compare the outputs from our final module (e.g. softmax) to our targets T ∈ RS×C, also referred to as labels. The rows are the target row-vectors t ∈ R1×C and are usually one-hot, meaning that all elements are 0 except for the one corresponding to the correct label, which is set to unity. This can be generalized even further such that Pj tj = Pj Tkj = 1, for all samples k. Let us pick the categorical cross entropy. The loss of a sample i in the batch is then given by:
Li := −XTik log(Xik)
k
The final loss is the mean over all the samples in the batch. Therefore, .
Question 1.1 c) Softmax and Loss Modules
i. Consider a softmax module such that Yij = [softmax(X)]ij, where X is the input and Y is the output of the module. Find an expression for in terms of .
ii. The gradient that kicks the whole backpropagation algorithm off is the one for the lossmodule itself. The loss module for the categorical cross entropy takes as input X and returns
. Find a closed form expression for ∂∂LX. Write your answer
in terms of matrix operations (you may use element-wise operations as well).
Answer 1.1 c) Softmax and Loss Modules
The first one is tricky to get into closed form, but remember that you can play around with einsum to get a similar result. The operation in the second answer is element-wise division. i.
ii. X
4.2 NumPy implementation
After discussing the theory, it is time to get some experience by implementing your own neural network with the equations above. For those who are not familiar with Python and NumPy it is highly recommended to going through the NumPy tutorial.
Question 1.2
Implement a multi-layer perceptron using purely NumPy routines. The network should consist of a series of linear layers with ReLU activation functions followed by a final linear layer and softmax activation. As loss function, use the common cross-entropy loss for classification tasks. To optimize your network you will use the mini-batch stochastic gradient descent algorithm. Implement all modules in the files modules.py and mlp_numpy.py by carefully checking the instructions in the files. You can use the provided unittests.py to check your implementation of the modules for bugs.
Part of the success of neural networks is the high efficiency on graphical processing units
(GPUs) through matrix multiplications. Therefore, all of your code should make use of matrix multiplications rather than iterating over samples in the batch or weight rows/columns. Implementing multiplications by iteration will result in a penalty.
Implement training and testing scripts for the MLP inside train_mlp_numpy.py. Using the default parameters provided in this file, you should get an accuracy of around 47 − 48% using ReLU activation function for the entire test set for an MLP with one hidden layer of 128
units. Finally, provide the achieved test accuracy and loss curve for the training on ans-delft for the default values of parameters (one layer, 128 hidden units, 10 epochs, learning rate 0.1, seed 42).
5 PyTorch MLP
The main goal of this part is to make you familiar with PyTorch. PyTorch is a deep learning framework for fast, flexible experimentation. It provides two high-level features:
• Tensor computation (like NumPy) with strong GPU acceleration
• Deep Neural Networks built on a tape-based autodiff system
You can also reuse your favorite python packages such as NumPy, SciPy and Cython to extend PyTorch when needed. Check out Tutorial 2 for an introduction to PyTorch.
Question 2.1
Implement the MLP in mlp_pytorch.py file by following the instructions inside the file. The interface is similar to mlp_numpy.py, except that we also consider adding a batch normalization layer in the network. Implement training and testing procedures for your model in train_mlp_pytorch.py by following instructions inside the file. Using the same parameters as in Question 4.2, you should get similar accuracy on the test set. Again, provide
the achieved test accuracy and loss curve for the training on ans-delft for the default values of parameters (one layer, 128 hidden units, 10 epochs, no batch normalization, learning rate 0.1, seed 42).
5.1 Batch Normalization
We have so far trained very simple MLPs on our dataset only. Now, we want to take a closer look at the effect of normalization in neural networks. Specifically, in mlp_pytorch.py, you are asked to allow adding Batch Normalization to your network. Let’s train a few networks with and without Batch Normalization, and compare their performance.
Question 2.2
Train your PyTorch MLP with the following non-default hyperparameter configurations:
Hidden dimensionalities Number of epochs Learning rate
128 20 0.1
256, 128 20 0.1
512, 256, 128 20 0.1
For each of these 3 hyperparameter configurations, run it with and without using Batch Normalization between the layers. You can implement it in the file compare_batch_norm.py. What do you experience? Compare the models by plotting their validation and training accuracies over epochs. We recommend plotting the training and validation curves in separate plots due to different accuracy scales. Clearly label the axes and curves. Explain the different behaviors of the models, and trends over increasing the network size. What does it tell us about the effect of batch normalization in such networks?
Remark 1: for the training accuracies, it is sufficient to report the mean accuracy over batches that have been trained on in this epoch. This does not necessarily reflect the exact accuracy that the model after the epoch achieves on the training data, but is close enough for our purpose here.
Remark 2: It is usually good practice to run each model multiple times with different seeds to get stable results and remove any ’luck’ in the runs. However, to limit the scope of this assignment, and since the difference should be apparent from a single run, we do not require multiple runs with different seeds.
References
1. Matrix Cookbook: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook. pdf (This contains all the identities you will ever need, and then some!)
2. Useful list of identities on Wikipedia: https://en.wikipedia.org/wiki/Matrix_ calculus (Note how the results can be cast differently depending on the convention used.)
3. Mathematics for Machine Learning textbook: https://mml-book.github.io/book/ mml-book.pdf (A nice refresher. Go to section 5.5 for a list of some important identities.) 4. Using einsum in NumPy: https://numpy.org/doc/stable/reference/generated/ numpy.einsum.html
5. A guide to numpy.einsum: https://ajcr.net/Basic-guide-to-einsum/