$30
In this assignment, you will study and implement convolutional, recurrent, and graph neural networks (CNNs, RNNs, GNNs). CNNs are the standard solution for images. RNNs are best suited for sequential data processing, such as a sequence of characters, words, or video frames. Their applications are mostly in neural machine translation, speech analysis, and video understanding. GNNs are specifically applied to graph-structured data, like knowledge graphs, molecules, or citation networks.
The assignment consists of three parts. First, you will get familiar with CNNs and how they transfer to different corruption functions on the test set. In the second part, we look at the theory behind vanilla RNNs and another type of RNN called a Long ShortTerm Memory (LSTM) network. Then you will implement an LSTM yourself and test it on generating text. In the final part, you will analyze the forward pass of a graph convolutional neural network and implement your own small GNN. In addition to the coding assignments, the text contains multiple pen and paper questions which you need to answer. We expect each student to hand in their code individually and submit their answers via Ans.
1 Convolutional Neural Networks
1.1 Vanilla CNN
Although standard Multilayer Perceptrons (MLP) achieve good results in some tasks as seen in the previous assignment, they are not as scalable as eventually needed in practice. The disadvantage is that the number of total parameters can grow very quickly, especially when having high dimensional inputs like images. This is inefficient because there is redundancy in such high dimensions. To address the issue, Convolutional Neural Networks (CNNs) have been proposed. Using a shared set of weights at each layer, CNNs are more efficient than MLPs. The intrinsic feature of images, which is translation invariance, can be employed to boost the performance of downstream tasks.
Question 1.1
You are asked to refer to the Jupyter notebook that has been attached to this assignment. In each section of the notebook, you are asked to implement the necessary parts of common layers in a CNN from scratch using Numpy functions. This helps you understand the underlying forward and backward operations that exist in the PyTorch convolution functions.
Hint: you are not required to implement it in an efficient matrix-optimized form but can use for-loops to simplify the implementation.
1.2 Modern Architectures
After the proposal of CNNs, different CNN-based architectures have been introduced, which usually differ in depth, number of filters, ways of applying filters to the input information flow, and etc. Some of the most famous architectures are VGG-11 [1], ResNet32 [2], and DenseNet-121 [3]. The latter two we have already implemented and trained in Tutorial 5. Although state-of-the-art (SOTA) CNN-based methods have achieved even better human performance during the last few years, the amount of transferability, generalization, and robustness of the learned features are still a matter of debate. In this part, by conducting simple experiments, you will scrutinize the learned features more to understand their limitations better. Please refer to the attached python file in which four common corruptions with different severities have been defined. For instance, we experiment with different levels of Gaussian noise and check how much this influences the CNNs prediction accuracy. You have to report the corruption error of each of the
above-mentioned architectures using the following criteria:
𝑓 ( Í5𝑠=1 𝐸𝑠,𝑐𝑓 )
CE𝑐 = Í5𝑠=1 𝐸ResNet-18𝑠,𝑐 ) , (1)
(
𝑓 ( Í5 𝐸𝑠,𝑐𝑓 − 𝐸clean𝑓 )
𝑠=1
RCE𝑐 = Í5 𝐸ResNet-18𝑠,𝑐 − 𝐸cleanResNet-18) , (2)
( 𝑠=1
where 𝐸clean𝑓 denotes the clean dataset top-1 error rate of architecture 𝑓 , and 𝐸𝑠,𝑐𝑓 shows the error after applying severity 𝑠 and corruption 𝑐 to the test dataset. Since different corruption functions impose different severities, which are all aggregated into a summation to show the general robustness of the models, normalizing them with respect to a standard factor is necessary. Therefore, here, all the calculations are normalized based on the performance of ResNet-18 when various corruptions are applied. All the experiments will be performed on CIFAR-10 to keep the computational cost on a reasonable scale.
Question 1.2
(a) (8 points) Implement the training and testing of a ResNet-18 architecture in the provided python templates. Report the test accuracies of the ResNet-18 on the four provided corruption functions Gaussian noise, Gaussian blur, Contrast reduction, and JPEG compression over different severities (1 to 5), including the test accuracy on the original test dataset. Plot the results with accuracy on the y-axis, severity on the x-axis, and different lines for the different corruption functions. Shortly, describe the trends you observe.
(b) (7 points) Redo the experiment for the following network architectures:
• VGG-11
• VGG-11 with Batch Normalization
• ResNet-34
• DenseNet-121
Report the results in terms of the previously defined metrics, CE and RCE, for the four different corruption functions. Explain the trend you observe when comparing network architectures, deeper models, and using Batch Normalization.
Hint: This question requires a considerable amount of computation, and thus it is strongly recommended deploying the code on Lisa for this. The estimated runtime of all models together is about 6hrs, so do not wait until the last minute to run these models. Use the provided ’debug’ model for faster debugging on CPU.
2 Recurrent Neural Networks (Total: 35 points)
2.1 Vanilla RNNs
The vanilla RNN is formalized as follows. Given a sequence of input vectors x(𝑡) for
𝑡 = 1, . . .𝑇, the network computes a sequence of hidden states h(𝑡) and a sequence of output vectors p(𝑡) using the following equations for time steps 𝑡 = 1, . . . ,𝑇:
h(𝑡) = tanh(Wℎ𝑥x(𝑡) + Wℎℎh(𝑡−1) + bℎ) (3)
p(𝑡) = W𝑝ℎh(𝑡) + b𝑝 (4)
As you can see, there are several trainable weight matrices and bias vectors. Wℎ𝑥 denotes the input-to-hidden weight matrix, Wℎℎ is the hidden-to-hidden (or recurrent) weight matrix, W𝑝ℎ represents the hidden-to-output weight matrix and the bℎ and b𝑝 vectors denote the biases. For the first time step 𝑡 = 1, the expression h(𝑡−1) = h(0) is replaced with a special vector h𝑖𝑛𝑖𝑡 that is commonly initialized to a vector of zeros. The output value p(𝑡) depends on the state of the hidden layer h(𝑡) which in its turn depends on all previous state of the hidden layer. Therefore, a recurrent neural network can be seen as a (deep) feed-forward network with shared weights.
To optimize the trainable weights, the gradients of the RNN are computed via backpropagation through time (BPTT). The goal is to calculate the gradients of the loss ℒ with respect to the model parameters Wℎ𝑥, Wℎℎ and W𝑝ℎ (biases omitted). Similar to training a feed-forward network, the weights and biases are updated using SGD or one of its variants. Different from feed-forward networks, recurrent networks can give output predictions yˆ(𝑡) at every time step. In this assignment the outputs will be given by the softmax function, i.e. yˆ(𝑡) = softmax(p(𝑡)). For training, we compute the standard cross-entropy loss at each time step 𝑡:
𝐾
ℒ(𝑡) = − Õy(𝑘𝑡) logyˆ(𝑘𝑡) (5)
𝑘=1
Figure 1. A graphical representation of LSTM memory cells (Olah, 2015 [4])
Where 𝑘 runs over the number of classes. In this expression, y denotes a one-hot vector of length 𝐾 containing true labels. The gradient w.r.t. W𝑝ℎ at time step 𝑡 can be expressed as follows:
𝜕ℒ(𝑡) 𝜕ℒ(𝑡) 𝜕yˆ(𝑡) 𝜕p(𝑡)
= (𝑡) 𝜕p(𝑡) 𝜕W𝑝ℎ (6)
𝜕W𝑝ℎ 𝜕yˆ
For the gradients w.r.t. Wℎℎ and Wℎ𝑥, we need to sum up the contributions of each time step to the gradient. In case of the weight matrix Wℎℎ, this leads to the following expression:
𝜕ℒ(𝑡) 𝜕ℒ(𝑡) 𝜕yˆ(𝑡) 𝜕p(𝑡) Õ𝑡 Ö𝑡 𝜕h(𝑗) 𝜕h(𝑡′)
= (𝑡) 𝜕p(𝑡) 𝜕h(𝑡) 𝑡′=0©𝑗=𝑡′+1 𝜕h(𝑗−1) ª® 𝜕Wℎℎ (7) 𝜕Wℎℎ 𝜕yˆ
« ¬
Note that the product Î𝑡𝑗=𝑡′+1𝜕h(𝑗)/𝜕h(𝑗−1) for 𝑡′ = 𝑡 equals 1 by convention. The gradient w.r.t. Wℎ𝑥 is expressed accordingly.
Question 2.1 (4 points)
Recurrent neural networks can be trained using backpropagation through time. Similar to feed-forward networks, the goal is to compute the gradients of the loss w.r.t. W𝑝ℎ, Wℎℎ and Wℎ𝑥.
(a) (2 points) Expand the expression of the gradient 𝜕𝜕Wℒ(ℎℎ𝑡) for time step 𝑡 = 3, i.e. write out the sum and product in terms of the variables that appear in Equation 7 and simplify where possible. In your solution, every time-index should be replaced by a number.
(b) (2 points) As can be seen from Equation 7, the gradient w.r.t. Wℎℎ depends on all past time steps, whereas the gradient w.r.t. W𝑝ℎ only depends on the current time step. Study the gradient w.r.t. Wℎℎ and explain what problems might occur when training this recurrent network for a large number of time steps. You can use your solution from (a) to support your explanations.
2.2 Long Short-Term Memory (LSTM) network
Training a vanilla RNN for remembering its inputs for an increasing number of time steps is difficult. The problem is that the influence of a given input on the hidden layer (and therefore on the output layer), either decays or blows up exponentially as it unrolls the network. In practice, the vanishing gradient problem is the main shortcoming of vanilla RNNs. As a result, training vanilla RNNs to consistently learn tasks containing delays of more than ∼ 10 time steps between relevant input and target is difficult. To overcome this problem, many different RNN architectures have been suggested. The most widely used variant is the Long Short-Term Memory networks (LSTMs). An LSTM (Figure 1) introduces several gating mechanisms to improve gradient flow for a more stable training procedure. Before continuing, we recommend reading the following blog post to get familiar with the LSTM architecture: Understanding LSTM Networks [4]. In this assignment we will use the following LSTM definition:
g(𝑡) = tanh(W𝑔𝑥x(𝑡) + W𝑔ℎh(𝑡−1) + b𝑔)
(8)
i(𝑡) =𝜎(W𝑖𝑥x(𝑡) + W𝑖ℎh(𝑡−1) + b𝑖)
(9)
f(𝑡) =𝜎(W𝑓 𝑥x(𝑡) + W𝑓 ℎh(𝑡−1) + b𝑓 )
(10)
o(𝑡) =𝜎(W𝑜𝑥x(𝑡) + W𝑜ℎh(𝑡−1) + b𝑜)
(11)
c(𝑡) = g(𝑡) ⊙ i(𝑡) + c(𝑡−1) ⊙ f(𝑡)
(12)
h(𝑡) = tanh(c(𝑡)) ⊙ o(𝑡)
(13)
p(𝑡) = W𝑝ℎh(𝑡) + b𝑝
(14)
yˆ(𝑡) = softmax(p(𝑡)).
(15)
In these equations ⊙ is element-wise multiplication and 𝜎(·) is the sigmoid function. The first six equations are the LSTM’s core part, whereas the last two equations are just the linear output mapping. Note that the LSTM has more weight matrices than the vanilla RNN. As the forward pass of the LSTM is relatively intricate, writing down the correct gradients for the LSTM would involve a lot of derivatives. Fortunately, LSTMs can easily be implemented in PyTorch, and automatic differentiation takes care of the derivatives.
Question 2.2 (6 points)
The LSTM extends the vanilla RNN cell by adding four gating mechanisms. Those gating mechanisms are crucial for successfully training recurrent neural networks.
(a) (4 points) The LSTM has an input modulation gate g(𝑡), input gate i(𝑡), forget gate f(𝑡) and output gate o(𝑡). For each of these gates, write down a brief explanation of their purpose; explicitly discuss the non-linearity they use and motivate why this is a good choice. (up to 3 sentences per gate)
(b) (2 points) Given the LSTM cell as defined by the equations above, let x ∈ R𝑇×𝑑 be an input sample where 𝑇 denotes the sequence length and 𝑑 is the feature dimensionality. Write down the formula for the total number of trainable parameters in the LSTM cell as defined above, in terms of 𝑁ℎ𝑖𝑑𝑑𝑒𝑛, 𝑁𝑖𝑛𝑝𝑢𝑡 and 𝑁𝑜𝑢𝑡𝑝𝑢𝑡 for the dimensionality associated with the hidden cell state, the input and the output respectively.
2.3 LSTMs in PyTorch
In this part of the assignment, you will implement your own LSTM module in PyTorch. In the next part of the assignment, you will use this implementation on text data, where the LSTM learns to predict the next character in a sentence. The inputs to the LSTM will therefore be sequences of characters. Since characters are categorical and do not have a trivial order, they can not simply be represented as integers. It is common to encode categorical input data with the help of embeddings before passing the data through the network. For an explanation of embeddings, see below. For this task, it is not necessary yet to set hyperparameters apart from the initialization of the weights, which is specified below.
Embeddings
As the network does not accept actual characters (‘e’, ‘a’, or whitespace) as inputs, we need to transform these characters to numerical values. As explained previously, one-hot encoding is often preferred over simply (and somewhat arbitrarily) mapping each input to a certain number, e.g., 𝑎 = 1, 𝑏 = 2, etc. The reasoning for this was that the to-be-encoded characters are unrelated. (Certainly 𝑏 ≠ 2 · 𝑎, so this mapping would indeed not make sense.)
However, there are two more things to take into consideration.
1. The characters may not be completely unrelated: We may find 𝑎 “more similar” to 𝑒 (both are vowels) than to 𝑘 or 𝑥. Using one-hot-encoding does not allow these possible differences in similarity.
2. If we have many different characters that are all mapped using one-hot-encoding, the input dimension is also large, requiring bigger (thus slower, less efficient, or harder-to-train) input modules.
For these reasons, we use more general maps called embeddings. From a broader perspective, note that one-hot-encoding is a way of mapping the 𝑁 possible categories to an 𝑁-dimensional space, whereas assigning numbers is a way of mapping the 𝑁 categories to a 1-dimensional space. We can now also think of mapping the 𝑁 possible categories to some 𝑀-dimensional space, where usually 𝑀 ≤ 𝑁. Instead of deciding the actual map manually, we allow the network to train the map, discovering the most practical representation on the go. One can think of starting using 𝑁-dimensional one-hot-encoded vectors and mapping these to an 𝑀-dimensional space using a matrix 𝑊embed ∈ R𝑀×𝑁, to end up with an 𝑀-dimensional input vector per character for the neural network.
Weight initialization
In many cases, LSTMs are simply initialized with small random weight√ s. For this assign√ ment, you should initialize all weights uniformly between −1/ 𝑁ℎ𝑖𝑑𝑑𝑒𝑛 and +1/ 𝑁ℎ𝑖𝑑𝑑𝑒𝑛. Additionally, you should add a value of 1 to the bias of the forget gate. The reason for this is that for learning long-term dependencies, it is good practice to initialize the bias of the forget gate to a larger value, such that the model starts off with remembering old states and learns what to forget (rather than vice versa).
Question 2.3 (7 points)
Implement an LSTM network as specified by Equations 8 - 13 above in the LSTM-class of the file part2/model.py. You do not need to implement the linear output mapping yet (i.e. the computation of p(𝑡)), since it does not belong to the LSTM’s core part. For the text generation in Question 2.4, you will use embeddings to represent each character of the input sequence (for a general explanation, see above). Therefore, you can assume the input to your LSTM module to be the embedding representation of text sequences. The sequence length and batch size are variable.
You are required to implement the model without any high-level PyTorch functions. Specifically, you are allowed to use nn.Parameter and the PyTorch non-linearity functions (e.g. torch.sigmoid), but you are not allowed to use nn.Linear, nn.LSTM or nn.LSTMCell. For the initialization of the weights, see the specifications above. You do not need to implement the backward pass yourself, but instead, you can rely on automatic differentiation.
2.4 Recurrent Nets as Generative Model
In this part, you will use your implementation of the LSTM for the generation of text. The LSTM will learn the local structure in the text by training to predict the next character in a sentence. You will train a one-layer LSTM on sentences from a book and use the model to generate new text. Before starting, we recommend reading the blog post The
Unreasonable Effectiveness of Recurrent Neural Networks.
Given a training sequence x = (𝑥(1), . . . , 𝑥(𝑇)), a recurrent neural network can use its output vectors p = (𝑝(1), . . . , 𝑝(𝑇)) to obtain a sequence of predictions yˆ(𝑡). When working with text input, each element in the prediction sequence is a distribution over the next character given the text available so far. As an example, for the training sequence "Hello world!", the first element of the prediction sequence yˆ(1) would be a distribution over all possible start characters, of which only one dimension resembles the probability 𝑝(’H’). Consequently, the third element yˆ(3) is a distribution over the next character given the past input characters (𝑥(1), 𝑥(2)) = ’He’, where one dimension of the prediction resembles 𝑝(’l’|’He’). Since the network predicts future characters based only on the past, it is an autoregressive model.
The network is trained using the total cross-entropy loss, which can be computed by averaging over all time steps using the target labels y(𝑡).
𝐾
ℒ(𝑡) = − Õy(𝑘𝑡) logyˆ(𝑘𝑡) (16)
𝑘=1
1 Õ (𝑡) (17)
ℒ = ℒ
𝑇
𝑡
Again, 𝑘 runs over all possible classes, where 𝐾 is the vocabulary size (for text, the classes could e.g. be all ASCII characters). In this expression, y denotes a one-hot vector of length 𝐾 containing true labels. Using this sequential loss, you can train a recurrent network to make a prediction at every time step.
After training, the LSTM can generate text character-by-character that will look similar to the original text. For this, the network uses its own output as an input for the next prediction. Just like multilayer perceptrons, LSTM cells can be stacked to create deeper layers for increased expressiveness. Each recurrent layer can be unrolled in time. For this task, we will only use one LSTM layer.
For training, you can use a large text corpus, such as publicly available books. We provide several books in the assets directory. However, you are also free to download other books. We recommend Project Gutenberg as a good source. Make sure you download the books in plain text (.txt) for easy file reading in Python. We provide the TextDataset class for loading the text corpus and drawing batches of example sentences from the text.
The sequence length specifies the length of training sentences, which also limits the number of time steps for backpropagation in time. When setting the sequence length to 30 steps, the gradient signal will never backpropagate more than 30 time steps. As a result, the network cannot learn text dependencies longer than this number of characters.
Teacher Forcing
In the standard RNN setting, the output of a previous module is used as input of the next module. During inference, a “start-of-sequence” token is used as input for the first module, and the network takes it from there, feeding each module with the output of its predecessor. During training, however, we do not have to follow this recipe strictly. In what is known as Teacher Forcing, we do not use the (possibly faulty) intermediate outputs of an RNN as inputs, but instead use the ground truth (the correct answer).
We still compute the loss based on those intermediate outputs, but by using Teacher Forcing, we give the network a fair chance on the subsequent modules to produce the correct output.[1] Note that if we hadn’t, the further we go in the sequence of modules, the more likely it is that everything is messed up due to some faulty outputs early on in the recurrence. Such an accumulation of errors makes it difficult for the model to learn. This is also observed in practice, where models without Teacher Forcing take longer to converge because the first few words need to be correct before successful training on longer sentences can occur.
Question 2.4 a) (8 points)
Study the code and its outputs in part2/dataset.py to sample sentences from the book to train with. Also, have a look at the parameters defined in part2/train.py. You need to implement the corresponding PyTorch code in part2/train.py to make the features work. You may need to tune them depending on your own implementation and chosen dataset. In the file part2/model.py, you need to implement the class TextGenerationModel and its forward-function.
(a) (8 points) Implement a one-layer LSTM network with 1024 hidden dimensions to predict the next character in a sentence by training on sentences from a book. Use your own implementation from Question 2.3 for the LSTM layer. If you were not able to implement the LSTM yourself, you are allowed to use the PyTorch module nn.LSTM. However, if your own implementation of the LSTM is working, using nn.LSTM will lead to point deductions. Note that you will need to transform the input into embeddings (use nn.Embedding) and pass the output of the LSTM module through a linear output layer if the output layer is not included in your LSTM module.
Train the model on sentences of length 𝑇 = 30 from your book of choice.
Define the total loss as the average of cross-entropy losses over all 30 time steps
(Equation 17). Plot the loss and accuracy during training, and report all the relevant hyperparameters that you used and shortly explain why you used them (even if they are the default ones). You can use TensorBoard for creating the plots. For convenience, use only a train set; no validation nor test sets required.
Hint: To train the model efficiently, use teacher forcing, as explained above.
Question 2.4 b) (5 points)
(b) (5 points) Make the network generate new sentences of length 𝑇 = 30 at three different points during training to show the progress of the model (e.g., after 1 epoch, 5 epochs and at the end of the training). You can do this by randomly setting the first character of the sentence and always picking the token with the highest probability predicted by your model. Store your hidden state in between token generations to prevent doing duplicate calculations. Report 5 text samples with different start characters generated by the network over the various stages of training. Carefully study the text generated by your network. What changes do you observe when the training process evolves? For your dataset, some patterns might be better visible when generating sentences longer than 30 characters. Discuss the difference in the quality of sentences generated (e.g., coherency) for a sequence length of less and more than 30 characters.
Question 2.4 c) (5 points)
(c) (5 points) In your current implementation, your next character is always chosen by selecting the one with the highest probability, regardless of the rest of the distribution.a On the opposite, we could also perform random sampling, where we generate an actual sample from the autoregressive distribution learned by the model: this will result in a high diversity of sequences, but they will be meaningless. However, we can interpolate between these two extreme cases by using the outputted distribution together with a temperature parameter 𝜏 in the softmax:
softmax(𝑥˜) = Íexp𝑖 exp(𝑥(˜𝑥/˜𝜏𝑖/)𝜏)
(for details, see Goodfellow et al.; Section 17.5.1).
• Explain the effect of the temperature parameter 𝜏 on the sampling process.
• Extend your current model by adding the temperature parameter 𝜏 to balance the sampling strategy between fully-greedy and fully-random.
• Report generated sentences for temperature values 𝜏 ∈ {0.5, 1.0, 2.0} of your fully trained model. What do you observe for different temperatures? What are the differences with respect to the sentences obtained by greedy sampling?
Note that using one single dataset is sufficient to get full points. However, we encourage you to experiment using different datasets to gain more insight. We suggest starting with relatively small (and possibly with simple language) datasets, such as Grim’s fairy tales, so that you can train the model on your laptop easily. If the network needs training for some hours until convergence, it is advised to run training on the SurfSara cluster. Also, you might want to save the model checkpoints now and then, so you can resume training later or generate new text using the trained model.
aSuch an implementation is also called greedy sampling. More generally, a greedy algorithm will make whatever choice seems best at the moment, without taking into account the resulting future state of the algorithm caused by that choice.
3 Graph Neural Networks (Total: 40 points)
Make sure to check the UvA Deep Learning Tutorial 7 about GNNs (link) before continuing with this part.
3.1 GCN Forward Layer
Graph convolutional neural networks are widely known architectures used to work with graph-structured data, and a particular version (GCN, or Graph Convolutional Network) was first introduced in https://arxiv.org/pdf/1609.02907.pdf. Consider Eq. 18, describing the propagation rule for a layer in the Graph Convolutional Network architecture to answer the following questions.
Where 𝐴ˆ is obtained by:
𝐻(𝑙+1) =𝜎(𝐴𝐻ˆ (𝑙)𝑊(𝑙))
(18)
𝐴ˆ = 𝐷˜ −12 𝐴˜𝐷˜ −12
(19)
𝐴˜ = 𝐴 + 𝐼𝑁
(20)
𝐷˜ 𝑖𝑖 = Õ 𝐴˜𝑖𝑗
(21)
𝑗
In the equations above, 𝐻(𝑙) is the 𝑁 × 𝑑 matrix of activations in the 𝑙-th layer, 𝐴 is the 𝑁 × 𝑁 adjacency matrix of the undirected graph, 𝐼𝑁 is an identity matrix of size 𝑁, and 𝐷˜ is a diagonal matrix used for normalization (you don’t need to care about this normalization step; instead, you should focus on discussing Eq. 18). 𝐴˜ is the adjacency matrix considering self-connections, 𝑁 is the number of nodes in the graph, 𝑑 is the dimension of the feature vector for each node. The adjacency matrix A is usually obtained from data (either by direct vertex attribute or by indirect training). W is a learnable 𝑑(𝑙) × 𝑑(𝑙+1) matrix utilized to change the dimension of feature per vertex during the propagation over the graph.
Question 3.1 (3 points)
Describe how this layer exploits the structural information in the graph data, and how a GCN layer can be seen as performing message passing over the graph.
3.2 Relation of Convolutions and Graphs
As a special case of a graph consider a one-dimensional grid with 𝑁 nodes with periodic boundary (a circulant graph): we label the nodes so that the 𝑖-th node is connected to the (𝑖 − 1 mod 𝑛)-th and (𝑖 + 1 mod 𝑛)-node (Figure 2).
Figure 2. 1D grid with 𝑛 =6 nodes.
Question 3.2 (2 points)
Write the adjacency matrix of the circular graph with 𝑛 nodes. (use dots . . . but
make sure their meaning is not ambiguous)
where
𝐻(𝑙+1) = 𝐹𝜃(𝐻(𝑙)),
(22)
𝑤
0
𝑤−1
0
𝐹𝜃(𝐻) = C(𝜃)𝐻 := ...
0
𝑤1
𝑤1
𝑤0 𝑤−1
. . .
0
0
𝑤1
𝑤0 ...
0
. . .
. . .
0
𝑤1 ...
𝑤−1
0
0
. . .
0
... 𝑤0
𝑤−1
𝑤−1
0
. . .
𝑤1
𝑤0
− − −
− − −
− − −
− − −
− − −
ℎ0 ℎ1 ℎ2
...
ℎ𝑛−2 ℎ𝑛−1
− − −
− − −
− − −
, (23)
− − −
− − −
On this graph, we design the following GNN layer:
where 𝜃= (𝑤0, 𝑤1, 0, . . . , 0, 𝑤−1), and ℎ𝑖 is the feature vector of the node 𝑖 so that 𝐻 is a
𝐷 × 𝑛 matrix. The general form of a permutation equivariant GNN layer is
𝜙(ℎ0, 𝒩0)
𝜙(ℎ1, 𝒩1)
𝐹(𝐻) = ... ,
𝜙(ℎ𝑛−2, 𝒩𝑛−2)
𝜙(ℎ𝑛−1, 𝒩𝑛−1)
Where 𝒩𝑖 is the neighbourhood of the node 𝑖.
Question 3.3 (6 points)
(a) (2 points) What is 𝜙(ℎ𝑖, 𝒩𝑖) for the GNN layer of equation (23)? Write the explicit form of 𝜙 as a function of the nodes features ℎ𝑖.
(b) (2 points) What operation is performed by this GNN layer on a circulant graph with one-dimensional features (𝐻 is a column vector)?
(b) (2 points) Using the definition of C(𝜃) provided by equation (23), what would be the action of 𝐹𝜃˜ for 𝜃˜ = (0, 1, 0, . . . , 0)?
In what follows, you will derive a theorem that is fundamental for many spectralbased convolutional graph neural networks such as GCNs: the (discrete version of the) Convolution Theorem. The theorem states that the Discrete Fourier Transform (DFT) of a convolution of two signals x = (𝑥0, 𝑥1, . . . , 𝑥𝑛−1) is the element-wise product of their DFTs:
C(𝜃)x =Φxˆ ⊙ 𝜃ˆ, (24)
where Φ∗x = xˆ, Φxˆ = x are the DFT and inverse DFT respectively (Φ∗ is the transpose conjugate of Φ), and C(𝜃) is our circulant matrix with filter 𝜃.
The Discrete Fourier Transform
The DFT and the inverse DFT of a signal x = (𝑥0, 𝑥1, . . . , 𝑥𝑛−1) are defined as
𝑛−1 𝑛−1
𝑥ˆ𝑘 Õ 𝑥𝑢𝑒−2𝜋𝑛𝑖𝑘𝑢 𝑥𝑢 Õ 𝑥ˆ𝑘𝑒 2𝜋𝑛𝑖𝑘𝑢 . (25)
𝑛 𝑢=0 𝑛 𝑘=0
Despite the seemingly complicated form of the DFT, the idea behind it is rather simple: the complex numbers 𝑒−2𝜋𝑛𝑖𝑘𝑢 represent phases and frequencies, and the magnitude of 𝑘-th entry of the transformed signal, 𝑥ˆ𝑘, is a measure of similarity of the original signal x and a sinusoid with frequency 𝑛𝑘 .
For the purposes of this assignment it suffices to notice that eq. (25) can be written
in matrix form as
xˆ =Φ∗x x =Φxˆ (26)
with
𝜔𝑛0 𝜔𝑛0 𝜔𝑛0 . . . 𝜔𝑛0 𝜔𝑛1 𝜔𝑛2 𝜔𝑛3 . . . 𝜔𝑛𝑛−1
1 ... ... ... ... ... ,
Φ= √
𝑛 𝜔𝑛(𝑛−2) 𝜔𝑛(𝑛−2)·1 𝜔𝑛(𝑛−2)·2 . . . 𝜔𝑛(𝑛−2)·(𝑛−1)
(𝑛−1) (𝑛−1)·1 (𝑛−1)·2 (𝑛−1)·(𝑛−1
𝜔𝑛 𝜔𝑛 𝜔𝑛 . . . 𝜔𝑛
and 𝜔𝑛 = 𝑒−2𝜋𝑛𝑖𝑘𝑢 . The remarkable fact that is leveraged by many ML algorithms such as GNNs is that the eigenvalues of all circulant matrices are precisely the columns of Φ.
Question 3.4 (6 points)
(a) (3 points) As mentioned in the box, it turns out that all circulant matrices are jointly diagonalizable: they all have the same eigenvectors which are the columns of the DFT matrix Φ. The eigenvalues of a circulant matrix are the entries of 𝜃ˆ: the DFT of 𝜃.
Given this information, show, in one line, that eq. (24) holds.
Hint: think about how diagonalizable matrices can be decomposed once their eigenvalues and eigenvectors are known.
(b) (3 points) The property of two matrices being simultaneously diagonalizable implies another important property that is a fundamental result in linear algebra. Can you state that property that is a direct consequence of matrices being simultaneously diagonalizable, and what that means practically if the two matrices considered are circulant with one of the two being the circulant with filter 𝜃˜ = (0, 1, 0, . . . , 0)?
Hint: think about CNNs.
3.3 Graph Attention Networks
In what follows, we introduce the concept of attention to Graph Neural Networks. We can express the Graph Convolution Layer from equation 18 as follows:
ℎ(𝑖𝑙+1) =𝜎© ∈𝒩(Õ ) p 𝐷1𝑖𝑖𝐷𝑖𝑗 𝑊(𝑙)ℎ(𝑗𝑙)®ª (27)
𝑗 𝑖
« ¬
Notice this is the exact same operation as in eq. 18, we only changed the notation of the nodes. The now equation indexes over each node of the graph such that ℎ(𝑖𝑙) ∈ R𝑑 is a vector of activations for a node 𝑖 in the 𝑙-th layer. 𝒩(𝑖) defines the set of neighbors for that node 𝑖 (including self-connections). Notice this notation shows more clearly the aggregation of neighbor nodes. For simplicity, from now on we will ignore the normalizing factor √ 1 leading to the following equation:
𝐷𝑖𝑖𝐷𝑖𝑗
ℎ(𝑖𝑙+1) =𝜎© Õ 𝑊(𝑙)ℎ(𝑗𝑙)ª® (28)
𝑗∈𝒩(𝑖)
« ¬
Question 3.5 (3 points)
We want our Graph Convolutional Layer from equation 28 to attend over the neighbor nodes of node 𝑖. What would you add to the equation to perform attention? Provide the updated equation and justify your changes. If you get stuck, you can take a look at Graph Attention Networks paper.
3.4 Observing the Suitability of GNNs on Chemical Data
Unlike images or sequences, molecular constituent atoms do not have a canonical physical orientation or index-label ordering. The word canonical implies a natural computational representation. For example, the canonical representation of a sequence begins with the first element and ends with the last. The ambiguity in representing molecules is due to symmetries within the data and has deep implications in physics and chemistry. This can be seen with the benzene ring in Figure 3 where there is no obvious “first” or “last” atom and no particular expected orientation of the molecule. Here, we simply note that some molecular properties, such as energy in a vacuum, are invariant with respect to permutation of indices as well as invariant to rotation and translation in space. Graph Neural Networks allow us to encode these invariances into our neural representation.
In this problem, we will focus on the index ordering of the atomic constituents by creating an MLP and a GNN, which both attempt to predict the atomization energy of the molecule at 0K. There are a few things to note... First, the performance will increase significantly by standard scoring (z-scoring) the labels before training the network. Second, we will only use the atomic elements and bond type information to simplify the problem. This simplification means that classic techniques, like data augmentation with the MLP might perform well on this task. We discuss issues with MLPs, even with data augmentation, for chemical data within the questions.
Figure 3. A “ball-and-sick” model of a benzene ring. Notice that the carbon (black balls) and hydrogen atoms (white balls) are indistinguishable from other atoms of the same atomic number. Source: wikipedia.
This question involves implementing an MLP, and a GNN on a collection of molecular data called QM9 [5]. To help with the task of implementing a graph neural network, we will leverage the library PyTorch Geometric. Please install it into the correct environment using the same version of cuda (or cpu) as your PyTorch installation (see here for an installation guide). PyTorch Geometric uses a sparse representation of edges and an unusual form of batching. Please consult Tutorial 7: Graph Neural Networks before proceeding.
Since it is an implementation question, you will need to fill in the missing pieces from the github repository for the class. There are three relevant files for this assignment data.py, networks.py, train.py. The file data.py is already complete; you will need to provide a solution only in the other files.
Your code will be checked by a unit test which determines if your implementation accurately accomplishes the spirit of the question. Each question must also have a short answer in Ans Delft.
Question 3.6 (20 points)
(a) (2 points) Take a closer look at the QM9 dataset that is downloaded automatically in the data.py file and examine the first few molecules. What’s the problem if we wanted to train a Multilayer Perceptron to predict properties of this dataset, naively? How could we overcome it by leveraging our knowledge of the entire dataset (including test data)?
Hint: Consider if we found a new molecule and wanted to test our network on it.
(b) (7 points) Since we have access to the entire QM9 dataset, we can use our insight from the last question and implement a Multilayer Perceptron to predict the z-scored atomization energy at 0K, known in the QM9 data as U ATOM0 .
Implement a Multilayer Perceptron which predicts this quantity using the one-hot representation of the atomic number data.x[:, :5] and the one-hot representation of the bonds as edge attributes data.edge_attr. Train the MLP. Report the validation loss curve and the final average test loss.
(c) (3 points) U ATOM0 is a permutation invariant quantity. That means, if we encoded the exact same molecule, but changed how we indexed the atoms, we should predict the same output z-scored atomization energy.
Is that what you observe in the Multilayer Perceptron? Permute the test set’s atomic indices only. What is the average estimated test loss of your model?
Hint: We provide a function that permutes the indices of a batch of molecules. Please apply the function for this experiment.
(d) (8 points) Implement a GNN which predicts a z-scored U ATOM0 using only the one-hot representation of the atomic number data.x[:, :5] and the one-hot representation of the bonds as edge attributes data.edge_attr.
We recommend considering the graph convolutions RGCNConv, which considers categorical edge attributes, and / or MFConv, which was designed with molecules in mind. The node features may benefit from an embedding using a linear layer. After a global pooling operation, more layers may be applied, if necessary.
Report the validation loss curve during training. Next, report the test loss, then report the test loss again for molecules with permuted atomic index. What do you notice compared to the MLP? Consider hypothetically testing the GNN on a new molecule that isn’t a part of QM9. what does the GNN input representation allow that an MLP does not?
[1] One can compare this with an exam question that consists of multiple parts, where the answer of part (a) is needed for part (b), and so on. If you get (a) wrong, you are likely to get the rest of the question wrong as well. However, if you are lucky, the question might say, “If you couldn’t find the answer for (a), use: . . . ” This allows you to complete the rest of the question, even though you do lose points on part (a). With Teacher Forcing, you make sure the network trains successfully on the entire question by always filling in the dots with the correct answer (regardless of whether the network found the correct answer or not).