Starting from:

$30

DL1- Practical 3. Transformers and Deep Generative Models Solved

This assignment contains two parts. First, you will take a closer look at the Transformer architecture. They have been already introduced in Lecture 9 and Tutorial 6, but here, we will discuss more about the theoretical and practical aspects of the architecture.

The second part, which is the main part of the assignment, will be about Deep

Generative Models. Modelling distributions in high dimensional spaces is difficult. Simple distributions such as multivariate Gaussians or mixture models are not powerful enough to model complicated high-dimensional distributions. The question is: How can we design complicated distributions over high-dimensional data, such as images or audio? In concise notation, how can we model a distribution p(x) = p(x1,x2,...,xM), where M is the number of dimensions of the input data x? The solution: Deep Generative Models.

Deep generative models come in many flavors, but all share a common goal: to model the probability distribution of the data. Examples of well-known generative models are Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). In this assignment, we will focus on VAEs [Kingma and Welling, 2014]. The assignment guides you through the theory of a VAE with questions along the way, and finally you will implement a VAE yourself in PyTorch as part of this assignment. Note that, although this assignment does contain some explanation on the model, we do not aim to give a complete introduction. The best source for understanding the models are the lectures, these papers [Goodfellow et al., 2014, Kingma and Welling, 2014, Rezende and Mohamed, 2015], and the hundreds of blog-posts that have been written on them ever since.

Throughout this assignment, you will see a new type of boxes between questions, namely Food for thought boxes. Those contain questions that are helpful for understanding the material, but are not essential and not required to submit in the report (no points are assigned to those question). Still, try to think of a solution for those boxes to gain a deeper understanding of the models.

This assignment contains 50 points: 10 on Transformers, and 40 on VAEs. Only the VAE part contains an implementation task at the end.

Note: for this assignment you are not allowed to use the torch.distributions package. You are, however, allowed to use standard, stochastic PyTorch functions like torch.randn and torch.multinomial, and all other PyTorch functionalities (especially from torch.nn). Moreover, try to stay as close as your can to the template files provided as part of the assignment.

1           Attention and Transformers          
In this part, we will discuss theoretical questions with respect to the Transformer architecture. Make sure to check the lecture on Transformers. It is recommended to also check the UvA Deep Learning Tutorial 6 on Transformers and Multi-Head Attention before continuing with this part.

1.1           Attention for Sequence-to-Sequence models
Conventional sequence-to-sequence models for neural machine translation have a difficulty to handle long-term dependencies of words in sentences. In such models, the neural network is compressing all the necessary information of a source sentence into a fixedlength vector. Attention, as introduced by Bahdanau et al. [2015], emerged as a potential solution to this problem. Intuitively, it allows the model to focus on relevant parts of the input, while decoding the output, instead of compressing everything into one fixed-length context vector. The attention mechanism is shown in Figure 1. The complete model comprises an encoder, a decoder, and an attention layer.

Unlike conventional sequence-to-sequence, here the conditional probability of generating the next word in the sequence is conditioned on a distinct context vector ci for each target word yi. In particular, the conditional probability of the decoder is defined as:

                                                       p(yi|y1,...,yi−1,x1,...,xT) = p(yi|yi−1,si),                                                   (1)

where si is the RNN decoder hidden state for time i computed as si = f(si−1,yi−1,ci).

The context vector ci depends on a sequence of annotations (h1,··· ,hTx) to which an encoder maps the input sentence; it is computed as a weighted sum of these annotations

hi:

Tx

ci = Xαijhj.

j=1

The weight αij of each annotation hj is computed by
(2)
                                                                 ,                                                             (3)

where eij = a(si−1,hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si−1 (just before emitting yi, Eq. (1)) and the j-th annotation hj of the input sentence. The alignment model a is parametrized as a feedforward neural network which is jointly trained with all the other components of the proposed system. Use the Figure 1 to understand the introduced notations and the derivation of the equations above.

1.2           Transformer
Transformer is the first encoder-decoder model based solely on (self-)attention mechanisms, without using recurrence and convolutions. The key concepts for understanding Transformers are: queries, keys and values, scaled dot-product attention, multi-head and self-attention.

Queries, Keys, Values The Transformer paper redefined the attention mechanism by providing a generic definition based on queries, keys, values. The encoded representation of the input is viewed as a set of key-value pairs, of input sequence length. The previously generated output in the decoder is denoted as a query.

Figure 1. The graphical illustration of the attention mechanism proposed in [Bahdanau et al., 2015]

Scaled dot-product attention            In [Vaswani et al., 2017], the authors propose scaled dot-product attention. The input consists of queries and keys of dimension dk, and values of dimension dv. The attention is computed as follows:

                                                     Attention(Q,K,V ) = softmax(                                             (4)

From this equation, it can be noticed that the dot product of the query with all keys represents a matching score between the two representations. This score is scaled by, because dot products can grow large in magnitude due to long sequence inputs. After applying a softmax we obtain the weights of the values.

Multi-head attention Instead of performing a single attention function, it is beneficial to linearly project the Q, K and V, h times with different, learned linear projections to dk, dk and dv dimensions, respectively. The outputs are concatenated and once again projected by an output layer:

MultiHead(Q,K,V ) = Concat(head1,...,headh)WO,

where headi = Attention(.

Self-attention       Self-attention (also called intra-attention) is a variant of the attention mechanism relating different positions of a single sequence in order to compute a revised representation of the sequence.

Question 1.1 
Consider the encoder-decoder attention introduced in Bahdanau et al. [2015] and the self-attention used in encoder and decoder of the Transformers architecture [Vaswani et al., 2017]. Compare the two mechanisms by denoting what the queries, keys and values represent. (Words limit: 100)

Question 1.2
Discuss the challenge of long input sequence lengths for the Transformer model by:

Explaining what is the underlying cause of this challenge.
Describing a way on how to overcome this challenge. (Words limit: 150)
1.3           Transformer-based Models
Transformers became the de-facto standard not only for NLP tasks but also for vision and multimodal tasks. The family of Transformer-based models grows every day, thus it is important to understand the fundamentals of different models and which tasks they can solve.

Question 1.3 
Describe how would you solve the task of Spam classification by using Transformers, if no meta-data is available (only the email text itself). Specifically:

Explain how the input and output representations should look like.
Design the training and inference stage.
(Words limit: 200)

2           Variational Auto Encoders 
VAEs leverage the flexibility of neural networks (NN) to learn and specify a latent variable model. We will first briefly discuss Latent Variable Models and then dive into VAEs. Mathematically, they are connected to a distribution p(x) over x in the following way: p(x) = R p(x|z)p(z)dz. This integral is typically too expensive to evaluate. However, in

 

this assignment, you will learn a solution via VAEs.

2.1           Latent Variable Models

A latent variable model is a statistical model that contains both observed and unobserved (i.e. latent) variables. Assume a dataset , where xn ∈ {0,1}M. For example, xn can be the pixel values of a binary image. A simple latent variable model for this data is shown in Figure 2, which we can also summarize with the following generative story:

zn ∼ N(0,ID)              (5) xn ∼ pX(fθ(zn)) (6)

where fθ is some function – parameterized by θ – that maps zn to the parameters of a distribution over xn. For example, if pX would

Figure 2. Graphical model of VAE. N denotes the dataset size.

 

be a Gaussian distribution we will usefor a mean and covariance matrix, or if pX is a product of Bernoulli distributions, we have fθ : RD → [0,1]M. Here, D denotes the dimensionality of the latent space. Likewise, if pixels can take on k discrete values, pX could be a product of Categorical distributions, so that fθ : RD → (p1,...,pk)M. Where p1,...,pk are event probabilities of the pixel belonging to value k, where pi ≥ 0 and. Note that our dataset D does not contain zn, hence zn is a latent (or unobserved) variable in our statistical model. In the case of a VAE, a (deep) NN is used for fθ(·).

Food for thought
How does the VAE relate to a standard autoencoder (see e.g. Tutorial 9)?

Are they different in terms of their main purpose? How so?
A VAE is generative. Can the same be said of a standard autoencoder? Why or why not?
Can a VAE be used in place of a standard autoencoder for its purpose you mentioned above?
2.2           Decoder: The Generative Part of the VAE
In the previous section, we described a general graphical model which also applies to

VAEs. In this section, we will define a more specific generative model that we will use throughout this assignment. This will later be refered to as the decoding part (or decoder) of a VAE. For this assignment we will assume the pixels of our images xn in the dataset D are Categorical(p) distributed.

                                                               p(zn) = N(0,ID)                                                                               (7)

M

p(xn|zn) = Y Cat(8)

m=1

where xis the m-th pixel of the n-th image in D, fθ : RD → (p1,...,pk)M is a neural network parameterized by θ that outputs the probabilities of the Categorical distributions for each pixel in xn. In other words, p = (p1,...,pk) are event probabilities of the pixel belonging to value k, where pi ≥ 0 and.

Now that we have defined the model, we can write out an expression for the log probability of the data D under this model:

                                                                                             (9)

Evaluating logp(xn) = logEp(zn) [p(xn|zn)] involves a very expensive integral. However, Equation 9 hints at a method for approximating it, namelyMonte-Carlo Integration. The log-likelihood can be approximated by drawing samples z from p(zn):

(10)

(11)

If we increase the number of samples L to infinity, the approximation would be equals to the actual expectation. Hence, the estimator is unbiased and can be used to approximate logp(xn) with a sufficient large number of samples.

Question 2.2 (3 points)
Although Monte-Carlo Integration with samples from p(zn) can be used to approximate logp(xn), it is not used for training VAE type of models, because it is inefficient. In a few sentences, describe why it is inefficient and how this efficiency scales with the dimensionality of z. (Hint: you may use Figure 3 in you explanation.)

2.3           KL Divergence
Before continuing our discussion about VAEs, we will need to learn about another concept that will help us later: the Kullback-Leibler divergence (KL divergence). It measures how different one probability distribution is from another:

                                                      (12)

where q and p are probability distributions in the space of some random variable X.

Question 2.3 (2 points)
            Assume that q and p in Equation 12, are univariate gaussians:                                   and

                                      . Give two examples of (                                 ): one of which results in a

very small, and one of which has a very large, KL-divergence: DKL(q||p).

Figure 3. Plot of 2-dimensional latent space and contours of prior and posterior distributions. The red contour shows the prior p(z) which is a Gaussian distribution with zero mean and standard deviation of one. The black points represent samples from the prior p(z). The blue contour shows the posterior distribution p(z|x) for an arbitrary x, which is a complex distribution and here, for example, peaked around (1,1).

In VAEs, we usually set the prior to be a normal distribution with a zero mean and unit variance: p = N(0,1). For this case, we can actually find a closed-form solution of the KL divergence:

(13)

(14)

(15)

(16)

For simplicity, we skipped a few steps in the derivation. You can find the details here if you are interested (it is not essential for understanding the VAE). We will need this result for our implementation of the VAE later.

2.4           The Encoder: qϕ(zn|xn) - Efficiently evaluating the integral
In the previous section 2.2, we have developed the intuition why we need the posterior p(zn|xn). Unfortunately, the true posterior p(zn|xn) is as difficult to compute as p(xn) itself. To solve this problem, instead of modeling the true posterior p(zn|xn), we can learn an approximate posterior distribution, which we refer to as the variational distribution. This variational distribution q(zn|xn) is used to approximate the (very expensive) posterior p(zn|xn).

Now we have all the tools to derive an efficient bound on the log-likelihood logp(D). We start from Equation 9 where the log-likelihood objective is written, but for simplicity in notation we write the log-likelihood logp(xn) only for a single datapoint.

              (multiply by q(zn|xn)/q(zn|xn))

                               (switch expectation distribution)

            (Jensen’s inequality)                 (re-arranging)

                               = Eq(zn|xn) [logp(xn|zn)] − KL(q(Z|xn)||p(Z))                        (writing 2nd term as KL)

                           |                                         {z                                        }

Evidence Lower Bound (ELBO)

(17) This is awesome! We have derived a bound on logp(xn), exactly the thing we want to optimize, where all terms on the right hand side are computable. Let’s put together what we have derived again in a single line:

logp(xn) ≥ Eq(zn|xn) [logp(xn|zn)] − KL(q(Z|xn)||p(Z)).

The right side of the equation is referred to as the evidence lowerbound (ELBO) on the log-probability of the data.

This leaves us with the question: How close is the ELBO to logp(xn)? With an alternate derivation[1], we can find the answer. It turns out the gap between logp(xn) and the ELBO is exactly KL(q(Z|xn)||p(Z|xn)) such that:

logp(xn)−KL(q(Z|xn)||p(Z|xn)) = Eq(zn|xn) [logp(xn|zn)]−KL(q(Z|xn)||p(Z)) (18)

Now, let’s optimize the ELBO. For this, we define our loss as the mean negative lower bound over samples:

                                      (19)

Note, that we make an explicit distinction between the generative parameters θ and the variational parameters ϕ.

Question 2.4 (3 points)
Explain how you can see from Equation 18 that the right hand side has to be a lower bound on the log-probability logp(xn)? Why must we optimize the lower-bound,

instead of optimizing the log-probability logp(xn) directly?

Question 2.5 (3 points)
Now, looking at the two terms on left-hand side of 18: Two things can happen when the lower bound is pushed up. Can you describe what these two things are?

2.5           Specifying the Encoder qϕ(zn|xn)
In VAE, we have some freedom to choose the distribution qϕ(zn|xn). In essence, we want to choose something that can closely approximate p(zn|xn), but we are also free to a select distribution that makes our life easier. We will do exactly that in this case and choose qϕ(zn|xn) to be a factored multivariate normal distribution, i.e.,

                                                     qϕ(zn|xn) = N(zn|µϕ(xn),diag(Σϕ(xn))),                                              (20)

where µϕ : RM → RD maps an input image to the mean of the multivariate normal over zn and  maps the input image to the diagonal of the covariance matrix of that same distribution. Moreover, diag(v) maps a K-dimensional (for any K) input vector v to a K × K matrix such that for i,j ∈ {1,...,K}

                                                           diag .                                                     (21)

Now we have defined an objective (Equation 19) in terms of an abstract model and variational approximation, we can put everything together using our model definition (Equation 5 and 6) and definition of qϕ(zn|xn) (Equation 20), and we can write down a single objective which we can minimize.

First, we write down the reconstruction term:

                                      Lreconn          = −Eqϕ(z|xn)[logpθ(xn|Z)]

here we used Monte-Carlo integration to approximate the expectation

Cat

Remember that fθ(·) denotes our decoder. Now let p, then

.

where x if the m-th pixel has the value k, and zero otherwise. In other words, the equation above represents the common cross-entropy loss term. When setting L = 1 (i.e. only one sample for zn), we obtain:

where p and zn ∼ q(zn|xn). Thus, we can use the cross-entropy loss with respect to the original input xn to optimize Lreconn Next, we write down the regularization term:

= DKL(N(Z|µϕ(xn),diag(Σϕ(xn))))||N(Z|0,ID))

Using the fact that both probability distributions factorize and that the KL-divergence of two factorizable distributions is a sum of KL terms, we can rewrite this to

D

= XDKL(N(Z(d)|µϕ(xn)d,Σϕ(xn)d ||N(Z(d)|0,1)) d=1

Let µnd = µϕ(xn)d and σnd = Σϕ(xn)d, then using the solution we found for question 1.5 we have

.

Hence, we can find the regularization term via the simple equation above.

2.6           The Reparametrization Trick
Although we have written down (the terms of) an objective above, we still cannot simply minimize this by taking gradients with regard to θ and ϕ. This is due to the fact that we sample from qϕ(zn|xn) to approximate the Eqϕ(z|xn)[logpθ(xn|Z)] term. Yet, we need to pass the derivative through these samples if we want to compute the gradient of the encoder parameters, i.e., ∇ϕL(θ,ϕ). Our posterior approximation qϕ(zn|xn) is parameterized by ϕ. If we want to train qϕ(zn|xn) to maximize the lower bound, and therefore approximate the posterior, we need to have the gradient of the lower-bound with respect to ϕ.

Figure 4. A VAE architecture on MNIST. The encoder distribution q(z|x) maps the input image into latent space. This latent space should follow a unit Gaussian prior p(z). A sample from q(z|x) is used as input to the decoder p(x|z) to reconstruct the image. Figure taken from this blog. Note that we are using FashionMNIST and not MNIST to train our VAE in this assignment.

Question 2.7 (3 points)
Passing the derivative through samples can be done using the reparameterization trick. In a few sentences, explain why the act of sampling usually prevents us from

computing ∇ϕL, and how the reparameterization trick solves this problem.

2.7           Putting things together: Building a VAE
Given everything we have discussed so far, we now have an objective (the evidence lower bound or ELBO) and a way to backpropagate to both θ and ϕ (i.e., the reparametrization trick). Thus, we can now implement a VAE in PyTorch to train on FashionMNIST images.

We will model the encoder q(z|x) and decoder p(x|z) by a deep neural network each, and train them to maximize the data likelihood. See Figure 4 for an overview of the components we need to consider in a VAE.

In the code directory part1, you can find the templates to use for implementing the

VAE. We provide two versions for the training loop: a template in PyTorch Lightning (train_pl.py), and a template in plain PyTorch (train_torch.py). You can choose which you prefer to implement. You only need to implement one of the two training loop templates. If you followed the tutorial notebooks, you might want to give PyTorch Lightning a try as it is less work, more structured and has an automatic saving and logging mechanism. You do not need to be familiar with PyTorch Lightning to the lowest level, but a high-level understanding as from the introduction in Tutorial 5 is sufficient for implementing the template.

You also need to implement additional functions in utils.py, and the encoder and decoder in the files cnn_encoder_decoder.py. We specified a recommended architecture to start with, but you are allowed to experiment with your own ideas for the models. For the sake of the assignment, it is sufficient to use the recommended architecture to achieve full points. Use the provided unit tests to ensure the correctness of your implementation. Details on the files can be found in the README of part 2.

As a loss objective and test metric, we will use the bits per dimension score (bpd). Bpd is motivated from an information theory perspective and describes how many bits we would need to encode a particular example in our modeled distribution. You can see it as how many bits we would need to store this image on our computer or send it over a network, if we have given our model. The less bits we need, the more likely the example is in our distribution. Hence, we can use bpd as loss metric to minimize. When we test for the bits per dimension on our test dataset, we can judge whether our model generalizes to new samples of the dataset and didn’t in fact memorize the training dataset. In order to calculate the bits per dimension score, we can rely on the negative log-likelihood we got from the ELBO, and change the log base (as bits are binary while NLL is usually exponential):

bpd

where d1,...,dK are the dimensions of the input excluding any batch dimension. For images, this would be the height, width and channel number. We average over those dimensions in order to have a metric that is comparable across different image resolutions. The nll represents the negative log-likelihood loss L from Equation 19 for a single data point. You should implement this function in utils.py.

Question 2.8 (12 points)
Build a Variational Autoencoder in the provided templates, and train it on the FashionMNIST dataset. Both the encoder and decoder should be implemented as a CNN. For the architecture, you can use the same as used in Tutorial 9 about Autoencoders. Note that you have to adjust the output shape of the decoder to 1 × 28 × 28 for FashionMNIST. You can do this by adjusting the output padding of the first transposed convolution in the decoder. Use a latent space size of z_dim=20. Read the provided README to become familiar with the code template.

In your submission, provide a short description (no more than 8 lines) of the used architectures for the encoder and decoder, any hyperparameters and your training steps. Additionally, plot the estimated bit per dimension score of the lower bound on the training and validation set as training progresses, and the final test score. You are allowed to take screenshots of a TensorBoard plot if the axes values are clear.

Note: using the default hyperparameters is sufficient to obtain full points. As a reference, the training loss should start at around 4 bpd, reach below 2.0 after 2 epochs, and end between 1.20-1.25 after 80 epochs.

Question 2.9 (4 points)
Plot 64 samples (8 × 8 grid) from your model at three points throughout training

(before training, after training 10 epochs, and after training 80 epochs). You should observe an improvement in the quality of samples. Describe shortly the quality and/or issues of the generated images.

Question 2.10 (4 points)
Train a VAE with a 2-dimensional latent space (z_dim=2 in the code). Use this

VAE to plot the data manifold as is done in Figure 4b of [Kingma and Welling, 2014]. This is achieved by taking a two dimensional grid of points in Z-space, and plotting fθ(Z) = µ|Z. Use the percent point function (ppf, or the inverse CDF) to cover the part of Z-space that has significant density. Implement it in the function visualize_manifold in utils.py, and use a grid size of 20. Are you recognizing any patterns of the positions of the different items of clothing?

References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http:

//arxiv.org/abs/1409.0473. 2, 3

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. 1

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014. 1, 12

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1530–1538. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045281. 1

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 3

[1] This derivation is not done here, but can be found in for instance Bishop sec 9.4.

More products