Starting from:

$30

VAE-Variational Autoencoders Solved

Abstract
Variational autoencoders provides a base to describe an observation in latent space. In this project, we provide an introduction to variational autoencoders and the logic behind their framework using the mathematical concepts involved in defining them. The basic architecture of an autoencoder and the application of Principle Component Analysis for proving that maximising variance, minimises error, followed by the development into variational autoencoders and their latent space visualisation via a coding component.

Motivation and Context
We can consider the following problem: Suppose there exists a process f that operates on a vector parameter z ∈ Rd, generating an output x ∈ Rn. (We assume that d < n). We refer to z as being a latent vector.

We now observe x, and we want to figure out a mapping from the n dimensional space to the d dimensional space, to get an estimate of the underlying parameters zˆ.

Why do we want to find z?

The latent space of a distribution can be thought of as, a space of vectors z where if x1 and x2 correspond to similar observations, the corresponding vectors z1 and z2 are close to each other in that space. Ideally, we would like each dimension of the latent space to correspond to an easily-understood parameter to a human, but that may not be a possibility when dealing with data such as text and images.

Nonetheless, reducing the dimensions of x has its utility, particularly in the context of downstream tasks, such as classification and regression. High dimensional spaces cause issues for models that depend on distance metrics for their outputs, such as K-Nearest Neighbours, Linear and Logistic Regression.

This problem is known as the curse of dimensionality.

Reducing the number of dimensions fed to any downstream models lets us use the fact that the dimensions in the latent space are meaningful, as well as avoiding problems with using distance in high-dimensional spaces, improving the performance of downstream models.

Generative modelling
A subdomain of machine learning is generative modeling, which aims to solve the more general problem of learning the distribution over variables and simulates behaviour of data generation in the real world.

We looked at different real-life examples of such modelling, like

Meteorologists model the weather using partial differential equations showcase the physics of the weather
Astronomer models the formation of galaxies using physical laws for the celestial and stellar bodies.
Hence, we were determined to dig deeper into this concept for the project to understand the concepts taught better through this example widely used by various industries.

Index
Traditional Dimensionality Reduction
Review of Autoencoders
Review of VAEs
Statistical Concepts
Representation Learning Approach
Real-Life Application of VAE
Coding component
Principal Component Analysis
Dimensionality Reduction is the process of transforming high dimensional space into low-dimensional space, retaining the important original patterns. It works by reducing the number of features that describe the data.

The main purpose of a dimensionality reduction method is to find the best encoder/decoder from the possible encoders and decoders, that retains maximum information when encoding, hence resulting in minimum reconstruction error while decoding.

A main linear method of dimensionality reduction is Principal Component Analysis (PCA). To give a quick explanation of PCA, it involves trying to find components that cover most of the variance in the input data. In other words, PCA tries to find the best linear subspace of the initial space such that the error of approximating the data by their projections, is minimised.

How PCA works to reduce dimensionality:

Firstly, it standardises the initial variables Xˆ = X − X¯ 2. Then, the covariance matrix is calculated using the covariance between all possible combination pairs of input variables which is S=N1 XˆTXˆ
From the covariance matrix, the eigen vectors and eigen values are calculated which attribute into the calculation of the principal components of the data. The principal components can be understood as linear combinations of the initial variables so the new variables are uncorrelated.
Following this framework, maximum information from initial variables istried to be fitted in the first component, then the maximum of the remaining information in the second component and so on.
As such, the first principle component will represent the largest possiblevariance from the dataset. The second principal component similarly calculated will represent the next highest variance when it is perpendicular to or not correlated to the first principal component. This is repeated till p principal components have been calculated for p variables.
As seen above in the graph, the 10-dimensional data produces 10 principle components and the maximum percentage of information that is explained variances is fitted in the first component. After which, the information in each component keeps reducing.

When information is organised like seen above in principal components, itallows dimensionality reduction without the loss of important information, while discarding the components with low information.
Just like how the eigenvectors helped us find the principal components,it also helps choose the components with useful information (high eigenvalue) from the ones with less important information (low eigenvalue). The vectors that represent important information forms a matrix of vectors that we call ’Feature vector’.
The reason for doing so is that only p components out of n selected so mapping this will give us p dimensions in final output.

Finally, the data is reoriented to the axes represented by the principalcomponents using the feature vector as seen in the equation below:
OutputDataset = (Featurevector)T ∗ (Standardisedinputdataset)T

The objective of the process above is maximising the variance in order to minimise the reconstruction loss.

Maximising Variance:

The goal is to maximise the variance of the projected data so we look for the orthogonal projection of the data into a lower dimensional linear space. Lets see for dimensions=1: ∥w1∥ = 1

where S is the covariance matrix

Taking Lagrange multiplier to ensure the unit norm of w for the maximum variance optimisation problem

The derivative set to zero shows that Sw1 = λ1w1

Hence, on substitution, the maximum variance in lower dimensional space is equal to the eigen value

Overall, for lower dimensional space with d dimensions, the principal components are the eigenvectors corresponding to the largest eigenvalues as seen above for d = 1

Minimising Reconstruction Error:

The goal is to minimise the mean squared error between the data points and their linear projection.

where the reconstruction from the the lower dimensional latent variable is x˜

Based on properties of orthonormal basis wiTwj , the completeness of the basis so any linear combination of basis vectors can be used.

Computing

On substitution in J and its derivative set to zero,

To overcome the trivial solution at w=0, we normalise to (∥w∥)2 = 1 The Minimum Error for reconstruction J works by choosing the eigenvectors wi from the covariance matrix when Swi = λiwi so minimum reconstruction error

This can be reflected in the maximising problem above for variance too so proves that minimising error maximises variance.

Autoencoders
A brief aside into Neural Nets
When discussing Kernels, we discussed the idea of feature extraction. In other words, we use the features ϕ(x) rather than the input features x. Then, a downstream model can then be of the form

f(x;θ) = Wϕ(x) + b

The above model is linear in in the features θ = (W,b). However, the transformation x → ϕ(x) itself may be nonlinear. As observed in the lecture, identifying the optimal transformation may be difficult in many cases. However, we can try to parameterise the transformation, and then try to learn the new features from the data. Let the new learnable parameter be θ2. Then, we have:

f(x;θ) = Wϕ(x;θ2) + b

Effectively, we have composed 2 functions together. We can then extend this to compose even more functions together. The strength of neural nets lies in this composition of functions.

The Universal Approximation Theorem tells us that irrespective of what the function is, or the number of inputs and outputs, it will approximate and give a reasonable result.

While encoding data, it is an important property to achieve universality for optimal results however, given certain constraints it can be difficult to achieve universality.

Considering weak learners act as building blocks and if we stack many such building blocks and add all of them up, it helps you approximate any function ’g’ can be written as a composite combination of the individual functions. For any neural network architecture, finding any mathematical function y = f(x) that can map attributes(x) to output(y), this results in allowing us to approximate any complex true relationship between input and output.

Architecture of an Autoencoder
An Autoencoder consists of 2 sub-networks - an Encoder and Decoder network. The Encoder approximates the transformation f : Rn → Rd. The d-dimensional co-domain of the function f is the latent space of the data.

The decoder is, at its core, a reversed version of the encoder. It approximates the transformation g : Rd → Rn. The decoder maps from the latent space to the reconstruction of our original input x.

Usually, d ≪ n. This leads to an under-complete representation of the data, forcing generalisation. It is possible to take d ≫ n, but that case requires some form of regularisation to prevent overfitting.

Training an Autoencoder
Autoencoders are a form of unsupervised learning. This means that the labels of the data are not utilized for the training of the network. Like most other neural network architectures, autoencoders try to optimize an objective function. The usual objective function for autoencoder is:

min∥xˆ − x∥2 θ

Where θ represents all the parameters of the neural network, x is the input into the neural net, and represents the reconstructed x. This loss function, however, looks familiar. It is, in fact, the same objective we saw earlier in PCA - minimizing the reconstruction loss!

Linear Autoencoders as PCA
Consider the case of a linear autoencoder, i.e. a one-layer encoder and a onelayer decoder, with no activation function (it needn’t be one-layer, but no activation function is what makes it linear. We take only 1 layer for the sake of simplicity). Additionally, we consider the case with no bias. We then can represent out autoencoder by xˆ = g(f(x)), where f(x) = Wx and g(z) = Vz, where W ∈ Rd×n, and V ∈ Rn×d. The autoencoder objective can then be represented as:

min∥VWx − x∥2

W,V

Which is then equivalent to the aforementioned objective for PCA. We can then see that adding nonlinearities via adding activation functions can be viewed as a non-linear extension to PCA.

Latent Space Visualization - Autoencoder
Since similar points should lie closer to each other in the latent space, and dissimilar points lie far away from each other, we should see natural clustering emerge. Below, we show the latent space obtained by passing the MNIST test dataset to an autoencoder trained on the MNIST train dataset.

Oddly, we do not see an efficient utilisation of the latent space! We see that the points are all clustered along 2 axes, with most of the "weight" being on one axis or the other. This can limit the capacity of the model to interpolate, leading to the decoder having a very poor capacity for generating valid samples from the latent space.

We can see this effect in the generated interpolations from the latent space of the autoencoder, most of which are of poor quality.

To improve the quality of our latent space, for more efficient utilization and better quality generation outcomes, we need to make a modification to the autoencoder. This comes in the form of Variational Autoencoders .

Variational Autoencoders
In conventional bottleneck autoencoders, we were stymied in generation by the fact that randomly sampling from a point in the latent space completely changed the type of reconstruction we received. In other words, we could not guarantee that for a point z ∈ Rd, and decoder g : Rd → Rn, where g(z) = xˆ, a point z’ = z + ϵ (where ϵ ∈ Rd is a small perturbation) does not generate a value x’ˆ = g(z’) close to xˆ (A property called continuity). To put it simply, a random point close to an encoded point need not encode useful information.

This comes from the design of the bottleneck autoencoder. The normal autoencoder transforms the data input to another vector in the latent space.

This does not encourage 2 properties we want from our latent space: continuity, as well as completeness (Every point in the latent space corresponds to a meaningful reconstruction).

How do we encourage continuity and completeness?

Effectively, we need some way to teach our network that close points in the latent space should look similar once decoded, to ensure continuity. Also, we want to ensure that the model learns to spread out its encoded outputs throughout the latent space, to ensure completeness.

We do this by no longer mapping each input to a vector. We now attempt to map the inputs to a probability distribution in the latent space. In other words, we now consider each attribute of our latent space as being a probability distribution that we sample from for generation.

Statistical Motivation
Let us consider the following process: Suppose there exists a process that operates on a random variable Z, that generates an outcome x. We can only observe the realizations x of the process, and we wish to infer the properties of Z. To this end, we would want to compute p(z|x).

By Bayes’ Theorem, we have:

However, the main problem arises in the term in the denominator, p(x). By the law of total probability, we can see that:

Z p(x) =             p(x|z)p(z)dz

However, the above integral is intractable (i.e., there is no closed-form solution). This leads to us needing to find another way to calculate the probability

p(z|x).

Let us try to approximate p(z|x) by another distribution q(z|x). We consider a form for q(z|x) such that the integral becomes tractable. However, how do we figure out the best values for the parameters of q(z|x)?

The above is a common tactic in variational inference, which utilizes optimization to figure out the best parameters for q(z|x). While euclidean distance is used in the case of vectors, how do we quantify the difference between 2 distributions?

For this purpose, we use the Kullback-Liebler Divergence (KL Divergence). The KL divergence between 2 distributions can be seen as a measure of how different the 2 distributions are. We can then see that our optimization objective can be represented as:

minKL(q(z|x)||p(z|x))

Deriving the VAE objective
We can express the KL divergence as:

Recall that:

Therefore:

Observe that we try to sum over all values of z, since x is fixed. Thus, we can see that the second summation term is a constant! Thus, minimizing the KL divergence can be seen as minimizing the following term:

The negative of this term is referred to as the variational lower bound.

We now see that our second term is the KL divergence between q(z|x) and p(z)! The second term can be expressed as the negative of expectation of log(p(x|z)) with respect to q(z|x).

If we assume a Gaussian q(z|x), we can then show that the first term reduces down to: ∥xˆ − x∥2, which is reconstruction error! In other words, we can reexpress our objective as minimizing the sum of the reconstruction error and KL divergence!

The KL-divergence in this case acts as a regularization term for the variational autoencoder.

Trade-off between KL divergence and Reconstruction error
VAEs encode their inputs as a distribution rather than as vectors and the distributions of the VAE are regularized. With this regularisation term, the model does not encode data far apart in the latent space. This increases the amount of overlap within the latent space. However, this regularisation term results in a higher reconstruction error on the training data. So, the two have contrasting effects: The reconstruction loss is minimised to improve the quality of the reconstruction,but the shape of the latent space is neglected. The KL-divergence normalizes the latent space, but results in some additional “overlapping” between latent variables. Hence, the trade-off between the reconstruction error and the KL divergence needs to be adjusted.

Minimizing the KL distance between q(z|x) and the prior distribution p(z).

Log-likelihood: Ez Q(z|x)log(p(x|z))
KL Divergence: KL((q|z)||p(z)) log(p(x)) − KL(q(z|x)||p(z|x)) = Ez Q(z|x)log(p(x|z)) − KL((q|z)||p(z))
We can observe the trade-off between the two terms — maximisation of the expected log-likelihood and minimisation of the KL divergence.

Components of VAE
An image is fed as input x
The probabilistic encoder compresses the input ’x’ based on the distribution using the mean and standard deviation as a sampled latent vector
The probabilistic decoder then reconstructs and expands the compressed version of the input based on the probability
The output x’ is the reconstructed image of the input
Further comparisons are made between the input and output image and the loss function is represented by the reconstruction loss and the regularizer term.
As a summary, a variational autoencoder works similar to an autoencoder but with refined features and better representation and reconstruction of the input.

Real-life Applications of VAEs
Image Re-synthesis:
On optimising a VAE, a generative model can be designed over images, which can synthesize images and change features in them like colors, shape, etc can be modified and re-synthesized.

Compound Generation:
VAEs ca be used in different forms of drug discovery and the most common one is to generate new chemical/molecule structures using the patterns and

Experiments
We attempted to visualize the latent space of the VAE, while weighting the KL Diveregence. We show below the results of training a network at 3 values of weight for the KL divergence term.

KL Weight = 8e-4
In this case we see that the weight on the KL term is far too high. We can observe that the latent space is over-regularised, with only a few allowed points spread across the space.

The quality of the generation is also not too good:

KL Weight = 5e-5
In this case we see that the weight on the KL term seems appropriate. The points are well spread out, with efficient utilisation of the latent space.

The quality of the generation is also better:

KL Weight = 1e-8
In this case we see that the weight on the KL term is far too low. The VAE now resembles an autoencoder.

The quality of the generation is not great, either:

Appendix
Reflection
Overall, this project has drawn us towards to the practical and theoretical aspect of Variational Autoencoders. It further helped us question every step of our approach in terms of the proof and reasoning behind every assumption and claim made. Within our group, each of us was involved in reading and researching deeply about the topic of auto-encoders and the mathematical motivation behind variational autoencoders. We spent time discussing the applications of VAEs and how each component is developed with a mathematical purpose for generative modelling, which drew us towards deep self-learning over the process.

Contributions
Bharath Shankar: Code + Motivation and Context + Autoencoders (Except UAT) + VAE (Except Tradeoff and Components)

Ananya Gupta: Abstract + PCA + UAT + Preliminary Statistics + Tradeoff

+ Components of VAE + Reflection

Coding Component

Python Code using Pytorch

References
Jordan, J. (2018, July 16). Variational autoencoders. Jeremy Jordan.https://www.jeremyjordan.me/variational-autoencoders/
Huang, Y. (2022). Exploring Factor Structures Using Variational Autoencoder in Personality Research. Frontiers.
Hinton, G. E., Salakhutdinov, R. R. (2016, April 26). Reducing the Dimensionality of Data with Neural Networks. Www.Sciencemag.Org.
Smith, L. I. (2002). A tutorial on Principal Components Analysis. Elementary Linear Algebra 5e. http://axon.cs.byu.edu/Dan/478/Reading/PCA.pdf
Pictures from: https://towardsdatascience.com/understanding-variationalautoencoders-vaes-f70510919f73
Python Code written using: https://www.pytorchlightning.ai/

More products