Starting from:

$30

10418 Homework 5- VARIATIONAL INFERENCE Solved

1          Written Questions 
Answer the following questions in the template provided. Then upload your solutions to Gradescope. You may use LATEX or print the template and hand-write your answers then scan it in. Failure to use the template may result in a penalty. There are 44 points and 19 questions.

1.1        Mean-Field Approximation for Multivariate Gaussians
In this question, we’ll explore how accurate a Mean-Field approximation can be for an underlying multivariate Gaussian distribution.

Assume we have observed data X  that was drawn from a 2-dimensional Gaussian distribution p(x;µ, Λ−1).

                                                        p(x;µ,                                              (1.1)

Note here that we’re using the precision matrix Λ = Σ−1. An additional property of the precision matrix is that it is symmetric, so Λ12 = Λ21. This will make your lives easier for the math to come.

We will approximate this 2-dimensional Gaussian with a mean field approximation, q(x) = q(x1)q(x2), the product of two 1-dimensional distributions q(x1) and q(x2). For now, we won’t assume any form for this distributions.

1.    (1 point) Short Answer: Write down the equation for logp(X). For now, you can leave all of the parameters in terms of vectors and matrices, not their subcomponents.

 

2.    (2 points) Short Answer: Group together everything that involves X1 and remove anything involving X2. We claim that there exists some distribution q∗(X) = q∗(X1)q∗(X2) that minimizes the KL divergence q∗ = argminq KL(q||p). And further, said distribution will have a component q?(X1) will be proportional to the quantity you find below.

 

It can be shown that this implies that q(X1) (and therefore q(X2)) is a Gaussian distribution.

 

Where 

Using these facts, we’d like to explore how well our approximation can model the underlying distribution.

3.    Suppose the parameters of the true distribution are µ  and  .

(a)     (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q∗(X1)?

(b)     (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q∗(X1)?

(c)     (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q∗(X2)?

(d)    (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q∗(X2)?

(e)     (2 points) Plot: Provide a computer-generated contour plot to show the result of our approximation q∗(X) and the true underlying Gaussian p(X;µ,Λ) for the parameters given above.

 

4.    Suppose the parameters of the true distribution are µ  and  .

(a)     (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q∗(X1)?

(b)     (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q∗(X1)?

(c)     (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q∗(X2)?

(d)     (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q∗(X2)?

(e)     (2 points) Plot: Provide a computer-generated contour plot to show the result of our approximation q∗(X) and the true underlying Gaussian p(X;µ,Λ) for the parameters given above.

 

5.    (1 point) Describe in words how the plots you generated provide insight into the behavior of minimization of KL(q||p) with regards to the low probability and high probability regions of the the true vs. approximate distributions.

 

1.2        Variational Inference for Gaussian Mixture Models
Now that we have seen how the mean-field approximation works for a multivariate Gaussian, let’s look at the case of Gaussian Mixture Models. Suppose we have a Bayesian mixture of unit-variance univariate Gaussian distributions. This mixture consists of 2 components each corresponding to a Gaussian distribution, with means µ = {µ1,µ2}. The mean parameters are drawn independently from a Gaussian prior distribution N(0,σ2). The prior variance σ2 is a hyperparameter. Generating an observation xi from this model is done according to the following generative story:

1.    Choose a cluster assignment ci for the observation. The cluster assignment is chosen from the distribution Categorical  and indicates which latent cluster xi comes from. Encode ci as a one-hot vector where [1,0] indicates that xi is assigned to cluster 0 and vice versa.

2.    Generate xi from the corresponding Gaussian distribution N(cTi µ,1)

The complete hierarchical model is as follows:

µk ∼ N(0,σ2),k ∈ {1,2}

ci ∼ Categorical  xi|ci,µ ∼ N(cTi µ,1),i ∈ [1,n]

where n is the number of observations generated from the model.

1.    (1 point) What are the observed and latent variables for this model?

 

2.    (1 point) Write down the joint probability of observed and latent variables under this model

 

3.    (3 points) Let’s calculate the ELBO (evidence lower-bound) for this model. Recall that the ELBO is given by the following equation:

ELBO(q) = Eq[logp(x,z)] − Eq[logq(z)]

To calculate q(z), we will now use the mean-field assumption. Under this assumption, each latent variable is governed by its own latent factor, resulting in the following probability distribution:

 !

Here q(µk;mk,vk2) is the Gaussian distribution for the k-th mixture component with mean and variance mk and vk2. q(ci;ai) is the categorical distribution for the i-th observation with assignment probabilities ai (ai is a 2-dimensional vector). Given this assumption, write down the ELBO as a function of the variational parameters m, .

 

4.    Now that we have the ELBO formulation, let’s try to compute coordinate updates for our latent variables. Remember that the optimal variational density of a latent variable zi is proportional to the exponentiated expected log of the complete conditional given all other latent variables in the model and the observed data. In other words:

                                                                                                                                               !

qi(zi) ∝ exp E−j[logp(zj|z−j,x)]

Equivalently, you can also say that the variational density is proportional to the exponentiated expected log of the joint E−j[logp(zj,z−j,x)]. This is a valid coordinate update since the expectations on the right side of the equation do not involve zj due to the mean-field assumption.

(a)    (4 points) Show that the variational update for  .

(Hint: We can write the optimal variational density for cluster assignment variables as

                                                                                                                                     !

                     q(ci;ai1) ∝ exp               logp(ci) + Eµ[logp(xi|ci,µ);m,v2] .           Feel free to drop added constants

 

(b)    (6 points) Show that the variational updates for the k-th mixture component are  and .

(Hint: We can write the optimal variational density for the k-th mixture component as

 !

. Feel free to drop added constants

 

1.3        Running CAVI: Toy Example
Let’s now see this in action!

Recall that the CAVI update algorithm for a Gaussian Mixture Model is as follows:

 

Note that our notation differs slightly, with ϕ corresponding to a and s2 corresponding to v2. We also have K = 2. Assume initial parameters, m = [0.5,0.5], v2 = [1,1] and ai = [0.3,0.7] for all i ∈ n and a sample x = [0.1,−0.3,1.2,0.8,−0.5]. Also assume prior variance σ2 = 0.01

Write a python script implementing the above procedure and run it for 5 epochs. You should submit your code to autolab as a .tar file named cavi.tar containing a single file cavi.py. You can create that file by running:

tar -cvf cavi.tar cavi.py

from the directory containing your code.

After the fifth epoch, report

1.    (2 points) The variational parameters m.

m
 
 
2.    (2 points) The variational parameters v2.

v2
 
 
3.    (2 points) The variational parameters a.

 

Hint:

1.    Note that the expectation update for a does not depend on µ. (Why?)

2.    The expectation of the square of a Gaussian random variable is E[X2] = V ar[X] + E([X])2.

1.4        Variational Inference vs. Monte Carlo Methods
Let’s end with a brief comparison between variational methods and MCMC methods. We have seen that both classes of methods can be used for learning in scenarios involving latent variables, but both have their own sets of advantages and disadvantages. For each of the following statements, specify whether they apply more suitably to VI or MCMC methods:

1.    (1 point) Transforms inference into optimization problems.

  Variational Inference

  MCMC

2.    (1 point) Is easier to integrate with back-propagation.

  Variational Inference

  MCMC

3.    (1 point) Involves more stochasticity.

  Variational Inference

  MCMC

4.    (1 point) Non-parametric.

  Variational Inference

  MCMC

5.    (1 point) Is higher variance under limited computational resources.

  Variational Inference

  MCMC

1.5        Wrap-up Questions
1.    (1 point) Multiple Choice: Did you correctly submit your code to Autolab?

  Yes

  No

2.    (1 point) Numerical answer: How many hours did you spend on this assignment?.

 

1.6        Collaboration Policy
After you have completed all other components of this assignment, report your answers to the collaboration policy questions detailed in the Academic Integrity Policies for this course.

1.    Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details including names of people who helped you and the exact nature of help you received.

 

2.    Did you give any help whatsoever to anyone in solving this assignment? If so, include full details including names of people you helped and the exact nature of help you offered.

 

3.    Did you find or come across code that implements any part of this assignment? If so, include full details including the source of the code and how you used it in the assignment.

More products