Starting from:

$35

 STAT215 - Assignment 3 Solved



Problem 1: Variational inference.

Standard VI minimizes KL (q(z) II p(z | x)), the Kullback-Leibler divergence from the variational approx­imation q(z) to the true posterior p(z | x). In this problem we will develop some intuition for this optimization problem. For further reference, see Chapter 10 of Pattern Recognition and Machine Learning by Bishop.

flD

(a)   Let = {q(z) : q(z) = d=1 JV (zd | md, v2 d)} denote the set of Gaussian densities on z E RD with

diagonal covariance matrices. Solve for

q* = argmin KL(q(z) II ~ (z | µ, )) , where is an arbitrary covariance matrix.

Your answer here.

(b)   Now solve for q* E that minimizes the KL in the opposite direction, q* = argmin KL(K (z | µ, ) II q(z)). Your answer here.

(c)   Plot the contour lines of your solutions to parts (a) and (b) for the case where

         [ ~                [                                                          ~

0                                 1 0.9

µ =   ,                   =       .

0                              0.9 1

 Problem 2: Variational autoencoders (VAE’s)

In class we derived VAE’s as generative models p(x, z; θ) of observations x ∈ RP and latent variables z ∈ RD, with parameters θ. We used variational expectation-maximization to learn the parameters θ that maximize a lower bound on the marginal likelihood,

log p(x; θ) ≥ ~N Eq(zn|xn,φ) [log p(xn,zn; θ) − logq(zn | xn, φ)] 2(θ, φ).

n=1

The difference between VAE’s and regular variational expectation-maximization is that we constrained the variational distribution q(z | x, φ) to be a parametric function of the data; for example, we consid­ered,

q(z | xn, φ) = jV ( zn | µ(x; φ), diag([σ(x; φ),..., σ(xn; φ)])) ,

where µ : RP → RD and σ2 d : RP → R+ are functions parameterized by φ that take in a datapoint xn and output means and variances of zn, respectively. In practice, it is common to implement these functions with neural networks. Here we will study VAE’s in some special cases. For further reference, see Kingma and Welling (2019), which is linked on the course website.

(a)   Consider the linear Gaussian model factor model, p(xn,zn;θ) = J (zn;0, I)J (xn | Azn, V),

where A ∈ RP×D, V ∈ RP×P is a diagonal, positive definite matrix, and θ = (A, V). Solve for the true posterior p(zn | xn, θ).

Your answer here.

(b)   Consider the variational family of Gaussian densities with diagonal covariance, as described above, and assume that µ(x; φ) and log σ2 d(x; φ) are linear functions of x. Does this family contain the true posterior? Find the member of this variational family that maximizes Y(θ, φ) for fixed θ. (Hint: use your answer to Problem 1a.)

Your answer here.

(c)   Now consider a simple nonlinear factor model,

p(xn,zn; θ) = J (zn | 0, I) ~P JV(xnp | eaTpzn, vp),

p=1

parameterized by ap ∈ RD and vp ∈ R+. The posterior is no longer Gaussian, since the mean of xnp is a nonlinear function of the latent variable.1

Generate a synthetic dataset by sampling N = 1000 datapoints from a D = 1, P = 2 dimensional model with A = [1.2, 1]T and vp = 0.1 for p = 1, 2. Use the reparameterization trick and automatic differentiation to perform stochastic gradient descent on −2(θ,φ).

Make the following plots:

 1For this particular model, the expectations in 2(θ, φ) can still be computed in closed form using the fact that E[ez] = eµ+12 σ2 for z ∼ 4 (µ, σ2).

 • A scatter plot of your simulated data (with equal axis limits).

·   A plot of 2(0, 4,) as a function of SGD iteration.

·   A plot of the model parameters (A11,A21, v1, v2) as a function of SGD iteration.

·   The approximate Gaussian posterior with mean j(x; 4,) and variance o2 1(x; 4,) for x E {(0, 0), (1, 1), (10, 7)} using the learned parameters 4,.

·   The true posterior at those points. (Since z is one dimensional, you can compute the true posterior with numerical integration.)

Comment on your results. Your results here.

Problem 3: Semi-Markov models

Consider a Markov model as described in class and in, for example, Chapter 13 of Pattern Recogntion and Machine Learning by Bishop,

p(z1:T | π,A) = p(z1 | π) ~T p(zt | zt−1, A),

t=2

where zt ∈ {1,. . . ,K} denotes the “state,” and

p(z1 = i) = πi

p(zt = j | zt−1 = i,A) = Aij.

We will study the distribution of state durations—the length of time spent in a state before transitioning. Let d ≥ 1 denote the number of time steps before a transition out of state z1. That is, z1 = i,. . . , zd = i for some i, but zd+1 =5 i.

(a)     Show that p(d | z1 = i,A) = Geom(d | pi), the probability mass function of the geometric distribu­tion. Solve for the parameter pi as a function of the transition matrix A.

Your answer here.

(b)    We can equivalently represent z1:T as a set of states and durations {(˜zn, dn)}Nn=1, where ˜zn ∈ {1,.. . ,K} \ {˜zn−1} denotes the index of the n-th visited state and dn ∈ N denotes the duration spent in that state before transition. There is a one-to-one mapping between states/durations and the original state sequence:

   ,˜z2, ... ,˜z2

~ ~~ ~

d2 times

 ˜zN,.. .,˜zN

~ ~~ ~

dN times

 Show that the probability mass function of the states and durations is of the form

p({(n,dn)}=1) = p(1 | π) [fip(dn | zn,A) p(zn+1 | ˜zn,A)] p(dN| ˜zN, A), n=1

and derive each conditional probability mass function. Your answer here.

(c) Semi-Markov models replace p(dn | ˜zn) with a more flexible duration distribution. For example, consider the model,

p(dn | ˜zn) = NB(dn | r, θ˜zn),

where r ∈ N and θk ∈ [0, 1] for k = 1, . . . , K. Recall from Assignment 1 that the negative binomial distribution with integer r is equivalent to a sum of r geometric random variables. Use this equivalence to write the semi-Markov model with negative binomial durations as a Markov model on an extended set of states sn ∈ {1,.. . ,Kr}. Specifically, write the transition matrix for p(sn | sn−1) and the mapping from sn to zn.

Your answer here.

More products