Starting from:

$20

IFT6135-Assignment 3 Neural Network-based Generative Models Solved

Question 1 (4-4-4-4). One way to enforce autoregressive conditioning is via masking the weight parameters.[1] Consider a two-hidden-layer convolutional neural network without kernel flipping, with kernel size 3×3 and padding size 1 on each border (so that an input feature map of size 5×5 is convolved into a 5 × 5 output). Define mask of type A and mask of type B as



1



A (M )::ij :=            1

0
if i = 2 and j < 2 if i = 3

elsewhere


1



B (M )::ij := 1

0
if i = 2 and j ≤ 2 if i = 3

elsewhere
where the index starts from 1. Masking is achieved by multiplying the kernel with the binary mask (elementwise). Specify the receptive field of the output pixel that corresponds to the third row and the fourth column (index 34 of Figure 1 (Left)) in each of the following 4 cases:

11
12
13
14
15
 
11
12
13
14
15
21
22
23
24
25
21
22
23
24
25
31
32
33
34
35
31
32
33
34
35
41
42
43
44
45
41
42
43
44
45
51
52
53
54
55
51
52
53
54
55
Figure 1 – (Left) 5 × 5 convolutional feature map. (Right) Template answer.

1.    If we use MA for the first layer and MA for the second layer.

2.    If we use MA for the first layer and MB for the second layer.

3.    If we use MB for the first layer and MA for the second layer.

4.    If we use MB for the first layer and MB for the second layer.

Your answer should look like Figure 1 (Right).


 
[1] . An example of this is the use of masking in the Transformer architecture (Problem 3 of HW2 practical part).

 

Figure 8 – Part 4

Question 2 (6-3-6-3). Reparameterization trick is a standard technique that makes the samples of a random variable differentiable. The trick represents the random variable as a simple mapping from another random variable drawn from some simple distribution[1]. If the reparameterization is a bijective function, the induced density of the resulting random variable can be computed using the change-of-variable density formula, whose computation requires evaluating the determinant of the Jacobian of the mapping.

Consider a random vector Z ∈ RK with a density function q(z;φ) and a random variable Z0 ∈ RK having a φ-independent density function q(z0). We want to find a deterministic function g : RK → RK that depends on φ, to transform Z0, such that the induced distribution of the transformation has the same density as Z. Recall the change of density for a bijective, differentiable g:          (1)

1.    Assume q(z0) = N(0,IK) and g , where µ ∈ RK and . Note that  is element-wise product. Show that g(z0) is distributed by N(µ,diag(σ2)) using Equation (1).

2.    Compute the time complexity of evaluating |detJz0g(z0)| when g . Use the big O notation and expressive the time complexity as a function of K.

3.    Assume g(z0) = µ + Sz0, where S is a non-singular K × K matrix. Derive the density of g(z0) using Equation (1).

4.    The time complexity of the general Jacobian determinant is at least O(K2.37[2])3. Assume instead g(z0) = µ + Sz0 with S being a K × K lower triangular matrix; i.e. Sij = 0 for j i, and Sii 0. What is the time complexity of evaluating |detJz0g(z0)|?

 
[1] . More specifically, these mapping should be differentiable wrt the density function’s parameters.
[2] . https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations

 

Question 3 (5-5-6). Consider a latent variable model pθ(x) = R pθ(x|z)p(z)dz, where p(z) = N(0,IK) and z ∈ RK.The encoder network (aka “recognition model”) of variational autoencoder, qφ(z|x), is used to produce an approximate (variational) posterior distribution over latent variables z for any input datapoint x.[1] This distribution is trained to match the true posterior by maximizing the evidence lower bound (ELBO):

L(θ,φ;x) = Eqφ[logpθ(x | z)] − DKL(qφ(z | x)||p(z))

Let Q be the family of variational distributions with a feasible set of parameters P ; i.e. Q = {q(z;π) : π ∈ P}; for example π can be mean and standard deviation of a normal distribution. We assume qφ is parameterized by a neural network (with parameters φ) that outputs the parameters, πφ(x), of the distribution q ∈ Q, i.e. qφ(z|x) := q(z;πφ(x)).

1.    Show that maximizing the expected complete data log likelihood (ECLL)

Eq(z|x)[logpθ(x|z)p(z)]

for a fixed q(z|x), wrt the model parameter θ, is equivalent to maximizing

logpθ(x) − DKL(q(z|x)||pθ(z|x))

This means the maximizer of the ECLL coincides with that of the marginal likelihood only if q(z|x) perfectly matches p(z|x).

2.    Consider a finite training set {xi : i ∈ {1,...,n}}, n being the size the training data. Let φ∗ be the maximizer argmax  fixed. In addition, for each xi let qi ∈ Q be an “instance-dependent” variational distribution, and denote by   the maximizer of the corresponding ELBO. Compare DKL(qφ∗(z|xi)||pθ(z|xi)) and DKL(qi∗(z)||pθ(z|xi)). Which one is bigger?

3.    Following the previous question, compare the two approaches in the second subquestion

(a) in terms of bias of estimating the marginal likelihood via the ELBO, in the best case scenario

(i.e. when both approaches are optimal within the respective families)

(b)     from the computational point of view (efficiency)

(c)      in terms of memory (storage of parameters)

 
[1] . Using a recognition model in this way is known as “amortized inference” ; this can be contrasted with traditional variational inference approaches (see, e.g., Chapter 10 of Bishop’s Pattern Recognition an Machine Learning), which fit a variational posterior independently for each new datapoint.

 

Question 4 (8-8). Let p(x,z) be the joint probability of a latent variable model where x and z denote the observed and unobserved variables, respectively. Let q(z|x) be an auxiliary distribution which we call the proposal, and define5

 

We’ve seen in class that this objective is a tighter lower bound on logp(x) than the evidence lower bound (ELBO), which is equal to L1 ; that is L1[q(z|x)] ≤ LK[q(z|x)] ≤ logp(x).

In fact, LK[q(z|x)] can be interpreted as the ELBO with a refined proposal distribution. For zj drawn i.i.d. from q(z|x) with 2 ≤ j ≤ K, define the unnormalized density

 

(Hint: in what follows, you might need to use the fact that if w1,...,wK are random variables that have the same distribution, then KE[w1] = Pi E[wi] = E[Pi wi]. You need to identify such wi’s before applying this fact for each subquestion. )

1.    Show that LK[q(z|x)] = Ez2:K[L1[q˜(z|x,z2,...,zK)]]; that is, the importance-weighted lower bound with K samples is equal to the average ELBO with the unnormalized density as a refined proposal.

2.    Show that qK(z|x) := Ez2:K[q˜(z|x,z2,...,zK)] is in fact a probability density function. Also, show that L1[qK(z|x)] is an even tighter lower bound than LK[q(z|x)]. This implies qK(z|x) is closer to the true posterior p(z|x) than q(z|x) due to resampling, since LK[q(z|x)] ≥ L1[q(z|x)]. (Hint: f(x) := −xlogx is concave.)

 

Question 5 (5-5-5-6). Normalizing flows are expressive invertible transformations of probability distributions. In this exercise, we will see how to satisfy the invertibility constraint of some family of parameterizations. For the first 3 questions, we assume the function g : R → R maps from real space to real space.

1.    Let g(z) = af (bz + c) where f is the ReLU activation function f(x) = max(0,x). Show that g is non-invertible.

2.    Let  , where Pi wi = 1, ai 0, and σ(x) =

1/(1 + exp(−x)) is the logistic sigmoid activation function and σ−1 is its inverse. Show that g is strictly monotonically increasing on its domain (−∞,∞), which implies invertiblity.

3.    Consider a residual function of the form g(z) = z + f(z). Show that df/dz −1 implies g is invertible.

4.    Consider the following transformation:

                                                                                         g(z) = z + βh(α,r)(z − z0)                                                       (42)

where z0 ∈ RD, α ∈ R+, β ∈ R, and r = ||z − z0||2, h(α,r) = 1/(α + r). Consider the following decomposition of z = z0 +rz˜. (i) Given y = g(z), show that β ≥ −α is a sufficient condition to derive the unique r from equation (42). (ii) Given r and y, show that equation (42) has a unique solution z˜.

 

Question 6 (4-3-6). In this question, we are concerned with analyzing the training dynamics of GANs. Consider the following value function

                                                                                                   V (d,g) = dg                                                                    (43)

with g ∈ R and d ∈ R. We will use this simple example to study the training dynamics of GANs.

1.    Consider gradient descent/ascent with learning rate α as the optimization procedure to iteratively minimize V (d,g) w.r.t. g and maximize V (d,g) w.r.t. d. We will apply the gradient descent/ascent to update g and d simultaneously. What is the update rule of g and d? Write your answer in the following form

[dk+1,gk+1] = A[dk,gk]

where A is a 2 × 2 matrix; i.e. specify the value of A.

2.    The optimization procedure you found in 6.1 characterizes a map which has a stationary point[1], what are the coordinates of the stationary points?

3.    Analyze the eigenvalues of A and predict what will happen to d and g as you update them jointly. In other word, predict the behaviour of dk and gk as k → ∞.


 
[1] . A stationary point is a point on the surface of the graph (of the function) where all its partial derivatives are zero (equivalently, the gradient is zero). Source: https://en.wikipedia.org/wiki/Stationary_point

 

 

 

More products