$35
Problem 1: Bernoulli GLMs as a latent variable models.
Consider a Bernoulli regression model,
w ∼ ~ (µ, Σ)
yn | xn,w ∼ Bern(f (wTxn)) for n = 1,...,N,
where wand xn are vectors in RD, yn E {0, 1}, and f : R → [0, 1] is the mean function. In class we studied Newton’s method for finding the maximum a posteriori (MAP) estimate w* = arg max p(w | {xn, yn}N n=1). Now we will consider methods for approximating the full posterior distribution.
(a) Rather than using the logistic function, let the mean function be the normal cumulative distribution function (CDF), or “probit” function,
f (u) = Pr(z ≤ u) where z ∼ iV (0, 1)
= ~ (z; 0, 1) dz.
~ −∞
u
This is called the probit regression model. Show that the likelihood p(yn | xn, w) is a marginal of a joint distribution,
p(yn,zn | xn,w) = 1I[zn ≥ 0]~[yn=1] 1I[zn < 0]~[yn=0]JV (zn | xT n w,1).
<Your answer here.>>
(b) Derive the conditional distributions p(w | {xn, yn,zn}N n=1) and p(zn | xn, yn,w).1
<Your answer here.>>
(c) Gibbs sampling is a Markov chain Monte Carlo (MCMC) method for approximate posterior inference. It works by repeatedly sampling from the conditional distribution of one variable, holding all others fixed. For the probit regression model, this means iteratively performing these two steps:
1. Sample zn ∼ p(zn | xn, yn,w) for n = 1,.. .,N holding w fixed;
2. Sample w ∼ p(w | {xn, yn, zn}N n=1) holding {zn}N n=1 fixed.
1Observe that zn is conditionally independent of {xn~, yn~, zn~}n'~=n given w.
Note the similarity to EM: rather than computing a posterior distribution over zn, we draw a sample from it; rather than setting w to maximize the ELBO, we draw a sample from its conditional distribution. It can be shown that this algorithm defines a Markov chain on the space of (w, {zn}Nn=1) whose stationary distribution is the posterior p(w, {zn}Nn=1 | {xn, yn}Nn=1). In other words, repeating these steps infinitely many times would yield samples of w and {zn}Nn=1 drawn from their posterior distribution.
Implement this Gibbs sampling algorithm and test it on a synthetic dataset with D = 2 dimensional covariates and N = 100 data points. Scatter plot your samples of w and, for comparison, plot the true value of w that generated the data. Do your samples look approximately Gaussian distributed? How does the posterior distribution change when you vary N?
«Your figures and captions here.»
(d) Bonus. There are also auxiliary variable methods for logistic regression, where f (u) = eu/(1 + eu). Specifically, we have that,
eyn·wTxn
= f n exp {( y − 1 xT n w− 12zn(wTxn)2} PG(zn; 1, 0)dzn,
2
0
1 + ewT xn
where PG(z; b, c) is the density function of the Pólya-gamma (PG) distribution over z ∈ R+ with parameters b and c. The PG distribution has a number of nice properties: it is closed under exponential tilting so that,
e− 2 1zc2 PG(z; b,0) ∝ PG(z; b, c), and its expectation is available in closed form,
2c tanh ( 2).
Use these properties to derive an EM algorithm for finding w = argmax p({yn} | {xn},w). How do the EM updates compare to Newton’s method?
Problem 2: Spike sorting with mixture models
As discussed in class, “spike sorting” is ultimately a mixture modeling problem. Here we will study the problem in more detail. Let {yn}Nn=1 represent a collection of spikes. Each yn ∈ RD is a vector containing features of the n-th spike waveform. For example, the features may be projections of the spike waveform onto the top D principal components. We have the following, general model,
zn | π ∼ π
yn | zn, θ ∼ p(yn | θzn).
The label zn ∈ {1,..., K} indicates which of the K neurons generated the n-th spike waveform. The probability vector π ∈ ΔK specifies a prior distribution on spike labels, and the parameters θ = {θk}K k=1 determine the likelihood of the spike waveforms yn for each of the K neurons. The goal is to infer a posterior distribution p(zn | yn, π, θ) over labels for each observed spike, and to learn the parameters π~ and θ* that maximize the likelihood of the data.
(a) Start with a Gaussian observation model,
yn | zn,θ ∼ * (yn | µzn, Σzn),
where θk = (µk, Σk) includes the mean and covariance for the k-th neuron.
Derive an EM algorithm to compute πt, θ* = argmax p({yn}Nn=1 | π, θ). Start by deriving the “responsibilities” wnk = p(zn = k | yn, π', θ') for fixed parameters π' and θ'. Then use the responsibilities to compute the expected log joint probability,
2(π,θ) = ~N Ep(zn|yn,π/,θ/) [log p(yn,zn | π, θ)].
n=1
Finally, find closed-form expressions for π~ and θ* that optimize 2 (π, θ). «Your answer here.»
(b) The Gaussian model can be sensitive to outliers and lead spikes from one neuron to be split into two clusters. One way to side-step this issue is to replace the Gaussian with a heavier-tailed distribution like the multivariate Student’s t, which has probability density,
Γ[(α0 + D)/2] 1
p(yn | θzn) = 1/2 Iyn − µzn zn yn − µz
Γ(α0/2)α/2πD/2 L .
α0
We will treat α0 as a fixed hyperparameter.
Like the negative binomial distribution studied in HW1, the multivariate Student’s t can also be represented as an infinite mixture,
p(yn | θ) = J p(yn,τn | θ) dτn = J X(yn;µzn,τ−n1Σzn)Gamma(τn; α02 , 1 2) dτn. We will derive an EM algorithm to find πt, θ* in this model.
First, show that the posterior takes the form
p(τn,zn | yn,π,θ) = p(zn | yn,π,θ) p(τn | zn, yn, θ)
K 1[[z=k]
HI wnk Gamma(τ | ank, bnk)]
k
and solve for the parameters wnk, ank, bnk in terms of yn, π, and θ. <Your answer here.>>
(c) Now compute the expected log joint probability,
~ (π, θ) = Ep(τn,zn|yn,π~,θ~)[log p(yn,zn, τn | π, θ)],
n=1
using the fact that E[X] = a/b for X ∼ Gamma(a, b). You may omit terms that are constant with respect to π and θ.
<Your answer here.>>
(d) Finally, solve for π~ and θ* that maximize the expected log joint probability. How does your answer compare to the solution you found in part (a)?
<Your answer here.>>
Problem 3: Poisson matrix factorization
Many biological datasets come in the form of matrices of non-negative counts. RNA sequencing data, neural spike trains, and network data (where each entry indicate the number of connections between a pair of nodes) are all good examples. It is common to model these counts as a function of some latent features of the corresponding row and column. Here we consider one such model, which decomposes a count matrix into a superposition of non-negative row and column factors.
Let Y ∈ NM×N denote an observed M × N matrix of non-negative count data. We model this matrix as a function of non-negative row factors U ∈ RM×K +and column factors V ∈ RN×K
+ . Let um ∈ RK +and vn ∈ ~K +
denote the m-th and n-th rows of U and V, respectively. We assume that each observed count ymn is conditionally independent of the others given its corresponding row and column factors. Moreover, we assume a linear Poisson model,
ymn | um, vn ∼ Poisson(uTmvn).
(Since um and vn are non-negative, the mean parameter is valid.) Finally, assume gamma priors,
umk ∼ Gamma(α0,β0),
vnk ∼ Gamma(α0,β0).
Note that even though the gamma distribution is conjugate to the Poisson, here we have an inner product of two gamma vectors producing one Poisson random variable. The posterior distribution is more complicated. The entries of um are not independent under the posterior due to the “explaining away” effect. Nevertheless, we will derive a mean-field variational inference algorithm to approximate the posterior distribution.
(a) First we will use an augmentation trick based on the additivity of Poisson random variables; i.e. the fact that
y ∼ Poisson ( λ)⇐⇒y= yk where yk ∼ Poissonk) independently, k
for any collection of non-negative rates λ1, . . . ,λK ∈ R+. Use this fact to write the likelihood p(ymn | um, vn) as a marginal of ajoint distribution p(ymn, ¯ymn | um, vn) where ¯ymn = (ymn1,... , ymnK) is a length-K vector of non-negative counts. (Hint: this is similar to Problem 1 in that ymn is deterministic given ¯ymn.)
«Your answer here.»
(b) Let Y¯ ∈ NM×N×K denote the augmented data matrix with entries ymnk as above. We will use mean field variational inference to approximate the posterior as,
p(
Y¯ , U, V | Y ) ≈ q(
Y¯) q(U) q(V) = 1 iiM iiq(ymn) ] 1 m=1iiM iiKq(umk) ] 1 n=1ii iiKq(vnk)].
m=1 n=1 k=1 k=1
We will solve for the optimal posterior approximation via coordinate descent on the KL divergence to the true posterior. Recall that holding all factors except for q(¯ymn) fixed, the KL is minimized when
q(¯y) ∝ exp { Eqmn)q(U)q(V) [log p(Y, Y¯, U, V)]},
where q(I'n) = fl(m',n')(m,n) q(¯ym'n') denotes all variational factors except for the (m, n)-th. Show that the optimal q(¯ymn) is a multinomial of the form,
q(¯ymn) = Mult(¯ymn; ymn, πmn),
and solve for πmn E ΔK. You should write your answer in terms of expectations with respect to the other variational factors.
<Your answer here.>>
(c) Holding all factors but q(umk) fixed, show that optimal distribution is
q(umk) = Gamma(umk; αmk, βmk).
Solve for αmk, βmk; write your answer in terms of expectations with respect to q(¯ymn) and q(vnk). <Your answer here.>>
(d) Use the symmetry of the model to determine the parameters of the optimal gamma distribution for q(vnk), holding q(¯ymn) and q(umk) fixed,
q(vnk) = Gamma(vnk; αnk, βnk).
Solve for αnk, βnk; write your answer in terms of expectations with respect to q(¯ymn) and q(umk). <Your answer here.>>
(e) Now that the form of all variational factors has been determined, compute the required expectations (in closed form) to write the coordinate descent updates in terms of the other variational parameters. Use the fact that E[logX] = ψ(α)log β for X Gamma(α, β), where ψ is the digamma function.
<Your answer here.>>
(f) Suppose that Y is a sparse matrix with only S < MN non-zero entries. What is the complexity of this mean-field coordinate descent algorithm?
<Your answer here.>>
Problem 4: Apply Poisson matrix factorization to C. elegans connectomics data
Make a copy of this Colab notebook: https://colab.research.google.com/drive/1ZMwcB6vzVaXz4WJiNT514b7zB5s3_SBk
Use your solutions from Problem 3 to finish the incomplete code cells. Once you’re done, run all the code cells, save the notebook in .ipynb format, print a copy in .pdf format, and submit these files along with the rest of your written assignment.