Starting from:

$30.99

Computational Stats-Lab 3: Hasting-Metropolis (and Gibbs) samplers Solved

Exercise 1: Hasting-Metropolis within Gibbs – Stochastic Approximation EM
We observe a group of N (independent) individuals. For the i-th individual, we have ki measurements yi,j ∈ R. In studies on the progression of diseases, measurements yi,j can be measures of weight, volume of brain structures, protein concentration, tumoral score, etc. over time. We assume that each measurement yi,j are independent and obtained at time ti,j with ti,1 < ... < ti,ki.

1.A – A population model for longitudinal data
We wish to model an average progression as well as individual-specific progressions of the measurements from the observations (yi,j)i∈ 1,N ,j∈ 1,ki . To do that, we consider a hierarchical model defined as follows.    J K J K

i.      We assume that the average trajectory is the straight line which goes through the point p0 at time

t0 with velocity v0 d(t) := p0 + v0(t − t0)

where

                                      p )          ;           t )          ;          v 

and σp0, σt0, σv0 are fixed variance parameters. While we consider straight lines, we can also fix p0.

ii.    For the i-th individual, we assume a trajectory of progression of the form

di(t) := d(αi(t − t0 − τi) + t0 ).

The trajectory of the i-th individual corresponds to an affine reparametrization of the average trajectory. This affine reparametrization, given by t 7→ αi(t−t0 −τi)+t0, allows to characterize changes in speed and delay in the progression of the i-th individual with respect to the average trajectory. Moreover, we assume that for all i-th individual measurements

 y = d (t ) + ε

 i,j                 i       i,j                 i,j
i.i.d. where                εi,j ∼ N(0,σ2)
  αi = exp(ξi) where.

 τi  

  The parameters of the model are θ = (t0,v0,σξ,στ,σ ). For all i ∈ 1,N , zi = (αi,τi) are random variables called random effects and zpop = (t0,v0 ) are called fixed effectsJ . The fixed effects are used toK model the group progression whereas random effects model individual progressions. Likewise, we define θi = (σξ,στ,σ) and θpop = (t0,v0 ).

We consider a bayesian framework and assume the following a priori on the parameters θ :

                                                                  t )     ;      v 

                               )     ;      )     ;        σ2 i.i.d.∼ W−1(v,m).

where W−1(v,m) (v > 0, m ∈ N∗) is the inverse-Wishart distribution :

1 1  v m  v2  fW−1(σ2) =   exp −2 σ2 .

1.     Write the complete log-likelihood of the previous model logq(y,z,θ) and show that the proposed model belongs to the curved exponential family.

2.     Generate synthetic data from the model by taking some reasonable values for the parameters.

1.B – HM-SAEM – Hasting-Metropolis sampler
In order to estimate – by a maximum a posteriori for instance – the parameters of this statistical model, we will use the SAEM – Stochastic Approximation EM algorithm. However, this algorithm requires that we are able to sample from the a posteriori distribution, see algorithm 2.

We will use the Hasting-Metropolis algorithm to that end: actually a direct sampling is not possible in our context. Let q(.|z) be the proposal distribution of the algorithm, i.e. the conditional probability of proposing a state z∗ given the current state z, and π be a density defined on an open set U of Rn. The Hasting-Metropolis algorithm targeting π writes:

 

Algorithm 1: Hasting-Metropolis Sampler

 

3. Propose a Metropolis-Hastings sampler to sample from the a posteriori distribution of the latent variable z = (zpop,zi)i∈ 1,N = (t0,v0,ξi,τi)i∈ 1,N ∈ R2N+2.

                                                         J         K                                                   J          K

A natural choice for the proposal distribution is to consider a multivariate Gaussian distribution N (z,σprop). Thus, the acceptance ratio simply writes  . This algorithm is called Symmetric Random Walk Hasting-Metropolis algorithm.

We want to use the Expectation-Maximization algorithm to maximize the likelihood, especially as we have proved at the question 1 that the model belongs to the curved exponential family. Nevertheless, the expectation required by the EM algorithm Qk (y,z,θ(k))|y,θ(k) cannot be calculated here, due to the latent variable z. So, we have to use a stochastic version of the EM algorithm, namely the SAEM algorithm: the Expectation step is split into two steps, the Simulation one and the Stochastic Approximation one, see Algorithm 2. 4. Compute the optimal parameters

θ(k) = argmax{−Φ(θ) + h Sk | Ψ(θ) i} θ∈Θ

and implement the HM-SAEM in order to find the MAP. In particular, we assume that the MAP exists. Use the question 2 to check your algorithm.

For step-sizes εk we can choose a parameter Nb – burn-in parameter – and define

∀k ∈

                                                                         N,             1                 −α ifotherwisek ∈ J1,NbK

(k − Nb)

where   is necessary to ensure the convergence of the MCMC-SAEM. See [AKT10, AK15].

 Remark : Contrary to Bayesian inference, where burn-in traditionally refers to a certain amount of samples which are discarded, here the term burn-in refers to memoryless approximation steps. In other words, during the burn-in phase, the information contained in z(k) is not used in the approximation of the sufficient statistics. In practice, the burn-in period is chosen to be half of the maximum number of iterations.

Algorithm 2: MCMC-SAEM (for curved exponential family)

 

1    Given data y and intial guess θ(0)

2    #Initialization : z(0) = 0, S(0) = 0 and step-sizes (εk)k>0.

3    for k = 0 to maxIter do

 

1.C – HMwG-SAEM – Hasting-Metropolis within Gibbs sampler
However, the dimension of the latent variable z may become high if we consider a large cohort and so the a posteriori distribution of the latent variable difficult to sample. In that case, we can use a Gibbs sampler which consists in generating an instance from the distribution of each (sub)-variable in turn, conditional on the current values of the other (sub)-variables. Gibbs sampling is more generally applicable when the joint distribution is not known explicitly or is difficult to sample from directly, but the conditional distribution of each variable is known and is easy (or at least, easier) to sample from.

If we consider π, a density defined on an open set U of Rn (n > 2) and if we denote, for ` ∈ 1,n , π` the `th full conditional of π given by            J               K

π`(z` | z−`) ∝ π(z)

where z−` = {z1,...,z`−1,z`+1,...,zn}, we recall that the classical Gibbs sampler writes as follows :

 

Algorithm 3: Gibbs Sampler

 

When direct sampling from the full conditionals is not possible, the step (?) is often replaced with a Metropolis-Hastings step. The resulting MCMC algorithm is called hybrid Gibbs sampler or MetropolisHastings within Gibbs sampler.

5.    Propose a Metropolis-Hastings within Gibbs sampler to sample from the a posteriori distribution of zi = (ξi,τi).

6.    Likewise, propose a HMwG sampler for the a posteriori distribution of zpop = (t0,v0).

7.    Implement the HMwG-SAEM in order to find the MAP.

We can improve the sampling step for big dataset by considering a Block HMwG sampler instead of a "one-at-a-time" as described above HMwG sampler. In the Block version, each Metropolis-Hastings step of the algorithm consists in a multivariate symmetric random walk. Then, the Block MHwG sampler updates simultaneously block (or sets) of latent variables given the others.

8.    Explain what is the advantages of a Block Gibbs sampler over a "one-at-a-time" Gibbs sampler for our model.

9.    Implement a Block HMwG sampler by choosing a block for the fixed effects and a block by individuals.

The model studied in this exercise is a very simplified version of the model proposed by Jean-Baptiste Schiratti in his Phd-Thesis. For more details, you can refer to [SACD15, Sch16, COA17].

Exercise 2: Multiplicative Hasting-Metropolis
Let f be a density function on ]−1,1[. We consider the multiplicative Hasting-Metropolis algorithm defined as follows.

 Let X be the current state of the Markov chain.

(i)       We sample ε from the probability density function f and a random variable B from the Bernoulli distribution with parameter  .

(ii)     If B = 1, we set Y = εX. Otherwise, we set Y = Xε . Then, we accept the candidate Y with a probability given by α(X,Y ), the usual Hasting-Metropolis acceptation ratio.

1.    Determine the density of the jumping distribution Y ∼ q(X,Y ).

2.    Compute the acceptation ratio α so that the chain has a given distribution π as invariant distribution.

3.    Implement this sampler for two different target distributions : the first one being a distribution from which we can sample using the inverse transform method and the second one is of your choice.

4.    Evaluate, in each case, the match of your samples with the true distribution.

Exercise 3: Data augmentation
Let f : (x,y) ∈ Rp × Rq 7→ f(x,y) ∈ R+ be a density with respect to the Lebesgue measure on Rp+q. Let us define

and
Z fX(x) :=     f(x,y)dy ;
Z fY (y) =                f(x,y)dx ;
 
∀y ∈ Y := {y ∈ Rq |fY (y) > 0},
f(x,y)

fX|Y (x,y) :=            ;

fY (y)
 
∀x ∈ X := {x ∈ Rp |fX(x) > 0},
f(x,y) fX|Y (x,y) :=     .
fY (y) We define a bivariate chain {(Xn,Yn),n > 0} as in the following algorithm.

 

 

1.    Show that the bivariate process {(Xn,Yn),n > 0} is a Markov chain. Give the expression of its transition kernel as a function of the quantities defined above.

2.    Show that {Yn,n > 0} is a Markov chain : give the expression of its transition kernel and prove that fY (y)dy is invariant for this kernel.

Hereafter, we consider the case when

                                                                                                             x2                    

                                                             f(x,y) = √                           y2 exp −yR+(y)



3.    Describe a Gibbs algorithm to approximate the distribution on R × R with density f.

We can use a gamma distribution sampler : numpy.random.gamma in python or gamrnd from the Statistics and Machine Learning Toolbox in Matlab. We can also find a Matlab toolbox-free Gamma Generator in the Handbook of Monte Carlo Methods [KTB13] :

https://people.smp.uq.edu.au/DirkKroese/montecarlohandbook/probdist/.

4.    Let H be a bounded function on R. Explain how to approximate

                                                                                         Z          H(x)

 5 dx R (4 + x2)2

from the output {(Xn,Yn),0 6 n 6 N} of this Gibbs sampler.

References
[AK15]
Stéphanie Allassonnière and Estelle Kuhn. Convergent stochastic expectation maximization algorithm with efficient sampling in high dimension. application to deformable template model estimation. Computational Statistics & Data Analysis, 91:4 – 19, 2015.
[AKT10]
Stéphanie Allassonnière, Estelle Kuhn, and Alain Trouvé. Construction of bayesian deformable models via a stochastic approximation algorithm: A convergence study. Bernoulli, 16(3):641–678, 08 2010.
[COA17]
Juliette Chevallier, Stéphane Oudard, and Stéphanie Allassonnière. Learning spatiotemporal piecewise-geodesic trajectories from longitudinal manifold-valued data. In Advances in Neural Information Processing Systems 31. 2017.
[Dut12]
Somak Dutta. Multiplicative random walk metropolis-hastings on the real line. Sankhya B, 74(2):315–342, 2012.
[KTB13]
D.P. Kroese, T. Taimre, and Z.I. Botev. Handbook of Monte Carlo Methods. Wiley Series in Probability and Statistics. Wiley, 2013.
[SACD15] Jean-Baptiste Schiratti, Stéphanie Allassonnière, Olivier Colliot, and Stanley Durrleman. Learning spatiotemporal trajectories from manifold-valued longitudinal data. In Advances in Neural Information Processing Systems 28. 2015.

[Sch16] Jean-Baptiste Schiratti. Models and algorithms to learn spatiotemporal changes from longitudinal manifold-valued observations. PhD thesis, École polytechnique, 2016.
 

More products