Starting from:

$30.99

Computational Stats-Lab 4: Improve the Metropolis-Hastings Algorithm Solved

Exercise 1: Adaptive Metropolis-Hastings within Gibbs sampler
MCMC samplers, such as the Metropolis-Hastings algorithm or Gibbs sampler, require that the user specify a transition kernel with a given invariant distribution (the target distribution). These transition kernels usually depend on parameters which are to be given and tuned by the user. In practice, it is often difficult (if not impossible) to find the best parameters for such algorithms given a target distribution. Moreover, if the parameters are not carefully chosen, it may result in a MCMC algorithm performing poorly as in part A. Adaptive MCMC algorithms is a class of MCMC algorithms which address the problem of parameters tuning by updating automatically some of (if not all) the parameters.

1.A – Metropolis-Hastings within Gibbs sampler
We aim to sample the target distribution π, on R2, given by

2 !

where a > 0. We consider a Markov transition kernel P defined by

P =  (P1 + P2)

where Pi((x,y); dx0 × dy0 ) for i = 1,2 is the Markov transition kernel which only updates the i-th component: this update follows a symmetric random walk proposal mechanism and uses a Gaussian distribution with variance σi2.

Implement an algorithm which samples the distribution P1(z;) where z ∈ R2 ; likewise for the distribution P2(z;·). Then, implement an algorithm which samples a chain with kernel P.
Run the algorithm with a = 10 and standard deviations of the proposal distributions chosen as follows: (σ1,σ2) = (3,3). Discuss the performance of the algorithm in this situation.
How could the performance of the above algorithm be improved ? Propose two methods.
1.B – Adaptive Metropolis-Hastings within Gibbs sampler
Let π be a density defined on an open set U of Rd, d > 2. We consider here a Metropolis-Hastings within Gibbs algorithm to sample from the target density π. More precisely, the HM-step is a symmetric random walk one and the proposal distribution is a Gaussian distribution centered at the current state.

As usual, for i ∈ 1,d , let πi denote the i-th full conditional of π, which is given by:

                                        J      K

                                                     x−i = {x1,...,xi−1,xi+1,...,xd}                     ;              πi(xi |x−i) ∝ π(x)

and σi2 the variance of the corresponding proposal distribution.

                           Acceptance rate, HM sampler                                                                                                                              Acceptance rate, adaptative-HM sampler

Figure 1: Mean acceptance rate and contour plot of the density – a = 10

Algorithm 1: Metropolis-Hastings (symmetric random walk) within Gibbs Sampler

In [RR09], the authors propose an adaptive version of the above sampler which automatically adjusts the variances σ12,...,σd2 of the proposal distributions. We proceed as follows:

For each of the variables xi, we create an associate variable `i giving for the logarithm of the standard deviation σi to be used when proposing a normal increment to variable: `i := log(σi) ;
We initialize all `i to zero, which correspond to the unit proposal variance ;
After the j-th (j ∈ N∗) batch of 50 iterations, each variable `i is updated by adding or subtracting an amount δ(j). The adapting attempts to make the acceptance rate of proposals for variable xi as close as possible to 0.24, which is optimal for one-dimensional proposals in certain settings. Specifically, if the acceptance rate for the i-th variable is greater than 0.24, `i is increased with δ(j). Otherwise, if the rate is lower than 0.24, `i is decreased by δ(j). In practice, we (can) take δ(j) := min(0.01,j−1/2).
Implement the adaptative Metropolis-Hastings within Gibbs sampler and test the algorithm on the density π defined in the part A: Using auto-correlation plots (use a built-in function), compare the performance of the algorithm with or without adaptation.
We can also compare the performance of our algorithm on more complicated target densities. For example centered d-dimensional Gaussian N(0,Σ) or "banana"-shaped density as in TP 2:
                     ∀x = (x1,...,xd) ∈ Rd,            fB                                                                                                                                                                               x       .

In practice, you can choose d = 20 and B = 0.1 for the density fB. You may also choose d = 20 for the second one and use the 20 × 20 variance-covariance matrix Σ given in the file http://dept.stat.lsa.umich.edu/~yvesa/tmalaexcov.txt.

To go further. . .

The next improvement of the Metropolis-Hastings algorithm we can make is to consider a drift function in the proposal distribution. Given a positive definite matrix Λ and a scale parameter σ > 0, we consider a proposal distribution of the form:

qσ,.

qσ,Λ is the density (with respect to the Lebesgue measure on Rd) of the d-dimensional Gaussian distribution with mean x+ σ22ΛD(x) and variance-covariance matrix σ2Λ. If D vanishes everywhere (D ≡ 0), the corresponding algorithm is a Metropolis-Hastings Symmetric Random Walk. If the drift D is chosen such that:

                                                     ∀x ∈ Rd,       D

for a constant δ > 0, the corresponding algorithm is a Metropolis Adjusted Langevin Algorithm (MALA). In that case, the proposal distribution includes information on the gradient ∇logπ of the target distribution π. In [Atc06], the author proposes an adaptive version of the MALA algorithm in which the parameters σ and Λ are adjusted automatically.

Exercise 2: Sampling from multimodal distributions
We consider a target distribution π with support U ⊂ Rd (d ∈ N∗). When the target distribution is multimodal, especially with well-separated modes, classical MCMC algorithms can perform very poorly and exhibit poor mixing. Indeed, a Metropolis-Hastings algorithm with local proposal can get stuck for a long time in a local mode of the target distribution.

Figure 2: Mixture of 20 Gaussian distributions

2.A – A toy example
In the following, we consider a target distribution π – taken from [LW01] and plotted at figure 2 – defined on R2 as a mixture of 20 Gaussian distributions. The target distribution writes:

20

                                                       π(x) = Xi=1 2πσw i                          2 

where, ∀i ∈ {1,...,20}, wi = 0.05 and σi = 0.1. The 20 means µi are defined as follows:

             

                                 µ1,...,µ20 =                   2.18    ,    8.67    ,    4.24    ,    8.41    ,    3.93    ,    3.25    ,    1.70    ,

                                                                    5.76         9.59         8.48         1.68         8.82         3.47         0.50

            

                                                                    4.59         6.91         6.87         5.41         2.70         4.98         1.14

                                                                               ,               ,               ,               ,               ,               ,               ,

                                                                    5.60         5.81         5.40         2.65         7.88         3.70         2.39

           !

                                                                    8.33         4.93         1.83         2.26         5.54         1.69

, , , , , . 9.50 1.50 0.09 0.31 6.86 8.11

Write a Metropolis-Hastings Symmetric Random Walk algorithm (you may use your code from previous tutorial classes) to sample from π.
Show that the Metropolis-Hastings algorithm (even the adaptive Metropolis-Hastings algorithm) fails to sample from π.
2.B – Parallel Tempering
The general idea of the Parallel Tempering (PT) [Gey91, ED05] algorithm is to use tempered versions of the distribution π and run parallel Metropolis-Hastings algorithm to sample from these tempered distributions. The tempered distributions are obtained by "warming up" the target distribution π at different temperatures. At each iteration of the algorithm, a swap between two chains (chains running at different temperature levels) is proposed. The Parallel Tempering uses the fast mixing of the chains at high temperature to improve the mixing of the chains at low temperatures.

Let K denote a positive integer. We consider a sequence of temperatures (Ti)1≤i≤K such that:

T1 > T2 > ... > TK = 1.

In the Parallel Tempering algorithm, K chains run in parallel: for i ∈ 1,K , the i-th chain targets the tempered distribution πi := π1/Ti ; the distribution of interest corresponds to the lowest temperature,J K TK = 1. Let N denote the i-th chain, sampling from the tempered distribution πi.

At the n-th iteration of the Parallel Tempering algorithm, a candidate Yn(+1i) for the i-th chain is proposed using the transition kernel P(i)(Xn(i),·) of a Metropolis-Hastings algorithm. The next step consists in proposing a swap between two different chains (running at different temperatures): given

(i,j) ∈ 1,K 2, with i =6                    j, a swap is proposed with probability α(i,j).

                J        K

Implement the Parallel Tempering algorithm.
Algorithm 2: Parallel Tempering

1 For all i ∈ 1,K , initialize X0(i) ;

2 for n = 1 toJ NiterK do
For all i ∈ 1,K , draw Yn(+1i) using the transition kernel P(i)(Xn(i),·) ;
Choose uniformlyJ K (i,j) ∈ 1,K 2, with i =6        j ;
                                                               J        K                                                                 (j)                         (i)

Compute the swap acceptance probability α(i,j) = min 1, πi(Yn(+1i) )πj(Yn(+1j) )         ;
πi(Yn+1)πj(Yn+1)

Draw U ∼ U([0,1]) ;
7                        if U 6 α(i,j) then
8

                                                                     Xn(i+1)   = Yn(+1j)                and          Xn(j+1) = Yn(+1i) ;

9                  else
10

                                                                     Xn(i+1)   = Yn(+1i)                and          Xn(j+1) = Yn(+1j) ;

11                end
12                      For all k ∈ 1,K , k =6        i,j, set Xn(k+1) = Yn(+1k) .

                                   J        K

13 end
In order to illustrate the performance of the algorithm, use your code to sample from the distribution π of Part A. Use the algorithm with K = 5 and with the following temperatures ladder:
(T1,...,T5) = (60, 21.6, 7.7, 2.8, 1). For the Metropolis-Hastings step (line 3), take as proposal distribution the bivariate Gaussian distribution centered at Xn(i), with variance-covariance matrix τi2I2:

∀i ∈ 1,K , Ynwhere τi = 0.25pTi . J           K

The scale parameters τi are tuned to ensure a reasonable acceptance rate in the algorithm.

In practice, the performance of the Parallel Tempering algorithm strongly depends on the choice of the temperatures ladder, the number of chains and the choice of proposal kernels. For most distributions, tuning these parameters may be infeasible. In [MMV13], the authors have proposed an adaptive Parallel Tempering algorithm to address these difficulties.

Exercise 3: Bayesian analysis of a one-way random effects model
We recall that the density of an Inverse Gamma distribution with positive parameters (a,b) is proportional to

x →7 1 exp−b1R+(x) xa+1               x

and especially that we can sample y from the inverse gamma distribution of parameters (a,b) by generating x from a gamma distribution of parameters (a, 1b) and then taking y = x1. We also recall that we can find a Matlab toolbox-free Gamma Generator in the Handbook of Monte Carlo Methods [KTB13]:

https://people.smp.uq.edu.au/DirkKroese/montecarlohandbook/probdist/. Otherwise, we can use directly scipy.stats.invgamma or the Statistics and Machine Learning Toolbox in Matlab.

Let suppose we collect the observations Y = {yi,j, i ∈ 1,N , j ∈ 1,ki } and set k := PiN=1 ki the total number of observations. Let the following random effects model:J  K              J               K

yi,j is a realization of the variable Yi,j where Yi,j = Xi + εi,j ;
The random effects X = {Xi, i ∈ 1,N } are i.i.d from a Gaussian N(µ,σ2) and independent of the errors ε = {εi,j, i ∈ 1,N , j ∈J 1,kKi } ;
                                                             J        K           J        K

The errors ε are i.i.d from the centred Gaussian N(0,τ2) ;
where (µ,σ,τ) are the unknown parameters. Bayesian analysis using this model requires specifying a prior distribution, for which we consider:

 πprior

where α,β and γ are known hyper-parameters.

Write the density of the a posteriori distribution (X,µ,σ2,τ2) — it can be given up to a normalizing constant — e the density of the distribution (Y,X,µ,σ2,τ2).
Implement a Gibbs sampler which updates in turn (σ2,τ2,µ,X) one at a time.
Implement a Block-Gibbs sampler which updates σ2, then τ2 and then the block (X,µ).
Discuss the theoretical performance of these two algorithms.
Test your code on a synthetic dataset Y = {yi,j, i ∈ 1,N , j ∈ 1,ki } generated from the previous model. J K J K
References
[Atc06]
Yves F. Atchadé. An adaptive version for the metropolis adjusted langevin algorithm with a truncated drift. Methodology and Computing in Applied Probability, 8(2):235–254, 2006.
[ED05]
David J Earl and Michael W Deem. Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics, 7(23):3910–3916, 2005.
[Gey91]
Charles J Geyer. Markov chain monte carlo maximum likelihood. 1991.
[KTB13]
D.P. Kroese, T. Taimre, and Z.I. Botev. Handbook of Monte Carlo Methods. Wiley Series in Probability and Statistics. Wiley, 2013.
[LW01]
Faming Liang and Wing Hung Wong. Real-parameter evolutionary monte carlo with ap-
plications to bayesian mixture models. Journal of the American Statistical Association, 96(454):653–666, 2001.

[MMV13] Błażej Miasojedow, Eric Moulines, and Matti Vihola. An adaptive parallel tempering algorithm. Journal of Computational and Graphical Statistics, 22(3):649–664, 2013.

[RR09] Gareth O. Roberts and Jeffrey S. Rosenthal. Examples of adaptive mcmc. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.

More products