Starting from:

$30.99

Computational Stats-Lab 2: Expectation-Maximisation algorithm –Importance sampling Solved

Exercise 1: Discrete distributions
Let n ∈ N∗ and X = {x1,...,xn} a set of n distinct real numbers. Let (pi)i∈ 1,n a sequence of real numbers such that :           J K n

                                                               ∀i ∈ 1,n , pi > 0                      and Xpi = 1.

                                                                    J       K                                           i=1

Explain how to generate a random variable X having the discrete distribution on X given by
(pi)i∈ 1,n :

                      J        K                                                                                ∀i ∈ 1,n , P(X = x

                                                                                      J       K                         i) = pi.

Write (in Python, Julia, Matlab, Octave...) the corresponding algorithm.
Generate a sequence (Xi)i∈ 1,N of i.d. random variables having the same distribution as X for large values of N. Compare the empirical distribution to the theoretical distribution ofJ K X. (In Python, you can use the function numpy.histogram).
Exercise 2: Gaussian mixture model and the EM algorithm
A Gaussian mixture model (GMM) is useful for modelling data that comes from one of several groups: the groups might be different from each other, but data points within the same group can be wellmodelled by a Gaussian distribution. The main issue is to estimate the parameters of the mixture, i.e to find the most likely ones. Moreover, we aim to determine if our sample follow a Gaussian mixture distribution or not.

Let consider a n-sample. For each individual, we observe a random variable Xi and assume there is an unobserved variable Zi for each person which encode the class of Xi. More formally, we consider a mixture of m Gaussian : let (α1,...,αm) such that Pmi=1 αi = 1 and the following hierarchical model :

                                                             ∀i ∈ 1,n , ∀j ∈ 1,m ,                     Pθ(Zi = j) = αj

                                                                J       K               J        K

and

                                               ∀i ∈ 1,n , ∀j ∈ 1,m                           Xi | θ, {Zi = j} ∼ N(µj,Σj).

                                                   J       K               J        K

Unless otherwise stated, we suppose that m is fixed.

Identify the parameters, denoted θ, of the model and write down the likelihood of θ given the
outcomes (xi)i∈ 1,n of the i.i.d n-sample (Xi)i∈ 1,n , i.e the p.d.f

                                          J        K                                                                                  J        K

n

L(x1,...,xn;θ) = Yfθ(xi).

i=1

Sample a set of observation according to a Gaussian mixture law, with the parameters of your choice. Use the hierarchical model and the first exercise.
Implement the EM algorithm in order to estimate the parameters of this model from your observations and plot the log-likelihood over the number of iteration of the algorithm.
Are the estimated parameters far from the original ones ?
0
20      40

birth
60
80
0
20      40

birth
60
80
0
20      40

birth
60
80
                                    1 cluster                                                  2 clusters                                               3 clusters

Figure 1: Importance of the number of clusters – Crude Birth/Death Rate.

In practice, determining the right number of clusters is an important issue. A good criterion is to minimize the BIC – Bayesian Information Criterion. See for example [Gir15] for more information on the BIC.

      df(m)log(n) mb = argminm>1                      −logL(x1,...,xn;θ) + 2

where df is the number of degrees of freedom of the mixture model with m clusters.

Application : Download the data Crude Birth/Death Rate – See esa.un.org/unpd/wpp/ for instance – and plot the associated scatter graph. What do you think about using a Gaussian mixture model ?
Estimate the parameters θ for different values of m, try to interpret them and compute the BIC.
Plot the corresponding p.d.f over the scatter plot. (In Python, you can use plt.contour).

Exercise 3: Importance sampling
Let p be a density on Rd, d ∈ N∗. Importance Sampling aims at evaluating

Z

                                                                           Ep [g(X)] =           g(x)p(x)dx.

Classical Monte Carlo integration requires to generate i.i.d. random variables (X1,...,Xn) from p in order to approximate Ep [g(X)] by n1 Pni=1 g(Xi). Sampling from other distributions than the original distribution p can improve the variance of the estimator and reduce the number of samples needed.

Importance sampling is based on the following fundamental equality

                                                        Z                                   Z            p(x)                                       p(X)

                                Ep [g(X)] =               g(x)p(x) dx =            g(x)       q(x) dx = Eq g(X)

                                                                                                           q(x)                                            q(X)

which hold for any density q such that Supp(g × p) ⊂ Supp(q). The density q is called importance density. If (X1,...,Xn) is a sample from q, Ep [g(X)] can therefore be approximated by

                                                n                                                          n

1 X p(Xi)                            1 X

g(Xi) =                              ωig(Xi) n q(Xi)        n i=1                        i=1
with
p(Xi) ωi =                 .

q(Xi)
The (ωi)i are called importance weights. In Bayesian inference, the density p might be known only up to a normalizing constant. In this case, Ep [g(X)] can be approximated by

n

X

ω˜ig(Xi) i=1
where
ωi ω˜i = Pn             ωj .

j=1
The (ω˜i)i are called normalized importance weights and do not depend on the normalizing constant of p.

3.A – Poor Importance Sampling
The performance of Importance Sampling depends on the choice of importance density (or importance function). The "best" importance density q∗ is chosen among a parametric family of densities Q. Given a density q on Rd, the approximation is measured in terms of the Kullback-Leibler divergence K(pkq) given by

Z p(x)
                                                                     K(pkq) =         log         p(x) dx

q(x)

therefore

                                                                                  q∗ = argminK(p||q).                                                                             (?)

q∈Q

The parametric family Q of distributions on Rd should be chosen large enough to allow for a close match with p and be such that the optimization problem (?) is feasible. Before studying the above optimisation problem, we will illustrate the importance of choosing carefully the distribution q and explore the effects of selecting a poor distribution to cover p. We proceed as in [Cev08].

In this section, we will implement importance sampling in order to calculate the expectation of a function f defined by

f(x) = 2sinx 1R+(x)

where x is distributed according to a distribution similar to a χ distribution. We will use a scaled normal distribution N (0.8,1.5) as our sampling distribution where the parameters are chosen so that p(x) < k q(x) for all x ∈ R+ where k ∈ R+. Let consider

                                   p(x) = x(1.65)−1 e−x22 1R+(x)                  and               q(x) = p 2                   −((02(1.8).−5)x)2 .

2π(1.5)

Mainly, neither p nor q are proper distributions here without normalization.

0                                     1            2              3              4              5              0              2              4              6              8              10 µ = 0.8 µ = 6

Implement a simple importance scheme for the previous functions.
Be careful when sampling from q supported on R to discard any samples x < 0 while p is supported only for x > 0.

Compare the estimate and the importance weight for several sample size, N = 10, 100, 103, 104 for instance.
Shift the mean of q, µ = 6, so that the centers of mass for each distribution are far apart and repeat the experiment.
3.B – Adaptative Importance Sampling
In the following, we choose Q to be the family of mixtures of M Gaussian distributions on Rd. An element of q ∈ Q is of the form

M

q(x) = Xαiϕ(x;µi,Σi)

i=1

where, for all i, αi > 0, PMi=1 αi = 1 and (µi,Σi) are mean and covariance parameters which parametrize the i-th Gaussian component of q. Because the family Q is a parametric family of distributions, the optimization problem (?) can be rewritten :

                                                                                                           M                                                 !

Z

                                      Find      θ∗ =           argmax                       log Xαiϕ(x;µi,Σi)              p(x) dx.                               (??)

                                                                  θ=(αi,µi,Σi)1≤i≤d                                     i=1

The solution to (??) cannot always be obtained in closed-form due to the density p which makes the exact computation impossible. The Population Monte Carlo is an algorithm which aims at approximating this solution qθ∗.

Explain how the EM algorithm can be used to maximize the empirical criterion in step (iii).
Derive the parameters update.

Population Monte Carlo :

The Population Monte Carlo algorithm iterates between the following steps :

Choose mixture parameters (α(0),µ(0),Σ(0)). This choice of parameters defines an importance density q(0) as follows :
M

                                                              ∀x ∈ Rd,                 q(0)(x) = Xαi(0) ϕx; µ(0)i ,Σ(0)i  .

i=1

This importance density is used to compute an Importance Sampling estimate of the quantity of interest. Let (X1,...,Xn) be i.d. random variables generated from q(0). The exact criterion in (?) is approximated using normalized importance weights : M 
X     X ω˜i log αjϕ(Xi;θj) .

                                                                              i=1                             j=1

New parameters (α(1),µ(1),Σ(1)) are obtained by maximizing M 
                                                                                 X                     X
ω˜i log αjϕ(Xi;θj)

                                                                                i=1                             j=1

with respect to α, µ and Σ. The new parameters define a density q(1).

(iv) We start again with steps from (i) to (iii) until convergence.

3.C – Application to a "banana"-shaped density
The target density is based on a Gaussian distribution in Rd with mean 0 and covariance matrix Σ = diag(σ12,1,...,1). This density defined on Rd is twisted by changing the second coordinate x2 to x2 + b(x21 − σ12). If Φ(·;µ,Σ) denotes the density function of the d-dimensional Gaussian with mean µ and covariance Σ, we have :

                                   ∀x = (x1,...,xd) ∈ Rd,                         p(x) = Φ x1, x2 + b(x21 − σ12), x3, ..., xd.

If we choose d = 10, σ12 = 100 and b = 0.03, p results in a banana-shaped density in the first two dimensions.

Using the Adaptive Importance Sampling, write an algorithm which allows to exploring the density
p. You may display the results for the banana-shaped density in the first two coordinates.
References

[Bie09] Christophe Biernacki. Pourquoi les modèles de mélange pour la classification ? La revue Modulad, 40, 2009.

[Cev08] Volkan Cevher. Importance sampling. Lecture note, Rice University, 2008.

[Gir15] Christophe Giraud. Introduction to High-Dimensional Statistics. Chapman and Hall, CRC, 2015.

More products