Starting from:

$35

DT2119-Lab2: Hidden Markov Models with Gaussian Emissions Solved

1           Objective
The objective is to implement the algorithms for evaluation and decoding of Hidden Markov Models (HMMs) combined with Gaussian emission probability distributions. The lab is designed in Python, but the same functions can be obtained in Matlab/Octave or using the Hidden Markov Toolkit (HTK).

2           Task
The overall task is to implement and test methods for isolated word recognition:

•    combine phonetic HMMs into word HMMs using a lexicon

•    implement the forward-backward algorithm,

•    use it compute the log likelihood of spoken utterances given a Gaussian HMM

•    perform isolated word recognition

•    implement the Viterbi algorithm, and use it to compute Viterbi path and likelihood

•    compare and comment Viterbi and Forward likelihoods

•    implement the Baum-Welch algorithm to update the parameters of the emission probability distributions

In order to pass the lab, you will need to follow the steps described in this document, and present your results to a teaching assistant in the designated time slots.

3           Data and model set
The speech data used in this lab is similar but not the same as in Lab 1. You can load the array containing speech utterances with the commands:

>>> import numpy as np

>>> data = np.load('lab2_data.npz')['data']

The data contains also MFCC features (lmfcc key), but you are welcome to test how the algorithms perform on the MFCCs computed with your own code from Lab 1. Refer to the instructions to Lab 1 for more information about the data structures.

Additionally, the lab2_models.npz file contains the parameters of the models you are going to use to test your functions and the lab2_example.npz file contains an example that can be used for debugging.

3.1         The phonetic models

Load the model file with:

>>> phoneHMMs = np.load('lab2_models.npz')['phoneHMMs'].item() phoneHMMs is a dictionary with 21 keys each corresponding to a phonetic model. You can list

the model names with:

>>> list(sorted(phoneHMMs.keys()))

['ah', 'ao', 'ay', 'eh', 'ey', 'f', 'ih', 'iy', 'k', 'n', 'ow', 'r', 's',

'sil', 'sp', 't', 'th', 'uw', 'v', 'w', 'z']

Special cases are sil and sp that model silence and short pauses. For this exercise you can ignore the sp model, that will become important only in Lab 3. Note that the list is a subset of the phonemes used in the English language limited to the pronunciation of the 11 digits.

Each model is an HMM with a single Gaussian emission probability distribution per state and diagonal covariance matrices stored as vectors. The models were trained on the training data of the TIDIGITS database, using the 13 MFCC feature vectors computed as in Lab 1. Each model contains the following keys:

>>> phoneHMMs['ah'].keys()

['name', 'startprob', 'transmat', 'means', 'covars']

key
symbol
description
name
 
 phonetic symbol, sil or sp
startprob
πi = P(z0 = si)
probability to start in state i
transmat:
aij = P(zn = sj|zn−1 = si)
transition probability from state i to j
means:
µid
array of mean vectors (rows correspond to different states)
covars:
σid2
array of variance vectors (rows correspond to different states)
 

If you ignore the sp model, all models have three emitting states. Consequently, the means and covars arrays will be both 3 × 13 in size. Note, however, that both the startprob and transmat arrays have sizes corresponding to four states (startprob is length 4 and transmat is 4 × 4). The reason for this is that we will concatenate these models to form word models, and we therefore want to be able to express the probability to stay (a22) or leave (a23) the last (third) emitting state in each phonetic model. This is illustrated in the following figure and will be clearer in Section 3.2.

                                                            a00                            a11                            a22

 

Not that the last state s3 is non-emitting, meaning that it does not have any state to output probability distribution associated with it.

3.2             Pronunciation dictionary and word models

The mapping between digits and phonetic models can be obtained with the pronunciation dictionary that you can find in prondict.py, and that looks like this:

prondict = {} prondict['o'] = ['ow'] prondict['z'] = ['z', 'iy', 'r', 'ow'] prondict['1'] = ['w', 'ah', 'n'] ...

Because we are working with recordings of isolated digits, a model of each utterance should also contain initial and final silence:

>>> modellist = {}

>>> for digit in prondict.keys():

>>>                                 modellist[digit] = ['sil'] + prondict[digit] + ['sil']

Write the concatHMMs function from proto2.py that, given the set of HMM models and a list of model names, will return a combined model created by the concatenation of the corresponding models. For example:

>>> wordHMMs['o'] = concatHMMs(phoneHMMs, modellist['o'])

Remember that each model in phoneHMMs has an extra state to simplify the concatenation. An illustration of the process in the case of three models is in the following figure:

 

The following figure tries to explain how the transition matrices are combined to form the resulting transition matrix.

 

To be precise, we should remove the last row from the transition matrix of the red and blue models. However, in each original transition matrix la last row is [0.0,0.0,0.0,1.0]. If we copy the values in into the new transition matrix in the natural order (red, blue, green), the first element of new transition matrix will overwrite the last of the previous one, and we would obtain the same affect. The last row of the combined transition matrix will also be all zeros but one (a99 in this case).

3.3        The example

Load the example file with:

example = np.load('lab2_example.npz')['example'].item()

This is a dictionary containing all the fields as in the data array, plus the following additional fields:

list(example.keys())

[..., 'obsloglik', 'logalpha', 'loglik', 'vloglik', 'loggamma', 'logxi']

Here is a description of each field. You will see how to use this information in the reminder of these instructions. All the probabilities described below were obtained using the HMM model wordHMMs['o'] (that is, the models for digit 'o') on the sequence of MFCC vectors in

example['lmfcc']:
 
 
key                          idxs
symbol
description
obsloglik [i,j]
logφj(xi)
 observation log likelihood for each Gaussians in wordHMMs['o'], shape: (n_timesteps, n_states)
logalpha
[i,j]
logαi(j)
alpha log probabilities, see definition later, shape: (n_timesteps, n_states)
logbeta
[i,j]
logβi(j)
beta log probabilities, see definition later, shape: (n_timesteps, n_states)
loglik
-
logP(X|θHMM)
log likelihood of the observations sequence X given the HMM model, scalar
vloglik
-
logP(X,Sopt|θHMM)
Viterbi log likelihood of the observations sequence X and the best path given the HMM model, scalar
loggamma
[i,j]
logγi(j)
gamma log probabilities, see definition later, shape: (n_timesteps, n_states)
 

Figure 1 shows some of the relevant fields in example.

 

Figure 1. Step-by-step probability calculations for the utterance in the example. The utterance contains the digit “oh” spoken by a female speaker. The model parameters are obtained by concatenating the HMM models for sil, ow and sil.

4           HMM Likelihood and Recognition
4.1           Gaussian emission probabilities

The function log_multivariate_normal_density(..., ’diag’) from sklearn.mixture can be used to compute

obsloglik[i,j] = logφj(xi) = logN(xi,µj,Σj) = logP(xi|µj,Σj),

that is the log likelihood for each observation xi and each term in a multivariate Gaussian density function with means µj and diagonal covariance matrices Σj. In the case of Gaussian HMMs, φj corresponds to the emission probability model for a single state j.

Verify that you get the same results as in example['obsloglik'] when you apply the log_multivariate_normal_density function to the example['lmfcc'] data using the Gaussian distributions defined by the wordHMMs['o'] that you have created with your concatHMMs function. Note that in these cases we are not using the time dependency structure of the HMM, but only the Gaussian distributions.

Plot the log likelihood for Gaussians from HMMs models corresponding to the same digit on a test utterance of your choice. What can you say about the order of the Gaussians, and the time evolution of the likelihood along the utterance? Remember that each utterance starts and ends with silence.

4.2         Forward Algorithm

Write the function forward following the prototypes in proto2.py. The function should take as input a lattice of emission probabilities as in the previous section, that is an array containing: obsloglik[i,j] = logφj(xi)

and the model parameters aij and πi. When you compute the log of aij and πi, some of the values will become negative infinity (log(0)). This should not be a problem, because all the formulas should remain consistent[1]. The output is the array: logalpha[i,j] = logαi(j)

where i corresponds to time steps and j corresponds to states in the model.

Remember that the forward probability is defined as:

αn(j) = P(x0,...,xn,zn = sj|θ), (1) where θ = {Π,A,Φ} are the model parameters. There is a recursion formula (see Appendix A) to obtain αn(j) from αn−1(i) for all the previous states. In the recursion formulae for the forward (and backward) probabilities there is an expression involving the log of a sum of exponents (logPexp(.)). Make use of the logsumexp function from tools2.py to calculate those cases.

Apply your function to the example['obsloglik'] utterance using the model parameters in your wordHMMs['o'] model and verify that you obtain the same logα array as in example['logalpha'].

Remembering the definition of α in Eqn. 1, derive the formula to compute the likelihood P(X|θ) of the whole sequence X = {x0,...,xN−1}, given the model parameters θ in terms of αn(j).

Hint: if the events Bj,j ∈ [0,M) form a disjoint partition of the sure event, that is:

                 P(Bi,Bj)        =     0,          ∀i,j, i 6= j, and

M−1

X

                       P(Bj)      =     1,

j=0

then it is true that                   , that is, we can marginalize out the variable

Bj.
Convert the formula you have derived into log domain. Verify that the log likelihood obtained this way using the model parameters in wordHMMs['o'] and the observation sequence in example['lmfcc'] is the same as example['loglik'].

Using your formula, score all the 44 utterances in the data array with each of the 11 HMM models in wordHMMs. Do you see any mistakes if you take the maximum likelihood model as winner?

4.3          Viterbi Approximation

Here you will compute the log likelihood of the observation X given a HMM model and the best sequence of states. This is called the Viterbi approximation. Implement the function viterbi in proto2.py.

In order to recover the best path, you will also need an array storing the best previous path for each time step and state (this needs to be defined only for n ∈ [1,N), that is not for the first time step):

              Bn(j)      =      arg 

Consult the course book [1], Section 8.2.3, to see the details of the implementation (note that the book uses indices from 1, instead of 0).

Compute the Viterbi log likelihood for wordHMMs['o'] and example['lmfcc'], and verify that you obtain the same result as in example['vloglik'].

Plot the alpha array that you obtained in the previous Section and overlay the best path obtained by Viterbi decoding. Can you explain the reason why the path looks this way?

Using the Viterbi algorithm, score all the 44 utterances in the data with each of the 11 HMM models in wordHMMs. How many mistakes can you count if you take the maximum likelihood model as winner? Compare these results with the results obtained in previous section.

4.4          Backward Algorithm

Write the function backward following the prototypes in proto2.py. Similarly to the function forward in the previous section, the function should take as input a lattice of emission probabilities as in the previous section, that is an array containing:

obsloglik[i,j] = logφj(xi) The output is the array: logbeta[i,j]         =             logβi(j),

where i corresponds to time steps and j corresponds to states in the model. See Appendix A for the recursion formulae.

In all the cases where there is an expression involving the log of a sum of exponents (logPexp(.)), make use of the logsumexp function in tools2.py.

Apply your function to the example['lmfcc'] utterance using the model parameters in wordHMMs['o'] and verify that you obtain the logβ arrays as in example['logbeta'].

The definitions of βn(j) in terms of probabilities of events are defined below (where θ = {Π,A,Φ} are the model parameters):

             βn(i)      =              P(xn+1,...,xN−1|zn = si,θ)

Optional: Derive the formula that computes P(X|θ) using the betas βn(i) instead of the alphas.

Hint 1: only the β0(i) are needed for the calculation.

Hint 2: note that the definition of αn(j) is a joint probability of observations and state at time step n whereas βn(i) is a conditional probability of the observations given the state at time step n.

Hint 3: the calculation will not only involve the betas, but also some of the observation likelihoods φj(xn) and some of the model parameters, πi or aij.

Verify that also this method gives you the same log likelihood and that they both are equal to example['loglik'].
5           HMM Retraining (emission probability distributions)
5.1          State posterior probabilities

Implement the statePosteriors function in proto2.py that calculates the state posteriors γn(i) = P(zn = si|X,θ) given the observation sequence, also called gamma probabilities. See Apeendix A for the formulas in log domain. Calculate the state posteriors for the utterance in the example. Verify that for each time step the state posteriors sum to one (in linear domain). Now sum the posteriors (in linear domain) for each state along the time axis. What is the meaning the the values you obtain? What about summing over both states and time steps? Compare this number to the length of the observation sequence.

5.2               Retraining the emission probability distributions

Write the function updateMeanAndVar in proto2.py that, given a sequence of feature vectors and the array of state posteriors probabilities, estimates the new mean and variance vectors for each state. Note that this function has an input argument called varianceFloor with default value 5.0. The reason is that, the more we tend to maximise the likelihood, the narrower the Gaussian distributions will become (variance that tends to zero), especially if a Gaussian component is associated with very few data points. To prevent this, after the update, you should substitute any value of the variances that falls below varianceFloor, with the value of varianceFloor. In theory the variance floor should be different for each element in the feature vector. In this exercise, for simplicity we use a single variance floor.

Consider the utterance in data[10] containing the digit “four” spoken by a male speaker. Consider also the model in wordHMMs['4'], with the model parameters for the same digit trained on utterances from all the training speakers. First of all estimate the log likelihood of the data given the current model (logP(X|θ)), where the model parameters are, as usual:

•    πi a priori probability of state si,

•    aij transition probabilities from state si to state sj,

•    µik Gaussian mean parameter for state sj and feature component k

•    σik2 Gaussian variance parameter for state sj and feature component k

Keeping πj and aij constant, repeat the following steps until convergence (the increase in log likelihood falls below a threshold):

1.    Expectation: compute the alpha, beta and gamma probabilities for the utterance, given the current model parameters (using your functions forward(), backward() and statePosteriors())

2.    Maximization: update the means µjk and variances σik2 given the sequence of feature vectors and the gamma probabilities (using updateMeanAndVar())

3.    estimate the likelihood of the data given the new model, if the likelihood has increased, go back to 1

You can use a max number of iterations, for example 20, and a threshold of 1.0 on the increase in log likelihood, as stopping criterion.

Repeat the same procedure on the same utterance data[10], but starting with the model for nine: wordHMMs['9']. Does it take longer to converge? Does this converge to the same likelihood?

A              Recursion Formulae in Log Domain
A.1          Forward probability

The recursion formulae for the forward probabilities in log domain are given here, where we have used Python convention with array indices going from 0 to len-1:

 

A.2           Backward probability

The recursion formulae for the backward probabilities in log domain are given here, where we have used Python convention with array indices going from 0 to len-1:

 

Note also that the initialization of the βN−1(i) is different from the one defined in the course book [1] (Section 8.2.4) but corresponds to the one used in [2].

A.3           Viterbi approximation

The Viterbi recursion formulas are as follows:

 

A.4           State posteriors (gamma)

The gamma probability is the posterior of the state given the whole observation sequence and the model parameters: γn(i) = P(zn = si|X,θ) This can be easily computed from the forward and backward probabilities. In log domain:

logγn(i) = logαn(i) + logβn(i) − Xexp(logαN−1(i))

i

where we have used the Python convention of starting indices from 0.

A.5                 Pair of states posteriors (xi, not used in this exercise)

The xi probabilities are the posteriors of being in a two subsequent states given the whole observation sequence and the model parameters: ξn(i,j) = P(zn = sj,zn−1 = si|X,θ). In order to compute them you will need the forward and backward probabilities, but also the emission probabilities φj(xn) for each state and feature vector and the state transition probabilities aij. In log domain:

logξn(i,j) = logαn−1(i) + logaij + logφj(xn) + logβn(j) − Xexp(logαN−1(i))

i

Note that you can only compute this from the second time step (n = 1) because you need the alpha probabilities for the previous time step.

References
[1]       Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR, 2001.

[2]       L. R. Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition”. In: Proceedings of the IEEE 77.2 (Feb. 1989), pp. 257–286.


 
[1] An alternative would be to use the function numpy.ma.log() which will mask log(0) elements, but at the time of writing I could not get this to work properly, so I do not recommend it at this point.

More products