This assignment introduces you to Gaussian mixture modelling, and two basic tasks in speech technology: speaker identification, in which we try to determine who is talking, and speech recognition, in which we try to determine what was said.
The assignment is divided into two sections. In the first, you will experiment with speaker identification by training mixtures of Gaussians to the acoustic characteristics of individual speakers, and then identify speakers based on these models. In the second section, you will evaluate two speech recognition engines.
The data come from the CSC Deceptive Speech corpus, which was developed by Columbia University, SRI International, and University of Colorado Boulder. It consists of 32 hours of audio interview from 32 native speakers of Standard American English (16 male,16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.
Data are in /u/cs401/A3/data/; each sub-folder represents speech from one speaker and contains raw audio, pre-computed MFCCs, and orthographic transcripts. Further file descriptions are in Appendix A.
1 Speaker Identification
Speaker identification is the task of correctly identifying speaker sc from among S possible speakers si=1..S given an input speech sequence X, consisting of a succession of d-dimensional real vectors. In the interests of efficiency, d = 13 in this assignment. Each vector represents a small 25 ms unit of speech called a frame. Speakers are identified by training data that are ascribed to them. This is a discrete classification task (choosing among several speakers) that uses continuous-valued data (the vectors of real numbers) as input.
Gaussian Mixture Models
Gaussian mixture models are often used to generalize models from sparse data. They can tightly constrain large-dimensional data by using a small number of components but can, with many more components, model arbitrary density distributions. Sometimes, they are simply used because the domain being modelled appears to have multiple modes.
Given M components, GMMs are modelled by a collection of parameters, θ = {ωm=1..M,µm=1..M,Σm=1..M}, where ωm is the probability that an observation is generated by the mth component. These are subject
0Copyright 2019, Frank Rudzicz. All rights reserved.
to the constraint that Pm ωm = 1 and 0 ≤ ωm ≤ 1. Each component is a multivariate Gaussian distribution, which is characterized by that component’s mean, µm, and covariance matrix, Σm. For reasons of computational efficiency, we will reintroduce some independence assumptions by assuming that every component’s covariance matrix is diagonal, i.e.:
Σm[1]
0
Σm = ...
0
0
Σm[2]
0
···
···
···
0
0
Σm[d]
for some vector Σ~ m. Therefore, only d parameters are necessary to characterize a component’s (co)variance.
1.1 Utility functions
First, we implement three utility functions in /u/cs401/A3/code/a3 gmm.py. First, implement log b m x, which implements the log observation probability of xt for the mth mixture component, i.e., the log of:
(1)
Next, implement log p m x, which is the log probability of m given xt using model θ, i.e., the log of:
(2)
Finally, implement logLik, which is the log likelihood of a set of data X, i.e.:
) (3)
where
) (4)
and bm is defined in Equation 1. For efficiency, we just pass θ and precomputed bm(x~t) to this function.
1.2 Training Gaussian mixture models
Now we train an M-component GMM for each of the speakers in the data set. Specifically, for each speaker s, train the parameters θs = {ωm=1..M,µm=1..M,Σm=1..M} according to the method described in Appendix B. In all cases, assume that covariance matrices Σm are diagonal. Start with M = 8. You’ll be asked to experiment with that in Section 2.4. Complete the function train in /u/cs401/A3/code/a3 gmm.py.
1.3 Classification with Gaussian mixture mode
Now we test each of the test sequences we’ve already set aside for you in the main function. I.e., we check if the actual speaker is also the most likely speaker, ˆs:
(5)
s=1
Complete the function test in /u/cs401/A3/code/a3 gmm.py. Run through a train-test cycle, and save the output that this function writes to stdout, using the k = 5 top alternatives, to the file gmmLiks.txt.
1.4 Experiments and discussion
Experiment with the settings of M and if you wish). For example, what happens to classification accuracy as the number of components decreases? What about when the number of possible speakers, S, decreases? You will be marked on the detail with which you empirically answer these questions and whether you can devise one or more additional valid experiments of this type.
Additionally, your report should include short hypothetical answers to the following questions:
• How might you improve the classification accuracy of the Gaussian mixtures, without adding more training data?
• When would your classifier decide that a given test utterance comes from none of the trained speaker models, and how would your classifier come to this decision?
• Can you think of some alternative methods for doing speaker identification that don’t use Gaussian mixtures?
Put your experimental analysis and answers to these questions in the file gmmDiscussion.txt.
2 Speech Recognition
Automatic speech recognition (ASR) is the task of correctly identifying a word sequence given an input speech sequence X. To simplify your lives, we have ran two popular ASR engines on our data: the opensource and highly customizable Kaldi (specifically, a bi-directional LSTM model trained on the Fisher corpus), and the neither-open-source-nor-particularly-customizable Google Speech API.
We want to see which of Kaldi and Google are the most accurate on our data. For each speaker in our data, we have three transcript files: transcripts.txt (the gold-standard transcripts, from humans), transcripts.Kaldi.txt (the ASR output of Kaldi), and transcripts.Google.txt (the ASR output of Google); see Appendix A.
Complete the file at /u/cs401/A3/code/a3 levenshtein.py. Specifically, in the Levenshtein function, accept lists of words r (Reference) and h (hypothesis), and return a 4-item list containing the floatingpoint WER, the number of substitutions, the number of insertions, and the number of deletions where
Assume that the cost of a substitution is 0 if the words are identical and 1 otherwise. The costs of insertion and deletion are both 1.
In the main function, iterate through each of the speakers, and iterate through each line i of their transcripts. For each line, preprocess these transcripts by removing all punctuation (other than [ and ]) and setting the text to lowercase. Output the following to stdout:
[SPEAKER] [SYSTEM] [i] [WER] S:[numSubstitutions], I:[numInsertions], D:[numDeletions]
where [SYSTEM] is either ‘Kaldi’ or ‘Google’.
Save this output and put it into asrDiscussion.txt.
On the second-to-last line of asrDiscussion.txt, in free text, summarize your findings by reporting the average and standard deviation of WER for each of Kaldi and Google, separately, over all of these lines. If you want to be fancy, you can compute a statistical test of significance to see if one is better than the other, but you don’t need to.
On the last line of asrDiscussion.txt, add a sentence or two describing anything you observe about the types of errors being made by each system, by manually examining the transcript files.