$25
Statistical Natural Language Processing
Sheet IX
EM for Word Sense Disambiguation
In this exercise, you will implement expectation-maximization[1] for word sense disambiguation. You can find the pseudocode on slide 44 of chapter
7.
You will be provided with starter code to help you with the implementation, but you can also write your solution from scratch if you want to. If you are using the starter code, you will have to install the NumPy package on your machine.
The dataset
NLTK has a corpus reader for the Senseval 2 dataset. This data set contains data for four ambiguous words: hard, line, serve and interest. You can read more here: http://www.nltk.org/howto/corpus.html (search for ”senseval”). The provided code loads the dataset for you and converts each sample into a sample object. The sample class has two attributes: context and label. Label is the ground truth sense of the ambiguous word. As EM is an unsupervised method, we will only use the labels for final evaluation. Context is the left and right context of the ambiguous word as word IDs: it is a list of integers, which you can use later to index matrices. instances = senseval.instances(hard f)
# All training samples as a list. samples = [sample(inst) for inst in instances]
# Convert contexts to indices so they can be used for indexing. for sample in samples:
sample.context to index(word to id)
Now, ”samples” contains all the sample objects for the ambiguous word ”hard”.
Implementation
There are three matrices provided for you, which you will have to use for the implementation: priors, probs and C.
• priors: It is a vector of length K, where K is the number of clusters. Each value corresponds to a cluster prior p(sk). It’s initialized randomly.
• probs: It is a V x K sized matrix, where V is the size of the vocabulary. Each column is a conditional probability distribution over the words. probs[i,k] is p(vi|sk).
• C: contains the counts of the words in a given context. The size of the matrix is number of samples x vocabulary size. C[i,j] is C(vj in ci), the count of word j in context i.
To get a better feel of how to use these matrices read the log likelihood np(samples, probs, priors): function.
Get the IDs of the words. We will use this to index the rows of the probs matrix:
context index = sample.context words given sense = probs[context.index, :]
This is a matrix: the number of lines is the number of words in the context, the number of columns is number of senses. We multiply this column wise to get the probability of a context given a sense p(ci|sk):
context given sense = np.prod(words given sense, axis=0)
This is a vector where each value is a class conditional probability (size is K).
Exercises
• Briefly describe what EM is and when using it is appropriate? Give an example use case (1 point).
• Write the code for the expectation step (3 points).
• Write the code for the maximization step. In order to check your implementation, make sure that the log likelihood increases over the iterations (4 points).
• The code will print the ordered frequency of the labels within each cluster. Each cluster corresponds to one of the senses of ”hard”. How does the algorithm perform? Briefly explain. (1 point)
• Does EM find the global optimum? Explain why/why not? (1 point)
.
[1] Seehttps://machinelearningmastery.com/expectation-maximization-em-algorithm/