Starting from:

$30

CS114-Programming Assignment 3 N-gram Language Models Solved

The overall goal of this assignment is for you to train the best language model possible given the same limited amount of data. The data folder should contain the training set (train-data.txt), the development set (dev-data.txt), and a lot of jumbled data (jumble-dev). Test sets (for sentences and jumbled data) have been held out and are not given to you.

You are also given train-data small.txt and test-data small.txt, the toy corpus from HW2, in the same format. You should only create more models similar to the one in unigram.py, while leaving the other files intact.

Assignment
Unigram model
The unigram model, unigram.py, is given to you; you do not need to (and should not) change anything. However, before creating any new models, you should read the code and understand what it is doing. In particular, note the following:

train(self, trainingSentences) - Trains the model from the supplied collection of trainingSentences. First, we count all the words in the training data. Note that one symbol LanguageModel.STOP is counted at the end of each sentence, and the unknown word LanguageModel.UNK is also given a count of 1.

Then we build self.prob counter as a scipy.sparse.lil  matrix. For the unigram model, this is not necessary, but for bigram models with large vocabularies, it is impossible to store all of the probabilities in a Numpy array without running out of memory. However, using a sparse matrix (here, in list of lists format), we can take advantage of the fact that most of the bigram probabilities are 0, and store only the non-zero probabilities, saving a lot of space. There are a few idiosyncrasies (e.g. division becomes multiplication by the reciprocal), but for the most part, we can use Scipy sparse matrices similarly to Numpy arrays.

As in PA2, self.word  dict should be used to translate between indices and words; by popular demand, this time we can use a

{word:index} dictionary rather than {index:word}. Note the use of self.total to store the denominator—in this case, the total number of words in the training data. Finally, we divide by self.total to get the probabilities P(word) in self.prob counter.

getWordProbability(self, sentence, index) - Returns the probability of the word at index, according to the model, within the specified sentence. Note that if the index is at the end of the sentence, we return the probability of LanguageModel.STOP, and if we have not seen the word at index before, we return the probability of LanguageModel.UNK.

generateWord(self, context) - Returns, for a given context, a random word, according to the probabilities in the model. The context is a list of the previous words in the sentence. For the unigram model, the context is ignored, so we simply use numpy.random.choice to generate a word using the unigram probabilities in self.prob  counter.

Bigram model
Now that you are familiar with how the unigram model works, your first task is to create a bigram model in bigram.py. Whereas a unigram model uses probabilities of words P(word), a bigram model uses conditional probabilities P(word|previous word). Therefore, instead of just storing a single probability as self.probCounter[word], we can set self.prob  counter[previous  word][word] = P(word|previous word). Similarly, self.total[previous word] stores the count of previous word in the training data.

Some notes:

For a word at the beginning of the sentence (i.e. the index is 0 in getWordProbability, or the context is empty in generateWord), the previous “word” will be LanguageModel.START.

When building a conditional probability table (as in HW2), there is no row for </S, and no column for <S. Probably the easiest way to account for this in your code is to use a single index to represent

LanguageModel.START when used for a row and LanguageModel.STOP when used for a column.

If you get a divide-by-zero warning when calculating the (dev) test set/jumbled sentence perplexity, that is good! Do not worry about that.

Bigram model with add-k smoothing
Your next task is to create a bigram model with add-k smoothing, in bigram add k.py. A na¨ıve implementation would involve calculating the ad-

justed probabilities Pˆ(wn|wn−1) = count( wn−1,wn) + k and storing them count(wn−1) + k|V |

in self.prob counter. Do not do this! (You will run out of memory.) Instead, we will use some tricks:

1.    Collect the bigram counts as before, such that self.prob  counter only contains those bigrams that have non-zero actual counts in the training data.

2.    Add k|V | to self.total[previous word], so that after dividing, the count(wn−1,wn)

                values of self.prob counter have form        .

count(wn−1) + k|V |

3.    Then, whenever we need a probability (array), we can add the missing k   inside of getWordProbability and

count(wn−1) + k|V |

generateWord.

Otherwise, your procedure should be the same as for the unsmoothed bigram model.

Setting the value of k

Which value of k should you use? When you are first writing your code, you can use k = 1 (i.e., add-one smoothing) for simplicity. After you get your algorithm working, though, you should try to find the value of k that results in the best performance on the dev set, either in terms of minimum perplexity, or maximum accuracy on the jumble task, or both.

Bigram model with interpolation
 Finally, you should create a bigram model with (simple linear) interpolation, in bigram  interpolation.py. If you implemented bigram.py correctly, you should be able to leave train and getWordProbability (and init ) as is, only needing to fill in generateWord.

Tuning the interpolation weights

As above, you should experiment with finding the values of λ that optimize the dev set perplexity and/or jumble task accuracy. Both books mention the EM (expectation-maximization) algorithm, but for now, trial and error is fine.

More products