Starting from:

$30

CS114-Programming Assignment 4 Part-of-speech Tagging with Hidden Markov Models Solved

You are given pos tagger.py, and brown.zip, the Brown corpus (of partof-speech tagged sentences). Sentences are separated into a training set (≈80% of the data) and a development set (≈10% of the data). A testing set (≈10% of the data) has been held out and is not given to you. You are also given data  small.zip, the toy corpus from HW2, in the same format. Each folder (train, dev, train  small, test small) contains a number of documents, each of which contains a number of tokenized, tagged sentences in the following format: [<word/<tag]. For example:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd ‘‘/‘‘ no/at evidence/nn ’’/’’ that/cs any/dti irregularities/nns took/vbd place/nn ./.

Assignment
Your task is to implement a supervised hidden Markov model to perform part-of-speech tagging. Specifically, in pos tagger.py, you should fill in the following functions:

train(self, train set): This function should, given a folder of training documents, fill in self.initial, self.transition and self.emission, such that:

◦ self.initial[POS] = log(P(the initial tag is POS))

◦ self.transition[POS1][POS2] = log(P(the current tag is POS2|the previous tag is POS1))

◦ self.emission[POS][word] = log(P(the current word is word|the current tag is POS))

1

 

Some notes:

–    The input sentence is a list. A list of what? You will need to translate from words to indices at some point in your code; it is recommended to do this translation outside of the viterbi function, so that the input to the function is a list of indices.

–    Remember that when working in log-space, you should add the logs, rather than multiply the probabilities.

–    Note that operations like + are Numpy universal functions, meaning that they automatically operate element-wise over arrays. This results in a substantial reduction in running time, compared with looping over each element of an array. As such, your viterbi implementation should not contain any for loops that range over states (for loops that range over time steps are fine).

–    To avoid unnecessary for loops, you can use broadcasting to your advantage. Briefly, broadcasting allows you to operate over arrays with different shapes. For example, to add matrices of shapes (a,1) and (a,b), the single column of the first matrix is copied b times, to form a matrix of shape (a,b). Similarly, to add matrices of shapes (a,b) and (1,b), the single row of the second matrix is copied a times.

–    When performing integer array indexing, the result is an array of lower rank (number of dimensions). For example, if v is a matrix of shape (a,b), then v[:, t-1] is a vector of shape (a,). Broadcasting to a matrix of rank 2, however, results in a matrix of shape (1,a): our column becomes a row. To get a matrix of shape (a,1), you can either use slice indexing instead (i.e. v[:, t-1:t]), append a new axis of None after your integer index (i.e. v[:, t-1, None]), or use the numpy.reshape function (i.e. numpy.reshape(v[:, t-1], (a, 1))).

–    In self.transition, each row represents a previous tag, while each column represents a current tag. Do not mix them up!

–    Finally, you do not have to return the path probability, just the backtrace path.

3

test(self, dev  set): This function should, given a folder of development (or testing) documents, return a dictionary of results such that:

◦ results[sentence id][‘correct’] = correct sequence of POS tags

◦ results[sentence id][‘predicted’] = predicted sequence of POS tags (hint: call the viterbi function)

You can assign each sentence a sentence  id however you want. Your sequences of POS tags can just be lists, e.g. [‘at’, ‘np’, ...].

evaluate(self, results): This function should return the overall accuracy (# of words correctly tagged / total # of words). You don’t have to calculate precision, recall, or F1 score.

More products