Starting from:

$25

ECE448 - Assignment 4 - HMM POS tagging - Solved

 For this MP, you will implement part of speech (POS) tagging using an HMM model Make sure you understand the algorithm before you start writing code, e.g. look at lectures Hidden Markov Models and Chapter 8 of Jurafsky and Martin. 

General guidelines 
Basic instructions are the same as in previous MPs:

•       Extra credit is available, capped at 10%. 

•       You may use numpy (though it's not needed). You may not use other non-standard modules (including nltk). 

Problem Statement 
The mp4 code reads data from two files. Your tagging function will be given the training data with tags and the test data without tags. Your tagger should use the training data to to estimate the probabilities it requires, and then use this model to infer tags for the test input. The main mp4 function will compare these against the correct tags and report your accuracy.

The data is divided into sentences. Your tagger should process each sentence independently.

You will need to write two tagging functions:

•       Baseline 

•       Viterbi: HMM tagger 

The materials 
The code package for the MP contains three files:

•       mp4.py (do not change) 

•       utils.py (do not change) 

•       baseline.py 

•       viterbi_1.py 

•       test_viterbi/ o  test_viterbi.py o           utils.py (do not change) 

•       viterbi_2.py 

•       viterbi_ec.py 

The code package also contains training and development data

•       Brown corpus: data/brown-training.txt, data/brown-dev.txt 

The provided code converts all words to lowercase. It also adds a START and END tag for each sentence when it loads the sentence. These tags are just for standardization. They will not be considered in accuracy computation.

You should use the training data to train the parameters of your model and the development sets to test its accuracy. We will use a separate (unseen) set of data to test your code after you submit it.

In addition, your code will be test on a hidden dataset that is not available to you, which has different number of tags and words from the ones provided to you. So do NOT hardcode any of your important computations, such as initial probabilities, transition probabilities, emission probabilities, number or name of tags, and etc. We will inspect code for hardcoding computations/values and will penalize such implementations. 

To run the code on the Brown corpus data you need to tell it where the data is and which algorithm to run, either baseline, viterbi_1, viterbi_2, or viterbi_ec:

python3 mp4.py --train data/brown-training.txt --test data/brown-dev.txt --algorithm [baseline, viterbi_1, viterbi_2, viterbi_ec] 

The program will run the algorithm and report three accuracy numbers:

•       Overall accuracy 

•       accuracy  on words that have been seen with multiple different tags 

•       accuracy  on unseen words 

Many words in our datasets have only one possible tag, so it's very hard to get the tag wrong! This means that even very simple algorithms have high overall accuracy. The other two accuracy numbers will help you see where there is room for improvement.

Tagset 
The following is an example set of 16 part of speech tags. This is the tagset used in the provided Brown corpus. But remember you should not hardcode anything regarding this tagset because we will test your code on two other datasets with a different tagset.

•       ADJ adjective 

•       ADV adverb 

•       IN preposition 

•       PART particle (e.g. after verb, looks like a preposition) 

•       PRON pronoun 

•       NUM number 

•       CONJ conjunction 

•       UH filler, exclamation 

•       TO infinitive 

•       VERB verb 

•       MODAL modal verb 

•       DET determiner 

•       NOUN noun 

•       PERIOD end of sentence punctuation 

•       PUNCT other punctuation 

•       X miscellaneous hard-to-classify items 

Part 1 Baseline tagger
The Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often. For unseen words, it should guess the tag that's seen the most often in training dataset.

A correctly working baseline tagger should get about 93% accuracy on the Brown corpus development set, with over 90% accuracy on multitag words and over 69% on unseen words.

DO NOT ATTEMPT TO IMPROVE THE BASELINE ALGORITHM. IMPLEMENT IT EXACTLY AS DESCRIBED ABOVE. 

Part 2 Viterbi_1
IF YOU FIND THIS PART DIFFICULT TO BEGIN: You may find it easier to go to the next section and first implement only the Viterbi algorithm.

The Viterbi tagger should implement the HMM trellis (Viterbi) decoding algoirthm as seen in lecture or Jurafsky and Martin. That is, the probability of each tag depends only on the previous tag, and the probability of each word depends only on the corresponding tag. This model will need to estimate three sets of probabilities:

•       Initial probabilities (How often does each tag occur at the start of a sentence?) 

•       Transition probabilities (How often does tag 𝑡𝑡𝑏𝑏 follow tag 𝑡𝑡𝑎𝑎?) 

•       Emission probabilities (How often does tag t yield word w?) 

You can assume that all sentences will begin with a START token, whose tag is START. So your initial probabilities will have a very restricted form, whether you choose to handcode appropriate numbers or learn them from the data. The initial probabilities shown in the textbook/texture examples will be handled by transition probabilities from the START token to the first real word.

It's helpful to think of your processing in five steps:

•       Count occurrences of tags, tag pairs, tag/word pairs. 

•       Compute smoothed probabilities 

•       Take the log of each probability 

•       Construct the trellis. Notice that for each tag/time pair, you must store not only the probability of the best path but also a pointer to the previous tag/time pair in that path. 

•       Return the best path through the trellis. 

You'll need to use smoothing to get good performance. Make sure that your code for computing transition and emission probabilities never returns zero. Laplace smoothing is the method we use to smooth zero probability cases for calculating initial probabilities, transition probabilities, and emission probabilities.

For example, to smooth the emission probabilities, consider each tag individually. For a fixed tag T, you need to ensure that 𝑃𝑃𝑒𝑒(𝑊𝑊|𝑇𝑇) produces a non-zero number no matter what word W you give it. You can use Laplace smoothing (as in MP 3) to fill in a probability for "UNKNOWN" which will be the return value for all words W that were not seen in the training data. For this initial implementation of Viterbi, use the same Laplace smoothing constant α for all tags.

This simple version of Viterbi will perform worse than the baseline code for the Brown

development dataset. However you should notice that it's doing better on the multiple-tag words (e.g. over 93%). You should write this simple version of Viterbi under viterbi_1 function in viterbi_1.py.



test_viterbi.py: Write your Viterbi implementation here, and run it using:

python3 test_viterbi.py 

Once you get this working (if you started Part 2 here), you can then extend it to the Brown dataset where you need to additionally compute for initial tag probabilities, unseen tags, and correct implementation of smoothing. Part 3 Viterbi_2 

The previous  Vitebi tagger fails to beat the baseline because it does very poorly on unseen words. It's assuming that all tags have similar probability for these words, but we know that a new word is much more likely to have the tag NOUN than (say) CONJ. For this part, you'll improve your emission smoothing to match the real probabilities for unseen words.

Words that appear zero times in the training data (out-of-vocabulary or OOV words) and words that appear once in the training data (hapax words) tend to have similar parts of speech (POS). For this reason, instead of assuming that OOV words are uniformly distributed across all POS, we can get a much better estimate of their distribution by measuring the distribution of hapax words. Extract these words from the training data and calculate the probability of each tag on them. When you do your Laplace smoothing of the emission probabilities for tag T, scale the Laplace smoothing constant by P(T|hapax), i.e., the probability that tag T occurs given that the word was hapax. Remember that Laplace smoothing acts by reducing probability mass for highfrequency words, and re-assigning some of that probability mass to low-frequency words. A large smoothing constant can end up skewing probability masses a lot, so experiment with small orders of magnitude for this hyperparameter.

This optimized version of the Viterbi code should have a significantly better unseen word accuracy on the Brown development dataset. It also beat the baseline on overall accuracy. You should write your optimized version of Viterbi under the viterbi_2 function in viterbi_2.py.

The hapax word tag probabilities may be different from one dataset to another. So your viterbi_2 method should compute them dynamically from its training data each time it runs.

Hints
•       Tag 'X' rarely occurs in the dataset. Setting a high value for the Laplace smoothing constant may overly smooth the emission probabilities and break your statistical computations. A small value for the Laplace smoothing constant, e.g. 1e-5, may help. 

•       It's not advisable to use global variables in your implementation since the gradescope runs a number of different tests within the same python environment. Global values set during one test will carry over to subsequent tests. 


More products