$25
ASSIGNMENT: You are to design and implement a HMM based model from scratch. The dataset has text stored in format ”token/PoS” where “token” is the word and “PoS” is the part of the speech tag of that token in the current context. Assume PoS as the hidden states and tokens as the visible outputs.
Steps to execute:
1. Format all sentences with “start of sentence” and” end of sentence” tokens.
2. Split the data in 80:20 for training and testing.
3. Create a vocabulary for all the tokens in the training text with their count.
4. Using the training dataset, create Initial probabilities (probability of the token given “start of sentence” token), State transition probabilities (probability of PoS given the previous PoS) and Emission probabilities (probability of a token given the current PoS). Use the Baum Welch Algorithm.
5. Remove the PoS information from the data.
6. Now code the solution to the following issues with HMM-
● Evaluation problem: Given the HMM and the observation sequence, calculate the probability that model M has generated sequence O. Do this for all the sentences in the test set. Report the probabilities for all utterances.
● Decoding problem: Given the HMM and the observation sequence, calculate the most likely sequence of hidden states that produced this observation sequence. Design the greedy/Viterbi solution here. Do this for all the sentences in the test set. Verify if they match the state sequence of the test set. Report the overall accuracy.
● For the decoding problem, now design and code the beam search variation. Do this for all the sentences in the test set. Verify if they match the state sequence of the test set. Report the overall accuracy. Compare with the Viterbi algorithm. Also verify that the beam search with beam_width=1 behaves exactly as the Viterbi algorithm(same sequences are generated for a given input text).