$30
Introduction
Here you will implement and train a Neural Network LM instead of the Markov chain model used in HW 1. Your Neural Language Model (NLM) will be evaluated by a downstream task: the text classification you did in HW2. The primary goal i s to give you hands-on experience with neural N-gram languages. Understanding how these n eural models work will help you understand not just language modeling, but also the p ipeline of a common NLP task. This assignment is also more “from scratch” than the previous ones, which will help you prepare f or the final project. M ake sure you have installed pytorch and numpy to work on this assignment.
Neural Language Model
In this LM task, you need to build a vocabulary and calculate an optimal word embedding for every word in this vocabulary by training a neural network. And you need to compute the loss function on some training data, then update the parameters and the word embeddings with backpropagation. Recall that in an n-gram language model in the HW2, given a sequence of words w , we want to compute the probability ( P ):
Where w k is the k -th word of the l ength- k sequence. In this model, you should maximize the probability of the correct word w k+1 given the latent representation of the c ontext words. Note that your k -grams will come from a corpus where the first k words are context features and
the word w k+1 is the training label. Some useful background c an be found in the class lecture notes slides 58-64, CSE597-Wk5-Mtg9-FFNs.pdf (and a ssociated readings).
Task: N-Gram Neural Language Model
Modify the code in NML.py to implement a simple 3-gram language model. Make sure that your loss decreases during the training process, and the embedding of the word can be used in the downstream task (text classification) which you have finished in the HW1. These are the steps you need to complete for this task:
Step 1: Modify the Class NgramLM(nn.Module) which is a definition of the NLM model.
Step 2: Use the NgramLM class to implement the def training() function;
note that you should define the optimizer and the loss function first. You should experiment with different loss functions. You may need a DataLoader to enlarge the batch size.
Step 3: Finish the def output (model, file1, file2) . In this function, you need to copy the embedding vocabulary to disk in the format of a GloVe embedding file, called embedding.txt . Then it can be used in the downstream classification task. You are to conduct a controlled experiment where you compare a condition where words are initialized with random embeddings that have the same dimensionality as the GloVe lexicon, to a condition where you use the GloVe vectors. The random embedding vocabulary with the same words and the same embedding dimension with the n-gram embedding vocabulary, but the embedding vectors are randomly initialized, we call it random_embedding.txt.
Step 4: Conduct experiments on the text classification task. You are requested to test the performance differences between initializing with random_embedding.txt, embedding.txt, and glove.6B.50d.txt by the model from HW1 (which we give the formal answer in this project).
The zip file associated with this homework has three *py files (classifier.py, NLM.py, run.py), a glove folder with a *txt file with the GloVe vectors, and a data folder with three sizes of review texts and labels: reviews_100.txt and labels_100.txt are small, so this can be used for development (dev set) and debugging. reviews_500.txt and labels_500.txt are the dataset for this assignment (both NML part and
downstream part), and you will be graded based on them.
reviews.txt and labels.txt are much bigger files than the others; they will be used to evaluate the bonus part.
Bonus: 10 points extra credit
It is very time consuming if NML is trained in a large d ataset, so we only use 500 data examples in the dataset to train this NML, and test the classification task. However, there are several methods or tricks to overcome the training demands, such as enlarging the training batch size or sampling. If you can use the whole dataset to train the NML within the time limits (1hour) in a common laptop machine (without a GPU) a nd get an accuracy > 85% in the classification task, you will get extra credits (10 points).
Note that you can use the saving and loading in pytorch to conduct your experiments. Feel free to implement any other helpful functions, but other p ackages are excluded.
Questions
1. How did the choice of initialization o f word embeddings affect training of the LM and/or performance of the embeddings in the HW1 classifier?
2. Explain your choice of loss function, based on a comparison with at least one other loss function.