$30
You are becoming a researcher in NLP with Deep Learning with this programming assignment! You will implement a neural network architecture for Reading Comprehension using the recently published Stanford Question Answering Dataset (SQuAD) [1].[1]
SQuAD is comprised of around 100K question-answer pairs, along with a context paragraph. The context paragraphs were extracted from a set of articles from Wikipedia. Humans generated questions using that paragraph as a context, and selected a span from the same paragraph as the target answer. The following is an example of a triplet ⟨question, context, answer⟩:
Question: Why was Tesla returned to Gospic?
Context paragraph: On 24 March 1879, Tesla was returned to Gospic under police guard for not having a residence permit. On 17 April 1879, Milutin Tesla died at the age of 60 after contracting an unspecified illness (although some sources say that he died of a stroke). During that year, Tesla taught a large class of students in his old school, Higher Real Gymnasium, in Gospic.
Answer: not having a residence permit
In the SQuAD task, answering a question is defined as predicting an answer span within a given context paragraph.
In the first sections of this assignment, we describe the starter code that helps with preprocessing and evaluation procedures. After that, we give a few hints as to how you might approach the problem, and point you at some recent papers that have attempted the task. This is an open-ended assignment, where you should be working out how to do high-performance question answering on the SQuAD dataset. To meet the goal of the assignment, we expect you to explore different models by using what you have learned in class combined with your findings in the literature. Finally, we give you some practical tips on model training, and instructions on how to evaluate your model and submit it to our internal leaderboard.
1 Setup
Once you have downloaded the assignment from the website of the course, place it in a directory of your convenience (here referred as pa4). Before proceeding, please read the README.md file for additional information about the code setup and requirements. In particular, the starter code assumes a Python 2.7 installation with TensorFlow 0.12.1.[2]
In order to download the dataset, and start the preprocessing, run the following command under the main directory pa4. This will install several Python packages with pip and download about 862MB of GloVe word vectors (see below section 2.3).
code/get_started . sh python code/qa_data . py
The file qa_data.py takes an argument –glove_dim so that you can specify the GloVe word embedding dimensions that you want. This file will process the dimension specified by you.
1.1 Starter Code
The starter code provided for the assignment includes the following folders and files:
code/: a folder containing the starter code:
docker/:
Dockerfile: specification of a Docker image for running your code, necessary for submitting the assignment.[3]
preprocessing/: code to prepare the data to be consumed by the model:
py: downloads and stores the distributed word representations (word embeddings).
py: utilities to download the original dataset, parse the JSON files, extract paragraphs, questions, and answers, and tokenize them. It also splits the train dataset into train and validation.
sh: the first script to execute, it downloads and preprocesses the dataset.
py: the original evaluation script from SQuAD. Your model can import evaluation functions from this file, and you should not change this file.
py: the code that reads the preprocessed data and prepares it to be consumed by the model.
py: TensorFlow model definition. It contains the main API for the model architecture. You are free to change any of the code inside this file. You can delete everything and start from scratch.
py: Responsible for initialization, construction of the model, and building an entry point for the model.
py: We use this file to take in a JSON file and output another JSON file where the predictions of your model will lie. You must change the specified section in order to run this file.
data/: a folder hosting the dataset downloads as well as the preprocessed data.
train/: a folder containing the saved TensorFlow models.
log/: a folder for logging purposes.
1.2 Dataset
After the download, the SQuAD dataset is placed in the data/squad folder. SQuAD downloaded files include train and dev files in JSON format:
train-v1.1.json: a train dataset with around 87k
dev-v1.1.json: a dev dataset with around 10k [4]
Apart from the target answer, SQuAD also provides the starting position in character count of the answer.
Note that there is no test dataset publicly available: it is kept by the authors of SQuAD to ensure fairness in model evaluations. While developing the model in this assignment, we will consider for all purposes the dev set as our test set, i.e., we won’t be using the dev set until after initial model development. Instead, we split the supplied train dataset into two parts: a 95% slice for training, and the rest 5% for validation purposes, including hyperparameter search. We refer to these as train.* and val.* in filenames. Finally, you will be able to upload your model as a bundle to CodaLab[5] where you can evaluate your model on the unseen test dataset there.
1.3 Distributed Word Representations
The get_started.sh script downloads GloVe word embeddings of dimensionality d = 50, 100, 200, and 300 and vocab size of 400k that have been pretrained on Wikipedia 2014 and Gigaword 5. The word vectors are stored in the data/dwr subfolder.
The file qa_data.py will trim the GloVe embedding with the given dimension (by default d = 100) into a much smaller file. Your model only needs to load in that trimmed file. Feel free to remove the data/dwr subfolder after preprocessing is finished.
Consider This [Distributed word representations]: The embeddings used by default have dimensionality d = 100 and have been trained on 6B word corpora (Wikipedia + Gigaword). The vocabulary is uncased (all lowercase). Analyze the effect of selecting different embeddings for this task, e.g., other families of algorithms, larger size, trained on different corpora, et cetera.
1.4 Data Preprocessing
The starter code provides a preprocessing step that turns the original JSON files into four files containing a tokenized version of the question, context, answer, and answer span. Lines in these files are aligned.
Each line in the answer span file contains two numbers: the first number refers to the index of the first word of the answer in the context paragraph. The second number is the index of the last word of the answer in the context paragraph.
The first step is to get familiar with the dataset. Explore SQuAD and keep track of the values you may later use to limit, for example, the output size of the model. As a guide, plot histograms for context paragraph lengths, question lengths, and answer lengths.
Consider This: The preprocessing step takes the answer as a sequence of words and transforms it into two numbers. What are different possible ways to represent the answer?
Consider This [Improve tokenization]: The provided preprocessing uses NLTK. This process will result in skipping some triplets. If you want to explore other ways to tokenize the data, you can use other tools, such as Stanford CoreNLP, and try to reduce the number of skipped triplets.
2 Model Implementation
The goal of this assignment is an open-ended exploration of reading comprehension using the SQuAD task. The SQuAD website contains a leaderboard with a set of models that researchers have submitted for evaluation, and we strongly suggest that you spend time familiarizing yourself with different models, identifying key ideas from these papers that you may end up using in your own model. References to the most relevant papers are provided in section 4.
The following sections provide a possible, simple breakdown of the reading comprehension with SQuAD task. If you are unfamiliar with the task, reading it can help you make multiple architectural decisions and navigate the literature.
2.1 Problem Setup
In the SQuAD task, the goal is to predict an answer span tuple {as,ae} given a question of length n, q = {q1,q2,...,qn}, and a supporting context paragraph p = {p1,p2,...,pm} of length m. Thus, the model learns a function that, given a pair of sequences (q,p) returns a sequence of two scalar indices {as,ae} indicating the start position and end position of the answer in paragraph p, respectively. Note that in this task as ≤ ae, and 0 ≤ as,ae ≤ m.
Consider This: There are other formalizations of this problem. For instance, instead of predicting an answer span tuple {as,ae}, you can predict the real answer word by word as {a1,a2,...,a<eos}, but we leave these alternative formalizations for you to explore.
2.2 Architecture
Using what you have learned in class and in reading about the problem, you should find a way to encode the question and paragraph into a continuous representation. Good models read the paragraph with the question, just like what a human would do. We refer to this as the conditional understanding of the text.
You are strongly encouraged to get something simple and straightforward working first. A possible simple first model for this might be:
Run a BiLSTM over the question, concatenate the two end hidden vectors and call that the question representation.
Run a BiLSTM over the context paragraph, conditioned on the question representation.
Calculate an attention vector over the context paragraph representation based on the question representation.
Compute a new vector for each context paragraph position that multiplies context-paragraph representation with the attention vector.
Run a final LSTM that does a 2-class classification of these vectors as O or ANSWER.
Note that this outline only describes a very naive baseline. Step 2 refers the to sequence-tosequence attention: when you compute the hidden state at each paragraph position, you take into account of all hidden states of the question. This step is to create a mixture of question and paragraph representation. Step 3 is another attention step, where you compare the last hidden state of question to all computed paragraph hidden states from Step 2. The attention vector you produce is similar to a “pointer” that points to the most significant paragraph hidden states.
If you prefer a more detailed guidance, consider reading MatchLSTM [2] as a good starting point. It is similar to the baseline we are describing here.
Consider This [Question and Context Paragraph representations]: Different ways of fusing the representations of the question and the context paragraph have been addressed in the literature. The most relevant are Dynamic Coattention Network (DCN) [3] and the Bilateral Multi-Perspective Matching [4]. Take key ideas from these and possibly other papers and use them to improve your encoder.