Starting from:

$35

ECE324-Assignment 5 Solved

Subjective/Objective Sentence Classification Using Word Vectors and NLP
This assignment must be done individually. You can find the mark associated with each major section. You will be marked based on the correctness of your implementation, your results, and your answers to the required questions in each section.

Learning Objectives
In this assignment you will:

See how text data is processed using the torchtext and spacy libraries
Make use pre-trained word vectors as a basis for classifying text
Implement a basic, a convolutional and a recurrent neural network architecture for text classification
Do a full train-validate-test data split.
Build a simple, interactive application using all three models.
1           Sentence Classification - Problem Definition
Natural language processing, as we have discussed it in class, can provide the ability to work with the meaning of written language. As an illustration of that, in this assignment we will build models that try to determine if a sentence is objective (a statement based on facts) or subjective (a statement based on opinions).

In class we have described the concept and method to convert words (and possibly groups of words) into a vector (also called an embedding) that represents the meaning of the word. In this assignment we will make use of word vectors that have already been created (actually, trained), and use them as the basis for the three classifiers that you will build. The word vectors will be brought into your program and used to convert each word into a vector.

When working from text input, we need introduce a little terminology from the NLP domain: each word is first tokenized - i.e. made into word tokens. The first step has some complexity – for example, “I’m” should be separated to “I” and “am”, while “Los Angeles” should be considered together as a single word/token. After tokenization each word is converted into an identifying number (which is referred to both as its index or simply as a word token). With this index, the correct word vector can retrieved from a lookup table, which is referred to as the embedding matrix.

These indices are passed into different neural network models in this assignment to achieve the classification – subjective or objective – as illustrated below:

32
"The fight scenes are fun"           Tokenize          4      427 453

Output

Text Sentence                                                     Discrete tokens                                              Probability

(Subjective)

Figure 1: High Level diagram of the Assignment 4 Classifiers for Subjective/Objective

Note that the first ‘layer’ of the neural network model will actually be the step that converts the index/token a word vector. (This could have been done on all of the training examples, but that would hugely increase the amount of memory required to store the examples). From there on, the neural network deals only the word vectors.

2           Setting Up Your Environment
2.1         Installing Libraries
In addition to PyTorch, we will be using two additional libraries:

torchtext (https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutoria html): This package consists of both data processing utilities and popular datasets for natural language, and is compatible with PyTorch. We will be using torchtext to process the text inputs into numerical inputs for our models.
SpaCy (https://spacy.io/): For ‘tokenizing’ English words. A text input is a sequence of symbols (letters, spaces, numbers, punctuation, etc.). The process of tokenization separates the text into units (such as words) that have linguistic significance, as described above in Section 1.
Install these two packages using the following commands:

pip install torchtext spacy python -m spacy download en

2.2         Dataset
We will use the Subjectivity dataset [2], introduced in the paper by Pang and Lee [5]. The data comes from portions of movie reviews from Rotten Tomatoes [3] (which are assumed all be subjective) and summaries of the plot of movies from the Internet Movie Database (IMDB) [1] (which are assumed all be objective). This approach to labeling the training data as objective and subjective may not be strictly correct, but will work for our purposes.

3           Preparing the data
3.1         Create train/validation/test splits
The data for this assignment was provided in the file you downloaded from Quercus. It contains the file data.tsv, which is a tab-separated-value (TSV) file. It contains 2 columns, text and label. The text column contains a text string (including punctuation) for each sentence (or fragment or multiple sentences) that is a data sample. The label column contains a binary value {0,1}, where 0 represents the objective class and 1 represents the subjective class.

As discussed in class, we will now use proper data separation, dividing the available data into three datasets: training, validation and test. Write a Python script split data.py to split the data.tsv into 3 files:

tsv: this file should contain 64% of the total data
tsv: this file should contain 16% of the total data
tsv: this file should contain 20% of the total data
In addition, it is crucial to make sure that there are equal number of examples between the two classes in each of the train, validation, and test set, and have your script print out the number in each, and provide those numbers in your report.

Finally, created a fourth dataset, called overfit.tsv also with equal class representation, that contains only 50 training examples for use in debugging your models below.

3.2         Process the input data
The torchtext library is very useful for handling natural language text; we will provide the basic processing code to bring in the dataset and prepare it to be converted into word vectors. If you wish to learn more detail on this, the following tutorial the includes example uses of the library: https://medium.com/@sonicboom8/sentiment-analysis-torchtext-55fb57b1fab8. The code described in this section is already present in the skeleton code file main.py.

Below is a description of that code in the skeleton main.py that preprocesses the data:

The Field object tells torchtext how each column in the TSV file will be processed when passed into the TabularDataset object. The following code instantiates two torchtext.data.Field objects, one for the “text” (sentences) and one for the“label” columns of the TSV data:
TEXT = data.Field(sequential=True,lower=True, tokenize=’spacy’, include_lengths=True)

LABELS = data.Field(sequential=False, use_vocab=False)

Details: https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field

Next we load the train, validation, and test datasets to become datasets as was done in the previous assignments, with the torchtext method TabularDataset.splits, that is designed specifically for text input. main.py uses the following code, which assumes that the tsv files are in the folder data:
train_data, val_data, test_data = data.TabularDataset.splits(

path=’data/’, train=’train.tsv’, validation=’validation.tsv’, test=’test.tsv’, format=’tsv’, skip_header=True,

fields=[(’text’, TEXT), (’label’, LABELS)])

Details: https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.TabularDataset

Next we need to create an object that can be enumerated (Python-style) to be used in the training loops - these are the objects that produce each batch in the training loop. The objects in each batch are accessed using the .text field and the .label field that was specified in the above line.
The iterator for the train/validation/test splits created earlier is done using the data.BucketIterator as shown below. This class will ensure that, within a batch, the size of the sentences will be as similar as possible, to avoid as much padding of the sentences as possible.

train_iter, val_iter, test_iter = data.BucketIterator.splits((train_data, val_data, test_data), batch_sizes=(args.batch_size, args.batch_size, args.batch_size),

sort_key=lambda x: len(x.text), device=None, sort_within_batch=True, repeat=False)

The Vocab object will contain the index (also called word token) for each unique word in the data set. The is done using the build vocab function, which looks through all of the given sentences in the data:
TEXT.build_vocab(train_data,val_data, test_data)

Details: https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field. build_vocab

4           Baseline Model and Training
In your models.py file, you will first implement and train the baseline model (given below), which was discussed in class. Some of the code below will be re-usable for the other two models.

4.1         Loading GloVe Vector and Using Embedding Layer
As mentioned in Section 1, we will make use of word vectors that have already been created/trained. We will use the GloVe [6] pre-trained word vectors in an “embedding layer” (which is just that “lookup matrix” described earlier) in PyTorch in two steps:

(As given in the skeleton file py code) Using the vocab object from Section 3.2, item number 4, download (the first time this is run) and load the vectors that are downloaded into the vocab object, as follows:
TEXT.vocab.load_vectors(torchtext.vocab.GloVe(name=’6B’, dim=100)) vocab = TEXT.vocab

You can see the shape of the complete set of word vectors by printing out the the shape of the vectors object as follows, which will be the number of unique words in all the training sets and the embedding dimension (word vector size). print("Shape of Vocab:",TEXT.vocab.vectors.shape)

This loads word vectors into a GloVe class (see documentation https://torchtext.readthedocs. io/en/latest/vocab.html#torchtext.vocab.GloVe) This GloVe model was trained with six billion words to produce a word vector size of 100, as described in class. This will download a rather large 862 MB zip file into the folder named .vector cache, which might take some time; this file expands into a 3.3Gbyte set of files, but you will only need one of those files, labelled glove.6B.100d.txt, and so you can delete the rest (but don’t delete the file glove.6B.100d.txt.pt that will be created by main.py, which is the binary form of the vectors). Note that .vector cache folder, because it starts with a ‘.’, is typically not a visible folder, and you’ll have to make it visible with an operating system-specific view command of some kind. (Windows, Mac) Once downloaded your code can now access the vocabulary object within the text field object by calling .vocab attribute on the text field object.

The step that converts the input words from an index number (a word token) into the word vector is actually done inside the nn.module model class. So, when defining the layers in your model class, you must add an embedding layer with the function Embedding.from pretrained, and pass in vocab.vectors as the argument where vocab is the Vocab object. The code for this is shown below in the model section, and is given in the skeleton file models.py.
Details: https://pytorch.org/docs/stable/nn.html?highlight=from_pretrained#torch.

nn.Embedding.from_pretrained

4.2         Baseline Model
The      fight    scenes are        fun

Figure 2: A simple baseline architecture

The baseline model was discussed in class and is illustrated in Figure 2. It first converts each of the word tokens into a vector using the GloVe word embeddings that were downloaded. It then computes the average of those word embeddings in a given sentence. The idea is that this becomes the ’average’ meaning of the entire sentence. This is fed to a fully connected layer which produces a scalar output with sigmoid activation (which is computed inside the BCEWithLogitsLoss losss function) to represent the probability that the sentence is in the subjective class.

The code for this Baseline class is given below, and is also provided in the skeleton file models.py. Read it and make sure you understand it.

class Baseline(nn.Module):

def __init__(self, embedding_dim, vocab):

super(Baseline, self).__init__()

self.embedding = nn.Embedding.from_pretrained(vocab.vectors) self.fc = nn.Linear(embedding_dim, 1)

def forward(self, x, lengths=None): #x has shape [sentence length, batch size] embedded = self.embedding(x) average = embedded.mean(0) # [sent len, batch size, emb dim] output = self.fc(average).squeeze(1)

# Note - using the BCEWithLogitsLoss loss function

# performs the sigmoid function *as well* as well as

# the binary cross entropy loss computation # (these are combined for numerical stability) return output

4.3         Training the Baseline Model
In main.py write a training loop to iterate through the training dataset and train the baseline model. Use the hyperparameters given in Table 1. Note that we have not used the Adam optimizer yet in this course; it will be discussed in a later lecture. The Adam optimizer is invoked the same way as the SGD optimizer, using optim.Adam.

Hyperparameter
Value
Optimizer
Adam
Learning Rate
0.001
Batch Size
64
Number of Epochs
25
Loss Function
BCEWithLogitsLoss()
Table 1: Hyperparameters to Use in Training the Models

The objects train_iter, val_iter, test_iter and overfit_iter described in Section 3.2 are the iterable objects that will produce the batches of batch_size in each training inner loop step. The torchtext.data.batch.Batch object is given by the iterator, from which you can obtain both the text input and the length of the sentence sequences from the .text field of the Batch object, as follows, assuming that batch is the object returned from the iterator: batch_input, batch_input_length = batch.text

Where batch_input is the set of text sentences in the batch.

The details on this object can be found in https://spacy.io/usage/spacy-101#annotations-token.

4.4         Overfitting to debug
As was done in Assignment 3, debug your model by using only the very small overfit.tsv set (described above, which you’ll have to turn into a dataset and iterator as shown in the given code), and see if you can overfit your model and reach a much higher training accuracy than validation accuracy. (The baseline model won’t have enough parameters that you can get an accuracy of 100%; the cnn and rrnn models will have enough). You will need more than 25 epochs to succeed in overfitting. Recall that the purpose of doing this is to be able to make sure that the input processing and output measurement is working.

It is also recommended that you include some useful logging in the loop to help you keep track of progress, and help in debugging.

Provide the training loss and accuracy plot for the overfit data in your Report.

4.5         Full Training Data
Once you’ve succeeded in overfitting the model, then use the full training dataset to train your model, using the hyper-parameters given in Table 1.

In main.py write an evaluation loop to iterate through the validation dataset to evaluate your model. It is recommended that you call the evaluation function in the training loop (perhaps every epoch or two) to make sure your model isn’t overfitting. Keep in mind if you call the evaluation function too often, it will slow down training.

Give the training and validation loss and accuracy curves vs. epoch in your Report, and report the final test accuracy. Evaluate the test data and provide the accuracy result in your Report.

4.6         Saving and loading your model
In main.py, save the model with the lowest validation error with torch.save(model, ’model baseline.pt’). You will need to load this files in the next section. See https://pytorch.org/tutorials/ beginner/saving_loading_models.html for detail on saving and loading.

More products