Starting from:

$25

CSE354 -  Assignment 2 - Natural Language Processing - Solved

Data
Part 1: Word Sense Disambiguation (WSD) with One-Hot Neighbors
Part 2: Extracting PCA-Based Word Embeddings
Part 3: WSD with Embeddings
Extra Credit: WSD with Lemma and POS-aware embeddings
Overview 
Goals.
•    Implement reading of NLP data. 
•    Use one-hot encoding for location-specific features.
•    Implement word sense disambiguation.
•    Implement word embeddings to be used as features. 
•    Gain a deeper understanding of lexical semantics through implementing disambiguation and embedding techniques.
General Requirements. You must use Python version 3.6 or later, along with Pytorch 1.4.0. Â You may integrate any code (e.g. the logistic regression code) from your assignment 1. 
Python Libraries.  No libraries beyond those listed below are permitted.  Of these libraries, you may not use any subcomponents that specifically implement a concept which the instructions indicate you should implement (e.g. a one-hot feature embedding).  The project can be complete without any additional libraries. However, if any additional libraries are deemed permissible they will be listed here:
   torch-1.4.0
  sys
  re
  numpy
  pandas # only for data reading and storage (simpler to not use this)
  csv # though it is just as easy to read the input without  
Submission. 
1.    Place all of your code in a single file, a2_<lastname>_<id>.py, which takes in the training data and testing data filenames as parameters.  
All three parts of your code should run with: 
        python3 a2_LASTNAME_ID.py onesec_train.tsv onesec_test.tsv
2.    Place the output of your code in a single file called  a2_<lastname>_<id>_OUTPUT.txt
After the package imports add the following line which sends your print()Â statements to a file instead of the console
        sys.stdout = open('a2_lastname_id_OUTPUT.txt', 'w')
Change the file name to include your personal details. If this causes you any issues you can >Â your results to a text file from your terminal or copy-and-paste your results into a .txt

Data
Ambiguities hiding in plain sight! Have you noticed that the words "language", "process", and "machine", some of the most frequent words mentioned in this course, are quite ambiguous themselves?
You will use a subset of a modern word sense disambiguation corpus, called "onesec" for your training and test data. You can download this subset here:
Training Data
Test Data
You can read about the original dataset here: http://trainomatic.org/onesec. Interestingly, the data was created by exploiting the idea that most wikipedia categories only utilize one sense of a particular word within the article ("language" in  computer programming article is always a programming language; where as in an article about swearing it's about the words one chooses; yet another article about a particular country is probably referring to the natural language spoken in that country).. 
For the purposes of this assignment, there are three pieces of information being used: 
1. The lemma and id -- a unique value for every example ("language.NOUN.000068")
2. The sense -- the target word label; i.e. its sense (e.g. "language%1:10:01::"). 
3. The context -- The context (typically a sentence) in which the target word is mentioned. The target word is surrounded by <head>word</head>.  The unigrams are already tokenized, and delimited simply by a space. Further each word is given as original_word/lemma/part-of-speech. For example,  in "is/be/AUX", "is" is the original token, "be" is its lemma, and "AUX" is the part of speech (you can ignore the lemma and part of speech unless you do the extra credit). 
The files have these 3 fields in a .tsv (tab-separated values) format:
lemma.POS.id   Â  sense   Â context
Part 1: Word Sense Disambiguation (WSD) with One-Hot Neighbors
1.1 Read the data (15 pts). Refer to the data description above. You will need to store each of the 3 pieces for each record. In addition:
•    convert the sense (a string) into an integer. Your classifier will need to take an integer in as the label rather than a string. It can be any unique integer per sense. 
•    within the context, remove the lemma and POS (for parts 1 - 3; you may use them for the extra credit). 
•    track the counts of all words, so that it is easy to check if it is in or outside the vocabulary later. You will use the 2000 most frequent words of the training set as the vocabulary throughout this assignment. 
•    make sure you can easily access just the "language", "process", or "machine" examples. You will be making a separate classifier for each. 
1.2 Add one-hot feature encoding (10 pts). Using only the 2000 most frequent words as your vocabulary, make 2 one-hot encodings, Â one for the word before the target word and one for the word after:
 [one word before] _target word_ [one word after]
to process paperwork
[0 0 0 1 0 ...][1 0 0 0 0 ...]
[0 0 0 1 0 … 1 0 0 0 0 ...]
The data includes the part-of-speech of each word -- please disregard that and only focus on the word itself (you may use it for the extra credit, separately). At the end of this step you should have 4000 features -- the concatenation of both one-hot encodings.
1.3 Train logistic regression classifiers (10 pts). Make sure to only use the train data, and train a separate classifier for each lemma. You will need to use logistic regression that can handle more than two outcome classes, this means replacing the loss function with cross entropy loss (a multivariate version of log loss, where j represents the class from among all classes, V): 
 
You will also need to adjust LogReg()Â to return len(unique_classes)Â responses and not 1. 
1.4 Test each classifier on the test set (5 pts). Output the number correct out of the total for each word as well as the predictions (probabilities or linear output per label) for the following examples: 
process.NOUN.000018, process.NOUN.000024, language.NOUN.000008, language.NOUN.000014, machine.NOUN.000004, machine.NOUN.000008
Aim for an accuracy  Â than the most frequent sense baselines for each:
 process
  Â correct: 141 out of 202
 machine
  Â correct: 138 out of 202
 language
  Â correct: 142 out of 202
Part 1 Hints.
•    The .tsv file is easy to parse by splitting each line on tabs, taking the first element as the lemma.id, the second as the sense label, and the remainder as the context (even if it contains a tab).
•    Use a dictionary to store a mapping of the sense (e.g. "process%1:09:00::") Â to the integer label for each lemma. Start with 0 as the integer and then increment when encountering a unique sense string. 
•    The following will find the index of the target word (head) in context and remove the "<head></head>" markings:
  Â  Â  Â  headMatch=re.compile(r'<head>([^<]+)</head>') #matches contents of head  Â  Â  
  Â  Â  Â tokens = context.split() #get the tokens
  Â  Â  Â  headIndex = -1 #will be set to the index of the target word
  Â  Â  Â  for i in range(len(tokens)):
  Â  Â  Â  Â  Â  m = headMatch.match(tokens[i])
  Â  Â  Â  Â  Â  if m: #a match: we are at the target token
  Â  Â  Â  Â  Â  Â  Â  tokens[i] = m.groups()[0]
  Â  Â  Â  Â  Â  Â  Â  headIndex = i
  Â  Â  Â  context = ' '.join(tokens) #turn context back into string (optional)
•    Use methods to do the data reading and processing. Call them once on the train data, and then again on the test data. This way you only have one copy of data reading code and one place to debug. 
•    You may re-use your logistic regression code from Assignment 1
•    In PyTorch, CrossEntropyLoss automatically implements the logistic and softmax (see topic (3) class slides on cross-entropy).
•    Useful Ingredients for Assignment 2
•    An introduction to the type of structure that can help with your design decisions
•    Convert all tokens to lowercase just as in assignment 1. 
Example output:
[TESTING UNIGRAM WSD MODELS]
process                                        
  predictions for process.NOUN.000018: ['0.5500', '2.1585', '-0.8698', '-0.2431', '-0.8704', '-0.6612']
  predictions for process.NOUN.000024: ['0.5538', '2.2727', '-0.9298', '-0.2154', '-0.8979', '-0.6981']
  correct: 141 out of 202
machine
  predictions for machine.NOUN.000004: ['-0.7073', '2.3671', '0.6223', '-0.2614', '-0.9226', '-1.0012']
  predictions for machine.NOUN.000008: ['-0.6270', '2.0766', '0.7481', '-0.3050', '-0.8659', '-0.9891']
  correct: 138 out of 202
language
  predictions for language.NOUN.000008: ['-1.0620', '0.5458', '-0.2333', '-0.9103', '2.4244', '-0.7717']
  predictions for language.NOUN.000014: ['-1.0110', '0.5653', '-0.2724', '-0.8846', '2.2993', '-0.7144']
  correct: 142 out of 202
Part 2: Extracting PCA-Based Word Embeddings
2.1 Convert corpus into co-occurrence matrix (15 points). Â Use the contexts from all 3 training datasets (language, process, and machine). For co-occurrence counts, use the number of times the two words appear in the same document. Â Restrict to only using the 2000 most frequent words, and add an "<OOV>" column and row to be used for any words that are not among the 2000 most frequent. This matrix should be square and symmetrical (number of rows = vocabulary size + 1 = number of columns). 
2.2 Run PCA and extract static, 50 dimensional embeddings (15 points).  To do this, you may use toch.svd. SVD only solves PCA if you first standardize your data. Standardize means you must subtract the mean and divide by the standard deviation for each column:
data = (data - data.mean(dim=1, keepdim=True)) / data.std(dim=1, keepdim=True)
 Make sure to take the first 50 dimensions of the U matrix as the low dimensional representation. Store this in a dictionary so you have:
  Â  Â  Â  Â  Â  Â  Â  {'word1': [embedding1], 'word2': [embedding], …}
2.3 Find the distance between select words (5 points). Calculate the euclidean distance between the vectors for the following pairs of words:
[('language', 'process'),
('machine', 'process'),
('language', 'speak'),
('word', 'words'),
('word', 'the')]
Part 2 Hints.
•    Make sure to store your matrix as a torch tensor before starting to process it. 
•    It may be useful to create and standardize the data within numpy before running svd in torch. 
•    if you've scaled your variables correctly before SVD, then distances should rarely be > 2.0 and often be around 1.0. ('word', 'words') should be closest, and either of ('language', 'process') or ('machine', 'process') should be furthest. 
•    See new output; should be for "word", "words"
Example output:
('language', 'process') : 1.4051
('machine', 'process') : 1.4054
('language', 'speak') : 0.9911
('word', 'words') : 0.1048
('word', 'the') : 0.9999
Part 3: WSD with Embeddings
3.1 Extract embedding features (15 pts). Produce 4 sets of embeddings for each target word instance from your PCA embeddings created in Part 2:
[two words before] [one word before] _target_ [one word after] [two words after].
the word process for the
[-0.22, 0. 75, -0.42] [-0.0019, Â 0.0024, Â 0.0033] [-0.03, Â 0.02, Â 0.0004] [-0.22, 0. 75, -0.42]
[-0.22, 0. 75, -0.42, -0.0019, Â 0.0024, Â 0.0033, -0.03, Â 0.02, Â 0.0004, -0.22, 0. 75, -0.42]
Concatenate all 4 embeddings into a single vector of length 200. 
3.2 Rerun logistic regression training using your word embeddings (5 pts). Consider changing the penalty to accommodate the new feature size. 
3.3 Test the new logistic regression classifier (5 pts). Output the number correct out of the total for each word as well as the predictions for following examples: 
process.NOUN.000018, process.NOUN.000024, language.NOUN.000008, language.NOUN.000014, machine.NOUN.000004, machine.NOUN.000008
Aim for an accuracy  Â than the most frequent sense baselines for each:
 process
  Â correct: 141 out of 202
 machine
  Â correct: 138 out of 202
 language
  Â correct: 142 out of 202
Part 3 Hints.
•    Not only might your accuracy increase, but your training time may decrease due to the smaller feature space. 
•    Note: Embedding-based WSD may not work better than unigram-based WSD. Just make sure you are doing better than the most frequent sense baseline above.
•    You can use zeros in place of the embedding when the target word is at the beginning or end of the context. 
Example Output:
process
  predictions for process.NOUN.000018: ['0.1107', '2.7988', '-0.4470', '1.0502', '-2.2774', '-1.1228']
  predictions for process.NOUN.000024: ['0.5495', '3.4219', '-1.3625', '1.4292', '-2.4627', '-1.6034']
  correct: 142 out of 202
machine
  predictions for machine.NOUN.000004: ['-0.4345', '2.9917', '0.8196', '0.2932', '-2.3715', '-1.2307']
  predictions for machine.NOUN.000008: ['-0.0409', '2.4798', '1.1604', '-0.7553', '-0.9312', '-1.8865']
  correct: 140 out of 202
language
  predictions for language.NOUN.000008: ['-1.4646', '0.1316', '0.8124', '-0.6371', '2.4451', '-1.1656']
  predictions for language.NOUN.000014: ['-0.9916', '-0.1459', '-0.3174', '-0.5286', '2.0656', '-0.2421']
  correct: 143 out of 202
Extra Credit: WSD with Lemma and POS-aware embeddings
Create a classifier that uses the lemma and POS information in order to improve your results. For example, one might encode PCA embeddings using the lemma/POS in addition to the token. How to do this is up to you. Please specify in a README.md file that you have attempted the extra credit (up to 10 pts extra).

More products