Starting from:

$25

CSE354 -Assignment 3 - Natural Language Processing - Solved

  Â  Â  Â  Â  Â  Â  Â -- example output and more precise 2.3 instructions based on Piazza questions. 
Overview
Data
Part 1: Ideas from Medical Ethics for NLP Ethics
Part 2: Create a probabilistic language model
Overview 
Goals.
•    Learn about ethics from a research field with well established ethical principles. 
•    Think critically about ethics in NLP. 
•    Practice communicating about NLP in your own natural language. 
•    Implement ngram counts for language modeling with basic smoothing. 
•    Gain a deeper understanding of probabilistic language modeling by implementing a 3-gram language model. 
General Requirements. Half of the assignment is a written and critical thinking exercise related to ethics. This should be submitted as a 2-page pdf document, single column, 12pt times-new roman font, single-spaced, 1in margins all around.  
The other half is a python assignment. You must use Python version 3.6 or later, along with Pytorch 1.4.0. Â You may integrate any code (e.g. the logistic regression code from your Assignment 1). 
Python Libraries.  No libraries beyond those listed below are permitted.  Of these libraries, you may not use any subcomponents that specifically implement a concept which the instructions indicate you should implement.  The project can be complete without any additional libraries. However, if any additional libraries are deemed permissible they will be listed here:
  torch-1.4.0
  sys
  re
  numpy as np
  pandas # only for data reading and storage (simpler to not use this)
  csv # though it is just as easy to read the input without this  
Submission. 
1.    Title your 2-page pdf: a3_<lastname>_<id>.pdf.
Place all of your code in a single file, a3_<lastname>_<id>.py, which takes in the training data. Your code should run with:  
        python3 a3_LASTNAME_ID.py onesec_train.tsv 
2.    Place the output of your code in a single file called  a3_<lastname>_<id>_OUTPUT.txt
After the package imports add the following line which sends your print()Â statements to a file instead of the console
        sys.stdout = open('a3_lastname_id_OUTPUT.txt', 'w')
Change the file name to include your personal details. If this causes you any issues you can >Â your results to a text file from your terminal or copy-and-paste your results into the .txt file
3.    Submit all three of the .pdf, .py, and .txt in Blackboard under Assignment 3. 
DO NOT ZIPÂ the files, submit them as 3 independent files.
Academic Integrity.  Copying chunks of code or writing  from other students, websites or other resources outside of materials provided in class is prohibited. You are responsible for both (1) not copying others' work, and (2) making sure your work is not accessible to others. Assignments will be extensively checked for copying of others’ work. Please see the syllabus for additional policies. 
Data
Part IÂ of this assignment will use the Declaration of Helsinki, which you will process with the NLP + Critical Thinking system in your head (i.e. you will not run NLP programs for this but rather think and write yourself) 
Part II of this assignment will use the training data from the same dataset as Assignment 2. You will only need to use the contexts from each. Its description is repeated here: 
Ambiguities hiding in plain sight! Have you noticed that the words "language", "process", and "machine", some of the most frequent words mentioned in this course, are quite ambiguous themselves?
You will use a subset of a modern word sense disambiguation corpus, called "onesec" for your training and test data. You can download this subset here:
Training Data
You can read about the original dataset here: http://trainomatic.org/onesec.
For the purposes of this assignment, the only piece of information being used is: 
  Â The context -- The context (typically a sentence) in which the target word is mentioned. The unigrams are already tokenized, and delimited simply by a space. Further each word is given as original_word/lemma/part-of-speech. For example,  in "is/be/AUX", "is" is the original token, "be" is its lemma, and "AUX" is the part of speech (you can ignore the lemma and part of speech unless you do the extra credit). 
The files have these 3 fields in a .tsv (tab-separated values) format:
lemma.POS.id   Â  sense   Â context
Part 1: Ideas from Medical Ethics for NLP Ethics
1.1 Rewrite, 4 of the basic principles of the Declaration of Helsinki (20 pts). The declaration outlines principles for ethical work by medical researchers and physicians. The principles came about from decades of both (1) experience from medical practice as well as (2) long-form discussion and debate. More generally, it is an example of principles for interacting with people in such a way that seeks minimal harm and in exchange for maximal benefit. The field of NLP likewise is in a process of evolving and better defining its own principles for providing benefit with minimal risk of harm. This declaration may contain many ideas that transfer to our domain while others may not be in our/societal best interest.
Your task here is to rewrite 4 of the principles from section B of the declaration ("Basic principles for all medical research"), such that:
1.    mentions of physicians or researchers are replaced with "[NLP Developer]"
2.    mentions of patients or human subjects are replaced with "[User(s)]"
3.    mentions of research and similar tasks are replaced with "[NLP Application]". 
Here is an example of item 10 rewritten:
10. It is the duty of the [NLP Developer] to protect the life, health, privacy, and dignity of the [user]. 
In order to spread out the principles among the class, the 4 you will rewrite are assigned based on the last digit of your student id. For example, "0" is the last digit of 23848920; "7" the last of 238298347.
last digit    items assigned    last digit     items assigned
0 or 1    11, 15, 20, 26    6 or 7    14, 16, 17, 24
2 or 3    12, 16, 21, 27    8 or 9    10, 15, 18, 22
4 or 5    13, 19, 23, 25        
1.2 Examine, in detail, the appropriateness of each of your rewrites. For each of the above items, evaluate whether it is still applicable for NLP. State which aspects are applicable for NLP developers to keep in mind and which aspects are not applicable. Consider whether the principle could be revised further to make it more applicable for NLP. 
Restriction: keep your response to under 100 words per item. 
1.3 Review the transferability of the overall principles. Â In paragraph form, answer the following questions: 
•    Overall, to what extent do you find the principles can be transferred to NLP? 
•    What do you think would transfer well to NLP? 
•    What sorts of harms might the principles miss? 
•    What ideas from the declaration seem overly protective 
•    i.e. unnecessary or potentially stifling of innovation in the case of NLP? 
Restriction: keep your response to under 400 words total. 
We recommend using 2 or 3 paragraphs. 
The full document for part 1 (1.1 - 1.3) should be 2 pages or less. Â  
1.1 and 1.2 Rewrites and Examination example: 
10. It is the duty of the [AI Developer] to protect the life, health, privacy, and dignity of the [user].
This statement [is | is not | is partially] applicable for NLP ethics because … 
1.3. Overall Review:
...
Part 2: Create a probabilistic language model
2.1 Prepare your corpus and vocabulary. 
•    Load all of the contexts from the training data. Treat all the contexts as belonging to one corpus, just like you did in Part 2 of Assignment 2. 
•    Remove the head word xml tags: "<head>" and "</head>"
•    You can throw out the lemma and POS information, we will not use it.  
•    Prepend <s>Â to the beginning and append </s>Â to the end of each context. These tokens represent "sentence start" and "sentence end" tokens respectively. 
•    [<s>, here, is, an, example, sentence, </s>]
•    Create a vocabulary of the 5000 most frequent words, similar to the vocabulary we created in Assignment 2 which used 2000 words. 
•    lowercase all words (i.e. in order to treat as case insensitive)
2.2 Extract unigram, bigram, and trigram counts. 
•    Treat any word not in the dictionary as "<OOV>" no matter where it appears. 
•    Store unigram counts in a dictionary 
•    unigramCounts[word] = count
•    Store bigram and trigram counts in a dictionary-of-dictionaries:
•    bigramCounts[word1][word2] = count
•    trigramCounts[(word1,word2)][word3] = count 
•    This format will make it easier to calculate probabilities in the next step. 
As a checkpoint, print the counts for the following unigrams, bigrams and trigrams 
(yield 0 if not present):
•    unigrams:Â 'language', 'the', 'formal'
•    bigrams:Â ('the','language'), ('<OOV>','language'), ('to','process')
•    trigrams:Â ('specific','formal','languages'), ('to','process','<OOV>'), ('specific','formal','event'), 
results may vary from example output by +- 2 (except 0s should be 0; ('<OOV>','language') can be off by +-10 )
2.3 Create a method that calculates language model probabilities. Use add-one smoothing and back-off when creating the probabilities. The method calculates the probability of all possible current words wi  given either a single previous word (wi-1 → a bigram model) or two previous words (wi-1 and wi-2  â†’ a trigram model):
  input: wordMinus1, wordMinus2 = None  # wordMinus2 is optional 
  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â (additonal arguments are fine)
  output:  probs # a dict of probs[word] = P(word | previous word or words)
•    Start by making a list of all wi: These should be all words that ever appear after wi-1. (i.e. use the bigramCounts). Include <OOV> in the pool of possible words (assuming it appeared; it’s ok if it’s in wi-1 or wi-2). 
•    If only a single previous word is provided, then use add-one smoothing on the bigram model. to calculate the probability of the next word.
•    If two previous words are provided then calculate the probability of the next word based on both (1) the add-one smoothed trigram model and (2) the add-one smoothed bigram model, combined as follows: 
•     
•    This is called “interpolating” the models which tends to produce more robust probabilities in practice. Note: you may not have ever observed a count for the trigram, but with add-one smoothing you can still calculate its non-zero probability. 
•    Make sure to use add-one smoothing for both the trigram and bigram probabilities. Note: because we are not returning probabilities for all words in the vocabulary, with add-one smoothing, the sum of the word probabilities returned will usually be less than 1 (because the probabilities for additional vocabulary words are not being returned).
As a checkpoint, print the smoothed *conditional* probabilities (i.e. the output of your method for 2.3) for the same bigrams and trigrams as you did in Part 2.2 (do not print for unigrams). Use the interpolation where trigram counts exist (i.e. when P(wi | wi-1, wi-2)  is available) or just output P(wi | wi-1) when only bigram counts are available.
IMPORTANT: If the final word of the trigram was not a valid possible wi (i.e. it never occurred after the word before it), then print "Not valid Wi")
results may vary from example output by +- 0.001
2.4. Create a method to generate language. 
  Â input: words  # a list of one or more tokens to start the sentence
  output:  full_sentence # a list of the generated sentence words 
  Â  Â  Â (i.e. a sequence of words, beginning with those provided as input). 
•    A basic method to generate language using the language models is discussed in the second paragraph of SLP 3.3 and visualized in pg. 39 of the SLP3 chapter 3 slides. 
Here are some additional details:
•    Start with generating the next word from the bigram model given only <s>Â as the first word (for this word choice, only, do not use interpolation of both the bigram and trigram).
•    After choosing the first word, continue with the next but using the previous two words
•    i.e. use  the interpolated bigram and trigram probability that the method from Part 2.3 returns 
•    For example, if “walking” was generated from after “<s>” now you would query the model with wordminus2 =  “<s>” and wordminus1 = “walking”). 
•    The method from Part 2.3 is already restricting possible next words, so there is little to implement here. 
•    Stop once you generate “</s>” or once the sentence reaches max_length words. 
•    Set max_length = 32 words. 
Final checkpoint: 
•    Generate 3 sentences beginning with "<s>" 
•    Generate 3 sentences beginning with "<s> language is"
•    Generate 3 sentences beginning with "<s> machines"
•    Generate 3 sentences beginning with "<s> they want to process"
Print all of these out. 
Hints
•    Trigram probabilities can easily be computed on the fly, even with add-one smoothing:
 
•    Make sure to use a sparse representation (i.e. as in the dictionaries mentioned in 2.2) to store the counts. If you were to try to store the trigrams in dense form (i.e. a 3-d version of  the co-occurrence matrix of assignment 2) then there would be 12.5 billion entries (5,001 x 5,001 x 5,001) and this will surely use up memory. 
•    The following will randomly select a word from a list given probabilities for each: np.random.choice(['word1', 'word2', 'word3'], p=[0.6, 0.3, 0.1]) 
Since you are not passing all probabilities to the word choice function, you will need to re-normalize the probabilities that you do pass: divide them all by the sum of the probabilities before passing them to random.choice().
Example output:
CHECKPOINT 2.2 - counts
  1-grams:
  Â  ('language',) 760
  Â  ('the',) 3828
  Â  ('formal',) 14
  2-grams:
  Â  ('the', 'language') 74
  Â  ('<OOV>', 'language') 62
  Â  ('to', 'process') 6
  3-grams:
  Â  ('specific', 'formal', 'languages') 1
  Â  ('to', 'process', '<OOV>') 0
  Â  ('specific', 'formal', 'event') 0
CHECKPOINT 2.3 - Probs with add-one
  2-grams:
  Â  ('the', 'language') 0.008495695514272768
  Â  ('<OOV>', 'language') 0.005814490078449469
  Â  ('to', 'process') 0.0011058451816745656
  3-grams:
  Â  ('specific', 'formal', 'languages') 0.00039914234943198834
  Â  ('to', 'process', '<OOV>') 0.0014032625855113139
  Â  ('specific', 'formal', 'event') INVALID W_i
FINAL CHECKPOINT - Generated Language
 PROMPT: <s>
<s> the signal followed by approximately coincident <OOV> comic book <OOV> processes , like the interaction and manage information , it gains political organizations a notable <OOV> <OOV> indian languages , software
<s> <OOV> <OOV> in the team uses to washing machines where many ways of wheels were ones for written on complex systems . </s>
<s> formal structures of the board was banned in 1874 , physics and around the two brothers achieved prevalence first version , syntax of many have mixed language dictionary , or japanese
 PROMPT: <s> language is
<s> language is demand can display general internet , changes to paint , including vietnamese , they readily incorporate elements are reported quickly . </s>
<s> language is among the 20th century . </s>
<s> language is supported by john smeaton , various industries . </s>
 PROMPT: <s> machines
<s> machines were later language determine the <OOV> joint decoding we need for all names and they operate any digital technologies . </s>
<s> machines hold and <OOV> game `` before other causes and <OOV> are distinguished by applying methods for operating systems perform actions in canada , promoted the sales `` . </s>
<s> machines : better place precision required . </s>
 PROMPT: <s> they want to process
<s> they want to process engineering design -- verb -- <OOV> . </s>
<s> they want to process away <OOV> `` it had clearly “ power tool used this distinction between brain activity . </s>
<s> they want to process recognized as entrepreneurs . </s>

More products