1. Description of Task You code should accomplish the following tasks:
(1) Read the text file debate.txt. This is the transcript of the latest Texas Senate race debate between Ted Cruz and Beto O'Rourke. The following code does it.
(2) Tokenize the content of the file. For this, you need a tokenizer. For example, the following piece of code uses a regular expression tokenizer to return all course numbers in a string. Play with it and edit it. You can change the regular expression and the string to observe different output results.
For tokenizing the Texas Senate debate transcript, let's all use RegexpTokenizer(r'[a-zA-Z]+'). What tokens will it produce? What limitations does it have?
In [ ]: from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'[A-Z]{2,3}[1-9][0-9]{3,3}')
tokens = tokenizer.tokenize("CSE4334 and CSE5334 are taught together. IE
3013 is an undergraduate course.") print(tokens)
['CSE4334', 'CSE5334', 'IE3013']
(3) Perform stopword removal on the obtained tokens. NLTK already comes with a stopword list, as a corpus in the "NLTK Data" (http://www.nltk.org/nltk_data/ (http://www.nltk.org/nltk_data/)). You need to install this corpus. Follow the instructions at http://www.nltk.org/data.html (http://www.nltk.org/data.html). You can also find the instruction in this book: http://www.nltk.org/book/ch01.html (http://www.nltk.org/book/ch01.html) (Section 1.2 Getting Started with NLTK). Basically, use the following statements in Python interpreter. A pop-up window will appear. Click "Corpora" and choose "stopwords" from the list.
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/ index.xml
After the stopword list is downloaded, you will find a file "english" in folder nltk_data/corpora/stopwords, where folder nltk_data is the download directory in the step above. The file contains 179 stopwords. nltk.corpus.stopwords will give you this list of stopwords. Try the following piece of code.
(4) Also perform stemming on the obtained tokens. NLTK comes with a Porter stemmer. Try the following code and learn how to use the stemmer.
(5) Using the tokens, compute the TF-IDF vector for each paragraph. In this assignment, for calculating inverse document frequency, treat debate.txt as the whole corpus and the paragraphs as documents. That is also why we ask you to compute the TF-IDF vectors separately for all the paragraphs, one vector per paragraph.
Use the following equation that we learned in the lectures to calculate the term weights, in which t is a token and d is a document (i.e., paragraph):
N
wt,d = (1 + log10 tft,d) × (log10 ).
dft
Note that the TF-IDF vectors should be normalized (i.e., their lengths should be 1).
Represent a TF-IDF vector by a dictionary. The following is a sample TF-IDF vector.
Given a query string, calculate the query vector. Compute the cosine similarity between the query vector and the paragraphs in the transcript. Return the paragraph that attains the highest cosine similarity score. In calculating the query vector, the vector is also to be normalized