COMP 479 Project #1-Using Python3 Solved

Starting from:

$25

Yellow Highlight: Issue with running.

Green Highlight: Given names for outputs/inputs.
Notes:

- “Root Folder” consists of a copy of how my directory looks like after running all the commands below. This has been tested for pipelined version and file/path input version which results in the same output.

- “Individuals” consists of the collection segmented into individual files/articles/documents...

- “Step1 and Step 2” consists of the entire collection.

- “Step 3 to Step 6” consists of using the first 5 documents which is to be processed. This can be changed by modifying the ‘5’ parameter to the desired number of documents in “Step #3” though it would take a while to run on the entire collection :3

Step #1 – Reading the Reuter’s collection

(block-1-reader.py)

Tested Commands

• python3 ./block-1-reader.py --path reuters21578

• python3 ./block-1-reader.py --path reuters21578 -o collection.json

============================================================

Input: reuters21578 (Corpus’ path)

Output: Creates a JSON file of the entire collection of SGM 000 to 021 data as a string.

============================================================

- Changing directory to /reuters21578 to scan for .sgm files.

- open(file_name,'r', errors='ignore'): Opening files but set errors parameter to ignore due to unicode issue in sgm17.

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1519554: invalid start byte

Step #2 – Extracting the raw text of each article from the corpus

(block-2-document-segmenter.py)

Tested Commands

collection.json collection.json
• python3 ./block-2-document-segmenter.py -i

• python3 ./block-2-document-segmenter.py -i -o documents.json

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py -o documents.json

============================================================

- Input: collection.json contains all the documents or pipelined stdin data from - Output: Creates a JSON file with 5 documents.

============================================================

- **BeautifulSoup must be installed to parse the documents.

- **JSON is needed to parse the pipelined data

- replace("reuters", "REUTERS") is used to satisfied the assert <REUTERS> since html parser lower-cases the tags.

- find_all('reuters',limit=5): Parse 5 documents as for the output required.

- sys.stdin.isatty(): Check if stdin filestream is being used.

- sys.stdin.read(): Read the pipelined data

- str(document): Change the “tag” object to “string”

- enumerate(documents): Used to iterate but keep an index variable

Step #3 – Extraction

(block-3-extractor.py)

Tested Commands

documents.json documents.json
• python3 ./block-3-extractor.py -i

• python3 ./block-3-extractor.py -i -o articles.json

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py -o articles.json

============================================================

- Input: documents.json contains 5 documents

- Output: 5 dictionaries with ID and Text

============================================================

- **BeautifulSoup must be installed to parse the documents.

- document_id: Consists of ID with whitespace and escape characters

- document_id_arr: Consists of an array with one element (the ID #). Python’s regex library is used to parse the document_id without the escaped characters and whitespace.

- document.body.contents[0]: Used to get the body tag's text- str(text): Cast the iterable string to a normal string.

- re.sub(r"[\\]+","",text): Remove escaped backslash lines

Step #4 – Tokenisation

(block-4-tokeniser.py)

Tested Commands

articles.json articles.json
• python3 block-4-tokenizer.py -i

• python3 block-4-tokenizer.py -i -o tokens.json

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py | python3 ./block-4tokenizer.py

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py | python3 ./block-4tokenizer.py -o tokens.json

============================================================

- Input: articles.json contains dictionaries with ID and Text

- Output: Lists of Tokens [ID, Token]

============================================================

- **NLTK must be installed to tokenise the texts.- **JSON is needed to parse the pipelined data

- word_tokenize(full_text): Use NLTK to tokenise the text.

- Since the pipelined document needs to be properly formatted into JSON, some replacements are needed.

- data.replace("}","},")

- data= re.sub(r"{","[{",data, count=1) #For start of index

- data= re.sub(r"},$","}]",data, count=1)

Step #5 – Apply Porter Stemmer

(block-5-stemmer.py)

Tested Commands

tokens.json tokens.json
• python3 ./block-5-stemmer.py -i

• python3 ./block-5-stemmer.py -i -o stems.json

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py | python3 ./block-4tokenizer.py | python3 ./block-5-stemmer.py

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py | python3 ./block-4tokenizer.py | python3 ./block-5-stemmer.py -o stems.json

============================================================

- Input: tokens.json contains lists with ID and Text

- Output: Lists of Stemmed Tokens [ID, Token]

============================================================

- **NLTK must be installed to get the PorterStemmer

- **JSON is needed to parse the pipelined data

- sys.stdin.readlines(): Read line by line to parse the token in JSON array- stemmer.stem(1): Stems the token which is 2nd element of each pair of list. - (int(token[0]), token_stem): Casts the ID which is of type string into an int.

Step #6 – Given a list of stop words, remove those stop words from text. (block-6-stopwords-removal.py)

Tested Commands

stems.json stems.json
• python3 ./block-6-stopwords-removal.py -i

• python3 ./block-6-stopwords-removal.py -i -s stopwordssample.txt -o results.json

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py | python3 ./block-4tokenizer.py | python3 ./block-5-stemmer.py | python3 ./block-6-stopwordsremoval.py

• python3 ./block-1-reader.py --path reuters21578 | python3 ./block-2document-segmenter.py | python3 ./block-3-extractor.py | python3 ./block-4tokenizer.py | python3 ./block-5-stemmer.py | python3 ./block-6-stopwordsremoval.py -o results.json

============================================================

- Input: stems.json contains lists with ID and stemmed text and optionally the path to a list of stopwords. “stopwords-sample.txt” consists of a subset of NLTK stopwords.

- Output: Lists of Tokens [ID, Token] without the stopwords.

============================================================

- **NLTK must be installed to get a list of stopwords to remove.

- **JSON is needed to parse the pipelined data

- stop_words = set(stopwords.words("english")): Retrieves a set of stopwords in english from NLTK library. This has to be downloaded!

- if stopwords_list: Checks if the path of the “stopwords” is valid and not empty. If it’s not empty, it would iterate and adds these elements to an array.

- if stem_word not in stopwords_arr: Checks and adds the tokens if they are not in the previous array.

- if not stopwords_arr: Default if the path is invalid or the file is empty.

- if stem_word not in stop_words: Iterates through NLTK’s full stopwords set.

More products

CS204-Assignment 4 Solved

$25

Add to cart

CS204-Assignment 3 Solved

$25

Add to cart

CS313- Lab 1 Worksheet Solved

$25

Add to cart