Starting from:

$24.99

Big-Data-Processing Assignment 1 Solution

In this assignment you have to write a multi-threaded python program for the following problem. Make sure you use python version 3.10 or newer.
You are given a text document collection (in plain text format) along with the class labels. The documents in a folder corresponds to a particular class. The goal is to produce top k (k in an integer) unique word n-gram from the collection based on their class salience score. A word n-gram is a consecutive sequence of n words that appear in a document. The class salience
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑜𝑜 𝑐𝑐ℎ𝑒𝑒 𝑐𝑐−𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑖𝑖𝑐𝑐 𝑔𝑔 𝑐𝑐𝑐𝑐𝑔𝑔𝑐𝑐𝑐𝑐 score of a n-gram is defined as . Thus, if there are 20
# 𝑑𝑑𝑐𝑐𝑐𝑐𝑐𝑐𝑔𝑔𝑒𝑒𝑐𝑐𝑐𝑐𝑐𝑐 𝑖𝑖𝑐𝑐 𝑐𝑐ℎ𝑒𝑒 𝑐𝑐𝑐𝑐𝑔𝑔𝑐𝑐𝑐𝑐
classes, and a particular n-gram appears in all the classes, then the n-gram will have 20 scores (one for each class). The top k will be strictly based on descending order of score of the n-grams.

Tokenization rule (breaking documents into words): You must generate words from a document by breaking it on any non-alphanumeric character. You must also lowercase all the words.

Link to data: https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroupsmld/20_newsgroups.tar.gz

We will evaluate your program on a linux system from command line with the arguments as follows:

python <your-code.py> <path to data directory> <# threads> <value of n for n-gram> <value of k>

The above format is very important for evaluation. Thus, your program arguments must follow the sequence.

Submission guidelines:
Important notes:
1. No credit will be given if your program does not run and produces wrong output.
2. No credit will be given if your program in not multithreaded
4. It is your responsibility to check that the file has been submitted successfully.

More products