Starting from:

$34.99

Big-Data-Processing Assignment 3 Solution

You will be given a file with text documents, where each line corresponds to one document. For a given word (say W), the goal is to find:
1) Find top k positively associated word to W.
2) Find top k negatively associated word to W.
The association is computed based on word co-occurrence in documents using pointwise mutual information (PMI) scores. A word must not contain anything other than English letters. While computing co-occurrence, you must lowercase all the words and you must also remove the stopwords available here: https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt

𝑃𝑃(𝑤𝑤1,𝑤𝑤2)
PMI(w1, w2) = log2𝑃𝑃 (𝑤𝑤1)∗𝑃𝑃(𝑤𝑤2)

where P(w1,w2) = co/N, P(w) = m/N co -> # documents where two words appear m -> # documents where w present
N -> # documents

You goal is to write spark program for the above problem. You can use either scala or pyspark. Your code must have the main function.

Output format: Output needs to be printed on screen. First the list of positively associated words along with the PMI score. Then the list of negatively associated words along with the PMI scores.

We will evaluate your program on a linux system from command line with the arguments as follows:

spark-submit <your-code> <path to file> <query-word> <k>

Where “query-word” is the given word, k is the top positively associated and negatively associated words to “query-word”. This format is very important for evaluation. Thus, your program arguments must follow the sequence. Your program must have a main function.

Submission guidelines:
Important notes:
1. No credit will be given if your program does not run and produces wrong output.
3. It is your responsibility to check that the file has been submitted successfully.

More products