$30
Overview
In this assignment, we will write a program to analyze the word frequency in a document. Because the number of words in the document may not be known a priori, we will implement a dynamically doubling array to store the necessary information.
Please read all the directions before writing code, as this write-up contains specific requirements for how the code should be written.
Your Task
There are two files on Moodle. One contains text to be read and analyzed by your program, and is named mobydick.txt . As the name implies, this file contains the full text from Moby Dick. For your convenience, all the punctuation has been removed, all the words have been converted to lowercase, and the entire document can be read as if it were written on a single line. The other file contains the 50 most common words in the English language, which your program will ignore during analysis . It is called ignoreWords.txt .
Your program must take three command line arguments in the following order - a number N , the name of the text to be read, and the name of the text file with the words that should be ignored. It will read the text (ignoring the words in the second file) and store all distinct words in a dynamically doubling array (For file I/O, refer to your first homework assignment…) . It should then calculate and print the following information:
● The number of times array doubling was required to store all the distinct words
● The number of distinct “non-ignore” words in the file
● The total word count of the file (excluding the ignore words)
● Starting from index N, print the 10 most frequent words along with their probability (up to
4 decimal places) of occurrence from the array. The array should be sorted in decreasing order based on probability of occurrence of the words. If two words have the same probability, then those two words should be listed in alphabetical order.
For example, running your program with the command:
would print the next 10 words starting from index 25, i.e. your program should print the 25th-34th most frequent words, along with their respective probabilities. Keep in mind that these words should not be any of the words from ignoreWords.txt .
The full results would be:
Specifics:
1. Use an array of structs to store the words and their counts
There is an unknown number of words in the file. You will store each distinct word and its count (the number of times it occurs in the document). Because of this, you will need to store these words in a dynamically sized array of structs. The struct must be defined as follows:
2. Use the array-doubling algorithm to increase the size of your array
Your array will need to grow to fit the number of words in the file. Start with an array size of 100, and double the size whenever the array runs out of free space. You will need to allocate your array dynamically and copy values from the old array to the new array. (Array-doubling algorithm should be implemented in main() function).
3. Ignore the top 50 most common words that are read in from the second file To get useful information about word frequency, we will be ignoring the 50 most common words in the English language. These words will be read in from a file, whose name is the third command line argument.
4. Take three command line arguments
Your program must take three command line arguments - a number N which tells your program the starting index to print the next 10 most frequent words, the name of the text file to be read and analyzed, and the name of the text file with the words that should be ignored.
5. Output the Next 10 most frequent words starting from index N
Your program should print out the next 10 most frequent words - not including the common words - starting index N in the text where N is passed in as a command line argument. If two words have the same frequency, then those two words should be listed in alphabetical order.
E.g. If N=5 then print words from index 5-14 in the array sorted in decreasing order of the probabilities of the occurrence of the words.
6. Format your final output this way:
For example, using the command:
you should get the output:
7. You must include the following functions (they will be tested by the autograder):
a. In your main function
i. If the correct number of command line arguments is not passed, print the below statement and exit the program
ii. Get ignore-words/common-words from ignoreWords.txt and store them in an array (Call your getIgnoreWords function) iii. Array-doubling should be done in the main() function
iv. Read words from mobydick.txt and store all distinct words that are not ignore-words in an array of structs
1. Create a dynamic wordRecord array of size 100
2. Add non-ignore words to the array (double the array size if array is
full)
3. Keep track of the number of times the wordRecord array is doubled and the number of distinct non-ignore words
b.
This function should read the words to ignore from the file with the name stored in ignoreWordFileName and store them in the ignoreWords array. You can assume there will be exactly 50 words to ignore. There is no return value. In case the file fails to open, print an error message using the below cout statement:
c.
This function should return whether word is in the ignoreWords array.
d.
This function should compute the total number of words in the entire document by summing up all the counts of the individual distinct words. The function should return this sum.
e.
This function should sort the distinctWords array (which contains length initialized elements) by word count such that the most frequent words are sorted to the beginning. The function does not return anything.
f.
This function should print the next 10 words after the starting index N from sorted array of distinctWords . These 10 words should be printed with their probability of occurrence up to 4 decimal places. The exact format of this printing is given below. The function does not return anything.
Probability of occurrence of a word at position ind in the array is computed using the formula: ( Don’t forget to cast to float!)
Output format