Brief Description
In this homework, you will write a search engine as in the 2nd and 3rd Homework. Additionally, you will compare the speed of different data structures for your search engine architecture. The search engines in real life search hundreds of millions web pages to see if they have the words that you have typed, and they can do this for thousands of users at a given time. In order to do these searches really fast, search engines such as Google do a lot of what we call preprocessing of the pages they search; that is, they transform the contents of a web page (which for the purposes of this homework, we will assume it consists of only strings) into a structure that can be searched very fast.
In this homework, you are going to read the input files, and store them into the vector. You will then sort this vector with several sorting algorithms:
1. Quick Sort (selection of pivot=first, pivot=random, and pivot=median),
2. Insertion Sort,
3. Merge Sort (in-place: without using extra memory storage),
4. Heap Sort.
You will need to keep four different copies of the vector, since we want to see which of the sorting algorithms is more efficient. Next, after getting a query from the user, you should search for words in vector with the binary search algorithm (that’s why we sorted them).
Program Flow
You need to get the number of the documents that you need to process first. Then, after getting the names of all documents from the console, you need to store them in a vector. Only the alphabetical characters are considered, and the unique words need to be case insensitive (for example; “izmir” and “Izmir” are the same). The rest (such as digits or punctuation characters) are processed as the separator for the words. For each unique word appearing in the text, you need to insert a new node into the vector.
If the node is already there, you need to update the node with the information about this document. For example; you have the “a.txt” document first and there is only one “the” word in this document. You need to insert “the” word into the data structures. Then, while processing the “b.txt” document, “the” word appears 4 times. So, you need to add this information ({“b.txt”, 4}) into the existing item in your data structures and the item of “the” word becomes {{“a.txt”, 1}, {“b.txt”, 4}}.
After you processed all the documents, you need to get a query from the console which consists of a line of strings (HINT: You may use getline(cin, line)). This line might consist of more than one word. Then, you need to output the document names which all words in the query appear including the number of times that each word appears in that document.
For the timing output, you need to print three different results;
1. Time for search using binary search tree, hash table and binary search on sorted vectors.
2. Time for all sorting algorithms.
3. Speedups for comparing sorting algorithms and comparing search techniques.
For the comparison of the speed of using different search techniques, you need to perform the same operation, processing the query and building the result, without printing the result 100 times and taking the average timing. Hint struct DocumentItem {
string documentName; int count;
}; struct WordItem {
string word;
// List of DocumentItem’s.
};
For timing:
int k = 100; clock_t begin_time = clock(); for (int i = 0; i < k; i++) // QueryDocuments with Binary Search Tree double endBST = float(clock() - begin_time); cout << "\nBinary Search Tree Time: " << endBST / k << "\n"; begin_time = clock(); for (int i = 0; i < k; i++)
// QueryDocuments with Hash Table double endHT = float(clock() - begin_time); cout << "\nHash Table Time: " << endHT / k << "\n"; begin_time = clock(); for (int i = 0; i < k; i++)
// QueryDocuments with Binary Search double endBS = float(clock() - begin_time); cout << "\nBinary Search Time: " << endBS / k << "\n";
Rules
● Every non-alphabetical character is considered as a separator. This rule applies for both the text in input files and queried words.
For example; "face2face" = "face.face" = "face:facE" = "face face"
Input
First, input is the number of documents that is going to be processed. Then the file names will be received from the console. Then the queried words will be taken from the user as one line of words.
Output
After the user enters the query words, you need to print the document names and the number of times each queried word appears in that document.
Lastly you compute the time measurements for each sorting algorithm and print them with the ratio of these timing values.
Sample Run 1
Enter number of input files: 1
Enter 1. file name: a.txt
After preprocessing, the unique word count is 8475. Current load ratio is
0.521571
Enter queried words in one line: above in Document a.txt, above found 12 times. in Document a.txt, above found 12 times. in Document a.txt, above found 12 times.
Binary Search Tree Time: 199420
Hash Table Time: 49840
Binary Search Time: 3652
Quick Sort(median) Time: 7052375
Quick Sort(random) Time: 7067562
Quick Sort(first) Time: 7067666
Merge Sort Time: 89310980
Heap Sort Time: 34407965
Insertion Sort Time: 59291425
Speed Up BST/HST: 4.0012
Speed Up Merge Sort/Quick Sort(Median): 12.6125
Speed Up Heap Sort/Quick Sort(Median): 4.859
Speed Up Insertion Sort/Quick Sort(Median): 8.37314
Speed Up Binary Search / Binary Search Tree Time: 10.926
Speed Up Binary Search / Hash Table Time: 2.711
Sample Run 2
Enter number of input files: 2
Enter 1. file name: a.txt
Enter 2. file name: b.txt
After preprocessing, the unique word count is 15965. Current load ratio is
0.574219
Enter queried words in one line: above aberration in Document a.txt, above found 12 times, aberration found 1 times. in Document a.txt, above found 12 times, aberration found 1 times. in Document a.txt, above found 12 times, aberration found 1 times.
Binary Search Tree Time: 249490
Hash Table Time: 49975
Binary Search Time: 17787
Quick Sort(random) Time: 13987222
Quick Sort(median) Time: 15261330
Quick Sort(first) Time: 17879211
Merge Sort Time: 188395480
Heap Sort Time: 72005655
Insertion Sort Time: 281349125
Speed Up BST/HST: 4.9923
Speed Up Merge Sort/Quick Sort(Median): 12.3446
Speed Up Heap Sort/Quick Sort(Median): 4.71818
Speed Up Insertion Sort/Quick Sort(Median): 18.4354
Speed Up Binary Search Tree Time / Binary Search: 14.02
Speed Up Binary Search / Hash Table Time: 2.8
Sample Run 3
Enter number of input files: 3
Enter 1. file name: a.txt
Enter 2. file name: b.txt
Enter 3. file name: c.txt
After preprocessing, the unique word count is 22320. Current load ratio is
0.802791
Enter queried words in one line: a the-has been
No document contains the given query
No document contains the given query No document contains the given query
Binary Search Tree Time: 440676
Hash Table Time: 149620
Binary Search Time: 24482
Quick Sort(median) Time: 22140735
Quick Sort(first) Time: 22511732
Quick Sort(random) Time: 25140735
Merge Sort Time: 267334760
Heap Sort Time: 102276335
Insertion Sort Time: 3027406830
Speed Up BST/HST: 2.94063
Speed Up Merge Sort/Quick Sort(Median): 12.0743
Speed Up Heap Sort/Quick Sort(Median): 4.61937
Speed Up Insertion Sort/Quick Sort(Median): 136.735
Speed Up Binary Search Tree Time / Binary Search: 18.02 Speed Up Hash Table Time / Binary Search Tree Time: 6.1
Sample Run 4
Enter number of input files: 2
Enter 1. file name: a.txt Enter 2. file name: d.txt
Enter queried words in one line: A:
in Document a.txt, a found 87 times. in Document d.txt, a found 4 times. in Document a.txt, a found 87 times. in Document d.txt, a found 4 times. in Document a.txt, a found 87 times. in Document d.txt, a found 4 times.
Binary Search Tree Time: 99720
Hash Table Time: 49820
Binary Search Time: 9772
Quick Sort(first) Time: 7010690
Quick Sort(random) Time: 7111210
Quick Sort(median) Time: 7230590
Merge Sort Time: 94050125
Heap Sort Time: 36254350
Insertion Sort Time: 60584885
Speed Up BST/HST: 2.00161
Speed Up Merge Sort/Quick Sort(Median): 13.0073
Speed Up Heap Sort/Quick Sort(Median): 5.01402
Speed Up Insertion Sort/Quick Sort(Median): 8.37897
Speed Up Binary Search Tree Time / Binary Search: 10 Speed Up Hash Table Time / Binary Search: 5.02