Brief Description
In this homework, you will write a search engine as in the 2nd Homework. Additionally, you will compare the speed of two different data structures for your search engine architecture. The search engines in real life search hundreds of millions web pages to see if they have the words that you have typed, and they can do this for thousands of users at a given time. In order to do these searches really fast, search engines such as Google do a lot of what we call preprocessing of the pages they search; that is, they transform the contents of a web page (which for the purposes of this homework, we will assume it consists of only strings) into a structure that can be searched very fast.
In this homework, you are going to preprocess the documents provided to you. For each unique word, you will insert a node into both of your data structures, i.e. Binary Search Tree (BST) and Hash Table. Of course, you will also keep track of the details such as the document name which the word appears in and the number of times the word appears in each document. So, you need to implement a templated BST and Hash Table.
Using both of the data structures, you need to show the number of times that every queried word appears in. After that, you will compare the speed of them. You may use and modify the data structures and classes you implemented in Homework2.
BST ( Binary Search Tree)
You may modify the AVL Tree class that you have implemented in Homework2 by removing the balancing (rotation) parts of it.
HashTable
You are going to implement a hash table data structure adopting any of the collision handling strategies, such as separate chaining, linear or quadratic probing, etc. You may use any strategy which may accelerate the HashTable operations.
You need to rehash every time when the load of the table exceeds the load factor that you have determined for your HashTable. For each rehash operation you need to print the internal information of the HashTable such as the previous capacity, current capacity, previous load, and the current load. The load will be calculated as (item count / capacity).
Additionally, you need to develop a hash function aiming to have collision free hash values for different WordItem structs.
Program Flow
You need to get the number of the documents that you need to preprocess first. Then, after getting the names of all documents from the console, you need to preprocess all the text in the documents. Only the alphabetical characters are considered, and the unique words need to be case insensitive (for example; “izmir” and “Izmir” are the same). The rest (such as digits or punctuation characters) are processed as the separator for the words. For each unique word appearing in the text, you need to insert a new node into the tree and the hashtable.
If the node is already there, you need to update the node with the information about this document. For example; you have preprocessed the “a.txt” document first and there is only one “the” word in this document. You need to insert “the” word into the data structures. Then, while preprocessing the “b.txt” document, “the” word appears 4 times. So, you need to add this information ({“b.txt”, 4}) into the existing item in your data structures and the item of “the” word becomes {{“a.txt”, 1}, {“b.txt”, 4}}.
After you preprocess all the documents, you need to get a query from the console which consists of a line of strings (HINT: You may use getline(cin, line)). This line might consist of more than one word. Then, you need to output the document names which all words in the query appear including the number of times that each word appears in that document.
For the comparison of the speed of using different data structures, you need to perform the same operation, processing the query and building the result, without printing the result 20 times and taking the average timing.
Hint struct DocumentItem { string documentName; int count;
};
struct WordItem { string word;
// List of DocumentItem’s.
};
For timing
int k = 20;
auto start = std::chrono::high_resolution_clock :: now(); for (int i = 0; i < k; i++)
{
// QueryDocuments(with BST);
}
auto BSTTime = std::chrono::duration_cast<std::chrono::nanoseconds
(std::chrono::high_resolution_clock::now() - start);
cout << "\nTime: " << BSTTime.count() / k << "\n"; start = std::chrono:: high_resolution_clock :: now(); for (int i = 0; i < k; i++)
{
// QueryDocuments (with hashtable);
}
auto HTTime = std::chrono::duration_cast<std::chrono::nanoseconds
(std::chrono::high_resolution_clock :: now() - start);
cout << "\nTime: " << HTTime.count() / k << "\n";
Rules
● HashTable implementation must be faster than the BST implementation. Otherwise your grade will be 0.
● Every non-alphabetical character is considered as a separator (different than HW2). This rule applies for both the text in input files and queried words.
For example; "face2face" = "face.face" = "face:facE" = "face face"
● The resulting HashTable load factor (λ) must be larger than 0.25 and lower than 0.9 (0.25 < λ < 0.9) whatever the input size is. Otherwise your grade will be 0. Pick the upper bound of load value that gives you best results in terms of speed.
Input
First, input is the number of documents that is going to be preprocessed. Then the file names will be received from the console. Then the queried words will be taken from the user as one line of words.
Output
While preprocessing the files, you need to print the information about the HashTable every time when the rehashing occurs, i.e. display the previous capacity, the previous load, current capacity, current load, and the number of items in the HashTable. After all preprocessing is finished, you need to print the final item count in the HashTable and the current load factor λ.
After the user enters the query words, you need to print the document names and the number of times each queried word appears in that document.
Lastly you compute the time it takes to do the query with each data structure you’ve implemented (BST, Hash Table) and display them with the speedup ratio of these timing values compared to each other. In addition you display the ranking of the times of each sort algorithm.
Sample Run 1
Enter number of input files: 1 Enter 1. file name: a.txt rehashed...
previous table size:53, new table size: 107, current unique word count 49, current load factor: 0.448598 rehashed...
previous table size:107, new table size: 223, current unique word count 98, current load factor: 0.434978 rehashed...
previous table size:223, new table size: 449, current unique word count 202, current load factor: 0.447661 rehashed...
previous table size:449, new table size: 907, current unique word count 406, current load factor: 0.446527 rehashed...
previous table size:907, new table size: 1823, current unique word count 818, current load factor: 0.448162 rehashed...
previous table size:1823, new table size: 3659, current unique word count
1642, current load factor: 0.448483 rehashed...
previous table size:3659, new table size: 7321, current unique word count
3295, current load factor: 0.449939 rehashed...
previous table size:7321, new table size: 14653, current unique word count
6590, current load factor: 0.449669
After preprocessing, the unique word count is 8475. Current load ratio is
0.57838 Enter queried words in one line: above in Document a.txt, above found 12 times. in Document a.txt, above found 12 times.
Time: 9896
Time: 1797
Speed Up: 5.50544
Enter number of input files: 2
Enter 1. file name: a.txt Enter 2. file name: b.txt rehashed...
previous table size:53, new table size: 107, current unique word count 49, current load factor: 0.448598 rehashed...
previous table size:107, new table size: 223, current unique word count 98, current load factor: 0.434978 rehashed...
previous table size:223, new table size: 449, current unique word count 202, current load factor: 0.447661 rehashed...
previous table size:449, new table size: 907, current unique word count 406, current load factor: 0.446527 rehashed...
previous table size:907, new table size: 1823, current unique word count 818, current load factor: 0.448162 rehashed...
previous table size:1823, new table size: 3659, current unique word count
1642, current load factor: 0.448483 rehashed...
previous table size:3659, new table size: 7321, current unique word count
3295, current load factor: 0.449939 rehashed...
previous table size:7321, new table size: 14653, current unique word count
6590, current load factor: 0.449669 rehashed...
previous table size:14653, new table size: 29311, current unique word count
13189, current load factor: 0.449933
After preprocessing, the unique word count is 15965. Current load ratio is
0.544676
Enter queried words in one line: above:aberration
in Document a.txt, above found 12 times, aberration found 1 times. in Document a.txt, above found 12 times, aberration found 1 times.
Time: 14933
Time: 3081
Speed Up: 4.84613
Enter number of input files: 3
Enter 1. file name: a.txt
Enter 2. file name: b.txt Enter 3. file name: c.txt rehashed...
previous table size:53, new table size: 107, current unique word count 49, current load factor: 0.448598 rehashed...
previous table size:107, new table size: 223, current unique word count 98, current load factor: 0.434978 rehashed...
previous table size:223, new table size: 449, current unique word count 202, current load factor: 0.447661 rehashed...
previous table size:449, new table size: 907, current unique word count 406, current load factor: 0.446527 rehashed...
previous table size:907, new table size: 1823, current unique word count 818, current load factor: 0.448162 rehashed...
previous table size:1823, new table size: 3659, current unique word count
1642, current load factor: 0.448483 rehashed...
previous table size:3659, new table size: 7321, current unique word count
3295, current load factor: 0.449939 rehashed...
previous table size:7321, new table size: 14653, current unique word count
6590, current load factor: 0.449669 rehashed...
previous table size:14653, new table size: 29311, current unique word count
13189, current load factor: 0.449933
After preprocessing, the unique word count is 22320. Current load ratio is
0.761489
Enter queried words in one line: a the-has been
No document contains the given query
No document contains the given query
Time: 5669
Time: 5076
Speed Up: 1.11673
Enter number of input files: 2
Enter 1. file name: a.txt Enter 2. file name: d.txt rehashed...
previous table size:53, new table size: 107, current unique word count 49, current load factor: 0.448598 rehashed...
previous table size:107, new table size: 223, current unique word count 98, current load factor: 0.434978 rehashed...
previous table size:223, new table size: 449, current unique word count 202, current load factor: 0.447661 rehashed...
previous table size:449, new table size: 907, current unique word count 406, current load factor: 0.446527 rehashed...
previous table size:907, new table size: 1823, current unique word count 818, current load factor: 0.448162 rehashed...
previous table size:1823, new table size: 3659, current unique word count
1642, current load factor: 0.448483 rehashed...
previous table size:3659, new table size: 7321, current unique word count
3295, current load factor: 0.449939 rehashed...
previous table size:7321, new table size: 14653, current unique word count
6590, current load factor: 0.449669
After preprocessing, the unique word count is 8575. Current load ratio is
0.585204
Enter queried words in one line: A: in Document a.txt, a found 87 times. in Document d.txt, a found 4 times. in Document a.txt, a found 87 times. in Document d.txt, a found 4 times.
Time: 2390
Time: 1975
Speed Up: 1.21002