CS300– Homework 4 Solved

Your shopping cart is empty.

Brief Description
In this homework, you will write a search engine as in the 2nd and 3rd Homework. Additionally, you will compare the speed of different data structures for your search engine architecture. The search engines in real life search hundreds of millions web pages to see if they have the words that you have typed, and they can do this for thousands of users at a given time. In order to do these searches really fast, search engines such as Google do a lot of what we call preprocessing of the pages they search; that is, they transform the contents of a web page (which for the purposes of this homework, we will assume it consists of only strings) into a structure that can be searched very fast.

In this homework, you are going to read the input files, and store them into the vector. You will then sort this vector with several sorting algorithms:

            1.      Quick Sort (selection of pivot=first, pivot=random, and pivot=median),

            2.      Insertion Sort,

            3.      Merge Sort (in-place: without using extra memory storage),

            4.      Heap Sort.

You will need to keep four different copies of the vector, since we want to see which of the sorting algorithms is more efficient. Next, after getting a query from the user, you should search for words in vector with the binary search algorithm (that’s why we sorted them).

Program Flow
You need to get the number of the documents that you need to process first. Then, after getting the names of all documents from the console, you need to store them in a vector. Only the alphabetical characters are considered, and the unique words need to be case insensitive (for example; “izmir” and “Izmir” are the same). The rest (such as digits or punctuation characters) are processed as the separator for the words. For each unique word appearing in the text, you need to insert a new node into the vector.

If the node is already there, you need to update the node with the information about this document. For example; you have the “a.txt” document first and there is only one “the” word in this document. You need to insert “the” word into the data structures. Then, while processing the “b.txt” document, “the” word appears 4 times. So, you need to add this information ({“b.txt”, 4}) into the existing item in your data structures and the item of “the” word becomes {{“a.txt”, 1}, {“b.txt”, 4}}.

After you processed all the documents, you need to get a query from the console which consists of a line of strings (HINT: You may use getline(cin, line)). This line might consist of more than one word. Then, you need to output the document names which all words in the query appear including the number of times that each word appears in that document.

For the timing output, you need to print three different results;

1.     Time for search using binary search tree, hash table and binary search on sorted vectors.

2.     Time for all sorting algorithms.

3.     Speedups for comparing sorting algorithms and comparing search techniques.

For the comparison of the speed of using different search techniques, you need to perform the same operation, processing the query and building the result, without printing the result 100 times and taking the average timing. Hint struct DocumentItem {

     string documentName;      int count;

}; struct WordItem     {

     string word;

     // List of DocumentItem’s.

};

For timing:

    int k = 100;     clock_t begin_time = clock();     for (int i = 0; i < k; i++)         // QueryDocuments with Binary Search Tree     double endBST = float(clock() - begin_time);     cout << "\nBinary Search Tree Time: " << endBST / k << "\n";     begin_time = clock();     for (int i = 0; i < k; i++)

        // QueryDocuments with Hash Table     double endHT = float(clock() - begin_time);     cout << "\nHash Table Time: " << endHT / k << "\n";     begin_time = clock();     for (int i = 0; i < k; i++)

        // QueryDocuments with Binary Search            double endBS = float(clock() - begin_time);     cout << "\nBinary Search Time: " << endBS / k << "\n";

Rules
●    Every non-alphabetical character is considered as a separator. This rule applies for both the text in input files and queried words.

For example; "face2face" = "face.face" = "face:facE" = "face face"

Input
First, input is the number of documents that is going to be processed. Then the file names will be received from the console. Then the queried words will be taken from the user as one line of words.

Output
After the user enters the query words, you need to print the document names and the number of times each queried word appears in that document.

Lastly you compute the time measurements for each sorting algorithm and print them with the ratio of these timing values.

Sample Run 1
Enter number of input files: 1

Enter 1. file name: a.txt

After preprocessing, the unique word count is 8475. Current load ratio is

0.521571

Enter queried words in one line: above in Document a.txt, above found 12 times. in Document a.txt, above found 12 times. in Document a.txt, above found 12 times.

Binary Search Tree Time: 199420

Hash Table Time: 49840

Binary Search Time: 3652

Quick Sort(median) Time: 7052375

Quick Sort(random) Time: 7067562

Quick Sort(first) Time: 7067666

Merge Sort Time: 89310980

Heap Sort Time: 34407965

Insertion Sort Time: 59291425

Speed Up BST/HST: 4.0012

Speed Up Merge Sort/Quick Sort(Median): 12.6125

Speed Up Heap Sort/Quick Sort(Median): 4.859

Speed Up Insertion Sort/Quick Sort(Median): 8.37314

Speed Up Binary Search / Binary Search Tree Time: 10.926

Speed Up Binary Search / Hash Table Time: 2.711

Sample Run 2
Enter number of input files: 2

Enter 1. file name: a.txt

Enter 2. file name: b.txt

After preprocessing, the unique word count is 15965. Current load ratio is

0.574219

Enter queried words in one line: above aberration in Document a.txt, above found 12 times, aberration found 1 times. in Document a.txt, above found 12 times, aberration found 1 times. in Document a.txt, above found 12 times, aberration found 1 times.

Binary Search Tree Time: 249490

Hash Table Time: 49975

Binary Search Time: 17787

Quick Sort(random) Time: 13987222

Quick Sort(median) Time: 15261330

Quick Sort(first) Time: 17879211

Merge Sort Time: 188395480

Heap Sort Time: 72005655

Insertion Sort Time: 281349125

Speed Up BST/HST: 4.9923

Speed Up Merge Sort/Quick Sort(Median): 12.3446

Speed Up Heap Sort/Quick Sort(Median): 4.71818

Speed Up Insertion Sort/Quick Sort(Median): 18.4354

Speed Up Binary Search Tree Time / Binary Search: 14.02

Speed Up Binary Search / Hash Table Time: 2.8

Sample Run 3
Enter number of input files: 3

Enter 1. file name: a.txt

Enter 2. file name: b.txt

Enter 3. file name: c.txt

After preprocessing, the unique word count is 22320. Current load ratio is

0.802791

Enter queried words in one line: a the-has been

No document contains the given query

No document contains the given query No document contains the given query

Binary Search Tree Time: 440676

Hash Table Time: 149620

Binary Search Time: 24482

Quick Sort(median) Time: 22140735

Quick Sort(first) Time: 22511732

Quick Sort(random) Time: 25140735

Merge Sort Time: 267334760

Heap Sort Time: 102276335

Insertion Sort Time: 3027406830

Speed Up BST/HST: 2.94063

Speed Up Merge Sort/Quick Sort(Median): 12.0743

Speed Up Heap Sort/Quick Sort(Median): 4.61937

Speed Up Insertion Sort/Quick Sort(Median): 136.735

Speed Up Binary Search Tree Time / Binary Search: 18.02 Speed Up Hash Table Time / Binary Search Tree Time: 6.1

Sample Run 4
Enter number of input files: 2

Enter 1. file name: a.txt Enter 2. file name: d.txt

Enter queried words in one line: A:

in Document a.txt, a found 87 times. in Document d.txt, a found 4 times. in Document a.txt, a found 87 times. in Document d.txt, a found 4 times. in Document a.txt, a found 87 times. in Document d.txt, a found 4 times.

Binary Search Tree Time: 99720

Hash Table Time: 49820

Binary Search Time: 9772

Quick Sort(first) Time: 7010690

Quick Sort(random) Time: 7111210

Quick Sort(median) Time: 7230590

Merge Sort Time: 94050125

Heap Sort Time: 36254350

Insertion Sort Time: 60584885

Speed Up BST/HST: 2.00161

Speed Up Merge Sort/Quick Sort(Median): 13.0073

Speed Up Heap Sort/Quick Sort(Median): 5.01402

Speed Up Insertion Sort/Quick Sort(Median): 8.37897

Speed Up Binary Search Tree Time / Binary Search: 10 Speed Up Hash Table Time / Binary Search: 5.02

Shopping cart

US$0

CS300– Homework 4 Solved

More products