In this , you will be implementing and comparing performance characteristics of Hash Tables.
You will create a Hash table with 1019 slots (hereafter denoted as TABLE_SIZE). Why 1019? Hash Tables are typically created with size equalling a prime number.
You need to implement Hash Tables with combinations of the following hash functions and collision resolution mechanisms:
Hash Functions to implement:
1. h(x) = x mod TABLE_SIZE …….. (mod stands for the modulus operator)
2. h’(x) = Floor(x / TABLE_SIZE) ….. (the Floor() operator returns the integer part of the quotient)
Collision Resolution Mechanisms to Implement:
1. Chaining with a Linked List : You will implement chaining with a LL to resolve conflicts.
2. Chaining with a Balanced BST: Here, the chaining mechanism will utilize a Balanced BST instead of a Linked list.
3. Linear Probing: Implement Linear Probing with step size = 1
4. Cuckoo Hashing : Here we will explore an alternative open addressing strategy known as cuckoo hashing.
The main idea is to use two hash functions instead of one (we will use the ones mentioned above) . We will maintain two hash tables, each associated with one hash function.
Lookup: When searching for an item, we will first examine the location obtained from the first hash function. If we do not find the item we are looking for, we look for it in the location obtained from the second hash function. Therefore, we need to examine two locations before declaring that an item does not exist.
Delete: As mentioned above, we look for the item in its two potential locations and then remove it from the table (by setting the value in that slot to -1)
Insert: We will first try to insert the item in the location obtained from the first hash function. If that gives us a collision, we try to insert the item in its second location. If both locations are full, we displace the element in the current item’s first location and move the displaced item to its alternative location. As you can imagine, the displaced item’s alternative location may be full as well. Therefore, we need to move that item to its alternative location. If needed, we continue this process of displacing items till an item has an empty alternative location.
If this process causes a cycle i.e. this process of displacing goes into an infinite loop, we choose two new hash functions and rebuild the hash table (sometimes called rehashing). To obtain new hash functions, we only need to increase the size of our table. The changed TABLE_SIZE means that we have two new hash functions.
More information can be found here : https://en.wikipedia.org/wiki/Cuckoo_hashing
Performance Evaluation
You will be provided with two datasets of integers to be inserted into the hash tables.
You will need to compare performance on the 3 operations of a hash table: insert, lookup and delete. To this end, you will measure the average time taken for each operation. This will be done by performing 10 instances of each operation and then averaging them. For example, to measure the time required for an insert, you will perform 10 inserts, record their time and take their average.
You will utilize both hash functions for each implementation of the hash table. Therefore, the two chaining mechanisms and the linear probing mechanisms will have two implementations each while cuckoo hashing will have only one implementation.
Since the performance of each scheme varies according to the load factor of the table, you also need to measure performance at different load factors. Thus, you will not insert the entire dataset at once. Rather, you will first insert enough elements so that the load factor becomes 0.1, then, you will perform 10 inserts, 10 lookups and 10 deletes, record their average times and only then insert additional elements from the dataset till the load factor becomes 0.2. In this way, you will record average performance for each of the 3 operations at Load Factor = 0.1, 0.2, 0.5, 0.7, 0.9, 1.
Note: For Cuckoo hashing, it may be the case that you never get to high load factors because as the table gets filled up, you may need to resize the table. In this case, you should record the number of times you needed to resize the table for each dataset as well.
The data obtained from the above process would be drawn as a series of graphs. You will, therefore, have one (time vs load factor) graph per dataset, per hash function, per collision resolution implementation. All graphs should have the same scale for clarity and ease of visual comparison.
Extra Credit (15% of total)
Create an associative array and compare its performance with the C++ STL implementation of map<string, int.
Associative arrays are a generalization of standard arrays. Standard arrays use integers for indexing whereas associative arrays can use numeric as well as other data types. They are typically used to store (key, value) pairs. For those with experience in Python, Python’s Dictionaries are a kind of associative array i.e they store (key, value) pairs.
For instance, suppose we have stored pairs of student names and roll numbers in an array indexed with the student’s name, we would access the roll number of a student named “John” like so : my_assoc_array[“John”]. This would give us an integer denoting John’s roll number.
Your job is to create an associative array based on hash tables to map strings to integers. The main idea is to apply a hash function to the string and use that to index into a hash table. At that slot in the hash table, we would store the integer value. We will utilize open addressing as the collision resolution mechanism.
You will compare the performance of inserting a (key, value) pair and deleting a value in your implementation vs the C++ STL implementation of map<string, int on the dataset provided. You can use different hash functions and compare performance.