Published in · 5 min read · Nov 3, 2021
--
Ahh yes, the Hash Table, a Computer Science classic. Often our first thought as software developers when facing an unfamiliar problem is:
“hmmm, can I use a Hash Table to solve this problem?”
Which I mean, is the correct thought! Hash Tables not only make our logic simple, but they also are extremely quick with constant time accessing speeds!
As I wrote the simple Map<String, Integer> my_map = new Map<String, Integer>();
I grew curious about how many lines of code were running underneath-the-hood to make this super convenient data structure work. Which…. led me into a climatic journey in discovering the in’s and out’s of our friend, the Hash Table. Here is a digest of the findings!
Ok ok, I’ll keep this part short and sweet I swear… I won’t doubt your intelligence. There are two parts:
- A Hash Table is simply a block of memory (an array) that contains ‘buckets’ which hold the data we want to map too. Phew, easy right? We call them ‘buckets’ instead of indexes because, in fact, in some implementations, the index contains a Linked List (for collisions).
- A Hash Table is also abstractly-addressable, meaning that instead of using an outright offset like:
Data = (Memory Location of Array) + (array Offset)
we substitute the offset for a user-defined variable and hash it into a unique*** integer.
- **In a perfect world the value is unique
So you might ask yourself, okay Hashing algorithm, easy! The end goal is that we want to take the supplied key and convert it to an Integer. In our example:
Map<String, Integer> my_map = new Map<String, Integer>();
We would want to convert our key of type String
into an Integer so that we can use it as a substitute for our Array_Offset
value. That process it conducted by a hashing algorithm, which will yield a large number back, so we use modulo arithmetic with the current size of the array to prevent an ‘Array out of bounds’ error.
Neat right?! Okay, but now you might ask yourself:
“A Hash Table’s key can be any object! Not just a string, how do we convert any Object into an Integer Array Offset?”
The beauty of a Hash Table is that it can work with any object. How? Well, the only universal trait amongst various objects is that they are all originally made of bits! So with our hash function, we will use bit-wise operations to randomize the object before converting it to an Integer.
There is no ‘standard hashing function’ that is used throughout all OOP languages, but the C++ implementation of unordered_map
uses an algorithm called ‘MurmurHash’. Below is a simple example of how a 32-bit object can be hashed into a 32-bit Integer.
In the above example:
- We take a 32-bit Object, and spit it into 4 groups of 8-bit blocks.
- With each individual group, we perform an
AND
instruction with a random static variableVar1
- Bit shift the result to the right, or a
ROL
Instruction
4. Repeat steps 2
and 3
with a different variable, Var2
5. Combine each individual 8-bit block back into a 32-bit Object
6. Read the result as an Integer.
After hashing the key into an indexable integer, we are ready to insert a value at that memory location (the corresponding bucket).
Remember, the hash table itself is a block of memory, so after a certain amount of inserts, the block of memory will have to recreate itself. This process is relatively expensive — as the data structure will have to allocate a new block of memory, rehash the keys, and copy the values into the new index.
What determines when a hash table is full? A ‘Load Factor Value’.
Load Factor is defined as “The number of entries in the table (n)” Divided by “The Size of the Hash Table (k)”
There’s a check that happens before the insertion of a new element into a hash table. If the load factor is exceeded on the next insert, then the Hash Table will allocate a new block of memory, rehash the keys, and copy all the data back into the hash table.
We want to have some wiggle room in the hash table to avoid collisions, this usually means having the array size greater than the number of entries at all times.
Tada! Magic right?
Overall, the purpose of this article was to explore how the insertion and search time complexity of a standard Hash Table is achieved in constant time. There are plenty of questions that were left unanswered, such as “how to handle collisions with a hash table optimally?” Haha… I’ll let stack overflow answer some of those unanswered questions you may have. I hope however that you can confidently see how constant time-complexity is achieved with hash tables. Below are five key takeaways:
- Hashing functions use bit-wise operations on the Key Value
- The C++ standard library Hashing function is called “Murmurhash”
- The hashing algorithms used for hash tables are not cryptographic algorithms and prioritize speed rather than security.
- A hash table has to rehash it’s values once the load factor is met.
- The standard hashing threshold is 3/4, so a hash table should usually have about 25% memory than storing all elements in an array.