What In The Hell Are Hash Tables

GOTO All What In The Hell Articles

A hash table is a combination of two parts; an array, to store the data, and a hash function that computes the index of the array from a key. The best analogy to a hash table is a dictionary. When you use a dictionary you lookup a word (the key) and find the definition (the value).

Hash tables are an improvement over arrays and linked lists for large data sets. With a hash table you don’t have to search through every element in the collection (O(n) access time). Instead you pay a one time cost (the hash function) to retrieve data (O(1) access time).

 

The Hash Function

The purpose of the hash function is to compute an index from a key. That index is then used to store and retrieve data from the array. Many higher-level languages include some type of hash table and hash function.

As you can see from the diagram above the hash(“string”) function transforms a string into an integer which is then used as the index into the array. This index is used to store or retrieve data in the array.

The amount of time it takes to set or retrieve a value from a hash table is the time it takes to run the hash function plus the time it takes to access an element of the array. For small data sets this could take longer than simply iterating over each element of the array. However, since it is a fixed cost, for larger data sets using a hash table could be much faster.

Assuming we have a function set that takes a key to be hashed as its first argument and a value to be stored as its second argument, we can create the following hash table.

As you can see from the above diagram we store both the key and the value in the table. The reason for this is collisions, which will be explained below.

 

Collisions

A collision occurs when two different keys hash to the same value. Collisions will occur in even the best designed hash tables, however there are a few reasons why they can occur more frequently. The first reason is a poor hash function that does not distribute the indices evenly throughout the table, this will be explained in more detail below. The second reason is that the array is too small. If you have a 10 element array and you are trying to store 100 key/values in the hash table, there will obviously be a lot of collisions.

In the diagram above “Puff” hashed to the same value as “Peter” causing a collision. Both “Peter” and “Puff” are now stored in the same index of the array. A certain amount of collisions are unavoidable in a hash table, so we’ll need a way to deal with them.

The simplest way to deal with collisions is to make each element of the array a linked list. Then, when a collision occurs, we simply search through the linked list to find the data we want. This way the hash function gives us a ‘starting point’ to begin our search.

With a good hash function and an appropriately sized array collisions are a rare occurrence. However it is possible, though very unlikely, that all the key/values will be stored in one index and we will end up doing a search on a linked list.

 

What is a good hash function?

The sole job of the hash function is to return a unique array index for each key. This is not always possible and collisions can be expected to occur when the number of elements in the hash table approaches the square root of the size of the hash table’s array. A hash function that does not provide a uniform distribution of keys will cause a lot of collisions, possibly breaking our hash table’s performance down from O(1) to O(n). The only workaround to an ineffective hash function is to make the hash table’s array much larger than the number of elements that it holds, which obviously wastes space.

 

Hash Table Implementations

Python

Python’s equivalent of a hash table is called a dictionary or dict.

# Creating a dictionary
d = {'a':5, 'b':10}

# Adding key/value pairs
d['c'] = 15

# Get value by key
d['c']
#>>> 15

Chicken Scheme

;; Creating a hash-table from an alist
(define d (alist->hash-table '((a . 5) (b . 10))))

;; Adding key/value pairs
(hash-table-set d 'c 15)

;; Get value by key
(hash-table-ref d 'c)
;>>> 15

 

Performance

Indexing: O(1)
Insertion/Removal: O(1)
Search: O(1)

The performance listed above is what is normally found in specific hash table implementations. There are actually a number of different implementations, each with their own advantages, disadvantages and performance guarantees. Specific implementations will have to be presented in future WITH articles.

 

Links

GOTO Table of Contents

Advertisements