CMPS 144
Hashing and Hash Tables

The goal is to store a collection of <key,value> pairs so that the operations of retrieval (by key), insertion, and removal are efficient, and so that not much memory is wasted (i.e., allocated but not used). These are the same goals as are addressed by the binary search tree (BST) abstract data type (ADT). Done well —in particular, using techniques that ensure a tree's height never exceeds c·lg N for some small constant c (e.g., c≤2)— BST's guarantee worst-case O(log N) time for each of those operations, which is very good.

It turns out, however, that another approach, called hashing, provides expected O(1) (i.e., constant) time performance for each of these operations, when it is implemented well.

In a BST, each of the retrieve, insert, and remove operations begins by doing a search in the tree, beginning at the root and proceeding to one of the root's children, then one of its grandchildren, etc., etc., based upon comparisons between the search key and the keys in the visited nodes.

The hashing approach seeks to avoid those comparisons. The idea is, instead, to perform a key-to-address translation that yields the index/location, within a table (i.e., the hash table), where the data associated to the specified key can be found (or where it should be placed, if the intent is to do an insertion). The key-to-address translation is referred to as a hash function, and it typically treats the key, no matter its actual data type, as a number (or sequence thereof). There are a number of different kinds of hash function, but a basic requirement of such a function is that it distributes the universe of keys uniformly, at least approximately. That is, for any two addresses/indexes, the number of distinct keys that map into one should be about the same as the number that map into the other.

To illustrate, consider the problem of storing records describing University of Scranton students and employees —past and present— in such a way that retrieval, insertion, and removal can be done quickly. Each record can be viewed as a <key,value> pair in which the key is the student's Royal Number and the data includes name, address, birthdate, etc., etc.

Omitting the leading 'R', a Royal Number is just an 8-digit number. Hence, a straightforward approach would be to store all these records in a table (i.e., array) whose index range [0 .. 99,999,999] corresponds to the range of 8-digit integers. The record describing the student whose R-number is R15467890, for example, would be stored at the array location indexed by 15467890. As accessing an array element (fetching its value or writing a value into it) takes only constant time, it is clear that the obvious implementations of retrieve, insert, and remove run in constant (i.e.., O(1)) time.

The problem with this approach is that the size of the key space (i.e., the set of all possible keys) is huge compared to the size of the set of keys that would be represented in the table. (It's likely that, in the history of the University of Scranton, the total number of students and employees is less than a million.) Thus, we would have a table with 100 million locations, but less than one percent of them would be occupied by meaningful data. That would be a major waste of memory.

To fix this problem, we would like to shrink the size of the table to more closely fit the size of the actual data needing to be stored in it. For our example, we could make the table have index range [0 .. 999,999], say. But if we do that, we now need a non-trivial key-to-address translation that, given any R-number, produces a value in the range [0 .. 999,999]. (In effect, each 8-digit number must be translated into a 6-digit number.) There are various ways to do that. One would be to ignore, say, the first and last digits of the R-number. For our example R-number R15467890, the result would be 546789. Another would be to divide the R-number by 1,000,000 and take the remainder to be the address. (In effect, that would be to ignore the two leading digits of the R-number.) Indeed, it is common for hash functions, as their last step, possibly after applying other arithmetic operations to the key, to divide by the size N of the hash table and take a remainder as the result. This makes sense, because the result is then always an integer in the index range [0..N) of the table.

But however it is computed, an important point is that the key space K is almost always much larger than the index range [0..N) of the hash table, so any hash function mapping K into [0..N) necessarily maps multiple keys into some table addresses. When two keys map to the same address, we say that they collide. When collisions happen in practice (e.g., on an attempt to insert a <key,value> pair whose key maps to the same address already occupied by some other <key,value> pair), a collision-resolution strategy must be followed to allow the new <key,value> pair to be placed somewhere where it can be found (if later an attempt is made to retrieve it).

There are two basic types of hash tables, ones using separate chaining and others using open addressing.

Separate Chaining

Here, each address of the hash table contains a pointer/reference to a container in which are stored the <key,value> pairs whose key maps to that address. (Typically, this container is a list (also called a "chain", hence the name), but it could be something more elaborate, such as a binary search tree.) An example in which the keys are assumed to be persons' first names and the associated data is a birthdate is shown below. (Because first names are rarely unique, this is an unrealistic example.)

   +---+    +------------------+---+    +-------------------+---+
 0 | *-+--->| Bill, 1995/12/22 | *-+--->| Helen, 1927/10/15 | *-+--x
   +---+    +------------------+---+    +-------------------+---+
 1 | *-+---x
   +---+    +------------------+---+
 2 + *-+--->| Lisa, 2001/05/11 | *-+---x
   +---+    +------------------+---+
 . | . |
 . | . |
 . | . |
   +---+    +--------------------+---+    +------------------+---+
98 | *-+--->| Harold, 1974/04/30 | *-+--->| Mary, 1999/11/14 | *-+--> ...
   +---+    +--------------------+---+    +------------------+---+
99 | *-+---x
   +---+

In this example, we have a 100-element table some of whose addresses have no associated data (e.g., 1 and 99), others have a single <key,value> pair (e.g., 2), and others have multiple <key,value> pairs (e.g., 0, 98)

If the hash function does a good job of distributing the keys fairly evenly among the addresses, then you can count on each address's list having length in the neighborhood of m/N, where N is the size of the address space and m is the number of stored records. (The value m/N is known as the load factor or packing density.)

The algorithms for retrieving, inserting, and removing should be pretty obvious. The first step is to apply the hash function to the search key, which yields the address whose list is then traversed to find the sought <key,value> pair. Insertions and removals are carried out within the list.

If the intent is to keep the table small, so that the packing density may become larger than, say 10 or 20, it would make sense for the list/chain associated to each address to take on the form, instead, of an array or a binary search tree. In effect, then, each set of <key,value> pairs whose keys map to the same address would be stored in its own array or tree.

Open Addressing

Here, all the data is stored in the table itself, rather than each table cell having a pointer/reference to a separate list/chain. One of the advantages of this approach —compared to separate chaining— is that no memory space is spent on storing pointers. A big disadvantage is that collision resolution is a much more complicated issue.

The simplest collision resolution strategy is linear probing. The table below and to the right provides an illustration. For the sake of brevity, only keys are shown (once again common first names, unrealistically). Next to each name is its parenthesized "home address", i.e., the address to which it is mapped by our hypothetical hash function.

Notice that some records are not stored at their home addresses. Examples include Emily and Mia at locations 3 and 5, respectively, rather than location 2. Neither Ruth nor George are at their home addresses, either.

Here's how linear probing works: To search for a key, you apply the hash function to it, which yields the key's home address. Starting there, you probe (i.e., access) successive addresses (wrapping around to location zero if necessary) until either the record containing the sought key is encountered or an "empty" address is encountered. In the latter case, it means that there is no address with that key in the table. (Thus, if the operation being carried out is insertion, that is where you would place the record to be inserted.)

   +------------+
 0 | George (9) |
   +------------+
 1 |            |
   +------------+
 2 |  Amy (2)   |
   +------------+
 3 | Emily (2)  |
   +------------+
 4 | Frank (4)  |
   +------------+
 5 |  Mia (2)   |
   +------------+
 6 |            |
   +------------+
 7 |            |
   +------------+
 8 |  Mark (8)  |
   +------------+
 9 |  Ruth (8)  |
   +------------+  

Referring to the example table, it is clear that what happened is that Emily's record was inserted only after Amy's record had already been placed at their common home address, 2. For that reason, Emily's record was placed into the following location, which at the time must have been empty. At some time, possibly before either Amy or Emily were inserted, Frank was inserted at his home address, location 4. After all three were in place, Mia was inserted at location 5, which, even though it is three places from home, was the closest available empty slot. George must have been inserted after Ruth, which is why he is at location 0 rather than home address 9.

key# probes
George2
Amy1
Emily2
Frank1
Mia4
Mark1
Ruth2
Total:13
What sort of performance, in terms of number of probes per search, can we expect from such a hash table? Of course, the larger that number, the slower the retrieval time will tend to be. Let us compute this for our example table. It is useful here to distinguish between successful and unsuccessful searches. The former is a search for a key whose record is present in the table. The latter is a search for a key not present in the table.

Under the assumption that each key is equally likely to be the subject of a search, we imagine that such a search is performed on each key, and we add up the number of probes that were spent. For any given key, the number of probes needed to find it is equal to one plus its distance from its home address. We see that the total is 13 probes on 7 searches, which averages to 13/7 (about 1.86).

To compute the expected number of probes to perform an unsuccessful search, we imagine that each address is equally likely to be the home address of a search key and we imagine that one unsuccessful search is initiated at each address.

Unsuccessful Search
Address 01234 56788 TotalAverage
# probes 21543 21143 262.6

Not surprisingly, unsuccessful searches generally require more probes than successful ones. Of course, these calculations are with respect only to our example table, and they may not be generally representative. So let us do some analysis. Consider a table whose packing density (load factor) is α = m/N. We make the simplifying assumption that each address is equally likely to be occupied, with probability α.

For each k≥1, define P(k) to be the probability that exactly k probes will occur in an unsuccessful search. Then the expected number of probes in such a search is bounded above by the sum

Σ(k·P(k))
1≤k 

(If this concept is foreign to you, see the appendix, which provides a short introduction to the notion of expected value.) Each term in the sum is the product of some number of probes and the probability that that many probes will be carried out.

Well, what is P(k)? For an unsuccessful search to end after exactly k probes means that each of the first k−1 probes was to an occupied cell and the k-th probe was to an empty cell. But the probability that a given cell is occupied is α and so the probability that it is empty is 1−α. So we have

P(k)  =  αk−1(1 − α)  =  αk−1 − αk

(This makes the somewhat unrealistic assumption that the probability that one cell is occupied is independent of the probability that a neighboring cell is occupied.)

Working this out, we get

Σ(k·P(k)) = 1·P(1) + 2·P(2) + 3·P(3) + ...
1≤k 
= 1(α0 − α1) + 2(α1 − α2) + 3(α2 − α3) + ...
= α0 − α1 + 2α1 − 2α2 + 3α2 − ...
= α0 + α1 + α2 + ...
= 1/(1-α)

This analysis suggests that the number of probes is a function purely of the table's packing density and not of its sheer size.

Expected
# probes
10 +
   |
 9 + 
   |
 8 +                                         *
   |
 7 + 
   |
 6 +
   |
 5 + 
   |
 4 +                                   *
   |
 3 +                                    
   |                             *
 2 +                       *
   |                 *
 1 +           *
   |
   +-----------+-----+-----+-----+-----+-----+-----+
              1/4   3/8   1/2   5/8   3/4   7/8    1 

                   α (packing density)


Appendix: Expected Value

Dice RollWin/LoseProbability
2,4,6,8,10,12+$11/2
7+$21/6
3,11−$41/9
5,9−$22/9
As an example to illustrate the concept of expected value, imagine a dice game that might be suited for a gambling casino. To play the game, you simply roll a pair of dice. The amount of money that you win or lose is described in the table to the right. Each row of the table identifies an event (a set of outcomes that can be treated as a single unit), the value associated to it (here, how much is won/lost as a result of that event occurring), and the probability with which that event occurs.

A relevant question is: What is the expected gain/loss on a given playing of the game? That question is answered by multiplying the probability of each event by the value associated to it and adding the results. Here, that sum is

    (1/2)(1) + (1/6)(2) + (1/9)(-4) + (2/9)(-2)
=      1/2   +   1/3    +  -4/9     +  -4/9
=   −1/18

What this calculation shows is that a player can expect to lose $1/18 (i.e., 5.555... cents) each time (s)he plays the game.