Hashing


Contents


Introduction

Hashing is a way of storing data that in an average case scenario allows for constant time storage and retrieval. This can be achieved by using an array.
          Suppose we want to store a sequence of randomly generated numbers, keys: 5, 17,  37,  20, 42, 3.
        The array A, the hash table, where we want to store the numbers:

              0        1        2        3        4        5        6        7        8
            -------------------------------------------------------------------
            |       |        |        |        |        |        |        |        |        |        |
            -------------------------------------------------------------------

        We need a way of mapping the numbers to the array indexes, a hash function, that will let us store the numbers and later recompute the index when
        we want to retrieve them. There is a natural choice for this. Our hashtable has 9 fields and the mod function, which sends every integer to its
        remainder modulo 9, will map an integer to a number between 0 and 8.

             5 mod 9 = 5
            17 mod 9 = 8
            37 mod 9 = 1
            20 mod 9 = 2
            42 mod 9 = 6
            3 mod 9 = 3

        We store the values:

              0        1        2        3        4        5        6        7        8
           -------------------------------------------------------------------
           |         | 37  |   20  |   3   |          |    5   |  42  |           |  17  |
           -------------------------------------------------------------------

        In this case, computing the hash value of the number n to be stored:  n mod 9, costs a constant amount of time.  And so does the actual storage,
        because n is stored directly in an array field.
        If you want to check if the value  n is in the hash table, you would also have to compute n mod 9, and then access the appropriate array field to see
        if    n  is stored there. Both operations take a constant amount of time.

        A problem arises if you want to store the number 11 in addition to the numbers already in the hashtable. 11 hashes to 2 (=11 mod 9), and thus it
        would have to be stored in the array field, that is already occupied by the number 20. This is called a collision. It is all the more likely to happen, the
        fuller the table gets.
        The load factor measures how full the table is:
 

                                                                number of items stored in the table
                                        load factor = ----------------------------------------------
                                                               number of array fields in the table

        There are several ways of dealing with collisions:


Chained Hashing

         Instead of storing the data item directly in the hashtable, each hashtable entry contains a reference to a datastructure, e.g. a linked list (chain), or a
         balanced search tree. We compute the hash value  v of the data item d and then store d in the data structure referred to from the hashtable entry  at
         index  v.  In our example, we use a linked list:

              0        1        2        3        4        5        6        7        8
           --------------------------------------------------------------------
           |   \    |         |         |         |    \    |          |         |    \    |        |
           ----------|------|------|---------------|-------|---------------|----
                        |        |        |                    |          |                    |
                       v        v       v                   v          v                   v
                      37     11      3                   5         42                17
                                 |
                                 |
                                v
                               20

            In the worst case scenario, all items hash to the same value  v. Thus we store them  in the data structure ( linked list, balanced BST), referrred to
            by hashtable[ v]. Then the data structure attached to hashtable[v] will be nontrivial. In this worst case we have time complexities:
 

                                                                                            linked list                                    BST
            ______________________________________________________________________________
            Complexity for
                                        storage                                            O(1)                                        O(logN)
                                        lookup                                            O(N)                                        O(logN)

            where  N =  number of items in the hash table.
 

            It is possible to have a loadfactor > 1, when we are dealing with chained hashing. And that doesn't neccessarily mean that the hash table is overly
            full, since the chain attached at each table entry may not be very long.


Linear Probing

        Instead of attaching a data structure at every entry of the hash table, we store the data directly in the table. But then we have to find a convention to deal
        with possible collisions. One such convention is to then insert the colliding item into the next empty entry after the original one. For example:
 

              0        1        2        3        4        5        6        7        8
           -------------------------------------------------------------------
           |         | 37  |   20  |    3   |   11   |    5   |   42  |         |  17  |
           -------------------------------------------------------------------

        Both 20 and 11 hash to 2. 20 was inserted first, so when 11 came along, we have to look for the first empty space in the table after index 2. The index 3
        is already occupied, but the field at index 4 is still free, so we insert 11 there.
        This illustrates how clusters happen. They are all the more likely to occur, the higher the loadfactor is. Its limit value in this kind of hash table is 1,
        bacause you cannot store more items than the table has entries. A load factor of  0.6  is considered  quite high, because it indicates that storage and
        retrieval are quite inefficient, bacause of the presence of clusters, and thus call for a resize of the hash table.

        Suppose you do a lookup(11). 11 hashes to 2 and so you inspect hashtable[2]. Since you don't find 11 there, you keep looking at the successive table
        entries until you either find 11 or you come to an empty field.

Problem:
        Assume now , that you want to remove(3) from the hash table. 3 is located at index 3.

              0        1        2        3        4        5        6        7        8
           -------------------------------------------------------------------
           |         | 37  |   20  |         |   11   |    5   |   42  |         |  17  |
           -------------------------------------------------------------------
 

        Next, do a lookup(11). You will stop immediately at index 3, because the cluster has been broken and you assume that nothing hashing to 2 will come
        any more.

Solution:
        Instead of leaving the place, where we removed the item, empty, we leave a marker, a tombstone, to indicate that an item has been removed from this
        location. This way you won't stop a lookup, when you come across a tombstone.

              0        1        2        3        4        5        6        7        8
           -------------------------------------------------------------------
           |         | 37  |   20  |   T   |   11   |    5   |   42  |         |  17  |
           -------------------------------------------------------------------

        The time complexity of linear probing depends on the length of the longest cluster. In the worst case, all keys hash to the same value, thus making
        insertion and deletion O(N) operations.  The reason, why one still does linear probing is, that it is easy to implement.


Keys and Hash Functions:

The important issues to consider when choosing a hash function are: Since the result of the hash function will be used as an array index, it must be in the range 0 to TABLE_SIZE-1. Therefore, it is reasonable for the hash function to convert the key to an integer n and to return n mod TABLE_SIZE.

If the keys are integers, with well distributed values (i.e., modding them with TABLE_SIZE is likely to produce results evenly distributed from 0 to TABLE_SIZE-1), then the hash function can just return: key mod TABLE-SIZE. However, if the keys are not well distributed, or if they are strings (or some other non-integer type), then we must convert them to well distributed integer values (before modding them by TABLE_SIZE).

Let's assume that the keys are strings. Most programming languages provide a way to convert individual characters to integers. In Java, you can just cast a char to an int; for example, to convert the first character in String S to an int, use:

Once we know how to convert individual characters to integers, we can convert a whole string to an integer in a number of ways. First, though, let's think about which characters in the string we want to use. Remember that we want our hash function to be efficient, and to map different keys to different locations in the hashtable. Unfortunately, there tends to be tension between those two goals; for example, a hash function that only uses the first and last characters in a key will be faster to compute than a hash function that uses all of the characters in the key, but it will also be more likely to map different keys to the same hashtable location.

A reasonable solution is to compromise; e.g., to use N characters of the key for some reasonable value of N ("reasonable" will depend on how long you expect your keys to be and how fast you want your hash function to be). For keys that have more than N characters, there is also a question of which characters to choose. If the keys are likely to differ only at one end or the other (e.g., they are strings like "value1", "value2", etc, that differ only in their last characters), then this decision could make a big difference in how well the hash function "spreads out" the keys in the hashtable.

To simplify our discussion, let's assume that we know the keys won't be too long, so our hash function can simply use all of the characters in the keys.

Here are some ways to combine the integer values for the individual characters to compute a single integer n (remember that the hash function will return n mod TABLE_SIZE):

If you want to hash an instance of a class, which is part of the Java language, it is likely, that it already comes equipped with a hashcode method.

Any object can be a key. If you want to hash an instance of a self defined class, then you should include a hashcode() method in the class definition.
Just like the toString() and equals methods, this is a helpful utility. You may use a combination of functions to convert your class instance to an integer. Remember, that you do not need to worry about the table size inside your key class.


TEST YOURSELF #1

Consider storing the names: George, Amy, Alan, and Sandy in a hashtable of size 10, using the hash function:

where a=A=1, b=B=2, etc. Draw the hashtable that would be produced.

solution



Choosing the Hashtable Size

The best size to choose for the hashtable will depend on :
TEST YOURSELF #2

Consider hashing keys that are strings containing only lower-case letters. The hash function that will be used computes the product of the integer values of the characters in a key, using the scheme: a=0, b=1, c=2, etc. Why is this scheme not as good as using: a=1, b=2, etc?

solution


Summary