CS136, Lecture 33

      1. What to do if you obtain hash clashes?
        1. Open Addressing:
          1. Linear rehashing
          2. Quadratic rehashing
          3. Double Hashing
        2. External chaining.
      2. Analysis

What to do if you obtain hash clashes?

The home address of a key is the location that the hash function returns for that key.

A hash clash occurs when two different keys have the same home location.

There are two main options for getting out of trouble:

  1. Rehash to try to find an open slot (open addressing). This must be done in such a way that one can find the element again quickly!

  2. External chaining.

Suppose
    Object [] table = new Object[TableSize]; 
(If you are willing to rewrite the code, you can use an array of a more specialized type)

1. Open Addressing:

Find the home address of the key (gotten though hash function). If it happens to already be filled keep trying new slots until you find an empty slot.

There will be three types of entries in the table:

Here are some variations:

a. Linear rehashing
Let Probe(i) = (i + 1) % TableSize.

This is about as simple a function as possible. Successive rehashing will eventually try every slot in the table for an empty slot.

Ex. Table to hold strings, TableSize = 26

H(key) = ASCII code of first letter of string - ASCII code of first letter of 'a'.

Strings to be input are GA, D, A, G, A2, A1, A3, A4, Z, ZA, E

Look at how get inserted with linear rehashing:

0   1   2   3   4   5   6   7   8   9   10      ...     26
A   A2  A1  A3  D   A4  ZA  GA  G   E   ..      ...     Z
Primary clustering occurs when more than one key has the same home address. If the same rehashing scheme is used for all keys with the same home address then any new key will collide with all earlier keys with the same home address when the rehashing scheme is employed.

In example happened with A, A2, A1, A3, A4.

Secondary clustering occurs when a key has a home address which is occupied by an element which originally got hashed to a different home address, but in rehashing got moved to the address which is the same as the home address of the new element.

In example, happened with E

What happens when delete A2 & then search for A1?

Must mark deletions (not just make them empty) and then try to fill when possible.
Can be quite complex.

See code in text for how this is handled.

Minor variants of linear rehashing would include adding any number k (which is relatively prime to TableSize) to i rather than 1.

If the number k is divisible by any factor of TableSize (i.e., k is not relatively prime to TableSize), then not all entries of the table will be explored when rehashing. For instance, if TableSize = 100 and k = 50, the Probe function will only explore two slots no matter how many times it is applied to a starting location.

Often the use of k=1 works as well as any other choice.

b. Quadratic rehashing
Try (home + j2) % TableSize on the jth rehash.

This variant helps with secondary clustering but not primary clustering. (Why?) It can also result in instances where in rehashing you don't try all possible slots in table.

For example, suppose the TableSize is 5, with slots 0 to 4. If the home address of an element is 1, then successive rehashings result in 2, 0, 0, 2, 1, 2, 0, 0, ... The slots 3 and 4 will never be examined to see if they have room. This is clearly a disadvantage.

c. Double Hashing
Rather than computing a uniform jump size for successive rehashes, make it depend on the key by using a different hashing function to calculate the rehash.

E.g. compute delta(Key) = Key % (TableSize -2) + 1, and add delta for successive tries.

If the TableSize is chosen well, this should alleviate both primary and secondary clustering.

For example, suppose the TableSize is 5, and H(n) = n % 5. We calculate delta as above. Thus, while H(1) = H(6) = H(11) = 1, the jump sizes differ since delta(1) = 2, delta(6) = 1, and delta(11) = 3.

2. External chaining.

The idea is to let each slot in the hash table hold as many items as necessary.

The easiest way to do this is to let each slot be the head of a linked list of elements.

Draw picture with strings to be input as GA, D, A, G, A2, A1, A3, A4, Z, ZA, E

The simplest way to represent this is to allocate a table as an array of pointers, with each non-nil entry a pointer to the linked list of elements which got mapped to that slot.

We can organize these lists as ordered, singly or doubly linked, circular, etc.

We can even let them be binary search trees if we want.

Of course with good hashing functions, the size of lists should be short enough so that there is no need for fancy representations (though one may wish to hold them as ordered lists).

See figure 13.12 in text on efficiency

There are some advantages to this over open addressing:

  1. Deleting not a big deal

  2. The number of elements stored in the table can be larger than the table size.

  3. It avoids all problems of secondary clustering (though primary clustering can still be a problem)

Analysis

We can measure the performance of various hashing schemes with respect to the "load factor" of the table.

The load factor of a table is defined as a = F(number of elts in table,size of table)

a = 1 means the table is full, a = 0 means it is empty.

Larger values of a lead to more collisions.

(Note that with external chaining, it is possible to have a > 1).

The following table summarizes the performance of our collision resolution techniques in searching for an element. The value in each slot represents the average number of compares necessary for the search. The first column represents the number of compares if the search is ultimately unsuccessful, while the second represents the case when the item is found:

Strategy Unsuccessful Successful
Linear rehashing 1/2 (1+ 1/(1-a)2) 1/2 (1+ 1/(1-a))
Double hashing 1/(1-a) - (1/a) x log(1-a)
External hashing a+ea 1 + 1/2 a
Complexity of hashing operations when table has load factor a
The main point to note is that the time for linear rehashing goes up dramatically as a approaches 1.

Double hashing is similar, but not so bad, whereas external increases not very rapidly at all (linearly).

In particular, if a = .9, we get

Strategy Unsuccessful Successful
Linear rehashing 55 11/2
Double hashing 10 ~ 4
External hashing 3 1.45
Complexity of hashing operations when table has load factor .9
The differences become greater with heavier loading.

The space requirements (in words of memory) are roughly the same for both techniques:

but external chaining is happier with smaller table (i.e., higher loading factor)

General rule of thumb: small elts, small load factor, use open addressing.

If large elts then external chaining gives good performance at a small cost in space.