You are on page 1of 16

Chapter 6

Hashing

6.1 Introduction
We have looked at ways of searching for data in time O(log n). Can we do better?
Suppose we want to find a student record from a registration number. We could try:

f i n a l s t a t i c i n t MAX_REG =???? // Largest registration number +1


Student [] allStudents =new Student [ MAX_REG ];

// Fill with data


allStudents [12345603]=new Student (" John Smith " );

// Find a student
Student theStudent = allStudents [12345603];
Extracting an element from an array is O(1).
The only problem with this is that MAX_REG is very large, certainly over 30,000,000.
It is much larger than the number of students. An array of this size is unlikely to fit in
memory.
We can try to get round this by computing a smaller number from the registration
number. If we have 5000 students we could try taking the last four digits of the registration
number. This is an example of a hash code. It can be calculated as regNum%10000.

f i n a l i n t HASH_SIZE =10000;
Student [] allStudents =new Student [ HASH_SIZE ];

// Fill with data at the hash code


allStudents [12345603% HASH_SIZE ]=new Student (" John Smith " );

// Find a student
Student theStudent = allStudents [12345603% HASH_SIZE ];

1
CHAPTER 6. HASHING 2

There is an obvious problem with this method. Two or more students may have the
same last four digits to their registration number. This means we would try to put them
in the same place in the array. This is called a collision. We will learn several methods of
getting round this problem.
Collisions are made much more likely in this example by the poor choice of hash code.
The registration number uses the year of entry as the last two digits. This means most
students have code ending in 04 or 05, so only 200 out of the 10000 numbers are used
much. We will learn how to select a good hash code. For registration numbers the four
digits before the last two would be better.

6.2 Selecting a hash code


To use hashing we need to convert a key value into an integer in a suitable range. The key
could be an integer as in the example above, or it could be another data type such as a
string.
The method that we use must have the following properties:
1. It must be quick to calculate.

2. It must have the required range of values

3. It must distribute the data values as evenly as possible over the whole range of hash
values.
A neat way getting a hash value is to make an object return its own hash code. The
object will not know what size of array is needed, so the code will have to be reduced to
the required range before it can be used.
The java Object class provides a method public int hashCode(). This just returns
the machine address of the object as an int.
The simplest way to reduce it to the range we want is by taking a remainder. We need
to be careful because hashCode() can return a negative value and Java remainders do not
work as expected with negative values.
8%5 gives 3 as expected, 8%5 gives 3 not 2.
We can use

i n t place ( Object key )


{
return Math . abs ( key . hashCode ())% ARRAY_SIZE ;
}
If we have an object with an equals method declared we should not use the hashCode()
method of Object, as two values that are equal must always return the same hash code.
Instead we should write a hashCode method that overrides the Object method.
Many of the Java library classes have their own hashCode method.
CHAPTER 6. HASHING 3

Integer Returns the int value of the Integer.


Double Returns an int made from the binary digits of the double value.
String "abc" returns c+31*b+31*31*a with an extra 31 for each character. Note
if they just added a+b+c all anagrams such as "cab" would have the same
hash value.
If you need to write a hashCode method for a class you need to make sure that two
equal objects have the same hash code. The usual way is to combine the hashCode()
values of all the fields that are used by the equals method. You can combine them by
adding, or by multiplying and adding as in the String case but a better method to use the
bitwise exclusive OR operation a^b.

p ub li c c l a s s Thing
{
private int a,b;
p r i v a t e String s ;

p ub li c boolean equals ( Object o )


{
i f ( o == n u l l ) return f a l s e ; // Should return false for null
i f ( getClass ()!= o . getClass ())
return f a l s e ; // and for o not a Thing
Thing t = Thing ( o );
return a = t . a && s . equals ( t . s );
}

p ub li c i n t hashCode ()
{ return a ^ s . hashCode ();
}

}
In the example above note that the equals method does not look at b. The hashCode
method must not look at b otherwise two things with the same a and s but different b
would return equal but have different hash codes.
It is a good idea to make the length of a hash table a prime number, as this helps
ensure the values are well spread out. The multiplier used in combining hash codes should
also be a different prime number.

6.2.1 Perfect hashing


A perfect hash is one with no collisions. We can find these if the key values are fixed at
the start. A minimal perfect hash is a perfect hash where the table length is equal to the
CHAPTER 6. HASHING 4

number of keys, these are hard to find.


For example a program that needs to recognise a Java keyword could use a perfect hash
as the keywords are known in advance.

package com328 . hash ;


import java . util . HashSet ;
import java . util . Set ;

/* * PerfectHash
* @author C.T. Stretch
*/
p ub li c c l a s s PerfectHash
{
/* * The maximum size of table to try . */
f i n a l s t a t i c i n t MAX_P =100000;
/* * The table size . */
s t a t i c int p;
/* * The hash table . */
s t a t i c String [] table ;

/* * The data for the table . */


f i n a l s t a t i c String [] JAVA_WORDS =
{ " abstract " ," default " ," if " ," private " ," this " ,
" boolean " ," do " ," implements " ," protected " ," throw " ,
" break " ," double " ," import " ," public " ," throws " ,
" byte " ," else " ," instanceof " ," return " ," transient " ,
" case " ," extends " ," int " ," short " ," try " ,
" catch " ," final " ," interface " ," static " ," void " ,
" char " ," finally " ," long " ," strictfp " ," volatile " ,
" class " ," float " ," native " ," super " ," while " ,
" const " ," for " ," new " ," switch " ,
" continue " ," goto " ," package " ," synchronized " ,
" true " ," false " ," null " ," enum "
};

/* * Create and test a perfect hash table .


* @param args ignored
*/
p ub li c s t a t i c void main ( String [] args )
{
p = findModulus ();
i f ( p ==0) System . exit (0);
CHAPTER 6. HASHING 5

makeTable ();
System . out . println (" Testing the hash table " );
lookup (" false " );
lookup (" java " );
}

/* * Look up a word in the hash table


*/
s t a t i c void lookup ( String s )
{
String x = table [ place ( s )];
i f ( x == n u l l ||! x . equals ( s )) System . out . println ( s +" - Not found " );
e l s e System . out . println ( s +" - found at "+ place ( s ));
}

/* * Make a hash table


*/
s t a t i c void makeTable ()
{
table =new String [ p ];
f o r ( i n t i =0; i < JAVA_WORDS . length ; i ++)
{ String s = JAVA_WORDS [ i ];
table [ place ( s )]= s ;
}
}

/* * Find the hashcode for a string


* @param s the string to code .
* @return the hash code
*/
s t a t i c i n t place ( String s )
{
i n t k = s . hashCode ();
return Math . abs ( k ) % p ;
}

/* * Find a value of p so that place gives a perfect hash .


*/
s t a t i c i n t findModulus ()
{
i n t n = JAVA_WORDS . length ;
System . out . print (" Create a perfect hash " );
System . out . println (" table for "+ n +" words " );
CHAPTER 6. HASHING 6

System . out . println (" Trying table sizes from "+ n );


f o r ( i n t p = JAVA_WORDS . length ;p < MAX_P ; p ++)
{
Set < Integer > s = new HashSet < Integer >();
f o r ( i n t i = 0; i < JAVA_WORDS . length ; i ++)
{
i n t k = JAVA_WORDS [ i ]. hashCode ();
s . add (new Integer ( Math . abs ( k ) % p ));
}
i n t q = s . size ();
System . out . println (" Size "+ p +" " +( n - q )+ " collisions " );
i f ( q == n )
{ System . out . println (" Perfect hash of size "+ p );
return p ;
}
}
System . out . println (" Perfect hash not found " );
return 0;
}
}

6.3 Maps and Sets


There are two main uses of hash tables, sets and maps. These are similar to the sorted set
and sorted map types considered before, except that the keys are not kept in order.
We will implement maps, which are probably the most useful, here we keep two objects
at each place in the table, the key and the value. We can look up the key and find the
corresponding value.
The java library provides Set and Map interfaces. We will write our own Map interface
which is easier to implement.

package com328 . hash ;


/* * Simplified map interface .
* @author chris
*/
p ub li c i n t e r f a c e Map <K ,V >
{
/* * Puts a value at a key .
* @param key the key .
* @param value the value for this key .
*/
CHAPTER 6. HASHING 7

void put ( K key , V value );

/* * gets the value from a key .


* Returns null if not present .
* @param key the key .
* @return The value or null
*/
V get ( K key );

/* * Removes the value at a key .


* @param key the key .
* @return The value or null
*/
V remove ( K key );

/* * Gets the number of keys


* @return int the number of keys .
*/
i n t size ();
}

6.4 Dealing with collisions


Most hashing methods need to deal with collisions.
The likelihood of collisions depends on the load factor that is the number of keys divided
by the size of the hash table. If this is over 1 there must be a collision, but even if it is
small there are likely to be some collisions.
A famous example is given by the birthday problem How many people do you need to
have before there is an even chance that two have the same birthday?
The answer is only 23.
The same calculation shows that with a hash table of size 365 you only need 23 entries
to get an even chance of a collision.
There are two standard techniques for dealing with collisions:
Open addressing If the hash value we calculate is in use we use another one;
Buckets Put several items at the same hash value.

6.4.1 Open Addressing


The simplest form of open addressing is: If the table entry is occupied put the entry in
the next unoccupied position. We use a circular array and move round to the start when
we reach the end of the array.
CHAPTER 6. HASHING 8

We need to be careful using this technique to ensure there is always free space in the
array, that is the load factor is always less than 1. If not we could search the array for ever.
In practice you want to keep the load factor less than 0.75 for open addressing otherwise
it can get slow. 0.5 is better if possible.
To find an key we just start at the hash value and search through until we find the
required key or reach a null value, meaning our key is not in the table.
For example in a hash table of size 12 add items with the given hash codes:
(a,3),(b,4),(c,3),(d,4),(e,2),(f,2).

0 1 2 3 4 5 6 7 8 9 10 11
e a b c d f
2 3 4 3 4 2

Figure 6.1: Open addressing

To search for f we calculate its hash code is 2 then search through from 2 until we find
f or a null entry.
This causes complications when we want to remove an entry. Suppose we were to
remove b and replace it by null. If we were to search for c we would calculate its hash
value of 3 and search starting from position 3. We would stop at 4 as it has a null value
and miss finding c in position 5.
There are two ways to avoid this complication
1. When we remove an item move things down to fill the gap. This is not as easy as
it sounds. We could move d into position 4, but then searching for f would fail. We
need to scan backward to find the start of the block of filled entries and then scan
forward moving things as often as we can.

2. We put a special value into the b cell that marks it as being available for reuse but
the scan for values skips over it. Such a value is sometimes called a tombstone. We
must be careful that there are still some null values otherwise the searches will never
stop.
We hope that if we keep the load factor low collisions are rare and normally we find
our key quickly at its hash value, but we must allow for all cases in our code.

package com328 . hash ;


import java . util . NoSuchElementException ;

/* * HashMap
* @author C.T. Stretch
*/
p ub li c c l a s s HashMap <K ,V > implements Map <K ,V >
CHAPTER 6. HASHING 9

{
/* * The maximum load factor allowed . */
f i n a l double MAX_LOAD_FACTOR = 0.75;
/* * The smallest table size allowed . */
f i n a l s t a t i c i n t MIN_SIZE = 37;
/* * The tombstone marker . */
p r i v a t e f i n a l s t a t i c Tombstone TOMBSTONE = new Tombstone ();
/* * The length of the tables . */
p r i v a t e i n t length ;
/* * The number of keys . */
p r i v a t e i n t size ;
/* * The array that holds the keys . */
p r i v a t e Object [] keyArray ;
/* * The array that holds the values . */
p r i v a t e V [] valueArray ;

/* * Constructs a new HashMap .


* @param length the initial table length .
*/
@SuppressWarnings (" unchecked ")
HashMap ( i n t length )
{
i f ( length < MIN_SIZE )
length = MIN_SIZE ;
t h i s . length = nextPrime ( length );
keyArray = new Object [ length ];
valueArray = ( V [])(new Object [ length ]);
}

/* * Puts a value at a key .


* @param key the key .
* @param value the value for this key .
*/
p ub li c void put ( K key , V value )
{
i f (( double ) size / length > MAX_LOAD_FACTOR )
increaseLength ();
i n t i = find ( key );
i f ( i >= 0)
{
valueArray [ i ] = value ;
}
CHAPTER 6. HASHING 10

else
{
i n t h = Math . abs ( key . hashCode ()) % length ;
while ( keyArray [ h ] != n u l l && keyArray [ h ]!= TOMBSTONE )
{
h = ( h + 1) % length ;
}
valueArray [ h ] = value ;
}
}

/* * Removes the value at a key .


* @param key the key .
* @return The value or null
*/
p ub li c V remove ( K key )
{
i n t i = find ( key );
i f ( i < 0)
return n u l l ;
keyArray [ i ] = TOMBSTONE ;
return valueArray [ i ];
}

/* * gets the value from a key .


* Returns null if not present .
* @param key the key .
* @return The value or null
*/
p ub li c V get ( K key )
{
i n t i = find ( key );
i f ( i < 0)
throw new NoSuchElementException (" Key not found " );
return valueArray [ i ];
}

/* * Gets the number of keys


* @return int the number of keys .
*/
p ub li c i n t size ()
{
return size ;
CHAPTER 6. HASHING 11

/* * Find the position of a key in the table .


* @param key the key to find
* @return the position or -1 if not in .
*/
p r i v a t e i n t find ( K key )
{
i n t h = Math . abs ( key . hashCode ()) % length ;
Object obj ;
while (( obj = keyArray [ h ]) != n u l l )
{
i f ( obj . equals ( key ))
return h ;
h = ( h + 1) % length ;
}
return -1;
}

/* * Find the next prime number from n.


* @param n the lower limit .
* @return a prime number >= n.
*/
p r i v a t e s t a t i c i n t nextPrime ( i n t n )
{
while (! isPrime ( n ))
n ++;
return n ;
}

/* * Test if a number is prime and > 2.


* @param n the number to test .
* @return true for a prime > 2.
*/
p r i v a t e s t a t i c boolean isPrime ( i n t n )
{
i f ( n % 2 == 0)
return f a l s e ;
f o r ( i n t i = 3; i * i <= n ; i += 2)
i f ( n % i == 0)
return f a l s e ;
return true ;
}
CHAPTER 6. HASHING 12

/* * Increase the size of the table . */


@SuppressWarnings (" unchecked ")
p r i v a t e void increaseLength ()
{
i n t oldLength = length ;
length *=2;
Object [] keys = keyArray ;
V [] values = valueArray ;
keyArray =new Object [ length ];
valueArray =( V [])(new Object [ length ]);
f o r ( i n t i =0; i < oldLength ; i ++)
i f ( keys [ i ]!= n u l l && keys [ i ]!= TOMBSTONE )
put (( K )( keys [ i ]) , values [ i ]);
}

/* * An inner class marking an unused position . */


p r i v a t e s t a t i c c l a s s Tombstone
{
}
}

Clustering
The collision resolution used above has a problem. Suppose a collision occurs, then two
consecutive entries are used. This makes it twice as likely that another collision will occur
here. This tends to make clusters grow in the hash table which slows down operations.
To avoid this we can use a more complicated way to find the next entry to try if one is in
use. The previous method used linear probing, try the value then add 1, 2, 3, 4, . Instead
we can try quadratic probing, add squares 1, 4, 9, 16, . This reduces clustering but has a
problem, it does not visit all the entries, so it could miss all the empty values. A better
variant is to add 1, 1, 4, 4, 9, 9, If the table size is a prime that leaves a remainder
of 3 when divided by 4 then this reaches all the entries.
Here is a find method for quadratic probing:

p r i v a t e i n t find ( Object key )


{
i n t h = Math . abs ( key . hashCode ()) % length ;
Object obj ;
while (( obj = keyArray [ h ]) != n u l l )
{
i n t i =0;
CHAPTER 6. HASHING 13

i n t sign = -1;
i f ( obj . equals ( key ))
return h ;
h = ( h + sign * i * i ) % length ;
i f (h <0) h += length ;
i f ( sign <0) i ++
sign = - sign ;
}
return -1;
}

6.4.2 Buckets
We can deal with collisions by putting several keys in one table entry. We can use any list
or set structure to hold the entries. We will look at using a singly linked list.
To search for a key we calculate the key code to find an entry in the hash table. We
search the list for that table entry to find the required element. Our previous example
would look like:

2 - -

3 - - f e
4 - - c a
d b

Figure 6.2: Linked list buckets

Using buckets we can have load factors greater than one, but if the load factor gets too
large operations will slow down.
Here is an example of a map using linked list buckets. An extra field has been added
to the list nodes to hold the value associated with the key.
CHAPTER 6. HASHING 14

package com328 . hash ;


import java . util . NoSuchElementException ;

/* * HashBucketMap
* @author C.T. Stretch
*/
p ub li c c l a s s HashBucketMap <K ,V > implements Map <K ,V >
{
/* * The table . */
p r i v a t e Node <K ,V >[] nodes ;
/* * The length of the table . */
p r i v a t e i n t length ;
/* * The number of keys . */
p r i v a t e i n t size ;

/* * Construct a Hash Table


* @param length the table size
*/
@SuppressWarnings (" unchecked ")
p ub li c HashBucketMap ( i n t length )
{
t h i s . length = length ;
nodes =( Node <K ,V >[])(new Node [ length ]);
}

/* * Puts a value at a key .


* @param key the key .
* @param value the value for this key .
*/
p ub li c void put ( K key , V value )
{
i n t h =( Math . abs ( key . hashCode ()))% length ;
f o r ( Node <K ,V > n = nodes [ h ]; n != n u l l ; n = n . next )
i f ( n . key . equals ( key ))
{
n . value = value ;
return ;
}
nodes [ h ]=new Node <K ,V >( nodes [ h ] , key , value );
size ++;
}
CHAPTER 6. HASHING 15

/* * gets the value from a key .


* Returns null if not present .
* @param key the key .
* @return The value or null
*/
p ub li c V get ( K key )
{
i n t h =( Math . abs ( key . hashCode ()))% length ;
f o r ( Node <K ,V > n = nodes [ h ]; n != n u l l ; n = n . next )
i f ( n . key . equals ( key )) return n . value ;
throw new NoSuchElementException (" Key not found " );
}

/* * Removes the value at a key .


* @param key the key .
* @return The value or null
*/
p ub li c V remove ( K key )
{
V v= null ;
i n t h =( Math . abs ( key . hashCode ()))% length ;
f o r ( Node <K ,V > n = nodes [ h ] , p = n u l l ; n != n u l l ; p =n , n = n . next )
i f ( n . key . equals ( key ))
{
v = n . value ;
i f ( p == n u l l )
nodes [ h ]= n . next ;
else
p . next = n . next ;
}
return v ;
}

/* * Gets the number of keys


* @return int the number of keys .
*/
p ub li c i n t size ()
{
return size ;
}

/* * Node
*/
CHAPTER 6. HASHING 16

p r i v a t e s t a t i c c l a s s Node <K ,V >


{
/* * Pointer to the next node . */
Node <K ,V > next ;
/* * The key . */
K key ;
/* * The value for this key . */
V value ;

/* * The constructor creates a new node .


* @param next the next node in the list .
* @param key the key for this node .
* @param value The value for this key .
*/
Node ( Node <K ,V > next , K key , V value )
{
t h i s . next = next ;
t h i s . key = key ;
t h i s . value = value ;
}
}
}

You might also like