You are on page 1of 19

Csci 2111: Data and File

Structures
Week 10, Lectures 1 & 2

Hashing

March 23 & 28, 2000

Motivation
Sequential Searching can be done in O(N) access time,
meaning that the number of seeks grows in proportion to
the size of the file.
B-Trees improve on this greatly, providing O(Logk N)
access where k is a measure of the leaf size (i.e., the
number of records that can be stored in a leaf).
What we would like to achieve, however, is an O(1)
access, which means that no matter how big a file grows,
access to a record always takes the same small number of
seeks.
Static Hashing techniques can achieve such performance
provided that the file does not increase in time.
March 23 & 28, 2000

What is Hashing?
A Hash function is a function h(K) which transforms a
key K into an address.
Hashing is like indexing in that it involves associating
a key with a relative record address.
Hashing, however, is different from indexing in two
important ways:
With hashing, there is no obvious connection
between the key and the location.
With hashing two different keys may be transformed
to the same address.
March 23 & 28, 2000

Collisions
When two different keys produce the same
address, there is a collision. The keys involved are
called synonyms.
Coming up with a hashing function that avoids
collision is extremely difficult. It is best to simply
find ways to deal with them.
Possible Solutions:
Spread out the records
Use extra memory
Put more than one record at a single address.
March 23 & 28, 2000

A Simple Hashing Algorithm


Step 1: Represent the key in
numerical form
Step 2: Fold and Add
Step 3: Divide by a prime number
and use the remainder as the
address.

March 23 & 28, 2000

Hashing Functions and Record


Distributions
Records can be distributed among addresses in
different ways: there may be (a) no synonyms (uniform
distribution); (b) only synonyms (worst case); (c) a few
synonyms (happens with random distributions).
Purely uniform distributions are difficult to obtain and
may not be worth searching for.
Random distributions can be easily derived, but they
are not perfect since they may generate a fair number
of synonyms.
We want better hashing methods.
March 23 & 28, 2000

Some Other Hashing Methods


Though there is no hash function that guarantees
better-than-random distributions in all cases, by
taking into considerations the keys that are being
hashed, certain improvements are possible.
Here are some methods that are potentially better
than random:
Examine keys for a pattern
Fold parts of the key
Divide the key by a number
Square the key and take the middle
Radix transformation
March 23 & 28, 2000

Predicting the Distribution of


Records
When using a random distribution, we can use a
number of mathematical tools to obtain
conservative estimates of how our hashing
function is likely to behave:
Using the Poisson Function p(x)=(r/N)xe-(r/N)/x!
applied to Hashing, we can conclude that:
In general, if there are N addresses, then the
expected number of addresses with x records
assigned to them is Np(x)
March 23 & 28, 2000

Predicting Collisions for a Full


File
Suppose you have a hashing function that you
believe will distribute records randomly and you
want to store 10,000 records in 10,000 addresses.
How many addresses do you expect to have no
records assigned to them?
How many addresses should have one, two, and
three records assigned respectively?
How can we reduce the number of overflow
records?
March 23 & 28, 2000

Increasing Memory Space I


Reducing collisions can be done by choosing a good
hashing function or using extra memory.
The question asked here is how much extra memory
should be use to obtain a given rate of collision reduction?
Definition: Packing density refers to the ratio of the
number of records to be stored (r) to the number of
available spaces (N).
The packing density gives a measure of the amount of
space in a file that is used.
March 23 & 28, 2000

10

Increasing Memory Space II


The Poisson Distribution allows us to predict the number
of collisions that are likely to occur given a certain packing
density. We use the Poisson Distribution to answer the
following questions:
How many addresses should have no records assigned to
them?
How many addresses should have exactly one record
assigned (no synonym)?
How many addresses should have one record plus one or
more synonyms?
Assuming that only one record can be assigned to each home
address, how many overflow records can be expected?
What percentage of records should be overflow records?
March 23 & 28, 2000

11

Collision Resolution by
Progressive Overflow
How do we deal with records that cannot fit into their home
address? A simple approach: Progressive Overflow or
Linear Probing.
If a key, k1, hashes into the same address, a1, as another
key, k2, then look for the first available address, a2,
following a1 and place k1 in a2. If the end of the address
space is reached, then wrap around it.
When searching for a key that is not in, if the address space
is not full, then an empty address will be reached or the
search will come back to where it began.
March 23 & 28, 2000

12

Search Length when using


Progressive Overflow
Progressive Overflow causes extra searches and
thus extra disk accesses.
If there are many collisions, then many records
will be far from home.
Definitions: Search length refers to the number of
accesses required to retrieve a record from
secondary memory. The average search length is
the average number of times you can expect to
have to access the disk to retrieve a record.
Average search length = (Total search
length)/(Total number of records)
March 23 & 28, 2000

13

Storing More than One Record


per Address: Buckets
Definition: A bucket describes a block of records
sharing the same address that is retrieved in one
disk access.
When a record is to be stored or retrieved, its
home bucket address is determined by hashing.
When a bucket is filled, we still have to worry
about the record overflow problem, but this occurs
much less often than when each address can hold
only one record.
March 23 & 28, 2000

14

Effect of Buckets on Performance


To compute how densely packed a file is, we need
to consider 1) the number of addresses, N,
(buckets) 2) the number of records we can put at
each address, b, (bucket size) and 3) the number
of records, r. Then, Packing Density = r/bN.
Though the packing density does not change when
halving the number of addresses and doubling the
size of the buckets, the expected number of
overflows decreases dramatically.
March 23 & 28, 2000

15

Making Deletions
Deleting a record from a hashed file is more
complicated than adding a record for two reasons:
The slot freed by the deletion must not be allowed to
hinder later searches
It should be possible to reuse the freed slot for later
additions.
In order to deal with deletions we use tombstones, i.e.,
a marker indicating that a record once lived there but
no longer does. Tombstones solve both the problems
caused by deletion.
Insertion of records is slightly different when using
tombstones.
March 23 & 28, 2000

16

Effects of Deletions and


Additions on Performance
After a large number of deletions and additions
have taken places, one can expect to find many
tombstones occupying places that could be
occupied by records whose home address precedes
them but that are stored after them.
This deteriorates average search lengths.
There are 3 types of solutions for dealing with this
problem: a) local reorganization during deletions;
b) global reorganization when the average search
length is too large; c) use of a different collision
resolution algorithm.
March 23 & 28, 2000

17

Other Collision Resolution


Techniques
There are a few variations on random hashing that
may improve performance:
Double Hashing: When an overflow occurs, use a
second hashing function to map the record to its
overflow location.
Chained Progressive Overflow: Like Progressive
overflow except that synonyms are linked together with
pointers.
Chaining with a Separate Overflow Area: Like
chained progressive overflow except that overflow
addresses do not occupy home addresses.
Scatter Tables: The Hash file contains no records, but
only pointers to records. I.e., it is an index.
March 23 & 28, 2000

18

Pattern of Record Access


If we have some information about what records
get accessed most often, we can optimize their
location so that these records will have short
search lengths.
By doing this, we try to decrease the effective
average search length even if the nominal average
search length remains the same.
This principle is related to the one used in
Huffman encoding.
March 23 & 28, 2000

19

You might also like