Hashing

File Strutures by Folk, Zoellick and Riccardi
Chap11. Hashing

SNU-OOPSLA-LAB

File Structure
SNU-OOPSLA Lab
Chapter Objectives
Introduce the concept of hashing
Examine the problem of choosing a good hashing algorithm,

presents a reasonable one in detail, and describe some others
Explore several approaches for reducing collisions and storage of

several records per address
Develop and use mathematical tools for analyzing performance

differences resulting from the use of different hashing techniques
Examine problems associated with file deterioration (record

deletions) and discuss some solutions
Discuss collision resolution techniques
Examine effects of patterns of record access on performance
File Structure
SNU-OOPSLA Lab
Contents(1)
11.1 Introduction
11.2 A Simple Hashing Algorithm
11.3 Hashing Functions and Record Distribution
11.4 How Much Extra Memory Should Be Used?
11.5 Collision Resolution by Progressive Overflow
11.6 Storing More Than One Record per Address: Buckets
11.7 Making Deletions
11.8 Other Collision Resolution Techniques
11.9 Patterns of Record Access
File Structure
SNU-OOPSLA Lab
Overview
Overview(1)
O(1) access to files

Variation of the relative file
Record number for a record is not arbitrary; rather, it is
obtained by a hashing function H applied to the primary
key, H(key)
Record numbers generated should be uniformly and
randomly distributed such that 0 < H(key) < N
File Structure
SNU-OOPSLA Lab
Overview
Overview(2)
A hash function is like a block box that produces an

address every time you drop in a key
All parts of the key should be used by the hashing

function H so that a lot of records with similar keys do not
all hash to the same location
Given two random keys X, Y and N slots,

the probability H(X)=H(Y) is 1/N;
in this case, X and Y are called synonyms
and a collision occurs
File Structure
SNU-OOPSLA Lab
11.1 Introduction
Introduction
Hash function : h(k)
Transforms a key K into an address
Hash vs other index
Sequential search : O(N)

Binary search : O(log2N)
B(B+) Tree index : O(logkN) where k records in an index node
Hash : O(1)
File Structure
SNU-OOPSLA Lab
11.1 Introduction
A Simple Hashing Scheme(1)
Name
BALL
LOWELL
TREE
File Structure
ASCII Code for

First Two Letters
Product
Home
Address
66 65
66 X 65 =
4,290
4,290
76 96
76 X 96 =
6,004
6,004
84 82
84 X 82 =
6,888
6,888
SNU-OOPSLA Lab
11.1 Introduction
A Simple Hashing Scheme(2)
key
K=LOWELL
SNU-OOPSLA Lab
...
File Structure
...
h(K)
Record
Address
0
1
2
3
4 LOWELL . . .
Address
4
5
6
LOWELLs
home
address
8
11.1 Introduction
Idea behind Hash-based Files
Record with hash key i is stored in node i

All record with hash key h are stored in node h
Primary blocks of data level nodes are stored sequentially
Contents of the root node can be expressed by a simple
function: Address of data level node for record with primary
key k = address of node 0 + H(k)
In literature on hash-based files, primary blocks of data level
nodes are called buckets
File Structure
SNU-OOPSLA Lab
11.1 Introduction
e.g. Hash-based File

root node
70
15 50 1
30 51
57
File Structure
10 45 3
11 60
14
SNU-OOPSLA Lab
61 124 40
20 55
15
10
11.1 Introduction
Hashing(1)
Hashing Functions :
Consider a primary key consisting of
a string of 12 letters and
a file with 100,000 slots.
Since 2612 >> 105,
So synonyms (collisions) are inevitable!
File Structure
SNU-OOPSLA Lab
11
11.1 Introduction
Hashing(2)
Possible means of hashing
First, because 12 characters = 12 bytes = 3(32-bit) words,

partition into 3 words and perform folding as follows : either
(a) add modulo 232, or

(b) combine using exclusive OR, or
(c) invent your own method
Next, let R = V mod N

R => Record-number
V => Value obtained in the above
N => Number of Slots
so 0< R < N
If N has many small factors, a poor distribution leading to
many collisions can occur. Normally, N is chosen to be a
primary number
File Structure
SNU-OOPSLA Lab
12
11.1 Introduction
Hashing(3)
If M = number of records, N = number of available slots,

P(k) = probability of k records hashing to the same slot
then P(k) =
M
K
1
N
1- 1
N
k
~ f
~
ek*k!
where f is the loading factor M/N
As f --> 1, we know that p(0)->1/e and p(1) -> 1/e.

The other (1-1/e) of the records must hash into (1-2/e) of the slots,
for an average of 2.4 slot.
So many synonyms!!
File Structure
SNU-OOPSLA Lab
13
11.1 Introduction
Collision
Collision
Situation in which a record is hashed to an address that does not have
sufficient room to store the record
Perfect hashing : impossible!
Different key, same hash value
(Different record, same address)
Solutions
Spread out the records

Use extra memory
Put more than one record at a single address
File Structure
SNU-OOPSLA Lab
14
A Simple Hashing Algorithm

Step 1. Represent the key in numerical form
If the key is a string : take the ASCII code

If the key is a number : nothing to be done
e.g.. LOWELL = 76 79 87 69 76 76 32 32 32 32 32 32
L O W E L L ( 6 blanks )
File Structure
SNU-OOPSLA Lab
15
Step 2 . Fold and Add
Fold
76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32
Add parts into one integer

(Suppose we use 15bit integer expression, 32767 is limit)
7679 + 8769 + 7676 + 3232 + 3232 + 3232
= 33820 > 32767 (overflow!)
Largest addend : 9090 ( ZZ )
Largest allowable result : 32767 - 9090 = 19937

Ensure no intermediate sum exceeds using mod
( 7679 + 8769 )
(16448 + 7676 )
( 4187 + 3232 )
( 7419 + 3232 )
( 10651 + 3232)
File Structure
mod
mod
mod
mod
mod
19937 = 16448
19937 = 4187
19937 = 7419
19937 = 10651
19937 = 13883
SNU-OOPSLA Lab
16
Step 3. Divide by size of the address space
a = s mod n
a : home address
s : the sum produced in step 2
n : the number of addresses in the file
e.g.. a = 13883 mod 100 = 83 (s = 13883, n = 100)
A prime number is usually used for the divisor because primes tend to
distribute remainders much more uniformly than do nonprimes
So, we chose a prime number as close as possible to the desired size

of the address space (eg, a file with 75 records, a good choice for n
is 101, then the file will become 74.3 percent full
File Structure
SNU-OOPSLA Lab
17
11.3 Hashing Functions and Record Distributions
Hashing Functions and Record Distributions

Distributing records among address
Best
Record Address
1
A
2
B
3
4
C
5
D
6
E
7
F
8
9
G
10
File Structure
Worst
Record Address
1
A
2
B
3
4
C
5
D
6
E
7
F
8
9
G
10
SNU-OOPSLA Lab
Acceptable
Record
A
B
C
D
E
F
G
Address
1
2
3
4
5
6
7
8
9
10
18
11.3 Hashing Functions and Record Distributions
Some other hashing methods
Better-than-random
Examine keys for a pattern

Fold parts of the key
Divide the key by a number
When the better-than-random methods do not work ----randomize!
Square the key and take the middle

Radix transformation
File Structure
SNU-OOPSLA Lab
19
How Much Extra Memory Should Be Used?
Packing Density =
# of records
# of spaces
r
N
The more records are packed,

the more likely a collision will occur
File Structure
SNU-OOPSLA Lab
20
Poisson Distribution
p(x) =
(r/N)xe -r/N
x!
(poisson distribution)
N = the number of available addresses

r = the number of records to be stored
x = the number of records assigned to a given address
p(x) : the probability that a given address will have
x records assigned to it after the hashing
function has been applied to all n records
( x records collision )
File Structure
SNU-OOPSLA Lab
21
Predicting Collisions for Different Packing Densities
# of addresses no record assigned : N X P(0)
# of addresses one record assigned : N X P(1)
# of addresses more than two assigned :

N X [P(2) + P(3) + P(4) + ...]
# of overflows : 1 X NP(2) + 2 X NP(3) +...
Percentage of overflow records :
1 X NP(2) + 2 X NP(3) + 3 X NP(4) ...

X 100
N
File Structure
SNU-OOPSLA Lab
22
The larger space, the less overflows

N addresses
r records
r/N
File Structure
Packing
Density(%)
10
20
30
40
50
60
70
80
90
100
SNU-OOPSLA Lab
Synonym as
% of records
4.8
9.4
13.6
17.6
21.4
24.8
28.1
31.2
34.1
36.8
23
Collision Resolution by Progressive Overflow
Progressive overflow (= linear probing)

Insert a new record
1. Take home address if empty

2. Otherwise, next several addresses are searched in sequence,
until an empty one is found
3. If no more space -- wrapping around
File Structure
SNU-OOPSLA Lab
24
Progressive Overflow(Contd)
0
1
Hash
Routine
Address
....
....
Key
York
Novak. . .
Rosen. . .
Jasper. . .
Morely. . .
....
....
File Structure
SNU-OOPSLA Lab
Yorks home
address (busy)
2nd try (busy)
3rd try (busy)
4th try (open)
Yorks actual
address
25
Progressive Overflow(Cont'd)
0
1
2
Hash
Routine
97
Address
99
98
99 Jello...
....
....
File Structure
....
....
Key
Blue
SNU-OOPSLA Lab
Wrapping around
26
Search a record with a hash function value k:

from home address k,
look at successive records,
until Found,
or An open address is encountered
Worst case
When the record does not exist and the file is full
File Structure
SNU-OOPSLA Lab
27
- Search length : # of accesses required to retrieve a record (from
secondary memory)
20
21
22
23
24
25
...
Actual
Address
Adams. . .
Bates. . .
Cole. . .
Dean. . .
Evans. . .
Home Search
Address length
20
21
21
22
20
1
1
2
2
5
...
...
Average search length = total search length

total # of records
File Structure
SNU-OOPSLA Lab
= 2.2
28
With perfect hashing function : average search length = 1

Average search length of no greater than 2.0 are generally
considered acceptable
5
4
Average
search 3
length
1
20
40
60
100
80
Packing density
File Structure
SNU-OOPSLA Lab
29
11.6 Storing More Than One Record per Address : Buckets
Storing More Than One Record per Address :

Buckets
Bucket : a block of records sharing the same address
Home
Key Address
Green 30
30
Hall
Jerk 32
King 33
Land 33
Marx 33
Nutt 33
File Structure
Bucket
address
30
31
32
33
Bucket contents
Green ...
Hall ...
Jenks ...
King...
Land...
Marks...
(Nutt... is an
overflow record)
SNU-OOPSLA Lab
30
Effects of Buckets on Performance
N : # of addresses
b : # of records fit in a bucket
bN : # of available locations for records
Packing density = r/bN
# of overflow records
N X [ 1XP(b+1) + 2XP(b+2) + 3XP(b+3)...]
As the bucket size gets larger, performance continues to

improve
File Structure
SNU-OOPSLA Lab
31
Bucket Implementation
Collision counter =< bucket size
0
/////
/////
/////
/////
/////
/////
/////
BRICE
TROOP
An empty bucket
2
JONES
/////
ARNSWORTH
Two entries
JONES
ARNSWORTH
STOCKTON
A full bucket
File Structure
SNU-OOPSLA Lab
32
Bucket Implementation(Cont'd)
Initializing and Loading
Creating empty space

Use hash values and find the bucket to store
If the home bucket is full, continue to look at successive
buckets
Problems when
No empty space exists

Duplicate keys occur
File Structure
SNU-OOPSLA Lab
33
Making Deletions
The slot freed by the deletion hinders(disturb) later searches

Use tombstones and reuse the freed slots
Home
Record address
Adams
Jones
Morris
Smith
5
6
6
5
5
6
7
8
Delete 5 Adams...
Morris
6 Jones...
7 ######
8 Smith...
Adams...
Jones...
Morris...
Smith...
A tombstone for Morris

File Structure
SNU-OOPSLA Lab
34
Other Collision Resolution Techniques
Double hashing : avoid clustering with a second hash function

for overflow records
Chained progressive overflow : each home address contains

a pointer to the record with the same address
Chaining with a separate overflow area : move all overflow

records to a separate overflow area
Scatter tables : Hash file contains only pointers to records (like

indexing)
File Structure
SNU-OOPSLA Lab
35
Overflow File
When building the file, if a collision occurs, place the

new synonym into a separate area of the file called the
overflow section
This method is not recommended;
if there is a high load factor,

there will either be overflow from this overflow section
or it will be organized sequentially and performance suffers;
if there is a low load factor, much space is wasted
File Structure
SNU-OOPSLA Lab
36
Linear Probing(1)
When a synonym is identified, search forward from the

address given by the hash function (the natural address)
until an empty slot is located, and store this record there
This is an example of open addressing (examining a

predictable sequence of slots for an empty one)
File Structure
SNU-OOPSLA Lab
37
A S E A R C H I
key :
hash :
N G E X A M P L E
1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A
insertion sequence
S
E
A A
Memory Space
C
H
I
N
G
E E
E E G H I X
A A C A
M
P
File Structure
SNU-OOPSLA Lab
E E G H I X E
38
Rehashing(1)
Another form of open addressing
In linear probing, if synonym occurred, incremented r by 1

and searched next location
In rehashing, use a second hash function for the

displacement: D = (FOLD(key) mod P) + 1, where P < N is
another prime number
This method has the advantage of avoiding congestion,

because each synonym under the first hash function likely
uses a different displacement D, and this examines a
different sequence of slots
File Structure
SNU-OOPSLA Lab
39
Rehashing(where P=3)(2)
N G E X A M P L E
A S E A R C H I
key :
hash :
1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A
insertion sequence
S
E
A
Memory Space
C
H
I
N
G
A
M
P
File Structure
SNU-OOPSLA Lab
E
40
Chaining without Replacement(1)
Uses pointers to build linked lists within the file;

each linked list contains each set of synonyms;
the head of the list is the record at the natural address of
these synonyms
When a record is added whose natural address is
occupied, it is added to the list whose head is at the
natural address
Linear probing usually used to find where to put a new
synonym, although rehashing could be used as well
File Structure
SNU-OOPSLA Lab
41
A problem is that linked lists can coalesce:

Suppose that H(R1) = H(R2) = H(R4) = i and H(R3) =
H(R5) = i+1, and records are added in the order R1, R2,
R3, R4, R5.
Then the lists for natural addresses i and i+1 coalesce.
Periodic reorganization shortens such lists.
Let FWD and BWD be forward and backward pointers

along these chains (doubly-linked)
File Structure
SNU-OOPSLA Lab
42
R1
i+1
R2
R3
R4
R5
File Structure
SNU-OOPSLA Lab
H
H
43
Chaining with Replacement(4)
Eliminates problem with deletion that caused abandonment to be

necessary
Further reduces search lengths
When inserting a new record,
if the slot at the natural address is occupied by a record for which
it is not the natural address,
then record is relocated so the new record may be replaced at its
natural address
Synonym chains can never coalesce,
so a record can be deleted even if is the head of a chain,
simply by moving the second record on the chain to its natural
address ( ABANDON thus is no longer necessary )
File Structure
SNU-OOPSLA Lab
44
Chaining with Replacement(5)

after R1, R2 added
R1
R2
after R3 hashes to i+1
i
i+1
R1
R3
R2
H
H
finally
i
i+1
R1
R3
R2
R4
R5
H
H
2 chains : R1 - R2 -R4
R3 - R5
File Structure
SNU-OOPSLA Lab
45
Patterns of Record Access
Pareto Principle ( 80/20 Rule of Thumb): 80 % of the

accesses are performed on 20 % of the records!
The concepts of the Vital Few and the Trivial Many

20 % of the fisherman catch 80 % of the fish
20 % of the burglars steal 80 % of the loot
If we know the patterns of record access ahead, we can

do many intelligent and effective things!
Sometimes we can know or guess the access patterns
Very useful hints for file syetms or DBMSs
Intelligent placement of records
fast accesses
less collisions
File Structure
SNU-OOPSLA Lab
46
Lets Review !!!

11.1 Introduction
11.3 Hashing Functions and Record Distribution
11.9 Patterns of Record Access
File Structure
SNU-OOPSLA Lab
47

Hashing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hashing

Uploaded by

Copyright:

Available Formats

File Strutures by Folk, Zoellick and Riccardi

Introduce the concept of hashing

Examine the problem of choosing a good hashing algorithm,

Explore several approaches for reducing collisions and storage of

Develop and use mathematical tools for analyzing performance

Examine problems associated with file deterioration (record

Discuss collision resolution techniques

Examine effects of patterns of record access on performance

O(1) access to files

A hash function is like a block box that produces an

All parts of the key should be used by the hashing

Given two random keys X, Y and N slots,

Hash function : h(k)

Transforms a key K into an address

Hash vs other index

Sequential search : O(N)

A Simple Hashing Scheme(1)

ASCII Code for

A Simple Hashing Scheme(2)

Idea behind Hash-based Files

Record with hash key i is stored in node i

e.g. Hash-based File

So synonyms (collisions) are inevitable!

Possible means of hashing

First, because 12 characters = 12 bytes = 3(32-bit) words,

(a) add modulo 232, or

(c) invent your own method

Next, let R = V mod N

If M = number of records, N = number of available slots,

where f is the loading factor M/N

As f --> 1, we know that p(0)->1/e and p(1) -> 1/e.

Spread out the records

11.2 A Simple Hashing Algorithm

A Simple Hashing Algorithm

If the key is a string : take the ASCII code

11.2 A Simple Hashing Algorithm

Step 2 . Fold and Add

Add parts into one integer

Largest addend : 9090 ( ZZ )

Largest allowable result : 32767 - 9090 = 19937

11.2 A Simple Hashing Algorithm

Step 3. Divide by size of the address space

e.g.. a = 13883 mod 100 = 83 (s = 13883, n = 100)

So, we chose a prime number as close as possible to the desired size

11.3 Hashing Functions and Record Distributions

Hashing Functions and Record Distributions

11.3 Hashing Functions and Record Distributions

Some other hashing methods

Examine keys for a pattern

When the better-than-random methods do not work ----randomize!

Square the key and take the middle

11.4 How Much Extra Memory Should Be Used?

How Much Extra Memory Should Be Used?

The more records are packed,

11.4 How Much Extra Memory Should Be Used?

N = the number of available addresses

11.4 How Much Extra Memory Should Be Used?

Predicting Collisions for Different Packing Densities

# of addresses no record assigned : N X P(0)

# of addresses one record assigned : N X P(1)

# of addresses more than two assigned :

# of overflows : 1 X NP(2) + 2 X NP(3) +...

Percentage of overflow records :