You are on page 1of 10

Random File Organization

CS 102
File Structures & File Organizations

Allows access of a record without sequential search (access) through the file There is a relationship between each records key and its location in an external file Directly computes the location of the record using key value There may be no relationship between the logical ordering and the physical ordering of the file.

Chapter 06

Random Access and Hashing

CJD

Relative Files
an implementation of random file organization with fixed record lengths. Each record in secondary storage is assigned a record number which designates its relative position with respect to the beginning of the file. The first record may be record number 0 or 1 depending on implementation. record address = relative record number x fixed record length + address of beginning of file can be updated in place (not necessary to batch the transactions) can be sequentially accessed according to physical order of records in storage but this access is usually meaningless.

Direct Mapping Techniques


Requires Direct Access Storage Devices (DASD) like magnetic disks. Direct Mapping Techniques are used to translate a records key to a storage address. Two techniques : absolute addressing and relative addressing.

CJD

CJD

Absolute Addressing (1)


Mapping the key value of a record into a fixed absolute address in storage. h(key value) = absolute address The absolute address of the record is a machinedependent address With cylinder addressing : cylinder #, surface #, record # With sector addressing : sector #, record # To read and write a record requires the user to know exactly where on the device the record is stored physically.
CJD

Absolute Addressing (2)


Relocation of the direct file to another part of the disk or to another device requires changing the absolute address. Mapping used may result in many empty addresses. Remapping may be required to consolidate free space or to add more space. Rarely used

CJD

Relative Addressing (1)


Mapping the key value of a record relative to the address of the first record. h(key value) = relative address The first record can be stored anywhere and be assigned a relative address of 0 (sometimes 1, depending on implementation) If the file has n records, at least relative addresses 0, 1, 2, , n-1 are allocated. More than n locations may be needed to provide space for potential key values.
CJD

Relative Addressing (2)


Machine independent Requires fixed record lengths File can be easily moved: just change the address of the first record. The I/O channel can translate between relative and absolute addresses.

CJD

Simplest Mapping Function


key = relative address Key = 0 is first record, Key = i is (i+1)th record, Key = n-1 is nth record Absolute address of Key = i is easy to compute :
Absolute address of first record + ( i * fixed record length )

Indexing Techniques (1)


Reduces wasted space in a random file organization Maps a large domain of sparse key values into a smaller range of relative addresses Random access is the objective. The physical ordering is usually not the same as the logical ordering. Sequential access of a random file organization becomes difficult, if not impossible.

Function may need to be modified if first key value is not 0 If the keys are not dense, space may be wasted (allocated but unused)
CJD

CJD

Indexing Techniques (2)


Different Indexing Techniques : Hashing, Binary Search Trees, B-Trees, Index Tables, Inverted Files, Multilist Files.

Hashing
Calculates a relative address based on the key value of a record Maps a large domain of sparse key values into a smaller range of relative addresses Alphanumeric characters in the key are first translated into numeric values, say using their ASCII (e.g. 65 for A), EBCDIC or Unicode integer equivalents

CJD

CJD

Hashing Synonyms
The following are synonymous : hashing scatter storage techniques randomizing scheme key-to-address transformation methods direct addressing techniques hash table methods

How many hashing functions ?


With n keys to be hashed into m locations (nm), the number of hashing functions is m n .
Each key value can be hashed into any of the m locations.

the number of perfect hashing functions (i.e., 1-is to-1) is only P(m,n) = mPn = m! / (m-n)!
The 1st key value can be mapped to any of m locations, the 2 nd to any of remaining m-1 locations, the 3rd to any of remaining m-2 locations, , and the nth to any of remaining m-n+1 locations. If m=25 and n=15, there are 9.3x1020 hash functions of which about only 4.3x1010 are perfect.
CJD CJD

Collisions
Hashing functions do not always produce unique relative addresses. Collision when two or more key values are mapped to the same relative address. Synonyms - two or more key values that are mapped to the same relative address. Collisions may be reduced by allocating more file space than the number needed to store actual keys.

Load Factor
Load Factor = number of key values / number of file positions Question : How many file positions are allocated to a file with 10,000 key values and load factor of 80% ? Answer : number of key values / load factor = 10,000 / 0.80 = 10,000 * 1.25 = 12,500. If the load factor is 75%, 25% will be empty.

CJD

CJD

Load Factor Implications


High load factor results in a dense file; Low load factor is a sparse file. High load factor results in more collisions; Low load factor results in more wasted space; Find a balance ! When space is a premium, a maximum load factor of 70%-80% is used. More (CPU and I/O) time is required to perform the mapping and resolve collisions.

Some Hashing Functions


Some Hashing Functions used in Relative File Organization Prime Number Division Remainder Method Digit Extraction Folding Radix Conversion Mid-Square

CJD

CJD

Division Remainder Method


Prime Number Division Remainder Method preferred hashing method if distribution of keys is unknown h(k) = k mod p
where k is a key, p is a prime number, mod is the modulo (or remainder) function h(k) ranges from 0 to p-1 p storage locations Ex: if p=97, h(12345) = 12345 mod 97 = 26 choosing a prime p reduces the number of collisions also often used in conjunction with other hashing techniques.
CJD

Division Remainder Method


h(k) = k mod p (aka Caesars cipher in cryptology) Extension of this method results in linear congruential method
h(k) = (ak+b) mod p best if a and p are relatively prime numbers. also used in generating pseudorandom numbers can also be further extended into RSA encryptiondecryption techniques

CJD

Digit Extraction
Digit Extraction Analyze which digit positions in a key are uniformly or evenly distributed Extract the digits in these position and use as a relative address Example: If the four most evenly distributed digits in keys of length 9 are the eighth, fifth, third and second, h(143445254) = 5434 Requires the key values be known so an analysis can be made
CJD

Folding
Folding is used to reduce lengths of long keys so they become relative addresses Partition key into a group of integers Perform operations on them (eg. +, AND, XOR) Examples: Suppose key value is 9185551472
Folding in half and adding : 91855 + 51472 = 143327 Folding in thirds : 918 + 555 + 1472 = 2945 Folding alternate digits : 98517 + 15542 = 114059 other variations include reversing some groups. May use Div-Rem method if result > num of locations

Easy to compute but performs inconsistently


CJD

Radix Conversion
Radix Conversion Interpret a key to be of a different base, and convert to decimal, truncating high-order digits Example: if k=14344, base is 13, and up to 10000 relative addresses
1x134 + 4x133 + 3x132 + 4x131 + 4x130 = 28,561 + 8,788 + 507 + 52 + 4 = 37,912, so h(14344) = 7912

Mid-Square
Mid-Square
The middle c digits are extracted from key, and squared. Excess high-order and/or low digits are truncated Ex.: if k=143445254, c=3, m=10000 relative addresses middle 3 is 445, and 4452 = 198025, h(k)=8025 if keep last 4, or h(k)=9802 if keep 2 nd to 5th from rightmost digit. Alternative: square all of k, then hash to middle m digits. Often results in clustered mappings with many collisions Works well if keys do not have leading and trailing zeros with low load factor.
CJD CJD

Exercise
Determine the relative addresses from 0 to 99 for the following key values using different hashing functions : Given: Key values : 24964, 25936, and 32179 D - Prime Number Division Remainder Method (divisor = 97) E - Digit Extraction (fifth and third positions) F - Folding (first 3 + last 2) R - Radix Conversion (base = 12) M - Mid-Square (2nd&3rd&4th squared)
CJD

Exercise Answers
Key Values 24964 25936 32179 39652 40851 53455 53758 54603 63388 81347 No. of Collisions* D 35 37 72 76 14 8 20 89 47 61 0 Hashing Function E F R 49 13 56 69 95 50 91 0 25 26 38 58 18 59 57 54 89 5 87 95 0 36 49 59 83 21 36 73 60 3 0 1 0 M 16 49 89 25 25 25 25 0 44 56 3
CJD

Sorted Hash Values


Sorted Hash Values D E F R M 8 18 0 0 0 14 26 13 3 16 20 36 21 3 25 35 49 38 25 25 37 54 49 36 25 46 69 59 50 25 47 73 60 56 44 61 83 89 57 49 72 87 95 58 56 89 91 95 59 89

Perfect Hashing Functions (1)


Perfect Hashing Functions: ideal hashing technique with no collisions provides one-to-one mapping between keys and storage locations Examples Techniques:
Quotient Reduction Remainder Reduction Associated Value Hashing Reciprocal Hashing Ordered Minimal Perfect Hashing
CJD CJD

D - Prime Number Division Remainder Method quite uniform E - Digit Extraction (fifth & third positions) quite uniform F - Folding (first 3 + last 2) 3 clustered on 89-95 R - Radix Conversion (base = 12) 5 (half) clustered on 50-59 M - Mid-Square (2nd&3rd&4th squared) 3 collisions, 8 clustered on 0-49

Perfect Hashing Functions (2)


Require extensive algorithms to analyze fixed (or static) key sets only. Most file processing environments have variable (or dynamic) key sets and perfect hashing is not applicable. Applications of Perfect Hashing :
Reserved words lookup in a compiler frequently occuring words in natural language processing

Rehashing
Some Rehashing Strategies (or techniques for handling collisions) Linear Probing 2-Pass File Creation Separate Overflow Area Double Hashing Synonym Chaining Bucket Addressing Bucket Chaining
CJD CJD

Illustration
Consider a group of words which will be placed in a relative file. Define the hash function as follows : Assign A=1, B=2, , Z=26 and convert the values into binary. Apply XOR to the binary equivalents of letters of the word. The result of XOR is converted into its decimal equivalent. Normalize the result to fit into m locations (say by div-rem method)
CJD

Example Hashing
Example: h(THE) = binary(T) xor binary(H) xor binary(E) = 101002 xor 010002 xor 001012 = 110012 = 2510 With 5 binary digit positions, XOR could result in 32 different values. Allocate 41 storage locations with relative addresses 0 to 40

CJD

Hashed Values
By applying the above hash function a set of words, we get the following hashed values:
k THE OF AND TO A IN THAT IS WAS HE FOR h(k) 25 9* 11 27 1 7 9 26 5 13 27 k IT WITH AS HIS ON BE AT BY I THIS HAD h(k) 29 2 18 18 1 7 21 27 9 6 13 k NOT NO TON SAYS ARE BUT FROM OR HAVE AN THEY h(k) 21 1 21 24 22 3 22 29 26 15 0
CJD

Linear Probing
Suppose h(k) results in a collision Linear Probing: Store the record in the next available space :
check h(k)+1, h(k)+2, h(k)+3, etc. This results in clustering of records in the end of the file Solution: Treat the file with locations 0, 1, , n-1 as circular. check h(k)+1, h(k)+2, n-2, n-1, 0, 1, , h(k)-1 for first empty location There will still be clustering near h(k)'s. displacement: a record is displaced away from its computed location.
CJD

Searching
How do you search a record ? Linear Probe ! : check h(k), h(k)+1, h(k)+2, n-2, n-1, 0, 1, , h(k)-1. locations must be checked in sequence until record is found, or entire file has been checked (if not found). Alternatives to linear probing with less clustering: quadratic probing (out of scope) random probing (out of scope) double hashing (discussed later)
CJD

Example
ix data1 data2 data3 data4 data5 synonym of 0 THEY 1 A A A A A 2 WITH WITH WITH WITH 3 ON ON ON 1 4 NO 1 5 WAS WAS WAS WAS WAS 6 THIS 7 IN IN IN IN IN 8 BE BE BE 7 9 OF OF OF OF OF 10 THAT THAT THAT THAT THAT 9 11 AND AND AND AND AND 12 I 9 13 HE HE HE HE HE 14 HAD 13 15 BUT 3

ix data1 data2 data3 data4 data5 synonym


of

18 AS AS AS AS 19 HIS HIS HIS 20 21 AT AT AT 22 NOT 23 TON 24 SAYS 25 THE THE THE THE THE 26 IS IS IS IS IS 27 TO TO TO TO TO 28 FOR FOR FOR FOR 29 IT IT IT IT 30 BY 31 ARE 32 FROM 33 OR 34 HAVE 35

18

21 21

27 27 22 22 29 26

CJD

footnotes
In the example table above, data1 = after 10th word (HE) stored; THAT collides with OF data2 = after (AS); FOR collides with TO data3 = after (AT); HIS collides with AS; ON collides with A but next has WITH; BE collides with IN data4 = all sets of words; FROM has biggest displacement from 22 to 32.

Linear Probing Summary


Average number of probes (per Knuth) Load Average no. of probes Average no. of probes Factor (successful) (unsuccessful) 0.10 1.056 1.118 0.20 1.125 1.281 0.30 1.214 1.520 0.40 1.333 1.889 0.50 1.500 2.500 0.60 1.750 3.625 0.70 2.167 6.060 0.80 3.000 13.000 0.90 5.500 50.500 0.95 10.500 200.500
CJD CJD

2-Pass File Creation


Note: basic linear probing is a one-pass algorithm First pass : compute value of hash function for each record store records first hashed into unique positions. synonyms are initially stored in an output file with their function values Second pass : store synonyms in the next available positions

2-Pass File Creation


more records are stored in their hashed values good for records with no synonyms works well if records are known before storage (static instead of dynamic) unfortunately, some records will be stored farther than in linear probing. but on average, results in faster access

CJD

CJD

Separate Overflow Area


use a separate overflow area in addition to the prime area store each synonym in the next available position in the overflow area eliminates displacement but requires sequential search in the overflow area for synonyms. works well for records with no synonyms works well if there are few synonyms works poorly if there are many synonyms
CJD

Double Hashing
A record is initially hashed to a relative file position in the prime area If not available, a second hash function is applied and added to the first hashed value modulo the size of overflow area The new value is the relative position in the overflow area for the synonym, if available If position in overflow area is not available, there is a secondary collision Linear probing is applied to secondary collisions in the overflow area.
CJD

Double Hashing
If separate overflow area is not available, then perform probing using the sequence h1(k), h1(k)+h2(k), h1(k)+2 h2(k), h1(k)+3 h2(k), all modulo m locations. Performance : requires less probes than linear probing, requires additional computation time for second hashing function separate overflow area needed

Double Hashing Summary


Load Ave no. of Factor probes (successful) linear probing 0.10 1.056 0.20 1.125 0.30 1.214 0.40 1.333 0.50 1.500 0.60 1.750 0.70 2.167 0.80 3.000 0.90 5.500 0.95 10.500
CJD

Ave no. of probes (unsuccessful) linear probing 1.118 1.281 1.520 1.889 2.500 3.625 6.060 13.000 50.500 200.500

Ave no. of probes (successful) double hashing 1.054 1.116 1.189 1.277 1.386 1.527 1.720 2.012 2.558 3.153

Ave no. of probes (unsuccessful) double hashing 1.111 1.250 1.429 1.667 2.000 2.500 3.333 5.000 10.000 20.000
CJD

Synonym Chaining
Record is hashed to a location Stored in the computed location if available Otherwise, stored in displaced location using any of previous methods. Whats new ? Extend each record to contain a pointer Each pointer points to the displaced location of the next synonym, if any the synonyms form a linked list
CJD

Synonym Chaining
Performance Requires more space (for the pointers) than previous methods Searches only go through linked synonyms in the file or overflow area Reduces probing into locations that are not synonyms.

CJD

Synonym Chaining Summary


Load ave Factor probes found LP 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 0.95 1.056 1.125 1.214 1.333 1.500 1.750 2.167 3.000 5.500 10.500 ave probes not found LP 1.118 1.281 1.520 1.889 2.500 3.625 6.060 13.000 50.500 200.500 ave probes found DH 1.054 1.116 1.189 1.277 1.386 1.527 1.720 2.012 2.558 3.153 ave probes not found DH 1.111 1.250 1.429 1.667 2.000 2.500 3.333 5.000 10.000 20.000 ave probes found SC w/ SOA 1.050 1.100 1.150 1.200 1.250 1.300 1.350 1.400 1.450 1.475 ave probes not found SC w/ SOA 1.005 1.019 1.041 1.070 1.107 1.149 1.197 1.249 1.307 1.337
CJD

Bucket Addressing
Allocate a bucket for each hashed position Each bucket should be large enough to store all synonyms in the same bucket. Requires knowledge of this maximum number works ok for static files only. A lot of wasted space for empty positions or for positions with few synonyms. For dynamic files, how do you handle bucket overflows ?

CJD

Approaches to Overflows
Consecutive Spill Addressing if bucket overflows, store synonyms in the next available bucket. However this has the same problems as linear probing. Requires maintenance of a directory of nonfull buckets.

Approaches to Overflows
Bucket Chaining On overflow, allocate a new bucket and chain the synonymous buckets. Overflow buckets may be the same size as original bucket or may contain only one record. All overflow buckets may be stored in single separate overflow area. Synonym chaining is a special case when the bucket size is 1. Improves on the performance of synonym chaining but adds algorithmic complexity
CJD CJD

Reducing the Bucket Size


(Cecil): The bucket does not need to store the entire record - only need to store a pointer to a file where records are added sequentially. The bucket also needs a pointer to the next record in the bucket. Could save space and have quicker access.

Deletions
Care must be observed when deleting. One approach : Only mark deleted records does not prematurely terminate search probes may be replaced by newly added records excessive number of deleted records will require moving forward synonyms up the search sequence.

CJD

CJD

Directories
An organized set of key value and relative address pairs. To find a record on a relative file, locate its key value in the directory and then use the indicated address to find the record on storage. Organization of a directory : as a table or as a tree.
CJD

Directory Table
The directory is implemented as a table file of key values and addresses. The key values in the directory are sorted to allow binary search. The relative file are not sorted in logical order.

CJD

Directory Table
A Directory Table and Relative File
Key APE BAT CAT : : COW DOG EEL : : ZEBRA Address i-1 n 3 : : 1 i+1 i : : 2 Rel. Adr. 1 2 3 : : i-1 i i+1 : : n Relative File COW ZEBRA CAT

Directory Tree
A binary search tree of key and address pairs A Directory Tree and Relative File
APE, i-1 BAT, n CAT, 3
Rel. Adr. 1 2 3 : i-1 i i+1 : n Relative File COW ZEBRA CAT APE EEL DOG BAT

APE EEL DOG

COW, 1 DOG, i+1 EEL, i

BAT
CJD

ZEBRA, 2
CJD

Directories Performance (1)


Instead of a hash function, requires a search of the table or tree. After the directory entry is found, the address is obtained directly. Keys are machine-independent. Relative addresses are integers. Actual addresses are easily computed. The directory provides a logical (sequential) ordering of the keys. Random access is provided by the directory without having to sequentially search the file.
CJD

Directories Performance (2)


If the relative file is reorganized, does not affect the key values the directory structure remains intact address entries must be changed but does not affect directory structure If records are added or deleted, The directory must be adjusted. Slow for tables, unless the entries are chained Efficient algorithms exist for adjusting directory trees
CJD

Indexed Files
Directories pave the way for Indexed Sequential File Organizations discussed next

End

CJD

CJD

10

You might also like