Professional Documents
Culture Documents
1 9/23/2007
TOPICS
• Basic concepts
• Hashing
• B+-tree
2 9/23/2007
INTRODUCTION
• Review
3 9/23/2007
Software Architecture of a DBMS
Query Parser
Query Optimizer
Query Interpretor
Index structures
Abstraction of records
4 9/23/2007
Implementation of б
SS# N
Name A
Age S l
Salary d
dno
1 Joe 24 20000 2
• Emp table: 2 Mary 20 25000 3
3 B b
Bob 22 27000 4
4 Kathy 30 30000 5
5 Shideh 4 4000 1
5 9/23/2007
Implementation of б
SS# N
Name A
Age S l
Salary d
dno
1 Joe 24 20000 2
• Emp table: 2 Mary 20 25000 3
3 B b
Bob 22 27000 4
4 Kathy 30 30000 5
5 Shideh 4 4000 1
6 9/23/2007
Implementation of б
SS# N
Name A
Age S l
Salary d
dno
1 Joe 24 20000 2
• Emp table: 2 Mary 20 25000 3
3 B b
Bob 22 27000 4
4 Kathy 30 30000 5
5 Shideh 4 4000 1
Header
7 9/23/2007
TERMINOLOGY
8 9/23/2007
INTRODUCTION (Cont…)
( )
• Motivation: Speed-up those queries that reference only a small portion of the
records in a file.
• Analogy: Catalog cards in the library (more than one index).
• Evaluation:
1. Access time (find)
2. Insertion time (find + add)
3. Deletion time (find + delete)
4 Space
4. S overhead
h d
• Search-key: The attribute (or set of attributes) used to lookup records in a file
• Primaryy index: The index whose search keyy specifies
p the sequential
q order of
the records within a file.
• Secondary index: The index whose search key does not specify the sequential
order of the records within a file.
9 9/23/2007
INTRODUCTION (Cont…)
( )
• Example:
State Name Age Other!
Alaska Alaska Bob 12 ... Alice
Alaska Alaska George 28 Bob
Arizona Arizona David 48 Charles
California California Hellen 20 David
California California Jack 37 David
Florida Florida Frank 10 Frank
Florida Florida Charles 4 George
Indiana Indiana Joe 12 Hellen
Ohio Ohio Alice 23 Jack
Ohio Ohio David 36 Joe
10 9/23/2007
INTRODUCTION (Cont…)
( )
• Example:
State Name Age Else!
Alaska Alaska Bob 12 ... Alice
Alaska Alaska George 28 Bob
Boston Boston David 48 Charles
California California Hellen 20 David
California California Jack 37 David
Florida Florida Frank 10 Frank
Florida Florida Charles 4 George
Indiana Indiana Joe 12 Hellen
Ohio Ohio Alice 23 Jack
Ohio Ohio David 36 Joe
11 9/23/2007
INTRODUCTION (Cont…)
( )
12 9/23/2007
Dense Index Files
• Dense index — Index record appears for every search-key value in the file.
13 9/23/2007
Example
p of Sparse
p Index Files
14 9/23/2007
Multilevel Index
15 9/23/2007
HASHING
Hash function:
• K: the set of all search key values
• V: the set of all bucket address
• h(K): K V
• K is large (perhaps infinite) but set of search
search-key
key values actually stored in the
database is much smaller than K.
• Fast lookup: To find Ki, search the bucket with h(Ki) address.
16 9/23/2007
HASHING (Cont…)
( )
• Example:
– K = salaryy (set
( of all 6 digit
g integers)
g )
– V = 1000 buckets addressed from 0 to 999
– h(k) = k mod 1000.
SELECT name
FROM personnel
WHERE salary = “120,100”
• To find a 120,100
120 100 salary,
salary we should search bucket number 100.
100
• Hash is only appropriate for Exact match queries.
• A bad hash function maps the value to a subset of (or a few) buckets (e.g., h(k)
= k mod 10.
10
17 9/23/2007
HASHING (Cont…)
( )
18 9/23/2007
HEAP FILE ORGANIZATION
Bob, 21, 3.7, CS Kane, 19, 3.8, ME Louis, 32, 4, LS Chris, 22, 3.9, CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom, 20, 3.2, EE Chang, 18, 2.5, CS James, 24, 3.1, ME Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE Pat, 19, 2.8, EE Shideh, 16, 4, CS
19 9/23/2007
Non-Clustered Hash Index
• A non-clustered hash index on the age attribute with 4 buckets,
buckets
• h(age) = age % B (24, (1, 2)) (20, (4,3))
(32, (3,1)) (16, (4,4))
( (1,
(21, ( 1))))
(20 (1
(20, (1,3))
3)) (24 (3
(24, (3,3))
3))
(17, (2,4))
0 (29, (3,2)) (28, (4,2))
1
2 (18, (1
(18 (1, 4))
3
(22, (2,2))
(19, (2, 1))
(22, (4,1))
(19, (3, 4))
(18 (2
(18, (2,3))
3))
B b 21,
Bob, 21 3.7,
3 7 CS K
Kane, 19
19, 33.8,
8 ME L i 32,
Louis, 32 4,
4 LS Ch i 22,
Chris, 22 3.9,
3 9 CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom 20,
Tom, 20 3.2,
3 2 EE Chang 18,
Chang, 18 2.5,
2 5 CS James 24,
James, 24 3.1,
3 1 ME Leila 20,
Leila, 20 33.5,
5 LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE 20 Pat, 19, 2.8, EE Shideh,
9/23/2007 16, 4, CS
Clustered Hash Index
• A clustered hash index on the age attribute with 4 buckets,
buckets
• h(age) = age % B Mary, 24, 3, ECE Shideh, 16, 4, CS
Louis, 32, 4, LS Leila, 20, 3.5, LS
Bob, 21, 3.7, CS
T
Tom, 20,
20 3.2,
3 2 EE J
James, 24
24, 33.1,
1 ME
Vera, 17, 3.9, EE
0 Martha, 29, 3.8, CS Chad, 28, 2.3, LS
1
2 Kathy,
K h 1818, 33.8,
8 LS
3
Lam, 22, 2.8, ME
Kane, 19, 3.8, ME
Chris, 22, 3.9, CS
Pat, 19, 2.8, EE
Ch
Chang, 18,
18 2.5,
2 CS
21 9/23/2007
Non-Clustered Hash Index
• A non-clustered hash index on the age attribute with 4 buckets,
buckets 500
• h(age) = age % B (24, (1, 2)) (20, (4,3))
• Pointers are page-ids 1001 (32, (3,1)) (16, (4,4))
( (1,
(21, ( 1))))
(20 (1
(20, (1,3))
3)) (24 (3
(24, (3,3))
3))
(17, (2,4))
0 (29, (3,2)) (28, (4,2))
500
1 1001 706
2 706 (18, (1
(18 (1, 4))
3 101 101 (22, (2,2))
(19, (2, 1))
(22, (4,1))
(19, (3, 4))
(18 (2
(18, (2,3))
3))
B b 21,
Bob, 21 3.7,
3 7 CS K
Kane, 19
19, 33.8,
8 ME L i 32,
Louis, 32 4,
4 LS Ch i 22,
Chris, 22 3.9,
3 9 CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom 20,
Tom, 20 3.2,
3 2 EE Chang 18,
Chang, 18 2.5,
2 5 CS James 24,
James, 24 3.1,
3 1 ME Leila 20,
Leila, 20 33.5,
5 LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE 22 Pat, 19, 2.8, EE Shideh,
9/23/2007 16, 4, CS
Clustered Hash Index (SEQUENTIAL LAYOUT)
• A clustered hash index on the age attribute with 4 buckets,
buckets
• h(age) = age % 4
• When the number of buckets are known in advance, the system may
assume a sequentially laid file to eliminate the need for the hash directory.
Shideh, 16, 4, CS
Leila,, 20,, 3.5,, LS
James, 24, 3.1, ME
Mary, 24
M 24, 33, ECE Bob, 21, 3.7, CS Kathy, 18, 3.8, LS Kane, 19, 3.8, ME
Louis, 32, 4, LS Vera, 17, 3.9, EE Lam, 22, 2.8, ME Pat, 19, 2.8, EE
Tom, 20, 3.2, EE Martha, 29, 3.8, CS Chris, 22, 3.9, CS
Ch d 28,
Chad, 28 2.3,
2 3 LS Chang, 18, 2.5, CS
23 9/23/2007
Clustered Hash Index (SEQUENTIAL LAYOUT)
• A clustered hash index on the age attribute with 4 buckets,
buckets
• h(age) = age % 4
• When the number of buckets are known in advance, the system may
assume a sequentially laid file to eliminate the need for the hash directory.
Mary, 24
M 24, 33, ECE Bob, 21, 3.7, CS Kathy, 18, 3.8, LS Kane, 19, 3.8, ME
Louis, 32, 4, LS Vera, 17, 3.9, EE Lam, 22, 2.8, ME Pat, 19, 2.8, EE
Tom, 20, 3.2, EE Martha, 29, 3.8, CS Chris, 22, 3.9, CS
Ch d 28,
Chad, 28 2.3,
2 3 LS Chang, 18, 2.5, CS
24 9/23/2007
Block
Bucket address
Number on disk
0
1
2
M-2
M-1
25 9/23/2007
Example of Non-Clustered Hash Index
26 9/23/2007
Main buckets Overflow buckets
0 340 981 Record pointer
22
9 72
522
Record ppointer
27 9/23/2007
Block
Bucket address
Number on disk
0
1
2
M-2
M-1
28 9/23/2007
Example of Hash Index
29 9/23/2007
Main buckets Overflow buckets
0 340 981 Record pointer
22
9 72
522
Record ppointer
30 9/23/2007
HASHING (Cont…)
( )
• Loading factor
– B = # of buckets,, S = # of records pper bucket,, R = # of records in the relation
– loading - factor = R / (B×S)
– The loading factor should not exceed 80%, if that happens, double B and re-hash.
• Why a bucket might overflow?
– Heavy loading of the file
– Poor hash functions
– Statistical peculiarities
• If a bucket
b k t overflows?
fl ?
– Chaining: chain an empty bucket to the bucket that overflows.
– Open addressing: If bucket h(k) is full, store the record in h(k) + 1, if that is also
full,, tryy h(k)
( ) + 2,, and so on.
– Two hash functions: If bucket h(k) is full, store the record in h’(k).
31 9/23/2007
HASHING (Cont…)
( )
• Problem: The file grows and shrinks over time. Hence, how one should choose
the hash function:
1. Based on current file size performance degradation as DB grows
2. Based on anticipated file size waste space initially (and reduced buffer hits)
3. Periodical reorganization time consuming
3.1. Choose new hash function
3.2. Recompute hash value on every record
3.3. Generate new bucket assignments
• S l ti
Solution:
– Dynamic hash functions: dynamic modification of h to accommodate growth and
shrinkage of the DB. (e.g., extendible hashing)
32 9/23/2007
HASHING (Cont…)
( )
Extendible hashing
• Choose a hash function (h) such that it results in a b (b = 32) bit binary
number.
• The directory has a header that contains its depth, d.
• Each directoryy entryy ppoints to a hash bucket.
• Buckets are created on demand, as records are inserted.
• Each bucket contains a local depth used to find data.
Directory depth
1 bucket
2 directory
00
01
10
siblings
11
33 9/23/2007
HASHING (Cont…)
( )
34 9/23/2007
HASHING (Cont…)
( )
35 9/23/2007
HASHING (Cont…)
( )
36 9/23/2007
HASHING (Cont…)
( )
37 9/23/2007
Use of Extendable Hash Structure: Example
39 9/23/2007
Example
p ((Cont.))
Hash structure after insertion of Mianus record
40 9/23/2007
Example (Cont.)
H h structure
Hash t t after
ft insertion
i ti off three
th Perryridge
P id records
d
41 9/23/2007
Example
p ((Cont.))
42 9/23/2007
HASHING (Cont…)
( )
• Extendible hashing:
The insertion algorithm of extendible hashing might crash when
43 9/23/2007
HASHING (Cont…)
( )
44 9/23/2007
Example
p
• Suppose that we are using extendable hashing on a file that contains records
with the following
g search keyy values:
Show the extendable hash structure for this file if hash function is
h(x)
( ) = x mod 8 and buckets can hold three records
45 9/23/2007
B+-TREE
Root
Internal
…. Nodes
Leaf
... ...
Nodes
Data
File
46 9/23/2007
B+-TREE (Cont…)
( )
• Non-clustered: Leaf nodes contain the pairs (P, K), where P is a pointer to the
record in the file and K is a search-key.
y
47 9/23/2007
B+-TREE (Cont…)
( )
• Leaf nodes
P1 K1 P2 ... Pn-1 Kn-1 Pn
(n-1)
– Maintain between 2 to n-1 values per leaf.
– If i < j then Ki < Kj
5 7 10 (n = 4)
5 7 10 15 17 18
48 9/23/2007
B+-TREE (Cont…)
( )
• Internal nodes
n
– Maintain between 2 to n p
pointers pper internal node
– root is an exception: It must have more than one pointer.
– Suppose a node with m pointers and 2<= i < m:
1. Pi points to subtree containing search-key values < Ki and >= Ki-1.
2. Pm points to subtree containing search-key values >= Km-1.
3. P1 points to subtree containing search-key values < K1.
5 7 10
2 3 5 6 10 10 11
49 9/23/2007
To calculate the order n of a B+-tree
tree
• Suppose that the search key field is V = 9 bytes long, the
block size is B=512 bytes,
bytes a record pointer is Pr = 7 bytes,
bytes
and a block pointer is P = 6 bytes.
1. Calculate order of the internal nodes
2. Calculate order of the leaf nodes
50 9/23/2007
To calculate the order n of a B+-tree
tree
• Suppose that the search key field is V = 9 bytes long, the
block size is B=512 bytes,
bytes a record pointer is Pr = 7 bytes,
bytes
and a block pointer is P = 6 bytes.
1. Calculate order of the internal nodes
Calculate
C l l order d off the
h leaf
l f nodes
d
• The leaf nodes of the B+-tree will have the same
number of values and ppointers,, except
p that the pointers
p
are data pointers and a next pointer.
• (nleaf * (Pr + V)) + P <= B
• (nleaf * (7 + 9)) + 6 <= 512
• (16 * nleaf ) <= 506
• nleaf = 31
52 9/23/2007
B+-TREE (Cont…)
( )
• Lookup 30
8 41 50
4 7 10 20 30 40 41 47 50 52
4 30 50
7 40 52
10 41
20 47
– Find 7: 4 Ios
– Find 4-20: 4 IOs (assuming primary index), 8 IOs (assuming secondary index)
– More than 10% selection: it is more efficient to do sequential scan (do not use the
secondary
d iindex).
d )
– Example: 10,000 records, select 1000 of them, 1000 records per disk page:
(Sequential search: 10 IOs, Secondary index: potentially 1000+ IOs)
53 9/23/2007
B+-TREE (Cont…)
( )
• Analysis
– “B” in B+-tree stands for Balanced. i.e.,, the length
g of everyy ppath from the root to a
leaf node is the same.
– Hence, good performance for lookup, insertion, and deletion
– K: number of search key values in a file, then the path is < log n (K).
2
– #K = 1,000,000,
1 000 000 andd 10 <= n <= 100 ththen att mostt 3 to
t 9 nodes
d be b accessed.d
– Insertion and Deletion should not destroy the balance of the tree.
54 9/23/2007
B+-TREE (Cont…)
( )
n = 4;
8 25
Internal nodes: 2 to 4 pointers
Leaf nodes: 2 to 3 values
4 7 10 20 30 40
8 25
Insert 41
4 7 10 20 30 40 41
Insert 47
30 40 41 47
8 25 41
4 7 10 20 30 40 41 47
55 9/23/2007
B+-TREE (Cont…)
( )
8 25 41
Insert 50
4 7 10 20 30 40 41 47 50
Insert 52
41 47 50 52
8 25 41 50
41
8 25 50
4 7 10 20 30 40 41 47 50 52
56 9/23/2007
B+-TREE (Cont…)
( )
30
8 41 50
4 7 10 20 30 40 41 47 50 52
D l 20
Delete 30
8 41 50
4 7 10 30 40 41 47 50 52
30 41 50
4 7 10 30 40 41 47 50 52
57 9/23/2007
Example
• Construct a B+- tree for the following set of
values:
l
58 9/23/2007