Professional Documents
Culture Documents
Mechanism for
Separating logical organization of files from their physical
organization
The physical file may be unorganized, but there may be logically
ordered indexes for accessing it
Conducting efficient searches on the files
Indexes hold information about location of records with specific
values
Index structures (hopefully) fit in main memory so index searching is
fast
Indexing is an issue for both file systems and DBMSs
Remember that databases are eventually mapped to files (e.G., Each
relation stored in one file)
In DBMSs, the query processor accesses the index structures in
processing a query 1
Types of Indexes
2
Single-Level Indexes on
Unordered Files
The physical records are not ordered; the index
provides a physical order
Physical files are called entry-sequenced since the
order of physical records is that of entry order.
Append records to the physical file as they are inserted
Build an index (primary and/or secondary) on this file
Deletion/Update of physical records require
reorganization of the file and the reorganization of
primary index
3
Primary Index on Unordered Files
St. Id. Name Major Yr.
10567 J. Doe CS 3
75623 J. Wong BA 3
4
Operations
Record addition
Append the record to the end; Insert a record to the
appropriate place in the index
Requires reorganization of the index
Record deletion
Delete the physical record using any feasible technique
Delete the index record and reorganize
Record updates
If key field is affected, treat as delete/add
If key field is not affected, no problem.
5
Primary Index on Ordered Files
6
Primary Index on Ordered Files
10567 J. Doe CS 3
11589 T. Allen BA 2
15973 M. Smith CS 3
7
Clustering Index on Ordered Files
11589 T. Allen BA 2
75623 J. Wong BA 3
29579 B. Zimmer BS 1
BA 10567 J. Doe CS 3
BS 15973 M. Smith CS 3
CS
ME 84920 S. Allen CS 4
34596 T. Atkins ME 4
There should be no primary
96256 P. Wright ME 2
index if clustering index is
to exist.
8
Secondary Index
9
Secondary Index
St. Id. Name Major Yr.
10567 J. Doe CS 3
10
Problems With Inverted Files
11
Binary Search Tree
12
Binary Search Tree
fb sd
cl hn ps ws
ax de ft jd nr rf tk yj
14
Binary Search Tree
kf
fb sd
cl hn ps ws
ax de ft jd nr rf tk yj
KEY kf fb sd cl hn ps ws ax de ft jd nr rf tk yj
Left-child fb cl ps ax ft nr tk - - - - - - - -
Right-child sd hn rf de jd rf yj - - - - - - - -
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
KEY kf fb sd cl hn ps ws ax de ft jd nr rf tk yj
Left-child 1 3 5 7 0 11 13 -1 -1 -1 -1 -1 -1 -1
-1
Right-child 2 4 6 8 10 12 14 -1 -1 -1 -1 -1 -1 -1 15
Insertion into BSTs
kf
fb sd
cl hn ps ws
ax de ft jd nr rf tk yj
lv
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
KEY kf fb sd cl hn ps ws ax de ft jd nr rf tk yj lv
Left-child 1 3 5 7 0 11 13 -1 -1 -1 -1 15 -1 -1 -1
-1 -1
Right-child 2 4 6 8 10 12 14 -1 -1 -1 -1 -1 -1 -1
-1
Cost of Insertion: log2N – the same as search
16
Binary Trees
Advantage:
Sorting is avoided
A
Disadvantage:
Balance problem B
log2N C
17
AVL Tree
An AVL tree is a binary search tree such that the
height of the two subtrees at any vertex differ by
E
at most one C F
B D G
BCGEFDA
A
Height-balanced: there is a relationship between
the number of nodes and the height of the tree
In AVL trees, if there are N nodes and the height is h,
the following relationship holds
5 + 2 5 (1+ 5))h + 5 −2 5 (1 − 5 )h ≤N ≤2 h +1 −1
5 5 5 5
18
AVL Tree
Features:
Height balancing guarantees a certain level of search
performance [O(log2N)]
Height balancing requires tree to be adjusted by
rotations – in AVL trees, the rotations are localized
Performance
Search performance: O(log2N)
Update performance: O(log2N)
Solves problem #3 – index is not maintained in
sorted order.
19
Paged Binary Search Tree
Tackle the problem of high seek time/fast transfer
time by collecting indexes on pages
Assuming you can store 7 indexes on a page:
63 index nodes require two levels of pages;
4 levels can accommodate 4,095 nodes (which requires
12 seeks in the case of binary search
The basic idea
A balanced M-ary search tree such that both search and
maintenance can be done at O(logMN)
Place a vertex and its descendants down to a fixed level
(log2M) into a single page (cluster) such that the search
on the whole subtree of log2M levels can be done in
one disk access. Therefore, the average search and
maintenance can be done at O( log 2 M ) = O(log M N )
log 2 N
20
Paged Binary Search Tree
21
Paged Binary Search Tree
Perspective
Assume
N = 20,000,000 records
M = 512 records.
Then logMN = log51220,000,000 = 2.69
This implies that it takes three disk accesses to retrieve
a record from a file of 20 million records.
Difficulties
Balancing
Maintenance cost
22
B-Trees
A B-tree of order m is a paged binary search tree
such that
Each page contains a maximum of m-1 keys
Each page, except the root, contains at least
⎡m ⎤
⎢ − 1 keys
⎢2 ⎥
⎥
Root has at least 2 descedants unless it is the only node
A non-leaf page with k keys has k+1 descendants
All the leaves appear at the same level
Build the tree bottom-up to maintain balance
Split & promotion for overflow during insertion
Redistribution & concatenation for underflow during
deletion
23
B-Tree Structure
P1 K1 D1 P2 K2 D2 P3 K3 D3 P4
tree
pointer
data
pointer X
K2<X<K3
leaf
-1 Ki -1 Kj -1 Kr -1 node
24
B-Tree Properties
Insertion
Insert the key into an appropriate leaf page (by
search)
When overflow: split and promotion
Split the overflow page into two pages
Promote a key to a parent page
If the promotion in the previous step causes
additional overflow, then repeat the split-
promotion
26
Insertion Example
Promotion from left 2 2
S S
T
0 0 1 0
M 1
C D S C D S T A C D T
(a) (b) Insert T (c) Insert A
2 2
D S D S
0 3 1 0 3 1
A CD M T A B C I MP T W
0 3 4 1
A B C I M P T W
(f) Insert N
2
D NS
0 3 4 1
A B C G I M P R T UW
(g) Insert G, U, R 28
Insertion Example
2
D NS
K
0 3 4 1
A B C G I M P R T UW
(g) Insert K
2
D NS
?
0 3 5 4 1
A B C G I K M P R T UW
29
Insertion Example
2 6
D N S S
0 3 5 4 1
A B C G I K M P R T UW
(g) Insert K
30
Insertion Example
7
N
K 2 6
D N S
0 3 5 4 1
A B C G I K M P R T UW
(h) Insert K
31
Insertion Example
7
N
2 6
D K S
0 3 5 4 1
A B C E G I M P R T UW
(h) Insert E
32
Operations
Search
Recursively traverse the tree
What is the search performance?
What is the depth of the B-tree?
B-tree of order m with N keys and depth d
Best case: maximum number of descendents at each node
N = md
Worst case: minimal number of descendents at each node
Theorem ⎡m ⎤d −1
N =2×⎢ ⎥
⎢2 ⎥
N +1
log m (N +1) ≤ d ≤ log ⎡m ⎤( ) +1
⎢
⎢ ⎥⎥ 2
2
33
Operations
Deletions
Search B-tree to find the key to be deleted
Swap the key with its immediate successor, if the key is
not in a leaf page
Note only keys in a leaf may be deleted
When underflow: redistribution or concatenation
Redistribute keys among an adjacent sibling page, the parent
page, and the underflow page if possible (need a rich sibling)
Otherwise, concatenate with an adjacent page, demoting a key
from the parent page to the newly formed page.
If the demotion causes underflow, repeat
redistribution-concatenation
34
Deletion Example – Simple
0
1 2
DH QU
3 4 5 6 7 8
AC E F I J K NO P R S VWXY Z
(a) Delete J
35
Deletion Example – Exchange
0
MN
1 2
DH QU
3 4 5 6 7 8
AC E F I K MN O P R S VWXY Z
(a) Delete M
36
Deletion Example – Redistribution
0
1 2
V
DH Q UW
3 4 5 6 7 8
AC E F I K OP R S UV VWXY Z
(a) Delete R
37
Deletion Example – Concatenation
0
1 2
DH QW
3 4 5 6 7 8
AC E F I K OP S U V XYZ
CD E F
(a) Delete A
38
Deletion Example – Propagation
0
1 2
H QW
3 5 6 7 8
CC
A D E F E F I K OP S U V XYZ
HNQW
QW
3 5 6 7 8
EC FD E F I K OP S U V XYZ
AC
(a) Delete A (final form)
40
Improvements
Redistribution during insertion
A way of avoiding, or at least, postponing the creation of a
new page by redistributing overflow keys into its sibling
pages
Improve space utilization: 67% 86%
B*-trees
If redistribution takes placed during insertion, at time of
overflow, at least one other sibling is full
Two-to-three split
Distribute all keys in two full pages into three sibling pages evenly
Each page contains at least
Special handling of the root ⎢2m − 1⎥
⎢ keys
⎣ 3 ⎥ ⎦
41
Improvements
Virtual B-trees
B-trees that uses RAM page buffers
Buffer strategies
Keep the root page
Retain the pages of higher levels
LRU (the Least Recently Uses page is the buffer is replaced
by a new page)
42
Indexed Sequential Access
Primary problem:
Efficient sequential access and indexed search
(dual mode applications)
Possible solutions:
Sorted files:
good for sequential accesses
unacceptable performance for random access
maintenance costs too high
B-trees:
good for indexed search
very slow for sequential accesses (tree traversal)
maintenance costs low
B+ trees: a file with a B-tree structure + a sequence set
43
Sequence Sets
44
Maintenance of Sequence Sets
Changes to blocks
Goal: keep blocks at least half full
Accommodates variable length records
46
Example
head is B2
B1 camp – dutton B6
B2 adams – berne B4
B3 faber – folk B5
B4 bolen – cage B1
B5 folks – gaddis -1
B6 embry – evans B3
bo cam f folks
adams – berne bolen – cage camp – dutton embry – evans faber – folk folks – gaddis
48
Maintenance of B Trees +
49
Index Set Maintenance
Index set block size
Index node size = sequence set physical block size
Block size that is best for sequence sets is also best for index
set
Easier to build virtual simple prefix B+ trees
Index set blocks and sequence set blocks are intermingled in
the same file
Variable-order B-tree
Separators are variable length; Can store variable
number of them in a block
How to identify begin-end points of separators for
binary search?
50
Index Node Structure
Separator count
Total length of separators
index to separators
51
Building a Simple Prefix B Tree +
52
B Trees
+
53
Differences Between
B-tree and B+ Tree
Node information content
In B-trees all the pages contains the keys and
information (or a pointer to it)
In the B+ tree, the keys and information are contained in
the sequence set
Tree structure
B+ tree is usually shallower than a b-tree
Access speed
Ordered sequential access is faster in B+ trees
54