You are on page 1of 54

Indexing

 Mechanism for
 Separating logical organization of files from their physical
organization
 The physical file may be unorganized, but there may be logically
ordered indexes for accessing it
 Conducting efficient searches on the files
 Indexes hold information about location of records with specific
values
 Index structures (hopefully) fit in main memory so index searching is
fast
 Indexing is an issue for both file systems and DBMSs
 Remember that databases are eventually mapped to files (e.G., Each
relation stored in one file)
 In DBMSs, the query processor accesses the index structures in
processing a query 1
Types of Indexes

 Indexes on ordered vs unordered files


 Dense vs sparse indexes
 Primary indexes vs secondary indexes
 Single-level vs multi-level
 Single-level ones allow binary search on the index table
and access to the physical records/blocks directly from
the index table
 Multi-level ones are tree-structured and require a more
elaborate search algorithm

2
Single-Level Indexes on
Unordered Files
 The physical records are not ordered; the index
provides a physical order
 Physical files are called entry-sequenced since the
order of physical records is that of entry order.
 Append records to the physical file as they are inserted
 Build an index (primary and/or secondary) on this file
 Deletion/Update of physical records require
reorganization of the file and the reorganization of
primary index

3
Primary Index on Unordered Files
St. Id. Name Major Yr.

10567 J. Doe CS 3

10567 15973 M. Smith CS 3


11589 96256 P. Wright ME 2
15973
29579 29579 B. Zimmer BS 1
34596
11589 T. Allen BA 2
75623
84920 84920 S. Allen CS 4
96256
34596 T. Atkins ME 4

75623 J. Wong BA 3

4
Operations
 Record addition
 Append the record to the end; Insert a record to the
appropriate place in the index
 Requires reorganization of the index
 Record deletion
 Delete the physical record using any feasible technique
 Delete the index record and reorganize
 Record updates
 If key field is affected, treat as delete/add
 If key field is not affected, no problem.

5
Primary Index on Ordered Files

 Physical records may be kept ordered on the


primary key
 The index is ordered but only one index record for
each block
 Reduces the index requirement, enabling binary
search over the values (without having to read all
of the file to perform binary search).

6
Primary Index on Ordered Files
10567 J. Doe CS 3

11589 T. Allen BA 2

15973 M. Smith CS 3

10567 29579 B. Zimmer BS 1


29579 34596 T. Atkins ME 4
84920
75623 J. Wong BA 3

Also called 84920 S. Allen CS 4


inverted file index. 96256 P. Wright ME 2

7
Clustering Index on Ordered Files
11589 T. Allen BA 2

75623 J. Wong BA 3

29579 B. Zimmer BS 1

BA 10567 J. Doe CS 3
BS 15973 M. Smith CS 3
CS
ME 84920 S. Allen CS 4

34596 T. Atkins ME 4
There should be no primary
96256 P. Wright ME 2
index if clustering index is
to exist.
8
Secondary Index

 In addition to the primary index, establish indexes


on non-key attributes to facilitate faster access
 Secondary indexes typically point to primary
index
 Advantage:
 Record deletion and update causes less work
 Disadvantage:
 Less efficient

9
Secondary Index
St. Id. Name Major Yr.

10567 J. Doe CS 3

10567 15973 M. Smith CS 3


11589
96256 P. Wright ME 2
BA 15973
BS 29579 29579 B. Zimmer BS 1
CS 34596
11589 T. Allen BA 2
ME 75623
84920 84920 S. Allen CS 4
96256
34596 T. Atkins ME 4
1
2 75623 J. Wong BA 3
3
4

10
Problems With Inverted Files

 Inverted file indexes cannot be maintained in main


memory as the database sizes and number of
keywords increase.
 Binary search over indexes that are stored in
secondary storage is expensive.
 Too many seeks and I/Os.
 Maintaining the indexes in sorted key order is
difficult and expensive.

11
Binary Search Tree

 Binary search is effective (for in memory


searches) but the sorting is too expensive.
 Sorting is not necessary for binary search.
 The purpose of sorting is to find the central item in a
part of the list, and this can be done by indexing.
 This can be accomplished by building a binary tree of
the keys such that left child of a node has a smaller
value and the right child has a larger value.

12
Binary Search Tree

 A binary tree is a tree such that


 Each child of a vertex is distinguished as either a left
child or a right child, and
 No vertex has more than one left (right) child.
 A binary search tree for a set S is a labeled binary
tree such that
 For each vertex u in the left subtree of v, u < v ,
 For each vertex u in the right subtree of v , u > v , and
 For each element a in S, there exists exactly one vertex
v in the tree such that a = v.
13
Binary Search Tree

 Consider the following set of keys


ax cl de fb ft hn jd kf nr pa rf sd tk ws yj
kf

fb sd

cl hn ps ws

ax de ft jd nr rf tk yj
14
Binary Search Tree
kf

fb sd

cl hn ps ws

ax de ft jd nr rf tk yj

KEY kf fb sd cl hn ps ws ax de ft jd nr rf tk yj
Left-child fb cl ps ax ft nr tk - - - - - - - -
Right-child sd hn rf de jd rf yj - - - - - - - -

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
KEY kf fb sd cl hn ps ws ax de ft jd nr rf tk yj
Left-child 1 3 5 7 0 11 13 -1 -1 -1 -1 -1 -1 -1
-1
Right-child 2 4 6 8 10 12 14 -1 -1 -1 -1 -1 -1 -1 15
Insertion into BSTs
kf

fb sd

cl hn ps ws

ax de ft jd nr rf tk yj

lv
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
KEY kf fb sd cl hn ps ws ax de ft jd nr rf tk yj lv
Left-child 1 3 5 7 0 11 13 -1 -1 -1 -1 15 -1 -1 -1
-1 -1
Right-child 2 4 6 8 10 12 14 -1 -1 -1 -1 -1 -1 -1
-1
Cost of Insertion: log2N – the same as search
16
Binary Trees

 Advantage:
 Sorting is avoided
A
 Disadvantage:
 Balance problem B
 log2N C

17
AVL Tree
 An AVL tree is a binary search tree such that the
height of the two subtrees at any vertex differ by
E
at most one C F

B D G
BCGEFDA
A
 Height-balanced: there is a relationship between
the number of nodes and the height of the tree
 In AVL trees, if there are N nodes and the height is h,
the following relationship holds
5 + 2 5 (1+ 5))h + 5 −2 5 (1 − 5 )h ≤N ≤2 h +1 −1
5 5 5 5
18
AVL Tree
 Features:
 Height balancing guarantees a certain level of search
performance [O(log2N)]
 Height balancing requires tree to be adjusted by
rotations – in AVL trees, the rotations are localized
 Performance
 Search performance: O(log2N)
 Update performance: O(log2N)
 Solves problem #3 – index is not maintained in
sorted order.

19
Paged Binary Search Tree
 Tackle the problem of high seek time/fast transfer
time by collecting indexes on pages
 Assuming you can store 7 indexes on a page:
 63 index nodes require two levels of pages;
 4 levels can accommodate 4,095 nodes (which requires
12 seeks in the case of binary search
 The basic idea
 A balanced M-ary search tree such that both search and
maintenance can be done at O(logMN)
 Place a vertex and its descendants down to a fixed level
(log2M) into a single page (cluster) such that the search
on the whole subtree of log2M levels can be done in
one disk access. Therefore, the average search and
maintenance can be done at O( log 2 M ) = O(log M N )
log 2 N
20
Paged Binary Search Tree

21
Paged Binary Search Tree
 Perspective
 Assume
 N = 20,000,000 records
 M = 512 records.
 Then logMN = log51220,000,000 = 2.69
 This implies that it takes three disk accesses to retrieve
a record from a file of 20 million records.
 Difficulties
 Balancing
 Maintenance cost

22
B-Trees
 A B-tree of order m is a paged binary search tree
such that
 Each page contains a maximum of m-1 keys
 Each page, except the root, contains at least
⎡m ⎤
⎢ − 1 keys
⎢2 ⎥

 Root has at least 2 descedants unless it is the only node
 A non-leaf page with k keys has k+1 descendants
 All the leaves appear at the same level
 Build the tree bottom-up to maintain balance
 Split & promotion for overflow during insertion
 Redistribution & concatenation for underflow during
deletion
23
B-Tree Structure

P1 K1 D1 P2 K2 D2 P3 K3 D3 P4
tree
pointer
data
pointer X
K2<X<K3

leaf
-1 Ki -1 Kj -1 Kr -1 node

24
B-Tree Properties

Given a B-tree of order m


 Root page
 1  keys  m  1
 2  descendents  m
 Other pages
m −1≤ keys ≤ m −1
 ⎡2 ⎤
m
 ⎡ ⎤ ≤ descendents ≤ m
2
 K1 < K1 < … < Km-1
 Leaf pages
 all at the same level; tree pointers are null
25
Operations

 Insertion
 Insert the key into an appropriate leaf page (by
search)
 When overflow: split and promotion
 Split the overflow page into two pages
 Promote a key to a parent page
 If the promotion in the previous step causes
additional overflow, then repeat the split-
promotion

26
Insertion Example
Promotion from left 2 2
S S
T
0 0 1 0
M 1

C D S C D S T A C D T
(a) (b) Insert T (c) Insert A
2 2
D S D S

0 3 1 0 3 1

A CD M T A B C I MP T W

(d) Insert M (e) Insert P, I, B, W


27
Insertion Example
2
D NS

0 3 4 1

A B C I M P T W

(f) Insert N

2
D NS

0 3 4 1

A B C G I M P R T UW

(g) Insert G, U, R 28
Insertion Example
2
D NS
K
0 3 4 1

A B C G I M P R T UW
(g) Insert K

2
D NS
?
0 3 5 4 1

A B C G I K M P R T UW

29
Insertion Example

2 6

D N S S

0 3 5 4 1

A B C G I K M P R T UW

(g) Insert K

30
Insertion Example

7
N

K 2 6
D N S

0 3 5 4 1

A B C G I K M P R T UW

(h) Insert K
31
Insertion Example

7
N

2 6
D K S

0 3 5 4 1

A B C E G I M P R T UW

(h) Insert E
32
Operations
 Search
 Recursively traverse the tree
 What is the search performance?
 What is the depth of the B-tree?
 B-tree of order m with N keys and depth d
 Best case: maximum number of descendents at each node
N = md
 Worst case: minimal number of descendents at each node

 Theorem ⎡m ⎤d −1
N =2×⎢ ⎥
⎢2 ⎥

N +1
log m (N +1) ≤ d ≤ log ⎡m ⎤( ) +1

⎢ ⎥⎥ 2
2

33
Operations
 Deletions
 Search B-tree to find the key to be deleted
 Swap the key with its immediate successor, if the key is
not in a leaf page
 Note only keys in a leaf may be deleted
 When underflow: redistribution or concatenation
 Redistribute keys among an adjacent sibling page, the parent
page, and the underflow page if possible (need a rich sibling)
 Otherwise, concatenate with an adjacent page, demoting a key
from the parent page to the newly formed page.
 If the demotion causes underflow, repeat
redistribution-concatenation
34
Deletion Example – Simple
0

1 2

DH QU

3 4 5 6 7 8

AC E F I J K NO P R S VWXY Z

(a) Delete J
35
Deletion Example – Exchange
0

MN

1 2

DH QU

3 4 5 6 7 8

AC E F I K MN O P R S VWXY Z

(a) Delete M
36
Deletion Example – Redistribution
0

1 2
V
DH Q UW

3 4 5 6 7 8

AC E F I K OP R S UV VWXY Z

(a) Delete R
37
Deletion Example – Concatenation
0

1 2

DH QW

3 4 5 6 7 8

AC E F I K OP S U V XYZ

CD E F
(a) Delete A
38
Deletion Example – Propagation
0

1 2

H QW

3 5 6 7 8

CC
A D E F E F I K OP S U V XYZ

(a) Delete A (continued)


39
Deletion Example – Propagation
1

HNQW

QW

3 5 6 7 8

EC FD E F I K OP S U V XYZ

AC
(a) Delete A (final form)

40
Improvements
 Redistribution during insertion
 A way of avoiding, or at least, postponing the creation of a
new page by redistributing overflow keys into its sibling
pages
 Improve space utilization: 67%  86%
 B*-trees
 If redistribution takes placed during insertion, at time of
overflow, at least one other sibling is full
 Two-to-three split
 Distribute all keys in two full pages into three sibling pages evenly
 Each page contains at least
 Special handling of the root ⎢2m − 1⎥
⎢ keys
⎣ 3 ⎥ ⎦

41
Improvements

 Virtual B-trees
 B-trees that uses RAM page buffers
 Buffer strategies
 Keep the root page
 Retain the pages of higher levels
 LRU (the Least Recently Uses page is the buffer is replaced
by a new page)

42
Indexed Sequential Access
 Primary problem:
 Efficient sequential access and indexed search
(dual mode applications)
 Possible solutions:
 Sorted files:
 good for sequential accesses
 unacceptable performance for random access
 maintenance costs too high
 B-trees:
 good for indexed search
 very slow for sequential accesses (tree traversal)
 maintenance costs low
 B+ trees: a file with a B-tree structure + a sequence set
43
Sequence Sets

 Arrange the file into blocks


 Usually clusters or pages
head = 2
 Records within blocks are sorted
1 D, E, F, G 4
 Blocks are logically ordered 2 A, B, C 1
 Using a linked list 3 J, K, L, M, N -1
4 H, I 3
 If each block contains b records,
then sequential access can be
performed in N/b disk accesses

44
Maintenance of Sequence Sets
 Changes to blocks
 Goal: keep blocks at least half full
 Accommodates variable length records

file updates problems solutions

insertion overflow split w/o promotion

deletion underflow redistribution


concatenation

Choice of block size


The bigger the better
Restricted by size of RAM, buffer, access speed
45
Indexed Access to Sequence Sets

 Keys of last record in each block


 Similar to what you find on dictionary pages
 Creates a one-level index
 Has to be able to fit memory
 binary search
 index maintenance
 Separator: a shortest string that separates keys in
two consecutive blocks
 Increases the size of the index that can be maintained
in memory

46
Example
head is B2
B1 camp – dutton B6
B2 adams – berne B4
B3 faber – folk B5
B4 bolen – cage B1
B5 folks – gaddis -1
B6 embry – evans B3

Block # Keys Separators


2 berne bo
4 cage cam
1 dutton e
6 evans f
3 folk folks
5 gaddis
47
The Simple Prefix B Trees +

 Use separators to build a B-tree on top of sequence sets


 B-tree is called index set

bo cam f folks

adams – berne bolen – cage camp – dutton embry – evans faber – folk folks – gaddis

48
Maintenance of B Trees +

 Updates are first made to the sequence set and then


changes to the index set are made if necessary
 If blocks are split, add a new separator
 If blocks are concatenated, remove a separator
 If records in the sequence set are redistributed, change the
value of the separator

49
Index Set Maintenance
 Index set block size
 Index node size = sequence set physical block size
 Block size that is best for sequence sets is also best for index
set
 Easier to build virtual simple prefix B+ trees
 Index set blocks and sequence set blocks are intermingled in
the same file
 Variable-order B-tree
 Separators are variable length; Can store variable
number of them in a block
 How to identify begin-end points of separators for
binary search?
50
Index Node Structure

Separator count
Total length of separators

5 12 bocameffolks 00 02 05 06 07 B00 B01 B02 B03 B04 B05 B06

separators relative block numbers

index to separators
51
Building a Simple Prefix B Tree +

 Using insertion procedure


 Splitting and redistribution are expensive
 Loading
 Presort the sequence set
 Construct a B-tree of the index set to the sequence set
as sequence set blocks are formed

52
B Trees
+

 B+ trees differ from simple prefix B+ trees in that


the separators are the keys themselves
 Reasons:
 Additional overhead of index maintenance with variable
length separators may not be warranted
 The keys over which indexing is performed may not
lend themselves to being separated by anything less than
a key or the overhead of processing may be too great
 Student id numbers
 Social insurance numbers

53
Differences Between
B-tree and B+ Tree
 Node information content
 In B-trees all the pages contains the keys and
information (or a pointer to it)
 In the B+ tree, the keys and information are contained in
the sequence set
 Tree structure
 B+ tree is usually shallower than a b-tree
 Access speed
 Ordered sequential access is faster in B+ trees

54

You might also like