CSF-469 - L5

BITS Pilani
Dr.Aruna Malapati
BITS Pilani Asst Professor
Department of CSIS
Hyderabad Campus
BITS Pilani
Hyderabad Campus
Dictionaries and Tolerant Retrieval

Todays Agenda
Dictionary data structures
Tolerant retrieval
Wild card queries
Spelling correction
Soundex
BITS Pilani, Hyderabad Campus

Dictionary data structures for
Inverted Indexes
The dictionary data structures stores the term
Vocabulary, Document Frequency, Pointers to each
posting listsin what Data Structure?

Storing Dictionaries
For each term, we need to store a couple of items:

document frequency
pointer to postings list
...
Assume for the time being that we can store this

information in a fixed-length entry.
Assume that we store these entries in an array.

Storing Dictionaries (2)
An Array of
structures
char(20) int Postings*
space needed: 20 bytes 4/8 bytes 4/8 bytes
How do we look up an element in this array at query time?
Remember: these dictionaries can be huge, scanning is not an

option BITS Pilani, Hyderabad Campus
Dictionary Data Structures
Two main classes of data structures: hash tables and

search trees.
Criteria for when to use hash tables vs. trees:
Is there a fixed number of terms or will it keep growing?
What are the relative frequencies with which various keys

will be accessed?
How many terms are we likely to have?

Hash Tables
Each vocabulary term is hashed into an integer.
Try to avoid collisions
At query time, do the following: hash query term, resolve

collisions, locate entry in fixed-width array.
Pros:
Lookup in a hash table is faster than lookup in a tree.
Cons:
No prefix search (all terms starting with automat) [tolerant retrieval]
Need to rehash everything periodically if vocabulary keeps growing.

Trees
Trees solve the prefix problem (find all terms starting

With automat).
Simplest tree: binary tree
Binary trees are problematic:

Only balanced trees allow efficient retrieval
Rebalancing binary trees is expensive
Use B-trees
Solves the prefix problem.
Slower O(log M) [ and this requires the balanced tree]
Rebalancing is expensive
Binary Tree

B-Tree
Definition: Every internal node has a number of children

in the interval [a,b] where a,b are appropriate natural
numbers, e.g.,[2,4].

Root
M-N
A-K
A-G H-K A-K M-N
H-J Ka-Ke
An-As
B-G
An-Ar Go-Gu
Assam B-C
Andhra Arunachal Goa

Bihar Chhattisgarh Gujarath
Pradesh Pradesh
Wildcard Queries
mon*: find all docs containing any term beginning with

mon
Easy with B-tree dictionary: retrieve all terms t in the

range: mon t < moo
* mon: find all docs containing any term ending with mon
Maintain an additional tree for terms backwards
Then retrieve all terms t in the range: nom t < non
How do we enumerate all the terms meeting the wild card query pro*cent?

Query Processing
At this point, we have an enumeration of all terms in the

dictionary that match the wildcard query.
We still have to look up the postings for each

enumerated term.
E.g., consider the query: Jayasur*a, Sana* Jayasur*
This may result in the execution of many Boolean AND

queries.

Wildcards in Middle of Term
Example: m*nchen
We could look up m*and*nchen in the B-tree and

intersect the two term sets.
Expensive (there are probably thousands and thousands

of terms beginning with m)
Alternative: permuterm index
Basic idea: Rotate every wildcard query, so that the *

occurs at the end.

Permuterm index
For term hello, index under:
hello$,ello$h,llo$he,lo$hel,o$hell,$hello where$ is a
special symbol.
Queries: where X=hel and Y=o
X lookup on X$ X* lookup on $ X*
X*Y lookup on Y$ X* Y*Z ???

K-gram index
A K-gram is sequence of K characters .

More space-efficient than permuterm index
Enumerate all character k-grams (sequence of k
characters) occurring in a term 2-grams are called
bigrams
Example: from April is the cruelest month
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
$ is a special word boundary symbol.

Maintain an inverted index from bigrams to the terms
that contain the bigram
2-gram examples
Dictionary : All possible K-grams generated form the terms

Postings : All terms containing that K-gram sorted in lexicographic
order
Postings are terms in the inverted index

ap
April Apple map
Query: mon*
$m
mace madden
Bigrams for query: $m, mo,on
mo
among amortize $m AND mo AND on
on
along among All terms staring with m and containing
Mo and containing on
2-gram index for the term moon
Processing wild card query
As before, we must execute a Boolean query for each

enumerated, filtered term.
Wild-cards can result in expensive query execution (very
large disjunctions)
pyth* AND prog*
If you encourage laziness people will respond!
Search
Type your search terms, use * if you need to.
E.g., Alex* will match Alexander.

Summary
Arrays, Hash tables or tress are commonly used as

dictionary data structures.
B-tree dictionary is most suitable if the wild card queries

are to be answered.
K-grams indexes are suitable to answer complex

queries.

CSF-469 - L5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSF-469 - L5

Uploaded by

Copyright:

Available Formats

BITS Pilani

Dictionaries and Tolerant Retrieval

Dictionary data structures

Wild card queries

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

For each term, we need to store a couple of items:

pointer to postings list

Assume for the time being that we can store this

Assume that we store these entries in an array.

BITS Pilani, Hyderabad Campus

char(20) int Postings*

space needed: 20 bytes 4/8 bytes 4/8 bytes

How do we look up an element in this array at query time?

Remember: these dictionaries can be huge, scanning is not an

Two main classes of data structures: hash tables and

Criteria for when to use hash tables vs. trees:

Is there a fixed number of terms or will it keep growing?

What are the relative frequencies with which various keys

How many terms are we likely to have?

BITS Pilani, Hyderabad Campus

Each vocabulary term is hashed into an integer.

Try to avoid collisions

At query time, do the following: hash query term, resolve

Need to rehash everything periodically if vocabulary keeps growing.

BITS Pilani, Hyderabad Campus

Trees solve the prefix problem (find all terms starting

Simplest tree: binary tree

Binary trees are problematic:

Rebalancing binary trees is expensive

Slower O(log M) [ and this requires the balanced tree]

BITS Pilani, Hyderabad Campus

Definition: Every internal node has a number of children

BITS Pilani, Hyderabad Campus

A-G H-K A-K M-N

Andhra Arunachal Goa

mon*: find all docs containing any term beginning with

Easy with B-tree dictionary: retrieve all terms t in the

Then retrieve all terms t in the range: nom t < non

BITS Pilani, Hyderabad Campus

At this point, we have an enumeration of all terms in the

We still have to look up the postings for each

E.g., consider the query: Jayasur*a, Sana* Jayasur*

This may result in the execution of many Boolean AND

BITS Pilani, Hyderabad Campus

We could look up m*and*nchen in the B-tree and

Expensive (there are probably thousands and thousands

Alternative: permuterm index

Basic idea: Rotate every wildcard query, so that the *

BITS Pilani, Hyderabad Campus

For term hello, index under:

Queries: where X=hel and Y=o

X*Y lookup on Y$ X* Y*Z ???

BITS Pilani, Hyderabad Campus

A K-gram is sequence of K characters .

$ is a special word boundary symbol.

Dictionary : All possible K-grams generated form the terms

Postings are terms in the inverted index

As before, we must execute a Boolean query for each

If you encourage laziness people will respond!

BITS Pilani, Hyderabad Campus

Arrays, Hash tables or tress are commonly used as

B-tree dictionary is most suitable if the wild card queries

K-grams indexes are suitable to answer complex

E.g., consider the query: Jayasura, Sana Jayasur*

We could look up mandnchen in the B-tree and

XY lookup on Y$ X Y*Z ???