You are on page 1of 20

BITS Pilani

Dr.Aruna Malapati
BITS Pilani Asst Professor
Department of CSIS
Hyderabad Campus
BITS Pilani
Hyderabad Campus

Dictionaries and Tolerant Retrieval


Todays Agenda

Dictionary data structures

Tolerant retrieval

Wild card queries

Spelling correction

Soundex

BITS Pilani, Hyderabad Campus


Dictionary data structures for
Inverted Indexes
The dictionary data structures stores the term
Vocabulary, Document Frequency, Pointers to each
posting listsin what Data Structure?

BITS Pilani, Hyderabad Campus


Storing Dictionaries

For each term, we need to store a couple of items:


document frequency

pointer to postings list

...

Assume for the time being that we can store this


information in a fixed-length entry.

Assume that we store these entries in an array.

BITS Pilani, Hyderabad Campus


Storing Dictionaries (2)

An Array of
structures

char(20) int Postings*

space needed: 20 bytes 4/8 bytes 4/8 bytes

How do we look up an element in this array at query time?

Remember: these dictionaries can be huge, scanning is not an


option BITS Pilani, Hyderabad Campus
Dictionary Data Structures

Two main classes of data structures: hash tables and


search trees.

Criteria for when to use hash tables vs. trees:

Is there a fixed number of terms or will it keep growing?

What are the relative frequencies with which various keys


will be accessed?

How many terms are we likely to have?

BITS Pilani, Hyderabad Campus


Hash Tables

Each vocabulary term is hashed into an integer.

Try to avoid collisions

At query time, do the following: hash query term, resolve


collisions, locate entry in fixed-width array.

Pros:
Lookup in a hash table is faster than lookup in a tree.

Cons:
No prefix search (all terms starting with automat) [tolerant retrieval]

Need to rehash everything periodically if vocabulary keeps growing.

BITS Pilani, Hyderabad Campus


Trees

Trees solve the prefix problem (find all terms starting


With automat).

Simplest tree: binary tree

Binary trees are problematic:


Only balanced trees allow efficient retrieval

Rebalancing binary trees is expensive

Use B-trees
Solves the prefix problem.

Slower O(log M) [ and this requires the balanced tree]

Rebalancing is expensive
BITS Pilani, Hyderabad Campus
Binary Tree

BITS Pilani, Hyderabad Campus


B-Tree

Definition: Every internal node has a number of children


in the interval [a,b] where a,b are appropriate natural
numbers, e.g.,[2,4].

BITS Pilani, Hyderabad Campus


Root

M-N
A-K

A-G H-K A-K M-N

H-J Ka-Ke
An-As

B-G

An-Ar Go-Gu
Assam B-C

Andhra Arunachal Goa


Bihar Chhattisgarh Gujarath
Pradesh Pradesh
Wildcard Queries

mon*: find all docs containing any term beginning with


mon

Easy with B-tree dictionary: retrieve all terms t in the


range: mon t < moo

* mon: find all docs containing any term ending with mon
Maintain an additional tree for terms backwards

Then retrieve all terms t in the range: nom t < non

How do we enumerate all the terms meeting the wild card query pro*cent?

BITS Pilani, Hyderabad Campus


Query Processing

At this point, we have an enumeration of all terms in the


dictionary that match the wildcard query.

We still have to look up the postings for each


enumerated term.

E.g., consider the query: Jayasur*a, Sana* Jayasur*

This may result in the execution of many Boolean AND


queries.

BITS Pilani, Hyderabad Campus


Wildcards in Middle of Term

Example: m*nchen

We could look up m*and*nchen in the B-tree and


intersect the two term sets.

Expensive (there are probably thousands and thousands


of terms beginning with m)

Alternative: permuterm index

Basic idea: Rotate every wildcard query, so that the *


occurs at the end.

BITS Pilani, Hyderabad Campus


Permuterm index

For term hello, index under:

hello$,ello$h,llo$he,lo$hel,o$hell,$hello where$ is a
special symbol.

Queries: where X=hel and Y=o

X lookup on X$ X* lookup on $ X*

X*Y lookup on Y$ X* Y*Z ???

BITS Pilani, Hyderabad Campus


K-gram index

A K-gram is sequence of K characters .


More space-efficient than permuterm index
Enumerate all character k-grams (sequence of k
characters) occurring in a term 2-grams are called
bigrams
Example: from April is the cruelest month

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$

$ is a special word boundary symbol.


Maintain an inverted index from bigrams to the terms
that contain the bigram
BITS Pilani, Hyderabad Campus
2-gram examples

Dictionary : All possible K-grams generated form the terms


Postings : All terms containing that K-gram sorted in lexicographic
order

Postings are terms in the inverted index


ap
April Apple map

Query: mon*
$m
mace madden
Bigrams for query: $m, mo,on
mo
among amortize $m AND mo AND on
on
along among All terms staring with m and containing
Mo and containing on
2-gram index for the term moon
BITS Pilani, Hyderabad Campus
Processing wild card query

As before, we must execute a Boolean query for each


enumerated, filtered term.
Wild-cards can result in expensive query execution (very
large disjunctions)
pyth* AND prog*

If you encourage laziness people will respond!

Search
Type your search terms, use * if you need to.
E.g., Alex* will match Alexander.

BITS Pilani, Hyderabad Campus


Summary

Arrays, Hash tables or tress are commonly used as


dictionary data structures.

B-tree dictionary is most suitable if the wild card queries


are to be answered.

K-grams indexes are suitable to answer complex


queries.

BITS Pilani, Hyderabad Campus

You might also like