You are on page 1of 54

Sequence Alignment and Searching By: Sarika goyal

What is the purpose of sequence alignment?

Identification of homology and homologous sites in related sequences Inference of evolutionary history that lead to the differences in observed sequences

The Problem

Biological problem Finding a way to compare and represent similarity or dissimilarity between biomolecular sequences (DNA, RNA or amino acid)

The Problem

Computational problem Finding a way to perform inexact or approximate matching of subsequences within strings of characters Sequence comparison and alignment is a central problem in computational biology: High sequence similarity usually => structural or functional similarity

Substring and subsequence

Example: xyz is a subsequence within axayaz, but NOT a substring Characters in a substring must be contiguous

Types of comparisons and alignment methods

According to sequence Coverage: According to number of sequences: TWO SEQUENCES (Pairwise alignment)

LOCAL

GLOBAL

Database search against query sequences BLAST algorithm

Comparison of two sequences; First step in multiple sequence alignment

THREE OR MORE SEQUENCES (Multiple alignment)

Defining consensus sequences, protein structural motifs and domains, regulatory elements in DNA etc.

Determination of conserved residues and domains; Introductory step in molecular phylogenetic analysis

Introduction to sequence alignment


Given two text strings: First string = a b c d e Second string = a c d e f a reasonable alignment would be a b c d e a - c d e f

For the sequences gctgaacg and ctataatc: An uninformative alignment:


-------gctgaacg ctataatc-------

An alignment without gaps


gctgaacg ctataatc

An alignment with gaps


We must choose criteria so that algorithm can choose the best alignment.

gctga-a--cg --ct-ataatc

And another
gctg-aa-cg -ctataatc-

The dotplot (1)

A simple picture that gives an overview of the similarities between two sequences
Dotplot showing identities between sequences (DOROTHYHODGKIN) and (DOROTHYCROWFOOTHODGKIN):

Letters corresponding to isolated matches are shown in non-bold type. The longest matching regions, shown in boldface, are DOROTHY and HODGKIN. Shorter matching regions, such as the OTH of dorOTHy and RO of doROthy and cROwfoot, are noise.

The dotplot (2)

Dotplot showing identities between a repetitive sequence (ABRACADABRACADABRA) and itself. The repeats appear on several subsidiary diagonals parallel to the main diagonal.

The dotplot (3)

Dotplot showing identities between the palindromic sequence MAX I STAY AWAY AT SIX AM and itself. The palindrome reveals itself as a stretch of matches perpendicular to the main diagonal.

Dotplots and sequence alignments


Any path through the dotplot from upper left to lower right passes through a succession of cells, each of which picks out a pair of positions, one from the row and one from the column, that correspond in the alignment; or that indicates a gap in one of the sequences. The path need not pass through filled-in points only. However, the more filled-in points on the diagonal segments of the path, the more matching residues in the alignment.

Corrseponding alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN

Measures of sequence similarity

Two measures of distance between two character strings:

The Hamming distance, defined between two strings of equal length, is the number of positions with mismatching characters. The Levenshtein, or edit distance, between two strings of not necessarily equal length, is the minimal number of 'edit operations' required to change one string into the other, where an edit operation is a deletion, insertion or alteration of a single character in either sequence.

agtc cgta ag-tcc cgctca

Hamming distance = 2

Levenshtein distance = 3

The Edit Distance between two strings

Definition: The edit distance between two strings is defined as the minimum number of edit operations insertions, deletions, & substitutions needed to transform the first string into the second. For emphasis, note that matches are not counted. Example: AATT and AATG Distance = 1 (edit operation of substitution)

String alignment

An edit transcript is a way to represent a particular transformation of one string into another Emphasizes point mutations in the model An alignment displays a relationship between two strings Global alignment means for each string, entire string is involved in the alignment Examples:
AAGCA AA _C_

Sequence diversion

Sequences may have diverged from common ancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)

Scoring schemes

In molecular biology, certain changes are more likely to occur than others

Amino acid substitutions tend to be conservative In nucleotide sequences, transitions are more frequent than transversions

-> We want to give different weights to different edit operations Example: a DNA substitution matrix:

a g c t

a 20 10 5 5

g 10 20 5 5

c 5 5 20 10

t 5 5 10 20

BLAST the workhorse of bioinformatics


http://www.ncbi.nlm.nih.gov/BLAST

BLAST = Basic local alignment search tool When you have a nucleotide or protein sequence that you want to search against sequence databases

to determine what the sequence is to find related sequences (homologs)

Different BLAST programs

DNA potentially encodes six proteins

DNA can be translated into six potential proteins


5 CAT CAA 5 ATC AAC 5 TCA ACT
5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3 3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5

5 GTG GGT 5 TGG GTA 5 GGG TAG

Four components to a BLAST search

(1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click BLAST

Step 1: Choose your sequence

Sequence can be input in FASTA format or as accession number

Example of the FASTA format for a BLAST query

Step 2: Choose the BLAST program Program Input blastn blastp blastx tblastn tblastx DNA protein DNA protein DNA Database DNA protein protein DNA DNA

1 1 6 6 36

Step 2: Choose the BLAST program

Step 3: choose the database


nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites pdb = sequences derived from 3d structure of proteins Patents = Nucleotide sequences derived from patent division of GenBank.

Step 4a: Select optional search parameters

You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size change the output format

Step 4a: Select optional search parameters

CD search

Step 4a: Select optional search parameters

Entrez! Filter Expect Word size Scoring matrix

organism

filtering

Step 4b: optional formatting parameters

Alignment view Descriptions Alignments

page 97

BLAST format options

Alignments Views

Pairwise Standard BLAST alignment in pairs of query sequence and datab match.
Query: 251

tgaccggtaacgaccgcaccctggacgtcatggcgctggatgtggtgtggacggcgga 3

|||||||||| ||||||| |||||||| |||||| |||||||||||||| |||||||


Sbjct: 248575 tgaccggtaaagaccgcagcttggacgtgatggcgatggatgtggtgtggacagcgga 248634

program

query database

High scores low e values

Algorithm for Blast

How a BLAST search works

The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T. Altschul et al. (1990)

BLAST algorithm principles


(Basic Local Alignment Search Tool) Main idea:
1.

query

Construct a dictionary of all the words in the query Initiate a local alignment for each word match between query and DB DB

2.

BLAST Original Version


Dictionary: All words of length k Alignment initiated between words of alignment score T Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold

Background

The BLAST Algorithm

3 Stages

Preprocessing of the query Generation of hits Extension of the hits

Background

Step 1: Preprocessing of the query

Speed gained by minimizing search space Alignments require word hits ( word size = W )
Sequence 1

Sequence 2

word hits

Background

Step 1: Preprocessing of the query (Contd.)


RGD KGD QGD RGE EGD

17 14 13 13 12 12 12 12 11 11 11 11 11 11 11 11 11

Threshold score = T Neighborhood words of RGD W and T modulate speed and sensitivity

HGD NGD

RGN AGD MGD RAD RGQ RGS RND RSD SGD TGD

T=12

Background

Step 2: Generation of hits

A hit is made with one or several successive pairs of similar words. All the possible hits between the query sequence and sequences from databases are calculated in this way.
query

Background

Step 3: Extension

Each hit is extended in both directions. Extension is terminated when the maximum score drops below X.
scan DB

query

Background

The BLAST Algorithm: Summary

BLAST Original Version


Example: k = 4, T=4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps Output: GTAAGGTCC GTTAGGTCC A C G A A G T A A G G T C C A G T

C C C T T C C T G G A T T G C G A

BLAST results: List of hits

Evalue - number of unrelated databank sequences expected to yield same or higher score by pure chance

BLAST results: High scoring pairs (HSPs)

Fundamental unit of the BLAST algorithm output

HSP (high scoring pair)

Aligned fragments of query and detected sequence with similarity score exceeding a set cutoff value

E-value

(Expectation)

Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17 (58%), Positives = 16/17 (94%) Query: 81 SGDLSMLVLLPDEVSDL 97 +GD+SM +LLPDE++D+ Sbjct: 259 AGDVSMFLLLPDEIADV 275

HSP

BLAST Confidence measures


Score and bit-score : depend on scoring method E-value (Expect value) : number of unrelated database sequences expected to yield same or higher score by pure chance

The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. (E-value approching zero => significant alignment) An E value of 1 assigned to a hit can be interpreted as in a database of the current size one might expect to see 1 match with a similar score simply by chance.

Statistical Terminology

True positive: A hit returned from a database search which is homologous with the query sequence. GOOD False positive: A hit returned from a database search which is not homologous with the query sequence. BAD True negative: A sequence which is not homologous with the query sequence is not returned from database search. GOOD False negative: A sequence which is homologous with the query sequence is not returned from database search. BAD Sensitivity: A program which is sensitive picks up on most true positive. Selectivity: A program which is selective does not include false positives.

Conclusion

Treat BLAST searches as scientific experiments Dont use the default parameters

Default changes from time to time

BLAST is quite complicated. But its very useful.

Thanks

You might also like