Professional Documents
Culture Documents
Gene finding tools Serial analysis of gene expression. Paralogues and gene displacement
GLIMMER Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archAea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from non-coding DNA. The IMM approach, uses a combination of Markov models from 1st through 8th-order, weighting each model according to its predictive power. Glimmer uses 3-periodic non-homogenous Markov models in its IMMs. Glimmer is the primary microbial gene finder used at The Institute for Genomic Research (TIGR), where it was first developed, and has been used to annotate the complete genomes of over 80 bacterial species from TIGR and dozens (possibly hundreds) from
other labs. Its analyses of some of these genomes are available at the Comprehensive Microbial Resource Site. TWAIN TWAIN is a new syntenic genefinder which employs a Generalized Pair Hidden Markov Model (GPHMM) to predict genes in two closely related eukaryotic genomes simultaneously. It utilizes the MUMmer package to perform approximate alignment before applying a GPHMM based on an enhanced version of the TigrScan gene finder. TWAIN consists of two components: (1) ROSE, the Region Of Synteny Extractor, which identifies contiguous regions likely to contain one or more syntenic genes, and (2) OASIS, a generalized pair hidden Markov model (GPHMM) for predicting genes in the regions identified by ROSE. The system utilizes approximate alignments constructed by the PROmer and NUCmer programs in the MUMmer package to assess approximate alignment scores efficiently. GLIMMER HMM GlimmerHMM is a new gene finder based on a Generalized Hidden Markov Model (GHMM). Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. It also utilizes Interpolated Markov Models for the coding and noncoding models . Currently, GlimmerHMM's GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single).
GENEZILLA GeneZilla is a state-of-the-art program for computational prediction of protein-coding genes in eukaryotic DNA, and is based on the Generalized Hidden Markov Model (GHMM) framework, similar to GENSCAN and GENIE. It is highly reconfigurable and includes software for retraining by the end-user. It is written in highly optimized C++ and runs under most UNIX/Linux platforms. The run time and memory requirements are linear in the sequence length, and are in general much better than those of competing systems, due to GeneZilla's novel decoding algorithm. Graph-theoretic representations of the high scoring open reading frames are provided, allowing for exploration of suboptimal gene models. It utilizes Interpolated Markov Models (IMMs), Maximal
Dependence Decomposition (MDD), and includes states for signal peptides, branch points, TATA boxes, CAP sites, and will soon model CpG islands as well.
GeneZilla Architecture GeneZilla's state-transition diagram is essentially the same as that of GENSCAN. GeneZilla has the ability to model different types of exons (i.e., initial/internal/final/single) using different content sensors, unlike many GHMM-based gene finders.
GENE SPLICER A fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been trained and tested successfully on Plasmodium falciparum (malaria), Arabidopsis thaliana, human, Drosophila, and rice . Training data sets for human and Arabidopsis thaliana are included. Use the GeneSplicer Web Interface to run GeneSplicer directly, or see below for instructions on downloading the complete system including source code .
ExAlt ExAlt is a software program designed to predict alternatively spliced overlapping exons in genomic sequence. The program works in several ways depending on the available input. ExAlt can use information of existing gene structure as well as sequence conservation to improve the precision of it's predictions. ExAlt can also make predictions when only a single genomic sequence is available. ExAlt has been extensively tested on Drosophila melanogaster, but can be adapted to run on other species.
JIGSAW JIGSAW is a program designed to use the output from gene finders, splice site prediction programs and sequence alignments to predict gene models. The program provides an automated way to take advantage of the many succsessful methods for computational gene prediction and can provide substantial improvements in accuracy over an individual gene prediction program. JIGSAW is available for all species. Using JIGSAW
A training set is given to JIGSAW, which consists of example output from an automated gene structure annotation pipeline along with sequence coordinates of known genes. JIGSAW compares the pipeline's predicted genes to the example known genes to record the prediction accuracy of each combination of evidence. A non-linear model is built to estimate the accuracy of the different combinations of evidence found in new data. JIGSAW pieces together gene structure models most likely to be accuracte based on statistics collected in the training set. JIGSAW predicts gene models for a user supplied genomic sequence. The main interface is a simple "evidence list" file, which lists the file names of each prediction program's output, file format and the type of evidence. JIGSAW reads several coordinate based file formats including GFF.
RBSfinder RBSfinder is also a program from TIGR. It searches for probable ribosome binding sites in the vicinity of the beginning of genes. Based on its findings RBSfinder sometimes proposes a different starting coordinate of the ORF. In most cases it seems that RBSFinder improves the results from Glimmer. When RBSfinder proposes a different start the finding of Glimmer as well as the alternative gene start from RBSfinder are taken into the results.
recognition sequence for a second restriction enzyme, is added to the exposed ends of the retained cDNA fragments. The second enzyme liberates a short sequence of the original cDNA, which is 14 base pairs in length and is called a SAGE tag. Tags are harvested, polymerized, and sequenced. The sequence of a SAGE tag can uniquely identify a transcript, and quantification techniques reveal how often a tag appears, which gives a measurement of a gene's expression. SAGE experiments proceed as follows: 1. Isolate the mRNA of an input sample (e.g. a tumour). 2. Extract a small chunk of sequence from a defined position of each mRNA molecule. 3. Link these small pieces of sequence together to form a long chain (or concatemer). 4. Clone these chains into a vector which can be taken up by bacteria. 5. Sequence these chains using modern high-throughput DNA sequencers. Comparison to DNA microarrays The general goal of the technique is similar to the DNA microarray. However, SAGE is a sequence-based sampling technique. Observations are not based on hybridization, which result in more qualitative, digital values. In addition, the mRNA sequences do not need to be known a priori, so genes or gene variants which are not known can be discovered. Microarray experiments are much cheaper to perform, however, so large-scale studies do not typically use SAGE. Applications 1. Although SAGE was originally conceived for use in cancer studies, it has been successfully used to describe the transcriptome of other diseases and in a wide variety of organisms. 2. One of the major strengths of SAGE is the electronic nature of the database, allowing direct comparisons of libraries in silico by different investigators. For example, a normal human heart SAGE library is available on the CGAP Web site for gene expression queries, and a normal adult mouse heart SAGE library gene expression profile has recently been reported. Therefore, if both heart SAGE library data were available on an internet platform similar to the CGAP Web site, it may be possible for investigators to determine species similarities or differences in heart gene expression profiles. Because there is no such SAGE cardiovascular Web site, in some instances individual authors have made their SAGE tags available for download and analysis. 3. There are a number of areas in cardiovascular biology where the SAGE technique may be useful. These areas include stem cell biology, cardiovascular development,
angiogenesis, atherosclerosis, and lipid regulation. Some exploratory SAGE studies have already been reported for human hematopoietic stem cells, hyperlipidemic ApoE3-Leiden mice, and endothelial cells exposed to atherogenic stimulus. In the future, the SAGE technique could assist in finding new targets of important transcriptional factors such as Nkx2-5 in cardiogenesis, where the number of cells may be limiting. With the burgeoning population of congestive heart failure (CHF) patients, more insights are needed into our basic understanding of the pathogenetic mechanisms of CHF. Potentially, SAGE libraries could be made from human endomyocardial biopsy specimens, but tissue heterogeneity may undermine the gene expression signals. It may be more informative to study the temporal changes in gene expression using controlled animal models of CHF where more tissue material is available for processing. A refined candidate gene list could then be used in the diagnosis and prognosis of larger numbers of patient samples in a microarray format.
GENE DISPLACEMENT
Comparative genomics has revealed many examples in which the same function is performed by unrelated or distantly related proteins in different cellular lineages. In some
cases, this has been explained by the replacement of the original gene by a paralogue or non-homologue, a phenomenon known as non-orthologous gene displacement. Such gene displacement probably occurred early on in the history of proteins involved in DNA replication, repair, recombination and transcription (DNA informational proteins), i.e. just after the divergence of archaea, bacteria and eukarya from the last universal cellular ancestor (LUCA). This would explain why many DNA informational proteins are not orthologues between the three domains of life. However, in many cases, the origin of the displacing genes is obscure, as they do not even have detectable homologues in another domain. I suggest here that the original cellular DNA informational proteins have often been replaced by proteins of viral or plasmid origin. As viral and plasmid-encoded proteins are usually very divergent from their cellular counterparts, this would explain the puzzling phylogenies and distribution of many DNA informational proteins between the three domains of life.