METHODS FOR ORF PREDICTIONBY:- BY:- KARAMVEERM.Sc. LIFE SCIENCES WITH SPECIALISATION BIOINFORMATICS(2015-17)
What is gene prediction ?From a genomic DNA sequence we want to predict the regions that will encode for a protein: the genes.
Gene finding is about detecting these coding regions and infer the gene structure starting from genomic DNA sequences.anopen reading frame(ORF) is the part of areading framethat has the potential to code for a protein or peptide. An ORF is a continuous stretch ofcodonsthat do not contain astop codon We need to distinguish coding from non-coding regions using properties specific to each type of DNA region. Gene finding is not an easy task!
DNA sequence signals have low information content. It is difficult to discriminate real signals from noise (degenerated and highly unspecific signals);
Gene structure can be complex (sparse exons, alternative splicing, ...); DNA signals may vary in different organisms.
Identifying ORFsSimple 1st step in gene findings. Translate genomic sequence in six frames. Identify stop codon in each frame.Regions without stop codons are called open reading frames or ORFs.Locate and tag all of the likely ORFs in a sequence.The longest ORF from a methionine codon is a good prediction of a protein encoding sequence.
NCBI ORF finderThe ORF finder is a graphical analysis tool which finds all open reading of a selectable minimum size in a users sequence or in a sequence already in the database.This tool identifies all open reading frames using the standard genetic codes.The deduced amino acid sequence can be saved in various format and searched against the sequence database using the blast server.The orf finder should be helpful in preparing complete and accurate sequence.
Current gene prediction methods
Homology based methodBased on sequence similarity of query sequence with annotated genes present in databases.Given a database of sequences of other organism.Search for query sequence in this database .Identify database sequence (known genes) that resemble the query sequence.If the identified sequences are genes , the query sequence is probably (putatively) a gene.
BLASTBasic local alignment search tool.Well known search tool in this category. Strengths:-able to identify biologically relevant genes.Accuracy weakness:-Could not identify genes that code for protein , not present in database.Only 50% genes can be found by homology to other known genes or proteins.
Homology methods: GenewiseUses HMMs to compare DNA sequences to protein sequences at the level of its conceptual translation, regardless of sequencing errors and introns. Principle: The exon model used in genewise is a HMM with 3 base states (match, insert, delete) with the addition of more transitions between states to consider frame-shifts. Intron states have been added to the base model. Genewise directly compare HMM-profiles of proteins or domains to the gene structure HMM model. Genewise is a powerful tool, but time consuming. Requires strong similarities (>70% identity) to produce good predictions. Genewise is part of the Wise2 package: http://www.ebi.ac.uk/Wise2/.
AB initio methodComputational prediction that use most elementary information.Can predict both eukaryotic and prokaryotic genes.Predict genes based on the given sequence alone.It works on two major features associated with genes:-Gene signalsGene content
Methods for signal detection Hidden Markov Models (HMMs):- HMMs use a probabilistic framework to infer the probability that a sequence correspond to a real signal. Neural Networks (NNs): NNs are trained with positive and negative examples. NNs discover the features that distinguish the two sets.. The gene structure information is separated into several classes of features such as hexamer frequencies, splice sites, and GC composition.Example: NN for acceptor sites, the perceptron, (Horton and Kanehisa, 1992)
Ab initio methods: GRAILNeural network recognizing coding potential Incorporates genomic context information (splice junctions, start and stop codons , poly-A signals) Not appropriate for sequences without genomic context http://compbio.ornl.gov Human, Mouse, Drosophila, Arabidopsis, and E. coli
Performance Evaluationaccuracy of a prediction program can be evaluated using parameters such as sensitivity and specificity.To describe the concept of sensitivity and specificity accurately, four features are used:- true positive (TP), which is a correctly predicted feature; false positive (FP), which is an incorrectly predicted feature; false negative (FN), which is a missed feature; and true negative (TN), which is the correctly predicted absence of a feature.