prediction methods for ORF

METHODS FOR ORF PREDICTION

BY:- BY:- KARAMVEERM.Sc. LIFE SCIENCES WITH SPECIALISATION BIOINFORMATICS(2015-17)

WEL-COME

WHAT IS GENE PREDICTION ?

From a genomic DNA sequence we want to predict the regions that will encode for a protein: the genes.

• Gene finding is about detecting these coding regions and infer the gene structure starting from genomic DNA sequences.

an open reading frame (ORF) is the part of a reading frame that has the potential to code for a protein or peptide. An ORF is a continuous stretch of codons that do not contain a stop codon

• We need to distinguish coding from non-coding regions using properties specific to each type of DNA region.

• Gene finding is not an easy task!

• DNA sequence signals have low information content.• It is difficult to discriminate real signals from noise

(degenerated and highly unspecific signals);

• Gene structure can be complex (sparse exons, alternative splicing, ...);

• DNA signals may vary in different organisms.

https://en.wikipedia.org/wiki/Reading_frame

https://en.wikipedia.org/wiki/Codons

https://en.wikipedia.org/wiki/Stop_codon

GENE COMPONENTS

IDENTIFYING ORFS Simple 1st step in gene findings. Translate genomic sequence in six frames. Identify stop codon in each frame. Regions without stop codons are called “open reading

frames” or ORFs. Locate and tag all of the likely ORFs in a sequence. The longest ORF from a methionine codon is a good

prediction of a protein encoding sequence.

NCBI ORF FINDER The ORF finder is a graphical analysis tool which finds all

open reading of a selectable minimum size in a user’s sequence or in a sequence already in the database.

This tool identifies all open reading frames using the standard genetic codes.

The deduced amino acid sequence can be saved in various format and searched against the sequence database using the blast server.

The orf finder should be helpful in preparing complete and accurate sequence.

CURRENT GENE PREDICTION METHODS

HOMOLOGY BASED METHOD Based on sequence similarity of query sequence with

annotated genes present in databases. Given a database of sequences of other organism. Search for query sequence in this database . Identify database sequence (known genes) that resemble

the query sequence. If the identified sequences are genes , the query

sequence is probably (putatively) a gene.

BLAST

Basic local alignment search tool. Well known search tool in this category. Strengths:- able to identify biologically relevant genes. Accuracy weakness:- Could not identify genes that code for protein , not

present in database. Only 50% genes can be found by homology to other

known genes or proteins.

HOMOLOGY METHODS: GENEWISE Uses HMMs to compare DNA sequences to protein sequences at the level of its conceptual translation, regardless of sequencing errors and introns.

• Principle:• The exon model used in genewise is a HMM with 3 base states (match, insert, delete) with the addition of more transitions between states to consider frame-shifts.

• Intron states have been added to the base model.• Genewise directly compare HMM-profiles of proteins or domains to the gene structure HMM model.

• Genewise is a powerful tool, but time consuming.• Requires strong similarities (>70% identity) to produce good predictions.

• Genewise is part of the Wise2 package: http://www.ebi.ac.uk/Wise2/.

AB INITIO METHOD Computational prediction that use most elementary

information. Can predict both eukaryotic and prokaryotic genes. Predict genes based on the given sequence alone. It works on two major features associated with genes:-1. Gene signals2. Gene content

METHODS FOR SIGNAL DETECTION• Hidden Markov Models (HMMs):-• HMMs use a probabilistic framework to infer the

probability that a sequence correspond to a real signal.• Neural Networks (NNs):• NNs are trained with positive and negative examples.

NNs ”discover” the features that distinguish the two sets.• . The gene structure information is separated into several

classes of features such as hexamer frequencies, splice sites, and GC composition.

Example: NN for acceptor sites, the perceptron, (Horton and Kanehisa, 1992)

AB INITIO METHODS: GRAIL Neural network recognizing coding potential• Incorporates genomic context information (splice

junctions, start and stop codons , poly-A signals)• Not appropriate for sequences without genomic context• http://compbio.ornl.gov• Human, Mouse, Drosophila, Arabidopsis, and E. coli

PERFORMANCE EVALUATION accuracy of a prediction program can be evaluated using

parameters such as sensitivity and specificity. To describe the concept of sensitivity and specificity

accurately, four features are used:- true positive (TP), which is a correctly predicted feature;

false positive (FP), which is an incorrectly predicted feature; false negative (FN), which is a missed feature; and true negative (TN), which is the correctly predicted absence of a feature.

Conclusion

REFERENCE https://

www.google.co.in/search?q=gene+components&biw=1366&bih=623&source=lnms&tbm=isch&sa=X&sqi=2&ved=0CAYQ_AUoAWoVChMIld-fy7_4yAIVwh-UCh1dfwEb#tbm=isch&q=rbs+in+prokaryotic+gene&imgrc=p4VQkhXIIG_DsM%3A.

http://www.aun.edu.eg/molecular_biology/Procedure%20Bioinformatics22.23-4-2015/Xiong%20-%20Essential%20Bioinformatics%20send%20by%20Amira.pdf.

https://www.google.co.in/search?q=gene+components&biw=1366&bih=623&source=lnms&tbm=isch&sa=X&sqi=2&ved=0CAYQ_AUoAWoVChMIld-fy7_4yAIVwh-UCh1dfwEb#tbm=isch&q=rbs+in+prokaryotic+gene&imgrc=p4VQkhXIIG_DsM%3A





http://www.aun.edu.eg/molecular_biology/Procedure%20Bioinformatics22.23-4-2015/Xiong%20-%20Essential%20Bioinformatics%20send%20by%20Amira.pdf



Thank You…

Science

prediction methods for ORF