36
GENE PREDICTION

Gene Prediction Ppt

Embed Size (px)

Citation preview

Page 1: Gene Prediction Ppt

GENE PREDICTION

Page 2: Gene Prediction Ppt

TOPICS

INTRODUCTIONTWO APPROACHES FOR GENE PREDICTIONCLASSIFICATION OF GENE PREDICTIONMETHODOLGY FOR GENE PREDICTIONTOOLS AND SERVERS FOR GENE

PREDICTIONCONCLUSIONREFRENCES

Page 3: Gene Prediction Ppt

INTRODUCTION

Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions.

Gene finding is one of the first and most important steps in understanding the genome of a species

once it has been sequenced.Gene prediction is to identify regions of

genomic DNA that encode protiens

Page 4: Gene Prediction Ppt

IDENTIFICATION

mRna: Isolating mRNA from organisms in which they have been spliced out and then they are reverse translated into cDNA copy.

mRNA has only coding sequence.

EST : A 200 to 500 base fragment of mRNA sequence of a gene that is sequenced from a random collection of mRNA fragments ,often from the 5’ to 3’ ends.

Page 5: Gene Prediction Ppt

DNA RNA

cDNA

Phenotypeprotein

[1] Transcription[2] RNA processing (splicing)[3] RNA export[4] RNA surveillance

Page 6: Gene Prediction Ppt
Page 7: Gene Prediction Ppt

GT AG

exon intron

Splice sites

Donor site Acceptor site

Signals: Pre-mRNA Splicing

TranslationProtein

SplicingmRNA Cap- -Poly(A)

Transcriptionpre-mRNA Cap- -Poly(A)

Genomic DNA

Start codon Stop codon

Page 8: Gene Prediction Ppt

Overview of gene prediction strategies

What sequence signals can be used? Transcription: TF binding sites, promoter,

initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG)

ORFs, codon usage What other types of information can be used? cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST)

Page 9: Gene Prediction Ppt

Finding Eukaryotic Genes Computationally

Gene finding based on homology evidence: BLAST, FASTA, BLAT etc.

Content-based MethodsCpG islands, GC content, hexamer repeats,

composition statistics, codon frequencies Feature-based Methods

donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals, feature lengths

Similarity-based Methodssequence homology, EST searches

Pattern-basedHMMs, Artificial Neural Networks

Most effective is a combination of all the above !

Page 10: Gene Prediction Ppt

Scheme of a eukaryotic gene

Page 11: Gene Prediction Ppt

gene predicting approaches

Focused on individual featuresCoding regions (ORFs)Splice sites PromotersCodon biasCpG islandsGC content

Page 12: Gene Prediction Ppt

Six Frames in a DNA Sequence

start codons – ATGstop codons – TAA, TAG, TGA

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

Stop codons: 3 out of 64 codons ~ 1 in 20

Page 13: Gene Prediction Ppt

CpG Islands

CpG islands are regions of the genome with a higher frequency of CG dinucleotides (not base-pairs!) than the rest of the genome

CpG islands often occur near the beginning of genes maybe related to the binding of the Transcription Factor Sp1

Page 14: Gene Prediction Ppt

Splice sites are conserved (can be an important signal)

Page 15: Gene Prediction Ppt

Gene prediction: Eukaryotes vs prokaryotes

Gene prediction is easier in microbial genomes

Why? Smaller genomesSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously, mostly HMM-based Now: similarity-based methods

because so many genomes available

Page 16: Gene Prediction Ppt

PROCEDURE FOR GENE PREDICTION

Obtain new genomic DNA sequence

Translate in all six reading frames and compare to protien sequence database

Perform data base similarity search of EST database

of same organism, or cDNA sequences if available

Use gene prediction program to locate genes

Analyze regulatory sequences in the genes

Page 17: Gene Prediction Ppt

Integrated methods: Hidden Markov ModelsFully probabilistic, so can do proper statistics

Can estimate the parameters from labeled dataCan give confidence values

Semi- or Generalized HMMsA state explains a subsequence (e.g. a whole exon),

rather than a single basetransition between states at features detected by

other methods (e.g. splice site consensus)

Page 18: Gene Prediction Ppt

Hidden Markov Models

Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emission probabilities depend upon the state

Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model

Page 19: Gene Prediction Ppt

HMM Details

An HMM is completely defined by its: State-to-state transition matrix () Emission matrix (H) State vector (x)

We want to determine the probability of any specific (query) sequence having been generated by the model

Two algorithms are typically used for the likelihood calculation: Viterbi Forward

Page 20: Gene Prediction Ppt

GRAIL

Gene Recognition and Analysis Internet Link.Given by UBERBACHES & MURAL 1991Basic first technique developed for gene prediction.Grail make use of N.N (neural network) method

to recognize coding potential in fixed length about 100 bases without looking for additional features such as splice junction or start or stop codon ,it will depend upon sequence itself.

Improved version of grail 2 look for add feature ,predict by taking genomic context into account.

Clint server application is of XGRAIL basically runs on Unix platform.

URL :http://compbio.ornl.gov/tools/index.html

Page 21: Gene Prediction Ppt

FGENEH/FGENES

Developed by Victor solovyr and colleagues. It predicts internal exon by looking for

structural features such as donar and acceptor splice site .

Method makes use of linear dicriminant analysis: A mathematical technique that allows data for multiple experiments to combined

The server SANGER CETRE WEB. URL http:// genomic.sanger.ac.uk/gf/gf.html Example: Human BAC clone RG346p16 of

chromosome 7 (Gen bank Ac.no.Ac002416) Protien Product out put in Fasta format.

Page 22: Gene Prediction Ppt
Page 23: Gene Prediction Ppt

MZEF

Michael Zhang’s Exon FinderBy Cold Spring harbour Laboratory .Depend upon the technique quadratic

discriminant analysis.MZEF predict internal coding exons and

does not give any other information.Q.D.A : Result of two types of prediction 1.Splice site2.Exon length.

Page 24: Gene Prediction Ppt

•Predicting by exon length ,Exon –intron boundraies.

•Programe can be downloaded from CSHLFTP site for Unix Programe or programe can be accessed through a web front end

•URL: http:// www.cshl.org/genefinder

Page 25: Gene Prediction Ppt
Page 26: Gene Prediction Ppt

GENSCAN

Developed by Chris Burge & Sam Karlin. Predict complete gene structure Mostly used to predict high probability used in

design of PCR primers for cDNA amplification. GENSCAN rules on probabilistic model, the

algorithm can assign a “optimal exon”As well as “suboptimal exon”Optimal exon: Are the sequence with highest

probability (0.99 i.e .97.5%)Suboptimal exon: sequences having acceptable

probability. (0.56 i.e.62%)URL http:// genes.mit.edu/GENSCAN.html

Page 27: Gene Prediction Ppt
Page 28: Gene Prediction Ppt
Page 29: Gene Prediction Ppt
Page 30: Gene Prediction Ppt
Page 31: Gene Prediction Ppt

GENEID

Find exon based on coding potential .Given by GUIGO et al ,1992.GENEID uses position weight matrix to

access whether a strech of sequence represent a splice sites or a start stop codon.

It is more specific means we can get output according to our need.

Out put of only internal ExonOut put of only terminal ExonOut put of only all ExonURL: http:// www.imim.es/ geneid.html

Page 32: Gene Prediction Ppt

BEST METHOD OF PREDICTION

Cold Spring Harbor Laboratory worked on gene prediction to predict best tool.

Website called “Banbury Cross”.For each tool ther was four possible outcome .1.Sensitivity value: Reflecting the fraction of actual

coding region that are correctly predicted as truly being coding region.

Specificity value: Reflecting the overall fraction of the prediction that is correct.

To obtain a value of specificity and sensitivity correlation coefficient is formed.

-1: prediction wrong ,0 to 1: prediction right

Page 33: Gene Prediction Ppt

Sensitivity (sn) = TP/ (TP+FN)Specificity (sp) = TP/ (TP+FP)Correlation coefficiant cc = TP*TN+FP*FN P.P*PN*AP*ANResult: over all exon finder was MZFEGENE structure prediction is GENESCANAs CC ..MZEF # 0.79 CC…GENSCAN #0.86

Page 34: Gene Prediction Ppt

CONCLUSION

Gene prediction is to identify regions of genomic DNA that encode protiens

Gene finding based on homology evidence: BLAST, FASTA, BLAT etc.

Content-based Methods CpG islands, GC content, hexamer repeats, composition

statistics, codon frequencies Feature-based Methods

donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals, feature lengths

Similarity-based Methods sequence homology, EST searches

Pattern-based HMMs, Artificial Neural Networks

ON BASIS OF THIS GRAIL , fGENES , GENSCAN , MZEF , GENEID.

BEST CONCLUSION MADE WAS MZEF AND GENSCAN……….

Page 35: Gene Prediction Ppt

REFERENCES

BIOINFORMATICS (A PRACTICAL GUIDE TO THE ANALYSIS OF GENE AND PROTIENS) BY ANDRES D. BAXEVANIS

BIOINFORMATICS( SEQUENCE AND GENOME ANALYSIS) BY DAVID W. MOUNT

GOOGLE SEARCH TOOLWIKEPAEDIA SEARCH TOOL

Page 36: Gene Prediction Ppt

THANK YOU FOR YOUR PATIENCE!