39
Gene Prediction Alejandro Giorgetti

Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Gene Prediction

Alejandro Giorgetti

Page 2: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Annotation

• Defining the bological role of a molecule in all itscomplexity and mapping this knowledge onto the relevant gene products encoded by genomes.

• Computational gene prediction is the cornerstoneupon which a genome annotation is built

Page 3: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

1. Basic Information

• What types of predictions can we make?

• What are they based on?

• Each gene prediction is a hypothesis waiting to betested and the results of testing then inform our nextset of hypothesis

Page 4: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Bioinformatics asExtrapolation

• Computational gene finding is a process of:

– Identifying common phenomena in known genes

– Building a computational framework/model that can accurately describe the common phenomena

– Using the model to scan uncharacterized sequence toidentify regions that match the model, which becomeputative genes

– Test and validate the predictions

Page 5: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Different Types of Gene Finding

• RNA genes– tRNA, rRNA, snRNA, snoRNA, microRNA

• Protein coding genes– Prokaryotic

• No introns, simpler regulatory features– Eukaryotic

• Exon-intron structure• Complex regulatory features

Page 6: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Approaches to Gene Finding

• Direct– Exact or near-exact matches of EST, cDNA, or Proteins

from the same, or closely related organism

• Indirect1. Look for something that looks like one gene (homology)

1. Look for something that looks like all genes (ab initio)

1. Hybrid, combining homology and ab initio (and perhaps even direct) methods

Page 7: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Prokaryotes• Identify long Open Reading Frames (ORF), likely to

code for proteins.– Identification of start codon ATG that maximizes the length

of the ORF.– End codons (UAA,UAG, UGA)– Pribnow box (TATAAT), the -35 sequence or ribosomal

binding sites: transcriptional and translational start sites.– Codon bias (codon usage).

• GLIMMER, GeneMark. 90% for both sensitivity and specificity.

Page 8: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Pieces of a (Eukaryotic) Gene

(on the genome)

5’

3’

3’

5’

~ 1-100 Mbp

5’

3’

3’

5’……

……

~ 1-1000 kbp

exons (cds & utr) / introns(~ 102-103 bp) (~ 102-105 bp)

Polyadenylationsite

promoter (~103 bp)

enhancers (~101-102 bp) other regulatory sequences(~ 101-102 bp)

Page 9: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

What is it about genes thatwe can measure (and

model)?• Most of our knowledge is biased towards protein-

coding characteristics– ORF (Open Reading Frame): a sequence defined by in-

frame AUG and stop codon, which in turn defines a putative amino acid sequence.

– Codon Usage: most frequently measured by CAI (Codon Adaptation Index)

• Other phenomena– Nucleotide frequencies and correlations:

• value and structure– Functional sites:

• splice sites, promoters, UTRs, polyadenylation sites

Page 10: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae

Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771

Page 11: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Codon Adaptation Index(CAI)

CAI= ∏i= codons

f codon i

f �codoni�max

• Parameters are empirically determined by examininga “large” set of example genes

• This is not perfect– Genes sometimes have unusual codons for a reason– The predictive power is dependent on length of sequence

Page 12: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

General Things to Rememberabout (Protein-coding) Gene

Prediction Software• It is, in general, organism-specific

• It works best on genes that are reasonably similar tosomething seen previously

• It finds protein coding regions far better than non-codingregions

• In the absence of external (direct) information, alternative forms will not be identified

• It is imperfect! (It’s biology, after all…)

Page 13: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

2. Classes of Information

• Extrinsic information: cDNA, EST, protein products. Not itself a genome sequence.

• Intrinsic information: De novo gene predictors usingthe sequences of one or more genomes as their onlyinput. Ab initio: do not use informant genomes.

Page 14: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Extrinsic Information

• BLAST family, FASTA, etc.– Pros: fast, statistically well founded– Cons: no understanding/model of gene structure

• BLAT, Sim4, EST_GENOME, etc.– Pros: gene structure is incorporated– Cons: non-canonical splicing, slower than blast

Page 15: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Intrinsic Information

• The most ab initio would be a program that simulates the transcription, splicing and postprocessing of a transcript using only the information present in the cell.

• Rudimentary understanding: we must rely on metrics derived from known genes and features.– Signal sensors– Content sensors

Page 16: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Intrinsic Information: Signals• Nucleic acid motifs that are recognized by the cellular

machinery.• Start and stop codons.• Donor and acceptor splice sites• Splicing enhancers• Silencer elements• Transcription start and termination sites• Polyadenylation signals• Proximal and distal promoter sequences

• Signals modeled as Position Weight Matrices (PWM)• Capture the intrinsic variability characteristics.• Derived from a set of aligned sequences which are funtinally

related.• Sliding windows.

Page 17: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Intrinsic Information: Content

• Proper detection the start and end codons is still a challenge. Coding bias in sequence composition.

• Uneven usage of aa in real proteins.• Uneven usage of synonymous codons• Coding statistcs as functions that compute a real number

related to the likelihood that a given DNA sequence codes for protein.

• Codon or di-codon usage• Periodicity in base occurrence (GRAIL).

Page 18: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Conservation

• When one or more informant genomes are available it is possible to detect the characteristic conservation pattern of coding sequence and use it as an orthogonal measure for coding potential.

• TWINSCAN• GeneID• SGP2• CONTRAST• Genscan• Fgenesh

Page 19: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

2. Some MathematicalConcepts and Definitions

• Models

• Bayesian Statistics

• Markov Models & Hidden Markov Models

Page 20: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Models in Computational(Molecular) Biology

• In gene finding, models can best be thought of as “sequence generators” (e.g., Hidden Markov Models) or “sequence classifiers” (e.g., Neural Networks)

• The better (and usually more complex) a model is, the better the performance is likely to be

Page 21: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Models of Sequence Generation:Markov Chains

• Different types of structure components (exons or introns) are characterized by a state

• The gene model is thought to be generated by a state machine: starting form 5’ to 3’,each base is generated by a emission probability conditioned on the current state, and the transition from one state to an other: transition probability (an intron can only follow an exon, reading frames of to adjacent exonsshould be compatible, etc.)

• All the parameters of the emission probabilities and transition probabilities are learned.

Page 22: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Basic HMM

Described bya set of possible states: start, exon, donor, intron, acceptor, stop set of possible observations: A,C,G,Ttransition probability matrixemision probability matrix (frequencies of nucleotidesoccurring in a particular state)initial state probabilities.

Page 23: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Models of SequenceGeneration:

Markov Chains

• A Markov chain is a model for stochastic generation of sequential phenomena

• Every position in a chain is equivalent

• The order of the Markov chain is the number of previous positions on which the current position depends– e.g., in nucleic acid sequence, 0-order is mononucleotide, 1st-order

is dinucleotide, 2nd-order is trinucleotide, etc.

• The model parameters are the frequencies of the elements at each position (possibly as a function of preceding elements)

Page 24: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Markov Chains as Models ofSequence Generation

• 0th-order

• 1st-order

• 2nd-order

( ) ( ) ( ) ( ) ( ) ( )∏=

−⋅=⋅⋅=N

iii sssssssssP

211231211 |pp|p|pp L

( ) ( ) ( ) ( ) ( ) ( )∏=

−−⋅=⋅⋅=N

iiii ssssssssssssssP

31221324213212 |pp|p|pp L

( ) ( ) ( ) ( ) ( )∏=

=⋅⋅=N

iisssssP

13210 pppp L

L4321 sssss =

( ) ( ) ( ) ( ) ( ) ( ) ( )∏=

−⋅=⋅⋅⋅=N

iii sssactatttsP

2111 |pp|p|p|pp L

( ) ( ) ( ) ( ) ( ) ( ) ( )∏=

−−⋅=⋅⋅⋅=N

iiii sssssacgtacttattsP

312212 |pp|p|p|pp L

( ) ( ) ( ) ( ) ( ) ( ) ( )∏=

=⋅⋅⋅⋅=N

iisgcattsP

10 pppppp L

ttacggts =

Page 25: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Hidden Markov Models

In general, sequences are not monolithic, but can be made up of discrete segments

Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emissionprobabilities depend upon the state

Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model

Page 26: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

HMM DetailsAn HMM is completely defined by its:

State-to-state transition matrix (Φ)Emission matrix (H)State vector (x)

We want to determine the probability of any specific (query) sequence having been generated by the model; with multiple models, we then use Bayes’ rule to determine the best model for the sequence

Two algorithms are typically used for the likelihood calculation:ViterbiForward

Models are trained with known examples

Page 27: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

The HMM Matrixes: Φ and H

⎥⎥⎥⎥

⎢⎢⎢⎢

0002.0001.000996.0001.05.00002.0998.05.00000

⎥⎥⎥⎥

⎢⎢⎢⎢

=

32.018.018.032.0

25.025.022.028.0

H

xm(i) = probability of being in state m at position i;

H(m,yi) = probability of emitting character yi in state m;

Φmk = probability of transition from state k to m.

Page 28: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

A more realistic (and complex) HMM model for Gene Prediction (Genie)

Kulp, D., PhD Thesis, UCSC 2003

Page 29: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Scoring an HMM: Viterbi, Forward, and Forward-

BackwardTwo algorithms are typically used for the likelihood

calculation: Viterbi and ForwardViterbi is an approximation;

The probability of the sequence is determined by using the most likely mapping of the sequence to the model

in many cases good enough (gene finding, e.g.), but not always

Forward is the rigorous calculation;The probability of the sequence is determined by summing over

all mappings of the sequence to the model

Forward-Backward produces a probabilistic map of the model to the sequence

Page 30: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Eukaryotic Gene Prediction: GRAIL II: Neural network based

prediction

(Uberbacher and Mural 1991; Uberbacher et al. 1996)

Page 31: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Open Challenges in Predicting Eukaryotic

(Protein-Coding) Genes• Alternative Processing of Transcripts

– Splice variants, Start/stop variants

• Overlapping Genes– Mostly UTRs or intronic, but coding is possible

• Non-canonical functional elements– Splice w/o GT-AG,

• UTR predictions– Especially with introns

• Small (mini) exons

Page 32: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Open Challenges in Predicting Prokaryotic

(Protein-Coding) Genes• Start site prediction

– Most algorithms are greedy, taking the largest ORF

• Overlapping Genes– This can be very problematic, esp. with use of Viterbi-like

algorithms

• Non-canonical coding

Page 33: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Eukaryotic gene predictiontools and web servers

• Genscan (ab initio), GenomeScan (hybrid) – (http://genes.mit.edu/)

• Twinscan (hybrid)– (http://genes.cs.wustl.edu/)

• FGENESH (ab initio)– (http://www.softberry.com/berry.phtml?topic=gfind)

• GeneMark.hmm (ab initio)– (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi)

• MZEF (ab initio)– (http://rulai.cshl.org/tools/genefinder/)

• GrailEXP (hybrid)– (http://grail.lsd.ornl.gov/grailexp/)

• GeneID (hybrid)– (http://www1.imim.es/geneid.html)

Page 34: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Prokaryotic Gene Prediction

• Glimmer– http://www.tigr.org/~salzberg/glimmer.html

• GeneMark– http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.

cgi

• Critica– http://www.ttaxus.com/index.php?pagename=Software

• ORNL Annotation Pipeline– http://compbio.ornl.gov/GP3/pro.shtml

Page 35: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Non-protein Coding Gene Tools and Information

• tRNA– tRNA-ScanSE

• http://www.genetics.wustl.edu/eddy/tRNAscan-SE/– FAStRNA

• http://bioweb.pasteur.fr/seqanal/interfaces/fastrna.html

• snoRNA– snoRNA database

• http://rna.wustl.edu/snoRNAdb/

• microRNA– Sfold

• http://www.bioinfo.rpi.edu/applications/sfold/index.pl– SIRNA

• http://bioweb.pasteur.fr/seqanal/interfaces/sirna.html

Page 36: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Assessing performance:Sensitivity and Specificity

Testing of predictions is performed on sequences where the gene structure is known

Sensitivity is the fraction of known genes (or bases or exons) correctly predicted“Am I finding the things that I’m supposed to find”

Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes“What fraction of my predictions are true?”

In general, increasing one decreases the other

Page 37: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Graphic View of Specificity and Sensitivity

iveFalseNegatveTruePositiveTruePositi

AllTrueveTruePositiSn

+==

iveFalsePositveTruePositiveTruePositi

eAllPositivveTruePositiSp

+==

Page 38: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Quantifying the tradeoff:Correlation Coefficient

( )( ) ( )( )[ ]( )( )( )( )

FNTNPNFPTPPPFNTPAPFPTNAN

PNAPPPANFNFPTNTPCC

+=+=+=+=

−=

;;;

Page 39: Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Specificity/Sensitivity Tradeoffs

• Ideal Distribution of Scores

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40 45 50score (arb units)

coun

t

random sequence true sites

• More Realistically…

0

200

400

600

800

1000

1200

0 10 20 30 40 50

score (arb units)

coun

t

random sequence true sites