Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Gene Prediction
Alejandro Giorgetti
Annotation
• Defining the bological role of a molecule in all itscomplexity and mapping this knowledge onto the relevant gene products encoded by genomes.
• Computational gene prediction is the cornerstoneupon which a genome annotation is built
1. Basic Information
• What types of predictions can we make?
• What are they based on?
• Each gene prediction is a hypothesis waiting to betested and the results of testing then inform our nextset of hypothesis
Bioinformatics asExtrapolation
• Computational gene finding is a process of:
– Identifying common phenomena in known genes
– Building a computational framework/model that can accurately describe the common phenomena
– Using the model to scan uncharacterized sequence toidentify regions that match the model, which becomeputative genes
– Test and validate the predictions
Different Types of Gene Finding
• RNA genes– tRNA, rRNA, snRNA, snoRNA, microRNA
• Protein coding genes– Prokaryotic
• No introns, simpler regulatory features– Eukaryotic
• Exon-intron structure• Complex regulatory features
Approaches to Gene Finding
• Direct– Exact or near-exact matches of EST, cDNA, or Proteins
from the same, or closely related organism
• Indirect1. Look for something that looks like one gene (homology)
1. Look for something that looks like all genes (ab initio)
1. Hybrid, combining homology and ab initio (and perhaps even direct) methods
Prokaryotes• Identify long Open Reading Frames (ORF), likely to
code for proteins.– Identification of start codon ATG that maximizes the length
of the ORF.– End codons (UAA,UAG, UGA)– Pribnow box (TATAAT), the -35 sequence or ribosomal
binding sites: transcriptional and translational start sites.– Codon bias (codon usage).
• GLIMMER, GeneMark. 90% for both sensitivity and specificity.
Pieces of a (Eukaryotic) Gene
(on the genome)
5’
3’
3’
5’
~ 1-100 Mbp
5’
3’
3’
5’……
……
~ 1-1000 kbp
exons (cds & utr) / introns(~ 102-103 bp) (~ 102-105 bp)
Polyadenylationsite
promoter (~103 bp)
enhancers (~101-102 bp) other regulatory sequences(~ 101-102 bp)
What is it about genes thatwe can measure (and
model)?• Most of our knowledge is biased towards protein-
coding characteristics– ORF (Open Reading Frame): a sequence defined by in-
frame AUG and stop codon, which in turn defines a putative amino acid sequence.
– Codon Usage: most frequently measured by CAI (Codon Adaptation Index)
• Other phenomena– Nucleotide frequencies and correlations:
• value and structure– Functional sites:
• splice sites, promoters, UTRs, polyadenylation sites
A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771
Codon Adaptation Index(CAI)
CAI= ∏i= codons
f codon i
f �codoni�max
• Parameters are empirically determined by examininga “large” set of example genes
• This is not perfect– Genes sometimes have unusual codons for a reason– The predictive power is dependent on length of sequence
General Things to Rememberabout (Protein-coding) Gene
Prediction Software• It is, in general, organism-specific
• It works best on genes that are reasonably similar tosomething seen previously
• It finds protein coding regions far better than non-codingregions
• In the absence of external (direct) information, alternative forms will not be identified
• It is imperfect! (It’s biology, after all…)
2. Classes of Information
• Extrinsic information: cDNA, EST, protein products. Not itself a genome sequence.
• Intrinsic information: De novo gene predictors usingthe sequences of one or more genomes as their onlyinput. Ab initio: do not use informant genomes.
Extrinsic Information
• BLAST family, FASTA, etc.– Pros: fast, statistically well founded– Cons: no understanding/model of gene structure
• BLAT, Sim4, EST_GENOME, etc.– Pros: gene structure is incorporated– Cons: non-canonical splicing, slower than blast
Intrinsic Information
• The most ab initio would be a program that simulates the transcription, splicing and postprocessing of a transcript using only the information present in the cell.
• Rudimentary understanding: we must rely on metrics derived from known genes and features.– Signal sensors– Content sensors
Intrinsic Information: Signals• Nucleic acid motifs that are recognized by the cellular
machinery.• Start and stop codons.• Donor and acceptor splice sites• Splicing enhancers• Silencer elements• Transcription start and termination sites• Polyadenylation signals• Proximal and distal promoter sequences
• Signals modeled as Position Weight Matrices (PWM)• Capture the intrinsic variability characteristics.• Derived from a set of aligned sequences which are funtinally
related.• Sliding windows.
Intrinsic Information: Content
• Proper detection the start and end codons is still a challenge. Coding bias in sequence composition.
• Uneven usage of aa in real proteins.• Uneven usage of synonymous codons• Coding statistcs as functions that compute a real number
related to the likelihood that a given DNA sequence codes for protein.
• Codon or di-codon usage• Periodicity in base occurrence (GRAIL).
Conservation
• When one or more informant genomes are available it is possible to detect the characteristic conservation pattern of coding sequence and use it as an orthogonal measure for coding potential.
• TWINSCAN• GeneID• SGP2• CONTRAST• Genscan• Fgenesh
2. Some MathematicalConcepts and Definitions
• Models
• Bayesian Statistics
• Markov Models & Hidden Markov Models
Models in Computational(Molecular) Biology
• In gene finding, models can best be thought of as “sequence generators” (e.g., Hidden Markov Models) or “sequence classifiers” (e.g., Neural Networks)
• The better (and usually more complex) a model is, the better the performance is likely to be
Models of Sequence Generation:Markov Chains
• Different types of structure components (exons or introns) are characterized by a state
• The gene model is thought to be generated by a state machine: starting form 5’ to 3’,each base is generated by a emission probability conditioned on the current state, and the transition from one state to an other: transition probability (an intron can only follow an exon, reading frames of to adjacent exonsshould be compatible, etc.)
• All the parameters of the emission probabilities and transition probabilities are learned.
Basic HMM
Described bya set of possible states: start, exon, donor, intron, acceptor, stop set of possible observations: A,C,G,Ttransition probability matrixemision probability matrix (frequencies of nucleotidesoccurring in a particular state)initial state probabilities.
Models of SequenceGeneration:
Markov Chains
• A Markov chain is a model for stochastic generation of sequential phenomena
• Every position in a chain is equivalent
• The order of the Markov chain is the number of previous positions on which the current position depends– e.g., in nucleic acid sequence, 0-order is mononucleotide, 1st-order
is dinucleotide, 2nd-order is trinucleotide, etc.
• The model parameters are the frequencies of the elements at each position (possibly as a function of preceding elements)
Markov Chains as Models ofSequence Generation
• 0th-order
• 1st-order
• 2nd-order
( ) ( ) ( ) ( ) ( ) ( )∏=
−⋅=⋅⋅=N
iii sssssssssP
211231211 |pp|p|pp L
( ) ( ) ( ) ( ) ( ) ( )∏=
−−⋅=⋅⋅=N
iiii ssssssssssssssP
31221324213212 |pp|p|pp L
( ) ( ) ( ) ( ) ( )∏=
=⋅⋅=N
iisssssP
13210 pppp L
L4321 sssss =
( ) ( ) ( ) ( ) ( ) ( ) ( )∏=
−⋅=⋅⋅⋅=N
iii sssactatttsP
2111 |pp|p|p|pp L
( ) ( ) ( ) ( ) ( ) ( ) ( )∏=
−−⋅=⋅⋅⋅=N
iiii sssssacgtacttattsP
312212 |pp|p|p|pp L
( ) ( ) ( ) ( ) ( ) ( ) ( )∏=
=⋅⋅⋅⋅=N
iisgcattsP
10 pppppp L
ttacggts =
Hidden Markov Models
In general, sequences are not monolithic, but can be made up of discrete segments
Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emissionprobabilities depend upon the state
Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model
HMM DetailsAn HMM is completely defined by its:
State-to-state transition matrix (Φ)Emission matrix (H)State vector (x)
We want to determine the probability of any specific (query) sequence having been generated by the model; with multiple models, we then use Bayes’ rule to determine the best model for the sequence
Two algorithms are typically used for the likelihood calculation:ViterbiForward
Models are trained with known examples
The HMM Matrixes: Φ and H
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=Φ
0002.0001.000996.0001.05.00002.0998.05.00000
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
32.018.018.032.0
25.025.022.028.0
H
xm(i) = probability of being in state m at position i;
H(m,yi) = probability of emitting character yi in state m;
Φmk = probability of transition from state k to m.
A more realistic (and complex) HMM model for Gene Prediction (Genie)
Kulp, D., PhD Thesis, UCSC 2003
Scoring an HMM: Viterbi, Forward, and Forward-
BackwardTwo algorithms are typically used for the likelihood
calculation: Viterbi and ForwardViterbi is an approximation;
The probability of the sequence is determined by using the most likely mapping of the sequence to the model
in many cases good enough (gene finding, e.g.), but not always
Forward is the rigorous calculation;The probability of the sequence is determined by summing over
all mappings of the sequence to the model
Forward-Backward produces a probabilistic map of the model to the sequence
Eukaryotic Gene Prediction: GRAIL II: Neural network based
prediction
(Uberbacher and Mural 1991; Uberbacher et al. 1996)
Open Challenges in Predicting Eukaryotic
(Protein-Coding) Genes• Alternative Processing of Transcripts
– Splice variants, Start/stop variants
• Overlapping Genes– Mostly UTRs or intronic, but coding is possible
• Non-canonical functional elements– Splice w/o GT-AG,
• UTR predictions– Especially with introns
• Small (mini) exons
Open Challenges in Predicting Prokaryotic
(Protein-Coding) Genes• Start site prediction
– Most algorithms are greedy, taking the largest ORF
• Overlapping Genes– This can be very problematic, esp. with use of Viterbi-like
algorithms
• Non-canonical coding
Eukaryotic gene predictiontools and web servers
• Genscan (ab initio), GenomeScan (hybrid) – (http://genes.mit.edu/)
• Twinscan (hybrid)– (http://genes.cs.wustl.edu/)
• FGENESH (ab initio)– (http://www.softberry.com/berry.phtml?topic=gfind)
• GeneMark.hmm (ab initio)– (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi)
• MZEF (ab initio)– (http://rulai.cshl.org/tools/genefinder/)
• GrailEXP (hybrid)– (http://grail.lsd.ornl.gov/grailexp/)
• GeneID (hybrid)– (http://www1.imim.es/geneid.html)
Prokaryotic Gene Prediction
• Glimmer– http://www.tigr.org/~salzberg/glimmer.html
• GeneMark– http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.
cgi
• Critica– http://www.ttaxus.com/index.php?pagename=Software
• ORNL Annotation Pipeline– http://compbio.ornl.gov/GP3/pro.shtml
Non-protein Coding Gene Tools and Information
• tRNA– tRNA-ScanSE
• http://www.genetics.wustl.edu/eddy/tRNAscan-SE/– FAStRNA
• http://bioweb.pasteur.fr/seqanal/interfaces/fastrna.html
• snoRNA– snoRNA database
• http://rna.wustl.edu/snoRNAdb/
• microRNA– Sfold
• http://www.bioinfo.rpi.edu/applications/sfold/index.pl– SIRNA
• http://bioweb.pasteur.fr/seqanal/interfaces/sirna.html
Assessing performance:Sensitivity and Specificity
Testing of predictions is performed on sequences where the gene structure is known
Sensitivity is the fraction of known genes (or bases or exons) correctly predicted“Am I finding the things that I’m supposed to find”
Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes“What fraction of my predictions are true?”
In general, increasing one decreases the other
Graphic View of Specificity and Sensitivity
iveFalseNegatveTruePositiveTruePositi
AllTrueveTruePositiSn
+==
iveFalsePositveTruePositiveTruePositi
eAllPositivveTruePositiSp
+==
Quantifying the tradeoff:Correlation Coefficient
( )( ) ( )( )[ ]( )( )( )( )
FNTNPNFPTPPPFNTPAPFPTNAN
PNAPPPANFNFPTNTPCC
+=+=+=+=
−=
;;;
Specificity/Sensitivity Tradeoffs
• Ideal Distribution of Scores
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40 45 50score (arb units)
coun
t
random sequence true sites
• More Realistically…
0
200
400
600
800
1000
1200
0 10 20 30 40 50
score (arb units)
coun
t
random sequence true sites