Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches

Gene Prediction

Alejandro Giorgetti

Annotation

• Defining the bological role of a molecule in all itscomplexity and mapping this knowledge onto the relevant gene products encoded by genomes.

• Computational gene prediction is the cornerstoneupon which a genome annotation is built

1. Basic Information

• What types of predictions can we make?

• What are they based on?

• Each gene prediction is a hypothesis waiting to betested and the results of testing then inform our nextset of hypothesis

Bioinformatics asExtrapolation

• Computational gene finding is a process of:

– Identifying common phenomena in known genes

– Building a computational framework/model that can accurately describe the common phenomena

– Using the model to scan uncharacterized sequence toidentify regions that match the model, which becomeputative genes

– Test and validate the predictions

Different Types of Gene Finding

• RNA genes– tRNA, rRNA, snRNA, snoRNA, microRNA

• Protein coding genes– Prokaryotic

• No introns, simpler regulatory features– Eukaryotic

• Exon-intron structure• Complex regulatory features

Approaches to Gene Finding

• Direct– Exact or near-exact matches of EST, cDNA, or Proteins

from the same, or closely related organism

• Indirect1. Look for something that looks like one gene (homology)

1. Look for something that looks like all genes (ab initio)

1. Hybrid, combining homology and ab initio (and perhaps even direct) methods

Prokaryotes• Identify long Open Reading Frames (ORF), likely to

code for proteins.– Identification of start codon ATG that maximizes the length

of the ORF.– End codons (UAA,UAG, UGA)– Pribnow box (TATAAT), the -35 sequence or ribosomal

binding sites: transcriptional and translational start sites.– Codon bias (codon usage).

• GLIMMER, GeneMark. 90% for both sensitivity and specificity.

Pieces of a (Eukaryotic) Gene

(on the genome)

5’

3’

3’

5’

~ 1-100 Mbp

5’

3’

3’

5’……

……

~ 1-1000 kbp

exons (cds & utr) / introns(~ 102-103 bp) (~ 102-105 bp)

Polyadenylationsite

promoter (~103 bp)

enhancers (~101-102 bp) other regulatory sequences(~ 101-102 bp)

What is it about genes thatwe can measure (and

model)?• Most of our knowledge is biased towards protein-

coding characteristics– ORF (Open Reading Frame): a sequence defined by in-

frame AUG and stop codon, which in turn defines a putative amino acid sequence.

– Codon Usage: most frequently measured by CAI (Codon Adaptation Index)

• Other phenomena– Nucleotide frequencies and correlations:

• value and structure– Functional sites:

• splice sites, promoters, UTRs, polyadenylation sites

A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae

Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771

Codon Adaptation Index(CAI)

CAI= ∏i= codons

f codon i

f �codoni�max

• Parameters are empirically determined by examininga “large” set of example genes

• This is not perfect– Genes sometimes have unusual codons for a reason– The predictive power is dependent on length of sequence

General Things to Rememberabout (Protein-coding) Gene

Prediction Software• It is, in general, organism-specific

• It works best on genes that are reasonably similar tosomething seen previously

• It finds protein coding regions far better than non-codingregions

• In the absence of external (direct) information, alternative forms will not be identified

• It is imperfect! (It’s biology, after all…)

2. Classes of Information

• Extrinsic information: cDNA, EST, protein products. Not itself a genome sequence.

• Intrinsic information: De novo gene predictors usingthe sequences of one or more genomes as their onlyinput. Ab initio: do not use informant genomes.

Extrinsic Information

• BLAST family, FASTA, etc.– Pros: fast, statistically well founded– Cons: no understanding/model of gene structure

• BLAT, Sim4, EST_GENOME, etc.– Pros: gene structure is incorporated– Cons: non-canonical splicing, slower than blast

Intrinsic Information

• The most ab initio would be a program that simulates the transcription, splicing and postprocessing of a transcript using only the information present in the cell.

• Rudimentary understanding: we must rely on metrics derived from known genes and features.– Signal sensors– Content sensors

Intrinsic Information: Signals• Nucleic acid motifs that are recognized by the cellular

machinery.• Start and stop codons.• Donor and acceptor splice sites• Splicing enhancers• Silencer elements• Transcription start and termination sites• Polyadenylation signals• Proximal and distal promoter sequences

• Signals modeled as Position Weight Matrices (PWM)• Capture the intrinsic variability characteristics.• Derived from a set of aligned sequences which are funtinally

related.• Sliding windows.

Intrinsic Information: Content

• Proper detection the start and end codons is still a challenge. Coding bias in sequence composition.

• Uneven usage of aa in real proteins.• Uneven usage of synonymous codons• Coding statistcs as functions that compute a real number

related to the likelihood that a given DNA sequence codes for protein.

• Codon or di-codon usage• Periodicity in base occurrence (GRAIL).

Conservation

• When one or more informant genomes are available it is possible to detect the characteristic conservation pattern of coding sequence and use it as an orthogonal measure for coding potential.

• TWINSCAN• GeneID• SGP2• CONTRAST• Genscan• Fgenesh

2. Some MathematicalConcepts and Definitions

• Models

• Bayesian Statistics

• Markov Models & Hidden Markov Models

Models in Computational(Molecular) Biology

• In gene finding, models can best be thought of as “sequence generators” (e.g., Hidden Markov Models) or “sequence classifiers” (e.g., Neural Networks)

• The better (and usually more complex) a model is, the better the performance is likely to be

Models of Sequence Generation:Markov Chains

• Different types of structure components (exons or introns) are characterized by a state

• The gene model is thought to be generated by a state machine: starting form 5’ to 3’,each base is generated by a emission probability conditioned on the current state, and the transition from one state to an other: transition probability (an intron can only follow an exon, reading frames of to adjacent exonsshould be compatible, etc.)

• All the parameters of the emission probabilities and transition probabilities are learned.

Basic HMM

Described bya set of possible states: start, exon, donor, intron, acceptor, stop set of possible observations: A,C,G,Ttransition probability matrixemision probability matrix (frequencies of nucleotidesoccurring in a particular state)initial state probabilities.

Models of SequenceGeneration:

Markov Chains

• A Markov chain is a model for stochastic generation of sequential phenomena

• Every position in a chain is equivalent

• The order of the Markov chain is the number of previous positions on which the current position depends– e.g., in nucleic acid sequence, 0-order is mononucleotide, 1st-order

is dinucleotide, 2nd-order is trinucleotide, etc.

• The model parameters are the frequencies of the elements at each position (possibly as a function of preceding elements)

Markov Chains as Models ofSequence Generation

• 0th-order

• 1st-order

• 2nd-order

( ) ( ) ( ) ( ) ( ) ( )∏=

−⋅=⋅⋅=N

iii sssssssssP

211231211 |pp|p|pp L

( ) ( ) ( ) ( ) ( ) ( )∏=

−−⋅=⋅⋅=N

iiii ssssssssssssssP

31221324213212 |pp|p|pp L

( ) ( ) ( ) ( ) ( )∏=

=⋅⋅=N

iisssssP

13210 pppp L

L4321 sssss =

( ) ( ) ( ) ( ) ( ) ( ) ( )∏=

−⋅=⋅⋅⋅=N

iii sssactatttsP

2111 |pp|p|p|pp L

( ) ( ) ( ) ( ) ( ) ( ) ( )∏=

−−⋅=⋅⋅⋅=N

iiii sssssacgtacttattsP

312212 |pp|p|p|pp L

( ) ( ) ( ) ( ) ( ) ( ) ( )∏=

=⋅⋅⋅⋅=N

iisgcattsP

10 pppppp L

ttacggts =

Hidden Markov Models

In general, sequences are not monolithic, but can be made up of discrete segments

Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emissionprobabilities depend upon the state

Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model

HMM DetailsAn HMM is completely defined by its:

State-to-state transition matrix (Φ)Emission matrix (H)State vector (x)

We want to determine the probability of any specific (query) sequence having been generated by the model; with multiple models, we then use Bayes’ rule to determine the best model for the sequence

Two algorithms are typically used for the likelihood calculation:ViterbiForward

Models are trained with known examples

The HMM Matrixes: Φ and H

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=Φ

0002.0001.000996.0001.05.00002.0998.05.00000

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

32.018.018.032.0

25.025.022.028.0

H

xm(i) = probability of being in state m at position i;

H(m,yi) = probability of emitting character yi in state m;

Φmk = probability of transition from state k to m.

A more realistic (and complex) HMM model for Gene Prediction (Genie)

Kulp, D., PhD Thesis, UCSC 2003

Scoring an HMM: Viterbi, Forward, and Forward-

BackwardTwo algorithms are typically used for the likelihood

calculation: Viterbi and ForwardViterbi is an approximation;

The probability of the sequence is determined by using the most likely mapping of the sequence to the model

in many cases good enough (gene finding, e.g.), but not always

Forward is the rigorous calculation;The probability of the sequence is determined by summing over

all mappings of the sequence to the model

Forward-Backward produces a probabilistic map of the model to the sequence

Eukaryotic Gene Prediction: GRAIL II: Neural network based

prediction

(Uberbacher and Mural 1991; Uberbacher et al. 1996)

Open Challenges in Predicting Eukaryotic

(Protein-Coding) Genes• Alternative Processing of Transcripts

– Splice variants, Start/stop variants

• Overlapping Genes– Mostly UTRs or intronic, but coding is possible

• Non-canonical functional elements– Splice w/o GT-AG,

• UTR predictions– Especially with introns

• Small (mini) exons

Open Challenges in Predicting Prokaryotic

(Protein-Coding) Genes• Start site prediction

– Most algorithms are greedy, taking the largest ORF

• Overlapping Genes– This can be very problematic, esp. with use of Viterbi-like

algorithms

• Non-canonical coding

Eukaryotic gene predictiontools and web servers

• Genscan (ab initio), GenomeScan (hybrid) – (http://genes.mit.edu/)

• Twinscan (hybrid)– (http://genes.cs.wustl.edu/)

• FGENESH (ab initio)– (http://www.softberry.com/berry.phtml?topic=gfind)

• GeneMark.hmm (ab initio)– (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi)

• MZEF (ab initio)– (http://rulai.cshl.org/tools/genefinder/)

• GrailEXP (hybrid)– (http://grail.lsd.ornl.gov/grailexp/)

• GeneID (hybrid)– (http://www1.imim.es/geneid.html)

http://genes.mit.edu/

http://genes.cs.wustl.edu/

http://www.softberry.com/berry.phtml?topic=gfind

http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi

http://rulai.cshl.org/tools/genefinder/

http://grail.lsd.ornl.gov/grailexp/

http://www1.imim.es/geneid.html

Prokaryotic Gene Prediction

• Glimmer– http://www.tigr.org/~salzberg/glimmer.html

• GeneMark– http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.

cgi

• Critica– http://www.ttaxus.com/index.php?pagename=Software

• ORNL Annotation Pipeline– http://compbio.ornl.gov/GP3/pro.shtml

http://www.tigr.org/~salzberg/glimmer.html

http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi

http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi

http://www.ttaxus.com/index.php?pagename=Software

http://compbio.ornl.gov/GP3/pro.shtml

Non-protein Coding Gene Tools and Information

• tRNA– tRNA-ScanSE

• http://www.genetics.wustl.edu/eddy/tRNAscan-SE/– FAStRNA

• http://bioweb.pasteur.fr/seqanal/interfaces/fastrna.html

• snoRNA– snoRNA database

• http://rna.wustl.edu/snoRNAdb/

• microRNA– Sfold

• http://www.bioinfo.rpi.edu/applications/sfold/index.pl– SIRNA

• http://bioweb.pasteur.fr/seqanal/interfaces/sirna.html

http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

http://bioweb.pasteur.fr/seqanal/interfaces/fastrna.html

http://rna.wustl.edu/snoRNAdb/

http://www.bioinfo.rpi.edu/applications/sfold/index.pl

http://bioweb.pasteur.fr/seqanal/interfaces/sirna.html

Assessing performance:Sensitivity and Specificity

Testing of predictions is performed on sequences where the gene structure is known

Sensitivity is the fraction of known genes (or bases or exons) correctly predicted“Am I finding the things that I’m supposed to find”

Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes“What fraction of my predictions are true?”

In general, increasing one decreases the other

Graphic View of Specificity and Sensitivity

iveFalseNegatveTruePositiveTruePositi

AllTrueveTruePositiSn

+==

iveFalsePositveTruePositiveTruePositi

eAllPositivveTruePositiSp

+==

Quantifying the tradeoff:Correlation Coefficient

( )( ) ( )( )[ ]( )( )( )( )

FNTNPNFPTPPPFNTPAPFPTNAN

PNAPPPANFNFPTNTPCC

+=+=+=+=

−=

;;;

Specificity/Sensitivity Tradeoffs

• Ideal Distribution of Scores

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40 45 50score (arb units)

coun

t

random sequence true sites

• More Realistically…

0

200

400

600

800

1000

1200

0 10 20 30 40 50

score (arb units)

coun

t

random sequence true sites

Documents

Computational Approaches to Gene Findingmolsim.sci.univr.it/2013_bioinfo2/geneFinding_ale.pdf · 2012-01-27 · Approaches to Gene Finding •Direct – Exact or near-exact matches