Upload
justin-bishop
View
216
Download
0
Embed Size (px)
DESCRIPTION
GenScan Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 3
Citation preview
VL Algorithmische BioInformatik (19710)WS2015/2016Woche 7 – REST vom Montag
Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
Vorlesungsthemen
Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases
Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments
Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees
Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation
Part 5: Secondary Structures (4)11. Obtaining Secondary Structure from
Sequence 12. Predicting Secondary Structures
Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships
Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology
GenScanGenScan
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 3
E0 E1 E2
I0 I1 I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly ASignal
promoter
Intergenic region
Ex1 In1 Ex2 Ex2 In2 Ex3 In3 Ex4 In4 Ex5 Ex5
5’ UTR 3’ UTR
G T AG
Slide by Ron Pinter
62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC
62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC
62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA 62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC
62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG
62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC
62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC 62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC
62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC
62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA
62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC 62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA
62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT
62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG
62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC 62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA
62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC
62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG 62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT
62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC
63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT
63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT 63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC
63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC
63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT
63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT 63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT
63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG
63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT
63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG 63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG
63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA
63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT 63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA
63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT
GENSCAN (Burge & Karlin)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 5
Naïve Approach
25.025.025.025.0
1B
25.00
65.010.0
3B
10.035.010.035.0
2B
A GCT
5.03.02.04.02.04.02.07.01.0
A
#1 #2 #3#1 #2 #3
ith turn
i+1 turn
1.03.06.0 Exon
Intron UTR
GG
T GGAA GG
GGT TT
CC CCAAAA
AAAACC CC
GG TAA GGGG AA
CC CCT
T
Exon Intron UTR
A GCT
A GCT
Slide by Ron Pinter
GENESCAN components
031.041.028.039.0033.028.001000010
A
12.060.004.006.0
25.025.025.025.0
1B
25.00
65.010.0
3B
10.035.010.035.0
2B
25.00
65.010.0
4B
E0
E1
E2
I0
I1
I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly ASignal
promoter
Intergenic region
Inter-state transitions
Slide by Ron Pinter
GenScan Characteristics
Designed to predict complete gene structures • Introns and exons, Promoter sites, Polyadenylation
signals – Incorporates:
• Descriptions of transcriptional, translational and splicing signal
• Length distributions (Explicit State Duration HMMs)• Compositional features of exons, introns, intergenic,
C+G regions– Larger predictive scope
• Deal with partial and complete genes• Ability to predict multiple genes in a sequence• Ability to predict consistent sets of genes occurring on
either or both strands of the DNA.Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 8
Genscan Model
•Based on Generalized HMM (GHMM)
•Model both strands at once
Eukaryotic Gene Structure
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 10
GenScan States
• N - intergenic region• P - promoter• F - 5’ untranslated region• Esngl – single exon (intronless)
(translation start -> stop codon)• Einit – initial exon (translation start ->
donor splice site)• Ik – phase k intron: 0 – between
codons; 1 – after the first base of a codon; 2 – after the second base of a codon
• Ek – phase k internal exon (acceptor splice site->donor splice site)
• Eterm – terminal exon (acceptor splice site -> stop codon)
• T - 3’ untranslated region• A – poly-A
Four main components of model:
• A vector of initial state probability distribution,
• A matrix of state transition probabilities T
• A set of length distributions ƒ for different states
• A set of sequence generating models P for each of the states
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 12
GenScan model parameters
• Uses as a training set 238 multi-exon genes and 142 single-exon genes from GenBank to compute parameters
• Initial state probabilities• Transition probabilities• State length distributions • Probabilistic models for the states
– The states correspond to different functional units on a gene e.g promoter regions, exon
– Transitions ensure that the order that the model marches through the states is biologically consistent
– Length distributions take into account that different functional units have different lengths.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 13
Initial and transition probability
Based on CG content, initial and transition probability distribution are estimated in each of four categories: I (<43% C+G); II(43-51);III(51-57); and IV (>57).
To simplify the model, the initial probabilities of the exon, polyadenylation signal and promoter states are set to zero.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 14
State length distribution
• Intron and intergenic length are modeled as geometric distribution with parameter q estimated for each C+G group.
• 5-UTR and 3-UTR with mean value of 769 and 457 bp by geometric distribution
• Exon length L = 3c + i( c is the number of complete codon , i is the phase of subsequent intron: 0 , 1 or 2)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 15
State Models
• Coding regions: inhomogeneous 3-periodic fifth-order Markov model
(Borodovsky & Mcininch, 1993,Comp.Chem. 17,123-133)
• Non-coding states (introns, 5’UTR, 3’UTR, intergenic regions): homogeneous fifth-order Markov model
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 16
Signal models used by GenScan
• WMM = weight matrix model(Staden, 1984, Nucl. Acids Res, 12, 505-519)
– For transcriptional and translational signals (translation initiation, polyA signals, TATA box etc.)
– polyA signal is modeled as a 6 bp WMM with AATAAA as the consensus sequence (uses annotated data from GenBank)
• WAM = weight array model(Zhang & Marr, 1993, Comp. Appl.Biol.Sci.9(5): 499-509)
– Assumes some dependencies between adjacent positions in the sequence (= 1st oder Markov Model)
– used for the pyrimidine-rich region and the splice acceptor site
• MDD = Maximal dependency decomposition model(Burge, 1997, J.Mol. Biol. 268: 78-94)
- used for donor splice sites- Basically a Decision Tree, using a WAM at each level
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 17
• Translation initiation signal: 12bp WMM model with 6bp as start codon.
• Translation termination signal: one of the three stop codons is generated and the next three nucleotides are generated according to a WMM.
• Promotor model: TATA-containing promotor with probability 0.7 and TATA-less promotor with probability 0.3 because 30% eukaryotic organisms don’t have TATA signal.– TATA-containing promoter is modeled using a 15 bp TATA-box
WMM and an 8 bp cap site WMM.– TATA-less promoters are modeled simply as intergenic-null
regions of 40 bp in length.
Transcription and translation signal
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 18
Rather than using one weight array matrix for all splice sites, MDD differentiates between splice sites in the training set based on the bases around the AG/GT consensus.
At the leaves of the tree are weight matrices specific to signal variants as characterized by the predicates in the tree. Therefore, each leaf has a different WAM trained from a different subset of splice sites.
Maximal Dependence Decomposition (MDD)
A special type of signal sensor based on decision trees was introduced by the program GENSCAN (Burge, 1998).
Starting at the root of the tree, we apply predicates over base identities at positions in the sensor window, which determine the path followed as we descend the tree.
What about non-adjacent nucleotides dependencies?Procedure:
MDD- Maximal Dependency Decomposition
310-1-2-3-4
0086033 A%
30041337 C%
450100811418G%
100071312 T% 3
49
49…
…
…
…No dependencies
(A| C| T) 5G5
G5(A | C | T) -1G5G-1
G5G-1(C G T) -2G5G-1A-2
-4 -3 -2 -1 0 1 2 3
A … … … … … … … …
C …
G …
T …
A
C
G
T
A
C
G
T
A
C
G
T
A
C
G
T
A
C
G
T
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 20
MDD method
Constructing the MDD tree is done by performing a number of χ2 tests of independence between cells in the sensor window. At each bifurcation in the tree we select the predicate of maximal dependence.
•Dependencies among non-adjacent position are captured by the chi-square statistic between the consensus Ki at position i, and the nucleotide Nj at position j, i ¹ j.•If strong dependencies are detected (c2 ³ 16.3, for the cutoff P=0.001, 3df), then partition the data into two subsets: one containing the sequences with the consensus Ki , and another one with the remaining sequences.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 21
Application of model
• A precise probabilistic model of what a gene/genomic sequence looks like is specified in advance.
• Then, determine which of the vast number of possible gene structure has highest likelihood given a sequence
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 23
031.041.028.039.0033.028.001000010
A
12.060.004.006.0
Sequence generating models:P1 P2 P3 P4
CC CCAA
AA
AAAACC CC
GG TIntron
GG AA AAAACCT T
GENESCAN components
Intergenic region
E0
E1 E2
I0
I1
I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly Apromoter
Set of length distributions:f1 f2 f3 f4 fintron(10)=0fintron(350)=.03
Slide by Ron Pinter
Given a sequence S
and a parse ФiA C G C G A C T A G G C G C A G G T C T A … G A T
Exon0 Intron0 Exon0 Intron1 Exon1 3’UTR
We can calculate P(S, Фi):Slide by Ron Pinter
Definitions:For fixed sequence length L we define:
ФL- set of all possible parses of length L
SL- set of all possible DNA sequences of length L
ΩL= ФL x SL - probability density for each parse/sequence pair.
Using GeneScan
A C G C G A C T A G G C G C A G G T C T A … G A TExon0 Intron0 Exon0 Intron1 Exon1 3’UTR
031.041.028.039.0033.028.001000010
A
12.060.004.006.0
Sequence generating models:P1 P2 P3 P4
Set of length distributions:f1 f2 f3 f4
CCCCAA
AA
AAAACC CCGG TIntro
n
GG AA AAAACCT T
E0
E1
E2
I0
I1
I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly Apromoter
Intergenic region
πq1 fq1(d1)Pq1(s1) * …Aq1 -> q2 fq2(d2)P(s2) * … Aqk-1->qkfqk(dk)P(sk)P(S, Фi) =
Slide by Ron Pinter
Using GeneScan
Predicting the Gene Structure
The Genscan model (essentially semi-Markov type) can be formulated as an explicit state duration HMM; the model generates “parse” Ф:
•An ordered set of states q = q1,q2,…,qn
•An associated set of length (duration) d=d1,d2,…,dn
•Using probabilistic models of each of the state types, generate DNA sequences S of length L =∑ di (I=1...n)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 27
Generating a parse corresponding to a sequence length L:
•Choose initial q1 according to an initial distribution on the states π.
•A length (state duration), d1, corresponding to the state q1 is generated conditional on the value from the length distribution fQ.
•According to an appropriate sequence generating model for state type q1,generate a sequence segment s1 of length d1 based on d1 and q1.
•The subsequent state q2 is generated, based on the value of q1, from the state (first-order Markov) transition matrix T.
This process is repeated until the sum of the state duration first equals or exceeds the length L. The sequence generated is the concatenation of the sequence segments, S=s1,s2…Sn.
Predicting the Gene Structure
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 28
• Given a sequence, find the most probable gene structure compatible with the sequence
• For a given sequence S, the conditional probability of a particular parse Φi using Bayes’ Rule is:
Lj
SjPSiP
SPSiPSiP
),(),(
)(),()|(
Predicting the Gene Structure
P(Фi, S) = πq1 fq1(d1)Pq1(s1) * Aq1 -> q2 fq2(d2)P(s2)*…Aqk-1->qkfqk(dk)P(sk)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 29
PredictionPrediction::
Find the parse with maximum likelihood, i.e. max P(Фi | S)
In order to parse a given sequence S (i.e. predict genes in S) we…
Slide by Ron Pinter
Lj
SjPSiP
SPSiPSiP
),(),(
)(),()|(
• Find the most probable parse, opt, using a Viterbi-like algorithm
• Find P(S) using a “forward” algorithm
• Run time: at worst, quadratic in the number of possible state transitions
– In practice grows approximately linearly with sequence length for sequences of several kb or more
Algorithm
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 31
GENSCAN: Comparing other Gene Finders
Method Sn Sp AC Sn Sp (Sn+Sp)/2 ME WE
GENSCAN 0.93 0.93 0.91 0.78 0.81 0.8 0.09 0.05FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24
GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17GenLang 0.72 0.75 0.69 0.5 0.49 0.5 0.21 0.21GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.1
SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13
Accuracy per nucleotide Accuracy per exon
• Sn = Sensitivity• Sp = Specificity• Ac = Approximate Correlation• ME = Missing Exons• WE = Wrong ExonsGENSCAN Performance Data, http://genes.mit.edu/Accuracy.html
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 32
Genscan’s View of a Gene
E = ExonI = IntronA = polyadenylation signalP = PromoterF, T = UTRN = Intergenic sequence
Burge & Karlin, J. Mol. Biol. 268:78, 1997
Not mentionedNot mentioned•Reverse strand states•C+G% •Coding / non coding detection•Branch point detection•Expected vs. observed AG composition•And more…
TwinScanTwinScan
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 34
TWINSCAN
• Reason for developing TWINSCAN – GENSCAN performed poorly on the HTG
sequences.• What is HTG sequences?
– high-throughput genomic sequences– usually 100-200kb in length– contain an unknown number of genes
• TWINSCAN is designed specifically for HTG sequences
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 35
TWINSCAN
• TWINSCAN is developed by Brent’s group in Washington University, 2001.– Website: http://mblab.wustl.edu/software.html– Original paper
Korf, I., P. Flicek, D. Duan, and M.R. Brent. Integrating genomic homology into gene structure prediction. Bioinformatics 17: S140-148. (2001)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 36
TWINSCAN
• Based on GENSCAN
• Differences:– GENSCAN model does not account for
evolutionary conservation.– TWINSCAN extends the probability model of
GENSCAN and utilize the pattern of evolutionary conservation.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 37
TWINSCAN Activity Diagram
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 38
TWINSCAN
• Use the GENSCAN model topology for parsing.
E0 E1 E2
I0 I1 I2
Einit Eterm
Single exon gene5’ UTR 3’ UTR
Poly ASignal
promoter
Intergenic region
c-state
d-state
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 39
Component models of c-states
- Fifth Order Markov Chain- Weight Matrix Model (WMM)- Weight Array Model (WAM)- Maximal Dependence Decomposition (MDD)- Conservation Models
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 40
Conservation Sequence
• Conservation symbolic representation . unaligned, | matched, : mismatched
i.e.
1 2 3 4 5 6 7 8 9 position G A A T T C C G T target sequence
alignment 3 4 5 6 7 8 9 target position
A T T - C C G T target alignment | | | | | alignment match symbols
A T C A C C - T informant alignment
the resulting conservation sequence is1 2 3 4 5 6 7 8 9 positionG A A T T C C G T target sequence . . | | : | | : | conservation sequence
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 41
TWINSCAN
• State-specific sequence model combines – the probability model of generating a specific DNA
sequences from any given state.
– the probability model of generating a conservation sequence from any given state.
• Possible states– Coding state, UTR state, intron/intergenic state– translation initial and termination sites– splicing donor and acceptor sites
• For instance, given from i to j is an exonPr( Ti, j, Ci,j| Ei, j ) = Pr(Ti, j| Ei, j ) Pr(Ci, j| Ei, j )
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 42
TWINSCAN vs GENSCAN
Compare to GENSCAN, TWINSCAN shows notable improvement in exon sensitivity and specificity and dramatic exact gene sensitivity and specificity.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 43
Tim ConradAG Medical Bioinformaticswww.medicalbioinformatics.de
Mehr Informationen im Internet unter medicalbioinformatics.de/teachingVielen Dank!
Weitere Weitere FragenFragen