VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 – REST vom Montag Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik,

VL Algorithmische BioInformatik (19710)WS2015/2016Woche 7 – REST vom Montag

Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Vorlesungsthemen

Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases

Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments

Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees

Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation

Part 5: Secondary Structures (4)11. Obtaining Secondary Structure from

Sequence 12. Predicting Secondary Structures

Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships

Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology

GenScanGenScan

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 3

E0 E1 E2

I0 I1 I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly ASignal

promoter

Intergenic region

Ex1 In1 Ex2 Ex2 In2 Ex3 In3 Ex4 In4 Ex5 Ex5

5’ UTR 3’ UTR

G T AG

Slide by Ron Pinter

62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC

62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC

62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA 62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC

62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG

62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC

62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC 62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC

62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC

62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA

62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC 62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA

62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT

62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG

62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC 62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA

62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC

62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG 62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT

62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC

63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT

63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT 63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC

63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC

63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT

63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT 63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT

63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG

63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT

63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG 63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG

63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA

63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT 63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA

63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT

GENSCAN (Burge & Karlin)


Naïve Approach

25.025.025.025.0

1B

25.00

65.010.0

3B

10.035.010.035.0

2B

A GCT

5.03.02.04.02.04.02.07.01.0

A

#1 #2 #3#1 #2 #3

ith turn

i+1 turn

1.03.06.0 Exon

Intron UTR

GG

T GGAA GG

GGT TT

CC CCAAAA

AAAACC CC

GG TAA GGGG AA

CC CCT

T

Exon Intron UTR

A GCT

A GCT

Slide by Ron Pinter

GENESCAN components

031.041.028.039.0033.028.001000010

A

12.060.004.006.0

25.025.025.025.0

1B

25.00

65.010.0

3B

10.035.010.035.0

2B

25.00

65.010.0

4B

E0

E1

E2

I0

I1

I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly ASignal

promoter

Intergenic region

Inter-state transitions

Slide by Ron Pinter

GenScan Characteristics

Designed to predict complete gene structures • Introns and exons, Promoter sites, Polyadenylation

signals – Incorporates:

• Descriptions of transcriptional, translational and splicing signal

• Length distributions (Explicit State Duration HMMs)• Compositional features of exons, introns, intergenic,

C+G regions– Larger predictive scope

• Deal with partial and complete genes• Ability to predict multiple genes in a sequence• Ability to predict consistent sets of genes occurring on

either or both strands of the DNA.Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 8

Genscan Model

•Based on Generalized HMM (GHMM)

•Model both strands at once

Eukaryotic Gene Structure


GenScan States

• N - intergenic region• P - promoter• F - 5’ untranslated region• Esngl – single exon (intronless)

(translation start -> stop codon)• Einit – initial exon (translation start ->

donor splice site)• Ik – phase k intron: 0 – between

codons; 1 – after the first base of a codon; 2 – after the second base of a codon

• Ek – phase k internal exon (acceptor splice site->donor splice site)

• Eterm – terminal exon (acceptor splice site -> stop codon)

• T - 3’ untranslated region• A – poly-A

Four main components of model:

• A vector of initial state probability distribution,

• A matrix of state transition probabilities T

• A set of length distributions ƒ for different states

• A set of sequence generating models P for each of the states


GenScan model parameters

• Uses as a training set 238 multi-exon genes and 142 single-exon genes from GenBank to compute parameters

• Initial state probabilities• Transition probabilities• State length distributions • Probabilistic models for the states

– The states correspond to different functional units on a gene e.g promoter regions, exon

– Transitions ensure that the order that the model marches through the states is biologically consistent

– Length distributions take into account that different functional units have different lengths.


Initial and transition probability

Based on CG content, initial and transition probability distribution are estimated in each of four categories: I (<43% C+G); II(43-51);III(51-57); and IV (>57).

To simplify the model, the initial probabilities of the exon, polyadenylation signal and promoter states are set to zero.


State length distribution

• Intron and intergenic length are modeled as geometric distribution with parameter q estimated for each C+G group.

• 5-UTR and 3-UTR with mean value of 769 and 457 bp by geometric distribution

• Exon length L ＝ 3c ＋ i（ c is the number of complete codon ， i is the phase of subsequent intron: 0 ， 1 or 2)


State Models

• Coding regions: inhomogeneous 3-periodic fifth-order Markov model

(Borodovsky & Mcininch, 1993,Comp.Chem. 17,123-133)

• Non-coding states (introns, 5’UTR, 3’UTR, intergenic regions): homogeneous fifth-order Markov model


Signal models used by GenScan

• WMM = weight matrix model(Staden, 1984, Nucl. Acids Res, 12, 505-519)

– For transcriptional and translational signals (translation initiation, polyA signals, TATA box etc.)

– polyA signal is modeled as a 6 bp WMM with AATAAA as the consensus sequence (uses annotated data from GenBank)

• WAM = weight array model(Zhang & Marr, 1993, Comp. Appl.Biol.Sci.9(5): 499-509)

– Assumes some dependencies between adjacent positions in the sequence (= 1st oder Markov Model)

– used for the pyrimidine-rich region and the splice acceptor site

• MDD = Maximal dependency decomposition model(Burge, 1997, J.Mol. Biol. 268: 78-94)

- used for donor splice sites- Basically a Decision Tree, using a WAM at each level


• Translation initiation signal: 12bp WMM model with 6bp as start codon.

• Translation termination signal: one of the three stop codons is generated and the next three nucleotides are generated according to a WMM.

• Promotor model: TATA-containing promotor with probability 0.7 and TATA-less promotor with probability 0.3 because 30% eukaryotic organisms don’t have TATA signal.– TATA-containing promoter is modeled using a 15 bp TATA-box

WMM and an 8 bp cap site WMM.– TATA-less promoters are modeled simply as intergenic-null

regions of 40 bp in length.

Transcription and translation signal


Rather than using one weight array matrix for all splice sites, MDD differentiates between splice sites in the training set based on the bases around the AG/GT consensus.

At the leaves of the tree are weight matrices specific to signal variants as characterized by the predicates in the tree. Therefore, each leaf has a different WAM trained from a different subset of splice sites.

Maximal Dependence Decomposition (MDD)

A special type of signal sensor based on decision trees was introduced by the program GENSCAN (Burge, 1998).

Starting at the root of the tree, we apply predicates over base identities at positions in the sensor window, which determine the path followed as we descend the tree.

What about non-adjacent nucleotides dependencies?Procedure:

MDD- Maximal Dependency Decomposition

310-1-2-3-4

0086033 A%

30041337 C%

450100811418G%

100071312 T% 3

49

49…

…

…

…No dependencies

(A| C| T) 5G5

G5(A | C | T) -1G5G-1

G5G-1(C G T) -2G5G-1A-2

-4 -3 -2 -1 0 1 2 3

A … … … … … … … …

C …

G …

T …

A

C

G

T

A

C

G

T

A

C

G

T

A

C

G

T

A

C

G

T


MDD method

Constructing the MDD tree is done by performing a number of χ2 tests of independence between cells in the sensor window. At each bifurcation in the tree we select the predicate of maximal dependence.

•Dependencies among non-adjacent position are captured by the chi-square statistic between the consensus Ki at position i, and the nucleotide Nj at position j, i ¹ j.•If strong dependencies are detected (c2 ³ 16.3, for the cutoff P=0.001, 3df), then partition the data into two subsets: one containing the sequences with the consensus Ki , and another one with the remaining sequences.


Application of model

• A precise probabilistic model of what a gene/genomic sequence looks like is specified in advance.

• Then, determine which of the vast number of possible gene structure has highest likelihood given a sequence


031.041.028.039.0033.028.001000010

A

12.060.004.006.0

Sequence generating models:P1 P2 P3 P4

CC CCAA

AA

AAAACC CC

GG TIntron

GG AA AAAACCT T

GENESCAN components

Intergenic region

E0

E1 E2

I0

I1

I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly Apromoter

Set of length distributions:f1 f2 f3 f4 fintron(10)=0fintron(350)=.03

Slide by Ron Pinter

Given a sequence S

and a parse ФiA C G C G A C T A G G C G C A G G T C T A … G A T

Exon0 Intron0 Exon0 Intron1 Exon1 3’UTR

We can calculate P(S, Фi):Slide by Ron Pinter

Definitions:For fixed sequence length L we define:

ФL- set of all possible parses of length L

SL- set of all possible DNA sequences of length L

ΩL= ФL x SL - probability density for each parse/sequence pair.

Using GeneScan

A C G C G A C T A G G C G C A G G T C T A … G A TExon0 Intron0 Exon0 Intron1 Exon1 3’UTR

031.041.028.039.0033.028.001000010

A

12.060.004.006.0

Sequence generating models:P1 P2 P3 P4

Set of length distributions:f1 f2 f3 f4

CCCCAA

AA

AAAACC CCGG TIntro

n

GG AA AAAACCT T

E0

E1

E2

I0

I1

I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly Apromoter

Intergenic region

πq1 fq1(d1)Pq1(s1) * …Aq1 -> q2 fq2(d2)P(s2) * … Aqk-1->qkfqk(dk)P(sk)P(S, Фi) =

Slide by Ron Pinter

Using GeneScan

Predicting the Gene Structure

The Genscan model (essentially semi-Markov type) can be formulated as an explicit state duration HMM; the model generates “parse” Ф:

•An ordered set of states q = q1,q2,…,qn

•An associated set of length (duration) d=d1,d2,…,dn

•Using probabilistic models of each of the state types, generate DNA sequences S of length L ＝∑ di (I=1...n)


Generating a parse corresponding to a sequence length L:

•Choose initial q1 according to an initial distribution on the states π.

•A length (state duration), d1, corresponding to the state q1 is generated conditional on the value from the length distribution fQ.

•According to an appropriate sequence generating model for state type q1,generate a sequence segment s1 of length d1 based on d1 and q1.

•The subsequent state q2 is generated, based on the value of q1, from the state (first-order Markov) transition matrix T.

This process is repeated until the sum of the state duration first equals or exceeds the length L. The sequence generated is the concatenation of the sequence segments, S=s1,s2…Sn.



• Given a sequence, find the most probable gene structure compatible with the sequence

• For a given sequence S, the conditional probability of a particular parse Φi using Bayes’ Rule is:

Lj

SjPSiP

SPSiPSiP

),(),(

)(),()|(


P(Фi, S) = πq1 fq1(d1)Pq1(s1) * Aq1 -> q2 fq2(d2)P(s2)*…Aqk-1->qkfqk(dk)P(sk)


PredictionPrediction::

Find the parse with maximum likelihood, i.e. max P(Фi | S)

In order to parse a given sequence S (i.e. predict genes in S) we…

Slide by Ron Pinter

Lj

SjPSiP

SPSiPSiP

),(),(

)(),()|(

• Find the most probable parse, opt, using a Viterbi-like algorithm

• Find P(S) using a “forward” algorithm

• Run time: at worst, quadratic in the number of possible state transitions

– In practice grows approximately linearly with sequence length for sequences of several kb or more

Algorithm


GENSCAN: Comparing other Gene Finders

Method Sn Sp AC Sn Sp (Sn+Sp)/2 ME WE

GENSCAN 0.93 0.93 0.91 0.78 0.81 0.8 0.09 0.05FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24

GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17GenLang 0.72 0.75 0.69 0.5 0.49 0.5 0.21 0.21GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.1

SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13

Accuracy per nucleotide Accuracy per exon

• Sn = Sensitivity• Sp = Specificity• Ac = Approximate Correlation• ME = Missing Exons• WE = Wrong ExonsGENSCAN Performance Data, http://genes.mit.edu/Accuracy.html


Genscan’s View of a Gene

E = ExonI = IntronA = polyadenylation signalP = PromoterF, T = UTRN = Intergenic sequence

Burge & Karlin, J. Mol. Biol. 268:78, 1997

Not mentionedNot mentioned•Reverse strand states•C+G% •Coding / non coding detection•Branch point detection•Expected vs. observed AG composition•And more…

TwinScanTwinScan


TWINSCAN

• Reason for developing TWINSCAN – GENSCAN performed poorly on the HTG

sequences.• What is HTG sequences?

– high-throughput genomic sequences– usually 100-200kb in length– contain an unknown number of genes

• TWINSCAN is designed specifically for HTG sequences


TWINSCAN

• TWINSCAN is developed by Brent’s group in Washington University, 2001.– Website: http://mblab.wustl.edu/software.html– Original paper

Korf, I., P. Flicek, D. Duan, and M.R. Brent. Integrating genomic homology into gene structure prediction. Bioinformatics 17: S140-148. (2001)


TWINSCAN

• Based on GENSCAN

• Differences:– GENSCAN model does not account for

evolutionary conservation.– TWINSCAN extends the probability model of

GENSCAN and utilize the pattern of evolutionary conservation.


TWINSCAN Activity Diagram


TWINSCAN

• Use the GENSCAN model topology for parsing.

E0 E1 E2

I0 I1 I2

Einit Eterm

Single exon gene5’ UTR 3’ UTR

Poly ASignal

promoter

Intergenic region

c-state

d-state


Component models of c-states

- Fifth Order Markov Chain- Weight Matrix Model (WMM)- Weight Array Model (WAM)- Maximal Dependence Decomposition (MDD)- Conservation Models


Conservation Sequence

• Conservation symbolic representation . unaligned, | matched, : mismatched

i.e.

1 2 3 4 5 6 7 8 9 position G A A T T C C G T target sequence

alignment 3 4 5 6 7 8 9 target position

A T T - C C G T target alignment | | | | | alignment match symbols

A T C A C C - T informant alignment

the resulting conservation sequence is1 2 3 4 5 6 7 8 9 positionG A A T T C C G T target sequence . . | | : | | : | conservation sequence


TWINSCAN

• State-specific sequence model combines – the probability model of generating a specific DNA

sequences from any given state.

– the probability model of generating a conservation sequence from any given state.

• Possible states– Coding state, UTR state, intron/intergenic state– translation initial and termination sites– splicing donor and acceptor sites

• For instance, given from i to j is an exonPr( Ti, j, Ci,j| Ei, j ) = Pr(Ti, j| Ei, j ) Pr(Ci, j| Ei, j )


TWINSCAN vs GENSCAN

Compare to GENSCAN, TWINSCAN shows notable improvement in exon sensitivity and specificity and dramatic exact gene sensitivity and specificity.


Tim ConradAG Medical Bioinformaticswww.medicalbioinformatics.de

Mehr Informationen im Internet unter medicalbioinformatics.de/teachingVielen Dank!

Weitere Weitere FragenFragen

Documents

VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 – REST vom Montag Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik,