97
Gene prediction in prokaryotes Dmitrij Frishman Technische Universität München

Gene Prediction Bacterial

Embed Size (px)

Citation preview

Page 1: Gene Prediction Bacterial

Gene prediction in prokaryotes!

Dmitrij Frishman!Technische Universität München!

Page 2: Gene Prediction Bacterial

Gene prediction: a mind map!

Page 3: Gene Prediction Bacterial
Page 4: Gene Prediction Bacterial

Prokarya!

Page 5: Gene Prediction Bacterial

From DeRisi et al., 1997

>gb|L43967|MGEN Mycoplasma genitalium G37 complete genome TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA AATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT CATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACT TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTG TTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAA TTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA ATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAG TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATG GAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGC AAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT TTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCT TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATG AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA CTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA ATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATA TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAA ATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAAC TTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCC TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAA TCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC

From DNA sequence... ...to structure and function of gene products

Page 6: Gene Prediction Bacterial

Linear representation of the Buchnera sp.!

Page 7: Gene Prediction Bacterial

Circular genome map showing the position and orientation of known genes, pseudogenes and repetitive sequences!

From the outside: •!circles 1 and 2 (clockwise and anticlockwise) genes on the - and + strands, respectively •!circles 3 and 4, pseudogenes •!5 and 6, M. leprae specific genes •!7, repeat sequences •!8, G+C content •!9, G/C bias (G+C)/(G-C)

Genes are color coded using the following functional categories: lipid metabolism (dark grey); intermediary metabolism and respiration (yellow); information pathways (red); regulatory proteins (light blue); conserved hypothetical proteins (orange); proteins of unknown function (light green); insertion sequences and phage related functions (pink); stable RNAs (dark blue); cell wall and cell processes (green); PE and PPE protein families (magenta); virulence, detoxification, adaptation (brown).

© Cole et al., 2001

Page 8: Gene Prediction Bacterial

Gene prediction flowchart!

Obtain new genomic

DNA sequence

1. Translate in all six reading frames and compare to proten sequence database

2. Perform database similarity search of expressed sequence tag (EST) database of same organism, or cDNA sequences if available

Use gene prediction program to

locate genes

Analyze regulatory

sequence in the gene

© D. Mount, 2001

Page 9: Gene Prediction Bacterial

In bacteria!

© Chris Burge

Page 10: Gene Prediction Bacterial

Gene Expression I: Transcription!

Page 11: Gene Prediction Bacterial

Ribosome!!! Translation is handled by a molecular complex,

ribosome, which consists of both proteins & ribosomal RNA (rRNA)!

!! Ribosome reads mRNA & the translation starts at a start codon (the translation start site)!

!! With help of tRNA, each codon is translated to an amino acid!

!! Translation stops once ribosome reads a stop codon (the translation stop site)!

Page 12: Gene Prediction Bacterial

The Genetic Code!

Page 13: Gene Prediction Bacterial

The structure of a bacterial gene. Single gene transcript.!

ATG TAA

transcript

Coding sequence

Transcription start site

Promoter Ribosome binding site/translational start (ATG)

hisG

Translational end

Transcriptional terminator

Page 14: Gene Prediction Bacterial

ATG TAA

transcript

Transcription start site

hisG

Translational end

hisH

Multigene bacterial operon

-One promoter, one transcriptional stop; multiple translational starts and stops

TAA

Page 15: Gene Prediction Bacterial

!! Promoters are DNA segments upstream of transcripts that initiate transcription

!! Promoter attracts RNA Polymerase to the transcription start site

5’ Promoter 3’

© www.bioalgorithms.info

Page 16: Gene Prediction Bacterial

Promoter Structure in Prokaryotes (E.coli)!

Transcription starts at offset 0.

•! Pribnow Box (-10)

•! Gilbert Box (-30)

•! Ribosomal Binding Site (+10)

© www.bioalgorithms.info

Page 17: Gene Prediction Bacterial

Figure 13.14 Genomes 3 (© Garland Science 2007)

Page 18: Gene Prediction Bacterial

The Shine-Dalgarno sequence!

!! The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit.!

!! The binding site is the Shine-Dalgarno sequence (SD).!

!! The SD is a purine-rich sequence (consensus sequence: AGGAG) at the 5' end of most prokaryotic mRNAs. !

!! The SD is found 5-10 basepairs upstream from the start codon.!

Page 19: Gene Prediction Bacterial

Ribosome binding sites!

A logo of the ribosome-binding sites and start codon in E.coli genes

Probability distribution of spacer length between RBS and the gene start

© Besemer et al., 2001

Page 20: Gene Prediction Bacterial
Page 21: Gene Prediction Bacterial

>AE006641 GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA CTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAA ACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATT AGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGA AGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAG GGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAAC TCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAAT GACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATC ATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCC AGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGAT ACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCC CAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTG CCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATT AGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTA GAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCA AAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAA ACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAA ATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGA CAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGT AGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTAC ATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGG GATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGA CTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGA AGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG

Can you spot the gene?!

/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

Page 22: Gene Prediction Bacterial

Bacterial gene finding: relatively easy!

!! Dense Genomes!!! Short intergenic regions!!! Uninterrupted ORFs!!! Conserved signals!!! Abundant comparative information!!! Complete genomes!

© Chuck Stuben, 2005

Page 23: Gene Prediction Bacterial

Indicators of protein coding regions in bacterial DNA!

!! Intrinsic evidence!"! Sufficient ORF length. Long ORFs rarely occur by chance.!

"! Specific patterns of codon usage that are different from triplet frequencies in non-coding regions ("coding potential").!

"! The presence of ribosome binding sites (RBS) in the (-20)...(-1) region upstream of the start codon that help to direct ribosomes to the correct translation start positions. A part of the RBS is formed by the purine-rich Shine-Dalgarno (SD) sequence which is complementary to the 3’ end of the 16S rRNA. !

!! Extrinsic evidence!"! Similarity to known, especially experimentally characterized, gene products!

Page 24: Gene Prediction Bacterial

Gene-Finding Strategies!

Page 25: Gene Prediction Bacterial

Intrinsic approaches!

!! GeneMark (Borodovsky & McIninch, 1993)!"! non-homogeneous Markov models for DNA regions that code for proteins or are complementary to

them!"! homogeneous Markov models for non-coding regions !"! coding capacity of sliding windows is deduced through A Bayesian decision rule!

!! GLIMMER (Salzberg et al., 1998)!"! interpolated Markov models!"! takes into account DNA oligomers of varying length dependent on the local composition of the

sequence!!! EcoParse (Krohg et al., 1994)!

"! hidden Markov models!"! finds the maximum likelihood parse of a DNA sequence into coding and non-coding regions !"! no sliding windows used!

!! Hatzigeorgiou and Fickett !"! A program for gene recognition relying solely on ORF length and RBS!

!! FgenesB (Softberry, Inc.)!

Page 26: Gene Prediction Bacterial

Figure 5.1 Genomes 3 (© Garland Science 2007)

Page 27: Gene Prediction Bacterial

!! stop codons !!! start codons!

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

© www.bioalgorithms.info

Page 28: Gene Prediction Bacterial

Figure 5.2 Genomes 3 (© Garland Science 2007)

Page 29: Gene Prediction Bacterial

http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c!

Page 30: Gene Prediction Bacterial
Page 31: Gene Prediction Bacterial
Page 32: Gene Prediction Bacterial
Page 33: Gene Prediction Bacterial

!! Detect potential coding regions by looking at ORFs!"! A genome of length n is comprised of (n/3) codons!"! Stop codons break genome into segments between consecutive Stop codons!"! The subsegments of these that start from the Start codon (ATG) are ORFs!

#!

3n3n3n

Genomic Sequence

Open reading frame

ATG TGA

© www.bioalgorithms.info

Page 34: Gene Prediction Bacterial

!! Long open reading frames may be a gene "! At random, we should expect one stop codon every

(64/3) ~= 21 codons "! However, genes are usually much longer than this

!! A basic approach is to scan for ORFs whose length exceeds certain threshold "! This is naïve because some genes (e.g. some neural

and immune system genes) are relatively short

© www.bioalgorithms.info

Page 35: Gene Prediction Bacterial

Open reading frames of "100 bp encoded on a 10-kb fragment of the Escherichia coli K12 genome from 3435250 to 3445250!

The six horizontal lines represent frames 1, 2, 3, !1, !2, and !3, respectively.

© Koonin & Galperin, 2003

Page 36: Gene Prediction Bacterial

© Russell F. Doolittle, 2002

Page 37: Gene Prediction Bacterial

Estimated over-annotation of genes in sequenced genomes!

© Skovgaard et al., 2001

Page 38: Gene Prediction Bacterial

Protein length distributions for Escherichia coli!

© Skovgaard et al., 2001

Page 39: Gene Prediction Bacterial

Protein length distributions for Aeropyrum pernix!

© Skovgaard et al., 2001

Page 40: Gene Prediction Bacterial

© Besemer et al., 2001

Page 41: Gene Prediction Bacterial

Intrinsic approach!

Page 42: Gene Prediction Bacterial

Statictical properties of protein coding regions!

!! Factors that contribute to the unequal usage of codons in a coding sequence:!"! Unequal use of amino acids!"! Unequal number of codons for different amino acids!"! Codon preference!

!! If reading frame 1 encodes a protein, it will influence the following factors:!"! The amino acid composition in both the coding frame and the other two

frames (frames 2 and 3)!"! The codon composition of all three frames!"! Positional base frequency: the frequency with which each of the four

bases occupies each of the three positions within codons!

© Staden, 1990

Page 43: Gene Prediction Bacterial

The Genetic Code!

Page 44: Gene Prediction Bacterial

SwissProt protein sequence database!

Page 45: Gene Prediction Bacterial

© Staden, 1990

Page 46: Gene Prediction Bacterial

© Staden, 1990

Page 47: Gene Prediction Bacterial

© Staden, 1990

Page 48: Gene Prediction Bacterial

Examining positional base frequency!

!! For each base measure the difference in its abundance in each position!"! E.g., relative abundance of base T in positions 1,2,3 of codons!"! The greater difference, the more likely that the sequence is coding!"! Uneven positional base frequencies method!

!! Study preferences for certain bases to occupy particular positions in codons!"! Compare the observed positional base frequencies with an expected

distribution!"! Positional base preference method !

© Staden, 1990

Page 49: Gene Prediction Bacterial

Uneven positional base frequencies method (Fickett, 1982)!

!! Nij: the number of times base i occurs in a position j of a codon!

!! Eij=(Ni1+Ni2+Ni3): expected value for each base in each position of a codon, assuming an even distribution!

!! D=!|Eij-Nij| : divergence from even

positional usage!!! Slide a window along the sequence,

one codon at a time!!! Calculate D for each position in which

a window is placed!

Coding Noncoding

EMBL database, 1984

© Staden, 1990

Page 50: Gene Prediction Bacterial

© Staden, 1990

Page 51: Gene Prediction Bacterial

Codon usage table!

Source: http://www.kazusa.or.jp/codon/

Page 52: Gene Prediction Bacterial

Source: http://www.kazusa.or.jp/codon/

Page 53: Gene Prediction Bacterial

TESTCODE (Fickett, 1982)!!! In protein coding regions every third base tends to be the same one much more often than by chance alone due

to non-random use of codons!!! Method:!

"! Count the number of each base at every third position starting at positions 1,2, and 3, and going to the end of the sequence window!

"! The assymetry statistic for each base is calculated as the ratio of the maximum count of the three possible reading frames devided by the minimum count for the same base plus 1!

"! The frequency of each base is the window is also calcutated!"! The resulting assymetry and frequency scores are converted to probabilities of being found in a coding region!

Plot of the nonrandomness of every third base in the sequence

Coding

Noncoding

© D.Mount, 2001

Page 54: Gene Prediction Bacterial

HMM for gene finding!

!! HMMs are able to model grammar!!! Many problems in biological sequence analysis have a

grammatical structure!!! Example: the grammar of eukaryotic gene structure (simplified)!

"! Exons and introns are „words“!"! Sentences: exon-intron-exon-intron..intron-exon!"! Sentences can never end with an intron!"! Exon can never follow an exon without an intron in between!"! Other constraints…!

© Krogh, 1998

Page 55: Gene Prediction Bacterial

HMM architecture for a parser for E.coli DNA with a simple intergenic model!

Central state •!Generates no nucleotides •!Connects all models

Main states

Insertions

Deletions Generate no nucleotides

The thickness of the arrows indicates the fraction of sequences making a given transition.

Insert state •!Produces random sequences from the base distribution of intergenic regions in the training set

© Krogh, 1994

Page 56: Gene Prediction Bacterial

How HMM generates a sequence of nucleotides!

!! Process: random walk starting in the middle of any of the HMMs. !"! begin at any state (e.g. central) and enter any of the rings!"! each such state transition has an associated probability!"! transitions out of the central state are chosen at random according to these probabilities (they sum to one)!

#! Example: a transition leading to the AAC codon model HMM generates the three nucleotides AAC with very high probability and then, with probability 1, makes the transition back to the central state.!

"! subsequently, a new transition out of the central state is selected randomly and independently of the previous transition!

!! Choosing one of the 61 codon models repeatedly results in a 'random gene‘!!! Gene termination!

"! entry into one of the rings below the central state!"! low probability (roughly: the number of intergenic regions divided by the number of codons in a typical

contig of E.coli DNA.)!"! one stop codon HMM generates both TAA and TGA, each according to its frequency of occurrence in E.

coli, and the other TAG!!! Intergenic region!

"! produced independently and at random by looping in the state labelled 'Intergene model‘!!! Start codon HMM!

"! generates either ATG, GTG or TTG, each with the appropriate probability (TTG is very rare in E. coli)!!! A transition is made back to the central state and the whole process repeated !!! Result: a sequence of nucleotides that is statistically similar to a contig of E. coli DNA consisting

of a collection of genes interspersed with intergenic regions.!

© Krogh, 1994

Page 57: Gene Prediction Bacterial

Finding the most likely random walk!

!! A dynamic programming method known as the Viterbi algorithm!!! Generates a parse of the contig: labels genes in the DNA by

identifying portions of the path that begin with the start codon at the end of the intergenic ring, pass through several amino acid codon HMMs, and return to one of the stop codons at the beginning of the intergenic ring!

!! The model parses a gene in one direction only and thus finds all genes on the direct strand!

!! To locate genes on the opposite strand, the reverse complement is parsed!

© Krogh, 1994

Page 58: Gene Prediction Bacterial

The gene model!!! The role of the codon HMMs is similar to codon usage statistics!!! Simplification: the codons in a gene are random and independent!!! The probability that a region is coding is simply the product of the probabilities of

the individual codons!!! The probability of an open reading frame (ORF) consisting of codons cl, c2,...ck and

excluding start and stop codons is!

Codon probability

ORF gene index

Contig length © Krogh, 1994

Page 59: Gene Prediction Bacterial

© Krogh, 1994

Page 60: Gene Prediction Bacterial

Gene index in E.coli!!! Typical genes have I<1!!! Average(I) = 0.935!!! 16% greater than 0.96!!! 7% greater than 0.98!!! 2% greater than 1.0!!! Used to rank predictions and

resolve ambiguities!

© Krogh, 1994

Page 61: Gene Prediction Bacterial

A parser with a complex intergenic model!

© Krogh, 1994

Page 62: Gene Prediction Bacterial

Traning data!

© Krogh, 1994

Page 63: Gene Prediction Bacterial

© Krogh, 1994

Page 64: Gene Prediction Bacterial

Modeling inter-codon dependencies: GeneMark!

!! Fifth-order Markov model (Lukashin and Borodovsky 1998)!"! uses sequence information from the previous five bases!"! the frequency of hexamers is used to differentiate between

coding and noncoding sequences!"! limitation: there must be many representatives of each

hexameric sequence in genes!

© D.Mount, 2001

Page 65: Gene Prediction Bacterial

GeneMark!!! Uses a Markov chain model to represent the statistics of coding and noncoding reading

frames!!! Uses dicodon statistics to identify coding regions!

•! Fifth-order Markov chains consist of terms P(a|x1x2x3x4x5) which represent the probability of the sixth base of sequence x being a given that the previous five bases were x1x2x3x4x5.

•! These terms must be defined for all possible pentamers with the general sequence b1b2b3b4b5(training data) •! When there are sufficient data for good statistics:

P(ab1b2b3b4b5) =nb1b2b3b4b5anb1b2b3b4b5a

a=A ,C ,G,T"

Where

nb1b2b3b4b5a is the number of times the sequence b1b2b3b4b5a occurs in the training data

•! GeneMark assumes each coding reading frame has unique dicodon statistics, labeled P1(a|b1b2b3b4b5), P2(a|b1b2b3b4b5), … P6(a|b1b2b3b4b5). Note: reading frames are identified by the codon position of the last base of the pentamer!

•! Non-coding regions are assumed to have the same statistics in all six reading frames Pnc(a|b1b2b3b4b5). •! These sets of parameters form seven distinct models, which can be used to estimate the likelihood that any given sequence is

coding or noncoding, and if coding, which reading frame is involved.

Page 66: Gene Prediction Bacterial

The probability of obtaining a sequence x=x1x2x3x4x5x6x7x8x9 in a coding region in the translated reading frame 3 (i.e. x1x2x3 is a translated codon):

P(x|3) = P2(x1x2x3x4x5)P2(x6|x1x2x3x4x5)P3(x7|x2x3x4x5x6)P1(x8|x3x4x5x6x7)P2(x9|x4x5x6x7x8)

where the first term is the probability of finding the pentamer x1x2x3x4x5 in the coding frame 2, which will result in x9 being in reading frame 3.

Note the cycling of successive terms through P1(a|…), P2(a|…), and P3(a|…). Such models are called periodic, phased, or inhomogeneous Markov models.

The probability of obtaining the sequence x on the complementary strand to the coding sequence can be obtained using cycling terms P4(a|…), P5(a|…), and P6(a|…).

Using the Bayes formula, the likelihood that the segment of sequence x is in coding frame 3:

P 3 x( ) =P(x 3)P 3( )

P(x nc)P nc( ) + P(x m)P m( )m=1

6

"

Where P(3) and P(nc) are the a priori probabilities of the coding frame 3 and noncoding models, respectively, and m is the index of the coding frame.

In GeneMark P(nc) – the a priori probability of the sequence being noncoding – was assumed to be !, and P(1)-P(6) assumed all to be 1/12. Sliding windows of 96 nucleotides were scored in steps of 12 nucleotides. If P(i|x), where i=1,6 exceeds a chosen threshold, the window is predicted to be in coding frame i.

Page 67: Gene Prediction Bacterial

Different types of markov models that output nucleotide sequence!

Homogeneous fifth-order Markov model, with the five states i-5 to i-1 generating state i. Each state corresponds to a nucleotide.

Three-periodic fifth-order Markov models, each modeling a different DNA reading frame. The probabilities are dependent on the position of the base within the codon. Each state is labeled with the codon position of the represented base

Page 68: Gene Prediction Bacterial

GeneMark text output!

Page 69: Gene Prediction Bacterial

GeneMark: graphical output!

Page 70: Gene Prediction Bacterial

Gene classes in Escherichia coli!

!! Three classes: class I, II, and III!"! differ not only in the statistical but in the biological sense!

!! Class I genes!"! intermediate codon usage bias!"! maintain a low or intermediate level of expression,!"! some genes may occasionally be expressed at a very high level in environmentally triggered

(rare) conditions!!! Class II genes!

"! high codon usage bias!"! highly expressed under exponential growth conditions!

!! Class III genes,!"! low codon usage bias!"! mainly belong to plasmids and insertion sequences!"! also includes genes coding for fimbriae, major pili, many membrane proteins, restriction

endonucleases and lambdoid phage lysogeny control proteins!"! can be expressed at a fairly high level!

© Médigue et al., 1991

Page 71: Gene Prediction Bacterial

© Philippe & Douady, 2003

Page 72: Gene Prediction Bacterial
Page 73: Gene Prediction Bacterial

Dealing with different gene pools!

!! Protein-coding sequences in a bacterial genome may not be homogeneous in their compositional features!

!! Program using models trained on the bulk set of protein-coding sequences will be insensitive in finding genes of minor inhomogeneity classes!

!! If preliminary information on gene classes is available, class-specific models of protein- coding regions improve the performance of the gene-finding method!

!! Possible solution: !"! use a set of ‘‘long’’ ORFs to obtain parameters of Markov models of protein-coding and

noncoding regions !"! initial models used to score the putative gene sequences and to form the cluster seeds for the

class-specific training sets.!"! run clusterization procedure until convergence!"! obtain several sets of presumably coding sequences with more homogeneous compositional

features!

© Hayes & Borodovsky, 1998

Page 74: Gene Prediction Bacterial

Hidden Markov model of a prokaryotic nucleotide sequence used in the GeneMark.hmm algorithm!

© Lukashin and Mark Borodovsky, 1998

Page 75: Gene Prediction Bacterial

Modeling inter-codon dependencies: Glimmer!

!! Interpolated Markov model (Salzberg et al. 1998)!"! finds a sufficient number of patterns by searching for the longest possible

patterns that are represented in the known gene sequences up to a length of eight bases. !

"! if there are not enough hexameric sequences, then pentamers or smaller may be more highly represented!

"! in other cases many representative patterns even longer than six bases may be found !

"! the longer the patterns, the more accurate the prediction!"! combines probability estimates from the different-sized patterns, giving

emphasis to longer patterns and weighting more heavily the patterns that are well represented in the training sequences!

Page 76: Gene Prediction Bacterial

A case for the extrinsic approach!

!! In real life putative coding regions predicted by intrinsic methods are verified by similarity searches!

!! 60-80% of genes in newly sequenced organisms have known counterparts in other species. In at least 30% of the cases reliable global alignments with well characterized proteins can be obtained!

Page 77: Gene Prediction Bacterial

ORPHEUS: Gene prediction in bacterial genomes!

!! Incorporation of extrinsic and intrinsic information!!! Consideration of coding potential and ribosome-

binding sites!!! High accuracy!!! Precise identification of gene starts!!! Completely automatic prediction!

© Frishman et al., 1998

Page 78: Gene Prediction Bacterial

Outline of the algorithm!Step I Step II Step III Step IV

High-stringency DNA-protein search (DPS, Huang, 1996)

DPS local alignments

Contig sequence

Extend local alignments Upstream: first start codon Downstream: stop codon

Codon usage

Similarity-based seed ORFs

Contig sequence DPS alignments

Extend local alignments Upstream: first start codon Downstream: stop codon

Similarity-based and ab initio ORFs with a single possible start

Find other ORFs with high coding potential

Alignment of -20-1 regions upstream from the start codon

RBS detection

RBS weight matrix

Contig sequence DPS alignments

Codon usage

Extend local alignments Upstream: first start codon Downstream: stop codon

Contig sequence DPS alignments

Codon usage RBS weight matrix

Find other ORFs with high coding potential

Full set of predicted genes

Extend all ORFs upstream till the best start is found

Page 79: Gene Prediction Bacterial

Example DPS output!

Page 80: Gene Prediction Bacterial

Outline of the algorithm!Step I Step II Step III Step IV

High-stringency DNA-protein search (DPS, Huang, 1996)

DPS local alignments

Contig sequence

Extend local alignments Upstream: first start codon Downstream: stop codon

Codon usage

Similarity-based seed ORFs

Contig sequence DPS alignments

Extend local alignments Upstream: first start codon Downstream: stop codon

Similarity-based and ab initio ORFs with a single possible start

Find other ORFs with high coding potential

Alignment of -20-1 regions upstream from the start codon

RBS detection

RBS weight matrix

Contig sequence DPS alignments

Codon usage

Extend local alignments Upstream: first start codon Downstream: stop codon

Contig sequence DPS alignments

Codon usage RBS weight matrix

Find other ORFs with high coding potential

Full set of predicted genes

Extend all ORFs upstream till the best start is found

Page 81: Gene Prediction Bacterial

Calculation of the coding potential!

Primary coding potential

W(abc) = log F(abc), where F(abc) is its genomic frequency Statistical weight of the codon abc

Normalized coding potential

Average codon weight

Standard deviation

Coding quality of a DNA fragment

G(b) – genomic frequency of the base b

To avoid the influence of the local base composition and gene shadows

Q(a1b1c1…anbncn ) = W (akbkck )k=1

n

"

R=Q"µn# n

µ = G(b1)G(b2)G(b3)W (b1b2b3)b1b2b3=AAA

TTT"

"2 = G(b1)G(b2)G(b3)[W (b1b2b3)#µ]2b1b2b3=AAA

TTT$

"(a1b1c1…anbncn)=R(a1b1c1…anbncn)#max{R(c0a1b1…cn#1anbn),R(b1c1a2…bncnan+1)}

Page 82: Gene Prediction Bacterial

Outline of the algorithm!Step I Step II Step III Step IV

High-stringency DNA-protein search (DPS, Huang, 1996)

DPS local alignments

Contig sequence

Extend local alignments Upstream: first start codon Downstream: stop codon

Codon usage

Similarity-based seed ORFs

Contig sequence DPS alignments

Extend local alignments Upstream: first start codon Downstream: stop codon

Similarity-based and ab initio ORFs with a single possible start

Find other ORFs with high coding potential

Alignment of -20-1 regions upstream from the start codon

RBS detection

RBS weight matrix

Contig sequence DPS alignments

Codon usage

Extend local alignments Upstream: first start codon Downstream: stop codon

Contig sequence DPS alignments

Codon usage RBS weight matrix

Find other ORFs with high coding potential

Full set of predicted genes

Extend all ORFs upstream till the best start is found

Page 83: Gene Prediction Bacterial

Alignment of the -20-1 gene regions!

1 AGAAAACGACAAGGAAAGGTATTTG 2 ATCTGAAGGGGGATTTTGGAGAATG 3 GTGAAAAATTGGAGGGAAACTCATG 4 CAATTAGAGAGGAGAATTCGATATG 5 AAAGCGAAGGACTCGGCGAGTAATG 6 ACATACCCTGCAAGGATGATTAATG 7 AAAAGGAAGGGAGGTCTATCTCATG 8 TTAAATATGGTGGTGGAAACAGATG 9 TTTCGAAAGGATTGTTTATAAAATG 10 GACTGAATATTTAAACAATTATGTG 11 TAAATAGAAGGAGGCGCACAAAATG ...

376 ACCGTTATGGAGGGATACATAAATG 377 TAACTGAATTTAAAGGAGGTTCATG 378 CGTTTAAAATGCATAATAAGGAGTG 379 TGGAAAACAGGGAGAGATCATAATG 380 TGCAAACTAACGGGGGGATAATATG 381 AAAAGAAAAAGGAGATGGGAGTATG 382 ATACAAGGTCTTTCGGGAGGCCTTG 383 ATCATGATCGGTCATATTTTAGATG 384 CGAATGTAAACATGTAGCAAGGGTG 385 AAATTCGGGAGAGTGAAGCGAGATG 386 ACAAACAGAATTCAGGTGAGACATG

L is the expected length of the Shine-Dalgarno box

Positional information content

F(b,j) - positional nucleotide frequencies in the initial alignment G(B) - genomic frequency of the base b

Initial position of the RBS signal

H( j)= F(b, j)log F(b, j) G(b)( )b=A

T"

H(k) j="20..."L# $ # # # # # # maxk= j

j+L"1%

Page 84: Gene Prediction Bacterial

The RBS weight matrix!

Positional nucleotide weights

Where maxN(B,k) is the frequency of the consensus nucleotide

W (b,k)= log N(b,k)+0.5maxbN(b,k)+0.5"

#

$ $ $

%

&

' ' '

Nucleotide Nucleotide position in the window

1 2 3 4 5 6

A 0.000 -0.909 -0.845 0.000 -0.909 -0.511

C -0.856 -1.000 -0.943 -0.923 -0.999 -0.748

G -0.804 0.000 0.000 -0.709 0.000 0.000

T -0.765 -0.897 -0.980 -0.868 -0.962 0.725

Consensus A G G A G G

Page 85: Gene Prediction Bacterial

RBS weight matrix for B.subtilis !

Nucleotide Nucleotide position in the window

1 2 3 4 5 6

A 0.000 -0.909 -0.845 0.000 -0.909 -0.511

C -0.856 -1.000 -0.943 -0.923 -0.999 -0.748

G -0.804 0.000 0.000 -0.709 0.000 0.000

T -0.765 -0.897 -0.980 -0.868 -0.962 0.725

Consensus A G G A G G

Page 86: Gene Prediction Bacterial

Finding the RBS positions in each sequence!

The Shine-Dalgarno signal score

1 AGAAAACGACAAGGAAAGGTATTTG -1.763 2 ATCTGAAGGGGGATTTTGGAGAATG -0.965 3 GTGAAAAATTGGAGGGAAACTCATG -0.765 4 CAATTAGAGAGGAGAATTCGATATG -0.511 5 AAAGCGAAGGACTCGGCGAGTAATG -1.884 6 ACATACCCTGCAAGGATGATTAATG -1.264 7 AAAAGGAAGGGAGGTCTATCTCATG -0.914 8 TTAAATATGGTGGTGGAAACAGATG -1.793 9 TTTCGAAAGGATTGTTTATAAAATG -1.847 10 GACTGAATATTTAAACAATTATGTG ------ 11 TAAATAGAAGGAGGCGCACAAAATG -0.110 ...

376 ACCGTTATGGAGGGATACATAAATG -0.925 377 TAACTGAATTTAAAGGAGGTTCATG -0.325 378 CGTTTAAAATGCATAATAAGGAGTG -2.061 379 TGGAAAACAGGGAGAGATCATAATG -1.315 380 TGCAAACTAACGGGGGGATAATATG -1.710 381 AAAAGAAAAAGGAGATGGGAGTATG -0.511 382 ATACAAGGTCTTTCGGGAGGCCTTG -1.159 383 ATCATGATCGGTCATATTTTAGATG ------ 384 CGAATGTAAACATGTAGCAAGGGTG -2.171 385 AAATTCGGGAGAGTGAAGCGAGATG -1.570 386 ACAAACAGAATTCAGGTGAGACATG -1.704

"(b1…bL)= W (bk ,k)k=1

L#

Page 87: Gene Prediction Bacterial

Automatically derived positional preference of the Shine-Dalgarno box in B.subtilis!

-0,30

-0,25

-0,20

-0,15

-0,10

-0,05

0,00

-22

-21

-20

-19

-18

-17

-16

-15

-14

-13

-12

-11

-10

-9 -8 -7

Window position relative to the start codon

Posi

tion

wei

ght

Page 88: Gene Prediction Bacterial

Incorporation of SD positional preference !

Positional weights

M - a possible position of the SD box within the RBS region N(M) the number of time this position occurs Nmax - the count of the most frequent position.

The strength of the RBS signal

V(M )= log N(M )+0.5Nmax+0.5"

#

$ $

%

&

' '

"(b1…bL ) = W (bk,k)k=1

L

# +V (M)

Page 89: Gene Prediction Bacterial

Outline of the algorithm!Step I Step II Step III Step IV

High-stringency DNA-protein search (DPS, Huang, 1996)

DPS local alignments

Contig sequence

Extend local alignments Upstream: first start codon Downstream: stop codon

Codon usage

Similarity-based seed ORFs

Contig sequence DPS alignments

Extend local alignments Upstream: first start codon Downstream: stop codon

Similarity-based and ab initio ORFs with a single possible start

Find other ORFs with high coding potential

Alignment of -20-1 regions upstream from the start codon

RBS detection

RBS weight matrix

Contig sequence DPS alignments

Codon usage

Extend local alignments Upstream: first start codon Downstream: stop codon

Contig sequence DPS alignments

Codon usage RBS weight matrix

Find other ORFs with high coding potential

Full set of predicted genes

Extend all ORFs upstream till the best start is found

Page 90: Gene Prediction Bacterial

Start codon selection procedure!

B.subtilis competence gene comA (Weinrauch et al., 1989)

Start codon position in the genome

RBS signal strength ( )

Coding potential quality ( )

Putative translation initiation region

3252280 -4.171 2.090 gaatcctcattgtaaaattatcGTG 3252331 -3.423 1.031 tctaggcggcgaggtcaatgggATG 3252364 -3.684 0.927 ctcgtcatatgatctcattttaATG 3252445 -2.388 0.484 aattttggaaacggattcgaatTTG 3252463 -3.738 0.312 catggaaggcaccaagacaattTTG 3252484 -3.950 0.777 gattgatgaccatccggctgtcATG 3252508 -2.706 0.780 ggaaaacatgaaaaagatactaGTG 3252523 -0.718 0.844 agtgagtaaaagggaggaaaacATG 3252544 -2.502 0.698 cttttttataaaatggaaaagaGTG

Page 91: Gene Prediction Bacterial

Fragment of the C.trachomatis genome with genes predicted by similarity (red)

and ab initio (blue) !

Page 92: Gene Prediction Bacterial

Comparison of the gene prediction results with the sets of sequences from the PIR-International and the genome sequencing projects!

Percentage of correctly identified

genes

(true positives)

Percentage of correct starts for

correctly identified genes

Percentage of correctly predicted genes

with correct starts using "leftmost

ATG" procedure

Dataset

Length>100 Length > 35 Length>100 Length > 35 Length>100 Length > 35

SUBPIR 93.3% - 80.2% - 61% -

ECOPIR 96.3% - 65.8% - 68.5% -

SUBGEN 98.9% 88.9% 79.1% 69.9% 64.4% 58.9%

ECOGEN 99.1% 87.1% 70.0% 61.3% 78.2% 71.7%

Page 93: Gene Prediction Bacterial

Organism GCcontent,%

Status Gene prediction method Information content ofthe RBS alignment

Helicobacter pylori 39 C GeneMark/GeneSmith 4.35Campylobacter jejuni 31 P - 3.67Enterococcus faecalis 36 P - 3.6Bacillus subtilis 43.5 C GeneMark/additional

searches for RBS signals3.47

Streptococcus pyogenes 36 P - 3.33Streptococcus pneumoniae 40 P - 3.24Aquifex aeolicus 43.4 C Similarity/Critica 3.13M. thermoautotrophicum 49.5 C GenomeBrowser 3.11Haemophilus influenzae 38 C GeneMark 3.06Methanococcus jannaschii 31.4 C GeneMark 3.00Pyrococcus horikoshii OT3 42 C Simple ORF extraction

and similarity analysis2.93

Chlamydia trachomatis 42 P - 2.84Neisseria gonorrhoeae 52 P - 2.73Pyrococcus furiosus 41 P - 2.73Archaeoglobus fulgidus 48.5 C GeneSmith/Critica 2.72Escherichia coli 50.8 C Manual analysis 2.55Neisseria meningitidis 52 P - 2.62A. actinomycetemcomitans 44 P - 2.50Treponema pallidum 52.8 C Glimmer 2.32Synechocystis sp. 48 C ??? 2.26Deinococcus radiodurans 66 P - 2.2Yersinia pestis 48 P - 2.18Borrelia burgdorferi 28.6 C Glimmer 2.08Pseudomonas aeruginosa 67 P - 2.08Mycobacterium leprae 58 P - 2.03Streptomyces coelicolor 72 P - 1.98Mycobacterium tuberculosis 65.6 C TbParse 1.93

Page 94: Gene Prediction Bacterial

Dependence of the gene start prediction accuracy (selected genes)!

40,0

50,0

60,0

70,0

80,0

90,0

100,0

1,5 2,0 2,5 3,0 3,5 4,0

Information content of the RBS alignment

Cor

rect

ly p

redi

cted

gen

e st

arts

, %

M.tuberculosis P.aeruginosa

Synechosystis sp.

N.gonorrhae

E.coli

H.influenzaeB.subtilis

S.pneumoniae

M.thermoautotrophicum

Page 95: Gene Prediction Bacterial

Comparison of the experimentally

determined and predicted RBS in C.jejuni !

RBS Spacer Start codonttAGGAt 5 TTGaAAGGct 3 ATGtAAGGAt 5 ATGcAAGGAt 6 ATGtAAGGAg 7 ATGaAAGGAc 5 ATGaAAGGAa 4 ATGcgAGGAc 2 ATGAAGGca 3 ATGagAGGAg 6 GTGaAAGGAa 6 ATGaAAGGAg 5 ATGaAAGGAg 5 ATGttAGGtg 6 ATGtcAGGAg 7 ATGtAAGGAa 6 ATGcAAGGAg 5 GTGaAAGGAt 5 ATGctAGcAc 5 ATGaAAGGAg 5 ATGaAAGGta 9 ATG

RBS Spacer Start codonAAAGGA 6 ATGACATGA 13 TTGAAAGAA 6 ATGAAAGGA 7 ATGAAAGGA 6 ATGAAAGGT 5 ATGAAAGGA 8 ATGAAAGGA 6 ATGAAAGGA 6 ATGTTAGGA 7 ATGAAAGGT 9 ATGAAAGGA 5 ATGAAAGGA 6 TTGAAAGGA 6 ATGAAAGAA 16 ATGAAAAGA 10 GTGAAAGAA 6 TTGAAAGGA 7 ATGAGAGGA 3 ATGAAAGGA 6 ATGCAAAGA 1 ATGAAAGGA 6 ATGCAAGGG 6 ATGATAAGA 6 TTGAGAAGA 13 TTGAAAGGA 6 ATGTTAGGA 6 ATGCAAGGA 7 ATGGAAGGA 5 ATGAAAGGA 2 ATGAAGAGA 9 ATGAAACAA 3 GTGAAAAGA 13 ATGGAAGGT 16 GTG

Orpheus

Wösten et al. (1998)

Page 96: Gene Prediction Bacterial

The C.jejuni RBS weight matrix!

Position in thewindow

1 2 3 4 5 6

A 0 0 0 -0.626 -0-806 0C -0.788 -0.916 -1.079 -0.871 -1.017 -1.017G -0.840 -0.862 -1.022 0 0 -0.960T -0.840 -0.811 -1.079 -0.871 -1.017 -0.854Consensus A A A G G A

Page 97: Gene Prediction Bacterial

Dependence of the gene start prediction accuracy on the RBS quality (complete genomes)!

40,0

50,0

60,0

70,0

80,0

90,0

100,0

1,5 2 2,5 3 3,5 4

Information content of the RBS alignment

Cor

rect

ly p

redi

cted

gen

e st

arts

, %

B.subtilis

A.aeolicus

M.thermoautotropicum

H.influenzae

M.jannaschii

P. horikishii

C.trachomatis

A.fulgidis

E.coli

T.pallidumSynechosystis sp.

B.burgdorferi

M.tuberculosis