View
216
Download
1
Tags:
Embed Size (px)
Citation preview
A markovian approach for the analysis of the gene structure
C. MelodeLima1, L. Guéguen1, C. Gautier1 and D. Piau2
1Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France2Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France
PRABI
• Introduction
• HMM for the genomic structure of DNA sequences
• Discrimination method based on HMM
Contents
• Conclusion
• Direction of research
Introduction
• Intensive sequencing
• Genes represent only 3% of the human genome
Markovian models are widely used for the
identification of genes
We propose an analysis of the structural properties of genes, using a discrimination method based on HMMs
Advantages:
Each state represents a different type of region in the sequence
The complexity of the algorithm is linear with respect to the length of the sequence
Hidden Markov model
Drawback: The distribution of the sojourn time in a given state is geometric
The empirical distribution of the length of the exons is not geometric!
Introduction
HMM for the genomic structure of DNA sequences
CDS No CDS
Structure of the HMM model
1-t1 1-t2
t1
t2
Bases probabilitiesA pAC pCG pGT pT
Bases probabilitiesA qAC qCG qGT qT
CDS: coding sequence
• Model of order 5
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Model of order 5
StSt-1St-2St-3St-4St-5St-6
XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Model of order 5
StSt-1St-2St-3St-4St-5St-6
XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Model of order 5
StSt-1St-2St-3St-4St-5St-6
XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Model of order 5
StSt-1St-2St-3St-4St-5St-6
XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Model of order 5
StSt-1St-2St-3St-4St-5St-6
XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
Intergenic region
Single exon
Initial exon Initial intron
Internal intron
Internal exon
Terminal intron
Terminal exon
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Length distributions of exons and introns according to their position in genes:
Intergenic region
Single exon
Initial exon Initial intron
Internal intron
Internal exon
Terminal intron
Terminal exon
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Length distributions of exons and introns according to their position in genes:
Intergenic region
Single exon
Initial exon Initial intron
Internal intron
Internal exon
Terminal intron
Terminal exon
Several biological properties of DNA sequences were taken into account
HMM for the genomic structure of DNA sequences
• Length distributions of exons and introns according to their position in genes:
Intergenic region
Single exon
Initial exon Initial intron
Internal intron
Internal exon
Terminal intron
Terminal exon
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Length distributions of exons and introns according to their position in genes:
• Direct and reverse strands
Intergenic region
Single exon
Initial exon Initial intron
Internal intron
Internal exon
Terminal intron
Terminal exon
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
• Length distributions of exons and introns according to their position in genes:
• Codons:
1-p
Exon p
frame 0 frame 1 frame 2
p p p
1-p 1-p
1-p
HMM for the genomic structure of DNA sequences
Several biological properties of DNA sequences were taken into account
Sojourn time in a HMM state must follows a geometric law
Length of a hidden state
CDS
p
T: sojourn time in a given stateT follows a geometric law
Geometric law
1-p
HMM for the genomic structure of DNA sequences
Times of stay in state CDS Probability
1 1-p2 p (1-p)3 p2 (1-p)…n pn-1 (1-p)
Pro
babi
lity
Length of the internal exons
Méthode
HMM for the genomic structure of DNA sequences
Method: estimation of the length of a region
• Geometric laws does not fit the empirical distribution of the length of exons
Pro
babi
lity
Length of the internal exons
Méthode
HMM for the genomic structure of DNA sequences
Method: estimation of the length of a region
• We suggest to:
State 1 State 2State
• Geometric laws does not fit the empirical distribution of the length of exons
• Geometric laws does not fit the empirical distribution of the length of exons
Pro
babi
lity
Length of the internal exons
Méthode
HMM for the genomic structure of DNA sequences
Method: estimation of the length of a region
• We suggest to:
State 1 State 2State
• Good fit with sums of 5 geometric random variables
Length of the internal exons
Pro
babi
lityt
Method: estimation of the length of a region
• Data: Human genome
* extracted from HOVERGEN
• Different length distributions:
* Sum of geometric laws of equal parameter with =1..7
* Sum of 2 or 3 geometric laws of different parameters
•
For each region:
* We choose parameters that minimize the Kolmogorov-Smirnov distance
* We do not use the maximum likelihood method
HMM for the genomic structure of DNA sequences
Results: Estimation of the length of a region
HMM for the genomic structure of DNA sequences
Pro
babi
l ity
Length of the initial exon
Maximum likelihood estimation
Kolmogorov-Smirnov estimation
The model fits very well the empirical distribution
HMM for the genomic structure of DNA sequences
Results: Estimation of the length distribution of internal exons
Length of the internal exons
Pro
babi
lityt
Sum of 5 geometric laws
p=1/26
HMM for the genomic structure of DNA sequences
Results: Estimation of the length distribution of intronless genes
Many small genes with single exons are
pseudogenes
Sum of 2 geometric laws p=1/440
• Introduction
• HMM for the genomic structure of DNA sequences
• Discrimination method based on HMM
• Conclusion
Contents
• Direction of research
• Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)
Method: A model for initial, internal, terminal exons
Discrimination method based on HMM
• Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)
Method: A model for initial, internal, terminal exons
Discrimination method based on HMM
D = { log P(S/ HMM1) - log P(S/ HMM2) } / |S| (Eq. 1)
S is the test sequence of length |S|
• Discrimination method to test the homogeneity between regions:
HMM1: Initial Exon HMM2: Internal exon
Sequence
likelihood Sequence is characterized by the HMM with the best
likelihood
Quality of the decision: We want to know if models are well adapted
to their regions (HMMs are compared pair wise)
{Initial exon sequences} N
Decision
N1 initial exons N-N1 internal exons
N1
N-N1
Discrimination method based on HMM
Each model is characterized by the frequency of sequence recognition
Results: Comparison of different HMMs on different test sequences
Internal exon ≈ Terminal exon Initial exon ≠ Internal exon
Initial exon ≠ Terminal exon
Discrimination method based on HMM
Results: Comparison of different HMMs on different test sequences
Internal exon ≈ Terminal exon Initial exon ≠ Internal exon
Initial exon ≠ Terminal exon
Discrimination method based on HMM
Results: Comparison of different HMMs on different test sequences
Internal exon ≈ Terminal exon Initial exon ≠ Internal exon
Initial exon ≠ Terminal exon
Discrimination method based on HMM
Results: Comparison of different HMMs on different test sequences
Internal exon ≈ Terminal exon Initial exon ≠ Internal exon
Initial exon ≠ Terminal exon
Discrimination method based on HMM
To determine the break point in first exon sequences, we consider different HMMs:
HMM Start HMM End
Initial exon HMM
Initial exon HMM
k
The HMM representing the initial exon was split into 2 HMMs around the kth base
• A “Start” HMM is trained on the first k bases
• An “End” HMM is trained on the remaining bases
Discrimination method based on HMM
Results: Break in the homogeneity of the first coding exon
Results: Break in the homogeneity of the first coding exon
M_EI80
Other
models
Discrimination method based on HMM
Results: Break in the homogeneity of the first coding exon
M_EI80
Other
models
Discrimination method based on HMM
Results: Break in the homogeneity of the first coding exon
M_EI80
Other
models
Discrimination method based on HMM
Results: Break in the homogeneity of the first coding exon
M_EI80
Other
models
Discrimination method based on HMM
Results: Initial exons
HMM Start
HMM End
25%
75%
with peptide signal (SignalP)
Discrimination method based on HMM
Result: Initial exons
HMM Start
HMM End
25%
75%
with peptide signal (SignalP)
HMM Start characterizes well the peptide signal
90%
10%
without peptide signal
Discrimination method based on HMM
Modelling of the exons length distribution:
• The model has relatively few parameters
Sum of 5 geometric laws of the same parameter (internal exons)
Sum of 3 geometric laws of different parameters (terminal exons)
• Sums of geometric laws fit well the distribution of exons lengths
Conclusion
Modelling of the exons length distribution:
• The model has relatively few parameters
Sum of 5 geometric laws of the same parameter (internal exons)
Sum of 3 geometric laws of different parameters (terminal exons)
• Sums of geometric laws fit well the distribution of exons lengths
Conclusion
Discrimination method based on HMM:
• Bad annotation in database of the intronless genes
• Homogeneity between internal and terminal exons
• Break of homogeneity of initial exon around 80th base
Peptide signal
• Introduction
• HMM for the genomic structure of DNA sequences
• Discrimination method based on HMM
• Conclusion
Contents
• Direction of research
Versteeg 2003
Chromosome 9
Content of GC
Markovian models for the analysis of the organization of genomes
Direction of research
Versteeg 2003
Chromosome 9
Content of GC
Genes density
Markovian models for the analysis of the organization of genomes
Direction of research
Versteeg 2003
Chromosome 9
Genes density
Content of GC
Size of introns
Markovian models for the analysis of the organization of genomes
Direction of research
Versteeg 2003
Chromosome 9
Genes density
Content of GC
Size of introns
Repeated elements
Markovian models for the analysis of the organization of genomes
Direction of research
Versteeg 2003
Chromosome 9
Genes density
Content of GC
Size of introns
Repeated elements
Genes expression
Markovian models for the analysis of the organization of genomes
Direction of research
Structure superposition in genomes
A chromosome
Isochore level
Gene level
Exon-intron level
Codon level
intronexon
acc gcc agt tac ccc aga
Direction of research
– Build 3 HMMs adapted to the organization structure of each of the 3 isochores classes H, L, M
H = [72%, 100%]M = ]56%, 72%[
L = [0%, 56%]
– Human chromosomes are divided into overlapping 100 kb segments.
Two successive segments overlap by half of their length. – Bayesian approach: for each segment and for each model (H, L and
M), we compute the probability P[Model | Segment]
Segment is characterized by the model with the best probability
Scan the genome
Direction of research
Results: Human chromosome 1
Model H
Model M
Model L
Genes density
Repartition of isochores
G+C content
Direction of research
Direction of research
Comparing the human genome with genomes of different organisms
can be useful to:
• better understand the structure and function of human genes
• study evolutionary changes among organisms
• help to identify the genes that are conserved among species
Comparative Genomic Analysis
Human Chimpanzee Mouse
Chicken Tetraodon
Direction of research