A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C....

Preview:

Citation preview

A markovian approach for the analysis of the gene structure

C. MelodeLima1, L. Guéguen1, C. Gautier1 and D. Piau2

1Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France2Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France

PRABI

• Introduction

• HMM for the genomic structure of DNA sequences

• Discrimination method based on HMM

Contents

• Conclusion

• Direction of research

Introduction

• Intensive sequencing

• Genes represent only 3% of the human genome

Markovian models are widely used for the

identification of genes

We propose an analysis of the structural properties of genes, using a discrimination method based on HMMs

Advantages:

Each state represents a different type of region in the sequence

The complexity of the algorithm is linear with respect to the length of the sequence

Hidden Markov model

Drawback: The distribution of the sojourn time in a given state is geometric

The empirical distribution of the length of the exons is not geometric!

Introduction

HMM for the genomic structure of DNA sequences

CDS No CDS

Structure of the HMM model

1-t1 1-t2

t1

t2

Bases probabilitiesA pAC pCG pGT pT

Bases probabilitiesA qAC qCG qGT qT

CDS: coding sequence

• Model of order 5

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

Several biological properties of DNA sequences were taken into account

HMM for the genomic structure of DNA sequences

• Length distributions of exons and introns according to their position in genes:

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

• Direct and reverse strands

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

• Codons:

1-p

Exon p

frame 0 frame 1 frame 2

p p p

1-p 1-p

1-p

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Sojourn time in a HMM state must follows a geometric law

Length of a hidden state

CDS

p

T: sojourn time in a given stateT follows a geometric law

Geometric law

1-p

HMM for the genomic structure of DNA sequences

Times of stay in state CDS Probability

1 1-p2 p (1-p)3 p2 (1-p)…n pn-1 (1-p)

Pro

babi

lity

Length of the internal exons

Méthode

HMM for the genomic structure of DNA sequences

Method: estimation of the length of a region

• Geometric laws does not fit the empirical distribution of the length of exons

Pro

babi

lity

Length of the internal exons

Méthode

HMM for the genomic structure of DNA sequences

Method: estimation of the length of a region

• We suggest to:

State 1 State 2State

• Geometric laws does not fit the empirical distribution of the length of exons

• Geometric laws does not fit the empirical distribution of the length of exons

Pro

babi

lity

Length of the internal exons

Méthode

HMM for the genomic structure of DNA sequences

Method: estimation of the length of a region

• We suggest to:

State 1 State 2State

• Good fit with sums of 5 geometric random variables

Length of the internal exons

Pro

babi

lityt

Method: estimation of the length of a region

• Data: Human genome

* extracted from HOVERGEN

• Different length distributions:

* Sum of geometric laws of equal parameter with =1..7

* Sum of 2 or 3 geometric laws of different parameters

For each region:

* We choose parameters that minimize the Kolmogorov-Smirnov distance

* We do not use the maximum likelihood method

HMM for the genomic structure of DNA sequences

Results: Estimation of the length of a region

HMM for the genomic structure of DNA sequences

Pro

babi

l ity

Length of the initial exon

Maximum likelihood estimation

Kolmogorov-Smirnov estimation

The model fits very well the empirical distribution

HMM for the genomic structure of DNA sequences

Results: Estimation of the length distribution of internal exons

Length of the internal exons

Pro

babi

lityt

Sum of 5 geometric laws

p=1/26

HMM for the genomic structure of DNA sequences

Results: Estimation of the length distribution of intronless genes

Many small genes with single exons are

pseudogenes

Sum of 2 geometric laws p=1/440

• Introduction

• HMM for the genomic structure of DNA sequences

• Discrimination method based on HMM

• Conclusion

Contents

• Direction of research

• Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)

Method: A model for initial, internal, terminal exons

Discrimination method based on HMM

• Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)

Method: A model for initial, internal, terminal exons

Discrimination method based on HMM

D = { log P(S/ HMM1) - log P(S/ HMM2) } / |S| (Eq. 1)

S is the test sequence of length |S|

• Discrimination method to test the homogeneity between regions:

HMM1: Initial Exon HMM2: Internal exon

Sequence

likelihood Sequence is characterized by the HMM with the best

likelihood

Quality of the decision: We want to know if models are well adapted

to their regions (HMMs are compared pair wise)

{Initial exon sequences} N

Decision

N1 initial exons N-N1 internal exons

N1

N-N1

Discrimination method based on HMM

Each model is characterized by the frequency of sequence recognition

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

To determine the break point in first exon sequences, we consider different HMMs:

HMM Start HMM End

Initial exon HMM

Initial exon HMM

k

The HMM representing the initial exon was split into 2 HMMs around the kth base

• A “Start” HMM is trained on the first k bases

• An “End” HMM is trained on the remaining bases

Discrimination method based on HMM

Results: Break in the homogeneity of the first coding exon

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Results: Initial exons

HMM Start

HMM End

25%

75%

with peptide signal (SignalP)

Discrimination method based on HMM

Result: Initial exons

HMM Start

HMM End

25%

75%

with peptide signal (SignalP)

HMM Start characterizes well the peptide signal

90%

10%

without peptide signal

Discrimination method based on HMM

Modelling of the exons length distribution:

• The model has relatively few parameters

Sum of 5 geometric laws of the same parameter (internal exons)

Sum of 3 geometric laws of different parameters (terminal exons)

• Sums of geometric laws fit well the distribution of exons lengths

Conclusion

Modelling of the exons length distribution:

• The model has relatively few parameters

Sum of 5 geometric laws of the same parameter (internal exons)

Sum of 3 geometric laws of different parameters (terminal exons)

• Sums of geometric laws fit well the distribution of exons lengths

Conclusion

Discrimination method based on HMM:

• Bad annotation in database of the intronless genes

• Homogeneity between internal and terminal exons

• Break of homogeneity of initial exon around 80th base

Peptide signal

• Introduction

• HMM for the genomic structure of DNA sequences

• Discrimination method based on HMM

• Conclusion

Contents

• Direction of research

Versteeg 2003

Chromosome 9

Content of GC

Markovian models for the analysis of the organization of genomes

Direction of research

Versteeg 2003

Chromosome 9

Content of GC

Genes density

Markovian models for the analysis of the organization of genomes

Direction of research

Versteeg 2003

Chromosome 9

Genes density

Content of GC

Size of introns

Markovian models for the analysis of the organization of genomes

Direction of research

Versteeg 2003

Chromosome 9

Genes density

Content of GC

Size of introns

Repeated elements

Markovian models for the analysis of the organization of genomes

Direction of research

Versteeg 2003

Chromosome 9

Genes density

Content of GC

Size of introns

Repeated elements

Genes expression

Markovian models for the analysis of the organization of genomes

Direction of research

Structure superposition in genomes

A chromosome

Isochore level

Gene level

Exon-intron level

Codon level

intronexon

acc gcc agt tac ccc aga

Direction of research

– Build 3 HMMs adapted to the organization structure of each of the 3 isochores classes H, L, M

H = [72%, 100%]M = ]56%, 72%[

L = [0%, 56%]

– Human chromosomes are divided into overlapping 100 kb segments.

Two successive segments overlap by half of their length. – Bayesian approach: for each segment and for each model (H, L and

M), we compute the probability P[Model | Segment]

Segment is characterized by the model with the best probability

Scan the genome

Direction of research

Results: Human chromosome 1

Model H

Model M

Model L

Genes density

Repartition of isochores

G+C content

Direction of research

Direction of research

Comparing the human genome with genomes of different organisms

can be useful to:

• better understand the structure and function of human genes

• study evolutionary changes among organisms

• help to identify the genes that are conserved among species

Comparative Genomic Analysis

Human Chimpanzee Mouse

Chicken Tetraodon

Direction of research

Recommended