53
A markovian approach for the analysis of the gene structure C. MelodeLima 1 , L. Guéguen 1 , C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France 2 Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France PRABI

A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Embed Size (px)

Citation preview

Page 1: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

A markovian approach for the analysis of the gene structure

C. MelodeLima1, L. Guéguen1, C. Gautier1 and D. Piau2

1Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France2Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France

PRABI

Page 2: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Introduction

• HMM for the genomic structure of DNA sequences

• Discrimination method based on HMM

Contents

• Conclusion

• Direction of research

Page 3: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Introduction

• Intensive sequencing

• Genes represent only 3% of the human genome

Markovian models are widely used for the

identification of genes

We propose an analysis of the structural properties of genes, using a discrimination method based on HMMs

Page 4: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Advantages:

Each state represents a different type of region in the sequence

The complexity of the algorithm is linear with respect to the length of the sequence

Hidden Markov model

Drawback: The distribution of the sojourn time in a given state is geometric

The empirical distribution of the length of the exons is not geometric!

Introduction

Page 5: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

HMM for the genomic structure of DNA sequences

CDS No CDS

Structure of the HMM model

1-t1 1-t2

t1

t2

Bases probabilitiesA pAC pCG pGT pT

Bases probabilitiesA qAC qCG qGT qT

CDS: coding sequence

Page 6: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Model of order 5

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Page 7: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Page 8: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Page 9: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Page 10: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Page 11: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Model of order 5

StSt-1St-2St-3St-4St-5St-6

XtXt-1Xt-2Xt-3Xt-4Xt-5Xt-6

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Page 12: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

Page 13: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

Page 14: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

Several biological properties of DNA sequences were taken into account

HMM for the genomic structure of DNA sequences

• Length distributions of exons and introns according to their position in genes:

Page 15: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

Page 16: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Direct and reverse strands

Intergenic region

Single exon

Initial exon Initial intron

Internal intron

Internal exon

Terminal intron

Terminal exon

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

• Length distributions of exons and introns according to their position in genes:

Page 17: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Codons:

1-p

Exon p

frame 0 frame 1 frame 2

p p p

1-p 1-p

1-p

HMM for the genomic structure of DNA sequences

Several biological properties of DNA sequences were taken into account

Page 18: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Sojourn time in a HMM state must follows a geometric law

Length of a hidden state

CDS

p

T: sojourn time in a given stateT follows a geometric law

Geometric law

1-p

HMM for the genomic structure of DNA sequences

Times of stay in state CDS Probability

1 1-p2 p (1-p)3 p2 (1-p)…n pn-1 (1-p)

Page 19: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Pro

babi

lity

Length of the internal exons

Méthode

HMM for the genomic structure of DNA sequences

Method: estimation of the length of a region

• Geometric laws does not fit the empirical distribution of the length of exons

Page 20: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Pro

babi

lity

Length of the internal exons

Méthode

HMM for the genomic structure of DNA sequences

Method: estimation of the length of a region

• We suggest to:

State 1 State 2State

• Geometric laws does not fit the empirical distribution of the length of exons

Page 21: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Geometric laws does not fit the empirical distribution of the length of exons

Pro

babi

lity

Length of the internal exons

Méthode

HMM for the genomic structure of DNA sequences

Method: estimation of the length of a region

• We suggest to:

State 1 State 2State

• Good fit with sums of 5 geometric random variables

Length of the internal exons

Pro

babi

lityt

Page 22: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Method: estimation of the length of a region

• Data: Human genome

* extracted from HOVERGEN

• Different length distributions:

* Sum of geometric laws of equal parameter with =1..7

* Sum of 2 or 3 geometric laws of different parameters

For each region:

* We choose parameters that minimize the Kolmogorov-Smirnov distance

* We do not use the maximum likelihood method

HMM for the genomic structure of DNA sequences

Page 23: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Estimation of the length of a region

HMM for the genomic structure of DNA sequences

Pro

babi

l ity

Length of the initial exon

Maximum likelihood estimation

Kolmogorov-Smirnov estimation

Page 24: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

The model fits very well the empirical distribution

HMM for the genomic structure of DNA sequences

Results: Estimation of the length distribution of internal exons

Length of the internal exons

Pro

babi

lityt

Sum of 5 geometric laws

p=1/26

Page 25: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

HMM for the genomic structure of DNA sequences

Results: Estimation of the length distribution of intronless genes

Many small genes with single exons are

pseudogenes

Sum of 2 geometric laws p=1/440

Page 26: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Introduction

• HMM for the genomic structure of DNA sequences

• Discrimination method based on HMM

• Conclusion

Contents

• Direction of research

Page 27: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)

Method: A model for initial, internal, terminal exons

Discrimination method based on HMM

Page 28: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5)

Method: A model for initial, internal, terminal exons

Discrimination method based on HMM

D = { log P(S/ HMM1) - log P(S/ HMM2) } / |S| (Eq. 1)

S is the test sequence of length |S|

• Discrimination method to test the homogeneity between regions:

HMM1: Initial Exon HMM2: Internal exon

Sequence

likelihood Sequence is characterized by the HMM with the best

likelihood

Page 29: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Quality of the decision: We want to know if models are well adapted

to their regions (HMMs are compared pair wise)

{Initial exon sequences} N

Decision

N1 initial exons N-N1 internal exons

N1

N-N1

Discrimination method based on HMM

Each model is characterized by the frequency of sequence recognition

Page 30: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

Page 31: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

Page 32: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

Page 33: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Comparison of different HMMs on different test sequences

Internal exon ≈ Terminal exon Initial exon ≠ Internal exon

Initial exon ≠ Terminal exon

Discrimination method based on HMM

Page 34: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

To determine the break point in first exon sequences, we consider different HMMs:

HMM Start HMM End

Initial exon HMM

Initial exon HMM

k

The HMM representing the initial exon was split into 2 HMMs around the kth base

• A “Start” HMM is trained on the first k bases

• An “End” HMM is trained on the remaining bases

Discrimination method based on HMM

Results: Break in the homogeneity of the first coding exon

Page 35: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Page 36: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Page 37: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Page 38: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Break in the homogeneity of the first coding exon

M_EI80

Other

models

Discrimination method based on HMM

Page 39: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Initial exons

HMM Start

HMM End

25%

75%

with peptide signal (SignalP)

Discrimination method based on HMM

Page 40: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Result: Initial exons

HMM Start

HMM End

25%

75%

with peptide signal (SignalP)

HMM Start characterizes well the peptide signal

90%

10%

without peptide signal

Discrimination method based on HMM

Page 41: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Modelling of the exons length distribution:

• The model has relatively few parameters

Sum of 5 geometric laws of the same parameter (internal exons)

Sum of 3 geometric laws of different parameters (terminal exons)

• Sums of geometric laws fit well the distribution of exons lengths

Conclusion

Page 42: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Modelling of the exons length distribution:

• The model has relatively few parameters

Sum of 5 geometric laws of the same parameter (internal exons)

Sum of 3 geometric laws of different parameters (terminal exons)

• Sums of geometric laws fit well the distribution of exons lengths

Conclusion

Discrimination method based on HMM:

• Bad annotation in database of the intronless genes

• Homogeneity between internal and terminal exons

• Break of homogeneity of initial exon around 80th base

Peptide signal

Page 43: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

• Introduction

• HMM for the genomic structure of DNA sequences

• Discrimination method based on HMM

• Conclusion

Contents

• Direction of research

Page 44: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Versteeg 2003

Chromosome 9

Content of GC

Markovian models for the analysis of the organization of genomes

Direction of research

Page 45: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Versteeg 2003

Chromosome 9

Content of GC

Genes density

Markovian models for the analysis of the organization of genomes

Direction of research

Page 46: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Versteeg 2003

Chromosome 9

Genes density

Content of GC

Size of introns

Markovian models for the analysis of the organization of genomes

Direction of research

Page 47: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Versteeg 2003

Chromosome 9

Genes density

Content of GC

Size of introns

Repeated elements

Markovian models for the analysis of the organization of genomes

Direction of research

Page 48: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Versteeg 2003

Chromosome 9

Genes density

Content of GC

Size of introns

Repeated elements

Genes expression

Markovian models for the analysis of the organization of genomes

Direction of research

Page 49: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Structure superposition in genomes

A chromosome

Isochore level

Gene level

Exon-intron level

Codon level

intronexon

acc gcc agt tac ccc aga

Direction of research

Page 50: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

– Build 3 HMMs adapted to the organization structure of each of the 3 isochores classes H, L, M

H = [72%, 100%]M = ]56%, 72%[

L = [0%, 56%]

– Human chromosomes are divided into overlapping 100 kb segments.

Two successive segments overlap by half of their length. – Bayesian approach: for each segment and for each model (H, L and

M), we compute the probability P[Model | Segment]

Segment is characterized by the model with the best probability

Scan the genome

Direction of research

Page 51: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Results: Human chromosome 1

Model H

Model M

Model L

Genes density

Repartition of isochores

G+C content

Direction of research

Page 52: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Direction of research

Comparing the human genome with genomes of different organisms

can be useful to:

• better understand the structure and function of human genes

• study evolutionary changes among organisms

• help to identify the genes that are conserved among species

Comparative Genomic Analysis

Page 53: A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive

Human Chimpanzee Mouse

Chicken Tetraodon

Direction of research