44
Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Embed Size (px)

Citation preview

Page 1: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Yaroslav Ryabov

Lognormal Pattern

of Exon size distributions

in Eukaryotic genomes

Page 2: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Outline• Global and local approaches in analysis of genetic

information

• Vocabulary of contemporary genetics: Exons and Introns

• What we can learn from exon size distributions? Lognormal pattern and two clases of exons

• How we can model exon size distributions of real genomes?

• What could be the biological reason for observed pattern of exons size distributions?

• Conclusions

Page 3: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Analyzing Genetic Information

Local approach

analyzing details of nucleotide sequence Basic Local

Alignment Search Tool1990

Watson and Crick1953Marshall Nirenberg1968

Global approach

analyzing properties of entire genome

Georg Johann Mendel1866

Recombination of inherited properties

Frequency of mutations

Size of genome etc.

Page 4: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Analyzing Genetic Information in Post-Genomic Era

More than 60 animal genomes with complete annotations

Page 5: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Analyzing Genetic Information in Post-Genomic Era

Back to Global Approach ?

More than 60 animal genomes with complete annotations

Page 6: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Prokaryote and Eukaryote

pro + karyon

before + nucleos

eu + karyon

good + nucleos

Page 7: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Gene expression

DNA

Transcriptionwith RNA polymerase

mRNATranslationwith ribosome

Protein

Page 8: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

DNA

Transcriptionwith RNA polymerase

mRNA

Exons

Introns

Splicing

Splicing in Eukaryotes

Page 9: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Prokaryote vs Eukaryote

Most of DNA code is used to produce some cellular product

Substantial fraction “silent” DNA regions

Exons Introns

Short genomes: ~ 1 000 exons

Long genomes: ~ 100 000 exons

Long exons: ~ 1 000 base pairs

Short exons: ~ 100 base pairs

Page 10: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

0.0 0.5 1.0 1.5 2.00

10

20

30

40

50

60

70

80

Num

ber o

f cou

nts

in th

ousa

nds

Exon length in 1000 base pairs

0.0 0.5 1.0 1.5 2.00.00

0.05

0.10

0.15

0.20

Num

ber

of c

ount

s in

tho

usan

ds

Exon length in 1000 base pairs

Prokaryote vs Eukaryote Exon size distributions

Homo SapiensFlavobacteria Bacterium

Page 11: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

0.0 0.5 1.0 1.5 2.00

10

20

30

40

50

60

70

80

Num

ber o

f cou

nts

in th

ousa

nds

Exon length in 1000 base pairs

0.0 0.5 1.0 1.5 2.00.00

0.05

0.10

0.15

0.20

Num

ber

of c

ount

s in

tho

usan

ds

Exon length in 1000 base pairs

Homo SapiensFlavobacteria Bacterium

2 592 exons 1 075 b.p. mean exon size

299 298 exons 267 b.p. mean exon size

What we can learn from exonsize distributions ?

Page 12: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Assumption:

Probability to split DNA at any location and any time is Constant

Consequence:

The lengths of intervals between splitting points obey

Exponential distribution

Poisson Process

0.0 0.5 1.0 1.5 2.00

10

20

30

40

50

60

70

80

Num

ber

of c

ount

s in

tho

usan

ds

Exon length, 1000 b.p.

= Const frequency of splitting

Page 13: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

0 1 2 3 40

10

20

30

40

Num

ber

of c

ount

s in

thou

sand

s

Log10

(Exon Length)

Assumption:

Probability to split DNA at any location and any time is Constant

or in logarithm scale

Consequence:

The lengths of intervals between splitting points obey

Exponential distribution

Poisson Process

= Const frequency of splitting

Page 14: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Poisson Process Kolmogoroff Process

Lognormal distributionExponential distribution

0 1 2 3 40

10

20

30

40

Num

ber

of c

ount

s in

tho

usan

ds

Log10

(Exon Length)

0 1 2 3 40

10

20

30

40

Num

ber

of c

ount

s in

thou

sand

s

Log10

(Exon Length)

Page 15: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Lognormal distribution is a consequence of Central Limit Theorem

Sumof random variables

Normal distribution

Lognormal distribution

Productof random variables

Page 16: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Kolmogoroff process (1941)

Page 17: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Kolmogoroff process (1941)

Page 18: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Page 19: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Page 20: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Page 21: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Page 22: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Kolmogoroff process (1941)

Log (Exon Size)

Num

ber

of C

ount

s

Assumption:

Probability to split any exon is Independent of exon size

Consequence:

The lengths of exons obey

Lognormal distributionM

2s

Page 23: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Real Genomes: Two Lognormal Peaks

1 2 3 4 5 6 7 8 9 10

5

10

15

20

25

30

Ln(Exon Size)

peak A M = 4.80 ± 0.01 = 0.85 ± 0.01

peak B M = 5.44 ± 0.09 = 3.10 ± 0.20

Homo Sapiens

Data: SatDatHS_BModel: TwoGauss Chi^2/DoF = 182733.65479R^2 = 0.99722 y0 0 ±0A 28378.88917 ±679.3328w 0.85204 ±0.01438xc 4.80305 ±0.00554A1 15303.67377 ±838.0008w1 3.09794 ±0.15891xc1 5.43988 ±0.0879

Numb

er of

Cou

nts (t

hous

ands

)

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6peak A M = 5.07 ± 0.01 = 0.53 ± 0.02

peak B M = 5.78 ± 0.01 = 2.10 ± 0.02

Drosophila MelanogasterData: StaDR_BModel: TwoGauss Chi^2/DoF = 6036.76567R^2 = 0.99806 y0 0 ±0A 10674.50108 ±120.44604w 2.09918 ±0.02005xc 5.77986 ±0.01405A1 1461.13835 ±83.49117w1 0.53408 ±0.02349xc1 5.07437 ±0.00955

Numb

er of

Cou

nts (t

hous

ands

)

Ln(Exon Size)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.81 ± 0.01 = 0.82 ± 0.02

peak B M = 5.08 ± 0.07 = 3.13 ± 0.18

Num

ber

of C

oun

ts (

tho

usan

ds)

Ln(Exon Size)

Pan troglodytes (chimpanzee)

Page 24: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

2 4 6 8 100

1

2

3

4

5

6

7

peak A M = 5.07 ± 0.01 = 0.55 ± 0.03

peak B M = 5.53 ± 0.01 = 1.89 ± 0.02

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Anopheles gambiae (mosquito)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.59 ± 0.05 = 2.06 ± 0.12

peak B M = 4.83 ± 0.01 = 0.77 ± 0.03

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Bos taurus (cow)

2 4 6 8 100

2

4

6

8

10

12

14

16

peak A M = 4.75 ± 0.02 = 0.71 ± 0.05

peak B M = 5.22 ± 0.03 = 1.46 ± 0.03

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Caenorhabditis elegans (worm)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.62 ± 0.05 = 2.10 ± 0.12

peak B M = 4.84 ± 0.01 = 0.77 ± 0.02

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Canis familiaris (dog)

2 4 6 8 100

5

10

15

20

25

30

35

40

peak A M = 4.50 ± 0.05 = 2.49 ± 0.11

peak B M = 4.83 ± 0.01 = 0.77 ± 0.02

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Danio rerio (zebrafish)

2 4 6 8 100

5

10

15

20

25

30

peak A M = 4.56 ± 0.06 = 2.21 ± 0.13

peak B M = 4.83 ± 0.01 = 0.76 ± 0.03

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Gallus gallus (chiken)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.81 ± 0.01 = 0.82 ± 0.02

peak B M = 4.86 ± 0.05 = 2.73 ± 0.16

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Macaca mulatta (rhesus macaque)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.82 ± 0.01 = 0.84 ± 0.02

peak B M = 5.38 ± 0.10 = 3.09 ± 0.20

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Mus musculus (house mouse)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.85 ± 0.01 = 0.76 ± 0.02

peak B M = 5.00 ± 0.03 = 1.95 ± 0.10

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Tetraodon nigroviridis (spotted green pufferfish)

Page 25: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Parameters of Two Lognormal peaks model: M location of peak maximum

Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)

Page 26: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Parameters of Two Lognormal peaks model: s peak width

Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)

Page 27: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)

Parameters of Two Lognormal peaks model: peak area Narrow and Wide, %

Page 28: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Summary of Observed Facts

• Exon lengths distributions of studied eukaryotic genomes can be fitted by Two Lognormal Distributions

• Parameters of those two peaks follow two distinctive patterns: changes of peak width and relative peak area correlate with complexity of species.

• This may indicate presence of two different classes of exons with different evolutionary histories

Page 29: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

How we can model exon size distributions of real genomes ?

Growth of total genome length

Duplications in genome code

Merging exons together (intron loss)

Exon loss

Page 30: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

any exon of length has probability to be modified during time interval

If

Parameters of Kolmogoroff Processas model parameters for elementary “exon modification” event

average number of exons with sizes

and

with

peak position peak width

Then the process converges to a lognormal distribution with

Page 31: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

distribution

Exons splitting

(decreasing Exon length)

Increasing Exon length

Parameters of Q(k) function

0 1 2

d Q(k)

k

0 1 2

d Q(k)

k

0 1 2

d Q(k)

k

Exons duplicating

Page 32: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Initial exon size distribution A mockup of a bacterial genome with 500 exons

Having 1000 bp mean exons length

Modeling exon size distributions in real genomes

10000

1x103

2x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

Page 33: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

2 4 6 8 100

1x104

2x104

3x104

4x104

Dis

trib

utio

n de

nsity

Log (Exon Length)

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.200

k

15 000 time steps for p=0.001

Modeling exon size distributions in real genomes

Page 34: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

2 4 6 8 100

1x104

2x104

3x104

4x104

Dis

trib

utio

n de

nsity

Log (Exon Length)

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.200

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.263

k

15 000 time steps for p=0.001

Modeling exon size distributions in real genomes

Page 35: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

2 4 6 8 100

1x104

2x104

3x104

4x104

Dis

trib

utio

n de

nsity

Log (Exon Length)0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.263

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.200

k

15 000 time steps for p=0.001

Modeling exon size distributions in real genomes

Page 36: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

50 000 time steps for p=0.001

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.049

k

2 4 6 8 100

1x103

2x103

3x103

4x103

5x103

6x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

Modeling exon size distributions in real genomes

Page 37: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

2 4 6 8 100

1x103

2x103

3x103

4x103

5x103

6x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

50 000 time steps for p=0.001

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.049

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.010

k

Modeling exon size distributions in real genomes

Page 38: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

2 4 6 8 100

1x103

2x103

3x103

4x103

5x103

6x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

50 000 time steps for p=0.001

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.049

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.010

k

Modeling exon size distributions in real genomes

Page 39: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

What cloud be the biological reason for two exons peaks?

Narrow peak

Wide peak

Num

ber

of C

ount

s

Log (Exon Size)

Page 40: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Narrow peak

Holds approximately Constant position

Has approximately Constant width

Has greater relative occupation for Complex organisms

Wide peak

Changes position

Width increases with increasing complexity of and organism

Has greater relative occupation for Simple organisms

What could be the biological reason for two exons peaks?

Page 41: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

DNA

mRNA

Protein

Alterative SplicingUntranslated

regions of mRNA

Introns

Two ways of Gene Expression Regulation

Page 42: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

DNA

mRNA

Protein

Untranslated

regions of mRNA

Introns

Two ways of Gene Expression Regulation

Page 43: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Conclusions• Analysis of global properties of eukaryotic genomes reveals two

distinct peaks in statistical distribution of exon sizes

• The observed peak could be fitted by a sum of two lognormal distributions which may imply that they originated by two different exon splitting pathways described in the general frameworks of Kolmogoroff splitting process

• Two observed peaks of exons could be correlated with the phenomenon of alternative splicing and with exons contributing into untranslated regions of mRNA. This suggests that the observed separation of exons in two different classes may be originated from two different ways of protein expression regulation.

Page 44: Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes

Acknowledgments

Michael Gribskov

Alexander Berezhkovskii