Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
Information Theory in BiologyThomas D. Schneider, Ph.D.
Molecular Information Theory GroupCenter for Cancer Research
Gene Regulation and Chromosome Biology LaboratoryNational Cancer InstituteFrederick, MD 21702-1201
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
ATGC
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
El Duomo, Florence, Italy
T7 RNA polymerase + DNA
http://www.ebi.ac.uk/pdbe/entry/pdb/1qln/portfolio/?view=entry_index#ad-image-0
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
2 bits/base
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
1 bit/base
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
0 bits/base
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Individual Information
Sequence Logo and Sequence Walker
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
29.1 bits
SequenceWalkerPatent5,867,402
Sequence Logo and Sequence Walker and Rsequence
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Rsequence is the average: 35.0± 0.6 bits
Sequence Logo and Sequence Walker and Rsequence
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Rsequence is the average: 35.0± 0.6 bits= “area under the logo”
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Rfrequency = Hbefore −Hafter
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Rfrequency = Hbefore −Hafter
= log2G− log2 γ
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Rfrequency = Hbefore −Hafter
= log2G− log2 γ= − log2 γ/G
Rfrequency
2log 16/2 = 3 bits
16 positions 2 sites
16 positions 1 sitelog 16/1 = 4 bits2
Information requiredto find a set of binding sitesin a genome
RNA Splicing
DNA
intron exon
donor acceptor
RNA
Spliced RNA
RNA Splicing
Copy DNA (transcription)
exon
exonexonexon
intronexon
Donor and acceptor logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
Rsequence and Rfrequency for Splice Acceptors
Rsequence
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
• Information at binding site sequences (area under sequence logo)• from: binding site sequences• 9.4 bits per site
Rsequence and Rfrequency for Splice Acceptors
Rsequence
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
• Information at binding site sequences (area under sequence logo)• from: binding site sequences• 9.4 bits per site
Rfrequency
acceptor
intron exon
donor
• Information needed to locate the sites• from: size of genome and number of sites (length of intron+exon)• 9.7 bits per site
Rfrequency/Rsequence = 0.97
Rsequence = Rfrequency Hypothesis
Hypothesis:The information in
binding site patternsis just sufficient
for the sites to be foundin the genome
Rsequence versus Rfrequency
Binding Site Total Pattern Information needed to Pattern InfoLocation Info
Recognizer1 Information Locate Site in Genome
= Rsequence = Rfrequency =Rsequence
Rfrequency
(bits) (bits)
Spliceosome acceptor2 9.35± 0.12 9.66 0.97± 0.01Spliceosome donor 7.92± 0.09 9.66 0.82± 0.01
Ribosome 11.0 10.6 1.0λ cI/cro 17.7± 1.6 19.3 0.9± 0.1LexA 21.5± 1.7 18.4 1.2± 0.1TrpR 23.4± 1.9 20.3 1.2± 0.1LacI 19.2± 2.8 21.9 0.9± 0.1ArgR 16.4 18.4 0.9O (λ Origin) 20.9 19.9 1.0Ara C 19.3 19.3 1.0Transcription at TATA3
3.3 ∼ 3 ∼ 1
T7 Promoter 35.4 16.5 2.11T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. J. Mol. Biol., 188:415-431, 1986.2R. M. Stephens and T. D. Schneider. J. Mol. Biol., 228:1124-1136, 1992.3F. E. Penotti. J Mol Biol, 213:37-52, 1990.
Rsequence versus Rfrequency - meaning
The information in the binding site pattern (Rsequence)is close to
The information needed to find the binding sites (Rfrequency)
Rsequence versus Rfrequency - meaning
The information in the binding site pattern (Rsequence)is close to
The information needed to find the binding sites (Rfrequency)
But for a species in a stable environment:
• size of genome (G) is fixed (e. g. E. coli has 4.7× 106 bp)• number of binding sites (γ) is fixed (e. g. there are ∼50 E. coli LexA sites)
so Rfrequency = log2G/γ is fixed
Rsequence versus Rfrequency - meaning
The information in the binding site pattern (Rsequence)is close to
The information needed to find the binding sites (Rfrequency)
But for a species in a stable environment:
• size of genome (G) is fixed (e. g. E. coli has 4.7× 106 bp)• number of binding sites (γ) is fixed (e. g. there are ∼50 E. coli LexA sites)
so Rfrequency = log2G/γ is fixed
Rsequence must evolve towards Rfrequency!
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)• predetermined binding site locations (γ)
(to fix the frequency of sites)
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)•
}
Rfrequencyis fixedpredetermined binding site locations (γ)
(to fix the frequency of sites)
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)•
}
Rfrequencyis fixedpredetermined binding site locations (γ)
(to fix the frequency of sites)• a recognizer gene encoded in the sequence:
use a weight matrix
How A Weight Matrix Works
Sequence matrix, s(b, l, j) for sequence j
base b position l
C A G G T C T G C A−3 −2 −1 0 1 2 3 4 5 6
A 0 1 0 0 0 0 0 0 0 1
C 1 0 0 0 0 1 0 0 1 0
G 0 0 1 1 0 0 0 1 0 0
T 0 0 0 0 1 0 1 0 0 0
Individual information weight matrix, Riw(b, l)
base b position l
−3 −2 −1 0 1 2 3 4 5 6
A +0.4 +1.3 −1.4 −8.8 −5.8 +1.1 +1.5 −1.8 −0.7 +0.0
C +0.6 −0.8 −2.4 −7.8 −5.5 −3.7 −1.6 −2.2 −0.5 −0.2G −0.6 −1.0 +1.6 +2.0 −6.2 +0.7 −1.1 +1.7 −0.3 +0.4
T −1.0 −0.9 −1.7 −5.8 +2.0 −3.4 −1.6 −2.2 +0.9 −0.5
How A Weight Matrix Works
Sequence matrix, s(b, l, j) for sequence j
base b position l
C A G G T C T G C A−3 −2 −1 0 1 2 3 4 5 6
A 0 1 0 0 0 0 0 0 0 1
C 1 0 0 0 0 1 0 0 1 0
G 0 0 1 1 0 0 0 1 0 0
T 0 0 0 0 1 0 1 0 0 0
Individual information weight matrix, Riw(b, l)
base b position l
−3 −2 −1 0 1 2 3 4 5 6
A +0.4 +1.3 −1.4 −8.8 −5.8 +1.1 +1.5 −1.8 −0.7 +0.0
C +0.6 −0.8 −2.4 −7.8 −5.5 −3.7 −1.6 −2.2 −0.5 −0.2G −0.6 −1.0 +1.6 +2.0 −6.2 +0.7 −1.1 +1.7 −0.3 +0.4
T −1.0 −0.9 −1.7 −5.8 +2.0 −3.4 −1.6 −2.2 +0.9 −0.5
5’ c a g g t c t g c a 3’
Sequence Walker
Unevolved Ev Creature
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 bases
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
found real site
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
found real sitemissed real site
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
found real sitemissed real sitefound wrong site
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right place
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
• Sort the creatures by their mistakes
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
• Sort the creatures by their mistakes
• REPLICATE: the best creatures areduplicated and replace the worst ones
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
• Sort the creatures by their mistakes
• REPLICATE: the best creatures areduplicated and replace the worst ones
• MUTATE all genomes randomly
Evolved Ev Creature
Evolution of Binding Sites
16 evolving binding sitesGeneration 100
Rs = -0.1 +/- 0.5 bits
0
1
2b
its
5′ -5 -4
CATG
-3 -2 -1 0 1CGAT
2 3TCGA
4CATG
5ATCG
6 7 8CATG
9 10 3′
16 evolving binding sitesGeneration 200
Rs = -0.0 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
CATG
-3 -2
GATC
-1 0 1GACT
2 3CTAG
4 5ATGC
6 7TGAC 8 9 10 3′
16 evolving binding sitesGeneration 300
Rs = 1.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5
AGCT
-4 -3
TACG
-2 -1 0GCTA
1AGCT
2 3TAG
4 5TGC
6 7GTCA
8GCTA
9 10
GCTA
3′
16 evolving binding sitesGeneration 400
Rs = 2.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
ATCG
-3
ACG
-2 -1
TCAG
0TGCA
1GACT
2GCTA
3CTAG
4TCAG
5TGC
6 7TGCA 8 9ACTG
10
CGTA
3′
16 evolving binding sitesGeneration 500
Rs = 2.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2 -1
CTAG 0TGA
1ACT
2GCTA
3CTAG
4TCAG
5AGC
6 7GTCA 8GCTA
9 10
GTA
3′
16 evolving binding sitesGeneration 600
Rs = 2.7 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3 -2
GTAC
-1 0TCGA
1GACT
2TCA
3ATG
4CGA
5AGC
6 7TGCA 8GCAT 9 10
GCTA
3′
16 evolving binding sitesGeneration 700
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2
ATCG
-1
TCGA 0TGA
1GCT
2CAT
3TGA
4CAG
5TGAC
6 7CTGA 8CTGA 9 10
GTA
3′
16 evolving binding sitesGeneration 800
Rs = 3.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3 -2
ATGC
-1 0
GA
1ACT
2CTA
3CTGA
4ACG
5TGAC
6 7TGCA 8CGAT 9CTAG
10
CGTA
3′
16 evolving binding sitesGeneration 900
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GCAT
-4 -3 -2
ATCG
-1
TAG
0
GA
1GACT
2CTA
3ACTG 4CAG
5TGAC
6 7 8TGA
9TGA
10 3′
16 evolving binding sitesGeneration 1000
Rs = 4.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TAGC
-2
ATCG
-1
TAG
0GA
1
CT
2CTA
3 4CAG
5GAC
6 7 8 9CTGA
10
CGTA
3′
16 evolving binding sitesGeneration 1100
Rs = 5.0 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3
ATGC
-2
ATCG
-1
CAG
0
GA
1GACT
2TA
3CTAG 4CAG
5GAC
6 7 8TCGA
9CTGA
10
TAG
3′
16 evolving binding sitesGeneration 1200
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCG
-2
ACGT
-1
CTGA
0GA
1GACT
2CTA
3 4ACG
5GTAC
6 7 8TCGA
9GTA
10
TGA
3′
16 evolving binding sitesGeneration 1300
Rs = 4.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GTCA
-4 -3
CATG
-2
ACGT
-1
TCGA
0GA
1GATC
2CTA
3ACTG 4CAG
5TGCA
6CTA
7 8 9CGTA
10
TGA
3′
16 evolving binding sitesGeneration 1400
Rs = 4.4 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4
GATC
-3
ATCG
-2
GCT
-1 0TGA
1ACT
2TCA
3ATCG
4CTAG
5GAC
6 7 8GCTA
9 10
CTGA
3′
16 evolving binding sitesGeneration 1500
Rs = 3.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3
ACTG
-2
ATGC
-1
TGCA 0TGA
1ACT
2TCA
3TACG
4CTAG
5GAC
6CGAT 7 8 9GCTA
10
TCGA
3′
16 evolving binding sitesGeneration 1600
Rs = 4.9 +/- 0.5 bits
0
1
2b
its
5′ -5
GATC
-4
AGTC
-3
TACG
-2 -1
TCGA
0
GA
1
CT
2TCA
3CTG
4TCAG
5TGAC
6AGCT
7 8GCTA
9CTGA
10
CTGA
3′
16 evolving binding sitesGeneration 1700
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
ACG
-2
ATCG
-1
TCAG
0
GA
1ACT
2GTCA
3CTG
4TAG
5GATC
6ATGC
7TACG
8GCTA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 1800
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2 -1 0
GA
1ACT
2TCA
3CATG
4
GA
5GTAC
6 7 8GCTA 9TCGA
10
GTA
3′
16 evolving binding sitesGeneration 1900
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2
GACT
-1 0
GA
1ACT
2TCA
3CATG
4
GA
5TAC
6 7 8TCA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 2000
Rs = 5.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
TGAC
-4 -3 -2
CGAT
-1
GTCA
0
GA
1ACT
2TCA
3ATCG
4
GA
5TAC
6CAGT 7 8GTCA 9CTGA
10
TGA
3′
Evolution of Binding Sites
16 evolving binding sitesGeneration 100
Rs = -0.1 +/- 0.5 bits
0
1
2b
its
5′ -5 -4
CATG
-3 -2 -1 0 1CGAT
2 3TCGA
4CATG
5ATCG
6 7 8CATG
9 10 3′
16 evolving binding sitesGeneration 200
Rs = -0.0 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
CATG
-3 -2
GATC
-1 0 1GACT
2 3CTAG
4 5ATGC
6 7TGAC 8 9 10 3′
16 evolving binding sitesGeneration 300
Rs = 1.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5
AGCT
-4 -3
TACG
-2 -1 0GCTA
1AGCT
2 3TAG
4 5TGC
6 7GTCA
8GCTA
9 10
GCTA
3′
16 evolving binding sitesGeneration 400
Rs = 2.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
ATCG
-3
ACG
-2 -1
TCAG
0TGCA
1GACT
2GCTA
3CTAG
4TCAG
5TGC
6 7TGCA 8 9ACTG
10
CGTA
3′
16 evolving binding sitesGeneration 500
Rs = 2.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2 -1
CTAG 0TGA
1ACT
2GCTA
3CTAG
4TCAG
5AGC
6 7GTCA 8GCTA
9 10
GTA
3′
16 evolving binding sitesGeneration 600
Rs = 2.7 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3 -2
GTAC
-1 0TCGA
1GACT
2TCA
3ATG
4CGA
5AGC
6 7TGCA 8GCAT 9 10
GCTA
3′
16 evolving binding sitesGeneration 700
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2
ATCG
-1
TCGA 0TGA
1GCT
2CAT
3TGA
4CAG
5TGAC
6 7CTGA 8CTGA 9 10
GTA
3′
16 evolving binding sitesGeneration 800
Rs = 3.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3 -2
ATGC
-1 0
GA
1ACT
2CTA
3CTGA
4ACG
5TGAC
6 7TGCA 8CGAT 9CTAG
10
CGTA
3′
16 evolving binding sitesGeneration 900
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GCAT
-4 -3 -2
ATCG
-1
TAG
0
GA
1GACT
2CTA
3ACTG 4CAG
5TGAC
6 7 8TGA
9TGA
10 3′
16 evolving binding sitesGeneration 1000
Rs = 4.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TAGC
-2
ATCG
-1
TAG
0GA
1
CT
2CTA
3 4CAG
5GAC
6 7 8 9CTGA
10
CGTA
3′
16 evolving binding sitesGeneration 1100
Rs = 5.0 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3
ATGC
-2
ATCG
-1
CAG
0
GA
1GACT
2TA
3CTAG 4CAG
5GAC
6 7 8TCGA
9CTGA
10
TAG
3′
16 evolving binding sitesGeneration 1200
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCG
-2
ACGT
-1
CTGA
0GA
1GACT
2CTA
3 4ACG
5GTAC
6 7 8TCGA
9GTA
10
TGA
3′
16 evolving binding sitesGeneration 1300
Rs = 4.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GTCA
-4 -3
CATG
-2
ACGT
-1
TCGA
0GA
1GATC
2CTA
3ACTG 4CAG
5TGCA
6CTA
7 8 9CGTA
10
TGA
3′
16 evolving binding sitesGeneration 1400
Rs = 4.4 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4
GATC
-3
ATCG
-2
GCT
-1 0TGA
1ACT
2TCA
3ATCG
4CTAG
5GAC
6 7 8GCTA
9 10
CTGA
3′
16 evolving binding sitesGeneration 1500
Rs = 3.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3
ACTG
-2
ATGC
-1
TGCA 0TGA
1ACT
2TCA
3TACG
4CTAG
5GAC
6CGAT 7 8 9GCTA
10
TCGA
3′
16 evolving binding sitesGeneration 1600
Rs = 4.9 +/- 0.5 bits
0
1
2b
its
5′ -5
GATC
-4
AGTC
-3
TACG
-2 -1
TCGA
0
GA
1
CT
2TCA
3CTG
4TCAG
5TGAC
6AGCT
7 8GCTA
9CTGA
10
CTGA
3′
16 evolving binding sitesGeneration 1700
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
ACG
-2
ATCG
-1
TCAG
0
GA
1ACT
2GTCA
3CTG
4TAG
5GATC
6ATGC
7TACG
8GCTA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 1800
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2 -1 0
GA
1ACT
2TCA
3CATG
4
GA
5GTAC
6 7 8GCTA 9TCGA
10
GTA
3′
16 evolving binding sitesGeneration 1900
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2
GACT
-1 0
GA
1ACT
2TCA
3CATG
4
GA
5TAC
6 7 8TCA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 2000
Rs = 5.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
TGAC
-4 -3 -2
CGAT
-1
GTCA
0
GA
1ACT
2TCA
3ATCG
4
GA
5TAC
6CAGT 7 8GTCA 9CTGA
10
TGA
3′
selection no selection
0 500 1000 1500 2000 0
4
8
12
16
20
Generation
Mistakes of Best Organism
Evolution of Binding Sites
16 evolving binding sitesGeneration 100
Rs = -0.1 +/- 0.5 bits
0
1
2b
its
5′ -5 -4
CATG
-3 -2 -1 0 1CGAT
2 3TCGA
4CATG
5ATCG
6 7 8CATG
9 10 3′
16 evolving binding sitesGeneration 200
Rs = -0.0 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
CATG
-3 -2
GATC
-1 0 1GACT
2 3CTAG
4 5ATGC
6 7TGAC 8 9 10 3′
16 evolving binding sitesGeneration 300
Rs = 1.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5
AGCT
-4 -3
TACG
-2 -1 0GCTA
1AGCT
2 3TAG
4 5TGC
6 7GTCA
8GCTA
9 10
GCTA
3′
16 evolving binding sitesGeneration 400
Rs = 2.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
ATCG
-3
ACG
-2 -1
TCAG
0TGCA
1GACT
2GCTA
3CTAG
4TCAG
5TGC
6 7TGCA 8 9ACTG
10
CGTA
3′
16 evolving binding sitesGeneration 500
Rs = 2.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2 -1
CTAG 0TGA
1ACT
2GCTA
3CTAG
4TCAG
5AGC
6 7GTCA 8GCTA
9 10
GTA
3′
16 evolving binding sitesGeneration 600
Rs = 2.7 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3 -2
GTAC
-1 0TCGA
1GACT
2TCA
3ATG
4CGA
5AGC
6 7TGCA 8GCAT 9 10
GCTA
3′
16 evolving binding sitesGeneration 700
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2
ATCG
-1
TCGA 0TGA
1GCT
2CAT
3TGA
4CAG
5TGAC
6 7CTGA 8CTGA 9 10
GTA
3′
16 evolving binding sitesGeneration 800
Rs = 3.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3 -2
ATGC
-1 0
GA
1ACT
2CTA
3CTGA
4ACG
5TGAC
6 7TGCA 8CGAT 9CTAG
10
CGTA
3′
16 evolving binding sitesGeneration 900
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GCAT
-4 -3 -2
ATCG
-1
TAG
0
GA
1GACT
2CTA
3ACTG 4CAG
5TGAC
6 7 8TGA
9TGA
10 3′
16 evolving binding sitesGeneration 1000
Rs = 4.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TAGC
-2
ATCG
-1
TAG
0GA
1
CT
2CTA
3 4CAG
5GAC
6 7 8 9CTGA
10
CGTA
3′
16 evolving binding sitesGeneration 1100
Rs = 5.0 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3
ATGC
-2
ATCG
-1
CAG
0
GA
1GACT
2TA
3CTAG 4CAG
5GAC
6 7 8TCGA
9CTGA
10
TAG
3′
16 evolving binding sitesGeneration 1200
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCG
-2
ACGT
-1
CTGA
0GA
1GACT
2CTA
3 4ACG
5GTAC
6 7 8TCGA
9GTA
10
TGA
3′
16 evolving binding sitesGeneration 1300
Rs = 4.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GTCA
-4 -3
CATG
-2
ACGT
-1
TCGA
0GA
1GATC
2CTA
3ACTG 4CAG
5TGCA
6CTA
7 8 9CGTA
10
TGA
3′
16 evolving binding sitesGeneration 1400
Rs = 4.4 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4
GATC
-3
ATCG
-2
GCT
-1 0TGA
1ACT
2TCA
3ATCG
4CTAG
5GAC
6 7 8GCTA
9 10
CTGA
3′
16 evolving binding sitesGeneration 1500
Rs = 3.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3
ACTG
-2
ATGC
-1
TGCA 0TGA
1ACT
2TCA
3TACG
4CTAG
5GAC
6CGAT 7 8 9GCTA
10
TCGA
3′
16 evolving binding sitesGeneration 1600
Rs = 4.9 +/- 0.5 bits
0
1
2b
its
5′ -5
GATC
-4
AGTC
-3
TACG
-2 -1
TCGA
0
GA
1
CT
2TCA
3CTG
4TCAG
5TGAC
6AGCT
7 8GCTA
9CTGA
10
CTGA
3′
16 evolving binding sitesGeneration 1700
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
ACG
-2
ATCG
-1
TCAG
0
GA
1ACT
2GTCA
3CTG
4TAG
5GATC
6ATGC
7TACG
8GCTA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 1800
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2 -1 0
GA
1ACT
2TCA
3CATG
4
GA
5GTAC
6 7 8GCTA 9TCGA
10
GTA
3′
16 evolving binding sitesGeneration 1900
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2
GACT
-1 0
GA
1ACT
2TCA
3CATG
4
GA
5TAC
6 7 8TCA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 2000
Rs = 5.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
TGAC
-4 -3 -2
CGAT
-1
GTCA
0
GA
1ACT
2TCA
3ATCG
4
GA
5TAC
6CAGT 7 8GTCA 9CTGA
10
TGA
3′
selection no selection
0 500 1000 1500 2000 0
4
8
12
16
20
Generation
Mistakes of Best Organism
Rfrequency Rsequence
selection
no selection
0 500 1000 1500 2000 -1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Generation
Information (bits per site)
Donor and acceptor logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G —
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G — G
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G — G T
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G — G T
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
• BUT the information curves (sequence conservation) differ!
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
TCAG TACGTCGAGTACTGACTGAC
GTCA
CTAG
CT
GGACT
CTGA
TCGA
CTAG
ACGT
TCAG
AGTCATGCATCGATCGATCGATGC
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCA
TAC
GTCAG
CGAT
GACTATGCGATCTAGCTCAG
C A G — G T
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
• BUT the information curves (sequence conservation) differ!
• Put letters into the graph proportional to their frequency!
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
TCAG TACGTCGAGTACTGACTGAC
GTCA
CTAG
CT
GGACT
CTGA
TCGA
CTAG
ACGT
TCAG
AGTCATGCATCGATCGATCGATGC
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCA
TAC
GTCAG
CGAT
GACTATGCGATCTAGCTCAG
1799 Human donor sites
0
1
2
bit
s
5′ -9
TCAG
-8 -7
TACG
-6
TCGA
-5
GTAC
-4
TGAC
-3
TGAC
-2
GTCA
-1
CTAG
0CT
G
1GACT 2CTGA
3TCGA
4CTAG
5ACGT
6TCAG 7AGTC 8ATGC 9ATCG
10
ATCG
11
ATCG
12
ATGC
3′
1744 Human acceptor sites
0
1
2
bit
s
5′ -25GATC
-24GATC
-23GACT
-22GATC
-21GAC
T
-20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8
AGCT
-7
GACT
-6
GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT 3GACT 4ATGC 5GATC 6TAGC 7TCAG
3′
https://alum.mit.edu/www/toms/papers/logo/
• That’s how and why we invented sequence logos!
Splice Junction Sequence Logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
• 90% of the splice junctioninformation is on the intron side
https://alum.mit.edu/www/toms/papers/splice/
Splice Junction Sequence Logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
donor acceptor
proto
GTAC
GTCA
CTAG
TCAG
CGAT
TGAC
GTCA
CTAG
CT
GGACT
GATC
GTCA
TAC
GTCAG
CGAT
• 90% of the splice junctioninformation is on the intron side
• Hypothesis:donor and acceptor sites had acommon ancestor that duplicated
https://alum.mit.edu/www/toms/papers/splice/
Splice Junction Sequence Logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
donor acceptor
proto
GTAC
GTCA
CTAG
TCAG
CGAT
TGAC
GTCA
CTAG
CT
GGACT
GATC
GTCA
TAC
GTCAG
CGAT
• 90% of the splice junctioninformation is on the intron side
• Hypothesis:donor and acceptor sites had acommon ancestor that duplicated
• They evolved to put the informationinto the intron. This avoids affectingthe proteins.
https://alum.mit.edu/www/toms/papers/splice/
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
• Correlations can be measured in bits!
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
• Correlations can be measured in bits!
• 3D Sequence logo
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
• Correlations can be measured in bits!
• 3D Sequence logo
• OBSERVED: tRNA stems
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
Sequence Walker example: rrnB P1
. *4164240 . *4164250 . *4164260 . *4164270 . 5’ g g a g c t g a a c a a t t a t t g c c c g t t t t a c a g c g t t a c g g c t t c g a 3’ 3’ c c t c g a c t t g t t a a t a a c g g g c a a a a t g t c g c a a t g c c g a a g c t 5’
Fis 12.0 bits
*4164280 . *4164290 . *4164300 . *4164310 . *4164320 . *4164330 5’ a a c g c t c g a a a a a c t g g c a g t t t t a g g c t g a t t t g g t t g a a t g t t g c g c g g t c a 3’ 3’ t t g c g a g c t t t t t g a c c g t c a a a a t c c g a c t a a a c c a a c t t a c a a c g c g c c a g t 5’
Fis 5.3 bits Fis 10.4 bits
. *4164340 . *4164350 . *4164360 . *4164370 . *4164380 5’ g a a a a t t a t t t t a a a t t t c c t c t t g t c a g g c c g g a a t a a c t c c c t a t a a t g 3’ 3’ c t t t t a a t a a a a t t t a a a g g a g a a c a g t c c g g c c t t a t t g a g g g a t a t t a c 5’
distalUP 6.6 bits p35 5.5 bits p10 8.4 bits | | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | p35- ( 22) - p10 4164377 Gap 2.3 bits | | | | proximalUP 4.4 bits | | | | | | | | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | proximalUP- ( 27) - p10 4164377 Gap 3.4 bits | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | distalUP- ( 45) - p10 4164377 Gap 5.4 bits |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |- - - - - - - - - |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | distalUP- proximalUP- p35- p10 4164377 total 13.7 bits
rrnB P1
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10• Using information theory we discovered that
stress-response σ38 promoters do not have a −35
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10• Using information theory we discovered that
stress-response σ38 promoters do not have a −35• Instead, they have a −10 and two UP elements
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10• Using information theory we discovered that
stress-response σ38 promoters do not have a −35• Instead, they have a −10 and two UP elements• σ38 promoter talA P1 is complex!
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
• Information gained for energy dissipated
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
• Information gained for energy dissipated
• Isothermal efficiency
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
• Information gained for energy dissipated
• Isothermal efficiency
• My most important discovery:
Molecules are often 70% efficient
Information of EcoRI DNA Binding
′
• EcoRI - restriction enzyme
Information of EcoRI DNA Binding
EcoRI sites
0
1
2
bit
s5′
0
G1A 2A 3T 4T 5
C3′
• EcoRI - restriction enzyme
• EcoRI binds DNA at 5′ GAATTC 3′
Information of EcoRI DNA Binding
EcoRI sites
0
1
2
bit
s5′
0
G1A 2A 3T 4T 5
C3′
• EcoRI - restriction enzyme
• EcoRI binds DNA at 5′ GAATTC 3′
• information required:
6 bases × 2 bits per base = 12 bits
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
• Average energy dissipated by one molecule as it binds:
∆G◦
spec = −kBT lnKspec (joules per binding)
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
• Average energy dissipated by one molecule as it binds:
∆G◦
spec = −kBT lnKspec (joules per binding)
• The Second Law of Thermodynamics as a conversion factor:
Emin = kBT ln 2 (joules per bit)
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
• Average energy dissipated by one molecule as it binds:
∆G◦
spec = −kBT lnKspec (joules per binding)
• The Second Law of Thermodynamics as a conversion factor:
Emin = kBT ln 2 (joules per bit)
• Number of bits that could have been selected:
Renergy = −∆G◦/Emin
= kBT lnKspec/kBT ln 2
= log2Kspec ⇐ SO SIMPLE!
= 17.3 bits per binding
Information/Energy = Efficiency of EcoRI
′
EcoRI could have made 17.3 binary choices
Information/Energy = Efficiency of EcoRI
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Information/Energy = Efficiency of EcoRI
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
Information/Energy = Efficiency of EcoRI
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
12 bits per binding
17.3 bits per binding= 0.7
Information/Energy = Efficiency of EcoRI = 70%
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
12 bits per binding
17.3 bits per binding= 0.7
The efficiency is 70%.
Information/Energy = Efficiency of EcoRI = 70%
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
12 bits per binding
17.3 bits per binding= 0.7
The efficiency is 70%.
18 out of 19 DNA binding proteins give ∼70% efficiency.
Rhodopsin Shape Change
Dark State
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Rhodopsin Shape Change
Dark State
hν
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Rhodopsin Shape Change
Dark State After Photon - Light State
hν
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Rhodopsin Shape Change
Dark State After Photon - Light State
hν
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Rhodopsin Shape Change
Dark State After Photon - Light State
hν70%
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Rhodopsin Shape Change
Dark State After Photon - Light State
hν70%
30%
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Tom’s Model of Muscle Mechanism
54321 54321
54321 54321
1 2 3 4 5 54321
1 2 3 4 5 5432154321
+PiADP
+PiADP
+PiADP
ATP
heat
operation
before state
after statedegenerate
after stateforward
ATP
heat + ADP + Pi
70% success
heat + ADP + Pi
prime
"power stroke"
30% failure
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
•Weight lifting gives work done
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
•Weight lifting gives work done
• NMR coil gives ATP = energy used
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
•Weight lifting gives work done
• NMR coil gives ATP = energy used
• Efficiency: 0.68± 0.09
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
• other systems
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
• other systems
Why 70% efficiency?
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
• other systems
Why 70% efficiency?
Information theory explanation
Lock and Key
Like a key in a lockwhich has many independent pins,it takes many numbersto describe the vibrational stateof a molecular machine
G A A T T C
Gaussians
•2
x
ep(x) −x
Pin motion x has a Gaussian distribution:
p(x) =1√2πσ2
e−(x−µ)2
2σ2
µ = mean, σ = standard deviation
Gaussians
•2
x
ep(x) −x
Pin motion x has a Gaussian distribution:
p(x) =1√2πσ2
e−(x−µ)2
2σ2
µ = mean, σ = standard deviation
• Gaussian distributions are generated by the sum ofmany small random variables
Gaussians
•2
x
ep(x) −x
Pin motion x has a Gaussian distribution:
p(x) =1√2πσ2
e−(x−µ)2
2σ2
µ = mean, σ = standard deviation
• Gaussian distributions are generated by the sum ofmany small random variables
•Drunkard’s walk: Galton’s quincunx device!
http://www.youtube.com/watch?v=xDIyAOBa_yU
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 (1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 (2)
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
p(x, y) = p(x)× p(y) (3)
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
credit: http://en.wikipedia.org/wiki/Pythagoras
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
∝ e−r2
(6)
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
credit: http://en.wikipedia.org/wiki/Pythagoras
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
∝ e−r2
(6)
If p(x, y) is a constant,then r is a constant.
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
credit: http://en.wikipedia.org/wiki/Pythagoras
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
∝ e−r2
(6)
If p(x, y) is a constant,then r is a constant.
Circular distribution!
1 Dimension
Energy
States
1 dimension is too simple!
Bowls in 2 Dimensions
Spheres in 3 Dimensions
N Dimensional Sphere
Spheres tighten in high dimensions
D=1
D=2
D=4
D=8
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 2.20 2.40 2.60 2.80 3.00 0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Normalized Radius
Normalized Density
Good Sphere Packing
Good packing of spheresgives a moleculethe capacityto make selections efficiently
N Dimensional Sphere Separation
Degenerate Sphere
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise √
Power
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise √
Power
Energy dissipated to escape the Degenerate Sphere must exceed the Noise
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise √
Power
Energy dissipated to escape the Degenerate Sphere must exceed the Noise√Power >
√Noise
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
�
t
P
y
=N
y
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
ǫt =ln
(
Py
Ny+1
)
Py
Ny
← isothermal efficiency
�
t
P
y
=N
y
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
ǫt =ln
(
Py
Ny+1
)
Py
Ny
← isothermal efficiency
�
t
P
y
=N
y
0 1 2 3 40.00
1.00
0.69
0.55
0.46
0.40
Second Law upper bound
Isothermal Efficiency upper bound
The curve is an upper bound
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
ǫt =ln
(
Py
Ny+1
)
Py
Ny
← isothermal efficiency
�
t
P
y
=N
y
0 1 2 3 40.00
1.00
0.69
0.55
0.46
0.40
Second Law upper bound
Isothermal Efficiency upper bound
The curve is an upper bound
• If Py/Ny = 1 the efficiency is 70%!
Acknowledgements
• Mentors:• Larry Gold (graduate school mentor)• Andrej Ehrenfeucht (information theory)• Students:
• Mike Stephens (logos, splicing)• John Spouge (mathematics)• Paul C. Anagnostopoulos (Evolution model)• Bruce Shapiro and Eckart Bindewald (3D logo)• Kevin Franco (σ38)• Ding Jin (σ38)• Useful discussions with
• Jeff Strathern• Amar Klar• Kemi Abolude• Susan Lauffer• Cedric Cagliero• Amar Klar• Zhi-Ming Zheng• Mark Lewandoski• This research was supported by the Intramural Research Program of the NIH,National Cancer Institute, Center for Cancer Research.
Information theory: the mathematics of biology1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′ EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
https://alum.mit.edu/www/toms/
Version
version = 1.89 of kanpurtalk.tex 2016 Oct 16