Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
12/23/2020
1
Eukaryotic Comparative Genomics
Barak Cohen
June 2018 GEP Alumni Workshop
Last Update: 12/23/2020
1
Detecting Conserved Sequences
Motoo KimuraCharles Darwin
2
Evolution of Neutral DNA
A T GC C CGT T GGA A TTT T TT G G GT AA
A T GC C CGT T GGA A TTT T TT G G GT AA
A
G
A G
A
C
AT TT
A
G
AA
G* * * * * * * * * * * * * * * * * * * * * * * * *
3
Evolution of Non-Neutral DNA
A
*
AT T T T TGGC G AC CCCA A AA AG GC T TA AC C
A AT T T T TGGC G AC CCCA A AA AG GC T TA AC C*****************************
G
C
CCG
G T
G
A T
T G
T
4
Multi-Species Alignment
ATGTGGCGCAGCCTGTGCCAGCTGGACGATCGA
ATGTAGCCTAGCCAGTGCCAGCTGGACGATCGA
GTACATCGATAGCTTAGAATGCTGGACGATCTC
GTACGTCGATAGCATAGAATGCTGGACGATCTC
* * * * ***********
5
How to do Comparative Genomics
1. Choose species to analyze2. Align sequences3. Identify streches of highly conserved
nucleotides
6
12/23/2020
2
Choose species
closely relatedspecies
distantly relatedspecies
• Closely Related Species– align well– not many changes
• Distantly Related Species– hard to align– lots of changes
7
S.cerevisiae
S. paradoxus
S. bayanusS. pastorianus
S. servazziiS. unisporus
S. exiguusS. diarenensis
S. castellii
S. kluyveri
Kluyveromyces lactis
Schizosaccharomyces pombe
S. cariocanus
S. mikataeS. kudriavzevii
~10Mya
~20Mya
~150Mya
>350Mya
8
Case Study: Coding vs. Non-Coding
• Coding DNA- codes for protein- triplet code- open reading frame (ORF)- tend to be long (50-500 bp)- highly constrained
• Non-Coding DNA- regulatory functions- short (5-15 bp)- degenerate- variable spacing
ORFATG…. …TAA
9
CASE 1:Non-Coding
GAL4ATG… …TAA
10
S.cerevisiae
S. paradoxus
S. bayanusS. pastorianus
S. servazziiS. unisporus
S. exiguusS. diarenensis
S. castellii
S. kluyveri
Kluyveromyces lactis
Schizosaccharomyces pombe
S. cariocanus
S. mikataeS. kudriavzevii
~10Mya
~20Mya
~150Mya
>350Mya
11
paradoxus TCTTCTGAGACAGCATCACTTCTTCTTNTTTTTTACATAACTTATTCTTCTATAATTTTCcerevisiae TCCTTTGAGACAGCATTCGCCCAGTATTTTTTTTATTCTACA-AACCTTCTATAATTT-C
** * *********** * * ******* ** * ************ *
paradoxus AACGTATTTACATAGTTCTGTATCAGTTTAATCACCATAATATTGTTTTCCCTCAACTAAcerevisiae AAAGTATTTACATAATTCTGTATCAGTTTAATCACCATAATATCGTTTTCT-----TTGT
** *********** **************************** ****** *
paradoxus TGAATGCAATTAGATTTTCTTATTGTTCCCTCGCGGCTTTTTTTTGTTTTATAATCTATTcerevisiae TTAGTGCAATTAATTTTTCCTATTGTTACTTCG-GGCCTTTTTCTGTTTTATGAGCTATT
* * ******** ***** ******* * *** *** ***** ******** * *****
paradoxus TTTTCCGTCATTTCTTCCCCAGATTTCCAACTTCATCTCCAGATTGTGTCTATGTAATGCcerevisiae TTTTCCGTCATC-CTTCCCCAGATTTTCAGCTTCATCTCCAGATTGTGTCTACGTAATGC
*********** ************* ** ********************** *******
paradoxus ATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCTACTGTCTcerevisiae ACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGCTACTGTCT
* ** ***** ** *** * ** ****** *** ********** ***************
Closely-related sequences are uninformative
GAL4ATG…
12
12/23/2020
3
S.cerevisiae
S. paradoxus
S. bayanusS. pastorianus
S. servazziiS. unisporus
S. exiguusS. diarenensis
S. castellii
S. kluyveri
Kluyveromyces lactis
Schizosaccharomyces pombe
S. cariocanus
S. mikataeS. kudriavzevii
~10Mya
~20Mya
~150Mya
>350Mya
13
Distantly-related sequences do not align
cerevisiae ACTTACCAT-CAAC-CATAGATGGGTAAAC---GGTTAGTAACTAGGAACACGATcastelli AGA-GTCAAACTTTTCGT—ATA--TATATATAATATGTCTGATTGCTGGTT---T
* ** * * * * * * * * *
Noncoding (Promoter)
GAL4ATG…
14
S.cerevisiae
S. paradoxus
S. bayanusS. pastorianus
S. servazziiS. unisporus
S. exiguusS. diarenensis
S. castellii
S. kluyveri
Kluyveromyces lactis
Schizosaccharomyces pombe
S. cariocanus
S. mikataeS. kudriavzevii
~10Mya
~20Mya
~150Mya
>350Mya
15
Multiple sequence alignments reveal conserved elements
cerevisiae TGAGACAGCAT-CACTTCTT-CTTNTTTTTTACATAACTTATTCTTCTATAATTTTCAACmikatae TGAGACAGCATTCACTTCTTTCTTTTTTTTTACATATCTTATTCTTCTATAATTTTCAACBayanus TGAGACAGCATTCGCCCAGT--ATTTTTTTTAT-TCTACAAACCTTCTATAATTT-CAAAkudriadzevi TGAGACTGCACTCCC--------TCTTCCTTTC------------TCCATAACTT---AC
****** *** * * * ** ** ** **** ** *
paradoxus GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAACkluyveri GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAACcerevisiae GTATTTACATAATTCTGTATCAGTTTAATCACCATAAT------ATCGTTTTCTTTGT--bayanus TTATTTACATAGTTTTGTATCAGTTTAATCACCATAATCGTAACACCGTTTTACCTCACC
********** ** *********************** * ***** *
paradoxus TAATGAATGCAATTAGATTTTC-TTATTGTTCCC-TCGCGGCTTTTTTTTGTTTTATAATkluyveri TAATGAATGCAATTAGATTTTCCTTATTGTTCCCCTCGCGGCTTTTTTTTGTTTTATAATcerevisiae ---TTAGTGCAATTAATTTTTC-CTATTGTTACT-TCG-GGCCTTTTTCTGTTTTATGAGbayanus TGATGCGGG--A---ATCCTTC-AGACCGTTCTC-TCGCGC-------------------
* * * *** * *** *** *
paradoxus -CTATTTTTTCCGTCATTTCTTCCCC-AGATTTCCAACTTCAT-CTCCAGATTGTGTCTAkluyveri ACTATTTTTTCCGTCATTTCTTCCCCCAGATTTCCAACTTCATACTCCAGATTGTGTCTAcerevisiae -CTATTTTTTCCGTCATC-CTTCCCC-AGATTTTCAGCTTCAT-CTCCAGATTGTGTCTAbayanus -CTTTTTTTTTCGTCATTTCTTCCCC-AGATCTACAACTTTAA-CTCCAGACGGTGTATA
** ****** ****** ******* **** * ** *** * ******* **** **
paradoxus TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCkluyveri TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCcerevisiae CGTAATGCACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGCbayanus GGCAGTACAAGCAGTGCTTTTGGGAAGAGGCAAAGCTGCAGACCTCGAGAACAATGAAGC
* * * ** ** * * ** ** * * ** ** **** *** *******
UAS1 UAS2
UES MIG1 MIG1
GAL4ATG…
16
CASE 2:Coding
CLN3ATG… …TAA
17
S.cerevisiae
S. paradoxus
S. bayanusS. pastorianus
S. servazziiS. unisporus
S. exiguusS. diarenensis
S. castellii
S. kluyveri
Kluyveromyces lactis
Schizosaccharomyces pombe
S. cariocanus
S. mikataeS. kudriavzevii
~10Mya
~20Mya
~150Mya
>350Mya
18
12/23/2020
4
Closely-related sequences are uninformative
19
S.cerevisiae
S. paradoxus
S. bayanusS. pastorianus
S. servazziiS. unisporus
S. exiguusS. diarenensis
S. castellii
S. kluyveri
Kluyveromyces lactis
Schizosaccharomyces pombe
S. cariocanus
S. mikataeS. kudriavzevii
~10Mya
~20Mya
~150Mya
>350Mya
20
Less distantly related species not informative either
21
S.cerevisiae
S. paradoxus
S. bayanusS. pastorianus
S. servazziiS. unisporus
S. exiguusS. diarenensis
S. castellii
S. kluyveri
Kluyveromyces lactis
Schizosaccharomyces pombe
S. cariocanus
S. mikataeS. kudriavzevii
~10Mya
~20Mya
~150Mya
>350Mya
22
Distantly-related species reveal functional protein domains
23
Identification of Multi-Species Conserved Regions (MCS)
Margulies et al (2003) Gen. Res. 13:2507-18
Human cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctctChimp cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctctMouse ttcagtcgtttcccagtgtctctga-cattcagagactactttagtaagcattt-tctctRat tcagtccttccctggcatctccag-cactcaa-gactactttagtaagcattt-tctctgDog tcaatgactttcccagtctcttctactgggaagagattaggttgcaaatcatttttctct
* * * * * * **
How can we decide if this region is “conserved?”
24
12/23/2020
5
Its like flipping coins (really)
25
Binomial-Based Method for Detecting Conserved Sequences
p = probability that a site is the same between human and mouse by chance alone (Kimura), q = 1-p
For an alignment N base pairs long with n identities calculate the cumulative binomial probability as:
Margulies et al (2003) Gen. Res. 13:2507-18
Human: AATGGMouse: AATCGStatus: CCCDC
26
27 28
Large sequencing projects are underway
29
species A
species B
species Cspecies D
species E
species F
Star Phylogeny Actual Phylogeny
Tree Topology Influences Power
30
12/23/2020
6
Challenges in larger genomes
1) Deciding on the neutral rate of substitution
2) Local differences in neutral rate of substitutions
3) Multiple hypothesis testing
4) Repeat sequences and uneven base composition
31
OLIG2
100 kb upstream of OLIG2
PhastCons and the UCSC Genome Browser
32
Gene 1 Gene 2 Gene 3 Gene NSpecies 1Species 2Species 3
…
Motif Searching Across Several Multiple Alignments
33
Information Content
GAATTCGAATTCGAATTCGAATTCGAATTCGAATTCGAATTC
EcoR1GCCTACACATTCTCATTCCGACTCGAATTCATATCGGAAATG
Random Rap1TGTATGGGTGTGTTCGGATTTGCATGGGTGTGTACAGGTGTGTATGGATGTGTTCGGGTTTGTATGGGTG
34
Weight Matrix Model of TATA Box
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
G. Stormo
35
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
….A C T A T A A T G T …
Score = -24
G. Stormo
Weight Matrix Model of TATA Box
36
12/23/2020
7
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
….A C T A T A A T G T …
Score = 43
G. Stormo
Weight Matrix Model of TATA Box
37
N(b,i)
F(b,i)
S(b,i) = log[F(b,i)/P(b)]
G. Stormo
Weight Matrix Model of TATA Box
38
Now we can compare motifs to each other
4 -3 5 -6 -2 -52 -1 -2 11 -1 -1
-10 8 2 -4 2 -3-3 2 1 2 -3 15
3 -2 2 1 3 13 -1 -2 7 -2 -1-8 6 3 -2 2 -2-1 1 1 4 -3 9
ACGT
ACGT
39
MAGMAunaligned motif finding in multispecies conserved regions
Gene 1 Gene 2 Gene 3
*Ihuegbu, Stormo, & Buhler, JCB 19:139, 2012
Gene NSpecies 1Species 2Species 3
…
40