Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
بنام خداDNA Sequencing
سید ابوالفضل مطهری
Sequencing Problem
Sequencing Problem
S = X1X2X3X4 . . . XG�1XG
Sequencing Problem
finding the sequence
S = X1X2X3X4 . . . XG�1XG
Sequencing Problem
finding the sequenceor finding some parts of the sequence
S = X1X2X3X4 . . . XG�1XG
Sequencing Problem
finding the sequenceor finding some parts of the sequence
DNA Sequencing sequence is the genome
S = X1X2X3X4 . . . XG�1XG
Genome
Genome
Human Genome consists of 23 chromosomes
Genome
Human Genome consists of 23 chromosomesEach chromosome contains a DNA sequence
Genome
Human Genome consists of 23 chromosomes Each chromosome contains a DNA sequence The total length isG = 3.2⇥ 109
Genome
DexoyriboNucleic Acid (DNA)
pPhosphate
Sugar
Bases
GuanineCytosine
AdenineThymine
T A
C G
DexoyriboNucleic Acid (DNA)
pPhosphate
Sugar
Bases
GuanineCytosine
AdenineThymine
T A
C G
pCovalent
DexoyriboNucleic Acid (DNA)
pPhosphate
Sugar
Bases
GuanineCytosine
AdenineThymine
T A
C G
pCovalent
G C
A THydrogen
Single Stranded DNAp
Single Stranded DNAp p
Single Stranded DNAp p p
Single Stranded DNAp p p p
Single Stranded DNAp p p p p
Single Stranded DNAp p p p p
TA
CG G
TACGG.....
Fred Sanger
1918-2013
Won the Nobel Prize for Chemistry twice Sequencing Insulin DNA Sequencing
Phi X 174 Bactriophage
First Genome to be Sequenced Genome Size: 5386 bp Fred Sanger (1977)
Phi X 174 Bactriophage
First Genome to be Sequenced Genome Size: 5386 bp Fred Sanger (1977)
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA
Lambda Phage
Genome Size: 48,502 bp Fred Sanger, et al (1982) Method: Shotgun Sequencing
Shotgun Sequencing
Shotgun Sequencing
Reads
Shotgun Sequencing
Reads
Hearken to the reed-flute
Shotgun Sequencing
Reads
Hearken to the reed-flute reed-flute, how it complains,
it complains, Lamenting its
Lamenting its banishment frombanishment from its home
Shotgun Sequencing
Reads
Hearken to the reed-flute
reed-flute, how it complains,
it complains, Lamenting its
Lamenting its banishment frombanishment from its home
Shotgun Sequencing
Reads
Hearken to the reed-flutereed-flute, how it complains,
it complains, Lamenting its
Lamenting its banishment frombanishment from its home
Shotgun Sequencing
Reads
Hearken to the reed-flutereed-flute, how it complains,it complains, Lamenting its
Lamenting its banishment frombanishment from its home
Shotgun Sequencing
Reads
Hearken to the reed-flutereed-flute, how it complains,it complains, Lamenting itsLamenting its banishment from
banishment from its home
Shotgun Sequencing
Reads
Hearken to the reed-flutereed-flute, how it complains,it complains, Lamenting itsLamenting its banishment frombanishment from its home
Human Genome
DNA Shotgun Sequencing
{A,C,G, T}
DNA Shotgun Sequencing
{A,C,G, T}
Reads
DNA Shotgun Sequencing
{A,C,G, T}
Reads
⇡
DNA Shotgun Sequencing
{A,C,G, T}
Reads
ACCGTCCTTAAGG
⇡
DNA Shotgun Sequencing
{A,C,G, T}
Reads
ACCGTCCTTAAGG
⇡
ACCGTCGTTAAGG
DNA Shotgun Sequencing
{A,C,G, T}
Reads
ACCGTCCTTAAGG
⇡
ACCGTCGTTAAGG
ACCGTCGTTAAGG
DNA Shotgun Sequencing
{A,C,G, T}
Reads
⇡
GTTTCCTAAGTGGACCGTCGTTAAGG
ACTGCTTCGTCCTA
CCTTAAGCGGTTCACTGCTTCGTCCTA
DNA Shotgun Sequencing
Starting Big Projects
Starting Big Projects
Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996
1989
Starting Big Projects
1990E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb
Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996
1989
Starting Big Projects
1990E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb
1990Human Genome Project (HGP) Size of Genome: 3.2 Gb 23 Chromosomes Finished 2003
Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996
1989
Strategy: Hierarchical SG
Strategy: Hierarchical SG
50-200 kbp
Random Fragments
Strategy: Hierarchical SG
Tiling path
Physical Map
50-200 kbp
Random Fragments
Strategy: Hierarchical SG
Tiling path
Physical Map
50-200 kbp
Random Fragments
Shotgun
500 bp
H Influenzae
The first sequenced free-living organism Genome Size: 1.8 Mb Strategy: Whole Genome Shotgun Sequencing The Institute for Genomic Research TIGR Finished 1995 Done in 13 months Cost: 50 cents per base
Whole Genome Shotgun Sequencing
Whole Genome Shotgun Sequencing
Shotgun
Number of Reads: N Read Length: L ATCCGGTACGGATCGTACA
HGP FINISHED
There is Never Enough Sequencing
courtesy: Batzoglou
There is Never Enough Sequencing
100 million species
courtesy: Batzoglou
There is Never Enough Sequencing
100 million species
7 billion individuals courtesy: Batzoglou
There is Never Enough Sequencing
100 million species
7 billion individuals
Somatic mutations (e.g., HIV, cancer)
courtesy: Batzoglou
Sequencing Growth
Cost of one human genome • 2004: $30,000,000 • 2008: $100,000 • 2010: $10,000 • 2013: $4,000
(today) • ???: $300
courtesy: Batzoglou
Next Generation Sequencing
Very High Throughput Higher error rate Shorter Length Very Cheap
Computational Challenges
courtesy: Batzoglou
Computational ChallengesHow to assemble the reads?
courtesy: Batzoglou
Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements
courtesy: Batzoglou
Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements
Number of Reads: N
courtesy: Batzoglou
Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements
Number of Reads: NLength of Reads: L
courtesy: Batzoglou
Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements
Number of Reads: NLength of Reads: L
N
L
Possible or not
courtesy: Batzoglou
Coverage Condition
Coverage Condition
Ncov
is the minimum number of reads required to cover the sequence
Coverage Condition
Ncov
is the minimum number of reads required to cover the sequence
N
L
Ncov
impossible
Repeat Condition
courtesy: Tse
Repeat Condition
L-spectrum is given
Repeat Condition
L-spectrum is given
ATAGCCTAGCGAT
Repeat Condition
L-spectrum is given
ATAGCCTAGCGATATAGC
TAGCCAGCCTGCCTA
CCTAGCTAGC
TAGCGAGCGA
GCGAT
Repeat Condition
de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers
Repeat Condition
TAGC
AGCC
AGCG
GCCC
GCGA
CCCT CCTA
CTAG
ATAG
CGAT
de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers
Repeat Condition
TAGC
AGCC
AGCG
GCCC
GCGA
CCCT CCTA
CTAG
ATAG
CGAT
de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers
Find an Eulerian Path
Repeat Condition
ATAGCCCTAGCGAT
TAGC
AGCC
AGCG
GCCC
GCGA
CCCT CCTA
CTAG
ATAG
CGAT
de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers
Find an Eulerian Path
Repeat Condition
A
B
A
Repeat Condition
A
B
A
A AB B
Interleaved repeats
Repeat Condition
A
B
A
A AB B
Interleaved repeats
A A A
Triple repeat
Repeat Condition
A
B
A
A AB B
Interleaved repeats
A A A
Triple repeat
Lcritis the minimum read length where there exists a unique solution for the problem
Repeat Condition
N
L
L > Lcrit
Lcrit
impo
ssib
le
Putting together
N
Ncov
L̄Normalized Read Length
Normalized number of reads
Putting together
N
Ncov
L̄
1covered
not covered
Normalized Read Length
Normalized number of reads
Putting together
N
Ncov
L̄
1covered
not covered
L̄crit
triple orinterleaved repeats
no triple orinterleaved repeats
Normalized Read Length
Normalized number of reads
Putting together
N
Ncov
L̄
1covered
not covered
L̄crit
triple orinterleaved repeats
no triple orinterleaved repeats
Normalized Read Length
Normalized number of reads
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
what about the rest?
I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.
Algorithm
The Greedy Algorithm
RANDOM Sequence
N
Ncov
L̄
1
RANDOM Sequence
N
Ncov
L̄
1
Is feasible
Motahari, Bresler, Tse (2012)
Human Chromosome 19
500 1000 1500 2000 2500 3000 3500 4000 45000
1
2
3
4
5
6
7
8
9
10
N
Ncov
L̄crit
Longest triple repeat Longest interleaved repeat
Longest repeat
(Bresler, Bresler & T. 12)
Human Chromosome 19
500 1000 1500 2000 2500 3000 3500 4000 45000
1
2
3
4
5
6
7
8
9
10
N
Ncov
L̄crit
Longest triple repeat Longest interleaved repeat
Longest repeat
Greedy Algorithm
Modified de Bruijn Algorithm
(Bresler, Bresler & T. 12)
WHY?
(Bresler, Bresler & T. 12)
WHY?
0 1000 2000 3000 40000
5
10
15
data
i.i.d. fit
log ( number of repeats) Human chromosome 19
(Bresler, Bresler & T. 12)