DNA Sequencing - Sharificpc.sharif.edu/acmicpc14/docs/talks/Motahari - DNA Sequencing.pdfDNA...

Preview:

Citation preview

بنام خداDNA Sequencing

سید ابوالفضل مطهری

Sequencing Problem

Sequencing Problem

S = X1X2X3X4 . . . XG�1XG

Sequencing Problem

finding the sequence

S = X1X2X3X4 . . . XG�1XG

Sequencing Problem

finding the sequenceor finding some parts of the sequence

S = X1X2X3X4 . . . XG�1XG

Sequencing Problem

finding the sequenceor finding some parts of the sequence

DNA Sequencing sequence is the genome

S = X1X2X3X4 . . . XG�1XG

Genome

Genome

Human Genome consists of 23 chromosomes

Genome

Human Genome consists of 23 chromosomesEach chromosome contains a DNA sequence

Genome

Human Genome consists of 23 chromosomes Each chromosome contains a DNA sequence The total length isG = 3.2⇥ 109

Genome

DexoyriboNucleic Acid (DNA)

pPhosphate

Sugar

Bases

GuanineCytosine

AdenineThymine

T A

C G

DexoyriboNucleic Acid (DNA)

pPhosphate

Sugar

Bases

GuanineCytosine

AdenineThymine

T A

C G

pCovalent

DexoyriboNucleic Acid (DNA)

pPhosphate

Sugar

Bases

GuanineCytosine

AdenineThymine

T A

C G

pCovalent

G C

A THydrogen

Single Stranded DNAp

Single Stranded DNAp p

Single Stranded DNAp p p

Single Stranded DNAp p p p

Single Stranded DNAp p p p p

Single Stranded DNAp p p p p

TA

CG G

TACGG.....

Fred Sanger

1918-2013

Won the Nobel Prize for Chemistry twice Sequencing Insulin DNA Sequencing

Phi X 174 Bactriophage

First Genome to be Sequenced Genome Size: 5386 bp Fred Sanger (1977)

Phi X 174 Bactriophage

First Genome to be Sequenced Genome Size: 5386 bp Fred Sanger (1977)

GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA

Lambda Phage

Genome Size: 48,502 bp Fred Sanger, et al (1982) Method: Shotgun Sequencing

Shotgun Sequencing

Shotgun Sequencing

Reads

Shotgun Sequencing

Reads

Hearken to the reed-flute

Shotgun Sequencing

Reads

Hearken to the reed-flute reed-flute, how it complains,

it complains, Lamenting its

Lamenting its banishment frombanishment from its home

Shotgun Sequencing

Reads

Hearken to the reed-flute

reed-flute, how it complains,

it complains, Lamenting its

Lamenting its banishment frombanishment from its home

Shotgun Sequencing

Reads

Hearken to the reed-flutereed-flute, how it complains,

it complains, Lamenting its

Lamenting its banishment frombanishment from its home

Shotgun Sequencing

Reads

Hearken to the reed-flutereed-flute, how it complains,it complains, Lamenting its

Lamenting its banishment frombanishment from its home

Shotgun Sequencing

Reads

Hearken to the reed-flutereed-flute, how it complains,it complains, Lamenting itsLamenting its banishment from

banishment from its home

Shotgun Sequencing

Reads

Hearken to the reed-flutereed-flute, how it complains,it complains, Lamenting itsLamenting its banishment frombanishment from its home

Human Genome

DNA Shotgun Sequencing

{A,C,G, T}

DNA Shotgun Sequencing

{A,C,G, T}

Reads

DNA Shotgun Sequencing

{A,C,G, T}

Reads

DNA Shotgun Sequencing

{A,C,G, T}

Reads

ACCGTCCTTAAGG

DNA Shotgun Sequencing

{A,C,G, T}

Reads

ACCGTCCTTAAGG

ACCGTCGTTAAGG

DNA Shotgun Sequencing

{A,C,G, T}

Reads

ACCGTCCTTAAGG

ACCGTCGTTAAGG

ACCGTCGTTAAGG

DNA Shotgun Sequencing

{A,C,G, T}

Reads

GTTTCCTAAGTGGACCGTCGTTAAGG

ACTGCTTCGTCCTA

CCTTAAGCGGTTCACTGCTTCGTCCTA

DNA Shotgun Sequencing

Starting Big Projects

Starting Big Projects

Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996

1989

Starting Big Projects

1990E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb

Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996

1989

Starting Big Projects

1990E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb

1990Human Genome Project (HGP) Size of Genome: 3.2 Gb 23 Chromosomes Finished 2003

Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996

1989

Strategy: Hierarchical SG

Strategy: Hierarchical SG

50-200 kbp

Random Fragments

Strategy: Hierarchical SG

Tiling path

Physical Map

50-200 kbp

Random Fragments

Strategy: Hierarchical SG

Tiling path

Physical Map

50-200 kbp

Random Fragments

Shotgun

500 bp

H Influenzae

The first sequenced free-living organism Genome Size: 1.8 Mb Strategy: Whole Genome Shotgun Sequencing The Institute for Genomic Research TIGR Finished 1995 Done in 13 months Cost: 50 cents per base

Whole Genome Shotgun Sequencing

Whole Genome Shotgun Sequencing

Shotgun

Number of Reads: N Read Length: L ATCCGGTACGGATCGTACA

HGP FINISHED

There is Never Enough Sequencing

courtesy: Batzoglou

There is Never Enough Sequencing

100 million species

courtesy: Batzoglou

There is Never Enough Sequencing

100 million species

7 billion individuals courtesy: Batzoglou

There is Never Enough Sequencing

100 million species

7 billion individuals

Somatic mutations (e.g., HIV, cancer)

courtesy: Batzoglou

Sequencing Growth

Cost of one human genome • 2004: $30,000,000 • 2008: $100,000 • 2010: $10,000 • 2013: $4,000

(today) • ???: $300

courtesy: Batzoglou

Next Generation Sequencing

Very High Throughput Higher error rate Shorter Length Very Cheap

Computational Challenges

courtesy: Batzoglou

Computational ChallengesHow to assemble the reads?

courtesy: Batzoglou

Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements

courtesy: Batzoglou

Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements

Number of Reads: N

courtesy: Batzoglou

Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements

Number of Reads: NLength of Reads: L

courtesy: Batzoglou

Computational ChallengesHow to assemble the reads?For given genome size G, what are the requirements

Number of Reads: NLength of Reads: L

N

L

Possible or not

courtesy: Batzoglou

Coverage Condition

Coverage Condition

Ncov

is the minimum number of reads required to cover the sequence

Coverage Condition

Ncov

is the minimum number of reads required to cover the sequence

N

L

Ncov

impossible

Repeat Condition

courtesy: Tse

Repeat Condition

L-spectrum is given

Repeat Condition

L-spectrum is given

ATAGCCTAGCGAT

Repeat Condition

L-spectrum is given

ATAGCCTAGCGATATAGC

TAGCCAGCCTGCCTA

CCTAGCTAGC

TAGCGAGCGA

GCGAT

Repeat Condition

de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers

Repeat Condition

TAGC

AGCC

AGCG

GCCC

GCGA

CCCT CCTA

CTAG

ATAG

CGAT

de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers

Repeat Condition

TAGC

AGCC

AGCG

GCCC

GCGA

CCCT CCTA

CTAG

ATAG

CGAT

de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers

Find an Eulerian Path

Repeat Condition

ATAGCCCTAGCGAT

TAGC

AGCC

AGCG

GCCC

GCGA

CCCT CCTA

CTAG

ATAG

CGAT

de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers

Find an Eulerian Path

Repeat Condition

A

B

A

Repeat Condition

A

B

A

A AB B

Interleaved repeats

Repeat Condition

A

B

A

A AB B

Interleaved repeats

A A A

Triple repeat

Repeat Condition

A

B

A

A AB B

Interleaved repeats

A A A

Triple repeat

Lcritis the minimum read length where there exists a unique solution for the problem

Repeat Condition

N

L

L > Lcrit

Lcrit

impo

ssib

le

Putting together

N

Ncov

L̄Normalized Read Length

Normalized number of reads

Putting together

N

Ncov

1covered

not covered

Normalized Read Length

Normalized number of reads

Putting together

N

Ncov

1covered

not covered

L̄crit

triple orinterleaved repeats

no triple orinterleaved repeats

Normalized Read Length

Normalized number of reads

Putting together

N

Ncov

1covered

not covered

L̄crit

triple orinterleaved repeats

no triple orinterleaved repeats

Normalized Read Length

Normalized number of reads

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

what about the rest?

I. Find two reads with the largest overlap and merge them into one read.II. Repeat 1 until only one contig remains.

Algorithm

The Greedy Algorithm

RANDOM Sequence

N

Ncov

1

RANDOM Sequence

N

Ncov

1

Is feasible

Motahari, Bresler, Tse (2012)

Human Chromosome 19

500 1000 1500 2000 2500 3000 3500 4000 45000

1

2

3

4

5

6

7

8

9

10

N

Ncov

L̄crit

Longest triple repeat Longest interleaved repeat

Longest repeat

(Bresler, Bresler & T. 12)

Human Chromosome 19

500 1000 1500 2000 2500 3000 3500 4000 45000

1

2

3

4

5

6

7

8

9

10

N

Ncov

L̄crit

Longest triple repeat Longest interleaved repeat

Longest repeat

Greedy Algorithm

Modified de Bruijn Algorithm

(Bresler, Bresler & T. 12)

WHY?

(Bresler, Bresler & T. 12)

WHY?

0 1000 2000 3000 40000

5

10

15

data

i.i.d. fit

log ( number of repeats) Human chromosome 19

(Bresler, Bresler & T. 12)

Recommended