35
Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Embed Size (px)

Citation preview

Page 1: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Genome Assembly: a brief introduction

Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg

1

Page 2: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

2

Page 3: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

SIZE SELECT

e.g., e.g., 10Kbp 10Kbp ± 8% ± 8% std.dev.std.dev.

SHEAR

Shotgun DNA Sequencing (Technology)

DNA target sampleDNA target sample

VectorVector

LIGATE & CLONE

PrimerPrimer

End Reads (Mates)End Reads (Mates)

SEQUENCE

550bp

Page 4: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Whole Genome Shotgun Sequencing

– Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.

BAC 5’BAC 5’ BAC 3’BAC 3’

– Collect another 20X in clone coverage of 50Kbp end sequence pairs:~ 1.2million pairs for Human. pairs for Human.

– Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads reads for Human. for Human.

ShortShort LongLong

2Kbp2Kbp 10Kbp10Kbp

+ single highly automated process+ single highly automated process+ only three library constructions+ only three library constructions– – assembly is much more difficultassembly is much more difficult

Page 5: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Sequencing Factory

Page 6: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

300 ABI 3700 DNA Sequencers 300 ABI 3700 DNA Sequencers

50 Production Staff50 Production Staff

20,000 sq. ft. of wet lab20,000 sq. ft. of wet lab

20,000 sq. ft. of sequencing space20,000 sq. ft. of sequencing space

800 tons of A/C (160,000 cfm)800 tons of A/C (160,000 cfm)

$1 million / year for electrical service$1 million / year for electrical service

$10 million / month for reagents$10 million / month for reagents

Celera’s Sequencing Factory(circa 2001)

Page 7: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Collected 27.27 Million reads = 5.11X coverageCollected 27.27 Million reads = 5.11X coverage

21.04 Million are paired (77%) = 10.52 Million pairs21.04 Million are paired (77%) = 10.52 Million pairs

2Kbp2Kbp 5.045M5.045M 98.6% true *98.6% true * <6% std.dev.<6% std.dev.

10Kbp10Kbp 4.401M4.401M 98.6% true *98.6% true * <8% std.dev.<8% std.dev.

50Kbp50Kbp 1.071M1.071M 90.0% true *90.0% true * <15% std.dev.<15% std.dev.

* validated against finished Chrom. 21 sequence* validated against finished Chrom. 21 sequence

The clones cover the genome 38.7X timesThe clones cover the genome 38.7X times

Data is from 5 individuals (roughly 3X, 4 others at .5X)Data is from 5 individuals (roughly 3X, 4 others at .5X)

Human Data (April 2000)

Page 8: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Consensus (15- 30Kbp)Consensus (15- 30Kbp)

ReadsReads

ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.

??

Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.

2-pair2-pair

Mean & Std.Dev.Mean & Std.Dev.is knownis known

ScaffoldScaffold

Pairs Give Order & Orientation

Page 9: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

ChromosomeChromosomeSTSSTS

STS-mapped ScaffoldsSTS-mapped Scaffolds

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

ConsensusConsensus

Reads (of several haplotypes)Reads (of several haplotypes)

SNPsSNPsExternal “Reads”External “Reads”

Anatomy of a WGS Assembly

Page 10: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

11

Assembly gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

Page 11: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

12

Shotgun sequencing statistics

Page 12: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

13

Typical contig coverage

1 2 3 4 5 6 Coverage

Contig

Reads

Imagine raindrops on a sidewalk

Page 13: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

14

Lander-Waterman statistics

L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L

E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads

Page 14: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

15

Example

c N #islands #contigs bases not in any read

bases not in contigs

1 1,667 655 614 698 367,806

3 5,000 304 250 121 49,787

5 8,334 78 57 20 6,735

8 13,334 7 5 1 335

Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

Page 15: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

16

Experimental data

X coverage

# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs

1 284 54 1,234 (1,138) 3,337 526

3 597 67 1,794 (4,429) 9,589 1,092

5 548 79 2,495 (21,791) 17,977 1,398

8 495 85 3,294 (302,545) 64,307 1,762

complete 1 100 1.26 M 1.26 M 1,329

Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel

Page 16: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

17

Assembly paradigms

• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)

• Eulerian path (especially useful for short read sequencing)

Page 17: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

18

TIGR Assembler/phrap

Greedy

• Build a rough map of fragment overlaps

• Pick the largest scoring overlap

• Merge the two fragments

• Repeat until no more merges can be done

Page 18: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

19

Overlap-layout-consensusMain entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

Page 19: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

20

Paths through graphs and assembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

A

B D C

E

H G

I

F

A

B

C

D H

I

F

G

E

Genome

Page 20: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

21

Implementation details

Page 21: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

22

Overlap between two sequences

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

overlap (19 bases) overhang (6 bases)

overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences

The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.

% identity = 18/19 % = 94.7%

Page 22: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

23

All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs

are possible– Build a table of k-mers contained in sequences (single

pass through the genome)– Generate the pairs from k-mer table (single pass

through k-mer table)

k-mer

A

B

C

D H

I

F

G

E

Page 23: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

24

Page 24: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

AA

BB

impliesimplies

AA

BB

TRUE

OROR

AA BB

REPEAT-INDUCED

Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.

Trim & ScreenTrim & Screen

Page 25: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

Compute all overlap consistent sub-assemblies:Compute all overlap consistent sub-assemblies:Unitigs (Uniquely Assembled Contig)

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Trim & ScreenTrim & Screen

Page 26: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

OVERLAP GRAPH

Edge Types:

AA

BB

AA

BB

AA

BB

BB

BB

BB

AA

AA

AA

Regular DovetailRegular Dovetail

Prefix DovetailPrefix Dovetail

Suffix DovetailSuffix Dovetail

E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps

Page 27: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

The Unitig Reduction

1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:

AA

BB

CC AABB

CC

Page 28: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

The Unitig Reduction

2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:

AA BBAA

BB

412412 352352

4545

Page 29: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Arrival IntervalsArrival Intervals

Discriminator Statistic is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.

Definitely UniqueDefinitely Repetitive Don’t Know

-10-10 +10+1000

Dist. For UniqueDist. For Repetitive

Unique DNA unitig Repetitive DNA unitig

Identifying Unique DNA Stretches

Page 30: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Scaffold U-unitigs with confirmed pairsScaffold U-unitigs with confirmed pairs

Mated reads

Trim & ScreenTrim & Screen

Page 31: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs

Unitig>0Unitig>0

Trim & ScreenTrim & Screen

Page 32: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

33

REPEATS

Page 33: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

34

Handling repeats1. Repeat detection

– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)

– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)

2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat

Page 34: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

35

Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value

(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random librariespoor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

Page 35: Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg 1

36

Mis-assembled repeats

a b c

a c

b

a b c d

I II III

I

II

III

a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement