1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

1

2

Genome sequence assemblyAssembly concepts and methods

(some slides courtesy of Mihai Pop)

3

Building a library

• Break DNA into random fragments (8-10x coverage)

Actual situation

4

Building a library

• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments

– Amplify the fragments in a vector– Sequence 800-1000 (500-700) bases at each end of the fragment

5

Assembling the fragments

6

Forward-reverse constraints• The sequenced ends are facing towards each other • The distance between the two fragments is known

(within certain experimental error)

Clone

Insert

F R

FR

I II

R

I

F

II

F

II

R

I

7

Building Scaffolds

• Break DNA into random fragments (8-10x coverage)

• Sequence the ends of the fragments

• Assemble the sequenced ends

• Build scaffolds

8

Assembly gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

9

Unifying view of assembly

Assembly

Scaffolding

10

Shotgun sequencing statistics

11

Typical contig coverage

1 2 3 4 5 6 Coverage

Contig

Reads

Imagine raindrops on a sidewalk

12

Lander-Waterman statistics

L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L

E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads

13

Example

c N #islands #contigs bases not in any read

bases not in contigs

1 1,667 655 614 698 367,806

3 5,000 304 250 121 49,787

5 8,334 78 57 20 6,735

8 13,334 7 5 1 335

Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

14

Experimental data

X coverage

# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs

1 284 54 1,234 (1,138) 3,337 526

3 597 67 1,794 (4,429) 9,589 1,092

5 548 79 2,495 (21,791) 17,977 1,398

8 495 85 3,294 (302,545) 64,307 1,762

complete 1 100 1.26 M 1.26 M 1,329

Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel

15

Read coverage vs. Clone coverage

4 kbp

1 kbp

Read coverage = 8X

Clone (insert) coverage = 16

2X coverage in BAC-ends implies 100x coverage by BACs

(1 BAC clone = approx. 100kbp)

16

Assembly paradigms

• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)

• Eulerian path (especially useful for short read sequencing)

17

TIGR Assembler/phrap

Greedy

• Build a rough map of fragment overlaps

• Pick the largest scoring overlap

• Merge the two fragments

• Repeat until no more merges can be done

18

Overlap-layout-consensusMain entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

19

Paths through graphs and assembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

A

B D C

E

H G

I

F

A

B

C

D H

I

F

G

E

Genome

20

Implementation details

21

Overlap between two sequences

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

overlap (19 bases) overhang (6 bases)

overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences

The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.

% identity = 18/19 % = 94.7%

22

All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs

are possible– Build a table of k-mers contained in sequences (single

pass through the genome)– Generate the pairs from k-mer table (single pass

through k-mer table)

k-mer

A

B

C

D H

I

F

G

E

23

24

REPEATS

25

1

2

3

4

5

6

7

8

9

10

11

12

13

RptA RptB

1

2

3

4

5 7

10

11

12

13

86

9

26

Non-repetitive overlap graph

1

2

3

4

5,9 7

86,10

11

12

13

27

Handling repeats1. Repeat detection

– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)

– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)

2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat

28

Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value

(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random librariespoor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

29

Mis-assembled repeats

a b c

a c

b

a b c d

I II III

I

II

III

a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement

“chimeras” “mates”

How do you align the green pieces?

ribosomal RNA repeats, Ames Porton strain

An assembly puzzle: contradictory data

(discovered after publication)

Puzzle solution

Tandem duplication

Reference: Ames ‘ancestor’ strain

Ames Porton Down strain

PortonStrain

Ft Detrick

AttackStrain

Victim

Plasmids cured

Porton Down

Lab C Lab DLab B

? ?

?

Floridaisolate

Porton1

Porton2

?

…

“Ames”isolate

1981

1982

1998 20012001

UC Berkeley

Anthrax attack strain history

(Ames ancestor)

Cut of assembly 67452 from GBX0130.contig (11 bases)Cut at ungapped consensus offset 15986 (from 1), +/- 5 positions:Cut at gapped consensus offset 16009 (from 1) 11 positions

Ungapped consensus TGAATGCACACGapped consensus TGAATGCACAC T G A A T G C A C A C

Covering reads: GBZEI27TF TGAATGCACAC 26 30 34 36 33 36 36 37 36 36 36 GBXEZ08TR TGAATGCACAC 27 30 33 35 41 37 36 23 36 36 36 GBZDA09TF TGAATGCACAC 26 18 35 31 26 20 29 19 36 36 36

Summary info:P-value (10^q) -7.9 -7.8 -10.2 -10.2 -10.0 -9.3 -10.1 -7.9 -10.8 -10.8 -10.8 Quality Class 5 3 3 3 3 3 3 3 3 3 3 Coverage depth 3 3 3 3 3 3 3 3 3 3 3Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Cut of assembly 6264 from GBA0117.contig (11 bases)Cut at ungapped consensus offset 725687 (from 1), +/- 5 positions:Cut at gapped consensus offset 730468 (from 1) 11 positions

Ungapped consensus TGAATACACACGapped consensus TGAATACACAC T G A A T A C A C A C

Covering reads: GBIFW80TF TGAATA-ACAC 34 33 15 16 13 09 00 11 11 11 15 GBICA33TR TGAATACACAC 34 35 34 35 35 34 34 36 36 36 36 GBIFQ32TR TGAATACACAC 10 13 12 18 24 13 21 12 13 14 13 GBICH40TF TGAATACACAC 36 36 36 32 32 32 32 32 31 31 36 GBICU19TR TGAATACACAC 21 30 33 18 19 24 21 11 36 10 29

Summary info:P-value (10^q) -42.8 -45.4 -45.5 -45.3 -46.7 -44.4 -46.7 -43.9 -46.8 -45.1 -42.7 Quality Class 1 1 1 1 1 1 1 1 1 1 1 Coverage depth 16 16 16 16 16 16 15 16 16 16 16Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

GBX(g)

GBA(a)

Not shown

Probability of base-callingerror

b. anthracis SNP CS-1

Documents

1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)