33
1

1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

1

Page 2: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

2

Genome sequence assemblyAssembly concepts and methods

(some slides courtesy of Mihai Pop)

Page 3: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

3

Building a library

• Break DNA into random fragments (8-10x coverage)

Actual situation

Page 4: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

4

Building a library

• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments

– Amplify the fragments in a vector– Sequence 800-1000 (500-700) bases at each end of the fragment

Page 5: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

5

Assembling the fragments

Page 6: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

6

Forward-reverse constraints• The sequenced ends are facing towards each other • The distance between the two fragments is known

(within certain experimental error)

Clone

Insert

F R

FR

I II

R

I

F

II

F

II

R

I

Page 7: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

7

Building Scaffolds

• Break DNA into random fragments (8-10x coverage)

• Sequence the ends of the fragments

• Assemble the sequenced ends

• Build scaffolds

Page 8: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

8

Assembly gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

Page 9: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

9

Unifying view of assembly

Assembly

Scaffolding

Page 10: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

10

Shotgun sequencing statistics

Page 11: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

11

Typical contig coverage

1 2 3 4 5 6 Coverage

Contig

Reads

Imagine raindrops on a sidewalk

Page 12: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

12

Lander-Waterman statistics

L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L

E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads

Page 13: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

13

Example

c N #islands #contigs bases not in any read

bases not in contigs

1 1,667 655 614 698 367,806

3 5,000 304 250 121 49,787

5 8,334 78 57 20 6,735

8 13,334 7 5 1 335

Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

Page 14: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

14

Experimental data

X coverage

# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs

1 284 54 1,234 (1,138) 3,337 526

3 597 67 1,794 (4,429) 9,589 1,092

5 548 79 2,495 (21,791) 17,977 1,398

8 495 85 3,294 (302,545) 64,307 1,762

complete 1 100 1.26 M 1.26 M 1,329

Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel

Page 15: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

15

Read coverage vs. Clone coverage

4 kbp

1 kbp

Read coverage = 8X

Clone (insert) coverage = 16

2X coverage in BAC-ends implies 100x coverage by BACs

(1 BAC clone = approx. 100kbp)

Page 16: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

16

Assembly paradigms

• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)

• Eulerian path (especially useful for short read sequencing)

Page 17: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

17

TIGR Assembler/phrap

Greedy

• Build a rough map of fragment overlaps

• Pick the largest scoring overlap

• Merge the two fragments

• Repeat until no more merges can be done

Page 18: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

18

Overlap-layout-consensusMain entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

Page 19: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

19

Paths through graphs and assembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

A

B D C

E

H G

I

F

A

B

C

D H

I

F

G

E

Genome

Page 20: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

20

Implementation details

Page 21: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

21

Overlap between two sequences

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

overlap (19 bases) overhang (6 bases)

overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences

The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.

% identity = 18/19 % = 94.7%

Page 22: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

22

All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs

are possible– Build a table of k-mers contained in sequences (single

pass through the genome)– Generate the pairs from k-mer table (single pass

through k-mer table)

k-mer

A

B

C

D H

I

F

G

E

Page 23: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

23

Page 24: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

24

REPEATS

Page 25: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

25

1

2

3

4

5

6

7

8

9

10

11

12

13

RptA RptB

1

2

3

4

5 7

10

11

12

13

86

9

Page 26: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

26

Non-repetitive overlap graph

1

2

3

4

5,9 7

86,10

11

12

13

Page 27: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

27

Handling repeats1. Repeat detection

– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)

– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)

2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat

Page 28: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

28

Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value

(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random librariespoor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

Page 29: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

29

Mis-assembled repeats

a b c

a c

b

a b c d

I II III

I

II

III

a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement

Page 30: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

“chimeras” “mates”

How do you align the green pieces?

ribosomal RNA repeats, Ames Porton strain

An assembly puzzle: contradictory data

(discovered after publication)

Page 31: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

Puzzle solution

Tandem duplication

Reference: Ames ‘ancestor’ strain

Ames Porton Down strain

Page 32: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

PortonStrain

Ft Detrick

AttackStrain

Victim

Plasmids cured

Porton Down

Lab C Lab DLab B

? ?

?

Floridaisolate

Porton1

Porton2

?

“Ames”isolate

1981

1982

1998 20012001

UC Berkeley

Anthrax attack strain history

(Ames ancestor)

Page 33: 1. 2 Genome sequence assembly Assembly concepts and methods (some slides courtesy of Mihai Pop)

Cut of assembly 67452 from GBX0130.contig (11 bases)Cut at ungapped consensus offset 15986 (from 1), +/- 5 positions:Cut at gapped consensus offset 16009 (from 1) 11 positions

Ungapped consensus TGAATGCACACGapped consensus TGAATGCACAC T G A A T G C A C A C

Covering reads: GBZEI27TF TGAATGCACAC 26 30 34 36 33 36 36 37 36 36 36 GBXEZ08TR TGAATGCACAC 27 30 33 35 41 37 36 23 36 36 36 GBZDA09TF TGAATGCACAC 26 18 35 31 26 20 29 19 36 36 36

Summary info:P-value (10^q) -7.9 -7.8 -10.2 -10.2 -10.0 -9.3 -10.1 -7.9 -10.8 -10.8 -10.8 Quality Class 5 3 3 3 3 3 3 3 3 3 3 Coverage depth 3 3 3 3 3 3 3 3 3 3 3Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Cut of assembly 6264 from GBA0117.contig (11 bases)Cut at ungapped consensus offset 725687 (from 1), +/- 5 positions:Cut at gapped consensus offset 730468 (from 1) 11 positions

Ungapped consensus TGAATACACACGapped consensus TGAATACACAC T G A A T A C A C A C

Covering reads: GBIFW80TF TGAATA-ACAC 34 33 15 16 13 09 00 11 11 11 15 GBICA33TR TGAATACACAC 34 35 34 35 35 34 34 36 36 36 36 GBIFQ32TR TGAATACACAC 10 13 12 18 24 13 21 12 13 14 13 GBICH40TF TGAATACACAC 36 36 36 32 32 32 32 32 31 31 36 GBICU19TR TGAATACACAC 21 30 33 18 19 24 21 11 36 10 29

Summary info:P-value (10^q) -42.8 -45.4 -45.5 -45.3 -46.7 -44.4 -46.7 -43.9 -46.8 -45.1 -42.7 Quality Class 1 1 1 1 1 1 1 1 1 1 1 Coverage depth 16 16 16 16 16 16 15 16 16 16 16Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

GBX(g)

GBA(a)

Not shown

Probability of base-callingerror

b. anthracis SNP CS-1