View
229
Download
0
Tags:
Embed Size (px)
Citation preview
1
2
Genome sequence assemblyAssembly concepts and methods
(some slides courtesy of Mihai Pop)
3
Building a library
• Break DNA into random fragments (8-10x coverage)
Actual situation
4
Building a library
• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments
– Amplify the fragments in a vector– Sequence 800-1000 (500-700) bases at each end of the fragment
5
Assembling the fragments
6
Forward-reverse constraints• The sequenced ends are facing towards each other • The distance between the two fragments is known
(within certain experimental error)
Clone
Insert
F R
FR
I II
R
I
F
II
F
II
R
I
7
Building Scaffolds
• Break DNA into random fragments (8-10x coverage)
• Sequence the ends of the fragments
• Assemble the sequenced ends
• Build scaffolds
8
Assembly gaps
sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap
Sequencing gaps
Physical gaps
9
Unifying view of assembly
Assembly
Scaffolding
10
Shotgun sequencing statistics
11
Typical contig coverage
1 2 3 4 5 6 Coverage
Contig
Reads
Imagine raindrops on a sidewalk
12
Lander-Waterman statistics
L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L
E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads
13
Example
c N #islands #contigs bases not in any read
bases not in contigs
1 1,667 655 614 698 367,806
3 5,000 304 250 121 49,787
5 8,334 78 57 20 6,735
8 13,334 7 5 1 335
Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40
14
Experimental data
X coverage
# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs
1 284 54 1,234 (1,138) 3,337 526
3 597 67 1,794 (4,429) 9,589 1,092
5 548 79 2,495 (21,791) 17,977 1,398
8 495 85 3,294 (302,545) 64,307 1,762
complete 1 100 1.26 M 1.26 M 1,329
Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel
15
Read coverage vs. Clone coverage
4 kbp
1 kbp
Read coverage = 8X
Clone (insert) coverage = 16
2X coverage in BAC-ends implies 100x coverage by BACs
(1 BAC clone = approx. 100kbp)
16
Assembly paradigms
• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)
• Eulerian path (especially useful for short read sequencing)
17
TIGR Assembler/phrap
Greedy
• Build a rough map of fragment overlaps
• Pick the largest scoring overlap
• Merge the two fragments
• Repeat until no more merges can be done
18
Overlap-layout-consensusMain entity: readRelationship between reads: overlap
12
3
45
6
78
9
1 2 3 4 5 6 7 8 9
1 2 3
1 2 3
1 2 3 12
3
1 3
2
13
2
ACCTGAACCTGAAGCTGAACCAGA
19
Paths through graphs and assembly
• Hamiltonian circuit: visit each node (city) exactly once, returning to the start
A
B D C
E
H G
I
F
A
B
C
D H
I
F
G
E
Genome
20
Implementation details
21
Overlap between two sequences
…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…
overlap (19 bases) overhang (6 bases)
overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences
The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.
% identity = 18/19 % = 94.7%
22
All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs
are possible– Build a table of k-mers contained in sequences (single
pass through the genome)– Generate the pairs from k-mer table (single pass
through k-mer table)
k-mer
A
B
C
D H
I
F
G
E
23
24
REPEATS
25
1
2
3
4
5
6
7
8
9
10
11
12
13
RptA RptB
1
2
3
4
5 7
10
11
12
13
86
9
26
Non-repetitive overlap graph
1
2
3
4
5,9 7
86,10
11
12
13
27
Handling repeats1. Repeat detection
– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)
– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)
– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat
28
Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.
- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value
(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)
Problem 1: assumption of uniform distribution of fragments - leads to false positives
non-random librariespoor clonability regions
Problem 2: repeats with low copy number are missed - leads to false negatives
29
Mis-assembled repeats
a b c
a c
b
a b c d
I II III
I
II
III
a
b c
d
b c
a b d c e f
I II III IV
I III II IV
a d b e c f
a
collapsed tandem excision
rearrangement
“chimeras” “mates”
How do you align the green pieces?
ribosomal RNA repeats, Ames Porton strain
An assembly puzzle: contradictory data
(discovered after publication)
Puzzle solution
Tandem duplication
Reference: Ames ‘ancestor’ strain
Ames Porton Down strain
PortonStrain
Ft Detrick
AttackStrain
Victim
Plasmids cured
Porton Down
Lab C Lab DLab B
? ?
?
Floridaisolate
Porton1
Porton2
?
…
“Ames”isolate
1981
1982
1998 20012001
UC Berkeley
Anthrax attack strain history
(Ames ancestor)
Cut of assembly 67452 from GBX0130.contig (11 bases)Cut at ungapped consensus offset 15986 (from 1), +/- 5 positions:Cut at gapped consensus offset 16009 (from 1) 11 positions
Ungapped consensus TGAATGCACACGapped consensus TGAATGCACAC T G A A T G C A C A C
Covering reads: GBZEI27TF TGAATGCACAC 26 30 34 36 33 36 36 37 36 36 36 GBXEZ08TR TGAATGCACAC 27 30 33 35 41 37 36 23 36 36 36 GBZDA09TF TGAATGCACAC 26 18 35 31 26 20 29 19 36 36 36
Summary info:P-value (10^q) -7.9 -7.8 -10.2 -10.2 -10.0 -9.3 -10.1 -7.9 -10.8 -10.8 -10.8 Quality Class 5 3 3 3 3 3 3 3 3 3 3 Coverage depth 3 3 3 3 3 3 3 3 3 3 3Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Cut of assembly 6264 from GBA0117.contig (11 bases)Cut at ungapped consensus offset 725687 (from 1), +/- 5 positions:Cut at gapped consensus offset 730468 (from 1) 11 positions
Ungapped consensus TGAATACACACGapped consensus TGAATACACAC T G A A T A C A C A C
Covering reads: GBIFW80TF TGAATA-ACAC 34 33 15 16 13 09 00 11 11 11 15 GBICA33TR TGAATACACAC 34 35 34 35 35 34 34 36 36 36 36 GBIFQ32TR TGAATACACAC 10 13 12 18 24 13 21 12 13 14 13 GBICH40TF TGAATACACAC 36 36 36 32 32 32 32 32 31 31 36 GBICU19TR TGAATACACAC 21 30 33 18 19 24 21 11 36 10 29
Summary info:P-value (10^q) -42.8 -45.4 -45.5 -45.3 -46.7 -44.4 -46.7 -43.9 -46.8 -45.1 -42.7 Quality Class 1 1 1 1 1 1 1 1 1 1 1 Coverage depth 16 16 16 16 16 16 15 16 16 16 16Homogeneity 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
GBX(g)
GBA(a)
Not shown
Probability of base-callingerror
b. anthracis SNP CS-1