3. Lecture WS 2003/04
Bioinformatics III 1
Whole Genome Shotgun Assembly
Two strategies for sequencing:
clone-by-clone approach
whole-genome shotgun approach
(Celera, Gene Myers).
Shotgun sequencing was
introduced by F. Sanger et al.
(1977) and has remained the
mainstay of genome sequence
assembly for nearly 25 years now.
ED Green, Nat Rev Genet 2, 573 (2001)
3. Lecture WS 2003/04
Bioinformatics III 2
Automatic sequencing
W hole genomeBAC/cosm id clone
f in a l con sen sus seq u en ce
Finishingq u a lity
b o th s ta n ds covera geg a p f illing
Partial Assem blyco n tigs
DNA sequencingra n d om clo n es
Clone libraryp U C 18
Sm all fragm ents1 .0 - 2 .0 kb
DNA fragm entationso n ic d is rup tion
n e bu liza tion
W hole genomeBAC/cosm id clone
3. Lecture WS 2003/04
Bioinformatics III 3
Automated Sequencing
nearly all automatic sequencing is done using the enzymatic dideoxy chain-
termination method of Sanger (1977).
Separation of fragments by gel electrophoresis.
Readout of fragments labeled with fluorescent dyes.
Computer analysis of gel images:- lane tracking – identify gel boundaries- lane profiling – sum each of 4 signals across lane width to create a profile- trace processing – deconvolute and smooth signal estimates + reduce noise- base-calling in which the processed trace is translated into a sequence of bases.
Program Phred is quasi-standard for last step (base calling).
3. Lecture WS 2003/04
Bioinformatics III 4
Base Calling - Phred
B. Ewing, L. Hillier, M.C. Wendl, P. Green Base-calling of automated sequencer traces using Phred.
I. Accuracy assessment. Genome Res 8, 175-185 (1998).
B. Ewing, P. Green. Base-calling of automated sequencer traces using Phred. II. Errror probabilities.
Genome Res 8, 186-194 (1998).
The processed traces are displayed as chromatograms of 4 curves of different color, each curve representing the signal of 1 of the 4 bases.
3. Lecture WS 2003/04
Bioinformatics III 5
Base Calling - Phred
Idealized traces would
consist of evenly spaced,
nonoverlapping peaks.
Real traces deviate from
this ideal due to imper-
fections of the sequencing
reactions, of gel electro-
phoresis, and of trace
processing.
The first 50 or so peaks
and peaks over 500 or so
are particularly noisy.
Quality:high – noambiguities
medium – someambiguities
Poor – low confidence
3. Lecture WS 2003/04
Bioinformatics III 6
Base Calling Algorithm
1 Locate Predicted Peaks
find the idealized locations of the base peaks using Fourier methods.
2 Locate Observed Peaks
scan 4 trace arrays for concave regions satisfying
2 v(i) v(i+1) + v(i-1)
3 Match Observed and Predicted Peaks
a) find easy matches
b) use dynamic programming to align those peaks not matched in a)
c) match remaining observed peaks that seem to represent genuine bases
4 Find missed Peaks
Phred quality values
q = - 10 log10 (p)
whereq - quality valuep - estimated probability error for a base call
Examples:
q = 20 means p = 10-2 (1 error in 100 bases)q = 40 means p = 10-4 (1 error in 10,000 bases)
Phred
Phred performs several tasks:
a. Reads trace files – compatible with most file formats: SCF (standard
chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR.
b. Calls bases – attributes a base for each identified peak with a lower error
rate than the standard base calling programs.
c. Assigns quality values to the bases – a “Phred value” based on an error
rate estimation calculated for each individual base.
d. Creates output files – base calls and quality values are written to output
files.
3. Lecture WS 2003/04
Bioinformatics III 9
whole genome assembly: problem description
The goal is to reconstruct an unknown source sequence (the genome) on
{A, C, G, T} given many random short segments from the sequence, the
shotgun reads.
A read is a subsequence of nucleotides of length around 500, taken from a
random place in the genome.
The orientation of the read is either forward or reverse complement.
Reads contain two kinds of errors: base substitutions and indels.
Base substitutions occur with a frequency of ca. 0.5 – 2%.
Indels occur roughly 10 times less frequently.
Reads can come from short plasmid inserts (2-12 kb), cosmids (40 kb)
or BACs (150 kb).
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 10
Whole Genome Assemblers
TIGR Assembler G.G. Sutton et al., Genome Sci Technol 1, 9-19 (1995)
PHRAP P. Green (1996)
Celera Assembler
CAP3 X. Huang, A. Madan, Genome Res 9, 868-877 (1999)
RePS J. Wang et al. Genome Res 12, 824-831 (2002)
Phusion (Sanger) J.C. Mullikin, Z. Ning, Genome Res 13, 81-90 (2003)
Arachne (Whitehead/MIT)
Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001)
most assemblers follow the same approach:
overlap – layout - consensus
3. Lecture WS 2003/04
Bioinformatics III 11
CAP3 Assembler
Removal of poor end regions of reads
Computation of overlaps between reads
Removal of false overlaps
Construction of contigs
Construction of multiple sequence
alignments and generation of
consensus sequences
3. Lecture WS 2003/04
Bioinformatics III 12
CAP3: Clipping of Low-Quality Regions
Use base quality values (from Phred) and sequence similarities to
compute 5‘ and 3‘ clipping positions of reads.
Definition of good regions of a read:
- any sufficiently long region of high-quality values that is similar
to a region of another read OR- any sufficiently long region that is highly similar to a good high-quality
region of another read
Computation of the 5‘ and 3‘ clipping positions of read f. Read f has high localsimilarities to reads g and h. A pair of broken lines shows the start and endpositions of a similarity. A thick line indicates the high quality region of a read.
Huang, Madan, Genome Res 9, 868 (1999)
3. Lecture WS 2003/04
Bioinformatics III 13
Celera – compartmentalized shotgun assembler
use preliminary data from both
human genome assembly projects
Huson et al. Bioinformatics 17, S132 (2001)
3. Lecture WS 2003/04
Bioinformatics III 14
Arachne program
by Serafin Batzoglou (MIT, PhD thesis 2000)
(i) create graph G of overlaps between pairs of reads of shotgun data
(ii) process G for the purpose of constructing supercontigs of mapped reads.
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 15
Earmuff links
An important variation of whole-genome shotgun sequencing obtains
reads from both ends of an insert, forward and backward.
Since inserts are size-selected, the approximate distance of the pair
of reads obtained from the ends of a fragment is known.
These will be called earmuff links.
3. Lecture WS 2003/04
Bioinformatics III 16
Arachne: creation of overlap graph
List of reads R = (r1, ..., rN) , N is number of reads.
Each read ri has length li < 1000.
If both reads are taken from the endpoints of the same clone (earmuff link)
ri has link to another read rj at specified distance dij.
First: create graph G of overlaps (edges) between pairs of reads (nodes).
Pairs of reads in R need to be aligned.
Since R can be very long, N2 alignments are infeasible.
Create table of occurences of k-mers (k long strings) in the reads,
count the number of k-mer matches for each pair of reads.
Then perform pairwise alignments between pairs of reads that contain
more than a cutoff number of common k-mers.
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 17
Arachne: table of k-mer occurrences
Find number of k-mer matches in the forward or reverse complement direction
between each pair of reads in R.
(1) Obtain all triplets (r,t,v)
r = read in R
t = index of a k-mer occuring in r
v = direction of occurrence (forward or reverse complement)
(2) sort the set of pairs according to k-mer indices t
(3) use sorted list to create table T of quadrublets (ri, rj, f, v) where ri and ri are
reads that contain at least one common k-mer, v is a direction, and f is the
number of k-mers in common between ri and rj in direction v.
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 18
Arachne: table of k-mer occurrences
Batzoglou PhD thesis (2002)
Here:k = 3
3. Lecture WS 2003/04
Bioinformatics III 19
Arachne: table of k-mer occurrences
If a k-mer occurs „too often“ likely part of a repeat sequence,
we should not use it for detecting overlap.
Implementation
(1) find k-mer occurences (r,t,v) and sort into 64 files according to the
first three nucleotides of each k-mer.
(2) For i=1,64
load file in memory, sort according to t, store sorted file.
end
(3) load 64 sorted files in memory sequentially, create table T incrementally.
In practice, k = 8 to 24.
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 20
Arachne: pairwise read alignments
Perform pairwise alignments between reads that contain more than a cutoff
number of common k-mers. When excluding those k-mers that are too common
(larger than a second) cutoff it is guaranteed that only O(N) number of pairwise
alignments will be performed.
Only a small number of base substitutions and indels is allowed in an
overlapping region of two aligned reads.
Use dynamic programming alignment that disallows deviations of more than
a few characters.
Output of the alignment algorithm:
for reads ri, rj quadrublets (b1, b2, e1, e2) of beginning b1, b2 and end e1,
e2 positions of the detected overlap region.
If a significant overlap region is detected (ri, rj, b1, b2, e1, e2) becomes a link
in the overlap graph G.
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 21
Correcting errors in reads
Batzoglou et al. Genome Res 12, 177 (2002)
Shown is a portion of a multiple alignment between 5 reads. A base T of quality 30 is aligned to bases C, some of which are of quality greater than 30. The base T is subsequently changed to a base C of quality 30.
3. Lecture WS 2003/04
Bioinformatics III 22
Partial alignments
3 partial alignments of length
k=6 between a pair of reads
coalesce to yield a single full
alignment of length k=19.
Vertical bars denote matching
bases, whereas x‘s denote
mismatches. This illustrates
the commonly occurring
situation where an extended k-
mer hit is a full alignment
between two reads.
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 23
Ambiguity created by the presence of repeats
In the absence of sequencing
errors and repreats it would be
simple to retrieve all retrievable
pairwise distances of reads
and to construct G.
In the presence of repeats a
link between two reads in G
does not necessarily imply true
overlap. A „repeat link“ is a link
in G between two reads that
come from different regions in
the genome, and overlap in a
repeated segment.
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 24
Arachne: processing of overlap graph
Some of the repetition in the genome is efficiently masked before the creation
of G by throwing away k-mers of high frequency when building T.
Furthermore some heuristic algorithms are used to detect and delete
repetitive links (not discussed here).
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 25
Merging contigs
Batzoglou PhD thesis (2002)
Sequence contigs are formed by merging together pairs of reads that can be
merged without ambiguity.
In practice the situation
is much worse than shown
here. Repeats are not
100% conserved between
copies.
3. Lecture WS 2003/04
Bioinformatics III 26
Sequence contigs
Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04
Bioinformatics III 27
Using paired pairs of overlaps to merge reads
Arachne searches for instances of
two plasmids of similar insert size
with sequence overlaps occurring at
both ends paired pairs.
Batzoglou et al. Genome Res 12, 177 (2002)
(A) A paired pair of overlaps. The top
two reads are end sequences from one
insert, and the bottom two reads are end
sequences from another.
The two overlaps must not imply too
large a discrepancy between the insert
lengths.
(B) Initially, the top two pairs of reads
are merged. Then the third pair of
reads is merged in, based on having
an overlap with one of the top two left
reads, an overlap with one of the top
two right reads, and consistent insert
lengths. The bottom pair is similarly
merged.Bottom: collection of paired pairs are
merged into contigs, and
consensus sequences are formed.
3. Lecture WS 2003/04
Bioinformatics III 28
Detection of repeat contigs
Contig R is linked to contigs A and
B to the right. The distances
estimated between R and A and
R and B are such A and B cannot
be positioned without substantial
overlap between them. If there is
no corresponding detected overlap
between A and B then R is
probably a repeat linking to two
unique regions to the right.
Batzoglou et al. Genome Res 12, 177 (2002)
Some of the identified contigs are repeat contigs in which nearly identical
sequence from distinct regions are collapsed together. Detection by
(a) repeat contigs usually have an unusually high depth of coverage.
(b) they will typically have conflicting links to other contigs.
After marking repeat contigs, the remaining
contigs should represent the correctly
assembled sequence.
3. Lecture WS 2003/04
Bioinformatics III 29
Supercontig creation and gap filling
(A) A supercontig is constructed by successively
linking pairs of contigs that share at least two
forward-reverse links. Here, 3 contigs are
joined into one supercontig.
The layout now consists of a number
of supercontigs with interleaved gaps.
Most gaps belong to regions marked
as repeat contigs, some correspond
to regions of insufficient shotgun reads.
(B) Arachne attempts to fill gaps by using paths of
contigs. The first gap in the supercontig
shown here is filled with one contig, and the
second gap is filled by a path consisting of two
contigs.Batzoglou et al. Genome Res 12, 177 (2002)
Unmarked contigs = unique contigs.
Iteratively merge contigs into supercontigs.
3. Lecture WS 2003/04
Bioinformatics III 30
Contig assembly
If (a,b) and (a,c) overlap, then
(b,c) are expected to overlap.
Moreover, one can calculate that
shift(b,c)=shift(a,c)-shift(a,b).
A repeat boundary is detected
toward the right of read a, if there
is no overlap (b,c), nor any path
of reads x1, ..., xk such that (b,x1),
(x1,x2) ..., (xk,c) are all overlaps,
and shift(b,x1) + ... + shift(xk,c)
shift(a,c) – shift(a,b).
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 31
Consistency of forward-reverse links
(A) The distance d(A,B) (length of
gap or negated length of
overlap) between two linked
contigs A and B can be
estimated using the forward-
reverse linked reads between
them.
(B) The distance d(B,C) between
two contigs B,C that are
linked to the same contig A
can be estimated from their
respective distances to the
linked contig.
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 32
Types of misassemblies
(A)3 types of simple minor misas-
semblies are shown: insertions,
deletions, and hanging ends. In all
cases, a contiguous segment (of a
contig ore the genome) of less than
10 kb does not align in the expected
location (with the genome or contig).
(B) More misassemblies.
First, two pieces of a contig align to
distant parts of the genome.
Second, adjacent contigs in a
supercontig are aligned to distant
parts of the genome.
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 33
Filling gaps in supercontigs
(A) Contigs A and B are connected by
a path p of contigs X1,..., Xk. The
distance dp(A,B) between A and B
(along the path p) is the length of
the sequence in the path that does
not overlap A and B.
(B) Contigs Y1 and Y2 share forward-
reverse links with the supercontig
S. These links position them in the
vicinity of the gap between A and
B. Therefore, Y1 and Y2 will be
used as possible stepping points in
the path closing the gap from A to
B.
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 34
Detection of chimeric reads
Reads l1, l2, l3, r1, r2, and r3, and the absence of a read n (having long overlaps on
both sides of a point x) suggest that read c may be chimeric, consisting of the
juxtaposition of two disparate genomic segments: one corresponding to the part
of c before x, and one corresponding to the part of c after x.
Note that reads l3 and r3 extend slightly beyond x, as often happens for real
chimeric reads.
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 35
Contig Coverage and Read Usage
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 36
Characterization of Contigs
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 37
Characterization of Supercontigs
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 38
Base Pair Accuracy
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 39
Misassemblies
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 40
Computational Performance
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 41
Contig Coverage and Read Usage
Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04
Bioinformatics III 42
Comparison of different assemblers
Pevzner, Tang, Waterman PNAS 98, 9748 (2001)
you should look out for:- smallest number of contigs + misassembled contigs- highest possible coverage by contigs- lowest possible coverage by misassembled contigs
3. Lecture WS 2003/04
Bioinformatics III 43
There is no error-free assembler to date
Pevzner, Tang, Waterman PNAS 98, 9748 (2001)
Comparative analysis of EULER, PHRAP, CAP, and TIGR assemblers (NM sequencing project). Every box corresponds to a contig in NM assembly produced by these programs with colored boxes corresponding to assembly errors. Boxes in the IDEAL assembly correspond to islands in the read coverage. Boxes of the same color show misassembled contigs. Repeats with similarity higher than 95% are indicated by numbered boxes at the solid line showing the genome. To check the accuracy of the assembled contigs, we fit each assembled contig into the genomic sequence. Inability to fit a contig into the genomic sequence indicates that the contig is misassembled. For example, PHRAP misassembles 17 contigs in the NM sequencing project, each contig containing from two to four fragments from different parts of the genome.
„Biologists "pay" for these errors at the
time-consuming finishing step“.
3. Lecture WS 2003/04
Bioinformatics III 44
What comes next? Finishing the genome
Usually, the assembly of shotgun data is finished with a number of contigs
with some remaining gaps.
Also, within each contig there are some regions of high error rate.
The goal of the finishing phase is then to get a single continuous contig
with low error rate.
„Finishers“ apply ad hoc rules to decide where additional data is necessary.
This experimental data may then be generated in experiments using
different chemistry or higher coverage.
Autofinish (phrap group) is a program to help humans with deciding
which new reads to get.
3. Lecture WS 2003/04
Bioinformatics III 45
Human experts are only rarely needed ...
D. Gordon, C. Desmarais, P. Green, Genome Res, 11, 614 (2001)