Upload
claude-harrison
View
217
Download
0
Embed Size (px)
Citation preview
Chap. 4 FRAGMENT ASSEMBLY OF DNA
Introduction to Computational Molecular Biology
Chapter 4
4.1 Biological Background The ideal case
Approximation = 10 bases
The consensus sequence = TTACCGTGC Answer = 9 bases ( close )∴
The four sequences Fragment assembly
ACCGT
CGTGC
TTAC
TACCGT
ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C
T T A C ㅡㅡㅡㅡㅡㅡ T A C C G T ㅡㅡ
T T A C C G T G C
4.1 Biological Background Substitution
There was a substitution error in the second position of the last fragment, where A was replaced by G.
The consensus is still correct because of majority voting.
The four sequences Fragment assembly
ACCGT
CGTGC
TTAC
TGCCGT
ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C
T T A C ㅡㅡㅡㅡㅡㅡ T G C C G T ㅡㅡ
T T A C C G T G C
4.1 Biological Background Insertion
There was an insertion error in the second position of the second fragment. Base A appeared where there should be none.
The consensus is still correct.
The four sequences Fragment assembly
ACCGT
CAGTGC
TTAC
TACCGT
ㅡㅡ A C C ㅡ G T ㅡㅡㅡㅡㅡㅡ C A G T G C
T T A C ㅡㅡㅡㅡㅡㅡㅡ T A C C ㅡ G T ㅡㅡ
T T A C C ㅡ G T G C
4.1 Biological Background Deletion
There was a deletion in the third ( or fourth) base in the last fragment.
The consensus is still correct.
The four sequences Fragment assembly
ACCGT
CGTGC
TTAC
TACGT
ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C
T T A C ㅡㅡㅡㅡㅡㅡ T A C ㅡ G T ㅡㅡ
T T A C C G T G C
4.1 Biological Background Chimera
The last fragment in this input set is a chimera.
The four sequences Fragment assembly
ACCGT
CGTGC
TTAC
TACCGT
TTATGC
ㅡㅡ A C C G T ㅡㅡㅡㅡㅡㅡ C G T G C
T T A C ㅡㅡㅡㅡㅡㅡ T A C C G T ㅡㅡ
T T A C C G T G C
T T A ㅡㅡㅡ T G C
4.1 Biological Background Unknown Orientation
Fragments can come from any of the DNA strands and we generally do not know to which strand a particular fragment belongs.
We do know, however, that whatever the strand the sequence read goes from 5’ to 3’.
Because of the complementarity and opposite orientation of strands.
Using A fragment ( substring of one strand ) is equivalent to its
reverse complement(substring of the other).
4.1 Biological Background Fragment assembly with unknown orientation
Initially we do not know the orientation of fragments.
Input Answer
CACGT
ACGT ACTACG
GTACT
ACTGA
CTGA
→
→
←
←
→
→
CACGTXXXXXXXX
XACGTXXXXXXXX
XXCGTAGTXXXXX
XXXXXAGTACXXX
XXXXXXXXACTGA
XXXXXXXXXCTGA
CACGTAGTACTGA
4.1 Biological Background Fragment assembly with unknown orientation
Repeated regions Repeated regions or repeats are sequences that appear two or
more times in the target molecule. If the level of similarity between two copies of a repeat is high
enough, the differences can be mistaken for base call errors.
The blocks marked X1 and X2 are approximately the samesequence.
X1 X2
4.1 Biological Background Fragment assembly with unknown orientation
The kinds of problems (Repeats)
Target sequence leading to ambiguous assembly because of repeats of the form XXX.
A X B X C X D
A X C X B X D
4.1 Biological Background Fragment assembly with unknown orientation
The kinds of problems (Repeats)
Target sequence leading to ambiguous assembly because of repeats of the form XYXY.
A X B Y C X D Y E
A X D Y C X B Y E
4.1 Biological Background Fragment assembly with unknown orientation
The kinds of problems (Repeats) Inverted repeats, which are repeated regions in opposite
strands, can also occur and are potentially more dangerous.
Target sequence with inverted repeat.
X X
X X
Rotate 1800
4.2 Models Shortest Common Superstring
Problem : Shortest Common Superstring(SCS) Input : A collection F of strings Output : A shortest possible string S such that for every
f F, S is a superstring of f.∈
Example
F={ACT, CTA,AGT}
S=ACTAGT is the SCS of F.
CTA is a substring of S.
ACT
CTA
AGT
ACTAGT
4.2 Models Shortest Common Superstring
Problem
X X
Target sequence with long repeat that contains many fragments.
4.2 Models Reconstruction
To deal with errors Errors and unknown orientation
Substring edit distance S(b) = The set of all substrings of b d is the classical edit distance ds(a,b) ≠ ds(b,a) : asymmetric
),(),( min)(
sadbabSs
sd
4.2 Models Reconstruction
Example Optimal alignment for substring edit distance, which does not
charge for end deletions in the first string.
- - - - - G C – G A T A G - - - -C A G T C G C T G A T C G T A C G
ds(a,b)=2
4.2 Models Reconstruction
An error tolerance f is an approximate substring Permission : for each base in f.
Input : A collection F of strings and an error tolerance between 0 and 1.
Output : A shortest possible string S such that for ever f F
fSfd s),(
fSfSf dd ss),(),,(min(
4.2 Models Multicontig
--TAATGTGTAA-- GTAC 3-contig
TAATG------TGTAA GTAC 2-contig
TGTAA-------TAATG---------GTAC
1-contig
4.3 Algorithms Overlap multigraph
PATH1 = abc
GACA-------- ---ACCC----- ------CTAAAG
PATH2 = abcd
a= TACGA----------- b= ----ACCC-------- c= -------CTAAAG--- d= ------------GACA
b ACCCTACGA
CTAAAG
GACA
1
1
1
12
d
c
a
Overlap between fragment c and d
4.3 Algorithms The greedy
Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph.
4.3 Algorithms The greedy
ExampleS=AGTATTGGCAATCGATGCAAACCTTTTGGCAATCACT
w=AGTATTGGCAATC
z=AATCGATG
u=ATGCAAACCT
x=CCTTTTGG
y=TTGGCAATCACT
This solution has length 36 and is generated by the Greedy algorithm. However, its weakest link is zero.
4.3 Algorithms Acyclic
Hamiltonian path
4 3 3 4
This solution has length 37. Its weakest link is 3.
4.4 Heuristics Alignment and consensus
Suppose we have a path f-> g-> h
f=CATAGTCg=TAACTATh=AGACTATCC
C A T A G T C - - - - -- - T A – A C T A T - -- - - A G A C T A T C CC A T A G A C T A T C C
4.4 Heuristics Alignment and consensus
Two layouts for the same sequences
ACT-GGACTTGGAC-TGGACT-GGAC-TGGACTTGG
ACT-GGACTTGGAC-TGGACT-GGAC-TGGACTTGG
T-TT-TT--TTT
T-TT-TT--TTT
Using a sum-of pairs scoring