String Matching String matching: definition of the problem (text,pattern) depends on what we have:...
of 27/27
String Matching String matching: definition of the problem (text,pattern)depends on what we have: text or patterns • Exact matching: • Approximate matching: • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • The text ----> Data structure for the text (suffix tree, ...) • The patterns ---> Data structures for the patterns • Dynamic programming • Sequence alignment (pairwise and multiple) • Extensions • Regular Expressions • Probabilistic search: • Sequence assembly: hash algorithm Hidden Markov Models
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Text of String Matching String matching: definition of the problem (text,pattern) depends on what we have:...
Slide 1
String Matching String matching: definition of the problem
(text,pattern) depends on what we have: text or patterns Exact
matching: Approximate matching: 1 pattern ---> The algorithm
depends on |p| and | | k patterns ---> The algorithm depends on
k, |p| and | | The text ----> Data structure for the text
(suffix tree,...) The patterns ---> Data structures for the
patterns Dynamic programming Sequence alignment (pairwise and
multiple) Extensions Regular Expressions Probabilistic search:
Sequence assembly: hash algorithm Hidden Markov Models
Slide 2
Approximate string matching For instance, given the sequence
CTACTACTTACGTGACTAATACTGATCGTAGCTAC search for the pattern ACTGA
allowing one error but what is the meaning of one error?
Slide 3
Edit distance We accept three types of errors: The edit
distance d between two strings is the minimum number of
substitutions,insertions and deletions needed to transform the
first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)=
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT
2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT
Indel
Slide 4
Edit distance We accept three types of errors: The edit
distance d between two strings is the minimum number of
substitutions,insertions and deletions needed to transform the
first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2.
Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT
d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel 012 312
Slide 5
Edit distance and alignment of strings ACT and ACT : ACT ACT
ACTTG and ATCTG: ACT and AT: ACT A -T ACTTG ATCTG ACT - TG A - TCTG
Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best
alignment in every case? The Edit distance is related with the best
alignment of strings Then, the alignment suggest the substitutions,
insertions and deletions to transform one string into the
other
Slide 6
Edit distance and alignment of strings But which is the
distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? and
the best alignment between them? 1966 was the first time this
problem was discussed and the algorithm was proposed in 1968,1970,
using the technique called Dynamic programming
Slide 7
Edit distance and alignment of strings C T A C T A C T A C G T
A C T G A
Slide 8
Edit distance and alignment of strings C T A C T A C T A C G T
A C T G A
Slide 9
Edit distance and alignment of strings C T A C T A C T A C G T
A C T G A The cell contains the distance between AC and CTACT.
Slide 10
Edit distance and alignment of strings C T A C T A C T A C G T
A C T G A ?
Slide 11
Edit distance and alignment of strings C T A C T A C T A C G T
0 A C T G A ?
Slide 12
Edit distance and alignment of strings C T A C T A C T A C G T
0 1 A C T G A - C ?
Slide 13
Edit distance and alignment of strings C T A C T A C T A C G T
0 1 2 A C T G A - - CT ?
Slide 14
Edit distance and alignment of strings C T A C T A C T A C G T
0 1 2 3 4 5 6 7 8 A C T G A - - - - - - CTACTA
Slide 15
Edit distance and alignment of strings C T A C T A C T A C G T
0 1 2 3 4 5 6 7 8 A ? C ? T ? G A
Slide 16
Edit distance and alignment of strings C T A C T A C T A C G T
0 1 2 3 4 5 6 7 8 A 1 C 2 T 3 G A ACT - - -
Slide 17
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 A 1 C 2 T 3 G A C T A
C T A C T A C G T A C T G A Edit distance and alignment of strings
BA(AC,CTA) - C BA(A,CTA) CCCC BA(A,CTAC) C - BA(AC,CTAC)= best
d(AC,CTAC)=min d(AC,CTA)+1 d(A,CTA) d(A,CTAC)+1
Slide 18
C T A C T A C T A C G T A C T G A Edit distance and alignment
of strings d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA) ..+1
d(AC,CTA)+1 C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 A 1 C 2 T 3 G
A
Slide 19
Edit distance and alignment of strings Connect to
http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm and use the
global method.
Slide 20
Edit distance and alignment of strings How this algorithm can
be applied to the approximate search? to the K-approximate string
searching?
Slide 21
K-approximate string searching C T A C T A C T A C G T A C T G
G T G A A A C T G A This cell
Slide 22
K-approximate string searching C T A C T A C T A C G T A C T G
G T G A A A C T G A This cell gives the distance between (ACTGA,
CTGTA) but we only are interested in the last characters
Slide 23
K-approximate string searching C T A C T A C T A C G T A C T G
G T G A A A C T G A This cell gives the distance between (ACTGA,
CTGTA) but we only are interested in the last characters
Slide 24
K-approximate string searching * * * * * * C T A C G T A C T G
G T G A A A C T G A This cell gives the distance between (ACTGA,
CTGTA) but we only are interested in the last characters no matter
where they appears in the text, then
Slide 25
K-approximate string searching * * * * * * C T A C G T A C T G
G T G A A 0 A C T G A This cell gives the distance between (ACTGA,
CTGTA) but we only are interested in the last characters no matter
where they appears in the text, then
Slide 26
K-approximate string searching * * * * * * C T A C G T A C T G
G T G A A 0 A C T G A This cell gives the distance between (ACTGA,
CTGTA) but we only are interested in the last characters no matter
where they appears in the text, then
Slide 27
K-approximate string searching C T A C T A C T A C G T A C T G
G T G A A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A C T G A
This cell gives the distance between (ACTGA, CTGTA) but we only are
interested in the last characters no matter where they appears in
the text, then
Slide 28
K-approximate string searching Connect to
http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm and use the
semi-global method.