Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Recuperació de la informació
•Modern Information Retrieval (1999)
Ricardo-Baeza Yates and Berthier Ribeiro-Neto
•Flexible Pattern Matching in Strings (2002)
Gonzalo Navarro and Mathieu Raffinot
•http://www-igm.univ-mlv.fr/~lecroq/string/index.html
Algorismes de:
Cerca de patrons (exacta i aproximada)
(String matching i Pattern matching)
Indexació de textos:
Suffix trees, Suffix arrays
String Matching
String matching: definition of the problem (text,pattern)
depends on what we have: text or patterns• Exact matching:
• Approximate matching:
• 1 pattern ---> The algorithm depends on |p| and | |
• k patterns ---> The algorithm depends on k, |p| and | |
• The text ----> Data structure for the text (suffix tree, ...)
• The patterns ---> Data structures for the patterns
• Dynamic programming
• Sequence alignment (pairwise and multiple)
• Extensions
• Regular Expressions
• Probabilistic search:
• Sequence assembly: hash algorithm
Hidden Markov Models
Exact string matching: one pattern (text on-line)
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Multiple string matching
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMlmin
(5 strings)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(10 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(100 strings)
Ad AC
Trie
Construct the trie of
GTATGTA,GTAT,TAATA,GTGTA
Trie
Construct the trie of
GTATGTA,GTAT,TAATA,GTGTA
G
G
AT
TT A
Trie
Construct the trie of
GTATGTA,GTAT,TAATA,GTGTA
G
G
AT
TT A
Trie
Construct the trie of
GTATGTA,GTAT,TAATA,GTGTA
A
G
G
AT
TT
T A
A
AA T
Trie
Construct the trie of
GTATGTA,GTAT,TAATA,GTGTA
T A
G
G
AT
TT
T
G
A
A
AA T
Which is the cost?
Set Horspool algorithm
Text :
Patterns:
By suffixes
• Which is the next position of the window?
• How the comparison is made?
a
Trie of all inverse patterns
We shift until a is aligned with the first a in the trie not longer than lmin, or lmin
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
1. Construct the trie of GTATGTA,
GTAT, TAATA i GTGTA
2. Determine lmin=
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
1. Construct the trie of GTATGTA,
GTAT, TAATA i GTGTA
2. Determine lmin=4A 1
C 4 (lmin)
G
T
3. Determine the shift table
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
1. Construct the trie of GTATGTA,
GTAT, TAATA i GTGTA
2. Determine lmin=4A 1
C 4 (lmin)
G 2
T
3. Determine the shift table
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
4. Find the patterns
T A
G
G
AT
TT
T
G
A
A
AA T
1. Construct the trie of GTATGTA,
GTAT, TAATA i GTGTA
2. Determine lmin=4A 1
C 4 (lmin)
G 2
T 1
3. Determine the shift table
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
…
Set Horspool algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACA…
A 1
C 4 (lmin)
G 2
T 1
Is the expected length of the shifts related with the
number of patterns?
…
As more patterns we search for,
shorter shifts we do!
AA 1
AC 3 (LMIN-L+1)
AG
AT
CA
CC
CG
…
2 símbols
Set Horspool algorithm Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only one!
Given ATGTATG,TATG,ATAAT,ATGTG
A 1
C 4 (lmin)
G 2
T 1
1 símbol
AA 1
AC 3 (LMIN-L+1)
AG
AT
CA
CC
CG
…
2 símbols
Set Horspool algorithm Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only one!
Given ATGTATG,TATG,ATAAT,ATGTG
A 1
C 4 (lmin)
G 2
T 1
1 símbol
3
AA 1
AC 3 (LMIN-L+1)
AG
AT
CA
CC
CG
…
2 símbols
Set Horspool algorithm Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only one!
Given ATGTATG,TATG,ATAAT,ATGTG
A 1
C 4 (lmin)
G 2
T 1
1 símbol
31
AA 1
AC 3 (LMIN-L+1)
AG
AT 1
CA
CC
CG
…
2 símbols
Set Horspool algorithm Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only one!
Given ATGTATG,TATG,ATAAT,ATGTG
A 1
C 4 (lmin)
G 2
T 1
1 símbol
3
3
3
3
AA 1
AC 3 (LMIN-L+1)
AG 3
AT 1
CA 3
CC 3
CG 3
…
2 símbols
Set Horspool algorithm Wu-Manber algorithm
How the length of shifts can be increased?
By reading blocks of symbols instead of only one!
Given ATGTATG,TATG,ATAAT,ATGTG
AA 1
AT 1
GT 1
TA 2
TG 2
A 1
C 4 (lmin)
G 2
T 1
1 símbol
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACATAATA
AA 1
AT 1
GT 1
TA 2
TG 2
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACATAATA
AA 1
AT 1
GT 1
TA 2
TG 2
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACATAATA
AA 1
AT 1
GT 1
TA 2
TG 2
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G
AT
TT
T
G
A
A
AA T
text: ACATGCTATGTGACATAATA
…
AA 1
AT 1
GT 1
TA 2
TG 2
log|Σ| 2*lmin*k
But given k patterns,
how many symbols
we should take ?
Multiple string matching
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMlmin
(5 strings)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(10 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(100 strings)
Ad AC
BOM algorithm (Backward Oracle Matching)
• Which is the next position of the window?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle
Check if the suffix is a factor of any pattern
The position determined by the last character of the text
with a transition in the automata
Factor Oracle of k strings
How can we build the Factor Oracle of
GTATGTA, GTAA, TAATA i GTGTA ?
T A
A
GG AT TT
T
A
G
A
1,4
32
A
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G T
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G AT
T
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
G AT T
T
A
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
GG AT T
T
A
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
GG AT T
T
A
G
T
Factor Oracle of k strings
Given the Factor Oracle of GTATGTA
GG AT TT
T
A
G
… we insert GTAA
A
1
Factor Oracle of k strings
…inserting GTAA
GG AT TT
T
A
G
A
1
A
2
Factor Oracle of k strings
Given the AFO of GTATGTA and GTAA
A
GG AT TT
T
A
G
A
… we insert TAATA
1
2
Factor Oracle of k strings
… inserting TAATA
T
A
GG AT TT
T
A
G
A
A1
2
A
3
Factor Oracle of k strings
Given the AFO of GTATGTA, GTAA and TAATA
T A
A
GG AT TT
T
A
G
A
1
32
A
…we insert GTGTA
Factor Oracle of k strings
…inserting GTGTA
T A
A
GG AT TT
T
A
G
A
1
32
A
Factor Oracle of k strings
This is the Automata Factor Oracle of
GTATGTA, GTAA, TAATA and GTGTA
T A
A
GG AT TT
T
A
G
A
1,4
32
A
SBOM algorithm
• Which is the next position of the window?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle (Inverse patterns of length lmin)
Check if the suffix is a factor of any pattern
The position determined by the last character of the text
with a transition in the automata
SBOM algorithm: example
… the we build the Automata Factor Oracle of
GTATG, GTAAT, TAATA and GTGTA
of length lmin=5
We search for the patterns
ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TT
TA
G A
T A
A1 4
2 3
text: ACATGCTAGCTATAATAATGT…
A
Multiple string matching
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMlmin
(5 strings)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(10 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 strings)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(100 strings)
Ad AC