68
Recuperació de la informació rn Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Net ible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot tic.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html rithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq ://www-igm.univ-mlv.fr/~lecroq/string/index

Recuperació de la informació

Embed Size (px)

DESCRIPTION

Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://static.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html - PowerPoint PPT Presentation

Citation preview

Page 1: Recuperació de la informació

Recuperació de la informació

•Modern Information Retrieval (1999)Ricardo-Baeza Yates and Berthier Ribeiro-Neto

•Flexible Pattern Matching in Strings (2002)Gonzalo Navarro and Mathieu Raffinot

http://static.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html

•Algorithms on strings (2001)M. Crochemore, C. Hancart and T. Lecroq

•http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Page 2: Recuperació de la informació

String Matching

String matching: definition of the problem (text,pattern) depends on what we have: text or patterns• Exact matching:

• Approximate matching:

• 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and ||

• The text ----> Data structure for the text (suffix tree, ...)

• The patterns ---> Data structures for the patterns

• Dynamic programming • Sequence alignment (pairwise and multiple)

• Extensions • Regular Expressions

• Probabilistic search:

• Sequence assembly: hash algorithm

Hidden Markov Models

Page 3: Recuperació de la informació

Extended string matching

• Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}.

• Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters.

• Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text.

• Wild cards: we find pattern as AT*TA where * means an arbitrary long string.

• Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..

Page 4: Recuperació de la informació

Classes of characters

There are characters in the text that represent sets of simbols

1. Classes of characters in the tetx.

There are characters in the pattern that represent sets of simbols

2. Classes of characters in the pattern.

There are classes of characters represented by onesymbol. For instace the IUPAC code for the

DNA alphabet is:R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T}

B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)

Page 5: Recuperació de la informació

Extended alphabets

First part

Classes in the text

Page 6: Recuperació de la informació

Classes in the text: Brute force algorithm

Text : over 2|∑|

Pattern over From left to right: prefix

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text :

The window is shifted only one cell

We need the operation: belongs to a set ?

?

Page 7: Recuperació de la informació

Classes in the text: Brute force algorithm

Every subset of is represented by a string of bits of length | |.

When || < computer word

For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=( , , , )

Page 8: Recuperació de la informació

Classes in the text: Brute force algorithm

Every subset of is represented by a string of bits of length | |.

When || < computer word

For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=( , , , )

Page 9: Recuperació de la informació

Classes in the text: Brute force algorithm

Every subset of is represented by a string of bits of length | |.

When || < computer word

For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)

Then the operation “A belongs to set X” is made with ...

Page 10: Recuperació de la informació

Classes in the text: Brute force algorithm

Every subset of is represented by a string of bits of length | |.

G T A R T R N A G G A ...A T G T A

A T G T A

When || < computer word

For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)

Then the operation “A belongs to set X” is made with I(A) and I(X) >0

I(A) & I(T)>0

Page 11: Recuperació de la informació

Classes in the text: Brute force algorithm

Every subset of is represented by a string of bits of length | |.

G T A R T R N A G G A ...A T G T A

A T G T A

A T G T A

When || < computer word

For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)

Then the operation “A belongs to set X” is made with I(A) and I(X) >0

I(A) & I(T)>0

I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0

Page 12: Recuperació de la informació

Classes in the text: Brute force algorithm

Every subset of is represented by a string of bits of length | |.

G T A R T R N A G G A ...A T G T A

A T G T A

A T G T A

When || < computer word

For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1)

Then the operation “A belongs to set X” is made with I(A) and I(X) >0

I(A) & I(T)>0

I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0

I(A) & I(N)>0 I(T) & I(R)>0

...Which is the cost?

Page 13: Recuperació de la informació

Classes in the text

Experimental efficiency (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 14: Recuperació de la informació

Classes in the text: Horspool algorithm

Text :

Pattern :Sufix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text : a

Shift until the next ocurrence of “a” (or “t”,”r”,…) in the pattern:

aa a

a a a

We need a shift table with the extended alphabet.

Page 15: Recuperació de la informació

Classes in the text :Horspool example

Given the pattern ATGTA

• The shift table is:

A 4C 5G 2T 1R ?…N ?

Page 16: Recuperació de la informació

Classes in the text :Horspool example

Given the pattern ATGTA

• The shift table is:

A 4C 5G 2T 1R 2…N ?

Page 17: Recuperació de la informació

Classes in the text :Horspool example

Given the pattern ATGTA

• The shift table is:

A 4C 5G 2T 1R 2…N 1

text : G T A R T R N A A G G A …A T G T A

A T G T A

A T G T A

Page 18: Recuperació de la informació

Classes in the text :Horspool example

Given the pattern ATGTA

• The shift table is:

A 4C 5G 2T 1R 2…N 1

text : G T A R T R N A A G G A ...A T G T A

A T G T A

A T G T A A T G T A

Page 19: Recuperació de la informació

Classes in the text

Experimental efficiency (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 20: Recuperació de la informació

Text :

Pattern :

Search for suffixes of T that are factors of the pattern

Classes in the text: BNDM algorithm

• Which is the next position of the window ?

• How the comparison is made?

…that is denoted as

D2 = 1 0 0 0 1 0 0

Depends on the value of the leftmost bit of D

Once the next character x is read D3 = D2<<1 & B(x)

B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0)

D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )

x

Page 21: Recuperació de la informació

Classes in the text : BNDM example

Given the pattern ATGTA

• The masks of bits are

B(A) = ( 1 0 0 0 1 ) B(R)=( )B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=( )B(T) = ( 0 1 0 1 0 )

Page 22: Recuperació de la informació

Classes in the text : BNDM example

Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=( )B(T) = ( 0 1 0 1 0 )

• The masks of bits are

Page 23: Recuperació de la informació

Classes in the text : BNDM example

Given the pattern ATGTA

• text : G T A R T R N A G G A C G ...A T G T A

A T G T A

B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1)B(T) = ( 0 1 0 1 0 )

D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 )D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )

• The masks of bits are

Page 24: Recuperació de la informació

Classes in the text : BNDM example

Given the pattern ATGTA

• text : G T A R T R N A G G A C G ...A T G T A

A T G T A

B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1)B(T) = ( 0 1 0 1 0 )

D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 )

D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 )

D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )

D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0)

• The masks of bits are

Page 25: Recuperació de la informació

Classes in the text : BNDM example

Given the pattern ATGTA

• text : G T A R T R N A G G A C G ...A T G T A

A T G T A

A T G T A

B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1)B(C) = ( 0 0 0 0 0 ) …B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1)B(T) = ( 0 1 0 1 0 )

D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 )

D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 )

D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )

D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0)

• The masks of bits are

Page 26: Recuperació de la informació

Classes in the text

Experimental efficiency (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 27: Recuperació de la informació

BOM algorithm (Backward Oracle Matching)

• Which is the next position of the window?

• How the comparison is made?

Text :

Pattern : Automata: Factor Oracle

Check if the suffix is a factor

The position determined by the last character of the text with a transition in the automata

Page 28: Recuperació de la informació

Classes in the text: BOM example

The we build the AFO of the inverse pattern of ATGTATG

… and we try to find… : G T A R T R N A A T G…A T G T A T G

GG AT T ATTA

G

It’s not possible any improvement!

Page 29: Recuperació de la informació

Multiple string matching

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMlmin

(5 strings)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 strings)

Ad AC

Page 30: Recuperació de la informació

Classes in the text: Set Horspool algorithm

Text :

Patterns:

By suffixes

• Which is the next position of the window?

• How the comparison is made?

Trie of all inverse patterns

?

Page 31: Recuperació de la informació

Set Horspool algorithm

Search for ATGTATG,TATG,ATAAT,ATGTG

4. Find the patterns

T A

G

GA

TTT

T

G

A

A

AA T

1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA

2. Determine lmin=4A 1C 4 (lmin)G 2T 1

3. Determine the shift table

Page 32: Recuperació de la informació

Classes in the text: Set Horspool

Search for the patterns ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

text: ARTGNCTATGTGACA…

It’s not possible any improvement!

Page 33: Recuperació de la informació

Multiple string matching

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMlmin

(5 strings)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 strings)

Ad AC

Page 34: Recuperació de la informació

Classes in the text: SBOM algorithm

• Which is the next position of the window?

• How the comparison is made?

Text :

Pattern : Automata: Factor Oracle (Inverse patterns of length lmin)

Check if the suffix is a factor of any pattern

The position determined by the last character of the text with a transition in the automata

Page 35: Recuperació de la informació

Classes in the text: SBOM example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

text: ACATN C TAGC TA TA ATAATGTATG

A

It’s not possible any improvement!

Page 36: Recuperació de la informació

Extended alphabets

Classes in the:text pattern

Horspool ✓BNDM ✓BOM ✗

Set-Horspool ✗SBOM ✗

Page 37: Recuperació de la informació

Extended search

Second part

Classes in the pattern

Page 38: Recuperació de la informació

Classes in the pattern: Brute force algorithm

Text : over From left to right: prefix

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text :

The window is shifted only one cell

We need the operation: belongs to a set ?

?

Pattern : over 2|∑|

Page 39: Recuperació de la informació

Classes in the pattern: Brute force algorithm

Every subset is represented by a string of bits of length | |.

G T A C T A G A G G A C G T A T G T A C T G ...A T N T R

A T N T R

When || < computer word

For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=(1,0,1,0,),..., I(N)=(1,1,1,1)

Then the operation “A belongs to set X” is made with I(A) and I(X) >0

I(T) and I(R) >0

I(A) and I(R) >0 I(T) and I(T) >0 I(C) and I(N) >0 I(A) and I(T) >0 …

Page 40: Recuperació de la informació

Classes in the text

Experimental efficiency (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 41: Recuperació de la informació

Classes in the pattern: Horspool algorithm

Text :

Pattern :Sufix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text : a

Shift until the next ocurrence of “a” in the pattern:

aa a

a a a

We need a preprocessing phase to construct the shift table.

Page 42: Recuperació de la informació

Classes in the pattern: Horspool example

Given the pattern ATNTR

• The shift table is:

A C G T

Page 43: Recuperació de la informació

Classes in the pattern: Horspool example

Given the pattern ATNTR

• The shift table is:

A 2C G T

Page 44: Recuperació de la informació

Classes in the pattern: Horspool example

Given the pattern ATNTR

• The shift table is:

A 2C 2G T

Page 45: Recuperació de la informació

Classes in the pattern: Horspool example

Given the pattern ATNTR

• The shift table is:

A 2C 2G 2T

Page 46: Recuperació de la informació

Classes in the pattern: Horspool example

Given the pattern ATNTR

• The shift table is:

A 2C 2G 2T 1

text : G T A C T A G A T A T G A G ...A T N T R

A T N T R

A T N T R

A T N T R A T N T R

Page 47: Recuperació de la informació

Classes in the pattern: Horspool example

Given the pattern ATNTR

• The shift table is:

A 2C 2G 2T 1

text : G T A C T A G A T A T G A G ...A T N T R

A T N T R

A T N T R

A T N T R A T N T R

A T G T A

Shorter shifts!

Page 48: Recuperació de la informació

Classes in the text

Experimental efficiency (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 49: Recuperació de la informació

Text :

Pattern :

Search for suffixes of T that are factors of the pattern

Classes in the text: BNDM algorithm

• Which is the next position of the window ?

• How the comparison is made?

…that is denoted as

D2 = 1 0 0 0 1 0 0

Depends on the value of the leftmost bit of D

Once the next character x is read D3 = D2<<1 & B(x)

B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0)

D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )

x

Page 50: Recuperació de la informació

Classes in the pattern : BNDM example

Given the pattern ATNTR

• The masks of bits of symbols are

B(A) = ( )B(C) = ( )B(G) = ( )B(T) = ( )

Page 51: Recuperació de la informació

Classes in the pattern : BNDM example

Given the pattern ATNTR

• The masks of bits of symbols are

B(A) = ( 1 0 1 0 1 )B(C) = ( )B(G) = ( )B(T) = ( )

Page 52: Recuperació de la informació

Classes in the pattern : BNDM example

Given the pattern ATNTR

• The masks of bits of symbols are

B(A) = ( 1 0 1 0 1 )B(C) = ( 0 0 1 0 0 )B(G) = ( )B(T) = ( )

Page 53: Recuperació de la informació

Classes in the pattern : BNDM example

Given the pattern ATNTR

• The masks of bits of symbols are

B(A) = ( 1 0 1 0 1 )B(C) = ( 0 0 1 0 0 )B(G) = ( 0 0 1 0 1 )B(T) = ( )

Page 54: Recuperació de la informació

Classes in the pattern : BNDM example

Given the pattern ATNTR

• text : G T A C T A G A G G A C G T A T G T A C T G ...A T N T R

A T N T R

A T N T R

• The masks of bits of symbols are

B(A) = ( 1 0 1 0 1 )B(C) = ( 0 0 1 0 0 )B(G) = ( 0 0 1 0 1 )B(T) = ( 0 1 1 1 0 )

D1 = ( 0 1 1 1 0 )D2 = ( 1 1 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )

D1 = ( 0 0 1 0 1 )D2 = ( 0 1 0 1 0 ) & ( 0 0 1 0 1 ) = ( 0 0 0 0 0 )

D3 = ( 0 1 0 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 0 0 0 )

D1 = ( 1 0 1 0 1 )D2 = ( 0 1 0 1 0 ) & ( 0 1 1 1 0 ) = ( 0 1 0 1 0 )

D3 = ( 1 0 1 0 0 ) & ( 0 0 1 0 1 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) A T N T R

Page 55: Recuperació de la informació

Classes in the text

Experimental efficiency (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 56: Recuperació de la informació

BOM algorithm (Backward Oracle Matching)

• Which is the next position of the window?

• How the comparison is made?

Text :

Pattern : Automata: Factor Oracle

Check if the suffix is a factor

The position determined by the last character of the text with a transition in the automata

Page 57: Recuperació de la informació

Classes in the pattern: BOM example

• Given the pattern ATGTATG, the AFO is

GG AT T ATTA

G

We should apply the SBOM algorithm!

but for the patter ATNTRTG?

Page 58: Recuperació de la informació

Multiple string matching

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMlmin

(5 strings)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 strings)

Ad AC

Page 59: Recuperació de la informació

Set Horspool algorithm

Text :

Patterns:

By suffixes

• Which is the next position of the window?

• How the comparison is made?

a

Trie of all inverse patterns

We shift until a is aligned with the first a in the trie not longer than lmin, or lmin

Page 60: Recuperació de la informació

Set Horspool algorithm

Search for ATNTARG,RTGR,NTTNAR,ATRTG

4. Find the patterns

1. Construct the trie of the 46 possible inverse patterns

2. Determine lmin=4

A 1C 2G 1T 1

3. Determine the shift table

Page 61: Recuperació de la informació

Multiple string matching

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMlmin

(5 strings)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 strings)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 strings)

Ad AC

Page 62: Recuperació de la informació

SBOM algorithm

• Which is the next position of the window?

• How the comparison is made?

Text :

Pattern : Automata: Factor Oracle (Inverse patterns of length lmin)

Check if the suffix is a factor of any pattern

The position determined by the last character of the text with a transition in the automata

Page 63: Recuperació de la informació

Classes in the patterns: SBOM example

the Automata Factor Oracle of all 21 possible patterns is built …

Given the patternsATGNARG, TRATR,TAATAAT i ANTNTGR

Page 64: Recuperació de la informació

Extended alphabets

Classes in the:text pattern

Horspool ✓ ✓BNDM ✓ ✓BOM ✗ ≈

Set-Horspool ✗ ≈ SBOM ✗ ≈

Page 65: Recuperació de la informació

Extended string matching

• Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}.

• Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters.

• Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text.

• Wild cards: we find pattern as AT*TA where * means an arbitrary long string.

• Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..

Page 66: Recuperació de la informació

Bounded length gaps : BNDM example

Given the pattern ATx(2,3)TA

• The masks of bits are

B(A) = ( 1 0 1 1 1 0 1 ) B(C) = ( 0 0 1 1 1 0 0 )B(G) = ( 0 0 1 1 1 0 0 )B(T) = ( 0 1 1 1 1 1 0 )

Page 67: Recuperació de la informació

Bounded length gaps : BNDM example

Given the pattern ATx(2,3)TA

• text : A T A G T A G A G T ...

D2 = ( 0 1 1 1 0 1 0 ) & ( 0 1 1 1 1 1 0 ) = ( 0 1 1 1 0 1 0 )D3 = ( 1 1 1 0 1 0 0 ) & ( 0 0 1 1 1 0 0 ) = ( 0 0 1 0 1 0 0 )

• The masks of bits are• The masks of bits are

B(A) = ( 1 0 1 1 1 0 1 ) B(C) = ( 0 0 1 1 1 0 0 )B(G) = ( 0 0 1 1 1 0 0 )B(T) = ( 0 1 1 1 1 1 0 )

D1 = ( 1 0 1 1 1 0 1 )

D4 = ( 0 1 0 1 0 0 0 ) & ( 1 0 1 1 1 0 1 ) = ( 0 0 0 1 0 0 0 )

D5 = ( 0 0 1 0 0 0 0 ) & ( 0 1 1 1 1 1 0) = ( 0 0 1 0 0 0 0 )

D6 = ( 0 1 0 0 0 0 0 ) & ( 1 0 1 1 1 0 1) = ( 0 0 0 0 0 0 0 ) ?

Page 68: Recuperació de la informació

Bounded length gaps : BNDM example

Given the pattern ATx(2,3)TA • text : A T A G T A G A G T ...

D2 = ( 0 1 1 1 0 1 0 ) & ( 0 1 1 1 1 1 0 ) = ( 0 1 1 1 0 1 0 )D3 = ( 1 1 1 0 1 0 0 ) & ( 0 0 1 1 1 0 0 ) = ( 0 0 1 0 1 0 0 )

D1 = ( 1 0 1 1 1 0 1 )

D4 = ( 0 1 0 1 0 0 0 ) & ( 1 0 1 1 1 0 1 ) = ( 0 0 0 1 0 0 0 )D5 = ( 0 0 1 0 0 0 0 ) & ( 0 1 1 1 1 1 0) = ( 0 0 1 0 0 0 0 )

D6 = ( 0 1 0 0 0 0 0 ) & ( 1 0 1 1 1 0 1) = ( 0 0 0 0 0 0 0 ) ?AT***TA

Let’s see the automaton:

ε0 1 0 0 0 0 0 -0 0 0 1 0 0 0

0 0 1 1 0 0 0

= F= I

D [ F - (I & D) ] & ¬ F

( 0 0 1 1 0 0 0 ) 1 1 1 1