27
String Matching String matching: definition of the problem (text,pattern)depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and || k patterns ---> The algorithm depends on k, |p| and || The text ----> Data structure for the text (suffix tree, ...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models

String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

  • View
    252

  • Download
    2

Embed Size (px)

Citation preview

Page 1: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

String Matching

String matching: definition of the problem (text,pattern) depends on what we have: text or patterns• Exact matching:

• Approximate matching:

• 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and ||

• The text ----> Data structure for the text (suffix tree, ...)

• The patterns ---> Data structures for the patterns

• Dynamic programming • Sequence alignment (pairwise and multiple)

• Extensions • Regular Expressions

• Probabilistic search:

• Sequence assembly: hash algorithm

Hidden Markov Models

Page 2: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Approximate string matching

For instance, given the sequence

CTACTACTTACGTGACTAATACTGATCGTAGCTAC…

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?

Page 3: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

Indel

Page 4: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

Indel

0 1 23 1 2

Page 5: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

• ACT and ACT : ACT ACT

• ACTTG and ATCTG:

• ACT and AT: ACT A -T

ACTTG ATCTG

ACT - TGA - TCTG

Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2which is the best alignment in every case?

The Edit distance is related with the best alignment of strings

Then, the alignment suggest the substitutions, insertions and deletions to transform one string into the other

Page 6: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”

Page 7: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T

ACTGA

Page 8: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T

ACTGA

Page 9: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T ACTGA

The cell contains the distance between AC and CTACT.

Page 10: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T A C T GA

?

Page 11: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 A C T GA

?

Page 12: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 A C T GA

-C

?

Page 13: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 A C T GA

- -CT

?

Page 14: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A C T GA

- - - - - -CTACTA

Page 15: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A ?C ?T ?GA

Page 16: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3G…A

ACT - - -

Page 17: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3GA

C T A C T A C T A C G T A C TGA

Edit distance and alignment of strings

BA(AC,CTA) -C

BA(A,CTA)CC

BA(A,CTAC)C -

BA(AC,CTAC)= best

d(AC,CTAC)=min

d(AC,CTA)+1

d(A,CTA)

d(A,CTAC)+1

Page 18: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

Connect to

http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm

and use the global method.

Page 19: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

Edit distance and alignment of strings

How this algorithm can be applied

to the approximate search?

to the K-approximate string searching?

Page 20: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

ACTGA

This cell …

Page 21: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters

Page 22: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters

Page 23: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A …

ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

Page 24: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A … 0ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

Page 25: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A … 0ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

Page 26: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0ACTGA

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then

Page 27: String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:

K-approximate string searching

Connect to

http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm

and use the semi-global method.