22
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and || k patterns ---> The algorithm depends on k, |p| and || The text ----> Data structure for the text (suffix tree, ...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models

String Matching

  • Upload
    elsa

  • View
    38

  • Download
    1

Embed Size (px)

DESCRIPTION

String Matching. String matching: definition of the problem (text,pattern). depends on what we have: text or patterns. Exact matching:. The patterns ---> Data structures for the patterns. 1 pattern ---> The algorithm depends on |p| and | |. - PowerPoint PPT Presentation

Citation preview

Page 1: String Matching

String MatchingString matching: definition of the problem (text,pattern)

depends on what we have: text or patterns• Exact matching:

• Approximate matching:

• 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and ||

• The text ----> Data structure for the text (suffix tree, ...)

• The patterns ---> Data structures for the patterns

• Dynamic programming • Sequence alignment (pairwise and multiple)

• Extensions • Regular Expressions

• Probabilistic search:

• Sequence assembly: hash algorithm

Hidden Markov Models

Page 2: String Matching

2.2 Pairwise alignment

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) from the alphabet {a,c,t,g}

we say that A* and B* from {a,c,t,g,-} are aligned iff

i) A* and B* become A and B if gaps ( – ) are removed.ii) |A*|=|B*|iii) For all i, it is not possible that ai = bi = -

Which is the best alignment?

How many alignments of two sequences exist?

MALIG (an example)

Page 3: String Matching

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3#(a1,b1)

Page 4: String Matching

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

Page 5: String Matching

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

3 ? ?

Page 6: String Matching

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) there are:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with (an,-)+ #(a1a2...an ,b1b2...bm-1) those that end with (-,bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with (an,bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

3 5 757 ?

Page 7: String Matching

2.2 Number of alignments

Given two DNA sequences A (a1a2...an) and B (b1b2...bm) then:

#(a1a2...an ,b1b2...bm) = #(a1a2...an-1 ,b1b2...bm) those that end with ( an , -)+ #(a1a2...an ,b1b2...bm-1) those that end with ( - , bm) + #(a1a2...an-1 ,b1b2...bm-1) those that end with ( an , bm)

a1

a2

a3

b1 b2 b3

1 1 1 1111

3 5 75713 2525 63

But, what is the assymptotic value?

Page 8: String Matching

2.2 Assymptotic value

> Σ ( ) ( )k=0

K=n

kn

kn

As

= ( )n2n#(a1a2...an ,b1b2...bn)

and

n! ~ nn e-n (Stirling approximation)

then

#(a1a2...an ,b1b2...bn) > 22n

Page 9: String Matching

2.2 Best alignment

How can an alignment be scored? catcactactgacgactatcgtagcgcggctatacatctacgccaa- ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg---- * * *** * * ** * ******* * * **** **** ******* * **** ** * ***

How can the best alignment be found?

• Gap: worst case

• Mismatch: unfavorable

• Match: favorable

Then we assign a score for each case,for example 1,-1,-2.

Page 10: String Matching

2.2 Best alignment

C T A C T A C T A C G T ACTGA

The cell contains the score of the best alignment of AC and CTACT.

Page 11: String Matching

Best alignment

accaccacaccacaacgagcata … acctgagcgatat

acc..t

Given the maximum score,how can the best alignment be found?

• Quadratic cost in space and time

• Up to 10,000 bps sequences in lengthDownload alggen tool

Page 12: String Matching

2.2 Some slides revisited

We have developed the theory according to the following principles:1) Both sequences have a similar length (global).2) The model of gaps is linear

If there are k consecutive gaps the penalty scores k(-2).

Page 13: String Matching

Assume that we have sequences with different lengthS1

S2

2.2 Semiglobal pairwise alignment

It is meaningless to introduce gaps until both sequences have similar length ….

The most probable alignment should be

How can these alignments be found? Final gaps Initial gaps

Page 14: String Matching

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T

ACT

Initial gaps

Note that

Final gaps

Page 15: String Matching

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T

ACT

The cell contains the score of the best alignment of CTA with the empty sequence.

Given a cell

0 0 0 0 0 0 0 0 0 0 0 00

Page 16: String Matching

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T 0 0 0 0 0 0 0…ACT

The contribution of the initial gaps is disregarded, then C T A C T A C T A C G T 0 0 0 0 0 0 0…A 1C 2T 3

but, what happens with the final gaps?

Page 17: String Matching

2.2 Semiglobal pairwise alignment

C T A C T A C T A C G T 0 0 0 0 0 0 0…A 1C 2T 3

Practice with the alggen tool.

… by checking the last row for the best score.

How does the algorithm search for the best alignment?

Page 18: String Matching

2.2 Affine-gap model score

Given the following alignments that have the same score …a g t a c c c c g t a ga g t - c c - - g t a -

a g t a c c c c g t a ga g t - c - c - g t a -

a g t a c c c c g t a ga g t - c - - c g t a -

a g t a c c c c g t a ga g t - - c c - g t a -

a g t a c c c c g t a ga g t - - c - c g t a -

a g t a c c c c g t a ga g t - - - c c g t a -

Which is the most reliable case from a biological point of view?

Page 19: String Matching

2.2 Affine-gap model score

Then, how can we distinguish betweenconsecutive gaps and separated gaps?

a g t a c c c c g t a ga g t - - c - c g t a -

a g t a c c c c g t a ga g t - - - c c g t a -

By scoring the opening gaps greater than the extension gaps,

for instance, -10 and -0.5.

Then, the penalty of k consecutive gaps becomes OG + (k-1) EG

which is an affine-gap function.

How is the best alignment found?.

Page 20: String Matching

C T A C T A C T A C G T ACTGA

2.2 Affine-gap model score

Smallest arrows: refer to the introduction of an opening gap.Largest arrows: refer to the introduction of an extension gap.

But from which cell do the largest arrows originate?

Page 21: String Matching

2.2 Local alignment

Given two sequences, we can consider the alignments of all their substrings…

…how can the best of them be found?

Two questions arise:

- how can the alignments be compared?

- how can the best one be selected?

Page 22: String Matching

2.2 Local alignment

Given a path

Imagine the graph of the scores:can the best subalignments be detected?

accaccacaccacaacgagcata … acctgagcgatat

acc..t

…It suffices to compare the value of each cell with zero!