Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bioinformatics Algorithms and Data Structures

Chapter 2: KMP Algorithm

Lecturer: Dr. RoseSlides by: Dr. Rose

January 30, 2007


Technology


KMP Algorithm

• Preliminaries:– KMP can be easily explained in terms of finite

state machines.– KMP has a easily proved linear bound– KMP is usually not the method of choice


Technology


KMP Algorithm

• Recall that the naïve approach to string matching is (mn).

• How can we reduce this complexity?– Avoid redundant comparisons– Use larger shifts

• Boyer-Moore good suffix rule

• Boyer-Moore extended bad character rule


Technology


KMP Algorithm

• KMP finds larger shifts by recognizing patterns in P.– Let spi(P) denote the length of the longest

proper suffix of P[1..i] that matches a prefix of P.

– By definition sp1 = 0 for any string.

– Q: Why does this make sense?– A: The proper suffix must be the empty string


Technology


KMP Algorithm

• Example: P = abcaeabcabd– P[1..2] = ab hence sp2 = ?

– sp2 = 0

– P[1..3] = abc hence sp3 = ?

– sp3 = 0

– P[1..4] = abca hence sp4 = ?

– sp4 = 1

– P[1..5] = abcae hence sp5 = ?

– sp5 = 0

– P[1..6] = abcaea hence sp6 = ?

– sp6 = 1


Technology


KMP Algorithm

• Example Continued– P[1..7] = abcaeab hence sp7 = ?

– sp7 = 2

– P[1..8] = abcaeabc hence sp8 = ?

– sp8 = 3

– P[1..9] = abcaeabca hence sp9 = ?

– sp9 = 4

– P[1..10] = abcaeabcab hence sp10 = ?

– sp10 = 2

– P[1..11] = abcaeabcabd hence sp11 = ?

– sp11 = 0


Technology


KMP Algorithm

• Like the / concept for Boyer-Moore, there is an analogous spi/spí concept.

• Let spí(P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(spí + 1) are unequal.

• Example: P = abcdabce sp´7 = 3

Obviously spí(P) <= spi(P), since the later is lessrestrictive.


Technology


KMP Algorithm

• KMP Shift Rule:1. Mismatch case:

• Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan.

• Shift P to the right, aligning P[1..spí] with T[k- spí..k-1]

2. Match case:• If no mismatch is found, an occurrence of P has been found.

• Shift P by n – spń spaces to continue searching for other occurrences.


Technology


KMP Algorithm

• Observations:– The prefix P[1..spí] of the shifted P is shifted to match

the corresponding substring in T.– Subsequent character matching proceeds from position

spí + 1

– Unlike Boyer-Moore, the matched substring is not compared again.

– The shift rule based on spí guarantees that the exact same mismatch won’t occur at spí + 1 but doesn’t guarantee that P(spí+1) = T(k)


Technology


KMP Algorithm

• Example: P = abcxabcde– If a mismatch occurs at position 8, P will be shifted 4

positions to the right.

– Q: Where did the 4 position shift come from?

– A: The number of position is given by i - spí , in this example i = 7, sp´7 = 3, 7 – 3 = 4

– Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..


Technology


KMP Algorithm

• Example Continued: P = abcxabcde– After the shift, P[1..3] lines up with T[k-4..k-1]

– Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed.

– The scan continues from P(4) & T(k)

• Advantages of KMP Shift Rule1. P is often shifted by more than 1 character, (i - spí )

2. The left-most spí characters in the shifted P are known to match the corresponding characters in T.


Technology


KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde

Assume that we have already shifted past the first two positions in T.

xyabcxabcxadcdqfegabcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x, shift 4 places

abcxabcde

^ 1 start again from position 4


Technology


Preprocessing for KMP

Approach: show how to derive sp´ values from Z values.

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

– Recall that Zj(P) denotes the length of the Z-box starting at position j.

– This says that j maps to i if i is the right end of a Z-box starting at j.


Technology



Theorem. For any i > 1, spí(P) = Zj = i – j + 1

Where j > 1 is the smallest position that maps to i.

If j then spí(P) = 0

Similarly for sp:

For any i > 1, spi(P) = i – j + 1

Where j, i j > 1, is the smallest position that maps to i or beyond.

If j then spi(P) = 0


Technology



Given the theorem from the preceding slide, the spí and spi values can be computed in linear time using Zi values:

For i = 1 to n { spí = 0;}For j = n downto 2 {

i = j + Zi(P) – 1; spí = Zi;

}spn(P) = spń(P); For i = n - 1 downto 2 {

spi (P) = max[spi+1 (P) - 1, spí(P)];}


Technology



Defn. Failure function F´(i) = spí-1 + 1 , 1 i n + 1, sp´0 = 0

(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)• Idea:

– We maintain a pointer i in P and c in T.– After a mismatch at P(i+1) with T(c), shift P to align

P(spí + 1) with T(c), i.e., i = spí + 1.– Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1– Special case 2: we find P in T, shift n - spń spaces,

i.e., i = F´(n + 1) = spń + 1.


Technology


Full KMP Algorithm

Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;}


Technology


Full KMP Algorithm

xyabcxabcxabcdefegabcxabcde

^ 1 a!=x

p != n+1

p = 1! c = 2 p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}


Technology


Full KMP Algorithm

xyabcxabcxabcdefegabcxabcde

^ 1 a!=y

p != n+1

p = 1! c = 3 p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}

abcxabcde


Technology


Full KMP Algorithm

xyabcxabcxabcdefeg

p != n+1

p = 8! don’t change c p = F´(8) = 4

abcxabcde abcxabcde^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}


Technology


p = 4, c = 10

^ 4

Full KMP Algorithm

xyabcxabcxabcdefeg

p = n+1 !

abcxabcde ^ 5 ^ 6 ^ 7 ^ 8

abcxabcde abcxabcde abcxabcde

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}

^ 9


Technology


Real-Time KMP

• Q: What is meant by real-time algorithms?• A: Typically these are algorithms that are meant

to interact synchronously in the real world.– This implies a known fixed turn-around time for

processing a task

– Many embedded scheduling systems are examples involving real-time algorithms.

– For KMP this means that we require a constant time for processing all strings of length n.


Technology


Real-Time KMP

• Q: Why is KMP not real-time?• A: For any mismatched character in T, we may

try matching it several times.– Recall that spí only guarantees that P(i + 1) and P(spí + 1) differ– There is NO guarantee that P(i + 1) and T(k) match

• We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k).

• This means that we have to compute spí values with respect to all characters in since any could appear in T.


Technology


Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(spí + 1) is x.

• This is will tell us exactly what shift to use for each possible mismatch.

• A mismatched character T(k) will never be involved in subsequent comparisons.


Technology


Real-Time KMP

• Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons?

• A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k).

• This results in a real-time version of KMP.• Let’s consider how we can find the sp´(i,x)(P)

values in linear time.


Technology


Real-Time KMP

Thm. For P[i + 1] x, sp´(i,x)(P) = i - j + 1– Here j is the smallest position such that j maps to i and

P(Zj + 1) = x.– If there is no such j then where sp´(i,x)(P) = 0

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;

}


Technology


Real-Time KMP

• Notice how this works:– Starting from the right

• Find i the right end of the Z box associated with j• Find x the character immediately following the prefix

corresponding to this Z box.• Set sp´(i,x) = Zi, the length of this Z box.

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;}

Documents

Bioinformatics Algorithms and Data Structures