27
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 30, 2007

Bioinformatics Algorithms and Data Structures

  • Upload
    louvain

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics Algorithms and Data Structures. Chapter 2: KMP Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 30, 2007. KMP Algorithm. Preliminaries: KMP can be easily explained in terms of finite state machines. KMP has a easily proved linear bound - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bioinformatics Algorithms and Data Structures

Chapter 2: KMP Algorithm

Lecturer: Dr. RoseSlides by: Dr. Rose

January 30, 2007

Page 2: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Preliminaries:– KMP can be easily explained in terms of finite

state machines.– KMP has a easily proved linear bound– KMP is usually not the method of choice

Page 3: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Recall that the naïve approach to string matching is (mn).

• How can we reduce this complexity?– Avoid redundant comparisons– Use larger shifts

• Boyer-Moore good suffix rule

• Boyer-Moore extended bad character rule

Page 4: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• KMP finds larger shifts by recognizing patterns in P.– Let spi(P) denote the length of the longest

proper suffix of P[1..i] that matches a prefix of P.

– By definition sp1 = 0 for any string.

– Q: Why does this make sense?– A: The proper suffix must be the empty string

Page 5: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Example: P = abcaeabcabd– P[1..2] = ab hence sp2 = ?

– sp2 = 0

– P[1..3] = abc hence sp3 = ?

– sp3 = 0

– P[1..4] = abca hence sp4 = ?

– sp4 = 1

– P[1..5] = abcae hence sp5 = ?

– sp5 = 0

– P[1..6] = abcaea hence sp6 = ?

– sp6 = 1

Page 6: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Example Continued– P[1..7] = abcaeab hence sp7 = ?

– sp7 = 2

– P[1..8] = abcaeabc hence sp8 = ?

– sp8 = 3

– P[1..9] = abcaeabca hence sp9 = ?

– sp9 = 4

– P[1..10] = abcaeabcab hence sp10 = ?

– sp10 = 2

– P[1..11] = abcaeabcabd hence sp11 = ?

– sp11 = 0

Page 7: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Like the / concept for Boyer-Moore, there is an analogous spi/sp´i concept.

• Let sp´i(P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´i + 1) are unequal.

• Example: P = abcdabce sp´7 = 3

Obviously sp´i(P) <= spi(P), since the later is lessrestrictive.

Page 8: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• KMP Shift Rule:1. Mismatch case:

• Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan.

• Shift P to the right, aligning P[1..sp´i] with T[k- sp´i..k-1]

2. Match case:• If no mismatch is found, an occurrence of P has been found.

• Shift P by n – sp´n spaces to continue searching for other occurrences.

Page 9: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Observations:– The prefix P[1..sp´i] of the shifted P is shifted to match

the corresponding substring in T.– Subsequent character matching proceeds from position

sp´i + 1

– Unlike Boyer-Moore, the matched substring is not compared again.

– The shift rule based on sp´i guarantees that the exact same mismatch won’t occur at sp´i + 1 but doesn’t guarantee that P(sp´i+1) = T(k)

Page 10: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Example: P = abcxabcde– If a mismatch occurs at position 8, P will be shifted 4

positions to the right.

– Q: Where did the 4 position shift come from?

– A: The number of position is given by i - sp´i , in this example i = 7, sp´7 = 3, 7 – 3 = 4

– Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..

Page 11: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

• Example Continued: P = abcxabcde– After the shift, P[1..3] lines up with T[k-4..k-1]

– Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed.

– The scan continues from P(4) & T(k)

• Advantages of KMP Shift Rule1. P is often shifted by more than 1 character, (i - sp´i )

2. The left-most sp´i characters in the shifted P are known to match the corresponding characters in T.

Page 12: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde

Assume that we have already shifted past the first two positions in T.

xyabcxabcxadcdqfegabcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x, shift 4 places

abcxabcde

^ 1 start again from position 4

Page 13: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Preprocessing for KMP

Approach: show how to derive sp´ values from Z values.

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

– Recall that Zj(P) denotes the length of the Z-box starting at position j.

– This says that j maps to i if i is the right end of a Z-box starting at j.

Page 14: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Preprocessing for KMP

Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1

Where j > 1 is the smallest position that maps to i.

If j then sp´i(P) = 0

Similarly for sp:

For any i > 1, spi(P) = i – j + 1

Where j, i j > 1, is the smallest position that maps to i or beyond.

If j then spi(P) = 0

Page 15: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Preprocessing for KMP

Given the theorem from the preceding slide, the sp´i and spi values can be computed in linear time using Zi values:

For i = 1 to n { sp´i = 0;}For j = n downto 2 {

i = j + Zi(P) – 1; sp´i = Zi;

}spn(P) = sp´n(P); For i = n - 1 downto 2 {

spi (P) = max[spi+1 (P) - 1, sp´i(P)];}

Page 16: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Preprocessing for KMP

Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0

(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)• Idea:

– We maintain a pointer i in P and c in T.– After a mismatch at P(i+1) with T(c), shift P to align

P(sp´i + 1) with T(c), i.e., i = sp´i + 1.– Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1– Special case 2: we find P in T, shift n - sp´n spaces,

i.e., i = F´(n + 1) = sp´n + 1.

Page 17: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Full KMP Algorithm

Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;}

Page 18: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Full KMP Algorithm

xyabcxabcxabcdefegabcxabcde

^ 1 a!=x

p != n+1

p = 1! c = 2 p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

Page 19: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Full KMP Algorithm

xyabcxabcxabcdefegabcxabcde

^ 1 a!=y

p != n+1

p = 1! c = 3 p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

abcxabcde

Page 20: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Full KMP Algorithm

xyabcxabcxabcdefeg

p != n+1

p = 8! don’t change c p = F´(8) = 4

abcxabcde abcxabcde^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

Page 21: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

p = 4, c = 10

^ 4

Full KMP Algorithm

xyabcxabcxabcdefeg

p = n+1 !

abcxabcde ^ 5 ^ 6 ^ 7 ^ 8

abcxabcde abcxabcde abcxabcde

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

^ 9

Page 22: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Real-Time KMP

• Q: What is meant by real-time algorithms?• A: Typically these are algorithms that are meant

to interact synchronously in the real world.– This implies a known fixed turn-around time for

processing a task

– Many embedded scheduling systems are examples involving real-time algorithms.

– For KMP this means that we require a constant time for processing all strings of length n.

Page 23: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Real-Time KMP

• Q: Why is KMP not real-time?• A: For any mismatched character in T, we may

try matching it several times.– Recall that sp´i only guarantees that P(i + 1) and P(sp´i + 1) differ– There is NO guarantee that P(i + 1) and T(k) match

• We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k).

• This means that we have to compute sp´i values with respect to all characters in since any could appear in T.

Page 24: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x.

• This is will tell us exactly what shift to use for each possible mismatch.

• A mismatched character T(k) will never be involved in subsequent comparisons.

Page 25: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Real-Time KMP

• Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons?

• A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k).

• This results in a real-time version of KMP.• Let’s consider how we can find the sp´(i,x)(P)

values in linear time.

Page 26: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Real-Time KMP

Thm. For P[i + 1] x, sp´(i,x)(P) = i - j + 1– Here j is the smallest position such that j maps to i and

P(Zj + 1) = x.– If there is no such j then where sp´(i,x)(P) = 0

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;

}

Page 27: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Real-Time KMP

• Notice how this works:– Starting from the right

• Find i the right end of the Z box associated with j• Find x the character immediately following the prefix

corresponding to this Z box.• Set sp´(i,x) = Zi, the length of this Z box.

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;}