51
Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison Wesley, 1990.

Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

  • View
    220

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

Princeton University • COS 423 • Theory of Algorithms • Spring 2002 • Kevin Wayne

String Searching

Reference: Chapter 19, Algorithms in C by R. Sedgewick.Addison Wesley, 1990.

Page 2: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

2

Strings

String. Sequence of characters over some alphabet.

– binary { 0, 1 }– ASCII, UNICODE

Some applications. Word processors. Virus scanning. Text information retrieval systems. (Lexis, Nexis) Digital libraries. Natural language processing. Specialized databases. Computational molecular biology. Web search engines.

Page 3: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

3

String Searching

Parameters. N = # characters in text. M = # characters in pattern. Typically, N >> M.

– e.g., N = 1 million, M = 1 hundred

n n e e n lSearch Text

e d e n e e n e e d l e n l d

n e e d l eSearch Pattern

n n e e n lSuccessful Search

e d e n e e n e e d l e n l d

Page 4: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

4

Brute Force

Brute force. Check for pattern starting at every text position.

int brutesearch(char p[], char t[]) { int i, j; int M = strlen(p); // pattern length int N = strlen(t); // text length

for (i = 0; i < N; i++) { for (j = 0; j < M; j++) { if (t[i+j] != p[j]) break; } if (j == M) return i; // found at offset i } return -1; // not found}

Brute Force String Search

Page 5: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

5

Analysis of Brute Force

Analysis of brute force. Running time depends on pattern and text.

– can be slow when strings repeat themselves Worst case: MN comparisons.

– too slow when M and N are large

a a a a a bSearch Pattern

a a a a a aSearch Text

a a a a a a a a a a a a a a ba a a a a b

a a a a a ba a a a a b

a a a a a ba a a a a b

Page 6: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

6

How To Save Comparisons

How to avoid recomputation? Pre-analyze search pattern. Ex: suppose that first 5 characters of pattern are all a's.

– If t[0..4] matches p[0..4] then t[1..4] matches p[0..3].– no need to check i = 1, j = 0, 1, 2, 3– saves 4 comparisons

Need better ideas in general.

a a a a a bSearch Pattern

a a a a a aSearch Text

a a a a a a a a a a a a a a ba a a a a b

a a a a a b

Page 7: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

7

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

a a b a a aSearch Pattern

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

accept state

Page 8: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

8

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

accept state

a a a b a aSearch Text

b a a a b

Page 9: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

9

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

accept state

a a a b a aSearch Text

b a a a b

Page 10: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

10

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 11: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

11

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 12: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

12

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 13: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

13

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 14: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

14

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 15: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

15

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 16: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

16

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 17: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

17

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a a b a aSearch Text

b a a a

accept state

b

Page 18: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

18

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

a a b a a aSearch Pattern

a a b a a aSearch Text

b a a a

accept state

b

Page 19: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

19

Knuth-Morris-Pratt

KMP algorithm. Use knowledge of how search pattern repeats itself. Build FSA from pattern. Run FSA on text. O(M + N) worst-case running time.

– FSA simulation takes O(N) time– can build FSA in O(M) time with cleverness

Page 20: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

20

FSA Representation

FSA used in KMP has special property. Upon character match, go forward one state. Only need to keep track of where to go upon character mismatch.

– go to state next[j] if character mismatches in state j

b 0a 1

0 1 2 3 4 5

02

32

04

05

36

a a b a a aSearch Pattern

next 0 0 2 0 0 3

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

accept state

Page 21: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

21

KMP Algorithm

Given the FSA, string search is easy. The array next[] contains next FSA state if character mismatches.

int kmpsearch(char p[], char t[], int next[]) { int i, j = 0; int M = strlen(p); // pattern length int N = strlen(t); // text length

for (i = 0; i < N; i++) { if (t[i] == p[j]) j++; // char match else j = next[j]; // char mismatch if (j == M) return i – M + 1; // found }

return -1; // not found}

KMP String Search

Page 22: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

22

FSA Construction for KMP

FSA construction for KMP. FSA builds itself!

Example. Building FSA for aabaaabb. State 6. p[0..5] = aabaaa

– assume you know state for p[1..5] = abaaa X = 2 – if next char is b (match): go forward 6 + 1 = 7– if next char is a (mismatch): go to state for abaaaa X +

'a' = 2– update X to state for p[1..6] = abaaab X + 'b' = 3

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

Page 23: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

23

FSA Construction for KMP

FSA construction for KMP. FSA builds itself!

Example. Building FSA for aabaaabb.

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

7

a

b

Page 24: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

24

FSA Construction for KMP

FSA construction for KMP. FSA builds itself!

Example. Building FSA for aabaaabb. State 7. p[0..6] = aabaaab

– assume you know state for p[1..6] = abaaab X = 3 – if next char is b (match): go forward 7 + 1 = 8– next char is a (mismatch): go to state for abaaaba X +

'a' = 4– update X to state for p[1..7] = abaaabb X +

'b' = 0

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

7

a

b

Page 25: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

25

FSA Construction for KMP

FSA construction for KMP. FSA builds itself!

Example. Building FSA for aabaaabb.

3 4a a

5 6a

0 1a a

2b b

7 8b

bb

ab

b

a

b a

Page 26: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

26

FSA Construction for KMP

FSA construction for KMP. FSA builds itself!

Crucial insight. To compute transitions for state n of FSA, suffices to have:

– FSA for states 0 to n-1– state X that FSA ends up in with input p[1..n-1]

To compute state X' that FSA ends up in with input p[1..n], it suffices to have:

– FSA for states 0 to n-1– state X that FSA ends up in with input p[1..n-1]

Page 27: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

27

FSA Construction for KMP

0 1a

b

ba

Xa a b a a a b b

Search Pattern pattern[1..j]j

Page 28: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

28

FSA Construction for KMP

0 1a

b

b 0a 1

0

a a b a a a b bSearch Pattern X

00pattern[1..j]j next

0

Page 29: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

29

FSA Construction for KMP

0 1a a

2

b

b

b 0a 1

0 1

02

a a b a a a b bSearch Pattern X

0a 11

0pattern[1..j]j next

00

Page 30: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

30

FSA Construction for KMP

30 1a a

2b

b

b a

b 0a 1

0 1 2

02

32

a a b a a a b bSearch Pattern X

0

a b 0a 1

210

pattern[1..j]j next0

20

Page 31: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

31

FSA Construction for KMP

3 4a

0 1a a

2b

bb

b a

b 0a 1

0 1 2 3

02

32

04

a a b a a a b bSearch Pattern X

0

a b 0a b a 1

a 121

3

0pattern[1..j]j next

0

20

0

Page 32: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

32

FSA Construction for KMP

3 4a a

50 1a a

2b

b

b

b

b a

b 0a 1

0 1 2 3 4

02

32

04

05

a a b a a a b bSearch Pattern

a b a a

X

2

0

a b 0a b a 1

a 121

43

0pattern[1..j]j next

0

0

20

0

Page 33: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

33

FSA Construction for KMP

3 4a a

5 6a

0 1a a

2b

bb

b

b

b a

b 0a 1

0 1 2 3 4 5

02

32

04

05

36

a a b a a a b bSearch Pattern

a b a a

X

2a b a a a 2

0

a b 0a b a 1

a 121

43

5

0pattern[1..j]j next

03

0

20

0

Page 34: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

34

FSA Construction for KMP

3 4a a

5 6a

0 1a a

2b b

7

bb

b

b

a

b a

b 0a 1

0 1 2 3 4 5 6

02

32

04

05

36

72

a a b a a a b bSearch Pattern

a b a a

X

2a b a a a 2a b a a a b 3

0

a b 0a b a 1

a 121

43

5

0

6

pattern[1..j]j next

032

0

20

0

Page 35: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

35

FSA Construction for KMP

3 4a a

5 6a

0 1a a

2b b

7 8b

bb

ab

b

a

b a

b 0a 1

0 1 2 3 4 5 6 7

02

32

04

05

36

72

84

a b a a

X

2a b a a a 2a b a a a b 3

0

a b 0a b a 1

a 121

43

5

0

6a b a a a b b 07

a a b a a a b bSearch Pattern pattern[1..j]j next

032

0

20

0

4

Page 36: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

36

FSA Construction for KMP

Code for FSA construction in KMP algorithm.

void kmpinit(char p[], int next[]) { int j, X = 0, M = strlen(p); next[0] = 0;

for (j = 1; j < M; j++) { if (p[X] == p[j]) { next[j] = next[X]; X = X + 1; } else { next[j] = X + 1; X = next[X]; } }}

FSA Construction for KMP

Page 37: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

37

Specialized KMP Implementation

Specialized C program for aabaaabb pattern.

Ultimate search program for aabaaabb pattern. Machine language version of above.

int kmpsearch(char t[]) { int i = 0; s0: if (t[i++] != 'a') goto s0; s1: if (t[i++] != 'a') goto s0; s2: if (t[i++] != 'b') goto s2; s3: if (t[i++] != 'a') goto s0; s4: if (t[i++] != 'a') goto s0; s5: if (t[i++] != 'a') goto s3; s6: if (t[i++] != 'b') goto s2; s7: if (t[i++] != 'b') goto s4; return i - 8;}

Hardwired FSA for aabaaabb

next[]

Page 38: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

38

Summary of KMP

KMP summary. Build FSA from pattern. Run FSA on text. O(M + N) worst case string search. Good efficiency for patterns and texts with much repetition.

– binary files– graphics formats

Less useful for text strings. On-line algorithm.

– virus scanning– Internet spying

Page 39: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

39

History of KMP

History of KMP. Inspired by theorem of Cook that says O(M + N) algorithm should

be possible. Discovered in 1976 independently by two groups. Knuth-Pratt. Morris was hacker trying to build an editor.

– annoying problem that you needed a buffer when performing text search

Resolved theoretical and practical problems. Surprise when it was discovered. In hindsight, seems like right algorithm.

Page 40: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

40

Boyer-Moore

Boyer-Moore algorithm (1974). Right-to-left scanning.

– find offset i in text by moving left to right.– compare pattern to text by moving right to left.

aText

s t r i n g s s e a r c h c o n s i s t i n g o fs t i n g

s t i n gs t i n g

s t i n gs t i n g

Pattern MismatchMatchNo comparison

Page 41: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

41

Boyer-Moore

Boyer-Moore algorithm (1974). Right-to-left scanning. Heuristic 1: advance offset i using "bad character rule."

– upon mismatch of text character c, look upj = index[c]

– increase offset i so that jth character of pattern lines up with text character c

aText

s t r i n g s s e a r c h c o n s i s t i n g o fs t i n g

s t i n gs t i n g

s t i n gs t i n g

s t i n gs t i n g

s t i n gPattern

MismatchMatchNo comparison

Indexg

i

n

s

t

*

4

2

3

0

1

-1

Page 42: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

42

Boyer-Moore

Boyer-Moore algorithm (1974). Right-to-left scanning. Heuristic 1: advance offset i using "bad character rule."

– extremely effective for English text Heuristic 2: use KMP-like suffix rule.

– effective with small alphabets– different rules lead to different worst-case behavior

x c a b d a b d a bx c a b d a b d a b

bad character heuristic

Textx x x x x x x b a b x x x x x x x x x x x x x x

Page 43: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

43

Boyer-Moore

Boyer-Moore algorithm (1974). Right-to-left scanning. Heuristic 1: advance offset i using "bad character rule."

– extremely effective for English text Heuristic 2: use KMP-like suffix rule.

– effective with small alphabets– different rules lead to different worst-case behavior

Textx x x x x x x b a b x x x x x x x x x x x x x xx c a b d a b d a b

x c a b d a b d a b

strong good suffix

Page 44: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

44

Boyer-Moore

Boyer-Moore analysis. O(N / M) average case if given letter usually doesn't occur in string.

– English text: 10 character search string, 26 char alphabet– time decreases as pattern length increases– sublinear in input size!

O(M + N) worst-case with Galil variant.– proof is quite difficult

Page 45: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

45

Karp-Rabin

Idea: use hashing. Compute hash function for each text position. No explicit hash table!

– just compare with pattern hash

Example. Hash "table" size = 97.

3 1 4 1 5 9Search Text

2 6 5 3 5 8 9 7 9 3 2 3 8 4 6

5 9 2 6 5Search Pattern

3 1 4 1 51 4 1 5 9

4 1 5 9 21 5 9 2 6

31415 % 97 = 8414159 % 97 = 9441592 % 97 = 7615926 % 97 = 1859265 % 97 = 955 9 2 6 5

59265 % 97 = 95

Page 46: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

46

Karp-Rabin

Idea: use hashing. Compute hash function for each text position.

Problems. Need full compare on hash match to guard against collisions.

– 59265 % 97 = 95– 59362 % 97 = 95

Hash function depends on M characters.– running time on search miss = MN

Page 47: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

47

Karp-Rabin

Key idea: fast to compute hash function of adjacent substrings. Use previous hash to compute next hash. O(1) time per hash, except first one.

Example. Pre-compute: 10000 % 97 = 9 Previous hash: 41592 % 97 = 76 Next hash: 15926 % 97

Observation. 15926 (41592 – (4 * 10000)) * 10 + 6 15926 % 97 (41592 – (4 * 10000)) * 10 + 6

(76 – 4 * 9) * 10 + 6 406 18

Page 48: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

48

#define q 3355439 // table size#define d 256 // radix

int rksearch(char p[], char t[]) { int i, j, dM = 1, h1 = 0, h2 = 0; int M = strlen(p), N = strlen(t);

for (j = 1; j < M; j++) // precompute d^M % q dM = (d * dM) % q; for (j = 0; j < M; j++) { h1 = (h1*d + p[j]) % q; // hash of pattern h2 = (h2*d + t[j]) % q; // hash of text }

for (i = M; i < N; i++) { if (h1 == h2) return i – M; // match found h2 = (h2 – a[i-M]*dM) % q; // remove high order digit h2 = (h2*d + a[i]) % q; // insert low order digit } return -1; // not found}

Karp-Rabin (Sedgewick, p. 290)

Page 49: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

49

Karp-Rabin

Karp-Rabin algorithm. Choose table size at RANDOM to be huge prime. Expected running time is O(M + N). O(MN) worst-case, but this is (unbelievably) unlikely.

Randomized algorithms. Monte Carlo: don't check for collisions.

– algorithm can be wrong but running time guaranteed linear Las Vegas: if collision, start over with new random table size.

– algorithm always correct, but running time is expected linear

Advantages. Extends to 2d patterns and other generalizations.

Page 50: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

Princeton University • COS 423 • Theory of Algorithms • Spring 2002 • Kevin Wayne

String Searching

Page 51: Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison

51

String Search Analysis

How to test algorithm?

Bad idea: use random pattern on random text. Example: binary text.

– N possible patterns.– probability of search success < N / 2M.– NOT FOUND is simple algorithm that usually works

Better idea: Search for fixed pattern in fixed text. Use random position for successful search. Use random perturbation for unsuccessful search.