46
240-301 Comp. Eng. Lab III (Software), Pattern Matching Pattern Matching Pattern Matching 1 a b a c a a b 2 3 4 a b a c a b a b a c a b Dr. Andrew Davison WiG Lab (teachers room), CoE [email protected] 0-301, Computer Engineering Lab III (Software) T: P: Semester 1, 2006-2007

Pattern Matching

Embed Size (px)

Citation preview

Page 1: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 1

Pattern MatchingPattern Matching

1

a b a c a a b

234

a b a c a b

a b a c a b

Dr. Andrew DavisonWiG Lab (teachers room), [email protected]

240-301, Computer Engineering Lab III (Software)

T:

P:

Semester 1, 2006-2007

Page 2: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 2

OverviewOverview

1. What is Pattern Matching?1. What is Pattern Matching?

2. The Brute Force Algorithm2. The Brute Force Algorithm

3. The Boyer-Moore Algorithm3. The Boyer-Moore Algorithm

4. The Knuth-Morris-Pratt Algorithm4. The Knuth-Morris-Pratt Algorithm

5. More Information5. More Information

Page 3: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 3

1. What is Pattern Matching?1. What is Pattern Matching?

Definition:Definition:– given a text string T and a pattern string P, find tgiven a text string T and a pattern string P, find t

he pattern inside the texthe pattern inside the text T: “the rain in spain stays mainly on the plain”T: “the rain in spain stays mainly on the plain” P: “n th”P: “n th”

Applications:Applications:– text editors, Web search engines (e.g. Google), itext editors, Web search engines (e.g. Google), i

mage analysismage analysis

Page 4: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 4

String ConceptsString Concepts

Assume S is a string of size m.Assume S is a string of size m.

A A substringsubstring S[i .. j] of S is the string S[i .. j] of S is the string fragment between indexes i and j.fragment between indexes i and j.

A A prefixprefix of S is a substring S[0 .. i] of S is a substring S[0 .. i] A A suffixsuffix of S is a substring S[i .. m-1] of S is a substring S[i .. m-1]

– i is any index between 0 and m-1 i is any index between 0 and m-1

Page 5: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 5

ExamplesExamples

Substring S[1..3] == "ndr"Substring S[1..3] == "ndr"

All possible prefixes of S:All possible prefixes of S:– "andrew", "andre", "andr", "and", "an”, "a""andrew", "andre", "andr", "and", "an”, "a"

All possible suffixes of S:All possible suffixes of S:– "andrew", "ndrew", "drew", "rew", "ew", "w""andrew", "ndrew", "drew", "rew", "ew", "w"

a n d r e wS

0 5

Page 6: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 6

2. The Brute Force Algorithm2. The Brute Force Algorithm

Check each position in the text T to see if thCheck each position in the text T to see if the pattern P starts in that positione pattern P starts in that position

a n d r e wT:

r e wP:

a n d r e wT:

r e wP:

. . . .P moves 1 char at a time through T

Page 7: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 7

Brute Force in JavaBrute Force in Java

public static int brute(String text,String pattern) { int n = text.length(); // n is length of text int m = pattern.length(); // m is length of pattern int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match} // end of brute()

Return index where pattern starts, or -1

Page 8: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 8

UsageUsagepublic static void main(String args[]){ if (args.length != 2) { System.out.println("Usage: java BruteSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]);

int posn = brute(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn "

+ posn);}

Page 9: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 9

AnalysisAnalysis

Brute force pattern matching runs in time Brute force pattern matching runs in time O(mn) in the worst case.O(mn) in the worst case.

But most searches of ordinary text take But most searches of ordinary text take O(m+n), which is very quick. O(m+n), which is very quick.

continued

Page 10: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 10

The brute force algorithm is fast when the alThe brute force algorithm is fast when the alphabet of the text is large phabet of the text is large – e.g. A..Z, a..z, 1..9, etc.e.g. A..Z, a..z, 1..9, etc.

It is slower when the alphabet is smallIt is slower when the alphabet is small– e.g. 0, 1 (as in binary files, image files, etc.)e.g. 0, 1 (as in binary files, image files, etc.)

continued

Page 11: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 11

Example of a worst case:Example of a worst case:– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"– P: "aaah"P: "aaah"

Example of a more average case:Example of a more average case:– T: "a string searching example is standard"T: "a string searching example is standard"– P: "store"P: "store"

Page 12: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 12

3. The Boyer-Moore Algorithm3. The Boyer-Moore Algorithm

The Boyer-Moore pattern matching The Boyer-Moore pattern matching algorithm is based on two techniques.algorithm is based on two techniques.

1. The 1. The looking-glasslooking-glass technique technique– find P in T by moving find P in T by moving backwardsbackwards through P, through P,

starting at its endstarting at its end

Page 13: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 13

2. The 2. The character-jumpcharacter-jump technique technique– when a mismatch occurs at T[i] == xwhen a mismatch occurs at T[i] == x– the character in pattern P[j] is not the the character in pattern P[j] is not the

same as T[i]same as T[i]

There are 3 possible There are 3 possible cases, tried in order.cases, tried in order.

x aTi

b aP

j

Page 14: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 14

Case 1Case 1 If P contains x somewhere, then try to If P contains x somewhere, then try to

shift Pshift P right to align the last occurrence right to align the last occurrence of x in P with T[i].of x in P with T[i].

x aTi

b aP

j

x c

x aTinew

b aP

jnew

x c

? ?

and move i and j right, soj at end

Page 15: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 15

Case 2Case 2 If P contains x somewhere, but a shift right If P contains x somewhere, but a shift right

to the last occurrence is to the last occurrence is notnot possible, then possible, thenshift Pshift P right by 1 character to T[i+1]. right by 1 character to T[i+1].

a xTi

a xP

j

c w

a xTinew

a xP

jnew

c w

?

and move i and j right, soj at end

x

x is after j position

x

Page 16: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 16

Case 3Case 3 If cases 1 and 2 do not apply, then If cases 1 and 2 do not apply, then shiftshift P to P to

align P[0] with T[i+1].align P[0] with T[i+1].

x aTi

b aP

j

d c

x aTinew

b aP

jnew

d c

? ?

and move i and j right, soj at end

No x in P

?

0

Page 17: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 17

Boyer-Moore Example (1)Boyer-Moore Example (1)

1

a p a t t e r n m a t c h i n g a l g o r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

2

3

4

5

6

7891011

T:

P:

Page 18: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 18

Last Occurrence FunctionLast Occurrence Function

Boyer-Moore’s algorithm preprocesses the Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet A to build a last pattern P and the alphabet A to build a last occurrence function L()occurrence function L()– L() maps all the letters in A to integersL() maps all the letters in A to integers

L(x) is defined as:L(x) is defined as: // x is a letter in // x is a letter in AA– the largest index i such that P[i] == x, orthe largest index i such that P[i] == x, or– -1 if no such index exists-1 if no such index exists

Page 19: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 19

L() ExampleL() Example

A = {a, b, c, d}A = {a, b, c, d} P: "abacab"P: "abacab"

-1354L(x)

dcbax

a b a c a b

0 1 2 3 4 5

P

L() stores indexes into P[]

Page 20: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 20

NoteNote

In Boyer-Moore code, L() is calculated wheIn Boyer-Moore code, L() is calculated when the pattern P is read in.n the pattern P is read in.

Usually L() is stored as an arrayUsually L() is stored as an array– something like the table in the previous slidesomething like the table in the previous slide

Page 21: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 21

Boyer-Moore Example (2)Boyer-Moore Example (2)

1

a b a c a a b a d c a b a c a b a a b b

234

5

6

7

891012

a b a c a b

a b a c a b

a b a c a b

a b a c a b

a b a c a b

a b a c a b1113

11335544LL((xx))

ddccbbaaxx

T:

P:

Page 22: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 22

Boyer-Moore in JavaBoyer-Moore in Java

public static int bmMatch(String text, String pattern)

{ int last[] = buildLast(pattern); int n = text.length(); int m = pattern.length(); int i = m-1;

if (i > n-1) return -1; // no match if pattern is // longer than text :

Return index where pattern starts, or -1

Page 23: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 23

int j = m-1; do { if (pattern.charAt(j) == text.charAt(i)) if (j == 0) return i; // match else { // looking-glass technique i--; j--; } else { // character jump technique int lo = last[text.charAt(i)]; //last occ i = i + m - Math.min(j, 1+lo); j = m - 1; } } while (i <= n-1);

return -1; // no match } // end of bmMatch()

Page 24: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 24

public static int[] buildLast(String pattern) /* Return array storing index of last occurrence of each ASCII char in pattern. */ { int last[] = new int[128]; // ASCII char set

for(int i=0; i < 128; i++) last[i] = -1; // initialize array

for (int i = 0; i < pattern.length(); i++) last[pattern.charAt(i)] = i;

return last; } // end of buildLast()

Page 25: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 25

UsageUsage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BmSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]);

int posn = bmMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }

Page 26: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 26

AnalysisAnalysis Boyer-Moore worst case running time is Boyer-Moore worst case running time is

O(nm + A)O(nm + A)

But, Boyer-Moore is fast when the alphabet (A) is But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small.large, slow when the alphabet is small.– e.g. good for English text, poor for binarye.g. good for English text, poor for binary

Boyer-Moore is Boyer-Moore is significantly faster than brute forcesignificantly faster than brute force for searching English text.for searching English text.

Page 27: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 27

Worst Case ExampleWorst Case Example

T: "aaaaa…a"T: "aaaaa…a" P: "baaaaa"P: "baaaaa"

11

1

a a a a a a a a a

23456

b a a a a a

b a a a a a

b a a a a a

b a a a a a

7891012

131415161718

192021222324

T:

P:

Page 28: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 28

4. The KMP Algorithm4. The KMP Algorithm

The Knuth-Morris-Pratt (KMP) algorithm lThe Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a ooks for the pattern in the text in a left-to-rileft-to-rightght order (like the brute force algorithm). order (like the brute force algorithm).

But it shifts the pattern more intelligently thBut it shifts the pattern more intelligently than the brute force algorithm.an the brute force algorithm.

continued

Page 29: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 29

If a mismatch occurs between the text and pIf a mismatch occurs between the text and pattern P at P[j], what is the attern P at P[j], what is the mostmost we can shif we can shift the pattern to avoid wasteful comparisons?t the pattern to avoid wasteful comparisons?

AnswerAnswer: the largest prefix of P[0 .. j-1] that i: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]s a suffix of P[1 .. j-1]

Page 30: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 30

ExampleExample

T:

P:

jnew = 2

j = 5

i

Page 31: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 31

WhyWhy

Find largest prefix (start) of:Find largest prefix (start) of:"a b a a b""a b a a b" ( P[0..j-1] )( P[0..j-1] )

which is suffix (end) of:which is suffix (end) of:"b a a b""b a a b" ( p[1 .. j-1] )( p[1 .. j-1] )

Answer: "a b"Answer: "a b" Set j = 2 // the new j valueSet j = 2 // the new j value

j == 5

Page 32: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 32

KMP Failure FunctionKMP Failure Function

KMP preprocesses the pattern to find KMP preprocesses the pattern to find matches of prefixes of the pattern with the matches of prefixes of the pattern with the pattern itself.pattern itself.

j = mismatch position in P[]j = mismatch position in P[] k = position before the mismatch (k = j-1).k = position before the mismatch (k = j-1). The The failure functionfailure function F(k) is defined as the F(k) is defined as the

sizesize of the largest prefix of P[0..k] that is also of the largest prefix of P[0..k] that is also a suffix of P[1..k].a suffix of P[1..k].

Page 33: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 33

P: "abaaba"P: "abaaba"

j: j: 012345012345

In code, F() is represented by an array, like In code, F() is represented by an array, like the table.the table.

Failure Function ExampleFailure Function Example

F(k) is the size of the largest prefix.

1

3

2

4210j

100F(j)

k

F(k)

(k == j-1)

Page 34: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 34

Why is F(4) == 2?Why is F(4) == 2?

F(4) meansF(4) means– find the size of the largest prefix of P[0..4] that find the size of the largest prefix of P[0..4] that

is also a suffix of P[1..4]is also a suffix of P[1..4]

= find the size largest prefix of "abaab" that = find the size largest prefix of "abaab" that is also a suffix of "baab"is also a suffix of "baab"

= find the size of "ab"= find the size of "ab"

= 2= 2

P: "abaaba"

Page 35: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 35

Knuth-Morris-Pratt’s algorithm modifies the brute-Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm.force algorithm.– if a mismatch occurs at P[j] if a mismatch occurs at P[j]

(i.e. P[j] != T[i]), then (i.e. P[j] != T[i]), then k = j-1; k = j-1; j = F(k); // obtain the new j j = F(k); // obtain the new j

Using the Failure FunctionUsing the Failure Function

Page 36: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 36

KMP in JavaKMP in Java

public static int kmpMatch(String text, String pattern)

{ int n = text.length(); int m = pattern.length();

int fail[] = computeFail(pattern);

int i=0; int j=0; :

Return index where pattern starts, or -1

Page 37: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 37

while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

Page 38: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 38

public static int[] computeFail(String pattern)

{ int fail[] = new int[pattern.length()]; fail[0] = 0;

int m = pattern.length(); int j = 0; int i = 1; :

Page 39: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 39

while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code

to kmpMatch()

Page 40: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 40

UsageUsage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java KmpSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]);

int posn = kmpMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }

Page 41: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 41

ExampleExample

1

a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9

a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

0

3

1

4210k

100F(k)

T:

P:

Page 42: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 42

Why is F(4) == 1?Why is F(4) == 1?

F(4) meansF(4) means– find the size of the largest prefix of P[0..4] that find the size of the largest prefix of P[0..4] that

is also a suffix of P[1..4]is also a suffix of P[1..4]

= find the size largest prefix of "abaca" that = find the size largest prefix of "abaca" that is also a suffix of "baca"is also a suffix of "baca"

= find the size of "a"= find the size of "a"

= 1= 1

P: "abacab"

Page 43: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 43

KMP AdvantagesKMP Advantages

KMP runs in optimal time: O(m+n)KMP runs in optimal time: O(m+n)– very fastvery fast

The algorithm never needs to move backwaThe algorithm never needs to move backwards in the input text, Trds in the input text, T– this makes the algorithm good for processing vethis makes the algorithm good for processing ve

ry large files that are read in from external deviry large files that are read in from external devices or through a network streamces or through a network stream

Page 44: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 44

KMP DisadvantagesKMP Disadvantages

KMP doesn’t work so well as the size of the KMP doesn’t work so well as the size of the alphabet increasesalphabet increases– more chance of a mismatch (more possible mismore chance of a mismatch (more possible mis

matches)matches)– mismatches tend to occur early in the pattern, bmismatches tend to occur early in the pattern, b

ut KMP is faster when the mismatches occur latut KMP is faster when the mismatches occur laterer

Page 45: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 45

KMP ExtensionsKMP Extensions

The basic algorithm doesn't take into accouThe basic algorithm doesn't take into account the letter in the text that caused the mismnt the letter in the text that caused the mismatch.atch.

a a ab b

a a ab b a

x

a a ab b a

T:

P:Basic KMP

does not do this.

Page 46: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 46

5. More Information5. More Information Algorithms in C++Algorithms in C++

Robert SedgewickRobert SedgewickAddison-Wesley, 1992Addison-Wesley, 1992– chapter 19, String Searchingchapter 19, String Searching

Online Animated Algorithms:Online Animated Algorithms:– http://www.ics.uci.edu/~goodrich/dsa/ 11strings/demos/pattern/

– http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM1.html

– http://www-igm.univ-mlv.fr/~lecroq/string/

This book isin the CoE library.