48
240-301 Comp. Eng. Lab III (Software), Pattern Matching Pattern Matching Pattern Matching 1 a b a c a a b 2 3 4 a b a c a b a b a c a b Dr. Andrew Davison WiG Lab (teachers room), CoE [email protected] 0-301, Computer Engineering Lab III (Software) T: P: Semester 1, 2006-2007

Pattern Matching

Embed Size (px)

DESCRIPTION

240-301, Computer Engineering Lab III (Software). Semester 1, 200 6 -200 7. Pattern Matching. Dr. Andrew Davison WiG Lab (teachers room) , CoE [email protected] .psu.ac.th. T:. P:. Overview. 1. What is Pattern Matching? 2. The Brute Force Algorithm 3. The Knuth-Morris-Pratt Algorithm - PowerPoint PPT Presentation

Citation preview

Page 1: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 1

Pattern MatchingPattern Matching

1

a b a c a a b

234

a b a c a b

a b a c a b

Dr. Andrew DavisonWiG Lab (teachers room), [email protected]

240-301, Computer Engineering Lab III (Software)

T:

P:

Semester 1, 2006-2007

Page 2: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 2

OverviewOverview

1. What is Pattern Matching?1. What is Pattern Matching?

2. The Brute Force Algorithm2. The Brute Force Algorithm

3. The Knuth-Morris-Pratt Algorithm3. The Knuth-Morris-Pratt Algorithm

4. The Boyer-Moore Algorithm4. The Boyer-Moore Algorithm

5. More Information5. More Information

Page 3: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 3

1. What is Pattern Matching?1. What is Pattern Matching?

Definition:Definition:– given a text string T and a pattern string P, find tgiven a text string T and a pattern string P, find t

he pattern inside the texthe pattern inside the text T: “the rain in spain stays mainly on the plain”T: “the rain in spain stays mainly on the plain” P: “n th”P: “n th”

Applications:Applications:– text editors, Web search engines (e.g. Google), itext editors, Web search engines (e.g. Google), i

mage analysismage analysis

Page 4: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 4

String ConceptsString Concepts

Assume S is a string of size m.Assume S is a string of size m.

S = xx11xx2 2 … … xxmm

A A prefixprefix of S is a substring S[1 .. of S is a substring S[1 .. kk-1]-1] A A suffixsuffix of S is a substring S[ of S is a substring S[k-k-1 .. m]1 .. m]

– k is any index between 1 and m k is any index between 1 and m – S[0] is null character S[0] is null character

Page 5: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 5

ExamplesExamples

All possible prefixes of S:All possible prefixes of S:– “ “””, , “a", "an", "and", "an“a", "an", "and", "andrdr”, "a”, "andrendre““, ,

All possible suffixes of S:All possible suffixes of S:– ““””, , ““ww", “ew", “rew", “", “ew", “rew", “ddrew", “rew", “ndrndrew” ew”

a n d r e wS

0 5

Page 6: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 6

2. The Brute Force Algorithm2. The Brute Force Algorithm

Check each position in the text T to see if thCheck each position in the text T to see if the pattern P starts in that positione pattern P starts in that position

a n d r e wT:

r e wP:

a n d r e wT:

r e wP:

. . . .P moves 1 char at a time through T

Page 7: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 7

Brute Force in JavaBrute Force in Java

public static int brute(String text,String pattern) { int n = text.length(); // n is length of text int m = pattern.length(); // m is length of pattern int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match} // end of brute()

Return index where pattern starts, or -1

Page 8: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 8

UsageUsagepublic static void main(String args[]){ if (args.length != 2) { System.out.println("Usage: java BruteSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]);

int posn = brute(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn "

+ posn);}

Page 9: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 9

AnalysisAnalysis

Brute force pattern matching runs in time Brute force pattern matching runs in time O(mn) in the worst case.O(mn) in the worst case.

But most searches of ordinary text take But most searches of ordinary text take O(m+n), which is very quick. O(m+n), which is very quick.

continued

Page 10: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 10

The brute force algorithm is fast when the alThe brute force algorithm is fast when the alphabet of the text is large phabet of the text is large – e.g. A..Z, a..z, 1..9, etc.e.g. A..Z, a..z, 1..9, etc.

It is slower when the alphabet is smallIt is slower when the alphabet is small– e.g. 0, 1 (as in binary files, image files, etc.)e.g. 0, 1 (as in binary files, image files, etc.)

continued

Page 11: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 11

Example of a worst case:Example of a worst case:– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"– P: "aaah"P: "aaah"

Example of a more average case:Example of a more average case:– T: "a string searching example is standard"T: "a string searching example is standard"– P: "store"P: "store"

Page 12: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 12

22. The KMP Algorithm. The KMP Algorithm

The Knuth-Morris-Pratt (KMP) algorithm lThe Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a ooks for the pattern in the text in a left-to-rileft-to-rightght order (like the brute force algorithm). order (like the brute force algorithm).

But it shifts the pattern more intelligently thBut it shifts the pattern more intelligently than the brute force algorithm.an the brute force algorithm.

continued

Page 13: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 13

Donald E. Knuth

Donald Ervin Knuth (born January 10, 1938) is a computer scientist and Professor Emeritus at Stanford University. He is the author of the seminal multi-volume work The Art of Computer Programming.[3] Knuth has been called the "father" of the analysis of algorithms. He contributed to the development of the rigorous analysis of the computational complexity of algorithms and systematized formal mathematical techniques for it. In the process he also popularized the asymptotic notation.

Page 14: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 14

If a mismatch occurs between the text and pIf a mismatch occurs between the text and pattern P at P[j], what is the attern P at P[j], what is the mostmost we can shif we can shift the pattern to avoid wasteful comparisons?t the pattern to avoid wasteful comparisons?

AnswerAnswer: the largest prefix of P[: the largest prefix of P[11 .. j-1] that i .. j-1] that is a suffix of P[1 .. j-1]s a suffix of P[1 .. j-1]

Page 15: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 15

ExampleExample

T:

P:

jnew = 3

j = 6

i

Page 16: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 16

Page 17: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 17

WhyWhy

Find largest prefix (start) of:Find largest prefix (start) of:"a b a a b""a b a a b" ( P[( P[11..j-1] )..j-1] )

which is suffix (end) of:which is suffix (end) of:““a a b a a b"b a a b" ( p[1 .. j-1] )( p[1 .. j-1] )

Answer: "a b"Answer: "a b" Set j = Set j = 33 // the new j value // the new j value

j == 5

Page 18: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 18

KMP KMP BorderBorder Function Function

KMP preprocesses the pattern to find matches of KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.prefixes of the pattern with the pattern itself.

j = mismatch position in P[]j = mismatch position in P[] k = position before the mismatch (k = j-1).k = position before the mismatch (k = j-1). The The border functionborder function b(k) is defined as the b(k) is defined as the sizesize of of

the largest prefix of P[1..k] that is also a suffix of the largest prefix of P[1..k] that is also a suffix of P[1..k].P[1..k].

The other name: The other name: failure functionfailure function

Page 19: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 19

P: "abaaba"P: "abaaba"

j: j: 123451234566

In code, b() is represented by an array, like In code, b() is represented by an array, like the table.the table.

BorderBorder Function Example Function Example

b(k) is the size of the largest border.

1

4

2

5321j

100F(j)

k

b(k)

(k == j-1)

Page 20: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 20

Why is Why is bb((55) == 2?) == 2?

bb((55) means) means– find the size of the largest prefix of P[find the size of the largest prefix of P[11....55] that ] that

is also a suffix of P[1..is also a suffix of P[1..55]]

= find the size largest prefix of "abaab" that = find the size largest prefix of "abaab" that is also a suffix of "baab"is also a suffix of "baab"

= find the size of "ab"= find the size of "ab"

= 2= 2

P: "abaaba"

Page 21: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 21

Knuth-Morris-Pratt’s algorithm modifies the brute-Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm.force algorithm.– if a mismatch occurs at P[j] if a mismatch occurs at P[j]

(i.e. P[j] != T[i]), then (i.e. P[j] != T[i]), then k = j-1; k = j-1; j = b(k) + 1; // obtain the new j j = b(k) + 1; // obtain the new j

Using the Using the BorderBorder Function Function

Page 22: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 22

KMP in JavaKMP in Java

public static int kmpMatch(String text, String pattern)

{ int n = text.length(); int m = pattern.length();

int fail[] = computeFail(pattern);

int i=0; int j=0; :

Return index where pattern starts, or -1

Page 23: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 23

while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

Page 24: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 24

public static int[] computeFail(String pattern)

{ int fail[] = new int[pattern.length()]; fail[0] = 0;

int m = pattern.length(); int j = 0; int i = 1; :

Page 25: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 25

while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code

to kmpMatch()

Page 26: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 26

UsageUsage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java KmpSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]);

int posn = kmpMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }

Page 27: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 27

ExampleExample

1

a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9

a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

0

4

1

5321k

100b(k)

T:

P:

Page 28: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 28

Why is Why is bb(4) == 1?(4) == 1?

bb(4) means(4) means– find the size of the largest prefix of P[find the size of the largest prefix of P[11....55] that ] that

is also a suffix of P[1..is also a suffix of P[1..55]]

= find the size largest prefix of "abaca" that = find the size largest prefix of "abaca" that is also a suffix of "baca"is also a suffix of "baca"

= find the size of "a"= find the size of "a"

= 1= 1

P: "abacab"

Page 29: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 29

KMP AdvantagesKMP Advantages

KMP runs in optimal time: O(m+n)KMP runs in optimal time: O(m+n)– very fastvery fast

The algorithm never needs to move backwaThe algorithm never needs to move backwards in the input text, Trds in the input text, T– this makes the algorithm good for processing vethis makes the algorithm good for processing ve

ry large files that are read in from external deviry large files that are read in from external devices or through a network streamces or through a network stream

Page 30: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 30

KMP DisadvantagesKMP Disadvantages

KMP doesn’t work so well as the size of the KMP doesn’t work so well as the size of the alphabet increasesalphabet increases– more chance of a mismatch (more possible mismore chance of a mismatch (more possible mis

matches)matches)– mismatches tend to occur early in the pattern, bmismatches tend to occur early in the pattern, b

ut KMP is faster when the mismatches occur latut KMP is faster when the mismatches occur laterer

Page 31: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 31

KMP ExtensionsKMP Extensions

The basic algorithm doesn't take into accouThe basic algorithm doesn't take into account the letter in the text that caused the mismnt the letter in the text that caused the mismatch.atch.

a a ab b

a a ab b a

x

a a ab b a

T:

P:Basic KMP

does not do this.

Page 32: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 32

3. The Boyer-Moore Algorithm3. The Boyer-Moore Algorithm

The Boyer-Moore pattern matching The Boyer-Moore pattern matching algorithm is based on two techniques.algorithm is based on two techniques.

1. The 1. The looking-glasslooking-glass technique technique– find P in T by moving find P in T by moving backwardsbackwards through P, through P,

starting at its endstarting at its end

Page 33: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 33

2. The 2. The character-jumpcharacter-jump technique technique– when a mismatch occurs at T[i] == xwhen a mismatch occurs at T[i] == x– the character in pattern P[j] is not the the character in pattern P[j] is not the

same as T[i]same as T[i]

There are 3 possible There are 3 possible cases, tried in order.cases, tried in order.

x aTi

b aP

j

Page 34: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 34

Case 1Case 1 If P contains x somewhere, then try to If P contains x somewhere, then try to

shift Pshift P right to align the last occurrence right to align the last occurrence of x in P with T[i].of x in P with T[i].

x aTi

b aP

j

x c

x aTinew

b aP

jnew

x c

? ?

and move i and j right, soj at end

Page 35: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 35

Case 2Case 2 If P contains x somewhere, but a shift right If P contains x somewhere, but a shift right

to the last occurrence is to the last occurrence is notnot possible, then possible, thenshift Pshift P right by 1 character to T[i+1]. right by 1 character to T[i+1].

a xTi

a xP

j

c w

a xTinew

a xP

jnew

c w

?

and move i and j right, soj at end

x

x is after j position

x

Page 36: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 36

Case 3Case 3 If cases 1 and 2 do not apply, then If cases 1 and 2 do not apply, then shiftshift P to P to

align P[align P[11] with T[i+1].] with T[i+1].

x aTi

b aP

j

d c

x aTinew

b aP

jnew

d c

? ?

and move i and j right, soj at end

No x in P

?

1

Page 37: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 37

Boyer-Moore Example (1)Boyer-Moore Example (1)

1

a p a t t e r n m a t c h i n g a l g o r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

r i t h m

2

3

4

5

6

7891011

T:

P:

Page 38: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 38

Last Occurrence FunctionLast Occurrence Function

Boyer-Moore’s algorithm preprocesses the Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet A to build a last pattern P and the alphabet A to build a last occurrence function L()occurrence function L()– L() maps all the letters in A to integersL() maps all the letters in A to integers

L(x) is defined as:L(x) is defined as: // x is a letter in // x is a letter in AA– the largest index i such that P[i] == x, orthe largest index i such that P[i] == x, or– -1 if no such index exists-1 if no such index exists

Page 39: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 39

L() ExampleL() Example

A = {a, b, c, d}A = {a, b, c, d} P: "abacab"P: "abacab"

-1465L(x)

dcbax

a b a c a b

1 2 3 4 5 6

P

L() stores indexes into P[]

Page 40: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 40

NoteNote

In Boyer-Moore code, L() is calculated wheIn Boyer-Moore code, L() is calculated when the pattern P is read in.n the pattern P is read in.

Usually L() is stored as an arrayUsually L() is stored as an array– something like the table in the previous slidesomething like the table in the previous slide

Page 41: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 41

Boyer-Moore Example (2)Boyer-Moore Example (2)

1

a b a c a a b a d c a b a c a b a a b b

234

5

6

7

891012

a b a c a b

a b a c a b

a b a c a b

a b a c a b

a b a c a b

a b a c a b1113

11446655LL((xx))

ddccbbaaxx

T:

P:

Page 42: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 42

Boyer-Moore in JavaBoyer-Moore in Java

public static int bmMatch(String text, String pattern)

{ int last[] = buildLast(pattern); int n = text.length(); int m = pattern.length(); int i = m-1;

if (i > n-1) return -1; // no match if pattern is // longer than text :

Return index where pattern starts, or -1

Page 43: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 43

int j = m-1; do { if (pattern.charAt(j) == text.charAt(i)) if (j == 0) return i; // match else { // looking-glass technique i--; j--; } else { // character jump technique int lo = last[text.charAt(i)]; //last occ i = i + m - Math.min(j, 1+lo); j = m - 1; } } while (i <= n-1);

return -1; // no match } // end of bmMatch()

Page 44: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 44

public static int[] buildLast(String pattern) /* Return array storing index of last occurrence of each ASCII char in pattern. */ { int last[] = new int[128]; // ASCII char set

for(int i=0; i < 128; i++) last[i] = -1; // initialize array

for (int i = 0; i < pattern.length(); i++) last[pattern.charAt(i)] = i;

return last; } // end of buildLast()

Page 45: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 45

UsageUsage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BmSearch <text> <pattern>"); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]);

int posn = bmMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }

Page 46: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 46

AnalysisAnalysis Boyer-Moore worst case running time is Boyer-Moore worst case running time is

O(nm + A)O(nm + A)

But, Boyer-Moore is fast when the alphabet (A) is But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small.large, slow when the alphabet is small.– e.g. good for English text, poor for binarye.g. good for English text, poor for binary

Boyer-Moore is Boyer-Moore is significantly faster than brute forcesignificantly faster than brute force for searching English text.for searching English text.

Page 47: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 47

Worst Case ExampleWorst Case Example

T: "aaaaa…a"T: "aaaaa…a" P: "baaaaa"P: "baaaaa"

11

1

a a a a a a a a a

23456

b a a a a a

b a a a a a

b a a a a a

b a a a a a

7891012

131415161718

192021222324

T:

P:

Page 48: Pattern Matching

240-301 Comp. Eng. Lab III (Software), Pattern Matching 48

5. More Information5. More Information Algorithms in C++Algorithms in C++

Robert SedgewickRobert SedgewickAddison-Wesley, 1992Addison-Wesley, 1992– chapter 19, String Searchingchapter 19, String Searching

Online Animated Algorithms:Online Animated Algorithms:– http://www.ics.uci.edu/~goodrich/dsa/ 11strings/demos/pattern/

– http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM1.html

– http://www-igm.univ-mlv.fr/~lecroq/string/

This book isin the CoE library.