4
Data Structures and Algorithms II Lecture 10: Pattern Matching Algorithms Motivation Representing Patterns A Simple Pattern Matching Algorithm. 1 Pattern Matching: Motivation For many applications we want tools to find if given string matches some criteria, e.g., Find all filenames that end with .cpp Check whether word entered is yes, y, Yes, Y, or variant. Search for a line in a file containing both “data” and “abstraction”. These can typically be done by checking if strings match a pattern. We need: A notation to describe these patterns. An algorithm to check if a pattern matches a given string. 2 Representing Patterns: Regular Expressions Patterns can be represented by just using two special characters: “ ” represents alternatives. ab cd matches ab or cd. (a bc)d matches ad or bcd. “*” allows repetition (0 or more times). ab* matches a, ab, abb, abbb etc. a(bc)* matches a, abc, abcbc, etc. Brackets may be used, as illustrate above. * has higher priority than . Simple concatenation has priority in between, so a bc means (a or (b followed by c)) but ab* means (a followed by (b*)). Further characters are often used for conciseness: “?” matches any character at all. a?b matches aab abb azb.. “+” allows 1 or more repetitions. ab+ matches ab, abb, abbb etc. 3 Regular Expressions Given the following regular expressions, which of the example strings do you think it would match? (c de)* 1. cd 2. ccc 3. cdede (a*b c+) 1. b 2. aaaa 3. ccc 4. ab ?*(ie ei)?* 1. ii 2. piece 3. sheik 4

data structure & algorithms - pattern matching

Embed Size (px)

Citation preview

Page 1: data structure & algorithms - pattern matching

Data Structur esand Algorithms IILecture10:

Pattern Matching Algorithms

� Motivation

� RepresentingPatterns

� A SimplePatternMatchingAlgorithm.

1

Pattern Matching: Moti vation

For many applicationswewanttoolsto find if given

stringmatchessomecriteria,e.g.,

� Findall filenamesthatendwith .cpp

� Checkwhetherwordenteredis yes,y, Yes,Y,

or variant.

� Searchfor a line in afile containingboth“data”

and“abstraction”.

Thesecantypically bedoneby checkingif strings

matcha pattern.

Weneed:

� A notationto describethesepatterns.

� An algorithmto checkif apatternmatchesa

givenstring.

2

RepresentingPatterns: RegularExpressions

Patternscanberepresentedby justusingtwo

specialcharacters:

� “�” representsalternatives.

ab�cdmatchesabor cd.

(a�bc)dmatchesador bcd.

� “*” allows repetition(0 or moretimes).

ab* matchesa,ab,abb,abbbetc.

a(bc)*matchesa,abc,abcbc,etc.

Bracketsmaybeused,asillustrateabove. * has

higherpriority than�. Simpleconcatenationhas

priority in between,soa�bcmeans(aor (b followed

by c)) but ab* means(a followedby (b*)).

Furthercharactersareoftenusedfor conciseness:

� “?” matchesany characteratall.

a?bmatchesaababbazb..

� “+” allows1 or morerepetitions.

ab+matchesab,abb,abbbetc.

3

Regular Expressions

Giventhefollowing regularexpressions,whichof

theexamplestringsdoyou think it wouldmatch?

� (c � de)*

1. cd

2. ccc

3. cdede

� (a*b� c+)

1. b

2. aaaa

3. ccc

4. ab

� ?*(ie � ei)?*

1. ii

2. piece

3. sheik

4

Page 2: data structure & algorithms - pattern matching

Regular Expressionsand Finite StateMachines(FSMs)

� Canrepresentregularexpressionsin termsof a

network of nodesandconnectionsbetween

them.

� Thesenodesrepresentstates,andthe

connectionsrepresenttransitionsbetween

them.

Thenodesin ourpatternmatchercapturethe

state“in whichacertaincharacterin thepattern

hasbeensuccessfullymatched”.

� Thenetwork is referredto asafinite-state

machine(andhasmany applicationsin CS).

� In particular, it is a non-deterministic finite

statemachine,asit will needto have choice

nodes.Youwon’t beableto immediately

determinewhichrouteto take in thenetwork

justby traversingthestring.

5

Example:

a

b c

start finish

choice

choice

d

e

1

2

3 4

5

6

7 80

(italicised bits are just for explanation or referring to it)

Stringmatchesif youcantraversenetwork from

startto finish,matchingall thecharactersin the

string.(e.g.,bcdeee,ad).

6

Implementing the Machine

� Thefinite statemachine(FSM)suggestsagood

way to representpatternssothatpattern

matchingalgorithmscanbeeasily

implemented.

� WecouldrepresentourFSMusingageneral

graphADTs (discussedlater).But wenever

allow anodeto havemorethantwo neighbours,

soasimplerdatastructureis possible.

� Eachstatehasoneor two successorstates..so

useanNx2 arraywhereN is numberof states,

andstorein it theindicesof successorstates.

� Also needarrayto storecharactersin nodes-

let contentbe“?” for choicenodes.

For theexampleFSMin earlierslidewe’d have:

next[0][0]=1 ch[1]=’?’

next[1][0]=2 ch[2]=’a’

next[1][1]=3 ch[3]=’b’

next[2][0]=5 etc.

7

Algorithm

We’re now readyfor analgorithmfor pattern

matching.

� Weuseneedadatastructurethatallowsusto

keeptrack,aswego throughthestring,which

charactersarelegal accordingto thepattern.

� Weuseaspeciallist structurefor this - it

containsthepossiblestatesin theFSM

correspondingto thecurrentandnext character

in thestringbeinganalysed.

� This is updatedaswesimultaneouslygo

throughtheFSMandmove up in thestring-

basedonpossiblestatecorrespondingto

currentcharacterin string,wecandetermine

from FSMpossiblestatesfor next characterin

string.

8

Page 3: data structure & algorithms - pattern matching

A variantof thestack/queueis usedasthe

ADT: double ended queue allowsnodesto be

puton front or endof queue:dq.putaddsitems

to end,while dq.pushaddsitemsto start.

Split thequeuein two halveswith special

character

e.g.,(1 2 + 5 6)

Statesbefore“+” representpossiblestates

correspondingto currentcharacter.

Statesafter“+” representpossiblestates

correspondingto next character.

9

Outline of Algorithm

Main stepin algorithmis:

Look atcurrentcharacter, andpossiblecurrent

statein FSM.

If thestatein theFSMis achoicenode,that

meansthecurrentcharactermightcorrespond

to eitherof thechoices- soput themon the

queueaspossibilitiesfor currentcharacter.

If thestatein theFSMcontainsacharacter

matchingthecurrentcharacter, thenthenext

charactershouldmatchthenext statein the

FSM- soput thatnext stateon the(endof) the

queue(aspossibilityfor next char).

e.g.,

queue = (1 +) str=’’ad’’

1 is a choice node

queue = (2 3 +)

2 corresponds to ’a’

state 2 has successor 5

queue = (3 + 5)

10

Pattern Matching Algorithm

(j=positionin string)

dq.put(’+’); j=0;

state=next[start][0];

While notatendof stringor pattern.

� If (state==’+’) � j++;dq.put(’+’); (finisheddealingwith charj somove on..)

� else if (ch[state]==str[j])

dq.put(next[state][0])

(charmatchesonein pattern,soputnext state

onqueue).

� else if (ch[state]=’?’)

� dq.push(next[state][0]);dq.push(next[state][1]); (choicenodesoputalternativeson front of

queue.)

� state=dq.pop();

(remove state from dq)

11

Example

Suppose:

next[0][0]=1 ch[1]=? str=’’abd’’

next[1][0]=2 ch[2]=a

next[1][1]=4 ch[3]=b

next[2][0]=3 ch[4]=c

next[3][0]=5 ch[5]=d

next[4][0]=5 node 6 = finish

next[5][0]=6

Working throughalgorithmwehave,

dq j state

(+) 0 1

(2 4 +) 0 1 (as ch[1]=’?’)

(4 +) 0 2 (after dq.pop)

(4 + 3) 0 2 (as ch[2]=str[0])

(+ 3) 0 3 (after dq.pop)

(3) 0 + (after dq.pop)

(3 +) 1 + (state=’+’)

(+) 1 3 (after dq.pop)

(+ 5) 1 3 (ch[3]=str[1])

(5) 1 +

(5 +) 2 +

(+) 2 5

12

Page 4: data structure & algorithms - pattern matching

(+ 6) 2 5

(6) 2 +

(6 +) 3 +

(+) 3 6

.. andwe’re at theendof thestringandlaststatein

thepattern,sothematchsucceeds.

13

Summary

� Regularexpressionsareequivalentto finite

statemachines,whereeachstatein themachine

representsacharacter.

� Onealgorithmfor patternmatchinginvolves:

– Tranforminga regularexpressioninto an

FSM,andrepresentingthisusingarrays.

– Searchingthenetwork usingasearch

methodwhichusesadoubleendedqueue,

anddividesit soweknow whichstates

correspondto optionsfor currentcharacter,

andwhich for thenext character.

– This is a fairly simpleandefficient

algorithmfor adifficult problem.Efficiency

O(MN*N) whereM = numberof states;N

= lengthof string.

14