Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing

Regular Expressions Regular Expressions intointo Finite AutomataFinite Automata

Anne Bruggemann-Klein

Presenting: Rutie Mesing

Outline

Building the Glushkov automaton in O((size of E)2) Defining the Star Normal Form

Building the Glushkov automaton in O(size of E) for deterministic regular expressions

Strong and weak unambiguity Quadratic time decision algorithm for weak

unambiguity

General definitions

E – regular expression L(E) – the language specified by the

regular expression E The size of a regular expression E

The number of symbols it contain, including the syntactic symbols such as brackets, +, ., and *

The size of an NFA The number of its transitions

pos(E), (x)

(a+b)*a(ab)* (a1+b2)*a3(a4b5)*

pos(E) – the set of subscripted symbols in an expression E

x, y, z are used to denote positions a, b, c are used for elements of

For a position x, (x) is the corresponding symbol of

Positions sets: first(E), last(E) inductive definition

[E = or ]first(E) = last(E) =

[E = x]first(E) = last(E) = {x}

[E = F + G]first(E) = first(F) first(G)

last(E) = last(F) last(G)

[E = FG]

first(E) = first(F) first(G) if ∈L(F)

first(F) otherwise

last(E) = last(F) last(G) if ∈L(G)

last(G) otherwise

[E = F *]first(E) = first(F)

last(E) = last(F)

Positions sets: follow(E,x) inductive definition

[E = or ]E has no positions

[E = x]follow(E,x) =

[E = F + G]follow(E,x) = follow(F,x) if x∈pos(F)

follow(G,x) if x∈pos(G)

[E = FG]follow(E,x) = follow(F,x) if x∈pos(F)\ last(F)

follow(F,x) first(G) if x∈last(F)

follow(G,x) if x∈pos(G)

[E = F*]follow(E,x) = follow(F,x) if x∈pos(F)\ last(F)

follow(F,x) first(F) if x∈last(F)

The Glushkov Automaton (NFA)

ME = (QE {qI}, , E, qI, FE) QE = pos(E) For a∈ ,

let E (qI,a) = {x| x∈first(E), (x)=a}

For x∈pos(E), a∈,

let E(x,a) = {y| y∈follow(E,x), (y)=a}

FE = last(E){qI} if ∈L(ME)

last(E) otherwise

Proposition 2.1 L(L(MMEE) = L() = L(EE))

Example (a*+ba)* = (a1*+b2a3)*

bb

b

a

a

a

a

1

2 3

The canonical method (O(n3)) for computing first, last & follow

Converting E into a syntax tree Leafs are labeled with: , or positions of E Internal nodes: +, . or * Building time: O(n) (n = size of E) Each node v in the syntax tree corresponds to a

subexpression EEvv of E.

Postorder traversal of the syntax tree computing:

nullable(nullable(vv)): Boolean – can Ev contain first(first(vv)), last(last(vv)): 2pos(E) For each xpos(E) there is a global variable: follow(follow(xx)): 2pos(E) O(n3)

case v is a node labeled :

nullable (v) := false; first(v) := ; last(v) := ;

v is a node labeled : nullable (v) := true; first(v) := ; last(v) := ;

v is a node labeled x: nullable (v) := false; follow (x) := ; first(v) := {x}; last(v) := {x};

if nullable(rightchild) then last(v) := last(leftchild ) last(rightchild ) ( ) else last(v) := last(rightchild );

v is a node labeled *: nullable (v) := true; for each x in last(child) do follow (x) := follow (x) first(child ); ( ) first(v) := first(child ); last(v) := last(child ); end case;

v is a node labeled +: nullable (v) := nullable (leftchild ) or nullable

(rightchild ); first(v) := first(leftchild ) first(rightchild ); ( )

last(v) := last(leftchild ) last(rightchild ); ( )

v is a node labeled . : nullable (v) := nullable (leftchild ) and nullable

(rightchild ); for each x in last(leftchild) do follow (x) := follow (x) first(rightchild ); ( ) if nullable(leftchild) then first(v) := first(leftchild ) first(rightchild ) ( ) else first(v) := first(leftchild );

Lemma 2.5

The following invariant holds after node v has been visited.

1. nullable (v) is true if and only if ∈L(Ev ).

2. first(v) = first(Ev ), last(v) = last(Ev ). Furthermore, if node v has been visited but the

parent of v has not, then 3. follow (x) = follow (Ev, x) for x ∈ pos(Ev ).

Especially, for the root note v0 ,

1. first(v0 ) = first(E), last(v0 ) = last(E). 2. follow (x) = follow (E, x), for x∈pos(E).

Observations All unions labeled ( ) or ( ) are disjoint

pos(F) pos(G) = Only unions labeled ( ) are not

necessarily disjoint Example: E=(a*b*)*, H=a*b*

Elements of first(H) are added to follow(H,x) for x∈last(H), but some elements of first(H) may already belong to follow(H,x) for some x∈last(H).

O(n3) for computing first(E), last(E) and follow(E,x)

Computing first, last & follow in a better time bound (O(n2))

General Strategy: We only consider expressions for which all

unions, including the ones of type ( ), are disjoint.

Such expressions are in star normal form (SNF).

Then we show that our algorithm runs in time O(size(ME)) for expressions E in star normal form.

Finally, we show why the restriction to star normal form is justified.

Star Normal Form - Star Normal Form - DefinitionDefinition

A regular expression is in star normal form if for each starred subexpression H* of E the SNF-conditions:

follow(H, last(H)) first(H) =

and ∉L(H)

hold.

Lemma 2.7 Let E be a regular expression in star normal form. ME can be computed from E in time O(size(E) + size(ME)) Proof

( ) takes constant time (list concatenation). ( ) or ( ): Observation:

For any subexp. F of subexp. G of E, x∈pos(F) follow(F,x) follow(G,x) follow(E,x)

Run time for ( ) or ( ) in a node v and for position x is proportional to the number of positions in follow(Ev,x) that are not present in any of the subexpressions of Ev.

Total run time spent in instructions ( ) or ( ):

x ∈ pos(E) | follow(E,x) |

disjoint unions (SNF)

Which is less or equal to the number of transitions in Which is less or equal to the number of transitions in MMEE

Why the restriction to star Why the restriction to star normal form is justifiednormal form is justified

Theorem 3.1 For each regular expression E, there

is a regular expression E such that ME = ME (Glushkov Automaton) E is in star normal form E can be computed from E in linear time.

From starred expression E* into Eo*

Goal: SNF conditions fulfilled for Eo

Observation After removing from ME all “feedback”

transitions leading from a final states (apart from

qi)

to states that qi is directly connected to,

and changing qi to be non final

The resulting NFA is the Glushkov automaton of E

with follow(E,last(E))first(E)=.

Example E = (a1*b2*)*

b

b

aa

1

2

a

b

Eo = (a1+b2)

b

1

2

a

E - inductive definition

[E = or ]Eo =

[E = a]Eo = E

[E = F + G]Eo = Fo + Go

FG if ∉L(F) ∉L(G)

[E = FG]Eo = FoG if ∉L(F) ∈L(G)

FGo if ∈L(F) ∉L(G)

Fo + Go (!) if ∈L(F) ∈L(G)

[E = F*] Eo = Fo )!(


b

b

aa

1

2

a

b

Eo = (a1+b2)

b

1

2

a

Lemma 3.31. size(Eo) ≤ size(E).

2. ∉L(Eo)

3. pos(Eo) = pos(E).

4. first(Eo) = first(E), last(Eo) = last(E).

5. follow (Eo, x) = follow (E, x), for all x ∈ pos(E) \ last(E).

6. follow (Eo, x) = follow (E, x) \ first(E), for all x∈last(E),

follow (Eo, last(Eo )) first(Eo) = 7. follow (Eo*, x) = follow (E*, x), for all x∈pos(E).

8. ME* = ME * o

The proof is in induction on EClaims 7, 8 follow directly from 5 and 6

From E to E

If we substitute in E each starred subexpression H* with H* Proceeding bottom up in E

We can expect to get an expression E in star normal form with ME=ME

E - inductive definition


b

b

aa

1

2

a

b

Eo = (a1+b2)

b

1

2

a

[E = a , or ]E = E

[E = F + G]E = F + G

[E = FG]FG

[E = F*] E = F*

E=(a*b*) *E=(a*b*)* = (a*b*)*

) = a+b) = *(a+b*(

ME = ME

Lemma 3.5 L(E) = L(E) size(E) size(E) pos(E) = pos(E) first(E) = first(E) last(E) = last(E) follow(E, x) =

follow(E,x), for x∈pos(E)

qI∈FE if and only if qI∈FE

These claims imply the first part of Theorem 3.1,

ME = ME

E in SNF The proof is by induction on the size

of E. The star case [E = F*] E = F*

SNF conditions hold for F (Lemma 3.3) F in SNF, by induction hypothesis Need to show that F = F

follow(H, last(H )) first(H ) =

∉L(H)

Lemma 3.6E = E

E = E

E = E

(1) E = F = F = E

Proof – by induction on E The star case [E = F*]

(2) E = F* = F = F = F = E

(3) E = F* = F* = F* = F* = E

def

def indu

def

def def & (1) indu

def def (2) indu & (1) def

Compute E from E in linear time

For H subexpression of E, we need H and H for computing E

H and H are computed simultaneously during the postorder traversal

Left to prove that at each node only a constant amount of time is spent

Conclusions so far Theorem 3.9

The Glushkov automaton ME can be computed from a regular expression E in time linear in size(E)+size(ME)

Proof E is computed from E in linear time. E is in star normal form ME can be computed from E in time

O(size(E)+size(ME))

Deterministic regular expression

A regular expression E is deterministic if the corresponding NFA ME is deterministic.

Theorem 3.11 1. It can be decided in linear time whether

a regular expression E is deterministic.

2. If E is deterministic, then the deterministic finite automaton ME can be computed from E in linear time.

Theorem 3.11 - Proof E is deterministic if and only if E is

Isomorphic Glushkov automata

we can assume that E is in star normal form. We start to compute first(E), last(E), and follow (E,x)

for xpos(E) incrementally keeping track of the follow(E,x) in a |pos(E)||| matrix

E= (a1+b2)* E= (a1+b2)*a3

ab1a1b2

2a1b2

pos

ab1a1 & a3b2

2a1 & a3b2

3

pos

E is determinis

tic

E is nondeterminist

ic

Ambiguity in automata and expressions

Unambiguous NFA – definition: for each word w, there is at most one path from the initial state to

a final state that spells out w. Weakly unambiguous Intuition

Each word of E has a unique path through E Definition

A regular expression E is weakly unambiguous if and only if the NFA ME is unambiguous.

Strongly unambiguous Intuition

Each word of E can be uniquely decomposed into subwords of E

Strongly unambiguous

]E = or a[E is strongly unambiguous

]E = F + G[E is strongly unambiguous if F and G are strongly unambiguous and L(F) and L(G) are disjoint.

]E = FG[E is strongly unambiguous if F and G are strongly unambiguous and the concatenation of L(F) and L(G) is unambiguous

]E = F*[ E is strongly unambiguous if F is strongly unambiguous and the star of L(F) is unambiguous.

Concatenation – L.L’ is unambiguous if v,wL, v’,w’L’, vv’=ww’ v=w and v’=w’.

L* is unambiguous if v1...vmL, w1…wnL, m,n0, v1…vm=w1…wn m=n and vi=wi for 1im.

Strongly unambiguousIn terms of automata

Let M’E be the NFA recognizing L(E) according to any of the standard constructions

Lemma 4.5 E is strongly unambiguous if and only if M’E is unambiguous

Lemma 4.6 If E is strongly unambiguous, then E is weakly unambiguous Proof

Elimination of transitions transforms M’E into ME. Different paths in M’E spelling out a word w correspond to

different paths in ME doing the same. Unambiguity of M’E (Lemma 4.5) unambiguity of ME

Lemma 4.7 – weakly unambiguous

]E = or a[E is weakly unambiguous

]E = F + G[E is weakly unambiguous if and only if F and G are weakly unambiguous and at most is both in L(F ) and L(G).

]E = FG[E is weakly unambiguous if and only if F and G are weakly unambiguous and the concatenation of L(F ) and L(G) is unambiguous

]E = F*[

Let follow (F,last(F))first(F) = , L(F ).

Then, E is weakly unambiguous if and only if F is weakly unambiguous and the star of L(F ) is

unambiguous

Epsilon Normal Form Epsilon Normal Form condition:

No subexpression of E denotes the empty word umbiguously

]E = or a[E is in epsilon normal form

]E = F + G[E is in epsilon normal form if F and G are in epsilon normal form and L(F)L(G)

]E = FG[E is in epsilon normal form if F and G are in epsilon normal form

]E = F*[E is in epsilon normal form if F is in epsilon normal form and L(F)

Strongly unambiguous expressions

are in star and in epsilon normal form

Lemma 4.10 If E* is strongly unambiguous, then

follow(E,last(E))first(E) =

Proof Assume that there exist xlast(E),

yfollow(E,x)first(E), zlast(E) x is a final state in ME. (and also z) x1...xn x yy1…ymz is a path through ME

But this path is also the composition of two paths through ME

This makes L(E)* ambiguous.

Theorem 4.9 E is strongly unambiguous if and only if

1. E is weakly unambiguous2. E is in star normal form 3. E is in epsilon normal form

Proof For expressions in star and epsilon normal form, weak

and strong unambiguity are identical (using Lemma 4.7) Strongly unambiguous expressions are in star and in

epsilon normal form. (Lemma 4.10)

Test for weak unambiguity in quadratic time

Theorem 4.11 Regular expressions in epsilon normal form can be

tested for weak unambiguity in quadratic time. Proof

Let E be in epsilon normal form. E can be transformed into star normal form E

without changing the Glushkov automaton linear time.

E is also in epsilon normal form. E is weakly unambiguous if and only if E is if and only if E

is strongly unambiguous. strong unambiguity of expressions can be decided in

quadratic time

Open problems It is easy to see that a regular expression can be tested

for epsilon normal form in linear time.

Can a given regular expression be transformed into epsilon normal form in linear time?

Our transformation into star normal form can deal with starred subexpressions.

Hence, the crucial point is how expressions E = F+G with L(F)L(G) can be handled.

A straight forward approach would eliminate the empty string either from L(F) or from L(G).

This opens up another question:

Is there a linear time algorithm transforming a regular expression E into an expression E’ with L(E’) = L(E)\{}?

The End

Documents

Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing