Download pdf - Harvard CS 121 and CSCI E-121 Lecture 5: Regular Expressions · 2013. 9. 17. · Harvard CS 121 & CSCI E-121 September 17, 2013 Regular Expressions • Let Σ = {a,b}. The regular

Harvard CS 121 and CSCI E-121Lecture 5:

Regular Expressions

Harry Lewis

September 17, 2013

• Reading: Sipser, §1.3.

Harvard CS 121 & CSCI E-121 September 17, 2013

Is the Subset Construction Optimal?

The subset construction shows that any n-state NFA can beimplemented as a 2n-state DFA.

NFA States DFA States4 16

10 1024100 2100

1000 21000 � the number of particles in the universe

How to simulate this construction on the fly on an ordinarydigital computer?

NFA states DFA state bit vector1, . . . , n 10 1 0 . . . 1

1 2 n

1


Is this construction the best we can do?

Could there be a construction that always produces an n2 stateDFA for example?

Theorem: For every n ≥ 1, there is a language Ln such that

1. There is an (n + 1)-state NFA recognizing Ln.

2. There is no DFA recognizing Ln with fewer than 2n states.

Conclusion: For finite automata, nondeterminism provides anexponential savings over determinism (in the worst case).

(The bound can be tightened.)

2


Proving that exponential blowup is sometimes unavoidable

(Could there be a construction that always produces an n2

state DFA for example?)

Consider (for some fixed n=17, say)

Ln = {w ∈ {a, b}∗ : the nth symbol

from the right end of w is an a}

• There is an (n + 1)-state NFA that accepts Ln.

• There is no DFA that accepts Ln and has < 2n states

3


A “Fooling Argument”

• Suppose a DFA M has < 2n states, and L(M) = Ln

• There are 2n strings of length n.

• By the pigeonhole principle, two such strings x 6= y must driveM to the same state q.

• Suppose x and y differ at the kth position from

the right end (one has a, the other has b)

(k = 1, 2, . . . , or n)

• M must treat xan−k and yan−k identically (accept both or rejectboth). These strings differ at position n from the right end.

• So L(M) 6= Ln, contradiction. QED.

4


Illustration of the fooling argument

a

b

n

k

M is in state q0 M is in state q

x6=y

a a a

b a a

n

M in state q0 M in state q

xan−k

yan−k

Different symbols npositions from right

M in same state p,

• x and y are different strings

(so there is a position k where one has a and the other has b)

• But both strings drive M from s to the same state q

5


What the argument proves

• This shows that the subset construction is within a factor of 2of being optimal

• In fact it is optimal, i.e., as good as we can do in the worstcase.

• Still, in many cases, the “generate-states-as-needed” methodyields a DFA with � 2n states

(e.g. if the NFA was deterministic to begin with!)

6


Regular Expressions

• Let Σ = {a, b}. The regular expressions over Σ are certainexpressions formed using the symbols {a, b, (, ), ε, ∅,∪, ◦, ∗}

• We use red for the strings under discussion (the objectlanguage) and black for the ordinary notation we are using fordoing mathematics (the metalanguage).

• Construction Rules (= inductive/recursive definition):

1. a, b, ε, ∅ are regular expressions (of size 1)

2. If R1 and R2 are REs (of size s1 and s2), then(R1◦R2), (R1∪R2), and (R∗

1) are REs(of sizes s1 + s2 + 3, s1 + s2 + 3, and s1 + 3, respectively).

• Examples:

(a ◦ b) ((((a ◦ (b∗)) ◦ c) ∪ ((b∗) ◦ a))∗) (∅∗)

7


What REs Do

• Regular expressions (which are strings) represent languages(which are sets of strings), via the function L:

(1) L(a) = {a}(2) L(b) = {b}(3) L(ε) = {ε}(3) L(∅) = ∅(4) L((R1◦R2)) = L(R1) ◦ L(R2)(5) L((R1∪R2)) = L(R1) ∪ L(R2)(6) L((R∗

1)) = L(R1)∗

• Example:L(((a∗) ◦ (b∗))) = {a}∗ ◦ {b}∗

• L(·) is called the semantics of the expression.

8


Syntactic Shorthand

• Omit many parentheses, because union and concatenation oflanguages are associative. For example,

for any languages L1, L2, L3:

(L1L2)L3 = L1(L2L3)

and therefore for any regular expressions R1, R2, R3,

L((R1◦(R2◦R3))

)= L

((R1◦(R2◦R3))

)• Omit ◦ symbol

• Drop the distinction between red and black, between objectlanguage and metalanguage.

9


Semantic equivalence

The following are equivalent:

((ab)c) (a(bc)) abc

or strictly speaking

((a ◦ b) ◦ c) (a ◦ (b ◦ c))

• Equivalent means:

“same semantics—same L(·)-value—maybe different syntax”

10


More syntactic sugar

• By convention, ∗ takes precedence over ◦, which takesprecedence over ∪.

So a ∪ bc∗ is equivalent to (a ∪ (b ◦ (c∗))).

• Σ is shorthand for a ∪ b (or the analogous RE for whateveralphabet is in use).

11


Examples of Regular Languages

Strings ending in a = Σ∗a

Strings containing the substring abaab = ?

Strings of even length = (aa ∪ ab ∪ ba ∪ bb)∗

Strings with even # of a’s = (b ∪ ab∗a)∗

= b∗(ab∗ab∗)∗

Strings with ≤ two a’s = ?

Strings of form x1x2 · · ·xk, k ≥ 0, each xi ∈ {aab, aaba, aaa} = ?

Decimal numerals, no leading zeroes= 0 ∪ ((1 ∪ . . . ∪ 9)(0 ∪ . . . ∪ 9)∗)

All strings with an even # of a’s and an even # of b’s= (b ∪ ab∗a)∗ ∩ (a ∪ ba∗b)∗ but this isn’t a regular expression

12


Equivalence of REs and FAs

Recall: we call a language regular if there is a finite automatonthat recognizes it.

Theorem: For every regular expression R, L(R) is regular.

Proof:

Induct on the construction of regular expressions (“structuralinduction”).

Base Case: R is a, b, ε, or ∅

Regular Expressions 3



Proof:

Induct on the construction of regular expressions (“structural induction”).

Base Case:

σ

accepts {σ} accepts ∅ accepts {ε}

Inductive Step: If R1 and R2 are REs and L(R1) and L(R2) are regular (inductive hyp.), then soare:

L((R1 ◦ R2)) = L(R1) ◦ L(R2)L((R1 ∪R2)) = L(R1) ∪ L(R2)

L((R∗1)) = L(R1)∗

(By the closure properties of the regular languages).

Proof is constructive. Example: (a ∪ ε)(aa ∪ bb)∗




Proof:


Base Case:

σ



L((R1 ◦ R2)) = L(R1) ◦ L(R2)L((R1 ∪R2)) = L(R1) ∪ L(R2)

L((R∗1)) = L(R1)∗






Proof:


Base Case:

σ



L((R1 ◦ R2)) = L(R1) ◦ L(R2)L((R1 ∪R2)) = L(R1) ∪ L(R2)

L((R∗1)) = L(R1)∗




13


Equivalence of REs and FAs, continued

Inductive Step: If R1 and R2 are REs and L(R1) and L(R2) areregular (inductive hyp.), then so are:

L((R1◦R2)) = L(R1) ◦ L(R2)

L((R1∪R2)) = L(R1) ∪ L(R2)

L((R∗1)) = L(R1)∗


Proof is constructive (actually produces the equivalent finiteautomaton, not just proves its existence).

14


Example Conversion of a RE to a FA

(a ∪ ε)(aa ∪ bb)∗

15


Converting Finite Automata to Regular Expressions

Theorem: For every regular language L, there is a regularexpression R such that L(R) = L.

Proof:

Define generalized NFAs (GNFAs) (of interest only for thisproof)

• Transitions labelled by regular expressions (rather thansymbols).

• One start state qstart and only one accept state qaccept.

• Exactly one transition from qi to qj for every two statesqi 6= qaccept and qj 6= qstart (including self-loops).

16


NFAs to GNFAs

Lemma: For every NFA N , there is an equivalent GNFA G.

• Add new start state, new accept state. Transitions?

• If multiple transitions between two states, combine. How?

• If no transition between two states, add one. With whatlabel?

17


GNFAs to REs

Lemma: For every GNFA G, there is an equivalent RE R.

• By induction on the number of states k of G.

• Base case: k = 2. Set R to be the label of the transition fromqstart to qaccept.

• Inductive Hypothesis: Suppose every GNFA G of k or fewerstates has an equivalent RE (where k ≥ 2).

• Induction Step: Given a (k + 1)-state GNFA G, we willconstruct an equivalent k-state GNFA G′.

Rip: Remove a state qr (other than qstart, qaccept).

Repair: Augment labels on all transitions qi → qj to alsoinclude strings that could have followed the transitionsqi → qr → qj.

18


Ripping and repairing GNFAs: details

Given a (k + 1)-state GNFA G (k ≥ 2), we construct anequivalent k-state GNFA G′ as follows.

For any two (not necessarily distinct) states qi, qj, let Rij be theregular expression labeling the transition qi → qj.

Rip: Remove a state qr (other than qstart, qaccept).

Repair: For every two states qi, qj such that qi /∈ {qaccept, qr},qj /∈ {qstart, qr} simultaneously

put Rij ∪Ri,rR∗r,rRr,j on transition qi → qj.

Argue that L(G′) = L(G), which generated by a regularexpression by IH.

19


Example: The even as and even bs language

All strings with an even # of a’s and an even # of b’s= (b ∪ ab∗a)∗ ∩ (a ∪ ba∗b)∗

but this isn’t a regular expression

So let’s build a DFA and convert it to a regular expression!

20


21


22


23


24


25