Harvard CS 121 and CSCI E-121Lecture 5:
Regular Expressions
Harry Lewis
September 17, 2013
• Reading: Sipser, §1.3.
Harvard CS 121 & CSCI E-121 September 17, 2013
Is the Subset Construction Optimal?
The subset construction shows that any n-state NFA can beimplemented as a 2n-state DFA.
NFA States DFA States4 16
10 1024100 2100
1000 21000 � the number of particles in the universe
How to simulate this construction on the fly on an ordinarydigital computer?
NFA states DFA state bit vector1, . . . , n 10 1 0 . . . 1
1 2 n
1
Harvard CS 121 & CSCI E-121 September 17, 2013
Is this construction the best we can do?
Could there be a construction that always produces an n2 stateDFA for example?
Theorem: For every n ≥ 1, there is a language Ln such that
1. There is an (n + 1)-state NFA recognizing Ln.
2. There is no DFA recognizing Ln with fewer than 2n states.
Conclusion: For finite automata, nondeterminism provides anexponential savings over determinism (in the worst case).
(The bound can be tightened.)
2
Harvard CS 121 & CSCI E-121 September 17, 2013
Proving that exponential blowup is sometimes unavoidable
(Could there be a construction that always produces an n2
state DFA for example?)
Consider (for some fixed n=17, say)
Ln = {w ∈ {a, b}∗ : the nth symbol
from the right end of w is an a}
• There is an (n + 1)-state NFA that accepts Ln.
• There is no DFA that accepts Ln and has < 2n states
3
Harvard CS 121 & CSCI E-121 September 17, 2013
A “Fooling Argument”
• Suppose a DFA M has < 2n states, and L(M) = Ln
• There are 2n strings of length n.
• By the pigeonhole principle, two such strings x 6= y must driveM to the same state q.
• Suppose x and y differ at the kth position from
the right end (one has a, the other has b)
(k = 1, 2, . . . , or n)
• M must treat xan−k and yan−k identically (accept both or rejectboth). These strings differ at position n from the right end.
• So L(M) 6= Ln, contradiction. QED.
4
Harvard CS 121 & CSCI E-121 September 17, 2013
Illustration of the fooling argument
a
b
n
k
M is in state q0 M is in state q
x6=y
a a a
b a a
n
M in state q0 M in state q
xan−k
yan−k
Different symbols npositions from right
M in same state p,
• x and y are different strings
(so there is a position k where one has a and the other has b)
• But both strings drive M from s to the same state q
5
Harvard CS 121 & CSCI E-121 September 17, 2013
What the argument proves
• This shows that the subset construction is within a factor of 2of being optimal
• In fact it is optimal, i.e., as good as we can do in the worstcase.
• Still, in many cases, the “generate-states-as-needed” methodyields a DFA with � 2n states
(e.g. if the NFA was deterministic to begin with!)
6
Harvard CS 121 & CSCI E-121 September 17, 2013
Regular Expressions
• Let Σ = {a, b}. The regular expressions over Σ are certainexpressions formed using the symbols {a, b, (, ), ε, ∅,∪, ◦, ∗}
• We use red for the strings under discussion (the objectlanguage) and black for the ordinary notation we are using fordoing mathematics (the metalanguage).
• Construction Rules (= inductive/recursive definition):
1. a, b, ε, ∅ are regular expressions (of size 1)
2. If R1 and R2 are REs (of size s1 and s2), then(R1◦R2), (R1∪R2), and (R∗
1) are REs(of sizes s1 + s2 + 3, s1 + s2 + 3, and s1 + 3, respectively).
• Examples:
(a ◦ b) ((((a ◦ (b∗)) ◦ c) ∪ ((b∗) ◦ a))∗) (∅∗)
7
Harvard CS 121 & CSCI E-121 September 17, 2013
What REs Do
• Regular expressions (which are strings) represent languages(which are sets of strings), via the function L:
(1) L(a) = {a}(2) L(b) = {b}(3) L(ε) = {ε}(3) L(∅) = ∅(4) L((R1◦R2)) = L(R1) ◦ L(R2)(5) L((R1∪R2)) = L(R1) ∪ L(R2)(6) L((R∗
1)) = L(R1)∗
• Example:L(((a∗) ◦ (b∗))) = {a}∗ ◦ {b}∗
• L(·) is called the semantics of the expression.
8
Harvard CS 121 & CSCI E-121 September 17, 2013
Syntactic Shorthand
• Omit many parentheses, because union and concatenation oflanguages are associative. For example,
for any languages L1, L2, L3:
(L1L2)L3 = L1(L2L3)
and therefore for any regular expressions R1, R2, R3,
L((R1◦(R2◦R3))
)= L
((R1◦(R2◦R3))
)• Omit ◦ symbol
• Drop the distinction between red and black, between objectlanguage and metalanguage.
9
Harvard CS 121 & CSCI E-121 September 17, 2013
Semantic equivalence
The following are equivalent:
((ab)c) (a(bc)) abc
or strictly speaking
((a ◦ b) ◦ c) (a ◦ (b ◦ c))
• Equivalent means:
“same semantics—same L(·)-value—maybe different syntax”
10
Harvard CS 121 & CSCI E-121 September 17, 2013
More syntactic sugar
• By convention, ∗ takes precedence over ◦, which takesprecedence over ∪.
So a ∪ bc∗ is equivalent to (a ∪ (b ◦ (c∗))).
• Σ is shorthand for a ∪ b (or the analogous RE for whateveralphabet is in use).
11
Harvard CS 121 & CSCI E-121 September 17, 2013
Examples of Regular Languages
Strings ending in a = Σ∗a
Strings containing the substring abaab = ?
Strings of even length = (aa ∪ ab ∪ ba ∪ bb)∗
Strings with even # of a’s = (b ∪ ab∗a)∗
= b∗(ab∗ab∗)∗
Strings with ≤ two a’s = ?
Strings of form x1x2 · · ·xk, k ≥ 0, each xi ∈ {aab, aaba, aaa} = ?
Decimal numerals, no leading zeroes= 0 ∪ ((1 ∪ . . . ∪ 9)(0 ∪ . . . ∪ 9)∗)
All strings with an even # of a’s and an even # of b’s= (b ∪ ab∗a)∗ ∩ (a ∪ ba∗b)∗ but this isn’t a regular expression
12
Harvard CS 121 & CSCI E-121 September 17, 2013
Equivalence of REs and FAs
Recall: we call a language regular if there is a finite automatonthat recognizes it.
Theorem: For every regular expression R, L(R) is regular.
Proof:
Induct on the construction of regular expressions (“structuralinduction”).
Base Case: R is a, b, ε, or ∅
Regular Expressions 3
Equivalence of REs and FAs
Theorem: For every regular expression R, L(R) is regular.
Proof:
Induct on the construction of regular expressions (“structural induction”).
Base Case:
σ
accepts {σ} accepts ∅ accepts {ε}
Inductive Step: If R1 and R2 are REs and L(R1) and L(R2) are regular (inductive hyp.), then soare:
L((R1 ◦ R2)) = L(R1) ◦ L(R2)L((R1 ∪R2)) = L(R1) ∪ L(R2)
L((R∗1)) = L(R1)∗
(By the closure properties of the regular languages).
Proof is constructive. Example: (a ∪ ε)(aa ∪ bb)∗
Regular Expressions 3
Equivalence of REs and FAs
Theorem: For every regular expression R, L(R) is regular.
Proof:
Induct on the construction of regular expressions (“structural induction”).
Base Case:
σ
accepts {σ} accepts ∅ accepts {ε}
Inductive Step: If R1 and R2 are REs and L(R1) and L(R2) are regular (inductive hyp.), then soare:
L((R1 ◦ R2)) = L(R1) ◦ L(R2)L((R1 ∪R2)) = L(R1) ∪ L(R2)
L((R∗1)) = L(R1)∗
(By the closure properties of the regular languages).
Proof is constructive. Example: (a ∪ ε)(aa ∪ bb)∗
Regular Expressions 3
Equivalence of REs and FAs
Theorem: For every regular expression R, L(R) is regular.
Proof:
Induct on the construction of regular expressions (“structural induction”).
Base Case:
σ
accepts {σ} accepts ∅ accepts {ε}
Inductive Step: If R1 and R2 are REs and L(R1) and L(R2) are regular (inductive hyp.), then soare:
L((R1 ◦ R2)) = L(R1) ◦ L(R2)L((R1 ∪R2)) = L(R1) ∪ L(R2)
L((R∗1)) = L(R1)∗
(By the closure properties of the regular languages).
Proof is constructive. Example: (a ∪ ε)(aa ∪ bb)∗
accepts {σ} accepts ∅ accepts {ε}
13
Harvard CS 121 & CSCI E-121 September 17, 2013
Equivalence of REs and FAs, continued
Inductive Step: If R1 and R2 are REs and L(R1) and L(R2) areregular (inductive hyp.), then so are:
L((R1◦R2)) = L(R1) ◦ L(R2)
L((R1∪R2)) = L(R1) ∪ L(R2)
L((R∗1)) = L(R1)∗
(By the closure properties of the regular languages).
Proof is constructive (actually produces the equivalent finiteautomaton, not just proves its existence).
14
Harvard CS 121 & CSCI E-121 September 17, 2013
Example Conversion of a RE to a FA
(a ∪ ε)(aa ∪ bb)∗
15
Harvard CS 121 & CSCI E-121 September 17, 2013
Converting Finite Automata to Regular Expressions
Theorem: For every regular language L, there is a regularexpression R such that L(R) = L.
Proof:
Define generalized NFAs (GNFAs) (of interest only for thisproof)
• Transitions labelled by regular expressions (rather thansymbols).
• One start state qstart and only one accept state qaccept.
• Exactly one transition from qi to qj for every two statesqi 6= qaccept and qj 6= qstart (including self-loops).
16
Harvard CS 121 & CSCI E-121 September 17, 2013
NFAs to GNFAs
Lemma: For every NFA N , there is an equivalent GNFA G.
• Add new start state, new accept state. Transitions?
• If multiple transitions between two states, combine. How?
• If no transition between two states, add one. With whatlabel?
17
Harvard CS 121 & CSCI E-121 September 17, 2013
GNFAs to REs
Lemma: For every GNFA G, there is an equivalent RE R.
• By induction on the number of states k of G.
• Base case: k = 2. Set R to be the label of the transition fromqstart to qaccept.
• Inductive Hypothesis: Suppose every GNFA G of k or fewerstates has an equivalent RE (where k ≥ 2).
• Induction Step: Given a (k + 1)-state GNFA G, we willconstruct an equivalent k-state GNFA G′.
Rip: Remove a state qr (other than qstart, qaccept).
Repair: Augment labels on all transitions qi → qj to alsoinclude strings that could have followed the transitionsqi → qr → qj.
18
Harvard CS 121 & CSCI E-121 September 17, 2013
Ripping and repairing GNFAs: details
Given a (k + 1)-state GNFA G (k ≥ 2), we construct anequivalent k-state GNFA G′ as follows.
For any two (not necessarily distinct) states qi, qj, let Rij be theregular expression labeling the transition qi → qj.
Rip: Remove a state qr (other than qstart, qaccept).
Repair: For every two states qi, qj such that qi /∈ {qaccept, qr},qj /∈ {qstart, qr} simultaneously
put Rij ∪Ri,rR∗r,rRr,j on transition qi → qj.
Argue that L(G′) = L(G), which generated by a regularexpression by IH.
19
Harvard CS 121 & CSCI E-121 September 17, 2013
Example: The even as and even bs language
All strings with an even # of a’s and an even # of b’s= (b ∪ ab∗a)∗ ∩ (a ∪ ba∗b)∗
but this isn’t a regular expression
So let’s build a DFA and convert it to a regular expression!
20
Harvard CS 121 & CSCI E-121 September 17, 2013
21
Harvard CS 121 & CSCI E-121 September 17, 2013
22
Harvard CS 121 & CSCI E-121 September 17, 2013
23
Harvard CS 121 & CSCI E-121 September 17, 2013
24
Harvard CS 121 & CSCI E-121 September 17, 2013
25