View
215
Download
0
Embed Size (px)
Citation preview
1
Foundations of Software Design
Lecture 23: Finite Automata and Context-Free Grammars Marti HearstFall 2002
2Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Outline
• Finite State Automata• Why regular expressions are not enough• Context Free Grammars• Next Time: Relationship to Programming
Languages and Compilers
3Adapted from Jurafsky & Martin 2000
Three Equivalent Representations
Finite automata
Regularexpressions
Regular languages
Each can
describethe others
Theorem:
For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa.
4Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Finite Automata
• A FA is similar to a compiler in that: – A compiler recognizes legal programs
in some (source) language.– A finite-state machine recognizes legal strings
in some language.
• Example: Pascal Identifiers– sequences of one or more letters or digits,
starting with a letter:
letterletter | digit
S A
5Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Finite-Automata State Graphs
• The start state
• An accepting state
• A transitiona
• A state
6Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Finite Automata
• Transitions1 a s2
• Is readIn state s1 on input “a” go to state s2
• If end of input– If in accepting state => accept– Otherwise => reject
• If no transition possible (got stuck) => reject• FSA = Finite State Automata
7Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Language defined by FSA
• The language defined by a FSA is the set of strings accepted by the FSA.
– in the language of the FSM shown below: • x, tmp2, XyZzy, position27.
– not in the language of the FSM shown below: • 123, a?, 13apples.
letterletter | digit
S A
8Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Example: Integer Literals• FSA that accepts integer literals with an
optional + or - sign:• Note – two different edges from S to A• \(+|-)?[0-9]+\
+
digit
S
B
A-
digit
digit
9
Example:• FSA that accepts three letter English words that begin
with p and end with d or t.• Here I use the convenient notation of making the state
name match the input that has to be on the edge leading to that state.
p
ta
o
u
d
i
10Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Practice
Write an automaton that accepts Java identifiers– One or more letters, digits, or underscores, starting
with a letter or an underscore.– Start with the regexp
11Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Formal Definition
• A finite automaton is a 5-tuple (, Q, , q, F) where:– An input alphabet – A set of states Q– A start state q– A set of accepting states F Q is the state transition function: Q x Q
(i.e., encodes transitions state input state)
12Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
How to Implement an FSA
A table-driven approach:• table:
– one row for each state in the machine, and– one column for each possible character.
• Table[j][k] – which state to go to from state j on character k,– an empty entry corresponds to the machine getting stuck.
13Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The table-driven program for a Deterministic FSA
state = S // S is the start state repeat {
k = next character from the inputif (k == EOF) // the end of input
if state is a final state then acceptelse reject
state = T[state,k]if state = empty then reject // got stuck
}
14Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regular expressions are not enough.
• You can write an automaton that accepts the specific strings:– “a”, “(a)”, “((a))”, and “(((a)))”
• But you can’t write one for this (in the general case):– “a”, “(a)”, “((a))”, “(((a)))”, … “(ka)k”
15Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regular expressions are not enough.
• What programs are generated by?digit+ ( ( “+” | “-” | “*” | “/” ) digit+ )*
• [Note: the Perl-style replacement operators such as /1 are not part of the definition of regular expressions.]
• What important properties does this regular expression fail to express?– Regex’s are not good at showing
• Precedence• Nesting• Recursion
16
Context-Free Grammars
17Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Derivation
1. Begin with a string consisting of the start symbol “S”
2. Replace any non-terminal in the string by a the right-hand side of some production
3. Repeat (2) until there are no non-terminals in the string
1 nX Y Y
X
18Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Example
Define a language to recognize strings of balanced parentheses:
The grammar:
( )S S
S
( ) | 0i i i
Recognizes these strings: () (()) (((()))) …
19Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Derivation Rules
( )S S
S
( )
|
S S
Is
the sameas
20Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Arithmetic Example
Simple arithmetic expressions:
Some valid strings in the language:
E E+E | E E | (E) | id
id id + id
(id) id id
(id) id id (id)
21Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Arithmetic Example
Simple arithmetic expressions:
Some valid strings in the language:
E E+E | E E | (E) | id
id id + id
(id) id id
(id) id id (id)
id id + id
22Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Derivations and Parse Trees
A derivation is a sequence of productions
A derivation can be drawn as a tree– Start symbol is the tree’s root– For a production add children
to node – Stop when you reach all non-terminals
S
1 nX Y Y X
1 nY Y
23Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Right-most Derivation Example Input: id * id + id
E
E
E E
E+
id*
idid
E
E+E
E+id
E E + id
E id + id
id id + id
24Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Left-most and Right-most Derivations• The example is a right-
most derivation– At each step, replace the
right-most non-terminal
• There is an equivalent notion of a left-most derivation
E
E+E
E+id
E E + id
E id + id
id id + id
25Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
A formal definition of CFGs
• A CFG consists of– A set of terminals T– A set of non-terminals N– A start symbol S (a non-terminal)– A set of productions:
1 2
where and n
i
X YY Y
X N Y T N
26Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Derivations and Parse Trees
• Note that right-most and left-most derivations have the same parse tree
• The difference is the order in which branches are added
27Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Example
E
E
E E
E+
id*
idid
• The program:x * y + z
• Input to parser:ID TIMES ID PLUS IDwe’ll write tokens as follows:
id * id + id
• Output of parser:the parse tree
28Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Ambiguity
• Grammar
• String
E E+E | E E | (E) | id
id id + id
29Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Ambiguity (Cont.)
This string has two parse treesE
E
E E
E*
id +
idid
E
E
E E
E+
id*
idid
30Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Ambiguity (Cont.)
• A grammar is ambiguous if for some string(the following three conditions are equivalent)
– it has more than one parse tree– if there is more than one right-most derivation– if there is more than one left-most derivation
• Ambiguity is BAD– Makes the meaning of some programs ill-defined
31Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Dealing with Ambiguity
• There are several ways to handle ambiguity
• Most direct method is to rewrite grammar unambiguously
• Enforces precedence of * over +
' '
'
E E E | E
E id E' | id | (E)
32Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Properties of CFGs
• Membership in a language is “yes” or “no”• Form of the grammar is important
– Different grammars can generate the same language
• Need an “implementation” of CFG’s, – i.e. the parser– we’ll create the parser using a parser generator
• available generators: yacc, javacc
33
CFGs vs. Regex’s
• CFGs are better at expressing recursive structure– They are often described using trees …– … which we know have a recursive structure
• Example:– if E then S1 else S2– if E1 then if E2 then S1 else S2 else S3– if E1 then (if E2 then S1 else S2) else S3
34
Regular Languages vs CFGs
• Every regex can be expressed as a CFG• The converse is not true
– BUT regex’s cover much of what you need
• When to use which?– According to Aho, Sethi, and Ullman ’86:
– Regex’s are more concise and easier to understand– More efficient analyzers can be constructed from
regex’s– It is often useful to separate the structure of a
language into lexical and nonlexical parts• Lexical processed with regex• The structure processed with CFGs