34
1 Foundations of Software Design Lecture 23: Finite Automata and Context- Free Grammars Marti Hearst Fall 2002

1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

1

Foundations of Software Design

Lecture 23: Finite Automata and Context-Free Grammars Marti HearstFall 2002 

Page 2: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

2Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Outline

• Finite State Automata• Why regular expressions are not enough• Context Free Grammars• Next Time: Relationship to Programming

Languages and Compilers

Page 3: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

3Adapted from Jurafsky & Martin 2000

Three Equivalent Representations

Finite automata

Regularexpressions

Regular languages

Each can

describethe others

Theorem:

For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa.

Page 4: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

4Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Finite Automata

• A FA is similar to a compiler in that: – A compiler recognizes legal programs

in some (source) language.– A finite-state machine recognizes legal strings

in some language.

• Example: Pascal Identifiers– sequences of one or more letters or digits,

starting with a letter:

letterletter | digit

S A

Page 5: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

5Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Finite-Automata State Graphs

• The start state

• An accepting state

• A transitiona

• A state

Page 6: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

6Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Finite Automata

• Transitions1 a s2

• Is readIn state s1 on input “a” go to state s2

• If end of input– If in accepting state => accept– Otherwise => reject

• If no transition possible (got stuck) => reject• FSA = Finite State Automata

Page 7: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

7Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Language defined by FSA

• The language defined by a FSA is the set of strings accepted by the FSA.

– in the language of the FSM shown below: • x, tmp2, XyZzy, position27.

– not in the language of the FSM shown below: • 123, a?, 13apples.

letterletter | digit

S A

Page 8: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

8Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Example: Integer Literals• FSA that accepts integer literals with an

optional + or - sign:• Note – two different edges from S to A• \(+|-)?[0-9]+\

+

digit

S

B

A-

digit

digit

Page 9: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

9

Example:• FSA that accepts three letter English words that begin

with p and end with d or t.• Here I use the convenient notation of making the state

name match the input that has to be on the edge leading to that state.

p

ta

o

u

d

i

Page 10: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

10Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Practice

Write an automaton that accepts Java identifiers– One or more letters, digits, or underscores, starting

with a letter or an underscore.– Start with the regexp

Page 11: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

11Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Formal Definition

• A finite automaton is a 5-tuple (, Q, , q, F) where:– An input alphabet – A set of states Q– A start state q– A set of accepting states F Q is the state transition function: Q x Q

(i.e., encodes transitions state input state)

Page 12: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

12Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

How to Implement an FSA

A table-driven approach:• table:

– one row for each state in the machine, and– one column for each possible character.

• Table[j][k] – which state to go to from state j on character k,– an empty entry corresponds to the machine getting stuck.

Page 13: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

13Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The table-driven program for a Deterministic FSA

state = S // S is the start state repeat {

k = next character from the inputif (k == EOF) // the end of input

if state is a final state then acceptelse reject

state = T[state,k]if state = empty then reject // got stuck

}

Page 14: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

14Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Regular expressions are not enough.

• You can write an automaton that accepts the specific strings:– “a”, “(a)”, “((a))”, and “(((a)))”

• But you can’t write one for this (in the general case):– “a”, “(a)”, “((a))”, “(((a)))”, … “(ka)k”

Page 15: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

15Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Regular expressions are not enough.

• What programs are generated by?digit+ ( ( “+” | “-” | “*” | “/” ) digit+ )*

• [Note: the Perl-style replacement operators such as /1 are not part of the definition of regular expressions.]

• What important properties does this regular expression fail to express?– Regex’s are not good at showing

• Precedence• Nesting• Recursion

Page 16: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

16

Context-Free Grammars

Page 17: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

17Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Derivation

1. Begin with a string consisting of the start symbol “S”

2. Replace any non-terminal in the string by a the right-hand side of some production

3. Repeat (2) until there are no non-terminals in the string

1 nX Y Y

X

Page 18: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

18Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Example

Define a language to recognize strings of balanced parentheses:

The grammar:

( )S S

S

( ) | 0i i i

Recognizes these strings: () (()) (((()))) …

Page 19: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

19Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Derivation Rules

( )S S

S

( )

|

S S

Is

the sameas

Page 20: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

20Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Arithmetic Example

Simple arithmetic expressions:

Some valid strings in the language:

E E+E | E E | (E) | id

id id + id

(id) id id

(id) id id (id)

Page 21: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

21Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Arithmetic Example

Simple arithmetic expressions:

Some valid strings in the language:

E E+E | E E | (E) | id

id id + id

(id) id id

(id) id id (id)

id id + id

Page 22: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

22Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Derivations and Parse Trees

A derivation is a sequence of productions

A derivation can be drawn as a tree– Start symbol is the tree’s root– For a production add children

to node – Stop when you reach all non-terminals

S

1 nX Y Y X

1 nY Y

Page 23: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

23Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Right-most Derivation Example Input: id * id + id

E

E

E E

E+

id*

idid

E

E+E

E+id

E E + id

E id + id

id id + id

Page 24: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

24Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Left-most and Right-most Derivations• The example is a right-

most derivation– At each step, replace the

right-most non-terminal

• There is an equivalent notion of a left-most derivation

E

E+E

E+id

E E + id

E id + id

id id + id

Page 25: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

25Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

A formal definition of CFGs

• A CFG consists of– A set of terminals T– A set of non-terminals N– A start symbol S (a non-terminal)– A set of productions:

1 2

where and n

i

X YY Y

X N Y T N

Page 26: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

26Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Derivations and Parse Trees

• Note that right-most and left-most derivations have the same parse tree

• The difference is the order in which branches are added

Page 27: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

27Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Example

E

E

E E

E+

id*

idid

• The program:x * y + z

• Input to parser:ID TIMES ID PLUS IDwe’ll write tokens as follows:

id * id + id

• Output of parser:the parse tree

Page 28: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

28Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Ambiguity

• Grammar

• String

E E+E | E E | (E) | id

id id + id

Page 29: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

29Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Ambiguity (Cont.)

This string has two parse treesE

E

E E

E*

id +

idid

E

E

E E

E+

id*

idid

Page 30: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

30Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Ambiguity (Cont.)

• A grammar is ambiguous if for some string(the following three conditions are equivalent)

– it has more than one parse tree– if there is more than one right-most derivation– if there is more than one left-most derivation

• Ambiguity is BAD– Makes the meaning of some programs ill-defined

Page 31: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

31Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Dealing with Ambiguity

• There are several ways to handle ambiguity

• Most direct method is to rewrite grammar unambiguously

• Enforces precedence of * over +

' '

'

E E E | E

E id E' | id | (E)

Page 32: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

32Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Properties of CFGs

• Membership in a language is “yes” or “no”• Form of the grammar is important

– Different grammars can generate the same language

• Need an “implementation” of CFG’s, – i.e. the parser– we’ll create the parser using a parser generator

• available generators: yacc, javacc

Page 33: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

33

CFGs vs. Regex’s

• CFGs are better at expressing recursive structure– They are often described using trees …– … which we know have a recursive structure

• Example:– if E then S1 else S2– if E1 then if E2 then S1 else S2 else S3– if E1 then (if E2 then S1 else S2) else S3

Page 34: 1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002

34

Regular Languages vs CFGs

• Every regex can be expressed as a CFG• The converse is not true

– BUT regex’s cover much of what you need

• When to use which?– According to Aho, Sethi, and Ullman ’86:

– Regex’s are more concise and easier to understand– More efficient analyzers can be constructed from

regex’s– It is often useful to separate the structure of a

language into lexical and nonlexical parts• Lexical processed with regex• The structure processed with CFGs