29
Jun 20, 2 022 Recognizers

22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

Embed Size (px)

Citation preview

Page 1: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

Apr 20, 2023

Recognizers

Page 2: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

2

Parsers and recognizers Given a grammar (say, in BNF) and a string,

A recognizer will tell whether the string belongs to the language defined by the grammar

A parser will try to build a tree corresponding to the string, according to the rules of the grammar

Input string Recognizer result Parser result

2 + 3 * 4 true

2 + 3 * false Error

Page 3: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

3

Building a recognizer

One way of building a recognizer from a grammar is called recursive descent

Recursive descent is pretty easy to implement, once you figure out the basic ideas Recursive descent is a great way to build a “quick and dirty”

recognizer or parser Production-quality parsers use much more sophisticated and

efficient techniques In the following slides, I’ll talk about how to do

recursive descent, and give some examples in Java

Page 4: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

4

Review of BNF “Plain” BNF

< > indicate a nonterminal that needs to be further expanded, for example, <number>

Symbols not enclosed in < > are terminals; they represent themselves, for example, if, while, (

The symbol ::= means is defined as The symbol | means or; it separates alternatives, for example,

<add_operator> ::= + | - Extended BNF

[ ] enclose an optional part of the rule Example:

<if statement> ::= if ( <condition> ) <statement> [ else <statement> ]

{ } mean the enclosed can be repeated zero or more times Example:

<parameter list> ::= ( ) | ( { <parameter> , } <parameter> )

Page 5: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

5

Recognizing simple alternatives, I Consider the following BNF rule:

<add_operator> ::= + | - That is, an add operator is a plus sign or a minus sign

To recognize an add operator, we need to get the next token, and test whether it is one of these characters

If it is a plus or a minus, we simply return true But what if it isn’t?

We not only need to return false, but we also need to put the token back because it doesn’t belong to us, and some other grammar rule probably wants it

Our tokenizer needs to be able to take back tokens Usually, it’s enough to be able to put just one token back More complex grammars may require the ability to put back several

tokens

Page 6: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

6

Recognizing simple alternatives, II

Our rule is <add_operator> ::= + | -

Our method for recognizing an <add_operator> (which we will call addOperator) looks like this:

public boolean addOperator() { Get the next token, call it t If t is a “+”, return true If t is a “-”, return true If t is anything else, put the token back return false}

Page 7: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

7

Java code public boolean addOperator() {

Token t = myTokenizer.next(); if (t.type == TokenType.OPERATOR && t.text.equals("+")) { return true; } if (t.type == TokenType.OPERATOR && t.text.equals("-")) { return true; } myTokenizer.backUp(); return false;}

While this code isn’t particularly long or hard to read, we are going to have a lot of very similar methods

Page 8: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

8

Helper methods

Remember the DRY principle: Don’t Repeat Yourself If we turn each BNF production directly into Java, we will be

writing a lot of very similar code We should write some auxiliary or “helper” methods to hide

some of the details for us First helper method:

private boolean operator(String expectedOperator) Get the next token and test whether it matches the expectedOperator

If it matches, returns true If it doesn’t match, puts the token back and returns false

We’ll look more closely at this method in a moment

Page 9: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

9

Recognizing simple alternatives, III

Our rule is <add_operator> ::= + | - Our pseudocode is:

public boolean addOperator() { Get the next token, call it t If t is a “+”, return true If t is a “-”, return true If t is anything else, put the token back return false}

Thanks to our helper method, our actual Java code is: public boolean addOperator() {

return operator("+") || operator("-");}

Page 10: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

10

First implementation of symbol Here’s what operator does:

Gets a token Makes sure that the token is an operator Compares the text of the operator to the desired text If all the above is satisfied, returns true Else (if not satisfied) puts the token back, and returns false

private boolean operator(String text) { Token t = tokenizer.next(); if (t.type == TokenType.OPERATOR && t.text.equals(text)) { return true; } else { tokenizer.backUp(); return false; }}

Page 11: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

11

Implementing symbol We can implement method groupingSymbol the same way All this code will look pretty much alike

The only difference is in checking for the type t.type == TokenType.OPERATOR changes to

t.type == TokenType.GROUPING_SYMBOL The DRY principle suggests we should use a helper method for these two

methods, operator and groupingSymbol

private boolean operator(String expectedText) { return nextTokenMatches(TokenType.OPERATOR, expectedText);}

private boolean groupingSymbol(String expectedText) { return nextTokenMatches(TokenType.GROUPING_SYMBOL, expectedText);}

Page 12: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

12

nextTokenMatches #1 The nextTokenMatches method should:

Get a token Compare types and values Return true if the token is as expected Put the token back and return false if it doesn’t match

private boolean nextTokenMatches(Type type, String expectedText) { Token t = tokenizer.next(); if (type == t.type() && t.text.equals(expectedText)) { return true; } else { tokenizer.backUp(); return false; }}

Page 13: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

13

nextTokenMatches #2

The previous method is fine for symbols, but what if we only care about the type?

For example, we want to get a number—any number We need to compare only type, not value private boolean nextTokenMatches(TokenType type, String

expectedText) { Token t = tokenizer.next(); omit this parameter if (type == t.type() && t.text.equals(expectedText)) return true; else tokenizer.backUp(); omit this test return false;}

It’s easier to overload nextTokenMatches than to combine the two versions, and both versions are fairly short, so we are probably better off with the code duplication

Page 14: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

14

addOperator reprise public boolean addOperator() {

return symbol("+") || symbol("-");}

private boolean operator(String expectedText) { return nextTokenMatches(TokenType.OPERATOR, expectedText);}

private boolean nextTokenMatches(TokenType type, String expectedText) { Token t = tokenizer.next(); if (type == t.type() && t.text.equals(expectedText)) { return true; } tokenizer.backUp(); return false;}

Page 15: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

15

Sequences, I

Suppose we want to recognize a grammar rule in which one thing follows another, for example,

<empty_list> ::= “[” “]” (I put quotes around these brackets to distinguish them from the EBNF

metasymbols for “optional”)

The code for this would be fairly simple... public boolean emptyList() {

return groupingSymbol("[") && groupingSymbol("]");}

...except for one thing... What happens if we get a “[” and don’t get a “]”? The above method won’t work—why not?

Only the second call to groupingSymbol failed, and only one token gets pushed back

Page 16: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

16

Sequences, II

The grammar rule is <empty_list> ::= “[” “]” And the token string contains [ 5 ]

Solution #1: Write a backUp method that push back more than one token at a time

This will allow you to put the back both the “[” and the “5” You have to be very careful of the order in which you return tokens This is a good use for a Stack

Solution #2: Call it an error You might be able to get away with this, depending on the grammar For example, for any reasonable grammar, (2 + 3 +) is clearly an error

Solution #3: Change the grammar Tricky, and may not be possible

Solution #4: Combine rules See the next slide

Page 17: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

17

Implementing a fancier backUp() java.io.StreamTokenizer does almost everything you need in a

tokenizer, so in many cases you don’t need to write your own But it has a really ugly user interface

Its pushBack() method only “puts back” a single token If you need more than that, you can extend StreamTokenizer

To push back more than one token, you need to either: Make your tokenizer keep track of the last several tokens (and have a

backUp(int n) method, or Expect the calling program to tell you what tokens to push back (with a

backUp(Token t) method)

Plus, you will have to override nextToken() Inside your nextToken() method, you can call super.nextToken() to get the

next never-before-seen token Your nextToken() method will also have to do something about nval and

sval, such as provide methods to get these values

Page 18: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

18

Sequences, III

Suppose the grammar really says <list> ::= “[” “]” | “[” <number> “]”

Now your pseudocode should look something like this: public boolean list() {

if first token is “[” { if second token is “]” return true else if second token is a number { if third token is “]” return true else error } else put back first token}

Page 19: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

19

Sequences, IV Another possibility is to revise the grammar (but make sure the

new grammar is equivalent to the old one!) Old grammar:

<list> ::= “[” “]” | “[” <number> “]” New grammar:

<list> ::= “[” <rest_of_list><rest_of_list> ::= “]” | <number> “]”

New pseudocode: public boolean list() {

if first token is “[” { if restOfList() return true } else put back first token}

private boolean restOfList() { if first token is “]”, return true if first token is a number and second token is a “]”, return true else return false}

Page 20: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

20

Simple sequences in Java Suppose you have this rule:

<factor> ::= ( <expression> ) A good way to do this is often to test whether the grammar rule is

not met public boolean factor() {

if (symbol("(")) { if (!expression()) error("Error in parenthesized expression"); if (!symbol(")")) error("Unclosed parenthetical expression"); return true; } return false;}

To do this, you need to be careful that the “(” is not the start of some other production that can be used where a factor can be used

In other words, be sure that if you get a “(” it must start a factor Also, error(String) must throw an Exception—why?

Page 21: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

21

false vs. error

When should a method return false, and when should it report an error? false means that this method did not recognize its input Report an error if you know that something has gone

wrong In other words, you know that no other method will recognize the

input, either public boolean ifStatement() {

if you don’t see “if”, return false // could be some other kind of statement if you don’t see a condition, return an error // “if” is a keyword that must start an if statement

If you see if, and it isn’t followed by a condition, there is nothing else that it could be

This isn’t completely mechanical; you have to decide

Page 22: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

22

Sequences and alternatives Here’s the real grammar rule for <factor>: <factor> ::= <name>

| <number> | ( <expression> )

And here’s the actual code: public boolean factor() {

if (name()) return true; if (number()) return true; if (groupingSymbol("(")) { if (!expression()) error("Error in parenthesized expression"); if (!symbol(")")) error("Unclosed parenthetical expression"); return true; } return false;}

Page 23: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

23

Recursion, I Here’s an unfortunate (but legal!) grammar rule:

<expression> ::= <term> | <expression> + <term> Here’s some code for it:

public boolean expression() { if (term()) return true; if (!expression()) return false; if (!addOperator()) return false; if (!term()) error("Error in expression after '+' "); return true;}

Do you see the problem? We aren’t recurring with a simpler case, therefore, we have an

infinite recursion Our grammar rule is left recursive (the recursive part is the

leftmost thing in the definition)

Page 24: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

24

Recursion, II

Here’s our unfortunate grammar rule again: <expression> ::= <term> | <expression> +

<term> Here’s an equivalent, right recursive rule:

<expression> ::= <term> [ + <expression> ] Here’s some (much happier!) code for it:

public boolean expression() { if (!term()) return false; if (!addOperator()) return true; if (!expression()) error("Error in expression after '+' "); return true;}

This works for the Recognizer, but will cause problems later We’ll cross that bridge when we come to it

Page 25: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

25

Extended BNF—optional parts

Extended BNF uses brackets to indicate optional parts of rules Example:

<if_statement> ::= if <condition> <statement> [ else <statement> ]

Pseudocode for this example:public boolean ifStatement() { if you don’t see “if”, return false if you don’t see a condition, return an error if you don’t see a statement, return an error if you see an “else” { if you see a “statement”, return true else return an error } else return true;}

Page 26: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

26

Extended BNF—zero or more

Extended BNF uses braces to indicate parts of a rule that can be repeated Example: <expression> ::= <term> { +

<term> } Pseudocode for example: public boolean expression() {

if you don’t see a term, return false while you see a “+” { if you don’t see a term, return an error } return true}

Page 27: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

27

Back to parsers

A parser is like a recognizer The difference is that, when a parser recognizes

something, it does something about it Usually, what a parser does is build a tree

If the thing that is being parsed is a program, then You can write another program that “walks” the tree and

executes the statements and expressions as it finds them Such a program is called an interpreter

You can write another program that “walks” the tree and produces code in some other language (usually assembly language) that does the same thing

Such a program is called a compiler

Page 28: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

28

Conclusions

If you start with a BNF definition of a language, You can write a recursive descent recognizer to tell you

whether an input string “belongs to” that language (is a valid program in that language)

Writing such a recognizer is a “cookbook” exercise—you just follow the recipe and it works (hopefully)

You can write a recursive descent parser to create a parse tree representing the program

The parse tree can later be used to execute the program

BNF is purely syntactic BNF tells you what is legal, and how things are put together BNF has nothing to say about what things actually mean

Page 29: 22-Nov-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the

29

The End