Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
2/8/2008 Prof. Hilfinger CS164 Lecture 8 1
Introduction to Parsing
Lecture 8Adapted from slides by G. Necula and R. Bodik
2/8/2008 Prof. Hilfinger CS164 Lecture 8 2
Outline
• Limitations of regular languages• Parser overview• Context-free grammars (CFG’s)• Derivations• Syntax-Directed Translation
2/8/2008 Prof. Hilfinger CS164 Lecture 8 3
Languages and Automata
• Formal languages are very important in CS– Especially in programming languages
• Regular languages– The weakest formal languages widely used– Many applications
• We will also study context-free languages
2/8/2008 Prof. Hilfinger CS164 Lecture 8 4
Limitations of Regular Languages
• Intuition: A finite automaton that runs longenough must repeat states
• Finite automaton can’t remember # of times ithas visited a particular state
• Finite automaton has finite memory– Only enough to store in which state it is– Cannot count, except up to a finite limit
• E.g., language of balanced parentheses is notregular: { (i )i | i ≥ 0}
2/8/2008 Prof. Hilfinger CS164 Lecture 8 5
The Structure of a Compiler
Source Tokens
Interm.Language
Lexicalanalysis
Parsing
CodeGen.
MachineCode
Today westart
Optimization
2/8/2008 Prof. Hilfinger CS164 Lecture 8 6
The Functionality of the Parser
• Input: sequence of tokens from lexer
• Output: abstract syntax tree of the program
2/8/2008 Prof. Hilfinger CS164 Lecture 8 7
Example
• Pyth: if x == y: z =1 else: z = 2
• Parser input: IF ID == ID : ID = INT ↵ ELSE : ID = INT ↵
• Parser output (abstract syntax tree):
IF-THEN-ELSE
== = =
ID ID ID ID INTINT
2/8/2008 Prof. Hilfinger CS164 Lecture 8 8
Why A Tree?
• Each stage of the compiler has two purposes:– Detect and filter out some class of errors– Compute some new information or translate the
representation of the program to make thingseasier for later stages
• Recursive structure of tree suits recursivestructure of language definition
• With tree, later stages can easily find “theelse clause”, e.g., rather than having to scanthrough tokens to find it.
2/8/2008 Prof. Hilfinger CS164 Lecture 8 9
Comparison with Lexical Analysis
Syntax treeSequence oftokens
Parser
Sequence oftokens
Sequence ofcharacters
Lexer
OutputInputPhase
2/8/2008 Prof. Hilfinger CS164 Lecture 8 10
The Role of the Parser
• Not all sequences of tokens are programs . . .• . . . Parser must distinguish between valid and
invalid sequences of tokens
• We need– A language for describing valid sequences of tokens– A method for distinguishing valid from invalid
sequences of tokens
2/8/2008 Prof. Hilfinger CS164 Lecture 8 11
Programming Language Structure
• Programming languages have recursive structure• Consider the language of arithmetic expressions
with integers, +, *, and ( )• An expression is either:
– an integer– an expression followed by “+” followed by expression– an expression followed by “*” followed by expression– a ‘(‘ followed by an expression followed by ‘)’
• int , int + int , ( int + int) * int are expressions
2/8/2008 Prof. Hilfinger CS164 Lecture 8 12
Notation for Programming Languages
• An alternative notation: E → int E → E + E E → E * E E → ( E )
• We can view these rules as rewrite rules– We start with E and replace occurrences of E with
some right-hand side• E → E * E → ( E ) * E → ( E + E ) * E → … → (int + int) * int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 13
Observation
• All arithmetic expressions can be obtained bya sequence of replacements
• Any sequence of replacements forms a validarithmetic expression
• This means that we cannot obtain ( int ) )by any sequence of replacements. Why?
• This set of rules is a context-free grammar
2/8/2008 Prof. Hilfinger CS164 Lecture 8 14
Context-Free Grammars
• A CFG consists of– A set of non-terminals N
• By convention, written with capital letter in these notes– A set of terminals T
• By convention, either lower case names or punctuation– A start symbol S (a non-terminal)– A set of productions
• Assuming E ∈ N E → ε , or E → Y1 Y2 ... Yn where Yi ∈ N ∪ T
2/8/2008 Prof. Hilfinger CS164 Lecture 8 15
Examples of CFGs
Simple arithmetic expressions: E → int E → E + E E → E * E E → ( E )– One non-terminal: E– Several terminals: int, +, *, (, )
• Called terminals because they are never replaced– By convention the non-terminal for the first
production is the start one
2/8/2008 Prof. Hilfinger CS164 Lecture 8 16
The Language of a CFG
Read productions as replacement rules:
X → Y1 ... YnMeans X can be replaced by Y1 ... Yn
X → εMeans X can be erased (replaced with empty string)
2/8/2008 Prof. Hilfinger CS164 Lecture 8 17
Key Idea
1. Begin with a string consisting of the startsymbol “S”
2. Replace any non-terminal X in the string by aright-hand side of some production X → Y1 … Yn
3. Repeat (2) until there are only terminals inthe string
4. The successive strings created in this wayare called sentential forms.
2/8/2008 Prof. Hilfinger CS164 Lecture 8 18
The Language of a CFG (Cont.)
More formally, may write
X1 … Xi-1 Xi Xi+1… Xn → X1 … Xi-1 Y1 … Ym Xi+1 … Xn
if there is a production
Xi → Y1 … Ym
2/8/2008 Prof. Hilfinger CS164 Lecture 8 19
The Language of a CFG (Cont.)
Write X1 … Xn →* Y1 … Ym
if X1 … Xn → … → … → Y1 … Ym
in 0 or more steps
2/8/2008 Prof. Hilfinger CS164 Lecture 8 20
The Language of a CFG
Let G be a context-free grammar with startsymbol S. Then the language of G is:
L(G) = { a1 … an | S →* a1 … an and every ai
is a terminal }
2/8/2008 Prof. Hilfinger CS164 Lecture 8 21
Examples:
• S → 0 also written as S → 0 | 1 S → 1
Generates the language { “0”, “1” }• What about S → 1 A A → 0 | 1• What about S → 1 A A → 0 | 1 A• What about S → ε | ( S )
2/8/2008 Prof. Hilfinger CS164 Lecture 8 22
Pyth Example
A fragment of Pyth:
Compound → while Expr: Block | if Expr: Block ElsesElses → ε | else: Block | elif Expr: Block Elses Block → Stmt_List | Suite
(Formal language papers use one-character non-terminals, but we don’t have to!)
2/8/2008 Prof. Hilfinger CS164 Lecture 8 23
Derivations and Parse Trees
• A derivation is a sequence of sentential formsresulting from the application of a sequence ofproductions
S → … → …
• A derivation can be represented as a parse tree– Start symbol is the tree’s root– For a production X → Y1 … Yn add children Y1, …, Yn to node X
2/8/2008 Prof. Hilfinger CS164 Lecture 8 24
Derivation Example
• Grammar E → E + E | E * E | (E) | int• String int * int + int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 25
Derivation Example (Cont.)
E→ E + E→ E * E + E→ int * E + E→ int * int + E→ int * int + int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 26
Derivation in Detail (1)
E E
2/8/2008 Prof. Hilfinger CS164 Lecture 8 27
Derivation in Detail (2)
E
E E+
E→ E + E
2/8/2008 Prof. Hilfinger CS164 Lecture 8 28
Derivation in Detail (3)
E
E
E E
E+
*
E→ E + E→ E * E + E
2/8/2008 Prof. Hilfinger CS164 Lecture 8 29
Derivation in Detail (4)
E
E
E E
E+
*
int
E→ E + E→ E * E + E→ int * E + E
2/8/2008 Prof. Hilfinger CS164 Lecture 8 30
Derivation in Detail (5)
E
E
E E
E+
*
intint
E→ E + E→ E * E + E→ int * E + E→ int * int + E
2/8/2008 Prof. Hilfinger CS164 Lecture 8 31
Derivation in Detail (6)
E
E
E E
E+
int*
intint
E→ E + E→ E * E + E→ int * E + E→ int * int + E→ int * int + int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 32
Notes on Derivations
• A parse tree has– Terminals at the leaves– Non-terminals at the interior nodes
• A left-right traversal of the leaves is theoriginal input
• The parse tree shows the association ofoperations, the input string does not !– There may be multiple ways to match the input– Derivations (and parse trees) choose one
2/8/2008 Prof. Hilfinger CS164 Lecture 8 33
The Payoff: parser as a translator
syntax-directed translation
parser
syntax + translation rules(typically hardcoded in the parser)
stream of tokens
ASTs, orassembly code
2/8/2008 Prof. Hilfinger CS164 Lecture 8 34
Mechanism of syntax-directed translation
• syntax-directed translation is done byextending the CFG– a translation rule is defined for each production
givenX d A B c
the translation of X is defined recursively using• translation of nonterminals A, B• values of attributes of terminals d, c• constants
2/8/2008 Prof. Hilfinger CS164 Lecture 8 35
To translate an input string:
1. Build the parse tree.2. Working bottom-up
• Use the translation rules to compute the translation of eachnonterminal in the tree
Result: the translation of the string is the translation of the parsetree's root nonterminal.
Why bottom up?• a nonterminal's value may depend on the value of the symbols
on the right-hand side,• so translate a non-terminal node only after children
translations are available.
2/8/2008 Prof. Hilfinger CS164 Lecture 8 36
Example 1: Arithmetic expression to value
Syntax-directed translation rules:
E E + T E1.trans = E2.trans + T.trans
E T E.trans = T.transT T * F T1.trans = T2.trans * F.transT F T.trans = F.transF int F.trans = int.valueF ( E ) F.trans = E.trans
2/8/2008 Prof. Hilfinger CS164 Lecture 8 37
Example 1: Bison/Yacc Notation
E : E + T { $$ = $1 + $3; }
T : T * F { $$ = $1 * $3; }
F : int { $$ = $1; }F : ‘(‘ E ‘) ‘ { $$ = $2; }
• KEY: $$ : Semantic value of left-hand side
$n : Semantic value of nth symbol onright-hand side
2/8/2008 Prof. Hilfinger CS164 Lecture 8 38
Example 1 (cont): Annotated Parse Tree
Input: 2 * (4 + 5)E (18)
T (18)
F (9)T (2)
F (2)E (9)
T (5)
F (5)
E (4)
T (4)
F (4)
*
)
+
(
int (2)
int (4)int (5)
2/8/2008 Prof. Hilfinger CS164 Lecture 8 39
Example 2: Compute the type of an expression
E -> E + E if $1 == INT and $3 == INT: $$ = INTelse: $$ = ERROR
E -> E and E if $1 == BOOL and $3 == BOOL: $$ = BOOLelse: $$ = ERROR
E -> E == E if $1 == $3 and $2 != ERROR: $$ = BOOLelse: $$ = ERROR
E -> true $$ = BOOLE -> false $$ = BOOLE -> int $$ = INTE -> ( E ) $$ = $2
2/8/2008 Prof. Hilfinger CS164 Lecture 8 40
Example 2 (cont)
• Input: (2 + 2) == 4
E (INT)
E (BOOL)
E (INT)
int (INT)
==
E (INT) E (INT)+
int (INT)int (INT)
E (INT)( )
2/8/2008 Prof. Hilfinger CS164 Lecture 8 41
Building Abstract Syntax Trees
• Examples so far, streams of tokens translatedinto– integer values, or– types
• Translating into ASTs is not very different
2/8/2008 Prof. Hilfinger CS164 Lecture 8 42
AST vs. Parse Tree
• AST is condensed form of a parse tree– operators appear at internal nodes, not at leaves.– "Chains" of single productions are collapsed.– Lists are "flattened".– Syntactic details are omitted
• e.g., parentheses, commas, semi-colons
• AST is a better structure for later compilerstages– omits details having to do with the source language,– only contains information about the essential
structure of the program.
2/8/2008 Prof. Hilfinger CS164 Lecture 8 43
Example: 2 * (4 + 5) Parse tree vs. AST
E
int (2)
*
+2
54
T
FT
F E
TF
ETF
*
)
+
(
int (5)
int (4)
2/8/2008 Prof. Hilfinger CS164 Lecture 8 44
AST-building translation rules
E E + T $$ = new PlusNode($1, $3)
E T $$ = $1
T T * F $$ = new TimesNode($1, $3)
T F $$ = $1
F int $$ = new IntLitNode($1)
F ( E ) $$ = $2
2/8/2008 Prof. Hilfinger CS164 Lecture 8 45
Example: 2 * (4 + 5): Steps in Creating AST
E
int (2)
T
FT
F E
TF
ETF
*
)
+
(
int (5)
int (4)
+54
*2
+54
2 (Only some of thesemantic values areshown)
2/8/2008 Prof. Hilfinger CS164 Lecture 8 46
Leftmost and Rightmost Derivations
E→ E + E→ E * E + E→ int * E + E→ int * int + E→ int * int + intLeftmost derivation:
always act on leftmostnon-terminal
E→ E + E→ E + int→ E * E + int→ E * int + int→ int * int + intRightmost derivation:always act on rightmostnon-terminal
2/8/2008 Prof. Hilfinger CS164 Lecture 8 47
rightmost Derivation in Detail (1)
E E
2/8/2008 Prof. Hilfinger CS164 Lecture 8 48
rightmost Derivation in Detail (2)
E
E E+
E→ E + E
2/8/2008 Prof. Hilfinger CS164 Lecture 8 49
rightmost Derivation in Detail (3)
E
E E+
int
E→ E + E→ E + int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 50
rightmost Derivation in Detail (4)
E
E
E E
E+
int*
E→ E + E→ E + int→ E * E + int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 51
rightmost Derivation in Detail (5)
E
E
E E
E+
int*
int
E→ E + E→ E + int→ E * E + int→ E * int + int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 52
rightmost Derivation in Detail (6)
E
E
E E
E+
int*
intint
E→ E + E→ E + int→ E * E + int→ E * int + int→ int * int + int
2/8/2008 Prof. Hilfinger CS164 Lecture 8 53
Aside: Canonical Derivations
• Take a look at that last derivation in reverse.• The active part (red) tends to move left to
right.• We call this a reverse rightmost or canonical
derivation.• Comes up in bottom-up parsing. We’ll return
to it in a couple of lectures.
2/8/2008 Prof. Hilfinger CS164 Lecture 8 54
Derivations and Parse Trees
• For each parse tree there is exactly oneleftmost and one rightmost derivation
• The difference is the order in which branchesare added, not the structure of the tree.
2/8/2008 Prof. Hilfinger CS164 Lecture 8 55
Summary of Derivations
• We are not just interested in whether s ∈ L(G)• Also need derivation (or parse tree) and AST.• Parse trees slavishly reflect the grammar.• Abstract syntax trees abstract from the grammar,
cutting out detail that interferes with later stages.• A derivation defines a parse tree
– But one parse tree may have many derivations• Derivations drive translation (to ASTs, etc.)• Leftmost and rightmost derivations most important in
parser implementation