Syntax Analysis Part1

SYNTAX ANALYSIS

OVERVIEW

The role of Syntax Analyzer

Context Free grammars

Derivations, parse tree and Ambiguity

Eliminating Ambiguity

Elimination of Left Recursion

Left factoring

Top-Down parsing

THE ROLE OF THE PARSER

Main task:

reads string of tokens from lexical analyzer and verifies that

the string of token names can be generated by the grammar

of the source language.

Position of parser in compiler model:

Lexical

analyzer ParserRest of

front end

Symbol

Table

Source

program

token

Get next

token

parse

tree

intermediate

representation

Grammars offer significant advantages to both language designers and compiler writers.

Grammars gives a precise syntactic specification of Programming Languages.

A properly designed grammar imparts a structure to a PL that is useful for translation of source programs into correct object code and for the detection of errors.

Languages evolve over a period of time, acquiring new constructs and performing additional tasks.

For certain classes of grammars, we can automatically construct an efficient parser that determines if a source program is well formed.

Three general types of parsers for grammars

Universal parsing methods

Cocke-Younger-Kasami algorithm

Earley’s algorithm

Too inefficient to use in production compilers

Compilers either top-down or bottom-up

one symbol at a time

Most efficient top-down and bottom-up methods work only on subclasses of grammars.

LL and LR grammars are expressive enough to describe most syntactic constructs in PLs

SYNTACTIC ERROR HANDLING

Common programming errors can occur at different levels: Lexical

Misspelling an identifier or keyword or operator

Syntactic Arithmetic expression with unbalanced parentheses

Semantic Operator applied to incompatible operand (type mismatch)

Logical An infinitely recursive call

Much of error detection and recovery can be handled by parsing methods.

Detection of semantic and logical errors at compile time is a difficult task.

The error handler in a parser has simple-to-state goals:

Report the presence of errors clearly and accurately.

Recover from each error quickly enough to detect subsequent errors.

Add minimal overhead to the processing of correct programs.

LL and LR parsing methods, detect an error as soon as possible.

ERROR RECOVERY STRATEGIES

Panic mode Discards input until a valid tokens if found.

Phase level local correction(replacement of characters) on the remaining input

Error productions augment the grammar with productions that generate erroneous

constructs.

Global correction algorithms for choosing a minimal sequence of changes to obtain a

global least cost correction.

CONTEXT FREE GRAMMARS

PL constructs have an inherently recursive structure that can be defined by CFGs

A CFG consists of terminals, non-terminals, a start

symbol, and productions.

Terminals: id, (, ), *, /, +, -

Start symbol: expression.

Non-terminals: expression, term, factor

DERIVATIONS

Derivation corresponds to the construction of parse

tree, starting from the start symbol replacing a non-

terminal using one of its production rules.

Any string of grammar symbols is a sentential form

if it is derived from the start symbols.

A sentence in a grammar is a sentential form with

no non-terminals.

Left most derivation

the leftmost non-terminal is replaced in the sentential

form

Right most derivation:

the rightmost non-terminal is replaced in the sentential

form

PARSE TREE

A parse tree is a graphical representation of a

derivations that rules out the order in which

productions are applied to replace non-terminals.

Yield of parse tree

Leaves of the parse tree read from left to right ,

constitute a sentential form.

AMBIGUITY

A grammar that produces more than one parse tree for some sentence is said to be ambiguous.

An ambiguous grammar is one that produces more than one leftmost or rightmost derivation for the same sentence.

ELIMINATING AMBIGUITY

Consider the following grammar of expressions

consisting of plus, minus and digits

When an operand 5 has operators to left and right,

some conventions are needed for deciding which

operator applies to that operand.

ASSOCIATIVITY OF OPERATORS

In PLs, the operators +, -, *, / are left-associative.

An operand with plus signs on both sides of it

belongs to the operator to its left.

Exponentiation and assignment are right-

associative.

PRECEDENCE OF OPERATORS

Rules defining relative precedence of operators are

needed when more than one kind of operator are

present.

*, / have higher precedence than +, -.

9+5*2 is interpreted as 9+(5*2) and not (9+5)*2.

Consider the following “dangling-else” grammar

According to this grammar, the compound

statement

has the parse tree

The grammar is ambiguous since the string

has two parse trees

The dangling else grammar can be rewritten to an

unambiguous grammar by matching every then to

an else using the grammar shown below.

ELIMINATION OF LEFT RECURSION

A grammar is left-recursive if it has a non-terminal

such that there is a derivation for some .

Immediate left-recursion is a production of the form

The left recursion production

can be replaced by the non-left-recursive

productions

without changing the strings derivable from A.

In general, the immediate left-recursion for any

number of A-productions

can be eliminated as

Non-left recursive grammar for expression grammar

is

ALGORITHM TO ELIMINATE LEFT-RECURSION

The resulting non-left-recursive grammar may have

ϵ-productions.

Example: consider the following grammar

The algorithm yields the following non-recursive

algorithm

S Aa | bA Ac | Sd | ε

S Aa | bA bdA’ | A’

A’ cA’ | adA’ | ε

LEFT FACTORING

Left factoring transformation is useful for producing

a grammar suitable for predictive parsing.

If are two A-productions , the left-

factored productions for A becomes

ALGORITHM:

Example: Consider the grammar

Left factored grammar becomes

NON-CONTEXT FREE LANGUAGE CONSTRUCTS

L1 = { wcw | w is in (a|b)* } is not context free

The non-context freedom of L1 , directly implies the non-context-freedom of programming languages like C and Java, which require declaration of identifiers before their use and which allow identifiers of arbitrary length.

L2 = { anbmcndm | n >= 1 and m >= 1 } is not context free

Here, L2 abstracts the problem of checking the number of formal parameters in the declaration of a function agrees the number of actual parameters in a use of the function.

TOP-DOWN PARSING

Finding leftmost derivation for an input string.

The key problem is to determine the production to be

applied for a non-terminal.

Backtracking may be involved in finding the correct

production to a non-terminal.

For most of the PL constructs, back tracking can be

avoided by using suitable transformations to the

grammar.

RECURSIVE DESCENT PARSING

Leftmost derivation for an input string.

Construction of a parse tree from the root and creating the nodes of the parse tree in preorder.

Recursive-descent parsing may involve backtracking. Consider the grammar S cAd

A ab | a

derivation for string w=cad here, involves backtracking.

S S

c A d

a

c A d

a bc dA

S

Recursive descent parsing consists of set of procedures,

one for each non-terminal.

Issues

May require backtracking i.e, may require repeated scan over

the input.

Left-recursive grammar can cause the parser to go into an

infinite loop.

FIRST AND FOLLOW SETS

If α is any string of grammar symbols,

FIRST(α) is the set of terminals that begin the strings derived

from α.

For any non-terminal A

FOLLOW(A) is the set of terminals that can appear

immediately to the right of A in some sentential form.

FIRST()

To compute FIRST(X) for all grammar symbols X, apply

the following rules until no more terminals or ϵ can be

added to FIRST(X).

FOLLOW()

To compute FIRST(A) for all non-terminals A, apply the

following rules until nothing can be added to any

FOLLOW set.

EXAMPLE:

In the expression grammar

The FIRST sets are

FIRST(F) = FIRST(T)=FIRST(E)={(, id}

FIRST(E’) ={+, ϵ}

FIRST(T’) = {*, ϵ}

The FOLLOW sets are

FOLLOW(E) = FOLLOW(E’) = {), $}

FOLLOW(T) = FOLLOW(T’) = {+, ), $}

FOLLOW(F) = {+, *, ), $}

PREDICTIVE PARSING

The goal of predictive parsing is to construct top-down parser that never backtracks.

Obtain a grammar that can be parsed by a recursive-descent parser that needs no backtracking.

Given an input symbol and a non-terminal to be expanded, we must know which one of the alternatives of the productions is the unique alternative that derives a string beginning with a.

To do so, we do the following transformations to the grammar Eliminate left-recursion

Perform left-factoring

LL(1) GRAMMARS

Class of LL(1) grammars Left to right scan, Leftmost derivation, 1 input symbol

lookahead

LL(1) grammars aid in automatic construction of predictive parser.

A grammar G is LL(1) if and only if whenever are two distinct productions in G, the following conditions hold: FIRST( ) and FIRST( ) are disjoint sets.

If ϵ is in FIRST( ), then FIRST( ) and FOLLOW(A) are disjoint sets and like wise if ϵ is in FIRST( ) .

CONSTRUCTION OF A PREDICTIVE PARSING TABLE

EXAMPLE: PREDICTIVE PARSING TABLE FOR

EXPRESSION GRAMMAR

For every LL(1) grammar, each parsing-table entry

uniquely identifies a production or signals an error.

For some grammars, the table may have some

entries multiple defined.

If grammar is either left-recursive or ambiguous at least

one of the entry is multiply defined.

Although left-recursion elimination and left-factoring

are easy to do, there are some grammars for which

no amount of alteration will produce an LL(1)

grammar.

EXAMPLE: PREDICTIVE PARSING TABLE FOR

DANGLING-ELSE PROBLEM

Predictive parsing table for the grammar

is

TABLE-DRIVEN PREDICTIVE PARSING

PREDICTIVE PARSING ALGORITHM

EXAMPLE:

Documents

Syntax Analysis Part1