1 Syntax Analysis Programming Language Syntax: Syntax Specifications, Stages in Translation: Processing Programs, Syntax Analysis, Semantic Analysis, Lexical

1

Syntax Analysis

• Programming Language Syntax: Syntax Specifications,• Stages in Translation: Processing Programs, Syntax Analysis,

Semantic Analysis, Lexical Analyzer, Code Generation,• Regular expressions, • Finite Automata, • Grammar Types: Unrestricted, Context-Free, Context-

Sensitive, Regular, BNF, EBNF,• Derivation: Parse Tree,• Grammar Issues: Ambiguous Grammars, Grammar

Transformations, Syntax Diagram, • Recursive Descent Process, Shift-reduce Parsing,• Concrete and Abstract Syntax,• LL grammar and• LR grammar: SLR, LALR.• Programming the Scanner and Parser

Coverage

2

Syntax Analysis

• Syntax defines the structure of the language

• Syntax helps in:− Language design and language comprehension− Implementing or writing the compiler, software

specification and the language system as a whole− Verifying for program correctness

• Definitions− Constructs: Strings that belong to the language− Syntax: The form or structure of the expression,

statements, and the program unit as a whole is called as Syntax

− Semantics: Semantics duly considers what happens while executing a program segment. Thus, it provides the meaning of the statements, expressions and program unit

− Pragmatics: Tools provided by the translator to help in debugging and interacting with the operating system

Programming Language Syntax

3

Syntax Analysis

• Lexeme: Lowest level syntactic unit of any language (e.g.,

sum, begin)

• Token: Category of lexemes (e.g., Identifiers)

• Any complier needs to have recognizers to recognize the

syntax of the language

• Notations of Expressions• Infix notation: operator symbol is present between the

operands• Prefix or Polish notation: operator symbol is present

before the operands• Postfix or Suffix or Reverse Polish notation: operator

symbol is present after the operands• Mixfix notation: operations that don't fit into the previous

notations, like if-then-else


4

Syntax Analysis

• Associativity in Expressions− Left-associative: Expressions with the same operator or

operator with same precedence are grouped from left to right. • Example: +, -, * and /

− Right-associative: Expressions with the same operator or operator with same precedence are grouped from right to left.• Example: Assignment symbol and exponentiation

• Expression Trees and their Evaluation• Expressions are expressed in the form of a tree with the root

indicating the result of the expression− Traversing a tree can be done in many ways:

• In-order traversal: All the nodes in the left subtree are visited first and then the root node is visited. Finally, the nodes in the right subtree are visited.

• Post-order traversal: All the nodes in the left and right subtree are visited before the root node is visited.


5

Syntax Analysis

• Expression Trees and their Evaluation− Traversing a tree can be done in many ways:

• Pre-order traversal: The root node is visited first and then the nodes of the left and right subtree are visited.

• Breadth-first traversal: Traversing is taken level by level. Finish visiting nodes at one level before moving to the next level. It is also called as level-order traversal.

• Depth-first traversal: Traversing goes into the depth and then rises to the next subtree. The order of traversing the tree performed by depth-first traversal is similar to preorder traversal.


6

Syntax Analysis

• Evaluation of Expressions− Applicative Order Evaluation (strict or eager evaluation):

The process of evaluation is bottom-up, which means the processing starts from the leaves and moves towards the root

− Normal Order Evaluation: Evaluation of an expression is done when it is needed in the computation of the result • Addition(5+2)• Addition(Y) {int Y; Y = Y + 2;} • Here, Y is replaced with 5+2 instead of doing the addition

first− Lazy Evaluation (Delayed evaluation): Evaluation is

postponed until it is really needed• Frequently used in functional languages.

− Block Order Evaluation: This is the evaluation of an expression that contains a declaration. • Example: We could have block expression in a function

that includes variable declaration in Pascal


7

Syntax Analysis

• Evaluation of Expressions− Short Circuit Evaluation: When we are evaluating

expressions which are of Boolean or logical, we could partially evaluate the expression and get the result• AND (X AND Y): If both X and Y are "1", then the result

is "1". Otherwise, the result is "0".• OR (X OR Y): If either or both X and Y are "1", then the

result is "1". Otherwise, the result is "0".• XOR (X XOR Y): If only one of them (X or Y) is "1", then

the result is "1". Otherwise, the result is "0".• NOT (X): If X is "1", then the result is "0". If X is "0",

then the result is "1".


8

Syntax Analysis

Compilation Process

SOURCE PROGRAM

SCANNER

PARSER

SEMANTIC ANALYSIS

INTERMEDIATE CODE GENERATION

CODE GENERATION

OPTIMIZATION(OPTIONAL)

SYMBOL TABLE

TOKENS

PARSE TREE

INTERMEDIATE CODE

ABSTRACT SYNTAX TREE

MACHINE CODE

SYNTAX

ANALYSIS

9

Syntax Analysis

• Syntax Analysis is of low-level and high-level parts.• Low-level (scanner or lexical analyzer):

• Mostly done using finite automata• Input symbols are scanned and grouped into meaningful

units called tokens.• Tokens are formed by principle of longest substring or

maximum match, using lookahead pointer • High-level part (parser or syntax analyzer)

• Done using Backus-Naur Form (BNF) or Context-Free grammar

• Tokens are grouped into syntactic units like expressions, statements and declarations and checked whether they confirm to the grammatical rules of the language

• Identification of reserved words: Use lookup table (symbol table)

• if statement: "if" "(" "y" "<" "5" ")" … • y is called as a variable, < is called as an operator, … • Tokens are represented as keywords, operators, identifiers,

literals, etc.

Compilation Process

10

Syntax Analysis

• Parser• The parser should find all syntax errors and produce the

parse tree• Parsing algorithms:

• Top-down: Recursive descent (which is a coded implementation) and LL Parser (which is a table driven implementation)

• Bottom-up: LR grammar

• Why separate the syntax analysis into scanner and parser?− Simplicity: Separating them makes the parser simpler.− Efficiency: Due to the separation, we could make

optimization possible for the lexical analyzer.− Portability: Even though parts of the lexical analyzer

might not be portable, we could always make the parser portable

Compilation Process

11

Syntax Analysis

• Semantic analysis (Contextual analysis) is required to make sure

that the data types match

• Semantic analysis works in synchronization with the syntax

analysis

• Contextual analysis is used to answer the following:− Whether the variable has been declared earlier or not?− Does the declaration type match with the usage type of the

variable?− Whether the initialization of the variable has been done in

advance or not?− Is the reference to the array within the bounds of the array?− …

• Code generation• Converting the program into executable machine code• Stages: intermediate code generation and code

generation

Compilation Process

12

Syntax Analysis

• Regular expression is used to represent the information

required by the lexical analyzer

• Regular Expression Definitions: The rules of a language

L(E) defined over the alphabet of the language is expressed

using regular expression E. − Alternation: If a and b are regular expressions, then

(a+b) is also a regular expression.− Concatenation (or Sequencing): If a and b are regular

expressions, then (a.b) is also a regular expression.− Kleene Closure: If a is a regular expression, then a*

means zero or more representation of a. − Positive Closure: If a is a regular expression, then a+

means one or more of the representation of a.− Empty: Empty expressions are those with no strings.− Atom: Atoms indicate that there is only one string in the

expression.

Regular Expressions

13

Syntax Analysis

Regular Expressions

14

Syntax Analysis

Regular Expressions

15

Syntax Analysis

• Regular expression to match integers and floating point

numbers− To match a digit: [0-9] − To match one or more occurrences, we use [0-9]+− To support both signed and unsigned integers: -?[0-9]+

• -? indicates the presence or absence of minus− Floating point representation: Decimal part is present before

the dot • ([0-9]* \. [0-9]+)

− Exponent part: Presence of the character "e" either as lower or uppercase. • “e” is followed by + or – sign which is followed by an

integer. • ([eE][-+]?[0-9]+)? • Question mark at the end indicates the presence of

exponent part is not compulsory.− -?(([0-9]+) | ([0-9]* \. [0-9]+) ([eE][-+]?[0-9]+)?)

Regular Expressions

16

Syntax Analysis

• Finite Automata represent computing devices that could accept or recognize the given regular expression that represent a language

• Finite Automata Definitions− Alphabet (): An alphabet is made up of finite, non-empty

set of symbols. Symbols are represented using lower case Latin alphabets. Symbols are considered to be atoms which cannot be subdivided further. Ex. = {a,b,c}

− String or Word: String is a sequence of symbols formed using a single alphabet. • Given the alphabet = {a,b,c}, the various strings that

could be formed are: a, abc, aa, abcabcabc− Empty String (): Empty string indicates a string that is

composed of zero symbols. Empty string can be included in an alphabet.

− Size of a String: Size of a string indicates the number of symbols present in the string. • Size of the string ab is denoted as, |ab| = 2• Size of the string || = 0 Size of the string |b| = 1

Finite Automata

17

Syntax Analysis

• Finite Automata Definitions− Concatenation of Strings: String can be combined together

to form a new string. • S1 = abc and S2 = def: S1S2 = abcdef and S2S1 = defabc• Concatenate empty string: S1 = S1 = abc = abc = abc =

S1

• Empty string is called as the identity operator for string concatenation.

− Languages (L): Language defines an infinite set of strings from a given alphabet. = {a,b,c}, Language L = {anbncn | n 0}• In this example, number of a's and b's and c's are the same.

− Power of an alphabet: • Represented by the power of order n• This order represents the number of elements present in

each permutation combination of the given string− For a string = {a,b,c}− 0 = {}− 1 = {a, b, c}− 2 = {aa, bb, cc, ab, ba, ac, ca, bc, cb}− 3 = {aaa, bbb, ccc, aab, bba, aac, cca, …}

Finite Automata

18

Syntax Analysis

• Finite Automata Definitions− Closure of an alphabet:

• Transitive Closure: − Zero or more combinations of the string.− * = 0 1 2 3 = {, a, b, c, aa, bb, cc, ab,

… }• Transitive-reflexive Closure:

− One or more combinations of the string.− + = 1 2 3 = {a, b, c, aa, bb, cc, ab, … }

• Any language defined on the given alphabet is a subset of the transitive-reflexive closure of the alphabet.− L, L *

− Empty Language: • Empty language is one that has no strings in it. • L = {} is an empty language. • L = {} is not an empty language because it is made

up of one string, called as the empty string.

Finite Automata

19

Syntax Analysis

• Finite Automata Representation• Circle: state; Arrows: transition; Double circle: final state• States are indicated using numbers• Arrows are indicated using a transition variable or

Finite Automata

Figure 2.2. NFA for

t

Figure 2.3. NFA for t

X Y

Figure 2.4. NFA for XYX

Y

Figure 2.5. NFA for X|Y

20

Syntax Analysis

• DFA (Deterministic Finite Automata) Vs NFA (Non-deterministic Finite Automata)• In DFA, empty transitions () are not allowed. Also, from any

state s there should be only one edge labeled a.• Convert from NFA to DFA

− Find –closure of s:• Add s (the node itself) to its –closure. i.e. –closure(s) =

{s}− Reachable with empty transition: If there is a node t in –

closure(s), and there exists an edge labeled from t to u, then u is also added to –closure(s) if u is not there already. Continue until no more nodes can be added to –closure(s)

Finite Automata

X

Figure 2.6. NFA for X*

21

Syntax Analysis

• Convert from NFA to DFA− State transition:

• From the initial –closure, find transitions on various terminals present in the given regular expression

• Example: If there is a node t in the –closure(s), and there exists an edge labeled (non-empty) from t to u, u is also added to –closure(s) if u is not there already. From u, add all the nodes that could be reached using –transition.

− A transition table is drawn based on the States and Inputs.− Optimization of the transition table can be done as:

• Partition the set of states into non-final and final states. • With the non-final states:

− The state whose transition goes to outside the group is separated from the group.

− If there are states with same transition on all the inputs, keep one of those states and replace the other entries with the preserved one.

− Check for dead state. Dead state is one in which the transitions end up in the same state irrespective of the input. Also, this dead state is not the final state.

Finite Automata

22

Syntax Analysis

• Transitions for (m | n)*mnn

• Find –closure: Starting from 0, using -transition, we could reach 0, 1, 2, 4 and 7. A = {0, 1, 2, 4, 7}.− From node 3, we can reach 6, 7, 1, 2 and 4 using -transition. But

from node 8, there is no more transition possible using -transition.

− -Closure({3,8}) = B = {3,8} − Finally, we get B = {1, 2, 3, 4, 6, 7, 8}.

• Transition of n on set A, we get C = {1,2,4,5,6,7}• Transition of n on set B, we get D = {1,2,4,5,6,7,9}• Transition of n on set D, we get E = {1,2,4,5,6,7,10}• If you apply transition of m on set C, we get B. So, we stop here

because any further transition repeats to the already found sets only.

Finite Automata - Example

10

2 3m

4 5n

1 6

7 8 90m

n n

23

Syntax Analysis

• Transition Table

• Non-Final States (ABCD); Final State (E).

• With non-final states − On input m, all of them go to B and so

they are in one group.− On input n, states A, B, and C move to

members of group (ABCD) but D goes to E. So, split (ABCD) into (ABC) and (D).

− In (ABC), with input n, states A & C go to C but B goes to D. So, split them as (AC) and (B).

− In (AC), both of have the same transitions. Thus, use only one (A) of them.

− Check for dead state. In our example, there is no dead state.

Finite Automata - Example

24

Syntax Analysis

• Terminal Symbols: Atomic or non-divisible symbols in any language

• Non-terminal Symbols (variable symbols or syntactic categories or syntactic variable or abstraction): A single non-terminal symbol can be made of more than one Right Hand Side (RHS) derivation, separated by a divisor (|).

• Variable symbol or distinguished symbol (start symbol): Basic category that is being defined

• Production or Rewriting Rules: Rules that are used to define the structure of the constructs. Defines how to write any variable symbol using terminal and non-terminal symbols. Rule has a left-hand size (LHS) derived to a right-hand side (RHS) that is made up of terminal and non-terminal symbols.

Grammar Types - Definitions

25

Syntax Analysis

• Grammar: A grammar is a finite non-empty set of rules.

• Syntactic lists: Lists of syntactic nature could be represented using recursion. <ident_list> ident | ident, <ident_list>

• Derivation: This is the process of repeatedly applying the rules, starting from the start symbol until there are no more non-terminal symbols to expand.

Grammar Types - Definitions

26

Syntax Analysis

• Unrestricted Grammar: − Called as Recursively Enumerable or Phrase

Structured grammar or Type 0 grammar. − There is no restriction on the right hand side of

the production rule. − At least one non-terminal symbol on the left side

of the production rule must be present

− whereV + and V

− V: finite set of Variable Symbols.− T: finite set of terminal symbols.− Example: S ACaB; Ca aaC

Grammar Types

27

Syntax Analysis

• Context-Sensitive Grammar: − Called as Type 1 grammar − Requires that the right side of the production

rule must not have fewer symbols compared to the left side

− Called as Context-Sensitive Grammar as any replacement of a variable depends on what surrounds it

−

• where AV,V and V +

− Example: Things b b Thing; Thing c Other b c

Grammar Types

28

Syntax Analysis

• Context-Free Grammar:− Called as Type 2 grammar− Developed by Noam Chomsky during the mid-

1950s − The left side of a production rule is a single

variable symbol and the right side is a combination of terminal and variable symbols

− Production rule takes the form Awhere AV,V

− Example: Fraction Digit; Fraction Digit Fraction

Grammar Types

29

Syntax Analysis

• Regular Grammar: − Called as Restrictive Grammar or Type 3 grammar− Each production rule is restricted to have only one

terminal or one terminal and one variable on the right side− Regular Grammars are classified as right-linear or left-

linear grammars.− Right-linear grammar

• AxB or Ax where AV, BV, and xT− Left-linear grammar

• ABx or Ax where AV, BV, and xT− Regular expressions Vs context-free grammar:

• To represent lexical rules which are simple in nature, we don't need a powerful notation like context-free grammar

• Regular expressions can be used to make recognizers for any language.

Grammar Types

30

Syntax Analysis

• Backus-Naur Form (BNF): − Invented by John Backus to describe Algol 58 − Described as a metalanguage because it is a

language that is used to describe another language− Considered equivalent to context-free grammar− Abstractions are used to represent various classes

of syntactic structures, which act like non-terminal symbols.

• To represent While statement:− <while_stmt> while ( <logic_expr> ) <stmt>

• Reasons for using BNF to describe syntax are:− BNF provides a clear and concise syntax

description.− The parser can be based directly on the BNF.− Parsers based on BNF are easier to handle.

Grammar Types

31

Syntax Analysis

• Extended BNF (EBNF): − BNF’s notation + regular expressions− Different notations persist:

• Optional parts: Denoted with a subscript as opt or used within a square bracket.− <proc_call> ident ( <expr_list>)opt− <proc_call> ident [ ( <expr_list>)]

− Alternative parts: • Pipe (|) indicates either-or choice• Grouping of the choices is done with square brackets or

brackets.− <term> <term> [+ | -] const− <term> <term> (+ | -) const

− Put repetitions (0 or more) in braces ({ })• Asterisk indicates zero or more occurrence of the item. • Presence or absence of asterisk means the same here, as the

presence of curly brackets itself indicates zero or more occurrence of the item.− <ident> letter {letter | digit}*− <ident> letter {letter | digit}

Grammar Types

32

Syntax Analysis

• Differences between BNF and EBNF notations− BNF:

• <expr> <expr> + <term> | <expr> - <term> | <term>

• <term> <term> * <factor> | <term> / <factor> | <factor>

− EBNF:• <expr> <term> {[+ | -] <term>}*• <term> <factor> {[ * | / ] <factor>}*

• EBNF uses the final replacement of <expr> by the

<term> and provides the right hand side without

any <expr> entry there.

Grammar Types

33

Syntax Analysis

• Apply the grammar to the start symbol <program> and continue to expand until there is no more non-terminal symbol

left on the right-hand side

• Methods of Derivation− Leftmost derivation is a process by which the leftmost non-

terminal in each sentential form is expanded − Parse-tree or Derivation tree

• Top-down parser keeps the start symbol as the root of the tree. Then, it replaces every variable symbol with a string of terminal symbols.

• Bottom-up parser begins with the terminal symbols. These terminal symbols are matched with the right hand side of the production rule and are replaced with the corresponding variable symbols present in the left hand side of the production rule.

• Parse trees can be used to attach semantics of a construct to its syntactic structure, called as syntax-directed semantics

Derivation

34

Syntax Analysis

• Given the regular grammar S ::= aS | bS | a |

b, check whether the grammar can derive the

form anbn.− Let's try for a1b1; S aS ab− Let's try for a2b2; S aS aaS aabS aabb− Let's try for a3b3; S aS aaS aaaS

aaabS aaabbS aaabbb− We are able to attain the required format using

this regular grammar.

Derivation - Example

35

Syntax Analysis

• Ambiguities in Grammar− Any grammar is said to be ambiguous if it

generates a sentential form that has two or more distinct parse trees.

− Ex. If statement with dangling else.

Grammar Issues

If Statement

) StatementExpressionif (

If Statement

) StatementExpressionif ( else Statement

If Statement

) StatementExpressionif ( else Statement

If Statement

) StatementExpressionif (

36

Syntax Analysis

• Left Factorization: − Initial element of the options in right side of the given rule is

same • N XY | XZ X (Y|Z)

• Elimination of Left Recursion: − First element on the right hand side causes transition to the left

hand side of the rule• N X | NY

XY*− The termination of the NY is possible only if we replace N with X. − If N X is used without the use of N NY, then there will be no

Y. • N NY NYY XYY

• Substitution of Non-terminal Symbols: − Presence of any non-terminal symbol in the right hand side of the

given rule should be replaced using another rule.• N X and M N can be changed as N X and M X

Grammar Transformations

37

Syntax Analysis

• Called as Syntax Charts or Railroad Diagram • Developed by Niklaus Wirth in 1970• Used to visualize rules in the form of diagrams• Used to represent EBNF notations and not BNF notations• Variables are represented by rectangles and terminal symbols

are represented by circles (sometimes oval shape)• Each production rule is represented as a directed graph whose

vertices are symbols

Syntax Diagram

38

Syntax Analysis

• There is a subprogram for each non-terminal in the grammar that parses the sentences that are generated by the non-terminal

• For proceeding with the correct grammatical rule, we match each terminal symbol in the right hand side with the next input token. − If there is a match, we continue further. − Otherwise, an error is generated or other rules are tried

• If a non-terminal has more than one RHS, we determine which one to parse first using:− Choose the correct RHS based on the next token (lookahead).− Next token is compared with the first token that can be

generated by each RHS until a match is found.− If there is no match, then it is considered as a syntax error.

• Shift-Reduce Parsing: With the given grammar and given input string, we reduce the right hand side of the input string to attain the start symbol of the grammar

Recursive Descent Parsing

39

Syntax Analysis

• Concrete Syntax: − Defines the structure of all the parts of a program like

arithmetic expressions, assignments, loops, functions, definitions, etc.

− Context-Free grammars, BNF, EBNF, etc are of concrete syntax type.• Assignment Identifier = Expression;• Expression Term | Expression + Term

• Abstract Syntax: − Generated by the parser and is used to link syntax and

semantics of a program− Unlike concrete syntax, abstract syntax provides only the

essential syntactic elements and does not describe how they are structured• Statement = Assignment | Loop• Assignment = Variable target; Expression source

• Ambiguity occurs in concrete syntax but not in abstract syntax

Concrete and Abstract Syntax

40

Syntax Analysis

• Identification Tables− Called as symbol tables.− A dictionary-type data structure to store identifier names

along with corresponding attributes • Organization of identification table depends on the "block

structure" used in different languages− Monolithic block structure: e.g. BASIC, COBOL− Flat block structure: e.g. Fortran− Nested block structure is used in the modern "block-

structured" programming languages (e.g. Algol, Pascal, C, C++, Scheme, Java, …)

• Monolithic Block Structure: − A single block is used for the entire program− Every identifier is visible throughout the entire program − Scope of each identifier is the whole program and cannot

be declared twice

Symbol Table

41

Syntax Analysis

• Flat Block Structure: − Whole block area is divided into several disjoint blocks− Declarations can be local or global− Identifiers can be redefined in another block− Local declaration is given higher priority over global declaration

• Nested Block Structure: − Blocks may be nested one within another− Scope of an identifier depends on the level of nesting present− An identifier cannot be defined more than once at the same level

within the same block

Symbol Table

42

Syntax Analysis

• Unordered list: Data could be stored in an array or a linked list.

• Ordered list: − Entries in the list are ordered − Searching is faster− Insertion of data into the list is an expensive process

• Binary Search Tree: − Using a binary search tree, the searching time takes

O(log(n)).• Hash Table:

− Most commonly used option− Access the data can be done in constant time− Storage of data is not time consuming

Symbol Table Structure

43

Syntax Analysis

• First L in LL specifies that a left-to-right scan of the input is handled

• Second L specifies that a leftmost derivation is generated • First step towards using LL grammar is elimination of common

prefix. Note: and can match zero or more elements.− Form is B 1 | 2 | … |m |Xm+1| Xm+2 | … | Xm+n

− Replace it with• B B1 | Xm+1| Xm+2 | … | Xm+n

• B1 1 | 2 | … |m

• Convert the grammar into unambiguous one − Make sure they obey precendence and associativity rules− Start from the terminal and move from high precedence to

low precedence• Consider the grammar: E E + E | E * E | (E) | id

− Select the terminals and name them differently.• Factor (E) | id

− * operator has high priority that + operator. So, select E E * E next• E E * E is considered first.

LL Grammar

44

Syntax Analysis

• Convert the grammar into unambiguous one • Consider the grammar: E E + E | E * E | (E) | id

− * has high priority that +. So, select E E * E next• To provide the link between E * E and the

Factor, use the pipe (|) operator.• With no link, the non-terminal will never become

a terminal.• Give a new name “Term” for the element.• Term Term * Factor | Factor

− Then, consider E E + E and change it also.• Expression Expression + Term | Term

− So, F (E) | id; T T * F | F; E E + T | T• Remove Left-recursion

− If A A1 | A2 | … | Am | 1 | 2 | … | n

− Where no i begins with an A. Where A is E, is +T & is T− Replace the above as:

• A 1A' | 2A' |… | nA' • A' 1A' | 2A' | … | mA' |

LL Grammar

45

Syntax Analysis

• Consider the grammar• ETE'; E'+TE'|; TFT'; T'*FT'|; F(E)|id• FIRST & FOLLOW

− FIRST:• If X is terminal, then FIRST(X) is {X}.• If X is non-terminal and X a is a production, then add a to

FIRST(X). If X is a production, then add to FIRST(X).

• If X Y1Y2…Yk is a production, then for all i such that all of Y1,..Yi-1 are non-terminals and FIRST(Yj) contains for j=1,2,… i-1, add every non- symbol in FIRST(Yj) to FIRST(X). If is in FIRST(Yj) for all j=1,2,…,k, then add to FIRST(X).

− The third rule of FIRST is like E TE' where T FT' and F(E)|id. Thus, what is in FIRST(F) will be in FIRST(E) & FIRST(T).

• FIRST(E) = FIRST(T) = FIRST(F) = {(,id} FIRST(E')={+, }• FIRST(T')={*, }

LL Grammar

46

Syntax Analysis

• FIRST & FOLLOW− FOLLOW: (is any string of grammar symbols; can

also be .)• $ in FOLLOW(X), where X is the start symbol.• If there is a production AB, , then everything in

FIRST() but is in FOLLOW(B).• If there is a production AB, or a production AB

where FIRST() contains , then everything in FOLLOW(A) is in FOLLOW(B).

• In FOLLOW, take the first rule apply to all the grammar and then take the second rule apply to all the grammar and so on.

• Note: Refer to notes for verbal explanation for FIRST & FOLLOW rules

LL Grammar

A à B FOLLOW

Condition: FIRST(contains

Third Rule of FOLLOW

FOLLOW

A à BFOLLOWFOLLOW

A à B

FOLLOWFIRST, except

Condition:

Second Rule of FOLLOW

47

Syntax Analysis

• FIRST & FOLLOW− FOLLOW(E) = FOLLOW(E') = {), $}− FOLLOW(T) = FOLLOW(T') = {+,), $} − FOLLOW(F) = {+,*,),$}

• Generating the parsing table− A Grammar whose parsing table has no multiply-defined

entries is said to be LL(1). is any string of grammar symbols; can also be .

1. For each production A of the grammar, do steps 2 & 3.2. For each terminal a in FIRST(), add A to M[A,a].3. If is in FIRST(), add A to M[A,b] for each terminal b in

FOLLOW(A). If is in FIRST() and $ is in FOLLOW(A), add A to M[A,$].

− Note: Here, M[A,b] indicates the corresponding cell in the table, whose row corresponds to the non-terminal A and column corresponds to the terminal b.

4. Make each undefined entry of M error.

LL Grammar

48

Syntax Analysis

• Left to Right grammar• Most powerful shift-reduce parsing technique

− Non-backtracking shift-reduce parsing which could detect a syntactic error as soon as possible

• Represented as LR(k) where k indicates the look-ahead value• LR(1) means no look-ahead: only next element is considered

and not anything those follows the next element. • Can parse all grammars that could be parsed with predictive

parsers like LL(1) grammar• Types of LR grammars:

− SLR – Simple LR parser.− LR – Most general LR parser.− LALR – Intermediate LR parser (Look-ahead LR parser).

• All the types use the same algorithm but with different parsing table

LR Grammar

49

Syntax Analysis

• LR parser configuration: (S0 X1 S1 ... Xm Sm, ai ai+1 ...

an $), which includes Stack values and the rest of

Inputs

− Xi is a grammar symbol

− Si is a state

− ai is an input

• Initial Stack contains just S0

LR Grammar

a1 ... ai ... an $

Sm

Xm

Sm-1

Xm-1

.

.

S1

X1

S0

LR PARSING ALGORITHM

Action TableTerminal and $

States + Four Different Actions

Goto TableNon-Terminal

States + Each item is a state number

Figure 2.11. LR Parsing

50

Syntax Analysis

• Parser takes action using Sm and ai

• shift s: shifts the next input symbol ai and the state s onto the stack − (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $) (S0 X1 S1 ... Xm Sm ai s, ai+1 ...

an $)

• reduce A (or rn where n is a production number)− pop r (r is the length of ) number of items from the stack;

This is done so that we can replace the right hand side with the left hand side of the grammar.

− then push A and s where s=goto[sm-r,A]. Here, m-r indicates that r items have been taken of the stack.

− (S0 X1 S1 ... Xm Sm, ai ai+1 ... an $) (S0 X1 S1 ... Xm-r Sm-r A s, ai ... an $)

− Output is the reducing production rule, reduce A• Accept: Parsing is successfully completed.

• Error: Parser has detected an error. This might because there is an empty entry in the action table.

• GOTO takes a state and grammar symbol as arguments and produces a state.

LR Grammar

51

Syntax Analysis

• Closure: If I is a set of LR(0) items for a grammar G, then

closure(I) is the set of LR(0) items constructed from I by the

two rules:

1. Initially, every LR(0) item in I is added to closure(I).

2. If A .B is in closure(I) and B is a production rule of G; then B. will be in the closure(I). Here, B is a non-terminal. can be anything or even empty

• The above-mentioned rule is applied until no more LR(0) item can be added to closure(I).

E' E E E+T E T T T*F T F F (E) F id

Check for non-terminal after dot, if there is, continue the productions.

closure({E' .E}) = { E' .E E .E+T E .T

T .T*F T .F F .(E) F .id }

Phases of LR Grammar Processing

52

Syntax Analysis

• GOTO: If I is a set of LR(0) items and X is a grammar symbol

(terminal or non-terminal), then goto(I,X) is defined as follows:

− If A .X in I then every item in closure({A X.}) will be in goto(I,X).

Example:I ={ E' .E, E .E+T, E .T,

T .T*F, T .F, F .(E), F .id }goto(I,E) = { E' E., E E.+T } Move dot one step further with E.goto(I,T) = { E T., T T.*F } Move dot one step further with T.goto(I,F) = {T F. } Move dot one step further with F.goto(I,() = { F (.E), E .E+T, E .T, T .T*F, T .F,

F .(E), F .id } After moving the dot after (, there exists a non-terminal and so add the closure of that non-terminal.

goto(I,id) = { F id. } Move dot one step further with id.


53

Syntax Analysis

• Canonical LR(0) algorithm: This is needed to create the SLR

parsing table.

C is { closure({S'.S}) }

repeat the followings until no more set of LR(0) items can be added to C.

for each I in C and each grammar symbol X

if goto(I,X) is not empty and not in C

add goto(I,X) to C

• goto function is a

DFA on the sets in C.


For I1, we look at I0 and

use the symbol E.

I2 and I3 are obtained

using transitions with

symbol T and F

54

Syntax Analysis

• For I4, we have moved the dot on open-bracket. As the dot is

followed by E (a non-terminal), we need to add all the

transitions with E (E .E+T and E .T) from I0. As still we

have some non-terminals (like T and F) that follow the dot, we

add their transitions also.

• I5 is made using transition on id from I0. Then, we make

transition on + from I2 to obtain I6.


I0 I1

I2

I3

I4

I5

I6

I7

I8

I9

I10

I11id

(

F

T

E +

*

E

T

To I3

To I4

To I5

F

(

id

To I2

To I3

To I4

F

(

T

To I4

To I5

F

(

To I6

+

id

)

id

* To I7

Figure 2.12. SLR Transitions

55

Syntax Analysis

1. Construct the canonical collection of sets of LR(0) items for G’.

C {I0,...,In}

2. Create the parsing action table as follows• If a is a terminal, A.a in Ii and goto(Ii,a)=Ij then

action[i,a] is shift j.• If A. is in Ii , then action[i,a] is reduce A for all a in

FOLLOW(A) where AS'. A in reduce is represented using the sequence number of A in the grammar. • Note: There is no element after the dot; can be

anything or even empty • If S'S. is in Ii , then action[i,$] is accept. Here, E being

the starting symbol S, E'E. will produce the accept entry.• If any conflicting actions generated by these rules, the

grammar is not SLR(1).

3. Create the parsing goto table• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j

4. All entries not defined by (2) and (3) are errors.

5. Initial state of the parser contains S'.S

LR Grammar – Create SLR Parsing Table

56

Syntax Analysis

• 1) E E+T 2) E T 3) T T*F

• 4) T F 5) F (E) 6) F id

• The first entry of s5 in the (row, column) grouping as (0,id) is

because from Figure 2.12, we could see that I0 transits to I5 on

id. So, action[0, id] = shift 5.

• s6 on (1,+) is because from Figure 2.12, we could see that I1

transits to I6 on +. And so on…


57

Syntax Analysis

LR Grammar – Given an input id * id + id

58

Syntax Analysis

• SLR(1) grammar is called as SLR grammar in short

• SLR grammar is always unambiguous but that does not mean that all unambiguous grammars are SLR grammars.

• SLR grammar does not posses any of these conflicts:− Shift/Reduce conflict: It is in a state when it is not sure

whether to make a shift or reduction operation for a terminal.

− Reduce/Reduce conflict: It is in a state when it is not sure whether to make a reduction operation using the production rule i or j for a terminal.

• Canonical SLR(1) parsing table:− In SLR method, the state i makes a reduction by A when

the current token is a:• if the A. in the Ii and a is FOLLOW(A)

− In some situations, A cannot be followed by the terminal a in a right-sentential form when and the state i are on the top stack. This means that making reduction in this case is not correct.

SLR(1) Grammar

59

Syntax Analysis

• LR(1) item− In order to avoid invalid reductions, we need to

make the states carry more information. This information is added as a terminal symbol in the form of a second component in an item.

− A LR(1) item is defined as: A .,a where a is the look-ahead of the LR(1) item (a is a terminal or end-marker.) When (in the LR(1) item A .,a ) is not empty, the look-ahead does not have any effect.

− When is empty (A .,a ), we do the reduction by A only if the next input symbol is a (not for any terminal in FOLLOW(A)).

− A state will contain A .,a1 where {a1,...,an} FOLLOW(A)

SLR(1) Grammar

60

Syntax Analysis

• Canonical Collection of LR(1) items: Similar to LR(0) but with slight changes in closure and goto.

• closure(I) is: ( where I is a set of LR(1) items)

− every LR(1) item in I is in closure(I)

− if A.B,a in closure(I) and B is a production rule of G; then B.,b will be in the closure(I) for each terminal b in FIRST(a) .

• B is the term next to the dot. The rule of any non-terminal that follows the dot will be included into the closure.

• Also, indicates on what follows B as it is the FIRST() or FIRST(a). and can be anything or even empty.

• If I is a set of LR(1) items and X is a grammar symbol (terminal or non-terminal), then goto(I,X) is defined as follows:− If A .X,a in I then every item in closure({A X.,a})

will be in goto(I,X). − Move the dot a step forward using goto

SLR(1) Grammar

61

Syntax Analysis

• Numbering of the rules start with 1 but the initial S'

S is excluded from the rule numbering.

SLR(1) Grammar

62

Syntax Analysis

• In I0: In the representation S' .S,$: $ is the element that

follows S'. From here, as the dot is followed by a terminal (S), we

need add its rules (S .L=R,$ & S .R,$) also. − S' .S,$ matches A.B,a and S .L=R,$ matches B.,b.

$ is added as the look-ahead item as is empty [so, FIRST() is also empty] and FIRST(a) = FIRST($) = $. Then, the dot is followed by L and R, we add their rules also. The dot stays at the beginning of the right-side in the added rules.

• In I0: In the representation L .*R,$/= L .id,$/= we need to

apply FIRST() = FIRST(=) and FIRST(a) = FIRST($) as A.B,a

is matched with S L.=R,$.

• In I0: R .L,$ does not contain a = as the look-ahead because

A.B,a is matched to S .R,$ and is empty and a is $.

• Transitions are handled based on the movement of dot across

terminal or non-terminal. Transition to I1 from I0 is based on S.

SLR(1) Grammar

63

Syntax Analysis

1. Construct the canonical collection of sets of LR(1) items for G’.

C{I0,...,In}

2. Create the parsing action table as follows• If a is a terminal, A.a,b in Ii and goto(Ii,a)=Ij then action[i,a]

is shift j.• If A.,a is in Ii , then action[i,a] is reduce A where AS’.

• If S’S.,$ is in Ii , then action[i,$] is accept.

• If any conflicting actions generated by these rules, the grammar is not LR(1).

LR(1) Parsing Table Construction

3. Create the parsing goto table

• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j

4. All entries not defined by (2) and (3) are

errors.

5. Initial state of the parser contains S’.S,$

64

Syntax Analysis

• LALR stands for LookAhead LR• LALR tables are smaller than LR(1) parsing tables but the

number of states remain the same• LALR parser is obtained by shrinking the canonical LR(1)

parser. This shrinking process should not produce reduce/reduce conflict.

• The core of the LALR grammar is the first component of the LR(1) items, which excludes the look-ahead item. − For Example, in S L.=R,$, the core part is S L.=R

• If there is more than one LR(1) item with the same core, we merge them into a single state.

• Creating LALR parsing table− Create the canonical LR(1) collection of the sets of LR(1)

items for the given grammar.− Find all sets that have the same core. Replace those sets

having the same core with a single set which is their union. • C={I0,...,In} C’={J1,...,Jm}where m n

LALR Grammar

65

Syntax Analysis

• Creating LALR parsing table− Create the parsing tables (action and goto tables) same as

the construction of the parsing tables of LR(1) parser.• Note that: If J=I1 ... Ik since I1,...,Ik have same

cores then cores of goto(I1,X),...,goto(I2,X) must be same.

• So, goto(J,X)=K where K is the union of all sets of items having same cores as goto(I1,X).

− If no conflict is introduced, the grammar is LALR(1) grammar. (reduce/reduce conflicts can be introduced but not shift/reduce conflict)

• Ambiguous grammars produce conflicts− Consider this ambiguous grammar

E E + E | E * E | (E) | id − Produce the parsing table

LALR Grammar

66

Syntax Analysis

LALR Grammar

67

Syntax Analysis

• Errors can be detected by consulting the parsing action table− Goto table is not used to detect errors

• Canonical LR or LR(1) parser will not make any reduction before announcing an error but SLR and LALR might make many reductions before indicating an error

• Panic Mode Error Recovery in LR Parser− When faced with an error, remove the entries in the stack

before the state sthat has a goto with a particular non-terminal A

− Discard zero or more input symbols until the symbol a is found that is present in follow of A

− Parser can now stack the non-terminal A and the state goto[s,A] and proceed with parsing

• Phrase Mode Error Recovery in LR Parser− An empty entry in the action table is associated with a specific

error routine that reflects the most likely error in this case− This error could either insert or delete symbols into or from

the stack− This could be useful in handling missing operand, unbalanced

right parenthesis, etc

Error Recovery in LR Grammar

68

Syntax Analysis

• For scanner:− lex (A Lexical Analyzer Generator): generates codes in C

language− Variants to lex: flex, AT&T lex, Abraxas Pclex, MKS Lex, POSIX

Lex, jflex, … • For Parser:

− yacc ("Yet Another Compiler Compiler" with AT&T Yacc, Berkeley Yacc and GNU Bison as variants)

− Accent: Check for conflicts• Programming with lex/flex

− File name: filename.l− Does not generate executable code, but generates the C

routine called yylex()− We will need to write a program that calls yylex( ) to run the

lexer • Lex programs are divided into three sections: definitions section,

rules section and user subroutines section− The starting and ending of the rules section is indicated using

"%%"− ONLY User subroutines section is optional

Programming the Scanner and Parser

69

Syntax Analysis

• In the definitions section, the part that is covered by %{ and %} is copied as it is into the generated C program

• C language comments can be added outside the definition section also

• When using comments outside the %{ and %} block, comments must be intended with whitespace.

• Rules section− Map pattern and action− If the number of actions that ought to be handled is more

than one, then the actions are grouped with braces. • User subroutines section

− Contains many subroutines− The subroutine that calls yylex( ) is copied as it is into the

C program• Internal Variables of LEX/FLEX:

− yylval: This variable contains the value of the token.− yyleng: This variable contains the length of the string the

lexer has recognized.


70

Syntax Analysis

• Internal Variables of LEX/FLEX:− yyin: Indicates how lexer reads the input. By default yyin is

set to stdin. − yylex( ): Function that runs the lexer.− yywrap( ): Function that is called by the yylex to check for

the end of the file. − input( ), output( ) and unput( ): input() and unput()

functions are needed to read input from the command line. − Start State: Start states are defined using %s in the

definitions section. − ECHO: This macro is used to write the token to the current

output file yyout. This is similar to writing like: fprintf(yyout, "%s", yytext);

− REJECT: REJECT is used as an action to put back the text matched by the pattern and search for the next best match.


71

Syntax Analysis

• Programming with Yacc/Bison− Does the task of LALR(1) parser− Being LALR(1) parser, yacc can only go one step lookahead

and thus ambiguous natures beyond one step will generate an error

− The program structure in Yacc is similar to that of Lex − Definitions sections: definitions, C code and associativity

rules are specified. − Yacc calls yylex routine repeatedly to get the token and

then applies the rules specified− As Lex returns tokens to Yacc, both the programs need to

agree on what tokens are• Definitions section in yacc: %token NUMBER

− In the lex program:• extern int yylval;• %%• [0-9]+ {yylval = atoi(yytext); return NUMBER; }


72

Syntax Analysis

• Programming with Yacc/Bison− In the yacc program, do the following:

• Specify the variables.− %union {int ival, double cost;}

• Connect the values to the return tokens.− %token <ival> INDEX− %token <cost> NUMBER

• Specify the type for the non-terminals. Let's say ival is a terminal but cost is not.− %type <cost> expression

• Associative and Precedence rules are specified in the definition section of the yacc program.− %left '-' '+'− %left '*' '/'− %nonassoc UMINUS


73

Syntax Analysis

• Programming with Yacc/Bison− expression: expression '+' NUMBER{$$ = $1 - $3; }− | expression '-' NUMBER {$$ = $1 - $3; }− ;

• $1 represents the first number value in the right hand side, $2 represents the operator and $3 represents the second number value in the right hand side. Left-hand side is represented using $$.

− Using union and yyval, only a single value can be passed between lexer and parser. So, use symbol table to pass multiple values

− Error is reported using yyerror() function.− While compiling the C programs generated by Lex and Yacc,

we will use –ly option of the C compiler. The yacc library must contain main() and yyerror().

• Compilation and Execution on Linux platform.− Compile the lex program: lex filename.l− Compile the yacc program: yacc –d filename.y− Compile the C program: gcc –o output y.tab.c lex.yy.c –ly –ll− Running the program: ./output


74

Syntax Analysis

• Compilation and Execution on Windows platform.− Make sure that flex (flex.exe), bison (bison.exe) and tcc

(Tiny C Compiler or any C compiler) are installed.− Compile the lex program: flex filename.l− Compile the yacc program:

• bison –d filename.y• bison –d filename.y –b y

− Compile the C program (using Tiny C Compiler – tcc) generated: • tcc –o output.exe y.tab.c lex.yy.c yyerror.c libyywrap.c

yyinit.c main.c yyaccpt.c− Running the program: output.exe

• Programming with Accent and Amber− After writing the lex program, we need to write the accent

program− Rules have left and right hand side separated by a colon− The initial symbol provided in the grammar is called as

start symbol and it follows context-free grammar


75

Syntax Analysis

• Accent− Parameters can be specified as in (inherited attributes) and

out (synthesized attributes), with “<“ and “>” enclosing them

− Statements written within %prelude { …} are literally copied into the generated C program


%token NUMBER;root: expression<n> { printf("Final = %d\n",

n);};

expression<n>: expression<x> '+' term<y> { *n = x + y;} | term<n> ;

term<n> : term<x> '*' factor<y> { *n = x * y; } | factor<n>;

factor<n> : '(' expression<n> ')' | NUMBER<n> ;

• Given the grammar (R stands

for root, E stands for

expression, T stands for term

and F stands for factor. id is a

terminal which represented by

token NUMBER):

• R E

• E E + T | T

• T T * F | F;

• F (E) | id;

76

Syntax Analysis

• Programming with Accent− Compilation and Execution on Linux

• lex filename.l• accent filename.acc• gcc –o output yygrammar.c lex.yy.c entire.c• Check for ambiguity using Amber:

− accent filename.acc − gcc -o output -O3 yygrammar.c amber.c− output examples 1000

− Compilation and Execution on Windows• flex filename.l• accent filename.acc• tcc –o output.exe yygrammar.c lex.yy.c entire.c

yyerror.c libyywrap.c main.c yyinit.c yyaccpt.c • Check for ambiguity using Amber:

− accent filename.acc − tcc -o output.exe yygrammar.c amber.c− output examples 1000


Documents

1 Syntax Analysis Programming Language Syntax: Syntax Specifications, Stages in Translation: Processing Programs, Syntax Analysis, Semantic Analysis, Lexical