Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
P16CS32 – Compiler Design
Date : 18.08.2020
Topic : Introduction about Compiler
What is Compiler?
o A Compiler is a translator that converts the high level language into the
machine language.
o High level language is written by a developer and machine language can be
understood by the processor.
o Compiler is used to show errors to the program.
o The main purpose of compiler is to change the code written in one language
without changing the meaning of the program.
o When you execute a program which is written in HLL programming language
then it executes into two parts.
o In the first part, the source program compiled and translated into the object
program (Low Level Language).
o In the second part, object program translated into the target program through
the assembler.
Execution process of source program in compiler
Source Program Object Program
Object Program Target Program
Language Processing System
We have learnt that any computer system is made of hardware and software. The
hardware understands a language, which humans cannot understand. So we write programs in
high-level language, which is easier for us to understand and remember. These programs are
then fed into a series of tools and OS components to get the desired code that can be used by
the machine. This is known as Language Processing System.
Compiler
Assembler
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.
Let us first understand how a program, using C compiler, is executed on a host machine.
o User writes a program in C language (high-level language).
o The C compiler compiles the program and translates it to assembly program (low-
level language).
o An assembler then translates the assembly program into machine code (object).
o A linker tool is used to link all the parts of the program together for execution
(executable machine code).
o A loader loads all of them into memory and then the program is executed.
Before diving straight into the concepts of compilers, we should understand a few other
tools that work closely with compilers.
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces
input for compilers. It deals with macro-processing, augmentation, file inclusion, language
extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine
language. The difference lies in the way they read the source code or input. A compiler reads
the whole source code at once, creates tokens, checks semantics, generates intermediate code,
executes the whole program and may involve many passes. In contrast, an interpreter reads a
statement from the input, converts it to an intermediate code, executes it, then takes the next
statement in sequence. If an error occurs, an interpreter stops execution and reports it;
whereas a compiler reads the whole program even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code. The output
of an assembler is called an object file, which contains a combination of machine instructions
as well as the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in
order to make an executable file. All these files might have been compiled by separate
assemblers. The major task of a linker is to search and locate referenced module/routines in a
program and to determine the memory location where these codes will be loaded, making the
program instruction to have absolute references.
Compilers and Interpreters
• “Compilation”
– Translation of a program written in a source language into a semantically
equivalent program written in a target language
• “Interpretation”
– Performing the operations implied by the source program
Loader
Loader is a part of operating system and is responsible for loading executable files
into memory and execute them. It calculates the size of a program (instructions and data) and
creates memory space for it. It initializes various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for
platform (B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates
it into the source code of another programming language is called a source-to-source
compiler.
COMPILER DESIGN –ARCHITECTURE
A compiler can broadly be divided into two phases based on the way they compile.
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads
the source program, divides it into core parts, and then checks for lexical, grammar, and
syntax errors. The analysis phase generates an intermediate representation of the source
program and symbol table, which should be fed to the Synthesis phase as input.
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target
program with the help of intermediate source code representation and symbol table.
A compiler can have many phases and passes.
o Pass : A pass refers to the traversal of a compiler through the entire program.
o Phase : A phase of a compiler is a distinguishable stage, which takes input from the
previous stage, processes and yields output that can be used as input for the next
stage. A pass can have more than one phase.
P16CS32 – Compiler Design
Date : 21.08.2020
Topic : Input Buffering, Specification of Tokens, Regular
Expressions, Transition Diagram
Input Buffering
The LA scans the characters of the source program one at a time to discover tokens.
Because of large amount of time can be consumed scanning characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an input
character.
Buffering techniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the characters of the source program one at a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for the
lexical analyzer to read its input from an input buffer. Figure shows a buffer divided into two
have of, say 100 characters each. One pointer marks the beginning of the token being discovered.
A look ahead pointer scans ahead of the beginning point, until the token is discovered .we view
the position of each pointer as being between the character last read and the character next to be
read.
In practice each buffering scheme adopts one convention either a pointer is at the symbol
last read or the symbol it is ready to read.
Token beginnings look ahead pointer. The distance which the look ahead pointer may
have to travel past the actual token may be large. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an
array name until we see the character that follows the right parenthesis. In either case, the token
itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it
began, the other half must be loaded with the next characters from the source file. Since the
buffer shown in above figure is of limited size there is an implied constraint on how much look
ahead can be used before the next token is discovered. In the above example, if the look ahead
traveled to the left half and all the way through the left half to the middle, we could not reload
the right half, because we would lose characters that had not yet been grouped into tokens. While
we can make the buffer larger if we chose or use another buffering scheme, we cannot ignore the
fact that overhead is limited
Variables (Div, mod, count, i)
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of
English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorials point is 14 and is
denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is
known as an empty string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-),
Modulo(%), Multiplication(*),
Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.),
Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations can be
performed on them. Finite languages can be described by means of regular expressions.
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the
language rules.
REGULAR EXPRESSIONS
Regular expressions have the capability to express finite languages by defining a pattern
for finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar.
The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches
a set of strings, so regular expressions serve as names for a set of strings. Programming language
tokens can be described by regular languages. The specification of regular expressions is an
example of a recursive definition. Regular languages are easy to understand and have efficient
implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can
be used to manipulate regular expressions into equivalent forms.
A token is either a single string or one of a collection of strings of a certain type. If we
view the set of strings in each token class as an language, we can use the regular-expression
notation to describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or
digits. In regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet.
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examines the input string and
finds a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt
| є
Expr →term relop term
| term
Term →id
|number
In addition, we assign the lexical analyzer the job stripping out white space, by
recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when we
recognize it, we do not return it to parser ,but rather restart the lexical analysis from the character
that follows the white space . It is the following token that gets returned to the parser.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
• x* means zero or more occurrence of x.
o i.e., it can generate { e, x, xx, xxx, xxxx, … }
• x+ means one or more occurrence of x.
o i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
• x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
Representing occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
relop →< | > |<= | >= | = = | < >
< relop LT
<= relop LE
== relop EQ
<> relop NE
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted solution is
to use finite automata for verification.
TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking for a
lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state ‘S’, and the next input symbol is ‘a’, we look for an edge out of
state ‘s’ labeled by a. if we find such an edge ,we advance the forward pointer and enter the state
of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has been
found, although the actual lexeme may not consist of all positions b/w the lexeme Begin and
forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
As an intermediate step in the construction of a LA, we first produce a stylized
flowchart, called a transition diagram. Position in a transition diagram, are drawn as circles
and are called as states.
The above TD for an identifier, defined to be a letter followed by any no of letters or
digits.A sequence of transition diagram can be converted into program to look for the tokens
specified by the diagrams. Each state gets a segment of code.
FINITE AUTOMATA
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
Finite set of states (Q)
Finite set of input symbols (Σ)
One Start state (q0)
Set of final states (qf)
Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q ×
Σ ➔ Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
States : States of FA are represented by circles. State names are written inside circles.
Start state : The state from where the automata starts is known as the start state. Start state
has an arrow pointed towards it.
Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.
Final state : If the input string is successfully parsed, the automata is expected to be in this
state. Final state is represented by double circles. It may have any odd number of arrows pointing
to it and even number of arrows pointing out from it. The number of odd arrows are one greater
than even, i.e. odd = even+1.
Transition : The transition from one state to another state happens when a desired symbol in
the input is found. Upon transition, automata can either move to the next state or stay in the same
state. Movement from one state to another is shown as a directed arrow, where the arrows point
to the destination state. If automata stays on the same state, an arrow pointing from a state to
itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0,
qf), Σ(0,1), q0, qf, δ}
TYPES OF FINITE AUTOMATON
• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a
lexical analyzer for our tokens.
Non-Deterministic Finite Automaton (NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Example:
Deterministic Finite Automaton (DFA)
• A Deterministic Finite Automaton (DFA) is a special form of a NFA.
• No state has ε- transition
• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition
function is from pair of state-symbol to state (not set of states)
Example:
P16CS32 – Compiler Design
Date : 20.08.2020
Topic : Lexical Analysis
What is Lexical Analysis?
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts
the High level input program into a sequence of Tokens.
The lexical analyzer breaks this syntax into a series of tokens. It removes any extra space
or comment written in the source code.
Programs that perform lexical analysis are called lexical analyzers or lexers. A lexer
contains tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it generates
an error. It reads character streams from the source code, checks for legal tokens, and pass the
data to the syntax analyzer when it demands.
LEXICAL ANALYSIS
• reads and converts the input into a stream of tokens to be analyzed by parser.
• lexeme : a sequence of characters which comprises a single token.
• Lexical Analyzer →Lexeme / Token → Parser
Removal of White Space and Comments
• Remove white space(blank, tab, new line etc.) and comments
Contsants
• Constants: For a while, consider only integers
What's a lexeme?
A lexeme is a sequence of characters that are included in the source program according to
the matching pattern of a token. It is nothing but an instance of a token.
In other words, The sequence of characters matched by a pattern to form the
corresponding token or a sequence of input characters that comprises a single token is called a
lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
What's a token?
The token is a sequence of characters which represents a unit of information in the source
program.
Example of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
• Keywords : Examples - for, while, if, etc.,,,
• Identifiers : Examples – Variable Name, Function Name, etc,
• Operators: Examples- ‘+’, ‘++’, ‘-‘, etc,
• Separators: Examples : ‘;’ ‘,’, etc,
(eg.) c=a+b*5;
Lexemes and tokens
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which uses
as a token, the pattern is a sequence of characters.
• Examples 1:
for input 31 + 28, output(token representation)?
input : 31 + 28
output: <num, 31> <+, > <num, 28>
num + :token
31 28 : attribute, value(or lexeme) of integer token num
• Examples 2:
input : count = count + increment;
output : <id,1> <=, > <id,1> <+, > <id, 2>;
Symbol table
tokens attributes(lexeme)
id
count
increment
How Lexical Analyzer functions
1. Tokenization i.e. Dividing the program into valid tokens.
2. Remove white space characters.
3. Remove comments.
4. It also provides help in generating error messages by providing row numbers and column
numbers.
For example, consider the program
int main()
{
// 2 variables
int a, b; a = 10; return 0;
}
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens.
As another example, consider below printf statement.
2 5
Printf ( “ WELCOME ” ) ;
1 3 4
There are 5 valid token in this printf statement.
Exercise 1:
Count number of tokens :
int main()
{
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2:
Count number of tokens :
int max(int i);
• Lexical analyzer first read int and finds it to be valid and accepts as token
• max is read by it and found to be a valid function name after reading (
• int is also a token , then again i as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;
Lexical Analyzer Architecture: OR The role of the Lexical Analyzer:-
The main task of lexical analysis is to read input characters in the code and produce
tokens.
Lexical analyzer scans the entire source code of the program. It identifies each token one by one.
Scanners are usually implemented to produce tokens only when requested by a parser. Here is
how this works-
Lexical Analyzer
Symbol Table
Parser
Token
getNextToken
To sematic
analysis Source
Program
1. Get next token" is a command which is sent from the parser to the lexical analyzer.
2. On receiving this command, the lexical analyzer scans the input until it finds the next
token.
3. It returns the token to Parser.
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is
present, then Lexical analyzer will correlate that error with the source file and line number.
Lexical analyzer performs below given tasks:
• Helps to identify token into the symbol table
• Removes white spaces and comments from the source program
• Correlates error messages with the source program
• Helps you to expands the macros if it is found in the source program
• Read input characters from the source program
Example of Lexical Analysis, Tokens, Non-Tokens
Consider the following code that is fed to Lexical Analyzer
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
Pre-processor directive #define NUMS 8,9
Macro NUMS
Whitespace /n /b /t
Need of Lexical Analyzer
• Simplicity of design of compiler The removal of white spaces and comments enables the
syntax analyzer for efficient syntactic constructs.
• Compiler efficiency is improved Specialized buffering techniques for reading characters
speed up the compiler process.
• Compiler portability is enhanced
Issues in Lexical Analysis
Lexical analysis is the process of producing tokens from the source program. It has the following
issues:
• Lookahead
• Ambiguities
Lookahead
Lookahead is required to decide when one token will end and the next token will begin. The
simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the
lexemes of each token is required.
A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are similar.
Hence, the number of lookahead to be considered and a way to describe the lexemes of each
token is also needed.
Regular expressions are one of the most popular ways of representing tokens.
Ambiguities
The lexical analysis programs written with lex accept ambiguous specifications and choose the
longest match possible at each input point. Lex can handle ambiguous specifications. When more
than one expression can match the current input, lex chooses as follows:
• The longest match is preferred.
• Among rules which matched the same number of characters, the rule given first is preferred.
Lexical Errors
• A character sequence that cannot be scanned into any valid token is a lexical error.
• Lexical errors are uncommon, but they still must be handled by a scanner.
• Misspelling of identifiers, keyword, or operators are considered as lexical errors.
Usually, a lexical error is caused by the appearance of some illegal character, mostly at the
beginning of a token.
Lexical error handling approaches
Lexical errors can be handled by the following actions:
• Deleting one character from the remaining input.
• Inserting a missing character into the remaining input.
• Replacing a character by another character.
• Transposing two adjacent characters
Error Recovery Schemes
• Panic mode recovery
• Local correction
❖ Source text is changed around the error point in order to get a correct text.
❖ Analyzer will be restarted with the resultant new text as input.
• Global correction
❖ It is an enhanced panic mode recovery.
❖ Preferred when local correction fails.
Panic mode recovery
In panic mode recovery, unmatched patterns are deleted from the remaining input, until the
lexical analyzer can find a well-formed token at the beginning of what input is left.
(eg.) For instance the string fi is encountered for the first time in a C program in the context:
fi (a== f(x))
A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an undeclared
function identifier.
Since f i is a valid lexeme for the token id, the lexical analyzer will return the token id to the
parser.
Local correction
Local correction performs deletion/insertion and/or replacement of any number of symbols in the
error detection point.
(eg.) In Pascal, c[i] '='; the scanner deletes the first quote because it cannot legally follow the
closing bracket and the parser replaces the resulting'=' by an assignment statement.
Most of the errors are corrected by local correction.
(eg.) The effects of lexical error recovery might well create a later syntax error, handled by the
parser. Consider
· · · for $tnight · · ·
The $ terminates scanning of for. Since no valid token begins with $, it is deleted. Then tnight is
scanned as an identifier.
In effect it results,
· · · fortnight · · ·
Which will cause a syntax error? Such false errors are unavoidable, though a syntactic error-
repair may help.
Advantages of Lexical analysis
• Lexical analyzer method is used by programs like compilers which can use the parsed
data from a programmer's code to create a compiled binary executable code
• It is used by web browsers to format and display a web page with the help of parsed data
from JavsScript, HTML, CSS
• A separate lexical analyzer helps you to construct a specialized and potentially more
efficient processor for the task
Disadvantage of Lexical analysis
• You need to spend significant time reading the source program and partitioning it in the
form of tokens
• Some regular expressions are quite difficult to understand compared to PEG or EBNF
rules
• More effort is needed to develop and debug the lexer and its token descriptions
• Additional runtime overhead is required to generate the lexer tables and construct the
tokens
P16CS32 – Compiler Design
Date : 24.08.2020
Topic : Finite Automata and Conversion NFA into DFA
FINITE AUTOMATA
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
Finite set of states (Q)
Finite set of input symbols (Σ)
One Start state (q0)
Set of final states (qf)
Transition function (δ)
• In other words Lexical analysis is essentially a process of recognizing different tokens from the
source program. This Process of recognition can be accomplished by building a classical model
called Finite State Machine (FSM) or a Finite Automation(FA).
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
States : States of FA are represented by circles. State names are written inside circles.
Start state : The state from where the automata starts is known as the start state. Start state
has an arrow pointed towards it.
Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.
Final state : If the input string is successfully parsed, the automata is expected to be in this
state. Final state is represented by double circles. It may have any odd number of arrows pointing
to it and even number of arrows pointing out from it. The number of odd arrows are one greater
than even, i.e. odd = even+1.
Transition : The transition from one state to another state happens when a desired symbol in
the input is found. Upon transition, automata can either move to the next state or stay in the same
state. Movement from one state to another is shown as a directed arrow, where the arrows point
to the destination state. If automata stays on the same state, an arrow pointing from a state to
itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0,
qf), Σ(0,1), q0, qf, δ}
TYPES OF FINITE AUTOMATON
• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Both deterministic and non-deterministic finite automaton recognizes regular sets.
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a
lexical analyzer for our tokens.
Non-Deterministic Finite Automaton (NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Example:
Deterministic Finite Automaton (DFA)
• A Deterministic Finite Automaton (DFA) is a special form of a NFA.
• No state has ε- transition
• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition
function is from pair of state-symbol to state (not set of states)
Example:
Conversion of NFA to DFA
We can convert from an NFA to DFA using subset Construction. To perform this operation, let
us define two functions:
• The €- closure function takes a state and returns the set of states reachable from it based on
(one or more) €-Transitions. Note that this will always include the state itself. We should be
able to get from a state to any state in its €- closure without consuming any input.
• The function move takes a state and a character, and returns the states reachable by one
transition on this character.
We can generate both these functions to apply t sets of states by taking the union of the
application to individual states.
Eg. If A, B and C are states, move ({A,B,C},’a’) = move(A,’a’) U move(B,’a’) U move(C,’a’).
ALGORITHM: (The subset construction of a DFA from an NFA)
INPUT: An NFA N.
OUTPUT: A DFA d accepting the same language as N .
Method:
OPERATION DESCRIPTION
€-closure(s) Set of NFA States Reachable from NFA state S One €- transitions
alone.
€-closure(T) Set of NFA states reachable from some NFA state S in set T on €-
transitions alone ; =Us in T €- closure(s).
Move(T,a) Set of NFA states to which there is a transition on input symbol a
from input symbol a from some state S in T.
The start state of D is €- closure(so),and the accepting states of d are all those sets n’ s
states that include at least one accepting start of N . TO COMPLETE our description of the
subset construction, we need only to show how initially, €- closure (so) is the only state in D
states and it is unmarked;
The Subset construction algorithm
1. Create the start state of the DFA by taking the €- closure of the start state of the NFA.
2. Perform the flowing for the new DFA state:
For each possible input symbol:
▪ Apply move to the newly created state and the input symbol; this will
return a set of stated.
▪ Apply the €- closure t this set of states, possibly resulting in a new set.
▪ This set of NFA states will be a single state in the DFA.
3. Each time we generate a new DFA state, we must apply step 2 to it. The process is
complete when applying step 2 does not yield any new states.
4. The finish states of the DFA ate those which contain any of the finish states of the NFA.
While (there is an unmarkable state T in Dstates)
{
mark T;
FOR ( each input symbol a)
{
U=€-closure (move(T,a));
If ( U is not in D states)
add U as an unmarked state t Dstates;
Dtran[T,a] = U;
}
}
Computing €- closure (T)
Push all states of T onto stack;
initialize €- closure (T) to T;
While ( Stack is not empty)
pop t, the top element, off stack;
for (each state u with an edge from t to u labeled e)
if (u is not in €- closure (T))
{
Add u to €- closure (T);
Push u onto stack;
}
}
Simulating of an NFA
S= €- closure (s0);
C= nextChar( );
While (c != eof)
{
S=€- closure (move (S,c));
C=nextChar( );
}
If ( S ∩ F != 0) return “yes”;
Else return “on”;
Conversion NFA into DFA:
Algorithm used for construction an DFA from NFA is called subset construction
algorithm.
NFA – N for (a | b )* abb
i/p = NFA o/p = DFA
a
€ €
€ € € € a b b
b
0 1
2 3
4 5
6 7 8 9
€
€
€
Starting state of DFA is €-closure(s)
€-closure(0) = {0, 1, 2, 4, 7}---------------------------------=>
1.This State is called unmarked
State x = {0, 1, 2, 4, 7}=>
For i/p symbol a ( from transition A )
a a
T = { 3, 8 } DFA
U = €-closure ( move (T, a))
U = €-closure ( move ( {3, 8 }, a)) b
€-closure ( 3,8) = { 1, 2, 3, 4, 6, 7, 8 }---------------------- >
For i/p symbol b ( from transition A ) a
b
T = { 5 }
U = €-closure ( move (T, a))
U = €-closure ( move ( {5 }, b))
€-closure ( 5) = { 1, 2, 4, 5, 6, 7}----------------------------->
2. This State is called unmarked
State x = {1, 2, 3, 4, 6, 7, 8 }=>
For i/p symbol a ( from transition B ) DFA
a a b
€-closure ( 3,8) = { 1, 2, 3, 4, 6, 7, 8 }------------------>
For i/p symbol b ( from transition B ) a b
b b a
€-closure ( 5, 9 ) = { 1, 2, 4, 5, 6, 7, 9 }------------------>
A
A
2 3 7 8
B
C
4 5
A
C
B
B
2 3 7 8
B
4 5 8 9
D
A
C
B
D
3. This State is called unmarked
State x = {1, 2, 3, 4, 5, 6, 7 }=> DFA b
For i/p symbol a ( from transition C ) b
a a a
€-closure ( 3,8) = { 1, 2, 3, 4, 6, 7, 8 }----------------- > a b
For i/p symbol b ( from transition C ) a
b
€-closure ( 5} = { 1, 2, 3, 4, 5, 6, 7}------------------------- >
4. This state is called unmarked
State x = { 1, 2, 4, 5, 6, 7, 9 } => DFA
For i/p symbol a ( from transition D ) b
a a b a
€-closure ( 3, 8) = { 1, 2, 3, 4, 6, 7, 8 }--------------------> a b
For i/p symbol b ( from transition D ) a a
b b
€-closure ( 5, 10) = { 1, 2, 3, 4, 5, 6, 7, 10 }-------------------->
5.This state is called unmarked
State x = { 1, 2, 3, 4, 5, 6, 7, 10 } =>
For i/p symbol a ( from transition E )
a a
€-closure ( 3, 8) = { 1, 2, 3, 4, 6, 7, 8 }-------------------->
C
2 3 7 8
B
4 5
C
A
C
B
D
B
{
1
,
2
,
4
,
5
,
6
,
7
,
9
{
1
,
2
,
4
D
{
1
,
2
,
4
,
5
,
6
,
7
,
9
{
1
,
2
,
4
,
5
,
6
,
7
2 3 7 8
4 5 9 10
A
C
B
D
E
{
1
,
2
,
4
,
5
,
6
,
7
,
9
{
1
,
2
,
4
,
5
,
b
E
B
2 3 7 8
E
For i/p symbol b ( from transition E)
b
€-closure ( 5) = { 1, 2, 3, 4, 5, 6, 7 }-------------------->
Transition table:
If we continue this process with the unmarked sets B and C, we eventually reach a point where
all the states of the DFA are marked. The state A is the start state, and state E, which contains
state 10 of the NFA, is the only accepting state. Note that D has one more state than the DFA.
States A and C have the same move function, and so can be merged. The figure below shows the
DFA for the ( a | b )*abb
Difference between NFA and DFA
1. Every DFA is NFA but not vice versa.
2. Both NFA and DFA have same power and each NFA can be translated into a DFA.
3. There can be multiple final in both DFA and NFA.
NFA State DFA State a b
{0,1,2,4,7} A B C
{1,2,3,4,6,7,8} B B D
{1,2,4,5,6,7} C B C
{1,2,4,5,6,7,9} D B E
{1,2,3,5,6,7,10} E B C
4 5
C
C
A B D E
4. NFA is more of theoretical concepts. DFA is used in lexical analysis in compiler.
5. The transition function for non-deterministic finite automata i.e. delta is multi valued
where as for DFA it is single valued.
6. Checking membership is easy with deterministic finite automata where as it is difficult
for non-deterministic finite automata.
7. Construction of non-deterministic finite automata is very easy where as for DFA it is
difficult.
8. Space required for deterministic finite automata is more where for non-deterministic
finite automata it is less.
9. Backtracking is allowed in deterministic finite automata, but it is not possible in every
case in non-deterministic finite automata.
10. For every input and output we can construct deterministic finite automata machine, but it
is not possible to construct an NFA machine for every input and output.
11. There is only one final state in non-deterministic finite automata but there can be more
than one final state in deterministic finite automata.
P16CS32 – Compiler Design
Date : 19.08.2020 Topic: Phases of the Compiler
Cousins of the Compiler
1. Pre Processor
The output of preprocessors may be given as the input t compilers. The tasks performed
by the preprocessor are given as below:
Macro Processing – Preprocessor allow to macros in the program. Macro means some
set of instruction which can be used repeatedly in the program. Thus macro preprocessing task is
to be done by preprocessors.
File – Inclusion – Preprocessor also allows user to include the header files whichs may
be required by the program. For example #include <stdio.h>
By this statement the header file stdio.h can be included and user can make use of the
functions defined in this header file. This task of preprocessor is called file inclusion.
Rational Preprocessor – These processors augment older language with modern flow of
control and data structuring facilities. Such processors provide the users with built in macros for
constructs like while – loop or if statements.
2. ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They
begin to use a mnemonic (symbols) for each machine instruction, which they would
subsequently translate into machine language. Such a mnemonic machine language is now called
an assembly language.
Programs known as assembler were written to automate the translation of assembly
language in to machine language. The input to an assembler program is called source program,
the output is a machine language translation (object program).
3. Loader and Link Editors
Linker is a computer program that links and merges various object files together in order
to make an executable file. All these files might have been compiled by separate assemblers.
The major task of a linker is to search and locate referenced module/routines in a program and to
determine the memory location where these codes will be loaded, making the program
instruction to have absolute references.
Loader is a part of operating system and is responsible for loading executable files into
memory and execute them. It calculates the size of a program (instructions and data) and creates
memory space for it. It initializes various registers to initiate execution.
INTERPRETER
An interpreter is a program that appears to execute a source program as if it were
machine language.
Languages such as BASIC, SNOBOL, and LISP can be translated using interpreters. JAVA also
uses interpreter.
The process of interpretation can be carried out in following phases.
o Lexical analysis
o Syntax analysis
o Semantic analysis
o Direct Execution
Advantages:
o Modification of user program can be easily made and implemented as execution
proceeds.
o Type of object that denotes various may change dynamically.
o Debugging a program and finding errors is simplified task for a program used for
interpretation.
o The interpreter for the language makes it machine independent.
Disadvantages:
o The execution of the program is slower.
o Memory consumption is more.
TRANSLATOR
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the translator
performs another very important role, the error-detection. Any violation of d HLL specification
would be detected and reported to the programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the HLL.
LIST OF COMPILERS
1.Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .Common Lisp compilers
9. ECMAScript interpreters
10. Fortran compilers
11 .Java compilers
12. Pascal compilers
13. PL/I compilers
14. Python compilers
15. Smalltalk compilers
PHASES OF COMPILER
A compiler operates in phases. A phase is a logically interrelated operation that takes
source program in one representation and produces output in another representation.
There are two phases of compilation.
o Analysis (Machine Independent/Language Dependent) Front end
o Synthesis(Machine Dependent/Language independent) Back end
Compiler passes:
A collection of phases is done only once (single pass) or multiple times (multi pass)
• Single pass: usually requires everything to be defined before being
used in source program
• Multi pass: compiler may have to keep entire program representation
in memory
Compilation process is partitioned into no-of-sub processes called ‘phases’.
A compiler can have many phases and passes.
Pass : A pass refers to the traversal of a compiler through the entire program.
Phase : A phase of a compiler is a distinguishable stage, which takes input from the previous
stage, processes and yields output that can be used as input for the next stage. A pass can
have more than one phase.
STRUCTURE OF THE COMPILER DESIGN
The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of source program, and feeds its output to the next
phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis:-
The first phase of scanner works as a text scanner. This phase scans the source code as a
stream of characters and converts it into meaningful lexemes. Lexical analyzer represents these
lexemes in the form of tokens as:
<token-name, attribute-value>
LA or Scanners reads the source program one character at a time, carving the source
program into a sequence of atomic units called tokens.
Syntax Analysis:-
The next phase is called the syntax analysis or parsing. It takes the token produced by
lexical analysis as input and generates a parse tree (or syntax tree). In this phase, token
arrangements are checked against the source code grammar, i.e., the parser checks if the
expression made by the tokens is syntactically correct.
Semantic Analysis:-
Semantic analysis checks whether the parse tree constructed follows the rules of
language. For example, assignment of values is between compatible data types, and adding string
to an integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not, etc. The semantic analyzer produces an
annotated syntax tree as an output.
Intermediate Code Generations:-
The intermediate code generation uses the structure produced by the syntax analyzer to
create a stream of simple instructions. Many styles of intermediate code are possible. One
common style uses instruction with one operator and a small number of operands. The output of
the syntax analyzer is some representation of a parse tree. The intermediate code generation
phase transforms this parse tree into an intermediate language representation of the source
program.
After semantic analysis, the compiler generates an intermediate code of the source code
for the target machine. It represents a program for some abstract machine. It is in between the
high-level language and the machine language. This intermediate code should be generated in
such a way that it makes it easier to be translated into the target machine code.
An intermediate representation of the final machine language code is produced. This
phase bridges the analysis and synthesis phases of translation.
Code Optimization:-
The next phase does code optimization of the intermediate code. Optimization can be
assumed as something that removes unnecessary code lines, and arranges the sequence of
statements in order to speed up the program execution without wasting resources (CPU,
memory). This is optional phase described to improve the intermediate code so that the output
runs faster and takes less space.
Code Optimization
This is optional phase described to improve the intermediate code so that the output runs
faster and takes less space. Its output is another intermediate code program that does the same
job as the original, but in a way that saves time and / or spaces.
a. Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
If A > B goto L2
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
b. Loop Optimization:-
Another important source of optimization concerns about increasing the speed of loops.
A typical loop improvement is to move a computation that produces the same result each time
around the loop to a point, in the program just before the loop is entered.
Code Generation:-
Code Generator produces the object code by deciding on the memory locations for data,
selecting code to access each datum and selecting the registers in which each computation is to
be done. Many computers have only a few high speed registers in which computations can be
performed quickly. A good code generator would attempt to utilize registers as efficiently as
possible.
The last phase of translation is code generation. A number of optimizations to reduce the
length of machine language program are carried out during this phase. The output of the code
generator is the machine language program of the specified computer.
In this phase, the code generator takes the optimized representation of the intermediate
code and maps it to the target machine language. The code generator translates the intermediate
code into a sequence of (generally) re-locatable machine code. Sequence of instructions of
machine code performs the task as the intermediate code would do.
Table Management OR Book-keeping OR Symbol Table :- A compiler needs to collect information about all the data objects that appear in the
source program.
The information about data objects is collected by the early phases of the compiler-
lexical and syntactic analyzers. The data structure used to record this information is called as
Symbol Table.
It is a data-structure maintained throughout all the phases of a compiler. All the
identifiers’ names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also used for
scope management.
This is the portion to keep the names used by the program and records essential
information about each. The data structure used to record this information called a ‘Symbol
Table’.
Error Handling:-
One of the most important functions of a compiler is the detection and reporting of errors
in the source program. The error message should allow the programmer to determine exactly
where the errors have occurred. Errors may occur in all or the phases of a compiler. Whenever a
phase of the compiler discovers an error, it must report the error to the error handler, which
issues an appropriate diagnostic msg. Both of the table-management and error-Handling routines
interact with all phases of the compiler
It is invoked when a flaw error in the source program is detected. The output of LA is a
stream of tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups
the tokens together into syntactic structure called as expression. Expression may further be
combined to form statements. The syntactic structure can be regarded as a tree whose leaves are
the token called as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer, occur in
pattern that are permitted by the specification for the source language. It also imposes on tokens
a tree-like structure that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this expression
might appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax
analyzer should detect an error situation, because the presence of these two adjacent binary
operators violates the formulations rule of an expression. Syntax analysis is to make explicit the
hierarchical structure of the incoming token stream by identifying which parts of the token
stream should be grouped.
Example,
(A/B*C has two possible interpretations.)
1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
Each of these two interpretations can be represented in terms of a parse tree.
Compilation Process of a source code through phases
Simply says:
• Lexical analysis (Scanning)
• Syntax Analysis (Parsing)
• Syntax Directed Translation
• Intermediate Code Generation
• Run-time environments
• Code Generation
• Machine Independent Optimization
The Phases of a Compiler
Compiler-Construction Tools
• Software development tools are available to implement one or more compiler phases
o Scanner generators
o Parser generators
o Syntax-directed translation engines
o Automatic code generators
o Data-flow engines
Compiler-Construction Tools
The compiler writer, like any software developer, can portably use modern software
development environments containing tools such as language editors, debuggers, version
managers, porkers, test harnesses, and so on. In addition to these general software-development
tools, other more specialized tools have been created to help implement various phases of a
compiler.
These tools use specialized languages for specifying and implementing spefic
components, and many use quite sophisticated algorithms. The most successful tools are those
that hide the details of the generation algorithm and produce components that can be easily
integrated into the remainder of the compiler. Some commonly used compiler-construction tools
include
1. Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of
the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for constructing
various phases of a compiler.