Chap 5

Chap 5

Compilers

Basic compiler Functions

(1)A high-level programming language is usually described in terms of a grammar. This grammar specifies the syntax of legalstatements in the language.

　　 The purposed of compiler construction, a high-level programming language is usually described in terms of a grammar. This grammar specifies the from be defined by the grammar as a variable name, assignment operator (:=), followed by an expression. Define by the grammar, and code for each statement.



(2)A source program statement is a sequence of tokens, which may be thought of as the fundamental building blocks of the language.

　　 Tokens may be thought of as the fundamental building blocks of the language. For example, a keyword, a variable name, an integer, an arithmetic operator, etc. Classifying the various as lexical analysis. The part of the compiler that performs this is commonly called the scanner.　　


　　 This process is called syntactic analysis or parsing, usually called the parser. Process is the generation of object code.

　　 The compilation process-scanning, parsing, and code generation.



(3)Character strings enclosed between the angle brackets <and> are called nonterminal symbols. Entries not enclosed in angle brackets are terminal symbols of the grammar. (i.e. tokens).


(4.) A grammar for the semantics, or meaning, of the various statement. Difference between syntax and semantics:I:= J+ K and I:=X+Y

　　 REAL variables and I, J, K are INTEGER variables have identical syntax. Each is an assignment statement; consists of two variable by the operator+. Statement specifies that the variables in the expression are to be added using integer arithmetic operations. Floating-point addition, floating point before adding. These two statements would be compiled into very different sequences of machine instructions during code generation.


Basic compiler Functions(5)

Compiler functions consists of lexical analysis, syntactic analysis and code generation:*Lexical analysis involves scanning the program to be compiled and recognizing the tokens that make up the source statements.*During syntactic analysis, the source statements written by the programmer are recognized as language constructs described by the grammar being used.*When the parser recognizes a portion of the source program according to some rule of the grammar, the corresponding routine, called semantic routine, is executed.

　　 Writing grammars for Back-Naur Form. Consists of a set of rules, in the programming language.

<read> ::=READ(<id-list>)The grammar as <read>. The symbol ::= can be read “is defined to be.” This symbol is the language construct being define of the syntax being defined for it.


　　 Character strings enclosed between the angle brackets <and > are called nonterminal symbols. Entries not enclosed in angle brackets are terminal symbols of the grammar. Tokens READ, included only to improve readability.


<id-list> ::= id | <id-list>, idToken id, followed by the token ”,” (camera), followed by a token id. Defined partially in terms of itself. <id-list> that consists of a single id ALPHA; ALPHA, followed source statement in terms of grammar as a tree. The parse tree, or syntax tree shows the parse tree for the statement Rule 10 gives a definition of an <exp> :


　　 The parse tree for statement 14 from Fig. 5.1 for performing this sort of syntactic analysis in a compiler.

　　 SUMSQ DIV 100 and MEAN * MEAN must be calculated first since these intermediate results for the – operation are implied by the way Rules are constructed.

　　 More than one possible parse tree, the grammar is said to be ambiguous. Unambiguous grammars in compiler construction, double about what object code should be generated


　　 The program to be compiled and recognizing the tokens that make up the source statements. Keywords, operators, and identifiers might be defined by the rules<letter> ::= A | B | C | D | … | Z<digit> ::= 0 | 1 | 2 | 3 | … | 9would interpret a sequence of such characters as construct <ident>. Special-purpose routine such as the perform this same function much more efficiently.


　　 Both single- and multiple- character tokens directly. Single token rather than as a sequence of four token R, E, A, D. Approach creates considerably more work for the parser.　　　　 The scanner consists of a sequence of tokens. Efficient of token is usually represented by some fixed-length code, as a variable-length character string.


　　 Scanned is a keyword or an operator, such a coding scheme gives sufficient information. A token specifier with the type code for such tokens.

　　 Shows the output from a scanner for the program in Fig. 5.1, coding scheme in Fig. 5.5.

　　 This does not mean that the entire program is scanned at one time, other processing. The parser when it needs another.


　　 The line of the source program as needed, printing the source listing. The source statements before parsing begins.　　　　 The scanner must take into account any special format required of the source statements. In Columns 1-5 of a source statement should be interpreted as a statement number, not as an integer.

　　 Tokens may also vary from one part of the program to another.


　　 Figure 5.6 Lexical scan of the program from Fig. 5.1

　　 Keyword, 10as a statement number, I as an identifier, etc. However, in the statement

DO 10 I = 1Remember that blanks in FORTRAN, represent either keywords or variable names defined by the programmer.


　　 The parser so that it could tell the proper interpretation of each word, or it might simply place identifiers and keywords in the task of distinguishing between them to the parser.


　　 The states is designated as the starting state, more states are designated as final states. Often represented graphically, as illustrated in Fig. 5.7(a).　　 It stops when there is no transition from its current state character to be scanned.


　　 The first input string in Fig. 5.7(b).


　　 Figure 5.7 Graphical representation of a finite automation

　　 The a causes a transition from State 1 to State 2. The next character to be scanned is c in State 2. Recognizes tokens of the from abc…abc… where the grouping abc is repeated one or more times, and the c within each grouping may also be repeated.


　　 Letter and may continue with any sequence of letters and digits. The notation A-Z, which indicated transition.

　　 Shows a finite automaton that recognizes identifiers of this type. End with an underscore, or that contain two consecutive underscores.


　　 Seen so far was designed to recognize type of token. Finite automaton that can recognize the tokens listed in Fig. 5.5. All identifiers and keywords with one final state (State 2). Operation could then be used to distinguish keywords. Separate check could be made to ensure that identifiers permitted by the language definition.


　　“ VAR” when the perform a check to see whether the string being recognized is “END.”. The scanner could, in effect, back up to State 2 (recognizing the ”VAR”).

　　 Figure 5.10(a) shows a typical algorithm to recognize such a token. Fig. 5.8(b), corresponding to State 1.


Figure 5.10 Token recognition using (a) algorithmic code (b) tabular representation of finite automation

　　 Syntactic analysis, language constructs described by the grammar being bottom-up and top-down. One bottom-up method and one top-down method, and show the application of these techniques to our example program.


　　 The operator precedence method. Multiplication and division are performed before addition and subtraction.

+ < *

* > +

Operator-precedence parsing

A + B * C – D 　　 The expression B * C is to be computed

before operations in the expression is performed. The * operation appears at a lower level than does either + or -.

　　 Interpreted in terms of the rules of the grammar. Reached, the analysis is complete.


　　 The precedence relations between the operators of the grammar. In this context, operator is token to mean any terminal symbol (i.e., any token), precedence relations involving tokens such as BEGIN. precedence relations do not follow the ordinary rules for comparisons. That is , when ; has higher precedence. But when the END has higher precedence.


　　 Precedence relation between a pair of tokens. Tokens cannot appear together in any legal statement. It should be recognized as a syntax error.

　　 Constructing a precedence matrix like Fig. 5.11 from a grammar [Aho et al. (1988)].


　　 The operator-precedence parsing statement from line 9 of the program in the Fig. 5.1, on token at a time. Each pair of operators has identified the portion of the statement delimited relations <and> to be interpreted in terms of the grammar. A <factor> according to Rule 12 of the grammar.


　　 Nonterminal symbol is being recognized; nonterminal <N1>, replaced by <N1>. The parse tree that corresponds to this interpretation appears to the right.

　　 Parser generally uses a stack that have been scanned but not yet parsed, the statement to be recognized.

　　 The parsing of the READ statement. Identified the syntax of the statement, which is the goal of the parsing process.


　　 Step-by-step parsing of the assignment statement from line 14 of the program in Fig. 5.1. The next portion of the statement to be recognized, the first portion delimited by <and>. Once this portion has been determined, some rule of the grammar.　　 Fig. 5.3 the id SUMSQ is interpreted first as a <facor>, the single nonterminal <N1>.


　　 One of the earliest bottom-up parsing methods. The operator precedence technique were later developed into a more general method known as shift-reduce parsing, can be taken are shift (current token) and reduce.

　　 The parser shifts (pushing the current token onto the stack) when it encounters the token BEGIN. The shift action is also applied to the next three tokens, the reduce action is invoked. A set of tokens from the top of the stack is reduced, to be reduced later as part of the READ statement.


　　 Encounters the relations and . Reduce roughly corresponds to the action taken when an operator precedence parser encounter the relation .


　　 Top-down method known as recursive descent. A procedure is to find a substring of the input, beginning with the current token, interpreted as the nonterminal with token pointer past the substring it has just recognized. 　　 Examines the next two input tokens looking for READ and (these are found, the procedure for <read> then calls the procedure for <id-list>. The next input token, looking for). If all these procedure returns an indication of success.


　　 Several alternatives defined by the grammar for a nonterminal. The alternatives to try. For the next input token. The procedure for <id-list>, corresponding between its two alternatives since both id and <id-list> can begin with id. Immediate recursive call, which leads to an unending immediate left recursion.


　　 The terms between {and} may be omitted, or repeated one or more times. An id followed by zero or more occurrences of “, id”. 　　 Recursive-descent of the READ statement on the grammar in Fig. 5.15.


　　 A graphic representation of the recursive-descent parsing process for the statement being analyzed. In part (ii), READ has called IDLIST, which has examined the token id. READ has then examined the input token. Beginning at the root, hence the term top-down patsing.


　　 We describe involves a set of routines, the parser recognizes program according to some rule of the grammar, the corresponding routine is executed semantic routines. The corresponding construct in the language. Generation routines intermediate from of the program that would attempt to generate more efficient object code.


　　 The operator-precedence method ignores certain nonterminals.

　　 The generation of object code for a SIC/XE machine of two data structures storage: a list and a stack. Items inserted into the list are removed in the order, first in-first out. Items pushed onto the stack are removes in the opposite order, last in-first out. The name of the identifier, or a pointer to the symbol-stable entry is the value of the integer, such as #100.


　　 Segments of object code for the compiled program. LOCCTR is updated to reflect the next available address in the compiled program.

　　 The parse tree for this statement if repeated for convenience in Fig. 5.18(a). Substring of the input is nonterminal <N1>. Recursive-descent parse, the recognition occurs when a procedure returns to its caller, recognizes the id VALUE as an <id-list>, the complete statement as a <read>.


　　 Consists of a call to a subroutine of a standard library associated with the compiler called by any program that wants to READ operation.

　　 Since XREAD may be used to perform any READ operation, immediately after the JSUB that calls it. Value that specifies the number that will be assigned values by the READ. Address of these variables specifies that one variable is to be read.


5.18

　　 Figure 5.18(b) shows a set of routines that might be used to accomplish in Rule 6 of the grammar in Fig. 5.2. In either case, the code-generation routines, reflect this insertion. The token specifiers for all the identifiers that are part of the <id-list>.


　　 Shows the code-generation process for the assignment statement on line 14 of Fig. 5.1. Parser first recognizes the id SUMSQ as a <factor> and a <term>; it recognizes SUMSQ DIV 100 as a <term>.

　　 Code-generation routine create the corresponding object code.


Fig 5.19 1/4

Fig 5.19 2/4

Fig 5.19 3/4

Fig 5.19 4/4

Machine-dependent 　　　　　　　 compiler features　　 Machine-dependent extensions programs

written in a high-level programming language for some computer. The syntax of programs written generation and optimization of the object code.

　　 An intermediate from of the program being compiled. This intermediate from of the program for the purpose of code optimization than it would be to perform the corresponding operations on either the source.

　　 Representing a program in an intermediate from for code analysis and optimization.

+ , SUM, VALUE, i:=, i1 , , SUM

Intermediate result (SUM + VALUE) represented with quadruples.

Machine-dependent 　　　　　　　 compiler features

　　 The quadruples can be rearranged to eliminate redundant load and store operations, and the intermediate results temporary variables to make their use as efficient as possible.

　　 The corresponding object code instructions are to be executed. PARAM quadruples that the READ or WRITE.


　　 Performing machine-dependent code optimization, use of the registers as instruction operands.

　　 Machine instructions that use registers as operands are usually faster than the corresponding instructions that refer to locations in memory. In the program, as an intermediate result, it can be assigned to some register. This approach also avoids unnecessary movement of the values between memory and registers.


Figure 5.22 Intermediate code for the program from Fig. 5.1

　　 VALUE is used once in quadruple 7 and twice in quadruple 9. Fetch this value only once. Register for use by the code generated from quadruple 9, quadruple 16 stores the value of i5 into the variable MEAN. This value could still be available when the value in quadruple 18, eight intermediate results (ij) in the Fig. 5.22.


　　 Register that is being reassigned contain the value of some variable already stored in memory.

　　 Compiler must also consider of the program. For example, quadruple 1 in the Fig. 5.22 assigns the value 0 to SUM. This value in some register for later use as an operand in quadruple 7 from the register; however, this is not necessarily the case. The J operation in quadruple 14 jumps to quadruple 4. Quadruple 7 in the way, the value of SUM may not be in the designated register.


　　 A sequence of quadruples with one entry point, which is at the beginning of the block, one exit point, which within the block.　　 quadruples 1-3; block B contains quadruple 4, and shows a representation of the control flow of the program. That control can pass directly from the last quadruple of X to the first quadruple of Y. A flow graph for the program can analyze a flow graph and remain vaild from one basic block to another.　　 Involves rearranging quadruples before machine code is generated.


Figure 5.24 Rearrangement of quadruples for code optimization

　　 The intermediate result i1 is calculated first and stored in temporary variable T1. Subtracting the value of i2 from i1, its value is available in register.

　　 This rearrangement is illustrated in Fig. 5.24(b). Loop-control instructions or more efficient object code.


　　 INTEGER variable occupies one word of memory, then store this array. If an array is declared as

B : ARRAY [0..3, 1..6] OF INTEGERTotal of 4 * 6 = 24


　　 Elements that have the same value of the first subscript are stored in contiguous locations; this is called row-major order. The same value of the second subscript called column-major order.

　　 Most high-level languages store arrays using row-major, most FORTRAN compilers store arrays in column-major order.


　　 Indexed addressing would then be used to access the desired to the starting address of the array is given by 5 * 3 = 15.

　　 The compiler must generate object code to perform this calculation during execution.


　　 Perform such a calculation is illustrated, after having placed the index register complete rows (row 0 and row 1) before arriving at the beginning of row 2.


Figure 5.26 Code generation for array references

　　 The generation of code to perform such an array reference.

　　 The type of the elements in the array, the number of dimensions declared, and the lower and upper limit for each subscript.


　　 ROWS and COLUMNS are not known at compilation time, directly generate code like than that in Fig. 5.26. Instead, the lower and upper bounds for each. The values from the descriptor to the number of dimensions of the array elements, and a pointer to the beginning of the array.


　　 Code optimization is the elimination of common subexpressions. For example, the statements in Fig. 5.27(a). The term 2*J is a common subexpressions. An optimizing code so that this multiplication is performed and the result is used in both places.

Machine-dependent 　　　　　　　 code optimization

　　 Intermediate from of the program. Such an intermediate from is shown in Fig. 5.27(b). If quadruples 5 are the same except for the name of the intermediate result produced. Value between quadruples reference to its result (i10) with a reference of quadruple 5.


　　 Quadruples 6 and 13 are the name of the result. They are equivalent to quadruples 3 and 4, This technique is shown in Fig. 5.27(c). Left unchanged, except for the substitutions.

　　 The removal of loop invariants. These values do not change from iteration of the loop to the next.


　　 Loop-invariant computation is the term 2*J in Fig. 5.27(a). The operand J, change in value during the execution of the loop. Thus we can move quadruple 5 in Fig. 5.27(c) to a point immediately before the loop is entered.

　　 The sequence of quadruples that results from modifications. The number of quadruples within the body of the loop has been reduced from 14 to 11. Operations required for one execution of the FOR is reduced from 141 to 114.


　　 Some optimization, of course, can be obtained by rewriting the source. The statements in Fig. 5.27(a) could have been calculating relative address from subscript values. For example, the optimizations involving quadruples 3, 4, 10, and 11 in Fig. 5.27(b) could not be achieved with any rewriting of the source statement.(5) * #2 J i3(12) * #2 J i10

Figure 5.27(a)


　　 The substitution of a more efficient operation for a less efficient one. Figure 5.28(b) shows a representation of these statements as a series of quadruples that performs a series of multiplications.

　　 The current iteration can be found by multiplying the value for the previous iteration by 2, is much more efficient than performing a series. Transformation is called reduction in strength. Perform one multiplication for each array reference.


　　 Accomplished by adding 3 to the previous displacement. If addition is faster than multiplication on the target machine.

　　 These two reductions in strength. An algorithm for performing this sort of transformation can be found in Aho et al. (1988).


(2) EXP #2 I i3　　 Computations whose operand values are known at compilation time can be performed by the compiler Known as folding. Converting a loop into straight line code (loop unrolling) and merging of the bodies of loops (loop jamming).


　　 Temporary variables, including the one used to save the return address, assigned fixed address within the program. Assignment is usually called static allocation.

　　 Allocation cannot be used. Its return address from register L at a fixed location. The return address for this call has been stored at a fixed location within SUB, which calls itself recursively, as in Fig. 5.29(c), address for invocation 2. These is no possibility of ever making a correct return to MAIN.


　　 SUB may be set to new values; this destroys the previous contents of these variables.

　　 Variables used by SUB, temporaries, return address, register save areas, etc., when dynamic storage allocation technique. Activation record that called recursively, another activation record.


　　 Records are typically allocated on a stack, with the current record at the top of the stack. This process is illustrated in Fig. 5.30. Corresponds to Fig. 5.29(a), the procedure MAIN has been called; its activation record appears on the stack. Address of this current activation record, which is the first, the pointer value is null. Word of the stack, words contain the values of variables used by the procedure.


　　 MAIN has called the procedure SUB. A new activation with register B set to indicate this new current record. SUB has called itself recursively; another activation record has been created.

　　 Procedure returns to its caller, the current activation record is deleted. The pointer PREV in the deleted record is used to reestablish the previous activation record as the current one, and execution continues. SUB returns from the recursive call.


Figure 5.30 Recursive invocation of a procedure using automatic storage allocation

　　 Address that is relative to the beginning of the activation record. Register B, so a reference to a variable is translated as an instruction that uses base relative addressing.

　　 This code is often called a prologue for the procedure. The current activation record called an epilogue.


　　 An activation record relative addressing. Occupy a fixed location in the activation record.　　　　 The operating system for an area of storage of the required size support procedure associated with the compiler. Free storage called a heap is obtained.

　　 The association of an address with a variable is made when the procedure is executed.


　　 Program can be divided into units called blocks. Its own identifiers by units such as procedures and functions in such block- structured languages. The terms procedure and block interchangeably.

　　 In compiling a program written un a block-structured language, it is convenient to number the blocks as shown in Fig. 5.31(a). Table that describes the block structure, as one greater than that of the surrounding block.


Figure 5.31 Nesting of blacks in a source program

　　 The symbol table for a definition of that identifier is reached without finding a definition of the identifier, then the reference is an error.

　　 The chain of definitions for that identifier is the appropriate entry.


　　 Most block- structured languages make use of automatic storage allocation. Activation record that is created each time the block is entered. Statement refers to a variable that is declared within the current is present in the current activation record. Variable is declared in some surrounding block. In that case, the most record foe that block must be located to access the variable.


　　 Variable in surrounding blocks uses data structure called a display. Display is illustrated in Fig. 5.32. Situation is shown in Fig. 5.32(a). The stack records for the invocations of A, B, and C. The display contains pointers to the activation records for C and for the surrounding.

　　 Variable declared by C should use this most recent activation record; C is changed accordingly. There is no display pointer to this activation record.


Figure 5.32 Use of display for

procedures in Fig. 5.31

　　 The resulting shown in Fig. 5.32(c). Contains only two pointers: one activation records. Procedure D cannot refer to variables in B or C, variable that are declared by D in the source program.　　　　 The rules for referring to variables in the block-structured program, as represented by the display.


　　 Simple one-pass compilation scheme for Pascal language. In this design, lexical scanner was called when the parser needed another. Declarations of variables must appear in the program before the statements.

Division into passes

X := Y * ZThe variables X,Y, and X are of type INTEGER, language that allows for ward references to data items cannot be compiled in one pass.　　　　 One-pass design might be preferred. For example, executed only once or twice for each compilation; executed many times for each compilation.


　　 Source program written in a high-level language. The main is that interpreters execute of translating it into machine code.

　　 Internal from, the interpreter executes the operations specified by the program. During this phase, an interpreter can be viewed as a set of subroutines.


　　 Program by an interpreter is much slower than execution of the machine code produced by a compiler. Over a compiler, however, is in the debugging facilities that can easily be provided.　　　　 The compiled program would consist of calls to such routines. In such cases, an interpreter might e preferred because of its speed of translation. During execution, not by the nesting of blocks in the source program.


　　 P-code compilers (bytecode compilers). Intermediate form is the machine language. P-code compiler is illustrated in Fig. 5.33.

　　 The main advantage of this approach is portability of software. Generate different code for different computers, executed on any machine that has a P-code interpreter. P-code can then be interpreted.


　　 The design of a P-machine and the associated P-code is often related to the requirements of the language. The P-code object program is often much smaller than a corresponding machine code. P-code program may be much slower than the execution of the equivalent machine code.


　　 Such tools are often called compiler generates or translator-writing systems. Typical compiler-compiler is illustrated in Fig. 5.34. Lexical rules for defining the source language. Generate a scanner and a parser directly.

　　 The language construct described by the associated rule. Compilers that are generated in this way tend to require more memory and compile programs more slowly than handwritten compilers.


Figure 5.33 Translation and execution using a P-code compiler

Figure 5.34 Automated compiler construction using a compiler-compiler

Documents

Chap 5