15
1 CS 1622 Lecture 2 1 CS 1622 Lecture 2 Lexical Analysis CS 1622 Lecture 2 2 Lecture 2 Review of last lecture and finish up overview The first compiler phase: lexical analysis Reading: Chapter 2 in text (by 1/18) CS 1622 Lecture 2 3 The Front End Responsibilites: Recognize legal and illegal programs Report errors meaningfully Produce IR and initial storage map Shape the code for the backend Typically automatically constructed From a lexical specification Based on finite automata (meet theory) Very well understood Source code Scanner IR Parser Errors tokens

CS 1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture02.pdfBuilds IR for source program ... CS 1622 Lecture 2 9 To recognize a valid sentence for some

  • Upload
    lycong

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

1

CS 1622 Lecture 2 1

CS 1622

Lecture 2Lexical Analysis

CS 1622 Lecture 2 2

Lecture 2 Review of last lecture and finish up

overview The first compiler phase: lexical

analysis Reading: Chapter 2 in text (by 1/18)

CS 1622 Lecture 2 3

The Front End

Responsibilites: Recognize legal and illegal programs Report errors meaningfully Produce IR and initial storage map Shape the code for the backend

Typically automatically constructed From a lexical specification Based on finite automata (meet theory) Very well understood

Sourcecode Scanner

IRParser

Errors

tokens

2

CS 1622 Lecture 2 4

Maps characters into tokens - basic lexicalunits x = y + z becomes <id> <assign> <id> <binop>

<id> Lexeme = string that matches the token

x, y, and z are lexemes that match <id> Some tokens have attributes

<id, x> or <binop, plus>

Eliminates whitespace In some languages performs preprocessing

(in C done by the preprocessor)

Sourcecode

ScannerIRParser

Errors

tokens

CS 1622 Lecture 2 5

Recognizes syntactic structure & errors Directs semantic analysis (type checking) Builds IR for source program For some languages (more precisely:

grammars) can be easily built by hand More flexible: use parser generators

Can change language more easily Typically very fast Well undestood theory (“Push-down automata”

Sourcecode Scanner

IRParser

Errors

tokens

CS 1622 Lecture 2 6

Grammars A concise and precise way to specify

languages For context-free grammars can build

efficient parsers Can typically write a CFG for a

programming language Tool of choice for specifying syntactic

structure

3

CS 1622 Lecture 2 7

Grammars

Formally, a grammar G = (S,N,T,P) S is the start symbol N is a set of non-terminal symbols T is a set of terminal symbols or words P is a set of productions or rewrite

rules (P : N → N ∪ T )

CS 1622 Lecture 2 8

CFG Example

1. goal → expr

2. expr → expr op term

3. | term4. term → number

5. | id6. op → +

7. | -

S = goal

T = { number, id, +, - }

N = { goal, expr, term, op }

P = { 1, 2, 3, 4, 5, 6, 7}

CS 1622 Lecture 2 9

To recognize a valid sentence for some CFG,we reverse this process and build up a parse

Deriving SentencesProduction Result goal

1 expr2 expr op term5 expr op y7 expr - y2 expr op term - y4 expr op 2 - y6 expr + 2 - y3 term + 2 - y5 x + 2 - y

4

CS 1622 Lecture 2 10

Parse Tree

x + 2 - y

Lots of superfluous detail.

term

op termexpr

termexpr

goal

expr

op

<id,x>

<number,2>

<id,y>

+

-

1. goal → expr

2. expr → expr op term3. | term4. term → number

5. | id6. op → +

7. | -

CS 1622 Lecture 2 11

Abstract Syntax Tree (AST)

This is much more conciseASTs are one form of intermediate

representation (IR)

+

-

<id,x> <number,2>

<id,y> The AST summarizesgrammatical structure,without including detailabout the derivation

CS 1622 Lecture 2 12

The Back End - instruction selection

Responsibilities: Translates IR to target code Selects target instructions for IR (trivial for RISC) Allocates machine resources (registers, memory) Typically implemented manually

For CISC some automated pattern matchingapproaches

Lots of hand-crafting done for good backends -- mustknow target architecture well!

Errors

IR InstructionScheduling

InstructionSelection

Machinecode

RegisterAllocation

IR IR

5

CS 1622 Lecture 2 13

Back end - instruction scheduling

Avoid hardware stalls and interlocks Use all functional units productively Can increase lifetime of variables Optimal scheduling is NP-Complete in

nearly all cases but good heuristictechniques are well understood

Errors

IR InstructionScheduling

InstructionSelection

Machinecode

RegisterAllocation

IR IR

CS 1622 Lecture 2 14

Back end - register allocation

Have each value in a register when it isused

Manage a limited set of resources Can change instruction choices & insert

LOADs & STOREs Optimal allocation is NP-Complete

approximate

Errors

IR InstructionScheduling

InstructionSelection

Machinecode

RegisterAllocation

IR IR

CS 1622 Lecture 2 15

Traditional Three-passCompiler

Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the

compiled code May also improve space, power

consumption, … Must preserve “meaning” of the code

Errors

SourceCode

MiddleEnd

FrontEnd

Machinecode

BackEnd

IR IR

6

CS 1622 Lecture 2 16

The Optimizer

Discover & propagate some constant value Move a computation to a less frequently

executed place Specialize some computation based on

context Discover a redundant computation & remove

it Remove useless or unreachable code Encode an idiom in some particularly efficient

form

Errors

Opt1

Opt3

Opt2

Optn

...IR IR IR IR IR

CS 1622 Lecture 2 17

The Scanner: Overview Task:

translate the sequence of characters to acorresponding sequence of tokens - essentiallygrouping characters into words -removingirrevelant characters - e.g., white space

Each time the scanner is called, it should find the longest sequence of characters

in the input starting with the current character … that corresponds to a token, and

return that token.

CS 1622 Lecture 2 18

How to write a scanner? write it from scratch, or automatically

generate it with a scanner generator lex or flex (produce C code), or jlex (produces Java code).

input to a scanner generator: one regular expression for each token

output of a scanner generator: a finite state machine

so, you need to understand: regular expressions finite automata.

7

CS 1622 Lecture 2 19

Lexical analyzersGoals:

To simplify specification & implementationof scanners

To understand the underlying techniquesand technologies

Scanner

ScannerGenerator

specifications

source code parts of speech

tablesor code

CS 1622 Lecture 2 20

Regular Expressions to FiniteAutomata Generating a scanner

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA

CS 1622 Lecture 2 21

Recognizing words Example - “begin”

c= next char;if c != b then errorc = next char;if c!= e the error;c = next char;if c != g then error;….

s0s1 s2 s3 s4

b e g i ns5

Transition diagrams - serve as abstractions for code that wouldbe written - finite automata

8

CS 1622 Lecture 2 22

Finite Automata A compiler recognizes legal programs

in some (source) language. A finite-state machine recognizes legal strings

in some language. Example: Identifiers

sequences of one or more letters or digits,starting with a letter:

letterletter | digit

S A

CS 1622 Lecture 2 23

Finite-Automata State Graphs A state

• The start state

• An accepting/final state

• A transitiona

CS 1622 Lecture 2 24

Finite Automata Transition

s1 →a s2

Is readIn state s1 on input “a” go to state s2

If end of input or no transition possible If in accepting state => accept Otherwise => reject

9

CS 1622 Lecture 2 25

Language defined by FSM The language defined by a FSM is the set of

strings accepted by the FSM.

in the language of the FSM on previous slide: x, tmp2, XyZzy, position27.

not in the language of the FSM on previous slide: 123, a?, 13apples.

CS 1622 Lecture 2 26

Example: Integer Literals FA that accepts integer literals with an

optional + or - sign:

+

digit

S

B

A-

digitdigit

CS 1622 Lecture 2 27

Formal FSA Definition A finite automaton is a 5-tuple (Σ, S, δ,

s0, SF) where: An input alphabet Σν A set of states Sν A start state s0ν A set of accepting states SF ⊆ Sν δ is the state transition function: S x Σ S

(i.e., encodes transitions state →input state)

10

CS 1622 Lecture 2 28

FA for the integer-literalexample

Σ = {digit, +, - )A set of states S = {S, A and B}A start state S00 = SA set of accepting states SF ⊆ S = {B}δ is the state transition function =

(S,digit) -> B(S, + ) -> A(S, - ) -> A(B, digit) -> B(A, digit) -> B

CS 1622 Lecture 2 29

Two kinds of AutomataDeterministic (DFA):

No state has more than one outgoing edge withthe same label.

Non-Deterministic (NFA): States may have more than one outgoing edge

with same label. Edges may be labeled with ε (epsilon), the empty

string. The automaton can take an ε epsilon transition

without looking at the current input character.

CS 1622 Lecture 2 30

Example of NFA integer-literal example:

+

digit

S

B

A-

εdigit

11

CS 1622 Lecture 2 31

Non-deterministic automata(NFA) often simpler (e.g. smaller) than DFA can be in multiple states at the same time NFA accepts a string is if

there exists a sequence of moves starting in the start state, ending in a final state, that consumes the entire string. Think about it as pursuing all choices in parallel or

having an oracle that says what to do. Example:

the integer-literal NFA on input "+75":

CS 1622 Lecture 2 32

Equivalence of DFA and NFA Theorem:

For every non-deterministic finite-state machine M,there exists a deterministic machine M' such thatM and M' accept the same language.

Why is the theorem important for scannergeneration?

Theorem is not enough: what do we need forautomatic scanner generation?

CS 1622 Lecture 2 33

How to Implement a FSMA table-driven approach: table:

one row for each state in the machine, and one column for each possible character.

Table[j][k] which state to go to from state j on character k, an empty entry corresponds to the machine

getting stuck.

12

CS 1622 Lecture 2 34

The table-driven program for aDFA

state = S // S is the start staterepeat {

k = next character from the inputif k == EOF the // end of input

if state is a final state then acceptelse reject

state = T[state,k]if state = empty then reject // got stuck

}

CS 1622 Lecture 2 35

Generating a scanner

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA

CS 1622 Lecture 2 36

Regular Expressions

FA’s not good way to specify tokens -diagrams hard to write down

regular expressions are another specificationtechnique a compact way to define a language that can be

accepted by an automaton. used as the input to a scanner generator

define each token, and define white-space, comments, etc

these do not correspond to tokens, but must be recognized and ignored.

13

CS 1622 Lecture 2 37

Example: Simple identifier English: A letter, followed by zero or

more letters or digits. RE: letter . (letter | digit)*Operators:

| means "or"

. means "followed by” (usually just use position)

* means zero or more instances

() are used for grouping

CS 1622 Lecture 2 38

Operands of a regularexpression Operands are same as labels on the edges of

an FSM single characters, or the special character ε (the empty string)

"letter" is a shorthand for a | b | c | ... | z | A | ... | Z

"digit“ is a shorthand for 0 | 1 | … | 9

sometimes we put the characters in quotes necessary when denoting characters: | . *

CS 1622 Lecture 2 39

Precedence of | . * operators.

Consider regular expressions: letter.letter | digit* letter.(letter | digit)*

highestexponentiation*middletimes.lowestplus|

PrecedenceAnalogousArithmeticOperator

RegularExpressionOperator

14

CS 1622 Lecture 2 40

Examples Describe (in English) the language defined by

each of the following regular expressions: letter (letter | digit*)

digit digit* "." digit digit*

CS 1622 Lecture 2 41

Example: Integer Literals An integer literal with an optional sign can be

defined in English as: “(nothing or + or -) followed by one or more digits”

The corresponding regular expression is: (+|-|epsilon).(digit.digit*)

A new convenient operator ‘+’ digit.digit* is the same asdigit+ which means "one or more digits”

CS 1622 Lecture 2 42

Language Defined by aRegular Expression Recall: language = set of strings Language defined by an automaton / RE

the set of strings accepted by the automaton the set of strings that match the expression.

Regular Exp. Corresponding Set of Strings

epsilon {""}

a {"a"}

a.b.c {"abc"}

a | b | c {"a", "b", "c"}

(a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...}

15

CS 1622 Lecture 2 43

REs describe regularlanguagesPatterns form a regular language

*** any finite language is regular ***

Regular Expression (RE) (over alphabet Σ)

ε is a RE denoting the set {ε}

If a is in Σ, then a is a RE denoting {a}

If x and y are REs denoting L(x) and L(y) then

x is an RE denoting L(x); y is a RE denoting L(y);

x | y is an RE denoting L(x) ∪ L(y)

xy is an RE denoting L(x)L(y)

x* is an RE denoting L(x)*

Can combine RE to form other REs