Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of...

Preview:

Citation preview

CodeGenerator

Translator Architecture

ParserTokenizer

string ofcharacters

(source code)

string oftokens

abstractprogram

string ofintegers

(object code)

A Tokenizing Machine

Another greatmachine!

Ready to Dispense?

In(Chars)

Out(Tokens)

Tokenizing Machine Continued…

Ready toDispense?

In Out

3: Tokenscome outhere

2: Light comes on

1: Charactersgo in here

Tokenizing Machine Continued… Type

BL_Tokenizing_Machine_Kernel is modeled by (

buffer : string of character ready_to_dispense : boolean

) constraint …

Initial Value (empty_string, false)

Tokenizing Machine Continued…

Operations m.Insert (ch) m.Dispense (token_text, token_kind) m.Is_Ready_To_Dispense () m.Flush_A_Token (token_text, token_kind) m.Size ()

A State-Transition View

not readyto dispense

readyto dispense

Insert

Flush_A_Token

Insert

Size

Size

Is_Ready_To_DispenseIs_Ready_To_Dispense

Dispense

Tokenizing BL Programs

Token Types KEYWORD IDENTIFIER CONDITION WHITE_SPACE COMMENT ERROR

A Very Useful Extensionprocedure_body Get_Next_Token ( alters Character_IStream& str, produces Text& token_text, produces Integer& token_kind ){

}

while (not self.Is_Ready_To_Dispense () and not str.At_EOS ()) { object Character ch; str.Read (ch); self.Insert (ch); } if (self.Is_Ready_To_Dispense ()) { self.Dispense (token_text, token_kind); } else { self.Flush_A_Token (token_text, token_kind); }

Another Useful Extensionprocedure_body Get_Next_Non_Separator_Token ( alters Character_IStream& str, produces Text& token_text, produces Integer& token_kind ){

}

self.Get_Next_Token (str, token_text, token_kind); while ((token_kind == WHITE_SPACE) or (token_kind == COMMENT)) { self.Get_Next_Token (str, token_text, token_kind); }

How Does Insert Work?

buffer: “PROGRAM”ready_to_dispense: false

buffer: “PROGRAM ”ready_to_dispense: true

#m:

m:

Here’s anothercharacter.

‘ ’

The Specification of Insert

procedure Insert ( preserves Character ch ) is_abstract; /*! requires self.ready_to_dispense = false ensures self.buffer = #self.buffer * <ch> and self.ready_to_dispense = IS_COMPLETE_TOKEN_TEXT (#self.buffer, ch) !*/

An Important Math Operation

math definition IS_COMPLETE_TOKEN_TEXT (

s: string of character

c: character

): boolean is

( s is in OK_STRINGS and

s * <c> is not in OK_STRINGS ) or

( <c> is in PREFIX (OK_STRINGS) and

s * <c> is not in PREFIX (OK_STRINGS) )

s is a complete“valid” token

c can start a “valid” token, buts*<c> doesn’t start a “valid” token

Other Math Definitions

OK_STRINGS ={s: string of character (IS_KEYWORD (s))} union{s: string of character (IS_IDENTIFIER (s))} union{s: string of character (IS_CONDITION_NAME (s))} union{s: string of character (IS_WHITE_SPACE (s))} union{s: string of character (IS_COMMENT (s))}

PREFIX (s_set) ={x: string of character (there exists y: string of character (x * y is in s_set))}

PREFIX Examples

s_set = {“abc”} PREFIX (s_set) =

{“”, “a”, “ab”, “abc”}

s_set = {“abc”, “de”} PREFIX (s_set) =

{“”, “a”, “ab”, “abc”, “d”, “de”}

Tokenizing Machine: Implementation

Obvious Representation Text buffer_rep Boolean token_ready

Insert (ch)? check if IS_COMPLETE_TOKEN_TEXT

(self[buffer_rep], ch), andset self[token_ready] accordingly

append ch at end of self[buffer_rep]

Tokenizing Machine: Implementation Continued…

Dispense (token_text, token_kind)? set token_text to all but the last

character of self[buffer_rep] set token_kind to the value of

WHICH_KIND (token_text) set self[token_ready] to false

Tokenizing Machine: Implementation Continued…

How do we “check if IS_COMPLETE_TOKEN_TEXT (self[buffer_rep], ch)”?

How do we determine“WHICH_KIND (token_text)”?

How do we do these things quickly?

Making Decisions Quickly

Keep track of the “state” of the buffer by adding one field to the representation: Text buffer_rep Boolean token_ready Integer buffer_state

Possible Buffer States

How many interestingly different buffer states do you think there may be?

Let’s start enumerating them…

Buffer States Continued… Initial state (empty buffer) How many states after inserting

the first character? ‘B’, ‘D’, ‘E’, ‘I’, ‘P’, ‘T’, ‘W’, ‘n’, ‘r’, ‘t’,

identifier (any other letter) white_space (‘ ’, ‘\n’, ‘\t’) comment (‘#’) error (any other character)

Buffer States Continued…

How many states after inserting the second character? “BE”, “DO”, “EL”, “EN”, “IF”, “IN”, “IS”,

“PR”, “TH”, “WH”, “ne”, “ra”, “tr”, identifier (any other id character)

white_space (‘ ’, ‘\n’, ‘\t’) comment (any other character but ‘\n’) error (any character that cannot start a

new “good” token)

A State Transition Diagram:Transitions Out of ‘empty’ Only

B

DE I

PT

W

n

r

t

identifierwhite_space

comment

error

empty

‘B’

‘D’

‘E’ ‘I’‘P’

‘T’

‘W’

‘n’

‘r’

‘t’

any otherletter

‘ ’,’\n’,’\t’‘#’

any othercharacter

Structure of Body of Insertcase_select (self[buffer_state]){ case empty: // case for buffer = empty_string case B: // case for buffer = “B” case D: // case for buffer = “D” case E: // case for buffer = “E” . . . case error: // case for buffer holding an error token}

A Simplified View

Buffer States EMPTY_BS ID_OR_KEYWORD_OR_CONDITION_BS WHITE_SPACE_BS COMMENT_BS ERROR_BS

The State Transition Diagram

EMPTY_BS

ID_OR_KEYWORD_OR_CONDITION_BS

COMMENT_BS

WHITE_SPACE_BS

ERROR_BS

‘ ’, ‘\n’, ‘\t’

‘#’

‘a’..’z’,‘A’..’Z’

any othercharacter

any characterexcept ‘\n’

‘ ’, ‘\n’, ‘\t’

‘a’..’z’,‘A’..’Z’,‘0’..’9’,

‘-’

any character except‘a’..’z’, ‘A’..’Z’, ‘ ’, ‘\n’, ‘\t’, ‘#’

Useful Private Functions Is_White_Space_Character (ch) Is_Digit_Character (ch) Is_Alphabetic_Character (ch) Is_Identifier_Character (ch) Can_Start_Token (ch) Id_Or_Keyword_Or_Condition (t) Buffer_Type (ch) Token_Kind (bs, str)

Recommended