View
249
Download
0
Embed Size (px)
Citation preview
Lexical Analyzer The lexical analyzer translates the source
program into a stream of lexical tokens Source program:
stream of characters vary from language to language (ASCII or Unicode,
or …) Lexical token:
compiler internal data structure that represents the occurrence of a terminal symbol
vary from compiler to compiler
Example Recall the min-ML language in “code3”prog -> decsdecs -> dec; decs |dec -> val id = exp | val _ = printInt expexp -> id | num | exp + exp | true | false | if (exp) then exp else exp | (exp)
Example
val x = 3;val y = 4;val z = if (2) then (x) else y;val _ = printInt z;
VAL IDENT(x) ASSIGN INT(3) SEMICOLON
VAL IDENT(y) ASSIGN INT(4) SEMICOLON
VAL IDENT(z) ASSIGN IF LPAREN INT(2) RPAREN THEN LPAREN IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON
VAL UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON EOF
lexical analysis
Lexer Implementation Options:
Write a lexer by hand from scratch boring, error-prone, and too much work see dragon book sec3.4
Automatic lexer generator Quick and easy
Regular Expressions
How to specify a lexer? Develop another language Regular expressions
What’s a lexer-generator? Another compiler…
Basic Definitions
Alphabet: the char set (say ASCII or Unicode)
String: a finite sequence of char from alphabet
Language: a set of strings finite or infinite say the C language
Regular Expression (RE) Construction by induction
each c \in alphabet {a}
empty \eps {}
for M and N, then M|N (a|b) = {a, b}
for M and N, then MN (a|b)(c|d) = {ac, ad, bc, bd}
for M, then M* (Kleen closure) (a|b)* = {\eps, a, aa, b, ab, abb, baa, …}
Example
C’s indentifier: starts with a letter (“_” counts as a lett
er) followed by zero or more of letter or digit
(…) (…)
(_|a|b|…|z|A|B|…|Z) (…)
(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)
(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)*
It’s really error-prone and tedious…
Syntax Sugar More syntax sugar:
[a-z] == a|b|…|z e+ == one or more of e e? == zero or one of e “a*” == a* itself e{i, j} == more than i and less than j of e . == any char except \n
All these can be translated into core RE
Example Revisted C’s indentifier:
starts with a letter (“_” counts as a letter)
followed by zero or more of letter or digit(…) (…)
(_|a|b|…|z|A|B|…|Z) (…)
(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)
[_a-zA-Z][_a-zA-Z0-9]*
What about the key word “if”?
Ambiguous Rule
A single RE is not ambiguous But in a language, there may be many
REs? [_a-zA-Z][_a-zA-Z0-9]* “if”
So, for a string, which RE to match?
Ambiguous Rule Two conventions:
Longest match: The regular expression that matches the longest string takes precedence.
Rule Priority: The regular expressions identifying tokens are written down in sequence. If two regular expressions match the same (longest) string, the first regular expression in the sequence takes precedence.
Lexer Generator History Lexical analysis was once a
performance bottleneck certainly not true today!
As a result, early research investigated methods for efficient lexical analysis
While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use
History: A long-standing goal
In this early period, a considerable amount of study went into the goal of creating an automatic compiler generator (aka compiler-compiler)
declarative compiler
specification
compiler
History: Unix and C In the mid-1960’s at Bell Labs, Ritchie and others
were developing Unix A key part of this project was the development of C
and a compiler for it Johnson, in 1968, proposed the use of finite state
machines for lexical analysis and developed Lex [CACM 11(12), 1968]
read the accompanying paper on course page Lex realized a part of the compiler-compiler goal b
y automatically generating fast lexical analyzers
The Lex tool The original Lex generated lexers written in
C (C in C) Today every major language has its own lex
tool(s): sml-lex, ocamllex, JLex, C#lex, …
Our topic next: sml-lex concepts and techniques apply to other tools
SML-Lex Specification Lexical specification consists of 3
parts (yet another programming language):User Declarations (plain SML types, values, functions)
%%SML-LEX Definitions (RE abbreviations, special stuff)
%%Rules (association of REs with tokens) (each token will be represented in plain SML)
User Declarations
User Declarations: User can define various values that are ava
ilable to the action fragments. Two values must be defined in this section:
type lexresult type of the value returned by each rule action.
fun eof () called by lexer when end of input stream is reached.
(EOF)
SML-LEX Definitions
ML-LEX Definitions: User can define regular expression abbre
viations:
Define multiple lexers to work together. Each is given a unique name.
digits = [0-9] +;letter = [a-zA-Z];
%s lex1 lex2 lex3;
Rules Rules:
A rule consists of a pattern and an action: Pattern in a regular expression. Action is a fragment of ordinary SML code. Longest match & rule priority used for disambig
uation Rules may be prefixed with the list of lexers
that are allowed to use this rule.
<lexerList> regularExp => (action) ;
Rules Rule actions can use any value defined in the User
Declarations section, including type lexresult
type of value returned by each rule action val eof : unit -> lexresult
called by lexer when end of input stream reached special variables:
yytext: input substring matched by regular expression yypos: file position of the beginning of matched string continue (): doesn’t return token; recursively calls lexer
Example #1(* A language called Toy *)
prog -> word prog
->
word -> symbol
-> number
symbol -> [_a-zA-Z][_0-9a-zA-Z]*
number -> [0-9]+
Example #1(* Lexer Toy, see the accompany code for detail *)datatype token = Symbol of string * int | Number of string * intexception Endtype lexresult = unitfun eof () = raise Endfun output x = …;%%letter = [_a-zA-Z];digit = [0-9];ld = {letter}|{digit};symbol = {letter} {ld}*;number = {digit}+;%%<INITIAL>{symbol} =>(output (Symbol(yytext, yypos)));<INITIAL>{number} =>(output (Number(yytext, yypos)));
Example #2(* Expression Language
* C-style comment, i.e. /* … */
*)
prog -> stms
stms -> stm; stms
->
stm -> id = e
-> print e
e -> id
-> num
-> e bop e
-> (e)
bop -> + | - | * | /
Example #2(* All terminals *)
prog -> stms
stms -> stm; stms
->
stm -> id = e
-> print e
e -> id
-> num
-> e bop e
-> (e)
bop -> + | - | * | /
Example #2 in Lex(* Expression language, see the accompany code * for detail. * Part 1: user code *)datatype token = Id of string * int | Number of string * int | Print of string * int | Plus of string * int | … (* all other stuffs *)exception Endtype lexresult = unitfun eof () = raise Endfun output x = …;
Example #2 in Lex, cont’(* Expression language, see the accompany code * for detail. * Part 2: lex definition *)%%letter = [_a-zA-Z];digit = [0-9];ld = {letter}|{digit};sym = {letter} {ld}*;num = {digit}+;ws = [\ \t];nl = [\n];
Example #2 in Lex, cont’(* Expression language, see the accompany code * for detail. * Part 3: rules *)%%<INITIAL>{ws} =>(continue ()); <INITIAL>{nl} =>(continue ());<INITIAL>”+” =>(output (Plus (yytext, yypos)));<INITIAL>”-” =>(output (Minus (yytext, yypos)));<INITIAL>”*” =>(output (Times (yytext, yypos))); <INITIAL>”/” =>(output (Divide (yytext, yypos)));<INITIAL>”(” =>(output (Lparen (yytext, yypos)));<INITIAL>”)” =>(output (Rparen (yytext, yypos)));<INITIAL>”=” =>(output (Assign (yytext, yypos)));<INITIAL>”;” =>(output (Semi (yytext, yypos)));
Example #2 in Lex, cont’(* Expression language, see the accompany code * for detail. * Part 3: rules cont’ *)<INITIAL>”print”=>(output (Print(yytext, yypos)));<INITIAL>{sym} =>(output (Id (yytext, yypos)));<INITIAL>{num} =>(output (Number(yytext, yypos)));<INITIAL>”/*” => (YYBEGIN COMMENT; continue ());<COMMENT>”*/” => (YYBEGIN INITIAL; continue ());<COMMENT>{nl} => (continue ());<COMMENT>. => (continue ());<INITIAL>. => (error (…));
Lex Implementation Lex accepts regular expressions (alon
g with others) So SML-lex is a compiler from RE to a l
exer Internal:RE NFA DFA table-driven alog’
Finite-state Automata (FA)
Input String M {Yes, No}
M = (, S, q0, F, )
Input alphabet State
setInitial state
Final states
Transition function
DFA example
Which strings of as and bs are accepted?
Transition function: { (q0,a)q1, (q0,b)q0, (q1,a)q2, (q1,b)q1, (q2,a)q2, (q2,b)q2 }
1 20 a a
bb a,b
RE -> NFA:Thompson algorithm
Break RE down to atoms construct small NFAs directly for atoms inductively construct larger NFAs from s
mall NFAs Easy to implement
a small recursion algorithm
Example%%
letter = [_a-zA-Z];
digit = [0-9];
id = {letter} ({letter}|{digit})* ;
%%
<INITIAL>”if” => (IF (yytext, yypos));
<INITIAL>{id} => (Id (yytext, yypos));
(* Equivalent to:
* “if” | {id}
*)
NFA -> DFA:Subset construction algorithm(* subset construction: workList algorithm *)
q0 <- e-closure (n0)
Q <- {q0}
workList <- q0
while (workList != \phi)
remove q from workList
foreach (character c)
t <- -closure (move (q, c)) D[q, c] <- t
if (t\not\in Q)
add t to Q and workList
NFA -> DFA:-closure(* -closure: fixpoint algorithm *)(* Dragon Fig 3.33 gives a DFS-like algorithm.
* Here we give a recursive version. (Simpler)
*)
X <- \phi
fun eps (t) =
X <- X {t}∪ foreach (s \in one-eps(t))
if (s \not\in X)
then eps (s)
NFA -> DFA: -closure(* -closure: fixpoint algorithm *)(* dragon Fig 3.33 gives a DFS-like algorithm.
* Here we give a recursive version. (Simpler)
*)
fun e-closure (T) =
X <- T
foreach (t \in T)
X <- X eps(t)∪
NFA -> DFA: -closure(* -closure: fixpoint algorithm *)(* A BFS-like algorithm. *)X <- empty;fun e-closure (T) = Q <- T X <- T while (Q not empty) q <- deQueue (Q) foreach (s \in one-eps(q)) if (s \not\in X) enQueue (Q, s) X <- X s∪
Example<INITIAL>”if” => (IF (yytext, yypos));
<INITIAL>{id} => (Id (yytext, yypos));
1 i
5
0
2
8
3
f
6[_a-zA-Z]
7
[_a-zA-Z0-9]
4
Exampleq0 = {0, 1, 5} Q = {q0}
D[q0, “i”] = {2, 3, 6, 7, 8} Q q1∪D[q0, _] = {6, 7, 8} Q q2∪D[q1, “f”] = {4, 7, 8} Q q3∪
1 i
5
0
2
8
3
f
6[_a-zA-Z]
7
[_a-zA-Z0-9] q0
q1
q2
q3if
_
4
ExampleD[q1, _] = {7, 8} Q q4∪D[q2, _] = {7, 8} Q
D[q3, _] = {7, 8} Q
D[q4, _] = {7, 8} Q 1 i
5
0
2
8
3
f
6[_a-zA-Z]
7
[_a-zA-Z0-9]
q0
q1
q2
q3
i
f
_ q4
_
_
_
_
4
Exampleq0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8}
q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8}
1 i
5
0
2
8
3
f
6[_a-zA-Z]
7
[_a-zA-Z0-9]
q0
q1
q2
q3
“i”
“f”
letter-”i”
q4ld-”f”
ld
ld
ld
4
Exampleq0 = {0, 1, 5} q1 = {2, 3, 6, 7, 8}
q2 = {6, 7, 8} q3 = {4, 7, 8} q4 = {7, 8}
1 i
5
0
2
8
3
f
6[_a-zA-Z]
7
[_a-zA-Z0-9]
q0
q1
q2
q3
“i”
“f”
letter-”i”
q4ld-”f”
ld
ld
ld
4
Table-driven Algorithm Conceptually, an FA is a directed graph Pragmatically, many different strategies to
encode an FA: Matrix (adjacency matrix)
sml-lex Array of list (adjacency list) Hash table Jump table (switch statements)
flex Balance between time and space
Example
q0
q1
q2
q3
“i”
“f”
letter-”i”
q4ld-”f”
ld
ld
ld
state\char
“i” “f” letter-”i”-”f”
… other
q0 q1 q2 q2 … error
q1 q4 q3 q4 … error
q2 q4 q4 q4 … error
q3 q4 q4 q4 … error
q4 q4 q4 q4 … error
<INITIAL>”if” => (IF (yytext, yypos));<INITIAL>{id} => (Id (yytext, yypos));
state q0 q1 q2 q3 q4
action
Id Id IF Id
DFA Minimization:Hopcroft’s Algorithm
q0
q1
q2
q3
“i”
“f”
letter-”i”
q4ld-”f”
ld
ld
ld
state q0 q1 q2 q3 q4
action
Id Id IF Id
DFA Minimization:Hopcroft’s Algorithm
q0
q1
q2
q3
“i”
“f”
letter-”i”
q4ld-”f”
ld
ld
ld
state q0 q1 q2 q3 q4
action
Id Id IF Id
DFA Minimization:Hopcroft’s Algorithm
q0
q1
q2, q4
q3
“i”
“f”
letter-”i”
ld-”f” ld
ld
state q0 q1 q2, q4
q3
action Id Id IF
Summary A Lexer:
input: stream of characters output: stream of tokens
Writing lexers by hand is boring, so we use a lexer generator: ml-lex RE -> NFA -> DFA -> table-driven algo
Moral: don’t underestimate your theory classes! great application of cool theory developed in mat
hematics. we’ll see more cool apps as the course progress
es