23
Lexical Analysis, Regular Expressions & Finite State Machines

Lexical Analysis, Regular Expressions & Finite State Machines

Embed Size (px)

Citation preview

Page 1: Lexical Analysis, Regular Expressions & Finite State Machines

Lexical Analysis,Regular Expressions &Finite State Machines

Page 2: Lexical Analysis, Regular Expressions & Finite State Machines

Processing English

• Consider the following two sentences• Hi, I am 22 years old. I come from Alabama.• 22 come Alabama I, old from am. Hi years I.

• Are they both correct?• How do you know?

• Same words, numbers and punctuation

• What did you do first?1. Find words, numbers and punctuation2. Then, check order (grammar rules)

Page 3: Lexical Analysis, Regular Expressions & Finite State Machines

Finding Words and Numbers

• How did you find words, numbers and punctuation? • You have a definition of what each is, or looks like• For example, what is a number? a word?

• Although your are a bit more agile, the process was:1. Start with first character

2. If letter, assume word; if digit, assume number

3. Scan left to right 1 character at a time, until punctuation mark (space, comma, etc.)

4. Recognize word or number

5. If no more characters, done; otherwise return to 1

Page 4: Lexical Analysis, Regular Expressions & Finite State Machines

Processing CodeHow do you process the following?

What are the main parts in which to break the input? void quote() {

print( "To iterate is human, to recurse divine." + " - L. Peter Deutsch");

}

Schemes: childOf(X,Y) marriedTo(X,Y)Facts: marriedTo('Zed','Bea'). marriedTo('Jack','Jill'). childOf('Jill','Zed'). childOf('Sue','Jack').Rules: childOf(X,Y) :- childOf(X,Z), marriedTo(Y,Z). marriedTo(X,Y) :- marriedTo(Y,X).Queries: marriedTo('Bea','Zed')? childOf('Jill','Bea')?

def addABC(x):s = “ABC”return x + s

addABC(input(“String: ”))

Page 5: Lexical Analysis, Regular Expressions & Finite State Machines

Example

def addABC ( x ) :s = “ABC”return x + s

addABC ( input ( “String: ” ) )

Page 6: Lexical Analysis, Regular Expressions & Finite State Machines

What are the Parts?

• They are called TOKENS• Process similar to English processing• Lexical Analysis

• Input:A program in some language

• Output:A list of tokens

(type, value, location)

Page 7: Lexical Analysis, Regular Expressions & Finite State Machines

Example RevisitedSample Input: Sample Output:

def addABC(x): s = “ABC” return x + s

addABC(input(“String: ”))

(FUNDEF,”def”,1)(ID,”addABC”,1)(LEFT_PAREN,”(”,1)(ID,”x”,1)(RIGHT_PAREN,”)”,1)(COLON,”:”,1)(ID,”s”,2)(ASSIGN,”=”,2)(STRING,”’ABC’”,2)(FUNRET,”return”,3)(ID,”x”,3)(OPERATOR,”+”,3)(ID,”s”,3)(ID,”addABC”,4)(LEFT_PAREN,”(”,4)…

Page 8: Lexical Analysis, Regular Expressions & Finite State Machines

Program Compilation

Lexical Analysis is first step of process

ProgramCompiler

Code

LexicalAnalyzer

Program

Parser

TokensCode

Generator

Internal Data Code

KeywordsString literals

Variables…

Error messages

Syntax Analysis Or Interpreter(Executed directly)

Page 9: Lexical Analysis, Regular Expressions & Finite State Machines

Token Specification

• Regular Expressions• Pattern description for strings

• Concatenation: abc -> “abc” • Boolean OR: ab|ac -> “ab”, “ac” • Kleene closure: ab* -> “a”, “ab”, “abbb”, etc.• Optional: ab?c -> “ac”, “abc”• One or more: ab+ -> “ab”, “abbb”• Group using ()

• (a|b)c -> “ac”, “bc”• (a|b)*c -> “c”, “ac”, “bc”, “bac”, “abaaabbbabbaaaaac”, etc.

Page 10: Lexical Analysis, Regular Expressions & Finite State Machines

RegEx Extensions

• Exactly n: a3b+ -> “aaab”, “aaabb”, …• [A-Z] = A|B|…|Z• [ABC] = A|B|C• [~aA] = any character but “a” or “A”• \ = escape character (e.g., \* -> “*”)• Whitespace characters

• \s, \t, \n, \v

Page 11: Lexical Analysis, Regular Expressions & Finite State Machines

Token Recognition

• Finite State Machine• A DFSM is a 5-tuple (Σ,S,s0,δ,F)

• Σ: finite, non-empty set of symbols (input alphabet)

• S: finite, non-empty set of states• s0: member of S designated as start state

• δ: state-transition function δ: S x Σ -> S• F: subset of S (final states, may be empty)

Page 12: Lexical Analysis, Regular Expressions & Finite State Machines

FSM & RegEx

• abc

• a(b|c)

• ab*

• (a(b?c))+

a b c

Note the special double-circle designation of a final/accepting state.

a

a

ab

b

b

a

c

cc

Page 13: Lexical Analysis, Regular Expressions & Finite State Machines

Finite State Transducer

• Extended FSM:• Γ: finite, non-empty set of symbols (output

alphabet)• δ: state-transition function δ: S x Σ -> S x Γ

• FST consumes input symbols and emits output symbols• Lexical analyzer

• consume raw characters

• emit tokens

Page 14: Lexical Analysis, Regular Expressions & Finite State Machines

CS 236 Coolness Factor!

• Design our own language• Subset of Datalog (LP-like)

• Build an interpreter for our language• Lexical Analyzer (Project 1)• Parser (Project 2)• Interpreter (Projects 3 and 4)• Optimization (Project 5)

Page 15: Lexical Analysis, Regular Expressions & Finite State Machines

Designing a Language

• Define the tokens • Elements of the language, punctuation, etc.• For example, what are they in C++?

• Recognize the tokens (lexical analysis)• Define the grammar

• Forms of correct sentences• For example, what are they in C++?

• Recognize the grammar (parsing)• Interpret and execute the program• C++ is a bit too complicated for us…

Page 16: Lexical Analysis, Regular Expressions & Finite State Machines

Varied World Viewsfct personlist siblings(person x) {

return x’s siblings

}

fct int square(int x) {

return x * x

}

fct boolean succeeds(person x) {

if studies(x) return T else return F

}

fct boolean sibling(person x, person y) {

if y is x’s sibling return T else return F

}

fct boolean square(int x, int y) {

if y == x * x return T else return F

}

fct boolean succeeds(person x) {

if studies(x) return T else return F

}

Look up table or oracleNo concerns with efficiency

Page 17: Lexical Analysis, Regular Expressions & Finite State Machines

Logic Programming

• Assume: all functions are Boolean• Compute using facts and rules

• Facts are the known true values of the functions• Rules express relations among functions

• Example: studies(x), succeeds(x)• Facts: studies(Matt), studies(Jenny)• Rule: succeeds(x) :- studies(x)

• Closed-world Assumption

Page 18: Lexical Analysis, Regular Expressions & Finite State Machines

Logic Programming

• Computing is like issuing queries• First check if it can be answered with facts• Second check if rules can be applied

• Examples• studies(Alex)?

• NO (neither facts nor rules to establish it)

• studies(Matt)?• YES (there is fact about that)

• succeeds(Jenny)?• YES (no fact, but a rule that if Jenny studies then she succeeds and a fact that

Jenny studies)

Page 19: Lexical Analysis, Regular Expressions & Finite State Machines

Functions of Several Arguments

• Examples• loves(x,y), parent(x,y), inclass(x,y)• loves(x,y) :- married(x,y)

• Computing• parent(Christophe, Samuel)?

• Yes, if there is a fact that matches

• parent(Christophe, X)?• Yes, if there is a value of X that would cause it to match a fact – return value of X

• loves(X, Y)?• Yes, if there are values of X and Y that would make this true, either by matching a

fact or via rules (e.g., married(Christophe, Isabelle)) – return values of X and Y

Page 20: Lexical Analysis, Regular Expressions & Finite State Machines

When We Are Done

Sample Program: Sample Execution:

Schemes: snap(S,N,A,P) csg(C,S,G) cn(C,N) ncg(N,C,G)

Facts: snap('12345','C. Brown','12 Apple St.','555-1234'). snap('22222','P. Patty','56 Grape Blvd','555-9999'). snap('33333','Snoopy','12 Apple St.','555-1234'). csg('CS101','12345','A'). csg('CS101','22222','B'). csg('CS101','33333','C'). csg('EE200','12345','B+'). csg('EE200','22222','B').

Rules: cn(C,N) :- snap(S,N,A,P),csg(C,S,G). ncg(N,C,G) :- snap(S,N,A,P),csg(C,S,G).

Queries: cn('CS101',Name)? ncg('Snoopy',Course,Grade)?

cn('CS101',Name)? Yes(3) Name='C. Brown' Name='P. Patty' Name='Snoopy'

ncg('Snoopy',Course,Grade)? Yes(1) Course='CS101', Grade='C'

Demo…

Page 21: Lexical Analysis, Regular Expressions & Finite State Machines

Project 1: Lexical Analyzer

Sample Input: Sample Output:

Queries: IsInRoomAtDH('Snoopy',R,'M',H)#SchemesFactsRules.

(QUERIES,"Queries",1)(COLON,":",1)(ID,"IsInRoomAtDH",2)(LEFT_PAREN,"(",2)(STRING,"'Snoopy'",2)(COMMA,",",2)(ID,"R",2)(COMMA,",",2)(STRING,"'M'",2)(COMMA,",",2)(ID,"H",2)(RIGHT_PAREN,")",2)(COMMENT,"#SchemesFactsRules",3)(PERIOD,".",4)Total Tokens = 14

Define and find the tokens

Page 22: Lexical Analysis, Regular Expressions & Finite State Machines

Basic FST for Project 1

<character (except <cr> and <eof>)>

‘ ‘

:

string

:

white space

ident.

-

<space> | <tab> | <cr>

<space> | <tab> | <cr>

<letter>

<letter> | <digit>

<any other char>

eof

error

Special check forKeywords (Schemes,Facts, Rules, Queries)<eof>

or:-

orkeywd.

start

:-

error<cr> or <eof>

Page 23: Lexical Analysis, Regular Expressions & Finite State Machines

Implementing a FSTState in Variable

state = START;input = readChar();while (state != ACCEPT) { if (state == START) { if (input == QUOTE) { input = readChar();

state = STRING; } else if (input == ...) { ... other kinds of tokens ... } } else if (state == STRING) { if (input == QUOTE) { input = readChar();

state = ACCEPT; } else { input = readChar();

state = STRING; } }}

State in Position in Codeinput = readChar();// begin in START state

if (input == QUOTE) { input = readChar(); // now in STRING state

while (input != QUOTE) { input = readChar(); // stay in STRING state } input = readChar(); // now in ACCEPT state

} else if (input == ...) { ... other kinds of tokens ...}