Upload
others
View
52
Download
0
Embed Size (px)
Citation preview
Regular expressions as types:Bit-coded regular expression parsing
Fritz Henglein
Department of Computer ScienceUniversity of CopenhagenEmail: [email protected]
WG 2.8 Meeting, Marble Falls, 2011-03-07
Joint work with Lasse Nielsen, DIKU
Regular expression
Definition (Regular expression)
A regular expression (RE) over finite alphabet A is an expression ofthe form
E ,F ::= 0 | 1 | a | E |F | EF | E∗
where a ∈ A
Used in bioinformatics, compilers (lexical analysis, control flowanalysis), logic, natural language processing, program verification,protocol specification, query processing, security, XML accesspaths and document types, operating systems, scripting ofsearching, matching and substitution in texts or semi-structureddata (Perl) . . .
2
Language interpretation of regular expressions
Definition (Language interpretation)
The language interpretation of a regular expression E is the set ofstrings L[[E ]] defined by
L[[0]] = ∅L[[1]] = {ε}L[[a]] = {a}
L[[E |F ]] = L[[E ]] ∪ L[[F ]]L[[EF ]] = L[[E ]]� L[[F ]]L[[E∗]] =
⋃i≥0(L[[E ]])i
where S � T = {s t | s ∈ S ∧ t ∈ T}, E 0 = {ε},E i+1 = E E i .
3
Kleene’s Theorem
Theorem (Kleene 1956)
A language is regular if and only it is denoted by a regularexpression under its language interpretation.
4
What is regular expression “matching”?
Given regular expression and input string, return . . . what?
1 yes or no (membership testing)
2 zero or one substring matches for each regular subexpression(PCRE)
3 any finite number of substring matches for each regularsubexpression (regular expression types)
4 a parse tree
5
What is regular expression “matching”?
1 Membership testing = language interpretation.
2 PCRE: Only one match under a Kleene star (typically the last)
3 RET: Matches under two Kleene stars not grouped
4 Parsing: Each Kleene star yields a list of matches (thus parsetree).
Note:
Increasing structure: Lower level matching outputconstructible from higher level matching output, in particularfrom parsing.
Classical automata theory (e.g. minimal DFA construction)only sound for membership testing.
6
Practice
PCRE-style programming1:
Group matching: Does the RE match and where do (some of)its sub-REs match in the string?
Substitution: Replace matched substrings by specified otherstrings
Extensions: Backreferences, look-ahead, look-behind,...
Lazy vs. greedy matching, possessive quantifiers, atomicgrouping
Optimization
Observe: Language interpretation (yes/no) inappropriate, needmore refined interpretation
1in Perl and such7
Example
((ab)(c|d)|(abc))*.
Match against abdabc .For each parenthesized group a substring is returned.a
PCRE POSIX
$1 = abc abc$2 = ab ε$3 = c ε$4 = ε abc
aOr special null-value
8
Intermezzo: Optimization??Optimizing regular expressions = rewriting them to equivalentform that is more efficient for matching.2
Cox (2007)
Perl-compliant regular expressions (what you get in Perl,Python, Ruby, Java) use backtracking parsing.Does not handle “problematic” regular expressions: E ∗ whereE contains ε – may crash at run-time (stack overflow).
2Friedl, Mastering Regular Expressions, chapter 6: Crafting an efficientexpression9
Why discrepancy between theory and practice?
Theory is extensional: About regular languages.
Does this string match the regular expression? Yes or no?
Practice is intensional: About regular expressions asgrammars.
Does this string match the regular expression and if sohow—which parts of the string match which parts of the RE?
Ideally: Regular expression matching = parsing +“catamorphic” processing of syntax tree
Reality:
Naive backtracking matching, orfinite automaton + opportunistic instrumentation to get someparsing information (TCL (?), Laurikari 2000, Cox 2010).
10
Regular expression parsing
Regular expression parsing: Construct parse tree for givenstring.
Representation of parse tree: Regular expression as type
Example
Parse abdabc according to ((ab)(c|d)|(abc))*.
p1 = [inl ((a, b), inr d), inr (a, (b, c))]
p2 = [inl ((a, b), inr d), inl ((a, b), inl c)]
p1, p2 have type ((a× b)× (c + d) + a× (b × c)) list .
Compare with regular expression ((ab)(c|d)|(abc))* .
The elements of type E correspond to the syntax trees forstrings parsed according to regular expression E !
11
Type interpretation
Definition (Type interpretation)
The type interpretation T [[.]] compositionally maps a regularexpression E to the corresponding simple type:
T [[0]] = ∅ empty typeT [[1]] = {()} unit typeT [[a]] = {a} singleton type
T [[E | F ]] = T [[E ]] + T [[F ]] sum typeL[[E F ]] = T [[E ]]× T [[F ]] product typeT [[E ∗]] = {[v1, . . . , vn] | vi ∈ T [[E ]]} list type
12
Flattening
Definition
The flattening function flat(.) : Val(A)→ Seq(A) is defined asfollows:
flat(()) = ε flat(a) = aflat(inl v) = flat(v) flat(inr w) = flat(w)
flat((v ,w)) = flat(v) flat(w)
flat([v1, . . . , vn]) = flat(v1) . . . flat(vn)
Example
flat([inl ((a, b), inr d), inr (a, (b, c))]) = abdabc
flat([inl ((a, b), inr d), inl ((a, b), inl c)]) = abdabc
13
Regular expressions as types
Informally:
string s with syntax tree p according to regular expression E∼=
string flat(v) of value v element of simple type E
Theorem
L[[E ]] = {flat(v) | v ∈ T [[E ]]}
14
Membership testing versus parsing
Example
E = ((ab)(c|d)|(abc))* Ed = (ab(c|d))*
Ed is unambiguous: If v ,w ∈ T [[Ed ]] and flat(v) = flat(w)then v = w . (Each string in Ed has exactly one syntax tree.)
E is ambiguous. (Recall p1 and p2.)
E and Ed are equivalent: L[[E ]] = L[[Ed ]]
Ed “represents” the minimal deterministic finite automatonfor E .
Matching (membership testing): Easy—use Ed .
But: How to parse according to E using Ed?
15
Bit coding
General idea:
Have nondeterministic machine/algorithm M with no input,generating all elements of a set
Use sequence of choices as representation of output (moduloM)
For regular languages:
Record binary choices for expanding a regular expression Einto a particular string s.
The sequence of choices (as bits) to drive machine toparticular output s as the bit coding of s under E .
16
Bit coding: Example
Example
Recall syntax trees p1, p2 for abdabc underE = ((a× b)× (c + d) + a× (b × c))∗.
p1 = [inl ((a, b), inr d), inr (a, (b, c))]
p2 = [inl ((a, b), inr d), inl ((a, b), inl c)]
We can code them by storing only their inl , inr occurrences:
code(p1) = 011
code(p2) = 0100
17
Bit decoding
There is a linear-time polytypic function decode that canreconstitute the syntax trees.
Theorem
decodeE (codeE (v)) = v for all v ∈ T [[E ]].
Example
decodeE (011) = [inl ((a, b), inr d), inr (a, (b, c))]
decodeE (0100) = [inl ((a, b), inr d), inl ((a, b), inl c)]
18
Why bit coding?
Bit coding of string s under E
represents a syntax tree of s
takes at most as much space as |s| and often a lot less(depending on E )
can be combined with statistical compression for textcompression
19
Bit coded regular expression parsing
Problem:
Input: string s and regular expression E .Output: (some) parse tree p such that flat(p) = s.
Goal: Output bit coding codeE (p) instead.
Dual advantage:
Less space used for output.Output faster to compute.
How to do that? Mark the “turns” in Thompson NFA (theyyield the bit coding)
20
DFASIM algorithm: Outline
1 RE to NFA: Build Thompson-style NFA with suitable outputbits
2 NFA to DFA: Perform extended DFA construction (only forstates required by input string), with (multiple) bit sequenceannotations on edges
3 Traverse accepting path from right to left to construct bitcoding by concatenating bit sequences.
21
Thompson-style NFA generation with output bits
E NFA Extended NFA
00
1
0
1
10 0
a0 1
a0 1
a/
E F0 1
E2
F0 1
E2
F
E | F
0
2
1
4F
3E
5 0
2/1
1
/0
4F
3E
5/
/
E ∗0
3
1
2
E0
3
/1
1/0
2
E
/
22
Benchmark examples
1: \w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
([,;]\s*\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)*
2: $?(\d{1,3},?(\d{3},?)*\d{3}(\.\d{0,2})?|\d{1,3}(\.\d{0,2})?|\.\d{1,2}?)
4: [A-Za-z0-9](([ \.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)
(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z][A-Za-z]+)
5: (\w|-)+@((\w|-)+\.)+(\w|-)+
6: [+-]?([0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)([eE][+-]?[0-9]+)?
7: ((\w|\d|\-|\.)+)@{1}(((\w|\d|\-){1,67})|((\w|\d|\-)+\.(\w|\d|\-){1,67}))
\.((([a-z]|[A-Z]|\d){2,4})(\.([a-z]|[A-Z]|\d){2})?)
8: (([A-Za-z0-9]+ +)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*
[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6}
9: (([a-zA-Z0-9 \-\.]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5}){1,25})+
([;.](([a-zA-Z0-9 \-\.]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5}){1,25})+)*
10: ((\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)\s*[,]{0,1}\s*)+
From Veanes, de Halleaux, Tillman (2010)
23
Benchmark experiments (without #3)
0
10
20
30
40
50
60
70
80
90
0 2000 4000 6000 8000 10000
tim
e
n
Example #1
FrCa (s)DFA (s)
Precompiled DFA (ms)DFASIM (ms)
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000
tim
e
n
Example #2
Backtracking (ms)FrCa (ms)DFA (ms)
Precompiled DFA (ms)DFASIM (ms)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 2000 4000 6000 8000 10000
tim
e
n
Example #4
FrCa (ms)DFA (s)
Precompiled DFA (ms)DFASIM (ms)
0
1000
2000
3000
4000
5000
6000
7000
8000
0 2000 4000 6000 8000 10000
tim
e
n
Example #5
FrCa (ms)DFA (ms)
Precompiled DFA (ms)DFASIM (ms)
0
20
40
60
80
100
120
140
160
180
0 2000 4000 6000 8000 10000
tim
e
n
Example #6
Backtracking (s)FrCa (ms)DFA (ms)
Precompiled DFA (ms)DFASIM (ms)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 2000 4000 6000 8000 10000
tim
e
n
Example #7
FrCa (ms)DFASIM (ms)
0
500
1000
1500
2000
2500
3000
3500
0 2000 4000 6000 8000 10000
tim
e
n
Example #8
FrCa (ms)DFASIM (ms)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 2000 4000 6000 8000 10000
tim
e
n
Example #9
FrCa (ms)DFASIM (ms)
0
2000
4000
6000
8000
10000
12000
0 2000 4000 6000 8000 10000
tim
e
n
Example #10
FrCa (ms)DFASIM (ms)
24
Regular expression algorithms compared
FrCa: Based on Frisch, Cardelli (2004), right-to-left firstphase, left-to-right second phase.
DFASIM: As above.
DFA: As DFASIM, but staged. Extended DFA for completeextended Thomson-NFA generated, before application toinput.
Precompiled DFA: As DFA, but extended DFA specialized (inC++) and compiled.
Backtracking: PCRE-style backtracking parser.
All algorithms:
generate bit codes;
coded in C++
25
Benchmark experiment #1
0
10
20
30
40
50
60
70
80
90
0 2000 4000 6000 8000 10000
tim
e
n
Example #1
FrCa (s)DFA (s)
Precompiled DFA (ms)DFASIM (ms)
26
Benchmark experiment #2
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000
tim
e
n
Example #2
Backtracking (ms)FrCa (ms)DFA (ms)
Precompiled DFA (ms)DFASIM (ms)
27
Benchmark experiment #4
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 2000 4000 6000 8000 10000
tim
e
n
Example #4
FrCa (ms)DFA (s)
Precompiled DFA (ms)DFASIM (ms)
28
Benchmark experiment #5
0
1000
2000
3000
4000
5000
6000
7000
8000
0 2000 4000 6000 8000 10000
tim
e
n
Example #5
FrCa (ms)DFA (ms)
Precompiled DFA (ms)DFASIM (ms)
29
Benchmark experiment #6
0
20
40
60
80
100
120
140
160
180
0 2000 4000 6000 8000 10000
tim
e
n
Example #6
Backtracking (s)FrCa (ms)DFA (ms)
Precompiled DFA (ms)DFASIM (ms)
30
Benchmark experiment #7
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 2000 4000 6000 8000 10000
tim
e
n
Example #7
FrCa (ms)DFASIM (ms)
31
Benchmark experiment #8
0
500
1000
1500
2000
2500
3000
3500
0 2000 4000 6000 8000 10000
tim
e
n
Example #8
FrCa (ms)DFASIM (ms)
32
Benchmark experiment #9
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 2000 4000 6000 8000 10000
tim
e
n
Example #9
FrCa (ms)DFASIM (ms)
33
References
Henglein, Nielsen, “Regular Expression Containment:Coinductive Axiomatization and ComputationalInterpretation”, POPL 2011
Nielsen, Henglein, “Bit-coded Regular Expression Parsing”,LATA 2011
34
Related workFrisch, Cardelli (2004): Regular types corresponding to regularexpressions, linear-time parsing for REs;Hosoya et al. (2000-): Regular expression types, properextension of regular types (!), axiomatization of treecontainmentAanderaa (1965), Salomaa (1966), Krob (1990), Pratt(1990), Kozen (1994, 2008), Grabmeyer (2005), Rutten et al.(2008): RE axiomatizations (extensional)Rutten et al. (1998-): Coalgebraic approach to systems,including finite automata, extensionalBrandt/Henglein (1998): Coinduction rule and computationalinterpretation for recursive typesCameron (1988), Jansson, Jeuring (1999): Bit coding forCFGs and algebraic typesCox (2010): RE2 regular expression library, TCL RE library(appear to be state of the Perl/POSIX-style “regex” libraries)
35
Questions?
36
Future work
Construction of minimal extended NFAs: All
Regular expression parsing with projection (throwing subtreesaway)
Regular expression parsing with catamorphic postprocessing(substituting subtrees)
Regular expression library as practical alternatives to PCRE,RE2 and Tcl, etc., with improved expressiveness, semanticsand performance.
37