Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Regular expressions as types:Bit-coded regular expression parsing

Fritz Henglein

Department of Computer ScienceUniversity of CopenhagenEmail: [email protected]

WG 2.8 Meeting, Marble Falls, 2011-03-07

Joint work with Lasse Nielsen, DIKU

Regular expression

Definition (Regular expression)

A regular expression (RE) over finite alphabet A is an expression ofthe form

E ,F ::= 0 | 1 | a | E |F | EF | E∗

where a ∈ A

Used in bioinformatics, compilers (lexical analysis, control flowanalysis), logic, natural language processing, program verification,protocol specification, query processing, security, XML accesspaths and document types, operating systems, scripting ofsearching, matching and substitution in texts or semi-structureddata (Perl) . . .

2

Language interpretation of regular expressions

Definition (Language interpretation)

The language interpretation of a regular expression E is the set ofstrings L[[E ]] defined by

L[[0]] = ∅L[[1]] = {ε}L[[a]] = {a}

L[[E |F ]] = L[[E ]] ∪ L[[F ]]L[[EF ]] = L[[E ]]� L[[F ]]L[[E∗]] =

⋃i≥0(L[[E ]])i

where S � T = {s t | s ∈ S ∧ t ∈ T}, E 0 = {ε},E i+1 = E E i .

3

Kleene’s Theorem

Theorem (Kleene 1956)

A language is regular if and only it is denoted by a regularexpression under its language interpretation.

4

What is regular expression “matching”?

Given regular expression and input string, return . . . what?

1 yes or no (membership testing)

2 zero or one substring matches for each regular subexpression(PCRE)

3 any finite number of substring matches for each regularsubexpression (regular expression types)

4 a parse tree

5

What is regular expression “matching”?

1 Membership testing = language interpretation.

2 PCRE: Only one match under a Kleene star (typically the last)

3 RET: Matches under two Kleene stars not grouped

4 Parsing: Each Kleene star yields a list of matches (thus parsetree).

Note:

Increasing structure: Lower level matching outputconstructible from higher level matching output, in particularfrom parsing.

Classical automata theory (e.g. minimal DFA construction)only sound for membership testing.

6

Practice

PCRE-style programming1:

Group matching: Does the RE match and where do (some of)its sub-REs match in the string?

Substitution: Replace matched substrings by specified otherstrings

Extensions: Backreferences, look-ahead, look-behind,...

Lazy vs. greedy matching, possessive quantifiers, atomicgrouping

Optimization

Observe: Language interpretation (yes/no) inappropriate, needmore refined interpretation

1in Perl and such7

Example

((ab)(c|d)|(abc))*.

Match against abdabc .For each parenthesized group a substring is returned.a

PCRE POSIX

$1 = abc abc$2 = ab ε$3 = c ε$4 = ε abc

aOr special null-value

8

Intermezzo: Optimization??Optimizing regular expressions = rewriting them to equivalentform that is more efficient for matching.2

Cox (2007)

Perl-compliant regular expressions (what you get in Perl,Python, Ruby, Java) use backtracking parsing.Does not handle “problematic” regular expressions: E ∗ whereE contains ε – may crash at run-time (stack overflow).

2Friedl, Mastering Regular Expressions, chapter 6: Crafting an efficientexpression9

Why discrepancy between theory and practice?

Theory is extensional: About regular languages.

Does this string match the regular expression? Yes or no?

Practice is intensional: About regular expressions asgrammars.

Does this string match the regular expression and if sohow—which parts of the string match which parts of the RE?

Ideally: Regular expression matching = parsing +“catamorphic” processing of syntax tree

Reality:

Naive backtracking matching, orfinite automaton + opportunistic instrumentation to get someparsing information (TCL (?), Laurikari 2000, Cox 2010).

10

Regular expression parsing

Regular expression parsing: Construct parse tree for givenstring.

Representation of parse tree: Regular expression as type

Example

Parse abdabc according to ((ab)(c|d)|(abc))*.

p1 = [inl ((a, b), inr d), inr (a, (b, c))]

p2 = [inl ((a, b), inr d), inl ((a, b), inl c)]

p1, p2 have type ((a× b)× (c + d) + a× (b × c)) list .

Compare with regular expression ((ab)(c|d)|(abc))* .

The elements of type E correspond to the syntax trees forstrings parsed according to regular expression E !

11

Type interpretation

Definition (Type interpretation)

The type interpretation T [[.]] compositionally maps a regularexpression E to the corresponding simple type:

T [[0]] = ∅ empty typeT [[1]] = {()} unit typeT [[a]] = {a} singleton type

T [[E | F ]] = T [[E ]] + T [[F ]] sum typeL[[E F ]] = T [[E ]]× T [[F ]] product typeT [[E ∗]] = {[v1, . . . , vn] | vi ∈ T [[E ]]} list type

12

Flattening

Definition

The flattening function flat(.) : Val(A)→ Seq(A) is defined asfollows:

flat(()) = ε flat(a) = aflat(inl v) = flat(v) flat(inr w) = flat(w)

flat((v ,w)) = flat(v) flat(w)

flat([v1, . . . , vn]) = flat(v1) . . . flat(vn)

Example

flat([inl ((a, b), inr d), inr (a, (b, c))]) = abdabc

flat([inl ((a, b), inr d), inl ((a, b), inl c)]) = abdabc

13

Regular expressions as types

Informally:

string s with syntax tree p according to regular expression E∼=

string flat(v) of value v element of simple type E

Theorem

L[[E ]] = {flat(v) | v ∈ T [[E ]]}

14

Membership testing versus parsing

Example

E = ((ab)(c|d)|(abc))* Ed = (ab(c|d))*

Ed is unambiguous: If v ,w ∈ T [[Ed ]] and flat(v) = flat(w)then v = w . (Each string in Ed has exactly one syntax tree.)

E is ambiguous. (Recall p1 and p2.)

E and Ed are equivalent: L[[E ]] = L[[Ed ]]

Ed “represents” the minimal deterministic finite automatonfor E .

Matching (membership testing): Easy—use Ed .

But: How to parse according to E using Ed?

15

Bit coding

General idea:

Have nondeterministic machine/algorithm M with no input,generating all elements of a set

Use sequence of choices as representation of output (moduloM)

For regular languages:

Record binary choices for expanding a regular expression Einto a particular string s.

The sequence of choices (as bits) to drive machine toparticular output s as the bit coding of s under E .

16

Bit coding: Example

Example

Recall syntax trees p1, p2 for abdabc underE = ((a× b)× (c + d) + a× (b × c))∗.

p1 = [inl ((a, b), inr d), inr (a, (b, c))]

p2 = [inl ((a, b), inr d), inl ((a, b), inl c)]

We can code them by storing only their inl , inr occurrences:

code(p1) = 011

code(p2) = 0100

17

Bit decoding

There is a linear-time polytypic function decode that canreconstitute the syntax trees.

Theorem

decodeE (codeE (v)) = v for all v ∈ T [[E ]].

Example

decodeE (011) = [inl ((a, b), inr d), inr (a, (b, c))]

decodeE (0100) = [inl ((a, b), inr d), inl ((a, b), inl c)]

18

Why bit coding?

Bit coding of string s under E

represents a syntax tree of s

takes at most as much space as |s| and often a lot less(depending on E )

can be combined with statistical compression for textcompression

19

Bit coded regular expression parsing

Problem:

Input: string s and regular expression E .Output: (some) parse tree p such that flat(p) = s.

Goal: Output bit coding codeE (p) instead.

Dual advantage:

Less space used for output.Output faster to compute.

How to do that? Mark the “turns” in Thompson NFA (theyyield the bit coding)

20

DFASIM algorithm: Outline

1 RE to NFA: Build Thompson-style NFA with suitable outputbits

2 NFA to DFA: Perform extended DFA construction (only forstates required by input string), with (multiple) bit sequenceannotations on edges

3 Traverse accepting path from right to left to construct bitcoding by concatenating bit sequences.

21

Thompson-style NFA generation with output bits

E NFA Extended NFA

00

1

0

1

10 0

a0 1

a0 1

a/

E F0 1

E2

F0 1

E2

F

E | F

0

2

1

4F

3E

5 0

2/1

1

/0

4F

3E

5/

/

E ∗0

3

1

2

E0

3

/1

1/0

2

E

/

22

Benchmark examples

1: \w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*

([,;]\s*\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)*

2: $?(\d{1,3},?(\d{3},?)*\d{3}(\.\d{0,2})?|\d{1,3}(\.\d{0,2})?|\.\d{1,2}?)

4: [A-Za-z0-9](([ \.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)

(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z][A-Za-z]+)

5: (\w|-)+@((\w|-)+\.)+(\w|-)+

6: [+-]?([0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)([eE][+-]?[0-9]+)?

7: ((\w|\d|\-|\.)+)@{1}(((\w|\d|\-){1,67})|((\w|\d|\-)+\.(\w|\d|\-){1,67}))

\.((([a-z]|[A-Z]|\d){2,4})(\.([a-z]|[A-Z]|\d){2})?)

8: (([A-Za-z0-9]+ +)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*

[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6}

9: (([a-zA-Z0-9 \-\.]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5}){1,25})+

([;.](([a-zA-Z0-9 \-\.]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5}){1,25})+)*

10: ((\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)\s*[,]{0,1}\s*)+

From Veanes, de Halleaux, Tillman (2010)

23

Benchmark experiments (without #3)

0

10

20

30

40

50

60

70

80

90

0 2000 4000 6000 8000 10000

tim

e

n

Example #1

FrCa (s)DFA (s)

Precompiled DFA (ms)DFASIM (ms)

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000

tim

e

n

Example #2

Backtracking (ms)FrCa (ms)DFA (ms)


0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 2000 4000 6000 8000 10000

tim

e

n

Example #4

FrCa (ms)DFA (s)


0

1000

2000

3000

4000

5000

6000

7000

8000

0 2000 4000 6000 8000 10000

tim

e

n

Example #5

FrCa (ms)DFA (ms)


0

20

40

60

80

100

120

140

160

180

0 2000 4000 6000 8000 10000

tim

e

n

Example #6

Backtracking (s)FrCa (ms)DFA (ms)


0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 2000 4000 6000 8000 10000

tim

e

n

Example #7

FrCa (ms)DFASIM (ms)

0

500

1000

1500

2000

2500

3000

3500

0 2000 4000 6000 8000 10000

tim

e

n

Example #8


0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 2000 4000 6000 8000 10000

tim

e

n

Example #9


0

2000

4000

6000

8000

10000

12000

0 2000 4000 6000 8000 10000

tim

e

n

Example #10


24

Regular expression algorithms compared

FrCa: Based on Frisch, Cardelli (2004), right-to-left firstphase, left-to-right second phase.

DFASIM: As above.

DFA: As DFASIM, but staged. Extended DFA for completeextended Thomson-NFA generated, before application toinput.

Precompiled DFA: As DFA, but extended DFA specialized (inC++) and compiled.

Backtracking: PCRE-style backtracking parser.

All algorithms:

generate bit codes;

coded in C++

25

Benchmark experiment #1

0

10

20

30

40

50

60

70

80

90

0 2000 4000 6000 8000 10000

tim

e

n

Example #1

FrCa (s)DFA (s)


26


0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000

tim

e

n

Example #2

Backtracking (ms)FrCa (ms)DFA (ms)


27


0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 2000 4000 6000 8000 10000

tim

e

n

Example #4

FrCa (ms)DFA (s)


28


0

1000

2000

3000

4000

5000

6000

7000

8000

0 2000 4000 6000 8000 10000

tim

e

n

Example #5

FrCa (ms)DFA (ms)


29


0

20

40

60

80

100

120

140

160

180

0 2000 4000 6000 8000 10000

tim

e

n

Example #6

Backtracking (s)FrCa (ms)DFA (ms)


30


0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 2000 4000 6000 8000 10000

tim

e

n

Example #7


31


0

500

1000

1500

2000

2500

3000

3500

0 2000 4000 6000 8000 10000

tim

e

n

Example #8


32


0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 2000 4000 6000 8000 10000

tim

e

n

Example #9


33

References

Henglein, Nielsen, “Regular Expression Containment:Coinductive Axiomatization and ComputationalInterpretation”, POPL 2011

Nielsen, Henglein, “Bit-coded Regular Expression Parsing”,LATA 2011

34

Related workFrisch, Cardelli (2004): Regular types corresponding to regularexpressions, linear-time parsing for REs;Hosoya et al. (2000-): Regular expression types, properextension of regular types (!), axiomatization of treecontainmentAanderaa (1965), Salomaa (1966), Krob (1990), Pratt(1990), Kozen (1994, 2008), Grabmeyer (2005), Rutten et al.(2008): RE axiomatizations (extensional)Rutten et al. (1998-): Coalgebraic approach to systems,including finite automata, extensionalBrandt/Henglein (1998): Coinduction rule and computationalinterpretation for recursive typesCameron (1988), Jansson, Jeuring (1999): Bit coding forCFGs and algebraic typesCox (2010): RE2 regular expression library, TCL RE library(appear to be state of the Perl/POSIX-style “regex” libraries)

35

Questions?

36

Future work

Construction of minimal extended NFAs: All

Regular expression parsing with projection (throwing subtreesaway)

Regular expression parsing with catamorphic postprocessing(substituting subtrees)

Regular expression library as practical alternatives to PCRE,RE2 and Tcl, etc., with improved expressiveness, semanticsand performance.

37

Documents

Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science