47
[ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand [ [email protected] ] IT University of Copenhagen Jakob G. Thomsen [ [email protected] ] Aarhus University Num = 0 | [1-9][0-9]* Email = [a-z]+ "@" [a-z]+ ("." [a-z]+ )*

[ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

Embed Size (px)

Citation preview

Page 1: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 1 ] May 11, 2010C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark

Pattern Matching on Stringsusing Regular Expressions

Claus Brabrand[ [email protected] ]

IT University of Copenhagen

Jakob G. Thomsen[ [email protected] ]

Aarhus University

Num = 0 | [1-9][0-9]* Email = [a-z]+ "@" [a-z]+ ("." [a-z]+ )*

Page 2: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 3 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 3: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 4 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Introduction & Motivation

Pattern matching an indispensable problemMany applications need to "parse" dynamic input

1) URLs:

2) Log Files:

3) DBLP:

http://first.dk/index.php?id=141&view=details

13/02/2010 66.249.65.107 get /support.html20/02/2010 42.116.32.64 post /search.html

<article> <title>Three Models for the...</title> <author>Noam Chomsky</author> <year>1956</year></article>

protocol host path query-string

(list of key-value pairs)

Page 4: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 5 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 5: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 6 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Language classes (+formalisms):

Type-3 regular expressions "enough" for:URLs, log files, DBLP, ...

"Trade" (excess) expressivity for:declarativity, simplicity, and static safety !

The Chomsky Hierarchy (1956)

Page 6: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 7 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type-0: java.net.URL

Turing-Complete programming (e.g., Java)[ "unrestricted grammars" (e.g., rewriting systems) ]

Cyclomatic complexity (of official "java.net.URL"):

88 bug reports on Sun's Bug Repository !Bug reports span more than a decade !

Page 7: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 8 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type-1: Context-Sensitivity

Not widely used (or studied?) formalism

Presumeably because:Restricts expressivity w/o offering extra safety?

- ? -

Page 8: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 9 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type-2: Context-Free Grammars

Conceptually harder than regexpsEssentially (Type-3) Regular Expressions + recursion

The ultimate end-all scientific argument:We d:

regexps 12 times more popular !

(conjecture!)

Page 9: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 10 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type-?: Regexp Capture Groups

Capturing groups (Perl, PHP, Java regex, ...):Syntax: (i.e., in parentheses)

Back-references:Syntax: (i.e., "index of" capturing group)

Beyond regularity !: is non-regular

In fact, not even context-free !!!: is non-context-free

(R)

\7

(a*)b\1

(.*).\1

{ an b an | n 0 }

{ | , * }

Page 10: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 11 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type-?: Regexp Capture Groups

Interpretation with back-tracking:NP-complete (exponential worst-case): :-(

regexp " a?nan " vs. string " an "

1 minute0.02 msecs

3.000.000:1 on strings of length 29 !!!

Page 11: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 12 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type-3: Regular Expressions

Closure properties:Union

Concatenation

Iteration

Restriction

Intersection

Complement

...

Decidability properties:...

...

Containment: L(R) L(R')

Ambiguity

...

...

Declarative ! Safe ! Simple !

Page 12: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 13 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 13: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 14 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Regular Expressions

Syntax:

Semantics:

where:

L1 L2 is concatenation (i.e., { 1 2 | 1L1, 2L2 })

L* = i0 Li where L0 = { } and Li = L Li-1

Page 14: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 15 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Common Extensions (sugar)

Any character (aka, dot):"." as c1|c2|...|cn, ci

Character ranges:"[a-z]" as a|b|...|z

One-or-more regexps:"R+" as RR*

Optional regexp:"R?" as |R

Various repetitions; e.g.:"R{2,3}" as RRR?

Page 15: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 16 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 16: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 17 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Recording

Syntax:"x " is a recording identifier

(it "remembers" the substring it matches)

Semantics:

Example (simplified emails):

Matching against string:

yields:

[a-z]+ "@" [a-z]+ ("." [a-z]+)*

"[email protected]"

user = "obama" domain = "whitehouse.gov"&

NB: cannot use DFAs / NFAs !- only recognition (yes / no)- not how (i.e., "the structure")

<user = > <domain = >

Page 17: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 18 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Recording (structured)

Another example (with nested recordings):

Matching against string: yields:

<date = <day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} >>

"26/06/1992"

date.day = 26

date.month = 06

date.year = 1992

date = 26/06/1992

Page 18: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 19 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Recording (structured, lists)

Yet another example (yielding lists):

Matching against string:

yields a list structure:

<name = [a-z]+ > " & " <name = [a-z]+ >

"obama & bush"

name = [obama,bush]

( <name = [a-z]+ > "\n" )*

<name = [a-z]+ > (" & " <name = [a-z]+ > )*

Page 19: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 20 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 20: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 21 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Abstract Syntax Trees (ASTs)

Page 21: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 22 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Ambiguity

Definition:R ambiguous iff

T,T'ASTR: T T' ||T|| = ||T'||

where ||||: AST * (the flattening) is:

T

R

T'

R'

=

Page 22: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 23 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Characterization of Ambiguity

Theorem:R unambiguous iff

NB: sound & complete !

R* = | RR*

Page 23: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 24 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Examples

Ambiguous:

a|aL(a) L(a) = { a } Ø

a*a*L(a*) L(a*) = { an } Ø

Unambiguous:

a|aaL(a) L(aa) = Ø

a*ba*L(a*) L(ba*) = Ø

Page 24: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 25 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Ambiguity Examples

a?b+|(ab)*

(a|ab)(ba|a)

(aa|aaa)*

*** ambiguous concatenation: (a|ab) <--> (ba|a) shortest ambiguous string: "aba"

*** ambiguous choice: a?b+ <-|-> (ab)* shortest ambiguous string: "ab"

*** ambiguous star: (aa|aaa)* shortest ambiguous string: "aaaaa"

Page 25: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 27 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 26: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 28 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Disambiguation

1) Manual rewriting:Always possible :-)

Tedious :-(

Error-prone :-(

Not structure-preserving :-(

3) Disambiguators:From characterization:

concat: 'L', 'R'

choice: '|L', '|R'

star: '*L', '*R'

(partial-order on ASTs)

2) Restriction:R1 - R2

And then encode...:

RC as: * - RR1 & R2 as: (R1

C|R2C)C

4) Default disamb:concat, choice, and star are all left-biassed (by default) !

(Our tool does this)

Page 27: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 30 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 28: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 31 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type Inference

Type Inference: R : (L,S)

Page 29: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 32 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Examples (Type Inference)

Regexp:

Usage:

Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"

class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... }}

String s = "obama (48)";

Person p = Person.match(s);print(p.name + " is " + p.age + "y old");

compile(our tool)

Page 30: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 33 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Examples (Type Inference)

Usage:

People = ( $Person "\n" )*

class People { // auto-generated String[] name; int[] age; static Person match(String s) { ... } public String toString() { ... }}

compile(our tool)

String s = "obama (48) \n bush (63) \n ";

People p = People.match(s);println("Second name is " + p[1].name);

Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"

Page 31: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 34 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Examples (Type Inference)

Usage:

People = ( <person = $Person > "\n" )* ;

class People { // auto-generated Person[] person; class Person { // nested class String name; int age; }... }

compile(our tool)

String s = "obama (48) \n bush (63) \n ";

People people = People.match(s);for (p : people.person) println(p.name);

Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"

Page 32: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 35 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 33: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 36 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

URLs

URLs:

Regexp:

Query string further structured (list of key-value pairs):

"http://www.google.com/search?q=record&hl=en"protocol host path query-string

(list of key-value pairs)

Host = <host = [a-z]+ ("." [a-z]+ )* > ;Path = <path = [a-z/.]* > ;Query = <query = [a-z&=]* > ;URL = "http://" $Host "/" $Path "?" $Query ;

KeyVal = <key = [a-z]* > "=" <val = [a-z]* > ;Query = $KeyVal ("&" $KeyVal)* ;

(list of key-value pairs)

Page 34: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 37 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

URLs (Usage Example)

Regexp:

Usage (example):

Host = <host = [a-z]+ ("." [a-z]+ )* > ;Path = <path = [a-z/.]* > ;KeyVal = <key = [a-z]* > "=" <val = [a-z]* > ;Query = $KeyVal ("&" $KeyVal)* ;URL = "http://" $Host "/" $Path "?" $Query ;

String s = "http://www.google.com/search?q=record";URL url = URL.match(s);print("Host is: " + url.host);if (url.key.length>0) print("1st key: " + url.key[0]);for (String val : url.val) println("value = " + val);

Page 35: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 38 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Log Files

13/02/2010 66.249.65.107 /support.html20/02/2010 42.116.32.64 /search.html...

Date = <date = <day = $Day > "/" <month = $Month > "/" <year = [0-9]{4} > > ;IP = <ip = [0-9]{1,3} ("." [0-9]{1,3} ){3} > ;Entry = <entry = $Date " " $IP " " $Path "\n" > ;Log = $Entry * ;

Log log = Log.match(log_file);for (Entry e : log.entry) if (e.date.month == 02 && e.date.day == 29) print("Access on LEAP YEAR from IP# " + e.ip);

Format

Regexp

Usage

Page 36: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 39 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Log Files (cont'd, ambiguity)

Assume we forgot "/" (between day & month):

Ambiguity:

i.e. "1/01" (January 1) vs. "10/1" (January 10) :-)

*** ambiguous concatenation: <day> <--> <month> shortest ambiguous string: "101"

Day = 0?[1-9] | [1-2][0-9] | 30 | 31 ;Month = 0?[1-9] | 10 | 11 | 12 ;

Date = <date = <day = $Day > // no slash ! <month = $Month > "/" <year = [0-9]{4} > > ;

Regexp

Error

Page 37: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 40 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

DBLP (Format)

DBLP (XML) Format:<article> <author>Noam Chomsky</author> <title>Three Models for the Description of Language</title> <year>1956</year> <journal>IRE Transactions on Information Theory</journal></article><article> <author>Claus Brabrand</author> <author>Jakob G Thomsen</author> <title>Typed and Unambiguous Pattern Matching on Strings using Regular Expressions</title> <year>2010</year> <note>Submitted</note></article>...

Page 38: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 41 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

DBLP (Regexp)

DBLP Regexp:

Ambiguity !:

EITHER 2 publications (.* = "") OR 1 publication (.* = gray part) !!!

Author = "<author>" <author = [a-z]* > "</author>" ;Title = "<title>" <title = [a-z]* > "</title>" ;Article = "<article>" $Author* $Title .* "</article>" ;DBLP = <pub = $Article > * ;

*** ambiguous star: <pub>* shortest ambiguous string: "<article><title></title></article> <article><title></title></article>"

Page 39: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 42 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

DBLP (Disambiguated)

DBLP Regexp:

Disambiguated (using "(R1-R2)"):

Unambiguous! :-)

Article = "<article>" $Author* $Title (.* - (.* "</article>" .*)) "</article>" ;

Author = "<author>" <author = [a-z]* > "</author>" ;Title = "<title>" <title = [a-z]* > "</title>" ;Article = "<article>" $Author* $Title .* "</article>" ;DBLP = <pub = $Article > * ;

Page 40: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 43 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

DBLP (Usage Example)

DBLP Regexp:

Usage (example):DBLP dblp = DBLP.match(readXMLfile("DBLP.xml"));for (Article a: dblp.article) print("Title: " + a.title);

Author = "<author>" <author = [a-z]* > "</author>" ;Title = "<title>" <title = [a-z]* > "</title>" ;Article = "<article>" $Author* $Title .* "</article>" ;DBLP = <article = $Article > * ;

Page 41: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 44 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Outline

Pattern Matching (intro & motiv):The Chomsky Hierarchy (1956)

Regular Expressions:The Recording Construction

Ambiguity:Disambiguation

Type Inference

Usage and Examples

Evaluation and Conclusion

Page 42: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 45 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Evaluation

Evaluation summary:

Also, (Type-3) regexps expressive "enough"for: URLs, Log files, DBLP, ...

[ MatMult ][ NP-Complete ][ Frisch&Cardelli'04 ]

Page 43: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 46 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Type-3 vs. Type-0 (URLs)

Regexps vs. Java:

Regexps are 8 times more concise !

Page 44: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 47 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

java.util.regex vs. Our approach

Efficiency(on DBLP):

java.util.regex:Exponential O(2||) 2,500 chars in 2 mins !

In contrast; ours:Linear (on DBLP) 1,200,000 chars in 6 secs !

2 mins

10 msecs

Page 45: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 48 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Related Work

Recording (with lists in general):"x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP

Ambiguity:[Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed)

Disambiguation:[Vansummeren'06] but with global, not local disambiguation

Type inference:Exact type inference in XDuce & CDuce(soundness+completeness proof in [Vansummeren'06])but not for stand-alone and non-intrusive usage (Java)

Page 46: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 49 ]C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010

Conclusion

For string pattern matching, it is possible to:

In conclusion:

i.e., ambiguity checking and type inference !+ stand-alone & non-intrusive language integration (Java) !

We conclude that if regular expressions are sufficiently expressive, they provide a simple, declarative, and safe means for pattern matching on strings, capable of extracting highly structural information in a statically type-safe and unambiguous manner.

"trade (excess) expressivity for safety+simplicity"

Page 47: [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand

[ 50 ] May 11, 2010C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark

</Talk>

Questions ? Complaints ?

[ http://www.cs.au.dk/~gedefar/reg-exp-rec/ ]