21
Regexes SoftFluent day 03/10/2013 Pablo Fernandez Duran

Regexes in .NET

Embed Size (px)

Citation preview

RegexesSoftFluent day

03/10/2013

Pablo Fernandez Duran

Reg-what?

• Regular expressions

• Describing a search pattern

• Find and replace operations

• 1950

• Regular language, formal language …

• Different flavors -> PCRE (Perl Compatible Regular Expressions)

• Now… not so regular

regex

regexp

reg-exp

regexps

reg-exps

regexes

regexen

^reg-?ex(?(?<=-ex)p|p?)(?(?<=x)e[sn]|s)?$

var re = new RegExp(/.*/); // js

var re = new Regex(".*"); // .NET

• We have a problem.

• Let’s use regexes !

• Now we have two problems.

What about you ?

• Can you read regexes ?

^[0-9]\w*$

• Can you really read regexes ?

^[^)(]*\((?>[^()]+|\((?<p>)|\)(?<-p>))*(?(p)(?!))\)[^)(]*$

Language overview• Character classes

• Character group [abc]

• Negation [^a1]

• Range [C-F] or [2-6A-D]

• Differences [A-Z-[B]]

• Anchors

. (wildcard) \w (writable) \d (decimals) \s (spacing)

\W (not \w) \D (not \d) \S (not \s)

^ (beginning of string or line) $ (end of string or line) \b (word boundary)

\B (not \b)

Language overview

• Quantifiers

• Range : {n,m} , {n,}

• Zero or more : * (can be written {0,})

• One or more : + (can be written {1,})

• Zero or one : ? (can be written {0,1})

• Greedy vs Lazy

• Greedy : the longest match (by default)

• Lazy : the shortest match

• *? , +? , ?? , {n,m}?

Language overview

• Grouping constructs

• Capturing group : (subexpression)

• Named group : (?<group_name>subexpression)

• Non capturing group : (?:subexpression)

• Balancing groups : (?<name1-name2>subexpression)

• Look around assertions (zero length)

• Positive look ahead : (?=subexpression)

• Negative look ahead : (?!subexpression)

• Positive look behind : (?<=subexpression)

• Negative look behind : (?<!subexpression)

Language overview

• Backreference constructs

• \groupnumber or \k<groupname>

• Alternation constructs

• (expression1|..|expressionn)

• (?(expression)yes|no)

• (?(referenced group)yes|no)

Format/Comment your code

As you do it when you write code…

public static void C(string an, string pn, string n, string nn) { RegexCompilationInfo[] re ={ new RegexCompilationInfo(pn, RegexOptions.Compiled, n, nn, true) };System.Reflection.AssemblyName asn = new System.Reflection.AssemblyName(); asn.Name = an;Regex.CompileToAssembly(re, asn); }

Regexes can have inline comments:

(#comment)

And can be written in multiple lines (don’t forget the IgnorePatternWhitespace option ):

Before:

^[^()]*((?<g>\()[^()]*)*((?<-g>\))[^()]*)*[^()]*(?(g)(?!))$

After:

^ #start

[^()]* #everything but ()

(

(?<g>\() #opening group (

[^()]* #everything but ()

)*

(

(?<-g>\)) #closing group )

[^()]* #everything but ()

)*

[^()]* #everything but ()

(?(g) #if opening group remaining

(?!)) #then make match fail

$ #end

In .NET / C#

• A class to know : System.Text.RegularExpressions.Regex

• Represents the Regex engine

• A pattern is tightly coupled to the regex engine

• All regular expressions must be compiled (sooner or later)

• Initialization can be an expensive process

Regex options

• None

• IgnoreCase

• Multiline

• Singleline

• ExplicitCapture

• Compiled

• IgnorePatternWhitespace

• RightToLeft

• ECMAScript

• CultureInvariant

http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx

Instance or Static method calls ?

• Both provide the same matching/replacing methods

• Static method calls use caching (15 by default)

• Manage the cache size using Regex.CacheSize

• Only static calls use caching (since .NET 2.0)

Instance or Static method calls ?

• new Regex(pattern).IsMatch(email)

Vs

• Regex.IsMatch(email, pattern)

Data from:

http://blogs.msdn.com/b/bclteam/archive/2010/06/25/optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regex-objects.aspx

Interpreted or compiled

• Interpreted:

• opcodes created on initialization (static or instance).

• opcodes converted to MSIL and executed by the JIT when the method is called.

• Startup time reduced but slower execution time

• Compiled (RegexOptions.Compiled):

• regex converted to MSIL code.

• MSIL code executed by the JIT when the method is called.

• Execution time reduced but slower startup time.

• Compiled on design time:

• Regex.CompileToAssembly

• The regex is fixed and used only in instance calls.

• Startup and execution time reduced at run-time but must be done design time.

Interpreted or compiled

Data from:

http://blogs.msdn.com/b/bclteam/archive/2010/06/25/optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regex-objects.aspx

Tools

• Regex Design

• Expresso

• The regex coach

• Regex buddy (not free)

• Rex (microsoft research)

• Visual Studio

Bonus

• Mail::RFC822::Address: regexp-based address validation http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

• A regular expression to check for prime numbers:

^1?$|^(11+?)\1+$http://montreal.pm.org/tech/neil_kandalgaonkar.shtml

• RegEx match open tags except XHTML self-contained tags (stackoverflow)http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Regex optimization

• Time out

• Consider the input source

• Capture only when necessary

• Factorization

• Backtracking

“In general, a Nondeterministic Finite Automaton (NFA) engine like the .NET Framework regular expression engine places the responsibility for crafting efficient, fast regular expressions on the developer.”

?