19
10/11/2014 Advanced Regex Tutorial—Regex Syntax http://www.rexegg.com/regex-disambiguation.html 1/19 Fundamentals Black Belt Program Regex in Action Humor & More Ask Rex Reducing (? … ) Syntax Confusion What the (? … ) A question mark inside a parenthesis: So many uses! I thought I would bring them all together in one place. I don't know the fine details of the history of regular expressions . Stephen Kleene and Ken Thompson, who started them, obviously wanted something very compact. Maybe they were into hieroglyphs, maybe they were into cryptography, or maybe that was just the way you did things when you only had a few kilobytes or RAM. The heroes who expanded regular expressions (such as Henry Spencer and Larry Wall) followed in these footsteps. One of the things that make regexes hard to read for beginners is that many points of syntax that serve vastly different purposes all start with the same two characters: (? In the regex tutorials and books I have read, these various points of syntax are introduced in stages. But (?: … ) looks a lot like (?= … ), so that at some point they are bound to clash in the mind of the regex apprentice. To facilitate study, I have pulled all the (? … ) usages I know about into one place. I'll start by pointing out three confusing couples; details of usage will follow. Jumping Points For easy navigation, here are some jumping points to various sections of the page: Confusing Couples Lookahead and Lookbehind: (?= … ) , (?! … ) , (?<= … ) , (?<! … ) Non-Capturing Groups: (?: … ) and (?is: … ) Atomic Groups: (?> … ) Named Capture: (?<foo> … ) and (?P<foo> … ) Inline Modifiers: (?isx-m) Subroutines: (?1) Recursion: (?R) Conditionals: (?(A)B) and (?(A)B|C) Pre-Defined Subroutines: (?(DEFINE)(<foo> … )(<bar> … )) and (?&foo) Branch Reset: (?| … ) Inline Comments: (?# … ) (direct link)

Advanced Regex Tutorial—Regex Syntax

Embed Size (px)

DESCRIPTION

Regex Tutorial for you

Citation preview

Page 1: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 1/19

FundamentalsBlack Belt ProgramRegex in ActionHumor & MoreAsk Rex

Reducing (? … ) Syntax Confusion

What the (? … )A question mark inside a parenthesis: So many uses!I thought I would bring them all together in one place.

I don't know the fine details of the history of regular expressions. Stephen Kleene and Ken Thompson,who started them, obviously wanted something very compact. Maybe they were into hieroglyphs, maybethey were into cryptography, or maybe that was just the way you did things when you only had a fewkilobytes or RAM.

The heroes who expanded regular expressions (such as Henry Spencer and Larry Wall) followed in thesefootsteps. One of the things that make regexes hard to read for beginners is that many points of syntaxthat serve vastly different purposes all start with the same two characters:

(?In the regex tutorials and books I have read, these various points of syntax are introduced in stages. But(?: … ) looks a lot like (?= … ), so that at some point they are bound to clash in the mind of the regexapprentice. To facilitate study, I have pulled all the (? … ) usages I know about into one place. I'll startby pointing out three confusing couples; details of usage will follow.

Jumping PointsFor easy navigation, here are some jumping points to various sections of the page:

✽ Confusing Couples✽ Lookahead and Lookbehind: (?= … ), (?! … ), (?<= … ), (?<! … )✽ Non-Capturing Groups: (?: … ) and (?is: … )✽ Atomic Groups: (?> … )✽ Named Capture: (?<foo> … ) and (?P<foo> … )✽ Inline Modifiers: (?isx-m)✽ Subroutines: (?1)✽ Recursion: (?R)✽ Conditionals: (?(A)B) and (?(A)B|C)✽ Pre-Defined Subroutines: (?(DEFINE)(<foo> … )(<bar> … )) and (?&foo)✽ Branch Reset: (?| … )✽ Inline Comments: (?# … )

(direct link)

Page 2: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 2/19

Confusing Couples

Confusing Couple #1: (?: … ) and (?= … ) These false twins have very different jobs. (?: … ) contains a non-capturing group, while (?= … ) is alookahead.

Confusing Couple #2: (?<= … ) and (?> … ) (?<= … ) is a lookbehind, so (?> … ) must be a lookahead, right? Not so. (?> … ) contains an atomicgroup. The actual lookahead marker is (?= … ). More about all these guys below.

Confusing Couple #3: (?(1) … ) and (?1) This pair is delightfully confusing. The first is a conditional expression that tests whether Group 1 hasbeen captured. The second is a subroutine call that matches the sub-pattern contained within thecapturing parentheses of Group 1.

Now that these three "big ones" are out of the way, let's drill into the syntax.

(direct link)

Lookarounds: (?<= … ) and (?= … ),

               (?<! … ) and (?! … )

Collectively, lookbehinds and lookaheads are known as lookarounds. This section gives you basicexamples of the syntax, but further down the track I encourage you to read the dedicated regexlookaround page, as it covers subtleties that need to be grasped if you'd like lookaheads and lookbehindsto become your trusted friends.

In the meantime, if there is one thing you should remember, it is this: a lookahead or a lookbehind doesnot "consume" any characters on the string. This means that after the lookahead or lookbehind's closingparenthesis, the regex engine is left standing on the very same spot in the string from which it startedlooking: it hasn't moved. From that position, then engine can start matching characters again, or, why not,look ahead (or behind) for something else—a useful technique, as we'll later see.

Here is how the syntax works.

(direct link)Lookahead After the Match: \d+(?= dollars)Sample Match: 100 in 100 dollarsExplanation: \d+ matches the digits 100, then the lookahead (?= dollars) asserts that at that position inthe string, what immediately follows is the characters " dollars"

Lookahead Before the Match: (?=\d+ dollars)\d+Sample Match: 100 in 100 dollarsExplanation: The lookahead (?=\d+ dollars) asserts that at the current position in the string, whatfollows is digits then the characters " dollars". If the assertion succeeds, the engine matches the digitswith \d+.

Page 3: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 3/19

Note that this pattern achieves the same result as \d+(?= dollars) from above, but it is less efficientbecause \d+ is matched twice. A better use of looking ahead before matching characters is to validatemultiple conditions in a password.

(direct link)Negative Lookahead After the Match: \d+(?! dollars)Sample Match: 100 in 100 pesosExplanation: \d+ matches 100, then the negative lookahead (?! dollars) asserts that at that position in thestring, what immediately follows is not the characters " dollars"

Negative Lookahead Before the Match: (?!\d+ dollars)\d+Sample Match: 100 in 100 pesosExplanation: The negative lookahead (?!\d+ dollars) asserts that at the current position in the string,what follows is not digits then the characters " dollars". If the assertion succeeds, the engine matches thedigits with \d+.

Note that this pattern achieves the same result as \d+(?! dollars) from above, but it is less efficientbecause \d+ is matched twice. A better use of looking ahead before matching characters is to validatemultiple conditions in a password.

(direct link)Lookbehind Before the match: (?<=USD)\d{3}Sample Match: 100 in USD100Explanation: The lookbehind (?<=USD) asserts that at the current position in the string, what precedesis the characters "USD". If the assertion succeeds, the engine matches three digits with \d{3}.

Lookbehind After the match: \d{3}(?<=USD\d{3})Sample Match: 100 in USD100Explanation: \d{3} matches 100, then the lookbehind (?<=USD\d{3}) asserts that at that position in thestring, what immediately precedes is the characters "USD" then three digits.

Note that this pattern achieves the same result as (?<=USD)\d{3} from above, but it is less efficientbecause \d{3} is matched twice.

(direct link)Negative Lookbehind Before the Match: (?<!USD)\d{3}Sample Match: 100 in JPY100Explanation: The negative lookbehind (?<!USD) asserts that at the current position in the string, whatprecedes is not the characters "USD". If the assertion succeeds, the engine matches three digits with\d{3}.

Negative Lookbehind After the Match: \d{3}(?<!USD\d{3})Explanation: \d{3} matches 100, then the negative lookbehind (?<!USD\d{3}) asserts that at thatposition in the string, what immediately precedes is not the characters "USD" then three digits.

Note that this pattern achieves the same result as (?<!USD)\d{3} from above, but it is less efficientbecause \d{3} is matched twice.

(direct link)

Page 4: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 4/19

Support for LookaroundsAll major engines have some form of support for lookarounds—with some important differences. Forinstance, JavaScript doesn't support lookbehind, though it supports lookahead (one of the many blotcheson its regex scorecard). Ruby 1.8 suffered from the same condition.

(direct link)Lookbehind: Fixed-Width / Constrained Width / Infinite WidthOne important difference is whether lookbehind accepts variable-width patterns.

✽ At the moment, I am aware of only three engines that allow infinite repetition within a lookbehind—asin (?<=\s*): .NET, Matthew Barnett's outstanding regex module for Python, whose features far outstripthose of the standard re module, and the JGSoft engine used by Jan Goyvaerts' software such as EditPadPro.

✽ Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within apre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or fourcharacters. Likewise, (?<=A{1,10}) is valid.

✽ PCRE (C, PHP, R …), Java and Ruby 2+ allow lookbehinds to contain alternations that match stringsof different but pre-determined lengths (such as (?<=cat|raccoon))

✽ Perl and Python require a lookbehind to match strings of a fixed length, so (?<=cat|racoons) will notwork.

To master lookarounds, there is a bit more you should really know. For these finer details, visit thelookaround page.

(direct link)

Non-Capturing Groups: (?: … )

In regex as in the (2+3)*(5-2) of arithmetic, parentheses are often needed to group components of anexpression together. For instance, the above operation yields 15. Without the parentheses, because the *operator has higher precedence than the + and -, 2+3*5-2 is interpreted as 2+(3*5)-2, yielding… er… 15(a happy coincidence).

In regex, normal parentheses not only group parts of a pattern, they also capture the sub-match to acapture group. This is often tremendously useful. At other times, you do not need the overhead.

In .NET, this capturing behavior of parentheses can be overridden by the (?n) flag or theRegexOptions.ExplicitCapture option. But in all flavors, .NET included, it is far more common to use (?:… ), which is the syntax for a non-capturing group. Watch out, as the syntax closely resembles that for alookahead (?= … ).

For instance (?:Bob|Chloe) matches Bob or Chloe—but the name is not captured.

Within a non-capturing group, you can still use capture groups. For instance, (?:Bob says: (\w+)) wouldmatch Bob says: Go and capture Go in Group 1.

Page 5: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 5/19

Likewise, you can capture the content of a non-capturing group by surrounding it with parentheses. Forinstance, ((?:Bob|Chloe)\d\d) would capture "Chloe44".

(direct link)Mode Modifiers within Non-Capture GroupsOn all engines that support inline modifiers such as (?i), except Python, you can blend the the non-capture group syntax with mode modifiers. Here are some examples:✽ (?i:Bob|Chloe) This non-capturing group is case-insensitive.✽ (?ism:^BEGIN.*?END) This non-capturing group matches everything between "begin" and "end"(case-insensitive), allowing such content to span multiple lines (the s modifier), starting at the beginningof any line (the m modifier allows the ^ anchor to match the beginning of any line).✽ (?i-sm:^BEGIN.*?END) As above, but turns off the "s" and "m" modifiers

See below for more on inline modifiers.

(direct link)

Atomic Groups: (?> … )

An atomic group is an expression that becomes solid as a block once the regex leaves the closingparenthesis. If the regex fails later down the string and needs to backtrack, a regular group containing aquantifier would give up characters one at a time, allowing the engine to try other matches. Likewise, ifthe group contained an alternation, the engine would try the next branch. An atomic group won't do that:it's all or nothing.

Example 1: With Alternation(?>A|.B)CThis will fail against ABC, whereas (?:A|.B)C would have succeeded. After matching the A in the atomicgroup, the engine tries to match the C but fails. Because it is atomic, it is unable to try the .B part of thealternation, which would also succeed, and allow the final token C to match.

Example 2: With Quantifier(?>A+)[A-Z]CThis will fail against AAC, whereas (?:A+)[A-Z]C would have succeeded. After matching the AA in theatomic group, the engine tries to match the [A-Z], succeeds by matching the C, then tries to match thetoken C but fails as the end of the string has been reached. Because the group is atomic, it is unable togive up the second A, which would allow the rest of the pattern to match.

If, before the atomic group, there were other options to which the engine can backtrack (such asquantifiers or alternations), then the whole atomic group can be given up in one go.

When are Atomic Groups Important?When a series of characters only makes sense as a block, using an atomic group can prevent needlessbacktracking. This is explored on the section on possessive quantifiers. In such situations atomicquantifiers can be useful, but not necessarily mission-critical.

On the other hand, there are situations where atomic quantifiers can save your pattern from disaster. Theyare particularly useful:

Page 6: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 6/19

✽ In order to avoid the Lazy Trap with patterns that contain lazy quantifiers whose token can eat thedelimiter✽ To avoid certain forms of the Explosive Quantifier Trap

Supported Engines, and WorkaroundAtomic groups are supported in most of the major engines: .NET, Perl, PCRE and Ruby. For engines thatdon't support atomic grouping syntax, such as Python and JavaScript, see the well-known pseudo-atomicgroup workaround.

(direct link)Alternate Syntax: Possessive QuantifierWhen an atomic group only contains a token with a quantifier, an alternate syntax (in engines thatsupport it) is a possessive quantifier, where a + is added to the quantifier. For instance, ✽ (?>A+) is equivalent to A++✽ (?>A*) is equivalent to A*+✽ (?>A?) is equivalent to A?+✽ (?>A{…,…}) is equivalent to A{…,…}+

This works in Perl, PCRE, Java and Ruby 2+. For more, see the possessive quantifiers section of the quantifiers page.

Non-CapturingAtomic groups are non-capturing, though as with other non-capturing groups, you can place the groupinside another set of parentheses to capture the group's entire match; and you can place parenthesesinside the atomic group to capture a section of the match.

Watch out, as the atomic group syntax is confusingly similar to the lookbehind syntax (?<= … ).

(direct link)

Named Capture: (?<foo> … ),

                 (?P<foo> … ) and (?P=foo)

When you cut and paste a piece of a pattern, Group 3 can suddenly become Group 1. That's a problem ifyou were using a back-reference \3 or replacement $3.

One way around this problem is named capture groups. The syntax varies across engines (see NamingGroups—and referring back to them for the gory details). It's worth noting that named group also have anumber that obeys the left-to-right numbering rules, and can be referenced by their number as well astheir name.

In short, the two capturing flavors are (?<foo> … ) and (?P<foo> … ). For instance, in the right engines,

^(?<intpart>\d+)\.(?<decpart>\d+)$ or ^(?P<intpart>\d+)\.(?P<decpart>\d+)$would both match a string containing a decimal number such as 12.22, storing the integer portion to agroup named intpart, and storing the decimal portion to a group named decpart.

Page 7: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 7/19

To create a back-reference to the intpart group in the pattern, depending on the engine, you'll use\k<intpart> or (?P=intpart). To insert the named group in a replacement string, depending on the engine,you'll either use ${intpart}, \g<intpart>, $+{intpart}or the group number \1. For the gory details, seeNaming Groups—and referring back to them.

To name, or not to name?I'll admit that I don't use named groups a whole lot, but some people love them.

Sure, named captures are bulkier than a quick (capture) and reference to \1—but they can save hassles inexpressions that contain many groups.

Do they make your patterns easier to read? That's subjective. For my part, if the regex is short, I alwaysprefer numbered groups. And if it is long, I would rather read a regex with numbered groups and goodcomments in free-spacing mode than a one-liner with named groups.

(direct link)

Inline Modifiers: (?isx-m)

All popular regex flavors apart from JavaScript support inline modifiers, which allow you to tell theengine, in a pattern, to change how to interpret the pattern. For instance, (?i) turns on case-insensitivity.Except in Python, (?-i) turns it off.

If a modifier appears at the head of the pattern, it modifies the matching mode for the whole pattern—unless it is later turned off. But (except in Python) a modifier can appear in mid-pattern, in which case inonly affects the portion of the pattern that follows.

Modifiers can be combined: for instance, (?ix) turns on both case-insensitive and free-spacing mode. (?ix-s) does the same, but also turns off single-line (a.k.a DOTALL) mode.

Summary of inline modifiers✽ (?i) turns on case insensitive mode.

✽ Except in Ruby, (?s) activates "single-line mode", a.k.a. DOTALL modes, allowing the dot to matchline break characters. In Ruby, the same function is served by (?m)

✽ Except in Ruby, (?m) activate "multi-line mode", which allows the dollar $ and caret ^ assertions tomatch at the beginning and end of lines. In Ruby, (?m) does what (?s) does in other flavors—it activatesDOTALL mode.

✽ (?x) Turns on the free-spacing mode (a.k.a. whitespace mode or comment mode). This allows you towrite your regex on multiple lines—like on the example on the home page—with comments preceded bya #. Warning: You will usually want to make sure that (?x) appears immediately after the quotecharacter that starts the pattern string. For instance, if you try placing it on a newline because it wouldlook better, the engine will try matching the newline characters before it activates free-spacing mode.

✽ In .NET, (?n) turns on "named capture only" mode, which means that regular parentheses are treated asnon-capture groups.

Page 8: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 8/19

✽ In Java, (?d) turns on "Unix lines mode" mode, which means that the dot and the anchors ^ and $ onlycare about line break characters when they are line feeds \n.

Combining Non-Capture Group with Inline ModifiersAs we saw in the section on non-capture groups, you can blend mode modifiers into the non-capturegroup syntax in all engines that support inline modifiers—except Python. For instance, (?i:bob) is a non-capturing group with the case insensitive flag turned on. It matches strings such as "bob" and "boB"

But don't get carried away: you cannot blend inline modifiers with any random bit of regex syntax. Forinstance, the following are all illegal: (?i=bob), (?iP<name>bob) and (?i>bob)

Using Inline Modifiers in the Middle of a PatternUsually, you'll use your inline modifiers at the start of the regex string to set the mode for the entirepattern. However, changing modes in the middle of a pattern can be useful, so I'll give you two examples.

(\b[A-Z]+\b)(?i).*?\b\1\b This ensures that an upper-case word is repeated somewhere in the string, inany letter-case. First we capture an upper-case word to Group 1 (for instance DOG), then we set case-insensitive mode, then .*? matches any characters up to the back-reference \1, which could be dog ordOg. As a neat variation, (\b[A-Z]+\b).*?\b(?=[a-z]+\b)(?i)\1\b ensures that the back-reference is inlower-case.

^(\w+)\b.*\r?\n(?s).*?\b\1\b This ensures that the first word of the string is repeated on a differentline. First we capture a word to Group 1, then we get to the end of the line with .*, match a line break,then set DOTALL mode—allowing the .*? to match across lines, which brings us to our back-reference\1.

(direct link)

Subroutines: (?1) and (?&foo)

As you well know by now, when you create a capture group such as (\d+), you can then create a back-reference to that group—for instance \1 for Group 1—to match the very characters that were captured bythe group. For instance, (\w+) \1 matches Hey Hey.

In Perl, PCRE (C, PHP, R …) and Ruby 1.9+, you can also repeat the actual pattern defined by a captureGroup. In Perl and PCRE, the syntax to repeat the pattern of Group 1 is (?1) (in Ruby 2+, it is \g<1>)

For instance, (\w+) (?1)will match Hey Ho. The parentheses in (\w+) not only capture Hey to Group 1—they also defineSubroutine 1, whose pattern is \w+. Later, (?1) is a call to subroutine 1. The entire regex is thereforeequivalent to (\w+) \w+

Subroutines can make long expressions much easier to look at and far less prone to copy-paste errors.

(direct link)Relative SubroutinesInstead of referring to a subroutine by its number, you can refer to the relative position of its defininggroup, counting left or right from the current position in the pattern. For instance, (?-1) refers to the last

Page 9: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 9/19

defined subroutine, and (?+1) refers to the next defined subroutine. Therefore,

(\w+) (?-1) and (?+1) (\w+)are both equivalent to our first example with numbered group 1. In Ruby 2+, for relative subroutine calls,you would use \g<-1> and \g<+1>.

(direct link)Named SubroutinesInstead of using numbered groups, you can use named groups. In that case, in Perl and PHP the syntaxfor the subroutine call will be (?&group_name). In Ruby 2+ the syntax is \g<some_word>. For instance, (?<some_word>\w+) (?&some_word) is equivalent to our first example with numbered group 1.

Pre-Defined SubroutinesSo far, when we defined our subroutines, we also matched something. For instance, (\w+) definessubroutine 1 but also immediately matches some word characters. It so happens that Perl and PCRE haveterrific syntax that allows you to pre-define a subroutine without initially matching anything. Thissyntax is extremely useful to build large, modular expressions. We will look at it in the correspondingsection: Defined Subroutines: (?(DEFINE)(<foo> … ))(<bar> … ))

Subroutines and RecursionIf you place a subroutine such as (?1) within the very capture group to which it refers—Group 1 in thiscase—then you have a recursive expression. For instance, the regex ^(A(?1)?Z)$ contains a recursivesub-pattern, because the call (?1) to subroutine 1 is embedded in the parentheses that define Group 1.

If you try to trace the matching path of this regex in your mind, you will see that it matches strings likeAAAZZZ, strings which start with any number of letters A and end with letters Z that perfectly balance theAs. After you open the parenthesis, the A matches an A… then the optional (?1)? opens anotherparenthesis and tries to match an A… and so on.

We'll look at recursion syntax in the next section. There is also a page dedicated to recursion.

WarningNote that the (?1) syntax looks confusingly similar to the ?(1) found in conditionals.

(direct link)

Recursive Expressions: (?R) … and old friends

A recursive pattern allows you to repeat an expression within itself any number of times. This is quitehandy to match patterns where some tokens on the left must be balanced by some tokens on the right.

Recursive calls are available in PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module forPython.

Recursion of the Entire Pattern: (?R)To repeat the entire pattern, the syntax in Perl and PCRE is (?R). In Ruby, it is \g<0>.

For instance, A(?R)?Z matches strings or substrings such as AAAZZZ, where a number of letters A at the start are

Page 10: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 10/19

perfectly balanced by a number of letters Z at the end. The initial token A matches an A… Then theoptional (?R)? tries to repeat the whole pattern right there, and therefore attempts the token A to match anA… and so on.

Recursion of a Subroutine: (?1) and (?-1)You also have recursion when a subroutine calls itself. For instance, in^(A(?1)?Z)$ subroutine 1 (defined by the outer parentheses) contains a call to itself. This regex matchesentire strings such as AAAZZZ, where a number of letters A at the start are perfectly balanced by anumber of letters Z at the end.

As we saw in the section on subroutines, you can also call a subroutine by the relative position of itsdefining group at the current position in the pattern. Therefore, ^(A(?-1)?Z)$ performs exactly like the above regex.

There is much more to be said about recursion. See the page dedicated to recursive regex patterns.

(direct link)

Conditionals: (?(A)B) and (?(A)B|C)

This section covers the basics on conditional syntax. For more, you'll want to explore the page dedicatedto regex conditionals.

In (?(A)B), condition A is evaluated. If it is true, the engine must match pattern B. In the full form (?(A)B|C), when condition A is not true, the engine must match pattern C. Conditionals therefore allowyou to inject some if(…) then {…} else {…} logic into your patterns.

Typically, condition A will be that a given capture group has been set. For instance, (?(1)}) says: Ifcapture Group 1 has been set, match a closing curly brace. This would be useful in^({)?\d+(?(1)})$Likewise, (?(foo)…) checks if the capture group named foo has been set.

This pattern matches a string of digits that may or may not be embedded in curly braces. The optionalcapture Group 1 ({)? captures an opening brace. Later, the conditional checks if capture 1 was set, and ifso it matches the closing brace.

Let's expand this example to use the "else" part of the syntax:^(?:({)|")\d+(?(1)}|")$This pattern matches strings of digits that are either embedded in double quotes or in curly braces. Thenon-capture group (?:({)|") matches the opening delimiter, capturing it to Group 1 if it is a curly brace.After matching the digits, (?(1)}|") checks whether Group 1 was set. If so, we match a closing curlybrace. If not, we match a double quote.

Lookaround in ConditionsIn (?(A)B), the condition you'll most frequently see is a check as to whether a capture group has been set.In .NET, PCRE and Perl (but not Python and Ruby), you can also use lookarounds: \b(?(?<=5D:)\d{5}|\d{10})\bIf the prefix 5D: can be found, the pattern will match five digits. Otherwise, it will match ten digits.

Page 11: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 11/19

Needless to say, that is not the only way to perform this task.

(direct link)Checking if a relative capture group was set (?(1)A) checks whether Group 1 was set. In PCRE, instead of hard-coding the group number, we can alsocheck whether a group at a relative position to the current position in the pattern has been set: forinstance, (?(-1)A) checks whether the previous group has been set. Likewise, (?(+1)A) checks whetherthe next capture group has been set. (This last scenario would be found within a larger repeating group,so that on the second pass through the pattern, the next capture group may indeed have been set on theprevious pass.)

(direct link)Checking if a recursion level was reachedThis is not the place to be talking in depth about recursion, which has a section below and a dedicatedpage, but for completion I should mention two other uses of conditionals, available in Perl and PCRE:

✽ (?(R)A) tests whether the regex engine is currently working within a recursion depth (reached from arecursive call to the whole pattern or a subroutine).✽ (?(R1)A) tests whether the current recursion level has been reached by a recursive call to subroutine 1.See examples here.

Availability of Regex ConditionalsConditionals are available in PCRE, Perl, .NET, Python, and Ruby 2+. In other engines, the work of aconditional can usually be handled by the careful use of lookarounds.

Similar SyntaxNote that the (?(1)B) syntax can look confusingly similar to (?1) which stands for a regex subroutine,where the regex pattern defined by Group 1 must be matched.

(direct link)

Pre-Defined Subroutines: (?(DEFINE)(<foo> … )(<bar> … ))

                      and (?&foo)

Available in Perl and PCRE (and therefore C, PHP, R…), pre-defined subroutines allow you to produceregular expressions that are beautifully modular and start to feel like clean procedural code.

Within a (?(DEFINE) … ) block, you can pre-define one or several named subroutines without matchingany characters at that time. You can even pre-define subroutines based on other subroutines. When youget to the matching part of the regex, this allows you to match complex expressions with compact andreadable syntax—and to match the same kind of expressions in multiple places without needing to repeatyour regex code.

This makes your regex more maintainable, both because it is easier to understand and because you don'tneed to fix a sub-pattern in multiple places.

But an example is worth a thousand words, so let's dive in. If you like, you can play with the pattern and

Page 12: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 12/19

sample text in this online demo.

A quick note first: in case you wonder what the \ are all about, they simply match one space character.The regex is in free-spacing mode—the x flag is implied but could be made part of the pattern using the(?x) modifier. In free-spacing mode, spaces that you do want to match must either be escaped as in \ orspecified inside a character class as in [ ].

(?(DEFINE) # start DEFINE block # pre-define quant subroutine (?<quant>many|some|five)

# pre-define adj subroutine (?<adj>blue|large|interesting)

# pre-define object subroutine (?<object>cars|elephants|problems)

# pre-define noun_phrase subroutine (?<noun_phrase>(?&quant)\ (?&adj)\ (?&object))

# pre-define verb subroutine (?<verb>borrow|solve|resemble)) # end DEFINE block

##### The regex matching starts here #####(?&noun_phrase)\ (?&verb)\ (?&noun_phrase)

This regex would match phrases such as:✽ five blue elephants solve many interesting problems✽ many large problems resemble some interesting cars

Note that the portion that does the matching is extremely compact and readable:(?&noun_phrase)\ (?&verb)\ (?&noun_phrase) The subroutine noun_phrase is called twice: there is no need to paste a large repeated regex sub-pattern,and if we decide to change the definition of noun_phrase, that immediately trickles to the two placeswhere it is used.

Note also that noun_phrase itself is built by assembling smaller blocks: its code (?&quant)\ (?&adj)\ (?&object) uses the quant, adj and object subroutines.

With this kind of modularity, you can build regex cathedrals. There is a beautiful example on the pagewith the regex to match numbers in plain English.

A Note on Group NumberingPlease be mindful that each named subroutine consumes one capture group number, so if you use capturegroups later in the regex, remember to count from left to right. The gory details are on the page aboutCapture Group Numbering & Naming.

(direct link)

Branch Reset: (?| … )

Page 13: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 13/19

If you've read the page about Capture Group Numbering & Naming, you'll remember that capture groupsget numbered from left to right. Therefore, if you have two sets of capturing parentheses, they have twogroup numbers. Sometimes, you might wish that these two sets of parentheses might capture to the samenumbered group.

Perl and PCRE (and therefore C, PHP, R…) have a feature that let you reuse a group number whencapturing parentheses are present on different sides of an alternation.

This is rather abstract, so let's take an example. Let's say you want to match a number, but only in threesituations:✽ If it follows an A, as in A00✽ If it precedes a B, as in 11B✽ If it is sandwiched between C and D, as in C22D

This poses no problem using lookahead and lookbehind, but the branch reset syntax (?| … ) gives youanother—potentially more readable—option:(?|A(\d+)|(\d+)B|C(\d+)D)After the initial (?|, which introduces a branch reset, the group has a three-piece alternation (two |). Eachof those contains a capture group (\d+). The number of all of those capture groups is the same: Group 1.

You are not limited to one group. For instance, if you are also interested in capturing a potential suffixafter the number (which can happen in the situations 11B and C55D), place another set of parentheseswherever you find a suffix:(?|A(\d+)|(\d+)(B)|C(\d+)(D))Using this regex to match the string A00 11B C22D, you obtain these groups:

Match Group 1: Number Group 2: Suffix----- --------------- ---------------A00 00 (not set)11B 11 BC22D 22 D

How Useful is Branch Reset?When I first read about branch reset in the PCRE documentation a few years ago, I was excited andcertain I'd use it often. Since then, I've written several thousand regular expression patterns, but I've usedbranch reset less than a handful of times. It's probably my fault for always jumping on other ways to dothings first, but this leaves me with a sense that the feature is not all that useful after all.

That being said, on rare occasions, it's just the most direct and elegant way of doing things.

Let's look at one more example, less contrived than the first—which was pared down in order to explainthe feature.

A Branch Reset Example: Tokenization with Variable FormatsTo me, this is an example where branch reset seems to offer benefits over competing idioms.

Suppose you want to parse strings such assong:"Sweet Home Alabama" fruit:apple color:blue motto:"Don't Worry"into pairs of keys and values. When the value following the colon is between quotes, you only want theinside of the quotes. Therefore, you expect something like:

Page 14: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 14/19

Group 1 Group 2------- -------song Sweet Home Alabamafruit applecolor bluemotto Don't Worry

This branch reset regex will get you there:

(\S+):(?|([^"\s]+)|"([^"]+)) Group 1 (\S+) is a straight capture group that captures the key. In the branch reset, the two sets ofcapturing parentheses allow you to capture different kinds of values in different formats to the samegroup, i.e. Group 2. You can check the group captures in the right pane of this online regex demo.

To me, this alternative with a conditional and a lookbehind…(\S+):"?((?(?<!")[^"\s]+|[^"]+)) …feels a little less satisfying. But hey, it works too.

(direct link)

Inline Comments: (?# … )

By now you must be familiar with the free-spacing mode, which makes it possible to unroll long regexesand comment them out, as in the many code boxes on this site. To turn on free-spacing for an entirepattern, the syntax varies:✽ the (?x) modifier works in .NET, Perl, PCRE, Java, Python and Ruby.✽ the x flag can be added after the pattern delimiter in Perl, PHP and Ruby.✽ .NET lets you turn on the RegexOptions.IgnorePatternWhitespace option.✽ Python lets you turn on re.VERBOSE

What if you only want to insert a single comment without turning on free-spacing mode for the entirepattern? In Perl, PCRE (and therefore C, PHP, R…), Python and Ruby, you can write an inline commentwith this syntax: (?# … )

For instance, in:

(?# the year)\d{4}\d{4} matches four digits, while (?# the year) tells you what we are trying to match.

How useful is this? Not very. I almost never use this feature: when I want comments, I just turn on free-spacing mode for the whole regex.

Don't Miss The Regex Style Guide

and The Best Regex Trick Ever!!!

Everything You've Wanted to know about Capture Groups

Page 15: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 15/19

Ask Rex

Leave a Comment1-7 of 7 ThreadsDuncan – UKMarch 12, 2014 - 02:40Subject: Removing Confusion Around (? Regex Syntax

This topic is very well written and much appreciated. Distills large works like Friedl's book into an easilydigestible quarter of an hour. I look forward to reading the rest!xtello – FranceFebruary 19, 2014 - 08:03Subject: RE: Your banner regex

Thanks Rex, you really made me laugh!! I see you always have the same excellent sense of humor as inyour (brilliant) articles & tutorials! Thank you for this great site and for the joke :) (and for the newregex)

Greetings from (the south of) France! Xavier TelloReply to xtelloRexFebruary 21, 2014 - 10:45Subject: RE: Your banner regex

Hi Xavier, Thank you for your very kind encouragements! If only everyone could be like you. When thetechnology becomes available, would you mind if I get back in touch in order to clone you? Wishing youa fun weekend, Rexxtello – FranceFebruary 17, 2014 - 10:07Subject: Your banner regex

I looked at the regex displayed in your banner… Applying this regex to the string [spoiler] will produce[spoiler] (if I'm not wrong!). What's this easter egg? ;-)Reply to xtelloRexFebruary 17, 2014 - 16:37Subject: RE: Your banner regex

Hi Xavier, Thank you for writing, it was a treat to hear from you. Wow, you are the first person to notice!In fact, you made me change the banner to satisfy your sense of completion (and make it harder for thenext guy). > What's this easter egg? This Easter Egg (pun intended, I presume) is that you are the grandwinner of a secret contest. From the time I launched the site, I had planned that the first person to

Page 16: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 16/19

discover this would win a free trip to the South of France. You won!!! :) :) :) Wishing you a beautifulday, RexNicolas – BrusselsAugust 05, 2013 - 10:09Subject: Little question about capture

Hi Andy. Thank you for all these articles, they are amazing! I learn a lot with this website. So glad tofound it! Like they said : Best ressource on internet :)

I tried some of your example, and I'm stuck with one of them: (? :(\()|-)\d{6}(? (1)\)). When I'm trying "(111111)" with "preg_match_all", it captures"(". Do you think it's possible to bypass this capture? WhenI use "-222222", it catches an empty string… And I dont unserstand why. Could you please explain this?Thank you Andy! And again: Nice work!Reply to NicolasRexAugust 05, 2013 - 18:56Subject: RE: Little question about capture

Hi Nicolas,

Run this: $regex='~(?:(\()|-)\d{6}(?(1)\))~';$string='(such as "(444444)"), or it is preceded by a minus sign (such as "-333333").';preg_match_all($regex,$string,$m);var_dump( $m );

You will see that the MATCHES are (444444) and -333333The CAPTURES are "(" and "". The captured left par is what makes the ?(1) work later in the regex. Let me know if this is still unclear.Aravind P SMay 03, 2013 - 17:39Subject: Great Work man.

I enjoyed reading this article and learnt a lot. Thanks for your wonderful work. :)Vin – SwitzerlandNovember 28, 2012 - 21:05Subject: Brilliant

Best resource I've found yet on regular expressions. Much appreciate the work you put into this. Why notcreate an eBook that could be downloaded—I for one would willingly cough up a few dollars. RegardsVinReply to VinAndyDecember 02, 2012 - 09:03Subject: Re: Brilliant

Hi Vin, Thank you very much for your encouragements, and also for your suggestion. I've been itching tomake a print-on-demand book with the lowest price possible, to make it easy to read offline. Willprobably do that as soon as they extend the length of a day to 49 hours. Wishing you a fun weekend,Andy

Page 17: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 17/19

Privacy & Terms

SkrellNovember 22, 2012 - 08:21Subject: amazing

These articles you post on regular expressions are among the best, I've found on the entire internet! Nojoke! Much appreciated!!!Reply to SkrellAndyNovember 22, 2012 - 21:13Subject: Re: amazing

Hi Skrell, thank you very much for your supportive comment. I'm glad to know that someone likes thesepages! They took weeks to write and I've been surprised by how little time visitors have spent on them.To enjoy a certain presentation of technical information I guess we must be of like minds at least in somesmall way. :) Wishing you a fun end of the week, -A

Leave a Comment * Your name * Email (it will not be shown) Your location

Subject:

All comments are moderated.Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit yourcomment.

Submit

Fundamentals

Regex TutorialRegex vs. RegexQuick Reference100 Uses for RegexRegex Style Guide

Black Belt Program

Type the text

Page 18: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 18/19

All (? … ) SyntaxBoundaries++AnchorsCapture & BackFlags & ModifiersLookaroundsQuantifiersExplosive QuantifiersConditionalsRecursionClass OperationsRegex GotchasSyntax TricksQuantifier capture

Regex in Action For awesome tricks:scroll down!

CookbookCool Regex ClassesRegex OptimizationsPCRE: Grep and TestPerl One-Liners

Tools & More

Regex ToolsRegex HumorRegex Books & MoreRegexBuddy Trial

Tricks

The Best Regex TrickLine NumbersNumbers in English

Languages

PCRE Doc & LogRegex with C#Regex with PHPRegex with PythonRegex with JavaRegex with JavaScriptRegex with Ruby

Page 19: Advanced Regex Tutorial—Regex Syntax

10/11/2014 Advanced Regex Tutorial—Regex Syntax

http://www.rexegg.com/regex-disambiguation.html 19/19

Regex with PerlRegex with VB.NET

A must-read

RegexBuddy 4is Out! Big Wow!Get the Free Trial

Ask Rex search the site

© Copyright RexEgg.com