REGEX Extended

8/2/2019 REGEX Extended

1/39

Metacharacters

1. the 12 punctuation characters that make

regular expressions work their magic are $ ( )

* + . ? [ \ ^ { |

2. notably absent from the list are ] , - and }.

The first two become metacharacters onlyafter an unescaped [, and the } only after an

unescaped {

3. If you want your regex to match themliterally, you need to escape them by placing

a backslash in front of them


2/39

Matching literal string

Any regular expression that does not includeany of the dozen characters $()*+.? [\^{|simply matches itself.

By default, regular expressions are case

sensitive - regex matches regex but notRegex, REGEX, or ReGeX

Turn on case insensitivity by using the (?i)

mode modifier, such as (?i)regex, orsensitive(?i)caseless(?-i)sensitive (localmode modifiers) in .NET or setting the /i flagwhen creating it in JavaScript.


3/39

Matching non printable charactersRepresentation Meaning Hex Flavors

\a bell 0x07 .NET\e escape 0x1B .NET

\f form feed 0x0C .NET, JScript

\n new line 0x0A .NET, JScript

\r carriage return 0x0D .NET, JScript

\t horizontal tab 0x09 .NET, JScript

\v vertical tab 0x0B .NET, JScript

Variations:Using \cA through \cZ, you can match one of the 26 control characters that occupy

positions 1 through 26 in the ASCII table

A lowercase \x followed by two uppercase hexadecimal digits matches a single character

in the ASCII set


4/39

Matching *$"'\n\d/\\+ :

C# - "[$\"'\n\\d/\\\\] "

- double quotes and backslashes must be escaped with a backslash.Note: "\n" is a string with a literal line break, which is ignored as

whitespace. "\\n" is a string with the regex token \n, which matches

a newline.

@"[$""'\n\d/\\] - to include a double quote in a verbatim string, double it upNote: @"\n" is always the regex token \n, which matches a newline;

verbatim strings do not support \n at the string level

JavaScript - /[$"'\n\d\/\\]/- Simply place your regular expression between two forward slashes

- If any forward slashes occur within the regular expression itself,

escape those with a backslash.


5/39

Creating Regular Expression Objects

C#:try{

Regex regexObj = new Regex("UserInput", RegexOptions.Compile);}catch (ArgumentException ex){

//...}

Note: RegexOptions.Compile can run up to 10 times faster than a regular expressioncompiled without this option (it compiles the regular expression down to CIL)

JavaScript:var myregexp = /regex pattern/;

var myregexp = new RegExp(userinput);


6/39

Match One of Many Characters

[ ] character class matches a single characterout of a list of possible characters

^ (caret) - negates the character class if you placeit immediately after the opening bracket

- (hyphen) - creates a range when it is placedbetween two characters (order given by ASCII orUNICODE character table)

Examples:

o Hexadecimal character : [a-fA-F0-9]

o Nonhexadecimal character : [^a-fA-F0-9]

o Characters group : [aeiou]


7/39

Shorthands

Six regex tokens that consist of a backslash and a letter

form shorthand character classes. Each lowercaseshorthand character has an associated uppercaseshorthand character with the opposite meaning.

Token Matches Opposite\d a single digit \D*^\d+)

\w a single word character \W

\s any whitespace character \S

(this includes spaces, tabs, and line)

Note - In JavaScript \w is always identical to *a-zA-Z0-9_+. In .NET it includes letters and digits from all otherscripts (Cyrillic, Thai, etc.)


8/39

Matching any character

Solution Matches Flavor Notes

. any character, except line

breaks

.NET

JScript

.NET : the dot matches line

breaks option must not be

set

. any character, including line

breaks

.NET .NET : the dot matches line

breaks option must be set[1] - RegexOptions.Singleline

[\s\S] Any character, including line

breaks

JScript[2]

[1] you can also place a mode modifier at the start of the regular expression

: (?s) is the mode modifier for dot matches line breaks mode in .NET[2] an alternative solution is needed for JavaScript, which doesnt have a

dot matches line breaks option (*\d\D+ and *\w\W+ have the same

effect).


9/39

Match Something at the Start and/or

the End of a Line (1)

Solution Matches Flavor Note

\A At the very start of the subject text,

before the first character (to test

whether the subject text begins with

the text you want to match)

.NET A must be uppercase

equivalent to \A, as long as you do not

turn on the ^ and $ match

at line breaks option; otherwise it will

match at the very start of the each line

.NET

JScript

.NET : ^ and $ match at line breaks option -

RegexOptions.Multiline

\Z \z at the very end of the subject text, after

the last character (to test whether thesubject text ends with the text you want

to match)

.NET Difference between \Z and \z - when the

last character in your subject text is a linebreak. In that case, \Z can match at the very

end of the subject text, after the final line

break, as well as immediately before that line

break

$ equivalent to \Z, as long as you do not

turn on the ^ and $ match

at line breaks option; otherwise it will

match at the ver end of the each line

.NET

JScript

.NET : ^ and $ match at line breaks option -

RegexOptions.Multiline

Anchors - ^, $, \A, \Z, and \z - they match at certain positions, effectively

anchoring the regular expression match at those positions:


10/39


the End of a Line (2)Examples ^alpha (.NET, JavaScript)matches alpha at the

start of the subject text if ^ and $ match at line breaksis not set or at the start of each line otherwise

\Aalpha (.NET) - matches alpha at the start of thesubject text

omega$ (.NET, JavaScript)matches omega at theend of the subject text if ^ and $ match at line breaks

is not set or at the end of each line otherwise omega\Z (.NET) - matches omega at the end of the

subject text


11/39


the End of a Line (3)

Combining two anchors:

\A\Z matches the empty string, as well as

the string that consists of a single newline

\A\z matches only the empty string

^$ matches each empty line in the subject

text (in ^ and $ match at line breaks mode)

Note - In .NET, if you cannot turn on ^ and $ match at line breaks mode outside

the regular expression, you can place (?m) mode modifier at the start of the

regular expression


12/39

Regular Expression Options (C#)None Specifies that no options are set.

IgnoreCase Specifies case-insensitive matching.

Multiline Multiline mode. Changes the meaning of ^ and $ so they match at thebeginning and end, respectively, of any line, and not just the beginning and

end of the entire string (Caret and dollar match at line breaks)

ExplicitCapture Specifies that the only valid captures are explicitly named or numbered groups

of the form (?). This allows unnamed parentheses to act as

noncapturing groups without the syntactic clumsiness of the expression (?:).

Compiled Specifies that the regular expression is compiled to an assembly. This yieldsfaster execution but increases startup time.

Singleline Specifies single-line mode. Changes the meaning of the dot (.) so it matches

every character (instead of every character except \n). (Dot matches line

break)

IgnorePatternWhitespace Eliminates unescaped white space from the pattern and enables comments

marked with #. (Free-spacing).RightToLeft Specifies that the search will be from right to left instead of from left to right.

ECMAScript Enables ECMAScript-compliant behavior for the expression. This value can be

used only in conjunction with the IgnoreCase, Multiline, and Compiled values.

The use of this value with any other values results in an exception (JavaScript

flavor) - most important effect is that with this option, \w and \d are restricted

to ASCII characters, as they are in JavaScriptCultureInvariant Specifies that cultural differences in language is ignored.


13/39

Setting Regular Expression Options

C#Regex regexObj = new Regex("regex pattern",

RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase |RegexOptions.Singleline | RegexOptions.Multiline);

JavaScriptvar myregexp = /regex pattern/im;

Regex Options1. Free-spacing: Not supported by JavaScript.

2. Case insensitive: /i3. Dot matches line breaks: Not supported by JavaScript.

4. Caret and dollar match at line breaks: /m

5. Additional Language-Specific Options: apply a regular expression repeatedly to thesame string: /g


14/39

Test Whether a Match Can Be Found

Within a Subject StringC#:bool foundMatch = false;try {

foundMatch = Regex.IsMatch(subjectString, UserInput);} catch (ArgumentNullException ex) {

// Cannot pass null as the regular expression orsubject string} catch (ArgumentException ex) {

// Syntax error in the regular expression}

orbool foundMatch = Regex.IsMatch(subjectString, "regex pattern");

Note:@"\Aregex pattern\Z" - regex matches the subject string entirely

Javascript:if (/regex pattern/.test(subjectString)) {

// Successful match} else {

// Match attempt failed}

Note: /^regex pattern&/.test(subjectString) - regex matches the subject string

entirely


15/39

Retrieve the Matched TextC#:

Regex regexObj = new Regex(@"\d+");string resultString = regexObj.Match(subjectString).Value;

Note:1. regexObj.Match("123456", 3, 2)tries to find a match in "45

2. regexObj.Match(subjectString).Index position in subject string

3. regexObj.Match(subjectString).Length length of the match

JavaScript:var result =

subject.match(/\d+/);

if (result) {

result = result[0];

} else {

result = '';

}

var matchstart = -1;var matchlength = -1;

var match = /\d+/.exec(subject);if (match) {

matchstart = match.index;matchlength = match[0].length;

}

JavaScript:var result =

subject.match(/\d+/);

if (result) {

result = result[0];

} else {

result = '';

}


16/39

Match Whole Words \b - word boundary - matches at the start or the end of a

word in three positions:

Example: \bdog\b - The first \b requires the d to occur atthe very start of the string, or after a nonword character. Thesecond \b requires the g to occur at the very end of thestring, or before a nonword character (line break charactersare nonword characters). It matches dog in My dog is

stupid, but not in I will build a doghouse. \Bmatches at every position in the subject text where \b

does not match, at every position that is not at start or end ofa word.

Example: \Bcat\B matches cat in scatter, but not in My catis lazy, category, or bobcat

Note: you need to use alternation to combine \Bcat and cat\Binto \Bcat|cat\B

U i d C d P i P i


17/39

Unicode Code Points, Properties,

Blocks, and Scripts (1)Solution Matches Flavor Note

\u2122 Unicode

code point

.NET

JScript

- a code point is one entry in the Unicode character database

(\u2122 trademark sign)

- \u syntax requires exactly four hexadecimal digits

(U+0000 through U+FFFF)

\p{Sc} Unicode

propertyor

category

.NET \pL- - Any kind of letter from any language

\pM- - A character intended to be combined with anothercharacter (accents etc.)

\pZ- - Any kind of whitespaces or invisible characters

\pS- - Math symbols, currency signs etc.

\pN- - Any kind of numeric character in any script

\pP- - Any kind of punctuation character

\pC- - Invisible control characters and unused code points

\p{IsGreek

Extended}

Unicode

block

.NET \p{InBasic_Latin- \p{InGreek_and_Coptic- \p{InCyrillic-

\p{InKatakana- etc.

\P{M}\p{M

}*

Unicode

grapheme

.NET Unicode grapheme - combining marks -

"\u00E0\u0061\u0300


18/39

Unicode Code Points, Properties,

Blocks, and Scripts (2) The uppercase \P is the negated variant of the lowercase

\p. Example: \PSc- matches any character that does nothave the Currency Symbol Unicode property.

JavaScript flavor does not support Unicode categories, blocks,

or scripts, you can list the characters that are in the category,block, or in a character class. Alternative versions for:

Blocks - [\u1F00-\u1FFF] \p{IsGreekExtended}

Category, character class you should create a character classwith all the unicodes from the specific category/characterclass

See also: http://www.unicode.org/
http://www.unicode.org/http://www.unicode.org/


19/39

Character class subtractions in .NET

General form: *class-*subtract++

Example :

1. [a-zA-Z0-9-[g-zG-Z]]

2. *\p{IsThai}-[\PN-++ matches any of the 10 Thai digits.

\p{IsThai- - matches any character in the Thai

block\PN- matches any character that doesnt have the Number

property


20/39

Match One of Several Alternatives

The vertical bar, or pipe symbol, splits the regular expression

into multiple alternatives

Example: apply Mary|Jane|Sue to Mary, Jane, and Sue

went to Mary's housethe match Mary is immediately found

at the start of the string

The order of the alternatives in the regex matters only when

two of them can match at the same position in the string. The

solution would be to leave the most general string last in the

enumeration.


21/39

Group and Capture Parts of the Match A capturing group is a pair of parentheses that can capture only part of the

regular expressions

Example: \b(\d\d\d\d)-(\d\d)-(\d\d)\b1. Has three capturing groups (\d\d\d\d), (\d\d) and (\d\d)

2. During the matching process the regular expression engine stores the part ofthe text matched by the capturing group

Applied on subject string 2012 10 2 groups 2012, 10 , 2

Noncapturing groups : (?: opens the noncapturing groups (not available in Jscript flavor)

You can specify mode modifiers (example: (?i: ) case insensitivenoncapturing group)

Benefits:

You can add them to an existing regex without upsetting the references tonumbered capturing groups

Performance - a capturing group adds unnecessary overhead that you caneliminate by using a noncapturing group

Note: parts of the match can be named : \b(?\d\d\d\d)-(?\d\d)-(?\d\d)\b or \b(?\d\d\d\d)-(?\d\d)-(?\d\d)\b (only .NET).


22/39

Match Previously Matched Text Again

Steps

1. Capture a text in a group

2. Match the same text anywhere in the regex

using a backreference (backslash followed by anumber)

Example: \b\d\d(\d\d)-\1-\1\b matches 2012-09-09, 2012-10-10, 2012-11-11 etc.

Note: you can name a backreference:\b\d\d(?\d\d)-\k-\k\b


23/39

Retrieve Part of the Matched Text

C#:string resultString = Regex.Match(subjectString, "http://([a-z0-9.-

]+)").Groups[1].Value;

string resultString = Regex.Match(subjectString,

"http://(?[a-z0-9.-]+)").Groups["domain"].Value;

JavaScript:var result = "";

var match = /http:\/\/([a-z0-9.-]+)/.exec(subject);

if (match) {

result = match[1];} else {

result = '';

}


24/39

Retrieve a List of All Matches

C#:Regex regexObj = new Regex(@"\d+");

MatchCollection matchlist = regexObj.Matches(subjectString);

JavaScript:var list = subject.match(/\d+/g);

Note:

- the /g flag tells the match() function to iterate over all matches in the string

and put them into an array

- regex with the /g flag, string.match() does not provide any further details

about the regular expression


25/39

Iterate over All MatchesC#:

Match matchResult = Regex.Match(subjectString, @"\d+");while (matchResult.Success) {

// Here you can process the match stored in matchResult

matchResult = matchResult.NextMatch();

}

JavaScript:var regex = /\d+/g;

var match = null;

while (match = regex.exec(subject)) {

// Don't let browsers such as Firefox get stuck in an infinite loop

if (match.index == regex.lastIndex) regex.lastIndex++;// Here you can process the match stored in the match variable

}

Note: exec() should set lastIndex to the first character after the match if the match iszero characters long, the next match attempt will begin at the position of the match justfound, resulting in an infinite loop

Repeat Part of the Regex a Certain


26/39

Repeat Part of the Regex a Certain

Number of Times

\b\d{100}\b - a decimal number with 100 digits

\b[a-f0-9]{1,8}\b - A 32-bit hexadecimal number

\b[a-f0-9]{1,8}h?\b - A 32-bit hexadecimal number with an

optional h suffix

\b\d*\.\d+(e\d+)? - A floating-point number with an optional

integer part, a mandatory fractional part, and an optional

exponent

Token Result Notes

{n} repeats the preceding regex token nnumber of times

{n,m} Variable repetition (between n and m

times)

{n,} Infinite repetition but more than n times \d1,- matches one or more digits\d

\d0,- matches zero or more digits\d\d0,1- matches zero or one digit\d?

+, * , ? - greedy quantifiers


27/39

Choose Minimal or Maximal Repetition (1)

Lazy quantifiers repeats as few times as it has to, stores one

backtracking position, and allows the regex to continue- the regex goes ahead only one character at a time,

each time checking whether the following text can bematched

You can make any quantifier lazy by placing a questionmark after it: ?, ?, ??, and 7,42-?

Example:

The very first task is to find the beginningof a paragraph.

Then you have to find the end of theparagraph

.*
vs
.*?


28/39

Choose Minimal or Maximal Repetition (2)

Possessive quantifiers it tries to repeat as many times as possible

will never give back, not even when giving back is the only way thatthe remainder of the regular expression could match.

do not keep backtracking positions

You can make any quantifier possessive by placing a plus sign after it:, , ?, and 7,42-

Possessive quantifiers Atomic group (not available in JScript) a noncapturing group, with the extra job of refusing to backtrack

the opening bracket simply consists of the three characters (?>

\b\d++\b\b(?>\d+)\b

\w++\d(?>\w+)(?>\d+)


29/39

Test for a Match Without Adding It to

the Overall Match Lookaround - checks whether certain text can be matched

without actually matching it:

1. lookbehind

positive : (?"a

2. lookahead

positive : q(?=u) matches a "q" that is followed by a "u"

negative : q(?!u) matches a "q" not followed by a "u

Note: JavaScript supports only lookahead


30/39

Match One of Two Alternatives Based

on a Condition

(?(1)then|else) - checks whether the first capturing group has

already matched something

Example:

1. \b(?:(?:(one)|(two)|(three))(?:,|\b)){3,}(?(1)|(?!))(?(2)|(?!))(?(3)|(?!))

(?(1)|(?!)) - if named group "(1)"

- then empty regex "" (always pass)

-else empty negative lookahead (?!) (always fail)

2. (a)?b(?(1)c|d)abc|bd


31/39

Insert Literal Text into the

Replacement Text (1)

Key characters:

\ - literal character does not need to be escaped

$ - need to be escaped only when they are

followed by a digit, &, `, ", _, +, or $; to escape a

dollar sign, precede it with another dollar sign.

Example: $%\*$1\1 => $%\*$$1\1

Note: $1 and/or \1 are a backreference to acapturing group and $& refers to whole regex


32/39

Insert Literal Text into the

Replacement Text (2)

Examples:

1. Regular expression: http:\S+

Replacement: $&

2. Regular expression: \b(\d{4})(\d{3})(\d{3})\bReplacement: ($1) $2-$3

3. Regular expression: \b(?\d{3})(?\d{3})(?\d{4})\b

Replacement: (${g1}) ${g2}-${g3}

Note: .NET and JavaScript leave backreferences to groups that

dont exist as literal text in the replacement.


33/39

Replace All Matches

C#:Regex regexObj = new Regex("pattern");

string resultString = regexObj.Replace(subjectString,replacement, count);

Example: Replace(subject, replacement, 3) replaces only the first threeregular expression matches, and further matches are ignored.

JavaScript:

result = subject.replace(/before/g, "after");Note: if you want to replace all regex matches in the string, set the /g flag when

creating your regular expression object; if you dont use the /g flag, only the first

match will be replaced.


34/39

Replace Matches Reusing Parts of the

MatchC#:string resultString = Regex.Replace(subjectString, @"(\w+)=(\w+)",

"$2=$1");

or

Regex regexObj = new Regex(@"(\w+)=(\w+)");

string resultString = regexObj.Replace(subjectString, "$2=$1");

With named groups:

Regex regexObj = new Regex(@"(?\w+)=(?\w+)");

string resultString = regexObj.Replace(subjectString,

"${right}=${left}");

JavaScript:result = subject.replace(/(\w+)=(\w+)/g, "$2=$1");


35/39

Replace Matches with Replacements

Generated in CodeC#:Regex regexObj = new Regex(@"\d+");string resultString = regexObj.Replace(subjectString, new

MatchEvaluator(ComputeReplacement));

public String ComputeReplacement(Match matchResult) {int t= int.Parse(matchResult.Value) * 2;

return t.ToString();}

JavaScript:var result = subject.replace(/\d+/g,

function(match) { return match * 2; }

);

Note: replacement function may accept one or more parameters: the firstparameter will be set to the text matched by the regular expression. If theregular expression has capturing groups, the second parameter will hold thetext matched by the first capturing group, the third parameter gives you the

text of the second capturing group, and so on.


36/39

Split a stringC#:string[] splitArray = Regex.Split(subjectString, "");

JavaScript:var list = [];

var regex = //g;var match = null;

var lastIndex = 0;

while (match = regex.exec(subject)) {

// Don't let browsers such as Firefox get stuck in an infinite loop

if (match.index == regex.lastIndex) regex.lastIndex++;// Add the text before the match

list.push(subject.substring(lastIndex, match.index));

lastIndex = match.index + match[0].length;

}


37/39

Search Line by LineC#:

string[] lines = Regex.Split(subjectString, "\r?\n");Regex regexObj = new Regex("regex pattern");for (int i = 0; i < lines.Length; i++) {

if (regexObj.IsMatch(lines[i])) {// The regex matches lines[i]

} else {// The regex does not match lines[i]

}}

JavaScript:var lines = subject.split(/\r?\n/);var regexp = /regex pattern/;for (var i = 0; i < lines.length; i++) {

if (lines[i].match(regexp)) {// The regex matches lines[i]

} else {// The regex does not match lines[i]

}}


38/39

Validation and Formatting (1)

Email address^[\w!#$%&'*+/=?`{|}~^]+(?:\.[!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$

International Phone Numbers^\+(?:[0-9]\x20?){6,14}[0-9]$

Validate Traditional Date Formats^(?:(0?2)/([12][0-9]|0?[1-9])|(0?[469]|11)/(30|[12][0-9]|0?[1-

9])|(0?[13578]|1[02])/(3[01]|[12][0-9]|0?[1-9]))/((?:[0-9]{2})?[0-9]{2})$

Limit the Number of Lines in Text^(?:(?:\r\n?|\n)?[^\r\n]*){0,5}$

Validate Affirmative Responses^(?:1|t(?:rue)?|y(?:es)?|ok(?:ay)?)$


39/39

Validation and Formatting (2)

Find Words Near Each Other\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b

Remove Duplicate Lines^(.*)(?:(?:\r?\n|\r)\1)+$ replaced with $1

Validating URL^((https?|ftp)://|(www|ftp)\.)[a-z0-9-]+(\.[a-z0-9-]+)+([/?].*)?$

Extracting the Query from a URL^[^?#]+\?([^#]+)

Validate Windows Paths^(?:[a-z]:|\\\\[a-z0-9_.$]+\\[a-z0-9_.$]+)\$?:[^\\/:*?"|\r\n]+\$*[^\\/:*?"|\r\n]*$

Documents

REGEX Extended