Regex startup

Regular Expressionsfor

Beginners

Srikanth Modegunta

Introduction

Also referred to as Regex or RegExp Used to match the pattern of text

− Ex: maven and maeven can be matched with regex “mae?ven”

Regular Expressions are processed by a piece of software called “Regular Expressions Engine”

Most of the languages support Regex− Ex: perl, java, c# etc.

Introduction (Contd..) Used where text processing is required.

XML parsing involves Regex as it is based on the pattern matching.

− We will see how to match xml or html tag. Automation of the tasks

− Ex: if mail subject contains “<operation> <some task name> <command>” then start processing the task.

Text Editors updating the comments to functions automatically(Replacing a pattern with some text)

− Ex: replace

− “sub subroutine(parameters){<statements>}” by

/* this is a sample subroutine*/

sub subroutine(parameters){<statements>}

Meta Characters

The following are the meta characters\ | ( ) [ { ^ $ * + ? .

Meta Characters (Contd..)

Character Meaning

* 0 or more

+ 1 or more

? 0 or 1 (optional)

. All characters excluding new-line

^ Start of line. But [^abc] means character other than 'a' or 'b' or 'c'

$ End of line

\A Start of string

\Z End of string

Meta Characters (Contd..)Character Meaning

{ } If I know How many times the pattern repeats I can use thisEx: a{2, 5} matches 'a' repeated minimum 2 times and maximum 5 times.

| Saying 'or' in patternsEx: cat|dog|mouse

() Used to capture groups

[ ] Only one letter from the set

Quantifiers

To specify the quantity− Ex: ear, eaaaar – the quantity of a is 1 and 4

in these two cases. If a pattern is repeated then we need to use

quantifiers to match that repeated pattern. To match the above case we use the following

regex− ea+r means a can come 1 or more times

Quantifiers (Contd..)* 0 or more times (it is hungry matching)

Ex: ca* matches c, ca, caa, caaa etc.Matches even if the character does not exist and matches any number of 'a' s generally till last occurrence of pattern

+ 1 or more times (it is hungry matching)Ex: ca+ matches ca, caa, caaa etc

{n} Match exactly n timesEx: ca{4}r matches caaaar

{m,} Matches minimum of m times and maximum of more than m timesEx: ca{2,}r matches only if a repeats greater than 2 times. (hungry matching)

{m,n} Matches minimum m times and maximum n times.Ex: ca{2,3}r matches and 'a' repeats minimum 2 times and maximum 3 times.(hungry matching)

Hungry Matching refers to the behavior that the pattern matches maximum possible text. Ex: for ca{0,4} the text “caaaa” matches I.e all the 4 'a's are matched.

Quantifiers (Contd..)

*? Lazy matching i.e it matches 0 or more times but stops at first matchEx: if text is “caaaaaa” then “ca*?” will match only 'c'.

+? Lazy matching i.e it matches 1 or more times but stops at first matchEx: if text is “caaaaaa” then “ca+?” will match only 'ca'.

?? Lazy matching i.e it matches 0 or 1 times but stops at first matchEx: if text is “ca” then “ca??” will match only 'c'.

{min,}? {n}? {min, max}?

Lazy matching

Lazy Matching refers to the behavior that the pattern matches minimum possible text. Ex: for ca{0,4}? the text “caaaa” matches only “c”

Character Sets

Matches one character among the set of characters

[abcd] is same as [a-d] [a-di-l] is same as [abcdijkl] [^abcd] matches any character other than

a,b,c,d Quantifiers can be applied to the character sets

− [a-z]+ matches the string 'hello' in 'hello1234E'

Characters for Matching

Common character classes shorthand

[a-zA-Z0-9_] \w

[0-9] \d

[\ \t\n\r] \s

[^a-zA-Z0-9_] \W

[^0-9] \D

[^\ \t\n\r] \S

\b Word Boundary

\B Other than a Word Boundary

Simple Matching modegunta.srikanth@gmail.com

− mail id should not start with number or special symbols

− Mail id id can start with _− Mail id can have '.' in the middle− Should end with @domain.com

Pattern : − [a-zA-Z_][a-zA-Z_\.]+@\w+\.(com|co\.in)− Meta characters must be escaped in the

pattern to match them as normal characters

Modifiers

Modifier Meaning

i Case insensitive

g Global matching (in perl)

m Multiline matching

s Dot all ('.' matches \n also)

x Extended regex pattern (pretty format ref: perl)

e (Used for replacing string) evaluate the replacing pattern as an expression (ref: perl)

Grouping Groups can be captured using parenthesis

− (<pattern>)

− Saves the text identified by the group into a backreference (we will see it later)

Groups are to capture part of text in the matching pattern

− Ex: take simple xml element

− <(\w+)>.*?<\/\1>

− Here \1 is back reference Java has a method “group(int)” method in

“java.util.regex.Matcher” class.

Grouping Example

If the command is − /sbin/service <service-name> <command>− ([^\s]+)\s+([\w-_]+)\s+(start|stop|status)− Group 0=matched pattern− Group 1=”/sbin/service”− Group 2=<service-name>− Group 3=<command>− Command can be start, stop or status

Back References

Stores the part of the string matched by the part of the regular expression inside the parentheses

If there is any string that occurs multiple times in the input, we can use back reference to identify the match

Ex: xml/html start-tag should have the end-tag Here if we capture the start-tag name in first

group, we can put end-tag name as back reference (\1)

Back references example

For example take the xml tag− <root id=”E12”>test</root>− <([\w\-\_]+)\s*([^\<\>]+)?>\w+<\/\1> matches

xml element− Group 0: <root id=”E12”>test</root>− Group 1: root− Group 2: id=”E12”− \1 in the regex pattern is the back reference to

group 1.

No grouping with parenthesis

If groups are not required for the parenthesized patterns

− Use ?: inside group (?:)− (text1|text2|text3) is any on of text1, text2 and

text3− (?:text1|text2|text3) but will not be a group

Look ahead and Look behind Positive look-ahead

− \w+(?=:) not all words.... select words that come before ':'

Negative look-ahead

− \w+(?!:) words other than those coming before : When the pattern comes the regex engine looks ahead for

the filtering pattern in case of Look ahead.

Positive look-behind

− (?<=a)b selects 'b' that follows 'a' Negative look-behind

− (?<!a)b selects 'b' that doesn't follow 'a' When the pattern comes the regex engine looks behind for

the filtering pattern in case of Look behind.

References:1) http://www.regular-expressions.info/tutorial.html2) Thinking in java 4th Editon –

Chapter: Strings page 392

Thank You

Regex startup

Technology

정규표현식 Regular expression (regex)

Regular Expression (regex) · Regular Expression (regex) Jonathan Feinberg Dept. of Informatics, Univ. of Oslo Simula Research Laboratory August 2014. Regular expression (regex)

Einführung in RegEx

Strings, Regex, Web Response

Introduction to Regex(2)

DevSum11 - Regex by Staffan Nöteberg

Regex Basics

4 Regex Enumerables

JSTL, RegEx

Regex is Fun

Regex Cards - Powerpoint Format

Web Scraping and Regex

Regex Introduction

Asa 8x Regex Config

Aurelio Regex

REGEX Extended

Perl Training Regex

Java 19 Regex

Regexes are Hard: Decision-making, Difﬁculties, and Risks in ......semantic regex search engines for regex reuse and improved input generators for regex validation. Index Terms—regular

RegEx - Expresii Regulate