Regular Expression Mohsen Mollanoori. What is RegeX ? “ A notation to describe regular languages....

Preview:

Citation preview

Regular Expression

Mohsen Mollanoori

What is RegeX ? “A notation to describe regular languages.” “Not necessarily (and not usually) regular” “A Powerful String Processing Tool” “A pattern that can be matched against a

string” “A Language But Not A Language”

What RegeX Does ? String Processing

Matching Strings against a Specific Pattern Split Strings Change Substrings Extract Substrings

What Programming Languages Support RegeX ? Almost All of Them

Perl Java .Net (C#, VB.Net, …) PHP Ruby Java Script …

And even Many IDEs & Editors & Utilities grep eclipse Visual Studio .Net vim emacs …

The NotationSymbol Meaning Example

.Any Single Char /.at/ matches

“cat”, “bat”, “pat”, “mat”

*Zero or More occurrence

of preceding Char/a*b/ matches “b”,

“aaaaab”

+ One or More occurrence of preceding Char

/a+b/ matches “ab”, “aaaaab”

? Zero or One occurrence of preceding Char

/a?b/ matches “ab” and “b”

Example 1

String: “Term”, “Term1”, “Term2”Pattern: /Term./Result: “Term1”, “Term2”

Example 2

String: “Term”, “Term1”, “Term2”Pattern: /Term.?/Result: “Term”, “Term1”, “Term2”

Example 3

String: “Term”, “Term1”, “Term2”Pattern: /Term1?/Result: “Term”, “Term1”

Example 4

String: “Term1”, “Term11”, “Term2”, “Term”

Pattern: /Term1+/Result: “Term1”, “Term11”

Example 5

String: “Term1”, “Term11”, “Term2”, “Term”

Pattern: /Term1*/Result: “Term1”, “Term11”, “Term”

Character ClassesExample Meaning

[pnm] “p” or “n” or “m”

[Qq] “Q” or “q”

[A-Z] Upper Case Letters

[A-Za-z] Letters

[^A-Z] Every char EXCEPT A-Z

[A-Z&&[^C-E]] A-Z but NOT C-E

Example 6

String: “CAT”, “Cat”, “cat”Pattern: /[Cc]at/Result: “Cat”, “cat”

Example 7

String: “CAT”, “Cat”, “cat”Pattern: /[Cc][Aa][Tt]/Result: “CAT”, “Cat”, “cat”

Example 8

String: “Term”, “Term1”, “Term2”Pattern: /[A-Za-z]+/Result: “Term”

Example 9

String: “Term”, “Term1”, “Term222”Pattern: /.*[0-9]+/Result: “Term1”, “Term222”

Example 10

String: “Term”, “Term1”, “Term222”Pattern: /[^0-9]+/Result: “Term”

Repeating Chars (Intervals)Example Description

a{3} Matches “aaa”

a{3,5} Matches “aaa”, “aaaa”, “aaaaa”

a{3,} Matches “aaa”, “aaaa”, …

Predefined Character Classes

Class Description\d Digit

\D Non Digit

\s Space

\S Non Space

\w Alphanumeric

\W Non Alphanumeric

\b Word Boundary

\B Non Word Boundary

\A The beginning of the input

\z The end of the input

Example 11String: “This is some text !”Pattern: /is/Result: “This is some text !”

Example 12String: “This is some text !”Pattern: /\bis\b/Result: “This is some text !”

Example 13Variable Names

Pattern: /[A-Za-z]\w{0,15}/

Groupsemail addresses:

/[A-Za-z0-9_]+@.+\.\w+/

/([A-Za-z0-9_]+)@(.+)\.(\w+)/

$1Username

$2Server

$3Domain

RegeX & Perlopen (IN, “File.txt”); # open file

while ($line = <IN>) # read line by line{ if($line =~ /([A-Za-z0-9_])@(.+)\.(\w+)/) {

print ‘User =’, $1, “, Server =“, $2}

}

close(IN);

RegeX & Ruby

open('in.txt', 'r').readlines.each do |line|

puts line if line =~ /^([a-z0-9_]+)@(.+)\.(.+)$/i

end

RegeX & Java java.util.regex.Pattern java.util.regex.Matcher

java.util.Scanner

java.lang.String replaceAll(regex, replacement) replaceFirst(regex, replacement) matches(regex) split(regex)

Example 16

String email = readEmailFromSomewhere();

if (email.matches("([A-Za-z0-9_]+)@(.+)\\.(\\w+)")) { System.out.println("valid email");} else { System.out.println("invalid email");}

Example 17

String str = "098 123-456-789";

String[] nums = str.split("[\\s-]");

for (String num : nums) {

System.out.println(num);

}

Example 18

// Remove Tags from HTML

String html = “<html><head><title>This is a title.</title></head>” +“<body>This is <b>body</b> of a <i>HTML</i> file” + “!</body></html>”;

String text = html.replaceAll("<[^>]+>", " ");String normalizedText = text.replaceAll("\\s+", " ");

System.out.println(normalizedText);

Example 19// hyperlik urls

String html = "<html>Please Visit http://myhomepage.com</html>";

html = html.replaceAll("https?://([-.A-Za-z]+)“,"<a href='$0'>$1</a>");

System.out.println(html);

Example 20Convert MixedCase to underlined_format

String MixedCase = "ThisIsSomeTextInMixedCaseFormat";

String temp = MixedCase.replaceAll("([a-z])([A-Z])", "$1_$2");

String underlined_format = temp.toLowerCase();

System.out.println(underlined_format);

// result: this_is_some_text_in_mixed_case_format

Convert underlined_format to MixedCase

?

Example 21The Pipe Sign Find Strings of 0s & 1s that have even

number of 1s or even number of 0s

str = ‘110100101'

puts str =~ /^(1*(01*0)*1*|0*(10*1)*0*)$/ ? 'Yes' : 'No'

Example 22Finding Unintentionally Repeated Words

text = 'hello, this is some some text!'

?

Back References \i references to iths matched group

Example: /(.)\1/ matches against “aa”, “bb”, “11”, “##”

Example 22Finding Unintentionally Repeated Words

text = 'hello, this is some some text!'

if text =~ /(\b\w+\b)\W+\1/

puts $1 + " is repeated more than once"

end

# some is repeated more than once

You even needn't write code An Editor that

supports RegeX

eclipse find/replace dialog box

eclipse find/replace dialog box

Microsoft VS.NET Quick Replace

Use Regular ExpressionUse Regular Expression

Extracting Timestamps From a log file

Extracting Timestamps From a log file

Some Rewriting System!rewrite(input) temp = input do before = temp temp = rewrite temp using rule1 temp = rewrite temp using rule2

...

after = temp while(before != after) return temp

XML

<Students >

<Student faculty="Computer Engineering" student-id="8017024">

<Name first="Mohsen" last="Mollanoori"/>

<Terms >

<Term num="1">

<Lesson name="Statistics" mark="10"/>

<Lesson name"Math" mark="10"/>

</Term>

</Terms>

</Student>

</Students>

MML

@Students

{

@Student(faculty="Computer Engineering" student-id="8017024")

{

@Name(first="Mohsen" last="Mollanoori");

@Terms

{

@Term(num="1")

{

@Lesson(name=“Statistics” mark="10");

@Lesson(name"Math1“ mark="10");

}

}

}

}

Example 23MML 2 XML

do {

before = mml;

mml = mml.replaceAll(

"@([A-Za-z]+)(\\(([^)]*)\\))?;",

"<$1 $3/>“);

mml = mml.replaceAll(

"@([A-Za-z]+)(\\(([^)]*)\\))?\\{([^\\{\\}]*)\\}",

"<$1 $3>$4</$1>“);

after = mml;

} while (!before.equals(after));

Example 24Remove Text from XML(Keep Tags Only)Is this Correct ?

String xml = “<b><em>Text Here</em></b>”

xml = xml.replaceAll(“>[^<]*<”, “”);

Match: “<b><em>Text Here</em></b>”

Result: “<b><em/em></b>”

Look Ahead & Look Behind

String xml = “<b><em>Text Here</em></b>”

xml = xml.replaceAll(“(?<=>)[^<]*(?=<)”, “”);

Look Behind using ?<= to

see a ‘>’

Looking Ahead using ?= to see a ‘<’

Example 25Over Matchingxml = “<a> aaa </a><b> bbb </b>”;

xml = xml.replaceFirst(“>.*<”, “”);

Match: xml = “<a> aaa </a><b> bbb </b>”;

Result: xml = “<a/b>”;

Greedy & Non GreedyGreedy Non Greedy

* *?

+ +?

? ??

{a,b} {a,b}?

Example 26Solution to Over Matchingxml = “<a> aaa </a><b> bbb </b>”;

xml = xml.replaceFirst(“>.*?<”, “”);

Match: xml = “<a> aaa </a><b> bbb </b>”;

Result: xml = “<a/a><b> bbb </b>”;

Example 27String xml = "aabb";xml = xml.replaceAll(".{2,3}", "-");System.out.println(xml); // result = ‘-b’

String xml = "aabb";xml = xml.replaceAll(".{2,3}?", "-");System.out.println(xml);// result = ‘--’

Example 28String xml = "aabb";xml = xml.replaceAll(".?", "-");System.out.println(xml);// result: -----

String xml = "aabb";xml = xml.replaceAll(".??", "-");System.out.println(xml);// result: -a-a-b-b-

Further Reading & Works “Teach Yourself Regular Expressions in 10

Minutes”, Sams Publishing, February 28, 2004, ISBN: 0-672-32566-7

“Mastering Regular Expressions, 3rd Edition”, By Jeffrey E. F. Friedl, O'Reilly, August 2006, ISBN :0-596-52812-4

Java Regular Expression Documents

Practice, Practice, Practice

TANX

Recommended