50
Regular Expression Mohsen Mollanoori

Regular Expression Mohsen Mollanoori. What is RegeX ? “ A notation to describe regular languages. ” “ Not necessarily (and not usually) regular ”

Embed Size (px)

Citation preview

Page 1: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Regular Expression

Mohsen Mollanoori

Page 2: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

What is RegeX ? “A notation to describe regular languages.” “Not necessarily (and not usually) regular” “A Powerful String Processing Tool” “A pattern that can be matched against a

string” “A Language But Not A Language”

Page 3: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

What RegeX Does ? String Processing

Matching Strings against a Specific Pattern Split Strings Change Substrings Extract Substrings

Page 4: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

What Programming Languages Support RegeX ? Almost All of Them

Perl Java .Net (C#, VB.Net, …) PHP Ruby Java Script …

And even Many IDEs & Editors & Utilities grep eclipse Visual Studio .Net vim emacs …

Page 5: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

The NotationSymbol Meaning Example

.Any Single Char /.at/ matches

“cat”, “bat”, “pat”, “mat”

*Zero or More occurrence

of preceding Char/a*b/ matches “b”,

“aaaaab”

+ One or More occurrence of preceding Char

/a+b/ matches “ab”, “aaaaab”

? Zero or One occurrence of preceding Char

/a?b/ matches “ab” and “b”

Page 6: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 1

String: “Term”, “Term1”, “Term2”Pattern: /Term./Result: “Term1”, “Term2”

Page 7: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 2

String: “Term”, “Term1”, “Term2”Pattern: /Term.?/Result: “Term”, “Term1”, “Term2”

Page 8: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 3

String: “Term”, “Term1”, “Term2”Pattern: /Term1?/Result: “Term”, “Term1”

Page 9: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 4

String: “Term1”, “Term11”, “Term2”, “Term”

Pattern: /Term1+/Result: “Term1”, “Term11”

Page 10: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 5

String: “Term1”, “Term11”, “Term2”, “Term”

Pattern: /Term1*/Result: “Term1”, “Term11”, “Term”

Page 11: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Character ClassesExample Meaning

[pnm] “p” or “n” or “m”

[Qq] “Q” or “q”

[A-Z] Upper Case Letters

[A-Za-z] Letters

[^A-Z] Every char EXCEPT A-Z

[A-Z&&[^C-E]] A-Z but NOT C-E

Page 12: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 6

String: “CAT”, “Cat”, “cat”Pattern: /[Cc]at/Result: “Cat”, “cat”

Page 13: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 7

String: “CAT”, “Cat”, “cat”Pattern: /[Cc][Aa][Tt]/Result: “CAT”, “Cat”, “cat”

Page 14: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 8

String: “Term”, “Term1”, “Term2”Pattern: /[A-Za-z]+/Result: “Term”

Page 15: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 9

String: “Term”, “Term1”, “Term222”Pattern: /.*[0-9]+/Result: “Term1”, “Term222”

Page 16: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 10

String: “Term”, “Term1”, “Term222”Pattern: /[^0-9]+/Result: “Term”

Page 17: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Repeating Chars (Intervals)Example Description

a{3} Matches “aaa”

a{3,5} Matches “aaa”, “aaaa”, “aaaaa”

a{3,} Matches “aaa”, “aaaa”, …

Page 18: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Predefined Character Classes

Class Description\d Digit

\D Non Digit

\s Space

\S Non Space

\w Alphanumeric

\W Non Alphanumeric

\b Word Boundary

\B Non Word Boundary

\A The beginning of the input

\z The end of the input

Page 19: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 11String: “This is some text !”Pattern: /is/Result: “This is some text !”

Page 20: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 12String: “This is some text !”Pattern: /\bis\b/Result: “This is some text !”

Page 21: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 13Variable Names

Pattern: /[A-Za-z]\w{0,15}/

Page 22: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Groupsemail addresses:

/[A-Za-z0-9_]+@.+\.\w+/

/([A-Za-z0-9_]+)@(.+)\.(\w+)/

$1Username

$2Server

$3Domain

Page 23: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

RegeX & Perlopen (IN, “File.txt”); # open file

while ($line = <IN>) # read line by line{ if($line =~ /([A-Za-z0-9_])@(.+)\.(\w+)/) {

print ‘User =’, $1, “, Server =“, $2}

}

close(IN);

Page 24: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

RegeX & Ruby

open('in.txt', 'r').readlines.each do |line|

puts line if line =~ /^([a-z0-9_]+)@(.+)\.(.+)$/i

end

Page 25: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

RegeX & Java java.util.regex.Pattern java.util.regex.Matcher

java.util.Scanner

java.lang.String replaceAll(regex, replacement) replaceFirst(regex, replacement) matches(regex) split(regex)

Page 26: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 16

String email = readEmailFromSomewhere();

if (email.matches("([A-Za-z0-9_]+)@(.+)\\.(\\w+)")) { System.out.println("valid email");} else { System.out.println("invalid email");}

Page 27: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 17

String str = "098 123-456-789";

String[] nums = str.split("[\\s-]");

for (String num : nums) {

System.out.println(num);

}

Page 28: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 18

// Remove Tags from HTML

String html = “<html><head><title>This is a title.</title></head>” +“<body>This is <b>body</b> of a <i>HTML</i> file” + “!</body></html>”;

String text = html.replaceAll("<[^>]+>", " ");String normalizedText = text.replaceAll("\\s+", " ");

System.out.println(normalizedText);

Page 29: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 19// hyperlik urls

String html = "<html>Please Visit http://myhomepage.com</html>";

html = html.replaceAll("https?://([-.A-Za-z]+)“,"<a href='$0'>$1</a>");

System.out.println(html);

Page 30: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 20Convert MixedCase to underlined_format

String MixedCase = "ThisIsSomeTextInMixedCaseFormat";

String temp = MixedCase.replaceAll("([a-z])([A-Z])", "$1_$2");

String underlined_format = temp.toLowerCase();

System.out.println(underlined_format);

// result: this_is_some_text_in_mixed_case_format

Page 31: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Convert underlined_format to MixedCase

?

Page 32: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 21The Pipe Sign Find Strings of 0s & 1s that have even

number of 1s or even number of 0s

str = ‘110100101'

puts str =~ /^(1*(01*0)*1*|0*(10*1)*0*)$/ ? 'Yes' : 'No'

Page 33: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 22Finding Unintentionally Repeated Words

text = 'hello, this is some some text!'

?

Page 34: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Back References \i references to iths matched group

Example: /(.)\1/ matches against “aa”, “bb”, “11”, “##”

Page 35: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 22Finding Unintentionally Repeated Words

text = 'hello, this is some some text!'

if text =~ /(\b\w+\b)\W+\1/

puts $1 + " is repeated more than once"

end

# some is repeated more than once

Page 36: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

You even needn't write code An Editor that

supports RegeX

eclipse find/replace dialog box

eclipse find/replace dialog box

Page 37: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Microsoft VS.NET Quick Replace

Use Regular ExpressionUse Regular Expression

Extracting Timestamps From a log file

Extracting Timestamps From a log file

Page 38: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Some Rewriting System!rewrite(input) temp = input do before = temp temp = rewrite temp using rule1 temp = rewrite temp using rule2

...

after = temp while(before != after) return temp

Page 39: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

XML

<Students >

<Student faculty="Computer Engineering" student-id="8017024">

<Name first="Mohsen" last="Mollanoori"/>

<Terms >

<Term num="1">

<Lesson name="Statistics" mark="10"/>

<Lesson name"Math" mark="10"/>

</Term>

</Terms>

</Student>

</Students>

Page 40: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

MML

@Students

{

@Student(faculty="Computer Engineering" student-id="8017024")

{

@Name(first="Mohsen" last="Mollanoori");

@Terms

{

@Term(num="1")

{

@Lesson(name=“Statistics” mark="10");

@Lesson(name"Math1“ mark="10");

}

}

}

}

Page 41: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 23MML 2 XML

do {

before = mml;

mml = mml.replaceAll(

"@([A-Za-z]+)(\\(([^)]*)\\))?;",

"<$1 $3/>“);

mml = mml.replaceAll(

"@([A-Za-z]+)(\\(([^)]*)\\))?\\{([^\\{\\}]*)\\}",

"<$1 $3>$4</$1>“);

after = mml;

} while (!before.equals(after));

Page 42: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 24Remove Text from XML(Keep Tags Only)Is this Correct ?

String xml = “<b><em>Text Here</em></b>”

xml = xml.replaceAll(“>[^<]*<”, “”);

Match: “<b><em>Text Here</em></b>”

Result: “<b><em/em></b>”

Page 43: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Look Ahead & Look Behind

String xml = “<b><em>Text Here</em></b>”

xml = xml.replaceAll(“(?<=>)[^<]*(?=<)”, “”);

Look Behind using ?<= to

see a ‘>’

Looking Ahead using ?= to see a ‘<’

Page 44: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 25Over Matchingxml = “<a> aaa </a><b> bbb </b>”;

xml = xml.replaceFirst(“>.*<”, “”);

Match: xml = “<a> aaa </a><b> bbb </b>”;

Result: xml = “<a/b>”;

Page 45: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Greedy & Non GreedyGreedy Non Greedy

* *?

+ +?

? ??

{a,b} {a,b}?

Page 46: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 26Solution to Over Matchingxml = “<a> aaa </a><b> bbb </b>”;

xml = xml.replaceFirst(“>.*?<”, “”);

Match: xml = “<a> aaa </a><b> bbb </b>”;

Result: xml = “<a/a><b> bbb </b>”;

Page 47: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 27String xml = "aabb";xml = xml.replaceAll(".{2,3}", "-");System.out.println(xml); // result = ‘-b’

String xml = "aabb";xml = xml.replaceAll(".{2,3}?", "-");System.out.println(xml);// result = ‘--’

Page 48: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Example 28String xml = "aabb";xml = xml.replaceAll(".?", "-");System.out.println(xml);// result: -----

String xml = "aabb";xml = xml.replaceAll(".??", "-");System.out.println(xml);// result: -a-a-b-b-

Page 49: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

Further Reading & Works “Teach Yourself Regular Expressions in 10

Minutes”, Sams Publishing, February 28, 2004, ISBN: 0-672-32566-7

“Mastering Regular Expressions, 3rd Edition”, By Jeffrey E. F. Friedl, O'Reilly, August 2006, ISBN :0-596-52812-4

Java Regular Expression Documents

Practice, Practice, Practice

Page 50: Regular Expression Mohsen Mollanoori. What is RegeX ?  “ A notation to describe regular languages. ”  “ Not necessarily (and not usually) regular ”

TANX