32
Chapter 11: Regular Expressions and Matching • The match operator has the following form. m/pattern/ • A pattern can be an ordinary string or a generalized string containing metacharacters. • The binding operator, =~, is used to "bind" the matching operator onto a string. "yesterday" =~ m/yes/

Chapter 11: Regular Expressions and Matching The match operator has the following form

  • Upload
    gari

  • View
    26

  • Download
    4

Embed Size (px)

DESCRIPTION

Chapter 11: Regular Expressions and Matching The match operator has the following form. m/ pattern / A pattern can be an ordinary string or a generalized string containing metacharacters . The binding operator , =~ , is used to "bind" the matching operator onto a string. - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Chapter 11: Regular Expressions and Matching• The match operator has the following form.

m/pattern/• A pattern can be an ordinary string or a generalized string containing metacharacters.

• The binding operator, =~, is used to "bind" the matching operator onto a string.

"yesterday" =~ m/yes/

• Here the pattern is an ordinary three character string.

• The entire expression evaluates to a Boolean value, true (1) in this case since the pattern yes is a substring of "yesterday".

Page 2: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• Since matching expressions result in Boolean values, they are usually used in a conditional.

$str="yesterday";if($str =~ m/yes/) { print "The pattern yes was found in $str.\n";}

• For demonstration, we will usually only show the matching expression.

Example:$str="yesterday";

$str =~ m/ester/ #true$str =~ m/Ester/ #false$str =~ m/yet/ #false

Page 3: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Some notes:• The !~ is the negated form of the match operator. It returns true if the matching action does not find the pattern in the string. We will more often use the matching operator.

if($response !~ m/yes/){ print "yes was not found in your response.\n";} • The matching operator can be simplified syntactically. For example, the following two expressions are equivalent.

$str =~ m/yes/$str =~ /yes/

Page 4: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The match operator can be bound not only onto string literals and variables, but also onto expressions that evaluate to strings.

$str1="wilde";$str2="beest";$str1.$str2 =~ /debe/ #true

Example: A server-side "platform sniff" done by matching against the HTTP_USER_AGENT environment variable. • This example features the first pattern which is not merely a sequence of characters. The match $info =~ /(Unix|Linux)/

is true of either Unix or Linux is a substring of whatever is stored in the $info variable.See source file os.cgi.

Page 5: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• A regular expression is a set of rules which define a generalized string.

• For simplicity we call regular expressions patterns.

• The syntax for a pattern is /pattern/ .

• A pattern is like a double quoted string in that variables are interpolated and escape sequences are interpreted.

• But a pattern is much more powerful than a string and can contain wildcards, character classes, and quantifiers, just to name a few features which make patterns (regular expressions) much more general than ordinary strings.

Page 6: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Metacharacters• Characters which have special meaning in patterns are called metacharacters.

[ ] ( ) { } | \ + ? . * ^ $

• If used literally inside a pattern, their special meaning must be escaped.

if($sentence =~ m/\?/){ print "Your sentence seems to be a question.\n";}

Page 7: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Normal characters• These include ordinary ASCII characters which are not metacharacters.

• Normal characters include, letters, numbers, the underscore, and a few other characters such as @ % & = ; : , which are not reserved metacharacters in patterns.

• Normal characters need not be escaped when testing for matches.

if($sentence =~ m/;/){ print "Your sentence seems to contain an independent clause.\n";}

Page 8: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Escaped characters• Escaping in patterns works just like escaping characters in ordinary strings..

• For example, \* stands for one *, and \( stands for one (.

• The following tests whether $str contains the three character string "(b)".

$str =~ /\(b\)/

Example values for $str which would yield true and false values in the above match.

true: "(b)" , "(a)(b)(c)"false: "(ab)" , "( b )"

Page 9: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Escape sequences that stand for one character• Some escaped characters stand literally for only one character, like escaped metacharacters.

• Some stand for one invisible character, such as a whitespace character. Just like with ordinary strings \n stands for one newline character, and \t stands for one tab character.

•The following tests whether $str contains two consecutive newline characters.

$str =~ /\n\n/true: "a\n\nb" , "a\n\n\n\tb"false: "\na\n" , "a\n \nb"

Page 10: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Escape sequences that stand for a class of characters

• These represent only one character in a pattern, but that one character matches any character in the specified group.

\d any digit 0 through 9\D any character that is not a digit\w any alphanumeric character: letter, digit,

underscore (w comes from word)

\W any character that is not alphanumeric (opposite of \w)

\s one whitespace character (blank space, tab, or newline)

\S One non-whitespace character (opposite of \s)

Page 11: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

•The following tests whether $str contains a four character sequence that looks like a year in the 1900s.

$str =~ /19\d\d/true: "1921" , "34192176"false: "191a" , "34192-76"

• The following tests whether $str contains a non-whitespace character. (i.e. It is not the empty string or merely a sequence of whitespace characters. )

$str =~ /\S/true: "x" , "()"false: "" , " ", "\n"

Page 12: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Wildcard

• A period . stands for any one character, except a newline.

• The following tests whether $str contains a three character substring that is c and t with anything in between, except a newline.

$str =~ /c.t/true: "cat" , "arc&tangent"false: "ct" , "cart" , "arc\ntangent"

Page 13: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Escape sequences that match locations

• These characters do not actually represent a character in a pattern. Rather, they represent locations within patterns.

\A beginning of string

\Z end of string or before a final newline character

\z end of string

\b word boundary

\B not a word boundary (thus location between two \w type characters)

Page 14: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The following tests whether $str begins with T$str =~ /\AT/true: "Tom" , "The beest"false: "tom" , "AT&T"

•The following tests whether $str begins with The . $str =~ /\AThe/true: "Thelma" , "The beest" false: "That" , "the beest"

• The following tests whether $str contains the word cat but not as part of any bigger word.$str =~ /\bcat\b/true: "cat" , "my cat" false: "cats" , "concatenate"

Page 15: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Note: When matching locations, the escape sequence does not "use up" a character. That is, an expression such as

$str =~ /ing\z/

only tests for the three character string ing at the end of $str.

Page 16: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Character Classes• Square brackets [] in a pattern define a class. • The whole class matches only one character, and only if the character belongs to the class.

• The following tests whether $str contains a three-character string beginning with one of r, b, or c, and followed by at.

$str =~ /[rbc]at/true: "rat" , "bat" , "cat" , "concatenate" , "battery" false: "mat" , "at"

Page 17: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The escape sequences \d, \w, and \s and their opposites can be used inside a class. • A dash (-) can be used between two characters to denote a range of characters. • For example, the class[\dA-F] stands for one character that is either a numeric digit or one of the upper case letters A-F. It is equivalent to [0123456789ABCDEF]

• The following tests whether $str contains a two-digit hexadecimal number as formatted in query string encoding.

$str =~ /%[\dA-F][\dA-F]/true:"%0A" , "data=Hi,%0A%0Dmy name is..."false: "%0a" , "%3"

Page 18: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Alternatives• The | character serves like an or by creating alternatives.

• The following tests whether $str contains any of the three patterns.

$str =~ /cat|dog|ferret/true: "cat" , "dog" , "ferret" , "my cat" "cats and dogs" , "doggedly"false: "hamster" , "dodge the cart"

• The alternatives are tested from left to right.

• The alternatives themselves can be more complicated patterns.

Page 19: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Grouping and Capturing

• Parentheses () are used for grouping in patterns.

• The following tests whether $str contains one of the three alternatives, then a whitespace, then food. $str =~ /(cat|dog|ferret) food/true: "cat food","dog food","ferret food" "I like cat food and dog food"false: "cats food", "rat food", "dogfood"

• With several alternatives, it is often desirable to capture which of the alternatives caused the successful match. That is, a mere truth value indicating a match doesn't indicate which match actually occurred.

Page 20: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The special, built-in variables $1, $2, $3, … automatically capture an alternative that provides a successful match. $str = "Do you have ferret food?";$str =~ /(cat|dog|ferret) food/

• Here, $1 is assigned the value "ferret" since that alternative provides the match. The rest are empty.

• If more than one match is present, only the left-most match is recorded since alternatives are processed from left to right. $str = "Do you have dog food or ferret food?" ;$str =~ /(cat|dog|ferret) food/

• Here, "dog" is assigned to $1, but $2 is empty even though there is a second match.

Page 21: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• Multiple groups can populate more of the special variables.

$str = "Purina cat chow";$str =~ /(cat|dog|ferret) (food|chow)/

• $1 is assigned the value "cat" and $2 is assigned the value "chow". Captured matches are assigned into the special variables starting from the left-most grouping of alternatives.

• Groups can be collected into a larger group.

$str = "Purina cat chow";$str =~ /((cat|dog|ferret) (food|chow))/

• $1 is assigned "cat chow" , $2 is assigned "cat" , and $3 is assigned "chow". The left-most behavior is still observed.

Page 22: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Note: After a successful match, the special capturing variables are global variables within the program.

if ($data =~ /(cat|dog|ferret) (food|chow)/ ) { print "The match<b>$1 $2</b> was found.";}

So if the $data is "Purina cat chow is now", then the print statement would generate:

The match cat chow was found.

As global variables, they will contain the captured matches throughout the rest of the program or until their values are replaced by data captured in other matches.

Page 23: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Other special variables• There is some degree of "capturing" even when grouping is not used.

$` (prematch - that part before the match),$& (match - the matched part)$' (postmatch - the part after the match).• After this is executed

"I like cats and bats." =~ /[rbc]at/$& contains "cat"$` contains "I like "$' contains "s and bats"

• In general, the original string is equivalent to the concatenation of the three special variables. $`. $&. $'

Page 24: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Quantifiers+ occurrence one or more times (consecutively)

? occurrence zero or one times (consecutively)* occurrence zero or more times (consecutively){n} occurrence exactly n times (consecutively){n,} occurrence at least n times (consecutively){n,m} occurrence at least n and at most m times (consecutively)

• A quantifier is always put after the character (or class of characters) to be quantified.

/x+/ -- matches one or more x's in a row/[aeiou]{3}/ -- matches any three vowels in a row/c.*t/ -- matches a c followed by a t with 0 or more of any character in between

Page 25: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The following tests whether $str contains at least one b character in between an a and c.$str =~ /ab+c/true: "abc", "abbc" , "abbbc" , "aabcc"false: "ac" , "aBc"

• The following tests whether $str contains a sequence of exactly 3 b characters in between an a and c.$str =~ /ab{3}c/true: "abbbc", "aabbbcc" false: "abbc" , "abbbbc"

• The following tests whether $str contains a sequence of at least 2 b characters in between an a and c.$str =~ /ab{2,}c/true: "abbc", "abbbc" , "aabbbbcc" false: "abc" , "aBBc"

Page 26: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• It gets interesting when quantifiers are mixed with the special character classes.

• The following tests to see if $str contains an alphanumeric word (chunk of consecutive alphanumeric characters).$str =~ /\w+/true: "beest", "1234" , "R2D2" , "x" , "##xyz##"false: "####" , "" , " "

• The following tests to see if $str contains one or more consecutive digits (i.e. is there an integer inside).$str =~ /\d+/true: "1", "121 Elm. St." , "R2D2" , "##1##" , "3.14"false: "a" , "####" , "" , " "

Page 27: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The following tests to see if $str contains a substring that looks like a (possibly negative) integer. That is, does $str contain zero or one – characters, followed by one or more consecutive digits.

$str =~ /-?\d+/ true: "2", "-2" , "-3.14" , "3-21.7"false: "xyx" , "x-y" , "4-x"

• The following tests to see if there is at least one whitespace character in $str.$str =~ /\s+/ true: " ", " " , " xyy" , "The End"false: "" , "xyz" , "TheEnd"

Page 28: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The following matches any two digit hexadecimal number. That is, it matches any occurrence of two consecutive characters from the class [0123456789abcdefABCDEF]. /[\da-fA-F]{2}/

• The quantified pattern is equivalent to the longer pattern /[\da-fA-F][\da-fA-F]/.

• For the next example, suppose we have dates that are roughly formatted, but in the general formmonth_name day_number, year

• We wish to create a pattern capable of factoring out inconsistent formatting and capture the three date parts. For example, it should handle both dates below.jan 1,2002MARCH 22, 02

Page 29: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• The following tests whether $date contains (a group of one or more letters, lower or upper-case), followed by one or more spaces, followed by (a group of one or more digits), followed by a comma and then zero or more spaces, followed by (a group of one or more digits).

$date =~ /([a-zA-Z]+)\s+(\d+),\s*(\d+)/

• Since there are three groups, the month is captured into $1, the day into $2, and the year in $3.

Page 30: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Quantifiers are greedy by default

• That means a quantified pattern will attempt to match as much as possible. ("Matching is greedy.")

• The following expression tests for a < character, followed by one or more of anything (wildcard), followed by a > character.

"<h1>Title</h1>" =~ /<.+>/

• The quantifier's greedyness passes up "<h1>", which would otherwise be a match. So the pattern matches the whole string in this case.

Page 31: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

• To overcome the greedyness (match as little as possible), an extra ? character is placed after the quantifier.

• For example, to find HTML tags, the pattern <.+?> would be used. It basically says test for a < character followed by one or more of anything until the first > character is found.

• The following would only match "<h1>".

"<h1>Title</h1>" =~ /<.+?>/

Page 32: Chapter 11:  Regular Expressions and Matching  The  match operator  has the following form

Command modifiers• The behavior of the matching operator can be altered by using a command modifier, which is placed after the operator.

string_expression =~ /pattern/command_modifier

Case insensitive matching • The command modifier i specifies that the matching should be done in a case insensitive fashion.

if($str =~ /be/i) { print "The string contains either be, Be, bE, or BE.";}