Upload
jason-noble
View
732
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
^[Rr]egular [Ee]xpressions$
Introduction
Vocabulary
• Regular expression / Regex / Regexp– Regex is pronounced Reg (as in register)
Ex (as in FedEx)
• Matching– Regex matches a string means it matches in a string
Regular Expressions
• Composed of two types of characters– Metacharacters / Special characters
• * ? ^ $ . [ ]
– Literal characters• a b c d
Egrep tool
• Allows you to use Regular Expressions to find words that match
• Available for Macs, PCs and Linux
• cat /usr/share/dict/words | egrep ‘…’
• See http://regex.info/egrep.html if you don’t have it preinstalled
My first regex
• cat /usr/share/dict/words | egrep ‘cat’– Matches any words
with a ‘c’ followed by an ‘a’ followed by a ‘t’
• bobcat• cat• catwalk• scatter
• Simple regex, only uses Literal chars
Metacharacters: ^ and $
• ^ matches the beginning of a line• $ matches the end of a line
– ^cat (start of line followed by ‘c’ then ‘a’ then ‘t’)• cat• catwalk
– cat$ (‘c’ followed by ‘a’ then ‘t’ followed by EOL)• bobcat• cat
– ^cat$ (start of line followed by ‘c’ then ‘a’ then ‘t’ then EOL)
• cat
How to read regex
• Read each character one at a time• ^bat
– Start of line followed by ‘b’ then ‘a’ then ‘t’
• rat$– ‘r’ then ‘a’ then ‘t’ followed by end of line
• ^dog$– Start of line followed by ‘d’ then ‘o’ then ‘g’
then EOL
More simple regex’s
• ^– Start of line
• ^$– Start of line followed by end of line
• $– End of line
• ^foot$– Start of line followed by ‘f’ then ‘o’ then ‘o’ then ‘t’
followed by EOL
Character Classes [ ]
• Matches one of the characters in the [ ]– [ae]
• Matches ‘a’ or ‘e’
– [aeiouy]• Matches any vowel
– ^gr[ae]y$• Start of line followed by ‘g’ then ‘r’ then ‘a’ or ‘e’
then ‘y’ followed by end of line• grey or gray
Character Classes cont.
• [Ss]– Matches upper or lower case ‘S’
• [123456]– Matches any of the digits listed
• [Hh][123456]– Matches H1, h2, h3, H4, etc
Special characters in [ ]’s
• - (dash) references a range– [1-6] is the same as [123456]– [a-f] is the same as [abcdef]
• Ranges can be mixed with literals– [0-9a-fA-F_!.?]
• Any digit, upper or lower case ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, underscore, exclamation, period or question mark
Negated character class [^ ]
• ^ inside of [ ] means “not any of these”– [^1-6]
• Any character other than 1, 2, 3, 4, 5, 6
– [^a-fA-F]• Any character other than A-F (upper or lower)
– The ^ must be the first character inside [ ]• [^c] (Matches anything but ‘c’)• [c^] (Matches a ‘c’ or ‘^’)
Translating regex practice
• List of words that have ‘q’ followed by a character other than ‘u’– q[^u]
• List of words with ‘f’ followed by an ‘i’ or ‘o’ followed by ‘r’ then ‘e’– f[io]re
• Line starts with ‘Qu’ or ‘qu’ followed by an ‘e’ followed by any letter between ‘p’ and ‘t’– ^[Qq]ue[p-t]
Metacharacter: . (dot)
• Matches any character• c.t
– ‘c’ followed by any character followed by ‘t’• cat• cot• c8t
• Period inside of [ ]’s matches a period– [a.c]
• Matches ‘a’, ‘.’ or ‘c’
Periods cont.
• 03.19.76– Matches ‘03’ followed by a char then ‘19’
then any char then ‘76’• 03-19-76• 03/19/76• 03.19.76• 03 19 76• 03319876
Alternatives: | (pipe)
• Pipes allow you to specify alternatives• grey|gray
– Matches on grey or gray
• Use parentheses to constrain alternatives– gr(e|a)y
• Within [ ]’s, | is a normal character– [a|b]
• Matches ‘a’ or ‘|’ or ‘b’
Pipes (cont.)
• Use parenthesis to constrain– gre|ay
• matches ‘gre’ or ‘ay’
– gr(e|a)y• matches ‘gr’ followed by ‘e’ or ‘a’ then ‘y’
Regex practice
• Match “First Street” or “1st street”– (First|1st) [Ss]treet– (Fir|1)st [Ss]treet
• These are equivalent, which is better?
• Match “toothbrush” or “hairbrush”– (tooth|hair)brush
^ or $ and alternation
• Be careful when using ^ or $ with alternation• ^From|Subject|Date:
– Start of line followed by From OR– Subject OR– Date:
• ^(From|Subject|Date):– Start of line followed by ‘From’ or ‘Subject’ or
‘Date’ followed by ‘:’
• Safer to use ()’s to group your alternates
Case insensitive match
• Matches are case sensitive by default– [Ff]rom will match From but not FRom
• Use egrep’s -i option to do a case insensitive match
• Most languages have a case insensitive match as well
Quantifiers: ?
• ? metacharacter means optional– colou?r
• matches color or colour• ‘c’ then ‘o’ then ‘l’ then ‘o’ then optionally ‘u’
then ‘r’
• Match July or Jul and fourth, 4th and 4– (July|Jul) (fourth|4th|4)– July? (fourth|4th|4)– July? (fourth|4(th)?)
Quantifiers: + and *
• + (plus) – One or more of the previous item
• * (star)– Zero or more of the previous item
• b[0-9]*a– ba– b9999a– b999999999999999a
Summary of Quantifiers
Minimum Required
Maximum to try
Meaning
? none 1 zero or one occurrence
* none no limit zero or more occurrences
+ 1 no limit one or more occurrences
Escaping metacharacters
• Use \ (backslash) to escape metacharacters– \. matches ‘.’– . matches any character
• c.t matches cat
• c\.t does not match cat
• \(cat\) matches ‘(cat)’ not ‘cat’
More practice
• Match chat, cat, chart– ch?ar?t– c[h]?a[r]?t
• Start of line then M then one or more ‘a’ followed by ‘st’ and zero or more ‘b’– ^M[a]+st[b]*
• Lines ending with one or more ‘c’ followed by a ‘t’ then zero or one ‘e’– [c]+t[e]*$
More practice
• ^[Mm][^a-np-z]ney$– Start of line then ‘M’ or ‘m’ then any
character not a-n and p-z then ‘ney’ followed by end of line
– Money, money, m3ney
• ^be.*(bob|ted)$– Start of line followed by ‘be’ followed by
zero or more characters followed by ‘bob’ or ‘ted’ followed by end of line
More practice
• Match truck, firetruck but not dumptruck– ^(fire)?truck$
• $0.99, $599.95, $1000.45, $5000– \$[0-9]+(\.[0-9][0-9])?$
• 404-555-1212, 404.555.1212, (404) 555-1212– ^[()0-9]+.[0-9]+.[0-9]+$