View
219
Download
2
Embed Size (px)
Citation preview
Administrivia
• Anyone try installing Perl yet?– Active State Perl– http://www.activestate.com– install the free version– (not the trial versions)
Background Reading
• Textbook– Chapter 2: Regular Expressions and Automata– Section 2.1: Regular Expressions
• Perl Regular Expressions (RE)– perlrequick - Perl regular expressions quick start
• http://perldoc.perl.org/perlrequick.html
– perlretut - Perl regular expressions tutorial• http://perldoc.perl.org/perlretut.html
Regular Expressions
• regular expressions are used in string pattern-matching – important tool in automated searching
– formally equivalent to finite-state automata (FSA) and regular grammars
• popular implementations– Unix grep
• command line program• returns lines matching a regular expression• standard part of all Unix-based systems
– including MacOS X (command-line interface in Terminal)
• many shareware/freeware implementations available for Windows XP– just Google and see...
– grep functionality is built into many programming languages• e.g. Perl
– wildcard search in Microsoft Word• limited version of regular expressions (not full power)• with differences in notation
Regular Expressions
• Historical note – grep:
• name comes from Unix ed command – g/re/p– “search globally for lines
matching the regular expression, and print them”
– [Source: http://en.wikipedia.org/wiki/Grep]
– ed• is an obscure and difficult-to-
use text edit program on Unix systems
– doesn’t need a screen display– would work on an ancient
teletype
wikipedia
Regular Expressions
• Formally– a regular expression
(regexp) is formed from:• an alphabet
(= set of characters)
• operators
– a regexp is shorthand for a set of strings
• (possibly infinite set)
• (strings are of finite length)
• Formally– a set of strings is
called a language– a language that can
be defined by a regular expression is called a regular language
• (not all languages are regular)
Regular Expressions
• alphabet– e.g. {a,b,c,...,z}
set of lower case English letters
Note: case is important
• operators– a single symbol
a– an exactly n
occurrences of
a, n a positive integer
– an a3 aaa
– a* zero or more occurrences of a
– a+ one or more occurrences of a
– concatenation• two regexps may be
concatenated, the resulting string is also a regexp
• e.g. abc
– disjunction• infix operator: |• (vertical bar)• e.g. a|b
– parentheses may be used for disambiguation
• e.g. gupp(y|ies)
Regular Expressions
• Technically, a+ is not necessary– aa* = a+
“a concatenated with a* (zero or more occurrences of a)”
= “one or more occurrences of a”
• Disjunction– [set of characters]
set of characters enclosed in square brackets means match one of the characters
– e.g. [aeiou]
matches any of the vowels a, e, i, o or u but not d
– dash (-) shorthand for a range
– e.g. [a-e]
matches a, b, c, d or e
Regular Expressions
• Range defined over a computer character set– typically ASCII– originally a 7 bit
character set – 2^7 = 128 (0-127)
different characters– ASCII = American
Standard Code for Information Interchange
– e.g. [A-z] [0-9A-Za-z]
• http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml
Regular Expressions
• Perl uses the same notation as grep (textbook also uses grep notation)
• More shorthand– question mark (?) means the
previous regexp is optional– e.g. colou?r– matches color or colour– metacharacters or operators
like ? have a function– to match a question mark,
escape it using a backslash (\)– e.g. why\?– ? in Microsoft Word means
match any character
• More shorthand– period (.) stands for any
character (except newline)
– e.g. e.tmatches eat as well as eet
– caret sign (^) as the first character of a range of characters [^set of characters]means don’t match any of the characters mentioned (after the caret)
– e.g. [^aeiou]– any character except for one
of the vowels listed
Regular Expressions
• Text files in Unix consists of sequences of lines separated by a newline character (LF = line feed)
• Typically, text files are read a line at a time by programs
• Matching in Perl and grep is line-oriented(can be changed in Perl)
• Differences in platforms for line breaking:– Unix: LF– Windows (DOS): CR LF– MacOS (X): CR
• affects binary file transfers
Regular Expressions
• Line-oriented metacharacters:– caret (^) at the beginning of a
regexp string matches the “beginning of a line”
– e.g. ^The
matches lines beginning with
the sequence The– Note: the caret is very
overloaded...• [^ab]
• a^b
– dollar sign ($) at the end of a regexp string matches the “end of the line”
– e.g. end\.$– matches lines ending in
the sequence end.– e.g. ^$
matches blank lines only– e.g. ^ $
matches lines contains exactly one space
Regular Expressions
• Word-oriented metacharacters:– a word is any sequence of digits [0-9],
underscores (_) and letters [a-zA-Z]– (historical reasons for this)– \b matches a word boundary, e.g. a
space or beginning or end of a line or a non-word character
– e.g. the– matches the, they, breathe and other– but \bthe will only match the and they– the\b will match the and breathe– \bthe\b will only match the– (\< and \> can also be used to match the
beginning and end of a word)
– e.g. \b99– matches 99 but not 299– also matches $99
Regular Expressions
• Note: – definition of word does not
include characters with accent marks
• extended ASCII character set– 8 bit– characters 128-255– does not include non-Roman
characters
• towards multilingual computing– Unicode– one massive table
Regular Expressions
• Range abbreviations:– \d (digit) = [0-9]– \s (whitespace character) =
space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF)
– \w (word character) = [0-9a-zA-Z_]
• uppercase versions denote negation– e.g. \W means a non-word
character \D means a non-digit
• Repetition abbreviations:– a? a optional– a* zero or more a’s– a+ one or more a’s– a{n,m} between n and m a’s– a{n,} at least n a’s– a{n} exactly n a’s– e.g. \d{7,}
matches numbers with at least 7 digits
– e.g. \d{3}-\d{4}– matches 7 digit telephone
numbers with a separating dash
Perl• Run from the command line in Windows
– Start > Run...– cmd (brings up command line interpreter)
• Running a Perl program:– perl -help (gives options)
– perl filename.pl
(runs Perl command file filename.pl)
– perl filename.pl inputfile.txt
(runs Perl command file filename.pl, inputfile.txt is supplied to filename.pl)
e.g. filename.pl reads and processes input file inputfile.txt
Perl
• Example Perl program (match.pl) to read in a text file and print lines matching a regexp enclosed by /.../
• Example input file (text.txt)
• Commandperl match.pl text.txt
open (F,$ARGV[0]) or die "$ARGV[0] not found!\n";
while (<F>) {
print $_ if (/The/);
}
This is a test.The cat sat on the mat.These shoes are made for walking.Otherwise, I thought it was cold.45
Perl
• Program:open (F,$ARGV[0]) or die "$ARGV[0]
not found!\n";
while (<F>) {
print $_ if (/The/);
}
– while (<F>) first evaluates <F> – <F> reads in a line from the file
referenced by F and places the line in the program variable $_
– then it executes the program code between the curly braces
– then it goes back and reads another line
– it does this repeatedly while <F> produces a valid line – if we reach the end of the file, the while loop stops
– print $_ if (/.../); is conditional code that means print the contents of variable $_ if the regexp between the /.../ can be found in $_
• Program explained:– open the file referenced in $AGRV[0] for input
– $AGRV[0] is the first command line argument following the program name– F is the file descriptor associated with the opened file
– if there is a problem opening the file, e.g. file doesn’t exist, program execution dies and prints the value of the string enclosed in double quotes "$ARGV[0] not found!\n"