20
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23

LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

LING/C SC/PSYC 438/538Computational Linguistics

Sandiway Fong

Lecture 2: 8/23

Administrivia

• Anyone try installing Perl yet?– Active State Perl– http://www.activestate.com– install the free version– (not the trial versions)

Background Reading

• Textbook– Chapter 2: Regular Expressions and Automata– Section 2.1: Regular Expressions

• Perl Regular Expressions (RE)– perlrequick - Perl regular expressions quick start

• http://perldoc.perl.org/perlrequick.html

– perlretut - Perl regular expressions tutorial• http://perldoc.perl.org/perlretut.html

Regular Expressions

• regular expressions are used in string pattern-matching – important tool in automated searching

– formally equivalent to finite-state automata (FSA) and regular grammars

• popular implementations– Unix grep

• command line program• returns lines matching a regular expression• standard part of all Unix-based systems

– including MacOS X (command-line interface in Terminal)

• many shareware/freeware implementations available for Windows XP– just Google and see...

– grep functionality is built into many programming languages• e.g. Perl

– wildcard search in Microsoft Word• limited version of regular expressions (not full power)• with differences in notation

Regular Expressions

• Historical note – grep:

• name comes from Unix ed command – g/re/p– “search globally for lines

matching the regular expression, and print them”

– [Source: http://en.wikipedia.org/wiki/Grep]

– ed• is an obscure and difficult-to-

use text edit program on Unix systems

– doesn’t need a screen display– would work on an ancient

teletype

wikipedia

Regular Expressions

• Formally– a regular expression

(regexp) is formed from:• an alphabet

(= set of characters)

• operators

– a regexp is shorthand for a set of strings

• (possibly infinite set)

• (strings are of finite length)

• Formally– a set of strings is

called a language– a language that can

be defined by a regular expression is called a regular language

• (not all languages are regular)

Regular Expressions

• alphabet– e.g. {a,b,c,...,z}

set of lower case English letters

Note: case is important

• operators– a single symbol

a– an exactly n

occurrences of

a, n a positive integer

– an a3 aaa

– a* zero or more occurrences of a

– a+ one or more occurrences of a

– concatenation• two regexps may be

concatenated, the resulting string is also a regexp

• e.g. abc

– disjunction• infix operator: |• (vertical bar)• e.g. a|b

– parentheses may be used for disambiguation

• e.g. gupp(y|ies)

Regular Expressions

• Technically, a+ is not necessary– aa* = a+

“a concatenated with a* (zero or more occurrences of a)”

= “one or more occurrences of a”

• Disjunction– [set of characters]

set of characters enclosed in square brackets means match one of the characters

– e.g. [aeiou]

matches any of the vowels a, e, i, o or u but not d

– dash (-) shorthand for a range

– e.g. [a-e]

matches a, b, c, d or e

Regular Expressions

• Range defined over a computer character set– typically ASCII– originally a 7 bit

character set – 2^7 = 128 (0-127)

different characters– ASCII = American

Standard Code for Information Interchange

– e.g. [A-z] [0-9A-Za-z]

• http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml

Regular Expressions: Microsoft Word

• terminology:– wildcard search

Regular Expressions: Microsoft Word

Note: zero or more times is missing in Microsoft Word

Regular Expressions

• Perl uses the same notation as grep (textbook also uses grep notation)

• More shorthand– question mark (?) means the

previous regexp is optional– e.g. colou?r– matches color or colour– metacharacters or operators

like ? have a function– to match a question mark,

escape it using a backslash (\)– e.g. why\?– ? in Microsoft Word means

match any character

• More shorthand– period (.) stands for any

character (except newline)

– e.g. e.tmatches eat as well as eet

– caret sign (^) as the first character of a range of characters [^set of characters]means don’t match any of the characters mentioned (after the caret)

– e.g. [^aeiou]– any character except for one

of the vowels listed

Regular Expressions

• Text files in Unix consists of sequences of lines separated by a newline character (LF = line feed)

• Typically, text files are read a line at a time by programs

• Matching in Perl and grep is line-oriented(can be changed in Perl)

• Differences in platforms for line breaking:– Unix: LF– Windows (DOS): CR LF– MacOS (X): CR

• affects binary file transfers

Regular Expressions

• Line-oriented metacharacters:– caret (^) at the beginning of a

regexp string matches the “beginning of a line”

– e.g. ^The

matches lines beginning with

the sequence The– Note: the caret is very

overloaded...• [^ab]

• a^b

– dollar sign ($) at the end of a regexp string matches the “end of the line”

– e.g. end\.$– matches lines ending in

the sequence end.– e.g. ^$

matches blank lines only– e.g. ^ $

matches lines contains exactly one space

Regular Expressions

• Word-oriented metacharacters:– a word is any sequence of digits [0-9],

underscores (_) and letters [a-zA-Z]– (historical reasons for this)– \b matches a word boundary, e.g. a

space or beginning or end of a line or a non-word character

– e.g. the– matches the, they, breathe and other– but \bthe will only match the and they– the\b will match the and breathe– \bthe\b will only match the– (\< and \> can also be used to match the

beginning and end of a word)

– e.g. \b99– matches 99 but not 299– also matches $99

Regular Expressions

• Note: – definition of word does not

include characters with accent marks

• extended ASCII character set– 8 bit– characters 128-255– does not include non-Roman

characters

• towards multilingual computing– Unicode– one massive table

Regular Expressions

• Range abbreviations:– \d (digit) = [0-9]– \s (whitespace character) =

space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF)

– \w (word character) = [0-9a-zA-Z_]

• uppercase versions denote negation– e.g. \W means a non-word

character \D means a non-digit

• Repetition abbreviations:– a? a optional– a* zero or more a’s– a+ one or more a’s– a{n,m} between n and m a’s– a{n,} at least n a’s– a{n} exactly n a’s– e.g. \d{7,}

matches numbers with at least 7 digits

– e.g. \d{3}-\d{4}– matches 7 digit telephone

numbers with a separating dash

Perl• Run from the command line in Windows

– Start > Run...– cmd (brings up command line interpreter)

• Running a Perl program:– perl -help (gives options)

– perl filename.pl

(runs Perl command file filename.pl)

– perl filename.pl inputfile.txt

(runs Perl command file filename.pl, inputfile.txt is supplied to filename.pl)

e.g. filename.pl reads and processes input file inputfile.txt

Perl

• Example Perl program (match.pl) to read in a text file and print lines matching a regexp enclosed by /.../

• Example input file (text.txt)

• Commandperl match.pl text.txt

open (F,$ARGV[0]) or die "$ARGV[0] not found!\n";

while (<F>) {

print $_ if (/The/);

}

This is a test.The cat sat on the mat.These shoes are made for walking.Otherwise, I thought it was cold.45

Perl

• Program:open (F,$ARGV[0]) or die "$ARGV[0]

not found!\n";

while (<F>) {

print $_ if (/The/);

}

– while (<F>) first evaluates <F> – <F> reads in a line from the file

referenced by F and places the line in the program variable $_

– then it executes the program code between the curly braces

– then it goes back and reads another line

– it does this repeatedly while <F> produces a valid line – if we reach the end of the file, the while loop stops

– print $_ if (/.../); is conditional code that means print the contents of variable $_ if the regexp between the /.../ can be found in $_

• Program explained:– open the file referenced in $AGRV[0] for input

– $AGRV[0] is the first command line argument following the program name– F is the file descriptor associated with the opened file

– if there is a problem opening the file, e.g. file doesn’t exist, program execution dies and prints the value of the string enclosed in double quotes "$ARGV[0] not found!\n"