25
Introduction to Regular Expressions Christine Moulen MIT Libraries ELUNA 2014

Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Embed Size (px)

Citation preview

Page 1: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Introduction to Regular ExpressionsChristine Moulen

MIT LibrariesELUNA 2014

Page 2: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

What is a regular expression?

Regular expressions are :› A language or syntax that lets you specify

patterns for matching e.g. filenames or strings

› Used to identify the files or lines you want to work with

› Used inside of substitution functions to change the contents of a string

Page 3: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Command line examples

ls 14* › * is a wildcard here, not regex› 14 followed by zero or more of any character

ls 14[0-1][0-9]* › [0-1] and [0-9] are regex character classes,

specifying a single character within the the list of characters from 0 to 1, and 0 to 9, respectively

ls 14[0-1][0-9][0-3][0-9]*› 6 digits that look like a date YYMMDD, mostly

Page 4: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

More command line examples

mv [b-z]* $data_scratch› An alphabetical class, which depending on

your system might match the lower case letters from b through z, OR a mix of upper and lower case: b C c D d ... Z z

grep 'MIT01$' sysnos.txt› Find lines that end ($) with MIT01› ^ can be used to match at the beginning

of a line

Page 5: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

UNIX/Linux editors

In vi, you can use regular expressions with the s/// substitution operator

With emacs, use M-x query-replace-regexp› Replace $ with MIT01› Take a list of system numbers and make it

valid input to an Aleph service by adding the library code to the end of each line

Page 6: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Matching example in Perl

Look through a MARC file in Aleph sequential format for lines with tag 260› 001234567 260 L $$aCambridge$$bMIT Press

if ($matched =~ m/^\d{9}\s260.+/) { ... }› $matched is the while loop variable

representing the line we're working on› =~ is a pattern operator used with the

matching (m), substitution (s), and translation (tr) functions

› m// is the pattern matching function

Page 7: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

m/^\d{9}\s260.+/

^ start at the beginning of the line \d Perl-speak for the digits character

class {9} a quantifier. Find exactly 9 of \d \s Perl-speak for the whitespace char

class 260 the MARC tag I'm looking for . any character + a quantifier. Find 1 or more of .

Page 8: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

m/^\d{9}\s260.+/

^ start at the beginning of the line

\d Perl-speak for the digits character class

{9} a quantifier. Find exactly 9 of \d

\s Perl-speak for the whitespace char class

260 the MARC tag I'm looking for

. any character

+ a quantifier. Find 1 or more of .

Page 9: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Working with MARC fields

Look for deleted records › LDR position 05 is d› $my_LDR =~ /LDR L .....d/

Look for e-resource records› $my_245 =~ /\$\$h\[electronic resource\]/

Look for OCLC numbers› $my_035 =~ /(\(OCoLC\)\d{8,10})/› Note the double use of () here

Page 10: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Counting up records at the end of a script

if ($hash{$tmp} =~ m/SKIP/ || $hash{$tmp} =~ m/NEW/) { $new_count++ if (m/ FMT L /); $skip_count++ if (m/ FMT L / && $hash{$tmp} =~

m/SKIP/); $bre_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP

Brief/); $bks_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP

Books24x7/); $eebo_count++ if (m/ FMT L / && $hash{$tmp} =~

m/SKIP EEBO/); $epda_count++ if (m/ FMT L / && $hash{$tmp} =~

m/SKIP EPDA/); $sta_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP

STA/);}

Page 11: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Substitution example in Perl

We have a browse index of URLs An Aleph browse index only sorts the

first 69 characters of the field When we have many URLs from the

same site, we need to get the unique part closer to the beginning

Following is an SFX OpenURL from the MARCit! service

Page 12: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

This OpenURL ...

http://owens.mit.edu/sfx_local? url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&rfr_id=info:sid/sfxit.com:opac_856&url_ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignore_date_threshold=1&rft.object_id=3710000000092335&svc_val_fmt=info:ofi/fmt:kev:mtx:sch_svc&

Page 13: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

... becomes this.

http://owens.mit.edu/sfx_local?rft.object_id=3710000000092335&url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&rfr_id=info:sid/sfxit.com:opac_856&url_ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignore_date_threshold=1&svc_val_fmt=info:ofi/fmt:kev:mtx:sch_svc&

Page 14: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

The substitution expression

$my_856 =~ s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/;

s is the substitution operator› substitute/this/for this/

Parentheses used here to group different sections of the pattern, and then re-arrange them

Page 15: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/

$1 The first matched parenthetical section

^.*sfx_local\? From the beginning, anything up to and including sfx_local?

? is a special character and is escaped here to get a literal question mark

$2 The 2nd matched parenthetical section

.* Any number of any character, until it reaches the next match string

Page 16: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/

Now change the order from $1$2$3$4 to $1$3$2$4

$3 The 3rd parenthetical section

rft\.object_id\=\d{1,}\&

rft.object_id= followed by one or more digits and an ampersand

. = and & are escaped with \ because they are special characters

{1,} is like + a quantifier meaning one or more

$4 The 4th and final parenthetical section

.*$ Any number of any character to the end

Page 17: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Parsing thesis notes

Thesis degree, year, and department are stored in a single free text MARC field 502

We have applied some structure to this, but it has varied over time

In DSpace, we want to get these 3 bits into separate fields, so the note is parsed on the way from MARC to Dublin Core

Page 18: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Parsing thesis notes

$MIT = 'Massachusetts Institute of Technology\.?|M\.\s?I\.\s?T\.';› ? is the zero or one quantifier. › | match the pattern alternative before or

after this $Dept = '[Dd]epartment\s[Oo]f|

[dD]ept\.\s+[Oo]f';› A few small character classes, to allow for

case variation, and Department vs Dept.

Page 19: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Parsing thesis notes

$Month = 'January|February|March|April|May|June|July|August|September|October|November|December';› match any one month name when $Month

is used inside a pattern

Page 20: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Thesis. 1975. Sc.D.--Massachusetts Institute of Technology. Dept. of Mechanical Engineering

/^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)--($MIT)\.?\s+($Dept)?\s*(.+)$/o

/^Thesis\. Begin with Thesis.

\s+ 1 or more spaces

(\d+) 1 or more digits = $1

\.? 0 or 1 period

\s+ 1 or more spaces

([\w\.\s]+) 1 or more word chars, periods, spaces = $2

-- --

($MIT) something matching $MIT = $3

Page 21: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Thesis. 1975. Sc.D.--Massachusetts Institute of Technology. Dept. of Mechanical Engineering

/^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)--($MIT)\.?\s+($Dept)?\s*(.+)$/o

\.? 0 or 1 period

\s+ 1 or more spaces

($Dept)? 0 or 1 strings matching $Dept = $4

\s* 0 or more spaces

(.+)$ anything left to the end = $5

/o An option. Compile the expression only once. The variables, $MIT and $Dept are not going to change

Page 22: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

More thesis examples

Massachusetts Institute of Technology. Dept. of Economics. Thesis. 1968. Ph.D.

Massachusetts Institute of Technology, Dept. of Civil Engineering, Thesis. 1965. Sc. D.

/^($MIT)(\.|,)?\s+($Dept)?\s*([\w\s\.,]+)\s+Thesis.\s*(\d{4})\.?\s*(.*)$/o

Page 23: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

More thesis examples

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1973.

Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics an Astronautics.

Thesis. (M.S.)--Sloan School of Management, 1983. Thesis (Sc. D.)--Massachusetts Institute of Technology,

Dept. of Mechanical Engineering, 1951. Thesis (Ph. D.)--Massachusetts Institute of Technology,

Dept. of Linguistics and Philosophy, February 2004.

/^Thesis\.?\s*\(([^\)]*)\)(\s*--?\s*|\s+)?(($MIT)[\.,]?)?\s*($Dept)?\s*(.*)(,\s+(\d{4}))?\.?$/o

Page 24: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

More thesis examples

Thesis (Ph. D.)--Joint Program in Oceanography/Applied Ocean Science and Engineering (Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; and the Woods Hole Oceanographic Institution), 2013.

/^Thesis\.?\s*\(([^\)]*)\)(\s*--(Joint Program in ([\w\.\s]+)\((($MIT)[\.,]?)?\s*($Dept)?\s*([\w,;\s]+)\)))(,\s+(\d{4}))?\.?$/o

Page 25: Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files

Questions?

[email protected]