Upload
rolf-jenkins
View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Regular expressions 4Day 9 - 9/15/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
15-Sept-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction.
http://www.tulane.edu/~howard/CompCultEN/
The quiz was the review.
Review
15-Sept-2014
3
NLP, Prof. Howard, Tulane University
4.3.4. Summary table
meta-character
matches name notes
a|b a or bdisjunction
(ab) a and b groupingonly outputs what is in (); (?:ab) for rest of pattern
[ab] a or b range[a-z] lowercase, [A-Z] uppercase, [0-9] digits
[^a] all but a negation
a{m, n}from m to n of a
repetition
a{n} a number n of a
^aa at start of S
a$a at end of S
a+one or more of a
a+? lazy +
a*zero or more of a
Kleene star
a*? lazy *
a?with or without a
optionality
a?? lazy ?
15-Sept-2014NLP, Prof. Howard, Tulane University
4
There is a bit more to say.
§4. Regular expressions 4
15-Sept-2014
5
NLP, Prof. Howard, Tulane University
Open Spyder
15-Sept-2014
6
NLP, Prof. Howard, Tulane University
Sample string
import re
>>> S = '''This above all: to thine own self be true, And it must follow, as the night the day, Thou canst not then be false to any man.'''
15-Sept-2014NLP, Prof. Howard, Tulane University
7
4.4. Character classes
class abbreviates name notes
\w[a-zA-Z0-9_]
alphanumeric
it’s really alphanumeric and underscore, but we are lazy
\W[^a-zA-Z0-9_]
not alphanumeric
\d [0-9] digit
\D [^0-9] not a digit
\s [ tvnrf] whitespace
\S [^ tvnrf] not whitespace
\t horizontal tab
\v vertical tab
\n newline
\r carriage return
\f form-feed
\b word boundary
\B not a word boundary
\A ^
\Z $
15-Sept-2014NLP, Prof. Howard, Tulane University
8
4.4.2. Raw string notation with r’‘ Python interprets regular expressions just like any other expression. This can lead to unexpected results with class meta-characters, because the backslash that they incorporate is sometimes also used by Python for its own constructs.
For instance, we just met a class meta-character \b, which marks the edge of a word. It will be extremely useful for us, but it happens to overlap with Python’s own backspace operator, \b.
15-Sept-2014NLP, Prof. Howard, Tulane University
9
Raw text
The way to resolve this ambiguity is to prefix an r to a regular expression. The r marks the regular expression as raw text, so Python does not process it for special characters. The previous example is augmented with the raw text notation below:
1. >>> re.findall(r'\b\w\w\b', S)
2. ['to', 'be', 'it', 'as', 'be', 'to']
3. >>> re.findall(r'\b\w{2}\b', S)
4. ['to', 'be', 'it', 'as', 'be', 'to']
15-Sept-2014NLP, Prof. Howard, Tulane University
10
More raw text
As a further illustration, what do you think are the non-alphanumeric characters in the Shakespeare text?:
>>> re.findall(r'\W', S) [' ', ' ', ':', ' ', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']
15-Sept-2014NLP, Prof. Howard, Tulane University
11
Practice
4.3.5. Further practice of variable-length matching
4.6. Further practice Practice with answers on a different page
15-Sept-2014NLP, Prof. Howard, Tulane University
12
There is a bit more to say.
§5. Lists1
15-Sept-2014
13
NLP, Prof. Howard, Tulane University
Introduction
In working with re.findall(), you have seen many instances of a collection of strings held within square brackets, such as the one below:
>>> S = '''This above all: to thine own self be true,
... And it must follow, as the night the day,
... Thou canst not then be false to any man.'''
>>> re.findall(r'\b[a-zA-Z]{4}\b', S)
['This', 'self', 'true', 'must', 'Thou', 'then']
15-Sept-2014NLP, Prof. Howard, Tulane University
14
Definition of list
A list in Python is a sequence of objects delimited by square brackets, []. The objects are separated by commas. Consider this sentence from Shakespeare’s A Midsummer Night’s Dream represented as a list:
>>> L = ['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.'] >>> type(L) >>> type(L[0])
L is a list of strings. You may think that a string is also a list of characters, and you would be correct for ordinary English, but in pythonic English, the word ‘list’ refers exclusively to a sequence of objects delimited by square brackets.
15-Sept-2014NLP, Prof. Howard, Tulane University
15
An example with numerical objects1. >>> i = 2 2. >>> type(i) 3. >>> I = [0,1,i,3] 4. >>> type(I) 5. >>> type(I[0]) 6. >>> n = 2.3 7. >>> type(n) 8. >>> N = [2.0,2.1,2.2,n] 9. >>> type(N) 10. >>> type(N[0])
15-Sept-2014NLP, Prof. Howard, Tulane University
16
Most of the string methods work just as well on lists1. >>> len(L) 2. >>> sorted(L) 3. >>> set(L) 4. >>> sorted(set(L)) 5. >>> len(sorted(set(L))) 6. >>> L+'!' 7. >>> len(L+'!') 8. >>> L*2 9. >>> len(L*2) 10. >>> L.count('the')
15-Sept-2014NLP, Prof. Howard, Tulane University
17
String methods work on lists, cont.1. >>> L.count('Love') 2. >>> L.count('love') 3. >>> L.index('with') 4. >>> L.rindex('with') 5. >>> L[2:] 6. >>> L[:2] 7. >>> L[-2:] 8. >>> L[:-2] 9. >>> L[2:-2] 10. >>> L[-2:2] 11. >>> L[:] 12. >>> L[:-1]+['!']
15-Sept-2014NLP, Prof. Howard, Tulane University
18
Q1
MIN 5.0 AVG 9.5 MAX 10.0
15-Sept-2014NLP, Prof. Howard, Tulane University
19
More on lists
Next time
15-Sept-2014NLP, Prof. Howard, Tulane University
20