20
1 Basic Text Processing Regular Expressions SLP3 slides (Jurafsky & Martin) Regular expressions A formal language for specifying text strings How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

Basic Text Processing - people.cs.georgetown.edu

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Basic Text Processing - people.cs.georgetown.edu

1

BasicTextProcessing

RegularExpressionsSLP3slides(Jurafsky &Martin)

Regularexpressions• Aformallanguageforspecifyingtextstrings• Howcanwesearchforanyofthese?

• woodchuck• woodchucks• Woodchuck• Woodchucks

Page 2: Basic Text Processing - people.cs.georgetown.edu

2

RegularExpressions:Disjunctions• Lettersinsidesquarebrackets[]

• Ranges [A-Z]

Pattern Matches[wW]oodchuck Woodchuck, woodchuck[1234567890] Anydigit

Pattern Matches[A-Z] Anuppercaseletter Drenched Blossoms

[a-z] Alowercaseletter my beans were impatient

[0-9] Asingle digit Chapter 1: Down the Rabbit Hole

RegularExpressions:NegationinDisjunction

• Negations [^Ss]• Caratmeansnegationonlywhenfirstin[]

Pattern Matches[^A-Z] Not anuppercaseletter Oyfn pripetchik

[^Ss] Neither‘S’nor‘s’ I have no exquisite reason”

[^e^] Neitherenor^ Look here

a^b Thepattern a carat b Look up a^b now

Page 3: Basic Text Processing - people.cs.georgetown.edu

3

RegularExpressions:MoreDisjunction

• Woodchucksisanothernameforgroundhog!• Thepipe|fordisjunction

Pattern Matchesgroundhog|woodchuck

yours|mine yoursmine

a|b|c =[abc][gG]roundhog|[Ww]oodchuck

RegularExpressions:? * + .

StephenCKleene

Pattern Matchescolou?r Optional

previouscharcolor colour

oo*h! 0ormoreofpreviouschar

oh! ooh! oooh! ooooh!

o+h! 1ormoreofpreviouschar

oh! ooh! oooh! ooooh!

baa+ baa baaa baaaa baaaaa

beg.n begin begun begun beg3n Kleene *,Kleene +

Page 4: Basic Text Processing - people.cs.georgetown.edu

4

RegularExpressions:Anchors^$

Pattern Matches^[A-Z] Palo Alto

^[^A-Za-z] 1 “Hello”

\.$ The end.

.$ The end? The end!

Example

• Findmeallinstancesoftheword“the”inatext.the

Missescapitalizedexamples[tT]he

Incorrectlyreturnsother ortheology[^a-zA-Z][tT]he[^a-zA-Z]

Page 5: Basic Text Processing - people.cs.georgetown.edu

5

9

Refertohttp://people.cs.georgetown.edu/nschneid/cosc272/f17/02_py-notes.htmlandlinksonthatpageforfurtherregexnotation,andadviceforusingregexesinPython3.

Errors

• Theprocesswejustwentthroughwasbasedonfixingtwokindsoferrors• Matchingstringsthatweshouldnothavematched(there,then,other)• Falsepositives(TypeI)

• Notmatchingthingsthatweshouldhavematched(The)• Falsenegatives(TypeII)

Page 6: Basic Text Processing - people.cs.georgetown.edu

6

Errorscont.

• InNLPwearealwaysdealingwiththesekindsoferrors.

• Reducingtheerrorrateforanapplicationofteninvolvestwoantagonisticefforts:• Increasingaccuracyorprecision(minimizingfalsepositives)• Increasingcoverageorrecall(minimizingfalsenegatives).

Summary

• Regularexpressionsplayasurprisinglylargerole• Sophisticatedsequencesofregularexpressionsareoftenthefirstmodelforanytextprocessingtext

• Formanyhardtasks,weusemachinelearningclassifiers• Butregularexpressionsareusedasfeaturesintheclassifiers• Canbeveryusefulincapturinggeneralizations

12

Page 7: Basic Text Processing - people.cs.georgetown.edu

7

BasicTextProcessing

RegularExpressions

BasicTextProcessing

Wordtokenization

Page 8: Basic Text Processing - people.cs.georgetown.edu

8

TextNormalization• EveryNLPtaskneedstodotextnormalization:1. Segmenting/tokenizingwordsinrunningtext2. Normalizingwordformats3. Segmentingsentencesinrunningtext

Howmanywords?

• Idouhmain- mainlybusinessdataprocessing• Fragments,filledpauses

• Seuss’scatinthehatisdifferentfromother cats!• Lemma:samestem,partofspeech,roughwordsense• catandcats=samelemma

• Wordform:thefullinflectedsurfaceform• catandcats=differentwordforms

Page 9: Basic Text Processing - people.cs.georgetown.edu

9

Howmanywords?theylaybackontheSanFranciscograssandlookedatthestarsandtheir

• Type:anelementofthevocabulary.• Token:aninstanceofthattypeinrunningtext.• Howmany?

• 15tokens(or14)• 13types(or12)(or11?)

Howmanywords?

N =numberoftokensV =vocabulary=setoftypes

|V| isthesizeofthevocabulary

Tokens=N Types=|V|Switchboardphoneconversations

2.4million 20 thousand

Shakespeare 884,000 31 thousandGoogleN-grams 1trillion 13million

ChurchandGale(1990):|V|>O(N½)

Page 10: Basic Text Processing - people.cs.georgetown.edu

10

19

Tokenization

Normal written conventions sometimes do not reflect the “logical” organizationof textual symbols. For example, some punctuation marks are written adjacent tothe previous or following word, even though they are not part of it. (The detailsvary according to language and style guide!)

Given a string of raw text, a tokenizer adds logical boundaries between separateword/punctuation tokens (occurrences) not already separated by spaces:

Daniels made several appearances as C-3PO on numerous TV shows andcommercials, notably on a Star Wars-themed episode of The Donny and

Marie Show in 1977, Disneyland’s 35th Anniversary.)

Daniels made several appearances as C-3PO on numerous TV shows andcommercials , notably on a Star Wars - themed episode of The Donny and

Marie Show in 1977 , Disneyland ’s 35th Anniversary .

To a large extent, this can be automated by rules. But there are always di�cultcases.

Nathan Schneider ANLP (COSC/LING-272) Lecture 2 27

20

Tokenization in NLTK

>>> nltk.word_tokenize("Daniels made several,! appearances as C-3PO on numerous TV shows and,! commercials, notably on a Star Wars-themed episode,! of The Donny and Marie Show in 1977, Disneyland's,! 35th Anniversary.")

['Daniels', 'made', 'several', 'appearances', 'as',,! 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and',,! 'commercials', ',', 'notably', 'on', 'a', 'Star',,! 'Wars-themed', 'episode', 'of', 'The', 'Donny',,! 'and', 'Marie', 'Show', 'in', '1977', ',',,! 'Disneyland', "'s", '35th', 'Anniversary', '.']

Nathan Schneider ANLP (COSC/LING-272) Lecture 2 28

Page 11: Basic Text Processing - people.cs.georgetown.edu

11

21

Tokenization in NLTK

>>> nltk.word_tokenize("Daniels made several,! appearances as C-3PO on numerous TV shows and,! commercials, notably on a Star Wars-themed episode,! of The Donny and Marie Show in 1977, Disneyland's,! 35th Anniversary.")

['Daniels', 'made', 'several', 'appearances', 'as',,! 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and',,! 'commercials', ',', 'notably', 'on', 'a', 'Star',,! 'Wars-themed', 'episode', 'of', 'The', 'Donny',,! 'and', 'Marie', 'Show', 'in', '1977', ',',,! 'Disneyland', "'s", '35th', 'Anniversary', '.']

English tokenization conventions vary somewhat—e.g., with respect to:

• clitics (contracted forms) ’s, n’t, ’re, etc.

• hyphens in compounds like president-elect (fun fact: this convention changedbetween versions of the Penn Treebank!)

Nathan Schneider ANLP (COSC/LING-272) Lecture 2 29

IssuesinTokenization

• Finland’s capital ® Finland Finlands Finland’s ?• what’re, I’m, isn’t ® What are, I am, is not• Hewlett-Packard ® Hewlett Packard ?• state-of-the-art ® state of the art ?• Lowercase ® lower-case lowercase lower case ?• San Francisco ® onetokenortwo?• m.p.h.,PhD. ® ??

Page 12: Basic Text Processing - people.cs.georgetown.edu

12

Tokenization:languageissues

• French• L'ensemble® onetokenortwo?• L?L’?Le?• Wantl’ensemble tomatchwithunensemble

• Germannouncompoundsarenotsegmented• Lebensversicherungsgesellschaftsangestellter• ‘lifeinsurancecompanyemployee’• Germaninformationretrievalneedscompoundsplitter

Tokenization:languageissues• ChineseandJapanesenospacesbetweenwords:

• 莎拉波娃现在居住在美国东南部的佛罗里达。• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达

• Sharapova now livesin USsoutheasternFlorida

• FurthercomplicatedinJapanese,withmultiplealphabetsintermingled• Dates/amountsinmultipleformats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-usercanexpressqueryentirelyinhiragana!

Page 13: Basic Text Processing - people.cs.georgetown.edu

13

WordTokenizationinChinese

• AlsocalledWordSegmentation• Chinesewordsarecomposedofcharacters

• Charactersaregenerally1syllableand1morpheme.• Averagewordis2.4characterslong.

• Standardbaselinesegmentationalgorithm:• MaximumMatching (alsocalledGreedy)

BasicTextProcessing

Wordtokenization

Page 14: Basic Text Processing - people.cs.georgetown.edu

14

BasicTextProcessing

WordNormalizationandStemming

Normalization

• Needto“normalize”terms• InformationRetrieval:indexedtext&querytermsmusthavesameform.

• WewanttomatchU.S.A. andUSA

• Weimplicitlydefineequivalenceclassesofterms• e.g.,deletingperiodsinaterm

• Alternative:asymmetricexpansion:• Enter:window Search:window,windows• Enter:windows Search:Windows,windows,window• Enter:Windows Search:Windows

• Potentiallymorepowerful,butlessefficient

Page 15: Basic Text Processing - people.cs.georgetown.edu

15

Casefolding

• ApplicationslikeIR:reduceallletterstolowercase• Sinceuserstendtouselowercase• Possibleexception:uppercaseinmid-sentence?• e.g.,GeneralMotors• Fed vs.fed• SAIL vs.sail

• Forsentimentanalysis,MT,Informationextraction• Caseishelpful(US versususisimportant)

Lemmatization

• Reduceinflectionsorvariantformstobaseform• am,are, is® be

• car,cars,car's,cars'® car

• theboy'scarsaredifferentcolors® theboycarbedifferentcolor• Lemmatization:havetofindcorrectdictionaryheadwordform• Machinetranslation

• Spanishquiero (‘Iwant’),quieres (‘youwant’)samelemmaasquerer‘want’

Page 16: Basic Text Processing - people.cs.georgetown.edu

16

Morphology

• Morphemes:• Thesmallmeaningfulunitsthatmakeupwords• Stems:Thecoremeaning-bearingunits• Affixes:Bitsandpiecesthatadheretostems• Oftenwithgrammaticalfunctions

Stemming

• Reducetermstotheirstemsininformationretrieval• Stemming iscrudechoppingofaffixes

• languagedependent• e.g.,automate(s),automatic,automation allreducedtoautomat.

forexamplecompressedandcompressionarebothacceptedasequivalenttocompress.

forexampl compressandcompressar bothacceptasequival tocompress

Page 17: Basic Text Processing - people.cs.georgetown.edu

17

BasicTextProcessing

WordNormalizationandStemming

BasicTextProcessing

SentenceSegmentationandDecisionTrees

Page 18: Basic Text Processing - people.cs.georgetown.edu

18

SentenceSegmentation

• !,?arerelativelyunambiguous• Period“.”isquiteambiguous

• Sentenceboundary• AbbreviationslikeInc.orDr.• Numberslike.02%or4.3

• Buildabinaryclassifier• Looksata“.”• DecidesEndOfSentence/NotEndOfSentence• Classifiers:hand-writtenrules,regularexpressions,ormachine-learning

Determiningifawordisend-of-sentence:aDecisionTree

Page 19: Basic Text Processing - people.cs.georgetown.edu

19

Moresophisticateddecisiontreefeatures

• Caseofwordwith“.”:Upper,Lower,Cap,Number• Caseofwordafter“.”:Upper,Lower,Cap,Number

• Numericfeatures• Lengthofwordwith“.”• Probability(wordwith“.”occursatend-of-s)• Probability(wordafter“.”occursatbeginning-of-s)

ImplementingDecisionTrees

• Adecisiontreeisjustanif-then-elsestatement• Theinterestingresearchischoosingthefeatures• Settingupthestructureisoftentoohardtodobyhand

• Hand-buildingonlypossibleforverysimplefeatures,domains• Fornumericfeatures,it’stoohardtopickeachthreshold

• Instead,structureusuallylearnedbymachinelearningfromatrainingcorpus

Page 20: Basic Text Processing - people.cs.georgetown.edu

20

DecisionTreesandotherclassifiers

• Wecanthinkofthequestionsinadecisiontree• Asfeaturesthatcouldbeexploitedbyanykindofclassifier• Logisticregression• SVM• NeuralNets• etc.

BasicTextProcessing

SentenceSegmentationandDecisionTrees