33
Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus- Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2008/10/13 (Slides from Dr. Mary P. Harper, http://min.ecn.purdue.edu/~ee669/)

Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing1

Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze)

Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information Engineering,

National Cheng Kung University

2008/10/13 (Slides from Dr. Mary P. Harper,

http://min.ecn.purdue.edu/~ee669/)

Page 2: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing2

What is a Corpus?(1) A collection of texts, especially if complete and self- contained: the corpus of Anglo-Saxon verse. (2) In linguistics and lexicography, a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database.

Currently, computer corpora may store many millions of running words, whose features can be analyzed by means of tagging and the use of concordancing programs.

[from The Oxford Companion to the English Language, ed. McArthur & McArthur 1992]

Page 3: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing3

Corpus-Based Work• Text corpora are usually big, often representative

samples of some population of interest. For example, the Brown Corpus collected by Kucera and Francis was designed as a representative sample of written American English. Balance of subtypes (e.g., genre) is often desired.

• Corpus work involves collecting a large number of counts from corpora that need to be accessed quickly.

• There exists some software for processing corpora (see useful links on course homepage).

Page 4: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing4

Taxonomies of Corpora

• Media: printed, electronic text, digitized audio, video, OCR text, etc.

• Raw (plain text) vs. Annotated (use a markup scheme to add codes to the file, e.g., part-of-speech tags)

• Language variables:– monolingual vs. multilingual

– original vs. translation

Page 5: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing5

Major Suppliers of Corpora

• Linguistic Data Consortium (LDC): http://www.ldc.upenn.edu

• European Language Resources Association (ELRA): http://www.icp.grenet.fr/ELRA/

• Oxford Text Archive (OTA): http://ota.ahds.ac.uk

• Child Language Data Exchange System (CHILDES): http://childes.psy.cmu.edu/

• International Computer Archive of Modern English (ICAME): http://nora.hd.uib.no/icame.html

Page 6: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing6

Page 7: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing7

Software• Text Editors: e.g., emacs• Regular Expressions: to identify patterns in text

(equivalent to a finite state machine; can process text in linear time).

• Programming Languages: C, C++, Java, Perl, Prolog, etc.

• Programming Techniques: – Data structures like hash tables are useful for mapping

words to numbers.– Need counts to calculate probabilities (two pass: emit

toke and then count later, e.g., CMU-Cambridge Statistical Language Modeling toolkit.

Page 8: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing8

Challenges for Corpus Building

• Low-level formatting issues: dealing with junk and case

• What is a word? -- Tokenization• To stem or not to stem? tokenization token (or

maybe toke)• What is a sentence, and how can we detect their

boundaries?

Page 9: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing9

Low-Level Formatting Issues

• Junk Formatting/Content: Examples include document headers and separators, typesetter codes, tables and diagrams, garbled data in the file. Problems arise if data was obtained using OCR (unrecognized words). May need to remove junk content before any processing begins.

• Uppercase and Lowercase: Should we keep the case or not? The, the, and THE should all be treated as the same token but White in George White and white in white snow should be treated as distinct tokens. What about sentence initial capitalization (to downcase or not to downcase)?

Page 10: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing10

Tokenization: What is a Word?• Early in processing, we must divide the input text

into meaningful units called tokens (e.g., words, numbers, puctuation).

• Tokenization is the process of breaking input from a text character stream into tokens to be normalized and saved (see Sampson’s 1995 book English for the Computer by Oxford University Press for a carefully designed and tested set of tokenization rules).

• A graphic word token (Kucera and Francis):– A string of contiguous alphanumeric characters with space on

either side which may include hyphens and apostrophes, but no other punctuation marks.

– Problems:Microsoft or :-)

Page 11: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing11

Some of the Problems: Period• Words are not always separated from other

tokens by white space. For example, periods may signal an abbreviation (do not separate) or the end of sentence (separate?).– Abbreviations (haplology): etc. St. Dr.

• A single capital followed by a period, e.g., A. B. C.

• A sequence of letter-period-letter-period’s such as U.S., m.p.h.

• Mt. St. Wash.

– End of sentence? I live on Burt St.

Page 12: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing12

Some of the Problems: Apostrophes

• How should contractions and clitics be regarded? One or two tokens?– I’ll or I ’ll– The dog’s food or The dog ’s food – The boys’ club

• From the perspective of parsing, I’ll needs to be separated into two tokens because there is no category that combines nouns and verbs together.

Page 13: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing13

Some of the Problems: Hyphens• How should we deal with hyphens? Are hyphenated

words comprised of one or multiple tokens? Useage:1. Typographical to improve the right margins of a document:

typically the hyphens should be removed since breaks occur at syllable boundaries; however, the hyphen may be part of the word too.

2. Lexical hyphens: inserted before or after small word formatives (e.g., co-operate, so-called, pro-university).

3. Word grouping: Take-it-or-leave-it, once-in-a-lifetime, text-based, etc.

• How many lexemes will you allow?– Data base, data-base, database

– Cooperate, Co-operate

– Mark-up, mark up

Page 14: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing14

Some of the Problems: Hyphens

• Authors may not be consistent with hyphenation, e.g., cooperate and co-operate may appear in the same document.

• Dashes can be used as punctuation without separating them from words with space: I am happy-Bill is not.

Page 15: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing15

Different Formats in Text Pattern

Page 16: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing16

Some of the Problems: Homographs• In some cases, lexemes have overlapping forms

(homographs) as in:– I saw the dog.

– When you saw the wood, please wear safety goggles.

– The saw is sharp.

• These forms will need to be distinguished for part-of-speech tagging.

Page 17: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing17

Some of the Problems: No space between Words

• There are no separators between words in languages like Chinese, so English tokenization methods are irrelevant.

• Waterloo is located in the south of Canada.

• Compounds in German:

Lebensversicherungsgesellschaftsangesteller

Page 18: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing18

Some of the Problems: Spaces within Words

• Sometimes spaces occur in the middle of something that we would prefer to call a single token:– Phone number: 765 494 3654

– Names: Mr. John Smith, New York, U. S. A.

– Verb plus particle: work out, make up

Page 19: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing19

Some of the Problems: Multiple Formats

• Numbers (format plus ambiguous separator):– English: 123,456.78

• [0-9](([0-9]+[,])*)([.][0-9]+)

– French: 123 456,78• [0-9](([0-9]+[ ])*)([,][0-9]+)

• There are also multiple formats for:– Dates– Phone numbers– Addresses– Names

Page 20: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing20

Morphology: What Should I Put in My Dictionary?

• Should all word forms be stored in the lexicon? Probably ok for English (little morphology) but not for Czech or German (lots of forms!)

• Stemming: Strip off affixes and leave the stem (lemma). Not that helpful in English (from an IR point of view) Perhaps more useful for other languages or in other

contexts

• Multi-word tokens as a single word token can help.

Page 21: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing21

What is a Sentence?

• Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases.Sentences may be split up by other punctuation marks

(e.g., : ; --).Sentences may be broken up, as in: “You should be

here,” she said, “before I know it!”Quote marks may be at the very end of the sentence. Identifying sentence boundaries can involve heuristic

methods that are hand-coded. Some effort to automate the sentence-boundary process has also been tried.

Page 22: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing22

Heuristic Algorithm• Place putative sentence boundaries after all

occurrences of . ? !.• Move boundary after following quotation marks,

if any.• Disqualify a period boundary in the following

circumstances:– If it is preceded by a known abbreviation of a sort that

does not normally occur word finally, but is commonly followed by a capitalized proper name, such as Prof. or vs.

Page 23: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing23

– If it is preceded by a known abbreviation and not followed by an uppercase word. This will deal correctly with most usage of abbreviations like etc. or Jr. which can occur sentence medially or finally.

• Disqualify a boundary with a ? or ! If:– It is followed by a lowercase letter (or a known name)

• Regard other putative sentence boundaries as sentence boundaries.

Heuristic Algorithm (cont.)

Page 24: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing24

Adaptive Sentence Boundary Detect

• The group included Dr. J. M. Freeman and T.Boone Pickens Jr.

• David D. Palmer, Marti A. Hearst, Adaptive Sentence Boundary Disambiguation, Technical Report, 97/94 ,UC Berkeley: 98-99% correct

• The part-of-speech probabilities of the tokens surrounding a punctuation mark are input to a feed forward neural network, and the network’s output activation value indicates the role of the punctuation.

Page 25: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing25

Adaptive Sentence Boundary Detect (cont.)

• To solve the problem of processing cycle, instead of assigning a single POS to each word, the algorithm uses the prior probabilities of all POS for that word. (20)

• Input: k*20, where k is the number of words of context surrounding an instance of an end-of-sentence punctuation mark.

• K hidden units with sigmoid squashing activation function.

• 1 Output indicates the results of the function.

Page 26: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing26

Marking up Data: Mark-up Schemes• Plain text corpora are useful, but more can be learned if

information is added.– Boundaries for sentences, paragraphs, etc.

– Lexical tags

– Syntactic Structure

– Semantic Representation

– Semantic class

• Different Mark-up schemes:

– COCOA format (header information in texts, e.g., author, date, title): uses angle brackets with the first letter indicating the broad semantics of the field).

– Standard Generalized Markup Language or SGML (related: HTML, TEI, XML)

Page 27: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing27

SGML Examples

• <p> <s> This book does not delve very deeply into SGML. </s> … <s> In XML, such empty elements may be specifically marked by ending the tag name with a forward slash character. </s> </p>

• <utt speak=“Mary”, date = “now”> SGML can be very useful. </utt>

• Character and Entity codes: begin with ampersand and end with semicolon– &#x43; is the less than symbol < is the less than

symbol – r&eacute;sum&eacute; rèsumè

Page 28: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing28

Marking up Data: Grammatical Coding

• Tagging corresponds to indicating the various conventional parts of speech. Tagging can be done automatically (we will talk about that in a later lecture).

• Different Tag Sets have been used, e.g., Brown Tag Set, University of Lancaster Tag Set, Penn Treebank Tag Set, British National Corpus (CLAWS*), Czech National Corpus

• The Design of a Tag Set: – Target Features: useful information on the grammatical class

– Predictive Features: useful for predicting behavior of other words in context (e.g., distinguish modals and auxiliary verbs from regular verbs)

Page 29: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing29

Penn Treebank Set• Pronoun: PRP, PRP$, WP,

WP$, EX• Verb: VB, VBP, VBZ,

VBD, VBG, VBN (have, be, and do are not distinguished)

• Infinitive marker (to): TO• Preposition to: TO• Other prepositions: IN• Punctuation: . ; , - $ ( ) ``

’’ • FW, SYM, LS

• Adjective: JJ, JJR, JJS

• Cardinal: CD

• Adverb: RB, RBR, RBS, WRB

• Conjunction: CC, IN (subordinating and that)

• Determiner: DT, PDT, WDT

• Noun: NN, NNS, NNP, NNPS (no distinction for adverbial)

Page 30: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing30

Tag Sets

• General definition:– Tags can be represented as a vector: (c1,c2,...,cn)

– Thought of as a flat list T = {ti}i=1..n with some assumed 1:1 mapping

T (C1,C2,...,Cn)

• English tagsets:– Penn treebank (45) (VBZ: Verb,Pres,3,sg, JJR: Adj. Comp.)

– Brown Corpus (87), Claws c5 (62), London-Lund (197)

Page 31: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing31

Tag Sets for other Languages

• Differences:– Larger number of tags

– categories covered (POS, Number, Case, Negation,...)

– level of detail

– presentation (short names vs. structured (“positional”))

• Example:

– Czech: AGFS3----1A----

POS

SUBPOS

GENDER

NUMBER

CASE

POSSG

POSSNPERSON

TENSEDCOMP

NEG

VOICE

VAR

Page 32: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing32

Sentence Length Distribution

Page 33: Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language

Processing33