20
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Embed Size (px)

Citation preview

Page 1: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Word Bi-grams and PoS Tags

COMP3310 Natural Language Processing

Eric Atwell, Language Research Group

(with thanks to Katja Markert, Marti Hearst, and other contributors)

Page 2: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Reminder

FreqDist counts of tokens and their distribution can be useful

Eg find main characters in Gutenberg texts

Eg compare word-lengths in different languages

Human can predict the next word …

N-gram models are based on counts in a large corpus

Auto-generate a story ... (but gets stuck in local maximum)

Grammatical trends: modal verb distribution predicts genre

Page 3: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Why do puns make us groan?

He drove his expensive car into a tree and found

out how the Mercedes bends.

Isn't the Grand Canyon just gorges?

Time flies like an arrow. Fruit flies like a banana.

Page 4: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Predicting Next Words

One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word

They also exploit

• homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous)

• polysemy – same spelling, different meaning

NLP programs can also make use of word-sequence modeling

Page 5: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Auto-generate a Story

How to fix this? Use a random number generator.

Page 6: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Auto-generate a Story The choice() method chooses one item

randomly from a list(from random import *)

Page 7: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Part-of-Speech Tagging: Terminology

Tagging

• The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg

Hand-coded rules

Statistical taggers

Brill (transformation-based) tagger

Hybrid tagger: combination, eg by “vote”

Tags

• The labels

Tag Set

• The collection of tags used for a particular task, eg Brown or LOB tagset

Page 8: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Example from the GENIA corpus

Typically a tagged text is a sequence of white-space separated word/tag tokens:

These/DT

findings/NNS

should/MD

be/VB

useful/JJ

for/IN

therapeutic/JJ

strategies/NNS

and/CC

the/DT

development/NN

of/IN

immunosuppressants/NNS

targeting/VBG

the/DT

CD28/NN

costimulatory/NN

pathway/NN

./.

Page 9: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

What does Tagging do?

Collapses Distinctions

• Lexical identity may be discarded

• e.g., all personal pronouns tagged with PRP

Introduces Distinctions

• Ambiguities may be resolved

• e.g. deal tagged with NN or VB

Helps in classification and prediction

Page 10: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Significance of Parts of Speech

A word’s POS tells us a lot about the word and its neighbors:

• Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)

• Helps in stemming

• Limits the range of following words

• Can help select nouns from a document for summarization

• Basis for partial parsing (chunked parsing)

• Parsers can build trees directly on the POS tags instead of maintaining a lexicon

Page 11: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Choosing a tagset

The choice of tagset greatly affects the difficulty of the problem

Need to strike a balance between

• Getting better information about context

• Make it possible for classifiers to do their job

Page 12: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Some of the best-known Tagsets

Brown corpus: 87 tags

• (more when tags are combined, eg isn’t)

LOB corpus: 132 tags

Penn Treebank: 45 tags

Lancaster UCREL C5 (used to tag the BNC): 61 tags

Lancaster C7: 145 tags

Page 13: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

The Brown Corpus

An early digital corpus (1961)

• Francis and Kucera, Brown University

Contents: 500 texts, each 2000 words long

• From American books, newspapers, magazines

• Representing genres:

• Science fiction, romance fiction, press reportage scientific writing, popular lore

Page 14: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

help(nltk.corpus.brown)

>>> help(nltk.corpus.brown)

| paras(self, fileids=None, categories=None)

|

| raw(self, fileids=None, categories=None)

|

| sents(self, fileids=None, categories=None)

|

| tagged_paras(self, fileids=None, categories=None, simplify_tags=False)

|

| tagged_sents(self, fileids=None, categories=None, simplify_tags=False)

|

| tagged_words(self, fileids=None, categories=None, simplify_tags=False)

|

| words(self, fileids=None, categories=None)

|

Page 15: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

nltk.corpus.brown

>>> nltk.corpus.brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

>>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

>>> nltk.corpus.brown.tagged_sents()

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …

Page 16: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Penn Treebank

First large syntactically annotated corpus

1 million words from Wall Street Journal

Part-of-speech tags and syntax trees

Page 17: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

help(nltk.corpus.treebank)

| parsed(*args, **kwargs)

| @deprecated: Use .parsed_sents() instead.

|

| parsed_sents(self, files=None)

|

| raw(self, files=None)

|

| read(*args, **kwargs)

| @deprecated: Use .raw() or .sents() or .tagged_sents() or

| .parsed_sents() instead.

|

| sents(self, files=None)

|

| tagged(*args, **kwargs)

| @deprecated: Use .tagged_sents() instead.

|

| tagged_sents(self, files=None)

|

| tagged_words(self, files=None)

Page 18: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

How hard is POS tagging?

Number of tags 1 2 3 4 5 6 7

Number of word types

35340 3760 264 61 12 2 1

In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous

Page 19: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Tagging with lexical frequencies

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

Problem: assign a tag to race given its lexical frequency

Solution: we choose the tag that has the greater probability

• P(race|VB)

• P(race|NN)

Actual estimate from the Switchboard corpus:

• P(race|NN) = .00041

• P(race|VB) = .00003

This suggests we should always tag race/NN (correct 41/44=93%)

Page 20: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Reminder

Puns play on our assumptions of the next word…

… eg they present us with an unexpected homonym (bends)

ConditionalFreqDist() counts word-pairs: word bigrams

Used for story generation, Speech recognition, …

Parts of Speech: groups words into grammatical categories

… and separates different functions of a word

In English, many words are ambiguous: 2 or more PoS-tags

Very simple tagger: choose by lexical probability (only)

Better Pos-Taggers: to come…