45
CS388: Natural Language Processing Lecture 4: Sequence Models Eunsol Choi Parts of this lecture adapted from Greg DurreA, Yejin Choi, Yoav Artzi

CS388: Natural Language Processing Lecture 4: Sequence Models

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS388: Natural Language Processing Lecture 4: Sequence Models

CS388: Natural Language Processing Lecture 4: Sequence Models

Eunsol Choi

Parts of this lecture adapted from Greg DurreA, Yejin Choi, Yoav Artzi

Page 2: CS388: Natural Language Processing Lecture 4: Sequence Models

LogisHcs

2

‣ HW1 due today midnight

‣ HW2 will be released tomorrow, due September 30th

‣ Materials needed to do HW2 will be covered by next Tuesday

Page 3: CS388: Natural Language Processing Lecture 4: Sequence Models

Sequence Models

3

‣ Topics for next three lectures and HW2

‣ We will be back to neural sequence models again in a few weeks

Page 4: CS388: Natural Language Processing Lecture 4: Sequence Models

Overview

‣ Sequence Modeling Problems in NLP

‣ GeneraHve Model: Hidden Markov Models (HMM)

‣ DiscriminaHve Model: Maximum Entropy Markov Models (MEMM) CondiHonal Random Fields

‣ Unsupervised Learning: ExpectaHon MaximizaHon

Page 5: CS388: Natural Language Processing Lecture 4: Sequence Models

Reading

5

‣ Collins: HMMs —> GeneraHve sequence tagging model

‣ Collins: MEMMs —> DiscriminaHve sequence tagging models

‣ Collins: EMs —> ExpectaHon MaximizaHon

‣ J&M: Chapter 8 (opHonal) ‣ Covers both HMM, MEMM

Page 6: CS388: Natural Language Processing Lecture 4: Sequence Models

The Structure of Language

‣ Language is tree-structured

I ate the spaghea with chopsHcks I ate the spaghea with meatballs

‣ But labelled sequence can provide shallow analysis

I ate the spaghea with chopsHcks I ate the spaghea with meatballsPRP VBZ DT NN IN NNS PRP VBZ DT NN IN NNS

Page 7: CS388: Natural Language Processing Lecture 4: Sequence Models

Sequence Modeling Problems in NLP

7

‣ Parts of Speech Tagging (POS)

I ate the spaghea with chopsHcks I ate the spaghea with meatballsPRP VBZ DT NN IN NNS PRP VBZ DT NN IN NNS

‣ Named EnHty RecogniHon (NER): ‣ Segment text into spans with certain properHes (person,

organizaHon, )[Germany]LOC ’s representative to the [European Union]ORG ’s veterinary committee [Werner Zwingman]PER said on Wednesday consumers should…

Germany/BL ’s/NA representative/NA to/NA the/NA European/BO Union/CO ’s/NA veterinary/NA committee/NA Werner/BP Zwingman/CP said/NA on/NA Wednesday/NA consumers/NA should/NA…

Page 8: CS388: Natural Language Processing Lecture 4: Sequence Models

Parts of Speech

Slide credit: Dan Klein

‣ CategorizaHon of words into types

Page 9: CS388: Natural Language Processing Lecture 4: Sequence Models

CC conjunction, coordinating and both but either orCD numeral, cardinal mid-1890 nine-thirty 0.5 oneDT determiner a all an every no that theEX existential there there FW foreign word gemeinschaft hund ich jeuxIN preposition or conjunction, subordinating among whether out on by ifJJ adjective or numeral, ordinal third ill-mannered regrettable

JJR adjective, comparative braver cheaper tallerJJS adjective, superlative bravest cheapest tallestMD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity

NNP noun, proper, singular Motown Cougar Yvette LiverpoolNNPS noun, proper, plural Americans Materials StatesNNS noun, common, plural undergraduates bric-a-brac averagesPOS genitive marker ' 's PRP pronoun, personal hers himself it we themPRP$ pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously

RBR adverb, comparative further gloomier heavier less-perfectlyRBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open throughTO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heckVB verb, base form ask bring fire see take

VBD verb, past tense pleaded swiped registered sawVBG verb, present participle or gerund stirring focusing approaching erasingVBN verb, past participle dilapidated imitated reunifed unsettledVBP verb, present tense, not 3rd person singular twist appear comprise mold postponeVBZ verb, present tense, 3rd person singular bases reconstructs marks usesWDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whomWP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why

Main Tags

Page 10: CS388: Natural Language Processing Lecture 4: Sequence Models

POS Tagging

The back door = JJ (Adjective) On my back = NN (Noun) Win the voters back = RB (Adverb) Promised to back the bill = VB (Verb)

10

‣ The POS tagging problem is to determine the POS tag for a parHcular instance of a word.

‣ Many words have more than one POS, depending on its context

Page 11: CS388: Natural Language Processing Lecture 4: Sequence Models

Sources of InformaHon

11

‣ Knowledge of neighboring words

‣ Knowledge of word probabiliHes‣ the, a, an is almost always arHcle‣ man is frequently noun, rarely used as a verb

Time flies like an arrow; Fruit flies like a banana

‣ If we choose the most frequent tag, over 90% accuracy‣ About 40% of word tokens are ambiguous

Page 12: CS388: Natural Language Processing Lecture 4: Sequence Models

What is this good for?

‣ Preprocessing step for syntacHc parsers

‣ Domain-independent disambiguaHon for other tasks

‣ (Very) shallow informaHon extracHon: ‣ write regular expressions like (Det) Adj*N + over the output for

phrases

Page 13: CS388: Natural Language Processing Lecture 4: Sequence Models

POS tag sets in different languages

13[Petrov et al. 2012]

Page 14: CS388: Natural Language Processing Lecture 4: Sequence Models

Universal POS Tag Set

‣ Universal POS tagset (~12 tags), cross-lingual model works well!

Gillick et al. 2016

Page 15: CS388: Natural Language Processing Lecture 4: Sequence Models

Today

‣ Sequence Modeling Problems in NLP

‣ Hidden Markov Models (HMM)

‣ Inference (Viterbi)

‣ HMM parameter esHmaHon

Page 16: CS388: Natural Language Processing Lecture 4: Sequence Models

Classic SoluHon: Hidden Markov Models

y = (y1, ..., yn)Output ‣ Input x = (x1, ..., xn)

Page 17: CS388: Natural Language Processing Lecture 4: Sequence Models

Two simplifying assumpHons

17

‣ Independent AssumpHon:

‣ Markov AssumpHon (future is condiHonally independent of the past given the present)

P(yi |y1, y2, ⋯, yi−1) = P(yi |yi−1)

P(xi |x, y) = P(xi |yi)

Page 18: CS388: Natural Language Processing Lecture 4: Sequence Models

HMM for POS

18

The Georgia branch had taken on loan commitments …

DT NNP NN VBD VBN RP NN NNS

‣ States = {DT, NNP, NN, ... } are the POS tags

‣ Observations = V are words

‣ Transition distribution models the tag sequences

‣ Emission distribution models words given their POS

𝑌𝑋

𝑞(𝑦𝑖 𝑦𝑖−1)

𝑒(𝑥𝑖 𝑦𝑖)

Page 19: CS388: Natural Language Processing Lecture 4: Sequence Models

HMM Learning and Inference

19

‣ Learning: ‣ Maximum likelihood: transiHon q and emission e

‣ Inference: ‣ Viterbi:

Page 20: CS388: Natural Language Processing Lecture 4: Sequence Models

Learning: Maximum Likelihood

20

‣ Supervised Learning for esHmaHng transiHons and emissions

‣ Any concerns for the quality of any of these esHmates?

Sparsity again!

Page 21: CS388: Natural Language Processing Lecture 4: Sequence Models

Learning: Low frequency Words

21

Dealingwith Low-FrequencyWords: An Example[Bikel et. al 1999] (named-entity recognition)

Word class Example Intuition

twoDigitNum 90 Two digit yearfourDigitNum 1990 Four digit yearcontainsDigitAndAlpha A8956-67 Product codecontainsDigitAndDash 09-96 DatecontainsDigitAndSlash 11/9/89 DatecontainsDigitAndComma 23,000.00 Monetary amountcontainsDigitAndPeriod 1.00 Monetary amount,percentageothernum 456789 Other numberallCaps BBN OrganizationcapPeriod M. Person name initialfirstWord first word of sentence no useful capitalization informationinitCap Sally Capitalized wordlowercase can Uncapitalized wordother , Punctuation marks, all other words

18

‣ Used the following word classes for infrequent words [Bickel et. al, 1999]

Page 22: CS388: Natural Language Processing Lecture 4: Sequence Models

Inference (Decoding)

‣ Inference problem:

‣ We can list all possible y and then pick the best one! ‣ Any problems?

‣ Input x = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

argmaxyP (y|x) = argmaxyP (y,x)

P (x)

Page 23: CS388: Natural Language Processing Lecture 4: Sequence Models

23

‣ First soluHon: Beam Search ‣ A beam is a set of parHal hypotheses ‣ Start with a single empty trajectory ‣ At each step, consider all conHnuaHon, discard most, keep top K

‣ But this does not guarantee the opHmal answer…

Inference (Decoding)

Page 24: CS388: Natural Language Processing Lecture 4: Sequence Models

The Viterbi Algorithm

24

‣ Dynamic program for compuHng the max score of a sequence of length i ending in tag yi

‣ Now this is an efficient algorithm!

Page 25: CS388: Natural Language Processing Lecture 4: Sequence Models

25

‣ Dynamic program for compuHng (for all i)

‣ IteraHve ComputaHon:

‣ For I = 1… n: ‣ Store score

‣ Store back-pointer

The Viterbi Algorithm

Page 26: CS388: Natural Language Processing Lecture 4: Sequence Models

Time flies like an arrow; Fruit flies like a banana

26

Page 27: CS388: Natural Language Processing Lecture 4: Sequence Models

27

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

Page 28: CS388: Natural Language Processing Lecture 4: Sequence Models

28

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03

Page 29: CS388: Natural Language Processing Lecture 4: Sequence Models

29

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03 =0.005

Page 30: CS388: Natural Language Processing Lecture 4: Sequence Models

30

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03 =0.005

=0.007

=0

Page 31: CS388: Natural Language Processing Lecture 4: Sequence Models

31

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

Fruit Flies Like Bananas

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

Page 32: CS388: Natural Language Processing Lecture 4: Sequence Models

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

Page 33: CS388: Natural Language Processing Lecture 4: Sequence Models

Fruit Flies Like Bananas

33

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

Page 34: CS388: Natural Language Processing Lecture 4: Sequence Models

Fruit Flies Like Bananas

34

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

Page 35: CS388: Natural Language Processing Lecture 4: Sequence Models

Fruit Flies Like Bananas

35

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

Page 36: CS388: Natural Language Processing Lecture 4: Sequence Models

Why does this find the max p(.)? What is the runtime?

36

𝜋(1, 𝑁 )

𝜋(1, 𝑉 )

𝜋(1, 𝐼𝑁 )

𝜋(2, 𝑁 )

𝜋(2, 𝑉 )

𝜋(2, 𝐼𝑁 )

𝜋(3, 𝑁 )

𝜋(3, 𝑉 )

𝜋(3, 𝐼𝑁 )

𝜋(4, 𝑁 )

𝜋(4, 𝑉 )

𝜋(4, 𝐼𝑁 )

STA

RT

STO

P

=0

=0.01

=0.03 =0.005

=0.007

=0

=0.0007

=0.0003

=0.0001

=0.00001

=0

=0.00003

Page 37: CS388: Natural Language Processing Lecture 4: Sequence Models

The Viterbi Algorithm: RunHme

37

‣ Linear in sentence length ‣ Polynomial in the number of possible tags

‣ Total RunHme:

‣ Would there any scenarios where we would choose beam search?

Page 38: CS388: Natural Language Processing Lecture 4: Sequence Models

Tagsets in Different Languages

38

2942 = 86436

452 = 2045

112 = 121

Page 39: CS388: Natural Language Processing Lecture 4: Sequence Models

Trigram HMM Taggers‣ Trigram model: y1 = (<S>, NNP), y2 = (NNP, VBZ), …

‣ P((VBZ, NN) | (NNP, VBZ)) — more context! Noun-verb-noun S-V-O

‣ Tradeoff between model capacity and data size (sparsity) ‣ Trigrams are a “sweet spot” for POS tagging

Page 40: CS388: Natural Language Processing Lecture 4: Sequence Models

HMM POS Tagging

‣ Baseline: assign each word its most frequent tag: ~90% accuracy

‣ Trigram HMM: ~95% accuracy / 55% on unknown words

‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks

Slide credit: Dan Klein

‣ State-of-the-art (BiLSTM-CRFs): 97.5% / 89%+

Page 41: CS388: Natural Language Processing Lecture 4: Sequence Models

Can we do beAer?

41

‣ HMM is a generaHve model, esHmaHon relies on counHng! ‣ Reminds you of something?

‣ Can we build a discriminaHve model, incorporaHng rich features?

Page 42: CS388: Natural Language Processing Lecture 4: Sequence Models

Named EnHty RecogniHon (NER)

Barack Obama will travel to Hangzhou today for the G20 mee=ng .

PERSON LOC ORG

B-PER I-PER O O O B-LOC B-ORGO O O O O

‣ BIO tagset: begin, inside, outside

‣ Why might an HMM not do so well here?

‣ Lots of O’s

‣ Sequence of tags — should we use an HMM?

‣ Insufficient features/capacity with mulHnomials (especially for unks)

Page 43: CS388: Natural Language Processing Lecture 4: Sequence Models

Emission Features for NER

Leicestershire is a nice place to visit…

I took a vaca=on to Boston

Apple released a new version…

According to the New York Times…

ORG

ORG

LOC

LOC

Texas governor Greg AbboI said

Leonardo DiCaprio won an award…

PER

PER

LOC

Page 44: CS388: Natural Language Processing Lecture 4: Sequence Models

Emission Features for NER

‣ Context features ‣ Words before/a�er

‣ Word features ‣ CapitalizaHon ‣ Word shape ‣ Prefixes/suffixes ‣ Lexical indicators

‣ Word clusters

Leicestershire

Boston

Apple released a new version…

According to the New York Times…

Page 45: CS388: Natural Language Processing Lecture 4: Sequence Models

Maximum Entropy Markov Models (MEMM)

45

Chain rule

Independence assumpHon

‣ Log linear model for sequence tagging problem

‣ Learning:

‣ Train as a discrete log-linear model p(yi |yi−1, x1, …, xn)

‣ Scoring: