Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
CS388: Natural Language Processing Lecture 4: Sequence Models
Eunsol Choi
Parts of this lecture adapted from Greg DurreA, Yejin Choi, Yoav Artzi
LogisHcs
2
‣ HW1 due today midnight
‣ HW2 will be released tomorrow, due September 30th
‣ Materials needed to do HW2 will be covered by next Tuesday
Sequence Models
3
‣ Topics for next three lectures and HW2
‣ We will be back to neural sequence models again in a few weeks
Overview
‣ Sequence Modeling Problems in NLP
‣ GeneraHve Model: Hidden Markov Models (HMM)
‣ DiscriminaHve Model: Maximum Entropy Markov Models (MEMM) CondiHonal Random Fields
‣ Unsupervised Learning: ExpectaHon MaximizaHon
Reading
5
‣ Collins: HMMs —> GeneraHve sequence tagging model
‣ Collins: MEMMs —> DiscriminaHve sequence tagging models
‣ Collins: EMs —> ExpectaHon MaximizaHon
‣ J&M: Chapter 8 (opHonal) ‣ Covers both HMM, MEMM
The Structure of Language
‣ Language is tree-structured
I ate the spaghea with chopsHcks I ate the spaghea with meatballs
‣ But labelled sequence can provide shallow analysis
I ate the spaghea with chopsHcks I ate the spaghea with meatballsPRP VBZ DT NN IN NNS PRP VBZ DT NN IN NNS
Sequence Modeling Problems in NLP
7
‣ Parts of Speech Tagging (POS)
I ate the spaghea with chopsHcks I ate the spaghea with meatballsPRP VBZ DT NN IN NNS PRP VBZ DT NN IN NNS
‣ Named EnHty RecogniHon (NER): ‣ Segment text into spans with certain properHes (person,
organizaHon, )[Germany]LOC ’s representative to the [European Union]ORG ’s veterinary committee [Werner Zwingman]PER said on Wednesday consumers should…
Germany/BL ’s/NA representative/NA to/NA the/NA European/BO Union/CO ’s/NA veterinary/NA committee/NA Werner/BP Zwingman/CP said/NA on/NA Wednesday/NA consumers/NA should/NA…
Parts of Speech
Slide credit: Dan Klein
‣ CategorizaHon of words into types
CC conjunction, coordinating and both but either orCD numeral, cardinal mid-1890 nine-thirty 0.5 oneDT determiner a all an every no that theEX existential there there FW foreign word gemeinschaft hund ich jeuxIN preposition or conjunction, subordinating among whether out on by ifJJ adjective or numeral, ordinal third ill-mannered regrettable
JJR adjective, comparative braver cheaper tallerJJS adjective, superlative bravest cheapest tallestMD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity
NNP noun, proper, singular Motown Cougar Yvette LiverpoolNNPS noun, proper, plural Americans Materials StatesNNS noun, common, plural undergraduates bric-a-brac averagesPOS genitive marker ' 's PRP pronoun, personal hers himself it we themPRP$ pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously
RBR adverb, comparative further gloomier heavier less-perfectlyRBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open throughTO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heckVB verb, base form ask bring fire see take
VBD verb, past tense pleaded swiped registered sawVBG verb, present participle or gerund stirring focusing approaching erasingVBN verb, past participle dilapidated imitated reunifed unsettledVBP verb, present tense, not 3rd person singular twist appear comprise mold postponeVBZ verb, present tense, 3rd person singular bases reconstructs marks usesWDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whomWP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why
Main Tags
POS Tagging
The back door = JJ (Adjective) On my back = NN (Noun) Win the voters back = RB (Adverb) Promised to back the bill = VB (Verb)
10
‣ The POS tagging problem is to determine the POS tag for a parHcular instance of a word.
‣ Many words have more than one POS, depending on its context
Sources of InformaHon
11
‣ Knowledge of neighboring words
‣ Knowledge of word probabiliHes‣ the, a, an is almost always arHcle‣ man is frequently noun, rarely used as a verb
Time flies like an arrow; Fruit flies like a banana
‣ If we choose the most frequent tag, over 90% accuracy‣ About 40% of word tokens are ambiguous
What is this good for?
‣ Preprocessing step for syntacHc parsers
‣ Domain-independent disambiguaHon for other tasks
‣ (Very) shallow informaHon extracHon: ‣ write regular expressions like (Det) Adj*N + over the output for
phrases
POS tag sets in different languages
13[Petrov et al. 2012]
Universal POS Tag Set
‣ Universal POS tagset (~12 tags), cross-lingual model works well!
Gillick et al. 2016
Today
‣ Sequence Modeling Problems in NLP
‣ Hidden Markov Models (HMM)
‣ Inference (Viterbi)
‣ HMM parameter esHmaHon
Classic SoluHon: Hidden Markov Models
y = (y1, ..., yn)Output ‣ Input x = (x1, ..., xn)
Two simplifying assumpHons
17
‣ Independent AssumpHon:
‣ Markov AssumpHon (future is condiHonally independent of the past given the present)
P(yi |y1, y2, ⋯, yi−1) = P(yi |yi−1)
P(xi |x, y) = P(xi |yi)
HMM for POS
18
The Georgia branch had taken on loan commitments …
DT NNP NN VBD VBN RP NN NNS
‣ States = {DT, NNP, NN, ... } are the POS tags
‣ Observations = V are words
‣ Transition distribution models the tag sequences
‣ Emission distribution models words given their POS
𝑌𝑋
𝑞(𝑦𝑖 𝑦𝑖−1)
𝑒(𝑥𝑖 𝑦𝑖)
HMM Learning and Inference
19
‣ Learning: ‣ Maximum likelihood: transiHon q and emission e
‣ Inference: ‣ Viterbi:
Learning: Maximum Likelihood
20
‣ Supervised Learning for esHmaHng transiHons and emissions
‣ Any concerns for the quality of any of these esHmates?
Sparsity again!
Learning: Low frequency Words
21
Dealingwith Low-FrequencyWords: An Example[Bikel et. al 1999] (named-entity recognition)
Word class Example Intuition
twoDigitNum 90 Two digit yearfourDigitNum 1990 Four digit yearcontainsDigitAndAlpha A8956-67 Product codecontainsDigitAndDash 09-96 DatecontainsDigitAndSlash 11/9/89 DatecontainsDigitAndComma 23,000.00 Monetary amountcontainsDigitAndPeriod 1.00 Monetary amount,percentageothernum 456789 Other numberallCaps BBN OrganizationcapPeriod M. Person name initialfirstWord first word of sentence no useful capitalization informationinitCap Sally Capitalized wordlowercase can Uncapitalized wordother , Punctuation marks, all other words
18
‣ Used the following word classes for infrequent words [Bickel et. al, 1999]
Inference (Decoding)
‣ Inference problem:
‣ We can list all possible y and then pick the best one! ‣ Any problems?
‣ Input x = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
…
argmaxyP (y|x) = argmaxyP (y,x)
P (x)
23
‣ First soluHon: Beam Search ‣ A beam is a set of parHal hypotheses ‣ Start with a single empty trajectory ‣ At each step, consider all conHnuaHon, discard most, keep top K
‣ But this does not guarantee the opHmal answer…
Inference (Decoding)
The Viterbi Algorithm
24
‣ Dynamic program for compuHng the max score of a sequence of length i ending in tag yi
‣ Now this is an efficient algorithm!
25
‣ Dynamic program for compuHng (for all i)
‣ IteraHve ComputaHon:
‣ For I = 1… n: ‣ Store score
‣ Store back-pointer
The Viterbi Algorithm
Time flies like an arrow; Fruit flies like a banana
26
27
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
Fruit Flies Like Bananas
28
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
Fruit Flies Like Bananas
=0
=0.01
=0.03
29
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
Fruit Flies Like Bananas
=0
=0.01
=0.03 =0.005
30
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
Fruit Flies Like Bananas
=0
=0.01
=0.03 =0.005
=0.007
=0
31
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
Fruit Flies Like Bananas
=0
=0.01
=0.03 =0.005
=0.007
=0
=0.0007
=0.0003
=0.0001
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
=0
=0.01
=0.03 =0.005
=0.007
=0
=0.0007
=0.0003
=0.0001
Fruit Flies Like Bananas
33
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
=0
=0.01
=0.03 =0.005
=0.007
=0
=0.0007
=0.0003
=0.0001
=0.00001
=0
=0.00003
Fruit Flies Like Bananas
34
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
=0
=0.01
=0.03 =0.005
=0.007
=0
=0.0007
=0.0003
=0.0001
=0.00001
=0
=0.00003
Fruit Flies Like Bananas
35
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
=0
=0.01
=0.03 =0.005
=0.007
=0
=0.0007
=0.0003
=0.0001
=0.00001
=0
=0.00003
Why does this find the max p(.)? What is the runtime?
36
𝜋(1, 𝑁 )
𝜋(1, 𝑉 )
𝜋(1, 𝐼𝑁 )
𝜋(2, 𝑁 )
𝜋(2, 𝑉 )
𝜋(2, 𝐼𝑁 )
𝜋(3, 𝑁 )
𝜋(3, 𝑉 )
𝜋(3, 𝐼𝑁 )
𝜋(4, 𝑁 )
𝜋(4, 𝑉 )
𝜋(4, 𝐼𝑁 )
STA
RT
STO
P
=0
=0.01
=0.03 =0.005
=0.007
=0
=0.0007
=0.0003
=0.0001
=0.00001
=0
=0.00003
The Viterbi Algorithm: RunHme
37
‣ Linear in sentence length ‣ Polynomial in the number of possible tags
‣ Total RunHme:
‣ Would there any scenarios where we would choose beam search?
Tagsets in Different Languages
38
2942 = 86436
452 = 2045
112 = 121
Trigram HMM Taggers‣ Trigram model: y1 = (<S>, NNP), y2 = (NNP, VBZ), …
‣ P((VBZ, NN) | (NNP, VBZ)) — more context! Noun-verb-noun S-V-O
‣ Tradeoff between model capacity and data size (sparsity) ‣ Trigrams are a “sweet spot” for POS tagging
HMM POS Tagging
‣ Baseline: assign each word its most frequent tag: ~90% accuracy
‣ Trigram HMM: ~95% accuracy / 55% on unknown words
‣ TnT tagger (Brants 1998, tuned HMM): 96.2% accuracy / 86.0% on unks
Slide credit: Dan Klein
‣ State-of-the-art (BiLSTM-CRFs): 97.5% / 89%+
Can we do beAer?
41
‣ HMM is a generaHve model, esHmaHon relies on counHng! ‣ Reminds you of something?
‣ Can we build a discriminaHve model, incorporaHng rich features?
Named EnHty RecogniHon (NER)
Barack Obama will travel to Hangzhou today for the G20 mee=ng .
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ BIO tagset: begin, inside, outside
‣ Why might an HMM not do so well here?
‣ Lots of O’s
‣ Sequence of tags — should we use an HMM?
‣ Insufficient features/capacity with mulHnomials (especially for unks)
Emission Features for NER
Leicestershire is a nice place to visit…
I took a vaca=on to Boston
Apple released a new version…
According to the New York Times…
ORG
ORG
LOC
LOC
Texas governor Greg AbboI said
Leonardo DiCaprio won an award…
PER
PER
LOC
Emission Features for NER
‣ Context features ‣ Words before/a�er
‣ Word features ‣ CapitalizaHon ‣ Word shape ‣ Prefixes/suffixes ‣ Lexical indicators
‣ Word clusters
Leicestershire
Boston
Apple released a new version…
According to the New York Times…
Maximum Entropy Markov Models (MEMM)
45
Chain rule
Independence assumpHon
‣ Log linear model for sequence tagging problem
‣ Learning:
‣ Train as a discrete log-linear model p(yi |yi−1, x1, …, xn)
‣ Scoring: