Sequence Models I

SequenceModelsI

WeiXu(many slides from Greg Durrett, Dan Klein, Vivek Srikumar, Chris Manning, Yoav Artzi)

Administrivia

‣ Project1isout,dueonSep20(nextMonday!).

‣ Reading:Eisenstein7.0-7.4,Jurafsky+MarOnChapter8

ThisLecture

‣ Sequencemodeling

‣ HMMsforPOStagging

‣ Viterbi,forward-backward

‣ HMMparameteresOmaOon

LinguisOcStructures

‣ Languageistree-structured

Iatethespaghe*withchops/cks Iatethespaghe*withmeatballs

‣ Understandingsyntaxfundamentallyrequirestrees—thesentenceshavethesameshallowanalysis

Iatethespaghe*withchops/cks Iatethespaghe*withmeatballsPRPVBZDTNNINNNS PRPVBZDTNNINNNS

LinguisOcStructures

‣ LanguageissequenOallystructured:interpretedinanonlineway

Tanenhausetal.(1995)

POSTagging

Ghana’sambassadorshouldhavesetupthebigmee/nginDCyesterday.

‣Whattagsareoutthere?

NNPPOSNNMDVBVBNRPDTJJNNINNNPNN.

POSTagging

Slidecredit:DanKlein

POSTagging

Slidecredit:YoavArtzi

POSTagging

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

I’m0.5%interestedintheFed’sraises!

Iherebyincreaseinterestrates0.5%

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ OtherpathsarealsoplausiblebutevenmoresemanOcallyweird…‣Whatgovernsthecorrectchoice?Word+context‣WordidenOty:mostwordshave<=2tags,manyhaveone(percent,the)‣ Context:nounsstartsentences,nounsfollowverbs,etc.

Whatisthisgoodfor?

‣ Text-to-speech:record,lead

‣ PreprocessingstepforsyntacOcparsers

‣ Domain-independentdisambiguaOonforothertasks

‣ (Very)shallowinformaOonextracOon

SequenceModels

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

‣ POStagging:xisasequenceofwords,yisasequenceoftags

‣ Today:generaOvemodelsP(x,y);discriminaOvemodelsnextOme

HiddenMarkovModels

y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)

‣ModelthesequenceofyasaMarkovprocess

‣Markovproperty:futureiscondiOonallyindependentofthepastgiventhepresent

‣ Ifyaretags,thisroughlycorrespondstoassumingthatthenexttagonlydependsonthecurrenttag,notanythingbefore

y3 P (y3|y1, y2) = P (y3|y2)

‣ LotsofmathemaOcaltheoryabouthowMarkovchainsbehave

HiddenMarkovModels

y1 y2 yn

x1 x2 xn

Fedraises percent…

NNP VBZ NN…

HiddenMarkovModels

y1 y2 yn

x1 x2 xn

P (y,x) = P (y1)nY

P (yi|yi�1)nY

P (xi|yi)

IniOaldistribuOon

TransiOonprobabiliOes

EmissionprobabiliOes

} }} ‣ P(x|y)isadistribuOonoverallwordsinthevocabulary—notadistribuOonoverfeatures(butcouldbe!)

‣MulOnomials:tagxtagtransiOons,tagxwordemissions

‣ ObservaOon(x)dependsonlyoncurrentstate(y)

TransiOonsinPOSTagging

‣ Dynamicsmodel

Fedraisesinterestrates0.5percent.

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ likelybecausestartofsentence

‣ likelybecauseverbonenfollowsnoun

‣ directobjectfollowsverb,otherverbrarelyfollowspasttenseverb(mainverbscanfollowmodalsthough!)

P (y1 = NNP)

P (y2 = VBZ|y1 = NNP)

P (y3 = NN|y2 = VBZ)

P (y1)nY

P (yi|yi�1) NNP-propernoun,singularVBZ-verb,3rdps.sing.presentNN-noun,singularormass.

EsOmaOngTransiOons

‣ SimilartoNaiveBayesesOmaOon:maximumlikelihoodsoluOon=normalizedcounts(withsmoothing)readoffsuperviseddata

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN

‣ Howtosmooth?

‣ Onemethod:smoothwithunigramdistribuOonovertags

‣ P(tag|NN)

P (tag|tag�1) = (1� �)P̂ (tag|tag�1) + �P̂ (tag)

=empiricaldistribuOon(readofffromdata)P̂

=(0.5.,0.5NNS)

‣ EmissionsP(x|y)capturethedistribuOonofwordsoccurringwithagiventag

EmissionsinPOSTagging

‣ P(word|NN)=(0.05person,0.04official,0.03interest,0.03percent…)

‣Whenyoucomputetheposteriorforagivenword’stags,thedistribuOonfavorstagsthataremorelikelytogeneratethatword

‣ Howshouldwesmooththis?

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN .

EsOmaOngEmissions

‣ P(word|NN)=(0.5interest,0.5percent)—hardtosmooth!

‣ Fancytechniquesfromlanguagemodeling,e.g.lookattypeferOlity—P(tag|word)isflarerforsomekindsofwordsthanforothers)

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

P (word|tag) = P (tag|word)P (word)

P (tag)

‣ AlternaOve:useBayes’rule

‣ CaninterpolatewithdistribuOonlookingatwordshapeP(wordshape|tag)(e.g.,P(capitalizedwordoflen>=8|tag))

‣ P(word|tag)canbealog-linearmodel—we’llseethisinafewlectures

InferenceinHMMs

‣ Inferenceproblem:

‣ ExponenOallymanypossibleyhere!

‣ SoluOon:dynamicprogramming(possiblebecauseofMarkovstructure!)

‣ManyneuralsequencemodelsdependonenOreprevioustagsequence,needtouseapproximaOonslikebeamsearch

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

… P (y,x) = P (y1)nY

P (yi|yi�1)nY

P (xi|yi)

argmaxyP (y|x) = argmaxyP (y,x)

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm

best(parOal)scorefor asequenceendinginstates

ViterbiAlgorithm

slidecredit:DanKlein

‣ “Thinkabout”allpossibleimmediatepriorstatevalues.Everythingbeforethathasalreadybeenaccountedforbyearlierstages.

Forward-BackwardAlgorithm‣ InaddiOontofindingthebestpath,wemaywanttocomputemarginalprobabiliOesofpaths P (yi = s|x)

P (yi = s|x) =X

y1,...,yi�1,yi+1,...,yn

P (y|x)

‣WhatdidViterbicompute? P (ymax|x) = maxy1,...,yn

P (y|x)

‣ Cancomputemarginalswithdynamicprogrammingaswellusinganalgorithmcalledforward-backward

Forward-BackwardAlgorithm

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths

slidecredit:DanKlein

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths

‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute

↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

‣ IniOal:

‣ Recurrence:

‣ SameasViterbibutsumminginsteadofmaxing!

‣ ThesequanOOesgetverysmall!StoreeverythingaslogprobabiliOes

‣ IniOal:�n(s) = 1

�t(st) =X

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Recurrence:

‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)

↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

�n(s) = 1

�t(st) =X

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)

↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

�n(s) = 1

�t(st) =X

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)

‣Whatisthedenominatorhere? P (x)

HMMPOSTagging

‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

TrigramTaggers

‣ Trigrammodel:y1=(<S>,NNP),y2=(NNP,VBZ),…

‣ P((VBZ,NN)|(NNP,VBZ))—morecontext!Noun-verb-nounS-V-O

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

‣ Tradeoffbetweenmodelcapacityanddatasize—trigramsarea“sweetspot”forPOStagging

HMMPOSTagging

‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

‣ TnTtagger(Brants1998,tunedHMM):96.2%accuracy/86.0%onunks

‣ State-of-the-art(BiLSTM-CRFs):97.5%/89%+onunks

https://arxiv.org/pdf/cs/0003055.pdf

Errors

officialknowledge madeupthestory recentlysoldsharesJJ/NNNN VBDRP/INDTNN RBVBD/VBNNNS

Slidecredit:DanKlein/Toutanova+Manning(2000)(NNNN:taxcut,artgallery,…)

RemainingErrors

‣ Underspecified/unclear,goldstandardinconsistent/wrong:58%

‣ Lexicongap(wordnotseenwiththattagintraining)4.5%‣ Unknownword:4.5%‣ Couldgetright:16%(manyoftheseinvolveparsing!)

‣ DifficultlinguisOcs:20%

Theysetupabsurdsitua/ons,detachedfromrealityVBD/VBP?(pastorpresent?)

a$10millionfourth-quarterchargeagainstdiscon/nuedopera/onsadjecOveorverbalparOciple?JJ/VBN?

Manning2011“Part-of-SpeechTaggingfrom97%to100%:IsItTimeforSomeLinguisOcs?”

OtherLanguages

Petrovetal.2012

OtherLanguages

‣ UniversalPOStagset(~12tags),cross-lingualmodelworksaswellastunedCRFusingexternalresources

Gillicketal.2016

Byte-to-Span

Zero-shotCross-lingualTransferLearning

‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging.

Lan,Chen,Xu,Rirer2020

Zero-shotCross-lingualTransferLearning

‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging. Lan,Chen,Xu,Rirer2020

NextUp

‣ CRFs:feature-baseddiscriminaOvemodels

‣ NamedenOtyrecogniOon

Sequence Models I

Documents

Deep Learning Sequence to Sequence models: Attention Modelsbhiksha/courses/deeplearning/... · Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1. ... ENCODER

Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Week 6: Protein sequence models, likelihood, hidden Markov ...evolution.gs.washington.edu/gs570/2014/week6.pdf · Week 6: Protein sequence models, likelihood, hidden Markov models

Structured learning for sequence labeling Part 2: Sequence ... · Hidden Markov Models What is a hidden Markov model?I A tool that helps us solve sequence labeling problems Observations

Neural Abstractive Text Summarization with Sequence-to ... · sequence models, attention model, pointer-generator network, deep reinforcement learning, beam search. I. INTRODUCTION

Microfacies models and sequence stratigraphic architecture ... · PDF fileMicrofacies models and sequence stratigraphic architecture of the ... The termination of migration of marine

SEQUENCE-TO-SEQUENCE MODELS FOR PUNCTUATED TRANSCRIPTION COMBINING LEXICAL AND ... · 2018-03-28 · SEQUENCE-TO-SEQUENCE MODELS FOR PUNCTUATED TRANSCRIPTION COMBINING LEXICAL AND

Efficient Inference on Sequence Segmentation Models

Sequence to Sequence Models - GitHub Pages

The models of sequence stratigraphy and depositional

Sparse Sequence-to-Sequence Models

Action Detection with Actom Sequence Models

Sequence Stratigraphic Models for - GCSSEPM Home · entitled “Sequence Stratigraphic Models for Exploration and ... a progress report on the evolution of sequence ... final editing,

Markov Models DNA Sequence Evolution · 2019. 9. 12. · 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution ... • Hidden Markov Models - looking

Dynamic Models Sequence Diagrams Collaboration Diagrams Activity Diagrams

End-to-end systems 2: Sequence-to-sequence models

Sequence Models - UMD

Testing sequence stratigraphic models by drilling Miocene ...geology.rutgers.edu/images/stories/faculty/miller... · 1236 Testing sequence stratigraphic models by drilling Miocene

Sequence-to-Sequence Models Can Zhifeng Chen Directly

TimeSHAP: Explaining Recurrent Models through Sequence